[
  {
    "path": "README.md",
    "content": "# 0.前言\n\n决赛答辩已经过去一段时间了，我们队伍ac milan最终获得了复赛第3，决赛第4的成绩。在此首先感谢一些队友的carry～\n\n经过2个多月的比赛，学习收获了很多，也认识了很多大佬，在这里记录一下自己的参赛体验和学习收获。\n\n[github地址]: https://github.com/daniellibin/gaiic2021_track3_querySim\n[比赛地址]: https://tianchi.aliyun.com/competition/entrance/531851/introduction\n\n# 1.赛题背景\n\n小布助手是OPPO公司为欧加集团三品牌手机和IoT设备自研的语音助手，为用户提供了有趣、贴心、便捷的对话式服务。意图识别是对话系统中的一个核心任务，而对话短文本语义匹配是意图识别的主流算法方案之一。本赛题要求参赛队伍根据脱敏后的短文本query-pair，预测它们是否属于同一语义，提交的结果按照指定的评价指标使用在线评测数据进行评测和排名，得分最优者获胜。\n\n# 2.赛题描述及数据说明\n\n- ### 训练数据\n\n  训练数据包含输入query-pair，以及对应的真值。初赛训练样本10万，复赛训练样本30万，这份数据主要用于参赛队伍训练模型，为确保数据的高质量，每一个样本的真值都有进行人工标注校验。每行为一个训练样本，由query-pair和真值组成，每行格式如下：\n\n  - query-pair格式：query以中文为主，中间可能带有少量英文单词（如英文缩写、品牌词、设备型号等），采用UTF-8编码，未分词，两个query之间使用\\t分割。\n  - 真值：真值可为0或1，其中1代表query-pair语义相匹配，0则代表不匹配，真值与query-pair之间也用\\t分割。\n\n  ### 训练数据样本举例（空白间隔为\\t）：\n\n  ```\n  肖战的粉丝叫什么名字 肖战的粉丝叫什么 1\n  \n  王者荣耀里面打野谁最厉害 王者荣耀什么英雄最好玩 0\n  \n  我想换个手机 我要换手机 1\n  \n  我是张睿 我想张睿 0\n  \n  不想 不想说 0\n  ```\n\n  ### 测试数据\n\n  脱敏后的query-pair数据，初赛采用A/B榜的方式，A榜和B榜样本规模分别为2.5万，发布时间以赛制为准，初赛队伍根据初赛B榜排名择优进入复赛；复赛同样采用A/B榜的方式，样本规模5万（与初赛不重复），复赛队伍根据复赛B榜排名择优进入现场答辩。\n\n  ### 测试数据样本举例（空白间隔为\\t）\n\n  ```\n  肖战的粉丝叫什么名字 肖战的粉丝叫什么\n  \n  王者荣耀里面打野谁最厉害 王者荣耀什么英雄最好玩\n  \n  我想换个手机 我要换手机\n  \n  我是张睿 我想张睿\n  \n  不想 不想说\n  ```\n\n# 3.评估标准\n\n比赛的评估标准由性能标准和效果标准两部分组成，初赛采用效果标准，`AUC` 指标。\n\n# 4.整体设计\n\n![image-20210619163346172](README.assets/image-20210619163346172.png)\n\n## （1）预训练\n\n#### a.模型选取\n\n赛题所给数据经过了脱敏，相当于一种新的语言，无法直接利用开源的预训练模型进行迁移学习\n\n但是预训练依然很有必要，在有限的数据上，我们需要尽可能充分地利用其中的信息，Bert语言模型的MLM预训练任务可以利用无监督文本信息，学习文本表征、语言学知识和世界性知识\n\n我们选用的是Bert和其变种Nezha，二者主要区别在于绝对位置编码与相对位置编码\n\n考虑到后续的模型融合以及线上环境提供四卡，我们预训练了四个模型，参数量皆为1亿左右\n\n![image-20210619163530653](README.assets/image-20210619163530653.png)\n\n#### b.MASK策略\n\n模型输入为经典的拼接形式：[CLS] s1 [SEP] s2 [SEP] \n\n对偶：s1、s2以50%的概率交换位置，是对语义无损的数据增强方式\n\n长度自适应动态N-gram Mask策略\n- 动态Mask：预训练达到400 epoch，上百万次iter，可以每次迭代都随机生成新的mask文本，增强模型泛化能力\n- N-gram Mask：以15%的概率选中token，为增加训练难度，选中部分以70%、20%、10%的概率进行1-gram、2-gram、3-gram片段的mask（选中token使用[MASK]、随机词、自身替换的概率和原版Bert一致）\n- 长度自适应：考虑到对短文本进行过较长gram的mask对语义有较大破坏，长度小于7的文本不进行3-gram mask，小于4的文本不进行2-gram mask\n- 防止小概率的连续Mask：已经mask了的文本片段，强制跳过下一个token的mask，防止一长串连续的mask\n\n#### c.其他Trick与参数设置\n\n- 学习率warmup与衰减\n\n  - 预训练400 epoch ，前4.5个epoch，学习率从0线性增长到5e-5，之后线性衰减到1e-5\n\n- 分块shuffle\n\n  - 预训练周期长，优化时间性能非常重要，分块shuffle将长度差不多的样本组成batch快，块间shuffle，减少padding部分运算量，耗时减少了约40%，实测不会降低模型效果\n\n- 权重衰减\n\n  - 限制网络权值的大小，缓解过拟合现象\n\n- 四个模型通用参数设置\n\n  ![image-20210619170554408](README.assets/image-20210619170554408.png)\n\n## （2）微调\n\n#### a.模型参数\n\n- 预训练利用文本中的无监督信息，微调则需利用有监督的句子对匹配信息，将赛题任务建模为匹配与不匹配的二分类问题\n\n- 我们在4个预训练模型的基础上，训练了6个微调模型，从词表、截断长度和模型结构等维度保证模型之间的差异性，以便后序模型融合，参数设置对比如下：\n\n  ![image-20210619170702479](README.assets/image-20210619170702479.png)\n\n#### b.后接结构\n\n- Bert/Nezha后接的三种结构\n\n  ![image-20210619170927378](README.assets/image-20210619170927378.png)\n\n考虑到Bert已经具备强大的特征提取能力，以及运行和推理时限严格，所以其只后接了一些简单的结构。\n\n#### c.Trick\n\n- 学习率\n\n    - warmup与衰减：可以使得训练初期学习率较小，模型可以慢慢趋于稳定，待相对稳定后再以预先设置的学习率进行训练，使得模型收敛速度变得更快。后采用学习率衰减的方式使模型收敛到更佳的极值点，提升最终效果\n    - 不同模型采用不同的学习率（2e-5或4e-5）\n\n- 模型融合时先对logits加权平均，后softmax\n\n  - 使得softmax不再是每个模型独立进行，而是综合利用所有模型信息\n\n- 对抗训练\n\n  - 对抗训练是一种引入噪声的训练方式，可以对参数进行正则化，提升模型鲁棒性和泛化能力\n    Fast Gradient Method (FGM)：对embedding层在梯度方向添加扰动\n    Projected Gradient Descent (PGD) ：迭代扰动，每次扰动被投影到规定范围内\n    团队实验了FGM、PGD，前者速度快且效果更佳。\n\n  #### d.通用参数\n\n  最佳参数\n  - batch_size=32，预训练充分的情况下，微调收敛非常快，小bs带来更大的随机性，更不容易过早陷入局部最优\n  - epoch=3\n  - dropout=0.2，训练时以一定概率丢弃某些神经元，缓解过拟合\n  - FGM，epsilon=0.25时效果最佳\n\n## （3）模型融合与推理\n\n![image-20210619171224369](README.assets/image-20210619171224369.png)\n\n## （4）性能优化\n\n#### a.分块shuffle\n\n- 赛题限制线上总运行时间为80小时，限制推理5w测试集时间为15分钟（含网络开销），性能优化尤为关键\n\n  - 分块shuffle将长度差不多的样本组成batch快，块间shuffle，减少padding部分运算量，预训练耗时减少了约40%\n\n  - 最终预训练线上能控制在9分多钟一个epoch，400个epoch能控制在65小时以内完成\n\n    ![image-20210619171438518](README.assets/image-20210619171438518.png)\n\n#### b.推理加速\n\n- ONNX Runtime：ONNX Runtime是机器学习模型的预测引擎，能使用内置的图优化（Graph  Optimization）和各种硬件加速功能，来优化和加速推理。像BERT这样的Transformer模型，由许多运算符（Operator）的图构成，ONNX  Runtime内置图优化功能，可以简化图并且减少节点，进而执行更复杂的节点融合和布局优化。通过使用ONNX Runtime，推理部分获得了非常可观的加速。\n\n![image-20210619171514789](README.assets/image-20210619171514789.png)\n\n#### c.对cuda版本的调优\n\n- 在大家使用较多的cuda11镜像中，我们发现线上V100速度较慢，根据以往项目经验，老一些的卡用较新的cuda版本未必能发挥出最好的性能，我们尝试更换镜像版本为cuda10.2，cudnn版本配套改为7，onnxruntime-gpu版本配套改为1.5.1，推理速度有了较大提升，使得在15分钟内我们能跑6个模型（以往为4个）\n\n![image-20210619171554475](README.assets/image-20210619171554475.png)\n\n#### d.其他细节\n\n- 减少内存到显存的通信开销：避免使用.to('cuda')的方式将tensor从内存移至显存，增加通信开销，而是一开始就用torch.tensor(xxx,device='cuda')的方式将tensor创建在显存\n\n- 编写更快的分词函数：所给数据已经用空格将token隔开，避免使用tokenize函数将数据整体当做字符串进行分词，而是按空格split后直接convert_tokens_to_ids\n\n- ……\n\n# 5.创新和落地\n\n#### a.创新\n\n- 融入对偶的长度自适应动态N-gram Mask策略\n\n- 不同词表、不同截断长度、不同结构的模型融合，保证模型差异性\n\n- 学习率warmup与衰减、模型权重衰减、对抗训练等Trick\n\n- 性能优化，包括分块shuffle、ONNX Runtime的使用、对cuda版本的调优和其他细节优化\n\n#### b.落地\n\n- 我们的模型将语义匹配转换为分类问题，这是一种通用性非常强的解决方案，可以广泛落地于自然语言处理领域中涉及到句子关系的各项任务中，如开放域意图识别（本赛题）、QQ匹配、QA匹配、文本蕴含等\n\n- 推理速度较快，不计网络通信消耗，比赛使用的6模（4 Bert，2 Nezha）融合后可达77的QPS（AUC 0.9579），在牺牲不到一个百分点的AUC下，单模Bert可达595的QPS（AUC 0.948）\n\n- 实际生产环境复杂，短文本相对容易出现语义缺失，且受噪声影响相对更大（用户输错或语音识别错误几个字，占短文本整体的比例可能就较大），可能需考虑辅以指代消解、文本补全、文本纠错等技术\n\n- 深度学习并非万能，实际落地时，需要不断进行badcase分析，适当辅以规则的方法提升系统鲁棒性\n\n# 6.方案总结\n\n- 总结性回答\n  - 我们从预训练、微调、模型融合和推理四个方面入手，每个阶段进行针对性的策略改进及创新，辅以性能优化，最终形成了一个较好的端到端解决方案，可以广泛落地于自然语言处理领域中涉及到句子关系的各项任务中，具有较好的实用性和创新性。\n- 方法优劣势分析、展望\n  - 优点：效果好，速度快，模型通用性强\n  - 缺点：交互型模型因为每次计算都需要输入完整句子对，不适合于从海量文本中召回结果，而是适合在召回小部分候选集后，进行精细的排序\n  - 展望：从科学研究角度，我们要利用好预训练模型这个核武器，设计更有针对性，更加合理的预训练任务，此外也可探索结合上下文、引入知识的多轮匹配任务。从应用角度，可以从badcase出发，不断优化算法，挖掘用户需求，让小布成为一个知识更加渊博，对话更加流畅，更加人性化的智能助理\n\n\n\n# 7.前排大佬解决方案\n\n# 一、AI小花\n\nhttps://github.com/nilboy/gaic_track3_pair_sim\n\n![](README.assets/image-20210619194942961.png)\n\n![image-20210619194951696](README.assets/image-20210619194951696.png)\n\n# 二、[none]\n\n![image-20210619200120075](README.assets/image-20210619200120075.png)\n\n![image-20210619200126641](README.assets/image-20210619200126641.png) \n\n![image-20210619200137247](README.assets/image-20210619200137247.png)\n\n# 三、赛道3-白[MASK]\n\n![image-20210619204111156](README.assets/image-20210619204111156.png)\n\n![image-20210619204017855](README.assets/image-20210619204017855.png)\n\n![image-20210619204120283](README.assets/image-20210619204120283.png)\n\n![image-20210619204128548](README.assets/image-20210619204128548.png)\n\n# 四、科讯嘉联灵珠团队\n\n![image-20210619205915821](README.assets/image-20210619205915821.png)\n\n![image-20210619210050654](README.assets/image-20210619210050654.png)\n\n# 五、LOL王者\n\n![image-20210619210251396](README.assets/image-20210619210251396.png)\n\n![image-20210619210301353](README.assets/image-20210619210301353.png)\n\n"
  },
  {
    "path": "code/.gitignore",
    "content": "bert-base-chinese/pytorch_model.bin\nnezha-cn-base/pytorch_model.bin\n.idea\n.DS_Store\n__pycache__\n"
  },
  {
    "path": "code/Config.py",
    "content": "from transformers import BertTokenizer, AdamW, BertModel, BertPreTrainedModel, BertConfig, \\\n    get_linear_schedule_with_warmup, XLNetModel, XLNetTokenizer, XLNetConfig, ElectraModel, ElectraConfig, ElectraTokenizer, \\\n    RobertaTokenizer, RobertaModel, RobertaConfig\nfrom NEZHA.modeling_nezha import NeZhaModel\nfrom NEZHA.configuration_nezha import NeZhaConfig\n\n\nMODELS = {\n    'BertForClass':  BertModel,\n    'BertForClass_MultiDropout':  BertModel,\n   'BertLastTwoCls':  BertModel,\n    'BertLastCls':BertModel,\n   'BertLastTwoClsPooler':  BertModel,\n    'BertLastTwoEmbeddings': BertModel,\n    'BertLastTwoEmbeddingsPooler': BertModel,\n    'BertLastFourCls': BertModel,\n    'BertLastFourClsPooler':  BertModel,\n    'BertLastFourEmbeddings':  BertModel,\n   'BertLastFourEmbeddingsPooler':  BertModel,\n   'BertDynCls':  BertModel,\n    'BertDynEmbeddings': BertModel,\n    'BertRNN': BertModel,\n    'BertCNN': XLNetModel,\n    'BertRCNN':  BertModel,\n    'XLNet': XLNetModel,\n    'Electra': ElectraModel,\n    'NEZHA': NeZhaModel\n    }\n\nTOKENIZERS = {\n    'BertForClass': BertTokenizer,\n    'BertForClass_MultiDropout': BertTokenizer,\n    'BertLastTwoCls': BertTokenizer,\n    'BertLastCls': BertTokenizer,\n    'BertLastTwoClsPooler': BertTokenizer,\n    'BertLastTwoEmbeddings': BertTokenizer,\n    'BertLastTwoEmbeddingsPooler': BertTokenizer,\n    'BertLastFourCls': BertTokenizer,\n    'BertLastFourClsPooler': BertTokenizer,\n    'BertLastFourEmbeddings': BertTokenizer,\n    'BertLastFourEmbeddingsPooler': BertTokenizer,\n    'BertDynCls': BertTokenizer,\n    'BertDynEmbeddings': BertTokenizer,\n    'BertRNN': BertTokenizer,\n    'BertCNN': BertTokenizer,\n    'BertRCNN': BertTokenizer,\n    'XLNet': XLNetTokenizer,\n    'Electra': ElectraTokenizer,\n    'NEZHA': BertTokenizer\n    }\n\nCONFIGS = {\n    'BertForClass': BertConfig,\n    'BertForClass_MultiDropout': BertConfig,\n    'BertLastTwoCls': BertConfig,\n    'BertLastCls': BertConfig,\n    'BertLastTwoClsPooler': BertConfig,\n    'BertLastTwoEmbeddings': BertConfig,\n    'BertLastTwoEmbeddingsPooler': BertConfig,\n    'BertLastFourCls': BertConfig,\n    'BertLastFourClsPooler': BertConfig,\n    'BertLastFourEmbeddings': BertConfig,\n    'BertLastFourEmbeddingsPooler': BertConfig,\n    'BertDynCls': BertConfig,\n    'BertDynEmbeddings': BertConfig,\n    'BertRNN': BertConfig,\n    'BertCNN': BertConfig,\n    'BertRCNN': BertConfig,\n    'XLNet': XLNetConfig,\n    'Electra': ElectraConfig,\n    'NEZHA': NeZhaConfig\n\n    }"
  },
  {
    "path": "code/Dockerfile",
    "content": "# Base Images\n## 从天池基础镜像构建(from的base img 根据自己的需要更换，建议使用天池open list镜像链接：https://tianchi.aliyun.com/forum/postDetail?postId=67720)\n#FROM registry.cn-shanghai.aliyuncs.com/tcc-public/pytorch:1.6-cuda10.1-py3\nFROM registry.cn-shanghai.aliyuncs.com/xiaobu_match/match:cuda10.2base\n\n## 把当前文件夹里的文件构建到镜像的根目录下\nADD . /\n\n##安装依赖包,pip包请在requirements.txt添加\n#RUN apt-get update && apt-get install -y curl\n\n\n#RUN   pip install --upgrade pip -i https://pypi.tuna.tsinghua.edu.cn/simple\n#pip install --no-cache-dir -r requirements.txt -i https://pypi.tuna.tsinghua.edu.cn/simple\n#RUN    pip install -i https://pypi.tuna.tsinghua.edu.cn/simple transformers==4.2.0\n#RUN    pip install -i https://pypi.tuna.tsinghua.edu.cn/simple tqdm\n#RUN    pip install -i https://pypi.tuna.tsinghua.edu.cn/simple flask\n#RUN    pip install -i https://pypi.tuna.tsinghua.edu.cn/simple pandas\n#RUN    pip install -i https://pypi.tuna.tsinghua.edu.cn/simple psutil\n#RUN    pip install -i https://pypi.tuna.tsinghua.edu.cn/simple onnx\n#RUN    pip install -i https://pypi.tuna.tsinghua.edu.cn/simple onnxruntime-gpu==1.7.0\n#RUN    pip install -i https://pypi.tuna.tsinghua.edu.cn/simple sklearn\n#RUN    pip install -i https://pypi.tuna.tsinghua.edu.cn/simple onnxruntime_tools\n#RUN    pip install -i https://pypi.tuna.tsinghua.edu.cn/simple sympy\n#RUN    pip install -i https://pypi.tuna.tsinghua.edu.cn/simple sentencepiece\n\n\n## 指定默认工作目录为根目录（需要把run.sh和生成的结果文件都放在该文件夹下，提交后才能运行）\nWORKDIR /\n\n\n## 镜像启动后统一执行 sh run.sh\nCMD [\"sh\", \"run.sh\"]\n"
  },
  {
    "path": "code/NEZHA/configuration_nezha.py",
    "content": "\nfrom transformers import PretrainedConfig\n\nNEZHA_PRETRAINED_CONFIG_ARCHIVE_MAP = {}\n\nclass NeZhaConfig(PretrainedConfig):\n    r\"\"\"\n        This is the configuration class to store the configuration of an :class:`~transformers.AlbertModel`.\n        It is used to instantiate an ALBERT model according to the specified arguments, defining the model\n        architecture. Instantiating a configuration with the defaults will yield a similar configuration to that of\n        the ALBERT `xxlarge <https://huggingface.co/albert-xxlarge-v2>`__ architecture.\n\n        Configuration objects inherit from  :class:`~transformers.PretrainedConfig` and can be used\n        to control the model outputs. Read the documentation from  :class:`~transformers.PretrainedConfig`\n        for more information.\n\n\n        Args:\n            vocab_size (:obj:`int`, optional, defaults to 30000):\n                Vocabulary size of the ALBERT model. Defines the different tokens that\n                can be represented by the `inputs_ids` passed to the forward method of :class:`~transformers.AlbertModel`.\n            embedding_size (:obj:`int`, optional, defaults to 128):\n                Dimensionality of vocabulary embeddings.\n            hidden_size (:obj:`int`, optional, defaults to 4096):\n                Dimensionality of the encoder layers and the pooler layer.\n            num_hidden_layers (:obj:`int`, optional, defaults to 12):\n                Number of hidden layers in the Transformer encoder.\n            num_hidden_groups (:obj:`int`, optional, defaults to 1):\n                Number of groups for the hidden layers, parameters in the same group are shared.\n            num_attention_heads (:obj:`int`, optional, defaults to 64):\n                Number of attention heads for each attention layer in the Transformer encoder.\n            intermediate_size (:obj:`int`, optional, defaults to 16384):\n                The dimensionality of the \"intermediate\" (i.e., feed-forward) layer in the Transformer encoder.\n            inner_group_num (:obj:`int`, optional, defaults to 1):\n                The number of inner repetition of attention and ffn.\n            hidden_act (:obj:`str` or :obj:`function`, optional, defaults to \"gelu_new\"):\n                The non-linear activation function (function or string) in the encoder and pooler.\n                If string, \"gelu\", \"relu\", \"swish\" and \"gelu_new\" are supported.\n            hidden_dropout_prob (:obj:`float`, optional, defaults to 0):\n                The dropout probability for all fully connected layers in the embeddings, encoder, and pooler.\n            attention_probs_dropout_prob (:obj:`float`, optional, defaults to 0):\n                The dropout ratio for the attention probabilities.\n            max_position_embeddings (:obj:`int`, optional, defaults to 512):\n                The maximum sequence length that this model might ever be used with. Typically set this to something\n                large (e.g., 512 or 1024 or 2048).\n            type_vocab_size (:obj:`int`, optional, defaults to 2):\n                The vocabulary size of the `token_type_ids` passed into :class:`~transformers.AlbertModel`.\n            initializer_range (:obj:`float`, optional, defaults to 0.02):\n                The standard deviation of the truncated_normal_initializer for initializing all weight matrices.\n            layer_norm_eps (:obj:`float`, optional, defaults to 1e-12):\n                The epsilon used by the layer normalization layers.\n            classifier_dropout_prob (:obj:`float`, optional, defaults to 0.1):\n                The dropout ratio for attached classifiers.\n\n        Example::\n\n            from transformers import AlbertConfig, AlbertModel\n            # Initializing an ALBERT-xxlarge style configuration\n            albert_xxlarge_configuration = AlbertConfig()\n\n            # Initializing an ALBERT-base style configuration\n            albert_base_configuration = AlbertConfig(\n                hidden_size=768,\n                num_attention_heads=12,\n                intermediate_size=3072,\n            )\n\n            # Initializing a model from the ALBERT-base style configuration\n            model = AlbertModel(albert_xxlarge_configuration)\n\n            # Accessing the model configuration\n            configuration = model.config\n\n        Attributes:\n            pretrained_config_archive_map (Dict[str, str]):\n                A dictionary containing all the available pre-trained checkpoints.\n    \"\"\"\n\n    pretrained_config_archive_map = NEZHA_PRETRAINED_CONFIG_ARCHIVE_MAP\n    model_type = \"nezha\"\n\n    def __init__(\n        self,\n        vocab_size=30000,\n        embedding_size=128,\n        hidden_size=4096,\n        num_hidden_layers=12,\n        num_hidden_groups=1,\n        num_attention_heads=64,\n        intermediate_size=16384,\n        inner_group_num=1,\n        hidden_act=\"gelu_new\",\n        hidden_dropout_prob=0,\n        attention_probs_dropout_prob=0,\n        max_position_embeddings=512,\n        max_relative_position=64,\n        type_vocab_size=2,\n        initializer_range=0.02,\n        layer_norm_eps=1e-12,\n        classifier_dropout_prob=0.1,\n        use_relative_position=True,\n        pad_token_id=0,\n        bos_token_id=2,\n        eos_token_id=3,\n        **kwargs\n    ):\n        super().__init__(pad_token_id=pad_token_id, bos_token_id=bos_token_id, eos_token_id=eos_token_id, **kwargs)\n\n        self.vocab_size = vocab_size\n        self.embedding_size = embedding_size\n        self.hidden_size = hidden_size\n        self.num_hidden_layers = num_hidden_layers\n        self.num_hidden_groups = num_hidden_groups\n        self.num_attention_heads = num_attention_heads\n        self.inner_group_num = inner_group_num\n        self.hidden_act = hidden_act\n        self.intermediate_size = intermediate_size\n        self.hidden_dropout_prob = hidden_dropout_prob\n        self.attention_probs_dropout_prob = attention_probs_dropout_prob\n        self.max_position_embeddings = max_position_embeddings\n        self.max_relative_position = max_relative_position\n        self.type_vocab_size = type_vocab_size\n        self.initializer_range = initializer_range\n        self.layer_norm_eps = layer_norm_eps\n        self.use_relative_position=use_relative_position\n        self.classifier_dropout_prob = classifier_dropout_prob\n"
  },
  {
    "path": "code/NEZHA/modeling_nezha.py",
    "content": "import math\nimport os\nimport warnings\nfrom dataclasses import dataclass\nfrom typing import Optional, Tuple\n\nimport torch\nimport torch.utils.checkpoint\nfrom torch import nn\nfrom torch.nn import CrossEntropyLoss, MSELoss\n\nfrom transformers.activations import ACT2FN\nfrom transformers.file_utils import (\n    ModelOutput,\n    add_code_sample_docstrings,\n    add_start_docstrings,\n    add_start_docstrings_to_model_forward,\n    replace_return_docstrings,\n)\nfrom transformers.modeling_outputs import (\n    BaseModelOutputWithPastAndCrossAttentions,\n    BaseModelOutputWithPoolingAndCrossAttentions,\n    CausalLMOutputWithCrossAttentions,\n    MaskedLMOutput,\n    MultipleChoiceModelOutput,\n    NextSentencePredictorOutput,\n    QuestionAnsweringModelOutput,\n    SequenceClassifierOutput,\n    TokenClassifierOutput,\n)\nfrom transformers.modeling_utils import (\n    PreTrainedModel,\n    apply_chunking_to_forward,\n    find_pruneable_heads_and_indices,\n    prune_linear_layer,\n)\n\nfrom transformers.models.bert.configuration_bert import BertConfig\n\nimport logging\nlogger = logging.getLogger(__name__)\n\n_CHECKPOINT_FOR_DOC = \"bert-base-uncased\"\n_CONFIG_FOR_DOC = \"BertConfig\"\n_TOKENIZER_FOR_DOC = \"BertTokenizer\"\n\n\ndef load_tf_weights_in_bert(model, config, tf_checkpoint_path):\n    \"\"\"Load tf checkpoints in a pytorch model.\"\"\"\n    try:\n        import re\n\n        import numpy as np\n        import tensorflow as tf\n    except ImportError:\n        logger.error(\n            \"Loading a TensorFlow model in PyTorch, requires TensorFlow to be installed. Please see \"\n            \"https://www.tensorflow.org/install/ for installation instructions.\"\n        )\n        raise\n    tf_path = os.path.abspath(tf_checkpoint_path)\n    logger.info(\"Converting TensorFlow checkpoint from {}\".format(tf_path))\n    # Load weights from TF model\n    init_vars = tf.train.list_variables(tf_path)\n    names = []\n    arrays = []\n    for name, shape in init_vars:\n        logger.info(\"Loading TF weight {} with shape {}\".format(name, shape))\n        array = tf.train.load_variable(tf_path, name)\n        names.append(name)\n        arrays.append(array)\n\n    for name, array in zip(names, arrays):\n        name = name.split(\"/\")\n        # adam_v and adam_m are variables used in AdamWeightDecayOptimizer to calculated m and v\n        # which are not required for using pretrained model\n        if any(\n            n in [\"adam_v\", \"adam_m\", \"AdamWeightDecayOptimizer\", \"AdamWeightDecayOptimizer_1\", \"global_step\"]\n            for n in name\n        ):\n            logger.info(\"Skipping {}\".format(\"/\".join(name)))\n            continue\n        pointer = model\n        for m_name in name:\n            if re.fullmatch(r\"[A-Za-z]+_\\d+\", m_name):\n                scope_names = re.split(r\"_(\\d+)\", m_name)\n            else:\n                scope_names = [m_name]\n            if scope_names[0] == \"kernel\" or scope_names[0] == \"gamma\":\n                pointer = getattr(pointer, \"weight\")\n            elif scope_names[0] == \"output_bias\" or scope_names[0] == \"beta\":\n                pointer = getattr(pointer, \"bias\")\n            elif scope_names[0] == \"output_weights\":\n                pointer = getattr(pointer, \"weight\")\n            elif scope_names[0] == \"squad\":\n                pointer = getattr(pointer, \"classifier\")\n            else:\n                try:\n                    pointer = getattr(pointer, scope_names[0])\n                except AttributeError:\n                    logger.info(\"Skipping {}\".format(\"/\".join(name)))\n                    continue\n            if len(scope_names) >= 2:\n                num = int(scope_names[1])\n                pointer = pointer[num]\n        if m_name[-11:] == \"_embeddings\":\n            pointer = getattr(pointer, \"weight\")\n        elif m_name == \"kernel\":\n            array = np.transpose(array)\n        try:\n            assert (\n                pointer.shape == array.shape\n            ), f\"Pointer shape {pointer.shape} and array shape {array.shape} mismatched\"\n        except AssertionError as e:\n            e.args += (pointer.shape, array.shape)\n            raise\n        logger.info(\"Initialize PyTorch weight {}\".format(name))\n        pointer.data = torch.from_numpy(array)\n    return model\n\n\nclass BertEmbeddings(nn.Module):\n    \"\"\"Construct the embeddings from word, position and token_type embeddings.\"\"\"\n\n    def __init__(self, config):\n        super().__init__()\n        self.word_embeddings = nn.Embedding(config.vocab_size, config.hidden_size, padding_idx=config.pad_token_id)\n        self.token_type_embeddings = nn.Embedding(config.type_vocab_size, config.hidden_size)\n        # self.LayerNorm is not snake-cased to stick with TensorFlow model variable name and be able to load\n        # any TensorFlow checkpoint file\n        self.LayerNorm = nn.LayerNorm(config.hidden_size, eps=config.layer_norm_eps)\n        self.dropout = nn.Dropout(config.hidden_dropout_prob)\n\n    def forward(self, input_ids=None, token_type_ids=None, inputs_embeds=None):\n        if input_ids is not None:\n            input_shape = input_ids.size()\n        else:\n            input_shape = inputs_embeds.size()[:-1]\n        if token_type_ids is None:\n            token_type_ids = torch.zeros(input_shape, dtype=torch.long, device=input_ids.device)\n\n        if inputs_embeds is None:\n            inputs_embeds = self.word_embeddings(input_ids)\n        token_type_embeddings = self.token_type_embeddings(token_type_ids)\n\n        embeddings = inputs_embeds + token_type_embeddings\n        embeddings = self.LayerNorm(embeddings)\n        embeddings = self.dropout(embeddings)\n        return embeddings\n\ndef relative_position_encoding(depth, max_length=512, max_relative_position=64):\n    vocab_size = max_relative_position * 2 + 1\n    range_vec = torch.arange(max_length)\n    range_mat = range_vec.repeat(max_length).view(max_length, max_length)\n    distance_mat = range_mat - torch.t(range_mat)\n    distance_mat_clipped = torch.clamp(distance_mat, -max_relative_position, max_relative_position)\n    final_mat = distance_mat_clipped + max_relative_position\n\n    embeddings_table = torch.zeros(vocab_size, depth)\n    position = torch.arange(0, vocab_size, dtype=torch.float).unsqueeze(1)\n    div_term = torch.exp(torch.arange(0, depth, 2).float() * (-math.log(10000.0) / depth))\n    embeddings_table[:, 0::2] = torch.sin(position * div_term)\n    embeddings_table[:, 1::2] = torch.cos(position * div_term)\n    embeddings_table = embeddings_table.unsqueeze(0).transpose(0, 1).squeeze(1)\n\n    flat_relative_positions_matrix = final_mat.view(-1)\n    one_hot_relative_positions_matrix = torch.nn.functional.one_hot(flat_relative_positions_matrix,\n                                                                    num_classes=vocab_size).float()\n    positions_encoding = torch.matmul(one_hot_relative_positions_matrix, embeddings_table)\n    my_shape = list(final_mat.size())\n    my_shape.append(depth)\n    positions_encoding = positions_encoding.view(my_shape)\n    return positions_encoding\n\nclass BertSelfAttention(nn.Module):\n    def __init__(self, config):\n        super().__init__()\n        if config.hidden_size % config.num_attention_heads != 0 and not hasattr(config, \"embedding_size\"):\n            raise ValueError(\n                \"The hidden size (%d) is not a multiple of the number of attention \"\n                \"heads (%d)\" % (config.hidden_size, config.num_attention_heads)\n            )\n\n        self.num_attention_heads = config.num_attention_heads\n        self.attention_head_size = int(config.hidden_size / config.num_attention_heads)\n        self.all_head_size = self.num_attention_heads * self.attention_head_size\n\n        self.query = nn.Linear(config.hidden_size, self.all_head_size)\n        self.key = nn.Linear(config.hidden_size, self.all_head_size)\n        self.value = nn.Linear(config.hidden_size, self.all_head_size)\n\n        self.dropout = nn.Dropout(config.attention_probs_dropout_prob)\n        self.position_embedding_type = getattr(config, \"position_embedding_type\", \"absolute\")\n        if self.position_embedding_type == \"relative_key\" or self.position_embedding_type == \"relative_key_query\":\n            self.max_position_embeddings = config.max_position_embeddings\n            self.distance_embedding = nn.Embedding(2 * config.max_position_embeddings - 1, self.attention_head_size)\n\n        self.is_decoder = config.is_decoder\n\n    def transpose_for_scores(self, x):\n        new_x_shape = x.size()[:-1] + (self.num_attention_heads, self.attention_head_size)\n        x = x.view(*new_x_shape)\n        return x.permute(0, 2, 1, 3)\n\n    def forward(\n        self,\n        hidden_states,\n        attention_mask=None,\n        head_mask=None,\n        encoder_hidden_states=None,\n        encoder_attention_mask=None,\n        past_key_value=None,\n        output_attentions=False,\n        relations_kv=None\n    ):\n        mixed_query_layer = self.query(hidden_states)\n\n        # If this is instantiated as a cross-attention module, the keys\n        # and values come from an encoder; the attention mask needs to be\n        # such that the encoder's padding tokens are not attended to.\n        is_cross_attention = encoder_hidden_states is not None\n\n        if is_cross_attention and past_key_value is not None:\n            # reuse k,v, cross_attentions\n            key_layer = past_key_value[0]\n            value_layer = past_key_value[1]\n            attention_mask = encoder_attention_mask\n        elif is_cross_attention:\n            key_layer = self.transpose_for_scores(self.key(encoder_hidden_states))\n            value_layer = self.transpose_for_scores(self.value(encoder_hidden_states))\n            attention_mask = encoder_attention_mask\n        elif past_key_value is not None:\n            key_layer = self.transpose_for_scores(self.key(hidden_states))\n            value_layer = self.transpose_for_scores(self.value(hidden_states))\n            key_layer = torch.cat([past_key_value[0], key_layer], dim=2)\n            value_layer = torch.cat([past_key_value[1], value_layer], dim=2)\n        else:\n            key_layer = self.transpose_for_scores(self.key(hidden_states))\n            value_layer = self.transpose_for_scores(self.value(hidden_states))\n\n        query_layer = self.transpose_for_scores(mixed_query_layer)\n\n        if self.is_decoder:\n            # if cross_attention save Tuple(torch.Tensor, torch.Tensor) of all cross attention key/value_states.\n            # Further calls to cross_attention layer can then reuse all cross-attention\n            # key/value_states (first \"if\" case)\n            # if uni-directional self-attention (decoder) save Tuple(torch.Tensor, torch.Tensor) of\n            # all previous decoder key/value_states. Further calls to uni-directional self-attention\n            # can concat previous decoder key/value_states to current projected key/value_states (third \"elif\" case)\n            # if encoder bi-directional self-attention `past_key_value` is always `None`\n            past_key_value = (key_layer, value_layer)\n\n        # Take the dot product between \"query\" and \"key\" to get the raw attention scores.\n        attention_scores = torch.matmul(query_layer, key_layer.transpose(-1, -2))\n\n        batch_size, num_attention_heads, from_seq_length, to_seq_length = attention_scores.size()\n\n\n        query_layer_t = query_layer.permute(2, 0, 1, 3)\n\n        query_layer_r = query_layer_t.contiguous().view(from_seq_length, batch_size * num_attention_heads,\n                                                        self.attention_head_size)\n        key_position_scores = torch.matmul(query_layer_r, relations_kv.permute(0, 2, 1))\n        key_position_scores_r = key_position_scores.view(from_seq_length, batch_size,\n                                                         num_attention_heads, from_seq_length)\n        key_position_scores_r_t = key_position_scores_r.permute(1, 2, 0, 3)\n        attention_scores = attention_scores + key_position_scores_r_t\n\n        attention_scores = attention_scores / math.sqrt(self.attention_head_size)\n        if attention_mask is not None:\n            # Apply the attention mask is (precomputed for all layers in NeZhaModel forward() function)\n            attention_scores = attention_scores + attention_mask\n\n        # Normalize the attention scores to probabilities.\n        attention_probs = nn.Softmax(dim=-1)(attention_scores)\n\n        # This is actually dropping out entire tokens to attend to, which might\n        # seem a bit unusual, but is taken from the original Transformer paper.\n        attention_probs = self.dropout(attention_probs)\n\n        # Mask heads if we want to\n        if head_mask is not None:\n            attention_probs = attention_probs * head_mask\n\n        context_layer = torch.matmul(attention_probs, value_layer)\n\n\n        attention_probs_t = attention_probs.permute(2, 0, 1, 3)\n        attentions_probs_r = attention_probs_t.contiguous().view(from_seq_length, batch_size * num_attention_heads,\n                                                                 to_seq_length)\n        value_position_scores = torch.matmul(attentions_probs_r, relations_kv)\n        value_position_scores_r = value_position_scores.view(from_seq_length, batch_size,\n                                                             num_attention_heads, self.attention_head_size)\n        value_position_scores_r_t = value_position_scores_r.permute(1, 2, 0, 3)\n        context_layer = context_layer + value_position_scores_r_t\n\n        context_layer = context_layer.permute(0, 2, 1, 3).contiguous()\n        new_context_layer_shape = context_layer.size()[:-2] + (self.all_head_size,)\n        context_layer = context_layer.view(*new_context_layer_shape)\n\n        outputs = (context_layer, attention_probs) if output_attentions else (context_layer,)\n\n        if self.is_decoder:\n            outputs = outputs + (past_key_value,)\n        return outputs\n\n\nclass BertSelfOutput(nn.Module):\n    def __init__(self, config):\n        super().__init__()\n        self.dense = nn.Linear(config.hidden_size, config.hidden_size)\n        self.LayerNorm = nn.LayerNorm(config.hidden_size, eps=config.layer_norm_eps)\n        self.dropout = nn.Dropout(config.hidden_dropout_prob)\n\n    def forward(self, hidden_states, input_tensor):\n        hidden_states = self.dense(hidden_states)\n        hidden_states = self.dropout(hidden_states)\n        hidden_states = self.LayerNorm(hidden_states + input_tensor)\n        return hidden_states\n\n\nclass BertAttention(nn.Module):\n    def __init__(self, config):\n        super().__init__()\n        self.self = BertSelfAttention(config)\n        self.output = BertSelfOutput(config)\n        self.pruned_heads = set()\n\n    def prune_heads(self, heads):\n        if len(heads) == 0:\n            return\n        heads, index = find_pruneable_heads_and_indices(\n            heads, self.self.num_attention_heads, self.self.attention_head_size, self.pruned_heads\n        )\n\n        # Prune linear layers\n        self.self.query = prune_linear_layer(self.self.query, index)\n        self.self.key = prune_linear_layer(self.self.key, index)\n        self.self.value = prune_linear_layer(self.self.value, index)\n        self.output.dense = prune_linear_layer(self.output.dense, index, dim=1)\n\n        # Update hyper params and store pruned heads\n        self.self.num_attention_heads = self.self.num_attention_heads - len(heads)\n        self.self.all_head_size = self.self.attention_head_size * self.self.num_attention_heads\n        self.pruned_heads = self.pruned_heads.union(heads)\n\n    def forward(\n        self,\n        hidden_states,\n        attention_mask=None,\n        head_mask=None,\n        encoder_hidden_states=None,\n        encoder_attention_mask=None,\n        past_key_value=None,\n        output_attentions=False,\n        relations_kv=None\n    ):\n        self_outputs = self.self(\n            hidden_states,\n            attention_mask,\n            head_mask,\n            encoder_hidden_states,\n            encoder_attention_mask,\n            past_key_value,\n            output_attentions,\n            relations_kv=relations_kv\n        )\n        attention_output = self.output(self_outputs[0], hidden_states)\n        outputs = (attention_output,) + self_outputs[1:]  # add attentions if we output them\n        return outputs\n\n\nclass BertIntermediate(nn.Module):\n    def __init__(self, config):\n        super().__init__()\n        self.dense = nn.Linear(config.hidden_size, config.intermediate_size)\n        if isinstance(config.hidden_act, str):\n            self.intermediate_act_fn = ACT2FN[config.hidden_act]\n        else:\n            self.intermediate_act_fn = config.hidden_act\n\n    def forward(self, hidden_states):\n        hidden_states = self.dense(hidden_states)\n        hidden_states = self.intermediate_act_fn(hidden_states)\n        return hidden_states\n\n\nclass BertOutput(nn.Module):\n    def __init__(self, config):\n        super().__init__()\n        self.dense = nn.Linear(config.intermediate_size, config.hidden_size)\n        self.LayerNorm = nn.LayerNorm(config.hidden_size, eps=config.layer_norm_eps)\n        self.dropout = nn.Dropout(config.hidden_dropout_prob)\n\n    def forward(self, hidden_states, input_tensor):\n        hidden_states = self.dense(hidden_states)\n        hidden_states = self.dropout(hidden_states)\n        hidden_states = self.LayerNorm(hidden_states + input_tensor)\n        return hidden_states\n\n\nclass BertLayer(nn.Module):\n    def __init__(self, config):\n        super().__init__()\n        self.chunk_size_feed_forward = config.chunk_size_feed_forward\n        self.seq_len_dim = 1\n        self.attention = BertAttention(config)\n        self.is_decoder = config.is_decoder\n        self.add_cross_attention = config.add_cross_attention\n        if self.add_cross_attention:\n            assert self.is_decoder, f\"{self} should be used as a decoder model if cross attention is added\"\n            self.crossattention = BertAttention(config)\n        self.intermediate = BertIntermediate(config)\n        self.output = BertOutput(config)\n\n    def forward(\n        self,\n        hidden_states,\n        attention_mask=None,\n        head_mask=None,\n        encoder_hidden_states=None,\n        encoder_attention_mask=None,\n        past_key_value=None,\n        output_attentions=False,\n        relations_kv=None\n    ):\n        # decoder uni-directional self-attention cached key/values tuple is at positions 1,2\n        self_attn_past_key_value = past_key_value[:2] if past_key_value is not None else None\n        self_attention_outputs = self.attention(\n            hidden_states,\n            attention_mask,\n            head_mask,\n            output_attentions=output_attentions,\n            past_key_value=self_attn_past_key_value,\n            relations_kv=relations_kv\n        )\n        attention_output = self_attention_outputs[0]\n\n        # if decoder, the last output is tuple of self-attn cache\n        if self.is_decoder:\n            outputs = self_attention_outputs[1:-1]\n            present_key_value = self_attention_outputs[-1]\n        else:\n            outputs = self_attention_outputs[1:]  # add self attentions if we output attention weights\n\n        cross_attn_present_key_value = None\n        if self.is_decoder and encoder_hidden_states is not None:\n            assert hasattr(\n                self, \"crossattention\"\n            ), f\"If `encoder_hidden_states` are passed, {self} has to be instantiated with cross-attention layers by setting `config.add_cross_attention=True`\"\n\n            # cross_attn cached key/values tuple is at positions 3,4 of past_key_value tuple\n            cross_attn_past_key_value = past_key_value[-2:] if past_key_value is not None else None\n            cross_attention_outputs = self.crossattention(\n                attention_output,\n                attention_mask,\n                head_mask,\n                encoder_hidden_states,\n                encoder_attention_mask,\n                cross_attn_past_key_value,\n                output_attentions,\n            )\n            attention_output = cross_attention_outputs[0]\n            outputs = outputs + cross_attention_outputs[1:-1]  # add cross attentions if we output attention weights\n\n            # add cross-attn cache to positions 3,4 of present_key_value tuple\n            cross_attn_present_key_value = cross_attention_outputs[-1]\n            present_key_value = present_key_value + cross_attn_present_key_value\n\n        layer_output = apply_chunking_to_forward(\n            self.feed_forward_chunk, self.chunk_size_feed_forward, self.seq_len_dim, attention_output\n        )\n        outputs = (layer_output,) + outputs\n\n        # if decoder, return the attn key/values as the last output\n        if self.is_decoder:\n            outputs = outputs + (present_key_value,)\n\n        return outputs\n\n    def feed_forward_chunk(self, attention_output):\n        intermediate_output = self.intermediate(attention_output)\n        layer_output = self.output(intermediate_output, attention_output)\n        return layer_output\n\n\nclass NeZhaEncoder(nn.Module):\n    def __init__(self, config):\n        super().__init__()\n        self.config = config\n        self.layer = nn.ModuleList([BertLayer(config) for _ in range(config.num_hidden_layers)])\n        self.relative_positions_encoding = relative_position_encoding(max_length=config.max_position_embeddings,\n                                                                     depth=int(config.hidden_size / config.num_attention_heads),\n                                                                     max_relative_position=config.max_relative_position).to('cuda')\n    def forward(\n        self,\n        hidden_states,\n        attention_mask=None,\n        head_mask=None,\n        encoder_hidden_states=None,\n        encoder_attention_mask=None,\n        past_key_values=None,\n        use_cache=None,\n        output_attentions=False,\n        output_hidden_states=False,\n        return_dict=False,\n    ):\n        to_seq_length=hidden_states.shape[1]\n        relations_kv = self.relative_positions_encoding[:to_seq_length, :to_seq_length, :]\n        all_hidden_states = () if output_hidden_states else None\n        all_self_attentions = () if output_attentions else None\n        all_cross_attentions = () if output_attentions and self.config.add_cross_attention else None\n\n        next_decoder_cache = () if use_cache else None\n        for i, layer_module in enumerate(self.layer):\n            if output_hidden_states:\n                all_hidden_states = all_hidden_states + (hidden_states,)\n\n            layer_head_mask = head_mask[i] if head_mask is not None else None\n            past_key_value = past_key_values[i] if past_key_values is not None else None\n\n            if getattr(self.config, \"gradient_checkpointing\", False) and self.training:\n\n                if use_cache:\n                    logger.warn(\n                        \"`use_cache=True` is incompatible with `config.gradient_checkpointing=True`. Setting \"\n                        \"`use_cache=False`...\"\n                    )\n                    use_cache = False\n\n                def create_custom_forward(module):\n                    def custom_forward(*inputs):\n                        return module(*inputs, past_key_value, output_attentions)\n\n                    return custom_forward\n\n                layer_outputs = torch.utils.checkpoint.checkpoint(\n                    create_custom_forward(layer_module),\n                    hidden_states,\n                    attention_mask,\n                    layer_head_mask,\n                    encoder_hidden_states,\n                    encoder_attention_mask,\n                )\n            else:\n                layer_outputs = layer_module(\n                    hidden_states,\n                    attention_mask,\n                    layer_head_mask,\n                    encoder_hidden_states,\n                    encoder_attention_mask,\n                    past_key_value,\n                    output_attentions,relations_kv=relations_kv\n                )\n\n            hidden_states = layer_outputs[0]\n            if use_cache:\n                next_decoder_cache += (layer_outputs[-1],)\n            if output_attentions:\n                all_self_attentions = all_self_attentions + (layer_outputs[1],)\n                if self.config.add_cross_attention:\n                    all_cross_attentions = all_cross_attentions + (layer_outputs[2],)\n\n        if output_hidden_states:\n            all_hidden_states = all_hidden_states + (hidden_states,)\n\n        if not return_dict:\n            return tuple(\n                v\n                for v in [\n                    hidden_states,\n                    next_decoder_cache,\n                    all_hidden_states,\n                    all_self_attentions,\n                    all_cross_attentions,\n                ]\n                if v is not None\n            )\n        return BaseModelOutputWithPastAndCrossAttentions(\n            last_hidden_state=hidden_states,\n            past_key_values=next_decoder_cache,\n            hidden_states=all_hidden_states,\n            attentions=all_self_attentions,\n            cross_attentions=all_cross_attentions,\n        )\n\n\nclass BertPooler(nn.Module):\n    def __init__(self, config):\n        super().__init__()\n        self.dense = nn.Linear(config.hidden_size, config.hidden_size)\n        self.activation = nn.Tanh()\n\n    def forward(self, hidden_states):\n        # We \"pool\" the model by simply taking the hidden state corresponding\n        # to the first token.\n        first_token_tensor = hidden_states[:, 0]\n        pooled_output = self.dense(first_token_tensor)\n        pooled_output = self.activation(pooled_output)\n        return pooled_output\n\n\nclass BertPredictionHeadTransform(nn.Module):\n    def __init__(self, config):\n        super().__init__()\n        self.dense = nn.Linear(config.hidden_size, config.hidden_size)\n        if isinstance(config.hidden_act, str):\n            self.transform_act_fn = ACT2FN[config.hidden_act]\n        else:\n            self.transform_act_fn = config.hidden_act\n        self.LayerNorm = nn.LayerNorm(config.hidden_size, eps=config.layer_norm_eps)\n\n    def forward(self, hidden_states):\n        hidden_states = self.dense(hidden_states)\n        hidden_states = self.transform_act_fn(hidden_states)\n        hidden_states = self.LayerNorm(hidden_states)\n        return hidden_states\n\n\nclass BertLMPredictionHead(nn.Module):\n    def __init__(self, config):\n        super().__init__()\n        self.transform = BertPredictionHeadTransform(config)\n\n        # The output weights are the same as the input embeddings, but there is\n        # an output-only bias for each token.\n        self.decoder = nn.Linear(config.hidden_size, config.vocab_size, bias=False)\n\n        self.bias = nn.Parameter(torch.zeros(config.vocab_size))\n\n        # Need a link between the two variables so that the bias is correctly resized with `resize_token_embeddings`\n        self.decoder.bias = self.bias\n\n    def forward(self, hidden_states):\n        hidden_states = self.transform(hidden_states)\n        hidden_states = self.decoder(hidden_states)\n        return hidden_states\n\n\nclass BertOnlyMLMHead(nn.Module):\n    def __init__(self, config):\n        super().__init__()\n        self.predictions = BertLMPredictionHead(config)\n\n    def forward(self, sequence_output):\n        prediction_scores = self.predictions(sequence_output)\n        return prediction_scores\n\n\nclass BertOnlyNSPHead(nn.Module):\n    def __init__(self, config):\n        super().__init__()\n        self.seq_relationship = nn.Linear(config.hidden_size, 2)\n\n    def forward(self, pooled_output):\n        seq_relationship_score = self.seq_relationship(pooled_output)\n        return seq_relationship_score\n\n\nclass BertPreTrainingHeads(nn.Module):\n    def __init__(self, config):\n        super().__init__()\n        self.predictions = BertLMPredictionHead(config)\n        self.seq_relationship = nn.Linear(config.hidden_size, 2)\n\n    def forward(self, sequence_output, pooled_output):\n        prediction_scores = self.predictions(sequence_output)\n        seq_relationship_score = self.seq_relationship(pooled_output)\n        return prediction_scores, seq_relationship_score\n\n\nclass BertPreTrainedModel(PreTrainedModel):\n    \"\"\"\n    An abstract class to handle weights initialization and a simple interface for downloading and loading pretrained\n    models.\n    \"\"\"\n\n    config_class = BertConfig\n    load_tf_weights = load_tf_weights_in_bert\n    base_model_prefix = \"bert\"\n\n    def _init_weights(self, module):\n        \"\"\" Initialize the weights \"\"\"\n        if isinstance(module, nn.Linear):\n            # Slightly different from the TF version which uses truncated_normal for initialization\n            # cf https://github.com/pytorch/pytorch/pull/5617\n            module.weight.data.normal_(mean=0.0, std=self.config.initializer_range)\n            if module.bias is not None:\n                module.bias.data.zero_()\n        elif isinstance(module, nn.Embedding):\n            module.weight.data.normal_(mean=0.0, std=self.config.initializer_range)\n            if module.padding_idx is not None:\n                module.weight.data[module.padding_idx].zero_()\n        elif isinstance(module, nn.LayerNorm):\n            module.bias.data.zero_()\n            module.weight.data.fill_(1.0)\n\n\n@dataclass\nclass BertForPreTrainingOutput(ModelOutput):\n    \"\"\"\n    Output type of :class:`~transformers.BertForPreTraining`.\n\n    Args:\n        loss (`optional`, returned when ``labels`` is provided, ``torch.FloatTensor`` of shape :obj:`(1,)`):\n            Total loss as the sum of the masked language modeling loss and the next sequence prediction\n            (classification) loss.\n        prediction_logits (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length, config.vocab_size)`):\n            Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax).\n        seq_relationship_logits (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, 2)`):\n            Prediction scores of the next sequence prediction (classification) head (scores of True/False continuation\n            before SoftMax).\n        hidden_states (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``output_hidden_states=True`` is passed or when ``config.output_hidden_states=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer)\n            of shape :obj:`(batch_size, sequence_length, hidden_size)`.\n\n            Hidden-states of the model at the output of each layer plus the initial embedding outputs.\n        attentions (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``output_attentions=True`` is passed or when ``config.output_attentions=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for each layer) of shape :obj:`(batch_size, num_heads,\n            sequence_length, sequence_length)`.\n\n            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention\n            heads.\n    \"\"\"\n\n    loss: Optional[torch.FloatTensor] = None\n    prediction_logits: torch.FloatTensor = None\n    seq_relationship_logits: torch.FloatTensor = None\n    hidden_states: Optional[Tuple[torch.FloatTensor]] = None\n    attentions: Optional[Tuple[torch.FloatTensor]] = None\n\n\nBERT_START_DOCSTRING = r\"\"\"\n\n    This model inherits from :class:`~transformers.PreTrainedModel`. Check the superclass documentation for the generic\n    methods the library implements for all its model (such as downloading or saving, resizing the input embeddings,\n    pruning heads etc.)\n\n    This model is also a PyTorch `torch.nn.Module <https://pytorch.org/docs/stable/nn.html#torch.nn.Module>`__\n    subclass. Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to\n    general usage and behavior.\n\n    Parameters:\n        config (:class:`~transformers.BertConfig`): Model configuration class with all the parameters of the model.\n            Initializing with a config file does not load the weights associated with the model, only the\n            configuration. Check out the :meth:`~transformers.PreTrainedModel.from_pretrained` method to load the model\n            weights.\n\"\"\"\n\nBERT_INPUTS_DOCSTRING = r\"\"\"\n    Args:\n        input_ids (:obj:`torch.LongTensor` of shape :obj:`({0})`):\n            Indices of input sequence tokens in the vocabulary.\n\n            Indices can be obtained using :class:`~transformers.BertTokenizer`. See\n            :meth:`transformers.PreTrainedTokenizer.encode` and :meth:`transformers.PreTrainedTokenizer.__call__` for\n            details.\n\n            `What are input IDs? <../glossary.html#input-ids>`__\n        attention_mask (:obj:`torch.FloatTensor` of shape :obj:`({0})`, `optional`):\n            Mask to avoid performing attention on padding token indices. Mask values selected in ``[0, 1]``:\n\n            - 1 for tokens that are **not masked**,\n            - 0 for tokens that are **masked**.\n\n            `What are attention masks? <../glossary.html#attention-mask>`__\n        token_type_ids (:obj:`torch.LongTensor` of shape :obj:`({0})`, `optional`):\n            Segment token indices to indicate first and second portions of the inputs. Indices are selected in ``[0,\n            1]``:\n\n            - 0 corresponds to a `sentence A` token,\n            - 1 corresponds to a `sentence B` token.\n\n            `What are token type IDs? <../glossary.html#token-type-ids>`_\n        position_ids (:obj:`torch.LongTensor` of shape :obj:`({0})`, `optional`):\n            Indices of positions of each input sequence tokens in the position embeddings. Selected in the range ``[0,\n            config.max_position_embeddings - 1]``.\n\n            `What are position IDs? <../glossary.html#position-ids>`_\n        head_mask (:obj:`torch.FloatTensor` of shape :obj:`(num_heads,)` or :obj:`(num_layers, num_heads)`, `optional`):\n            Mask to nullify selected heads of the self-attention modules. Mask values selected in ``[0, 1]``:\n\n            - 1 indicates the head is **not masked**,\n            - 0 indicates the head is **masked**.\n\n        inputs_embeds (:obj:`torch.FloatTensor` of shape :obj:`({0}, hidden_size)`, `optional`):\n            Optionally, instead of passing :obj:`input_ids` you can choose to directly pass an embedded representation.\n            This is useful if you want more control over how to convert :obj:`input_ids` indices into associated\n            vectors than the model's internal embedding lookup matrix.\n        output_attentions (:obj:`bool`, `optional`):\n            Whether or not to return the attentions tensors of all attention layers. See ``attentions`` under returned\n            tensors for more detail.\n        output_hidden_states (:obj:`bool`, `optional`):\n            Whether or not to return the hidden states of all layers. See ``hidden_states`` under returned tensors for\n            more detail.\n        return_dict (:obj:`bool`, `optional`):\n            Whether or not to return a :class:`~transformers.file_utils.ModelOutput` instead of a plain tuple.\n\"\"\"\n\n\n@add_start_docstrings(\n    \"The bare Bert Model transformer outputting raw hidden-states without any specific head on top.\",\n    BERT_START_DOCSTRING,\n)\nclass NeZhaModel(BertPreTrainedModel):\n    \"\"\"\n\n    The model can behave as an encoder (with only self-attention) as well as a decoder, in which case a layer of\n    cross-attention is added between the self-attention layers, following the architecture described in `Attention is\n    all you need <https://arxiv.org/abs/1706.03762>`__ by Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit,\n    Llion Jones, Aidan N. Gomez, Lukasz Kaiser and Illia Polosukhin.\n\n    To behave as an decoder the model needs to be initialized with the :obj:`is_decoder` argument of the configuration\n    set to :obj:`True`. To be used in a Seq2Seq model, the model needs to initialized with both :obj:`is_decoder`\n    argument and :obj:`add_cross_attention` set to :obj:`True`; an :obj:`encoder_hidden_states` is then expected as an\n    input to the forward pass.\n    \"\"\"\n\n    def __init__(self, config, add_pooling_layer=True):\n        super().__init__(config)\n        self.config = config\n\n        self.embeddings = BertEmbeddings(config)\n        self.encoder = NeZhaEncoder(config)\n\n        self.pooler = BertPooler(config) if add_pooling_layer else None\n\n        self.init_weights()\n\n    def get_input_embeddings(self):\n        return self.embeddings.word_embeddings\n\n    def set_input_embeddings(self, value):\n        self.embeddings.word_embeddings = value\n\n    def _prune_heads(self, heads_to_prune):\n        \"\"\"\n        Prunes heads of the model. heads_to_prune: dict of {layer_num: list of heads to prune in this layer} See base\n        class PreTrainedModel\n        \"\"\"\n        for layer, heads in heads_to_prune.items():\n            self.encoder.layer[layer].attention.prune_heads(heads)\n\n    @add_start_docstrings_to_model_forward(BERT_INPUTS_DOCSTRING.format(\"batch_size, sequence_length\"))\n    @add_code_sample_docstrings(\n        tokenizer_class=_TOKENIZER_FOR_DOC,\n        checkpoint=_CHECKPOINT_FOR_DOC,\n        output_type=BaseModelOutputWithPoolingAndCrossAttentions,\n        config_class=_CONFIG_FOR_DOC,\n    )\n    def forward(\n        self,\n        input_ids=None,\n        attention_mask=None,\n        token_type_ids=None,\n\n        head_mask=None,\n        inputs_embeds=None,\n        encoder_hidden_states=None,\n        encoder_attention_mask=None,\n        past_key_values=None,\n        use_cache=None,\n        output_attentions=None,\n        output_hidden_states=None,\n        return_dict=False,\n    ):\n        r\"\"\"\n        encoder_hidden_states  (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length, hidden_size)`, `optional`):\n            Sequence of hidden-states at the output of the last layer of the encoder. Used in the cross-attention if\n            the model is configured as a decoder.\n        encoder_attention_mask (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length)`, `optional`):\n            Mask to avoid performing attention on the padding token indices of the encoder input. This mask is used in\n            the cross-attention if the model is configured as a decoder. Mask values selected in ``[0, 1]``:\n\n            - 1 for tokens that are **not masked**,\n            - 0 for tokens that are **masked**.\n        past_key_values (:obj:`tuple(tuple(torch.FloatTensor))` of length :obj:`config.n_layers` with each tuple having 4 tensors of shape :obj:`(batch_size, num_heads, sequence_length - 1, embed_size_per_head)`):\n            Contains precomputed key and value hidden states of the attention blocks. Can be used to speed up decoding.\n\n            If :obj:`past_key_values` are used, the user can optionally input only the last :obj:`decoder_input_ids`\n            (those that don't have their past key value states given to this model) of shape :obj:`(batch_size, 1)`\n            instead of all :obj:`decoder_input_ids` of shape :obj:`(batch_size, sequence_length)`.\n        use_cache (:obj:`bool`, `optional`):\n            If set to :obj:`True`, :obj:`past_key_values` key value states are returned and can be used to speed up\n            decoding (see :obj:`past_key_values`).\n        \"\"\"\n        output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions\n        output_hidden_states = (\n            output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states\n        )\n        return_dict = return_dict if return_dict is not None else self.config.use_return_dict\n\n        if self.config.is_decoder:\n            use_cache = use_cache if use_cache is not None else self.config.use_cache\n        else:\n            use_cache = False\n\n        if input_ids is not None and inputs_embeds is not None:\n            raise ValueError(\"You cannot specify both input_ids and inputs_embeds at the same time\")\n        elif input_ids is not None:\n            input_shape = input_ids.size()\n            batch_size, seq_length = input_shape\n        elif inputs_embeds is not None:\n            input_shape = inputs_embeds.size()[:-1]\n            batch_size, seq_length = input_shape\n        else:\n            raise ValueError(\"You have to specify either input_ids or inputs_embeds\")\n\n        device = input_ids.device if input_ids is not None else inputs_embeds.device\n\n        # past_key_values_length\n        past_key_values_length = past_key_values[0][0].shape[2] if past_key_values is not None else 0\n\n        if attention_mask is None:\n            attention_mask = torch.ones(((batch_size, seq_length + past_key_values_length)), device=device)\n        if token_type_ids is None:\n            token_type_ids = torch.zeros(input_shape, dtype=torch.long, device=device)\n\n        # We can provide a self-attention mask of dimensions [batch_size, from_seq_length, to_seq_length]\n        # ourselves in which case we just need to make it broadcastable to all heads.\n        extended_attention_mask: torch.Tensor = self.get_extended_attention_mask(attention_mask, input_shape, device)\n\n        # If a 2D or 3D attention mask is provided for the cross-attention\n        # we need to make broadcastable to [batch_size, num_heads, seq_length, seq_length]\n        if self.config.is_decoder and encoder_hidden_states is not None:\n            encoder_batch_size, encoder_sequence_length, _ = encoder_hidden_states.size()\n            encoder_hidden_shape = (encoder_batch_size, encoder_sequence_length)\n            if encoder_attention_mask is None:\n                encoder_attention_mask = torch.ones(encoder_hidden_shape, device=device)\n            encoder_extended_attention_mask = self.invert_attention_mask(encoder_attention_mask)\n        else:\n            encoder_extended_attention_mask = None\n\n        # Prepare head mask if needed\n        # 1.0 in head_mask indicate we keep the head\n        # attention_probs has shape bsz x n_heads x N x N\n        # input head_mask has shape [num_heads] or [num_hidden_layers x num_heads]\n        # and head_mask is converted to shape [num_hidden_layers x batch x num_heads x seq_length x seq_length]\n        head_mask = self.get_head_mask(head_mask, self.config.num_hidden_layers)\n\n        embedding_output = self.embeddings(\n            input_ids=input_ids,\n\n            token_type_ids=token_type_ids,\n            inputs_embeds=inputs_embeds,\n        )\n        encoder_outputs = self.encoder(\n            embedding_output,\n            attention_mask=extended_attention_mask,\n            head_mask=head_mask,\n            encoder_hidden_states=encoder_hidden_states,\n            encoder_attention_mask=encoder_extended_attention_mask,\n            past_key_values=past_key_values,\n            use_cache=use_cache,\n            output_attentions=output_attentions,\n            output_hidden_states=output_hidden_states,\n            return_dict=return_dict,\n        )\n        sequence_output = encoder_outputs[0]\n        pooled_output = self.pooler(sequence_output) if self.pooler is not None else None\n\n        if not return_dict:\n            return (sequence_output, pooled_output) + encoder_outputs[1:]\n\n        return BaseModelOutputWithPoolingAndCrossAttentions(\n            last_hidden_state=sequence_output,\n            pooler_output=pooled_output,\n            past_key_values=encoder_outputs.past_key_values,\n            hidden_states=encoder_outputs.hidden_states,\n            attentions=encoder_outputs.attentions,\n            cross_attentions=encoder_outputs.cross_attentions,\n        )\n\n\n@add_start_docstrings(\n    \"\"\"\n    Bert Model with two heads on top as done during the pretraining: a `masked language modeling` head and a `next\n    sentence prediction (classification)` head.\n    \"\"\",\n    BERT_START_DOCSTRING,\n)\nclass BertForPreTraining(BertPreTrainedModel):\n    def __init__(self, config):\n        super().__init__(config)\n\n        self.bert = NeZhaModel(config)\n        self.cls = BertPreTrainingHeads(config)\n\n        self.init_weights()\n\n    def get_output_embeddings(self):\n        return self.cls.predictions.decoder\n\n    def set_output_embeddings(self, new_embeddings):\n        self.cls.predictions.decoder = new_embeddings\n\n    @add_start_docstrings_to_model_forward(BERT_INPUTS_DOCSTRING.format(\"batch_size, sequence_length\"))\n    @replace_return_docstrings(output_type=BertForPreTrainingOutput, config_class=_CONFIG_FOR_DOC)\n    def forward(\n        self,\n        input_ids=None,\n        attention_mask=None,\n        token_type_ids=None,\n\n        head_mask=None,\n        inputs_embeds=None,\n        labels=None,\n        next_sentence_label=None,\n        output_attentions=None,\n        output_hidden_states=None,\n        return_dict=False,\n    ):\n        r\"\"\"\n        labels (:obj:`torch.LongTensor` of shape ``(batch_size, sequence_length)``, `optional`):\n            Labels for computing the masked language modeling loss. Indices should be in ``[-100, 0, ...,\n            config.vocab_size]`` (see ``input_ids`` docstring) Tokens with indices set to ``-100`` are ignored\n            (masked), the loss is only computed for the tokens with labels in ``[0, ..., config.vocab_size]``\n        next_sentence_label (``torch.LongTensor`` of shape ``(batch_size,)``, `optional`):\n            Labels for computing the next sequence prediction (classification) loss. Input should be a sequence pair\n            (see :obj:`input_ids` docstring) Indices should be in ``[0, 1]``:\n\n            - 0 indicates sequence B is a continuation of sequence A,\n            - 1 indicates sequence B is a random sequence.\n        kwargs (:obj:`Dict[str, any]`, optional, defaults to `{}`):\n            Used to hide legacy arguments that have been deprecated.\n\n        Returns:\n\n        Example::\n\n            >>> from transformers import BertTokenizer, BertForPreTraining\n            >>> import torch\n\n            >>> tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')\n            >>> model = BertForPreTraining.from_pretrained('bert-base-uncased')\n\n            >>> inputs = tokenizer(\"Hello, my dog is cute\", return_tensors=\"pt\")\n            >>> outputs = model(**inputs)\n\n            >>> prediction_logits = outputs.prediction_logits\n            >>> seq_relationship_logits = outputs.seq_relationship_logits\n        \"\"\"\n        return_dict = return_dict if return_dict is not None else self.config.use_return_dict\n\n        outputs = self.bert(\n            input_ids,\n            attention_mask=attention_mask,\n            token_type_ids=token_type_ids,\n\n            head_mask=head_mask,\n            inputs_embeds=inputs_embeds,\n            output_attentions=output_attentions,\n            output_hidden_states=output_hidden_states,\n            return_dict=return_dict,\n        )\n\n        sequence_output, pooled_output = outputs[:2]\n        prediction_scores, seq_relationship_score = self.cls(sequence_output, pooled_output)\n\n        total_loss = None\n        if labels is not None and next_sentence_label is not None:\n            loss_fct = CrossEntropyLoss()\n            masked_lm_loss = loss_fct(prediction_scores.view(-1, self.config.vocab_size), labels.view(-1))\n            next_sentence_loss = loss_fct(seq_relationship_score.view(-1, 2), next_sentence_label.view(-1))\n            total_loss = masked_lm_loss + next_sentence_loss\n\n        if not return_dict:\n            output = (prediction_scores, seq_relationship_score) + outputs[2:]\n            return ((total_loss,) + output) if total_loss is not None else output\n\n        return BertForPreTrainingOutput(\n            loss=total_loss,\n            prediction_logits=prediction_scores,\n            seq_relationship_logits=seq_relationship_score,\n            hidden_states=outputs.hidden_states,\n            attentions=outputs.attentions,\n        )\n\n\n@add_start_docstrings(\n    \"\"\"Bert Model with a `language modeling` head on top for CLM fine-tuning. \"\"\", BERT_START_DOCSTRING\n)\nclass BertLMHeadModel(BertPreTrainedModel):\n\n    _keys_to_ignore_on_load_unexpected = [r\"pooler\"]\n    _keys_to_ignore_on_load_missing = [ r\"predictions.decoder.bias\"]\n\n    def __init__(self, config):\n        super().__init__(config)\n\n        if not config.is_decoder:\n            logger.warning(\"If you want to use `BertLMHeadModel` as a standalone, add `is_decoder=True.`\")\n\n        self.bert = NeZhaModel(config, add_pooling_layer=False)\n        self.cls = BertOnlyMLMHead(config)\n\n        self.init_weights()\n\n    def get_output_embeddings(self):\n        return self.cls.predictions.decoder\n\n    def set_output_embeddings(self, new_embeddings):\n        self.cls.predictions.decoder = new_embeddings\n\n    @add_start_docstrings_to_model_forward(BERT_INPUTS_DOCSTRING.format(\"batch_size, sequence_length\"))\n    @replace_return_docstrings(output_type=CausalLMOutputWithCrossAttentions, config_class=_CONFIG_FOR_DOC)\n    def forward(\n        self,\n        input_ids=None,\n        attention_mask=None,\n        token_type_ids=None,\n\n        head_mask=None,\n        inputs_embeds=None,\n        encoder_hidden_states=None,\n        encoder_attention_mask=None,\n        labels=None,\n        past_key_values=None,\n        use_cache=None,\n        output_attentions=None,\n        output_hidden_states=None,\n        return_dict=False,\n    ):\n        r\"\"\"\n        encoder_hidden_states  (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length, hidden_size)`, `optional`):\n            Sequence of hidden-states at the output of the last layer of the encoder. Used in the cross-attention if\n            the model is configured as a decoder.\n        encoder_attention_mask (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length)`, `optional`):\n            Mask to avoid performing attention on the padding token indices of the encoder input. This mask is used in\n            the cross-attention if the model is configured as a decoder. Mask values selected in ``[0, 1]``:\n\n            - 1 for tokens that are **not masked**,\n            - 0 for tokens that are **masked**.\n        labels (:obj:`torch.LongTensor` of shape :obj:`(batch_size, sequence_length)`, `optional`):\n            Labels for computing the left-to-right language modeling loss (next word prediction). Indices should be in\n            ``[-100, 0, ..., config.vocab_size]`` (see ``input_ids`` docstring) Tokens with indices set to ``-100`` are\n            ignored (masked), the loss is only computed for the tokens with labels n ``[0, ..., config.vocab_size]``\n        past_key_values (:obj:`tuple(tuple(torch.FloatTensor))` of length :obj:`config.n_layers` with each tuple having 4 tensors of shape :obj:`(batch_size, num_heads, sequence_length - 1, embed_size_per_head)`):\n            Contains precomputed key and value hidden states of the attention blocks. Can be used to speed up decoding.\n\n            If :obj:`past_key_values` are used, the user can optionally input only the last :obj:`decoder_input_ids`\n            (those that don't have their past key value states given to this model) of shape :obj:`(batch_size, 1)`\n            instead of all :obj:`decoder_input_ids` of shape :obj:`(batch_size, sequence_length)`.\n        use_cache (:obj:`bool`, `optional`):\n            If set to :obj:`True`, :obj:`past_key_values` key value states are returned and can be used to speed up\n            decoding (see :obj:`past_key_values`).\n\n        Returns:\n\n        Example::\n\n            >>> from transformers import BertTokenizer, BertLMHeadModel, BertConfig\n            >>> import torch\n\n            >>> tokenizer = BertTokenizer.from_pretrained('bert-base-cased')\n            >>> config = BertConfig.from_pretrained(\"bert-base-cased\")\n            >>> config.is_decoder = True\n            >>> model = BertLMHeadModel.from_pretrained('bert-base-cased', config=config)\n\n            >>> inputs = tokenizer(\"Hello, my dog is cute\", return_tensors=\"pt\")\n            >>> outputs = model(**inputs)\n\n            >>> prediction_logits = outputs.logits\n        \"\"\"\n        return_dict = return_dict if return_dict is not None else self.config.use_return_dict\n        if labels is not None:\n            use_cache = False\n\n        outputs = self.bert(\n            input_ids,\n            attention_mask=attention_mask,\n            token_type_ids=token_type_ids,\n\n            head_mask=head_mask,\n            inputs_embeds=inputs_embeds,\n            encoder_hidden_states=encoder_hidden_states,\n            encoder_attention_mask=encoder_attention_mask,\n            past_key_values=past_key_values,\n            use_cache=use_cache,\n            output_attentions=output_attentions,\n            output_hidden_states=output_hidden_states,\n            return_dict=return_dict,\n        )\n\n        sequence_output = outputs[0]\n        prediction_scores = self.cls(sequence_output)\n\n        lm_loss = None\n        if labels is not None:\n            # we are doing next-token prediction; shift prediction scores and input ids by one\n            shifted_prediction_scores = prediction_scores[:, :-1, :].contiguous()\n            labels = labels[:, 1:].contiguous()\n            loss_fct = CrossEntropyLoss()\n            lm_loss = loss_fct(shifted_prediction_scores.view(-1, self.config.vocab_size), labels.view(-1))\n\n        if not return_dict:\n            output = (prediction_scores,) + outputs[2:]\n            return ((lm_loss,) + output) if lm_loss is not None else output\n\n        return CausalLMOutputWithCrossAttentions(\n            loss=lm_loss,\n            logits=prediction_scores,\n            past_key_values=outputs.past_key_values,\n            hidden_states=outputs.hidden_states,\n            attentions=outputs.attentions,\n            cross_attentions=outputs.cross_attentions,\n        )\n\n    def prepare_inputs_for_generation(self, input_ids, past=None, attention_mask=None, **model_kwargs):\n        input_shape = input_ids.shape\n        # if model is used as a decoder in encoder-decoder model, the decoder attention mask is created on the fly\n        if attention_mask is None:\n            attention_mask = input_ids.new_ones(input_shape)\n\n        # cut decoder_input_ids if past is used\n        if past is not None:\n            input_ids = input_ids[:, -1:]\n\n        return {\"input_ids\": input_ids, \"attention_mask\": attention_mask, \"past_key_values\": past}\n\n    def _reorder_cache(self, past, beam_idx):\n        reordered_past = ()\n        for layer_past in past:\n            reordered_past += (tuple(past_state.index_select(0, beam_idx) for past_state in layer_past),)\n        return reordered_past\n\n\n@add_start_docstrings(\"\"\"Bert Model with a `language modeling` head on top. \"\"\", BERT_START_DOCSTRING)\nclass NeZhaForMaskedLM(BertPreTrainedModel):\n\n    _keys_to_ignore_on_load_unexpected = [r\"pooler\"]\n    _keys_to_ignore_on_load_missing = [r\"predictions.decoder.bias\"]\n\n    def __init__(self, config):\n        super().__init__(config)\n\n        if config.is_decoder:\n            logger.warning(\n                \"If you want to use `NeZhaForMaskedLM` make sure `config.is_decoder=False` for \"\n                \"bi-directional self-attention.\"\n            )\n\n        self.bert = NeZhaModel(config, add_pooling_layer=False)\n        self.cls = BertOnlyMLMHead(config)\n\n        self.init_weights()\n\n    def get_output_embeddings(self):\n        return self.cls.predictions.decoder\n\n    def set_output_embeddings(self, new_embeddings):\n        self.cls.predictions.decoder = new_embeddings\n\n    @add_start_docstrings_to_model_forward(BERT_INPUTS_DOCSTRING.format(\"batch_size, sequence_length\"))\n    @add_code_sample_docstrings(\n        tokenizer_class=_TOKENIZER_FOR_DOC,\n        checkpoint=_CHECKPOINT_FOR_DOC,\n        output_type=MaskedLMOutput,\n        config_class=_CONFIG_FOR_DOC,\n    )\n    def forward(\n        self,\n        input_ids=None,\n        attention_mask=None,\n        token_type_ids=None,\n\n        head_mask=None,\n        inputs_embeds=None,\n        encoder_hidden_states=None,\n        encoder_attention_mask=None,\n        labels=None,\n        output_attentions=None,\n        output_hidden_states=None,\n        return_dict=False,\n    ):\n        r\"\"\"\n        labels (:obj:`torch.LongTensor` of shape :obj:`(batch_size, sequence_length)`, `optional`):\n            Labels for computing the masked language modeling loss. Indices should be in ``[-100, 0, ...,\n            config.vocab_size]`` (see ``input_ids`` docstring) Tokens with indices set to ``-100`` are ignored\n            (masked), the loss is only computed for the tokens with labels in ``[0, ..., config.vocab_size]``\n        \"\"\"\n\n        return_dict = return_dict if return_dict is not None else self.config.use_return_dict\n\n        outputs = self.bert(\n            input_ids,\n            attention_mask=attention_mask,\n            token_type_ids=token_type_ids,\n\n            head_mask=head_mask,\n            inputs_embeds=inputs_embeds,\n            encoder_hidden_states=encoder_hidden_states,\n            encoder_attention_mask=encoder_attention_mask,\n            output_attentions=output_attentions,\n            output_hidden_states=output_hidden_states,\n            return_dict=return_dict,\n        )\n\n        sequence_output = outputs[0]\n        prediction_scores = self.cls(sequence_output)\n\n        masked_lm_loss = None\n        if labels is not None:\n            loss_fct = CrossEntropyLoss()  # -100 index = padding token\n            masked_lm_loss = loss_fct(prediction_scores.view(-1, self.config.vocab_size), labels.view(-1))\n\n        if not return_dict:\n            output = (prediction_scores,) + outputs[2:]\n            return ((masked_lm_loss,) + output) if masked_lm_loss is not None else output\n\n        return MaskedLMOutput(\n            loss=masked_lm_loss,\n            logits=prediction_scores,\n            hidden_states=outputs.hidden_states,\n            attentions=outputs.attentions,\n        )\n\n    def prepare_inputs_for_generation(self, input_ids, attention_mask=None, **model_kwargs):\n        input_shape = input_ids.shape\n        effective_batch_size = input_shape[0]\n\n        #  add a dummy token\n        assert self.config.pad_token_id is not None, \"The PAD token should be defined for generation\"\n        attention_mask = torch.cat([attention_mask, attention_mask.new_zeros((attention_mask.shape[0], 1))], dim=-1)\n        dummy_token = torch.full(\n            (effective_batch_size, 1), self.config.pad_token_id, dtype=torch.long, device=input_ids.device\n        )\n        input_ids = torch.cat([input_ids, dummy_token], dim=1)\n\n        return {\"input_ids\": input_ids, \"attention_mask\": attention_mask}\n\n\n@add_start_docstrings(\n    \"\"\"Bert Model with a `next sentence prediction (classification)` head on top. \"\"\",\n    BERT_START_DOCSTRING,\n)\nclass BertForNextSentencePrediction(BertPreTrainedModel):\n    def __init__(self, config):\n        super().__init__(config)\n\n        self.bert = NeZhaModel(config)\n        self.cls = BertOnlyNSPHead(config)\n\n        self.init_weights()\n\n    @add_start_docstrings_to_model_forward(BERT_INPUTS_DOCSTRING.format(\"batch_size, sequence_length\"))\n    @replace_return_docstrings(output_type=NextSentencePredictorOutput, config_class=_CONFIG_FOR_DOC)\n    def forward(\n        self,\n        input_ids=None,\n        attention_mask=None,\n        token_type_ids=None,\n\n        head_mask=None,\n        inputs_embeds=None,\n        labels=None,\n        output_attentions=None,\n        output_hidden_states=None,\n        return_dict=False,\n        **kwargs\n    ):\n        r\"\"\"\n        labels (:obj:`torch.LongTensor` of shape :obj:`(batch_size,)`, `optional`):\n            Labels for computing the next sequence prediction (classification) loss. Input should be a sequence pair\n            (see ``input_ids`` docstring). Indices should be in ``[0, 1]``:\n\n            - 0 indicates sequence B is a continuation of sequence A,\n            - 1 indicates sequence B is a random sequence.\n\n        Returns:\n\n        Example::\n\n            >>> from transformers import BertTokenizer, BertForNextSentencePrediction\n            >>> import torch\n\n            >>> tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')\n            >>> model = BertForNextSentencePrediction.from_pretrained('bert-base-uncased')\n\n            >>> prompt = \"In Italy, pizza served in formal settings, such as at a restaurant, is presented unsliced.\"\n            >>> next_sentence = \"The sky is blue due to the shorter wavelength of blue light.\"\n            >>> encoding = tokenizer(prompt, next_sentence, return_tensors='pt')\n\n            >>> outputs = model(**encoding, labels=torch.LongTensor([1]))\n            >>> logits = outputs.logits\n            >>> assert logits[0, 0] < logits[0, 1] # next sentence was random\n        \"\"\"\n\n        if \"next_sentence_label\" in kwargs:\n            warnings.warn(\n                \"The `next_sentence_label` argument is deprecated and will be removed in a future version, use `labels` instead.\",\n                FutureWarning,\n            )\n            labels = kwargs.pop(\"next_sentence_label\")\n\n        return_dict = return_dict if return_dict is not None else self.config.use_return_dict\n\n        outputs = self.bert(\n            input_ids,\n            attention_mask=attention_mask,\n            token_type_ids=token_type_ids,\n\n            head_mask=head_mask,\n            inputs_embeds=inputs_embeds,\n            output_attentions=output_attentions,\n            output_hidden_states=output_hidden_states,\n            return_dict=return_dict,\n        )\n\n        pooled_output = outputs[1]\n\n        seq_relationship_scores = self.cls(pooled_output)\n\n        next_sentence_loss = None\n        if labels is not None:\n            loss_fct = CrossEntropyLoss()\n            next_sentence_loss = loss_fct(seq_relationship_scores.view(-1, 2), labels.view(-1))\n\n        if not return_dict:\n            output = (seq_relationship_scores,) + outputs[2:]\n            return ((next_sentence_loss,) + output) if next_sentence_loss is not None else output\n\n        return NextSentencePredictorOutput(\n            loss=next_sentence_loss,\n            logits=seq_relationship_scores,\n            hidden_states=outputs.hidden_states,\n            attentions=outputs.attentions,\n        )\n\n\n@add_start_docstrings(\n    \"\"\"\n    Bert Model transformer with a sequence classification/regression head on top (a linear layer on top of the pooled\n    output) e.g. for GLUE tasks.\n    \"\"\",\n    BERT_START_DOCSTRING,\n)\nclass BertForSequenceClassification(BertPreTrainedModel):\n    def __init__(self, config):\n        super().__init__(config)\n        self.num_labels = config.num_labels\n\n        self.bert = NeZhaModel(config)\n        self.dropout = nn.Dropout(config.hidden_dropout_prob)\n        self.classifier = nn.Linear(config.hidden_size, config.num_labels)\n\n        self.init_weights()\n\n    @add_start_docstrings_to_model_forward(BERT_INPUTS_DOCSTRING.format(\"batch_size, sequence_length\"))\n    @add_code_sample_docstrings(\n        tokenizer_class=_TOKENIZER_FOR_DOC,\n        checkpoint=_CHECKPOINT_FOR_DOC,\n        output_type=SequenceClassifierOutput,\n        config_class=_CONFIG_FOR_DOC,\n    )\n    def forward(\n        self,\n        input_ids=None,\n        attention_mask=None,\n        token_type_ids=None,\n\n        head_mask=None,\n        inputs_embeds=None,\n        labels=None,\n        output_attentions=None,\n        output_hidden_states=None,\n        return_dict=False,\n    ):\n        r\"\"\"\n        labels (:obj:`torch.LongTensor` of shape :obj:`(batch_size,)`, `optional`):\n            Labels for computing the sequence classification/regression loss. Indices should be in :obj:`[0, ...,\n            config.num_labels - 1]`. If :obj:`config.num_labels == 1` a regression loss is computed (Mean-Square loss),\n            If :obj:`config.num_labels > 1` a classification loss is computed (Cross-Entropy).\n        \"\"\"\n        return_dict = return_dict if return_dict is not None else self.config.use_return_dict\n\n        outputs = self.bert(\n            input_ids,\n            attention_mask=attention_mask,\n            token_type_ids=token_type_ids,\n\n            head_mask=head_mask,\n            inputs_embeds=inputs_embeds,\n            output_attentions=output_attentions,\n            output_hidden_states=output_hidden_states,\n            return_dict=return_dict,\n        )\n\n        pooled_output = outputs[1]\n\n        pooled_output = self.dropout(pooled_output)\n        logits = self.classifier(pooled_output)\n\n        loss = None\n        if labels is not None:\n            if self.num_labels == 1:\n                #  We are doing regression\n                loss_fct = MSELoss()\n                loss = loss_fct(logits.view(-1), labels.view(-1))\n            else:\n                loss_fct = CrossEntropyLoss()\n                loss = loss_fct(logits.view(-1, self.num_labels), labels.view(-1))\n\n        if not return_dict:\n            output = (logits,) + outputs[2:]\n            return ((loss,) + output) if loss is not None else output\n\n        return SequenceClassifierOutput(\n            loss=loss,\n            logits=logits,\n            hidden_states=outputs.hidden_states,\n            attentions=outputs.attentions,\n        )\n\n\n@add_start_docstrings(\n    \"\"\"\n    Bert Model with a multiple choice classification head on top (a linear layer on top of the pooled output and a\n    softmax) e.g. for RocStories/SWAG tasks.\n    \"\"\",\n    BERT_START_DOCSTRING,\n)\nclass BertForMultipleChoice(BertPreTrainedModel):\n    def __init__(self, config):\n        super().__init__(config)\n\n        self.bert = NeZhaModel(config)\n        self.dropout = nn.Dropout(config.hidden_dropout_prob)\n        self.classifier = nn.Linear(config.hidden_size, 1)\n\n        self.init_weights()\n\n    @add_start_docstrings_to_model_forward(BERT_INPUTS_DOCSTRING.format(\"batch_size, num_choices, sequence_length\"))\n    @add_code_sample_docstrings(\n        tokenizer_class=_TOKENIZER_FOR_DOC,\n        checkpoint=_CHECKPOINT_FOR_DOC,\n        output_type=MultipleChoiceModelOutput,\n        config_class=_CONFIG_FOR_DOC,\n    )\n    def forward(\n        self,\n        input_ids=None,\n        attention_mask=None,\n        token_type_ids=None,\n\n        head_mask=None,\n        inputs_embeds=None,\n        labels=None,\n        output_attentions=None,\n        output_hidden_states=None,\n        return_dict=False,\n    ):\n        r\"\"\"\n        labels (:obj:`torch.LongTensor` of shape :obj:`(batch_size,)`, `optional`):\n            Labels for computing the multiple choice classification loss. Indices should be in ``[0, ...,\n            num_choices-1]`` where :obj:`num_choices` is the size of the second dimension of the input tensors. (See\n            :obj:`input_ids` above)\n        \"\"\"\n        return_dict = return_dict if return_dict is not None else self.config.use_return_dict\n        num_choices = input_ids.shape[1] if input_ids is not None else inputs_embeds.shape[1]\n\n        input_ids = input_ids.view(-1, input_ids.size(-1)) if input_ids is not None else None\n        attention_mask = attention_mask.view(-1, attention_mask.size(-1)) if attention_mask is not None else None\n        token_type_ids = token_type_ids.view(-1, token_type_ids.size(-1)) if token_type_ids is not None else None\n        inputs_embeds = (\n            inputs_embeds.view(-1, inputs_embeds.size(-2), inputs_embeds.size(-1))\n            if inputs_embeds is not None\n            else None\n        )\n\n        outputs = self.bert(\n            input_ids,\n            attention_mask=attention_mask,\n            token_type_ids=token_type_ids,\n\n            head_mask=head_mask,\n            inputs_embeds=inputs_embeds,\n            output_attentions=output_attentions,\n            output_hidden_states=output_hidden_states,\n            return_dict=return_dict,\n        )\n\n        pooled_output = outputs[1]\n\n        pooled_output = self.dropout(pooled_output)\n        logits = self.classifier(pooled_output)\n        reshaped_logits = logits.view(-1, num_choices)\n\n        loss = None\n        if labels is not None:\n            loss_fct = CrossEntropyLoss()\n            loss = loss_fct(reshaped_logits, labels)\n\n        if not return_dict:\n            output = (reshaped_logits,) + outputs[2:]\n            return ((loss,) + output) if loss is not None else output\n\n        return MultipleChoiceModelOutput(\n            loss=loss,\n            logits=reshaped_logits,\n            hidden_states=outputs.hidden_states,\n            attentions=outputs.attentions,\n        )\n\n\n@add_start_docstrings(\n    \"\"\"\n    Bert Model with a token classification head on top (a linear layer on top of the hidden-states output) e.g. for\n    Named-Entity-Recognition (NER) tasks.\n    \"\"\",\n    BERT_START_DOCSTRING,\n)\nclass BertForTokenClassification(BertPreTrainedModel):\n\n    _keys_to_ignore_on_load_unexpected = [r\"pooler\"]\n\n    def __init__(self, config):\n        super().__init__(config)\n        self.num_labels = config.num_labels\n\n        self.bert = NeZhaModel(config, add_pooling_layer=False)\n        self.dropout = nn.Dropout(config.hidden_dropout_prob)\n        self.classifier = nn.Linear(config.hidden_size, config.num_labels)\n\n        self.init_weights()\n\n    @add_start_docstrings_to_model_forward(BERT_INPUTS_DOCSTRING.format(\"batch_size, sequence_length\"))\n    @add_code_sample_docstrings(\n        tokenizer_class=_TOKENIZER_FOR_DOC,\n        checkpoint=_CHECKPOINT_FOR_DOC,\n        output_type=TokenClassifierOutput,\n        config_class=_CONFIG_FOR_DOC,\n    )\n    def forward(\n        self,\n        input_ids=None,\n        attention_mask=None,\n        token_type_ids=None,\n\n        head_mask=None,\n        inputs_embeds=None,\n        labels=None,\n        output_attentions=None,\n        output_hidden_states=None,\n        return_dict=False,\n    ):\n        r\"\"\"\n        labels (:obj:`torch.LongTensor` of shape :obj:`(batch_size, sequence_length)`, `optional`):\n            Labels for computing the token classification loss. Indices should be in ``[0, ..., config.num_labels -\n            1]``.\n        \"\"\"\n        return_dict = return_dict if return_dict is not None else self.config.use_return_dict\n\n        outputs = self.bert(\n            input_ids,\n            attention_mask=attention_mask,\n            token_type_ids=token_type_ids,\n\n            head_mask=head_mask,\n            inputs_embeds=inputs_embeds,\n            output_attentions=output_attentions,\n            output_hidden_states=output_hidden_states,\n            return_dict=return_dict,\n        )\n\n        sequence_output = outputs[0]\n\n        sequence_output = self.dropout(sequence_output)\n        logits = self.classifier(sequence_output)\n\n        loss = None\n        if labels is not None:\n            loss_fct = CrossEntropyLoss()\n            # Only keep active parts of the loss\n            if attention_mask is not None:\n                active_loss = attention_mask.view(-1) == 1\n                active_logits = logits.view(-1, self.num_labels)\n                active_labels = torch.where(\n                    active_loss, labels.view(-1), torch.tensor(loss_fct.ignore_index).type_as(labels)\n                )\n                loss = loss_fct(active_logits, active_labels)\n            else:\n                loss = loss_fct(logits.view(-1, self.num_labels), labels.view(-1))\n\n        if not return_dict:\n            output = (logits,) + outputs[2:]\n            return ((loss,) + output) if loss is not None else output\n\n        return TokenClassifierOutput(\n            loss=loss,\n            logits=logits,\n            hidden_states=outputs.hidden_states,\n            attentions=outputs.attentions,\n        )\n\n\n@add_start_docstrings(\n    \"\"\"\n    Bert Model with a span classification head on top for extractive question-answering tasks like SQuAD (a linear\n    layers on top of the hidden-states output to compute `span start logits` and `span end logits`).\n    \"\"\",\n    BERT_START_DOCSTRING,\n)\nclass BertForQuestionAnswering(BertPreTrainedModel):\n\n    _keys_to_ignore_on_load_unexpected = [r\"pooler\"]\n\n    def __init__(self, config):\n        super().__init__(config)\n        self.num_labels = config.num_labels\n\n        self.bert = NeZhaModel(config, add_pooling_layer=False)\n        self.qa_outputs = nn.Linear(config.hidden_size, config.num_labels)\n\n        self.init_weights()\n\n    @add_start_docstrings_to_model_forward(BERT_INPUTS_DOCSTRING.format(\"batch_size, sequence_length\"))\n    @add_code_sample_docstrings(\n        tokenizer_class=_TOKENIZER_FOR_DOC,\n        checkpoint=_CHECKPOINT_FOR_DOC,\n        output_type=QuestionAnsweringModelOutput,\n        config_class=_CONFIG_FOR_DOC,\n    )\n    def forward(\n        self,\n        input_ids=None,\n        attention_mask=None,\n        token_type_ids=None,\n\n        head_mask=None,\n        inputs_embeds=None,\n        start_positions=None,\n        end_positions=None,\n        output_attentions=None,\n        output_hidden_states=None,\n        return_dict=False,\n    ):\n        r\"\"\"\n        start_positions (:obj:`torch.LongTensor` of shape :obj:`(batch_size,)`, `optional`):\n            Labels for position (index) of the start of the labelled span for computing the token classification loss.\n            Positions are clamped to the length of the sequence (:obj:`sequence_length`). Position outside of the\n            sequence are not taken into account for computing the loss.\n        end_positions (:obj:`torch.LongTensor` of shape :obj:`(batch_size,)`, `optional`):\n            Labels for position (index) of the end of the labelled span for computing the token classification loss.\n            Positions are clamped to the length of the sequence (:obj:`sequence_length`). Position outside of the\n            sequence are not taken into account for computing the loss.\n        \"\"\"\n        return_dict = return_dict if return_dict is not None else self.config.use_return_dict\n\n        outputs = self.bert(\n            input_ids,\n            attention_mask=attention_mask,\n            token_type_ids=token_type_ids,\n            \n            head_mask=head_mask,\n            inputs_embeds=inputs_embeds,\n            output_attentions=output_attentions,\n            output_hidden_states=output_hidden_states,\n            return_dict=return_dict,\n        )\n\n        sequence_output = outputs[0]\n\n        logits = self.qa_outputs(sequence_output)\n        start_logits, end_logits = logits.split(1, dim=-1)\n        start_logits = start_logits.squeeze(-1)\n        end_logits = end_logits.squeeze(-1)\n\n        total_loss = None\n        if start_positions is not None and end_positions is not None:\n            # If we are on multi-GPU, split add a dimension\n            if len(start_positions.size()) > 1:\n                start_positions = start_positions.squeeze(-1)\n            if len(end_positions.size()) > 1:\n                end_positions = end_positions.squeeze(-1)\n            # sometimes the start/end positions are outside our model inputs, we ignore these terms\n            ignored_index = start_logits.size(1)\n            start_positions.clamp_(0, ignored_index)\n            end_positions.clamp_(0, ignored_index)\n\n            loss_fct = CrossEntropyLoss(ignore_index=ignored_index)\n            start_loss = loss_fct(start_logits, start_positions)\n            end_loss = loss_fct(end_logits, end_positions)\n            total_loss = (start_loss + end_loss) / 2\n\n        if not return_dict:\n            output = (start_logits, end_logits) + outputs[2:]\n            return ((total_loss,) + output) if total_loss is not None else output\n\n        return QuestionAnsweringModelOutput(\n            loss=total_loss,\n            start_logits=start_logits,\n            end_logits=end_logits,\n            hidden_states=outputs.hidden_states,\n            attentions=outputs.attentions,\n        )\n"
  },
  {
    "path": "code/bert-base-chinese/config.json",
    "content": "{\n  \"architectures\": [\n    \"BertForMaskedLM\"\n  ],\n  \"attention_probs_dropout_prob\": 0.1,\n  \"directionality\": \"bidi\",\n  \"hidden_act\": \"gelu\",\n  \"hidden_dropout_prob\": 0.1,\n  \"hidden_size\": 768,\n  \"initializer_range\": 0.02,\n  \"intermediate_size\": 3072,\n  \"layer_norm_eps\": 1e-12,\n  \"max_position_embeddings\": 512,\n  \"model_type\": \"bert\",\n  \"num_attention_heads\": 12,\n  \"num_hidden_layers\": 12,\n  \"pad_token_id\": 0,\n  \"pooler_fc_size\": 768,\n  \"pooler_num_attention_heads\": 12,\n  \"pooler_num_fc_layers\": 3,\n  \"pooler_size_per_head\": 128,\n  \"pooler_type\": \"first_token_transform\",\n  \"type_vocab_size\": 2,\n  \"vocab_size\": 21128\n}\n"
  },
  {
    "path": "code/bert-base-count3/finetuning/.ipynb_checkpoints/PyTorch_Bert-Squad_OnnxRuntime_GPU-checkpoint.ipynb",
    "content": "{\n \"cells\": [\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Copyright (c) Microsoft Corporation. All rights reserved.  \\n\",\n    \"Licensed under the MIT License.\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"# Inference PyTorch Bert Model with ONNX Runtime on GPU\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"In this tutorial, you'll learn how to load a Bert model from PyTorch, convert it to ONNX, and inference it for high performance using ONNX Runtime and NVIDIA GPU. In the following sections, we are going to use the Bert model trained with Stanford Question Answering Dataset (SQuAD) dataset as an example. Bert SQuAD model is used in question answering scenarios, where the answer to every question is a segment of text from the corresponding reading passage, or the question might be unanswerable.\\n\",\n    \"\\n\",\n    \"This notebook is for GPU inference. For CPU inference, please look at another notebook [Inference PyTorch Bert Model with ONNX Runtime on CPU](PyTorch_Bert-Squad_OnnxRuntime_CPU.ipynb).\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"## 0. Prerequisites ##\\n\",\n    \"It requires your machine to have a GPU, and a python environment with [PyTorch](https://pytorch.org/) installed before running this notebook.\\n\",\n    \"\\n\",\n    \"#### GPU Environment Setup using AnaConda\\n\",\n    \"\\n\",\n    \"First, we install [AnaConda](https://www.anaconda.com/distribution/) in a target machine and open an AnaConda prompt window when it is done. Then run the following commands to create a conda environment. This notebook is tested with PyTorch 1.5.0 and OnnxRuntime 1.3.0.\\n\",\n    \"\\n\",\n    \"```console\\n\",\n    \"conda create -n gpu_env python=3.7\\n\",\n    \"conda activate gpu_env\\n\",\n    \"conda install pytorch torchvision cudatoolkit=10.1 -c pytorch\\n\",\n    \"conda install -c anaconda ipykernel\\n\",\n    \"conda install -c conda-forge ipywidgets\\n\",\n    \"python -m ipykernel install --user --name=gpu_env_py37\\n\",\n    \"jupyter notebook\\n\",\n    \"```\\n\",\n    \"Finally, launch Jupyter Notebook and you can choose gpu_env_py37 as kernel to run this notebook.\\n\",\n    \"\\n\",\n    \"Onnxruntime-gpu need specified version of CUDA and cuDNN. You can find the corresponding version in [requirements](https://github.com/microsoft/onnxruntime/tree/rel-1.3.0#system-requirements). If the version is different from above cudatoolkit version, you have to install them separately, and add their bin directories to PATH environment variable (See [CUDA and cuDNN Path](#CUDA-and-cuDNN-Path) below).\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"\\u001b[33mWARNING: Skipping onnxruntime-gpu as it is not installed.\\u001b[0m\\r\\n\"\n     ]\n    }\n   ],\n   \"source\": [\n    \"import sys\\n\",\n    \"!{sys.executable} -m pip uninstall --quiet --yes onnxruntime-gpu\\n\",\n    \"!{sys.executable} -m pip install --quiet onnxruntime-gpu\\n\",\n    \"!{sys.executable} -m pip install --quiet --upgrade transformers\\n\",\n    \"!{sys.executable} -m pip install --quiet --upgrade onnxconverter_common\\n\",\n    \"!{sys.executable} -m pip install --quiet --upgrade onnxruntime-tools\\n\",\n    \"!{sys.executable} -m pip install --quiet wget netron pandas\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"## 1. Load Pretrained Bert model ##\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"We begin by downloading the SQuAD data file and store them in the specified location. \"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"import os\\n\",\n    \"\\n\",\n    \"cache_dir = \\\"./squad\\\"\\n\",\n    \"if not os.path.exists(cache_dir):\\n\",\n    \"    os.makedirs(cache_dir)\\n\",\n    \"\\n\",\n    \"predict_file_url = \\\"https://rajpurkar.github.io/SQuAD-explorer/dataset/dev-v1.1.json\\\"\\n\",\n    \"predict_file = os.path.join(cache_dir, \\\"dev-v1.1.json\\\")\\n\",\n    \"if not os.path.exists(predict_file):\\n\",\n    \"    import wget\\n\",\n    \"    print(\\\"Start downloading predict file.\\\")\\n\",\n    \"    wget.download(predict_file_url, predict_file)\\n\",\n    \"    print(\\\"Predict file downloaded.\\\")\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Let's first define some constant variables.\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 3,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"# Whether allow overwriting existing ONNX model and download the latest script from GitHub\\n\",\n    \"enable_overwrite = True\\n\",\n    \"\\n\",\n    \"# Total samples to inference, so that we can get average latency\\n\",\n    \"total_samples = 1000\\n\",\n    \"\\n\",\n    \"# ONNX opset version\\n\",\n    \"opset_version=11\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Specify some model configuration variables.\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 4,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"# For fine-tuned large model, the model name is \\\"bert-large-uncased-whole-word-masking-finetuned-squad\\\". Here we use bert-base for demo.\\n\",\n    \"model_name_or_path = \\\"bert-base-cased\\\"\\n\",\n    \"max_seq_length = 128\\n\",\n    \"doc_stride = 128\\n\",\n    \"max_query_length = 64\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Start to load model from pretrained. This step could take a few minutes. \"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 5,\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"name\": \"stderr\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"100%|██████████| 48/48 [00:04<00:00, 11.28it/s]\\n\",\n      \"convert squad examples to features: 100%|██████████| 1000/1000 [00:09<00:00, 102.15it/s]\\n\",\n      \"add example index and unique id: 100%|██████████| 1000/1000 [00:00<00:00, 161306.98it/s]\\n\"\n     ]\n    }\n   ],\n   \"source\": [\n    \"# The following code is adapted from HuggingFace transformers\\n\",\n    \"# https://github.com/huggingface/transformers/blob/master/examples/run_squad.py\\n\",\n    \"\\n\",\n    \"from transformers import (BertConfig, BertForQuestionAnswering, BertTokenizer)\\n\",\n    \"\\n\",\n    \"# Load pretrained model and tokenizer\\n\",\n    \"config_class, model_class, tokenizer_class = (BertConfig, BertForQuestionAnswering, BertTokenizer)\\n\",\n    \"config = config_class.from_pretrained(model_name_or_path, cache_dir=cache_dir)\\n\",\n    \"tokenizer = tokenizer_class.from_pretrained(model_name_or_path, do_lower_case=True, cache_dir=cache_dir)\\n\",\n    \"model = model_class.from_pretrained(model_name_or_path,\\n\",\n    \"                                    from_tf=False,\\n\",\n    \"                                    config=config,\\n\",\n    \"                                    cache_dir=cache_dir)\\n\",\n    \"# load some examples\\n\",\n    \"from transformers.data.processors.squad import SquadV1Processor\\n\",\n    \"\\n\",\n    \"processor = SquadV1Processor()\\n\",\n    \"examples = processor.get_dev_examples(None, filename=predict_file)\\n\",\n    \"\\n\",\n    \"from transformers import squad_convert_examples_to_features\\n\",\n    \"features, dataset = squad_convert_examples_to_features( \\n\",\n    \"            examples=examples[:total_samples], # convert enough examples for this notebook\\n\",\n    \"            tokenizer=tokenizer,\\n\",\n    \"            max_seq_length=max_seq_length,\\n\",\n    \"            doc_stride=doc_stride,\\n\",\n    \"            max_query_length=max_query_length,\\n\",\n    \"            is_training=False,\\n\",\n    \"            return_dataset='pt'\\n\",\n    \"        )\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"## 2. Export the loaded model ##\\n\",\n    \"Once the model is loaded, we can export the loaded PyTorch model to ONNX.\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 6,\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"Model exported at  ./onnx/bert-base-cased-squad_opset11.onnx\\n\"\n     ]\n    }\n   ],\n   \"source\": [\n    \"output_dir = \\\"./onnx\\\"\\n\",\n    \"if not os.path.exists(output_dir):\\n\",\n    \"    os.makedirs(output_dir)   \\n\",\n    \"export_model_path = os.path.join(output_dir, 'bert-base-cased-squad_opset{}.onnx'.format(opset_version))\\n\",\n    \"\\n\",\n    \"import torch\\n\",\n    \"use_gpu = torch.cuda.is_available()\\n\",\n    \"device = torch.device(\\\"cuda\\\" if use_gpu else \\\"cpu\\\")\\n\",\n    \"\\n\",\n    \"# Get the first example data to run the model and export it to ONNX\\n\",\n    \"data = dataset[0]\\n\",\n    \"inputs = {\\n\",\n    \"    'input_ids':      data[0].to(device).reshape(1, max_seq_length),\\n\",\n    \"    'attention_mask': data[1].to(device).reshape(1, max_seq_length),\\n\",\n    \"    'token_type_ids': data[2].to(device).reshape(1, max_seq_length)\\n\",\n    \"}\\n\",\n    \"\\n\",\n    \"# Set model to inference mode, which is required before exporting the model because some operators behave differently in \\n\",\n    \"# inference and training mode.\\n\",\n    \"model.eval()\\n\",\n    \"model.to(device)\\n\",\n    \"\\n\",\n    \"if enable_overwrite or not os.path.exists(export_model_path):\\n\",\n    \"    with torch.no_grad():\\n\",\n    \"        symbolic_names = {0: 'batch_size', 1: 'max_seq_len'}\\n\",\n    \"        torch.onnx.export(model,                                            # model being run\\n\",\n    \"                          args=tuple(inputs.values()),                      # model input (or a tuple for multiple inputs)\\n\",\n    \"                          f=export_model_path,                              # where to save the model (can be a file or file-like object)\\n\",\n    \"                          opset_version=opset_version,                      # the ONNX version to export the model to\\n\",\n    \"                          do_constant_folding=True,                         # whether to execute constant folding for optimization\\n\",\n    \"                          input_names=['input_ids',                         # the model's input names\\n\",\n    \"                                       'input_mask', \\n\",\n    \"                                       'segment_ids'],\\n\",\n    \"                          output_names=['start', 'end'],                    # the model's output names\\n\",\n    \"                          dynamic_axes={'input_ids': symbolic_names,        # variable length axes\\n\",\n    \"                                        'input_mask' : symbolic_names,\\n\",\n    \"                                        'segment_ids' : symbolic_names,\\n\",\n    \"                                        'start' : symbolic_names,\\n\",\n    \"                                        'end' : symbolic_names})\\n\",\n    \"        print(\\\"Model exported at \\\", export_model_path)\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"## 3. PyTorch Inference ##\\n\",\n    \"Use PyTorch to evaluate an example input for comparison purpose.\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 7,\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"PyTorch cuda Inference time = 16.57 ms\\n\"\n     ]\n    }\n   ],\n   \"source\": [\n    \"import time\\n\",\n    \"\\n\",\n    \"# Measure the latency. It is not accurate using Jupyter Notebook, it is recommended to use standalone python script.\\n\",\n    \"latency = []\\n\",\n    \"with torch.no_grad():\\n\",\n    \"    for i in range(total_samples):\\n\",\n    \"        data = dataset[i]\\n\",\n    \"        inputs = {\\n\",\n    \"            'input_ids':      data[0].to(device).reshape(1, max_seq_length),\\n\",\n    \"            'attention_mask': data[1].to(device).reshape(1, max_seq_length),\\n\",\n    \"            'token_type_ids': data[2].to(device).reshape(1, max_seq_length)\\n\",\n    \"        }\\n\",\n    \"        start = time.time()\\n\",\n    \"        outputs = model(**inputs)\\n\",\n    \"        latency.append(time.time() - start)\\n\",\n    \"print(\\\"PyTorch {} Inference time = {} ms\\\".format(device.type, format(sum(latency) * 1000 / len(latency), '.2f')))\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"## 4. Inference ONNX Model with ONNX Runtime ##\\n\",\n    \"\\n\",\n    \"### CUDA and cuDNN Path\\n\",\n    \"onnxruntime-gpu has dependency on [CUDA](https://developer.nvidia.com/cuda-downloads) and [cuDNN](https://developer.nvidia.com/cudnn):\\n\",\n    \"\\n\",\n    \"* [onnxruntime-gpu v1.3.0](https://github.com/microsoft/onnxruntime/tree/rel-1.3.0#system-requirements) requires CUDA Runtime 10.1 and CUDNN 7.6.5.\\n\",\n    \"* [onnxruntime-gpu v1.2.0](https://github.com/microsoft/onnxruntime/releases/tag/v1.2.0) requires CUDA Runtime 10.1 and CUDNN 7.6.5.\\n\",\n    \"\\n\",\n    \"During installing PyTorch 1.5, we installed cudatoolkit 10.1.243 in this conda environment. That shall be good for onnxruntime-gpu 1.3.0 in Jupyter Notebook.\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 8,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"# Change to True when onnxruntime (like onnxruntime-gpu 1.0.0 ~ 1.1.2) cannot be imported.\\n\",\n    \"add_cuda_path = False\\n\",\n    \"\\n\",\n    \"if add_cuda_path:\\n\",\n    \"    # Add path of CUDA 10.0 and CUDNN 7.6 for onnxruntime-gpu 1.0.0 ~ 1.1.2\\n\",\n    \"    cuda_dir = 'D:/NVidia/CUDA/v10.1/bin'\\n\",\n    \"    cudnn_dir = 'D:/NVidia/CUDA/v10.1/bin'\\n\",\n    \"    if not (os.path.exists(cuda_dir) and os.path.exists(cudnn_dir)):\\n\",\n    \"        raise ValueError(\\\"Please specify correct path for CUDA and cuDNN. Otherwise onnxruntime cannot be imported.\\\")\\n\",\n    \"    else:\\n\",\n    \"        if cuda_dir == cudnn_dir:\\n\",\n    \"            os.environ[\\\"PATH\\\"] = cuda_dir + ';' + os.environ[\\\"PATH\\\"]\\n\",\n    \"        else:\\n\",\n    \"            os.environ[\\\"PATH\\\"] = cuda_dir + ';' + cudnn_dir + ';' + os.environ[\\\"PATH\\\"]\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"### OpenMP Environment Variable\\n\",\n    \"\\n\",\n    \"OpenMP environment variables are optional for GPU inference of standard Bert model. It has little performance impact on Bert model since most nodes are executed in GPU. \\n\",\n    \"\\n\",\n    \"You can find the best setting based on [Performance Test Tool](#Performance-Test-Tool) result in later part of this notebook.\\n\",\n    \"\\n\",\n    \"**Attention: Setting environment variables shall be done before importing onnxruntime**. Otherwise, they might not take effect.\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 9,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"# Optional. You can change them according to Performance Test Tool result.\\n\",\n    \"#os.environ[\\\"OMP_NUM_THREADS\\\"] = '1'\\n\",\n    \"#os.environ[\\\"OMP_WAIT_POLICY\\\"] = 'PASSIVE'\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Now we are ready to inference the model with ONNX Runtime.\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 10,\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"OnnxRuntime gpu Inference time = 4.43 ms\\n\"\n     ]\n    }\n   ],\n   \"source\": [\n    \"import psutil\\n\",\n    \"import onnxruntime\\n\",\n    \"import numpy\\n\",\n    \"\\n\",\n    \"assert 'CUDAExecutionProvider' in onnxruntime.get_available_providers()\\n\",\n    \"device_name = 'gpu'\\n\",\n    \"\\n\",\n    \"sess_options = onnxruntime.SessionOptions()\\n\",\n    \"\\n\",\n    \"# Optional: store the optimized graph and view it using Netron to verify that model is fully optimized.\\n\",\n    \"# Note that this will increase session creation time so enable it for debugging only.\\n\",\n    \"sess_options.optimized_model_filepath = os.path.join(output_dir, \\\"optimized_model_{}.onnx\\\".format(device_name))\\n\",\n    \"\\n\",\n    \"# Please change the value according to best setting in Performance Test Tool result.\\n\",\n    \"sess_options.intra_op_num_threads=psutil.cpu_count(logical=True)\\n\",\n    \"\\n\",\n    \"session = onnxruntime.InferenceSession(export_model_path, sess_options)\\n\",\n    \"\\n\",\n    \"latency = []\\n\",\n    \"for i in range(total_samples):\\n\",\n    \"    data = dataset[i]\\n\",\n    \"    # TODO: use IO Binding (see https://github.com/microsoft/onnxruntime/pull/4206) to improve performance.\\n\",\n    \"    ort_inputs = {\\n\",\n    \"        'input_ids':  data[0].cpu().reshape(1, max_seq_length).numpy(),\\n\",\n    \"        'input_mask': data[1].cpu().reshape(1, max_seq_length).numpy(),\\n\",\n    \"        'segment_ids': data[2].cpu().reshape(1, max_seq_length).numpy()\\n\",\n    \"    }\\n\",\n    \"    start = time.time()\\n\",\n    \"    ort_outputs = session.run(None, ort_inputs)\\n\",\n    \"    latency.append(time.time() - start)\\n\",\n    \"    \\n\",\n    \"print(\\\"OnnxRuntime {} Inference time = {} ms\\\".format(device_name, format(sum(latency) * 1000 / len(latency), '.2f')))\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"We can compare the output of PyTorch and ONNX Runtime. We can see some results are not close. It is because ONNX Runtime uses some approximation in CUDA optimization. Based on our evaluation on SQuAD data set, F1 score is on par for models before and after optimization.\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 11,\n   \"metadata\": {\n    \"scrolled\": true\n   },\n   \"outputs\": [\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"***** Verifying correctness *****\\n\",\n      \"PyTorch and ONNX Runtime output 0 are close: True\\n\",\n      \"maximum_diff=9.499490261077881e-07 average_diff=1.4225952327251434e-07\\n\",\n      \"PyTorch and ONNX Runtime output 1 are close: True\\n\",\n      \"maximum_diff=6.92903995513916e-07 average_diff=1.2441887520253658e-07\\n\"\n     ]\n    }\n   ],\n   \"source\": [\n    \"print(\\\"***** Verifying correctness *****\\\")\\n\",\n    \"for i in range(2):    \\n\",\n    \"    print('PyTorch and ONNX Runtime output {} are close:'.format(i), numpy.allclose(ort_outputs[i], outputs[i].cpu(), rtol=1e-02, atol=1e-02))\\n\",\n    \"    diff = ort_outputs[i] - outputs[i].cpu().numpy()\\n\",\n    \"    max_diff = numpy.max(numpy.abs(diff))\\n\",\n    \"    avg_diff = numpy.average(numpy.abs(diff))\\n\",\n    \"    print(f'maximum_diff={max_diff} average_diff={avg_diff}')\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"### Inference with Actual Sequence Length\\n\",\n    \"Note that ONNX model is exported using dynamic length axis. It is recommended to use actual sequence input without padding instead of fixed length input for best performance. Let's see how it can be applied to this model.\\n\",\n    \"\\n\",\n    \"From an example input below, we can see zero padding at the end of each sequence.\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 12,\n   \"metadata\": {\n    \"scrolled\": true\n   },\n   \"outputs\": [\n    {\n     \"data\": {\n      \"text/plain\": [\n       \"{'input_ids': tensor([[  101,  1293,  1242,  2557,  1127,  1226,  1104,  1103,  3613, 16429,\\n\",\n       \"           5235,   136,   102,  3613, 16429,  5988,   170,   107,  1353,  1671,\\n\",\n       \"           1992,  1342,   107,  5235,   117,  1107,  1134,  1473,  3683,  3538,\\n\",\n       \"           1125,   170,  1476,   118,  1248,  2595,  4086,  1714,  1104,  2965,\\n\",\n       \"          15897,  1104,  3613, 16429,   119,  1473,  3683,  3538,  3222,  1149,\\n\",\n       \"           2551,  1168, 23759,  1116,  1121,  1506,  1103, 10280,  2231,  1111,\\n\",\n       \"           1103,  1714, 16355,   119,   102,     0,     0,     0,     0,     0,\\n\",\n       \"              0,     0,     0,     0,     0,     0,     0,     0,     0,     0,\\n\",\n       \"              0,     0,     0,     0,     0,     0,     0,     0,     0,     0,\\n\",\n       \"              0,     0,     0,     0,     0,     0,     0,     0,     0,     0,\\n\",\n       \"              0,     0,     0,     0,     0,     0,     0,     0,     0,     0,\\n\",\n       \"              0,     0,     0,     0,     0,     0,     0,     0,     0,     0,\\n\",\n       \"              0,     0,     0,     0,     0,     0,     0,     0]],\\n\",\n       \"        device='cuda:0'),\\n\",\n       \" 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,\\n\",\n       \"          1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,\\n\",\n       \"          1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0,\\n\",\n       \"          0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,\\n\",\n       \"          0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,\\n\",\n       \"          0, 0, 0, 0, 0, 0, 0, 0]], device='cuda:0'),\\n\",\n       \" 'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,\\n\",\n       \"          1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,\\n\",\n       \"          1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0,\\n\",\n       \"          0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,\\n\",\n       \"          0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,\\n\",\n       \"          0, 0, 0, 0, 0, 0, 0, 0]], device='cuda:0')}\"\n      ]\n     },\n     \"execution_count\": 12,\n     \"metadata\": {},\n     \"output_type\": \"execute_result\"\n    }\n   ],\n   \"source\": [\n    \"# An example input (we can see padding). From attention_mask, we can deduce the actual length.\\n\",\n    \"inputs\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"The original sequence length is 128. After removing paddings, the sequence length is reduced. Input with smaller sequence length need less computation, thus we can see there is improvement on inference latency. \"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 13,\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"Average length 101\\n\",\n      \"OnnxRuntime gpu Inference time with actual sequence length = 4.23 ms\\n\"\n     ]\n    }\n   ],\n   \"source\": [\n    \"import statistics\\n\",\n    \"\\n\",\n    \"latency = []\\n\",\n    \"lengths = []\\n\",\n    \"for i in range(total_samples):\\n\",\n    \"    data = dataset[i]\\n\",\n    \"    # Instead of using fixed length (128), we can use actual sequence length (less than 128), which helps to get better performance.\\n\",\n    \"    actual_sequence_length = sum(data[1].numpy())\\n\",\n    \"    lengths.append(actual_sequence_length)\\n\",\n    \"    opt_inputs = {\\n\",\n    \"        'input_ids':  data[0].numpy()[:actual_sequence_length].reshape(1, actual_sequence_length),\\n\",\n    \"        'input_mask': data[1].numpy()[:actual_sequence_length].reshape(1, actual_sequence_length),\\n\",\n    \"        'segment_ids': data[2].numpy()[:actual_sequence_length].reshape(1, actual_sequence_length)\\n\",\n    \"    }\\n\",\n    \"    start = time.time()\\n\",\n    \"    opt_outputs = session.run(None, opt_inputs)\\n\",\n    \"    latency.append(time.time() - start)\\n\",\n    \"print(\\\"Average length\\\", statistics.mean(lengths))\\n\",\n    \"print(\\\"OnnxRuntime {} Inference time with actual sequence length = {} ms\\\".format(device_name, format(sum(latency) * 1000 / len(latency), '.2f')))\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Let's compare the output and see whether the results are close.\\n\",\n    \"\\n\",\n    \"**Note**: Need end-to-end evaluation on performance and accuracy if you use this strategy.\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 14,\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"***** Comparing results with/without paddings *****\\n\",\n      \"Output 0 are close: True\\n\",\n      \"Output 1 are close: True\\n\"\n     ]\n    }\n   ],\n   \"source\": [\n    \"print(\\\"***** Comparing results with/without paddings *****\\\")\\n\",\n    \"for i in range(2):\\n\",\n    \"    print('Output {} are close:'.format(i), numpy.allclose(opt_outputs[i], ort_outputs[i][:,:len(opt_outputs[i][0])], rtol=1e-03, atol=1e-03))\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"## 5. Offline Optimization and Test Tools\\n\",\n    \"\\n\",\n    \"It is recommended to try [OnnxRuntime Transformer Model Optimization Tool](https://github.com/microsoft/onnxruntime/tree/master/onnxruntime/python/tools/transformers) on the exported ONNX models. It could help verify whether the model can be fully optimized, and get performance test results.\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"#### Transformer Optimizer\\n\",\n    \"\\n\",\n    \"Although OnnxRuntime could optimize Bert model exported by PyTorch. Sometime, model cannot be fully optimized due to different reasons:\\n\",\n    \"* A new subgraph pattern is generated by new version of export tool, and the pattern is not covered by older version of OnnxRuntime. \\n\",\n    \"* The exported model uses dynamic axis and this makes it harder for shape inference of the graph. That blocks some optimization to be applied.\\n\",\n    \"* Some optimization is better to be done offline. Like change input tensor type from int64 to int32 to avoid extra Cast nodes, or convert model to float16 to achieve better performance in V100 or T4 GPU.\\n\",\n    \"\\n\",\n    \"We have python script **optimizer.py**, which is more flexible in graph pattern matching and model conversion (like float32 to float16). You can also use it to verify whether a Bert model is fully optimized.\\n\",\n    \"\\n\",\n    \"In this example, we can see that it introduces optimization that is not provided by onnxruntime: SkipLayerNormalization and bias fusion, which is not fused in OnnxRuntime due to shape inference as mentioned.\\n\",\n    \"\\n\",\n    \"It will also tell whether the model is fully optimized or not. If not, that means you might need change the script to fuse some new pattern of subgraph.\\n\",\n    \"\\n\",\n    \"Example Usage:\\n\",\n    \"```\\n\",\n    \"from onnxruntime_tools import optimizer\\n\",\n    \"optimized_model = optimizer.optimize_model(export_model_path, model_type='bert', num_heads=12, hidden_size=768)\\n\",\n    \"optimized_model.save_model_to_file(optimized_model_path)\\n\",\n    \"```\\n\",\n    \"\\n\",\n    \"You can also use optimizer_cli like the following:\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"#### Float32 Model\\n\",\n    \"Let us optimize the ONNX model using the script. The first example will output model with float32 to store weights. This is the choice for most GPUs without Tensor Core.\\n\",\n    \"\\n\",\n    \"If your GPU (like V100 or T4) has Tensor Core, jump to [Float16 Model](#6.-Model-Optimization-with-Float16) section since that will give you better performance than Float32 model.\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 15,\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"optimize_by_onnxruntime: Save optimized model by onnxruntime to ./onnx/bert-base-cased-squad_opset11_o1_cpu.onnx\\n\",\n      \"               apply: Fused LayerNormalization count: 25\\n\",\n      \"               apply: Fused Gelu count: 12\\n\",\n      \"               apply: Fused SkipLayerNormalization count: 25\\n\",\n      \"               apply: Fused Attention count: 12\\n\",\n      \"         prune_graph: Graph pruned: 0 inputs, 0 outputs and 5 nodes are removed\\n\",\n      \"               apply: Fused EmbedLayerNormalization(with mask) count: 1\\n\",\n      \"         prune_graph: Graph pruned: 0 inputs, 0 outputs and 12 nodes are removed\\n\",\n      \"         prune_graph: Graph pruned: 0 inputs, 0 outputs and 0 nodes are removed\\n\",\n      \"               apply: Fused BiasGelu count: 12\\n\",\n      \"               apply: Fused SkipLayerNormalization(add bias) count: 24\\n\",\n      \"            optimize: opset verion: 11\\n\",\n      \"  save_model_to_file: Output model to ./onnx/bert-base-cased-squad_opt_gpu_fp32.onnx\\n\",\n      \"get_fused_operator_statistics: Optimized operators:{'EmbedLayerNormalization': 1, 'Attention': 12, 'Gelu': 0, 'FastGelu': 0, 'BiasGelu': 12, 'LayerNormalization': 0, 'SkipLayerNormalization': 24}\\n\",\n      \"                main: The model has been fully optimized.\\n\"\n     ]\n    }\n   ],\n   \"source\": [\n    \"optimized_fp32_model_path = './onnx/bert-base-cased-squad_opt_{}_fp32.onnx'.format('gpu' if use_gpu else 'cpu')\\n\",\n    \"\\n\",\n    \"!python -m onnxruntime_tools.optimizer_cli --input $export_model_path --output $optimized_fp32_model_path\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"#### Optimized Graph\\n\",\n    \"We can open the optimized model using [Netron](https://github.com/lutzroeder/netron) to visualize.\\n\",\n    \"\\n\",\n    \"The graph is like the following:\\n\",\n    \"<img src='images/optimized_bert_gpu.png'>\\n\",\n    \"\\n\",\n    \"Sometime, optimized graph is slightly different. For example, FastGelu is replaced by BiasGelu for CPU inference; When the option --input_int32 is used, Cast nodes for inputs are removed.\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 16,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"import netron\\n\",\n    \"\\n\",\n    \"# change it to True if want to view the optimized model in browser\\n\",\n    \"enable_netron = False\\n\",\n    \"if enable_netron:\\n\",\n    \"    # If you encounter error \\\"access a socket in a way forbidden by its access permissions\\\", install Netron as standalone application instead.\\n\",\n    \"    netron.start(optimized_fp32_model_path)\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"### Performance Test Tool\\n\",\n    \"\\n\",\n    \"The following will create 1000 random inputs of batch_size 1 and sequence length 128, then measure the average latency and throughput numbers.\\n\",\n    \"\\n\",\n    \"Note that the test uses fixed sequence length. If you use [dynamic sequence length](#Inference-with-Actual-Sequence-Length), actual performance depends on the distribution of sequence length.\\n\",\n    \"\\n\",\n    \"**Attention**: Latency numbers from Jupyter Notebook are not accurate. See [Attional Info](#7.-Additional-Info) for more info.\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 17,\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"test setting TestSetting(batch_size=1, sequence_length=128, test_cases=1000, test_times=1, contiguous=None, use_gpu=True, warmup=True, omp_num_threads=None, omp_wait_policy=None, intra_op_num_threads=None, seed=3, verbose=False, inclusive=False, extra_latency=True)\\n\",\n      \"Generating 1000 samples for batch_size=1 sequence_length=128\\n\",\n      \"Running test: model=bert-base-cased-squad_opt_gpu_fp32.onnx,graph_optimization_level=ENABLE_ALL,intra_op_num_threads=0,OMP_NUM_THREADS=,OMP_WAIT_POLICY=,batch_size=1,sequence_length=128,test_cases=1000,test_times=1,contiguous=None,use_gpu=True,warmup=True\\n\",\n      \"Average latency = 4.92 ms, Throughput = 203.24 QPS\\n\",\n      \"Running test: model=bert-base-cased-squad_opt_gpu_fp32.onnx,graph_optimization_level=ENABLE_ALL,intra_op_num_threads=1,OMP_NUM_THREADS=12,OMP_WAIT_POLICY=PASSIVE,batch_size=1,sequence_length=128,test_cases=1000,test_times=1,contiguous=None,use_gpu=True,warmup=True\\n\",\n      \"Average latency = 4.90 ms, Throughput = 203.88 QPS\\n\",\n      \"Running test: model=bert-base-cased-squad_opt_gpu_fp32.onnx,graph_optimization_level=ENABLE_ALL,intra_op_num_threads=12,OMP_NUM_THREADS=1,OMP_WAIT_POLICY=ACTIVE,batch_size=1,sequence_length=128,test_cases=1000,test_times=1,contiguous=None,use_gpu=True,warmup=True\\n\",\n      \"Average latency = 5.07 ms, Throughput = 197.16 QPS\\n\",\n      \"Running test: model=bert-base-cased-squad_opt_gpu_fp32.onnx,graph_optimization_level=ENABLE_ALL,intra_op_num_threads=1,OMP_NUM_THREADS=12,OMP_WAIT_POLICY=ACTIVE,batch_size=1,sequence_length=128,test_cases=1000,test_times=1,contiguous=None,use_gpu=True,warmup=True\\n\",\n      \"Average latency = 4.82 ms, Throughput = 207.33 QPS\\n\",\n      \"skip duplicated test: model=bert-base-cased-squad_opt_gpu_fp32.onnx,graph_optimization_level=ENABLE_ALL,intra_op_num_threads=1,OMP_NUM_THREADS=12,OMP_WAIT_POLICY=PASSIVE,batch_size=1,sequence_length=128,test_cases=1000,test_times=1,contiguous=None,use_gpu=True,warmup=True\\n\",\n      \"skip duplicated test: model=bert-base-cased-squad_opt_gpu_fp32.onnx,graph_optimization_level=ENABLE_ALL,intra_op_num_threads=12,OMP_NUM_THREADS=1,OMP_WAIT_POLICY=ACTIVE,batch_size=1,sequence_length=128,test_cases=1000,test_times=1,contiguous=None,use_gpu=True,warmup=True\\n\",\n      \"Running test: model=bert-base-cased-squad_opt_gpu_fp32.onnx,graph_optimization_level=ENABLE_ALL,intra_op_num_threads=12,OMP_NUM_THREADS=1,OMP_WAIT_POLICY=PASSIVE,batch_size=1,sequence_length=128,test_cases=1000,test_times=1,contiguous=None,use_gpu=True,warmup=True\\n\",\n      \"Average latency = 4.93 ms, Throughput = 202.92 QPS\\n\",\n      \"Running test: model=bert-base-cased-squad_opt_gpu_fp32.onnx,graph_optimization_level=ENABLE_ALL,intra_op_num_threads=12,OMP_NUM_THREADS=12,OMP_WAIT_POLICY=ACTIVE,batch_size=1,sequence_length=128,test_cases=1000,test_times=1,contiguous=None,use_gpu=True,warmup=True\\n\",\n      \"Average latency = 4.91 ms, Throughput = 203.55 QPS\\n\",\n      \"Running test: model=bert-base-cased-squad_opt_gpu_fp32.onnx,graph_optimization_level=ENABLE_ALL,intra_op_num_threads=12,OMP_NUM_THREADS=12,OMP_WAIT_POLICY=PASSIVE,batch_size=1,sequence_length=128,test_cases=1000,test_times=1,contiguous=None,use_gpu=True,warmup=True\\n\",\n      \"Average latency = 4.88 ms, Throughput = 204.90 QPS\\n\",\n      \"Test summary is saved to onnx/perf_results_GPU_B1_S128_20200617-232134.txt\\n\"\n     ]\n    }\n   ],\n   \"source\": [\n    \"GPU_OPTION = '--use_gpu' if use_gpu else ''\\n\",\n    \"\\n\",\n    \"!python -m onnxruntime_tools.transformers.bert_perf_test --model $optimized_fp32_model_path --batch_size 1 --sequence_length 128 --samples 1000 --test_times 1 --inclusive --all $GPU_OPTION\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Let's load the summary file and take a look. Note that blank value in OMP_NUM_THREADS or OMP_WAIT_POLICY means the environment variable does not exist.\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 18,\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"Float32 model perf results from ./onnx/perf_results_GPU_B1_S128_20200617-232134.txt\\n\"\n     ]\n    },\n    {\n     \"data\": {\n      \"text/html\": [\n       \"<div>\\n\",\n       \"<style scoped>\\n\",\n       \"    .dataframe tbody tr th:only-of-type {\\n\",\n       \"        vertical-align: middle;\\n\",\n       \"    }\\n\",\n       \"\\n\",\n       \"    .dataframe tbody tr th {\\n\",\n       \"        vertical-align: top;\\n\",\n       \"    }\\n\",\n       \"\\n\",\n       \"    .dataframe thead th {\\n\",\n       \"        text-align: right;\\n\",\n       \"    }\\n\",\n       \"</style>\\n\",\n       \"<table border=\\\"1\\\" class=\\\"dataframe\\\">\\n\",\n       \"  <thead>\\n\",\n       \"    <tr style=\\\"text-align: right;\\\">\\n\",\n       \"      <th></th>\\n\",\n       \"      <th>Latency(ms)</th>\\n\",\n       \"      <th>Latency_P50</th>\\n\",\n       \"      <th>Latency_P75</th>\\n\",\n       \"      <th>Latency_P90</th>\\n\",\n       \"      <th>Latency_P95</th>\\n\",\n       \"      <th>Latency_P99</th>\\n\",\n       \"      <th>Throughput(QPS)</th>\\n\",\n       \"      <th>intra_op_num_threads</th>\\n\",\n       \"      <th>OMP_NUM_THREADS</th>\\n\",\n       \"      <th>OMP_WAIT_POLICY</th>\\n\",\n       \"      <th>contiguous</th>\\n\",\n       \"      <th>warmup</th>\\n\",\n       \"    </tr>\\n\",\n       \"  </thead>\\n\",\n       \"  <tbody>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>0</th>\\n\",\n       \"      <td>4.82</td>\\n\",\n       \"      <td>4.53</td>\\n\",\n       \"      <td>4.57</td>\\n\",\n       \"      <td>5.15</td>\\n\",\n       \"      <td>7.25</td>\\n\",\n       \"      <td>8.75</td>\\n\",\n       \"      <td>207.33</td>\\n\",\n       \"      <td>1</td>\\n\",\n       \"      <td>12</td>\\n\",\n       \"      <td>ACTIVE</td>\\n\",\n       \"      <td>None</td>\\n\",\n       \"      <td>True</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>1</th>\\n\",\n       \"      <td>4.88</td>\\n\",\n       \"      <td>4.54</td>\\n\",\n       \"      <td>4.58</td>\\n\",\n       \"      <td>6.47</td>\\n\",\n       \"      <td>7.13</td>\\n\",\n       \"      <td>8.68</td>\\n\",\n       \"      <td>204.90</td>\\n\",\n       \"      <td>12</td>\\n\",\n       \"      <td>12</td>\\n\",\n       \"      <td>PASSIVE</td>\\n\",\n       \"      <td>None</td>\\n\",\n       \"      <td>True</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>2</th>\\n\",\n       \"      <td>4.90</td>\\n\",\n       \"      <td>4.54</td>\\n\",\n       \"      <td>4.57</td>\\n\",\n       \"      <td>6.16</td>\\n\",\n       \"      <td>7.64</td>\\n\",\n       \"      <td>8.82</td>\\n\",\n       \"      <td>203.88</td>\\n\",\n       \"      <td>1</td>\\n\",\n       \"      <td>12</td>\\n\",\n       \"      <td>PASSIVE</td>\\n\",\n       \"      <td>None</td>\\n\",\n       \"      <td>True</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>3</th>\\n\",\n       \"      <td>4.91</td>\\n\",\n       \"      <td>4.55</td>\\n\",\n       \"      <td>4.59</td>\\n\",\n       \"      <td>6.70</td>\\n\",\n       \"      <td>7.43</td>\\n\",\n       \"      <td>8.78</td>\\n\",\n       \"      <td>203.55</td>\\n\",\n       \"      <td>12</td>\\n\",\n       \"      <td>12</td>\\n\",\n       \"      <td>ACTIVE</td>\\n\",\n       \"      <td>None</td>\\n\",\n       \"      <td>True</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>4</th>\\n\",\n       \"      <td>4.92</td>\\n\",\n       \"      <td>4.57</td>\\n\",\n       \"      <td>4.60</td>\\n\",\n       \"      <td>6.50</td>\\n\",\n       \"      <td>7.82</td>\\n\",\n       \"      <td>8.90</td>\\n\",\n       \"      <td>203.24</td>\\n\",\n       \"      <td>0</td>\\n\",\n       \"      <td></td>\\n\",\n       \"      <td></td>\\n\",\n       \"      <td>None</td>\\n\",\n       \"      <td>True</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>5</th>\\n\",\n       \"      <td>4.93</td>\\n\",\n       \"      <td>4.55</td>\\n\",\n       \"      <td>4.59</td>\\n\",\n       \"      <td>6.66</td>\\n\",\n       \"      <td>7.57</td>\\n\",\n       \"      <td>8.80</td>\\n\",\n       \"      <td>202.92</td>\\n\",\n       \"      <td>12</td>\\n\",\n       \"      <td>1</td>\\n\",\n       \"      <td>PASSIVE</td>\\n\",\n       \"      <td>None</td>\\n\",\n       \"      <td>True</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>6</th>\\n\",\n       \"      <td>5.07</td>\\n\",\n       \"      <td>4.56</td>\\n\",\n       \"      <td>4.61</td>\\n\",\n       \"      <td>7.19</td>\\n\",\n       \"      <td>8.11</td>\\n\",\n       \"      <td>9.01</td>\\n\",\n       \"      <td>197.16</td>\\n\",\n       \"      <td>12</td>\\n\",\n       \"      <td>1</td>\\n\",\n       \"      <td>ACTIVE</td>\\n\",\n       \"      <td>None</td>\\n\",\n       \"      <td>True</td>\\n\",\n       \"    </tr>\\n\",\n       \"  </tbody>\\n\",\n       \"</table>\\n\",\n       \"</div>\"\n      ],\n      \"text/plain\": [\n       \"   Latency(ms)  Latency_P50  Latency_P75  Latency_P90  Latency_P95  \\\\\\n\",\n       \"0         4.82         4.53         4.57         5.15         7.25   \\n\",\n       \"1         4.88         4.54         4.58         6.47         7.13   \\n\",\n       \"2         4.90         4.54         4.57         6.16         7.64   \\n\",\n       \"3         4.91         4.55         4.59         6.70         7.43   \\n\",\n       \"4         4.92         4.57         4.60         6.50         7.82   \\n\",\n       \"5         4.93         4.55         4.59         6.66         7.57   \\n\",\n       \"6         5.07         4.56         4.61         7.19         8.11   \\n\",\n       \"\\n\",\n       \"   Latency_P99  Throughput(QPS)  intra_op_num_threads OMP_NUM_THREADS  \\\\\\n\",\n       \"0         8.75           207.33                     1              12   \\n\",\n       \"1         8.68           204.90                    12              12   \\n\",\n       \"2         8.82           203.88                     1              12   \\n\",\n       \"3         8.78           203.55                    12              12   \\n\",\n       \"4         8.90           203.24                     0                   \\n\",\n       \"5         8.80           202.92                    12               1   \\n\",\n       \"6         9.01           197.16                    12               1   \\n\",\n       \"\\n\",\n       \"  OMP_WAIT_POLICY contiguous  warmup  \\n\",\n       \"0          ACTIVE       None    True  \\n\",\n       \"1         PASSIVE       None    True  \\n\",\n       \"2         PASSIVE       None    True  \\n\",\n       \"3          ACTIVE       None    True  \\n\",\n       \"4                       None    True  \\n\",\n       \"5         PASSIVE       None    True  \\n\",\n       \"6          ACTIVE       None    True  \"\n      ]\n     },\n     \"execution_count\": 18,\n     \"metadata\": {},\n     \"output_type\": \"execute_result\"\n    }\n   ],\n   \"source\": [\n    \"import os\\n\",\n    \"import glob     \\n\",\n    \"import pandas\\n\",\n    \"latest_result_file = max(glob.glob(\\\"./onnx/perf_results_GPU_B1_S128_*.txt\\\"), key=os.path.getmtime)\\n\",\n    \"result_data = pandas.read_table(latest_result_file, converters={'OMP_NUM_THREADS': str, 'OMP_WAIT_POLICY':str})\\n\",\n    \"print(\\\"Float32 model perf results from\\\", latest_result_file)\\n\",\n    \"# Remove some columns that have same values for all rows.\\n\",\n    \"columns_to_remove = ['model', 'graph_optimization_level', 'batch_size', 'sequence_length', 'test_cases', 'test_times', 'use_gpu']\\n\",\n    \"result_data.drop(columns_to_remove, axis=1, inplace=True)\\n\",\n    \"result_data\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"From above result, we can see that latency is very close for different settings. The default setting (intra_op_num_threads=0, OMP_NUM_THREADS and OMP_WAIT_POLICY does not exist) performs the best. \\n\",\n    \"\\n\",\n    \"### Model Results Comparison Tool\\n\",\n    \"\\n\",\n    \"When a BERT model is optimized, some approximation is used in calculation. If your BERT model has three inputs, a script compare_bert_results.py can be used to do a quick verification. The tool will generate some fake input data, and compare the inference outputs of the original and optimized models. If outputs are all close, it is safe to use the optimized model.\\n\",\n    \"\\n\",\n    \"For GPU inference, the absolute or relative difference is larger than those numbers of CPU inference. Note that slight difference in output will not impact final result. We did end-to-end evaluation using SQuAD data set using a fine-tuned squad model, and F1 score is almost the same before/after optimization.\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 19,\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"100% passed for 100 random inputs given thresholds (rtol=0.01, atol=0.01).\\r\\n\",\n      \"maximum absolute difference=1.9222497940063477e-06\\r\\n\",\n      \"maximum relative difference=0.05027933046221733\\r\\n\"\n     ]\n    }\n   ],\n   \"source\": [\n    \"!python -m onnxruntime_tools.transformers.compare_bert_results --baseline_model $export_model_path --optimized_model $optimized_fp32_model_path --batch_size 1 --sequence_length 128 --samples 100 --rtol 0.01 --atol 0.01 $GPU_OPTION\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"## 6. Model Optimization with Float16\\n\",\n    \"\\n\",\n    \"The optimizer.py script have an option **--float16** to convert model to use float16 to store weights. After the conversion, it could be faster to run in GPU with tensor cores like V100 or T4.\\n\",\n    \"\\n\",\n    \"Let's run tools to measure the performance on V100. The results show significant performance improvement: latency is about 3.4 ms for float32 model, and 1.8 ms for float16 model.\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 20,\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"optimize_by_onnxruntime: Save optimized model by onnxruntime to ./onnx/bert-base-cased-squad_opset11_o1_cpu.onnx\\n\",\n      \"               apply: Fused LayerNormalization count: 25\\n\",\n      \"               apply: Fused Gelu count: 12\\n\",\n      \"               apply: Fused SkipLayerNormalization count: 25\\n\",\n      \"               apply: Fused Attention count: 12\\n\",\n      \"         prune_graph: Graph pruned: 0 inputs, 0 outputs and 5 nodes are removed\\n\",\n      \"               apply: Fused EmbedLayerNormalization(with mask) count: 1\\n\",\n      \"         prune_graph: Graph pruned: 0 inputs, 0 outputs and 12 nodes are removed\\n\",\n      \"         prune_graph: Graph pruned: 0 inputs, 0 outputs and 0 nodes are removed\\n\",\n      \"               apply: Fused BiasGelu count: 12\\n\",\n      \"               apply: Fused SkipLayerNormalization(add bias) count: 24\\n\",\n      \"            optimize: opset verion: 11\\n\",\n      \"  save_model_to_file: Output model to ./onnx/bert-base-cased-squad_opt_gpu_fp16.onnx\\n\",\n      \"get_fused_operator_statistics: Optimized operators:{'EmbedLayerNormalization': 1, 'Attention': 12, 'Gelu': 0, 'FastGelu': 0, 'BiasGelu': 12, 'LayerNormalization': 0, 'SkipLayerNormalization': 24}\\n\",\n      \"                main: The model has been fully optimized.\\n\"\n     ]\n    }\n   ],\n   \"source\": [\n    \"optimized_fp16_model_path = './onnx/bert-base-cased-squad_opt_{}_fp16.onnx'.format('gpu' if use_gpu else 'cpu')\\n\",\n    \"!python -m onnxruntime_tools.optimizer_cli --input $export_model_path --output $optimized_fp16_model_path --float16\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 21,\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"test setting TestSetting(batch_size=1, sequence_length=128, test_cases=1000, test_times=1, contiguous=None, use_gpu=True, warmup=True, omp_num_threads=None, omp_wait_policy=None, intra_op_num_threads=None, seed=3, verbose=False, inclusive=False, extra_latency=True)\\n\",\n      \"Generating 1000 samples for batch_size=1 sequence_length=128\\n\",\n      \"Running test: model=bert-base-cased-squad_opt_gpu_fp16.onnx,graph_optimization_level=ENABLE_ALL,intra_op_num_threads=0,OMP_NUM_THREADS=,OMP_WAIT_POLICY=,batch_size=1,sequence_length=128,test_cases=1000,test_times=1,contiguous=None,use_gpu=True,warmup=True\\n\",\n      \"Average latency = 3.01 ms, Throughput = 331.90 QPS\\n\",\n      \"Running test: model=bert-base-cased-squad_opt_gpu_fp16.onnx,graph_optimization_level=ENABLE_ALL,intra_op_num_threads=1,OMP_NUM_THREADS=12,OMP_WAIT_POLICY=PASSIVE,batch_size=1,sequence_length=128,test_cases=1000,test_times=1,contiguous=None,use_gpu=True,warmup=True\\n\",\n      \"Average latency = 3.12 ms, Throughput = 320.00 QPS\\n\",\n      \"Running test: model=bert-base-cased-squad_opt_gpu_fp16.onnx,graph_optimization_level=ENABLE_ALL,intra_op_num_threads=12,OMP_NUM_THREADS=1,OMP_WAIT_POLICY=ACTIVE,batch_size=1,sequence_length=128,test_cases=1000,test_times=1,contiguous=None,use_gpu=True,warmup=True\\n\",\n      \"Average latency = 3.02 ms, Throughput = 331.39 QPS\\n\",\n      \"Running test: model=bert-base-cased-squad_opt_gpu_fp16.onnx,graph_optimization_level=ENABLE_ALL,intra_op_num_threads=1,OMP_NUM_THREADS=12,OMP_WAIT_POLICY=ACTIVE,batch_size=1,sequence_length=128,test_cases=1000,test_times=1,contiguous=None,use_gpu=True,warmup=True\\n\",\n      \"Average latency = 3.01 ms, Throughput = 332.53 QPS\\n\",\n      \"skip duplicated test: model=bert-base-cased-squad_opt_gpu_fp16.onnx,graph_optimization_level=ENABLE_ALL,intra_op_num_threads=1,OMP_NUM_THREADS=12,OMP_WAIT_POLICY=PASSIVE,batch_size=1,sequence_length=128,test_cases=1000,test_times=1,contiguous=None,use_gpu=True,warmup=True\\n\",\n      \"skip duplicated test: model=bert-base-cased-squad_opt_gpu_fp16.onnx,graph_optimization_level=ENABLE_ALL,intra_op_num_threads=12,OMP_NUM_THREADS=1,OMP_WAIT_POLICY=ACTIVE,batch_size=1,sequence_length=128,test_cases=1000,test_times=1,contiguous=None,use_gpu=True,warmup=True\\n\",\n      \"Running test: model=bert-base-cased-squad_opt_gpu_fp16.onnx,graph_optimization_level=ENABLE_ALL,intra_op_num_threads=12,OMP_NUM_THREADS=1,OMP_WAIT_POLICY=PASSIVE,batch_size=1,sequence_length=128,test_cases=1000,test_times=1,contiguous=None,use_gpu=True,warmup=True\\n\",\n      \"Average latency = 3.04 ms, Throughput = 328.67 QPS\\n\",\n      \"Running test: model=bert-base-cased-squad_opt_gpu_fp16.onnx,graph_optimization_level=ENABLE_ALL,intra_op_num_threads=12,OMP_NUM_THREADS=12,OMP_WAIT_POLICY=ACTIVE,batch_size=1,sequence_length=128,test_cases=1000,test_times=1,contiguous=None,use_gpu=True,warmup=True\\n\",\n      \"Average latency = 3.01 ms, Throughput = 331.72 QPS\\n\",\n      \"Running test: model=bert-base-cased-squad_opt_gpu_fp16.onnx,graph_optimization_level=ENABLE_ALL,intra_op_num_threads=12,OMP_NUM_THREADS=12,OMP_WAIT_POLICY=PASSIVE,batch_size=1,sequence_length=128,test_cases=1000,test_times=1,contiguous=None,use_gpu=True,warmup=True\\n\",\n      \"Average latency = 3.04 ms, Throughput = 329.32 QPS\\n\",\n      \"Test summary is saved to onnx/perf_results_GPU_B1_S128_20200617-232234.txt\\n\"\n     ]\n    }\n   ],\n   \"source\": [\n    \"GPU_OPTION = '--use_gpu' if use_gpu else ''\\n\",\n    \"!python -m onnxruntime_tools.transformers.bert_perf_test --model $optimized_fp16_model_path --batch_size 1 --sequence_length 128 --samples 1000 --test_times 1 --inclusive --all $GPU_OPTION\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 22,\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"Float32 model perf results from ./onnx/perf_results_GPU_B1_S128_20200617-232234.txt\\n\"\n     ]\n    },\n    {\n     \"data\": {\n      \"text/html\": [\n       \"<div>\\n\",\n       \"<style scoped>\\n\",\n       \"    .dataframe tbody tr th:only-of-type {\\n\",\n       \"        vertical-align: middle;\\n\",\n       \"    }\\n\",\n       \"\\n\",\n       \"    .dataframe tbody tr th {\\n\",\n       \"        vertical-align: top;\\n\",\n       \"    }\\n\",\n       \"\\n\",\n       \"    .dataframe thead th {\\n\",\n       \"        text-align: right;\\n\",\n       \"    }\\n\",\n       \"</style>\\n\",\n       \"<table border=\\\"1\\\" class=\\\"dataframe\\\">\\n\",\n       \"  <thead>\\n\",\n       \"    <tr style=\\\"text-align: right;\\\">\\n\",\n       \"      <th></th>\\n\",\n       \"      <th>Latency(ms)</th>\\n\",\n       \"      <th>Latency_P50</th>\\n\",\n       \"      <th>Latency_P75</th>\\n\",\n       \"      <th>Latency_P90</th>\\n\",\n       \"      <th>Latency_P95</th>\\n\",\n       \"      <th>Latency_P99</th>\\n\",\n       \"      <th>Throughput(QPS)</th>\\n\",\n       \"      <th>intra_op_num_threads</th>\\n\",\n       \"      <th>OMP_NUM_THREADS</th>\\n\",\n       \"      <th>OMP_WAIT_POLICY</th>\\n\",\n       \"      <th>contiguous</th>\\n\",\n       \"      <th>warmup</th>\\n\",\n       \"    </tr>\\n\",\n       \"  </thead>\\n\",\n       \"  <tbody>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>0</th>\\n\",\n       \"      <td>3.01</td>\\n\",\n       \"      <td>2.79</td>\\n\",\n       \"      <td>2.81</td>\\n\",\n       \"      <td>2.86</td>\\n\",\n       \"      <td>5.08</td>\\n\",\n       \"      <td>7.16</td>\\n\",\n       \"      <td>332.53</td>\\n\",\n       \"      <td>1</td>\\n\",\n       \"      <td>12</td>\\n\",\n       \"      <td>ACTIVE</td>\\n\",\n       \"      <td>None</td>\\n\",\n       \"      <td>True</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>1</th>\\n\",\n       \"      <td>3.01</td>\\n\",\n       \"      <td>2.80</td>\\n\",\n       \"      <td>2.81</td>\\n\",\n       \"      <td>2.88</td>\\n\",\n       \"      <td>4.52</td>\\n\",\n       \"      <td>7.05</td>\\n\",\n       \"      <td>331.90</td>\\n\",\n       \"      <td>0</td>\\n\",\n       \"      <td></td>\\n\",\n       \"      <td></td>\\n\",\n       \"      <td>None</td>\\n\",\n       \"      <td>True</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>2</th>\\n\",\n       \"      <td>3.01</td>\\n\",\n       \"      <td>2.78</td>\\n\",\n       \"      <td>2.80</td>\\n\",\n       \"      <td>2.92</td>\\n\",\n       \"      <td>5.01</td>\\n\",\n       \"      <td>7.02</td>\\n\",\n       \"      <td>331.72</td>\\n\",\n       \"      <td>12</td>\\n\",\n       \"      <td>12</td>\\n\",\n       \"      <td>ACTIVE</td>\\n\",\n       \"      <td>None</td>\\n\",\n       \"      <td>True</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>3</th>\\n\",\n       \"      <td>3.02</td>\\n\",\n       \"      <td>2.79</td>\\n\",\n       \"      <td>2.80</td>\\n\",\n       \"      <td>2.85</td>\\n\",\n       \"      <td>6.34</td>\\n\",\n       \"      <td>7.04</td>\\n\",\n       \"      <td>331.39</td>\\n\",\n       \"      <td>12</td>\\n\",\n       \"      <td>1</td>\\n\",\n       \"      <td>ACTIVE</td>\\n\",\n       \"      <td>None</td>\\n\",\n       \"      <td>True</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>4</th>\\n\",\n       \"      <td>3.04</td>\\n\",\n       \"      <td>2.80</td>\\n\",\n       \"      <td>2.82</td>\\n\",\n       \"      <td>2.93</td>\\n\",\n       \"      <td>5.56</td>\\n\",\n       \"      <td>7.08</td>\\n\",\n       \"      <td>329.32</td>\\n\",\n       \"      <td>12</td>\\n\",\n       \"      <td>12</td>\\n\",\n       \"      <td>PASSIVE</td>\\n\",\n       \"      <td>None</td>\\n\",\n       \"      <td>True</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>5</th>\\n\",\n       \"      <td>3.04</td>\\n\",\n       \"      <td>2.79</td>\\n\",\n       \"      <td>2.81</td>\\n\",\n       \"      <td>2.92</td>\\n\",\n       \"      <td>6.37</td>\\n\",\n       \"      <td>7.08</td>\\n\",\n       \"      <td>328.67</td>\\n\",\n       \"      <td>12</td>\\n\",\n       \"      <td>1</td>\\n\",\n       \"      <td>PASSIVE</td>\\n\",\n       \"      <td>None</td>\\n\",\n       \"      <td>True</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>6</th>\\n\",\n       \"      <td>3.12</td>\\n\",\n       \"      <td>2.79</td>\\n\",\n       \"      <td>2.82</td>\\n\",\n       \"      <td>2.96</td>\\n\",\n       \"      <td>6.66</td>\\n\",\n       \"      <td>7.20</td>\\n\",\n       \"      <td>320.00</td>\\n\",\n       \"      <td>1</td>\\n\",\n       \"      <td>12</td>\\n\",\n       \"      <td>PASSIVE</td>\\n\",\n       \"      <td>None</td>\\n\",\n       \"      <td>True</td>\\n\",\n       \"    </tr>\\n\",\n       \"  </tbody>\\n\",\n       \"</table>\\n\",\n       \"</div>\"\n      ],\n      \"text/plain\": [\n       \"   Latency(ms)  Latency_P50  Latency_P75  Latency_P90  Latency_P95  \\\\\\n\",\n       \"0         3.01         2.79         2.81         2.86         5.08   \\n\",\n       \"1         3.01         2.80         2.81         2.88         4.52   \\n\",\n       \"2         3.01         2.78         2.80         2.92         5.01   \\n\",\n       \"3         3.02         2.79         2.80         2.85         6.34   \\n\",\n       \"4         3.04         2.80         2.82         2.93         5.56   \\n\",\n       \"5         3.04         2.79         2.81         2.92         6.37   \\n\",\n       \"6         3.12         2.79         2.82         2.96         6.66   \\n\",\n       \"\\n\",\n       \"   Latency_P99  Throughput(QPS)  intra_op_num_threads OMP_NUM_THREADS  \\\\\\n\",\n       \"0         7.16           332.53                     1              12   \\n\",\n       \"1         7.05           331.90                     0                   \\n\",\n       \"2         7.02           331.72                    12              12   \\n\",\n       \"3         7.04           331.39                    12               1   \\n\",\n       \"4         7.08           329.32                    12              12   \\n\",\n       \"5         7.08           328.67                    12               1   \\n\",\n       \"6         7.20           320.00                     1              12   \\n\",\n       \"\\n\",\n       \"  OMP_WAIT_POLICY contiguous  warmup  \\n\",\n       \"0          ACTIVE       None    True  \\n\",\n       \"1                       None    True  \\n\",\n       \"2          ACTIVE       None    True  \\n\",\n       \"3          ACTIVE       None    True  \\n\",\n       \"4         PASSIVE       None    True  \\n\",\n       \"5         PASSIVE       None    True  \\n\",\n       \"6         PASSIVE       None    True  \"\n      ]\n     },\n     \"execution_count\": 22,\n     \"metadata\": {},\n     \"output_type\": \"execute_result\"\n    }\n   ],\n   \"source\": [\n    \"import os\\n\",\n    \"import glob     \\n\",\n    \"import pandas\\n\",\n    \"latest_result_file = max(glob.glob(\\\"./onnx/perf_results_GPU_B1_S128_*.txt\\\"), key=os.path.getmtime)\\n\",\n    \"result_data = pandas.read_table(latest_result_file, converters={'OMP_NUM_THREADS': str, 'OMP_WAIT_POLICY':str})\\n\",\n    \"print(\\\"Float32 model perf results from\\\", latest_result_file)\\n\",\n    \"# Remove some columns that have same values for all rows.\\n\",\n    \"columns_to_remove = ['model', 'graph_optimization_level', 'batch_size', 'sequence_length', 'test_cases', 'test_times', 'use_gpu']\\n\",\n    \"result_data.drop(columns_to_remove, axis=1, inplace=True)\\n\",\n    \"result_data\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"### Throughput Tuning\\n\",\n    \"\\n\",\n    \"Some application need best throughput under some constraint on latency. This can be done by testing performance of different batch sizes. The tool could help on this.\\n\",\n    \"\\n\",\n    \"Here is an example that check the performance of multiple batch sizes (1, 2, 4, 8, 16, 32 and 64) using default settings.\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 23,\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"test setting TestSetting(batch_size=32, sequence_length=128, test_cases=1000, test_times=1, contiguous=None, use_gpu=True, warmup=True, omp_num_threads=12, omp_wait_policy='ACTIVE', intra_op_num_threads=1, seed=3, verbose=False, inclusive=False, extra_latency=True)\\n\",\n      \"Generating 1000 samples for batch_size=32 sequence_length=128\\n\",\n      \"Running test: model=bert-base-cased-squad_opt_gpu_fp16.onnx,graph_optimization_level=ENABLE_ALL,intra_op_num_threads=1,OMP_NUM_THREADS=12,OMP_WAIT_POLICY=ACTIVE,batch_size=32,sequence_length=128,test_cases=1000,test_times=1,contiguous=None,use_gpu=True,warmup=True\\n\",\n      \"Average latency = 16.17 ms, Throughput = 1979.41 QPS\\n\",\n      \"test setting TestSetting(batch_size=1, sequence_length=128, test_cases=1000, test_times=1, contiguous=None, use_gpu=True, warmup=True, omp_num_threads=12, omp_wait_policy='ACTIVE', intra_op_num_threads=1, seed=3, verbose=False, inclusive=False, extra_latency=True)\\n\",\n      \"Generating 1000 samples for batch_size=1 sequence_length=128\\n\",\n      \"Running test: model=bert-base-cased-squad_opt_gpu_fp16.onnx,graph_optimization_level=ENABLE_ALL,intra_op_num_threads=1,OMP_NUM_THREADS=12,OMP_WAIT_POLICY=ACTIVE,batch_size=1,sequence_length=128,test_cases=1000,test_times=1,contiguous=None,use_gpu=True,warmup=True\\n\",\n      \"Average latency = 3.00 ms, Throughput = 333.83 QPS\\n\",\n      \"test setting TestSetting(batch_size=2, sequence_length=128, test_cases=1000, test_times=1, contiguous=None, use_gpu=True, warmup=True, omp_num_threads=12, omp_wait_policy='ACTIVE', intra_op_num_threads=1, seed=3, verbose=False, inclusive=False, extra_latency=True)\\n\",\n      \"Generating 1000 samples for batch_size=2 sequence_length=128\\n\",\n      \"Running test: model=bert-base-cased-squad_opt_gpu_fp16.onnx,graph_optimization_level=ENABLE_ALL,intra_op_num_threads=1,OMP_NUM_THREADS=12,OMP_WAIT_POLICY=ACTIVE,batch_size=2,sequence_length=128,test_cases=1000,test_times=1,contiguous=None,use_gpu=True,warmup=True\\n\",\n      \"Average latency = 3.59 ms, Throughput = 557.32 QPS\\n\",\n      \"test setting TestSetting(batch_size=64, sequence_length=128, test_cases=1000, test_times=1, contiguous=None, use_gpu=True, warmup=True, omp_num_threads=12, omp_wait_policy='ACTIVE', intra_op_num_threads=1, seed=3, verbose=False, inclusive=False, extra_latency=True)\\n\",\n      \"Generating 1000 samples for batch_size=64 sequence_length=128\\n\",\n      \"Running test: model=bert-base-cased-squad_opt_gpu_fp16.onnx,graph_optimization_level=ENABLE_ALL,intra_op_num_threads=1,OMP_NUM_THREADS=12,OMP_WAIT_POLICY=ACTIVE,batch_size=64,sequence_length=128,test_cases=1000,test_times=1,contiguous=None,use_gpu=True,warmup=True\\n\",\n      \"Average latency = 29.26 ms, Throughput = 2187.15 QPS\\n\",\n      \"test setting TestSetting(batch_size=4, sequence_length=128, test_cases=1000, test_times=1, contiguous=None, use_gpu=True, warmup=True, omp_num_threads=12, omp_wait_policy='ACTIVE', intra_op_num_threads=1, seed=3, verbose=False, inclusive=False, extra_latency=True)\\n\",\n      \"Generating 1000 samples for batch_size=4 sequence_length=128\\n\",\n      \"Running test: model=bert-base-cased-squad_opt_gpu_fp16.onnx,graph_optimization_level=ENABLE_ALL,intra_op_num_threads=1,OMP_NUM_THREADS=12,OMP_WAIT_POLICY=ACTIVE,batch_size=4,sequence_length=128,test_cases=1000,test_times=1,contiguous=None,use_gpu=True,warmup=True\\n\",\n      \"Average latency = 4.32 ms, Throughput = 926.92 QPS\\n\",\n      \"test setting TestSetting(batch_size=8, sequence_length=128, test_cases=1000, test_times=1, contiguous=None, use_gpu=True, warmup=True, omp_num_threads=12, omp_wait_policy='ACTIVE', intra_op_num_threads=1, seed=3, verbose=False, inclusive=False, extra_latency=True)\\n\",\n      \"Generating 1000 samples for batch_size=8 sequence_length=128\\n\",\n      \"Running test: model=bert-base-cased-squad_opt_gpu_fp16.onnx,graph_optimization_level=ENABLE_ALL,intra_op_num_threads=1,OMP_NUM_THREADS=12,OMP_WAIT_POLICY=ACTIVE,batch_size=8,sequence_length=128,test_cases=1000,test_times=1,contiguous=None,use_gpu=True,warmup=True\\n\",\n      \"Average latency = 6.32 ms, Throughput = 1266.63 QPS\\n\",\n      \"test setting TestSetting(batch_size=16, sequence_length=128, test_cases=1000, test_times=1, contiguous=None, use_gpu=True, warmup=True, omp_num_threads=12, omp_wait_policy='ACTIVE', intra_op_num_threads=1, seed=3, verbose=False, inclusive=False, extra_latency=True)\\n\",\n      \"Generating 1000 samples for batch_size=16 sequence_length=128\\n\",\n      \"Running test: model=bert-base-cased-squad_opt_gpu_fp16.onnx,graph_optimization_level=ENABLE_ALL,intra_op_num_threads=1,OMP_NUM_THREADS=12,OMP_WAIT_POLICY=ACTIVE,batch_size=16,sequence_length=128,test_cases=1000,test_times=1,contiguous=None,use_gpu=True,warmup=True\\n\",\n      \"Average latency = 9.60 ms, Throughput = 1666.05 QPS\\n\",\n      \"Test summary is saved to onnx/perf_results_GPU_B1-2-4-8-16-32-64_S128_20200617-232401.txt\\n\"\n     ]\n    }\n   ],\n   \"source\": [\n    \"GPU_OPTION = '--use_gpu' if use_gpu else ''\\n\",\n    \"THREAD_SETTING = '--intra_op_num_threads 1 --omp_num_threads {} --omp_wait_policy ACTIVE'.format(psutil.cpu_count(logical=True))\\n\",\n    \"!python -m onnxruntime_tools.transformers.bert_perf_test --model $optimized_fp16_model_path --batch_size 1 2 4 8 16 32 64 --sequence_length 128 --samples 1000 --test_times 1 --inclusive $THREAD_SETTING $GPU_OPTION\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 26,\n   \"metadata\": {\n    \"scrolled\": false\n   },\n   \"outputs\": [\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"Float16 model summary from ./onnx/perf_results_GPU_B1-2-4-8-16-32-64_S128_20200617-232401.txt\\n\"\n     ]\n    },\n    {\n     \"data\": {\n      \"text/html\": [\n       \"<div>\\n\",\n       \"<style scoped>\\n\",\n       \"    .dataframe tbody tr th:only-of-type {\\n\",\n       \"        vertical-align: middle;\\n\",\n       \"    }\\n\",\n       \"\\n\",\n       \"    .dataframe tbody tr th {\\n\",\n       \"        vertical-align: top;\\n\",\n       \"    }\\n\",\n       \"\\n\",\n       \"    .dataframe thead th {\\n\",\n       \"        text-align: right;\\n\",\n       \"    }\\n\",\n       \"</style>\\n\",\n       \"<table border=\\\"1\\\" class=\\\"dataframe\\\">\\n\",\n       \"  <thead>\\n\",\n       \"    <tr style=\\\"text-align: right;\\\">\\n\",\n       \"      <th></th>\\n\",\n       \"      <th>Latency(ms)</th>\\n\",\n       \"      <th>Latency_P50</th>\\n\",\n       \"      <th>Latency_P75</th>\\n\",\n       \"      <th>Latency_P90</th>\\n\",\n       \"      <th>Latency_P95</th>\\n\",\n       \"      <th>Latency_P99</th>\\n\",\n       \"      <th>Throughput(QPS)</th>\\n\",\n       \"      <th>batch_size</th>\\n\",\n       \"    </tr>\\n\",\n       \"  </thead>\\n\",\n       \"  <tbody>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>0</th>\\n\",\n       \"      <td>3.00</td>\\n\",\n       \"      <td>2.79</td>\\n\",\n       \"      <td>2.81</td>\\n\",\n       \"      <td>2.86</td>\\n\",\n       \"      <td>4.37</td>\\n\",\n       \"      <td>7.08</td>\\n\",\n       \"      <td>333.83</td>\\n\",\n       \"      <td>1</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>1</th>\\n\",\n       \"      <td>3.59</td>\\n\",\n       \"      <td>3.33</td>\\n\",\n       \"      <td>3.35</td>\\n\",\n       \"      <td>3.42</td>\\n\",\n       \"      <td>6.60</td>\\n\",\n       \"      <td>7.54</td>\\n\",\n       \"      <td>557.32</td>\\n\",\n       \"      <td>2</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>2</th>\\n\",\n       \"      <td>4.32</td>\\n\",\n       \"      <td>3.98</td>\\n\",\n       \"      <td>4.01</td>\\n\",\n       \"      <td>4.64</td>\\n\",\n       \"      <td>7.23</td>\\n\",\n       \"      <td>8.11</td>\\n\",\n       \"      <td>926.92</td>\\n\",\n       \"      <td>4</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>3</th>\\n\",\n       \"      <td>6.32</td>\\n\",\n       \"      <td>5.94</td>\\n\",\n       \"      <td>5.97</td>\\n\",\n       \"      <td>7.61</td>\\n\",\n       \"      <td>8.96</td>\\n\",\n       \"      <td>10.12</td>\\n\",\n       \"      <td>1266.63</td>\\n\",\n       \"      <td>8</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>4</th>\\n\",\n       \"      <td>9.60</td>\\n\",\n       \"      <td>9.22</td>\\n\",\n       \"      <td>9.25</td>\\n\",\n       \"      <td>11.32</td>\\n\",\n       \"      <td>12.33</td>\\n\",\n       \"      <td>13.34</td>\\n\",\n       \"      <td>1666.05</td>\\n\",\n       \"      <td>16</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>5</th>\\n\",\n       \"      <td>16.17</td>\\n\",\n       \"      <td>15.80</td>\\n\",\n       \"      <td>15.90</td>\\n\",\n       \"      <td>17.38</td>\\n\",\n       \"      <td>18.80</td>\\n\",\n       \"      <td>19.93</td>\\n\",\n       \"      <td>1979.41</td>\\n\",\n       \"      <td>32</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>6</th>\\n\",\n       \"      <td>29.26</td>\\n\",\n       \"      <td>28.89</td>\\n\",\n       \"      <td>29.01</td>\\n\",\n       \"      <td>30.63</td>\\n\",\n       \"      <td>32.53</td>\\n\",\n       \"      <td>33.28</td>\\n\",\n       \"      <td>2187.15</td>\\n\",\n       \"      <td>64</td>\\n\",\n       \"    </tr>\\n\",\n       \"  </tbody>\\n\",\n       \"</table>\\n\",\n       \"</div>\"\n      ],\n      \"text/plain\": [\n       \"   Latency(ms)  Latency_P50  Latency_P75  Latency_P90  Latency_P95  \\\\\\n\",\n       \"0         3.00         2.79         2.81         2.86         4.37   \\n\",\n       \"1         3.59         3.33         3.35         3.42         6.60   \\n\",\n       \"2         4.32         3.98         4.01         4.64         7.23   \\n\",\n       \"3         6.32         5.94         5.97         7.61         8.96   \\n\",\n       \"4         9.60         9.22         9.25        11.32        12.33   \\n\",\n       \"5        16.17        15.80        15.90        17.38        18.80   \\n\",\n       \"6        29.26        28.89        29.01        30.63        32.53   \\n\",\n       \"\\n\",\n       \"   Latency_P99  Throughput(QPS)  batch_size  \\n\",\n       \"0         7.08           333.83           1  \\n\",\n       \"1         7.54           557.32           2  \\n\",\n       \"2         8.11           926.92           4  \\n\",\n       \"3        10.12          1266.63           8  \\n\",\n       \"4        13.34          1666.05          16  \\n\",\n       \"5        19.93          1979.41          32  \\n\",\n       \"6        33.28          2187.15          64  \"\n      ]\n     },\n     \"execution_count\": 26,\n     \"metadata\": {},\n     \"output_type\": \"execute_result\"\n    }\n   ],\n   \"source\": [\n    \"import os\\n\",\n    \"import glob     \\n\",\n    \"import pandas\\n\",\n    \"latest_result_file = max(glob.glob(\\\"./onnx/perf_results_*.txt\\\"), key=os.path.getmtime)\\n\",\n    \"result_data = pandas.read_table(latest_result_file, converters={'OMP_NUM_THREADS': str, 'OMP_WAIT_POLICY':str})\\n\",\n    \"print(\\\"Float16 model summary from\\\", latest_result_file)\\n\",\n    \"columns_to_remove = ['model', 'graph_optimization_level', 'test_cases', 'test_times', 'use_gpu', 'warmup', 'sequence_length']\\n\",\n    \"columns_to_remove.extend(['intra_op_num_threads', 'OMP_NUM_THREADS', 'OMP_WAIT_POLICY', 'contiguous'])\\n\",\n    \"result_data.drop(columns_to_remove, axis=1, inplace=True)\\n\",\n    \"result_data\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"## 7. Additional Info\\n\",\n    \"\\n\",\n    \"Note that running Jupyter Notebook has significant impact on performance result. You can close Jupyter Notebook and other applications, then run the performance test in a console to get more accurate performance numbers.\\n\",\n    \"\\n\",\n    \"We have a [benchmark script](https://github.com/microsoft/onnxruntime/blob/master/onnxruntime/python/tools/transformers/run_benchmark.sh). It is recommended to use it measure inference speed of OnnxRuntime.\\n\",\n    \"\\n\",\n    \"[OnnxRuntime C API](https://github.com/microsoft/onnxruntime/blob/master/docs/C_API.md) could get slightly better performance than python API. If you use C API in inference, you can use OnnxRuntime_Perf_Test.exe built from source to measure performance instead.\\n\",\n    \"\\n\",\n    \"Here is the machine configuration that generated the above results. You might get slower or faster result according to your hardware.\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 27,\n   \"metadata\": {\n    \"scrolled\": true\n   },\n   \"outputs\": [\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"{\\r\\n\",\n      \"  \\\"gpu\\\": {\\r\\n\",\n      \"    \\\"driver_version\\\": \\\"440.64.00\\\",\\r\\n\",\n      \"    \\\"devices\\\": [\\r\\n\",\n      \"      {\\r\\n\",\n      \"        \\\"memory_total\\\": 16945512448,\\r\\n\",\n      \"        \\\"memory_available\\\": 14110883840,\\r\\n\",\n      \"        \\\"name\\\": \\\"Tesla V100-PCIE-16GB\\\"\\r\\n\",\n      \"      },\\r\\n\",\n      \"      {\\r\\n\",\n      \"        \\\"memory_total\\\": 16945512448,\\r\\n\",\n      \"        \\\"memory_available\\\": 16932601856,\\r\\n\",\n      \"        \\\"name\\\": \\\"Tesla V100-PCIE-16GB\\\"\\r\\n\",\n      \"      }\\r\\n\",\n      \"    ]\\r\\n\",\n      \"  },\\r\\n\",\n      \"  \\\"cpu\\\": {\\r\\n\",\n      \"    \\\"brand\\\": \\\"Intel(R) Xeon(R) CPU E5-2690 v4 @ 2.60GHz\\\",\\r\\n\",\n      \"    \\\"cores\\\": 12,\\r\\n\",\n      \"    \\\"logical_cores\\\": 12,\\r\\n\",\n      \"    \\\"hz\\\": \\\"2.5940 GHz\\\",\\r\\n\",\n      \"    \\\"l2_cache\\\": \\\"256 KB\\\",\\r\\n\",\n      \"    \\\"l3_cache\\\": \\\"35840 KB\\\",\\r\\n\",\n      \"    \\\"processor\\\": \\\"x86_64\\\"\\r\\n\",\n      \"  },\\r\\n\",\n      \"  \\\"memory\\\": {\\r\\n\",\n      \"    \\\"total\\\": 236645588992,\\r\\n\",\n      \"    \\\"available\\\": 222567559168\\r\\n\",\n      \"  },\\r\\n\",\n      \"  \\\"python\\\": \\\"3.7.7.final.0 (64 bit)\\\",\\r\\n\",\n      \"  \\\"os\\\": \\\"Linux-4.15.0-1089-azure-x86_64-with-debian-stretch-sid\\\",\\r\\n\",\n      \"  \\\"onnxruntime\\\": {\\r\\n\",\n      \"    \\\"version\\\": \\\"1.3.0\\\",\\r\\n\",\n      \"    \\\"support_gpu\\\": true\\r\\n\",\n      \"  },\\r\\n\",\n      \"  \\\"pytorch\\\": {\\r\\n\",\n      \"    \\\"version\\\": \\\"1.5.0\\\",\\r\\n\",\n      \"    \\\"support_gpu\\\": true\\r\\n\",\n      \"  },\\r\\n\",\n      \"  \\\"tensorflow\\\": null\\r\\n\",\n      \"}\\r\\n\"\n     ]\n    }\n   ],\n   \"source\": [\n    \"!{sys.executable} -m onnxruntime_tools.transformers.machine_info --silent\"\n   ]\n  }\n ],\n \"metadata\": {\n  \"kernelspec\": {\n   \"display_name\": \"PyCharm (ccks_ner-master)\",\n   \"language\": \"python\",\n   \"name\": \"pycharm-de4c0941\"\n  },\n  \"language_info\": {\n   \"codemirror_mode\": {\n    \"name\": \"ipython\",\n    \"version\": 3\n   },\n   \"file_extension\": \".py\",\n   \"mimetype\": \"text/x-python\",\n   \"name\": \"python\",\n   \"nbconvert_exporter\": \"python\",\n   \"pygments_lexer\": \"ipython3\",\n   \"version\": \"3.8.5\"\n  }\n },\n \"nbformat\": 4,\n \"nbformat_minor\": 2\n}\n"
  },
  {
    "path": "code/bert-base-count3/finetuning/Config.py",
    "content": "from transformers import BertTokenizer, AdamW, BertModel, BertPreTrainedModel, BertConfig, \\\n    get_linear_schedule_with_warmup, XLNetModel, XLNetTokenizer, XLNetConfig, ElectraModel, ElectraConfig, ElectraTokenizer, \\\n    RobertaTokenizer, RobertaModel, RobertaConfig\nfrom NEZHA.modeling_nezha import NeZhaModel\nfrom NEZHA.configuration_nezha import NeZhaConfig\n\n\nMODELS = {\n    'BertForClass':  BertModel,\n    'BertForClass_MultiDropout':  BertModel,\n   'BertLastTwoCls':  BertModel,\n    'BertLastCls':BertModel,\n   'BertLastTwoClsPooler':  BertModel,\n    'BertLastTwoEmbeddings': BertModel,\n    'BertLastTwoEmbeddingsPooler': BertModel,\n    'BertLastFourCls': BertModel,\n    'BertLastFourClsPooler':  BertModel,\n    'BertLastFourEmbeddings':  BertModel,\n   'BertLastFourEmbeddingsPooler':  BertModel,\n   'BertDynCls':  BertModel,\n    'BertDynEmbeddings': BertModel,\n    'BertRNN': BertModel,\n    'BertCNN': XLNetModel,\n    'BertRCNN':  BertModel,\n    'XLNet': XLNetModel,\n    'Electra': ElectraModel,\n    'NEZHA': NeZhaModel\n    }\n\nTOKENIZERS = {\n    'BertForClass': BertTokenizer,\n    'BertForClass_MultiDropout': BertTokenizer,\n    'BertLastTwoCls': BertTokenizer,\n    'BertLastCls': BertTokenizer,\n    'BertLastTwoClsPooler': BertTokenizer,\n    'BertLastTwoEmbeddings': BertTokenizer,\n    'BertLastTwoEmbeddingsPooler': BertTokenizer,\n    'BertLastFourCls': BertTokenizer,\n    'BertLastFourClsPooler': BertTokenizer,\n    'BertLastFourEmbeddings': BertTokenizer,\n    'BertLastFourEmbeddingsPooler': BertTokenizer,\n    'BertDynCls': BertTokenizer,\n    'BertDynEmbeddings': BertTokenizer,\n    'BertRNN': BertTokenizer,\n    'BertCNN': BertTokenizer,\n    'BertRCNN': BertTokenizer,\n    'XLNet': XLNetTokenizer,\n    'Electra': ElectraTokenizer,\n    'NEZHA': BertTokenizer\n    }\n\nCONFIGS = {\n    'BertForClass': BertConfig,\n    'BertForClass_MultiDropout': BertConfig,\n    'BertLastTwoCls': BertConfig,\n    'BertLastCls': BertConfig,\n    'BertLastTwoClsPooler': BertConfig,\n    'BertLastTwoEmbeddings': BertConfig,\n    'BertLastTwoEmbeddingsPooler': BertConfig,\n    'BertLastFourCls': BertConfig,\n    'BertLastFourClsPooler': BertConfig,\n    'BertLastFourEmbeddings': BertConfig,\n    'BertLastFourEmbeddingsPooler': BertConfig,\n    'BertDynCls': BertConfig,\n    'BertDynEmbeddings': BertConfig,\n    'BertRNN': BertConfig,\n    'BertCNN': BertConfig,\n    'BertRCNN': BertConfig,\n    'XLNet': XLNetConfig,\n    'Electra': ElectraConfig,\n    'NEZHA': NeZhaConfig\n\n    }"
  },
  {
    "path": "code/bert-base-count3/finetuning/NEZHA/configuration_nezha.py",
    "content": "\nfrom transformers import PretrainedConfig\n\nNEZHA_PRETRAINED_CONFIG_ARCHIVE_MAP = {}\n\nclass NeZhaConfig(PretrainedConfig):\n    r\"\"\"\n        This is the configuration class to store the configuration of an :class:`~transformers.AlbertModel`.\n        It is used to instantiate an ALBERT model according to the specified arguments, defining the model\n        architecture. Instantiating a configuration with the defaults will yield a similar configuration to that of\n        the ALBERT `xxlarge <https://huggingface.co/albert-xxlarge-v2>`__ architecture.\n\n        Configuration objects inherit from  :class:`~transformers.PretrainedConfig` and can be used\n        to control the model outputs. Read the documentation from  :class:`~transformers.PretrainedConfig`\n        for more information.\n\n\n        Args:\n            vocab_size (:obj:`int`, optional, defaults to 30000):\n                Vocabulary size of the ALBERT model. Defines the different tokens that\n                can be represented by the `inputs_ids` passed to the forward method of :class:`~transformers.AlbertModel`.\n            embedding_size (:obj:`int`, optional, defaults to 128):\n                Dimensionality of vocabulary embeddings.\n            hidden_size (:obj:`int`, optional, defaults to 4096):\n                Dimensionality of the encoder layers and the pooler layer.\n            num_hidden_layers (:obj:`int`, optional, defaults to 12):\n                Number of hidden layers in the Transformer encoder.\n            num_hidden_groups (:obj:`int`, optional, defaults to 1):\n                Number of groups for the hidden layers, parameters in the same group are shared.\n            num_attention_heads (:obj:`int`, optional, defaults to 64):\n                Number of attention heads for each attention layer in the Transformer encoder.\n            intermediate_size (:obj:`int`, optional, defaults to 16384):\n                The dimensionality of the \"intermediate\" (i.e., feed-forward) layer in the Transformer encoder.\n            inner_group_num (:obj:`int`, optional, defaults to 1):\n                The number of inner repetition of attention and ffn.\n            hidden_act (:obj:`str` or :obj:`function`, optional, defaults to \"gelu_new\"):\n                The non-linear activation function (function or string) in the encoder and pooler.\n                If string, \"gelu\", \"relu\", \"swish\" and \"gelu_new\" are supported.\n            hidden_dropout_prob (:obj:`float`, optional, defaults to 0):\n                The dropout probability for all fully connected layers in the embeddings, encoder, and pooler.\n            attention_probs_dropout_prob (:obj:`float`, optional, defaults to 0):\n                The dropout ratio for the attention probabilities.\n            max_position_embeddings (:obj:`int`, optional, defaults to 512):\n                The maximum sequence length that this model might ever be used with. Typically set this to something\n                large (e.g., 512 or 1024 or 2048).\n            type_vocab_size (:obj:`int`, optional, defaults to 2):\n                The vocabulary size of the `token_type_ids` passed into :class:`~transformers.AlbertModel`.\n            initializer_range (:obj:`float`, optional, defaults to 0.02):\n                The standard deviation of the truncated_normal_initializer for initializing all weight matrices.\n            layer_norm_eps (:obj:`float`, optional, defaults to 1e-12):\n                The epsilon used by the layer normalization layers.\n            classifier_dropout_prob (:obj:`float`, optional, defaults to 0.1):\n                The dropout ratio for attached classifiers.\n\n        Example::\n\n            from transformers import AlbertConfig, AlbertModel\n            # Initializing an ALBERT-xxlarge style configuration\n            albert_xxlarge_configuration = AlbertConfig()\n\n            # Initializing an ALBERT-base style configuration\n            albert_base_configuration = AlbertConfig(\n                hidden_size=768,\n                num_attention_heads=12,\n                intermediate_size=3072,\n            )\n\n            # Initializing a model from the ALBERT-base style configuration\n            model = AlbertModel(albert_xxlarge_configuration)\n\n            # Accessing the model configuration\n            configuration = model.config\n\n        Attributes:\n            pretrained_config_archive_map (Dict[str, str]):\n                A dictionary containing all the available pre-trained checkpoints.\n    \"\"\"\n\n    pretrained_config_archive_map = NEZHA_PRETRAINED_CONFIG_ARCHIVE_MAP\n    model_type = \"nezha\"\n\n    def __init__(\n        self,\n        vocab_size=30000,\n        embedding_size=128,\n        hidden_size=4096,\n        num_hidden_layers=12,\n        num_hidden_groups=1,\n        num_attention_heads=64,\n        intermediate_size=16384,\n        inner_group_num=1,\n        hidden_act=\"gelu_new\",\n        hidden_dropout_prob=0,\n        attention_probs_dropout_prob=0,\n        max_position_embeddings=512,\n        max_relative_position=64,\n        type_vocab_size=2,\n        initializer_range=0.02,\n        layer_norm_eps=1e-12,\n        classifier_dropout_prob=0.1,\n        use_relative_position=True,\n        pad_token_id=0,\n        bos_token_id=2,\n        eos_token_id=3,\n        **kwargs\n    ):\n        super().__init__(pad_token_id=pad_token_id, bos_token_id=bos_token_id, eos_token_id=eos_token_id, **kwargs)\n\n        self.vocab_size = vocab_size\n        self.embedding_size = embedding_size\n        self.hidden_size = hidden_size\n        self.num_hidden_layers = num_hidden_layers\n        self.num_hidden_groups = num_hidden_groups\n        self.num_attention_heads = num_attention_heads\n        self.inner_group_num = inner_group_num\n        self.hidden_act = hidden_act\n        self.intermediate_size = intermediate_size\n        self.hidden_dropout_prob = hidden_dropout_prob\n        self.attention_probs_dropout_prob = attention_probs_dropout_prob\n        self.max_position_embeddings = max_position_embeddings\n        self.max_relative_position = max_relative_position\n        self.type_vocab_size = type_vocab_size\n        self.initializer_range = initializer_range\n        self.layer_norm_eps = layer_norm_eps\n        self.use_relative_position=use_relative_position\n        self.classifier_dropout_prob = classifier_dropout_prob\n"
  },
  {
    "path": "code/bert-base-count3/finetuning/NEZHA/modeling_nezha.py",
    "content": "import math\nimport os\nimport logging\nimport torch\n\nfrom torch import nn\nfrom torch.nn import CrossEntropyLoss, MSELoss\n\nfrom .configuration_nezha import NeZhaConfig\nfrom transformers.file_utils import add_start_docstrings, add_start_docstrings_to_model_forward\nfrom transformers.modeling_utils import PreTrainedModel, prune_linear_layer\nfrom transformers.models.bert.modeling_bert import (\n    BertOutput,\n    BertPooler,\n    BertSelfOutput,\n    BertIntermediate,\n    BertOnlyMLMHead,\n    BertOnlyNSPHead,\n    BertPreTrainingHeads,\n    BERT_START_DOCSTRING,\n    BERT_INPUTS_DOCSTRING,\n)\n\nlogger = logging.getLogger(__name__)\n\n_CONFIG_FOR_DOC = \"NeZhaConfig\"\n_TOKENIZER_FOR_DOC = \"NeZhaTokenizer\"\n\nNEZHA_PRETRAINED_MODEL_ARCHIVE_LIST = []\nNEZHA_PRETRAINED_MODEL_ARCHIVE_MAP = {}\n\n\ndef load_tf_weights_in_nezha(model, config, tf_checkpoint_path):\n    \"\"\"Load tf checkpoints in a pytorch model.\"\"\"\n    try:\n        import re\n        import numpy as np\n        import tensorflow as tf\n    except ImportError:\n        logger.error(\n            \"Loading a TensorFlow model in PyTorch, requires TensorFlow to be installed. Please see \"\n            \"https://www.tensorflow.org/install/ for installation instructions.\"\n        )\n        raise\n\n    tf_path = os.path.abspath(tf_checkpoint_path)\n    logger.info(\"Converting TensorFlow checkpoint from {}\".format(tf_path))\n    # Load weights from TF model\n    init_vars = tf.train.list_variables(tf_path)\n    names = []\n    arrays = []\n    for name, shape in init_vars:\n        # logger.info(\"Loading TF weight {} with shape {}\".format(name, shape))\n        array = tf.train.load_variable(tf_path, name)\n        names.append(name)\n        arrays.append(array)\n\n    for name, array in zip(names, arrays):\n        name = name.split(\"/\")\n        # adam_v and adam_m are variables used in AdamWeightDecayOptimizer to calculated m and v\n        # which are not required for using pretrained model\n        if any(\n                n in [\"adam_v\", \"adam_m\", \"lamb_m\", \"lamb_v\", \"AdamWeightDecayOptimizer\", \"AdamWeightDecayOptimizer_1\",\n                      \"global_step\", \"good_steps\", \"loss_scale\", 'bad_steps']\n                for n in name\n        ):\n            logger.info(\"Skipping {}\".format(\"/\".join(name)))\n            continue\n        pointer = model\n        for m_name in name:\n            if re.fullmatch(r\"[A-Za-z]+_\\d+\", m_name):\n                scope_names = re.split(r\"_(\\d+)\", m_name)\n            else:\n                scope_names = [m_name]\n            if scope_names[0] == \"kernel\" or scope_names[0] == \"gamma\":\n                pointer = getattr(pointer, \"weight\")\n            elif scope_names[0] == \"output_bias\" or scope_names[0] == \"beta\":\n                pointer = getattr(pointer, \"bias\")\n            elif scope_names[0] == \"output_weights\":\n                pointer = getattr(pointer, \"weight\")\n            elif scope_names[0] == \"squad\":\n                pointer = getattr(pointer, \"classifier\")\n            else:\n                try:\n                    pointer = getattr(pointer, scope_names[0])\n                except AttributeError:\n                    logger.info(\"Skipping {}\".format(\"/\".join(name)))\n                    continue\n            if len(scope_names) >= 2:\n                num = int(scope_names[1])\n                pointer = pointer[num]\n        if m_name[-11:] == \"_embeddings\":\n            pointer = getattr(pointer, \"weight\")\n        elif m_name == \"kernel\":\n            array = np.transpose(array)\n        try:\n            assert (\n                    pointer.shape == array.shape\n            ), f\"Pointer shape {pointer.shape} and array shape {array.shape} mismatched\"\n        except AssertionError as e:\n            e.args += (pointer.shape, array.shape)\n            raise\n        logger.info(\"Initialize PyTorch weight {}\".format(name))\n        pointer.data = torch.from_numpy(array)\n    return model\n\n\nclass NeZhaEmbeddings(nn.Module):\n    \"\"\"\n    Construct the embeddings from word, position and token_type embeddings.\n    \"\"\"\n\n    def __init__(self, config):\n        super().__init__()\n        self.use_relative_position = config.use_relative_position\n        self.word_embeddings = nn.Embedding(config.vocab_size, config.hidden_size, padding_idx=config.pad_token_id)\n        self.token_type_embeddings = nn.Embedding(config.type_vocab_size, config.hidden_size)\n        # self.LayerNorm is not snake-cased to stick with TensorFlow model variable name and be able to load\n        # any TensorFlow checkpoint file\n        self.LayerNorm = nn.LayerNorm(config.hidden_size, eps=config.layer_norm_eps)\n        self.dropout = nn.Dropout(config.hidden_dropout_prob)\n\n    def forward(self, input_ids=None, token_type_ids=None, inputs_embeds=None):\n        if input_ids is not None:\n            input_shape = input_ids.size()\n        else:\n            input_shape = inputs_embeds.size()[:-1]\n        device = input_ids.device if input_ids is not None else inputs_embeds.device\n        if token_type_ids is None:\n            token_type_ids = torch.zeros(input_shape, dtype=torch.long, device=device)\n        if inputs_embeds is None:\n            inputs_embeds = self.word_embeddings(input_ids)\n        token_type_embeddings = self.token_type_embeddings(token_type_ids)\n        embeddings = inputs_embeds + token_type_embeddings\n        embeddings = self.LayerNorm(embeddings)\n        embeddings = self.dropout(embeddings)\n        return embeddings\n\n\ndef relative_position_encoding(depth, max_length=512, max_relative_position=127):\n    vocab_size = max_relative_position * 2 + 1\n    range_vec = torch.arange(max_length)\n    range_mat = range_vec.repeat(max_length).view(max_length, max_length)\n    distance_mat = range_mat - torch.t(range_mat)\n    distance_mat_clipped = torch.clamp(distance_mat, -max_relative_position, max_relative_position)\n    final_mat = distance_mat_clipped + max_relative_position\n\n    embeddings_table = torch.zeros(vocab_size, depth)\n    position = torch.arange(0, vocab_size, dtype=torch.float).unsqueeze(1)\n    div_term = torch.exp(torch.arange(0, depth, 2).float() * (-math.log(10000.0) / depth))\n    embeddings_table[:, 0::2] = torch.sin(position * div_term)\n    embeddings_table[:, 1::2] = torch.cos(position * div_term)\n    embeddings_table = embeddings_table.unsqueeze(0).transpose(0, 1).squeeze(1)\n\n    flat_relative_positions_matrix = final_mat.view(-1)\n    one_hot_relative_positions_matrix = torch.nn.functional.one_hot(flat_relative_positions_matrix,\n                                                                    num_classes=vocab_size).float()\n    positions_encoding = torch.matmul(one_hot_relative_positions_matrix, embeddings_table)\n    my_shape = list(final_mat.size())\n    my_shape.append(depth)\n    positions_encoding = positions_encoding.view(my_shape)\n    return positions_encoding\n\n\nclass NeZhaSelfAttention(nn.Module):\n    def __init__(self, config):\n        super().__init__()\n        if config.hidden_size % config.num_attention_heads != 0 and not hasattr(config, \"embedding_size\"):\n            raise ValueError(\n                \"The hidden size (%d) is not a multiple of the number of attention \"\n                \"heads (%d)\" % (config.hidden_size, config.num_attention_heads)\n            )\n        self.output_attentions = config.output_attentions\n\n        self.num_attention_heads = config.num_attention_heads\n        self.attention_head_size = int(config.hidden_size / config.num_attention_heads)\n        self.all_head_size = self.num_attention_heads * self.attention_head_size\n\n        self.query = nn.Linear(config.hidden_size, self.all_head_size)\n        self.key = nn.Linear(config.hidden_size, self.all_head_size)\n        self.value = nn.Linear(config.hidden_size, self.all_head_size)\n        self.dropout = nn.Dropout(config.attention_probs_dropout_prob)\n\n        self.relative_positions_encoding = relative_position_encoding(max_length=config.max_position_embeddings,\n                                                                     depth=self.attention_head_size,\n                                                                     max_relative_position=config.max_relative_position).to('cuda')\n\n    def transpose_for_scores(self, x):\n        new_x_shape = x.size()[:-1] + (self.num_attention_heads, self.attention_head_size)\n        x = x.view(*new_x_shape)\n        return x.permute(0, 2, 1, 3)\n\n    def forward(\n            self,\n            hidden_states,\n            attention_mask=None,\n            head_mask=None,\n            encoder_hidden_states=None,\n            encoder_attention_mask=None,\n    ):\n        mixed_query_layer = self.query(hidden_states)\n\n        # If this is instantiated as a cross-attention module, the keys\n        # and values come from an encoder; the attention mask needs to be\n        # such that the encoder's padding tokens are not attended to.\n        if encoder_hidden_states is not None:\n            mixed_key_layer = self.key(encoder_hidden_states)\n            mixed_value_layer = self.value(encoder_hidden_states)\n            attention_mask = encoder_attention_mask\n        else:\n            mixed_key_layer = self.key(hidden_states)\n            mixed_value_layer = self.value(hidden_states)\n\n        query_layer = self.transpose_for_scores(mixed_query_layer)\n        key_layer = self.transpose_for_scores(mixed_key_layer)\n        value_layer = self.transpose_for_scores(mixed_value_layer)\n\n        # Take the dot product between \"query\" and \"key\" to get the raw attention scores.\n        attention_scores = torch.matmul(query_layer, key_layer.transpose(-1, -2))\n\n        batch_size, num_attention_heads, from_seq_length, to_seq_length = attention_scores.size()\n\n        relations_keys = self.relative_positions_encoding[:to_seq_length, :to_seq_length, :]\n        query_layer_t = query_layer.permute(2, 0, 1, 3)\n\n        query_layer_r = query_layer_t.contiguous().view(from_seq_length, batch_size * num_attention_heads,\n                                                        self.attention_head_size)\n        key_position_scores = torch.matmul(query_layer_r, relations_keys.permute(0, 2, 1))\n        key_position_scores_r = key_position_scores.view(from_seq_length, batch_size,\n                                                         num_attention_heads, from_seq_length)\n        key_position_scores_r_t = key_position_scores_r.permute(1, 2, 0, 3)\n        attention_scores = attention_scores + key_position_scores_r_t\n\n        attention_scores = attention_scores / math.sqrt(self.attention_head_size)\n        if attention_mask is not None:\n            # Apply the attention mask is (precomputed for all layers in BertModel forward() function)\n            attention_scores = attention_scores + attention_mask\n\n        # Normalize the attention scores to probabilities.\n        attention_probs = nn.Softmax(dim=-1)(attention_scores)\n\n        # This is actually dropping out entire tokens to attend to, which might\n        # seem a bit unusual, but is taken from the original Transformer paper.\n        attention_probs = self.dropout(attention_probs)\n\n        # Mask heads if we want to\n        if head_mask is not None:\n            attention_probs = attention_probs * head_mask\n\n        context_layer = torch.matmul(attention_probs, value_layer)\n\n        relations_values = self.relative_positions_encoding[:to_seq_length, :to_seq_length, :]\n        attention_probs_t = attention_probs.permute(2, 0, 1, 3)\n        attentions_probs_r = attention_probs_t.contiguous().view(from_seq_length, batch_size * num_attention_heads,\n                                                                 to_seq_length)\n        value_position_scores = torch.matmul(attentions_probs_r, relations_values)\n        value_position_scores_r = value_position_scores.view(from_seq_length, batch_size,\n                                                             num_attention_heads, self.attention_head_size)\n        value_position_scores_r_t = value_position_scores_r.permute(1, 2, 0, 3)\n        context_layer = context_layer + value_position_scores_r_t\n\n        context_layer = context_layer.permute(0, 2, 1, 3).contiguous()\n        new_context_layer_shape = context_layer.size()[:-2] + (self.all_head_size,)\n        context_layer = context_layer.view(*new_context_layer_shape)\n\n        outputs = (context_layer, attention_probs) if self.output_attentions else (context_layer,)\n        return outputs\n\n\nclass NeZhaAttention(nn.Module):\n    def __init__(self, config):\n        super().__init__()\n        self.self = NeZhaSelfAttention(config)\n        self.output = BertSelfOutput(config)\n        self.pruned_heads = set()\n\n    def prune_heads(self, heads):\n        if len(heads) == 0:\n            return\n        mask = torch.ones(self.self.num_attention_heads, self.self.attention_head_size)\n        heads = set(heads) - self.pruned_heads  # Convert to set and remove already pruned heads\n        for head in heads:\n            # Compute how many pruned heads are before the head and move the index accordingly\n            head = head - sum(1 if h < head else 0 for h in self.pruned_heads)\n            mask[head] = 0\n        mask = mask.view(-1).contiguous().eq(1)\n        index = torch.arange(len(mask))[mask].long()\n        # Prune linear layers\n        self.self.query = prune_linear_layer(self.self.query, index)\n        self.self.key = prune_linear_layer(self.self.key, index)\n        self.self.value = prune_linear_layer(self.self.value, index)\n        self.output.dense = prune_linear_layer(self.output.dense, index, dim=1)\n        # Update hyper params and store pruned heads\n        self.self.num_attention_heads = self.self.num_attention_heads - len(heads)\n        self.self.all_head_size = self.self.attention_head_size * self.self.num_attention_heads\n        self.pruned_heads = self.pruned_heads.union(heads)\n\n    def forward(\n            self,\n            hidden_states,\n            attention_mask=None,\n            head_mask=None,\n            encoder_hidden_states=None,\n            encoder_attention_mask=None,\n    ):\n        self_outputs = self.self(\n            hidden_states, attention_mask, head_mask, encoder_hidden_states, encoder_attention_mask\n        )\n        attention_output = self.output(self_outputs[0], hidden_states)\n        outputs = (attention_output,) + self_outputs[1:]  # add attentions if we output them\n        return outputs\n\n\nclass NeZhaLayer(nn.Module):\n    def __init__(self, config):\n        super().__init__()\n        self.attention = NeZhaAttention(config)\n        self.is_decoder = config.is_decoder\n        if self.is_decoder:\n            self.crossattention = NeZhaAttention(config)\n        self.intermediate = BertIntermediate(config)\n        self.output = BertOutput(config)\n\n    def forward(\n            self,\n            hidden_states,\n            attention_mask=None,\n            head_mask=None,\n            encoder_hidden_states=None,\n            encoder_attention_mask=None,\n    ):\n        self_attention_outputs = self.attention(hidden_states, attention_mask, head_mask)\n        attention_output = self_attention_outputs[0]\n        outputs = self_attention_outputs[1:]  # add self attentions if we output attention weights\n\n        if self.is_decoder and encoder_hidden_states is not None:\n            cross_attention_outputs = self.crossattention(\n                attention_output, attention_mask, head_mask, encoder_hidden_states, encoder_attention_mask\n            )\n            attention_output = cross_attention_outputs[0]\n            outputs = outputs + cross_attention_outputs[1:]  # add cross attentions if we output attention weights\n\n        intermediate_output = self.intermediate(attention_output)\n        layer_output = self.output(intermediate_output, attention_output)\n        outputs = (layer_output,) + outputs\n        return outputs\n\n\nclass NeZhaEncoder(nn.Module):\n    def __init__(self, config):\n        super().__init__()\n        self.output_attentions = config.output_attentions\n        self.output_hidden_states = config.output_hidden_states\n        self.layer = nn.ModuleList([NeZhaLayer(config) for _ in range(config.num_hidden_layers)])\n\n\n    def forward(\n            self,\n            hidden_states,\n            attention_mask=None,\n            head_mask=None,\n            encoder_hidden_states=None,\n            encoder_attention_mask=None,\n    ):\n        all_hidden_states = ()\n        all_attentions = ()\n        for i, layer_module in enumerate(self.layer):\n            if self.output_hidden_states:\n                all_hidden_states = all_hidden_states + (hidden_states,)\n            layer_outputs = layer_module(\n                hidden_states, attention_mask, head_mask[i], encoder_hidden_states, encoder_attention_mask\n            )\n            hidden_states = layer_outputs[0]\n            if self.output_attentions:\n                all_attentions = all_attentions + (layer_outputs[1],)\n        # Add last layer\n        if self.output_hidden_states:\n            all_hidden_states = all_hidden_states + (hidden_states,)\n\n        outputs = (hidden_states,)\n        if self.output_hidden_states:\n            outputs = outputs + (all_hidden_states,)\n        if self.output_attentions:\n            outputs = outputs + (all_attentions,)\n        return outputs  # last-layer hidden state, (all hidden states), (all attentions)\n\n\nclass NeZhaPreTrainedModel(PreTrainedModel):\n    \"\"\" An abstract class to handle weights initialization and\n        a simple interface for downloading and loading pretrained models.\n    \"\"\"\n    config_class = NeZhaConfig\n    pretrained_model_archive_map = NEZHA_PRETRAINED_MODEL_ARCHIVE_MAP\n    load_tf_weights = load_tf_weights_in_nezha\n    base_model_prefix = \"bert\"\n\n    def _init_weights(self, module):\n        \"\"\" Initialize the weights \"\"\"\n        if isinstance(module, (nn.Linear, nn.Embedding)):\n            # Slightly different from the TF version which uses truncated_normal for initialization\n            # cf https://github.com/pytorch/pytorch/pull/5617\n            module.weight.data.normal_(mean=0.0, std=self.config.initializer_range)\n        elif isinstance(module, nn.LayerNorm):\n            module.bias.data.zero_()\n            module.weight.data.fill_(1.0)\n        if isinstance(module, nn.Linear) and module.bias is not None:\n            module.bias.data.zero_()\n\n\n@add_start_docstrings(\n    \"The bare Bert Model transformer outputting raw hidden-states without any specific head on top.\",\n    BERT_START_DOCSTRING,\n)\nclass NeZhaModel(NeZhaPreTrainedModel):\n    \"\"\"\n    The model can behave as an encoder (with only self-attention) as well\n    as a decoder, in which case a layer of cross-attention is added between\n    the self-attention layers, following the architecture described in `Attention is all you need`_ by Ashish Vaswani,\n    Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser and Illia Polosukhin.\n\n    To behave as an decoder the model needs to be initialized with the\n    :obj:`is_decoder` argument of the configuration set to :obj:`True`; an\n    :obj:`encoder_hidden_states` is expected as an input to the forward pass.\n\n    .. _`Attention is all you need`:\n        https://arxiv.org/abs/1706.03762\n\n    \"\"\"\n\n    def __init__(self, config):\n        super().__init__(config)\n        self.config = config\n        self.embeddings = NeZhaEmbeddings(config)\n        self.encoder = NeZhaEncoder(config)\n        self.pooler = BertPooler(config)\n        self.init_weights()\n\n    def get_input_embeddings(self):\n        return self.embeddings.word_embeddings\n\n    def set_input_embeddings(self, value):\n        self.embeddings.word_embeddings = value\n\n    def _prune_heads(self, heads_to_prune):\n        \"\"\" Prunes heads of the model.\n            heads_to_prune: dict of {layer_num: list of heads to prune in this layer}\n            See base class PreTrainedModel\n        \"\"\"\n        for layer, heads in heads_to_prune.items():\n            self.encoder.layer[layer].attention.prune_heads(heads)\n\n    @add_start_docstrings_to_model_forward(BERT_INPUTS_DOCSTRING.format(\"batch_size, sequence_length\"))\n    def forward(\n            self,\n            input_ids=None,\n            attention_mask=None,\n            token_type_ids=None,\n            head_mask=None,\n            position_ids=None,\n            inputs_embeds=None,\n            encoder_hidden_states=None,\n            encoder_attention_mask=None,\n    ):\n        r\"\"\"\n    Return:\n        :obj:`tuple(torch.FloatTensor)` comprising various elements depending on the configuration (:class:`~transformers.BertConfig`) and inputs:\n        last_hidden_state (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length, hidden_size)`):\n            Sequence of hidden-states at the output of the last layer of the model.\n        pooler_output (:obj:`torch.FloatTensor`: of shape :obj:`(batch_size, hidden_size)`):\n            Last layer hidden-state of the first token of the sequence (classification token)\n            further processed by a Linear layer and a Tanh activation function. The Linear\n            layer weights are trained from the next sentence prediction (classification)\n            objective during pre-training.\n\n            This output is usually *not* a good summary\n            of the semantic content of the input, you're often better with averaging or pooling\n            the sequence of hidden-states for the whole input sequence.\n        hidden_states (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_hidden_states=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer)\n            of shape :obj:`(batch_size, sequence_length, hidden_size)`.\n\n            Hidden-states of the model at the output of each layer plus the initial embedding outputs.\n        attentions (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_attentions=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for each layer) of shape\n            :obj:`(batch_size, num_heads, sequence_length, sequence_length)`.\n\n            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention\n            heads.\n\n    Examples::\n\n        from transformers import BertModel, BertTokenizer\n        import torch\n\n        tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')\n        model = BertModel.from_pretrained('bert-base-uncased')\n\n        input_ids = torch.tensor(tokenizer.encode(\"Hello, my dog is cute\", add_special_tokens=True)).unsqueeze(0)  # Batch size 1\n        outputs = model(input_ids)\n\n        last_hidden_states = outputs[0]  # The last hidden-state is the first element of the output tuple\n\n        \"\"\"\n\n        if input_ids is not None and inputs_embeds is not None:\n            raise ValueError(\"You cannot specify both input_ids and inputs_embeds at the same time\")\n        elif input_ids is not None:\n            input_shape = input_ids.size()\n        elif inputs_embeds is not None:\n            input_shape = inputs_embeds.size()[:-1]\n        else:\n            raise ValueError(\"You have to specify either input_ids or inputs_embeds\")\n\n        device = input_ids.device if input_ids is not None else inputs_embeds.device\n\n        if attention_mask is None:\n            attention_mask = torch.ones(input_shape, device=device)\n        if token_type_ids is None:\n            token_type_ids = torch.zeros(input_shape, dtype=torch.long, device=device)\n\n        # We can provide a self-attention mask of dimensions [batch_size, from_seq_length, to_seq_length]\n        # ourselves in which case we just need to make it broadcastable to all heads.\n        extended_attention_mask: torch.Tensor = self.get_extended_attention_mask(\n            attention_mask, input_shape, self.device\n        )\n\n        # If a 2D ou 3D attention mask is provided for the cross-attention\n        # we need to make broadcastabe to [batch_size, num_heads, seq_length, seq_length]\n        if self.config.is_decoder and encoder_hidden_states is not None:\n            encoder_batch_size, encoder_sequence_length, _ = encoder_hidden_states.size()\n            encoder_hidden_shape = (encoder_batch_size, encoder_sequence_length)\n            if encoder_attention_mask is None:\n                encoder_attention_mask = torch.ones(encoder_hidden_shape, device=device)\n            encoder_extended_attention_mask = self.invert_attention_mask(encoder_attention_mask)\n        else:\n            encoder_extended_attention_mask = None\n\n        # Prepare head mask if needed\n        # 1.0 in head_mask indicate we keep the head\n        # attention_probs has shape bsz x n_heads x N x N\n        # input head_mask has shape [num_heads] or [num_hidden_layers x num_heads]\n        # and head_mask is converted to shape [num_hidden_layers x batch x num_heads x seq_length x seq_length]\n        head_mask = self.get_head_mask(head_mask, self.config.num_hidden_layers)\n\n        embedding_output = self.embeddings(\n            input_ids=input_ids, token_type_ids=token_type_ids, inputs_embeds=inputs_embeds\n        )\n        encoder_outputs = self.encoder(\n            embedding_output,\n            attention_mask=extended_attention_mask,\n            head_mask=head_mask,\n            encoder_hidden_states=encoder_hidden_states,\n            encoder_attention_mask=encoder_extended_attention_mask,\n        )\n        sequence_output = encoder_outputs[0]\n        pooled_output = self.pooler(sequence_output)\n\n        outputs = (sequence_output, pooled_output,) + encoder_outputs[\n                                                      1:\n                                                      ]  # add hidden_states and attentions if they are here\n        return outputs  # sequence_output, pooled_output, (hidden_states), (attentions)\n\n\n@add_start_docstrings(\n    \"\"\"Bert Model with two heads on top as done during the pre-training: a `masked language modeling` head and\n    a `next sentence prediction (classification)` head. \"\"\",\n    BERT_START_DOCSTRING,\n)\nclass NeZhaForPreTraining(NeZhaPreTrainedModel):\n    def __init__(self, config):\n        super().__init__(config)\n        self.bert = NeZhaModel(config)\n        self.cls = BertPreTrainingHeads(config)\n        self.init_weights()\n\n    def get_output_embeddings(self):\n        return self.cls.predictions.decoder\n\n    @add_start_docstrings_to_model_forward(BERT_INPUTS_DOCSTRING.format(\"batch_size, sequence_length\"))\n    def forward(\n            self,\n            input_ids=None,\n            attention_mask=None,\n            token_type_ids=None,\n            head_mask=None,\n            position_ids=None,\n            inputs_embeds=None,\n            labels=None,\n            next_sentence_label=None,\n    ):\n        r\"\"\"\n        masked_lm_labels (``torch.LongTensor`` of shape ``(batch_size, sequence_length)``, `optional`, defaults to :obj:`None`):\n            Labels for computing the masked language modeling loss.\n            Indices should be in ``[-100, 0, ..., config.vocab_size]`` (see ``input_ids`` docstring)\n            Tokens with indices set to ``-100`` are ignored (masked), the loss is only computed for the tokens with labels\n            in ``[0, ..., config.vocab_size]``\n        next_sentence_label (``torch.LongTensor`` of shape ``(batch_size,)``, `optional`, defaults to :obj:`None`):\n            Labels for computing the next sequence prediction (classification) loss. Input should be a sequence pair (see :obj:`input_ids` docstring)\n            Indices should be in ``[0, 1]``.\n            ``0`` indicates sequence B is a continuation of sequence A,\n            ``1`` indicates sequence B is a random sequence.\n\n    Returns:\n        :obj:`tuple(torch.FloatTensor)` comprising various elements depending on the configuration (:class:`~transformers.BertConfig`) and inputs:\n        loss (`optional`, returned when ``masked_lm_labels`` is provided) ``torch.FloatTensor`` of shape ``(1,)``:\n            Total loss as the sum of the masked language modeling loss and the next sequence prediction (classification) loss.\n        prediction_scores (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length, config.vocab_size)`)\n            Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax).\n        seq_relationship_scores (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, 2)`):\n            Prediction scores of the next sequence prediction (classification) head (scores of True/False\n            continuation before SoftMax).\n        hidden_states (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when :obj:`config.output_hidden_states=True`):\n            Tuple of :obj:`torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer)\n            of shape :obj:`(batch_size, sequence_length, hidden_size)`.\n\n            Hidden-states of the model at the output of each layer plus the initial embedding outputs.\n        attentions (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_attentions=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for each layer) of shape\n            :obj:`(batch_size, num_heads, sequence_length, sequence_length)`.\n\n            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention\n            heads.\n\n\n    Examples::\n\n        from transformers import BertTokenizer, BertForPreTraining\n        import torch\n\n        tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')\n        model = BertForPreTraining.from_pretrained('bert-base-uncased')\n\n        input_ids = torch.tensor(tokenizer.encode(\"Hello, my dog is cute\", add_special_tokens=True)).unsqueeze(0)  # Batch size 1\n        outputs = model(input_ids)\n\n        prediction_scores, seq_relationship_scores = outputs[:2]\n\n        \"\"\"\n\n        outputs = self.bert(\n            input_ids,\n            attention_mask=attention_mask,\n            token_type_ids=token_type_ids,\n            head_mask=head_mask,\n            inputs_embeds=inputs_embeds,\n        )\n\n        sequence_output, pooled_output = outputs[:2]\n        prediction_scores, seq_relationship_score = self.cls(sequence_output, pooled_output)\n        # add hidden states and attention if they are here\n        outputs = (prediction_scores, seq_relationship_score,) + outputs[2:]\n\n        if labels is not None and next_sentence_label is not None:\n            loss_fct = CrossEntropyLoss()\n            masked_lm_loss = loss_fct(prediction_scores.view(-1, self.config.vocab_size), labels.view(-1))\n            next_sentence_loss = loss_fct(seq_relationship_score.view(-1, 2), next_sentence_label.view(-1))\n            total_loss = masked_lm_loss + next_sentence_loss\n            outputs = (total_loss,) + outputs\n\n        return outputs  # (loss), prediction_scores, seq_relationship_score, (hidden_states), (attentions)\n\n\n@add_start_docstrings(\"\"\"Bert Model with a `language modeling` head on top. \"\"\", BERT_START_DOCSTRING)\nclass NeZhaForMaskedLM(NeZhaPreTrainedModel):\n    def __init__(self, config):\n        super().__init__(config)\n        self.bert = NeZhaModel(config)\n        self.cls = BertOnlyMLMHead(config)\n        self.init_weights()\n\n    def get_output_embeddings(self):\n        return self.cls.predictions.decoder\n\n    @add_start_docstrings_to_model_forward(BERT_INPUTS_DOCSTRING.format(\"batch_size, sequence_length\"))\n    def forward(\n            self,\n            input_ids=None,\n            attention_mask=None,\n            token_type_ids=None,\n            head_mask=None,\n            position_ids=None,\n            inputs_embeds=None,\n            encoder_hidden_states=None,\n            encoder_attention_mask=None,\n            labels=None,\n    ):\n        r\"\"\"\n        masked_lm_labels (:obj:`torch.LongTensor` of shape :obj:`(batch_size, sequence_length)`, `optional`, defaults to :obj:`None`):\n            Labels for computing the masked language modeling loss.\n            Indices should be in ``[-100, 0, ..., config.vocab_size]`` (see ``input_ids`` docstring)\n            Tokens with indices set to ``-100`` are ignored (masked), the loss is only computed for the tokens with labels\n            in ``[0, ..., config.vocab_size]``\n        lm_labels (:obj:`torch.LongTensor` of shape :obj:`(batch_size, sequence_length)`, `optional`, defaults to :obj:`None`):\n            Labels for computing the left-to-right language modeling loss (next word prediction).\n            Indices should be in ``[-100, 0, ..., config.vocab_size]`` (see ``input_ids`` docstring)\n            Tokens with indices set to ``-100`` are ignored (masked), the loss is only computed for the tokens with labels\n            in ``[0, ..., config.vocab_size]``\n\n    Returns:\n        :obj:`tuple(torch.FloatTensor)` comprising various elements depending on the configuration (:class:`~transformers.BertConfig`) and inputs:\n        masked_lm_loss (`optional`, returned when ``masked_lm_labels`` is provided) ``torch.FloatTensor`` of shape ``(1,)``:\n            Masked language modeling loss.\n        ltr_lm_loss (:obj:`torch.FloatTensor` of shape :obj:`(1,)`, `optional`, returned when :obj:`lm_labels` is provided):\n                Next token prediction loss.\n        prediction_scores (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length, config.vocab_size)`)\n            Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax).\n        hidden_states (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_hidden_states=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer)\n            of shape :obj:`(batch_size, sequence_length, hidden_size)`.\n\n            Hidden-states of the model at the output of each layer plus the initial embedding outputs.\n        attentions (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_attentions=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for each layer) of shape\n            :obj:`(batch_size, num_heads, sequence_length, sequence_length)`.\n\n            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention\n            heads.\n\n        Examples::\n\n            from transformers import BertTokenizer, BertForMaskedLM\n            import torch\n\n            tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')\n            model = BertForMaskedLM.from_pretrained('bert-base-uncased')\n\n            input_ids = torch.tensor(tokenizer.encode(\"Hello, my dog is cute\", add_special_tokens=True)).unsqueeze(0)  # Batch size 1\n            outputs = model(input_ids, masked_lm_labels=input_ids)\n\n            loss, prediction_scores = outputs[:2]\n\n        \"\"\"\n        outputs = self.bert(\n            input_ids,\n            attention_mask=attention_mask,\n            token_type_ids=token_type_ids,\n            head_mask=head_mask,\n            inputs_embeds=inputs_embeds,\n            encoder_hidden_states=encoder_hidden_states,\n            encoder_attention_mask=encoder_attention_mask,\n        )\n\n        sequence_output = outputs[0]\n        prediction_scores = self.cls(sequence_output)\n        outputs = (prediction_scores,) + outputs[2:]  # Add hidden states and attention if they are here\n\n        # Although this may seem awkward, BertForMaskedLM supports two scenarios:\n        # 1. If a tensor that contains the indices of masked labels is provided,\n        #    the cross-entropy is the MLM cross-entropy that measures the likelihood\n        #    of predictions for masked words.\n        # 2. If `lm_labels` is provided we are in a causal scenario where we\n        #    try to predict the next token for each input in the decoder.\n        masked_lm_labels = None\n        if labels is not None:\n            loss_fct = CrossEntropyLoss()  # -100 index = padding token\n            masked_lm_loss = loss_fct(prediction_scores.view(-1, self.config.vocab_size), labels.view(-1))\n            outputs = (masked_lm_loss,) + outputs\n        return outputs  # (ltr_lm_loss), (masked_lm_loss), prediction_scores, (hidden_states), (attentions)\n\n    def prepare_inputs_for_generation(self, input_ids, attention_mask=None, **model_kwargs):\n        input_shape = input_ids.shape\n        effective_batch_size = input_shape[0]\n\n        # if model is used as a decoder in encoder-decoder model, the decoder attention mask is created on the fly\n        if attention_mask is None:\n            attention_mask = input_ids.new_ones(input_shape)\n\n        # if model is does not use a causal mask then add a dummy token\n        if self.config.is_decoder is False:\n            assert self.config.pad_token_id is not None, \"The PAD token should be defined for generation\"\n            attention_mask = torch.cat(\n                [attention_mask, attention_mask.new_zeros((attention_mask.shape[0], 1))], dim=-1\n            )\n\n            dummy_token = torch.full(\n                (effective_batch_size, 1), self.config.pad_token_id, dtype=torch.long, device=input_ids.device\n            )\n            input_ids = torch.cat([input_ids, dummy_token], dim=1)\n\n        return {\"input_ids\": input_ids, \"attention_mask\": attention_mask}\n\n\n@add_start_docstrings(\n    \"\"\"Bert Model with a `next sentence prediction (classification)` head on top. \"\"\", BERT_START_DOCSTRING,\n)\nclass NeZhaForNextSentencePrediction(NeZhaPreTrainedModel):\n    def __init__(self, config):\n        super().__init__(config)\n        self.bert = NeZhaModel(config)\n        self.cls = BertOnlyNSPHead(config)\n        self.init_weights()\n\n    @add_start_docstrings_to_model_forward(BERT_INPUTS_DOCSTRING.format(\"batch_size, sequence_length\"))\n    def forward(\n            self,\n            input_ids=None,\n            attention_mask=None,\n            token_type_ids=None,\n            head_mask=None,\n            position_ids=None,\n            inputs_embeds=None,\n            next_sentence_label=None,\n    ):\n        r\"\"\"\n        next_sentence_label (:obj:`torch.LongTensor` of shape :obj:`(batch_size,)`, `optional`, defaults to :obj:`None`):\n            Labels for computing the next sequence prediction (classification) loss. Input should be a sequence pair (see ``input_ids`` docstring)\n            Indices should be in ``[0, 1]``.\n            ``0`` indicates sequence B is a continuation of sequence A,\n            ``1`` indicates sequence B is a random sequence.\n\n    Returns:\n        :obj:`tuple(torch.FloatTensor)` comprising various elements depending on the configuration (:class:`~transformers.BertConfig`) and inputs:\n        loss (:obj:`torch.FloatTensor` of shape :obj:`(1,)`, `optional`, returned when :obj:`next_sentence_label` is provided):\n            Next sequence prediction (classification) loss.\n        seq_relationship_scores (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, 2)`):\n            Prediction scores of the next sequence prediction (classification) head (scores of True/False continuation before SoftMax).\n        hidden_states (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_hidden_states=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer)\n            of shape :obj:`(batch_size, sequence_length, hidden_size)`.\n\n            Hidden-states of the model at the output of each layer plus the initial embedding outputs.\n        attentions (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_attentions=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for each layer) of shape\n            :obj:`(batch_size, num_heads, sequence_length, sequence_length)`.\n\n            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention\n            heads.\n\n    Examples::\n\n        from transformers import BertTokenizer, BertForNextSentencePrediction\n        import torch\n\n        tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')\n        model = BertForNextSentencePrediction.from_pretrained('bert-base-uncased')\n\n        input_ids = torch.tensor(tokenizer.encode(\"Hello, my dog is cute\", add_special_tokens=True)).unsqueeze(0)  # Batch size 1\n        outputs = model(input_ids)\n\n        seq_relationship_scores = outputs[0]\n\n        \"\"\"\n\n        outputs = self.bert(\n            input_ids,\n            attention_mask=attention_mask,\n            token_type_ids=token_type_ids,\n            head_mask=head_mask,\n            inputs_embeds=inputs_embeds,\n        )\n\n        pooled_output = outputs[1]\n        seq_relationship_score = self.cls(pooled_output)\n        outputs = (seq_relationship_score,) + outputs[2:]  # add hidden states and attention if they are here\n        if next_sentence_label is not None:\n            loss_fct = CrossEntropyLoss()\n            next_sentence_loss = loss_fct(seq_relationship_score.view(-1, 2), next_sentence_label.view(-1))\n            outputs = (next_sentence_loss,) + outputs\n\n        return outputs  # (next_sentence_loss), seq_relationship_score, (hidden_states), (attentions)\n\n\n@add_start_docstrings(\n    \"\"\"Bert Model transformer with a sequence classification/regression head on top (a linear layer on top of\n    the pooled output) e.g. for GLUE tasks. \"\"\",\n    BERT_START_DOCSTRING,\n)\nclass NeZhaForSequenceClassification(NeZhaPreTrainedModel):\n    def __init__(self, config):\n        super().__init__(config)\n        self.num_labels = config.num_labels\n        self.bert = NeZhaModel(config)\n        self.dropout = nn.Dropout(config.hidden_dropout_prob)\n        self.classifier = nn.Linear(config.hidden_size, config.num_labels)\n        self.init_weights()\n\n    @add_start_docstrings_to_model_forward(BERT_INPUTS_DOCSTRING.format(\"batch_size, sequence_length\"))\n    def forward(\n            self,\n            input_ids=None,\n            attention_mask=None,\n            token_type_ids=None,\n            position_ids=None,\n            head_mask=None,\n            inputs_embeds=None,\n            labels=None,\n    ):\n        r\"\"\"\n        labels (:obj:`torch.LongTensor` of shape :obj:`(batch_size,)`, `optional`, defaults to :obj:`None`):\n            Labels for computing the sequence classification/regression loss.\n            Indices should be in :obj:`[0, ..., config.num_labels - 1]`.\n            If :obj:`config.num_labels == 1` a regression loss is computed (Mean-Square loss),\n            If :obj:`config.num_labels > 1` a classification loss is computed (Cross-Entropy).\n\n    Returns:\n        :obj:`tuple(torch.FloatTensor)` comprising various elements depending on the configuration (:class:`~transformers.BertConfig`) and inputs:\n        loss (:obj:`torch.FloatTensor` of shape :obj:`(1,)`, `optional`, returned when :obj:`label` is provided):\n            Classification (or regression if config.num_labels==1) loss.\n        logits (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, config.num_labels)`):\n            Classification (or regression if config.num_labels==1) scores (before SoftMax).\n        hidden_states (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_hidden_states=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer)\n            of shape :obj:`(batch_size, sequence_length, hidden_size)`.\n\n            Hidden-states of the model at the output of each layer plus the initial embedding outputs.\n        attentions (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_attentions=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for each layer) of shape\n            :obj:`(batch_size, num_heads, sequence_length, sequence_length)`.\n\n            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention\n            heads.\n\n    Examples::\n\n        from transformers import BertTokenizer, BertForSequenceClassification\n        import torch\n\n        tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')\n        model = BertForSequenceClassification.from_pretrained('bert-base-uncased')\n\n        input_ids = torch.tensor(tokenizer.encode(\"Hello, my dog is cute\", add_special_tokens=True)).unsqueeze(0)  # Batch size 1\n        labels = torch.tensor([1]).unsqueeze(0)  # Batch size 1\n        outputs = model(input_ids, labels=labels)\n\n        loss, logits = outputs[:2]\n\n        \"\"\"\n\n        outputs = self.bert(\n            input_ids,\n            attention_mask=attention_mask,\n            token_type_ids=token_type_ids,\n            head_mask=head_mask,\n            inputs_embeds=inputs_embeds,\n        )\n\n        pooled_output = outputs[1]\n\n        pooled_output = self.dropout(pooled_output)\n        logits = self.classifier(pooled_output)\n\n        outputs = (logits,) + outputs[2:]  # add hidden states and attention if they are here\n\n        if labels is not None:\n            if self.num_labels == 1:\n                #  We are doing regression\n                loss_fct = MSELoss()\n                loss = loss_fct(logits.view(-1), labels.view(-1))\n            else:\n                loss_fct = CrossEntropyLoss()\n                loss = loss_fct(logits.view(-1, self.num_labels), labels.view(-1))\n            outputs = (loss,) + outputs\n\n        return outputs  # (loss), logits, (hidden_states), (attentions)\n\n\n@add_start_docstrings(\n    \"\"\"Bert Model with a multiple choice classification head on top (a linear layer on top of\n    the pooled output and a softmax) e.g. for RocStories/SWAG tasks. \"\"\",\n    BERT_START_DOCSTRING,\n)\nclass NeZhaForMultipleChoice(NeZhaPreTrainedModel):\n    def __init__(self, config):\n        super().__init__(config)\n        self.bert = NeZhaModel(config)\n        self.dropout = nn.Dropout(config.hidden_dropout_prob)\n        self.classifier = nn.Linear(config.hidden_size, 1)\n        self.init_weights()\n\n    @add_start_docstrings_to_model_forward(BERT_INPUTS_DOCSTRING.format(\"batch_size, sequence_length\"))\n    def forward(\n            self,\n            input_ids=None,\n            attention_mask=None,\n            token_type_ids=None,\n            head_mask=None,\n            position_ids=None,\n            inputs_embeds=None,\n            labels=None,\n    ):\n        r\"\"\"\n        labels (:obj:`torch.LongTensor` of shape :obj:`(batch_size,)`, `optional`, defaults to :obj:`None`):\n            Labels for computing the multiple choice classification loss.\n            Indices should be in ``[0, ..., num_choices]`` where `num_choices` is the size of the second dimension\n            of the input tensors. (see `input_ids` above)\n\n    Returns:\n        :obj:`tuple(torch.FloatTensor)` comprising various elements depending on the configuration (:class:`~transformers.BertConfig`) and inputs:\n        loss (:obj:`torch.FloatTensor` of shape `(1,)`, `optional`, returned when :obj:`labels` is provided):\n            Classification loss.\n        classification_scores (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, num_choices)`):\n            `num_choices` is the second dimension of the input tensors. (see `input_ids` above).\n\n            Classification scores (before SoftMax).\n        hidden_states (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_hidden_states=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer)\n            of shape :obj:`(batch_size, sequence_length, hidden_size)`.\n\n            Hidden-states of the model at the output of each layer plus the initial embedding outputs.\n        attentions (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_attentions=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for each layer) of shape\n            :obj:`(batch_size, num_heads, sequence_length, sequence_length)`.\n\n            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention\n            heads.\n\n    Examples::\n\n        from transformers import BertTokenizer, BertForMultipleChoice\n        import torch\n\n        tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')\n        model = BertForMultipleChoice.from_pretrained('bert-base-uncased')\n        choices = [\"Hello, my dog is cute\", \"Hello, my cat is amazing\"]\n\n        input_ids = torch.tensor([tokenizer.encode(s, add_special_tokens=True) for s in choices]).unsqueeze(0)  # Batch size 1, 2 choices\n        labels = torch.tensor(1).unsqueeze(0)  # Batch size 1\n        outputs = model(input_ids, labels=labels)\n\n        loss, classification_scores = outputs[:2]\n\n        \"\"\"\n        num_choices = input_ids.shape[1]\n\n        input_ids = input_ids.view(-1, input_ids.size(-1))\n        attention_mask = attention_mask.view(-1, attention_mask.size(-1)) if attention_mask is not None else None\n        token_type_ids = token_type_ids.view(-1, token_type_ids.size(-1)) if token_type_ids is not None else None\n\n        outputs = self.bert(\n            input_ids,\n            attention_mask=attention_mask,\n            token_type_ids=token_type_ids,\n            head_mask=head_mask,\n            inputs_embeds=inputs_embeds,\n        )\n\n        pooled_output = outputs[1]\n\n        pooled_output = self.dropout(pooled_output)\n        logits = self.classifier(pooled_output)\n        reshaped_logits = logits.view(-1, num_choices)\n\n        outputs = (reshaped_logits,) + outputs[2:]  # add hidden states and attention if they are here\n\n        if labels is not None:\n            loss_fct = CrossEntropyLoss()\n            loss = loss_fct(reshaped_logits, labels)\n            outputs = (loss,) + outputs\n\n        return outputs  # (loss), reshaped_logits, (hidden_states), (attentions)\n\n\n@add_start_docstrings(\n    \"\"\"Bert Model with a token classification head on top (a linear layer on top of\n    the hidden-states output) e.g. for Named-Entity-Recognition (NER) tasks. \"\"\",\n    BERT_START_DOCSTRING,\n)\nclass NeZhaForTokenClassification(NeZhaPreTrainedModel):\n    def __init__(self, config):\n        super().__init__(config)\n        self.num_labels = config.num_labels\n        self.bert = NeZhaModel(config)\n        self.dropout = nn.Dropout(config.hidden_dropout_prob)\n        self.classifier = nn.Linear(config.hidden_size, config.num_labels)\n        self.init_weights()\n\n    @add_start_docstrings_to_model_forward(BERT_INPUTS_DOCSTRING.format(\"batch_size, sequence_length\"))\n    def forward(\n            self,\n            input_ids=None,\n            attention_mask=None,\n            token_type_ids=None,\n            head_mask=None,\n            position_ids=None,\n            inputs_embeds=None,\n            labels=None,\n    ):\n        r\"\"\"\n        labels (:obj:`torch.LongTensor` of shape :obj:`(batch_size, sequence_length)`, `optional`, defaults to :obj:`None`):\n            Labels for computing the token classification loss.\n            Indices should be in ``[0, ..., config.num_labels - 1]``.\n\n    Returns:\n        :obj:`tuple(torch.FloatTensor)` comprising various elements depending on the configuration (:class:`~transformers.BertConfig`) and inputs:\n        loss (:obj:`torch.FloatTensor` of shape :obj:`(1,)`, `optional`, returned when ``labels`` is provided) :\n            Classification loss.\n        scores (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length, config.num_labels)`)\n            Classification scores (before SoftMax).\n        hidden_states (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_hidden_states=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer)\n            of shape :obj:`(batch_size, sequence_length, hidden_size)`.\n\n            Hidden-states of the model at the output of each layer plus the initial embedding outputs.\n        attentions (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_attentions=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for each layer) of shape\n            :obj:`(batch_size, num_heads, sequence_length, sequence_length)`.\n\n            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention\n            heads.\n\n    Examples::\n\n        from transformers import BertTokenizer, BertForTokenClassification\n        import torch\n\n        tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')\n        model = BertForTokenClassification.from_pretrained('bert-base-uncased')\n\n        input_ids = torch.tensor(tokenizer.encode(\"Hello, my dog is cute\", add_special_tokens=True)).unsqueeze(0)  # Batch size 1\n        labels = torch.tensor([1] * input_ids.size(1)).unsqueeze(0)  # Batch size 1\n        outputs = model(input_ids, labels=labels)\n\n        loss, scores = outputs[:2]\n\n        \"\"\"\n\n        outputs = self.bert(\n            input_ids,\n            attention_mask=attention_mask,\n            token_type_ids=token_type_ids,\n            head_mask=head_mask,\n            inputs_embeds=inputs_embeds,\n        )\n\n        sequence_output = outputs[0]\n\n        sequence_output = self.dropout(sequence_output)\n        logits = self.classifier(sequence_output)\n\n        outputs = (logits,) + outputs[2:]  # add hidden states and attention if they are here\n        if labels is not None:\n            loss_fct = CrossEntropyLoss()\n            # Only keep active parts of the loss\n            if attention_mask is not None:\n                active_loss = attention_mask.view(-1) == 1\n                active_logits = logits.view(-1, self.num_labels)\n                active_labels = torch.where(\n                    active_loss, labels.view(-1), torch.tensor(loss_fct.ignore_index).type_as(labels)\n                )\n                loss = loss_fct(active_logits, active_labels)\n            else:\n                loss = loss_fct(logits.view(-1, self.num_labels), labels.view(-1))\n            outputs = (loss,) + outputs\n\n        return outputs  # (loss), scores, (hidden_states), (attentions)\n\n\n@add_start_docstrings(\n    \"\"\"Bert Model with a span classification head on top for extractive question-answering tasks like SQuAD (a linear\n    layers on top of the hidden-states output to compute `span start logits` and `span end logits`). \"\"\",\n    BERT_START_DOCSTRING,\n)\nclass NeZhaForQuestionAnswering(NeZhaPreTrainedModel):\n    def __init__(self, config):\n        super().__init__(config)\n        self.num_labels = config.num_labels\n        self.bert = NeZhaModel(config)\n        self.qa_outputs = nn.Linear(config.hidden_size, config.num_labels)\n        self.init_weights()\n\n    @add_start_docstrings_to_model_forward(BERT_INPUTS_DOCSTRING.format(\"batch_size, sequence_length\"))\n    def forward(\n            self,\n            input_ids=None,\n            attention_mask=None,\n            token_type_ids=None,\n            head_mask=None,\n            inputs_embeds=None,\n            position_ids=None,\n            start_positions=None,\n            end_positions=None,\n    ):\n        r\"\"\"\n        start_positions (:obj:`torch.LongTensor` of shape :obj:`(batch_size,)`, `optional`, defaults to :obj:`None`):\n            Labels for position (index) of the start of the labelled span for computing the token classification loss.\n            Positions are clamped to the length of the sequence (`sequence_length`).\n            Position outside of the sequence are not taken into account for computing the loss.\n        end_positions (:obj:`torch.LongTensor` of shape :obj:`(batch_size,)`, `optional`, defaults to :obj:`None`):\n            Labels for position (index) of the end of the labelled span for computing the token classification loss.\n            Positions are clamped to the length of the sequence (`sequence_length`).\n            Position outside of the sequence are not taken into account for computing the loss.\n\n    Returns:\n        :obj:`tuple(torch.FloatTensor)` comprising various elements depending on the configuration (:class:`~transformers.BertConfig`) and inputs:\n        loss (:obj:`torch.FloatTensor` of shape :obj:`(1,)`, `optional`, returned when :obj:`labels` is provided):\n            Total span extraction loss is the sum of a Cross-Entropy for the start and end positions.\n        start_scores (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length,)`):\n            Span-start scores (before SoftMax).\n        end_scores (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length,)`):\n            Span-end scores (before SoftMax).\n        hidden_states (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_hidden_states=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer)\n            of shape :obj:`(batch_size, sequence_length, hidden_size)`.\n\n            Hidden-states of the model at the output of each layer plus the initial embedding outputs.\n        attentions (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_attentions=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for each layer) of shape\n            :obj:`(batch_size, num_heads, sequence_length, sequence_length)`.\n\n            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention\n            heads.\n\n    Examples::\n\n        from transformers import BertTokenizer, BertForQuestionAnswering\n        import torch\n\n        tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')\n        model = BertForQuestionAnswering.from_pretrained('bert-large-uncased-whole-word-masking-finetuned-squad')\n\n        question, text = \"Who was Jim Henson?\", \"Jim Henson was a nice puppet\"\n        encoding = tokenizer.encode_plus(question, text)\n        input_ids, token_type_ids = encoding[\"input_ids\"], encoding[\"token_type_ids\"]\n        start_scores, end_scores = model(torch.tensor([input_ids]), token_type_ids=torch.tensor([token_type_ids]))\n\n        all_tokens = tokenizer.convert_ids_to_tokens(input_ids)\n        answer = ' '.join(all_tokens[torch.argmax(start_scores) : torch.argmax(end_scores)+1])\n\n        assert answer == \"a nice puppet\"\n\n        \"\"\"\n\n        outputs = self.bert(\n            input_ids,\n            attention_mask=attention_mask,\n            token_type_ids=token_type_ids,\n            head_mask=head_mask,\n            inputs_embeds=inputs_embeds,\n        )\n\n        sequence_output = outputs[0]\n\n        logits = self.qa_outputs(sequence_output)\n        start_logits, end_logits = logits.split(1, dim=-1)\n        start_logits = start_logits.squeeze(-1)\n        end_logits = end_logits.squeeze(-1)\n\n        outputs = (start_logits, end_logits,) + outputs[2:]\n        if start_positions is not None and end_positions is not None:\n            # If we are on multi-GPU, split add a dimension\n            if len(start_positions.size()) > 1:\n                start_positions = start_positions.squeeze(-1)\n            if len(end_positions.size()) > 1:\n                end_positions = end_positions.squeeze(-1)\n            # sometimes the start/end positions are outside our model inputs, we ignore these terms\n            ignored_index = start_logits.size(1)\n            start_positions.clamp_(0, ignored_index)\n            end_positions.clamp_(0, ignored_index)\n\n            loss_fct = CrossEntropyLoss(ignore_index=ignored_index)\n            start_loss = loss_fct(start_logits, start_positions)\n            end_loss = loss_fct(end_logits, end_positions)\n            total_loss = (start_loss + end_loss) / 2\n            outputs = (total_loss,) + outputs\n\n        return outputs  # (loss), start_logits, end_logits, (hidden_states), (attentions)\n\n"
  },
  {
    "path": "code/bert-base-count3/finetuning/model.py",
    "content": "import torch\nimport random\nimport os\nfrom torch import nn, optim\nimport torch.nn.functional as F\nfrom transformers.activations import get_activation\n\nfrom Config import *\n\n\nclass BertForClass(nn.Module):\n    def __init__(self, config):\n        super(BertForClass, self).__init__()\n        self.n_classes = config.num_class\n\n        config_json = 'bert_config.json' if os.path.exists(config.model_path + 'bert_config.json') else 'config.json'\n        self.bert_config = CONFIGS[config.model].from_pretrained(config.model_path + config_json,\n                                                                 output_hidden_states=True)\n        self.bert_model = MODELS[config.model].from_pretrained(config.model_path, config=self.bert_config)\n        self.isDropout = True if 0 < config.dropout < 1 else False\n        self.dropout = nn.Dropout(p=config.dropout)\n        self.classifier = nn.Linear(self.bert_config.hidden_size * 2, self.n_classes)\n\n    def forward(self, input_ids, input_masks, segment_ids):\n        output = self.bert_model(input_ids=input_ids, token_type_ids=segment_ids,\n                                                                        attention_mask=input_masks)\n        sequence_output = output[0]\n        pooler_output = output[1]\n        hidden_states = output[2]\n        seq_avg = torch.mean(sequence_output, dim=1)\n        concat_out = torch.cat((seq_avg, pooler_output), dim=1)\n        if self.isDropout:\n            concat_out = self.dropout(concat_out)\n        logit = self.classifier(concat_out)\n        return logit\n\nclass BertForClass_MultiDropout(nn.Module):\n    def __init__(self, config):\n        super(BertForClass_MultiDropout, self).__init__()\n        self.n_classes = config.num_class\n        config_json = 'bert_config.json' if os.path.exists(config.model_path + 'bert_config.json') else 'config.json'\n        self.bert_config = CONFIGS[config.model].from_pretrained(config.model_path + config_json,\n                                                                 output_hidden_states=True)\n\n        self.bert_model = MODELS[config.model].from_pretrained(config.model_path, config=self.bert_config)\n        self.multi_drop = 5\n        self.multi_dropouts = nn.ModuleList([nn.Dropout(config.dropout) for _ in range(self.multi_drop)])\n        self.classifier = nn.Linear(self.bert_config.hidden_size * 2, self.n_classes)\n\n    def forward(self, input_ids, input_masks, segment_ids):\n        sequence_output, pooler_output, hidden_states = self.bert_model(input_ids=input_ids, token_type_ids=segment_ids,\n                                                                        attention_mask=input_masks)\n        seq_avg = torch.mean(sequence_output, dim=1)\n        concat_out = torch.cat((seq_avg, pooler_output), dim=1)\n        for j, dropout in enumerate(self.multi_dropouts):\n            if j == 0:\n                logit = self.classifier(dropout(concat_out)) / self.multi_drop\n            else:\n                logit += self.classifier(dropout(concat_out)) / self.multi_drop\n\n        return logit\n\nclass BertLastTwoCls(nn.Module):\n    def __init__(self, config):\n        super(BertLastTwoCls, self).__init__()\n        self.n_classes = config.num_class\n        config_json = 'bert_config.json' if os.path.exists(config.model_path + 'bert_config.json') else 'config.json'\n        self.bert_config = CONFIGS[config.model].from_pretrained(config.model_path + config_json,\n                                                                 output_hidden_states=True)\n        self.bert_model = MODELS[config.model].from_pretrained(config.model_path, config=self.bert_config)\n        self.isDropout = True if 0 < config.dropout < 1 else False\n        self.dropout = nn.Dropout(p=config.dropout)\n        self.classifier = nn.Linear(self.bert_config.hidden_size, config.num_class)\n\n    def forward(self, input_ids, input_masks, segment_ids):\n        sequence_output, pooler_output, hidden_states = self.bert_model(input_ids=input_ids, token_type_ids=segment_ids,\n                                                                        attention_mask=input_masks)\n        logit = self.classifier(pooler_output)\n\n        return logit\n\n\nclass BertLastCls(nn.Module):\n    def __init__(self, config):\n        super(BertLastCls, self).__init__()\n        self.n_classes = config.num_class\n        config_json = 'bert_config.json' if os.path.exists(config.model_path + 'bert_config.json') else 'config.json'\n        self.bert_config = CONFIGS[config.model].from_pretrained(config.model_path + config_json,\n                                                                 output_hidden_states=True)\n        self.bert_model = MODELS[config.model].from_pretrained(config.model_path, config=self.bert_config)\n        self.isDropout = True if 0 < config.dropout < 1 else False\n        self.dropout = nn.Dropout(p=config.dropout)\n        self.classifier = nn.Linear(self.bert_config.hidden_size, config.num_class)\n\n    def forward(self, input_ids, input_masks, segment_ids):\n        output = self.bert_model(input_ids=input_ids, token_type_ids=segment_ids,\n                                 attention_mask=input_masks)\n        sequence_output = output[0]\n        pooler_output = output[1]\n        hidden_states = output[2]\n\n        if self.isDropout:\n            output = self.dropout(pooler_output)\n        logit = self.classifier(output)\n\n        return logit\n\nclass BertLastTwoClsPooler(nn.Module):\n    def __init__(self, config):\n        super(BertLastTwoClsPooler, self).__init__()\n        self.n_classes = config.num_class\n        config_json = 'bert_config.json' if os.path.exists(config.model_path + 'bert_config.json') else 'config.json'\n        self.bert_config = CONFIGS[config.model].from_pretrained(config.model_path + config_json,\n                                                                 output_hidden_states=True)\n        self.bert_model = MODELS[config.model].from_pretrained(config.model_path, config=self.bert_config)\n        self.isDropout = True if 0 < config.dropout < 1 else False\n        self.dropout = nn.Dropout(p=config.dropout)\n        self.classifier = nn.Linear(self.bert_config.hidden_size * 3, config.num_class)\n\n    def forward(self, input_ids, input_masks, segment_ids):\n        sequence_output, pooler_output, hidden_states = self.bert_model(input_ids=input_ids, token_type_ids=segment_ids,\n                                                                        attention_mask=input_masks)\n\n        output = torch.cat(\n            (pooler_output, hidden_states[-1][:, 0], hidden_states[-2][:, 0]), dim=1)\n        if self.isDropout:\n            output = self.dropout(output)\n        logit = self.classifier(output)\n\n        return logit\n\nclass BertLastTwoEmbeddings(nn.Module):\n    def __init__(self, config):\n        super(BertLastTwoEmbeddings, self).__init__()\n        self.n_classes = config.num_class\n        config_json = 'bert_config.json' if os.path.exists(config.model_path + 'bert_config.json') else 'config.json'\n        self.bert_config = CONFIGS[config.model].from_pretrained(config.model_path + config_json,\n                                                                 output_hidden_states=True)\n        self.bert_model = MODELS[config.model].from_pretrained(config.model_path, config=self.bert_config)\n        self.isDropout = True if 0 < config.dropout < 1 else False\n        self.dropout = nn.Dropout(p=config.dropout)\n        self.classifier = nn.Linear(self.bert_config.hidden_size * 2, config.num_class)\n\n    def forward(self, input_ids, input_masks, segment_ids):\n        sequence_output, pooler_output, hidden_states = self.bert_model(input_ids=input_ids, token_type_ids=segment_ids,\n                                                                        attention_mask=input_masks)\n\n        hidden_states1 = torch.mean(hidden_states[-1], dim=1)\n        hidden_states2 = torch.mean(hidden_states[-2], dim=1)\n\n        output = torch.cat(\n            (hidden_states1, hidden_states2), dim=1)\n        if self.isDropout:\n            output = self.dropout(output)\n        logit = self.classifier(output)\n\n        return logit\n\n\nclass BertLastTwoEmbeddingsPooler(nn.Module):\n    def __init__(self, config):\n        super(BertLastTwoEmbeddingsPooler, self).__init__()\n        self.n_classes = config.num_class\n        config_json = 'bert_config.json' if os.path.exists(config.model_path + 'bert_config.json') else 'config.json'\n        self.bert_config = CONFIGS[config.model].from_pretrained(config.model_path + config_json,\n                                                                 output_hidden_states=True)\n        self.bert_model = MODELS[config.model].from_pretrained(config.model_path, config=self.bert_config)\n        self.isDropout = True if 0 < config.dropout < 1 else False\n        self.dropout = nn.Dropout(p=config.dropout)\n        self.classifier = nn.Linear(self.bert_config.hidden_size * 3, config.num_class)\n\n    def forward(self, input_ids, input_masks, segment_ids):\n        sequence_output, pooler_output, hidden_states = self.bert_model(input_ids=input_ids, token_type_ids=segment_ids,\n                                                                        attention_mask=input_masks)\n\n        hidden_states1 = torch.mean(hidden_states[-1], dim=1)\n        hidden_states2 = torch.mean(hidden_states[-2], dim=1)\n\n        output = torch.cat(\n            (pooler_output, hidden_states1, hidden_states2), dim=1)\n        if self.isDropout:\n            output = self.dropout(output)\n        logit = self.classifier(output)\n\n        return logit\n\nclass BertLastFourCls(nn.Module):\n    def __init__(self, config):\n        super(BertLastFourCls, self).__init__()\n        self.n_classes = config.num_class\n        config_json = 'bert_config.json' if os.path.exists(config.model_path + 'bert_config.json') else 'config.json'\n        self.bert_config = CONFIGS[config.model].from_pretrained(config.model_path + config_json,\n                                                                 output_hidden_states=True)\n        self.bert_model = MODELS[config.model].from_pretrained(config.model_path, config=self.bert_config)\n        self.isDropout = True if 0 < config.dropout < 1 else False\n        self.dropout = nn.Dropout(p=config.dropout)\n        self.classifier = nn.Linear(self.bert_config.hidden_size * 4, config.num_class)\n\n    def forward(self, input_ids, input_masks, segment_ids):\n        output = self.bert_model(input_ids=input_ids, token_type_ids=segment_ids,\n                                 attention_mask=input_masks)\n        sequence_output = output[0]\n        pooler_output = output[1]\n        hidden_states = output[2]\n\n        output = torch.cat(\n            (hidden_states[-1][:, 0], hidden_states[-2][:, 0], hidden_states[-3][:, 0], hidden_states[-4][:, 0]), dim=1)\n        if self.isDropout:\n            output = self.dropout(output)\n        logit = self.classifier(output)\n\n        return logit\n\n\nclass BertLastFourClsPooler(nn.Module):\n    def __init__(self, config):\n        super(BertLastFourClsPooler, self).__init__()\n        self.n_classes = config.num_class\n        config_json = 'bert_config.json' if os.path.exists(config.model_path + 'bert_config.json') else 'config.json'\n        self.bert_config = CONFIGS[config.model].from_pretrained(config.model_path + config_json,\n                                                                 output_hidden_states=True)\n        self.bert_model = MODELS[config.model].from_pretrained(config.model_path, config=self.bert_config)\n        self.isDropout = True if 0 < config.dropout < 1 else False\n        self.dropout = nn.Dropout(p=config.dropout)\n        self.classifier = nn.Linear(self.bert_config.hidden_size * 5, config.num_class)\n\n    def forward(self, input_ids, input_masks, segment_ids):\n        sequence_output, pooler_output, hidden_states = self.bert_model(input_ids=input_ids, token_type_ids=segment_ids,\n                                                                        attention_mask=input_masks)\n\n        output = torch.cat(\n            (pooler_output, hidden_states[-1][:, 0], hidden_states[-2][:, 0], hidden_states[-3][:, 0], hidden_states[-4][:, 0]), dim=1)\n        if self.isDropout:\n            output = self.dropout(output)\n        logit = self.classifier(output)\n\n        return logit\n\nclass BertLastFourEmbeddings(nn.Module):\n    def __init__(self, config):\n        super(BertLastFourEmbeddings, self).__init__()\n        self.n_classes = config.num_class\n        config_json = 'bert_config.json' if os.path.exists(config.model_path + 'bert_config.json') else 'config.json'\n        self.bert_config = CONFIGS[config.model].from_pretrained(config.model_path + config_json,\n                                                                 output_hidden_states=True)\n        self.bert_model = MODELS[config.model].from_pretrained(config.model_path, config=self.bert_config)\n        self.isDropout = True if 0 < config.dropout < 1 else False\n        self.dropout = nn.Dropout(p=config.dropout)\n        self.classifier = nn.Linear(self.bert_config.hidden_size * 4, config.num_class)\n\n    def forward(self, input_ids, input_masks, segment_ids):\n        sequence_output, pooler_output, hidden_states = self.bert_model(input_ids=input_ids, token_type_ids=segment_ids,\n                                                                        attention_mask=input_masks)\n\n        hidden_states1 = torch.mean(hidden_states[-1], dim=1)\n        hidden_states2 = torch.mean(hidden_states[-2], dim=1)\n        hidden_states3 = torch.mean(hidden_states[-3], dim=1)\n        hidden_states4 = torch.mean(hidden_states[-4], dim=1)\n        output = torch.cat(\n            (hidden_states1, hidden_states2, hidden_states3, hidden_states4), dim=1)\n        if self.isDropout:\n            output = self.dropout(output)\n        logit = self.classifier(output)\n\n        return logit\n\n\nclass BertLastFourEmbeddingsPooler(nn.Module):\n    def __init__(self, config):\n        super(BertLastFourEmbeddingsPooler, self).__init__()\n        self.n_classes = config.num_class\n        config_json = 'bert_config.json' if os.path.exists(config.model_path + 'bert_config.json') else 'config.json'\n        self.bert_config = CONFIGS[config.model].from_pretrained(config.model_path + config_json,\n                                                                 output_hidden_states=True)\n        self.bert_model = MODELS[config.model].from_pretrained(config.model_path, config=self.bert_config)\n        self.isDropout = True if 0 < config.dropout < 1 else False\n        self.dropout = nn.Dropout(p=config.dropout)\n        self.classifier = nn.Linear(self.bert_config.hidden_size * 5, config.num_class)\n\n    def forward(self, input_ids, input_masks, segment_ids):\n        sequence_output, pooler_output, hidden_states = self.bert_model(input_ids=input_ids, token_type_ids=segment_ids,\n                                                                        attention_mask=input_masks)\n\n        hidden_states1 = torch.mean(hidden_states[-1], dim=1)\n        hidden_states2 = torch.mean(hidden_states[-2], dim=1)\n        hidden_states3 = torch.mean(hidden_states[-3], dim=1)\n        hidden_states4 = torch.mean(hidden_states[-4], dim=1)\n        output = torch.cat(\n            (pooler_output, hidden_states1, hidden_states2, hidden_states3, hidden_states4), dim=1)\n        if self.isDropout:\n            output = self.dropout(output)\n        logit = self.classifier(output)\n\n        return logit\n\nclass BertDynCls(nn.Module):\n    def __init__(self, config):\n        super(BertDynCls, self).__init__()\n        self.n_classes = config.num_class\n        config_json = 'bert_config.json' if os.path.exists(config.model_path + 'bert_config.json') else 'config.json'\n        self.bert_config = CONFIGS[config.model].from_pretrained(config.model_path + config_json,\n                                                                 output_hidden_states=True)\n        self.bert_model = MODELS[config.model].from_pretrained(config.model_path, config=self.bert_config)\n        self.dynWeight = nn.Linear(self.bert_config.hidden_size, 1)\n        self.dence = nn.Linear(self.bert_config.hidden_size, 512)\n        self.isDropout = True if 0 < config.dropout < 1 else False\n        self.dropout = nn.Dropout(p=config.dropout)\n        self.classifier = nn.Linear(512, config.num_class)\n\n\n    def forward(self, input_ids, input_masks, segment_ids):\n        sequence_output, pooler_output, hidden_states = self.bert_model(input_ids=input_ids, token_type_ids=segment_ids,\n                                                                        attention_mask=input_masks)\n\n        batch_size = pooler_output.shape[0]\n\n        hid_avg_list = None\n        weight_list = None\n        for i, hidden in enumerate(hidden_states):\n            hid_avg = hidden_states[-(i + 1)][0]\n            weight = self.dynWeight(hid_avg).repeat(1, self.bert_config.hidden_size)\n            if hid_avg_list is None:\n                hid_avg_list = hid_avg\n            else:\n                hid_avg_list = torch.cat((hid_avg_list, hid_avg), dim=1)\n\n            if weight_list is None:\n                weight_list = hid_avg\n            else:\n                weight_list = torch.cat((weight_list, weight), dim=1)\n\n        concat_out = weight_list.mul_(hid_avg_list)\n        concat_out = concat_out.reshape(batch_size, -1, self.bert_config.hidden_size)\n        concat_out = torch.sum(concat_out, dim=1)\n\n        if self.isDropout:\n            concat_out = self.dropout(concat_out)\n        concat_out = self.dence(concat_out)\n        logit = self.classifier(concat_out)\n\n        return logit\n\nclass BertDynEmbeddings(nn.Module):\n    def __init__(self, config):\n        super(BertDynEmbeddings, self).__init__()\n        self.n_classes = config.num_class\n        config_json = 'bert_config.json' if os.path.exists(config.model_path + 'bert_config.json') else 'config.json'\n        self.bert_config = CONFIGS[config.model].from_pretrained(config.model_path + config_json,\n                                                                 output_hidden_states=True)\n        self.bert_model = MODELS[config.model].from_pretrained(config.model_path, config=self.bert_config)\n        self.dynWeight = nn.Linear(self.bert_config.hidden_size, 1)\n        self.dence = nn.Linear(self.bert_config.hidden_size, 512)\n        self.isDropout = True if 0 < config.dropout < 1 else False\n        self.dropout = nn.Dropout(p=config.dropout)\n        self.classifier = nn.Linear(512, config.num_class)\n\n\n    def forward(self, input_ids, input_masks, segment_ids):\n        sequence_output, pooler_output, hidden_states = self.bert_model(input_ids=input_ids, token_type_ids=segment_ids,\n                                                                        attention_mask=input_masks)\n\n        batch_size = pooler_output.shape[0]\n\n        hid_avg_list = None\n        weight_list = None\n        for i, hidden in enumerate(hidden_states):\n            hid_avg = torch.mean(hidden_states[-(i + 1)], dim=1)\n            weight = self.dynWeight(hid_avg).repeat(1, self.bert_config.hidden_size)\n            if hid_avg_list is None:\n                hid_avg_list = hid_avg\n            else:\n                hid_avg_list = torch.cat((hid_avg_list, hid_avg), dim=1)\n\n            if weight_list is None:\n                weight_list = hid_avg\n            else:\n                weight_list = torch.cat((weight_list, weight), dim=1)\n\n        concat_out = weight_list.mul_(hid_avg_list)\n        concat_out = concat_out.reshape(batch_size, -1, self.bert_config.hidden_size)\n        concat_out = torch.sum(concat_out, dim=1)\n\n        if self.isDropout:\n            concat_out = self.dropout(concat_out)\n\n        concat_out = self.dence(concat_out)\n        logit = self.classifier(concat_out)\n\n        return logit\n\n\nclass BertRNN(nn.Module):\n\n    def __init__(self, config):\n        super(BertRNN, self).__init__()\n        self.rnn_type = \"gru\"\n        self.bidirectional = True\n        self.hidden_dim = 256\n        self.n_layers = 2\n        self.batch_first = True\n        self.drop_out = 0.1\n        self.n_classes = config.num_class\n        config_json = 'bert_config.json' if os.path.exists(config.model_path + 'bert_config.json') else 'config.json'\n        self.bert_config = CONFIGS[config.model].from_pretrained(config.model_path + config_json,\n                                                                 output_hidden_states=True)\n        self.bert_model = MODELS[config.model].from_pretrained(config.model_path, config=self.bert_config)\n        self.num_directions = 1 if not self.bidirectional else 2\n\n        if self.rnn_type == 'lstm':\n            self.rnn = nn.LSTM(self.bert_config.to_dict()['hidden_size'],\n                               hidden_size=self.hidden_dim,\n                               num_layers=self.n_layers,\n                               bidirectional=self.bidirectional,\n                               batch_first=self.batch_first,\n                               dropout=self.drop_out)\n        elif self.rnn_type == 'gru':\n            self.rnn = nn.GRU(self.bert_config.to_dict()['hidden_size'],\n                              hidden_size=self.hidden_dim,\n                              num_layers=self.n_layers,\n                              bidirectional=self.bidirectional,\n                              batch_first=self.batch_first,\n                              dropout=self.drop_out)\n        else:\n            self.rnn = nn.RNN(self.bert_config.to_dict()['hidden_size'],\n                              hidden_size=self.hidden_dim,\n                              num_layers=self.n_layers,\n                              bidirectional=self.bidirectional,\n                              batch_first=self.batch_first,\n                              dropout=self.drop_out)\n\n        self.dropout = nn.Dropout(self.drop_out)\n        self.fc_rnn = nn.Linear(self.hidden_dim * self.num_directions, config.num_class)\n\n    def forward(self, input_ids, input_masks, segment_ids):\n\n        sequence_output, pooler_output, hidden_states = self.bert_model(input_ids=input_ids, token_type_ids=segment_ids,\n                                                                        attention_mask=input_masks)\n\n        self.rnn.flatten_parameters()\n        if self.rnn_type in ['rnn', 'gru']:\n            output, hidden = self.rnn(sequence_output)\n        else:\n            output, (hidden, cell) = self.rnn(sequence_output)\n\n        # output = [ batch size, sent len, hidden_dim * bidirectional]\n        batch_size, max_seq_len, hidden_dim = output.shape\n        hidden = torch.transpose(hidden, 1, 0)\n        hidden = torch.mean(torch.reshape(hidden, [batch_size, -1, hidden_dim]), dim=1)\n        output = torch.sum(output, dim=1)\n        fc_input = self.dropout(output + hidden)\n\n        # output = torch.mean(output, dim=1)\n        # fc_input = self.dropout(output)\n        out = self.fc_rnn(fc_input)\n\n        return out\n\n\nclass BertCNN(nn.Module):\n\n    def __init__(self, config):\n        super(BertCNN, self).__init__()\n        self.num_filters = 100\n        config_json = 'bert_config.json' if os.path.exists(config.model_path + 'bert_config.json') else 'config.json'\n        self.bert_config = CONFIGS[config.model].from_pretrained(config.model_path + config_json,\n                                                                 output_hidden_states=True)\n        self.hidden_size = self.bert_config.to_dict()['hidden_size']\n        self.filter_sizes = {3, 4, 5}\n        self.drop_out = 0.5\n\n        self.convs = nn.ModuleList(\n            [nn.Conv2d(1, self.num_filters, (k, self.hidden_size)) for k in self.filter_sizes])\n\n\n        self.bert_model = MODELS[config.model].from_pretrained(config.model_path, config=self.bert_config)\n        self.dropout = nn.Dropout(self.drop_out)\n\n        self.fc_cnn = nn.Linear(self.num_filters * len(self.filter_sizes), config.num_class)\n\n    def conv_and_pool(self, x, conv):\n        x = F.relu(conv(x)).squeeze(3)\n        x = F.max_pool1d(x, x.size(2)).squeeze(2)\n        return x\n\n    def forward(self, input_ids, input_masks, segment_ids):\n        sequence_output, pooler_output, hidden_states = self.bert_model(input_ids=input_ids, token_type_ids=segment_ids,\n                                                                        attention_mask=input_masks)\n\n        sequence_output = self.dropout(sequence_output)\n        out = sequence_output.unsqueeze(1)\n        out = torch.cat([self.conv_and_pool(out, conv) for conv in self.convs], 1)\n        out = self.dropout(out)\n        out = self.fc_cnn(out)\n        return out\n\n\nclass BertRCNN(nn.Module):\n    def __init__(self, config):\n        super(BertRCNN, self).__init__()\n        self.rnn_type = \"lstm\"\n        self.bidirectional = True\n        self.hidden_dim = 256\n        self.n_layers = 2\n        self.batch_first = True\n        self.drop_out = 0.5\n\n        config_json = 'bert_config.json' if os.path.exists(config.model_path + 'bert_config.json') else 'config.json'\n        self.bert_config = CONFIGS[config.model].from_pretrained(config.model_path + config_json,\n                                                                 output_hidden_states=True)\n\n        if self.rnn_type == 'lstm':\n            self.rnn = nn.LSTM(self.bert_config.to_dict()['hidden_size'],\n                               self.hidden_dim,\n                               num_layers=self.n_layers,\n                               bidirectional=self.bidirectional,\n                               batch_first=self.batch_first,\n                               dropout=self.drop_out)\n        elif self.rnn_type == 'gru':\n            self.rnn = nn.GRU(self.bert_config.to_dict()['hidden_size'],\n                              hidden_size=self.hidden_dim,\n                              num_layers=self.n_layers,\n                              bidirectional=self.bidirectional,\n                              batch_first=self.batch_first,\n                              dropout=self.drop_out)\n        else:\n            self.rnn = nn.RNN(self.bert_config.to_dict()['hidden_size'],\n                              hidden_size=self.hidden_dim,\n                              num_layers=self.n_layers,\n                              bidirectional=self.bidirectional,\n                              batch_first=self.batch_first,\n                              dropout=self.drop_out)\n\n        # self.maxpool = nn.MaxPool1d()\n\n\n        self.bert_model = MODELS[config.model].from_pretrained(config.model_path, config=self.bert_config)\n        self.fc = nn.Linear(self.hidden_dim * self.n_layers, config.num_class)\n        self.dropout = nn.Dropout(self.drop_out)\n\n    def forward(self, input_ids, input_masks, segment_ids):\n\n        sequence_output, pooler_output, hidden_states = self.bert_model(input_ids=input_ids, token_type_ids=segment_ids,\n                                                                        attention_mask=input_masks)\n\n        sentence_len = sequence_output.shape[1]\n        pooler_output = pooler_output.unsqueeze(dim=1).repeat(1, sentence_len, 1)\n        bert_sentence = sequence_output + pooler_output\n\n        self.rnn.flatten_parameters()\n        if self.rnn_type in ['rnn', 'gru']:\n            output, hidden = self.rnn(bert_sentence)\n        else:\n            output, (hidden, cell) = self.rnn(bert_sentence)\n\n        batch_size, max_seq_len, hidden_dim = output.shape\n        out = torch.transpose(output.relu(), 1, 2)\n\n        out = F.max_pool1d(out, max_seq_len).squeeze()\n        out = self.fc(out)\n\n        return out\n\n\nclass XLNet(nn.Module):\n\n    def __init__(self, config):\n        super(XLNet, self).__init__()\n        self.xlnet = XLNetModel.from_pretrained(config.model_path)\n\n        self.isDropout = True if 0 < config.dropout < 1 else False\n        self.dropout = nn.Dropout(p=config.dropout)\n        self.fc = nn.Linear(self.xlnet.d_model, config.num_class)\n\n    def forward(self, input_ids, input_masks, segment_ids):\n        sequence_output = self.xlnet(input_ids=input_ids, token_type_ids=segment_ids,\n                                     attention_mask=input_masks)\n        sequence_output = torch.sum(sequence_output[0], dim=1)\n        if self.isDropout:\n            sequence_output = self.dropout(sequence_output)\n        out = self.fc(sequence_output)\n        return out\n\n\nclass ElectraClassificationHead(nn.Module):\n    \"\"\"Head for sentence-level classification tasks.\"\"\"\n\n    def __init__(self, config):\n        super().__init__()\n        self.dense = nn.Linear(config.hidden_size, config.hidden_size)\n        self.dropout = nn.Dropout(config.hidden_dropout_prob)\n        self.out_proj = nn.Linear(config.hidden_size, config.num_labels)\n\n    def forward(self, features, **kwargs):\n        x = features[:, 0, :]  # take <s> token (equiv. to [CLS])\n        x = self.dropout(x)\n        x = self.dense(x)\n        x = get_activation(\"gelu\")(x)  # although BERT uses tanh here, it seems Electra authors used gelu here\n        x = self.dropout(x)\n        x = self.out_proj(x)\n        return x\n\nclass Electra(nn.Module):\n\n    def __init__(self, config):\n        super(Electra, self).__init__()\n        self.electra = ElectraModel.from_pretrained(config.model_path)\n\n        config_json = 'bert_config.json' if os.path.exists(config.model_path + 'bert_config.json') else 'config.json'\n        self.electra_config = CONFIGS[config.model].from_pretrained(config.model_path + config_json)\n        self.electra_config.num_labels = config.num_class\n        self.fc = ElectraClassificationHead(self.electra_config)\n\n    def forward(self, input_ids, input_masks, segment_ids):\n        discriminator_hidden_states = self.electra(input_ids=input_ids, token_type_ids=segment_ids,\n                                     attention_mask=input_masks)\n\n        sequence_output = discriminator_hidden_states[0]\n        out = self.fc(sequence_output)\n        return out\n\nclass NEZHA(nn.Module):\n    def __init__(self, config):\n        super(NEZHA, self).__init__()\n        self.n_classes = config.num_class\n\n        config_json = 'bert_config.json' if os.path.exists(config.model_path + 'bert_config.json') else 'config.json'\n        self.bert_config = CONFIGS[config.model].from_pretrained(config.model_path + config_json)\n        #self.bert_model = MODELS[config.model](config=self.bert_config)\n        self.bert_model = MODELS[config.model].from_pretrained(config.model_path, config=self.bert_config)\n\n        # NEZHA init\n        #torch_init_model(self.bert_model, os.path.join(config.model_path, 'pytorch_model.bin'))\n        self.isDropout = True if 0 < config.dropout < 1 else False\n        self.dropout = nn.Dropout(p=config.dropout)\n        self.classifier = nn.Linear(self.bert_config.hidden_size * 2, self.n_classes)\n\n    def forward(self, input_ids, input_masks, segment_ids):\n        sequence_output, pooler_output = self.bert_model(input_ids=input_ids, token_type_ids=segment_ids,\n                                                                        attention_mask=input_masks)\n        seq_avg = torch.mean(sequence_output, dim=1)\n        concat_out = torch.cat((seq_avg, pooler_output), dim=1)\n\n        if self.isDropout:\n            concat_out = self.dropout(concat_out)\n        logit = self.classifier(concat_out)\n        return logit\n\n\n"
  },
  {
    "path": "code/bert-base-count3/finetuning/models/gitkeep",
    "content": ""
  },
  {
    "path": "code/bert-base-count3/finetuning/multi_gpu_QA.py",
    "content": "from tqdm import tqdm, trange\nimport numpy as np\nimport pandas as pd\nimport logging\nimport torch\nimport random\nimport os\nfrom torch import nn, optim\nfrom transformers import BertTokenizer, AdamW, BertModel, BertPreTrainedModel, BertConfig, \\\n    get_linear_schedule_with_warmup, XLNetModel, XLNetTokenizer, XLNetConfig\nfrom transformers.optimization import get_linear_schedule_with_warmup\nfrom sklearn.model_selection import StratifiedKFold, KFold\nfrom sklearn.metrics import mean_absolute_error, accuracy_score, f1_score, roc_auc_score\nfrom model import *\nfrom utils import *\nimport time\nimport logging\nlogging.basicConfig(level=logging.DEBUG, filename=\"train.log\",filemode='a')\n\n\nfrom NEZHA.modeling_nezha import *\n\nMODEL_CLASSES = {\n    'BertForClass': BertForClass,\n    'BertLastCls': BertLastCls,\n    'BertLastTwoCls': BertLastTwoCls,\n    'BertLastTwoClsPooler': BertLastTwoClsPooler,\n    'BertLastTwoEmbeddings': BertLastTwoEmbeddings,\n    'BertLastTwoEmbeddingsPooler': BertLastTwoEmbeddingsPooler,\n    'BertLastFourCls': BertLastFourCls,\n    'BertLastFourClsPooler': BertLastFourClsPooler,\n    'BertLastFourEmbeddings': BertLastFourEmbeddings,\n    'BertLastFourEmbeddingsPooler': BertLastFourEmbeddingsPooler,\n    'BertDynCls': BertDynCls,\n    'BertDynEmbeddings': BertDynEmbeddings,\n    'BertRNN': BertRNN,\n    'BertCNN': BertCNN,\n    'BertRCNN': BertRCNN,\n    'XLNet': XLNet,\n    'Electra': Electra,\n    'NEZHA': NEZHA,\n\n}\n\n\nclass Config:\n    def __init__(self):\n        # 预训练模型路径\n        self.modelId = 2\n        self.model = \"BertForClass\"\n        self.Stratification = False\n        self.model_path = '../pretrain/bert_model/'\n\n        self.num_class = 2\n        self.dropout = 0.2\n        self.MAX_LEN = 32\n        self.epoch = 3\n        self.learn_rate = 4e-5\n        self.normal_lr = 1e-4\n        self.batch_size = 32\n        self.k_fold = 10\n        self.seed = 42\n\n        self.device = torch.device('cuda')\n        # self.device = torch.device('cpu')\n\n        self.focalloss = False\n        self.pgd = False\n        self.fgm = True\n\n\nconfig = Config()\nos.environ['PYTHONHASHSEED']='0'#消除hash算法的随机性\nrandom.seed(config.seed)\nnp.random.seed(config.seed)\ntorch.manual_seed(config.seed)\ntorch.cuda.manual_seed_all(config.seed)\n\n\nfile_path = './log/'\n# 创建一个logger\nlogger = logging.getLogger('mylogger')\nlogger.setLevel(logging.DEBUG)\n\n\ntrain = pd.read_csv('/tcdata/gaiic_track3_round1_train_20210228.tsv',sep='\\t',header=None)\nsemi = pd.read_csv('/tcdata/gaiic_track3_round2_train_20210407.tsv',sep='\\t',header=None)\ntrain = pd.concat([train, semi], sort=False)\ntrain.columns=['q1','q2','label']\n\n\ntrain_query1 = train['q1'].values.astype(str)\ntrain_query2 = train['q2'].values.astype(str)\ntrain_label = train['label'].values.astype(int)\n\n\noof_train = np.zeros((len(train), config.num_class), dtype=np.float32)\n\n\n#kf = StratifiedKFold(n_splits=config.k_fold, shuffle=True, random_state=config.seed)\nkf = KFold(n_splits=config.k_fold, shuffle=True, random_state=config.seed)\n\nfor fold, (train_index, valid_index) in enumerate(kf.split(train_query1, train_label)):\n\n    print('\\n\\n------------fold:{}------------\\n'.format(fold))\n\n    '''\n    q1 = train_query1[train_index]\n    q2 = train_query2[train_index]\n    y = train_label[train_index]\n    '''\n    q1 = train_query1\n    q2 = train_query2\n    y = train_label\n\n\n    val_q1 = train_query1[valid_index]\n    val_q2 = train_query2[valid_index]\n    val_y = train_label[valid_index]\n\n    train_D = data_generator([q1, q2, y], config, shuffle=True)\n    val_D = data_generator([val_q1, val_q2, val_y], config)\n\n    model = MODEL_CLASSES[config.model](config).to(config.device)\n\n    if torch.cuda.device_count() > 1:\n        print(\"Let's use\", torch.cuda.device_count(), \"GPUs!\")\n        model = torch.nn.DataParallel(model)\n\n\n    if config.pgd:\n        pgd = PGD(model)\n        K = 3\n\n    elif config.fgm:\n        fgm = FGM(model)\n\n    if config.focalloss:\n        loss_fn = FocalLoss(config.num_class)\n    else:\n        loss_fn = nn.CrossEntropyLoss()  # BCEWithLogitsLoss就是把Sigmoid-BCELoss合成一步\n\n\n    num_train_steps = int(len(train) / config.batch_size * config.epoch)\n    param_optimizer = list(model.named_parameters())\n\n    no_decay = [\"bias\", \"LayerNorm.bias\", \"LayerNorm.weight\"]\n\n    if config.Stratification:\n        bert_params = [x for x in param_optimizer if 'bert' in x[0]]\n        normal_params = [p for n, p in param_optimizer if 'bert' not in n]\n        optimizer_parameters = [\n            {'params': [p for n, p in bert_params if not any(nd in n for nd in no_decay)], 'weight_decay': 0.01},\n            {'params': [p for n, p in bert_params if any(nd in n for nd in no_decay)], 'weight_decay': 0.0},\n            {'params': normal_params, 'lr': config.normal_lr},\n        ]\n    else:\n        optimizer_parameters = [\n            {'params': [p for n, p in param_optimizer if not any(nd in n for nd in no_decay)], 'weight_decay': 0.01},\n            {'params': [p for n, p in param_optimizer if any(nd in n for nd in no_decay)], 'weight_decay': 0.0},\n        ]\n\n    optimizer = AdamW(optimizer_parameters, lr=config.learn_rate) # lr为全局学习率\n    scheduler = get_linear_schedule_with_warmup(\n        optimizer,\n        num_warmup_steps=int(len(train) / config.batch_size / 2),\n        num_training_steps=num_train_steps\n    )\n\n    best_auc = 0\n    PATH = './models/bert_{}.pth'.format(fold)\n    save_model_path = './models/'\n    if not os.path.exists(save_model_path):\n        os.makedirs(save_model_path)\n\n    for e in range(config.epoch):\n        print('\\n------------epoch:{}------------'.format(e))\n        model.train()\n        acc = 0\n        train_len = 0\n        loss_num = 0\n        tq = tqdm(train_D,ncols=70,disable=True)\n        last=time.time()\n        for input_ids, input_masks, segment_ids, labels in tq:\n            label_t = torch.tensor(labels, dtype=torch.long).to(config.device)\n\n            y_pred = model(input_ids, input_masks, segment_ids)\n\n            loss = loss_fn(y_pred, label_t)\n            loss = loss.mean()\n            loss.backward()\n\n            if config.pgd:\n                pgd.backup_grad()\n                # 对抗训练\n                for t in range(K):\n                    pgd.attack(is_first_attack=(t == 0))  # 在embedding上添加对抗扰动, first attack时备份param.data\n                    if t != K - 1:\n                        model.zero_grad()\n                    else:\n                        pgd.restore_grad()\n                    y_pred = model(input_ids, input_masks, segment_ids)\n\n                    loss_adv = loss_fn(y_pred, label_t)\n                    loss_adv = loss_adv.mean()\n                    loss_adv.backward()  # 反向传播，并在正常的grad基础上，累加对抗训练的梯度\n                pgd.restore()  # 恢复embedding参数\n\n            elif config.fgm:\n                # 对抗训练\n                fgm.attack()  # 在embedding上添加对抗扰动\n                y_pred = model(input_ids, input_masks, segment_ids)\n                loss_adv = loss_fn(y_pred, label_t)\n                loss_adv = loss_adv.mean()\n                loss_adv.backward()  # 反向传播，并在正常的grad基础上，累加对抗训练的梯度\n                fgm.restore()  # 恢复embedding参数\n\n\n            # 梯度下降，更新参数\n            optimizer.step()\n            scheduler.step()  # Update learning rate schedule\n            model.zero_grad()\n\n            y_pred = np.argmax(y_pred.detach().to(\"cpu\").numpy(), axis=1)\n            acc += sum(y_pred == labels)\n            loss_num += loss.item()\n            train_len += len(labels)\n            tq.set_postfix(fold=fold, epoch=e, loss=loss_num / train_len, acc=acc / train_len)\n        print(f\"微调第{e}轮耗时：{time.time()-last}\")\n        model.eval()\n        with torch.no_grad():\n            y_p = []\n            y_l = []\n            train_logit = None\n            for input_ids, input_masks, segment_ids, labels in tqdm(val_D,disable=True):\n                label_t = torch.tensor(labels, dtype=torch.long).to(config.device)\n\n                y_pred = model(input_ids, input_masks, segment_ids)\n                y_pred = F.softmax(y_pred)\n                y_pred = y_pred.detach().to(\"cpu\").numpy()\n                if train_logit is None:\n                    train_logit = y_pred\n                else:\n                    train_logit = np.vstack((train_logit, y_pred))\n\n                y_p += list(y_pred[:,1])\n\n                y_pred = np.argmax(y_pred, axis=1)\n                y_l += list(y_pred)\n\n\n            f1 = f1_score(val_y, y_l, average=\"macro\")\n            auc_score = roc_auc_score(val_y, y_p)\n            print(\"best_auc:{}  auc_score:{}  f1:{}\\n\".format(best_auc, auc_score, f1))\n            if auc_score >= best_auc:\n                best_auc = auc_score\n                oof_train[valid_index] = np.array(train_logit)\n                #torch.save(model.module.state_dict() if hasattr(model, \"module\") else model.state_dict(), PATH)\n                torch.save(model.module if hasattr(model, \"module\") else model, PATH)\n\n    optimizer.zero_grad()\n\n    del model\n    torch.cuda.empty_cache()\n\n    break\n\n"
  },
  {
    "path": "code/bert-base-count3/finetuning/utils.py",
    "content": "import torch\nfrom transformers import BertTokenizer, AdamW, BertModel, BertPreTrainedModel, BertConfig, \\\n    get_linear_schedule_with_warmup, XLNetModel, XLNetTokenizer, XLNetConfig\nimport numpy as np\nimport os\nimport random\nfrom Config import *\nimport torch\nimport torch.nn as nn\nimport torch.nn.functional as F\n\ndef paddingList(ls:list,val,returnTensor=False):\n    ls=ls[:]#不要改变了原list尺寸\n    maxLen=max([len(i) for i in ls])\n    for i in range(len(ls)):\n        ls[i]=ls[i]+[val]*(maxLen-len(ls[i]))\n    return torch.tensor(ls,device='cuda') if returnTensor else ls\n\ndef fastTokenizer(a:str,b:str,maxLen,tk):\n    a,b=a.split(),b.split()\n    a,b=tk.convert_tokens_to_ids(a),tk.convert_tokens_to_ids(b)\n    maxLen-=3#空留给cls sep sep\n    assert maxLen>=0\n    len2=maxLen//2#若为奇数，更长部分给左边\n    len1=maxLen-len2\n    #一共就a超长与否，b超长与否，组合的四种情况\n    if len(a)+len(b)>maxLen:#需要截断\n        if len(a)<=len1 and len(b)>len2:\n            b=b[:maxLen-len(a)]\n        elif len(a)>len1 and len(b)<=len2:\n            a=a[:maxLen-len(b)]\n        elif len(a)>len1 and len(b)>len2:\n            a=a[:len1]\n            b=b[:len2]\n    input_ids=[tk.cls_token_id]+a+[tk.sep_token_id]+b+[tk.sep_token_id]\n    token_type_ids=[0]*(len(a)+2)+[1]*(len(b)+1)\n    return {'input_ids': input_ids, 'token_type_ids': token_type_ids}\n\nclass data_generator:\n    def __init__(self, data, config, shuffle=False):\n        self.data = data\n        self.batch_size = config.batch_size\n        self.max_length = config.MAX_LEN\n        self.shuffle = shuffle\n\n        vocab = 'vocab.txt' if os.path.exists(config.model_path + 'vocab.txt') else 'spiece.model'\n        self.tokenizer = TOKENIZERS[config.model].from_pretrained(config.model_path + vocab)\n\n        self.steps = len(self.data[0]) // self.batch_size\n        if len(self.data[0]) % self.batch_size != 0:\n            self.steps += 1\n\n    def __len__(self):\n        return self.steps\n\n    def __iter__(self):\n        q1, q2, y = self.data\n        idxs = list(range(len(self.data[0])))\n        if self.shuffle:\n            np.random.shuffle(idxs)\n        input_ids, input_masks, segment_ids, labels = [], [], [], []\n\n        for index, i in enumerate(idxs):\n\n            text = q1[i]\n            text_pair = q2[i]\n            '''\n            # text = self.tokenizer(text, text_pair, padding='max_length', truncation=True, max_length=self.max_length)\n            text = fastTokenizer(text, text_pair, self.max_length, self.tokenizer)\n            input_ids.append(text['input_ids'])\n            segment_ids.append(text['token_type_ids'])\n            input_masks.append([1] * len(text['input_ids']))  # bs为1时无padding，全1\n            yield input_ids, input_masks, segment_ids, labels\n            input_ids, input_masks, segment_ids, labels = [], [], [], []\n\n            '''\n            tkRes = self.tokenizer(text, text_pair, max_length=self.max_length, truncation='longest_first',\n                                   return_attention_mask=False)\n            input_id = tkRes['input_ids']\n            segment_id = tkRes['token_type_ids']\n            assert len(segment_id) == len(input_id)\n            input_ids.append(input_id)\n            segment_ids.append(segment_id)\n            labels.append(y[i])\n\n            if len(input_ids) == self.batch_size or i == idxs[-1]:\n                input_ids = paddingList(input_ids, 0, returnTensor=True)  # 动态padding\n                segment_ids = paddingList(segment_ids, 0, returnTensor=True)\n                input_masks = (input_ids != 0)\n                yield input_ids, input_masks, segment_ids, labels\n                input_ids, input_masks, segment_ids, labels = [], [], [], []\n\n\n\nclass PGD():\n    def __init__(self, model):\n        self.model = model\n        self.emb_backup = {}\n        self.grad_backup = {}\n\n    def attack(self, epsilon=0.3, alpha=0.1, emb_name='word_embeddings', is_first_attack=False):\n        # emb_name这个参数要换成你模型中embedding的参数名\n        for name, param in self.model.named_parameters():\n            if param.requires_grad and emb_name in name:\n                if is_first_attack:\n                    self.emb_backup[name] = param.data.clone()\n                norm = torch.norm(param.grad)\n                if norm != 0 and not torch.isnan(norm):\n                    r_at = alpha * param.grad / norm\n                    param.data.add_(r_at)\n                    param.data = self.project(name, param.data, epsilon)\n\n    def restore(self, emb_name='word_embeddings'):\n        # emb_name这个参数要换成你模型中embedding的参数名\n        for name, param in self.model.named_parameters():\n            if param.requires_grad and emb_name in name:\n                assert name in self.emb_backup\n                param.data = self.emb_backup[name]\n        self.emb_backup = {}\n\n    def project(self, param_name, param_data, epsilon):\n        r = param_data - self.emb_backup[param_name]\n        if torch.norm(r) > epsilon:\n            r = epsilon * r / torch.norm(r)\n        return self.emb_backup[param_name] + r\n\n    def backup_grad(self):\n        for name, param in self.model.named_parameters():\n            if param.requires_grad:\n                self.grad_backup[name] = param.grad.clone()\n\n    def restore_grad(self):\n        for name, param in self.model.named_parameters():\n            if param.requires_grad:\n                param.grad = self.grad_backup[name]\n\n\n\nclass FGM():\n    def __init__(self, model):\n        self.model = model\n        self.backup = {}\n\n    def attack(self, epsilon=0.25, emb_name='word_embeddings'):\n        # emb_name这个参数要换成你模型中embedding的参数名\n        for name, param in self.model.named_parameters():\n            if param.requires_grad and emb_name in name:\n                self.backup[name] = param.data.clone()\n                norm = torch.norm(param.grad)\n                if norm != 0:\n                    r_at = epsilon * param.grad / norm\n                    param.data.add_(r_at)\n\n    def restore(self, emb_name='word_embeddings'):\n        # emb_name这个参数要换成你模型中embedding的参数名\n        for name, param in self.model.named_parameters():\n            if param.requires_grad and emb_name in name:\n                assert name in self.backup\n                param.data = self.backup[name]\n        self.backup = {}\n\n\n# 支持多分类和二分类\nclass FocalLoss(nn.Module):\n    \"\"\"\n    This is a implementation of Focal Loss with smooth label cross entropy supported which is proposed in\n    'Focal Loss for Dense Object Detection. (https://arxiv.org/abs/1708.02002)'\n    Focal_Loss= -1*alpha*(1-pt)^gamma*log(pt)\n    :param num_class:\n    :param alpha: (tensor) 3D or 4D the scalar factor for this criterion\n    :param gamma: (float,double) gamma > 0 reduces the relative loss\n    for well-classified examples (p>0.5) putting more\n    focus on hard misclassified example\n    :param smooth: (float,double) smooth value when cross entropy\n    :param balance_index: (int) balance class index,\n    should be specific when alpha is float\n    :param size_average: (bool, optional) By default,\n    the losses are averaged over each loss element in the batch.\n    \"\"\"\n    def __init__(self, num_class, alpha=None, gamma=2,\n                smooth=None, size_average=True):\n        super(FocalLoss, self).__init__()\n        self.num_class = num_class\n        self.alpha = alpha\n        self.gamma = gamma\n        self.smooth = smooth\n        self.size_average = size_average\n\n        if self.alpha is None:\n            self.alpha = torch.ones(self.num_class, 1)\n        elif isinstance(self.alpha, (list, np.ndarray)):\n            assert len(self.alpha) == self.num_class\n            self.alpha = torch.FloatTensor(alpha).view(self.num_class, 1)\n            self.alpha = self.alpha / self.alpha.sum()\n        else:\n            raise TypeError('Not support alpha type')\n        if self.smooth is not None:\n            if self.smooth < 0 or self.smooth > 1.0:\n                raise ValueError('smooth value should be in [0,1]')\n\n    def forward(self, input, target):\n        logit = F.softmax(input, dim=1)\n\n        if logit.dim() > 2:\n            # N,C,d1,d2 -> N,C,m (m=d1*d2*...)\n            logit = logit.view(logit.size(0), logit.size(1), -1)\n            logit = logit.permute(0, 2, 1).contiguous()\n            logit = logit.view(-1, logit.size(-1))\n        target = target.view(-1, 1)\n\n        # N = input.size(0)\n        # alpha = torch.ones(N, self.num_class)\n        # alpha = alpha * (1 - self.alpha)\n        # alpha = alpha.scatter_(1, target.long(), self.alpha)\n        epsilon = 1e-10\n        alpha = self.alpha\n        if alpha.device != input.device:\n            alpha = alpha.to(input.device)\n\n        idx = target.cpu().long()\n        one_hot_key = torch.FloatTensor(target.size(0), self.num_class).zero_()\n        one_hot_key = one_hot_key.scatter_(1, idx, 1)\n        if one_hot_key.device != logit.device:\n            one_hot_key = one_hot_key.to(logit.device)\n\n        if self.smooth:\n            one_hot_key = torch.clamp(\n                one_hot_key, self.smooth, 1.0 - self.smooth)\n        pt = (one_hot_key * logit).sum(1) + epsilon\n        logpt = pt.log()\n\n        gamma = self.gamma\n\n        alpha = alpha[idx]\n        loss = -1 * alpha * torch.pow((1 - pt), gamma) * logpt\n\n        if self.size_average:\n            loss = loss.mean()\n        else:\n            loss = loss.sum()\n        return loss\n\n\ndef f1_match(y_true,y_pred):\n    acc = sum(y_pred & y_true) / (sum(y_pred))\n    rec = sum(y_pred & y_true) / (sum(y_true))\n\n    return 2 * acc * rec /(acc + rec)"
  },
  {
    "path": "code/bert-base-count3/pretrain/NLP_Utils.py",
    "content": "import random\nimport json\nimport transformers as _\nfrom transformers1 import BertTokenizer\nimport torch\nfrom torch.utils.data import Dataset,DataLoader\nimport numpy as np\nfrom itertools import chain\n\ndef writeToJsonFile(path: str, obj):\n    with open(path, \"w\", encoding=\"utf-8\") as f:\n        f.write(json.dumps(obj, ensure_ascii=False,indent=0))\ndef readFromJsonFile(path: str):\n    with open(path, \"r\", encoding=\"utf-8\") as f:\n        return json.loads(f.read())\n\ndef loadData(path):\n    allData=[]\n    with open(path,\"r\") as f:\n        for i in f:\n            i=i.strip().split('\\t')\n            if len(i)==0:#防止空行\n                break\n            if len(i)==3:#训练集\n                a,b,label=i\n                a=a.split(' ')\n                b=b.split(' ')\n            else:#测试集，直接转为id形式\n                a,b,label=i[0],i[1],-1\n                a=a.split(' ')\n                b=b.split(' ')\n            allData.append([a,b,label])\n    return allData\n\ndef calNegPos(ls):#计算正负比例\n    posNum,negNum=0,0\n    for i in ls:\n        if i[2]==0:\n            negNum+=1\n        elif i[2]==1:\n            posNum+=1\n    posNum=1 if posNum==0 else posNum\n    return negNum,posNum,round(negNum/posNum,4)\n\nallData=loadData('/tcdata/gaiic_track3_round1_train_20210228.tsv')+loadData('/tcdata/gaiic_track3_round2_train_20210407.tsv')\ntestA_data = loadData('/tcdata/gaiic_track3_round1_testA_20210228.tsv')\ntestB_data = loadData('/tcdata/gaiic_track3_round1_testB_20210317.tsv')\nrandom.shuffle(allData)\n\ntrain_data=allData+testA_data+testB_data#全量\nvalid_data=allData[-20000:]\nprint(\"训练集样本数量：\", len(train_data))\n\ndef paddingList(ls:list,val,returnTensor=False):\n    ls=ls[:]#不要改变了原list尺寸\n    maxLen=max([len(i) for i in ls])\n    for i in range(len(ls)):\n        ls[i]=ls[i]+[val]*(maxLen-len(ls[i]))\n    return torch.tensor(ls,device='cuda') if returnTensor else ls\n\ndef truncate(a:list,b:list,maxLen):\n    maxLen-=3#空留给cls sep sep\n    assert maxLen>=0\n    len2=maxLen//2#若为奇数，更长部分给左边\n    len1=maxLen-len2\n    #一共就a超长与否，b超长与否，组合的四种情况\n    if len(a)+len(b)>maxLen:#需要截断\n        if len(a)<=len1 and len(b)>len2:\n            b=b[:maxLen-len(a)]\n        elif len(a)>len1 and len(b)<=len2:\n            a=a[:maxLen-len(b)]\n        elif len(a)>len1 and len(b)>len2:\n            a=a[:len1]\n            b=b[:len2]\n    return a,b\n\nclass MLM_Data(Dataset):\n    #传入句子对列表\n    def __init__(self,textLs:list,maxLen:int,tk:BertTokenizer):\n        super().__init__()\n        self.data=textLs\n        self.maxLen=maxLen\n        self.tk=tk\n        self.spNum=len(tk.all_special_tokens)\n        self.tkNum=tk.vocab_size\n\n    def __len__(self):\n        return len(self.data)\n\n    def random_mask(self,text_ids):\n        input_ids, output_ids = [], []\n        rands = np.random.random(len(text_ids))\n        idx=0\n        while idx<len(rands):\n            if rands[idx]<0.15:#需要mask\n                ngram=np.random.choice([1,2,3], p=[0.7,0.2,0.1])#若要mask，进行x_gram mask的概率\n                if ngram==3 and len(rands)<7:#太大的gram不要应用于过短文本\n                    ngram=2\n                if ngram==2 and len(rands)<4:\n                    ngram=1\n                L=idx+1\n                R=idx+ngram#最终需要mask的右边界（开）\n                while L<R and L<len(rands):\n                    rands[L]=np.random.random()*0.15#强制mask\n                    L+=1\n                idx=R\n                if idx<len(rands):\n                    rands[idx]=1#禁止mask片段的下一个token被mask，防止一大片连续mask\n            idx+=1\n\n        for r, i in zip(rands, text_ids):\n            if r < 0.15 * 0.8:\n                input_ids.append(self.tk.mask_token_id)\n                output_ids.append(i)#mask预测自己\n            elif r < 0.15 * 0.9:\n                input_ids.append(i)\n                output_ids.append(i)#自己预测自己\n            elif r < 0.15:\n                input_ids.append(np.random.randint(self.spNum,self.tkNum))\n                output_ids.append(i)#随机的一个词预测自己，随机词不会从特殊符号中选取，有小概率抽到自己\n            else:\n                input_ids.append(i)\n                output_ids.append(-100)#保持原样不预测\n\n        return input_ids, output_ids\n\n    #耗时操作在此进行，可用上多进程\n    def __getitem__(self, item):\n        text1,text2,_=self.data[item]#预处理，mask等操作\n        if random.random()>0.5:\n            text1,text2=text2,text1#交换位置\n        text1,text2=truncate(text1,text2,self.maxLen)\n        text1_ids,text2_ids = self.tk.convert_tokens_to_ids(text1),self.tk.convert_tokens_to_ids(text2)\n        text1_ids, out1_ids = self.random_mask(text1_ids)#添加mask预测\n        text2_ids, out2_ids = self.random_mask(text2_ids)\n        input_ids = [self.tk.cls_token_id] + text1_ids + [self.tk.sep_token_id] + text2_ids + [self.tk.sep_token_id]#拼接\n        token_type_ids=[0]*(len(text1_ids)+2)+[1]*(len(text2_ids)+1)\n        labels = [-100] + out1_ids + [-100] + out2_ids + [-100]\n        assert len(input_ids)==len(token_type_ids)==len(labels)\n        return {'input_ids':input_ids,'token_type_ids':token_type_ids,'labels':labels}\n\n    @classmethod\n    def collate(cls,batch):\n        input_ids=[i['input_ids'] for i in batch]\n        token_type_ids=[i['token_type_ids'] for i in batch]\n        labels=[i['labels'] for i in batch]\n        input_ids=paddingList(input_ids,0,returnTensor=True)\n        token_type_ids=paddingList(token_type_ids,0,returnTensor=True)\n        labels=paddingList(labels,-100,returnTensor=True)\n        attention_mask=(input_ids!=0)\n        return {'input_ids':input_ids,'token_type_ids':token_type_ids\n                ,'attention_mask':attention_mask,'labels':labels}\n\n\n\n\nunionList=lambda ls:list(chain(*ls))#按元素拼接\nsplitList=lambda x,bs:[x[i:i+bs] for i in range(0,len(x),bs)]#按bs切分\n\n\n#sortBsNum：原序列按多少个bs块为单位排序，可用来增强随机性\n#比如如果每次打乱后都全体一起排序，那每次都是一样的\ndef blockShuffle(data:list,bs:int,sortBsNum,key):\n    random.shuffle(data)#先打乱\n    tail=len(data)%bs#计算碎片长度\n    tail=[] if tail==0 else data[-tail:]\n    data=data[:len(data)-len(tail)]\n    assert len(data)%bs==0#剩下的一定能被bs整除\n    sortBsNum=len(data)//bs if sortBsNum is None else sortBsNum#为None就是整体排序\n    data=splitList(data,sortBsNum*bs)\n    data=[sorted(i,key=key,reverse=True) for i in data]#每个大块进行降排序\n    data=unionList(data)\n    data=splitList(data,bs)#最后，按bs分块\n    random.shuffle(data)#块间打乱\n    data=unionList(data)+tail\n    return data\nfrom torch.utils.data.dataloader import _SingleProcessDataLoaderIter,_MultiProcessingDataLoaderIter\n#每轮迭代重新分块shuffle数据的DataLoader\nclass blockShuffleDataLoader(DataLoader):\n    def __init__(self, dataset: Dataset,sortBsNum,key,**kwargs):\n        assert isinstance(dataset.data,list)#需要有list类型的data属性\n        super().__init__(dataset,**kwargs)#父类的参数传过去\n        self.sortBsNum=sortBsNum\n        self.key=key\n\n    def __iter__(self):\n        #分块shuffle\n        self.dataset.data=blockShuffle(self.dataset.data,self.batch_size,self.sortBsNum,self.key)\n        if self.num_workers == 0:\n            return _SingleProcessDataLoaderIter(self)\n        else:\n            return _MultiProcessingDataLoaderIter(self)\n"
  },
  {
    "path": "code/bert-base-count3/pretrain/__init__.py",
    "content": ""
  },
  {
    "path": "code/bert-base-count3/pretrain/bert_model/gitkeep",
    "content": ""
  },
  {
    "path": "code/bert-base-count3/pretrain/train_bert.py",
    "content": "# coding:utf-8\nimport numpy as np\nimport random\nimport os\nrandom.seed(0)\nnp.random.seed(0)#seed应该在main里尽早设置，以防万一\nos.environ['PYTHONHASHSEED'] =str(0)#消除hash算法的随机性\nfrom transformers import BertForMaskedLM#除nezha外模型用新版加载\nfrom transformers1 import Trainer, TrainingArguments,BertTokenizer,BertConfig\nfrom NLP_Utils import MLM_Data,train_data,blockShuffleDataLoader\n\nmaxlen=32\nbatch_size=128\nvocab_file_dir = './bert_model/vocab.txt'\ntokenizer = BertTokenizer.from_pretrained(vocab_file_dir)\n\nconfig = BertConfig(\n    vocab_size=len(tokenizer),\n    hidden_size=768,\n    num_hidden_layers=12,\n    num_attention_heads=12,\n    max_position_embeddings=512,\n)\n\n# 把层数改为8层\nmodel = BertForMaskedLM.from_pretrained('../../bert-base-chinese')\n\nmodel.resize_token_embeddings(len(tokenizer))\nprint(model)\ntrain_MLM_data=MLM_Data(train_data,maxlen,tokenizer)\n#自己定义dataloader，不要用huggingface的\ndl=blockShuffleDataLoader(train_MLM_data,None,key=lambda x:len(x[0])+len(x[1]),shuffle=False\n                          ,batch_size=batch_size,collate_fn=train_MLM_data.collate)\n\ntraining_args = TrainingArguments(\n    output_dir='./bert_output',\n    overwrite_output_dir=True,\n    num_train_epochs=400,\n    per_device_train_batch_size=batch_size,\n    save_steps=len(dl)*10000,#每10个epoch save一次\n    save_total_limit=3,\n    logging_steps=len(dl),#每个epoch log一次\n    seed=2021,\n    learning_rate=5e-5,\n    lr_end=1e-5,#学习率衰减的终点\n    weight_decay=0.01,\n    warmup_steps=int(450000*150/batch_size*0.03)\n)\n\ntrainer = Trainer(\n    model=model,\n    args=training_args,\n    train_dataLoader=dl,\n    prediction_loss_only=True,\n)\n\nif __name__ == '__main__':\n    trainer.train()\n    trainer.save_model('./bert_model')\n"
  },
  {
    "path": "code/bert-base-count3/pretrain/transformers1/__init__.py",
    "content": "# flake8: noqa\n# There's no way to ignore \"F401 '...' imported but unused\" warnings in this\n# module, but to preserve other warnings. So, don't check this module at all.\n\n__version__ = \"2.11.0\"\n\n# Work around to update TensorFlow's absl.logging threshold which alters the\n# default Python logging output behavior when present.\n# see: https://github.com/abseil/abseil-py/issues/99\n# and: https://github.com/tensorflow/tensorflow/issues/26691#issuecomment-500369493\ntry:\n    import absl.logging\nexcept ImportError:\n    pass\nelse:\n    absl.logging.set_verbosity(\"info\")\n    absl.logging.set_stderrthreshold(\"info\")\n    absl.logging._warn_preinit_stderr = False\n\nimport logging\n\n# Configurations\nfrom .configuration_albert import ALBERT_PRETRAINED_CONFIG_ARCHIVE_MAP, AlbertConfig\nfrom .configuration_auto import ALL_PRETRAINED_CONFIG_ARCHIVE_MAP, CONFIG_MAPPING, AutoConfig\nfrom .configuration_bart import BartConfig\nfrom .configuration_bert import BERT_PRETRAINED_CONFIG_ARCHIVE_MAP, BertConfig\nfrom .configuration_camembert import CAMEMBERT_PRETRAINED_CONFIG_ARCHIVE_MAP, CamembertConfig\nfrom .configuration_ctrl import CTRL_PRETRAINED_CONFIG_ARCHIVE_MAP, CTRLConfig\nfrom .configuration_distilbert import DISTILBERT_PRETRAINED_CONFIG_ARCHIVE_MAP, DistilBertConfig\nfrom .configuration_electra import ELECTRA_PRETRAINED_CONFIG_ARCHIVE_MAP, ElectraConfig\nfrom .configuration_encoder_decoder import EncoderDecoderConfig\nfrom .configuration_flaubert import FLAUBERT_PRETRAINED_CONFIG_ARCHIVE_MAP, FlaubertConfig\nfrom .configuration_gpt2 import GPT2_PRETRAINED_CONFIG_ARCHIVE_MAP, GPT2Config\nfrom .configuration_longformer import LONGFORMER_PRETRAINED_CONFIG_ARCHIVE_MAP, LongformerConfig\nfrom .configuration_marian import MarianConfig\nfrom .configuration_mmbt import MMBTConfig\nfrom .configuration_openai import OPENAI_GPT_PRETRAINED_CONFIG_ARCHIVE_MAP, OpenAIGPTConfig\nfrom .configuration_reformer import REFORMER_PRETRAINED_CONFIG_ARCHIVE_MAP, ReformerConfig\nfrom .configuration_roberta import ROBERTA_PRETRAINED_CONFIG_ARCHIVE_MAP, RobertaConfig\nfrom .configuration_t5 import T5_PRETRAINED_CONFIG_ARCHIVE_MAP, T5Config\nfrom .configuration_transfo_xl import TRANSFO_XL_PRETRAINED_CONFIG_ARCHIVE_MAP, TransfoXLConfig\nfrom .configuration_utils import PretrainedConfig\nfrom .configuration_xlm import XLM_PRETRAINED_CONFIG_ARCHIVE_MAP, XLMConfig\nfrom .configuration_xlm_roberta import XLM_ROBERTA_PRETRAINED_CONFIG_ARCHIVE_MAP, XLMRobertaConfig\nfrom .configuration_xlnet import XLNET_PRETRAINED_CONFIG_ARCHIVE_MAP, XLNetConfig\nfrom .data import (\n    DataProcessor,\n    InputExample,\n    InputFeatures,\n    SingleSentenceClassificationProcessor,\n    SquadExample,\n    SquadFeatures,\n    SquadV1Processor,\n    SquadV2Processor,\n    glue_convert_examples_to_features,\n    glue_output_modes,\n    glue_processors,\n    glue_tasks_num_labels,\n    is_sklearn_available,\n    squad_convert_examples_to_features,\n    xnli_output_modes,\n    xnli_processors,\n    xnli_tasks_num_labels,\n)\n\n# Files and general utilities\nfrom .file_utils import (\n    CONFIG_NAME,\n    MODEL_CARD_NAME,\n    PYTORCH_PRETRAINED_BERT_CACHE,\n    PYTORCH_TRANSFORMERS_CACHE,\n    TF2_WEIGHTS_NAME,\n    TF_WEIGHTS_NAME,\n    TRANSFORMERS_CACHE,\n    WEIGHTS_NAME,\n    add_end_docstrings,\n    add_start_docstrings,\n    cached_path,\n    is_tf_available,\n    is_torch_available,\n)\nfrom .hf_argparser import HfArgumentParser\n\n# Model Cards\nfrom .modelcard import ModelCard\n\n# TF 2.0 <=> PyTorch conversion utilities\nfrom .modeling_tf_pytorch_utils import (\n    convert_tf_weight_name_to_pt_weight_name,\n    load_pytorch_checkpoint_in_tf2_model,\n    load_pytorch_model_in_tf2_model,\n    load_pytorch_weights_in_tf2_model,\n    load_tf2_checkpoint_in_pytorch_model,\n    load_tf2_model_in_pytorch_model,\n    load_tf2_weights_in_pytorch_model,\n)\n\n# Pipelines\nfrom .pipelines import (\n    CsvPipelineDataFormat,\n    FeatureExtractionPipeline,\n    FillMaskPipeline,\n    JsonPipelineDataFormat,\n    NerPipeline,\n    PipedPipelineDataFormat,\n    Pipeline,\n    PipelineDataFormat,\n    QuestionAnsweringPipeline,\n    SummarizationPipeline,\n    TextClassificationPipeline,\n    TextGenerationPipeline,\n    TokenClassificationPipeline,\n    TranslationPipeline,\n    pipeline,\n)\n\n# Tokenizers\nfrom .tokenization_albert import AlbertTokenizer\nfrom .tokenization_auto import TOKENIZER_MAPPING, AutoTokenizer\nfrom .tokenization_bart import BartTokenizer, MBartTokenizer\nfrom .tokenization_bert import BasicTokenizer, BertTokenizer, BertTokenizerFast, WordpieceTokenizer\nfrom .tokenization_bert_japanese import BertJapaneseTokenizer, CharacterTokenizer, MecabTokenizer\nfrom .tokenization_camembert import CamembertTokenizer\nfrom .tokenization_ctrl import CTRLTokenizer\nfrom .tokenization_distilbert import DistilBertTokenizer, DistilBertTokenizerFast\nfrom .tokenization_electra import ElectraTokenizer, ElectraTokenizerFast\nfrom .tokenization_flaubert import FlaubertTokenizer\nfrom .tokenization_gpt2 import GPT2Tokenizer, GPT2TokenizerFast\nfrom .tokenization_longformer import LongformerTokenizer, LongformerTokenizerFast\nfrom .tokenization_openai import OpenAIGPTTokenizer, OpenAIGPTTokenizerFast\nfrom .tokenization_reformer import ReformerTokenizer\nfrom .tokenization_roberta import RobertaTokenizer, RobertaTokenizerFast\nfrom .tokenization_t5 import T5Tokenizer\nfrom .tokenization_transfo_xl import TransfoXLCorpus, TransfoXLTokenizer, TransfoXLTokenizerFast\nfrom .tokenization_utils import PreTrainedTokenizer\nfrom .tokenization_xlm import XLMTokenizer\nfrom .tokenization_xlm_roberta import XLMRobertaTokenizer\nfrom .tokenization_xlnet import SPIECE_UNDERLINE, XLNetTokenizer\nfrom .trainer_utils import EvalPrediction\nfrom .training_args import TrainingArguments\nfrom .training_args_tf import TFTrainingArguments\n\n\nlogger = logging.getLogger(__name__)  # pylint: disable=invalid-name\n\n\nif is_sklearn_available():\n    from .data import glue_compute_metrics, xnli_compute_metrics\n\n\n# Modeling\nif is_torch_available():\n    from .modeling_utils import PreTrainedModel, prune_layer, Conv1D, top_k_top_p_filtering, apply_chunking_to_forward\n    from .modeling_auto import (\n        AutoModel,\n        AutoModelForPreTraining,\n        AutoModelForSequenceClassification,\n        AutoModelForQuestionAnswering,\n        AutoModelWithLMHead,\n        AutoModelForTokenClassification,\n        AutoModelForMultipleChoice,\n        MODEL_MAPPING,\n        MODEL_FOR_PRETRAINING_MAPPING,\n        MODEL_WITH_LM_HEAD_MAPPING,\n        MODEL_FOR_SEQUENCE_CLASSIFICATION_MAPPING,\n        MODEL_FOR_QUESTION_ANSWERING_MAPPING,\n        MODEL_FOR_TOKEN_CLASSIFICATION_MAPPING,\n        MODEL_FOR_MULTIPLE_CHOICE_MAPPING,\n    )\n\n    from .modeling_bert import (\n        BertPreTrainedModel,\n        BertModel,\n        BertForPreTraining,\n        BertForMaskedLM,\n        BertForNextSentencePrediction,\n        BertForSequenceClassification,\n        BertForMultipleChoice,\n        BertForTokenClassification,\n        BertForQuestionAnswering,\n        load_tf_weights_in_bert,\n        BERT_PRETRAINED_MODEL_ARCHIVE_LIST,\n        BertLayer,\n    )\n    from .modeling_openai import (\n        OpenAIGPTPreTrainedModel,\n        OpenAIGPTModel,\n        OpenAIGPTLMHeadModel,\n        OpenAIGPTDoubleHeadsModel,\n        load_tf_weights_in_openai_gpt,\n        OPENAI_GPT_PRETRAINED_MODEL_ARCHIVE_LIST,\n    )\n    from .modeling_transfo_xl import (\n        TransfoXLPreTrainedModel,\n        TransfoXLModel,\n        TransfoXLLMHeadModel,\n        AdaptiveEmbedding,\n        load_tf_weights_in_transfo_xl,\n        TRANSFO_XL_PRETRAINED_MODEL_ARCHIVE_LIST,\n    )\n    from .modeling_gpt2 import (\n        GPT2PreTrainedModel,\n        GPT2Model,\n        GPT2LMHeadModel,\n        GPT2DoubleHeadsModel,\n        load_tf_weights_in_gpt2,\n        GPT2_PRETRAINED_MODEL_ARCHIVE_LIST,\n    )\n    from .modeling_ctrl import CTRLPreTrainedModel, CTRLModel, CTRLLMHeadModel, CTRL_PRETRAINED_MODEL_ARCHIVE_LIST\n    from .modeling_xlnet import (\n        XLNetPreTrainedModel,\n        XLNetModel,\n        XLNetLMHeadModel,\n        XLNetForSequenceClassification,\n        XLNetForTokenClassification,\n        XLNetForMultipleChoice,\n        XLNetForQuestionAnsweringSimple,\n        XLNetForQuestionAnswering,\n        load_tf_weights_in_xlnet,\n        XLNET_PRETRAINED_MODEL_ARCHIVE_LIST,\n    )\n    from .modeling_xlm import (\n        XLMPreTrainedModel,\n        XLMModel,\n        XLMWithLMHeadModel,\n        XLMForSequenceClassification,\n        XLMForTokenClassification,\n        XLMForQuestionAnswering,\n        XLMForQuestionAnsweringSimple,\n        XLM_PRETRAINED_MODEL_ARCHIVE_LIST,\n    )\n    from .modeling_bart import (\n        BartForSequenceClassification,\n        BartModel,\n        BartForConditionalGeneration,\n        BART_PRETRAINED_MODEL_ARCHIVE_LIST,\n    )\n    from .modeling_marian import MarianMTModel\n    from .tokenization_marian import MarianTokenizer\n    from .modeling_roberta import (\n        RobertaForMaskedLM,\n        RobertaModel,\n        RobertaForSequenceClassification,\n        RobertaForMultipleChoice,\n        RobertaForTokenClassification,\n        RobertaForQuestionAnswering,\n        ROBERTA_PRETRAINED_MODEL_ARCHIVE_LIST,\n    )\n    from .modeling_distilbert import (\n        DistilBertPreTrainedModel,\n        DistilBertForMaskedLM,\n        DistilBertModel,\n        DistilBertForSequenceClassification,\n        DistilBertForQuestionAnswering,\n        DistilBertForTokenClassification,\n        DISTILBERT_PRETRAINED_MODEL_ARCHIVE_LIST,\n    )\n    from .modeling_camembert import (\n        CamembertForMaskedLM,\n        CamembertModel,\n        CamembertForSequenceClassification,\n        CamembertForMultipleChoice,\n        CamembertForTokenClassification,\n        CamembertForQuestionAnswering,\n        CAMEMBERT_PRETRAINED_MODEL_ARCHIVE_LIST,\n    )\n    from .modeling_encoder_decoder import EncoderDecoderModel\n    from .modeling_t5 import (\n        T5PreTrainedModel,\n        T5Model,\n        T5ForConditionalGeneration,\n        load_tf_weights_in_t5,\n        T5_PRETRAINED_MODEL_ARCHIVE_LIST,\n    )\n    from .modeling_albert import (\n        AlbertPreTrainedModel,\n        AlbertModel,\n        AlbertForPreTraining,\n        AlbertForMaskedLM,\n        AlbertForSequenceClassification,\n        AlbertForQuestionAnswering,\n        AlbertForTokenClassification,\n        load_tf_weights_in_albert,\n        ALBERT_PRETRAINED_MODEL_ARCHIVE_LIST,\n    )\n    from .modeling_xlm_roberta import (\n        XLMRobertaForMaskedLM,\n        XLMRobertaModel,\n        XLMRobertaForMultipleChoice,\n        XLMRobertaForSequenceClassification,\n        XLMRobertaForTokenClassification,\n        XLM_ROBERTA_PRETRAINED_MODEL_ARCHIVE_LIST,\n    )\n    from .modeling_mmbt import ModalEmbeddings, MMBTModel, MMBTForClassification\n\n    from .modeling_flaubert import (\n        FlaubertModel,\n        FlaubertWithLMHeadModel,\n        FlaubertForSequenceClassification,\n        FlaubertForQuestionAnswering,\n        FlaubertForQuestionAnsweringSimple,\n        FLAUBERT_PRETRAINED_MODEL_ARCHIVE_LIST,\n    )\n\n    from .modeling_electra import (\n        ElectraForPreTraining,\n        ElectraForMaskedLM,\n        ElectraForTokenClassification,\n        ElectraPreTrainedModel,\n        ElectraForSequenceClassification,\n        ElectraModel,\n        load_tf_weights_in_electra,\n        ELECTRA_PRETRAINED_MODEL_ARCHIVE_LIST,\n    )\n\n    from .modeling_reformer import (\n        ReformerAttention,\n        ReformerLayer,\n        ReformerModel,\n        ReformerModelWithLMHead,\n        REFORMER_PRETRAINED_MODEL_ARCHIVE_LIST,\n    )\n\n    from .modeling_longformer import (\n        LongformerModel,\n        LongformerForMaskedLM,\n        LongformerForSequenceClassification,\n        LongformerForMultipleChoice,\n        LongformerForTokenClassification,\n        LongformerForQuestionAnswering,\n        LONGFORMER_PRETRAINED_MODEL_ARCHIVE_LIST,\n    )\n\n    # Optimization\n    from .optimization import (\n        AdamW,\n        get_constant_schedule,\n        get_constant_schedule_with_warmup,\n        get_cosine_schedule_with_warmup,\n        get_cosine_with_hard_restarts_schedule_with_warmup,\n        get_linear_schedule_with_warmup,\n    )\n\n    # Trainer\n    from .trainer import Trainer, set_seed, torch_distributed_zero_first, EvalPrediction\n    from .data.data_collator import DefaultDataCollator, DataCollator, DataCollatorForLanguageModeling\n    from .data.datasets import GlueDataset, TextDataset, LineByLineTextDataset, GlueDataTrainingArguments\n\n    # Benchmarks\n    from .benchmark import PyTorchBenchmark, PyTorchBenchmarkArguments\n\n# TensorFlow\nif is_tf_available():\n    from .modeling_tf_utils import (\n        TFPreTrainedModel,\n        TFSharedEmbeddings,\n        TFSequenceSummary,\n        shape_list,\n        tf_top_k_top_p_filtering,\n    )\n    from .modeling_tf_auto import (\n        TFAutoModel,\n        TFAutoModelForPreTraining,\n        TFAutoModelForMultipleChoice,\n        TFAutoModelForSequenceClassification,\n        TFAutoModelForQuestionAnswering,\n        TFAutoModelWithLMHead,\n        TFAutoModelForTokenClassification,\n        TF_MODEL_MAPPING,\n        TF_MODEL_FOR_PRETRAINING_MAPPING,\n        TF_MODEL_WITH_LM_HEAD_MAPPING,\n        TF_MODEL_FOR_SEQUENCE_CLASSIFICATION_MAPPING,\n        TF_MODEL_FOR_QUESTION_ANSWERING_MAPPING,\n        TF_MODEL_FOR_TOKEN_CLASSIFICATION_MAPPING,\n    )\n\n    from .modeling_tf_bert import (\n        TFBertPreTrainedModel,\n        TFBertMainLayer,\n        TFBertEmbeddings,\n        TFBertModel,\n        TFBertForPreTraining,\n        TFBertForMaskedLM,\n        TFBertForNextSentencePrediction,\n        TFBertForSequenceClassification,\n        TFBertForMultipleChoice,\n        TFBertForTokenClassification,\n        TFBertForQuestionAnswering,\n        TF_BERT_PRETRAINED_MODEL_ARCHIVE_LIST,\n    )\n\n    from .modeling_tf_gpt2 import (\n        TFGPT2PreTrainedModel,\n        TFGPT2MainLayer,\n        TFGPT2Model,\n        TFGPT2LMHeadModel,\n        TFGPT2DoubleHeadsModel,\n        TF_GPT2_PRETRAINED_MODEL_ARCHIVE_LIST,\n    )\n\n    from .modeling_tf_openai import (\n        TFOpenAIGPTPreTrainedModel,\n        TFOpenAIGPTMainLayer,\n        TFOpenAIGPTModel,\n        TFOpenAIGPTLMHeadModel,\n        TFOpenAIGPTDoubleHeadsModel,\n        TF_OPENAI_GPT_PRETRAINED_MODEL_ARCHIVE_LIST,\n    )\n\n    from .modeling_tf_transfo_xl import (\n        TFTransfoXLPreTrainedModel,\n        TFTransfoXLMainLayer,\n        TFTransfoXLModel,\n        TFTransfoXLLMHeadModel,\n        TF_TRANSFO_XL_PRETRAINED_MODEL_ARCHIVE_LIST,\n        TFAdaptiveEmbedding,\n    )\n\n    from .modeling_tf_xlnet import (\n        TFXLNetPreTrainedModel,\n        TFXLNetMainLayer,\n        TFXLNetModel,\n        TFXLNetLMHeadModel,\n        TFXLNetForSequenceClassification,\n        TFXLNetForTokenClassification,\n        TFXLNetForQuestionAnsweringSimple,\n        TF_XLNET_PRETRAINED_MODEL_ARCHIVE_LIST,\n    )\n\n    from .modeling_tf_xlm import (\n        TFXLMPreTrainedModel,\n        TFXLMMainLayer,\n        TFXLMModel,\n        TFXLMWithLMHeadModel,\n        TFXLMForSequenceClassification,\n        TFXLMForQuestionAnsweringSimple,\n        TF_XLM_PRETRAINED_MODEL_ARCHIVE_LIST,\n    )\n\n    from .modeling_tf_xlm_roberta import (\n        TFXLMRobertaForMaskedLM,\n        TFXLMRobertaModel,\n        TFXLMRobertaForSequenceClassification,\n        TFXLMRobertaForTokenClassification,\n        TF_XLM_ROBERTA_PRETRAINED_MODEL_ARCHIVE_LIST,\n    )\n\n    from .modeling_tf_roberta import (\n        TFRobertaPreTrainedModel,\n        TFRobertaMainLayer,\n        TFRobertaModel,\n        TFRobertaForMaskedLM,\n        TFRobertaForSequenceClassification,\n        TFRobertaForTokenClassification,\n        TFRobertaForQuestionAnswering,\n        TF_ROBERTA_PRETRAINED_MODEL_ARCHIVE_LIST,\n    )\n\n    from .modeling_tf_camembert import (\n        TFCamembertModel,\n        TFCamembertForMaskedLM,\n        TFCamembertForSequenceClassification,\n        TFCamembertForTokenClassification,\n        TF_CAMEMBERT_PRETRAINED_MODEL_ARCHIVE_LIST,\n    )\n\n    from .modeling_tf_flaubert import (\n        TFFlaubertModel,\n        TFFlaubertWithLMHeadModel,\n        TFFlaubertForSequenceClassification,\n        TF_FLAUBERT_PRETRAINED_MODEL_ARCHIVE_LIST,\n    )\n\n    from .modeling_tf_distilbert import (\n        TFDistilBertPreTrainedModel,\n        TFDistilBertMainLayer,\n        TFDistilBertModel,\n        TFDistilBertForMaskedLM,\n        TFDistilBertForSequenceClassification,\n        TFDistilBertForTokenClassification,\n        TFDistilBertForQuestionAnswering,\n        TF_DISTILBERT_PRETRAINED_MODEL_ARCHIVE_LIST,\n    )\n\n    from .modeling_tf_ctrl import (\n        TFCTRLPreTrainedModel,\n        TFCTRLModel,\n        TFCTRLLMHeadModel,\n        TF_CTRL_PRETRAINED_MODEL_ARCHIVE_LIST,\n    )\n\n    from .modeling_tf_albert import (\n        TFAlbertPreTrainedModel,\n        TFAlbertMainLayer,\n        TFAlbertModel,\n        TFAlbertForPreTraining,\n        TFAlbertForMaskedLM,\n        TFAlbertForMultipleChoice,\n        TFAlbertForSequenceClassification,\n        TFAlbertForQuestionAnswering,\n        TF_ALBERT_PRETRAINED_MODEL_ARCHIVE_LIST,\n    )\n\n    from .modeling_tf_t5 import (\n        TFT5PreTrainedModel,\n        TFT5Model,\n        TFT5ForConditionalGeneration,\n        TF_T5_PRETRAINED_MODEL_ARCHIVE_LIST,\n    )\n\n    from .modeling_tf_electra import (\n        TFElectraPreTrainedModel,\n        TFElectraModel,\n        TFElectraForPreTraining,\n        TFElectraForMaskedLM,\n        TFElectraForTokenClassification,\n        TF_ELECTRA_PRETRAINED_MODEL_ARCHIVE_LIST,\n    )\n\n    # Optimization\n    from .optimization_tf import WarmUp, create_optimizer, AdamWeightDecay, GradientAccumulator\n\n    # Trainer\n    from .trainer_tf import TFTrainer\n\n\nif not is_tf_available() and not is_torch_available():\n    logger.warning(\n        \"Neither PyTorch nor TensorFlow >= 2.0 have been found.\"\n        \"Models won't be available and only tokenizers, configuration\"\n        \"and file/data utilities can be used.\"\n    )\n"
  },
  {
    "path": "code/bert-base-count3/pretrain/transformers1/__main__.py",
    "content": "# coding: utf8\ndef main():\n    import sys\n    if (len(sys.argv) < 4 or len(sys.argv) > 6) or sys.argv[1] not in [\"bert\", \"gpt\", \"transfo_xl\", \"gpt2\", \"xlnet\", \"xlm\"]:\n        print(\n        \"This command line utility let you convert original (author released) model checkpoint to pytorch.\\n\"\n        \"It should be used as one of: \\n\"\n        \">> transformers1 bert TF_CHECKPOINT TF_CONFIG PYTORCH_DUMP_OUTPUT, \\n\"\n        \">> transformers1 gpt OPENAI_GPT_CHECKPOINT_FOLDER_PATH PYTORCH_DUMP_OUTPUT [OPENAI_GPT_CONFIG], \\n\"\n        \">> transformers1 transfo_xl TF_CHECKPOINT_OR_DATASET PYTORCH_DUMP_OUTPUT [TF_CONFIG] or \\n\"\n        \">> transformers1 gpt2 TF_CHECKPOINT PYTORCH_DUMP_OUTPUT [GPT2_CONFIG] or \\n\"\n        \">> transformers1 xlnet TF_CHECKPOINT TF_CONFIG PYTORCH_DUMP_OUTPUT [FINETUNING_TASK_NAME] or \\n\"\n        \">> transformers1 xlm XLM_CHECKPOINT_PATH PYTORCH_DUMP_OUTPUT\")\n    else:\n        if sys.argv[1] == \"bert\":\n            try:\n                from .convert_bert_original_tf_checkpoint_to_pytorch import convert_tf_checkpoint_to_pytorch\n            except ImportError:\n                print(\"transformers1 can only be used from the commandline to convert TensorFlow models in PyTorch, \"\n                    \"In that case, it requires TensorFlow to be installed. Please see \"\n                    \"https://www.tensorflow.org/install/ for installation instructions.\")\n                raise\n\n            if len(sys.argv) != 5:\n                # pylint: disable=line-too-long\n                print(\"Should be used as `transformers1 bert TF_CHECKPOINT TF_CONFIG PYTORCH_DUMP_OUTPUT`\")\n            else:\n                PYTORCH_DUMP_OUTPUT = sys.argv.pop()\n                TF_CONFIG = sys.argv.pop()\n                TF_CHECKPOINT = sys.argv.pop()\n                convert_tf_checkpoint_to_pytorch(TF_CHECKPOINT, TF_CONFIG, PYTORCH_DUMP_OUTPUT)\n        elif sys.argv[1] == \"gpt\":\n            from .convert_openai_original_tf_checkpoint_to_pytorch import convert_openai_checkpoint_to_pytorch\n            if len(sys.argv) < 4 or len(sys.argv) > 5:\n                # pylint: disable=line-too-long\n                print(\"Should be used as `transformers1 gpt OPENAI_GPT_CHECKPOINT_FOLDER_PATH PYTORCH_DUMP_OUTPUT [OPENAI_GPT_CONFIG]`\")\n            else:\n                OPENAI_GPT_CHECKPOINT_FOLDER_PATH = sys.argv[2]\n                PYTORCH_DUMP_OUTPUT = sys.argv[3]\n                if len(sys.argv) == 5:\n                    OPENAI_GPT_CONFIG = sys.argv[4]\n                else:\n                    OPENAI_GPT_CONFIG = \"\"\n                convert_openai_checkpoint_to_pytorch(OPENAI_GPT_CHECKPOINT_FOLDER_PATH,\n                                                    OPENAI_GPT_CONFIG,\n                                                    PYTORCH_DUMP_OUTPUT)\n        elif sys.argv[1] == \"transfo_xl\":\n            try:\n                from .convert_transfo_xl_original_tf_checkpoint_to_pytorch import convert_transfo_xl_checkpoint_to_pytorch\n            except ImportError:\n                print(\"transformers1 can only be used from the commandline to convert TensorFlow models in PyTorch, \"\n                    \"In that case, it requires TensorFlow to be installed. Please see \"\n                    \"https://www.tensorflow.org/install/ for installation instructions.\")\n                raise\n            if len(sys.argv) < 4 or len(sys.argv) > 5:\n                # pylint: disable=line-too-long\n                print(\"Should be used as `transformers1 transfo_xl TF_CHECKPOINT/TF_DATASET_FILE PYTORCH_DUMP_OUTPUT [TF_CONFIG]`\")\n            else:\n                if 'ckpt' in sys.argv[2].lower():\n                    TF_CHECKPOINT = sys.argv[2]\n                    TF_DATASET_FILE = \"\"\n                else:\n                    TF_DATASET_FILE = sys.argv[2]\n                    TF_CHECKPOINT = \"\"\n                PYTORCH_DUMP_OUTPUT = sys.argv[3]\n                if len(sys.argv) == 5:\n                    TF_CONFIG = sys.argv[4]\n                else:\n                    TF_CONFIG = \"\"\n                convert_transfo_xl_checkpoint_to_pytorch(TF_CHECKPOINT, TF_CONFIG, PYTORCH_DUMP_OUTPUT, TF_DATASET_FILE)\n        elif sys.argv[1] == \"gpt2\":\n            try:\n                from .convert_gpt2_original_tf_checkpoint_to_pytorch import convert_gpt2_checkpoint_to_pytorch\n            except ImportError:\n                print(\"transformers1 can only be used from the commandline to convert TensorFlow models in PyTorch, \"\n                    \"In that case, it requires TensorFlow to be installed. Please see \"\n                    \"https://www.tensorflow.org/install/ for installation instructions.\")\n                raise\n\n            if len(sys.argv) < 4 or len(sys.argv) > 5:\n                # pylint: disable=line-too-long\n                print(\"Should be used as `transformers1 gpt2 TF_CHECKPOINT PYTORCH_DUMP_OUTPUT [TF_CONFIG]`\")\n            else:\n                TF_CHECKPOINT = sys.argv[2]\n                PYTORCH_DUMP_OUTPUT = sys.argv[3]\n                if len(sys.argv) == 5:\n                    TF_CONFIG = sys.argv[4]\n                else:\n                    TF_CONFIG = \"\"\n                convert_gpt2_checkpoint_to_pytorch(TF_CHECKPOINT, TF_CONFIG, PYTORCH_DUMP_OUTPUT)\n        elif sys.argv[1] == \"xlnet\":\n            try:\n                from .convert_xlnet_original_tf_checkpoint_to_pytorch import convert_xlnet_checkpoint_to_pytorch\n            except ImportError:\n                print(\"transformers1 can only be used from the commandline to convert TensorFlow models in PyTorch, \"\n                    \"In that case, it requires TensorFlow to be installed. Please see \"\n                    \"https://www.tensorflow.org/install/ for installation instructions.\")\n                raise\n\n            if len(sys.argv) < 5 or len(sys.argv) > 6:\n                # pylint: disable=line-too-long\n                print(\"Should be used as `transformers1 xlnet TF_CHECKPOINT TF_CONFIG PYTORCH_DUMP_OUTPUT [FINETUNING_TASK_NAME]`\")\n            else:\n                TF_CHECKPOINT = sys.argv[2]\n                TF_CONFIG = sys.argv[3]\n                PYTORCH_DUMP_OUTPUT = sys.argv[4]\n                if len(sys.argv) == 6:\n                    FINETUNING_TASK = sys.argv[5]\n                else:\n                    FINETUNING_TASK = None\n\n                convert_xlnet_checkpoint_to_pytorch(TF_CHECKPOINT,\n                                                    TF_CONFIG,\n                                                    PYTORCH_DUMP_OUTPUT,\n                                                    FINETUNING_TASK)\n        elif sys.argv[1] == \"xlm\":\n            from .convert_xlm_original_pytorch_checkpoint_to_pytorch import convert_xlm_checkpoint_to_pytorch\n\n            if len(sys.argv) != 4:\n                # pylint: disable=line-too-long\n                print(\"Should be used as `transformers1 xlm XLM_CHECKPOINT_PATH PYTORCH_DUMP_OUTPUT`\")\n            else:\n                XLM_CHECKPOINT_PATH = sys.argv[2]\n                PYTORCH_DUMP_OUTPUT = sys.argv[3]\n\n                convert_xlm_checkpoint_to_pytorch(XLM_CHECKPOINT_PATH, PYTORCH_DUMP_OUTPUT)\n\nif __name__ == '__main__':\n    main()\n"
  },
  {
    "path": "code/bert-base-count3/pretrain/transformers1/activations.py",
    "content": "import logging\nimport math\n\nimport torch\nimport torch.nn.functional as F\n\n\nlogger = logging.getLogger(__name__)\n\n\ndef swish(x):\n    return x * torch.sigmoid(x)\n\n\ndef _gelu_python(x):\n    \"\"\" Original Implementation of the gelu activation function in Google Bert repo when initially created.\n        For information: OpenAI GPT's gelu is slightly different (and gives slightly different results):\n        0.5 * x * (1 + torch.tanh(math.sqrt(2 / math.pi) * (x + 0.044715 * torch.pow(x, 3))))\n        This is now written in C in torch.nn.functional\n        Also see https://arxiv.org/abs/1606.08415\n    \"\"\"\n    return x * 0.5 * (1.0 + torch.erf(x / math.sqrt(2.0)))\n\n\ndef gelu_new(x):\n    \"\"\" Implementation of the gelu activation function currently in Google Bert repo (identical to OpenAI GPT).\n        Also see https://arxiv.org/abs/1606.08415\n    \"\"\"\n    return 0.5 * x * (1.0 + torch.tanh(math.sqrt(2.0 / math.pi) * (x + 0.044715 * torch.pow(x, 3.0))))\n\n\nif torch.__version__ < \"1.4.0\":\n    gelu = _gelu_python\nelse:\n    gelu = F.gelu\n\n\ndef gelu_fast(x):\n    return 0.5 * x * (1.0 + torch.tanh(x * 0.7978845608 * (1.0 + 0.044715 * x * x)))\n\n\nACT2FN = {\n    \"relu\": F.relu,\n    \"swish\": swish,\n    \"gelu\": gelu,\n    \"tanh\": torch.tanh,\n    \"gelu_new\": gelu_new,\n    \"gelu_fast\": gelu_fast,\n}\n\n\ndef get_activation(activation_string):\n    if activation_string in ACT2FN:\n        return ACT2FN[activation_string]\n    else:\n        raise KeyError(\"function {} not found in ACT2FN mapping {}\".format(activation_string, list(ACT2FN.keys())))\n"
  },
  {
    "path": "code/bert-base-count3/pretrain/transformers1/another_try.py",
    "content": "from transformers import TFBertModel, BertTokenizer, BertConfig\nimport tensorflow as tf\n\nconfig = BertConfig.from_pretrained(\"bert-base-cased\", output_hidden_states=True)\nmodel = TFBertModel.from_pretrained(\"bert-base-cased\", config=config)\n\ntok = BertTokenizer.from_pretrained(\"bert-base-cased\")\ntext = tok.encode(\"Ain't this [MASK] best thing you've ever seen?\")\n\ninputs = tf.constant(text)\noutputs = model.predict(inputs)\n\nprint(outputs)"
  },
  {
    "path": "code/bert-base-count3/pretrain/transformers1/benchmark/__init__.py",
    "content": "# flake8: noqa\n# There's no way to ignore \"F401 '...' imported but unused\" warnings in this\n# module, but to preserve other warnings. So, don't check this module at all.\n\nfrom ..file_utils import is_torch_available\n\n\nif is_torch_available():\n    from .benchmark_args import PyTorchBenchmarkArguments\n    from .benchmark import PyTorchBenchmark\n"
  },
  {
    "path": "code/bert-base-count3/pretrain/transformers1/benchmark/benchmark.py",
    "content": "# coding=utf-8\n# Copyright 2018 The HuggingFace Inc. team.\n# Copyright (c) 2018, NVIDIA CORPORATION.  All rights reserved.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n\"\"\"\n    Benchmarking the library on inference and training in PyTorch.\n\"\"\"\n\n\nimport inspect\nimport logging\nimport timeit\n\nfrom transformers import MODEL_MAPPING, MODEL_WITH_LM_HEAD_MAPPING, PretrainedConfig, is_torch_available\n\nfrom .benchmark_utils import Benchmark, Memory, start_memory_tracing, stop_memory_tracing\n\n\nif is_torch_available():\n    import torch\n    from .benchmark_args import PyTorchBenchmarkArguments\n\n\nlogger = logging.getLogger(__name__)\n\n\nclass PyTorchBenchmark(Benchmark):\n\n    args: PyTorchBenchmarkArguments\n    configs: PretrainedConfig\n    framework: str = \"PyTorch\"\n\n    @property\n    def framework_version(self):\n        return torch.__version__\n\n    def train(self, model_name, batch_size, sequence_length, trace_memory=False):\n        try:\n            config = self.config_dict[model_name]\n            model = MODEL_WITH_LM_HEAD_MAPPING[config.__class__](config)\n            model.to(self.args.device)\n            model.train()\n\n            input_ids = torch.randint(\n                model.config.vocab_size, (batch_size, sequence_length), dtype=torch.long, device=self.args.device\n            )\n\n            def compute_loss_and_backprob():\n                # TODO: Not all models call labels argument labels => this hack using the function signature should be corrected once all models have a common name for labels\n                function_argument_names = inspect.getfullargspec(model.forward).args\n                if \"labels\" in function_argument_names:\n                    loss = model(input_ids, labels=input_ids)[0]\n                elif \"lm_labels\" in function_argument_names:\n                    loss = model(input_ids, lm_labels=input_ids)[0]\n                elif \"masked_lm_labels\" in function_argument_names:\n                    loss = model(input_ids, masked_lm_labels=input_ids)[0]\n                else:\n                    NotImplementedError(f\"{model_name} does not seem to allow training with labels\")\n\n                loss.backward()\n                model.zero_grad()\n\n            if trace_memory is True:\n                if self.args.trace_memory_line_by_line or self.args.n_gpu == 0:\n                    trace = start_memory_tracing(\"transformers1\")\n                else:\n                    # clear cuda cache\n                    torch.cuda.empty_cache()\n                    torch.cuda.reset_peak_memory_stats()\n\n                # calculate loss and do backpropagation\n                compute_loss_and_backprob()\n\n                if self.args.trace_memory_line_by_line or self.args.n_gpu == 0:\n                    summary = stop_memory_tracing(trace)\n                    memory = summary.total\n                else:\n                    memory = Memory(torch.cuda.max_memory_reserved())\n\n                return memory\n            else:\n                # as written in https://docs.python.org/2/library/timeit.html#timeit.Timer.repeat, min should be taken rather than the average\n                runtimes = timeit.repeat(lambda: compute_loss_and_backprob(), repeat=self.args.repeat, number=10,)\n                return min(runtimes) / 10.0\n        except RuntimeError as e:\n            self.print_fn(\"Doesn't fit on GPU. {}\".format(e))\n            return \"N/A\"\n\n    def inference(self, model_name, batch_size, sequence_length, trace_memory=False):\n        try:\n            config = self.config_dict[model_name]\n            model = MODEL_MAPPING[config.__class__](config)\n            model.to(self.args.device)\n            model.eval()\n\n            input_ids = torch.randint(\n                config.vocab_size, (batch_size, sequence_length), dtype=torch.long, device=self.args.device\n            )\n            if trace_memory is True:\n                if self.args.trace_memory_line_by_line or self.args.n_gpu == 0:\n                    trace = start_memory_tracing(\"transformers1\")\n                else:\n                    # clear cuda cache\n                    torch.cuda.empty_cache()\n                    if hasattr(torch.cuda, \"max_memory_reserved\"):\n                        torch.cuda.reset_peak_memory_stats()\n                    else:\n                        logger.info(\n                            \"Please consider updating PyTorch to version 1.4 to get more accuracy on GPU memory usage\"\n                        )\n                        torch.cuda.reset_max_memory_cached()\n\n                model(input_ids)\n\n                if self.args.trace_memory_line_by_line or self.args.n_gpu == 0:\n                    summary = stop_memory_tracing(trace)\n                    memory = summary.total\n                else:\n                    if hasattr(torch.cuda, \"max_memory_reserved\"):\n                        memory = Memory(torch.cuda.max_memory_reserved())\n                    else:\n                        logger.info(\n                            \"Please consider updating PyTorch to version 1.4 to get more accuracy on GPU memory usage\"\n                        )\n                        memory = Memory(torch.cuda.max_memory_cached())\n\n                return memory\n            else:\n                # as written in https://docs.python.org/2/library/timeit.html#timeit.Timer.repeat, min should be taken rather than the average\n                runtimes = timeit.repeat(lambda: model(input_ids), repeat=self.args.repeat, number=10,)\n                return min(runtimes) / 10.0\n\n        except RuntimeError as e:\n            self.print_fn(\"Doesn't fit on GPU. {}\".format(e))\n            return \"N/A\"\n"
  },
  {
    "path": "code/bert-base-count3/pretrain/transformers1/benchmark/benchmark_args.py",
    "content": "# coding=utf-8\n# Copyright 2018 The HuggingFace Inc. team.\n# Copyright (c) 2018, NVIDIA CORPORATION.  All rights reserved.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n\nimport logging\nfrom dataclasses import dataclass, field\nfrom typing import Tuple\n\nfrom ..file_utils import cached_property, is_torch_available, torch_required\nfrom .benchmark_args_utils import BenchmarkArguments\n\n\nif is_torch_available():\n    import torch\n\ntry:\n    import torch_xla.core.xla_model as xm\n\n    _has_tpu = True\nexcept ImportError:\n    _has_tpu = False\n\n\n@torch_required\ndef is_tpu_available():\n    return _has_tpu\n\n\nlogger = logging.getLogger(__name__)\n\n\n@dataclass\nclass PyTorchBenchmarkArguments(BenchmarkArguments):\n    no_cuda: bool = field(default=False, metadata={\"help\": \"Whether to run on available cuda devices\"})\n    torchscript: bool = field(default=False, metadata={\"help\": \"Trace the models using torchscript\"})\n    fp16: bool = field(default=False, metadata={\"help\": \"Use FP16 to accelerate inference.\"})\n\n    @cached_property\n    @torch_required\n    def _setup_devices(self) -> Tuple[\"torch.device\", int]:\n        logger.info(\"PyTorch: setting up devices\")\n        if self.no_cuda:\n            device = torch.device(\"cpu\")\n            n_gpu = 0\n        elif is_tpu_available():\n            device = xm.xla_device()\n            n_gpu = 0\n        else:\n            device = torch.device(\"cuda\" if torch.cuda.is_available() else \"cpu\")\n            n_gpu = torch.cuda.device_count()\n        return device, n_gpu\n\n    @property\n    @torch_required\n    def device_idx(self) -> int:\n        return torch.cuda.current_device()\n\n    @property\n    @torch_required\n    def device(self) -> \"torch.device\":\n        return self._setup_devices[0]\n\n    @property\n    @torch_required\n    def n_gpu(self):\n        return self._setup_devices[1]\n"
  },
  {
    "path": "code/bert-base-count3/pretrain/transformers1/benchmark/benchmark_args_utils.py",
    "content": "# coding=utf-8\n# Copyright 2018 The HuggingFace Inc. team.\n# Copyright (c) 2018, NVIDIA CORPORATION.  All rights reserved.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n\nimport dataclasses\nimport json\nfrom dataclasses import dataclass, field\nfrom time import time\nfrom typing import List\n\n\ndef list_field(default=None, metadata=None):\n    return field(default_factory=lambda: default, metadata=metadata)\n\n\n@dataclass\nclass BenchmarkArguments:\n    \"\"\"\n    BenchMarkArguments are arguments we use in our benchmark scripts\n    **which relate to the training loop itself**.\n\n    Using `HfArgumentParser` we can turn this class\n    into argparse arguments to be able to specify them on\n    the command line.\n    \"\"\"\n\n    models: List[str] = list_field(\n        default=[],\n        metadata={\n            \"help\": \"Model checkpoints to be provided to the AutoModel classes. Leave blank to benchmark the base version of all available models\"\n        },\n    )\n\n    batch_sizes: List[int] = list_field(\n        default=[8], metadata={\"help\": \"List of batch sizes for which memory and time performance will be evaluated\"}\n    )\n\n    sequence_lengths: List[int] = list_field(\n        default=[8, 32, 128, 512],\n        metadata={\"help\": \"List of sequence lengths for which memory and time performance will be evaluated\"},\n    )\n\n    no_inference: bool = field(default=False, metadata={\"help\": \"Don't benchmark inference of model\"})\n    training: bool = field(default=False, metadata={\"help\": \"Benchmark training of model\"})\n    verbose: bool = field(default=False, metadata={\"help\": \"Verbose memory tracing\"})\n    no_speed: bool = field(default=False, metadata={\"help\": \"Don't perform speed measurments\"})\n    no_memory: bool = field(default=False, metadata={\"help\": \"Don't perform memory measurments\"})\n    trace_memory_line_by_line: bool = field(default=False, metadata={\"help\": \"Trace memory line by line\"})\n    save_to_csv: bool = field(default=False, metadata={\"help\": \"Save result to a CSV file\"})\n    log_print: bool = field(default=False, metadata={\"help\": \"Save all print statements in a log file\"})\n    no_env_print: bool = field(default=False, metadata={\"help\": \"Don't print environment information\"})\n    inference_time_csv_file: str = field(\n        default=f\"inference_time_{round(time())}.csv\",\n        metadata={\"help\": \"CSV filename used if saving time results to csv.\"},\n    )\n    inference_memory_csv_file: str = field(\n        default=f\"inference_memory_{round(time())}.csv\",\n        metadata={\"help\": \"CSV filename used if saving memory results to csv.\"},\n    )\n    train_time_csv_file: str = field(\n        default=f\"train_time_{round(time())}.csv\",\n        metadata={\"help\": \"CSV filename used if saving time results to csv for training.\"},\n    )\n    train_memory_csv_file: str = field(\n        default=f\"train_memory_{round(time())}.csv\",\n        metadata={\"help\": \"CSV filename used if saving memory results to csv for training.\"},\n    )\n    env_info_csv_file: str = field(\n        default=f\"env_info_{round(time())}.csv\",\n        metadata={\"help\": \"CSV filename used if saving environment information.\"},\n    )\n    log_filename: str = field(\n        default=f\"log_{round(time())}.csv\",\n        metadata={\"help\": \"Log filename used if print statements are saved in log.\"},\n    )\n    repeat: int = field(default=3, metadata={\"help\": \"Times an experiment will be run.\"})\n\n    def to_json_string(self):\n        \"\"\"\n        Serializes this instance to a JSON string.\n        \"\"\"\n        return json.dumps(dataclasses.asdict(self), indent=2)\n\n    @property\n    def model_names(self):\n        return self.models\n"
  },
  {
    "path": "code/bert-base-count3/pretrain/transformers1/benchmark/benchmark_utils.py",
    "content": "\"\"\"\nUtilities for working with the local dataset cache.\nThis file is adapted from the AllenNLP library at https://github.com/allenai/allennlp\nCopyright by the AllenNLP authors.\n\"\"\"\n\nimport copy\nimport csv\nimport linecache\nimport logging\nimport os\nimport platform\nimport sys\nfrom abc import ABC, abstractmethod\nfrom collections import defaultdict, namedtuple\nfrom datetime import datetime\nfrom typing import Iterable, List, NamedTuple, Optional, Union\n\nfrom transformers import AutoConfig, PretrainedConfig\nfrom transformers import __version__ as version\n\nfrom ..file_utils import is_tf_available, is_torch_available\nfrom .benchmark_args_utils import BenchmarkArguments\n\n\nif is_torch_available():\n    from torch.cuda import empty_cache as torch_empty_cache\n\nif is_tf_available():\n    from tensorflow.python.eager import context as tf_context\n\n\nlogger = logging.getLogger(__name__)  # pylint: disable=invalid-name\n\n\n_is_memory_tracing_enabled = False\n\nBenchmarkOutput = namedtuple(\n    \"BenchmarkOutput\", [\"time_inference_result\", \"memory_inference_result\", \"time_train_result\", \"memory_train_result\"]\n)\n\n\ndef is_memory_tracing_enabled():\n    global _is_memory_tracing_enabled\n    return _is_memory_tracing_enabled\n\n\nclass Frame(NamedTuple):\n    \"\"\" `Frame` is a NamedTuple used to gather the current frame state.\n            `Frame` has the following fields:\n            - 'filename' (string): Name of the file currently executed\n            - 'module' (string): Name of the module currently executed\n            - 'line_number' (int): Number of the line currently executed\n            - 'event' (string): Event that triggered the tracing (default will be \"line\")\n            - 'line_text' (string): Text of the line in the python script\n    \"\"\"\n\n    filename: str\n    module: str\n    line_number: int\n    event: str\n    line_text: str\n\n\nclass UsedMemoryState(NamedTuple):\n    \"\"\" `UsedMemoryState` are named tuples with the following fields:\n        - 'frame': a `Frame` namedtuple (see below) storing information on the current tracing frame (current file, location in current file)\n        - 'cpu_memory': CPU RSS memory state *before* executing the line\n        - 'gpu_memory': GPU used memory *before* executing the line (sum for all GPUs or for only `gpus_to_trace` if provided)\n    \"\"\"\n\n    frame: Frame\n    cpu_memory: int\n    gpu_memory: int\n\n\nclass Memory(NamedTuple):\n    \"\"\" `Memory` NamedTuple have a single field `bytes` and\n        you can get a human readable str of the number of mega bytes by calling `__repr__`\n            - `byte` (integer): number of bytes,\n    \"\"\"\n\n    bytes: int\n\n    def __repr__(self) -> str:\n        return str(bytes_to_mega_bytes(self.bytes))\n\n\nclass MemoryState(NamedTuple):\n    \"\"\" `MemoryState` are namedtuples listing frame + CPU/GPU memory with the following fields:\n        - `frame` (`Frame`): the current frame (see above)\n        - `cpu`: CPU memory consumed at during the current frame as a `Memory` named tuple\n        - `gpu`: GPU memory consumed at during the current frame as a `Memory` named tuple\n        - `cpu_gpu`: CPU + GPU memory consumed at during the current frame as a `Memory` named tuple\n    \"\"\"\n\n    frame: Frame\n    cpu: Memory\n    gpu: Memory\n    cpu_gpu: Memory\n\n\nclass MemorySummary(NamedTuple):\n    \"\"\" `MemorySummary` namedtuple otherwise with the fields:\n        - `sequential`: a list of `MemoryState` namedtuple (see below) computed from the provided `memory_trace`\n            by substracting the memory after executing each line from the memory before executing said line.\n        - `cumulative`: a list of `MemoryState` namedtuple (see below) with cumulative increase in memory for each line\n            obtained by summing repeted memory increase for a line if it's executed several times.\n            The list is sorted from the frame with the largest memory consumption to the frame with the smallest (can be negative if memory is released)\n        - `total`: total memory increase during the full tracing as a `Memory` named tuple (see below).\n            Line with memory release (negative consumption) are ignored if `ignore_released_memory` is `True` (default).\n    \"\"\"\n\n    sequential: List[MemoryState]\n    cumulative: List[MemoryState]\n    current: List[MemoryState]\n    total: Memory\n\n\nMemoryTrace = List[UsedMemoryState]\n\n\ndef start_memory_tracing(\n    modules_to_trace: Optional[Union[str, Iterable[str]]] = None,\n    modules_not_to_trace: Optional[Union[str, Iterable[str]]] = None,\n    events_to_trace: str = \"line\",\n    gpus_to_trace: Optional[List[int]] = None,\n) -> MemoryTrace:\n    \"\"\" Setup line-by-line tracing to record rss mem (RAM) at each line of a module or sub-module.\n        See `../../examples/benchmarks.py for a usage example.\n        Current memory consumption is returned using psutil and in particular is the RSS memory\n            \"Resident Set Size” (the non-swapped physical memory the process is using).\n            See https://psutil.readthedocs.io/en/latest/#psutil.Process.memory_info\n\n        Args:\n            - `modules_to_trace`: (None, string, list/tuple of string)\n                if None, all events are recorded\n                if string or list of strings: only events from the listed module/sub-module will be recorded (e.g. 'fairseq' or 'transformers1.modeling_gpt2')\n            - `modules_not_to_trace`: (None, string, list/tuple of string)\n                if None, no module is avoided\n                if string or list of strings: events from the listed module/sub-module will not be recorded (e.g. 'torch')\n            - `events_to_trace`: string or list of string of events to be recorded (see official python doc for `sys.settrace` for the list of events)\n                default to line\n            - `gpus_to_trace`: (optional list, default None) list of GPUs to trace. Default to tracing all GPUs\n\n        Return:\n            - `memory_trace` is a list of `UsedMemoryState` for each event (default each line of the traced script).\n                - `UsedMemoryState` are named tuples with the following fields:\n                    - 'frame': a `Frame` namedtuple (see below) storing information on the current tracing frame (current file, location in current file)\n                    - 'cpu_memory': CPU RSS memory state *before* executing the line\n                    - 'gpu_memory': GPU used memory *before* executing the line (sum for all GPUs or for only `gpus_to_trace` if provided)\n\n        `Frame` is a namedtuple used by `UsedMemoryState` to list the current frame state.\n            `Frame` has the following fields:\n            - 'filename' (string): Name of the file currently executed\n            - 'module' (string): Name of the module currently executed\n            - 'line_number' (int): Number of the line currently executed\n            - 'event' (string): Event that triggered the tracing (default will be \"line\")\n            - 'line_text' (string): Text of the line in the python script\n\n    \"\"\"\n    try:\n        import psutil\n    except (ImportError):\n        logger.warning(\n            \"Psutil not installed, we won't log CPU memory usage. \"\n            \"Install psutil (pip install psutil) to use CPU memory tracing.\"\n        )\n        process = None\n    else:\n        process = psutil.Process(os.getpid())\n\n    try:\n        from py3nvml import py3nvml\n\n        py3nvml.nvmlInit()\n        devices = list(range(py3nvml.nvmlDeviceGetCount())) if gpus_to_trace is None else gpus_to_trace\n        py3nvml.nvmlShutdown()\n    except ImportError:\n        logger.warning(\n            \"py3nvml not installed, we won't log GPU memory usage. \"\n            \"Install py3nvml (pip install py3nvml) to use GPU memory tracing.\"\n        )\n        log_gpu = False\n    except (OSError, py3nvml.NVMLError):\n        logger.warning(\"Error while initializing comunication with GPU. \" \"We won't perform GPU memory tracing.\")\n        log_gpu = False\n    else:\n        log_gpu = is_torch_available() or is_tf_available()\n\n    memory_trace = []\n\n    def traceit(frame, event, args):\n        \"\"\" Tracing method executed before running each line in a module or sub-module\n            Record memory allocated in a list with debugging information\n        \"\"\"\n        global _is_memory_tracing_enabled\n\n        if not _is_memory_tracing_enabled:\n            return traceit\n\n        # Filter events\n        if events_to_trace is not None:\n            if isinstance(events_to_trace, str) and event != events_to_trace:\n                return traceit\n            elif isinstance(events_to_trace, (list, tuple)) and event not in events_to_trace:\n                return traceit\n\n        # Filter modules\n        name = frame.f_globals[\"__name__\"]\n        if not isinstance(name, str):\n            return traceit\n        else:\n            # Filter whitelist of modules to trace\n            if modules_to_trace is not None:\n                if isinstance(modules_to_trace, str) and modules_to_trace not in name:\n                    return traceit\n                elif isinstance(modules_to_trace, (list, tuple)) and all(m not in name for m in modules_to_trace):\n                    return traceit\n\n            # Filter blacklist of modules not to trace\n            if modules_not_to_trace is not None:\n                if isinstance(modules_not_to_trace, str) and modules_not_to_trace in name:\n                    return traceit\n                elif isinstance(modules_not_to_trace, (list, tuple)) and any(m in name for m in modules_not_to_trace):\n                    return traceit\n\n        # Record current tracing state (file, location in file...)\n        lineno = frame.f_lineno\n        filename = frame.f_globals[\"__file__\"]\n        if filename.endswith(\".pyc\") or filename.endswith(\".pyo\"):\n            filename = filename[:-1]\n        line = linecache.getline(filename, lineno).rstrip()\n        traced_state = Frame(filename, name, lineno, event, line)\n\n        # Record current memory state (rss memory) and compute difference with previous memory state\n        cpu_mem = 0\n        if process is not None:\n            mem = process.memory_info()\n            cpu_mem = mem.rss\n\n        gpu_mem = 0\n        if log_gpu:\n            # Clear GPU caches\n            if is_torch_available():\n                torch_empty_cache()\n            if is_tf_available():\n                tf_context.context()._clear_caches()  # See https://github.com/tensorflow/tensorflow/issues/20218#issuecomment-416771802\n\n            # Sum used memory for all GPUs\n            py3nvml.nvmlInit()\n\n            for i in devices:\n                handle = py3nvml.nvmlDeviceGetHandleByIndex(i)\n                meminfo = py3nvml.nvmlDeviceGetMemoryInfo(handle)\n                gpu_mem += meminfo.used\n\n            py3nvml.nvmlShutdown()\n\n        mem_state = UsedMemoryState(traced_state, cpu_mem, gpu_mem)\n        memory_trace.append(mem_state)\n\n        return traceit\n\n    sys.settrace(traceit)\n\n    global _is_memory_tracing_enabled\n    _is_memory_tracing_enabled = True\n\n    return memory_trace\n\n\ndef stop_memory_tracing(\n    memory_trace: Optional[MemoryTrace] = None, ignore_released_memory: bool = True\n) -> Optional[MemorySummary]:\n    \"\"\" Stop memory tracing cleanly and return a summary of the memory trace if a trace is given.\n\n        Args:\n            - `memory_trace` (optional output of start_memory_tracing, default: None): memory trace to convert in summary\n            - `ignore_released_memory` (boolean, default: None): if True we only sum memory increase to compute total memory\n\n        Return:\n            - None if `memory_trace` is None\n            - `MemorySummary` namedtuple otherwise with the fields:\n                - `sequential`: a list of `MemoryState` namedtuple (see below) computed from the provided `memory_trace`\n                    by substracting the memory after executing each line from the memory before executing said line.\n                - `cumulative`: a list of `MemoryState` namedtuple (see below) with cumulative increase in memory for each line\n                    obtained by summing repeted memory increase for a line if it's executed several times.\n                    The list is sorted from the frame with the largest memory consumption to the frame with the smallest (can be negative if memory is released)\n                - `total`: total memory increase during the full tracing as a `Memory` named tuple (see below).\n                    Line with memory release (negative consumption) are ignored if `ignore_released_memory` is `True` (default).\n\n        `Memory` named tuple have fields\n            - `byte` (integer): number of bytes,\n            - `string` (string): same as human readable string (ex: \"3.5MB\")\n\n        `Frame` are namedtuple used to list the current frame state and have the following fields:\n            - 'filename' (string): Name of the file currently executed\n            - 'module' (string): Name of the module currently executed\n            - 'line_number' (int): Number of the line currently executed\n            - 'event' (string): Event that triggered the tracing (default will be \"line\")\n            - 'line_text' (string): Text of the line in the python script\n\n        `MemoryState` are namedtuples listing frame + CPU/GPU memory with the following fields:\n            - `frame` (`Frame`): the current frame (see above)\n            - `cpu`: CPU memory consumed at during the current frame as a `Memory` named tuple\n            - `gpu`: GPU memory consumed at during the current frame as a `Memory` named tuple\n            - `cpu_gpu`: CPU + GPU memory consumed at during the current frame as a `Memory` named tuple\n    \"\"\"\n    global _is_memory_tracing_enabled\n    _is_memory_tracing_enabled = False\n\n    if memory_trace is not None and len(memory_trace) > 1:\n        memory_diff_trace = []\n        memory_curr_trace = []\n\n        cumulative_memory_dict = defaultdict(lambda: [0, 0, 0])\n\n        for ((frame, cpu_mem, gpu_mem), (next_frame, next_cpu_mem, next_gpu_mem),) in zip(\n            memory_trace[:-1], memory_trace[1:]\n        ):\n            cpu_mem_inc = next_cpu_mem - cpu_mem\n            gpu_mem_inc = next_gpu_mem - gpu_mem\n            cpu_gpu_mem_inc = cpu_mem_inc + gpu_mem_inc\n            memory_diff_trace.append(\n                MemoryState(\n                    frame=frame, cpu=Memory(cpu_mem_inc), gpu=Memory(gpu_mem_inc), cpu_gpu=Memory(cpu_gpu_mem_inc),\n                )\n            )\n\n            memory_curr_trace.append(\n                MemoryState(\n                    frame=frame,\n                    cpu=Memory(next_cpu_mem),\n                    gpu=Memory(next_gpu_mem),\n                    cpu_gpu=Memory(next_gpu_mem + next_cpu_mem),\n                )\n            )\n\n            cumulative_memory_dict[frame][0] += cpu_mem_inc\n            cumulative_memory_dict[frame][1] += gpu_mem_inc\n            cumulative_memory_dict[frame][2] += cpu_gpu_mem_inc\n\n        cumulative_memory = sorted(\n            list(cumulative_memory_dict.items()), key=lambda x: x[1][2], reverse=True\n        )  # order by the total CPU + GPU memory increase\n        cumulative_memory = list(\n            MemoryState(\n                frame=frame, cpu=Memory(cpu_mem_inc), gpu=Memory(gpu_mem_inc), cpu_gpu=Memory(cpu_gpu_mem_inc),\n            )\n            for frame, (cpu_mem_inc, gpu_mem_inc, cpu_gpu_mem_inc) in cumulative_memory\n        )\n\n        memory_curr_trace = sorted(memory_curr_trace, key=lambda x: x.cpu_gpu.bytes, reverse=True)\n\n        if ignore_released_memory:\n            total_memory = sum(max(0, step_trace.cpu_gpu.bytes) for step_trace in memory_diff_trace)\n        else:\n            total_memory = sum(step_trace.cpu_gpu.bytes for step_trace in memory_diff_trace)\n\n        total_memory = Memory(total_memory)\n\n        return MemorySummary(\n            sequential=memory_diff_trace, cumulative=cumulative_memory, current=memory_curr_trace, total=total_memory,\n        )\n\n    return None\n\n\ndef bytes_to_mega_bytes(memory_amount: int) -> int:\n    \"\"\" Utility to convert a number of bytes (int) into a number of mega bytes (int)\n    \"\"\"\n    return memory_amount >> 20\n\n\nclass Benchmark(ABC):\n    \"\"\"\n    Benchmarks is a simple but feature-complete benchmarking script\n    to compare memory and time performance of models in Transformers.\n    \"\"\"\n\n    args: BenchmarkArguments\n    configs: PretrainedConfig\n    framework: str\n\n    def __init__(self, args: BenchmarkArguments = None, configs: PretrainedConfig = None):\n        self.args = args\n\n        if configs is None:\n            self.config_dict = {\n                model_name: AutoConfig.from_pretrained(model_name) for model_name in self.args.model_names\n            }\n        else:\n            self.config_dict = {model_name: config for model_name, config in zip(self.args.model_names, configs)}\n\n        self._print_fn = None\n        self._framework_version = None\n        self._environment_info = None\n\n    @property\n    def print_fn(self):\n        if self._print_fn is None:\n            if self.args.log_print:\n                logging.basicConfig(\n                    level=logging.DEBUG,\n                    filename=self.args.log_filename,\n                    filemode=\"a+\",\n                    format=\"%(asctime)-15s %(levelname)-8s %(message)s\",\n                )\n\n                def print_and_log(*args):\n                    logging.info(*args)\n                    print(*args)\n\n                self._print_fn = print_and_log\n            else:\n                self._print_fn = print\n        return self._print_fn\n\n    @property\n    def is_gpu(self):\n        return self.args.n_gpu > 0\n\n    @property\n    @abstractmethod\n    def framework_version(self):\n        pass\n\n    @abstractmethod\n    def train(self, model_name, batch_size, sequence_length):\n        pass\n\n    @abstractmethod\n    def inference(self, model_name, batch_size, sequence_length):\n        pass\n\n    def run(self):\n        result_dict = {model_name: {} for model_name in self.args.model_names}\n        inference_result_time = copy.deepcopy(result_dict)\n        inference_result_memory = copy.deepcopy(result_dict)\n        train_result_time = copy.deepcopy(result_dict)\n        train_result_memory = copy.deepcopy(result_dict)\n\n        for c, model_name in enumerate(self.args.model_names):\n            self.print_fn(f\"{c + 1} / {len(self.args.model_names)}\")\n\n            model_dict = {\n                \"bs\": self.args.batch_sizes,\n                \"ss\": self.args.sequence_lengths,\n                \"result\": {i: {} for i in self.args.batch_sizes},\n            }\n            inference_result_time[model_name] = copy.deepcopy(model_dict)\n            inference_result_memory[model_name] = copy.deepcopy(model_dict)\n            train_result_time[model_name] = copy.deepcopy(model_dict)\n            train_result_memory[model_name] = copy.deepcopy(model_dict)\n\n            for batch_size in self.args.batch_sizes:\n                for sequence_length in self.args.sequence_lengths:\n                    if not self.args.no_inference:\n                        if not self.args.no_memory:\n                            memory = self.inference(model_name, batch_size, sequence_length, trace_memory=True)\n                            inference_result_memory[model_name][\"result\"][batch_size][sequence_length] = memory\n                        if not self.args.no_speed:\n                            time = self.inference(model_name, batch_size, sequence_length, trace_memory=False)\n                            inference_result_time[model_name][\"result\"][batch_size][sequence_length] = time\n\n                    if self.args.training:\n                        if not self.args.no_memory:\n                            memory = self.train(model_name, batch_size, sequence_length, trace_memory=True)\n                            train_result_memory[model_name][\"result\"][batch_size][sequence_length] = memory\n                        if not self.args.no_speed:\n                            time = self.inference(model_name, batch_size, sequence_length, trace_memory=False)\n                            train_result_time[model_name][\"result\"][batch_size][sequence_length] = time\n\n        if not self.args.no_inference:\n            if not self.args.no_speed:\n                self.print_fn(\"======= INFERENCE - SPEED - RESULT =======\")\n                self.print_results(inference_result_time)\n                self.save_to_csv(inference_result_time, self.args.inference_time_csv_file)\n\n            if not self.args.no_memory:\n                self.print_fn(\"======= INFERENCE - MEMORY - RESULT =======\")\n                self.print_results(inference_result_memory)\n                self.save_to_csv(inference_result_memory, self.args.inference_memory_csv_file)\n\n        if self.args.training:\n            if not self.args.no_speed:\n                self.print_fn(\"======= TRAIN - SPEED - RESULT =======\")\n                self.print_results(train_result_time)\n                self.save_to_csv(train_result_time, self.args.train_time_csv_file)\n\n            if not self.args.no_memory:\n                self.print_fn(\"======= TRAIN - MEMORY - RESULT =======\")\n                self.print_results(train_result_memory)\n                self.save_to_csv(train_result_memory, self.args.train_memory_csv_file)\n\n        if not self.args.no_env_print:\n            self.print_fn(\"\\n======== ENVIRONMENT - INFORMATION ========\")\n            self.print_fn(\n                \"\\n\".join([\"- {}: {}\".format(prop, val) for prop, val in self.environment_info.items()]) + \"\\n\"\n            )\n\n        if self.args.save_to_csv:\n            with open(self.args.env_info_csv_file, mode=\"w\", newline=\"\") as csv_file:\n                writer = csv.writer(csv_file)\n                for key, value in self.environment_info.items():\n                    writer.writerow([key, value])\n\n        return BenchmarkOutput(inference_result_time, inference_result_memory, train_result_time, train_result_memory)\n\n    @property\n    def environment_info(self):\n        if self._environment_info is None:\n            info = {}\n            info[\"transformers_version\"] = version\n            info[\"framework\"] = self.framework\n            info[\"framework_version\"] = self.framework_version\n            info[\"python_version\"] = platform.python_version()\n            info[\"system\"] = platform.system()\n            info[\"cpu\"] = platform.processor()\n            info[\"architecture\"] = platform.architecture()[0]\n            info[\"date\"] = datetime.date(datetime.now())\n            info[\"time\"] = datetime.time(datetime.now())\n\n            try:\n                import psutil\n            except (ImportError):\n                logger.warning(\n                    \"Psutil not installed, we won't log available CPU memory.\"\n                    \"Install psutil (pip install psutil) to log available CPU memory.\"\n                )\n                info[\"cpu_ram_mb\"] = \"N/A\"\n            else:\n                info[\"cpu_ram_mb\"] = bytes_to_mega_bytes(psutil.virtual_memory().total)\n\n            info[\"use_gpu\"] = self.is_gpu\n            if self.is_gpu:\n                info[\"num_gpus\"] = self.args.n_gpu\n                try:\n                    from py3nvml import py3nvml\n\n                    py3nvml.nvmlInit()\n                    handle = py3nvml.nvmlDeviceGetHandleByIndex(self.args.device_idx)\n                except ImportError:\n                    logger.warning(\n                        \"py3nvml not installed, we won't log GPU memory usage. \"\n                        \"Install py3nvml (pip install py3nvml) to log information about GPU.\"\n                    )\n                    info[\"gpu\"] = \"N/A\"\n                    info[\"gpu_ram_mb\"] = \"N/A\"\n                    info[\"gpu_power_watts\"] = \"N/A\"\n                    info[\"gpu_performance_state\"] = \"N/A\"\n                except (OSError, py3nvml.NVMLError):\n                    logger.warning(\n                        \"Error while initializing comunication with GPU. \" \"We won't log information about GPU.\"\n                    )\n                    info[\"gpu\"] = \"N/A\"\n                    info[\"gpu_ram_mb\"] = \"N/A\"\n                    info[\"gpu_power_watts\"] = \"N/A\"\n                    info[\"gpu_performance_state\"] = \"N/A\"\n                    py3nvml.nvmlShutdown()\n                else:\n                    info[\"gpu\"] = py3nvml.nvmlDeviceGetName(handle)\n                    info[\"gpu_ram_mb\"] = bytes_to_mega_bytes(py3nvml.nvmlDeviceGetMemoryInfo(handle).total)\n                    info[\"gpu_power_watts\"] = py3nvml.nvmlDeviceGetPowerManagementLimit(handle) / 1000\n                    info[\"gpu_performance_state\"] = py3nvml.nvmlDeviceGetPerformanceState(handle)\n                    py3nvml.nvmlShutdown()\n\n            self._environment_info = info\n        return self._environment_info\n\n    def print_results(self, result_dict):\n        for model_name in self.args.model_names:\n            self.print_fn(\"\\t\" + f\"======= MODEL CHECKPOINT: {model_name} =======\")\n            for batch_size in result_dict[model_name][\"bs\"]:\n                for sequence_length in result_dict[model_name][\"ss\"]:\n                    result = result_dict[model_name][\"result\"][batch_size][sequence_length]\n                    if isinstance(result, float):\n                        self.print_fn(\n                            f\"\\t\\t{model_name}/{batch_size}/{sequence_length}: \" f\"{(round(1000 * result) / 1000)}s\"\n                        )\n                    else:\n                        self.print_fn(f\"\\t\\t{model_name}/{batch_size}/{sequence_length}: \" f\"{result} MB\")\n\n    def print_memory_trace_statistics(self, summary: MemorySummary):\n        self.print_fn(\n            \"\\nLine by line memory consumption:\\n\"\n            + \"\\n\".join(\n                f\"{state.frame.filename}:{state.frame.line_number}: mem {state.cpu_gpu}: {state.frame.line_text}\"\n                for state in summary.sequential\n            )\n        )\n        self.print_fn(\n            \"\\nLines with top memory consumption:\\n\"\n            + \"\\n\".join(\n                f\"=> {state.frame.filename}:{state.frame.line_number}: mem {state.cpu_gpu}: {state.frame.line_text}\"\n                for state in summary.cumulative[:6]\n            )\n        )\n        self.print_fn(\n            \"\\nLines with lowest memory consumption:\\n\"\n            + \"\\n\".join(\n                f\"=> {state.frame.filename}:{state.frame.line_number}: mem {state.cpu_gpu}: {state.frame.line_text}\"\n                for state in summary.cumulative[-6:]\n            )\n        )\n        self.print_fn(f\"\\nTotal memory increase: {summary.total}\")\n\n    def save_to_csv(self, result_dict, filename):\n        if not self.args.save_to_csv:\n            return\n        self.print_fn(\"Saving results to csv.\")\n        with open(filename, mode=\"w\") as csv_file:\n\n            assert len(self.args.model_names) > 0, \"At least 1 model should be defined, but got {}\".format(\n                self.model_names\n            )\n\n            fieldnames = [\"model\", \"batch_size\", \"sequence_length\"]\n            writer = csv.DictWriter(csv_file, fieldnames=fieldnames + [\"result\"])\n            writer.writeheader()\n\n            for model_name in self.args.model_names:\n                result_dict_model = result_dict[model_name][\"result\"]\n                for bs in result_dict_model:\n                    for ss in result_dict_model[bs]:\n                        result_model = result_dict_model[bs][ss]\n                        writer.writerow(\n                            {\n                                \"model\": model_name,\n                                \"batch_size\": bs,\n                                \"sequence_length\": ss,\n                                \"result\": (\"{}\" if not isinstance(result_model, float) else \"{:.4f}\").format(\n                                    result_model\n                                ),\n                            }\n                        )\n"
  },
  {
    "path": "code/bert-base-count3/pretrain/transformers1/benchmark_utils.py",
    "content": "\"\"\"\nUtilities for working with the local dataset cache.\nThis file is adapted from the AllenNLP library at https://github.com/allenai/allennlp\nCopyright by the AllenNLP authors.\n\"\"\"\n\nimport linecache\nimport logging\nimport os\nimport sys\nfrom collections import defaultdict\nfrom typing import Iterable, List, NamedTuple, Optional, Union\n\nfrom .file_utils import is_tf_available, is_torch_available\n\n\nif is_torch_available():\n    from torch.cuda import empty_cache as torch_empty_cache\nif is_tf_available():\n    from tensorflow.python.eager import context as tf_context\n\n\nlogger = logging.getLogger(__name__)  # pylint: disable=invalid-name\n\n\n_is_memory_tracing_enabled = False\n\n\ndef is_memory_tracing_enabled():\n    global _is_memory_tracing_enabled\n    return _is_memory_tracing_enabled\n\n\nclass Frame(NamedTuple):\n    \"\"\" `Frame` is a NamedTuple used to gather the current frame state.\n            `Frame` has the following fields:\n            - 'filename' (string): Name of the file currently executed\n            - 'module' (string): Name of the module currently executed\n            - 'line_number' (int): Number of the line currently executed\n            - 'event' (string): Event that triggered the tracing (default will be \"line\")\n            - 'line_text' (string): Text of the line in the python script\n    \"\"\"\n\n    filename: str\n    module: str\n    line_number: int\n    event: str\n    line_text: str\n\n\nclass UsedMemoryState(NamedTuple):\n    \"\"\" `UsedMemoryState` are named tuples with the following fields:\n        - 'frame': a `Frame` namedtuple (see below) storing information on the current tracing frame (current file, location in current file)\n        - 'cpu_memory': CPU RSS memory state *before* executing the line\n        - 'gpu_memory': GPU used memory *before* executing the line (sum for all GPUs or for only `gpus_to_trace` if provided)\n    \"\"\"\n\n    frame: Frame\n    cpu_memory: int\n    gpu_memory: int\n\n\nclass Memory(NamedTuple):\n    \"\"\" `Memory` NamedTuple have a single field `bytes` and\n        you can get a human readable string of the number of bytes by calling `__repr__`\n            - `byte` (integer): number of bytes,\n    \"\"\"\n\n    bytes: int\n\n    def __repr__(self) -> str:\n        return bytes_to_human_readable(self.bytes)\n\n\nclass MemoryState(NamedTuple):\n    \"\"\" `MemoryState` are namedtuples listing frame + CPU/GPU memory with the following fields:\n        - `frame` (`Frame`): the current frame (see above)\n        - `cpu`: CPU memory consumed at during the current frame as a `Memory` named tuple\n        - `gpu`: GPU memory consumed at during the current frame as a `Memory` named tuple\n        - `cpu_gpu`: CPU + GPU memory consumed at during the current frame as a `Memory` named tuple\n    \"\"\"\n\n    frame: Frame\n    cpu: Memory\n    gpu: Memory\n    cpu_gpu: Memory\n\n\nclass MemorySummary(NamedTuple):\n    \"\"\" `MemorySummary` namedtuple otherwise with the fields:\n        - `sequential`: a list of `MemoryState` namedtuple (see below) computed from the provided `memory_trace`\n            by substracting the memory after executing each line from the memory before executing said line.\n        - `cumulative`: a list of `MemoryState` namedtuple (see below) with cumulative increase in memory for each line\n            obtained by summing repeted memory increase for a line if it's executed several times.\n            The list is sorted from the frame with the largest memory consumption to the frame with the smallest (can be negative if memory is released)\n        - `total`: total memory increase during the full tracing as a `Memory` named tuple (see below).\n            Line with memory release (negative consumption) are ignored if `ignore_released_memory` is `True` (default).\n    \"\"\"\n\n    sequential: List[MemoryState]\n    cumulative: List[MemoryState]\n    total: Memory\n\n\nMemoryTrace = List[UsedMemoryState]\n\n\ndef start_memory_tracing(\n    modules_to_trace: Optional[Union[str, Iterable[str]]] = None,\n    modules_not_to_trace: Optional[Union[str, Iterable[str]]] = None,\n    events_to_trace: str = \"line\",\n    gpus_to_trace: Optional[List[int]] = None,\n) -> MemoryTrace:\n    \"\"\" Setup line-by-line tracing to record rss mem (RAM) at each line of a module or sub-module.\n        See `../../examples/benchmarks.py for a usage example.\n        Current memory consumption is returned using psutil and in particular is the RSS memory\n            \"Resident Set Size” (the non-swapped physical memory the process is using).\n            See https://psutil.readthedocs.io/en/latest/#psutil.Process.memory_info\n\n        Args:\n            - `modules_to_trace`: (None, string, list/tuple of string)\n                if None, all events are recorded\n                if string or list of strings: only events from the listed module/sub-module will be recorded (e.g. 'fairseq' or 'transformers1.modeling_gpt2')\n            - `modules_not_to_trace`: (None, string, list/tuple of string)\n                if None, no module is avoided\n                if string or list of strings: events from the listed module/sub-module will not be recorded (e.g. 'torch')\n            - `events_to_trace`: string or list of string of events to be recorded (see official python doc for `sys.settrace` for the list of events)\n                default to line\n            - `gpus_to_trace`: (optional list, default None) list of GPUs to trace. Default to tracing all GPUs\n\n        Return:\n            - `memory_trace` is a list of `UsedMemoryState` for each event (default each line of the traced script).\n                - `UsedMemoryState` are named tuples with the following fields:\n                    - 'frame': a `Frame` namedtuple (see below) storing information on the current tracing frame (current file, location in current file)\n                    - 'cpu_memory': CPU RSS memory state *before* executing the line\n                    - 'gpu_memory': GPU used memory *before* executing the line (sum for all GPUs or for only `gpus_to_trace` if provided)\n\n        `Frame` is a namedtuple used by `UsedMemoryState` to list the current frame state.\n            `Frame` has the following fields:\n            - 'filename' (string): Name of the file currently executed\n            - 'module' (string): Name of the module currently executed\n            - 'line_number' (int): Number of the line currently executed\n            - 'event' (string): Event that triggered the tracing (default will be \"line\")\n            - 'line_text' (string): Text of the line in the python script\n\n    \"\"\"\n    try:\n        import psutil\n    except (ImportError):\n        logger.warning(\n            \"Psutil not installed, we won't log CPU memory usage. \"\n            \"Install psutil (pip install psutil) to use CPU memory tracing.\"\n        )\n        process = None\n    else:\n        process = psutil.Process(os.getpid())\n\n    try:\n        from py3nvml import py3nvml\n\n        py3nvml.nvmlInit()\n        devices = list(range(py3nvml.nvmlDeviceGetCount())) if gpus_to_trace is None else gpus_to_trace\n        py3nvml.nvmlShutdown()\n    except ImportError:\n        logger.warning(\n            \"py3nvml not installed, we won't log GPU memory usage. \"\n            \"Install py3nvml (pip install py3nvml) to use GPU memory tracing.\"\n        )\n        log_gpu = False\n    except (OSError, py3nvml.NVMLError):\n        logger.warning(\"Error while initializing comunication with GPU. \" \"We won't perform GPU memory tracing.\")\n        log_gpu = False\n    else:\n        log_gpu = is_torch_available() or is_tf_available()\n\n    memory_trace = []\n\n    def traceit(frame, event, args):\n        \"\"\" Tracing method executed before running each line in a module or sub-module\n            Record memory allocated in a list with debugging information\n        \"\"\"\n        global _is_memory_tracing_enabled\n\n        if not _is_memory_tracing_enabled:\n            return traceit\n\n        # Filter events\n        if events_to_trace is not None:\n            if isinstance(events_to_trace, str) and event != events_to_trace:\n                return traceit\n            elif isinstance(events_to_trace, (list, tuple)) and event not in events_to_trace:\n                return traceit\n\n        # Filter modules\n        name = frame.f_globals[\"__name__\"]\n        if not isinstance(name, str):\n            return traceit\n        else:\n            # Filter whitelist of modules to trace\n            if modules_to_trace is not None:\n                if isinstance(modules_to_trace, str) and modules_to_trace not in name:\n                    return traceit\n                elif isinstance(modules_to_trace, (list, tuple)) and all(m not in name for m in modules_to_trace):\n                    return traceit\n\n            # Filter blacklist of modules not to trace\n            if modules_not_to_trace is not None:\n                if isinstance(modules_not_to_trace, str) and modules_not_to_trace in name:\n                    return traceit\n                elif isinstance(modules_not_to_trace, (list, tuple)) and any(m in name for m in modules_not_to_trace):\n                    return traceit\n\n        # Record current tracing state (file, location in file...)\n        lineno = frame.f_lineno\n        filename = frame.f_globals[\"__file__\"]\n        if filename.endswith(\".pyc\") or filename.endswith(\".pyo\"):\n            filename = filename[:-1]\n        line = linecache.getline(filename, lineno).rstrip()\n        traced_state = Frame(filename, name, lineno, event, line)\n\n        # Record current memory state (rss memory) and compute difference with previous memory state\n        cpu_mem = 0\n        if process is not None:\n            mem = process.memory_info()\n            cpu_mem = mem.rss\n\n        gpu_mem = 0\n        if log_gpu:\n            # Clear GPU caches\n            if is_torch_available():\n                torch_empty_cache()\n            if is_tf_available():\n                tf_context.context()._clear_caches()  # See https://github.com/tensorflow/tensorflow/issues/20218#issuecomment-416771802\n\n            # Sum used memory for all GPUs\n            py3nvml.nvmlInit()\n            for i in devices:\n                handle = py3nvml.nvmlDeviceGetHandleByIndex(i)\n                meminfo = py3nvml.nvmlDeviceGetMemoryInfo(handle)\n                gpu_mem += meminfo.used\n            py3nvml.nvmlShutdown()\n\n        mem_state = UsedMemoryState(traced_state, cpu_mem, gpu_mem)\n        memory_trace.append(mem_state)\n\n        return traceit\n\n    sys.settrace(traceit)\n\n    global _is_memory_tracing_enabled\n    _is_memory_tracing_enabled = True\n\n    return memory_trace\n\n\ndef stop_memory_tracing(\n    memory_trace: Optional[MemoryTrace] = None, ignore_released_memory: bool = True\n) -> Optional[MemorySummary]:\n    \"\"\" Stop memory tracing cleanly and return a summary of the memory trace if a trace is given.\n\n        Args:\n            - `memory_trace` (optional output of start_memory_tracing, default: None): memory trace to convert in summary\n            - `ignore_released_memory` (boolean, default: None): if True we only sum memory increase to compute total memory\n\n        Return:\n            - None if `memory_trace` is None\n            - `MemorySummary` namedtuple otherwise with the fields:\n                - `sequential`: a list of `MemoryState` namedtuple (see below) computed from the provided `memory_trace`\n                    by substracting the memory after executing each line from the memory before executing said line.\n                - `cumulative`: a list of `MemoryState` namedtuple (see below) with cumulative increase in memory for each line\n                    obtained by summing repeted memory increase for a line if it's executed several times.\n                    The list is sorted from the frame with the largest memory consumption to the frame with the smallest (can be negative if memory is released)\n                - `total`: total memory increase during the full tracing as a `Memory` named tuple (see below).\n                    Line with memory release (negative consumption) are ignored if `ignore_released_memory` is `True` (default).\n\n        `Memory` named tuple have fields\n            - `byte` (integer): number of bytes,\n            - `string` (string): same as human readable string (ex: \"3.5MB\")\n\n        `Frame` are namedtuple used to list the current frame state and have the following fields:\n            - 'filename' (string): Name of the file currently executed\n            - 'module' (string): Name of the module currently executed\n            - 'line_number' (int): Number of the line currently executed\n            - 'event' (string): Event that triggered the tracing (default will be \"line\")\n            - 'line_text' (string): Text of the line in the python script\n\n        `MemoryState` are namedtuples listing frame + CPU/GPU memory with the following fields:\n            - `frame` (`Frame`): the current frame (see above)\n            - `cpu`: CPU memory consumed at during the current frame as a `Memory` named tuple\n            - `gpu`: GPU memory consumed at during the current frame as a `Memory` named tuple\n            - `cpu_gpu`: CPU + GPU memory consumed at during the current frame as a `Memory` named tuple\n    \"\"\"\n    global _is_memory_tracing_enabled\n    _is_memory_tracing_enabled = False\n\n    if memory_trace is not None and len(memory_trace) > 1:\n        memory_diff_trace = []\n        cumulative_memory_dict = defaultdict(lambda: [0, 0, 0])\n        for (frame, cpu_mem, gpu_mem), (next_frame, next_cpu_mem, next_gpu_mem) in zip(\n            memory_trace[:-1], memory_trace[1:]\n        ):\n            cpu_mem_inc = next_cpu_mem - cpu_mem\n            gpu_mem_inc = next_gpu_mem - gpu_mem\n            cpu_gpu_mem_inc = cpu_mem_inc + gpu_mem_inc\n            memory_diff_trace.append(\n                MemoryState(\n                    frame=frame, cpu=Memory(cpu_mem_inc), gpu=Memory(gpu_mem_inc), cpu_gpu=Memory(cpu_gpu_mem_inc),\n                )\n            )\n            cumulative_memory_dict[frame][0] += cpu_mem_inc\n            cumulative_memory_dict[frame][1] += gpu_mem_inc\n            cumulative_memory_dict[frame][2] += cpu_gpu_mem_inc\n\n        cumulative_memory = sorted(\n            list(cumulative_memory_dict.items()), key=lambda x: x[1][2], reverse=True\n        )  # order by the total CPU + GPU memory increase\n        cumulative_memory = list(\n            MemoryState(\n                frame=frame, cpu=Memory(cpu_mem_inc), gpu=Memory(gpu_mem_inc), cpu_gpu=Memory(cpu_gpu_mem_inc),\n            )\n            for frame, (cpu_mem_inc, gpu_mem_inc, cpu_gpu_mem_inc) in cumulative_memory\n        )\n\n        if ignore_released_memory:\n            total_memory = sum(max(0, step_trace.cpu_gpu.bytes) for step_trace in memory_diff_trace)\n        else:\n            total_memory = sum(step_trace.cpu_gpu.bytes for step_trace in memory_diff_trace)\n        total_memory = Memory(total_memory)\n        return MemorySummary(sequential=memory_diff_trace, cumulative=cumulative_memory, total=total_memory)\n\n    return None\n\n\ndef bytes_to_human_readable(memory_amount: int) -> str:\n    \"\"\" Utility to convert a number of bytes (int) in a human readable string (with units)\n    \"\"\"\n    for unit in [\"B\", \"KB\", \"MB\", \"GB\"]:\n        if memory_amount > -1024.0 and memory_amount < 1024.0:\n            return \"{:.3f}{}\".format(memory_amount, unit)\n        memory_amount /= 1024.0\n    return \"{:.3f}TB\".format(memory_amount)\n"
  },
  {
    "path": "code/bert-base-count3/pretrain/transformers1/commands/__init__.py",
    "content": "from abc import ABC, abstractmethod\nfrom argparse import ArgumentParser\n\n\nclass BaseTransformersCLICommand(ABC):\n    @staticmethod\n    @abstractmethod\n    def register_subcommand(parser: ArgumentParser):\n        raise NotImplementedError()\n\n    @abstractmethod\n    def run(self):\n        raise NotImplementedError()\n"
  },
  {
    "path": "code/bert-base-count3/pretrain/transformers1/commands/convert.py",
    "content": "from argparse import ArgumentParser, Namespace\nfrom logging import getLogger\n\nfrom transformers.commands import BaseTransformersCLICommand\n\n\ndef convert_command_factory(args: Namespace):\n    \"\"\"\n    Factory function used to convert a model TF 1.0 checkpoint in a PyTorch checkpoint.\n    :return: ServeCommand\n    \"\"\"\n    return ConvertCommand(\n        args.model_type, args.tf_checkpoint, args.pytorch_dump_output, args.config, args.finetuning_task_name\n    )\n\n\nclass ConvertCommand(BaseTransformersCLICommand):\n    @staticmethod\n    def register_subcommand(parser: ArgumentParser):\n        \"\"\"\n        Register this command to argparse so it's available for the transformer-cli\n        :param parser: Root parser to register command-specific arguments\n        :return:\n        \"\"\"\n        train_parser = parser.add_parser(\n            \"convert\",\n            help=\"CLI tool to run convert model from original \"\n            \"author checkpoints to Transformers PyTorch checkpoints.\",\n        )\n        train_parser.add_argument(\"--model_type\", type=str, required=True, help=\"Model's type.\")\n        train_parser.add_argument(\n            \"--tf_checkpoint\", type=str, required=True, help=\"TensorFlow checkpoint path or folder.\"\n        )\n        train_parser.add_argument(\n            \"--pytorch_dump_output\", type=str, required=True, help=\"Path to the PyTorch savd model output.\"\n        )\n        train_parser.add_argument(\"--config\", type=str, default=\"\", help=\"Configuration file path or folder.\")\n        train_parser.add_argument(\n            \"--finetuning_task_name\",\n            type=str,\n            default=None,\n            help=\"Optional fine-tuning task name if the TF model was a finetuned model.\",\n        )\n        train_parser.set_defaults(func=convert_command_factory)\n\n    def __init__(\n        self,\n        model_type: str,\n        tf_checkpoint: str,\n        pytorch_dump_output: str,\n        config: str,\n        finetuning_task_name: str,\n        *args\n    ):\n        self._logger = getLogger(\"transformers1-cli/converting\")\n\n        self._logger.info(\"Loading model {}\".format(model_type))\n        self._model_type = model_type\n        self._tf_checkpoint = tf_checkpoint\n        self._pytorch_dump_output = pytorch_dump_output\n        self._config = config\n        self._finetuning_task_name = finetuning_task_name\n\n    def run(self):\n        if self._model_type == \"albert\":\n            try:\n                from transformers.convert_albert_original_tf_checkpoint_to_pytorch import (\n                    convert_tf_checkpoint_to_pytorch,\n                )\n            except ImportError:\n                msg = (\n                    \"transformers1 can only be used from the commandline to convert TensorFlow models in PyTorch, \"\n                    \"In that case, it requires TensorFlow to be installed. Please see \"\n                    \"https://www.tensorflow.org/install/ for installation instructions.\"\n                )\n                raise ImportError(msg)\n\n            convert_tf_checkpoint_to_pytorch(self._tf_checkpoint, self._config, self._pytorch_dump_output)\n        elif self._model_type == \"bert\":\n            try:\n                from transformers.convert_bert_original_tf_checkpoint_to_pytorch import (\n                    convert_tf_checkpoint_to_pytorch,\n                )\n            except ImportError:\n                msg = (\n                    \"transformers1 can only be used from the commandline to convert TensorFlow models in PyTorch, \"\n                    \"In that case, it requires TensorFlow to be installed. Please see \"\n                    \"https://www.tensorflow.org/install/ for installation instructions.\"\n                )\n                raise ImportError(msg)\n\n            convert_tf_checkpoint_to_pytorch(self._tf_checkpoint, self._config, self._pytorch_dump_output)\n        elif self._model_type == \"gpt\":\n            from transformers.convert_openai_original_tf_checkpoint_to_pytorch import (\n                convert_openai_checkpoint_to_pytorch,\n            )\n\n            convert_openai_checkpoint_to_pytorch(self._tf_checkpoint, self._config, self._pytorch_dump_output)\n        elif self._model_type == \"transfo_xl\":\n            try:\n                from transformers.convert_transfo_xl_original_tf_checkpoint_to_pytorch import (\n                    convert_transfo_xl_checkpoint_to_pytorch,\n                )\n            except ImportError:\n                msg = (\n                    \"transformers1 can only be used from the commandline to convert TensorFlow models in PyTorch, \"\n                    \"In that case, it requires TensorFlow to be installed. Please see \"\n                    \"https://www.tensorflow.org/install/ for installation instructions.\"\n                )\n                raise ImportError(msg)\n\n            if \"ckpt\" in self._tf_checkpoint.lower():\n                TF_CHECKPOINT = self._tf_checkpoint\n                TF_DATASET_FILE = \"\"\n            else:\n                TF_DATASET_FILE = self._tf_checkpoint\n                TF_CHECKPOINT = \"\"\n            convert_transfo_xl_checkpoint_to_pytorch(\n                TF_CHECKPOINT, self._config, self._pytorch_dump_output, TF_DATASET_FILE\n            )\n        elif self._model_type == \"gpt2\":\n            try:\n                from transformers.convert_gpt2_original_tf_checkpoint_to_pytorch import (\n                    convert_gpt2_checkpoint_to_pytorch,\n                )\n            except ImportError:\n                msg = (\n                    \"transformers1 can only be used from the commandline to convert TensorFlow models in PyTorch, \"\n                    \"In that case, it requires TensorFlow to be installed. Please see \"\n                    \"https://www.tensorflow.org/install/ for installation instructions.\"\n                )\n                raise ImportError(msg)\n\n            convert_gpt2_checkpoint_to_pytorch(self._tf_checkpoint, self._config, self._pytorch_dump_output)\n        elif self._model_type == \"xlnet\":\n            try:\n                from transformers.convert_xlnet_original_tf_checkpoint_to_pytorch import (\n                    convert_xlnet_checkpoint_to_pytorch,\n                )\n            except ImportError:\n                msg = (\n                    \"transformers1 can only be used from the commandline to convert TensorFlow models in PyTorch, \"\n                    \"In that case, it requires TensorFlow to be installed. Please see \"\n                    \"https://www.tensorflow.org/install/ for installation instructions.\"\n                )\n                raise ImportError(msg)\n\n            convert_xlnet_checkpoint_to_pytorch(\n                self._tf_checkpoint, self._config, self._pytorch_dump_output, self._finetuning_task_name\n            )\n        elif self._model_type == \"xlm\":\n            from transformers.convert_xlm_original_pytorch_checkpoint_to_pytorch import (\n                convert_xlm_checkpoint_to_pytorch,\n            )\n\n            convert_xlm_checkpoint_to_pytorch(self._tf_checkpoint, self._pytorch_dump_output)\n        else:\n            raise ValueError(\"--model_type should be selected in the list [bert, gpt, gpt2, transfo_xl, xlnet, xlm]\")\n"
  },
  {
    "path": "code/bert-base-count3/pretrain/transformers1/commands/download.py",
    "content": "from argparse import ArgumentParser\n\nfrom transformers.commands import BaseTransformersCLICommand\n\n\ndef download_command_factory(args):\n    return DownloadCommand(args.model, args.cache_dir, args.force)\n\n\nclass DownloadCommand(BaseTransformersCLICommand):\n    @staticmethod\n    def register_subcommand(parser: ArgumentParser):\n        download_parser = parser.add_parser(\"download\")\n        download_parser.add_argument(\n            \"--cache-dir\", type=str, default=None, help=\"Path to location to store the models\"\n        )\n        download_parser.add_argument(\n            \"--force\", action=\"store_true\", help=\"Force the model to be download even if already in cache-dir\"\n        )\n        download_parser.add_argument(\"model\", type=str, help=\"Name of the model to download\")\n        download_parser.set_defaults(func=download_command_factory)\n\n    def __init__(self, model: str, cache: str, force: bool):\n        self._model = model\n        self._cache = cache\n        self._force = force\n\n    def run(self):\n        from transformers import AutoModel, AutoTokenizer\n\n        AutoModel.from_pretrained(self._model, cache_dir=self._cache, force_download=self._force)\n        AutoTokenizer.from_pretrained(self._model, cache_dir=self._cache, force_download=self._force)\n"
  },
  {
    "path": "code/bert-base-count3/pretrain/transformers1/commands/env.py",
    "content": "import platform\nfrom argparse import ArgumentParser\n\nfrom transformers import __version__ as version\nfrom transformers import is_tf_available, is_torch_available\nfrom transformers.commands import BaseTransformersCLICommand\n\n\ndef info_command_factory(_):\n    return EnvironmentCommand()\n\n\nclass EnvironmentCommand(BaseTransformersCLICommand):\n    @staticmethod\n    def register_subcommand(parser: ArgumentParser):\n        download_parser = parser.add_parser(\"env\")\n        download_parser.set_defaults(func=info_command_factory)\n\n    def run(self):\n        pt_version = \"not installed\"\n        pt_cuda_available = \"NA\"\n        if is_torch_available():\n            import torch\n\n            pt_version = torch.__version__\n            pt_cuda_available = torch.cuda.is_available()\n\n        tf_version = \"not installed\"\n        tf_cuda_available = \"NA\"\n        if is_tf_available():\n            import tensorflow as tf\n\n            tf_version = tf.__version__\n            try:\n                # deprecated in v2.1\n                tf_cuda_available = tf.test.is_gpu_available()\n            except AttributeError:\n                # returns list of devices, convert to bool\n                tf_cuda_available = bool(tf.config.list_physical_devices(\"GPU\"))\n\n        info = {\n            \"`transformers1` version\": version,\n            \"Platform\": platform.platform(),\n            \"Python version\": platform.python_version(),\n            \"PyTorch version (GPU?)\": \"{} ({})\".format(pt_version, pt_cuda_available),\n            \"Tensorflow version (GPU?)\": \"{} ({})\".format(tf_version, tf_cuda_available),\n            \"Using GPU in script?\": \"<fill in>\",\n            \"Using distributed or parallel set-up in script?\": \"<fill in>\",\n        }\n\n        print(\"\\nCopy-and-paste the text below in your GitHub issue and FILL OUT the two last points.\\n\")\n        print(self.format_dict(info))\n\n        return info\n\n    @staticmethod\n    def format_dict(d):\n        return \"\\n\".join([\"- {}: {}\".format(prop, val) for prop, val in d.items()]) + \"\\n\"\n"
  },
  {
    "path": "code/bert-base-count3/pretrain/transformers1/commands/run.py",
    "content": "import logging\nfrom argparse import ArgumentParser\n\nfrom transformers.commands import BaseTransformersCLICommand\nfrom transformers.pipelines import SUPPORTED_TASKS, Pipeline, PipelineDataFormat, pipeline\n\n\nlogger = logging.getLogger(__name__)  # pylint: disable=invalid-name\n\n\ndef try_infer_format_from_ext(path: str):\n    if not path:\n        return \"pipe\"\n\n    for ext in PipelineDataFormat.SUPPORTED_FORMATS:\n        if path.endswith(ext):\n            return ext\n\n    raise Exception(\n        \"Unable to determine file format from file extension {}. \"\n        \"Please provide the format through --format {}\".format(path, PipelineDataFormat.SUPPORTED_FORMATS)\n    )\n\n\ndef run_command_factory(args):\n    nlp = pipeline(\n        task=args.task,\n        model=args.model if args.model else None,\n        config=args.config,\n        tokenizer=args.tokenizer,\n        device=args.device,\n    )\n    format = try_infer_format_from_ext(args.input) if args.format == \"infer\" else args.format\n    reader = PipelineDataFormat.from_str(\n        format=format,\n        output_path=args.output,\n        input_path=args.input,\n        column=args.column if args.column else nlp.default_input_names,\n        overwrite=args.overwrite,\n    )\n    return RunCommand(nlp, reader)\n\n\nclass RunCommand(BaseTransformersCLICommand):\n    def __init__(self, nlp: Pipeline, reader: PipelineDataFormat):\n        self._nlp = nlp\n        self._reader = reader\n\n    @staticmethod\n    def register_subcommand(parser: ArgumentParser):\n        run_parser = parser.add_parser(\"run\", help=\"Run a pipeline through the CLI\")\n        run_parser.add_argument(\"--task\", choices=SUPPORTED_TASKS.keys(), help=\"Task to run\")\n        run_parser.add_argument(\"--input\", type=str, help=\"Path to the file to use for inference\")\n        run_parser.add_argument(\"--output\", type=str, help=\"Path to the file that will be used post to write results.\")\n        run_parser.add_argument(\"--model\", type=str, help=\"Name or path to the model to instantiate.\")\n        run_parser.add_argument(\"--config\", type=str, help=\"Name or path to the model's config to instantiate.\")\n        run_parser.add_argument(\n            \"--tokenizer\", type=str, help=\"Name of the tokenizer to use. (default: same as the model name)\"\n        )\n        run_parser.add_argument(\n            \"--column\",\n            type=str,\n            help=\"Name of the column to use as input. (For multi columns input as QA use column1,columns2)\",\n        )\n        run_parser.add_argument(\n            \"--format\",\n            type=str,\n            default=\"infer\",\n            choices=PipelineDataFormat.SUPPORTED_FORMATS,\n            help=\"Input format to read from\",\n        )\n        run_parser.add_argument(\n            \"--device\",\n            type=int,\n            default=-1,\n            help=\"Indicate the device to run onto, -1 indicates CPU, >= 0 indicates GPU (default: -1)\",\n        )\n        run_parser.add_argument(\"--overwrite\", action=\"store_true\", help=\"Allow overwriting the output file.\")\n        run_parser.set_defaults(func=run_command_factory)\n\n    def run(self):\n        nlp, outputs = self._nlp, []\n\n        for entry in self._reader:\n            output = nlp(**entry) if self._reader.is_multi_columns else nlp(entry)\n            if isinstance(output, dict):\n                outputs.append(output)\n            else:\n                outputs += output\n\n        # Saving data\n        if self._nlp.binary_output:\n            binary_path = self._reader.save_binary(outputs)\n            logger.warning(\"Current pipeline requires output to be in binary format, saving at {}\".format(binary_path))\n        else:\n            self._reader.save(outputs)\n"
  },
  {
    "path": "code/bert-base-count3/pretrain/transformers1/commands/serving.py",
    "content": "import logging\nfrom argparse import ArgumentParser, Namespace\nfrom typing import Any, List, Optional\n\nfrom transformers import Pipeline\nfrom transformers.commands import BaseTransformersCLICommand\nfrom transformers.pipelines import SUPPORTED_TASKS, pipeline\n\n\ntry:\n    from uvicorn import run\n    from fastapi import FastAPI, HTTPException, Body\n    from fastapi.routing import APIRoute\n    from pydantic import BaseModel\n    from starlette.responses import JSONResponse\n\n    _serve_dependencies_installed = True\nexcept (ImportError, AttributeError):\n    BaseModel = object\n\n    def Body(*x, **y):\n        pass\n\n    _serve_dependencies_installed = False\n\n\nlogger = logging.getLogger(\"transformers1-cli/serving\")\n\n\ndef serve_command_factory(args: Namespace):\n    \"\"\"\n    Factory function used to instantiate serving server from provided command line arguments.\n    :return: ServeCommand\n    \"\"\"\n    nlp = pipeline(\n        task=args.task,\n        model=args.model if args.model else None,\n        config=args.config,\n        tokenizer=args.tokenizer,\n        device=args.device,\n    )\n    return ServeCommand(nlp, args.host, args.port, args.workers)\n\n\nclass ServeModelInfoResult(BaseModel):\n    \"\"\"\n    Expose model information\n    \"\"\"\n\n    infos: dict\n\n\nclass ServeTokenizeResult(BaseModel):\n    \"\"\"\n    Tokenize result model\n    \"\"\"\n\n    tokens: List[str]\n    tokens_ids: Optional[List[int]]\n\n\nclass ServeDeTokenizeResult(BaseModel):\n    \"\"\"\n    DeTokenize result model\n    \"\"\"\n\n    text: str\n\n\nclass ServeForwardResult(BaseModel):\n    \"\"\"\n    Forward result model\n    \"\"\"\n\n    output: Any\n\n\nclass ServeCommand(BaseTransformersCLICommand):\n    @staticmethod\n    def register_subcommand(parser: ArgumentParser):\n        \"\"\"\n        Register this command to argparse so it's available for the transformer-cli\n        :param parser: Root parser to register command-specific arguments\n        :return:\n        \"\"\"\n        serve_parser = parser.add_parser(\n            \"serve\", help=\"CLI tool to run inference requests through REST and GraphQL endpoints.\"\n        )\n        serve_parser.add_argument(\n            \"--task\", type=str, choices=SUPPORTED_TASKS.keys(), help=\"The task to run the pipeline on\"\n        )\n        serve_parser.add_argument(\"--host\", type=str, default=\"localhost\", help=\"Interface the server will listen on.\")\n        serve_parser.add_argument(\"--port\", type=int, default=8888, help=\"Port the serving will listen to.\")\n        serve_parser.add_argument(\"--workers\", type=int, default=1, help=\"Number of http workers\")\n        serve_parser.add_argument(\"--model\", type=str, help=\"Model's name or path to stored model.\")\n        serve_parser.add_argument(\"--config\", type=str, help=\"Model's config name or path to stored model.\")\n        serve_parser.add_argument(\"--tokenizer\", type=str, help=\"Tokenizer name to use.\")\n        serve_parser.add_argument(\n            \"--device\",\n            type=int,\n            default=-1,\n            help=\"Indicate the device to run onto, -1 indicates CPU, >= 0 indicates GPU (default: -1)\",\n        )\n        serve_parser.set_defaults(func=serve_command_factory)\n\n    def __init__(self, pipeline: Pipeline, host: str, port: int, workers: int):\n\n        self._pipeline = pipeline\n\n        self.host = host\n        self.port = port\n        self.workers = workers\n\n        if not _serve_dependencies_installed:\n            raise RuntimeError(\n                \"Using serve command requires FastAPI and unicorn. \"\n                'Please install transformers1 with [serving]: pip install \"transformers1[serving]\".'\n                \"Or install FastAPI and unicorn separately.\"\n            )\n        else:\n            logger.info(\"Serving model over {}:{}\".format(host, port))\n            self._app = FastAPI(\n                routes=[\n                    APIRoute(\n                        \"/\",\n                        self.model_info,\n                        response_model=ServeModelInfoResult,\n                        response_class=JSONResponse,\n                        methods=[\"GET\"],\n                    ),\n                    APIRoute(\n                        \"/tokenize\",\n                        self.tokenize,\n                        response_model=ServeTokenizeResult,\n                        response_class=JSONResponse,\n                        methods=[\"POST\"],\n                    ),\n                    APIRoute(\n                        \"/detokenize\",\n                        self.detokenize,\n                        response_model=ServeDeTokenizeResult,\n                        response_class=JSONResponse,\n                        methods=[\"POST\"],\n                    ),\n                    APIRoute(\n                        \"/forward\",\n                        self.forward,\n                        response_model=ServeForwardResult,\n                        response_class=JSONResponse,\n                        methods=[\"POST\"],\n                    ),\n                ],\n                timeout=600,\n            )\n\n    def run(self):\n        run(self._app, host=self.host, port=self.port, workers=self.workers)\n\n    def model_info(self):\n        return ServeModelInfoResult(infos=vars(self._pipeline.model.config))\n\n    def tokenize(self, text_input: str = Body(None, embed=True), return_ids: bool = Body(False, embed=True)):\n        \"\"\"\n        Tokenize the provided input and eventually returns corresponding tokens id:\n        - **text_input**: String to tokenize\n        - **return_ids**: Boolean flags indicating if the tokens have to be converted to their integer mapping.\n        \"\"\"\n        try:\n            tokens_txt = self._pipeline.tokenizer.tokenize(text_input)\n\n            if return_ids:\n                tokens_ids = self._pipeline.tokenizer.convert_tokens_to_ids(tokens_txt)\n                return ServeTokenizeResult(tokens=tokens_txt, tokens_ids=tokens_ids)\n            else:\n                return ServeTokenizeResult(tokens=tokens_txt)\n\n        except Exception as e:\n            raise HTTPException(status_code=500, detail={\"model\": \"\", \"error\": str(e)})\n\n    def detokenize(\n        self,\n        tokens_ids: List[int] = Body(None, embed=True),\n        skip_special_tokens: bool = Body(False, embed=True),\n        cleanup_tokenization_spaces: bool = Body(True, embed=True),\n    ):\n        \"\"\"\n        Detokenize the provided tokens ids to readable text:\n        - **tokens_ids**: List of tokens ids\n        - **skip_special_tokens**: Flag indicating to not try to decode special tokens\n        - **cleanup_tokenization_spaces**: Flag indicating to remove all leading/trailing spaces and intermediate ones.\n        \"\"\"\n        try:\n            decoded_str = self._pipeline.tokenizer.decode(tokens_ids, skip_special_tokens, cleanup_tokenization_spaces)\n            return ServeDeTokenizeResult(model=\"\", text=decoded_str)\n        except Exception as e:\n            raise HTTPException(status_code=500, detail={\"model\": \"\", \"error\": str(e)})\n\n    async def forward(self, inputs=Body(None, embed=True)):\n        \"\"\"\n        **inputs**:\n        **attention_mask**:\n        **tokens_type_ids**:\n        \"\"\"\n\n        # Check we don't have empty string\n        if len(inputs) == 0:\n            return ServeForwardResult(output=[], attention=[])\n\n        try:\n            # Forward through the model\n            output = self._pipeline(inputs)\n            return ServeForwardResult(output=output)\n        except Exception as e:\n            raise HTTPException(500, {\"error\": str(e)})\n"
  },
  {
    "path": "code/bert-base-count3/pretrain/transformers1/commands/train.py",
    "content": "import os\nfrom argparse import ArgumentParser, Namespace\nfrom logging import getLogger\n\nfrom transformers import SingleSentenceClassificationProcessor as Processor\nfrom transformers import TextClassificationPipeline, is_tf_available, is_torch_available\nfrom transformers.commands import BaseTransformersCLICommand\n\n\nif not is_tf_available() and not is_torch_available():\n    raise RuntimeError(\"At least one of PyTorch or TensorFlow 2.0+ should be installed to use CLI training\")\n\n# TF training parameters\nUSE_XLA = False\nUSE_AMP = False\n\n\ndef train_command_factory(args: Namespace):\n    \"\"\"\n    Factory function used to instantiate serving server from provided command line arguments.\n    :return: ServeCommand\n    \"\"\"\n    return TrainCommand(args)\n\n\nclass TrainCommand(BaseTransformersCLICommand):\n    @staticmethod\n    def register_subcommand(parser: ArgumentParser):\n        \"\"\"\n        Register this command to argparse so it's available for the transformer-cli\n        :param parser: Root parser to register command-specific arguments\n        :return:\n        \"\"\"\n        train_parser = parser.add_parser(\"train\", help=\"CLI tool to train a model on a task.\")\n\n        train_parser.add_argument(\n            \"--train_data\",\n            type=str,\n            required=True,\n            help=\"path to train (and optionally evaluation) dataset as a csv with \"\n            \"tab separated labels and sentences.\",\n        )\n        train_parser.add_argument(\n            \"--column_label\", type=int, default=0, help=\"Column of the dataset csv file with example labels.\"\n        )\n        train_parser.add_argument(\n            \"--column_text\", type=int, default=1, help=\"Column of the dataset csv file with example texts.\"\n        )\n        train_parser.add_argument(\n            \"--column_id\", type=int, default=2, help=\"Column of the dataset csv file with example ids.\"\n        )\n        train_parser.add_argument(\n            \"--skip_first_row\", action=\"store_true\", help=\"Skip the first row of the csv file (headers).\"\n        )\n\n        train_parser.add_argument(\"--validation_data\", type=str, default=\"\", help=\"path to validation dataset.\")\n        train_parser.add_argument(\n            \"--validation_split\",\n            type=float,\n            default=0.1,\n            help=\"if validation dataset is not provided, fraction of train dataset \" \"to use as validation dataset.\",\n        )\n\n        train_parser.add_argument(\"--output\", type=str, default=\"./\", help=\"path to saved the trained model.\")\n\n        train_parser.add_argument(\n            \"--task\", type=str, default=\"text_classification\", help=\"Task to train the model on.\"\n        )\n        train_parser.add_argument(\n            \"--model\", type=str, default=\"bert-base-uncased\", help=\"Model's name or path to stored model.\"\n        )\n        train_parser.add_argument(\"--train_batch_size\", type=int, default=32, help=\"Batch size for training.\")\n        train_parser.add_argument(\"--valid_batch_size\", type=int, default=64, help=\"Batch size for validation.\")\n        train_parser.add_argument(\"--learning_rate\", type=float, default=3e-5, help=\"Learning rate.\")\n        train_parser.add_argument(\"--adam_epsilon\", type=float, default=1e-08, help=\"Epsilon for Adam optimizer.\")\n        train_parser.set_defaults(func=train_command_factory)\n\n    def __init__(self, args: Namespace):\n        self.logger = getLogger(\"transformers1-cli/training\")\n\n        self.framework = \"tf\" if is_tf_available() else \"torch\"\n\n        os.makedirs(args.output, exist_ok=True)\n        assert os.path.isdir(args.output)\n        self.output = args.output\n\n        self.column_label = args.column_label\n        self.column_text = args.column_text\n        self.column_id = args.column_id\n\n        self.logger.info(\"Loading {} pipeline for {}\".format(args.task, args.model))\n        if args.task == \"text_classification\":\n            self.pipeline = TextClassificationPipeline.from_pretrained(args.model)\n        elif args.task == \"token_classification\":\n            raise NotImplementedError\n        elif args.task == \"question_answering\":\n            raise NotImplementedError\n\n        self.logger.info(\"Loading dataset from {}\".format(args.train_data))\n        self.train_dataset = Processor.create_from_csv(\n            args.train_data,\n            column_label=args.column_label,\n            column_text=args.column_text,\n            column_id=args.column_id,\n            skip_first_row=args.skip_first_row,\n        )\n        self.valid_dataset = None\n        if args.validation_data:\n            self.logger.info(\"Loading validation dataset from {}\".format(args.validation_data))\n            self.valid_dataset = Processor.create_from_csv(\n                args.validation_data,\n                column_label=args.column_label,\n                column_text=args.column_text,\n                column_id=args.column_id,\n                skip_first_row=args.skip_first_row,\n            )\n\n        self.validation_split = args.validation_split\n        self.train_batch_size = args.train_batch_size\n        self.valid_batch_size = args.valid_batch_size\n        self.learning_rate = args.learning_rate\n        self.adam_epsilon = args.adam_epsilon\n\n    def run(self):\n        if self.framework == \"tf\":\n            return self.run_tf()\n        return self.run_torch()\n\n    def run_torch(self):\n        raise NotImplementedError\n\n    def run_tf(self):\n        self.pipeline.fit(\n            self.train_dataset,\n            validation_data=self.valid_dataset,\n            validation_split=self.validation_split,\n            learning_rate=self.learning_rate,\n            adam_epsilon=self.adam_epsilon,\n            train_batch_size=self.train_batch_size,\n            valid_batch_size=self.valid_batch_size,\n        )\n\n        # Save trained pipeline\n        self.pipeline.save_pretrained(self.output)\n"
  },
  {
    "path": "code/bert-base-count3/pretrain/transformers1/commands/transformers_cli.py",
    "content": "#!/usr/bin/env python\nfrom argparse import ArgumentParser\n\nfrom transformers.commands.convert import ConvertCommand\nfrom transformers.commands.download import DownloadCommand\nfrom transformers.commands.env import EnvironmentCommand\nfrom transformers.commands.run import RunCommand\nfrom transformers.commands.serving import ServeCommand\nfrom transformers.commands.user import UserCommands\n\n\ndef main():\n    parser = ArgumentParser(\"Transformers CLI tool\", usage=\"transformers1-cli <command> [<args>]\")\n    commands_parser = parser.add_subparsers(help=\"transformers1-cli command helpers\")\n\n    # Register commands\n    ConvertCommand.register_subcommand(commands_parser)\n    DownloadCommand.register_subcommand(commands_parser)\n    EnvironmentCommand.register_subcommand(commands_parser)\n    RunCommand.register_subcommand(commands_parser)\n    ServeCommand.register_subcommand(commands_parser)\n    UserCommands.register_subcommand(commands_parser)\n\n    # Let's go\n    args = parser.parse_args()\n\n    if not hasattr(args, \"func\"):\n        parser.print_help()\n        exit(1)\n\n    # Run\n    service = args.func(args)\n    service.run()\n\n\nif __name__ == \"__main__\":\n    main()\n"
  },
  {
    "path": "code/bert-base-count3/pretrain/transformers1/commands/user.py",
    "content": "import os\nimport sys\nfrom argparse import ArgumentParser\nfrom getpass import getpass\nfrom typing import List, Union\n\nfrom requests.exceptions import HTTPError\n\nfrom transformers.commands import BaseTransformersCLICommand\nfrom transformers.hf_api import HfApi, HfFolder\n\n\nUPLOAD_MAX_FILES = 15\n\n\nclass UserCommands(BaseTransformersCLICommand):\n    @staticmethod\n    def register_subcommand(parser: ArgumentParser):\n        login_parser = parser.add_parser(\"login\", help=\"Log in using the same credentials as on huggingface.co\")\n        login_parser.set_defaults(func=lambda args: LoginCommand(args))\n        whoami_parser = parser.add_parser(\"whoami\", help=\"Find out which huggingface.co account you are logged in as.\")\n        whoami_parser.set_defaults(func=lambda args: WhoamiCommand(args))\n        logout_parser = parser.add_parser(\"logout\", help=\"Log out\")\n        logout_parser.set_defaults(func=lambda args: LogoutCommand(args))\n        # s3\n        s3_parser = parser.add_parser(\"s3\", help=\"{ls, rm} Commands to interact with the files you upload on S3.\")\n        s3_subparsers = s3_parser.add_subparsers(help=\"s3 related commands\")\n        ls_parser = s3_subparsers.add_parser(\"ls\")\n        ls_parser.add_argument(\"--organization\", type=str, help=\"Optional: organization namespace.\")\n        ls_parser.set_defaults(func=lambda args: ListObjsCommand(args))\n        rm_parser = s3_subparsers.add_parser(\"rm\")\n        rm_parser.add_argument(\"filename\", type=str, help=\"individual object filename to delete from S3.\")\n        rm_parser.add_argument(\"--organization\", type=str, help=\"Optional: organization namespace.\")\n        rm_parser.set_defaults(func=lambda args: DeleteObjCommand(args))\n        # upload\n        upload_parser = parser.add_parser(\"upload\", help=\"Upload a model to S3.\")\n        upload_parser.add_argument(\n            \"path\", type=str, help=\"Local path of the model folder or individual file to upload.\"\n        )\n        upload_parser.add_argument(\"--organization\", type=str, help=\"Optional: organization namespace.\")\n        upload_parser.add_argument(\n            \"--filename\", type=str, default=None, help=\"Optional: override individual object filename on S3.\"\n        )\n        upload_parser.set_defaults(func=lambda args: UploadCommand(args))\n\n\nclass ANSI:\n    \"\"\"\n    Helper for en.wikipedia.org/wiki/ANSI_escape_code\n    \"\"\"\n\n    _bold = \"\\u001b[1m\"\n    _red = \"\\u001b[31m\"\n    _reset = \"\\u001b[0m\"\n\n    @classmethod\n    def bold(cls, s):\n        return \"{}{}{}\".format(cls._bold, s, cls._reset)\n\n    @classmethod\n    def red(cls, s):\n        return \"{}{}{}\".format(cls._bold + cls._red, s, cls._reset)\n\n\nclass BaseUserCommand:\n    def __init__(self, args):\n        self.args = args\n        self._api = HfApi()\n\n\nclass LoginCommand(BaseUserCommand):\n    def run(self):\n        print(\n            \"\"\"\n        _|    _|  _|    _|    _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|_|_|_|    _|_|      _|_|_|  _|_|_|_|\n        _|    _|  _|    _|  _|        _|          _|    _|_|    _|  _|            _|        _|    _|  _|        _|\n        _|_|_|_|  _|    _|  _|  _|_|  _|  _|_|    _|    _|  _|  _|  _|  _|_|      _|_|_|    _|_|_|_|  _|        _|_|_|\n        _|    _|  _|    _|  _|    _|  _|    _|    _|    _|    _|_|  _|    _|      _|        _|    _|  _|        _|\n        _|    _|    _|_|      _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|        _|    _|    _|_|_|  _|_|_|_|\n\n        \"\"\"\n        )\n        username = input(\"Username: \")\n        password = getpass()\n        try:\n            token = self._api.login(username, password)\n        except HTTPError as e:\n            # probably invalid credentials, display error message.\n            print(e)\n            print(ANSI.red(e.response.text))\n            exit(1)\n        HfFolder.save_token(token)\n        print(\"Login successful\")\n        print(\"Your token:\", token, \"\\n\")\n        print(\"Your token has been saved to\", HfFolder.path_token)\n\n\nclass WhoamiCommand(BaseUserCommand):\n    def run(self):\n        token = HfFolder.get_token()\n        if token is None:\n            print(\"Not logged in\")\n            exit()\n        try:\n            user, orgs = self._api.whoami(token)\n            print(user)\n            if orgs:\n                print(ANSI.bold(\"orgs: \"), \",\".join(orgs))\n        except HTTPError as e:\n            print(e)\n            print(ANSI.red(e.response.text))\n            exit(1)\n\n\nclass LogoutCommand(BaseUserCommand):\n    def run(self):\n        token = HfFolder.get_token()\n        if token is None:\n            print(\"Not logged in\")\n            exit()\n        HfFolder.delete_token()\n        self._api.logout(token)\n        print(\"Successfully logged out.\")\n\n\nclass ListObjsCommand(BaseUserCommand):\n    def tabulate(self, rows: List[List[Union[str, int]]], headers: List[str]) -> str:\n        \"\"\"\n        Inspired by:\n        stackoverflow.com/a/8356620/593036\n        stackoverflow.com/questions/9535954/printing-lists-as-tabular-data\n        \"\"\"\n        col_widths = [max(len(str(x)) for x in col) for col in zip(*rows, headers)]\n        row_format = (\"{{:{}}} \" * len(headers)).format(*col_widths)\n        lines = []\n        lines.append(row_format.format(*headers))\n        lines.append(row_format.format(*[\"-\" * w for w in col_widths]))\n        for row in rows:\n            lines.append(row_format.format(*row))\n        return \"\\n\".join(lines)\n\n    def run(self):\n        token = HfFolder.get_token()\n        if token is None:\n            print(\"Not logged in\")\n            exit(1)\n        try:\n            objs = self._api.list_objs(token, organization=self.args.organization)\n        except HTTPError as e:\n            print(e)\n            print(ANSI.red(e.response.text))\n            exit(1)\n        if len(objs) == 0:\n            print(\"No shared file yet\")\n            exit()\n        rows = [[obj.filename, obj.LastModified, obj.ETag, obj.Size] for obj in objs]\n        print(self.tabulate(rows, headers=[\"Filename\", \"LastModified\", \"ETag\", \"Size\"]))\n\n\nclass DeleteObjCommand(BaseUserCommand):\n    def run(self):\n        token = HfFolder.get_token()\n        if token is None:\n            print(\"Not logged in\")\n            exit(1)\n        try:\n            self._api.delete_obj(token, filename=self.args.filename, organization=self.args.organization)\n        except HTTPError as e:\n            print(e)\n            print(ANSI.red(e.response.text))\n            exit(1)\n        print(\"Done\")\n\n\nclass UploadCommand(BaseUserCommand):\n    def walk_dir(self, rel_path):\n        \"\"\"\n        Recursively list all files in a folder.\n        \"\"\"\n        entries: List[os.DirEntry] = list(os.scandir(rel_path))\n        files = [(os.path.join(os.getcwd(), f.path), f.path) for f in entries if f.is_file()]  # (filepath, filename)\n        for f in entries:\n            if f.is_dir():\n                files += self.walk_dir(f.path)\n        return files\n\n    def run(self):\n        token = HfFolder.get_token()\n        if token is None:\n            print(\"Not logged in\")\n            exit(1)\n        local_path = os.path.abspath(self.args.path)\n        if os.path.isdir(local_path):\n            if self.args.filename is not None:\n                raise ValueError(\"Cannot specify a filename override when uploading a folder.\")\n            rel_path = os.path.basename(local_path)\n            files = self.walk_dir(rel_path)\n        elif os.path.isfile(local_path):\n            filename = self.args.filename if self.args.filename is not None else os.path.basename(local_path)\n            files = [(local_path, filename)]\n        else:\n            raise ValueError(\"Not a valid file or directory: {}\".format(local_path))\n\n        if sys.platform == \"win32\":\n            files = [(filepath, filename.replace(os.sep, \"/\")) for filepath, filename in files]\n\n        if len(files) > UPLOAD_MAX_FILES:\n            print(\n                \"About to upload {} files to S3. This is probably wrong. Please filter files before uploading.\".format(\n                    ANSI.bold(len(files))\n                )\n            )\n            exit(1)\n\n        user, _ = self._api.whoami(token)\n        namespace = self.args.organization if self.args.organization is not None else user\n\n        for filepath, filename in files:\n            print(\n                \"About to upload file {} to S3 under filename {} and namespace {}\".format(\n                    ANSI.bold(filepath), ANSI.bold(filename), ANSI.bold(namespace)\n                )\n            )\n\n        choice = input(\"Proceed? [Y/n] \").lower()\n        if not (choice == \"\" or choice == \"y\" or choice == \"yes\"):\n            print(\"Abort\")\n            exit()\n        print(ANSI.bold(\"Uploading... This might take a while if files are large\"))\n        for filepath, filename in files:\n            try:\n                access_url = self._api.presign_and_upload(\n                    token=token, filename=filename, filepath=filepath, organization=self.args.organization\n                )\n            except HTTPError as e:\n                print(e)\n                print(ANSI.red(e.response.text))\n                exit(1)\n            print(\"Your file now lives at:\")\n            print(access_url)\n"
  },
  {
    "path": "code/bert-base-count3/pretrain/transformers1/configuration_albert.py",
    "content": "# coding=utf-8\n# Copyright 2018 The Google AI Language Team Authors and The HuggingFace Inc. team.\n# Copyright (c) 2018, NVIDIA CORPORATION.  All rights reserved.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n\"\"\" ALBERT model configuration \"\"\"\n\nfrom .configuration_utils import PretrainedConfig\n\n\nALBERT_PRETRAINED_CONFIG_ARCHIVE_MAP = {\n    \"albert-base-v1\": \"https://s3.amazonaws.com/models.huggingface.co/bert/albert-base-v1-config.json\",\n    \"albert-large-v1\": \"https://s3.amazonaws.com/models.huggingface.co/bert/albert-large-v1-config.json\",\n    \"albert-xlarge-v1\": \"https://s3.amazonaws.com/models.huggingface.co/bert/albert-xlarge-v1-config.json\",\n    \"albert-xxlarge-v1\": \"https://s3.amazonaws.com/models.huggingface.co/bert/albert-xxlarge-v1-config.json\",\n    \"albert-base-v2\": \"https://s3.amazonaws.com/models.huggingface.co/bert/albert-base-v2-config.json\",\n    \"albert-large-v2\": \"https://s3.amazonaws.com/models.huggingface.co/bert/albert-large-v2-config.json\",\n    \"albert-xlarge-v2\": \"https://s3.amazonaws.com/models.huggingface.co/bert/albert-xlarge-v2-config.json\",\n    \"albert-xxlarge-v2\": \"https://s3.amazonaws.com/models.huggingface.co/bert/albert-xxlarge-v2-config.json\",\n}\n\n\nclass AlbertConfig(PretrainedConfig):\n    r\"\"\"\n        This is the configuration class to store the configuration of a :class:`~transformers1.AlbertModel`.\n        It is used to instantiate an ALBERT model according to the specified arguments, defining the model\n        architecture. Instantiating a configuration with the defaults will yield a similar configuration to that of\n        the ALBERT `xxlarge <https://huggingface.co/albert-xxlarge-v2>`__ architecture.\n\n        Configuration objects inherit from  :class:`~transformers1.PretrainedConfig` and can be used\n        to control the model outputs. Read the documentation from  :class:`~transformers1.PretrainedConfig`\n        for more information.\n\n\n        Args:\n            vocab_size (:obj:`int`, optional, defaults to 30000):\n                Vocabulary size of the ALBERT model. Defines the different tokens that\n                can be represented by the `inputs_ids` passed to the forward method of :class:`~transformers1.AlbertModel`.\n            embedding_size (:obj:`int`, optional, defaults to 128):\n                Dimensionality of vocabulary embeddings.\n            hidden_size (:obj:`int`, optional, defaults to 4096):\n                Dimensionality of the encoder layers and the pooler layer.\n            num_hidden_layers (:obj:`int`, optional, defaults to 12):\n                Number of hidden layers in the Transformer encoder.\n            num_hidden_groups (:obj:`int`, optional, defaults to 1):\n                Number of groups for the hidden layers, parameters in the same group are shared.\n            num_attention_heads (:obj:`int`, optional, defaults to 64):\n                Number of attention heads for each attention layer in the Transformer encoder.\n            intermediate_size (:obj:`int`, optional, defaults to 16384):\n                The dimensionality of the \"intermediate\" (i.e., feed-forward) layer in the Transformer encoder.\n            inner_group_num (:obj:`int`, optional, defaults to 1):\n                The number of inner repetition of attention and ffn.\n            hidden_act (:obj:`str` or :obj:`function`, optional, defaults to \"gelu_new\"):\n                The non-linear activation function (function or string) in the encoder and pooler.\n                If string, \"gelu\", \"relu\", \"swish\" and \"gelu_new\" are supported.\n            hidden_dropout_prob (:obj:`float`, optional, defaults to 0):\n                The dropout probability for all fully connected layers in the embeddings, encoder, and pooler.\n            attention_probs_dropout_prob (:obj:`float`, optional, defaults to 0):\n                The dropout ratio for the attention probabilities.\n            max_position_embeddings (:obj:`int`, optional, defaults to 512):\n                The maximum sequence length that this model might ever be used with. Typically set this to something\n                large (e.g., 512 or 1024 or 2048).\n            type_vocab_size (:obj:`int`, optional, defaults to 2):\n                The vocabulary size of the `token_type_ids` passed into :class:`~transformers1.AlbertModel`.\n            initializer_range (:obj:`float`, optional, defaults to 0.02):\n                The standard deviation of the truncated_normal_initializer for initializing all weight matrices.\n            layer_norm_eps (:obj:`float`, optional, defaults to 1e-12):\n                The epsilon used by the layer normalization layers.\n            classifier_dropout_prob (:obj:`float`, optional, defaults to 0.1):\n                The dropout ratio for attached classifiers.\n\n        Example::\n\n            from transformers1 import AlbertConfig, AlbertModel\n            # Initializing an ALBERT-xxlarge style configuration\n            albert_xxlarge_configuration = AlbertConfig()\n\n            # Initializing an ALBERT-base style configuration\n            albert_base_configuration = AlbertConfig(\n                hidden_size=768,\n                num_attention_heads=12,\n                intermediate_size=3072,\n            )\n\n            # Initializing a model from the ALBERT-base style configuration\n            model = AlbertModel(albert_xxlarge_configuration)\n\n            # Accessing the model configuration\n            configuration = model.config\n    \"\"\"\n\n    model_type = \"albert\"\n\n    def __init__(\n        self,\n        vocab_size=30000,\n        embedding_size=128,\n        hidden_size=4096,\n        num_hidden_layers=12,\n        num_hidden_groups=1,\n        num_attention_heads=64,\n        intermediate_size=16384,\n        inner_group_num=1,\n        hidden_act=\"gelu_new\",\n        hidden_dropout_prob=0,\n        attention_probs_dropout_prob=0,\n        max_position_embeddings=512,\n        type_vocab_size=2,\n        initializer_range=0.02,\n        layer_norm_eps=1e-12,\n        classifier_dropout_prob=0.1,\n        pad_token_id=0,\n        bos_token_id=2,\n        eos_token_id=3,\n        **kwargs\n    ):\n        super().__init__(pad_token_id=pad_token_id, bos_token_id=bos_token_id, eos_token_id=eos_token_id, **kwargs)\n\n        self.vocab_size = vocab_size\n        self.embedding_size = embedding_size\n        self.hidden_size = hidden_size\n        self.num_hidden_layers = num_hidden_layers\n        self.num_hidden_groups = num_hidden_groups\n        self.num_attention_heads = num_attention_heads\n        self.inner_group_num = inner_group_num\n        self.hidden_act = hidden_act\n        self.intermediate_size = intermediate_size\n        self.hidden_dropout_prob = hidden_dropout_prob\n        self.attention_probs_dropout_prob = attention_probs_dropout_prob\n        self.max_position_embeddings = max_position_embeddings\n        self.type_vocab_size = type_vocab_size\n        self.initializer_range = initializer_range\n        self.layer_norm_eps = layer_norm_eps\n        self.classifier_dropout_prob = classifier_dropout_prob\n"
  },
  {
    "path": "code/bert-base-count3/pretrain/transformers1/configuration_auto.py",
    "content": "# coding=utf-8\n# Copyright 2018 The HuggingFace Inc. team.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n\"\"\" Auto Config class. \"\"\"\n\n\nimport logging\nfrom collections import OrderedDict\n\nfrom .configuration_albert import ALBERT_PRETRAINED_CONFIG_ARCHIVE_MAP, AlbertConfig\nfrom .configuration_bart import BART_PRETRAINED_CONFIG_ARCHIVE_MAP, BartConfig\nfrom .configuration_bert import BERT_PRETRAINED_CONFIG_ARCHIVE_MAP, BertConfig\nfrom .configuration_camembert import CAMEMBERT_PRETRAINED_CONFIG_ARCHIVE_MAP, CamembertConfig\nfrom .configuration_ctrl import CTRL_PRETRAINED_CONFIG_ARCHIVE_MAP, CTRLConfig\nfrom .configuration_distilbert import DISTILBERT_PRETRAINED_CONFIG_ARCHIVE_MAP, DistilBertConfig\nfrom .configuration_electra import ELECTRA_PRETRAINED_CONFIG_ARCHIVE_MAP, ElectraConfig\nfrom .configuration_encoder_decoder import EncoderDecoderConfig\nfrom .configuration_flaubert import FLAUBERT_PRETRAINED_CONFIG_ARCHIVE_MAP, FlaubertConfig\nfrom .configuration_gpt2 import GPT2_PRETRAINED_CONFIG_ARCHIVE_MAP, GPT2Config\nfrom .configuration_longformer import LONGFORMER_PRETRAINED_CONFIG_ARCHIVE_MAP, LongformerConfig\nfrom .configuration_marian import MarianConfig\nfrom .configuration_openai import OPENAI_GPT_PRETRAINED_CONFIG_ARCHIVE_MAP, OpenAIGPTConfig\nfrom .configuration_reformer import ReformerConfig\nfrom .configuration_roberta import ROBERTA_PRETRAINED_CONFIG_ARCHIVE_MAP, RobertaConfig\nfrom .configuration_t5 import T5_PRETRAINED_CONFIG_ARCHIVE_MAP, T5Config\nfrom .configuration_transfo_xl import TRANSFO_XL_PRETRAINED_CONFIG_ARCHIVE_MAP, TransfoXLConfig\nfrom .configuration_utils import PretrainedConfig\nfrom .configuration_xlm import XLM_PRETRAINED_CONFIG_ARCHIVE_MAP, XLMConfig\nfrom .configuration_xlm_roberta import XLM_ROBERTA_PRETRAINED_CONFIG_ARCHIVE_MAP, XLMRobertaConfig\nfrom .configuration_xlnet import XLNET_PRETRAINED_CONFIG_ARCHIVE_MAP, XLNetConfig\n\n\nlogger = logging.getLogger(__name__)\n\n\nALL_PRETRAINED_CONFIG_ARCHIVE_MAP = dict(\n    (key, value)\n    for pretrained_map in [\n        BERT_PRETRAINED_CONFIG_ARCHIVE_MAP,\n        BART_PRETRAINED_CONFIG_ARCHIVE_MAP,\n        OPENAI_GPT_PRETRAINED_CONFIG_ARCHIVE_MAP,\n        TRANSFO_XL_PRETRAINED_CONFIG_ARCHIVE_MAP,\n        GPT2_PRETRAINED_CONFIG_ARCHIVE_MAP,\n        CTRL_PRETRAINED_CONFIG_ARCHIVE_MAP,\n        XLNET_PRETRAINED_CONFIG_ARCHIVE_MAP,\n        XLM_PRETRAINED_CONFIG_ARCHIVE_MAP,\n        ROBERTA_PRETRAINED_CONFIG_ARCHIVE_MAP,\n        DISTILBERT_PRETRAINED_CONFIG_ARCHIVE_MAP,\n        ALBERT_PRETRAINED_CONFIG_ARCHIVE_MAP,\n        CAMEMBERT_PRETRAINED_CONFIG_ARCHIVE_MAP,\n        T5_PRETRAINED_CONFIG_ARCHIVE_MAP,\n        XLM_ROBERTA_PRETRAINED_CONFIG_ARCHIVE_MAP,\n        FLAUBERT_PRETRAINED_CONFIG_ARCHIVE_MAP,\n        ELECTRA_PRETRAINED_CONFIG_ARCHIVE_MAP,\n        LONGFORMER_PRETRAINED_CONFIG_ARCHIVE_MAP,\n    ]\n    for key, value, in pretrained_map.items()\n)\n\n\nCONFIG_MAPPING = OrderedDict(\n    [\n        (\"t5\", T5Config,),\n        (\"distilbert\", DistilBertConfig,),\n        (\"albert\", AlbertConfig,),\n        (\"camembert\", CamembertConfig,),\n        (\"xlm-roberta\", XLMRobertaConfig,),\n        (\"marian\", MarianConfig,),\n        (\"bart\", BartConfig,),\n        (\"reformer\", ReformerConfig,),\n        (\"longformer\", LongformerConfig,),\n        (\"roberta\", RobertaConfig,),\n        (\"flaubert\", FlaubertConfig,),\n        (\"bert\", BertConfig,),\n        (\"openai-gpt\", OpenAIGPTConfig,),\n        (\"gpt2\", GPT2Config,),\n        (\"transfo-xl\", TransfoXLConfig,),\n        (\"xlnet\", XLNetConfig,),\n        (\"xlm\", XLMConfig,),\n        (\"ctrl\", CTRLConfig,),\n        (\"electra\", ElectraConfig,),\n        (\"encoder-decoder\", EncoderDecoderConfig,),\n    ]\n)\n\n\nclass AutoConfig:\n    r\"\"\"\n        :class:`~transformers1.AutoConfig` is a generic configuration class\n        that will be instantiated as one of the configuration classes of the library\n        when created with the :func:`~transformers1.AutoConfig.from_pretrained` class method.\n\n        The :func:`~transformers1.AutoConfig.from_pretrained` method takes care of returning the correct model class instance\n        based on the `model_type` property of the config object, or when it's missing,\n        falling back to using pattern matching on the `pretrained_model_name_or_path` string.\n    \"\"\"\n\n    def __init__(self):\n        raise EnvironmentError(\n            \"AutoConfig is designed to be instantiated \"\n            \"using the `AutoConfig.from_pretrained(pretrained_model_name_or_path)` method.\"\n        )\n\n    @classmethod\n    def for_model(cls, model_type: str, *args, **kwargs):\n        if model_type in CONFIG_MAPPING:\n            config_class = CONFIG_MAPPING[model_type]\n            return config_class(*args, **kwargs)\n        raise ValueError(\n            \"Unrecognized model identifier: {}. Should contain one of {}\".format(\n                model_type, \", \".join(CONFIG_MAPPING.keys())\n            )\n        )\n\n    @classmethod\n    def from_pretrained(cls, pretrained_model_name_or_path, **kwargs):\n        r\"\"\" Instantiates one of the configuration classes of the library\n        from a pre-trained model configuration.\n\n        The configuration class to instantiate is selected\n        based on the `model_type` property of the config object, or when it's missing,\n        falling back to using pattern matching on the `pretrained_model_name_or_path` string:\n            - `t5`: :class:`~transformers1.T5Config` (T5 model)\n            - `distilbert`: :class:`~transformers1.DistilBertConfig` (DistilBERT model)\n            - `albert`: :class:`~transformers1.AlbertConfig` (ALBERT model)\n            - `camembert`: :class:`~transformers1.CamembertConfig` (CamemBERT model)\n            - `xlm-roberta`: :class:`~transformers1.XLMRobertaConfig` (XLM-RoBERTa model)\n            - `longformer`: :class:`~transformers1.LongformerConfig` (Longformer model)\n            - `roberta`: :class:`~transformers1.RobertaConfig` (RoBERTa model)\n            - `reformer`: :class:`~transformers1.ReformerConfig` (Reformer model)\n            - `bert`: :class:`~transformers1.BertConfig` (Bert model)\n            - `openai-gpt`: :class:`~transformers1.OpenAIGPTConfig` (OpenAI GPT model)\n            - `gpt2`: :class:`~transformers1.GPT2Config` (OpenAI GPT-2 model)\n            - `transfo-xl`: :class:`~transformers1.TransfoXLConfig` (Transformer-XL model)\n            - `xlnet`: :class:`~transformers1.XLNetConfig` (XLNet model)\n            - `xlm`: :class:`~transformers1.XLMConfig` (XLM model)\n            - `ctrl` : :class:`~transformers1.CTRLConfig` (CTRL model)\n            - `flaubert` : :class:`~transformers1.FlaubertConfig` (Flaubert model)\n            - `electra` : :class:`~transformers1.ElectraConfig` (ELECTRA model)\n\n        Args:\n            pretrained_model_name_or_path (:obj:`string`):\n                Is either: \\\n                    - a string with the `shortcut name` of a pre-trained model configuration to load from cache or download, e.g.: ``bert-base-uncased``.\n                    - a string with the `identifier name` of a pre-trained model configuration that was user-uploaded to our S3, e.g.: ``dbmdz/bert-base-german-cased``.\n                    - a path to a `directory` containing a configuration file saved using the :func:`~transformers1.PretrainedConfig.save_pretrained` method, e.g.: ``./my_model_directory/``.\n                    - a path or url to a saved configuration JSON `file`, e.g.: ``./my_model_directory/configuration.json``.\n\n            cache_dir (:obj:`string`, optional, defaults to `None`):\n                Path to a directory in which a downloaded pre-trained model\n                configuration should be cached if the standard cache should not be used.\n\n            force_download (:obj:`boolean`, optional, defaults to `False`):\n                Force to (re-)download the model weights and configuration files and override the cached versions if they exist.\n\n            resume_download (:obj:`boolean`, optional, defaults to `False`):\n                Do not delete incompletely received file. Attempt to resume the download if such a file exists.\n\n            proxies (:obj:`Dict[str, str]`, optional, defaults to `None`):\n                A dictionary of proxy servers to use by protocol or endpoint, e.g.: :obj:`{'http': 'foo.bar:3128', 'http://hostname': 'foo.bar:4012'}`.\n                The proxies are used on each request. See `the requests documentation <https://requests.readthedocs.io/en/master/user/advanced/#proxies>`__ for usage.\n\n            return_unused_kwargs (:obj:`boolean`, optional, defaults to `False`):\n                - If False, then this function returns just the final configuration object.\n                - If True, then this functions returns a tuple `(config, unused_kwargs)` where `unused_kwargs` is a dictionary consisting of the key/value pairs whose keys are not configuration attributes: ie the part of kwargs which has not been used to update `config` and is otherwise ignored.\n\n            kwargs (:obj:`Dict[str, any]`, optional, defaults to `{}`): key/value pairs with which to update the configuration object after loading.\n                - The values in kwargs of any keys which are configuration attributes will be used to override the loaded values.\n                - Behavior concerning key/value pairs whose keys are *not* configuration attributes is controlled by the `return_unused_kwargs` keyword parameter.\n\n\n        Examples::\n\n            config = AutoConfig.from_pretrained('bert-base-uncased')  # Download configuration from S3 and cache.\n            config = AutoConfig.from_pretrained('./test/bert_saved_model/')  # E.g. config (or model) was saved using `save_pretrained('./test/saved_model/')`\n            config = AutoConfig.from_pretrained('./test/bert_saved_model/my_configuration.json')\n            config = AutoConfig.from_pretrained('bert-base-uncased', output_attention=True, foo=False)\n            assert config.output_attention == True\n            config, unused_kwargs = AutoConfig.from_pretrained('bert-base-uncased', output_attention=True,\n                                                               foo=False, return_unused_kwargs=True)\n            assert config.output_attention == True\n            assert unused_kwargs == {'foo': False}\n\n        \"\"\"\n        config_dict, _ = PretrainedConfig.get_config_dict(pretrained_model_name_or_path, **kwargs)\n\n        if \"model_type\" in config_dict:\n            config_class = CONFIG_MAPPING[config_dict[\"model_type\"]]\n            return config_class.from_dict(config_dict, **kwargs)\n        else:\n            # Fallback: use pattern matching on the string.\n            for pattern, config_class in CONFIG_MAPPING.items():\n                if pattern in pretrained_model_name_or_path:\n                    return config_class.from_dict(config_dict, **kwargs)\n\n        raise ValueError(\n            \"Unrecognized model in {}. \"\n            \"Should have a `model_type` key in its config.json, or contain one of the following strings \"\n            \"in its name: {}\".format(pretrained_model_name_or_path, \", \".join(CONFIG_MAPPING.keys()))\n        )\n"
  },
  {
    "path": "code/bert-base-count3/pretrain/transformers1/configuration_bart.py",
    "content": "# coding=utf-8\n# Copyright 2020 The Fairseq Authors and The HuggingFace Inc. team.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n\"\"\" BART configuration \"\"\"\n\n\nimport logging\n\nfrom .configuration_utils import PretrainedConfig\n\n\nlogger = logging.getLogger(__name__)\n\nBART_PRETRAINED_CONFIG_ARCHIVE_MAP = {\n    \"facebook/bart-large\": \"https://s3.amazonaws.com/models.huggingface.co/bert/facebook/bart-large/config.json\",\n    \"facebook/bart-large-mnli\": \"https://s3.amazonaws.com/models.huggingface.co/bert/facebook/bart-large-mnli/config.json\",\n    \"facebook/bart-large-cnn\": \"https://s3.amazonaws.com/models.huggingface.co/bert/facebook/bart-large-cnn/config.json\",\n    \"facebook/bart-large-xsum\": \"https://s3.amazonaws.com/models.huggingface.co/bert/facebook/bart-large-xsum/config.json\",\n    \"facebook/mbart-large-en-ro\": \"https://s3.amazonaws.com/models.huggingface.co/bert/facebook/mbart-large-en-ro/config.json\",\n}\n\n\nclass BartConfig(PretrainedConfig):\n    r\"\"\"\n        Configuration class for Bart. Parameters are renamed from the fairseq implementation\n    \"\"\"\n    model_type = \"bart\"\n\n    def __init__(\n        self,\n        activation_dropout=0.0,\n        activation_function=\"gelu\",\n        vocab_size=50265,\n        d_model=1024,\n        encoder_ffn_dim=4096,\n        encoder_layers=12,\n        encoder_attention_heads=16,\n        decoder_ffn_dim=4096,\n        decoder_layers=12,\n        decoder_attention_heads=16,\n        encoder_layerdrop=0.0,\n        decoder_layerdrop=0.0,\n        attention_dropout=0.0,\n        dropout=0.1,\n        max_position_embeddings=1024,\n        init_std=0.02,\n        classifier_dropout=0.0,\n        num_labels=3,\n        is_encoder_decoder=True,\n        pad_token_id=1,\n        bos_token_id=0,\n        eos_token_id=2,\n        normalize_before=False,\n        add_final_layer_norm=False,\n        scale_embedding=False,\n        normalize_embedding=True,\n        static_position_embeddings=False,\n        add_bias_logits=False,\n        **common_kwargs\n    ):\n        r\"\"\"\n            :class:`~transformers1.BartConfig` is the configuration class for `BartModel`.\n            Examples:\n                config = BartConfig.from_pretrained('bart-large')\n                model = BartModel(config)\n        \"\"\"\n        if \"hidden_size\" in common_kwargs:\n            raise ValueError(\"hidden size is called d_model\")\n        super().__init__(\n            num_labels=num_labels,\n            pad_token_id=pad_token_id,\n            bos_token_id=bos_token_id,\n            eos_token_id=eos_token_id,\n            is_encoder_decoder=is_encoder_decoder,\n            **common_kwargs,\n        )\n        self.vocab_size = vocab_size\n        self.d_model = d_model  # encoder_embed_dim and decoder_embed_dim\n        self.encoder_ffn_dim = encoder_ffn_dim\n        self.encoder_layers = self.num_hidden_layers = encoder_layers\n        self.encoder_attention_heads = encoder_attention_heads\n        self.encoder_layerdrop = encoder_layerdrop\n        self.decoder_layerdrop = decoder_layerdrop\n        self.decoder_ffn_dim = decoder_ffn_dim\n        self.decoder_layers = decoder_layers\n        self.decoder_attention_heads = decoder_attention_heads\n        self.max_position_embeddings = max_position_embeddings\n        self.init_std = init_std  # Normal(0, this parameter)\n        self.activation_function = activation_function\n\n        # Params introduced for Mbart\n        self.scale_embedding = scale_embedding  # scale factor will be sqrt(d_model) if True\n        self.normalize_embedding = normalize_embedding  # True for mbart, False otherwise\n        self.normalize_before = normalize_before  # combo of fairseq's encoder_ and decoder_normalize_before\n        self.add_final_layer_norm = add_final_layer_norm\n\n        # Params introduced for Marian\n        self.add_bias_logits = add_bias_logits\n        self.static_position_embeddings = static_position_embeddings\n\n        # 3 Types of Dropout\n        self.attention_dropout = attention_dropout\n        self.activation_dropout = activation_dropout\n        self.dropout = dropout\n\n        # Classifier stuff\n        self.classif_dropout = classifier_dropout\n\n    @property\n    def num_attention_heads(self) -> int:\n        return self.encoder_attention_heads\n\n    @property\n    def hidden_size(self) -> int:\n        return self.d_model\n\n    def is_valid_mbart(self) -> bool:\n        \"\"\"Is the configuration aligned with the MBART paper.\"\"\"\n        if self.normalize_before and self.add_final_layer_norm and self.scale_embedding:\n            return True\n        if self.normalize_before or self.add_final_layer_norm or self.scale_embedding:\n            logger.info(\"This configuration is a mixture of MBART and BART settings\")\n        return False\n"
  },
  {
    "path": "code/bert-base-count3/pretrain/transformers1/configuration_bert.py",
    "content": "# coding=utf-8\n# Copyright 2018 The Google AI Language Team Authors and The HuggingFace Inc. team.\n# Copyright (c) 2018, NVIDIA CORPORATION.  All rights reserved.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n\"\"\" BERT model configuration \"\"\"\n\n\nimport logging\n\nfrom .configuration_utils import PretrainedConfig\n\n\nlogger = logging.getLogger(__name__)\n\nBERT_PRETRAINED_CONFIG_ARCHIVE_MAP = {\n    \"bert-base-uncased\": \"https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-uncased-config.json\",\n    \"bert-large-uncased\": \"https://s3.amazonaws.com/models.huggingface.co/bert/bert-large-uncased-config.json\",\n    \"bert-base-cased\": \"https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-cased-config.json\",\n    \"bert-large-cased\": \"https://s3.amazonaws.com/models.huggingface.co/bert/bert-large-cased-config.json\",\n    \"bert-base-multilingual-uncased\": \"https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-multilingual-uncased-config.json\",\n    \"bert-base-multilingual-cased\": \"https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-multilingual-cased-config.json\",\n    \"bert-base-chinese\": \"https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-chinese-config.json\",\n    \"bert-base-german-cased\": \"https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-german-cased-config.json\",\n    \"bert-large-uncased-whole-word-masking\": \"https://s3.amazonaws.com/models.huggingface.co/bert/bert-large-uncased-whole-word-masking-config.json\",\n    \"bert-large-cased-whole-word-masking\": \"https://s3.amazonaws.com/models.huggingface.co/bert/bert-large-cased-whole-word-masking-config.json\",\n    \"bert-large-uncased-whole-word-masking-finetuned-squad\": \"https://s3.amazonaws.com/models.huggingface.co/bert/bert-large-uncased-whole-word-masking-finetuned-squad-config.json\",\n    \"bert-large-cased-whole-word-masking-finetuned-squad\": \"https://s3.amazonaws.com/models.huggingface.co/bert/bert-large-cased-whole-word-masking-finetuned-squad-config.json\",\n    \"bert-base-cased-finetuned-mrpc\": \"https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-cased-finetuned-mrpc-config.json\",\n    \"bert-base-german-dbmdz-cased\": \"https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-german-dbmdz-cased-config.json\",\n    \"bert-base-german-dbmdz-uncased\": \"https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-german-dbmdz-uncased-config.json\",\n    \"cl-tohoku/bert-base-japanese\": \"https://s3.amazonaws.com/models.huggingface.co/bert/cl-tohoku/bert-base-japanese/config.json\",\n    \"cl-tohoku/bert-base-japanese-whole-word-masking\": \"https://s3.amazonaws.com/models.huggingface.co/bert/cl-tohoku/bert-base-japanese-whole-word-masking/config.json\",\n    \"cl-tohoku/bert-base-japanese-char\": \"https://s3.amazonaws.com/models.huggingface.co/bert/cl-tohoku/bert-base-japanese-char/config.json\",\n    \"cl-tohoku/bert-base-japanese-char-whole-word-masking\": \"https://s3.amazonaws.com/models.huggingface.co/bert/cl-tohoku/bert-base-japanese-char-whole-word-masking/config.json\",\n    \"TurkuNLP/bert-base-finnish-cased-v1\": \"https://s3.amazonaws.com/models.huggingface.co/bert/TurkuNLP/bert-base-finnish-cased-v1/config.json\",\n    \"TurkuNLP/bert-base-finnish-uncased-v1\": \"https://s3.amazonaws.com/models.huggingface.co/bert/TurkuNLP/bert-base-finnish-uncased-v1/config.json\",\n    \"wietsedv/bert-base-dutch-cased\": \"https://s3.amazonaws.com/models.huggingface.co/bert/wietsedv/bert-base-dutch-cased/config.json\",\n    # See all BERT models at https://huggingface.co/models?filter=bert\n}\n\n\nclass BertConfig(PretrainedConfig):\n    r\"\"\"\n        This is the configuration class to store the configuration of a :class:`~transformers1.BertModel`.\n        It is used to instantiate an BERT model according to the specified arguments, defining the model\n        architecture. Instantiating a configuration with the defaults will yield a similar configuration to that of\n        the BERT `bert-base-uncased <https://huggingface.co/bert-base-uncased>`__ architecture.\n\n        Configuration objects inherit from  :class:`~transformers1.PretrainedConfig` and can be used\n        to control the model outputs. Read the documentation from  :class:`~transformers1.PretrainedConfig`\n        for more information.\n\n\n        Args:\n            vocab_size (:obj:`int`, optional, defaults to 30522):\n                Vocabulary size of the BERT model. Defines the different tokens that\n                can be represented by the `inputs_ids` passed to the forward method of :class:`~transformers1.BertModel`.\n            hidden_size (:obj:`int`, optional, defaults to 768):\n                Dimensionality of the encoder layers and the pooler layer.\n            num_hidden_layers (:obj:`int`, optional, defaults to 12):\n                Number of hidden layers in the Transformer encoder.\n            num_attention_heads (:obj:`int`, optional, defaults to 12):\n                Number of attention heads for each attention layer in the Transformer encoder.\n            intermediate_size (:obj:`int`, optional, defaults to 3072):\n                Dimensionality of the \"intermediate\" (i.e., feed-forward) layer in the Transformer encoder.\n            hidden_act (:obj:`str` or :obj:`function`, optional, defaults to \"gelu\"):\n                The non-linear activation function (function or string) in the encoder and pooler.\n                If string, \"gelu\", \"relu\", \"swish\" and \"gelu_new\" are supported.\n            hidden_dropout_prob (:obj:`float`, optional, defaults to 0.1):\n                The dropout probabilitiy for all fully connected layers in the embeddings, encoder, and pooler.\n            attention_probs_dropout_prob (:obj:`float`, optional, defaults to 0.1):\n                The dropout ratio for the attention probabilities.\n            max_position_embeddings (:obj:`int`, optional, defaults to 512):\n                The maximum sequence length that this model might ever be used with.\n                Typically set this to something large just in case (e.g., 512 or 1024 or 2048).\n            type_vocab_size (:obj:`int`, optional, defaults to 2):\n                The vocabulary size of the `token_type_ids` passed into :class:`~transformers1.BertModel`.\n            initializer_range (:obj:`float`, optional, defaults to 0.02):\n                The standard deviation of the truncated_normal_initializer for initializing all weight matrices.\n            layer_norm_eps (:obj:`float`, optional, defaults to 1e-12):\n                The epsilon used by the layer normalization layers.\n\n        Example::\n\n            from transformers1 import BertModel, BertConfig\n\n            # Initializing a BERT bert-base-uncased style configuration\n            configuration = BertConfig()\n\n            # Initializing a model from the bert-base-uncased style configuration\n            model = BertModel(configuration)\n\n            # Accessing the model configuration\n            configuration = model.config\n    \"\"\"\n    model_type = \"bert\"\n\n    def __init__(\n        self,\n        vocab_size=30522,\n        hidden_size=768,\n        num_hidden_layers=12,\n        num_attention_heads=12,\n        intermediate_size=3072,\n        hidden_act=\"gelu\",\n        hidden_dropout_prob=0.1,\n        attention_probs_dropout_prob=0.1,\n        max_position_embeddings=512,\n        type_vocab_size=2,\n        initializer_range=0.02,\n        layer_norm_eps=1e-12,\n        pad_token_id=0,\n        **kwargs\n    ):\n        super().__init__(pad_token_id=pad_token_id, **kwargs)\n\n        self.vocab_size = vocab_size\n        self.hidden_size = hidden_size\n        self.num_hidden_layers = num_hidden_layers\n        self.num_attention_heads = num_attention_heads\n        self.hidden_act = hidden_act\n        self.intermediate_size = intermediate_size\n        self.hidden_dropout_prob = hidden_dropout_prob\n        self.attention_probs_dropout_prob = attention_probs_dropout_prob\n        self.max_position_embeddings = max_position_embeddings\n        self.type_vocab_size = type_vocab_size\n        self.initializer_range = initializer_range\n        self.layer_norm_eps = layer_norm_eps\n"
  },
  {
    "path": "code/bert-base-count3/pretrain/transformers1/configuration_camembert.py",
    "content": "# coding=utf-8\n# Copyright 2018 The Google AI Language Team Authors and The HuggingFace Inc. team.\n# Copyright (c) 2018, NVIDIA CORPORATION.  All rights reserved.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n\"\"\" CamemBERT configuration \"\"\"\n\n\nimport logging\n\nfrom .configuration_roberta import RobertaConfig\n\n\nlogger = logging.getLogger(__name__)\n\nCAMEMBERT_PRETRAINED_CONFIG_ARCHIVE_MAP = {\n    \"camembert-base\": \"https://s3.amazonaws.com/models.huggingface.co/bert/camembert-base-config.json\",\n    \"umberto-commoncrawl-cased-v1\": \"https://s3.amazonaws.com/models.huggingface.co/bert/Musixmatch/umberto-commoncrawl-cased-v1/config.json\",\n    \"umberto-wikipedia-uncased-v1\": \"https://s3.amazonaws.com/models.huggingface.co/bert/Musixmatch/umberto-wikipedia-uncased-v1/config.json\",\n}\n\n\nclass CamembertConfig(RobertaConfig):\n    \"\"\"\n    This class overrides :class:`~transformers1.RobertaConfig`. Please check the\n    superclass for the appropriate documentation alongside usage examples.\n    \"\"\"\n\n    model_type = \"camembert\"\n"
  },
  {
    "path": "code/bert-base-count3/pretrain/transformers1/configuration_ctrl.py",
    "content": "# coding=utf-8\n# Copyright 2018 Salesforce and HuggingFace Inc. team.\n# Copyright (c) 2018, NVIDIA CORPORATION.  All rights reserved.\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n\"\"\" Salesforce CTRL configuration \"\"\"\n\n\nimport logging\n\nfrom .configuration_utils import PretrainedConfig\n\n\nlogger = logging.getLogger(__name__)\n\nCTRL_PRETRAINED_CONFIG_ARCHIVE_MAP = {\"ctrl\": \"https://s3.amazonaws.com/models.huggingface.co/bert/ctrl-config.json\"}\n\n\nclass CTRLConfig(PretrainedConfig):\n    \"\"\"\n        This is the configuration class to store the configuration of a :class:`~transformers1.CTRLModel`.\n        It is used to instantiate an CTRL model according to the specified arguments, defining the model\n        architecture. Instantiating a configuration with the defaults will yield a similar configuration to that of\n        the `ctrl <https://huggingface.co/ctrl>`__ architecture from SalesForce.\n\n        Configuration objects inherit from  :class:`~transformers1.PretrainedConfig` and can be used\n        to control the model outputs. Read the documentation from  :class:`~transformers1.PretrainedConfig`\n        for more information.\n\n        Args:\n            vocab_size (:obj:`int`, optional, defaults to 246534):\n                Vocabulary size of the CTRL model. Defines the different tokens that\n                can be represented by the `inputs_ids` passed to the forward method of :class:`~transformers1.CTRLModel`.\n            n_positions (:obj:`int`, optional, defaults to 256):\n                The maximum sequence length that this model might ever be used with.\n                Typically set this to something large just in case (e.g., 512 or 1024 or 2048).\n            n_ctx (:obj:`int`, optional, defaults to 256):\n                Dimensionality of the causal mask (usually same as n_positions).\n            n_embd (:obj:`int`, optional, defaults to 1280):\n                Dimensionality of the embeddings and hidden states.\n            dff (:obj:`int`, optional, defaults to 8192):\n                Dimensionality of the inner dimension of the FFN.\n            n_layer (:obj:`int`, optional, defaults to 48):\n                Number of hidden layers in the Transformer encoder.\n            n_head (:obj:`int`, optional, defaults to 16):\n                Number of attention heads for each attention layer in the Transformer encoder.\n            resid_pdrop (:obj:`float`, optional, defaults to 0.1):\n                The dropout probability for all fully connected layers in the embeddings, encoder, and pooler.\n            embd_pdrop (:obj:`int`, optional, defaults to 0.1):\n                The dropout ratio for the embeddings.\n            attn_pdrop (:obj:`float`, optional, defaults to 0.1):\n                The dropout ratio for the attention.\n            layer_norm_epsilon (:obj:`float`, optional, defaults to 1e-6):\n                The epsilon to use in the layer normalization layers\n            initializer_range (:obj:`float`, optional, defaults to 0.02):\n                The standard deviation of the truncated_normal_initializer for initializing all weight matrices.\n\n        Example::\n\n            from transformers1 import CTRLModel, CTRLConfig\n\n            # Initializing a CTRL configuration\n            configuration = CTRLConfig()\n\n            # Initializing a model from the configuration\n            model = CTRLModel(configuration)\n\n            # Accessing the model configuration\n            configuration = model.config\n    \"\"\"\n\n    model_type = \"ctrl\"\n\n    def __init__(\n        self,\n        vocab_size=246534,\n        n_positions=256,\n        n_ctx=256,\n        n_embd=1280,\n        dff=8192,\n        n_layer=48,\n        n_head=16,\n        resid_pdrop=0.1,\n        embd_pdrop=0.1,\n        attn_pdrop=0.1,\n        layer_norm_epsilon=1e-6,\n        initializer_range=0.02,\n        summary_type=\"cls_index\",\n        summary_use_proj=True,\n        summary_activation=None,\n        summary_proj_to_labels=True,\n        summary_first_dropout=0.1,\n        **kwargs\n    ):\n        super().__init__(**kwargs)\n        self.vocab_size = vocab_size\n        self.n_ctx = n_ctx\n        self.n_positions = n_positions\n        self.n_embd = n_embd\n        self.n_layer = n_layer\n        self.n_head = n_head\n        self.dff = dff\n        self.resid_pdrop = resid_pdrop\n        self.embd_pdrop = embd_pdrop\n        self.attn_pdrop = attn_pdrop\n        self.layer_norm_epsilon = layer_norm_epsilon\n        self.initializer_range = initializer_range\n\n        self.summary_type = summary_type\n        self.summary_use_proj = summary_use_proj\n        self.summary_activation = summary_activation\n        self.summary_first_dropout = summary_first_dropout\n        self.summary_proj_to_labels = summary_proj_to_labels\n\n    @property\n    def max_position_embeddings(self):\n        return self.n_positions\n\n    @property\n    def hidden_size(self):\n        return self.n_embd\n\n    @property\n    def num_attention_heads(self):\n        return self.n_head\n\n    @property\n    def num_hidden_layers(self):\n        return self.n_layer\n"
  },
  {
    "path": "code/bert-base-count3/pretrain/transformers1/configuration_distilbert.py",
    "content": "# coding=utf-8\n# Copyright 2019-present, the HuggingFace Inc. team, The Google AI Language Team and Facebook, Inc.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n\"\"\" DistilBERT model configuration \"\"\"\n\n\nimport logging\n\nfrom .configuration_utils import PretrainedConfig\n\n\nlogger = logging.getLogger(__name__)\n\nDISTILBERT_PRETRAINED_CONFIG_ARCHIVE_MAP = {\n    \"distilbert-base-uncased\": \"https://s3.amazonaws.com/models.huggingface.co/bert/distilbert-base-uncased-config.json\",\n    \"distilbert-base-uncased-distilled-squad\": \"https://s3.amazonaws.com/models.huggingface.co/bert/distilbert-base-uncased-distilled-squad-config.json\",\n    \"distilbert-base-cased\": \"https://s3.amazonaws.com/models.huggingface.co/bert/distilbert-base-cased-config.json\",\n    \"distilbert-base-cased-distilled-squad\": \"https://s3.amazonaws.com/models.huggingface.co/bert/distilbert-base-cased-distilled-squad-config.json\",\n    \"distilbert-base-german-cased\": \"https://s3.amazonaws.com/models.huggingface.co/bert/distilbert-base-german-cased-config.json\",\n    \"distilbert-base-multilingual-cased\": \"https://s3.amazonaws.com/models.huggingface.co/bert/distilbert-base-multilingual-cased-config.json\",\n    \"distilbert-base-uncased-finetuned-sst-2-english\": \"https://s3.amazonaws.com/models.huggingface.co/bert/distilbert-base-uncased-finetuned-sst-2-english-config.json\",\n}\n\n\nclass DistilBertConfig(PretrainedConfig):\n    r\"\"\"\n        This is the configuration class to store the configuration of a :class:`~transformers1.DistilBertModel`.\n        It is used to instantiate a DistilBERT model according to the specified arguments, defining the model\n        architecture. Instantiating a configuration with the defaults will yield a similar configuration to that of\n        the DistilBERT `distilbert-base-uncased <https://huggingface.co/distilbert-base-uncased>`__ architecture.\n\n        Configuration objects inherit from  :class:`~transformers1.PretrainedConfig` and can be used\n        to control the model outputs. Read the documentation from  :class:`~transformers1.PretrainedConfig`\n        for more information.\n\n\n        Args:\n            vocab_size (:obj:`int`, optional, defaults to 30522):\n                Vocabulary size of the DistilBERT model. Defines the different tokens that\n                can be represented by the `inputs_ids` passed to the forward method of :class:`~transformers1.BertModel`.\n            max_position_embeddings (:obj:`int`, optional, defaults to 512):\n                The maximum sequence length that this model might ever be used with.\n                Typically set this to something large just in case (e.g., 512 or 1024 or 2048).\n            sinusoidal_pos_embds (:obj:`boolean`, optional, defaults to :obj:`False`):\n                Whether to use sinusoidal positional embeddings.\n            n_layers (:obj:`int`, optional, defaults to 6):\n                Number of hidden layers in the Transformer encoder.\n            n_heads (:obj:`int`, optional, defaults to 12):\n                Number of attention heads for each attention layer in the Transformer encoder.\n            dim (:obj:`int`, optional, defaults to 768):\n                Dimensionality of the encoder layers and the pooler layer.\n            hidden_dim (:obj:`int`, optional, defaults to 3072):\n                The size of the \"intermediate\" (i.e., feed-forward) layer in the Transformer encoder.\n            dropout (:obj:`float`, optional, defaults to 0.1):\n                The dropout probabilitiy for all fully connected layers in the embeddings, encoder, and pooler.\n            attention_dropout (:obj:`float`, optional, defaults to 0.1):\n                The dropout ratio for the attention probabilities.\n            activation (:obj:`str` or :obj:`function`, optional, defaults to \"gelu\"):\n                The non-linear activation function (function or string) in the encoder and pooler.\n                If string, \"gelu\", \"relu\", \"swish\" and \"gelu_new\" are supported.\n            initializer_range (:obj:`float`, optional, defaults to 0.02):\n                The standard deviation of the truncated_normal_initializer for initializing all weight matrices.\n            qa_dropout (:obj:`float`, optional, defaults to 0.1):\n                The dropout probabilities used in the question answering model\n                :class:`~transformers1.DistilBertForQuestionAnswering`.\n            seq_classif_dropout (:obj:`float`, optional, defaults to 0.2):\n                The dropout probabilities used in the sequence classification model\n                :class:`~transformers1.DistilBertForSequenceClassification`.\n\n        Example::\n\n            from transformers1 import DistilBertModel, DistilBertConfig\n\n            # Initializing a DistilBERT configuration\n            configuration = DistilBertConfig()\n\n            # Initializing a model from the configuration\n            model = DistilBertModel(configuration)\n\n            # Accessing the model configuration\n            configuration = model.config\n    \"\"\"\n    model_type = \"distilbert\"\n\n    def __init__(\n        self,\n        vocab_size=30522,\n        max_position_embeddings=512,\n        sinusoidal_pos_embds=False,\n        n_layers=6,\n        n_heads=12,\n        dim=768,\n        hidden_dim=4 * 768,\n        dropout=0.1,\n        attention_dropout=0.1,\n        activation=\"gelu\",\n        initializer_range=0.02,\n        qa_dropout=0.1,\n        seq_classif_dropout=0.2,\n        pad_token_id=0,\n        **kwargs\n    ):\n        super().__init__(**kwargs, pad_token_id=pad_token_id)\n        self.vocab_size = vocab_size\n        self.max_position_embeddings = max_position_embeddings\n        self.sinusoidal_pos_embds = sinusoidal_pos_embds\n        self.n_layers = n_layers\n        self.n_heads = n_heads\n        self.dim = dim\n        self.hidden_dim = hidden_dim\n        self.dropout = dropout\n        self.attention_dropout = attention_dropout\n        self.activation = activation\n        self.initializer_range = initializer_range\n        self.qa_dropout = qa_dropout\n        self.seq_classif_dropout = seq_classif_dropout\n\n    @property\n    def hidden_size(self):\n        return self.dim\n\n    @property\n    def num_attention_heads(self):\n        return self.n_heads\n\n    @property\n    def num_hidden_layers(self):\n        return self.n_layers\n"
  },
  {
    "path": "code/bert-base-count3/pretrain/transformers1/configuration_electra.py",
    "content": "# coding=utf-8\n# Copyright 2018 The Google AI Language Team Authors and The HuggingFace Inc. team.\n# Copyright (c) 2018, NVIDIA CORPORATION.  All rights reserved.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n\"\"\" ELECTRA model configuration \"\"\"\n\n\nimport logging\n\nfrom .configuration_utils import PretrainedConfig\n\n\nlogger = logging.getLogger(__name__)\n\nELECTRA_PRETRAINED_CONFIG_ARCHIVE_MAP = {\n    \"google/electra-small-generator\": \"https://s3.amazonaws.com/models.huggingface.co/bert/google/electra-small-generator/config.json\",\n    \"google/electra-base-generator\": \"https://s3.amazonaws.com/models.huggingface.co/bert/google/electra-base-generator/config.json\",\n    \"google/electra-large-generator\": \"https://s3.amazonaws.com/models.huggingface.co/bert/google/electra-large-generator/config.json\",\n    \"google/electra-small-discriminator\": \"https://s3.amazonaws.com/models.huggingface.co/bert/google/electra-small-discriminator/config.json\",\n    \"google/electra-base-discriminator\": \"https://s3.amazonaws.com/models.huggingface.co/bert/google/electra-base-discriminator/config.json\",\n    \"google/electra-large-discriminator\": \"https://s3.amazonaws.com/models.huggingface.co/bert/google/electra-large-discriminator/config.json\",\n}\n\n\nclass ElectraConfig(PretrainedConfig):\n    r\"\"\"\n        This is the configuration class to store the configuration of a :class:`~transformers1.ElectraModel`.\n        It is used to instantiate an ELECTRA model according to the specified arguments, defining the model\n        architecture. Instantiating a configuration with the defaults will yield a similar configuration to that of\n        the ELECTRA `google/electra-small-discriminator <https://huggingface.co/google/electra-small-discriminator>`__\n        architecture.\n\n        Configuration objects inherit from  :class:`~transformers1.PretrainedConfig` and can be used\n        to control the model outputs. Read the documentation from  :class:`~transformers1.PretrainedConfig`\n        for more information.\n\n\n        Args:\n            vocab_size (:obj:`int`, optional, defaults to 30522):\n                Vocabulary size of the ELECTRA model. Defines the different tokens that\n                can be represented by the `inputs_ids` passed to the forward method of :class:`~transformers1.ElectraModel`.\n            embedding_size (:obj:`int`, optional, defaults to 128):\n                Dimensionality of the encoder layers and the pooler layer.\n            hidden_size (:obj:`int`, optional, defaults to 256):\n                Dimensionality of the encoder layers and the pooler layer.\n            num_hidden_layers (:obj:`int`, optional, defaults to 12):\n                Number of hidden layers in the Transformer encoder.\n            num_attention_heads (:obj:`int`, optional, defaults to 4):\n                Number of attention heads for each attention layer in the Transformer encoder.\n            intermediate_size (:obj:`int`, optional, defaults to 1024):\n                Dimensionality of the \"intermediate\" (i.e., feed-forward) layer in the Transformer encoder.\n            hidden_act (:obj:`str` or :obj:`function`, optional, defaults to \"gelu\"):\n                The non-linear activation function (function or string) in the encoder and pooler.\n                If string, \"gelu\", \"relu\", \"swish\" and \"gelu_new\" are supported.\n            hidden_dropout_prob (:obj:`float`, optional, defaults to 0.1):\n                The dropout probabilitiy for all fully connected layers in the embeddings, encoder, and pooler.\n            attention_probs_dropout_prob (:obj:`float`, optional, defaults to 0.1):\n                The dropout ratio for the attention probabilities.\n            max_position_embeddings (:obj:`int`, optional, defaults to 512):\n                The maximum sequence length that this model might ever be used with.\n                Typically set this to something large just in case (e.g., 512 or 1024 or 2048).\n            type_vocab_size (:obj:`int`, optional, defaults to 2):\n                The vocabulary size of the `token_type_ids` passed into :class:`~transformers1.ElectraModel`.\n            initializer_range (:obj:`float`, optional, defaults to 0.02):\n                The standard deviation of the truncated_normal_initializer for initializing all weight matrices.\n            layer_norm_eps (:obj:`float`, optional, defaults to 1e-12):\n                The epsilon used by the layer normalization layers.\n\n        Example::\n\n            from transformers1 import ElectraModel, ElectraConfig\n\n            # Initializing a ELECTRA electra-base-uncased style configuration\n            configuration = ElectraConfig()\n\n            # Initializing a model from the electra-base-uncased style configuration\n            model = ElectraModel(configuration)\n\n            # Accessing the model configuration\n            configuration = model.config\n    \"\"\"\n    model_type = \"electra\"\n\n    def __init__(\n        self,\n        vocab_size=30522,\n        embedding_size=128,\n        hidden_size=256,\n        num_hidden_layers=12,\n        num_attention_heads=4,\n        intermediate_size=1024,\n        hidden_act=\"gelu\",\n        hidden_dropout_prob=0.1,\n        attention_probs_dropout_prob=0.1,\n        max_position_embeddings=512,\n        type_vocab_size=2,\n        initializer_range=0.02,\n        layer_norm_eps=1e-12,\n        pad_token_id=0,\n        **kwargs\n    ):\n        super().__init__(pad_token_id=pad_token_id, **kwargs)\n\n        self.vocab_size = vocab_size\n        self.embedding_size = embedding_size\n        self.hidden_size = hidden_size\n        self.num_hidden_layers = num_hidden_layers\n        self.num_attention_heads = num_attention_heads\n        self.intermediate_size = intermediate_size\n        self.hidden_act = hidden_act\n        self.hidden_dropout_prob = hidden_dropout_prob\n        self.attention_probs_dropout_prob = attention_probs_dropout_prob\n        self.max_position_embeddings = max_position_embeddings\n        self.type_vocab_size = type_vocab_size\n        self.initializer_range = initializer_range\n        self.layer_norm_eps = layer_norm_eps\n"
  },
  {
    "path": "code/bert-base-count3/pretrain/transformers1/configuration_encoder_decoder.py",
    "content": "# coding=utf-8\n# Copyright 2020 The HuggingFace Inc. team.\n# Copyright (c) 2018, NVIDIA CORPORATION.  All rights reserved.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n\nimport copy\nimport logging\n\nfrom .configuration_utils import PretrainedConfig\n\n\nlogger = logging.getLogger(__name__)\n\n\nclass EncoderDecoderConfig(PretrainedConfig):\n    r\"\"\"\n        :class:`~transformers1.EncoderDecoderConfig` is the configuration class to store the configuration of a `EncoderDecoderModel`.\n\n        It is used to instantiate an Encoder Decoder model according to the specified arguments, defining the encoder and decoder configs.\n        Configuration objects inherit from  :class:`~transformers1.PretrainedConfig`\n        and can be used to control the model outputs.\n        See the documentation for :class:`~transformers1.PretrainedConfig` for more information.\n\n        Args:\n            kwargs (`optional`):\n                Remaining dictionary of keyword arguments. Notably:\n                    encoder (:class:`PretrainedConfig`, optional, defaults to `None`):\n                        An instance of a configuration object that defines the encoder config.\n                    encoder (:class:`PretrainedConfig`, optional, defaults to `None`):\n                        An instance of a configuration object that defines the decoder config.\n\n        Example::\n\n            from transformers1 import BertConfig, EncoderDecoderConfig, EncoderDecoderModel\n\n            # Initializing a BERT bert-base-uncased style configuration\n            config_encoder = BertConfig()\n            config_decoder = BertConfig()\n\n            config = EncoderDecoderConfig.from_encoder_decoder_configs(config_encoder, config_decoder)\n\n            # Initializing a Bert2Bert model from the bert-base-uncased style configurations\n            model = EncoderDecoderModel(config=config)\n\n            # Accessing the model configuration\n            config_encoder = model.config.encoder\n            config_decoder  = model.config.decoder\n    \"\"\"\n    model_type = \"encoder_decoder\"\n\n    def __init__(self, **kwargs):\n        super().__init__(**kwargs)\n        assert (\n            \"encoder\" in kwargs and \"decoder\" in kwargs\n        ), \"Config has to be initialized with encoder and decoder config\"\n        encoder_config = kwargs.pop(\"encoder\")\n        encoder_model_type = encoder_config.pop(\"model_type\")\n        decoder_config = kwargs.pop(\"decoder\")\n        decoder_model_type = decoder_config.pop(\"model_type\")\n\n        from transformers import AutoConfig\n\n        self.encoder = AutoConfig.for_model(encoder_model_type, **encoder_config)\n        self.decoder = AutoConfig.for_model(decoder_model_type, **decoder_config)\n        self.is_encoder_decoder = True\n\n    @classmethod\n    def from_encoder_decoder_configs(\n        cls, encoder_config: PretrainedConfig, decoder_config: PretrainedConfig\n    ) -> PretrainedConfig:\n        r\"\"\"\n        Instantiate a :class:`~transformers1.EncoderDecoderConfig` (or a derived class) from a pre-trained encoder model configuration and decoder model configuration.\n\n        Returns:\n            :class:`EncoderDecoderConfig`: An instance of a configuration object\n        \"\"\"\n        return cls(encoder=encoder_config.to_dict(), decoder=decoder_config.to_dict())\n\n    def to_dict(self):\n        \"\"\"\n        Serializes this instance to a Python dictionary. Override the default `to_dict()` from `PretrainedConfig`.\n\n        Returns:\n            :obj:`Dict[str, any]`: Dictionary of all the attributes that make up this configuration instance,\n        \"\"\"\n        output = copy.deepcopy(self.__dict__)\n        output[\"encoder\"] = self.encoder.to_dict()\n        output[\"decoder\"] = self.decoder.to_dict()\n        output[\"model_type\"] = self.__class__.model_type\n        return output\n"
  },
  {
    "path": "code/bert-base-count3/pretrain/transformers1/configuration_flaubert.py",
    "content": "# coding=utf-8\n# Copyright 2019-present CNRS, Facebook Inc. and the HuggingFace Inc. team.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n\"\"\" Flaubert configuration, based on XLM. \"\"\"\n\n\nimport logging\n\nfrom .configuration_xlm import XLMConfig\n\n\nlogger = logging.getLogger(__name__)\n\nFLAUBERT_PRETRAINED_CONFIG_ARCHIVE_MAP = {\n    \"flaubert/flaubert_small_cased\": \"https://s3.amazonaws.com/models.huggingface.co/bert/flaubert/flaubert_small_cased/config.json\",\n    \"flaubert/flaubert_base_uncased\": \"https://s3.amazonaws.com/models.huggingface.co/bert/flaubert/flaubert_base_uncased/config.json\",\n    \"flaubert/flaubert_base_cased\": \"https://s3.amazonaws.com/models.huggingface.co/bert/flaubert/flaubert_base_cased/config.json\",\n    \"flaubert/flaubert_large_cased\": \"https://s3.amazonaws.com/models.huggingface.co/bert/flaubert/flaubert_large_cased/config.json\",\n}\n\n\nclass FlaubertConfig(XLMConfig):\n    \"\"\"\n        Configuration class to store the configuration of a `FlaubertModel`.\n        This is the configuration class to store the configuration of a :class:`~transformers1.XLMModel`.\n        It is used to instantiate an XLM model according to the specified arguments, defining the model\n        architecture. Instantiating a configuration with the defaults will yield a similar configuration to that of\n        the `xlm-mlm-en-2048 <https://huggingface.co/xlm-mlm-en-2048>`__ architecture.\n\n        Configuration objects inherit from  :class:`~transformers1.PretrainedConfig` and can be used\n        to control the model outputs. Read the documentation from  :class:`~transformers1.PretrainedConfig`\n        for more information.\n\n        Args:\n            pre_norm (:obj:`bool`, `optional`, defaults to :obj:`False`):\n                Whether to apply the layer normalization before or after the feed forward layer following the\n                attention in each layer (Vaswani et al., Tensor2Tensor for Neural Machine Translation. 2018)\n            layerdrop (:obj:`float`, `optional`, defaults to 0.0):\n                Probability to drop layers during training (Fan et al., Reducing Transformer Depth on Demand\n                with Structured Dropout. ICLR 2020)\n            vocab_size (:obj:`int`, optional, defaults to 30145):\n                Vocabulary size of the Flaubert model. Defines the different tokens that\n                can be represented by the `inputs_ids` passed to the forward method of :class:`~transformers1.FlaubertModel`.\n            emb_dim (:obj:`int`, optional, defaults to 2048):\n                Dimensionality of the encoder layers and the pooler layer.\n            n_layer (:obj:`int`, optional, defaults to 12):\n                Number of hidden layers in the Transformer encoder.\n            n_head (:obj:`int`, optional, defaults to 16):\n                Number of attention heads for each attention layer in the Transformer encoder.\n            dropout (:obj:`float`, optional, defaults to 0.1):\n                The dropout probability for all fully connected\n                layers in the embeddings, encoder, and pooler.\n            attention_dropout (:obj:`float`, optional, defaults to 0.1):\n                The dropout probability for the attention mechanism\n            gelu_activation (:obj:`boolean`, optional, defaults to :obj:`True`):\n                The non-linear activation function (function or string) in the\n                encoder and pooler. If set to `True`, \"gelu\" will be used instead of \"relu\".\n            sinusoidal_embeddings (:obj:`boolean`, optional, defaults to :obj:`False`):\n                Whether to use sinusoidal positional embeddings instead of absolute positional embeddings.\n            causal (:obj:`boolean`, optional, defaults to :obj:`False`):\n                Set this to `True` for the model to behave in a causal manner.\n                Causal models use a triangular attention mask in order to only attend to the left-side context instead\n                if a bidirectional context.\n            asm (:obj:`boolean`, optional, defaults to :obj:`False`):\n                Whether to use an adaptive log softmax projection layer instead of a linear layer for the prediction\n                layer.\n            n_langs (:obj:`int`, optional, defaults to 1):\n                The number of languages the model handles. Set to 1 for monolingual models.\n            use_lang_emb (:obj:`boolean`, optional, defaults to :obj:`True`)\n                Whether to use language embeddings. Some models use additional language embeddings, see\n                `the multilingual models page <http://huggingface.co/transformers/multilingual.html#xlm-language-embeddings>`__\n                for information on how to use them.\n            max_position_embeddings (:obj:`int`, optional, defaults to 512):\n                The maximum sequence length that this model might\n                ever be used with. Typically set this to something large just in case\n                (e.g., 512 or 1024 or 2048).\n            embed_init_std (:obj:`float`, optional, defaults to 2048^-0.5):\n                The standard deviation of the truncated_normal_initializer for\n                initializing the embedding matrices.\n            init_std (:obj:`int`, optional, defaults to 50257):\n                The standard deviation of the truncated_normal_initializer for\n                initializing all weight matrices except the embedding matrices.\n            layer_norm_eps (:obj:`float`, optional, defaults to 1e-12):\n                The epsilon used by the layer normalization layers.\n            bos_index (:obj:`int`, optional, defaults to 0):\n                The index of the beginning of sentence token in the vocabulary.\n            eos_index (:obj:`int`, optional, defaults to 1):\n                The index of the end of sentence token in the vocabulary.\n            pad_index (:obj:`int`, optional, defaults to 2):\n                The index of the padding token in the vocabulary.\n            unk_index (:obj:`int`, optional, defaults to 3):\n                The index of the unknown token in the vocabulary.\n            mask_index (:obj:`int`, optional, defaults to 5):\n                The index of the masking token in the vocabulary.\n            is_encoder(:obj:`boolean`, optional, defaults to :obj:`True`):\n                Whether the initialized model should be a transformer encoder or decoder as seen in Vaswani et al.\n            summary_type (:obj:`string`, optional, defaults to \"first\"):\n                Argument used when doing sequence summary. Used in for the multiple choice head in\n                :class:`~transformers1.XLMForSequenceClassification`.\n                Is one of the following options:\n\n                - 'last' => take the last token hidden state (like XLNet)\n                - 'first' => take the first token hidden state (like Bert)\n                - 'mean' => take the mean of all tokens hidden states\n                - 'cls_index' => supply a Tensor of classification token position (GPT/GPT-2)\n                - 'attn' => Not implemented now, use multi-head attention\n            summary_use_proj (:obj:`boolean`, optional, defaults to :obj:`True`):\n                Argument used when doing sequence summary. Used in for the multiple choice head in\n                :class:`~transformers1.XLMForSequenceClassification`.\n                Add a projection after the vector extraction\n            summary_activation (:obj:`string` or :obj:`None`, optional, defaults to :obj:`None`):\n                Argument used when doing sequence summary. Used in for the multiple choice head in\n                :class:`~transformers1.XLMForSequenceClassification`.\n                'tanh' => add a tanh activation to the output, Other => no activation.\n            summary_proj_to_labels (:obj:`boolean`, optional, defaults to :obj:`True`):\n                Argument used when doing sequence summary. Used in for the multiple choice head in\n                :class:`~transformers1.XLMForSequenceClassification`.\n                If True, the projection outputs to config.num_labels classes (otherwise to hidden_size). Default: False.\n            summary_first_dropout (:obj:`float`, optional, defaults to 0.1):\n                Argument used when doing sequence summary. Used in for the multiple choice head in\n                :class:`~transformers1.XLMForSequenceClassification`.\n                Add a dropout before the projection and activation\n            start_n_top (:obj:`int`, optional, defaults to 5):\n                Used in the SQuAD evaluation script for XLM and XLNet.\n            end_n_top (:obj:`int`, optional, defaults to 5):\n                Used in the SQuAD evaluation script for XLM and XLNet.\n            mask_token_id (:obj:`int`, optional, defaults to 0):\n                Model agnostic parameter to identify masked tokens when generating text in an MLM context.\n            lang_id (:obj:`int`, optional, defaults to 1):\n                The ID of the language used by the model. This parameter is used when generating\n                text in a given language.\n    \"\"\"\n\n    model_type = \"flaubert\"\n\n    def __init__(self, layerdrop=0.0, pre_norm=False, pad_token_id=2, bos_token_id=0, **kwargs):\n        \"\"\"Constructs FlaubertConfig.\n        \"\"\"\n        super().__init__(pad_token_id=pad_token_id, bos_token_id=bos_token_id, **kwargs)\n        self.layerdrop = layerdrop\n        self.pre_norm = pre_norm\n"
  },
  {
    "path": "code/bert-base-count3/pretrain/transformers1/configuration_gpt2.py",
    "content": "# coding=utf-8\n# Copyright 2018 The OpenAI Team Authors and HuggingFace Inc. team.\n# Copyright (c) 2018, NVIDIA CORPORATION.  All rights reserved.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n\"\"\" OpenAI GPT-2 configuration \"\"\"\n\n\nimport logging\n\nfrom .configuration_utils import PretrainedConfig\n\n\nlogger = logging.getLogger(__name__)\n\nGPT2_PRETRAINED_CONFIG_ARCHIVE_MAP = {\n    \"gpt2\": \"https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-config.json\",\n    \"gpt2-medium\": \"https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-medium-config.json\",\n    \"gpt2-large\": \"https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-large-config.json\",\n    \"gpt2-xl\": \"https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-xl-config.json\",\n    \"distilgpt2\": \"https://s3.amazonaws.com/models.huggingface.co/bert/distilgpt2-config.json\",\n}\n\n\nclass GPT2Config(PretrainedConfig):\n    \"\"\"\n        This is the configuration class to store the configuration of a :class:`~transformers1.GPT2Model`.\n        It is used to instantiate an GPT-2 model according to the specified arguments, defining the model\n        architecture. Instantiating a configuration with the defaults will yield a similar configuration to that of\n        the GPT-2 `small <https://huggingface.co/gpt2>`__ architecture.\n\n        Configuration objects inherit from  :class:`~transformers1.PretrainedConfig` and can be used\n        to control the model outputs. Read the documentation from  :class:`~transformers1.PretrainedConfig`\n        for more information.\n\n\n        Args:\n            vocab_size (:obj:`int`, optional, defaults to 50257):\n                Vocabulary size of the GPT-2 model. Defines the different tokens that\n                can be represented by the `inputs_ids` passed to the forward method of :class:`~transformers1.GPT2Model`.\n            n_positions (:obj:`int`, optional, defaults to 1024):\n                The maximum sequence length that this model might ever be used with.\n                Typically set this to something large just in case (e.g., 512 or 1024 or 2048).\n            n_ctx (:obj:`int`, optional, defaults to 1024):\n                Dimensionality of the causal mask (usually same as n_positions).\n            n_embd (:obj:`int`, optional, defaults to 768):\n                Dimensionality of the embeddings and hidden states.\n            n_layer (:obj:`int`, optional, defaults to 12):\n                Number of hidden layers in the Transformer encoder.\n            n_head (:obj:`int`, optional, defaults to 12):\n                Number of attention heads for each attention layer in the Transformer encoder.\n            activation_function (:obj:`str`, optional, defaults to 'gelu'):\n                Activation function selected in the list [\"relu\", \"swish\", \"gelu\", \"tanh\", \"gelu_new\"].\n            resid_pdrop (:obj:`float`, optional, defaults to 0.1):\n                The dropout probability for all fully connected layers in the embeddings, encoder, and pooler.\n            embd_pdrop (:obj:`int`, optional, defaults to 0.1):\n                The dropout ratio for the embeddings.\n            attn_pdrop (:obj:`float`, optional, defaults to 0.1):\n                The dropout ratio for the attention.\n            layer_norm_epsilon (:obj:`float`, optional, defaults to 1e-5):\n                The epsilon to use in the layer normalization layers\n            initializer_range (:obj:`float`, optional, defaults to 16):\n                The standard deviation of the truncated_normal_initializer for initializing all weight matrices.\n            summary_type (:obj:`string`, optional, defaults to \"cls_index\"):\n                Argument used when doing sequence summary. Used in for the multiple choice head in\n                :class:`~transformers1.GPT2DoubleHeadsModel`.\n                Is one of the following options:\n\n                - 'last' => take the last token hidden state (like XLNet)\n                - 'first' => take the first token hidden state (like Bert)\n                - 'mean' => take the mean of all tokens hidden states\n                - 'cls_index' => supply a Tensor of classification token position (GPT/GPT-2)\n                - 'attn' => Not implemented now, use multi-head attention\n            summary_use_proj (:obj:`boolean`, optional, defaults to :obj:`True`):\n                Argument used when doing sequence summary. Used in for the multiple choice head in\n                :class:`~transformers1.GPT2DoubleHeadsModel`.\n                Add a projection after the vector extraction\n            summary_activation (:obj:`string` or :obj:`None`, optional, defaults to :obj:`None`):\n                Argument used when doing sequence summary. Used in for the multiple choice head in\n                :class:`~transformers1.GPT2DoubleHeadsModel`.\n                'tanh' => add a tanh activation to the output, Other => no activation.\n            summary_proj_to_labels (:obj:`boolean`, optional, defaults to :obj:`True`):\n                Argument used when doing sequence summary. Used in for the multiple choice head in\n                :class:`~transformers1.GPT2DoubleHeadsModel`.\n                If True, the projection outputs to config.num_labels classes (otherwise to hidden_size). Default: False.\n            summary_first_dropout (:obj:`float`, optional, defaults to 0.1):\n                Argument used when doing sequence summary. Used in for the multiple choice head in\n                :class:`~transformers1.GPT2DoubleHeadsModel`.\n                Add a dropout before the projection and activation\n\n        Example::\n\n            from transformers1 import GPT2Model, GPT2Config\n\n            # Initializing a GPT2 configuration\n            configuration = GPT2Config()\n\n            # Initializing a model from the configuration\n            model = GPT2Model(configuration)\n\n            # Accessing the model configuration\n            configuration = model.config\n    \"\"\"\n\n    model_type = \"gpt2\"\n\n    def __init__(\n        self,\n        vocab_size=50257,\n        n_positions=1024,\n        n_ctx=1024,\n        n_embd=768,\n        n_layer=12,\n        n_head=12,\n        activation_function=\"gelu_new\",\n        resid_pdrop=0.1,\n        embd_pdrop=0.1,\n        attn_pdrop=0.1,\n        layer_norm_epsilon=1e-5,\n        initializer_range=0.02,\n        summary_type=\"cls_index\",\n        summary_use_proj=True,\n        summary_activation=None,\n        summary_proj_to_labels=True,\n        summary_first_dropout=0.1,\n        bos_token_id=50256,\n        eos_token_id=50256,\n        **kwargs\n    ):\n        super().__init__(bos_token_id=bos_token_id, eos_token_id=eos_token_id, **kwargs)\n\n        self.vocab_size = vocab_size\n        self.n_ctx = n_ctx\n        self.n_positions = n_positions\n        self.n_embd = n_embd\n        self.n_layer = n_layer\n        self.n_head = n_head\n        self.activation_function = activation_function\n        self.resid_pdrop = resid_pdrop\n        self.embd_pdrop = embd_pdrop\n        self.attn_pdrop = attn_pdrop\n        self.layer_norm_epsilon = layer_norm_epsilon\n        self.initializer_range = initializer_range\n        self.summary_type = summary_type\n        self.summary_use_proj = summary_use_proj\n        self.summary_activation = summary_activation\n        self.summary_first_dropout = summary_first_dropout\n        self.summary_proj_to_labels = summary_proj_to_labels\n\n        self.bos_token_id = bos_token_id\n        self.eos_token_id = eos_token_id\n\n    @property\n    def max_position_embeddings(self):\n        return self.n_positions\n\n    @property\n    def hidden_size(self):\n        return self.n_embd\n\n    @property\n    def num_attention_heads(self):\n        return self.n_head\n\n    @property\n    def num_hidden_layers(self):\n        return self.n_layer\n"
  },
  {
    "path": "code/bert-base-count3/pretrain/transformers1/configuration_longformer.py",
    "content": "# coding=utf-8\n# Copyright 2020 The Allen Institute for AI team and The HuggingFace Inc. team.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n\"\"\" Longformer configuration \"\"\"\n\nimport logging\nfrom typing import List, Union\n\nfrom .configuration_roberta import RobertaConfig\n\n\nlogger = logging.getLogger(__name__)\n\nLONGFORMER_PRETRAINED_CONFIG_ARCHIVE_MAP = {\n    \"allenai/longformer-base-4096\": \"https://s3.amazonaws.com/models.huggingface.co/bert/allenai/longformer-base-4096/config.json\",\n    \"allenai/longformer-large-4096\": \"https://s3.amazonaws.com/models.huggingface.co/bert/allenai/longformer-large-4096/config.json\",\n    \"allenai/longformer-large-4096-finetuned-triviaqa\": \"https://s3.amazonaws.com/models.huggingface.co/bert/allenai/longformer-large-4096-finetuned-triviaqa/config.json\",\n    \"allenai/longformer-base-4096-extra.pos.embd.only\": \"https://s3.amazonaws.com/models.huggingface.co/bert/allenai/longformer-base-4096-extra.pos.embd.only/config.json\",\n    \"allenai/longformer-large-4096-extra.pos.embd.only\": \"https://s3.amazonaws.com/models.huggingface.co/bert/allenai/longformer-large-4096-extra.pos.embd.only/config.json\",\n}\n\n\nclass LongformerConfig(RobertaConfig):\n    r\"\"\"\n        This is the configuration class to store the configuration of a :class:`~transformers1.LongformerModel`.\n        It is used to instantiate an Longformer model according to the specified arguments, defining the model\n        architecture. Instantiating a configuration with the defaults will yield a similar configuration to that of\n        the RoBERTa `roberta-base <https://huggingface.co/roberta-base>`__ architecture with a sequence length 4,096.\n\n        The :class:`~transformers1.LongformerConfig` class directly inherits :class:`~transformers1.RobertaConfig`.\n        It reuses the same defaults. Please check the parent class for more information.\n\n        Args:\n            attention_window (:obj:`int` or :obj:`List[int]`, optional, defaults to 512):\n                Size of an attention window around each token. If :obj:`int`, use the same size for all layers.\n                To specify a different window size for each layer, use a :obj:`List[int]` where\n                ``len(attention_window) == num_hidden_layers``.\n\n        Example::\n\n            from transformers1 import LongformerConfig, LongformerModel\n\n            # Initializing a Longformer configuration\n            configuration = LongformerConfig()\n\n            # Initializing a model from the configuration\n            model = LongformerModel(configuration)\n\n            # Accessing the model configuration\n            configuration = model.config\n    \"\"\"\n    model_type = \"longformer\"\n\n    def __init__(self, attention_window: Union[List[int], int] = 512, sep_token_id: int = 2, **kwargs):\n        super().__init__(**kwargs)\n        self.attention_window = attention_window\n        self.sep_token_id = sep_token_id\n"
  },
  {
    "path": "code/bert-base-count3/pretrain/transformers1/configuration_marian.py",
    "content": "# coding=utf-8\n# Copyright 2020 The OPUS-NMT Team, Marian team, and The HuggingFace Inc. team.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n\"\"\" Marian model configuration \"\"\"\n\nfrom .configuration_bart import BartConfig\n\n\nPRETRAINED_CONFIG_ARCHIVE_MAP = {\n    \"Helsinki-NLP/opus-mt-en-de\": \"https://s3.amazonaws.com/models.huggingface.co/bert/Helsinki-NLP/opus-mt-en-de/config.json\",\n}\n\n\nclass MarianConfig(BartConfig):\n    model_type = \"marian\"\n"
  },
  {
    "path": "code/bert-base-count3/pretrain/transformers1/configuration_mmbt.py",
    "content": "# coding=utf-8\n# Copyright (c) Facebook, Inc. and its affiliates.\n# Copyright (c) HuggingFace Inc. team.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n\"\"\" MMBT configuration \"\"\"\n\n\nimport logging\n\n\nlogger = logging.getLogger(__name__)\n\n\nclass MMBTConfig(object):\n    \"\"\"Configuration class to store the configuration of a `MMBT Model`.\n\n    Args:\n        config (:obj:`~transformers1.PreTrainedConfig`):\n            Config of the underlying Transformer models. Its values are\n            copied over to use a single config.\n        num_labels (:obj:`int` or :obj:`None`, optional, defaults to `None`):\n            Size of final Linear layer for classification.\n        modal_hidden_size (:obj:`int`, optional, defautls to 2048):\n            Embedding dimension of the non-text modality encoder.\n    \"\"\"\n\n    def __init__(self, config, num_labels=None, modal_hidden_size=2048):\n        self.__dict__ = config.__dict__\n        self.modal_hidden_size = modal_hidden_size\n        if num_labels:\n            self.num_labels = num_labels\n"
  },
  {
    "path": "code/bert-base-count3/pretrain/transformers1/configuration_openai.py",
    "content": "# coding=utf-8\n# Copyright 2018 The OpenAI Team Authors and HuggingFace Inc. team.\n# Copyright (c) 2018, NVIDIA CORPORATION.  All rights reserved.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n\"\"\" OpenAI GPT configuration \"\"\"\n\n\nimport logging\n\nfrom .configuration_utils import PretrainedConfig\n\n\nlogger = logging.getLogger(__name__)\n\nOPENAI_GPT_PRETRAINED_CONFIG_ARCHIVE_MAP = {\n    \"openai-gpt\": \"https://s3.amazonaws.com/models.huggingface.co/bert/openai-gpt-config.json\"\n}\n\n\nclass OpenAIGPTConfig(PretrainedConfig):\n    \"\"\"\n        This is the configuration class to store the configuration of a :class:`~transformers1.OpenAIGPTModel`.\n        It is used to instantiate an GPT model according to the specified arguments, defining the model\n        architecture. Instantiating a configuration with the defaults will yield a similar configuration to that of\n        the `GPT <https://huggingface.co/openai-gpt>`__ architecture from OpenAI.\n\n        Configuration objects inherit from  :class:`~transformers1.PretrainedConfig` and can be used\n        to control the model outputs. Read the documentation from  :class:`~transformers1.PretrainedConfig`\n        for more information.\n\n        Args:\n            vocab_size (:obj:`int`, optional, defaults to 40478):\n                Vocabulary size of the GPT model. Defines the different tokens that\n                can be represented by the `inputs_ids` passed to the forward method of :class:`~transformers1.CTRLModel`.\n            n_positions (:obj:`int`, optional, defaults to 512):\n                The maximum sequence length that this model might ever be used with.\n                Typically set this to something large just in case (e.g., 512 or 1024 or 2048).\n            n_ctx (:obj:`int`, optional, defaults to 512):\n                Dimensionality of the causal mask (usually same as n_positions).\n            n_embd (:obj:`int`, optional, defaults to 768):\n                Dimensionality of the embeddings and hidden states.\n            n_layer (:obj:`int`, optional, defaults to 12):\n                Number of hidden layers in the Transformer encoder.\n            n_head (:obj:`int`, optional, defaults to 12):\n                Number of attention heads for each attention layer in the Transformer encoder.\n            afn (:obj:`str` or :obj:`function`, optional, defaults to \"gelu\"):\n                The non-linear activation function (function or string) in the encoder and pooler.\n                If string, \"gelu\", \"relu\", \"swish\" and \"gelu_new\" are supported.\n            resid_pdrop (:obj:`float`, optional, defaults to 0.1):\n                The dropout probability for all fully connected layers in the embeddings, encoder, and pooler.\n            embd_pdrop (:obj:`int`, optional, defaults to 0.1):\n                The dropout ratio for the embeddings.\n            attn_pdrop (:obj:`float`, optional, defaults to 0.1):\n                The dropout ratio for the attention.\n            layer_norm_epsilon (:obj:`float`, optional, defaults to 1e-5):\n                The epsilon to use in the layer normalization layers\n            initializer_range (:obj:`float`, optional, defaults to 0.02):\n                The standard deviation of the truncated_normal_initializer for initializing all weight matrices.\n            predict_special_tokens (:obj:`boolean`, optional, defaults to :obj:`True`):\n                Whether special tokens should be predicted when the model is has a language modeling head.\n            summary_type (:obj:`string`, optional, defaults to \"cls_index\"):\n                Argument used when doing sequence summary. Used in for the multiple choice head in\n                :class:`~transformers1.OpenAIGPTDoubleHeadsModel`.\n                Is one of the following options:\n\n                - 'last' => take the last token hidden state (like XLNet)\n                - 'first' => take the first token hidden state (like Bert)\n                - 'mean' => take the mean of all tokens hidden states\n                - 'cls_index' => supply a Tensor of classification token position (GPT/GPT-2)\n                - 'attn' => Not implemented now, use multi-head attention\n            summary_use_proj (:obj:`boolean`, optional, defaults to :obj:`True`):\n                Argument used when doing sequence summary. Used in for the multiple choice head in\n                :class:`~transformers1.OpenAIGPTDoubleHeadsModel`.\n                Add a projection after the vector extraction\n            summary_activation (:obj:`string` or :obj:`None`, optional, defaults to :obj:`None`):\n                Argument used when doing sequence summary. Used in for the multiple choice head in\n                :class:`~transformers1.OpenAIGPTDoubleHeadsModel`.\n                'tanh' => add a tanh activation to the output, Other => no activation.\n            summary_proj_to_labels (:obj:`boolean`, optional, defaults to :obj:`True`):\n                Argument used when doing sequence summary. Used in for the multiple choice head in\n                :class:`~transformers1.OpenAIGPTDoubleHeadsModel`.\n                If True, the projection outputs to config.num_labels classes (otherwise to hidden_size). Default: False.\n            summary_first_dropout (:obj:`float`, optional, defaults to 0.1):\n                Argument used when doing sequence summary. Used in for the multiple choice head in\n                :class:`~transformers1.OpenAIGPTDoubleHeadsModel`.\n                Add a dropout before the projection and activation\n\n        Example::\n\n            from transformers1 import OpenAIGPTConfig, OpenAIGPTModel\n\n            # Initializing a GPT configuration\n            configuration = OpenAIGPTConfig()\n\n            # Initializing a model from the configuration\n            model = OpenAIGPTModel(configuration)\n\n            # Accessing the model configuration\n            configuration = model.config\n    \"\"\"\n\n    model_type = \"openai-gpt\"\n\n    def __init__(\n        self,\n        vocab_size=40478,\n        n_positions=512,\n        n_ctx=512,\n        n_embd=768,\n        n_layer=12,\n        n_head=12,\n        afn=\"gelu\",\n        resid_pdrop=0.1,\n        embd_pdrop=0.1,\n        attn_pdrop=0.1,\n        layer_norm_epsilon=1e-5,\n        initializer_range=0.02,\n        predict_special_tokens=True,\n        summary_type=\"cls_index\",\n        summary_use_proj=True,\n        summary_activation=None,\n        summary_proj_to_labels=True,\n        summary_first_dropout=0.1,\n        **kwargs\n    ):\n        super().__init__(**kwargs)\n\n        self.vocab_size = vocab_size\n        self.n_ctx = n_ctx\n        self.n_positions = n_positions\n        self.n_embd = n_embd\n        self.n_layer = n_layer\n        self.n_head = n_head\n        self.afn = afn\n        self.resid_pdrop = resid_pdrop\n        self.embd_pdrop = embd_pdrop\n        self.attn_pdrop = attn_pdrop\n        self.layer_norm_epsilon = layer_norm_epsilon\n        self.initializer_range = initializer_range\n        self.predict_special_tokens = predict_special_tokens\n        self.summary_type = summary_type\n        self.summary_use_proj = summary_use_proj\n        self.summary_activation = summary_activation\n        self.summary_first_dropout = summary_first_dropout\n        self.summary_proj_to_labels = summary_proj_to_labels\n\n    @property\n    def max_position_embeddings(self):\n        return self.n_positions\n\n    @property\n    def hidden_size(self):\n        return self.n_embd\n\n    @property\n    def num_attention_heads(self):\n        return self.n_head\n\n    @property\n    def num_hidden_layers(self):\n        return self.n_layer\n"
  },
  {
    "path": "code/bert-base-count3/pretrain/transformers1/configuration_reformer.py",
    "content": "# coding=utf-8\n# Copyright 2020 The Trax Authors and The HuggingFace Inc. team.\n# Copyright (c) 2018, NVIDIA CORPORATION.  All rights reserved.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n\"\"\" Reformer model configuration \"\"\"\n\n\nimport logging\n\nfrom .configuration_utils import PretrainedConfig\n\n\nlogger = logging.getLogger(__name__)\n\nREFORMER_PRETRAINED_CONFIG_ARCHIVE_MAP = {\n    \"google/reformer-crime-and-punishment\": \"https://cdn.huggingface.co/google/reformer-crime-and-punishment/config.json\",\n    \"google/reformer-enwik8\": \"https://cdn.huggingface.co/google/reformer-enwik8/config.json\",\n}\n\n\nclass ReformerConfig(PretrainedConfig):\n    r\"\"\"\n        This is the configuration class to store the configuration of a :class:`~transformers1.ReformerModel`.\n        It is used to instantiate an Reformer model according to the specified arguments, defining the model\n        architecture.\n\n        Configuration objects inherit from  :class:`~transformers1.PretrainedConfig` and can be used\n        to control the model outputs. Read the documentation from  :class:`~transformers1.PretrainedConfig`\n        for more information.\n\n        Args:\n            attention_head_size (:obj:`int`, optional, defaults to 64):\n                Dimensionality of the projected key, query and value vectors\n            attn_layers (:obj:`list(str)`, optional, defaults to [\"local\", \"lsh\", \"local\", \"lsh\", \"local\", \"lsh\"]):\n                List of attention layer types in ascending order. It can be chosen between a\n                LSHSelfAttention layer (\"lsh\") and a LocalSelfAttention layer (\"local\").\n                For more information on LSHSelfAttention layer, see `LSH Self Attention <reformer.html#lsh-self-attention>`__ .\n                For more information on LocalSelfAttention layer, see `Local Self Attention <reformer.html#local-sensitive-hashing-self-attention>`__ .\n            axial_pos_embds (:obj:`bool`, optional, defaults to True):\n                If `True` use axial position embeddings. For more information on how axial position embeddings work, see `Axial Position Encodings <reformer.html#axial-positional-encodings>`__\n            axial_norm_std (:obj:`float`, optional, defaluts to 1.0):\n                The standard deviation of the normal_initializer for initializing the weight matrices of the axial positional encodings.\n            axial_pos_shape (:obj:`list(int)`, optional, defaults to `[64, 64]`):\n                The position dims of the axial position encodings.\n                During training the product of the position dims has to equal the sequence length.\n                For more information on how axial position embeddings work, see `Axial Position Encodings <reformer.html#axial-positional-encodings>`__.\n            axial_pos_embds_dim (:obj:`list(int)`, optional, defaults to `[64, 192]`):\n                The embedding dims of the axial position encodings.\n                The sum of the embedding dims has to equal the hidden size.\n                For more information on how axial position embeddings work, see `Axial Position Encodings <reformer.html#axial-positional-encodings>`__.\n            chunk_size_lm_head (:obj:`int`, optional, defaults to 0):\n                The chunk size of the final language model feed forward head layer.\n                A chunk size of 0 means that the feed forward layer is not chunked.\n                A chunk size of n means that the feed forward layer processes n < sequence_length embeddings at a time.\n                For more information on feed forward chunking, see `How does Feed Forward Chunking work? <../glossary.html#feed-forward-chunking>`__ .\n            chunk_size_feed_forward (:obj:`int`, optional, defaults to 0):\n                The chunk size of all feed forward layers in the residual attention blocks.\n                A chunk size of 0 means that the feed forward layer is not chunked.\n                A chunk size of n means that the feed forward layer processes n < sequence_length embeddings at a time.\n                For more information on feed forward chunking, see `How does Feed Forward Chunking work? <../glossary.html#feed-forward-chunking>`__ .\n            eos_token_id (:obj:`int`, optional, defaults to 2):\n                The token id for the <EOS> token.\n            feed_forward_size (:obj:`int`, optional, defaults to 512):\n                Dimensionality of the \"feed_forward\" (i.e., feed-forward) layer in the residual attention block.\n            hash_seed (:obj:`int`, optional, defaults to `None`):\n                Seed that can be used to make local sensitive hashing in LSHSelfAttention deterministic. This should only be set for testing purposed. For evaluation and training purposes `hash_seed` should be set to `None` to ensure fully random rotations in local sensitive hashing scheme.\n            hidden_act (:obj:`str` or :obj:`function`, optional, defaults to \"relu\"):\n                The non-linear activation function (function or string) in the feed forward layer in the residual attention block.\n                If string, \"gelu\", \"relu\", \"swish\", \"gelu_new\" and \"gelu_fast\" are supported.\n            hidden_dropout_prob (:obj:`float`, optional, defaults to 0.05):\n                The dropout probabilitiy for all fully connected layers in the embeddings, encoder, and pooler.\n            hidden_size (:obj:`int`, optional, defaults to 256):\n                Dimensionality of the output hidden states of the residual attention blocks.\n            initializer_range (:obj:`float`, optional, defaults to 0.02):\n                The standard deviation of the truncated_normal_initializer for initializing all weight matrices.\n            is_decoder (:obj:`bool`, optional, defaults to False):\n                If `is_decoder` is True, a causal mask is used in addition to `attention_mask`.\n                When using the Reformer for causal language modeling, `is_decoder` is set to `True`.\n            layer_norm_eps (:obj:`float`, optional, defaults to 1e-12):\n                The epsilon used by the layer normalization layers.\n            local_chunk_length (:obj:`int`, optional, defaults to 64):\n                Length of chunk which attends to itself in LocalSelfAttention. Chunking reduces memory complexity from sequence length x sequence length (self attention) to chunk length x chunk length x sequence length / chunk length (chunked self attention).\n            local_num_chunks_before (:obj:`int`, optional, defaults to 1):\n                Number of previous neighbouring chunks to attend to in LocalSelfAttention layer to itself.\n            local_num_chunks_after (:obj:`int`, optional, defaults to 0):\n                Number of following neighbouring chunks to attend to in LocalSelfAttention layer in addition to itself.\n            local_attention_probs_dropout_prob (:obj:`float`, optional, defaults to 0.1):\n                The dropout ratio for the attention probabilities in LocalSelfAttention.\n            lsh_chunk_length (:obj:`int`, optional, defaults to 64):\n                Length of chunk which attends to itself in LSHSelfAttention. Chunking reduces memory complexity from sequence length x sequence length (self attention) to chunk length x chunk length x sequence length / chunk length (chunked self attention).\n            lsh_num_chunks_before (:obj:`int`, optional, defaults to 1):\n                Number of previous neighbouring chunks to attend to in LSHSelfAttention layer to itself.\n            lsh_num_chunks_after (:obj:`int`, optional, defaults to 0):\n                Number of following neighbouring chunks to attend to in LSHSelfAttention layer to itself.\n            lsh_attention_probs_dropout_prob (:obj:`float`, optional, defaults to 0.1):\n                The dropout ratio for the attention probabilities in LSHSelfAttention.\n            max_position_embeddings (:obj:`int`, optional, defaults to 4096):\n                The maximum sequence length that this model might ever be used with.\n                Typically set this to something large just in case (e.g., 512 or 1024 or 2048).\n            num_attention_heads (:obj:`int`, optional, defaults to 12):\n                Number of attention heads for each attention layer in the Transformer encoder.\n            num_buckets (:obj:`int` or :obj:`list(int)`, optional, defaults to `None`):\n                Number of buckets, the key query vectors can be \"hashed into\" using the locality sensitive hashing scheme. Each query key vector is hashed into a hash in `1, ..., num_buckets`.\n                The number of buckets can also be factorized into a list for improved memory complexity. In this case, each query key vector is hashed into a hash in `1-1, 1-2, ..., num_buckets[0]-1, ..., num_buckets[0]-num_buckets[1]` if `num_buckets` is factorized into two factors.\n                The number of buckets (or the product the factors) should approximately equal sequence length / lsh_chunk_length. If `num_buckets` is set to `None`, a good value for `num_buckets` is calculated on the fly.\n            num_hashes (:obj:`int`, optional, defaults to 1):\n                Number of hashing rounds (e.g. number of random rotations) in Local Sensitive Hashing scheme.\n                The higher `num_hashes`, the more accurate the `LSHSelfAttention` becomes, but also the more memory and time intensive the hashing becomes.\n            pad_token_id (:obj:`int`, optional, defaults to 0):\n                The token id for the <PAD> token.\n            vocab_size (:obj:`int`, optional, defaults to 320):\n                Vocabulary size of the Reformer model. Defines the different tokens that\n                can be represented by the `inputs_ids` passed to the forward method of :class:`~transformers1.ReformerModel`.\n\n        Example::\n\n            from transformers1 import ReformerModel, ReformerConfig\n\n            # Initializing a Reformer configuration\n            configuration = ReformerConfig()\n\n            # Initializing a Reformer model\n            model = ReformerModel(configuration)\n\n            # Accessing the model configuration\n            configuration = model.config\n    \"\"\"\n    model_type = \"reformer\"\n\n    def __init__(\n        self,\n        attention_head_size=64,\n        attn_layers=[\"local\", \"lsh\", \"local\", \"lsh\", \"local\", \"lsh\"],\n        axial_norm_std=1.0,\n        axial_pos_embds=True,\n        axial_pos_shape=[64, 64],\n        axial_pos_embds_dim=[64, 192],\n        chunk_size_lm_head=0,\n        chunk_size_feed_forward=0,\n        eos_token_id=2,\n        feed_forward_size=512,\n        hash_seed=None,\n        hidden_act=\"relu\",\n        hidden_dropout_prob=0.05,\n        hidden_size=256,\n        initializer_range=0.02,\n        is_decoder=False,\n        layer_norm_eps=1e-12,\n        local_num_chunks_before=1,\n        local_num_chunks_after=0,\n        local_attention_probs_dropout_prob=0.05,\n        local_attn_chunk_length=64,\n        lsh_attn_chunk_length=64,\n        lsh_attention_probs_dropout_prob=0.0,\n        lsh_num_chunks_before=1,\n        lsh_num_chunks_after=0,\n        max_position_embeddings=4096,\n        num_attention_heads=2,\n        num_buckets=None,\n        num_hashes=1,\n        pad_token_id=0,\n        vocab_size=320,\n        **kwargs\n    ):\n        super().__init__(pad_token_id=pad_token_id, eos_token_id=eos_token_id, is_decoder=is_decoder, **kwargs)\n\n        self.hash_seed = hash_seed\n        self.vocab_size = vocab_size\n        self.attention_head_size = attention_head_size\n        self.hidden_size = hidden_size\n        self.num_attention_heads = num_attention_heads\n        self.num_hashes = num_hashes\n        self.num_hidden_layers = len(attn_layers)\n        self.num_buckets = tuple(num_buckets) if isinstance(num_buckets, list) else num_buckets\n        self.lsh_attn_chunk_length = lsh_attn_chunk_length\n        self.local_attn_chunk_length = local_attn_chunk_length\n        self.lsh_num_chunks_after = lsh_num_chunks_after\n        self.lsh_num_chunks_before = lsh_num_chunks_before\n        self.local_num_chunks_after = local_num_chunks_after\n        self.local_num_chunks_before = local_num_chunks_before\n        self.hidden_act = hidden_act\n        self.feed_forward_size = feed_forward_size\n        self.hidden_dropout_prob = hidden_dropout_prob\n        self.lsh_attention_probs_dropout_prob = lsh_attention_probs_dropout_prob\n        self.local_attention_probs_dropout_prob = local_attention_probs_dropout_prob\n        self.max_position_embeddings = max_position_embeddings\n        self.initializer_range = initializer_range\n        self.layer_norm_eps = layer_norm_eps\n        self.axial_pos_embds = axial_pos_embds\n        self.axial_pos_shape = tuple(axial_pos_shape)\n        self.axial_pos_embds_dim = tuple(axial_pos_embds_dim)\n        self.axial_norm_std = axial_norm_std\n        self.chunk_size_lm_head = chunk_size_lm_head\n        self.chunk_size_feed_forward = chunk_size_feed_forward\n        self.attn_layers = attn_layers\n"
  },
  {
    "path": "code/bert-base-count3/pretrain/transformers1/configuration_roberta.py",
    "content": "# coding=utf-8\n# Copyright 2018 The Google AI Language Team Authors and The HuggingFace Inc. team.\n# Copyright (c) 2018, NVIDIA CORPORATION.  All rights reserved.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n\"\"\" RoBERTa configuration \"\"\"\n\n\nimport logging\n\nfrom .configuration_bert import BertConfig\n\n\nlogger = logging.getLogger(__name__)\n\nROBERTA_PRETRAINED_CONFIG_ARCHIVE_MAP = {\n    \"roberta-base\": \"https://s3.amazonaws.com/models.huggingface.co/bert/roberta-base-config.json\",\n    \"roberta-large\": \"https://s3.amazonaws.com/models.huggingface.co/bert/roberta-large-config.json\",\n    \"roberta-large-mnli\": \"https://s3.amazonaws.com/models.huggingface.co/bert/roberta-large-mnli-config.json\",\n    \"distilroberta-base\": \"https://s3.amazonaws.com/models.huggingface.co/bert/distilroberta-base-config.json\",\n    \"roberta-base-openai-detector\": \"https://s3.amazonaws.com/models.huggingface.co/bert/roberta-base-openai-detector-config.json\",\n    \"roberta-large-openai-detector\": \"https://s3.amazonaws.com/models.huggingface.co/bert/roberta-large-openai-detector-config.json\",\n}\n\n\nclass RobertaConfig(BertConfig):\n    r\"\"\"\n        This is the configuration class to store the configuration of a :class:`~transformers1.RobertaModel`.\n        It is used to instantiate an RoBERTa model according to the specified arguments, defining the model\n        architecture. Instantiating a configuration with the defaults will yield a similar configuration to that of\n        the BERT `bert-base-uncased <https://huggingface.co/bert-base-uncased>`__ architecture.\n\n        Configuration objects inherit from  :class:`~transformers1.PretrainedConfig` and can be used\n        to control the model outputs. Read the documentation from  :class:`~transformers1.PretrainedConfig`\n        for more information.\n\n        The :class:`~transformers1.RobertaConfig` class directly inherits :class:`~transformers1.BertConfig`.\n        It reuses the same defaults. Please check the parent class for more information.\n\n        Example::\n\n            from transformers1 import RobertaConfig, RobertaModel\n\n            # Initializing a RoBERTa configuration\n            configuration = RobertaConfig()\n\n            # Initializing a model from the configuration\n            model = RobertaModel(configuration)\n\n            # Accessing the model configuration\n            configuration = model.config\n    \"\"\"\n    model_type = \"roberta\"\n\n    def __init__(self, pad_token_id=1, bos_token_id=0, eos_token_id=2, **kwargs):\n        \"\"\"Constructs RobertaConfig.\n        \"\"\"\n        super().__init__(pad_token_id=pad_token_id, bos_token_id=bos_token_id, eos_token_id=eos_token_id, **kwargs)\n"
  },
  {
    "path": "code/bert-base-count3/pretrain/transformers1/configuration_t5.py",
    "content": "# coding=utf-8\n# Copyright 2010, The T5 Authors and HuggingFace Inc.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n\"\"\" T5 model configuration \"\"\"\n\n\nimport logging\n\nfrom .configuration_utils import PretrainedConfig\n\n\nlogger = logging.getLogger(__name__)\n\nT5_PRETRAINED_CONFIG_ARCHIVE_MAP = {\n    \"t5-small\": \"https://s3.amazonaws.com/models.huggingface.co/bert/t5-small-config.json\",\n    \"t5-base\": \"https://s3.amazonaws.com/models.huggingface.co/bert/t5-base-config.json\",\n    \"t5-large\": \"https://s3.amazonaws.com/models.huggingface.co/bert/t5-large-config.json\",\n    \"t5-3b\": \"https://s3.amazonaws.com/models.huggingface.co/bert/t5-3b-config.json\",\n    \"t5-11b\": \"https://s3.amazonaws.com/models.huggingface.co/bert/t5-11b-config.json\",\n}\n\n\nclass T5Config(PretrainedConfig):\n    r\"\"\"\n        :class:`~transformers1.T5Config` is the configuration class to store the configuration of a\n        `T5Model`.\n\n\n        Arguments:\n            vocab_size_or_config_json_file: Vocabulary size of `inputs_ids` in `T5Model`.\n            d_model: Size of the encoder layers and the pooler layer. `d_model` can also accesed via the property `hidden_size`.\n            num_layers: Number of hidden layers in the Transformer encoder. `num_layers` can also be accessed via the property `num_hidden_layers`.\n            num_heads: Number of attention heads for each attention layer in\n                the Transformer encoder. `num_heads` can also be accessed via the property `num_attention_heads`.\n            intermediate_size: The size of the \"intermediate\" (i.e., feed-forward)\n                layer in the Transformer encoder.\n            hidden_act: The non-linear activation function (function or string) in the\n                encoder and pooler. If string, \"gelu\", \"relu\", \"swish\" and \"gelu_new\" are supported.\n            hidden_dropout_prob: The dropout probabilitiy for all fully connected\n                layers in the embeddings, encoder, and pooler.\n            attention_probs_dropout_prob: The dropout ratio for the attention\n                probabilities.\n            n_positions: The maximum sequence length that this model might\n                ever be used with. Typically set this to something large just in case\n                (e.g., 512 or 1024 or 2048). `n_positions` can also be accessed via the property `max_position_embeddings'.\n            type_vocab_size: The vocabulary size of the `token_type_ids` passed into\n                `T5Model`.\n            initializer_factor: A factor for initializing all weight matrices (should be kept to 1.0, used for initialization testing).\n            layer_norm_eps: The epsilon used by LayerNorm.\n    \"\"\"\n    model_type = \"t5\"\n\n    def __init__(\n        self,\n        vocab_size=32128,\n        n_positions=512,\n        d_model=512,\n        d_kv=64,\n        d_ff=2048,\n        num_layers=6,\n        num_heads=8,\n        relative_attention_num_buckets=32,\n        dropout_rate=0.1,\n        layer_norm_epsilon=1e-6,\n        initializer_factor=1.0,\n        is_encoder_decoder=True,\n        pad_token_id=0,\n        eos_token_id=1,\n        **kwargs\n    ):\n        super().__init__(\n            pad_token_id=pad_token_id, eos_token_id=eos_token_id, is_encoder_decoder=is_encoder_decoder, **kwargs,\n        )\n        self.vocab_size = vocab_size\n        self.n_positions = n_positions\n        self.d_model = d_model\n        self.d_kv = d_kv\n        self.d_ff = d_ff\n        self.num_layers = num_layers\n        self.num_heads = num_heads\n        self.relative_attention_num_buckets = relative_attention_num_buckets\n        self.dropout_rate = dropout_rate\n        self.layer_norm_epsilon = layer_norm_epsilon\n        self.initializer_factor = initializer_factor\n\n    @property\n    def max_position_embeddings(self):\n        return self.n_positions\n\n    @property\n    def hidden_size(self):\n        return self.d_model\n\n    @property\n    def num_attention_heads(self):\n        return self.num_heads\n\n    @property\n    def num_hidden_layers(self):\n        return self.num_layers\n"
  },
  {
    "path": "code/bert-base-count3/pretrain/transformers1/configuration_transfo_xl.py",
    "content": "# coding=utf-8\n# Copyright 2018 Google AI, Google Brain and Carnegie Mellon University Authors and the HuggingFace Inc. team.\n# Copyright (c) 2018, NVIDIA CORPORATION.  All rights reserved.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n\"\"\" Transformer XL configuration \"\"\"\n\n\nimport logging\n\nfrom .configuration_utils import PretrainedConfig\n\n\nlogger = logging.getLogger(__name__)\n\nTRANSFO_XL_PRETRAINED_CONFIG_ARCHIVE_MAP = {\n    \"transfo-xl-wt103\": \"https://s3.amazonaws.com/models.huggingface.co/bert/transfo-xl-wt103-config.json\",\n}\n\n\nclass TransfoXLConfig(PretrainedConfig):\n    \"\"\"\n        This is the configuration class to store the configuration of a :class:`~transformers1.TransfoXLModel`.\n        It is used to instantiate a Transformer XL model according to the specified arguments, defining the model\n        architecture. Instantiating a configuration with the defaults will yield a similar configuration to that of\n        the `Transformer XL <https://huggingface.co/transfo-xl-wt103>`__ architecture.\n\n        Configuration objects inherit from  :class:`~transformers1.PretrainedConfig` and can be used\n        to control the model outputs. Read the documentation from  :class:`~transformers1.PretrainedConfig`\n        for more information.\n\n        Args:\n            vocab_size (:obj:`int`, optional, defaults to 267735):\n                Vocabulary size of the Transformer XL model. Defines the different tokens that\n                can be represented by the `inputs_ids` passed to the forward method of :class:`~transformers1.TransfoXLModel`.\n            cutoffs (:obj:`List[int]`, optional, defaults to :obj:`[20000, 40000, 200000]`):\n                Cutoffs for the adaptive softmax\n            d_model (:obj:`int`, optional, defaults to 1024):\n                Dimensionality of the model's hidden states.\n            d_embed (:obj:`int`, optional, defaults to 1024):\n                Dimensionality of the embeddings\n            n_head (:obj:`int`, optional, defaults to 16):\n                Number of attention heads for each attention layer in the Transformer encoder.\n            d_head (:obj:`int`, optional, defaults to 64):\n                Dimensionality of the model's heads.\n            d_inner (:obj:`int`, optional, defaults to 4096):\n                Inner dimension in FF\n            div_val (:obj:`int`, optional, defaults to 4):\n                Divident value for adapative input and softmax\n            pre_lnorm (:obj:`boolean`, optional, defaults to :obj:`False`):\n                Apply LayerNorm to the input instead of the output\n            n_layer (:obj:`int`, optional, defaults to 18):\n                Number of hidden layers in the Transformer encoder.\n            tgt_len (:obj:`int`, optional, defaults to 128):\n                Number of tokens to predict\n            ext_len (:obj:`int`, optional, defaults to 0):\n                Length of the extended context\n            mem_len (:obj:`int`, optional, defaults to 1600):\n                Length of the retained previous heads\n            clamp_len (:obj:`int`, optional, defaults to 1000):\n                use the same pos embeddings after clamp_len\n            same_length (:obj:`boolean`, optional, defaults to :obj:`True`):\n                Use the same attn length for all tokens\n            proj_share_all_but_first (:obj:`boolean`, optional, defaults to :obj:`True`):\n                True to share all but first projs, False not to share.\n            attn_type (:obj:`int`, optional, defaults to 0):\n                Attention type. 0 for Transformer-XL, 1 for Shaw et al, 2 for Vaswani et al, 3 for Al Rfou et al.\n            sample_softmax (:obj:`int`, optional, defaults to -1):\n                number of samples in sampled softmax\n            adaptive (:obj:`boolean`, optional, defaults to :obj:`True`):\n                use adaptive softmax\n            tie_weight (:obj:`boolean`, optional, defaults to :obj:`True`):\n                tie the word embedding and softmax weights\n            dropout (:obj:`float`, optional, defaults to 0.1):\n                The dropout probabilitiy for all fully connected layers in the embeddings, encoder, and pooler.\n            dropatt (:obj:`float`, optional, defaults to 0):\n                The dropout ratio for the attention probabilities.\n            untie_r (:obj:`boolean`, optional, defaults to :obj:`True`):\n                Untie relative position biases\n            init (:obj:`string`, optional, defaults to `normal`):\n                Parameter initializer to use\n            init_range (:obj:`float`, optional, defaults to 0.01):\n                Parameters initialized by U(-init_range, init_range).\n            proj_init_std (:obj:`float`, optional, defaults to 0.01):\n                Parameters initialized by N(0, init_std)\n            init_std (:obj:`float`, optional, defaults to 0.02):\n                Parameters initialized by N(0, init_std)\n            layer_norm_epsilon (:obj:`float`, optional, defaults to 1e-5):\n                The epsilon to use in the layer normalization layers\n\n        Example::\n\n            from transformers1 import TransfoXLConfig, TransfoXLModel\n\n            # Initializing a Transformer XL configuration\n            configuration = TransfoXLConfig()\n\n            # Initializing a model from the configuration\n            model = TransfoXLModel(configuration)\n\n            # Accessing the model configuration\n            configuration = model.config\n    \"\"\"\n\n    model_type = \"transfo-xl\"\n\n    def __init__(\n        self,\n        vocab_size=267735,\n        cutoffs=[20000, 40000, 200000],\n        d_model=1024,\n        d_embed=1024,\n        n_head=16,\n        d_head=64,\n        d_inner=4096,\n        div_val=4,\n        pre_lnorm=False,\n        n_layer=18,\n        tgt_len=128,\n        ext_len=0,\n        mem_len=1600,\n        clamp_len=1000,\n        same_length=True,\n        proj_share_all_but_first=True,\n        attn_type=0,\n        sample_softmax=-1,\n        adaptive=True,\n        tie_weight=True,\n        dropout=0.1,\n        dropatt=0.0,\n        untie_r=True,\n        init=\"normal\",\n        init_range=0.01,\n        proj_init_std=0.01,\n        init_std=0.02,\n        layer_norm_epsilon=1e-5,\n        eos_token_id=0,\n        **kwargs\n    ):\n        super().__init__(eos_token_id=eos_token_id, **kwargs)\n\n        self.vocab_size = vocab_size\n        self.cutoffs = []\n        self.cutoffs.extend(cutoffs)\n        self.tie_weight = tie_weight\n        if proj_share_all_but_first:\n            self.tie_projs = [False] + [True] * len(self.cutoffs)\n        else:\n            self.tie_projs = [False] + [False] * len(self.cutoffs)\n        self.d_model = d_model\n        self.d_embed = d_embed\n        self.d_head = d_head\n        self.d_inner = d_inner\n        self.div_val = div_val\n        self.pre_lnorm = pre_lnorm\n        self.n_layer = n_layer\n        self.n_head = n_head\n        self.tgt_len = tgt_len\n        self.ext_len = ext_len\n        self.mem_len = mem_len\n        self.same_length = same_length\n        self.attn_type = attn_type\n        self.clamp_len = clamp_len\n        self.sample_softmax = sample_softmax\n        self.adaptive = adaptive\n        self.dropout = dropout\n        self.dropatt = dropatt\n        self.untie_r = untie_r\n        self.init = init\n        self.init_range = init_range\n        self.proj_init_std = proj_init_std\n        self.init_std = init_std\n        self.layer_norm_epsilon = layer_norm_epsilon\n\n    @property\n    def max_position_embeddings(self):\n        return self.tgt_len + self.ext_len + self.mem_len\n\n    @property\n    def n_token(self):  # Backward compatibility\n        return self.vocab_size\n\n    @n_token.setter\n    def n_token(self, value):  # Backward compatibility\n        self.vocab_size = value\n\n    @property\n    def hidden_size(self):\n        return self.d_model\n\n    @property\n    def num_attention_heads(self):\n        return self.n_head\n\n    @property\n    def num_hidden_layers(self):\n        return self.n_layer\n"
  },
  {
    "path": "code/bert-base-count3/pretrain/transformers1/configuration_utils.py",
    "content": "# coding=utf-8\n# Copyright 2018 The Google AI Language Team Authors and The HuggingFace Inc. team.\n# Copyright (c) 2018, NVIDIA CORPORATION.  All rights reserved.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n\"\"\" Configuration base class and utilities.\"\"\"\n\n\nimport copy\nimport json\nimport logging\nimport os\nfrom typing import Dict, Tuple\n\nfrom .file_utils import CONFIG_NAME, cached_path, hf_bucket_url, is_remote_url\n\n\nlogger = logging.getLogger(__name__)\n\n\nclass PretrainedConfig(object):\n    r\"\"\" Base class for all configuration classes.\n        Handles a few parameters common to all models' configurations as well as methods for loading/downloading/saving configurations.\n\n        Note:\n            A configuration file can be loaded and saved to disk. Loading the configuration file and using this file to initialize a model does **not** load the model weights.\n            It only affects the model's configuration.\n\n        Class attributes (overridden by derived classes):\n            - ``model_type``: a string that identifies the model type, that we serialize into the JSON file, and that we use to recreate the correct object in :class:`~transformers1.AutoConfig`.\n\n        Args:\n            finetuning_task (:obj:`string` or :obj:`None`, `optional`, defaults to :obj:`None`):\n                Name of the task used to fine-tune the model. This can be used when converting from an original (TensorFlow or PyTorch) checkpoint.\n            num_labels (:obj:`int`, `optional`, defaults to `2`):\n                Number of classes to use when the model is a classification model (sequences/tokens)\n            output_attentions (:obj:`bool`, `optional`, defaults to :obj:`False`):\n                Should the model returns attentions weights.\n            output_hidden_states (:obj:`string`, `optional`, defaults to :obj:`False`):\n                Should the model returns all hidden-states.\n            torchscript (:obj:`bool`, `optional`, defaults to :obj:`False`):\n                Is the model used with Torchscript (for PyTorch models).\n    \"\"\"\n    model_type: str = \"\"\n\n    def __init__(self, **kwargs):\n        # Attributes with defaults\n        self.output_attentions = kwargs.pop(\"output_attentions\", False)\n        self.output_hidden_states = kwargs.pop(\"output_hidden_states\", False)\n        self.use_cache = kwargs.pop(\"use_cache\", True)  # Not used by all models\n        self.torchscript = kwargs.pop(\"torchscript\", False)  # Only used by PyTorch models\n        self.use_bfloat16 = kwargs.pop(\"use_bfloat16\", False)\n        self.pruned_heads = kwargs.pop(\"pruned_heads\", {})\n\n        # Is decoder is used in encoder-decoder models to differentiate encoder from decoder\n        self.is_encoder_decoder = kwargs.pop(\"is_encoder_decoder\", False)\n        self.is_decoder = kwargs.pop(\"is_decoder\", False)\n\n        # Parameters for sequence generation\n        self.max_length = kwargs.pop(\"max_length\", 20)\n        self.min_length = kwargs.pop(\"min_length\", 0)\n        self.do_sample = kwargs.pop(\"do_sample\", False)\n        self.early_stopping = kwargs.pop(\"early_stopping\", False)\n        self.num_beams = kwargs.pop(\"num_beams\", 1)\n        self.temperature = kwargs.pop(\"temperature\", 1.0)\n        self.top_k = kwargs.pop(\"top_k\", 50)\n        self.top_p = kwargs.pop(\"top_p\", 1.0)\n        self.repetition_penalty = kwargs.pop(\"repetition_penalty\", 1.0)\n        self.length_penalty = kwargs.pop(\"length_penalty\", 1.0)\n        self.no_repeat_ngram_size = kwargs.pop(\"no_repeat_ngram_size\", 0)\n        self.bad_words_ids = kwargs.pop(\"bad_words_ids\", None)\n        self.num_return_sequences = kwargs.pop(\"num_return_sequences\", 1)\n\n        # Fine-tuning task arguments\n        self.architectures = kwargs.pop(\"architectures\", None)\n        self.finetuning_task = kwargs.pop(\"finetuning_task\", None)\n        self.id2label = kwargs.pop(\"id2label\", None)\n        self.label2id = kwargs.pop(\"label2id\", None)\n        if self.id2label is not None:\n            kwargs.pop(\"num_labels\", None)\n            self.id2label = dict((int(key), value) for key, value in self.id2label.items())\n            # Keys are always strings in JSON so convert ids to int here.\n        else:\n            self.num_labels = kwargs.pop(\"num_labels\", 2)\n\n        # Tokenizer arguments TODO: eventually tokenizer and models should share the same config\n        self.prefix = kwargs.pop(\"prefix\", None)\n        self.bos_token_id = kwargs.pop(\"bos_token_id\", None)\n        self.pad_token_id = kwargs.pop(\"pad_token_id\", None)\n        self.eos_token_id = kwargs.pop(\"eos_token_id\", None)\n        self.decoder_start_token_id = kwargs.pop(\"decoder_start_token_id\", None)\n\n        # task specific arguments\n        self.task_specific_params = kwargs.pop(\"task_specific_params\", None)\n\n        # TPU arguments\n        self.xla_device = kwargs.pop(\"xla_device\", None)\n\n        # Additional attributes without default values\n        for key, value in kwargs.items():\n            try:\n                setattr(self, key, value)\n            except AttributeError as err:\n                logger.error(\"Can't set {} with value {} for {}\".format(key, value, self))\n                raise err\n\n    @property\n    def num_labels(self):\n        return len(self.id2label)\n\n    @num_labels.setter\n    def num_labels(self, num_labels):\n        self.id2label = {i: \"LABEL_{}\".format(i) for i in range(num_labels)}\n        self.label2id = dict(zip(self.id2label.values(), self.id2label.keys()))\n\n    def save_pretrained(self, save_directory):\n        \"\"\"\n        Save a configuration object to the directory `save_directory`, so that it\n        can be re-loaded using the :func:`~transformers1.PretrainedConfig.from_pretrained` class method.\n\n        Args:\n            save_directory (:obj:`string`):\n                Directory where the configuration JSON file will be saved.\n        \"\"\"\n        assert os.path.isdir(\n            save_directory\n        ), \"Saving path should be a directory where the model and configuration can be saved\"\n\n        # If we save using the predefined names, we can load using `from_pretrained`\n        output_config_file = os.path.join(save_directory, CONFIG_NAME)\n\n        self.to_json_file(output_config_file, use_diff=True)\n        logger.info(\"Configuration saved in {}\".format(output_config_file))\n\n    @classmethod\n    def from_pretrained(cls, pretrained_model_name_or_path, **kwargs) -> \"PretrainedConfig\":\n        r\"\"\"\n\n        Instantiate a :class:`~transformers1.PretrainedConfig` (or a derived class) from a pre-trained model configuration.\n\n        Args:\n            pretrained_model_name_or_path (:obj:`string`):\n                either:\n                  - a string with the `shortcut name` of a pre-trained model configuration to load from cache or\n                    download, e.g.: ``bert-base-uncased``.\n                  - a string with the `identifier name` of a pre-trained model configuration that was user-uploaded to\n                    our S3, e.g.: ``dbmdz/bert-base-german-cased``.\n                  - a path to a `directory` containing a configuration file saved using the\n                    :func:`~transformers1.PretrainedConfig.save_pretrained` method, e.g.: ``./my_model_directory/``.\n                  - a path or url to a saved configuration JSON `file`, e.g.:\n                    ``./my_model_directory/configuration.json``.\n            cache_dir (:obj:`string`, `optional`):\n                Path to a directory in which a downloaded pre-trained model\n                configuration should be cached if the standard cache should not be used.\n            kwargs (:obj:`Dict[str, any]`, `optional`):\n                The values in kwargs of any keys which are configuration attributes will be used to override the loaded\n                values. Behavior concerning key/value pairs whose keys are *not* configuration attributes is\n                controlled by the `return_unused_kwargs` keyword parameter.\n            force_download (:obj:`bool`, `optional`, defaults to :obj:`False`):\n                Force to (re-)download the model weights and configuration files and override the cached versions if they exist.\n            resume_download (:obj:`bool`, `optional`, defaults to :obj:`False`):\n                Do not delete incompletely recieved file. Attempt to resume the download if such a file exists.\n            proxies (:obj:`Dict`, `optional`):\n                A dictionary of proxy servers to use by protocol or endpoint, e.g.:\n                :obj:`{'http': 'foo.bar:3128', 'http://hostname': 'foo.bar:4012'}.`\n                The proxies are used on each request.\n            return_unused_kwargs: (`optional`) bool:\n                If False, then this function returns just the final configuration object.\n                If True, then this functions returns a :obj:`Tuple(config, unused_kwargs)` where `unused_kwargs` is a\n                dictionary consisting of the key/value pairs whose keys are not configuration attributes: ie the part\n                of kwargs which has not been used to update `config` and is otherwise ignored.\n\n        Returns:\n            :class:`PretrainedConfig`: An instance of a configuration object\n\n        Examples::\n\n            # We can't instantiate directly the base class `PretrainedConfig` so let's show the examples on a\n            # derived class: BertConfig\n            config = BertConfig.from_pretrained('bert-base-uncased')    # Download configuration from S3 and cache.\n            config = BertConfig.from_pretrained('./test/saved_model/')  # E.g. config (or model) was saved using `save_pretrained('./test/saved_model/')`\n            config = BertConfig.from_pretrained('./test/saved_model/my_configuration.json')\n            config = BertConfig.from_pretrained('bert-base-uncased', output_attention=True, foo=False)\n            assert config.output_attention == True\n            config, unused_kwargs = BertConfig.from_pretrained('bert-base-uncased', output_attention=True,\n                                                               foo=False, return_unused_kwargs=True)\n            assert config.output_attention == True\n            assert unused_kwargs == {'foo': False}\n\n        \"\"\"\n        config_dict, kwargs = cls.get_config_dict(pretrained_model_name_or_path, **kwargs)\n        return cls.from_dict(config_dict, **kwargs)\n\n    @classmethod\n    def get_config_dict(cls, pretrained_model_name_or_path: str, **kwargs) -> Tuple[Dict, Dict]:\n        \"\"\"\n        From a `pretrained_model_name_or_path`, resolve to a dictionary of parameters, to be used\n        for instantiating a Config using `from_dict`.\n\n        Parameters:\n            pretrained_model_name_or_path (:obj:`string`):\n                The identifier of the pre-trained checkpoint from which we want the dictionary of parameters.\n\n        Returns:\n            :obj:`Tuple[Dict, Dict]`: The dictionary that will be used to instantiate the configuration object.\n\n        \"\"\"\n        cache_dir = kwargs.pop(\"cache_dir\", None)\n        force_download = kwargs.pop(\"force_download\", False)\n        resume_download = kwargs.pop(\"resume_download\", False)\n        proxies = kwargs.pop(\"proxies\", None)\n        local_files_only = kwargs.pop(\"local_files_only\", False)\n\n        if os.path.isdir(pretrained_model_name_or_path):\n            config_file = os.path.join(pretrained_model_name_or_path, CONFIG_NAME)\n        elif os.path.isfile(pretrained_model_name_or_path) or is_remote_url(pretrained_model_name_or_path):\n            config_file = pretrained_model_name_or_path\n        else:\n            config_file = hf_bucket_url(pretrained_model_name_or_path, filename=CONFIG_NAME, use_cdn=False)\n\n        try:\n            # Load from URL or cache if already cached\n            resolved_config_file = cached_path(\n                config_file,\n                cache_dir=cache_dir,\n                force_download=force_download,\n                proxies=proxies,\n                resume_download=resume_download,\n                local_files_only=local_files_only,\n            )\n            # Load config dict\n            if resolved_config_file is None:\n                raise EnvironmentError\n            config_dict = cls._dict_from_json_file(resolved_config_file)\n\n        except EnvironmentError:\n            msg = (\n                f\"Can't load config for '{pretrained_model_name_or_path}'. Make sure that:\\n\\n\"\n                f\"- '{pretrained_model_name_or_path}' is a correct model identifier listed on 'https://huggingface.co/models'\\n\\n\"\n                f\"- or '{pretrained_model_name_or_path}' is the correct path to a directory containing a {CONFIG_NAME} file\\n\\n\"\n            )\n            raise EnvironmentError(msg)\n\n        except json.JSONDecodeError:\n            msg = (\n                \"Couldn't reach server at '{}' to download configuration file or \"\n                \"configuration file is not a valid JSON file. \"\n                \"Please check network or file content here: {}.\".format(config_file, resolved_config_file)\n            )\n            raise EnvironmentError(msg)\n\n        if resolved_config_file == config_file:\n            logger.info(\"loading configuration file {}\".format(config_file))\n        else:\n            logger.info(\"loading configuration file {} from cache at {}\".format(config_file, resolved_config_file))\n\n        return config_dict, kwargs\n\n    @classmethod\n    def from_dict(cls, config_dict: Dict, **kwargs) -> \"PretrainedConfig\":\n        \"\"\"\n        Constructs a `Config` from a Python dictionary of parameters.\n\n        Args:\n            config_dict (:obj:`Dict[str, any]`):\n                Dictionary that will be used to instantiate the configuration object. Such a dictionary can be retrieved\n                from a pre-trained checkpoint by leveraging the :func:`~transformers1.PretrainedConfig.get_config_dict`\n                method.\n            kwargs (:obj:`Dict[str, any]`):\n                Additional parameters from which to initialize the configuration object.\n\n        Returns:\n            :class:`PretrainedConfig`: An instance of a configuration object\n        \"\"\"\n        return_unused_kwargs = kwargs.pop(\"return_unused_kwargs\", False)\n\n        config = cls(**config_dict)\n\n        if hasattr(config, \"pruned_heads\"):\n            config.pruned_heads = dict((int(key), value) for key, value in config.pruned_heads.items())\n\n        # Update config with kwargs if needed\n        to_remove = []\n        for key, value in kwargs.items():\n            if hasattr(config, key):\n                setattr(config, key, value)\n                to_remove.append(key)\n        for key in to_remove:\n            kwargs.pop(key, None)\n\n        logger.info(\"Model config %s\", str(config))\n        if return_unused_kwargs:\n            return config, kwargs\n        else:\n            return config\n\n    @classmethod\n    def from_json_file(cls, json_file: str) -> \"PretrainedConfig\":\n        \"\"\"\n        Constructs a `Config` from the path to a json file of parameters.\n\n        Args:\n            json_file (:obj:`string`):\n                Path to the JSON file containing the parameters.\n\n        Returns:\n            :class:`PretrainedConfig`: An instance of a configuration object\n\n        \"\"\"\n        config_dict = cls._dict_from_json_file(json_file)\n        return cls(**config_dict)\n\n    @classmethod\n    def _dict_from_json_file(cls, json_file: str):\n        with open(json_file, \"r\", encoding=\"utf-8\") as reader:\n            text = reader.read()\n        return json.loads(text)\n\n    def __eq__(self, other):\n        return self.__dict__ == other.__dict__\n\n    def __repr__(self):\n        return \"{} {}\".format(self.__class__.__name__, self.to_json_string())\n\n    def to_diff_dict(self):\n        \"\"\"\n        Removes all attributes from config which correspond to the default\n        config attributes for better readability and serializes to a Python\n        dictionary.\n\n        Returns:\n            :obj:`Dict[str, any]`: Dictionary of all the attributes that make up this configuration instance,\n        \"\"\"\n        config_dict = self.to_dict()\n\n        # get the default config dict\n        default_config_dict = PretrainedConfig().to_dict()\n\n        serializable_config_dict = {}\n\n        # only serialize values that differ from the default config\n        for key, value in config_dict.items():\n            if key not in default_config_dict or value != default_config_dict[key]:\n                serializable_config_dict[key] = value\n\n        return serializable_config_dict\n\n    def to_dict(self):\n        \"\"\"\n        Serializes this instance to a Python dictionary.\n\n        Returns:\n            :obj:`Dict[str, any]`: Dictionary of all the attributes that make up this configuration instance,\n        \"\"\"\n        output = copy.deepcopy(self.__dict__)\n        if hasattr(self.__class__, \"model_type\"):\n            output[\"model_type\"] = self.__class__.model_type\n        return output\n\n    def to_json_string(self, use_diff=True):\n        \"\"\"\n        Serializes this instance to a JSON string.\n\n        Args:\n            use_diff (:obj:`bool`):\n                If set to True, only the difference between the config instance and the default PretrainedConfig() is serialized to JSON string.\n\n        Returns:\n            :obj:`string`: String containing all the attributes that make up this configuration instance in JSON format.\n        \"\"\"\n        if use_diff is True:\n            config_dict = self.to_diff_dict()\n        else:\n            config_dict = self.to_dict()\n        return json.dumps(config_dict, indent=2, sort_keys=True) + \"\\n\"\n\n    def to_json_file(self, json_file_path, use_diff=True):\n        \"\"\"\n        Save this instance to a json file.\n\n        Args:\n            json_file_path (:obj:`string`):\n                Path to the JSON file in which this configuration instance's parameters will be saved.\n            use_diff (:obj:`bool`):\n                If set to True, only the difference between the config instance and the default PretrainedConfig() is serialized to JSON file.\n        \"\"\"\n        with open(json_file_path, \"w\", encoding=\"utf-8\") as writer:\n            writer.write(self.to_json_string(use_diff=use_diff))\n\n    def update(self, config_dict: Dict):\n        \"\"\"\n        Updates attributes of this class\n        with attributes from `config_dict`.\n\n        Args:\n            :obj:`Dict[str, any]`: Dictionary of attributes that shall be updated for this class.\n        \"\"\"\n        for key, value in config_dict.items():\n            setattr(self, key, value)\n"
  },
  {
    "path": "code/bert-base-count3/pretrain/transformers1/configuration_xlm.py",
    "content": "# coding=utf-8\n# Copyright 2019-present, Facebook, Inc and the HuggingFace Inc. team.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n\"\"\" XLM configuration \"\"\"\n\n\nimport logging\n\nfrom .configuration_utils import PretrainedConfig\n\n\nlogger = logging.getLogger(__name__)\n\nXLM_PRETRAINED_CONFIG_ARCHIVE_MAP = {\n    \"xlm-mlm-en-2048\": \"https://s3.amazonaws.com/models.huggingface.co/bert/xlm-mlm-en-2048-config.json\",\n    \"xlm-mlm-ende-1024\": \"https://s3.amazonaws.com/models.huggingface.co/bert/xlm-mlm-ende-1024-config.json\",\n    \"xlm-mlm-enfr-1024\": \"https://s3.amazonaws.com/models.huggingface.co/bert/xlm-mlm-enfr-1024-config.json\",\n    \"xlm-mlm-enro-1024\": \"https://s3.amazonaws.com/models.huggingface.co/bert/xlm-mlm-enro-1024-config.json\",\n    \"xlm-mlm-tlm-xnli15-1024\": \"https://s3.amazonaws.com/models.huggingface.co/bert/xlm-mlm-tlm-xnli15-1024-config.json\",\n    \"xlm-mlm-xnli15-1024\": \"https://s3.amazonaws.com/models.huggingface.co/bert/xlm-mlm-xnli15-1024-config.json\",\n    \"xlm-clm-enfr-1024\": \"https://s3.amazonaws.com/models.huggingface.co/bert/xlm-clm-enfr-1024-config.json\",\n    \"xlm-clm-ende-1024\": \"https://s3.amazonaws.com/models.huggingface.co/bert/xlm-clm-ende-1024-config.json\",\n    \"xlm-mlm-17-1280\": \"https://s3.amazonaws.com/models.huggingface.co/bert/xlm-mlm-17-1280-config.json\",\n    \"xlm-mlm-100-1280\": \"https://s3.amazonaws.com/models.huggingface.co/bert/xlm-mlm-100-1280-config.json\",\n}\n\n\nclass XLMConfig(PretrainedConfig):\n    \"\"\"\n        This is the configuration class to store the configuration of a :class:`~transformers1.XLMModel`.\n        It is used to instantiate an XLM model according to the specified arguments, defining the model\n        architecture. Instantiating a configuration with the defaults will yield a similar configuration to that of\n        the `xlm-mlm-en-2048 <https://huggingface.co/xlm-mlm-en-2048>`__ architecture.\n\n        Configuration objects inherit from  :class:`~transformers1.PretrainedConfig` and can be used\n        to control the model outputs. Read the documentation from  :class:`~transformers1.PretrainedConfig`\n        for more information.\n\n        Args:\n            vocab_size (:obj:`int`, optional, defaults to 30145):\n                Vocabulary size of the XLM model. Defines the different tokens that\n                can be represented by the `inputs_ids` passed to the forward method of :class:`~transformers1.XLMModel`.\n            emb_dim (:obj:`int`, optional, defaults to 2048):\n                Dimensionality of the encoder layers and the pooler layer.\n            n_layer (:obj:`int`, optional, defaults to 12):\n                Number of hidden layers in the Transformer encoder.\n            n_head (:obj:`int`, optional, defaults to 16):\n                Number of attention heads for each attention layer in the Transformer encoder.\n            dropout (:obj:`float`, optional, defaults to 0.1):\n                The dropout probability for all fully connected\n                layers in the embeddings, encoder, and pooler.\n            attention_dropout (:obj:`float`, optional, defaults to 0.1):\n                The dropout probability for the attention mechanism\n            gelu_activation (:obj:`boolean`, optional, defaults to :obj:`True`):\n                The non-linear activation function (function or string) in the\n                encoder and pooler. If set to `True`, \"gelu\" will be used instead of \"relu\".\n            sinusoidal_embeddings (:obj:`boolean`, optional, defaults to :obj:`False`):\n                Whether to use sinusoidal positional embeddings instead of absolute positional embeddings.\n            causal (:obj:`boolean`, optional, defaults to :obj:`False`):\n                Set this to `True` for the model to behave in a causal manner.\n                Causal models use a triangular attention mask in order to only attend to the left-side context instead\n                if a bidirectional context.\n            asm (:obj:`boolean`, optional, defaults to :obj:`False`):\n                Whether to use an adaptive log softmax projection layer instead of a linear layer for the prediction\n                layer.\n            n_langs (:obj:`int`, optional, defaults to 1):\n                The number of languages the model handles. Set to 1 for monolingual models.\n            use_lang_emb (:obj:`boolean`, optional, defaults to :obj:`True`)\n                Whether to use language embeddings. Some models use additional language embeddings, see\n                `the multilingual models page <http://huggingface.co/transformers/multilingual.html#xlm-language-embeddings>`__\n                for information on how to use them.\n            max_position_embeddings (:obj:`int`, optional, defaults to 512):\n                The maximum sequence length that this model might\n                ever be used with. Typically set this to something large just in case\n                (e.g., 512 or 1024 or 2048).\n            embed_init_std (:obj:`float`, optional, defaults to 2048^-0.5):\n                The standard deviation of the truncated_normal_initializer for\n                initializing the embedding matrices.\n            init_std (:obj:`int`, optional, defaults to 50257):\n                The standard deviation of the truncated_normal_initializer for\n                initializing all weight matrices except the embedding matrices.\n            layer_norm_eps (:obj:`float`, optional, defaults to 1e-12):\n                The epsilon used by the layer normalization layers.\n            bos_index (:obj:`int`, optional, defaults to 0):\n                The index of the beginning of sentence token in the vocabulary.\n            eos_index (:obj:`int`, optional, defaults to 1):\n                The index of the end of sentence token in the vocabulary.\n            pad_index (:obj:`int`, optional, defaults to 2):\n                The index of the padding token in the vocabulary.\n            unk_index (:obj:`int`, optional, defaults to 3):\n                The index of the unknown token in the vocabulary.\n            mask_index (:obj:`int`, optional, defaults to 5):\n                The index of the masking token in the vocabulary.\n            is_encoder(:obj:`boolean`, optional, defaults to :obj:`True`):\n                Whether the initialized model should be a transformer encoder or decoder as seen in Vaswani et al.\n            summary_type (:obj:`string`, optional, defaults to \"first\"):\n                Argument used when doing sequence summary. Used in for the multiple choice head in\n                :class:`~transformers1.XLMForSequenceClassification`.\n                Is one of the following options:\n\n                - 'last' => take the last token hidden state (like XLNet)\n                - 'first' => take the first token hidden state (like Bert)\n                - 'mean' => take the mean of all tokens hidden states\n                - 'cls_index' => supply a Tensor of classification token position (GPT/GPT-2)\n                - 'attn' => Not implemented now, use multi-head attention\n            summary_use_proj (:obj:`boolean`, optional, defaults to :obj:`True`):\n                Argument used when doing sequence summary. Used in for the multiple choice head in\n                :class:`~transformers1.XLMForSequenceClassification`.\n                Add a projection after the vector extraction\n            summary_activation (:obj:`string` or :obj:`None`, optional, defaults to :obj:`None`):\n                Argument used when doing sequence summary. Used in for the multiple choice head in\n                :class:`~transformers1.XLMForSequenceClassification`.\n                'tanh' => add a tanh activation to the output, Other => no activation.\n            summary_proj_to_labels (:obj:`boolean`, optional, defaults to :obj:`True`):\n                Argument used when doing sequence summary. Used in for the multiple choice head in\n                :class:`~transformers1.XLMForSequenceClassification`.\n                If True, the projection outputs to config.num_labels classes (otherwise to hidden_size). Default: False.\n            summary_first_dropout (:obj:`float`, optional, defaults to 0.1):\n                Argument used when doing sequence summary. Used in for the multiple choice head in\n                :class:`~transformers1.XLMForSequenceClassification`.\n                Add a dropout before the projection and activation\n            start_n_top (:obj:`int`, optional, defaults to 5):\n                Used in the SQuAD evaluation script for XLM and XLNet.\n            end_n_top (:obj:`int`, optional, defaults to 5):\n                Used in the SQuAD evaluation script for XLM and XLNet.\n            mask_token_id (:obj:`int`, optional, defaults to 0):\n                Model agnostic parameter to identify masked tokens when generating text in an MLM context.\n            lang_id (:obj:`int`, optional, defaults to 1):\n                The ID of the language used by the model. This parameter is used when generating\n                text in a given language.\n\n        Example::\n\n            from transformers1 import XLMConfig, XLMModel\n\n            # Initializing a XLM configuration\n            configuration = XLMConfig()\n\n            # Initializing a model from the configuration\n            model = XLMModel(configuration)\n\n            # Accessing the model configuration\n            configuration = model.config\n    \"\"\"\n\n    model_type = \"xlm\"\n\n    def __init__(\n        self,\n        vocab_size=30145,\n        emb_dim=2048,\n        n_layers=12,\n        n_heads=16,\n        dropout=0.1,\n        attention_dropout=0.1,\n        gelu_activation=True,\n        sinusoidal_embeddings=False,\n        causal=False,\n        asm=False,\n        n_langs=1,\n        use_lang_emb=True,\n        max_position_embeddings=512,\n        embed_init_std=2048 ** -0.5,\n        layer_norm_eps=1e-12,\n        init_std=0.02,\n        bos_index=0,\n        eos_index=1,\n        pad_index=2,\n        unk_index=3,\n        mask_index=5,\n        is_encoder=True,\n        summary_type=\"first\",\n        summary_use_proj=True,\n        summary_activation=None,\n        summary_proj_to_labels=True,\n        summary_first_dropout=0.1,\n        start_n_top=5,\n        end_n_top=5,\n        mask_token_id=0,\n        lang_id=0,\n        pad_token_id=2,\n        bos_token_id=0,\n        **kwargs\n    ):\n        \"\"\"Constructs XLMConfig.\n        \"\"\"\n        super().__init__(pad_token_id=pad_token_id, bos_token_id=bos_token_id, **kwargs)\n        self.vocab_size = vocab_size\n        self.emb_dim = emb_dim\n        self.n_layers = n_layers\n        self.n_heads = n_heads\n        self.dropout = dropout\n        self.attention_dropout = attention_dropout\n        self.gelu_activation = gelu_activation\n        self.sinusoidal_embeddings = sinusoidal_embeddings\n        self.causal = causal\n        self.asm = asm\n        self.n_langs = n_langs\n        self.use_lang_emb = use_lang_emb\n        self.layer_norm_eps = layer_norm_eps\n        self.bos_index = bos_index\n        self.eos_index = eos_index\n        self.pad_index = pad_index\n        self.unk_index = unk_index\n        self.mask_index = mask_index\n        self.is_encoder = is_encoder\n        self.max_position_embeddings = max_position_embeddings\n        self.embed_init_std = embed_init_std\n        self.init_std = init_std\n        self.summary_type = summary_type\n        self.summary_use_proj = summary_use_proj\n        self.summary_activation = summary_activation\n        self.summary_proj_to_labels = summary_proj_to_labels\n        self.summary_first_dropout = summary_first_dropout\n        self.start_n_top = start_n_top\n        self.end_n_top = end_n_top\n        self.mask_token_id = mask_token_id\n        self.lang_id = lang_id\n\n        if \"n_words\" in kwargs:\n            self.n_words = kwargs[\"n_words\"]\n\n    @property\n    def n_words(self):  # For backward compatibility\n        return self.vocab_size\n\n    @n_words.setter\n    def n_words(self, value):  # For backward compatibility\n        self.vocab_size = value\n\n    @property\n    def hidden_size(self):\n        return self.emb_dim\n\n    @property\n    def num_attention_heads(self):\n        return self.n_heads\n\n    @property\n    def num_hidden_layers(self):\n        return self.n_layers\n"
  },
  {
    "path": "code/bert-base-count3/pretrain/transformers1/configuration_xlm_roberta.py",
    "content": "# coding=utf-8\n# Copyright 2018 The Google AI Language Team Authors and The HuggingFace Inc. team.\n# Copyright (c) 2018, NVIDIA CORPORATION.  All rights reserved.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n\"\"\" XLM-RoBERTa configuration \"\"\"\n\n\nimport logging\n\nfrom .configuration_roberta import RobertaConfig\n\n\nlogger = logging.getLogger(__name__)\n\nXLM_ROBERTA_PRETRAINED_CONFIG_ARCHIVE_MAP = {\n    \"xlm-roberta-base\": \"https://s3.amazonaws.com/models.huggingface.co/bert/xlm-roberta-base-config.json\",\n    \"xlm-roberta-large\": \"https://s3.amazonaws.com/models.huggingface.co/bert/xlm-roberta-large-config.json\",\n    \"xlm-roberta-large-finetuned-conll02-dutch\": \"https://s3.amazonaws.com/models.huggingface.co/bert/xlm-roberta-large-finetuned-conll02-dutch-config.json\",\n    \"xlm-roberta-large-finetuned-conll02-spanish\": \"https://s3.amazonaws.com/models.huggingface.co/bert/xlm-roberta-large-finetuned-conll02-spanish-config.json\",\n    \"xlm-roberta-large-finetuned-conll03-english\": \"https://s3.amazonaws.com/models.huggingface.co/bert/xlm-roberta-large-finetuned-conll03-english-config.json\",\n    \"xlm-roberta-large-finetuned-conll03-german\": \"https://s3.amazonaws.com/models.huggingface.co/bert/xlm-roberta-large-finetuned-conll03-german-config.json\",\n}\n\n\nclass XLMRobertaConfig(RobertaConfig):\n    \"\"\"\n    This class overrides :class:`~transformers1.RobertaConfig`. Please check the\n    superclass for the appropriate documentation alongside usage examples.\n    \"\"\"\n\n    model_type = \"xlm-roberta\"\n"
  },
  {
    "path": "code/bert-base-count3/pretrain/transformers1/configuration_xlnet.py",
    "content": "# coding=utf-8\n# Copyright 2018 Google AI, Google Brain and Carnegie Mellon University Authors and the HuggingFace Inc. team.\n# Copyright (c) 2018, NVIDIA CORPORATION.  All rights reserved.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n\"\"\" XLNet configuration \"\"\"\n\n\nimport logging\n\nfrom .configuration_utils import PretrainedConfig\n\n\nlogger = logging.getLogger(__name__)\n\nXLNET_PRETRAINED_CONFIG_ARCHIVE_MAP = {\n    \"xlnet-base-cased\": \"https://s3.amazonaws.com/models.huggingface.co/bert/xlnet-base-cased-config.json\",\n    \"xlnet-large-cased\": \"https://s3.amazonaws.com/models.huggingface.co/bert/xlnet-large-cased-config.json\",\n}\n\n\nclass XLNetConfig(PretrainedConfig):\n    \"\"\"\n        This is the configuration class to store the configuration of a :class:`~transformers1.XLNetModel`.\n        It is used to instantiate an XLNet model according to the specified arguments, defining the model\n        architecture. Instantiating a configuration with the defaults will yield a similar configuration to that of\n        the `xlnet-large-cased <https://huggingface.co/xlnet-large-cased>`__ architecture.\n\n        Configuration objects inherit from  :class:`~transformers1.PretrainedConfig` and can be used\n        to control the model outputs. Read the documentation from  :class:`~transformers1.PretrainedConfig`\n        for more information.\n\n        Args:\n            vocab_size (:obj:`int`, optional, defaults to 32000):\n                Vocabulary size of the XLNet model. Defines the different tokens that\n                can be represented by the `inputs_ids` passed to the forward method of :class:`~transformers1.XLNetModel`.\n            d_model (:obj:`int`, optional, defaults to 1024):\n                Dimensionality of the encoder layers and the pooler layer.\n            n_layer (:obj:`int`, optional, defaults to 24):\n                Number of hidden layers in the Transformer encoder.\n            n_head (:obj:`int`, optional, defaults to 16):\n                Number of attention heads for each attention layer in the Transformer encoder.\n            d_inner (:obj:`int`, optional, defaults to 4096):\n                Dimensionality of the \"intermediate\" (i.e., feed-forward) layer in the Transformer encoder.\n            ff_activation (:obj:`string`, optional, defaults to \"gelu\"):\n                The non-linear activation function (function or string) in the\n                encoder and pooler. If string, \"gelu\", \"relu\" and \"swish\" are supported.\n            untie_r (:obj:`boolean`, optional, defaults to :obj:`True`):\n                Untie relative position biases\n            attn_type (:obj:`string`, optional, defaults to \"bi\"):\n                The attention type used by the model. Set 'bi' for XLNet, 'uni' for Transformer-XL.\n            initializer_range (:obj:`float`, optional, defaults to 0.02):\n                The standard deviation of the truncated_normal_initializer for initializing all weight matrices.\n            layer_norm_eps (:obj:`float`, optional, defaults to 1e-12):\n                The epsilon used by the layer normalization layers.\n            dropout (:obj:`float`, optional, defaults to 0.1):\n                The dropout probability for all fully connected layers in the embeddings, encoder, and pooler.\n            mem_len (:obj:`int` or :obj:`None`, optional, defaults to :obj:`None`):\n                The number of tokens to cache. The key/value pairs that have already been pre-computed\n                in a previous forward pass won't be re-computed. See the\n                `quickstart <https://huggingface.co/transformers/quickstart.html#using-the-past>`__\n                for more information.\n            reuse_len (:obj:`int` or :obj:`None`, optional, defaults to :obj:`None`):\n                The number of tokens in the current batch to be cached and reused in the future.\n            bi_data (:obj:`boolean`, optional, defaults to :obj:`False`):\n                Whether to use bidirectional input pipeline. Usually set to `True` during\n                pretraining and `False` during finetuning.\n            clamp_len (:obj:`int`, optional, defaults to -1):\n                Clamp all relative distances larger than clamp_len.\n                Setting this attribute to -1 means no clamping.\n            same_length (:obj:`boolean`, optional, defaults to :obj:`False`):\n                Whether to use the same attention length for each token.\n            summary_type (:obj:`string`, optional, defaults to \"last\"):\n                Argument used when doing sequence summary. Used in for the multiple choice head in\n                :class:transformers1.XLNetForSequenceClassification` and :class:`~transformers1.XLNetForMultipleChoice`.\n                Is one of the following options:\n                    - 'last' => take the last token hidden state (like XLNet)\n                    - 'first' => take the first token hidden state (like Bert)\n                    - 'mean' => take the mean of all tokens hidden states\n                    - 'cls_index' => supply a Tensor of classification token position (GPT/GPT-2)\n                    - 'attn' => Not implemented now, use multi-head attention\n            summary_use_proj (:obj:`boolean`, optional, defaults to :obj:`True`):\n                Argument used when doing sequence summary. Used in for the multiple choice head in\n                :class:`~transformers1.XLNetForSequenceClassification` and :class:`~transformers1.XLNetForMultipleChoice`.\n                Add a projection after the vector extraction\n            summary_activation (:obj:`string` or :obj:`None`, optional, defaults to :obj:`None`):\n                Argument used when doing sequence summary. Used in for the multiple choice head in\n                :class:`~transformers1.XLNetForSequenceClassification` and :class:`~transformers1.XLNetForMultipleChoice`.\n                'tanh' => add a tanh activation to the output, Other => no activation.\n            summary_proj_to_labels (:obj:`boolean`, optional, defaults to :obj:`True`):\n                Argument used when doing sequence summary. Used in for the multiple choice head in\n                :class:`~transformers1.XLNetForSequenceClassification` and :class:`~transformers1.XLNetForMultipleChoice`.\n                If True, the projection outputs to config.num_labels classes (otherwise to hidden_size). Default: False.\n            summary_last_dropout (:obj:`float`, optional, defaults to 0.1):\n                Argument used when doing sequence summary. Used in for the multiple choice head in\n                :class:`~transformers1.XLNetForSequenceClassification` and :class:`~transformers1.XLNetForMultipleChoice`.\n                Add a dropout after the projection and activation\n            start_n_top (:obj:`int`, optional, defaults to 5):\n                Used in the SQuAD evaluation script for XLM and XLNet.\n            end_n_top (:obj:`int`, optional, defaults to 5):\n                Used in the SQuAD evaluation script for XLM and XLNet.\n\n        Example::\n\n            from transformers1 import XLNetConfig, XLNetModel\n\n            # Initializing a XLNet configuration\n            configuration = XLNetConfig()\n\n            # Initializing a model from the configuration\n            model = XLNetModel(configuration)\n\n            # Accessing the model configuration\n            configuration = model.config\n    \"\"\"\n\n    model_type = \"xlnet\"\n\n    def __init__(\n        self,\n        vocab_size=32000,\n        d_model=1024,\n        n_layer=24,\n        n_head=16,\n        d_inner=4096,\n        ff_activation=\"gelu\",\n        untie_r=True,\n        attn_type=\"bi\",\n        initializer_range=0.02,\n        layer_norm_eps=1e-12,\n        dropout=0.1,\n        mem_len=None,\n        reuse_len=None,\n        bi_data=False,\n        clamp_len=-1,\n        same_length=False,\n        summary_type=\"last\",\n        summary_use_proj=True,\n        summary_activation=\"tanh\",\n        summary_last_dropout=0.1,\n        start_n_top=5,\n        end_n_top=5,\n        pad_token_id=5,\n        bos_token_id=1,\n        eos_token_id=2,\n        **kwargs\n    ):\n        \"\"\"Constructs XLNetConfig.\n        \"\"\"\n        super().__init__(pad_token_id=pad_token_id, bos_token_id=bos_token_id, eos_token_id=eos_token_id, **kwargs)\n        self.vocab_size = vocab_size\n        self.d_model = d_model\n        self.n_layer = n_layer\n        self.n_head = n_head\n        assert d_model % n_head == 0\n        self.d_head = d_model // n_head\n        self.ff_activation = ff_activation\n        self.d_inner = d_inner\n        self.untie_r = untie_r\n        self.attn_type = attn_type\n\n        self.initializer_range = initializer_range\n        self.layer_norm_eps = layer_norm_eps\n\n        self.dropout = dropout\n        self.mem_len = mem_len\n        self.reuse_len = reuse_len\n        self.bi_data = bi_data\n        self.clamp_len = clamp_len\n        self.same_length = same_length\n\n        self.summary_type = summary_type\n        self.summary_use_proj = summary_use_proj\n        self.summary_activation = summary_activation\n        self.summary_last_dropout = summary_last_dropout\n        self.start_n_top = start_n_top\n        self.end_n_top = end_n_top\n\n        self.bos_token_id = bos_token_id\n        self.pad_token_id = pad_token_id\n        self.eos_token_id = eos_token_id\n\n    @property\n    def max_position_embeddings(self):\n        return -1\n\n    @property\n    def n_token(self):  # Backward compatibility\n        return self.vocab_size\n\n    @n_token.setter\n    def n_token(self, value):  # Backward compatibility\n        self.vocab_size = value\n\n    @property\n    def hidden_size(self):\n        return self.d_model\n\n    @property\n    def num_attention_heads(self):\n        return self.n_head\n\n    @property\n    def num_hidden_layers(self):\n        return self.n_layer\n"
  },
  {
    "path": "code/bert-base-count3/pretrain/transformers1/convert_albert_original_tf_checkpoint_to_pytorch.py",
    "content": "# coding=utf-8\n# Copyright 2018 The HuggingFace Inc. team.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n\"\"\"Convert ALBERT checkpoint.\"\"\"\n\n\nimport argparse\nimport logging\n\nimport torch\n\nfrom transformers import AlbertConfig, AlbertForPreTraining, load_tf_weights_in_albert\n\n\nlogging.basicConfig(level=logging.INFO)\n\n\ndef convert_tf_checkpoint_to_pytorch(tf_checkpoint_path, albert_config_file, pytorch_dump_path):\n    # Initialise PyTorch model\n    config = AlbertConfig.from_json_file(albert_config_file)\n    print(\"Building PyTorch model from configuration: {}\".format(str(config)))\n    model = AlbertForPreTraining(config)\n\n    # Load weights from tf checkpoint\n    load_tf_weights_in_albert(model, config, tf_checkpoint_path)\n\n    # Save pytorch-model\n    print(\"Save PyTorch model to {}\".format(pytorch_dump_path))\n    torch.save(model.state_dict(), pytorch_dump_path)\n\n\nif __name__ == \"__main__\":\n    parser = argparse.ArgumentParser()\n    # Required parameters\n    parser.add_argument(\n        \"--tf_checkpoint_path\", default=None, type=str, required=True, help=\"Path to the TensorFlow checkpoint path.\"\n    )\n    parser.add_argument(\n        \"--albert_config_file\",\n        default=None,\n        type=str,\n        required=True,\n        help=\"The config json file corresponding to the pre-trained ALBERT model. \\n\"\n        \"This specifies the model architecture.\",\n    )\n    parser.add_argument(\n        \"--pytorch_dump_path\", default=None, type=str, required=True, help=\"Path to the output PyTorch model.\"\n    )\n    args = parser.parse_args()\n    convert_tf_checkpoint_to_pytorch(args.tf_checkpoint_path, args.albert_config_file, args.pytorch_dump_path)\n"
  },
  {
    "path": "code/bert-base-count3/pretrain/transformers1/convert_bart_original_pytorch_checkpoint_to_pytorch.py",
    "content": "# coding=utf-8\n# Copyright 2020 The HuggingFace Inc. team.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n\"\"\"Convert BART checkpoint.\"\"\"\n\n\nimport argparse\nimport logging\nimport os\nfrom pathlib import Path\n\nimport fairseq\nimport torch\nfrom packaging import version\n\nfrom transformers import (\n    BartConfig,\n    BartForConditionalGeneration,\n    BartForSequenceClassification,\n    BartModel,\n    BartTokenizer,\n)\nfrom transformers.modeling_bart import _make_linear_from_emb\n\n\nFAIRSEQ_MODELS = [\"bart.large\", \"bart.large.mnli\", \"bart.large.cnn\", \"bart_xsum/model.pt\"]\nextra_arch = {\"bart.large\": BartModel, \"bart.large.mnli\": BartForSequenceClassification}\nif version.parse(fairseq.__version__) < version.parse(\"0.9.0\"):\n    raise Exception(\"requires fairseq >= 0.9.0\")\n\n\nlogging.basicConfig(level=logging.INFO)\nlogger = logging.getLogger(__name__)\n\nSAMPLE_TEXT = \" Hello world! cécé herlolip\"\n\nmnli_rename_keys = [\n    (\"model.classification_heads.mnli.dense.weight\", \"classification_head.dense.weight\"),\n    (\"model.classification_heads.mnli.dense.bias\", \"classification_head.dense.bias\"),\n    (\"model.classification_heads.mnli.out_proj.weight\", \"classification_head.out_proj.weight\"),\n    (\"model.classification_heads.mnli.out_proj.bias\", \"classification_head.out_proj.bias\"),\n]\n\n\ndef remove_ignore_keys_(state_dict):\n    ignore_keys = [\n        \"encoder.version\",\n        \"decoder.version\",\n        \"model.encoder.version\",\n        \"model.decoder.version\",\n        \"_float_tensor\",\n    ]\n    for k in ignore_keys:\n        state_dict.pop(k, None)\n\n\ndef rename_key(dct, old, new):\n    val = dct.pop(old)\n    dct[new] = val\n\n\ndef load_xsum_checkpoint(checkpoint_path):\n    \"\"\"Checkpoint path should end in model.pt\"\"\"\n    sd = torch.load(checkpoint_path, map_location=\"cpu\")\n    hub_interface = torch.hub.load(\"pytorch/fairseq\", \"bart.large.cnn\").eval()\n    hub_interface.model.load_state_dict(sd[\"model\"])\n    return hub_interface\n\n\ndef convert_checkpoint_from_disk(checkpoint_path, **config_kwargs):\n    state_dict = torch.load(checkpoint_path, map_location=\"cpu\")[\"model\"]\n    remove_ignore_keys_(state_dict)\n    vocab_size = state_dict[\"encoder.embed_tokens.weight\"].shape[0]\n    state_dict[\"shared.weight\"] = state_dict[\"decoder.embed_tokens.weight\"]\n    mbart_config = BartConfig(vocab_size=vocab_size, **config_kwargs)\n    model = BartForConditionalGeneration(mbart_config)\n    model.model.load_state_dict(state_dict)\n    if hasattr(model, \"lm_head\"):\n        model.lm_head = _make_linear_from_emb(model.model.shared)\n    return model\n\n\n@torch.no_grad()\ndef convert_bart_checkpoint(checkpoint_path, pytorch_dump_folder_path, hf_checkpoint_name=None):\n    \"\"\"\n    Copy/paste/tweak model's weights to our BERT structure.\n    \"\"\"\n    if not os.path.exists(checkpoint_path):\n        bart = torch.hub.load(\"pytorch/fairseq\", checkpoint_path).eval()\n    else:\n        bart = load_xsum_checkpoint(checkpoint_path)\n\n    bart.model.upgrade_state_dict(bart.model.state_dict())\n    if hf_checkpoint_name is None:\n        hf_checkpoint_name = checkpoint_path.replace(\".\", \"-\")\n    config = BartConfig.from_pretrained(hf_checkpoint_name)\n    tokens = bart.encode(SAMPLE_TEXT).unsqueeze(0)\n    tokens2 = BartTokenizer.from_pretrained(hf_checkpoint_name).encode(SAMPLE_TEXT, return_tensors=\"pt\").unsqueeze(0)\n    assert torch.eq(tokens, tokens2).all()\n\n    if checkpoint_path == \"bart.large.mnli\":\n        state_dict = bart.state_dict()\n        remove_ignore_keys_(state_dict)\n        state_dict[\"model.shared.weight\"] = state_dict[\"model.decoder.embed_tokens.weight\"]\n        for src, dest in mnli_rename_keys:\n            rename_key(state_dict, src, dest)\n        model = BartForSequenceClassification(config).eval()\n        model.load_state_dict(state_dict)\n        fairseq_output = bart.predict(\"mnli\", tokens, return_logits=True)\n        new_model_outputs = model(tokens)[0]  # logits\n    else:  # no classification heads to worry about\n        state_dict = bart.model.state_dict()\n        remove_ignore_keys_(state_dict)\n        state_dict[\"shared.weight\"] = state_dict[\"decoder.embed_tokens.weight\"]\n        fairseq_output = bart.extract_features(tokens)\n        if hf_checkpoint_name == \"facebook/bart-large\":\n            model = BartModel(config).eval()\n            model.load_state_dict(state_dict)\n            new_model_outputs = model(tokens).model[0]\n        else:\n            model = BartForConditionalGeneration(config).eval()  # an existing summarization ckpt\n            model.model.load_state_dict(state_dict)\n            if hasattr(model, \"lm_head\"):\n                model.lm_head = _make_linear_from_emb(model.model.shared)\n            new_model_outputs = model.model(tokens)[0]\n\n    # Check results\n    assert fairseq_output.shape == new_model_outputs.shape\n    assert (fairseq_output == new_model_outputs).all().item()\n    Path(pytorch_dump_folder_path).mkdir(exist_ok=True)\n    model.save_pretrained(pytorch_dump_folder_path)\n\n\nif __name__ == \"__main__\":\n    parser = argparse.ArgumentParser()\n    # Required parameters\n    parser.add_argument(\n        \"fairseq_path\", type=str, help=\"bart.large, bart.large.cnn or a path to a model.pt on local filesystem.\"\n    )\n    parser.add_argument(\"pytorch_dump_folder_path\", default=None, type=str, help=\"Path to the output PyTorch model.\")\n    parser.add_argument(\n        \"--hf_config\", default=None, type=str, help=\"Which huggingface architecture to use: bart-large-xsum\"\n    )\n    args = parser.parse_args()\n    convert_bart_checkpoint(args.fairseq_path, args.pytorch_dump_folder_path, hf_checkpoint_name=args.hf_config)\n"
  },
  {
    "path": "code/bert-base-count3/pretrain/transformers1/convert_bert_original_tf_checkpoint_to_pytorch.py",
    "content": "# coding=utf-8\n# Copyright 2018 The HuggingFace Inc. team.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n\"\"\"Convert BERT checkpoint.\"\"\"\n\n\nimport argparse\nimport logging\n\nimport torch\n\nfrom transformers import BertConfig, BertForPreTraining, load_tf_weights_in_bert\n\n\nlogging.basicConfig(level=logging.INFO)\n\n\ndef convert_tf_checkpoint_to_pytorch(tf_checkpoint_path, bert_config_file, pytorch_dump_path):\n    # Initialise PyTorch model\n    config = BertConfig.from_json_file(bert_config_file)\n    print(\"Building PyTorch model from configuration: {}\".format(str(config)))\n    model = BertForPreTraining(config)\n\n    # Load weights from tf checkpoint\n    load_tf_weights_in_bert(model, config, tf_checkpoint_path)\n\n    # Save pytorch-model\n    print(\"Save PyTorch model to {}\".format(pytorch_dump_path))\n    torch.save(model.state_dict(), pytorch_dump_path)\n\n\nif __name__ == \"__main__\":\n    parser = argparse.ArgumentParser()\n    # Required parameters\n    parser.add_argument(\n        \"--tf_checkpoint_path\", default=None, type=str, required=True, help=\"Path to the TensorFlow checkpoint path.\"\n    )\n    parser.add_argument(\n        \"--bert_config_file\",\n        default=None,\n        type=str,\n        required=True,\n        help=\"The config json file corresponding to the pre-trained BERT model. \\n\"\n        \"This specifies the model architecture.\",\n    )\n    parser.add_argument(\n        \"--pytorch_dump_path\", default=None, type=str, required=True, help=\"Path to the output PyTorch model.\"\n    )\n    args = parser.parse_args()\n    convert_tf_checkpoint_to_pytorch(args.tf_checkpoint_path, args.bert_config_file, args.pytorch_dump_path)\n"
  },
  {
    "path": "code/bert-base-count3/pretrain/transformers1/convert_bert_pytorch_checkpoint_to_original_tf.py",
    "content": "# coding=utf-8\n# Copyright 2018 The HuggingFace Inc. team.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n\n\"\"\"Convert Huggingface Pytorch checkpoint to Tensorflow checkpoint.\"\"\"\n\nimport argparse\nimport os\n\nimport numpy as np\nimport tensorflow as tf\nimport torch\n\nfrom transformers import BertModel\n\n\ndef convert_pytorch_checkpoint_to_tf(model: BertModel, ckpt_dir: str, model_name: str):\n\n    \"\"\"\n    :param model:BertModel Pytorch model instance to be converted\n    :param ckpt_dir: Tensorflow model directory\n    :param model_name: model name\n    :return:\n\n    Currently supported HF models:\n        Y BertModel\n        N BertForMaskedLM\n        N BertForPreTraining\n        N BertForMultipleChoice\n        N BertForNextSentencePrediction\n        N BertForSequenceClassification\n        N BertForQuestionAnswering\n    \"\"\"\n\n    tensors_to_transpose = (\"dense.weight\", \"attention.self.query\", \"attention.self.key\", \"attention.self.value\")\n\n    var_map = (\n        (\"layer.\", \"layer_\"),\n        (\"word_embeddings.weight\", \"word_embeddings\"),\n        (\"position_embeddings.weight\", \"position_embeddings\"),\n        (\"token_type_embeddings.weight\", \"token_type_embeddings\"),\n        (\".\", \"/\"),\n        (\"LayerNorm/weight\", \"LayerNorm/gamma\"),\n        (\"LayerNorm/bias\", \"LayerNorm/beta\"),\n        (\"weight\", \"kernel\"),\n    )\n\n    if not os.path.isdir(ckpt_dir):\n        os.makedirs(ckpt_dir)\n\n    state_dict = model.state_dict()\n\n    def to_tf_var_name(name: str):\n        for patt, repl in iter(var_map):\n            name = name.replace(patt, repl)\n        return \"bert/{}\".format(name)\n\n    def create_tf_var(tensor: np.ndarray, name: str, session: tf.Session):\n        tf_dtype = tf.dtypes.as_dtype(tensor.dtype)\n        tf_var = tf.get_variable(dtype=tf_dtype, shape=tensor.shape, name=name, initializer=tf.zeros_initializer())\n        session.run(tf.variables_initializer([tf_var]))\n        session.run(tf_var)\n        return tf_var\n\n    tf.reset_default_graph()\n    with tf.Session() as session:\n        for var_name in state_dict:\n            tf_name = to_tf_var_name(var_name)\n            torch_tensor = state_dict[var_name].numpy()\n            if any([x in var_name for x in tensors_to_transpose]):\n                torch_tensor = torch_tensor.T\n            tf_var = create_tf_var(tensor=torch_tensor, name=tf_name, session=session)\n            tf.keras.backend.set_value(tf_var, torch_tensor)\n            tf_weight = session.run(tf_var)\n            print(\"Successfully created {}: {}\".format(tf_name, np.allclose(tf_weight, torch_tensor)))\n\n        saver = tf.train.Saver(tf.trainable_variables())\n        saver.save(session, os.path.join(ckpt_dir, model_name.replace(\"-\", \"_\") + \".ckpt\"))\n\n\ndef main(raw_args=None):\n    parser = argparse.ArgumentParser()\n    parser.add_argument(\"--model_name\", type=str, required=True, help=\"model name e.g. bert-base-uncased\")\n    parser.add_argument(\n        \"--cache_dir\", type=str, default=None, required=False, help=\"Directory containing pytorch model\"\n    )\n    parser.add_argument(\"--pytorch_model_path\", type=str, required=True, help=\"/path/to/<pytorch-model-name>.bin\")\n    parser.add_argument(\"--tf_cache_dir\", type=str, required=True, help=\"Directory in which to save tensorflow model\")\n    args = parser.parse_args(raw_args)\n\n    model = BertModel.from_pretrained(\n        pretrained_model_name_or_path=args.model_name,\n        state_dict=torch.load(args.pytorch_model_path),\n        cache_dir=args.cache_dir,\n    )\n\n    convert_pytorch_checkpoint_to_tf(model=model, ckpt_dir=args.tf_cache_dir, model_name=args.model_name)\n\n\nif __name__ == \"__main__\":\n    main()\n"
  },
  {
    "path": "code/bert-base-count3/pretrain/transformers1/convert_dialogpt_original_pytorch_checkpoint_to_pytorch.py",
    "content": "import argparse\nimport os\n\nimport torch\n\nfrom transformers.file_utils import WEIGHTS_NAME\n\n\nDIALOGPT_MODELS = [\"small\", \"medium\", \"large\"]\n\nOLD_KEY = \"lm_head.decoder.weight\"\nNEW_KEY = \"lm_head.weight\"\n\n\ndef convert_dialogpt_checkpoint(checkpoint_path: str, pytorch_dump_folder_path: str):\n    d = torch.load(checkpoint_path)\n    d[NEW_KEY] = d.pop(OLD_KEY)\n    os.makedirs(pytorch_dump_folder_path, exist_ok=True)\n    torch.save(d, os.path.join(pytorch_dump_folder_path, WEIGHTS_NAME))\n\n\nif __name__ == \"__main__\":\n    parser = argparse.ArgumentParser()\n    parser.add_argument(\"--dialogpt_path\", default=\".\", type=str)\n    args = parser.parse_args()\n    for MODEL in DIALOGPT_MODELS:\n        checkpoint_path = os.path.join(args.dialogpt_path, f\"{MODEL}_ft.pkl\")\n        pytorch_dump_folder_path = f\"./DialoGPT-{MODEL}\"\n        convert_dialogpt_checkpoint(\n            checkpoint_path, pytorch_dump_folder_path,\n        )\n"
  },
  {
    "path": "code/bert-base-count3/pretrain/transformers1/convert_electra_original_tf_checkpoint_to_pytorch.py",
    "content": "# coding=utf-8\n# Copyright 2018 The HuggingFace Inc. team.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n\"\"\"Convert ELECTRA checkpoint.\"\"\"\n\n\nimport argparse\nimport logging\n\nimport torch\n\nfrom transformers import ElectraConfig, ElectraForMaskedLM, ElectraForPreTraining, load_tf_weights_in_electra\n\n\nlogging.basicConfig(level=logging.INFO)\n\n\ndef convert_tf_checkpoint_to_pytorch(tf_checkpoint_path, config_file, pytorch_dump_path, discriminator_or_generator):\n    # Initialise PyTorch model\n    config = ElectraConfig.from_json_file(config_file)\n    print(\"Building PyTorch model from configuration: {}\".format(str(config)))\n\n    if discriminator_or_generator == \"discriminator\":\n        model = ElectraForPreTraining(config)\n    elif discriminator_or_generator == \"generator\":\n        model = ElectraForMaskedLM(config)\n    else:\n        raise ValueError(\"The discriminator_or_generator argument should be either 'discriminator' or 'generator'\")\n\n    # Load weights from tf checkpoint\n    load_tf_weights_in_electra(\n        model, config, tf_checkpoint_path, discriminator_or_generator=discriminator_or_generator\n    )\n\n    # Save pytorch-model\n    print(\"Save PyTorch model to {}\".format(pytorch_dump_path))\n    torch.save(model.state_dict(), pytorch_dump_path)\n\n\nif __name__ == \"__main__\":\n    parser = argparse.ArgumentParser()\n    # Required parameters\n    parser.add_argument(\n        \"--tf_checkpoint_path\", default=None, type=str, required=True, help=\"Path to the TensorFlow checkpoint path.\"\n    )\n    parser.add_argument(\n        \"--config_file\",\n        default=None,\n        type=str,\n        required=True,\n        help=\"The config json file corresponding to the pre-trained model. \\n\"\n        \"This specifies the model architecture.\",\n    )\n    parser.add_argument(\n        \"--pytorch_dump_path\", default=None, type=str, required=True, help=\"Path to the output PyTorch model.\"\n    )\n    parser.add_argument(\n        \"--discriminator_or_generator\",\n        default=None,\n        type=str,\n        required=True,\n        help=\"Whether to export the generator or the discriminator. Should be a string, either 'discriminator' or \"\n        \"'generator'.\",\n    )\n    args = parser.parse_args()\n    convert_tf_checkpoint_to_pytorch(\n        args.tf_checkpoint_path, args.config_file, args.pytorch_dump_path, args.discriminator_or_generator\n    )\n"
  },
  {
    "path": "code/bert-base-count3/pretrain/transformers1/convert_gpt2_original_tf_checkpoint_to_pytorch.py",
    "content": "# coding=utf-8\n# Copyright 2018 The HuggingFace Inc. team.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n\"\"\"Convert OpenAI GPT checkpoint.\"\"\"\n\n\nimport argparse\nimport logging\n\nimport torch\n\nfrom transformers import CONFIG_NAME, WEIGHTS_NAME, GPT2Config, GPT2Model, load_tf_weights_in_gpt2\n\n\nlogging.basicConfig(level=logging.INFO)\n\n\ndef convert_gpt2_checkpoint_to_pytorch(gpt2_checkpoint_path, gpt2_config_file, pytorch_dump_folder_path):\n    # Construct model\n    if gpt2_config_file == \"\":\n        config = GPT2Config()\n    else:\n        config = GPT2Config.from_json_file(gpt2_config_file)\n    model = GPT2Model(config)\n\n    # Load weights from numpy\n    load_tf_weights_in_gpt2(model, config, gpt2_checkpoint_path)\n\n    # Save pytorch-model\n    pytorch_weights_dump_path = pytorch_dump_folder_path + \"/\" + WEIGHTS_NAME\n    pytorch_config_dump_path = pytorch_dump_folder_path + \"/\" + CONFIG_NAME\n    print(\"Save PyTorch model to {}\".format(pytorch_weights_dump_path))\n    torch.save(model.state_dict(), pytorch_weights_dump_path)\n    print(\"Save configuration file to {}\".format(pytorch_config_dump_path))\n    with open(pytorch_config_dump_path, \"w\", encoding=\"utf-8\") as f:\n        f.write(config.to_json_string())\n\n\nif __name__ == \"__main__\":\n    parser = argparse.ArgumentParser()\n    # Required parameters\n    parser.add_argument(\n        \"--gpt2_checkpoint_path\", default=None, type=str, required=True, help=\"Path to the TensorFlow checkpoint path.\"\n    )\n    parser.add_argument(\n        \"--pytorch_dump_folder_path\", default=None, type=str, required=True, help=\"Path to the output PyTorch model.\"\n    )\n    parser.add_argument(\n        \"--gpt2_config_file\",\n        default=\"\",\n        type=str,\n        help=\"An optional config json file corresponding to the pre-trained OpenAI model. \\n\"\n        \"This specifies the model architecture.\",\n    )\n    args = parser.parse_args()\n    convert_gpt2_checkpoint_to_pytorch(args.gpt2_checkpoint_path, args.gpt2_config_file, args.pytorch_dump_folder_path)\n"
  },
  {
    "path": "code/bert-base-count3/pretrain/transformers1/convert_graph_to_onnx.py",
    "content": "from argparse import ArgumentParser\nfrom os import listdir, makedirs\nfrom os.path import abspath, dirname, exists\nfrom typing import Dict, List, Optional, Tuple\n\nfrom transformers import is_tf_available, is_torch_available\nfrom transformers.pipelines import Pipeline, pipeline\nfrom transformers.tokenization_utils import BatchEncoding\n\n\nclass OnnxConverterArgumentParser(ArgumentParser):\n    \"\"\"\n    Wraps all the script arguments supported to export transformers1 models to ONNX IR\n    \"\"\"\n\n    def __init__(self):\n        super(OnnxConverterArgumentParser, self).__init__(\"ONNX Converter\")\n\n        self.add_argument(\"--model\", type=str, required=True, help=\"Model's id or path (ex: bert-base-cased)\")\n        self.add_argument(\"--tokenizer\", type=str, help=\"Tokenizer's id or path (ex: bert-base-cased)\")\n        self.add_argument(\"--framework\", type=str, choices=[\"pt\", \"tf\"], help=\"Framework for loading the model\")\n        self.add_argument(\"--opset\", type=int, default=11, help=\"ONNX opset to use\")\n        self.add_argument(\"--check-loading\", action=\"store_true\", help=\"Check ONNX is able to load the model\")\n        self.add_argument(\"--use-external-format\", action=\"store_true\", help=\"Allow exporting model >= than 2Gb\")\n        self.add_argument(\"output\")\n\n\ndef ensure_valid_input(model, tokens, input_names):\n    \"\"\"\n    Ensure input are presented in the correct order, without any None\n    Args:\n        model: The model used to forward the input data\n        tokens: BatchEncoding holding the input data\n        input_names: The name of the inputs\n\n    Returns: Tuple\n\n    \"\"\"\n    model_args_name = model.forward.__code__.co_varnames\n\n    ordered_input_names = []\n    model_args = []\n    for arg_name in model_args_name[1:]:  # start at index 1 to skip \"self\" argument\n        if arg_name in input_names:\n            ordered_input_names.append(arg_name)\n            model_args.append(tokens[arg_name])\n        else:\n            break\n\n    return ordered_input_names, tuple(model_args)\n\n\ndef infer_shapes(nlp: Pipeline, framework: str) -> Tuple[List[str], List[str], Dict, BatchEncoding]:\n    def build_shape_dict(tensor, is_input: bool, seq_len: int):\n        if isinstance(tensor, (tuple, list)):\n            return [build_shape_dict(t, is_input, seq_len) for t in tensor]\n\n        else:\n            # Let's assume batch is the first axis with only 1 element (~~ might not be always true ...)\n            axes = {[axis for axis, numel in enumerate(tensor.shape) if numel == 1][0]: \"batch\"}\n            if is_input:\n                if len(tensor.shape) == 2:\n                    axes[1] = \"sequence\"\n                else:\n                    raise ValueError(\"Unable to infer tensor axes ({})\".format(len(tensor.shape)))\n            else:\n                seq_axes = [dim for dim, shape in enumerate(tensor.shape) if shape == seq_len]\n                axes.update({dim: \"sequence\" for dim in seq_axes})\n\n        return axes\n\n    tokens = nlp.tokenizer.encode_plus(\"This is a sample output\", return_tensors=framework)\n    seq_len = tokens.input_ids.shape[-1]\n    outputs = nlp.model(**tokens) if framework == \"pt\" else nlp.model(tokens)\n\n    if not isinstance(outputs, (list, tuple)):\n        outputs = (outputs,)\n\n    # Generate input names & axes\n    input_vars = list(tokens.keys())\n    input_dynamic_axes = {k: build_shape_dict(v, True, seq_len) for k, v in tokens.items()}\n\n    # flatten potentially grouped outputs (past for gpt2, attentions)\n    outputs_flat = []\n    for output in outputs:\n        if isinstance(output, (tuple, list)):\n            outputs_flat.extend(output)\n        else:\n            outputs_flat.append(output)\n\n    # Generate output names & axes\n    output_names = [\"output_{}\".format(i) for i in range(len(outputs_flat))]\n    output_dynamic_axes = {k: build_shape_dict(v, False, seq_len) for k, v in zip(output_names, outputs_flat)}\n\n    # Create the aggregated axes representation\n    dynamic_axes = dict(input_dynamic_axes, **output_dynamic_axes)\n    return input_vars, output_names, dynamic_axes, tokens\n\n\ndef load_graph_from_args(framework: str, model: str, tokenizer: Optional[str] = None) -> Pipeline:\n    # If no tokenizer provided\n    if tokenizer is None:\n        tokenizer = model\n\n    print(\"Loading pipeline (model: {}, tokenizer: {})\".format(model, tokenizer))\n\n    # Allocate tokenizer and model\n    return pipeline(\"feature-extraction\", model=model, tokenizer=tokenizer, framework=framework)\n\n\ndef convert_pytorch(nlp: Pipeline, opset: int, output: str, use_external_format: bool):\n    if not is_torch_available():\n        raise Exception(\"Cannot convert because PyTorch is not installed. Please install torch first.\")\n\n    import torch\n    from torch.onnx import export\n\n    print(\"PyTorch: {}\".format(torch.__version__))\n\n    with torch.no_grad():\n        input_names, output_names, dynamic_axes, tokens = infer_shapes(nlp, \"pt\")\n        ordered_input_names, model_args = ensure_valid_input(nlp.model, tokens, input_names)\n\n        export(\n            nlp.model,\n            model_args,\n            f=output,\n            input_names=ordered_input_names,\n            output_names=output_names,\n            dynamic_axes=dynamic_axes,\n            do_constant_folding=True,\n            use_external_data_format=use_external_format,\n            enable_onnx_checker=True,\n            opset_version=opset,\n        )\n\n\ndef convert_tensorflow(nlp: Pipeline, opset: int, output: str):\n    if not is_tf_available():\n        raise Exception(\n            \"Cannot convert {} because TF is not installed. Please install torch first.\".format(args.model)\n        )\n\n    print(\"/!\\\\ Please note TensorFlow doesn't support exporting model > 2Gb /!\\\\\")\n\n    try:\n        import tensorflow as tf\n        from keras2onnx import convert_keras, save_model, __version__ as k2ov\n\n        print(\"TensorFlow: {}, keras2onnx: {}\".format(tf.version.VERSION, k2ov))\n\n        # Build\n        input_names, output_names, dynamic_axes, tokens = infer_shapes(nlp, \"tf\")\n\n        # Forward\n        nlp.model.predict(tokens.data)\n        onnx_model = convert_keras(nlp.model, nlp.model.name, target_opset=opset)\n        save_model(onnx_model, output)\n\n    except ImportError as e:\n        raise Exception(\n            \"Cannot import {} required to convert TF model to ONNX. Please install {} first.\".format(e.name, e.name)\n        )\n\n\ndef convert(\n    framework: str,\n    model: str,\n    output: str,\n    opset: int,\n    tokenizer: Optional[str] = None,\n    use_external_format: bool = False,\n):\n    print(\"ONNX opset version set to: {}\".format(opset))\n\n    # Load the pipeline\n    nlp = load_graph_from_args(framework, model, tokenizer)\n\n    parent = dirname(output)\n    if not exists(parent):\n        print(\"Creating folder {}\".format(parent))\n        makedirs(parent)\n    elif len(listdir(parent)) > 0:\n        raise Exception(\"Folder {} is not empty, aborting conversion\".format(parent))\n\n    # Export the graph\n    if framework == \"pt\":\n        convert_pytorch(nlp, opset, output, use_external_format)\n    else:\n        convert_tensorflow(nlp, opset, output)\n\n\ndef verify(path: str):\n    from onnxruntime import InferenceSession, SessionOptions\n    from onnxruntime.capi.onnxruntime_pybind11_state import RuntimeException\n\n    print(\"Checking ONNX model loading from: {}\".format(path))\n    try:\n        onnx_options = SessionOptions()\n        _ = InferenceSession(path, onnx_options, providers=[\"CPUExecutionProvider\"])\n        print(\"Model correctly loaded\")\n    except RuntimeException as re:\n        print(\"Error while loading the model: {}\".format(re))\n\n\nif __name__ == \"__main__\":\n    parser = OnnxConverterArgumentParser()\n    args = parser.parse_args()\n\n    # Make sure output is absolute path\n    args.output = abspath(args.output)\n\n    try:\n        # Convert\n        convert(args.framework, args.model, args.output, args.opset, args.tokenizer, args.use_external_format)\n\n        # And verify\n        if args.check_loading:\n            verify(args.output)\n    except Exception as e:\n        print(\"Error while converting the model: {}\".format(e))\n        exit(1)\n"
  },
  {
    "path": "code/bert-base-count3/pretrain/transformers1/convert_longformer_original_pytorch_lightning_to_pytorch.py",
    "content": "# coding=utf-8\n# Copyright 2018 The HuggingFace Inc. team.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n\"\"\"Convert RoBERTa checkpoint.\"\"\"\n\n\nimport argparse\n\nimport pytorch_lightning as pl\nimport torch\n\nfrom transformers.modeling_longformer import LongformerForQuestionAnswering, LongformerModel\n\n\nclass LightningModel(pl.LightningModule):\n    def __init__(self, model):\n        super().__init__()\n        self.model = model\n        self.num_labels = 2\n        self.qa_outputs = torch.nn.Linear(self.model.config.hidden_size, self.num_labels)\n\n    # implement only because lighning requires to do so\n    def forward(self):\n        pass\n\n\ndef convert_longformer_qa_checkpoint_to_pytorch(\n    longformer_model: str, longformer_question_answering_ckpt_path: str, pytorch_dump_folder_path: str\n):\n\n    # load longformer model from model identifier\n    longformer = LongformerModel.from_pretrained(longformer_model)\n    lightning_model = LightningModel(longformer)\n\n    ckpt = torch.load(longformer_question_answering_ckpt_path, map_location=torch.device(\"cpu\"))\n    lightning_model.load_state_dict(ckpt[\"state_dict\"])\n\n    # init longformer question answering model\n    longformer_for_qa = LongformerForQuestionAnswering.from_pretrained(longformer_model)\n\n    # transfer weights\n    longformer_for_qa.longformer.load_state_dict(lightning_model.model.state_dict())\n    longformer_for_qa.qa_outputs.load_state_dict(lightning_model.qa_outputs.state_dict())\n    longformer_for_qa.eval()\n\n    # save model\n    longformer_for_qa.save_pretrained(pytorch_dump_folder_path)\n\n    print(\"Conversion succesful. Model saved under {}\".format(pytorch_dump_folder_path))\n\n\nif __name__ == \"__main__\":\n    parser = argparse.ArgumentParser()\n    # Required parameters\n    parser.add_argument(\n        \"--longformer_model\",\n        default=None,\n        type=str,\n        required=True,\n        help=\"model identifier of longformer. Should be either `longformer-base-4096` or `longformer-large-4096`.\",\n    )\n    parser.add_argument(\n        \"--longformer_question_answering_ckpt_path\",\n        default=None,\n        type=str,\n        required=True,\n        help=\"Path the official PyTorch Lighning Checkpoint.\",\n    )\n    parser.add_argument(\n        \"--pytorch_dump_folder_path\", default=None, type=str, required=True, help=\"Path to the output PyTorch model.\"\n    )\n    args = parser.parse_args()\n    convert_longformer_qa_checkpoint_to_pytorch(\n        args.longformer_model, args.longformer_question_answering_ckpt_path, args.pytorch_dump_folder_path\n    )\n"
  },
  {
    "path": "code/bert-base-count3/pretrain/transformers1/convert_marian_to_pytorch.py",
    "content": "import argparse\nimport json\nimport os\nimport shutil\nimport warnings\nfrom pathlib import Path\nfrom typing import Dict, List, Union\nfrom zipfile import ZipFile\n\nimport numpy as np\nimport torch\nfrom tqdm import tqdm\n\nfrom transformers import MarianConfig, MarianMTModel, MarianTokenizer\nfrom transformers.hf_api import HfApi\n\n\ndef remove_prefix(text: str, prefix: str):\n    if text.startswith(prefix):\n        return text[len(prefix) :]\n    return text  # or whatever\n\n\ndef convert_encoder_layer(opus_dict, layer_prefix: str, converter: dict):\n    sd = {}\n    for k in opus_dict:\n        if not k.startswith(layer_prefix):\n            continue\n        stripped = remove_prefix(k, layer_prefix)\n        v = opus_dict[k].T  # besides embeddings, everything must be transposed.\n        sd[converter[stripped]] = torch.tensor(v).squeeze()\n    return sd\n\n\ndef load_layers_(layer_lst: torch.nn.ModuleList, opus_state: dict, converter, is_decoder=False):\n    for i, layer in enumerate(layer_lst):\n        layer_tag = f\"decoder_l{i + 1}_\" if is_decoder else f\"encoder_l{i + 1}_\"\n        sd = convert_encoder_layer(opus_state, layer_tag, converter)\n        layer.load_state_dict(sd, strict=True)\n\n\ndef find_pretrained_model(src_lang: str, tgt_lang: str) -> List[str]:\n    \"\"\"Find models that can accept src_lang as input and return tgt_lang as output.\"\"\"\n    prefix = \"Helsinki-NLP/opus-mt-\"\n    api = HfApi()\n    model_list = api.model_list()\n    model_ids = [x.modelId for x in model_list if x.modelId.startswith(\"Helsinki-NLP\")]\n    src_and_targ = [\n        remove_prefix(m, prefix).lower().split(\"-\") for m in model_ids if \"+\" not in m\n    ]  # + cant be loaded.\n    matching = [f\"{prefix}{a}-{b}\" for (a, b) in src_and_targ if src_lang in a and tgt_lang in b]\n    return matching\n\n\ndef add_emb_entries(wemb, final_bias, n_special_tokens=1):\n    vsize, d_model = wemb.shape\n    embs_to_add = np.zeros((n_special_tokens, d_model))\n    new_embs = np.concatenate([wemb, embs_to_add])\n    bias_to_add = np.zeros((n_special_tokens, 1))\n    new_bias = np.concatenate((final_bias, bias_to_add), axis=1)\n    return new_embs, new_bias\n\n\ndef _cast_yaml_str(v):\n    bool_dct = {\"true\": True, \"false\": False}\n    if not isinstance(v, str):\n        return v\n    elif v in bool_dct:\n        return bool_dct[v]\n    try:\n        return int(v)\n    except (TypeError, ValueError):\n        return v\n\n\ndef cast_marian_config(raw_cfg: Dict[str, str]) -> Dict:\n    return {k: _cast_yaml_str(v) for k, v in raw_cfg.items()}\n\n\nCONFIG_KEY = \"special:model.yml\"\n\n\ndef load_config_from_state_dict(opus_dict):\n    import yaml\n\n    cfg_str = \"\".join([chr(x) for x in opus_dict[CONFIG_KEY]])\n    yaml_cfg = yaml.load(cfg_str[:-1], Loader=yaml.BaseLoader)\n    return cast_marian_config(yaml_cfg)\n\n\ndef find_model_file(dest_dir):  # this one better\n    model_files = list(Path(dest_dir).glob(\"*.npz\"))\n    assert len(model_files) == 1, model_files\n    model_file = model_files[0]\n    return model_file\n\n\n# Group Names Logic: change long opus model names to something shorter, like opus-mt-en-ROMANCE\nROM_GROUP = \"fr+fr_BE+fr_CA+fr_FR+wa+frp+oc+ca+rm+lld+fur+lij+lmo+es+es_AR+es_CL+es_CO+es_CR+es_DO+es_EC+es_ES+es_GT+es_HN+es_MX+es_NI+es_PA+es_PE+es_PR+es_SV+es_UY+es_VE+pt+pt_br+pt_BR+pt_PT+gl+lad+an+mwl+it+it_IT+co+nap+scn+vec+sc+ro+la\"\nGROUPS = [\n    (\"cmn+cn+yue+ze_zh+zh_cn+zh_CN+zh_HK+zh_tw+zh_TW+zh_yue+zhs+zht+zh\", \"ZH\"),\n    (ROM_GROUP, \"ROMANCE\"),\n    (\"de+nl+fy+af+da+fo+is+no+nb+nn+sv\", \"NORTH_EU\"),\n    (\"da+fo+is+no+nb+nn+sv\", \"SCANDINAVIA\"),\n    (\"se+sma+smj+smn+sms\", \"SAMI\"),\n    (\"nb_NO+nb+nn_NO+nn+nog+no_nb+no\", \"NORWAY\"),\n    (\"ga+cy+br+gd+kw+gv\", \"CELTIC\"),  # https://en.wikipedia.org/wiki/Insular_Celtic_languages\n]\nGROUP_TO_OPUS_NAME = {\n    \"opus-mt-ZH-de\": \"cmn+cn+yue+ze_zh+zh_cn+zh_CN+zh_HK+zh_tw+zh_TW+zh_yue+zhs+zht+zh-de\",\n    \"opus-mt-ZH-fi\": \"cmn+cn+yue+ze_zh+zh_cn+zh_CN+zh_HK+zh_tw+zh_TW+zh_yue+zhs+zht+zh-fi\",\n    \"opus-mt-ZH-sv\": \"cmn+cn+yue+ze_zh+zh_cn+zh_CN+zh_HK+zh_tw+zh_TW+zh_yue+zhs+zht+zh-sv\",\n    \"opus-mt-SCANDINAVIA-SCANDINAVIA\": \"da+fo+is+no+nb+nn+sv-da+fo+is+no+nb+nn+sv\",\n    \"opus-mt-NORTH_EU-NORTH_EU\": \"de+nl+fy+af+da+fo+is+no+nb+nn+sv-de+nl+fy+af+da+fo+is+no+nb+nn+sv\",\n    \"opus-mt-de-ZH\": \"de-cmn+cn+yue+ze_zh+zh_cn+zh_CN+zh_HK+zh_tw+zh_TW+zh_yue+zhs+zht+zh\",\n    \"opus-mt-en_el_es_fi-en_el_es_fi\": \"en+el+es+fi-en+el+es+fi\",\n    \"opus-mt-en-ROMANCE\": \"en-fr+fr_BE+fr_CA+fr_FR+wa+frp+oc+ca+rm+lld+fur+lij+lmo+es+es_AR+es_CL+es_CO+es_CR+es_DO\"\n    \"+es_EC+es_ES+es_GT+es_HN+es_MX+es_NI+es_PA+es_PE+es_PR+es_SV+es_UY+es_VE+pt+pt_br+pt_BR\"\n    \"+pt_PT+gl+lad+an+mwl+it+it_IT+co+nap+scn+vec+sc+ro+la\",\n    \"opus-mt-en-CELTIC\": \"en-ga+cy+br+gd+kw+gv\",\n    \"opus-mt-es-NORWAY\": \"es-nb_NO+nb+nn_NO+nn+nog+no_nb+no\",\n    \"opus-mt-fi_nb_no_nn_ru_sv_en-SAMI\": \"fi+nb+no+nn+ru+sv+en-se+sma+smj+smn+sms\",\n    \"opus-mt-fi-ZH\": \"fi-cmn+cn+yue+ze_zh+zh_cn+zh_CN+zh_HK+zh_tw+zh_TW+zh_yue+zhs+zht+zh\",\n    \"opus-mt-fi-NORWAY\": \"fi-nb_NO+nb+nn_NO+nn+nog+no_nb+no\",\n    \"opus-mt-ROMANCE-en\": \"fr+fr_BE+fr_CA+fr_FR+wa+frp+oc+ca+rm+lld+fur+lij+lmo+es+es_AR+es_CL+es_CO+es_CR+es_DO\"\n    \"+es_EC+es_ES+es_GT+es_HN+es_MX+es_NI+es_PA+es_PE+es_PR+es_SV+es_UY+es_VE+pt+pt_br+pt_BR\"\n    \"+pt_PT+gl+lad+an+mwl+it+it_IT+co+nap+scn+vec+sc+ro+la-en\",\n    \"opus-mt-CELTIC-en\": \"ga+cy+br+gd+kw+gv-en\",\n    \"opus-mt-sv-ZH\": \"sv-cmn+cn+yue+ze_zh+zh_cn+zh_CN+zh_HK+zh_tw+zh_TW+zh_yue+zhs+zht+zh\",\n    \"opus-mt-sv-NORWAY\": \"sv-nb_NO+nb+nn_NO+nn+nog+no_nb+no\",\n}\nOPUS_GITHUB_URL = \"https://github.com/Helsinki-NLP/OPUS-MT-train/blob/master/models/\"\nORG_NAME = \"Helsinki-NLP/\"\n\n\ndef convert_opus_name_to_hf_name(x):\n    for substr, grp_name in GROUPS:\n        x = x.replace(substr, grp_name)\n    return x.replace(\"+\", \"_\")\n\n\ndef convert_hf_name_to_opus_name(hf_model_name):\n    \"\"\"Relies on the assumption that there are no language codes like pt_br in models that are not in GROUP_TO_OPUS_NAME.\"\"\"\n    hf_model_name = remove_prefix(hf_model_name, ORG_NAME)\n    if hf_model_name in GROUP_TO_OPUS_NAME:\n        opus_w_prefix = GROUP_TO_OPUS_NAME[hf_model_name]\n    else:\n        opus_w_prefix = hf_model_name.replace(\"_\", \"+\")\n    return remove_prefix(opus_w_prefix, \"opus-mt-\")\n\n\ndef write_model_card(\n    hf_model_name: str,\n    repo_path=\"OPUS-MT-train/models/\",\n    dry_run=False,\n    model_card_dir=Path(\"marian_converted/model_cards/Helsinki-NLP/\"),\n) -> str:\n    \"\"\"Copy the most recent model's readme section from opus, and add metadata.\n    upload command: s3cmd sync --recursive model_card_dir s3://models.huggingface.co/bert/Helsinki-NLP/\n    \"\"\"\n    hf_model_name = remove_prefix(hf_model_name, ORG_NAME)\n    opus_name: str = convert_hf_name_to_opus_name(hf_model_name)\n    opus_src, opus_tgt = [x.split(\"+\") for x in opus_name.split(\"-\")]\n    readme_url = OPUS_GITHUB_URL + f\"{opus_name}/README.md\"\n    s, t = \",\".join(opus_src), \",\".join(opus_tgt)\n    extra_markdown = f\"### {hf_model_name}\\n\\n* source languages: {s}\\n* target languages: {t}\\n*  OPUS readme: [{opus_name}]({readme_url})\\n\"\n    # combine with opus markdown\n    opus_readme_path = Path(f\"{repo_path}{opus_name}/README.md\")\n    assert opus_readme_path.exists(), opus_readme_path\n    content = opus_readme_path.open().read()\n    content = content.split(\"\\n# \")[-1]  # Get the lowest level 1 header in the README -- the most recent model.\n    content = \"*\".join(content.split(\"*\")[1:])\n    content = extra_markdown + \"\\n* \" + content.replace(\"download\", \"download original weights\")\n    if dry_run:\n        return content\n    # Save string to model_cards/hf_model_name/readme.md\n    model_card_dir.mkdir(exist_ok=True)\n    sub_dir = model_card_dir / hf_model_name\n    sub_dir.mkdir(exist_ok=True)\n    dest = sub_dir / \"README.md\"\n    dest.open(\"w\").write(content)\n    return content\n\n\ndef get_clean_model_id_mapping(multiling_model_ids):\n    return {x: convert_opus_name_to_hf_name(x) for x in multiling_model_ids}\n\n\ndef make_registry(repo_path=\"Opus-MT-train/models\"):\n    if not (Path(repo_path) / \"fr-en\" / \"README.md\").exists():\n        raise ValueError(\n            f\"repo_path:{repo_path} does not exist: \"\n            \"You must run: git clone git@github.com:Helsinki-NLP/Opus-MT-train.git before calling.\"\n        )\n    results = {}\n    for p in Path(repo_path).ls():\n        n_dash = p.name.count(\"-\")\n        if n_dash == 0:\n            continue\n        else:\n            lns = list(open(p / \"README.md\").readlines())\n            results[p.name] = _parse_readme(lns)\n    return [(k, v[\"pre-processing\"], v[\"download\"], v[\"download\"][:-4] + \".test.txt\") for k, v in results.items()]\n\n\ndef convert_all_sentencepiece_models(model_list=None, repo_path=None):\n    \"\"\"Requires 300GB\"\"\"\n    save_dir = Path(\"marian_ckpt\")\n    dest_dir = Path(\"marian_converted\")\n    dest_dir.mkdir(exist_ok=True)\n    if model_list is None:\n        model_list: list = make_registry(repo_path=repo_path)\n    for k, prepro, download, test_set_url in tqdm(model_list):\n        if \"SentencePiece\" not in prepro:  # dont convert BPE models.\n            continue\n        if not os.path.exists(save_dir / k / \"pytorch_model.bin\"):\n            download_and_unzip(download, save_dir / k)\n        pair_name = convert_opus_name_to_hf_name(k)\n        convert(save_dir / k, dest_dir / f\"opus-mt-{pair_name}\")\n\n\ndef lmap(f, x) -> List:\n    return list(map(f, x))\n\n\ndef fetch_test_set(test_set_url):\n    import wget\n\n    fname = wget.download(test_set_url, \"opus_test.txt\")\n    lns = Path(fname).open().readlines()\n    src = lmap(str.strip, lns[::4])\n    gold = lmap(str.strip, lns[1::4])\n    mar_model = lmap(str.strip, lns[2::4])\n    assert len(gold) == len(mar_model) == len(src)\n    os.remove(fname)\n    return src, mar_model, gold\n\n\ndef convert_whole_dir(path=Path(\"marian_ckpt/\")):\n    for subdir in tqdm(list(path.ls())):\n        dest_dir = f\"marian_converted/{subdir.name}\"\n        if (dest_dir / \"pytorch_model.bin\").exists():\n            continue\n        convert(source_dir, dest_dir)\n\n\ndef _parse_readme(lns):\n    \"\"\"Get link and metadata from opus model card equivalent.\"\"\"\n    subres = {}\n    for ln in [x.strip() for x in lns]:\n        if not ln.startswith(\"*\"):\n            continue\n        ln = ln[1:].strip()\n\n        for k in [\"download\", \"dataset\", \"models\", \"model\", \"pre-processing\"]:\n            if ln.startswith(k):\n                break\n        else:\n            continue\n        if k in [\"dataset\", \"model\", \"pre-processing\"]:\n            splat = ln.split(\":\")\n            _, v = splat\n            subres[k] = v\n        elif k == \"download\":\n            v = ln.split(\"(\")[-1][:-1]\n            subres[k] = v\n    return subres\n\n\ndef save_tokenizer_config(dest_dir: Path):\n    dname = dest_dir.name.split(\"-\")\n    dct = dict(target_lang=dname[-1], source_lang=\"-\".join(dname[:-1]))\n    save_json(dct, dest_dir / \"tokenizer_config.json\")\n\n\ndef add_to_vocab_(vocab: Dict[str, int], special_tokens: List[str]):\n    start = max(vocab.values()) + 1\n    added = 0\n    for tok in special_tokens:\n        if tok in vocab:\n            continue\n        vocab[tok] = start + added\n        added += 1\n    return added\n\n\ndef find_vocab_file(model_dir):\n    return list(model_dir.glob(\"*vocab.yml\"))[0]\n\n\ndef add_special_tokens_to_vocab(model_dir: Path) -> None:\n    vocab = load_yaml(find_vocab_file(model_dir))\n    vocab = {k: int(v) for k, v in vocab.items()}\n    num_added = add_to_vocab_(vocab, [\"<pad>\"])\n    print(f\"added {num_added} tokens to vocab\")\n    save_json(vocab, model_dir / \"vocab.json\")\n    save_tokenizer_config(model_dir)\n\n\ndef save_tokenizer(self, save_directory):\n    dest = Path(save_directory)\n    src_path = Path(self.init_kwargs[\"source_spm\"])\n\n    for dest_name in {\"source.spm\", \"target.spm\", \"tokenizer_config.json\"}:\n        shutil.copyfile(src_path.parent / dest_name, dest / dest_name)\n    save_json(self.encoder, dest / \"vocab.json\")\n\n\ndef check_equal(marian_cfg, k1, k2):\n    v1, v2 = marian_cfg[k1], marian_cfg[k2]\n    assert v1 == v2, f\"hparams {k1},{k2} differ: {v1} != {v2}\"\n\n\ndef check_marian_cfg_assumptions(marian_cfg):\n    assumed_settings = {\n        \"tied-embeddings-all\": True,\n        \"layer-normalization\": False,\n        \"right-left\": False,\n        \"transformer-ffn-depth\": 2,\n        \"transformer-aan-depth\": 2,\n        \"transformer-no-projection\": False,\n        \"transformer-postprocess-emb\": \"d\",\n        \"transformer-postprocess\": \"dan\",  # Dropout, add, normalize\n        \"transformer-preprocess\": \"\",\n        \"type\": \"transformer\",\n        \"ulr-dim-emb\": 0,\n        \"dec-cell-base-depth\": 2,\n        \"dec-cell-high-depth\": 1,\n        \"transformer-aan-nogate\": False,\n    }\n    for k, v in assumed_settings.items():\n        actual = marian_cfg[k]\n        assert actual == v, f\"Unexpected config value for {k} expected {v} got {actual}\"\n    check_equal(marian_cfg, \"transformer-ffn-activation\", \"transformer-aan-activation\")\n    check_equal(marian_cfg, \"transformer-ffn-depth\", \"transformer-aan-depth\")\n    check_equal(marian_cfg, \"transformer-dim-ffn\", \"transformer-dim-aan\")\n\n\nBIAS_KEY = \"decoder_ff_logit_out_b\"\nBART_CONVERTER = {  # for each encoder and decoder layer\n    \"self_Wq\": \"self_attn.q_proj.weight\",\n    \"self_Wk\": \"self_attn.k_proj.weight\",\n    \"self_Wv\": \"self_attn.v_proj.weight\",\n    \"self_Wo\": \"self_attn.out_proj.weight\",\n    \"self_bq\": \"self_attn.q_proj.bias\",\n    \"self_bk\": \"self_attn.k_proj.bias\",\n    \"self_bv\": \"self_attn.v_proj.bias\",\n    \"self_bo\": \"self_attn.out_proj.bias\",\n    \"self_Wo_ln_scale\": \"self_attn_layer_norm.weight\",\n    \"self_Wo_ln_bias\": \"self_attn_layer_norm.bias\",\n    \"ffn_W1\": \"fc1.weight\",\n    \"ffn_b1\": \"fc1.bias\",\n    \"ffn_W2\": \"fc2.weight\",\n    \"ffn_b2\": \"fc2.bias\",\n    \"ffn_ffn_ln_scale\": \"final_layer_norm.weight\",\n    \"ffn_ffn_ln_bias\": \"final_layer_norm.bias\",\n    # Decoder Cross Attention\n    \"context_Wk\": \"encoder_attn.k_proj.weight\",\n    \"context_Wo\": \"encoder_attn.out_proj.weight\",\n    \"context_Wq\": \"encoder_attn.q_proj.weight\",\n    \"context_Wv\": \"encoder_attn.v_proj.weight\",\n    \"context_bk\": \"encoder_attn.k_proj.bias\",\n    \"context_bo\": \"encoder_attn.out_proj.bias\",\n    \"context_bq\": \"encoder_attn.q_proj.bias\",\n    \"context_bv\": \"encoder_attn.v_proj.bias\",\n    \"context_Wo_ln_scale\": \"encoder_attn_layer_norm.weight\",\n    \"context_Wo_ln_bias\": \"encoder_attn_layer_norm.bias\",\n}\n\n\nclass OpusState:\n    def __init__(self, source_dir):\n        npz_path = find_model_file(source_dir)\n        self.state_dict = np.load(npz_path)\n        cfg = load_config_from_state_dict(self.state_dict)\n        assert cfg[\"dim-vocabs\"][0] == cfg[\"dim-vocabs\"][1]\n        assert \"Wpos\" not in self.state_dict\n        self.state_dict = dict(self.state_dict)\n        self.wemb, self.final_bias = add_emb_entries(self.state_dict[\"Wemb\"], self.state_dict[BIAS_KEY], 1)\n        self.pad_token_id = self.wemb.shape[0] - 1\n        cfg[\"vocab_size\"] = self.pad_token_id + 1\n        # self.state_dict['Wemb'].sha\n        self.state_keys = list(self.state_dict.keys())\n        if \"Wtype\" in self.state_dict:\n            raise ValueError(\"found Wtype key\")\n        self._check_layer_entries()\n        self.source_dir = source_dir\n        self.cfg = cfg\n        hidden_size, intermediate_shape = self.state_dict[\"encoder_l1_ffn_W1\"].shape\n        assert hidden_size == cfg[\"dim-emb\"] == 512\n\n        # Process decoder.yml\n        decoder_yml = cast_marian_config(load_yaml(source_dir / \"decoder.yml\"))\n        check_marian_cfg_assumptions(cfg)\n        self.hf_config = MarianConfig(\n            vocab_size=cfg[\"vocab_size\"],\n            decoder_layers=cfg[\"dec-depth\"],\n            encoder_layers=cfg[\"enc-depth\"],\n            decoder_attention_heads=cfg[\"transformer-heads\"],\n            encoder_attention_heads=cfg[\"transformer-heads\"],\n            decoder_ffn_dim=cfg[\"transformer-dim-ffn\"],\n            encoder_ffn_dim=cfg[\"transformer-dim-ffn\"],\n            d_model=cfg[\"dim-emb\"],\n            activation_function=cfg[\"transformer-aan-activation\"],\n            pad_token_id=self.pad_token_id,\n            eos_token_id=0,\n            bos_token_id=0,\n            max_position_embeddings=cfg[\"dim-emb\"],\n            scale_embedding=True,\n            normalize_embedding=\"n\" in cfg[\"transformer-preprocess\"],\n            static_position_embeddings=not cfg[\"transformer-train-position-embeddings\"],\n            dropout=0.1,  # see opus-mt-train repo/transformer-dropout param.\n            # default: add_final_layer_norm=False,\n            num_beams=decoder_yml[\"beam-size\"],\n            decoder_start_token_id=self.pad_token_id,\n            bad_words_ids=[[self.pad_token_id]],\n            max_length=512,\n        )\n\n    def _check_layer_entries(self):\n        self.encoder_l1 = self.sub_keys(\"encoder_l1\")\n        self.decoder_l1 = self.sub_keys(\"decoder_l1\")\n        self.decoder_l2 = self.sub_keys(\"decoder_l2\")\n        if len(self.encoder_l1) != 16:\n            warnings.warn(f\"Expected 16 keys for each encoder layer, got {len(self.encoder_l1)}\")\n        if len(self.decoder_l1) != 26:\n            warnings.warn(f\"Expected 26 keys for each decoder layer, got {len(self.decoder_l1)}\")\n        if len(self.decoder_l2) != 26:\n            warnings.warn(f\"Expected 26 keys for each decoder layer, got {len(self.decoder_l1)}\")\n\n    @property\n    def extra_keys(self):\n        extra = []\n        for k in self.state_keys:\n            if (\n                k.startswith(\"encoder_l\")\n                or k.startswith(\"decoder_l\")\n                or k in [CONFIG_KEY, \"Wemb\", \"Wpos\", \"decoder_ff_logit_out_b\"]\n            ):\n                continue\n            else:\n                extra.append(k)\n        return extra\n\n    def sub_keys(self, layer_prefix):\n        return [remove_prefix(k, layer_prefix) for k in self.state_dict if k.startswith(layer_prefix)]\n\n    def load_marian_model(self) -> MarianMTModel:\n        state_dict, cfg = self.state_dict, self.hf_config\n\n        assert cfg.static_position_embeddings\n        model = MarianMTModel(cfg)\n\n        assert \"hidden_size\" not in cfg.to_dict()\n        load_layers_(\n            model.model.encoder.layers, state_dict, BART_CONVERTER,\n        )\n        load_layers_(model.model.decoder.layers, state_dict, BART_CONVERTER, is_decoder=True)\n\n        # handle tensors not associated with layers\n        wemb_tensor = torch.nn.Parameter(torch.FloatTensor(self.wemb))\n        bias_tensor = torch.nn.Parameter(torch.FloatTensor(self.final_bias))\n        model.model.shared.weight = wemb_tensor\n        model.model.encoder.embed_tokens = model.model.decoder.embed_tokens = model.model.shared\n\n        model.final_logits_bias = bias_tensor\n\n        if \"Wpos\" in state_dict:\n            print(\"Unexpected: got Wpos\")\n            wpos_tensor = torch.tensor(state_dict[\"Wpos\"])\n            model.model.encoder.embed_positions.weight = wpos_tensor\n            model.model.decoder.embed_positions.weight = wpos_tensor\n\n        if cfg.normalize_embedding:\n            assert \"encoder_emb_ln_scale_pre\" in state_dict\n            raise NotImplementedError(\"Need to convert layernorm_embedding\")\n\n        assert not self.extra_keys, f\"Failed to convert {self.extra_keys}\"\n        assert model.model.shared.padding_idx == self.pad_token_id\n        return model\n\n\ndef download_and_unzip(url, dest_dir):\n    try:\n        import wget\n    except ImportError:\n        raise ImportError(\"you must pip install wget\")\n\n    filename = wget.download(url)\n    unzip(filename, dest_dir)\n    os.remove(filename)\n\n\ndef convert(source_dir: Path, dest_dir):\n    dest_dir = Path(dest_dir)\n    dest_dir.mkdir(exist_ok=True)\n\n    add_special_tokens_to_vocab(source_dir)\n    tokenizer = MarianTokenizer.from_pretrained(str(source_dir))\n    save_tokenizer(tokenizer, dest_dir)\n\n    opus_state = OpusState(source_dir)\n    assert opus_state.cfg[\"vocab_size\"] == len(tokenizer.encoder)\n    # save_json(opus_state.cfg, dest_dir / \"marian_original_config.json\")\n    # ^^ Save human readable marian config for debugging\n\n    model = opus_state.load_marian_model()\n    model.save_pretrained(dest_dir)\n    model.from_pretrained(dest_dir)  # sanity check\n\n\nif __name__ == \"__main__\":\n    parser = argparse.ArgumentParser()\n    # Required parameters\n    parser.add_argument(\"--src\", type=str, help=\"path to marian model dir\", default=\"en-de\")\n    parser.add_argument(\"--dest\", type=str, default=None, help=\"Path to the output PyTorch model.\")\n    args = parser.parse_args()\n\n    source_dir = Path(args.src)\n    assert source_dir.exists()\n    dest_dir = f\"converted-{source_dir.name}\" if args.dest is None else args.dest\n    convert(source_dir, dest_dir)\n\n\ndef load_yaml(path):\n    import yaml\n\n    with open(path) as f:\n        return yaml.load(f, Loader=yaml.BaseLoader)\n\n\ndef save_json(content: Union[Dict, List], path: str) -> None:\n    with open(path, \"w\") as f:\n        json.dump(content, f)\n\n\ndef unzip(zip_path: str, dest_dir: str) -> None:\n    with ZipFile(zip_path, \"r\") as zipObj:\n        zipObj.extractall(dest_dir)\n"
  },
  {
    "path": "code/bert-base-count3/pretrain/transformers1/convert_openai_original_tf_checkpoint_to_pytorch.py",
    "content": "# coding=utf-8\n# Copyright 2018 The HuggingFace Inc. team.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n\"\"\"Convert OpenAI GPT checkpoint.\"\"\"\n\n\nimport argparse\nimport logging\n\nimport torch\n\nfrom transformers import CONFIG_NAME, WEIGHTS_NAME, OpenAIGPTConfig, OpenAIGPTModel, load_tf_weights_in_openai_gpt\n\n\nlogging.basicConfig(level=logging.INFO)\n\n\ndef convert_openai_checkpoint_to_pytorch(openai_checkpoint_folder_path, openai_config_file, pytorch_dump_folder_path):\n    # Construct model\n    if openai_config_file == \"\":\n        config = OpenAIGPTConfig()\n    else:\n        config = OpenAIGPTConfig.from_json_file(openai_config_file)\n    model = OpenAIGPTModel(config)\n\n    # Load weights from numpy\n    load_tf_weights_in_openai_gpt(model, config, openai_checkpoint_folder_path)\n\n    # Save pytorch-model\n    pytorch_weights_dump_path = pytorch_dump_folder_path + \"/\" + WEIGHTS_NAME\n    pytorch_config_dump_path = pytorch_dump_folder_path + \"/\" + CONFIG_NAME\n    print(\"Save PyTorch model to {}\".format(pytorch_weights_dump_path))\n    torch.save(model.state_dict(), pytorch_weights_dump_path)\n    print(\"Save configuration file to {}\".format(pytorch_config_dump_path))\n    with open(pytorch_config_dump_path, \"w\", encoding=\"utf-8\") as f:\n        f.write(config.to_json_string())\n\n\nif __name__ == \"__main__\":\n    parser = argparse.ArgumentParser()\n    # Required parameters\n    parser.add_argument(\n        \"--openai_checkpoint_folder_path\",\n        default=None,\n        type=str,\n        required=True,\n        help=\"Path to the TensorFlow checkpoint path.\",\n    )\n    parser.add_argument(\n        \"--pytorch_dump_folder_path\", default=None, type=str, required=True, help=\"Path to the output PyTorch model.\"\n    )\n    parser.add_argument(\n        \"--openai_config_file\",\n        default=\"\",\n        type=str,\n        help=\"An optional config json file corresponding to the pre-trained OpenAI model. \\n\"\n        \"This specifies the model architecture.\",\n    )\n    args = parser.parse_args()\n    convert_openai_checkpoint_to_pytorch(\n        args.openai_checkpoint_folder_path, args.openai_config_file, args.pytorch_dump_folder_path\n    )\n"
  },
  {
    "path": "code/bert-base-count3/pretrain/transformers1/convert_pytorch_checkpoint_to_tf2.py",
    "content": "# coding=utf-8\n# Copyright 2018 The HuggingFace Inc. team.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n\"\"\" Convert pytorch checkpoints to TensorFlow \"\"\"\n\n\nimport argparse\nimport logging\nimport os\n\nfrom transformers import (\n    ALBERT_PRETRAINED_CONFIG_ARCHIVE_MAP,\n    BERT_PRETRAINED_CONFIG_ARCHIVE_MAP,\n    CAMEMBERT_PRETRAINED_CONFIG_ARCHIVE_MAP,\n    CTRL_PRETRAINED_CONFIG_ARCHIVE_MAP,\n    DISTILBERT_PRETRAINED_CONFIG_ARCHIVE_MAP,\n    ELECTRA_PRETRAINED_CONFIG_ARCHIVE_MAP,\n    FLAUBERT_PRETRAINED_CONFIG_ARCHIVE_MAP,\n    GPT2_PRETRAINED_CONFIG_ARCHIVE_MAP,\n    OPENAI_GPT_PRETRAINED_CONFIG_ARCHIVE_MAP,\n    ROBERTA_PRETRAINED_CONFIG_ARCHIVE_MAP,\n    T5_PRETRAINED_CONFIG_ARCHIVE_MAP,\n    TRANSFO_XL_PRETRAINED_CONFIG_ARCHIVE_MAP,\n    WEIGHTS_NAME,\n    XLM_PRETRAINED_CONFIG_ARCHIVE_MAP,\n    XLM_ROBERTA_PRETRAINED_CONFIG_ARCHIVE_MAP,\n    XLNET_PRETRAINED_CONFIG_ARCHIVE_MAP,\n    AlbertConfig,\n    BertConfig,\n    CamembertConfig,\n    CTRLConfig,\n    DistilBertConfig,\n    ElectraConfig,\n    FlaubertConfig,\n    GPT2Config,\n    OpenAIGPTConfig,\n    RobertaConfig,\n    T5Config,\n    TFAlbertForPreTraining,\n    TFBertForPreTraining,\n    TFBertForQuestionAnswering,\n    TFBertForSequenceClassification,\n    TFCamembertForMaskedLM,\n    TFCTRLLMHeadModel,\n    TFDistilBertForMaskedLM,\n    TFDistilBertForQuestionAnswering,\n    TFElectraForPreTraining,\n    TFFlaubertWithLMHeadModel,\n    TFGPT2LMHeadModel,\n    TFOpenAIGPTLMHeadModel,\n    TFRobertaForMaskedLM,\n    TFRobertaForSequenceClassification,\n    TFT5ForConditionalGeneration,\n    TFTransfoXLLMHeadModel,\n    TFXLMRobertaForMaskedLM,\n    TFXLMWithLMHeadModel,\n    TFXLNetLMHeadModel,\n    TransfoXLConfig,\n    XLMConfig,\n    XLMRobertaConfig,\n    XLNetConfig,\n    cached_path,\n    hf_bucket_url,\n    is_torch_available,\n    load_pytorch_checkpoint_in_tf2_model,\n)\n\n\nif is_torch_available():\n    import torch\n    import numpy as np\n    from transformers import (\n        BertForPreTraining,\n        BertForQuestionAnswering,\n        BertForSequenceClassification,\n        GPT2LMHeadModel,\n        XLNetLMHeadModel,\n        XLMWithLMHeadModel,\n        XLMRobertaForMaskedLM,\n        TransfoXLLMHeadModel,\n        OpenAIGPTLMHeadModel,\n        RobertaForMaskedLM,\n        RobertaForSequenceClassification,\n        CamembertForMaskedLM,\n        FlaubertWithLMHeadModel,\n        DistilBertForMaskedLM,\n        DistilBertForQuestionAnswering,\n        CTRLLMHeadModel,\n        AlbertForPreTraining,\n        T5ForConditionalGeneration,\n        ElectraForPreTraining,\n    )\n\n\nlogging.basicConfig(level=logging.INFO)\n\nMODEL_CLASSES = {\n    \"bert\": (BertConfig, TFBertForPreTraining, BertForPreTraining, BERT_PRETRAINED_CONFIG_ARCHIVE_MAP,),\n    \"bert-large-uncased-whole-word-masking-finetuned-squad\": (\n        BertConfig,\n        TFBertForQuestionAnswering,\n        BertForQuestionAnswering,\n        BERT_PRETRAINED_CONFIG_ARCHIVE_MAP,\n    ),\n    \"bert-large-cased-whole-word-masking-finetuned-squad\": (\n        BertConfig,\n        TFBertForQuestionAnswering,\n        BertForQuestionAnswering,\n        BERT_PRETRAINED_CONFIG_ARCHIVE_MAP,\n    ),\n    \"bert-base-cased-finetuned-mrpc\": (\n        BertConfig,\n        TFBertForSequenceClassification,\n        BertForSequenceClassification,\n        BERT_PRETRAINED_CONFIG_ARCHIVE_MAP,\n    ),\n    \"gpt2\": (GPT2Config, TFGPT2LMHeadModel, GPT2LMHeadModel, GPT2_PRETRAINED_CONFIG_ARCHIVE_MAP,),\n    \"xlnet\": (XLNetConfig, TFXLNetLMHeadModel, XLNetLMHeadModel, XLNET_PRETRAINED_CONFIG_ARCHIVE_MAP,),\n    \"xlm\": (XLMConfig, TFXLMWithLMHeadModel, XLMWithLMHeadModel, XLM_PRETRAINED_CONFIG_ARCHIVE_MAP,),\n    \"xlm-roberta\": (\n        XLMRobertaConfig,\n        TFXLMRobertaForMaskedLM,\n        XLMRobertaForMaskedLM,\n        XLM_ROBERTA_PRETRAINED_CONFIG_ARCHIVE_MAP,\n    ),\n    \"transfo-xl\": (\n        TransfoXLConfig,\n        TFTransfoXLLMHeadModel,\n        TransfoXLLMHeadModel,\n        TRANSFO_XL_PRETRAINED_CONFIG_ARCHIVE_MAP,\n    ),\n    \"openai-gpt\": (\n        OpenAIGPTConfig,\n        TFOpenAIGPTLMHeadModel,\n        OpenAIGPTLMHeadModel,\n        OPENAI_GPT_PRETRAINED_CONFIG_ARCHIVE_MAP,\n    ),\n    \"roberta\": (RobertaConfig, TFRobertaForMaskedLM, RobertaForMaskedLM, ROBERTA_PRETRAINED_CONFIG_ARCHIVE_MAP,),\n    \"roberta-large-mnli\": (\n        RobertaConfig,\n        TFRobertaForSequenceClassification,\n        RobertaForSequenceClassification,\n        ROBERTA_PRETRAINED_CONFIG_ARCHIVE_MAP,\n    ),\n    \"camembert\": (\n        CamembertConfig,\n        TFCamembertForMaskedLM,\n        CamembertForMaskedLM,\n        CAMEMBERT_PRETRAINED_CONFIG_ARCHIVE_MAP,\n    ),\n    \"flaubert\": (\n        FlaubertConfig,\n        TFFlaubertWithLMHeadModel,\n        FlaubertWithLMHeadModel,\n        FLAUBERT_PRETRAINED_CONFIG_ARCHIVE_MAP,\n    ),\n    \"distilbert\": (\n        DistilBertConfig,\n        TFDistilBertForMaskedLM,\n        DistilBertForMaskedLM,\n        DISTILBERT_PRETRAINED_CONFIG_ARCHIVE_MAP,\n    ),\n    \"distilbert-base-distilled-squad\": (\n        DistilBertConfig,\n        TFDistilBertForQuestionAnswering,\n        DistilBertForQuestionAnswering,\n        DISTILBERT_PRETRAINED_CONFIG_ARCHIVE_MAP,\n    ),\n    \"ctrl\": (CTRLConfig, TFCTRLLMHeadModel, CTRLLMHeadModel, CTRL_PRETRAINED_CONFIG_ARCHIVE_MAP,),\n    \"albert\": (AlbertConfig, TFAlbertForPreTraining, AlbertForPreTraining, ALBERT_PRETRAINED_CONFIG_ARCHIVE_MAP,),\n    \"t5\": (T5Config, TFT5ForConditionalGeneration, T5ForConditionalGeneration, T5_PRETRAINED_CONFIG_ARCHIVE_MAP,),\n    \"electra\": (ElectraConfig, TFElectraForPreTraining, ElectraForPreTraining, ELECTRA_PRETRAINED_CONFIG_ARCHIVE_MAP,),\n}\n\n\ndef convert_pt_checkpoint_to_tf(\n    model_type, pytorch_checkpoint_path, config_file, tf_dump_path, compare_with_pt_model=False, use_cached_models=True\n):\n    if model_type not in MODEL_CLASSES:\n        raise ValueError(\"Unrecognized model type, should be one of {}.\".format(list(MODEL_CLASSES.keys())))\n\n    config_class, model_class, pt_model_class, aws_config_map = MODEL_CLASSES[model_type]\n\n    # Initialise TF model\n    if config_file in aws_config_map:\n        config_file = cached_path(aws_config_map[config_file], force_download=not use_cached_models)\n    config = config_class.from_json_file(config_file)\n    config.output_hidden_states = True\n    config.output_attentions = True\n    print(\"Building TensorFlow model from configuration: {}\".format(str(config)))\n    tf_model = model_class(config)\n\n    # Load weights from tf checkpoint\n    if pytorch_checkpoint_path in aws_config_map.keys():\n        pytorch_checkpoint_url = hf_bucket_url(pytorch_checkpoint_path, filename=WEIGHTS_NAME)\n        pytorch_checkpoint_path = cached_path(pytorch_checkpoint_url, force_download=not use_cached_models)\n    # Load PyTorch checkpoint in tf2 model:\n    tf_model = load_pytorch_checkpoint_in_tf2_model(tf_model, pytorch_checkpoint_path)\n\n    if compare_with_pt_model:\n        tfo = tf_model(tf_model.dummy_inputs, training=False)  # build the network\n\n        state_dict = torch.load(pytorch_checkpoint_path, map_location=\"cpu\")\n        pt_model = pt_model_class.from_pretrained(\n            pretrained_model_name_or_path=None, config=config, state_dict=state_dict\n        )\n\n        with torch.no_grad():\n            pto = pt_model(**pt_model.dummy_inputs)\n\n        np_pt = pto[0].numpy()\n        np_tf = tfo[0].numpy()\n        diff = np.amax(np.abs(np_pt - np_tf))\n        print(\"Max absolute difference between models outputs {}\".format(diff))\n        assert diff <= 2e-2, \"Error, model absolute difference is >2e-2: {}\".format(diff)\n\n    # Save pytorch-model\n    print(\"Save TensorFlow model to {}\".format(tf_dump_path))\n    tf_model.save_weights(tf_dump_path, save_format=\"h5\")\n\n\ndef convert_all_pt_checkpoints_to_tf(\n    args_model_type,\n    tf_dump_path,\n    model_shortcut_names_or_path=None,\n    config_shortcut_names_or_path=None,\n    compare_with_pt_model=False,\n    use_cached_models=False,\n    remove_cached_files=False,\n    only_convert_finetuned_models=False,\n):\n    assert os.path.isdir(args.tf_dump_path), \"--tf_dump_path should be a directory\"\n\n    if args_model_type is None:\n        model_types = list(MODEL_CLASSES.keys())\n    else:\n        model_types = [args_model_type]\n\n    for j, model_type in enumerate(model_types, start=1):\n        print(\"=\" * 100)\n        print(\" Converting model type {}/{}: {}\".format(j, len(model_types), model_type))\n        print(\"=\" * 100)\n        if model_type not in MODEL_CLASSES:\n            raise ValueError(\n                \"Unrecognized model type {}, should be one of {}.\".format(model_type, list(MODEL_CLASSES.keys()))\n            )\n\n        config_class, model_class, pt_model_class, aws_model_maps, aws_config_map = MODEL_CLASSES[model_type]\n\n        if model_shortcut_names_or_path is None:\n            model_shortcut_names_or_path = list(aws_model_maps.keys())\n        if config_shortcut_names_or_path is None:\n            config_shortcut_names_or_path = model_shortcut_names_or_path\n\n        for i, (model_shortcut_name, config_shortcut_name) in enumerate(\n            zip(model_shortcut_names_or_path, config_shortcut_names_or_path), start=1\n        ):\n            print(\"-\" * 100)\n            if \"-squad\" in model_shortcut_name or \"-mrpc\" in model_shortcut_name or \"-mnli\" in model_shortcut_name:\n                if not only_convert_finetuned_models:\n                    print(\"    Skipping finetuned checkpoint {}\".format(model_shortcut_name))\n                    continue\n                model_type = model_shortcut_name\n            elif only_convert_finetuned_models:\n                print(\"    Skipping not finetuned checkpoint {}\".format(model_shortcut_name))\n                continue\n            print(\n                \"    Converting checkpoint {}/{}: {} - model_type {}\".format(\n                    i, len(aws_config_map), model_shortcut_name, model_type\n                )\n            )\n            print(\"-\" * 100)\n\n            if config_shortcut_name in aws_config_map:\n                config_file = cached_path(aws_config_map[config_shortcut_name], force_download=not use_cached_models)\n            else:\n                config_file = cached_path(config_shortcut_name, force_download=not use_cached_models)\n\n            if model_shortcut_name in aws_model_maps:\n                model_file = cached_path(aws_model_maps[model_shortcut_name], force_download=not use_cached_models)\n            else:\n                model_file = cached_path(model_shortcut_name, force_download=not use_cached_models)\n\n            if os.path.isfile(model_shortcut_name):\n                model_shortcut_name = \"converted_model\"\n\n            convert_pt_checkpoint_to_tf(\n                model_type=model_type,\n                pytorch_checkpoint_path=model_file,\n                config_file=config_file,\n                tf_dump_path=os.path.join(tf_dump_path, model_shortcut_name + \"-tf_model.h5\"),\n                compare_with_pt_model=compare_with_pt_model,\n            )\n            if remove_cached_files:\n                os.remove(config_file)\n                os.remove(model_file)\n\n\nif __name__ == \"__main__\":\n    parser = argparse.ArgumentParser()\n    # Required parameters\n    parser.add_argument(\n        \"--tf_dump_path\", default=None, type=str, required=True, help=\"Path to the output Tensorflow dump file.\"\n    )\n    parser.add_argument(\n        \"--model_type\",\n        default=None,\n        type=str,\n        help=\"Model type selected in the list of {}. If not given, will download and convert all the models from AWS.\".format(\n            list(MODEL_CLASSES.keys())\n        ),\n    )\n    parser.add_argument(\n        \"--pytorch_checkpoint_path\",\n        default=None,\n        type=str,\n        help=\"Path to the PyTorch checkpoint path or shortcut name to download from AWS. \"\n        \"If not given, will download and convert all the checkpoints from AWS.\",\n    )\n    parser.add_argument(\n        \"--config_file\",\n        default=None,\n        type=str,\n        help=\"The config json file corresponding to the pre-trained model. \\n\"\n        \"This specifies the model architecture. If not given and \"\n        \"--pytorch_checkpoint_path is not given or is a shortcut name\"\n        \"use the configuration associated to the shortcut name on the AWS\",\n    )\n    parser.add_argument(\n        \"--compare_with_pt_model\", action=\"store_true\", help=\"Compare Tensorflow and PyTorch model predictions.\"\n    )\n    parser.add_argument(\n        \"--use_cached_models\",\n        action=\"store_true\",\n        help=\"Use cached models if possible instead of updating to latest checkpoint versions.\",\n    )\n    parser.add_argument(\n        \"--remove_cached_files\",\n        action=\"store_true\",\n        help=\"Remove pytorch models after conversion (save memory when converting in batches).\",\n    )\n    parser.add_argument(\"--only_convert_finetuned_models\", action=\"store_true\", help=\"Only convert finetuned models.\")\n    args = parser.parse_args()\n\n    # if args.pytorch_checkpoint_path is not None:\n    #     convert_pt_checkpoint_to_tf(args.model_type.lower(),\n    #                                 args.pytorch_checkpoint_path,\n    #                                 args.config_file if args.config_file is not None else args.pytorch_checkpoint_path,\n    #                                 args.tf_dump_path,\n    #                                 compare_with_pt_model=args.compare_with_pt_model,\n    #                                 use_cached_models=args.use_cached_models)\n    # else:\n    convert_all_pt_checkpoints_to_tf(\n        args.model_type.lower() if args.model_type is not None else None,\n        args.tf_dump_path,\n        model_shortcut_names_or_path=[args.pytorch_checkpoint_path]\n        if args.pytorch_checkpoint_path is not None\n        else None,\n        config_shortcut_names_or_path=[args.config_file] if args.config_file is not None else None,\n        compare_with_pt_model=args.compare_with_pt_model,\n        use_cached_models=args.use_cached_models,\n        remove_cached_files=args.remove_cached_files,\n        only_convert_finetuned_models=args.only_convert_finetuned_models,\n    )\n"
  },
  {
    "path": "code/bert-base-count3/pretrain/transformers1/convert_reformer_trax_checkpoint_to_pytorch.py",
    "content": "# coding=utf-8\n# Copyright 2020 The HuggingFace Inc. team.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n\"\"\"Convert Reformer checkpoint.\"\"\"\n\n\nimport argparse\nimport logging\nimport pickle\n\nimport numpy as np\nimport torch\n\nfrom transformers import ReformerConfig, ReformerModelWithLMHead\n\n\nlogging.basicConfig(level=logging.INFO)\n\n\ndef set_param(torch_layer, weight, bias=None):\n    # set parameter of one layer\n    assert torch_layer.weight.shape == weight.shape, \"{} layer.weight does not match\".format(torch_layer)\n    torch_layer.weight = torch.nn.Parameter(weight)\n    if bias is not None:\n        assert torch_layer.bias.shape == bias.shape, \"{} layer.bias does not match\".format(torch_layer)\n        torch_layer.bias = torch.nn.Parameter(bias)\n\n\ndef set_layer_weights_in_torch_lsh(weights, torch_layer, hidden_size):\n    # set torch weights for 1-to-1 comparison\n    np_query_key = np.asarray(weights[0])\n    np_value = np.asarray(weights[1])\n    np_dense = np.asarray(weights[2])\n\n    set_param(\n        torch_layer.self_attention.query_key,\n        torch.tensor(np_query_key).transpose(1, 2).contiguous().view(-1, hidden_size),\n    )\n    set_param(\n        torch_layer.self_attention.value, torch.tensor(np_value).transpose(1, 2).contiguous().view(-1, hidden_size),\n    )\n    set_param(\n        torch_layer.output.dense, torch.tensor(np_dense).view(-1, hidden_size).contiguous().transpose(0, 1),\n    )\n\n\ndef set_layer_weights_in_torch_local(weights, torch_layer, hidden_size):\n    # set torch weights for 1-to-1 comparison\n    np_query = np.asarray(weights[0])\n    np_key = np.asarray(weights[1])\n    np_value = np.asarray(weights[2])\n    np_dense = np.asarray(weights[3])\n\n    set_param(\n        torch_layer.self_attention.query, torch.tensor(np_query).transpose(1, 2).contiguous().view(-1, hidden_size),\n    )\n    set_param(\n        torch_layer.self_attention.key, torch.tensor(np_key).transpose(1, 2).contiguous().view(-1, hidden_size),\n    )\n    set_param(\n        torch_layer.self_attention.value, torch.tensor(np_value).transpose(1, 2).contiguous().view(-1, hidden_size),\n    )\n    set_param(\n        torch_layer.output.dense, torch.tensor(np_dense).view(-1, hidden_size).contiguous().transpose(0, 1),\n    )\n\n\ndef set_block_weights_in_torch(weights, torch_block, hidden_size):\n    # layernorm 1\n    layer_norm_1 = weights[0][0][0]\n    layer_norm_1_weight = np.asarray(layer_norm_1[0])\n    layer_norm_1_bias = np.asarray(layer_norm_1[1])\n    set_param(\n        torch_block.attention.layer_norm, torch.tensor(layer_norm_1_weight), torch.tensor(layer_norm_1_bias),\n    )\n\n    # lsh weights + output\n    attn_weights = weights[0][1]\n    if len(attn_weights) < 4:\n        set_layer_weights_in_torch_lsh(attn_weights, torch_block.attention, hidden_size)\n    else:\n        set_layer_weights_in_torch_local(attn_weights, torch_block.attention, hidden_size)\n\n    # intermediate weighs\n    intermediate_weights = weights[2][0][1][2]\n\n    # Chunked Feed Forward\n    if len(intermediate_weights) == 4:\n        intermediate_weights = intermediate_weights[2]\n\n    # layernorm 2\n    layer_norm_2_weight = np.asarray(intermediate_weights[0][0])\n    layer_norm_2_bias = np.asarray(intermediate_weights[0][1])\n    set_param(\n        torch_block.feed_forward.layer_norm, torch.tensor(layer_norm_2_weight), torch.tensor(layer_norm_2_bias),\n    )\n\n    # intermediate dense\n    inter_dense_weight = np.asarray(intermediate_weights[1][0])\n    inter_dense_bias = np.asarray(intermediate_weights[1][1])\n    set_param(\n        torch_block.feed_forward.dense.dense,\n        torch.tensor(inter_dense_weight).transpose(0, 1).contiguous(),\n        torch.tensor(inter_dense_bias),\n    )\n\n    # intermediate out\n    out_dense_weight = np.asarray(intermediate_weights[4][0])\n    out_dense_bias = np.asarray(intermediate_weights[4][1])\n    set_param(\n        torch_block.feed_forward.output.dense,\n        torch.tensor(out_dense_weight).transpose(0, 1).contiguous(),\n        torch.tensor(out_dense_bias),\n    )\n\n\ndef set_model_weights_in_torch(weights, torch_model, hidden_size):\n    # reformer model\n    torch_model_reformer = torch_model.reformer\n\n    # word embeds\n    word_embeddings = np.asarray(weights[1])\n    set_param(\n        torch_model_reformer.embeddings.word_embeddings, torch.tensor(word_embeddings),\n    )\n\n    if isinstance(weights[3], tuple):\n        position_embeddings = torch_model_reformer.embeddings.position_embeddings\n        for emb_idx in range(len(position_embeddings.weights)):\n            emb_weights = np.asarray(weights[3][emb_idx][0])\n            assert position_embeddings.weights[emb_idx].shape == emb_weights.shape, \"{} emb does not match\".format(\n                position_embeddings[emb_idx]\n            )\n            position_embeddings.weights[emb_idx] = torch.nn.Parameter(torch.tensor(emb_weights))\n\n    trax_layer_weights = weights[5]\n    assert len(torch_model_reformer.encoder.layers) * 4 == len(\n        trax_layer_weights\n    ), \"HF and trax model do not have the same number of layers\"\n    for layer_idx, layer in enumerate(torch_model_reformer.encoder.layers):\n        block_weights = trax_layer_weights[4 * layer_idx : 4 * (layer_idx + 1)]\n        set_block_weights_in_torch(block_weights, layer, hidden_size)\n\n    # output layer norm\n    layer_norm_out_weight = np.asarray(weights[7][0])\n    layer_norm_out_bias = np.asarray(weights[7][1])\n    set_param(\n        torch_model_reformer.encoder.layer_norm,\n        torch.tensor(layer_norm_out_weight),\n        torch.tensor(layer_norm_out_bias),\n    )\n\n    # output embeddings\n    output_embed_weights = np.asarray(weights[9][0])\n    output_embed_bias = np.asarray(weights[9][1])\n    set_param(\n        torch_model.lm_head.decoder,\n        torch.tensor(output_embed_weights).transpose(0, 1).contiguous(),\n        torch.tensor(output_embed_bias),\n    )\n\n\ndef convert_trax_checkpoint_to_pytorch(trax_model_pkl_path, config_file, pytorch_dump_path):\n    # Initialise PyTorch model\n    config = ReformerConfig.from_json_file(config_file)\n    print(\"Building PyTorch model from configuration: {}\".format(str(config)))\n    model = ReformerModelWithLMHead(config)\n\n    with open(trax_model_pkl_path, \"rb\") as f:\n        model_weights = pickle.load(f)[\"weights\"]\n\n    set_model_weights_in_torch(model_weights, model, config.hidden_size)\n\n    # Save pytorch-model\n    print(\"Save PyTorch model to {}\".format(pytorch_dump_path))\n    torch.save(model.state_dict(), pytorch_dump_path)\n\n\nif __name__ == \"__main__\":\n    parser = argparse.ArgumentParser()\n    # Required parameters\n    parser.add_argument(\n        \"--trax_model_pkl_path\", default=None, type=str, required=True, help=\"Path to the TensorFlow checkpoint path.\"\n    )\n    parser.add_argument(\n        \"--config_file\",\n        default=None,\n        type=str,\n        required=True,\n        help=\"The config json file corresponding to the pre-trained Reformer model. \\n\"\n        \"This specifies the model architecture.\",\n    )\n    parser.add_argument(\n        \"--pytorch_dump_path\", default=None, type=str, required=True, help=\"Path to the output PyTorch model.\"\n    )\n    args = parser.parse_args()\n    convert_trax_checkpoint_to_pytorch(args.trax_model_pkl_path, args.config_file, args.pytorch_dump_path)\n"
  },
  {
    "path": "code/bert-base-count3/pretrain/transformers1/convert_roberta_original_pytorch_checkpoint_to_pytorch.py",
    "content": "# coding=utf-8\n# Copyright 2018 The HuggingFace Inc. team.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n\"\"\"Convert RoBERTa checkpoint.\"\"\"\n\n\nimport argparse\nimport logging\nimport pathlib\n\nimport fairseq\nimport torch\nfrom fairseq.models.roberta import RobertaModel as FairseqRobertaModel\nfrom fairseq.modules import TransformerSentenceEncoderLayer\nfrom packaging import version\n\nfrom transformers.modeling_bert import BertIntermediate, BertLayer, BertOutput, BertSelfAttention, BertSelfOutput\nfrom transformers.modeling_roberta import RobertaConfig, RobertaForMaskedLM, RobertaForSequenceClassification\n\n\nif version.parse(fairseq.__version__) < version.parse(\"0.9.0\"):\n    raise Exception(\"requires fairseq >= 0.9.0\")\n\n\nlogging.basicConfig(level=logging.INFO)\nlogger = logging.getLogger(__name__)\n\nSAMPLE_TEXT = \"Hello world! cécé herlolip\"\n\n\ndef convert_roberta_checkpoint_to_pytorch(\n    roberta_checkpoint_path: str, pytorch_dump_folder_path: str, classification_head: bool\n):\n    \"\"\"\n    Copy/paste/tweak roberta's weights to our BERT structure.\n    \"\"\"\n    roberta = FairseqRobertaModel.from_pretrained(roberta_checkpoint_path)\n    roberta.eval()  # disable dropout\n    roberta_sent_encoder = roberta.model.decoder.sentence_encoder\n    config = RobertaConfig(\n        vocab_size=roberta_sent_encoder.embed_tokens.num_embeddings,\n        hidden_size=roberta.args.encoder_embed_dim,\n        num_hidden_layers=roberta.args.encoder_layers,\n        num_attention_heads=roberta.args.encoder_attention_heads,\n        intermediate_size=roberta.args.encoder_ffn_embed_dim,\n        max_position_embeddings=514,\n        type_vocab_size=1,\n        layer_norm_eps=1e-5,  # PyTorch default used in fairseq\n    )\n    if classification_head:\n        config.num_labels = roberta.args.num_classes\n    print(\"Our BERT config:\", config)\n\n    model = RobertaForSequenceClassification(config) if classification_head else RobertaForMaskedLM(config)\n    model.eval()\n\n    # Now let's copy all the weights.\n    # Embeddings\n    model.roberta.embeddings.word_embeddings.weight = roberta_sent_encoder.embed_tokens.weight\n    model.roberta.embeddings.position_embeddings.weight = roberta_sent_encoder.embed_positions.weight\n    model.roberta.embeddings.token_type_embeddings.weight.data = torch.zeros_like(\n        model.roberta.embeddings.token_type_embeddings.weight\n    )  # just zero them out b/c RoBERTa doesn't use them.\n    model.roberta.embeddings.LayerNorm.weight = roberta_sent_encoder.emb_layer_norm.weight\n    model.roberta.embeddings.LayerNorm.bias = roberta_sent_encoder.emb_layer_norm.bias\n\n    for i in range(config.num_hidden_layers):\n        # Encoder: start of layer\n        layer: BertLayer = model.roberta.encoder.layer[i]\n        roberta_layer: TransformerSentenceEncoderLayer = roberta_sent_encoder.layers[i]\n\n        # self attention\n        self_attn: BertSelfAttention = layer.attention.self\n        assert (\n            roberta_layer.self_attn.k_proj.weight.data.shape\n            == roberta_layer.self_attn.q_proj.weight.data.shape\n            == roberta_layer.self_attn.v_proj.weight.data.shape\n            == torch.Size((config.hidden_size, config.hidden_size))\n        )\n\n        self_attn.query.weight.data = roberta_layer.self_attn.q_proj.weight\n        self_attn.query.bias.data = roberta_layer.self_attn.q_proj.bias\n        self_attn.key.weight.data = roberta_layer.self_attn.k_proj.weight\n        self_attn.key.bias.data = roberta_layer.self_attn.k_proj.bias\n        self_attn.value.weight.data = roberta_layer.self_attn.v_proj.weight\n        self_attn.value.bias.data = roberta_layer.self_attn.v_proj.bias\n\n        # self-attention output\n        self_output: BertSelfOutput = layer.attention.output\n        assert self_output.dense.weight.shape == roberta_layer.self_attn.out_proj.weight.shape\n        self_output.dense.weight = roberta_layer.self_attn.out_proj.weight\n        self_output.dense.bias = roberta_layer.self_attn.out_proj.bias\n        self_output.LayerNorm.weight = roberta_layer.self_attn_layer_norm.weight\n        self_output.LayerNorm.bias = roberta_layer.self_attn_layer_norm.bias\n\n        # intermediate\n        intermediate: BertIntermediate = layer.intermediate\n        assert intermediate.dense.weight.shape == roberta_layer.fc1.weight.shape\n        intermediate.dense.weight = roberta_layer.fc1.weight\n        intermediate.dense.bias = roberta_layer.fc1.bias\n\n        # output\n        bert_output: BertOutput = layer.output\n        assert bert_output.dense.weight.shape == roberta_layer.fc2.weight.shape\n        bert_output.dense.weight = roberta_layer.fc2.weight\n        bert_output.dense.bias = roberta_layer.fc2.bias\n        bert_output.LayerNorm.weight = roberta_layer.final_layer_norm.weight\n        bert_output.LayerNorm.bias = roberta_layer.final_layer_norm.bias\n        # end of layer\n\n    if classification_head:\n        model.classifier.dense.weight = roberta.model.classification_heads[\"mnli\"].dense.weight\n        model.classifier.dense.bias = roberta.model.classification_heads[\"mnli\"].dense.bias\n        model.classifier.out_proj.weight = roberta.model.classification_heads[\"mnli\"].out_proj.weight\n        model.classifier.out_proj.bias = roberta.model.classification_heads[\"mnli\"].out_proj.bias\n    else:\n        # LM Head\n        model.lm_head.dense.weight = roberta.model.decoder.lm_head.dense.weight\n        model.lm_head.dense.bias = roberta.model.decoder.lm_head.dense.bias\n        model.lm_head.layer_norm.weight = roberta.model.decoder.lm_head.layer_norm.weight\n        model.lm_head.layer_norm.bias = roberta.model.decoder.lm_head.layer_norm.bias\n        model.lm_head.decoder.weight = roberta.model.decoder.lm_head.weight\n        model.lm_head.decoder.bias = roberta.model.decoder.lm_head.bias\n\n    # Let's check that we get the same results.\n    input_ids: torch.Tensor = roberta.encode(SAMPLE_TEXT).unsqueeze(0)  # batch of size 1\n\n    our_output = model(input_ids)[0]\n    if classification_head:\n        their_output = roberta.model.classification_heads[\"mnli\"](roberta.extract_features(input_ids))\n    else:\n        their_output = roberta.model(input_ids)[0]\n    print(our_output.shape, their_output.shape)\n    max_absolute_diff = torch.max(torch.abs(our_output - their_output)).item()\n    print(f\"max_absolute_diff = {max_absolute_diff}\")  # ~ 1e-7\n    success = torch.allclose(our_output, their_output, atol=1e-3)\n    print(\"Do both models output the same tensors?\", \"🔥\" if success else \"💩\")\n    if not success:\n        raise Exception(\"Something went wRoNg\")\n\n    pathlib.Path(pytorch_dump_folder_path).mkdir(parents=True, exist_ok=True)\n    print(f\"Saving model to {pytorch_dump_folder_path}\")\n    model.save_pretrained(pytorch_dump_folder_path)\n\n\nif __name__ == \"__main__\":\n    parser = argparse.ArgumentParser()\n    # Required parameters\n    parser.add_argument(\n        \"--roberta_checkpoint_path\", default=None, type=str, required=True, help=\"Path the official PyTorch dump.\"\n    )\n    parser.add_argument(\n        \"--pytorch_dump_folder_path\", default=None, type=str, required=True, help=\"Path to the output PyTorch model.\"\n    )\n    parser.add_argument(\n        \"--classification_head\", action=\"store_true\", help=\"Whether to convert a final classification head.\"\n    )\n    args = parser.parse_args()\n    convert_roberta_checkpoint_to_pytorch(\n        args.roberta_checkpoint_path, args.pytorch_dump_folder_path, args.classification_head\n    )\n"
  },
  {
    "path": "code/bert-base-count3/pretrain/transformers1/convert_t5_original_tf_checkpoint_to_pytorch.py",
    "content": "# coding=utf-8\n# Copyright 2018 The T5 authors and HuggingFace Inc. team.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n\"\"\"Convert T5 checkpoint.\"\"\"\n\n\nimport argparse\nimport logging\n\nimport torch\n\nfrom transformers import T5Config, T5Model, load_tf_weights_in_t5\n\n\nlogging.basicConfig(level=logging.INFO)\n\n\ndef convert_tf_checkpoint_to_pytorch(tf_checkpoint_path, config_file, pytorch_dump_path):\n    # Initialise PyTorch model\n    config = T5Config.from_json_file(config_file)\n    print(\"Building PyTorch model from configuration: {}\".format(str(config)))\n    model = T5Model(config)\n\n    # Load weights from tf checkpoint\n    load_tf_weights_in_t5(model, config, tf_checkpoint_path)\n\n    # Save pytorch-model\n    print(\"Save PyTorch model to {}\".format(pytorch_dump_path))\n    torch.save(model.state_dict(), pytorch_dump_path)\n\n\nif __name__ == \"__main__\":\n    parser = argparse.ArgumentParser()\n    # Required parameters\n    parser.add_argument(\n        \"--tf_checkpoint_path\", default=None, type=str, required=True, help=\"Path to the TensorFlow checkpoint path.\"\n    )\n    parser.add_argument(\n        \"--config_file\",\n        default=None,\n        type=str,\n        required=True,\n        help=\"The config json file corresponding to the pre-trained T5 model. \\n\"\n        \"This specifies the model architecture.\",\n    )\n    parser.add_argument(\n        \"--pytorch_dump_path\", default=None, type=str, required=True, help=\"Path to the output PyTorch model.\"\n    )\n    args = parser.parse_args()\n    convert_tf_checkpoint_to_pytorch(args.tf_checkpoint_path, args.config_file, args.pytorch_dump_path)\n"
  },
  {
    "path": "code/bert-base-count3/pretrain/transformers1/convert_transfo_xl_original_tf_checkpoint_to_pytorch.py",
    "content": "# coding=utf-8\n# Copyright 2018 The HuggingFace Inc. team.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n\"\"\"Convert Transformer XL checkpoint and datasets.\"\"\"\n\n\nimport argparse\nimport logging\nimport os\nimport pickle\nimport sys\n\nimport torch\n\nimport transformers.tokenization_transfo_xl as data_utils\nfrom transformers import (\n    CONFIG_NAME,\n    WEIGHTS_NAME,\n    TransfoXLConfig,\n    TransfoXLLMHeadModel,\n    load_tf_weights_in_transfo_xl,\n)\nfrom transformers.tokenization_transfo_xl import CORPUS_NAME, VOCAB_FILES_NAMES\n\n\nlogging.basicConfig(level=logging.INFO)\n\n# We do this to be able to load python 2 datasets pickles\n# See e.g. https://stackoverflow.com/questions/2121874/python-pickling-after-changing-a-modules-directory/2121918#2121918\ndata_utils.Vocab = data_utils.TransfoXLTokenizer\ndata_utils.Corpus = data_utils.TransfoXLCorpus\nsys.modules[\"data_utils\"] = data_utils\nsys.modules[\"vocabulary\"] = data_utils\n\n\ndef convert_transfo_xl_checkpoint_to_pytorch(\n    tf_checkpoint_path, transfo_xl_config_file, pytorch_dump_folder_path, transfo_xl_dataset_file\n):\n    if transfo_xl_dataset_file:\n        # Convert a pre-processed corpus (see original TensorFlow repo)\n        with open(transfo_xl_dataset_file, \"rb\") as fp:\n            corpus = pickle.load(fp, encoding=\"latin1\")\n        # Save vocabulary and dataset cache as Dictionaries (should be better than pickles for the long-term)\n        pytorch_vocab_dump_path = pytorch_dump_folder_path + \"/\" + VOCAB_FILES_NAMES[\"pretrained_vocab_file\"]\n        print(\"Save vocabulary to {}\".format(pytorch_vocab_dump_path))\n        corpus_vocab_dict = corpus.vocab.__dict__\n        torch.save(corpus_vocab_dict, pytorch_vocab_dump_path)\n\n        corpus_dict_no_vocab = corpus.__dict__\n        corpus_dict_no_vocab.pop(\"vocab\", None)\n        pytorch_dataset_dump_path = pytorch_dump_folder_path + \"/\" + CORPUS_NAME\n        print(\"Save dataset to {}\".format(pytorch_dataset_dump_path))\n        torch.save(corpus_dict_no_vocab, pytorch_dataset_dump_path)\n\n    if tf_checkpoint_path:\n        # Convert a pre-trained TensorFlow model\n        config_path = os.path.abspath(transfo_xl_config_file)\n        tf_path = os.path.abspath(tf_checkpoint_path)\n\n        print(\"Converting Transformer XL checkpoint from {} with config at {}\".format(tf_path, config_path))\n        # Initialise PyTorch model\n        if transfo_xl_config_file == \"\":\n            config = TransfoXLConfig()\n        else:\n            config = TransfoXLConfig.from_json_file(transfo_xl_config_file)\n        print(\"Building PyTorch model from configuration: {}\".format(str(config)))\n        model = TransfoXLLMHeadModel(config)\n\n        model = load_tf_weights_in_transfo_xl(model, config, tf_path)\n        # Save pytorch-model\n        pytorch_weights_dump_path = os.path.join(pytorch_dump_folder_path, WEIGHTS_NAME)\n        pytorch_config_dump_path = os.path.join(pytorch_dump_folder_path, CONFIG_NAME)\n        print(\"Save PyTorch model to {}\".format(os.path.abspath(pytorch_weights_dump_path)))\n        torch.save(model.state_dict(), pytorch_weights_dump_path)\n        print(\"Save configuration file to {}\".format(os.path.abspath(pytorch_config_dump_path)))\n        with open(pytorch_config_dump_path, \"w\", encoding=\"utf-8\") as f:\n            f.write(config.to_json_string())\n\n\nif __name__ == \"__main__\":\n    parser = argparse.ArgumentParser()\n    parser.add_argument(\n        \"--pytorch_dump_folder_path\",\n        default=None,\n        type=str,\n        required=True,\n        help=\"Path to the folder to store the PyTorch model or dataset/vocab.\",\n    )\n    parser.add_argument(\n        \"--tf_checkpoint_path\",\n        default=\"\",\n        type=str,\n        help=\"An optional path to a TensorFlow checkpoint path to be converted.\",\n    )\n    parser.add_argument(\n        \"--transfo_xl_config_file\",\n        default=\"\",\n        type=str,\n        help=\"An optional config json file corresponding to the pre-trained BERT model. \\n\"\n        \"This specifies the model architecture.\",\n    )\n    parser.add_argument(\n        \"--transfo_xl_dataset_file\",\n        default=\"\",\n        type=str,\n        help=\"An optional dataset file to be converted in a vocabulary.\",\n    )\n    args = parser.parse_args()\n    convert_transfo_xl_checkpoint_to_pytorch(\n        args.tf_checkpoint_path,\n        args.transfo_xl_config_file,\n        args.pytorch_dump_folder_path,\n        args.transfo_xl_dataset_file,\n    )\n"
  },
  {
    "path": "code/bert-base-count3/pretrain/transformers1/convert_xlm_original_pytorch_checkpoint_to_pytorch.py",
    "content": "# coding=utf-8\n# Copyright 2018 The HuggingFace Inc. team.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n\"\"\"Convert OpenAI GPT checkpoint.\"\"\"\n\n\nimport argparse\nimport json\nimport logging\n\nimport numpy\nimport torch\n\nfrom transformers import CONFIG_NAME, WEIGHTS_NAME\nfrom transformers.tokenization_xlm import VOCAB_FILES_NAMES\n\n\nlogging.basicConfig(level=logging.INFO)\n\n\ndef convert_xlm_checkpoint_to_pytorch(xlm_checkpoint_path, pytorch_dump_folder_path):\n    # Load checkpoint\n    chkpt = torch.load(xlm_checkpoint_path, map_location=\"cpu\")\n\n    state_dict = chkpt[\"model\"]\n\n    # We have the base model one level deeper than the original XLM repository\n    two_levels_state_dict = {}\n    for k, v in state_dict.items():\n        if \"pred_layer\" in k:\n            two_levels_state_dict[k] = v\n        else:\n            two_levels_state_dict[\"transformer.\" + k] = v\n\n    config = chkpt[\"params\"]\n    config = dict((n, v) for n, v in config.items() if not isinstance(v, (torch.FloatTensor, numpy.ndarray)))\n\n    vocab = chkpt[\"dico_word2id\"]\n    vocab = dict((s + \"</w>\" if s.find(\"@@\") == -1 and i > 13 else s.replace(\"@@\", \"\"), i) for s, i in vocab.items())\n\n    # Save pytorch-model\n    pytorch_weights_dump_path = pytorch_dump_folder_path + \"/\" + WEIGHTS_NAME\n    pytorch_config_dump_path = pytorch_dump_folder_path + \"/\" + CONFIG_NAME\n    pytorch_vocab_dump_path = pytorch_dump_folder_path + \"/\" + VOCAB_FILES_NAMES[\"vocab_file\"]\n\n    print(\"Save PyTorch model to {}\".format(pytorch_weights_dump_path))\n    torch.save(two_levels_state_dict, pytorch_weights_dump_path)\n\n    print(\"Save configuration file to {}\".format(pytorch_config_dump_path))\n    with open(pytorch_config_dump_path, \"w\", encoding=\"utf-8\") as f:\n        f.write(json.dumps(config, indent=2) + \"\\n\")\n\n    print(\"Save vocab file to {}\".format(pytorch_config_dump_path))\n    with open(pytorch_vocab_dump_path, \"w\", encoding=\"utf-8\") as f:\n        f.write(json.dumps(vocab, indent=2) + \"\\n\")\n\n\nif __name__ == \"__main__\":\n    parser = argparse.ArgumentParser()\n    # Required parameters\n    parser.add_argument(\n        \"--xlm_checkpoint_path\", default=None, type=str, required=True, help=\"Path the official PyTorch dump.\"\n    )\n    parser.add_argument(\n        \"--pytorch_dump_folder_path\", default=None, type=str, required=True, help=\"Path to the output PyTorch model.\"\n    )\n    args = parser.parse_args()\n    convert_xlm_checkpoint_to_pytorch(args.xlm_checkpoint_path, args.pytorch_dump_folder_path)\n"
  },
  {
    "path": "code/bert-base-count3/pretrain/transformers1/convert_xlnet_original_tf_checkpoint_to_pytorch.py",
    "content": "# coding=utf-8\n# Copyright 2018 The HuggingFace Inc. team.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n\"\"\"Convert BERT checkpoint.\"\"\"\n\n\nimport argparse\nimport logging\nimport os\n\nimport torch\n\nfrom transformers import (\n    CONFIG_NAME,\n    WEIGHTS_NAME,\n    XLNetConfig,\n    XLNetForQuestionAnswering,\n    XLNetForSequenceClassification,\n    XLNetLMHeadModel,\n    load_tf_weights_in_xlnet,\n)\n\n\nGLUE_TASKS_NUM_LABELS = {\n    \"cola\": 2,\n    \"mnli\": 3,\n    \"mrpc\": 2,\n    \"sst-2\": 2,\n    \"sts-b\": 1,\n    \"qqp\": 2,\n    \"qnli\": 2,\n    \"rte\": 2,\n    \"wnli\": 2,\n}\n\n\nlogging.basicConfig(level=logging.INFO)\n\n\ndef convert_xlnet_checkpoint_to_pytorch(\n    tf_checkpoint_path, bert_config_file, pytorch_dump_folder_path, finetuning_task=None\n):\n    # Initialise PyTorch model\n    config = XLNetConfig.from_json_file(bert_config_file)\n\n    finetuning_task = finetuning_task.lower() if finetuning_task is not None else \"\"\n    if finetuning_task in GLUE_TASKS_NUM_LABELS:\n        print(\"Building PyTorch XLNetForSequenceClassification model from configuration: {}\".format(str(config)))\n        config.finetuning_task = finetuning_task\n        config.num_labels = GLUE_TASKS_NUM_LABELS[finetuning_task]\n        model = XLNetForSequenceClassification(config)\n    elif \"squad\" in finetuning_task:\n        config.finetuning_task = finetuning_task\n        model = XLNetForQuestionAnswering(config)\n    else:\n        model = XLNetLMHeadModel(config)\n\n    # Load weights from tf checkpoint\n    load_tf_weights_in_xlnet(model, config, tf_checkpoint_path)\n\n    # Save pytorch-model\n    pytorch_weights_dump_path = os.path.join(pytorch_dump_folder_path, WEIGHTS_NAME)\n    pytorch_config_dump_path = os.path.join(pytorch_dump_folder_path, CONFIG_NAME)\n    print(\"Save PyTorch model to {}\".format(os.path.abspath(pytorch_weights_dump_path)))\n    torch.save(model.state_dict(), pytorch_weights_dump_path)\n    print(\"Save configuration file to {}\".format(os.path.abspath(pytorch_config_dump_path)))\n    with open(pytorch_config_dump_path, \"w\", encoding=\"utf-8\") as f:\n        f.write(config.to_json_string())\n\n\nif __name__ == \"__main__\":\n    parser = argparse.ArgumentParser()\n    # Required parameters\n    parser.add_argument(\n        \"--tf_checkpoint_path\", default=None, type=str, required=True, help=\"Path to the TensorFlow checkpoint path.\"\n    )\n    parser.add_argument(\n        \"--xlnet_config_file\",\n        default=None,\n        type=str,\n        required=True,\n        help=\"The config json file corresponding to the pre-trained XLNet model. \\n\"\n        \"This specifies the model architecture.\",\n    )\n    parser.add_argument(\n        \"--pytorch_dump_folder_path\",\n        default=None,\n        type=str,\n        required=True,\n        help=\"Path to the folder to store the PyTorch model or dataset/vocab.\",\n    )\n    parser.add_argument(\n        \"--finetuning_task\",\n        default=None,\n        type=str,\n        help=\"Name of a task on which the XLNet TensorFloaw model was fine-tuned\",\n    )\n    args = parser.parse_args()\n    print(args)\n\n    convert_xlnet_checkpoint_to_pytorch(\n        args.tf_checkpoint_path, args.xlnet_config_file, args.pytorch_dump_folder_path, args.finetuning_task\n    )\n"
  },
  {
    "path": "code/bert-base-count3/pretrain/transformers1/data/__init__.py",
    "content": "# flake8: noqa\n# There's no way to ignore \"F401 '...' imported but unused\" warnings in this\n# module, but to preserve other warnings. So, don't check this module at all.\n\nfrom .metrics import is_sklearn_available\nfrom .processors import (\n    DataProcessor,\n    InputExample,\n    InputFeatures,\n    SingleSentenceClassificationProcessor,\n    SquadExample,\n    SquadFeatures,\n    SquadV1Processor,\n    SquadV2Processor,\n    glue_convert_examples_to_features,\n    glue_output_modes,\n    glue_processors,\n    glue_tasks_num_labels,\n    squad_convert_examples_to_features,\n    xnli_output_modes,\n    xnli_processors,\n    xnli_tasks_num_labels,\n)\n\n\nif is_sklearn_available():\n    from .metrics import glue_compute_metrics, xnli_compute_metrics\n"
  },
  {
    "path": "code/bert-base-count3/pretrain/transformers1/data/data_collator.py",
    "content": "from abc import ABC, abstractmethod\nfrom dataclasses import dataclass\nfrom typing import Any, Dict, List, NewType, Tuple\n\nimport torch\nfrom torch.nn.utils.rnn import pad_sequence\nimport random\nimport numpy as np\nfrom ..tokenization_utils import PreTrainedTokenizer\n\n\nclass DataCollator(ABC):\n    \"\"\"\n    A `DataCollator` is responsible for batching\n    and pre-processing samples of data as requested by the training loop.\n    \"\"\"\n\n    @abstractmethod\n    def collate_batch(self) -> Dict[str, torch.Tensor]:\n        \"\"\"\n        Take a list of samples from a Dataset and collate them into a batch.\n\n        Returns:\n            A dictionary of tensors\n        \"\"\"\n        pass\n\n\nInputDataClass = NewType(\"InputDataClass\", Any)\n\n\n@dataclass\nclass DefaultDataCollator(DataCollator):\n    \"\"\"\n    Very simple data collator that:\n    - simply collates batches of dict-like objects\n    - Performs special handling for potential keys named:\n        - `label`: handles a single value (int or float) per object\n        - `label_ids`: handles a list of values per object\n    - does not do any additional preprocessing\n\n    i.e., Property names of the input object will be used as corresponding inputs to the model.\n    See glue and ner for example of how it's useful.\n    \"\"\"\n\n    def collate_batch(self, features: List[InputDataClass]) -> Dict[str, torch.Tensor]:\n        # In this method we'll make the assumption that all `features` in the batch\n        # have the same attributes.\n        # So we will look at the first element as a proxy for what attributes exist\n        # on the whole batch.\n        first = features[0]\n\n        # Special handling for labels.\n        # Ensure that tensor is created with the correct type\n        # (it should be automatically the case, but let's make sure of it.)\n        if hasattr(first, \"label\") and first.label is not None:\n            if type(first.label) is int:\n                labels = torch.tensor([f.label for f in features], dtype=torch.long)\n            else:\n                labels = torch.tensor([f.label for f in features], dtype=torch.float)\n            batch = {\"labels\": labels}\n        elif hasattr(first, \"label_ids\") and first.label_ids is not None:\n            if type(first.label_ids[0]) is int:\n                labels = torch.tensor([f.label_ids for f in features], dtype=torch.long)\n            else:\n                labels = torch.tensor([f.label_ids for f in features], dtype=torch.float)\n            batch = {\"labels\": labels}\n        else:\n            batch = {}\n\n        # Handling of all other possible attributes.\n        # Again, we will use the first element to figure out which key/values are not None for this model.\n        for k, v in vars(first).items():\n            if k not in (\"label\", \"label_ids\") and v is not None and not isinstance(v, str):\n                batch[k] = torch.tensor([getattr(f, k) for f in features], dtype=torch.long)\n        return batch\n\n\n@dataclass\nclass DataCollatorForLanguageModeling(DataCollator):\n    \"\"\"\n    Data collator used for language modeling.\n    - collates batches of tensors, honoring their tokenizer's pad_token\n    - preprocesses batches for masked language modeling\n    \"\"\"\n\n    tokenizer: PreTrainedTokenizer\n    mlm: bool = True\n    mlm_probability: float = 0.15\n\n    def collate_batch(self, examples: List[torch.Tensor]) -> Dict[str, torch.Tensor]:\n        batch = self._tensorize_batch(examples)\n        if self.mlm:\n            inputs, labels = self.mask_tokens7(batch)\n            return {\"input_ids\": inputs, \"labels\": labels}\n        else:\n            return {\"input_ids\": batch, \"labels\": batch}\n\n    def _tensorize_batch(self, examples: List[torch.Tensor]) -> torch.Tensor:\n        length_of_first = examples[0].size(0)\n        are_tensors_same_length = all(x.size(0) == length_of_first for x in examples)\n        if are_tensors_same_length:\n            return torch.stack(examples, dim=0)\n        else:\n            if self.tokenizer._pad_token is None:\n                raise ValueError(\n                    \"You are attempting to pad samples but the tokenizer you are using\"\n                    f\" ({self.tokenizer.__class__.__name__}) does not have one.\"\n                )\n            return pad_sequence(examples, batch_first=True, padding_value=self.tokenizer.pad_token_id)\n\n    def mask_tokens(self, inputs: torch.Tensor) -> Tuple[torch.Tensor, torch.Tensor]:\n        \"\"\"\n        Prepare masked tokens inputs/labels for masked language modeling: 80% MASK, 10% random, 10% original.\n        \"\"\"\n\n        if self.tokenizer.mask_token is None:\n            raise ValueError(\n                \"This tokenizer does not have a mask token which is necessary for masked language modeling. Remove the --mlm flag if you want to use this tokenizer.\"\n            )\n\n        labels = inputs.clone()\n        # We sample a few tokens in each sequence for masked-LM training (with probability args.mlm_probability defaults to 0.15 in Bert/RoBERTa)\n        probability_matrix = torch.full(labels.shape, self.mlm_probability)\n        special_tokens_mask = [\n            self.tokenizer.get_special_tokens_mask(val, already_has_special_tokens=True) for val in labels.tolist()\n        ]\n        probability_matrix.masked_fill_(torch.tensor(special_tokens_mask, dtype=torch.bool), value=0.0)\n        if self.tokenizer._pad_token is not None:\n            padding_mask = labels.eq(self.tokenizer.pad_token_id)\n            probability_matrix.masked_fill_(padding_mask, value=0.0)\n\n        masked_indices = torch.bernoulli(probability_matrix).bool()\n        labels[~masked_indices] = -100  # We only compute loss on masked tokens\n\n        # 80% of the time, we replace masked input tokens with tokenizer.mask_token ([MASK])\n        indices_replaced = torch.bernoulli(torch.full(labels.shape, 0.8)).bool() & masked_indices\n        inputs[indices_replaced] = self.tokenizer.convert_tokens_to_ids(self.tokenizer.mask_token)\n\n        # 10% of the time, we replace masked input tokens with random word\n        indices_random = torch.bernoulli(torch.full(labels.shape, 0.5)).bool() & masked_indices & ~indices_replaced\n        random_words = torch.randint(len(self.tokenizer), labels.shape, dtype=torch.long)\n        inputs[indices_random] = random_words[indices_random]\n\n        # The rest of the time (10% of the time) we keep the masked input tokens unchanged\n        return inputs, labels\n\n    def mask_tokens2(self, inputs: torch.Tensor) -> Tuple[torch.Tensor, torch.Tensor]:\n        \"\"\"\n        Prepare masked tokens inputs/labels for masked language modeling: 80% MASK, 10% random, 10% original.\n        \"\"\"\n\n        if self.tokenizer.mask_token is None:\n            raise ValueError(\n                \"This tokenizer does not have a mask token which is necessary for masked language modeling. Remove the --mlm flag if you want to use this tokenizer.\"\n            )\n\n        labels = inputs.clone()\n        inputs = inputs.numpy()\n        # We sample a few tokens in each sequence for masked-LM training (with probability args.mlm_probability defaults to 0.15 in Bert/RoBERTa)\n        probability_matrix = torch.full(labels.shape, self.mlm_probability)\n        special_tokens_mask = [\n            self.tokenizer.get_special_tokens_mask(val, already_has_special_tokens=True) for val in labels.tolist()\n        ]\n        probability_matrix.masked_fill_(torch.tensor(special_tokens_mask, dtype=torch.bool), value=0.0)\n        if self.tokenizer._pad_token is not None:\n            padding_mask = labels.eq(self.tokenizer.pad_token_id)\n            probability_matrix.masked_fill_(padding_mask, value=0.0)\n\n        probability_matrix = probability_matrix.numpy()\n        labels = labels.numpy()\n        for i in range(len(probability_matrix)):\n            for j in range(len(probability_matrix[0])):\n                if probability_matrix[i][j] == np.float32(0.15):\n                    if random.random() > 0.85:\n                        if random.random() > 0.2:\n                            inputs[i][j] = self.tokenizer.mask_token_id\n                        elif random.random() > 0.5:\n                            inputs[i][j] = random.randint(5, len(self.tokenizer) - 1)\n                        else:\n                            pass\n                    else:\n                        labels[i][j] = np.float32(-100)\n                else:\n                    labels[i][j] = np.float32(-100)\n\n        inputs = torch.from_numpy(inputs)\n        labels = torch.from_numpy(labels)\n        return inputs, labels\n\n\n    def mask_tokens3(self, inputs: torch.Tensor) -> Tuple[torch.Tensor, torch.Tensor]:\n        \"\"\"\n        Prepare masked tokens inputs/labels for masked language modeling: 80% MASK, 10% random, 10% original.\n        \"\"\"\n\n        if self.tokenizer.mask_token is None:\n            raise ValueError(\n                \"This tokenizer does not have a mask token which is necessary for masked language modeling. Remove the --mlm flag if you want to use this tokenizer.\"\n            )\n\n        labels = inputs.clone()\n        inputs = inputs.numpy()\n        # We sample a few tokens in each sequence for masked-LM training (with probability args.mlm_probability defaults to 0.15 in Bert/RoBERTa)\n        probability_matrix = torch.full(labels.shape, self.mlm_probability)\n        special_tokens_mask = [\n            self.tokenizer.get_special_tokens_mask(val, already_has_special_tokens=True) for val in labels.tolist()\n        ]\n        probability_matrix.masked_fill_(torch.tensor(special_tokens_mask, dtype=torch.bool), value=0.0)\n        if self.tokenizer._pad_token is not None:\n            padding_mask = labels.eq(self.tokenizer.pad_token_id)\n            probability_matrix.masked_fill_(padding_mask, value=0.0)\n\n        probability_matrix = probability_matrix.numpy()\n        labels = labels.numpy()\n        covered = set()\n        for i in range(len(probability_matrix)):\n            for j in range(len(probability_matrix[0])):\n                if probability_matrix[i][j] == np.float32(0.15) and (i,j) not in covered:\n                    if random.random() > 0.85:\n                        if random.random() > 0.2:\n                            if random.random() > 0.85:\n                                for k in range(j,min(j+5,len(probability_matrix[0]))):\n                                    if probability_matrix[i][k] == np.float32(0.15) and (i,k) not in covered:\n                                        inputs[i][k] = self.tokenizer.mask_token_id\n                                        covered.add((i,k))\n                            elif random.random() > 0.7647:\n                                for k in range(j,min(j+4,len(probability_matrix[0]))):\n                                    if probability_matrix[i][k] == np.float32(0.15) and (i,k) not in covered:\n                                        inputs[i][k] = self.tokenizer.mask_token_id\n                                        covered.add((i,k))\n                            elif random.random() > 0.5384:\n                                for k in range(j,min(j+3,len(probability_matrix[0]))):\n                                    if probability_matrix[i][k] == np.float32(0.15) and (i,k) not in covered:\n                                        inputs[i][k] = self.tokenizer.mask_token_id\n                                        covered.add((i,k))\n                            elif random.random() > 0.42857:\n                                for k in range(j,min(j+2,len(probability_matrix[0]))):\n                                    if probability_matrix[i][k] == np.float32(0.15) and (i,k) not in covered:\n                                        inputs[i][k] = self.tokenizer.mask_token_id\n                                        covered.add((i,k))\n                            else:\n                                inputs[i][j] = self.tokenizer.mask_token_id\n                                covered.add((i,j))\n\n                        elif random.random() > 0.5:\n                            inputs[i][j] = random.randint(5, len(self.tokenizer) - 1)\n                        else:\n                            pass\n                    else:\n                        labels[i][j] = np.float32(-100)\n                else:\n                    labels[i][j] = np.float32(-100)\n\n        inputs = torch.from_numpy(inputs)\n        labels = torch.from_numpy(labels)\n        return inputs, labels\n\n    def mask_tokens4(self, inputs: torch.Tensor) -> Tuple[torch.Tensor, torch.Tensor]:\n        \"\"\"\n        Prepare masked tokens inputs/labels for masked language modeling: 80% MASK, 10% random, 10% original.\n        \"\"\"\n\n        if self.tokenizer.mask_token is None:\n            raise ValueError(\n                \"This tokenizer does not have a mask token which is necessary for masked language modeling. Remove the --mlm flag if you want to use this tokenizer.\"\n            )\n\n        inputs = inputs.numpy()\n        ids = [i for i in range(len(inputs))]\n        random.shuffle(ids)\n        inputs = inputs[ids]\n        inputs = torch.from_numpy(inputs)\n\n        labels = inputs.clone()\n        inputs = inputs.numpy()\n        # We sample a few tokens in each sequence for masked-LM training (with probability args.mlm_probability defaults to 0.15 in Bert/RoBERTa)\n        probability_matrix = torch.full(labels.shape, self.mlm_probability)\n        special_tokens_mask = [\n            self.tokenizer.get_special_tokens_mask(val, already_has_special_tokens=True) for val in labels.tolist()\n        ]\n        probability_matrix.masked_fill_(torch.tensor(special_tokens_mask, dtype=torch.bool), value=0.0)\n        if self.tokenizer._pad_token is not None:\n            padding_mask = labels.eq(self.tokenizer.pad_token_id)\n            probability_matrix.masked_fill_(padding_mask, value=0.0)\n\n        total_token = 0\n        for i in range(len(probability_matrix)):\n            for j in range(len(probability_matrix[0])):\n                if probability_matrix[i][j] == np.float32(0.15):\n                    total_token += 1\n\n        cur_token = 0\n        probability_matrix = probability_matrix.numpy()\n        labels = labels.numpy()\n        covered = set()\n        ngramFlag = True\n        for i in range(len(probability_matrix)):\n            if cur_token > total_token * 0.03:\n                ngramFlag = False\n            for j in range(len(probability_matrix[0])):\n                if probability_matrix[i][j] == np.float32(0.15) and (i,j) not in covered:\n                    if random.random() > 0.85:\n                        if random.random() > 0.2:\n                            if random.random() > 0.9 and ngramFlag:\n                                for k in range(j,min(j+4,len(probability_matrix[0]))):\n                                    if probability_matrix[i][k] == np.float32(0.15) and (i,k) not in covered:\n                                        inputs[i][k] = self.tokenizer.mask_token_id\n                                        covered.add((i,k))\n                                        cur_token += 1\n                            elif random.random() > 0.222 and ngramFlag:\n                                for k in range(j,min(j+3,len(probability_matrix[0]))):\n                                    if probability_matrix[i][k] == np.float32(0.15) and (i,k) not in covered:\n                                        inputs[i][k] = self.tokenizer.mask_token_id\n                                        covered.add((i,k))\n                                        cur_token += 1\n                            elif random.random() > 0.42857 and ngramFlag:\n                                for k in range(j,min(j+2,len(probability_matrix[0]))):\n                                    if probability_matrix[i][k] == np.float32(0.15) and (i,k) not in covered:\n                                        inputs[i][k] = self.tokenizer.mask_token_id\n                                        covered.add((i,k))\n                                        cur_token += 1\n                            else:\n                                inputs[i][j] = self.tokenizer.mask_token_id\n                                covered.add((i,j))\n                                cur_token += 1\n\n                        elif random.random() > 0.5:\n                            inputs[i][j] = random.randint(5, len(self.tokenizer) - 1)\n                            cur_token += 1\n                        else:\n                            pass\n                    else:\n                        labels[i][j] = np.float32(-100)\n                else:\n                    labels[i][j] = np.float32(-100)\n\n        inputs = torch.from_numpy(inputs)\n        labels = torch.from_numpy(labels)\n        return inputs, labels\n\n    def mask_tokens5(self, inputs: torch.Tensor) -> Tuple[torch.Tensor, torch.Tensor]:\n        \"\"\"\n        Prepare masked tokens inputs/labels for masked language modeling: 80% MASK, 10% random, 10% original.\n        \"\"\"\n\n        if self.tokenizer.mask_token is None:\n            raise ValueError(\n                \"This tokenizer does not have a mask token which is necessary for masked language modeling. Remove the --mlm flag if you want to use this tokenizer.\"\n            )\n\n        labels = inputs.clone()\n        inputs = inputs.numpy()\n        # We sample a few tokens in each sequence for masked-LM training (with probability args.mlm_probability defaults to 0.15 in Bert/RoBERTa)\n        probability_matrix = torch.full(labels.shape, self.mlm_probability)\n        special_tokens_mask = [\n            self.tokenizer.get_special_tokens_mask(val, already_has_special_tokens=True) for val in labels.tolist()\n        ]\n        probability_matrix.masked_fill_(torch.tensor(special_tokens_mask, dtype=torch.bool), value=0.0)\n        if self.tokenizer._pad_token is not None:\n            padding_mask = labels.eq(self.tokenizer.pad_token_id)\n            probability_matrix.masked_fill_(padding_mask, value=0.0)\n\n        covered = set()\n        pvals = [0.4, 0.3, 0.2, 0.1]\n        ngrams = np.arange(1, 5, dtype=np.int64)\n\n        probability_matrix = probability_matrix.numpy()\n        labels = labels.numpy()\n        for i in range(len(probability_matrix)):\n            cur_token = 0\n            total_token = 0\n            for j in range(len(probability_matrix[0])):\n                if probability_matrix[i][j] == np.float32(0.15):\n                    total_token += 1\n            choose = random.randint(0, 1)\n            if choose == 0:\n                startIndex = 0\n                endIndex = np.argwhere(inputs[i] == np.float32(2))[-1][0]\n            elif choose == 1:\n                startIndex = np.argwhere(inputs[i] == np.float32(2))[-1][0]\n                endIndex = np.argwhere(inputs[i] == np.float32(3))[-1][0]\n\n            valid_j = [index for index in range(startIndex, endIndex + 1)]\n\n            for j in range(len(probability_matrix[0])):\n                if cur_token < total_token * 0.15:\n                    if probability_matrix[i][j] == np.float32(0.15):\n                        n = np.random.choice(ngrams, p=pvals)\n                        for k in range(n):\n                            if j + k >= len(probability_matrix[0]):\n                                break\n                            if (i, j+k) in covered:\n                                continue\n                            if j+k in valid_j:\n                                if random.random() > 0.7:\n                                    if random.random() > 0.2:\n                                        if probability_matrix[i][j+k] == np.float32(0.15):\n                                            inputs[i][j+k] = self.tokenizer.mask_token_id\n                                            covered.add((i, j + k))\n                                            cur_token += 1\n\n                                    elif random.random() > 0.5:\n                                        if probability_matrix[i][j + k] == np.float32(0.15):\n                                            inputs[i][j+k] = random.randint(5, len(self.tokenizer) - 1)\n                                            covered.add((i, j + k))\n                                            cur_token += 1\n\n                                    else:\n                                        if probability_matrix[i][j + k] == np.float32(0.15):\n                                            covered.add((i, j + k))\n                                            cur_token += 1\n\n                                else:\n                                    labels[i][j] = np.float32(-100)\n                            else:\n                                labels[i][j] = np.float32(-100)\n                    else:\n                        labels[i][j] = np.float32(-100)\n                else:\n                    labels[i][j] = np.float32(-100)\n\n        inputs = torch.from_numpy(inputs)\n        labels = torch.from_numpy(labels)\n        return inputs, labels\n\n    def mask_tokens6(self, inputs: torch.Tensor) -> Tuple[torch.Tensor, torch.Tensor]:\n        \"\"\"\n        Prepare masked tokens inputs/labels for masked language modeling: 80% MASK, 10% random, 10% original.\n        \"\"\"\n\n        if self.tokenizer.mask_token is None:\n            raise ValueError(\n                \"This tokenizer does not have a mask token which is necessary for masked language modeling. Remove the --mlm flag if you want to use this tokenizer.\"\n            )\n\n        labels = inputs.clone()\n        inputs = inputs.numpy()\n        # We sample a few tokens in each sequence for masked-LM training (with probability args.mlm_probability defaults to 0.15 in Bert/RoBERTa)\n        probability_matrix = torch.full(labels.shape, self.mlm_probability)\n        special_tokens_mask = [\n            self.tokenizer.get_special_tokens_mask(val, already_has_special_tokens=True) for val in labels.tolist()\n        ]\n        probability_matrix.masked_fill_(torch.tensor(special_tokens_mask, dtype=torch.bool), value=0.0)\n        if self.tokenizer._pad_token is not None:\n            padding_mask = labels.eq(self.tokenizer.pad_token_id)\n            probability_matrix.masked_fill_(padding_mask, value=0.0)\n\n        covered = set()\n\n\n        probability_matrix = probability_matrix.numpy()\n        labels = labels.numpy()\n        for i in range(len(probability_matrix)):\n            cur_token = 0\n            total_token = 0\n            for j in range(len(probability_matrix[0])):\n                if probability_matrix[i][j] == np.float32(0.15):\n                    total_token += 1\n            for j in range(len(probability_matrix[0])):\n                if cur_token > total_token*0.15:\n                    break\n                if probability_matrix[i][j] == np.float32(0.15):\n                    if random.random() > 0.85:\n                        if random.random() > 0.2:\n                            if random.random() > 0.9:\n                                for k in range(j, min(j + 4, len(probability_matrix[0]))):\n                                    if probability_matrix[i][k] == np.float32(0.15) and (i, k) not in covered:\n                                        inputs[i][k] = self.tokenizer.mask_token_id\n                                        covered.add((i, k))\n                                        cur_token += 1\n                            elif random.random() > 0.222:\n                                for k in range(j, min(j + 3, len(probability_matrix[0]))):\n                                    if probability_matrix[i][k] == np.float32(0.15) and (i, k) not in covered:\n                                        inputs[i][k] = self.tokenizer.mask_token_id\n                                        covered.add((i, k))\n                                        cur_token += 1\n                            elif random.random() > 0.42857:\n                                for k in range(j, min(j + 2, len(probability_matrix[0]))):\n                                    if probability_matrix[i][k] == np.float32(0.15) and (i, k) not in covered:\n                                        inputs[i][k] = self.tokenizer.mask_token_id\n                                        covered.add((i, k))\n                                        cur_token += 1\n                            else:\n                                inputs[i][j] = self.tokenizer.mask_token_id\n                                covered.add((i, j))\n                                cur_token += 1\n\n                        elif random.random() > 0.5:\n                            inputs[i][j] = random.randint(5, len(self.tokenizer) - 1)\n                            cur_token += 1\n                        else:\n                            cur_token += 1\n\n                    else:\n                        labels[i][j] = np.float32(-100)\n\n\n                else:\n                    labels[i][j] = np.float32(-100)\n\n        inputs = torch.from_numpy(inputs)\n        labels = torch.from_numpy(labels)\n        return inputs, labels\n\n\n    def mask_tokens7(self, inputs: torch.Tensor) -> Tuple[torch.Tensor, torch.Tensor]:\n        \"\"\"\n        Prepare masked tokens inputs/labels for masked language modeling: 80% MASK, 10% random, 10% original.\n        \"\"\"\n\n        if self.tokenizer.mask_token is None:\n            raise ValueError(\n                \"This tokenizer does not have a mask token which is necessary for masked language modeling. Remove the --mlm flag if you want to use this tokenizer.\"\n            )\n\n        labels = inputs.clone()\n        inputs = inputs.numpy()\n        # We sample a few tokens in each sequence for masked-LM training (with probability args.mlm_probability defaults to 0.15 in Bert/RoBERTa)\n        probability_matrix = torch.full(labels.shape, self.mlm_probability)\n        special_tokens_mask = [\n            self.tokenizer.get_special_tokens_mask(val, already_has_special_tokens=True) for val in labels.tolist()\n        ]\n        probability_matrix.masked_fill_(torch.tensor(special_tokens_mask, dtype=torch.bool), value=0.0)\n        if self.tokenizer._pad_token is not None:\n            padding_mask = labels.eq(self.tokenizer.pad_token_id)\n            probability_matrix.masked_fill_(padding_mask, value=0.0)\n\n        covered = set()\n        ngrams = np.arange(1, 3 + 1, dtype=np.int64)\n        pvals = 1. / np.arange(1, 3 + 1)\n        pvals /= pvals.sum(keepdims=True)\n\n        probability_matrix = probability_matrix.numpy()\n        labels = labels.numpy()\n        for i in range(len(probability_matrix)):\n            cur_token = 0\n            total_token = 0\n            for j in range(len(probability_matrix[0])):\n                if probability_matrix[i][j] == np.float32(0.15):\n                    total_token += 1\n            for j in range(len(probability_matrix[0])):\n                if cur_token <= total_token * 0.15:\n                    n = np.random.choice(ngrams, p=pvals)\n                    if probability_matrix[i][j] == np.float32(0.15):\n                        for k in range(n):\n                            if j + k >= len(probability_matrix[0]):\n                                break\n                            if (i, j+k) in covered:\n                                continue\n                            if random.random() > 0.85:\n                                if random.random() > 0.2:\n                                    if probability_matrix[i][j+k] == np.float32(0.15):\n                                        inputs[i][j+k] = self.tokenizer.mask_token_id\n                                        covered.add((i, j + k))\n                                        cur_token += 1\n\n                                elif random.random() > 0.5:\n                                    if probability_matrix[i][j + k] == np.float32(0.15):\n                                        inputs[i][j+k] = random.randint(5, len(self.tokenizer) - 1)\n                                        covered.add((i, j + k))\n                                        cur_token += 1\n\n                                else:\n                                    if probability_matrix[i][j + k] == np.float32(0.15):\n                                        covered.add((i, j + k))\n                                        cur_token += 1\n\n                            else:\n                                labels[i][j] = np.float32(-100)\n\n                    else:\n                        labels[i][j] = np.float32(-100)\n                else:\n                    labels[i][j] = np.float32(-100)\n\n        inputs = torch.from_numpy(inputs)\n        labels = torch.from_numpy(labels)\n        return inputs, labels\n\n"
  },
  {
    "path": "code/bert-base-count3/pretrain/transformers1/data/datasets/__init__.py",
    "content": "# flake8: noqa\n# There's no way to ignore \"F401 '...' imported but unused\" warnings in this\n# module, but to preserve other warnings. So, don't check this module at all.\n\nfrom .glue import GlueDataset, GlueDataTrainingArguments\nfrom .language_modeling import LineByLineTextDataset, TextDataset\n"
  },
  {
    "path": "code/bert-base-count3/pretrain/transformers1/data/datasets/glue.py",
    "content": "import logging\nimport os\nimport time\nfrom dataclasses import dataclass, field\nfrom enum import Enum\nfrom typing import List, Optional, Union\n\nimport torch\nfrom filelock import FileLock\nfrom torch.utils.data.dataset import Dataset\n\nfrom ...tokenization_roberta import RobertaTokenizer, RobertaTokenizerFast\nfrom ...tokenization_utils import PreTrainedTokenizer\nfrom ...tokenization_xlm_roberta import XLMRobertaTokenizer\nfrom ..processors.glue import glue_convert_examples_to_features, glue_output_modes, glue_processors\nfrom ..processors.utils import InputFeatures\n\n\nlogger = logging.getLogger(__name__)\n\n\n@dataclass\nclass GlueDataTrainingArguments:\n    \"\"\"\n    Arguments pertaining to what data we are going to input our model for training and eval.\n\n    Using `HfArgumentParser` we can turn this class\n    into argparse arguments to be able to specify them on\n    the command line.\n    \"\"\"\n\n    task_name: str = field(metadata={\"help\": \"The name of the task to train on: \" + \", \".join(glue_processors.keys())})\n    data_dir: str = field(\n        metadata={\"help\": \"The input data dir. Should contain the .tsv files (or other data files) for the task.\"}\n    )\n    max_seq_length: int = field(\n        default=128,\n        metadata={\n            \"help\": \"The maximum total input sequence length after tokenization. Sequences longer \"\n            \"than this will be truncated, sequences shorter will be padded.\"\n        },\n    )\n    overwrite_cache: bool = field(\n        default=False, metadata={\"help\": \"Overwrite the cached training and evaluation sets\"}\n    )\n\n    def __post_init__(self):\n        self.task_name = self.task_name.lower()\n\n\nclass Split(Enum):\n    train = \"train\"\n    dev = \"dev\"\n    test = \"test\"\n\n\nclass GlueDataset(Dataset):\n    \"\"\"\n    This will be superseded by a framework-agnostic approach\n    soon.\n    \"\"\"\n\n    args: GlueDataTrainingArguments\n    output_mode: str\n    features: List[InputFeatures]\n\n    def __init__(\n        self,\n        args: GlueDataTrainingArguments,\n        tokenizer: PreTrainedTokenizer,\n        limit_length: Optional[int] = None,\n        mode: Union[str, Split] = Split.train,\n    ):\n        self.args = args\n        self.processor = glue_processors[args.task_name]()\n        self.output_mode = glue_output_modes[args.task_name]\n        if isinstance(mode, str):\n            try:\n                mode = Split[mode]\n            except KeyError:\n                raise KeyError(\"mode is not a valid split name\")\n        # Load data features from cache or dataset file\n        cached_features_file = os.path.join(\n            args.data_dir,\n            \"cached_{}_{}_{}_{}\".format(\n                mode.value, tokenizer.__class__.__name__, str(args.max_seq_length), args.task_name,\n            ),\n        )\n        label_list = self.processor.get_labels()\n        if args.task_name in [\"mnli\", \"mnli-mm\"] and tokenizer.__class__ in (\n            RobertaTokenizer,\n            RobertaTokenizerFast,\n            XLMRobertaTokenizer,\n        ):\n            # HACK(label indices are swapped in RoBERTa pretrained model)\n            label_list[1], label_list[2] = label_list[2], label_list[1]\n        self.label_list = label_list\n\n        # Make sure only the first process in distributed training processes the dataset,\n        # and the others will use the cache.\n        lock_path = cached_features_file + \".lock\"\n        with FileLock(lock_path):\n\n            if os.path.exists(cached_features_file) and not args.overwrite_cache:\n                start = time.time()\n                self.features = torch.load(cached_features_file)\n                logger.info(\n                    f\"Loading features from cached file {cached_features_file} [took %.3f s]\", time.time() - start\n                )\n            else:\n                logger.info(f\"Creating features from dataset file at {args.data_dir}\")\n\n                if mode == Split.dev:\n                    examples = self.processor.get_dev_examples(args.data_dir)\n                elif mode == Split.test:\n                    examples = self.processor.get_test_examples(args.data_dir)\n                else:\n                    examples = self.processor.get_train_examples(args.data_dir)\n                if limit_length is not None:\n                    examples = examples[:limit_length]\n                self.features = glue_convert_examples_to_features(\n                    examples,\n                    tokenizer,\n                    max_length=args.max_seq_length,\n                    label_list=label_list,\n                    output_mode=self.output_mode,\n                )\n                start = time.time()\n                torch.save(self.features, cached_features_file)\n                # ^ This seems to take a lot of time so I want to investigate why and how we can improve.\n                logger.info(\n                    \"Saving features into cached file %s [took %.3f s]\", cached_features_file, time.time() - start\n                )\n\n    def __len__(self):\n        return len(self.features)\n\n    def __getitem__(self, i) -> InputFeatures:\n        return self.features[i]\n\n    def get_labels(self):\n        return self.label_list\n"
  },
  {
    "path": "code/bert-base-count3/pretrain/transformers1/data/datasets/language_modeling.py",
    "content": "import logging\nimport os\nimport pickle\nimport time\n\nimport torch\nfrom filelock import FileLock\nfrom torch.utils.data.dataset import Dataset\n\nfrom ...tokenization_utils import PreTrainedTokenizer\n\n\nlogger = logging.getLogger(__name__)\n\n\nclass TextDataset(Dataset):\n    \"\"\"\n    This will be superseded by a framework-agnostic approach\n    soon.\n    \"\"\"\n\n    def __init__(\n        self, tokenizer: PreTrainedTokenizer, file_path: str, block_size: int, overwrite_cache=False,\n    ):\n        assert os.path.isfile(file_path)\n\n        block_size = block_size - tokenizer.num_special_tokens_to_add(pair=False)\n\n        directory, filename = os.path.split(file_path)\n        cached_features_file = os.path.join(\n            directory, \"cached_lm_{}_{}_{}\".format(tokenizer.__class__.__name__, str(block_size), filename,),\n        )\n\n        # Make sure only the first process in distributed training processes the dataset,\n        # and the others will use the cache.\n        lock_path = cached_features_file + \".lock\"\n        with FileLock(lock_path):\n\n            if os.path.exists(cached_features_file) and not overwrite_cache:\n                start = time.time()\n                with open(cached_features_file, \"rb\") as handle:\n                    self.examples = pickle.load(handle)\n                logger.info(\n                    f\"Loading features from cached file {cached_features_file} [took %.3f s]\", time.time() - start\n                )\n\n            else:\n                logger.info(f\"Creating features from dataset file at {directory}\")\n\n                self.examples = []\n                with open(file_path, encoding=\"utf-8\") as f:\n                    text = f.read()\n\n                tokenized_text = tokenizer.convert_tokens_to_ids(tokenizer.tokenize(text))\n\n                for i in range(0, len(tokenized_text) - block_size + 1, block_size):  # Truncate in block of block_size\n                    self.examples.append(\n                        tokenizer.build_inputs_with_special_tokens(tokenized_text[i : i + block_size])\n                    )\n                # Note that we are losing the last truncated example here for the sake of simplicity (no padding)\n                # If your dataset is small, first you should loook for a bigger one :-) and second you\n                # can change this behavior by adding (model specific) padding.\n\n                start = time.time()\n                with open(cached_features_file, \"wb\") as handle:\n                    pickle.dump(self.examples, handle, protocol=pickle.HIGHEST_PROTOCOL)\n                logger.info(\n                    \"Saving features into cached file %s [took %.3f s]\", cached_features_file, time.time() - start\n                )\n\n    def __len__(self):\n        return len(self.examples)\n\n    def __getitem__(self, i) -> torch.Tensor:\n        return torch.tensor(self.examples[i], dtype=torch.long)\n\n\nclass LineByLineTextDataset(Dataset):\n    \"\"\"\n    This will be superseded by a framework-agnostic approach\n    soon.\n    \"\"\"\n\n    def __init__(self, tokenizer: PreTrainedTokenizer, file_path: str, block_size: int):\n        assert os.path.isfile(file_path)\n        # Here, we do not cache the features, operating under the assumption\n        # that we will soon use fast multithreaded tokenizers from the\n        # `tokenizers` repo everywhere =)\n        logger.info(\"Creating features from dataset file at %s\", file_path)\n\n        with open(file_path, encoding=\"utf-8\") as f:\n            lines = [line for line in f.read().splitlines() if (len(line) > 0 and not line.isspace())]\n\n        batch_encoding = tokenizer.batch_encode_plus(lines, add_special_tokens=True, max_length=block_size)\n        self.examples = batch_encoding[\"input_ids\"]\n\n    def __len__(self):\n        return len(self.examples)\n\n    def __getitem__(self, i) -> torch.Tensor:\n        return torch.tensor(self.examples[i], dtype=torch.long)\n"
  },
  {
    "path": "code/bert-base-count3/pretrain/transformers1/data/metrics/__init__.py",
    "content": "# coding=utf-8\n# Copyright 2018 The Google AI Language Team Authors and The HuggingFace Inc. team.\n# Copyright (c) 2018, NVIDIA CORPORATION.  All rights reserved.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n\ntry:\n    from scipy.stats import pearsonr, spearmanr\n    from sklearn.metrics import matthews_corrcoef, f1_score\n\n    _has_sklearn = True\nexcept (AttributeError, ImportError):\n    _has_sklearn = False\n\n\ndef is_sklearn_available():\n    return _has_sklearn\n\n\nif _has_sklearn:\n\n    def simple_accuracy(preds, labels):\n        return (preds == labels).mean()\n\n    def acc_and_f1(preds, labels):\n        acc = simple_accuracy(preds, labels)\n        f1 = f1_score(y_true=labels, y_pred=preds)\n        return {\n            \"acc\": acc,\n            \"f1\": f1,\n            \"acc_and_f1\": (acc + f1) / 2,\n        }\n\n    def pearson_and_spearman(preds, labels):\n        pearson_corr = pearsonr(preds, labels)[0]\n        spearman_corr = spearmanr(preds, labels)[0]\n        return {\n            \"pearson\": pearson_corr,\n            \"spearmanr\": spearman_corr,\n            \"corr\": (pearson_corr + spearman_corr) / 2,\n        }\n\n    def glue_compute_metrics(task_name, preds, labels):\n        assert len(preds) == len(labels)\n        if task_name == \"cola\":\n            return {\"mcc\": matthews_corrcoef(labels, preds)}\n        elif task_name == \"sst-2\":\n            return {\"acc\": simple_accuracy(preds, labels)}\n        elif task_name == \"mrpc\":\n            return acc_and_f1(preds, labels)\n        elif task_name == \"sts-b\":\n            return pearson_and_spearman(preds, labels)\n        elif task_name == \"qqp\":\n            return acc_and_f1(preds, labels)\n        elif task_name == \"mnli\":\n            return {\"acc\": simple_accuracy(preds, labels)}\n        elif task_name == \"mnli-mm\":\n            return {\"acc\": simple_accuracy(preds, labels)}\n        elif task_name == \"qnli\":\n            return {\"acc\": simple_accuracy(preds, labels)}\n        elif task_name == \"rte\":\n            return {\"acc\": simple_accuracy(preds, labels)}\n        elif task_name == \"wnli\":\n            return {\"acc\": simple_accuracy(preds, labels)}\n        elif task_name == \"hans\":\n            return {\"acc\": simple_accuracy(preds, labels)}\n        else:\n            raise KeyError(task_name)\n\n    def xnli_compute_metrics(task_name, preds, labels):\n        assert len(preds) == len(labels)\n        if task_name == \"xnli\":\n            return {\"acc\": simple_accuracy(preds, labels)}\n        else:\n            raise KeyError(task_name)\n"
  },
  {
    "path": "code/bert-base-count3/pretrain/transformers1/data/metrics/squad_metrics.py",
    "content": "\"\"\" Very heavily inspired by the official evaluation script for SQuAD version 2.0 which was\nmodified by XLNet authors to update `find_best_threshold` scripts for SQuAD V2.0\n\nIn addition to basic functionality, we also compute additional statistics and\nplot precision-recall curves if an additional na_prob.json file is provided.\nThis file is expected to map question ID's to the model's predicted probability\nthat a question is unanswerable.\n\"\"\"\n\n\nimport collections\nimport json\nimport logging\nimport math\nimport re\nimport string\n\nfrom transformers.tokenization_bert import BasicTokenizer\n\n\nlogger = logging.getLogger(__name__)\n\n\ndef normalize_answer(s):\n    \"\"\"Lower text and remove punctuation, articles and extra whitespace.\"\"\"\n\n    def remove_articles(text):\n        regex = re.compile(r\"\\b(a|an|the)\\b\", re.UNICODE)\n        return re.sub(regex, \" \", text)\n\n    def white_space_fix(text):\n        return \" \".join(text.split())\n\n    def remove_punc(text):\n        exclude = set(string.punctuation)\n        return \"\".join(ch for ch in text if ch not in exclude)\n\n    def lower(text):\n        return text.lower()\n\n    return white_space_fix(remove_articles(remove_punc(lower(s))))\n\n\ndef get_tokens(s):\n    if not s:\n        return []\n    return normalize_answer(s).split()\n\n\ndef compute_exact(a_gold, a_pred):\n    return int(normalize_answer(a_gold) == normalize_answer(a_pred))\n\n\ndef compute_f1(a_gold, a_pred):\n    gold_toks = get_tokens(a_gold)\n    pred_toks = get_tokens(a_pred)\n    common = collections.Counter(gold_toks) & collections.Counter(pred_toks)\n    num_same = sum(common.values())\n    if len(gold_toks) == 0 or len(pred_toks) == 0:\n        # If either is no-answer, then F1 is 1 if they agree, 0 otherwise\n        return int(gold_toks == pred_toks)\n    if num_same == 0:\n        return 0\n    precision = 1.0 * num_same / len(pred_toks)\n    recall = 1.0 * num_same / len(gold_toks)\n    f1 = (2 * precision * recall) / (precision + recall)\n    return f1\n\n\ndef get_raw_scores(examples, preds):\n    \"\"\"\n    Computes the exact and f1 scores from the examples and the model predictions\n    \"\"\"\n    exact_scores = {}\n    f1_scores = {}\n\n    for example in examples:\n        qas_id = example.qas_id\n        gold_answers = [answer[\"text\"] for answer in example.answers if normalize_answer(answer[\"text\"])]\n\n        if not gold_answers:\n            # For unanswerable questions, only correct answer is empty string\n            gold_answers = [\"\"]\n\n        if qas_id not in preds:\n            print(\"Missing prediction for %s\" % qas_id)\n            continue\n\n        prediction = preds[qas_id]\n        exact_scores[qas_id] = max(compute_exact(a, prediction) for a in gold_answers)\n        f1_scores[qas_id] = max(compute_f1(a, prediction) for a in gold_answers)\n\n    return exact_scores, f1_scores\n\n\ndef apply_no_ans_threshold(scores, na_probs, qid_to_has_ans, na_prob_thresh):\n    new_scores = {}\n    for qid, s in scores.items():\n        pred_na = na_probs[qid] > na_prob_thresh\n        if pred_na:\n            new_scores[qid] = float(not qid_to_has_ans[qid])\n        else:\n            new_scores[qid] = s\n    return new_scores\n\n\ndef make_eval_dict(exact_scores, f1_scores, qid_list=None):\n    if not qid_list:\n        total = len(exact_scores)\n        return collections.OrderedDict(\n            [\n                (\"exact\", 100.0 * sum(exact_scores.values()) / total),\n                (\"f1\", 100.0 * sum(f1_scores.values()) / total),\n                (\"total\", total),\n            ]\n        )\n    else:\n        total = len(qid_list)\n        return collections.OrderedDict(\n            [\n                (\"exact\", 100.0 * sum(exact_scores[k] for k in qid_list) / total),\n                (\"f1\", 100.0 * sum(f1_scores[k] for k in qid_list) / total),\n                (\"total\", total),\n            ]\n        )\n\n\ndef merge_eval(main_eval, new_eval, prefix):\n    for k in new_eval:\n        main_eval[\"%s_%s\" % (prefix, k)] = new_eval[k]\n\n\ndef find_best_thresh_v2(preds, scores, na_probs, qid_to_has_ans):\n    num_no_ans = sum(1 for k in qid_to_has_ans if not qid_to_has_ans[k])\n    cur_score = num_no_ans\n    best_score = cur_score\n    best_thresh = 0.0\n    qid_list = sorted(na_probs, key=lambda k: na_probs[k])\n    for i, qid in enumerate(qid_list):\n        if qid not in scores:\n            continue\n        if qid_to_has_ans[qid]:\n            diff = scores[qid]\n        else:\n            if preds[qid]:\n                diff = -1\n            else:\n                diff = 0\n        cur_score += diff\n        if cur_score > best_score:\n            best_score = cur_score\n            best_thresh = na_probs[qid]\n\n    has_ans_score, has_ans_cnt = 0, 0\n    for qid in qid_list:\n        if not qid_to_has_ans[qid]:\n            continue\n        has_ans_cnt += 1\n\n        if qid not in scores:\n            continue\n        has_ans_score += scores[qid]\n\n    return 100.0 * best_score / len(scores), best_thresh, 1.0 * has_ans_score / has_ans_cnt\n\n\ndef find_all_best_thresh_v2(main_eval, preds, exact_raw, f1_raw, na_probs, qid_to_has_ans):\n    best_exact, exact_thresh, has_ans_exact = find_best_thresh_v2(preds, exact_raw, na_probs, qid_to_has_ans)\n    best_f1, f1_thresh, has_ans_f1 = find_best_thresh_v2(preds, f1_raw, na_probs, qid_to_has_ans)\n    main_eval[\"best_exact\"] = best_exact\n    main_eval[\"best_exact_thresh\"] = exact_thresh\n    main_eval[\"best_f1\"] = best_f1\n    main_eval[\"best_f1_thresh\"] = f1_thresh\n    main_eval[\"has_ans_exact\"] = has_ans_exact\n    main_eval[\"has_ans_f1\"] = has_ans_f1\n\n\ndef find_best_thresh(preds, scores, na_probs, qid_to_has_ans):\n    num_no_ans = sum(1 for k in qid_to_has_ans if not qid_to_has_ans[k])\n    cur_score = num_no_ans\n    best_score = cur_score\n    best_thresh = 0.0\n    qid_list = sorted(na_probs, key=lambda k: na_probs[k])\n    for _, qid in enumerate(qid_list):\n        if qid not in scores:\n            continue\n        if qid_to_has_ans[qid]:\n            diff = scores[qid]\n        else:\n            if preds[qid]:\n                diff = -1\n            else:\n                diff = 0\n        cur_score += diff\n        if cur_score > best_score:\n            best_score = cur_score\n            best_thresh = na_probs[qid]\n    return 100.0 * best_score / len(scores), best_thresh\n\n\ndef find_all_best_thresh(main_eval, preds, exact_raw, f1_raw, na_probs, qid_to_has_ans):\n    best_exact, exact_thresh = find_best_thresh(preds, exact_raw, na_probs, qid_to_has_ans)\n    best_f1, f1_thresh = find_best_thresh(preds, f1_raw, na_probs, qid_to_has_ans)\n\n    main_eval[\"best_exact\"] = best_exact\n    main_eval[\"best_exact_thresh\"] = exact_thresh\n    main_eval[\"best_f1\"] = best_f1\n    main_eval[\"best_f1_thresh\"] = f1_thresh\n\n\ndef squad_evaluate(examples, preds, no_answer_probs=None, no_answer_probability_threshold=1.0):\n    qas_id_to_has_answer = {example.qas_id: bool(example.answers) for example in examples}\n    has_answer_qids = [qas_id for qas_id, has_answer in qas_id_to_has_answer.items() if has_answer]\n    no_answer_qids = [qas_id for qas_id, has_answer in qas_id_to_has_answer.items() if not has_answer]\n\n    if no_answer_probs is None:\n        no_answer_probs = {k: 0.0 for k in preds}\n\n    exact, f1 = get_raw_scores(examples, preds)\n\n    exact_threshold = apply_no_ans_threshold(\n        exact, no_answer_probs, qas_id_to_has_answer, no_answer_probability_threshold\n    )\n    f1_threshold = apply_no_ans_threshold(f1, no_answer_probs, qas_id_to_has_answer, no_answer_probability_threshold)\n\n    evaluation = make_eval_dict(exact_threshold, f1_threshold)\n\n    if has_answer_qids:\n        has_ans_eval = make_eval_dict(exact_threshold, f1_threshold, qid_list=has_answer_qids)\n        merge_eval(evaluation, has_ans_eval, \"HasAns\")\n\n    if no_answer_qids:\n        no_ans_eval = make_eval_dict(exact_threshold, f1_threshold, qid_list=no_answer_qids)\n        merge_eval(evaluation, no_ans_eval, \"NoAns\")\n\n    if no_answer_probs:\n        find_all_best_thresh(evaluation, preds, exact, f1, no_answer_probs, qas_id_to_has_answer)\n\n    return evaluation\n\n\ndef get_final_text(pred_text, orig_text, do_lower_case, verbose_logging=False):\n    \"\"\"Project the tokenized prediction back to the original text.\"\"\"\n\n    # When we created the data, we kept track of the alignment between original\n    # (whitespace tokenized) tokens and our WordPiece tokenized tokens. So\n    # now `orig_text` contains the span of our original text corresponding to the\n    # span that we predicted.\n    #\n    # However, `orig_text` may contain extra characters that we don't want in\n    # our prediction.\n    #\n    # For example, let's say:\n    #   pred_text = steve smith\n    #   orig_text = Steve Smith's\n    #\n    # We don't want to return `orig_text` because it contains the extra \"'s\".\n    #\n    # We don't want to return `pred_text` because it's already been normalized\n    # (the SQuAD eval script also does punctuation stripping/lower casing but\n    # our tokenizer does additional normalization like stripping accent\n    # characters).\n    #\n    # What we really want to return is \"Steve Smith\".\n    #\n    # Therefore, we have to apply a semi-complicated alignment heuristic between\n    # `pred_text` and `orig_text` to get a character-to-character alignment. This\n    # can fail in certain cases in which case we just return `orig_text`.\n\n    def _strip_spaces(text):\n        ns_chars = []\n        ns_to_s_map = collections.OrderedDict()\n        for (i, c) in enumerate(text):\n            if c == \" \":\n                continue\n            ns_to_s_map[len(ns_chars)] = i\n            ns_chars.append(c)\n        ns_text = \"\".join(ns_chars)\n        return (ns_text, ns_to_s_map)\n\n    # We first tokenize `orig_text`, strip whitespace from the result\n    # and `pred_text`, and check if they are the same length. If they are\n    # NOT the same length, the heuristic has failed. If they are the same\n    # length, we assume the characters are one-to-one aligned.\n    tokenizer = BasicTokenizer(do_lower_case=do_lower_case)\n\n    tok_text = \" \".join(tokenizer.tokenize(orig_text))\n\n    start_position = tok_text.find(pred_text)\n    if start_position == -1:\n        if verbose_logging:\n            logger.info(\"Unable to find text: '%s' in '%s'\" % (pred_text, orig_text))\n        return orig_text\n    end_position = start_position + len(pred_text) - 1\n\n    (orig_ns_text, orig_ns_to_s_map) = _strip_spaces(orig_text)\n    (tok_ns_text, tok_ns_to_s_map) = _strip_spaces(tok_text)\n\n    if len(orig_ns_text) != len(tok_ns_text):\n        if verbose_logging:\n            logger.info(\"Length not equal after stripping spaces: '%s' vs '%s'\", orig_ns_text, tok_ns_text)\n        return orig_text\n\n    # We then project the characters in `pred_text` back to `orig_text` using\n    # the character-to-character alignment.\n    tok_s_to_ns_map = {}\n    for (i, tok_index) in tok_ns_to_s_map.items():\n        tok_s_to_ns_map[tok_index] = i\n\n    orig_start_position = None\n    if start_position in tok_s_to_ns_map:\n        ns_start_position = tok_s_to_ns_map[start_position]\n        if ns_start_position in orig_ns_to_s_map:\n            orig_start_position = orig_ns_to_s_map[ns_start_position]\n\n    if orig_start_position is None:\n        if verbose_logging:\n            logger.info(\"Couldn't map start position\")\n        return orig_text\n\n    orig_end_position = None\n    if end_position in tok_s_to_ns_map:\n        ns_end_position = tok_s_to_ns_map[end_position]\n        if ns_end_position in orig_ns_to_s_map:\n            orig_end_position = orig_ns_to_s_map[ns_end_position]\n\n    if orig_end_position is None:\n        if verbose_logging:\n            logger.info(\"Couldn't map end position\")\n        return orig_text\n\n    output_text = orig_text[orig_start_position : (orig_end_position + 1)]\n    return output_text\n\n\ndef _get_best_indexes(logits, n_best_size):\n    \"\"\"Get the n-best logits from a list.\"\"\"\n    index_and_score = sorted(enumerate(logits), key=lambda x: x[1], reverse=True)\n\n    best_indexes = []\n    for i in range(len(index_and_score)):\n        if i >= n_best_size:\n            break\n        best_indexes.append(index_and_score[i][0])\n    return best_indexes\n\n\ndef _compute_softmax(scores):\n    \"\"\"Compute softmax probability over raw logits.\"\"\"\n    if not scores:\n        return []\n\n    max_score = None\n    for score in scores:\n        if max_score is None or score > max_score:\n            max_score = score\n\n    exp_scores = []\n    total_sum = 0.0\n    for score in scores:\n        x = math.exp(score - max_score)\n        exp_scores.append(x)\n        total_sum += x\n\n    probs = []\n    for score in exp_scores:\n        probs.append(score / total_sum)\n    return probs\n\n\ndef compute_predictions_logits(\n    all_examples,\n    all_features,\n    all_results,\n    n_best_size,\n    max_answer_length,\n    do_lower_case,\n    output_prediction_file,\n    output_nbest_file,\n    output_null_log_odds_file,\n    verbose_logging,\n    version_2_with_negative,\n    null_score_diff_threshold,\n    tokenizer,\n):\n    \"\"\"Write final predictions to the json file and log-odds of null if needed.\"\"\"\n    if output_prediction_file:\n        logger.info(f\"Writing predictions to: {output_prediction_file}\")\n    if output_nbest_file:\n        logger.info(f\"Writing nbest to: {output_nbest_file}\")\n    if output_null_log_odds_file and version_2_with_negative:\n        logger.info(f\"Writing null_log_odds to: {output_null_log_odds_file}\")\n\n    example_index_to_features = collections.defaultdict(list)\n    for feature in all_features:\n        example_index_to_features[feature.example_index].append(feature)\n\n    unique_id_to_result = {}\n    for result in all_results:\n        unique_id_to_result[result.unique_id] = result\n\n    _PrelimPrediction = collections.namedtuple(  # pylint: disable=invalid-name\n        \"PrelimPrediction\", [\"feature_index\", \"start_index\", \"end_index\", \"start_logit\", \"end_logit\"]\n    )\n\n    all_predictions = collections.OrderedDict()\n    all_nbest_json = collections.OrderedDict()\n    scores_diff_json = collections.OrderedDict()\n\n    for (example_index, example) in enumerate(all_examples):\n        features = example_index_to_features[example_index]\n\n        prelim_predictions = []\n        # keep track of the minimum score of null start+end of position 0\n        score_null = 1000000  # large and positive\n        min_null_feature_index = 0  # the paragraph slice with min null score\n        null_start_logit = 0  # the start logit at the slice with min null score\n        null_end_logit = 0  # the end logit at the slice with min null score\n        for (feature_index, feature) in enumerate(features):\n            result = unique_id_to_result[feature.unique_id]\n            start_indexes = _get_best_indexes(result.start_logits, n_best_size)\n            end_indexes = _get_best_indexes(result.end_logits, n_best_size)\n            # if we could have irrelevant answers, get the min score of irrelevant\n            if version_2_with_negative:\n                feature_null_score = result.start_logits[0] + result.end_logits[0]\n                if feature_null_score < score_null:\n                    score_null = feature_null_score\n                    min_null_feature_index = feature_index\n                    null_start_logit = result.start_logits[0]\n                    null_end_logit = result.end_logits[0]\n            for start_index in start_indexes:\n                for end_index in end_indexes:\n                    # We could hypothetically create invalid predictions, e.g., predict\n                    # that the start of the span is in the question. We throw out all\n                    # invalid predictions.\n                    if start_index >= len(feature.tokens):\n                        continue\n                    if end_index >= len(feature.tokens):\n                        continue\n                    if start_index not in feature.token_to_orig_map:\n                        continue\n                    if end_index not in feature.token_to_orig_map:\n                        continue\n                    if not feature.token_is_max_context.get(start_index, False):\n                        continue\n                    if end_index < start_index:\n                        continue\n                    length = end_index - start_index + 1\n                    if length > max_answer_length:\n                        continue\n                    prelim_predictions.append(\n                        _PrelimPrediction(\n                            feature_index=feature_index,\n                            start_index=start_index,\n                            end_index=end_index,\n                            start_logit=result.start_logits[start_index],\n                            end_logit=result.end_logits[end_index],\n                        )\n                    )\n        if version_2_with_negative:\n            prelim_predictions.append(\n                _PrelimPrediction(\n                    feature_index=min_null_feature_index,\n                    start_index=0,\n                    end_index=0,\n                    start_logit=null_start_logit,\n                    end_logit=null_end_logit,\n                )\n            )\n        prelim_predictions = sorted(prelim_predictions, key=lambda x: (x.start_logit + x.end_logit), reverse=True)\n\n        _NbestPrediction = collections.namedtuple(  # pylint: disable=invalid-name\n            \"NbestPrediction\", [\"text\", \"start_logit\", \"end_logit\"]\n        )\n\n        seen_predictions = {}\n        nbest = []\n        for pred in prelim_predictions:\n            if len(nbest) >= n_best_size:\n                break\n            feature = features[pred.feature_index]\n            if pred.start_index > 0:  # this is a non-null prediction\n                tok_tokens = feature.tokens[pred.start_index : (pred.end_index + 1)]\n                orig_doc_start = feature.token_to_orig_map[pred.start_index]\n                orig_doc_end = feature.token_to_orig_map[pred.end_index]\n                orig_tokens = example.doc_tokens[orig_doc_start : (orig_doc_end + 1)]\n\n                tok_text = tokenizer.convert_tokens_to_string(tok_tokens)\n\n                # tok_text = \" \".join(tok_tokens)\n                #\n                # # De-tokenize WordPieces that have been split off.\n                # tok_text = tok_text.replace(\" ##\", \"\")\n                # tok_text = tok_text.replace(\"##\", \"\")\n\n                # Clean whitespace\n                tok_text = tok_text.strip()\n                tok_text = \" \".join(tok_text.split())\n                orig_text = \" \".join(orig_tokens)\n\n                final_text = get_final_text(tok_text, orig_text, do_lower_case, verbose_logging)\n                if final_text in seen_predictions:\n                    continue\n\n                seen_predictions[final_text] = True\n            else:\n                final_text = \"\"\n                seen_predictions[final_text] = True\n\n            nbest.append(_NbestPrediction(text=final_text, start_logit=pred.start_logit, end_logit=pred.end_logit))\n        # if we didn't include the empty option in the n-best, include it\n        if version_2_with_negative:\n            if \"\" not in seen_predictions:\n                nbest.append(_NbestPrediction(text=\"\", start_logit=null_start_logit, end_logit=null_end_logit))\n\n            # In very rare edge cases we could only have single null prediction.\n            # So we just create a nonce prediction in this case to avoid failure.\n            if len(nbest) == 1:\n                nbest.insert(0, _NbestPrediction(text=\"empty\", start_logit=0.0, end_logit=0.0))\n\n        # In very rare edge cases we could have no valid predictions. So we\n        # just create a nonce prediction in this case to avoid failure.\n        if not nbest:\n            nbest.append(_NbestPrediction(text=\"empty\", start_logit=0.0, end_logit=0.0))\n\n        assert len(nbest) >= 1\n\n        total_scores = []\n        best_non_null_entry = None\n        for entry in nbest:\n            total_scores.append(entry.start_logit + entry.end_logit)\n            if not best_non_null_entry:\n                if entry.text:\n                    best_non_null_entry = entry\n\n        probs = _compute_softmax(total_scores)\n\n        nbest_json = []\n        for (i, entry) in enumerate(nbest):\n            output = collections.OrderedDict()\n            output[\"text\"] = entry.text\n            output[\"probability\"] = probs[i]\n            output[\"start_logit\"] = entry.start_logit\n            output[\"end_logit\"] = entry.end_logit\n            nbest_json.append(output)\n\n        assert len(nbest_json) >= 1\n\n        if not version_2_with_negative:\n            all_predictions[example.qas_id] = nbest_json[0][\"text\"]\n        else:\n            # predict \"\" iff the null score - the score of best non-null > threshold\n            score_diff = score_null - best_non_null_entry.start_logit - (best_non_null_entry.end_logit)\n            scores_diff_json[example.qas_id] = score_diff\n            if score_diff > null_score_diff_threshold:\n                all_predictions[example.qas_id] = \"\"\n            else:\n                all_predictions[example.qas_id] = best_non_null_entry.text\n        all_nbest_json[example.qas_id] = nbest_json\n\n    if output_prediction_file:\n        with open(output_prediction_file, \"w\") as writer:\n            writer.write(json.dumps(all_predictions, indent=4) + \"\\n\")\n\n    if output_nbest_file:\n        with open(output_nbest_file, \"w\") as writer:\n            writer.write(json.dumps(all_nbest_json, indent=4) + \"\\n\")\n\n    if output_null_log_odds_file and version_2_with_negative:\n        with open(output_null_log_odds_file, \"w\") as writer:\n            writer.write(json.dumps(scores_diff_json, indent=4) + \"\\n\")\n\n    return all_predictions\n\n\ndef compute_predictions_log_probs(\n    all_examples,\n    all_features,\n    all_results,\n    n_best_size,\n    max_answer_length,\n    output_prediction_file,\n    output_nbest_file,\n    output_null_log_odds_file,\n    start_n_top,\n    end_n_top,\n    version_2_with_negative,\n    tokenizer,\n    verbose_logging,\n):\n    \"\"\" XLNet write prediction logic (more complex than Bert's).\n        Write final predictions to the json file and log-odds of null if needed.\n\n        Requires utils_squad_evaluate.py\n    \"\"\"\n    _PrelimPrediction = collections.namedtuple(  # pylint: disable=invalid-name\n        \"PrelimPrediction\", [\"feature_index\", \"start_index\", \"end_index\", \"start_log_prob\", \"end_log_prob\"]\n    )\n\n    _NbestPrediction = collections.namedtuple(  # pylint: disable=invalid-name\n        \"NbestPrediction\", [\"text\", \"start_log_prob\", \"end_log_prob\"]\n    )\n\n    logger.info(\"Writing predictions to: %s\", output_prediction_file)\n    # logger.info(\"Writing nbest to: %s\" % (output_nbest_file))\n\n    example_index_to_features = collections.defaultdict(list)\n    for feature in all_features:\n        example_index_to_features[feature.example_index].append(feature)\n\n    unique_id_to_result = {}\n    for result in all_results:\n        unique_id_to_result[result.unique_id] = result\n\n    all_predictions = collections.OrderedDict()\n    all_nbest_json = collections.OrderedDict()\n    scores_diff_json = collections.OrderedDict()\n\n    for (example_index, example) in enumerate(all_examples):\n        features = example_index_to_features[example_index]\n\n        prelim_predictions = []\n        # keep track of the minimum score of null start+end of position 0\n        score_null = 1000000  # large and positive\n\n        for (feature_index, feature) in enumerate(features):\n            result = unique_id_to_result[feature.unique_id]\n\n            cur_null_score = result.cls_logits\n\n            # if we could have irrelevant answers, get the min score of irrelevant\n            score_null = min(score_null, cur_null_score)\n\n            for i in range(start_n_top):\n                for j in range(end_n_top):\n                    start_log_prob = result.start_logits[i]\n                    start_index = result.start_top_index[i]\n\n                    j_index = i * end_n_top + j\n\n                    end_log_prob = result.end_logits[j_index]\n                    end_index = result.end_top_index[j_index]\n\n                    # We could hypothetically create invalid predictions, e.g., predict\n                    # that the start of the span is in the question. We throw out all\n                    # invalid predictions.\n                    if start_index >= feature.paragraph_len - 1:\n                        continue\n                    if end_index >= feature.paragraph_len - 1:\n                        continue\n\n                    if not feature.token_is_max_context.get(start_index, False):\n                        continue\n                    if end_index < start_index:\n                        continue\n                    length = end_index - start_index + 1\n                    if length > max_answer_length:\n                        continue\n\n                    prelim_predictions.append(\n                        _PrelimPrediction(\n                            feature_index=feature_index,\n                            start_index=start_index,\n                            end_index=end_index,\n                            start_log_prob=start_log_prob,\n                            end_log_prob=end_log_prob,\n                        )\n                    )\n\n        prelim_predictions = sorted(\n            prelim_predictions, key=lambda x: (x.start_log_prob + x.end_log_prob), reverse=True\n        )\n\n        seen_predictions = {}\n        nbest = []\n        for pred in prelim_predictions:\n            if len(nbest) >= n_best_size:\n                break\n            feature = features[pred.feature_index]\n\n            # XLNet un-tokenizer\n            # Let's keep it simple for now and see if we need all this later.\n            #\n            # tok_start_to_orig_index = feature.tok_start_to_orig_index\n            # tok_end_to_orig_index = feature.tok_end_to_orig_index\n            # start_orig_pos = tok_start_to_orig_index[pred.start_index]\n            # end_orig_pos = tok_end_to_orig_index[pred.end_index]\n            # paragraph_text = example.paragraph_text\n            # final_text = paragraph_text[start_orig_pos: end_orig_pos + 1].strip()\n\n            # Previously used Bert untokenizer\n            tok_tokens = feature.tokens[pred.start_index : (pred.end_index + 1)]\n            orig_doc_start = feature.token_to_orig_map[pred.start_index]\n            orig_doc_end = feature.token_to_orig_map[pred.end_index]\n            orig_tokens = example.doc_tokens[orig_doc_start : (orig_doc_end + 1)]\n            tok_text = tokenizer.convert_tokens_to_string(tok_tokens)\n\n            # Clean whitespace\n            tok_text = tok_text.strip()\n            tok_text = \" \".join(tok_text.split())\n            orig_text = \" \".join(orig_tokens)\n\n            if hasattr(tokenizer, \"do_lower_case\"):\n                do_lower_case = tokenizer.do_lower_case\n            else:\n                do_lower_case = tokenizer.do_lowercase_and_remove_accent\n\n            final_text = get_final_text(tok_text, orig_text, do_lower_case, verbose_logging)\n\n            if final_text in seen_predictions:\n                continue\n\n            seen_predictions[final_text] = True\n\n            nbest.append(\n                _NbestPrediction(text=final_text, start_log_prob=pred.start_log_prob, end_log_prob=pred.end_log_prob)\n            )\n\n        # In very rare edge cases we could have no valid predictions. So we\n        # just create a nonce prediction in this case to avoid failure.\n        if not nbest:\n            nbest.append(_NbestPrediction(text=\"\", start_log_prob=-1e6, end_log_prob=-1e6))\n\n        total_scores = []\n        best_non_null_entry = None\n        for entry in nbest:\n            total_scores.append(entry.start_log_prob + entry.end_log_prob)\n            if not best_non_null_entry:\n                best_non_null_entry = entry\n\n        probs = _compute_softmax(total_scores)\n\n        nbest_json = []\n        for (i, entry) in enumerate(nbest):\n            output = collections.OrderedDict()\n            output[\"text\"] = entry.text\n            output[\"probability\"] = probs[i]\n            output[\"start_log_prob\"] = entry.start_log_prob\n            output[\"end_log_prob\"] = entry.end_log_prob\n            nbest_json.append(output)\n\n        assert len(nbest_json) >= 1\n        assert best_non_null_entry is not None\n\n        score_diff = score_null\n        scores_diff_json[example.qas_id] = score_diff\n        # note(zhiliny): always predict best_non_null_entry\n        # and the evaluation script will search for the best threshold\n        all_predictions[example.qas_id] = best_non_null_entry.text\n\n        all_nbest_json[example.qas_id] = nbest_json\n\n    with open(output_prediction_file, \"w\") as writer:\n        writer.write(json.dumps(all_predictions, indent=4) + \"\\n\")\n\n    with open(output_nbest_file, \"w\") as writer:\n        writer.write(json.dumps(all_nbest_json, indent=4) + \"\\n\")\n\n    if version_2_with_negative:\n        with open(output_null_log_odds_file, \"w\") as writer:\n            writer.write(json.dumps(scores_diff_json, indent=4) + \"\\n\")\n\n    return all_predictions\n"
  },
  {
    "path": "code/bert-base-count3/pretrain/transformers1/data/processors/__init__.py",
    "content": "# flake8: noqa\n# There's no way to ignore \"F401 '...' imported but unused\" warnings in this\n# module, but to preserve other warnings. So, don't check this module at all.\n\nfrom .glue import glue_convert_examples_to_features, glue_output_modes, glue_processors, glue_tasks_num_labels\nfrom .squad import SquadExample, SquadFeatures, SquadV1Processor, SquadV2Processor, squad_convert_examples_to_features\nfrom .utils import DataProcessor, InputExample, InputFeatures, SingleSentenceClassificationProcessor\nfrom .xnli import xnli_output_modes, xnli_processors, xnli_tasks_num_labels\n"
  },
  {
    "path": "code/bert-base-count3/pretrain/transformers1/data/processors/glue.py",
    "content": "# coding=utf-8\n# Copyright 2018 The Google AI Language Team Authors and The HuggingFace Inc. team.\n# Copyright (c) 2018, NVIDIA CORPORATION.  All rights reserved.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n\"\"\" GLUE processors and helpers \"\"\"\n\nimport logging\nimport os\nfrom enum import Enum\nfrom typing import List, Optional, Union\n\nfrom ...file_utils import is_tf_available\nfrom ...tokenization_utils import PreTrainedTokenizer\nfrom .utils import DataProcessor, InputExample, InputFeatures\n\n\nif is_tf_available():\n    import tensorflow as tf\n\nlogger = logging.getLogger(__name__)\n\n\ndef glue_convert_examples_to_features(\n    examples: Union[List[InputExample], \"tf.data.Dataset\"],\n    tokenizer: PreTrainedTokenizer,\n    max_length: Optional[int] = None,\n    task=None,\n    label_list=None,\n    output_mode=None,\n):\n    \"\"\"\n    Loads a data file into a list of ``InputFeatures``\n\n    Args:\n        examples: List of ``InputExamples`` or ``tf.data.Dataset`` containing the examples.\n        tokenizer: Instance of a tokenizer that will tokenize the examples\n        max_length: Maximum example length. Defaults to the tokenizer's max_len\n        task: GLUE task\n        label_list: List of labels. Can be obtained from the processor using the ``processor.get_labels()`` method\n        output_mode: String indicating the output mode. Either ``regression`` or ``classification``\n\n    Returns:\n        If the ``examples`` input is a ``tf.data.Dataset``, will return a ``tf.data.Dataset``\n        containing the task-specific features. If the input is a list of ``InputExamples``, will return\n        a list of task-specific ``InputFeatures`` which can be fed to the model.\n\n    \"\"\"\n    if is_tf_available() and isinstance(examples, tf.data.Dataset):\n        if task is None:\n            raise ValueError(\"When calling glue_convert_examples_to_features from TF, the task parameter is required.\")\n        return _tf_glue_convert_examples_to_features(examples, tokenizer, max_length=max_length, task=task)\n    return _glue_convert_examples_to_features(\n        examples, tokenizer, max_length=max_length, task=task, label_list=label_list, output_mode=output_mode\n    )\n\n\nif is_tf_available():\n\n    def _tf_glue_convert_examples_to_features(\n        examples: tf.data.Dataset, tokenizer: PreTrainedTokenizer, task=str, max_length: Optional[int] = None,\n    ) -> tf.data.Dataset:\n        \"\"\"\n        Returns:\n            A ``tf.data.Dataset`` containing the task-specific features.\n\n        \"\"\"\n        processor = glue_processors[task]()\n        examples = [processor.tfds_map(processor.get_example_from_tensor_dict(example)) for example in examples]\n        features = glue_convert_examples_to_features(examples, tokenizer, max_length=max_length, task=task)\n\n        def gen():\n            for ex in features:\n                yield (\n                    {\n                        \"input_ids\": ex.input_ids,\n                        \"attention_mask\": ex.attention_mask,\n                        \"token_type_ids\": ex.token_type_ids,\n                    },\n                    ex.label,\n                )\n\n        return tf.data.Dataset.from_generator(\n            gen,\n            ({\"input_ids\": tf.int32, \"attention_mask\": tf.int32, \"token_type_ids\": tf.int32}, tf.int64),\n            (\n                {\n                    \"input_ids\": tf.TensorShape([None]),\n                    \"attention_mask\": tf.TensorShape([None]),\n                    \"token_type_ids\": tf.TensorShape([None]),\n                },\n                tf.TensorShape([]),\n            ),\n        )\n\n\ndef _glue_convert_examples_to_features(\n    examples: List[InputExample],\n    tokenizer: PreTrainedTokenizer,\n    max_length: Optional[int] = None,\n    task=None,\n    label_list=None,\n    output_mode=None,\n):\n    if max_length is None:\n        max_length = tokenizer.max_len\n\n    if task is not None:\n        processor = glue_processors[task]()\n        if label_list is None:\n            label_list = processor.get_labels()\n            logger.info(\"Using label list %s for task %s\" % (label_list, task))\n        if output_mode is None:\n            output_mode = glue_output_modes[task]\n            logger.info(\"Using output mode %s for task %s\" % (output_mode, task))\n\n    label_map = {label: i for i, label in enumerate(label_list)}\n\n    def label_from_example(example: InputExample) -> Union[int, float, None]:\n        if example.label is None:\n            return None\n        if output_mode == \"classification\":\n            return label_map[example.label]\n        elif output_mode == \"regression\":\n            return float(example.label)\n        raise KeyError(output_mode)\n\n    labels = [label_from_example(example) for example in examples]\n\n    batch_encoding = tokenizer.batch_encode_plus(\n        [(example.text_a, example.text_b) for example in examples], max_length=max_length, pad_to_max_length=True,\n    )\n\n    features = []\n    for i in range(len(examples)):\n        inputs = {k: batch_encoding[k][i] for k in batch_encoding}\n\n        feature = InputFeatures(**inputs, label=labels[i])\n        features.append(feature)\n\n    for i, example in enumerate(examples[:5]):\n        logger.info(\"*** Example ***\")\n        logger.info(\"guid: %s\" % (example.guid))\n        logger.info(\"features: %s\" % features[i])\n\n    return features\n\n\nclass OutputMode(Enum):\n    classification = \"classification\"\n    regression = \"regression\"\n\n\nclass MrpcProcessor(DataProcessor):\n    \"\"\"Processor for the MRPC data set (GLUE version).\"\"\"\n\n    def get_example_from_tensor_dict(self, tensor_dict):\n        \"\"\"See base class.\"\"\"\n        return InputExample(\n            tensor_dict[\"idx\"].numpy(),\n            tensor_dict[\"sentence1\"].numpy().decode(\"utf-8\"),\n            tensor_dict[\"sentence2\"].numpy().decode(\"utf-8\"),\n            str(tensor_dict[\"label\"].numpy()),\n        )\n\n    def get_train_examples(self, data_dir):\n        \"\"\"See base class.\"\"\"\n        logger.info(\"LOOKING AT {}\".format(os.path.join(data_dir, \"train.tsv\")))\n        return self._create_examples(self._read_tsv(os.path.join(data_dir, \"train.tsv\")), \"train\")\n\n    def get_dev_examples(self, data_dir):\n        \"\"\"See base class.\"\"\"\n        return self._create_examples(self._read_tsv(os.path.join(data_dir, \"dev.tsv\")), \"dev\")\n\n    def get_test_examples(self, data_dir):\n        \"\"\"See base class.\"\"\"\n        return self._create_examples(self._read_tsv(os.path.join(data_dir, \"test.tsv\")), \"test\")\n\n    def get_labels(self):\n        \"\"\"See base class.\"\"\"\n        return [\"0\", \"1\"]\n\n    def _create_examples(self, lines, set_type):\n        \"\"\"Creates examples for the training, dev and test sets.\"\"\"\n        examples = []\n        for (i, line) in enumerate(lines):\n            if i == 0:\n                continue\n            guid = \"%s-%s\" % (set_type, i)\n            text_a = line[3]\n            text_b = line[4]\n            label = None if set_type == \"test\" else line[0]\n            examples.append(InputExample(guid=guid, text_a=text_a, text_b=text_b, label=label))\n        return examples\n\n\nclass MnliProcessor(DataProcessor):\n    \"\"\"Processor for the MultiNLI data set (GLUE version).\"\"\"\n\n    def get_example_from_tensor_dict(self, tensor_dict):\n        \"\"\"See base class.\"\"\"\n        return InputExample(\n            tensor_dict[\"idx\"].numpy(),\n            tensor_dict[\"premise\"].numpy().decode(\"utf-8\"),\n            tensor_dict[\"hypothesis\"].numpy().decode(\"utf-8\"),\n            str(tensor_dict[\"label\"].numpy()),\n        )\n\n    def get_train_examples(self, data_dir):\n        \"\"\"See base class.\"\"\"\n        return self._create_examples(self._read_tsv(os.path.join(data_dir, \"train.tsv\")), \"train\")\n\n    def get_dev_examples(self, data_dir):\n        \"\"\"See base class.\"\"\"\n        return self._create_examples(self._read_tsv(os.path.join(data_dir, \"dev_matched.tsv\")), \"dev_matched\")\n\n    def get_test_examples(self, data_dir):\n        \"\"\"See base class.\"\"\"\n        return self._create_examples(self._read_tsv(os.path.join(data_dir, \"test_matched.tsv\")), \"test_matched\")\n\n    def get_labels(self):\n        \"\"\"See base class.\"\"\"\n        return [\"contradiction\", \"entailment\", \"neutral\"]\n\n    def _create_examples(self, lines, set_type):\n        \"\"\"Creates examples for the training, dev and test sets.\"\"\"\n        examples = []\n        for (i, line) in enumerate(lines):\n            if i == 0:\n                continue\n            guid = \"%s-%s\" % (set_type, line[0])\n            text_a = line[8]\n            text_b = line[9]\n            label = None if set_type.startswith(\"test\") else line[-1]\n            examples.append(InputExample(guid=guid, text_a=text_a, text_b=text_b, label=label))\n        return examples\n\n\nclass MnliMismatchedProcessor(MnliProcessor):\n    \"\"\"Processor for the MultiNLI Mismatched data set (GLUE version).\"\"\"\n\n    def get_dev_examples(self, data_dir):\n        \"\"\"See base class.\"\"\"\n        return self._create_examples(self._read_tsv(os.path.join(data_dir, \"dev_mismatched.tsv\")), \"dev_mismatched\")\n\n    def get_test_examples(self, data_dir):\n        \"\"\"See base class.\"\"\"\n        return self._create_examples(self._read_tsv(os.path.join(data_dir, \"test_mismatched.tsv\")), \"test_mismatched\")\n\n\nclass ColaProcessor(DataProcessor):\n    \"\"\"Processor for the CoLA data set (GLUE version).\"\"\"\n\n    def get_example_from_tensor_dict(self, tensor_dict):\n        \"\"\"See base class.\"\"\"\n        return InputExample(\n            tensor_dict[\"idx\"].numpy(),\n            tensor_dict[\"sentence\"].numpy().decode(\"utf-8\"),\n            None,\n            str(tensor_dict[\"label\"].numpy()),\n        )\n\n    def get_train_examples(self, data_dir):\n        \"\"\"See base class.\"\"\"\n        return self._create_examples(self._read_tsv(os.path.join(data_dir, \"train.tsv\")), \"train\")\n\n    def get_dev_examples(self, data_dir):\n        \"\"\"See base class.\"\"\"\n        return self._create_examples(self._read_tsv(os.path.join(data_dir, \"dev.tsv\")), \"dev\")\n\n    def get_test_examples(self, data_dir):\n        \"\"\"See base class.\"\"\"\n        return self._create_examples(self._read_tsv(os.path.join(data_dir, \"test.tsv\")), \"test\")\n\n    def get_labels(self):\n        \"\"\"See base class.\"\"\"\n        return [\"0\", \"1\"]\n\n    def _create_examples(self, lines, set_type):\n        \"\"\"Creates examples for the training, dev and test sets.\"\"\"\n        test_mode = set_type == \"test\"\n        if test_mode:\n            lines = lines[1:]\n        text_index = 1 if test_mode else 3\n        examples = []\n        for (i, line) in enumerate(lines):\n            guid = \"%s-%s\" % (set_type, i)\n            text_a = line[text_index]\n            label = None if test_mode else line[1]\n            examples.append(InputExample(guid=guid, text_a=text_a, text_b=None, label=label))\n        return examples\n\n\nclass Sst2Processor(DataProcessor):\n    \"\"\"Processor for the SST-2 data set (GLUE version).\"\"\"\n\n    def get_example_from_tensor_dict(self, tensor_dict):\n        \"\"\"See base class.\"\"\"\n        return InputExample(\n            tensor_dict[\"idx\"].numpy(),\n            tensor_dict[\"sentence\"].numpy().decode(\"utf-8\"),\n            None,\n            str(tensor_dict[\"label\"].numpy()),\n        )\n\n    def get_train_examples(self, data_dir):\n        \"\"\"See base class.\"\"\"\n        return self._create_examples(self._read_tsv(os.path.join(data_dir, \"train.tsv\")), \"train\")\n\n    def get_dev_examples(self, data_dir):\n        \"\"\"See base class.\"\"\"\n        return self._create_examples(self._read_tsv(os.path.join(data_dir, \"dev.tsv\")), \"dev\")\n\n    def get_test_examples(self, data_dir):\n        \"\"\"See base class.\"\"\"\n        return self._create_examples(self._read_tsv(os.path.join(data_dir, \"test.tsv\")), \"test\")\n\n    def get_labels(self):\n        \"\"\"See base class.\"\"\"\n        return [\"0\", \"1\"]\n\n    def _create_examples(self, lines, set_type):\n        \"\"\"Creates examples for the training, dev and test sets.\"\"\"\n        examples = []\n        text_index = 1 if set_type == \"test\" else 0\n        for (i, line) in enumerate(lines):\n            if i == 0:\n                continue\n            guid = \"%s-%s\" % (set_type, i)\n            text_a = line[text_index]\n            label = None if set_type == \"test\" else line[1]\n            examples.append(InputExample(guid=guid, text_a=text_a, text_b=None, label=label))\n        return examples\n\n\nclass StsbProcessor(DataProcessor):\n    \"\"\"Processor for the STS-B data set (GLUE version).\"\"\"\n\n    def get_example_from_tensor_dict(self, tensor_dict):\n        \"\"\"See base class.\"\"\"\n        return InputExample(\n            tensor_dict[\"idx\"].numpy(),\n            tensor_dict[\"sentence1\"].numpy().decode(\"utf-8\"),\n            tensor_dict[\"sentence2\"].numpy().decode(\"utf-8\"),\n            str(tensor_dict[\"label\"].numpy()),\n        )\n\n    def get_train_examples(self, data_dir):\n        \"\"\"See base class.\"\"\"\n        return self._create_examples(self._read_tsv(os.path.join(data_dir, \"train.tsv\")), \"train\")\n\n    def get_dev_examples(self, data_dir):\n        \"\"\"See base class.\"\"\"\n        return self._create_examples(self._read_tsv(os.path.join(data_dir, \"dev.tsv\")), \"dev\")\n\n    def get_test_examples(self, data_dir):\n        \"\"\"See base class.\"\"\"\n        return self._create_examples(self._read_tsv(os.path.join(data_dir, \"test.tsv\")), \"test\")\n\n    def get_labels(self):\n        \"\"\"See base class.\"\"\"\n        return [None]\n\n    def _create_examples(self, lines, set_type):\n        \"\"\"Creates examples for the training, dev and test sets.\"\"\"\n        examples = []\n        for (i, line) in enumerate(lines):\n            if i == 0:\n                continue\n            guid = \"%s-%s\" % (set_type, line[0])\n            text_a = line[7]\n            text_b = line[8]\n            label = None if set_type == \"test\" else line[-1]\n            examples.append(InputExample(guid=guid, text_a=text_a, text_b=text_b, label=label))\n        return examples\n\n\nclass QqpProcessor(DataProcessor):\n    \"\"\"Processor for the QQP data set (GLUE version).\"\"\"\n\n    def get_example_from_tensor_dict(self, tensor_dict):\n        \"\"\"See base class.\"\"\"\n        return InputExample(\n            tensor_dict[\"idx\"].numpy(),\n            tensor_dict[\"question1\"].numpy().decode(\"utf-8\"),\n            tensor_dict[\"question2\"].numpy().decode(\"utf-8\"),\n            str(tensor_dict[\"label\"].numpy()),\n        )\n\n    def get_train_examples(self, data_dir):\n        \"\"\"See base class.\"\"\"\n        return self._create_examples(self._read_tsv(os.path.join(data_dir, \"train.tsv\")), \"train\")\n\n    def get_dev_examples(self, data_dir):\n        \"\"\"See base class.\"\"\"\n        return self._create_examples(self._read_tsv(os.path.join(data_dir, \"dev.tsv\")), \"dev\")\n\n    def get_test_examples(self, data_dir):\n        \"\"\"See base class.\"\"\"\n        return self._create_examples(self._read_tsv(os.path.join(data_dir, \"test.tsv\")), \"test\")\n\n    def get_labels(self):\n        \"\"\"See base class.\"\"\"\n        return [\"0\", \"1\"]\n\n    def _create_examples(self, lines, set_type):\n        \"\"\"Creates examples for the training, dev and test sets.\"\"\"\n        test_mode = set_type == \"test\"\n        q1_index = 1 if test_mode else 3\n        q2_index = 2 if test_mode else 4\n        examples = []\n        for (i, line) in enumerate(lines):\n            if i == 0:\n                continue\n            guid = \"%s-%s\" % (set_type, line[0])\n            try:\n                text_a = line[q1_index]\n                text_b = line[q2_index]\n                label = None if test_mode else line[5]\n            except IndexError:\n                continue\n            examples.append(InputExample(guid=guid, text_a=text_a, text_b=text_b, label=label))\n        return examples\n\n\nclass QnliProcessor(DataProcessor):\n    \"\"\"Processor for the QNLI data set (GLUE version).\"\"\"\n\n    def get_example_from_tensor_dict(self, tensor_dict):\n        \"\"\"See base class.\"\"\"\n        return InputExample(\n            tensor_dict[\"idx\"].numpy(),\n            tensor_dict[\"question\"].numpy().decode(\"utf-8\"),\n            tensor_dict[\"sentence\"].numpy().decode(\"utf-8\"),\n            str(tensor_dict[\"label\"].numpy()),\n        )\n\n    def get_train_examples(self, data_dir):\n        \"\"\"See base class.\"\"\"\n        return self._create_examples(self._read_tsv(os.path.join(data_dir, \"train.tsv\")), \"train\")\n\n    def get_dev_examples(self, data_dir):\n        \"\"\"See base class.\"\"\"\n        return self._create_examples(self._read_tsv(os.path.join(data_dir, \"dev.tsv\")), \"dev\")\n\n    def get_test_examples(self, data_dir):\n        \"\"\"See base class.\"\"\"\n        return self._create_examples(self._read_tsv(os.path.join(data_dir, \"test.tsv\")), \"test\")\n\n    def get_labels(self):\n        \"\"\"See base class.\"\"\"\n        return [\"entailment\", \"not_entailment\"]\n\n    def _create_examples(self, lines, set_type):\n        \"\"\"Creates examples for the training, dev and test sets.\"\"\"\n        examples = []\n        for (i, line) in enumerate(lines):\n            if i == 0:\n                continue\n            guid = \"%s-%s\" % (set_type, line[0])\n            text_a = line[1]\n            text_b = line[2]\n            label = None if set_type == \"test\" else line[-1]\n            examples.append(InputExample(guid=guid, text_a=text_a, text_b=text_b, label=label))\n        return examples\n\n\nclass RteProcessor(DataProcessor):\n    \"\"\"Processor for the RTE data set (GLUE version).\"\"\"\n\n    def get_example_from_tensor_dict(self, tensor_dict):\n        \"\"\"See base class.\"\"\"\n        return InputExample(\n            tensor_dict[\"idx\"].numpy(),\n            tensor_dict[\"sentence1\"].numpy().decode(\"utf-8\"),\n            tensor_dict[\"sentence2\"].numpy().decode(\"utf-8\"),\n            str(tensor_dict[\"label\"].numpy()),\n        )\n\n    def get_train_examples(self, data_dir):\n        \"\"\"See base class.\"\"\"\n        return self._create_examples(self._read_tsv(os.path.join(data_dir, \"train.tsv\")), \"train\")\n\n    def get_dev_examples(self, data_dir):\n        \"\"\"See base class.\"\"\"\n        return self._create_examples(self._read_tsv(os.path.join(data_dir, \"dev.tsv\")), \"dev\")\n\n    def get_test_examples(self, data_dir):\n        \"\"\"See base class.\"\"\"\n        return self._create_examples(self._read_tsv(os.path.join(data_dir, \"test.tsv\")), \"test\")\n\n    def get_labels(self):\n        \"\"\"See base class.\"\"\"\n        return [\"entailment\", \"not_entailment\"]\n\n    def _create_examples(self, lines, set_type):\n        \"\"\"Creates examples for the training, dev and test sets.\"\"\"\n        examples = []\n        for (i, line) in enumerate(lines):\n            if i == 0:\n                continue\n            guid = \"%s-%s\" % (set_type, line[0])\n            text_a = line[1]\n            text_b = line[2]\n            label = None if set_type == \"test\" else line[-1]\n            examples.append(InputExample(guid=guid, text_a=text_a, text_b=text_b, label=label))\n        return examples\n\n\nclass WnliProcessor(DataProcessor):\n    \"\"\"Processor for the WNLI data set (GLUE version).\"\"\"\n\n    def get_example_from_tensor_dict(self, tensor_dict):\n        \"\"\"See base class.\"\"\"\n        return InputExample(\n            tensor_dict[\"idx\"].numpy(),\n            tensor_dict[\"sentence1\"].numpy().decode(\"utf-8\"),\n            tensor_dict[\"sentence2\"].numpy().decode(\"utf-8\"),\n            str(tensor_dict[\"label\"].numpy()),\n        )\n\n    def get_train_examples(self, data_dir):\n        \"\"\"See base class.\"\"\"\n        return self._create_examples(self._read_tsv(os.path.join(data_dir, \"train.tsv\")), \"train\")\n\n    def get_dev_examples(self, data_dir):\n        \"\"\"See base class.\"\"\"\n        return self._create_examples(self._read_tsv(os.path.join(data_dir, \"dev.tsv\")), \"dev\")\n\n    def get_test_examples(self, data_dir):\n        \"\"\"See base class.\"\"\"\n        return self._create_examples(self._read_tsv(os.path.join(data_dir, \"test.tsv\")), \"test\")\n\n    def get_labels(self):\n        \"\"\"See base class.\"\"\"\n        return [\"0\", \"1\"]\n\n    def _create_examples(self, lines, set_type):\n        \"\"\"Creates examples for the training, dev and test sets.\"\"\"\n        examples = []\n        for (i, line) in enumerate(lines):\n            if i == 0:\n                continue\n            guid = \"%s-%s\" % (set_type, line[0])\n            text_a = line[1]\n            text_b = line[2]\n            label = None if set_type == \"test\" else line[-1]\n            examples.append(InputExample(guid=guid, text_a=text_a, text_b=text_b, label=label))\n        return examples\n\n\nglue_tasks_num_labels = {\n    \"cola\": 2,\n    \"mnli\": 3,\n    \"mrpc\": 2,\n    \"sst-2\": 2,\n    \"sts-b\": 1,\n    \"qqp\": 2,\n    \"qnli\": 2,\n    \"rte\": 2,\n    \"wnli\": 2,\n}\n\nglue_processors = {\n    \"cola\": ColaProcessor,\n    \"mnli\": MnliProcessor,\n    \"mnli-mm\": MnliMismatchedProcessor,\n    \"mrpc\": MrpcProcessor,\n    \"sst-2\": Sst2Processor,\n    \"sts-b\": StsbProcessor,\n    \"qqp\": QqpProcessor,\n    \"qnli\": QnliProcessor,\n    \"rte\": RteProcessor,\n    \"wnli\": WnliProcessor,\n}\n\nglue_output_modes = {\n    \"cola\": \"classification\",\n    \"mnli\": \"classification\",\n    \"mnli-mm\": \"classification\",\n    \"mrpc\": \"classification\",\n    \"sst-2\": \"classification\",\n    \"sts-b\": \"regression\",\n    \"qqp\": \"classification\",\n    \"qnli\": \"classification\",\n    \"rte\": \"classification\",\n    \"wnli\": \"classification\",\n}\n"
  },
  {
    "path": "code/bert-base-count3/pretrain/transformers1/data/processors/squad.py",
    "content": "import json\nimport logging\nimport os\nfrom functools import partial\nfrom multiprocessing import Pool, cpu_count\n\nimport numpy as np\nfrom tqdm import tqdm\n\nfrom ...file_utils import is_tf_available, is_torch_available\nfrom ...tokenization_bert import whitespace_tokenize\nfrom .utils import DataProcessor\n\n\nif is_torch_available():\n    import torch\n    from torch.utils.data import TensorDataset\n\nif is_tf_available():\n    import tensorflow as tf\n\nlogger = logging.getLogger(__name__)\n\n\ndef _improve_answer_span(doc_tokens, input_start, input_end, tokenizer, orig_answer_text):\n    \"\"\"Returns tokenized answer spans that better match the annotated answer.\"\"\"\n    tok_answer_text = \" \".join(tokenizer.tokenize(orig_answer_text))\n\n    for new_start in range(input_start, input_end + 1):\n        for new_end in range(input_end, new_start - 1, -1):\n            text_span = \" \".join(doc_tokens[new_start : (new_end + 1)])\n            if text_span == tok_answer_text:\n                return (new_start, new_end)\n\n    return (input_start, input_end)\n\n\ndef _check_is_max_context(doc_spans, cur_span_index, position):\n    \"\"\"Check if this is the 'max context' doc span for the token.\"\"\"\n    best_score = None\n    best_span_index = None\n    for (span_index, doc_span) in enumerate(doc_spans):\n        end = doc_span.start + doc_span.length - 1\n        if position < doc_span.start:\n            continue\n        if position > end:\n            continue\n        num_left_context = position - doc_span.start\n        num_right_context = end - position\n        score = min(num_left_context, num_right_context) + 0.01 * doc_span.length\n        if best_score is None or score > best_score:\n            best_score = score\n            best_span_index = span_index\n\n    return cur_span_index == best_span_index\n\n\ndef _new_check_is_max_context(doc_spans, cur_span_index, position):\n    \"\"\"Check if this is the 'max context' doc span for the token.\"\"\"\n    # if len(doc_spans) == 1:\n    # return True\n    best_score = None\n    best_span_index = None\n    for (span_index, doc_span) in enumerate(doc_spans):\n        end = doc_span[\"start\"] + doc_span[\"length\"] - 1\n        if position < doc_span[\"start\"]:\n            continue\n        if position > end:\n            continue\n        num_left_context = position - doc_span[\"start\"]\n        num_right_context = end - position\n        score = min(num_left_context, num_right_context) + 0.01 * doc_span[\"length\"]\n        if best_score is None or score > best_score:\n            best_score = score\n            best_span_index = span_index\n\n    return cur_span_index == best_span_index\n\n\ndef _is_whitespace(c):\n    if c == \" \" or c == \"\\t\" or c == \"\\r\" or c == \"\\n\" or ord(c) == 0x202F:\n        return True\n    return False\n\n\ndef squad_convert_example_to_features(example, max_seq_length, doc_stride, max_query_length, is_training):\n    features = []\n    if is_training and not example.is_impossible:\n        # Get start and end position\n        start_position = example.start_position\n        end_position = example.end_position\n\n        # If the answer cannot be found in the text, then skip this example.\n        actual_text = \" \".join(example.doc_tokens[start_position : (end_position + 1)])\n        cleaned_answer_text = \" \".join(whitespace_tokenize(example.answer_text))\n        if actual_text.find(cleaned_answer_text) == -1:\n            logger.warning(\"Could not find answer: '%s' vs. '%s'\", actual_text, cleaned_answer_text)\n            return []\n\n    tok_to_orig_index = []\n    orig_to_tok_index = []\n    all_doc_tokens = []\n    for (i, token) in enumerate(example.doc_tokens):\n        orig_to_tok_index.append(len(all_doc_tokens))\n        sub_tokens = tokenizer.tokenize(token)\n        for sub_token in sub_tokens:\n            tok_to_orig_index.append(i)\n            all_doc_tokens.append(sub_token)\n\n    if is_training and not example.is_impossible:\n        tok_start_position = orig_to_tok_index[example.start_position]\n        if example.end_position < len(example.doc_tokens) - 1:\n            tok_end_position = orig_to_tok_index[example.end_position + 1] - 1\n        else:\n            tok_end_position = len(all_doc_tokens) - 1\n\n        (tok_start_position, tok_end_position) = _improve_answer_span(\n            all_doc_tokens, tok_start_position, tok_end_position, tokenizer, example.answer_text\n        )\n\n    spans = []\n\n    truncated_query = tokenizer.encode(example.question_text, add_special_tokens=False, max_length=max_query_length)\n    sequence_added_tokens = (\n        tokenizer.max_len - tokenizer.max_len_single_sentence + 1\n        if \"roberta\" in str(type(tokenizer)) or \"camembert\" in str(type(tokenizer))\n        else tokenizer.max_len - tokenizer.max_len_single_sentence\n    )\n    sequence_pair_added_tokens = tokenizer.max_len - tokenizer.max_len_sentences_pair\n\n    span_doc_tokens = all_doc_tokens\n    while len(spans) * doc_stride < len(all_doc_tokens):\n\n        encoded_dict = tokenizer.encode_plus(\n            truncated_query if tokenizer.padding_side == \"right\" else span_doc_tokens,\n            span_doc_tokens if tokenizer.padding_side == \"right\" else truncated_query,\n            max_length=max_seq_length,\n            return_overflowing_tokens=True,\n            pad_to_max_length=True,\n            stride=max_seq_length - doc_stride - len(truncated_query) - sequence_pair_added_tokens,\n            truncation_strategy=\"only_second\" if tokenizer.padding_side == \"right\" else \"only_first\",\n            return_token_type_ids=True,\n        )\n\n        paragraph_len = min(\n            len(all_doc_tokens) - len(spans) * doc_stride,\n            max_seq_length - len(truncated_query) - sequence_pair_added_tokens,\n        )\n\n        if tokenizer.pad_token_id in encoded_dict[\"input_ids\"]:\n            if tokenizer.padding_side == \"right\":\n                non_padded_ids = encoded_dict[\"input_ids\"][: encoded_dict[\"input_ids\"].index(tokenizer.pad_token_id)]\n            else:\n                last_padding_id_position = (\n                    len(encoded_dict[\"input_ids\"]) - 1 - encoded_dict[\"input_ids\"][::-1].index(tokenizer.pad_token_id)\n                )\n                non_padded_ids = encoded_dict[\"input_ids\"][last_padding_id_position + 1 :]\n\n        else:\n            non_padded_ids = encoded_dict[\"input_ids\"]\n\n        tokens = tokenizer.convert_ids_to_tokens(non_padded_ids)\n\n        token_to_orig_map = {}\n        for i in range(paragraph_len):\n            index = len(truncated_query) + sequence_added_tokens + i if tokenizer.padding_side == \"right\" else i\n            token_to_orig_map[index] = tok_to_orig_index[len(spans) * doc_stride + i]\n\n        encoded_dict[\"paragraph_len\"] = paragraph_len\n        encoded_dict[\"tokens\"] = tokens\n        encoded_dict[\"token_to_orig_map\"] = token_to_orig_map\n        encoded_dict[\"truncated_query_with_special_tokens_length\"] = len(truncated_query) + sequence_added_tokens\n        encoded_dict[\"token_is_max_context\"] = {}\n        encoded_dict[\"start\"] = len(spans) * doc_stride\n        encoded_dict[\"length\"] = paragraph_len\n\n        spans.append(encoded_dict)\n\n        if \"overflowing_tokens\" not in encoded_dict:\n            break\n        span_doc_tokens = encoded_dict[\"overflowing_tokens\"]\n\n    for doc_span_index in range(len(spans)):\n        for j in range(spans[doc_span_index][\"paragraph_len\"]):\n            is_max_context = _new_check_is_max_context(spans, doc_span_index, doc_span_index * doc_stride + j)\n            index = (\n                j\n                if tokenizer.padding_side == \"left\"\n                else spans[doc_span_index][\"truncated_query_with_special_tokens_length\"] + j\n            )\n            spans[doc_span_index][\"token_is_max_context\"][index] = is_max_context\n\n    for span in spans:\n        # Identify the position of the CLS token\n        cls_index = span[\"input_ids\"].index(tokenizer.cls_token_id)\n\n        # p_mask: mask with 1 for token than cannot be in the answer (0 for token which can be in an answer)\n        # Original TF implem also keep the classification token (set to 0)\n        p_mask = np.ones_like(span[\"token_type_ids\"])\n        if tokenizer.padding_side == \"right\":\n            p_mask[len(truncated_query) + sequence_added_tokens :] = 0\n        else:\n            p_mask[-len(span[\"tokens\"]) : -(len(truncated_query) + sequence_added_tokens)] = 0\n\n        pad_token_indices = np.where(span[\"input_ids\"] == tokenizer.pad_token_id)\n        special_token_indices = np.asarray(\n            tokenizer.get_special_tokens_mask(span[\"input_ids\"], already_has_special_tokens=True)\n        ).nonzero()\n\n        p_mask[pad_token_indices] = 1\n        p_mask[special_token_indices] = 1\n\n        # Set the cls index to 0: the CLS index can be used for impossible answers\n        p_mask[cls_index] = 0\n\n        span_is_impossible = example.is_impossible\n        start_position = 0\n        end_position = 0\n        if is_training and not span_is_impossible:\n            # For training, if our document chunk does not contain an annotation\n            # we throw it out, since there is nothing to predict.\n            doc_start = span[\"start\"]\n            doc_end = span[\"start\"] + span[\"length\"] - 1\n            out_of_span = False\n\n            if not (tok_start_position >= doc_start and tok_end_position <= doc_end):\n                out_of_span = True\n\n            if out_of_span:\n                start_position = cls_index\n                end_position = cls_index\n                span_is_impossible = True\n            else:\n                if tokenizer.padding_side == \"left\":\n                    doc_offset = 0\n                else:\n                    doc_offset = len(truncated_query) + sequence_added_tokens\n\n                start_position = tok_start_position - doc_start + doc_offset\n                end_position = tok_end_position - doc_start + doc_offset\n\n        features.append(\n            SquadFeatures(\n                span[\"input_ids\"],\n                span[\"attention_mask\"],\n                span[\"token_type_ids\"],\n                cls_index,\n                p_mask.tolist(),\n                example_index=0,  # Can not set unique_id and example_index here. They will be set after multiple processing.\n                unique_id=0,\n                paragraph_len=span[\"paragraph_len\"],\n                token_is_max_context=span[\"token_is_max_context\"],\n                tokens=span[\"tokens\"],\n                token_to_orig_map=span[\"token_to_orig_map\"],\n                start_position=start_position,\n                end_position=end_position,\n                is_impossible=span_is_impossible,\n                qas_id=example.qas_id,\n            )\n        )\n    return features\n\n\ndef squad_convert_example_to_features_init(tokenizer_for_convert):\n    global tokenizer\n    tokenizer = tokenizer_for_convert\n\n\ndef squad_convert_examples_to_features(\n    examples,\n    tokenizer,\n    max_seq_length,\n    doc_stride,\n    max_query_length,\n    is_training,\n    return_dataset=False,\n    threads=1,\n    tqdm_enabled=True,\n):\n    \"\"\"\n    Converts a list of examples into a list of features that can be directly given as input to a model.\n    It is model-dependant and takes advantage of many of the tokenizer's features to create the model's inputs.\n\n    Args:\n        examples: list of :class:`~transformers1.data.processors.squad.SquadExample`\n        tokenizer: an instance of a child of :class:`~transformers1.PreTrainedTokenizer`\n        max_seq_length: The maximum sequence length of the inputs.\n        doc_stride: The stride used when the context is too large and is split across several features.\n        max_query_length: The maximum length of the query.\n        is_training: whether to create features for model evaluation or model training.\n        return_dataset: Default False. Either 'pt' or 'tf'.\n            if 'pt': returns a torch.data.TensorDataset,\n            if 'tf': returns a tf.data.Dataset\n        threads: multiple processing threadsa-smi\n\n\n    Returns:\n        list of :class:`~transformers1.data.processors.squad.SquadFeatures`\n\n    Example::\n\n        processor = SquadV2Processor()\n        examples = processor.get_dev_examples(data_dir)\n\n        features = squad_convert_examples_to_features(\n            examples=examples,\n            tokenizer=tokenizer,\n            max_seq_length=args.max_seq_length,\n            doc_stride=args.doc_stride,\n            max_query_length=args.max_query_length,\n            is_training=not evaluate,\n        )\n    \"\"\"\n\n    # Defining helper methods\n    features = []\n    threads = min(threads, cpu_count())\n    with Pool(threads, initializer=squad_convert_example_to_features_init, initargs=(tokenizer,)) as p:\n        annotate_ = partial(\n            squad_convert_example_to_features,\n            max_seq_length=max_seq_length,\n            doc_stride=doc_stride,\n            max_query_length=max_query_length,\n            is_training=is_training,\n        )\n        features = list(\n            tqdm(\n                p.imap(annotate_, examples, chunksize=32),\n                total=len(examples),\n                desc=\"convert squad examples to features\",\n                disable=not tqdm_enabled,\n            )\n        )\n    new_features = []\n    unique_id = 1000000000\n    example_index = 0\n    for example_features in tqdm(\n        features, total=len(features), desc=\"add example index and unique id\", disable=not tqdm_enabled\n    ):\n        if not example_features:\n            continue\n        for example_feature in example_features:\n            example_feature.example_index = example_index\n            example_feature.unique_id = unique_id\n            new_features.append(example_feature)\n            unique_id += 1\n        example_index += 1\n    features = new_features\n    del new_features\n    if return_dataset == \"pt\":\n        if not is_torch_available():\n            raise RuntimeError(\"PyTorch must be installed to return a PyTorch dataset.\")\n\n        # Convert to Tensors and build dataset\n        all_input_ids = torch.tensor([f.input_ids for f in features], dtype=torch.long)\n        all_attention_masks = torch.tensor([f.attention_mask for f in features], dtype=torch.long)\n        all_token_type_ids = torch.tensor([f.token_type_ids for f in features], dtype=torch.long)\n        all_cls_index = torch.tensor([f.cls_index for f in features], dtype=torch.long)\n        all_p_mask = torch.tensor([f.p_mask for f in features], dtype=torch.float)\n        all_is_impossible = torch.tensor([f.is_impossible for f in features], dtype=torch.float)\n\n        if not is_training:\n            all_feature_index = torch.arange(all_input_ids.size(0), dtype=torch.long)\n            dataset = TensorDataset(\n                all_input_ids, all_attention_masks, all_token_type_ids, all_feature_index, all_cls_index, all_p_mask\n            )\n        else:\n            all_start_positions = torch.tensor([f.start_position for f in features], dtype=torch.long)\n            all_end_positions = torch.tensor([f.end_position for f in features], dtype=torch.long)\n            dataset = TensorDataset(\n                all_input_ids,\n                all_attention_masks,\n                all_token_type_ids,\n                all_start_positions,\n                all_end_positions,\n                all_cls_index,\n                all_p_mask,\n                all_is_impossible,\n            )\n\n        return features, dataset\n    elif return_dataset == \"tf\":\n        if not is_tf_available():\n            raise RuntimeError(\"TensorFlow must be installed to return a TensorFlow dataset.\")\n\n        def gen():\n            for i, ex in enumerate(features):\n                yield (\n                    {\n                        \"input_ids\": ex.input_ids,\n                        \"attention_mask\": ex.attention_mask,\n                        \"token_type_ids\": ex.token_type_ids,\n                        \"feature_index\": i,\n                        \"qas_id\": ex.qas_id,\n                    },\n                    {\n                        \"start_position\": ex.start_position,\n                        \"end_position\": ex.end_position,\n                        \"cls_index\": ex.cls_index,\n                        \"p_mask\": ex.p_mask,\n                        \"is_impossible\": ex.is_impossible,\n                    },\n                )\n\n        # Why have we split the batch into a tuple? PyTorch just has a list of tensors.\n        train_types = (\n            {\n                \"input_ids\": tf.int32,\n                \"attention_mask\": tf.int32,\n                \"token_type_ids\": tf.int32,\n                \"feature_index\": tf.int64,\n                \"qas_id\": tf.string,\n            },\n            {\n                \"start_position\": tf.int64,\n                \"end_position\": tf.int64,\n                \"cls_index\": tf.int64,\n                \"p_mask\": tf.int32,\n                \"is_impossible\": tf.int32,\n            },\n        )\n\n        train_shapes = (\n            {\n                \"input_ids\": tf.TensorShape([None]),\n                \"attention_mask\": tf.TensorShape([None]),\n                \"token_type_ids\": tf.TensorShape([None]),\n                \"feature_index\": tf.TensorShape([]),\n                \"qas_id\": tf.TensorShape([]),\n            },\n            {\n                \"start_position\": tf.TensorShape([]),\n                \"end_position\": tf.TensorShape([]),\n                \"cls_index\": tf.TensorShape([]),\n                \"p_mask\": tf.TensorShape([None]),\n                \"is_impossible\": tf.TensorShape([]),\n            },\n        )\n\n        return tf.data.Dataset.from_generator(gen, train_types, train_shapes)\n    else:\n        return features\n\n\nclass SquadProcessor(DataProcessor):\n    \"\"\"\n    Processor for the SQuAD data set.\n    Overriden by SquadV1Processor and SquadV2Processor, used by the version 1.1 and version 2.0 of SQuAD, respectively.\n    \"\"\"\n\n    train_file = None\n    dev_file = None\n\n    def _get_example_from_tensor_dict(self, tensor_dict, evaluate=False):\n        if not evaluate:\n            answer = tensor_dict[\"answers\"][\"text\"][0].numpy().decode(\"utf-8\")\n            answer_start = tensor_dict[\"answers\"][\"answer_start\"][0].numpy()\n            answers = []\n        else:\n            answers = [\n                {\"answer_start\": start.numpy(), \"text\": text.numpy().decode(\"utf-8\")}\n                for start, text in zip(tensor_dict[\"answers\"][\"answer_start\"], tensor_dict[\"answers\"][\"text\"])\n            ]\n\n            answer = None\n            answer_start = None\n\n        return SquadExample(\n            qas_id=tensor_dict[\"id\"].numpy().decode(\"utf-8\"),\n            question_text=tensor_dict[\"question\"].numpy().decode(\"utf-8\"),\n            context_text=tensor_dict[\"context\"].numpy().decode(\"utf-8\"),\n            answer_text=answer,\n            start_position_character=answer_start,\n            title=tensor_dict[\"title\"].numpy().decode(\"utf-8\"),\n            answers=answers,\n        )\n\n    def get_examples_from_dataset(self, dataset, evaluate=False):\n        \"\"\"\n        Creates a list of :class:`~transformers1.data.processors.squad.SquadExample` using a TFDS dataset.\n\n        Args:\n            dataset: The tfds dataset loaded from `tensorflow_datasets.load(\"squad\")`\n            evaluate: boolean specifying if in evaluation mode or in training mode\n\n        Returns:\n            List of SquadExample\n\n        Examples::\n\n            import tensorflow_datasets as tfds\n            dataset = tfds.load(\"squad\")\n\n            training_examples = get_examples_from_dataset(dataset, evaluate=False)\n            evaluation_examples = get_examples_from_dataset(dataset, evaluate=True)\n        \"\"\"\n\n        if evaluate:\n            dataset = dataset[\"validation\"]\n        else:\n            dataset = dataset[\"train\"]\n\n        examples = []\n        for tensor_dict in tqdm(dataset):\n            examples.append(self._get_example_from_tensor_dict(tensor_dict, evaluate=evaluate))\n\n        return examples\n\n    def get_train_examples(self, data_dir, filename=None):\n        \"\"\"\n        Returns the training examples from the data directory.\n\n        Args:\n            data_dir: Directory containing the data files used for training and evaluating.\n            filename: None by default, specify this if the training file has a different name than the original one\n                which is `train-v1.1.json` and `train-v2.0.json` for squad versions 1.1 and 2.0 respectively.\n\n        \"\"\"\n        if data_dir is None:\n            data_dir = \"\"\n\n        if self.train_file is None:\n            raise ValueError(\"SquadProcessor should be instantiated via SquadV1Processor or SquadV2Processor\")\n\n        with open(\n            os.path.join(data_dir, self.train_file if filename is None else filename), \"r\", encoding=\"utf-8\"\n        ) as reader:\n            input_data = json.load(reader)[\"data\"]\n        return self._create_examples(input_data, \"train\")\n\n    def get_dev_examples(self, data_dir, filename=None):\n        \"\"\"\n        Returns the evaluation example from the data directory.\n\n        Args:\n            data_dir: Directory containing the data files used for training and evaluating.\n            filename: None by default, specify this if the evaluation file has a different name than the original one\n                which is `train-v1.1.json` and `train-v2.0.json` for squad versions 1.1 and 2.0 respectively.\n        \"\"\"\n        if data_dir is None:\n            data_dir = \"\"\n\n        if self.dev_file is None:\n            raise ValueError(\"SquadProcessor should be instantiated via SquadV1Processor or SquadV2Processor\")\n\n        with open(\n            os.path.join(data_dir, self.dev_file if filename is None else filename), \"r\", encoding=\"utf-8\"\n        ) as reader:\n            input_data = json.load(reader)[\"data\"]\n        return self._create_examples(input_data, \"dev\")\n\n    def _create_examples(self, input_data, set_type):\n        is_training = set_type == \"train\"\n        examples = []\n        for entry in tqdm(input_data):\n            title = entry[\"title\"]\n            for paragraph in entry[\"paragraphs\"]:\n                context_text = paragraph[\"context\"]\n                for qa in paragraph[\"qas\"]:\n                    qas_id = qa[\"id\"]\n                    question_text = qa[\"question\"]\n                    start_position_character = None\n                    answer_text = None\n                    answers = []\n\n                    if \"is_impossible\" in qa:\n                        is_impossible = qa[\"is_impossible\"]\n                    else:\n                        is_impossible = False\n\n                    if not is_impossible:\n                        if is_training:\n                            answer = qa[\"answers\"][0]\n                            answer_text = answer[\"text\"]\n                            start_position_character = answer[\"answer_start\"]\n                        else:\n                            answers = qa[\"answers\"]\n\n                    example = SquadExample(\n                        qas_id=qas_id,\n                        question_text=question_text,\n                        context_text=context_text,\n                        answer_text=answer_text,\n                        start_position_character=start_position_character,\n                        title=title,\n                        is_impossible=is_impossible,\n                        answers=answers,\n                    )\n\n                    examples.append(example)\n        return examples\n\n\nclass SquadV1Processor(SquadProcessor):\n    train_file = \"train-v1.1.json\"\n    dev_file = \"dev-v1.1.json\"\n\n\nclass SquadV2Processor(SquadProcessor):\n    train_file = \"train-v2.0.json\"\n    dev_file = \"dev-v2.0.json\"\n\n\nclass SquadExample(object):\n    \"\"\"\n    A single training/test example for the Squad dataset, as loaded from disk.\n\n    Args:\n        qas_id: The example's unique identifier\n        question_text: The question string\n        context_text: The context string\n        answer_text: The answer string\n        start_position_character: The character position of the start of the answer\n        title: The title of the example\n        answers: None by default, this is used during evaluation. Holds answers as well as their start positions.\n        is_impossible: False by default, set to True if the example has no possible answer.\n    \"\"\"\n\n    def __init__(\n        self,\n        qas_id,\n        question_text,\n        context_text,\n        answer_text,\n        start_position_character,\n        title,\n        answers=[],\n        is_impossible=False,\n    ):\n        self.qas_id = qas_id\n        self.question_text = question_text\n        self.context_text = context_text\n        self.answer_text = answer_text\n        self.title = title\n        self.is_impossible = is_impossible\n        self.answers = answers\n\n        self.start_position, self.end_position = 0, 0\n\n        doc_tokens = []\n        char_to_word_offset = []\n        prev_is_whitespace = True\n\n        # Split on whitespace so that different tokens may be attributed to their original position.\n        for c in self.context_text:\n            if _is_whitespace(c):\n                prev_is_whitespace = True\n            else:\n                if prev_is_whitespace:\n                    doc_tokens.append(c)\n                else:\n                    doc_tokens[-1] += c\n                prev_is_whitespace = False\n            char_to_word_offset.append(len(doc_tokens) - 1)\n\n        self.doc_tokens = doc_tokens\n        self.char_to_word_offset = char_to_word_offset\n\n        # Start and end positions only has a value during evaluation.\n        if start_position_character is not None and not is_impossible:\n            self.start_position = char_to_word_offset[start_position_character]\n            self.end_position = char_to_word_offset[\n                min(start_position_character + len(answer_text) - 1, len(char_to_word_offset) - 1)\n            ]\n\n\nclass SquadFeatures(object):\n    \"\"\"\n    Single squad example features to be fed to a model.\n    Those features are model-specific and can be crafted from :class:`~transformers1.data.processors.squad.SquadExample`\n    using the :method:`~transformers1.data.processors.squad.squad_convert_examples_to_features` method.\n\n    Args:\n        input_ids: Indices of input sequence tokens in the vocabulary.\n        attention_mask: Mask to avoid performing attention on padding token indices.\n        token_type_ids: Segment token indices to indicate first and second portions of the inputs.\n        cls_index: the index of the CLS token.\n        p_mask: Mask identifying tokens that can be answers vs. tokens that cannot.\n            Mask with 1 for tokens than cannot be in the answer and 0 for token that can be in an answer\n        example_index: the index of the example\n        unique_id: The unique Feature identifier\n        paragraph_len: The length of the context\n        token_is_max_context: List of booleans identifying which tokens have their maximum context in this feature object.\n            If a token does not have their maximum context in this feature object, it means that another feature object\n            has more information related to that token and should be prioritized over this feature for that token.\n        tokens: list of tokens corresponding to the input ids\n        token_to_orig_map: mapping between the tokens and the original text, needed in order to identify the answer.\n        start_position: start of the answer token index\n        end_position: end of the answer token index\n    \"\"\"\n\n    def __init__(\n        self,\n        input_ids,\n        attention_mask,\n        token_type_ids,\n        cls_index,\n        p_mask,\n        example_index,\n        unique_id,\n        paragraph_len,\n        token_is_max_context,\n        tokens,\n        token_to_orig_map,\n        start_position,\n        end_position,\n        is_impossible,\n        qas_id: str = None,\n    ):\n        self.input_ids = input_ids\n        self.attention_mask = attention_mask\n        self.token_type_ids = token_type_ids\n        self.cls_index = cls_index\n        self.p_mask = p_mask\n\n        self.example_index = example_index\n        self.unique_id = unique_id\n        self.paragraph_len = paragraph_len\n        self.token_is_max_context = token_is_max_context\n        self.tokens = tokens\n        self.token_to_orig_map = token_to_orig_map\n\n        self.start_position = start_position\n        self.end_position = end_position\n        self.is_impossible = is_impossible\n        self.qas_id = qas_id\n\n\nclass SquadResult(object):\n    \"\"\"\n    Constructs a SquadResult which can be used to evaluate a model's output on the SQuAD dataset.\n\n    Args:\n        unique_id: The unique identifier corresponding to that example.\n        start_logits: The logits corresponding to the start of the answer\n        end_logits: The logits corresponding to the end of the answer\n    \"\"\"\n\n    def __init__(self, unique_id, start_logits, end_logits, start_top_index=None, end_top_index=None, cls_logits=None):\n        self.start_logits = start_logits\n        self.end_logits = end_logits\n        self.unique_id = unique_id\n\n        if start_top_index:\n            self.start_top_index = start_top_index\n            self.end_top_index = end_top_index\n            self.cls_logits = cls_logits\n"
  },
  {
    "path": "code/bert-base-count3/pretrain/transformers1/data/processors/utils.py",
    "content": "# coding=utf-8\n# Copyright 2018 The Google AI Language Team Authors and The HuggingFace Inc. team.\n# Copyright (c) 2018, NVIDIA CORPORATION.  All rights reserved.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n\nimport csv\nimport dataclasses\nimport json\nimport logging\nfrom dataclasses import dataclass\nfrom typing import List, Optional, Union\n\nfrom ...file_utils import is_tf_available, is_torch_available\n\n\nlogger = logging.getLogger(__name__)\n\n\n@dataclass\nclass InputExample:\n    \"\"\"\n    A single training/test example for simple sequence classification.\n\n    Args:\n        guid: Unique id for the example.\n        text_a: string. The untokenized text of the first sequence. For single\n            sequence tasks, only this sequence must be specified.\n        text_b: (Optional) string. The untokenized text of the second sequence.\n            Only must be specified for sequence pair tasks.\n        label: (Optional) string. The label of the example. This should be\n            specified for train and dev examples, but not for test examples.\n    \"\"\"\n\n    guid: str\n    text_a: str\n    text_b: Optional[str] = None\n    label: Optional[str] = None\n\n    def to_json_string(self):\n        \"\"\"Serializes this instance to a JSON string.\"\"\"\n        return json.dumps(dataclasses.asdict(self), indent=2) + \"\\n\"\n\n\n@dataclass(frozen=True)\nclass InputFeatures:\n    \"\"\"\n    A single set of features of data.\n    Property names are the same names as the corresponding inputs to a model.\n\n    Args:\n        input_ids: Indices of input sequence tokens in the vocabulary.\n        attention_mask: Mask to avoid performing attention on padding token indices.\n            Mask values selected in ``[0, 1]``:\n            Usually  ``1`` for tokens that are NOT MASKED, ``0`` for MASKED (padded) tokens.\n        token_type_ids: (Optional) Segment token indices to indicate first and second\n            portions of the inputs. Only some models use them.\n        label: (Optional) Label corresponding to the input. Int for classification problems,\n            float for regression problems.\n    \"\"\"\n\n    input_ids: List[int]\n    attention_mask: Optional[List[int]] = None\n    token_type_ids: Optional[List[int]] = None\n    label: Optional[Union[int, float]] = None\n\n    def to_json_string(self):\n        \"\"\"Serializes this instance to a JSON string.\"\"\"\n        return json.dumps(dataclasses.asdict(self)) + \"\\n\"\n\n\nclass DataProcessor:\n    \"\"\"Base class for data converters for sequence classification data sets.\"\"\"\n\n    def get_example_from_tensor_dict(self, tensor_dict):\n        \"\"\"Gets an example from a dict with tensorflow tensors\n        Args:\n            tensor_dict: Keys and values should match the corresponding Glue\n                tensorflow_dataset examples.\n        \"\"\"\n        raise NotImplementedError()\n\n    def get_train_examples(self, data_dir):\n        \"\"\"Gets a collection of `InputExample`s for the train set.\"\"\"\n        raise NotImplementedError()\n\n    def get_dev_examples(self, data_dir):\n        \"\"\"Gets a collection of `InputExample`s for the dev set.\"\"\"\n        raise NotImplementedError()\n\n    def get_test_examples(self, data_dir):\n        \"\"\"Gets a collection of `InputExample`s for the test set.\"\"\"\n        raise NotImplementedError()\n\n    def get_labels(self):\n        \"\"\"Gets the list of labels for this data set.\"\"\"\n        raise NotImplementedError()\n\n    def tfds_map(self, example):\n        \"\"\"Some tensorflow_datasets datasets are not formatted the same way the GLUE datasets are.\n        This method converts examples to the correct format.\"\"\"\n        if len(self.get_labels()) > 1:\n            example.label = self.get_labels()[int(example.label)]\n        return example\n\n    @classmethod\n    def _read_tsv(cls, input_file, quotechar=None):\n        \"\"\"Reads a tab separated value file.\"\"\"\n        with open(input_file, \"r\", encoding=\"utf-8-sig\") as f:\n            return list(csv.reader(f, delimiter=\"\\t\", quotechar=quotechar))\n\n\nclass SingleSentenceClassificationProcessor(DataProcessor):\n    \"\"\" Generic processor for a single sentence classification data set.\"\"\"\n\n    def __init__(self, labels=None, examples=None, mode=\"classification\", verbose=False):\n        self.labels = [] if labels is None else labels\n        self.examples = [] if examples is None else examples\n        self.mode = mode\n        self.verbose = verbose\n\n    def __len__(self):\n        return len(self.examples)\n\n    def __getitem__(self, idx):\n        if isinstance(idx, slice):\n            return SingleSentenceClassificationProcessor(labels=self.labels, examples=self.examples[idx])\n        return self.examples[idx]\n\n    @classmethod\n    def create_from_csv(\n        cls, file_name, split_name=\"\", column_label=0, column_text=1, column_id=None, skip_first_row=False, **kwargs\n    ):\n        processor = cls(**kwargs)\n        processor.add_examples_from_csv(\n            file_name,\n            split_name=split_name,\n            column_label=column_label,\n            column_text=column_text,\n            column_id=column_id,\n            skip_first_row=skip_first_row,\n            overwrite_labels=True,\n            overwrite_examples=True,\n        )\n        return processor\n\n    @classmethod\n    def create_from_examples(cls, texts_or_text_and_labels, labels=None, **kwargs):\n        processor = cls(**kwargs)\n        processor.add_examples(texts_or_text_and_labels, labels=labels)\n        return processor\n\n    def add_examples_from_csv(\n        self,\n        file_name,\n        split_name=\"\",\n        column_label=0,\n        column_text=1,\n        column_id=None,\n        skip_first_row=False,\n        overwrite_labels=False,\n        overwrite_examples=False,\n    ):\n        lines = self._read_tsv(file_name)\n        if skip_first_row:\n            lines = lines[1:]\n        texts = []\n        labels = []\n        ids = []\n        for (i, line) in enumerate(lines):\n            texts.append(line[column_text])\n            labels.append(line[column_label])\n            if column_id is not None:\n                ids.append(line[column_id])\n            else:\n                guid = \"%s-%s\" % (split_name, i) if split_name else \"%s\" % i\n                ids.append(guid)\n\n        return self.add_examples(\n            texts, labels, ids, overwrite_labels=overwrite_labels, overwrite_examples=overwrite_examples\n        )\n\n    def add_examples(\n        self, texts_or_text_and_labels, labels=None, ids=None, overwrite_labels=False, overwrite_examples=False\n    ):\n        assert labels is None or len(texts_or_text_and_labels) == len(labels)\n        assert ids is None or len(texts_or_text_and_labels) == len(ids)\n        if ids is None:\n            ids = [None] * len(texts_or_text_and_labels)\n        if labels is None:\n            labels = [None] * len(texts_or_text_and_labels)\n        examples = []\n        added_labels = set()\n        for (text_or_text_and_label, label, guid) in zip(texts_or_text_and_labels, labels, ids):\n            if isinstance(text_or_text_and_label, (tuple, list)) and label is None:\n                text, label = text_or_text_and_label\n            else:\n                text = text_or_text_and_label\n            added_labels.add(label)\n            examples.append(InputExample(guid=guid, text_a=text, text_b=None, label=label))\n\n        # Update examples\n        if overwrite_examples:\n            self.examples = examples\n        else:\n            self.examples.extend(examples)\n\n        # Update labels\n        if overwrite_labels:\n            self.labels = list(added_labels)\n        else:\n            self.labels = list(set(self.labels).union(added_labels))\n\n        return self.examples\n\n    def get_features(\n        self,\n        tokenizer,\n        max_length=None,\n        pad_on_left=False,\n        pad_token=0,\n        mask_padding_with_zero=True,\n        return_tensors=None,\n    ):\n        \"\"\"\n        Convert examples in a list of ``InputFeatures``\n\n        Args:\n            tokenizer: Instance of a tokenizer that will tokenize the examples\n            max_length: Maximum example length\n            task: GLUE task\n            label_list: List of labels. Can be obtained from the processor using the ``processor.get_labels()`` method\n            output_mode: String indicating the output mode. Either ``regression`` or ``classification``\n            pad_on_left: If set to ``True``, the examples will be padded on the left rather than on the right (default)\n            pad_token: Padding token\n            mask_padding_with_zero: If set to ``True``, the attention mask will be filled by ``1`` for actual values\n                and by ``0`` for padded values. If set to ``False``, inverts it (``1`` for padded values, ``0`` for\n                actual values)\n\n        Returns:\n            If the ``examples`` input is a ``tf.data.Dataset``, will return a ``tf.data.Dataset``\n            containing the task-specific features. If the input is a list of ``InputExamples``, will return\n            a list of task-specific ``InputFeatures`` which can be fed to the model.\n\n        \"\"\"\n        if max_length is None:\n            max_length = tokenizer.max_len\n\n        label_map = {label: i for i, label in enumerate(self.labels)}\n\n        all_input_ids = []\n        for (ex_index, example) in enumerate(self.examples):\n            if ex_index % 10000 == 0:\n                logger.info(\"Tokenizing example %d\", ex_index)\n\n            input_ids = tokenizer.encode(\n                example.text_a, add_special_tokens=True, max_length=min(max_length, tokenizer.max_len),\n            )\n            all_input_ids.append(input_ids)\n\n        batch_length = max(len(input_ids) for input_ids in all_input_ids)\n\n        features = []\n        for (ex_index, (input_ids, example)) in enumerate(zip(all_input_ids, self.examples)):\n            if ex_index % 10000 == 0:\n                logger.info(\"Writing example %d/%d\" % (ex_index, len(self.examples)))\n            # The mask has 1 for real tokens and 0 for padding tokens. Only real\n            # tokens are attended to.\n            attention_mask = [1 if mask_padding_with_zero else 0] * len(input_ids)\n\n            # Zero-pad up to the sequence length.\n            padding_length = batch_length - len(input_ids)\n            if pad_on_left:\n                input_ids = ([pad_token] * padding_length) + input_ids\n                attention_mask = ([0 if mask_padding_with_zero else 1] * padding_length) + attention_mask\n            else:\n                input_ids = input_ids + ([pad_token] * padding_length)\n                attention_mask = attention_mask + ([0 if mask_padding_with_zero else 1] * padding_length)\n\n            assert len(input_ids) == batch_length, \"Error with input length {} vs {}\".format(\n                len(input_ids), batch_length\n            )\n            assert len(attention_mask) == batch_length, \"Error with input length {} vs {}\".format(\n                len(attention_mask), batch_length\n            )\n\n            if self.mode == \"classification\":\n                label = label_map[example.label]\n            elif self.mode == \"regression\":\n                label = float(example.label)\n            else:\n                raise ValueError(self.mode)\n\n            if ex_index < 5 and self.verbose:\n                logger.info(\"*** Example ***\")\n                logger.info(\"guid: %s\" % (example.guid))\n                logger.info(\"input_ids: %s\" % \" \".join([str(x) for x in input_ids]))\n                logger.info(\"attention_mask: %s\" % \" \".join([str(x) for x in attention_mask]))\n                logger.info(\"label: %s (id = %d)\" % (example.label, label))\n\n            features.append(InputFeatures(input_ids=input_ids, attention_mask=attention_mask, label=label))\n\n        if return_tensors is None:\n            return features\n        elif return_tensors == \"tf\":\n            if not is_tf_available():\n                raise RuntimeError(\"return_tensors set to 'tf' but TensorFlow 2.0 can't be imported\")\n            import tensorflow as tf\n\n            def gen():\n                for ex in features:\n                    yield ({\"input_ids\": ex.input_ids, \"attention_mask\": ex.attention_mask}, ex.label)\n\n            dataset = tf.data.Dataset.from_generator(\n                gen,\n                ({\"input_ids\": tf.int32, \"attention_mask\": tf.int32}, tf.int64),\n                ({\"input_ids\": tf.TensorShape([None]), \"attention_mask\": tf.TensorShape([None])}, tf.TensorShape([])),\n            )\n            return dataset\n        elif return_tensors == \"pt\":\n            if not is_torch_available():\n                raise RuntimeError(\"return_tensors set to 'pt' but PyTorch can't be imported\")\n            import torch\n            from torch.utils.data import TensorDataset\n\n            all_input_ids = torch.tensor([f.input_ids for f in features], dtype=torch.long)\n            all_attention_mask = torch.tensor([f.attention_mask for f in features], dtype=torch.long)\n            if self.mode == \"classification\":\n                all_labels = torch.tensor([f.label for f in features], dtype=torch.long)\n            elif self.mode == \"regression\":\n                all_labels = torch.tensor([f.label for f in features], dtype=torch.float)\n\n            dataset = TensorDataset(all_input_ids, all_attention_mask, all_labels)\n            return dataset\n        else:\n            raise ValueError(\"return_tensors should be one of 'tf' or 'pt'\")\n"
  },
  {
    "path": "code/bert-base-count3/pretrain/transformers1/data/processors/xnli.py",
    "content": "# coding=utf-8\n# Copyright 2018 The Google AI Language Team Authors and The HuggingFace Inc. team.\n# Copyright (c) 2018, NVIDIA CORPORATION.  All rights reserved.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n\"\"\" XNLI utils (dataset loading and evaluation) \"\"\"\n\n\nimport logging\nimport os\n\nfrom .utils import DataProcessor, InputExample\n\n\nlogger = logging.getLogger(__name__)\n\n\nclass XnliProcessor(DataProcessor):\n    \"\"\"Processor for the XNLI dataset.\n    Adapted from https://github.com/google-research/bert/blob/f39e881b169b9d53bea03d2d341b31707a6c052b/run_classifier.py#L207\"\"\"\n\n    def __init__(self, language, train_language=None):\n        self.language = language\n        self.train_language = train_language\n\n    def get_train_examples(self, data_dir):\n        \"\"\"See base class.\"\"\"\n        lg = self.language if self.train_language is None else self.train_language\n        lines = self._read_tsv(os.path.join(data_dir, \"XNLI-MT-1.0/multinli/multinli.train.{}.tsv\".format(lg)))\n        examples = []\n        for (i, line) in enumerate(lines):\n            if i == 0:\n                continue\n            guid = \"%s-%s\" % (\"train\", i)\n            text_a = line[0]\n            text_b = line[1]\n            label = \"contradiction\" if line[2] == \"contradictory\" else line[2]\n            assert isinstance(text_a, str) and isinstance(text_b, str) and isinstance(label, str)\n            examples.append(InputExample(guid=guid, text_a=text_a, text_b=text_b, label=label))\n        return examples\n\n    def get_test_examples(self, data_dir):\n        \"\"\"See base class.\"\"\"\n        lines = self._read_tsv(os.path.join(data_dir, \"XNLI-1.0/xnli.test.tsv\"))\n        examples = []\n        for (i, line) in enumerate(lines):\n            if i == 0:\n                continue\n            language = line[0]\n            if language != self.language:\n                continue\n            guid = \"%s-%s\" % (\"test\", i)\n            text_a = line[6]\n            text_b = line[7]\n            label = line[1]\n            assert isinstance(text_a, str) and isinstance(text_b, str) and isinstance(label, str)\n            examples.append(InputExample(guid=guid, text_a=text_a, text_b=text_b, label=label))\n        return examples\n\n    def get_labels(self):\n        \"\"\"See base class.\"\"\"\n        return [\"contradiction\", \"entailment\", \"neutral\"]\n\n\nxnli_processors = {\n    \"xnli\": XnliProcessor,\n}\n\nxnli_output_modes = {\n    \"xnli\": \"classification\",\n}\n\nxnli_tasks_num_labels = {\n    \"xnli\": 3,\n}\n"
  },
  {
    "path": "code/bert-base-count3/pretrain/transformers1/file.py",
    "content": ""
  },
  {
    "path": "code/bert-base-count3/pretrain/transformers1/file_utils.py",
    "content": "\"\"\"\nUtilities for working with the local dataset cache.\nThis file is adapted from the AllenNLP library at https://github.com/allenai/allennlp\nCopyright by the AllenNLP authors.\n\"\"\"\n\nimport fnmatch\nimport json\nimport logging\nimport os\nimport shutil\nimport sys\nimport tarfile\nimport tempfile\nfrom contextlib import contextmanager\nfrom functools import partial, wraps\nfrom hashlib import sha256\nfrom pathlib import Path\nfrom typing import Optional\nfrom urllib.parse import urlparse\nfrom zipfile import ZipFile, is_zipfile\n\nimport requests\nfrom filelock import FileLock\nfrom tqdm.auto import tqdm\n\nfrom . import __version__\n\n\nlogger = logging.getLogger(__name__)  # pylint: disable=invalid-name\n\ntry:\n    USE_TF = os.environ.get(\"USE_TF\", \"AUTO\").upper()\n    USE_TORCH = os.environ.get(\"USE_TORCH\", \"AUTO\").upper()\n    if USE_TORCH in (\"1\", \"ON\", \"YES\", \"AUTO\") and USE_TF not in (\"1\", \"ON\", \"YES\"):\n        import torch\n\n        _torch_available = True  # pylint: disable=invalid-name\n        logger.info(\"PyTorch version {} available.\".format(torch.__version__))\n    else:\n        logger.info(\"Disabling PyTorch because USE_TF is set\")\n        _torch_available = False\nexcept ImportError:\n    _torch_available = False  # pylint: disable=invalid-name\n\ntry:\n    USE_TF = os.environ.get(\"USE_TF\", \"AUTO\").upper()\n    USE_TORCH = os.environ.get(\"USE_TORCH\", \"AUTO\").upper()\n\n    if USE_TF in (\"1\", \"ON\", \"YES\", \"AUTO\") and USE_TORCH not in (\"1\", \"ON\", \"YES\"):\n        import tensorflow as tf\n\n        assert hasattr(tf, \"__version__\") and int(tf.__version__[0]) >= 2\n        _tf_available = True  # pylint: disable=invalid-name\n        logger.info(\"TensorFlow version {} available.\".format(tf.__version__))\n    else:\n        logger.info(\"Disabling Tensorflow because USE_TORCH is set\")\n        _tf_available = False\nexcept (ImportError, AssertionError):\n    _tf_available = False  # pylint: disable=invalid-name\n\n\ntry:\n    from torch.hub import _get_torch_home\n\n    torch_cache_home = _get_torch_home()\nexcept ImportError:\n    torch_cache_home = os.path.expanduser(\n        os.getenv(\"TORCH_HOME\", os.path.join(os.getenv(\"XDG_CACHE_HOME\", \"~/.cache\"), \"torch\"))\n    )\ndefault_cache_path = os.path.join(torch_cache_home, \"transformers1\")\n\n\nPYTORCH_PRETRAINED_BERT_CACHE = os.getenv(\"PYTORCH_PRETRAINED_BERT_CACHE\", default_cache_path)\nPYTORCH_TRANSFORMERS_CACHE = os.getenv(\"PYTORCH_TRANSFORMERS_CACHE\", PYTORCH_PRETRAINED_BERT_CACHE)\nTRANSFORMERS_CACHE = os.getenv(\"TRANSFORMERS_CACHE\", PYTORCH_TRANSFORMERS_CACHE)\n\nWEIGHTS_NAME = \"pytorch_model.bin\"\nTF2_WEIGHTS_NAME = \"tf_model.h5\"\nTF_WEIGHTS_NAME = \"model.ckpt\"\nCONFIG_NAME = \"config.json\"\nMODEL_CARD_NAME = \"modelcard.json\"\n\n\nMULTIPLE_CHOICE_DUMMY_INPUTS = [[[0], [1]], [[0], [1]]]\nDUMMY_INPUTS = [[7, 6, 0, 0, 1], [1, 2, 3, 0, 0], [0, 0, 0, 4, 5]]\nDUMMY_MASK = [[1, 1, 1, 1, 1], [1, 1, 1, 0, 0], [0, 0, 0, 1, 1]]\n\nS3_BUCKET_PREFIX = \"https://s3.amazonaws.com/models.huggingface.co/bert\"\nCLOUDFRONT_DISTRIB_PREFIX = \"https://cdn.huggingface.co\"\n\n\ndef is_torch_available():\n    return _torch_available\n\n\ndef is_tf_available():\n    return _tf_available\n\n\ndef add_start_docstrings(*docstr):\n    def docstring_decorator(fn):\n        fn.__doc__ = \"\".join(docstr) + (fn.__doc__ if fn.__doc__ is not None else \"\")\n        return fn\n\n    return docstring_decorator\n\n\ndef add_start_docstrings_to_callable(*docstr):\n    def docstring_decorator(fn):\n        class_name = \":class:`~transformers1.{}`\".format(fn.__qualname__.split(\".\")[0])\n        intro = \"   The {} forward method, overrides the :func:`__call__` special method.\".format(class_name)\n        note = r\"\"\"\n\n    .. note::\n        Although the recipe for forward pass needs to be defined within\n        this function, one should call the :class:`Module` instance afterwards\n        instead of this since the former takes care of running the\n        pre and post processing steps while the latter silently ignores them.\n        \"\"\"\n        fn.__doc__ = intro + note + \"\".join(docstr) + (fn.__doc__ if fn.__doc__ is not None else \"\")\n        return fn\n\n    return docstring_decorator\n\n\ndef add_end_docstrings(*docstr):\n    def docstring_decorator(fn):\n        fn.__doc__ = fn.__doc__ + \"\".join(docstr)\n        return fn\n\n    return docstring_decorator\n\n\ndef is_remote_url(url_or_filename):\n    parsed = urlparse(url_or_filename)\n    return parsed.scheme in (\"http\", \"https\")\n\n\ndef hf_bucket_url(model_id: str, filename: str, use_cdn=True) -> str:\n    \"\"\"\n    Resolve a model identifier, and a file name, to a HF-hosted url\n    on either S3 or Cloudfront (a Content Delivery Network, or CDN).\n\n    Cloudfront is replicated over the globe so downloads are way faster\n    for the end user (and it also lowers our bandwidth costs). However, it\n    is more aggressively cached by default, so may not always reflect the\n    latest changes to the underlying file (default TTL is 24 hours).\n\n    In terms of client-side caching from this library, even though\n    Cloudfront relays the ETags from S3, using one or the other\n    (or switching from one to the other) will affect caching: cached files\n    are not shared between the two because the cached file's name contains\n    a hash of the url.\n    \"\"\"\n    endpoint = CLOUDFRONT_DISTRIB_PREFIX if use_cdn else S3_BUCKET_PREFIX\n    legacy_format = \"/\" not in model_id\n    if legacy_format:\n        return f\"{endpoint}/{model_id}-{filename}\"\n    else:\n        return f\"{endpoint}/{model_id}/{filename}\"\n\n\ndef url_to_filename(url, etag=None):\n    \"\"\"\n    Convert `url` into a hashed filename in a repeatable way.\n    If `etag` is specified, append its hash to the url's, delimited\n    by a period.\n    If the url ends with .h5 (Keras HDF5 weights) adds '.h5' to the name\n    so that TF 2.0 can identify it as a HDF5 file\n    (see https://github.com/tensorflow/tensorflow/blob/00fad90125b18b80fe054de1055770cfb8fe4ba3/tensorflow/python/keras/engine/network.py#L1380)\n    \"\"\"\n    url_bytes = url.encode(\"utf-8\")\n    url_hash = sha256(url_bytes)\n    filename = url_hash.hexdigest()\n\n    if etag:\n        etag_bytes = etag.encode(\"utf-8\")\n        etag_hash = sha256(etag_bytes)\n        filename += \".\" + etag_hash.hexdigest()\n\n    if url.endswith(\".h5\"):\n        filename += \".h5\"\n\n    return filename\n\n\ndef filename_to_url(filename, cache_dir=None):\n    \"\"\"\n    Return the url and etag (which may be ``None``) stored for `filename`.\n    Raise ``EnvironmentError`` if `filename` or its stored metadata do not exist.\n    \"\"\"\n    if cache_dir is None:\n        cache_dir = TRANSFORMERS_CACHE\n    if isinstance(cache_dir, Path):\n        cache_dir = str(cache_dir)\n\n    cache_path = os.path.join(cache_dir, filename)\n    if not os.path.exists(cache_path):\n        raise EnvironmentError(\"file {} not found\".format(cache_path))\n\n    meta_path = cache_path + \".json\"\n    if not os.path.exists(meta_path):\n        raise EnvironmentError(\"file {} not found\".format(meta_path))\n\n    with open(meta_path, encoding=\"utf-8\") as meta_file:\n        metadata = json.load(meta_file)\n    url = metadata[\"url\"]\n    etag = metadata[\"etag\"]\n\n    return url, etag\n\n\ndef cached_path(\n    url_or_filename,\n    cache_dir=None,\n    force_download=False,\n    proxies=None,\n    resume_download=False,\n    user_agent=None,\n    extract_compressed_file=False,\n    force_extract=False,\n    local_files_only=False,\n) -> Optional[str]:\n    \"\"\"\n    Given something that might be a URL (or might be a local path),\n    determine which. If it's a URL, download the file and cache it, and\n    return the path to the cached file. If it's already a local path,\n    make sure the file exists and then return the path.\n    Args:\n        cache_dir: specify a cache directory to save the file to (overwrite the default cache dir).\n        force_download: if True, re-dowload the file even if it's already cached in the cache dir.\n        resume_download: if True, resume the download if incompletly recieved file is found.\n        user_agent: Optional string or dict that will be appended to the user-agent on remote requests.\n        extract_compressed_file: if True and the path point to a zip or tar file, extract the compressed\n            file in a folder along the archive.\n        force_extract: if True when extract_compressed_file is True and the archive was already extracted,\n            re-extract the archive and overide the folder where it was extracted.\n\n    Return:\n        None in case of non-recoverable file (non-existent or inaccessible url + no cache on disk).\n        Local path (string) otherwise\n    \"\"\"\n    if cache_dir is None:\n        cache_dir = TRANSFORMERS_CACHE\n    if isinstance(url_or_filename, Path):\n        url_or_filename = str(url_or_filename)\n    if isinstance(cache_dir, Path):\n        cache_dir = str(cache_dir)\n\n    if is_remote_url(url_or_filename):\n        # URL, so get it from the cache (downloading if necessary)\n        output_path = get_from_cache(\n            url_or_filename,\n            cache_dir=cache_dir,\n            force_download=force_download,\n            proxies=proxies,\n            resume_download=resume_download,\n            user_agent=user_agent,\n            local_files_only=local_files_only,\n        )\n    elif os.path.exists(url_or_filename):\n        # File, and it exists.\n        output_path = url_or_filename\n    elif urlparse(url_or_filename).scheme == \"\":\n        # File, but it doesn't exist.\n        raise EnvironmentError(\"file {} not found\".format(url_or_filename))\n    else:\n        # Something unknown\n        raise ValueError(\"unable to parse {} as a URL or as a local path\".format(url_or_filename))\n\n    if extract_compressed_file:\n        if not is_zipfile(output_path) and not tarfile.is_tarfile(output_path):\n            return output_path\n\n        # Path where we extract compressed archives\n        # We avoid '.' in dir name and add \"-extracted\" at the end: \"./model.zip\" => \"./model-zip-extracted/\"\n        output_dir, output_file = os.path.split(output_path)\n        output_extract_dir_name = output_file.replace(\".\", \"-\") + \"-extracted\"\n        output_path_extracted = os.path.join(output_dir, output_extract_dir_name)\n\n        if os.path.isdir(output_path_extracted) and os.listdir(output_path_extracted) and not force_extract:\n            return output_path_extracted\n\n        # Prevent parallel extractions\n        lock_path = output_path + \".lock\"\n        with FileLock(lock_path):\n            shutil.rmtree(output_path_extracted, ignore_errors=True)\n            os.makedirs(output_path_extracted)\n            if is_zipfile(output_path):\n                with ZipFile(output_path, \"r\") as zip_file:\n                    zip_file.extractall(output_path_extracted)\n                    zip_file.close()\n            elif tarfile.is_tarfile(output_path):\n                tar_file = tarfile.open(output_path)\n                tar_file.extractall(output_path_extracted)\n                tar_file.close()\n            else:\n                raise EnvironmentError(\"Archive format of {} could not be identified\".format(output_path))\n\n        return output_path_extracted\n\n    return output_path\n\n\ndef http_get(url, temp_file, proxies=None, resume_size=0, user_agent=None):\n    ua = \"transformers1/{}; python/{}\".format(__version__, sys.version.split()[0])\n    if is_torch_available():\n        ua += \"; torch/{}\".format(torch.__version__)\n    if is_tf_available():\n        ua += \"; tensorflow/{}\".format(tf.__version__)\n    if isinstance(user_agent, dict):\n        ua += \"; \" + \"; \".join(\"{}/{}\".format(k, v) for k, v in user_agent.items())\n    elif isinstance(user_agent, str):\n        ua += \"; \" + user_agent\n    headers = {\"user-agent\": ua}\n    if resume_size > 0:\n        headers[\"Range\"] = \"bytes=%d-\" % (resume_size,)\n    response = requests.get(url, stream=True, proxies=proxies, headers=headers)\n    if response.status_code == 416:  # Range not satisfiable\n        return\n    content_length = response.headers.get(\"Content-Length\")\n    total = resume_size + int(content_length) if content_length is not None else None\n    progress = tqdm(\n        unit=\"B\",\n        unit_scale=True,\n        total=total,\n        initial=resume_size,\n        desc=\"Downloading\",\n        disable=bool(logger.getEffectiveLevel() == logging.NOTSET),\n    )\n    for chunk in response.iter_content(chunk_size=1024):\n        if chunk:  # filter out keep-alive new chunks\n            progress.update(len(chunk))\n            temp_file.write(chunk)\n    progress.close()\n\n\ndef get_from_cache(\n    url,\n    cache_dir=None,\n    force_download=False,\n    proxies=None,\n    etag_timeout=10,\n    resume_download=False,\n    user_agent=None,\n    local_files_only=False,\n) -> Optional[str]:\n    \"\"\"\n    Given a URL, look for the corresponding file in the local cache.\n    If it's not there, download it. Then return the path to the cached file.\n\n    Return:\n        None in case of non-recoverable file (non-existent or inaccessible url + no cache on disk).\n        Local path (string) otherwise\n    \"\"\"\n    if cache_dir is None:\n        cache_dir = TRANSFORMERS_CACHE\n    if isinstance(cache_dir, Path):\n        cache_dir = str(cache_dir)\n\n    os.makedirs(cache_dir, exist_ok=True)\n\n    etag = None\n    if not local_files_only:\n        try:\n            response = requests.head(url, allow_redirects=True, proxies=proxies, timeout=etag_timeout)\n            if response.status_code == 200:\n                etag = response.headers.get(\"ETag\")\n        except (EnvironmentError, requests.exceptions.Timeout):\n            # etag is already None\n            pass\n\n    filename = url_to_filename(url, etag)\n\n    # get cache path to put the file\n    cache_path = os.path.join(cache_dir, filename)\n\n    # etag is None = we don't have a connection, or url doesn't exist, or is otherwise inaccessible.\n    # try to get the last downloaded one\n    if etag is None:\n        if os.path.exists(cache_path):\n            return cache_path\n        else:\n            matching_files = [\n                file\n                for file in fnmatch.filter(os.listdir(cache_dir), filename + \".*\")\n                if not file.endswith(\".json\") and not file.endswith(\".lock\")\n            ]\n            if len(matching_files) > 0:\n                return os.path.join(cache_dir, matching_files[-1])\n            else:\n                # If files cannot be found and local_files_only=True,\n                # the models might've been found if local_files_only=False\n                # Notify the user about that\n                if local_files_only:\n                    raise ValueError(\n                        \"Cannot find the requested files in the cached path and outgoing traffic has been\"\n                        \" disabled. To enable model look-ups and downloads online, set 'local_files_only'\"\n                        \" to False.\"\n                    )\n                return None\n\n    # From now on, etag is not None.\n    if os.path.exists(cache_path) and not force_download:\n        return cache_path\n\n    # Prevent parallel downloads of the same file with a lock.\n    lock_path = cache_path + \".lock\"\n    with FileLock(lock_path):\n\n        # If the download just completed while the lock was activated.\n        if os.path.exists(cache_path) and not force_download:\n            # Even if returning early like here, the lock will be released.\n            return cache_path\n\n        if resume_download:\n            incomplete_path = cache_path + \".incomplete\"\n\n            @contextmanager\n            def _resumable_file_manager():\n                with open(incomplete_path, \"a+b\") as f:\n                    yield f\n\n            temp_file_manager = _resumable_file_manager\n            if os.path.exists(incomplete_path):\n                resume_size = os.stat(incomplete_path).st_size\n            else:\n                resume_size = 0\n        else:\n            temp_file_manager = partial(tempfile.NamedTemporaryFile, dir=cache_dir, delete=False)\n            resume_size = 0\n\n        # Download to temporary file, then copy to cache dir once finished.\n        # Otherwise you get corrupt cache entries if the download gets interrupted.\n        with temp_file_manager() as temp_file:\n            logger.info(\"%s not found in cache or force_download set to True, downloading to %s\", url, temp_file.name)\n\n            http_get(url, temp_file, proxies=proxies, resume_size=resume_size, user_agent=user_agent)\n\n        logger.info(\"storing %s in cache at %s\", url, cache_path)\n        os.replace(temp_file.name, cache_path)\n\n        logger.info(\"creating metadata file for %s\", cache_path)\n        meta = {\"url\": url, \"etag\": etag}\n        meta_path = cache_path + \".json\"\n        with open(meta_path, \"w\") as meta_file:\n            json.dump(meta, meta_file)\n\n    return cache_path\n\n\nclass cached_property(property):\n    \"\"\"\n    Descriptor that mimics @property but caches output in member variable.\n\n    From tensorflow_datasets\n\n    Built-in in functools from Python 3.8.\n    \"\"\"\n\n    def __get__(self, obj, objtype=None):\n        # See docs.python.org/3/howto/descriptor.html#properties\n        if obj is None:\n            return self\n        if self.fget is None:\n            raise AttributeError(\"unreadable attribute\")\n        attr = \"__cached_\" + self.fget.__name__\n        cached = getattr(obj, attr, None)\n        if cached is None:\n            cached = self.fget(obj)\n            setattr(obj, attr, cached)\n        return cached\n\n\ndef torch_required(func):\n    # Chose a different decorator name than in tests so it's clear they are not the same.\n    @wraps(func)\n    def wrapper(*args, **kwargs):\n        if is_torch_available():\n            return func(*args, **kwargs)\n        else:\n            raise ImportError(f\"Method `{func.__name__}` requires PyTorch.\")\n\n    return wrapper\n\n\ndef tf_required(func):\n    # Chose a different decorator name than in tests so it's clear they are not the same.\n    @wraps(func)\n    def wrapper(*args, **kwargs):\n        if is_tf_available():\n            return func(*args, **kwargs)\n        else:\n            raise ImportError(f\"Method `{func.__name__}` requires TF.\")\n\n    return wrapper\n"
  },
  {
    "path": "code/bert-base-count3/pretrain/transformers1/filep.py",
    "content": "from transformers import GPT2LMHeadModel, GPT2Tokenizer\nimport torch\n\ntokenizer = GPT2Tokenizer.from_pretrained(\"gpt2\")\nmodel = GPT2LMHeadModel.from_pretrained('gpt2')\n\ngenerated = tokenizer.encode(\"The Manhattan bridge\")\ncontext = torch.tensor([generated])\npast = None\n\nfor i in range(15):\n    output, past = model(context, past=past)\n\n    distribution = output[0, :]\n\n    # Get the top 10 values' indices and cast them to a list\n    top_values = distribution[-1].topk(10).indices.tolist()\n\n    # Decode those into words\n    top_words = [tokenizer.decode([x]) for x in top_values.indices.tolist()]\n\n    # select words (only arbitrarily select the first three)\n    words = words[0:3]\n\n    # Cast them back to tokens which can be used as an added token\n    selected_tokens = [tokenizer.encode(word) for word in words]\n\n    generated += [argmax_token.tolist()]\n    context = argmax_token.unsqueeze(0)\n\n    print(tokenizer.decode([argmax_token.tolist()]))\n\nsequence = tokenizer.decode(generated)\n\nprint(sequence)"
  },
  {
    "path": "code/bert-base-count3/pretrain/transformers1/hf_api.py",
    "content": "# coding=utf-8\n# Copyright 2019-present, the HuggingFace Inc. team.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n\n\nimport io\nimport os\nfrom os.path import expanduser\nfrom typing import Dict, List, Optional, Tuple\n\nimport requests\nfrom tqdm import tqdm\n\n\nENDPOINT = \"https://huggingface.co\"\n\n\nclass S3Obj:\n    \"\"\"\n    Data structure that represents a file belonging to the current user.\n    \"\"\"\n\n    def __init__(self, filename: str, LastModified: str, ETag: str, Size: int, **kwargs):\n        self.filename = filename\n        self.LastModified = LastModified\n        self.ETag = ETag\n        self.Size = Size\n\n\nclass PresignedUrl:\n    def __init__(self, write: str, access: str, type: str, **kwargs):\n        self.write = write\n        self.access = access\n        self.type = type  # mime-type to send to S3.\n\n\nclass S3Object:\n    \"\"\"\n    Data structure that represents a public file accessible on our S3.\n    \"\"\"\n\n    def __init__(\n        self,\n        key: str,  # S3 object key\n        etag: str,\n        lastModified: str,\n        size: int,\n        rfilename: str,  # filename relative to config.json\n        **kwargs\n    ):\n        self.key = key\n        self.etag = etag\n        self.lastModified = lastModified\n        self.size = size\n        self.rfilename = rfilename\n\n\nclass ModelInfo:\n    \"\"\"\n    Info about a public model accessible from our S3.\n    \"\"\"\n\n    def __init__(\n        self,\n        modelId: str,  # id of model\n        key: str,  # S3 object key of config.json\n        author: Optional[str] = None,\n        downloads: Optional[int] = None,\n        tags: List[str] = [],\n        siblings: List[Dict] = [],  # list of files that constitute the model\n        **kwargs\n    ):\n        self.modelId = modelId\n        self.key = key\n        self.author = author\n        self.downloads = downloads\n        self.tags = tags\n        self.siblings = [S3Object(**x) for x in siblings]\n\n\nclass HfApi:\n    def __init__(self, endpoint=None):\n        self.endpoint = endpoint if endpoint is not None else ENDPOINT\n\n    def login(self, username: str, password: str) -> str:\n        \"\"\"\n        Call HF API to sign in a user and get a token if credentials are valid.\n\n        Outputs:\n            token if credentials are valid\n\n        Throws:\n            requests.exceptions.HTTPError if credentials are invalid\n        \"\"\"\n        path = \"{}/api/login\".format(self.endpoint)\n        r = requests.post(path, json={\"username\": username, \"password\": password})\n        r.raise_for_status()\n        d = r.json()\n        return d[\"token\"]\n\n    def whoami(self, token: str) -> Tuple[str, List[str]]:\n        \"\"\"\n        Call HF API to know \"whoami\"\n        \"\"\"\n        path = \"{}/api/whoami\".format(self.endpoint)\n        r = requests.get(path, headers={\"authorization\": \"Bearer {}\".format(token)})\n        r.raise_for_status()\n        d = r.json()\n        return d[\"user\"], d[\"orgs\"]\n\n    def logout(self, token: str) -> None:\n        \"\"\"\n        Call HF API to log out.\n        \"\"\"\n        path = \"{}/api/logout\".format(self.endpoint)\n        r = requests.post(path, headers={\"authorization\": \"Bearer {}\".format(token)})\n        r.raise_for_status()\n\n    def presign(self, token: str, filename: str, organization: Optional[str] = None) -> PresignedUrl:\n        \"\"\"\n        Call HF API to get a presigned url to upload `filename` to S3.\n        \"\"\"\n        path = \"{}/api/presign\".format(self.endpoint)\n        r = requests.post(\n            path,\n            headers={\"authorization\": \"Bearer {}\".format(token)},\n            json={\"filename\": filename, \"organization\": organization},\n        )\n        r.raise_for_status()\n        d = r.json()\n        return PresignedUrl(**d)\n\n    def presign_and_upload(self, token: str, filename: str, filepath: str, organization: Optional[str] = None) -> str:\n        \"\"\"\n        Get a presigned url, then upload file to S3.\n\n        Outputs:\n            url: Read-only url for the stored file on S3.\n        \"\"\"\n        urls = self.presign(token, filename=filename, organization=organization)\n        # streaming upload:\n        # https://2.python-requests.org/en/master/user/advanced/#streaming-uploads\n        #\n        # Even though we presign with the correct content-type,\n        # the client still has to specify it when uploading the file.\n        with open(filepath, \"rb\") as f:\n            pf = TqdmProgressFileReader(f)\n            data = f if pf.total_size > 0 else \"\"\n\n            r = requests.put(urls.write, data=data, headers={\"content-type\": urls.type})\n            r.raise_for_status()\n            pf.close()\n        return urls.access\n\n    def list_objs(self, token: str, organization: Optional[str] = None) -> List[S3Obj]:\n        \"\"\"\n        Call HF API to list all stored files for user (or one of their organizations).\n        \"\"\"\n        path = \"{}/api/listObjs\".format(self.endpoint)\n        params = {\"organization\": organization} if organization is not None else None\n        r = requests.get(path, params=params, headers={\"authorization\": \"Bearer {}\".format(token)})\n        r.raise_for_status()\n        d = r.json()\n        return [S3Obj(**x) for x in d]\n\n    def delete_obj(self, token: str, filename: str, organization: Optional[str] = None):\n        \"\"\"\n        Call HF API to delete a file stored by user\n        \"\"\"\n        path = \"{}/api/deleteObj\".format(self.endpoint)\n        r = requests.delete(\n            path,\n            headers={\"authorization\": \"Bearer {}\".format(token)},\n            json={\"filename\": filename, \"organization\": organization},\n        )\n        r.raise_for_status()\n\n    def model_list(self) -> List[ModelInfo]:\n        \"\"\"\n        Get the public list of all the models on huggingface, including the community models\n        \"\"\"\n        path = \"{}/api/models\".format(self.endpoint)\n        r = requests.get(path)\n        r.raise_for_status()\n        d = r.json()\n        return [ModelInfo(**x) for x in d]\n\n\nclass TqdmProgressFileReader:\n    \"\"\"\n    Wrap an io.BufferedReader `f` (such as the output of `open(…, \"rb\")`)\n    and override `f.read()` so as to display a tqdm progress bar.\n\n    see github.com/huggingface/transformers1/pull/2078#discussion_r354739608\n    for implementation details.\n    \"\"\"\n\n    def __init__(self, f: io.BufferedReader):\n        self.f = f\n        self.total_size = os.fstat(f.fileno()).st_size\n        self.pbar = tqdm(total=self.total_size, leave=False)\n        self.read = f.read\n        f.read = self._read\n\n    def _read(self, n=-1):\n        self.pbar.update(n)\n        return self.read(n)\n\n    def close(self):\n        self.pbar.close()\n\n\nclass HfFolder:\n    path_token = expanduser(\"~/.huggingface/token\")\n\n    @classmethod\n    def save_token(cls, token):\n        \"\"\"\n        Save token, creating folder as needed.\n        \"\"\"\n        os.makedirs(os.path.dirname(cls.path_token), exist_ok=True)\n        with open(cls.path_token, \"w+\") as f:\n            f.write(token)\n\n    @classmethod\n    def get_token(cls):\n        \"\"\"\n        Get token or None if not existent.\n        \"\"\"\n        try:\n            with open(cls.path_token, \"r\") as f:\n                return f.read()\n        except FileNotFoundError:\n            pass\n\n    @classmethod\n    def delete_token(cls):\n        \"\"\"\n        Delete token.\n        Do not fail if token does not exist.\n        \"\"\"\n        try:\n            os.remove(cls.path_token)\n        except FileNotFoundError:\n            pass\n"
  },
  {
    "path": "code/bert-base-count3/pretrain/transformers1/hf_argparser.py",
    "content": "import dataclasses\nimport json\nimport sys\nfrom argparse import ArgumentParser\nfrom enum import Enum\nfrom pathlib import Path\nfrom typing import Any, Iterable, List, NewType, Tuple, Union\n\n\nDataClass = NewType(\"DataClass\", Any)\nDataClassType = NewType(\"DataClassType\", Any)\n\n\nclass HfArgumentParser(ArgumentParser):\n    \"\"\"\n    This subclass of `argparse.ArgumentParser` uses type hints on dataclasses\n    to generate arguments.\n\n    The class is designed to play well with the native argparse. In particular,\n    you can add more (non-dataclass backed) arguments to the parser after initialization\n    and you'll get the output back after parsing as an additional namespace.\n    \"\"\"\n\n    dataclass_types: Iterable[DataClassType]\n\n    def __init__(self, dataclass_types: Union[DataClassType, Iterable[DataClassType]], **kwargs):\n        \"\"\"\n        Args:\n            dataclass_types:\n                Dataclass type, or list of dataclass types for which we will \"fill\" instances\n                with the parsed args.\n            kwargs:\n                (Optional) Passed to `argparse.ArgumentParser()` in the regular way.\n        \"\"\"\n        super().__init__(**kwargs)\n        if dataclasses.is_dataclass(dataclass_types):\n            dataclass_types = [dataclass_types]\n        self.dataclass_types = dataclass_types\n        for dtype in self.dataclass_types:\n            self._add_dataclass_arguments(dtype)\n\n    def _add_dataclass_arguments(self, dtype: DataClassType):\n        for field in dataclasses.fields(dtype):\n            field_name = f\"--{field.name}\"\n            kwargs = field.metadata.copy()\n            # field.metadata is not used at all by Data Classes,\n            # it is provided as a third-party extension mechanism.\n            if isinstance(field.type, str):\n                raise ImportError(\n                    \"This implementation is not compatible with Postponed Evaluation of Annotations (PEP 563),\"\n                    \"which can be opted in from Python 3.7 with `from __future__ import annotations`.\"\n                    \"We will add compatibility when Python 3.9 is released.\"\n                )\n            typestring = str(field.type)\n            for prim_type in (int, float, str):\n                for collection in (List,):\n                    if typestring == f\"typing.Union[{collection[prim_type]}, NoneType]\":\n                        field.type = collection[prim_type]\n                if typestring == f\"typing.Union[{prim_type.__name__}, NoneType]\":\n                    field.type = prim_type\n\n            if isinstance(field.type, type) and issubclass(field.type, Enum):\n                kwargs[\"choices\"] = list(field.type)\n                kwargs[\"type\"] = field.type\n                if field.default is not dataclasses.MISSING:\n                    kwargs[\"default\"] = field.default\n            elif field.type is bool:\n                kwargs[\"action\"] = \"store_false\" if field.default is True else \"store_true\"\n                if field.default is True:\n                    field_name = f\"--no-{field.name}\"\n                    kwargs[\"dest\"] = field.name\n            elif hasattr(field.type, \"__origin__\") and issubclass(field.type.__origin__, List):\n                kwargs[\"nargs\"] = \"+\"\n                kwargs[\"type\"] = field.type.__args__[0]\n                assert all(\n                    x == kwargs[\"type\"] for x in field.type.__args__\n                ), \"{} cannot be a List of mixed types\".format(field.name)\n                if field.default_factory is not dataclasses.MISSING:\n                    kwargs[\"default\"] = field.default_factory()\n            else:\n                kwargs[\"type\"] = field.type\n                if field.default is not dataclasses.MISSING:\n                    kwargs[\"default\"] = field.default\n                else:\n                    kwargs[\"required\"] = True\n            self.add_argument(field_name, **kwargs)\n\n    def parse_args_into_dataclasses(\n        self, args=None, return_remaining_strings=False, look_for_args_file=True\n    ) -> Tuple[DataClass, ...]:\n        \"\"\"\n        Parse command-line args into instances of the specified dataclass types.\n\n        This relies on argparse's `ArgumentParser.parse_known_args`.\n        See the doc at:\n        docs.python.org/3.7/library/argparse.html#argparse.ArgumentParser.parse_args\n\n        Args:\n            args:\n                List of strings to parse. The default is taken from sys.argv.\n                (same as argparse.ArgumentParser)\n            return_remaining_strings:\n                If true, also return a list of remaining argument strings.\n            look_for_args_file:\n                If true, will look for a \".args\" file with the same base name\n                as the entry point script for this process, and will append its\n                potential content to the command line args.\n\n        Returns:\n            Tuple consisting of:\n                - the dataclass instances in the same order as they\n                  were passed to the initializer.abspath\n                - if applicable, an additional namespace for more\n                  (non-dataclass backed) arguments added to the parser\n                  after initialization.\n                - The potential list of remaining argument strings.\n                  (same as argparse.ArgumentParser.parse_known_args)\n        \"\"\"\n        if look_for_args_file and len(sys.argv):\n            args_file = Path(sys.argv[0]).with_suffix(\".args\")\n            if args_file.exists():\n                fargs = args_file.read_text().split()\n                args = fargs + args if args is not None else fargs + sys.argv[1:]\n                # in case of duplicate arguments the first one has precedence\n                # so we append rather than prepend.\n        namespace, remaining_args = self.parse_known_args(args=args)\n        outputs = []\n        for dtype in self.dataclass_types:\n            keys = {f.name for f in dataclasses.fields(dtype)}\n            inputs = {k: v for k, v in vars(namespace).items() if k in keys}\n            for k in keys:\n                delattr(namespace, k)\n            obj = dtype(**inputs)\n            outputs.append(obj)\n        if len(namespace.__dict__) > 0:\n            # additional namespace.\n            outputs.append(namespace)\n        if return_remaining_strings:\n            return (*outputs, remaining_args)\n        else:\n            if remaining_args:\n                raise ValueError(f\"Some specified arguments are not used by the HfArgumentParser: {remaining_args}\")\n\n            return (*outputs,)\n\n    def parse_json_file(self, json_file: str) -> Tuple[DataClass, ...]:\n        \"\"\"\n        Alternative helper method that does not use `argparse` at all,\n        instead loading a json file and populating the dataclass types.\n        \"\"\"\n        data = json.loads(Path(json_file).read_text())\n        outputs = []\n        for dtype in self.dataclass_types:\n            keys = {f.name for f in dataclasses.fields(dtype)}\n            inputs = {k: v for k, v in data.items() if k in keys}\n            obj = dtype(**inputs)\n            outputs.append(obj)\n        return (*outputs,)\n"
  },
  {
    "path": "code/bert-base-count3/pretrain/transformers1/modelcard.py",
    "content": "# coding=utf-8\n# Copyright 2018 The HuggingFace Inc. team.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n\"\"\" Configuration base class and utilities.\"\"\"\n\n\nimport copy\nimport json\nimport logging\nimport os\n\nfrom .configuration_auto import ALL_PRETRAINED_CONFIG_ARCHIVE_MAP\nfrom .file_utils import (\n    CONFIG_NAME,\n    MODEL_CARD_NAME,\n    TF2_WEIGHTS_NAME,\n    WEIGHTS_NAME,\n    cached_path,\n    hf_bucket_url,\n    is_remote_url,\n)\n\n\nlogger = logging.getLogger(__name__)\n\n\nclass ModelCard:\n    r\"\"\" Structured Model Card class.\n        Store model card as well as methods for loading/downloading/saving model cards.\n\n        Please read the following paper for details and explanation on the sections:\n            \"Model Cards for Model Reporting\"\n                by Margaret Mitchell, Simone Wu,\n                Andrew Zaldivar, Parker Barnes, Lucy Vasserman, Ben Hutchinson, Elena Spitzer,\n                Inioluwa Deborah Raji and Timnit Gebru for the proposal behind model cards.\n            Link: https://arxiv.org/abs/1810.03993\n\n        Note:\n            A model card can be loaded and saved to disk.\n\n        Parameters:\n    \"\"\"\n\n    def __init__(self, **kwargs):\n        # Recomended attributes from https://arxiv.org/abs/1810.03993 (see papers)\n        self.model_details = kwargs.pop(\"model_details\", {})\n        self.intended_use = kwargs.pop(\"intended_use\", {})\n        self.factors = kwargs.pop(\"factors\", {})\n        self.metrics = kwargs.pop(\"metrics\", {})\n        self.evaluation_data = kwargs.pop(\"evaluation_data\", {})\n        self.training_data = kwargs.pop(\"training_data\", {})\n        self.quantitative_analyses = kwargs.pop(\"quantitative_analyses\", {})\n        self.ethical_considerations = kwargs.pop(\"ethical_considerations\", {})\n        self.caveats_and_recommendations = kwargs.pop(\"caveats_and_recommendations\", {})\n\n        # Open additional attributes\n        for key, value in kwargs.items():\n            try:\n                setattr(self, key, value)\n            except AttributeError as err:\n                logger.error(\"Can't set {} with value {} for {}\".format(key, value, self))\n                raise err\n\n    def save_pretrained(self, save_directory_or_file):\n        \"\"\" Save a model card object to the directory or file `save_directory_or_file`.\n        \"\"\"\n        if os.path.isdir(save_directory_or_file):\n            # If we save using the predefined names, we can load using `from_pretrained`\n            output_model_card_file = os.path.join(save_directory_or_file, MODEL_CARD_NAME)\n        else:\n            output_model_card_file = save_directory_or_file\n\n        self.to_json_file(output_model_card_file)\n        logger.info(\"Model card saved in {}\".format(output_model_card_file))\n\n    @classmethod\n    def from_pretrained(cls, pretrained_model_name_or_path, **kwargs):\n        r\"\"\" Instantiate a :class:`~transformers1.ModelCard` from a pre-trained model model card.\n\n        Parameters:\n            pretrained_model_name_or_path: either:\n\n                - a string with the `shortcut name` of a pre-trained model card to load from cache or download, e.g.: ``bert-base-uncased``.\n                - a string with the `identifier name` of a pre-trained model card that was user-uploaded to our S3, e.g.: ``dbmdz/bert-base-german-cased``.\n                - a path to a `directory` containing a model card file saved using the :func:`~transformers1.ModelCard.save_pretrained` method, e.g.: ``./my_model_directory/``.\n                - a path or url to a saved model card JSON `file`, e.g.: ``./my_model_directory/modelcard.json``.\n\n            cache_dir: (`optional`) string:\n                Path to a directory in which a downloaded pre-trained model\n                card should be cached if the standard cache should not be used.\n\n            kwargs: (`optional`) dict: key/value pairs with which to update the ModelCard object after loading.\n\n                - The values in kwargs of any keys which are model card attributes will be used to override the loaded values.\n                - Behavior concerning key/value pairs whose keys are *not* model card attributes is controlled by the `return_unused_kwargs` keyword parameter.\n\n            proxies: (`optional`) dict, default None:\n                A dictionary of proxy servers to use by protocol or endpoint, e.g.: {'http': 'foo.bar:3128', 'http://hostname': 'foo.bar:4012'}.\n                The proxies are used on each request.\n\n            find_from_standard_name: (`optional`) boolean, default True:\n                If the pretrained_model_name_or_path ends with our standard model or config filenames, replace them with our standard modelcard filename.\n                Can be used to directly feed a model/config url and access the colocated modelcard.\n\n            return_unused_kwargs: (`optional`) bool:\n\n                - If False, then this function returns just the final model card object.\n                - If True, then this functions returns a tuple `(model card, unused_kwargs)` where `unused_kwargs` is a dictionary consisting of the key/value pairs whose keys are not model card attributes: ie the part of kwargs which has not been used to update `ModelCard` and is otherwise ignored.\n\n        Examples::\n\n            modelcard = ModelCard.from_pretrained('bert-base-uncased')    # Download model card from S3 and cache.\n            modelcard = ModelCard.from_pretrained('./test/saved_model/')  # E.g. model card was saved using `save_pretrained('./test/saved_model/')`\n            modelcard = ModelCard.from_pretrained('./test/saved_model/modelcard.json')\n            modelcard = ModelCard.from_pretrained('bert-base-uncased', output_attention=True, foo=False)\n\n        \"\"\"\n        cache_dir = kwargs.pop(\"cache_dir\", None)\n        proxies = kwargs.pop(\"proxies\", None)\n        find_from_standard_name = kwargs.pop(\"find_from_standard_name\", True)\n        return_unused_kwargs = kwargs.pop(\"return_unused_kwargs\", False)\n\n        if pretrained_model_name_or_path in ALL_PRETRAINED_CONFIG_ARCHIVE_MAP:\n            # For simplicity we use the same pretrained url than the configuration files\n            # but with a different suffix (modelcard.json). This suffix is replaced below.\n            model_card_file = ALL_PRETRAINED_CONFIG_ARCHIVE_MAP[pretrained_model_name_or_path]\n        elif os.path.isdir(pretrained_model_name_or_path):\n            model_card_file = os.path.join(pretrained_model_name_or_path, MODEL_CARD_NAME)\n        elif os.path.isfile(pretrained_model_name_or_path) or is_remote_url(pretrained_model_name_or_path):\n            model_card_file = pretrained_model_name_or_path\n        else:\n            model_card_file = hf_bucket_url(pretrained_model_name_or_path, filename=MODEL_CARD_NAME, use_cdn=False)\n\n        if find_from_standard_name or pretrained_model_name_or_path in ALL_PRETRAINED_CONFIG_ARCHIVE_MAP:\n            model_card_file = model_card_file.replace(CONFIG_NAME, MODEL_CARD_NAME)\n            model_card_file = model_card_file.replace(WEIGHTS_NAME, MODEL_CARD_NAME)\n            model_card_file = model_card_file.replace(TF2_WEIGHTS_NAME, MODEL_CARD_NAME)\n\n        try:\n            # Load from URL or cache if already cached\n            resolved_model_card_file = cached_path(\n                model_card_file, cache_dir=cache_dir, force_download=True, proxies=proxies, resume_download=False\n            )\n            if resolved_model_card_file is None:\n                raise EnvironmentError\n            if resolved_model_card_file == model_card_file:\n                logger.info(\"loading model card file {}\".format(model_card_file))\n            else:\n                logger.info(\n                    \"loading model card file {} from cache at {}\".format(model_card_file, resolved_model_card_file)\n                )\n            # Load model card\n            modelcard = cls.from_json_file(resolved_model_card_file)\n\n        except (EnvironmentError, json.JSONDecodeError):\n            # We fall back on creating an empty model card\n            modelcard = cls()\n\n        # Update model card with kwargs if needed\n        to_remove = []\n        for key, value in kwargs.items():\n            if hasattr(modelcard, key):\n                setattr(modelcard, key, value)\n                to_remove.append(key)\n        for key in to_remove:\n            kwargs.pop(key, None)\n\n        logger.info(\"Model card: %s\", str(modelcard))\n        if return_unused_kwargs:\n            return modelcard, kwargs\n        else:\n            return modelcard\n\n    @classmethod\n    def from_dict(cls, json_object):\n        \"\"\"Constructs a `ModelCard` from a Python dictionary of parameters.\"\"\"\n        return cls(**json_object)\n\n    @classmethod\n    def from_json_file(cls, json_file):\n        \"\"\"Constructs a `ModelCard` from a json file of parameters.\"\"\"\n        with open(json_file, \"r\", encoding=\"utf-8\") as reader:\n            text = reader.read()\n        dict_obj = json.loads(text)\n        return cls(**dict_obj)\n\n    def __eq__(self, other):\n        return self.__dict__ == other.__dict__\n\n    def __repr__(self):\n        return str(self.to_json_string())\n\n    def to_dict(self):\n        \"\"\"Serializes this instance to a Python dictionary.\"\"\"\n        output = copy.deepcopy(self.__dict__)\n        return output\n\n    def to_json_string(self):\n        \"\"\"Serializes this instance to a JSON string.\"\"\"\n        return json.dumps(self.to_dict(), indent=2, sort_keys=True) + \"\\n\"\n\n    def to_json_file(self, json_file_path):\n        \"\"\" Save this instance to a json file.\"\"\"\n        with open(json_file_path, \"w\", encoding=\"utf-8\") as writer:\n            writer.write(self.to_json_string())\n"
  },
  {
    "path": "code/bert-base-count3/pretrain/transformers1/modeling_albert.py",
    "content": "# coding=utf-8\n# Copyright 2018 Google AI, Google Brain and the HuggingFace Inc. team.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n\"\"\"PyTorch ALBERT model. \"\"\"\n\nimport logging\nimport math\nimport os\n\nimport torch\nimport torch.nn as nn\nfrom torch.nn import CrossEntropyLoss, MSELoss\n\nfrom .configuration_albert import AlbertConfig\nfrom .file_utils import add_start_docstrings, add_start_docstrings_to_callable\nfrom .modeling_bert import ACT2FN, BertEmbeddings, BertSelfAttention, prune_linear_layer\nfrom .modeling_utils import PreTrainedModel\n\n\nlogger = logging.getLogger(__name__)\n\n\nALBERT_PRETRAINED_MODEL_ARCHIVE_LIST = [\n    \"albert-base-v1\",\n    \"albert-large-v1\",\n    \"albert-xlarge-v1\",\n    \"albert-xxlarge-v1\",\n    \"albert-base-v2\",\n    \"albert-large-v2\",\n    \"albert-xlarge-v2\",\n    \"albert-xxlarge-v2\",\n    # See all ALBERT models at https://huggingface.co/models?filter=albert\n]\n\n\ndef load_tf_weights_in_albert(model, config, tf_checkpoint_path):\n    \"\"\" Load tf checkpoints in a pytorch model.\"\"\"\n    try:\n        import re\n        import numpy as np\n        import tensorflow as tf\n    except ImportError:\n        logger.error(\n            \"Loading a TensorFlow model in PyTorch, requires TensorFlow to be installed. Please see \"\n            \"https://www.tensorflow.org/install/ for installation instructions.\"\n        )\n        raise\n    tf_path = os.path.abspath(tf_checkpoint_path)\n    logger.info(\"Converting TensorFlow checkpoint from {}\".format(tf_path))\n    # Load weights from TF model\n    init_vars = tf.train.list_variables(tf_path)\n    names = []\n    arrays = []\n    for name, shape in init_vars:\n        logger.info(\"Loading TF weight {} with shape {}\".format(name, shape))\n        array = tf.train.load_variable(tf_path, name)\n        names.append(name)\n        arrays.append(array)\n\n    for name, array in zip(names, arrays):\n        print(name)\n\n    for name, array in zip(names, arrays):\n        original_name = name\n\n        # If saved from the TF HUB module\n        name = name.replace(\"module/\", \"\")\n\n        # Renaming and simplifying\n        name = name.replace(\"ffn_1\", \"ffn\")\n        name = name.replace(\"bert/\", \"albert/\")\n        name = name.replace(\"attention_1\", \"attention\")\n        name = name.replace(\"transform/\", \"\")\n        name = name.replace(\"LayerNorm_1\", \"full_layer_layer_norm\")\n        name = name.replace(\"LayerNorm\", \"attention/LayerNorm\")\n        name = name.replace(\"transformer/\", \"\")\n\n        # The feed forward layer had an 'intermediate' step which has been abstracted away\n        name = name.replace(\"intermediate/dense/\", \"\")\n        name = name.replace(\"ffn/intermediate/output/dense/\", \"ffn_output/\")\n\n        # ALBERT attention was split between self and output which have been abstracted away\n        name = name.replace(\"/output/\", \"/\")\n        name = name.replace(\"/self/\", \"/\")\n\n        # The pooler is a linear layer\n        name = name.replace(\"pooler/dense\", \"pooler\")\n\n        # The classifier was simplified to predictions from cls/predictions\n        name = name.replace(\"cls/predictions\", \"predictions\")\n        name = name.replace(\"predictions/attention\", \"predictions\")\n\n        # Naming was changed to be more explicit\n        name = name.replace(\"embeddings/attention\", \"embeddings\")\n        name = name.replace(\"inner_group_\", \"albert_layers/\")\n        name = name.replace(\"group_\", \"albert_layer_groups/\")\n\n        # Classifier\n        if len(name.split(\"/\")) == 1 and (\"output_bias\" in name or \"output_weights\" in name):\n            name = \"classifier/\" + name\n\n        # No ALBERT model currently handles the next sentence prediction task\n        if \"seq_relationship\" in name:\n            name = name.replace(\"seq_relationship/output_\", \"sop_classifier/classifier/\")\n            name = name.replace(\"weights\", \"weight\")\n\n        name = name.split(\"/\")\n\n        # Ignore the gradients applied by the LAMB/ADAM optimizers.\n        if (\n            \"adam_m\" in name\n            or \"adam_v\" in name\n            or \"AdamWeightDecayOptimizer\" in name\n            or \"AdamWeightDecayOptimizer_1\" in name\n            or \"global_step\" in name\n        ):\n            logger.info(\"Skipping {}\".format(\"/\".join(name)))\n            continue\n\n        pointer = model\n        for m_name in name:\n            if re.fullmatch(r\"[A-Za-z]+_\\d+\", m_name):\n                scope_names = re.split(r\"_(\\d+)\", m_name)\n            else:\n                scope_names = [m_name]\n\n            if scope_names[0] == \"kernel\" or scope_names[0] == \"gamma\":\n                pointer = getattr(pointer, \"weight\")\n            elif scope_names[0] == \"output_bias\" or scope_names[0] == \"beta\":\n                pointer = getattr(pointer, \"bias\")\n            elif scope_names[0] == \"output_weights\":\n                pointer = getattr(pointer, \"weight\")\n            elif scope_names[0] == \"squad\":\n                pointer = getattr(pointer, \"classifier\")\n            else:\n                try:\n                    pointer = getattr(pointer, scope_names[0])\n                except AttributeError:\n                    logger.info(\"Skipping {}\".format(\"/\".join(name)))\n                    continue\n            if len(scope_names) >= 2:\n                num = int(scope_names[1])\n                pointer = pointer[num]\n\n        if m_name[-11:] == \"_embeddings\":\n            pointer = getattr(pointer, \"weight\")\n        elif m_name == \"kernel\":\n            array = np.transpose(array)\n        try:\n            assert pointer.shape == array.shape\n        except AssertionError as e:\n            e.args += (pointer.shape, array.shape)\n            raise\n        print(\"Initialize PyTorch weight {} from {}\".format(name, original_name))\n        pointer.data = torch.from_numpy(array)\n\n    return model\n\n\nclass AlbertEmbeddings(BertEmbeddings):\n    \"\"\"\n    Construct the embeddings from word, position and token_type embeddings.\n    \"\"\"\n\n    def __init__(self, config):\n        super().__init__(config)\n\n        self.word_embeddings = nn.Embedding(config.vocab_size, config.embedding_size, padding_idx=config.pad_token_id)\n        self.position_embeddings = nn.Embedding(config.max_position_embeddings, config.embedding_size)\n        self.token_type_embeddings = nn.Embedding(config.type_vocab_size, config.embedding_size)\n        self.LayerNorm = torch.nn.LayerNorm(config.embedding_size, eps=config.layer_norm_eps)\n\n\nclass AlbertAttention(BertSelfAttention):\n    def __init__(self, config):\n        super().__init__(config)\n\n        self.output_attentions = config.output_attentions\n        self.num_attention_heads = config.num_attention_heads\n        self.hidden_size = config.hidden_size\n        self.attention_head_size = config.hidden_size // config.num_attention_heads\n        self.dropout = nn.Dropout(config.attention_probs_dropout_prob)\n        self.dense = nn.Linear(config.hidden_size, config.hidden_size)\n        self.LayerNorm = nn.LayerNorm(config.hidden_size, eps=config.layer_norm_eps)\n        self.pruned_heads = set()\n\n    def prune_heads(self, heads):\n        if len(heads) == 0:\n            return\n        mask = torch.ones(self.num_attention_heads, self.attention_head_size)\n        heads = set(heads) - self.pruned_heads  # Convert to set and emove already pruned heads\n        for head in heads:\n            # Compute how many pruned heads are before the head and move the index accordingly\n            head = head - sum(1 if h < head else 0 for h in self.pruned_heads)\n            mask[head] = 0\n        mask = mask.view(-1).contiguous().eq(1)\n        index = torch.arange(len(mask))[mask].long()\n\n        # Prune linear layers\n        self.query = prune_linear_layer(self.query, index)\n        self.key = prune_linear_layer(self.key, index)\n        self.value = prune_linear_layer(self.value, index)\n        self.dense = prune_linear_layer(self.dense, index, dim=1)\n\n        # Update hyper params and store pruned heads\n        self.num_attention_heads = self.num_attention_heads - len(heads)\n        self.all_head_size = self.attention_head_size * self.num_attention_heads\n        self.pruned_heads = self.pruned_heads.union(heads)\n\n    def forward(self, input_ids, attention_mask=None, head_mask=None):\n        mixed_query_layer = self.query(input_ids)\n        mixed_key_layer = self.key(input_ids)\n        mixed_value_layer = self.value(input_ids)\n\n        query_layer = self.transpose_for_scores(mixed_query_layer)\n        key_layer = self.transpose_for_scores(mixed_key_layer)\n        value_layer = self.transpose_for_scores(mixed_value_layer)\n\n        # Take the dot product between \"query\" and \"key\" to get the raw attention scores.\n        attention_scores = torch.matmul(query_layer, key_layer.transpose(-1, -2))\n        attention_scores = attention_scores / math.sqrt(self.attention_head_size)\n        if attention_mask is not None:\n            # Apply the attention mask is (precomputed for all layers in BertModel forward() function)\n            attention_scores = attention_scores + attention_mask\n\n        # Normalize the attention scores to probabilities.\n        attention_probs = nn.Softmax(dim=-1)(attention_scores)\n\n        # This is actually dropping out entire tokens to attend to, which might\n        # seem a bit unusual, but is taken from the original Transformer paper.\n        attention_probs = self.dropout(attention_probs)\n\n        # Mask heads if we want to\n        if head_mask is not None:\n            attention_probs = attention_probs * head_mask\n\n        context_layer = torch.matmul(attention_probs, value_layer)\n\n        context_layer = context_layer.permute(0, 2, 1, 3).contiguous()\n\n        # Should find a better way to do this\n        w = (\n            self.dense.weight.t()\n            .view(self.num_attention_heads, self.attention_head_size, self.hidden_size)\n            .to(context_layer.dtype)\n        )\n        b = self.dense.bias.to(context_layer.dtype)\n\n        projected_context_layer = torch.einsum(\"bfnd,ndh->bfh\", context_layer, w) + b\n        projected_context_layer_dropout = self.dropout(projected_context_layer)\n        layernormed_context_layer = self.LayerNorm(input_ids + projected_context_layer_dropout)\n        return (layernormed_context_layer, attention_probs) if self.output_attentions else (layernormed_context_layer,)\n\n\nclass AlbertLayer(nn.Module):\n    def __init__(self, config):\n        super().__init__()\n\n        self.config = config\n        self.full_layer_layer_norm = nn.LayerNorm(config.hidden_size, eps=config.layer_norm_eps)\n        self.attention = AlbertAttention(config)\n        self.ffn = nn.Linear(config.hidden_size, config.intermediate_size)\n        self.ffn_output = nn.Linear(config.intermediate_size, config.hidden_size)\n        self.activation = ACT2FN[config.hidden_act]\n\n    def forward(self, hidden_states, attention_mask=None, head_mask=None):\n        attention_output = self.attention(hidden_states, attention_mask, head_mask)\n        ffn_output = self.ffn(attention_output[0])\n        ffn_output = self.activation(ffn_output)\n        ffn_output = self.ffn_output(ffn_output)\n        hidden_states = self.full_layer_layer_norm(ffn_output + attention_output[0])\n\n        return (hidden_states,) + attention_output[1:]  # add attentions if we output them\n\n\nclass AlbertLayerGroup(nn.Module):\n    def __init__(self, config):\n        super().__init__()\n\n        self.output_attentions = config.output_attentions\n        self.output_hidden_states = config.output_hidden_states\n        self.albert_layers = nn.ModuleList([AlbertLayer(config) for _ in range(config.inner_group_num)])\n\n    def forward(self, hidden_states, attention_mask=None, head_mask=None):\n        layer_hidden_states = ()\n        layer_attentions = ()\n\n        for layer_index, albert_layer in enumerate(self.albert_layers):\n            layer_output = albert_layer(hidden_states, attention_mask, head_mask[layer_index])\n            hidden_states = layer_output[0]\n\n            if self.output_attentions:\n                layer_attentions = layer_attentions + (layer_output[1],)\n\n            if self.output_hidden_states:\n                layer_hidden_states = layer_hidden_states + (hidden_states,)\n\n        outputs = (hidden_states,)\n        if self.output_hidden_states:\n            outputs = outputs + (layer_hidden_states,)\n        if self.output_attentions:\n            outputs = outputs + (layer_attentions,)\n        return outputs  # last-layer hidden state, (layer hidden states), (layer attentions)\n\n\nclass AlbertTransformer(nn.Module):\n    def __init__(self, config):\n        super().__init__()\n\n        self.config = config\n        self.output_attentions = config.output_attentions\n        self.output_hidden_states = config.output_hidden_states\n        self.embedding_hidden_mapping_in = nn.Linear(config.embedding_size, config.hidden_size)\n        self.albert_layer_groups = nn.ModuleList([AlbertLayerGroup(config) for _ in range(config.num_hidden_groups)])\n\n    def forward(self, hidden_states, attention_mask=None, head_mask=None):\n        hidden_states = self.embedding_hidden_mapping_in(hidden_states)\n\n        all_attentions = ()\n\n        if self.output_hidden_states:\n            all_hidden_states = (hidden_states,)\n\n        for i in range(self.config.num_hidden_layers):\n            # Number of layers in a hidden group\n            layers_per_group = int(self.config.num_hidden_layers / self.config.num_hidden_groups)\n\n            # Index of the hidden group\n            group_idx = int(i / (self.config.num_hidden_layers / self.config.num_hidden_groups))\n\n            layer_group_output = self.albert_layer_groups[group_idx](\n                hidden_states,\n                attention_mask,\n                head_mask[group_idx * layers_per_group : (group_idx + 1) * layers_per_group],\n            )\n            hidden_states = layer_group_output[0]\n\n            if self.output_attentions:\n                all_attentions = all_attentions + layer_group_output[-1]\n\n            if self.output_hidden_states:\n                all_hidden_states = all_hidden_states + (hidden_states,)\n\n        outputs = (hidden_states,)\n        if self.output_hidden_states:\n            outputs = outputs + (all_hidden_states,)\n        if self.output_attentions:\n            outputs = outputs + (all_attentions,)\n        return outputs  # last-layer hidden state, (all hidden states), (all attentions)\n\n\nclass AlbertPreTrainedModel(PreTrainedModel):\n    \"\"\" An abstract class to handle weights initialization and\n        a simple interface for downloading and loading pretrained models.\n    \"\"\"\n\n    config_class = AlbertConfig\n    base_model_prefix = \"albert\"\n\n    def _init_weights(self, module):\n        \"\"\" Initialize the weights.\n        \"\"\"\n        if isinstance(module, (nn.Linear, nn.Embedding)):\n            # Slightly different from the TF version which uses truncated_normal for initialization\n            # cf https://github.com/pytorch/pytorch/pull/5617\n            module.weight.data.normal_(mean=0.0, std=self.config.initializer_range)\n            if isinstance(module, (nn.Linear)) and module.bias is not None:\n                module.bias.data.zero_()\n        elif isinstance(module, nn.LayerNorm):\n            module.bias.data.zero_()\n            module.weight.data.fill_(1.0)\n\n\nALBERT_START_DOCSTRING = r\"\"\"\n\n    This model is a PyTorch `torch.nn.Module <https://pytorch.org/docs/stable/nn.html#torch.nn.Module>`_ sub-class.\n    Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general\n    usage and behavior.\n\n    Args:\n        config (:class:`~transformers1.AlbertConfig`): Model configuration class with all the parameters of the model.\n            Initializing with a config file does not load the weights associated with the model, only the configuration.\n            Check out the :meth:`~transformers1.PreTrainedModel.from_pretrained` method to load the model weights.\n\"\"\"\n\nALBERT_INPUTS_DOCSTRING = r\"\"\"\n    Args:\n        input_ids (:obj:`torch.LongTensor` of shape :obj:`(batch_size, sequence_length)`):\n            Indices of input sequence tokens in the vocabulary.\n\n            Indices can be obtained using :class:`transformers1.AlbertTokenizer`.\n            See :func:`transformers1.PreTrainedTokenizer.encode` and\n            :func:`transformers1.PreTrainedTokenizer.encode_plus` for details.\n\n            `What are input IDs? <../glossary.html#input-ids>`__\n        attention_mask (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length)`, `optional`, defaults to :obj:`None`):\n            Mask to avoid performing attention on padding token indices.\n            Mask values selected in ``[0, 1]``:\n            ``1`` for tokens that are NOT MASKED, ``0`` for MASKED tokens.\n\n            `What are attention masks? <../glossary.html#attention-mask>`__\n        token_type_ids (:obj:`torch.LongTensor` of shape :obj:`(batch_size, sequence_length)`, `optional`, defaults to :obj:`None`):\n            Segment token indices to indicate first and second portions of the inputs.\n            Indices are selected in ``[0, 1]``: ``0`` corresponds to a `sentence A` token, ``1``\n            corresponds to a `sentence B` token\n\n            `What are token type IDs? <../glossary.html#token-type-ids>`_\n        position_ids (:obj:`torch.LongTensor` of shape :obj:`(batch_size, sequence_length)`, `optional`, defaults to :obj:`None`):\n            Indices of positions of each input sequence tokens in the position embeddings.\n            Selected in the range ``[0, config.max_position_embeddings - 1]``.\n\n            `What are position IDs? <../glossary.html#position-ids>`_\n        head_mask (:obj:`torch.FloatTensor` of shape :obj:`(num_heads,)` or :obj:`(num_layers, num_heads)`, `optional`, defaults to :obj:`None`):\n            Mask to nullify selected heads of the self-attention modules.\n            Mask values selected in ``[0, 1]``:\n            :obj:`1` indicates the head is **not masked**, :obj:`0` indicates the head is **masked**.\n        inputs_embeds (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length, hidden_size)`, `optional`, defaults to :obj:`None`):\n            Optionally, instead of passing :obj:`input_ids` you can choose to directly pass an embedded representation.\n            This is useful if you want more control over how to convert `input_ids` indices into associated vectors\n            than the model's internal embedding lookup matrix.\n\"\"\"\n\n\n@add_start_docstrings(\n    \"The bare ALBERT Model transformer outputting raw hidden-states without any specific head on top.\",\n    ALBERT_START_DOCSTRING,\n)\nclass AlbertModel(AlbertPreTrainedModel):\n\n    config_class = AlbertConfig\n    load_tf_weights = load_tf_weights_in_albert\n    base_model_prefix = \"albert\"\n\n    def __init__(self, config):\n        super().__init__(config)\n\n        self.config = config\n        self.embeddings = AlbertEmbeddings(config)\n        self.encoder = AlbertTransformer(config)\n        self.pooler = nn.Linear(config.hidden_size, config.hidden_size)\n        self.pooler_activation = nn.Tanh()\n\n        self.init_weights()\n\n    def get_input_embeddings(self):\n        return self.embeddings.word_embeddings\n\n    def set_input_embeddings(self, value):\n        self.embeddings.word_embeddings = value\n\n    def _resize_token_embeddings(self, new_num_tokens):\n        old_embeddings = self.embeddings.word_embeddings\n        new_embeddings = self._get_resized_embeddings(old_embeddings, new_num_tokens)\n        self.embeddings.word_embeddings = new_embeddings\n        return self.embeddings.word_embeddings\n\n    def _prune_heads(self, heads_to_prune):\n        \"\"\" Prunes heads of the model.\n            heads_to_prune: dict of {layer_num: list of heads to prune in this layer}\n            ALBERT has a different architecture in that its layers are shared across groups, which then has inner groups.\n            If an ALBERT model has 12 hidden layers and 2 hidden groups, with two inner groups, there\n            is a total of 4 different layers.\n\n            These layers are flattened: the indices [0,1] correspond to the two inner groups of the first hidden layer,\n            while [2,3] correspond to the two inner groups of the second hidden layer.\n\n            Any layer with in index other than [0,1,2,3] will result in an error.\n            See base class PreTrainedModel for more information about head pruning\n        \"\"\"\n        for layer, heads in heads_to_prune.items():\n            group_idx = int(layer / self.config.inner_group_num)\n            inner_group_idx = int(layer - group_idx * self.config.inner_group_num)\n            self.encoder.albert_layer_groups[group_idx].albert_layers[inner_group_idx].attention.prune_heads(heads)\n\n    @add_start_docstrings_to_callable(ALBERT_INPUTS_DOCSTRING)\n    def forward(\n        self,\n        input_ids=None,\n        attention_mask=None,\n        token_type_ids=None,\n        position_ids=None,\n        head_mask=None,\n        inputs_embeds=None,\n    ):\n        r\"\"\"\n    Return:\n        :obj:`tuple(torch.FloatTensor)` comprising various elements depending on the configuration (:class:`~transformers1.AlbertConfig`) and inputs:\n        last_hidden_state (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length, hidden_size)`):\n            Sequence of hidden-states at the output of the last layer of the model.\n        pooler_output (:obj:`torch.FloatTensor`: of shape :obj:`(batch_size, hidden_size)`):\n            Last layer hidden-state of the first token of the sequence (classification token)\n            further processed by a Linear layer and a Tanh activation function. The Linear\n            layer weights are trained from the next sentence prediction (classification)\n            objective during pre-training.\n\n            This output is usually *not* a good summary\n            of the semantic content of the input, you're often better with averaging or pooling\n            the sequence of hidden-states for the whole input sequence.\n        hidden_states (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_hidden_states=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer)\n            of shape :obj:`(batch_size, sequence_length, hidden_size)`.\n\n            Hidden-states of the model at the output of each layer plus the initial embedding outputs.\n        attentions (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_attentions=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for each layer) of shape\n            :obj:`(batch_size, num_heads, sequence_length, sequence_length)`.\n\n            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention\n            heads.\n\n    Example::\n\n        from transformers1 import AlbertModel, AlbertTokenizer\n        import torch\n\n        tokenizer = AlbertTokenizer.from_pretrained('albert-base-v2')\n        model = AlbertModel.from_pretrained('albert-base-v2')\n        input_ids = torch.tensor(tokenizer.encode(\"Hello, my dog is cute\", add_special_tokens=True)).unsqueeze(0)  # Batch size 1\n        outputs = model(input_ids)\n        last_hidden_states = outputs[0]  # The last hidden-state is the first element of the output tuple\n\n        \"\"\"\n\n        if input_ids is not None and inputs_embeds is not None:\n            raise ValueError(\"You cannot specify both input_ids and inputs_embeds at the same time\")\n        elif input_ids is not None:\n            input_shape = input_ids.size()\n        elif inputs_embeds is not None:\n            input_shape = inputs_embeds.size()[:-1]\n        else:\n            raise ValueError(\"You have to specify either input_ids or inputs_embeds\")\n\n        device = input_ids.device if input_ids is not None else inputs_embeds.device\n\n        if attention_mask is None:\n            attention_mask = torch.ones(input_shape, device=device)\n        if token_type_ids is None:\n            token_type_ids = torch.zeros(input_shape, dtype=torch.long, device=device)\n\n        extended_attention_mask = attention_mask.unsqueeze(1).unsqueeze(2)\n        extended_attention_mask = extended_attention_mask.to(dtype=self.dtype)  # fp16 compatibility\n        extended_attention_mask = (1.0 - extended_attention_mask) * -10000.0\n        head_mask = self.get_head_mask(head_mask, self.config.num_hidden_layers)\n\n        embedding_output = self.embeddings(\n            input_ids, position_ids=position_ids, token_type_ids=token_type_ids, inputs_embeds=inputs_embeds\n        )\n        encoder_outputs = self.encoder(embedding_output, extended_attention_mask, head_mask=head_mask)\n\n        sequence_output = encoder_outputs[0]\n\n        pooled_output = self.pooler_activation(self.pooler(sequence_output[:, 0]))\n\n        outputs = (sequence_output, pooled_output) + encoder_outputs[\n            1:\n        ]  # add hidden_states and attentions if they are here\n        return outputs\n\n\n@add_start_docstrings(\n    \"\"\"Albert Model with two heads on top as done during the pre-training: a `masked language modeling` head and\n    a `sentence order prediction (classification)` head. \"\"\",\n    ALBERT_START_DOCSTRING,\n)\nclass AlbertForPreTraining(AlbertPreTrainedModel):\n    def __init__(self, config):\n        super().__init__(config)\n\n        self.albert = AlbertModel(config)\n        self.predictions = AlbertMLMHead(config)\n        self.sop_classifier = AlbertSOPHead(config)\n\n        self.init_weights()\n        self.tie_weights()\n\n    def tie_weights(self):\n        self._tie_or_clone_weights(self.predictions.decoder, self.albert.embeddings.word_embeddings)\n\n    def get_output_embeddings(self):\n        return self.predictions.decoder\n\n    @add_start_docstrings_to_callable(ALBERT_INPUTS_DOCSTRING)\n    def forward(\n        self,\n        input_ids=None,\n        attention_mask=None,\n        token_type_ids=None,\n        position_ids=None,\n        head_mask=None,\n        inputs_embeds=None,\n        masked_lm_labels=None,\n        sentence_order_label=None,\n    ):\n        r\"\"\"\n        masked_lm_labels (``torch.LongTensor`` of shape ``(batch_size, sequence_length)``, `optional`, defaults to :obj:`None`):\n            Labels for computing the masked language modeling loss.\n            Indices should be in ``[-100, 0, ..., config.vocab_size]`` (see ``input_ids`` docstring)\n            Tokens with indices set to ``-100`` are ignored (masked), the loss is only computed for the tokens with labels\n            in ``[0, ..., config.vocab_size]``\n        sentence_order_label (``torch.LongTensor`` of shape ``(batch_size,)``, `optional`, defaults to :obj:`None`):\n            Labels for computing the next sequence prediction (classification) loss. Input should be a sequence pair (see :obj:`input_ids` docstring)\n            Indices should be in ``[0, 1]``.\n            ``0`` indicates original order (sequence A, then sequence B),\n            ``1`` indicates switched order (sequence B, then sequence A).\n\n    Returns:\n        :obj:`tuple(torch.FloatTensor)` comprising various elements depending on the configuration (:class:`~transformers1.BertConfig`) and inputs:\n        loss (`optional`, returned when ``masked_lm_labels`` is provided) ``torch.FloatTensor`` of shape ``(1,)``:\n            Total loss as the sum of the masked language modeling loss and the next sequence prediction (classification) loss.\n        prediction_scores (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length, config.vocab_size)`)\n            Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax).\n        sop_scores (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, 2)`):\n            Prediction scores of the next sequence prediction (classification) head (scores of True/False\n            continuation before SoftMax).\n        hidden_states (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when :obj:`config.output_hidden_states=True`):\n            Tuple of :obj:`torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer)\n            of shape :obj:`(batch_size, sequence_length, hidden_size)`.\n\n            Hidden-states of the model at the output of each layer plus the initial embedding outputs.\n        attentions (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_attentions=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for each layer) of shape\n            :obj:`(batch_size, num_heads, sequence_length, sequence_length)`.\n\n            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention\n            heads.\n\n\n    Examples::\n\n        from transformers1 import AlbertTokenizer, AlbertForPreTraining\n        import torch\n\n        tokenizer = AlbertTokenizer.from_pretrained('albert-base-v2')\n        model = AlbertForPreTraining.from_pretrained('albert-base-v2')\n\n        input_ids = torch.tensor(tokenizer.encode(\"Hello, my dog is cute\", add_special_tokens=True)).unsqueeze(0)  # Batch size 1\n        outputs = model(input_ids)\n\n        prediction_scores, sop_scores = outputs[:2]\n\n        \"\"\"\n\n        outputs = self.albert(\n            input_ids,\n            attention_mask=attention_mask,\n            token_type_ids=token_type_ids,\n            position_ids=position_ids,\n            head_mask=head_mask,\n            inputs_embeds=inputs_embeds,\n        )\n\n        sequence_output, pooled_output = outputs[:2]\n\n        prediction_scores = self.predictions(sequence_output)\n        sop_scores = self.sop_classifier(pooled_output)\n\n        outputs = (prediction_scores, sop_scores,) + outputs[2:]  # add hidden states and attention if they are here\n\n        if masked_lm_labels is not None and sentence_order_label is not None:\n            loss_fct = CrossEntropyLoss()\n            masked_lm_loss = loss_fct(prediction_scores.view(-1, self.config.vocab_size), masked_lm_labels.view(-1))\n            sentence_order_loss = loss_fct(sop_scores.view(-1, 2), sentence_order_label.view(-1))\n            total_loss = masked_lm_loss + sentence_order_loss\n            outputs = (total_loss,) + outputs\n\n        return outputs  # (loss), prediction_scores, sop_scores, (hidden_states), (attentions)\n\n\nclass AlbertMLMHead(nn.Module):\n    def __init__(self, config):\n        super().__init__()\n\n        self.LayerNorm = nn.LayerNorm(config.embedding_size)\n        self.bias = nn.Parameter(torch.zeros(config.vocab_size))\n        self.dense = nn.Linear(config.hidden_size, config.embedding_size)\n        self.decoder = nn.Linear(config.embedding_size, config.vocab_size)\n        self.activation = ACT2FN[config.hidden_act]\n\n        # Need a link between the two variables so that the bias is correctly resized with `resize_token_embeddings`\n        self.decoder.bias = self.bias\n\n    def forward(self, hidden_states):\n        hidden_states = self.dense(hidden_states)\n        hidden_states = self.activation(hidden_states)\n        hidden_states = self.LayerNorm(hidden_states)\n        hidden_states = self.decoder(hidden_states)\n\n        prediction_scores = hidden_states\n\n        return prediction_scores\n\n\nclass AlbertSOPHead(nn.Module):\n    def __init__(self, config):\n        super().__init__()\n\n        self.dropout = nn.Dropout(config.classifier_dropout_prob)\n        self.classifier = nn.Linear(config.hidden_size, config.num_labels)\n\n    def forward(self, pooled_output):\n        dropout_pooled_output = self.dropout(pooled_output)\n        logits = self.classifier(dropout_pooled_output)\n        return logits\n\n\n@add_start_docstrings(\n    \"Albert Model with a `language modeling` head on top.\", ALBERT_START_DOCSTRING,\n)\nclass AlbertForMaskedLM(AlbertPreTrainedModel):\n    def __init__(self, config):\n        super().__init__(config)\n\n        self.albert = AlbertModel(config)\n        self.predictions = AlbertMLMHead(config)\n\n        self.init_weights()\n        self.tie_weights()\n\n    def tie_weights(self):\n        self._tie_or_clone_weights(self.predictions.decoder, self.albert.embeddings.word_embeddings)\n\n    def get_output_embeddings(self):\n        return self.predictions.decoder\n\n    @add_start_docstrings_to_callable(ALBERT_INPUTS_DOCSTRING)\n    def forward(\n        self,\n        input_ids=None,\n        attention_mask=None,\n        token_type_ids=None,\n        position_ids=None,\n        head_mask=None,\n        inputs_embeds=None,\n        masked_lm_labels=None,\n    ):\n        r\"\"\"\n        masked_lm_labels (:obj:`torch.LongTensor` of shape :obj:`(batch_size, sequence_length)`, `optional`, defaults to :obj:`None`):\n            Labels for computing the masked language modeling loss.\n            Indices should be in ``[-100, 0, ..., config.vocab_size]`` (see ``input_ids`` docstring)\n            Tokens with indices set to ``-100`` are ignored (masked), the loss is only computed for the tokens with\n            labels in ``[0, ..., config.vocab_size]``\n\n    Returns:\n        :obj:`tuple(torch.FloatTensor)` comprising various elements depending on the configuration (:class:`~transformers1.AlbertConfig`) and inputs:\n        loss (`optional`, returned when ``masked_lm_labels`` is provided) ``torch.FloatTensor`` of shape ``(1,)``:\n            Masked language modeling loss.\n        prediction_scores (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length, config.vocab_size)`)\n            Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax).\n        hidden_states (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_hidden_states=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer)\n            of shape :obj:`(batch_size, sequence_length, hidden_size)`.\n\n            Hidden-states of the model at the output of each layer plus the initial embedding outputs.\n        attentions (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_attentions=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for each layer) of shape\n            :obj:`(batch_size, num_heads, sequence_length, sequence_length)`.\n\n            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention\n            heads.\n\n    Example::\n\n        from transformers1 import AlbertTokenizer, AlbertForMaskedLM\n        import torch\n\n        tokenizer = AlbertTokenizer.from_pretrained('albert-base-v2')\n        model = AlbertForMaskedLM.from_pretrained('albert-base-v2')\n        input_ids = torch.tensor(tokenizer.encode(\"Hello, my dog is cute\", add_special_tokens=True)).unsqueeze(0)  # Batch size 1\n        outputs = model(input_ids, masked_lm_labels=input_ids)\n        loss, prediction_scores = outputs[:2]\n\n        \"\"\"\n        outputs = self.albert(\n            input_ids=input_ids,\n            attention_mask=attention_mask,\n            token_type_ids=token_type_ids,\n            position_ids=position_ids,\n            head_mask=head_mask,\n            inputs_embeds=inputs_embeds,\n        )\n        sequence_outputs = outputs[0]\n\n        prediction_scores = self.predictions(sequence_outputs)\n\n        outputs = (prediction_scores,) + outputs[2:]  # Add hidden states and attention if they are here\n        if masked_lm_labels is not None:\n            loss_fct = CrossEntropyLoss()\n            masked_lm_loss = loss_fct(prediction_scores.view(-1, self.config.vocab_size), masked_lm_labels.view(-1))\n            outputs = (masked_lm_loss,) + outputs\n\n        return outputs\n\n\n@add_start_docstrings(\n    \"\"\"Albert Model transformer with a sequence classification/regression head on top (a linear layer on top of\n    the pooled output) e.g. for GLUE tasks. \"\"\",\n    ALBERT_START_DOCSTRING,\n)\nclass AlbertForSequenceClassification(AlbertPreTrainedModel):\n    def __init__(self, config):\n        super().__init__(config)\n        self.num_labels = config.num_labels\n\n        self.albert = AlbertModel(config)\n        self.dropout = nn.Dropout(config.classifier_dropout_prob)\n        self.classifier = nn.Linear(config.hidden_size, self.config.num_labels)\n\n        self.init_weights()\n\n    @add_start_docstrings_to_callable(ALBERT_INPUTS_DOCSTRING)\n    def forward(\n        self,\n        input_ids=None,\n        attention_mask=None,\n        token_type_ids=None,\n        position_ids=None,\n        head_mask=None,\n        inputs_embeds=None,\n        labels=None,\n    ):\n        r\"\"\"\n        labels (:obj:`torch.LongTensor` of shape :obj:`(batch_size,)`, `optional`, defaults to :obj:`None`):\n            Labels for computing the sequence classification/regression loss.\n            Indices should be in ``[0, ..., config.num_labels - 1]``.\n            If ``config.num_labels == 1`` a regression loss is computed (Mean-Square loss),\n            If ``config.num_labels > 1`` a classification loss is computed (Cross-Entropy).\n\n    Returns:\n        :obj:`tuple(torch.FloatTensor)` comprising various elements depending on the configuration (:class:`~transformers1.AlbertConfig`) and inputs:\n        loss: (`optional`, returned when ``labels`` is provided) ``torch.FloatTensor`` of shape ``(1,)``:\n            Classification (or regression if config.num_labels==1) loss.\n        logits ``torch.FloatTensor`` of shape ``(batch_size, config.num_labels)``\n            Classification (or regression if config.num_labels==1) scores (before SoftMax).\n        hidden_states (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_hidden_states=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer)\n            of shape :obj:`(batch_size, sequence_length, hidden_size)`.\n\n            Hidden-states of the model at the output of each layer plus the initial embedding outputs.\n        attentions (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_attentions=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for each layer) of shape\n            :obj:`(batch_size, num_heads, sequence_length, sequence_length)`.\n\n            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention\n            heads.\n\n        Examples::\n\n            from transformers1 import AlbertTokenizer, AlbertForSequenceClassification\n            import torch\n\n            tokenizer = AlbertTokenizer.from_pretrained('albert-base-v2')\n            model = AlbertForSequenceClassification.from_pretrained('albert-base-v2')\n            input_ids = torch.tensor(tokenizer.encode(\"Hello, my dog is cute\")).unsqueeze(0)  # Batch size 1\n            labels = torch.tensor([1]).unsqueeze(0)  # Batch size 1\n            outputs = model(input_ids, labels=labels)\n            loss, logits = outputs[:2]\n\n        \"\"\"\n\n        outputs = self.albert(\n            input_ids=input_ids,\n            attention_mask=attention_mask,\n            token_type_ids=token_type_ids,\n            position_ids=position_ids,\n            head_mask=head_mask,\n            inputs_embeds=inputs_embeds,\n        )\n\n        pooled_output = outputs[1]\n\n        pooled_output = self.dropout(pooled_output)\n        logits = self.classifier(pooled_output)\n\n        outputs = (logits,) + outputs[2:]  # add hidden states and attention if they are here\n\n        if labels is not None:\n            if self.num_labels == 1:\n                #  We are doing regression\n                loss_fct = MSELoss()\n                loss = loss_fct(logits.view(-1), labels.view(-1))\n            else:\n                loss_fct = CrossEntropyLoss()\n                loss = loss_fct(logits.view(-1, self.num_labels), labels.view(-1))\n            outputs = (loss,) + outputs\n\n        return outputs  # (loss), logits, (hidden_states), (attentions)\n\n\n@add_start_docstrings(\n    \"\"\"Albert Model with a token classification head on top (a linear layer on top of\n    the hidden-states output) e.g. for Named-Entity-Recognition (NER) tasks. \"\"\",\n    ALBERT_START_DOCSTRING,\n)\nclass AlbertForTokenClassification(AlbertPreTrainedModel):\n    def __init__(self, config):\n        super().__init__(config)\n        self.num_labels = config.num_labels\n\n        self.albert = AlbertModel(config)\n        self.dropout = nn.Dropout(config.hidden_dropout_prob)\n        self.classifier = nn.Linear(config.hidden_size, self.config.num_labels)\n\n        self.init_weights()\n\n    @add_start_docstrings_to_callable(ALBERT_INPUTS_DOCSTRING)\n    def forward(\n        self,\n        input_ids=None,\n        attention_mask=None,\n        token_type_ids=None,\n        position_ids=None,\n        head_mask=None,\n        inputs_embeds=None,\n        labels=None,\n    ):\n        r\"\"\"\n        labels (:obj:`torch.LongTensor` of shape :obj:`(batch_size, sequence_length)`, `optional`, defaults to :obj:`None`):\n            Labels for computing the token classification loss.\n            Indices should be in ``[0, ..., config.num_labels - 1]``.\n\n    Returns:\n        :obj:`tuple(torch.FloatTensor)` comprising various elements depending on the configuration (:class:`~transformers1.AlbertConfig`) and inputs:\n        loss (:obj:`torch.FloatTensor` of shape :obj:`(1,)`, `optional`, returned when ``labels`` is provided) :\n            Classification loss.\n        scores (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length, config.num_labels)`)\n            Classification scores (before SoftMax).\n        hidden_states (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_hidden_states=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer)\n            of shape :obj:`(batch_size, sequence_length, hidden_size)`.\n\n            Hidden-states of the model at the output of each layer plus the initial embedding outputs.\n        attentions (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_attentions=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for each layer) of shape\n            :obj:`(batch_size, num_heads, sequence_length, sequence_length)`.\n\n            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention\n            heads.\n\n    Examples::\n\n        from transformers1 import AlbertTokenizer, AlbertForTokenClassification\n        import torch\n\n        tokenizer = AlbertTokenizer.from_pretrained('albert-base-v2')\n        model = AlbertForTokenClassification.from_pretrained('albert-base-v2')\n\n        input_ids = torch.tensor(tokenizer.encode(\"Hello, my dog is cute\", add_special_tokens=True)).unsqueeze(0)  # Batch size 1\n        labels = torch.tensor([1] * input_ids.size(1)).unsqueeze(0)  # Batch size 1\n        outputs = model(input_ids, labels=labels)\n\n        loss, scores = outputs[:2]\n\n        \"\"\"\n\n        outputs = self.albert(\n            input_ids,\n            attention_mask=attention_mask,\n            token_type_ids=token_type_ids,\n            position_ids=position_ids,\n            head_mask=head_mask,\n            inputs_embeds=inputs_embeds,\n        )\n\n        sequence_output = outputs[0]\n\n        sequence_output = self.dropout(sequence_output)\n        logits = self.classifier(sequence_output)\n\n        outputs = (logits,) + outputs[2:]  # add hidden states and attention if they are here\n\n        if labels is not None:\n            loss_fct = CrossEntropyLoss()\n            # Only keep active parts of the loss\n            if attention_mask is not None:\n                active_loss = attention_mask.view(-1) == 1\n                active_logits = logits.view(-1, self.num_labels)[active_loss]\n                active_labels = labels.view(-1)[active_loss]\n                loss = loss_fct(active_logits, active_labels)\n            else:\n                loss = loss_fct(logits.view(-1, self.num_labels), labels.view(-1))\n            outputs = (loss,) + outputs\n\n        return outputs  # (loss), logits, (hidden_states), (attentions)\n\n\n@add_start_docstrings(\n    \"\"\"Albert Model with a span classification head on top for extractive question-answering tasks like SQuAD (a linear layers on top of\n    the hidden-states output to compute `span start logits` and `span end logits`). \"\"\",\n    ALBERT_START_DOCSTRING,\n)\nclass AlbertForQuestionAnswering(AlbertPreTrainedModel):\n    def __init__(self, config):\n        super().__init__(config)\n        self.num_labels = config.num_labels\n\n        self.albert = AlbertModel(config)\n        self.qa_outputs = nn.Linear(config.hidden_size, config.num_labels)\n\n        self.init_weights()\n\n    @add_start_docstrings_to_callable(ALBERT_INPUTS_DOCSTRING)\n    def forward(\n        self,\n        input_ids=None,\n        attention_mask=None,\n        token_type_ids=None,\n        position_ids=None,\n        head_mask=None,\n        inputs_embeds=None,\n        start_positions=None,\n        end_positions=None,\n    ):\n        r\"\"\"\n        start_positions (:obj:`torch.LongTensor` of shape :obj:`(batch_size,)`, `optional`, defaults to :obj:`None`):\n            Labels for position (index) of the start of the labelled span for computing the token classification loss.\n            Positions are clamped to the length of the sequence (`sequence_length`).\n            Position outside of the sequence are not taken into account for computing the loss.\n        end_positions (:obj:`torch.LongTensor` of shape :obj:`(batch_size,)`, `optional`, defaults to :obj:`None`):\n            Labels for position (index) of the end of the labelled span for computing the token classification loss.\n            Positions are clamped to the length of the sequence (`sequence_length`).\n            Position outside of the sequence are not taken into account for computing the loss.\n\n    Returns:\n        :obj:`tuple(torch.FloatTensor)` comprising various elements depending on the configuration (:class:`~transformers1.AlbertConfig`) and inputs:\n        loss: (`optional`, returned when ``labels`` is provided) ``torch.FloatTensor`` of shape ``(1,)``:\n            Total span extraction loss is the sum of a Cross-Entropy for the start and end positions.\n        start_scores ``torch.FloatTensor`` of shape ``(batch_size, sequence_length,)``\n            Span-start scores (before SoftMax).\n        end_scores: ``torch.FloatTensor`` of shape ``(batch_size, sequence_length,)``\n            Span-end scores (before SoftMax).\n        hidden_states (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_hidden_states=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer)\n            of shape :obj:`(batch_size, sequence_length, hidden_size)`.\n\n            Hidden-states of the model at the output of each layer plus the initial embedding outputs.\n        attentions (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_attentions=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for each layer) of shape\n            :obj:`(batch_size, num_heads, sequence_length, sequence_length)`.\n\n            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention\n            heads.\n\n    Examples::\n\n        # The checkpoint albert-base-v2 is not fine-tuned for question answering. Please see the\n        # examples/question-answering/run_squad.py example to see how to fine-tune a model to a question answering task.\n\n        from transformers1 import AlbertTokenizer, AlbertForQuestionAnswering\n        import torch\n\n        tokenizer = AlbertTokenizer.from_pretrained('albert-base-v2')\n        model = AlbertForQuestionAnswering.from_pretrained('albert-base-v2')\n        question, text = \"Who was Jim Henson?\", \"Jim Henson was a nice puppet\"\n        input_dict = tokenizer.encode_plus(question, text, return_tensors='pt')\n        start_scores, end_scores = model(**input_dict)\n\n        \"\"\"\n\n        outputs = self.albert(\n            input_ids=input_ids,\n            attention_mask=attention_mask,\n            token_type_ids=token_type_ids,\n            position_ids=position_ids,\n            head_mask=head_mask,\n            inputs_embeds=inputs_embeds,\n        )\n\n        sequence_output = outputs[0]\n\n        logits = self.qa_outputs(sequence_output)\n        start_logits, end_logits = logits.split(1, dim=-1)\n        start_logits = start_logits.squeeze(-1)\n        end_logits = end_logits.squeeze(-1)\n\n        outputs = (start_logits, end_logits,) + outputs[2:]\n        if start_positions is not None and end_positions is not None:\n            # If we are on multi-GPU, split add a dimension\n            if len(start_positions.size()) > 1:\n                start_positions = start_positions.squeeze(-1)\n            if len(end_positions.size()) > 1:\n                end_positions = end_positions.squeeze(-1)\n            # sometimes the start/end positions are outside our model inputs, we ignore these terms\n            ignored_index = start_logits.size(1)\n            start_positions.clamp_(0, ignored_index)\n            end_positions.clamp_(0, ignored_index)\n\n            loss_fct = CrossEntropyLoss(ignore_index=ignored_index)\n            start_loss = loss_fct(start_logits, start_positions)\n            end_loss = loss_fct(end_logits, end_positions)\n            total_loss = (start_loss + end_loss) / 2\n            outputs = (total_loss,) + outputs\n\n        return outputs  # (loss), start_logits, end_logits, (hidden_states), (attentions)\n"
  },
  {
    "path": "code/bert-base-count3/pretrain/transformers1/modeling_auto.py",
    "content": "# coding=utf-8\n# Copyright 2018 The HuggingFace Inc. team.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n\"\"\" Auto Model class. \"\"\"\n\n\nimport logging\nfrom collections import OrderedDict\n\nfrom .configuration_auto import (\n    AlbertConfig,\n    AutoConfig,\n    BartConfig,\n    BertConfig,\n    CamembertConfig,\n    CTRLConfig,\n    DistilBertConfig,\n    ElectraConfig,\n    EncoderDecoderConfig,\n    FlaubertConfig,\n    GPT2Config,\n    LongformerConfig,\n    OpenAIGPTConfig,\n    ReformerConfig,\n    RobertaConfig,\n    T5Config,\n    TransfoXLConfig,\n    XLMConfig,\n    XLMRobertaConfig,\n    XLNetConfig,\n)\nfrom .configuration_marian import MarianConfig\nfrom .configuration_utils import PretrainedConfig\nfrom .modeling_albert import (\n    AlbertForMaskedLM,\n    AlbertForPreTraining,\n    AlbertForQuestionAnswering,\n    AlbertForSequenceClassification,\n    AlbertForTokenClassification,\n    AlbertModel,\n)\nfrom .modeling_bart import BartForConditionalGeneration, BartForSequenceClassification, BartModel\nfrom .modeling_bert import (\n    BertForMaskedLM,\n    BertForMultipleChoice,\n    BertForPreTraining,\n    BertForQuestionAnswering,\n    BertForSequenceClassification,\n    BertForTokenClassification,\n    BertModel,\n)\nfrom .modeling_camembert import (\n    CamembertForMaskedLM,\n    CamembertForMultipleChoice,\n    CamembertForSequenceClassification,\n    CamembertForTokenClassification,\n    CamembertModel,\n)\nfrom .modeling_ctrl import CTRLLMHeadModel, CTRLModel\nfrom .modeling_distilbert import (\n    DistilBertForMaskedLM,\n    DistilBertForQuestionAnswering,\n    DistilBertForSequenceClassification,\n    DistilBertForTokenClassification,\n    DistilBertModel,\n)\nfrom .modeling_electra import (\n    ElectraForMaskedLM,\n    ElectraForPreTraining,\n    ElectraForSequenceClassification,\n    ElectraForTokenClassification,\n    ElectraModel,\n)\nfrom .modeling_encoder_decoder import EncoderDecoderModel\nfrom .modeling_flaubert import (\n    FlaubertForQuestionAnsweringSimple,\n    FlaubertForSequenceClassification,\n    FlaubertModel,\n    FlaubertWithLMHeadModel,\n)\nfrom .modeling_gpt2 import GPT2LMHeadModel, GPT2Model\nfrom .modeling_longformer import (\n    LongformerForMaskedLM,\n    LongformerForMultipleChoice,\n    LongformerForQuestionAnswering,\n    LongformerForSequenceClassification,\n    LongformerForTokenClassification,\n    LongformerModel,\n)\nfrom .modeling_marian import MarianMTModel\nfrom .modeling_openai import OpenAIGPTLMHeadModel, OpenAIGPTModel\nfrom .modeling_reformer import ReformerModel, ReformerModelWithLMHead\nfrom .modeling_roberta import (\n    RobertaForMaskedLM,\n    RobertaForMultipleChoice,\n    RobertaForQuestionAnswering,\n    RobertaForSequenceClassification,\n    RobertaForTokenClassification,\n    RobertaModel,\n)\nfrom .modeling_t5 import T5ForConditionalGeneration, T5Model\nfrom .modeling_transfo_xl import TransfoXLLMHeadModel, TransfoXLModel\nfrom .modeling_xlm import (\n    XLMForQuestionAnsweringSimple,\n    XLMForSequenceClassification,\n    XLMForTokenClassification,\n    XLMModel,\n    XLMWithLMHeadModel,\n)\nfrom .modeling_xlm_roberta import (\n    XLMRobertaForMaskedLM,\n    XLMRobertaForMultipleChoice,\n    XLMRobertaForSequenceClassification,\n    XLMRobertaForTokenClassification,\n    XLMRobertaModel,\n)\nfrom .modeling_xlnet import (\n    XLNetForMultipleChoice,\n    XLNetForQuestionAnsweringSimple,\n    XLNetForSequenceClassification,\n    XLNetForTokenClassification,\n    XLNetLMHeadModel,\n    XLNetModel,\n)\n\n\nlogger = logging.getLogger(__name__)\n\n\nMODEL_MAPPING = OrderedDict(\n    [\n        (T5Config, T5Model),\n        (DistilBertConfig, DistilBertModel),\n        (AlbertConfig, AlbertModel),\n        (CamembertConfig, CamembertModel),\n        (XLMRobertaConfig, XLMRobertaModel),\n        (BartConfig, BartModel),\n        (LongformerConfig, LongformerModel),\n        (RobertaConfig, RobertaModel),\n        (BertConfig, BertModel),\n        (OpenAIGPTConfig, OpenAIGPTModel),\n        (GPT2Config, GPT2Model),\n        (TransfoXLConfig, TransfoXLModel),\n        (XLNetConfig, XLNetModel),\n        (FlaubertConfig, FlaubertModel),\n        (XLMConfig, XLMModel),\n        (CTRLConfig, CTRLModel),\n        (ElectraConfig, ElectraModel),\n        (ReformerConfig, ReformerModel),\n    ]\n)\n\nMODEL_FOR_PRETRAINING_MAPPING = OrderedDict(\n    [\n        (T5Config, T5ForConditionalGeneration),\n        (DistilBertConfig, DistilBertForMaskedLM),\n        (AlbertConfig, AlbertForPreTraining),\n        (CamembertConfig, CamembertForMaskedLM),\n        (XLMRobertaConfig, XLMRobertaForMaskedLM),\n        (BartConfig, BartForConditionalGeneration),\n        (LongformerConfig, LongformerForMaskedLM),\n        (RobertaConfig, RobertaForMaskedLM),\n        (BertConfig, BertForPreTraining),\n        (OpenAIGPTConfig, OpenAIGPTLMHeadModel),\n        (GPT2Config, GPT2LMHeadModel),\n        (TransfoXLConfig, TransfoXLLMHeadModel),\n        (XLNetConfig, XLNetLMHeadModel),\n        (FlaubertConfig, FlaubertWithLMHeadModel),\n        (XLMConfig, XLMWithLMHeadModel),\n        (CTRLConfig, CTRLLMHeadModel),\n        (ElectraConfig, ElectraForPreTraining),\n    ]\n)\n\nMODEL_WITH_LM_HEAD_MAPPING = OrderedDict(\n    [\n        (T5Config, T5ForConditionalGeneration),\n        (DistilBertConfig, DistilBertForMaskedLM),\n        (AlbertConfig, AlbertForMaskedLM),\n        (CamembertConfig, CamembertForMaskedLM),\n        (XLMRobertaConfig, XLMRobertaForMaskedLM),\n        (MarianConfig, MarianMTModel),\n        (BartConfig, BartForConditionalGeneration),\n        (LongformerConfig, LongformerForMaskedLM),\n        (RobertaConfig, RobertaForMaskedLM),\n        (BertConfig, BertForMaskedLM),\n        (OpenAIGPTConfig, OpenAIGPTLMHeadModel),\n        (GPT2Config, GPT2LMHeadModel),\n        (TransfoXLConfig, TransfoXLLMHeadModel),\n        (XLNetConfig, XLNetLMHeadModel),\n        (FlaubertConfig, FlaubertWithLMHeadModel),\n        (XLMConfig, XLMWithLMHeadModel),\n        (CTRLConfig, CTRLLMHeadModel),\n        (ElectraConfig, ElectraForMaskedLM),\n        (EncoderDecoderConfig, EncoderDecoderModel),\n        (ReformerConfig, ReformerModelWithLMHead),\n    ]\n)\n\nMODEL_FOR_SEQUENCE_CLASSIFICATION_MAPPING = OrderedDict(\n    [\n        (DistilBertConfig, DistilBertForSequenceClassification),\n        (AlbertConfig, AlbertForSequenceClassification),\n        (CamembertConfig, CamembertForSequenceClassification),\n        (XLMRobertaConfig, XLMRobertaForSequenceClassification),\n        (BartConfig, BartForSequenceClassification),\n        (LongformerConfig, LongformerForSequenceClassification),\n        (RobertaConfig, RobertaForSequenceClassification),\n        (BertConfig, BertForSequenceClassification),\n        (XLNetConfig, XLNetForSequenceClassification),\n        (FlaubertConfig, FlaubertForSequenceClassification),\n        (XLMConfig, XLMForSequenceClassification),\n        (ElectraConfig, ElectraForSequenceClassification),\n    ]\n)\n\nMODEL_FOR_QUESTION_ANSWERING_MAPPING = OrderedDict(\n    [\n        (DistilBertConfig, DistilBertForQuestionAnswering),\n        (AlbertConfig, AlbertForQuestionAnswering),\n        (LongformerConfig, LongformerForQuestionAnswering),\n        (RobertaConfig, RobertaForQuestionAnswering),\n        (BertConfig, BertForQuestionAnswering),\n        (XLNetConfig, XLNetForQuestionAnsweringSimple),\n        (FlaubertConfig, FlaubertForQuestionAnsweringSimple),\n        (XLMConfig, XLMForQuestionAnsweringSimple),\n    ]\n)\n\nMODEL_FOR_TOKEN_CLASSIFICATION_MAPPING = OrderedDict(\n    [\n        (DistilBertConfig, DistilBertForTokenClassification),\n        (CamembertConfig, CamembertForTokenClassification),\n        (XLMConfig, XLMForTokenClassification),\n        (XLMRobertaConfig, XLMRobertaForTokenClassification),\n        (LongformerConfig, LongformerForTokenClassification),\n        (RobertaConfig, RobertaForTokenClassification),\n        (BertConfig, BertForTokenClassification),\n        (XLNetConfig, XLNetForTokenClassification),\n        (AlbertConfig, AlbertForTokenClassification),\n        (ElectraConfig, ElectraForTokenClassification),\n    ]\n)\n\n\nMODEL_FOR_MULTIPLE_CHOICE_MAPPING = OrderedDict(\n    [\n        (CamembertConfig, CamembertForMultipleChoice),\n        (XLMRobertaConfig, XLMRobertaForMultipleChoice),\n        (LongformerConfig, LongformerForMultipleChoice),\n        (RobertaConfig, RobertaForMultipleChoice),\n        (BertConfig, BertForMultipleChoice),\n        (XLNetConfig, XLNetForMultipleChoice),\n    ]\n)\n\n\nclass AutoModel:\n    r\"\"\"\n        :class:`~transformers1.AutoModel` is a generic model class\n        that will be instantiated as one of the base model classes of the library\n        when created with the `AutoModel.from_pretrained(pretrained_model_name_or_path)`\n        or the `AutoModel.from_config(config)` class methods.\n\n        This class cannot be instantiated using `__init__()` (throws an error).\n    \"\"\"\n\n    def __init__(self):\n        raise EnvironmentError(\n            \"AutoModel is designed to be instantiated \"\n            \"using the `AutoModel.from_pretrained(pretrained_model_name_or_path)` or \"\n            \"`AutoModel.from_config(config)` methods.\"\n        )\n\n    @classmethod\n    def from_config(cls, config):\n        r\"\"\" Instantiates one of the base model classes of the library\n        from a configuration.\n\n        Note:\n            Loading a model from its configuration file does **not** load the model weights.\n            It only affects the model's configuration. Use :func:`~transformers1.AutoModel.from_pretrained` to load\n            the model weights\n\n        Args:\n            config (:class:`~transformers.PretrainedConfig`):\n                The model class to instantiate is selected based on the configuration class:\n\n                - isInstance of `distilbert` configuration class: :class:`~transformers1.DistilBertModel` (DistilBERT model)\n                - isInstance of `longformer` configuration class: :class:`~transformers1.LongformerModel` (Longformer model)\n                - isInstance of `roberta` configuration class: :class:`~transformers1.RobertaModel` (RoBERTa model)\n                - isInstance of `bert` configuration class: :class:`~transformers1.BertModel` (Bert model)\n                - isInstance of `openai-gpt` configuration class: :class:`~transformers1.OpenAIGPTModel` (OpenAI GPT model)\n                - isInstance of `gpt2` configuration class: :class:`~transformers1.GPT2Model` (OpenAI GPT-2 model)\n                - isInstance of `ctrl` configuration class: :class:`~transformers1.CTRLModel` (Salesforce CTRL  model)\n                - isInstance of `transfo-xl` configuration class: :class:`~transformers1.TransfoXLModel` (Transformer-XL model)\n                - isInstance of `xlnet` configuration class: :class:`~transformers1.XLNetModel` (XLNet model)\n                - isInstance of `xlm` configuration class: :class:`~transformers1.XLMModel` (XLM model)\n                - isInstance of `flaubert` configuration class: :class:`~transformers1.FlaubertModel` (Flaubert model)\n                - isInstance of `electra` configuration class: :class:`~transformers1.ElectraModel` (Electra model)\n\n        Examples::\n\n            config = BertConfig.from_pretrained('bert-base-uncased')    # Download configuration from S3 and cache.\n            model = AutoModel.from_config(config)  # E.g. model was saved using `save_pretrained('./test/saved_model/')`\n        \"\"\"\n        for config_class, model_class in MODEL_MAPPING.items():\n            if isinstance(config, config_class):\n                return model_class(config)\n        raise ValueError(\n            \"Unrecognized configuration class {} for this kind of AutoModel: {}.\\n\"\n            \"Model type should be one of {}.\".format(\n                config.__class__, cls.__name__, \", \".join(c.__name__ for c in MODEL_MAPPING.keys())\n            )\n        )\n\n    @classmethod\n    def from_pretrained(cls, pretrained_model_name_or_path, *model_args, **kwargs):\n        r\"\"\" Instantiates one of the base model classes of the library\n        from a pre-trained model configuration.\n\n        The `from_pretrained()` method takes care of returning the correct model class instance\n        based on the `model_type` property of the config object, or when it's missing,\n        falling back to using pattern matching on the `pretrained_model_name_or_path` string:\n            - `t5`: :class:`~transformers1.T5Model` (T5 model)\n            - `distilbert`: :class:`~transformers1.DistilBertModel` (DistilBERT model)\n            - `albert`: :class:`~transformers1.AlbertModel` (ALBERT model)\n            - `camembert`: :class:`~transformers1.CamembertModel` (CamemBERT model)\n            - `xlm-roberta`: :class:`~transformers1.XLMRobertaModel` (XLM-RoBERTa model)\n            - `longformer` :class:`~transformers1.LongformerModel` (Longformer model)\n            - `roberta`: :class:`~transformers1.RobertaModel` (RoBERTa model)\n            - `bert`: :class:`~transformers1.BertModel` (Bert model)\n            - `openai-gpt`: :class:`~transformers1.OpenAIGPTModel` (OpenAI GPT model)\n            - `gpt2`: :class:`~transformers1.GPT2Model` (OpenAI GPT-2 model)\n            - `transfo-xl`: :class:`~transformers1.TransfoXLModel` (Transformer-XL model)\n            - `xlnet`: :class:`~transformers1.XLNetModel` (XLNet model)\n            - `xlm`: :class:`~transformers1.XLMModel` (XLM model)\n            - `ctrl`: :class:`~transformers1.CTRLModel` (Salesforce CTRL  model)\n            - `flaubert`: :class:`~transformers1.FlaubertModel` (Flaubert  model)\n            - `electra`: :class:`~transformers1.ElectraModel` (Electra  model)\n\n        The model is set in evaluation mode by default using `model.eval()` (Dropout modules are deactivated)\n        To train the model, you should first set it back in training mode with `model.train()`\n\n        Args:\n            pretrained_model_name_or_path: either:\n\n                - a string with the `shortcut name` of a pre-trained model to load from cache or download, e.g.: ``bert-base-uncased``.\n                - a string with the `identifier name` of a pre-trained model that was user-uploaded to our S3, e.g.: ``dbmdz/bert-base-german-cased``.\n                - a path to a `directory` containing model weights saved using :func:`~transformers1.PreTrainedModel.save_pretrained`, e.g.: ``./my_model_directory/``.\n                - a path or url to a `tensorflow index checkpoint file` (e.g. `./tf_model/model.ckpt.index`). In this case, ``from_tf`` should be set to True and a configuration object should be provided as ``config`` argument. This loading path is slower than converting the TensorFlow checkpoint in a PyTorch model using the provided conversion scripts and loading the PyTorch model afterwards.\n\n            model_args: (`optional`) Sequence of positional arguments:\n                All remaning positional arguments will be passed to the underlying model's ``__init__`` method\n\n            config: (`optional`) instance of a class derived from :class:`~transformers1.PretrainedConfig`:\n                Configuration for the model to use instead of an automatically loaded configuation. Configuration can be automatically loaded when:\n\n                - the model is a model provided by the library (loaded with the ``shortcut-name`` string of a pretrained model), or\n                - the model was saved using :func:`~transformers1.PreTrainedModel.save_pretrained` and is reloaded by suppling the save directory.\n                - the model is loaded by suppling a local directory as ``pretrained_model_name_or_path`` and a configuration JSON file named `config.json` is found in the directory.\n\n            state_dict: (`optional`) dict:\n                an optional state dictionary for the model to use instead of a state dictionary loaded from saved weights file.\n                This option can be used if you want to create a model from a pretrained configuration but load your own weights.\n                In this case though, you should check if using :func:`~transformers1.PreTrainedModel.save_pretrained` and :func:`~transformers1.PreTrainedModel.from_pretrained` is not a simpler option.\n\n            cache_dir: (`optional`) string:\n                Path to a directory in which a downloaded pre-trained model\n                configuration should be cached if the standard cache should not be used.\n\n            force_download: (`optional`) boolean, default False:\n                Force to (re-)download the model weights and configuration files and override the cached versions if they exists.\n\n            resume_download: (`optional`) boolean, default False:\n                Do not delete incompletely recieved file. Attempt to resume the download if such a file exists.\n\n            proxies: (`optional`) dict, default None:\n                A dictionary of proxy servers to use by protocol or endpoint, e.g.: {'http': 'foo.bar:3128', 'http://hostname': 'foo.bar:4012'}.\n                The proxies are used on each request.\n\n            output_loading_info: (`optional`) boolean:\n                Set to ``True`` to also return a dictionary containing missing keys, unexpected keys and error messages.\n\n            kwargs: (`optional`) Remaining dictionary of keyword arguments:\n                These arguments will be passed to the configuration and the model.\n\n        Examples::\n\n            model = AutoModel.from_pretrained('bert-base-uncased')    # Download model and configuration from S3 and cache.\n            model = AutoModel.from_pretrained('./test/bert_model/')  # E.g. model was saved using `save_pretrained('./test/saved_model/')`\n            assert model.config.output_attention == True\n            # Loading from a TF checkpoint file instead of a PyTorch model (slower)\n            config = AutoConfig.from_json_file('./tf_model/bert_tf_model_config.json')\n            model = AutoModel.from_pretrained('./tf_model/bert_tf_checkpoint.ckpt.index', from_tf=True, config=config)\n\n        \"\"\"\n        config = kwargs.pop(\"config\", None)\n        if not isinstance(config, PretrainedConfig):\n            config = AutoConfig.from_pretrained(pretrained_model_name_or_path, **kwargs)\n\n        for config_class, model_class in MODEL_MAPPING.items():\n            if isinstance(config, config_class):\n                return model_class.from_pretrained(pretrained_model_name_or_path, *model_args, config=config, **kwargs)\n        raise ValueError(\n            \"Unrecognized configuration class {} for this kind of AutoModel: {}.\\n\"\n            \"Model type should be one of {}.\".format(\n                config.__class__, cls.__name__, \", \".join(c.__name__ for c in MODEL_MAPPING.keys())\n            )\n        )\n\n\nclass AutoModelForPreTraining:\n    r\"\"\"\n        :class:`~transformers1.AutoModelForPreTraining` is a generic model class\n        that will be instantiated as one of the model classes of the library -with the architecture used for pretraining this model– when created with the `AutoModelForPreTraining.from_pretrained(pretrained_model_name_or_path)`\n        class method.\n\n        This class cannot be instantiated using `__init__()` (throws an error).\n    \"\"\"\n\n    def __init__(self):\n        raise EnvironmentError(\n            \"AutoModelForPreTraining is designed to be instantiated \"\n            \"using the `AutoModelForPreTraining.from_pretrained(pretrained_model_name_or_path)` or \"\n            \"`AutoModelForPreTraining.from_config(config)` methods.\"\n        )\n\n    @classmethod\n    def from_config(cls, config):\n        r\"\"\" Instantiates one of the base model classes of the library\n        from a configuration.\n\n        Note:\n            Loading a model from its configuration file does **not** load the model weights.\n            It only affects the model's configuration. Use :func:`~transformers1.AutoModel.from_pretrained` to load\n            the model weights\n\n        Args:\n            config (:class:`~transformers.PretrainedConfig`):\n                The model class to instantiate is selected based on the configuration class:\n\n                - isInstance of `distilbert` configuration class: :class:`~transformers1.DistilBertForMaskedLM` (DistilBERT model)\n                - isInstance of `longformer` configuration class: :class:`~transformers1.LongformerForMaskedLM` (Longformer model)\n                - isInstance of `roberta` configuration class: :class:`~transformers1.RobertaForMaskedLM` (RoBERTa model)\n                - isInstance of `bert` configuration class: :class:`~transformers1.BertForPreTraining` (Bert model)\n                - isInstance of `openai-gpt` configuration class: :class:`~transformers1.OpenAIGPTLMHeadModel` (OpenAI GPT model)\n                - isInstance of `gpt2` configuration class: :class:`~transformers1.GPT2LMHeadModel` (OpenAI GPT-2 model)\n                - isInstance of `ctrl` configuration class: :class:`~transformers1.CTRLLMHeadModel` (Salesforce CTRL  model)\n                - isInstance of `transfo-xl` configuration class: :class:`~transformers1.TransfoXLLMHeadModel` (Transformer-XL model)\n                - isInstance of `xlnet` configuration class: :class:`~transformers1.XLNetLMHeadModel` (XLNet model)\n                - isInstance of `xlm` configuration class: :class:`~transformers1.XLMWithLMHeadModel` (XLM model)\n                - isInstance of `flaubert` configuration class: :class:`~transformers1.FlaubertWithLMHeadModel` (Flaubert model)\n                - isInstance of `electra` configuration class: :class:`~transformers1.ElectraForPreTraining` (Electra model)\n\n        Examples::\n\n            config = BertConfig.from_pretrained('bert-base-uncased')    # Download configuration from S3 and cache.\n            model = AutoModelForPreTraining.from_config(config)  # E.g. model was saved using `save_pretrained('./test/saved_model/')`\n        \"\"\"\n        for config_class, model_class in MODEL_FOR_PRETRAINING_MAPPING.items():\n            if isinstance(config, config_class):\n                return model_class(config)\n        raise ValueError(\n            \"Unrecognized configuration class {} for this kind of AutoModel: {}.\\n\"\n            \"Model type should be one of {}.\".format(\n                config.__class__, cls.__name__, \", \".join(c.__name__ for c in MODEL_FOR_PRETRAINING_MAPPING.keys())\n            )\n        )\n\n    @classmethod\n    def from_pretrained(cls, pretrained_model_name_or_path, *model_args, **kwargs):\n        r\"\"\" Instantiates one of the model classes of the library -with the architecture used for pretraining this model– from a pre-trained model configuration.\n\n        The `from_pretrained()` method takes care of returning the correct model class instance\n        based on the `model_type` property of the config object, or when it's missing,\n        falling back to using pattern matching on the `pretrained_model_name_or_path` string:\n            - `t5`: :class:`~transformers1.T5ModelWithLMHead` (T5 model)\n            - `distilbert`: :class:`~transformers1.DistilBertForMaskedLM` (DistilBERT model)\n            - `albert`: :class:`~transformers1.AlbertForMaskedLM` (ALBERT model)\n            - `camembert`: :class:`~transformers1.CamembertForMaskedLM` (CamemBERT model)\n            - `xlm-roberta`: :class:`~transformers1.XLMRobertaForMaskedLM` (XLM-RoBERTa model)\n            - `longformer`: :class:`~transformers1.LongformerForMaskedLM` (Longformer model)\n            - `roberta`: :class:`~transformers1.RobertaForMaskedLM` (RoBERTa model)\n            - `bert`: :class:`~transformers1.BertForPreTraining` (Bert model)\n            - `openai-gpt`: :class:`~transformers1.OpenAIGPTLMHeadModel` (OpenAI GPT model)\n            - `gpt2`: :class:`~transformers1.GPT2LMHeadModel` (OpenAI GPT-2 model)\n            - `transfo-xl`: :class:`~transformers1.TransfoXLLMHeadModel` (Transformer-XL model)\n            - `xlnet`: :class:`~transformers1.XLNetLMHeadModel` (XLNet model)\n            - `xlm`: :class:`~transformers1.XLMWithLMHeadModel` (XLM model)\n            - `ctrl`: :class:`~transformers1.CTRLLMHeadModel` (Salesforce CTRL model)\n            - `flaubert`: :class:`~transformers1.FlaubertWithLMHeadModel` (Flaubert model)\n            - `electra`: :class:`~transformers1.ElectraForPreTraining` (Electra model)\n\n        The model is set in evaluation mode by default using `model.eval()` (Dropout modules are deactivated)\n        To train the model, you should first set it back in training mode with `model.train()`\n\n        Args:\n            pretrained_model_name_or_path:\n                Either:\n\n                - a string with the `shortcut name` of a pre-trained model to load from cache or download, e.g.: ``bert-base-uncased``.\n                - a string with the `identifier name` of a pre-trained model that was user-uploaded to our S3, e.g.: ``dbmdz/bert-base-german-cased``.\n                - a path to a `directory` containing model weights saved using :func:`~transformers1.PreTrainedModel.save_pretrained`, e.g.: ``./my_model_directory/``.\n                - a path or url to a `tensorflow index checkpoint file` (e.g. `./tf_model/model.ckpt.index`). In this case, ``from_tf`` should be set to True and a configuration object should be provided as ``config`` argument. This loading path is slower than converting the TensorFlow checkpoint in a PyTorch model using the provided conversion scripts and loading the PyTorch model afterwards.\n            model_args: (`optional`) Sequence of positional arguments:\n                All remaning positional arguments will be passed to the underlying model's ``__init__`` method\n            config: (`optional`) instance of a class derived from :class:`~transformers1.PretrainedConfig`:\n                Configuration for the model to use instead of an automatically loaded configuation. Configuration can be automatically loaded when:\n\n                - the model is a model provided by the library (loaded with the ``shortcut-name`` string of a pretrained model), or\n                - the model was saved using :func:`~transformers1.PreTrainedModel.save_pretrained` and is reloaded by suppling the save directory.\n                - the model is loaded by suppling a local directory as ``pretrained_model_name_or_path`` and a configuration JSON file named `config.json` is found in the directory.\n\n            state_dict: (`optional`) dict:\n                an optional state dictionary for the model to use instead of a state dictionary loaded from saved weights file.\n                This option can be used if you want to create a model from a pretrained configuration but load your own weights.\n                In this case though, you should check if using :func:`~transformers1.PreTrainedModel.save_pretrained` and :func:`~transformers1.PreTrainedModel.from_pretrained` is not a simpler option.\n            cache_dir: (`optional`) string:\n                Path to a directory in which a downloaded pre-trained model\n                configuration should be cached if the standard cache should not be used.\n            force_download: (`optional`) boolean, default False:\n                Force to (re-)download the model weights and configuration files and override the cached versions if they exists.\n            resume_download: (`optional`) boolean, default False:\n                Do not delete incompletely received file. Attempt to resume the download if such a file exists.\n            proxies: (`optional`) dict, default None:\n                A dictionary of proxy servers to use by protocol or endpoint, e.g.: {'http': 'foo.bar:3128', 'http://hostname': 'foo.bar:4012'}.\n                The proxies are used on each request.\n            output_loading_info: (`optional`) boolean:\n                Set to ``True`` to also return a dictionary containing missing keys, unexpected keys and error messages.\n            kwargs: (`optional`) Remaining dictionary of keyword arguments:\n                These arguments will be passed to the configuration and the model.\n\n        Examples::\n\n            model = AutoModelForPreTraining.from_pretrained('bert-base-uncased')    # Download model and configuration from S3 and cache.\n            model = AutoModelForPreTraining.from_pretrained('./test/bert_model/')  # E.g. model was saved using `save_pretrained('./test/saved_model/')`\n            assert model.config.output_attention == True\n            # Loading from a TF checkpoint file instead of a PyTorch model (slower)\n            config = AutoConfig.from_json_file('./tf_model/bert_tf_model_config.json')\n            model = AutoModelForPreTraining.from_pretrained('./tf_model/bert_tf_checkpoint.ckpt.index', from_tf=True, config=config)\n\n        \"\"\"\n        config = kwargs.pop(\"config\", None)\n        if not isinstance(config, PretrainedConfig):\n            config = AutoConfig.from_pretrained(pretrained_model_name_or_path, **kwargs)\n\n        for config_class, model_class in MODEL_FOR_PRETRAINING_MAPPING.items():\n            if isinstance(config, config_class):\n                return model_class.from_pretrained(pretrained_model_name_or_path, *model_args, config=config, **kwargs)\n        raise ValueError(\n            \"Unrecognized configuration class {} for this kind of AutoModel: {}.\\n\"\n            \"Model type should be one of {}.\".format(\n                config.__class__, cls.__name__, \", \".join(c.__name__ for c in MODEL_FOR_PRETRAINING_MAPPING.keys())\n            )\n        )\n\n\nclass AutoModelWithLMHead:\n    r\"\"\"\n        :class:`~transformers1.AutoModelWithLMHead` is a generic model class\n        that will be instantiated as one of the language modeling model classes of the library\n        when created with the `AutoModelWithLMHead.from_pretrained(pretrained_model_name_or_path)`\n        class method.\n\n        This class cannot be instantiated using `__init__()` (throws an error).\n    \"\"\"\n\n    def __init__(self):\n        raise EnvironmentError(\n            \"AutoModelWithLMHead is designed to be instantiated \"\n            \"using the `AutoModelWithLMHead.from_pretrained(pretrained_model_name_or_path)` or \"\n            \"`AutoModelWithLMHead.from_config(config)` methods.\"\n        )\n\n    @classmethod\n    def from_config(cls, config):\n        r\"\"\" Instantiates one of the base model classes of the library\n        from a configuration.\n\n        Note:\n            Loading a model from its configuration file does **not** load the model weights.\n            It only affects the model's configuration. Use :func:`~transformers1.AutoModel.from_pretrained` to load\n            the model weights\n\n        Args:\n            config (:class:`~transformers.PretrainedConfig`):\n                The model class to instantiate is selected based on the configuration class:\n\n                - isInstance of `distilbert` configuration class: :class:`~transformers1.DistilBertForMaskedLM` (DistilBERT model)\n                - isInstance of `longformer` configuration class: :class:`~transformers1.LongformerForMaskedLM` (Longformer model)\n                - isInstance of `roberta` configuration class: :class:`~transformers1.RobertaForMaskedLM` (RoBERTa model)\n                - isInstance of `bert` configuration class: :class:`~transformers1.BertForMaskedLM` (Bert model)\n                - isInstance of `openai-gpt` configuration class: :class:`~transformers1.OpenAIGPTLMHeadModel` (OpenAI GPT model)\n                - isInstance of `gpt2` configuration class: :class:`~transformers1.GPT2LMHeadModel` (OpenAI GPT-2 model)\n                - isInstance of `ctrl` configuration class: :class:`~transformers1.CTRLLMHeadModel` (Salesforce CTRL  model)\n                - isInstance of `transfo-xl` configuration class: :class:`~transformers1.TransfoXLLMHeadModel` (Transformer-XL model)\n                - isInstance of `xlnet` configuration class: :class:`~transformers1.XLNetLMHeadModel` (XLNet model)\n                - isInstance of `xlm` configuration class: :class:`~transformers1.XLMWithLMHeadModel` (XLM model)\n                - isInstance of `flaubert` configuration class: :class:`~transformers1.FlaubertWithLMHeadModel` (Flaubert model)\n                - isInstance of `electra` configuration class: :class:`~transformers1.ElectraForMaskedLM` (Electra model)\n\n        Examples::\n\n            config = BertConfig.from_pretrained('bert-base-uncased')    # Download configuration from S3 and cache.\n            model = AutoModelWithLMHead.from_config(config)  # E.g. model was saved using `save_pretrained('./test/saved_model/')`\n        \"\"\"\n        for config_class, model_class in MODEL_WITH_LM_HEAD_MAPPING.items():\n            if isinstance(config, config_class):\n                return model_class(config)\n        raise ValueError(\n            \"Unrecognized configuration class {} for this kind of AutoModel: {}.\\n\"\n            \"Model type should be one of {}.\".format(\n                config.__class__, cls.__name__, \", \".join(c.__name__ for c in MODEL_WITH_LM_HEAD_MAPPING.keys())\n            )\n        )\n\n    @classmethod\n    def from_pretrained(cls, pretrained_model_name_or_path, *model_args, **kwargs):\n        r\"\"\" Instantiates one of the language modeling model classes of the library\n        from a pre-trained model configuration.\n\n        The `from_pretrained()` method takes care of returning the correct model class instance\n        based on the `model_type` property of the config object, or when it's missing,\n        falling back to using pattern matching on the `pretrained_model_name_or_path` string:\n            - `t5`: :class:`~transformers1.T5ModelWithLMHead` (T5 model)\n            - `distilbert`: :class:`~transformers1.DistilBertForMaskedLM` (DistilBERT model)\n            - `albert`: :class:`~transformers1.AlbertForMaskedLM` (ALBERT model)\n            - `camembert`: :class:`~transformers1.CamembertForMaskedLM` (CamemBERT model)\n            - `xlm-roberta`: :class:`~transformers1.XLMRobertaForMaskedLM` (XLM-RoBERTa model)\n            - `longformer`: :class:`~transformers1.LongformerForMaskedLM` (Longformer model)\n            - `roberta`: :class:`~transformers1.RobertaForMaskedLM` (RoBERTa model)\n            - `bert`: :class:`~transformers1.BertForMaskedLM` (Bert model)\n            - `openai-gpt`: :class:`~transformers1.OpenAIGPTLMHeadModel` (OpenAI GPT model)\n            - `gpt2`: :class:`~transformers1.GPT2LMHeadModel` (OpenAI GPT-2 model)\n            - `transfo-xl`: :class:`~transformers1.TransfoXLLMHeadModel` (Transformer-XL model)\n            - `xlnet`: :class:`~transformers1.XLNetLMHeadModel` (XLNet model)\n            - `xlm`: :class:`~transformers1.XLMWithLMHeadModel` (XLM model)\n            - `ctrl`: :class:`~transformers1.CTRLLMHeadModel` (Salesforce CTRL model)\n            - `flaubert`: :class:`~transformers1.FlaubertWithLMHeadModel` (Flaubert model)\n            - `electra`: :class:`~transformers1.ElectraForMaskedLM` (Electra model)\n\n        The model is set in evaluation mode by default using `model.eval()` (Dropout modules are deactivated)\n        To train the model, you should first set it back in training mode with `model.train()`\n\n        Args:\n            pretrained_model_name_or_path:\n                Either:\n\n                - a string with the `shortcut name` of a pre-trained model to load from cache or download, e.g.: ``bert-base-uncased``.\n                - a string with the `identifier name` of a pre-trained model that was user-uploaded to our S3, e.g.: ``dbmdz/bert-base-german-cased``.\n                - a path to a `directory` containing model weights saved using :func:`~transformers1.PreTrainedModel.save_pretrained`, e.g.: ``./my_model_directory/``.\n                - a path or url to a `tensorflow index checkpoint file` (e.g. `./tf_model/model.ckpt.index`). In this case, ``from_tf`` should be set to True and a configuration object should be provided as ``config`` argument. This loading path is slower than converting the TensorFlow checkpoint in a PyTorch model using the provided conversion scripts and loading the PyTorch model afterwards.\n            model_args: (`optional`) Sequence of positional arguments:\n                All remaning positional arguments will be passed to the underlying model's ``__init__`` method\n            config: (`optional`) instance of a class derived from :class:`~transformers1.PretrainedConfig`:\n                Configuration for the model to use instead of an automatically loaded configuation. Configuration can be automatically loaded when:\n\n                - the model is a model provided by the library (loaded with the ``shortcut-name`` string of a pretrained model), or\n                - the model was saved using :func:`~transformers1.PreTrainedModel.save_pretrained` and is reloaded by suppling the save directory.\n                - the model is loaded by suppling a local directory as ``pretrained_model_name_or_path`` and a configuration JSON file named `config.json` is found in the directory.\n\n            state_dict: (`optional`) dict:\n                an optional state dictionary for the model to use instead of a state dictionary loaded from saved weights file.\n                This option can be used if you want to create a model from a pretrained configuration but load your own weights.\n                In this case though, you should check if using :func:`~transformers1.PreTrainedModel.save_pretrained` and :func:`~transformers1.PreTrainedModel.from_pretrained` is not a simpler option.\n            cache_dir: (`optional`) string:\n                Path to a directory in which a downloaded pre-trained model\n                configuration should be cached if the standard cache should not be used.\n            force_download: (`optional`) boolean, default False:\n                Force to (re-)download the model weights and configuration files and override the cached versions if they exists.\n            resume_download: (`optional`) boolean, default False:\n                Do not delete incompletely received file. Attempt to resume the download if such a file exists.\n            proxies: (`optional`) dict, default None:\n                A dictionary of proxy servers to use by protocol or endpoint, e.g.: {'http': 'foo.bar:3128', 'http://hostname': 'foo.bar:4012'}.\n                The proxies are used on each request.\n            output_loading_info: (`optional`) boolean:\n                Set to ``True`` to also return a dictionary containing missing keys, unexpected keys and error messages.\n            kwargs: (`optional`) Remaining dictionary of keyword arguments:\n                These arguments will be passed to the configuration and the model.\n\n        Examples::\n\n            model = AutoModelWithLMHead.from_pretrained('bert-base-uncased')    # Download model and configuration from S3 and cache.\n            model = AutoModelWithLMHead.from_pretrained('./test/bert_model/')  # E.g. model was saved using `save_pretrained('./test/saved_model/')`\n            assert model.config.output_attention == True\n            # Loading from a TF checkpoint file instead of a PyTorch model (slower)\n            config = AutoConfig.from_json_file('./tf_model/bert_tf_model_config.json')\n            model = AutoModelWithLMHead.from_pretrained('./tf_model/bert_tf_checkpoint.ckpt.index', from_tf=True, config=config)\n\n        \"\"\"\n        config = kwargs.pop(\"config\", None)\n        if not isinstance(config, PretrainedConfig):\n            config = AutoConfig.from_pretrained(pretrained_model_name_or_path, **kwargs)\n\n        for config_class, model_class in MODEL_WITH_LM_HEAD_MAPPING.items():\n            if isinstance(config, config_class):\n                return model_class.from_pretrained(pretrained_model_name_or_path, *model_args, config=config, **kwargs)\n        raise ValueError(\n            \"Unrecognized configuration class {} for this kind of AutoModel: {}.\\n\"\n            \"Model type should be one of {}.\".format(\n                config.__class__, cls.__name__, \", \".join(c.__name__ for c in MODEL_WITH_LM_HEAD_MAPPING.keys())\n            )\n        )\n\n\nclass AutoModelForSequenceClassification:\n    r\"\"\"\n        :class:`~transformers1.AutoModelForSequenceClassification` is a generic model class\n        that will be instantiated as one of the sequence classification model classes of the library\n        when created with the `AutoModelForSequenceClassification.from_pretrained(pretrained_model_name_or_path)`\n        class method.\n\n        This class cannot be instantiated using `__init__()` (throws an error).\n    \"\"\"\n\n    def __init__(self):\n        raise EnvironmentError(\n            \"AutoModelForSequenceClassification is designed to be instantiated \"\n            \"using the `AutoModelForSequenceClassification.from_pretrained(pretrained_model_name_or_path)` or \"\n            \"`AutoModelForSequenceClassification.from_config(config)` methods.\"\n        )\n\n    @classmethod\n    def from_config(cls, config):\n        r\"\"\" Instantiates one of the base model classes of the library\n        from a configuration.\n\n        Note:\n            Loading a model from its configuration file does **not** load the model weights.\n            It only affects the model's configuration. Use :func:`~transformers1.AutoModel.from_pretrained` to load\n            the model weights\n\n        Args:\n            config (:class:`~transformers.PretrainedConfig`):\n                The model class to instantiate is selected based on the configuration class:\n\n                - isInstance of `distilbert` configuration class: :class:`~transformers1.DistilBertForSequenceClassification` (DistilBERT model)\n                - isInstance of `albert` configuration class: :class:`~transformers1.AlbertForSequenceClassification` (ALBERT model)\n                - isInstance of `camembert` configuration class: :class:`~transformers1.CamembertForSequenceClassification` (CamemBERT model)\n                - isInstance of `xlm roberta` configuration class: :class:`~transformers1.XLMRobertaForSequenceClassification` (XLM-RoBERTa model)\n                - isInstance of `roberta` configuration class: :class:`~transformers1.RobertaForSequenceClassification` (RoBERTa model)\n                - isInstance of `bert` configuration class: :class:`~transformers1.BertForSequenceClassification` (Bert model)\n                - isInstance of `xlnet` configuration class: :class:`~transformers1.XLNetForSequenceClassification` (XLNet model)\n                - isInstance of `xlm` configuration class: :class:`~transformers1.XLMForSequenceClassification` (XLM model)\n                - isInstance of `flaubert` configuration class: :class:`~transformers1.FlaubertForSequenceClassification` (Flaubert model)\n\n\n        Examples::\n\n            config = BertConfig.from_pretrained('bert-base-uncased')    # Download configuration from S3 and cache.\n            model = AutoModelForSequenceClassification.from_config(config)  # E.g. model was saved using `save_pretrained('./test/saved_model/')`\n        \"\"\"\n        for config_class, model_class in MODEL_FOR_SEQUENCE_CLASSIFICATION_MAPPING.items():\n            if isinstance(config, config_class):\n                return model_class(config)\n        raise ValueError(\n            \"Unrecognized configuration class {} for this kind of AutoModel: {}.\\n\"\n            \"Model type should be one of {}.\".format(\n                config.__class__,\n                cls.__name__,\n                \", \".join(c.__name__ for c in MODEL_FOR_SEQUENCE_CLASSIFICATION_MAPPING.keys()),\n            )\n        )\n\n    @classmethod\n    def from_pretrained(cls, pretrained_model_name_or_path, *model_args, **kwargs):\n        r\"\"\" Instantiates one of the sequence classification model classes of the library\n        from a pre-trained model configuration.\n\n        The `from_pretrained()` method takes care of returning the correct model class instance\n        based on the `model_type` property of the config object, or when it's missing,\n        falling back to using pattern matching on the `pretrained_model_name_or_path` string:\n            - `distilbert`: :class:`~transformers1.DistilBertForSequenceClassification` (DistilBERT model)\n            - `albert`: :class:`~transformers1.AlbertForSequenceClassification` (ALBERT model)\n            - `camembert`: :class:`~transformers1.CamembertForSequenceClassification` (CamemBERT model)\n            - `xlm-roberta`: :class:`~transformers1.XLMRobertaForSequenceClassification` (XLM-RoBERTa model)\n            - `roberta`: :class:`~transformers1.RobertaForSequenceClassification` (RoBERTa model)\n            - `bert`: :class:`~transformers1.BertForSequenceClassification` (Bert model)\n            - `xlnet`: :class:`~transformers1.XLNetForSequenceClassification` (XLNet model)\n            - `flaubert`: :class:`~transformers1.FlaubertForSequenceClassification` (Flaubert model)\n\n        The model is set in evaluation mode by default using `model.eval()` (Dropout modules are deactivated)\n        To train the model, you should first set it back in training mode with `model.train()`\n\n        Args:\n            pretrained_model_name_or_path: either:\n\n                - a string with the `shortcut name` of a pre-trained model to load from cache or download, e.g.: ``bert-base-uncased``.\n                - a string with the `identifier name` of a pre-trained model that was user-uploaded to our S3, e.g.: ``dbmdz/bert-base-german-cased``.\n                - a path to a `directory` containing model weights saved using :func:`~transformers1.PreTrainedModel.save_pretrained`, e.g.: ``./my_model_directory/``.\n                - a path or url to a `tensorflow index checkpoint file` (e.g. `./tf_model/model.ckpt.index`). In this case, ``from_tf`` should be set to True and a configuration object should be provided as ``config`` argument. This loading path is slower than converting the TensorFlow checkpoint in a PyTorch model using the provided conversion scripts and loading the PyTorch model afterwards.\n\n            model_args: (`optional`) Sequence of positional arguments:\n                All remaining positional arguments will be passed to the underlying model's ``__init__`` method\n\n            config: (`optional`) instance of a class derived from :class:`~transformers1.PretrainedConfig`:\n                Configuration for the model to use instead of an automatically loaded configuation. Configuration can be automatically loaded when:\n\n                - the model is a model provided by the library (loaded with the ``shortcut-name`` string of a pretrained model), or\n                - the model was saved using :func:`~transformers1.PreTrainedModel.save_pretrained` and is reloaded by suppling the save directory.\n                - the model is loaded by suppling a local directory as ``pretrained_model_name_or_path`` and a configuration JSON file named `config.json` is found in the directory.\n\n            state_dict: (`optional`) dict:\n                an optional state dictionary for the model to use instead of a state dictionary loaded from saved weights file.\n                This option can be used if you want to create a model from a pretrained configuration but load your own weights.\n                In this case though, you should check if using :func:`~transformers1.PreTrainedModel.save_pretrained` and :func:`~transformers1.PreTrainedModel.from_pretrained` is not a simpler option.\n\n            cache_dir: (`optional`) string:\n                Path to a directory in which a downloaded pre-trained model\n                configuration should be cached if the standard cache should not be used.\n\n            force_download: (`optional`) boolean, default False:\n                Force to (re-)download the model weights and configuration files and override the cached versions if they exists.\n\n            resume_download: (`optional`) boolean, default False:\n                Do not delete incompletely recieved file. Attempt to resume the download if such a file exists.\n\n            proxies: (`optional`) dict, default None:\n                A dictionary of proxy servers to use by protocol or endpoint, e.g.: {'http': 'foo.bar:3128', 'http://hostname': 'foo.bar:4012'}.\n                The proxies are used on each request.\n\n            output_loading_info: (`optional`) boolean:\n                Set to ``True`` to also return a dictionary containing missing keys, unexpected keys and error messages.\n\n            kwargs: (`optional`) Remaining dictionary of keyword arguments:\n                These arguments will be passed to the configuration and the model.\n\n        Examples::\n\n            model = AutoModelForSequenceClassification.from_pretrained('bert-base-uncased')    # Download model and configuration from S3 and cache.\n            model = AutoModelForSequenceClassification.from_pretrained('./test/bert_model/')  # E.g. model was saved using `save_pretrained('./test/saved_model/')`\n            assert model.config.output_attention == True\n            # Loading from a TF checkpoint file instead of a PyTorch model (slower)\n            config = AutoConfig.from_json_file('./tf_model/bert_tf_model_config.json')\n            model = AutoModelForSequenceClassification.from_pretrained('./tf_model/bert_tf_checkpoint.ckpt.index', from_tf=True, config=config)\n\n        \"\"\"\n        config = kwargs.pop(\"config\", None)\n        if not isinstance(config, PretrainedConfig):\n            config = AutoConfig.from_pretrained(pretrained_model_name_or_path, **kwargs)\n\n        for config_class, model_class in MODEL_FOR_SEQUENCE_CLASSIFICATION_MAPPING.items():\n            if isinstance(config, config_class):\n                return model_class.from_pretrained(pretrained_model_name_or_path, *model_args, config=config, **kwargs)\n        raise ValueError(\n            \"Unrecognized configuration class {} for this kind of AutoModel: {}.\\n\"\n            \"Model type should be one of {}.\".format(\n                config.__class__,\n                cls.__name__,\n                \", \".join(c.__name__ for c in MODEL_FOR_SEQUENCE_CLASSIFICATION_MAPPING.keys()),\n            )\n        )\n\n\nclass AutoModelForQuestionAnswering:\n    r\"\"\"\n        :class:`~transformers1.AutoModelForQuestionAnswering` is a generic model class\n        that will be instantiated as one of the question answering model classes of the library\n        when created with the `AutoModelForQuestionAnswering.from_pretrained(pretrained_model_name_or_path)`\n        class method.\n\n        This class cannot be instantiated using `__init__()` (throws an error).\n    \"\"\"\n\n    def __init__(self):\n        raise EnvironmentError(\n            \"AutoModelForQuestionAnswering is designed to be instantiated \"\n            \"using the `AutoModelForQuestionAnswering.from_pretrained(pretrained_model_name_or_path)` or \"\n            \"`AutoModelForQuestionAnswering.from_config(config)` methods.\"\n        )\n\n    @classmethod\n    def from_config(cls, config):\n        r\"\"\" Instantiates one of the base model classes of the library\n        from a configuration.\n\n        Note:\n            Loading a model from its configuration file does **not** load the model weights.\n            It only affects the model's configuration. Use :func:`~transformers1.AutoModel.from_pretrained` to load\n            the model weights\n\n        Args:\n            config (:class:`~transformers.PretrainedConfig`):\n                The model class to instantiate is selected based on the configuration class:\n\n                - isInstance of `distilbert` configuration class: :class:`~transformers1.DistilBertForQuestionAnswering` (DistilBERT model)\n                - isInstance of `albert` configuration class: :class:`~transformers1.AlbertForQuestionAnswering` (ALBERT model)\n                - isInstance of `bert` configuration class: :class:`~transformers1.BertModelForQuestionAnswering` (Bert model)\n                - isInstance of `xlnet` configuration class: :class:`~transformers1.XLNetForQuestionAnswering` (XLNet model)\n                - isInstance of `xlm` configuration class: :class:`~transformers1.XLMForQuestionAnswering` (XLM model)\n                - isInstance of `flaubert` configuration class: :class:`~transformers1.FlaubertForQuestionAnswering` (XLM model)\n\n        Examples::\n\n            config = BertConfig.from_pretrained('bert-base-uncased')    # Download configuration from S3 and cache.\n            model = AutoModelForQuestionAnswering.from_config(config)  # E.g. model was saved using `save_pretrained('./test/saved_model/')`\n        \"\"\"\n        for config_class, model_class in MODEL_FOR_QUESTION_ANSWERING_MAPPING.items():\n            if isinstance(config, config_class):\n                return model_class(config)\n\n        raise ValueError(\n            \"Unrecognized configuration class {} for this kind of AutoModel: {}.\\n\"\n            \"Model type should be one of {}.\".format(\n                config.__class__,\n                cls.__name__,\n                \", \".join(c.__name__ for c in MODEL_FOR_QUESTION_ANSWERING_MAPPING.keys()),\n            )\n        )\n\n    @classmethod\n    def from_pretrained(cls, pretrained_model_name_or_path, *model_args, **kwargs):\n        r\"\"\" Instantiates one of the question answering model classes of the library\n        from a pre-trained model configuration.\n\n        The `from_pretrained()` method takes care of returning the correct model class instance\n        based on the `model_type` property of the config object, or when it's missing,\n        falling back to using pattern matching on the `pretrained_model_name_or_path` string:\n            - `distilbert`: :class:`~transformers1.DistilBertForQuestionAnswering` (DistilBERT model)\n            - `albert`: :class:`~transformers1.AlbertForQuestionAnswering` (ALBERT model)\n            - `bert`: :class:`~transformers1.BertForQuestionAnswering` (Bert model)\n            - `xlnet`: :class:`~transformers1.XLNetForQuestionAnswering` (XLNet model)\n            - `xlm`: :class:`~transformers1.XLMForQuestionAnswering` (XLM model)\n            - `flaubert`: :class:`~transformers1.FlaubertForQuestionAnswering` (XLM model)\n\n        The model is set in evaluation mode by default using `model.eval()` (Dropout modules are deactivated)\n        To train the model, you should first set it back in training mode with `model.train()`\n\n        Args:\n            pretrained_model_name_or_path: either:\n\n                - a string with the `shortcut name` of a pre-trained model to load from cache or download, e.g.: ``bert-base-uncased``.\n                - a string with the `identifier name` of a pre-trained model that was user-uploaded to our S3, e.g.: ``dbmdz/bert-base-german-cased``.\n                - a path to a `directory` containing model weights saved using :func:`~transformers1.PreTrainedModel.save_pretrained`, e.g.: ``./my_model_directory/``.\n                - a path or url to a `tensorflow index checkpoint file` (e.g. `./tf_model/model.ckpt.index`). In this case, ``from_tf`` should be set to True and a configuration object should be provided as ``config`` argument. This loading path is slower than converting the TensorFlow checkpoint in a PyTorch model using the provided conversion scripts and loading the PyTorch model afterwards.\n\n            model_args: (`optional`) Sequence of positional arguments:\n                All remaning positional arguments will be passed to the underlying model's ``__init__`` method\n\n            config: (`optional`) instance of a class derived from :class:`~transformers1.PretrainedConfig`:\n                Configuration for the model to use instead of an automatically loaded configuation. Configuration can be automatically loaded when:\n\n                - the model is a model provided by the library (loaded with the ``shortcut-name`` string of a pretrained model), or\n                - the model was saved using :func:`~transformers1.PreTrainedModel.save_pretrained` and is reloaded by suppling the save directory.\n                - the model is loaded by suppling a local directory as ``pretrained_model_name_or_path`` and a configuration JSON file named `config.json` is found in the directory.\n\n            state_dict: (`optional`) dict:\n                an optional state dictionary for the model to use instead of a state dictionary loaded from saved weights file.\n                This option can be used if you want to create a model from a pretrained configuration but load your own weights.\n                In this case though, you should check if using :func:`~transformers1.PreTrainedModel.save_pretrained` and :func:`~transformers1.PreTrainedModel.from_pretrained` is not a simpler option.\n\n            cache_dir: (`optional`) string:\n                Path to a directory in which a downloaded pre-trained model\n                configuration should be cached if the standard cache should not be used.\n\n            force_download: (`optional`) boolean, default False:\n                Force to (re-)download the model weights and configuration files and override the cached versions if they exists.\n\n            proxies: (`optional`) dict, default None:\n                A dictionary of proxy servers to use by protocol or endpoint, e.g.: {'http': 'foo.bar:3128', 'http://hostname': 'foo.bar:4012'}.\n                The proxies are used on each request.\n\n            output_loading_info: (`optional`) boolean:\n                Set to ``True`` to also return a dictionary containing missing keys, unexpected keys and error messages.\n\n            kwargs: (`optional`) Remaining dictionary of keyword arguments:\n                These arguments will be passed to the configuration and the model.\n\n        Examples::\n\n            model = AutoModelForQuestionAnswering.from_pretrained('bert-base-uncased')    # Download model and configuration from S3 and cache.\n            model = AutoModelForQuestionAnswering.from_pretrained('./test/bert_model/')  # E.g. model was saved using `save_pretrained('./test/saved_model/')`\n            assert model.config.output_attention == True\n            # Loading from a TF checkpoint file instead of a PyTorch model (slower)\n            config = AutoConfig.from_json_file('./tf_model/bert_tf_model_config.json')\n            model = AutoModelForQuestionAnswering.from_pretrained('./tf_model/bert_tf_checkpoint.ckpt.index', from_tf=True, config=config)\n\n        \"\"\"\n        config = kwargs.pop(\"config\", None)\n        if not isinstance(config, PretrainedConfig):\n            config = AutoConfig.from_pretrained(pretrained_model_name_or_path, **kwargs)\n\n        for config_class, model_class in MODEL_FOR_QUESTION_ANSWERING_MAPPING.items():\n            if isinstance(config, config_class):\n                return model_class.from_pretrained(pretrained_model_name_or_path, *model_args, config=config, **kwargs)\n\n        raise ValueError(\n            \"Unrecognized configuration class {} for this kind of AutoModel: {}.\\n\"\n            \"Model type should be one of {}.\".format(\n                config.__class__,\n                cls.__name__,\n                \", \".join(c.__name__ for c in MODEL_FOR_QUESTION_ANSWERING_MAPPING.keys()),\n            )\n        )\n\n\nclass AutoModelForTokenClassification:\n    r\"\"\"\n        :class:`~transformers1.AutoModelForTokenClassification` is a generic model class\n        that will be instantiated as one of the token classification model classes of the library\n        when created with the `AutoModelForTokenClassification.from_pretrained(pretrained_model_name_or_path)`\n        class method.\n\n        This class cannot be instantiated using `__init__()` (throws an error).\n    \"\"\"\n\n    def __init__(self):\n        raise EnvironmentError(\n            \"AutoModelForTokenClassification is designed to be instantiated \"\n            \"using the `AutoModelForTokenClassification.from_pretrained(pretrained_model_name_or_path)` or \"\n            \"`AutoModelForTokenClassification.from_config(config)` methods.\"\n        )\n\n    @classmethod\n    def from_config(cls, config):\n        r\"\"\" Instantiates one of the base model classes of the library\n        from a configuration.\n\n        Note:\n            Loading a model from its configuration file does **not** load the model weights.\n            It only affects the model's configuration. Use :func:`~transformers1.AutoModel.from_pretrained` to load\n            the model weights\n\n        Args:\n            config (:class:`~transformers.PretrainedConfig`):\n                The model class to instantiate is selected based on the configuration class:\n\n                - isInstance of `distilbert` configuration class: :class:`~transformers1.DistilBertModelForTokenClassification` (DistilBERT model)\n                - isInstance of `xlm` configuration class: :class:`~transformers1.XLMForTokenClassification` (XLM model)\n                - isInstance of `xlm roberta` configuration class: :class:`~transformers1.XLMRobertaModelForTokenClassification` (XLMRoberta model)\n                - isInstance of `bert` configuration class: :class:`~transformers1.BertModelForTokenClassification` (Bert model)\n                - isInstance of `albert` configuration class: :class:`~transformers1.AlbertForTokenClassification` (AlBert model)\n                - isInstance of `xlnet` configuration class: :class:`~transformers1.XLNetModelForTokenClassification` (XLNet model)\n                - isInstance of `camembert` configuration class: :class:`~transformers1.CamembertModelForTokenClassification` (Camembert model)\n                - isInstance of `roberta` configuration class: :class:`~transformers1.RobertaModelForTokenClassification` (Roberta model)\n                - isInstance of `electra` configuration class: :class:`~transformers1.ElectraForTokenClassification` (Electra model)\n\n        Examples::\n\n            config = BertConfig.from_pretrained('bert-base-uncased')    # Download configuration from S3 and cache.\n            model = AutoModelForTokenClassification.from_config(config)  # E.g. model was saved using `save_pretrained('./test/saved_model/')`\n        \"\"\"\n        for config_class, model_class in MODEL_FOR_TOKEN_CLASSIFICATION_MAPPING.items():\n            if isinstance(config, config_class):\n                return model_class(config)\n\n        raise ValueError(\n            \"Unrecognized configuration class {} for this kind of AutoModel: {}.\\n\"\n            \"Model type should be one of {}.\".format(\n                config.__class__,\n                cls.__name__,\n                \", \".join(c.__name__ for c in MODEL_FOR_TOKEN_CLASSIFICATION_MAPPING.keys()),\n            )\n        )\n\n    @classmethod\n    def from_pretrained(cls, pretrained_model_name_or_path, *model_args, **kwargs):\n        r\"\"\" Instantiates one of the question answering model classes of the library\n        from a pre-trained model configuration.\n\n        The `from_pretrained()` method takes care of returning the correct model class instance\n        based on the `model_type` property of the config object, or when it's missing,\n        falling back to using pattern matching on the `pretrained_model_name_or_path` string:\n            - `distilbert`: :class:`~transformers1.DistilBertForTokenClassification` (DistilBERT model)\n            - `xlm`: :class:`~transformers1.XLMForTokenClassification` (XLM model)\n            - `xlm-roberta`: :class:`~transformers1.XLMRobertaForTokenClassification` (XLM-RoBERTa?Para model)\n            - `camembert`: :class:`~transformers1.CamembertForTokenClassification` (Camembert model)\n            - `bert`: :class:`~transformers1.BertForTokenClassification` (Bert model)\n            - `xlnet`: :class:`~transformers1.XLNetForTokenClassification` (XLNet model)\n            - `roberta`: :class:`~transformers1.RobertaForTokenClassification` (Roberta model)\n            - `electra`: :class:`~transformers1.ElectraForTokenClassification` (Electra model)\n\n        The model is set in evaluation mode by default using `model.eval()` (Dropout modules are deactivated)\n        To train the model, you should first set it back in training mode with `model.train()`\n\n        Args:\n            pretrained_model_name_or_path:\n                Either:\n\n                - a string with the `shortcut name` of a pre-trained model to load from cache or download, e.g.: ``bert-base-uncased``.\n                - a path to a `directory` containing model weights saved using :func:`~transformers1.PreTrainedModel.save_pretrained`, e.g.: ``./my_model_directory/``.\n                - a path or url to a `tensorflow index checkpoint file` (e.g. `./tf_model/model.ckpt.index`). In this case, ``from_tf`` should be set to True and a configuration object should be provided as ``config`` argument. This loading path is slower than converting the TensorFlow checkpoint in a PyTorch model using the provided conversion scripts and loading the PyTorch model afterwards.\n\n            model_args: (`optional`) Sequence of positional arguments:\n                All remaning positional arguments will be passed to the underlying model's ``__init__`` method\n\n            config: (`optional`) instance of a class derived from :class:`~transformers1.PretrainedConfig`:\n                Configuration for the model to use instead of an automatically loaded configuation. Configuration can be automatically loaded when:\n\n                - the model is a model provided by the library (loaded with the ``shortcut-name`` string of a pretrained model), or\n                - the model was saved using :func:`~transformers1.PreTrainedModel.save_pretrained` and is reloaded by suppling the save directory.\n                - the model is loaded by suppling a local directory as ``pretrained_model_name_or_path`` and a configuration JSON file named `config.json` is found in the directory.\n\n            state_dict: (`optional`) dict:\n                an optional state dictionary for the model to use instead of a state dictionary loaded from saved weights file.\n                This option can be used if you want to create a model from a pretrained configuration but load your own weights.\n                In this case though, you should check if using :func:`~transformers1.PreTrainedModel.save_pretrained` and :func:`~transformers1.PreTrainedModel.from_pretrained` is not a simpler option.\n\n            cache_dir: (`optional`) string:\n                Path to a directory in which a downloaded pre-trained model\n                configuration should be cached if the standard cache should not be used.\n\n            force_download: (`optional`) boolean, default False:\n                Force to (re-)download the model weights and configuration files and override the cached versions if they exists.\n\n            proxies: (`optional`) dict, default None:\n                A dictionary of proxy servers to use by protocol or endpoint, e.g.: {'http': 'foo.bar:3128', 'http://hostname': 'foo.bar:4012'}.\n                The proxies are used on each request.\n\n            output_loading_info: (`optional`) boolean:\n                Set to ``True`` to also return a dictionary containing missing keys, unexpected keys and error messages.\n\n            kwargs: (`optional`) Remaining dictionary of keyword arguments:\n                These arguments will be passed to the configuration and the model.\n\n        Examples::\n\n            model = AutoModelForTokenClassification.from_pretrained('bert-base-uncased')    # Download model and configuration from S3 and cache.\n            model = AutoModelForTokenClassification.from_pretrained('./test/bert_model/')  # E.g. model was saved using `save_pretrained('./test/saved_model/')`\n            assert model.config.output_attention == True\n            # Loading from a TF checkpoint file instead of a PyTorch model (slower)\n            config = AutoConfig.from_json_file('./tf_model/bert_tf_model_config.json')\n            model = AutoModelForTokenClassification.from_pretrained('./tf_model/bert_tf_checkpoint.ckpt.index', from_tf=True, config=config)\n\n        \"\"\"\n        config = kwargs.pop(\"config\", None)\n        if not isinstance(config, PretrainedConfig):\n            config = AutoConfig.from_pretrained(pretrained_model_name_or_path, **kwargs)\n\n        for config_class, model_class in MODEL_FOR_TOKEN_CLASSIFICATION_MAPPING.items():\n            if isinstance(config, config_class):\n                return model_class.from_pretrained(pretrained_model_name_or_path, *model_args, config=config, **kwargs)\n\n        raise ValueError(\n            \"Unrecognized configuration class {} for this kind of AutoModel: {}.\\n\"\n            \"Model type should be one of {}.\".format(\n                config.__class__,\n                cls.__name__,\n                \", \".join(c.__name__ for c in MODEL_FOR_TOKEN_CLASSIFICATION_MAPPING.keys()),\n            )\n        )\n\n\nclass AutoModelForMultipleChoice:\n    r\"\"\"\n        :class:`~transformers1.AutoModelForMultipleChoice` is a generic model class\n        that will be instantiated as one of the multiple choice model classes of the library\n        when created with the `AutoModelForMultipleChoice.from_pretrained(pretrained_model_name_or_path)`\n        class method.\n\n        This class cannot be instantiated using `__init__()` (throws an error).\n    \"\"\"\n\n    def __init__(self):\n        raise EnvironmentError(\n            \"AutoModelForMultipleChoice is designed to be instantiated \"\n            \"using the `AutoModelForMultipleChoice.from_pretrained(pretrained_model_name_or_path)` or \"\n            \"`AutoModelForMultipleChoice.from_config(config)` methods.\"\n        )\n\n    @classmethod\n    def from_config(cls, config):\n        for config_class, model_class in MODEL_FOR_MULTIPLE_CHOICE_MAPPING.items():\n            if isinstance(config, config_class):\n                return model_class(config)\n\n        raise ValueError(\n            \"Unrecognized configuration class {} for this kind of AutoModel: {}.\\n\"\n            \"Model type should be one of {}.\".format(\n                config.__class__,\n                cls.__name__,\n                \", \".join(c.__name__ for c in MODEL_FOR_MULTIPLE_CHOICE_MAPPING.keys()),\n            )\n        )\n\n    @classmethod\n    def from_pretrained(cls, pretrained_model_name_or_path, *model_args, **kwargs):\n        config = kwargs.pop(\"config\", None)\n        if not isinstance(config, PretrainedConfig):\n            config = AutoConfig.from_pretrained(pretrained_model_name_or_path, **kwargs)\n\n        for config_class, model_class in MODEL_FOR_MULTIPLE_CHOICE_MAPPING.items():\n            if isinstance(config, config_class):\n                return model_class.from_pretrained(pretrained_model_name_or_path, *model_args, config=config, **kwargs)\n\n        raise ValueError(\n            \"Unrecognized configuration class {} for this kind of AutoModel: {}.\\n\"\n            \"Model type should be one of {}.\".format(\n                config.__class__,\n                cls.__name__,\n                \", \".join(c.__name__ for c in MODEL_FOR_MULTIPLE_CHOICE_MAPPING.keys()),\n            )\n        )\n"
  },
  {
    "path": "code/bert-base-count3/pretrain/transformers1/modeling_bart.py",
    "content": "# coding=utf-8\n# Copyright 2020 The Facebook AI Research Team Authors and The HuggingFace Inc. team.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n\"\"\"PyTorch BART model, ported from the fairseq repo.\"\"\"\nimport logging\nimport math\nimport random\nfrom typing import Dict, List, Optional, Tuple\n\nimport numpy as np\nimport torch\nimport torch.nn.functional as F\nfrom torch import Tensor, nn\n\nfrom .activations import ACT2FN\nfrom .configuration_bart import BartConfig\nfrom .file_utils import add_start_docstrings, add_start_docstrings_to_callable\nfrom .modeling_utils import PreTrainedModel, create_position_ids_from_input_ids\n\n\nlogger = logging.getLogger(__name__)\n\n\nBART_PRETRAINED_MODEL_ARCHIVE_LIST = [\n    \"facebook/bart-large\",\n    \"facebook/bart-large-mnli\",\n    \"facebook/bart-large-cnn\",\n    \"facebook/bart-large-xsum\",\n    \"facebook/mbart-large-en-ro\",\n    # See all BART models at https://huggingface.co/models?filter=bart\n]\n\n\nBART_START_DOCSTRING = r\"\"\"\n\n    This model is a PyTorch `torch.nn.Module <https://pytorch.org/docs/stable/nn.html#torch.nn.Module>`_ sub-class. Use it as a regular PyTorch Module and\n    refer to the PyTorch documentation for all matters related to general usage and behavior.\n\n    Parameters:\n        config (:class:`~transformers1.BartConfig`): Model configuration class with all the parameters of the model.\n            Initializing with a config file does not load the weights associated with the model, only the configuration.\n            Check out the :meth:`~transformers1.PreTrainedModel.from_pretrained` method to load the model weights.\n\n\"\"\"\nBART_GENERATION_EXAMPLE = r\"\"\"\n    Examples::\n\n        from transformers1 import BartTokenizer, BartForConditionalGeneration, BartConfig\n        # see ``examples/summarization/bart/evaluate_cnn.py`` for a longer example\n        model = BartForConditionalGeneration.from_pretrained('bart-large-cnn')\n        tokenizer = BartTokenizer.from_pretrained('bart-large-cnn')\n        ARTICLE_TO_SUMMARIZE = \"My friends are cool but they eat too many carbs.\"\n        inputs = tokenizer.batch_encode_plus([ARTICLE_TO_SUMMARIZE], max_length=1024, return_tensors='pt')\n        # Generate Summary\n        summary_ids = model.generate(inputs['input_ids'], num_beams=4, max_length=5, early_stopping=True)\n        print([tokenizer.decode(g, skip_special_tokens=True, clean_up_tokenization_spaces=False) for g in summary_ids])\n\n\"\"\"\n\nBART_INPUTS_DOCSTRING = r\"\"\"\n    Args:\n        input_ids (:obj:`torch.LongTensor` of shape :obj:`(batch_size, sequence_length)`):\n               Indices of input sequence tokens in the vocabulary. Use BartTokenizer.encode to produce them.\n            Padding will be ignored by default should you provide it.\n            Indices can be obtained using :class:`transformers1.BartTokenizer.encode(text)`.\n        attention_mask (:obj:`torch.Tensor` of shape :obj:`(batch_size, sequence_length)`, `optional`, defaults to :obj:`None`):\n            Mask to avoid performing attention on padding token indices in input_ids.\n            Mask values selected in ``[0, 1]``:\n            ``1`` for tokens that are NOT MASKED, ``0`` for MASKED tokens.\n        encoder_outputs (:obj:`tuple(tuple(torch.FloatTensor)`, `optional`, defaults to :obj:`None`):\n            Tuple consists of (`last_hidden_state`, `optional`: `hidden_states`, `optional`: `attentions`)\n            `last_hidden_state` of shape :obj:`(batch_size, sequence_length, hidden_size)`, `optional`, defaults to :obj:`None`) is a sequence of hidden-states at the output of the last layer of the encoder.\n            Used in the cross-attention of the decoder.\n        decoder_input_ids (:obj:`torch.LongTensor` of shape :obj:`(batch_size, target_sequence_length)`, `optional`, defaults to :obj:`None`):\n            Provide for translation and summarization training. By default, the model will create this tensor by shifting the input_ids right, following the paper.\n        decoder_attention_mask (:obj:`torch.BoolTensor` of shape :obj:`(batch_size, tgt_seq_len)`, `optional`, defaults to :obj:`None`):\n            Default behavior: generate a tensor that ignores pad tokens in decoder_input_ids. Causal mask will also be used by default.\n            If you want to change padding behavior, you should read :func:`~transformers1.modeling_bart._prepare_decoder_inputs` and modify.\n            See diagram 1 in the paper for more info on the default strategy\n\"\"\"\n\n\ndef invert_mask(attention_mask):\n    assert attention_mask.dim() == 2\n    return attention_mask.eq(0)\n\n\ndef _prepare_bart_decoder_inputs(\n    config, input_ids, decoder_input_ids=None, decoder_padding_mask=None, causal_mask_dtype=torch.float32\n):\n    \"\"\"Prepare masks that ignore padding tokens in the decoder and a causal mask for the decoder if\n    none are provided. This mimics the default behavior in fairseq. To override it pass in masks.\n    Note: this is not called during generation\n    \"\"\"\n    pad_token_id = config.pad_token_id\n    if decoder_input_ids is None:\n        decoder_input_ids = shift_tokens_right(input_ids, pad_token_id)\n    bsz, tgt_len = decoder_input_ids.size()\n    if decoder_padding_mask is None:\n        decoder_padding_mask = make_padding_mask(decoder_input_ids, pad_token_id)\n    else:\n        decoder_padding_mask = invert_mask(decoder_padding_mask)\n    causal_mask = torch.triu(fill_with_neg_inf(torch.zeros(tgt_len, tgt_len)), 1).to(\n        dtype=causal_mask_dtype, device=decoder_input_ids.device\n    )\n    return decoder_input_ids, decoder_padding_mask, causal_mask\n\n\nclass PretrainedBartModel(PreTrainedModel):\n    config_class = BartConfig\n    base_model_prefix = \"model\"\n\n    def _init_weights(self, module):\n        std = self.config.init_std\n        if isinstance(module, nn.Linear):\n            module.weight.data.normal_(mean=0.0, std=std)\n            if module.bias is not None:\n                module.bias.data.zero_()\n        elif isinstance(module, SinusoidalPositionalEmbedding):\n            pass\n        elif isinstance(module, nn.Embedding):\n            module.weight.data.normal_(mean=0.0, std=std)\n            if module.padding_idx is not None:\n                module.weight.data[module.padding_idx].zero_()\n\n    @property\n    def dummy_inputs(self):\n        pad_token = self.config.pad_token_id\n        input_ids = torch.tensor([[0, 6, 10, 4, 2], [0, 8, 12, 2, pad_token]], device=self.device)\n        dummy_inputs = {\n            \"attention_mask\": input_ids.ne(pad_token),\n            \"input_ids\": input_ids,\n        }\n        return dummy_inputs\n\n\ndef _make_linear_from_emb(emb):\n    vocab_size, emb_size = emb.weight.shape\n    lin_layer = nn.Linear(vocab_size, emb_size, bias=False)\n    lin_layer.weight.data = emb.weight.data\n    return lin_layer\n\n\n# Helper Functions, mostly for making masks\ndef _check_shapes(shape_1, shape2):\n    if shape_1 != shape2:\n        raise AssertionError(\"shape mismatch: {} != {}\".format(shape_1, shape2))\n\n\ndef shift_tokens_right(input_ids, pad_token_id):\n    \"\"\"Shift input ids one token to the right, and wrap the last non pad token (usually <eos>).\"\"\"\n    prev_output_tokens = input_ids.clone()\n    index_of_eos = (input_ids.ne(pad_token_id).sum(dim=1) - 1).unsqueeze(-1)\n    prev_output_tokens[:, 0] = input_ids.gather(1, index_of_eos).squeeze()\n    prev_output_tokens[:, 1:] = input_ids[:, :-1]\n    return prev_output_tokens\n\n\ndef make_padding_mask(input_ids, padding_idx=1):\n    \"\"\"True for pad tokens\"\"\"\n    padding_mask = input_ids.eq(padding_idx)\n    if not padding_mask.any():\n        padding_mask = None\n    return padding_mask\n\n\n# Helper Modules\n\n\nclass EncoderLayer(nn.Module):\n    def __init__(self, config: BartConfig):\n        super().__init__()\n        self.embed_dim = config.d_model\n        self.output_attentions = config.output_attentions\n        self.self_attn = SelfAttention(\n            self.embed_dim, config.encoder_attention_heads, dropout=config.attention_dropout,\n        )\n        self.normalize_before = config.normalize_before\n        self.self_attn_layer_norm = LayerNorm(self.embed_dim)\n        self.dropout = config.dropout\n        self.activation_fn = ACT2FN[config.activation_function]\n        self.activation_dropout = config.activation_dropout\n        self.fc1 = nn.Linear(self.embed_dim, config.encoder_ffn_dim)\n        self.fc2 = nn.Linear(config.encoder_ffn_dim, self.embed_dim)\n        self.final_layer_norm = LayerNorm(self.embed_dim)\n\n    def forward(self, x, encoder_padding_mask):\n        \"\"\"\n        Args:\n            x (Tensor): input to the layer of shape `(seq_len, batch, embed_dim)`\n            encoder_padding_mask (ByteTensor): binary ByteTensor of shape\n                `(batch, src_len)` where padding elements are indicated by ``1``.\n            for t_tgt, t_src is excluded (or masked out), =0 means it is\n            included in attention\n\n        Returns:\n            encoded output of shape `(seq_len, batch, embed_dim)`\n        \"\"\"\n        residual = x\n        if self.normalize_before:\n            x = self.self_attn_layer_norm(x)\n        x, attn_weights = self.self_attn(\n            query=x, key=x, key_padding_mask=encoder_padding_mask, need_weights=self.output_attentions\n        )\n        x = F.dropout(x, p=self.dropout, training=self.training)\n        x = residual + x\n        if not self.normalize_before:\n            x = self.self_attn_layer_norm(x)\n\n        residual = x\n        if self.normalize_before:\n            x = self.final_layer_norm(x)\n        x = self.activation_fn(self.fc1(x))\n        x = F.dropout(x, p=self.activation_dropout, training=self.training)\n        x = self.fc2(x)\n        x = F.dropout(x, p=self.dropout, training=self.training)\n        x = residual + x\n        if not self.normalize_before:\n            x = self.final_layer_norm(x)\n        return x, attn_weights\n\n\nclass BartEncoder(nn.Module):\n    \"\"\"\n    Transformer encoder consisting of *config.encoder_layers* self attention layers. Each layer\n    is a :class:`EncoderLayer`.\n\n    Args:\n        config: BartConfig\n    \"\"\"\n\n    def __init__(self, config: BartConfig, embed_tokens):\n        super().__init__()\n\n        self.dropout = config.dropout\n        self.layerdrop = config.encoder_layerdrop\n        self.output_attentions = config.output_attentions\n        self.output_hidden_states = config.output_hidden_states\n\n        embed_dim = embed_tokens.embedding_dim\n        self.embed_scale = math.sqrt(embed_dim) if config.scale_embedding else 1.0\n        self.padding_idx = embed_tokens.padding_idx\n        self.max_source_positions = config.max_position_embeddings\n\n        self.embed_tokens = embed_tokens\n        if config.static_position_embeddings:\n            self.embed_positions = SinusoidalPositionalEmbedding(\n                config.max_position_embeddings, embed_dim, self.padding_idx\n            )\n        else:\n            self.embed_positions = LearnedPositionalEmbedding(\n                config.max_position_embeddings, embed_dim, self.padding_idx,\n            )\n        self.layers = nn.ModuleList([EncoderLayer(config) for _ in range(config.encoder_layers)])\n        self.layernorm_embedding = LayerNorm(embed_dim) if config.normalize_embedding else nn.Identity()\n        # mbart has one extra layer_norm\n        self.layer_norm = LayerNorm(config.d_model) if config.normalize_before else None\n\n    def forward(\n        self, input_ids, attention_mask=None,\n    ):\n        \"\"\"\n        Args:\n            input_ids (LongTensor): tokens in the source language of shape\n                `(batch, src_len)`\n            attention_mask (torch.LongTensor): indicating which indices are padding tokens.\n        Returns:\n            Tuple comprised of:\n                - **x** (Tensor): the last encoder layer's output of\n                  shape `(src_len, batch, embed_dim)`\n                - **encoder_states** (List[Tensor]): all intermediate\n                  hidden states of shape `(src_len, batch, embed_dim)`.\n                  Only populated if *self.output_hidden_states:* is True.\n                - **all_attentions** (List[Tensor]): Attention weights for each layer.\n                During training might not be of length n_layers because of layer dropout.\n        \"\"\"\n        # check attention mask and invert\n        if attention_mask is not None:\n            attention_mask = invert_mask(attention_mask)\n\n        inputs_embeds = self.embed_tokens(input_ids) * self.embed_scale\n        embed_pos = self.embed_positions(input_ids)\n        x = inputs_embeds + embed_pos\n        x = self.layernorm_embedding(x)\n        x = F.dropout(x, p=self.dropout, training=self.training)\n\n        # B x T x C -> T x B x C\n        x = x.transpose(0, 1)\n\n        encoder_states, all_attentions = [], []\n        for encoder_layer in self.layers:\n            if self.output_hidden_states:\n                encoder_states.append(x)\n            # add LayerDrop (see https://arxiv.org/abs/1909.11556 for description)\n            dropout_probability = random.uniform(0, 1)\n            if self.training and (dropout_probability < self.layerdrop):  # skip the layer\n                attn = None\n            else:\n                x, attn = encoder_layer(x, attention_mask)\n\n            if self.output_attentions:\n                all_attentions.append(attn)\n\n        if self.layer_norm:\n            x = self.layer_norm(x)\n        if self.output_hidden_states:\n            encoder_states.append(x)\n\n        # T x B x C -> B x T x C\n        encoder_states = [hidden_state.transpose(0, 1) for hidden_state in encoder_states]\n        x = x.transpose(0, 1)\n\n        return x, encoder_states, all_attentions\n\n\nclass DecoderLayer(nn.Module):\n    def __init__(self, config: BartConfig):\n        super().__init__()\n        self.embed_dim = config.d_model\n        self.output_attentions = config.output_attentions\n        self.self_attn = SelfAttention(\n            embed_dim=self.embed_dim, num_heads=config.decoder_attention_heads, dropout=config.attention_dropout,\n        )\n        self.dropout = config.dropout\n        self.activation_fn = ACT2FN[config.activation_function]\n        self.activation_dropout = config.activation_dropout\n        self.normalize_before = config.normalize_before\n\n        self.self_attn_layer_norm = LayerNorm(self.embed_dim)\n        self.encoder_attn = SelfAttention(\n            self.embed_dim,\n            config.decoder_attention_heads,\n            dropout=config.attention_dropout,\n            encoder_decoder_attention=True,\n        )\n        self.encoder_attn_layer_norm = LayerNorm(self.embed_dim)\n        self.fc1 = nn.Linear(self.embed_dim, config.decoder_ffn_dim)\n        self.fc2 = nn.Linear(config.decoder_ffn_dim, self.embed_dim)\n        self.final_layer_norm = LayerNorm(self.embed_dim)\n\n    def forward(\n        self,\n        x,\n        encoder_hidden_states,\n        encoder_attn_mask=None,\n        layer_state=None,\n        causal_mask=None,\n        decoder_padding_mask=None,\n    ):\n        residual = x\n\n        if layer_state is None:\n            layer_state = {}\n        if self.normalize_before:\n            x = self.self_attn_layer_norm(x)\n        # Self Attention\n\n        x, self_attn_weights = self.self_attn(\n            query=x,\n            key=x,\n            layer_state=layer_state,  # adds keys to layer state\n            key_padding_mask=decoder_padding_mask,\n            attn_mask=causal_mask,\n            need_weights=self.output_attentions,\n        )\n        x = F.dropout(x, p=self.dropout, training=self.training)\n        x = residual + x\n        if not self.normalize_before:\n            x = self.self_attn_layer_norm(x)\n\n        # Cross attention\n        residual = x\n        assert self.encoder_attn.cache_key != self.self_attn.cache_key\n        if self.normalize_before:\n            x = self.encoder_attn_layer_norm(x)\n        x, _ = self.encoder_attn(\n            query=x,\n            key=encoder_hidden_states,\n            key_padding_mask=encoder_attn_mask,\n            layer_state=layer_state,  # mutates layer state\n        )\n        x = F.dropout(x, p=self.dropout, training=self.training)\n        x = residual + x\n        if not self.normalize_before:\n            x = self.encoder_attn_layer_norm(x)\n\n        # Fully Connected\n        residual = x\n        if self.normalize_before:\n            x = self.final_layer_norm(x)\n        x = self.activation_fn(self.fc1(x))\n        x = F.dropout(x, p=self.activation_dropout, training=self.training)\n        x = self.fc2(x)\n        x = F.dropout(x, p=self.dropout, training=self.training)\n        x = residual + x\n        if not self.normalize_before:\n            x = self.final_layer_norm(x)\n        return (\n            x,\n            self_attn_weights,\n            layer_state,\n        )  # just self_attn weights for now, following t5, layer_state = cache for decoding\n\n\nclass BartDecoder(nn.Module):\n    \"\"\"\n    Transformer decoder consisting of *config.decoder_layers* layers. Each layer\n    is a :class:`DecoderLayer`.\n    Args:\n        config: BartConfig\n        embed_tokens (torch.nn.Embedding): output embedding\n    \"\"\"\n\n    def __init__(self, config: BartConfig, embed_tokens: nn.Embedding):\n        super().__init__()\n        self.output_attentions = config.output_attentions\n        self.output_hidden_states = config.output_hidden_states\n        self.dropout = config.dropout\n        self.layerdrop = config.decoder_layerdrop\n        self.padding_idx = embed_tokens.padding_idx\n        self.max_target_positions = config.max_position_embeddings\n        self.embed_scale = math.sqrt(config.d_model) if config.scale_embedding else 1.0\n        self.embed_tokens = embed_tokens\n        if config.static_position_embeddings:\n            self.embed_positions = SinusoidalPositionalEmbedding(\n                config.max_position_embeddings, config.d_model, config.pad_token_id\n            )\n        else:\n            self.embed_positions = LearnedPositionalEmbedding(\n                config.max_position_embeddings, config.d_model, self.padding_idx,\n            )\n        self.layers = nn.ModuleList(\n            [DecoderLayer(config) for _ in range(config.decoder_layers)]\n        )  # type: List[DecoderLayer]\n        self.layernorm_embedding = LayerNorm(config.d_model) if config.normalize_embedding else nn.Identity()\n        self.layer_norm = LayerNorm(config.d_model) if config.add_final_layer_norm else None\n\n    def forward(\n        self,\n        input_ids,\n        encoder_hidden_states,\n        encoder_padding_mask,\n        decoder_padding_mask,\n        decoder_causal_mask,\n        decoder_cached_states=None,\n        use_cache=False,\n        **unused\n    ):\n        \"\"\"\n        Includes several features from \"Jointly Learning to Align and\n        Translate with Transformer Models\" (Garg et al., EMNLP 2019).\n\n        Args:\n            input_ids (LongTensor): previous decoder outputs of shape\n                `(batch, tgt_len)`, for teacher forcing\n            encoder_hidden_states: output from the encoder, used for\n                encoder-side attention\n            encoder_padding_mask: for ignoring pad tokens\n            decoder_cached_states (dict or None): dictionary used for storing state during generation\n\n        Returns:\n            tuple:\n                - the decoder's features of shape `(batch, tgt_len, embed_dim)`\n                - hidden states\n                - attentions\n        \"\"\"\n        # check attention mask and invert\n        if encoder_padding_mask is not None:\n            encoder_padding_mask = invert_mask(encoder_padding_mask)\n\n        # embed positions\n        positions = self.embed_positions(input_ids, use_cache=use_cache)\n\n        if use_cache:\n            input_ids = input_ids[:, -1:]\n            positions = positions[:, -1:]  # happens after we embed them\n            # assert input_ids.ne(self.padding_idx).any()\n\n        x = self.embed_tokens(input_ids) * self.embed_scale\n        x += positions\n        x = self.layernorm_embedding(x)\n        x = F.dropout(x, p=self.dropout, training=self.training)\n\n        # Convert to Bart output format: (seq_len, BS, model_dim) -> (BS, seq_len, model_dim)\n        x = x.transpose(0, 1)\n        encoder_hidden_states = encoder_hidden_states.transpose(0, 1)\n\n        # decoder layers\n        all_hidden_states = ()\n        all_self_attns = ()\n        next_decoder_cache = []\n        for idx, decoder_layer in enumerate(self.layers):\n            # add LayerDrop (see https://arxiv.org/abs/1909.11556 for description)\n            if self.output_hidden_states:\n                all_hidden_states += (x,)\n            dropout_probability = random.uniform(0, 1)\n            if self.training and (dropout_probability < self.layerdrop):\n                continue\n\n            layer_state = decoder_cached_states[idx] if decoder_cached_states is not None else None\n\n            x, layer_self_attn, layer_past = decoder_layer(\n                x,\n                encoder_hidden_states,\n                encoder_attn_mask=encoder_padding_mask,\n                decoder_padding_mask=decoder_padding_mask,\n                layer_state=layer_state,\n                causal_mask=decoder_causal_mask,\n            )\n\n            if use_cache:\n                next_decoder_cache.append(layer_past.copy())\n\n            if self.layer_norm and (idx == len(self.layers) - 1):  # last layer of mbart\n                x = self.layer_norm(x)\n            if self.output_attentions:\n                all_self_attns += (layer_self_attn,)\n\n        # Convert to standard output format: (seq_len, BS, model_dim) -> (BS, seq_len, model_dim)\n        all_hidden_states = [hidden_state.transpose(0, 1) for hidden_state in all_hidden_states]\n        x = x.transpose(0, 1)\n        encoder_hidden_states = encoder_hidden_states.transpose(0, 1)\n\n        if use_cache:\n            next_cache = ((encoder_hidden_states, encoder_padding_mask), next_decoder_cache)\n        else:\n            next_cache = None\n        return x, next_cache, all_hidden_states, list(all_self_attns)\n\n\ndef _reorder_buffer(attn_cache, new_order):\n    for k, input_buffer_k in attn_cache.items():\n        if input_buffer_k is not None:\n            attn_cache[k] = input_buffer_k.index_select(0, new_order)\n    return attn_cache\n\n\nclass SelfAttention(nn.Module):\n    \"\"\"Multi-headed attention from 'Attention Is All You Need' paper\"\"\"\n\n    def __init__(\n        self,\n        embed_dim,\n        num_heads,\n        dropout=0.0,\n        bias=True,\n        encoder_decoder_attention=False,  # otherwise self_attention\n    ):\n        super().__init__()\n        self.embed_dim = embed_dim\n        self.num_heads = num_heads\n        self.dropout = dropout\n        self.head_dim = embed_dim // num_heads\n        assert self.head_dim * num_heads == self.embed_dim, \"embed_dim must be divisible by num_heads\"\n        self.scaling = self.head_dim ** -0.5\n\n        self.encoder_decoder_attention = encoder_decoder_attention\n        self.k_proj = nn.Linear(embed_dim, embed_dim, bias=bias)\n        self.v_proj = nn.Linear(embed_dim, embed_dim, bias=bias)\n        self.q_proj = nn.Linear(embed_dim, embed_dim, bias=bias)\n        self.out_proj = nn.Linear(embed_dim, embed_dim, bias=bias)\n        self.cache_key = \"encoder_decoder\" if self.encoder_decoder_attention else \"self\"\n\n    def _shape(self, tensor, dim_0, bsz):\n        return tensor.contiguous().view(dim_0, bsz * self.num_heads, self.head_dim).transpose(0, 1)\n\n    def forward(\n        self,\n        query,\n        key: Optional[Tensor],\n        key_padding_mask: Optional[Tensor] = None,\n        layer_state: Optional[Dict[str, Optional[Tensor]]] = None,\n        attn_mask: Optional[Tensor] = None,\n        need_weights=False,\n    ) -> Tuple[Tensor, Optional[Tensor]]:\n        \"\"\"Input shape: Time(SeqLen) x Batch x Channel\"\"\"\n        static_kv: bool = self.encoder_decoder_attention\n        tgt_len, bsz, embed_dim = query.size()\n        assert embed_dim == self.embed_dim\n        assert list(query.size()) == [tgt_len, bsz, embed_dim]\n        # get here for encoder decoder cause of static_kv\n        if layer_state is not None:  # reuse k,v and encoder_padding_mask\n            saved_state = layer_state.get(self.cache_key, {})\n            if \"prev_key\" in saved_state:\n                # previous time steps are cached - no need to recompute key and value if they are static\n                if static_kv:\n                    key = None\n        else:\n            saved_state = None\n            layer_state = {}\n\n        q = self.q_proj(query) * self.scaling\n        if static_kv:\n            if key is None:\n                k = v = None\n            else:\n                k = self.k_proj(key)\n                v = self.v_proj(key)\n        else:\n            k = self.k_proj(query)\n            v = self.v_proj(query)\n\n        q = self._shape(q, tgt_len, bsz)\n        if k is not None:\n            k = self._shape(k, -1, bsz)\n        if v is not None:\n            v = self._shape(v, -1, bsz)\n\n        if saved_state is not None:\n            k, v, key_padding_mask = self._use_saved_state(k, v, saved_state, key_padding_mask, static_kv, bsz)\n\n        # Update cache\n        layer_state[self.cache_key] = {\n            \"prev_key\": k.view(bsz, self.num_heads, -1, self.head_dim),\n            \"prev_value\": v.view(bsz, self.num_heads, -1, self.head_dim),\n            \"prev_key_padding_mask\": key_padding_mask if not static_kv else None,\n        }\n\n        assert k is not None\n        src_len = k.size(1)\n        attn_weights = torch.bmm(q, k.transpose(1, 2))\n        assert attn_weights.size() == (bsz * self.num_heads, tgt_len, src_len)\n\n        if attn_mask is not None:\n            attn_weights = attn_weights.view(bsz, self.num_heads, tgt_len, src_len) + attn_mask\n            attn_weights = attn_weights.view(bsz * self.num_heads, tgt_len, src_len)\n\n        # This is part of a workaround to get around fork/join parallelism not supporting Optional types.\n        if key_padding_mask is not None and key_padding_mask.dim() == 0:\n            key_padding_mask = None\n        assert key_padding_mask is None or key_padding_mask.size()[:2] == (bsz, src_len,)\n\n        if key_padding_mask is not None:  # don't attend to padding symbols\n            attn_weights = attn_weights.view(bsz, self.num_heads, tgt_len, src_len)\n            reshaped = key_padding_mask.unsqueeze(1).unsqueeze(2)\n            attn_weights = attn_weights.masked_fill(reshaped, float(\"-inf\"))\n            attn_weights = attn_weights.view(bsz * self.num_heads, tgt_len, src_len)\n        attn_weights = F.softmax(attn_weights, dim=-1)\n        attn_probs = F.dropout(attn_weights, p=self.dropout, training=self.training,)\n\n        assert v is not None\n        attn_output = torch.bmm(attn_probs, v)\n        assert attn_output.size() == (bsz * self.num_heads, tgt_len, self.head_dim)\n        attn_output = attn_output.transpose(0, 1).contiguous().view(tgt_len, bsz, embed_dim)\n        attn_output = self.out_proj(attn_output)\n        if need_weights:\n            attn_weights = attn_weights.view(bsz, self.num_heads, tgt_len, src_len)\n        else:\n            attn_weights = None\n        return attn_output, attn_weights\n\n    def _use_saved_state(self, k, v, saved_state, key_padding_mask, static_kv, bsz):\n        # saved states are stored with shape (bsz, num_heads, seq_len, head_dim)\n        if \"prev_key\" in saved_state:\n            _prev_key = saved_state[\"prev_key\"]\n            assert _prev_key is not None\n            prev_key = _prev_key.view(bsz * self.num_heads, -1, self.head_dim)\n            if static_kv:\n                k = prev_key\n            else:\n                assert k is not None\n                k = torch.cat([prev_key, k], dim=1)\n        if \"prev_value\" in saved_state:\n            _prev_value = saved_state[\"prev_value\"]\n            assert _prev_value is not None\n            prev_value = _prev_value.view(bsz * self.num_heads, -1, self.head_dim)\n            if static_kv:\n                v = prev_value\n            else:\n                assert v is not None\n                v = torch.cat([prev_value, v], dim=1)\n        assert k is not None and v is not None\n        prev_key_padding_mask: Optional[Tensor] = saved_state.get(\"prev_key_padding_mask\", None)\n        key_padding_mask = self._cat_prev_key_padding_mask(\n            key_padding_mask, prev_key_padding_mask, bsz, k.size(1), static_kv\n        )\n        return k, v, key_padding_mask\n\n    @staticmethod\n    def _cat_prev_key_padding_mask(\n        key_padding_mask: Optional[Tensor],\n        prev_key_padding_mask: Optional[Tensor],\n        batch_size: int,\n        src_len: int,\n        static_kv: bool,\n    ) -> Optional[Tensor]:\n        # saved key padding masks have shape (bsz, seq_len)\n        if prev_key_padding_mask is not None:\n            if static_kv:\n                new_key_padding_mask = prev_key_padding_mask\n            else:\n                new_key_padding_mask = torch.cat([prev_key_padding_mask, key_padding_mask], dim=1)\n\n        elif key_padding_mask is not None:\n            filler = torch.zeros(\n                batch_size,\n                src_len - key_padding_mask.size(1),\n                dtype=key_padding_mask.dtype,\n                device=key_padding_mask.device,\n            )\n            new_key_padding_mask = torch.cat([filler, key_padding_mask], dim=1)\n        else:\n            new_key_padding_mask = prev_key_padding_mask\n        return new_key_padding_mask\n\n\nclass BartClassificationHead(nn.Module):\n    \"\"\"Head for sentence-level classification tasks.\"\"\"\n\n    # This can trivially be shared with RobertaClassificationHead\n\n    def __init__(\n        self, input_dim, inner_dim, num_classes, pooler_dropout,\n    ):\n        super().__init__()\n        self.dense = nn.Linear(input_dim, inner_dim)\n        self.dropout = nn.Dropout(p=pooler_dropout)\n        self.out_proj = nn.Linear(inner_dim, num_classes)\n\n    def forward(self, x):\n        x = self.dropout(x)\n        x = self.dense(x)\n        x = torch.tanh(x)\n        x = self.dropout(x)\n        x = self.out_proj(x)\n        return x\n\n\nclass LearnedPositionalEmbedding(nn.Embedding):\n    \"\"\"\n    This module learns positional embeddings up to a fixed maximum size.\n    Padding ids are ignored by either offsetting based on padding_idx\n    or by setting padding_idx to None and ensuring that the appropriate\n    position ids are passed to the forward function.\n    \"\"\"\n\n    def __init__(\n        self, num_embeddings: int, embedding_dim: int, padding_idx: int,\n    ):\n        # if padding_idx is specified then offset the embedding ids by\n        # this index and adjust num_embeddings appropriately\n        assert padding_idx is not None\n        num_embeddings += padding_idx + 1  # WHY?\n        super().__init__(num_embeddings, embedding_dim, padding_idx=padding_idx)\n\n    def forward(self, input, use_cache=False):\n        \"\"\"Input is expected to be of size [bsz x seqlen].\"\"\"\n        if use_cache:  # the position is our current step in the decoded sequence\n            pos = int(self.padding_idx + input.size(1))\n            positions = input.data.new(1, 1).fill_(pos)\n        else:\n            positions = create_position_ids_from_input_ids(input, self.padding_idx)\n        return super().forward(positions)\n\n\ndef LayerNorm(normalized_shape, eps=1e-5, elementwise_affine=True):\n    if torch.cuda.is_available():\n        try:\n            from apex.normalization import FusedLayerNorm\n\n            return FusedLayerNorm(normalized_shape, eps, elementwise_affine)\n        except ImportError:\n            pass\n    return torch.nn.LayerNorm(normalized_shape, eps, elementwise_affine)\n\n\ndef fill_with_neg_inf(t):\n    \"\"\"FP16-compatible function that fills a input_ids with -inf.\"\"\"\n    return t.float().fill_(float(\"-inf\")).type_as(t)\n\n\ndef _filter_out_falsey_values(tup) -> Tuple:\n    \"\"\"Remove entries that are None or [] from an iterable.\"\"\"\n    return tuple(x for x in tup if isinstance(x, torch.Tensor) or x)\n\n\n# Public API\ndef _get_shape(t):\n    return getattr(t, \"shape\", None)\n\n\n@add_start_docstrings(\n    \"The bare BART Model outputting raw hidden-states without any specific head on top.\", BART_START_DOCSTRING,\n)\nclass BartModel(PretrainedBartModel):\n    def __init__(self, config: BartConfig):\n        super().__init__(config)\n        self.output_attentions = config.output_attentions\n        self.output_hidden_states = config.output_hidden_states\n\n        padding_idx, vocab_size = config.pad_token_id, config.vocab_size\n        self.shared = nn.Embedding(vocab_size, config.d_model, padding_idx)\n\n        self.encoder = BartEncoder(config, self.shared)\n        self.decoder = BartDecoder(config, self.shared)\n\n        self.init_weights()\n\n    @add_start_docstrings_to_callable(BART_INPUTS_DOCSTRING)\n    def forward(\n        self,\n        input_ids,\n        attention_mask=None,\n        decoder_input_ids=None,\n        encoder_outputs: Optional[Tuple] = None,\n        decoder_attention_mask=None,\n        decoder_cached_states=None,\n        use_cache=False,\n    ):\n\n        # make masks if user doesn't supply\n        if not use_cache:\n            decoder_input_ids, decoder_padding_mask, causal_mask = _prepare_bart_decoder_inputs(\n                self.config,\n                input_ids,\n                decoder_input_ids=decoder_input_ids,\n                decoder_padding_mask=decoder_attention_mask,\n                causal_mask_dtype=self.shared.weight.dtype,\n            )\n        else:\n            decoder_padding_mask, causal_mask = None, None\n\n        assert decoder_input_ids is not None\n        if encoder_outputs is None:\n            encoder_outputs = self.encoder(input_ids=input_ids, attention_mask=attention_mask)\n        assert isinstance(encoder_outputs, tuple)\n        # decoder outputs consists of (dec_features, layer_state, dec_hidden, dec_attn)\n        decoder_outputs = self.decoder(\n            decoder_input_ids,\n            encoder_outputs[0],\n            attention_mask,\n            decoder_padding_mask,\n            decoder_causal_mask=causal_mask,\n            decoder_cached_states=decoder_cached_states,\n            use_cache=use_cache,\n        )\n        # Attention and hidden_states will be [] or None if they aren't needed\n        decoder_outputs: Tuple = _filter_out_falsey_values(decoder_outputs)\n        assert isinstance(decoder_outputs[0], torch.Tensor)\n        encoder_outputs: Tuple = _filter_out_falsey_values(encoder_outputs)\n        return decoder_outputs + encoder_outputs\n\n    def get_input_embeddings(self):\n        return self.shared\n\n    def set_input_embeddings(self, value):\n        self.shared = value\n        self.encoder.embed_tokens = self.shared\n        self.decoder.embed_tokens = self.shared\n\n    def get_output_embeddings(self):\n        return _make_linear_from_emb(self.shared)  # make it on the fly\n\n\n@add_start_docstrings(\n    \"The BART Model with a language modeling head. Can be used for summarization.\",\n    BART_START_DOCSTRING + BART_GENERATION_EXAMPLE,\n)\nclass BartForConditionalGeneration(PretrainedBartModel):\n    base_model_prefix = \"model\"\n\n    def __init__(self, config: BartConfig):\n        super().__init__(config)\n        base_model = BartModel(config)\n        self.model = base_model\n        self.register_buffer(\"final_logits_bias\", torch.zeros((1, self.model.shared.num_embeddings)))\n\n    def resize_token_embeddings(self, new_num_tokens: int) -> nn.Embedding:\n        old_num_tokens = self.model.shared.num_embeddings\n        new_embeddings = super().resize_token_embeddings(new_num_tokens)\n        self.model.shared = new_embeddings\n        self._resize_final_logits_bias(new_num_tokens, old_num_tokens)\n        return new_embeddings\n\n    def _resize_final_logits_bias(self, new_num_tokens: int, old_num_tokens: int) -> None:\n        if new_num_tokens <= old_num_tokens:\n            new_bias = self.final_logits_bias[:, :new_num_tokens]\n        else:\n            extra_bias = torch.zeros((1, new_num_tokens - old_num_tokens), device=self.final_logits_bias.device)\n            new_bias = torch.cat([self.final_logits_bias, extra_bias], dim=1)\n        self.register_buffer(\"final_logits_bias\", new_bias)\n\n    @add_start_docstrings_to_callable(BART_INPUTS_DOCSTRING)\n    def forward(\n        self,\n        input_ids,\n        attention_mask=None,\n        encoder_outputs=None,\n        decoder_input_ids=None,\n        decoder_attention_mask=None,\n        decoder_cached_states=None,\n        lm_labels=None,\n        use_cache=False,\n        **unused\n    ):\n        r\"\"\"\n        lm_labels (:obj:`torch.LongTensor` of shape :obj:`(batch_size, sequence_length)`, `optional`, defaults to :obj:`None`):\n            Labels for computing the masked language modeling loss.\n            Indices should either be in ``[0, ..., config.vocab_size]`` or -100 (see ``input_ids`` docstring).\n            Tokens with indices set to ``-100`` are ignored (masked), the loss is only computed for the tokens\n            with labels\n            in ``[0, ..., config.vocab_size]``.\n\n    Returns:\n        :obj:`tuple(torch.FloatTensor)` comprising various elements depending on the configuration (:class:`~transformers1.RobertaConfig`) and inputs:\n        masked_lm_loss (`optional`, returned when ``lm_labels`` is provided) ``torch.FloatTensor`` of shape ``(1,)``:\n            Masked language modeling loss.\n        prediction_scores (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length, config.vocab_size)`)\n            Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax).\n        hidden_states (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_hidden_states=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer)\n            of shape :obj:`(batch_size, sequence_length, hidden_size)`.\n\n            Hidden-states of the model at the output of each layer plus the initial embedding outputs.\n        attentions (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_attentions=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for each layer) of shape\n            :obj:`(batch_size, num_heads, sequence_length, sequence_length)`.\n\n            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention\n            heads.\n\n    Examples::\n\n            # Mask filling only works for bart-large\n            from transformers1 import BartTokenizer, BartForConditionalGeneration\n            tokenizer = BartTokenizer.from_pretrained('bart-large')\n            TXT = \"My friends are <mask> but they eat too many carbs.\"\n            model = BartForConditionalGeneration.from_pretrained('bart-large')\n            input_ids = tokenizer.batch_encode_plus([TXT], return_tensors='pt')['input_ids']\n            logits = model(input_ids)[0]\n            masked_index = (input_ids[0] == tokenizer.mask_token_id).nonzero().item()\n            probs = logits[0, masked_index].softmax(dim=0)\n            values, predictions = probs.topk(5)\n            tokenizer.decode(predictions).split()\n            # ['good', 'great', 'all', 'really', 'very']\n        \"\"\"\n        outputs = self.model(\n            input_ids,\n            attention_mask=attention_mask,\n            decoder_input_ids=decoder_input_ids,\n            encoder_outputs=encoder_outputs,\n            decoder_attention_mask=decoder_attention_mask,\n            decoder_cached_states=decoder_cached_states,\n            use_cache=use_cache,\n        )\n        lm_logits = F.linear(outputs[0], self.model.shared.weight, bias=self.final_logits_bias)\n        outputs = (lm_logits,) + outputs[1:]  # Add cache, hidden states and attention if they are here\n        if lm_labels is not None:\n            loss_fct = nn.CrossEntropyLoss()\n            # TODO(SS): do we need to ignore pad tokens in lm_labels?\n            masked_lm_loss = loss_fct(lm_logits.view(-1, self.config.vocab_size), lm_labels.view(-1))\n            outputs = (masked_lm_loss,) + outputs\n\n        return outputs\n\n    def prepare_inputs_for_generation(self, decoder_input_ids, past, attention_mask, use_cache, **kwargs):\n        assert past is not None, \"past has to be defined for encoder_outputs\"\n\n        # first step, decoder_cached_states are empty\n        if not past[1]:\n            encoder_outputs, decoder_cached_states = past, None\n        else:\n            encoder_outputs, decoder_cached_states = past\n        return {\n            \"input_ids\": None,  # encoder_outputs is defined. input_ids not needed\n            \"encoder_outputs\": encoder_outputs,\n            \"decoder_cached_states\": decoder_cached_states,\n            \"decoder_input_ids\": decoder_input_ids,\n            \"attention_mask\": attention_mask,\n            \"use_cache\": use_cache,  # change this to avoid caching (presumably for debugging)\n        }\n\n    def prepare_logits_for_generation(self, logits, cur_len, max_length):\n        if cur_len == 1:\n            self._force_token_ids_generation(logits, self.config.bos_token_id)\n        if cur_len == max_length - 1 and self.config.eos_token_id is not None:\n            self._force_token_ids_generation(logits, self.config.eos_token_id)\n        return logits\n\n    def _force_token_ids_generation(self, scores, token_ids) -> None:\n        \"\"\"force one of token_ids to be generated by setting prob of all other tokens to 0\"\"\"\n        if isinstance(token_ids, int):\n            token_ids = [token_ids]\n        all_but_token_ids_mask = torch.tensor(\n            [x for x in range(self.config.vocab_size) if x not in token_ids],\n            dtype=torch.long,\n            device=next(self.parameters()).device,\n        )\n        assert len(scores.shape) == 2, \"scores should be of rank 2 with shape: [batch_size, vocab_size]\"\n        scores[:, all_but_token_ids_mask] = -float(\"inf\")\n\n    @staticmethod\n    def _reorder_cache(past, beam_idx):\n        ((enc_out, enc_mask), decoder_cached_states) = past\n        reordered_past = []\n        for layer_past in decoder_cached_states:\n            # get the correct batch idx from decoder layer's batch dim for cross and self-attn\n            layer_past_new = {\n                attn_key: _reorder_buffer(attn_cache, beam_idx) for attn_key, attn_cache in layer_past.items()\n            }\n            reordered_past.append(layer_past_new)\n\n        new_enc_out = enc_out if enc_out is None else enc_out.index_select(0, beam_idx)\n        new_enc_mask = enc_mask if enc_mask is None else enc_mask.index_select(0, beam_idx)\n\n        past = ((new_enc_out, new_enc_mask), reordered_past)\n        return past\n\n    def get_encoder(self):\n        return self.model.encoder\n\n    def get_output_embeddings(self):\n        return _make_linear_from_emb(self.model.shared)  # make it on the fly\n\n\n@add_start_docstrings(\n    \"\"\"Bart model with a sequence classification/head on top (a linear layer on top of the pooled output) e.g. for GLUE tasks. \"\"\",\n    BART_START_DOCSTRING,\n)\nclass BartForSequenceClassification(PretrainedBartModel):\n    def __init__(self, config: BartConfig, **kwargs):\n        super().__init__(config, **kwargs)\n        self.model = BartModel(config)\n        self.classification_head = BartClassificationHead(\n            config.d_model, config.d_model, config.num_labels, config.classif_dropout,\n        )\n        self.model._init_weights(self.classification_head.dense)\n        self.model._init_weights(self.classification_head.out_proj)\n\n    @add_start_docstrings_to_callable(BART_INPUTS_DOCSTRING)\n    def forward(\n        self,\n        input_ids,\n        attention_mask=None,\n        encoder_outputs=None,\n        decoder_input_ids=None,\n        decoder_attention_mask=None,\n        labels=None,\n    ):\n        r\"\"\"\n        labels (:obj:`torch.LongTensor` of shape :obj:`(batch_size,)`, `optional`, defaults to :obj:`None`):\n            Labels for computing the sequence classification/regression loss.\n            Indices should be in :obj:`[0, ..., config.num_labels - 1]`.\n            If :obj:`config.num_labels > 1` a classification loss is computed (Cross-Entropy).\n\n    Returns:\n        :obj:`tuple(torch.FloatTensor)` comprising various elements depending on the configuration (:class:`~transformers1.BartConfig`) and inputs:\n            loss (:obj:`torch.FloatTensor` of shape :obj:`(1,)`, `optional`, returned when :obj:`label` is provided):\n                Classification loss (cross entropy)\n            logits (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, config.num_labels)`):\n                Classification (or regression if config.num_labels==1) scores (before SoftMax).\n            hidden_states (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_hidden_states=True``):\n                Tuple of :obj:`torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer)\n                of shape :obj:`(batch_size, sequence_length, hidden_size)`.\n                Hidden-states of the model at the output of each layer plus the initial embedding outputs.\n            attentions (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_attentions=True``):\n                Tuple of :obj:`torch.FloatTensor` (one for each layer) of shape :obj:`(batch_size, num_heads, sequence_length, sequence_length)`.\n                Attentions weights after the attention softmax, used to compute the weighted average in the\n                self-attention\n                heads.\n\n    Examples::\n\n        from transformers1 import BartTokenizer, BartForSequenceClassification\n        import torch\n\n        tokenizer = BartTokenizer.from_pretrained('bart-large')\n        model = BartForSequenceClassification.from_pretrained('bart-large')\n        input_ids = torch.tensor(tokenizer.encode(\"Hello, my dog is cute\",\n        add_special_tokens=True)).unsqueeze(0)  # Batch size 1\n        labels = torch.tensor([1]).unsqueeze(0)  # Batch size 1\n        outputs = model(input_ids, labels=labels)\n        loss, logits = outputs[:2]\n\n        \"\"\"\n        outputs = self.model(\n            input_ids,\n            attention_mask=attention_mask,\n            decoder_input_ids=decoder_input_ids,\n            decoder_attention_mask=decoder_attention_mask,\n            encoder_outputs=encoder_outputs,\n        )\n        x = outputs[0]  # last hidden state\n        eos_mask = input_ids.eq(self.config.eos_token_id)\n        if len(torch.unique(eos_mask.sum(1))) > 1:\n            raise ValueError(\"All examples must have the same number of <eos> tokens.\")\n        sentence_representation = x[eos_mask, :].view(x.size(0), -1, x.size(-1))[:, -1, :]\n        logits = self.classification_head(sentence_representation)\n        # Prepend logits\n        outputs = (logits,) + outputs[1:]  # Add hidden states and attention if they are here\n        if labels is not None:  # prepend loss to output,\n            loss = F.cross_entropy(logits.view(-1, self.config.num_labels), labels.view(-1))\n            outputs = (loss,) + outputs\n\n        return outputs\n\n\nclass SinusoidalPositionalEmbedding(nn.Embedding):\n    \"\"\"This module produces sinusoidal positional embeddings of any length.\"\"\"\n\n    def __init__(self, num_positions, embedding_dim, padding_idx=None):\n        super().__init__(num_positions, embedding_dim)\n        if embedding_dim % 2 != 0:\n            raise NotImplementedError(f\"odd embedding_dim {embedding_dim} not supported\")\n        self.weight = self._init_weight(self.weight)\n\n    @staticmethod\n    def _init_weight(out: nn.Parameter):\n        \"\"\"Identical to the XLM create_sinusoidal_embeddings except features are not interleaved.\n            The cos features are in the 2nd half of the vector. [dim // 2:]\n        \"\"\"\n        n_pos, dim = out.shape\n        position_enc = np.array(\n            [[pos / np.power(10000, 2 * (j // 2) / dim) for j in range(dim)] for pos in range(n_pos)]\n        )\n        out[:, 0 : dim // 2] = torch.FloatTensor(np.sin(position_enc[:, 0::2]))  # This line breaks for odd n_pos\n        out[:, dim // 2 :] = torch.FloatTensor(np.cos(position_enc[:, 1::2]))\n        out.detach_()\n        out.requires_grad = False\n        return out\n\n    @torch.no_grad()\n    def forward(self, input_ids, use_cache=False):\n        \"\"\"Input is expected to be of size [bsz x seqlen].\"\"\"\n        bsz, seq_len = input_ids.shape[:2]\n        if use_cache:\n            positions = input_ids.data.new(1, 1).fill_(seq_len - 1)  # called before slicing\n        else:\n            # starts at 0, ends at 1-seq_len\n            positions = torch.arange(seq_len, dtype=torch.long, device=self.weight.device)\n        return super().forward(positions)\n"
  },
  {
    "path": "code/bert-base-count3/pretrain/transformers1/modeling_beam_search.py",
    "content": "# coding=utf-8\n# Copyright (c) 2019 Yang Liu\n\n# Permission is hereby granted, free of charge, to any person obtaining a copy\n# of this software and associated documentation files (the \"Software\"), to deal\n# in the Software without restriction, including without limitation the rights\n# to use, copy, modify, merge, publish, distribute, sublicense, and/or sell\n# copies of the Software, and to permit persons to whom the Software is\n# furnished to do so, subject to the following conditions:\n\n# The above copyright notice and this permission notice shall be included in all\n# copies or substantial portions of the Software.\n\n# THE SOFTWARE IS PROVIDED \"AS IS\", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR\n# IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,\n# FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE\n# AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER\n# LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,\n# OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE\n# SOFTWARE.\n\"\"\"\nA general wrapper around models with LM heads to generate sequences\nusing beam search.\n\"\"\"\nimport torch\nfrom torch import nn\n\n\nclass TransformerBeamSearch(nn.Module):\n    def __init__(\n        self,\n        model,\n        tokenizer,\n        batch_size,\n        beam_size,\n        min_length,\n        max_length,\n        alpha=0,\n        block_repeating_trigram=True,\n    ):\n        \"\"\"\n        Attributes:\n            mask_word_id: token id that corresponds to the mask\n        \"\"\"\n        super(TransformerBeamSearch, self).__init__()\n        self.model = model\n        self.tokenizer = tokenizer\n\n        self.start_token_id = tokenizer.start_token_id\n        self.end_token_id = tokenizer.end_token_id\n        self.pad_token_id = tokenizer.pad_token_id\n\n        self.beam_size = beam_size\n        self.min_length = min_length\n        self.max_length = max_length\n\n        self.block_repeating_trigram = block_repeating_trigram\n        self.apply_length_penalty = False if alpha == 0 else True\n        self.alpha = alpha\n\n        # State of the beam\n        self.hypotheses = [[] for _ in range(batch_size)]\n        self.batch_offset = torch.arange(batch_size, dtype=torch.long)\n        self.beam_offset = torch.arange(\n            0, batch_size * self.beam_size, step=self.beam_size, dtype=torch.long\n        )\n        self.growing_beam = torch.full(\n            (batch_size * self.beam_size, 1), self.start_token_id, dtype=torch.long\n        )\n        self.topk_log_probabilities = torch.tensor(\n            [0.0] + [float(\"-inf\")] * (self.beam_size - 1), dtype=torch.float\n        ).repeat(batch_size)\n        self.results = {\n            \"prediction\": [[] for _ in batch_size],\n            \"scores\": [[] for _ in batch_size],\n        }\n        self._step = 0\n        self.is_done = False\n\n    def step(self, log_probabilities):\n        \"\"\" Grows the beam by one step. \"\"\"\n        self._step += 1\n\n        # The batch size changes as some beams finish so we define _B\n        vocab_size = log_probabilities.size(-1)\n        _B = log_probabilities.size(0) // self.beam_size\n\n        # Multiply each beam probability with the probability of the\n        # next token (conditioned on the words in the beam).\n        log_probabilities += self.topk_log_probabilities.view(-1, 1)\n\n        self.enforce_min_length(log_probabilities)\n        if self.block_repeating_trigram:\n            self.remove_repeating_trigrams(log_probabilities, _B)\n\n        # Find the `beam_size` (previous_beam + token) combinations with\n        # the highest score\n        topk_log_probabilities, topk_ids = log_probabilities.topk(\n            log_probabilities.view(_B, self.beam_size * vocab_size),\n            self.beam_size,\n            dim=1,\n        )\n\n        # Apply the length penalty. The +1 accounts for the [EOS] token\n        # that will be added if the beam ends.\n        topk_scores = topk_log_probabilities / self.length_penalty()\n\n        # Retrieve the corresponding respective beam and token id\n        # topk_token_ids[i] will be added to topk_beam_ids[i]\n        topk_beam_ids = topk_ids.div(vocab_size)\n        topk_token_ids = topk_ids.fmod(vocab_size)\n\n        # Retrieve the row index of the surviving beams in the original\n        # view of the log_probabilities tensor\n        surviving_beams_rows = (topk_beam_ids + self.beam_offset[:_B].view(-1, 1)).view(\n            -1\n        )\n\n        # Append the last predictions\n        self.growing_beam = torch.cat(\n            [\n                self.growing_beam.index_select(0, surviving_beams_rows),\n                topk_token_ids.view(-1, 1),\n            ],\n            1,\n        )\n\n        # Check if any of the beam searches has ended during this\n        # growth step. Also if top beam (most probable) has ended\n        # for one element of the batch.\n        is_finished = topk_token_ids.eq(self.end_token_id)\n        self.enforce_max_length()\n        is_top_beam_finished = is_finished[:, 0].eq(1)\n\n        # Save the finished searches\n        if is_finished.any():\n            predictions = self.growing_beam.view(\n                -1, self.beam_size, self.growing_beam.size(1)\n            )\n            for i in range(is_finished.size(0)):\n                if is_top_beam_finished[i]:\n                    is_finished[i].fill_(1)\n                finished_hyp = is_finished[i].nonzero().view(-1)\n\n                # Store finished hypotheses for this batch.\n                b = self.batch_offset[i]\n                for j in finished_hyp:\n                    self.hypotheses[b].append((topk_scores[i, j], predictions[i, j, :]))\n\n                # If the batch reached the end, save the best hypotheses\n                # in terms of length-penalized score.\n                if is_top_beam_finished[i]:\n                    best_hyp = sorted(\n                        self.hypotheses[b], key=lambda x: x[0], reverse=True\n                    )\n                    best_score, best_prediction = best_hyp[0]\n                    self.results[\"scores\"][b].append(best_score)\n                    self.results[\"predictions\"][b].append(best_prediction)\n\n            non_finished = is_top_beam_finished.eq(0).nonzero().view(-1)\n            if len(non_finished) == 0:\n                self.is_done = True\n\n            # Remove finished batches for the next step.\n            topk_log_probabilities = topk_log_probabilities.index_select(\n                0, non_finished\n            )\n            self.batch_offset = self.batch_offset.index_select(0, non_finished)\n            self.growing_beam = predictions.index_select(0, non_finished).view(\n                -1, self.growing_beam.size(-1)\n            )\n\n            surviving_beams_rows = surviving_beams_rows.index_select(0, non_finished)\n\n        return surviving_beams_rows\n\n    def forward(self, encoder_input_ids, **kwargs):\n        # keyword arguments come in 3 flavors: encoder-specific (prefixed by\n        # `encoder_`), decoder-specific (prefixed by `decoder_`) and those\n        # that apply to the model as whole.\n        # We let the specific kwargs override the common ones in case of conflict.\n        kwargs_encoder = {\n            argument[len(\"encoder_\"):]: value\n            for argument, value in kwargs.items()\n            if argument.startswith(\"encoder_\")\n        }\n        kwargs_decoder = {\n            argument[len(\"decoder_\"):]: value\n            for argument, value in kwargs.items()\n            if argument.startswith(\"decoder_\")\n        }\n        kwargs_common = {\n            argument: value\n            for argument, value in kwargs.items()\n            if not (argument.startswith(\"encoder_\") or argument.startswith(\"decoder_\"))\n        }\n        kwargs_decoder = dict(kwargs_common, **kwargs_decoder)\n        kwargs_encoder = dict(kwargs_common, **kwargs_encoder)\n\n        # forward pass on the encoder\n        encoder_outputs = self.model.encoder.forward(encoder_input_ids, kwargs_encoder)\n        kwargs_decoder[\"encoder_hidden_states\"] = tile(\n            encoder_outputs, self.beam_size, dim=0\n        )\n\n        # grow the beam by generating sequences in an autoregressive way\n        self.growing_beam = torch.full(\n            (self.batch_size * self.beam_size, 1), self.start_token_id, dtype=torch.long\n        )\n        for step in range(self.max_length):\n            decoder_input = self.growing_beam[:, -1]\n            outputs = self.model.decoder(decoder_input, kwargs_decoder)\n            log_probabilities = torch.nn.functional.log_softmax(outputs[1])\n            surviving_beams_rows = self.step(log_probabilities)\n            if self.is_done:\n                break\n\n            kwargs_decoder[\"encoder_hidden_states\"] = kwargs_decoder[\n                \"encoder_hidden_states\"\n            ].index_select(0, surviving_beams_rows)\n\n        return self.results\n\n    def remove_repeating_trigrams(self, log_probabilities, _B):\n        if(self._step + 1 > 3):\n            for i in range(_B * self.beam_size):\n                tokens = [t for t in self.growing_beam[i]]\n                trigrams = [(tokens[i-1], tokens[i], tokens[i+1]) for i in range(1, len(words) - 1)]\n                last_trigram = tuple(trigrams[-1])\n                if last_trigram in trigrams[:-1]:\n                    log_probabilities[i] = -1e20\n\n    def enforce_min_length(self):\n        if self._step < self.min_length:\n            self.log_probabilities[self.end_token_id] = -1e20\n\n    def enforce_max_length(self):\n        if self._step + 1 == self.max_length:\n            self.is_finished.fill_(1)\n\n    def length_penalty(self):\n        return ((5.0 + (self._step + 1)) / 6.0) ** self.alpha\n\n\ndef tile(x, count, dim=0):\n    \"\"\"\n    Tiles `x` along dimension `dim` `count` times.\n\n    Example:\n        >> ex = torch.tensor([1,2],[3,4])\n        >> tile(ex, 2, 0)\n        torch.Tensor([[1,2],[1,2],[3,4],[3,4]])\n    \"\"\"\n    perm = list(range(len(x.size())))\n    if dim != 0:\n        perm[0], perm[dim] = perm[dim], perm[0]\n        x = x.permute(perm).contiguous()\n    out_size = list(x.size())\n    out_size[0] *= count\n    batch = x.size(0)\n    x = (\n        x.view(batch, -1)\n        .transpose(0, 1)\n        .repeat(count, 1)\n        .transpose(0, 1)\n        .contiguous()\n        .view(*out_size)\n    )\n    if dim != 0:\n        x = x.permute(perm).contiguous()\n    return x\n"
  },
  {
    "path": "code/bert-base-count3/pretrain/transformers1/modeling_bert.py",
    "content": "# coding=utf-8\n# Copyright 2018 The Google AI Language Team Authors and The HuggingFace Inc. team.\n# Copyright (c) 2018, NVIDIA CORPORATION.  All rights reserved.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n\"\"\"PyTorch BERT model. \"\"\"\n\n\nimport logging\nimport math\nimport os\n\nimport torch\nfrom torch import nn\nfrom torch.nn import CrossEntropyLoss, MSELoss\n\nfrom .activations import gelu, gelu_new, swish\nfrom .configuration_bert import BertConfig\nfrom .file_utils import add_start_docstrings, add_start_docstrings_to_callable\nfrom .modeling_utils import PreTrainedModel, prune_linear_layer\n\n\nlogger = logging.getLogger(__name__)\n\nBERT_PRETRAINED_MODEL_ARCHIVE_LIST = [\n    \"bert-base-uncased\",\n    \"bert-large-uncased\",\n    \"bert-base-cased\",\n    \"bert-large-cased\",\n    \"bert-base-multilingual-uncased\",\n    \"bert-base-multilingual-cased\",\n    \"bert-base-chinese\",\n    \"bert-base-german-cased\",\n    \"bert-large-uncased-whole-word-masking\",\n    \"bert-large-cased-whole-word-masking\",\n    \"bert-large-uncased-whole-word-masking-finetuned-squad\",\n    \"bert-large-cased-whole-word-masking-finetuned-squad\",\n    \"bert-base-cased-finetuned-mrpc\",\n    \"bert-base-german-dbmdz-cased\",\n    \"bert-base-german-dbmdz-uncased\",\n    \"cl-tohoku/bert-base-japanese\",\n    \"cl-tohoku/bert-base-japanese-whole-word-masking\",\n    \"cl-tohoku/bert-base-japanese-char\",\n    \"cl-tohoku/bert-base-japanese-char-whole-word-masking\",\n    \"TurkuNLP/bert-base-finnish-cased-v1\",\n    \"TurkuNLP/bert-base-finnish-uncased-v1\",\n    \"wietsedv/bert-base-dutch-cased\",\n    # See all BERT models at https://huggingface.co/models?filter=bert\n]\n\n\ndef load_tf_weights_in_bert(model, config, tf_checkpoint_path):\n    \"\"\" Load tf checkpoints in a pytorch model.\n    \"\"\"\n    try:\n        import re\n        import numpy as np\n        import tensorflow as tf\n    except ImportError:\n        logger.error(\n            \"Loading a TensorFlow model in PyTorch, requires TensorFlow to be installed. Please see \"\n            \"https://www.tensorflow.org/install/ for installation instructions.\"\n        )\n        raise\n    tf_path = os.path.abspath(tf_checkpoint_path)\n    logger.info(\"Converting TensorFlow checkpoint from {}\".format(tf_path))\n    # Load weights from TF model\n    init_vars = tf.train.list_variables(tf_path)\n    names = []\n    arrays = []\n    for name, shape in init_vars:\n        logger.info(\"Loading TF weight {} with shape {}\".format(name, shape))\n        array = tf.train.load_variable(tf_path, name)\n        names.append(name)\n        arrays.append(array)\n\n    for name, array in zip(names, arrays):\n        name = name.split(\"/\")\n        # adam_v and adam_m are variables used in AdamWeightDecayOptimizer to calculated m and v\n        # which are not required for using pretrained model\n        if any(\n            n in [\"adam_v\", \"adam_m\", \"AdamWeightDecayOptimizer\", \"AdamWeightDecayOptimizer_1\", \"global_step\"]\n            for n in name\n        ):\n            logger.info(\"Skipping {}\".format(\"/\".join(name)))\n            continue\n        pointer = model\n        for m_name in name:\n            if re.fullmatch(r\"[A-Za-z]+_\\d+\", m_name):\n                scope_names = re.split(r\"_(\\d+)\", m_name)\n            else:\n                scope_names = [m_name]\n            if scope_names[0] == \"kernel\" or scope_names[0] == \"gamma\":\n                pointer = getattr(pointer, \"weight\")\n            elif scope_names[0] == \"output_bias\" or scope_names[0] == \"beta\":\n                pointer = getattr(pointer, \"bias\")\n            elif scope_names[0] == \"output_weights\":\n                pointer = getattr(pointer, \"weight\")\n            elif scope_names[0] == \"squad\":\n                pointer = getattr(pointer, \"classifier\")\n            else:\n                try:\n                    pointer = getattr(pointer, scope_names[0])\n                except AttributeError:\n                    logger.info(\"Skipping {}\".format(\"/\".join(name)))\n                    continue\n            if len(scope_names) >= 2:\n                num = int(scope_names[1])\n                pointer = pointer[num]\n        if m_name[-11:] == \"_embeddings\":\n            pointer = getattr(pointer, \"weight\")\n        elif m_name == \"kernel\":\n            array = np.transpose(array)\n        try:\n            assert pointer.shape == array.shape\n        except AssertionError as e:\n            e.args += (pointer.shape, array.shape)\n            raise\n        logger.info(\"Initialize PyTorch weight {}\".format(name))\n        pointer.data = torch.from_numpy(array)\n    return model\n\n\ndef mish(x):\n    return x * torch.tanh(nn.functional.softplus(x))\n\n\nACT2FN = {\"gelu\": gelu, \"relu\": torch.nn.functional.relu, \"swish\": swish, \"gelu_new\": gelu_new, \"mish\": mish}\n\n\nBertLayerNorm = torch.nn.LayerNorm\n\n\nclass BertEmbeddings(nn.Module):\n    \"\"\"Construct the embeddings from word, position and token_type embeddings.\n    \"\"\"\n\n    def __init__(self, config):\n        super().__init__()\n        self.word_embeddings = nn.Embedding(config.vocab_size, config.hidden_size, padding_idx=config.pad_token_id)\n        self.position_embeddings = nn.Embedding(config.max_position_embeddings, config.hidden_size)\n        self.token_type_embeddings = nn.Embedding(config.type_vocab_size, config.hidden_size)\n\n        # self.LayerNorm is not snake-cased to stick with TensorFlow model variable name and be able to load\n        # any TensorFlow checkpoint file\n        self.LayerNorm = BertLayerNorm(config.hidden_size, eps=config.layer_norm_eps)\n        self.dropout = nn.Dropout(config.hidden_dropout_prob)\n\n    def forward(self, input_ids=None, token_type_ids=None, position_ids=None, inputs_embeds=None):\n        if input_ids is not None:\n            input_shape = input_ids.size()\n        else:\n            input_shape = inputs_embeds.size()[:-1]\n\n        seq_length = input_shape[1]\n        device = input_ids.device if input_ids is not None else inputs_embeds.device\n        if position_ids is None:\n            position_ids = torch.arange(seq_length, dtype=torch.long, device=device)\n            position_ids = position_ids.unsqueeze(0).expand(input_shape)\n        if token_type_ids is None:\n            token_type_ids = torch.zeros(input_shape, dtype=torch.long, device=device)\n\n        if inputs_embeds is None:\n            inputs_embeds = self.word_embeddings(input_ids)\n        position_embeddings = self.position_embeddings(position_ids)\n        token_type_embeddings = self.token_type_embeddings(token_type_ids)\n\n        embeddings = inputs_embeds + position_embeddings + token_type_embeddings\n        embeddings = self.LayerNorm(embeddings)\n        embeddings = self.dropout(embeddings)\n        return embeddings\n\n\nclass BertSelfAttention(nn.Module):\n    def __init__(self, config):\n        super().__init__()\n        if config.hidden_size % config.num_attention_heads != 0 and not hasattr(config, \"embedding_size\"):\n            raise ValueError(\n                \"The hidden size (%d) is not a multiple of the number of attention \"\n                \"heads (%d)\" % (config.hidden_size, config.num_attention_heads)\n            )\n        self.output_attentions = config.output_attentions\n\n        self.num_attention_heads = config.num_attention_heads\n        self.attention_head_size = int(config.hidden_size / config.num_attention_heads)\n        self.all_head_size = self.num_attention_heads * self.attention_head_size\n\n        self.query = nn.Linear(config.hidden_size, self.all_head_size)\n        self.key = nn.Linear(config.hidden_size, self.all_head_size)\n        self.value = nn.Linear(config.hidden_size, self.all_head_size)\n\n        self.dropout = nn.Dropout(config.attention_probs_dropout_prob)\n\n    def transpose_for_scores(self, x):\n        new_x_shape = x.size()[:-1] + (self.num_attention_heads, self.attention_head_size)\n        x = x.view(*new_x_shape)\n        return x.permute(0, 2, 1, 3)\n\n    def forward(\n        self,\n        hidden_states,\n        attention_mask=None,\n        head_mask=None,\n        encoder_hidden_states=None,\n        encoder_attention_mask=None,\n    ):\n        mixed_query_layer = self.query(hidden_states)\n\n        # If this is instantiated as a cross-attention module, the keys\n        # and values come from an encoder; the attention mask needs to be\n        # such that the encoder's padding tokens are not attended to.\n        if encoder_hidden_states is not None:\n            mixed_key_layer = self.key(encoder_hidden_states)\n            mixed_value_layer = self.value(encoder_hidden_states)\n            attention_mask = encoder_attention_mask\n        else:\n            mixed_key_layer = self.key(hidden_states)\n            mixed_value_layer = self.value(hidden_states)\n\n        query_layer = self.transpose_for_scores(mixed_query_layer)\n        key_layer = self.transpose_for_scores(mixed_key_layer)\n        value_layer = self.transpose_for_scores(mixed_value_layer)\n\n        # Take the dot product between \"query\" and \"key\" to get the raw attention scores.\n        attention_scores = torch.matmul(query_layer, key_layer.transpose(-1, -2))\n        attention_scores = attention_scores / math.sqrt(self.attention_head_size)\n        if attention_mask is not None:\n            # Apply the attention mask is (precomputed for all layers in BertModel forward() function)\n            attention_scores = attention_scores + attention_mask\n\n        # Normalize the attention scores to probabilities.\n        attention_probs = nn.Softmax(dim=-1)(attention_scores)\n\n        # This is actually dropping out entire tokens to attend to, which might\n        # seem a bit unusual, but is taken from the original Transformer paper.\n        attention_probs = self.dropout(attention_probs)\n\n        # Mask heads if we want to\n        if head_mask is not None:\n            attention_probs = attention_probs * head_mask\n\n        context_layer = torch.matmul(attention_probs, value_layer)\n\n        context_layer = context_layer.permute(0, 2, 1, 3).contiguous()\n        new_context_layer_shape = context_layer.size()[:-2] + (self.all_head_size,)\n        context_layer = context_layer.view(*new_context_layer_shape)\n\n        outputs = (context_layer, attention_probs) if self.output_attentions else (context_layer,)\n        return outputs\n\n\nclass BertSelfOutput(nn.Module):\n    def __init__(self, config):\n        super().__init__()\n        self.dense = nn.Linear(config.hidden_size, config.hidden_size)\n        self.LayerNorm = BertLayerNorm(config.hidden_size, eps=config.layer_norm_eps)\n        self.dropout = nn.Dropout(config.hidden_dropout_prob)\n\n    def forward(self, hidden_states, input_tensor):\n        hidden_states = self.dense(hidden_states)\n        hidden_states = self.dropout(hidden_states)\n        hidden_states = self.LayerNorm(hidden_states + input_tensor)\n        return hidden_states\n\n\nclass BertAttention(nn.Module):\n    def __init__(self, config):\n        super().__init__()\n        self.self = BertSelfAttention(config)\n        self.output = BertSelfOutput(config)\n        self.pruned_heads = set()\n\n    def prune_heads(self, heads):\n        if len(heads) == 0:\n            return\n        mask = torch.ones(self.self.num_attention_heads, self.self.attention_head_size)\n        heads = set(heads) - self.pruned_heads  # Convert to set and remove already pruned heads\n        for head in heads:\n            # Compute how many pruned heads are before the head and move the index accordingly\n            head = head - sum(1 if h < head else 0 for h in self.pruned_heads)\n            mask[head] = 0\n        mask = mask.view(-1).contiguous().eq(1)\n        index = torch.arange(len(mask))[mask].long()\n\n        # Prune linear layers\n        self.self.query = prune_linear_layer(self.self.query, index)\n        self.self.key = prune_linear_layer(self.self.key, index)\n        self.self.value = prune_linear_layer(self.self.value, index)\n        self.output.dense = prune_linear_layer(self.output.dense, index, dim=1)\n\n        # Update hyper params and store pruned heads\n        self.self.num_attention_heads = self.self.num_attention_heads - len(heads)\n        self.self.all_head_size = self.self.attention_head_size * self.self.num_attention_heads\n        self.pruned_heads = self.pruned_heads.union(heads)\n\n    def forward(\n        self,\n        hidden_states,\n        attention_mask=None,\n        head_mask=None,\n        encoder_hidden_states=None,\n        encoder_attention_mask=None,\n    ):\n        self_outputs = self.self(\n            hidden_states, attention_mask, head_mask, encoder_hidden_states, encoder_attention_mask\n        )\n        attention_output = self.output(self_outputs[0], hidden_states)\n        outputs = (attention_output,) + self_outputs[1:]  # add attentions if we output them\n        return outputs\n\n\nclass BertIntermediate(nn.Module):\n    def __init__(self, config):\n        super().__init__()\n        self.dense = nn.Linear(config.hidden_size, config.intermediate_size)\n        if isinstance(config.hidden_act, str):\n            self.intermediate_act_fn = ACT2FN[config.hidden_act]\n        else:\n            self.intermediate_act_fn = config.hidden_act\n\n    def forward(self, hidden_states):\n        hidden_states = self.dense(hidden_states)\n        hidden_states = self.intermediate_act_fn(hidden_states)\n        return hidden_states\n\n\nclass BertOutput(nn.Module):\n    def __init__(self, config):\n        super().__init__()\n        self.dense = nn.Linear(config.intermediate_size, config.hidden_size)\n        self.LayerNorm = BertLayerNorm(config.hidden_size, eps=config.layer_norm_eps)\n        self.dropout = nn.Dropout(config.hidden_dropout_prob)\n\n    def forward(self, hidden_states, input_tensor):\n        hidden_states = self.dense(hidden_states)\n        hidden_states = self.dropout(hidden_states)\n        hidden_states = self.LayerNorm(hidden_states + input_tensor)\n        return hidden_states\n\n\nclass BertLayer(nn.Module):\n    def __init__(self, config):\n        super().__init__()\n        self.attention = BertAttention(config)\n        self.is_decoder = config.is_decoder\n        if self.is_decoder:\n            self.crossattention = BertAttention(config)\n        self.intermediate = BertIntermediate(config)\n        self.output = BertOutput(config)\n\n    def forward(\n        self,\n        hidden_states,\n        attention_mask=None,\n        head_mask=None,\n        encoder_hidden_states=None,\n        encoder_attention_mask=None,\n    ):\n        self_attention_outputs = self.attention(hidden_states, attention_mask, head_mask)\n        attention_output = self_attention_outputs[0]\n        outputs = self_attention_outputs[1:]  # add self attentions if we output attention weights\n\n        if self.is_decoder and encoder_hidden_states is not None:\n            cross_attention_outputs = self.crossattention(\n                attention_output, attention_mask, head_mask, encoder_hidden_states, encoder_attention_mask\n            )\n            attention_output = cross_attention_outputs[0]\n            outputs = outputs + cross_attention_outputs[1:]  # add cross attentions if we output attention weights\n\n        intermediate_output = self.intermediate(attention_output)\n        layer_output = self.output(intermediate_output, attention_output)\n        outputs = (layer_output,) + outputs\n        return outputs\n\n\nclass BertEncoder(nn.Module):\n    def __init__(self, config):\n        super().__init__()\n        self.output_attentions = config.output_attentions\n        self.output_hidden_states = config.output_hidden_states\n        self.layer = nn.ModuleList([BertLayer(config) for _ in range(config.num_hidden_layers)])\n\n    def forward(\n        self,\n        hidden_states,\n        attention_mask=None,\n        head_mask=None,\n        encoder_hidden_states=None,\n        encoder_attention_mask=None,\n    ):\n        all_hidden_states = ()\n        all_attentions = ()\n        for i, layer_module in enumerate(self.layer):\n            if self.output_hidden_states:\n                all_hidden_states = all_hidden_states + (hidden_states,)\n\n            layer_outputs = layer_module(\n                hidden_states, attention_mask, head_mask[i], encoder_hidden_states, encoder_attention_mask\n            )\n            hidden_states = layer_outputs[0]\n\n            if self.output_attentions:\n                all_attentions = all_attentions + (layer_outputs[1],)\n\n        # Add last layer\n        if self.output_hidden_states:\n            all_hidden_states = all_hidden_states + (hidden_states,)\n\n        outputs = (hidden_states,)\n        if self.output_hidden_states:\n            outputs = outputs + (all_hidden_states,)\n        if self.output_attentions:\n            outputs = outputs + (all_attentions,)\n        return outputs  # last-layer hidden state, (all hidden states), (all attentions)\n\n\nclass BertPooler(nn.Module):\n    def __init__(self, config):\n        super().__init__()\n        self.dense = nn.Linear(config.hidden_size, config.hidden_size)\n        self.activation = nn.Tanh()\n\n    def forward(self, hidden_states):\n        # We \"pool\" the model by simply taking the hidden state corresponding\n        # to the first token.\n        first_token_tensor = hidden_states[:, 0]\n        pooled_output = self.dense(first_token_tensor)\n        pooled_output = self.activation(pooled_output)\n        return pooled_output\n\n\nclass BertPredictionHeadTransform(nn.Module):\n    def __init__(self, config):\n        super().__init__()\n        self.dense = nn.Linear(config.hidden_size, config.hidden_size)\n        if isinstance(config.hidden_act, str):\n            self.transform_act_fn = ACT2FN[config.hidden_act]\n        else:\n            self.transform_act_fn = config.hidden_act\n        self.LayerNorm = BertLayerNorm(config.hidden_size, eps=config.layer_norm_eps)\n\n    def forward(self, hidden_states):\n        hidden_states = self.dense(hidden_states)\n        hidden_states = self.transform_act_fn(hidden_states)\n        hidden_states = self.LayerNorm(hidden_states)\n        return hidden_states\n\n\nclass BertLMPredictionHead(nn.Module):\n    def __init__(self, config):\n        super().__init__()\n        self.transform = BertPredictionHeadTransform(config)\n\n        # The output weights are the same as the input embeddings, but there is\n        # an output-only bias for each token.\n        self.decoder = nn.Linear(config.hidden_size, config.vocab_size, bias=False)\n\n        self.bias = nn.Parameter(torch.zeros(config.vocab_size))\n\n        # Need a link between the two variables so that the bias is correctly resized with `resize_token_embeddings`\n        self.decoder.bias = self.bias\n\n    def forward(self, hidden_states):\n        hidden_states = self.transform(hidden_states)\n        hidden_states = self.decoder(hidden_states)\n        return hidden_states\n\n\nclass BertOnlyMLMHead(nn.Module):\n    def __init__(self, config):\n        super().__init__()\n        self.predictions = BertLMPredictionHead(config)\n\n    def forward(self, sequence_output):\n        prediction_scores = self.predictions(sequence_output)\n        return prediction_scores\n\n\nclass BertOnlyNSPHead(nn.Module):\n    def __init__(self, config):\n        super().__init__()\n        self.seq_relationship = nn.Linear(config.hidden_size, 2)\n\n    def forward(self, pooled_output):\n        seq_relationship_score = self.seq_relationship(pooled_output)\n        return seq_relationship_score\n\n\nclass BertPreTrainingHeads(nn.Module):\n    def __init__(self, config):\n        super().__init__()\n        self.predictions = BertLMPredictionHead(config)\n        self.seq_relationship = nn.Linear(config.hidden_size, 2)\n\n    def forward(self, sequence_output, pooled_output):\n        prediction_scores = self.predictions(sequence_output)\n        seq_relationship_score = self.seq_relationship(pooled_output)\n        return prediction_scores, seq_relationship_score\n\n\nclass BertPreTrainedModel(PreTrainedModel):\n    \"\"\" An abstract class to handle weights initialization and\n        a simple interface for downloading and loading pretrained models.\n    \"\"\"\n\n    config_class = BertConfig\n    load_tf_weights = load_tf_weights_in_bert\n    base_model_prefix = \"bert\"\n\n    def _init_weights(self, module):\n        \"\"\" Initialize the weights \"\"\"\n        if isinstance(module, (nn.Linear, nn.Embedding)):\n            # Slightly different from the TF version which uses truncated_normal for initialization\n            # cf https://github.com/pytorch/pytorch/pull/5617\n            module.weight.data.normal_(mean=0.0, std=self.config.initializer_range)\n        elif isinstance(module, BertLayerNorm):\n            module.bias.data.zero_()\n            module.weight.data.fill_(1.0)\n        if isinstance(module, nn.Linear) and module.bias is not None:\n            module.bias.data.zero_()\n\n\nBERT_START_DOCSTRING = r\"\"\"\n    This model is a PyTorch `torch.nn.Module <https://pytorch.org/docs/stable/nn.html#torch.nn.Module>`_ sub-class.\n    Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general\n    usage and behavior.\n\n    Parameters:\n        config (:class:`~transformers1.BertConfig`): Model configuration class with all the parameters of the model.\n            Initializing with a config file does not load the weights associated with the model, only the configuration.\n            Check out the :meth:`~transformers1.PreTrainedModel.from_pretrained` method to load the model weights.\n\"\"\"\n\nBERT_INPUTS_DOCSTRING = r\"\"\"\n    Args:\n        input_ids (:obj:`torch.LongTensor` of shape :obj:`{0}`):\n            Indices of input sequence tokens in the vocabulary.\n\n            Indices can be obtained using :class:`transformers1.BertTokenizer`.\n            See :func:`transformers1.PreTrainedTokenizer.encode` and\n            :func:`transformers1.PreTrainedTokenizer.encode_plus` for details.\n\n            `What are input IDs? <../glossary.html#input-ids>`__\n        attention_mask (:obj:`torch.FloatTensor` of shape :obj:`{0}`, `optional`, defaults to :obj:`None`):\n            Mask to avoid performing attention on padding token indices.\n            Mask values selected in ``[0, 1]``:\n            ``1`` for tokens that are NOT MASKED, ``0`` for MASKED tokens.\n\n            `What are attention masks? <../glossary.html#attention-mask>`__\n        token_type_ids (:obj:`torch.LongTensor` of shape :obj:`{0}`, `optional`, defaults to :obj:`None`):\n            Segment token indices to indicate first and second portions of the inputs.\n            Indices are selected in ``[0, 1]``: ``0`` corresponds to a `sentence A` token, ``1``\n            corresponds to a `sentence B` token\n\n            `What are token type IDs? <../glossary.html#token-type-ids>`_\n        position_ids (:obj:`torch.LongTensor` of shape :obj:`{0}`, `optional`, defaults to :obj:`None`):\n            Indices of positions of each input sequence tokens in the position embeddings.\n            Selected in the range ``[0, config.max_position_embeddings - 1]``.\n\n            `What are position IDs? <../glossary.html#position-ids>`_\n        head_mask (:obj:`torch.FloatTensor` of shape :obj:`(num_heads,)` or :obj:`(num_layers, num_heads)`, `optional`, defaults to :obj:`None`):\n            Mask to nullify selected heads of the self-attention modules.\n            Mask values selected in ``[0, 1]``:\n            :obj:`1` indicates the head is **not masked**, :obj:`0` indicates the head is **masked**.\n        inputs_embeds (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length, hidden_size)`, `optional`, defaults to :obj:`None`):\n            Optionally, instead of passing :obj:`input_ids` you can choose to directly pass an embedded representation.\n            This is useful if you want more control over how to convert `input_ids` indices into associated vectors\n            than the model's internal embedding lookup matrix.\n        encoder_hidden_states  (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length, hidden_size)`, `optional`, defaults to :obj:`None`):\n            Sequence of hidden-states at the output of the last layer of the encoder. Used in the cross-attention\n            if the model is configured as a decoder.\n        encoder_attention_mask (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length)`, `optional`, defaults to :obj:`None`):\n            Mask to avoid performing attention on the padding token indices of the encoder input. This mask\n            is used in the cross-attention if the model is configured as a decoder.\n            Mask values selected in ``[0, 1]``:\n            ``1`` for tokens that are NOT MASKED, ``0`` for MASKED tokens.\n\"\"\"\n\n\n@add_start_docstrings(\n    \"The bare Bert Model transformer outputting raw hidden-states without any specific head on top.\",\n    BERT_START_DOCSTRING,\n)\nclass BertModel(BertPreTrainedModel):\n    \"\"\"\n\n    The model can behave as an encoder (with only self-attention) as well\n    as a decoder, in which case a layer of cross-attention is added between\n    the self-attention layers, following the architecture described in `Attention is all you need`_ by Ashish Vaswani,\n    Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser and Illia Polosukhin.\n\n    To behave as an decoder the model needs to be initialized with the\n    :obj:`is_decoder` argument of the configuration set to :obj:`True`; an\n    :obj:`encoder_hidden_states` is expected as an input to the forward pass.\n\n    .. _`Attention is all you need`:\n        https://arxiv.org/abs/1706.03762\n\n    \"\"\"\n\n    def __init__(self, config):\n        super().__init__(config)\n        self.config = config\n\n        self.embeddings = BertEmbeddings(config)\n        self.encoder = BertEncoder(config)\n        self.pooler = BertPooler(config)\n\n        self.init_weights()\n\n    def get_input_embeddings(self):\n        return self.embeddings.word_embeddings\n\n    def set_input_embeddings(self, value):\n        self.embeddings.word_embeddings = value\n\n    def _prune_heads(self, heads_to_prune):\n        \"\"\" Prunes heads of the model.\n            heads_to_prune: dict of {layer_num: list of heads to prune in this layer}\n            See base class PreTrainedModel\n        \"\"\"\n        for layer, heads in heads_to_prune.items():\n            self.encoder.layer[layer].attention.prune_heads(heads)\n\n    @add_start_docstrings_to_callable(BERT_INPUTS_DOCSTRING.format(\"(batch_size, sequence_length)\"))\n    def forward(\n        self,\n        input_ids=None,\n        attention_mask=None,\n        token_type_ids=None,\n        position_ids=None,\n        head_mask=None,\n        inputs_embeds=None,\n        encoder_hidden_states=None,\n        encoder_attention_mask=None,\n    ):\n        r\"\"\"\n    Return:\n        :obj:`tuple(torch.FloatTensor)` comprising various elements depending on the configuration (:class:`~transformers1.BertConfig`) and inputs:\n        last_hidden_state (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length, hidden_size)`):\n            Sequence of hidden-states at the output of the last layer of the model.\n        pooler_output (:obj:`torch.FloatTensor`: of shape :obj:`(batch_size, hidden_size)`):\n            Last layer hidden-state of the first token of the sequence (classification token)\n            further processed by a Linear layer and a Tanh activation function. The Linear\n            layer weights are trained from the next sentence prediction (classification)\n            objective during pre-training.\n\n            This output is usually *not* a good summary\n            of the semantic content of the input, you're often better with averaging or pooling\n            the sequence of hidden-states for the whole input sequence.\n        hidden_states (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_hidden_states=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer)\n            of shape :obj:`(batch_size, sequence_length, hidden_size)`.\n\n            Hidden-states of the model at the output of each layer plus the initial embedding outputs.\n        attentions (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_attentions=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for each layer) of shape\n            :obj:`(batch_size, num_heads, sequence_length, sequence_length)`.\n\n            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention\n            heads.\n\n    Examples::\n\n        from transformers1 import BertModel, BertTokenizer\n        import torch\n\n        tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')\n        model = BertModel.from_pretrained('bert-base-uncased')\n\n        input_ids = torch.tensor(tokenizer.encode(\"Hello, my dog is cute\", add_special_tokens=True)).unsqueeze(0)  # Batch size 1\n        outputs = model(input_ids)\n\n        last_hidden_states = outputs[0]  # The last hidden-state is the first element of the output tuple\n\n        \"\"\"\n\n        if input_ids is not None and inputs_embeds is not None:\n            raise ValueError(\"You cannot specify both input_ids and inputs_embeds at the same time\")\n        elif input_ids is not None:\n            input_shape = input_ids.size()\n        elif inputs_embeds is not None:\n            input_shape = inputs_embeds.size()[:-1]\n        else:\n            raise ValueError(\"You have to specify either input_ids or inputs_embeds\")\n\n        device = input_ids.device if input_ids is not None else inputs_embeds.device\n\n        if attention_mask is None:\n            attention_mask = torch.ones(input_shape, device=device)\n        if token_type_ids is None:\n            token_type_ids = torch.zeros(input_shape, dtype=torch.long, device=device)\n\n        # We can provide a self-attention mask of dimensions [batch_size, from_seq_length, to_seq_length]\n        # ourselves in which case we just need to make it broadcastable to all heads.\n        extended_attention_mask: torch.Tensor = self.get_extended_attention_mask(attention_mask, input_shape, device)\n\n        # If a 2D ou 3D attention mask is provided for the cross-attention\n        # we need to make broadcastabe to [batch_size, num_heads, seq_length, seq_length]\n        if self.config.is_decoder and encoder_hidden_states is not None:\n            encoder_batch_size, encoder_sequence_length, _ = encoder_hidden_states.size()\n            encoder_hidden_shape = (encoder_batch_size, encoder_sequence_length)\n            if encoder_attention_mask is None:\n                encoder_attention_mask = torch.ones(encoder_hidden_shape, device=device)\n            encoder_extended_attention_mask = self.invert_attention_mask(encoder_attention_mask)\n        else:\n            encoder_extended_attention_mask = None\n\n        # Prepare head mask if needed\n        # 1.0 in head_mask indicate we keep the head\n        # attention_probs has shape bsz x n_heads x N x N\n        # input head_mask has shape [num_heads] or [num_hidden_layers x num_heads]\n        # and head_mask is converted to shape [num_hidden_layers x batch x num_heads x seq_length x seq_length]\n        head_mask = self.get_head_mask(head_mask, self.config.num_hidden_layers)\n\n        embedding_output = self.embeddings(\n            input_ids=input_ids, position_ids=position_ids, token_type_ids=token_type_ids, inputs_embeds=inputs_embeds\n        )\n        encoder_outputs = self.encoder(\n            embedding_output,\n            attention_mask=extended_attention_mask,\n            head_mask=head_mask,\n            encoder_hidden_states=encoder_hidden_states,\n            encoder_attention_mask=encoder_extended_attention_mask,\n        )\n        sequence_output = encoder_outputs[0]\n        pooled_output = self.pooler(sequence_output)\n\n        outputs = (sequence_output, pooled_output,) + encoder_outputs[\n            1:\n        ]  # add hidden_states and attentions if they are here\n        return outputs  # sequence_output, pooled_output, (hidden_states), (attentions)\n\n\n@add_start_docstrings(\n    \"\"\"Bert Model with two heads on top as done during the pre-training: a `masked language modeling` head and\n    a `next sentence prediction (classification)` head. \"\"\",\n    BERT_START_DOCSTRING,\n)\nclass BertForPreTraining(BertPreTrainedModel):\n    def __init__(self, config):\n        super().__init__(config)\n\n        self.bert = BertModel(config)\n        self.cls = BertPreTrainingHeads(config)\n\n        self.init_weights()\n\n    def get_output_embeddings(self):\n        return self.cls.predictions.decoder\n\n    @add_start_docstrings_to_callable(BERT_INPUTS_DOCSTRING.format(\"(batch_size, sequence_length)\"))\n    def forward(\n        self,\n        input_ids=None,\n        attention_mask=None,\n        token_type_ids=None,\n        position_ids=None,\n        head_mask=None,\n        inputs_embeds=None,\n        masked_lm_labels=None,\n        next_sentence_label=None,\n    ):\n        r\"\"\"\n        masked_lm_labels (``torch.LongTensor`` of shape ``(batch_size, sequence_length)``, `optional`, defaults to :obj:`None`):\n            Labels for computing the masked language modeling loss.\n            Indices should be in ``[-100, 0, ..., config.vocab_size]`` (see ``input_ids`` docstring)\n            Tokens with indices set to ``-100`` are ignored (masked), the loss is only computed for the tokens with labels\n            in ``[0, ..., config.vocab_size]``\n        next_sentence_label (``torch.LongTensor`` of shape ``(batch_size,)``, `optional`, defaults to :obj:`None`):\n            Labels for computing the next sequence prediction (classification) loss. Input should be a sequence pair (see :obj:`input_ids` docstring)\n            Indices should be in ``[0, 1]``.\n            ``0`` indicates sequence B is a continuation of sequence A,\n            ``1`` indicates sequence B is a random sequence.\n\n    Returns:\n        :obj:`tuple(torch.FloatTensor)` comprising various elements depending on the configuration (:class:`~transformers1.BertConfig`) and inputs:\n        loss (`optional`, returned when ``masked_lm_labels`` is provided) ``torch.FloatTensor`` of shape ``(1,)``:\n            Total loss as the sum of the masked language modeling loss and the next sequence prediction (classification) loss.\n        prediction_scores (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length, config.vocab_size)`)\n            Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax).\n        seq_relationship_scores (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, 2)`):\n            Prediction scores of the next sequence prediction (classification) head (scores of True/False\n            continuation before SoftMax).\n        hidden_states (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when :obj:`config.output_hidden_states=True`):\n            Tuple of :obj:`torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer)\n            of shape :obj:`(batch_size, sequence_length, hidden_size)`.\n\n            Hidden-states of the model at the output of each layer plus the initial embedding outputs.\n        attentions (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_attentions=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for each layer) of shape\n            :obj:`(batch_size, num_heads, sequence_length, sequence_length)`.\n\n            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention\n            heads.\n\n\n    Examples::\n\n        from transformers1 import BertTokenizer, BertForPreTraining\n        import torch\n\n        tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')\n        model = BertForPreTraining.from_pretrained('bert-base-uncased')\n\n        input_ids = torch.tensor(tokenizer.encode(\"Hello, my dog is cute\", add_special_tokens=True)).unsqueeze(0)  # Batch size 1\n        outputs = model(input_ids)\n\n        prediction_scores, seq_relationship_scores = outputs[:2]\n\n        \"\"\"\n\n        outputs = self.bert(\n            input_ids,\n            attention_mask=attention_mask,\n            token_type_ids=token_type_ids,\n            position_ids=position_ids,\n            head_mask=head_mask,\n            inputs_embeds=inputs_embeds,\n        )\n\n        sequence_output, pooled_output = outputs[:2]\n        prediction_scores, seq_relationship_score = self.cls(sequence_output, pooled_output)\n\n        outputs = (prediction_scores, seq_relationship_score,) + outputs[\n            2:\n        ]  # add hidden states and attention if they are here\n\n        if masked_lm_labels is not None and next_sentence_label is not None:\n            loss_fct = CrossEntropyLoss()\n            masked_lm_loss = loss_fct(prediction_scores.view(-1, self.config.vocab_size), masked_lm_labels.view(-1))\n            next_sentence_loss = loss_fct(seq_relationship_score.view(-1, 2), next_sentence_label.view(-1))\n            total_loss = masked_lm_loss + next_sentence_loss\n            outputs = (total_loss,) + outputs\n\n        return outputs  # (loss), prediction_scores, seq_relationship_score, (hidden_states), (attentions)\n\n\n@add_start_docstrings(\"\"\"Bert Model with a `language modeling` head on top. \"\"\", BERT_START_DOCSTRING)\nclass BertForMaskedLM(BertPreTrainedModel):\n    def __init__(self, config):\n        super().__init__(config)\n\n        self.bert = BertModel(config)\n        self.cls = BertOnlyMLMHead(config)\n\n        self.init_weights()\n\n    def get_output_embeddings(self):\n        return self.cls.predictions.decoder\n\n    @add_start_docstrings_to_callable(BERT_INPUTS_DOCSTRING.format(\"(batch_size, sequence_length)\"))\n    def forward(\n        self,\n        input_ids=None,\n        attention_mask=None,\n        token_type_ids=None,\n        position_ids=None,\n        head_mask=None,\n        inputs_embeds=None,\n        masked_lm_labels=None,\n        encoder_hidden_states=None,\n        encoder_attention_mask=None,\n        lm_labels=None,\n    ):\n        r\"\"\"\n        masked_lm_labels (:obj:`torch.LongTensor` of shape :obj:`(batch_size, sequence_length)`, `optional`, defaults to :obj:`None`):\n            Labels for computing the masked language modeling loss.\n            Indices should be in ``[-100, 0, ..., config.vocab_size]`` (see ``input_ids`` docstring)\n            Tokens with indices set to ``-100`` are ignored (masked), the loss is only computed for the tokens with labels\n            in ``[0, ..., config.vocab_size]``\n        lm_labels (:obj:`torch.LongTensor` of shape :obj:`(batch_size, sequence_length)`, `optional`, defaults to :obj:`None`):\n            Labels for computing the left-to-right language modeling loss (next word prediction).\n            Indices should be in ``[-100, 0, ..., config.vocab_size]`` (see ``input_ids`` docstring)\n            Tokens with indices set to ``-100`` are ignored (masked), the loss is only computed for the tokens with labels\n            in ``[0, ..., config.vocab_size]``\n\n    Returns:\n        :obj:`tuple(torch.FloatTensor)` comprising various elements depending on the configuration (:class:`~transformers1.BertConfig`) and inputs:\n        masked_lm_loss (`optional`, returned when ``masked_lm_labels`` is provided) ``torch.FloatTensor`` of shape ``(1,)``:\n            Masked language modeling loss.\n        ltr_lm_loss (:obj:`torch.FloatTensor` of shape :obj:`(1,)`, `optional`, returned when :obj:`lm_labels` is provided):\n                Next token prediction loss.\n        prediction_scores (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length, config.vocab_size)`)\n            Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax).\n        hidden_states (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_hidden_states=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer)\n            of shape :obj:`(batch_size, sequence_length, hidden_size)`.\n\n            Hidden-states of the model at the output of each layer plus the initial embedding outputs.\n        attentions (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_attentions=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for each layer) of shape\n            :obj:`(batch_size, num_heads, sequence_length, sequence_length)`.\n\n            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention\n            heads.\n\n        Examples::\n\n            from transformers1 import BertTokenizer, BertForMaskedLM\n            import torch\n\n            tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')\n            model = BertForMaskedLM.from_pretrained('bert-base-uncased')\n\n            input_ids = torch.tensor(tokenizer.encode(\"Hello, my dog is cute\", add_special_tokens=True)).unsqueeze(0)  # Batch size 1\n            outputs = model(input_ids, masked_lm_labels=input_ids)\n\n            loss, prediction_scores = outputs[:2]\n\n        \"\"\"\n\n        outputs = self.bert(\n            input_ids,\n            attention_mask=attention_mask,\n            token_type_ids=token_type_ids,\n            position_ids=position_ids,\n            head_mask=head_mask,\n            inputs_embeds=inputs_embeds,\n            encoder_hidden_states=encoder_hidden_states,\n            encoder_attention_mask=encoder_attention_mask,\n        )\n\n        sequence_output = outputs[0]\n        prediction_scores = self.cls(sequence_output)\n\n        outputs = (prediction_scores,) + outputs[2:]  # Add hidden states and attention if they are here\n\n        # Although this may seem awkward, BertForMaskedLM supports two scenarios:\n        # 1. If a tensor that contains the indices of masked labels is provided,\n        #    the cross-entropy is the MLM cross-entropy that measures the likelihood\n        #    of predictions for masked words.\n        # 2. If `lm_labels` is provided we are in a causal scenario where we\n        #    try to predict the next token for each input in the decoder.\n        if masked_lm_labels is not None:\n            loss_fct = CrossEntropyLoss()  # -100 index = padding token\n            masked_lm_loss = loss_fct(prediction_scores.view(-1, self.config.vocab_size), masked_lm_labels.view(-1))\n            outputs = (masked_lm_loss,) + outputs\n\n        if lm_labels is not None:\n            # we are doing next-token prediction; shift prediction scores and input ids by one\n            prediction_scores = prediction_scores[:, :-1, :].contiguous()\n            lm_labels = lm_labels[:, 1:].contiguous()\n            loss_fct = CrossEntropyLoss()\n            ltr_lm_loss = loss_fct(prediction_scores.view(-1, self.config.vocab_size), lm_labels.view(-1))\n            outputs = (ltr_lm_loss,) + outputs\n\n        return outputs  # (ltr_lm_loss), (masked_lm_loss), prediction_scores, (hidden_states), (attentions)\n\n    def prepare_inputs_for_generation(self, input_ids, attention_mask=None, **model_kwargs):\n        input_shape = input_ids.shape\n        effective_batch_size = input_shape[0]\n\n        # if model is used as a decoder in encoder-decoder model, the decoder attention mask is created on the fly\n        if attention_mask is None:\n            attention_mask = input_ids.new_ones(input_shape)\n\n        # if model is does not use a causal mask then add a dummy token\n        if self.config.is_decoder is False:\n            assert self.config.pad_token_id is not None, \"The PAD token should be defined for generation\"\n            attention_mask = torch.cat(\n                [attention_mask, attention_mask.new_zeros((attention_mask.shape[0], 1))], dim=-1\n            )\n\n            dummy_token = torch.full(\n                (effective_batch_size, 1), self.config.pad_token_id, dtype=torch.long, device=input_ids.device\n            )\n            input_ids = torch.cat([input_ids, dummy_token], dim=1)\n\n        return {\"input_ids\": input_ids, \"attention_mask\": attention_mask}\n\n\n@add_start_docstrings(\n    \"\"\"Bert Model with a `next sentence prediction (classification)` head on top. \"\"\", BERT_START_DOCSTRING,\n)\nclass BertForNextSentencePrediction(BertPreTrainedModel):\n    def __init__(self, config):\n        super().__init__(config)\n\n        self.bert = BertModel(config)\n        self.cls = BertOnlyNSPHead(config)\n\n        self.init_weights()\n\n    @add_start_docstrings_to_callable(BERT_INPUTS_DOCSTRING.format(\"(batch_size, sequence_length)\"))\n    def forward(\n        self,\n        input_ids=None,\n        attention_mask=None,\n        token_type_ids=None,\n        position_ids=None,\n        head_mask=None,\n        inputs_embeds=None,\n        next_sentence_label=None,\n    ):\n        r\"\"\"\n        next_sentence_label (:obj:`torch.LongTensor` of shape :obj:`(batch_size,)`, `optional`, defaults to :obj:`None`):\n            Labels for computing the next sequence prediction (classification) loss. Input should be a sequence pair (see ``input_ids`` docstring)\n            Indices should be in ``[0, 1]``.\n            ``0`` indicates sequence B is a continuation of sequence A,\n            ``1`` indicates sequence B is a random sequence.\n\n    Returns:\n        :obj:`tuple(torch.FloatTensor)` comprising various elements depending on the configuration (:class:`~transformers1.BertConfig`) and inputs:\n        loss (:obj:`torch.FloatTensor` of shape :obj:`(1,)`, `optional`, returned when :obj:`next_sentence_label` is provided):\n            Next sequence prediction (classification) loss.\n        seq_relationship_scores (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, 2)`):\n            Prediction scores of the next sequence prediction (classification) head (scores of True/False continuation before SoftMax).\n        hidden_states (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_hidden_states=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer)\n            of shape :obj:`(batch_size, sequence_length, hidden_size)`.\n\n            Hidden-states of the model at the output of each layer plus the initial embedding outputs.\n        attentions (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_attentions=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for each layer) of shape\n            :obj:`(batch_size, num_heads, sequence_length, sequence_length)`.\n\n            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention\n            heads.\n\n    Examples::\n\n        from transformers1 import BertTokenizer, BertForNextSentencePrediction\n        import torch\n\n        tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')\n        model = BertForNextSentencePrediction.from_pretrained('bert-base-uncased')\n\n        prompt = \"In Italy, pizza served in formal settings, such as at a restaurant, is presented unsliced.\"\n        next_sentence = \"The sky is blue due to the shorter wavelength of blue light.\"\n        encoding = tokenizer.encode_plus(prompt, next_sentence, return_tensors='pt')\n\n        loss, logits = model(**encoding, next_sentence_label=torch.LongTensor([1]))\n        assert logits[0, 0] < logits[0, 1] # next sentence was random\n        \"\"\"\n\n        outputs = self.bert(\n            input_ids,\n            attention_mask=attention_mask,\n            token_type_ids=token_type_ids,\n            position_ids=position_ids,\n            head_mask=head_mask,\n            inputs_embeds=inputs_embeds,\n        )\n\n        pooled_output = outputs[1]\n\n        seq_relationship_score = self.cls(pooled_output)\n\n        outputs = (seq_relationship_score,) + outputs[2:]  # add hidden states and attention if they are here\n        if next_sentence_label is not None:\n            loss_fct = CrossEntropyLoss()\n            next_sentence_loss = loss_fct(seq_relationship_score.view(-1, 2), next_sentence_label.view(-1))\n            outputs = (next_sentence_loss,) + outputs\n\n        return outputs  # (next_sentence_loss), seq_relationship_score, (hidden_states), (attentions)\n\n\n@add_start_docstrings(\n    \"\"\"Bert Model transformer with a sequence classification/regression head on top (a linear layer on top of\n    the pooled output) e.g. for GLUE tasks. \"\"\",\n    BERT_START_DOCSTRING,\n)\nclass BertForSequenceClassification(BertPreTrainedModel):\n    def __init__(self, config):\n        super().__init__(config)\n        self.num_labels = config.num_labels\n\n        self.bert = BertModel(config)\n        self.dropout = nn.Dropout(config.hidden_dropout_prob)\n        self.classifier = nn.Linear(config.hidden_size, config.num_labels)\n\n        self.init_weights()\n\n    @add_start_docstrings_to_callable(BERT_INPUTS_DOCSTRING.format(\"(batch_size, sequence_length)\"))\n    def forward(\n        self,\n        input_ids=None,\n        attention_mask=None,\n        token_type_ids=None,\n        position_ids=None,\n        head_mask=None,\n        inputs_embeds=None,\n        labels=None,\n    ):\n        r\"\"\"\n        labels (:obj:`torch.LongTensor` of shape :obj:`(batch_size,)`, `optional`, defaults to :obj:`None`):\n            Labels for computing the sequence classification/regression loss.\n            Indices should be in :obj:`[0, ..., config.num_labels - 1]`.\n            If :obj:`config.num_labels == 1` a regression loss is computed (Mean-Square loss),\n            If :obj:`config.num_labels > 1` a classification loss is computed (Cross-Entropy).\n\n    Returns:\n        :obj:`tuple(torch.FloatTensor)` comprising various elements depending on the configuration (:class:`~transformers1.BertConfig`) and inputs:\n        loss (:obj:`torch.FloatTensor` of shape :obj:`(1,)`, `optional`, returned when :obj:`label` is provided):\n            Classification (or regression if config.num_labels==1) loss.\n        logits (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, config.num_labels)`):\n            Classification (or regression if config.num_labels==1) scores (before SoftMax).\n        hidden_states (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_hidden_states=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer)\n            of shape :obj:`(batch_size, sequence_length, hidden_size)`.\n\n            Hidden-states of the model at the output of each layer plus the initial embedding outputs.\n        attentions (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_attentions=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for each layer) of shape\n            :obj:`(batch_size, num_heads, sequence_length, sequence_length)`.\n\n            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention\n            heads.\n\n    Examples::\n\n        from transformers1 import BertTokenizer, BertForSequenceClassification\n        import torch\n\n        tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')\n        model = BertForSequenceClassification.from_pretrained('bert-base-uncased')\n\n        input_ids = torch.tensor(tokenizer.encode(\"Hello, my dog is cute\", add_special_tokens=True)).unsqueeze(0)  # Batch size 1\n        labels = torch.tensor([1]).unsqueeze(0)  # Batch size 1\n        outputs = model(input_ids, labels=labels)\n\n        loss, logits = outputs[:2]\n\n        \"\"\"\n\n        outputs = self.bert(\n            input_ids,\n            attention_mask=attention_mask,\n            token_type_ids=token_type_ids,\n            position_ids=position_ids,\n            head_mask=head_mask,\n            inputs_embeds=inputs_embeds,\n        )\n\n        pooled_output = outputs[1]\n\n        pooled_output = self.dropout(pooled_output)\n        logits = self.classifier(pooled_output)\n\n        outputs = (logits,) + outputs[2:]  # add hidden states and attention if they are here\n\n        if labels is not None:\n            if self.num_labels == 1:\n                #  We are doing regression\n                loss_fct = MSELoss()\n                loss = loss_fct(logits.view(-1), labels.view(-1))\n            else:\n                loss_fct = CrossEntropyLoss()\n                loss = loss_fct(logits.view(-1, self.num_labels), labels.view(-1))\n            outputs = (loss,) + outputs\n\n        return outputs  # (loss), logits, (hidden_states), (attentions)\n\n\n@add_start_docstrings(\n    \"\"\"Bert Model with a multiple choice classification head on top (a linear layer on top of\n    the pooled output and a softmax) e.g. for RocStories/SWAG tasks. \"\"\",\n    BERT_START_DOCSTRING,\n)\nclass BertForMultipleChoice(BertPreTrainedModel):\n    def __init__(self, config):\n        super().__init__(config)\n\n        self.bert = BertModel(config)\n        self.dropout = nn.Dropout(config.hidden_dropout_prob)\n        self.classifier = nn.Linear(config.hidden_size, 1)\n\n        self.init_weights()\n\n    @add_start_docstrings_to_callable(BERT_INPUTS_DOCSTRING.format(\"(batch_size, num_choices, sequence_length)\"))\n    def forward(\n        self,\n        input_ids=None,\n        attention_mask=None,\n        token_type_ids=None,\n        position_ids=None,\n        head_mask=None,\n        inputs_embeds=None,\n        labels=None,\n    ):\n        r\"\"\"\n        labels (:obj:`torch.LongTensor` of shape :obj:`(batch_size,)`, `optional`, defaults to :obj:`None`):\n            Labels for computing the multiple choice classification loss.\n            Indices should be in ``[0, ..., num_choices-1]`` where `num_choices` is the size of the second dimension\n            of the input tensors. (see `input_ids` above)\n\n    Returns:\n        :obj:`tuple(torch.FloatTensor)` comprising various elements depending on the configuration (:class:`~transformers1.BertConfig`) and inputs:\n        loss (:obj:`torch.FloatTensor` of shape `(1,)`, `optional`, returned when :obj:`labels` is provided):\n            Classification loss.\n        classification_scores (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, num_choices)`):\n            `num_choices` is the second dimension of the input tensors. (see `input_ids` above).\n\n            Classification scores (before SoftMax).\n        hidden_states (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_hidden_states=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer)\n            of shape :obj:`(batch_size, sequence_length, hidden_size)`.\n\n            Hidden-states of the model at the output of each layer plus the initial embedding outputs.\n        attentions (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_attentions=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for each layer) of shape\n            :obj:`(batch_size, num_heads, sequence_length, sequence_length)`.\n\n            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention\n            heads.\n\n    Examples::\n\n        from transformers1 import BertTokenizer, BertForMultipleChoice\n        import torch\n\n        tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')\n        model = BertForMultipleChoice.from_pretrained('bert-base-uncased')\n\n        prompt = \"In Italy, pizza served in formal settings, such as at a restaurant, is presented unsliced.\"\n        choice0 = \"It is eaten with a fork and a knife.\"\n        choice1 = \"It is eaten while held in the hand.\"\n        labels = torch.tensor(0) # choice0 is correct (according to Wikipedia ;))\n\n        encoding = tokenizer.batch_encode_plus([[prompt, choice0], [prompt, choice1]], return_tensors='pt', pad_to_max_length=True)\n        outputs = model(**{k: v.unsqueeze(0) for k,v in encoding.items()}, labels=labels) # batch size is 1\n\n        # the linear classifier still needs to be trained\n        loss, logits = outputs[:2]\n        \"\"\"\n        num_choices = input_ids.shape[1]\n\n        input_ids = input_ids.view(-1, input_ids.size(-1))\n        attention_mask = attention_mask.view(-1, attention_mask.size(-1)) if attention_mask is not None else None\n        token_type_ids = token_type_ids.view(-1, token_type_ids.size(-1)) if token_type_ids is not None else None\n        position_ids = position_ids.view(-1, position_ids.size(-1)) if position_ids is not None else None\n\n        outputs = self.bert(\n            input_ids,\n            attention_mask=attention_mask,\n            token_type_ids=token_type_ids,\n            position_ids=position_ids,\n            head_mask=head_mask,\n            inputs_embeds=inputs_embeds,\n        )\n\n        pooled_output = outputs[1]\n\n        pooled_output = self.dropout(pooled_output)\n        logits = self.classifier(pooled_output)\n        reshaped_logits = logits.view(-1, num_choices)\n\n        outputs = (reshaped_logits,) + outputs[2:]  # add hidden states and attention if they are here\n\n        if labels is not None:\n            loss_fct = CrossEntropyLoss()\n            loss = loss_fct(reshaped_logits, labels)\n            outputs = (loss,) + outputs\n\n        return outputs  # (loss), reshaped_logits, (hidden_states), (attentions)\n\n\n@add_start_docstrings(\n    \"\"\"Bert Model with a token classification head on top (a linear layer on top of\n    the hidden-states output) e.g. for Named-Entity-Recognition (NER) tasks. \"\"\",\n    BERT_START_DOCSTRING,\n)\nclass BertForTokenClassification(BertPreTrainedModel):\n    def __init__(self, config):\n        super().__init__(config)\n        self.num_labels = config.num_labels\n\n        self.bert = BertModel(config)\n        self.dropout = nn.Dropout(config.hidden_dropout_prob)\n        self.classifier = nn.Linear(config.hidden_size, config.num_labels)\n\n        self.init_weights()\n\n    @add_start_docstrings_to_callable(BERT_INPUTS_DOCSTRING.format(\"(batch_size, sequence_length)\"))\n    def forward(\n        self,\n        input_ids=None,\n        attention_mask=None,\n        token_type_ids=None,\n        position_ids=None,\n        head_mask=None,\n        inputs_embeds=None,\n        labels=None,\n    ):\n        r\"\"\"\n        labels (:obj:`torch.LongTensor` of shape :obj:`(batch_size, sequence_length)`, `optional`, defaults to :obj:`None`):\n            Labels for computing the token classification loss.\n            Indices should be in ``[0, ..., config.num_labels - 1]``.\n\n    Returns:\n        :obj:`tuple(torch.FloatTensor)` comprising various elements depending on the configuration (:class:`~transformers1.BertConfig`) and inputs:\n        loss (:obj:`torch.FloatTensor` of shape :obj:`(1,)`, `optional`, returned when ``labels`` is provided) :\n            Classification loss.\n        scores (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length, config.num_labels)`)\n            Classification scores (before SoftMax).\n        hidden_states (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_hidden_states=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer)\n            of shape :obj:`(batch_size, sequence_length, hidden_size)`.\n\n            Hidden-states of the model at the output of each layer plus the initial embedding outputs.\n        attentions (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_attentions=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for each layer) of shape\n            :obj:`(batch_size, num_heads, sequence_length, sequence_length)`.\n\n            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention\n            heads.\n\n    Examples::\n\n        from transformers1 import BertTokenizer, BertForTokenClassification\n        import torch\n\n        tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')\n        model = BertForTokenClassification.from_pretrained('bert-base-uncased')\n\n        input_ids = torch.tensor(tokenizer.encode(\"Hello, my dog is cute\", add_special_tokens=True)).unsqueeze(0)  # Batch size 1\n        labels = torch.tensor([1] * input_ids.size(1)).unsqueeze(0)  # Batch size 1\n        outputs = model(input_ids, labels=labels)\n\n        loss, scores = outputs[:2]\n\n        \"\"\"\n\n        outputs = self.bert(\n            input_ids,\n            attention_mask=attention_mask,\n            token_type_ids=token_type_ids,\n            position_ids=position_ids,\n            head_mask=head_mask,\n            inputs_embeds=inputs_embeds,\n        )\n\n        sequence_output = outputs[0]\n\n        sequence_output = self.dropout(sequence_output)\n        logits = self.classifier(sequence_output)\n\n        outputs = (logits,) + outputs[2:]  # add hidden states and attention if they are here\n        if labels is not None:\n            loss_fct = CrossEntropyLoss()\n            # Only keep active parts of the loss\n            if attention_mask is not None:\n                active_loss = attention_mask.view(-1) == 1\n                active_logits = logits.view(-1, self.num_labels)\n                active_labels = torch.where(\n                    active_loss, labels.view(-1), torch.tensor(loss_fct.ignore_index).type_as(labels)\n                )\n                loss = loss_fct(active_logits, active_labels)\n            else:\n                loss = loss_fct(logits.view(-1, self.num_labels), labels.view(-1))\n            outputs = (loss,) + outputs\n\n        return outputs  # (loss), scores, (hidden_states), (attentions)\n\n\n@add_start_docstrings(\n    \"\"\"Bert Model with a span classification head on top for extractive question-answering tasks like SQuAD (a linear\n    layers on top of the hidden-states output to compute `span start logits` and `span end logits`). \"\"\",\n    BERT_START_DOCSTRING,\n)\nclass BertForQuestionAnswering(BertPreTrainedModel):\n    def __init__(self, config):\n        super().__init__(config)\n        self.num_labels = config.num_labels\n\n        self.bert = BertModel(config)\n        self.qa_outputs = nn.Linear(config.hidden_size, config.num_labels)\n\n        self.init_weights()\n\n    @add_start_docstrings_to_callable(BERT_INPUTS_DOCSTRING.format(\"(batch_size, sequence_length)\"))\n    def forward(\n        self,\n        input_ids=None,\n        attention_mask=None,\n        token_type_ids=None,\n        position_ids=None,\n        head_mask=None,\n        inputs_embeds=None,\n        start_positions=None,\n        end_positions=None,\n    ):\n        r\"\"\"\n        start_positions (:obj:`torch.LongTensor` of shape :obj:`(batch_size,)`, `optional`, defaults to :obj:`None`):\n            Labels for position (index) of the start of the labelled span for computing the token classification loss.\n            Positions are clamped to the length of the sequence (`sequence_length`).\n            Position outside of the sequence are not taken into account for computing the loss.\n        end_positions (:obj:`torch.LongTensor` of shape :obj:`(batch_size,)`, `optional`, defaults to :obj:`None`):\n            Labels for position (index) of the end of the labelled span for computing the token classification loss.\n            Positions are clamped to the length of the sequence (`sequence_length`).\n            Position outside of the sequence are not taken into account for computing the loss.\n\n    Returns:\n        :obj:`tuple(torch.FloatTensor)` comprising various elements depending on the configuration (:class:`~transformers1.BertConfig`) and inputs:\n        loss (:obj:`torch.FloatTensor` of shape :obj:`(1,)`, `optional`, returned when :obj:`labels` is provided):\n            Total span extraction loss is the sum of a Cross-Entropy for the start and end positions.\n        start_scores (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length,)`):\n            Span-start scores (before SoftMax).\n        end_scores (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length,)`):\n            Span-end scores (before SoftMax).\n        hidden_states (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_hidden_states=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer)\n            of shape :obj:`(batch_size, sequence_length, hidden_size)`.\n\n            Hidden-states of the model at the output of each layer plus the initial embedding outputs.\n        attentions (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_attentions=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for each layer) of shape\n            :obj:`(batch_size, num_heads, sequence_length, sequence_length)`.\n\n            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention\n            heads.\n\n    Examples::\n\n        from transformers1 import BertTokenizer, BertForQuestionAnswering\n        import torch\n\n        tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')\n        model = BertForQuestionAnswering.from_pretrained('bert-large-uncased-whole-word-masking-finetuned-squad')\n\n        question, text = \"Who was Jim Henson?\", \"Jim Henson was a nice puppet\"\n        encoding = tokenizer.encode_plus(question, text)\n        input_ids, token_type_ids = encoding[\"input_ids\"], encoding[\"token_type_ids\"]\n        start_scores, end_scores = model(torch.tensor([input_ids]), token_type_ids=torch.tensor([token_type_ids]))\n\n        all_tokens = tokenizer.convert_ids_to_tokens(input_ids)\n        answer = ' '.join(all_tokens[torch.argmax(start_scores) : torch.argmax(end_scores)+1])\n\n        assert answer == \"a nice puppet\"\n\n        \"\"\"\n\n        outputs = self.bert(\n            input_ids,\n            attention_mask=attention_mask,\n            token_type_ids=token_type_ids,\n            position_ids=position_ids,\n            head_mask=head_mask,\n            inputs_embeds=inputs_embeds,\n        )\n\n        sequence_output = outputs[0]\n\n        logits = self.qa_outputs(sequence_output)\n        start_logits, end_logits = logits.split(1, dim=-1)\n        start_logits = start_logits.squeeze(-1)\n        end_logits = end_logits.squeeze(-1)\n\n        outputs = (start_logits, end_logits,) + outputs[2:]\n        if start_positions is not None and end_positions is not None:\n            # If we are on multi-GPU, split add a dimension\n            if len(start_positions.size()) > 1:\n                start_positions = start_positions.squeeze(-1)\n            if len(end_positions.size()) > 1:\n                end_positions = end_positions.squeeze(-1)\n            # sometimes the start/end positions are outside our model inputs, we ignore these terms\n            ignored_index = start_logits.size(1)\n            start_positions.clamp_(0, ignored_index)\n            end_positions.clamp_(0, ignored_index)\n\n            loss_fct = CrossEntropyLoss(ignore_index=ignored_index)\n            start_loss = loss_fct(start_logits, start_positions)\n            end_loss = loss_fct(end_logits, end_positions)\n            total_loss = (start_loss + end_loss) / 2\n            outputs = (total_loss,) + outputs\n\n        return outputs  # (loss), start_logits, end_logits, (hidden_states), (attentions)\n"
  },
  {
    "path": "code/bert-base-count3/pretrain/transformers1/modeling_camembert.py",
    "content": "# coding=utf-8\n# Copyright 2019 Inria, Facebook AI Research and the HuggingFace Inc. team.\n# Copyright (c) 2018, NVIDIA CORPORATION.  All rights reserved.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n\"\"\"PyTorch CamemBERT model. \"\"\"\n\nimport logging\n\nfrom .configuration_camembert import CamembertConfig\nfrom .file_utils import add_start_docstrings\nfrom .modeling_roberta import (\n    RobertaForMaskedLM,\n    RobertaForMultipleChoice,\n    RobertaForQuestionAnswering,\n    RobertaForSequenceClassification,\n    RobertaForTokenClassification,\n    RobertaModel,\n)\n\n\nlogger = logging.getLogger(__name__)\n\nCAMEMBERT_PRETRAINED_MODEL_ARCHIVE_LIST = [\n    \"camembert-base\",\n    \"Musixmatch/umberto-commoncrawl-cased-v1\",\n    \"Musixmatch/umberto-wikipedia-uncased-v1\",\n    # See all CamemBERT models at https://huggingface.co/models?filter=camembert\n]\n\nCAMEMBERT_START_DOCSTRING = r\"\"\"\n\n    This model is a PyTorch `torch.nn.Module <https://pytorch.org/docs/stable/nn.html#torch.nn.Module>`_ sub-class.\n    Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general\n    usage and behavior.\n\n    Parameters:\n        config (:class:`~transformers1.CamembertConfig`): Model configuration class with all the parameters of the\n            model. Initializing with a config file does not load the weights associated with the model, only the\n            configuration.\n            Check out the :meth:`~transformers1.PreTrainedModel.from_pretrained` method to load the model weights.\n\"\"\"\n\n\n@add_start_docstrings(\n    \"The bare CamemBERT Model transformer outputting raw hidden-states without any specific head on top.\",\n    CAMEMBERT_START_DOCSTRING,\n)\nclass CamembertModel(RobertaModel):\n    \"\"\"\n    This class overrides :class:`~transformers1.RobertaModel`. Please check the\n    superclass for the appropriate documentation alongside usage examples.\n    \"\"\"\n\n    config_class = CamembertConfig\n\n\n@add_start_docstrings(\n    \"\"\"CamemBERT Model with a `language modeling` head on top. \"\"\", CAMEMBERT_START_DOCSTRING,\n)\nclass CamembertForMaskedLM(RobertaForMaskedLM):\n    \"\"\"\n    This class overrides :class:`~transformers1.RobertaForMaskedLM`. Please check the\n    superclass for the appropriate documentation alongside usage examples.\n    \"\"\"\n\n    config_class = CamembertConfig\n\n\n@add_start_docstrings(\n    \"\"\"CamemBERT Model transformer with a sequence classification/regression head on top (a linear layer\n    on top of the pooled output) e.g. for GLUE tasks. \"\"\",\n    CAMEMBERT_START_DOCSTRING,\n)\nclass CamembertForSequenceClassification(RobertaForSequenceClassification):\n    \"\"\"\n    This class overrides :class:`~transformers1.RobertaForSequenceClassification`. Please check the\n    superclass for the appropriate documentation alongside usage examples.\n    \"\"\"\n\n    config_class = CamembertConfig\n\n\n@add_start_docstrings(\n    \"\"\"CamemBERT Model with a multiple choice classification head on top (a linear layer on top of\n    the pooled output and a softmax) e.g. for RocStories/SWAG tasks. \"\"\",\n    CAMEMBERT_START_DOCSTRING,\n)\nclass CamembertForMultipleChoice(RobertaForMultipleChoice):\n    \"\"\"\n    This class overrides :class:`~transformers1.RobertaForMultipleChoice`. Please check the\n    superclass for the appropriate documentation alongside usage examples.\n    \"\"\"\n\n    config_class = CamembertConfig\n\n\n@add_start_docstrings(\n    \"\"\"CamemBERT Model with a token classification head on top (a linear layer on top of\n    the hidden-states output) e.g. for Named-Entity-Recognition (NER) tasks. \"\"\",\n    CAMEMBERT_START_DOCSTRING,\n)\nclass CamembertForTokenClassification(RobertaForTokenClassification):\n    \"\"\"\n    This class overrides :class:`~transformers1.RobertaForTokenClassification`. Please check the\n    superclass for the appropriate documentation alongside usage examples.\n    \"\"\"\n\n    config_class = CamembertConfig\n\n\n@add_start_docstrings(\n    \"\"\"CamemBERT Model with a span classification head on top for extractive question-answering tasks like SQuAD\n    (a linear layers on top of the hidden-states output to compute `span start logits` and `span end logits` \"\"\",\n    CAMEMBERT_START_DOCSTRING,\n)\nclass CamembertForQuestionAnswering(RobertaForQuestionAnswering):\n    \"\"\"\n    This class overrides :class:`~transformers1.RobertaForQuestionAnswering`. Please check the\n    superclass for the appropriate documentation alongside usage examples.\n    \"\"\"\n\n    config_class = CamembertConfig\n"
  },
  {
    "path": "code/bert-base-count3/pretrain/transformers1/modeling_ctrl.py",
    "content": "# coding=utf-8\n# Copyright 2018 Salesforce and HuggingFace Inc. team.\n# Copyright (c) 2018, NVIDIA CORPORATION.  All rights reserved.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n\"\"\" PyTorch CTRL model.\"\"\"\n\n\nimport logging\n\nimport numpy as np\nimport torch\nimport torch.nn as nn\nfrom torch.nn import CrossEntropyLoss\n\nfrom .configuration_ctrl import CTRLConfig\nfrom .file_utils import add_start_docstrings, add_start_docstrings_to_callable\nfrom .modeling_utils import Conv1D, PreTrainedModel\n\n\nlogger = logging.getLogger(__name__)\n\nCTRL_PRETRAINED_MODEL_ARCHIVE_LIST = [\n    \"ctrl\"\n    # See all CTRL models at https://huggingface.co/models?filter=ctrl\n]\n\n\ndef angle_defn(pos, i, d_model_size):\n    angle_rates = 1 / torch.pow(10000, (2 * (i // 2)) / d_model_size)\n    return pos * angle_rates\n\n\ndef positional_encoding(position, d_model_size, dtype):\n    # create the sinusoidal pattern for the positional encoding\n    angle_rads = angle_defn(\n        torch.arange(position, dtype=dtype).unsqueeze(1),\n        torch.arange(d_model_size, dtype=dtype).unsqueeze(0),\n        d_model_size,\n    )\n\n    sines = torch.sin(angle_rads[:, 0::2])\n    cosines = torch.cos(angle_rads[:, 1::2])\n\n    pos_encoding = torch.cat([sines, cosines], dim=-1)\n    return pos_encoding\n\n\ndef scaled_dot_product_attention(q, k, v, mask, attention_mask=None, head_mask=None):\n    # calculate attention\n    matmul_qk = torch.matmul(q, k.permute(0, 1, 3, 2))\n\n    dk = k.shape[-1]\n    scaled_attention_logits = matmul_qk / np.sqrt(dk)\n\n    if mask is not None:\n        nd, ns = scaled_attention_logits.size(-2), scaled_attention_logits.size(-1)\n        scaled_attention_logits += mask[ns - nd : ns, :ns] * -1e4\n\n    if attention_mask is not None:\n        # Apply the attention mask\n        scaled_attention_logits = scaled_attention_logits + attention_mask\n\n    attention_weights = torch.softmax(scaled_attention_logits, dim=-1)\n\n    # Mask heads if we want to\n    if head_mask is not None:\n        attention_weights = attention_weights * head_mask\n\n    output = torch.matmul(attention_weights, v)\n\n    return output, attention_weights\n\n\nclass MultiHeadAttention(torch.nn.Module):\n    def __init__(self, d_model_size, num_heads, output_attentions=False):\n        super().__init__()\n        self.output_attentions = output_attentions\n        self.num_heads = num_heads\n        self.d_model_size = d_model_size\n\n        self.depth = int(d_model_size / self.num_heads)\n\n        self.Wq = torch.nn.Linear(d_model_size, d_model_size)\n        self.Wk = torch.nn.Linear(d_model_size, d_model_size)\n        self.Wv = torch.nn.Linear(d_model_size, d_model_size)\n\n        self.dense = torch.nn.Linear(d_model_size, d_model_size)\n\n    def split_into_heads(self, x, batch_size):\n        x = x.reshape(batch_size, -1, self.num_heads, self.depth)\n        return x.permute([0, 2, 1, 3])\n\n    def forward(self, v, k, q, mask, layer_past=None, attention_mask=None, head_mask=None, use_cache=False):\n        batch_size = q.shape[0]\n\n        q = self.Wq(q)\n        k = self.Wk(k)\n        v = self.Wv(v)\n\n        q = self.split_into_heads(q, batch_size)\n        k = self.split_into_heads(k, batch_size)\n        v = self.split_into_heads(v, batch_size)\n        if layer_past is not None:\n            past_key, past_value = layer_past[0], layer_past[1]\n            k = torch.cat((past_key, k), dim=-2)\n            v = torch.cat((past_value, v), dim=-2)\n\n        if use_cache is True:\n            present = torch.stack((k, v))\n        else:\n            present = (None,)\n\n        output = scaled_dot_product_attention(q, k, v, mask, attention_mask, head_mask)\n        scaled_attention = output[0].permute([0, 2, 1, 3])\n        attn = output[1]\n        original_size_attention = scaled_attention.reshape(batch_size, -1, self.d_model_size)\n        output = self.dense(original_size_attention)\n\n        outputs = (output, present)\n        if self.output_attentions:\n            outputs = outputs + (attn,)\n        return outputs\n\n\ndef point_wise_feed_forward_network(d_model_size, dff):\n    return torch.nn.Sequential(torch.nn.Linear(d_model_size, dff), torch.nn.ReLU(), torch.nn.Linear(dff, d_model_size))\n\n\nclass EncoderLayer(torch.nn.Module):\n    def __init__(self, d_model_size, num_heads, dff, rate=0.1, output_attentions=False):\n        super().__init__()\n\n        self.multi_head_attention = MultiHeadAttention(d_model_size, num_heads, output_attentions)\n        self.ffn = point_wise_feed_forward_network(d_model_size, dff)\n\n        self.layernorm1 = torch.nn.LayerNorm(d_model_size, eps=1e-6)\n        self.layernorm2 = torch.nn.LayerNorm(d_model_size, eps=1e-6)\n\n        self.dropout1 = torch.nn.Dropout(rate)\n        self.dropout2 = torch.nn.Dropout(rate)\n\n    def forward(self, x, mask, layer_past=None, attention_mask=None, head_mask=None, use_cache=False):\n        normed = self.layernorm1(x)\n        attn_outputs = self.multi_head_attention(\n            normed,\n            normed,\n            normed,\n            mask,\n            layer_past=layer_past,\n            attention_mask=attention_mask,\n            head_mask=head_mask,\n            use_cache=use_cache,\n        )\n        attn_output = attn_outputs[0]\n        attn_output = self.dropout1(attn_output)\n        out1 = x + attn_output\n\n        out2 = self.layernorm2(out1)\n        ffn_output = self.ffn(out2)\n        ffn_output = self.dropout2(ffn_output)\n        out2 = out1 + ffn_output\n\n        outputs = (out2,) + attn_outputs[1:]\n        return outputs\n\n\nclass CTRLPreTrainedModel(PreTrainedModel):\n    \"\"\" An abstract class to handle weights initialization and\n        a simple interface for downloading and loading pretrained models.\n    \"\"\"\n\n    config_class = CTRLConfig\n    base_model_prefix = \"transformer\"\n\n    def _init_weights(self, module):\n        \"\"\" Initialize the weights.\n        \"\"\"\n        if isinstance(module, (nn.Linear, nn.Embedding, Conv1D)):\n            # Slightly different from the TF version which uses truncated_normal for initialization\n            # cf https://github.com/pytorch/pytorch/pull/5617\n            module.weight.data.normal_(mean=0.0, std=self.config.initializer_range)\n            if isinstance(module, (nn.Linear, Conv1D)) and module.bias is not None:\n                module.bias.data.zero_()\n        elif isinstance(module, nn.LayerNorm):\n            module.bias.data.zero_()\n            module.weight.data.fill_(1.0)\n\n\nCTRL_START_DOCSTRING = r\"\"\"\n    This model is a PyTorch `torch.nn.Module <https://pytorch.org/docs/stable/nn.html#torch.nn.Module>`_ sub-class.\n    Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general\n    usage and behavior.\n\n    Parameters:\n        config (:class:`~transformers1.CTRLConfig`): Model configuration class with all the parameters of the model.\n            Initializing with a config file does not load the weights associated with the model, only the configuration.\n            Check out the :meth:`~transformers1.PreTrainedModel.from_pretrained` method to load the model weights.\n\"\"\"\n\nCTRL_INPUTS_DOCSTRING = r\"\"\"\n    Args:\n        input_ids (:obj:`torch.LongTensor` of shape :obj:`(batch_size, input_ids_length)`):\n            :obj:`input_ids_length` = ``sequence_length`` if ``past`` is ``None`` else ``past[0].shape[-2]`` (``sequence_length`` of input past key value states).\n            Indices of input sequence tokens in the vocabulary.\n\n            If `past` is used, only input_ids that do not have their past calculated should be passed as input_ids.\n\n            Indices can be obtained using :class:`transformers1.CTRLTokenizer`.\n            See :func:`transformers1.PreTrainedTokenizer.encode` and\n            :func:`transformers1.PreTrainedTokenizer.encode_plus` for details.\n\n            `What are input IDs? <../glossary.html#input-ids>`__\n        past (:obj:`List[torch.FloatTensor]` of length :obj:`config.n_layers`):\n            Contains pre-computed hidden-states (key and values in the attention blocks) as computed by the model\n            (see `past` output below). Can be used to speed up sequential decoding.\n            The input_ids which have their past given to this model should not be passed as input ids as they have already been computed.\n        attention_mask (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length)`, `optional`, defaults to :obj:`None`):\n            Mask to avoid performing attention on padding token indices.\n            Mask values selected in ``[0, 1]``:\n            ``1`` for tokens that are NOT MASKED, ``0`` for MASKED tokens.\n\n            `What are attention masks? <../glossary.html#attention-mask>`__\n        token_type_ids (:obj:`torch.LongTensor` of shape :obj:`(batch_size, sequence_length)`, `optional`, defaults to :obj:`None`):\n            Segment token indices to indicate first and second portions of the inputs.\n            Indices are selected in ``[0, 1]``: ``0`` corresponds to a `sentence A` token, ``1``\n            corresponds to a `sentence B` token\n\n            `What are token type IDs? <../glossary.html#token-type-ids>`_\n        position_ids (:obj:`torch.LongTensor` of shape :obj:`(batch_size, sequence_length)`, `optional`, defaults to :obj:`None`):\n            Indices of positions of each input sequence tokens in the position embeddings.\n            Selected in the range ``[0, config.max_position_embeddings - 1]``.\n\n            `What are position IDs? <../glossary.html#position-ids>`_\n        head_mask (:obj:`torch.FloatTensor` of shape :obj:`(num_heads,)` or :obj:`(num_layers, num_heads)`, `optional`, defaults to :obj:`None`):\n            Mask to nullify selected heads of the self-attention modules.\n            Mask values selected in ``[0, 1]``:\n            :obj:`1` indicates the head is **not masked**, :obj:`0` indicates the head is **masked**.\n        inputs_embeds (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length, hidden_size)`, `optional`, defaults to :obj:`None`):\n            This is useful if you want more control over how to convert `input_ids` indices into associated vectors\n            than the model's internal embedding lookup matrix.\n            If `past` is used, optionally only the last `inputs_embeds` have to be input (see `past`).\n        use_cache (:obj:`bool`):\n            If `use_cache` is True, `past` key value states are returned and\n            can be used to speed up decoding (see `past`). Defaults to `True`.\n\"\"\"\n\n\n@add_start_docstrings(\n    \"The bare CTRL Model transformer outputting raw hidden-states without any specific head on top.\",\n    CTRL_START_DOCSTRING,\n)\nclass CTRLModel(CTRLPreTrainedModel):\n    def __init__(self, config):\n        super().__init__(config)\n        self.output_hidden_states = config.output_hidden_states\n        self.output_attentions = config.output_attentions\n\n        self.d_model_size = config.n_embd\n        self.num_layers = config.n_layer\n\n        self.pos_encoding = positional_encoding(config.n_positions, self.d_model_size, torch.float)\n\n        self.w = nn.Embedding(config.vocab_size, config.n_embd)\n\n        self.dropout = nn.Dropout(config.embd_pdrop)\n        self.h = nn.ModuleList(\n            [\n                EncoderLayer(config.n_embd, config.n_head, config.dff, config.resid_pdrop, config.output_attentions)\n                for _ in range(config.n_layer)\n            ]\n        )\n        self.layernorm = nn.LayerNorm(config.n_embd, eps=config.layer_norm_epsilon)\n\n        self.init_weights()\n\n    def get_input_embeddings(self):\n        return self.w\n\n    def set_input_embeddings(self, new_embeddings):\n        self.w = new_embeddings\n\n    def _prune_heads(self, heads_to_prune):\n        \"\"\" Prunes heads of the model.\n                heads_to_prune: dict of {layer_num: list of heads to prune in this layer}\n        \"\"\"\n        for layer, heads in heads_to_prune.items():\n            self.h[layer].attn.prune_heads(heads)\n\n    @add_start_docstrings_to_callable(CTRL_INPUTS_DOCSTRING)\n    def forward(\n        self,\n        input_ids=None,\n        past=None,\n        attention_mask=None,\n        token_type_ids=None,\n        position_ids=None,\n        head_mask=None,\n        inputs_embeds=None,\n        use_cache=True,\n    ):\n        r\"\"\"\n    Return:\n        :obj:`tuple(torch.FloatTensor)` comprising various elements depending on the configuration (:class:`~transformers1.CTRLConfig`) and inputs:\n        last_hidden_state (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length, hidden_size)`):\n            Sequence of hidden-states at the last layer of the model.\n        past (:obj:`List[torch.FloatTensor]` of length :obj:`config.n_layers` with each tensor of shape :obj:`(2, batch_size, num_heads, sequence_length, embed_size_per_head)`):\n            Contains pre-computed hidden-states (key and values in the attention blocks).\n            Can be used (see `past` input) to speed up sequential decoding.\n        hidden_states (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_hidden_states=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer)\n            of shape :obj:`(batch_size, sequence_length, hidden_size)`.\n\n            Hidden-states of the model at the output of each layer plus the initial embedding outputs.\n        attentions (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_attentions=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for each layer) of shape\n            :obj:`(batch_size, num_heads, sequence_length, sequence_length)`.\n\n            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention\n            heads.\n\n    Examples::\n\n        from transformers1 import CTRLTokenizer, CTRLModel\n        import torch\n\n        tokenizer = CTRLTokenizer.from_pretrained('ctrl')\n        model = CTRLModel.from_pretrained('ctrl')\n\n        input_ids = torch.tensor(tokenizer.encode(\"Links Hello, my dog is cute\", add_special_tokens=True)).unsqueeze(0)  # Batch size 1\n        outputs = model(input_ids)\n\n        last_hidden_states = outputs[0]  # The last hidden-state is the first element of the output tuple\n\n        \"\"\"\n\n        if input_ids is not None and inputs_embeds is not None:\n            raise ValueError(\"You cannot specify both input_ids and inputs_embeds at the same time\")\n        elif input_ids is not None:\n            input_shape = input_ids.size()\n            input_ids = input_ids.view(-1, input_shape[-1])\n            batch_size = input_ids.shape[0]\n        elif inputs_embeds is not None:\n            input_shape = inputs_embeds.size()[:-1]\n            batch_size = inputs_embeds.shape[0]\n        else:\n            raise ValueError(\"You have to specify either input_ids or inputs_embeds\")\n\n        if past is None:\n            past_length = 0\n            past = [None] * len(self.h)\n        else:\n            past_length = past[0][0].size(-2)\n        if position_ids is None:\n            device = input_ids.device if input_ids is not None else inputs_embeds.device\n            position_ids = torch.arange(past_length, input_shape[-1] + past_length, dtype=torch.long, device=device)\n            position_ids = position_ids.unsqueeze(0).view(-1, input_shape[-1])\n\n        # Attention mask.\n        if attention_mask is not None:\n            assert batch_size > 0, \"batch_size has to be defined and > 0\"\n            attention_mask = attention_mask.view(batch_size, -1)\n            # We create a 3D attention mask from a 2D tensor mask.\n            # Sizes are [batch_size, 1, 1, to_seq_length]\n            # So we can broadcast to [batch_size, num_heads, from_seq_length, to_seq_length]\n            # this attention mask is more simple than the triangular masking of causal attention\n            # used in OpenAI GPT, we just need to prepare the broadcast dimension here.\n            attention_mask = attention_mask.unsqueeze(1).unsqueeze(2)\n\n            # Since attention_mask is 1.0 for positions we want to attend and 0.0 for\n            # masked positions, this operation will create a tensor which is 0.0 for\n            # positions we want to attend and -10000.0 for masked positions.\n            # Since we are adding it to the raw scores before the softmax, this is\n            # effectively the same as removing these entirely.\n            attention_mask = attention_mask.to(dtype=self.dtype)  # fp16 compatibility\n            attention_mask = (1.0 - attention_mask) * -10000.0\n\n        # Prepare head mask if needed\n        head_mask = self.get_head_mask(head_mask, self.config.n_layer)\n\n        if token_type_ids is not None:\n            token_type_ids = token_type_ids.view(-1, input_shape[-1])\n            token_type_embeds = self.w(token_type_ids)\n            token_type_embeds *= np.sqrt(self.d_model_size)\n        else:\n            token_type_embeds = 0\n        position_ids = position_ids.view(-1, input_shape[-1])\n\n        if inputs_embeds is None:\n            inputs_embeds = self.w(input_ids)\n        # inputs_embeds = embedded.unsqueeze(0) if len(input_ids.shape)<2 else embedded\n        seq_len = input_shape[-1]\n        mask = torch.triu(torch.ones(seq_len + past_length, seq_len + past_length), 1).to(inputs_embeds.device)\n\n        inputs_embeds *= np.sqrt(self.d_model_size)\n\n        pos_embeds = self.pos_encoding[position_ids, :].to(inputs_embeds.device)\n\n        hidden_states = inputs_embeds + pos_embeds + token_type_embeds\n\n        hidden_states = self.dropout(hidden_states)\n\n        output_shape = input_shape + (inputs_embeds.size(-1),)\n        presents = ()\n        all_hidden_states = ()\n        all_attentions = []\n        for i, (h, layer_past) in enumerate(zip(self.h, past)):\n            if self.output_hidden_states:\n                all_hidden_states = all_hidden_states + (hidden_states.view(*output_shape),)\n            outputs = h(\n                hidden_states,\n                mask,\n                layer_past=layer_past,\n                attention_mask=attention_mask,\n                head_mask=head_mask[i],\n                use_cache=use_cache,\n            )\n            hidden_states, present = outputs[:2]\n            if use_cache is True:\n                presents = presents + (present,)\n\n            if self.output_attentions:\n                all_attentions.append(outputs[2])\n\n        hidden_states = self.layernorm(hidden_states)\n        hidden_states = hidden_states.view(*output_shape)\n        if self.output_hidden_states:\n            all_hidden_states = all_hidden_states + (hidden_states,)\n\n        outputs = (hidden_states,)\n        if use_cache is True:\n            outputs = outputs + (presents,)\n        if self.output_hidden_states:\n            outputs = outputs + (all_hidden_states,)\n        if self.output_attentions:\n            # let the number of heads free (-1) so we can extract attention even after head pruning\n            attention_output_shape = input_shape[:-1] + (-1,) + all_attentions[0].shape[-2:]\n            all_attentions = tuple(t.view(*attention_output_shape) for t in all_attentions)\n            outputs = outputs + (all_attentions,)\n        return outputs\n\n\n@add_start_docstrings(\n    \"\"\"The CTRL Model transformer with a language modeling head on top\n    (linear layer with weights tied to the input embeddings). \"\"\",\n    CTRL_START_DOCSTRING,\n)\nclass CTRLLMHeadModel(CTRLPreTrainedModel):\n    def __init__(self, config):\n        super().__init__(config)\n        self.transformer = CTRLModel(config)\n        self.lm_head = nn.Linear(config.n_embd, config.vocab_size, bias=True)\n\n        self.init_weights()\n\n    def get_output_embeddings(self):\n        return self.lm_head\n\n    def prepare_inputs_for_generation(self, input_ids, past, **kwargs):\n        # only last token for inputs_ids if past is defined in kwargs\n        if past:\n            input_ids = input_ids[:, -1].unsqueeze(-1)\n\n        return {\"input_ids\": input_ids, \"past\": past, \"use_cache\": kwargs[\"use_cache\"]}\n\n    @add_start_docstrings_to_callable(CTRL_INPUTS_DOCSTRING)\n    def forward(\n        self,\n        input_ids=None,\n        past=None,\n        attention_mask=None,\n        token_type_ids=None,\n        position_ids=None,\n        head_mask=None,\n        inputs_embeds=None,\n        labels=None,\n        use_cache=True,\n    ):\n        r\"\"\"\n        labels (:obj:`torch.LongTensor` of shape :obj:`(batch_size, sequence_length)`, `optional`, defaults to :obj:`None`):\n            Labels for language modeling.\n            Note that the labels **are shifted** inside the model, i.e. you can set ``lm_labels = input_ids``\n            Indices are selected in ``[-100, 0, ..., config.vocab_size]``\n            All labels set to ``-100`` are ignored (masked), the loss is only\n            computed for labels in ``[0, ..., config.vocab_size]``\n\n    Return:\n        :obj:`tuple(torch.FloatTensor)` comprising various elements depending on the configuration (:class:`~transformers1.CTRLConfig`) and inputs:\n        loss (:obj:`torch.FloatTensor` of shape `(1,)`, `optional`, returned when ``labels`` is provided)\n            Language modeling loss.\n        prediction_scores (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length, config.vocab_size)`):\n            Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax).\n        past (:obj:`List[torch.FloatTensor]` of length :obj:`config.n_layers` with each tensor of shape :obj:`(2, batch_size, num_heads, sequence_length, embed_size_per_head)`):\n            Contains pre-computed hidden-states (key and values in the attention blocks).\n            Can be used (see `past` input) to speed up sequential decoding.\n        hidden_states (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_hidden_states=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer)\n            of shape :obj:`(batch_size, sequence_length, hidden_size)`.\n\n            Hidden-states of the model at the output of each layer plus the initial embedding outputs.\n        attentions (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_attentions=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for each layer) of shape\n            :obj:`(batch_size, num_heads, sequence_length, sequence_length)`.\n\n            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention\n            heads.\n\n    Examples::\n\n        import torch\n        from transformers1 import CTRLTokenizer, CTRLLMHeadModel\n\n        tokenizer = CTRLTokenizer.from_pretrained('ctrl')\n        model = CTRLLMHeadModel.from_pretrained('ctrl')\n\n        input_ids = torch.tensor(tokenizer.encode(\"Links Hello, my dog is cute\", add_special_tokens=True)).unsqueeze(0)  # Batch size 1\n        outputs = model(input_ids, labels=input_ids)\n        loss, logits = outputs[:2]\n\n        \"\"\"\n        transformer_outputs = self.transformer(\n            input_ids,\n            past=past,\n            attention_mask=attention_mask,\n            token_type_ids=token_type_ids,\n            position_ids=position_ids,\n            head_mask=head_mask,\n            inputs_embeds=inputs_embeds,\n            use_cache=use_cache,\n        )\n\n        hidden_states = transformer_outputs[0]\n\n        lm_logits = self.lm_head(hidden_states)\n\n        outputs = (lm_logits,) + transformer_outputs[1:]\n\n        if labels is not None:\n            # Shift so that tokens < n predict n\n            shift_logits = lm_logits[..., :-1, :].contiguous()\n            shift_labels = labels[..., 1:].contiguous()\n            # Flatten the tokens\n            loss_fct = CrossEntropyLoss()\n            loss = loss_fct(shift_logits.view(-1, shift_logits.size(-1)), shift_labels.view(-1))\n            outputs = (loss,) + outputs\n\n        return outputs  # (loss), lm_logits, presents, (all hidden_states), (attentions)\n"
  },
  {
    "path": "code/bert-base-count3/pretrain/transformers1/modeling_distilbert.py",
    "content": "# coding=utf-8\n# Copyright 2019-present, the HuggingFace Inc. team, The Google AI Language Team and Facebook, Inc.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n\"\"\" PyTorch DistilBERT model\n    adapted in part from Facebook, Inc XLM model (https://github.com/facebookresearch/XLM)\n    and in part from HuggingFace PyTorch version of Google AI Bert model (https://github.com/google-research/bert)\n\"\"\"\n\n\nimport copy\nimport logging\nimport math\n\nimport numpy as np\nimport torch\nimport torch.nn as nn\nfrom torch.nn import CrossEntropyLoss\n\nfrom .activations import gelu\nfrom .configuration_distilbert import DistilBertConfig\nfrom .file_utils import add_start_docstrings, add_start_docstrings_to_callable\nfrom .modeling_utils import PreTrainedModel, prune_linear_layer\n\n\nlogger = logging.getLogger(__name__)\n\n\nDISTILBERT_PRETRAINED_MODEL_ARCHIVE_LIST = [\n    \"distilbert-base-uncased\",\n    \"distilbert-base-uncased-distilled-squad\",\n    \"distilbert-base-cased\",\n    \"distilbert-base-cased-distilled-squad\",\n    \"distilbert-base-german-cased\",\n    \"distilbert-base-multilingual-cased\",\n    \"distilbert-base-uncased-finetuned-sst-2-english\",\n    # See all DistilBERT models at https://huggingface.co/models?filter=distilbert\n]\n\n\n# UTILS AND BUILDING BLOCKS OF THE ARCHITECTURE #\n\n\ndef create_sinusoidal_embeddings(n_pos, dim, out):\n    position_enc = np.array([[pos / np.power(10000, 2 * (j // 2) / dim) for j in range(dim)] for pos in range(n_pos)])\n    out[:, 0::2] = torch.FloatTensor(np.sin(position_enc[:, 0::2]))\n    out[:, 1::2] = torch.FloatTensor(np.cos(position_enc[:, 1::2]))\n    out.detach_()\n    out.requires_grad = False\n\n\nclass Embeddings(nn.Module):\n    def __init__(self, config):\n        super().__init__()\n        self.word_embeddings = nn.Embedding(config.vocab_size, config.dim, padding_idx=config.pad_token_id)\n        self.position_embeddings = nn.Embedding(config.max_position_embeddings, config.dim)\n        if config.sinusoidal_pos_embds:\n            create_sinusoidal_embeddings(\n                n_pos=config.max_position_embeddings, dim=config.dim, out=self.position_embeddings.weight\n            )\n\n        self.LayerNorm = nn.LayerNorm(config.dim, eps=1e-12)\n        self.dropout = nn.Dropout(config.dropout)\n\n    def forward(self, input_ids):\n        \"\"\"\n        Parameters\n        ----------\n        input_ids: torch.tensor(bs, max_seq_length)\n            The token ids to embed.\n\n        Outputs\n        -------\n        embeddings: torch.tensor(bs, max_seq_length, dim)\n            The embedded tokens (plus position embeddings, no token_type embeddings)\n        \"\"\"\n        seq_length = input_ids.size(1)\n        position_ids = torch.arange(seq_length, dtype=torch.long, device=input_ids.device)  # (max_seq_length)\n        position_ids = position_ids.unsqueeze(0).expand_as(input_ids)  # (bs, max_seq_length)\n\n        word_embeddings = self.word_embeddings(input_ids)  # (bs, max_seq_length, dim)\n        position_embeddings = self.position_embeddings(position_ids)  # (bs, max_seq_length, dim)\n\n        embeddings = word_embeddings + position_embeddings  # (bs, max_seq_length, dim)\n        embeddings = self.LayerNorm(embeddings)  # (bs, max_seq_length, dim)\n        embeddings = self.dropout(embeddings)  # (bs, max_seq_length, dim)\n        return embeddings\n\n\nclass MultiHeadSelfAttention(nn.Module):\n    def __init__(self, config):\n        super().__init__()\n\n        self.n_heads = config.n_heads\n        self.dim = config.dim\n        self.dropout = nn.Dropout(p=config.attention_dropout)\n        self.output_attentions = config.output_attentions\n\n        assert self.dim % self.n_heads == 0\n\n        self.q_lin = nn.Linear(in_features=config.dim, out_features=config.dim)\n        self.k_lin = nn.Linear(in_features=config.dim, out_features=config.dim)\n        self.v_lin = nn.Linear(in_features=config.dim, out_features=config.dim)\n        self.out_lin = nn.Linear(in_features=config.dim, out_features=config.dim)\n\n        self.pruned_heads = set()\n\n    def prune_heads(self, heads):\n        attention_head_size = self.dim // self.n_heads\n        if len(heads) == 0:\n            return\n        mask = torch.ones(self.n_heads, attention_head_size)\n        heads = set(heads) - self.pruned_heads\n        for head in heads:\n            head -= sum(1 if h < head else 0 for h in self.pruned_heads)\n            mask[head] = 0\n        mask = mask.view(-1).contiguous().eq(1)\n        index = torch.arange(len(mask))[mask].long()\n        # Prune linear layers\n        self.q_lin = prune_linear_layer(self.q_lin, index)\n        self.k_lin = prune_linear_layer(self.k_lin, index)\n        self.v_lin = prune_linear_layer(self.v_lin, index)\n        self.out_lin = prune_linear_layer(self.out_lin, index, dim=1)\n        # Update hyper params\n        self.n_heads = self.n_heads - len(heads)\n        self.dim = attention_head_size * self.n_heads\n        self.pruned_heads = self.pruned_heads.union(heads)\n\n    def forward(self, query, key, value, mask, head_mask=None):\n        \"\"\"\n        Parameters\n        ----------\n        query: torch.tensor(bs, seq_length, dim)\n        key: torch.tensor(bs, seq_length, dim)\n        value: torch.tensor(bs, seq_length, dim)\n        mask: torch.tensor(bs, seq_length)\n\n        Outputs\n        -------\n        weights: torch.tensor(bs, n_heads, seq_length, seq_length)\n            Attention weights\n        context: torch.tensor(bs, seq_length, dim)\n            Contextualized layer. Optional: only if `output_attentions=True`\n        \"\"\"\n        bs, q_length, dim = query.size()\n        k_length = key.size(1)\n        # assert dim == self.dim, 'Dimensions do not match: %s input vs %s configured' % (dim, self.dim)\n        # assert key.size() == value.size()\n\n        dim_per_head = self.dim // self.n_heads\n\n        mask_reshp = (bs, 1, 1, k_length)\n\n        def shape(x):\n            \"\"\" separate heads \"\"\"\n            return x.view(bs, -1, self.n_heads, dim_per_head).transpose(1, 2)\n\n        def unshape(x):\n            \"\"\" group heads \"\"\"\n            return x.transpose(1, 2).contiguous().view(bs, -1, self.n_heads * dim_per_head)\n\n        q = shape(self.q_lin(query))  # (bs, n_heads, q_length, dim_per_head)\n        k = shape(self.k_lin(key))  # (bs, n_heads, k_length, dim_per_head)\n        v = shape(self.v_lin(value))  # (bs, n_heads, k_length, dim_per_head)\n\n        q = q / math.sqrt(dim_per_head)  # (bs, n_heads, q_length, dim_per_head)\n        scores = torch.matmul(q, k.transpose(2, 3))  # (bs, n_heads, q_length, k_length)\n        mask = (mask == 0).view(mask_reshp).expand_as(scores)  # (bs, n_heads, q_length, k_length)\n        scores.masked_fill_(mask, -float(\"inf\"))  # (bs, n_heads, q_length, k_length)\n\n        weights = nn.Softmax(dim=-1)(scores)  # (bs, n_heads, q_length, k_length)\n        weights = self.dropout(weights)  # (bs, n_heads, q_length, k_length)\n\n        # Mask heads if we want to\n        if head_mask is not None:\n            weights = weights * head_mask\n\n        context = torch.matmul(weights, v)  # (bs, n_heads, q_length, dim_per_head)\n        context = unshape(context)  # (bs, q_length, dim)\n        context = self.out_lin(context)  # (bs, q_length, dim)\n\n        if self.output_attentions:\n            return (context, weights)\n        else:\n            return (context,)\n\n\nclass FFN(nn.Module):\n    def __init__(self, config):\n        super().__init__()\n        self.dropout = nn.Dropout(p=config.dropout)\n        self.lin1 = nn.Linear(in_features=config.dim, out_features=config.hidden_dim)\n        self.lin2 = nn.Linear(in_features=config.hidden_dim, out_features=config.dim)\n        assert config.activation in [\"relu\", \"gelu\"], \"activation ({}) must be in ['relu', 'gelu']\".format(\n            config.activation\n        )\n        self.activation = gelu if config.activation == \"gelu\" else nn.ReLU()\n\n    def forward(self, input):\n        x = self.lin1(input)\n        x = self.activation(x)\n        x = self.lin2(x)\n        x = self.dropout(x)\n        return x\n\n\nclass TransformerBlock(nn.Module):\n    def __init__(self, config):\n        super().__init__()\n\n        self.output_attentions = config.output_attentions\n\n        assert config.dim % config.n_heads == 0\n\n        self.attention = MultiHeadSelfAttention(config)\n        self.sa_layer_norm = nn.LayerNorm(normalized_shape=config.dim, eps=1e-12)\n\n        self.ffn = FFN(config)\n        self.output_layer_norm = nn.LayerNorm(normalized_shape=config.dim, eps=1e-12)\n\n    def forward(self, x, attn_mask=None, head_mask=None):\n        \"\"\"\n        Parameters\n        ----------\n        x: torch.tensor(bs, seq_length, dim)\n        attn_mask: torch.tensor(bs, seq_length)\n\n        Outputs\n        -------\n        sa_weights: torch.tensor(bs, n_heads, seq_length, seq_length)\n            The attention weights\n        ffn_output: torch.tensor(bs, seq_length, dim)\n            The output of the transformer block contextualization.\n        \"\"\"\n        # Self-Attention\n        sa_output = self.attention(query=x, key=x, value=x, mask=attn_mask, head_mask=head_mask)\n        if self.output_attentions:\n            sa_output, sa_weights = sa_output  # (bs, seq_length, dim), (bs, n_heads, seq_length, seq_length)\n        else:  # To handle these `output_attention` or `output_hidden_states` cases returning tuples\n            assert type(sa_output) == tuple\n            sa_output = sa_output[0]\n        sa_output = self.sa_layer_norm(sa_output + x)  # (bs, seq_length, dim)\n\n        # Feed Forward Network\n        ffn_output = self.ffn(sa_output)  # (bs, seq_length, dim)\n        ffn_output = self.output_layer_norm(ffn_output + sa_output)  # (bs, seq_length, dim)\n\n        output = (ffn_output,)\n        if self.output_attentions:\n            output = (sa_weights,) + output\n        return output\n\n\nclass Transformer(nn.Module):\n    def __init__(self, config):\n        super().__init__()\n        self.n_layers = config.n_layers\n        self.output_attentions = config.output_attentions\n        self.output_hidden_states = config.output_hidden_states\n\n        layer = TransformerBlock(config)\n        self.layer = nn.ModuleList([copy.deepcopy(layer) for _ in range(config.n_layers)])\n\n    def forward(self, x, attn_mask=None, head_mask=None):\n        \"\"\"\n        Parameters\n        ----------\n        x: torch.tensor(bs, seq_length, dim)\n            Input sequence embedded.\n        attn_mask: torch.tensor(bs, seq_length)\n            Attention mask on the sequence.\n\n        Outputs\n        -------\n        hidden_state: torch.tensor(bs, seq_length, dim)\n            Sequence of hiddens states in the last (top) layer\n        all_hidden_states: Tuple[torch.tensor(bs, seq_length, dim)]\n            Tuple of length n_layers with the hidden states from each layer.\n            Optional: only if output_hidden_states=True\n        all_attentions: Tuple[torch.tensor(bs, n_heads, seq_length, seq_length)]\n            Tuple of length n_layers with the attention weights from each layer\n            Optional: only if output_attentions=True\n        \"\"\"\n        all_hidden_states = ()\n        all_attentions = ()\n\n        hidden_state = x\n        for i, layer_module in enumerate(self.layer):\n            if self.output_hidden_states:\n                all_hidden_states = all_hidden_states + (hidden_state,)\n\n            layer_outputs = layer_module(x=hidden_state, attn_mask=attn_mask, head_mask=head_mask[i])\n            hidden_state = layer_outputs[-1]\n\n            if self.output_attentions:\n                assert len(layer_outputs) == 2\n                attentions = layer_outputs[0]\n                all_attentions = all_attentions + (attentions,)\n            else:\n                assert len(layer_outputs) == 1\n\n        # Add last layer\n        if self.output_hidden_states:\n            all_hidden_states = all_hidden_states + (hidden_state,)\n\n        outputs = (hidden_state,)\n        if self.output_hidden_states:\n            outputs = outputs + (all_hidden_states,)\n        if self.output_attentions:\n            outputs = outputs + (all_attentions,)\n        return outputs  # last-layer hidden state, (all hidden states), (all attentions)\n\n\n# INTERFACE FOR ENCODER AND TASK SPECIFIC MODEL #\nclass DistilBertPreTrainedModel(PreTrainedModel):\n    \"\"\" An abstract class to handle weights initialization and\n        a simple interface for downloading and loading pretrained models.\n    \"\"\"\n\n    config_class = DistilBertConfig\n    load_tf_weights = None\n    base_model_prefix = \"distilbert\"\n\n    def _init_weights(self, module):\n        \"\"\" Initialize the weights.\n        \"\"\"\n        if isinstance(module, nn.Embedding):\n            if module.weight.requires_grad:\n                module.weight.data.normal_(mean=0.0, std=self.config.initializer_range)\n        if isinstance(module, nn.Linear):\n            module.weight.data.normal_(mean=0.0, std=self.config.initializer_range)\n        elif isinstance(module, nn.LayerNorm):\n            module.bias.data.zero_()\n            module.weight.data.fill_(1.0)\n        if isinstance(module, nn.Linear) and module.bias is not None:\n            module.bias.data.zero_()\n\n\nDISTILBERT_START_DOCSTRING = r\"\"\"\n\n    This model is a PyTorch `torch.nn.Module <https://pytorch.org/docs/stable/nn.html#torch.nn.Module>`_ sub-class.\n    Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general\n    usage and behavior.\n\n    Parameters:\n        config (:class:`~transformers1.DistilBertConfig`): Model configuration class with all the parameters of the model.\n            Initializing with a config file does not load the weights associated with the model, only the configuration.\n            Check out the :meth:`~transformers1.PreTrainedModel.from_pretrained` method to load the model weights.\n\"\"\"\n\nDISTILBERT_INPUTS_DOCSTRING = r\"\"\"\n    Args:\n        input_ids (:obj:`torch.LongTensor` of shape :obj:`(batch_size, sequence_length)`):\n            Indices of input sequence tokens in the vocabulary.\n\n            Indices can be obtained using :class:`transformers1.DistilBertTokenizer`.\n            See :func:`transformers1.PreTrainedTokenizer.encode` and\n            :func:`transformers1.PreTrainedTokenizer.encode_plus` for details.\n\n            `What are input IDs? <../glossary.html#input-ids>`__\n        attention_mask (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length)`, `optional`, defaults to :obj:`None`):\n            Mask to avoid performing attention on padding token indices.\n            Mask values selected in ``[0, 1]``:\n            ``1`` for tokens that are NOT MASKED, ``0`` for MASKED tokens.\n\n            `What are attention masks? <../glossary.html#attention-mask>`__\n        head_mask (:obj:`torch.FloatTensor` of shape :obj:`(num_heads,)` or :obj:`(num_layers, num_heads)`, `optional`, defaults to :obj:`None`):\n            Mask to nullify selected heads of the self-attention modules.\n            Mask values selected in ``[0, 1]``:\n            :obj:`1` indicates the head is **not masked**, :obj:`0` indicates the head is **masked**.\n        inputs_embeds (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length, hidden_size)`, `optional`, defaults to :obj:`None`):\n            Optionally, instead of passing :obj:`input_ids` you can choose to directly pass an embedded representation.\n            This is useful if you want more control over how to convert `input_ids` indices into associated vectors\n            than the model's internal embedding lookup matrix.\n\"\"\"\n\n\n@add_start_docstrings(\n    \"The bare DistilBERT encoder/transformer outputting raw hidden-states without any specific head on top.\",\n    DISTILBERT_START_DOCSTRING,\n)\nclass DistilBertModel(DistilBertPreTrainedModel):\n    def __init__(self, config):\n        super().__init__(config)\n\n        self.embeddings = Embeddings(config)  # Embeddings\n        self.transformer = Transformer(config)  # Encoder\n\n        self.init_weights()\n\n    def get_input_embeddings(self):\n        return self.embeddings.word_embeddings\n\n    def set_input_embeddings(self, new_embeddings):\n        self.embeddings.word_embeddings = new_embeddings\n\n    def _prune_heads(self, heads_to_prune):\n        \"\"\" Prunes heads of the model.\n            heads_to_prune: dict of {layer_num: list of heads to prune in this layer}\n            See base class PreTrainedModel\n        \"\"\"\n        for layer, heads in heads_to_prune.items():\n            self.transformer.layer[layer].attention.prune_heads(heads)\n\n    @add_start_docstrings_to_callable(DISTILBERT_INPUTS_DOCSTRING)\n    def forward(self, input_ids=None, attention_mask=None, head_mask=None, inputs_embeds=None):\n        r\"\"\"\n    Return:\n        :obj:`tuple(torch.FloatTensor)` comprising various elements depending on the configuration (:class:`~transformers1.DistilBertConfig`) and inputs:\n        last_hidden_state (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length, hidden_size)`):\n            Sequence of hidden-states at the output of the last layer of the model.\n        hidden_states (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_hidden_states=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer)\n            of shape :obj:`(batch_size, sequence_length, hidden_size)`.\n\n            Hidden-states of the model at the output of each layer plus the initial embedding outputs.\n        attentions (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_attentions=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for each layer) of shape\n            :obj:`(batch_size, num_heads, sequence_length, sequence_length)`.\n\n            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention\n            heads.\n\n    Examples::\n\n        from transformers1 import DistilBertTokenizer, DistilBertModel\n        import torch\n\n        tokenizer = DistilBertTokenizer.from_pretrained('distilbert-base-cased')\n        model = DistilBertModel.from_pretrained('distilbert-base-cased')\n\n        input_ids = torch.tensor(tokenizer.encode(\"Hello, my dog is cute\", add_special_tokens=True)).unsqueeze(0)  # Batch size 1\n        outputs = model(input_ids)\n\n        last_hidden_states = outputs[0]  # The last hidden-state is the first element of the output tuple\n\n        \"\"\"\n        if input_ids is not None and inputs_embeds is not None:\n            raise ValueError(\"You cannot specify both input_ids and inputs_embeds at the same time\")\n        elif input_ids is not None:\n            input_shape = input_ids.size()\n        elif inputs_embeds is not None:\n            input_shape = inputs_embeds.size()[:-1]\n        else:\n            raise ValueError(\"You have to specify either input_ids or inputs_embeds\")\n\n        device = input_ids.device if input_ids is not None else inputs_embeds.device\n\n        if attention_mask is None:\n            attention_mask = torch.ones(input_shape, device=device)  # (bs, seq_length)\n\n        # Prepare head mask if needed\n        head_mask = self.get_head_mask(head_mask, self.config.num_hidden_layers)\n\n        if inputs_embeds is None:\n            inputs_embeds = self.embeddings(input_ids)  # (bs, seq_length, dim)\n        tfmr_output = self.transformer(x=inputs_embeds, attn_mask=attention_mask, head_mask=head_mask)\n        hidden_state = tfmr_output[0]\n        output = (hidden_state,) + tfmr_output[1:]\n\n        return output  # last-layer hidden-state, (all hidden_states), (all attentions)\n\n\n@add_start_docstrings(\n    \"\"\"DistilBert Model with a `masked language modeling` head on top. \"\"\", DISTILBERT_START_DOCSTRING,\n)\nclass DistilBertForMaskedLM(DistilBertPreTrainedModel):\n    def __init__(self, config):\n        super().__init__(config)\n        self.output_attentions = config.output_attentions\n        self.output_hidden_states = config.output_hidden_states\n\n        self.distilbert = DistilBertModel(config)\n        self.vocab_transform = nn.Linear(config.dim, config.dim)\n        self.vocab_layer_norm = nn.LayerNorm(config.dim, eps=1e-12)\n        self.vocab_projector = nn.Linear(config.dim, config.vocab_size)\n\n        self.init_weights()\n\n        self.mlm_loss_fct = nn.CrossEntropyLoss()\n\n    def get_output_embeddings(self):\n        return self.vocab_projector\n\n    @add_start_docstrings_to_callable(DISTILBERT_INPUTS_DOCSTRING)\n    def forward(self, input_ids=None, attention_mask=None, head_mask=None, inputs_embeds=None, masked_lm_labels=None):\n        r\"\"\"\n        masked_lm_labels (:obj:`torch.LongTensor` of shape :obj:`(batch_size, sequence_length)`, `optional`, defaults to :obj:`None`):\n            Labels for computing the masked language modeling loss.\n            Indices should be in ``[-100, 0, ..., config.vocab_size]`` (see ``input_ids`` docstring)\n            Tokens with indices set to ``-100`` are ignored (masked), the loss is only computed for the tokens with labels\n            in ``[0, ..., config.vocab_size]``\n\n    Returns:\n        :obj:`tuple(torch.FloatTensor)` comprising various elements depending on the configuration (:class:`~transformers1.DistilBertConfig`) and inputs:\n        loss (`optional`, returned when ``masked_lm_labels`` is provided) ``torch.FloatTensor`` of shape ``(1,)``:\n            Masked language modeling loss.\n        prediction_scores (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length, config.vocab_size)`)\n            Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax).\n        hidden_states (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_hidden_states=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer)\n            of shape :obj:`(batch_size, sequence_length, hidden_size)`.\n\n            Hidden-states of the model at the output of each layer plus the initial embedding outputs.\n        attentions (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_attentions=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for each layer) of shape\n            :obj:`(batch_size, num_heads, sequence_length, sequence_length)`.\n\n            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention\n            heads.\n\n    Examples::\n\n        from transformers1 import DistilBertTokenizer, DistilBertForMaskedLM\n        import torch\n\n        tokenizer = DistilBertTokenizer.from_pretrained('distilbert-base-cased')\n        model = DistilBertForMaskedLM.from_pretrained('distilbert-base-cased')\n        input_ids = torch.tensor(tokenizer.encode(\"Hello, my dog is cute\", add_special_tokens=True)).unsqueeze(0)  # Batch size 1\n        outputs = model(input_ids, masked_lm_labels=input_ids)\n        loss, prediction_scores = outputs[:2]\n\n        \"\"\"\n        dlbrt_output = self.distilbert(\n            input_ids=input_ids, attention_mask=attention_mask, head_mask=head_mask, inputs_embeds=inputs_embeds\n        )\n        hidden_states = dlbrt_output[0]  # (bs, seq_length, dim)\n        prediction_logits = self.vocab_transform(hidden_states)  # (bs, seq_length, dim)\n        prediction_logits = gelu(prediction_logits)  # (bs, seq_length, dim)\n        prediction_logits = self.vocab_layer_norm(prediction_logits)  # (bs, seq_length, dim)\n        prediction_logits = self.vocab_projector(prediction_logits)  # (bs, seq_length, vocab_size)\n\n        outputs = (prediction_logits,) + dlbrt_output[1:]\n        if masked_lm_labels is not None:\n            mlm_loss = self.mlm_loss_fct(\n                prediction_logits.view(-1, prediction_logits.size(-1)), masked_lm_labels.view(-1)\n            )\n            outputs = (mlm_loss,) + outputs\n\n        return outputs  # (mlm_loss), prediction_logits, (all hidden_states), (all attentions)\n\n\n@add_start_docstrings(\n    \"\"\"DistilBert Model transformer with a sequence classification/regression head on top (a linear layer on top of\n    the pooled output) e.g. for GLUE tasks. \"\"\",\n    DISTILBERT_START_DOCSTRING,\n)\nclass DistilBertForSequenceClassification(DistilBertPreTrainedModel):\n    def __init__(self, config):\n        super().__init__(config)\n        self.num_labels = config.num_labels\n\n        self.distilbert = DistilBertModel(config)\n        self.pre_classifier = nn.Linear(config.dim, config.dim)\n        self.classifier = nn.Linear(config.dim, config.num_labels)\n        self.dropout = nn.Dropout(config.seq_classif_dropout)\n\n        self.init_weights()\n\n    @add_start_docstrings_to_callable(DISTILBERT_INPUTS_DOCSTRING)\n    def forward(self, input_ids=None, attention_mask=None, head_mask=None, inputs_embeds=None, labels=None):\n        r\"\"\"\n        labels (:obj:`torch.LongTensor` of shape :obj:`(batch_size,)`, `optional`, defaults to :obj:`None`):\n            Labels for computing the sequence classification/regression loss.\n            Indices should be in :obj:`[0, ..., config.num_labels - 1]`.\n            If :obj:`config.num_labels == 1` a regression loss is computed (Mean-Square loss),\n            If :obj:`config.num_labels > 1` a classification loss is computed (Cross-Entropy).\n\n    Returns:\n        :obj:`tuple(torch.FloatTensor)` comprising various elements depending on the configuration (:class:`~transformers1.DistilBertConfig`) and inputs:\n        loss (:obj:`torch.FloatTensor` of shape :obj:`(1,)`, `optional`, returned when :obj:`label` is provided):\n            Classification (or regression if config.num_labels==1) loss.\n        logits (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, config.num_labels)`):\n            Classification (or regression if config.num_labels==1) scores (before SoftMax).\n        hidden_states (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_hidden_states=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer)\n            of shape :obj:`(batch_size, sequence_length, hidden_size)`.\n\n            Hidden-states of the model at the output of each layer plus the initial embedding outputs.\n        attentions (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_attentions=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for each layer) of shape\n            :obj:`(batch_size, num_heads, sequence_length, sequence_length)`.\n\n            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention\n            heads.\n\n    Examples::\n\n        from transformers1 import DistilBertTokenizer, DistilBertForSequenceClassification\n        import torch\n\n        tokenizer = DistilBertTokenizer.from_pretrained('distilbert-base-cased')\n        model = DistilBertForSequenceClassification.from_pretrained('distilbert-base-cased')\n        input_ids = torch.tensor(tokenizer.encode(\"Hello, my dog is cute\", add_special_tokens=True)).unsqueeze(0)  # Batch size 1\n        labels = torch.tensor([1]).unsqueeze(0)  # Batch size 1\n        outputs = model(input_ids, labels=labels)\n        loss, logits = outputs[:2]\n\n        \"\"\"\n        distilbert_output = self.distilbert(\n            input_ids=input_ids, attention_mask=attention_mask, head_mask=head_mask, inputs_embeds=inputs_embeds\n        )\n        hidden_state = distilbert_output[0]  # (bs, seq_len, dim)\n        pooled_output = hidden_state[:, 0]  # (bs, dim)\n        pooled_output = self.pre_classifier(pooled_output)  # (bs, dim)\n        pooled_output = nn.ReLU()(pooled_output)  # (bs, dim)\n        pooled_output = self.dropout(pooled_output)  # (bs, dim)\n        logits = self.classifier(pooled_output)  # (bs, dim)\n\n        outputs = (logits,) + distilbert_output[1:]\n        if labels is not None:\n            if self.num_labels == 1:\n                loss_fct = nn.MSELoss()\n                loss = loss_fct(logits.view(-1), labels.view(-1))\n            else:\n                loss_fct = nn.CrossEntropyLoss()\n                loss = loss_fct(logits.view(-1, self.num_labels), labels.view(-1))\n            outputs = (loss,) + outputs\n\n        return outputs  # (loss), logits, (hidden_states), (attentions)\n\n\n@add_start_docstrings(\n    \"\"\"DistilBert Model with a span classification head on top for extractive question-answering tasks like SQuAD (a linear layers on top of\n    the hidden-states output to compute `span start logits` and `span end logits`). \"\"\",\n    DISTILBERT_START_DOCSTRING,\n)\nclass DistilBertForQuestionAnswering(DistilBertPreTrainedModel):\n    def __init__(self, config):\n        super().__init__(config)\n\n        self.distilbert = DistilBertModel(config)\n        self.qa_outputs = nn.Linear(config.dim, config.num_labels)\n        assert config.num_labels == 2\n        self.dropout = nn.Dropout(config.qa_dropout)\n\n        self.init_weights()\n\n    @add_start_docstrings_to_callable(DISTILBERT_INPUTS_DOCSTRING)\n    def forward(\n        self,\n        input_ids=None,\n        attention_mask=None,\n        head_mask=None,\n        inputs_embeds=None,\n        start_positions=None,\n        end_positions=None,\n    ):\n        r\"\"\"\n        start_positions (:obj:`torch.LongTensor` of shape :obj:`(batch_size,)`, `optional`, defaults to :obj:`None`):\n            Labels for position (index) of the start of the labelled span for computing the token classification loss.\n            Positions are clamped to the length of the sequence (`sequence_length`).\n            Position outside of the sequence are not taken into account for computing the loss.\n        end_positions (:obj:`torch.LongTensor` of shape :obj:`(batch_size,)`, `optional`, defaults to :obj:`None`):\n            Labels for position (index) of the end of the labelled span for computing the token classification loss.\n            Positions are clamped to the length of the sequence (`sequence_length`).\n            Position outside of the sequence are not taken into account for computing the loss.\n\n    Returns:\n        :obj:`tuple(torch.FloatTensor)` comprising various elements depending on the configuration (:class:`~transformers1.DistilBertConfig`) and inputs:\n        loss (:obj:`torch.FloatTensor` of shape :obj:`(1,)`, `optional`, returned when :obj:`labels` is provided):\n            Total span extraction loss is the sum of a Cross-Entropy for the start and end positions.\n        start_scores (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length,)`):\n            Span-start scores (before SoftMax).\n        end_scores (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length,)`):\n            Span-end scores (before SoftMax).\n        hidden_states (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_hidden_states=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer)\n            of shape :obj:`(batch_size, sequence_length, hidden_size)`.\n\n            Hidden-states of the model at the output of each layer plus the initial embedding outputs.\n        attentions (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_attentions=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for each layer) of shape\n            :obj:`(batch_size, num_heads, sequence_length, sequence_length)`.\n\n            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention\n            heads.\n\n    Examples::\n\n        from transformers1 import DistilBertTokenizer, DistilBertForQuestionAnswering\n        import torch\n\n        tokenizer = DistilBertTokenizer.from_pretrained('distilbert-base-cased')\n        model = DistilBertForQuestionAnswering.from_pretrained('distilbert-base-cased')\n        input_ids = torch.tensor(tokenizer.encode(\"Hello, my dog is cute\", add_special_tokens=True)).unsqueeze(0)  # Batch size 1\n        start_positions = torch.tensor([1])\n        end_positions = torch.tensor([3])\n        outputs = model(input_ids, start_positions=start_positions, end_positions=end_positions)\n        loss, start_scores, end_scores = outputs[:3]\n\n        \"\"\"\n        distilbert_output = self.distilbert(\n            input_ids=input_ids, attention_mask=attention_mask, head_mask=head_mask, inputs_embeds=inputs_embeds\n        )\n        hidden_states = distilbert_output[0]  # (bs, max_query_len, dim)\n\n        hidden_states = self.dropout(hidden_states)  # (bs, max_query_len, dim)\n        logits = self.qa_outputs(hidden_states)  # (bs, max_query_len, 2)\n        start_logits, end_logits = logits.split(1, dim=-1)\n        start_logits = start_logits.squeeze(-1)  # (bs, max_query_len)\n        end_logits = end_logits.squeeze(-1)  # (bs, max_query_len)\n\n        outputs = (start_logits, end_logits,) + distilbert_output[1:]\n        if start_positions is not None and end_positions is not None:\n            # If we are on multi-GPU, split add a dimension\n            if len(start_positions.size()) > 1:\n                start_positions = start_positions.squeeze(-1)\n            if len(end_positions.size()) > 1:\n                end_positions = end_positions.squeeze(-1)\n            # sometimes the start/end positions are outside our model inputs, we ignore these terms\n            ignored_index = start_logits.size(1)\n            start_positions.clamp_(0, ignored_index)\n            end_positions.clamp_(0, ignored_index)\n\n            loss_fct = nn.CrossEntropyLoss(ignore_index=ignored_index)\n            start_loss = loss_fct(start_logits, start_positions)\n            end_loss = loss_fct(end_logits, end_positions)\n            total_loss = (start_loss + end_loss) / 2\n            outputs = (total_loss,) + outputs\n\n        return outputs  # (loss), start_logits, end_logits, (hidden_states), (attentions)\n\n\n@add_start_docstrings(\n    \"\"\"DistilBert Model with a token classification head on top (a linear layer on top of\n    the hidden-states output) e.g. for Named-Entity-Recognition (NER) tasks. \"\"\",\n    DISTILBERT_START_DOCSTRING,\n)\nclass DistilBertForTokenClassification(DistilBertPreTrainedModel):\n    def __init__(self, config):\n        super().__init__(config)\n        self.num_labels = config.num_labels\n\n        self.distilbert = DistilBertModel(config)\n        self.dropout = nn.Dropout(config.dropout)\n        self.classifier = nn.Linear(config.hidden_size, config.num_labels)\n\n        self.init_weights()\n\n    @add_start_docstrings_to_callable(DISTILBERT_INPUTS_DOCSTRING)\n    def forward(self, input_ids=None, attention_mask=None, head_mask=None, inputs_embeds=None, labels=None):\n        r\"\"\"\n        labels (:obj:`torch.LongTensor` of shape :obj:`(batch_size, sequence_length)`, `optional`, defaults to :obj:`None`):\n            Labels for computing the token classification loss.\n            Indices should be in ``[0, ..., config.num_labels - 1]``.\n\n    Returns:\n        :obj:`tuple(torch.FloatTensor)` comprising various elements depending on the configuration (:class:`~transformers1.DistilBertConfig`) and inputs:\n        loss (:obj:`torch.FloatTensor` of shape :obj:`(1,)`, `optional`, returned when ``labels`` is provided) :\n            Classification loss.\n        scores (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length, config.num_labels)`)\n            Classification scores (before SoftMax).\n        hidden_states (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_hidden_states=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer)\n            of shape :obj:`(batch_size, sequence_length, hidden_size)`.\n\n            Hidden-states of the model at the output of each layer plus the initial embedding outputs.\n        attentions (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_attentions=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for each layer) of shape\n            :obj:`(batch_size, num_heads, sequence_length, sequence_length)`.\n\n            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention\n            heads.\n\n    Examples::\n\n        from transformers1 import DistilBertTokenizer, DistilBertForTokenClassification\n        import torch\n\n        tokenizer = DistilBertTokenizer.from_pretrained('distilbert-base-cased')\n        model = DistilBertForTokenClassification.from_pretrained('distilbert-base-cased')\n        input_ids = torch.tensor(tokenizer.encode(\"Hello, my dog is cute\")).unsqueeze(0)  # Batch size 1\n        labels = torch.tensor([1] * input_ids.size(1)).unsqueeze(0)  # Batch size 1\n        outputs = model(input_ids, labels=labels)\n        loss, scores = outputs[:2]\n\n        \"\"\"\n\n        outputs = self.distilbert(\n            input_ids, attention_mask=attention_mask, head_mask=head_mask, inputs_embeds=inputs_embeds\n        )\n\n        sequence_output = outputs[0]\n\n        sequence_output = self.dropout(sequence_output)\n        logits = self.classifier(sequence_output)\n\n        outputs = (logits,) + outputs[2:]  # add hidden states and attention if they are here\n        if labels is not None:\n            loss_fct = CrossEntropyLoss()\n            # Only keep active parts of the loss\n            if attention_mask is not None:\n                active_loss = attention_mask.view(-1) == 1\n                active_logits = logits.view(-1, self.num_labels)\n                active_labels = torch.where(\n                    active_loss, labels.view(-1), torch.tensor(loss_fct.ignore_index).type_as(labels)\n                )\n                loss = loss_fct(active_logits, active_labels)\n            else:\n                loss = loss_fct(logits.view(-1, self.num_labels), labels.view(-1))\n            outputs = (loss,) + outputs\n\n        return outputs  # (loss), scores, (hidden_states), (attentions)\n"
  },
  {
    "path": "code/bert-base-count3/pretrain/transformers1/modeling_electra.py",
    "content": "import logging\nimport os\n\nimport torch\nimport torch.nn as nn\nfrom torch.nn import CrossEntropyLoss, MSELoss\n\nfrom .activations import get_activation\nfrom .configuration_electra import ElectraConfig\nfrom .file_utils import add_start_docstrings, add_start_docstrings_to_callable\nfrom .modeling_bert import BertEmbeddings, BertEncoder, BertLayerNorm, BertPreTrainedModel\n\n\nlogger = logging.getLogger(__name__)\n\n\nELECTRA_PRETRAINED_MODEL_ARCHIVE_LIST = [\n    \"google/electra-small-generator\",\n    \"google/electra-base-generator\",\n    \"google/electra-large-generator\",\n    \"google/electra-small-discriminator\",\n    \"google/electra-base-discriminator\",\n    \"google/electra-large-discriminator\",\n    # See all ELECTRA models at https://huggingface.co/models?filter=electra\n]\n\n\ndef load_tf_weights_in_electra(model, config, tf_checkpoint_path, discriminator_or_generator=\"discriminator\"):\n    \"\"\" Load tf checkpoints in a pytorch model.\n    \"\"\"\n    try:\n        import re\n        import numpy as np\n        import tensorflow as tf\n    except ImportError:\n        logger.error(\n            \"Loading a TensorFlow model in PyTorch, requires TensorFlow to be installed. Please see \"\n            \"https://www.tensorflow.org/install/ for installation instructions.\"\n        )\n        raise\n    tf_path = os.path.abspath(tf_checkpoint_path)\n    logger.info(\"Converting TensorFlow checkpoint from {}\".format(tf_path))\n    # Load weights from TF model\n    init_vars = tf.train.list_variables(tf_path)\n    names = []\n    arrays = []\n    for name, shape in init_vars:\n        logger.info(\"Loading TF weight {} with shape {}\".format(name, shape))\n        array = tf.train.load_variable(tf_path, name)\n        names.append(name)\n        arrays.append(array)\n    for name, array in zip(names, arrays):\n        original_name: str = name\n\n        try:\n            if isinstance(model, ElectraForMaskedLM):\n                name = name.replace(\"electra/embeddings/\", \"generator/embeddings/\")\n\n            if discriminator_or_generator == \"generator\":\n                name = name.replace(\"electra/\", \"discriminator/\")\n                name = name.replace(\"generator/\", \"electra/\")\n\n            name = name.replace(\"dense_1\", \"dense_prediction\")\n            name = name.replace(\"generator_predictions/output_bias\", \"generator_lm_head/bias\")\n\n            name = name.split(\"/\")\n            # print(original_name, name)\n            # adam_v and adam_m are variables used in AdamWeightDecayOptimizer to calculated m and v\n            # which are not required for using pretrained model\n            if any(n in [\"global_step\", \"temperature\"] for n in name):\n                logger.info(\"Skipping {}\".format(original_name))\n                continue\n            pointer = model\n            for m_name in name:\n                if re.fullmatch(r\"[A-Za-z]+_\\d+\", m_name):\n                    scope_names = re.split(r\"_(\\d+)\", m_name)\n                else:\n                    scope_names = [m_name]\n                if scope_names[0] == \"kernel\" or scope_names[0] == \"gamma\":\n                    pointer = getattr(pointer, \"weight\")\n                elif scope_names[0] == \"output_bias\" or scope_names[0] == \"beta\":\n                    pointer = getattr(pointer, \"bias\")\n                elif scope_names[0] == \"output_weights\":\n                    pointer = getattr(pointer, \"weight\")\n                elif scope_names[0] == \"squad\":\n                    pointer = getattr(pointer, \"classifier\")\n                else:\n                    pointer = getattr(pointer, scope_names[0])\n                if len(scope_names) >= 2:\n                    num = int(scope_names[1])\n                    pointer = pointer[num]\n            if m_name.endswith(\"_embeddings\"):\n                pointer = getattr(pointer, \"weight\")\n            elif m_name == \"kernel\":\n                array = np.transpose(array)\n            try:\n                assert pointer.shape == array.shape, original_name\n            except AssertionError as e:\n                e.args += (pointer.shape, array.shape)\n                raise\n            print(\"Initialize PyTorch weight {}\".format(name), original_name)\n            pointer.data = torch.from_numpy(array)\n        except AttributeError as e:\n            print(\"Skipping {}\".format(original_name), name, e)\n            continue\n    return model\n\n\nclass ElectraEmbeddings(BertEmbeddings):\n    \"\"\"Construct the embeddings from word, position and token_type embeddings.\"\"\"\n\n    def __init__(self, config):\n        super().__init__(config)\n        self.word_embeddings = nn.Embedding(config.vocab_size, config.embedding_size, padding_idx=config.pad_token_id)\n        self.position_embeddings = nn.Embedding(config.max_position_embeddings, config.embedding_size)\n        self.token_type_embeddings = nn.Embedding(config.type_vocab_size, config.embedding_size)\n\n        # self.LayerNorm is not snake-cased to stick with TensorFlow model variable name and be able to load\n        # any TensorFlow checkpoint file\n        self.LayerNorm = BertLayerNorm(config.embedding_size, eps=config.layer_norm_eps)\n\n\nclass ElectraDiscriminatorPredictions(nn.Module):\n    \"\"\"Prediction module for the discriminator, made up of two dense layers.\"\"\"\n\n    def __init__(self, config):\n        super().__init__()\n\n        self.dense = nn.Linear(config.hidden_size, config.hidden_size)\n        self.dense_prediction = nn.Linear(config.hidden_size, 1)\n        self.config = config\n\n    def forward(self, discriminator_hidden_states, attention_mask):\n        hidden_states = self.dense(discriminator_hidden_states)\n        hidden_states = get_activation(self.config.hidden_act)(hidden_states)\n        logits = self.dense_prediction(hidden_states).squeeze()\n\n        return logits\n\n\nclass ElectraGeneratorPredictions(nn.Module):\n    \"\"\"Prediction module for the generator, made up of two dense layers.\"\"\"\n\n    def __init__(self, config):\n        super().__init__()\n\n        self.LayerNorm = BertLayerNorm(config.embedding_size)\n        self.dense = nn.Linear(config.hidden_size, config.embedding_size)\n\n    def forward(self, generator_hidden_states):\n        hidden_states = self.dense(generator_hidden_states)\n        hidden_states = get_activation(\"gelu\")(hidden_states)\n        hidden_states = self.LayerNorm(hidden_states)\n\n        return hidden_states\n\n\nclass ElectraPreTrainedModel(BertPreTrainedModel):\n    \"\"\" An abstract class to handle weights initialization and\n        a simple interface for downloading and loading pretrained models.\n    \"\"\"\n\n    config_class = ElectraConfig\n    load_tf_weights = load_tf_weights_in_electra\n    base_model_prefix = \"electra\"\n\n\nELECTRA_START_DOCSTRING = r\"\"\"\n    This model is a PyTorch `torch.nn.Module <https://pytorch.org/docs/stable/nn.html#torch.nn.Module>`_ sub-class.\n    Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general\n    usage and behavior.\n\n    Parameters:\n        config (:class:`~transformers1.ElectraConfig`): Model configuration class with all the parameters of the model.\n            Initializing with a config file does not load the weights associated with the model, only the configuration.\n            Check out the :meth:`~transformers1.PreTrainedModel.from_pretrained` method to load the model weights.\n\"\"\"\n\nELECTRA_INPUTS_DOCSTRING = r\"\"\"\n    Args:\n        input_ids (:obj:`torch.LongTensor` of shape :obj:`(batch_size, sequence_length)`):\n            Indices of input sequence tokens in the vocabulary.\n\n            Indices can be obtained using :class:`transformers1.ElectraTokenizer`.\n            See :func:`transformers1.PreTrainedTokenizer.encode` and\n            :func:`transformers1.PreTrainedTokenizer.encode_plus` for details.\n\n            `What are input IDs? <../glossary.html#input-ids>`__\n        attention_mask (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length)`, `optional`, defaults to :obj:`None`):\n            Mask to avoid performing attention on padding token indices.\n            Mask values selected in ``[0, 1]``:\n            ``1`` for tokens that are NOT MASKED, ``0`` for MASKED tokens.\n\n            `What are attention masks? <../glossary.html#attention-mask>`__\n        token_type_ids (:obj:`torch.LongTensor` of shape :obj:`(batch_size, sequence_length)`, `optional`, defaults to :obj:`None`):\n            Segment token indices to indicate first and second portions of the inputs.\n            Indices are selected in ``[0, 1]``: ``0`` corresponds to a `sentence A` token, ``1``\n            corresponds to a `sentence B` token\n\n            `What are token type IDs? <../glossary.html#token-type-ids>`_\n        position_ids (:obj:`torch.LongTensor` of shape :obj:`(batch_size, sequence_length)`, `optional`, defaults to :obj:`None`):\n            Indices of positions of each input sequence tokens in the position embeddings.\n            Selected in the range ``[0, config.max_position_embeddings - 1]``.\n\n            `What are position IDs? <../glossary.html#position-ids>`_\n        head_mask (:obj:`torch.FloatTensor` of shape :obj:`(num_heads,)` or :obj:`(num_layers, num_heads)`, `optional`, defaults to :obj:`None`):\n            Mask to nullify selected heads of the self-attention modules.\n            Mask values selected in ``[0, 1]``:\n            :obj:`1` indicates the head is **not masked**, :obj:`0` indicates the head is **masked**.\n        inputs_embeds (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length, hidden_size)`, `optional`, defaults to :obj:`None`):\n            Optionally, instead of passing :obj:`input_ids` you can choose to directly pass an embedded representation.\n            This is useful if you want more control over how to convert `input_ids` indices into associated vectors\n            than the model's internal embedding lookup matrix.\n        encoder_hidden_states  (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length, hidden_size)`, `optional`, defaults to :obj:`None`):\n            Sequence of hidden-states at the output of the last layer of the encoder. Used in the cross-attention\n            if the model is configured as a decoder.\n        encoder_attention_mask (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length)`, `optional`, defaults to :obj:`None`):\n            Mask to avoid performing attention on the padding token indices of the encoder input. This mask\n            is used in the cross-attention if the model is configured as a decoder.\n            Mask values selected in ``[0, 1]``:\n            ``1`` for tokens that are NOT MASKED, ``0`` for MASKED tokens.\n\"\"\"\n\n\n@add_start_docstrings(\n    \"The bare Electra Model transformer outputting raw hidden-states without any specific head on top. Identical to \"\n    \"the BERT model except that it uses an additional linear layer between the embedding layer and the encoder if the \"\n    \"hidden size and embedding size are different.\"\n    \"\"\n    \"Both the generator and discriminator checkpoints may be loaded into this model.\",\n    ELECTRA_START_DOCSTRING,\n)\nclass ElectraModel(ElectraPreTrainedModel):\n\n    config_class = ElectraConfig\n\n    def __init__(self, config):\n        super().__init__(config)\n        self.embeddings = ElectraEmbeddings(config)\n\n        if config.embedding_size != config.hidden_size:\n            self.embeddings_project = nn.Linear(config.embedding_size, config.hidden_size)\n\n        self.encoder = BertEncoder(config)\n        self.config = config\n        self.init_weights()\n\n    def get_input_embeddings(self):\n        return self.embeddings.word_embeddings\n\n    def set_input_embeddings(self, value):\n        self.embeddings.word_embeddings = value\n\n    def _prune_heads(self, heads_to_prune):\n        \"\"\" Prunes heads of the model.\n            heads_to_prune: dict of {layer_num: list of heads to prune in this layer}\n            See base class PreTrainedModel\n        \"\"\"\n        for layer, heads in heads_to_prune.items():\n            self.encoder.layer[layer].attention.prune_heads(heads)\n\n    @add_start_docstrings_to_callable(ELECTRA_INPUTS_DOCSTRING)\n    def forward(\n        self,\n        input_ids=None,\n        attention_mask=None,\n        token_type_ids=None,\n        position_ids=None,\n        head_mask=None,\n        inputs_embeds=None,\n    ):\n        r\"\"\"\n    Return:\n        :obj:`tuple(torch.FloatTensor)` comprising various elements depending on the configuration (:class:`~transformers1.ElectraConfig`) and inputs:\n        last_hidden_state (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length, hidden_size)`):\n            Sequence of hidden-states at the output of the last layer of the model.\n        hidden_states (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_hidden_states=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer)\n            of shape :obj:`(batch_size, sequence_length, hidden_size)`.\n\n            Hidden-states of the model at the output of each layer plus the initial embedding outputs.\n        attentions (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_attentions=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for each layer) of shape\n            :obj:`(batch_size, num_heads, sequence_length, sequence_length)`.\n\n            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention\n            heads.\n\n    Examples::\n\n        from transformers1 import ElectraModel, ElectraTokenizer\n        import torch\n\n        tokenizer = ElectraTokenizer.from_pretrained('google/electra-small-discriminator')\n        model = ElectraModel.from_pretrained('google/electra-small-discriminator')\n\n        input_ids = torch.tensor(tokenizer.encode(\"Hello, my dog is cute\", add_special_tokens=True)).unsqueeze(0)  # Batch size 1\n        outputs = model(input_ids)\n\n        last_hidden_states = outputs[0]  # The last hidden-state is the first element of the output tuple\n\n        \"\"\"\n        if input_ids is not None and inputs_embeds is not None:\n            raise ValueError(\"You cannot specify both input_ids and inputs_embeds at the same time\")\n        elif input_ids is not None:\n            input_shape = input_ids.size()\n        elif inputs_embeds is not None:\n            input_shape = inputs_embeds.size()[:-1]\n        else:\n            raise ValueError(\"You have to specify either input_ids or inputs_embeds\")\n\n        device = input_ids.device if input_ids is not None else inputs_embeds.device\n\n        if attention_mask is None:\n            attention_mask = torch.ones(input_shape, device=device)\n        if token_type_ids is None:\n            token_type_ids = torch.zeros(input_shape, dtype=torch.long, device=device)\n\n        extended_attention_mask = self.get_extended_attention_mask(attention_mask, input_shape, device)\n        head_mask = self.get_head_mask(head_mask, self.config.num_hidden_layers)\n\n        hidden_states = self.embeddings(\n            input_ids=input_ids, position_ids=position_ids, token_type_ids=token_type_ids, inputs_embeds=inputs_embeds\n        )\n\n        if hasattr(self, \"embeddings_project\"):\n            hidden_states = self.embeddings_project(hidden_states)\n\n        hidden_states = self.encoder(hidden_states, attention_mask=extended_attention_mask, head_mask=head_mask)\n\n        return hidden_states\n\n\nclass ElectraClassificationHead(nn.Module):\n    \"\"\"Head for sentence-level classification tasks.\"\"\"\n\n    def __init__(self, config):\n        super().__init__()\n        self.dense = nn.Linear(config.hidden_size, config.hidden_size)\n        self.dropout = nn.Dropout(config.hidden_dropout_prob)\n        self.out_proj = nn.Linear(config.hidden_size, config.num_labels)\n\n    def forward(self, features, **kwargs):\n        x = features[:, 0, :]  # take <s> token (equiv. to [CLS])\n        x = self.dropout(x)\n        x = self.dense(x)\n        x = get_activation(\"gelu\")(x)  # although BERT uses tanh here, it seems Electra authors used gelu here\n        x = self.dropout(x)\n        x = self.out_proj(x)\n        return x\n\n\n@add_start_docstrings(\n    \"\"\"ELECTRA Model transformer with a sequence classification/regression head on top (a linear layer on top of\n    the pooled output) e.g. for GLUE tasks. \"\"\",\n    ELECTRA_START_DOCSTRING,\n)\nclass ElectraForSequenceClassification(ElectraPreTrainedModel):\n    def __init__(self, config):\n        super().__init__(config)\n        self.num_labels = config.num_labels\n        self.electra = ElectraModel(config)\n        self.classifier = ElectraClassificationHead(config)\n\n        self.init_weights()\n\n    @add_start_docstrings_to_callable(ELECTRA_INPUTS_DOCSTRING)\n    def forward(\n        self,\n        input_ids=None,\n        attention_mask=None,\n        token_type_ids=None,\n        position_ids=None,\n        head_mask=None,\n        inputs_embeds=None,\n        labels=None,\n    ):\n        r\"\"\"\n        labels (:obj:`torch.LongTensor` of shape :obj:`(batch_size,)`, `optional`, defaults to :obj:`None`):\n            Labels for computing the sequence classification/regression loss.\n            Indices should be in :obj:`[0, ..., config.num_labels - 1]`.\n            If :obj:`config.num_labels == 1` a regression loss is computed (Mean-Square loss),\n            If :obj:`config.num_labels > 1` a classification loss is computed (Cross-Entropy).\n\n    Returns:\n        :obj:`tuple(torch.FloatTensor)` comprising various elements depending on the configuration (:class:`~transformers1.BertConfig`) and inputs:\n        loss (:obj:`torch.FloatTensor` of shape :obj:`(1,)`, `optional`, returned when :obj:`label` is provided):\n            Classification (or regression if config.num_labels==1) loss.\n        logits (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, config.num_labels)`):\n            Classification (or regression if config.num_labels==1) scores (before SoftMax).\n        hidden_states (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_hidden_states=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer)\n            of shape :obj:`(batch_size, sequence_length, hidden_size)`.\n\n            Hidden-states of the model at the output of each layer plus the initial embedding outputs.\n        attentions (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_attentions=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for each layer) of shape\n            :obj:`(batch_size, num_heads, sequence_length, sequence_length)`.\n\n            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention\n            heads.\n\n    Examples::\n\n        from transformers1 import BertTokenizer, BertForSequenceClassification\n        import torch\n\n        tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')\n        model = BertForSequenceClassification.from_pretrained('bert-base-uncased')\n\n        input_ids = torch.tensor(tokenizer.encode(\"Hello, my dog is cute\", add_special_tokens=True)).unsqueeze(0)  # Batch size 1\n        labels = torch.tensor([1]).unsqueeze(0)  # Batch size 1\n        outputs = model(input_ids, labels=labels)\n\n        loss, logits = outputs[:2]\n\n        \"\"\"\n        discriminator_hidden_states = self.electra(\n            input_ids, attention_mask, token_type_ids, position_ids, head_mask, inputs_embeds\n        )\n\n        sequence_output = discriminator_hidden_states[0]\n        logits = self.classifier(sequence_output)\n\n        outputs = (logits,) + discriminator_hidden_states[2:]  # add hidden states and attention if they are here\n\n        if labels is not None:\n            if self.num_labels == 1:\n                #  We are doing regression\n                loss_fct = MSELoss()\n                loss = loss_fct(logits.view(-1), labels.view(-1))\n            else:\n                loss_fct = CrossEntropyLoss()\n                loss = loss_fct(logits.view(-1, self.num_labels), labels.view(-1))\n            outputs = (loss,) + outputs\n\n        return outputs  # (loss), logits, (hidden_states), (attentions)\n\n\n@add_start_docstrings(\n    \"\"\"\n    Electra model with a binary classification head on top as used during pre-training for identifying generated\n    tokens.\n\n    It is recommended to load the discriminator checkpoint into that model.\"\"\",\n    ELECTRA_START_DOCSTRING,\n)\nclass ElectraForPreTraining(ElectraPreTrainedModel):\n    def __init__(self, config):\n        super().__init__(config)\n\n        self.electra = ElectraModel(config)\n        self.discriminator_predictions = ElectraDiscriminatorPredictions(config)\n        self.init_weights()\n\n    @add_start_docstrings_to_callable(ELECTRA_INPUTS_DOCSTRING)\n    def forward(\n        self,\n        input_ids=None,\n        attention_mask=None,\n        token_type_ids=None,\n        position_ids=None,\n        head_mask=None,\n        inputs_embeds=None,\n        labels=None,\n    ):\n        r\"\"\"\n        labels (``torch.LongTensor`` of shape ``(batch_size, sequence_length)``, `optional`, defaults to :obj:`None`):\n            Labels for computing the ELECTRA loss. Input should be a sequence of tokens (see :obj:`input_ids` docstring)\n            Indices should be in ``[0, 1]``.\n            ``0`` indicates the token is an original token,\n            ``1`` indicates the token was replaced.\n\n    Returns:\n        :obj:`tuple(torch.FloatTensor)` comprising various elements depending on the configuration (:class:`~transformers1.ElectraConfig`) and inputs:\n        loss (`optional`, returned when ``labels`` is provided) ``torch.FloatTensor`` of shape ``(1,)``:\n            Total loss of the ELECTRA objective.\n        scores (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length)`)\n            Prediction scores of the head (scores for each token before SoftMax).\n        hidden_states (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when :obj:`config.output_hidden_states=True`):\n            Tuple of :obj:`torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer)\n            of shape :obj:`(batch_size, sequence_length, hidden_size)`.\n\n            Hidden-states of the model at the output of each layer plus the initial embedding outputs.\n        attentions (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_attentions=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for each layer) of shape\n            :obj:`(batch_size, num_heads, sequence_length, sequence_length)`.\n\n            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention\n            heads.\n\n\n    Examples::\n\n        from transformers1 import ElectraTokenizer, ElectraForPreTraining\n        import torch\n\n        tokenizer = ElectraTokenizer.from_pretrained('google/electra-small-discriminator')\n        model = ElectraForPreTraining.from_pretrained('google/electra-small-discriminator')\n\n        input_ids = torch.tensor(tokenizer.encode(\"Hello, my dog is cute\", add_special_tokens=True)).unsqueeze(0)  # Batch size 1\n        outputs = model(input_ids)\n\n        prediction_scores, seq_relationship_scores = outputs[:2]\n\n        \"\"\"\n\n        discriminator_hidden_states = self.electra(\n            input_ids, attention_mask, token_type_ids, position_ids, head_mask, inputs_embeds\n        )\n        discriminator_sequence_output = discriminator_hidden_states[0]\n\n        logits = self.discriminator_predictions(discriminator_sequence_output, attention_mask)\n\n        output = (logits,)\n\n        if labels is not None:\n            loss_fct = nn.BCEWithLogitsLoss()\n            if attention_mask is not None:\n                active_loss = attention_mask.view(-1, discriminator_sequence_output.shape[1]) == 1\n                active_logits = logits.view(-1, discriminator_sequence_output.shape[1])[active_loss]\n                active_labels = labels[active_loss]\n                loss = loss_fct(active_logits, active_labels.float())\n            else:\n                loss = loss_fct(logits.view(-1, discriminator_sequence_output.shape[1]), labels.float())\n\n            output = (loss,) + output\n\n        output += discriminator_hidden_states[1:]\n\n        return output  # (loss), scores, (hidden_states), (attentions)\n\n\n@add_start_docstrings(\n    \"\"\"\n    Electra model with a language modeling head on top.\n\n    Even though both the discriminator and generator may be loaded into this model, the generator is\n    the only model of the two to have been trained for the masked language modeling task.\"\"\",\n    ELECTRA_START_DOCSTRING,\n)\nclass ElectraForMaskedLM(ElectraPreTrainedModel):\n    def __init__(self, config):\n        super().__init__(config)\n\n        self.electra = ElectraModel(config)\n        self.generator_predictions = ElectraGeneratorPredictions(config)\n\n        self.generator_lm_head = nn.Linear(config.embedding_size, config.vocab_size)\n        self.init_weights()\n\n    def get_output_embeddings(self):\n        return self.generator_lm_head\n\n    @add_start_docstrings_to_callable(ELECTRA_INPUTS_DOCSTRING)\n    def forward(\n        self,\n        input_ids=None,\n        attention_mask=None,\n        token_type_ids=None,\n        position_ids=None,\n        head_mask=None,\n        inputs_embeds=None,\n        masked_lm_labels=None,\n    ):\n        r\"\"\"\n        masked_lm_labels (:obj:`torch.LongTensor` of shape :obj:`(batch_size, sequence_length)`, `optional`, defaults to :obj:`None`):\n            Labels for computing the masked language modeling loss.\n            Indices should be in ``[-100, 0, ..., config.vocab_size]`` (see ``input_ids`` docstring)\n            Tokens with indices set to ``-100`` are ignored (masked), the loss is only computed for the tokens with labels\n            in ``[0, ..., config.vocab_size]``\n\n    Returns:\n        :obj:`tuple(torch.FloatTensor)` comprising various elements depending on the configuration (:class:`~transformers1.ElectraConfig`) and inputs:\n        masked_lm_loss (`optional`, returned when ``masked_lm_labels`` is provided) ``torch.FloatTensor`` of shape ``(1,)``:\n            Masked language modeling loss.\n        prediction_scores (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length, config.vocab_size)`)\n            Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax).\n        hidden_states (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_hidden_states=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer)\n            of shape :obj:`(batch_size, sequence_length, hidden_size)`.\n\n            Hidden-states of the model at the output of each layer plus the initial embedding outputs.\n        attentions (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_attentions=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for each layer) of shape\n            :obj:`(batch_size, num_heads, sequence_length, sequence_length)`.\n\n            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention\n            heads.\n\n        Examples::\n\n            from transformers1 import ElectraTokenizer, ElectraForMaskedLM\n            import torch\n\n            tokenizer = ElectraTokenizer.from_pretrained('google/electra-small-generator')\n            model = ElectraForMaskedLM.from_pretrained('google/electra-small-generator')\n\n            input_ids = torch.tensor(tokenizer.encode(\"Hello, my dog is cute\", add_special_tokens=True)).unsqueeze(0)  # Batch size 1\n            outputs = model(input_ids, masked_lm_labels=input_ids)\n\n            loss, prediction_scores = outputs[:2]\n\n        \"\"\"\n\n        generator_hidden_states = self.electra(\n            input_ids, attention_mask, token_type_ids, position_ids, head_mask, inputs_embeds\n        )\n        generator_sequence_output = generator_hidden_states[0]\n\n        prediction_scores = self.generator_predictions(generator_sequence_output)\n        prediction_scores = self.generator_lm_head(prediction_scores)\n\n        output = (prediction_scores,)\n\n        # Masked language modeling softmax layer\n        if masked_lm_labels is not None:\n            loss_fct = nn.CrossEntropyLoss()  # -100 index = padding token\n            loss = loss_fct(prediction_scores.view(-1, self.config.vocab_size), masked_lm_labels.view(-1))\n            output = (loss,) + output\n\n        output += generator_hidden_states[1:]\n\n        return output  # (masked_lm_loss), prediction_scores, (hidden_states), (attentions)\n\n\n@add_start_docstrings(\n    \"\"\"\n    Electra model with a token classification head on top.\n\n    Both the discriminator and generator may be loaded into this model.\"\"\",\n    ELECTRA_START_DOCSTRING,\n)\nclass ElectraForTokenClassification(ElectraPreTrainedModel):\n    def __init__(self, config):\n        super().__init__(config)\n\n        self.electra = ElectraModel(config)\n        self.dropout = nn.Dropout(config.hidden_dropout_prob)\n        self.classifier = nn.Linear(config.hidden_size, config.num_labels)\n        self.init_weights()\n\n    @add_start_docstrings_to_callable(ELECTRA_INPUTS_DOCSTRING)\n    def forward(\n        self,\n        input_ids=None,\n        attention_mask=None,\n        token_type_ids=None,\n        position_ids=None,\n        head_mask=None,\n        inputs_embeds=None,\n        labels=None,\n    ):\n        r\"\"\"\n        labels (:obj:`torch.LongTensor` of shape :obj:`(batch_size, sequence_length)`, `optional`, defaults to :obj:`None`):\n            Labels for computing the token classification loss.\n            Indices should be in ``[0, ..., config.num_labels - 1]``.\n\n    Returns:\n        :obj:`tuple(torch.FloatTensor)` comprising various elements depending on the configuration (:class:`~transformers1.ElectraConfig`) and inputs:\n        loss (:obj:`torch.FloatTensor` of shape :obj:`(1,)`, `optional`, returned when ``labels`` is provided) :\n            Classification loss.\n        scores (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length, config.num_labels)`)\n            Classification scores (before SoftMax).\n        hidden_states (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_hidden_states=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer)\n            of shape :obj:`(batch_size, sequence_length, hidden_size)`.\n\n            Hidden-states of the model at the output of each layer plus the initial embedding outputs.\n        attentions (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_attentions=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for each layer) of shape\n            :obj:`(batch_size, num_heads, sequence_length, sequence_length)`.\n\n            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention\n            heads.\n\n    Examples::\n\n        from transformers1 import ElectraTokenizer, ElectraForTokenClassification\n        import torch\n\n        tokenizer = ElectraTokenizer.from_pretrained('google/electra-small-discriminator')\n        model = ElectraForTokenClassification.from_pretrained('google/electra-small-discriminator')\n\n        input_ids = torch.tensor(tokenizer.encode(\"Hello, my dog is cute\", add_special_tokens=True)).unsqueeze(0)  # Batch size 1\n        labels = torch.tensor([1] * input_ids.size(1)).unsqueeze(0)  # Batch size 1\n        outputs = model(input_ids, labels=labels)\n\n        loss, scores = outputs[:2]\n\n        \"\"\"\n\n        discriminator_hidden_states = self.electra(\n            input_ids, attention_mask, token_type_ids, position_ids, head_mask, inputs_embeds\n        )\n        discriminator_sequence_output = discriminator_hidden_states[0]\n\n        discriminator_sequence_output = self.dropout(discriminator_sequence_output)\n        logits = self.classifier(discriminator_sequence_output)\n\n        output = (logits,)\n\n        if labels is not None:\n            loss_fct = nn.CrossEntropyLoss()\n            # Only keep active parts of the loss\n            if attention_mask is not None:\n                active_loss = attention_mask.view(-1) == 1\n                active_logits = logits.view(-1, self.config.num_labels)[active_loss]\n                active_labels = labels.view(-1)[active_loss]\n                loss = loss_fct(active_logits, active_labels)\n            else:\n                loss = loss_fct(logits.view(-1, self.config.num_labels), labels.view(-1))\n\n            output = (loss,) + output\n\n        output += discriminator_hidden_states[1:]\n\n        return output  # (loss), scores, (hidden_states), (attentions)\n"
  },
  {
    "path": "code/bert-base-count3/pretrain/transformers1/modeling_encoder_decoder.py",
    "content": "# coding=utf-8\n# Copyright 2018 The HuggingFace Inc. team.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n\"\"\" Classes to support Encoder-Decoder architectures \"\"\"\n\n\nimport logging\nfrom typing import Optional\n\nfrom .configuration_encoder_decoder import EncoderDecoderConfig\nfrom .configuration_utils import PretrainedConfig\nfrom .modeling_utils import PreTrainedModel\n\n\nlogger = logging.getLogger(__name__)\n\n\nclass EncoderDecoderModel(PreTrainedModel):\n    r\"\"\"\n        :class:`~transformers1.EncoderDecoder` is a generic model class that will be\n        instantiated as a transformer architecture with one of the base model\n        classes of the library as encoder and another one as\n        decoder when created with the `AutoModel.from_pretrained(pretrained_model_name_or_path)`\n        class method for the encoder and `AutoModelWithLMHead.from_pretrained(pretrained_model_name_or_path)` class method for the decoder.\n    \"\"\"\n    config_class = EncoderDecoderConfig\n    base_model_prefix = \"encoder_decoder\"\n\n    def __init__(\n        self,\n        config: Optional[PretrainedConfig] = None,\n        encoder: Optional[PreTrainedModel] = None,\n        decoder: Optional[PreTrainedModel] = None,\n    ):\n        assert config is not None or (\n            encoder is not None and decoder is not None\n        ), \"Either a configuration or an Encoder and a decoder has to be provided\"\n        if config is None:\n            config = EncoderDecoderConfig.from_encoder_decoder_configs(encoder.config, decoder.config)\n        else:\n            assert isinstance(config, self.config_class), \"config: {} has to be of type {}\".format(\n                config, self.config_class\n            )\n        # initialize with config\n        super().__init__(config)\n\n        if encoder is None:\n            from transformers import AutoModel\n\n            encoder = AutoModel.from_config(config.encoder)\n\n        if decoder is None:\n            from transformers import AutoModelWithLMHead\n\n            decoder = AutoModelWithLMHead.from_config(config.decoder)\n\n        self.encoder = encoder\n        self.decoder = decoder\n        assert (\n            self.encoder.get_output_embeddings() is None\n        ), \"The encoder {} should not have a LM Head. Please use a model without LM Head\"\n\n    def tie_weights(self):\n        # for now no weights tying in encoder-decoder\n        pass\n\n    def get_encoder(self):\n        return self.encoder\n\n    def get_decoder(self):\n        return self.decoder\n\n    def get_input_embeddings(self):\n        return self.encoder.get_input_embeddings()\n\n    def get_output_embeddings(self):\n        return self.decoder.get_output_embeddings()\n\n    @classmethod\n    def from_encoder_decoder_pretrained(\n        cls,\n        encoder_pretrained_model_name_or_path: str = None,\n        decoder_pretrained_model_name_or_path: str = None,\n        *model_args,\n        **kwargs\n    ) -> PreTrainedModel:\n        r\"\"\" Instantiates an encoder and a decoder from one or two base classes of the library from pre-trained model checkpoints.\n\n\n        The model is set in evaluation mode by default using `model.eval()` (Dropout modules are deactivated).\n        To train the model, you need to first set it back in training mode with `model.train()`.\n\n        Params:\n            encoder_pretrained_model_name_or_path (:obj: `str`, `optional`, defaults to `None`):\n                information necessary to initiate the encoder. Either:\n\n                - a string with the `shortcut name` of a pre-trained model to load from cache or download, e.g.: ``bert-base-uncased``.\n                - a string with the `identifier name` of a pre-trained model that was user-uploaded to our S3, e.g.: ``dbmdz/bert-base-german-cased``.\n                - a path to a `directory` containing model weights saved using :func:`~transformers1.PreTrainedModel.save_pretrained`, e.g.: ``./my_model_directory/encoder``.\n                - a path or url to a `tensorflow index checkpoint file` (e.g. `./tf_model/model.ckpt.index`). In this case, ``from_tf`` should be set to True and a configuration object should be provided as ``config`` argument. This loading path is slower than converting the TensorFlow checkpoint in a PyTorch model using the provided conversion scripts and loading the PyTorch model afterwards.\n\n            decoder_pretrained_model_name_or_path (:obj: `str`, `optional`, defaults to `None`):\n                information necessary to initiate the decoder. Either:\n\n                - a string with the `shortcut name` of a pre-trained model to load from cache or download, e.g.: ``bert-base-uncased``.\n                - a string with the `identifier name` of a pre-trained model that was user-uploaded to our S3, e.g.: ``dbmdz/bert-base-german-cased``.\n                - a path to a `directory` containing model weights saved using :func:`~transformers1.PreTrainedModel.save_pretrained`, e.g.: ``./my_model_directory/decoder``.\n                - a path or url to a `tensorflow index checkpoint file` (e.g. `./tf_model/model.ckpt.index`). In this case, ``from_tf`` should be set to True and a configuration object should be provided as ``config`` argument. This loading path is slower than converting the TensorFlow checkpoint in a PyTorch model using the provided conversion scripts and loading the PyTorch model afterwards.\n\n            model_args: (`optional`) Sequence of positional arguments:\n                All remaning positional arguments will be passed to the underlying model's ``__init__`` method\n\n            kwargs: (`optional`) Remaining dictionary of keyword arguments.\n                Can be used to update the configuration object (after it being loaded) and initiate the model. (e.g. ``output_attention=True``). Behave differently depending on whether a `config` is provided or automatically loaded:\n\n        Examples::\n\n            from transformers1 import EncoderDecoder\n\n            model = EncoderDecoder.from_encoder_decoder_pretrained('bert-base-uncased', 'bert-base-uncased') # initialize Bert2Bert\n        \"\"\"\n\n        kwargs_encoder = {\n            argument[len(\"encoder_\") :]: value for argument, value in kwargs.items() if argument.startswith(\"encoder_\")\n        }\n\n        kwargs_decoder = {\n            argument[len(\"decoder_\") :]: value for argument, value in kwargs.items() if argument.startswith(\"decoder_\")\n        }\n\n        # Load and initialize the encoder and decoder\n        # The distinction between encoder and decoder at the model level is made\n        # by the value of the flag `is_decoder` that we need to set correctly.\n        encoder = kwargs_encoder.pop(\"model\", None)\n        if encoder is None:\n            assert (\n                encoder_pretrained_model_name_or_path is not None\n            ), \"If `model` is not defined as an argument, a `encoder_pretrained_model_name_or_path` has to be defined\"\n            from .modeling_auto import AutoModel\n\n            encoder = AutoModel.from_pretrained(encoder_pretrained_model_name_or_path, *model_args, **kwargs_encoder)\n        encoder.config.is_decoder = False\n\n        decoder = kwargs_decoder.pop(\"model\", None)\n        if decoder is None:\n            assert (\n                decoder_pretrained_model_name_or_path is not None\n            ), \"If `decoder_model` is not defined as an argument, a `decoder_pretrained_model_name_or_path` has to be defined\"\n            from .modeling_auto import AutoModelWithLMHead\n\n            if \"config\" not in kwargs_decoder:\n                from transformers import AutoConfig\n\n                decoder_config = AutoConfig.from_pretrained(decoder_pretrained_model_name_or_path)\n                if decoder_config.is_decoder is False:\n                    logger.info(\n                        f\"Initializing {decoder_pretrained_model_name_or_path} as a decoder model. Cross attention layers are added to {decoder_pretrained_model_name_or_path} and randomly initialized if {decoder_pretrained_model_name_or_path}'s architecture allows for cross attention layers.\"\n                    )\n                    decoder_config.is_decoder = True\n\n                kwargs_decoder[\"config\"] = decoder_config\n\n            if kwargs_decoder[\"config\"].is_decoder is False:\n                logger.warning(\n                    f\"Decoder model {decoder_pretrained_model_name_or_path} is not initialized as a decoder. In order to initialize {decoder_pretrained_model_name_or_path} as a decoder, make sure that the attribute `is_decoder` of `decoder_config` passed to `.from_encoder_decoder_pretrained(...)` is set to `True` or do not pass a `decoder_config` to `.from_encoder_decoder_pretrained(...)`\"\n                )\n\n            decoder = AutoModelWithLMHead.from_pretrained(decoder_pretrained_model_name_or_path, **kwargs_decoder)\n\n        return cls(encoder=encoder, decoder=decoder)\n\n    def forward(\n        self,\n        input_ids=None,\n        inputs_embeds=None,\n        attention_mask=None,\n        head_mask=None,\n        encoder_outputs=None,\n        decoder_input_ids=None,\n        decoder_attention_mask=None,\n        decoder_head_mask=None,\n        decoder_inputs_embeds=None,\n        masked_lm_labels=None,\n        lm_labels=None,\n        **kwargs,\n    ):\n\n        \"\"\"\n        Args:\n            input_ids (:obj:`torch.LongTensor` of shape :obj:`(batch_size, sequence_length)`):\n                Indices of input sequence tokens in the vocabulary for the encoder.\n                Indices can be obtained using :class:`transformers1.PretrainedTokenizer`.\n                See :func:`transformers1.PreTrainedTokenizer.encode` and\n                :func:`transformers1.PreTrainedTokenizer.convert_tokens_to_ids` for details.\n            inputs_embeds (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length, hidden_size)`, `optional`, defaults to :obj:`None`):\n                Optionally, instead of passing :obj:`input_ids` you can choose to directly pass an embedded representation.\n                This is useful if you want more control over how to convert `input_ids` indices into associated vectors\n                than the model's internal embedding lookup matrix.\n            attention_mask (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length)`, `optional`, defaults to :obj:`None`):\n                Mask to avoid performing attention on padding token indices for the encoder.\n                Mask values selected in ``[0, 1]``:\n                ``1`` for tokens that are NOT MASKED, ``0`` for MASKED tokens.\n            head_mask: (:obj:`torch.FloatTensor` of shape :obj:`(num_heads,)` or :obj:`(num_layers, num_heads)`, `optional`, defaults to :obj:`None`):\n                Mask to nullify selected heads of the self-attention modules for the encoder.\n                Mask values selected in ``[0, 1]``:\n                ``1`` indicates the head is **not masked**, ``0`` indicates the head is **masked**.\n            encoder_outputs (:obj:`tuple(tuple(torch.FloatTensor)`, `optional`, defaults to :obj:`None`):\n                Tuple consists of (`last_hidden_state`, `optional`: `hidden_states`, `optional`: `attentions`)\n                `last_hidden_state` of shape :obj:`(batch_size, sequence_length, hidden_size)`, `optional`, defaults to :obj:`None`) is a sequence of hidden-states at the output of the last layer of the encoder.\n                Used in the cross-attention of the decoder.\n            decoder_input_ids (:obj:`torch.LongTensor` of shape :obj:`(batch_size, target_sequence_length)`, `optional`, defaults to :obj:`None`):\n                Provide for sequence to sequence training to the decoder.\n                Indices can be obtained using :class:`transformers1.PretrainedTokenizer`.\n                See :func:`transformers1.PreTrainedTokenizer.encode` and\n                :func:`transformers1.PreTrainedTokenizer.convert_tokens_to_ids` for details.\n            decoder_attention_mask (:obj:`torch.BoolTensor` of shape :obj:`(batch_size, tgt_seq_len)`, `optional`, defaults to :obj:`None`):\n                Default behavior: generate a tensor that ignores pad tokens in decoder_input_ids. Causal mask will also be used by default.\n            decoder_head_mask: (:obj:`torch.FloatTensor` of shape :obj:`(num_heads,)` or :obj:`(num_layers, num_heads)`, `optional`, defaults to :obj:`None`):\n                Mask to nullify selected heads of the self-attention modules for the decoder.\n                Mask values selected in ``[0, 1]``:\n                ``1`` indicates the head is **not masked**, ``0`` indicates the head is **masked**.\n            decoder_inputs_embeds (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, target_sequence_length, hidden_size)`, `optional`, defaults to :obj:`None`):\n                Optionally, instead of passing :obj:`decoder_input_ids` you can choose to directly pass an embedded representation.\n                This is useful if you want more control over how to convert `decoder_input_ids` indices into associated vectors\n                than the model's internal embedding lookup matrix.\n            masked_lm_labels (:obj:`torch.LongTensor` of shape :obj:`(batch_size, sequence_length)`, `optional`, defaults to :obj:`None`):\n                Labels for computing the masked language modeling loss for the decoder.\n                Indices should be in ``[-100, 0, ..., config.vocab_size]`` (see ``input_ids`` docstring)\n                Tokens with indices set to ``-100`` are ignored (masked), the loss is only computed for the tokens with labels\n                in ``[0, ..., config.vocab_size]``\n            lm_labels (:obj:`torch.LongTensor` of shape :obj:`(batch_size, sequence_length)`, `optional`, defaults to :obj:`None`):\n                Labels for computing the left-to-right language modeling loss (next word prediction) for the decoder.\n                Indices should be in ``[-100, 0, ..., config.vocab_size]`` (see ``input_ids`` docstring)\n                Tokens with indices set to ``-100`` are ignored (masked), the loss is only computed for the tokens with labels\n                in ``[0, ..., config.vocab_size]``\n            kwargs: (`optional`) Remaining dictionary of keyword arguments. Keyword arguments come in two flavors:\n                - Without a prefix which will be input as `**encoder_kwargs` for the encoder forward function.\n                - With a `decoder_` prefix which will be input as `**decoder_kwargs` for the decoder forward function.\n\n        Examples::\n\n            from transformers1 import EncoderDecoderModel, BertTokenizer\n            import torch\n\n            tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')\n            model = EncoderDecoderModel.from_encoder_decoder_pretrained('bert-base-uncased', 'bert-base-uncased') # initialize Bert2Bert\n\n            # forward\n            input_ids = torch.tensor(tokenizer.encode(\"Hello, my dog is cute\", add_special_tokens=True)).unsqueeze(0)  # Batch size 1\n            outputs = model(input_ids=input_ids, decoder_input_ids=input_ids)\n\n            # training\n            loss, outputs = model(input_ids=input_ids, decoder_input_ids=input_ids, lm_labels=input_ids)[:2]\n\n            # generation\n            generated = model.generate(input_ids, decoder_start_token_id=model.config.decoder.pad_token_id)\n\n        \"\"\"\n\n        kwargs_encoder = {argument: value for argument, value in kwargs.items() if not argument.startswith(\"decoder_\")}\n\n        kwargs_decoder = {\n            argument[len(\"decoder_\") :]: value for argument, value in kwargs.items() if argument.startswith(\"decoder_\")\n        }\n\n        if encoder_outputs is None:\n            encoder_outputs = self.encoder(\n                input_ids=input_ids,\n                attention_mask=attention_mask,\n                inputs_embeds=inputs_embeds,\n                head_mask=head_mask,\n                **kwargs_encoder,\n            )\n\n        hidden_states = encoder_outputs[0]\n\n        # Decode\n        decoder_outputs = self.decoder(\n            input_ids=decoder_input_ids,\n            inputs_embeds=decoder_inputs_embeds,\n            attention_mask=decoder_attention_mask,\n            encoder_hidden_states=hidden_states,\n            encoder_attention_mask=attention_mask,\n            head_mask=decoder_head_mask,\n            lm_labels=lm_labels,\n            masked_lm_labels=masked_lm_labels,\n            **kwargs_decoder,\n        )\n\n        return decoder_outputs + encoder_outputs\n\n    def prepare_inputs_for_generation(self, input_ids, past, attention_mask, **kwargs):\n        assert past is not None, \"past has to be defined for encoder_outputs\"\n\n        # first step\n        if type(past) is tuple:\n            encoder_outputs = past\n        else:\n            encoder_outputs = (past,)\n\n        decoder_inputs = self.decoder.prepare_inputs_for_generation(input_ids)\n\n        return {\n            \"attention_mask\": attention_mask,\n            \"decoder_attention_mask\": decoder_inputs[\"attention_mask\"],\n            \"decoder_input_ids\": decoder_inputs[\"input_ids\"],\n            \"encoder_outputs\": encoder_outputs,\n        }\n\n    def _reorder_cache(self, past, beam_idx):\n        # as a default encoder-decoder models do not re-order the past.\n        # TODO(PVP): might have to be updated, e.g. if GPT2 is to be used as a decoder\n        return past\n"
  },
  {
    "path": "code/bert-base-count3/pretrain/transformers1/modeling_flaubert.py",
    "content": "# coding=utf-8\n# Copyright 2019-present CNRS, Facebook Inc. and the HuggingFace Inc. team.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n\"\"\" PyTorch Flaubert model, based on XLM. \"\"\"\n\n\nimport logging\nimport random\n\nimport torch\nfrom torch.nn import functional as F\n\nfrom .configuration_flaubert import FlaubertConfig\nfrom .file_utils import add_start_docstrings, add_start_docstrings_to_callable\nfrom .modeling_xlm import (\n    XLMForQuestionAnswering,\n    XLMForQuestionAnsweringSimple,\n    XLMForSequenceClassification,\n    XLMModel,\n    XLMWithLMHeadModel,\n    get_masks,\n)\n\n\nlogger = logging.getLogger(__name__)\n\nFLAUBERT_PRETRAINED_MODEL_ARCHIVE_LIST = [\n    \"flaubert/flaubert_small_cased\",\n    \"flaubert/flaubert_base_uncased\",\n    \"flaubert/flaubert_base_cased\",\n    \"flaubert/flaubert_large_cased\",\n    # See all Flaubert models at https://huggingface.co/models?filter=flaubert\n]\n\n\nFLAUBERT_START_DOCSTRING = r\"\"\"\n\n    This model is a PyTorch `torch.nn.Module <https://pytorch.org/docs/stable/nn.html#torch.nn.Module>`_ sub-class.\n    Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general\n    usage and behavior.\n\n    Parameters:\n        config (:class:`~transformers1.FlaubertConfig`): Model configuration class with all the parameters of the model.\n            Initializing with a config file does not load the weights associated with the model, only the configuration.\n            Check out the :meth:`~transformers1.PreTrainedModel.from_pretrained` method to load the model weights.\n\"\"\"\n\nFLAUBERT_INPUTS_DOCSTRING = r\"\"\"\n    Args:\n        input_ids (:obj:`torch.LongTensor` of shape :obj:`(batch_size, sequence_length)`):\n            Indices of input sequence tokens in the vocabulary.\n\n            Indices can be obtained using :class:`transformers1.BertTokenizer`.\n            See :func:`transformers1.PreTrainedTokenizer.encode` and\n            :func:`transformers1.PreTrainedTokenizer.encode_plus` for details.\n\n            `What are input IDs? <../glossary.html#input-ids>`__\n        attention_mask (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length)`, `optional`, defaults to :obj:`None`):\n            Mask to avoid performing attention on padding token indices.\n            Mask values selected in ``[0, 1]``:\n            ``1`` for tokens that are NOT MASKED, ``0`` for MASKED tokens.\n\n            `What are attention masks? <../glossary.html#attention-mask>`__\n        token_type_ids (:obj:`torch.LongTensor` of shape :obj:`(batch_size, sequence_length)`, `optional`, defaults to :obj:`None`):\n            Segment token indices to indicate first and second portions of the inputs.\n            Indices are selected in ``[0, 1]``: ``0`` corresponds to a `sentence A` token, ``1``\n            corresponds to a `sentence B` token\n\n            `What are token type IDs? <../glossary.html#token-type-ids>`_\n        position_ids (:obj:`torch.LongTensor` of shape :obj:`(batch_size, sequence_length)`, `optional`, defaults to :obj:`None`):\n            Indices of positions of each input sequence tokens in the position embeddings.\n            Selected in the range ``[0, config.max_position_embeddings - 1]``.\n\n            `What are position IDs? <../glossary.html#position-ids>`_\n        lengths (:obj:`torch.LongTensor` of shape :obj:`(batch_size,)`, `optional`, defaults to :obj:`None`):\n            Length of each sentence that can be used to avoid performing attention on padding token indices.\n            You can also use `attention_mask` for the same result (see above), kept here for compatbility.\n            Indices selected in ``[0, ..., input_ids.size(-1)]``:\n        cache (:obj:`Dict[str, torch.FloatTensor]`, `optional`, defaults to :obj:`None`):\n            dictionary with ``torch.FloatTensor`` that contains pre-computed\n            hidden-states (key and values in the attention blocks) as computed by the model\n            (see `cache` output below). Can be used to speed up sequential decoding.\n            The dictionary object will be modified in-place during the forward pass to add newly computed hidden-states.\n        head_mask (:obj:`torch.FloatTensor` of shape :obj:`(num_heads,)` or :obj:`(num_layers, num_heads)`, `optional`, defaults to :obj:`None`):\n            Mask to nullify selected heads of the self-attention modules.\n            Mask values selected in ``[0, 1]``:\n            :obj:`1` indicates the head is **not masked**, :obj:`0` indicates the head is **masked**.\n        inputs_embeds (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length, hidden_size)`, `optional`, defaults to :obj:`None`):\n            Optionally, instead of passing :obj:`input_ids` you can choose to directly pass an embedded representation.\n            This is useful if you want more control over how to convert `input_ids` indices into associated vectors\n            than the model's internal embedding lookup matrix.\n\"\"\"\n\n\n@add_start_docstrings(\n    \"The bare Flaubert Model transformer outputting raw hidden-states without any specific head on top.\",\n    FLAUBERT_START_DOCSTRING,\n)\nclass FlaubertModel(XLMModel):\n\n    config_class = FlaubertConfig\n\n    def __init__(self, config):  # , dico, is_encoder, with_output):\n        super().__init__(config)\n        self.layerdrop = getattr(config, \"layerdrop\", 0.0)\n        self.pre_norm = getattr(config, \"pre_norm\", False)\n\n    @add_start_docstrings_to_callable(FLAUBERT_INPUTS_DOCSTRING)\n    def forward(\n        self,\n        input_ids=None,\n        attention_mask=None,\n        langs=None,\n        token_type_ids=None,\n        position_ids=None,\n        lengths=None,\n        cache=None,\n        head_mask=None,\n        inputs_embeds=None,\n    ):\n        r\"\"\"\n    Return:\n        :obj:`tuple(torch.FloatTensor)` comprising various elements depending on the configuration (:class:`~transformers1.XLMConfig`) and inputs:\n        last_hidden_state (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length, hidden_size)`):\n            Sequence of hidden-states at the output of the last layer of the model.\n        hidden_states (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_hidden_states=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer)\n            of shape :obj:`(batch_size, sequence_length, hidden_size)`.\n\n            Hidden-states of the model at the output of each layer plus the initial embedding outputs.\n        attentions (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_attentions=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for each layer) of shape\n            :obj:`(batch_size, num_heads, sequence_length, sequence_length)`.\n\n            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention\n            heads.\n\n    Examples::\n\n        from transformers1 import FlaubertTokenizer, FlaubertModel\n        import torch\n\n        tokenizer = FlaubertTokenizer.from_pretrained('flaubert-base-cased')\n        model = FlaubertModel.from_pretrained('flaubert-base-cased')\n        input_ids = torch.tensor(tokenizer.encode(\"Le chat mange une pomme.\", add_special_tokens=True)).unsqueeze(0)  # Batch size 1\n        outputs = model(input_ids)\n        last_hidden_states = outputs[0]  # The last hidden-state is the first element of the output tuple\n\n        \"\"\"\n        # removed: src_enc=None, src_len=None\n        if input_ids is not None:\n            bs, slen = input_ids.size()\n        else:\n            bs, slen = inputs_embeds.size()[:-1]\n\n        if lengths is None:\n            if input_ids is not None:\n                lengths = (input_ids != self.pad_index).sum(dim=1).long()\n            else:\n                lengths = torch.LongTensor([slen] * bs)\n        # mask = input_ids != self.pad_index\n\n        # check inputs\n        assert lengths.size(0) == bs\n        assert lengths.max().item() <= slen\n        # input_ids = input_ids.transpose(0, 1)  # batch size as dimension 0\n        # assert (src_enc is None) == (src_len is None)\n        # if src_enc is not None:\n        #     assert self.is_decoder\n        #     assert src_enc.size(0) == bs\n\n        # generate masks\n        mask, attn_mask = get_masks(slen, lengths, self.causal, padding_mask=attention_mask)\n        # if self.is_decoder and src_enc is not None:\n        #     src_mask = torch.arange(src_len.max(), dtype=torch.long, device=lengths.device) < src_len[:, None]\n\n        device = input_ids.device if input_ids is not None else inputs_embeds.device\n\n        # position_ids\n        if position_ids is None:\n            position_ids = torch.arange(slen, dtype=torch.long, device=device)\n            position_ids = position_ids.unsqueeze(0).expand((bs, slen))\n        else:\n            assert position_ids.size() == (bs, slen)  # (slen, bs)\n            # position_ids = position_ids.transpose(0, 1)\n\n        # langs\n        if langs is not None:\n            assert langs.size() == (bs, slen)  # (slen, bs)\n            # langs = langs.transpose(0, 1)\n\n        # Prepare head mask if needed\n        head_mask = self.get_head_mask(head_mask, self.config.n_layers)\n\n        # do not recompute cached elements\n        if cache is not None and input_ids is not None:\n            _slen = slen - cache[\"slen\"]\n            input_ids = input_ids[:, -_slen:]\n            position_ids = position_ids[:, -_slen:]\n            if langs is not None:\n                langs = langs[:, -_slen:]\n            mask = mask[:, -_slen:]\n            attn_mask = attn_mask[:, -_slen:]\n\n        # embeddings\n        if inputs_embeds is None:\n            inputs_embeds = self.embeddings(input_ids)\n\n        tensor = inputs_embeds + self.position_embeddings(position_ids).expand_as(inputs_embeds)\n        if langs is not None and self.use_lang_emb and self.config.n_langs > 1:\n            tensor = tensor + self.lang_embeddings(langs)\n        if token_type_ids is not None:\n            tensor = tensor + self.embeddings(token_type_ids)\n        tensor = self.layer_norm_emb(tensor)\n        tensor = F.dropout(tensor, p=self.dropout, training=self.training)\n        tensor *= mask.unsqueeze(-1).to(tensor.dtype)\n\n        # transformer layers\n        hidden_states = ()\n        attentions = ()\n        for i in range(self.n_layers):\n            # LayerDrop\n            dropout_probability = random.uniform(0, 1)\n            if self.training and (dropout_probability < self.layerdrop):\n                continue\n\n            if self.output_hidden_states:\n                hidden_states = hidden_states + (tensor,)\n\n            # self attention\n            if not self.pre_norm:\n                attn_outputs = self.attentions[i](tensor, attn_mask, cache=cache, head_mask=head_mask[i])\n                attn = attn_outputs[0]\n                if self.output_attentions:\n                    attentions = attentions + (attn_outputs[1],)\n                attn = F.dropout(attn, p=self.dropout, training=self.training)\n                tensor = tensor + attn\n                tensor = self.layer_norm1[i](tensor)\n            else:\n                tensor_normalized = self.layer_norm1[i](tensor)\n                attn_outputs = self.attentions[i](tensor_normalized, attn_mask, cache=cache, head_mask=head_mask[i])\n                attn = attn_outputs[0]\n                if self.output_attentions:\n                    attentions = attentions + (attn_outputs[1],)\n                attn = F.dropout(attn, p=self.dropout, training=self.training)\n                tensor = tensor + attn\n\n            # encoder attention (for decoder only)\n            # if self.is_decoder and src_enc is not None:\n            #     attn = self.encoder_attn[i](tensor, src_mask, kv=src_enc, cache=cache)\n            #     attn = F.dropout(attn, p=self.dropout, training=self.training)\n            #     tensor = tensor + attn\n            #     tensor = self.layer_norm15[i](tensor)\n\n            # FFN\n            if not self.pre_norm:\n                tensor = tensor + self.ffns[i](tensor)\n                tensor = self.layer_norm2[i](tensor)\n            else:\n                tensor_normalized = self.layer_norm2[i](tensor)\n                tensor = tensor + self.ffns[i](tensor_normalized)\n\n            tensor *= mask.unsqueeze(-1).to(tensor.dtype)\n\n        # Add last hidden state\n        if self.output_hidden_states:\n            hidden_states = hidden_states + (tensor,)\n\n        # update cache length\n        if cache is not None:\n            cache[\"slen\"] += tensor.size(1)\n\n        # move back sequence length to dimension 0\n        # tensor = tensor.transpose(0, 1)\n\n        outputs = (tensor,)\n        if self.output_hidden_states:\n            outputs = outputs + (hidden_states,)\n        if self.output_attentions:\n            outputs = outputs + (attentions,)\n        return outputs  # outputs, (hidden_states), (attentions)\n\n\n@add_start_docstrings(\n    \"\"\"The Flaubert Model transformer with a language modeling head on top\n    (linear layer with weights tied to the input embeddings). \"\"\",\n    FLAUBERT_START_DOCSTRING,\n)\nclass FlaubertWithLMHeadModel(XLMWithLMHeadModel):\n    \"\"\"\n    This class overrides :class:`~transformers1.XLMWithLMHeadModel`. Please check the\n    superclass for the appropriate documentation alongside usage examples.\n    \"\"\"\n\n    config_class = FlaubertConfig\n\n    def __init__(self, config):\n        super().__init__(config)\n        self.transformer = FlaubertModel(config)\n        self.init_weights()\n\n\n@add_start_docstrings(\n    \"\"\"Flaubert Model with a sequence classification/regression head on top (a linear layer on top of\n    the pooled output) e.g. for GLUE tasks. \"\"\",\n    FLAUBERT_START_DOCSTRING,\n)\nclass FlaubertForSequenceClassification(XLMForSequenceClassification):\n    \"\"\"\n    This class overrides :class:`~transformers1.XLMForSequenceClassification`. Please check the\n    superclass for the appropriate documentation alongside usage examples.\n    \"\"\"\n\n    config_class = FlaubertConfig\n\n    def __init__(self, config):\n        super().__init__(config)\n        self.transformer = FlaubertModel(config)\n        self.init_weights()\n\n\n@add_start_docstrings(\n    \"\"\"Flaubert Model with a span classification head on top for extractive question-answering tasks like SQuAD (a linear layers on top of\n    the hidden-states output to compute `span start logits` and `span end logits`). \"\"\",\n    FLAUBERT_START_DOCSTRING,\n)\nclass FlaubertForQuestionAnsweringSimple(XLMForQuestionAnsweringSimple):\n    \"\"\"\n    This class overrides :class:`~transformers1.XLMForQuestionAnsweringSimple`. Please check the\n    superclass for the appropriate documentation alongside usage examples.\n    \"\"\"\n\n    config_class = FlaubertConfig\n\n    def __init__(self, config):\n        super().__init__(config)\n        self.transformer = FlaubertModel(config)\n        self.init_weights()\n\n\n@add_start_docstrings(\n    \"\"\"Flaubert Model with a beam-search span classification head on top for extractive question-answering tasks like SQuAD (a linear layers on top of\n    the hidden-states output to compute `span start logits` and `span end logits`). \"\"\",\n    FLAUBERT_START_DOCSTRING,\n)\nclass FlaubertForQuestionAnswering(XLMForQuestionAnswering):\n    \"\"\"\n    This class overrides :class:`~transformers1.XLMForQuestionAnswering`. Please check the\n    superclass for the appropriate documentation alongside usage examples.\n    \"\"\"\n\n    config_class = FlaubertConfig\n\n    def __init__(self, config):\n        super().__init__(config)\n        self.transformer = FlaubertModel(config)\n        self.init_weights()\n"
  },
  {
    "path": "code/bert-base-count3/pretrain/transformers1/modeling_gpt2.py",
    "content": "# coding=utf-8\n# Copyright 2018 The OpenAI Team Authors and HuggingFace Inc. team.\n# Copyright (c) 2018, NVIDIA CORPORATION.  All rights reserved.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n\"\"\"PyTorch OpenAI GPT-2 model.\"\"\"\n\n\nimport logging\nimport os\n\nimport torch\nimport torch.nn as nn\nfrom torch.nn import CrossEntropyLoss\n\nfrom .activations import ACT2FN\nfrom .configuration_gpt2 import GPT2Config\nfrom .file_utils import add_start_docstrings, add_start_docstrings_to_callable\nfrom .modeling_utils import Conv1D, PreTrainedModel, SequenceSummary, prune_conv1d_layer\n\n\nlogger = logging.getLogger(__name__)\n\nGPT2_PRETRAINED_MODEL_ARCHIVE_LIST = [\n    \"gpt2\",\n    \"gpt2-medium\",\n    \"gpt2-large\",\n    \"gpt2-xl\",\n    \"distilgpt2\",\n    # See all GPT-2 models at https://huggingface.co/models?filter=gpt2\n]\n\n\ndef load_tf_weights_in_gpt2(model, config, gpt2_checkpoint_path):\n    \"\"\" Load tf checkpoints in a pytorch model\n    \"\"\"\n    try:\n        import re\n        import tensorflow as tf\n    except ImportError:\n        logger.error(\n            \"Loading a TensorFlow model in PyTorch, requires TensorFlow to be installed. Please see \"\n            \"https://www.tensorflow.org/install/ for installation instructions.\"\n        )\n        raise\n    tf_path = os.path.abspath(gpt2_checkpoint_path)\n    logger.info(\"Converting TensorFlow checkpoint from {}\".format(tf_path))\n    # Load weights from TF model\n    init_vars = tf.train.list_variables(tf_path)\n    names = []\n    arrays = []\n    for name, shape in init_vars:\n        logger.info(\"Loading TF weight {} with shape {}\".format(name, shape))\n        array = tf.train.load_variable(tf_path, name)\n        names.append(name)\n        arrays.append(array.squeeze())\n\n    for name, array in zip(names, arrays):\n        name = name[6:]  # skip \"model/\"\n        name = name.split(\"/\")\n        pointer = model\n        for m_name in name:\n            if re.fullmatch(r\"[A-Za-z]+\\d+\", m_name):\n                scope_names = re.split(r\"(\\d+)\", m_name)\n            else:\n                scope_names = [m_name]\n            if scope_names[0] == \"w\" or scope_names[0] == \"g\":\n                pointer = getattr(pointer, \"weight\")\n            elif scope_names[0] == \"b\":\n                pointer = getattr(pointer, \"bias\")\n            elif scope_names[0] == \"wpe\" or scope_names[0] == \"wte\":\n                pointer = getattr(pointer, scope_names[0])\n                pointer = getattr(pointer, \"weight\")\n            else:\n                pointer = getattr(pointer, scope_names[0])\n            if len(scope_names) >= 2:\n                num = int(scope_names[1])\n                pointer = pointer[num]\n        try:\n            assert pointer.shape == array.shape\n        except AssertionError as e:\n            e.args += (pointer.shape, array.shape)\n            raise\n        logger.info(\"Initialize PyTorch weight {}\".format(name))\n        pointer.data = torch.from_numpy(array)\n    return model\n\n\nclass Attention(nn.Module):\n    def __init__(self, nx, n_ctx, config, scale=False):\n        super().__init__()\n        self.output_attentions = config.output_attentions\n\n        n_state = nx  # in Attention: n_state=768 (nx=n_embd)\n        # [switch nx => n_state from Block to Attention to keep identical to TF implem]\n        assert n_state % config.n_head == 0\n        self.register_buffer(\n            \"bias\", torch.tril(torch.ones((n_ctx, n_ctx), dtype=torch.uint8)).view(1, 1, n_ctx, n_ctx)\n        )\n        self.register_buffer(\"masked_bias\", torch.tensor(-1e4))\n        self.n_head = config.n_head\n        self.split_size = n_state\n        self.scale = scale\n\n        self.c_attn = Conv1D(n_state * 3, nx)\n        self.c_proj = Conv1D(n_state, nx)\n        self.attn_dropout = nn.Dropout(config.attn_pdrop)\n        self.resid_dropout = nn.Dropout(config.resid_pdrop)\n        self.pruned_heads = set()\n\n    def prune_heads(self, heads):\n        if len(heads) == 0:\n            return\n        mask = torch.ones(self.n_head, self.split_size // self.n_head)\n        heads = set(heads) - self.pruned_heads  # Convert to set and emove already pruned heads\n        for head in heads:\n            # Compute how many pruned heads are before the head and move the index accordingly\n            head = head - sum(1 if h < head else 0 for h in self.pruned_heads)\n            mask[head] = 0\n        mask = mask.view(-1).contiguous().eq(1)\n        index = torch.arange(len(mask))[mask].long()\n        index_attn = torch.cat([index, index + self.split_size, index + (2 * self.split_size)])\n\n        # Prune conv1d layers\n        self.c_attn = prune_conv1d_layer(self.c_attn, index_attn, dim=1)\n        self.c_proj = prune_conv1d_layer(self.c_proj, index, dim=0)\n\n        # Update hyper params\n        self.split_size = (self.split_size // self.n_head) * (self.n_head - len(heads))\n        self.n_head = self.n_head - len(heads)\n        self.pruned_heads = self.pruned_heads.union(heads)\n\n    def _attn(self, q, k, v, attention_mask=None, head_mask=None):\n        w = torch.matmul(q, k)\n        if self.scale:\n            w = w / (float(v.size(-1)) ** 0.5)\n        nd, ns = w.size(-2), w.size(-1)\n        mask = self.bias[:, :, ns - nd : ns, :ns]\n        w = torch.where(mask.bool(), w, self.masked_bias.to(w.dtype))\n\n        if attention_mask is not None:\n            # Apply the attention mask\n            w = w + attention_mask\n\n        w = nn.Softmax(dim=-1)(w)\n        w = self.attn_dropout(w)\n\n        # Mask heads if we want to\n        if head_mask is not None:\n            w = w * head_mask\n\n        outputs = [torch.matmul(w, v)]\n        if self.output_attentions:\n            outputs.append(w)\n        return outputs\n\n    def merge_heads(self, x):\n        x = x.permute(0, 2, 1, 3).contiguous()\n        new_x_shape = x.size()[:-2] + (x.size(-2) * x.size(-1),)\n        return x.view(*new_x_shape)  # in Tensorflow implem: fct merge_states\n\n    def split_heads(self, x, k=False):\n        new_x_shape = x.size()[:-1] + (self.n_head, x.size(-1) // self.n_head)\n        x = x.view(*new_x_shape)  # in Tensorflow implem: fct split_states\n        if k:\n            return x.permute(0, 2, 3, 1)  # (batch, head, head_features, seq_length)\n        else:\n            return x.permute(0, 2, 1, 3)  # (batch, head, seq_length, head_features)\n\n    def forward(self, x, layer_past=None, attention_mask=None, head_mask=None, use_cache=False):\n        x = self.c_attn(x)\n        query, key, value = x.split(self.split_size, dim=2)\n        query = self.split_heads(query)\n        key = self.split_heads(key, k=True)\n        value = self.split_heads(value)\n        if layer_past is not None:\n            past_key, past_value = layer_past[0].transpose(-2, -1), layer_past[1]  # transpose back cf below\n            key = torch.cat((past_key, key), dim=-1)\n            value = torch.cat((past_value, value), dim=-2)\n\n        if use_cache is True:\n            present = torch.stack((key.transpose(-2, -1), value))  # transpose to have same shapes for stacking\n        else:\n            present = (None,)\n\n        attn_outputs = self._attn(query, key, value, attention_mask, head_mask)\n        a = attn_outputs[0]\n\n        a = self.merge_heads(a)\n        a = self.c_proj(a)\n        a = self.resid_dropout(a)\n\n        outputs = [a, present] + attn_outputs[1:]\n        return outputs  # a, present, (attentions)\n\n\nclass MLP(nn.Module):\n    def __init__(self, n_state, config):  # in MLP: n_state=3072 (4 * n_embd)\n        super().__init__()\n        nx = config.n_embd\n        self.c_fc = Conv1D(n_state, nx)\n        self.c_proj = Conv1D(nx, n_state)\n        self.act = ACT2FN[config.activation_function]\n        self.dropout = nn.Dropout(config.resid_pdrop)\n\n    def forward(self, x):\n        h = self.act(self.c_fc(x))\n        h2 = self.c_proj(h)\n        return self.dropout(h2)\n\n\nclass Block(nn.Module):\n    def __init__(self, n_ctx, config, scale=False):\n        super().__init__()\n        nx = config.n_embd\n        self.ln_1 = nn.LayerNorm(nx, eps=config.layer_norm_epsilon)\n        self.attn = Attention(nx, n_ctx, config, scale)\n        self.ln_2 = nn.LayerNorm(nx, eps=config.layer_norm_epsilon)\n        self.mlp = MLP(4 * nx, config)\n\n    def forward(self, x, layer_past=None, attention_mask=None, head_mask=None, use_cache=False):\n        output_attn = self.attn(\n            self.ln_1(x),\n            layer_past=layer_past,\n            attention_mask=attention_mask,\n            head_mask=head_mask,\n            use_cache=use_cache,\n        )\n        a = output_attn[0]  # output_attn: a, present, (attentions)\n\n        x = x + a\n        m = self.mlp(self.ln_2(x))\n        x = x + m\n\n        outputs = [x] + output_attn[1:]\n        return outputs  # x, present, (attentions)\n\n\nclass GPT2PreTrainedModel(PreTrainedModel):\n    \"\"\" An abstract class to handle weights initialization and\n        a simple interface for downloading and loading pretrained models.\n    \"\"\"\n\n    config_class = GPT2Config\n    load_tf_weights = load_tf_weights_in_gpt2\n    base_model_prefix = \"transformer\"\n\n    def __init__(self, *inputs, **kwargs):\n        super().__init__(*inputs, **kwargs)\n\n    def _init_weights(self, module):\n        \"\"\" Initialize the weights.\n        \"\"\"\n        if isinstance(module, (nn.Linear, nn.Embedding, Conv1D)):\n            # Slightly different from the TF version which uses truncated_normal for initialization\n            # cf https://github.com/pytorch/pytorch/pull/5617\n            module.weight.data.normal_(mean=0.0, std=self.config.initializer_range)\n            if isinstance(module, (nn.Linear, Conv1D)) and module.bias is not None:\n                module.bias.data.zero_()\n        elif isinstance(module, nn.LayerNorm):\n            module.bias.data.zero_()\n            module.weight.data.fill_(1.0)\n\n\nGPT2_START_DOCSTRING = r\"\"\"\n\n    This model is a PyTorch `torch.nn.Module <https://pytorch.org/docs/stable/nn.html#torch.nn.Module>`_ sub-class.\n    Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general\n    usage and behavior.\n\n    Parameters:\n        config (:class:`~transformers1.GPT2Config`): Model configuration class with all the parameters of the model.\n            Initializing with a config file does not load the weights associated with the model, only the configuration.\n            Check out the :meth:`~transformers1.PreTrainedModel.from_pretrained` method to load the model weights.\n\"\"\"\n\nGPT2_INPUTS_DOCSTRING = r\"\"\"\n    Args:\n        input_ids (:obj:`torch.LongTensor` of shape :obj:`(batch_size, input_ids_length)`):\n            :obj:`input_ids_length` = ``sequence_length`` if ``past`` is ``None`` else ``past[0].shape[-2]`` (``sequence_length`` of input past key value states).\n            Indices of input sequence tokens in the vocabulary.\n\n            If `past` is used, only `input_ids` that do not have their past calculated should be passed as `input_ids`.\n\n            Indices can be obtained using :class:`transformers1.GPT2Tokenizer`.\n            See :func:`transformers1.PreTrainedTokenizer.encode` and\n            :func:`transformers1.PreTrainedTokenizer.encode_plus` for details.\n\n            `What are input IDs? <../glossary.html#input-ids>`__\n\n        past (:obj:`List[torch.FloatTensor]` of length :obj:`config.n_layers`):\n            Contains pre-computed hidden-states (key and values in the attention blocks) as computed by the model\n            (see `past` output below). Can be used to speed up sequential decoding.\n            The `input_ids` which have their past given to this model should not be passed as `input_ids` as they have already been computed.\n        attention_mask (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length)`, `optional`, defaults to :obj:`None`):\n            Mask to avoid performing attention on padding token indices.\n            Mask values selected in ``[0, 1]``:\n            ``1`` for tokens that are NOT MASKED, ``0`` for MASKED tokens.\n\n            `What are attention masks? <../glossary.html#attention-mask>`__\n        token_type_ids (:obj:`torch.LongTensor` of shape :obj:`(batch_size, input_ids_length)`, `optional`, defaults to :obj:`None`):\n            `input_ids_length` = `sequence_length if `past` is None else 1\n            Segment token indices to indicate first and second portions of the inputs.\n            Indices are selected in ``[0, 1]``: ``0`` corresponds to a `sentence A` token, ``1``\n            corresponds to a `sentence B` token\n            `What are token type IDs? <../glossary.html#token-type-ids>`_\n        position_ids (:obj:`torch.LongTensor` of shape :obj:`(batch_size, sequence_length)`, `optional`, defaults to :obj:`None`):\n            Indices of positions of each input sequence tokens in the position embeddings.\n            Selected in the range ``[0, config.max_position_embeddings - 1]``.\n\n            `What are position IDs? <../glossary.html#position-ids>`_\n        head_mask (:obj:`torch.FloatTensor` of shape :obj:`(num_heads,)` or :obj:`(num_layers, num_heads)`, `optional`, defaults to :obj:`None`):\n            Mask to nullify selected heads of the self-attention modules.\n            Mask values selected in ``[0, 1]``:\n            :obj:`1` indicates the head is **not masked**, :obj:`0` indicates the head is **masked**.\n        inputs_embeds (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length, hidden_size)`, `optional`, defaults to :obj:`None`):\n            This is useful if you want more control over how to convert `input_ids` indices into associated vectors\n            than the model's internal embedding lookup matrix.\n            If `past` is used, optionally only the last `inputs_embeds` have to be input (see `past`).\n        use_cache (:obj:`bool`):\n            If `use_cache` is True, `past` key value states are returned and can be used to speed up decoding (see `past`). Defaults to `True`.\n\"\"\"\n\n\n@add_start_docstrings(\n    \"The bare GPT2 Model transformer outputting raw hidden-states without any specific head on top.\",\n    GPT2_START_DOCSTRING,\n)\nclass GPT2Model(GPT2PreTrainedModel):\n    def __init__(self, config):\n        super().__init__(config)\n        self.output_hidden_states = config.output_hidden_states\n        self.output_attentions = config.output_attentions\n\n        self.wte = nn.Embedding(config.vocab_size, config.n_embd)\n        self.wpe = nn.Embedding(config.n_positions, config.n_embd)\n        self.drop = nn.Dropout(config.embd_pdrop)\n        self.h = nn.ModuleList([Block(config.n_ctx, config, scale=True) for _ in range(config.n_layer)])\n        self.ln_f = nn.LayerNorm(config.n_embd, eps=config.layer_norm_epsilon)\n\n        self.init_weights()\n\n    def get_input_embeddings(self):\n        return self.wte\n\n    def set_input_embeddings(self, new_embeddings):\n        self.wte = new_embeddings\n\n    def _prune_heads(self, heads_to_prune):\n        \"\"\" Prunes heads of the model.\n            heads_to_prune: dict of {layer_num: list of heads to prune in this layer}\n        \"\"\"\n        for layer, heads in heads_to_prune.items():\n            self.h[layer].attn.prune_heads(heads)\n\n    @add_start_docstrings_to_callable(GPT2_INPUTS_DOCSTRING)\n    def forward(\n        self,\n        input_ids=None,\n        past=None,\n        attention_mask=None,\n        token_type_ids=None,\n        position_ids=None,\n        head_mask=None,\n        inputs_embeds=None,\n        use_cache=True,\n    ):\n        r\"\"\"\n    Return:\n        :obj:`tuple(torch.FloatTensor)` comprising various elements depending on the configuration (:class:`~transformers1.GPT2Config`) and inputs:\n        last_hidden_state (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length, hidden_size)`):\n            Sequence of hidden-states at the last layer of the model.\n            If `past` is used only the last hidden-state of the sequences of shape :obj:`(batch_size, 1, hidden_size)` is output.\n        past (:obj:`List[torch.FloatTensor]` of length :obj:`config.n_layers` with each tensor of shape :obj:`(2, batch_size, num_heads, sequence_length, embed_size_per_head)`):\n            Contains pre-computed hidden-states (key and values in the attention blocks).\n            Can be used (see `past` input) to speed up sequential decoding.\n        hidden_states (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_hidden_states=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer)\n            of shape :obj:`(batch_size, sequence_length, hidden_size)`.\n\n            Hidden-states of the model at the output of each layer plus the initial embedding outputs.\n        attentions (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_attentions=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for each layer) of shape\n            :obj:`(batch_size, num_heads, sequence_length, sequence_length)`.\n\n            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention\n            heads.\n\n    Examples::\n\n        from transformers1 import GPT2Tokenizer, GPT2Model\n        import torch\n\n        tokenizer = GPT2Tokenizer.from_pretrained('gpt2')\n        model = GPT2Model.from_pretrained('gpt2')\n        input_ids = torch.tensor(tokenizer.encode(\"Hello, my dog is cute\", add_special_tokens=True)).unsqueeze(0)  # Batch size 1\n        outputs = model(input_ids)\n        last_hidden_states = outputs[0]  # The last hidden-state is the first element of the output tuple\n\n        \"\"\"\n\n        if input_ids is not None and inputs_embeds is not None:\n            raise ValueError(\"You cannot specify both input_ids and inputs_embeds at the same time\")\n        elif input_ids is not None:\n            input_shape = input_ids.size()\n            input_ids = input_ids.view(-1, input_shape[-1])\n            batch_size = input_ids.shape[0]\n        elif inputs_embeds is not None:\n            input_shape = inputs_embeds.size()[:-1]\n            batch_size = inputs_embeds.shape[0]\n        else:\n            raise ValueError(\"You have to specify either input_ids or inputs_embeds\")\n\n        if token_type_ids is not None:\n            token_type_ids = token_type_ids.view(-1, input_shape[-1])\n        if position_ids is not None:\n            position_ids = position_ids.view(-1, input_shape[-1])\n\n        if past is None:\n            past_length = 0\n            past = [None] * len(self.h)\n        else:\n            past_length = past[0][0].size(-2)\n        if position_ids is None:\n            device = input_ids.device if input_ids is not None else inputs_embeds.device\n            position_ids = torch.arange(past_length, input_shape[-1] + past_length, dtype=torch.long, device=device)\n            position_ids = position_ids.unsqueeze(0).view(-1, input_shape[-1])\n\n        # Attention mask.\n        if attention_mask is not None:\n            assert batch_size > 0, \"batch_size has to be defined and > 0\"\n            attention_mask = attention_mask.view(batch_size, -1)\n            # We create a 3D attention mask from a 2D tensor mask.\n            # Sizes are [batch_size, 1, 1, to_seq_length]\n            # So we can broadcast to [batch_size, num_heads, from_seq_length, to_seq_length]\n            # this attention mask is more simple than the triangular masking of causal attention\n            # used in OpenAI GPT, we just need to prepare the broadcast dimension here.\n            attention_mask = attention_mask.unsqueeze(1).unsqueeze(2)\n\n            # Since attention_mask is 1.0 for positions we want to attend and 0.0 for\n            # masked positions, this operation will create a tensor which is 0.0 for\n            # positions we want to attend and -10000.0 for masked positions.\n            # Since we are adding it to the raw scores before the softmax, this is\n            # effectively the same as removing these entirely.\n            attention_mask = attention_mask.to(dtype=next(self.parameters()).dtype)  # fp16 compatibility\n            attention_mask = (1.0 - attention_mask) * -10000.0\n\n        # Prepare head mask if needed\n        # 1.0 in head_mask indicate we keep the head\n        # attention_probs has shape bsz x n_heads x N x N\n        # head_mask has shape n_layer x batch x n_heads x N x N\n        head_mask = self.get_head_mask(head_mask, self.config.n_layer)\n\n        if inputs_embeds is None:\n            inputs_embeds = self.wte(input_ids)\n        position_embeds = self.wpe(position_ids)\n        if token_type_ids is not None:\n            token_type_embeds = self.wte(token_type_ids)\n        else:\n            token_type_embeds = 0\n        hidden_states = inputs_embeds + position_embeds + token_type_embeds\n        hidden_states = self.drop(hidden_states)\n\n        output_shape = input_shape + (hidden_states.size(-1),)\n\n        presents = ()\n        all_attentions = []\n        all_hidden_states = ()\n        for i, (block, layer_past) in enumerate(zip(self.h, past)):\n            if self.output_hidden_states:\n                all_hidden_states = all_hidden_states + (hidden_states.view(*output_shape),)\n\n            outputs = block(\n                hidden_states,\n                layer_past=layer_past,\n                attention_mask=attention_mask,\n                head_mask=head_mask[i],\n                use_cache=use_cache,\n            )\n\n            hidden_states, present = outputs[:2]\n            if use_cache is True:\n                presents = presents + (present,)\n\n            if self.output_attentions:\n                all_attentions.append(outputs[2])\n\n        hidden_states = self.ln_f(hidden_states)\n\n        hidden_states = hidden_states.view(*output_shape)\n        # Add last hidden state\n        if self.output_hidden_states:\n            all_hidden_states = all_hidden_states + (hidden_states,)\n\n        outputs = (hidden_states,)\n        if use_cache is True:\n            outputs = outputs + (presents,)\n        if self.output_hidden_states:\n            outputs = outputs + (all_hidden_states,)\n        if self.output_attentions:\n            # let the number of heads free (-1) so we can extract attention even after head pruning\n            attention_output_shape = input_shape[:-1] + (-1,) + all_attentions[0].shape[-2:]\n            all_attentions = tuple(t.view(*attention_output_shape) for t in all_attentions)\n            outputs = outputs + (all_attentions,)\n        return outputs  # last hidden state, (presents), (all hidden_states), (attentions)\n\n\n@add_start_docstrings(\n    \"\"\"The GPT2 Model transformer with a language modeling head on top\n    (linear layer with weights tied to the input embeddings). \"\"\",\n    GPT2_START_DOCSTRING,\n)\nclass GPT2LMHeadModel(GPT2PreTrainedModel):\n    def __init__(self, config):\n        super().__init__(config)\n        self.transformer = GPT2Model(config)\n        self.lm_head = nn.Linear(config.n_embd, config.vocab_size, bias=False)\n\n        self.init_weights()\n\n    def get_output_embeddings(self):\n        return self.lm_head\n\n    def prepare_inputs_for_generation(self, input_ids, past, **kwargs):\n        # only last token for inputs_ids if past is defined in kwargs\n        if past:\n            input_ids = input_ids[:, -1].unsqueeze(-1)\n\n        return {\"input_ids\": input_ids, \"past\": past, \"use_cache\": kwargs[\"use_cache\"]}\n\n    @add_start_docstrings_to_callable(GPT2_INPUTS_DOCSTRING)\n    def forward(\n        self,\n        input_ids=None,\n        past=None,\n        attention_mask=None,\n        token_type_ids=None,\n        position_ids=None,\n        head_mask=None,\n        inputs_embeds=None,\n        labels=None,\n        use_cache=True,\n    ):\n        r\"\"\"\n        labels (:obj:`torch.LongTensor` of shape :obj:`(batch_size, sequence_length)`, `optional`, defaults to :obj:`None`):\n            Labels for language modeling.\n            Note that the labels **are shifted** inside the model, i.e. you can set ``labels = input_ids``\n            Indices are selected in ``[-100, 0, ..., config.vocab_size]``\n            All labels set to ``-100`` are ignored (masked), the loss is only\n            computed for labels in ``[0, ..., config.vocab_size]``\n\n    Return:\n        :obj:`tuple(torch.FloatTensor)` comprising various elements depending on the configuration (:class:`~transformers1.GPT2Config`) and inputs:\n        loss (:obj:`torch.FloatTensor` of shape `(1,)`, `optional`, returned when ``labels`` is provided)\n            Language modeling loss.\n        prediction_scores (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length, config.vocab_size)`):\n            Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax).\n        past (:obj:`List[torch.FloatTensor]` of length :obj:`config.n_layers` with each tensor of shape :obj:`(2, batch_size, num_heads, sequence_length, embed_size_per_head)`):\n            Contains pre-computed hidden-states (key and values in the attention blocks).\n            Can be used (see `past` input) to speed up sequential decoding.\n        hidden_states (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_hidden_states=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer)\n            of shape :obj:`(batch_size, sequence_length, hidden_size)`.\n\n            Hidden-states of the model at the output of each layer plus the initial embedding outputs.\n        attentions (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_attentions=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for each layer) of shape\n            :obj:`(batch_size, num_heads, sequence_length, sequence_length)`.\n\n            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention\n            heads.\n\n    Examples::\n\n        import torch\n        from transformers1 import GPT2Tokenizer, GPT2LMHeadModel\n\n        tokenizer = GPT2Tokenizer.from_pretrained('gpt2')\n        model = GPT2LMHeadModel.from_pretrained('gpt2')\n\n        input_ids = torch.tensor(tokenizer.encode(\"Hello, my dog is cute\", add_special_tokens=True)).unsqueeze(0)  # Batch size 1\n        outputs = model(input_ids, labels=input_ids)\n        loss, logits = outputs[:2]\n\n        \"\"\"\n        transformer_outputs = self.transformer(\n            input_ids,\n            past=past,\n            attention_mask=attention_mask,\n            token_type_ids=token_type_ids,\n            position_ids=position_ids,\n            head_mask=head_mask,\n            inputs_embeds=inputs_embeds,\n            use_cache=use_cache,\n        )\n        hidden_states = transformer_outputs[0]\n\n        lm_logits = self.lm_head(hidden_states)\n\n        outputs = (lm_logits,) + transformer_outputs[1:]\n        if labels is not None:\n            # Shift so that tokens < n predict n\n            shift_logits = lm_logits[..., :-1, :].contiguous()\n            shift_labels = labels[..., 1:].contiguous()\n            # Flatten the tokens\n            loss_fct = CrossEntropyLoss()\n            loss = loss_fct(shift_logits.view(-1, shift_logits.size(-1)), shift_labels.view(-1))\n            outputs = (loss,) + outputs\n\n        return outputs  # (loss), lm_logits, presents, (all hidden_states), (attentions)\n\n\n@add_start_docstrings(\n    \"\"\"The GPT2 Model transformer with a language modeling and a multiple-choice classification\n    head on top e.g. for RocStories/SWAG tasks. The two heads are two linear layers.\n    The language modeling head has its weights tied to the input embeddings,\n    the classification head takes as input the input of a specified classification token index in the input sequence).\n\"\"\",\n    GPT2_START_DOCSTRING,\n)\nclass GPT2DoubleHeadsModel(GPT2PreTrainedModel):\n    def __init__(self, config):\n        super().__init__(config)\n        config.num_labels = 1\n        self.transformer = GPT2Model(config)\n        self.lm_head = nn.Linear(config.n_embd, config.vocab_size, bias=False)\n        self.multiple_choice_head = SequenceSummary(config)\n\n        self.init_weights()\n\n    def get_output_embeddings(self):\n        return self.lm_head\n\n    @add_start_docstrings_to_callable(GPT2_INPUTS_DOCSTRING)\n    def forward(\n        self,\n        input_ids=None,\n        past=None,\n        attention_mask=None,\n        token_type_ids=None,\n        position_ids=None,\n        head_mask=None,\n        inputs_embeds=None,\n        mc_token_ids=None,\n        lm_labels=None,\n        mc_labels=None,\n        use_cache=True,\n    ):\n        r\"\"\"\n        mc_token_ids (:obj:`torch.LongTensor` of shape :obj:`(batch_size, num_choices)`, `optional`, default to index of the last token of the input)\n            Index of the classification token in each input sequence.\n            Selected in the range ``[0, input_ids.size(-1) - 1[``.\n        lm_labels (:obj:`torch.LongTensor` of shape :obj:`(batch_size, sequence_length)`, `optional`, defaults to :obj:`None`)\n            Labels for language modeling.\n            Note that the labels **are shifted** inside the model, i.e. you can set ``lm_labels = input_ids``\n            Indices are selected in ``[-1, 0, ..., config.vocab_size]``\n            All labels set to ``-100`` are ignored (masked), the loss is only\n            computed for labels in ``[0, ..., config.vocab_size]``\n        mc_labels (:obj:`torch.LongTensor` of shape :obj:`(batch_size)`, `optional`, defaults to :obj:`None`)\n            Labels for computing the multiple choice classification loss.\n            Indices should be in ``[0, ..., num_choices]`` where `num_choices` is the size of the second dimension\n            of the input tensors. (see `input_ids` above)\n\n    Return:\n        :obj:`tuple(torch.FloatTensor)` comprising various elements depending on the configuration (:class:`~transformers1.GPT2Config`) and inputs:\n        lm_loss (:obj:`torch.FloatTensor` of shape :obj:`(1,)`, `optional`, returned when ``lm_labels`` is provided):\n            Language modeling loss.\n        mc_loss (:obj:`torch.FloatTensor` of shape :obj:`(1,)`, `optional`, returned when :obj:`multiple_choice_labels` is provided):\n            Multiple choice classification loss.\n        lm_prediction_scores (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, num_choices, sequence_length, config.vocab_size)`):\n            Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax).\n        mc_prediction_scores (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, num_choices)`):\n            Prediction scores of the multiple choice classification head (scores for each choice before SoftMax).\n        past (:obj:`List[torch.FloatTensor]` of length :obj:`config.n_layers` with each tensor of shape :obj:`(2, batch_size, num_heads, sequence_length, embed_size_per_head)`):\n            Contains pre-computed hidden-states (key and values in the attention blocks).\n            Can be used (see `past` input) to speed up sequential decoding.\n        hidden_states (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_hidden_states=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer)\n            of shape :obj:`(batch_size, sequence_length, hidden_size)`.\n\n            Hidden-states of the model at the output of each layer plus the initial embedding outputs.\n        attentions (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_attentions=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for each layer) of shape\n            :obj:`(batch_size, num_heads, sequence_length, sequence_length)`.\n\n            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention\n            heads.\n\n    Examples::\n\n        import torch\n        from transformers1 import GPT2Tokenizer, GPT2DoubleHeadsModel\n\n        tokenizer = GPT2Tokenizer.from_pretrained('gpt2')\n        model = GPT2DoubleHeadsModel.from_pretrained('gpt2')\n\n        # Add a [CLS] to the vocabulary (we should train it also!)\n        tokenizer.add_special_tokens({'cls_token': '[CLS]'})\n        model.resize_token_embeddings(len(tokenizer))  # Update the model embeddings with the new vocabulary size\n        print(tokenizer.cls_token_id, len(tokenizer))  # The newly token the last token of the vocabulary\n\n        choices = [\"Hello, my dog is cute [CLS]\", \"Hello, my cat is cute [CLS]\"]\n        encoded_choices = [tokenizer.encode(s) for s in choices]\n        cls_token_location = [tokens.index(tokenizer.cls_token_id) for tokens in encoded_choices]\n\n        input_ids = torch.tensor(encoded_choices).unsqueeze(0)  # Batch size: 1, number of choices: 2\n        mc_token_ids = torch.tensor([cls_token_location])  # Batch size: 1\n\n        outputs = model(input_ids, mc_token_ids=mc_token_ids)\n        lm_prediction_scores, mc_prediction_scores = outputs[:2]\n\n        \"\"\"\n        transformer_outputs = self.transformer(\n            input_ids,\n            past=past,\n            attention_mask=attention_mask,\n            token_type_ids=token_type_ids,\n            position_ids=position_ids,\n            head_mask=head_mask,\n            inputs_embeds=inputs_embeds,\n            use_cache=use_cache,\n        )\n\n        hidden_states = transformer_outputs[0]\n\n        lm_logits = self.lm_head(hidden_states)\n        mc_logits = self.multiple_choice_head(hidden_states, mc_token_ids).squeeze(-1)\n\n        outputs = (lm_logits, mc_logits) + transformer_outputs[1:]\n        if mc_labels is not None:\n            loss_fct = CrossEntropyLoss()\n            loss = loss_fct(mc_logits.view(-1, mc_logits.size(-1)), mc_labels.view(-1))\n            outputs = (loss,) + outputs\n        if lm_labels is not None:\n            shift_logits = lm_logits[..., :-1, :].contiguous()\n            shift_labels = lm_labels[..., 1:].contiguous()\n            loss_fct = CrossEntropyLoss()\n            loss = loss_fct(shift_logits.view(-1, shift_logits.size(-1)), shift_labels.view(-1))\n            outputs = (loss,) + outputs\n\n        return outputs  # (lm loss), (mc loss), lm logits, mc logits, presents, (all hidden_states), (attentions)\n"
  },
  {
    "path": "code/bert-base-count3/pretrain/transformers1/modeling_longformer.py",
    "content": "# coding=utf-8\n# Copyright 2020 The Allen Institute for AI team and The HuggingFace Inc. team.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n\"\"\"PyTorch Longformer model. \"\"\"\n\nimport logging\nimport math\n\nimport torch\nimport torch.nn as nn\nfrom torch.nn import CrossEntropyLoss, MSELoss\nfrom torch.nn import functional as F\n\nfrom .configuration_longformer import LongformerConfig\nfrom .file_utils import add_start_docstrings, add_start_docstrings_to_callable\nfrom .modeling_bert import BertPreTrainedModel\nfrom .modeling_roberta import RobertaLMHead, RobertaModel\n\n\nlogger = logging.getLogger(__name__)\n\nLONGFORMER_PRETRAINED_MODEL_ARCHIVE_LIST = [\n    \"allenai/longformer-base-4096\",\n    \"allenai/longformer-large-4096\",\n    \"allenai/longformer-large-4096-finetuned-triviaqa\",\n    \"allenai/longformer-base-4096-extra.pos.embd.only\",\n    \"allenai/longformer-large-4096-extra.pos.embd.only\",\n    # See all Longformer models at https://huggingface.co/models?filter=longformer\n]\n\n\ndef _get_question_end_index(input_ids, sep_token_id):\n    \"\"\"\n        Computes the index of the first occurance of `sep_token_id`.\n    \"\"\"\n\n    sep_token_indices = (input_ids == sep_token_id).nonzero()\n    batch_size = input_ids.shape[0]\n\n    assert sep_token_indices.shape[1] == 2, \"`input_ids` should have two dimensions\"\n    assert (\n        sep_token_indices.shape[0] == 3 * batch_size\n    ), f\"There should be exactly three separator tokens: {sep_token_id} in every sample for questions answering. You might also consider to set `global_attention_mask` manually in the forward function to avoid this error.\"\n\n    return sep_token_indices.view(batch_size, 3, 2)[:, 0, 1]\n\n\ndef _compute_global_attention_mask(input_ids, sep_token_id, before_sep_token=True):\n    \"\"\"\n        Computes global attention mask by putting attention on all tokens\n        before `sep_token_id` if `before_sep_token is True` else after\n        `sep_token_id`.\n    \"\"\"\n\n    question_end_index = _get_question_end_index(input_ids, sep_token_id)\n    question_end_index = question_end_index.unsqueeze(dim=1)  # size: batch_size x 1\n    # bool attention mask with True in locations of global attention\n    attention_mask = torch.arange(input_ids.shape[1], device=input_ids.device)\n    if before_sep_token is True:\n        attention_mask = (attention_mask.expand_as(input_ids) < question_end_index).to(torch.uint8)\n    else:\n        # last token is separation token and should not be counted and in the middle are two separation tokens\n        attention_mask = (attention_mask.expand_as(input_ids) > (question_end_index + 1)).to(torch.uint8) * (\n            attention_mask.expand_as(input_ids) < input_ids.shape[-1]\n        ).to(torch.uint8)\n\n    return attention_mask\n\n\nclass LongformerSelfAttention(nn.Module):\n    def __init__(self, config, layer_id):\n        super().__init__()\n        if config.hidden_size % config.num_attention_heads != 0:\n            raise ValueError(\n                \"The hidden size (%d) is not a multiple of the number of attention \"\n                \"heads (%d)\" % (config.hidden_size, config.num_attention_heads)\n            )\n        self.output_attentions = config.output_attentions\n        self.num_heads = config.num_attention_heads\n        self.head_dim = int(config.hidden_size / config.num_attention_heads)\n        self.embed_dim = config.hidden_size\n\n        self.query = nn.Linear(config.hidden_size, self.embed_dim)\n        self.key = nn.Linear(config.hidden_size, self.embed_dim)\n        self.value = nn.Linear(config.hidden_size, self.embed_dim)\n\n        # separate projection layers for tokens with global attention\n        self.query_global = nn.Linear(config.hidden_size, self.embed_dim)\n        self.key_global = nn.Linear(config.hidden_size, self.embed_dim)\n        self.value_global = nn.Linear(config.hidden_size, self.embed_dim)\n\n        self.dropout = config.attention_probs_dropout_prob\n\n        self.layer_id = layer_id\n        attention_window = config.attention_window[self.layer_id]\n        assert (\n            attention_window % 2 == 0\n        ), f\"`attention_window` for layer {self.layer_id} has to be an even value. Given {attention_window}\"\n        assert (\n            attention_window > 0\n        ), f\"`attention_window` for layer {self.layer_id} has to be positive. Given {attention_window}\"\n\n        self.one_sided_attention_window_size = attention_window // 2\n\n    @staticmethod\n    def _skew(x, direction):\n        \"\"\"Convert diagonals into columns (or columns into diagonals depending on `direction`\"\"\"\n        x_padded = F.pad(x, direction)  # padding value is not important because it will be overwritten\n        x_padded = x_padded.view(*x_padded.size()[:-2], x_padded.size(-1), x_padded.size(-2))\n        return x_padded\n\n    @staticmethod\n    def _skew2(x):\n        \"\"\"shift every row 1 step to right converting columns into diagonals\"\"\"\n        # X = B x C x M x L\n        B, C, M, L = x.size()\n        x = F.pad(x, (0, M + 1))  # B x C x M x (L+M+1). Padding value is not important because it'll be overwritten\n        x = x.view(B, C, -1)  # B x C x ML+MM+M\n        x = x[:, :, :-M]  # B x C x ML+MM\n        x = x.view(B, C, M, M + L)  # B x C, M x L+M\n        x = x[:, :, :, :-1]\n        return x\n\n    @staticmethod\n    def _chunk(x, w):\n        \"\"\"convert into overlapping chunkings. Chunk size = 2w, overlap size = w\"\"\"\n\n        # non-overlapping chunks of size = 2w\n        x = x.view(x.size(0), x.size(1) // (w * 2), w * 2, x.size(2))\n\n        # use `as_strided` to make the chunks overlap with an overlap size = w\n        chunk_size = list(x.size())\n        chunk_size[1] = chunk_size[1] * 2 - 1\n\n        chunk_stride = list(x.stride())\n        chunk_stride[1] = chunk_stride[1] // 2\n        return x.as_strided(size=chunk_size, stride=chunk_stride)\n\n    def _mask_invalid_locations(self, input_tensor, w) -> torch.Tensor:\n        affected_seqlen = w\n        beginning_mask_2d = input_tensor.new_ones(w, w + 1).tril().flip(dims=[0])\n        beginning_mask = beginning_mask_2d[None, :, None, :]\n        ending_mask = beginning_mask.flip(dims=(1, 3))\n        seqlen = input_tensor.size(1)\n        beginning_input = input_tensor[:, :affected_seqlen, :, : w + 1]\n        beginning_mask = beginning_mask[:, :seqlen].expand(beginning_input.size())\n        beginning_input.masked_fill_(beginning_mask == 1, -float(\"inf\"))  # `== 1` converts to bool or uint8\n        ending_input = input_tensor[:, -affected_seqlen:, :, -(w + 1) :]\n        ending_mask = ending_mask[:, -seqlen:].expand(ending_input.size())\n        ending_input.masked_fill_(ending_mask == 1, -float(\"inf\"))  # `== 1` converts to bool or uint8\n\n    def _sliding_chunks_matmul_qk(self, q: torch.Tensor, k: torch.Tensor, w: int):\n        \"\"\"Matrix multiplicatio of query x key tensors using with a sliding window attention pattern.\n        This implementation splits the input into overlapping chunks of size 2w (e.g. 512 for pretrained Longformer)\n        with an overlap of size w\"\"\"\n        batch_size, seqlen, num_heads, head_dim = q.size()\n        assert seqlen % (w * 2) == 0, f\"Sequence length should be multiple of {w * 2}. Given {seqlen}\"\n        assert q.size() == k.size()\n\n        chunks_count = seqlen // w - 1\n\n        # group batch_size and num_heads dimensions into one, then chunk seqlen into chunks of size w * 2\n        q = q.transpose(1, 2).reshape(batch_size * num_heads, seqlen, head_dim)\n        k = k.transpose(1, 2).reshape(batch_size * num_heads, seqlen, head_dim)\n\n        chunk_q = self._chunk(q, w)\n        chunk_k = self._chunk(k, w)\n\n        # matrix multipication\n        # bcxd: batch_size * num_heads x chunks x 2w x head_dim\n        # bcyd: batch_size * num_heads x chunks x 2w x head_dim\n        # bcxy: batch_size * num_heads x chunks x 2w x 2w\n        chunk_attn = torch.einsum(\"bcxd,bcyd->bcxy\", (chunk_q, chunk_k))  # multiply\n\n        # convert diagonals into columns\n        diagonal_chunk_attn = self._skew(chunk_attn, direction=(0, 0, 0, 1))\n\n        # allocate space for the overall attention matrix where the chunks are compined. The last dimension\n        # has (w * 2 + 1) columns. The first (w) columns are the w lower triangles (attention from a word to\n        # w previous words). The following column is attention score from each word to itself, then\n        # followed by w columns for the upper triangle.\n\n        diagonal_attn = diagonal_chunk_attn.new_empty((batch_size * num_heads, chunks_count + 1, w, w * 2 + 1))\n\n        # copy parts from diagonal_chunk_attn into the compined matrix of attentions\n        # - copying the main diagonal and the upper triangle\n        diagonal_attn[:, :-1, :, w:] = diagonal_chunk_attn[:, :, :w, : w + 1]\n        diagonal_attn[:, -1, :, w:] = diagonal_chunk_attn[:, -1, w:, : w + 1]\n        # - copying the lower triangle\n        diagonal_attn[:, 1:, :, :w] = diagonal_chunk_attn[:, :, -(w + 1) : -1, w + 1 :]\n        diagonal_attn[:, 0, 1:w, 1:w] = diagonal_chunk_attn[:, 0, : w - 1, 1 - w :]\n\n        # separate batch_size and num_heads dimensions again\n        diagonal_attn = diagonal_attn.view(batch_size, num_heads, seqlen, 2 * w + 1).transpose(2, 1)\n\n        self._mask_invalid_locations(diagonal_attn, w)\n        return diagonal_attn\n\n    def _sliding_chunks_matmul_pv(self, prob: torch.Tensor, v: torch.Tensor, w: int):\n        \"\"\"Same as _sliding_chunks_matmul_qk but for prob and value tensors. It is expecting the same output\n        format from _sliding_chunks_matmul_qk\"\"\"\n        batch_size, seqlen, num_heads, head_dim = v.size()\n        assert seqlen % (w * 2) == 0\n        assert prob.size()[:3] == v.size()[:3]\n        assert prob.size(3) == 2 * w + 1\n        chunks_count = seqlen // w - 1\n        # group batch_size and num_heads dimensions into one, then chunk seqlen into chunks of size 2w\n        chunk_prob = prob.transpose(1, 2).reshape(batch_size * num_heads, seqlen // w, w, 2 * w + 1)\n\n        # group batch_size and num_heads dimensions into one\n        v = v.transpose(1, 2).reshape(batch_size * num_heads, seqlen, head_dim)\n\n        # pad seqlen with w at the beginning of the sequence and another w at the end\n        padded_v = F.pad(v, (0, 0, w, w), value=-1)\n\n        # chunk padded_v into chunks of size 3w and an overlap of size w\n        chunk_v_size = (batch_size * num_heads, chunks_count + 1, 3 * w, head_dim)\n        chunk_v_stride = padded_v.stride()\n        chunk_v_stride = chunk_v_stride[0], w * chunk_v_stride[1], chunk_v_stride[1], chunk_v_stride[2]\n        chunk_v = padded_v.as_strided(size=chunk_v_size, stride=chunk_v_stride)\n\n        skewed_prob = self._skew2(chunk_prob)\n\n        context = torch.einsum(\"bcwd,bcdh->bcwh\", (skewed_prob, chunk_v))\n        return context.view(batch_size, num_heads, seqlen, head_dim).transpose(1, 2)\n\n    def forward(\n        self,\n        hidden_states,\n        attention_mask=None,\n        head_mask=None,\n        encoder_hidden_states=None,\n        encoder_attention_mask=None,\n    ):\n        \"\"\"\n        LongformerSelfAttention expects `len(hidden_states)` to be multiple of `attention_window`.\n        Padding to `attention_window` happens in LongformerModel.forward to avoid redoing the padding on each layer.\n\n        The `attention_mask` is changed in `BertModel.forward` from 0, 1, 2 to\n            -ve: no attention\n              0: local attention\n            +ve: global attention\n\n        `encoder_hidden_states` and `encoder_attention_mask` are not supported and should be None\n        \"\"\"\n        # TODO: add support for `encoder_hidden_states` and `encoder_attention_mask`\n        assert encoder_hidden_states is None, \"`encoder_hidden_states` is not supported and should be None\"\n        assert encoder_attention_mask is None, \"`encoder_attention_mask` is not supported and shiould be None\"\n\n        if attention_mask is not None:\n            attention_mask = attention_mask.squeeze(dim=2).squeeze(dim=1)\n            key_padding_mask = attention_mask < 0\n            extra_attention_mask = attention_mask > 0\n            remove_from_windowed_attention_mask = attention_mask != 0\n\n            num_extra_indices_per_batch = extra_attention_mask.long().sum(dim=1)\n            max_num_extra_indices_per_batch = num_extra_indices_per_batch.max()\n            if max_num_extra_indices_per_batch <= 0:\n                extra_attention_mask = None\n            else:\n                # To support the case of variable number of global attention in the rows of a batch,\n                # we use the following three selection masks to select global attention embeddings\n                # in a 3d tensor and pad it to `max_num_extra_indices_per_batch`\n                # 1) selecting embeddings that correspond to global attention\n                extra_attention_mask_nonzeros = extra_attention_mask.nonzero(as_tuple=True)\n                zero_to_max_range = torch.arange(\n                    0, max_num_extra_indices_per_batch, device=num_extra_indices_per_batch.device\n                )\n                # mask indicating which values are actually going to be padding\n                selection_padding_mask = zero_to_max_range < num_extra_indices_per_batch.unsqueeze(dim=-1)\n                # 2) location of the non-padding values in the selected global attention\n                selection_padding_mask_nonzeros = selection_padding_mask.nonzero(as_tuple=True)\n                # 3) location of the padding values in the selected global attention\n                selection_padding_mask_zeros = (selection_padding_mask == 0).nonzero(as_tuple=True)\n        else:\n            remove_from_windowed_attention_mask = None\n            extra_attention_mask = None\n            key_padding_mask = None\n\n        hidden_states = hidden_states.transpose(0, 1)\n        seqlen, batch_size, embed_dim = hidden_states.size()\n        assert embed_dim == self.embed_dim\n        q = self.query(hidden_states)\n        k = self.key(hidden_states)\n        v = self.value(hidden_states)\n        q /= math.sqrt(self.head_dim)\n\n        q = q.view(seqlen, batch_size, self.num_heads, self.head_dim).transpose(0, 1)\n        k = k.view(seqlen, batch_size, self.num_heads, self.head_dim).transpose(0, 1)\n        # attn_weights = (batch_size, seqlen, num_heads, window*2+1)\n        attn_weights = self._sliding_chunks_matmul_qk(q, k, self.one_sided_attention_window_size)\n        self._mask_invalid_locations(attn_weights, self.one_sided_attention_window_size)\n        if remove_from_windowed_attention_mask is not None:\n            # This implementation is fast and takes very little memory because num_heads x hidden_size = 1\n            # from (batch_size x seqlen) to (batch_size x seqlen x num_heads x hidden_size)\n            remove_from_windowed_attention_mask = remove_from_windowed_attention_mask.unsqueeze(dim=-1).unsqueeze(\n                dim=-1\n            )\n            # cast to fp32/fp16 then replace 1's with -inf\n            float_mask = remove_from_windowed_attention_mask.type_as(q).masked_fill(\n                remove_from_windowed_attention_mask, -10000.0\n            )\n            ones = float_mask.new_ones(size=float_mask.size())  # tensor of ones\n            # diagonal mask with zeros everywhere and -inf inplace of padding\n            d_mask = self._sliding_chunks_matmul_qk(ones, float_mask, self.one_sided_attention_window_size)\n            attn_weights += d_mask\n        assert list(attn_weights.size()) == [\n            batch_size,\n            seqlen,\n            self.num_heads,\n            self.one_sided_attention_window_size * 2 + 1,\n        ]\n\n        # the extra attention\n        if extra_attention_mask is not None:\n            selected_k = k.new_zeros(batch_size, max_num_extra_indices_per_batch, self.num_heads, self.head_dim)\n            selected_k[selection_padding_mask_nonzeros] = k[extra_attention_mask_nonzeros]\n            # (batch_size, seqlen, num_heads, max_num_extra_indices_per_batch)\n            selected_attn_weights = torch.einsum(\"blhd,bshd->blhs\", (q, selected_k))\n            selected_attn_weights[selection_padding_mask_zeros[0], :, :, selection_padding_mask_zeros[1]] = -10000\n            # concat to attn_weights\n            # (batch_size, seqlen, num_heads, extra attention count + 2*window+1)\n            attn_weights = torch.cat((selected_attn_weights, attn_weights), dim=-1)\n\n        attn_weights_fp32 = F.softmax(attn_weights, dim=-1, dtype=torch.float32)  # use fp32 for numerical stability\n        attn_weights = attn_weights_fp32.type_as(attn_weights)\n\n        if key_padding_mask is not None:\n            # softmax sometimes inserts NaN if all positions are masked, replace them with 0\n            attn_weights = torch.masked_fill(attn_weights, key_padding_mask.unsqueeze(-1).unsqueeze(-1), 0.0)\n\n        attn_probs = F.dropout(attn_weights, p=self.dropout, training=self.training)\n        v = v.view(seqlen, batch_size, self.num_heads, self.head_dim).transpose(0, 1)\n        attn = None\n        if extra_attention_mask is not None:\n            selected_attn_probs = attn_probs.narrow(-1, 0, max_num_extra_indices_per_batch)\n            selected_v = v.new_zeros(batch_size, max_num_extra_indices_per_batch, self.num_heads, self.head_dim)\n            selected_v[selection_padding_mask_nonzeros] = v[extra_attention_mask_nonzeros]\n            # use `matmul` because `einsum` crashes sometimes with fp16\n            # attn = torch.einsum('blhs,bshd->blhd', (selected_attn_probs, selected_v))\n            attn = torch.matmul(selected_attn_probs.transpose(1, 2), selected_v.transpose(1, 2)).transpose(1, 2)\n            attn_probs = attn_probs.narrow(\n                -1, max_num_extra_indices_per_batch, attn_probs.size(-1) - max_num_extra_indices_per_batch\n            ).contiguous()\n        if attn is None:\n            attn = self._sliding_chunks_matmul_pv(attn_probs, v, self.one_sided_attention_window_size)\n        else:\n            attn += self._sliding_chunks_matmul_pv(attn_probs, v, self.one_sided_attention_window_size)\n\n        assert attn.size() == (batch_size, seqlen, self.num_heads, self.head_dim), \"Unexpected size\"\n        attn = attn.transpose(0, 1).reshape(seqlen, batch_size, embed_dim).contiguous()\n\n        # For this case, we'll just recompute the attention for these indices\n        # and overwrite the attn tensor.\n        # TODO: remove the redundant computation\n        if extra_attention_mask is not None:\n            selected_hidden_states = hidden_states.new_zeros(max_num_extra_indices_per_batch, batch_size, embed_dim)\n            selected_hidden_states[selection_padding_mask_nonzeros[::-1]] = hidden_states[\n                extra_attention_mask_nonzeros[::-1]\n            ]\n\n            q = self.query_global(selected_hidden_states)\n            k = self.key_global(hidden_states)\n            v = self.value_global(hidden_states)\n            q /= math.sqrt(self.head_dim)\n\n            q = (\n                q.contiguous()\n                .view(max_num_extra_indices_per_batch, batch_size * self.num_heads, self.head_dim)\n                .transpose(0, 1)\n            )  # (batch_size * self.num_heads, max_num_extra_indices_per_batch, head_dim)\n            k = (\n                k.contiguous().view(-1, batch_size * self.num_heads, self.head_dim).transpose(0, 1)\n            )  # batch_size * self.num_heads, seqlen, head_dim)\n            v = (\n                v.contiguous().view(-1, batch_size * self.num_heads, self.head_dim).transpose(0, 1)\n            )  # batch_size * self.num_heads, seqlen, head_dim)\n            attn_weights = torch.bmm(q, k.transpose(1, 2))\n            assert list(attn_weights.size()) == [batch_size * self.num_heads, max_num_extra_indices_per_batch, seqlen]\n\n            attn_weights = attn_weights.view(batch_size, self.num_heads, max_num_extra_indices_per_batch, seqlen)\n            attn_weights[selection_padding_mask_zeros[0], :, selection_padding_mask_zeros[1], :] = -10000.0\n            if key_padding_mask is not None:\n                attn_weights = attn_weights.masked_fill(key_padding_mask.unsqueeze(1).unsqueeze(2), -10000.0,)\n            attn_weights = attn_weights.view(batch_size * self.num_heads, max_num_extra_indices_per_batch, seqlen)\n            attn_weights_float = F.softmax(\n                attn_weights, dim=-1, dtype=torch.float32\n            )  # use fp32 for numerical stability\n            attn_probs = F.dropout(attn_weights_float.type_as(attn_weights), p=self.dropout, training=self.training)\n            selected_attn = torch.bmm(attn_probs, v)\n            assert list(selected_attn.size()) == [\n                batch_size * self.num_heads,\n                max_num_extra_indices_per_batch,\n                self.head_dim,\n            ]\n\n            selected_attn_4d = selected_attn.view(\n                batch_size, self.num_heads, max_num_extra_indices_per_batch, self.head_dim\n            )\n            nonzero_selected_attn = selected_attn_4d[\n                selection_padding_mask_nonzeros[0], :, selection_padding_mask_nonzeros[1]\n            ]\n            attn[extra_attention_mask_nonzeros[::-1]] = nonzero_selected_attn.view(\n                len(selection_padding_mask_nonzeros[0]), -1\n            )\n\n        context_layer = attn.transpose(0, 1)\n        if self.output_attentions:\n            if extra_attention_mask is not None:\n                # With global attention, return global attention probabilities only\n                # batch_size x num_heads x max_num_global_attention_tokens x sequence_length\n                # which is the attention weights from tokens with global attention to all tokens\n                # It doesn't not return local attention\n                # In case of variable number of global attantion in the rows of a batch,\n                # attn_weights are padded with -10000.0 attention scores\n                attn_weights = attn_weights.view(batch_size, self.num_heads, max_num_extra_indices_per_batch, seqlen)\n            else:\n                # without global attention, return local attention probabilities\n                # batch_size x num_heads x sequence_length x window_size\n                # which is the attention weights of every token attending to its neighbours\n                attn_weights = attn_weights.permute(0, 2, 1, 3)\n        outputs = (context_layer, attn_weights) if self.output_attentions else (context_layer,)\n        return outputs\n\n\nLONGFORMER_START_DOCSTRING = r\"\"\"\n\n    This model is a PyTorch `torch.nn.Module <https://pytorch.org/docs/stable/nn.html#torch.nn.Module>`_ sub-class.\n    Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general\n    usage and behavior.\n\n    Parameters:\n        config (:class:`~transformers1.LongformerConfig`): Model configuration class with all the parameters of the\n            model. Initializing with a config file does not load the weights associated with the model, only the configuration.\n            Check out the :meth:`~transformers1.PreTrainedModel.from_pretrained` method to load the model weights.\n\"\"\"\n\nLONGFORMER_INPUTS_DOCSTRING = r\"\"\"\n    Args:\n        input_ids (:obj:`torch.LongTensor` of shape :obj:`{0}`):\n            Indices of input sequence tokens in the vocabulary.\n\n            Indices can be obtained using :class:`transformers1.LonmgformerTokenizer`.\n            See :func:`transformers1.PreTrainedTokenizer.encode` and\n            :func:`transformers1.PreTrainedTokenizer.encode_plus` for details.\n\n            `What are input IDs? <../glossary.html#input-ids>`__\n        attention_mask (:obj:`torch.FloatTensor` of shape :obj:`{0}`, `optional`, defaults to :obj:`None`):\n            Mask to avoid performing attention on padding token indices.\n            Mask values selected in ``[0, 1]``:\n            ``1`` for tokens that are NOT MASKED, ``0`` for MASKED tokens.\n\n            `What are attention masks? <../glossary.html#attention-mask>`__\n\n        global_attention_mask (:obj:`torch.FloatTensor` of shape :obj:`{0}`, `optional`, defaults to :obj:`None`):\n            Mask to decide the attention given on each token, local attention or global attenion.\n            Tokens with global attention attends to all other tokens, and all other tokens attend to them. This is important for\n            task-specific finetuning because it makes the model more flexible at representing the task. For example,\n            for classification, the <s> token should be given global attention. For QA, all question tokens should also have\n            global attention. Please refer to the Longformer paper https://arxiv.org/abs/2004.05150 for more details.\n            Mask values selected in ``[0, 1]``:\n            ``0`` for local attention (a sliding window attention),\n            ``1`` for global attention (tokens that attend to all other tokens, and all other tokens attend to them).\n\n        token_type_ids (:obj:`torch.LongTensor` of shape :obj:`{0}`, `optional`, defaults to :obj:`None`):\n            Segment token indices to indicate first and second portions of the inputs.\n            Indices are selected in ``[0, 1]``: ``0`` corresponds to a `sentence A` token, ``1``\n            corresponds to a `sentence B` token\n\n            `What are token type IDs? <../glossary.html#token-type-ids>`_\n        position_ids (:obj:`torch.LongTensor` of shape :obj:`{0}`, `optional`, defaults to :obj:`None`):\n            Indices of positions of each input sequence tokens in the position embeddings.\n            Selected in the range ``[0, config.max_position_embeddings - 1]``.\n\n            `What are position IDs? <../glossary.html#position-ids>`_\n        inputs_embeds (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length, hidden_size)`, `optional`, defaults to :obj:`None`):\n            Optionally, instead of passing :obj:`input_ids` you can choose to directly pass an embedded representation.\n            This is useful if you want more control over how to convert `input_ids` indices into associated vectors\n            than the model's internal embedding lookup matrix.\n\"\"\"\n\n\n@add_start_docstrings(\n    \"The bare Longformer Model outputting raw hidden-states without any specific head on top.\",\n    LONGFORMER_START_DOCSTRING,\n)\nclass LongformerModel(RobertaModel):\n    \"\"\"\n    This class overrides :class:`~transformers1.RobertaModel` to provide the ability to process\n    long sequences following the selfattention approach described in `Longformer: the Long-Document Transformer`_by\n    Iz Beltagy, Matthew E. Peters, and Arman Cohan. Longformer selfattention combines a local (sliding window)\n    and global attention to extend to long documents without the O(n^2) increase in memory and compute.\n\n    The selfattention module `LongformerSelfAttention` implemented here supports the combination of local and\n    global attention but it lacks support for autoregressive attention and dilated attention. Autoregressive\n    and dilated attention are more relevant for autoregressive language modeling than finetuning on downstream\n    tasks. Future release will add support for autoregressive attention, but the support for dilated attention\n    requires a custom CUDA kernel to be memory and compute efficient.\n\n    .. _`Longformer: the Long-Document Transformer`:\n        https://arxiv.org/abs/2004.05150\n\n    \"\"\"\n\n    config_class = LongformerConfig\n    base_model_prefix = \"longformer\"\n\n    def __init__(self, config):\n        super().__init__(config)\n\n        if isinstance(config.attention_window, int):\n            assert config.attention_window % 2 == 0, \"`config.attention_window` has to be an even value\"\n            assert config.attention_window > 0, \"`config.attention_window` has to be positive\"\n            config.attention_window = [config.attention_window] * config.num_hidden_layers  # one value per layer\n        else:\n            assert len(config.attention_window) == config.num_hidden_layers, (\n                \"`len(config.attention_window)` should equal `config.num_hidden_layers`. \"\n                f\"Expected {config.num_hidden_layers}, given {len(config.attention_window)}\"\n            )\n\n        for i, layer in enumerate(self.encoder.layer):\n            # replace the `modeling_bert.BertSelfAttention` object with `LongformerSelfAttention`\n            layer.attention.self = LongformerSelfAttention(config, layer_id=i)\n\n        self.init_weights()\n\n    def _pad_to_window_size(\n        self,\n        input_ids: torch.Tensor,\n        attention_mask: torch.Tensor,\n        token_type_ids: torch.Tensor,\n        position_ids: torch.Tensor,\n        inputs_embeds: torch.Tensor,\n        attention_window: int,\n        pad_token_id: int,\n    ):\n        \"\"\"A helper function to pad tokens and mask to work with implementation of Longformer selfattention.\"\"\"\n\n        assert attention_window % 2 == 0, f\"`attention_window` should be an even value. Given {attention_window}\"\n        input_shape = input_ids.shape if input_ids is not None else inputs_embeds.shape\n        batch_size, seqlen = input_shape[:2]\n\n        padding_len = (attention_window - seqlen % attention_window) % attention_window\n        if padding_len > 0:\n            logger.info(\n                \"Input ids are automatically padded from {} to {} to be a multiple of `config.attention_window`: {}\".format(\n                    seqlen, seqlen + padding_len, attention_window\n                )\n            )\n            if input_ids is not None:\n                input_ids = F.pad(input_ids, (0, padding_len), value=pad_token_id)\n            if attention_mask is not None:\n                attention_mask = F.pad(\n                    attention_mask, (0, padding_len), value=False\n                )  # no attention on the padding tokens\n            if token_type_ids is not None:\n                token_type_ids = F.pad(token_type_ids, (0, padding_len), value=0)  # pad with token_type_id = 0\n            if position_ids is not None:\n                # pad with position_id = pad_token_id as in modeling_roberta.RobertaEmbeddings\n                position_ids = F.pad(position_ids, (0, padding_len), value=pad_token_id)\n            if inputs_embeds is not None:\n                input_ids_padding = inputs_embeds.new_full(\n                    (batch_size, padding_len), self.config.pad_token_id, dtype=torch.long,\n                )\n                inputs_embeds_padding = self.embeddings(input_ids_padding)\n                inputs_embeds = torch.cat([inputs_embeds, inputs_embeds_padding], dim=-2)\n\n        return padding_len, input_ids, attention_mask, token_type_ids, position_ids, inputs_embeds\n\n    @add_start_docstrings_to_callable(LONGFORMER_INPUTS_DOCSTRING.format(\"(batch_size, sequence_length)\"))\n    def forward(\n        self,\n        input_ids=None,\n        attention_mask=None,\n        global_attention_mask=None,\n        token_type_ids=None,\n        position_ids=None,\n        inputs_embeds=None,\n        masked_lm_labels=None,\n    ):\n        r\"\"\"\n\n    Returns:\n        :obj:`tuple(torch.FloatTensor)` comprising various elements depending on the configuration (:class:`~transformers1.RobertaConfig`) and inputs:\n        masked_lm_loss (`optional`, returned when ``masked_lm_labels`` is provided) ``torch.FloatTensor`` of shape ``(1,)``:\n            Masked language modeling loss.\n        prediction_scores (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length, config.vocab_size)`)\n            Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax).\n        hidden_states (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_hidden_states=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer)\n            of shape :obj:`(batch_size, sequence_length, hidden_size)`.\n\n            Hidden-states of the model at the output of each layer plus the initial embedding outputs.\n        attentions (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_attentions=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for each layer) of shape\n            :obj:`(batch_size, num_heads, sequence_length, sequence_length)`.\n\n            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention\n            heads.\n\n    Examples::\n\n        import torch\n        from transformers1 import LongformerModel, LongformerTokenizer\n\n        model = LongformerModel.from_pretrained('allenai/longformer-base-4096')\n        tokenizer = LongformerTokenizer.from_pretrained('allenai/longformer-base-4096')\n\n        SAMPLE_TEXT = ' '.join(['Hello world! '] * 1000)  # long input document\n        input_ids = torch.tensor(tokenizer.encode(SAMPLE_TEXT)).unsqueeze(0)  # batch of size 1\n\n        # Attention mask values -- 0: no attention, 1: local attention, 2: global attention\n        attention_mask = torch.ones(input_ids.shape, dtype=torch.long, device=input_ids.device) # initialize to local attention\n        attention_mask[:, [1, 4, 21,]] = 2  # Set global attention based on the task. For example,\n                                            # classification: the <s> token\n                                            # QA: question tokens\n                                            # LM: potentially on the beginning of sentences and paragraphs\n        sequence_output, pooled_output = model(input_ids, attention_mask=attention_mask)\n        \"\"\"\n\n        # padding\n        attention_window = (\n            self.config.attention_window\n            if isinstance(self.config.attention_window, int)\n            else max(self.config.attention_window)\n        )\n\n        # merge `global_attention_mask` and `attention_mask`\n        if global_attention_mask is not None:\n            # longformer self attention expects attention mask to have 0 (no attn), 1 (local attn), 2 (global attn)\n            # (global_attention_mask + 1) => 1 for local attention, 2 for global attention\n            # => final attention_mask => 0 for no attention, 1 for local attention 2 for global attention\n            if attention_mask is not None:\n                attention_mask = attention_mask * (global_attention_mask + 1)\n            else:\n                # simply use `global_attention_mask` as `attention_mask`\n                # if no `attention_mask` is given\n                attention_mask = global_attention_mask + 1\n\n        padding_len, input_ids, attention_mask, token_type_ids, position_ids, inputs_embeds = self._pad_to_window_size(\n            input_ids=input_ids,\n            attention_mask=attention_mask,\n            token_type_ids=token_type_ids,\n            position_ids=position_ids,\n            inputs_embeds=inputs_embeds,\n            attention_window=attention_window,\n            pad_token_id=self.config.pad_token_id,\n        )\n\n        # embed\n        output = super().forward(\n            input_ids=input_ids,\n            attention_mask=attention_mask,\n            token_type_ids=token_type_ids,\n            position_ids=position_ids,\n            head_mask=None,\n            inputs_embeds=inputs_embeds,\n            encoder_hidden_states=None,\n            encoder_attention_mask=None,\n        )\n\n        # undo padding\n        if padding_len > 0:\n            # `output` has the following tensors: sequence_output, pooled_output, (hidden_states), (attentions)\n            # `sequence_output`: unpad because the calling function is expecting a length == input_ids.size(1)\n            # `pooled_output`: independent of the sequence length\n            # `hidden_states`: mainly used for debugging and analysis, so keep the padding\n            # `attentions`: mainly used for debugging and analysis, so keep the padding\n            output = output[0][:, :-padding_len], *output[1:]\n\n        return output\n\n\n@add_start_docstrings(\"\"\"Longformer Model with a `language modeling` head on top. \"\"\", LONGFORMER_START_DOCSTRING)\nclass LongformerForMaskedLM(BertPreTrainedModel):\n    config_class = LongformerConfig\n    base_model_prefix = \"longformer\"\n\n    def __init__(self, config):\n        super().__init__(config)\n\n        self.longformer = LongformerModel(config)\n        self.lm_head = RobertaLMHead(config)\n\n        self.init_weights()\n\n    @add_start_docstrings_to_callable(LONGFORMER_INPUTS_DOCSTRING.format(\"(batch_size, sequence_length)\"))\n    def forward(\n        self,\n        input_ids=None,\n        attention_mask=None,\n        global_attention_mask=None,\n        token_type_ids=None,\n        position_ids=None,\n        inputs_embeds=None,\n        masked_lm_labels=None,\n    ):\n        r\"\"\"\n        masked_lm_labels (:obj:`torch.LongTensor` of shape :obj:`(batch_size, sequence_length)`, `optional`, defaults to :obj:`None`):\n            Labels for computing the masked language modeling loss.\n            Indices should be in ``[-100, 0, ..., config.vocab_size]`` (see ``input_ids`` docstring)\n            Tokens with indices set to ``-100`` are ignored (masked), the loss is only computed for the tokens with labels\n            in ``[0, ..., config.vocab_size]``\n\n    Returns:\n        :obj:`tuple(torch.FloatTensor)` comprising various elements depending on the configuration (:class:`~transformers1.RobertaConfig`) and inputs:\n        masked_lm_loss (`optional`, returned when ``masked_lm_labels`` is provided) ``torch.FloatTensor`` of shape ``(1,)``:\n            Masked language modeling loss.\n        prediction_scores (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length, config.vocab_size)`)\n            Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax).\n        hidden_states (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_hidden_states=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer)\n            of shape :obj:`(batch_size, sequence_length, hidden_size)`.\n\n            Hidden-states of the model at the output of each layer plus the initial embedding outputs.\n        attentions (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_attentions=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for each layer) of shape\n            :obj:`(batch_size, num_heads, sequence_length, sequence_length)`.\n\n            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention\n            heads.\n\n    Examples::\n\n        import torch\n        from transformers1 import LongformerForMaskedLM, LongformerTokenizer\n\n        model = LongformerForMaskedLM.from_pretrained('allenai/longformer-base-4096')\n        tokenizer = LongformerTokenizer.from_pretrained('allenai/longformer-base-4096')\n\n        SAMPLE_TEXT = ' '.join(['Hello world! '] * 1000)  # long input document\n        input_ids = torch.tensor(tokenizer.encode(SAMPLE_TEXT)).unsqueeze(0)  # batch of size 1\n\n        attention_mask = None  # default is local attention everywhere, which is a good choice for MaskedLM\n                               # check ``LongformerModel.forward`` for more details how to set `attention_mask`\n        loss, prediction_scores = model(input_ids, attention_mask=attention_mask, masked_lm_labels=input_ids)\n        \"\"\"\n\n        outputs = self.longformer(\n            input_ids,\n            attention_mask=attention_mask,\n            global_attention_mask=global_attention_mask,\n            token_type_ids=token_type_ids,\n            position_ids=position_ids,\n            inputs_embeds=inputs_embeds,\n        )\n        sequence_output = outputs[0]\n        prediction_scores = self.lm_head(sequence_output)\n\n        outputs = (prediction_scores,) + outputs[2:]  # Add hidden states and attention if they are here\n\n        if masked_lm_labels is not None:\n            loss_fct = CrossEntropyLoss()\n            masked_lm_loss = loss_fct(prediction_scores.view(-1, self.config.vocab_size), masked_lm_labels.view(-1))\n            outputs = (masked_lm_loss,) + outputs\n\n        return outputs  # (masked_lm_loss), prediction_scores, (hidden_states), (attentions)\n\n\n@add_start_docstrings(\n    \"\"\"Longformer Model transformer with a sequence classification/regression head on top (a linear layer\n    on top of the pooled output) e.g. for GLUE tasks. \"\"\",\n    LONGFORMER_START_DOCSTRING,\n)\nclass LongformerForSequenceClassification(BertPreTrainedModel):\n    config_class = LongformerConfig\n    base_model_prefix = \"longformer\"\n\n    def __init__(self, config):\n        super().__init__(config)\n        self.num_labels = config.num_labels\n\n        self.longformer = LongformerModel(config)\n        self.classifier = LongformerClassificationHead(config)\n\n    @add_start_docstrings_to_callable(LONGFORMER_INPUTS_DOCSTRING.format(\"(batch_size, sequence_length)\"))\n    def forward(\n        self,\n        input_ids=None,\n        attention_mask=None,\n        global_attention_mask=None,\n        token_type_ids=None,\n        position_ids=None,\n        inputs_embeds=None,\n        labels=None,\n    ):\n        r\"\"\"\n        labels (:obj:`torch.LongTensor` of shape :obj:`(batch_size,)`, `optional`, defaults to :obj:`None`):\n            Labels for computing the sequence classification/regression loss.\n            Indices should be in :obj:`[0, ..., config.num_labels - 1]`.\n            If :obj:`config.num_labels == 1` a regression loss is computed (Mean-Square loss),\n            If :obj:`config.num_labels > 1` a classification loss is computed (Cross-Entropy).\n\n    Returns:\n        :obj:`tuple(torch.FloatTensor)` comprising various elements depending on the configuration (:class:`~transformers1.LongformerConfig`) and inputs:\n        loss (:obj:`torch.FloatTensor` of shape :obj:`(1,)`, `optional`, returned when :obj:`label` is provided):\n            Classification (or regression if config.num_labels==1) loss.\n        logits (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, config.num_labels)`):\n            Classification (or regression if config.num_labels==1) scores (before SoftMax).\n        hidden_states (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_hidden_states=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer)\n            of shape :obj:`(batch_size, sequence_length, hidden_size)`.\n\n            Hidden-states of the model at the output of each layer plus the initial embedding outputs.\n        attentions (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_attentions=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for each layer) of shape\n            :obj:`(batch_size, num_heads, sequence_length, sequence_length)`.\n\n            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention\n            heads.\n\n    Examples::\n\n        from transformers1 import LongformerTokenizer, LongformerForSequenceClassification\n        import torch\n\n        tokenizer = LongformerTokenizer.from_pretrained('allenai/longformer-base-4096')\n        model = LongformerForSequenceClassification.from_pretrained('allenai/longformer-base-4096')\n        input_ids = torch.tensor(tokenizer.encode(\"Hello, my dog is cute\", add_special_tokens=True)).unsqueeze(0)  # Batch size 1\n        labels = torch.tensor([1]).unsqueeze(0)  # Batch size 1\n        outputs = model(input_ids, labels=labels)\n        loss, logits = outputs[:2]\n\n        \"\"\"\n\n        if global_attention_mask is None:\n            logger.info(\"Initializing global attention on CLS token...\")\n            global_attention_mask = torch.zeros_like(input_ids)\n            # global attention on cls token\n            global_attention_mask[:, 0] = 1\n\n        outputs = self.longformer(\n            input_ids,\n            attention_mask=attention_mask,\n            global_attention_mask=global_attention_mask,\n            token_type_ids=token_type_ids,\n            position_ids=position_ids,\n            inputs_embeds=inputs_embeds,\n        )\n        sequence_output = outputs[0]\n        logits = self.classifier(sequence_output)\n\n        outputs = (logits,) + outputs[2:]\n        if labels is not None:\n            if self.num_labels == 1:\n                #  We are doing regression\n                loss_fct = MSELoss()\n                loss = loss_fct(logits.view(-1), labels.view(-1))\n            else:\n                loss_fct = CrossEntropyLoss()\n                loss = loss_fct(logits.view(-1, self.num_labels), labels.view(-1))\n            outputs = (loss,) + outputs\n\n        return outputs  # (loss), logits, (hidden_states), (attentions)\n\n\nclass LongformerClassificationHead(nn.Module):\n    \"\"\"Head for sentence-level classification tasks.\"\"\"\n\n    def __init__(self, config):\n        super().__init__()\n        self.dense = nn.Linear(config.hidden_size, config.hidden_size)\n        self.dropout = nn.Dropout(config.hidden_dropout_prob)\n        self.out_proj = nn.Linear(config.hidden_size, config.num_labels)\n\n    def forward(self, hidden_states, **kwargs):\n        hidden_states = hidden_states[:, 0, :]  # take <s> token (equiv. to [CLS])\n        hidden_states = self.dropout(hidden_states)\n        hidden_states = self.dense(hidden_states)\n        hidden_states = torch.tanh(hidden_states)\n        hidden_states = self.dropout(hidden_states)\n        output = self.out_proj(hidden_states)\n        return output\n\n\n@add_start_docstrings(\n    \"\"\"Longformer Model with a span classification head on top for extractive question-answering tasks like SQuAD / TriviaQA (a linear layers on top of\n    the hidden-states output to compute `span start logits` and `span end logits`). \"\"\",\n    LONGFORMER_START_DOCSTRING,\n)\nclass LongformerForQuestionAnswering(BertPreTrainedModel):\n    config_class = LongformerConfig\n    base_model_prefix = \"longformer\"\n\n    def __init__(self, config):\n        super().__init__(config)\n        self.num_labels = config.num_labels\n\n        self.longformer = LongformerModel(config)\n        self.qa_outputs = nn.Linear(config.hidden_size, config.num_labels)\n\n        self.init_weights()\n\n    @add_start_docstrings_to_callable(LONGFORMER_INPUTS_DOCSTRING.format(\"(batch_size, sequence_length)\"))\n    def forward(\n        self,\n        input_ids,\n        attention_mask=None,\n        global_attention_mask=None,\n        token_type_ids=None,\n        position_ids=None,\n        inputs_embeds=None,\n        start_positions=None,\n        end_positions=None,\n    ):\n        r\"\"\"\n        start_positions (:obj:`torch.LongTensor` of shape :obj:`(batch_size,)`, `optional`, defaults to :obj:`None`):\n            Labels for position (index) of the start of the labelled span for computing the token classification loss.\n            Positions are clamped to the length of the sequence (`sequence_length`).\n            Position outside of the sequence are not taken into account for computing the loss.\n        end_positions (:obj:`torch.LongTensor` of shape :obj:`(batch_size,)`, `optional`, defaults to :obj:`None`):\n            Labels for position (index) of the end of the labelled span for computing the token classification loss.\n            Positions are clamped to the length of the sequence (`sequence_length`).\n            Position outside of the sequence are not taken into account for computing the loss.\n    Returns:\n        :obj:`tuple(torch.FloatTensor)` comprising various elements depending on the configuration (:class:`~transformers1.LongformerConfig`) and inputs:\n        loss (:obj:`torch.FloatTensor` of shape :obj:`(1,)`, `optional`, returned when :obj:`labels` is provided):\n            Total span extraction loss is the sum of a Cross-Entropy for the start and end positions.\n        start_scores (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length,)`):\n            Span-start scores (before SoftMax).\n        end_scores (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length,)`):\n            Span-end scores (before SoftMax).\n        hidden_states (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_hidden_states=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer)\n            of shape :obj:`(batch_size, sequence_length, hidden_size)`.\n            Hidden-states of the model at the output of each layer plus the initial embedding outputs.\n        attentions (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_attentions=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for each layer) of shape\n            :obj:`(batch_size, num_heads, sequence_length, sequence_length)`.\n            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention\n            heads.\n\n    Examples::\n\n        from transformers1 import LongformerTokenizer, LongformerForQuestionAnswering\n        import torch\n\n        tokenizer = LongformerTokenizer.from_pretrained(\"allenai/longformer-large-4096-finetuned-triviaqa\")\n        model = LongformerForQuestionAnswering.from_pretrained(\"allenai/longformer-large-4096-finetuned-triviaqa\")\n\n        question, text = \"Who was Jim Henson?\", \"Jim Henson was a nice puppet\"\n        encoding = tokenizer.encode_plus(question, text, return_tensors=\"pt\")\n        input_ids = encoding[\"input_ids\"]\n\n        # default is local attention everywhere\n        # the forward method will automatically set global attention on question tokens\n        attention_mask = encoding[\"attention_mask\"]\n\n        start_scores, end_scores = model(input_ids, attention_mask=attention_mask)\n        all_tokens = tokenizer.convert_ids_to_tokens(input_ids[0].tolist())\n\n        answer_tokens = all_tokens[torch.argmax(start_scores) :torch.argmax(end_scores)+1]\n        answer = tokenizer.decode(tokenizer.convert_tokens_to_ids(answer_tokens)) # remove space prepending space token\n\n        \"\"\"\n\n        # set global attention on question tokens\n        if global_attention_mask is None:\n            logger.info(\"Initializing global attention on question tokens...\")\n            # put global attention on all tokens until `config.sep_token_id` is reached\n            global_attention_mask = _compute_global_attention_mask(input_ids, self.config.sep_token_id)\n\n        outputs = self.longformer(\n            input_ids,\n            attention_mask=attention_mask,\n            global_attention_mask=global_attention_mask,\n            token_type_ids=token_type_ids,\n            position_ids=position_ids,\n            inputs_embeds=inputs_embeds,\n        )\n\n        sequence_output = outputs[0]\n\n        logits = self.qa_outputs(sequence_output)\n        start_logits, end_logits = logits.split(1, dim=-1)\n        start_logits = start_logits.squeeze(-1)\n        end_logits = end_logits.squeeze(-1)\n\n        outputs = (start_logits, end_logits,) + outputs[2:]\n        if start_positions is not None and end_positions is not None:\n            # If we are on multi-GPU, split add a dimension\n            if len(start_positions.size()) > 1:\n                start_positions = start_positions.squeeze(-1)\n            if len(end_positions.size()) > 1:\n                end_positions = end_positions.squeeze(-1)\n            # sometimes the start/end positions are outside our model inputs, we ignore these terms\n            ignored_index = start_logits.size(1)\n            start_positions.clamp_(0, ignored_index)\n            end_positions.clamp_(0, ignored_index)\n\n            loss_fct = CrossEntropyLoss(ignore_index=ignored_index)\n            start_loss = loss_fct(start_logits, start_positions)\n            end_loss = loss_fct(end_logits, end_positions)\n            total_loss = (start_loss + end_loss) / 2\n            outputs = (total_loss,) + outputs\n\n        return outputs  # (loss), start_logits, end_logits, (hidden_states), (attentions)\n\n\n@add_start_docstrings(\n    \"\"\"Longformer Model with a token classification head on top (a linear layer on top of\n    the hidden-states output) e.g. for Named-Entity-Recognition (NER) tasks. \"\"\",\n    LONGFORMER_START_DOCSTRING,\n)\nclass LongformerForTokenClassification(BertPreTrainedModel):\n    config_class = LongformerConfig\n    base_model_prefix = \"longformer\"\n\n    def __init__(self, config):\n        super().__init__(config)\n        self.num_labels = config.num_labels\n\n        self.longformer = LongformerModel(config)\n        self.dropout = nn.Dropout(config.hidden_dropout_prob)\n        self.classifier = nn.Linear(config.hidden_size, config.num_labels)\n\n        self.init_weights()\n\n    @add_start_docstrings_to_callable(LONGFORMER_INPUTS_DOCSTRING.format(\"(batch_size, sequence_length)\"))\n    def forward(\n        self,\n        input_ids=None,\n        attention_mask=None,\n        global_attention_mask=None,\n        token_type_ids=None,\n        position_ids=None,\n        inputs_embeds=None,\n        labels=None,\n    ):\n        r\"\"\"\n        labels (:obj:`torch.LongTensor` of shape :obj:`(batch_size, sequence_length)`, `optional`, defaults to :obj:`None`):\n            Labels for computing the token classification loss.\n            Indices should be in ``[0, ..., config.num_labels - 1]``.\n\n    Returns:\n        :obj:`tuple(torch.FloatTensor)` comprising various elements depending on the configuration (:class:`~transformers1.LongformerConfig`) and inputs:\n        loss (:obj:`torch.FloatTensor` of shape :obj:`(1,)`, `optional`, returned when ``labels`` is provided) :\n            Classification loss.\n        scores (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length, config.num_labels)`)\n            Classification scores (before SoftMax).\n        hidden_states (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_hidden_states=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer)\n            of shape :obj:`(batch_size, sequence_length, hidden_size)`.\n\n            Hidden-states of the model at the output of each layer plus the initial embedding outputs.\n        attentions (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_attentions=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for each layer) of shape\n            :obj:`(batch_size, num_heads, sequence_length, sequence_length)`.\n\n            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention\n            heads.\n\n    Examples::\n\n        from transformers1 import LongformerTokenizer, LongformerForTokenClassification\n        import torch\n\n        tokenizer = LongformerTokenizer.from_pretrained('allenai/longformer-base-4096')\n        model = LongformerForTokenClassification.from_pretrained('allenai/longformer-base-4096')\n        input_ids = torch.tensor(tokenizer.encode(\"Hello, my dog is cute\", add_special_tokens=True)).unsqueeze(0)  # Batch size 1\n        labels = torch.tensor([1] * input_ids.size(1)).unsqueeze(0)  # Batch size 1\n        outputs = model(input_ids, labels=labels)\n        loss, scores = outputs[:2]\n\n        \"\"\"\n\n        outputs = self.longformer(\n            input_ids,\n            attention_mask=attention_mask,\n            global_attention_mask=global_attention_mask,\n            token_type_ids=token_type_ids,\n            position_ids=position_ids,\n            inputs_embeds=inputs_embeds,\n        )\n\n        sequence_output = outputs[0]\n\n        sequence_output = self.dropout(sequence_output)\n        logits = self.classifier(sequence_output)\n\n        outputs = (logits,) + outputs[2:]  # add hidden states and attention if they are here\n\n        if labels is not None:\n            loss_fct = CrossEntropyLoss()\n            # Only keep active parts of the loss\n            if attention_mask is not None:\n                active_loss = attention_mask.view(-1) == 1\n                active_logits = logits.view(-1, self.num_labels)\n                active_labels = torch.where(\n                    active_loss, labels.view(-1), torch.tensor(loss_fct.ignore_index).type_as(labels)\n                )\n                loss = loss_fct(active_logits, active_labels)\n            else:\n                loss = loss_fct(logits.view(-1, self.num_labels), labels.view(-1))\n            outputs = (loss,) + outputs\n\n        return outputs  # (loss), scores, (hidden_states), (attentions)\n\n\n@add_start_docstrings(\n    \"\"\"Longformer Model with a multiple choice classification head on top (a linear layer on top of\n    the pooled output and a softmax) e.g. for RocStories/SWAG tasks. \"\"\",\n    LONGFORMER_START_DOCSTRING,\n)\nclass LongformerForMultipleChoice(BertPreTrainedModel):\n    config_class = LongformerConfig\n    base_model_prefix = \"longformer\"\n\n    def __init__(self, config):\n        super().__init__(config)\n\n        self.longformer = LongformerModel(config)\n        self.dropout = nn.Dropout(config.hidden_dropout_prob)\n        self.classifier = nn.Linear(config.hidden_size, 1)\n\n        self.init_weights()\n\n    @add_start_docstrings_to_callable(LONGFORMER_INPUTS_DOCSTRING.format(\"(batch_size, num_choices, sequence_length)\"))\n    def forward(\n        self,\n        input_ids=None,\n        token_type_ids=None,\n        attention_mask=None,\n        global_attention_mask=None,\n        labels=None,\n        position_ids=None,\n        inputs_embeds=None,\n    ):\n        r\"\"\"\n        labels (:obj:`torch.LongTensor` of shape :obj:`(batch_size,)`, `optional`, defaults to :obj:`None`):\n            Labels for computing the multiple choice classification loss.\n            Indices should be in ``[0, ..., num_choices]`` where `num_choices` is the size of the second dimension\n            of the input tensors. (see `input_ids` above)\n\n    Returns:\n        :obj:`tuple(torch.FloatTensor)` comprising various elements depending on the configuration (:class:`~transformers1.RobertaConfig`) and inputs:\n        loss (:obj:`torch.FloatTensor`` of shape ``(1,)`, `optional`, returned when :obj:`labels` is provided):\n            Classification loss.\n        classification_scores (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, num_choices)`):\n            `num_choices` is the second dimension of the input tensors. (see `input_ids` above).\n\n            Classification scores (before SoftMax).\n        hidden_states (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_hidden_states=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer)\n            of shape :obj:`(batch_size, sequence_length, hidden_size)`.\n\n            Hidden-states of the model at the output of each layer plus the initial embedding outputs.\n        attentions (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_attentions=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for each layer) of shape\n            :obj:`(batch_size, num_heads, sequence_length, sequence_length)`.\n\n            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention\n            heads.\n\n    Examples::\n\n        from transformers1 import LongformerTokenizer, LongformerForMultipleChoice\n        import torch\n\n        tokenizer = LongformerTokenizer.from_pretrained('allenai/longformer-base-4096')\n        model = LongformerForMultipleChoice.from_pretrained('allenai/longformer-base-4096')\n        # context = \"The dog is cute\" | choice = \"the dog\" / \"the cat\"\n        choices = [(\"The dog is cute\", \"the dog\"), (\"The dog is cute\", \"the cat\")]\n        input_ids = torch.tensor([tokenizer.encode(s[0], s[1], add_special_tokens=True) for s in choices]).unsqueeze(0)  # Batch size 1, 2 choices\n        labels = torch.tensor(1).unsqueeze(0)  # Batch size 1\n\n        # global attention is automatically put on \"the dog\" and \"the cat\"\n        outputs = model(input_ids, labels=labels)\n        loss, classification_scores = outputs[:2]\n\n        \"\"\"\n        num_choices = input_ids.shape[1]\n\n        # set global attention on question tokens\n        if global_attention_mask is None:\n            logger.info(\"Initializing global attention on multiple choice...\")\n            # put global attention on all tokens after `config.sep_token_id`\n            global_attention_mask = torch.stack(\n                [\n                    _compute_global_attention_mask(input_ids[:, i], self.config.sep_token_id, before_sep_token=False)\n                    for i in range(num_choices)\n                ],\n                dim=1,\n            )\n\n        flat_input_ids = input_ids.view(-1, input_ids.size(-1))\n        flat_position_ids = position_ids.view(-1, position_ids.size(-1)) if position_ids is not None else None\n        flat_token_type_ids = token_type_ids.view(-1, token_type_ids.size(-1)) if token_type_ids is not None else None\n        flat_attention_mask = attention_mask.view(-1, attention_mask.size(-1)) if attention_mask is not None else None\n        flat_global_attention_mask = (\n            global_attention_mask.view(-1, global_attention_mask.size(-1))\n            if global_attention_mask is not None\n            else None\n        )\n\n        outputs = self.longformer(\n            flat_input_ids,\n            position_ids=flat_position_ids,\n            token_type_ids=flat_token_type_ids,\n            attention_mask=flat_attention_mask,\n            global_attention_mask=flat_global_attention_mask,\n        )\n        pooled_output = outputs[1]\n\n        pooled_output = self.dropout(pooled_output)\n        logits = self.classifier(pooled_output)\n        reshaped_logits = logits.view(-1, num_choices)\n\n        outputs = (reshaped_logits,) + outputs[2:]  # add hidden states and attention if they are here\n\n        if labels is not None:\n            loss_fct = CrossEntropyLoss()\n            loss = loss_fct(reshaped_logits, labels)\n            outputs = (loss,) + outputs\n\n        return outputs  # (loss), reshaped_logits, (hidden_states), (attentions)\n"
  },
  {
    "path": "code/bert-base-count3/pretrain/transformers1/modeling_marian.py",
    "content": "# coding=utf-8\n# Copyright 2020 Marian Team Authors and The HuggingFace Inc. team.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n\"\"\"PyTorch MarianMTModel model, ported from the Marian C++ repo.\"\"\"\n\n\nfrom .modeling_bart import BartForConditionalGeneration\n\n\nMARIAN_PRETRAINED_MODEL_ARCHIVE_LIST = [\n    # See all Marian models at https://huggingface.co/models?search=Helsinki-NLP\n]\n\n\nclass MarianMTModel(BartForConditionalGeneration):\n    r\"\"\"\n    Pytorch version of marian-nmt's transformer.h (c++). Designed for the OPUS-NMT translation checkpoints.\n    Model API is identical to BartForConditionalGeneration.\n    Available models are listed at `Model List <https://huggingface.co/models?search=Helsinki-NLP>`__\n\n    Examples::\n\n        from transformers1 import MarianTokenizer, MarianMTModel\n        from typing import List\n        src = 'fr'  # source language\n        trg = 'en'  # target language\n        sample_text = \"où est l'arrêt de bus ?\"\n        mname = f'Helsinki-NLP/opus-mt-{src}-{trg}'\n\n        model = MarianMTModel.from_pretrained(mname)\n        tok = MarianTokenizer.from_pretrained(mname)\n        batch = tok.prepare_translation_batch(src_texts=[sample_text])  # don't need tgt_text for inference\n        gen = model.generate(**batch)  # for forward pass: model(**batch)\n        words: List[str] = tok.batch_decode(gen, skip_special_tokens=True)  # returns \"Where is the the bus stop ?\"\n\n    \"\"\"\n\n    def prepare_logits_for_generation(self, logits, cur_len, max_length):\n        logits[:, self.config.pad_token_id] = float(\"-inf\")\n        if cur_len == max_length - 1 and self.config.eos_token_id is not None:\n            self._force_token_ids_generation(logits, self.config.eos_token_id)\n        return logits\n"
  },
  {
    "path": "code/bert-base-count3/pretrain/transformers1/modeling_mmbt.py",
    "content": "# coding=utf-8\n# Copyright (c) Facebook, Inc. and its affiliates.\n# Copyright (c) HuggingFace Inc. team.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n\"\"\"PyTorch MMBT model. \"\"\"\n\n\nimport logging\n\nimport torch\nimport torch.nn as nn\nfrom torch.nn import CrossEntropyLoss, MSELoss\n\nfrom .file_utils import add_start_docstrings\nfrom .modeling_utils import ModuleUtilsMixin\n\n\nlogger = logging.getLogger(__name__)\n\n\nclass ModalEmbeddings(nn.Module):\n    \"\"\"Generic Modal Embeddings which takes in an encoder, and a transformer embedding.\n    \"\"\"\n\n    def __init__(self, config, encoder, embeddings):\n        super().__init__()\n        self.config = config\n        self.encoder = encoder\n        self.proj_embeddings = nn.Linear(config.modal_hidden_size, config.hidden_size)\n        self.position_embeddings = embeddings.position_embeddings\n        self.token_type_embeddings = embeddings.token_type_embeddings\n        self.word_embeddings = embeddings.word_embeddings\n        self.LayerNorm = embeddings.LayerNorm\n        self.dropout = nn.Dropout(p=config.hidden_dropout_prob)\n\n    def forward(self, input_modal, start_token=None, end_token=None, position_ids=None, token_type_ids=None):\n        token_embeddings = self.proj_embeddings(self.encoder(input_modal))\n        seq_length = token_embeddings.size(1)\n\n        if start_token is not None:\n            start_token_embeds = self.word_embeddings(start_token)\n            seq_length += 1\n            token_embeddings = torch.cat([start_token_embeds.unsqueeze(1), token_embeddings], dim=1)\n\n        if end_token is not None:\n            end_token_embeds = self.word_embeddings(end_token)\n            seq_length += 1\n            token_embeddings = torch.cat([token_embeddings, end_token_embeds.unsqueeze(1)], dim=1)\n\n        if position_ids is None:\n            position_ids = torch.arange(seq_length, dtype=torch.long, device=input_modal.device)\n            position_ids = position_ids.unsqueeze(0).expand(input_modal.size(0), seq_length)\n\n        if token_type_ids is None:\n            token_type_ids = torch.zeros(\n                (input_modal.size(0), seq_length), dtype=torch.long, device=input_modal.device\n            )\n\n        position_embeddings = self.position_embeddings(position_ids)\n        token_type_embeddings = self.token_type_embeddings(token_type_ids)\n        embeddings = token_embeddings + position_embeddings + token_type_embeddings\n        embeddings = self.LayerNorm(embeddings)\n        embeddings = self.dropout(embeddings)\n        return embeddings\n\n\nMMBT_START_DOCSTRING = r\"\"\"    MMBT model was proposed in\n    `Supervised Multimodal Bitransformers for Classifying Images and Text`_\n    by Douwe Kiela, Suvrat Bhooshan, Hamed Firooz, Davide Testuggine.\n    It's a supervised multimodal bitransformer model that fuses information from text and other image encoders,\n    and obtain state-of-the-art performance on various multimodal classification benchmark tasks.\n\n    This model is a PyTorch `torch.nn.Module`_ sub-class. Use it as a regular PyTorch Module and\n    refer to the PyTorch documentation for all matter related to general usage and behavior.\n\n    .. _`Supervised Multimodal Bitransformers for Classifying Images and Text`:\n        https://github.com/facebookresearch/mmbt\n\n    .. _`torch.nn.Module`:\n        https://pytorch.org/docs/stable/nn.html#module\n\n    Parameters:\n        config (:class:`~transformers1.MMBTConfig`): Model configuration class with all the parameters of the model.\n            Initializing with a config file does not load the weights associated with the model, only the configuration.\n        transformer (:class: `~nn.Module`): A text transformer that is used by MMBT.\n            It should have embeddings, encoder, and pooler attributes.\n        encoder (:class: `~nn.Module`): Encoder for the second modality.\n            It should take in a batch of modal inputs and return k, n dimension embeddings.\n\"\"\"\n\nMMBT_INPUTS_DOCSTRING = r\"\"\"    Inputs:\n        **input_modal**: ``torch.FloatTensor`` of shape ``(batch_size, ***)``:\n            The other modality data. It will be the shape that the encoder for that type expects.\n            e.g. With an Image Encoder, the shape would be (batch_size, channels, height, width)\n        **input_ids**: ``torch.LongTensor`` of shape ``(batch_size, sequence_length)``:\n            Indices of input sequence tokens in the vocabulary.\n            It does not expect [CLS] token to be added as it's appended to the end of other modality embeddings.\n            See :func:`transformers1.PreTrainedTokenizer.encode` and\n            :func:`transformers1.PreTrainedTokenizer.convert_tokens_to_ids` for details.\n        **modal_start_tokens**: (`optional`) ``torch.LongTensor`` of shape ``(batch_size,)``:\n            Optional start token to be added to Other Modality Embedding. [CLS] Most commonly used for Classification tasks.\n        **modal_end_tokens**: (`optional`) ``torch.LongTensor`` of shape ``(batch_size,)``:\n            Optional end token to be added to Other Modality Embedding. [SEP] Most commonly used.\n        **attention_mask**: (`optional`) ``torch.FloatTensor`` of shape ``(batch_size, sequence_length)``:\n            Mask to avoid performing attention on padding token indices.\n            Mask values selected in ``[0, 1]``:\n            ``1`` for tokens that are NOT MASKED, ``0`` for MASKED tokens.\n        **token_type_ids**: (`optional`) ``torch.LongTensor`` of shape ``(batch_size, sequence_length)``:\n            Segment token indices to indicate different portions of the inputs.\n        **modal_token_type_ids**: (`optional`) ``torch.LongTensor`` of shape ``(batch_size, modal_sequence_length)``:\n            Segment token indices to indicate different portions of the non-text modality.\n            The embeddings from these tokens will be summed with the respective token embeddings for the non-text modality.\n        **position_ids**: (`optional`) ``torch.LongTensor`` of shape ``(batch_size, sequence_length)``:\n            Indices of positions of each input sequence tokens in the position embeddings.\n        **modal_position_ids**: (`optional`) ``torch.LongTensor`` of shape ``(batch_size, modal_sequence_length)``:\n            Indices of positions of each input sequence tokens in the position embeddings for the non-text modality.\n        **head_mask**: (`optional`) ``torch.FloatTensor`` of shape ``(num_heads,)`` or ``(num_layers, num_heads)``:\n            Mask to nullify selected heads of the self-attention modules.\n            Mask values selected in ``[0, 1]``:\n            ``1`` indicates the head is **not masked**, ``0`` indicates the head is **masked**.\n        **inputs_embeds**: (`optional`) ``torch.FloatTensor`` of shape ``(batch_size, sequence_length, embedding_dim)``:\n            Optionally, instead of passing ``input_ids`` you can choose to directly pass an embedded representation.\n            This is useful if you want more control over how to convert `input_ids` indices into associated vectors\n            than the model's internal embedding lookup matrix.\n        **encoder_hidden_states**: (`optional`) ``torch.FloatTensor`` of shape ``(batch_size, sequence_length, hidden_size)``:\n            Sequence of hidden-states at the output of the last layer of the encoder. Used in the cross-attention if the model\n            is configured as a decoder.\n        **encoder_attention_mask**: (`optional`) ``torch.FloatTensor`` of shape ``(batch_size, sequence_length)``:\n            Mask to avoid performing attention on the padding token indices of the encoder input. This mask\n            is used in the cross-attention if the model is configured as a decoder.\n            Mask values selected in ``[0, 1]``:\n            ``1`` for tokens that are NOT MASKED, ``0`` for MASKED tokens.\n\"\"\"\n\n\n@add_start_docstrings(\n    \"The bare MMBT Model outputting raw hidden-states without any specific head on top.\",\n    MMBT_START_DOCSTRING,\n    MMBT_INPUTS_DOCSTRING,\n)\nclass MMBTModel(nn.Module, ModuleUtilsMixin):\n    r\"\"\"\n        Outputs: `Tuple` comprising various elements depending on the configuration (config) and inputs:\n            **last_hidden_state**: ``torch.FloatTensor`` of shape ``(batch_size, sequence_length, hidden_size)``\n                Sequence of hidden-states at the output of the last layer of the model.\n            **pooler_output**: ``torch.FloatTensor`` of shape ``(batch_size, hidden_size)``\n                Last layer hidden-state of the first token of the sequence (classification token)\n                further processed by a Linear layer and a Tanh activation function. The Linear\n                layer weights are trained from the next sentence prediction (classification)\n                objective during Bert pretraining. This output is usually *not* a good summary\n                of the semantic content of the input, you're often better with averaging or pooling\n                the sequence of hidden-states for the whole input sequence.\n            **hidden_states**: (`optional`, returned when ``config.output_hidden_states=True``)\n                list of ``torch.FloatTensor`` (one for the output of each layer + the output of the embeddings)\n                of shape ``(batch_size, sequence_length, hidden_size)``:\n                Hidden-states of the model at the output of each layer plus the initial embedding outputs.\n            **attentions**: (`optional`, returned when ``config.output_attentions=True``)\n                list of ``torch.FloatTensor`` (one for each layer) of shape ``(batch_size, num_heads, sequence_length, sequence_length)``:\n                Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads.\n\n        Examples::\n\n            # For example purposes. Not runnable.\n            transformer = BertModel.from_pretrained('bert-base-uncased')\n            encoder = ImageEncoder(args)\n            mmbt = MMBTModel(config, transformer, encoder)\n        \"\"\"\n\n    def __init__(self, config, transformer, encoder):\n        super().__init__()\n        self.config = config\n        self.transformer = transformer\n        self.modal_encoder = ModalEmbeddings(config, encoder, transformer.embeddings)\n\n    def forward(\n        self,\n        input_modal,\n        input_ids=None,\n        modal_start_tokens=None,\n        modal_end_tokens=None,\n        attention_mask=None,\n        token_type_ids=None,\n        modal_token_type_ids=None,\n        position_ids=None,\n        modal_position_ids=None,\n        head_mask=None,\n        inputs_embeds=None,\n        encoder_hidden_states=None,\n        encoder_attention_mask=None,\n    ):\n\n        if input_ids is not None and inputs_embeds is not None:\n            raise ValueError(\"You cannot specify both input_ids and inputs_embeds at the same time\")\n        elif input_ids is not None:\n            input_txt_shape = input_ids.size()\n        elif inputs_embeds is not None:\n            input_txt_shape = inputs_embeds.size()[:-1]\n        else:\n            raise ValueError(\"You have to specify either input_ids or inputs_embeds\")\n\n        device = input_ids.device if input_ids is not None else inputs_embeds.device\n\n        modal_embeddings = self.modal_encoder(\n            input_modal,\n            start_token=modal_start_tokens,\n            end_token=modal_end_tokens,\n            position_ids=modal_position_ids,\n            token_type_ids=modal_token_type_ids,\n        )\n\n        input_modal_shape = modal_embeddings.size()[:-1]\n\n        if token_type_ids is None:\n            token_type_ids = torch.ones(input_txt_shape, dtype=torch.long, device=device)\n\n        txt_embeddings = self.transformer.embeddings(\n            input_ids=input_ids, position_ids=position_ids, token_type_ids=token_type_ids, inputs_embeds=inputs_embeds\n        )\n\n        embedding_output = torch.cat([modal_embeddings, txt_embeddings], 1)\n\n        input_shape = embedding_output.size()[:-1]\n\n        if attention_mask is None:\n            attention_mask = torch.ones(input_shape, device=device)\n        else:\n            attention_mask = torch.cat(\n                [torch.ones(input_modal_shape, device=device, dtype=torch.long), attention_mask], dim=1\n            )\n        if encoder_attention_mask is None:\n            encoder_attention_mask = torch.ones(input_shape, device=device)\n        else:\n            encoder_attention_mask = torch.cat(\n                [torch.ones(input_modal_shape, device=device), encoder_attention_mask], dim=1\n            )\n\n        extended_attention_mask = self.get_extended_attention_mask(attention_mask, input_shape, self.device)\n        encoder_extended_attention_mask = self.invert_attention_mask(encoder_attention_mask)\n        head_mask = self.get_head_mask(head_mask, self.config.num_hidden_layers)\n\n        encoder_outputs = self.transformer.encoder(\n            embedding_output,\n            attention_mask=extended_attention_mask,\n            head_mask=head_mask,\n            encoder_hidden_states=encoder_hidden_states,\n            encoder_attention_mask=encoder_extended_attention_mask,\n        )\n\n        sequence_output = encoder_outputs[0]\n        pooled_output = self.transformer.pooler(sequence_output)\n\n        outputs = (sequence_output, pooled_output,) + encoder_outputs[\n            1:\n        ]  # add hidden_states and attentions if they are here\n        return outputs  # sequence_output, pooled_output, (hidden_states), (attentions)\n\n    def get_input_embeddings(self):\n        return self.embeddings.word_embeddings\n\n    def set_input_embeddings(self, value):\n        self.embeddings.word_embeddings = value\n\n\n@add_start_docstrings(\n    \"\"\"MMBT Model with a sequence classification/regression head on top (a linear layer on top of\n                      the pooled output)\"\"\",\n    MMBT_START_DOCSTRING,\n    MMBT_INPUTS_DOCSTRING,\n)\nclass MMBTForClassification(nn.Module):\n    r\"\"\"\n            **labels**: (`optional`) ``torch.LongTensor`` of shape ``(batch_size,)``:\n                Labels for computing the sequence classification/regression loss.\n                Indices should be in ``[0, ..., config.num_labels - 1]``.\n                If ``config.num_labels == 1`` a regression loss is computed (Mean-Square loss),\n                If ``config.num_labels > 1`` a classification loss is computed (Cross-Entropy).\n\n        Outputs: `Tuple` comprising various elements depending on the configuration (config) and inputs:\n            **loss**: (`optional`, returned when ``labels`` is provided) ``torch.FloatTensor`` of shape ``(1,)``:\n                Classification (or regression if config.num_labels==1) loss.\n            **logits**: ``torch.FloatTensor`` of shape ``(batch_size, config.num_labels)``\n                Classification (or regression if config.num_labels==1) scores (before SoftMax).\n            **hidden_states**: (`optional`, returned when ``config.output_hidden_states=True``)\n                list of ``torch.FloatTensor`` (one for the output of each layer + the output of the embeddings)\n                of shape ``(batch_size, sequence_length, hidden_size)``:\n                Hidden-states of the model at the output of each layer plus the initial embedding outputs.\n            **attentions**: (`optional`, returned when ``config.output_attentions=True``)\n                list of ``torch.FloatTensor`` (one for each layer) of shape ``(batch_size, num_heads, sequence_length, sequence_length)``:\n                Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads.\n\n        Examples::\n\n            # For example purposes. Not runnable.\n            transformer = BertModel.from_pretrained('bert-base-uncased')\n            encoder = ImageEncoder(args)\n            model = MMBTForClassification(config, transformer, encoder)\n            outputs = model(input_modal, input_ids, labels=labels)\n            loss, logits = outputs[:2]\n        \"\"\"\n\n    def __init__(self, config, transformer, encoder):\n        super().__init__()\n        self.num_labels = config.num_labels\n\n        self.mmbt = MMBTModel(config, transformer, encoder)\n        self.dropout = nn.Dropout(config.hidden_dropout_prob)\n        self.classifier = nn.Linear(config.hidden_size, config.num_labels)\n\n    def forward(\n        self,\n        input_modal,\n        input_ids=None,\n        modal_start_tokens=None,\n        modal_end_tokens=None,\n        attention_mask=None,\n        token_type_ids=None,\n        modal_token_type_ids=None,\n        position_ids=None,\n        modal_position_ids=None,\n        head_mask=None,\n        inputs_embeds=None,\n        labels=None,\n    ):\n\n        outputs = self.mmbt(\n            input_modal=input_modal,\n            input_ids=input_ids,\n            modal_start_tokens=modal_start_tokens,\n            modal_end_tokens=modal_end_tokens,\n            attention_mask=attention_mask,\n            token_type_ids=token_type_ids,\n            modal_token_type_ids=modal_token_type_ids,\n            position_ids=position_ids,\n            modal_position_ids=modal_position_ids,\n            head_mask=head_mask,\n            inputs_embeds=inputs_embeds,\n        )\n\n        pooled_output = outputs[1]\n\n        pooled_output = self.dropout(pooled_output)\n        logits = self.classifier(pooled_output)\n\n        outputs = (logits,) + outputs[2:]  # add hidden states and attention if they are here\n\n        if labels is not None:\n            if self.num_labels == 1:\n                #  We are doing regression\n                loss_fct = MSELoss()\n                loss = loss_fct(logits.view(-1), labels.view(-1))\n            else:\n                loss_fct = CrossEntropyLoss()\n                loss = loss_fct(logits.view(-1, self.num_labels), labels.view(-1))\n            outputs = (loss,) + outputs\n\n        return outputs  # (loss), logits, (hidden_states), (attentions)\n"
  },
  {
    "path": "code/bert-base-count3/pretrain/transformers1/modeling_openai.py",
    "content": "# coding=utf-8\n# Copyright 2018 The OpenAI Team Authors and HuggingFace Inc. team.\n# Copyright (c) 2018, NVIDIA CORPORATION.  All rights reserved.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n\"\"\"PyTorch OpenAI GPT model.\"\"\"\n\n\nimport json\nimport logging\nimport math\nimport os\n\nimport torch\nimport torch.nn as nn\nfrom torch.nn import CrossEntropyLoss\n\nfrom .activations import gelu_new, swish\nfrom .configuration_openai import OpenAIGPTConfig\nfrom .file_utils import add_start_docstrings, add_start_docstrings_to_callable\nfrom .modeling_utils import Conv1D, PreTrainedModel, SequenceSummary, prune_conv1d_layer\n\n\nlogger = logging.getLogger(__name__)\n\nOPENAI_GPT_PRETRAINED_MODEL_ARCHIVE_LIST = [\n    \"openai-gpt\",\n    # See all OpenAI GPT models at https://huggingface.co/models?filter=openai-gpt\n]\n\n\ndef load_tf_weights_in_openai_gpt(model, config, openai_checkpoint_folder_path):\n    \"\"\" Load tf pre-trained weights in a pytorch model (from NumPy arrays here)\n    \"\"\"\n    import re\n    import numpy as np\n\n    if \".ckpt\" in openai_checkpoint_folder_path:\n        openai_checkpoint_folder_path = os.path.dirname(openai_checkpoint_folder_path)\n\n    logger.info(\"Loading weights from {}\".format(openai_checkpoint_folder_path))\n\n    with open(openai_checkpoint_folder_path + \"/parameters_names.json\", \"r\", encoding=\"utf-8\") as names_handle:\n        names = json.load(names_handle)\n    with open(openai_checkpoint_folder_path + \"/params_shapes.json\", \"r\", encoding=\"utf-8\") as shapes_handle:\n        shapes = json.load(shapes_handle)\n    offsets = np.cumsum([np.prod(shape) for shape in shapes])\n    init_params = [np.load(openai_checkpoint_folder_path + \"/params_{}.npy\".format(n)) for n in range(10)]\n    init_params = np.split(np.concatenate(init_params, 0), offsets)[:-1]\n    init_params = [param.reshape(shape) for param, shape in zip(init_params, shapes)]\n\n    # This was used when we had a single embedding matrix for positions and tokens\n    # init_params[0] = np.concatenate([init_params[1], init_params[0]], 0)\n    # model init_params[1]\n    init_params = [arr.squeeze() for arr in init_params]\n\n    try:\n        assert model.tokens_embed.weight.shape == init_params[1].shape\n        assert model.positions_embed.weight.shape == init_params[0].shape\n    except AssertionError as e:\n        e.args += (model.tokens_embed.weight.shape, init_params[1].shape)\n        e.args += (model.positions_embed.weight.shape, init_params[0].shape)\n        raise\n\n    model.tokens_embed.weight.data = torch.from_numpy(init_params[1])\n    model.positions_embed.weight.data = torch.from_numpy(init_params[0])\n    names.pop(0)\n    # Pop position and token embedding arrays\n    init_params.pop(0)\n    init_params.pop(0)\n\n    for name, array in zip(names, init_params):  # names[1:n_transfer], init_params[1:n_transfer]):\n        name = name[6:]  # skip \"model/\"\n        assert name[-2:] == \":0\"\n        name = name[:-2]\n        name = name.split(\"/\")\n        pointer = model\n        for m_name in name:\n            if re.fullmatch(r\"[A-Za-z]+\\d+\", m_name):\n                scope_names = re.split(r\"(\\d+)\", m_name)\n            else:\n                scope_names = [m_name]\n            if scope_names[0] == \"g\":\n                pointer = getattr(pointer, \"weight\")\n            elif scope_names[0] == \"b\":\n                pointer = getattr(pointer, \"bias\")\n            elif scope_names[0] == \"w\":\n                pointer = getattr(pointer, \"weight\")\n            else:\n                pointer = getattr(pointer, scope_names[0])\n            if len(scope_names) >= 2:\n                num = int(scope_names[1])\n                pointer = pointer[num]\n        try:\n            assert pointer.shape == array.shape\n        except AssertionError as e:\n            e.args += (pointer.shape, array.shape)\n            raise\n        try:\n            assert pointer.shape == array.shape\n        except AssertionError as e:\n            e.args += (pointer.shape, array.shape)\n            raise\n        logger.info(\"Initialize PyTorch weight {}\".format(name))\n        pointer.data = torch.from_numpy(array)\n    return model\n\n\nACT_FNS = {\"relu\": nn.ReLU, \"swish\": swish, \"gelu\": gelu_new}\n\n\nclass Attention(nn.Module):\n    def __init__(self, nx, n_ctx, config, scale=False):\n        super().__init__()\n        n_state = nx  # in Attention: n_state=768 (nx=n_embd)\n        # [switch nx => n_state from Block to Attention to keep identical to TF implem]\n        assert n_state % config.n_head == 0\n        self.register_buffer(\"bias\", torch.tril(torch.ones(n_ctx, n_ctx)).view(1, 1, n_ctx, n_ctx))\n        self.n_head = config.n_head\n        self.split_size = n_state\n        self.scale = scale\n\n        self.output_attentions = config.output_attentions\n\n        self.c_attn = Conv1D(n_state * 3, nx)\n        self.c_proj = Conv1D(n_state, nx)\n        self.attn_dropout = nn.Dropout(config.attn_pdrop)\n        self.resid_dropout = nn.Dropout(config.resid_pdrop)\n        self.pruned_heads = set()\n\n    def prune_heads(self, heads):\n        if len(heads) == 0:\n            return\n        mask = torch.ones(self.n_head, self.split_size // self.n_head)\n        heads = set(heads) - self.pruned_heads\n        for head in heads:\n            head -= sum(1 if h < head else 0 for h in self.pruned_heads)\n            mask[head] = 0\n        mask = mask.view(-1).contiguous().eq(1)\n        index = torch.arange(len(mask))[mask].long()\n        index_attn = torch.cat([index, index + self.split_size, index + (2 * self.split_size)])\n        # Prune conv1d layers\n        self.c_attn = prune_conv1d_layer(self.c_attn, index_attn, dim=1)\n        self.c_proj = prune_conv1d_layer(self.c_proj, index, dim=0)\n        # Update hyper params\n        self.split_size = (self.split_size // self.n_head) * (self.n_head - len(heads))\n        self.n_head = self.n_head - len(heads)\n        self.pruned_heads = self.pruned_heads.union(heads)\n\n    def _attn(self, q, k, v, attention_mask=None, head_mask=None):\n        w = torch.matmul(q, k)\n        if self.scale:\n            w = w / math.sqrt(v.size(-1))\n        # w = w * self.bias + -1e9 * (1 - self.bias)  # TF implem method: mask_attn_weights\n        # XD: self.b may be larger than w, so we need to crop it\n        b = self.bias[:, :, : w.size(-2), : w.size(-1)]\n        w = w * b + -1e4 * (1 - b)\n\n        if attention_mask is not None:\n            # Apply the attention mask\n            w = w + attention_mask\n\n        w = nn.Softmax(dim=-1)(w)\n        w = self.attn_dropout(w)\n\n        # Mask heads if we want to\n        if head_mask is not None:\n            w = w * head_mask\n\n        outputs = [torch.matmul(w, v)]\n        if self.output_attentions:\n            outputs.append(w)\n        return outputs\n\n    def merge_heads(self, x):\n        x = x.permute(0, 2, 1, 3).contiguous()\n        new_x_shape = x.size()[:-2] + (x.size(-2) * x.size(-1),)\n        return x.view(*new_x_shape)  # in Tensorflow implem: fct merge_states\n\n    def split_heads(self, x, k=False):\n        new_x_shape = x.size()[:-1] + (self.n_head, x.size(-1) // self.n_head)\n        x = x.view(*new_x_shape)  # in Tensorflow implem: fct split_states\n        if k:\n            return x.permute(0, 2, 3, 1)\n        else:\n            return x.permute(0, 2, 1, 3)\n\n    def forward(self, x, attention_mask=None, head_mask=None):\n        x = self.c_attn(x)\n        query, key, value = x.split(self.split_size, dim=2)\n        query = self.split_heads(query)\n        key = self.split_heads(key, k=True)\n        value = self.split_heads(value)\n\n        attn_outputs = self._attn(query, key, value, attention_mask, head_mask)\n        a = attn_outputs[0]\n\n        a = self.merge_heads(a)\n        a = self.c_proj(a)\n        a = self.resid_dropout(a)\n\n        outputs = [a] + attn_outputs[1:]\n        return outputs  # a, (attentions)\n\n\nclass MLP(nn.Module):\n    def __init__(self, n_state, config):  # in MLP: n_state=3072 (4 * n_embd)\n        super().__init__()\n        nx = config.n_embd\n        self.c_fc = Conv1D(n_state, nx)\n        self.c_proj = Conv1D(nx, n_state)\n        self.act = ACT_FNS[config.afn]\n        self.dropout = nn.Dropout(config.resid_pdrop)\n\n    def forward(self, x):\n        h = self.act(self.c_fc(x))\n        h2 = self.c_proj(h)\n        return self.dropout(h2)\n\n\nclass Block(nn.Module):\n    def __init__(self, n_ctx, config, scale=False):\n        super().__init__()\n        nx = config.n_embd\n        self.attn = Attention(nx, n_ctx, config, scale)\n        self.ln_1 = nn.LayerNorm(nx, eps=config.layer_norm_epsilon)\n        self.mlp = MLP(4 * nx, config)\n        self.ln_2 = nn.LayerNorm(nx, eps=config.layer_norm_epsilon)\n\n    def forward(self, x, attention_mask=None, head_mask=None):\n        attn_outputs = self.attn(x, attention_mask=attention_mask, head_mask=head_mask)\n        a = attn_outputs[0]\n\n        n = self.ln_1(x + a)\n        m = self.mlp(n)\n        h = self.ln_2(n + m)\n\n        outputs = [h] + attn_outputs[1:]\n        return outputs\n\n\nclass OpenAIGPTPreTrainedModel(PreTrainedModel):\n    \"\"\" An abstract class to handle weights initialization and\n        a simple interface for downloading and loading pretrained models.\n    \"\"\"\n\n    config_class = OpenAIGPTConfig\n    load_tf_weights = load_tf_weights_in_openai_gpt\n    base_model_prefix = \"transformer\"\n\n    def _init_weights(self, module):\n        \"\"\" Initialize the weights.\n        \"\"\"\n        if isinstance(module, (nn.Linear, nn.Embedding, Conv1D)):\n            # Slightly different from the TF version which uses truncated_normal for initialization\n            # cf https://github.com/pytorch/pytorch/pull/5617\n            module.weight.data.normal_(mean=0.0, std=self.config.initializer_range)\n            if isinstance(module, (nn.Linear, Conv1D)) and module.bias is not None:\n                module.bias.data.zero_()\n        elif isinstance(module, nn.LayerNorm):\n            module.bias.data.zero_()\n            module.weight.data.fill_(1.0)\n\n\nOPENAI_GPT_START_DOCSTRING = r\"\"\"\n\n    This model is a PyTorch `torch.nn.Module <https://pytorch.org/docs/stable/nn.html#torch.nn.Module>`_ sub-class.\n    Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general\n    usage and behavior.\n\n    Parameters:\n        config (:class:`~transformers1.OpenAIGPTConfig`): Model configuration class with all the parameters of the model.\n            Initializing with a config file does not load the weights associated with the model, only the configuration.\n            Check out the :meth:`~transformers1.PreTrainedModel.from_pretrained` method to load the model weights.\n\"\"\"\n\nOPENAI_GPT_INPUTS_DOCSTRING = r\"\"\"\n    Args:\n        input_ids (:obj:`torch.LongTensor` of shape :obj:`(batch_size, sequence_length)`):\n            Indices of input sequence tokens in the vocabulary.\n\n            Indices can be obtained using :class:`transformers1.OpenAIGPTTokenizer`.\n            See :func:`transformers1.PreTrainedTokenizer.encode` and\n            :func:`transformers1.PreTrainedTokenizer.encode_plus` for details.\n\n            `What are input IDs? <../glossary.html#input-ids>`__\n        attention_mask (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length)`, `optional`, defaults to :obj:`None`):\n            Mask to avoid performing attention on padding token indices.\n            Mask values selected in ``[0, 1]``:\n            ``1`` for tokens that are NOT MASKED, ``0`` for MASKED tokens.\n\n            `What are attention masks? <../glossary.html#attention-mask>`__\n        token_type_ids (:obj:`torch.LongTensor` of shape :obj:`(batch_size, sequence_length)`, `optional`, defaults to :obj:`None`):\n            Segment token indices to indicate first and second portions of the inputs.\n            Indices are selected in ``[0, 1]``: ``0`` corresponds to a `sentence A` token, ``1``\n            corresponds to a `sentence B` token\n\n            `What are token type IDs? <../glossary.html#token-type-ids>`_\n        position_ids (:obj:`torch.LongTensor` of shape :obj:`(batch_size, sequence_length)`, `optional`, defaults to :obj:`None`):\n            Indices of positions of each input sequence tokens in the position embeddings.\n            Selected in the range ``[0, config.max_position_embeddings - 1]``.\n\n            `What are position IDs? <../glossary.html#position-ids>`_\n        head_mask (:obj:`torch.FloatTensor` of shape :obj:`(num_heads,)` or :obj:`(num_layers, num_heads)`, `optional`, defaults to :obj:`None`):\n            Mask to nullify selected heads of the self-attention modules.\n            Mask values selected in ``[0, 1]``:\n            :obj:`1` indicates the head is **not masked**, :obj:`0` indicates the head is **masked**.\n        inputs_embeds (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length, hidden_size)`, `optional`, defaults to :obj:`None`):\n            Optionally, instead of passing :obj:`input_ids` you can choose to directly pass an embedded representation.\n            This is useful if you want more control over how to convert `input_ids` indices into associated vectors\n            than the model's internal embedding lookup matrix.\n\"\"\"\n\n\n@add_start_docstrings(\n    \"The bare OpenAI GPT transformer model outputting raw hidden-states without any specific head on top.\",\n    OPENAI_GPT_START_DOCSTRING,\n)\nclass OpenAIGPTModel(OpenAIGPTPreTrainedModel):\n    def __init__(self, config):\n        super().__init__(config)\n        self.output_attentions = config.output_attentions\n        self.output_hidden_states = config.output_hidden_states\n\n        self.tokens_embed = nn.Embedding(config.vocab_size, config.n_embd)\n        self.positions_embed = nn.Embedding(config.n_positions, config.n_embd)\n        self.drop = nn.Dropout(config.embd_pdrop)\n        self.h = nn.ModuleList([Block(config.n_ctx, config, scale=True) for _ in range(config.n_layer)])\n\n        self.init_weights()\n\n    def get_input_embeddings(self):\n        return self.tokens_embed\n\n    def set_input_embeddings(self, new_embeddings):\n        self.tokens_embed = new_embeddings\n\n    def _prune_heads(self, heads_to_prune):\n        \"\"\" Prunes heads of the model.\n            heads_to_prune: dict of {layer_num: list of heads to prune in this layer}\n        \"\"\"\n        for layer, heads in heads_to_prune.items():\n            self.h[layer].attn.prune_heads(heads)\n\n    @add_start_docstrings_to_callable(OPENAI_GPT_INPUTS_DOCSTRING)\n    def forward(\n        self,\n        input_ids=None,\n        attention_mask=None,\n        token_type_ids=None,\n        position_ids=None,\n        head_mask=None,\n        inputs_embeds=None,\n    ):\n        r\"\"\"\n    Return:\n        :obj:`tuple(torch.FloatTensor)` comprising various elements depending on the configuration (:class:`~transformers1.OpenAIGPTConfig`) and inputs:\n        last_hidden_state (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length, hidden_size)`):\n            Sequence of hidden-states at the last layer of the model.\n        hidden_states (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_hidden_states=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer)\n            of shape :obj:`(batch_size, sequence_length, hidden_size)`.\n\n            Hidden-states of the model at the output of each layer plus the initial embedding outputs.\n        attentions (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_attentions=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for each layer) of shape\n            :obj:`(batch_size, num_heads, sequence_length, sequence_length)`.\n\n            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention\n            heads.\n\n    Examples::\n\n        from transformers1 import OpenAIGPTTokenizer, OpenAIGPTModel\n        import torch\n\n        tokenizer = OpenAIGPTTokenizer.from_pretrained('openai-gpt')\n        model = OpenAIGPTModel.from_pretrained('openai-gpt')\n        input_ids = torch.tensor(tokenizer.encode(\"Hello, my dog is cute\", add_special_tokens=True)).unsqueeze(0)  # Batch size 1\n        outputs = model(input_ids)\n        last_hidden_states = outputs[0]  # The last hidden-state is the first element of the output tuple\n\n        \"\"\"\n        if input_ids is not None and inputs_embeds is not None:\n            raise ValueError(\"You cannot specify both input_ids and inputs_embeds at the same time\")\n        elif input_ids is not None:\n            input_shape = input_ids.size()\n            input_ids = input_ids.view(-1, input_shape[-1])\n        elif inputs_embeds is not None:\n            input_shape = inputs_embeds.size()[:-1]\n        else:\n            raise ValueError(\"You have to specify either input_ids or inputs_embeds\")\n\n        if position_ids is None:\n            # Code is different from when we had a single embedding matrice from position and token embeddings\n            device = input_ids.device if input_ids is not None else inputs_embeds.device\n            position_ids = torch.arange(input_shape[-1], dtype=torch.long, device=device)\n            position_ids = position_ids.unsqueeze(0).view(-1, input_shape[-1])\n\n        # Attention mask.\n        if attention_mask is not None:\n            # We create a 3D attention mask from a 2D tensor mask.\n            # Sizes are [batch_size, 1, 1, to_seq_length]\n            # So we can broadcast to [batch_size, num_heads, from_seq_length, to_seq_length]\n            # this attention mask is more simple than the triangular masking of causal attention\n            # used in OpenAI GPT, we just need to prepare the broadcast dimension here.\n            attention_mask = attention_mask.unsqueeze(1).unsqueeze(2)\n\n            # Since attention_mask is 1.0 for positions we want to attend and 0.0 for\n            # masked positions, this operation will create a tensor which is 0.0 for\n            # positions we want to attend and -10000.0 for masked positions.\n            # Since we are adding it to the raw scores before the softmax, this is\n            # effectively the same as removing these entirely.\n            attention_mask = attention_mask.to(dtype=next(self.parameters()).dtype)  # fp16 compatibility\n            attention_mask = (1.0 - attention_mask) * -10000.0\n\n        # Prepare head mask if needed\n        head_mask = self.get_head_mask(head_mask, self.config.n_layer)\n\n        if inputs_embeds is None:\n            inputs_embeds = self.tokens_embed(input_ids)\n        position_embeds = self.positions_embed(position_ids)\n        if token_type_ids is not None:\n            token_type_ids = token_type_ids.view(-1, token_type_ids.size(-1))\n            token_type_embeds = self.tokens_embed(token_type_ids)\n        else:\n            token_type_embeds = 0\n        hidden_states = inputs_embeds + position_embeds + token_type_embeds\n        hidden_states = self.drop(hidden_states)\n\n        output_shape = input_shape + (hidden_states.size(-1),)\n\n        all_attentions = ()\n        all_hidden_states = ()\n        for i, block in enumerate(self.h):\n            if self.output_hidden_states:\n                all_hidden_states = all_hidden_states + (hidden_states.view(*output_shape),)\n\n            outputs = block(hidden_states, attention_mask, head_mask[i])\n            hidden_states = outputs[0]\n            if self.output_attentions:\n                all_attentions = all_attentions + (outputs[1],)\n\n        # Add last layer\n        if self.output_hidden_states:\n            all_hidden_states = all_hidden_states + (hidden_states.view(*output_shape),)\n\n        outputs = (hidden_states.view(*output_shape),)\n        if self.output_hidden_states:\n            outputs = outputs + (all_hidden_states,)\n        if self.output_attentions:\n            outputs = outputs + (all_attentions,)\n        return outputs  # last hidden state, (all hidden states), (all attentions)\n\n\n@add_start_docstrings(\n    \"\"\"OpenAI GPT Model transformer with a language modeling head on top\n    (linear layer with weights tied to the input embeddings). \"\"\",\n    OPENAI_GPT_START_DOCSTRING,\n)\nclass OpenAIGPTLMHeadModel(OpenAIGPTPreTrainedModel):\n    def __init__(self, config):\n        super().__init__(config)\n        self.transformer = OpenAIGPTModel(config)\n        self.lm_head = nn.Linear(config.n_embd, config.vocab_size, bias=False)\n\n        self.init_weights()\n\n    def get_output_embeddings(self):\n        return self.lm_head\n\n    @add_start_docstrings_to_callable(OPENAI_GPT_INPUTS_DOCSTRING)\n    def forward(\n        self,\n        input_ids=None,\n        attention_mask=None,\n        token_type_ids=None,\n        position_ids=None,\n        head_mask=None,\n        inputs_embeds=None,\n        labels=None,\n    ):\n        r\"\"\"\n        labels (:obj:`torch.LongTensor` of shape :obj:`(batch_size, sequence_length)`, `optional`, defaults to :obj:`None`):\n            Labels for language modeling.\n            Note that the labels **are shifted** inside the model, i.e. you can set ``labels = input_ids``\n            Indices are selected in ``[-100, 0, ..., config.vocab_size]``\n            All labels set to ``-100`` are ignored (masked), the loss is only\n            computed for labels in ``[0, ..., config.vocab_size]``\n\n    Return:\n        :obj:`tuple(torch.FloatTensor)` comprising various elements depending on the configuration (:class:`~transformers1.OpenAIGPTConfig`) and inputs:\n        loss (:obj:`torch.FloatTensor` of shape `(1,)`, `optional`, returned when ``labels`` is provided)\n            Language modeling loss.\n        prediction_scores (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length, config.vocab_size)`):\n            Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax).\n        past (:obj:`List[torch.FloatTensor]` of length :obj:`config.n_layers` with each tensor of shape :obj:`(2, batch_size, num_heads, sequence_length, embed_size_per_head)`):\n            Contains pre-computed hidden-states (key and values in the attention blocks).\n            Can be used (see `past` input) to speed up sequential decoding. The token ids which have their past given to this model\n            should not be passed as input ids as they have already been computed.\n        hidden_states (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_hidden_states=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer)\n            of shape :obj:`(batch_size, sequence_length, hidden_size)`.\n\n            Hidden-states of the model at the output of each layer plus the initial embedding outputs.\n        attentions (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_attentions=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for each layer) of shape\n            :obj:`(batch_size, num_heads, sequence_length, sequence_length)`.\n\n            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention\n            heads.\n\n    Examples::\n\n        from transformers1 import OpenAIGPTTokenizer, OpenAIGPTLMHeadModel\n        import torch\n\n        tokenizer = OpenAIGPTTokenizer.from_pretrained('openai-gpt')\n        model = OpenAIGPTLMHeadModel.from_pretrained('openai-gpt')\n        input_ids = torch.tensor(tokenizer.encode(\"Hello, my dog is cute\", add_special_tokens=True)).unsqueeze(0)  # Batch size 1\n        outputs = model(input_ids, labels=input_ids)\n        loss, logits = outputs[:2]\n\n    \"\"\"\n        transformer_outputs = self.transformer(\n            input_ids,\n            attention_mask=attention_mask,\n            token_type_ids=token_type_ids,\n            position_ids=position_ids,\n            head_mask=head_mask,\n            inputs_embeds=inputs_embeds,\n        )\n        hidden_states = transformer_outputs[0]\n        lm_logits = self.lm_head(hidden_states)\n\n        outputs = (lm_logits,) + transformer_outputs[1:]\n        if labels is not None:\n            # Shift so that tokens < n predict n\n            shift_logits = lm_logits[..., :-1, :].contiguous()\n            shift_labels = labels[..., 1:].contiguous()\n            # Flatten the tokens\n            loss_fct = CrossEntropyLoss()\n            loss = loss_fct(shift_logits.view(-1, shift_logits.size(-1)), shift_labels.view(-1))\n            outputs = (loss,) + outputs\n\n        return outputs  # (loss), lm_logits, (all hidden states), (all attentions)\n\n\n@add_start_docstrings(\n    \"\"\"OpenAI GPT Model transformer with a language modeling and a multiple-choice classification\n    head on top e.g. for RocStories/SWAG tasks. The two heads are two linear layers.\n    The language modeling head has its weights tied to the input embeddings,\n    the classification head takes as input the input of a specified classification token index in the input sequence).\n\"\"\",\n    OPENAI_GPT_START_DOCSTRING,\n)\nclass OpenAIGPTDoubleHeadsModel(OpenAIGPTPreTrainedModel):\n    def __init__(self, config):\n        super().__init__(config)\n\n        config.num_labels = 1\n        self.transformer = OpenAIGPTModel(config)\n        self.lm_head = nn.Linear(config.n_embd, config.vocab_size, bias=False)\n        self.multiple_choice_head = SequenceSummary(config)\n\n        self.init_weights()\n\n    def get_output_embeddings(self):\n        return self.lm_head\n\n    @add_start_docstrings_to_callable(OPENAI_GPT_INPUTS_DOCSTRING)\n    def forward(\n        self,\n        input_ids=None,\n        attention_mask=None,\n        token_type_ids=None,\n        position_ids=None,\n        head_mask=None,\n        inputs_embeds=None,\n        mc_token_ids=None,\n        lm_labels=None,\n        mc_labels=None,\n    ):\n        r\"\"\"\n        mc_token_ids (:obj:`torch.LongTensor` of shape :obj:`(batch_size, num_choices)`, `optional`, default to index of the last token of the input)\n            Index of the classification token in each input sequence.\n            Selected in the range ``[0, input_ids.size(-1) - 1[``.\n        lm_labels (:obj:`torch.LongTensor` of shape :obj:`(batch_size, sequence_length)`, `optional`, defaults to :obj:`None`)\n            Labels for language modeling.\n            Note that the labels **are shifted** inside the model, i.e. you can set ``lm_labels = input_ids``\n            Indices are selected in ``[-1, 0, ..., config.vocab_size]``\n            All labels set to ``-100`` are ignored (masked), the loss is only\n            computed for labels in ``[0, ..., config.vocab_size]``\n        mc_labels (:obj:`torch.LongTensor` of shape :obj:`(batch_size)`, `optional`, defaults to :obj:`None`)\n            Labels for computing the multiple choice classification loss.\n            Indices should be in ``[0, ..., num_choices]`` where `num_choices` is the size of the second dimension\n            of the input tensors. (see `input_ids` above)\n\n    Return:\n        :obj:`tuple(torch.FloatTensor)` comprising various elements depending on the configuration (:class:`~transformers1.OpenAIGPTConfig`) and inputs:\n        lm_loss (:obj:`torch.FloatTensor` of shape :obj:`(1,)`, `optional`, returned when ``lm_labels`` is provided):\n            Language modeling loss.\n        mc_loss (:obj:`torch.FloatTensor` of shape :obj:`(1,)`, `optional`, returned when :obj:`multiple_choice_labels` is provided):\n            Multiple choice classification loss.\n        lm_prediction_scores (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, num_choices, sequence_length, config.vocab_size)`):\n            Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax).\n        mc_prediction_scores (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, num_choices)`):\n            Prediction scores of the multiple choice classification head (scores for each choice before SoftMax).\n        past (:obj:`List[torch.FloatTensor]` of length :obj:`config.n_layers` with each tensor of shape :obj:`(2, batch_size, num_heads, sequence_length, embed_size_per_head)`):\n            Contains pre-computed hidden-states (key and values in the attention blocks).\n            Can be used (see `past` input) to speed up sequential decoding. The token ids which have their past given to this model\n            should not be passed as input ids as they have already been computed.\n        hidden_states (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_hidden_states=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer)\n            of shape :obj:`(batch_size, sequence_length, hidden_size)`.\n\n            Hidden-states of the model at the output of each layer plus the initial embedding outputs.\n        attentions (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_attentions=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for each layer) of shape\n            :obj:`(batch_size, num_heads, sequence_length, sequence_length)`.\n\n            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention\n            heads.\n\n    Examples::\n\n        from transformers1 import OpenAIGPTTokenizer, OpenAIGPTDoubleHeadsModel\n        import torch\n\n        tokenizer = OpenAIGPTTokenizer.from_pretrained('openai-gpt')\n        model = OpenAIGPTDoubleHeadsModel.from_pretrained('openai-gpt')\n        tokenizer.add_special_tokens({'cls_token': '[CLS]'})  # Add a [CLS] to the vocabulary (we should train it also!)\n        model.resize_token_embeddings(len(tokenizer))\n\n        choices = [\"Hello, my dog is cute [CLS]\", \"Hello, my cat is cute [CLS]\"]\n        input_ids = torch.tensor([tokenizer.encode(s) for s in choices]).unsqueeze(0)  # Batch size 1, 2 choices\n        mc_token_ids = torch.tensor([input_ids.size(-1)-1, input_ids.size(-1)-1]).unsqueeze(0)  # Batch size 1\n\n        outputs = model(input_ids, mc_token_ids=mc_token_ids)\n        lm_prediction_scores, mc_prediction_scores = outputs[:2]\n\n    \"\"\"\n        transformer_outputs = self.transformer(\n            input_ids,\n            attention_mask=attention_mask,\n            token_type_ids=token_type_ids,\n            position_ids=position_ids,\n            head_mask=head_mask,\n            inputs_embeds=inputs_embeds,\n        )\n        hidden_states = transformer_outputs[0]\n\n        lm_logits = self.lm_head(hidden_states)\n        mc_logits = self.multiple_choice_head(hidden_states, mc_token_ids).squeeze(-1)\n\n        outputs = (lm_logits, mc_logits) + transformer_outputs[1:]\n        if mc_labels is not None:\n            loss_fct = CrossEntropyLoss()\n            loss = loss_fct(mc_logits.view(-1, mc_logits.size(-1)), mc_labels.view(-1))\n            outputs = (loss,) + outputs\n        if lm_labels is not None:\n            shift_logits = lm_logits[..., :-1, :].contiguous()\n            shift_labels = lm_labels[..., 1:].contiguous()\n            loss_fct = CrossEntropyLoss()\n            loss = loss_fct(shift_logits.view(-1, shift_logits.size(-1)), shift_labels.view(-1))\n            outputs = (loss,) + outputs\n\n        return outputs  # (lm loss), (mc loss), lm logits, mc logits, (all hidden_states), (attentions)\n"
  },
  {
    "path": "code/bert-base-count3/pretrain/transformers1/modeling_reformer.py",
    "content": "# coding=utf-8\n# Copyright 2020 The Trax Authors and The HuggingFace Inc. team.\n# Copyright (c) 2018, NVIDIA CORPORATION.  All rights reserved.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n\"\"\"PyTorch REFORMER model. \"\"\"\n\nimport logging\nimport sys\nfrom collections import namedtuple\nfrom functools import reduce\nfrom operator import mul\n\nimport numpy as np\nimport torch\nfrom torch import nn\nfrom torch.autograd.function import Function\nfrom torch.nn import CrossEntropyLoss\n\nfrom .activations import gelu, gelu_fast, gelu_new, swish\nfrom .configuration_reformer import ReformerConfig\nfrom .file_utils import DUMMY_INPUTS, DUMMY_MASK, add_start_docstrings, add_start_docstrings_to_callable\nfrom .modeling_utils import PreTrainedModel, apply_chunking_to_forward\n\n\nlogger = logging.getLogger(__name__)\n\nREFORMER_PRETRAINED_MODEL_ARCHIVE_LIST = [\n    \"google/reformer-crime-and-punishment\",\n    \"google/reformer-enwik8\",\n    # See all Reformer models at https://huggingface.co/models?filter=reformer\n]\n\n\ndef mish(x):\n    return x * torch.tanh(nn.functional.softplus(x))\n\n\nACT2FN = {\n    \"gelu\": gelu,\n    \"relu\": torch.nn.functional.relu,\n    \"swish\": swish,\n    \"gelu_new\": gelu_new,\n    \"gelu_fast\": gelu_fast,\n    \"mish\": mish,\n}\n\n\n# Define named tuples for nn.Modules here\nLSHSelfAttentionOutput = namedtuple(\"LSHSelfAttentionOutput\", [\"hidden_states\", \"attention_probs\", \"buckets\"])\nLocalSelfAttentionOutput = namedtuple(\"LocalSelfAttentionOutput\", [\"hidden_states\", \"attention_probs\"])\nAttentionOutput = namedtuple(\"AttentionOutput\", [\"hidden_states\", \"attention_probs\", \"buckets\"])\nReformerOutput = namedtuple(\"ReformerOutput\", [\"hidden_states\", \"attn_output\", \"attention_probs\", \"buckets\"])\nReformerBackwardOutput = namedtuple(\n    \"ReformerBackwardOutput\", [\"attn_output\", \"hidden_states\", \"grad_attn_output\", \"grad_hidden_states\"]\n)\nReformerEncoderOutput = namedtuple(\"ReformerEncoderOutput\", [\"hidden_states\", \"all_hidden_states\", \"all_attentions\"])\n\n\ndef _get_least_common_mult_chunk_len(config):\n    attn_types = config.attn_layers\n    attn_types_set = set(attn_types)\n    if len(attn_types_set) == 1 and attn_types[0] == \"lsh\":\n        return config.lsh_attn_chunk_length\n    elif len(attn_types_set) == 1 and attn_types[0] == \"local\":\n        return config.local_attn_chunk_length\n    elif len(attn_types_set) == 2 and attn_types_set == set([\"lsh\", \"local\"]):\n        return np.lcm(config.lsh_attn_chunk_length, config.local_attn_chunk_length)\n    else:\n        raise NotImplementedError(\n            \"Only attn layer types 'lsh' and 'local' exist, but `config.attn_layers`: {}. Select attn layer types from ['lsh', 'local'] only.\".format(\n                config.attn_layers\n            )\n        )\n\n\nclass AxialPositionEmbeddings(nn.Module):\n    \"\"\"Constructs axial position embeddings. Useful for very long input\n    sequences to save memory and time.\n    \"\"\"\n\n    def __init__(self, config):\n        super().__init__()\n        self.axial_pos_shape = config.axial_pos_shape\n        self.axial_pos_embds_dim = config.axial_pos_embds_dim\n        self.dropout = config.hidden_dropout_prob\n\n        self.least_common_mult_chunk_length = _get_least_common_mult_chunk_len(config)\n        self.weights = nn.ParameterList()\n\n        assert (\n            sum(self.axial_pos_embds_dim) == config.hidden_size\n        ), \"Make sure that config.axial_pos_embds factors: {} sum to config.hidden_size: {}\".format(\n            self.axial_pos_embds_dim, config.hidden_size\n        )\n\n        # create weights\n        for axis, axial_pos_embd_dim in enumerate(self.axial_pos_embds_dim):\n            # create expanded shapes\n            ax_shape = [1] * len(self.axial_pos_shape)\n            ax_shape[axis] = self.axial_pos_shape[axis]\n            ax_shape = tuple(ax_shape) + (axial_pos_embd_dim,)\n\n            # create tensor and init\n            self.weights.append(nn.Parameter(torch.ones(ax_shape, dtype=torch.float32)))\n\n    def forward(self, position_ids):\n        # broadcast weights to correct shape\n        batch_size = position_ids.shape[0]\n        sequence_length = position_ids.shape[1]\n\n        broadcasted_weights = [\n            weight.expand((batch_size,) + self.axial_pos_shape + weight.shape[-1:]) for weight in self.weights\n        ]\n\n        if self.training is True:\n            assert (\n                reduce(mul, self.axial_pos_shape) == sequence_length\n            ), \"If training, make sure that config.axial_pos_shape factors: {} multiply to sequence length. Got prod({}) != sequence_length: {}. You might want to consider padding your sequence length to {} or changing config.axial_pos_shape.\".format(\n                self.axial_pos_shape, self.axial_pos_shape, sequence_length, reduce(mul, self.axial_pos_shape)\n            )\n            if self.dropout > 0:\n                weights = torch.cat(broadcasted_weights, dim=-1)\n                # permute weights so that 2D correctly drops dims 1 and 2\n                transposed_weights = weights.transpose(2, 1)\n                # drop entire matrix of last two dims (prev dims 1 and 2)\n                dropped_transposed_weights = nn.functional.dropout2d(\n                    transposed_weights, p=self.dropout, training=self.training\n                )\n                dropped_weights = dropped_transposed_weights.transpose(2, 1)\n\n                position_encodings = torch.reshape(dropped_weights, (batch_size, sequence_length, -1))\n\n            else:\n                position_encodings = torch.cat(\n                    [torch.reshape(weight, (batch_size, sequence_length, -1)) for weight in broadcasted_weights],\n                    dim=-1,\n                )\n\n        else:\n            assert (\n                reduce(mul, self.axial_pos_shape) >= sequence_length\n            ), \"Make sure that config.axial_pos_shape factors: {} multiply at least to max(sequence_length, least_common_mult_chunk_length): max({}, {})\".format(\n                self.axial_pos_shape, sequence_length, self.least_common_mult_chunk_length,\n            )\n\n            # reshape axial encodings and use only until sequence_length\n            position_encodings = torch.cat(broadcasted_weights, dim=-1)\n            position_encodings = position_encodings.view(batch_size, -1, position_encodings.shape[-1])[\n                :, :sequence_length\n            ]\n\n        return position_encodings\n\n\nclass PositionEmbeddings(nn.Module):\n    \"\"\"Constructs conventional position embeddings of shape `[max_pos_embeddings, hidden_size]`.\n    \"\"\"\n\n    def __init__(self, config):\n        super().__init__()\n        self.dropout = config.hidden_dropout_prob\n        self.embedding = nn.Embedding(config.max_position_embeddings, config.hidden_size)\n\n    def forward(self, position_ids):\n        position_embeddings = self.embedding(position_ids)\n        position_embeddings = nn.functional.dropout(position_embeddings, p=self.dropout, training=self.training)\n        return position_embeddings\n\n\nclass ReformerEmbeddings(nn.Module):\n    \"\"\"Construct the embeddings from word, position and token_type embeddings.\n    \"\"\"\n\n    def __init__(self, config):\n        super().__init__()\n        self.max_position_embeddings = config.max_position_embeddings\n        self.dropout = config.hidden_dropout_prob\n\n        self.word_embeddings = nn.Embedding(config.vocab_size, config.hidden_size)\n        self.position_embeddings = (\n            AxialPositionEmbeddings(config) if config.axial_pos_embds else PositionEmbeddings(config)\n        )\n\n    def forward(self, input_ids=None, position_ids=None, inputs_embeds=None):\n        if input_ids is not None:\n            input_shape = input_ids.size()\n            device = input_ids.device\n        else:\n            input_shape = inputs_embeds.size()[:-1]\n            device = inputs_embeds.device\n\n        seq_length = input_shape[1]\n        if position_ids is None:\n            position_ids = torch.arange(seq_length, dtype=torch.long, device=device)\n            position_ids = position_ids.unsqueeze(0).expand(input_shape)\n\n        if inputs_embeds is None:\n            inputs_embeds = self.word_embeddings(input_ids)\n\n        assert (\n            position_ids.shape[-1] <= self.max_position_embeddings\n        ), \"Sequence Length: {} has to be larger equal than config.max_position_embeddings: {}\".format(\n            position_ids.shape[-1], self.max_position_embeddings\n        )\n\n        # dropout\n        embeddings = nn.functional.dropout(inputs_embeds, p=self.dropout, training=self.training)\n\n        # add positional embeddings\n        position_embeddings = self.position_embeddings(position_ids)\n        embeddings = embeddings + position_embeddings\n        return embeddings\n\n\nclass EfficientAttentionMixin:\n    \"\"\"\n    A few utilities for nn.Modules in Reformer, to be used as a mixin.\n    \"\"\"\n\n    def _look_adjacent(self, vectors, num_chunks_before, num_chunks_after):\n        \"\"\" Used to implement attention between consecutive chunks.\n\n            Args:\n                vectors: array of shape [batch_size, num_attention_heads, n_chunks, chunk_len, ...]\n                num_chunks_before: chunks before current chunk to include in attention\n                num_chunks_after: chunks after current chunk to include in attention\n\n            Returns:\n                tensor of shape [num_chunks, N * chunk_length, ...], where\n                N = (1 + num_chunks_before + num_chunks_after).\n        \"\"\"\n        if num_chunks_before == 0 and num_chunks_after == 0:\n            return vectors\n\n        slices = []\n        for i in range(-num_chunks_before, num_chunks_after + 1):\n            if i == 0:\n                slices.append(vectors)\n            else:\n                slices.append(torch.cat([vectors[:, :, i:, ...], vectors[:, :, :i, ...]], dim=2))\n        return torch.cat(slices, dim=3)\n\n    def _split_hidden_size_dim(self, x, num_attn_heads, attn_head_size):\n        \"\"\"\n            splits hidden_size dim into attn_head_size and num_attn_heads\n        \"\"\"\n        new_x_shape = x.size()[:-1] + (num_attn_heads, attn_head_size)\n        x = x.view(*new_x_shape)\n        return x.transpose(2, 1)\n\n    def _merge_hidden_size_dims(self, x, num_attn_heads, attn_head_size):\n        \"\"\"\n            merges attn_head_size dim and num_attn_heads dim into hidden_size\n        \"\"\"\n        x = x.permute(0, 2, 1, 3)\n        return torch.reshape(x, (x.size()[0], -1, num_attn_heads * attn_head_size))\n\n    def _split_seq_length_dim_to(self, vectors, dim_factor_1, dim_factor_2, num_attn_heads, attn_head_size=None):\n        \"\"\"\n            splits sequence length dim of vectors into `dim_factor_1` and `dim_factor_2` dims\n        \"\"\"\n        batch_size = vectors.shape[0]\n        split_dim_shape = (batch_size, num_attn_heads, dim_factor_1, dim_factor_2)\n\n        if len(vectors.shape) == 4:\n            return torch.reshape(vectors, split_dim_shape + (attn_head_size,))\n        elif len(vectors.shape) == 3:\n            return torch.reshape(vectors, split_dim_shape)\n        else:\n            raise ValueError(\"Input vector rank should be one of [3, 4], but is: {}\".format(len(vectors.shape)))\n\n\nclass LSHSelfAttention(nn.Module, EfficientAttentionMixin):\n    def __init__(self, config):\n        super().__init__()\n        self.config = config\n\n        self.chunk_length = config.lsh_attn_chunk_length\n        self.num_hashes = config.num_hashes\n        self.num_buckets = config.num_buckets\n        self.num_chunks_before = config.lsh_num_chunks_before\n        self.num_chunks_after = config.lsh_num_chunks_after\n        self.hash_seed = config.hash_seed\n        self.is_decoder = config.is_decoder\n        self.max_position_embeddings = config.max_position_embeddings\n\n        self.dropout = config.lsh_attention_probs_dropout_prob\n\n        self.num_attention_heads = config.num_attention_heads\n        self.attention_head_size = config.attention_head_size\n        self.all_head_size = self.num_attention_heads * self.attention_head_size\n        self.hidden_size = config.hidden_size\n\n        # projection matrices\n        self.query_key = nn.Linear(self.hidden_size, self.all_head_size, bias=False)\n        self.value = nn.Linear(self.hidden_size, self.all_head_size, bias=False)\n\n        # save mask value here. Need fp32 and fp16 mask values\n        self.register_buffer(\"self_mask_value_float16\", torch.tensor(-1e3))\n        self.register_buffer(\"self_mask_value_float32\", torch.tensor(-1e5))\n        self.register_buffer(\"mask_value_float16\", torch.tensor(-1e4))\n        self.register_buffer(\"mask_value_float32\", torch.tensor(-1e9))\n\n    def forward(\n        self,\n        hidden_states,\n        attention_mask=None,\n        head_mask=None,\n        num_hashes=None,\n        do_output_attentions=False,\n        buckets=None,\n        **kwargs\n    ):\n        sequence_length = hidden_states.shape[1]\n        batch_size = hidden_states.shape[0]\n\n        # num hashes can optionally be overwritten by user\n        num_hashes = num_hashes if num_hashes is not None else self.num_hashes\n\n        # project hidden_states to query_key and value\n        query_key_vectors = self.query_key(hidden_states)\n        value_vectors = self.value(hidden_states)\n\n        # free memory\n        del hidden_states\n\n        query_key_vectors = self._split_hidden_size_dim(\n            query_key_vectors, self.num_attention_heads, self.attention_head_size\n        )\n        value_vectors = self._split_hidden_size_dim(value_vectors, self.num_attention_heads, self.attention_head_size)\n\n        assert (\n            query_key_vectors.shape[-1] == self.attention_head_size\n        ), \"last dim of query_key_vectors is {} but should be {}.\".format(\n            query_key_vectors.shape[-1], self.attention_head_size\n        )\n        assert (\n            value_vectors.shape[-1] == self.attention_head_size\n        ), \"last dim of value_vectors is {} but should be {}.\".format(\n            value_vectors.shape[-1], self.attention_head_size\n        )\n\n        # set `num_buckets` on the fly, recommended way to do it\n        if self.num_buckets is None:\n            self._set_num_buckets(sequence_length)\n\n        # use cached buckets for backprop only\n        if buckets is None:\n            # hash query key vectors into buckets\n            buckets = self._hash_vectors(query_key_vectors, num_hashes)\n\n        assert (\n            int(buckets.shape[-1]) == num_hashes * sequence_length\n        ), \"last dim of buckets is {}, but should be {}\".format(buckets.shape[-1], num_hashes * sequence_length)\n\n        sorted_bucket_idx, undo_sorted_bucket_idx = self._get_sorted_bucket_idx_and_undo_sorted_bucket_idx(\n            sequence_length, buckets, num_hashes\n        )\n\n        # make sure bucket idx is not longer then sequence length\n        sorted_bucket_idx = sorted_bucket_idx % sequence_length\n\n        # cluster query key value vectors according to hashed buckets\n        query_key_vectors = self._gather_by_expansion(query_key_vectors, sorted_bucket_idx, num_hashes)\n        value_vectors = self._gather_by_expansion(value_vectors, sorted_bucket_idx, num_hashes)\n\n        query_key_vectors = self._split_seq_length_dim_to(\n            query_key_vectors, -1, self.chunk_length, self.num_attention_heads, self.attention_head_size,\n        )\n        value_vectors = self._split_seq_length_dim_to(\n            value_vectors, -1, self.chunk_length, self.num_attention_heads, self.attention_head_size,\n        )\n\n        if self.chunk_length is None:\n            assert (\n                self.num_chunks_before == 0 and self.num_chunks_after == 0\n            ), \"If `config.chunk_length` is `None`, make sure `config.num_chunks_after` and `config.num_chunks_before` are set to 0.\"\n\n        # scale key vectors\n        key_vectors = self._len_and_dim_norm(query_key_vectors)\n\n        # get attention probs\n        out_vectors, logits, attention_probs = self._attend(\n            query_vectors=query_key_vectors,\n            key_vectors=key_vectors,\n            value_vectors=value_vectors,\n            sorted_bucket_idx=sorted_bucket_idx,\n            attention_mask=attention_mask,\n            head_mask=head_mask,\n        )\n        # free memory\n        del query_key_vectors, key_vectors, value_vectors\n\n        # sort clusters back to correct ordering\n        out_vectors, logits = ReverseSort.apply(\n            out_vectors, logits, sorted_bucket_idx, undo_sorted_bucket_idx, self.num_hashes\n        )\n\n        # sum up all hash rounds\n        if num_hashes > 1:\n            out_vectors = self._split_seq_length_dim_to(\n                out_vectors, num_hashes, sequence_length, self.num_attention_heads, self.attention_head_size,\n            )\n            logits = self._split_seq_length_dim_to(\n                logits, num_hashes, sequence_length, self.num_attention_heads, self.attention_head_size,\n            ).unsqueeze(-1)\n\n            probs_vectors = torch.exp(logits - torch.logsumexp(logits, dim=2, keepdim=True))\n            out_vectors = torch.sum(out_vectors * probs_vectors, dim=2)\n            # free memory\n            del probs_vectors\n\n        # free memory\n        del logits\n\n        assert out_vectors.shape == (\n            batch_size,\n            self.num_attention_heads,\n            sequence_length,\n            self.attention_head_size,\n        ), \"out_vectors have be of shape `[batch_size, config.num_attention_heads, sequence_length, config.attention_head_size]`.\"\n\n        out_vectors = self._merge_hidden_size_dims(out_vectors, self.num_attention_heads, self.attention_head_size)\n\n        if do_output_attentions is False:\n            attention_probs = ()\n\n        return LSHSelfAttentionOutput(hidden_states=out_vectors, attention_probs=attention_probs, buckets=buckets)\n\n    def _hash_vectors(self, vectors, num_hashes):\n        batch_size = vectors.shape[0]\n\n        # See https://arxiv.org/pdf/1509.02897.pdf\n        # We sample a different random rotation for each round of hashing to\n        # decrease the probability of hash misses.\n        if isinstance(self.num_buckets, int):\n            assert (\n                self.num_buckets % 2 == 0\n            ), \"There should be an even number of bucktes, but `self.num_bucktes`: {}\".format(self.num_buckets)\n            rotation_size = self.num_buckets\n            num_buckets = self.num_buckets\n        else:\n            # Factorize the hash if self.num_buckets is a list or tuple\n            rotation_size, num_buckets = 0, 1\n            for bucket_factor in self.num_buckets:\n                assert bucket_factor % 2 == 0, \"The number of buckets should be even, but `num_bucket`: {}\".format(\n                    bucket_factor\n                )\n                rotation_size = rotation_size + bucket_factor\n                num_buckets = num_buckets * bucket_factor\n\n        # remove gradient\n        vectors = vectors.detach()\n\n        if self.hash_seed is not None:\n            # for determinism\n            torch.manual_seed(self.hash_seed)\n\n        rotations_shape = (self.num_attention_heads, vectors.shape[-1], num_hashes, rotation_size // 2)\n        # create a random self.attention_head_size x num_hashes x num_buckets/2\n        random_rotations = torch.randn(rotations_shape, device=vectors.device, dtype=vectors.dtype)\n\n        # Output dim: Batch_Size x Num_Attn_Heads x Num_Hashes x Seq_Len x Num_Buckets/2\n        rotated_vectors = torch.einsum(\"bmtd,mdhr->bmhtr\", vectors, random_rotations)\n\n        if isinstance(self.num_buckets, int) or len(self.num_buckets) == 1:\n            rotated_vectors = torch.cat([rotated_vectors, -rotated_vectors], dim=-1)\n            buckets = torch.argmax(rotated_vectors, dim=-1)\n        else:\n            # Get the buckets for them and combine.\n            buckets, cur_sum, cur_product = None, 0, 1\n            for bucket_factor in self.num_buckets:\n                rotated_vectors_factor = rotated_vectors[..., cur_sum : cur_sum + (bucket_factor // 2)]\n                cur_sum = cur_sum + bucket_factor // 2\n                rotated_vectors_factor = torch.cat([rotated_vectors_factor, -rotated_vectors_factor], dim=-1)\n\n                if buckets is None:\n                    buckets = torch.argmax(rotated_vectors_factor, dim=-1)\n                else:\n                    buckets = buckets + (cur_product * torch.argmax(rotated_vectors_factor, dim=-1))\n\n                cur_product = cur_product * bucket_factor\n\n        # buckets is now (Batch_size x Num_Attn_Heads x Num_Hashes x Seq_Len).\n        # Next we add offsets so that bucket numbers from different hashing rounds don't overlap.\n        offsets = torch.arange(num_hashes, device=vectors.device)\n        offsets = (offsets * num_buckets).view((1, 1, -1, 1))\n\n        # expand to batch size and num attention heads\n        offsets = offsets.expand((batch_size, self.num_attention_heads) + offsets.shape[-2:])\n        offset_buckets = (buckets + offsets).flatten(start_dim=2, end_dim=3)\n\n        return offset_buckets\n\n    def _get_sorted_bucket_idx_and_undo_sorted_bucket_idx(self, sequence_length, buckets, num_hashes):\n        # no gradients are needed\n        with torch.no_grad():\n            batch_size = buckets.shape[0]\n\n            # arange and expand\n            orig_indices = torch.arange(num_hashes * sequence_length, device=buckets.device).view(1, 1, -1)\n            orig_indices = orig_indices.expand(batch_size, self.num_attention_heads, orig_indices.shape[-1])\n\n            # scale buckets\n            scaled_buckets = sequence_length * buckets + (orig_indices % sequence_length)\n\n            # remove gradient\n            scaled_buckets = scaled_buckets.detach()\n\n            # Hash-based sort\n            sorted_bucket_idx = torch.argsort(scaled_buckets, dim=-1)\n\n            # create simple indices to scatter to, to have undo sort\n            indices = (\n                torch.arange(sorted_bucket_idx.shape[-1], device=buckets.device)\n                .view(1, 1, -1)\n                .expand(sorted_bucket_idx.shape)\n            )\n\n            # get undo sort\n            undo_sorted_bucket_idx = sorted_bucket_idx.new(*sorted_bucket_idx.size())\n            undo_sorted_bucket_idx.scatter_(-1, sorted_bucket_idx, indices)\n\n        return sorted_bucket_idx, undo_sorted_bucket_idx\n\n    def _set_num_buckets(self, sequence_length):\n        # `num_buckets` should be set to 2 * sequence_length // chunk_length as recommended in paper\n        num_buckets_pow_2 = (2 * (sequence_length // self.chunk_length)).bit_length() - 1\n        # make sure buckets are power of 2\n        num_buckets = 2 ** num_buckets_pow_2\n\n        # factorize `num_buckets` if `num_buckets` becomes too large\n        num_buckets_limit = 2 * max(\n            int((self.max_position_embeddings // self.chunk_length) ** (0.5)), self.chunk_length,\n        )\n        if num_buckets > num_buckets_limit:\n            num_buckets = [2 ** (num_buckets_pow_2 // 2), 2 ** (num_buckets_pow_2 - num_buckets_pow_2 // 2)]\n\n        logger.warning(\"config.num_buckets is not set. Setting config.num_buckets to {}...\".format(num_buckets))\n\n        # set num buckets in config to be properly saved\n        self.config.num_buckets = num_buckets\n        self.num_buckets = num_buckets\n\n    def _attend(\n        self, query_vectors, key_vectors, value_vectors, sorted_bucket_idx, attention_mask, head_mask,\n    ):\n        key_vectors = self._look_adjacent(key_vectors, self.num_chunks_before, self.num_chunks_after)\n        value_vectors = self._look_adjacent(value_vectors, self.num_chunks_before, self.num_chunks_after)\n\n        # get logits and dots\n        query_key_dots = torch.matmul(query_vectors, key_vectors.transpose(-1, -2))\n\n        # free memory\n        del query_vectors, key_vectors\n\n        query_bucket_idx = self._split_seq_length_dim_to(\n            sorted_bucket_idx, -1, self.chunk_length, self.num_attention_heads\n        )\n        key_value_bucket_idx = self._look_adjacent(query_bucket_idx, self.num_chunks_before, self.num_chunks_after)\n\n        # get correct mask values depending on precision\n        if query_key_dots.dtype == torch.float16:\n            self_mask_value = self.self_mask_value_float16.half()\n            mask_value = self.mask_value_float16.half()\n        else:\n            self_mask_value = self.self_mask_value_float32\n            mask_value = self.mask_value_float32\n\n        mask = self._compute_attn_mask(query_bucket_idx, key_value_bucket_idx, attention_mask)\n\n        if mask is not None:\n            query_key_dots = torch.where(mask, query_key_dots, mask_value)\n\n        # free memory\n        del mask\n\n        # Self mask is ALWAYS applied.\n        # From the reformer paper (https://arxiv.org/pdf/2001.04451.pdf):\n        # \" While attention to the future is not allowed, typical implementations of the\n        # Transformer do allow a position to attend to itself.\n        # Such behavior is undesirable in a shared-QK formulation because the dot-product\n        # of a query vector with itself will almost always be greater than the dot product of a\n        # query vector with a vector at another position. We therefore modify the masking\n        # to forbid a token from attending to itself, except in situations\n        # where a token has no other valid attention targets (e.g. the first token in a sequence) \"\n\n        self_mask = torch.ne(query_bucket_idx.unsqueeze(-1), key_value_bucket_idx.unsqueeze(-2)).to(\n            query_bucket_idx.device\n        )\n\n        # apply self_mask\n        query_key_dots = torch.where(self_mask, query_key_dots, self_mask_value)\n\n        # free memory\n        del self_mask\n\n        logits = torch.logsumexp(query_key_dots, dim=-1, keepdim=True)\n        # dots shape is `[batch_size, num_attn_heads, num_hashes * seq_len // chunk_length, chunk_length, chunk_length * (1 + num_chunks_before + num_chunks_after)]`\n        attention_probs = torch.exp(query_key_dots - logits)\n\n        # free memory\n        del query_key_dots\n\n        # dropout\n        attention_probs = nn.functional.dropout(attention_probs, p=self.dropout, training=self.training)\n\n        # Mask heads if we want to\n        if head_mask is not None:\n            attention_probs = attention_probs * head_mask\n\n        # attend values\n        out_vectors = torch.matmul(attention_probs, value_vectors)\n\n        # free memory\n        del value_vectors\n\n        # merge chunk length\n        logits = logits.flatten(start_dim=2, end_dim=3).squeeze(-1)\n        out_vectors = out_vectors.flatten(start_dim=2, end_dim=3)\n\n        return out_vectors, logits, attention_probs\n\n    def _compute_attn_mask(self, query_indices, key_indices, attention_mask):\n        mask = None\n\n        # Causal mask\n        if self.is_decoder:\n            mask = torch.ge(query_indices.unsqueeze(-1), key_indices.unsqueeze(-2)).to(query_indices.device)\n\n        # Attention mask: chunk, look up correct mask value from key_value_bucket_idx\n        # IMPORTANT: official trax code does not use a mask for LSH Atttention. Not sure why.\n        if attention_mask is not None:\n            attention_mask = attention_mask.to(torch.uint8)[:, None, None, :]\n            # expand attn_mask to fit with key_value_bucket_idx shape\n            attention_mask = attention_mask.expand(query_indices.shape[:-1] + (-1,))\n            key_attn_mask = torch.gather(attention_mask, -1, key_indices)\n            query_attn_mask = torch.gather(attention_mask, -1, query_indices)\n            # expand to query_key_dots shape: duplicate along query axis since key sorting is the same for each query position in chunk\n            attn_mask = query_attn_mask.unsqueeze(-1) * key_attn_mask.unsqueeze(-2)\n            # free memory\n            del query_attn_mask, key_attn_mask, attention_mask\n\n            # multiply by casaul mask if necessary\n            if mask is not None:\n                mask = mask * attn_mask\n            else:\n                mask = attn_mask\n\n        return mask\n\n    def _len_and_dim_norm(self, vectors):\n        \"\"\"\n            length and attention head size dim normalization\n        \"\"\"\n        vectors = self._len_norm(vectors)\n        vectors = vectors * torch.rsqrt(\n            torch.tensor(self.attention_head_size, device=vectors.device, dtype=vectors.dtype)\n        )\n        return vectors\n\n    def _len_norm(self, x, epsilon=1e-6):\n        \"\"\"\n            length normalization\n        \"\"\"\n        variance = torch.mean(x ** 2, -1, keepdim=True)\n        norm_x = x * torch.rsqrt(variance + epsilon)\n        return norm_x\n\n    def _gather_by_expansion(self, vectors, idxs, num_hashes):\n        \"\"\"\n            expand dims of idxs and vectors for all hashes and gather\n        \"\"\"\n        expanded_idxs = idxs.unsqueeze(-1).expand(-1, -1, -1, self.attention_head_size)\n        vectors = vectors.repeat(1, 1, num_hashes, 1)\n        return torch.gather(vectors, 2, expanded_idxs)\n\n\nclass ReverseSort(Function):\n    \"\"\"\n        After chunked attention is applied which sorted clusters,\n        original ordering has to be restored.\n        Since customized backward function is used for Reformer,\n        the gradients of the output vectors have to be explicitely\n        sorted here.\n    \"\"\"\n\n    @staticmethod\n    def forward(ctx, out_vectors, logits, sorted_bucket_idx, undo_sorted_bucket_idx, num_hashes):\n        # save sorted_bucket_idx for backprop\n        with torch.no_grad():\n            ctx.sorted_bucket_idx = sorted_bucket_idx\n            ctx.num_hashes = num_hashes\n\n            # undo sort to have correct order for next layer\n            expanded_undo_sort_indices = undo_sorted_bucket_idx.unsqueeze(-1).expand(out_vectors.shape)\n            out_vectors = torch.gather(out_vectors, 2, expanded_undo_sort_indices)\n            logits = torch.gather(logits, 2, undo_sorted_bucket_idx)\n        return out_vectors, logits\n\n    @staticmethod\n    def backward(ctx, grad_out_vectors, grad_logits):\n        # get parameters saved in ctx\n        sorted_bucket_idx = ctx.sorted_bucket_idx\n        num_hashes = ctx.num_hashes\n\n        # get real gradient shape\n        # shape is BatchSize x NumAttnHeads x ChunkLen * NumHashes\n        grad_logits_shape = grad_logits.shape\n        # shape is BatchSize x NumAttnHeads x ChunkLen * NumHashes x ChunkLen\n        grad_out_vectors_shape = grad_out_vectors.shape\n\n        # split gradient vectors and sorted bucket idxs by concatenated chunk dimension to gather correct indices\n        # shape is BatchSize x NumAttnHeads x NumHashes x ChunkLen\n        grad_logits = grad_logits.view((grad_logits_shape[:2] + (num_hashes, -1)))\n        # shape is BatchSize x NumAttnHeads x NumHashes x ChunkLen x ChunkLen\n        grad_out_vectors = grad_out_vectors.view(\n            (grad_out_vectors_shape[:2] + (num_hashes, -1) + grad_out_vectors_shape[-1:])\n        )\n\n        # reshape and expand\n        sorted_bucket_idx = torch.reshape(sorted_bucket_idx, (sorted_bucket_idx.shape[:2] + (num_hashes, -1)))\n        expanded_sort_indices = sorted_bucket_idx.unsqueeze(-1).expand(grad_out_vectors.shape)\n        # reverse sort of forward\n        grad_out_vectors = torch.gather(grad_out_vectors, 3, expanded_sort_indices)\n        grad_logits = torch.gather(grad_logits, 3, sorted_bucket_idx)\n\n        # reshape into correct shape\n        grad_logits = torch.reshape(grad_logits, grad_logits_shape)\n        grad_out_vectors = torch.reshape(grad_out_vectors, grad_out_vectors_shape)\n\n        # return grad and `None` fillers for last 3 forward args\n        return grad_out_vectors, grad_logits, None, None, None\n\n\nclass LocalSelfAttention(nn.Module, EfficientAttentionMixin):\n    def __init__(self, config):\n        super().__init__()\n\n        self.num_attention_heads = config.num_attention_heads\n        self.chunk_length = config.local_attn_chunk_length\n        self.num_chunks_before = config.local_num_chunks_before\n        self.num_chunks_after = config.local_num_chunks_after\n        self.is_decoder = config.is_decoder\n        self.pad_token_id = config.pad_token_id\n\n        self.attention_head_size = config.attention_head_size\n        self.all_head_size = self.num_attention_heads * self.attention_head_size\n        self.hidden_size = config.hidden_size\n\n        # projection matrices\n        self.query = nn.Linear(self.hidden_size, self.all_head_size, bias=False)\n        self.key = nn.Linear(self.hidden_size, self.all_head_size, bias=False)\n        self.value = nn.Linear(self.hidden_size, self.all_head_size, bias=False)\n\n        self.dropout = config.local_attention_probs_dropout_prob\n\n        # save mask value here\n        self.register_buffer(\"mask_value_float16\", torch.tensor(-1e4))\n        self.register_buffer(\"mask_value_float32\", torch.tensor(-1e9))\n\n    def forward(self, hidden_states, attention_mask=None, head_mask=None, do_output_attentions=False, **kwargs):\n        sequence_length = hidden_states.shape[1]\n        batch_size = hidden_states.shape[0]\n\n        # project hidden_states to query, key and value\n        query_vectors = self.query(hidden_states)\n        key_vectors = self.key(hidden_states)\n        value_vectors = self.value(hidden_states)\n\n        # split last dim into `config.num_attention_heads` and `config.attention_head_size`\n        query_vectors = self._split_hidden_size_dim(query_vectors, self.num_attention_heads, self.attention_head_size)\n        key_vectors = self._split_hidden_size_dim(key_vectors, self.num_attention_heads, self.attention_head_size)\n        value_vectors = self._split_hidden_size_dim(value_vectors, self.num_attention_heads, self.attention_head_size)\n\n        assert (\n            query_vectors.shape[-1] == self.attention_head_size\n        ), \"last dim of query_key_vectors is {} but should be {}.\".format(\n            query_vectors.shape[-1], self.attention_head_size\n        )\n        assert (\n            key_vectors.shape[-1] == self.attention_head_size\n        ), \"last dim of query_key_vectors is {} but should be {}.\".format(\n            key_vectors.shape[-1], self.attention_head_size\n        )\n        assert (\n            value_vectors.shape[-1] == self.attention_head_size\n        ), \"last dim of query_key_vectors is {} but should be {}.\".format(\n            value_vectors.shape[-1], self.attention_head_size\n        )\n\n        if self.chunk_length is None:\n            assert (\n                self.num_chunks_before == 0 and self.num_chunks_after == 0\n            ), \"If `config.chunk_length` is `None`, make sure `config.num_chunks_after` and `config.num_chunks_before` are set to 0.\"\n\n        # normalize key vectors\n        key_vectors = key_vectors / torch.sqrt(\n            torch.tensor(self.attention_head_size, device=key_vectors.device, dtype=key_vectors.dtype)\n        )\n\n        # chunk vectors\n        # B x Num_Attn_Head x Seq_Len // chunk_len x chunk_len  x  attn_head_size\n        query_vectors = self._split_seq_length_dim_to(\n            query_vectors, -1, self.chunk_length, self.num_attention_heads, self.attention_head_size,\n        )\n        key_vectors = self._split_seq_length_dim_to(\n            key_vectors, -1, self.chunk_length, self.num_attention_heads, self.attention_head_size,\n        )\n        value_vectors = self._split_seq_length_dim_to(\n            value_vectors, -1, self.chunk_length, self.num_attention_heads, self.attention_head_size,\n        )\n\n        # chunk indices\n        indices = torch.arange(sequence_length, device=query_vectors.device).repeat(\n            batch_size, self.num_attention_heads, 1\n        )\n        query_indices = self._split_seq_length_dim_to(indices, -1, self.chunk_length, self.num_attention_heads)\n        key_indices = self._split_seq_length_dim_to(indices, -1, self.chunk_length, self.num_attention_heads)\n\n        # append chunks before and after\n        key_vectors = self._look_adjacent(key_vectors, self.num_chunks_before, self.num_chunks_after)\n        value_vectors = self._look_adjacent(value_vectors, self.num_chunks_before, self.num_chunks_after)\n        key_indices = self._look_adjacent(key_indices, self.num_chunks_before, self.num_chunks_after)\n\n        query_key_dots = torch.matmul(query_vectors, key_vectors.transpose(-1, -2))\n\n        # free memory\n        del query_vectors, key_vectors\n\n        mask = self._compute_attn_mask(query_indices, key_indices, attention_mask, query_key_dots.shape)\n\n        if mask is not None:\n            # get mask tensor depending on half precision or not\n            if query_key_dots.dtype == torch.float16:\n                mask_value = self.mask_value_float16.half()\n            else:\n                mask_value = self.mask_value_float32\n\n            query_key_dots = torch.where(mask, query_key_dots, mask_value)\n\n        # free memory\n        del mask\n\n        # softmax\n        logits = torch.logsumexp(query_key_dots, dim=-1, keepdim=True)\n        attention_probs = torch.exp(query_key_dots - logits)\n\n        # free memory\n        del logits\n\n        # dropout\n        attention_probs = nn.functional.dropout(attention_probs, p=self.dropout, training=self.training)\n\n        # Mask heads if we want to\n        if head_mask is not None:\n            attention_probs = attention_probs * head_mask\n\n        # attend values\n        out_vectors = torch.matmul(attention_probs, value_vectors)\n\n        # free memory\n        del value_vectors\n\n        # merge chunk length\n        out_vectors = out_vectors.flatten(start_dim=2, end_dim=3)\n\n        assert out_vectors.shape == (batch_size, self.num_attention_heads, sequence_length, self.attention_head_size,)\n\n        out_vectors = self._merge_hidden_size_dims(out_vectors, self.num_attention_heads, self.attention_head_size)\n\n        if do_output_attentions is False:\n            attention_probs = ()\n\n        return LocalSelfAttentionOutput(hidden_states=out_vectors, attention_probs=attention_probs)\n\n    def _compute_attn_mask(self, query_indices, key_indices, attention_mask, query_key_dots_shape):\n        mask = None\n\n        # chunk attention mask and look before and after\n        if attention_mask is not None:\n            attention_mask = attention_mask.to(torch.uint8)[:, None, :]\n            attention_mask = self._split_seq_length_dim_to(attention_mask, -1, self.chunk_length, 1)\n            attention_mask_key = self._look_adjacent(attention_mask, self.num_chunks_before, self.num_chunks_after)\n\n        # Causal mask\n        if self.is_decoder is True:\n            mask = torch.ge(query_indices.unsqueeze(-1), key_indices.unsqueeze(-2)).to(query_indices.device)\n\n        # Attention mask\n        if attention_mask is not None:\n            # create attn_mask\n            attn_mask = (attention_mask.unsqueeze(-1) * attention_mask_key.unsqueeze(-2)).expand(query_key_dots_shape)\n            # multiply by casaul mask if necessary\n            if mask is not None:\n                mask = mask * attn_mask\n            else:\n                mask = attn_mask\n        return mask\n\n\nclass ReformerSelfOutput(nn.Module):\n    def __init__(self, config):\n        super().__init__()\n        all_head_size = config.num_attention_heads * config.attention_head_size\n        self.dropout = config.hidden_dropout_prob\n\n        self.dense = nn.Linear(all_head_size, config.hidden_size, bias=False)\n\n    def forward(self, hidden_states):\n        hidden_states = self.dense(hidden_states)\n        hidden_states = nn.functional.dropout(hidden_states, p=self.dropout, training=self.training)\n        return hidden_states\n\n\nclass ReformerAttention(nn.Module):\n    def __init__(self, config, layer_id=0):\n        super().__init__()\n        self.layer_id = layer_id\n        self.attn_layers = config.attn_layers\n\n        self.layer_norm = nn.LayerNorm(config.hidden_size, eps=config.layer_norm_eps)\n\n        if len(set(self.attn_layers)) == 1 and self.attn_layers[0] == \"lsh\":\n            self.self_attention = LSHSelfAttention(config)\n        elif len(set(self.attn_layers)) == 1 and self.attn_layers[0] == \"local\":\n            self.self_attention = LocalSelfAttention(config)\n        elif len(set(self.attn_layers)) == 2 and set(self.attn_layers) == set([\"lsh\", \"local\"]):\n            # get correct attn layers\n            if self.attn_layers[self.layer_id] == \"lsh\":\n                self.self_attention = LSHSelfAttention(config)\n            else:\n                self.self_attention = LocalSelfAttention(config)\n        else:\n            raise NotImplementedError(\n                \"Only attn layer types 'lsh' and 'local' exist, but got `config.attn_layers`: {}. Select attn layer types from ['lsh', 'local'] only.\".format(\n                    self.attn_layers\n                )\n            )\n        self.output = ReformerSelfOutput(config)\n\n    def forward(\n        self,\n        hidden_states,\n        attention_mask=None,\n        head_mask=None,\n        num_hashes=None,\n        do_output_attentions=False,\n        buckets=None,\n    ):\n        hidden_states = self.layer_norm(hidden_states)\n\n        # use cached buckets for backprob if buckets not None for LSHSelfAttention\n        self_attention_outputs = self.self_attention(\n            hidden_states=hidden_states,\n            head_mask=head_mask,\n            attention_mask=attention_mask,\n            num_hashes=num_hashes,\n            do_output_attentions=do_output_attentions,\n            buckets=buckets,\n        )\n        attention_output = self.output(self_attention_outputs.hidden_states)\n\n        # add buckets if necessary\n        if hasattr(self_attention_outputs, \"buckets\"):\n            buckets = self_attention_outputs.buckets\n        else:\n            buckets = None\n\n        return AttentionOutput(\n            hidden_states=attention_output, attention_probs=self_attention_outputs.attention_probs, buckets=buckets,\n        )\n\n\nclass ReformerFeedForwardDense(nn.Module):\n    def __init__(self, config):\n        super().__init__()\n        self.dropout = config.hidden_dropout_prob\n\n        if isinstance(config.hidden_act, str):\n            self.act_fn = ACT2FN[config.hidden_act]\n        else:\n            self.act_fn = config.hidden_act\n\n        self.dense = nn.Linear(config.hidden_size, config.feed_forward_size)\n\n    def forward(self, hidden_states):\n        hidden_states = self.dense(hidden_states)\n        hidden_states = nn.functional.dropout(hidden_states, p=self.dropout, training=self.training)\n        hidden_states = self.act_fn(hidden_states)\n        return hidden_states\n\n\nclass ReformerFeedForwardOutput(nn.Module):\n    def __init__(self, config):\n        super().__init__()\n        self.dropout = config.hidden_dropout_prob\n\n        self.dense = nn.Linear(config.feed_forward_size, config.hidden_size)\n\n    def forward(self, hidden_states):\n        hidden_states = self.dense(hidden_states)\n        hidden_states = nn.functional.dropout(hidden_states, p=self.dropout, training=self.training)\n        return hidden_states\n\n\nclass ChunkReformerFeedForward(nn.Module):\n    def __init__(self, config):\n        super().__init__()\n        self.chunk_size_feed_forward = config.chunk_size_feed_forward\n        self.seq_len_dim = 1\n\n        self.layer_norm = nn.LayerNorm(config.hidden_size, eps=config.layer_norm_eps)\n        self.dense = ReformerFeedForwardDense(config)\n        self.output = ReformerFeedForwardOutput(config)\n\n    def forward(self, attention_output):\n        return apply_chunking_to_forward(\n            self.chunk_size_feed_forward, self.seq_len_dim, self.forward_chunk, attention_output,\n        )\n\n    def forward_chunk(self, hidden_states):\n        hidden_states = self.layer_norm(hidden_states)\n        hidden_states = self.dense(hidden_states)\n        return self.output(hidden_states)\n\n\nclass ReformerLayer(nn.Module):\n    def __init__(self, config, layer_id=0):\n        super().__init__()\n        self.attention = ReformerAttention(config, layer_id)\n        # dropout requires to have the same\n        # seed for forward and backward pass\n        self.attention_seed = None\n        self.feed_forward_seed = None\n\n        self.feed_forward = ChunkReformerFeedForward(config)\n\n    def _init_attention_seed(self):\n        \"\"\"\n            This function sets a new seed for the\n            attention layer to make dropout deterministic\n            for both forward calls: 1 normal forward\n            call and 1 forward call in backward\n            to recalculate activations.\n        \"\"\"\n\n        # randomize seeds\n        if next(self.parameters()).device.type == \"cuda\":\n            # GPU\n            device_idx = torch.cuda.current_device()\n            self.attention_seed = torch.cuda.default_generators[device_idx].seed()\n            torch.cuda.manual_seed(self.attention_seed)\n        else:\n            # CPU\n            self.attention_seed = int(torch.seed() % sys.maxsize)\n            torch.manual_seed(self.attention_seed)\n\n    def _init_feed_forward_seed(self):\n        \"\"\"\n            This function sets a new seed for the\n            feed forward layer to make dropout deterministic\n            for both forward calls: 1 normal forward\n            call and 1 forward call in backward\n            to recalculate activations.\n        \"\"\"\n\n        # randomize seeds\n        if next(self.parameters()).device.type == \"cuda\":\n            # GPU\n            device_idx = torch.cuda.current_device()\n            self.feed_forward_seed = torch.cuda.default_generators[device_idx].seed()\n            torch.cuda.manual_seed(self.feed_forward_seed)\n        else:\n            # CPU\n            self.feed_forward_seed = int(torch.seed() % sys.maxsize)\n            torch.manual_seed(self.feed_forward_seed)\n\n    def forward(\n        self,\n        prev_attn_output,\n        hidden_states,\n        attention_mask=None,\n        head_mask=None,\n        num_hashes=None,\n        do_output_attentions=False,\n    ):\n        with torch.no_grad():\n            # every forward pass we sample a different seed\n            # for dropout and save for forward fn in backward pass\n            # to have correct dropout\n            self._init_attention_seed()\n            attn_outputs = self.attention(\n                hidden_states=hidden_states,\n                head_mask=head_mask,\n                attention_mask=attention_mask,\n                num_hashes=num_hashes,\n                do_output_attentions=do_output_attentions,\n            )\n            attn_output = attn_outputs.hidden_states\n\n            # Implementation of RevNet (see Fig. 6 in https://towardsdatascience.com/illustrating-the-reformer-393575ac6ba0)\n            # Y_1 = X_1 + f(X_2)\n            attn_output = prev_attn_output + attn_output\n\n            # free memory\n            del prev_attn_output\n\n            # every forward pass we sample a different seed\n            # for dropout and save seed for forward fn in backward\n            # to have correct dropout\n            self._init_feed_forward_seed()\n            # Y_2 = X_2 + g(Y_1)\n            hidden_states = hidden_states + self.feed_forward(attn_output)\n\n        return ReformerOutput(\n            attn_output=attn_output,\n            hidden_states=hidden_states,\n            attention_probs=attn_outputs.attention_probs,\n            buckets=attn_outputs.buckets,\n        )\n\n    def backward_pass(\n        self,\n        next_attn_output,\n        hidden_states,\n        grad_attn_output,\n        grad_hidden_states,\n        attention_mask=None,\n        head_mask=None,\n        buckets=None,\n    ):\n        # Implements the backward pass for reversible ResNets.\n        # A good blog post on how this works can be found here:\n        # Implementation of RevNet (see Fig. 6 in https://towardsdatascience.com/illustrating-the-reformer-393575ac6ba0)\n        # This code is heavily inspired by https://github.com/lucidrains/reformer-pytorch/blob/master/reformer_pytorch/reversible.py\n\n        with torch.enable_grad():\n            next_attn_output.requires_grad = True\n\n            # set seed to have correct dropout\n            torch.manual_seed(self.feed_forward_seed)\n            # g(Y_1)\n            res_hidden_states = self.feed_forward(next_attn_output)\n            res_hidden_states.backward(grad_hidden_states, retain_graph=True)\n\n        with torch.no_grad():\n            # X_2 = Y_2 - g(Y_1)\n            hidden_states = hidden_states - res_hidden_states\n            del res_hidden_states\n\n            grad_attn_output = grad_attn_output + next_attn_output.grad\n            next_attn_output.grad = None\n\n        with torch.enable_grad():\n            hidden_states.requires_grad = True\n\n            # set seed to have correct dropout\n            torch.manual_seed(self.attention_seed)\n            # f(X_2)\n            # use cached buckets for backprob if buckets not None for LSHSelfAttention\n            output = self.attention(\n                hidden_states=hidden_states, head_mask=head_mask, attention_mask=attention_mask, buckets=buckets,\n            ).hidden_states\n            output.backward(grad_attn_output, retain_graph=True)\n\n        with torch.no_grad():\n            # X_1 = Y_1 - f(X_2)\n            attn_output = next_attn_output - output\n            del output, next_attn_output\n\n            grad_hidden_states = grad_hidden_states + hidden_states.grad\n            hidden_states.grad = None\n            hidden_states = hidden_states.detach()\n\n        return ReformerBackwardOutput(\n            attn_output=attn_output,\n            hidden_states=hidden_states,\n            grad_attn_output=grad_attn_output,\n            grad_hidden_states=grad_hidden_states,\n        )\n\n\nclass _ReversibleFunction(Function):\n    \"\"\"\n    To prevent PyTorch from performing the usual backpropagation,\n    a customized backward function is implemented here. This way\n    it is made sure that no memory expensive activations are\n    saved during the forward pass.\n    This function is heavily inspired by https://github.com/lucidrains/reformer-pytorch/blob/master/reformer_pytorch/reversible.py\n    \"\"\"\n\n    @staticmethod\n    def forward(\n        ctx,\n        hidden_states,\n        layers,\n        attention_mask,\n        head_mask,\n        num_hashes,\n        all_hidden_states,\n        all_attentions,\n        do_output_hidden_states,\n        do_output_attentions,\n    ):\n        all_buckets = ()\n\n        # split duplicated tensor\n        hidden_states, attn_output = torch.chunk(hidden_states, 2, dim=-1)\n\n        for layer, layer_head_mask in zip(layers, head_mask):\n            if do_output_hidden_states is True:\n                all_hidden_states.append(hidden_states)\n\n            layer_outputs = layer(\n                prev_attn_output=attn_output,\n                hidden_states=hidden_states,\n                attention_mask=attention_mask,\n                head_mask=layer_head_mask,\n                num_hashes=num_hashes,\n                do_output_attentions=do_output_attentions,\n            )\n            attn_output = layer_outputs.attn_output\n            hidden_states = layer_outputs.hidden_states\n            all_buckets = all_buckets + (layer_outputs.buckets,)\n\n            if do_output_attentions:\n                all_attentions.append(layer_outputs.attention_probs)\n\n        # Add last layer\n        if do_output_hidden_states is True:\n            all_hidden_states.append(hidden_states)\n\n        # attach params to ctx for backward\n        ctx.save_for_backward(attn_output.detach(), hidden_states.detach())\n        ctx.layers = layers\n        ctx.all_buckets = all_buckets\n        ctx.head_mask = head_mask\n        ctx.attention_mask = attention_mask\n\n        # Concatenate 2 RevNet outputs\n        return torch.cat([attn_output, hidden_states], dim=-1)\n\n    @staticmethod\n    def backward(ctx, grad_hidden_states):\n        grad_attn_output, grad_hidden_states = torch.chunk(grad_hidden_states, 2, dim=-1)\n\n        # retrieve params from ctx for backward\n        attn_output, hidden_states = ctx.saved_tensors\n\n        # create tuple\n        output = ReformerBackwardOutput(\n            attn_output=attn_output,\n            hidden_states=hidden_states,\n            grad_attn_output=grad_attn_output,\n            grad_hidden_states=grad_hidden_states,\n        )\n\n        # free memory\n        del grad_attn_output, grad_hidden_states, attn_output, hidden_states\n\n        layers = ctx.layers\n        all_buckets = ctx.all_buckets\n        head_mask = ctx.head_mask\n        attention_mask = ctx.attention_mask\n\n        for idx, layer in enumerate(layers[::-1]):\n            # pop last buckets from stack\n            buckets = all_buckets[-1]\n            all_buckets = all_buckets[:-1]\n\n            # backprop\n            output = layer.backward_pass(\n                next_attn_output=output.attn_output,\n                hidden_states=output.hidden_states,\n                grad_attn_output=output.grad_attn_output,\n                grad_hidden_states=output.grad_hidden_states,\n                head_mask=head_mask[len(layers) - idx - 1],\n                attention_mask=attention_mask,\n                buckets=buckets,\n            )\n\n        assert all_buckets == (), \"buckets have to be empty after backpropagation\"\n        grad_hidden_states = torch.cat([output.grad_attn_output, output.grad_hidden_states], dim=-1)\n\n        # num of return vars has to match num of forward() args\n        # return gradient for hidden_states arg and None for other args\n        return grad_hidden_states, None, None, None, None, None, None, None, None\n\n\nclass ReformerEncoder(nn.Module):\n    def __init__(self, config):\n        super().__init__()\n        self.dropout = config.hidden_dropout_prob\n\n        self.layers = nn.ModuleList([ReformerLayer(config, i) for i in range(config.num_hidden_layers)])\n        # Reformer is using Rev Nets, thus last layer outputs are concatenated and\n        # Layer Norm is done over 2 * hidden_size\n        self.layer_norm = nn.LayerNorm(2 * config.hidden_size, eps=config.layer_norm_eps)\n\n    def forward(\n        self,\n        hidden_states,\n        attention_mask=None,\n        head_mask=None,\n        num_hashes=None,\n        do_output_hidden_states=False,\n        do_output_attentions=False,\n    ):\n        # hidden_states and attention lists to be filled if wished\n        all_hidden_states = []\n        all_attentions = []\n\n        # concat same tensor for reversible ResNet\n        hidden_states = torch.cat([hidden_states, hidden_states], dim=-1)\n        hidden_states = _ReversibleFunction.apply(\n            hidden_states,\n            self.layers,\n            attention_mask,\n            head_mask,\n            num_hashes,\n            all_hidden_states,\n            all_attentions,\n            do_output_hidden_states,\n            do_output_attentions,\n        )\n\n        # Apply layer norm to concatenated hidden states\n        hidden_states = self.layer_norm(hidden_states)\n\n        # Apply dropout\n        hidden_states = nn.functional.dropout(hidden_states, p=self.dropout, training=self.training)\n\n        return ReformerEncoderOutput(\n            hidden_states=hidden_states, all_hidden_states=all_hidden_states, all_attentions=all_attentions\n        )\n\n\nclass ReformerOnlyLMHead(nn.Module):\n    def __init__(self, config):\n        super().__init__()\n        # Reformer is using Rev Nets, thus last layer outputs are concatenated and\n        # Layer Norm is done over 2 * hidden_size\n        self.seq_len_dim = 1\n        self.chunk_size_lm_head = config.chunk_size_lm_head\n        self.decoder = nn.Linear(2 * config.hidden_size, config.vocab_size, bias=False)\n        self.bias = nn.Parameter(torch.zeros(config.vocab_size))\n\n        # Need a link between the two variables so that the bias is correctly resized with `resize_token_embeddings`\n        self.decoder.bias = self.bias\n\n    def forward(self, hidden_states):\n        return apply_chunking_to_forward(self.chunk_size_lm_head, self.seq_len_dim, self.forward_chunk, hidden_states)\n\n    def forward_chunk(self, hidden_states):\n        hidden_states = self.decoder(hidden_states)\n        return hidden_states\n\n\nclass ReformerPreTrainedModel(PreTrainedModel):\n    \"\"\" An abstract class to handle weights initialization and\n        a simple interface for downloading and loading pretrained models.\n    \"\"\"\n\n    config_class = ReformerConfig\n    base_model_prefix = \"reformer\"\n\n    @property\n    def dummy_inputs(self):\n        input_ids = torch.tensor(DUMMY_INPUTS)\n        input_mask = torch.tensor(DUMMY_MASK)\n        dummy_inputs = {\n            \"input_ids\": input_ids,\n            \"attention_mask\": input_mask,\n        }\n        return dummy_inputs\n\n    def _init_weights(self, module):\n        \"\"\" Initialize the weights \"\"\"\n        if isinstance(module, AxialPositionEmbeddings):\n            for weight in module.weights:\n                torch.nn.init.normal_(weight, std=self.config.axial_norm_std)\n        elif isinstance(module, nn.Embedding):\n            module.weight.data.normal_(mean=0.0, std=self.config.initializer_range)\n        elif isinstance(module, nn.Linear):\n            # Slightly different from the TF version which uses truncated_normal for initialization\n            # cf https://github.com/pytorch/pytorch/pull/5617\n            module.weight.data.normal_(mean=0.0, std=self.config.initializer_range)\n\n        elif isinstance(module, nn.LayerNorm):\n            module.bias.data.zero_()\n            module.weight.data.fill_(1.0)\n        if isinstance(module, nn.Linear) and module.bias is not None:\n            module.bias.data.zero_()\n\n\nREFORMER_START_DOCSTRING = r\"\"\"\n    Reformer was proposed in\n    `Reformer: The Efficient Transformer`_\n    by Nikita Kitaev, Łukasz Kaiser, Anselm Levskaya.\n\n    .. _`Reformer: The Efficient Transformer`:\n        https://arxiv.org/abs/2001.04451\n\n    This model is a PyTorch `torch.nn.Module <https://pytorch.org/docs/stable/nn.html#torch.nn.Module>`_ sub-class.\n    Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general\n    usage and behavior.\n\n    Parameters:\n        config (:class:`~transformers1.ReformerConfig`): Model configuration class with all the parameters of the model.\n            Initializing with a config file does not load the weights associated with the model, only the configuration.\n            Check out the :meth:`~transformers1.PreTrainedModel.from_pretrained` method to load the model weights.\n\"\"\"\n\nREFORMER_INPUTS_DOCSTRING = r\"\"\"\n    Args:\n        input_ids (:obj:`torch.LongTensor` of shape :obj:`(batch_size, sequence_length)`):\n            Indices of input sequence tokens in the vocabulary.\n            During training the input_ids sequence_length has to be a multiple of the relevant model's\n            chunk lengths (lsh's, local's or both). During evaluation, the indices are automatically\n            padded to be a multiple of the chunk length.\n\n            Indices can be obtained using :class:`transformers1.ReformerTokenizer`.\n            See :func:`transformers1.PreTrainedTokenizer.encode` and\n            :func:`transformers1.PreTrainedTokenizer.encode_plus` for details.\n\n            `What are input IDs? <../glossary.html#input-ids>`__\n        attention_mask (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length)`, `optional`, defaults to :obj:`None`):\n            Mask to avoid performing attention on padding token indices.\n            Mask values selected in ``[0, 1]``:\n            ``1`` for tokens that are NOT MASKED, ``0`` for MASKED tokens.\n\n            `What are attention masks? <../glossary.html#attention-mask>`__\n        position_ids (:obj:`torch.LongTensor` of shape :obj:`(batch_size, sequence_length)`, `optional`, defaults to :obj:`None`):\n            Indices of positions of each input sequence tokens in the position embeddings.\n            Selected in the range ``[0, config.max_position_embeddings - 1]``.\n\n            `What are position IDs? <../glossary.html#position-ids>`_\n        head_mask (:obj:`torch.FloatTensor` of shape :obj:`(num_heads,)` or :obj:`(num_layers, num_heads)`, `optional`, defaults to :obj:`None`):\n            Mask to nullify selected heads of the self-attention modules.\n            Mask values selected in ``[0, 1]``:\n            :obj:`1` indicates the head is **not masked**, :obj:`0` indicates the head is **masked**.\n        inputs_embeds (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length, hidden_size)`, `optional`, defaults to :obj:`None`):\n            Optionally, instead of passing :obj:`input_ids` you can choose to directly pass an embedded representation.\n            This is useful if you want more control over how to convert `input_ids` indices into associated vectors\n            than the model's internal embedding lookup matrix.\n        num_hashes (:obj:`int`, `optional`, defaults to :obj:`None`):\n            `num_hashes` is the number of hashing rounds that should be performed during\n            bucketing. Setting `num_hashes` overwrites the default `num_hashes` defined\n            in `config.num_hashes`.\n            For more information, see `num_hashes` in :class:`transformers1.ReformerConfig`.\n\"\"\"\n\n\n@add_start_docstrings(\n    \"The bare Reformer Model transformer outputting raw hidden-states\" \"without any specific head on top.\",\n    REFORMER_START_DOCSTRING,\n)\nclass ReformerModel(ReformerPreTrainedModel):\n    def __init__(self, config):\n        super().__init__(config)\n        self.config = config\n        assert (\n            self.config.num_hidden_layers > 0\n        ), \"`config.attn_layers` is empty. Select at least one attn layer form ['lsh', 'local']\"\n\n        self.embeddings = ReformerEmbeddings(config)\n        self.encoder = ReformerEncoder(config)\n\n        self.init_weights()\n\n    def get_input_embeddings(self):\n        return self.embeddings.word_embeddings\n\n    def set_input_embeddings(self, value):\n        self.embeddings.word_embeddings = value\n\n    def _prune_heads(self, heads_to_prune):\n        \"\"\" Prunes heads of the model.\n            heads_to_prune: dict of {layer_num: list of heads to prune in this layer}\n            See base class PreTrainedModel\n        \"\"\"\n        for layer, heads in heads_to_prune.items():\n            self.encoder.layer[layer].attention.prune_heads(heads)\n\n    @add_start_docstrings_to_callable(REFORMER_INPUTS_DOCSTRING)\n    def forward(\n        self,\n        input_ids=None,\n        attention_mask=None,\n        position_ids=None,\n        head_mask=None,\n        inputs_embeds=None,\n        num_hashes=None,\n        do_output_hidden_states=False,\n        do_output_attentions=False,\n    ):\n        r\"\"\"\n    Return:\n        :obj:`tuple(torch.FloatTensor)` comprising various elements depending on the configuration (:class:`~transformers1.BertConfig`) and inputs:\n        last_hidden_state (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length, hidden_size)`):\n            Sequence of hidden-states at the output of the last layer of the model.\n        all_hidden_states (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_hidden_states=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer)\n            of shape :obj:`(batch_size, sequence_length, hidden_size)`.\n\n            Hidden-states of the model at the output of each layer plus the initial embedding outputs.\n        all_attentions (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``do_output_attentions=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for each layer) of shape\n            :obj:`(batch_size, num_heads, sequence_length, sequence_length)`.\n\n            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention\n            heads.\n\n    Examples::\n\n        from transformers1 import ReformerModel, ReformerTokenizer\n        import torch\n\n        tokenizer = ReformerTokenizer.from_pretrained('google/reformer-crime-and-punishment')\n        model =  ReformerModel.from_pretrained('google/reformer-crime-and-punishment')\n\n        input_ids = torch.tensor(tokenizer.encode(\"Hello, my dog is cute\", add_special_tokens=True)).unsqueeze(0)  # Batch size 1\n        outputs = model(input_ids)\n\n        last_hidden_states = outputs[0]  # The last hidden-state is the first element of the output tuple\n        \"\"\"\n\n        # TODO(PVP): delete when PR to change output_attentions is made\n        do_output_attentions = self.config.output_attentions\n        do_output_hidden_states = self.config.output_hidden_states\n\n        if input_ids is not None and inputs_embeds is not None:\n            raise ValueError(\"You cannot specify both input_ids and inputs_embeds at the same time\")\n        elif input_ids is not None:\n            input_shape = input_ids.size()  # noqa: F841\n            device = input_ids.device\n        elif inputs_embeds is not None:\n            input_shape = inputs_embeds.size()[:-1]  # noqa: F841\n            device = inputs_embeds.device\n        else:\n            raise ValueError(\"You have to specify either input_ids or inputs_embeds\")\n\n        assert (\n            len(input_shape) == 2\n        ), \"`input_ids` have be of shape `[batch_size, sequence_length]`, but got shape: {}\".format(input_shape)\n\n        # prepare head mask\n        head_mask = self.get_head_mask(head_mask, self.config.num_hidden_layers, is_attention_chunked=True)\n\n        # original sequence length for padding\n        orig_sequence_length = input_shape[-1]\n\n        # if needs padding\n        least_common_mult_chunk_length = _get_least_common_mult_chunk_len(self.config)\n        must_pad_to_match_chunk_length = input_shape[-1] % least_common_mult_chunk_length != 0\n\n        if must_pad_to_match_chunk_length:\n            padding_length = least_common_mult_chunk_length - input_shape[-1] % least_common_mult_chunk_length\n\n            if self.training is True:\n                raise ValueError(\n                    \"If training, sequence Length {} has to be a multiple of least common multiple chunk_length {}. Please consider padding the input to a length of {}.\".format(\n                        input_shape[-1], least_common_mult_chunk_length, input_shape[-1] + padding_length\n                    )\n                )\n\n            # pad input\n            input_ids, inputs_embeds, attention_mask, position_ids, input_shape = self._pad_to_mult_of_chunk_length(\n                input_ids,\n                inputs_embeds=inputs_embeds,\n                attention_mask=attention_mask,\n                position_ids=position_ids,\n                input_shape=input_shape,\n                padding_length=padding_length,\n                padded_seq_length=least_common_mult_chunk_length,\n                device=device,\n            )\n\n        embedding_output = self.embeddings(input_ids=input_ids, position_ids=position_ids, inputs_embeds=inputs_embeds)\n\n        encoder_outputs = self.encoder(\n            hidden_states=embedding_output,\n            head_mask=head_mask,\n            attention_mask=attention_mask,\n            num_hashes=num_hashes,\n            do_output_hidden_states=do_output_hidden_states,\n            do_output_attentions=do_output_attentions,\n        )\n        sequence_output = encoder_outputs.hidden_states\n\n        # if padding was applied\n        if must_pad_to_match_chunk_length:\n            sequence_output = sequence_output[:, :orig_sequence_length]\n\n        outputs = (sequence_output,)\n        # TODO(PVP): Replace by named tuple after namedtuples are introduced in the library.\n        if do_output_hidden_states is True:\n            outputs = outputs + (encoder_outputs.all_hidden_states,)\n        if do_output_attentions is True:\n            outputs = outputs + (encoder_outputs.all_attentions,)\n        return outputs\n\n    def _pad_to_mult_of_chunk_length(\n        self,\n        input_ids,\n        inputs_embeds=None,\n        attention_mask=None,\n        position_ids=None,\n        input_shape=None,\n        padding_length=None,\n        padded_seq_length=None,\n        device=None,\n    ):\n        logger.info(\n            \"Input ids are automatically padded from {} to {} to be a multiple of `config.chunk_length`: {}\".format(\n                input_shape[-1], input_shape[-1] + padding_length, padded_seq_length\n            )\n        )\n\n        padded_input_ids = torch.full(\n            (input_shape[0], padding_length), self.config.pad_token_id, device=device, dtype=torch.long,\n        )\n\n        # Extend `attention_mask`\n        if attention_mask is not None:\n            attention_mask = torch.cat(\n                [\n                    attention_mask,\n                    torch.zeros(input_shape[0], padding_length, device=device, dtype=attention_mask.dtype,),\n                ],\n                dim=-1,\n            )\n        else:\n            attention_mask = torch.cat(\n                [\n                    torch.ones(input_shape, device=device, dtype=torch.uint8),\n                    torch.zeros((input_shape[0], padding_length), device=device, dtype=torch.uint8),\n                ],\n                dim=-1,\n            )\n\n        # Extend `input_ids` with padding to match least common multiple chunk_length\n        if input_ids is not None:\n            input_ids = torch.cat([input_ids, padded_input_ids], dim=-1)\n            input_shape = input_ids.size()\n\n            # Pad position ids if given\n            if position_ids is not None:\n                padded_position_ids = torch.arange(input_shape[-1], padded_seq_length, dtype=torch.long, device=device)\n                padded_position_ids = position_ids.unsqueeze(0).expand(input_shape[0], padding_length)\n                position_ids = torch.cat([position_ids, padded_position_ids], dim=-1)\n\n        # Extend `inputs_embeds` with padding to match least common multiple chunk_length\n        if inputs_embeds is not None:\n            padded_inputs_embeds = self.embeddings(padded_input_ids, position_ids)\n            inputs_embeds = torch.cat([inputs_embeds, padded_inputs_embeds], dim=-2)\n            input_shape = inputs_embeds.size()\n        return input_ids, inputs_embeds, attention_mask, position_ids, input_shape\n\n\n@add_start_docstrings(\"\"\"Reformer Model with a `language modeling` head on top. \"\"\", REFORMER_START_DOCSTRING)\nclass ReformerModelWithLMHead(ReformerPreTrainedModel):\n    def __init__(self, config):\n        super().__init__(config)\n        self.reformer = ReformerModel(config)\n        self.lm_head = ReformerOnlyLMHead(config)\n\n        self.init_weights()\n\n    def get_output_embeddings(self):\n        return self.lm_head.decoder\n\n    def tie_weights(self):\n        # word embeddings are not tied in Reformer\n        pass\n\n    @add_start_docstrings_to_callable(REFORMER_INPUTS_DOCSTRING)\n    def forward(\n        self,\n        input_ids=None,\n        position_ids=None,\n        attention_mask=None,\n        head_mask=None,\n        inputs_embeds=None,\n        num_hashes=None,\n        labels=None,\n        do_output_hidden_states=False,\n        do_output_attentions=False,\n    ):\n        r\"\"\"\n        labels (:obj:`torch.LongTensor` of shape :obj:`(batch_size,)`, `optional`, defaults to :obj:`None`):\n                Labels for computing the sequence classification/regression loss.\n                Indices should be in :obj:`[-100, 0, ..., config.vocab_size - 1]`.\n                All labels set to ``-100`` are ignored (masked), the loss is only\n                computed for labels in ``[0, ..., config.vocab_size]``\n\n    Return:\n        :obj:`tuple(torch.FloatTensor)` comprising various elements depending on the configuration (:class:`~transformers1.BertConfig`) and inputs:\n        loss (:obj:`torch.FloatTensor` of shape :obj:`(1,)`, `optional`, returned when :obj:`lm_label` is provided):\n            Classification loss (cross entropy).\n        prediction_scores (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length, config.vocab_size)`)\n            Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax).\n        all_hidden_states (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_hidden_states=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer)\n            of shape :obj:`(batch_size, sequence_length, hidden_size)`.\n\n            Hidden-states of the model at the output of each layer plus the initial embedding outputs.\n        all_attentions (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``do_output_attentions=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for each layer) of shape\n            :obj:`(batch_size, num_heads, sequence_length, sequence_length)`.\n\n            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention\n            heads.\n\n    Examples::\n\n        from transformers1 import ReformerModelWithLMHead, ReformerTokenizer\n        import torch\n\n        tokenizer = ReformerTokenizer.from_pretrained('google/reformer-crime-and-punishment')\n        model =  ReformerModelWithLMHead.from_pretrained('google/reformer-crime-and-punishment')\n\n        input_ids = torch.tensor(tokenizer.encode(\"Hello, my dog is cute\", add_special_tokens=True)).unsqueeze(0)  # Batch size 1\n        outputs = model(input_ids, labels=input_ids)\n\n        loss, prediction_scores = outputs[:2]\n        \"\"\"\n\n        reformer_outputs = self.reformer(\n            input_ids,\n            position_ids=position_ids,\n            attention_mask=attention_mask,\n            head_mask=head_mask,\n            inputs_embeds=inputs_embeds,\n            num_hashes=num_hashes,\n            do_output_hidden_states=do_output_hidden_states,\n            do_output_attentions=do_output_attentions,\n        )\n\n        sequence_output = reformer_outputs[0]\n        logits = self.lm_head(sequence_output)\n        outputs = (logits,) + reformer_outputs[1:]\n\n        if labels is not None:\n            # Shift so that tokens < n predict n\n            shift_logits = logits[..., :-1, :].contiguous()\n            shift_labels = labels[..., 1:].contiguous()\n            # Flatten the tokens\n            loss_fct = CrossEntropyLoss()\n            loss = loss_fct(shift_logits.view(-1, self.config.vocab_size), shift_labels.view(-1))\n            outputs = (loss,) + outputs\n        return outputs  # (lm_loss), lm_logits, (hidden_states), (attentions)\n\n    def prepare_inputs_for_generation(self, input_ids, past, **kwargs):\n        # TODO(PVP): Add smart caching\n        inputs_dict = {\"input_ids\": input_ids}\n\n        if \"num_hashes\" in kwargs:\n            inputs_dict[\"num_hashes\"] = kwargs[\"num_hashes\"]\n\n        return inputs_dict\n"
  },
  {
    "path": "code/bert-base-count3/pretrain/transformers1/modeling_roberta.py",
    "content": "# coding=utf-8\n# Copyright 2018 The Google AI Language Team Authors and The HuggingFace Inc. team.\n# Copyright (c) 2018, NVIDIA CORPORATION.  All rights reserved.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n\"\"\"PyTorch RoBERTa model. \"\"\"\n\n\nimport logging\n\nimport torch\nimport torch.nn as nn\nfrom torch.nn import CrossEntropyLoss, MSELoss\n\nfrom .configuration_roberta import RobertaConfig\nfrom .file_utils import add_start_docstrings, add_start_docstrings_to_callable\nfrom .modeling_bert import BertEmbeddings, BertLayerNorm, BertModel, BertPreTrainedModel, gelu\nfrom .modeling_utils import create_position_ids_from_input_ids\n\n\nlogger = logging.getLogger(__name__)\n\nROBERTA_PRETRAINED_MODEL_ARCHIVE_LIST = [\n    \"roberta-base\",\n    \"roberta-large\",\n    \"roberta-large-mnli\",\n    \"distilroberta-base\",\n    \"roberta-base-openai-detector\",\n    \"roberta-large-openai-detector\",\n    # See all RoBERTa models at https://huggingface.co/models?filter=roberta\n]\n\n\nclass RobertaEmbeddings(BertEmbeddings):\n    \"\"\"\n    Same as BertEmbeddings with a tiny tweak for positional embeddings indexing.\n    \"\"\"\n\n    def __init__(self, config):\n        super().__init__(config)\n        self.padding_idx = config.pad_token_id\n        self.word_embeddings = nn.Embedding(config.vocab_size, config.hidden_size, padding_idx=self.padding_idx)\n        self.position_embeddings = nn.Embedding(\n            config.max_position_embeddings, config.hidden_size, padding_idx=self.padding_idx\n        )\n\n    def forward(self, input_ids=None, token_type_ids=None, position_ids=None, inputs_embeds=None):\n        if position_ids is None:\n            if input_ids is not None:\n                # Create the position ids from the input token ids. Any padded tokens remain padded.\n                position_ids = create_position_ids_from_input_ids(input_ids, self.padding_idx).to(input_ids.device)\n            else:\n                position_ids = self.create_position_ids_from_inputs_embeds(inputs_embeds)\n\n        return super().forward(\n            input_ids, token_type_ids=token_type_ids, position_ids=position_ids, inputs_embeds=inputs_embeds\n        )\n\n    def create_position_ids_from_inputs_embeds(self, inputs_embeds):\n        \"\"\" We are provided embeddings directly. We cannot infer which are padded so just generate\n        sequential position ids.\n\n        :param torch.Tensor inputs_embeds:\n        :return torch.Tensor:\n        \"\"\"\n        input_shape = inputs_embeds.size()[:-1]\n        sequence_length = input_shape[1]\n\n        position_ids = torch.arange(\n            self.padding_idx + 1, sequence_length + self.padding_idx + 1, dtype=torch.long, device=inputs_embeds.device\n        )\n        return position_ids.unsqueeze(0).expand(input_shape)\n\n\nROBERTA_START_DOCSTRING = r\"\"\"\n\n    This model is a PyTorch `torch.nn.Module <https://pytorch.org/docs/stable/nn.html#torch.nn.Module>`_ sub-class.\n    Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general\n    usage and behavior.\n\n    Parameters:\n        config (:class:`~transformers1.RobertaConfig`): Model configuration class with all the parameters of the\n            model. Initializing with a config file does not load the weights associated with the model, only the configuration.\n            Check out the :meth:`~transformers1.PreTrainedModel.from_pretrained` method to load the model weights.\n\"\"\"\n\nROBERTA_INPUTS_DOCSTRING = r\"\"\"\n    Args:\n        input_ids (:obj:`torch.LongTensor` of shape :obj:`{0}`):\n            Indices of input sequence tokens in the vocabulary.\n\n            Indices can be obtained using :class:`transformers1.RobertaTokenizer`.\n            See :func:`transformers1.PreTrainedTokenizer.encode` and\n            :func:`transformers1.PreTrainedTokenizer.encode_plus` for details.\n\n            `What are input IDs? <../glossary.html#input-ids>`__\n        attention_mask (:obj:`torch.FloatTensor` of shape :obj:`{0}`, `optional`, defaults to :obj:`None`):\n            Mask to avoid performing attention on padding token indices.\n            Mask values selected in ``[0, 1]``:\n            ``1`` for tokens that are NOT MASKED, ``0`` for MASKED tokens.\n\n            `What are attention masks? <../glossary.html#attention-mask>`__\n        token_type_ids (:obj:`torch.LongTensor` of shape :obj:`{0}`, `optional`, defaults to :obj:`None`):\n            Segment token indices to indicate first and second portions of the inputs.\n            Indices are selected in ``[0, 1]``: ``0`` corresponds to a `sentence A` token, ``1``\n            corresponds to a `sentence B` token\n\n            `What are token type IDs? <../glossary.html#token-type-ids>`_\n        position_ids (:obj:`torch.LongTensor` of shape :obj:`{0}`, `optional`, defaults to :obj:`None`):\n            Indices of positions of each input sequence tokens in the position embeddings.\n            Selected in the range ``[0, config.max_position_embeddings - 1]``.\n\n            `What are position IDs? <../glossary.html#position-ids>`_\n        head_mask (:obj:`torch.FloatTensor` of shape :obj:`(num_heads,)` or :obj:`(num_layers, num_heads)`, `optional`, defaults to :obj:`None`):\n            Mask to nullify selected heads of the self-attention modules.\n            Mask values selected in ``[0, 1]``:\n            :obj:`1` indicates the head is **not masked**, :obj:`0` indicates the head is **masked**.\n        inputs_embeds (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length, hidden_size)`, `optional`, defaults to :obj:`None`):\n            Optionally, instead of passing :obj:`input_ids` you can choose to directly pass an embedded representation.\n            This is useful if you want more control over how to convert `input_ids` indices into associated vectors\n            than the model's internal embedding lookup matrix.\n\"\"\"\n\n\n@add_start_docstrings(\n    \"The bare RoBERTa Model transformer outputting raw hidden-states without any specific head on top.\",\n    ROBERTA_START_DOCSTRING,\n)\nclass RobertaModel(BertModel):\n    \"\"\"\n    This class overrides :class:`~transformers1.BertModel`. Please check the\n    superclass for the appropriate documentation alongside usage examples.\n    \"\"\"\n\n    config_class = RobertaConfig\n    base_model_prefix = \"roberta\"\n\n    def __init__(self, config):\n        super().__init__(config)\n\n        self.embeddings = RobertaEmbeddings(config)\n        self.init_weights()\n\n    def get_input_embeddings(self):\n        return self.embeddings.word_embeddings\n\n    def set_input_embeddings(self, value):\n        self.embeddings.word_embeddings = value\n\n\n@add_start_docstrings(\"\"\"RoBERTa Model with a `language modeling` head on top. \"\"\", ROBERTA_START_DOCSTRING)\nclass RobertaForMaskedLM(BertPreTrainedModel):\n    config_class = RobertaConfig\n    base_model_prefix = \"roberta\"\n\n    def __init__(self, config):\n        super().__init__(config)\n\n        self.roberta = RobertaModel(config)\n        self.lm_head = RobertaLMHead(config)\n\n        self.init_weights()\n\n    def get_output_embeddings(self):\n        return self.lm_head.decoder\n\n    @add_start_docstrings_to_callable(ROBERTA_INPUTS_DOCSTRING.format(\"(batch_size, sequence_length)\"))\n    def forward(\n        self,\n        input_ids=None,\n        attention_mask=None,\n        token_type_ids=None,\n        position_ids=None,\n        head_mask=None,\n        inputs_embeds=None,\n        masked_lm_labels=None,\n    ):\n        r\"\"\"\n        masked_lm_labels (:obj:`torch.LongTensor` of shape :obj:`(batch_size, sequence_length)`, `optional`, defaults to :obj:`None`):\n            Labels for computing the masked language modeling loss.\n            Indices should be in ``[-100, 0, ..., config.vocab_size]`` (see ``input_ids`` docstring)\n            Tokens with indices set to ``-100`` are ignored (masked), the loss is only computed for the tokens with labels\n            in ``[0, ..., config.vocab_size]``\n\n    Returns:\n        :obj:`tuple(torch.FloatTensor)` comprising various elements depending on the configuration (:class:`~transformers1.RobertaConfig`) and inputs:\n        masked_lm_loss (`optional`, returned when ``masked_lm_labels`` is provided) ``torch.FloatTensor`` of shape ``(1,)``:\n            Masked language modeling loss.\n        prediction_scores (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length, config.vocab_size)`)\n            Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax).\n        hidden_states (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_hidden_states=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer)\n            of shape :obj:`(batch_size, sequence_length, hidden_size)`.\n\n            Hidden-states of the model at the output of each layer plus the initial embedding outputs.\n        attentions (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_attentions=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for each layer) of shape\n            :obj:`(batch_size, num_heads, sequence_length, sequence_length)`.\n\n            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention\n            heads.\n\n    Examples::\n\n        from transformers1 import RobertaTokenizer, RobertaForMaskedLM\n        import torch\n\n        tokenizer = RobertaTokenizer.from_pretrained('roberta-base')\n        model = RobertaForMaskedLM.from_pretrained('roberta-base')\n        input_ids = torch.tensor(tokenizer.encode(\"Hello, my dog is cute\", add_special_tokens=True)).unsqueeze(0)  # Batch size 1\n        outputs = model(input_ids, masked_lm_labels=input_ids)\n        loss, prediction_scores = outputs[:2]\n\n        \"\"\"\n        outputs = self.roberta(\n            input_ids,\n            attention_mask=attention_mask,\n            token_type_ids=token_type_ids,\n            position_ids=position_ids,\n            head_mask=head_mask,\n            inputs_embeds=inputs_embeds,\n        )\n        sequence_output = outputs[0]\n        prediction_scores = self.lm_head(sequence_output)\n\n        outputs = (prediction_scores,) + outputs[2:]  # Add hidden states and attention if they are here\n\n        if masked_lm_labels is not None:\n            loss_fct = CrossEntropyLoss()\n            masked_lm_loss = loss_fct(prediction_scores.view(-1, self.config.vocab_size), masked_lm_labels.view(-1))\n            outputs = (masked_lm_loss,) + outputs\n\n        return outputs  # (masked_lm_loss), prediction_scores, (hidden_states), (attentions)\n\n\nclass RobertaLMHead(nn.Module):\n    \"\"\"Roberta Head for masked language modeling.\"\"\"\n\n    def __init__(self, config):\n        super().__init__()\n        self.dense = nn.Linear(config.hidden_size, config.hidden_size)\n        self.layer_norm = BertLayerNorm(config.hidden_size, eps=config.layer_norm_eps)\n\n        self.decoder = nn.Linear(config.hidden_size, config.vocab_size, bias=False)\n        self.bias = nn.Parameter(torch.zeros(config.vocab_size))\n\n        # Need a link between the two variables so that the bias is correctly resized with `resize_token_embeddings`\n        self.decoder.bias = self.bias\n\n    def forward(self, features, **kwargs):\n        x = self.dense(features)\n        x = gelu(x)\n        x = self.layer_norm(x)\n\n        # project back to size of vocabulary with bias\n        x = self.decoder(x)\n\n        return x\n\n\n@add_start_docstrings(\n    \"\"\"RoBERTa Model transformer with a sequence classification/regression head on top (a linear layer\n    on top of the pooled output) e.g. for GLUE tasks. \"\"\",\n    ROBERTA_START_DOCSTRING,\n)\nclass RobertaForSequenceClassification(BertPreTrainedModel):\n    config_class = RobertaConfig\n    base_model_prefix = \"roberta\"\n\n    def __init__(self, config):\n        super().__init__(config)\n        self.num_labels = config.num_labels\n\n        self.roberta = RobertaModel(config)\n        self.classifier = RobertaClassificationHead(config)\n\n    @add_start_docstrings_to_callable(ROBERTA_INPUTS_DOCSTRING.format(\"(batch_size, sequence_length)\"))\n    def forward(\n        self,\n        input_ids=None,\n        attention_mask=None,\n        token_type_ids=None,\n        position_ids=None,\n        head_mask=None,\n        inputs_embeds=None,\n        labels=None,\n    ):\n        r\"\"\"\n        labels (:obj:`torch.LongTensor` of shape :obj:`(batch_size,)`, `optional`, defaults to :obj:`None`):\n            Labels for computing the sequence classification/regression loss.\n            Indices should be in :obj:`[0, ..., config.num_labels - 1]`.\n            If :obj:`config.num_labels == 1` a regression loss is computed (Mean-Square loss),\n            If :obj:`config.num_labels > 1` a classification loss is computed (Cross-Entropy).\n\n    Returns:\n        :obj:`tuple(torch.FloatTensor)` comprising various elements depending on the configuration (:class:`~transformers1.RobertaConfig`) and inputs:\n        loss (:obj:`torch.FloatTensor` of shape :obj:`(1,)`, `optional`, returned when :obj:`label` is provided):\n            Classification (or regression if config.num_labels==1) loss.\n        logits (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, config.num_labels)`):\n            Classification (or regression if config.num_labels==1) scores (before SoftMax).\n        hidden_states (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_hidden_states=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer)\n            of shape :obj:`(batch_size, sequence_length, hidden_size)`.\n\n            Hidden-states of the model at the output of each layer plus the initial embedding outputs.\n        attentions (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_attentions=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for each layer) of shape\n            :obj:`(batch_size, num_heads, sequence_length, sequence_length)`.\n\n            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention\n            heads.\n\n    Examples::\n\n        from transformers1 import RobertaTokenizer, RobertaForSequenceClassification\n        import torch\n\n        tokenizer = RobertaTokenizer.from_pretrained('roberta-base')\n        model = RobertaForSequenceClassification.from_pretrained('roberta-base')\n        input_ids = torch.tensor(tokenizer.encode(\"Hello, my dog is cute\", add_special_tokens=True)).unsqueeze(0)  # Batch size 1\n        labels = torch.tensor([1]).unsqueeze(0)  # Batch size 1\n        outputs = model(input_ids, labels=labels)\n        loss, logits = outputs[:2]\n\n        \"\"\"\n        outputs = self.roberta(\n            input_ids,\n            attention_mask=attention_mask,\n            token_type_ids=token_type_ids,\n            position_ids=position_ids,\n            head_mask=head_mask,\n            inputs_embeds=inputs_embeds,\n        )\n        sequence_output = outputs[0]\n        logits = self.classifier(sequence_output)\n\n        outputs = (logits,) + outputs[2:]\n        if labels is not None:\n            if self.num_labels == 1:\n                #  We are doing regression\n                loss_fct = MSELoss()\n                loss = loss_fct(logits.view(-1), labels.view(-1))\n            else:\n                loss_fct = CrossEntropyLoss()\n                loss = loss_fct(logits.view(-1, self.num_labels), labels.view(-1))\n            outputs = (loss,) + outputs\n\n        return outputs  # (loss), logits, (hidden_states), (attentions)\n\n\n@add_start_docstrings(\n    \"\"\"Roberta Model with a multiple choice classification head on top (a linear layer on top of\n    the pooled output and a softmax) e.g. for RocStories/SWAG tasks. \"\"\",\n    ROBERTA_START_DOCSTRING,\n)\nclass RobertaForMultipleChoice(BertPreTrainedModel):\n    config_class = RobertaConfig\n    base_model_prefix = \"roberta\"\n\n    def __init__(self, config):\n        super().__init__(config)\n\n        self.roberta = RobertaModel(config)\n        self.dropout = nn.Dropout(config.hidden_dropout_prob)\n        self.classifier = nn.Linear(config.hidden_size, 1)\n\n        self.init_weights()\n\n    @add_start_docstrings_to_callable(ROBERTA_INPUTS_DOCSTRING.format(\"(batch_size, num_choices, sequence_length)\"))\n    def forward(\n        self,\n        input_ids=None,\n        token_type_ids=None,\n        attention_mask=None,\n        labels=None,\n        position_ids=None,\n        head_mask=None,\n        inputs_embeds=None,\n    ):\n        r\"\"\"\n        labels (:obj:`torch.LongTensor` of shape :obj:`(batch_size,)`, `optional`, defaults to :obj:`None`):\n            Labels for computing the multiple choice classification loss.\n            Indices should be in ``[0, ..., num_choices]`` where `num_choices` is the size of the second dimension\n            of the input tensors. (see `input_ids` above)\n\n    Returns:\n        :obj:`tuple(torch.FloatTensor)` comprising various elements depending on the configuration (:class:`~transformers1.RobertaConfig`) and inputs:\n        loss (:obj:`torch.FloatTensor`` of shape ``(1,)`, `optional`, returned when :obj:`labels` is provided):\n            Classification loss.\n        classification_scores (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, num_choices)`):\n            `num_choices` is the second dimension of the input tensors. (see `input_ids` above).\n\n            Classification scores (before SoftMax).\n        hidden_states (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_hidden_states=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer)\n            of shape :obj:`(batch_size, sequence_length, hidden_size)`.\n\n            Hidden-states of the model at the output of each layer plus the initial embedding outputs.\n        attentions (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_attentions=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for each layer) of shape\n            :obj:`(batch_size, num_heads, sequence_length, sequence_length)`.\n\n            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention\n            heads.\n\n    Examples::\n\n        from transformers1 import RobertaTokenizer, RobertaForMultipleChoice\n        import torch\n\n        tokenizer = RobertaTokenizer.from_pretrained('roberta-base')\n        model = RobertaForMultipleChoice.from_pretrained('roberta-base')\n        choices = [\"Hello, my dog is cute\", \"Hello, my cat is amazing\"]\n        input_ids = torch.tensor([tokenizer.encode(s, add_special_tokens=True) for s in choices]).unsqueeze(0)  # Batch size 1, 2 choices\n        labels = torch.tensor(1).unsqueeze(0)  # Batch size 1\n        outputs = model(input_ids, labels=labels)\n        loss, classification_scores = outputs[:2]\n\n        \"\"\"\n        num_choices = input_ids.shape[1]\n\n        flat_input_ids = input_ids.view(-1, input_ids.size(-1))\n        flat_position_ids = position_ids.view(-1, position_ids.size(-1)) if position_ids is not None else None\n        flat_token_type_ids = token_type_ids.view(-1, token_type_ids.size(-1)) if token_type_ids is not None else None\n        flat_attention_mask = attention_mask.view(-1, attention_mask.size(-1)) if attention_mask is not None else None\n        outputs = self.roberta(\n            flat_input_ids,\n            position_ids=flat_position_ids,\n            token_type_ids=flat_token_type_ids,\n            attention_mask=flat_attention_mask,\n            head_mask=head_mask,\n        )\n        pooled_output = outputs[1]\n\n        pooled_output = self.dropout(pooled_output)\n        logits = self.classifier(pooled_output)\n        reshaped_logits = logits.view(-1, num_choices)\n\n        outputs = (reshaped_logits,) + outputs[2:]  # add hidden states and attention if they are here\n\n        if labels is not None:\n            loss_fct = CrossEntropyLoss()\n            loss = loss_fct(reshaped_logits, labels)\n            outputs = (loss,) + outputs\n\n        return outputs  # (loss), reshaped_logits, (hidden_states), (attentions)\n\n\n@add_start_docstrings(\n    \"\"\"Roberta Model with a token classification head on top (a linear layer on top of\n    the hidden-states output) e.g. for Named-Entity-Recognition (NER) tasks. \"\"\",\n    ROBERTA_START_DOCSTRING,\n)\nclass RobertaForTokenClassification(BertPreTrainedModel):\n    config_class = RobertaConfig\n    base_model_prefix = \"roberta\"\n\n    def __init__(self, config):\n        super().__init__(config)\n        self.num_labels = config.num_labels\n\n        self.roberta = RobertaModel(config)\n        self.dropout = nn.Dropout(config.hidden_dropout_prob)\n        self.classifier = nn.Linear(config.hidden_size, config.num_labels)\n\n        self.init_weights()\n\n    @add_start_docstrings_to_callable(ROBERTA_INPUTS_DOCSTRING.format(\"(batch_size, sequence_length)\"))\n    def forward(\n        self,\n        input_ids=None,\n        attention_mask=None,\n        token_type_ids=None,\n        position_ids=None,\n        head_mask=None,\n        inputs_embeds=None,\n        labels=None,\n    ):\n        r\"\"\"\n        labels (:obj:`torch.LongTensor` of shape :obj:`(batch_size, sequence_length)`, `optional`, defaults to :obj:`None`):\n            Labels for computing the token classification loss.\n            Indices should be in ``[0, ..., config.num_labels - 1]``.\n\n    Returns:\n        :obj:`tuple(torch.FloatTensor)` comprising various elements depending on the configuration (:class:`~transformers1.RobertaConfig`) and inputs:\n        loss (:obj:`torch.FloatTensor` of shape :obj:`(1,)`, `optional`, returned when ``labels`` is provided) :\n            Classification loss.\n        scores (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length, config.num_labels)`)\n            Classification scores (before SoftMax).\n        hidden_states (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_hidden_states=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer)\n            of shape :obj:`(batch_size, sequence_length, hidden_size)`.\n\n            Hidden-states of the model at the output of each layer plus the initial embedding outputs.\n        attentions (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_attentions=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for each layer) of shape\n            :obj:`(batch_size, num_heads, sequence_length, sequence_length)`.\n\n            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention\n            heads.\n\n    Examples::\n\n        from transformers1 import RobertaTokenizer, RobertaForTokenClassification\n        import torch\n\n        tokenizer = RobertaTokenizer.from_pretrained('roberta-base')\n        model = RobertaForTokenClassification.from_pretrained('roberta-base')\n        input_ids = torch.tensor(tokenizer.encode(\"Hello, my dog is cute\", add_special_tokens=True)).unsqueeze(0)  # Batch size 1\n        labels = torch.tensor([1] * input_ids.size(1)).unsqueeze(0)  # Batch size 1\n        outputs = model(input_ids, labels=labels)\n        loss, scores = outputs[:2]\n\n        \"\"\"\n\n        outputs = self.roberta(\n            input_ids,\n            attention_mask=attention_mask,\n            token_type_ids=token_type_ids,\n            position_ids=position_ids,\n            head_mask=head_mask,\n            inputs_embeds=inputs_embeds,\n        )\n\n        sequence_output = outputs[0]\n\n        sequence_output = self.dropout(sequence_output)\n        logits = self.classifier(sequence_output)\n\n        outputs = (logits,) + outputs[2:]  # add hidden states and attention if they are here\n\n        if labels is not None:\n            loss_fct = CrossEntropyLoss()\n            # Only keep active parts of the loss\n            if attention_mask is not None:\n                active_loss = attention_mask.view(-1) == 1\n                active_logits = logits.view(-1, self.num_labels)\n                active_labels = torch.where(\n                    active_loss, labels.view(-1), torch.tensor(loss_fct.ignore_index).type_as(labels)\n                )\n                loss = loss_fct(active_logits, active_labels)\n            else:\n                loss = loss_fct(logits.view(-1, self.num_labels), labels.view(-1))\n            outputs = (loss,) + outputs\n\n        return outputs  # (loss), scores, (hidden_states), (attentions)\n\n\nclass RobertaClassificationHead(nn.Module):\n    \"\"\"Head for sentence-level classification tasks.\"\"\"\n\n    def __init__(self, config):\n        super().__init__()\n        self.dense = nn.Linear(config.hidden_size, config.hidden_size)\n        self.dropout = nn.Dropout(config.hidden_dropout_prob)\n        self.out_proj = nn.Linear(config.hidden_size, config.num_labels)\n\n    def forward(self, features, **kwargs):\n        x = features[:, 0, :]  # take <s> token (equiv. to [CLS])\n        x = self.dropout(x)\n        x = self.dense(x)\n        x = torch.tanh(x)\n        x = self.dropout(x)\n        x = self.out_proj(x)\n        return x\n\n\n@add_start_docstrings(\n    \"\"\"Roberta Model with a span classification head on top for extractive question-answering tasks like SQuAD (a linear layers on top of\n    the hidden-states output to compute `span start logits` and `span end logits`). \"\"\",\n    ROBERTA_START_DOCSTRING,\n)\nclass RobertaForQuestionAnswering(BertPreTrainedModel):\n    config_class = RobertaConfig\n    base_model_prefix = \"roberta\"\n\n    def __init__(self, config):\n        super().__init__(config)\n        self.num_labels = config.num_labels\n\n        self.roberta = RobertaModel(config)\n        self.qa_outputs = nn.Linear(config.hidden_size, config.num_labels)\n\n        self.init_weights()\n\n    @add_start_docstrings_to_callable(ROBERTA_INPUTS_DOCSTRING.format(\"(batch_size, sequence_length)\"))\n    def forward(\n        self,\n        input_ids,\n        attention_mask=None,\n        token_type_ids=None,\n        position_ids=None,\n        head_mask=None,\n        inputs_embeds=None,\n        start_positions=None,\n        end_positions=None,\n    ):\n        r\"\"\"\n        start_positions (:obj:`torch.LongTensor` of shape :obj:`(batch_size,)`, `optional`, defaults to :obj:`None`):\n            Labels for position (index) of the start of the labelled span for computing the token classification loss.\n            Positions are clamped to the length of the sequence (`sequence_length`).\n            Position outside of the sequence are not taken into account for computing the loss.\n        end_positions (:obj:`torch.LongTensor` of shape :obj:`(batch_size,)`, `optional`, defaults to :obj:`None`):\n            Labels for position (index) of the end of the labelled span for computing the token classification loss.\n            Positions are clamped to the length of the sequence (`sequence_length`).\n            Position outside of the sequence are not taken into account for computing the loss.\n\n    Returns:\n        :obj:`tuple(torch.FloatTensor)` comprising various elements depending on the configuration (:class:`~transformers1.RobertaConfig`) and inputs:\n        loss (:obj:`torch.FloatTensor` of shape :obj:`(1,)`, `optional`, returned when :obj:`labels` is provided):\n            Total span extraction loss is the sum of a Cross-Entropy for the start and end positions.\n        start_scores (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length,)`):\n            Span-start scores (before SoftMax).\n        end_scores (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length,)`):\n            Span-end scores (before SoftMax).\n        hidden_states (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_hidden_states=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer)\n            of shape :obj:`(batch_size, sequence_length, hidden_size)`.\n\n            Hidden-states of the model at the output of each layer plus the initial embedding outputs.\n        attentions (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_attentions=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for each layer) of shape\n            :obj:`(batch_size, num_heads, sequence_length, sequence_length)`.\n\n            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention\n            heads.\n\n    Examples::\n\n        # The checkpoint roberta-large is not fine-tuned for question answering. Please see the\n        # examples/question-answering/run_squad.py example to see how to fine-tune a model to a question answering task.\n\n        from transformers1 import RobertaTokenizer, RobertaForQuestionAnswering\n        import torch\n\n        tokenizer = RobertaTokenizer.from_pretrained('roberta-base')\n        model = RobertaForQuestionAnswering.from_pretrained('roberta-base')\n\n        question, text = \"Who was Jim Henson?\", \"Jim Henson was a nice puppet\"\n        input_ids = tokenizer.encode(question, text)\n        start_scores, end_scores = model(torch.tensor([input_ids]))\n\n        all_tokens = tokenizer.convert_ids_to_tokens(input_ids)\n        answer = ' '.join(all_tokens[torch.argmax(start_scores) : torch.argmax(end_scores)+1])\n\n        \"\"\"\n\n        outputs = self.roberta(\n            input_ids,\n            attention_mask=attention_mask,\n            token_type_ids=token_type_ids,\n            position_ids=position_ids,\n            head_mask=head_mask,\n            inputs_embeds=inputs_embeds,\n        )\n\n        sequence_output = outputs[0]\n\n        logits = self.qa_outputs(sequence_output)\n        start_logits, end_logits = logits.split(1, dim=-1)\n        start_logits = start_logits.squeeze(-1)\n        end_logits = end_logits.squeeze(-1)\n\n        outputs = (start_logits, end_logits,) + outputs[2:]\n        if start_positions is not None and end_positions is not None:\n            # If we are on multi-GPU, split add a dimension\n            if len(start_positions.size()) > 1:\n                start_positions = start_positions.squeeze(-1)\n            if len(end_positions.size()) > 1:\n                end_positions = end_positions.squeeze(-1)\n            # sometimes the start/end positions are outside our model inputs, we ignore these terms\n            ignored_index = start_logits.size(1)\n            start_positions.clamp_(0, ignored_index)\n            end_positions.clamp_(0, ignored_index)\n\n            loss_fct = CrossEntropyLoss(ignore_index=ignored_index)\n            start_loss = loss_fct(start_logits, start_positions)\n            end_loss = loss_fct(end_logits, end_positions)\n            total_loss = (start_loss + end_loss) / 2\n            outputs = (total_loss,) + outputs\n\n        return outputs  # (loss), start_logits, end_logits, (hidden_states), (attentions)\n"
  },
  {
    "path": "code/bert-base-count3/pretrain/transformers1/modeling_t5.py",
    "content": "# coding=utf-8\n# Copyright 2018 Mesh TensorFlow authors, T5 Authors and HuggingFace Inc. team.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n\"\"\" PyTorch T5 model. \"\"\"\n\n\nimport copy\nimport logging\nimport math\nimport os\n\nimport torch\nimport torch.nn.functional as F\nfrom torch import nn\nfrom torch.nn import CrossEntropyLoss\n\nfrom .configuration_t5 import T5Config\nfrom .file_utils import DUMMY_INPUTS, DUMMY_MASK, add_start_docstrings, add_start_docstrings_to_callable\nfrom .modeling_utils import PreTrainedModel, prune_linear_layer\n\n\nlogger = logging.getLogger(__name__)\n\n####################################################\n# This dict contrains shortcut names and associated url\n# for the pretrained weights provided with the models\n####################################################\nT5_PRETRAINED_MODEL_ARCHIVE_LIST = [\n    \"t5-small\",\n    \"t5-base\",\n    \"t5-large\",\n    \"t5-3b\",\n    \"t5-11b\",\n    # See all T5 models at https://huggingface.co/models?filter=t5\n]\n\n\n####################################################\n# This is a conversion method from TF 1.0 to PyTorch\n# More details: https://medium.com/huggingface/from-tensorflow-to-pytorch-265f40ef2a28\n####################################################\ndef load_tf_weights_in_t5(model, config, tf_checkpoint_path):\n    \"\"\" Load tf checkpoints in a pytorch model.\n    \"\"\"\n    try:\n        import re\n        import numpy as np\n        import tensorflow as tf\n    except ImportError:\n        logger.error(\n            \"Loading a TensorFlow model in PyTorch, requires TensorFlow to be installed. Please see \"\n            \"https://www.tensorflow.org/install/ for installation instructions.\"\n        )\n        raise\n    tf_path = os.path.abspath(tf_checkpoint_path)\n    logger.info(\"Converting TensorFlow checkpoint from {}\".format(tf_path))\n    # Load weights from TF model\n    init_vars = tf.train.list_variables(tf_path)\n    names = []\n    tf_weights = {}\n    for name, shape in init_vars:\n        logger.info(\"Loading TF weight {} with shape {}\".format(name, shape))\n        array = tf.train.load_variable(tf_path, name)\n        names.append(name)\n        tf_weights[name] = array\n\n    for txt_name in names:\n        name = txt_name.split(\"/\")\n        # adam_v and adam_m are variables used in AdamWeightDecayOptimizer to calculated m and v\n        # which are not required for using pretrained model\n        if any(\n            n in [\"adam_v\", \"adam_m\", \"AdamWeightDecayOptimizer\", \"AdamWeightDecayOptimizer_1\", \"global_step\"]\n            for n in name\n        ):\n            logger.info(\"Skipping {}\".format(\"/\".join(name)))\n            tf_weights.pop(txt_name, None)\n            continue\n        if \"_slot_\" in name[-1]:\n            logger.info(\"Skipping {}\".format(\"/\".join(name)))\n            tf_weights.pop(txt_name, None)\n            continue\n        pointer = model\n        array = tf_weights[txt_name]\n        for m_name in name:\n            if re.fullmatch(r\"[A-Za-z]+_\\d+\", m_name):\n                scope_names = re.split(r\"_(\\d+)\", m_name)\n            else:\n                scope_names = [m_name]\n            if scope_names[0] in [\"kernel\", \"scale\", \"embedding\"]:\n                pointer = getattr(pointer, \"weight\")\n            # elif scope_names[0] == 'scale':\n            #     pointer = getattr(pointer, 'weight')\n            # elif scope_names[0] == 'output_bias' or scope_names[0] == 'beta':\n            #     pointer = getattr(pointer, 'bias')\n            # elif scope_names[0] == 'squad':\n            #     pointer = getattr(pointer, 'classifier')\n            else:\n                try:\n                    pointer = getattr(pointer, scope_names[0])\n                except AttributeError:\n                    logger.info(\"Skipping {}\".format(\"/\".join(name)))\n                    continue\n            if len(scope_names) >= 2:\n                num = int(scope_names[1])\n                pointer = pointer[num]\n        if scope_names[0] not in [\"kernel\", \"scale\", \"embedding\"]:\n            pointer = getattr(pointer, \"weight\")\n        if scope_names[0] != \"embedding\":\n            logger.info(\"Transposing numpy weight of shape {} for {}\".format(array.shape, name))\n            array = np.transpose(array)\n        try:\n            assert pointer.shape == array.shape\n        except AssertionError as e:\n            e.args += (pointer.shape, array.shape)\n            raise\n        logger.info(\"Initialize PyTorch weight {}\".format(name))\n        pointer.data = torch.from_numpy(array.astype(np.float32))\n        tf_weights.pop(txt_name, None)\n\n    logger.info(\"Weights not copied to PyTorch model: {}\".format(\", \".join(tf_weights.keys())))\n    # logger.info(\"Weights not copied to PyTorch model: {}\".format(', '.join(tf_weights.keys())))\n    return model\n\n\n####################################################\n# PyTorch Models are constructed by sub-classing\n# - torch.nn.Module for the layers and\n# - PreTrainedModel for the models (it-self a sub-class of torch.nn.Module)\n####################################################\n\n\nclass T5LayerNorm(nn.Module):\n    def __init__(self, hidden_size, eps=1e-6):\n        \"\"\" Construct a layernorm module in the T5 style\n            No bias and no substraction of mean.\n        \"\"\"\n        super().__init__()\n        self.weight = nn.Parameter(torch.ones(hidden_size))\n        self.variance_epsilon = eps\n\n    def forward(self, x):\n        # layer norm should always be calculated in float32\n        variance = x.to(torch.float32).pow(2).mean(-1, keepdim=True)\n        x = x / torch.sqrt(variance + self.variance_epsilon)\n\n        if self.weight.dtype == torch.float16:\n            x = x.to(torch.float16)\n        return self.weight * x\n\n\nclass T5DenseReluDense(nn.Module):\n    def __init__(self, config):\n        super().__init__()\n        self.wi = nn.Linear(config.d_model, config.d_ff, bias=False)\n        self.wo = nn.Linear(config.d_ff, config.d_model, bias=False)\n        self.dropout = nn.Dropout(config.dropout_rate)\n\n    def forward(self, hidden_states):\n        h = self.wi(hidden_states)\n        h = F.relu(h)\n        h = self.dropout(h)\n        h = self.wo(h)\n        return h\n\n\nclass T5LayerFF(nn.Module):\n    def __init__(self, config):\n        super().__init__()\n        self.DenseReluDense = T5DenseReluDense(config)\n        self.layer_norm = T5LayerNorm(config.d_model, eps=config.layer_norm_epsilon)\n        self.dropout = nn.Dropout(config.dropout_rate)\n\n    def forward(self, hidden_states):\n        norm_x = self.layer_norm(hidden_states)\n        y = self.DenseReluDense(norm_x)\n        layer_output = hidden_states + self.dropout(y)\n        return layer_output\n\n\nclass T5Attention(nn.Module):\n    def __init__(self, config: T5Config, has_relative_attention_bias=False):\n        super().__init__()\n        self.is_decoder = config.is_decoder\n        self.has_relative_attention_bias = has_relative_attention_bias\n\n        self.output_attentions = config.output_attentions\n        self.relative_attention_num_buckets = config.relative_attention_num_buckets\n        self.d_model = config.d_model\n        self.d_kv = config.d_kv\n        self.n_heads = config.num_heads\n        self.dropout = config.dropout_rate\n        self.inner_dim = self.n_heads * self.d_kv\n\n        # Mesh TensorFlow initialization to avoid scaling before softmax\n        self.q = nn.Linear(self.d_model, self.inner_dim, bias=False)\n        self.k = nn.Linear(self.d_model, self.inner_dim, bias=False)\n        self.v = nn.Linear(self.d_model, self.inner_dim, bias=False)\n        self.o = nn.Linear(self.inner_dim, self.d_model, bias=False)\n\n        if self.has_relative_attention_bias:\n            self.relative_attention_bias = nn.Embedding(self.relative_attention_num_buckets, self.n_heads)\n        self.pruned_heads = set()\n\n    def prune_heads(self, heads):\n        if len(heads) == 0:\n            return\n        mask = torch.ones(self.n_heads, self.d_kv)\n        heads = set(heads) - self.pruned_heads\n        for head in heads:\n            head -= sum(1 if h < head else 0 for h in self.pruned_heads)\n            mask[head] = 0\n        mask = mask.view(-1).contiguous().eq(1)\n        index = torch.arange(len(mask))[mask].long()\n        # Prune linear layers\n        self.q = prune_linear_layer(self.q, index)\n        self.k = prune_linear_layer(self.k, index)\n        self.v = prune_linear_layer(self.v, index)\n        self.o = prune_linear_layer(self.o, index, dim=1)\n        # Update hyper params\n        self.n_heads = self.n_heads - len(heads)\n        self.inner_dim = self.d_kv * self.n_heads\n        self.pruned_heads = self.pruned_heads.union(heads)\n\n    @staticmethod\n    def _relative_position_bucket(relative_position, bidirectional=True, num_buckets=32, max_distance=128):\n        \"\"\"\n        Adapted from Mesh Tensorflow:\n        https://github.com/tensorflow/mesh/blob/0cb87fe07da627bf0b7e60475d59f95ed6b5be3d/mesh_tensorflow/transformer/transformer_layers.py#L593\n\n        Translate relative position to a bucket number for relative attention.\n        The relative position is defined as memory_position - query_position, i.e.\n        the distance in tokens from the attending position to the attended-to\n        position.  If bidirectional=False, then positive relative positions are\n        invalid.\n        We use smaller buckets for small absolute relative_position and larger buckets\n        for larger absolute relative_positions.  All relative positions >=max_distance\n        map to the same bucket.  All relative positions <=-max_distance map to the\n        same bucket.  This should allow for more graceful generalization to longer\n        sequences than the model has been trained on.\n        Args:\n            relative_position: an int32 Tensor\n            bidirectional: a boolean - whether the attention is bidirectional\n            num_buckets: an integer\n            max_distance: an integer\n        Returns:\n            a Tensor with the same shape as relative_position, containing int32\n            values in the range [0, num_buckets)\n        \"\"\"\n        ret = 0\n        n = -relative_position\n        if bidirectional:\n            num_buckets //= 2\n            ret += (n < 0).to(torch.long) * num_buckets  # mtf.to_int32(mtf.less(n, 0)) * num_buckets\n            n = torch.abs(n)\n        else:\n            n = torch.max(n, torch.zeros_like(n))\n        # now n is in the range [0, inf)\n\n        # half of the buckets are for exact increments in positions\n        max_exact = num_buckets // 2\n        is_small = n < max_exact\n\n        # The other half of the buckets are for logarithmically bigger bins in positions up to max_distance\n        val_if_large = max_exact + (\n            torch.log(n.float() / max_exact) / math.log(max_distance / max_exact) * (num_buckets - max_exact)\n        ).to(torch.long)\n        val_if_large = torch.min(val_if_large, torch.full_like(val_if_large, num_buckets - 1))\n\n        ret += torch.where(is_small, n, val_if_large)\n        return ret\n\n    def compute_bias(self, qlen, klen):\n        \"\"\" Compute binned relative position bias \"\"\"\n        context_position = torch.arange(qlen, dtype=torch.long)[:, None]\n        memory_position = torch.arange(klen, dtype=torch.long)[None, :]\n        relative_position = memory_position - context_position  # shape (qlen, klen)\n        rp_bucket = self._relative_position_bucket(\n            relative_position,  # shape (qlen, klen)\n            bidirectional=not self.is_decoder,\n            num_buckets=self.relative_attention_num_buckets,\n        )\n        rp_bucket = rp_bucket.to(self.relative_attention_bias.weight.device)\n        values = self.relative_attention_bias(rp_bucket)  # shape (qlen, klen, num_heads)\n        values = values.permute([2, 0, 1]).unsqueeze(0)  # shape (1, num_heads, qlen, klen)\n        return values\n\n    def forward(\n        self,\n        input,\n        mask=None,\n        kv=None,\n        position_bias=None,\n        past_key_value_state=None,\n        head_mask=None,\n        query_length=None,\n        use_cache=False,\n    ):\n        \"\"\"\n        Self-attention (if kv is None) or attention over source sentence (provided by kv).\n        \"\"\"\n        # Input is (bs, qlen, dim)\n        # Mask is (bs, klen) (non-causal) or (bs, klen, klen)\n        # past_key_value_state[0] is (bs, n_heads, q_len - 1, dim_per_head)\n        bs, qlen, dim = input.size()\n\n        if past_key_value_state is not None:\n            assert self.is_decoder is True, \"Encoder cannot cache past key value states\"\n            assert (\n                len(past_key_value_state) == 2\n            ), \"past_key_value_state should have 2 past states: keys and values. Got {} past states\".format(\n                len(past_key_value_state)\n            )\n            real_qlen = qlen + past_key_value_state[0].shape[2] if query_length is None else query_length\n        else:\n            real_qlen = qlen\n\n        if kv is None:\n            klen = real_qlen\n        else:\n            klen = kv.size(1)\n\n        def shape(x):\n            \"\"\"  projection \"\"\"\n            return x.view(bs, -1, self.n_heads, self.d_kv).transpose(1, 2)\n\n        def unshape(x):\n            \"\"\"  compute context \"\"\"\n            return x.transpose(1, 2).contiguous().view(bs, -1, self.inner_dim)\n\n        q = shape(self.q(input))  # (bs, n_heads, qlen, dim_per_head)\n\n        if kv is None:\n            k = shape(self.k(input))  # (bs, n_heads, qlen, dim_per_head)\n            v = shape(self.v(input))  # (bs, n_heads, qlen, dim_per_head)\n        elif past_key_value_state is None:\n            k = v = kv\n            k = shape(self.k(k))  # (bs, n_heads, qlen, dim_per_head)\n            v = shape(self.v(v))  # (bs, n_heads, qlen, dim_per_head)\n\n        if past_key_value_state is not None:\n            if kv is None:\n                k_, v_ = past_key_value_state\n                k = torch.cat([k_, k], dim=2)  # (bs, n_heads, klen, dim_per_head)\n                v = torch.cat([v_, v], dim=2)  # (bs, n_heads, klen, dim_per_head)\n            else:\n                k, v = past_key_value_state\n\n        if self.is_decoder and use_cache is True:\n            present_key_value_state = ((k, v),)\n        else:\n            present_key_value_state = (None,)\n\n        scores = torch.einsum(\"bnqd,bnkd->bnqk\", q, k)  # (bs, n_heads, qlen, klen)\n\n        if position_bias is None:\n            if not self.has_relative_attention_bias:\n                raise ValueError(\"No position_bias provided and no weights to compute position_bias\")\n            position_bias = self.compute_bias(real_qlen, klen)\n\n            # if key and values are already calculated\n            # we want only the last query position bias\n            if past_key_value_state is not None:\n                position_bias = position_bias[:, :, -1:, :]\n\n            if mask is not None:\n                position_bias = position_bias + mask  # (bs, n_heads, qlen, klen)\n\n        scores += position_bias\n        weights = F.softmax(scores.float(), dim=-1).type_as(scores)  # (bs, n_heads, qlen, klen)\n        weights = F.dropout(weights, p=self.dropout, training=self.training)  # (bs, n_heads, qlen, klen)\n\n        # Mask heads if we want to\n        if head_mask is not None:\n            weights = weights * head_mask\n\n        context = torch.matmul(weights, v)  # (bs, n_heads, qlen, dim_per_head)\n        context = unshape(context)  # (bs, qlen, dim)\n\n        context = self.o(context)\n\n        outputs = (context,) + present_key_value_state\n\n        if self.output_attentions:\n            outputs = outputs + (weights,)\n        if self.has_relative_attention_bias:\n            outputs = outputs + (position_bias,)\n        return outputs\n\n\nclass T5LayerSelfAttention(nn.Module):\n    def __init__(self, config, has_relative_attention_bias=False):\n        super().__init__()\n        self.SelfAttention = T5Attention(config, has_relative_attention_bias=has_relative_attention_bias)\n        self.layer_norm = T5LayerNorm(config.d_model, eps=config.layer_norm_epsilon)\n        self.dropout = nn.Dropout(config.dropout_rate)\n\n    def forward(\n        self,\n        hidden_states,\n        attention_mask=None,\n        position_bias=None,\n        head_mask=None,\n        past_key_value_state=None,\n        use_cache=False,\n    ):\n        norm_x = self.layer_norm(hidden_states)\n        attention_output = self.SelfAttention(\n            norm_x,\n            mask=attention_mask,\n            position_bias=position_bias,\n            head_mask=head_mask,\n            past_key_value_state=past_key_value_state,\n            use_cache=use_cache,\n        )\n        y = attention_output[0]\n        layer_output = hidden_states + self.dropout(y)\n        outputs = (layer_output,) + attention_output[1:]  # add attentions if we output them\n        return outputs\n\n\nclass T5LayerCrossAttention(nn.Module):\n    def __init__(self, config, has_relative_attention_bias=False):\n        super().__init__()\n        self.EncDecAttention = T5Attention(config, has_relative_attention_bias=has_relative_attention_bias)\n        self.layer_norm = T5LayerNorm(config.d_model, eps=config.layer_norm_epsilon)\n        self.dropout = nn.Dropout(config.dropout_rate)\n\n    def forward(\n        self,\n        hidden_states,\n        kv,\n        attention_mask=None,\n        position_bias=None,\n        head_mask=None,\n        past_key_value_state=None,\n        use_cache=False,\n        query_length=None,\n    ):\n        norm_x = self.layer_norm(hidden_states)\n        attention_output = self.EncDecAttention(\n            norm_x,\n            mask=attention_mask,\n            kv=kv,\n            position_bias=position_bias,\n            head_mask=head_mask,\n            past_key_value_state=past_key_value_state,\n            use_cache=use_cache,\n            query_length=query_length,\n        )\n        y = attention_output[0]\n        layer_output = hidden_states + self.dropout(y)\n        outputs = (layer_output,) + attention_output[1:]  # add attentions if we output them\n        return outputs\n\n\nclass T5Block(nn.Module):\n    def __init__(self, config, has_relative_attention_bias=False):\n        super().__init__()\n        self.is_decoder = config.is_decoder\n        self.layer = nn.ModuleList()\n        self.layer.append(T5LayerSelfAttention(config, has_relative_attention_bias=has_relative_attention_bias))\n        if self.is_decoder:\n            self.layer.append(T5LayerCrossAttention(config, has_relative_attention_bias=has_relative_attention_bias))\n\n        self.layer.append(T5LayerFF(config))\n\n    def forward(\n        self,\n        hidden_states,\n        attention_mask=None,\n        position_bias=None,\n        encoder_hidden_states=None,\n        encoder_attention_mask=None,\n        encoder_decoder_position_bias=None,\n        head_mask=None,\n        past_key_value_state=None,\n        use_cache=False,\n    ):\n\n        if past_key_value_state is not None:\n            assert self.is_decoder, \"Only decoder can use `past_key_value_states`\"\n            expected_num_past_key_value_states = 2 if encoder_hidden_states is None else 4\n\n            error_message = \"There should be {} past states. 2 (past / key) for self attention.{} Got {} past key / value states\".format(\n                expected_num_past_key_value_states,\n                \"2 (past / key) for cross attention\" if expected_num_past_key_value_states == 4 else \"\",\n                len(past_key_value_state),\n            )\n            assert len(past_key_value_state) == expected_num_past_key_value_states, error_message\n\n            self_attn_past_key_value_state = past_key_value_state[:2]\n            cross_attn_past_key_value_state = past_key_value_state[2:]\n        else:\n            self_attn_past_key_value_state, cross_attn_past_key_value_state = None, None\n\n        self_attention_outputs = self.layer[0](\n            hidden_states,\n            attention_mask=attention_mask,\n            position_bias=position_bias,\n            head_mask=head_mask,\n            past_key_value_state=self_attn_past_key_value_state,\n            use_cache=use_cache,\n        )\n        hidden_states, present_key_value_state = self_attention_outputs[:2]\n        attention_outputs = self_attention_outputs[2:]  # Keep self-attention outputs and relative position weights\n\n        if self.is_decoder and encoder_hidden_states is not None:\n            # the actual query length is unknown for cross attention\n            # if using past key value states. Need to inject it here\n            if present_key_value_state is not None:\n                query_length = present_key_value_state[0].shape[2]\n            else:\n                query_length = None\n\n            cross_attention_outputs = self.layer[1](\n                hidden_states,\n                kv=encoder_hidden_states,\n                attention_mask=encoder_attention_mask,\n                position_bias=encoder_decoder_position_bias,\n                head_mask=head_mask,\n                past_key_value_state=cross_attn_past_key_value_state,\n                query_length=query_length,\n                use_cache=use_cache,\n            )\n            hidden_states = cross_attention_outputs[0]\n            # Combine self attn and cross attn key value states\n            if present_key_value_state is not None:\n                present_key_value_state = present_key_value_state + cross_attention_outputs[1]\n\n            # Keep cross-attention outputs and relative position weights\n            attention_outputs = attention_outputs + cross_attention_outputs[2:]\n\n        # Apply Feed Forward layer\n        hidden_states = self.layer[-1](hidden_states)\n        outputs = (hidden_states,)\n\n        # Add attentions if we output them\n        outputs = outputs + (present_key_value_state,) + attention_outputs\n        return outputs  # hidden-states, present_key_value_states, (self-attention weights), (self-attention position bias), (cross-attention weights), (cross-attention position bias)\n\n\nclass T5PreTrainedModel(PreTrainedModel):\n    \"\"\" An abstract class to handle weights initialization and\n        a simple interface for downloading and loading pretrained models.\n    \"\"\"\n\n    config_class = T5Config\n    load_tf_weights = load_tf_weights_in_t5\n    base_model_prefix = \"transformer\"\n\n    @property\n    def dummy_inputs(self):\n        input_ids = torch.tensor(DUMMY_INPUTS)\n        input_mask = torch.tensor(DUMMY_MASK)\n        dummy_inputs = {\n            \"decoder_input_ids\": input_ids,\n            \"input_ids\": input_ids,\n            \"decoder_attention_mask\": input_mask,\n        }\n        return dummy_inputs\n\n    def _init_weights(self, module):\n        \"\"\" Initialize the weights \"\"\"\n        factor = self.config.initializer_factor  # Used for testing weights initialization\n        if isinstance(module, T5LayerNorm):\n            module.weight.data.fill_(factor * 1.0)\n        elif isinstance(module, (T5Model, T5ForConditionalGeneration)):\n            # Mesh TensorFlow embeddings initialization\n            # See https://github.com/tensorflow/mesh/blob/fa19d69eafc9a482aff0b59ddd96b025c0cb207d/mesh_tensorflow/layers.py#L1624\n            module.shared.weight.data.normal_(mean=0.0, std=factor * 1.0)\n        elif isinstance(module, T5DenseReluDense):\n            # Mesh TensorFlow FF initialization\n            # See https://github.com/tensorflow/mesh/blob/master/mesh_tensorflow/transformer/transformer_layers.py#L56\n            # and https://github.com/tensorflow/mesh/blob/fa19d69eafc9a482aff0b59ddd96b025c0cb207d/mesh_tensorflow/layers.py#L89\n            module.wi.weight.data.normal_(mean=0.0, std=factor * ((self.config.d_model) ** -0.5))\n            if hasattr(module.wi, \"bias\") and module.wi.bias is not None:\n                module.wi.bias.data.zero_()\n            module.wo.weight.data.normal_(mean=0.0, std=factor * ((self.config.d_ff) ** -0.5))\n            if hasattr(module.wo, \"bias\") and module.wo.bias is not None:\n                module.wo.bias.data.zero_()\n        elif isinstance(module, T5Attention):\n            # Mesh TensorFlow attention initialization to avoid scaling before softmax\n            # See https://github.com/tensorflow/mesh/blob/fa19d69eafc9a482aff0b59ddd96b025c0cb207d/mesh_tensorflow/transformer/attention.py#L136\n            d_model = self.config.d_model\n            d_kv = self.config.d_kv\n            n_heads = self.config.num_heads\n            module.q.weight.data.normal_(mean=0.0, std=factor * ((d_model * d_kv) ** -0.5))\n            module.k.weight.data.normal_(mean=0.0, std=factor * (d_model ** -0.5))\n            module.v.weight.data.normal_(mean=0.0, std=factor * (d_model ** -0.5))\n            module.o.weight.data.normal_(mean=0.0, std=factor * ((n_heads * d_kv) ** -0.5))\n            if module.has_relative_attention_bias:\n                module.relative_attention_bias.weight.data.normal_(mean=0.0, std=factor * ((d_model) ** -0.5))\n\n    def _shift_right(self, input_ids):\n        decoder_start_token_id = self.config.decoder_start_token_id\n        pad_token_id = self.config.pad_token_id\n\n        assert (\n            decoder_start_token_id is not None\n        ), \"self.model.config.decoder_start_token_id has to be defined. In T5 it is usually set to the pad_token_id. See T5 docs for more information\"\n\n        # shift inputs to the right\n        shifted_input_ids = input_ids.new_zeros(input_ids.shape)\n        shifted_input_ids[..., 1:] = input_ids[..., :-1].clone()\n        shifted_input_ids[..., 0] = decoder_start_token_id\n\n        assert pad_token_id is not None, \"self.model.config.pad_token_id has to be defined.\"\n        # replace possible -100 values in lm_labels by `pad_token_id`\n        shifted_input_ids.masked_fill_(shifted_input_ids == -100, pad_token_id)\n\n        assert torch.all(shifted_input_ids >= 0).item(), \"Verify that `lm_labels` has only positive values and -100\"\n\n        return shifted_input_ids\n\n\nclass T5Stack(T5PreTrainedModel):\n    def __init__(self, config, embed_tokens=None):\n        super().__init__(config)\n        self.output_attentions = config.output_attentions\n        self.output_hidden_states = config.output_hidden_states\n\n        self.embed_tokens = embed_tokens\n        self.is_decoder = config.is_decoder\n\n        self.block = nn.ModuleList(\n            [T5Block(config, has_relative_attention_bias=bool(i == 0)) for i in range(config.num_layers)]\n        )\n        self.final_layer_norm = T5LayerNorm(config.d_model, eps=config.layer_norm_epsilon)\n        self.dropout = nn.Dropout(config.dropout_rate)\n\n        self.init_weights()\n\n    def get_input_embeddings(self):\n        return self.embed_tokens\n\n    def get_output_embeddings(self):\n        return self.embed_tokens\n\n    def set_input_embeddings(self, new_embeddings):\n        self.embed_tokens = new_embeddings\n\n    def forward(\n        self,\n        input_ids=None,\n        attention_mask=None,\n        encoder_hidden_states=None,\n        encoder_attention_mask=None,\n        inputs_embeds=None,\n        head_mask=None,\n        past_key_value_states=None,\n        use_cache=False,\n    ):\n\n        if input_ids is not None and inputs_embeds is not None:\n            raise ValueError(\"You cannot specify both input_ids and inputs_embeds at the same time\")\n        elif input_ids is not None:\n            input_shape = input_ids.size()\n            input_ids = input_ids.view(-1, input_shape[-1])\n        elif inputs_embeds is not None:\n            input_shape = inputs_embeds.size()[:-1]\n        else:\n            if self.is_decoder:\n                raise ValueError(\"You have to specify either decoder_input_ids or decoder_inputs_embeds\")\n            else:\n                raise ValueError(\"You have to specify either input_ids or inputs_embeds\")\n\n        if inputs_embeds is None:\n            assert self.embed_tokens is not None, \"You have to intialize the model with valid token embeddings\"\n            inputs_embeds = self.embed_tokens(input_ids)\n\n        batch_size, seq_length = input_shape\n\n        if past_key_value_states is not None:\n            assert seq_length == 1, \"Input shape is {}, but should be {} when using past_key_value_sates\".format(\n                input_shape, (batch_size, 1)\n            )\n            # required mask seq length can be calculated via length of past\n            # key value states and seq_length = 1 for the last token\n            mask_seq_length = past_key_value_states[0][0].shape[2] + seq_length\n        else:\n            mask_seq_length = seq_length\n\n        if attention_mask is None:\n            attention_mask = torch.ones(batch_size, mask_seq_length).to(inputs_embeds.device)\n        if self.is_decoder and encoder_attention_mask is None and encoder_hidden_states is not None:\n            encoder_seq_length = encoder_hidden_states.shape[1]\n            encoder_attention_mask = torch.ones(\n                batch_size, encoder_seq_length, device=inputs_embeds.device, dtype=torch.long\n            )\n\n        # initialize past_key_value_states with `None` if past does not exist\n        if past_key_value_states is None:\n            past_key_value_states = [None] * len(self.block)\n\n        # ourselves in which case we just need to make it broadcastable to all heads.\n        extended_attention_mask = self.get_extended_attention_mask(attention_mask, input_shape, inputs_embeds.device)\n\n        if self.is_decoder and encoder_attention_mask is not None:\n            encoder_extended_attention_mask = self.invert_attention_mask(encoder_attention_mask)\n        else:\n            encoder_extended_attention_mask = None\n\n        # Prepare head mask if needed\n        head_mask = self.get_head_mask(head_mask, self.config.num_layers)\n        present_key_value_states = ()\n        all_hidden_states = ()\n        all_attentions = ()\n        position_bias = None\n        encoder_decoder_position_bias = None\n\n        hidden_states = self.dropout(inputs_embeds)\n\n        for i, (layer_module, past_key_value_state) in enumerate(zip(self.block, past_key_value_states)):\n            if self.output_hidden_states:\n                all_hidden_states = all_hidden_states + (hidden_states,)\n\n            layer_outputs = layer_module(\n                hidden_states,\n                attention_mask=extended_attention_mask,\n                position_bias=position_bias,\n                encoder_hidden_states=encoder_hidden_states,\n                encoder_attention_mask=encoder_extended_attention_mask,\n                encoder_decoder_position_bias=encoder_decoder_position_bias,\n                head_mask=head_mask[i],\n                past_key_value_state=past_key_value_state,\n                use_cache=use_cache,\n            )\n            # layer_outputs is a tuple with:\n            # hidden-states, key-value-states, (self-attention weights), (self-attention position bias), (cross-attention weights), (cross-attention position bias)\n            hidden_states, present_key_value_state = layer_outputs[:2]\n\n            if i == 0:\n                # We share the position biases between the layers - the first layer store them\n                # layer_outputs = hidden-states, key-value-states (self-attention weights), (self-attention position bias), (cross-attention weights), (cross-attention position bias)\n                position_bias = layer_outputs[3 if self.output_attentions else 2]\n                if self.is_decoder and encoder_hidden_states is not None:\n                    encoder_decoder_position_bias = layer_outputs[5 if self.output_attentions else 3]\n            # append next layer key value states\n            present_key_value_states = present_key_value_states + (present_key_value_state,)\n\n            if self.output_attentions:\n                all_attentions = all_attentions + (layer_outputs[2],)  # We keep only self-attention weights for now\n\n        hidden_states = self.final_layer_norm(hidden_states)\n        hidden_states = self.dropout(hidden_states)\n\n        # Add last layer\n        if self.output_hidden_states:\n            all_hidden_states = all_hidden_states + (hidden_states,)\n\n        outputs = (hidden_states,)\n        if use_cache is True:\n            assert self.is_decoder, \"`use_cache` can only be set to `True` if {} is used as a decoder\".format(self)\n            outputs = outputs + (present_key_value_states,)\n        if self.output_hidden_states:\n            outputs = outputs + (all_hidden_states,)\n        if self.output_attentions:\n            outputs = outputs + (all_attentions,)\n        return outputs  # last-layer hidden state, (presents,) (all hidden states), (all attentions)\n\n\nT5_START_DOCSTRING = r\"\"\"    The T5 model was proposed in\n    `Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer`_\n    by Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, Peter J. Liu.\n    It's an encoder decoder transformer pre-trained in a text-to-text denoising generative setting.\n\n    This model is a PyTorch `torch.nn.Module`_ sub-class. Use it as a regular PyTorch Module and\n    refer to the PyTorch documentation for all matter related to general usage and behavior.\n\n    .. _`Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer`:\n        https://arxiv.org/abs/1910.10683\n\n    .. _`torch.nn.Module`:\n        https://pytorch.org/docs/stable/nn.html#module\n\n    Parameters:\n        config (:class:`~transformers1.T5Config`): Model configuration class with all the parameters of the model.\n            Initializing with a config file does not load the weights associated with the model, only the configuration.\n            Check out the :meth:`~transformers1.PreTrainedModel.from_pretrained` method to load the model weights.\n\"\"\"\n\nT5_INPUTS_DOCSTRING = r\"\"\"\n    Args:\n        input_ids (:obj:`torch.LongTensor` of shape :obj:`(batch_size, sequence_length)`):\n            Indices of input sequence tokens in the vocabulary.\n            T5 is a model with relative position embeddings so you should be able to pad the inputs on both the right and the left.\n            Indices can be obtained using :class:`transformers1.T5Tokenizer`.\n            See :func:`transformers1.PreTrainedTokenizer.encode` and\n            :func:`transformers1.PreTrainedTokenizer.convert_tokens_to_ids` for details.\n            To know more on how to prepare :obj:`input_ids` for pre-training take a look at\n            `T5 Training <./t5.html#training>`_ .\n        attention_mask (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length)`, `optional`, defaults to :obj:`None`):\n            Mask to avoid performing attention on padding token indices.\n            Mask values selected in ``[0, 1]``:\n            ``1`` for tokens that are NOT MASKED, ``0`` for MASKED tokens.\n        encoder_outputs (:obj:`tuple(tuple(torch.FloatTensor)`, `optional`, defaults to :obj:`None`):\n            Tuple consists of (`last_hidden_state`, `optional`: `hidden_states`, `optional`: `attentions`)\n            `last_hidden_state` of shape :obj:`(batch_size, sequence_length, hidden_size)`, `optional`, defaults to :obj:`None`) is a sequence of hidden-states at the output of the last layer of the encoder.\n            Used in the cross-attention of the decoder.\n        decoder_input_ids (:obj:`torch.LongTensor` of shape :obj:`(batch_size, target_sequence_length)`, `optional`, defaults to :obj:`None`):\n            Provide for sequence to sequence training. T5 uses the pad_token_id as the starting token for decoder_input_ids generation.\n            If `decoder_past_key_value_states` is used, optionally only the last `decoder_input_ids` have to be input (see `decoder_past_key_value_states`).\n            To know more on how to prepare :obj:`decoder_input_ids` for pre-training take a look at\n            `T5 Training <./t5.html#training>`_ .\n        decoder_attention_mask (:obj:`torch.BoolTensor` of shape :obj:`(batch_size, tgt_seq_len)`, `optional`, defaults to :obj:`None`):\n            Default behavior: generate a tensor that ignores pad tokens in decoder_input_ids. Causal mask will also be used by default.\n        decoder_past_key_value_states (:obj:`tuple(tuple(torch.FloatTensor))` of length :obj:`config.n_layers` with each tuple having 4 tensors of shape :obj:`(batch_size, num_heads, sequence_length - 1, embed_size_per_head)`):\n            Contains pre-computed key and value hidden-states of the attention blocks.\n            Can be used to speed up decoding.\n            If `decoder_past_key_value_states` are used, the user can optionally input only the last `decoder_input_ids`\n            (those that don't have their past key value states given to this model) of shape :obj:`(batch_size, 1)`\n            instead of all `decoder_input_ids` of shape :obj:`(batch_size, sequence_length)`.\n        use_cache (:obj:`bool`, `optional`, defaults to :obj:`True`):\n            If `use_cache` is True, `decoder_past_key_value_states` are returned and can be used to speed up decoding (see `decoder_past_key_value_states`).\n        inputs_embeds (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length, hidden_size)`, `optional`, defaults to :obj:`None`):\n            Optionally, instead of passing :obj:`input_ids` you can choose to directly pass an embedded representation.\n            This is useful if you want more control over how to convert `input_ids` indices into associated vectors\n            than the model's internal embedding lookup matrix.\n        decoder_inputs_embeds (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, target_sequence_length, hidden_size)`, `optional`, defaults to :obj:`None`):\n            Optionally, instead of passing :obj:`decoder_input_ids` you can choose to directly pass an embedded representation.\n            If `decoder_past_key_value_states` is used, optionally only the last `decoder_inputs_embeds` have to be input (see `decoder_past_key_value_states`).\n            This is useful if you want more control over how to convert `decoder_input_ids` indices into associated vectors\n            than the model's internal embedding lookup matrix.\n        head_mask: (:obj:`torch.FloatTensor` of shape :obj:`(num_heads,)` or :obj:`(num_layers, num_heads)`, `optional`, defaults to :obj:`None`):\n            Mask to nullify selected heads of the self-attention modules.\n            Mask values selected in ``[0, 1]``:\n            ``1`` indicates the head is **not masked**, ``0`` indicates the head is **masked**.\n\"\"\"\n\n\n@add_start_docstrings(\n    \"The bare T5 Model transformer outputting raw hidden-states\" \"without any specific head on top.\",\n    T5_START_DOCSTRING,\n)\nclass T5Model(T5PreTrainedModel):\n    def __init__(self, config):\n        super().__init__(config)\n        self.shared = nn.Embedding(config.vocab_size, config.d_model)\n\n        encoder_config = copy.deepcopy(config)\n        self.encoder = T5Stack(encoder_config, self.shared)\n\n        decoder_config = copy.deepcopy(config)\n        decoder_config.is_decoder = True\n        self.decoder = T5Stack(decoder_config, self.shared)\n\n        self.init_weights()\n\n    def get_input_embeddings(self):\n        return self.shared\n\n    def set_input_embeddings(self, new_embeddings):\n        self.shared = new_embeddings\n        self.encoder.set_input_embeddings(new_embeddings)\n        self.decoder.set_input_embeddings(new_embeddings)\n\n    def get_encoder(self):\n        return self.encoder\n\n    def get_decoder(self):\n        return self.decoder\n\n    def _prune_heads(self, heads_to_prune):\n        \"\"\" Prunes heads of the model.\n            heads_to_prune: dict of {layer_num: list of heads to prune in this layer}\n            See base class PreTrainedModel\n        \"\"\"\n        for layer, heads in heads_to_prune.items():\n            self.encoder.layer[layer].attention.prune_heads(heads)\n\n    @add_start_docstrings_to_callable(T5_INPUTS_DOCSTRING)\n    def forward(\n        self,\n        input_ids=None,\n        attention_mask=None,\n        encoder_outputs=None,\n        decoder_input_ids=None,\n        decoder_attention_mask=None,\n        decoder_past_key_value_states=None,\n        use_cache=True,\n        inputs_embeds=None,\n        decoder_inputs_embeds=None,\n        head_mask=None,\n    ):\n        r\"\"\"\n    Return:\n        :obj:`tuple(torch.FloatTensor)` comprising various elements depending on the configuration (:class:`~transformers1.T5Config`) and inputs.\n        last_hidden_state (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length, hidden_size)`):\n            Sequence of hidden-states at the output of the last layer of the model.\n            If `decoder_past_key_value_states` is used only the last hidden-state of the sequences of shape :obj:`(batch_size, 1, hidden_size)` is output.\n        decoder_past_key_value_states (:obj:`tuple(tuple(torch.FloatTensor))` of length :obj:`config.n_layers` with each tuple having 4 tensors of shape :obj:`(batch_size, num_heads, sequence_length, embed_size_per_head)`, `optional`, returned when ``use_cache=True``):\n            Contains pre-computed key and value hidden-states of the attention blocks.\n            Can be used to speed up sequential decoding (see `decoder_past_key_value_states` input).\n            Note that when using `decoder_past_key_value_states`, the model only outputs the last `hidden-state` of the sequence of shape :obj:`(batch_size, 1, config.vocab_size)`.\n        hidden_states (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_hidden_states=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer)\n            of shape :obj:`(batch_size, sequence_length, hidden_size)`.\n\n            Hidden-states of the model at the output of each layer plus the initial embedding outputs.\n        attentions (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_attentions=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for each layer) of shape\n            :obj:`(batch_size, num_heads, sequence_length, sequence_length)`.\n\n            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention\n            heads.\n\n    Examples::\n\n            from transformers1 import T5Tokenizer, T5Model\n\n            tokenizer = T5Tokenizer.from_pretrained('t5-small')\n            model = T5Model.from_pretrained('t5-small')\n            input_ids = tokenizer.encode(\"Hello, my dog is cute\", return_tensors=\"pt\")  # Batch size 1\n            outputs = model(input_ids=input_ids, decoder_input_ids=input_ids)\n            last_hidden_states = outputs[0]  # The last hidden-state is the first element of the output tuple\n\n        \"\"\"\n\n        # Encode if needed (training, first prediction pass)\n        if encoder_outputs is None:\n            encoder_outputs = self.encoder(\n                input_ids=input_ids, attention_mask=attention_mask, inputs_embeds=inputs_embeds, head_mask=head_mask\n            )\n\n        hidden_states = encoder_outputs[0]\n\n        # If decoding with past key value states, only the last tokens\n        # should be given as an input\n        if decoder_past_key_value_states is not None:\n            if decoder_input_ids is not None:\n                decoder_input_ids = decoder_input_ids[:, -1:]\n            if decoder_inputs_embeds is not None:\n                decoder_inputs_embeds = decoder_inputs_embeds[:, -1:]\n\n        # Decode\n        decoder_outputs = self.decoder(\n            input_ids=decoder_input_ids,\n            attention_mask=decoder_attention_mask,\n            inputs_embeds=decoder_inputs_embeds,\n            past_key_value_states=decoder_past_key_value_states,\n            encoder_hidden_states=hidden_states,\n            encoder_attention_mask=attention_mask,\n            head_mask=head_mask,\n            use_cache=use_cache,\n        )\n\n        if use_cache is True:\n            past = ((encoder_outputs, decoder_outputs[1]),)\n            decoder_outputs = decoder_outputs[:1] + past + decoder_outputs[2:]\n\n        return decoder_outputs + encoder_outputs\n\n\n@add_start_docstrings(\"\"\"T5 Model with a `language modeling` head on top. \"\"\", T5_START_DOCSTRING)\nclass T5ForConditionalGeneration(T5PreTrainedModel):\n    def __init__(self, config):\n        super().__init__(config)\n        self.model_dim = config.d_model\n\n        self.shared = nn.Embedding(config.vocab_size, config.d_model)\n\n        encoder_config = copy.deepcopy(config)\n        self.encoder = T5Stack(encoder_config, self.shared)\n\n        decoder_config = copy.deepcopy(config)\n        decoder_config.is_decoder = True\n        self.decoder = T5Stack(decoder_config, self.shared)\n\n        self.lm_head = nn.Linear(config.d_model, config.vocab_size, bias=False)\n\n        self.init_weights()\n\n    def get_input_embeddings(self):\n        return self.shared\n\n    def set_input_embeddings(self, new_embeddings):\n        self.shared = new_embeddings\n        self.encoder.set_input_embeddings(new_embeddings)\n        self.decoder.set_input_embeddings(new_embeddings)\n\n    def get_output_embeddings(self):\n        return self.lm_head\n\n    def get_encoder(self):\n        return self.encoder\n\n    def get_decoder(self):\n        return self.decoder\n\n    @add_start_docstrings_to_callable(T5_INPUTS_DOCSTRING)\n    def forward(\n        self,\n        input_ids=None,\n        attention_mask=None,\n        encoder_outputs=None,\n        decoder_input_ids=None,\n        decoder_attention_mask=None,\n        decoder_past_key_value_states=None,\n        use_cache=True,\n        lm_labels=None,\n        inputs_embeds=None,\n        decoder_inputs_embeds=None,\n        head_mask=None,\n    ):\n        r\"\"\"\n        lm_labels (:obj:`torch.LongTensor` of shape :obj:`(batch_size,)`, `optional`, defaults to :obj:`None`):\n                Labels for computing the sequence classification/regression loss.\n                Indices should be in :obj:`[-100, 0, ..., config.vocab_size - 1]`.\n                All labels set to ``-100`` are ignored (masked), the loss is only\n                computed for labels in ``[0, ..., config.vocab_size]``\n\n    Returns:\n        :obj:`tuple(torch.FloatTensor)` comprising various elements depending on the configuration (:class:`~transformers1.T5Config`) and inputs.\n        loss (:obj:`torch.FloatTensor` of shape :obj:`(1,)`, `optional`, returned when :obj:`lm_label` is provided):\n            Classification loss (cross entropy).\n        prediction_scores (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length, config.vocab_size)`)\n            Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax).\n            If `past_key_value_states` is used only the last prediction_scores of the sequences of shape :obj:`(batch_size, 1, hidden_size)` is output.\n        decoder_past_key_value_states (:obj:`tuple(tuple(torch.FloatTensor))` of length :obj:`config.n_layers` with each tuple having 4 tensors of shape :obj:`(batch_size, num_heads, sequence_length, embed_size_per_head)`, `optional`, returned when ``use_cache=True``):\n            Contains pre-computed key and value hidden-states of the attention blocks.\n            Can be used to speed up sequential decoding (see `decoder_past_key_value_states` input).\n            Note that when using `decoder_past_key_value_states`, the model only outputs the last `prediction_score` of the sequence of shape :obj:`(batch_size, 1, config.vocab_size)`.\n        hidden_states (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_hidden_states=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer)\n            of shape :obj:`(batch_size, sequence_length, hidden_size)`.\n            Hidden-states of the model at the output of each layer plus the initial embedding outputs.\n        attentions (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_attentions=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for each layer) of shape\n            :obj:`(batch_size, num_heads, sequence_length, sequence_length)`.\n            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention.\n\n    Examples::\n\n        from transformers1 import T5Tokenizer, T5ForConditionalGeneration\n\n        tokenizer = T5Tokenizer.from_pretrained('t5-small')\n        model = T5ForConditionalGeneration.from_pretrained('t5-small')\n        input_ids = tokenizer.encode(\"Hello, my dog is cute\", return_tensors=\"pt\")  # Batch size 1\n        outputs = model(input_ids=input_ids, decoder_input_ids=input_ids, lm_labels=input_ids)\n        loss, prediction_scores = outputs[:2]\n\n        tokenizer = T5Tokenizer.from_pretrained('t5-small')\n        model = T5ForConditionalGeneration.from_pretrained('t5-small')\n        input_ids = tokenizer.encode(\"summarize: Hello, my dog is cute\", return_tensors=\"pt\")  # Batch size 1\n        outputs = model.generate(input_ids)\n        \"\"\"\n\n        # Encode if needed (training, first prediction pass)\n        if encoder_outputs is None:\n            # Convert encoder inputs in embeddings if needed\n            encoder_outputs = self.encoder(\n                input_ids=input_ids, attention_mask=attention_mask, inputs_embeds=inputs_embeds, head_mask=head_mask\n            )\n\n        hidden_states = encoder_outputs[0]\n\n        if lm_labels is not None and decoder_input_ids is None and decoder_inputs_embeds is None:\n            # get decoder inputs from shifting lm labels to the right\n            decoder_input_ids = self._shift_right(lm_labels)\n\n        # If decoding with past key value states, only the last tokens\n        # should be given as an input\n        if decoder_past_key_value_states is not None:\n            assert lm_labels is None, \"Decoder should not use cached key value states when training.\"\n            if decoder_input_ids is not None:\n                decoder_input_ids = decoder_input_ids[:, -1:]\n            if decoder_inputs_embeds is not None:\n                decoder_inputs_embeds = decoder_inputs_embeds[:, -1:]\n\n        # Decode\n        decoder_outputs = self.decoder(\n            input_ids=decoder_input_ids,\n            attention_mask=decoder_attention_mask,\n            inputs_embeds=decoder_inputs_embeds,\n            past_key_value_states=decoder_past_key_value_states,\n            encoder_hidden_states=hidden_states,\n            encoder_attention_mask=attention_mask,\n            head_mask=head_mask,\n            use_cache=use_cache,\n        )\n\n        # insert decoder past at right place\n        # to speed up decoding\n        if use_cache is True:\n            past = ((encoder_outputs, decoder_outputs[1]),)\n            decoder_outputs = decoder_outputs[:1] + past + decoder_outputs[2:]\n\n        sequence_output = decoder_outputs[0]\n        # Rescale output before projecting on vocab\n        # See https://github.com/tensorflow/mesh/blob/fa19d69eafc9a482aff0b59ddd96b025c0cb207d/mesh_tensorflow/transformer/transformer.py#L586\n        sequence_output = sequence_output * (self.model_dim ** -0.5)\n        lm_logits = self.lm_head(sequence_output)\n\n        decoder_outputs = (lm_logits,) + decoder_outputs[1:]  # Add hidden states and attention if they are here\n        if lm_labels is not None:\n            loss_fct = CrossEntropyLoss(ignore_index=-100)\n            loss = loss_fct(lm_logits.view(-1, lm_logits.size(-1)), lm_labels.view(-1))\n            # TODO(thom): Add z_loss https://github.com/tensorflow/mesh/blob/fa19d69eafc9a482aff0b59ddd96b025c0cb207d/mesh_tensorflow/layers.py#L666\n            decoder_outputs = (loss,) + decoder_outputs\n\n        return decoder_outputs + encoder_outputs\n\n    def prepare_inputs_for_generation(self, input_ids, past, attention_mask, use_cache, **kwargs):\n        assert past is not None, \"past has to be defined for encoder_outputs\"\n\n        # first step\n        if len(past) < 2:\n            encoder_outputs, decoder_past_key_value_states = past, None\n        else:\n            encoder_outputs, decoder_past_key_value_states = past[0], past[1]\n\n        return {\n            \"decoder_input_ids\": input_ids,\n            \"decoder_past_key_value_states\": decoder_past_key_value_states,\n            \"encoder_outputs\": encoder_outputs,\n            \"attention_mask\": attention_mask,\n            \"use_cache\": use_cache,\n        }\n\n    def _reorder_cache(self, past, beam_idx):\n        # if decoder past is not included in output\n        # speedy decoding is disabled and no need to reorder\n        if len(past) < 2:\n            logger.warning(\"You might want to consider setting `use_cache=True` to speed up decoding\")\n            return past\n\n        decoder_past = past[1]\n        past = (past[0],)\n        reordered_decoder_past = ()\n        for layer_past_states in decoder_past:\n            # get the correct batch idx from layer past batch dim\n            # batch dim of `past` is at 2nd position\n            reordered_layer_past_states = ()\n            for layer_past_state in layer_past_states:\n                # need to set correct `past` for each of the four key / value states\n                reordered_layer_past_states = reordered_layer_past_states + (\n                    layer_past_state.index_select(0, beam_idx),\n                )\n\n            assert reordered_layer_past_states[0].shape == layer_past_states[0].shape\n            assert len(reordered_layer_past_states) == len(layer_past_states)\n\n            reordered_decoder_past = reordered_decoder_past + (reordered_layer_past_states,)\n        return past + (reordered_decoder_past,)\n"
  },
  {
    "path": "code/bert-base-count3/pretrain/transformers1/modeling_tf_albert.py",
    "content": "# coding=utf-8\n# Copyright 2018 The OpenAI Team Authors and HuggingFace Inc. team.\n# Copyright (c) 2018, NVIDIA CORPORATION.  All rights reserved.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n\"\"\" TF 2.0 ALBERT model. \"\"\"\n\n\nimport logging\n\nimport tensorflow as tf\n\nfrom .configuration_albert import AlbertConfig\nfrom .file_utils import MULTIPLE_CHOICE_DUMMY_INPUTS, add_start_docstrings, add_start_docstrings_to_callable\nfrom .modeling_tf_bert import ACT2FN, TFBertSelfAttention\nfrom .modeling_tf_utils import TFPreTrainedModel, get_initializer, keras_serializable, shape_list\nfrom .tokenization_utils import BatchEncoding\n\n\nlogger = logging.getLogger(__name__)\n\nTF_ALBERT_PRETRAINED_MODEL_ARCHIVE_LIST = [\n    \"albert-base-v1\",\n    \"albert-large-v1\",\n    \"albert-xlarge-v1\",\n    \"albert-xxlarge-v1\",\n    \"albert-base-v2\",\n    \"albert-large-v2\",\n    \"albert-xlarge-v2\",\n    \"albert-xxlarge-v2\",\n    # See all ALBERT models at https://huggingface.co/models?filter=albert\n]\n\n\nclass TFAlbertEmbeddings(tf.keras.layers.Layer):\n    \"\"\"Construct the embeddings from word, position and token_type embeddings.\n    \"\"\"\n\n    def __init__(self, config, **kwargs):\n        super().__init__(**kwargs)\n\n        self.config = config\n        self.position_embeddings = tf.keras.layers.Embedding(\n            config.max_position_embeddings,\n            config.embedding_size,\n            embeddings_initializer=get_initializer(self.config.initializer_range),\n            name=\"position_embeddings\",\n        )\n        self.token_type_embeddings = tf.keras.layers.Embedding(\n            config.type_vocab_size,\n            config.embedding_size,\n            embeddings_initializer=get_initializer(self.config.initializer_range),\n            name=\"token_type_embeddings\",\n        )\n\n        # self.LayerNorm is not snake-cased to stick with TensorFlow model variable name and be able to load\n        # any TensorFlow checkpoint file\n        self.LayerNorm = tf.keras.layers.LayerNormalization(epsilon=config.layer_norm_eps, name=\"LayerNorm\")\n        self.dropout = tf.keras.layers.Dropout(config.hidden_dropout_prob)\n\n    def build(self, input_shape):\n        \"\"\"Build shared word embedding layer \"\"\"\n        with tf.name_scope(\"word_embeddings\"):\n            # Create and initialize weights. The random normal initializer was chosen\n            # arbitrarily, and works well.\n            self.word_embeddings = self.add_weight(\n                \"weight\",\n                shape=[self.config.vocab_size, self.config.embedding_size],\n                initializer=get_initializer(self.config.initializer_range),\n            )\n        super().build(input_shape)\n\n    def call(self, inputs, mode=\"embedding\", training=False):\n        \"\"\"Get token embeddings of inputs.\n        Args:\n            inputs: list of three int64 tensors with shape [batch_size, length]: (input_ids, position_ids, token_type_ids)\n            mode: string, a valid value is one of \"embedding\" and \"linear\".\n        Returns:\n            outputs: (1) If mode == \"embedding\", output embedding tensor, float32 with\n                shape [batch_size, length, embedding_size]; (2) mode == \"linear\", output\n                linear tensor, float32 with shape [batch_size, length, vocab_size].\n        Raises:\n            ValueError: if mode is not valid.\n\n        Shared weights logic adapted from\n            https://github.com/tensorflow/models/blob/a009f4fb9d2fc4949e32192a944688925ef78659/official/transformer/v2/embedding_layer.py#L24\n        \"\"\"\n        if mode == \"embedding\":\n            return self._embedding(inputs, training=training)\n        elif mode == \"linear\":\n            return self._linear(inputs)\n        else:\n            raise ValueError(\"mode {} is not valid.\".format(mode))\n\n    def _embedding(self, inputs, training=False):\n        \"\"\"Applies embedding based on inputs tensor.\"\"\"\n        input_ids, position_ids, token_type_ids, inputs_embeds = inputs\n\n        if input_ids is not None:\n            input_shape = shape_list(input_ids)\n        else:\n            input_shape = shape_list(inputs_embeds)[:-1]\n\n        seq_length = input_shape[1]\n        if position_ids is None:\n            position_ids = tf.range(seq_length, dtype=tf.int32)[tf.newaxis, :]\n        if token_type_ids is None:\n            token_type_ids = tf.fill(input_shape, 0)\n\n        if inputs_embeds is None:\n            inputs_embeds = tf.gather(self.word_embeddings, input_ids)\n        position_embeddings = self.position_embeddings(position_ids)\n        token_type_embeddings = self.token_type_embeddings(token_type_ids)\n\n        embeddings = inputs_embeds + position_embeddings + token_type_embeddings\n        embeddings = self.LayerNorm(embeddings)\n        embeddings = self.dropout(embeddings, training=training)\n        return embeddings\n\n    def _linear(self, inputs):\n        \"\"\"Computes logits by running inputs through a linear layer.\n            Args:\n                inputs: A float32 tensor with shape [batch_size, length, embedding_size]\n            Returns:\n                float32 tensor with shape [batch_size, length, vocab_size].\n        \"\"\"\n        batch_size = shape_list(inputs)[0]\n        length = shape_list(inputs)[1]\n        x = tf.reshape(inputs, [-1, self.config.embedding_size])\n        logits = tf.matmul(x, self.word_embeddings, transpose_b=True)\n        return tf.reshape(logits, [batch_size, length, self.config.vocab_size])\n\n\nclass TFAlbertSelfAttention(tf.keras.layers.Layer):\n    def __init__(self, config, **kwargs):\n        super().__init__(**kwargs)\n        if config.hidden_size % config.num_attention_heads != 0:\n            raise ValueError(\n                \"The hidden size (%d) is not a multiple of the number of attention \"\n                \"heads (%d)\" % (config.hidden_size, config.num_attention_heads)\n            )\n        self.output_attentions = config.output_attentions\n\n        self.num_attention_heads = config.num_attention_heads\n        assert config.hidden_size % config.num_attention_heads == 0\n        self.attention_head_size = int(config.hidden_size / config.num_attention_heads)\n        self.all_head_size = self.num_attention_heads * self.attention_head_size\n\n        self.query = tf.keras.layers.Dense(\n            self.all_head_size, kernel_initializer=get_initializer(config.initializer_range), name=\"query\"\n        )\n        self.key = tf.keras.layers.Dense(\n            self.all_head_size, kernel_initializer=get_initializer(config.initializer_range), name=\"key\"\n        )\n        self.value = tf.keras.layers.Dense(\n            self.all_head_size, kernel_initializer=get_initializer(config.initializer_range), name=\"value\"\n        )\n\n        self.dropout = tf.keras.layers.Dropout(config.attention_probs_dropout_prob)\n\n    def transpose_for_scores(self, x, batch_size):\n        x = tf.reshape(x, (batch_size, -1, self.num_attention_heads, self.attention_head_size))\n        return tf.transpose(x, perm=[0, 2, 1, 3])\n\n    def call(self, inputs, training=False):\n        hidden_states, attention_mask, head_mask = inputs\n\n        batch_size = shape_list(hidden_states)[0]\n        mixed_query_layer = self.query(hidden_states)\n        mixed_key_layer = self.key(hidden_states)\n        mixed_value_layer = self.value(hidden_states)\n\n        query_layer = self.transpose_for_scores(mixed_query_layer, batch_size)\n        key_layer = self.transpose_for_scores(mixed_key_layer, batch_size)\n        value_layer = self.transpose_for_scores(mixed_value_layer, batch_size)\n\n        # Take the dot product between \"query\" and \"key\" to get the raw attention scores.\n        # (batch size, num_heads, seq_len_q, seq_len_k)\n        attention_scores = tf.matmul(query_layer, key_layer, transpose_b=True)\n        # scale attention_scores\n        dk = tf.cast(shape_list(key_layer)[-1], tf.float32)\n        attention_scores = attention_scores / tf.math.sqrt(dk)\n\n        if attention_mask is not None:\n            # Apply the attention mask is (precomputed for all layers in TFAlbertModel call() function)\n            attention_scores = attention_scores + attention_mask\n\n        # Normalize the attention scores to probabilities.\n        attention_probs = tf.nn.softmax(attention_scores, axis=-1)\n\n        # This is actually dropping out entire tokens to attend to, which might\n        # seem a bit unusual, but is taken from the original Transformer paper.\n        attention_probs = self.dropout(attention_probs, training=training)\n\n        # Mask heads if we want to\n        if head_mask is not None:\n            attention_probs = attention_probs * head_mask\n\n        context_layer = tf.matmul(attention_probs, value_layer)\n\n        context_layer = tf.transpose(context_layer, perm=[0, 2, 1, 3])\n        context_layer = tf.reshape(\n            context_layer, (batch_size, -1, self.all_head_size)\n        )  # (batch_size, seq_len_q, all_head_size)\n\n        outputs = (context_layer, attention_probs) if self.output_attentions else (context_layer,)\n        return outputs\n\n\nclass TFAlbertSelfOutput(tf.keras.layers.Layer):\n    def __init__(self, config, **kwargs):\n        super().__init__(**kwargs)\n        self.dense = tf.keras.layers.Dense(\n            config.hidden_size, kernel_initializer=get_initializer(config.initializer_range), name=\"dense\"\n        )\n        self.LayerNorm = tf.keras.layers.LayerNormalization(epsilon=config.layer_norm_eps, name=\"LayerNorm\")\n        self.dropout = tf.keras.layers.Dropout(config.hidden_dropout_prob)\n\n    def call(self, inputs, training=False):\n        hidden_states, input_tensor = inputs\n\n        hidden_states = self.dense(hidden_states)\n        hidden_states = self.dropout(hidden_states, training=training)\n        hidden_states = self.LayerNorm(hidden_states + input_tensor)\n        return hidden_states\n\n\nclass TFAlbertAttention(TFBertSelfAttention):\n    def __init__(self, config, **kwargs):\n        super().__init__(config, **kwargs)\n\n        self.hidden_size = config.hidden_size\n        self.dense = tf.keras.layers.Dense(\n            config.hidden_size, kernel_initializer=get_initializer(config.initializer_range), name=\"dense\"\n        )\n        self.LayerNorm = tf.keras.layers.LayerNormalization(epsilon=config.layer_norm_eps, name=\"LayerNorm\")\n        self.pruned_heads = set()\n\n    def prune_heads(self, heads):\n        raise NotImplementedError\n\n    def call(self, inputs, training=False):\n        input_tensor, attention_mask, head_mask = inputs\n\n        batch_size = shape_list(input_tensor)[0]\n        mixed_query_layer = self.query(input_tensor)\n        mixed_key_layer = self.key(input_tensor)\n        mixed_value_layer = self.value(input_tensor)\n\n        query_layer = self.transpose_for_scores(mixed_query_layer, batch_size)\n        key_layer = self.transpose_for_scores(mixed_key_layer, batch_size)\n        value_layer = self.transpose_for_scores(mixed_value_layer, batch_size)\n\n        # Take the dot product between \"query\" and \"key\" to get the raw attention scores.\n        # (batch size, num_heads, seq_len_q, seq_len_k)\n        attention_scores = tf.matmul(query_layer, key_layer, transpose_b=True)\n        # scale attention_scores\n        dk = tf.cast(shape_list(key_layer)[-1], tf.float32)\n        attention_scores = attention_scores / tf.math.sqrt(dk)\n\n        if attention_mask is not None:\n            # Apply the attention mask is (precomputed for all layers in TFBertModel call() function)\n            attention_scores = attention_scores + attention_mask\n\n        # Normalize the attention scores to probabilities.\n        attention_probs = tf.nn.softmax(attention_scores, axis=-1)\n\n        # This is actually dropping out entire tokens to attend to, which might\n        # seem a bit unusual, but is taken from the original Transformer paper.\n        attention_probs = self.dropout(attention_probs, training=training)\n\n        # Mask heads if we want to\n        if head_mask is not None:\n            attention_probs = attention_probs * head_mask\n\n        context_layer = tf.matmul(attention_probs, value_layer)\n\n        context_layer = tf.transpose(context_layer, perm=[0, 2, 1, 3])\n        context_layer = tf.reshape(\n            context_layer, (batch_size, -1, self.all_head_size)\n        )  # (batch_size, seq_len_q, all_head_size)\n\n        self_outputs = (context_layer, attention_probs) if self.output_attentions else (context_layer,)\n\n        hidden_states = self_outputs[0]\n\n        hidden_states = self.dense(hidden_states)\n        hidden_states = self.dropout(hidden_states, training=training)\n        attention_output = self.LayerNorm(hidden_states + input_tensor)\n\n        # add attentions if we output them\n        outputs = (attention_output,) + self_outputs[1:]\n        return outputs\n\n\nclass TFAlbertLayer(tf.keras.layers.Layer):\n    def __init__(self, config, **kwargs):\n        super().__init__(**kwargs)\n        self.attention = TFAlbertAttention(config, name=\"attention\")\n\n        self.ffn = tf.keras.layers.Dense(\n            config.intermediate_size, kernel_initializer=get_initializer(config.initializer_range), name=\"ffn\"\n        )\n\n        if isinstance(config.hidden_act, str):\n            self.activation = ACT2FN[config.hidden_act]\n        else:\n            self.activation = config.hidden_act\n\n        self.ffn_output = tf.keras.layers.Dense(\n            config.hidden_size, kernel_initializer=get_initializer(config.initializer_range), name=\"ffn_output\"\n        )\n        self.full_layer_layer_norm = tf.keras.layers.LayerNormalization(\n            epsilon=config.layer_norm_eps, name=\"full_layer_layer_norm\"\n        )\n        self.dropout = tf.keras.layers.Dropout(config.hidden_dropout_prob)\n\n    def call(self, inputs, training=False):\n        hidden_states, attention_mask, head_mask = inputs\n\n        attention_outputs = self.attention([hidden_states, attention_mask, head_mask], training=training)\n        ffn_output = self.ffn(attention_outputs[0])\n        ffn_output = self.activation(ffn_output)\n        ffn_output = self.ffn_output(ffn_output)\n\n        hidden_states = self.dropout(hidden_states, training=training)\n        hidden_states = self.full_layer_layer_norm(ffn_output + attention_outputs[0])\n\n        # add attentions if we output them\n        outputs = (hidden_states,) + attention_outputs[1:]\n        return outputs\n\n\nclass TFAlbertLayerGroup(tf.keras.layers.Layer):\n    def __init__(self, config, **kwargs):\n        super().__init__(**kwargs)\n\n        self.output_attentions = config.output_attentions\n        self.output_hidden_states = config.output_hidden_states\n        self.albert_layers = [\n            TFAlbertLayer(config, name=\"albert_layers_._{}\".format(i)) for i in range(config.inner_group_num)\n        ]\n\n    def call(self, inputs, training=False):\n        hidden_states, attention_mask, head_mask = inputs\n\n        layer_hidden_states = ()\n        layer_attentions = ()\n\n        for layer_index, albert_layer in enumerate(self.albert_layers):\n            layer_output = albert_layer([hidden_states, attention_mask, head_mask[layer_index]], training=training)\n            hidden_states = layer_output[0]\n\n            if self.output_attentions:\n                layer_attentions = layer_attentions + (layer_output[1],)\n\n            if self.output_hidden_states:\n                layer_hidden_states = layer_hidden_states + (hidden_states,)\n\n        outputs = (hidden_states,)\n        if self.output_hidden_states:\n            outputs = outputs + (layer_hidden_states,)\n        if self.output_attentions:\n            outputs = outputs + (layer_attentions,)\n        # last-layer hidden state, (layer hidden states), (layer attentions)\n        return outputs\n\n\nclass TFAlbertTransformer(tf.keras.layers.Layer):\n    def __init__(self, config, **kwargs):\n        super().__init__(**kwargs)\n\n        self.config = config\n        self.output_attentions = config.output_attentions\n        self.output_hidden_states = config.output_hidden_states\n        self.embedding_hidden_mapping_in = tf.keras.layers.Dense(\n            config.hidden_size,\n            kernel_initializer=get_initializer(config.initializer_range),\n            name=\"embedding_hidden_mapping_in\",\n        )\n        self.albert_layer_groups = [\n            TFAlbertLayerGroup(config, name=\"albert_layer_groups_._{}\".format(i))\n            for i in range(config.num_hidden_groups)\n        ]\n\n    def call(self, inputs, training=False):\n        hidden_states, attention_mask, head_mask = inputs\n\n        hidden_states = self.embedding_hidden_mapping_in(hidden_states)\n        all_attentions = ()\n\n        if self.output_hidden_states:\n            all_hidden_states = (hidden_states,)\n\n        for i in range(self.config.num_hidden_layers):\n            # Number of layers in a hidden group\n            layers_per_group = int(self.config.num_hidden_layers / self.config.num_hidden_groups)\n\n            # Index of the hidden group\n            group_idx = int(i / (self.config.num_hidden_layers / self.config.num_hidden_groups))\n\n            layer_group_output = self.albert_layer_groups[group_idx](\n                [\n                    hidden_states,\n                    attention_mask,\n                    head_mask[group_idx * layers_per_group : (group_idx + 1) * layers_per_group],\n                ],\n                training=training,\n            )\n            hidden_states = layer_group_output[0]\n\n            if self.output_attentions:\n                all_attentions = all_attentions + layer_group_output[-1]\n\n            if self.output_hidden_states:\n                all_hidden_states = all_hidden_states + (hidden_states,)\n\n        outputs = (hidden_states,)\n        if self.output_hidden_states:\n            outputs = outputs + (all_hidden_states,)\n        if self.output_attentions:\n            outputs = outputs + (all_attentions,)\n\n        # last-layer hidden state, (all hidden states), (all attentions)\n        return outputs\n\n\nclass TFAlbertPreTrainedModel(TFPreTrainedModel):\n    \"\"\" An abstract class to handle weights initialization and\n        a simple interface for downloading and loading pretrained models.\n    \"\"\"\n\n    config_class = AlbertConfig\n    base_model_prefix = \"albert\"\n\n\nclass TFAlbertMLMHead(tf.keras.layers.Layer):\n    def __init__(self, config, input_embeddings, **kwargs):\n        super().__init__(**kwargs)\n        self.vocab_size = config.vocab_size\n\n        self.dense = tf.keras.layers.Dense(\n            config.embedding_size, kernel_initializer=get_initializer(config.initializer_range), name=\"dense\"\n        )\n        if isinstance(config.hidden_act, str):\n            self.activation = ACT2FN[config.hidden_act]\n        else:\n            self.activation = config.hidden_act\n\n        self.LayerNorm = tf.keras.layers.LayerNormalization(epsilon=config.layer_norm_eps, name=\"LayerNorm\")\n\n        # The output weights are the same as the input embeddings, but there is\n        # an output-only bias for each token.\n        self.decoder = input_embeddings\n\n    def build(self, input_shape):\n        self.bias = self.add_weight(shape=(self.vocab_size,), initializer=\"zeros\", trainable=True, name=\"bias\")\n        self.decoder_bias = self.add_weight(\n            shape=(self.vocab_size,), initializer=\"zeros\", trainable=True, name=\"decoder/bias\"\n        )\n        super().build(input_shape)\n\n    def call(self, hidden_states):\n        hidden_states = self.dense(hidden_states)\n        hidden_states = self.activation(hidden_states)\n        hidden_states = self.LayerNorm(hidden_states)\n        hidden_states = self.decoder(hidden_states, mode=\"linear\") + self.decoder_bias\n        return hidden_states\n\n\n@keras_serializable\nclass TFAlbertMainLayer(tf.keras.layers.Layer):\n    config_class = AlbertConfig\n\n    def __init__(self, config, **kwargs):\n        super().__init__(**kwargs)\n        self.num_hidden_layers = config.num_hidden_layers\n\n        self.embeddings = TFAlbertEmbeddings(config, name=\"embeddings\")\n        self.encoder = TFAlbertTransformer(config, name=\"encoder\")\n        self.pooler = tf.keras.layers.Dense(\n            config.hidden_size,\n            kernel_initializer=get_initializer(config.initializer_range),\n            activation=\"tanh\",\n            name=\"pooler\",\n        )\n\n    def get_input_embeddings(self):\n        return self.embeddings\n\n    def _resize_token_embeddings(self, new_num_tokens):\n        raise NotImplementedError\n\n    def _prune_heads(self, heads_to_prune):\n        \"\"\" Prunes heads of the model.\n            heads_to_prune: dict of {layer_num: list of heads to prune in this layer}\n            See base class PreTrainedModel\n        \"\"\"\n        raise NotImplementedError\n\n    def call(\n        self,\n        inputs,\n        attention_mask=None,\n        token_type_ids=None,\n        position_ids=None,\n        head_mask=None,\n        inputs_embeds=None,\n        training=False,\n    ):\n        if isinstance(inputs, (tuple, list)):\n            input_ids = inputs[0]\n            attention_mask = inputs[1] if len(inputs) > 1 else attention_mask\n            token_type_ids = inputs[2] if len(inputs) > 2 else token_type_ids\n            position_ids = inputs[3] if len(inputs) > 3 else position_ids\n            head_mask = inputs[4] if len(inputs) > 4 else head_mask\n            inputs_embeds = inputs[5] if len(inputs) > 5 else inputs_embeds\n            assert len(inputs) <= 6, \"Too many inputs.\"\n        elif isinstance(inputs, (dict, BatchEncoding)):\n            input_ids = inputs.get(\"input_ids\")\n            attention_mask = inputs.get(\"attention_mask\", attention_mask)\n            token_type_ids = inputs.get(\"token_type_ids\", token_type_ids)\n            position_ids = inputs.get(\"position_ids\", position_ids)\n            head_mask = inputs.get(\"head_mask\", head_mask)\n            inputs_embeds = inputs.get(\"inputs_embeds\", inputs_embeds)\n            assert len(inputs) <= 6, \"Too many inputs.\"\n        else:\n            input_ids = inputs\n\n        if input_ids is not None and inputs_embeds is not None:\n            raise ValueError(\"You cannot specify both input_ids and inputs_embeds at the same time\")\n        elif input_ids is not None:\n            input_shape = shape_list(input_ids)\n        elif inputs_embeds is not None:\n            input_shape = shape_list(inputs_embeds)[:-1]\n        else:\n            raise ValueError(\"You have to specify either input_ids or inputs_embeds\")\n\n        if attention_mask is None:\n            attention_mask = tf.fill(input_shape, 1)\n        if token_type_ids is None:\n            token_type_ids = tf.fill(input_shape, 0)\n\n        # We create a 3D attention mask from a 2D tensor mask.\n        # Sizes are [batch_size, 1, 1, to_seq_length]\n        # So we can broadcast to [batch_size, num_heads, from_seq_length, to_seq_length]\n        # this attention mask is more simple than the triangular masking of causal attention\n        # used in OpenAI GPT, we just need to prepare the broadcast dimension here.\n        extended_attention_mask = attention_mask[:, tf.newaxis, tf.newaxis, :]\n\n        # Since attention_mask is 1.0 for positions we want to attend and 0.0 for\n        # masked positions, this operation will create a tensor which is 0.0 for\n        # positions we want to attend and -10000.0 for masked positions.\n        # Since we are adding it to the raw scores before the softmax, this is\n        # effectively the same as removing these entirely.\n\n        extended_attention_mask = tf.cast(extended_attention_mask, tf.float32)\n        extended_attention_mask = (1.0 - extended_attention_mask) * -10000.0\n\n        # Prepare head mask if needed\n        # 1.0 in head_mask indicate we keep the head\n        # attention_probs has shape bsz x n_heads x N x N\n        # input head_mask has shape [num_heads] or [num_hidden_layers x num_heads]\n        # and head_mask is converted to shape [num_hidden_layers x batch x num_heads x seq_length x seq_length]\n        if head_mask is not None:\n            raise NotImplementedError\n        else:\n            head_mask = [None] * self.num_hidden_layers\n            # head_mask = tf.constant([0] * self.num_hidden_layers)\n\n        embedding_output = self.embeddings([input_ids, position_ids, token_type_ids, inputs_embeds], training=training)\n        encoder_outputs = self.encoder([embedding_output, extended_attention_mask, head_mask], training=training)\n\n        sequence_output = encoder_outputs[0]\n        pooled_output = self.pooler(sequence_output[:, 0])\n\n        # add hidden_states and attentions if they are here\n        outputs = (sequence_output, pooled_output,) + encoder_outputs[1:]\n        # sequence_output, pooled_output, (hidden_states), (attentions)\n        return outputs\n\n\nALBERT_START_DOCSTRING = r\"\"\"\n    This model is a `tf.keras.Model <https://www.tensorflow.org/api_docs/python/tf/keras/Model>`__ sub-class.\n    Use it as a regular TF 2.0 Keras Model and\n    refer to the TF 2.0 documentation for all matter related to general usage and behavior.\n\n    .. _`ALBERT: A Lite BERT for Self-supervised Learning of Language Representations`:\n        https://arxiv.org/abs/1909.11942\n\n    .. _`tf.keras.Model`:\n        https://www.tensorflow.org/versions/r2.0/api_docs/python/tf/keras/Model\n\n    .. note::\n\n        TF 2.0 models accepts two formats as inputs:\n\n            - having all inputs as keyword arguments (like PyTorch models), or\n            - having all inputs as a list, tuple or dict in the first positional arguments.\n\n        This second option is useful when using :obj:`tf.keras.Model.fit()` method which currently requires having\n        all the tensors in the first argument of the model call function: :obj:`model(inputs)`.\n\n        If you choose this second option, there are three possibilities you can use to gather all the input Tensors\n        in the first positional argument :\n\n        - a single Tensor with input_ids only and nothing else: :obj:`model(inputs_ids)`\n        - a list of varying length with one or several input Tensors IN THE ORDER given in the docstring:\n          :obj:`model([input_ids, attention_mask])` or :obj:`model([input_ids, attention_mask, token_type_ids])`\n        - a dictionary with one or several input Tensors associated to the input names given in the docstring:\n          :obj:`model({'input_ids': input_ids, 'token_type_ids': token_type_ids})`\n\n    Args:\n        config (:class:`~transformers1.AlbertConfig`): Model configuration class with all the parameters of the model.\n            Initializing with a config file does not load the weights associated with the model, only the configuration.\n            Check out the :meth:`~transformers1.PreTrainedModel.from_pretrained` method to load the model weights.\n\"\"\"\n\nALBERT_INPUTS_DOCSTRING = r\"\"\"\n    Args:\n        input_ids (:obj:`Numpy array` or :obj:`tf.Tensor` of shape :obj:`{0}`):\n            Indices of input sequence tokens in the vocabulary.\n\n            Indices can be obtained using :class:`transformers1.AlbertTokenizer`.\n            See :func:`transformers1.PreTrainedTokenizer.encode` and\n            :func:`transformers1.PreTrainedTokenizer.encode_plus` for details.\n\n            `What are input IDs? <../glossary.html#input-ids>`__\n        attention_mask (:obj:`Numpy array` or :obj:`tf.Tensor` of shape :obj:`{0}`, `optional, defaults to :obj:`None`):\n            Mask to avoid performing attention on padding token indices.\n            Mask values selected in ``[0, 1]``:\n            ``1`` for tokens that are NOT MASKED, ``0`` for MASKED tokens.\n\n            `What are attention masks? <../glossary.html#attention-mask>`__\n        token_type_ids (:obj:`Numpy array` or :obj:`tf.Tensor` of shape :obj:`{0}`, `optional`, defaults to :obj:`None`):\n            Segment token indices to indicate first and second portions of the inputs.\n            Indices are selected in ``[0, 1]``: ``0`` corresponds to a `sentence A` token, ``1``\n            corresponds to a `sentence B` token\n\n            `What are token type IDs? <../glossary.html#token-type-ids>`_\n        position_ids (:obj:`Numpy array` or :obj:`tf.Tensor` of shape :obj:`{0}`, `optional`, defaults to :obj:`None`):\n            Indices of positions of each input sequence tokens in the position embeddings.\n            Selected in the range ``[0, config.max_position_embeddings - 1]``.\n\n            `What are position IDs? <../glossary.html#position-ids>`_\n        head_mask (:obj:`Numpy array` or :obj:`tf.Tensor` of shape :obj:`(num_heads,)` or :obj:`(num_layers, num_heads)`, `optional`, defaults to :obj:`None`):\n            Mask to nullify selected heads of the self-attention modules.\n            Mask values selected in ``[0, 1]``:\n            ``1`` indicates the head is **not masked**, ``0`` indicates the head is **masked**.\n        inputs_embeds (:obj:`tf.Tensor` of shape :obj:`(batch_size, sequence_length, hidden_size)`, `optional`, defaults to :obj:`None`):\n            Optionally, instead of passing :obj:`input_ids` you can choose to directly pass an embedded representation.\n            This is useful if you want more control over how to convert `input_ids` indices into associated vectors\n            than the model's internal embedding lookup matrix.\n        training (:obj:`boolean`, `optional`, defaults to :obj:`False`):\n            Whether to activate dropout modules (if set to :obj:`True`) during training or to de-activate them\n            (if set to :obj:`False`) for evaluation.\n\"\"\"\n\n\n@add_start_docstrings(\n    \"The bare Albert Model transformer outputing raw hidden-states without any specific head on top.\",\n    ALBERT_START_DOCSTRING,\n)\nclass TFAlbertModel(TFAlbertPreTrainedModel):\n    def __init__(self, config, *inputs, **kwargs):\n        super().__init__(config, *inputs, **kwargs)\n        self.albert = TFAlbertMainLayer(config, name=\"albert\")\n\n    @add_start_docstrings_to_callable(ALBERT_INPUTS_DOCSTRING.format(\"(batch_size, sequence_length)\"))\n    def call(self, inputs, **kwargs):\n        r\"\"\"\n        Returns:\n            :obj:`tuple(tf.Tensor)` comprising various elements depending on the configuration (:class:`~transformers1.AlbertConfig`) and inputs:\n            last_hidden_state (:obj:`tf.Tensor` of shape :obj:`(batch_size, sequence_length, hidden_size)`):\n                Sequence of hidden-states at the output of the last layer of the model.\n            pooler_output (:obj:`tf.Tensor` of shape :obj:`(batch_size, hidden_size)`):\n                Last layer hidden-state of the first token of the sequence (classification token)\n                further processed by a Linear layer and a Tanh activation function. The Linear\n                layer weights are trained from the next sentence prediction (classification)\n                objective during Albert pretraining. This output is usually *not* a good summary\n                of the semantic content of the input, you're often better with averaging or pooling\n                the sequence of hidden-states for the whole input sequence.\n            hidden_states (:obj:`tuple(tf.Tensor)`, `optional`, returned when :obj:`config.output_hidden_states=True`):\n                tuple of :obj:`tf.Tensor` (one for the output of the embeddings + one for the output of each layer)\n                of shape :obj:`(batch_size, sequence_length, hidden_size)`.\n\n                Hidden-states of the model at the output of each layer plus the initial embedding outputs.\n            attentions (:obj:`tuple(tf.Tensor)`, `optional`, returned when ``config.output_attentions=True``):\n                tuple of :obj:`tf.Tensor` (one for each layer) of shape\n                :obj:`(batch_size, num_heads, sequence_length, sequence_length)`:\n\n                Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads.\n\n        Examples::\n\n            import tensorflow as tf\n            from transformers1 import AlbertTokenizer, TFAlbertModel\n\n            tokenizer = AlbertTokenizer.from_pretrained('albert-base-v2')\n            model = TFAlbertModel.from_pretrained('albert-base-v2')\n            input_ids = tf.constant(tokenizer.encode(\"Hello, my dog is cute\"))[None, :]  # Batch size 1\n            outputs = model(input_ids)\n            last_hidden_states = outputs[0]  # The last hidden-state is the first element of the output tuple\n\n        \"\"\"\n        outputs = self.albert(inputs, **kwargs)\n        return outputs\n\n\n@add_start_docstrings(\n    \"\"\"Albert Model with two heads on top for pre-training:\n    a `masked language modeling` head and a `sentence order prediction` (classification) head. \"\"\",\n    ALBERT_START_DOCSTRING,\n)\nclass TFAlbertForPreTraining(TFAlbertPreTrainedModel):\n    def __init__(self, config, *inputs, **kwargs):\n        super().__init__(config, *inputs, **kwargs)\n        self.num_labels = config.num_labels\n\n        self.albert = TFAlbertMainLayer(config, name=\"albert\")\n        self.predictions = TFAlbertMLMHead(config, self.albert.embeddings, name=\"predictions\")\n        self.sop_classifier = TFAlbertSOPHead(config, name=\"sop_classifier\")\n\n    def get_output_embeddings(self):\n        return self.albert.embeddings\n\n    @add_start_docstrings_to_callable(ALBERT_INPUTS_DOCSTRING.format(\"(batch_size, sequence_length)\"))\n    def call(self, inputs, **kwargs):\n        r\"\"\"\n    Return:\n        :obj:`tuple(tf.Tensor)` comprising various elements depending on the configuration (:class:`~transformers1.BertConfig`) and inputs:\n        prediction_scores (:obj:`tf.Tensor` of shape :obj:`(batch_size, sequence_length, config.vocab_size)`):\n            Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax).\n        sop_scores (:obj:`tf.Tensor` of shape :obj:`(batch_size, sequence_length, 2)`):\n            Prediction scores of the sentence order prediction (classification) head (scores of True/False continuation before SoftMax).\n        hidden_states (:obj:`tuple(tf.Tensor)`, `optional`, returned when :obj:`config.output_hidden_states=True`):\n            tuple of :obj:`tf.Tensor` (one for the output of the embeddings + one for the output of each layer)\n            of shape :obj:`(batch_size, sequence_length, hidden_size)`.\n            Hidden-states of the model at the output of each layer plus the initial embedding outputs.\n        attentions (:obj:`tuple(tf.Tensor)`, `optional`, returned when ``config.output_attentions=True``):\n            tuple of :obj:`tf.Tensor` (one for each layer) of shape\n            :obj:`(batch_size, num_heads, sequence_length, sequence_length)`:\n            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads.\n    Examples::\n        import tensorflow as tf\n        from transformers1 import AlbertTokenizer, TFAlbertForPreTraining\n        tokenizer = AlbertTokenizer.from_pretrained('albert-base-v2')\n        model = TFAlbertForPreTraining.from_pretrained('albert-base-v2')\n        input_ids = tf.constant(tokenizer.encode(\"Hello, my dog is cute\", add_special_tokens=True))[None, :]  # Batch size 1\n        outputs = model(input_ids)\n        prediction_scores, sop_scores = outputs[:2]\n        \"\"\"\n\n        outputs = self.albert(inputs, **kwargs)\n        sequence_output, pooled_output = outputs[:2]\n        prediction_scores = self.predictions(sequence_output)\n        sop_scores = self.sop_classifier(pooled_output, training=kwargs.get(\"training\", False))\n        outputs = (prediction_scores, sop_scores) + outputs[2:]\n        return outputs\n\n\nclass TFAlbertSOPHead(tf.keras.layers.Layer):\n    def __init__(self, config, **kwargs):\n        super().__init__(**kwargs)\n\n        self.dropout = tf.keras.layers.Dropout(config.classifier_dropout_prob)\n        self.classifier = tf.keras.layers.Dense(\n            config.num_labels, kernel_initializer=get_initializer(config.initializer_range), name=\"classifier\",\n        )\n\n    def call(self, pooled_output, training: bool):\n        dropout_pooled_output = self.dropout(pooled_output, training=training)\n        logits = self.classifier(dropout_pooled_output)\n        return logits\n\n\n@add_start_docstrings(\"\"\"Albert Model with a `language modeling` head on top. \"\"\", ALBERT_START_DOCSTRING)\nclass TFAlbertForMaskedLM(TFAlbertPreTrainedModel):\n    def __init__(self, config, *inputs, **kwargs):\n        super().__init__(config, *inputs, **kwargs)\n\n        self.albert = TFAlbertMainLayer(config, name=\"albert\")\n        self.predictions = TFAlbertMLMHead(config, self.albert.embeddings, name=\"predictions\")\n\n    def get_output_embeddings(self):\n        return self.albert.embeddings\n\n    @add_start_docstrings_to_callable(ALBERT_INPUTS_DOCSTRING.format(\"(batch_size, sequence_length)\"))\n    def call(self, inputs, **kwargs):\n        r\"\"\"\n    Returns:\n        :obj:`tuple(tf.Tensor)` comprising various elements depending on the configuration (:class:`~transformers1.AlbertConfig`) and inputs:\n        prediction_scores (:obj:`Numpy array` or :obj:`tf.Tensor` of shape :obj:`(batch_size, sequence_length, config.vocab_size)`\n            Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax).\n        hidden_states (:obj:`tuple(tf.Tensor)`, `optional`, returned when :obj:`config.output_hidden_states=True`):\n            tuple of :obj:`tf.Tensor` (one for the output of the embeddings + one for the output of each layer)\n            of shape :obj:`(batch_size, sequence_length, hidden_size)`.\n\n            Hidden-states of the model at the output of each layer plus the initial embedding outputs.\n        attentions (:obj:`tuple(tf.Tensor)`, `optional`, returned when ``config.output_attentions=True``):\n            tuple of :obj:`tf.Tensor` (one for each layer) of shape\n            :obj:`(batch_size, num_heads, sequence_length, sequence_length)`:\n\n            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads.\n\n    Examples::\n\n        import tensorflow as tf\n        from transformers1 import AlbertTokenizer, TFAlbertForMaskedLM\n\n        tokenizer = AlbertTokenizer.from_pretrained('albert-base-v2')\n        model = TFAlbertForMaskedLM.from_pretrained('albert-base-v2')\n        input_ids = tf.constant(tokenizer.encode(\"Hello, my dog is cute\"))[None, :]  # Batch size 1\n        outputs = model(input_ids)\n        prediction_scores = outputs[0]\n\n        \"\"\"\n        outputs = self.albert(inputs, **kwargs)\n\n        sequence_output = outputs[0]\n        prediction_scores = self.predictions(sequence_output, training=kwargs.get(\"training\", False))\n\n        # Add hidden states and attention if they are here\n        outputs = (prediction_scores,) + outputs[2:]\n\n        return outputs  # prediction_scores, (hidden_states), (attentions)\n\n\n@add_start_docstrings(\n    \"\"\"Albert Model transformer with a sequence classification/regression head on top (a linear layer on top of\n    the pooled output) e.g. for GLUE tasks. \"\"\",\n    ALBERT_START_DOCSTRING,\n)\nclass TFAlbertForSequenceClassification(TFAlbertPreTrainedModel):\n    def __init__(self, config, *inputs, **kwargs):\n        super().__init__(config, *inputs, **kwargs)\n        self.num_labels = config.num_labels\n\n        self.albert = TFAlbertMainLayer(config, name=\"albert\")\n        self.dropout = tf.keras.layers.Dropout(config.classifier_dropout_prob)\n        self.classifier = tf.keras.layers.Dense(\n            config.num_labels, kernel_initializer=get_initializer(config.initializer_range), name=\"classifier\"\n        )\n\n    @add_start_docstrings_to_callable(ALBERT_INPUTS_DOCSTRING.format(\"(batch_size, sequence_length)\"))\n    def call(self, inputs, **kwargs):\n        r\"\"\"\n    Returns:\n        :obj:`tuple(tf.Tensor)` comprising various elements depending on the configuration (:class:`~transformers1.AlbertConfig`) and inputs:\n        logits (:obj:`Numpy array` or :obj:`tf.Tensor` of shape :obj:`(batch_size, config.num_labels)`)\n            Classification (or regression if config.num_labels==1) scores (before SoftMax).\n        hidden_states (:obj:`tuple(tf.Tensor)`, `optional`, returned when :obj:`config.output_hidden_states=True`):\n            tuple of :obj:`tf.Tensor` (one for the output of the embeddings + one for the output of each layer)\n            of shape :obj:`(batch_size, sequence_length, hidden_size)`.\n\n            Hidden-states of the model at the output of each layer plus the initial embedding outputs.\n        attentions (:obj:`tuple(tf.Tensor)`, `optional`, returned when ``config.output_attentions=True``):\n            tuple of :obj:`tf.Tensor` (one for each layer) of shape\n            :obj:`(batch_size, num_heads, sequence_length, sequence_length)`:\n\n            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads.\n\n    Examples::\n\n        import tensorflow as tf\n        from transformers1 import AlbertTokenizer, TFAlbertForSequenceClassification\n\n        tokenizer = AlbertTokenizer.from_pretrained('albert-base-v2')\n        model = TFAlbertForSequenceClassification.from_pretrained('albert-base-v2')\n        input_ids = tf.constant(tokenizer.encode(\"Hello, my dog is cute\"))[None, :]  # Batch size 1\n        outputs = model(input_ids)\n        logits = outputs[0]\n\n        \"\"\"\n        outputs = self.albert(inputs, **kwargs)\n\n        pooled_output = outputs[1]\n\n        pooled_output = self.dropout(pooled_output, training=kwargs.get(\"training\", False))\n        logits = self.classifier(pooled_output)\n\n        outputs = (logits,) + outputs[2:]  # add hidden states and attention if they are here\n\n        return outputs  # logits, (hidden_states), (attentions)\n\n\n@add_start_docstrings(\n    \"\"\"Albert Model with a span classification head on top for extractive question-answering tasks like SQuAD (a linear layers on top of the hidden-states output to compute `span start logits` and `span end logits`). \"\"\",\n    ALBERT_START_DOCSTRING,\n)\nclass TFAlbertForQuestionAnswering(TFAlbertPreTrainedModel):\n    def __init__(self, config, *inputs, **kwargs):\n        super().__init__(config, *inputs, **kwargs)\n        self.num_labels = config.num_labels\n\n        self.albert = TFAlbertMainLayer(config, name=\"albert\")\n        self.qa_outputs = tf.keras.layers.Dense(\n            config.num_labels, kernel_initializer=get_initializer(config.initializer_range), name=\"qa_outputs\"\n        )\n\n    @add_start_docstrings_to_callable(ALBERT_INPUTS_DOCSTRING.format(\"(batch_size, sequence_length)\"))\n    def call(self, inputs, **kwargs):\n        r\"\"\"\n    Return:\n        :obj:`tuple(tf.Tensor)` comprising various elements depending on the configuration (:class:`~transformers1.AlbertConfig`) and inputs:\n        start_scores (:obj:`Numpy array` or :obj:`tf.Tensor` of shape :obj:`(batch_size, sequence_length,)`):\n            Span-start scores (before SoftMax).\n        end_scores (:obj:`Numpy array` or :obj:`tf.Tensor` of shape :obj:`(batch_size, sequence_length,)`):\n            Span-end scores (before SoftMax).\n        hidden_states (:obj:`tuple(tf.Tensor)`, `optional`, returned when :obj:`config.output_hidden_states=True`):\n            tuple of :obj:`tf.Tensor` (one for the output of the embeddings + one for the output of each layer)\n            of shape :obj:`(batch_size, sequence_length, hidden_size)`.\n\n            Hidden-states of the model at the output of each layer plus the initial embedding outputs.\n        attentions (:obj:`tuple(tf.Tensor)`, `optional`, returned when ``config.output_attentions=True``):\n            tuple of :obj:`tf.Tensor` (one for each layer) of shape\n            :obj:`(batch_size, num_heads, sequence_length, sequence_length)`:\n\n            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads.\n\n    Examples::\n\n        # The checkpoint albert-base-v2 is not fine-tuned for question answering. Please see the\n        # examples/question-answering/run_squad.py example to see how to fine-tune a model to a question answering task.\n\n        import tensorflow as tf\n        from transformers1 import AlbertTokenizer, TFAlbertForQuestionAnswering\n\n        tokenizer = AlbertTokenizer.from_pretrained('albert-base-v2')\n        model = TFAlbertForQuestionAnswering.from_pretrained('albert-base-v2')\n        input_ids = tokenizer.encode(\"Who was Jim Henson?\", \"Jim Henson was a nice puppet\")\n        start_scores, end_scores = model(tf.constant(input_ids)[None, :]) # Batch size 1\n\n        all_tokens = tokenizer.convert_ids_to_tokens(input_ids)\n        answer = ' '.join(all_tokens[tf.math.argmax(start_scores, 1)[0] : tf.math.argmax(end_scores, 1)[0]+1])\n\n        \"\"\"\n        outputs = self.albert(inputs, **kwargs)\n\n        sequence_output = outputs[0]\n\n        logits = self.qa_outputs(sequence_output)\n        start_logits, end_logits = tf.split(logits, 2, axis=-1)\n        start_logits = tf.squeeze(start_logits, axis=-1)\n        end_logits = tf.squeeze(end_logits, axis=-1)\n\n        outputs = (start_logits, end_logits,) + outputs[2:]\n\n        return outputs  # start_logits, end_logits, (hidden_states), (attentions)\n\n\n@add_start_docstrings(\n    \"\"\"Albert Model with a multiple choice classification head on top (a linear layer on top of\n    the pooled output and a softmax) e.g. for RocStories/SWAG tasks. \"\"\",\n    ALBERT_START_DOCSTRING,\n)\nclass TFAlbertForMultipleChoice(TFAlbertPreTrainedModel):\n    def __init__(self, config, *inputs, **kwargs):\n        super().__init__(config, *inputs, **kwargs)\n\n        self.albert = TFAlbertMainLayer(config, name=\"albert\")\n        self.dropout = tf.keras.layers.Dropout(config.hidden_dropout_prob)\n        self.classifier = tf.keras.layers.Dense(\n            1, kernel_initializer=get_initializer(config.initializer_range), name=\"classifier\"\n        )\n\n    @property\n    def dummy_inputs(self):\n        \"\"\" Dummy inputs to build the network.\n\n        Returns:\n            tf.Tensor with dummy inputs\n        \"\"\"\n        return {\"input_ids\": tf.constant(MULTIPLE_CHOICE_DUMMY_INPUTS)}\n\n    @add_start_docstrings_to_callable(ALBERT_INPUTS_DOCSTRING.format(\"(batch_size, num_choices, sequence_length)\"))\n    def call(\n        self,\n        inputs,\n        attention_mask=None,\n        token_type_ids=None,\n        position_ids=None,\n        head_mask=None,\n        inputs_embeds=None,\n        training=False,\n    ):\n        r\"\"\"\n    Return:\n        :obj:`tuple(tf.Tensor)` comprising various elements depending on the configuration (:class:`~transformers1.BertConfig`) and inputs:\n        classification_scores (:obj:`Numpy array` or :obj:`tf.Tensor` of shape :obj:`(batch_size, num_choices)`:\n            `num_choices` is the size of the second dimension of the input tensors. (see `input_ids` above).\n\n            Classification scores (before SoftMax).\n        hidden_states (:obj:`tuple(tf.Tensor)`, `optional`, returned when :obj:`config.output_hidden_states=True`):\n            tuple of :obj:`tf.Tensor` (one for the output of the embeddings + one for the output of each layer)\n            of shape :obj:`(batch_size, sequence_length, hidden_size)`.\n\n            Hidden-states of the model at the output of each layer plus the initial embedding outputs.\n        attentions (:obj:`tuple(tf.Tensor)`, `optional`, returned when ``config.output_attentions=True``):\n            tuple of :obj:`tf.Tensor` (one for each layer) of shape\n            :obj:`(batch_size, num_heads, sequence_length, sequence_length)`:\n\n            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads.\n\n    Examples::\n\n        import tensorflow as tf\n        from transformers1 import AlbertTokenizer, TFAlbertForMultipleChoice\n\n        tokenizer = AlbertTokenizer.from_pretrained('albert-base-v2')\n        model = TFAlbertForMultipleChoice.from_pretrained('albert-base-v2')\n\n        example1 = [\"This is a context\", \"Is it a context? Yes\"]\n        example2 = [\"This is a context\", \"Is it a context? No\"]\n        encoding = tokenizer.batch_encode_plus([example1, example2], return_tensors='tf', truncation_strategy=\"only_first\", pad_to_max_length=True, max_length=128)\n        outputs = model(encoding[\"input_ids\"][None, :])\n        logits = outputs[0]\n\n        \"\"\"\n        if isinstance(inputs, (tuple, list)):\n            input_ids = inputs[0]\n            attention_mask = inputs[1] if len(inputs) > 1 else attention_mask\n            token_type_ids = inputs[2] if len(inputs) > 2 else token_type_ids\n            position_ids = inputs[3] if len(inputs) > 3 else position_ids\n            head_mask = inputs[4] if len(inputs) > 4 else head_mask\n            inputs_embeds = inputs[5] if len(inputs) > 5 else inputs_embeds\n            assert len(inputs) <= 6, \"Too many inputs.\"\n        elif isinstance(inputs, dict):\n            print(\"isdict(1)\")\n            input_ids = inputs.get(\"input_ids\")\n            print(input_ids)\n\n            attention_mask = inputs.get(\"attention_mask\", attention_mask)\n            token_type_ids = inputs.get(\"token_type_ids\", token_type_ids)\n            position_ids = inputs.get(\"position_ids\", position_ids)\n            head_mask = inputs.get(\"head_mask\", head_mask)\n            inputs_embeds = inputs.get(\"inputs_embeds\", inputs_embeds)\n            assert len(inputs) <= 6, \"Too many inputs.\"\n        else:\n            input_ids = inputs\n\n        if input_ids is not None:\n            num_choices = shape_list(input_ids)[1]\n            seq_length = shape_list(input_ids)[2]\n        else:\n            num_choices = shape_list(inputs_embeds)[1]\n            seq_length = shape_list(inputs_embeds)[2]\n\n        flat_input_ids = tf.reshape(input_ids, (-1, seq_length)) if input_ids is not None else None\n        flat_attention_mask = tf.reshape(attention_mask, (-1, seq_length)) if attention_mask is not None else None\n        flat_token_type_ids = tf.reshape(token_type_ids, (-1, seq_length)) if token_type_ids is not None else None\n        flat_position_ids = tf.reshape(position_ids, (-1, seq_length)) if position_ids is not None else None\n\n        flat_inputs = [\n            flat_input_ids,\n            flat_attention_mask,\n            flat_token_type_ids,\n            flat_position_ids,\n            head_mask,\n            inputs_embeds,\n        ]\n\n        outputs = self.albert(flat_inputs, training=training)\n\n        pooled_output = outputs[1]\n\n        pooled_output = self.dropout(pooled_output, training=training)\n        logits = self.classifier(pooled_output)\n        reshaped_logits = tf.reshape(logits, (-1, num_choices))\n\n        outputs = (reshaped_logits,) + outputs[2:]  # add hidden states and attention if they are here\n\n        return outputs  # reshaped_logits, (hidden_states), (attentions)\n"
  },
  {
    "path": "code/bert-base-count3/pretrain/transformers1/modeling_tf_auto.py",
    "content": "# coding=utf-8\n# Copyright 2018 The HuggingFace Inc. team.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n\"\"\" Auto Model class. \"\"\"\n\n\nimport logging\nfrom collections import OrderedDict\n\nfrom .configuration_auto import (\n    AlbertConfig,\n    AutoConfig,\n    BertConfig,\n    CTRLConfig,\n    DistilBertConfig,\n    GPT2Config,\n    OpenAIGPTConfig,\n    RobertaConfig,\n    T5Config,\n    TransfoXLConfig,\n    XLMConfig,\n    XLNetConfig,\n)\nfrom .configuration_utils import PretrainedConfig\nfrom .modeling_tf_albert import (\n    TFAlbertForMaskedLM,\n    TFAlbertForMultipleChoice,\n    TFAlbertForPreTraining,\n    TFAlbertForQuestionAnswering,\n    TFAlbertForSequenceClassification,\n    TFAlbertModel,\n)\nfrom .modeling_tf_bert import (\n    TFBertForMaskedLM,\n    TFBertForMultipleChoice,\n    TFBertForPreTraining,\n    TFBertForQuestionAnswering,\n    TFBertForSequenceClassification,\n    TFBertForTokenClassification,\n    TFBertModel,\n)\nfrom .modeling_tf_ctrl import TFCTRLLMHeadModel, TFCTRLModel\nfrom .modeling_tf_distilbert import (\n    TFDistilBertForMaskedLM,\n    TFDistilBertForQuestionAnswering,\n    TFDistilBertForSequenceClassification,\n    TFDistilBertForTokenClassification,\n    TFDistilBertModel,\n)\nfrom .modeling_tf_gpt2 import TFGPT2LMHeadModel, TFGPT2Model\nfrom .modeling_tf_openai import TFOpenAIGPTLMHeadModel, TFOpenAIGPTModel\nfrom .modeling_tf_roberta import (\n    TFRobertaForMaskedLM,\n    TFRobertaForQuestionAnswering,\n    TFRobertaForSequenceClassification,\n    TFRobertaForTokenClassification,\n    TFRobertaModel,\n)\nfrom .modeling_tf_t5 import TFT5ForConditionalGeneration, TFT5Model\nfrom .modeling_tf_transfo_xl import TFTransfoXLLMHeadModel, TFTransfoXLModel\nfrom .modeling_tf_xlm import (\n    TFXLMForQuestionAnsweringSimple,\n    TFXLMForSequenceClassification,\n    TFXLMModel,\n    TFXLMWithLMHeadModel,\n)\nfrom .modeling_tf_xlnet import (\n    TFXLNetForQuestionAnsweringSimple,\n    TFXLNetForSequenceClassification,\n    TFXLNetForTokenClassification,\n    TFXLNetLMHeadModel,\n    TFXLNetModel,\n)\n\n\nlogger = logging.getLogger(__name__)\n\n\nTF_MODEL_MAPPING = OrderedDict(\n    [\n        (T5Config, TFT5Model),\n        (DistilBertConfig, TFDistilBertModel),\n        (AlbertConfig, TFAlbertModel),\n        (RobertaConfig, TFRobertaModel),\n        (BertConfig, TFBertModel),\n        (OpenAIGPTConfig, TFOpenAIGPTModel),\n        (GPT2Config, TFGPT2Model),\n        (TransfoXLConfig, TFTransfoXLModel),\n        (XLNetConfig, TFXLNetModel),\n        (XLMConfig, TFXLMModel),\n        (CTRLConfig, TFCTRLModel),\n    ]\n)\n\nTF_MODEL_FOR_PRETRAINING_MAPPING = OrderedDict(\n    [\n        (T5Config, TFT5ForConditionalGeneration),\n        (DistilBertConfig, TFDistilBertForMaskedLM),\n        (AlbertConfig, TFAlbertForPreTraining),\n        (RobertaConfig, TFRobertaForMaskedLM),\n        (BertConfig, TFBertForPreTraining),\n        (OpenAIGPTConfig, TFOpenAIGPTLMHeadModel),\n        (GPT2Config, TFGPT2LMHeadModel),\n        (TransfoXLConfig, TFTransfoXLLMHeadModel),\n        (XLNetConfig, TFXLNetLMHeadModel),\n        (XLMConfig, TFXLMWithLMHeadModel),\n        (CTRLConfig, TFCTRLLMHeadModel),\n    ]\n)\n\nTF_MODEL_WITH_LM_HEAD_MAPPING = OrderedDict(\n    [\n        (T5Config, TFT5ForConditionalGeneration),\n        (DistilBertConfig, TFDistilBertForMaskedLM),\n        (AlbertConfig, TFAlbertForMaskedLM),\n        (RobertaConfig, TFRobertaForMaskedLM),\n        (BertConfig, TFBertForMaskedLM),\n        (OpenAIGPTConfig, TFOpenAIGPTLMHeadModel),\n        (GPT2Config, TFGPT2LMHeadModel),\n        (TransfoXLConfig, TFTransfoXLLMHeadModel),\n        (XLNetConfig, TFXLNetLMHeadModel),\n        (XLMConfig, TFXLMWithLMHeadModel),\n        (CTRLConfig, TFCTRLLMHeadModel),\n    ]\n)\n\nTF_MODEL_FOR_SEQUENCE_CLASSIFICATION_MAPPING = OrderedDict(\n    [\n        (DistilBertConfig, TFDistilBertForSequenceClassification),\n        (AlbertConfig, TFAlbertForSequenceClassification),\n        (RobertaConfig, TFRobertaForSequenceClassification),\n        (BertConfig, TFBertForSequenceClassification),\n        (XLNetConfig, TFXLNetForSequenceClassification),\n        (XLMConfig, TFXLMForSequenceClassification),\n    ]\n)\n\nTF_MODEL_FOR_MULTIPLE_CHOICE_MAPPING = OrderedDict(\n    [(BertConfig, TFBertForMultipleChoice), (AlbertConfig, TFAlbertForMultipleChoice)]\n)\n\nTF_MODEL_FOR_QUESTION_ANSWERING_MAPPING = OrderedDict(\n    [\n        (DistilBertConfig, TFDistilBertForQuestionAnswering),\n        (AlbertConfig, TFAlbertForQuestionAnswering),\n        (RobertaConfig, TFRobertaForQuestionAnswering),\n        (BertConfig, TFBertForQuestionAnswering),\n        (XLNetConfig, TFXLNetForQuestionAnsweringSimple),\n        (XLMConfig, TFXLMForQuestionAnsweringSimple),\n    ]\n)\n\nTF_MODEL_FOR_TOKEN_CLASSIFICATION_MAPPING = OrderedDict(\n    [\n        (DistilBertConfig, TFDistilBertForTokenClassification),\n        (RobertaConfig, TFRobertaForTokenClassification),\n        (BertConfig, TFBertForTokenClassification),\n        (XLNetConfig, TFXLNetForTokenClassification),\n    ]\n)\n\n\nclass TFAutoModel(object):\n    r\"\"\"\n        :class:`~transformers1.TFAutoModel` is a generic model class\n        that will be instantiated as one of the base model classes of the library\n        when created with the `TFAutoModel.from_pretrained(pretrained_model_name_or_path)`\n        class method.\n\n        The `from_pretrained()` method takes care of returning the correct model class instance\n        based on the `model_type` property of the config object, or when it's missing,\n        falling back to using pattern matching on the `pretrained_model_name_or_path` string:\n            - `t5`: TFT5Model (T5 model)\n            - `distilbert`: TFDistilBertModel (DistilBERT model)\n            - `roberta`: TFRobertaModel (RoBERTa model)\n            - `bert`: TFBertModel (Bert model)\n            - `openai-gpt`: TFOpenAIGPTModel (OpenAI GPT model)\n            - `gpt2`: TFGPT2Model (OpenAI GPT-2 model)\n            - `transfo-xl`: TFTransfoXLModel (Transformer-XL model)\n            - `xlnet`: TFXLNetModel (XLNet model)\n            - `xlm`: TFXLMModel (XLM model)\n            - `ctrl`: TFCTRLModel (CTRL model)\n\n        This class cannot be instantiated using `__init__()` (throws an error).\n    \"\"\"\n\n    def __init__(self):\n        raise EnvironmentError(\n            \"TFAutoModel is designed to be instantiated \"\n            \"using the `TFAutoModel.from_pretrained(pretrained_model_name_or_path)` or \"\n            \"`TFAutoModel.from_config(config)` methods.\"\n        )\n\n    @classmethod\n    def from_config(cls, config):\n        r\"\"\" Instantiates one of the base model classes of the library\n        from a configuration.\n\n        Note:\n            Loading a model from its configuration file does **not** load the model weights.\n            It only affects the model's configuration. Use :func:`~transformers1.AutoModel.from_pretrained` to load\n            the model weights\n\n        Args:\n            config: (`optional`) instance of a class derived from :class:`~transformers1.PretrainedConfig`:\n                The model class to instantiate is selected based on the configuration class:\n                    - isInstance of `distilbert` configuration class: TFDistilBertModel (DistilBERT model)\n                    - isInstance of `roberta` configuration class: TFRobertaModel (RoBERTa model)\n                    - isInstance of `bert` configuration class: TFBertModel (Bert model)\n                    - isInstance of `openai-gpt` configuration class: TFOpenAIGPTModel (OpenAI GPT model)\n                    - isInstance of `gpt2` configuration class: TFGPT2Model (OpenAI GPT-2 model)\n                    - isInstance of `ctrl` configuration class: TFCTRLModel (Salesforce CTRL  model)\n                    - isInstance of `transfo-xl` configuration class: TFTransfoXLModel (Transformer-XL model)\n                    - isInstance of `xlnet` configuration class: TFXLNetModel (XLNet model)\n                    - isInstance of `xlm` configuration class: TFXLMModel (XLM model)\n\n        Examples::\n\n            config = BertConfig.from_pretrained('bert-base-uncased')    # Download configuration from S3 and cache.\n            model = TFAutoModel.from_config(config)  # E.g. model was saved using `save_pretrained('./test/saved_model/')`\n        \"\"\"\n        for config_class, model_class in TF_MODEL_MAPPING.items():\n            if isinstance(config, config_class):\n                return model_class(config)\n        raise ValueError(\n            \"Unrecognized configuration class {} for this kind of TFAutoModel: {}.\\n\"\n            \"Model type should be one of {}.\".format(\n                config.__class__, cls.__name__, \", \".join(c.__name__ for c in TF_MODEL_MAPPING.keys())\n            )\n        )\n\n    @classmethod\n    def from_pretrained(cls, pretrained_model_name_or_path, *model_args, **kwargs):\n        r\"\"\" Instantiates one of the base model classes of the library\n        from a pre-trained model configuration.\n\n        The `from_pretrained()` method takes care of returning the correct model class instance\n        based on the `model_type` property of the config object, or when it's missing,\n        falling back to using pattern matching on the `pretrained_model_name_or_path` string:\n            - `t5`: TFT5Model (T5 model)\n            - `distilbert`: TFDistilBertModel (DistilBERT model)\n            - `roberta`: TFRobertaModel (RoBERTa model)\n            - `bert`: TFTFBertModel (Bert model)\n            - `openai-gpt`: TFOpenAIGPTModel (OpenAI GPT model)\n            - `gpt2`: TFGPT2Model (OpenAI GPT-2 model)\n            - `transfo-xl`: TFTransfoXLModel (Transformer-XL model)\n            - `xlnet`: TFXLNetModel (XLNet model)\n            - `ctrl`: TFCTRLModel (CTRL model)\n\n        Params:\n            pretrained_model_name_or_path: either:\n\n                - a string with the `shortcut name` of a pre-trained model to load from cache or download, e.g.: ``bert-base-uncased``.\n                - a string with the `identifier name` of a pre-trained model that was user-uploaded to our S3, e.g.: ``dbmdz/bert-base-german-cased``.\n                - a path to a `directory` containing model weights saved using :func:`~transformers1.PreTrainedModel.save_pretrained`, e.g.: ``./my_model_directory/``.\n                - a path or url to a `PyTorch, TF 1.X or TF 2.0 checkpoint file` (e.g. `./tf_model/model.ckpt.index`). In the case of a PyTorch checkpoint, ``from_pt`` should be set to True and a configuration object should be provided as ``config`` argument.\n\n            from_pt: (`Optional`) Boolean\n                Set to True if the Checkpoint is a PyTorch checkpoint.\n\n            model_args: (`optional`) Sequence of positional arguments:\n                All remaning positional arguments will be passed to the underlying model's ``__init__`` method\n\n            config: (`optional`) instance of a class derived from :class:`~transformers1.PretrainedConfig`:\n                Configuration for the model to use instead of an automatically loaded configuation. Configuration can be automatically loaded when:\n\n                - the model is a model provided by the library (loaded with the ``shortcut-name`` string of a pretrained model), or\n                - the model was saved using :func:`~transformers1.PreTrainedModel.save_pretrained` and is reloaded by suppling the save directory.\n                - the model is loaded by suppling a local directory as ``pretrained_model_name_or_path`` and a configuration JSON file named `config.json` is found in the directory.\n\n            state_dict: (`optional`) dict:\n                an optional state dictionnary for the model to use instead of a state dictionary loaded from saved weights file.\n                This option can be used if you want to create a model from a pretrained configuration but load your own weights.\n                In this case though, you should check if using :func:`~transformers1.PreTrainedModel.save_pretrained` and :func:`~transformers1.PreTrainedModel.from_pretrained` is not a simpler option.\n\n            cache_dir: (`optional`) string:\n                Path to a directory in which a downloaded pre-trained model\n                configuration should be cached if the standard cache should not be used.\n\n            force_download: (`optional`) boolean, default False:\n                Force to (re-)download the model weights and configuration files and override the cached versions if they exists.\n\n            resume_download: (`optional`) boolean, default False:\n                Do not delete incompletely recieved file. Attempt to resume the download if such a file exists.\n\n            proxies: (`optional`) dict, default None:\n                A dictionary of proxy servers to use by protocol or endpoint, e.g.: {'http': 'foo.bar:3128', 'http://hostname': 'foo.bar:4012'}.\n                The proxies are used on each request.\n\n            output_loading_info: (`optional`) boolean:\n                Set to ``True`` to also return a dictionnary containing missing keys, unexpected keys and error messages.\n\n            kwargs: (`optional`) Remaining dictionary of keyword arguments:\n                Can be used to update the configuration object (after it being loaded) and initiate the model. (e.g. ``output_attention=True``). Behave differently depending on whether a `config` is provided or automatically loaded:\n\n                - If a configuration is provided with ``config``, ``**kwargs`` will be directly passed to the underlying model's ``__init__`` method (we assume all relevant updates to the configuration have already been done)\n                - If a configuration is not provided, ``kwargs`` will be first passed to the configuration class initialization function (:func:`~transformers1.PretrainedConfig.from_pretrained`). Each key of ``kwargs`` that corresponds to a configuration attribute will be used to override said attribute with the supplied ``kwargs`` value. Remaining keys that do not correspond to any configuration attribute will be passed to the underlying model's ``__init__`` function.\n\n        Examples::\n\n            model = TFAutoModel.from_pretrained('bert-base-uncased')    # Download model and configuration from S3 and cache.\n            model = TFAutoModel.from_pretrained('./test/bert_model/')  # E.g. model was saved using `save_pretrained('./test/saved_model/')`\n            model = TFAutoModel.from_pretrained('bert-base-uncased', output_attention=True)  # Update configuration during loading\n            assert model.config.output_attention == True\n            # Loading from a TF checkpoint file instead of a PyTorch model (slower)\n            config = AutoConfig.from_json_file('./tf_model/bert_tf_model_config.json')\n            model = TFAutoModel.from_pretrained('./pt_model/bert_pytorch_model.bin', from_pt=True, config=config)\n\n        \"\"\"\n        config = kwargs.pop(\"config\", None)\n        if not isinstance(config, PretrainedConfig):\n            config = AutoConfig.from_pretrained(pretrained_model_name_or_path, **kwargs)\n\n        for config_class, model_class in TF_MODEL_MAPPING.items():\n            if isinstance(config, config_class):\n                return model_class.from_pretrained(pretrained_model_name_or_path, *model_args, config=config, **kwargs)\n        raise ValueError(\n            \"Unrecognized configuration class {} for this kind of TFAutoModel: {}.\\n\"\n            \"Model type should be one of {}.\".format(\n                config.__class__, cls.__name__, \", \".join(c.__name__ for c in TF_MODEL_MAPPING.keys())\n            )\n        )\n\n\nclass TFAutoModelForPreTraining(object):\n    r\"\"\"\n        :class:`~transformers1.TFAutoModelForPreTraining` is a generic model class\n        that will be instantiated as one of the model classes of the library -with the architecture used for pretraining this model– when created with the `TFAutoModelForPreTraining.from_pretrained(pretrained_model_name_or_path)`\n        class method.\n\n        This class cannot be instantiated using `__init__()` (throws an error).\n    \"\"\"\n\n    def __init__(self):\n        raise EnvironmentError(\n            \"TFAutoModelForPreTraining is designed to be instantiated \"\n            \"using the `TFAutoModelForPreTraining.from_pretrained(pretrained_model_name_or_path)` or \"\n            \"`TFAutoModelForPreTraining.from_config(config)` methods.\"\n        )\n\n    @classmethod\n    def from_config(cls, config):\n        r\"\"\" Instantiates one of the base model classes of the library\n        from a configuration.\n\n        Note:\n            Loading a model from its configuration file does **not** load the model weights.\n            It only affects the model's configuration. Use :func:`~transformers1.AutoModel.from_pretrained` to load\n            the model weights\n\n        Args:\n            config (:class:`~transformers.PretrainedConfig`):\n                The model class to instantiate is selected based on the configuration class:\n\n                - isInstance of `distilbert` configuration class: :class:`~transformers1.TFDistilBertModelForMaskedLM` (DistilBERT model)\n                - isInstance of `roberta` configuration class: :class:`~transformers1.TFRobertaModelForMaskedLM` (RoBERTa model)\n                - isInstance of `bert` configuration class: :class:`~transformers1.TFBertForPreTraining` (Bert model)\n                - isInstance of `openai-gpt` configuration class: :class:`~transformers1.TFOpenAIGPTLMHeadModel` (OpenAI GPT model)\n                - isInstance of `gpt2` configuration class: :class:`~transformers1.TFGPT2ModelLMHeadModel` (OpenAI GPT-2 model)\n                - isInstance of `ctrl` configuration class: :class:`~transformers1.TFCTRLModelLMHeadModel` (Salesforce CTRL  model)\n                - isInstance of `transfo-xl` configuration class: :class:`~transformers1.TFTransfoXLLMHeadModel` (Transformer-XL model)\n                - isInstance of `xlnet` configuration class: :class:`~transformers1.TFXLNetLMHeadModel` (XLNet model)\n                - isInstance of `xlm` configuration class: :class:`~transformers1.TFXLMWithLMHeadModel` (XLM model)\n\n        Examples::\n\n            config = BertConfig.from_pretrained('bert-base-uncased')    # Download configuration from S3 and cache.\n            model = TFAutoModelForPreTraining.from_config(config)  # E.g. model was saved using `save_pretrained('./test/saved_model/')`\n        \"\"\"\n        for config_class, model_class in TF_MODEL_FOR_PRETRAINING_MAPPING.items():\n            if isinstance(config, config_class):\n                return model_class(config)\n        raise ValueError(\n            \"Unrecognized configuration class {} for this kind of AutoModel: {}.\\n\"\n            \"Model type should be one of {}.\".format(\n                config.__class__, cls.__name__, \", \".join(c.__name__ for c in TF_MODEL_FOR_PRETRAINING_MAPPING.keys())\n            )\n        )\n\n    @classmethod\n    def from_pretrained(cls, pretrained_model_name_or_path, *model_args, **kwargs):\n        r\"\"\" Instantiates one of the model classes of the library -with the architecture used for pretraining this model– from a pre-trained model configuration.\n\n        The `from_pretrained()` method takes care of returning the correct model class instance\n        based on the `model_type` property of the config object, or when it's missing,\n        falling back to using pattern matching on the `pretrained_model_name_or_path` string:\n            - `t5`: :class:`~transformers1.TFT5ModelWithLMHead` (T5 model)\n            - `distilbert`: :class:`~transformers1.TFDistilBertForMaskedLM` (DistilBERT model)\n            - `albert`: :class:`~transformers1.TFAlbertForPreTraining` (ALBERT model)\n            - `roberta`: :class:`~transformers1.TFRobertaForMaskedLM` (RoBERTa model)\n            - `bert`: :class:`~transformers1.TFBertForPreTraining` (Bert model)\n            - `openai-gpt`: :class:`~transformers1.TFOpenAIGPTLMHeadModel` (OpenAI GPT model)\n            - `gpt2`: :class:`~transformers1.TFGPT2LMHeadModel` (OpenAI GPT-2 model)\n            - `transfo-xl`: :class:`~transformers1.TFTransfoXLLMHeadModel` (Transformer-XL model)\n            - `xlnet`: :class:`~transformers1.TFXLNetLMHeadModel` (XLNet model)\n            - `xlm`: :class:`~transformers1.TFXLMWithLMHeadModel` (XLM model)\n            - `ctrl`: :class:`~transformers1.TFCTRLLMHeadModel` (Salesforce CTRL model)\n\n        The model is set in evaluation mode by default using `model.eval()` (Dropout modules are deactivated)\n        To train the model, you should first set it back in training mode with `model.train()`\n\n        Args:\n            pretrained_model_name_or_path:\n                Either:\n\n                - a string with the `shortcut name` of a pre-trained model to load from cache or download, e.g.: ``bert-base-uncased``.\n                - a string with the `identifier name` of a pre-trained model that was user-uploaded to our S3, e.g.: ``dbmdz/bert-base-german-cased``.\n                - a path to a `directory` containing model weights saved using :func:`~transformers1.PreTrainedModel.save_pretrained`, e.g.: ``./my_model_directory/``.\n                - a path or url to a `tensorflow index checkpoint file` (e.g. `./tf_model/model.ckpt.index`). In this case, ``from_tf`` should be set to True and a configuration object should be provided as ``config`` argument. This loading path is slower than converting the TensorFlow checkpoint in a PyTorch model using the provided conversion scripts and loading the PyTorch model afterwards.\n            model_args: (`optional`) Sequence of positional arguments:\n                All remaning positional arguments will be passed to the underlying model's ``__init__`` method\n            config: (`optional`) instance of a class derived from :class:`~transformers1.PretrainedConfig`:\n                Configuration for the model to use instead of an automatically loaded configuation. Configuration can be automatically loaded when:\n\n                - the model is a model provided by the library (loaded with the ``shortcut-name`` string of a pretrained model), or\n                - the model was saved using :func:`~transformers1.PreTrainedModel.save_pretrained` and is reloaded by suppling the save directory.\n                - the model is loaded by suppling a local directory as ``pretrained_model_name_or_path`` and a configuration JSON file named `config.json` is found in the directory.\n\n            state_dict: (`optional`) dict:\n                an optional state dictionnary for the model to use instead of a state dictionary loaded from saved weights file.\n                This option can be used if you want to create a model from a pretrained configuration but load your own weights.\n                In this case though, you should check if using :func:`~transformers1.PreTrainedModel.save_pretrained` and :func:`~transformers1.PreTrainedModel.from_pretrained` is not a simpler option.\n            cache_dir: (`optional`) string:\n                Path to a directory in which a downloaded pre-trained model\n                configuration should be cached if the standard cache should not be used.\n            force_download: (`optional`) boolean, default False:\n                Force to (re-)download the model weights and configuration files and override the cached versions if they exists.\n            resume_download: (`optional`) boolean, default False:\n                Do not delete incompletely received file. Attempt to resume the download if such a file exists.\n            proxies: (`optional`) dict, default None:\n                A dictionary of proxy servers to use by protocol or endpoint, e.g.: {'http': 'foo.bar:3128', 'http://hostname': 'foo.bar:4012'}.\n                The proxies are used on each request.\n            output_loading_info: (`optional`) boolean:\n                Set to ``True`` to also return a dictionnary containing missing keys, unexpected keys and error messages.\n            kwargs: (`optional`) Remaining dictionary of keyword arguments:\n                Can be used to update the configuration object (after it being loaded) and initiate the model.\n                (e.g. ``output_attention=True``). Behave differently depending on whether a `config` is provided or\n                automatically loaded:\n\n                - If a configuration is provided with ``config``, ``**kwargs`` will be directly passed to the\n                  underlying model's ``__init__`` method (we assume all relevant updates to the configuration have\n                  already been done)\n                - If a configuration is not provided, ``kwargs`` will be first passed to the configuration class\n                  initialization function (:func:`~transformers1.PretrainedConfig.from_pretrained`). Each key of\n                  ``kwargs`` that corresponds to a configuration attribute will be used to override said attribute\n                  with the supplied ``kwargs`` value. Remaining keys that do not correspond to any configuration\n                  attribute will be passed to the underlying model's ``__init__`` function.\n\n        Examples::\n\n            model = TFAutoModelForPreTraining.from_pretrained('bert-base-uncased')    # Download model and configuration from S3 and cache.\n            model = TFAutoModelForPreTraining.from_pretrained('./test/bert_model/')  # E.g. model was saved using `save_pretrained('./test/saved_model/')`\n            model = TFAutoModelForPreTraining.from_pretrained('bert-base-uncased', output_attention=True)  # Update configuration during loading\n            assert model.config.output_attention == True\n            # Loading from a TF checkpoint file instead of a PyTorch model (slower)\n            config = AutoConfig.from_json_file('./tf_model/bert_tf_model_config.json')\n            model = TFAutoModelForPreTraining.from_pretrained('./tf_model/bert_tf_checkpoint.ckpt.index', from_tf=True, config=config)\n\n        \"\"\"\n        config = kwargs.pop(\"config\", None)\n        if not isinstance(config, PretrainedConfig):\n            config = AutoConfig.from_pretrained(pretrained_model_name_or_path, **kwargs)\n\n        for config_class, model_class in TF_MODEL_FOR_PRETRAINING_MAPPING.items():\n            if isinstance(config, config_class):\n                return model_class.from_pretrained(pretrained_model_name_or_path, *model_args, config=config, **kwargs)\n        raise ValueError(\n            \"Unrecognized configuration class {} for this kind of AutoModel: {}.\\n\"\n            \"Model type should be one of {}.\".format(\n                config.__class__, cls.__name__, \", \".join(c.__name__ for c in TF_MODEL_FOR_PRETRAINING_MAPPING.keys())\n            )\n        )\n\n\nclass TFAutoModelWithLMHead(object):\n    r\"\"\"\n        :class:`~transformers1.TFAutoModelWithLMHead` is a generic model class\n        that will be instantiated as one of the language modeling model classes of the library\n        when created with the `TFAutoModelWithLMHead.from_pretrained(pretrained_model_name_or_path)`\n        class method.\n\n        The `from_pretrained()` method takes care of returning the correct model class instance\n        based on the `model_type` property of the config object, or when it's missing,\n        falling back to using pattern matching on the `pretrained_model_name_or_path` string:\n            - `t5`: TFT5ForConditionalGeneration (T5 model)\n            - `distilbert`: TFDistilBertForMaskedLM (DistilBERT model)\n            - `roberta`: TFRobertaForMaskedLM (RoBERTa model)\n            - `bert`: TFBertForMaskedLM (Bert model)\n            - `openai-gpt`: TFOpenAIGPTLMHeadModel (OpenAI GPT model)\n            - `gpt2`: TFGPT2LMHeadModel (OpenAI GPT-2 model)\n            - `transfo-xl`: TFTransfoXLLMHeadModel (Transformer-XL model)\n            - `xlnet`: TFXLNetLMHeadModel (XLNet model)\n            - `xlm`: TFXLMWithLMHeadModel (XLM model)\n            - `ctrl`: TFCTRLLMHeadModel (CTRL model)\n\n        This class cannot be instantiated using `__init__()` (throws an error).\n    \"\"\"\n\n    def __init__(self):\n        raise EnvironmentError(\n            \"TFAutoModelWithLMHead is designed to be instantiated \"\n            \"using the `TFAutoModelWithLMHead.from_pretrained(pretrained_model_name_or_path)` or \"\n            \"`TFAutoModelWithLMHead.from_config(config)` methods.\"\n        )\n\n    @classmethod\n    def from_config(cls, config):\n        r\"\"\" Instantiates one of the base model classes of the library\n        from a configuration.\n\n        Note:\n            Loading a model from its configuration file does **not** load the model weights.\n            It only affects the model's configuration. Use :func:`~transformers1.AutoModel.from_pretrained` to load\n            the model weights\n\n        Args:\n            config: (`optional`) instance of a class derived from :class:`~transformers1.PretrainedConfig`:\n                The model class to instantiate is selected based on the configuration class:\n                    - isInstance of `distilbert` configuration class: DistilBertModel (DistilBERT model)\n                    - isInstance of `roberta` configuration class: RobertaModel (RoBERTa model)\n                    - isInstance of `bert` configuration class: BertModel (Bert model)\n                    - isInstance of `openai-gpt` configuration class: OpenAIGPTModel (OpenAI GPT model)\n                    - isInstance of `gpt2` configuration class: GPT2Model (OpenAI GPT-2 model)\n                    - isInstance of `ctrl` configuration class: CTRLModel (Salesforce CTRL  model)\n                    - isInstance of `transfo-xl` configuration class: TransfoXLModel (Transformer-XL model)\n                    - isInstance of `xlnet` configuration class: XLNetModel (XLNet model)\n                    - isInstance of `xlm` configuration class: XLMModel (XLM model)\n\n        Examples::\n\n            config = BertConfig.from_pretrained('bert-base-uncased')    # Download configuration from S3 and cache.\n            model = TFAutoModelWithLMHead.from_config(config)  # E.g. model was saved using `save_pretrained('./test/saved_model/')`\n        \"\"\"\n        for config_class, model_class in TF_MODEL_WITH_LM_HEAD_MAPPING.items():\n            if isinstance(config, config_class):\n                return model_class(config)\n        raise ValueError(\n            \"Unrecognized configuration class {} for this kind of TFAutoModel: {}.\\n\"\n            \"Model type should be one of {}.\".format(\n                config.__class__, cls.__name__, \", \".join(c.__name__ for c in TF_MODEL_WITH_LM_HEAD_MAPPING.keys())\n            )\n        )\n\n    @classmethod\n    def from_pretrained(cls, pretrained_model_name_or_path, *model_args, **kwargs):\n        r\"\"\" Instantiates one of the language modeling model classes of the library\n        from a pre-trained model configuration.\n\n        The `from_pretrained()` method takes care of returning the correct model class instance\n        based on the `model_type` property of the config object, or when it's missing,\n        falling back to using pattern matching on the `pretrained_model_name_or_path` string:\n            - `t5`: TFT5ForConditionalGeneration (T5 model)\n            - `distilbert`: TFDistilBertForMaskedLM (DistilBERT model)\n            - `roberta`: TFRobertaForMaskedLM (RoBERTa model)\n            - `bert`: TFBertForMaskedLM (Bert model)\n            - `openai-gpt`: TFOpenAIGPTLMHeadModel (OpenAI GPT model)\n            - `gpt2`: TFGPT2LMHeadModel (OpenAI GPT-2 model)\n            - `transfo-xl`: TFTransfoXLLMHeadModel (Transformer-XL model)\n            - `xlnet`: TFXLNetLMHeadModel (XLNet model)\n            - `xlm`: TFXLMWithLMHeadModel (XLM model)\n            - `ctrl`: TFCTRLLMHeadModel (CTRL model)\n\n        Params:\n            pretrained_model_name_or_path: either:\n\n                - a string with the `shortcut name` of a pre-trained model to load from cache or download, e.g.: ``bert-base-uncased``.\n                - a string with the `identifier name` of a pre-trained model that was user-uploaded to our S3, e.g.: ``dbmdz/bert-base-german-cased``.\n                - a path to a `directory` containing model weights saved using :func:`~transformers1.PreTrainedModel.save_pretrained`, e.g.: ``./my_model_directory/``.\n                - a path or url to a `PyTorch, TF 1.X or TF 2.0 checkpoint file` (e.g. `./tf_model/model.ckpt.index`). In the case of a PyTorch checkpoint, ``from_pt`` should be set to True and a configuration object should be provided as ``config`` argument.\n\n            from_pt: (`Optional`) Boolean\n                Set to True if the Checkpoint is a PyTorch checkpoint.\n\n            model_args: (`optional`) Sequence of positional arguments:\n                All remaning positional arguments will be passed to the underlying model's ``__init__`` method\n\n            config: (`optional`) instance of a class derived from :class:`~transformers1.PretrainedConfig`:\n                Configuration for the model to use instead of an automatically loaded configuation. Configuration can be automatically loaded when:\n\n                - the model is a model provided by the library (loaded with the ``shortcut-name`` string of a pretrained model), or\n                - the model was saved using :func:`~transformers1.PreTrainedModel.save_pretrained` and is reloaded by suppling the save directory.\n                - the model is loaded by suppling a local directory as ``pretrained_model_name_or_path`` and a configuration JSON file named `config.json` is found in the directory.\n\n            state_dict: (`optional`) dict:\n                an optional state dictionnary for the model to use instead of a state dictionary loaded from saved weights file.\n                This option can be used if you want to create a model from a pretrained configuration but load your own weights.\n                In this case though, you should check if using :func:`~transformers1.PreTrainedModel.save_pretrained` and :func:`~transformers1.PreTrainedModel.from_pretrained` is not a simpler option.\n\n            cache_dir: (`optional`) string:\n                Path to a directory in which a downloaded pre-trained model\n                configuration should be cached if the standard cache should not be used.\n\n            force_download: (`optional`) boolean, default False:\n                Force to (re-)download the model weights and configuration files and override the cached versions if they exists.\n\n            resume_download: (`optional`) boolean, default False:\n                Do not delete incompletely recieved file. Attempt to resume the download if such a file exists.\n\n            proxies: (`optional`) dict, default None:\n                A dictionary of proxy servers to use by protocol or endpoint, e.g.: {'http': 'foo.bar:3128', 'http://hostname': 'foo.bar:4012'}.\n                The proxies are used on each request.\n\n            output_loading_info: (`optional`) boolean:\n                Set to ``True`` to also return a dictionnary containing missing keys, unexpected keys and error messages.\n\n            kwargs: (`optional`) Remaining dictionary of keyword arguments:\n                Can be used to update the configuration object (after it being loaded) and initiate the model. (e.g. ``output_attention=True``). Behave differently depending on whether a `config` is provided or automatically loaded:\n\n                - If a configuration is provided with ``config``, ``**kwargs`` will be directly passed to the underlying model's ``__init__`` method (we assume all relevant updates to the configuration have already been done)\n                - If a configuration is not provided, ``kwargs`` will be first passed to the configuration class initialization function (:func:`~transformers1.PretrainedConfig.from_pretrained`). Each key of ``kwargs`` that corresponds to a configuration attribute will be used to override said attribute with the supplied ``kwargs`` value. Remaining keys that do not correspond to any configuration attribute will be passed to the underlying model's ``__init__`` function.\n\n        Examples::\n\n            model = TFAutoModelWithLMHead.from_pretrained('bert-base-uncased')    # Download model and configuration from S3 and cache.\n            model = TFAutoModelWithLMHead.from_pretrained('./test/bert_model/')  # E.g. model was saved using `save_pretrained('./test/saved_model/')`\n            model = TFAutoModelWithLMHead.from_pretrained('bert-base-uncased', output_attention=True)  # Update configuration during loading\n            assert model.config.output_attention == True\n            # Loading from a TF checkpoint file instead of a PyTorch model (slower)\n            config = AutoConfig.from_json_file('./tf_model/bert_tf_model_config.json')\n            model = TFAutoModelWithLMHead.from_pretrained('./pt_model/bert_pytorch_model.bin', from_pt=True, config=config)\n\n        \"\"\"\n        config = kwargs.pop(\"config\", None)\n        if not isinstance(config, PretrainedConfig):\n            config = AutoConfig.from_pretrained(pretrained_model_name_or_path, **kwargs)\n\n        for config_class, model_class in TF_MODEL_WITH_LM_HEAD_MAPPING.items():\n            if isinstance(config, config_class):\n                return model_class.from_pretrained(pretrained_model_name_or_path, *model_args, config=config, **kwargs)\n        raise ValueError(\n            \"Unrecognized configuration class {} for this kind of TFAutoModel: {}.\\n\"\n            \"Model type should be one of {}.\".format(\n                config.__class__, cls.__name__, \", \".join(c.__name__ for c in TF_MODEL_WITH_LM_HEAD_MAPPING.keys())\n            )\n        )\n\n\nclass TFAutoModelForMultipleChoice:\n    r\"\"\"\n        :class:`~transformers1.TFAutoModelForMultipleChoice` is a generic model class\n        that will be instantiated as one of the multiple choice model classes of the library\n        when created with the `TFAutoModelForMultipleChoice.from_pretrained(pretrained_model_name_or_path)`\n        class method.\n\n        The `from_pretrained()` method takes care of returning the correct model class instance\n        based on the `model_type` property of the config object, or when it's missing,\n        falling back to using pattern matching on the `pretrained_model_name_or_path` string:\n            - `albert`: TFAlbertForMultipleChoice (Albert model)\n            - `bert`: TFBertForMultipleChoice (Bert model)\n\n        This class cannot be instantiated using `__init__()` (throws an error).\n    \"\"\"\n\n    def __init__(self):\n        raise EnvironmentError(\n            \"TFAutoModelForMultipleChoice is designed to be instantiated \"\n            \"using the `TFAutoModelForMultipleChoice.from_pretrained(pretrained_model_name_or_path)` or \"\n            \"`TFAutoModelForMultipleChoice.from_config(config)` methods.\"\n        )\n\n    @classmethod\n    def from_config(cls, config):\n        r\"\"\" Instantiates one of the base model classes of the library\n        from a configuration.\n\n        Note:\n            Loading a model from its configuration file does **not** load the model weights.\n            It only affects the model's configuration. Use :func:`~transformers1.AutoModel.from_pretrained` to load\n            the model weights\n\n        Args:\n            config: (`optional`) instance of a class derived from :class:`~transformers1.PretrainedConfig`:\n                The model class to instantiate is selected based on the configuration class:\n                    - isInstance of `albert` configuration class: AlbertModel (Albert model)\n                    - isInstance of `bert` configuration class: BertModel (Bert model)\n\n        Examples::\n\n            config = BertConfig.from_pretrained('bert-base-uncased')    # Download configuration from S3 and cache.\n            model = AutoModelForMulitpleChoice.from_config(config)  # E.g. model was saved using `save_pretrained('./test/saved_model/')`\n        \"\"\"\n        for config_class, model_class in TF_MODEL_FOR_MULTIPLE_CHOICE_MAPPING.items():\n            if isinstance(config, config_class):\n                return model_class(config)\n        raise ValueError(\n            \"Unrecognized configuration class {} for this kind of TFAutoModel: {}.\\n\"\n            \"Model type should be one of {}.\".format(\n                config.__class__,\n                cls.__name__,\n                \", \".join(c.__name__ for c in TF_MODEL_FOR_MULTIPLE_CHOICE_MAPPING.keys()),\n            )\n        )\n\n    @classmethod\n    def from_pretrained(cls, pretrained_model_name_or_path, *model_args, **kwargs):\n        r\"\"\" Instantiates one of the multiple choice model classes of the library\n        from a pre-trained model configuration.\n\n        The `from_pretrained()` method takes care of returning the correct model class instance\n        based on the `model_type` property of the config object, or when it's missing,\n        falling back to using pattern matching on the `pretrained_model_name_or_path` string:\n            - `albert`: TFRobertaForMultiple (Albert model)\n            - `bert`: TFBertForMultipleChoice (Bert model)\n\n        The model is set in evaluation mode by default using `model.eval()` (Dropout modules are deactivated)\n        To train the model, you should first set it back in training mode with `model.train()`\n\n        Params:\n            pretrained_model_name_or_path: either:\n\n                - a string with the `shortcut name` of a pre-trained model to load from cache or download, e.g.: ``bert-base-uncased``.\n                - a string with the `identifier name` of a pre-trained model that was user-uploaded to our S3, e.g.: ``dbmdz/bert-base-german-cased``.\n                - a path to a `directory` containing model weights saved using :func:`~transformers1.PreTrainedModel.save_pretrained`, e.g.: ``./my_model_directory/``.\n                - a path or url to a `PyTorch, TF 1.X or TF 2.0 checkpoint file` (e.g. `./tf_model/model.ckpt.index`). In the case of a PyTorch checkpoint, ``from_pt`` should be set to True and a configuration object should be provided as ``config`` argument.\n\n            from_pt: (`Optional`) Boolean\n                Set to True if the Checkpoint is a PyTorch checkpoint.\n\n            model_args: (`optional`) Sequence of positional arguments:\n                All remaning positional arguments will be passed to the underlying model's ``__init__`` method\n\n            config: (`optional`) instance of a class derived from :class:`~transformers1.PretrainedConfig`:\n                Configuration for the model to use instead of an automatically loaded configuation. Configuration can be automatically loaded when:\n\n                - the model is a model provided by the library (loaded with the ``shortcut-name`` string of a pretrained model), or\n                - the model was saved using :func:`~transformers1.PreTrainedModel.save_pretrained` and is reloaded by suppling the save directory.\n                - the model is loaded by suppling a local directory as ``pretrained_model_name_or_path`` and a configuration JSON file named `config.json` is found in the directory.\n\n            state_dict: (`optional`) dict:\n                an optional state dictionnary for the model to use instead of a state dictionary loaded from saved weights file.\n                This option can be used if you want to create a model from a pretrained configuration but load your own weights.\n                In this case though, you should check if using :func:`~transformers1.PreTrainedModel.save_pretrained` and :func:`~transformers1.PreTrainedModel.from_pretrained` is not a simpler option.\n\n            cache_dir: (`optional`) string:\n                Path to a directory in which a downloaded pre-trained model\n                configuration should be cached if the standard cache should not be used.\n\n            force_download: (`optional`) boolean, default False:\n                Force to (re-)download the model weights and configuration files and override the cached versions if they exists.\n\n            resume_download: (`optional`) boolean, default False:\n                Do not delete incompletely recieved file. Attempt to resume the download if such a file exists.\n\n            proxies: (`optional`) dict, default None:\n                A dictionary of proxy servers to use by protocol or endpoint, e.g.: {'http': 'foo.bar:3128', 'http://hostname': 'foo.bar:4012'}.\n                The proxies are used on each request.\n\n            output_loading_info: (`optional`) boolean:\n                Set to ``True`` to also return a dictionnary containing missing keys, unexpected keys and error messages.\n\n            kwargs: (`optional`) Remaining dictionary of keyword arguments:\n                Can be used to update the configuration object (after it being loaded) and initiate the model. (e.g. ``output_attention=True``). Behave differently depending on whether a `config` is provided or automatically loaded:\n\n                - If a configuration is provided with ``config``, ``**kwargs`` will be directly passed to the underlying model's ``__init__`` method (we assume all relevant updates to the configuration have already been done)\n                - If a configuration is not provided, ``kwargs`` will be first passed to the configuration class initialization function (:func:`~transformers1.PretrainedConfig.from_pretrained`). Each key of ``kwargs`` that corresponds to a configuration attribute will be used to override said attribute with the supplied ``kwargs`` value. Remaining keys that do not correspond to any configuration attribute will be passed to the underlying model's ``__init__`` function.\n\n        Examples::\n\n            model = TFAutoModelFormultipleChoice.from_pretrained('bert-base-uncased')    # Download model and configuration from S3 and cache.\n            model = TFAutoModelFormultipleChoice.from_pretrained('./test/bert_model/')  # E.g. model was saved using `save_pretrained('./test/saved_model/')`\n            model = TFAutoModelFormultipleChoice.from_pretrained('bert-base-uncased', output_attention=True)  # Update configuration during loading\n            assert model.config.output_attention == True\n            # Loading from a TF checkpoint file instead of a PyTorch model (slower)\n            config = AutoConfig.from_json_file('./tf_model/bert_tf_model_config.json')\n            model = TFAutoModelFormultipleChoice.from_pretrained('./pt_model/bert_pytorch_model.bin', from_pt=True, config=config)\n\n        \"\"\"\n        config = kwargs.pop(\"config\", None)\n        if not isinstance(config, PretrainedConfig):\n            config = AutoConfig.from_pretrained(pretrained_model_name_or_path, **kwargs)\n\n        for config_class, model_class in TF_MODEL_FOR_MULTIPLE_CHOICE_MAPPING.items():\n            if isinstance(config, config_class):\n                return model_class.from_pretrained(pretrained_model_name_or_path, *model_args, config=config, **kwargs)\n        raise ValueError(\n            \"Unrecognized configuration class {} for this kind of TFAutoModel: {}.\\n\"\n            \"Model type should be one of {}.\".format(\n                config.__class__,\n                cls.__name__,\n                \", \".join(c.__name__ for c in TF_MODEL_FOR_MULTIPLE_CHOICE_MAPPING.keys()),\n            )\n        )\n\n\nclass TFAutoModelForSequenceClassification(object):\n    r\"\"\"\n        :class:`~transformers1.TFAutoModelForSequenceClassification` is a generic model class\n        that will be instantiated as one of the sequence classification model classes of the library\n        when created with the `TFAutoModelForSequenceClassification.from_pretrained(pretrained_model_name_or_path)`\n        class method.\n\n        The `from_pretrained()` method takes care of returning the correct model class instance\n        based on the `model_type` property of the config object, or when it's missing,\n        falling back to using pattern matching on the `pretrained_model_name_or_path` string:\n            - `distilbert`: TFDistilBertForSequenceClassification (DistilBERT model)\n            - `roberta`: TFRobertaForSequenceClassification (RoBERTa model)\n            - `bert`: TFBertForSequenceClassification (Bert model)\n            - `xlnet`: TFXLNetForSequenceClassification (XLNet model)\n            - `xlm`: TFXLMForSequenceClassification (XLM model)\n\n        This class cannot be instantiated using `__init__()` (throws an error).\n    \"\"\"\n\n    def __init__(self):\n        raise EnvironmentError(\n            \"TFAutoModelForSequenceClassification is designed to be instantiated \"\n            \"using the `TFAutoModelForSequenceClassification.from_pretrained(pretrained_model_name_or_path)` or \"\n            \"`TFAutoModelForSequenceClassification.from_config(config)` methods.\"\n        )\n\n    @classmethod\n    def from_config(cls, config):\n        r\"\"\" Instantiates one of the base model classes of the library\n        from a configuration.\n\n        Note:\n            Loading a model from its configuration file does **not** load the model weights.\n            It only affects the model's configuration. Use :func:`~transformers1.AutoModel.from_pretrained` to load\n            the model weights\n\n        Args:\n            config: (`optional`) instance of a class derived from :class:`~transformers1.PretrainedConfig`:\n                The model class to instantiate is selected based on the configuration class:\n                    - isInstance of `distilbert` configuration class: DistilBertModel (DistilBERT model)\n                    - isInstance of `roberta` configuration class: RobertaModel (RoBERTa model)\n                    - isInstance of `bert` configuration class: BertModel (Bert model)\n                    - isInstance of `xlnet` configuration class: XLNetModel (XLNet model)\n                    - isInstance of `xlm` configuration class: XLMModel (XLM model)\n\n        Examples::\n\n            config = BertConfig.from_pretrained('bert-base-uncased')    # Download configuration from S3 and cache.\n            model = AutoModelForSequenceClassification.from_config(config)  # E.g. model was saved using `save_pretrained('./test/saved_model/')`\n        \"\"\"\n        for config_class, model_class in TF_MODEL_FOR_SEQUENCE_CLASSIFICATION_MAPPING.items():\n            if isinstance(config, config_class):\n                return model_class(config)\n        raise ValueError(\n            \"Unrecognized configuration class {} for this kind of TFAutoModel: {}.\\n\"\n            \"Model type should be one of {}.\".format(\n                config.__class__,\n                cls.__name__,\n                \", \".join(c.__name__ for c in TF_MODEL_FOR_SEQUENCE_CLASSIFICATION_MAPPING.keys()),\n            )\n        )\n\n    @classmethod\n    def from_pretrained(cls, pretrained_model_name_or_path, *model_args, **kwargs):\n        r\"\"\" Instantiates one of the sequence classification model classes of the library\n        from a pre-trained model configuration.\n\n        The `from_pretrained()` method takes care of returning the correct model class instance\n        based on the `model_type` property of the config object, or when it's missing,\n        falling back to using pattern matching on the `pretrained_model_name_or_path` string:\n            - `distilbert`: TFDistilBertForSequenceClassification (DistilBERT model)\n            - `roberta`: TFRobertaForSequenceClassification (RoBERTa model)\n            - `bert`: TFBertForSequenceClassification (Bert model)\n            - `xlnet`: TFXLNetForSequenceClassification (XLNet model)\n            - `xlm`: TFXLMForSequenceClassification (XLM model)\n\n        The model is set in evaluation mode by default using `model.eval()` (Dropout modules are deactivated)\n        To train the model, you should first set it back in training mode with `model.train()`\n\n        Params:\n            pretrained_model_name_or_path: either:\n\n                - a string with the `shortcut name` of a pre-trained model to load from cache or download, e.g.: ``bert-base-uncased``.\n                - a string with the `identifier name` of a pre-trained model that was user-uploaded to our S3, e.g.: ``dbmdz/bert-base-german-cased``.\n                - a path to a `directory` containing model weights saved using :func:`~transformers1.PreTrainedModel.save_pretrained`, e.g.: ``./my_model_directory/``.\n                - a path or url to a `PyTorch, TF 1.X or TF 2.0 checkpoint file` (e.g. `./tf_model/model.ckpt.index`). In the case of a PyTorch checkpoint, ``from_pt`` should be set to True and a configuration object should be provided as ``config`` argument.\n\n            from_pt: (`Optional`) Boolean\n                Set to True if the Checkpoint is a PyTorch checkpoint.\n\n            model_args: (`optional`) Sequence of positional arguments:\n                All remaning positional arguments will be passed to the underlying model's ``__init__`` method\n\n            config: (`optional`) instance of a class derived from :class:`~transformers1.PretrainedConfig`:\n                Configuration for the model to use instead of an automatically loaded configuation. Configuration can be automatically loaded when:\n\n                - the model is a model provided by the library (loaded with the ``shortcut-name`` string of a pretrained model), or\n                - the model was saved using :func:`~transformers1.PreTrainedModel.save_pretrained` and is reloaded by suppling the save directory.\n                - the model is loaded by suppling a local directory as ``pretrained_model_name_or_path`` and a configuration JSON file named `config.json` is found in the directory.\n\n            state_dict: (`optional`) dict:\n                an optional state dictionnary for the model to use instead of a state dictionary loaded from saved weights file.\n                This option can be used if you want to create a model from a pretrained configuration but load your own weights.\n                In this case though, you should check if using :func:`~transformers1.PreTrainedModel.save_pretrained` and :func:`~transformers1.PreTrainedModel.from_pretrained` is not a simpler option.\n\n            cache_dir: (`optional`) string:\n                Path to a directory in which a downloaded pre-trained model\n                configuration should be cached if the standard cache should not be used.\n\n            force_download: (`optional`) boolean, default False:\n                Force to (re-)download the model weights and configuration files and override the cached versions if they exists.\n\n            resume_download: (`optional`) boolean, default False:\n                Do not delete incompletely recieved file. Attempt to resume the download if such a file exists.\n\n            proxies: (`optional`) dict, default None:\n                A dictionary of proxy servers to use by protocol or endpoint, e.g.: {'http': 'foo.bar:3128', 'http://hostname': 'foo.bar:4012'}.\n                The proxies are used on each request.\n\n            output_loading_info: (`optional`) boolean:\n                Set to ``True`` to also return a dictionnary containing missing keys, unexpected keys and error messages.\n\n            kwargs: (`optional`) Remaining dictionary of keyword arguments:\n                Can be used to update the configuration object (after it being loaded) and initiate the model. (e.g. ``output_attention=True``). Behave differently depending on whether a `config` is provided or automatically loaded:\n\n                - If a configuration is provided with ``config``, ``**kwargs`` will be directly passed to the underlying model's ``__init__`` method (we assume all relevant updates to the configuration have already been done)\n                - If a configuration is not provided, ``kwargs`` will be first passed to the configuration class initialization function (:func:`~transformers1.PretrainedConfig.from_pretrained`). Each key of ``kwargs`` that corresponds to a configuration attribute will be used to override said attribute with the supplied ``kwargs`` value. Remaining keys that do not correspond to any configuration attribute will be passed to the underlying model's ``__init__`` function.\n\n        Examples::\n\n            model = TFAutoModelForSequenceClassification.from_pretrained('bert-base-uncased')    # Download model and configuration from S3 and cache.\n            model = TFAutoModelForSequenceClassification.from_pretrained('./test/bert_model/')  # E.g. model was saved using `save_pretrained('./test/saved_model/')`\n            model = TFAutoModelForSequenceClassification.from_pretrained('bert-base-uncased', output_attention=True)  # Update configuration during loading\n            assert model.config.output_attention == True\n            # Loading from a TF checkpoint file instead of a PyTorch model (slower)\n            config = AutoConfig.from_json_file('./tf_model/bert_tf_model_config.json')\n            model = TFAutoModelForSequenceClassification.from_pretrained('./pt_model/bert_pytorch_model.bin', from_pt=True, config=config)\n\n        \"\"\"\n        config = kwargs.pop(\"config\", None)\n        if not isinstance(config, PretrainedConfig):\n            config = AutoConfig.from_pretrained(pretrained_model_name_or_path, **kwargs)\n\n        for config_class, model_class in TF_MODEL_FOR_SEQUENCE_CLASSIFICATION_MAPPING.items():\n            if isinstance(config, config_class):\n                return model_class.from_pretrained(pretrained_model_name_or_path, *model_args, config=config, **kwargs)\n        raise ValueError(\n            \"Unrecognized configuration class {} for this kind of TFAutoModel: {}.\\n\"\n            \"Model type should be one of {}.\".format(\n                config.__class__,\n                cls.__name__,\n                \", \".join(c.__name__ for c in TF_MODEL_FOR_SEQUENCE_CLASSIFICATION_MAPPING.keys()),\n            )\n        )\n\n\nclass TFAutoModelForQuestionAnswering(object):\n    r\"\"\"\n        :class:`~transformers1.TFAutoModelForQuestionAnswering` is a generic model class\n        that will be instantiated as one of the question answering model classes of the library\n        when created with the `TFAutoModelForQuestionAnswering.from_pretrained(pretrained_model_name_or_path)`\n        class method.\n\n        The `from_pretrained()` method takes care of returning the correct model class instance\n        based on the `model_type` property of the config object, or when it's missing,\n        falling back to using pattern matching on the `pretrained_model_name_or_path` string:\n            - `distilbert`: TFDistilBertForQuestionAnswering (DistilBERT model)\n            - `albert`: TFAlbertForQuestionAnswering (ALBERT model)\n            - `roberta`: TFRobertaForQuestionAnswering (RoBERTa model)\n            - `bert`: TFBertForQuestionAnswering (Bert model)\n            - `xlnet`: TFXLNetForQuestionAnswering (XLNet model)\n            - `xlm`: TFXLMForQuestionAnswering (XLM model)\n\n        This class cannot be instantiated using `__init__()` (throws an error).\n    \"\"\"\n\n    def __init__(self):\n        raise EnvironmentError(\n            \"TFAutoModelForQuestionAnswering is designed to be instantiated \"\n            \"using the `TFAutoModelForQuestionAnswering.from_pretrained(pretrained_model_name_or_path)` or \"\n            \"`TFAutoModelForQuestionAnswering.from_config(config)` methods.\"\n        )\n\n    @classmethod\n    def from_config(cls, config):\n        r\"\"\" Instantiates one of the base model classes of the library\n        from a configuration.\n\n        Note:\n            Loading a model from its configuration file does **not** load the model weights.\n            It only affects the model's configuration. Use :func:`~transformers1.AutoModel.from_pretrained` to load\n            the model weights\n\n        Args:\n            config: (`optional`) instance of a class derived from :class:`~transformers1.PretrainedConfig`:\n                The model class to instantiate is selected based on the configuration class:\n                    - isInstance of `distilbert` configuration class: DistilBertModel (DistilBERT model)\n                    - isInstance of `albert` configuration class: AlbertModel (ALBERT model)\n                    - isInstance of `roberta` configuration class: RobertaModel (RoBERTa model)\n                    - isInstance of `bert` configuration class: BertModel (Bert model)\n                    - isInstance of `xlnet` configuration class: XLNetModel (XLNet model)\n                    - isInstance of `xlm` configuration class: XLMModel (XLM model)\n\n        Examples::\n\n            config = BertConfig.from_pretrained('bert-base-uncased')    # Download configuration from S3 and cache.\n            model = TFAutoModelForQuestionAnswering.from_config(config)  # E.g. model was saved using `save_pretrained('./test/saved_model/')`\n        \"\"\"\n        for config_class, model_class in TF_MODEL_FOR_QUESTION_ANSWERING_MAPPING.items():\n            if isinstance(config, config_class):\n                return model_class(config)\n        raise ValueError(\n            \"Unrecognized configuration class {} for this kind of TFAutoModel: {}.\\n\"\n            \"Model type should be one of {}.\".format(\n                config.__class__,\n                cls.__name__,\n                \", \".join(c.__name__ for c in TF_MODEL_FOR_QUESTION_ANSWERING_MAPPING.keys()),\n            )\n        )\n\n    @classmethod\n    def from_pretrained(cls, pretrained_model_name_or_path, *model_args, **kwargs):\n        r\"\"\" Instantiates one of the question answering model classes of the library\n        from a pre-trained model configuration.\n\n        The `from_pretrained()` method takes care of returning the correct model class instance\n        based on the `model_type` property of the config object, or when it's missing,\n        falling back to using pattern matching on the `pretrained_model_name_or_path` string:\n            - `distilbert`: TFDistilBertForQuestionAnswering (DistilBERT model)\n            - `albert`: TFAlbertForQuestionAnswering (ALBERT model)\n            - `roberta`: TFRobertaForQuestionAnswering (RoBERTa model)\n            - `bert`: TFBertForQuestionAnswering (Bert model)\n            - `xlnet`: TFXLNetForQuestionAnswering (XLNet model)\n            - `xlm`: TFXLMForQuestionAnswering (XLM model)\n\n        The model is set in evaluation mode by default using `model.eval()` (Dropout modules are deactivated)\n        To train the model, you should first set it back in training mode with `model.train()`\n\n        Params:\n            pretrained_model_name_or_path: either:\n\n                - a string with the `shortcut name` of a pre-trained model to load from cache or download, e.g.: ``bert-base-uncased``.\n                - a string with the `identifier name` of a pre-trained model that was user-uploaded to our S3, e.g.: ``dbmdz/bert-base-german-cased``.\n                - a path to a `directory` containing model weights saved using :func:`~transformers1.PreTrainedModel.save_pretrained`, e.g.: ``./my_model_directory/``.\n                - a path or url to a `PyTorch, TF 1.X or TF 2.0 checkpoint file` (e.g. `./tf_model/model.ckpt.index`). In the case of a PyTorch checkpoint, ``from_pt`` should be set to True and a configuration object should be provided as ``config`` argument.\n\n            from_pt: (`Optional`) Boolean\n                Set to True if the Checkpoint is a PyTorch checkpoint.\n\n            model_args: (`optional`) Sequence of positional arguments:\n                All remaning positional arguments will be passed to the underlying model's ``__init__`` method\n\n            config: (`optional`) instance of a class derived from :class:`~transformers1.PretrainedConfig`:\n                Configuration for the model to use instead of an automatically loaded configuation. Configuration can be automatically loaded when:\n\n                - the model is a model provided by the library (loaded with the ``shortcut-name`` string of a pretrained model), or\n                - the model was saved using :func:`~transformers1.PreTrainedModel.save_pretrained` and is reloaded by suppling the save directory.\n                - the model is loaded by suppling a local directory as ``pretrained_model_name_or_path`` and a configuration JSON file named `config.json` is found in the directory.\n\n            state_dict: (`optional`) dict:\n                an optional state dictionnary for the model to use instead of a state dictionary loaded from saved weights file.\n                This option can be used if you want to create a model from a pretrained configuration but load your own weights.\n                In this case though, you should check if using :func:`~transformers1.PreTrainedModel.save_pretrained` and :func:`~transformers1.PreTrainedModel.from_pretrained` is not a simpler option.\n\n            cache_dir: (`optional`) string:\n                Path to a directory in which a downloaded pre-trained model\n                configuration should be cached if the standard cache should not be used.\n\n            force_download: (`optional`) boolean, default False:\n                Force to (re-)download the model weights and configuration files and override the cached versions if they exists.\n\n            resume_download: (`optional`) boolean, default False:\n                Do not delete incompletely recieved file. Attempt to resume the download if such a file exists.\n\n            proxies: (`optional`) dict, default None:\n                A dictionary of proxy servers to use by protocol or endpoint, e.g.: {'http': 'foo.bar:3128', 'http://hostname': 'foo.bar:4012'}.\n                The proxies are used on each request.\n\n            output_loading_info: (`optional`) boolean:\n                Set to ``True`` to also return a dictionnary containing missing keys, unexpected keys and error messages.\n\n            kwargs: (`optional`) Remaining dictionary of keyword arguments:\n                Can be used to update the configuration object (after it being loaded) and initiate the model. (e.g. ``output_attention=True``). Behave differently depending on whether a `config` is provided or automatically loaded:\n\n                - If a configuration is provided with ``config``, ``**kwargs`` will be directly passed to the underlying model's ``__init__`` method (we assume all relevant updates to the configuration have already been done)\n                - If a configuration is not provided, ``kwargs`` will be first passed to the configuration class initialization function (:func:`~transformers1.PretrainedConfig.from_pretrained`). Each key of ``kwargs`` that corresponds to a configuration attribute will be used to override said attribute with the supplied ``kwargs`` value. Remaining keys that do not correspond to any configuration attribute will be passed to the underlying model's ``__init__`` function.\n\n        Examples::\n\n            model = TFAutoModelForQuestionAnswering.from_pretrained('bert-base-uncased')    # Download model and configuration from S3 and cache.\n            model = TFAutoModelForQuestionAnswering.from_pretrained('./test/bert_model/')  # E.g. model was saved using `save_pretrained('./test/saved_model/')`\n            model = TFAutoModelForQuestionAnswering.from_pretrained('bert-base-uncased', output_attention=True)  # Update configuration during loading\n            assert model.config.output_attention == True\n            # Loading from a TF checkpoint file instead of a PyTorch model (slower)\n            config = AutoConfig.from_json_file('./tf_model/bert_tf_model_config.json')\n            model = TFAutoModelForQuestionAnswering.from_pretrained('./pt_model/bert_pytorch_model.bin', from_pt=True, config=config)\n\n        \"\"\"\n        config = kwargs.pop(\"config\", None)\n        if not isinstance(config, PretrainedConfig):\n            config = AutoConfig.from_pretrained(pretrained_model_name_or_path, **kwargs)\n\n        for config_class, model_class in TF_MODEL_FOR_QUESTION_ANSWERING_MAPPING.items():\n            if isinstance(config, config_class):\n                return model_class.from_pretrained(pretrained_model_name_or_path, *model_args, config=config, **kwargs)\n        raise ValueError(\n            \"Unrecognized configuration class {} for this kind of TFAutoModel: {}.\\n\"\n            \"Model type should be one of {}.\".format(\n                config.__class__,\n                cls.__name__,\n                \", \".join(c.__name__ for c in TF_MODEL_FOR_QUESTION_ANSWERING_MAPPING.keys()),\n            )\n        )\n\n\nclass TFAutoModelForTokenClassification:\n    def __init__(self):\n        raise EnvironmentError(\n            \"TFAutoModelForTokenClassification is designed to be instantiated \"\n            \"using the `TFAutoModelForTokenClassification.from_pretrained(pretrained_model_name_or_path)` or \"\n            \"`AutoModelForTokenClassification.from_config(config)` methods.\"\n        )\n\n    @classmethod\n    def from_config(cls, config):\n        r\"\"\" Instantiates one of the base model classes of the library\n        from a configuration.\n\n        Note:\n            Loading a model from its configuration file does **not** load the model weights.\n            It only affects the model's configuration. Use :func:`~transformers1.AutoModel.from_pretrained` to load\n            the model weights\n\n        Args:\n            config: (`optional`) instance of a class derived from :class:`~transformers1.PretrainedConfig`:\n                The model class to instantiate is selected based on the configuration class:\n                    - isInstance of `bert` configuration class: BertModel (Bert model)\n                    - isInstance of `xlnet` configuration class: XLNetModel (XLNet model)\n                    - isInstance of `distilbert` configuration class: DistilBertModel (DistilBert model)\n                    - isInstance of `roberta` configuration class: RobteraModel (Roberta model)\n\n        Examples::\n\n            config = BertConfig.from_pretrained('bert-base-uncased')    # Download configuration from S3 and cache.\n            model = TFAutoModelForTokenClassification.from_config(config)  # E.g. model was saved using `save_pretrained('./test/saved_model/')`\n        \"\"\"\n        for config_class, model_class in TF_MODEL_FOR_TOKEN_CLASSIFICATION_MAPPING.items():\n            if isinstance(config, config_class):\n                return model_class(config)\n        raise ValueError(\n            \"Unrecognized configuration class {} for this kind of TFAutoModel: {}.\\n\"\n            \"Model type should be one of {}.\".format(\n                config.__class__,\n                cls.__name__,\n                \", \".join(c.__name__ for c in TF_MODEL_FOR_TOKEN_CLASSIFICATION_MAPPING.keys()),\n            )\n        )\n\n    @classmethod\n    def from_pretrained(cls, pretrained_model_name_or_path, *model_args, **kwargs):\n        r\"\"\" Instantiates one of the question answering model classes of the library\n        from a pre-trained model configuration.\n\n        The `from_pretrained()` method takes care of returning the correct model class instance\n        based on the `model_type` property of the config object, or when it's missing,\n        falling back to using pattern matching on the `pretrained_model_name_or_path` string:\n            - `bert`: BertForTokenClassification (Bert model)\n            - `xlnet`: XLNetForTokenClassification (XLNet model)\n            - `distilbert`: DistilBertForTokenClassification (DistilBert model)\n            - `roberta`: RobertaForTokenClassification (Roberta model)\n\n        The model is set in evaluation mode by default using `model.eval()` (Dropout modules are deactivated)\n        To train the model, you should first set it back in training mode with `model.train()`\n\n        Params:\n            pretrained_model_name_or_path: either:\n\n                - a string with the `shortcut name` of a pre-trained model to load from cache or download, e.g.: ``bert-base-uncased``.\n                - a path to a `directory` containing model weights saved using :func:`~transformers1.PreTrainedModel.save_pretrained`, e.g.: ``./my_model_directory/``.\n                - a path or url to a `tensorflow index checkpoint file` (e.g. `./tf_model/model.ckpt.index`). In this case, ``from_tf`` should be set to True and a configuration object should be provided as ``config`` argument. This loading path is slower than converting the TensorFlow checkpoint in a PyTorch model using the provided conversion scripts and loading the PyTorch model afterwards.\n\n            model_args: (`optional`) Sequence of positional arguments:\n                All remaning positional arguments will be passed to the underlying model's ``__init__`` method\n\n            config: (`optional`) instance of a class derived from :class:`~transformers1.PretrainedConfig`:\n                Configuration for the model to use instead of an automatically loaded configuation. Configuration can be automatically loaded when:\n\n                - the model is a model provided by the library (loaded with the ``shortcut-name`` string of a pretrained model), or\n                - the model was saved using :func:`~transformers1.PreTrainedModel.save_pretrained` and is reloaded by suppling the save directory.\n                - the model is loaded by suppling a local directory as ``pretrained_model_name_or_path`` and a configuration JSON file named `config.json` is found in the directory.\n\n            state_dict: (`optional`) dict:\n                an optional state dictionnary for the model to use instead of a state dictionary loaded from saved weights file.\n                This option can be used if you want to create a model from a pretrained configuration but load your own weights.\n                In this case though, you should check if using :func:`~transformers1.PreTrainedModel.save_pretrained` and :func:`~transformers1.PreTrainedModel.from_pretrained` is not a simpler option.\n\n            cache_dir: (`optional`) string:\n                Path to a directory in which a downloaded pre-trained model\n                configuration should be cached if the standard cache should not be used.\n\n            force_download: (`optional`) boolean, default False:\n                Force to (re-)download the model weights and configuration files and override the cached versions if they exists.\n\n            proxies: (`optional`) dict, default None:\n                A dictionary of proxy servers to use by protocol or endpoint, e.g.: {'http': 'foo.bar:3128', 'http://hostname': 'foo.bar:4012'}.\n                The proxies are used on each request.\n\n            output_loading_info: (`optional`) boolean:\n                Set to ``True`` to also return a dictionnary containing missing keys, unexpected keys and error messages.\n\n            kwargs: (`optional`) Remaining dictionary of keyword arguments:\n                Can be used to update the configuration object (after it being loaded) and initiate the model. (e.g. ``output_attention=True``). Behave differently depending on whether a `config` is provided or automatically loaded:\n\n                - If a configuration is provided with ``config``, ``**kwargs`` will be directly passed to the underlying model's ``__init__`` method (we assume all relevant updates to the configuration have already been done)\n                - If a configuration is not provided, ``kwargs`` will be first passed to the configuration class initialization function (:func:`~transformers1.PretrainedConfig.from_pretrained`). Each key of ``kwargs`` that corresponds to a configuration attribute will be used to override said attribute with the supplied ``kwargs`` value. Remaining keys that do not correspond to any configuration attribute will be passed to the underlying model's ``__init__`` function.\n\n        Examples::\n\n            model = TFAutoModelForTokenClassification.from_pretrained('bert-base-uncased')    # Download model and configuration from S3 and cache.\n            model = TFAutoModelForTokenClassification.from_pretrained('./test/bert_model/')  # E.g. model was saved using `save_pretrained('./test/saved_model/')`\n            model = TFAutoModelForTokenClassification.from_pretrained('bert-base-uncased', output_attention=True)  # Update configuration during loading\n            assert model.config.output_attention == True\n            # Loading from a TF checkpoint file instead of a PyTorch model (slower)\n            config = AutoConfig.from_json_file('./tf_model/bert_tf_model_config.json')\n            model = TFAutoModelForTokenClassification.from_pretrained('./tf_model/bert_tf_checkpoint.ckpt.index', from_tf=True, config=config)\n\n        \"\"\"\n        config = kwargs.pop(\"config\", None)\n        if not isinstance(config, PretrainedConfig):\n            config = AutoConfig.from_pretrained(pretrained_model_name_or_path, **kwargs)\n\n        for config_class, model_class in TF_MODEL_FOR_TOKEN_CLASSIFICATION_MAPPING.items():\n            if isinstance(config, config_class):\n                return model_class.from_pretrained(pretrained_model_name_or_path, *model_args, config=config, **kwargs)\n        raise ValueError(\n            \"Unrecognized configuration class {} for this kind of TFAutoModel: {}.\\n\"\n            \"Model type should be one of {}.\".format(\n                config.__class__,\n                cls.__name__,\n                \", \".join(c.__name__ for c in TF_MODEL_FOR_TOKEN_CLASSIFICATION_MAPPING.keys()),\n            )\n        )\n"
  },
  {
    "path": "code/bert-base-count3/pretrain/transformers1/modeling_tf_bert.py",
    "content": "# coding=utf-8\n# Copyright 2018 The Google AI Language Team Authors and The HuggingFace Inc. team.\n# Copyright (c) 2018, NVIDIA CORPORATION.  All rights reserved.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n\"\"\" TF 2.0 BERT model. \"\"\"\n\n\nimport logging\n\nimport numpy as np\nimport tensorflow as tf\n\nfrom .configuration_bert import BertConfig\nfrom .file_utils import MULTIPLE_CHOICE_DUMMY_INPUTS, add_start_docstrings, add_start_docstrings_to_callable\nfrom .modeling_tf_utils import TFPreTrainedModel, get_initializer, keras_serializable, shape_list\nfrom .tokenization_utils import BatchEncoding\n\n\nlogger = logging.getLogger(__name__)\n\n\nTF_BERT_PRETRAINED_MODEL_ARCHIVE_LIST = [\n    \"bert-base-uncased\",\n    \"bert-large-uncased\",\n    \"bert-base-cased\",\n    \"bert-large-cased\",\n    \"bert-base-multilingual-uncased\",\n    \"bert-base-multilingual-cased\",\n    \"bert-base-chinese\",\n    \"bert-base-german-cased\",\n    \"bert-large-uncased-whole-word-masking\",\n    \"bert-large-cased-whole-word-masking\",\n    \"bert-large-uncased-whole-word-masking-finetuned-squad\",\n    \"bert-large-cased-whole-word-masking-finetuned-squad\",\n    \"bert-base-cased-finetuned-mrpc\",\n    \"cl-tohoku/bert-base-japanese\",\n    \"cl-tohoku/bert-base-japanese-whole-word-masking\",\n    \"cl-tohoku/bert-base-japanese-char\",\n    \"cl-tohoku/bert-base-japanese-char-whole-word-masking\",\n    \"TurkuNLP/bert-base-finnish-cased-v1\",\n    \"TurkuNLP/bert-base-finnish-uncased-v1\",\n    \"wietsedv/bert-base-dutch-cased\",\n    # See all BERT models at https://huggingface.co/models?filter=bert\n]\n\n\ndef gelu(x):\n    \"\"\" Gaussian Error Linear Unit.\n    Original Implementation of the gelu activation function in Google Bert repo when initially created.\n        For information: OpenAI GPT's gelu is slightly different (and gives slightly different results):\n        0.5 * x * (1 + torch.tanh(math.sqrt(2 / math.pi) * (x + 0.044715 * torch.pow(x, 3))))\n        Also see https://arxiv.org/abs/1606.08415\n    \"\"\"\n    cdf = 0.5 * (1.0 + tf.math.erf(x / tf.math.sqrt(2.0)))\n    return x * cdf\n\n\ndef gelu_new(x):\n    \"\"\"Gaussian Error Linear Unit.\n    This is a smoother version of the RELU.\n    Original paper: https://arxiv.org/abs/1606.08415\n    Args:\n        x: float Tensor to perform activation.\n    Returns:\n        `x` with the GELU activation applied.\n    \"\"\"\n    cdf = 0.5 * (1.0 + tf.tanh((np.sqrt(2 / np.pi) * (x + 0.044715 * tf.pow(x, 3)))))\n    return x * cdf\n\n\ndef swish(x):\n    return x * tf.sigmoid(x)\n\n\nACT2FN = {\n    \"gelu\": tf.keras.layers.Activation(gelu),\n    \"relu\": tf.keras.activations.relu,\n    \"swish\": tf.keras.layers.Activation(swish),\n    \"gelu_new\": tf.keras.layers.Activation(gelu_new),\n}\n\n\nclass TFBertEmbeddings(tf.keras.layers.Layer):\n    \"\"\"Construct the embeddings from word, position and token_type embeddings.\n    \"\"\"\n\n    def __init__(self, config, **kwargs):\n        super().__init__(**kwargs)\n        self.vocab_size = config.vocab_size\n        self.hidden_size = config.hidden_size\n        self.initializer_range = config.initializer_range\n\n        self.position_embeddings = tf.keras.layers.Embedding(\n            config.max_position_embeddings,\n            config.hidden_size,\n            embeddings_initializer=get_initializer(self.initializer_range),\n            name=\"position_embeddings\",\n        )\n        self.token_type_embeddings = tf.keras.layers.Embedding(\n            config.type_vocab_size,\n            config.hidden_size,\n            embeddings_initializer=get_initializer(self.initializer_range),\n            name=\"token_type_embeddings\",\n        )\n\n        # self.LayerNorm is not snake-cased to stick with TensorFlow model variable name and be able to load\n        # any TensorFlow checkpoint file\n        self.LayerNorm = tf.keras.layers.LayerNormalization(epsilon=config.layer_norm_eps, name=\"LayerNorm\")\n        self.dropout = tf.keras.layers.Dropout(config.hidden_dropout_prob)\n\n    def build(self, input_shape):\n        \"\"\"Build shared word embedding layer \"\"\"\n        with tf.name_scope(\"word_embeddings\"):\n            # Create and initialize weights. The random normal initializer was chosen\n            # arbitrarily, and works well.\n            self.word_embeddings = self.add_weight(\n                \"weight\",\n                shape=[self.vocab_size, self.hidden_size],\n                initializer=get_initializer(self.initializer_range),\n            )\n        super().build(input_shape)\n\n    def call(self, inputs, mode=\"embedding\", training=False):\n        \"\"\"Get token embeddings of inputs.\n        Args:\n            inputs: list of three int64 tensors with shape [batch_size, length]: (input_ids, position_ids, token_type_ids)\n            mode: string, a valid value is one of \"embedding\" and \"linear\".\n        Returns:\n            outputs: (1) If mode == \"embedding\", output embedding tensor, float32 with\n                shape [batch_size, length, embedding_size]; (2) mode == \"linear\", output\n                linear tensor, float32 with shape [batch_size, length, vocab_size].\n        Raises:\n            ValueError: if mode is not valid.\n\n        Shared weights logic adapted from\n            https://github.com/tensorflow/models/blob/a009f4fb9d2fc4949e32192a944688925ef78659/official/transformer/v2/embedding_layer.py#L24\n        \"\"\"\n        if mode == \"embedding\":\n            return self._embedding(inputs, training=training)\n        elif mode == \"linear\":\n            return self._linear(inputs)\n        else:\n            raise ValueError(\"mode {} is not valid.\".format(mode))\n\n    def _embedding(self, inputs, training=False):\n        \"\"\"Applies embedding based on inputs tensor.\"\"\"\n        input_ids, position_ids, token_type_ids, inputs_embeds = inputs\n\n        if input_ids is not None:\n            input_shape = shape_list(input_ids)\n        else:\n            input_shape = shape_list(inputs_embeds)[:-1]\n\n        seq_length = input_shape[1]\n        if position_ids is None:\n            position_ids = tf.range(seq_length, dtype=tf.int32)[tf.newaxis, :]\n        if token_type_ids is None:\n            token_type_ids = tf.fill(input_shape, 0)\n\n        if inputs_embeds is None:\n            inputs_embeds = tf.gather(self.word_embeddings, input_ids)\n        position_embeddings = self.position_embeddings(position_ids)\n        token_type_embeddings = self.token_type_embeddings(token_type_ids)\n\n        embeddings = inputs_embeds + position_embeddings + token_type_embeddings\n        embeddings = self.LayerNorm(embeddings)\n        embeddings = self.dropout(embeddings, training=training)\n        return embeddings\n\n    def _linear(self, inputs):\n        \"\"\"Computes logits by running inputs through a linear layer.\n            Args:\n                inputs: A float32 tensor with shape [batch_size, length, hidden_size]\n            Returns:\n                float32 tensor with shape [batch_size, length, vocab_size].\n        \"\"\"\n        batch_size = shape_list(inputs)[0]\n        length = shape_list(inputs)[1]\n\n        x = tf.reshape(inputs, [-1, self.hidden_size])\n        logits = tf.matmul(x, self.word_embeddings, transpose_b=True)\n\n        return tf.reshape(logits, [batch_size, length, self.vocab_size])\n\n\nclass TFBertSelfAttention(tf.keras.layers.Layer):\n    def __init__(self, config, **kwargs):\n        super().__init__(**kwargs)\n        if config.hidden_size % config.num_attention_heads != 0:\n            raise ValueError(\n                \"The hidden size (%d) is not a multiple of the number of attention \"\n                \"heads (%d)\" % (config.hidden_size, config.num_attention_heads)\n            )\n        self.output_attentions = config.output_attentions\n\n        self.num_attention_heads = config.num_attention_heads\n        assert config.hidden_size % config.num_attention_heads == 0\n        self.attention_head_size = int(config.hidden_size / config.num_attention_heads)\n        self.all_head_size = self.num_attention_heads * self.attention_head_size\n\n        self.query = tf.keras.layers.Dense(\n            self.all_head_size, kernel_initializer=get_initializer(config.initializer_range), name=\"query\"\n        )\n        self.key = tf.keras.layers.Dense(\n            self.all_head_size, kernel_initializer=get_initializer(config.initializer_range), name=\"key\"\n        )\n        self.value = tf.keras.layers.Dense(\n            self.all_head_size, kernel_initializer=get_initializer(config.initializer_range), name=\"value\"\n        )\n\n        self.dropout = tf.keras.layers.Dropout(config.attention_probs_dropout_prob)\n\n    def transpose_for_scores(self, x, batch_size):\n        x = tf.reshape(x, (batch_size, -1, self.num_attention_heads, self.attention_head_size))\n        return tf.transpose(x, perm=[0, 2, 1, 3])\n\n    def call(self, inputs, training=False):\n        hidden_states, attention_mask, head_mask = inputs\n\n        batch_size = shape_list(hidden_states)[0]\n        mixed_query_layer = self.query(hidden_states)\n        mixed_key_layer = self.key(hidden_states)\n        mixed_value_layer = self.value(hidden_states)\n\n        query_layer = self.transpose_for_scores(mixed_query_layer, batch_size)\n        key_layer = self.transpose_for_scores(mixed_key_layer, batch_size)\n        value_layer = self.transpose_for_scores(mixed_value_layer, batch_size)\n\n        # Take the dot product between \"query\" and \"key\" to get the raw attention scores.\n        attention_scores = tf.matmul(\n            query_layer, key_layer, transpose_b=True\n        )  # (batch size, num_heads, seq_len_q, seq_len_k)\n        dk = tf.cast(shape_list(key_layer)[-1], tf.float32)  # scale attention_scores\n        attention_scores = attention_scores / tf.math.sqrt(dk)\n\n        if attention_mask is not None:\n            # Apply the attention mask is (precomputed for all layers in TFBertModel call() function)\n            attention_scores = attention_scores + attention_mask\n\n        # Normalize the attention scores to probabilities.\n        attention_probs = tf.nn.softmax(attention_scores, axis=-1)\n\n        # This is actually dropping out entire tokens to attend to, which might\n        # seem a bit unusual, but is taken from the original Transformer paper.\n        attention_probs = self.dropout(attention_probs, training=training)\n\n        # Mask heads if we want to\n        if head_mask is not None:\n            attention_probs = attention_probs * head_mask\n\n        context_layer = tf.matmul(attention_probs, value_layer)\n\n        context_layer = tf.transpose(context_layer, perm=[0, 2, 1, 3])\n        context_layer = tf.reshape(\n            context_layer, (batch_size, -1, self.all_head_size)\n        )  # (batch_size, seq_len_q, all_head_size)\n\n        outputs = (context_layer, attention_probs) if self.output_attentions else (context_layer,)\n        return outputs\n\n\nclass TFBertSelfOutput(tf.keras.layers.Layer):\n    def __init__(self, config, **kwargs):\n        super().__init__(**kwargs)\n        self.dense = tf.keras.layers.Dense(\n            config.hidden_size, kernel_initializer=get_initializer(config.initializer_range), name=\"dense\"\n        )\n        self.LayerNorm = tf.keras.layers.LayerNormalization(epsilon=config.layer_norm_eps, name=\"LayerNorm\")\n        self.dropout = tf.keras.layers.Dropout(config.hidden_dropout_prob)\n\n    def call(self, inputs, training=False):\n        hidden_states, input_tensor = inputs\n\n        hidden_states = self.dense(hidden_states)\n        hidden_states = self.dropout(hidden_states, training=training)\n        hidden_states = self.LayerNorm(hidden_states + input_tensor)\n        return hidden_states\n\n\nclass TFBertAttention(tf.keras.layers.Layer):\n    def __init__(self, config, **kwargs):\n        super().__init__(**kwargs)\n        self.self_attention = TFBertSelfAttention(config, name=\"self\")\n        self.dense_output = TFBertSelfOutput(config, name=\"output\")\n\n    def prune_heads(self, heads):\n        raise NotImplementedError\n\n    def call(self, inputs, training=False):\n        input_tensor, attention_mask, head_mask = inputs\n\n        self_outputs = self.self_attention([input_tensor, attention_mask, head_mask], training=training)\n        attention_output = self.dense_output([self_outputs[0], input_tensor], training=training)\n        outputs = (attention_output,) + self_outputs[1:]  # add attentions if we output them\n        return outputs\n\n\nclass TFBertIntermediate(tf.keras.layers.Layer):\n    def __init__(self, config, **kwargs):\n        super().__init__(**kwargs)\n        self.dense = tf.keras.layers.Dense(\n            config.intermediate_size, kernel_initializer=get_initializer(config.initializer_range), name=\"dense\"\n        )\n        if isinstance(config.hidden_act, str):\n            self.intermediate_act_fn = ACT2FN[config.hidden_act]\n        else:\n            self.intermediate_act_fn = config.hidden_act\n\n    def call(self, hidden_states):\n        hidden_states = self.dense(hidden_states)\n        hidden_states = self.intermediate_act_fn(hidden_states)\n        return hidden_states\n\n\nclass TFBertOutput(tf.keras.layers.Layer):\n    def __init__(self, config, **kwargs):\n        super().__init__(**kwargs)\n        self.dense = tf.keras.layers.Dense(\n            config.hidden_size, kernel_initializer=get_initializer(config.initializer_range), name=\"dense\"\n        )\n        self.LayerNorm = tf.keras.layers.LayerNormalization(epsilon=config.layer_norm_eps, name=\"LayerNorm\")\n        self.dropout = tf.keras.layers.Dropout(config.hidden_dropout_prob)\n\n    def call(self, inputs, training=False):\n        hidden_states, input_tensor = inputs\n\n        hidden_states = self.dense(hidden_states)\n        hidden_states = self.dropout(hidden_states, training=training)\n        hidden_states = self.LayerNorm(hidden_states + input_tensor)\n        return hidden_states\n\n\nclass TFBertLayer(tf.keras.layers.Layer):\n    def __init__(self, config, **kwargs):\n        super().__init__(**kwargs)\n        self.attention = TFBertAttention(config, name=\"attention\")\n        self.intermediate = TFBertIntermediate(config, name=\"intermediate\")\n        self.bert_output = TFBertOutput(config, name=\"output\")\n\n    def call(self, inputs, training=False):\n        hidden_states, attention_mask, head_mask = inputs\n\n        attention_outputs = self.attention([hidden_states, attention_mask, head_mask], training=training)\n        attention_output = attention_outputs[0]\n        intermediate_output = self.intermediate(attention_output)\n        layer_output = self.bert_output([intermediate_output, attention_output], training=training)\n        outputs = (layer_output,) + attention_outputs[1:]  # add attentions if we output them\n        return outputs\n\n\nclass TFBertEncoder(tf.keras.layers.Layer):\n    def __init__(self, config, **kwargs):\n        super().__init__(**kwargs)\n        self.output_attentions = config.output_attentions\n        self.output_hidden_states = config.output_hidden_states\n        self.layer = [TFBertLayer(config, name=\"layer_._{}\".format(i)) for i in range(config.num_hidden_layers)]\n\n    def call(self, inputs, training=False):\n        hidden_states, attention_mask, head_mask = inputs\n\n        all_hidden_states = ()\n        all_attentions = ()\n        for i, layer_module in enumerate(self.layer):\n            if self.output_hidden_states:\n                all_hidden_states = all_hidden_states + (hidden_states,)\n\n            layer_outputs = layer_module([hidden_states, attention_mask, head_mask[i]], training=training)\n            hidden_states = layer_outputs[0]\n\n            if self.output_attentions:\n                all_attentions = all_attentions + (layer_outputs[1],)\n\n        # Add last layer\n        if self.output_hidden_states:\n            all_hidden_states = all_hidden_states + (hidden_states,)\n\n        outputs = (hidden_states,)\n        if self.output_hidden_states:\n            outputs = outputs + (all_hidden_states,)\n        if self.output_attentions:\n            outputs = outputs + (all_attentions,)\n        return outputs  # outputs, (hidden states), (attentions)\n\n\nclass TFBertPooler(tf.keras.layers.Layer):\n    def __init__(self, config, **kwargs):\n        super().__init__(**kwargs)\n        self.dense = tf.keras.layers.Dense(\n            config.hidden_size,\n            kernel_initializer=get_initializer(config.initializer_range),\n            activation=\"tanh\",\n            name=\"dense\",\n        )\n\n    def call(self, hidden_states):\n        # We \"pool\" the model by simply taking the hidden state corresponding\n        # to the first token.\n        first_token_tensor = hidden_states[:, 0]\n        pooled_output = self.dense(first_token_tensor)\n        return pooled_output\n\n\nclass TFBertPredictionHeadTransform(tf.keras.layers.Layer):\n    def __init__(self, config, **kwargs):\n        super().__init__(**kwargs)\n        self.dense = tf.keras.layers.Dense(\n            config.hidden_size, kernel_initializer=get_initializer(config.initializer_range), name=\"dense\"\n        )\n        if isinstance(config.hidden_act, str):\n            self.transform_act_fn = ACT2FN[config.hidden_act]\n        else:\n            self.transform_act_fn = config.hidden_act\n        self.LayerNorm = tf.keras.layers.LayerNormalization(epsilon=config.layer_norm_eps, name=\"LayerNorm\")\n\n    def call(self, hidden_states):\n        hidden_states = self.dense(hidden_states)\n        hidden_states = self.transform_act_fn(hidden_states)\n        hidden_states = self.LayerNorm(hidden_states)\n        return hidden_states\n\n\nclass TFBertLMPredictionHead(tf.keras.layers.Layer):\n    def __init__(self, config, input_embeddings, **kwargs):\n        super().__init__(**kwargs)\n        self.vocab_size = config.vocab_size\n        self.transform = TFBertPredictionHeadTransform(config, name=\"transform\")\n\n        # The output weights are the same as the input embeddings, but there is\n        # an output-only bias for each token.\n        self.input_embeddings = input_embeddings\n\n    def build(self, input_shape):\n        self.bias = self.add_weight(shape=(self.vocab_size,), initializer=\"zeros\", trainable=True, name=\"bias\")\n        super().build(input_shape)\n\n    def call(self, hidden_states):\n        hidden_states = self.transform(hidden_states)\n        hidden_states = self.input_embeddings(hidden_states, mode=\"linear\")\n        hidden_states = hidden_states + self.bias\n        return hidden_states\n\n\nclass TFBertMLMHead(tf.keras.layers.Layer):\n    def __init__(self, config, input_embeddings, **kwargs):\n        super().__init__(**kwargs)\n        self.predictions = TFBertLMPredictionHead(config, input_embeddings, name=\"predictions\")\n\n    def call(self, sequence_output):\n        prediction_scores = self.predictions(sequence_output)\n        return prediction_scores\n\n\nclass TFBertNSPHead(tf.keras.layers.Layer):\n    def __init__(self, config, **kwargs):\n        super().__init__(**kwargs)\n        self.seq_relationship = tf.keras.layers.Dense(\n            2, kernel_initializer=get_initializer(config.initializer_range), name=\"seq_relationship\"\n        )\n\n    def call(self, pooled_output):\n        seq_relationship_score = self.seq_relationship(pooled_output)\n        return seq_relationship_score\n\n\n@keras_serializable\nclass TFBertMainLayer(tf.keras.layers.Layer):\n    config_class = BertConfig\n\n    def __init__(self, config, **kwargs):\n        super().__init__(**kwargs)\n        self.num_hidden_layers = config.num_hidden_layers\n\n        self.embeddings = TFBertEmbeddings(config, name=\"embeddings\")\n        self.encoder = TFBertEncoder(config, name=\"encoder\")\n        self.pooler = TFBertPooler(config, name=\"pooler\")\n\n    def get_input_embeddings(self):\n        return self.embeddings\n\n    def _resize_token_embeddings(self, new_num_tokens):\n        raise NotImplementedError\n\n    def _prune_heads(self, heads_to_prune):\n        \"\"\" Prunes heads of the model.\n            heads_to_prune: dict of {layer_num: list of heads to prune in this layer}\n            See base class PreTrainedModel\n        \"\"\"\n        raise NotImplementedError\n\n    def call(\n        self,\n        inputs,\n        attention_mask=None,\n        token_type_ids=None,\n        position_ids=None,\n        head_mask=None,\n        inputs_embeds=None,\n        training=False,\n    ):\n        if isinstance(inputs, (tuple, list)):\n            input_ids = inputs[0]\n            attention_mask = inputs[1] if len(inputs) > 1 else attention_mask\n            token_type_ids = inputs[2] if len(inputs) > 2 else token_type_ids\n            position_ids = inputs[3] if len(inputs) > 3 else position_ids\n            head_mask = inputs[4] if len(inputs) > 4 else head_mask\n            inputs_embeds = inputs[5] if len(inputs) > 5 else inputs_embeds\n            assert len(inputs) <= 6, \"Too many inputs.\"\n        elif isinstance(inputs, (dict, BatchEncoding)):\n            input_ids = inputs.get(\"input_ids\")\n            attention_mask = inputs.get(\"attention_mask\", attention_mask)\n            token_type_ids = inputs.get(\"token_type_ids\", token_type_ids)\n            position_ids = inputs.get(\"position_ids\", position_ids)\n            head_mask = inputs.get(\"head_mask\", head_mask)\n            inputs_embeds = inputs.get(\"inputs_embeds\", inputs_embeds)\n            assert len(inputs) <= 6, \"Too many inputs.\"\n        else:\n            input_ids = inputs\n\n        if input_ids is not None and inputs_embeds is not None:\n            raise ValueError(\"You cannot specify both input_ids and inputs_embeds at the same time\")\n        elif input_ids is not None:\n            input_shape = shape_list(input_ids)\n        elif inputs_embeds is not None:\n            input_shape = shape_list(inputs_embeds)[:-1]\n        else:\n            raise ValueError(\"You have to specify either input_ids or inputs_embeds\")\n\n        if attention_mask is None:\n            attention_mask = tf.fill(input_shape, 1)\n        if token_type_ids is None:\n            token_type_ids = tf.fill(input_shape, 0)\n\n        # We create a 3D attention mask from a 2D tensor mask.\n        # Sizes are [batch_size, 1, 1, to_seq_length]\n        # So we can broadcast to [batch_size, num_heads, from_seq_length, to_seq_length]\n        # this attention mask is more simple than the triangular masking of causal attention\n        # used in OpenAI GPT, we just need to prepare the broadcast dimension here.\n        extended_attention_mask = attention_mask[:, tf.newaxis, tf.newaxis, :]\n\n        # Since attention_mask is 1.0 for positions we want to attend and 0.0 for\n        # masked positions, this operation will create a tensor which is 0.0 for\n        # positions we want to attend and -10000.0 for masked positions.\n        # Since we are adding it to the raw scores before the softmax, this is\n        # effectively the same as removing these entirely.\n\n        extended_attention_mask = tf.cast(extended_attention_mask, tf.float32)\n        extended_attention_mask = (1.0 - extended_attention_mask) * -10000.0\n\n        # Prepare head mask if needed\n        # 1.0 in head_mask indicate we keep the head\n        # attention_probs has shape bsz x n_heads x N x N\n        # input head_mask has shape [num_heads] or [num_hidden_layers x num_heads]\n        # and head_mask is converted to shape [num_hidden_layers x batch x num_heads x seq_length x seq_length]\n        if head_mask is not None:\n            raise NotImplementedError\n        else:\n            head_mask = [None] * self.num_hidden_layers\n            # head_mask = tf.constant([0] * self.num_hidden_layers)\n\n        embedding_output = self.embeddings([input_ids, position_ids, token_type_ids, inputs_embeds], training=training)\n        encoder_outputs = self.encoder([embedding_output, extended_attention_mask, head_mask], training=training)\n\n        sequence_output = encoder_outputs[0]\n        pooled_output = self.pooler(sequence_output)\n\n        outputs = (sequence_output, pooled_output,) + encoder_outputs[\n            1:\n        ]  # add hidden_states and attentions if they are here\n        return outputs  # sequence_output, pooled_output, (hidden_states), (attentions)\n\n\nclass TFBertPreTrainedModel(TFPreTrainedModel):\n    \"\"\" An abstract class to handle weights initialization and\n        a simple interface for downloading and loading pretrained models.\n    \"\"\"\n\n    config_class = BertConfig\n    base_model_prefix = \"bert\"\n\n\nBERT_START_DOCSTRING = r\"\"\"\n    This model is a `tf.keras.Model <https://www.tensorflow.org/api_docs/python/tf/keras/Model>`__ sub-class.\n    Use it as a regular TF 2.0 Keras Model and\n    refer to the TF 2.0 documentation for all matter related to general usage and behavior.\n\n    .. note::\n\n        TF 2.0 models accepts two formats as inputs:\n\n            - having all inputs as keyword arguments (like PyTorch models), or\n            - having all inputs as a list, tuple or dict in the first positional arguments.\n\n        This second option is useful when using :obj:`tf.keras.Model.fit()` method which currently requires having\n        all the tensors in the first argument of the model call function: :obj:`model(inputs)`.\n\n        If you choose this second option, there are three possibilities you can use to gather all the input Tensors\n        in the first positional argument :\n\n        - a single Tensor with input_ids only and nothing else: :obj:`model(inputs_ids)`\n        - a list of varying length with one or several input Tensors IN THE ORDER given in the docstring:\n          :obj:`model([input_ids, attention_mask])` or :obj:`model([input_ids, attention_mask, token_type_ids])`\n        - a dictionary with one or several input Tensors associated to the input names given in the docstring:\n          :obj:`model({'input_ids': input_ids, 'token_type_ids': token_type_ids})`\n\n    Parameters:\n        config (:class:`~transformers1.BertConfig`): Model configuration class with all the parameters of the model.\n            Initializing with a config file does not load the weights associated with the model, only the configuration.\n            Check out the :meth:`~transformers1.PreTrainedModel.from_pretrained` method to load the model weights.\n\"\"\"\n\nBERT_INPUTS_DOCSTRING = r\"\"\"\n    Args:\n        input_ids (:obj:`Numpy array` or :obj:`tf.Tensor` of shape :obj:`{0}`):\n            Indices of input sequence tokens in the vocabulary.\n\n            Indices can be obtained using :class:`transformers1.BertTokenizer`.\n            See :func:`transformers1.PreTrainedTokenizer.encode` and\n            :func:`transformers1.PreTrainedTokenizer.encode_plus` for details.\n\n            `What are input IDs? <../glossary.html#input-ids>`__\n        attention_mask (:obj:`Numpy array` or :obj:`tf.Tensor` of shape :obj:`{0}`, `optional`, defaults to :obj:`None`):\n            Mask to avoid performing attention on padding token indices.\n            Mask values selected in ``[0, 1]``:\n            ``1`` for tokens that are NOT MASKED, ``0`` for MASKED tokens.\n\n            `What are attention masks? <../glossary.html#attention-mask>`__\n        token_type_ids (:obj:`Numpy array` or :obj:`tf.Tensor` of shape :obj:`{0}`, `optional`, defaults to :obj:`None`):\n            Segment token indices to indicate first and second portions of the inputs.\n            Indices are selected in ``[0, 1]``: ``0`` corresponds to a `sentence A` token, ``1``\n            corresponds to a `sentence B` token\n\n            `What are token type IDs? <../glossary.html#token-type-ids>`__\n        position_ids (:obj:`Numpy array` or :obj:`tf.Tensor` of shape :obj:`{0}`, `optional`, defaults to :obj:`None`):\n            Indices of positions of each input sequence tokens in the position embeddings.\n            Selected in the range ``[0, config.max_position_embeddings - 1]``.\n\n            `What are position IDs? <../glossary.html#position-ids>`__\n        head_mask (:obj:`Numpy array` or :obj:`tf.Tensor` of shape :obj:`(num_heads,)` or :obj:`(num_layers, num_heads)`, `optional`, defaults to :obj:`None`):\n            Mask to nullify selected heads of the self-attention modules.\n            Mask values selected in ``[0, 1]``:\n            :obj:`1` indicates the head is **not masked**, :obj:`0` indicates the head is **masked**.\n        inputs_embeds (:obj:`Numpy array` or :obj:`tf.Tensor` of shape :obj:`(batch_size, sequence_length, embedding_dim)`, `optional`, defaults to :obj:`None`):\n            Optionally, instead of passing :obj:`input_ids` you can choose to directly pass an embedded representation.\n            This is useful if you want more control over how to convert `input_ids` indices into associated vectors\n            than the model's internal embedding lookup matrix.\n        training (:obj:`boolean`, `optional`, defaults to :obj:`False`):\n            Whether to activate dropout modules (if set to :obj:`True`) during training or to de-activate them\n            (if set to :obj:`False`) for evaluation.\n\"\"\"\n\n\n@add_start_docstrings(\n    \"The bare Bert Model transformer outputing raw hidden-states without any specific head on top.\",\n    BERT_START_DOCSTRING,\n)\nclass TFBertModel(TFBertPreTrainedModel):\n    def __init__(self, config, *inputs, **kwargs):\n        super().__init__(config, *inputs, **kwargs)\n        self.bert = TFBertMainLayer(config, name=\"bert\")\n\n    @add_start_docstrings_to_callable(BERT_INPUTS_DOCSTRING.format(\"(batch_size, sequence_length)\"))\n    def call(self, inputs, **kwargs):\n        r\"\"\"\n    Returns:\n        :obj:`tuple(tf.Tensor)` comprising various elements depending on the configuration (:class:`~transformers1.BertConfig`) and inputs:\n        last_hidden_state (:obj:`tf.Tensor` of shape :obj:`(batch_size, sequence_length, hidden_size)`):\n            Sequence of hidden-states at the output of the last layer of the model.\n        pooler_output (:obj:`tf.Tensor` of shape :obj:`(batch_size, hidden_size)`):\n            Last layer hidden-state of the first token of the sequence (classification token)\n            further processed by a Linear layer and a Tanh activation function. The Linear\n            layer weights are trained from the next sentence prediction (classification)\n            objective during Bert pretraining. This output is usually *not* a good summary\n            of the semantic content of the input, you're often better with averaging or pooling\n            the sequence of hidden-states for the whole input sequence.\n        hidden_states (:obj:`tuple(tf.Tensor)`, `optional`, returned when :obj:`config.output_hidden_states=True`):\n            tuple of :obj:`tf.Tensor` (one for the output of the embeddings + one for the output of each layer)\n            of shape :obj:`(batch_size, sequence_length, hidden_size)`.\n\n            Hidden-states of the model at the output of each layer plus the initial embedding outputs.\n        attentions (:obj:`tuple(tf.Tensor)`, `optional`, returned when ``config.output_attentions=True``):\n            tuple of :obj:`tf.Tensor` (one for each layer) of shape\n            :obj:`(batch_size, num_heads, sequence_length, sequence_length)`.\n\n            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads.\n\n\n    Examples::\n\n        import tensorflow as tf\n        from transformers1 import BertTokenizer, TFBertModel\n\n        tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')\n        model = TFBertModel.from_pretrained('bert-base-uncased')\n        input_ids = tf.constant(tokenizer.encode(\"Hello, my dog is cute\", add_special_tokens=True))[None, :]  # Batch size 1\n        outputs = model(input_ids)\n        last_hidden_states = outputs[0]  # The last hidden-state is the first element of the output tuple\n        \"\"\"\n        outputs = self.bert(inputs, **kwargs)\n        return outputs\n\n\n@add_start_docstrings(\n    \"\"\"Bert Model with two heads on top as done during the pre-training:\n    a `masked language modeling` head and a `next sentence prediction (classification)` head. \"\"\",\n    BERT_START_DOCSTRING,\n)\nclass TFBertForPreTraining(TFBertPreTrainedModel):\n    def __init__(self, config, *inputs, **kwargs):\n        super().__init__(config, *inputs, **kwargs)\n\n        self.bert = TFBertMainLayer(config, name=\"bert\")\n        self.nsp = TFBertNSPHead(config, name=\"nsp___cls\")\n        self.mlm = TFBertMLMHead(config, self.bert.embeddings, name=\"mlm___cls\")\n\n    def get_output_embeddings(self):\n        return self.bert.embeddings\n\n    @add_start_docstrings_to_callable(BERT_INPUTS_DOCSTRING.format(\"(batch_size, sequence_length)\"))\n    def call(self, inputs, **kwargs):\n        r\"\"\"\n    Return:\n        :obj:`tuple(tf.Tensor)` comprising various elements depending on the configuration (:class:`~transformers1.BertConfig`) and inputs:\n        prediction_scores (:obj:`tf.Tensor` of shape :obj:`(batch_size, sequence_length, config.vocab_size)`):\n            Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax).\n        seq_relationship_scores (:obj:`tf.Tensor` of shape :obj:`(batch_size, sequence_length, 2)`):\n            Prediction scores of the next sequence prediction (classification) head (scores of True/False continuation before SoftMax).\n        hidden_states (:obj:`tuple(tf.Tensor)`, `optional`, returned when :obj:`config.output_hidden_states=True`):\n            tuple of :obj:`tf.Tensor` (one for the output of the embeddings + one for the output of each layer)\n            of shape :obj:`(batch_size, sequence_length, hidden_size)`.\n\n            Hidden-states of the model at the output of each layer plus the initial embedding outputs.\n        attentions (:obj:`tuple(tf.Tensor)`, `optional`, returned when ``config.output_attentions=True``):\n            tuple of :obj:`tf.Tensor` (one for each layer) of shape\n            :obj:`(batch_size, num_heads, sequence_length, sequence_length)`:\n\n            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads.\n\n    Examples::\n\n        import tensorflow as tf\n        from transformers1 import BertTokenizer, TFBertForPreTraining\n\n        tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')\n        model = TFBertForPreTraining.from_pretrained('bert-base-uncased')\n        input_ids = tf.constant(tokenizer.encode(\"Hello, my dog is cute\", add_special_tokens=True))[None, :]  # Batch size 1\n        outputs = model(input_ids)\n        prediction_scores, seq_relationship_scores = outputs[:2]\n\n        \"\"\"\n        outputs = self.bert(inputs, **kwargs)\n\n        sequence_output, pooled_output = outputs[:2]\n        prediction_scores = self.mlm(sequence_output, training=kwargs.get(\"training\", False))\n        seq_relationship_score = self.nsp(pooled_output)\n\n        outputs = (prediction_scores, seq_relationship_score,) + outputs[\n            2:\n        ]  # add hidden states and attention if they are here\n\n        return outputs  # prediction_scores, seq_relationship_score, (hidden_states), (attentions)\n\n\n@add_start_docstrings(\"\"\"Bert Model with a `language modeling` head on top. \"\"\", BERT_START_DOCSTRING)\nclass TFBertForMaskedLM(TFBertPreTrainedModel):\n    def __init__(self, config, *inputs, **kwargs):\n        super().__init__(config, *inputs, **kwargs)\n\n        self.bert = TFBertMainLayer(config, name=\"bert\")\n        self.mlm = TFBertMLMHead(config, self.bert.embeddings, name=\"mlm___cls\")\n\n    def get_output_embeddings(self):\n        return self.bert.embeddings\n\n    @add_start_docstrings_to_callable(BERT_INPUTS_DOCSTRING.format(\"(batch_size, sequence_length)\"))\n    def call(self, inputs, **kwargs):\n        r\"\"\"\n    Return:\n        :obj:`tuple(tf.Tensor)` comprising various elements depending on the configuration (:class:`~transformers1.BertConfig`) and inputs:\n        prediction_scores (:obj:`Numpy array` or :obj:`tf.Tensor` of shape :obj:`(batch_size, sequence_length, config.vocab_size)`):\n            Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax).\n        hidden_states (:obj:`tuple(tf.Tensor)`, `optional`, returned when :obj:`config.output_hidden_states=True`):\n            tuple of :obj:`tf.Tensor` (one for the output of the embeddings + one for the output of each layer)\n            of shape :obj:`(batch_size, sequence_length, hidden_size)`.\n\n            Hidden-states of the model at the output of each layer plus the initial embedding outputs.\n        attentions (:obj:`tuple(tf.Tensor)`, `optional`, returned when ``config.output_attentions=True``):\n            tuple of :obj:`tf.Tensor` (one for each layer) of shape\n            :obj:`(batch_size, num_heads, sequence_length, sequence_length)`:\n\n            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads.\n\n    Examples::\n\n        import tensorflow as tf\n        from transformers1 import BertTokenizer, TFBertForMaskedLM\n\n        tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')\n        model = TFBertForMaskedLM.from_pretrained('bert-base-uncased')\n        input_ids = tf.constant(tokenizer.encode(\"Hello, my dog is cute\", add_special_tokens=True))[None, :]  # Batch size 1\n        outputs = model(input_ids)\n        prediction_scores = outputs[0]\n\n        \"\"\"\n        outputs = self.bert(inputs, **kwargs)\n\n        sequence_output = outputs[0]\n        prediction_scores = self.mlm(sequence_output, training=kwargs.get(\"training\", False))\n\n        outputs = (prediction_scores,) + outputs[2:]  # Add hidden states and attention if they are here\n\n        return outputs  # prediction_scores, (hidden_states), (attentions)\n\n\n@add_start_docstrings(\n    \"\"\"Bert Model with a `next sentence prediction (classification)` head on top. \"\"\", BERT_START_DOCSTRING,\n)\nclass TFBertForNextSentencePrediction(TFBertPreTrainedModel):\n    def __init__(self, config, *inputs, **kwargs):\n        super().__init__(config, *inputs, **kwargs)\n\n        self.bert = TFBertMainLayer(config, name=\"bert\")\n        self.nsp = TFBertNSPHead(config, name=\"nsp___cls\")\n\n    @add_start_docstrings_to_callable(BERT_INPUTS_DOCSTRING.format(\"(batch_size, sequence_length)\"))\n    def call(self, inputs, **kwargs):\n        r\"\"\"\n    Return:\n        :obj:`tuple(tf.Tensor)` comprising various elements depending on the configuration (:class:`~transformers1.BertConfig`) and inputs:\n        seq_relationship_scores (:obj:`Numpy array` or :obj:`tf.Tensor` of shape :obj:`(batch_size, sequence_length, 2)`)\n            Prediction scores of the next sequence prediction (classification) head (scores of True/False continuation before SoftMax).\n        hidden_states (:obj:`tuple(tf.Tensor)`, `optional`, returned when :obj:`config.output_hidden_states=True`):\n            tuple of :obj:`tf.Tensor` (one for the output of the embeddings + one for the output of each layer)\n            of shape :obj:`(batch_size, sequence_length, hidden_size)`.\n\n            Hidden-states of the model at the output of each layer plus the initial embedding outputs.\n        attentions (:obj:`tuple(tf.Tensor)`, `optional`, returned when ``config.output_attentions=True``):\n            tuple of :obj:`tf.Tensor` (one for each layer) of shape\n            :obj:`(batch_size, num_heads, sequence_length, sequence_length)`:\n\n            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads.\n\n    Examples::\n\n        import tensorflow as tf\n        from transformers1 import BertTokenizer, TFBertForNextSentencePrediction\n\n        tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')\n        model = TFBertForNextSentencePrediction.from_pretrained('bert-base-uncased')\n\n        prompt = \"In Italy, pizza served in formal settings, such as at a restaurant, is presented unsliced.\"\n        next_sentence = \"The sky is blue due to the shorter wavelength of blue light.\"\n        encoding = tokenizer.encode_plus(prompt, next_sentence, return_tensors='tf')\n\n        logits = model(encoding['input_ids'], token_type_ids=encoding['token_type_ids'])[0]\n        assert logits[0][0] < logits[0][1] # the next sentence was random\n        \"\"\"\n        outputs = self.bert(inputs, **kwargs)\n\n        pooled_output = outputs[1]\n        seq_relationship_score = self.nsp(pooled_output)\n\n        outputs = (seq_relationship_score,) + outputs[2:]  # add hidden states and attention if they are here\n\n        return outputs  # seq_relationship_score, (hidden_states), (attentions)\n\n\n@add_start_docstrings(\n    \"\"\"Bert Model transformer with a sequence classification/regression head on top (a linear layer on top of\n    the pooled output) e.g. for GLUE tasks. \"\"\",\n    BERT_START_DOCSTRING,\n)\nclass TFBertForSequenceClassification(TFBertPreTrainedModel):\n    def __init__(self, config, *inputs, **kwargs):\n        super().__init__(config, *inputs, **kwargs)\n        self.num_labels = config.num_labels\n\n        self.bert = TFBertMainLayer(config, name=\"bert\")\n        self.dropout = tf.keras.layers.Dropout(config.hidden_dropout_prob)\n        self.classifier = tf.keras.layers.Dense(\n            config.num_labels, kernel_initializer=get_initializer(config.initializer_range), name=\"classifier\"\n        )\n\n    @add_start_docstrings_to_callable(BERT_INPUTS_DOCSTRING.format(\"(batch_size, sequence_length)\"))\n    def call(self, inputs, **kwargs):\n        r\"\"\"\n    Return:\n        :obj:`tuple(tf.Tensor)` comprising various elements depending on the configuration (:class:`~transformers1.BertConfig`) and inputs:\n        logits (:obj:`Numpy array` or :obj:`tf.Tensor` of shape :obj:`(batch_size, config.num_labels)`):\n            Classification (or regression if config.num_labels==1) scores (before SoftMax).\n        hidden_states (:obj:`tuple(tf.Tensor)`, `optional`, returned when :obj:`config.output_hidden_states=True`):\n            tuple of :obj:`tf.Tensor` (one for the output of the embeddings + one for the output of each layer)\n            of shape :obj:`(batch_size, sequence_length, hidden_size)`.\n\n            Hidden-states of the model at the output of each layer plus the initial embedding outputs.\n        attentions (:obj:`tuple(tf.Tensor)`, `optional`, returned when ``config.output_attentions=True``):\n            tuple of :obj:`tf.Tensor` (one for each layer) of shape\n            :obj:`(batch_size, num_heads, sequence_length, sequence_length)`:\n\n            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads.\n\n    Examples::\n\n        import tensorflow as tf\n        from transformers1 import BertTokenizer, TFBertForSequenceClassification\n\n        tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')\n        model = TFBertForSequenceClassification.from_pretrained('bert-base-uncased')\n        input_ids = tf.constant(tokenizer.encode(\"Hello, my dog is cute\", add_special_tokens=True))[None, :]  # Batch size 1\n        outputs = model(input_ids)\n        logits = outputs[0]\n\n        \"\"\"\n        outputs = self.bert(inputs, **kwargs)\n\n        pooled_output = outputs[1]\n\n        pooled_output = self.dropout(pooled_output, training=kwargs.get(\"training\", False))\n        logits = self.classifier(pooled_output)\n\n        outputs = (logits,) + outputs[2:]  # add hidden states and attention if they are here\n\n        return outputs  # logits, (hidden_states), (attentions)\n\n\n@add_start_docstrings(\n    \"\"\"Bert Model with a multiple choice classification head on top (a linear layer on top of\n    the pooled output and a softmax) e.g. for RocStories/SWAG tasks. \"\"\",\n    BERT_START_DOCSTRING,\n)\nclass TFBertForMultipleChoice(TFBertPreTrainedModel):\n    def __init__(self, config, *inputs, **kwargs):\n        super().__init__(config, *inputs, **kwargs)\n\n        self.bert = TFBertMainLayer(config, name=\"bert\")\n        self.dropout = tf.keras.layers.Dropout(config.hidden_dropout_prob)\n        self.classifier = tf.keras.layers.Dense(\n            1, kernel_initializer=get_initializer(config.initializer_range), name=\"classifier\"\n        )\n\n    @property\n    def dummy_inputs(self):\n        \"\"\" Dummy inputs to build the network.\n\n        Returns:\n            tf.Tensor with dummy inputs\n        \"\"\"\n        return {\"input_ids\": tf.constant(MULTIPLE_CHOICE_DUMMY_INPUTS)}\n\n    @add_start_docstrings_to_callable(BERT_INPUTS_DOCSTRING.format(\"(batch_size, num_choices, sequence_length)\"))\n    def call(\n        self,\n        inputs,\n        attention_mask=None,\n        token_type_ids=None,\n        position_ids=None,\n        head_mask=None,\n        inputs_embeds=None,\n        training=False,\n    ):\n        r\"\"\"\n    Return:\n        :obj:`tuple(tf.Tensor)` comprising various elements depending on the configuration (:class:`~transformers1.BertConfig`) and inputs:\n        classification_scores (:obj:`Numpy array` or :obj:`tf.Tensor` of shape :obj:`(batch_size, num_choices)`:\n            `num_choices` is the size of the second dimension of the input tensors. (see `input_ids` above).\n\n            Classification scores (before SoftMax).\n        hidden_states (:obj:`tuple(tf.Tensor)`, `optional`, returned when :obj:`config.output_hidden_states=True`):\n            tuple of :obj:`tf.Tensor` (one for the output of the embeddings + one for the output of each layer)\n            of shape :obj:`(batch_size, sequence_length, hidden_size)`.\n\n            Hidden-states of the model at the output of each layer plus the initial embedding outputs.\n        attentions (:obj:`tuple(tf.Tensor)`, `optional`, returned when ``config.output_attentions=True``):\n            tuple of :obj:`tf.Tensor` (one for each layer) of shape\n            :obj:`(batch_size, num_heads, sequence_length, sequence_length)`:\n\n            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads.\n\n    Examples::\n\n        import tensorflow as tf\n        from transformers1 import BertTokenizer, TFBertForMultipleChoice\n\n        tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')\n        model = TFBertForMultipleChoice.from_pretrained('bert-base-uncased')\n\n        prompt = \"In Italy, pizza served in formal settings, such as at a restaurant, is presented unsliced.\"\n        choice0 = \"It is eaten with a fork and a knife.\"\n        choice1 = \"It is eaten while held in the hand.\"\n        encoding = tokenizer.batch_encode_plus([[prompt, choice0], [prompt, choice1]], return_tensors='tf', pad_to_max_length=True)\n\n        # linear classifier on the output is not yet trained\n        outputs = model(encoding['input_ids'][None, :])\n        logits = outputs[0]\n        \"\"\"\n        if isinstance(inputs, (tuple, list)):\n            input_ids = inputs[0]\n            attention_mask = inputs[1] if len(inputs) > 1 else attention_mask\n            token_type_ids = inputs[2] if len(inputs) > 2 else token_type_ids\n            position_ids = inputs[3] if len(inputs) > 3 else position_ids\n            head_mask = inputs[4] if len(inputs) > 4 else head_mask\n            inputs_embeds = inputs[5] if len(inputs) > 5 else inputs_embeds\n            assert len(inputs) <= 6, \"Too many inputs.\"\n        elif isinstance(inputs, dict):\n            input_ids = inputs.get(\"input_ids\")\n            attention_mask = inputs.get(\"attention_mask\", attention_mask)\n            token_type_ids = inputs.get(\"token_type_ids\", token_type_ids)\n            position_ids = inputs.get(\"position_ids\", position_ids)\n            head_mask = inputs.get(\"head_mask\", head_mask)\n            inputs_embeds = inputs.get(\"inputs_embeds\", inputs_embeds)\n            assert len(inputs) <= 6, \"Too many inputs.\"\n        else:\n            input_ids = inputs\n\n        if input_ids is not None:\n            num_choices = shape_list(input_ids)[1]\n            seq_length = shape_list(input_ids)[2]\n        else:\n            num_choices = shape_list(inputs_embeds)[1]\n            seq_length = shape_list(inputs_embeds)[2]\n\n        flat_input_ids = tf.reshape(input_ids, (-1, seq_length)) if input_ids is not None else None\n        flat_attention_mask = tf.reshape(attention_mask, (-1, seq_length)) if attention_mask is not None else None\n        flat_token_type_ids = tf.reshape(token_type_ids, (-1, seq_length)) if token_type_ids is not None else None\n        flat_position_ids = tf.reshape(position_ids, (-1, seq_length)) if position_ids is not None else None\n\n        flat_inputs = [\n            flat_input_ids,\n            flat_attention_mask,\n            flat_token_type_ids,\n            flat_position_ids,\n            head_mask,\n            inputs_embeds,\n        ]\n\n        outputs = self.bert(flat_inputs, training=training)\n\n        pooled_output = outputs[1]\n\n        pooled_output = self.dropout(pooled_output, training=training)\n        logits = self.classifier(pooled_output)\n        reshaped_logits = tf.reshape(logits, (-1, num_choices))\n\n        outputs = (reshaped_logits,) + outputs[2:]  # add hidden states and attention if they are here\n\n        return outputs  # reshaped_logits, (hidden_states), (attentions)\n\n\n@add_start_docstrings(\n    \"\"\"Bert Model with a token classification head on top (a linear layer on top of\n    the hidden-states output) e.g. for Named-Entity-Recognition (NER) tasks. \"\"\",\n    BERT_START_DOCSTRING,\n)\nclass TFBertForTokenClassification(TFBertPreTrainedModel):\n    def __init__(self, config, *inputs, **kwargs):\n        super().__init__(config, *inputs, **kwargs)\n        self.num_labels = config.num_labels\n\n        self.bert = TFBertMainLayer(config, name=\"bert\")\n        self.dropout = tf.keras.layers.Dropout(config.hidden_dropout_prob)\n        self.classifier = tf.keras.layers.Dense(\n            config.num_labels, kernel_initializer=get_initializer(config.initializer_range), name=\"classifier\"\n        )\n\n    @add_start_docstrings_to_callable(BERT_INPUTS_DOCSTRING.format(\"(batch_size, sequence_length)\"))\n    def call(self, inputs, **kwargs):\n        r\"\"\"\n    Return:\n        :obj:`tuple(tf.Tensor)` comprising various elements depending on the configuration (:class:`~transformers1.BertConfig`) and inputs:\n        scores (:obj:`Numpy array` or :obj:`tf.Tensor` of shape :obj:`(batch_size, sequence_length, config.num_labels)`):\n            Classification scores (before SoftMax).\n        hidden_states (:obj:`tuple(tf.Tensor)`, `optional`, returned when :obj:`config.output_hidden_states=True`):\n            tuple of :obj:`tf.Tensor` (one for the output of the embeddings + one for the output of each layer)\n            of shape :obj:`(batch_size, sequence_length, hidden_size)`.\n\n            Hidden-states of the model at the output of each layer plus the initial embedding outputs.\n        attentions (:obj:`tuple(tf.Tensor)`, `optional`, returned when ``config.output_attentions=True``):\n            tuple of :obj:`tf.Tensor` (one for each layer) of shape\n            :obj:`(batch_size, num_heads, sequence_length, sequence_length)`:\n\n            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads.\n\n    Examples::\n\n        import tensorflow as tf\n        from transformers1 import BertTokenizer, TFBertForTokenClassification\n\n        tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')\n        model = TFBertForTokenClassification.from_pretrained('bert-base-uncased')\n        input_ids = tf.constant(tokenizer.encode(\"Hello, my dog is cute\", add_special_tokens=True))[None, :]  # Batch size 1\n        outputs = model(input_ids)\n        scores = outputs[0]\n\n        \"\"\"\n        outputs = self.bert(inputs, **kwargs)\n\n        sequence_output = outputs[0]\n\n        sequence_output = self.dropout(sequence_output, training=kwargs.get(\"training\", False))\n        logits = self.classifier(sequence_output)\n\n        outputs = (logits,) + outputs[2:]  # add hidden states and attention if they are here\n\n        return outputs  # scores, (hidden_states), (attentions)\n\n\n@add_start_docstrings(\n    \"\"\"Bert Model with a span classification head on top for extractive question-answering tasks like SQuAD (a linear layers on top of\n    the hidden-states output to compute `span start logits` and `span end logits`). \"\"\",\n    BERT_START_DOCSTRING,\n)\nclass TFBertForQuestionAnswering(TFBertPreTrainedModel):\n    def __init__(self, config, *inputs, **kwargs):\n        super().__init__(config, *inputs, **kwargs)\n        self.num_labels = config.num_labels\n\n        self.bert = TFBertMainLayer(config, name=\"bert\")\n        self.qa_outputs = tf.keras.layers.Dense(\n            config.num_labels, kernel_initializer=get_initializer(config.initializer_range), name=\"qa_outputs\"\n        )\n\n    @add_start_docstrings_to_callable(BERT_INPUTS_DOCSTRING.format(\"(batch_size, sequence_length)\"))\n    def call(self, inputs, **kwargs):\n        r\"\"\"\n    Return:\n        :obj:`tuple(tf.Tensor)` comprising various elements depending on the configuration (:class:`~transformers1.BertConfig`) and inputs:\n        start_scores (:obj:`Numpy array` or :obj:`tf.Tensor` of shape :obj:`(batch_size, sequence_length,)`):\n            Span-start scores (before SoftMax).\n        end_scores (:obj:`Numpy array` or :obj:`tf.Tensor` of shape :obj:`(batch_size, sequence_length,)`):\n            Span-end scores (before SoftMax).\n        hidden_states (:obj:`tuple(tf.Tensor)`, `optional`, returned when :obj:`config.output_hidden_states=True`):\n            tuple of :obj:`tf.Tensor` (one for the output of the embeddings + one for the output of each layer)\n            of shape :obj:`(batch_size, sequence_length, hidden_size)`.\n\n            Hidden-states of the model at the output of each layer plus the initial embedding outputs.\n        attentions (:obj:`tuple(tf.Tensor)`, `optional`, returned when ``config.output_attentions=True``):\n            tuple of :obj:`tf.Tensor` (one for each layer) of shape\n            :obj:`(batch_size, num_heads, sequence_length, sequence_length)`:\n\n            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads.\n\n    Examples::\n\n        import tensorflow as tf\n        from transformers1 import BertTokenizer, TFBertForQuestionAnswering\n\n        tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')\n        model = TFBertForQuestionAnswering.from_pretrained('bert-large-uncased-whole-word-masking-finetuned-squad')\n\n        question, text = \"Who was Jim Henson?\", \"Jim Henson was a nice puppet\"\n        encoding = tokenizer.encode_plus(question, text)\n        input_ids, token_type_ids = encoding[\"input_ids\"], encoding[\"token_type_ids\"]\n        start_scores, end_scores = model(tf.constant(input_ids)[None, :], token_type_ids=tf.constant(token_type_ids)[None, :])\n\n        all_tokens = tokenizer.convert_ids_to_tokens(input_ids)\n        answer = ' '.join(all_tokens[tf.math.argmax(tf.squeeze(start_scores)) : tf.math.argmax(tf.squeeze(end_scores))+1])\n        assert answer == \"a nice puppet\"\n\n        \"\"\"\n        outputs = self.bert(inputs, **kwargs)\n\n        sequence_output = outputs[0]\n\n        logits = self.qa_outputs(sequence_output)\n        start_logits, end_logits = tf.split(logits, 2, axis=-1)\n        start_logits = tf.squeeze(start_logits, axis=-1)\n        end_logits = tf.squeeze(end_logits, axis=-1)\n\n        outputs = (start_logits, end_logits,) + outputs[2:]\n\n        return outputs  # start_logits, end_logits, (hidden_states), (attentions)\n"
  },
  {
    "path": "code/bert-base-count3/pretrain/transformers1/modeling_tf_camembert.py",
    "content": "# coding=utf-8\n# Copyright 2018 The Google AI Language Team Authors and The HuggingFace Inc. team.\n# Copyright (c) 2018, NVIDIA CORPORATION.  All rights reserved.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n\"\"\" TF 2.0 CamemBERT model. \"\"\"\n\n\nimport logging\n\nfrom .configuration_camembert import CamembertConfig\nfrom .file_utils import add_start_docstrings\nfrom .modeling_tf_roberta import (\n    TFRobertaForMaskedLM,\n    TFRobertaForSequenceClassification,\n    TFRobertaForTokenClassification,\n    TFRobertaModel,\n)\n\n\nlogger = logging.getLogger(__name__)\n\nTF_CAMEMBERT_PRETRAINED_MODEL_ARCHIVE_LIST = [\n    # See all CamemBERT models at https://huggingface.co/models?filter=camembert\n]\n\n\nCAMEMBERT_START_DOCSTRING = r\"\"\"\n\n    .. note::\n\n        TF 2.0 models accepts two formats as inputs:\n\n            - having all inputs as keyword arguments (like PyTorch models), or\n            - having all inputs as a list, tuple or dict in the first positional arguments.\n\n        This second option is useful when using :obj:`tf.keras.Model.fit()` method which currently requires having\n        all the tensors in the first argument of the model call function: :obj:`model(inputs)`.\n\n        If you choose this second option, there are three possibilities you can use to gather all the input Tensors\n        in the first positional argument :\n\n        - a single Tensor with input_ids only and nothing else: :obj:`model(inputs_ids)`\n        - a list of varying length with one or several input Tensors IN THE ORDER given in the docstring:\n          :obj:`model([input_ids, attention_mask])` or :obj:`model([input_ids, attention_mask, token_type_ids])`\n        - a dictionary with one or several input Tensors associated to the input names given in the docstring:\n          :obj:`model({'input_ids': input_ids, 'token_type_ids': token_type_ids})`\n\n    Parameters:\n        config (:class:`~transformers1.CamembertConfig`): Model configuration class with all the parameters of the\n            model. Initializing with a config file does not load the weights associated with the model, only the configuration.\n            Check out the :meth:`~transformers1.PreTrainedModel.from_pretrained` method to load the model weights.\n\"\"\"\n\n\n@add_start_docstrings(\n    \"The bare CamemBERT Model transformer outputting raw hidden-states without any specific head on top.\",\n    CAMEMBERT_START_DOCSTRING,\n)\nclass TFCamembertModel(TFRobertaModel):\n    \"\"\"\n    This class overrides :class:`~transformers1.TFRobertaModel`. Please check the\n    superclass for the appropriate documentation alongside usage examples.\n    \"\"\"\n\n    config_class = CamembertConfig\n\n\n@add_start_docstrings(\n    \"\"\"CamemBERT Model with a `language modeling` head on top. \"\"\", CAMEMBERT_START_DOCSTRING,\n)\nclass TFCamembertForMaskedLM(TFRobertaForMaskedLM):\n    \"\"\"\n    This class overrides :class:`~transformers1.TFRobertaForMaskedLM`. Please check the\n    superclass for the appropriate documentation alongside usage examples.\n    \"\"\"\n\n    config_class = CamembertConfig\n\n\n@add_start_docstrings(\n    \"\"\"CamemBERT Model transformer with a sequence classification/regression head on top (a linear layer\n    on top of the pooled output) e.g. for GLUE tasks. \"\"\",\n    CAMEMBERT_START_DOCSTRING,\n)\nclass TFCamembertForSequenceClassification(TFRobertaForSequenceClassification):\n    \"\"\"\n    This class overrides :class:`~transformers1.TFRobertaForSequenceClassification`. Please check the\n    superclass for the appropriate documentation alongside usage examples.\n    \"\"\"\n\n    config_class = CamembertConfig\n\n\n@add_start_docstrings(\n    \"\"\"CamemBERT Model with a token classification head on top (a linear layer on top of\n    the hidden-states output) e.g. for Named-Entity-Recognition (NER) tasks. \"\"\",\n    CAMEMBERT_START_DOCSTRING,\n)\nclass TFCamembertForTokenClassification(TFRobertaForTokenClassification):\n    \"\"\"\n    This class overrides :class:`~transformers1.TFRobertaForTokenClassification`. Please check the\n    superclass for the appropriate documentation alongside usage examples.\n    \"\"\"\n\n    config_class = CamembertConfig\n"
  },
  {
    "path": "code/bert-base-count3/pretrain/transformers1/modeling_tf_ctrl.py",
    "content": "# coding=utf-8\n# Copyright 2018 Salesforce and HuggingFace Inc. team.\n# Copyright (c) 2018, NVIDIA CORPORATION.  All rights reserved.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n\"\"\" TF 2.0 CTRL model.\"\"\"\n\n\nimport logging\n\nimport numpy as np\nimport tensorflow as tf\n\nfrom .configuration_ctrl import CTRLConfig\nfrom .file_utils import add_start_docstrings, add_start_docstrings_to_callable\nfrom .modeling_tf_utils import TFPreTrainedModel, TFSharedEmbeddings, keras_serializable, shape_list\nfrom .tokenization_utils import BatchEncoding\n\n\nlogger = logging.getLogger(__name__)\n\nTF_CTRL_PRETRAINED_MODEL_ARCHIVE_LIST = [\n    \"ctrl\"\n    # See all CTRL models at https://huggingface.co/models?filter=ctrl\n]\n\n\ndef angle_defn(pos, i, d_model_size):\n    angle_rates = 1 / np.power(10000, (2 * (i // 2)) / np.float32(d_model_size))\n    return pos * angle_rates\n\n\ndef positional_encoding(position, d_model_size):\n    # create the sinusoidal pattern for the positional encoding\n    angle_rads = angle_defn(np.arange(position)[:, np.newaxis], np.arange(d_model_size)[np.newaxis, :], d_model_size)\n\n    sines = np.sin(angle_rads[:, 0::2])\n    cosines = np.cos(angle_rads[:, 1::2])\n\n    # pos_encoding = tf.cast(np.concatenate([sines, cosines], axis=-1)[np.newaxis, ...], dtype=tf.float32)\n    pos_encoding = tf.cast(np.concatenate([sines, cosines], axis=-1), dtype=tf.float32)\n    return pos_encoding\n\n\ndef scaled_dot_product_attention(q, k, v, mask, attention_mask=None, head_mask=None):\n    # calculate attention\n    matmul_qk = tf.matmul(q, k, transpose_b=True)\n\n    dk = tf.cast(shape_list(k)[-1], tf.float32)\n    scaled_attention_logits = matmul_qk / tf.math.sqrt(dk)\n\n    if mask is not None:\n        scaled_attention_logits += mask * -1e4\n\n    if attention_mask is not None:\n        # Apply the attention mask\n        scaled_attention_logits = scaled_attention_logits + attention_mask\n\n    attention_weights = tf.nn.softmax(scaled_attention_logits, axis=-1)\n\n    # Mask heads if we want to\n    if head_mask is not None:\n        attention_weights = attention_weights * head_mask\n\n    output = tf.matmul(attention_weights, v)\n\n    return output, attention_weights\n\n\nclass TFMultiHeadAttention(tf.keras.layers.Layer):\n    def __init__(self, d_model_size, num_heads, output_attentions=False, **kwargs):\n        super().__init__(**kwargs)\n        self.output_attentions = output_attentions\n        self.num_heads = num_heads\n        self.d_model_size = d_model_size\n\n        self.depth = int(d_model_size / self.num_heads)\n\n        self.Wq = tf.keras.layers.Dense(d_model_size, name=\"Wq\")\n        self.Wk = tf.keras.layers.Dense(d_model_size, name=\"Wk\")\n        self.Wv = tf.keras.layers.Dense(d_model_size, name=\"Wv\")\n\n        self.dense = tf.keras.layers.Dense(d_model_size, name=\"dense\")\n\n    def split_into_heads(self, x, batch_size):\n        x = tf.reshape(x, (batch_size, -1, self.num_heads, self.depth))\n        return tf.transpose(x, perm=[0, 2, 1, 3])\n\n    def call(self, inputs, training=False):\n        v, k, q, mask, layer_past, attention_mask, head_mask, use_cache = inputs\n        batch_size = shape_list(q)[0]\n\n        q = self.Wq(q)\n        k = self.Wk(k)\n        v = self.Wv(v)\n\n        q = self.split_into_heads(q, batch_size)\n        k = self.split_into_heads(k, batch_size)\n        v = self.split_into_heads(v, batch_size)\n\n        if layer_past is not None:\n            past_key, past_value = tf.unstack(layer_past, axis=0)\n            k = tf.concat((past_key, k), axis=-2)\n            v = tf.concat((past_value, v), axis=-2)\n\n        # to cope with keras serialization\n        # we need to cast `use_cache` to correct bool\n        # if it is a tensor\n        if tf.is_tensor(use_cache):\n            if hasattr(use_cache, \"numpy\"):\n                use_cache = bool(use_cache.numpy())\n            else:\n                use_cache = True\n\n        if use_cache is True:\n            present = tf.stack((k, v), axis=0)\n        else:\n            present = (None,)\n\n        output = scaled_dot_product_attention(q, k, v, mask, attention_mask, head_mask)\n        scaled_attention = tf.transpose(output[0], perm=[0, 2, 1, 3])\n        attn = output[1]\n        original_size_attention = tf.reshape(scaled_attention, (batch_size, -1, self.d_model_size))\n        output = self.dense(original_size_attention)\n\n        outputs = (output, present)\n        if self.output_attentions:\n            outputs = outputs + (attn,)\n        return outputs\n\n\ndef point_wise_feed_forward_network(d_model_size, dff, name=\"\"):\n    return tf.keras.Sequential(\n        [tf.keras.layers.Dense(dff, activation=\"relu\", name=\"0\"), tf.keras.layers.Dense(d_model_size, name=\"2\")],\n        name=\"ffn\",\n    )\n\n\nclass TFEncoderLayer(tf.keras.layers.Layer):\n    def __init__(\n        self, d_model_size, num_heads, dff, rate=0.1, layer_norm_epsilon=1e-6, output_attentions=False, **kwargs\n    ):\n        super().__init__(**kwargs)\n\n        self.multi_head_attention = TFMultiHeadAttention(\n            d_model_size, num_heads, output_attentions, name=\"multi_head_attention\"\n        )\n        self.ffn = point_wise_feed_forward_network(d_model_size, dff, name=\"ffn\")\n\n        self.layernorm1 = tf.keras.layers.LayerNormalization(epsilon=layer_norm_epsilon, name=\"layernorm1\")\n        self.layernorm2 = tf.keras.layers.LayerNormalization(epsilon=layer_norm_epsilon, name=\"layernorm2\")\n\n        self.dropout1 = tf.keras.layers.Dropout(rate)\n        self.dropout2 = tf.keras.layers.Dropout(rate)\n\n    def call(self, inputs, training=False):\n        x, mask, layer_past, attention_mask, head_mask, use_cache = inputs\n        normed = self.layernorm1(x)\n        attn_outputs = self.multi_head_attention(\n            [normed, normed, normed, mask, layer_past, attention_mask, head_mask, use_cache], training=training\n        )\n        attn_output = attn_outputs[0]\n        attn_output = self.dropout1(attn_output, training=training)\n        out1 = x + attn_output\n\n        out2 = self.layernorm2(out1)\n        ffn_output = self.ffn(out2)\n        ffn_output = self.dropout2(ffn_output, training=training)\n        out2 = out1 + ffn_output\n\n        outputs = (out2,) + attn_outputs[1:]\n        return outputs\n\n\n@keras_serializable\nclass TFCTRLMainLayer(tf.keras.layers.Layer):\n    config_class = CTRLConfig\n\n    def __init__(self, config, **kwargs):\n        super().__init__(**kwargs)\n        self.output_hidden_states = config.output_hidden_states\n        self.output_attentions = config.output_attentions\n\n        self.d_model_size = config.n_embd\n        self.num_layers = config.n_layer\n\n        self.pos_encoding = positional_encoding(config.n_positions, self.d_model_size)\n\n        self.w = TFSharedEmbeddings(\n            config.vocab_size, config.n_embd, initializer_range=config.initializer_range, name=\"w\"\n        )\n\n        self.dropout = tf.keras.layers.Dropout(config.embd_pdrop)\n        self.h = [\n            TFEncoderLayer(\n                config.n_embd,\n                config.n_head,\n                config.dff,\n                config.resid_pdrop,\n                config.layer_norm_epsilon,\n                config.output_attentions,\n                name=\"h_._{}\".format(i),\n            )\n            for i in range(config.n_layer)\n        ]\n        self.layernorm = tf.keras.layers.LayerNormalization(epsilon=config.layer_norm_epsilon, name=\"layernorm\")\n\n    def get_input_embeddings(self):\n        return self.w\n\n    def _resize_token_embeddings(self, new_num_tokens):\n        raise NotImplementedError\n\n    def _prune_heads(self, heads_to_prune):\n        \"\"\" Prunes heads of the model.\n                heads_to_prune: dict of {layer_num: list of heads to prune in this layer}\n        \"\"\"\n        raise NotImplementedError\n\n    def call(\n        self,\n        inputs,\n        past=None,\n        attention_mask=None,\n        token_type_ids=None,\n        position_ids=None,\n        head_mask=None,\n        inputs_embeds=None,\n        use_cache=True,\n        training=False,\n    ):\n\n        if isinstance(inputs, (tuple, list)):\n            input_ids = inputs[0]\n            past = inputs[1] if len(inputs) > 1 else past\n            attention_mask = inputs[2] if len(inputs) > 2 else attention_mask\n            token_type_ids = inputs[3] if len(inputs) > 3 else token_type_ids\n            position_ids = inputs[4] if len(inputs) > 4 else position_ids\n            head_mask = inputs[5] if len(inputs) > 5 else head_mask\n            inputs_embeds = inputs[6] if len(inputs) > 6 else inputs_embeds\n            use_cache = inputs[7] if len(inputs) > 7 else use_cache\n            assert len(inputs) <= 8, \"Too many inputs.\"\n        elif isinstance(inputs, (dict, BatchEncoding)):\n            input_ids = inputs.get(\"input_ids\")\n            past = inputs.get(\"past\", past)\n            attention_mask = inputs.get(\"attention_mask\", attention_mask)\n            token_type_ids = inputs.get(\"token_type_ids\", token_type_ids)\n            position_ids = inputs.get(\"position_ids\", position_ids)\n            head_mask = inputs.get(\"head_mask\", head_mask)\n            inputs_embeds = inputs.get(\"inputs_embeds\", inputs_embeds)\n            use_cache = inputs.get(\"use_cache\", use_cache)\n            assert len(inputs) <= 8, \"Too many inputs.\"\n        else:\n            input_ids = inputs\n\n        # If using past key value states, only the last tokens\n        # should be given as an input\n        if past is not None:\n            if input_ids is not None:\n                input_ids = input_ids[:, -1:]\n            if inputs_embeds is not None:\n                inputs_embeds = inputs_embeds[:, -1:]\n            if token_type_ids is not None:\n                token_type_ids = token_type_ids[:, -1:]\n\n        if input_ids is not None and inputs_embeds is not None:\n            raise ValueError(\"You cannot specify both input_ids and inputs_embeds at the same time\")\n        elif input_ids is not None:\n            input_shape = shape_list(input_ids)\n            input_ids = tf.reshape(input_ids, [-1, input_shape[-1]])\n        elif inputs_embeds is not None:\n            input_shape = shape_list(inputs_embeds)[:-1]\n        else:\n            raise ValueError(\"You have to specify either input_ids or inputs_embeds\")\n\n        if past is None:\n            past_length = 0\n            past = [None] * len(self.h)\n        else:\n            past_length = shape_list(past[0][0])[-2]\n        if position_ids is None:\n            position_ids = tf.range(past_length, input_shape[-1] + past_length, dtype=tf.int32)[tf.newaxis, :]\n            position_ids = tf.tile(position_ids, [input_shape[0], 1])\n\n        # Attention mask.\n        if attention_mask is not None:\n            # We create a 3D attention mask from a 2D tensor mask.\n            # Sizes are [batch_size, 1, 1, to_seq_length]\n            # So we can broadcast to [batch_size, num_heads, from_seq_length, to_seq_length]\n            # this attention mask is more simple than the triangular masking of causal attention\n            # used in OpenAI GPT, we just need to prepare the broadcast dimension here.\n            attention_mask = attention_mask[:, tf.newaxis, tf.newaxis, :]\n\n            # Since attention_mask is 1.0 for positions we want to attend and 0.0 for\n            # masked positions, this operation will create a tensor which is 0.0 for\n            # positions we want to attend and -10000.0 for masked positions.\n            # Since we are adding it to the raw scores before the softmax, this is\n            # effectively the same as removing these entirely.\n\n            attention_mask = tf.cast(attention_mask, tf.float32)\n            attention_mask = (1.0 - attention_mask) * -10000.0\n        else:\n            attention_mask = None\n\n        # Prepare head mask if needed\n        # 1.0 in head_mask indicate we keep the head\n        # attention_probs has shape bsz x n_heads x N x N\n        # head_mask has shape n_layer x batch x n_heads x N x N\n        if head_mask is not None:\n            raise NotImplementedError\n        else:\n            head_mask = [None] * self.num_layers\n\n        if token_type_ids is not None:\n            token_type_ids = tf.reshape(token_type_ids, [-1, shape_list(token_type_ids)[-1]])\n            token_type_embeds = self.w(token_type_ids, mode=\"embedding\")\n            token_type_embeds *= tf.math.sqrt(tf.cast(self.d_model_size, tf.float32))\n        else:\n            token_type_embeds = 0\n        position_ids = tf.reshape(position_ids, [-1, shape_list(position_ids)[-1]])\n\n        if inputs_embeds is None:\n            inputs_embeds = self.w(input_ids, mode=\"embedding\")\n        seq_len = input_shape[-1]\n        mask = 1 - tf.linalg.band_part(tf.ones((seq_len, seq_len)), -1, 0)\n\n        inputs_embeds *= tf.math.sqrt(tf.cast(self.d_model_size, tf.float32))\n\n        pos_embeds = tf.gather(self.pos_encoding, position_ids)\n\n        hidden_states = inputs_embeds + pos_embeds + token_type_embeds\n\n        hidden_states = self.dropout(hidden_states, training=training)\n\n        output_shape = input_shape + [shape_list(hidden_states)[-1]]\n        presents = ()\n        all_hidden_states = ()\n        all_attentions = []\n        for i, (h, layer_past) in enumerate(zip(self.h, past)):\n            if self.output_hidden_states:\n                all_hidden_states = all_hidden_states + (tf.reshape(hidden_states, output_shape),)\n            outputs = h([hidden_states, mask, layer_past, attention_mask, head_mask[i], use_cache], training=training)\n            hidden_states, present = outputs[:2]\n\n            if use_cache is True:\n                presents = presents + (present,)\n\n            if self.output_attentions:\n                all_attentions.append(outputs[2])\n\n        hidden_states = self.layernorm(hidden_states)\n        hidden_states = tf.reshape(hidden_states, output_shape)\n        if self.output_hidden_states:\n            all_hidden_states = all_hidden_states + (hidden_states,)\n\n        outputs = (hidden_states,)\n        if use_cache is True:\n            outputs = outputs + (presents,)\n        if self.output_hidden_states:\n            outputs = outputs + (all_hidden_states,)\n        if self.output_attentions:\n            # let the number of heads free (-1) so we can extract attention even after head pruning\n            attention_output_shape = input_shape[:-1] + [-1] + shape_list(all_attentions[0])[-2:]\n            all_attentions = tuple(tf.reshape(t, attention_output_shape) for t in all_attentions)\n            outputs = outputs + (all_attentions,)\n        return outputs\n\n\nclass TFCTRLPreTrainedModel(TFPreTrainedModel):\n    \"\"\" An abstract class to handle weights initialization and\n        a simple interface for downloading and loading pretrained models.\n    \"\"\"\n\n    config_class = CTRLConfig\n    base_model_prefix = \"transformer\"\n\n\nCTRL_START_DOCSTRING = r\"\"\"\n\n    .. note::\n        TF 2.0 models accepts two formats as inputs:\n\n            - having all inputs as keyword arguments (like PyTorch models), or\n            - having all inputs as a list, tuple or dict in the first positional arguments.\n\n        This second option is useful when using :obj:`tf.keras.Model.fit()` method which currently requires having\n        all the tensors in the first argument of the model call function: :obj:`model(inputs)`.\n\n        If you choose this second option, there are three possibilities you can use to gather all the input Tensors\n        in the first positional argument :\n\n        - a single Tensor with input_ids only and nothing else: :obj:`model(inputs_ids)`\n        - a list of varying length with one or several input Tensors IN THE ORDER given in the docstring:\n          :obj:`model([input_ids, attention_mask])` or :obj:`model([input_ids, attention_mask, token_type_ids])`\n        - a dictionary with one or several input Tensors associated to the input names given in the docstring:\n          :obj:`model({'input_ids': input_ids, 'token_type_ids': token_type_ids})`\n\n    Parameters:\n        config (:class:`~transformers1.CTRLConfig`): Model configuration class with all the parameters of the model.\n            Initializing with a config file does not load the weights associated with the model, only the configuration.\n            Check out the :meth:`~transformers1.PreTrainedModel.from_pretrained` method to load the model weights.\n\"\"\"\n\nCTRL_INPUTS_DOCSTRING = r\"\"\"\n    Args:\n        input_ids (:obj:`Numpy array` or :obj:`tf.Tensor` of shape :obj:`(batch_size, input_ids_length)`):\n            :obj:`input_ids_length` = ``sequence_length`` if ``past`` is ``None`` else ``past[0].shape[-2]`` (``sequence_length`` of input past key value states).\n\n            Indices of input sequence tokens in the vocabulary.\n\n            If `past` is used, only input_ids that do not have their past calculated should be passed as input_ids (see `past`).\n\n            Indices can be obtained using :class:`transformers1.CTRLTokenizer`.\n            See :func:`transformers1.PreTrainedTokenizer.encode` and\n            :func:`transformers1.PreTrainedTokenizer.encode_plus` for details.\n\n            `What are input IDs? <../glossary.html#input-ids>`__\n        past (:obj:`List[tf.Tensor]` of length :obj:`config.n_layers`):\n            Contains pre-computed hidden-states (key and values in the attention blocks) as computed by the model\n            (see `past` output below). Can be used to speed up sequential decoding.\n            The token ids which have their past given to this model\n            should not be passed as input ids as they have already been computed.\n        attention_mask (:obj:`tf.Tensor` or :obj:`Numpy array` of shape :obj:`(batch_size, sequence_length)`, `optional`, defaults to :obj:`None`):\n            Mask to avoid performing attention on padding token indices.\n            Mask values selected in ``[0, 1]``:\n            ``1`` for tokens that are NOT MASKED, ``0`` for MASKED tokens.\n\n            `What are attention masks? <../glossary.html#attention-mask>`__\n        token_type_ids (:obj:`tf.Tensor` or :obj:`Numpy array` of shape :obj:`(batch_size, sequence_length)`, `optional`, defaults to :obj:`None`):\n            Segment token indices to indicate first and second portions of the inputs.\n            Indices are selected in ``[0, 1]``: ``0`` corresponds to a `sentence A` token, ``1``\n            corresponds to a `sentence B` token\n\n            `What are token type IDs? <../glossary.html#token-type-ids>`_\n        position_ids (:obj:`tf.Tensor` or :obj:`Numpy array` of shape :obj:`(batch_size, sequence_length)`, `optional`, defaults to :obj:`None`):\n            Indices of positions of each input sequence tokens in the position embeddings.\n            Selected in the range ``[0, config.max_position_embeddings - 1]``.\n\n            `What are position IDs? <../glossary.html#position-ids>`_\n        head_mask (:obj:`tf.Tensor` or :obj:`Numpy array` of shape :obj:`(num_heads,)` or :obj:`(num_layers, num_heads)`, `optional`, defaults to :obj:`None`):\n            Mask to nullify selected heads of the self-attention modules.\n            Mask values selected in ``[0, 1]``:\n            :obj:`1` indicates the head is **not masked**, :obj:`0` indicates the head is **masked**.\n        inputs_embeds (:obj:`tf.Tensor` or :obj:`Numpy array` of shape :obj:`(batch_size, sequence_length, hidden_size)`, `optional`, defaults to :obj:`None`):\n            Optionally, instead of passing :obj:`input_ids` you can choose to directly pass an embedded representation.\n            This is useful if you want more control over how to convert `input_ids` indices into associated vectors\n            than the model's internal embedding lookup matrix.\n        use_cache (:obj:`bool`):\n            If `use_cache` is True, `past` key value states are returned and\n            can be used to speed up decoding (see `past`). Defaults to `True`.\n        training (:obj:`boolean`, `optional`, defaults to :obj:`False`):\n            Whether to activate dropout modules (if set to :obj:`True`) during training or to de-activate them\n            (if set to :obj:`False`) for evaluation.\n\"\"\"\n\n\n@add_start_docstrings(\n    \"The bare CTRL Model transformer outputting raw hidden-states without any specific head on top.\",\n    CTRL_START_DOCSTRING,\n)\nclass TFCTRLModel(TFCTRLPreTrainedModel):\n    def __init__(self, config, *inputs, **kwargs):\n        super().__init__(config, *inputs, **kwargs)\n        self.transformer = TFCTRLMainLayer(config, name=\"transformer\")\n\n    @add_start_docstrings_to_callable(CTRL_INPUTS_DOCSTRING)\n    def call(self, inputs, **kwargs):\n        r\"\"\"\n    Return:\n        :obj:`tuple(tf.Tensor)` comprising various elements depending on the configuration (:class:`~transformers1.CTRLConfig`) and inputs:\n        last_hidden_state (:obj:`tf.Tensor` of shape :obj:`(batch_size, sequence_length, hidden_size)`):\n            Sequence of hidden-states at the last layer of the model.\n        past (:obj:`List[tf.Tensor]` of length :obj:`config.n_layers` with each tensor of shape :obj:`(2, batch_size, num_heads, sequence_length, embed_size_per_head)`):\n            Contains pre-computed hidden-states (key and values in the attention blocks).\n            Can be used (see `past` input) to speed up sequential decoding. The token ids which have their past given to this model\n            should not be passed as input ids as they have already been computed.\n        hidden_states (:obj:`tuple(tf.Tensor)` `optional`, returned when ``config.output_hidden_states=True``):\n            Tuple of :obj:`tf.Tensor` (one for the output of the embeddings + one for the output of each layer)\n            of shape :obj:`(batch_size, sequence_length, hidden_size)`.\n\n            Hidden-states of the model at the output of each layer plus the initial embedding outputs.\n        attentions (:obj:`tuple(tf.Tensor)`, `optional`, returned when ``config.output_attentions=True``):\n            Tuple of :obj:`tf.Tensor` (one for each layer) of shape\n            :obj:`(batch_size, num_heads, sequence_length, sequence_length)`.\n\n            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention\n            heads.\n\n    Examples::\n\n        import tensorflow as tf\n        from transformers1 import CTRLTokenizer, TFCTRLModel\n\n        tokenizer = CTRLTokenizer.from_pretrained('ctrl')\n        model = TFCTRLModel.from_pretrained('ctrl')\n        input_ids = tf.constant(tokenizer.encode(\"Hello, my dog is cute\", add_special_tokens=True))[None, :]  # Batch size 1\n        outputs = model(input_ids)\n        last_hidden_states = outputs[0]  # The last hidden-state is the first element of the output tuple\n\n        \"\"\"\n        outputs = self.transformer(inputs, **kwargs)\n        return outputs\n\n\nclass TFCTRLLMHead(tf.keras.layers.Layer):\n    def __init__(self, config, input_embeddings, **kwargs):\n        super().__init__(**kwargs)\n        self.vocab_size = config.vocab_size\n\n        # The output weights are the same as the input embeddings, but there is\n        # an output-only bias for each token.\n        self.input_embeddings = input_embeddings\n\n    def build(self, input_shape):\n        self.bias = self.add_weight(shape=(self.vocab_size,), initializer=\"zeros\", trainable=True, name=\"bias\")\n        super().build(input_shape)\n\n    def call(self, hidden_states):\n        hidden_states = self.input_embeddings(hidden_states, mode=\"linear\")\n        hidden_states = hidden_states + self.bias\n        return hidden_states\n\n\n@add_start_docstrings(\n    \"\"\"The CTRL Model transformer with a language modeling head on top\n    (linear layer with weights tied to the input embeddings). \"\"\",\n    CTRL_START_DOCSTRING,\n)\nclass TFCTRLLMHeadModel(TFCTRLPreTrainedModel):\n    def __init__(self, config, *inputs, **kwargs):\n        super().__init__(config, *inputs, **kwargs)\n        self.transformer = TFCTRLMainLayer(config, name=\"transformer\")\n\n        self.lm_head = TFCTRLLMHead(config, self.transformer.w, name=\"lm_head\")\n\n    def get_output_embeddings(self):\n        return self.lm_head.input_embeddings\n\n    def prepare_inputs_for_generation(self, inputs, past, **kwargs):\n        # only last token for inputs_ids if past is defined in kwargs\n        if past:\n            inputs = tf.expand_dims(inputs[:, -1], -1)\n\n        return {\"inputs\": inputs, \"past\": past, \"use_cache\": kwargs[\"use_cache\"]}\n\n    @add_start_docstrings_to_callable(CTRL_INPUTS_DOCSTRING)\n    def call(self, inputs, **kwargs):\n        r\"\"\"\n    Return:\n        :obj:`tuple(tf.Tensor)` comprising various elements depending on the configuration (:class:`~transformers1.CTRLConfig`) and inputs:\n        prediction_scores (:obj:`tf.Tensor` of shape :obj:`(batch_size, sequence_length, config.vocab_size)`):\n            Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax).\n        past (:obj:`List[tf.Tensor]` of length :obj:`config.n_layers` with each tensor of shape :obj:`(2, batch_size, num_heads, sequence_length, embed_size_per_head)`):\n            Contains pre-computed hidden-states (key and values in the attention blocks).\n            Can be used (see `past` input) to speed up sequential decoding. The token ids which have their past given to this model\n            should not be passed as input ids as they have already been computed.\n        hidden_states (:obj:`tuple(tf.Tensor)`, `optional`, returned when ``config.output_hidden_states=True``):\n            Tuple of :obj:`tf.Tensor` (one for the output of the embeddings + one for the output of each layer)\n            of shape :obj:`(batch_size, sequence_length, hidden_size)`.\n\n            Hidden-states of the model at the output of each layer plus the initial embedding outputs.\n        attentions (:obj:`tuple(tf.Tensor)`, `optional`, returned when ``config.output_attentions=True``):\n            Tuple of :obj:`tf.Tensor` (one for each layer) of shape\n            :obj:`(batch_size, num_heads, sequence_length, sequence_length)`.\n\n            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention\n            heads.\n\n    Examples::\n\n        import tensorflow as tf\n        from transformers1 import CTRLTokenizer, TFCTRLLMHeadModel\n\n        tokenizer = CTRLTokenizer.from_pretrained('ctrl')\n        model = TFCTRLLMHeadModel.from_pretrained('ctrl')\n\n        input_ids = tf.constant([tokenizer.encode(\"Links Hello, my dog is cute\", add_special_tokens=True)])\n        outputs = model(input_ids)\n        loss, logits = outputs[:2]\n\n        \"\"\"\n        transformer_outputs = self.transformer(inputs, **kwargs)\n        hidden_states = transformer_outputs[0]\n\n        lm_logits = self.lm_head(hidden_states)\n\n        outputs = (lm_logits,) + transformer_outputs[1:]\n\n        return outputs  # lm_logits, presents, (all hidden_states), (attentions)\n"
  },
  {
    "path": "code/bert-base-count3/pretrain/transformers1/modeling_tf_distilbert.py",
    "content": "# coding=utf-8\n# Copyright 2019-present, the HuggingFace Inc. team, The Google AI Language Team and Facebook, Inc.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n\"\"\" TF 2.0 DistilBERT model\n\"\"\"\n\n\nimport logging\nimport math\n\nimport numpy as np\nimport tensorflow as tf\n\nfrom .configuration_distilbert import DistilBertConfig\nfrom .file_utils import add_start_docstrings, add_start_docstrings_to_callable\nfrom .modeling_tf_utils import TFPreTrainedModel, TFSharedEmbeddings, get_initializer, shape_list\nfrom .tokenization_utils import BatchEncoding\n\n\nlogger = logging.getLogger(__name__)\n\n\nTF_DISTILBERT_PRETRAINED_MODEL_ARCHIVE_LIST = [\n    \"distilbert-base-uncased\",\n    \"distilbert-base-uncased-distilled-squad\",\n    \"distilbert-base-cased\",\n    \"distilbert-base-cased-distilled-squad\",\n    \"distilbert-base-multilingual-cased\",\n    \"distilbert-base-uncased-finetuned-sst-2-english\",\n    # See all DistilBERT models at https://huggingface.co/models?filter=distilbert\n]\n\n\n# UTILS AND BUILDING BLOCKS OF THE ARCHITECTURE #\ndef gelu(x):\n    \"\"\" Gaussian Error Linear Unit.\n    Original Implementation of the gelu activation function in Google Bert repo when initially created.\n        For information: OpenAI GPT's gelu is slightly different (and gives slightly different results):\n        0.5 * x * (1 + torch.tanh(math.sqrt(2 / math.pi) * (x + 0.044715 * torch.pow(x, 3))))\n        Also see https://arxiv.org/abs/1606.08415\n    \"\"\"\n    cdf = 0.5 * (1.0 + tf.math.erf(x / tf.math.sqrt(2.0)))\n    return x * cdf\n\n\ndef gelu_new(x):\n    \"\"\"Gaussian Error Linear Unit.\n    This is a smoother version of the RELU.\n    Original paper: https://arxiv.org/abs/1606.08415\n    Args:\n        x: float Tensor to perform activation.\n    Returns:\n        `x` with the GELU activation applied.\n    \"\"\"\n    cdf = 0.5 * (1.0 + tf.tanh((np.sqrt(2 / np.pi) * (x + 0.044715 * tf.pow(x, 3)))))\n    return x * cdf\n\n\nclass TFEmbeddings(tf.keras.layers.Layer):\n    def __init__(self, config, **kwargs):\n        super().__init__(**kwargs)\n        self.vocab_size = config.vocab_size\n        self.dim = config.dim\n        self.initializer_range = config.initializer_range\n        self.word_embeddings = TFSharedEmbeddings(\n            config.vocab_size, config.dim, initializer_range=config.initializer_range, name=\"word_embeddings\"\n        )  # padding_idx=0)\n        self.position_embeddings = tf.keras.layers.Embedding(\n            config.max_position_embeddings,\n            config.dim,\n            embeddings_initializer=get_initializer(config.initializer_range),\n            name=\"position_embeddings\",\n        )\n\n        self.LayerNorm = tf.keras.layers.LayerNormalization(epsilon=1e-12, name=\"LayerNorm\")\n        self.dropout = tf.keras.layers.Dropout(config.dropout)\n\n    def build(self, input_shape):\n        \"\"\"Build shared word embedding layer \"\"\"\n        with tf.name_scope(\"word_embeddings\"):\n            # Create and initialize weights. The random normal initializer was chosen\n            # arbitrarily, and works well.\n            self.word_embeddings = self.add_weight(\n                \"weight\", shape=[self.vocab_size, self.dim], initializer=get_initializer(self.initializer_range)\n            )\n        super().build(input_shape)\n\n    def call(self, inputs, inputs_embeds=None, mode=\"embedding\", training=False):\n        \"\"\"Get token embeddings of inputs.\n        Args:\n            inputs: list of three int64 tensors with shape [batch_size, length]: (input_ids, position_ids, token_type_ids)\n            mode: string, a valid value is one of \"embedding\" and \"linear\".\n        Returns:\n            outputs: (1) If mode == \"embedding\", output embedding tensor, float32 with\n                shape [batch_size, length, embedding_size]; (2) mode == \"linear\", output\n                linear tensor, float32 with shape [batch_size, length, vocab_size].\n        Raises:\n            ValueError: if mode is not valid.\n\n        Shared weights logic adapted from\n            https://github.com/tensorflow/models/blob/a009f4fb9d2fc4949e32192a944688925ef78659/official/transformer/v2/embedding_layer.py#L24\n        \"\"\"\n        if mode == \"embedding\":\n            return self._embedding(inputs, inputs_embeds=inputs_embeds, training=training)\n        elif mode == \"linear\":\n            return self._linear(inputs)\n        else:\n            raise ValueError(\"mode {} is not valid.\".format(mode))\n\n    def _embedding(self, inputs, inputs_embeds=None, training=False):\n        \"\"\"\n        Parameters\n        ----------\n        input_ids: tf.Tensor(bs, max_seq_length)\n            The token ids to embed.\n\n        Outputs\n        -------\n        embeddings: tf.Tensor(bs, max_seq_length, dim)\n            The embedded tokens (plus position embeddings, no token_type embeddings)\n        \"\"\"\n        if not isinstance(inputs, (tuple, list)):\n            input_ids = inputs\n            position_ids = None\n        else:\n            input_ids, position_ids = inputs\n\n        if input_ids is not None:\n            seq_length = shape_list(input_ids)[1]\n        else:\n            seq_length = shape_list(inputs_embeds)[1]\n\n        if position_ids is None:\n            position_ids = tf.range(seq_length, dtype=tf.int32)[tf.newaxis, :]\n\n        if inputs_embeds is None:\n            inputs_embeds = tf.gather(self.word_embeddings, input_ids)\n        position_embeddings = self.position_embeddings(position_ids)  # (bs, max_seq_length, dim)\n\n        embeddings = inputs_embeds + position_embeddings  # (bs, max_seq_length, dim)\n        embeddings = self.LayerNorm(embeddings)  # (bs, max_seq_length, dim)\n        embeddings = self.dropout(embeddings, training=training)  # (bs, max_seq_length, dim)\n        return embeddings\n\n    def _linear(self, inputs):\n        \"\"\"Computes logits by running inputs through a linear layer.\n            Args:\n                inputs: A float32 tensor with shape [batch_size, length, hidden_size]\n            Returns:\n                float32 tensor with shape [batch_size, length, vocab_size].\n        \"\"\"\n        batch_size = shape_list(inputs)[0]\n        length = shape_list(inputs)[1]\n\n        x = tf.reshape(inputs, [-1, self.dim])\n        logits = tf.matmul(x, self.word_embeddings, transpose_b=True)\n\n        return tf.reshape(logits, [batch_size, length, self.vocab_size])\n\n\nclass TFMultiHeadSelfAttention(tf.keras.layers.Layer):\n    def __init__(self, config, **kwargs):\n        super().__init__(**kwargs)\n\n        self.n_heads = config.n_heads\n        self.dim = config.dim\n        self.dropout = tf.keras.layers.Dropout(config.attention_dropout)\n        self.output_attentions = config.output_attentions\n\n        assert self.dim % self.n_heads == 0\n\n        self.q_lin = tf.keras.layers.Dense(\n            config.dim, kernel_initializer=get_initializer(config.initializer_range), name=\"q_lin\"\n        )\n        self.k_lin = tf.keras.layers.Dense(\n            config.dim, kernel_initializer=get_initializer(config.initializer_range), name=\"k_lin\"\n        )\n        self.v_lin = tf.keras.layers.Dense(\n            config.dim, kernel_initializer=get_initializer(config.initializer_range), name=\"v_lin\"\n        )\n        self.out_lin = tf.keras.layers.Dense(\n            config.dim, kernel_initializer=get_initializer(config.initializer_range), name=\"out_lin\"\n        )\n\n        self.pruned_heads = set()\n\n    def prune_heads(self, heads):\n        raise NotImplementedError\n\n    def call(self, inputs, training=False):\n        \"\"\"\n        Parameters\n        ----------\n        query: tf.Tensor(bs, seq_length, dim)\n        key: tf.Tensor(bs, seq_length, dim)\n        value: tf.Tensor(bs, seq_length, dim)\n        mask: tf.Tensor(bs, seq_length)\n\n        Outputs\n        -------\n        weights: tf.Tensor(bs, n_heads, seq_length, seq_length)\n            Attention weights\n        context: tf.Tensor(bs, seq_length, dim)\n            Contextualized layer. Optional: only if `output_attentions=True`\n        \"\"\"\n        query, key, value, mask, head_mask = inputs\n        bs, q_length, dim = shape_list(query)\n        k_length = shape_list(key)[1]\n        # assert dim == self.dim, 'Dimensions do not match: %s input vs %s configured' % (dim, self.dim)\n        # assert key.size() == value.size()\n\n        dim_per_head = self.dim // self.n_heads\n\n        mask_reshape = [bs, 1, 1, k_length]\n\n        def shape(x):\n            \"\"\" separate heads \"\"\"\n            return tf.transpose(tf.reshape(x, (bs, -1, self.n_heads, dim_per_head)), perm=(0, 2, 1, 3))\n\n        def unshape(x):\n            \"\"\" group heads \"\"\"\n            return tf.reshape(tf.transpose(x, perm=(0, 2, 1, 3)), (bs, -1, self.n_heads * dim_per_head))\n\n        q = shape(self.q_lin(query))  # (bs, n_heads, q_length, dim_per_head)\n        k = shape(self.k_lin(key))  # (bs, n_heads, k_length, dim_per_head)\n        v = shape(self.v_lin(value))  # (bs, n_heads, k_length, dim_per_head)\n\n        q = q / math.sqrt(dim_per_head)  # (bs, n_heads, q_length, dim_per_head)\n        scores = tf.matmul(q, k, transpose_b=True)  # (bs, n_heads, q_length, k_length)\n        mask = tf.reshape(mask, mask_reshape)  # (bs, n_heads, qlen, klen)\n        # scores.masked_fill_(mask, -float('inf'))            # (bs, n_heads, q_length, k_length)\n        scores = scores - 1e30 * (1.0 - mask)\n\n        weights = tf.nn.softmax(scores, axis=-1)  # (bs, n_heads, qlen, klen)\n        weights = self.dropout(weights, training=training)  # (bs, n_heads, qlen, klen)\n\n        # Mask heads if we want to\n        if head_mask is not None:\n            weights = weights * head_mask\n\n        context = tf.matmul(weights, v)  # (bs, n_heads, qlen, dim_per_head)\n        context = unshape(context)  # (bs, q_length, dim)\n        context = self.out_lin(context)  # (bs, q_length, dim)\n\n        if self.output_attentions:\n            return (context, weights)\n        else:\n            return (context,)\n\n\nclass TFFFN(tf.keras.layers.Layer):\n    def __init__(self, config, **kwargs):\n        super().__init__(**kwargs)\n        self.dropout = tf.keras.layers.Dropout(config.dropout)\n        self.lin1 = tf.keras.layers.Dense(\n            config.hidden_dim, kernel_initializer=get_initializer(config.initializer_range), name=\"lin1\"\n        )\n        self.lin2 = tf.keras.layers.Dense(\n            config.dim, kernel_initializer=get_initializer(config.initializer_range), name=\"lin2\"\n        )\n        assert config.activation in [\"relu\", \"gelu\"], \"activation ({}) must be in ['relu', 'gelu']\".format(\n            config.activation\n        )\n        self.activation = (\n            tf.keras.layers.Activation(gelu) if config.activation == \"gelu\" else tf.keras.activations.relu\n        )\n\n    def call(self, input, training=False):\n        x = self.lin1(input)\n        x = self.activation(x)\n        x = self.lin2(x)\n        x = self.dropout(x, training=training)\n        return x\n\n\nclass TFTransformerBlock(tf.keras.layers.Layer):\n    def __init__(self, config, **kwargs):\n        super().__init__(**kwargs)\n\n        self.n_heads = config.n_heads\n        self.dim = config.dim\n        self.hidden_dim = config.hidden_dim\n        self.dropout = tf.keras.layers.Dropout(config.dropout)\n        self.activation = config.activation\n        self.output_attentions = config.output_attentions\n\n        assert config.dim % config.n_heads == 0\n\n        self.attention = TFMultiHeadSelfAttention(config, name=\"attention\")\n        self.sa_layer_norm = tf.keras.layers.LayerNormalization(epsilon=1e-12, name=\"sa_layer_norm\")\n\n        self.ffn = TFFFN(config, name=\"ffn\")\n        self.output_layer_norm = tf.keras.layers.LayerNormalization(epsilon=1e-12, name=\"output_layer_norm\")\n\n    def call(self, inputs, training=False):  # removed: src_enc=None, src_len=None\n        \"\"\"\n        Parameters\n        ----------\n        x: tf.Tensor(bs, seq_length, dim)\n        attn_mask: tf.Tensor(bs, seq_length)\n\n        Outputs\n        -------\n        sa_weights: tf.Tensor(bs, n_heads, seq_length, seq_length)\n            The attention weights\n        ffn_output: tf.Tensor(bs, seq_length, dim)\n            The output of the transformer block contextualization.\n        \"\"\"\n        x, attn_mask, head_mask = inputs\n\n        # Self-Attention\n        sa_output = self.attention([x, x, x, attn_mask, head_mask], training=training)\n        if self.output_attentions:\n            sa_output, sa_weights = sa_output  # (bs, seq_length, dim), (bs, n_heads, seq_length, seq_length)\n        else:  # To handle these `output_attention` or `output_hidden_states` cases returning tuples\n            # assert type(sa_output) == tuple\n            sa_output = sa_output[0]\n        sa_output = self.sa_layer_norm(sa_output + x)  # (bs, seq_length, dim)\n\n        # Feed Forward Network\n        ffn_output = self.ffn(sa_output, training=training)  # (bs, seq_length, dim)\n        ffn_output = self.output_layer_norm(ffn_output + sa_output)  # (bs, seq_length, dim)\n\n        output = (ffn_output,)\n        if self.output_attentions:\n            output = (sa_weights,) + output\n        return output\n\n\nclass TFTransformer(tf.keras.layers.Layer):\n    def __init__(self, config, **kwargs):\n        super().__init__(**kwargs)\n        self.n_layers = config.n_layers\n        self.output_attentions = config.output_attentions\n        self.output_hidden_states = config.output_hidden_states\n\n        self.layer = [TFTransformerBlock(config, name=\"layer_._{}\".format(i)) for i in range(config.n_layers)]\n\n    def call(self, inputs, training=False):\n        \"\"\"\n        Parameters\n        ----------\n        x: tf.Tensor(bs, seq_length, dim)\n            Input sequence embedded.\n        attn_mask: tf.Tensor(bs, seq_length)\n            Attention mask on the sequence.\n\n        Outputs\n        -------\n        hidden_state: tf.Tensor(bs, seq_length, dim)\n            Sequence of hiddens states in the last (top) layer\n        all_hidden_states: Tuple[tf.Tensor(bs, seq_length, dim)]\n            Tuple of length n_layers with the hidden states from each layer.\n            Optional: only if output_hidden_states=True\n        all_attentions: Tuple[tf.Tensor(bs, n_heads, seq_length, seq_length)]\n            Tuple of length n_layers with the attention weights from each layer\n            Optional: only if output_attentions=True\n        \"\"\"\n        x, attn_mask, head_mask = inputs\n\n        all_hidden_states = ()\n        all_attentions = ()\n\n        hidden_state = x\n        for i, layer_module in enumerate(self.layer):\n            if self.output_hidden_states:\n                all_hidden_states = all_hidden_states + (hidden_state,)\n\n            layer_outputs = layer_module([hidden_state, attn_mask, head_mask[i]], training=training)\n            hidden_state = layer_outputs[-1]\n\n            if self.output_attentions:\n                assert len(layer_outputs) == 2\n                attentions = layer_outputs[0]\n                all_attentions = all_attentions + (attentions,)\n            else:\n                assert len(layer_outputs) == 1\n\n        # Add last layer\n        if self.output_hidden_states:\n            all_hidden_states = all_hidden_states + (hidden_state,)\n\n        outputs = (hidden_state,)\n        if self.output_hidden_states:\n            outputs = outputs + (all_hidden_states,)\n        if self.output_attentions:\n            outputs = outputs + (all_attentions,)\n        return outputs  # last-layer hidden state, (all hidden states), (all attentions)\n\n\nclass TFDistilBertMainLayer(tf.keras.layers.Layer):\n    def __init__(self, config, **kwargs):\n        super().__init__(**kwargs)\n        self.num_hidden_layers = config.num_hidden_layers\n\n        self.embeddings = TFEmbeddings(config, name=\"embeddings\")  # Embeddings\n        self.transformer = TFTransformer(config, name=\"transformer\")  # Encoder\n\n    def get_input_embeddings(self):\n        return self.embeddings\n\n    def _resize_token_embeddings(self, new_num_tokens):\n        raise NotImplementedError\n\n    def _prune_heads(self, heads_to_prune):\n        raise NotImplementedError\n\n    def call(self, inputs, attention_mask=None, head_mask=None, inputs_embeds=None, training=False):\n        if isinstance(inputs, (tuple, list)):\n            input_ids = inputs[0]\n            attention_mask = inputs[1] if len(inputs) > 1 else attention_mask\n            head_mask = inputs[2] if len(inputs) > 2 else head_mask\n            inputs_embeds = inputs[3] if len(inputs) > 3 else inputs_embeds\n            assert len(inputs) <= 4, \"Too many inputs.\"\n        elif isinstance(inputs, (dict, BatchEncoding)):\n            input_ids = inputs.get(\"input_ids\")\n            attention_mask = inputs.get(\"attention_mask\", attention_mask)\n            head_mask = inputs.get(\"head_mask\", head_mask)\n            inputs_embeds = inputs.get(\"inputs_embeds\", inputs_embeds)\n            assert len(inputs) <= 4, \"Too many inputs.\"\n        else:\n            input_ids = inputs\n\n        if input_ids is not None and inputs_embeds is not None:\n            raise ValueError(\"You cannot specify both input_ids and inputs_embeds at the same time\")\n        elif input_ids is not None:\n            input_shape = shape_list(input_ids)\n        elif inputs_embeds is not None:\n            input_shape = shape_list(inputs_embeds)[:-1]\n        else:\n            raise ValueError(\"You have to specify either input_ids or inputs_embeds\")\n\n        if attention_mask is None:\n            attention_mask = tf.ones(input_shape)  # (bs, seq_length)\n        attention_mask = tf.cast(attention_mask, dtype=tf.float32)\n\n        # Prepare head mask if needed\n        # 1.0 in head_mask indicate we keep the head\n        # attention_probs has shape bsz x n_heads x N x N\n        # input head_mask has shape [num_heads] or [num_hidden_layers x num_heads]\n        # and head_mask is converted to shape [num_hidden_layers x batch x num_heads x seq_length x seq_length]\n        if head_mask is not None:\n            raise NotImplementedError\n        else:\n            head_mask = [None] * self.num_hidden_layers\n\n        embedding_output = self.embeddings(input_ids, inputs_embeds=inputs_embeds)  # (bs, seq_length, dim)\n        tfmr_output = self.transformer([embedding_output, attention_mask, head_mask], training=training)\n\n        return tfmr_output  # last-layer hidden-state, (all hidden_states), (all attentions)\n\n\n# INTERFACE FOR ENCODER AND TASK SPECIFIC MODEL #\nclass TFDistilBertPreTrainedModel(TFPreTrainedModel):\n    \"\"\" An abstract class to handle weights initialization and\n        a simple interface for downloading and loading pretrained models.\n    \"\"\"\n\n    config_class = DistilBertConfig\n    base_model_prefix = \"distilbert\"\n\n\nDISTILBERT_START_DOCSTRING = r\"\"\"\n    This model is a `tf.keras.Model <https://www.tensorflow.org/api_docs/python/tf/keras/Model>`__ sub-class.\n    Use it as a regular TF 2.0 Keras Model and\n    refer to the TF 2.0 documentation for all matter related to general usage and behavior.\n\n    .. note::\n\n        TF 2.0 models accepts two formats as inputs:\n\n            - having all inputs as keyword arguments (like PyTorch models), or\n            - having all inputs as a list, tuple or dict in the first positional arguments.\n\n        This second option is useful when using :obj:`tf.keras.Model.fit()` method which currently requires having\n        all the tensors in the first argument of the model call function: :obj:`model(inputs)`.\n\n        If you choose this second option, there are three possibilities you can use to gather all the input Tensors\n        in the first positional argument :\n\n        - a single Tensor with input_ids only and nothing else: :obj:`model(inputs_ids)`\n        - a list of varying length with one or several input Tensors IN THE ORDER given in the docstring:\n          :obj:`model([input_ids, attention_mask])` or :obj:`model([input_ids, attention_mask, token_type_ids])`\n        - a dictionary with one or several input Tensors associated to the input names given in the docstring:\n          :obj:`model({'input_ids': input_ids, 'token_type_ids': token_type_ids})`\n\n    Parameters:\n        config (:class:`~transformers1.DistilBertConfig`): Model configuration class with all the parameters of the model.\n            Initializing with a config file does not load the weights associated with the model, only the configuration.\n            Check out the :meth:`~transformers1.PreTrainedModel.from_pretrained` method to load the model weights.\n\"\"\"\n\nDISTILBERT_INPUTS_DOCSTRING = r\"\"\"\n    Args:\n        input_ids (:obj:`Numpy array` or :obj:`tf.Tensor` of shape :obj:`(batch_size, sequence_length)`):\n            Indices of input sequence tokens in the vocabulary.\n\n            Indices can be obtained using :class:`transformers1.BertTokenizer`.\n            See :func:`transformers1.PreTrainedTokenizer.encode` and\n            :func:`transformers1.PreTrainedTokenizer.encode_plus` for details.\n\n            `What are input IDs? <../glossary.html#input-ids>`__\n        attention_mask (:obj:`Numpy array` or :obj:`tf.Tensor` of shape :obj:`(batch_size, sequence_length)`, `optional`, defaults to :obj:`None`):\n            Mask to avoid performing attention on padding token indices.\n            Mask values selected in ``[0, 1]``:\n            ``1`` for tokens that are NOT MASKED, ``0`` for MASKED tokens.\n\n            `What are attention masks? <../glossary.html#attention-mask>`__\n        head_mask (:obj:`Numpy array` or :obj:`tf.Tensor` of shape :obj:`(num_heads,)` or :obj:`(num_layers, num_heads)`, `optional`, defaults to :obj:`None`):\n            Mask to nullify selected heads of the self-attention modules.\n            Mask values selected in ``[0, 1]``:\n            :obj:`1` indicates the head is **not masked**, :obj:`0` indicates the head is **masked**.\n        inputs_embeds (:obj:`Numpy array` or :obj:`tf.Tensor` of shape :obj:`(batch_size, sequence_length, embedding_dim)`, `optional`, defaults to :obj:`None`):\n            Optionally, instead of passing :obj:`input_ids` you can choose to directly pass an embedded representation.\n            This is useful if you want more control over how to convert `input_ids` indices into associated vectors\n            than the model's internal embedding lookup matrix.\n        training (:obj:`boolean`, `optional`, defaults to :obj:`False`):\n            Whether to activate dropout modules (if set to :obj:`True`) during training or to de-activate them\n            (if set to :obj:`False`) for evaluation.\n\n\"\"\"\n\n\n@add_start_docstrings(\n    \"The bare DistilBERT encoder/transformer outputing raw hidden-states without any specific head on top.\",\n    DISTILBERT_START_DOCSTRING,\n)\nclass TFDistilBertModel(TFDistilBertPreTrainedModel):\n    def __init__(self, config, *inputs, **kwargs):\n        super().__init__(config, *inputs, **kwargs)\n        self.distilbert = TFDistilBertMainLayer(config, name=\"distilbert\")  # Embeddings\n\n    @add_start_docstrings_to_callable(DISTILBERT_INPUTS_DOCSTRING)\n    def call(self, inputs, **kwargs):\n        r\"\"\"\n    Returns:\n        :obj:`tuple(tf.Tensor)` comprising various elements depending on the configuration (:class:`~transformers1,DistilBertConfig`) and inputs:\n        last_hidden_state (:obj:`tf.Tensor` of shape :obj:`(batch_size, sequence_length, hidden_size)`):\n            Sequence of hidden-states at the output of the last layer of the model.\n        hidden_states (:obj:`tuple(tf.Tensor)`, `optional`, returned when :obj:`config.output_hidden_states=True`):\n            tuple of :obj:`tf.Tensor` (one for the output of the embeddings + one for the output of each layer)\n            of shape :obj:`(batch_size, sequence_length, hidden_size)`.\n\n            Hidden-states of the model at the output of each layer plus the initial embedding outputs.\n        attentions (:obj:`tuple(tf.Tensor)`, `optional`, returned when ``config.output_attentions=True``):\n            tuple of :obj:`tf.Tensor` (one for each layer) of shape\n            :obj:`(batch_size, num_heads, sequence_length, sequence_length)`:\n\n            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads.\n\n    Examples::\n\n        import tensorflow as tf\n        from transformers1 import DistilBertTokenizer, TFDistilBertModel\n\n        tokenizer = DistilBertTokenizer.from_pretrained('distilbert-base-cased')\n        model = TFDistilBertModel.from_pretrained('distilbert-base-cased')\n        input_ids = tf.constant(tokenizer.encode(\"Hello, my dog is cute\"))[None, :]  # Batch size 1\n        outputs = model(input_ids)\n        last_hidden_states = outputs[0]  # The last hidden-state is the first element of the output tuple\n        \"\"\"\n        outputs = self.distilbert(inputs, **kwargs)\n        return outputs\n\n\nclass TFDistilBertLMHead(tf.keras.layers.Layer):\n    def __init__(self, config, input_embeddings, **kwargs):\n        super().__init__(**kwargs)\n        self.vocab_size = config.vocab_size\n\n        # The output weights are the same as the input embeddings, but there is\n        # an output-only bias for each token.\n        self.input_embeddings = input_embeddings\n\n    def build(self, input_shape):\n        self.bias = self.add_weight(shape=(self.vocab_size,), initializer=\"zeros\", trainable=True, name=\"bias\")\n        super().build(input_shape)\n\n    def call(self, hidden_states):\n        hidden_states = self.input_embeddings(hidden_states, mode=\"linear\")\n        hidden_states = hidden_states + self.bias\n        return hidden_states\n\n\n@add_start_docstrings(\n    \"\"\"DistilBert Model with a `masked language modeling` head on top. \"\"\", DISTILBERT_START_DOCSTRING,\n)\nclass TFDistilBertForMaskedLM(TFDistilBertPreTrainedModel):\n    def __init__(self, config, *inputs, **kwargs):\n        super().__init__(config, *inputs, **kwargs)\n        self.output_attentions = config.output_attentions\n        self.output_hidden_states = config.output_hidden_states\n        self.vocab_size = config.vocab_size\n\n        self.distilbert = TFDistilBertMainLayer(config, name=\"distilbert\")\n        self.vocab_transform = tf.keras.layers.Dense(\n            config.dim, kernel_initializer=get_initializer(config.initializer_range), name=\"vocab_transform\"\n        )\n        self.act = tf.keras.layers.Activation(gelu)\n        self.vocab_layer_norm = tf.keras.layers.LayerNormalization(epsilon=1e-12, name=\"vocab_layer_norm\")\n        self.vocab_projector = TFDistilBertLMHead(config, self.distilbert.embeddings, name=\"vocab_projector\")\n\n    def get_output_embeddings(self):\n        return self.vocab_projector.input_embeddings\n\n    @add_start_docstrings_to_callable(DISTILBERT_INPUTS_DOCSTRING)\n    def call(self, inputs, **kwargs):\n        r\"\"\"\n\n    Returns:\n        :obj:`tuple(tf.Tensor)` comprising various elements depending on the configuration (:class:`~transformers1,DistilBertConfig`) and inputs:\n        prediction_scores (:obj:`Numpy array` or :obj:`tf.Tensor` of shape :obj:`(batch_size, sequence_length, config.vocab_size)`):\n            Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax).\n        hidden_states (:obj:`tuple(tf.Tensor)`, `optional`, returned when :obj:`config.output_hidden_states=True`):\n            tuple of :obj:`tf.Tensor` (one for the output of the embeddings + one for the output of each layer)\n            of shape :obj:`(batch_size, sequence_length, hidden_size)`.\n\n            Hidden-states of the model at the output of each layer plus the initial embedding outputs.\n        attentions (:obj:`tuple(tf.Tensor)`, `optional`, returned when ``config.output_attentions=True``):\n            tuple of :obj:`tf.Tensor` (one for each layer) of shape\n            :obj:`(batch_size, num_heads, sequence_length, sequence_length)`:\n\n            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads.\n\n    Examples::\n\n        import tensorflow as tf\n        from transformers1 import DistilBertTokenizer, TFDistilBertForMaskedLM\n\n        tokenizer = DistilBertTokenizer.from_pretrained('distilbert-base-cased')\n        model = TFDistilBertForMaskedLM.from_pretrained('distilbert-base-cased')\n        input_ids = tf.constant(tokenizer.encode(\"Hello, my dog is cute\"))[None, :]  # Batch size 1\n        outputs = model(input_ids)\n        prediction_scores = outputs[0]\n\n        \"\"\"\n        distilbert_output = self.distilbert(inputs, **kwargs)\n\n        hidden_states = distilbert_output[0]  # (bs, seq_length, dim)\n        prediction_logits = self.vocab_transform(hidden_states)  # (bs, seq_length, dim)\n        prediction_logits = self.act(prediction_logits)  # (bs, seq_length, dim)\n        prediction_logits = self.vocab_layer_norm(prediction_logits)  # (bs, seq_length, dim)\n        prediction_logits = self.vocab_projector(prediction_logits)\n\n        outputs = (prediction_logits,) + distilbert_output[1:]\n        return outputs  # logits, (hidden_states), (attentions)\n\n\n@add_start_docstrings(\n    \"\"\"DistilBert Model transformer with a sequence classification/regression head on top (a linear layer on top of\n    the pooled output) e.g. for GLUE tasks. \"\"\",\n    DISTILBERT_START_DOCSTRING,\n)\nclass TFDistilBertForSequenceClassification(TFDistilBertPreTrainedModel):\n    def __init__(self, config, *inputs, **kwargs):\n        super().__init__(config, *inputs, **kwargs)\n        self.num_labels = config.num_labels\n\n        self.distilbert = TFDistilBertMainLayer(config, name=\"distilbert\")\n        self.pre_classifier = tf.keras.layers.Dense(\n            config.dim,\n            kernel_initializer=get_initializer(config.initializer_range),\n            activation=\"relu\",\n            name=\"pre_classifier\",\n        )\n        self.classifier = tf.keras.layers.Dense(\n            config.num_labels, kernel_initializer=get_initializer(config.initializer_range), name=\"classifier\"\n        )\n        self.dropout = tf.keras.layers.Dropout(config.seq_classif_dropout)\n\n    @add_start_docstrings_to_callable(DISTILBERT_INPUTS_DOCSTRING)\n    def call(self, inputs, **kwargs):\n        r\"\"\"\n    Returns:\n        :obj:`tuple(tf.Tensor)` comprising various elements depending on the configuration (:class:`~transformers1,DistilBertConfig`) and inputs:\n        logits (:obj:`Numpy array` or :obj:`tf.Tensor` of shape :obj:`(batch_size, config.num_labels)`):\n            Classification (or regression if config.num_labels==1) scores (before SoftMax).\n        hidden_states (:obj:`tuple(tf.Tensor)`, `optional`, returned when :obj:`config.output_hidden_states=True`):\n            tuple of :obj:`tf.Tensor` (one for the output of the embeddings + one for the output of each layer)\n            of shape :obj:`(batch_size, sequence_length, hidden_size)`.\n\n            Hidden-states of the model at the output of each layer plus the initial embedding outputs.\n        attentions (:obj:`tuple(tf.Tensor)`, `optional`, returned when ``config.output_attentions=True``):\n            tuple of :obj:`tf.Tensor` (one for each layer) of shape\n            :obj:`(batch_size, num_heads, sequence_length, sequence_length)`:\n\n            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads.\n\n    Examples::\n\n        import tensorflow as tf\n        from transformers1 import DistilBertTokenizer, TFDistilBertForSequenceClassification\n\n        tokenizer = DistilBertTokenizer.from_pretrained('distilbert-base-cased')\n        model = TFDistilBertForSequenceClassification.from_pretrained('distilbert-base-cased')\n        input_ids = tf.constant(tokenizer.encode(\"Hello, my dog is cute\"))[None, :]  # Batch size 1\n        outputs = model(input_ids)\n        logits = outputs[0]\n\n        \"\"\"\n        distilbert_output = self.distilbert(inputs, **kwargs)\n\n        hidden_state = distilbert_output[0]  # (bs, seq_len, dim)\n        pooled_output = hidden_state[:, 0]  # (bs, dim)\n        pooled_output = self.pre_classifier(pooled_output)  # (bs, dim)\n        pooled_output = self.dropout(pooled_output, training=kwargs.get(\"training\", False))  # (bs, dim)\n        logits = self.classifier(pooled_output)  # (bs, dim)\n\n        outputs = (logits,) + distilbert_output[1:]\n        return outputs  # logits, (hidden_states), (attentions)\n\n\n@add_start_docstrings(\n    \"\"\"DistilBert Model with a token classification head on top (a linear layer on top of\n    the hidden-states output) e.g. for Named-Entity-Recognition (NER) tasks. \"\"\",\n    DISTILBERT_START_DOCSTRING,\n)\nclass TFDistilBertForTokenClassification(TFDistilBertPreTrainedModel):\n    def __init__(self, config, *inputs, **kwargs):\n        super().__init__(config, *inputs, **kwargs)\n        self.num_labels = config.num_labels\n\n        self.distilbert = TFDistilBertMainLayer(config, name=\"distilbert\")\n        self.dropout = tf.keras.layers.Dropout(config.dropout)\n        self.classifier = tf.keras.layers.Dense(\n            config.num_labels, kernel_initializer=get_initializer(config.initializer_range), name=\"classifier\"\n        )\n\n    @add_start_docstrings_to_callable(DISTILBERT_INPUTS_DOCSTRING)\n    def call(self, inputs, **kwargs):\n        r\"\"\"\n    Returns:\n        :obj:`tuple(tf.Tensor)` comprising various elements depending on the configuration (:class:`~transformers1,DistilBertConfig`) and inputs:\n        scores (:obj:`Numpy array` or :obj:`tf.Tensor` of shape :obj:`(batch_size, sequence_length, config.num_labels)`):\n            Classification scores (before SoftMax).\n        hidden_states (:obj:`tuple(tf.Tensor)`, `optional`, returned when :obj:`config.output_hidden_states=True`):\n            tuple of :obj:`tf.Tensor` (one for the output of the embeddings + one for the output of each layer)\n            of shape :obj:`(batch_size, sequence_length, hidden_size)`.\n\n            Hidden-states of the model at the output of each layer plus the initial embedding outputs.\n        attentions (:obj:`tuple(tf.Tensor)`, `optional`, returned when ``config.output_attentions=True``):\n            tuple of :obj:`tf.Tensor` (one for each layer) of shape\n            :obj:`(batch_size, num_heads, sequence_length, sequence_length)`:\n\n            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads.\n\n    Examples::\n\n        import tensorflow as tf\n        from transformers1 import DistilBertTokenizer, TFDistilBertForTokenClassification\n\n        tokenizer = DistilBertTokenizer.from_pretrained('distilbert-base-cased')\n        model = TFDistilBertForTokenClassification.from_pretrained('distilbert-base-cased')\n        input_ids = tf.constant(tokenizer.encode(\"Hello, my dog is cute\"))[None, :]  # Batch size 1\n        outputs = model(input_ids)\n        scores = outputs[0]\n        \"\"\"\n        outputs = self.distilbert(inputs, **kwargs)\n\n        sequence_output = outputs[0]\n\n        sequence_output = self.dropout(sequence_output, training=kwargs.get(\"training\", False))\n        logits = self.classifier(sequence_output)\n\n        outputs = (logits,) + outputs[2:]  # add hidden states and attention if they are here\n\n        return outputs  # scores, (hidden_states), (attentions)\n\n\n@add_start_docstrings(\n    \"\"\"DistilBert Model with a span classification head on top for extractive question-answering tasks like SQuAD (a linear layers on top of\n    the hidden-states output to compute `span start logits` and `span end logits`). \"\"\",\n    DISTILBERT_START_DOCSTRING,\n)\nclass TFDistilBertForQuestionAnswering(TFDistilBertPreTrainedModel):\n    def __init__(self, config, *inputs, **kwargs):\n        super().__init__(config, *inputs, **kwargs)\n\n        self.distilbert = TFDistilBertMainLayer(config, name=\"distilbert\")\n        self.qa_outputs = tf.keras.layers.Dense(\n            config.num_labels, kernel_initializer=get_initializer(config.initializer_range), name=\"qa_outputs\"\n        )\n        assert config.num_labels == 2\n        self.dropout = tf.keras.layers.Dropout(config.qa_dropout)\n\n    @add_start_docstrings_to_callable(DISTILBERT_INPUTS_DOCSTRING)\n    def call(self, inputs, **kwargs):\n        r\"\"\"\n    Return:\n        :obj:`tuple(tf.Tensor)` comprising various elements depending on the configuration (:class:`~transformers1,DistilBertConfig`) and inputs:\n        start_scores (:obj:`Numpy array` or :obj:`tf.Tensor` of shape :obj:`(batch_size, sequence_length,)`):\n            Span-start scores (before SoftMax).\n        end_scores (:obj:`Numpy array` or :obj:`tf.Tensor` of shape :obj:`(batch_size, sequence_length,)`):\n            Span-end scores (before SoftMax).\n        hidden_states (:obj:`tuple(tf.Tensor)`, `optional`, returned when :obj:`config.output_hidden_states=True`):\n            tuple of :obj:`tf.Tensor` (one for the output of the embeddings + one for the output of each layer)\n            of shape :obj:`(batch_size, sequence_length, hidden_size)`.\n\n            Hidden-states of the model at the output of each layer plus the initial embedding outputs.\n        attentions (:obj:`tuple(tf.Tensor)`, `optional`, returned when ``config.output_attentions=True``):\n            tuple of :obj:`tf.Tensor` (one for each layer) of shape\n            :obj:`(batch_size, num_heads, sequence_length, sequence_length)`:\n\n            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads.\n\n    Examples::\n\n        import tensorflow as tf\n        from transformers1 import DistilBertTokenizer, TFDistilBertForQuestionAnswering\n\n        tokenizer = DistilBertTokenizer.from_pretrained('distilbert-base-cased')\n        model = TFDistilBertForQuestionAnswering.from_pretrained('distilbert-base-cased')\n        input_ids = tf.constant(tokenizer.encode(\"Hello, my dog is cute\"))[None, :]  # Batch size 1\n        outputs = model(input_ids)\n        start_scores, end_scores = outputs[:2]\n\n        \"\"\"\n        distilbert_output = self.distilbert(inputs, **kwargs)\n\n        hidden_states = distilbert_output[0]  # (bs, max_query_len, dim)\n        hidden_states = self.dropout(hidden_states, training=kwargs.get(\"training\", False))  # (bs, max_query_len, dim)\n        logits = self.qa_outputs(hidden_states)  # (bs, max_query_len, 2)\n        start_logits, end_logits = tf.split(logits, 2, axis=-1)\n        start_logits = tf.squeeze(start_logits, axis=-1)\n        end_logits = tf.squeeze(end_logits, axis=-1)\n\n        outputs = (start_logits, end_logits,) + distilbert_output[1:]\n        return outputs  # start_logits, end_logits, (hidden_states), (attentions)\n"
  },
  {
    "path": "code/bert-base-count3/pretrain/transformers1/modeling_tf_electra.py",
    "content": "import logging\n\nimport tensorflow as tf\n\nfrom transformers import ElectraConfig\n\nfrom .file_utils import add_start_docstrings, add_start_docstrings_to_callable\nfrom .modeling_tf_bert import ACT2FN, TFBertEncoder, TFBertPreTrainedModel\nfrom .modeling_tf_utils import get_initializer, shape_list\nfrom .tokenization_utils import BatchEncoding\n\n\nlogger = logging.getLogger(__name__)\n\n\nTF_ELECTRA_PRETRAINED_MODEL_ARCHIVE_LIST = [\n    \"google/electra-small-generator\",\n    \"google/electra-base-generator\",\n    \"google/electra-large-generator\",\n    \"google/electra-small-discriminator\",\n    \"google/electra-base-discriminator\",\n    \"google/electra-large-discriminator\",\n    # See all ELECTRA models at https://huggingface.co/models?filter=electra\n]\n\n\nclass TFElectraEmbeddings(tf.keras.layers.Layer):\n    \"\"\"Construct the embeddings from word, position and token_type embeddings.\n    \"\"\"\n\n    def __init__(self, config, **kwargs):\n        super().__init__(**kwargs)\n        self.vocab_size = config.vocab_size\n        self.embedding_size = config.embedding_size\n        self.initializer_range = config.initializer_range\n\n        self.position_embeddings = tf.keras.layers.Embedding(\n            config.max_position_embeddings,\n            config.embedding_size,\n            embeddings_initializer=get_initializer(self.initializer_range),\n            name=\"position_embeddings\",\n        )\n        self.token_type_embeddings = tf.keras.layers.Embedding(\n            config.type_vocab_size,\n            config.embedding_size,\n            embeddings_initializer=get_initializer(self.initializer_range),\n            name=\"token_type_embeddings\",\n        )\n\n        # self.LayerNorm is not snake-cased to stick with TensorFlow model variable name and be able to load\n        # any TensorFlow checkpoint file\n        self.LayerNorm = tf.keras.layers.LayerNormalization(epsilon=config.layer_norm_eps, name=\"LayerNorm\")\n        self.dropout = tf.keras.layers.Dropout(config.hidden_dropout_prob)\n\n    def build(self, input_shape):\n        \"\"\"Build shared word embedding layer \"\"\"\n        with tf.name_scope(\"word_embeddings\"):\n            # Create and initialize weights. The random normal initializer was chosen\n            # arbitrarily, and works well.\n            self.word_embeddings = self.add_weight(\n                \"weight\",\n                shape=[self.vocab_size, self.embedding_size],\n                initializer=get_initializer(self.initializer_range),\n            )\n        super().build(input_shape)\n\n    def call(self, inputs, mode=\"embedding\", training=False):\n        \"\"\"Get token embeddings of inputs.\n        Args:\n            inputs: list of three int64 tensors with shape [batch_size, length]: (input_ids, position_ids, token_type_ids)\n            mode: string, a valid value is one of \"embedding\" and \"linear\".\n        Returns:\n            outputs: (1) If mode == \"embedding\", output embedding tensor, float32 with\n                shape [batch_size, length, embedding_size]; (2) mode == \"linear\", output\n                linear tensor, float32 with shape [batch_size, length, vocab_size].\n        Raises:\n            ValueError: if mode is not valid.\n\n        Shared weights logic adapted from\n            https://github.com/tensorflow/models/blob/a009f4fb9d2fc4949e32192a944688925ef78659/official/transformer/v2/embedding_layer.py#L24\n        \"\"\"\n        if mode == \"embedding\":\n            return self._embedding(inputs, training=training)\n        elif mode == \"linear\":\n            return self._linear(inputs)\n        else:\n            raise ValueError(\"mode {} is not valid.\".format(mode))\n\n    def _embedding(self, inputs, training=False):\n        \"\"\"Applies embedding based on inputs tensor.\"\"\"\n        input_ids, position_ids, token_type_ids, inputs_embeds = inputs\n\n        if input_ids is not None:\n            input_shape = shape_list(input_ids)\n        else:\n            input_shape = shape_list(inputs_embeds)[:-1]\n\n        seq_length = input_shape[1]\n        if position_ids is None:\n            position_ids = tf.range(seq_length, dtype=tf.int32)[tf.newaxis, :]\n        if token_type_ids is None:\n            token_type_ids = tf.fill(input_shape, 0)\n\n        if inputs_embeds is None:\n            inputs_embeds = tf.gather(self.word_embeddings, input_ids)\n        position_embeddings = self.position_embeddings(position_ids)\n        token_type_embeddings = self.token_type_embeddings(token_type_ids)\n\n        embeddings = inputs_embeds + position_embeddings + token_type_embeddings\n        embeddings = self.LayerNorm(embeddings)\n        embeddings = self.dropout(embeddings, training=training)\n        return embeddings\n\n    def _linear(self, inputs):\n        \"\"\"Computes logits by running inputs through a linear layer.\n            Args:\n                inputs: A float32 tensor with shape [batch_size, length, hidden_size]\n            Returns:\n                float32 tensor with shape [batch_size, length, vocab_size].\n        \"\"\"\n        batch_size = shape_list(inputs)[0]\n        length = shape_list(inputs)[1]\n\n        x = tf.reshape(inputs, [-1, self.embedding_size])\n        logits = tf.matmul(x, self.word_embeddings, transpose_b=True)\n\n        return tf.reshape(logits, [batch_size, length, self.vocab_size])\n\n\nclass TFElectraDiscriminatorPredictions(tf.keras.layers.Layer):\n    def __init__(self, config, **kwargs):\n        super().__init__(**kwargs)\n\n        self.dense = tf.keras.layers.Dense(config.hidden_size, name=\"dense\")\n        self.dense_prediction = tf.keras.layers.Dense(1, name=\"dense_prediction\")\n        self.config = config\n\n    def call(self, discriminator_hidden_states, training=False):\n        hidden_states = self.dense(discriminator_hidden_states)\n        hidden_states = ACT2FN[self.config.hidden_act](hidden_states)\n        logits = tf.squeeze(self.dense_prediction(hidden_states))\n\n        return logits\n\n\nclass TFElectraGeneratorPredictions(tf.keras.layers.Layer):\n    def __init__(self, config, **kwargs):\n        super().__init__(**kwargs)\n\n        self.LayerNorm = tf.keras.layers.LayerNormalization(epsilon=config.layer_norm_eps, name=\"LayerNorm\")\n        self.dense = tf.keras.layers.Dense(config.embedding_size, name=\"dense\")\n\n    def call(self, generator_hidden_states, training=False):\n        hidden_states = self.dense(generator_hidden_states)\n        hidden_states = ACT2FN[\"gelu\"](hidden_states)\n        hidden_states = self.LayerNorm(hidden_states)\n\n        return hidden_states\n\n\nclass TFElectraPreTrainedModel(TFBertPreTrainedModel):\n\n    config_class = ElectraConfig\n    base_model_prefix = \"electra\"\n\n    def get_extended_attention_mask(self, attention_mask, input_shape):\n        if attention_mask is None:\n            attention_mask = tf.fill(input_shape, 1)\n\n        # We create a 3D attention mask from a 2D tensor mask.\n        # Sizes are [batch_size, 1, 1, to_seq_length]\n        # So we can broadcast to [batch_size, num_heads, from_seq_length, to_seq_length]\n        # this attention mask is more simple than the triangular masking of causal attention\n        # used in OpenAI GPT, we just need to prepare the broadcast dimension here.\n        extended_attention_mask = attention_mask[:, tf.newaxis, tf.newaxis, :]\n\n        # Since attention_mask is 1.0 for positions we want to attend and 0.0 for\n        # masked positions, this operation will create a tensor which is 0.0 for\n        # positions we want to attend and -10000.0 for masked positions.\n        # Since we are adding it to the raw scores before the softmax, this is\n        # effectively the same as removing these entirely.\n\n        extended_attention_mask = tf.cast(extended_attention_mask, tf.float32)\n        extended_attention_mask = (1.0 - extended_attention_mask) * -10000.0\n\n        return extended_attention_mask\n\n    def get_head_mask(self, head_mask):\n        if head_mask is not None:\n            raise NotImplementedError\n        else:\n            head_mask = [None] * self.config.num_hidden_layers\n\n        return head_mask\n\n\nclass TFElectraMainLayer(TFElectraPreTrainedModel):\n\n    config_class = ElectraConfig\n\n    def __init__(self, config, **kwargs):\n        super().__init__(config, **kwargs)\n        self.embeddings = TFElectraEmbeddings(config, name=\"embeddings\")\n\n        if config.embedding_size != config.hidden_size:\n            self.embeddings_project = tf.keras.layers.Dense(config.hidden_size, name=\"embeddings_project\")\n        self.encoder = TFBertEncoder(config, name=\"encoder\")\n        self.config = config\n\n    def get_input_embeddings(self):\n        return self.embeddings\n\n    def _resize_token_embeddings(self, new_num_tokens):\n        raise NotImplementedError\n\n    def _prune_heads(self, heads_to_prune):\n        \"\"\" Prunes heads of the model.\n            heads_to_prune: dict of {layer_num: list of heads to prune in this layer}\n            See base class PreTrainedModel\n        \"\"\"\n        raise NotImplementedError\n\n    def call(\n        self,\n        inputs,\n        attention_mask=None,\n        token_type_ids=None,\n        position_ids=None,\n        head_mask=None,\n        inputs_embeds=None,\n        training=False,\n    ):\n        if isinstance(inputs, (tuple, list)):\n            input_ids = inputs[0]\n            attention_mask = inputs[1] if len(inputs) > 1 else attention_mask\n            token_type_ids = inputs[2] if len(inputs) > 2 else token_type_ids\n            position_ids = inputs[3] if len(inputs) > 3 else position_ids\n            head_mask = inputs[4] if len(inputs) > 4 else head_mask\n            inputs_embeds = inputs[5] if len(inputs) > 5 else inputs_embeds\n            assert len(inputs) <= 6, \"Too many inputs.\"\n        elif isinstance(inputs, (dict, BatchEncoding)):\n            input_ids = inputs.get(\"input_ids\")\n            attention_mask = inputs.get(\"attention_mask\", attention_mask)\n            token_type_ids = inputs.get(\"token_type_ids\", token_type_ids)\n            position_ids = inputs.get(\"position_ids\", position_ids)\n            head_mask = inputs.get(\"head_mask\", head_mask)\n            inputs_embeds = inputs.get(\"inputs_embeds\", inputs_embeds)\n            assert len(inputs) <= 6, \"Too many inputs.\"\n        else:\n            input_ids = inputs\n\n        if input_ids is not None and inputs_embeds is not None:\n            raise ValueError(\"You cannot specify both input_ids and inputs_embeds at the same time\")\n        elif input_ids is not None:\n            input_shape = shape_list(input_ids)\n        elif inputs_embeds is not None:\n            input_shape = shape_list(inputs_embeds)[:-1]\n        else:\n            raise ValueError(\"You have to specify either input_ids or inputs_embeds\")\n\n        if attention_mask is None:\n            attention_mask = tf.fill(input_shape, 1)\n        if token_type_ids is None:\n            token_type_ids = tf.fill(input_shape, 0)\n\n        extended_attention_mask = self.get_extended_attention_mask(attention_mask, input_shape)\n        head_mask = self.get_head_mask(head_mask)\n\n        hidden_states = self.embeddings([input_ids, position_ids, token_type_ids, inputs_embeds], training=training)\n\n        if hasattr(self, \"embeddings_project\"):\n            hidden_states = self.embeddings_project(hidden_states, training=training)\n\n        hidden_states = self.encoder([hidden_states, extended_attention_mask, head_mask], training=training)\n\n        return hidden_states\n\n\nELECTRA_START_DOCSTRING = r\"\"\"\n    This model is a `tf.keras.Model <https://www.tensorflow.org/api_docs/python/tf/keras/Model>`__ sub-class.\n    Use it as a regular TF 2.0 Keras Model and\n    refer to the TF 2.0 documentation for all matter related to general usage and behavior.\n\n    .. note::\n\n        TF 2.0 models accepts two formats as inputs:\n\n            - having all inputs as keyword arguments (like PyTorch models), or\n            - having all inputs as a list, tuple or dict in the first positional arguments.\n\n        This second option is useful when using :obj:`tf.keras.Model.fit()` method which currently requires having\n        all the tensors in the first argument of the model call function: :obj:`model(inputs)`.\n\n        If you choose this second option, there are three possibilities you can use to gather all the input Tensors\n        in the first positional argument :\n\n        - a single Tensor with input_ids only and nothing else: :obj:`model(inputs_ids)`\n        - a list of varying length with one or several input Tensors IN THE ORDER given in the docstring:\n          :obj:`model([input_ids, attention_mask])` or :obj:`model([input_ids, attention_mask, token_type_ids])`\n        - a dictionary with one or several input Tensors associated to the input names given in the docstring:\n          :obj:`model({'input_ids': input_ids, 'token_type_ids': token_type_ids})`\n\n    Parameters:\n        config (:class:`~transformers1.ElectraConfig`): Model configuration class with all the parameters of the model.\n            Initializing with a config file does not load the weights associated with the model, only the configuration.\n            Check out the :meth:`~transformers1.PreTrainedModel.from_pretrained` method to load the model weights.\n\"\"\"\n\nELECTRA_INPUTS_DOCSTRING = r\"\"\"\n    Args:\n        input_ids (:obj:`Numpy array` or :obj:`tf.Tensor` of shape :obj:`(batch_size, sequence_length)`):\n            Indices of input sequence tokens in the vocabulary.\n\n            Indices can be obtained using :class:`transformers1.ElectraTokenizer`.\n            See :func:`transformers1.PreTrainedTokenizer.encode` and\n            :func:`transformers1.PreTrainedTokenizer.encode_plus` for details.\n\n            `What are input IDs? <../glossary.html#input-ids>`__\n        attention_mask (:obj:`Numpy array` or :obj:`tf.Tensor` of shape :obj:`(batch_size, sequence_length)`, `optional`, defaults to :obj:`None`):\n            Mask to avoid performing attention on padding token indices.\n            Mask values selected in ``[0, 1]``:\n            ``1`` for tokens that are NOT MASKED, ``0`` for MASKED tokens.\n\n            `What are attention masks? <../glossary.html#attention-mask>`__\n        head_mask (:obj:`Numpy array` or :obj:`tf.Tensor` of shape :obj:`(num_heads,)` or :obj:`(num_layers, num_heads)`, `optional`, defaults to :obj:`None`):\n            Mask to nullify selected heads of the self-attention modules.\n            Mask values selected in ``[0, 1]``:\n            :obj:`1` indicates the head is **not masked**, :obj:`0` indicates the head is **masked**.\n        inputs_embeds (:obj:`Numpy array` or :obj:`tf.Tensor` of shape :obj:`(batch_size, sequence_length, embedding_dim)`, `optional`, defaults to :obj:`None`):\n            Optionally, instead of passing :obj:`input_ids` you can choose to directly pass an embedded representation.\n            This is useful if you want more control over how to convert `input_ids` indices into associated vectors\n            than the model's internal embedding lookup matrix.\n        training (:obj:`boolean`, `optional`, defaults to :obj:`False`):\n            Whether to activate dropout modules (if set to :obj:`True`) during training or to de-activate them\n            (if set to :obj:`False`) for evaluation.\n\n\"\"\"\n\n\n@add_start_docstrings(\n    \"The bare Electra Model transformer outputting raw hidden-states without any specific head on top. Identical to \"\n    \"the BERT model except that it uses an additional linear layer between the embedding layer and the encoder if the \"\n    \"hidden size and embedding size are different.\"\n    \"\"\n    \"Both the generator and discriminator checkpoints may be loaded into this model.\",\n    ELECTRA_START_DOCSTRING,\n)\nclass TFElectraModel(TFElectraPreTrainedModel):\n    def __init__(self, config, *inputs, **kwargs):\n        super().__init__(config, *inputs, **kwargs)\n        self.electra = TFElectraMainLayer(config, name=\"electra\")\n\n    def get_input_embeddings(self):\n        return self.electra.embeddings\n\n    @add_start_docstrings_to_callable(ELECTRA_INPUTS_DOCSTRING)\n    def call(self, inputs, **kwargs):\n        r\"\"\"\n    Returns:\n        :obj:`tuple(tf.Tensor)` comprising various elements depending on the configuration (:class:`~transformers1.ElectraConfig`) and inputs:\n        last_hidden_state (:obj:`tf.Tensor` of shape :obj:`(batch_size, sequence_length, hidden_size)`):\n            Sequence of hidden-states at the output of the last layer of the model.\n        hidden_states (:obj:`tuple(tf.Tensor)`, `optional`, returned when :obj:`config.output_hidden_states=True`):\n            tuple of :obj:`tf.Tensor` (one for the output of the embeddings + one for the output of each layer)\n            of shape :obj:`(batch_size, sequence_length, hidden_size)`.\n\n            Hidden-states of the model at the output of each layer plus the initial embedding outputs.\n        attentions (:obj:`tuple(tf.Tensor)`, `optional`, returned when ``config.output_attentions=True``):\n            tuple of :obj:`tf.Tensor` (one for each layer) of shape\n            :obj:`(batch_size, num_heads, sequence_length, sequence_length)`:\n\n            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads.\n\n    Examples::\n\n        import tensorflow as tf\n        from transformers1 import ElectraTokenizer, TFElectraModel\n\n        tokenizer = ElectraTokenizer.from_pretrained('google/electra-small-discriminator')\n        model = TFElectraModel.from_pretrained('google/electra-small-discriminator')\n        input_ids = tf.constant(tokenizer.encode(\"Hello, my dog is cute\"))[None, :]  # Batch size 1\n        outputs = model(input_ids)\n        last_hidden_states = outputs[0]  # The last hidden-state is the first element of the output tuple\n        \"\"\"\n        outputs = self.electra(inputs, **kwargs)\n        return outputs\n\n\n@add_start_docstrings(\n    \"\"\"\nElectra model with a binary classification head on top as used during pre-training for identifying generated\ntokens.\n\nEven though both the discriminator and generator may be loaded into this model, the discriminator is\nthe only model of the two to have the correct classification head to be used for this model.\"\"\",\n    ELECTRA_START_DOCSTRING,\n)\nclass TFElectraForPreTraining(TFElectraPreTrainedModel):\n    def __init__(self, config, **kwargs):\n        super().__init__(config, **kwargs)\n\n        self.electra = TFElectraMainLayer(config, name=\"electra\")\n        self.discriminator_predictions = TFElectraDiscriminatorPredictions(config, name=\"discriminator_predictions\")\n\n    def get_input_embeddings(self):\n        return self.electra.embeddings\n\n    @add_start_docstrings_to_callable(ELECTRA_INPUTS_DOCSTRING)\n    def call(\n        self,\n        input_ids=None,\n        attention_mask=None,\n        token_type_ids=None,\n        position_ids=None,\n        head_mask=None,\n        inputs_embeds=None,\n        training=False,\n    ):\n        r\"\"\"\n    Returns:\n        :obj:`tuple(tf.Tensor)` comprising various elements depending on the configuration (:class:`~transformers1.ElectraConfig`) and inputs:\n        scores (:obj:`Numpy array` or :obj:`tf.Tensor` of shape :obj:`(batch_size, sequence_length, config.num_labels)`):\n            Prediction scores of the head (scores for each token before SoftMax).\n        hidden_states (:obj:`tuple(tf.Tensor)`, `optional`, returned when :obj:`config.output_hidden_states=True`):\n            tuple of :obj:`tf.Tensor` (one for the output of the embeddings + one for the output of each layer)\n            of shape :obj:`(batch_size, sequence_length, hidden_size)`.\n\n            Hidden-states of the model at the output of each layer plus the initial embedding outputs.\n        attentions (:obj:`tuple(tf.Tensor)`, `optional`, returned when ``config.output_attentions=True``):\n            tuple of :obj:`tf.Tensor` (one for each layer) of shape\n            :obj:`(batch_size, num_heads, sequence_length, sequence_length)`:\n\n            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads.\n\n    Examples::\n\n        import tensorflow as tf\n        from transformers1 import ElectraTokenizer, TFElectraForPreTraining\n\n        tokenizer = ElectraTokenizer.from_pretrained('google/electra-small-discriminator')\n        model = TFElectraForPreTraining.from_pretrained('google/electra-small-discriminator')\n        input_ids = tf.constant(tokenizer.encode(\"Hello, my dog is cute\"))[None, :]  # Batch size 1\n        outputs = model(input_ids)\n        scores = outputs[0]\n        \"\"\"\n\n        discriminator_hidden_states = self.electra(\n            input_ids, attention_mask, token_type_ids, position_ids, head_mask, inputs_embeds, training=training\n        )\n        discriminator_sequence_output = discriminator_hidden_states[0]\n        logits = self.discriminator_predictions(discriminator_sequence_output)\n        output = (logits,)\n        output += discriminator_hidden_states[1:]\n\n        return output  # (loss), scores, (hidden_states), (attentions)\n\n\nclass TFElectraMaskedLMHead(tf.keras.layers.Layer):\n    def __init__(self, config, input_embeddings, **kwargs):\n        super().__init__(**kwargs)\n        self.vocab_size = config.vocab_size\n        self.input_embeddings = input_embeddings\n\n    def build(self, input_shape):\n        self.bias = self.add_weight(shape=(self.vocab_size,), initializer=\"zeros\", trainable=True, name=\"bias\")\n        super().build(input_shape)\n\n    def call(self, hidden_states, training=False):\n        hidden_states = self.input_embeddings(hidden_states, mode=\"linear\")\n        hidden_states = hidden_states + self.bias\n        return hidden_states\n\n\n@add_start_docstrings(\n    \"\"\"\nElectra model with a language modeling head on top.\n\nEven though both the discriminator and generator may be loaded into this model, the generator is\nthe only model of the two to have been trained for the masked language modeling task.\"\"\",\n    ELECTRA_START_DOCSTRING,\n)\nclass TFElectraForMaskedLM(TFElectraPreTrainedModel):\n    def __init__(self, config, **kwargs):\n        super().__init__(config, **kwargs)\n\n        self.vocab_size = config.vocab_size\n        self.electra = TFElectraMainLayer(config, name=\"electra\")\n        self.generator_predictions = TFElectraGeneratorPredictions(config, name=\"generator_predictions\")\n        if isinstance(config.hidden_act, str):\n            self.activation = ACT2FN[config.hidden_act]\n        else:\n            self.activation = config.hidden_act\n        self.generator_lm_head = TFElectraMaskedLMHead(config, self.electra.embeddings, name=\"generator_lm_head\")\n\n    def get_input_embeddings(self):\n        return self.electra.embeddings\n\n    def get_output_embeddings(self):\n        return self.generator_lm_head\n\n    @add_start_docstrings_to_callable(ELECTRA_INPUTS_DOCSTRING)\n    def call(\n        self,\n        input_ids=None,\n        attention_mask=None,\n        token_type_ids=None,\n        position_ids=None,\n        head_mask=None,\n        inputs_embeds=None,\n        training=False,\n    ):\n        r\"\"\"\n    Returns:\n        :obj:`tuple(tf.Tensor)` comprising various elements depending on the configuration (:class:`~transformers1.ElectraConfig`) and inputs:\n        prediction_scores (:obj:`Numpy array` or :obj:`tf.Tensor` of shape :obj:`(batch_size, sequence_length, config.vocab_size)`):\n            Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax).\n        hidden_states (:obj:`tuple(tf.Tensor)`, `optional`, returned when :obj:`config.output_hidden_states=True`):\n            tuple of :obj:`tf.Tensor` (one for the output of the embeddings + one for the output of each layer)\n            of shape :obj:`(batch_size, sequence_length, hidden_size)`.\n\n            Hidden-states of the model at the output of each layer plus the initial embedding outputs.\n        attentions (:obj:`tuple(tf.Tensor)`, `optional`, returned when ``config.output_attentions=True``):\n            tuple of :obj:`tf.Tensor` (one for each layer) of shape\n            :obj:`(batch_size, num_heads, sequence_length, sequence_length)`:\n\n            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads.\n\n    Examples::\n\n        import tensorflow as tf\n        from transformers1 import ElectraTokenizer, TFElectraForMaskedLM\n\n        tokenizer = ElectraTokenizer.from_pretrained('google/electra-small-generator')\n        model = TFElectraForMaskedLM.from_pretrained('google/electra-small-generator')\n        input_ids = tf.constant(tokenizer.encode(\"Hello, my dog is cute\"))[None, :]  # Batch size 1\n        outputs = model(input_ids)\n        prediction_scores = outputs[0]\n\n        \"\"\"\n\n        generator_hidden_states = self.electra(\n            input_ids, attention_mask, token_type_ids, position_ids, head_mask, inputs_embeds, training=training\n        )\n        generator_sequence_output = generator_hidden_states[0]\n        prediction_scores = self.generator_predictions(generator_sequence_output, training=training)\n        prediction_scores = self.generator_lm_head(prediction_scores, training=training)\n        output = (prediction_scores,)\n        output += generator_hidden_states[1:]\n\n        return output  # (masked_lm_loss), prediction_scores, (hidden_states), (attentions)\n\n\n@add_start_docstrings(\n    \"\"\"\nElectra model with a token classification head on top.\n\nBoth the discriminator and generator may be loaded into this model.\"\"\",\n    ELECTRA_START_DOCSTRING,\n)\nclass TFElectraForTokenClassification(TFElectraPreTrainedModel):\n    def __init__(self, config, **kwargs):\n        super().__init__(config, **kwargs)\n\n        self.electra = TFElectraMainLayer(config, name=\"electra\")\n        self.dropout = tf.keras.layers.Dropout(config.hidden_dropout_prob)\n        self.classifier = tf.keras.layers.Dense(config.num_labels, name=\"classifier\")\n\n    @add_start_docstrings_to_callable(ELECTRA_INPUTS_DOCSTRING)\n    def call(\n        self,\n        input_ids=None,\n        attention_mask=None,\n        token_type_ids=None,\n        position_ids=None,\n        head_mask=None,\n        inputs_embeds=None,\n        training=False,\n    ):\n        r\"\"\"\n    Returns:\n        :obj:`tuple(tf.Tensor)` comprising various elements depending on the configuration (:class:`~transformers1.ElectraConfig`) and inputs:\n        scores (:obj:`Numpy array` or :obj:`tf.Tensor` of shape :obj:`(batch_size, sequence_length, config.num_labels)`):\n            Classification scores (before SoftMax).\n        hidden_states (:obj:`tuple(tf.Tensor)`, `optional`, returned when :obj:`config.output_hidden_states=True`):\n            tuple of :obj:`tf.Tensor` (one for the output of the embeddings + one for the output of each layer)\n            of shape :obj:`(batch_size, sequence_length, hidden_size)`.\n\n            Hidden-states of the model at the output of each layer plus the initial embedding outputs.\n        attentions (:obj:`tuple(tf.Tensor)`, `optional`, returned when ``config.output_attentions=True``):\n            tuple of :obj:`tf.Tensor` (one for each layer) of shape\n            :obj:`(batch_size, num_heads, sequence_length, sequence_length)`:\n\n            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads.\n\n    Examples::\n\n        import tensorflow as tf\n        from transformers1 import ElectraTokenizer, TFElectraForTokenClassification\n\n        tokenizer = ElectraTokenizer.from_pretrained('google/electra-small-discriminator')\n        model = TFElectraForTokenClassification.from_pretrained('google/electra-small-discriminator')\n        input_ids = tf.constant(tokenizer.encode(\"Hello, my dog is cute\"))[None, :]  # Batch size 1\n        outputs = model(input_ids)\n        scores = outputs[0]\n        \"\"\"\n\n        discriminator_hidden_states = self.electra(\n            input_ids, attention_mask, token_type_ids, position_ids, head_mask, inputs_embeds, training=training\n        )\n        discriminator_sequence_output = discriminator_hidden_states[0]\n        discriminator_sequence_output = self.dropout(discriminator_sequence_output)\n        logits = self.classifier(discriminator_sequence_output)\n        output = (logits,)\n        output += discriminator_hidden_states[1:]\n\n        return output  # (loss), scores, (hidden_states), (attentions)\n"
  },
  {
    "path": "code/bert-base-count3/pretrain/transformers1/modeling_tf_flaubert.py",
    "content": "# coding=utf-8\n# Copyright 2019-present, Facebook, Inc and the HuggingFace Inc. team.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n\"\"\" TF 2.0 Flaubert model.\n\"\"\"\n\nimport logging\nimport random\n\nimport tensorflow as tf\n\nfrom .configuration_flaubert import FlaubertConfig\nfrom .file_utils import add_start_docstrings\nfrom .modeling_tf_xlm import (\n    TFXLMForSequenceClassification,\n    TFXLMMainLayer,\n    TFXLMModel,\n    TFXLMWithLMHeadModel,\n    get_masks,\n    shape_list,\n)\nfrom .tokenization_utils import BatchEncoding\n\n\nlogger = logging.getLogger(__name__)\n\nTF_FLAUBERT_PRETRAINED_MODEL_ARCHIVE_LIST = [\n    # See all Flaubert models at https://huggingface.co/models?filter=flaubert\n]\n\nFLAUBERT_START_DOCSTRING = r\"\"\"\n\n    This model is a `tf.keras.Model <https://www.tensorflow.org/api_docs/python/tf/keras/Model>`__ sub-class.\n    Use it as a regular TF 2.0 Keras Model and\n    refer to the TF 2.0 documentation for all matter related to general usage and behavior.\n\n    Parameters:\n        config (:class:`~transformers1.FlaubertConfig`): Model configuration class with all the parameters of the model.\n            Initializing with a config file does not load the weights associated with the model, only the configuration.\n            Check out the :meth:`~transformers1.PreTrainedModel.from_pretrained` method to load the model weights.\n\"\"\"\n\nFLAUBERT_INPUTS_DOCSTRING = r\"\"\"\n    Args:\n        input_ids (:obj:`tf.Tensor` or :obj:`Numpy array` of shape :obj:`(batch_size, sequence_length)`):\n            Indices of input sequence tokens in the vocabulary.\n            Indices can be obtained using :class:`transformers1.BertTokenizer`.\n            See :func:`transformers1.PreTrainedTokenizer.encode` and\n            :func:`transformers1.PreTrainedTokenizer.encode_plus` for details.\n            `What are input IDs? <../glossary.html#input-ids>`__\n        attention_mask (:obj:`tf.Tensor` or :obj:`Numpy array` of shape :obj:`(batch_size, sequence_length)`, `optional`, defaults to :obj:`None`):\n            Mask to avoid performing attention on padding token indices.\n            Mask values selected in ``[0, 1]``:\n            ``1`` for tokens that are NOT MASKED, ``0`` for MASKED tokens.\n            `What are attention masks? <../glossary.html#attention-mask>`__\n        langs (:obj:`tf.Tensor` or :obj:`Numpy array` of shape :obj:`(batch_size, sequence_length)`, `optional`, defaults to :obj:`None`):\n            A parallel sequence of tokens to be used to indicate the language of each token in the input.\n            Indices are languages ids which can be obtained from the language names by using two conversion mappings\n            provided in the configuration of the model (only provided for multilingual models).\n            More precisely, the `language name -> language id` mapping is in `model.config.lang2id` (dict str -> int) and\n            the `language id -> language name` mapping is `model.config.id2lang` (dict int -> str).\n            See usage examples detailed in the `multilingual documentation <https://huggingface.co/transformers/multilingual.html>`__.\n        token_type_ids (:obj:`tf.Tensor` or :obj:`Numpy array` of shape :obj:`(batch_size, sequence_length)`, `optional`, defaults to :obj:`None`):\n            Segment token indices to indicate first and second portions of the inputs.\n            Indices are selected in ``[0, 1]``: ``0`` corresponds to a `sentence A` token, ``1``\n            corresponds to a `sentence B` token\n            `What are token type IDs? <../glossary.html#token-type-ids>`_\n        position_ids (:obj:`tf.Tensor` or :obj:`Numpy array` of shape :obj:`(batch_size, sequence_length)`, `optional`, defaults to :obj:`None`):\n            Indices of positions of each input sequence tokens in the position embeddings.\n            Selected in the range ``[0, config.max_position_embeddings - 1]``.\n            `What are position IDs? <../glossary.html#position-ids>`_\n        lengths (:obj:`tf.Tensor` or :obj:`Numpy array` of shape :obj:`(batch_size,)`, `optional`, defaults to :obj:`None`):\n            Length of each sentence that can be used to avoid performing attention on padding token indices.\n            You can also use `attention_mask` for the same result (see above), kept here for compatbility.\n            Indices selected in ``[0, ..., input_ids.size(-1)]``:\n        cache (:obj:`Dict[str, tf.Tensor]`, `optional`, defaults to :obj:`None`):\n            dictionary with ``tf.Tensor`` that contains pre-computed\n            hidden-states (key and values in the attention blocks) as computed by the model\n            (see `cache` output below). Can be used to speed up sequential decoding.\n            The dictionary object will be modified in-place during the forward pass to add newly computed hidden-states.\n        head_mask (:obj:`tf.Tensor` or :obj:`Numpy array` of shape :obj:`(num_heads,)` or :obj:`(num_layers, num_heads)`, `optional`, defaults to :obj:`None`):\n            Mask to nullify selected heads of the self-attention modules.\n            Mask values selected in ``[0, 1]``:\n            :obj:`1` indicates the head is **not masked**, :obj:`0` indicates the head is **masked**.\n        inputs_embeds (:obj:`tf.Tensor` or :obj:`Numpy array` of shape :obj:`(batch_size, sequence_length, hidden_size)`, `optional`, defaults to :obj:`None`):\n            Optionally, instead of passing :obj:`input_ids` you can choose to directly pass an embedded representation.\n            This is useful if you want more control over how to convert `input_ids` indices into associated vectors\n            than the model's internal embedding lookup matrix.\n\"\"\"\n\n\n@add_start_docstrings(\n    \"The bare Flaubert Model transformer outputting raw hidden-states without any specific head on top.\",\n    FLAUBERT_START_DOCSTRING,\n)\nclass TFFlaubertModel(TFXLMModel):\n    config_class = FlaubertConfig\n\n    def __init__(self, config, *inputs, **kwargs):\n        super().__init__(config, *inputs, **kwargs)\n        self.transformer = TFFlaubertMainLayer(config, name=\"transformer\")\n\n\nclass TFFlaubertMainLayer(TFXLMMainLayer):\n    def __init__(self, config, *inputs, **kwargs):\n        super().__init__(config, *inputs, **kwargs)\n        self.layerdrop = getattr(config, \"layerdrop\", 0.0)\n        self.pre_norm = getattr(config, \"pre_norm\", False)\n\n    def call(\n        self,\n        inputs,\n        attention_mask=None,\n        langs=None,\n        token_type_ids=None,\n        position_ids=None,\n        lengths=None,\n        cache=None,\n        head_mask=None,\n        inputs_embeds=None,\n        training=False,\n    ):\n        # removed: src_enc=None, src_len=None\n        if isinstance(inputs, (tuple, list)):\n            input_ids = inputs[0]\n            attention_mask = inputs[1] if len(inputs) > 1 else attention_mask\n            langs = inputs[2] if len(inputs) > 2 else langs\n            token_type_ids = inputs[3] if len(inputs) > 3 else token_type_ids\n            position_ids = inputs[4] if len(inputs) > 4 else position_ids\n            lengths = inputs[5] if len(inputs) > 5 else lengths\n            cache = inputs[6] if len(inputs) > 6 else cache\n            head_mask = inputs[7] if len(inputs) > 7 else head_mask\n            inputs_embeds = inputs[8] if len(inputs) > 8 else inputs_embeds\n            assert len(inputs) <= 9, \"Too many inputs.\"\n        elif isinstance(inputs, (dict, BatchEncoding)):\n            input_ids = inputs.get(\"input_ids\")\n            attention_mask = inputs.get(\"attention_mask\", attention_mask)\n            langs = inputs.get(\"langs\", langs)\n            token_type_ids = inputs.get(\"token_type_ids\", token_type_ids)\n            position_ids = inputs.get(\"position_ids\", position_ids)\n            lengths = inputs.get(\"lengths\", lengths)\n            cache = inputs.get(\"cache\", cache)\n            head_mask = inputs.get(\"head_mask\", head_mask)\n            inputs_embeds = inputs.get(\"inputs_embeds\", inputs_embeds)\n            assert len(inputs) <= 9, \"Too many inputs.\"\n        else:\n            input_ids = inputs\n\n        if input_ids is not None and inputs_embeds is not None:\n            raise ValueError(\"You cannot specify both input_ids and inputs_embeds at the same time\")\n        elif input_ids is not None:\n            bs, slen = shape_list(input_ids)\n        elif inputs_embeds is not None:\n            bs, slen = shape_list(inputs_embeds)[:2]\n        else:\n            raise ValueError(\"You have to specify either input_ids or inputs_embeds\")\n\n        if lengths is None:\n            if input_ids is not None:\n                lengths = tf.reduce_sum(tf.cast(tf.not_equal(input_ids, self.pad_index), dtype=tf.int32), axis=1)\n            else:\n                lengths = tf.convert_to_tensor([slen] * bs, tf.int32)\n        # mask = input_ids != self.pad_index\n\n        # check inputs\n        # assert shape_list(lengths)[0] == bs\n        tf.debugging.assert_equal(shape_list(lengths)[0], bs)\n        # assert lengths.max().item() <= slen\n        # input_ids = input_ids.transpose(0, 1)  # batch size as dimension 0\n        # assert (src_enc is None) == (src_len is None)\n        # if src_enc is not None:\n        #     assert self.is_decoder\n        #     assert src_enc.size(0) == bs\n\n        # generate masks\n        mask, attn_mask = get_masks(slen, lengths, self.causal, padding_mask=attention_mask)\n        # if self.is_decoder and src_enc is not None:\n        #     src_mask = torch.arange(src_len.max(), dtype=torch.long, device=lengths.device) < src_len[:, None]\n\n        # position_ids\n        if position_ids is None:\n            position_ids = tf.expand_dims(tf.range(slen), axis=0)\n        else:\n            # assert shape_list(position_ids) == [bs, slen]  # (slen, bs)\n            tf.debugging.assert_equal(shape_list(position_ids), [bs, slen])\n            # position_ids = position_ids.transpose(0, 1)\n\n        # langs\n        if langs is not None:\n            # assert shape_list(langs) == [bs, slen]  # (slen, bs)\n            tf.debugging.assert_equal(shape_list(langs), [bs, slen])\n            # langs = langs.transpose(0, 1)\n\n        # Prepare head mask if needed\n        # 1.0 in head_mask indicate we keep the head\n        # attention_probs has shape bsz x n_heads x N x N\n        # input head_mask has shape [num_heads] or [num_hidden_layers x num_heads]\n        # and head_mask is converted to shape [num_hidden_layers x batch x num_heads x qlen x klen]\n        if head_mask is not None:\n            raise NotImplementedError\n        else:\n            head_mask = [None] * self.n_layers\n\n        # do not recompute cached elements\n        if cache is not None and input_ids is not None:\n            _slen = slen - cache[\"slen\"]\n            input_ids = input_ids[:, -_slen:]\n            position_ids = position_ids[:, -_slen:]\n            if langs is not None:\n                langs = langs[:, -_slen:]\n            mask = mask[:, -_slen:]\n            attn_mask = attn_mask[:, -_slen:]\n\n        # embeddings\n        if inputs_embeds is None:\n            inputs_embeds = self.embeddings(input_ids)\n\n        tensor = inputs_embeds + self.position_embeddings(position_ids)\n        if langs is not None and self.use_lang_emb:\n            tensor = tensor + self.lang_embeddings(langs)\n        if token_type_ids is not None:\n            tensor = tensor + self.embeddings(token_type_ids)\n        tensor = self.layer_norm_emb(tensor)\n        tensor = self.dropout(tensor, training=training)\n        tensor = tensor * mask[..., tf.newaxis]\n\n        # transformer layers\n        hidden_states = ()\n        attentions = ()\n        for i in range(self.n_layers):\n            # LayerDrop\n            dropout_probability = random.uniform(0, 1)\n            if training and (dropout_probability < self.layerdrop):\n                continue\n\n            if self.output_hidden_states:\n                hidden_states = hidden_states + (tensor,)\n\n            # self attention\n            if not self.pre_norm:\n                attn_outputs = self.attentions[i]([tensor, attn_mask, None, cache, head_mask[i]], training=training)\n                attn = attn_outputs[0]\n                if self.output_attentions:\n                    attentions = attentions + (attn_outputs[1],)\n                attn = self.dropout(attn, training=training)\n                tensor = tensor + attn\n                tensor = self.layer_norm1[i](tensor)\n            else:\n                tensor_normalized = self.layer_norm1[i](tensor)\n                attn_outputs = self.attentions[i](\n                    [tensor_normalized, attn_mask, None, cache, head_mask[i]], training=training\n                )\n                attn = attn_outputs[0]\n                if self.output_attentions:\n                    attentions = attentions + (attn_outputs[1],)\n                attn = self.dropout(attn, training=training)\n                tensor = tensor + attn\n\n            # encoder attention (for decoder only)\n            # if self.is_decoder and src_enc is not None:\n            #     attn = self.encoder_attn[i](tensor, src_mask, kv=src_enc, cache=cache)\n            #     attn = F.dropout(attn, p=self.dropout, training=self.training)\n            #     tensor = tensor + attn\n            #     tensor = self.layer_norm15[i](tensor)\n\n            # FFN\n            if not self.pre_norm:\n                tensor = tensor + self.ffns[i](tensor)\n                tensor = self.layer_norm2[i](tensor)\n            else:\n                tensor_normalized = self.layer_norm2[i](tensor)\n                tensor = tensor + self.ffns[i](tensor_normalized)\n\n            tensor = tensor * mask[..., tf.newaxis]\n\n        # Add last hidden state\n        if self.output_hidden_states:\n            hidden_states = hidden_states + (tensor,)\n\n        # update cache length\n        if cache is not None:\n            cache[\"slen\"] += tensor.size(1)\n\n        # move back sequence length to dimension 0\n        # tensor = tensor.transpose(0, 1)\n\n        outputs = (tensor,)\n        if self.output_hidden_states:\n            outputs = outputs + (hidden_states,)\n        if self.output_attentions:\n            outputs = outputs + (attentions,)\n        return outputs  # outputs, (hidden_states), (attentions)\n\n\n@add_start_docstrings(\n    \"\"\"The Flaubert Model transformer with a language modeling head on top\n    (linear layer with weights tied to the input embeddings). \"\"\",\n    FLAUBERT_START_DOCSTRING,\n)\nclass TFFlaubertWithLMHeadModel(TFXLMWithLMHeadModel):\n    config_class = FlaubertConfig\n\n    def __init__(self, config, *inputs, **kwargs):\n        super().__init__(config, *inputs, **kwargs)\n        self.transformer = TFFlaubertMainLayer(config, name=\"transformer\")\n\n\n@add_start_docstrings(\n    \"\"\"Flaubert Model with a sequence classification/regression head on top (a linear layer on top of\n    the pooled output) e.g. for GLUE tasks. \"\"\",\n    FLAUBERT_START_DOCSTRING,\n)\nclass TFFlaubertForSequenceClassification(TFXLMForSequenceClassification):\n    config_class = FlaubertConfig\n\n    def __init__(self, config, *inputs, **kwargs):\n        super().__init__(config, *inputs, **kwargs)\n        self.transformer = TFFlaubertMainLayer(config, name=\"transformer\")\n"
  },
  {
    "path": "code/bert-base-count3/pretrain/transformers1/modeling_tf_gpt2.py",
    "content": "# coding=utf-8\n# Copyright 2018 The OpenAI Team Authors and HuggingFace Inc. team.\n# Copyright (c) 2018, NVIDIA CORPORATION.  All rights reserved.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n\"\"\" TF 2.0 OpenAI GPT-2 model. \"\"\"\n\n\nimport logging\n\nimport numpy as np\nimport tensorflow as tf\n\nfrom .configuration_gpt2 import GPT2Config\nfrom .file_utils import add_start_docstrings, add_start_docstrings_to_callable\nfrom .modeling_tf_utils import (\n    TFConv1D,\n    TFPreTrainedModel,\n    TFSequenceSummary,\n    TFSharedEmbeddings,\n    get_initializer,\n    keras_serializable,\n    shape_list,\n)\nfrom .tokenization_utils import BatchEncoding\n\n\nlogger = logging.getLogger(__name__)\n\nTF_GPT2_PRETRAINED_MODEL_ARCHIVE_LIST = [\n    \"gpt2\",\n    \"gpt2-medium\",\n    \"gpt2-large\",\n    \"gpt2-xl\",\n    \"distilgpt2\",\n    # See all GPT-2 models at https://huggingface.co/models?filter=gpt2\n]\n\n\ndef gelu(x):\n    \"\"\"Gaussian Error Linear Unit.\n    This is a smoother version of the RELU.\n    Original paper: https://arxiv.org/abs/1606.08415\n    Args:\n        x: float Tensor to perform activation.\n    Returns:\n        `x` with the GELU activation applied.\n    \"\"\"\n    cdf = 0.5 * (1.0 + tf.tanh((np.sqrt(2 / np.pi) * (x + 0.044715 * tf.pow(x, 3)))))\n    return x * cdf\n\n\nclass TFAttention(tf.keras.layers.Layer):\n    def __init__(self, nx, n_ctx, config, scale=False, **kwargs):\n        super().__init__(**kwargs)\n        self.output_attentions = config.output_attentions\n\n        n_state = nx  # in Attention: n_state=768 (nx=n_embd)\n        # [switch nx => n_state from Block to Attention to keep identical to TF implem]\n        assert n_state % config.n_head == 0\n        self.n_ctx = n_ctx\n        self.n_head = config.n_head\n        self.split_size = n_state\n        self.scale = scale\n\n        self.c_attn = TFConv1D(n_state * 3, nx, initializer_range=config.initializer_range, name=\"c_attn\")\n        self.c_proj = TFConv1D(n_state, nx, initializer_range=config.initializer_range, name=\"c_proj\")\n        self.attn_dropout = tf.keras.layers.Dropout(config.attn_pdrop)\n        self.resid_dropout = tf.keras.layers.Dropout(config.resid_pdrop)\n        self.pruned_heads = set()\n\n    def prune_heads(self, heads):\n        pass\n\n    @staticmethod\n    def causal_attention_mask(nd, ns, dtype):\n        \"\"\"1's in the lower triangle, counting from the lower right corner.\n        Same as tf.matrix_band_part(tf.ones([nd, ns]), -1, ns-nd), but doesn't produce garbage on TPUs.\n        \"\"\"\n        i = tf.range(nd)[:, None]\n        j = tf.range(ns)\n        m = i >= j - ns + nd\n        return tf.cast(m, dtype)\n\n    def _attn(self, inputs, training=False):\n        q, k, v, attention_mask, head_mask = inputs\n        # q, k, v have shape [batch, heads, sequence, features]\n        w = tf.matmul(q, k, transpose_b=True)\n        if self.scale:\n            dk = tf.cast(shape_list(k)[-1], tf.float32)  # scale attention_scores\n            w = w / tf.math.sqrt(dk)\n\n        # w has shape [batch, heads, dst_sequence, src_sequence], where information flows from src to dst.\n        _, _, nd, ns = shape_list(w)\n        b = self.causal_attention_mask(nd, ns, dtype=w.dtype)\n        b = tf.reshape(b, [1, 1, nd, ns])\n        w = w * b - 1e4 * (1 - b)\n\n        if attention_mask is not None:\n            # Apply the attention mask\n            w = w + attention_mask\n\n        w = tf.nn.softmax(w, axis=-1)\n        w = self.attn_dropout(w, training=training)\n\n        # Mask heads if we want to\n        if head_mask is not None:\n            w = w * head_mask\n\n        outputs = [tf.matmul(w, v)]\n        if self.output_attentions:\n            outputs.append(w)\n        return outputs\n\n    def merge_heads(self, x):\n        x = tf.transpose(x, [0, 2, 1, 3])\n        x_shape = shape_list(x)\n        new_x_shape = x_shape[:-2] + [x_shape[-2] * x_shape[-1]]\n        return tf.reshape(x, new_x_shape)\n\n    def split_heads(self, x):\n        x_shape = shape_list(x)\n        new_x_shape = x_shape[:-1] + [self.n_head, x_shape[-1] // self.n_head]\n        x = tf.reshape(x, new_x_shape)\n        return tf.transpose(x, (0, 2, 1, 3))  # (batch, head, seq_length, head_features)\n\n    def call(self, inputs, training=False):\n        x, layer_past, attention_mask, head_mask, use_cache = inputs\n\n        x = self.c_attn(x)\n        query, key, value = tf.split(x, 3, axis=2)\n        query = self.split_heads(query)\n        key = self.split_heads(key)\n        value = self.split_heads(value)\n        if layer_past is not None:\n            past_key, past_value = tf.unstack(layer_past, axis=0)\n            key = tf.concat([past_key, key], axis=-2)\n            value = tf.concat([past_value, value], axis=-2)\n\n        # to cope with keras serialization\n        # we need to cast `use_cache` to correct bool\n        # if it is a tensor\n        if tf.is_tensor(use_cache):\n            if hasattr(use_cache, \"numpy\"):\n                use_cache = bool(use_cache.numpy())\n            else:\n                use_cache = True\n\n        if use_cache is True:\n            present = tf.stack([key, value], axis=0)\n        else:\n            present = (None,)\n\n        attn_outputs = self._attn([query, key, value, attention_mask, head_mask], training=training)\n        a = attn_outputs[0]\n\n        a = self.merge_heads(a)\n        a = self.c_proj(a)\n        a = self.resid_dropout(a, training=training)\n\n        outputs = [a, present] + attn_outputs[1:]\n        return outputs  # a, present, (attentions)\n\n\nclass TFMLP(tf.keras.layers.Layer):\n    def __init__(self, n_state, config, **kwargs):\n        super().__init__(**kwargs)\n        nx = config.n_embd\n        self.c_fc = TFConv1D(n_state, nx, initializer_range=config.initializer_range, name=\"c_fc\")\n        self.c_proj = TFConv1D(nx, n_state, initializer_range=config.initializer_range, name=\"c_proj\")\n        self.act = gelu\n        self.dropout = tf.keras.layers.Dropout(config.resid_pdrop)\n\n    def call(self, x, training=False):\n        h = self.act(self.c_fc(x))\n        h2 = self.c_proj(h)\n        h2 = self.dropout(h2, training=training)\n        return h2\n\n\nclass TFBlock(tf.keras.layers.Layer):\n    def __init__(self, n_ctx, config, scale=False, **kwargs):\n        super().__init__(**kwargs)\n        nx = config.n_embd\n        self.ln_1 = tf.keras.layers.LayerNormalization(epsilon=config.layer_norm_epsilon, name=\"ln_1\")\n        self.attn = TFAttention(nx, n_ctx, config, scale, name=\"attn\")\n        self.ln_2 = tf.keras.layers.LayerNormalization(epsilon=config.layer_norm_epsilon, name=\"ln_2\")\n        self.mlp = TFMLP(4 * nx, config, name=\"mlp\")\n\n    def call(self, inputs, training=False):\n        x, layer_past, attention_mask, head_mask, use_cache = inputs\n\n        a = self.ln_1(x)\n        output_attn = self.attn([a, layer_past, attention_mask, head_mask, use_cache], training=training)\n        a = output_attn[0]  # output_attn: a, present, (attentions)\n        x = x + a\n\n        m = self.ln_2(x)\n        m = self.mlp(m, training=training)\n        x = x + m\n\n        outputs = [x] + output_attn[1:]\n        return outputs  # x, present, (attentions)\n\n\n@keras_serializable\nclass TFGPT2MainLayer(tf.keras.layers.Layer):\n    config_class = GPT2Config\n\n    def __init__(self, config, *inputs, **kwargs):\n        super().__init__(*inputs, **kwargs)\n        self.output_hidden_states = config.output_hidden_states\n        self.output_attentions = config.output_attentions\n        self.num_hidden_layers = config.n_layer\n        self.vocab_size = config.vocab_size\n        self.n_embd = config.n_embd\n\n        self.wte = TFSharedEmbeddings(\n            config.vocab_size, config.hidden_size, initializer_range=config.initializer_range, name=\"wte\"\n        )\n        self.wpe = tf.keras.layers.Embedding(\n            config.n_positions,\n            config.n_embd,\n            embeddings_initializer=get_initializer(config.initializer_range),\n            name=\"wpe\",\n        )\n        self.drop = tf.keras.layers.Dropout(config.embd_pdrop)\n        self.h = [TFBlock(config.n_ctx, config, scale=True, name=\"h_._{}\".format(i)) for i in range(config.n_layer)]\n        self.ln_f = tf.keras.layers.LayerNormalization(epsilon=config.layer_norm_epsilon, name=\"ln_f\")\n\n    def get_input_embeddings(self):\n        return self.wte\n\n    def _resize_token_embeddings(self, new_num_tokens):\n        raise NotImplementedError\n\n    def _prune_heads(self, heads_to_prune):\n        \"\"\" Prunes heads of the model.\n            heads_to_prune: dict of {layer_num: list of heads to prune in this layer}\n        \"\"\"\n        raise NotImplementedError\n\n    def call(\n        self,\n        inputs,\n        past=None,\n        attention_mask=None,\n        token_type_ids=None,\n        position_ids=None,\n        head_mask=None,\n        inputs_embeds=None,\n        use_cache=True,\n        training=False,\n    ):\n        if isinstance(inputs, (tuple, list)):\n            input_ids = inputs[0]\n            past = inputs[1] if len(inputs) > 1 else past\n            attention_mask = inputs[2] if len(inputs) > 2 else attention_mask\n            token_type_ids = inputs[3] if len(inputs) > 3 else token_type_ids\n            position_ids = inputs[4] if len(inputs) > 4 else position_ids\n            head_mask = inputs[5] if len(inputs) > 5 else head_mask\n            inputs_embeds = inputs[6] if len(inputs) > 6 else inputs_embeds\n            use_cache = inputs[7] if len(inputs) > 7 else use_cache\n            assert len(inputs) <= 8, \"Too many inputs.\"\n        elif isinstance(inputs, (dict, BatchEncoding)):\n            input_ids = inputs.get(\"input_ids\")\n            past = inputs.get(\"past\", past)\n            attention_mask = inputs.get(\"attention_mask\", attention_mask)\n            token_type_ids = inputs.get(\"token_type_ids\", token_type_ids)\n            position_ids = inputs.get(\"position_ids\", position_ids)\n            head_mask = inputs.get(\"head_mask\", head_mask)\n            inputs_embeds = inputs.get(\"inputs_embeds\", inputs_embeds)\n            use_cache = inputs.get(\"use_cache\", use_cache)\n            assert len(inputs) <= 8, \"Too many inputs.\"\n        else:\n            input_ids = inputs\n\n        if input_ids is not None and inputs_embeds is not None:\n            raise ValueError(\"You cannot specify both input_ids and inputs_embeds at the same time\")\n        elif input_ids is not None:\n            input_shape = shape_list(input_ids)\n            input_ids = tf.reshape(input_ids, [-1, input_shape[-1]])\n        elif inputs_embeds is not None:\n            input_shape = shape_list(inputs_embeds)[:-1]\n        else:\n            raise ValueError(\"You have to specify either input_ids or inputs_embeds\")\n\n        if past is None:\n            past_length = 0\n            past = [None] * len(self.h)\n        else:\n            past_length = shape_list(past[0][0])[-2]\n        if position_ids is None:\n            position_ids = tf.range(past_length, input_shape[-1] + past_length, dtype=tf.int32)[tf.newaxis, :]\n\n        if attention_mask is not None:\n            # We create a 3D attention mask from a 2D tensor mask.\n            # Sizes are [batch_size, 1, 1, to_seq_length]\n            # So we can broadcast to [batch_size, num_heads, from_seq_length, to_seq_length]\n            # this attention mask is more simple than the triangular masking of causal attention\n            # used in OpenAI GPT, we just need to prepare the broadcast dimension here.\n            attention_mask = attention_mask[:, tf.newaxis, tf.newaxis, :]\n\n            # Since attention_mask is 1.0 for positions we want to attend and 0.0 for\n            # masked positions, this operation will create a tensor which is 0.0 for\n            # positions we want to attend and -10000.0 for masked positions.\n            # Since we are adding it to the raw scores before the softmax, this is\n            # effectively the same as removing these entirely.\n\n            attention_mask = tf.cast(attention_mask, tf.float32)\n            attention_mask = (1.0 - attention_mask) * -10000.0\n        else:\n            attention_mask = None\n\n        # Prepare head mask if needed\n        # 1.0 in head_mask indicate we keep the head\n        # attention_probs has shape bsz x n_heads x N x N\n        # input head_mask has shape [num_heads] or [num_hidden_layers x num_heads]\n        # and head_mask is converted to shape [num_hidden_layers x batch x num_heads x seq_length x seq_length]\n        if head_mask is not None:\n            raise NotImplementedError\n        else:\n            head_mask = [None] * self.num_hidden_layers\n            # head_mask = tf.constant([0] * self.num_hidden_layers)\n\n        position_ids = tf.reshape(position_ids, [-1, shape_list(position_ids)[-1]])\n\n        if inputs_embeds is None:\n            inputs_embeds = self.wte(input_ids, mode=\"embedding\")\n        position_embeds = self.wpe(position_ids)\n        if token_type_ids is not None:\n            token_type_ids = tf.reshape(token_type_ids, [-1, shape_list(token_type_ids)[-1]])\n            token_type_embeds = self.wte(token_type_ids, mode=\"embedding\")\n        else:\n            token_type_embeds = 0\n        hidden_states = inputs_embeds + position_embeds + token_type_embeds\n        hidden_states = self.drop(hidden_states, training=training)\n\n        output_shape = input_shape + [shape_list(hidden_states)[-1]]\n\n        presents = ()\n        all_attentions = []\n        all_hidden_states = ()\n        for i, (block, layer_past) in enumerate(zip(self.h, past)):\n            if self.output_hidden_states:\n                all_hidden_states = all_hidden_states + (tf.reshape(hidden_states, output_shape),)\n\n            outputs = block([hidden_states, layer_past, attention_mask, head_mask[i], use_cache], training=training)\n\n            hidden_states, present = outputs[:2]\n            presents = presents + (present,)\n\n            if self.output_attentions:\n                all_attentions.append(outputs[2])\n\n        hidden_states = self.ln_f(hidden_states)\n\n        hidden_states = tf.reshape(hidden_states, output_shape)\n        # Add last hidden state\n        if self.output_hidden_states:\n            all_hidden_states = all_hidden_states + (hidden_states,)\n\n        outputs = (hidden_states,)\n\n        if use_cache is True:\n            outputs = outputs + (presents,)\n        if self.output_hidden_states:\n            outputs = outputs + (all_hidden_states,)\n        if self.output_attentions:\n            # let the number of heads free (-1) so we can extract attention even after head pruning\n            attention_output_shape = input_shape[:-1] + [-1] + shape_list(all_attentions[0])[-2:]\n            all_attentions = tuple(tf.reshape(t, attention_output_shape) for t in all_attentions)\n            outputs = outputs + (all_attentions,)\n        return outputs  # last hidden state, presents, (all hidden_states), (attentions)\n\n\nclass TFGPT2PreTrainedModel(TFPreTrainedModel):\n    \"\"\" An abstract class to handle weights initialization and\n        a simple interface for downloading and loading pretrained models.\n    \"\"\"\n\n    config_class = GPT2Config\n    base_model_prefix = \"transformer\"\n\n\nGPT2_START_DOCSTRING = r\"\"\"\n\n    .. note::\n        TF 2.0 models accepts two formats as inputs:\n\n            - having all inputs as keyword arguments (like PyTorch models), or\n            - having all inputs as a list, tuple or dict in the first positional arguments.\n\n        This second option is useful when using :obj:`tf.keras.Model.fit()` method which currently requires having\n        all the tensors in the first argument of the model call function: :obj:`model(inputs)`.\n\n        If you choose this second option, there are three possibilities you can use to gather all the input Tensors\n        in the first positional argument :\n\n        - a single Tensor with input_ids only and nothing else: :obj:`model(inputs_ids)`\n        - a list of varying length with one or several input Tensors IN THE ORDER given in the docstring:\n          :obj:`model([input_ids, attention_mask])` or :obj:`model([input_ids, attention_mask, token_type_ids])`\n        - a dictionary with one or several input Tensors associated to the input names given in the docstring:\n          :obj:`model({'input_ids': input_ids, 'token_type_ids': token_type_ids})`\n\n    Parameters:\n        config (:class:`~transformers1.GPT2Config`): Model configuration class with all the parameters of the model.\n            Initializing with a config file does not load the weights associated with the model, only the configuration.\n            Check out the :meth:`~transformers1.PreTrainedModel.from_pretrained` method to load the model weights.\n\"\"\"\n\nGPT2_INPUTS_DOCSTRING = r\"\"\"\n    Args:\n        input_ids (:obj:`Numpy array` or :obj:`tf.Tensor` of shape :obj:`(batch_size, input_ids_length)`):\n            :obj:`input_ids_length` = ``sequence_length`` if ``past`` is ``None`` else ``past[0].shape[-2]`` (``sequence_length`` of input past key value states).\n            Indices of input sequence tokens in the vocabulary.\n\n            If `past` is used, only `input_ids` that do not have their past calculated should be passed as `input_ids`.\n\n            Indices can be obtained using :class:`transformers1.GPT2Tokenizer`.\n            See :func:`transformers1.PreTrainedTokenizer.encode` and\n            :func:`transformers1.PreTrainedTokenizer.encode_plus` for details.\n\n            `What are input IDs? <../glossary.html#input-ids>`__\n        past (:obj:`List[tf.Tensor]` of length :obj:`config.n_layers`):\n            Contains pre-computed hidden-states (key and values in the attention blocks) as computed by the model\n            (see `past` output below). Can be used to speed up sequential decoding.\n            The token ids which have their past given to this model\n            should not be passed as `input_ids` as they have already been computed.\n        attention_mask (:obj:`tf.Tensor` or :obj:`Numpy array` of shape :obj:`(batch_size, sequence_length)`, `optional`, defaults to :obj:`None`):\n            Mask to avoid performing attention on padding token indices.\n            Mask values selected in ``[0, 1]``:\n            ``1`` for tokens that are NOT MASKED, ``0`` for MASKED tokens.\n\n            `What are attention masks? <../glossary.html#attention-mask>`__\n        token_type_ids (:obj:`tf.Tensor` or :obj:`Numpy array` of shape :obj:`(batch_size, sequence_length)`, `optional`, defaults to :obj:`None`):\n            Segment token indices to indicate first and second portions of the inputs.\n            Indices are selected in ``[0, 1]``: ``0`` corresponds to a `sentence A` token, ``1``\n            corresponds to a `sentence B` token\n\n            `What are token type IDs? <../glossary.html#token-type-ids>`_\n        position_ids (:obj:`tf.Tensor` or :obj:`Numpy array` of shape :obj:`(batch_size, sequence_length)`, `optional`, defaults to :obj:`None`):\n            Indices of positions of each input sequence tokens in the position embeddings.\n            Selected in the range ``[0, config.max_position_embeddings - 1]``.\n\n            `What are position IDs? <../glossary.html#position-ids>`_\n        head_mask (:obj:`tf.Tensor` or :obj:`Numpy array` of shape :obj:`(num_heads,)` or :obj:`(num_layers, num_heads)`, `optional`, defaults to :obj:`None`):\n            Mask to nullify selected heads of the self-attention modules.\n            Mask values selected in ``[0, 1]``:\n            :obj:`1` indicates the head is **not masked**, :obj:`0` indicates the head is **masked**.\n        inputs_embeds (:obj:`tf.Tensor` or :obj:`Numpy array` of shape :obj:`(batch_size, sequence_length, hidden_size)`, `optional`, defaults to :obj:`None`):\n            Optionally, instead of passing :obj:`input_ids` you can choose to directly pass an embedded representation.\n            This is useful if you want more control over how to convert `input_ids` indices into associated vectors\n            than the model's internal embedding lookup matrix.\n        training (:obj:`boolean`, `optional`, defaults to :obj:`False`):\n            Whether to activate dropout modules (if set to :obj:`True`) during training or to de-activate them\n            (if set to :obj:`False`) for evaluation.\n\"\"\"\n\n\n@add_start_docstrings(\n    \"The bare GPT2 Model transformer outputing raw hidden-states without any specific head on top.\",\n    GPT2_START_DOCSTRING,\n)\nclass TFGPT2Model(TFGPT2PreTrainedModel):\n    def __init__(self, config, *inputs, **kwargs):\n        super().__init__(config, *inputs, **kwargs)\n        self.transformer = TFGPT2MainLayer(config, name=\"transformer\")\n\n    @add_start_docstrings_to_callable(GPT2_INPUTS_DOCSTRING)\n    def call(self, inputs, **kwargs):\n        r\"\"\"\n    Return:\n        :obj:`tuple(tf.Tensor)` comprising various elements depending on the configuration (:class:`~transformers1.GPT2Config`) and inputs:\n        last_hidden_state (:obj:`tf.Tensor` of shape :obj:`(batch_size, sequence_length, hidden_size)`):\n            Sequence of hidden-states at the last layer of the model.\n        past (:obj:`List[tf.Tensor]` of length :obj:`config.n_layers` with each tensor of shape :obj:`(2, batch_size, num_heads, sequence_length, embed_size_per_head)`):\n            Contains pre-computed hidden-states (key and values in the attention blocks).\n            Can be used (see `past` input) to speed up sequential decoding. The token ids which have their past given to this model\n            should not be passed as input ids as they have already been computed.\n        hidden_states (:obj:`tuple(tf.Tensor)` `optional`, returned when ``config.output_hidden_states=True``):\n            Tuple of :obj:`tf.Tensor` (one for the output of the embeddings + one for the output of each layer)\n            of shape :obj:`(batch_size, sequence_length, hidden_size)`.\n\n            Hidden-states of the model at the output of each layer plus the initial embedding outputs.\n        attentions (:obj:`tuple(tf.Tensor)`, `optional`, returned when ``config.output_attentions=True``):\n            Tuple of :obj:`tf.Tensor` (one for each layer) of shape\n            :obj:`(batch_size, num_heads, sequence_length, sequence_length)`.\n\n            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention\n            heads.\n\n    Examples::\n\n        import tensorflow as tf\n        from transformers1 import GPT2Tokenizer, TFGPT2Model\n\n        tokenizer = GPT2Tokenizer.from_pretrained('gpt2')\n        model = TFGPT2Model.from_pretrained('gpt2')\n        input_ids = tf.constant(tokenizer.encode(\"Hello, my dog is cute\", add_special_tokens=True))[None, :]  # Batch size 1\n        outputs = model(input_ids)\n        last_hidden_states = outputs[0]  # The last hidden-state is the first element of the output tuple\n\n    \"\"\"\n        outputs = self.transformer(inputs, **kwargs)\n        return outputs\n\n\n@add_start_docstrings(\n    \"\"\"The GPT2 Model transformer with a language modeling head on top\n    (linear layer with weights tied to the input embeddings). \"\"\",\n    GPT2_START_DOCSTRING,\n)\nclass TFGPT2LMHeadModel(TFGPT2PreTrainedModel):\n    def __init__(self, config, *inputs, **kwargs):\n        super().__init__(config, *inputs, **kwargs)\n        self.transformer = TFGPT2MainLayer(config, name=\"transformer\")\n\n    def get_output_embeddings(self):\n        return self.transformer.wte\n\n    def prepare_inputs_for_generation(self, inputs, past, **kwargs):\n        # only last token for inputs_ids if past is defined in kwargs\n        if past:\n            inputs = tf.expand_dims(inputs[:, -1], -1)\n\n        return {\"inputs\": inputs, \"past\": past, \"use_cache\": kwargs[\"use_cache\"]}\n\n    @add_start_docstrings_to_callable(GPT2_INPUTS_DOCSTRING)\n    def call(self, inputs, **kwargs):\n        r\"\"\"\n    Return:\n        :obj:`tuple(tf.Tensor)` comprising various elements depending on the configuration (:class:`~transformers1.GPT2Config`) and inputs:\n        prediction_scores (:obj:`tf.Tensor` of shape :obj:`(batch_size, sequence_length, config.vocab_size)`):\n            Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax).\n        past (:obj:`List[tf.Tensor]` of length :obj:`config.n_layers` with each tensor of shape :obj:`(2, batch_size, num_heads, sequence_length, embed_size_per_head)`):\n            Contains pre-computed hidden-states (key and values in the attention blocks).\n            Can be used (see `past` input) to speed up sequential decoding. The token ids which have their past given to this model\n            should not be passed as input ids as they have already been computed.\n        hidden_states (:obj:`tuple(tf.Tensor)`, `optional`, returned when ``config.output_hidden_states=True``):\n            Tuple of :obj:`tf.Tensor` (one for the output of the embeddings + one for the output of each layer)\n            of shape :obj:`(batch_size, sequence_length, hidden_size)`.\n\n            Hidden-states of the model at the output of each layer plus the initial embedding outputs.\n        attentions (:obj:`tuple(tf.Tensor)`, `optional`, returned when ``config.output_attentions=True``):\n            Tuple of :obj:`tf.Tensor` (one for each layer) of shape\n            :obj:`(batch_size, num_heads, sequence_length, sequence_length)`.\n\n            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention\n            heads.\n\n    Examples::\n\n        import tensorflow as tf\n        from transformers1 import GPT2Tokenizer, TFGPT2LMHeadModel\n\n        tokenizer = GPT2Tokenizer.from_pretrained('gpt2')\n        model = TFGPT2LMHeadModel.from_pretrained('gpt2')\n\n        input_ids = tf.constant(tokenizer.encode(\"Hello, my dog is cute\", add_special_tokens=True))[None, :]  # Batch size 1\n        outputs = model(input_ids)\n        logits = outputs[0]\n\n        \"\"\"\n        transformer_outputs = self.transformer(inputs, **kwargs)\n        hidden_states = transformer_outputs[0]\n\n        lm_logits = self.transformer.wte(hidden_states, mode=\"linear\")\n\n        outputs = (lm_logits,) + transformer_outputs[1:]\n\n        return outputs  # lm_logits, presents, (all hidden_states), (attentions)\n\n\n@add_start_docstrings(\n    \"\"\"The GPT2 Model transformer with a language modeling and a multiple-choice classification\n    head on top e.g. for RocStories/SWAG tasks. The two heads are two linear layers.\n    The language modeling head has its weights tied to the input embeddings,\n    the classification head takes as input the input of a specified classification token index in the input sequence).\n\"\"\",\n    GPT2_START_DOCSTRING,\n)\nclass TFGPT2DoubleHeadsModel(TFGPT2PreTrainedModel):\n    def __init__(self, config, *inputs, **kwargs):\n        super().__init__(config, *inputs, **kwargs)\n        config.num_labels = 1\n        self.transformer = TFGPT2MainLayer(config, name=\"transformer\")\n        self.multiple_choice_head = TFSequenceSummary(\n            config, initializer_range=config.initializer_range, name=\"multiple_choice_head\"\n        )\n\n    def get_output_embeddings(self):\n        return self.transformer.wte\n\n    @add_start_docstrings_to_callable(GPT2_INPUTS_DOCSTRING)\n    def call(\n        self,\n        inputs,\n        past=None,\n        attention_mask=None,\n        token_type_ids=None,\n        position_ids=None,\n        head_mask=None,\n        inputs_embeds=None,\n        mc_token_ids=None,\n        use_cache=True,\n        training=False,\n    ):\n        r\"\"\"\n        mc_token_ids (:obj:`tf.Tensor` or :obj:`Numpy array` of shape :obj:`(batch_size, num_choices)`, `optional`, default to index of the last token of the input)\n            Index of the classification token in each input sequence.\n            Selected in the range ``[0, input_ids.size(-1) - 1[``.\n\n    Return:\n        :obj:`tuple(tf.Tensor)` comprising various elements depending on the configuration (:class:`~transformers1.GPT2Config`) and inputs:\n        lm_prediction_scores (:obj:`tf.Tensor` of shape :obj:`(batch_size, num_choices, sequence_length, config.vocab_size)`):\n            Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax).\n        mc_prediction_scores (:obj:`tf.Tensor` of shape :obj:`(batch_size, num_choices)`):\n            Prediction scores of the multiple choice classification head (scores for each choice before SoftMax).\n        past (:obj:`List[tf.Tensor]` of length :obj:`config.n_layers` with each tensor of shape :obj:`(2, batch_size, num_heads, sequence_length, embed_size_per_head)`):\n            Contains pre-computed hidden-states (key and values in the attention blocks).\n            Can be used (see `past` input) to speed up sequential decoding. The token ids which have their past given to this model\n            should not be passed as `input_ids` as they have already been computed.\n        hidden_states (:obj:`tuple(tf.Tensor)`, `optional`, returned when ``config.output_hidden_states=True``):\n            Tuple of :obj:`tf.Tensor` (one for the output of the embeddings + one for the output of each layer)\n            of shape :obj:`(batch_size, sequence_length, hidden_size)`.\n\n            Hidden-states of the model at the output of each layer plus the initial embedding outputs.\n        attentions (:obj:`tuple(tf.Tensor)`, `optional`, returned when ``config.output_attentions=True``):\n            Tuple of :obj:`tf.Tensor` (one for each layer) of shape\n            :obj:`(batch_size, num_heads, sequence_length, sequence_length)`.\n\n            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention\n            heads.\n\n\n    Examples::\n\n        # For example purposes. Not runnable.\n        import tensorflow as tf\n        from transformers1 import GPT2Tokenizer, TFGPT2DoubleHeadsModel\n\n        tokenizer = GPT2Tokenizer.from_pretrained('gpt2')\n        model = TFGPT2DoubleHeadsModel.from_pretrained('gpt2')\n\n        # Add a [CLS] to the vocabulary (we should train it also!)\n        # This option is currently not implemented in TF 2.0\n        raise NotImplementedError\n        tokenizer.add_special_tokens({'cls_token': '[CLS]'})\n        model.resize_token_embeddings(len(tokenizer))  # Update the model embeddings with the new vocabulary size\n        print(tokenizer.cls_token_id, len(tokenizer))  # The newly token the last token of the vocabulary\n\n        choices = [\"Hello, my dog is cute [CLS]\", \"Hello, my cat is cute [CLS]\"]\n        encoded_choices = [tokenizer.encode(s) for s in choices]\n        cls_token_location = [tokens.index(tokenizer.cls_token_id) for tokens in encoded_choices]\n\n        input_ids = tf.constant(encoded_choices)[None, :]  # Batch size: 1, number of choices: 2\n        mc_token_ids = tf.constant([cls_token_location])  # Batch size: 1\n\n        outputs = model(input_ids, mc_token_ids=mc_token_ids)\n        lm_prediction_scores, mc_prediction_scores = outputs[:2]\n\n        \"\"\"\n        if isinstance(inputs, (tuple, list)):\n            input_ids = inputs[0]\n            past = inputs[1] if len(inputs) > 1 else past\n            attention_mask = inputs[2] if len(inputs) > 2 else attention_mask\n            token_type_ids = inputs[3] if len(inputs) > 3 else token_type_ids\n            position_ids = inputs[4] if len(inputs) > 4 else position_ids\n            head_mask = inputs[5] if len(inputs) > 5 else head_mask\n            inputs_embeds = inputs[6] if len(inputs) > 6 else inputs_embeds\n            mc_token_ids = inputs[7] if len(inputs) > 7 else mc_token_ids\n            use_cache = inputs[8] if len(inputs) > 8 else use_cache\n            assert len(inputs) <= 9, \"Too many inputs.\"\n        elif isinstance(inputs, dict):\n            input_ids = inputs.get(\"input_ids\")\n            past = inputs.get(\"past\", past)\n            attention_mask = inputs.get(\"attention_mask\", attention_mask)\n            token_type_ids = inputs.get(\"token_type_ids\", token_type_ids)\n            position_ids = inputs.get(\"position_ids\", position_ids)\n            head_mask = inputs.get(\"head_mask\", head_mask)\n            inputs_embeds = inputs.get(\"inputs_embeds\", inputs_embeds)\n            mc_token_ids = inputs.get(\"mc_token_ids\", mc_token_ids)\n            use_cache = inputs.get(\"use_cache\", use_cache)\n            assert len(inputs) <= 9, \"Too many inputs.\"\n        else:\n            input_ids = inputs\n\n        if input_ids is not None:\n            input_shapes = shape_list(input_ids)\n        else:\n            input_shapes = shape_list(inputs_embeds)[:-1]\n\n        seq_length = input_shapes[-1]\n\n        flat_input_ids = tf.reshape(input_ids, (-1, seq_length)) if input_ids is not None else None\n        flat_attention_mask = tf.reshape(attention_mask, (-1, seq_length)) if attention_mask is not None else None\n        flat_token_type_ids = tf.reshape(token_type_ids, (-1, seq_length)) if token_type_ids is not None else None\n        flat_position_ids = tf.reshape(position_ids, (-1, seq_length)) if position_ids is not None else None\n\n        flat_inputs = [\n            flat_input_ids,\n            past,\n            flat_attention_mask,\n            flat_token_type_ids,\n            flat_position_ids,\n            head_mask,\n            inputs_embeds,\n            use_cache,\n        ]\n\n        transformer_outputs = self.transformer(flat_inputs, training=training)\n        hidden_states = transformer_outputs[0]\n\n        hidden_states = tf.reshape(hidden_states, input_shapes + shape_list(hidden_states)[-1:])\n\n        lm_logits = self.transformer.wte(hidden_states, mode=\"linear\")\n        mc_logits = self.multiple_choice_head([hidden_states, mc_token_ids], training=training)\n\n        mc_logits = tf.squeeze(mc_logits, axis=-1)\n\n        outputs = (lm_logits, mc_logits) + transformer_outputs[1:]\n\n        return outputs  # lm logits, mc logits, presents, (all hidden_states), (attentions)\n"
  },
  {
    "path": "code/bert-base-count3/pretrain/transformers1/modeling_tf_openai.py",
    "content": "# coding=utf-8\n# Copyright 2018 The OpenAI Team Authors and HuggingFace Inc. team.\n# Copyright (c) 2018, NVIDIA CORPORATION.  All rights reserved.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n\"\"\" TF 2.0 OpenAI GPT model.\"\"\"\n\n\nimport logging\n\nimport numpy as np\nimport tensorflow as tf\n\nfrom .configuration_openai import OpenAIGPTConfig\nfrom .file_utils import add_start_docstrings, add_start_docstrings_to_callable\nfrom .modeling_tf_utils import (\n    TFConv1D,\n    TFPreTrainedModel,\n    TFSequenceSummary,\n    TFSharedEmbeddings,\n    get_initializer,\n    shape_list,\n)\nfrom .tokenization_utils import BatchEncoding\n\n\nlogger = logging.getLogger(__name__)\n\nTF_OPENAI_GPT_PRETRAINED_MODEL_ARCHIVE_LIST = [\n    \"openai-gpt\",\n    # See all OpenAI GPT models at https://huggingface.co/models?filter=openai-gpt\n]\n\n\ndef gelu(x):\n    \"\"\"Gaussian Error Linear Unit.\n    This is a smoother version of the RELU.\n    Original paper: https://arxiv.org/abs/1606.08415\n    Args:\n        x: float Tensor to perform activation.\n    Returns:\n        `x` with the GELU activation applied.\n    \"\"\"\n    cdf = 0.5 * (1.0 + tf.tanh((np.sqrt(2 / np.pi) * (x + 0.044715 * tf.pow(x, 3)))))\n    return x * cdf\n\n\ndef swish(x):\n    return x * tf.math.sigmoid(x)\n\n\nACT_FNS = {\n    \"gelu\": tf.keras.layers.Activation(gelu),\n    \"relu\": tf.keras.activations.relu,\n    \"swish\": tf.keras.layers.Activation(swish),\n}\n\n\nclass TFAttention(tf.keras.layers.Layer):\n    def __init__(self, nx, n_ctx, config, scale=False, **kwargs):\n        super().__init__(**kwargs)\n        self.output_attentions = config.output_attentions\n\n        n_state = nx  # in Attention: n_state=768 (nx=n_embd)\n        # [switch nx => n_state from Block to Attention to keep identical to TF implem]\n        assert n_state % config.n_head == 0\n        self.n_ctx = n_ctx\n        self.n_head = config.n_head\n        self.split_size = n_state\n        self.scale = scale\n\n        self.c_attn = TFConv1D(n_state * 3, nx, initializer_range=config.initializer_range, name=\"c_attn\")\n        self.c_proj = TFConv1D(n_state, nx, initializer_range=config.initializer_range, name=\"c_proj\")\n        self.attn_dropout = tf.keras.layers.Dropout(config.attn_pdrop)\n        self.resid_dropout = tf.keras.layers.Dropout(config.resid_pdrop)\n        self.pruned_heads = set()\n\n    def prune_heads(self, heads):\n        pass\n\n    @staticmethod\n    def causal_attention_mask(nd, ns, dtype):\n        \"\"\"1's in the lower triangle, counting from the lower right corner.\n        Same as tf.matrix_band_part(tf.ones([nd, ns]), -1, ns-nd), but doesn't produce garbage on TPUs.\n        \"\"\"\n        i = tf.range(nd)[:, None]\n        j = tf.range(ns)\n        m = i >= j - ns + nd\n        return tf.cast(m, dtype)\n\n    def _attn(self, inputs, training=False):\n        q, k, v, attention_mask, head_mask = inputs\n        # q, k, v have shape [batch, heads, sequence, features]\n        w = tf.matmul(q, k, transpose_b=True)\n        if self.scale:\n            dk = tf.cast(shape_list(k)[-1], tf.float32)  # scale attention_scores\n            w = w / tf.math.sqrt(dk)\n\n        # w has shape [batch, heads, dst_sequence, src_sequence], where information flows from src to dst.\n        _, _, nd, ns = shape_list(w)\n        b = self.causal_attention_mask(nd, ns, dtype=w.dtype)\n        b = tf.reshape(b, [1, 1, nd, ns])\n        w = w * b - 1e4 * (1 - b)\n\n        if attention_mask is not None:\n            # Apply the attention mask\n            w = w + attention_mask\n\n        w = tf.nn.softmax(w, axis=-1)\n        w = self.attn_dropout(w, training=training)\n\n        # Mask heads if we want to\n        if head_mask is not None:\n            w = w * head_mask\n\n        outputs = [tf.matmul(w, v)]\n        if self.output_attentions:\n            outputs.append(w)\n        return outputs\n\n    def merge_heads(self, x):\n        x = tf.transpose(x, [0, 2, 1, 3])\n        x_shape = shape_list(x)\n        new_x_shape = x_shape[:-2] + [x_shape[-2] * x_shape[-1]]\n        return tf.reshape(x, new_x_shape)\n\n    def split_heads(self, x):\n        x_shape = shape_list(x)\n        new_x_shape = x_shape[:-1] + [self.n_head, x_shape[-1] // self.n_head]\n        x = tf.reshape(x, new_x_shape)\n        return tf.transpose(x, (0, 2, 1, 3))  # (batch, head, seq_length, head_features)\n\n    def call(self, inputs, training=False):\n        x, attention_mask, head_mask = inputs\n\n        x = self.c_attn(x)\n        query, key, value = tf.split(x, 3, axis=2)\n        query = self.split_heads(query)\n        key = self.split_heads(key)\n        value = self.split_heads(value)\n\n        attn_outputs = self._attn([query, key, value, attention_mask, head_mask], training=training)\n        a = attn_outputs[0]\n\n        a = self.merge_heads(a)\n        a = self.c_proj(a)\n        a = self.resid_dropout(a, training=training)\n\n        outputs = [a] + attn_outputs[1:]\n        return outputs  # a, (attentions)\n\n\nclass TFMLP(tf.keras.layers.Layer):\n    def __init__(self, n_state, config, **kwargs):\n        super().__init__(**kwargs)\n        nx = config.n_embd\n        self.c_fc = TFConv1D(n_state, nx, initializer_range=config.initializer_range, name=\"c_fc\")\n        self.c_proj = TFConv1D(nx, n_state, initializer_range=config.initializer_range, name=\"c_proj\")\n        self.act = gelu\n        self.dropout = tf.keras.layers.Dropout(config.resid_pdrop)\n\n    def call(self, x, training=False):\n        h = self.act(self.c_fc(x))\n        h2 = self.c_proj(h)\n        h2 = self.dropout(h2, training=training)\n        return h2\n\n\nclass TFBlock(tf.keras.layers.Layer):\n    def __init__(self, n_ctx, config, scale=False, **kwargs):\n        super().__init__(**kwargs)\n        nx = config.n_embd\n        self.attn = TFAttention(nx, n_ctx, config, scale, name=\"attn\")\n        self.ln_1 = tf.keras.layers.LayerNormalization(epsilon=config.layer_norm_epsilon, name=\"ln_1\")\n        self.mlp = TFMLP(4 * nx, config, name=\"mlp\")\n        self.ln_2 = tf.keras.layers.LayerNormalization(epsilon=config.layer_norm_epsilon, name=\"ln_2\")\n\n    def call(self, inputs, training=False):\n        x, attention_mask, head_mask = inputs\n\n        output_attn = self.attn([x, attention_mask, head_mask], training=training)\n        a = output_attn[0]  # output_attn: a, (attentions)\n\n        n = self.ln_1(x + a)\n        m = self.mlp(n, training=training)\n        h = self.ln_2(n + m)\n\n        outputs = [h] + output_attn[1:]\n        return outputs  # x, (attentions)\n\n\nclass TFOpenAIGPTMainLayer(tf.keras.layers.Layer):\n    def __init__(self, config, *inputs, **kwargs):\n        super().__init__(*inputs, **kwargs)\n        self.output_hidden_states = config.output_hidden_states\n        self.output_attentions = config.output_attentions\n        self.num_hidden_layers = config.n_layer\n        self.vocab_size = config.vocab_size\n        self.n_embd = config.n_embd\n\n        self.tokens_embed = TFSharedEmbeddings(\n            config.vocab_size, config.n_embd, initializer_range=config.initializer_range, name=\"tokens_embed\"\n        )\n        self.positions_embed = tf.keras.layers.Embedding(\n            config.n_positions,\n            config.n_embd,\n            embeddings_initializer=get_initializer(config.initializer_range),\n            name=\"positions_embed\",\n        )\n        self.drop = tf.keras.layers.Dropout(config.embd_pdrop)\n        self.h = [TFBlock(config.n_ctx, config, scale=True, name=\"h_._{}\".format(i)) for i in range(config.n_layer)]\n\n    def get_input_embeddings(self):\n        return self.tokens_embed\n\n    def _resize_token_embeddings(self, new_num_tokens):\n        raise NotImplementedError\n\n    def _prune_heads(self, heads_to_prune):\n        \"\"\" Prunes heads of the model.\n            heads_to_prune: dict of {layer_num: list of heads to prune in this layer}\n        \"\"\"\n        raise NotImplementedError\n\n    def call(\n        self,\n        inputs,\n        attention_mask=None,\n        token_type_ids=None,\n        position_ids=None,\n        head_mask=None,\n        inputs_embeds=None,\n        training=False,\n    ):\n        if isinstance(inputs, (tuple, list)):\n            input_ids = inputs[0]\n            attention_mask = inputs[1] if len(inputs) > 1 else attention_mask\n            token_type_ids = inputs[2] if len(inputs) > 2 else token_type_ids\n            position_ids = inputs[3] if len(inputs) > 3 else position_ids\n            head_mask = inputs[4] if len(inputs) > 4 else head_mask\n            inputs_embeds = inputs[5] if len(inputs) > 5 else inputs_embeds\n            assert len(inputs) <= 6, \"Too many inputs.\"\n        elif isinstance(inputs, (dict, BatchEncoding)):\n            input_ids = inputs.get(\"input_ids\")\n            attention_mask = inputs.get(\"attention_mask\", attention_mask)\n            token_type_ids = inputs.get(\"token_type_ids\", token_type_ids)\n            position_ids = inputs.get(\"position_ids\", position_ids)\n            head_mask = inputs.get(\"head_mask\", head_mask)\n            inputs_embeds = inputs.get(\"inputs_embeds\", inputs_embeds)\n            assert len(inputs) <= 6, \"Too many inputs.\"\n        else:\n            input_ids = inputs\n\n        if input_ids is not None and inputs_embeds is not None:\n            raise ValueError(\"You cannot specify both input_ids and inputs_embeds at the same time\")\n        elif input_ids is not None:\n            input_shape = shape_list(input_ids)\n            input_ids = tf.reshape(input_ids, [-1, input_shape[-1]])\n        elif inputs_embeds is not None:\n            input_shape = shape_list(inputs_embeds)[:-1]\n        else:\n            raise ValueError(\"You have to specify either input_ids or inputs_embeds\")\n\n        if position_ids is None:\n            position_ids = tf.range(input_shape[-1], dtype=tf.int32)[tf.newaxis, :]\n\n        if attention_mask is not None:\n            # We create a 3D attention mask from a 2D tensor mask.\n            # Sizes are [batch_size, 1, 1, to_seq_length]\n            # So we can broadcast to [batch_size, num_heads, from_seq_length, to_seq_length]\n            # this attention mask is more simple than the triangular masking of causal attention\n            # used in OpenAI GPT, we just need to prepare the broadcast dimension here.\n            attention_mask = attention_mask[:, tf.newaxis, tf.newaxis, :]\n\n            # Since attention_mask is 1.0 for positions we want to attend and 0.0 for\n            # masked positions, this operation will create a tensor which is 0.0 for\n            # positions we want to attend and -10000.0 for masked positions.\n            # Since we are adding it to the raw scores before the softmax, this is\n            # effectively the same as removing these entirely.\n\n            attention_mask = tf.cast(attention_mask, tf.float32)\n            attention_mask = (1.0 - attention_mask) * -10000.0\n        else:\n            attention_mask = None\n\n        # Prepare head mask if needed\n        # 1.0 in head_mask indicate we keep the head\n        # attention_probs has shape bsz x n_heads x N x N\n        # input head_mask has shape [num_heads] or [num_hidden_layers x num_heads]\n        # and head_mask is converted to shape [num_hidden_layers x batch x num_heads x seq_length x seq_length]\n        if head_mask is not None:\n            raise NotImplementedError\n        else:\n            head_mask = [None] * self.num_hidden_layers\n            # head_mask = tf.constant([0] * self.num_hidden_layers)\n\n        position_ids = tf.reshape(position_ids, [-1, shape_list(position_ids)[-1]])\n\n        if inputs_embeds is None:\n            inputs_embeds = self.tokens_embed(input_ids, mode=\"embedding\")\n        position_embeds = self.positions_embed(position_ids)\n        if token_type_ids is not None:\n            token_type_ids = tf.reshape(token_type_ids, [-1, shape_list(token_type_ids)[-1]])\n            token_type_embeds = self.tokens_embed(token_type_ids, mode=\"embedding\")\n        else:\n            token_type_embeds = 0\n        hidden_states = inputs_embeds + position_embeds + token_type_embeds\n        hidden_states = self.drop(hidden_states, training=training)\n\n        output_shape = input_shape + [shape_list(hidden_states)[-1]]\n\n        all_attentions = []\n        all_hidden_states = ()\n        for i, block in enumerate(self.h):\n            if self.output_hidden_states:\n                all_hidden_states = all_hidden_states + (tf.reshape(hidden_states, output_shape),)\n\n            outputs = block([hidden_states, attention_mask, head_mask[i]], training=training)\n            hidden_states = outputs[0]\n            if self.output_attentions:\n                all_attentions.append(outputs[1])\n\n        hidden_states = tf.reshape(hidden_states, output_shape)\n        # Add last hidden state\n        if self.output_hidden_states:\n            all_hidden_states = all_hidden_states + (hidden_states,)\n\n        outputs = (hidden_states,)\n        if self.output_hidden_states:\n            outputs = outputs + (all_hidden_states,)\n        if self.output_attentions:\n            # let the number of heads free (-1) so we can extract attention even after head pruning\n            attention_output_shape = input_shape[:-1] + [-1] + shape_list(all_attentions[0])[-2:]\n            all_attentions = tuple(tf.reshape(t, attention_output_shape) for t in all_attentions)\n            outputs = outputs + (all_attentions,)\n        return outputs  # last hidden state, (all hidden_states), (attentions)\n\n\nclass TFOpenAIGPTPreTrainedModel(TFPreTrainedModel):\n    \"\"\" An abstract class to handle weights initialization and\n        a simple interface for downloading and loading pretrained models.\n    \"\"\"\n\n    config_class = OpenAIGPTConfig\n    base_model_prefix = \"transformer\"\n\n\nOPENAI_GPT_START_DOCSTRING = r\"\"\"\n\n    .. note::\n        TF 2.0 models accepts two formats as inputs:\n\n            - having all inputs as keyword arguments (like PyTorch models), or\n            - having all inputs as a list, tuple or dict in the first positional arguments.\n\n        This second option is useful when using :obj:`tf.keras.Model.fit()` method which currently requires having\n        all the tensors in the first argument of the model call function: :obj:`model(inputs)`.\n\n        If you choose this second option, there are three possibilities you can use to gather all the input Tensors\n        in the first positional argument :\n\n        - a single Tensor with input_ids only and nothing else: :obj:`model(inputs_ids)`\n        - a list of varying length with one or several input Tensors IN THE ORDER given in the docstring:\n          :obj:`model([input_ids, attention_mask])` or :obj:`model([input_ids, attention_mask, token_type_ids])`\n        - a dictionary with one or several input Tensors associated to the input names given in the docstring:\n          :obj:`model({'input_ids': input_ids, 'token_type_ids': token_type_ids})`\n\n\n    Parameters:\n        config (:class:`~transformers1.OpenAIGPTConfig`): Model configuration class with all the parameters of the model.\n            Initializing with a config file does not load the weights associated with the model, only the configuration.\n            Check out the :meth:`~transformers1.PreTrainedModel.from_pretrained` method to load the model weights.\n\"\"\"\n\nOPENAI_GPT_INPUTS_DOCSTRING = r\"\"\"\n    Args:\n        input_ids (:obj:`Numpy array` or :obj:`tf.Tensor` of shape :obj:`(batch_size, sequence_length)`):\n            Indices of input sequence tokens in the vocabulary.\n\n            Indices can be obtained using :class:`transformers1.GPT2Tokenizer`.\n            See :func:`transformers1.PreTrainedTokenizer.encode` and\n            :func:`transformers1.PreTrainedTokenizer.encode_plus` for details.\n\n            `What are input IDs? <../glossary.html#input-ids>`__\n        attention_mask (:obj:`tf.Tensor` or :obj:`Numpy array` of shape :obj:`(batch_size, sequence_length)`, `optional`, defaults to :obj:`None`):\n            Mask to avoid performing attention on padding token indices.\n            Mask values selected in ``[0, 1]``:\n            ``1`` for tokens that are NOT MASKED, ``0`` for MASKED tokens.\n\n            `What are attention masks? <../glossary.html#attention-mask>`__\n        token_type_ids (:obj:`tf.Tensor` or :obj:`Numpy array` of shape :obj:`(batch_size, sequence_length)`, `optional`, defaults to :obj:`None`):\n            Segment token indices to indicate first and second portions of the inputs.\n            Indices are selected in ``[0, 1]``: ``0`` corresponds to a `sentence A` token, ``1``\n            corresponds to a `sentence B` token\n\n            `What are token type IDs? <../glossary.html#token-type-ids>`_\n        position_ids (:obj:`tf.Tensor` or :obj:`Numpy array` of shape :obj:`(batch_size, sequence_length)`, `optional`, defaults to :obj:`None`):\n            Indices of positions of each input sequence tokens in the position embeddings.\n            Selected in the range ``[0, config.max_position_embeddings - 1]``.\n\n            `What are position IDs? <../glossary.html#position-ids>`_\n        head_mask (:obj:`tf.Tensor` or :obj:`Numpy array` of shape :obj:`(num_heads,)` or :obj:`(num_layers, num_heads)`, `optional`, defaults to :obj:`None`):\n            Mask to nullify selected heads of the self-attention modules.\n            Mask values selected in ``[0, 1]``:\n            :obj:`1` indicates the head is **not masked**, :obj:`0` indicates the head is **masked**.\n        inputs_embeds (:obj:`tf.Tensor` or :obj:`Numpy array` of shape :obj:`(batch_size, sequence_length, hidden_size)`, `optional`, defaults to :obj:`None`):\n            Optionally, instead of passing :obj:`input_ids` you can choose to directly pass an embedded representation.\n            This is useful if you want more control over how to convert `input_ids` indices into associated vectors\n            than the model's internal embedding lookup matrix.\n        training (:obj:`boolean`, `optional`, defaults to :obj:`False`):\n            Whether to activate dropout modules (if set to :obj:`True`) during training or to de-activate them\n            (if set to :obj:`False`) for evaluation.\n\"\"\"\n\n\n@add_start_docstrings(\n    \"The bare OpenAI GPT transformer model outputing raw hidden-states without any specific head on top.\",\n    OPENAI_GPT_START_DOCSTRING,\n)\nclass TFOpenAIGPTModel(TFOpenAIGPTPreTrainedModel):\n    def __init__(self, config, *inputs, **kwargs):\n        super().__init__(config, *inputs, **kwargs)\n        self.transformer = TFOpenAIGPTMainLayer(config, name=\"transformer\")\n\n    @add_start_docstrings_to_callable(OPENAI_GPT_INPUTS_DOCSTRING)\n    def call(self, inputs, **kwargs):\n        r\"\"\"\n    Return:\n        :obj:`tuple(tf.Tensor)` comprising various elements depending on the configuration (:class:`~transformers1.OpenAIGPTConfig`) and inputs:\n        last_hidden_state (:obj:`tf.Tensor` of shape :obj:`(batch_size, sequence_length, hidden_size)`):\n            Sequence of hidden-states at the last layer of the model.\n        hidden_states (:obj:`tuple(tf.Tensor)` `optional`, returned when ``config.output_hidden_states=True``):\n            Tuple of :obj:`tf.Tensor` (one for the output of the embeddings + one for the output of each layer)\n            of shape :obj:`(batch_size, sequence_length, hidden_size)`.\n\n            Hidden-states of the model at the output of each layer plus the initial embedding outputs.\n        attentions (:obj:`tuple(tf.Tensor)`, `optional`, returned when ``config.output_attentions=True``):\n            Tuple of :obj:`tf.Tensor` (one for each layer) of shape\n            :obj:`(batch_size, num_heads, sequence_length, sequence_length)`.\n\n            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention\n            heads.\n\n    Examples::\n\n        import tensorflow as tf\n        from transformers1 import OpenAIGPTTokenizer, TFOpenAIGPTModel\n\n        tokenizer = OpenAIGPTTokenizer.from_pretrained('openai-gpt')\n        model = TFOpenAIGPTModel.from_pretrained('openai-gpt')\n        input_ids = tf.constant(tokenizer.encode(\"Hello, my dog is cute\", add_special_tokens=True))[None, :]  # Batch size 1\n        outputs = model(input_ids)\n        last_hidden_states = outputs[0]  # The last hidden-state is the first element of the output tuple\n\n        \"\"\"\n        outputs = self.transformer(inputs, **kwargs)\n        return outputs\n\n\n@add_start_docstrings(\n    \"\"\"OpenAI GPT Model transformer with a language modeling head on top\n    (linear layer with weights tied to the input embeddings). \"\"\",\n    OPENAI_GPT_START_DOCSTRING,\n)\nclass TFOpenAIGPTLMHeadModel(TFOpenAIGPTPreTrainedModel):\n    def __init__(self, config, *inputs, **kwargs):\n        super().__init__(config, *inputs, **kwargs)\n        self.transformer = TFOpenAIGPTMainLayer(config, name=\"transformer\")\n\n    def get_output_embeddings(self):\n        return self.transformer.tokens_embed\n\n    @add_start_docstrings_to_callable(OPENAI_GPT_INPUTS_DOCSTRING)\n    def call(self, inputs, **kwargs):\n        r\"\"\"\n    Return:\n        :obj:`tuple(tf.Tensor)` comprising various elements depending on the configuration (:class:`~transformers1.OpenAIGPTConfig`) and inputs:\n        prediction_scores (:obj:`tf.Tensor` of shape :obj:`(batch_size, sequence_length, config.vocab_size)`):\n            Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax).\n        hidden_states (:obj:`tuple(tf.Tensor)`, `optional`, returned when ``config.output_hidden_states=True``):\n            Tuple of :obj:`tf.Tensor` (one for the output of the embeddings + one for the output of each layer)\n            of shape :obj:`(batch_size, sequence_length, hidden_size)`.\n\n            Hidden-states of the model at the output of each layer plus the initial embedding outputs.\n        attentions (:obj:`tuple(tf.Tensor)`, `optional`, returned when ``config.output_attentions=True``):\n            Tuple of :obj:`tf.Tensor` (one for each layer) of shape\n            :obj:`(batch_size, num_heads, sequence_length, sequence_length)`.\n\n            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention\n            heads.\n\n    Examples::\n\n        import tensorflow as tf\n        from transformers1 import OpenAIGPTTokenizer, TFOpenAIGPTLMHeadModel\n\n        tokenizer = OpenAIGPTTokenizer.from_pretrained('openai-gpt')\n        model = TFOpenAIGPTLMHeadModel.from_pretrained('openai-gpt')\n        input_ids = tf.constant(tokenizer.encode(\"Hello, my dog is cute\", add_special_tokens=True))[None, :]  # Batch size 1\n        outputs = model(input_ids)\n        logits = outputs[0]\n\n        \"\"\"\n        transformer_outputs = self.transformer(inputs, **kwargs)\n        hidden_states = transformer_outputs[0]\n\n        lm_logits = self.transformer.tokens_embed(hidden_states, mode=\"linear\")\n\n        outputs = (lm_logits,) + transformer_outputs[1:]\n\n        return outputs  # lm_logits, (all hidden_states), (attentions)\n\n\n@add_start_docstrings(\n    \"\"\"OpenAI GPT Model transformer with a language modeling and a multiple-choice classification\n    head on top e.g. for RocStories/SWAG tasks. The two heads are two linear layers.\n    The language modeling head has its weights tied to the input embeddings,\n    the classification head takes as input the input of a specified classification token index in the input sequence).\n\"\"\",\n    OPENAI_GPT_START_DOCSTRING,\n)\nclass TFOpenAIGPTDoubleHeadsModel(TFOpenAIGPTPreTrainedModel):\n    def __init__(self, config, *inputs, **kwargs):\n        super().__init__(config, *inputs, **kwargs)\n        config.num_labels = 1\n        self.transformer = TFOpenAIGPTMainLayer(config, name=\"transformer\")\n        self.multiple_choice_head = TFSequenceSummary(\n            config, initializer_range=config.initializer_range, name=\"multiple_choice_head\"\n        )\n\n    def get_output_embeddings(self):\n        return self.transformer.tokens_embed\n\n    @add_start_docstrings_to_callable(OPENAI_GPT_INPUTS_DOCSTRING)\n    def call(\n        self,\n        inputs,\n        attention_mask=None,\n        token_type_ids=None,\n        position_ids=None,\n        head_mask=None,\n        inputs_embeds=None,\n        mc_token_ids=None,\n        training=False,\n    ):\n        r\"\"\"\n        mc_token_ids (:obj:`tf.Tensor` or :obj:`Numpy array` of shape :obj:`(batch_size, num_choices)`, `optional`, default to index of the last token of the input)\n            Index of the classification token in each input sequence.\n            Selected in the range ``[0, input_ids.size(-1) - 1[``.\n\n    Return:\n        :obj:`tuple(tf.Tensor)` comprising various elements depending on the configuration (:class:`~transformers1.OpenAIGPTConfig`) and inputs:\n        lm_prediction_scores (:obj:`tf.Tensor` of shape :obj:`(batch_size, num_choices, sequence_length, config.vocab_size)`):\n            Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax).\n        mc_prediction_scores (:obj:`tf.Tensor` of shape :obj:`(batch_size, num_choices)`):\n            Prediction scores of the multiple choice classification head (scores for each choice before SoftMax).\n        past (:obj:`List[tf.Tensor]` of length :obj:`config.n_layers` with each tensor of shape :obj:`(2, batch_size, num_heads, sequence_length, embed_size_per_head)`):\n            Contains pre-computed hidden-states (key and values in the attention blocks).\n            Can be used (see `past` input) to speed up sequential decoding. The token ids which have their past given to this model\n            should not be passed as input ids as they have already been computed.\n        hidden_states (:obj:`tuple(tf.Tensor)`, `optional`, returned when ``config.output_hidden_states=True``):\n            Tuple of :obj:`tf.Tensor` (one for the output of the embeddings + one for the output of each layer)\n            of shape :obj:`(batch_size, sequence_length, hidden_size)`.\n\n            Hidden-states of the model at the output of each layer plus the initial embedding outputs.\n        attentions (:obj:`tuple(tf.Tensor)`, `optional`, returned when ``config.output_attentions=True``):\n            Tuple of :obj:`tf.Tensor` (one for each layer) of shape\n            :obj:`(batch_size, num_heads, sequence_length, sequence_length)`.\n\n            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention\n            heads.\n\n\n    Examples::\n\n        # For example purposes. Not runnable.\n        import tensorflow as tf\n        from transformers1 import OpenAIGPTTokenizer, TFOpenAIGPTDoubleHeadsModel\n\n        tokenizer = OpenAIGPTTokenizer.from_pretrained('openai-gpt')\n        model = TFOpenAIGPTDoubleHeadsModel.from_pretrained('openai-gpt')\n\n        # Add a [CLS] to the vocabulary (we should train it also!)\n        # This option is currently not implemented in TF 2.0\n        raise NotImplementedError\n        tokenizer.add_special_tokens({'cls_token': '[CLS]'})\n        model.resize_token_embeddings(len(tokenizer))  # Update the model embeddings with the new vocabulary size\n        print(tokenizer.cls_token_id, len(tokenizer))  # The newly token the last token of the vocabulary\n\n        choices = [\"Hello, my dog is cute [CLS]\", \"Hello, my cat is cute [CLS]\"]\n        input_ids = tf.constant([tokenizer.encode(s) for s in choices])[None, :]  # Batch size 1, 2 choices\n        mc_token_ids = tf.constant([input_ids.size(-1), input_ids.size(-1)])[None, :]  # Batch size 1\n        outputs = model(input_ids, mc_token_ids=mc_token_ids)\n        lm_prediction_scores, mc_prediction_scores = outputs[:2]\n\n        \"\"\"\n\n        if isinstance(inputs, (tuple, list)):\n            input_ids = inputs[0]\n            attention_mask = inputs[1] if len(inputs) > 1 else attention_mask\n            token_type_ids = inputs[2] if len(inputs) > 2 else token_type_ids\n            position_ids = inputs[3] if len(inputs) > 3 else position_ids\n            head_mask = inputs[4] if len(inputs) > 4 else head_mask\n            inputs_embeds = inputs[5] if len(inputs) > 5 else inputs_embeds\n            mc_token_ids = inputs[6] if len(inputs) > 6 else mc_token_ids\n            assert len(inputs) <= 7, \"Too many inputs.\"\n        elif isinstance(inputs, dict):\n            input_ids = inputs.get(\"input_ids\")\n            attention_mask = inputs.get(\"attention_mask\", attention_mask)\n            token_type_ids = inputs.get(\"token_type_ids\", token_type_ids)\n            position_ids = inputs.get(\"position_ids\", position_ids)\n            head_mask = inputs.get(\"head_mask\", head_mask)\n            inputs_embeds = inputs.get(\"inputs_embeds\", inputs_embeds)\n            mc_token_ids = inputs.get(\"mc_token_ids\", mc_token_ids)\n            assert len(inputs) <= 7, \"Too many inputs.\"\n        else:\n            input_ids = inputs\n\n        if input_ids is not None:\n            input_shapes = shape_list(input_ids)\n        else:\n            input_shapes = shape_list(inputs_embeds)[:-1]\n\n        seq_length = input_shapes[-1]\n\n        flat_input_ids = tf.reshape(input_ids, (-1, seq_length)) if input_ids is not None else None\n        flat_attention_mask = tf.reshape(attention_mask, (-1, seq_length)) if attention_mask is not None else None\n        flat_token_type_ids = tf.reshape(token_type_ids, (-1, seq_length)) if token_type_ids is not None else None\n        flat_position_ids = tf.reshape(position_ids, (-1, seq_length)) if position_ids is not None else None\n\n        flat_inputs = [\n            flat_input_ids,\n            flat_attention_mask,\n            flat_token_type_ids,\n            flat_position_ids,\n            head_mask,\n            inputs_embeds,\n        ]\n\n        transformer_outputs = self.transformer(flat_inputs, training=training)\n        hidden_states = transformer_outputs[0]\n\n        hidden_states = tf.reshape(hidden_states, input_shapes + shape_list(hidden_states)[-1:])\n\n        lm_logits = self.transformer.tokens_embed(hidden_states, mode=\"linear\")\n        mc_logits = self.multiple_choice_head([hidden_states, mc_token_ids], training=training)\n\n        mc_logits = tf.squeeze(mc_logits, axis=-1)\n\n        outputs = (lm_logits, mc_logits) + transformer_outputs[1:]\n\n        return outputs  # lm logits, mc logits, (all hidden_states), (attentions)\n"
  },
  {
    "path": "code/bert-base-count3/pretrain/transformers1/modeling_tf_pytorch_utils.py",
    "content": "# coding=utf-8\n# Copyright 2018 The Google AI Language Team Authors and The HuggingFace Inc. team.\n# Copyright (c) 2018, NVIDIA CORPORATION.  All rights reserved.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n\"\"\" PyTorch - TF 2.0 general utilities.\"\"\"\n\n\nimport logging\nimport os\nimport re\n\nimport numpy\n\n\nlogger = logging.getLogger(__name__)\n\n\ndef convert_tf_weight_name_to_pt_weight_name(tf_name, start_prefix_to_remove=\"\"):\n    \"\"\" Convert a TF 2.0 model variable name in a pytorch model weight name.\n\n        Conventions for TF2.0 scopes -> PyTorch attribute names conversions:\n            - '$1___$2' is replaced by $2 (can be used to duplicate or remove layers in TF2.0 vs PyTorch)\n            - '_._' is replaced by a new level separation (can be used to convert TF2.0 lists in PyTorch nn.ModulesList)\n\n        return tuple with:\n            - pytorch model weight name\n            - transpose: boolean indicating weither TF2.0 and PyTorch weights matrices are transposed with regards to each other\n    \"\"\"\n    tf_name = tf_name.replace(\":0\", \"\")  # device ids\n    tf_name = re.sub(\n        r\"/[^/]*___([^/]*)/\", r\"/\\1/\", tf_name\n    )  # '$1___$2' is replaced by $2 (can be used to duplicate or remove layers in TF2.0 vs PyTorch)\n    tf_name = tf_name.replace(\n        \"_._\", \"/\"\n    )  # '_._' is replaced by a level separation (can be used to convert TF2.0 lists in PyTorch nn.ModulesList)\n    tf_name = re.sub(r\"//+\", \"/\", tf_name)  # Remove empty levels at the end\n    tf_name = tf_name.split(\"/\")  # Convert from TF2.0 '/' separators to PyTorch '.' separators\n    tf_name = tf_name[1:]  # Remove level zero\n\n    # When should we transpose the weights\n    transpose = bool(tf_name[-1] == \"kernel\" or \"emb_projs\" in tf_name or \"out_projs\" in tf_name)\n\n    # Convert standard TF2.0 names in PyTorch names\n    if tf_name[-1] == \"kernel\" or tf_name[-1] == \"embeddings\" or tf_name[-1] == \"gamma\":\n        tf_name[-1] = \"weight\"\n    if tf_name[-1] == \"beta\":\n        tf_name[-1] = \"bias\"\n\n    # Remove prefix if needed\n    tf_name = \".\".join(tf_name)\n    if start_prefix_to_remove:\n        tf_name = tf_name.replace(start_prefix_to_remove, \"\", 1)\n\n    return tf_name, transpose\n\n\n#####################\n# PyTorch => TF 2.0 #\n#####################\n\n\ndef load_pytorch_checkpoint_in_tf2_model(tf_model, pytorch_checkpoint_path, tf_inputs=None, allow_missing_keys=False):\n    \"\"\" Load pytorch checkpoints in a TF 2.0 model\n    \"\"\"\n    try:\n        import tensorflow as tf  # noqa: F401\n        import torch  # noqa: F401\n    except ImportError:\n        logger.error(\n            \"Loading a PyTorch model in TensorFlow, requires both PyTorch and TensorFlow to be installed. Please see \"\n            \"https://pytorch.org/ and https://www.tensorflow.org/install/ for installation instructions.\"\n        )\n        raise\n\n    pt_path = os.path.abspath(pytorch_checkpoint_path)\n    logger.info(\"Loading PyTorch weights from {}\".format(pt_path))\n\n    pt_state_dict = torch.load(pt_path, map_location=\"cpu\")\n    logger.info(\"PyTorch checkpoint contains {:,} parameters\".format(sum(t.numel() for t in pt_state_dict.values())))\n\n    return load_pytorch_weights_in_tf2_model(\n        tf_model, pt_state_dict, tf_inputs=tf_inputs, allow_missing_keys=allow_missing_keys\n    )\n\n\ndef load_pytorch_model_in_tf2_model(tf_model, pt_model, tf_inputs=None, allow_missing_keys=False):\n    \"\"\" Load pytorch checkpoints in a TF 2.0 model\n    \"\"\"\n    pt_state_dict = pt_model.state_dict()\n\n    return load_pytorch_weights_in_tf2_model(\n        tf_model, pt_state_dict, tf_inputs=tf_inputs, allow_missing_keys=allow_missing_keys\n    )\n\n\ndef load_pytorch_weights_in_tf2_model(tf_model, pt_state_dict, tf_inputs=None, allow_missing_keys=False):\n    \"\"\" Load pytorch state_dict in a TF 2.0 model.\n    \"\"\"\n    try:\n        import torch  # noqa: F401\n        import tensorflow as tf  # noqa: F401\n        from tensorflow.python.keras import backend as K\n    except ImportError:\n        logger.error(\n            \"Loading a PyTorch model in TensorFlow, requires both PyTorch and TensorFlow to be installed. Please see \"\n            \"https://pytorch.org/ and https://www.tensorflow.org/install/ for installation instructions.\"\n        )\n        raise\n\n    if tf_inputs is None:\n        tf_inputs = tf_model.dummy_inputs\n\n    if tf_inputs is not None:\n        tf_model(tf_inputs, training=False)  # Make sure model is built\n\n    # Adapt state dict - TODO remove this and update the AWS weights files instead\n    # Convert old format to new format if needed from a PyTorch state_dict\n    old_keys = []\n    new_keys = []\n    for key in pt_state_dict.keys():\n        new_key = None\n        if \"gamma\" in key:\n            new_key = key.replace(\"gamma\", \"weight\")\n        if \"beta\" in key:\n            new_key = key.replace(\"beta\", \"bias\")\n        if new_key:\n            old_keys.append(key)\n            new_keys.append(new_key)\n    for old_key, new_key in zip(old_keys, new_keys):\n        pt_state_dict[new_key] = pt_state_dict.pop(old_key)\n\n    # Make sure we are able to load PyTorch base models as well as derived models (with heads)\n    # TF models always have a prefix, some of PyTorch models (base ones) don't\n    start_prefix_to_remove = \"\"\n    if not any(s.startswith(tf_model.base_model_prefix) for s in pt_state_dict.keys()):\n        start_prefix_to_remove = tf_model.base_model_prefix + \".\"\n\n    symbolic_weights = tf_model.trainable_weights + tf_model.non_trainable_weights\n    tf_loaded_numel = 0\n    weight_value_tuples = []\n    all_pytorch_weights = set(list(pt_state_dict.keys()))\n    for symbolic_weight in symbolic_weights:\n        sw_name = symbolic_weight.name\n        name, transpose = convert_tf_weight_name_to_pt_weight_name(\n            sw_name, start_prefix_to_remove=start_prefix_to_remove\n        )\n\n        # Find associated numpy array in pytorch model state dict\n        if name not in pt_state_dict:\n            if allow_missing_keys:\n                continue\n\n            raise AttributeError(\"{} not found in PyTorch model\".format(name))\n\n        array = pt_state_dict[name].numpy()\n\n        if transpose:\n            array = numpy.transpose(array)\n\n        if len(symbolic_weight.shape) < len(array.shape):\n            array = numpy.squeeze(array)\n        elif len(symbolic_weight.shape) > len(array.shape):\n            array = numpy.expand_dims(array, axis=0)\n\n        try:\n            assert list(symbolic_weight.shape) == list(array.shape)\n        except AssertionError as e:\n            e.args += (symbolic_weight.shape, array.shape)\n            raise e\n\n        tf_loaded_numel += array.size\n        # logger.warning(\"Initialize TF weight {}\".format(symbolic_weight.name))\n\n        weight_value_tuples.append((symbolic_weight, array))\n        all_pytorch_weights.discard(name)\n\n    K.batch_set_value(weight_value_tuples)\n\n    if tf_inputs is not None:\n        tf_model(tf_inputs, training=False)  # Make sure restore ops are run\n\n    logger.info(\"Loaded {:,} parameters in the TF 2.0 model.\".format(tf_loaded_numel))\n\n    logger.info(\"Weights or buffers not loaded from PyTorch model: {}\".format(all_pytorch_weights))\n\n    return tf_model\n\n\n#####################\n# TF 2.0 => PyTorch #\n#####################\n\n\ndef load_tf2_checkpoint_in_pytorch_model(pt_model, tf_checkpoint_path, tf_inputs=None, allow_missing_keys=False):\n    \"\"\" Load TF 2.0 HDF5 checkpoint in a PyTorch model\n        We use HDF5 to easily do transfer learning\n        (see https://github.com/tensorflow/tensorflow/blob/ee16fcac960ae660e0e4496658a366e2f745e1f0/tensorflow/python/keras/engine/network.py#L1352-L1357).\n    \"\"\"\n    try:\n        import tensorflow as tf  # noqa: F401\n        import torch  # noqa: F401\n    except ImportError:\n        logger.error(\n            \"Loading a TensorFlow model in PyTorch, requires both PyTorch and TensorFlow to be installed. Please see \"\n            \"https://pytorch.org/ and https://www.tensorflow.org/install/ for installation instructions.\"\n        )\n        raise\n\n    import transformers\n\n    logger.info(\"Loading TensorFlow weights from {}\".format(tf_checkpoint_path))\n\n    # Instantiate and load the associated TF 2.0 model\n    tf_model_class_name = \"TF\" + pt_model.__class__.__name__  # Add \"TF\" at the beggining\n    tf_model_class = getattr(transformers, tf_model_class_name)\n    tf_model = tf_model_class(pt_model.config)\n\n    if tf_inputs is None:\n        tf_inputs = tf_model.dummy_inputs\n\n    if tf_inputs is not None:\n        tf_model(tf_inputs, training=False)  # Make sure model is built\n\n    tf_model.load_weights(tf_checkpoint_path, by_name=True)\n\n    return load_tf2_model_in_pytorch_model(pt_model, tf_model, allow_missing_keys=allow_missing_keys)\n\n\ndef load_tf2_model_in_pytorch_model(pt_model, tf_model, allow_missing_keys=False):\n    \"\"\" Load TF 2.0 model in a pytorch model\n    \"\"\"\n    weights = tf_model.weights\n\n    return load_tf2_weights_in_pytorch_model(pt_model, weights, allow_missing_keys=allow_missing_keys)\n\n\ndef load_tf2_weights_in_pytorch_model(pt_model, tf_weights, allow_missing_keys=False):\n    \"\"\" Load TF2.0 symbolic weights in a PyTorch model\n    \"\"\"\n    try:\n        import tensorflow as tf  # noqa: F401\n        import torch  # noqa: F401\n    except ImportError:\n        logger.error(\n            \"Loading a TensorFlow model in PyTorch, requires both PyTorch and TensorFlow to be installed. Please see \"\n            \"https://pytorch.org/ and https://www.tensorflow.org/install/ for installation instructions.\"\n        )\n        raise\n\n    new_pt_params_dict = {}\n    current_pt_params_dict = dict(pt_model.named_parameters())\n\n    # Make sure we are able to load PyTorch base models as well as derived models (with heads)\n    # TF models always have a prefix, some of PyTorch models (base ones) don't\n    start_prefix_to_remove = \"\"\n    if not any(s.startswith(pt_model.base_model_prefix) for s in current_pt_params_dict.keys()):\n        start_prefix_to_remove = pt_model.base_model_prefix + \".\"\n\n    # Build a map from potential PyTorch weight names to TF 2.0 Variables\n    tf_weights_map = {}\n    for tf_weight in tf_weights:\n        pt_name, transpose = convert_tf_weight_name_to_pt_weight_name(\n            tf_weight.name, start_prefix_to_remove=start_prefix_to_remove\n        )\n        tf_weights_map[pt_name] = (tf_weight.numpy(), transpose)\n\n    all_tf_weights = set(list(tf_weights_map.keys()))\n    loaded_pt_weights_data_ptr = {}\n    missing_keys_pt = []\n    for pt_weight_name, pt_weight in current_pt_params_dict.items():\n        # Handle PyTorch shared weight ()not duplicated in TF 2.0\n        if pt_weight.data_ptr() in loaded_pt_weights_data_ptr:\n            new_pt_params_dict[pt_weight_name] = loaded_pt_weights_data_ptr[pt_weight.data_ptr()]\n            continue\n\n        # Find associated numpy array in pytorch model state dict\n        if pt_weight_name not in tf_weights_map:\n            if allow_missing_keys:\n                missing_keys_pt.append(pt_weight_name)\n                continue\n\n            raise AttributeError(\"{} not found in TF 2.0 model\".format(pt_weight_name))\n\n        array, transpose = tf_weights_map[pt_weight_name]\n\n        if transpose:\n            array = numpy.transpose(array)\n\n        if len(pt_weight.shape) < len(array.shape):\n            array = numpy.squeeze(array)\n        elif len(pt_weight.shape) > len(array.shape):\n            array = numpy.expand_dims(array, axis=0)\n\n        try:\n            assert list(pt_weight.shape) == list(array.shape)\n        except AssertionError as e:\n            e.args += (pt_weight.shape, array.shape)\n            raise e\n\n        # logger.warning(\"Initialize PyTorch weight {}\".format(pt_weight_name))\n\n        new_pt_params_dict[pt_weight_name] = torch.from_numpy(array)\n        loaded_pt_weights_data_ptr[pt_weight.data_ptr()] = torch.from_numpy(array)\n        all_tf_weights.discard(pt_weight_name)\n\n    missing_keys, unexpected_keys = pt_model.load_state_dict(new_pt_params_dict, strict=False)\n    missing_keys += missing_keys_pt\n\n    if len(missing_keys) > 0:\n        logger.info(\n            \"Weights of {} not initialized from TF 2.0 model: {}\".format(pt_model.__class__.__name__, missing_keys)\n        )\n    if len(unexpected_keys) > 0:\n        logger.info(\n            \"Weights from TF 2.0 model not used in {}: {}\".format(pt_model.__class__.__name__, unexpected_keys)\n        )\n\n    logger.info(\"Weights or buffers not loaded from TF 2.0 model: {}\".format(all_tf_weights))\n\n    return pt_model\n"
  },
  {
    "path": "code/bert-base-count3/pretrain/transformers1/modeling_tf_roberta.py",
    "content": "# coding=utf-8\n# Copyright 2018 The Google AI Language Team Authors and The HuggingFace Inc. team.\n# Copyright (c) 2018, NVIDIA CORPORATION.  All rights reserved.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n\"\"\" TF 2.0 RoBERTa model. \"\"\"\n\n\nimport logging\n\nimport tensorflow as tf\n\nfrom .configuration_roberta import RobertaConfig\nfrom .file_utils import add_start_docstrings, add_start_docstrings_to_callable\nfrom .modeling_tf_bert import TFBertEmbeddings, TFBertMainLayer, gelu\nfrom .modeling_tf_utils import TFPreTrainedModel, get_initializer, shape_list\n\n\nlogger = logging.getLogger(__name__)\n\nTF_ROBERTA_PRETRAINED_MODEL_ARCHIVE_LIST = [\n    \"roberta-base\",\n    \"roberta-large\",\n    \"roberta-large-mnli\",\n    \"distilroberta-base\",\n    # See all RoBERTa models at https://huggingface.co/models?filter=roberta\n]\n\n\nclass TFRobertaEmbeddings(TFBertEmbeddings):\n    \"\"\"\n    Same as BertEmbeddings with a tiny tweak for positional embeddings indexing.\n    \"\"\"\n\n    def __init__(self, config, **kwargs):\n        super().__init__(config, **kwargs)\n        self.padding_idx = 1\n\n    def create_position_ids_from_input_ids(self, x):\n        \"\"\" Replace non-padding symbols with their position numbers. Position numbers begin at\n        padding_idx+1. Padding symbols are ignored. This is modified from fairseq's\n        `utils.make_positions`.\n        :param tf.Tensor x:\n        :return tf.Tensor:\n        \"\"\"\n        mask = tf.cast(tf.math.not_equal(x, self.padding_idx), dtype=tf.int32)\n        incremental_indicies = tf.math.cumsum(mask, axis=1) * mask\n        return incremental_indicies + self.padding_idx\n\n    def create_position_ids_from_inputs_embeds(self, inputs_embeds):\n        \"\"\" We are provided embeddings directly. We cannot infer which are padded so just generate\n        sequential position ids.\n        :param tf.Tensor inputs_embeds:\n        :return tf.Tensor:\n        \"\"\"\n        seq_length = shape_list(inputs_embeds)[1]\n\n        position_ids = tf.range(self.padding_idx + 1, seq_length + self.padding_idx + 1, dtype=tf.int32)[tf.newaxis, :]\n        return position_ids\n\n    def _embedding(self, inputs, training=False):\n        \"\"\"Applies embedding based on inputs tensor.\"\"\"\n        input_ids, position_ids, token_type_ids, inputs_embeds = inputs\n\n        if position_ids is None:\n            if input_ids is not None:\n                # Create the position ids from the input token ids. Any padded tokens remain padded.\n                position_ids = self.create_position_ids_from_input_ids(input_ids)\n            else:\n                position_ids = self.create_position_ids_from_inputs_embeds(inputs_embeds)\n\n        return super()._embedding([input_ids, position_ids, token_type_ids, inputs_embeds], training=training)\n\n\nclass TFRobertaMainLayer(TFBertMainLayer):\n    \"\"\"\n    Same as TFBertMainLayer but uses TFRobertaEmbeddings.\n    \"\"\"\n\n    def __init__(self, config, **kwargs):\n        super().__init__(config, **kwargs)\n        self.embeddings = TFRobertaEmbeddings(config, name=\"embeddings\")\n\n    def get_input_embeddings(self):\n        return self.embeddings\n\n\nclass TFRobertaPreTrainedModel(TFPreTrainedModel):\n    \"\"\" An abstract class to handle weights initialization and\n        a simple interface for downloading and loading pretrained models.\n    \"\"\"\n\n    config_class = RobertaConfig\n    base_model_prefix = \"roberta\"\n\n\nROBERTA_START_DOCSTRING = r\"\"\"\n    This model is a `tf.keras.Model <https://www.tensorflow.org/api_docs/python/tf/keras/Model>`__ sub-class.\n    Use it as a regular TF 2.0 Keras Model and\n    refer to the TF 2.0 documentation for all matter related to general usage and behavior.\n\n    .. note::\n\n        TF 2.0 models accepts two formats as inputs:\n\n            - having all inputs as keyword arguments (like PyTorch models), or\n            - having all inputs as a list, tuple or dict in the first positional arguments.\n\n        This second option is useful when using :obj:`tf.keras.Model.fit()` method which currently requires having\n        all the tensors in the first argument of the model call function: :obj:`model(inputs)`.\n\n        If you choose this second option, there are three possibilities you can use to gather all the input Tensors\n        in the first positional argument :\n\n        - a single Tensor with input_ids only and nothing else: :obj:`model(inputs_ids)`\n        - a list of varying length with one or several input Tensors IN THE ORDER given in the docstring:\n          :obj:`model([input_ids, attention_mask])` or :obj:`model([input_ids, attention_mask, token_type_ids])`\n        - a dictionary with one or several input Tensors associated to the input names given in the docstring:\n          :obj:`model({'input_ids': input_ids, 'token_type_ids': token_type_ids})`\n\n    Parameters:\n        config (:class:`~transformers1.RobertaConfig`): Model configuration class with all the parameters of the\n            model. Initializing with a config file does not load the weights associated with the model, only the configuration.\n            Check out the :meth:`~transformers1.PreTrainedModel.from_pretrained` method to load the model weights.\n\"\"\"\n\nROBERTA_INPUTS_DOCSTRING = r\"\"\"\n    Args:\n        input_ids (:obj:`Numpy array` or :obj:`tf.Tensor` of shape :obj:`(batch_size, sequence_length)`):\n            Indices of input sequence tokens in the vocabulary.\n\n            Indices can be obtained using :class:`transformers1.RobertaTokenizer`.\n            See :func:`transformers1.PreTrainedTokenizer.encode` and\n            :func:`transformers1.PreTrainedTokenizer.encode_plus` for details.\n\n            `What are input IDs? <../glossary.html#input-ids>`__\n        attention_mask (:obj:`Numpy array` or :obj:`tf.Tensor` of shape :obj:`(batch_size, sequence_length)`, `optional`, defaults to :obj:`None`):\n            Mask to avoid performing attention on padding token indices.\n            Mask values selected in ``[0, 1]``:\n            ``1`` for tokens that are NOT MASKED, ``0`` for MASKED tokens.\n\n            `What are attention masks? <../glossary.html#attention-mask>`__\n        token_type_ids (:obj:`Numpy array` or :obj:`tf.Tensor` of shape :obj:`(batch_size, sequence_length)`, `optional`, defaults to :obj:`None`):\n            Segment token indices to indicate first and second portions of the inputs.\n            Indices are selected in ``[0, 1]``: ``0`` corresponds to a `sentence A` token, ``1``\n            corresponds to a `sentence B` token\n\n            `What are token type IDs? <../glossary.html#token-type-ids>`__\n        position_ids (:obj:`Numpy array` or :obj:`tf.Tensor` of shape :obj:`(batch_size, sequence_length)`, `optional`, defaults to :obj:`None`):\n            Indices of positions of each input sequence tokens in the position embeddings.\n            Selected in the range ``[0, config.max_position_embeddings - 1]``.\n\n            `What are position IDs? <../glossary.html#position-ids>`__\n        head_mask (:obj:`Numpy array` or :obj:`tf.Tensor` of shape :obj:`(num_heads,)` or :obj:`(num_layers, num_heads)`, `optional`, defaults to :obj:`None`):\n            Mask to nullify selected heads of the self-attention modules.\n            Mask values selected in ``[0, 1]``:\n            :obj:`1` indicates the head is **not masked**, :obj:`0` indicates the head is **masked**.\n        inputs_embeds (:obj:`Numpy array` or :obj:`tf.Tensor` of shape :obj:`(batch_size, sequence_length, embedding_dim)`, `optional`, defaults to :obj:`None`):\n            Optionally, instead of passing :obj:`input_ids` you can choose to directly pass an embedded representation.\n            This is useful if you want more control over how to convert `input_ids` indices into associated vectors\n            than the model's internal embedding lookup matrix.\n        training (:obj:`boolean`, `optional`, defaults to :obj:`False`):\n            Whether to activate dropout modules (if set to :obj:`True`) during training or to de-activate them\n            (if set to :obj:`False`) for evaluation.\n\"\"\"\n\n\n@add_start_docstrings(\n    \"The bare RoBERTa Model transformer outputing raw hidden-states without any specific head on top.\",\n    ROBERTA_START_DOCSTRING,\n)\nclass TFRobertaModel(TFRobertaPreTrainedModel):\n    def __init__(self, config, *inputs, **kwargs):\n        super().__init__(config, *inputs, **kwargs)\n        self.roberta = TFRobertaMainLayer(config, name=\"roberta\")\n\n    @add_start_docstrings_to_callable(ROBERTA_INPUTS_DOCSTRING)\n    def call(self, inputs, **kwargs):\n        r\"\"\"\n    Returns:\n        :obj:`tuple(tf.Tensor)` comprising various elements depending on the configuration (:class:`~transformers1.RobertaConfig`) and inputs:\n        last_hidden_state (:obj:`tf.Tensor` of shape :obj:`(batch_size, sequence_length, hidden_size)`):\n            Sequence of hidden-states at the output of the last layer of the model.\n        pooler_output (:obj:`tf.Tensor` of shape :obj:`(batch_size, hidden_size)`):\n            Last layer hidden-state of the first token of the sequence (classification token)\n            further processed by a Linear layer and a Tanh activation function. The Linear\n            layer weights are trained from the next sentence prediction (classification)\n            objective during Bert pretraining. This output is usually *not* a good summary\n            of the semantic content of the input, you're often better with averaging or pooling\n            the sequence of hidden-states for the whole input sequence.\n        hidden_states (:obj:`tuple(tf.Tensor)`, `optional`, returned when :obj:`config.output_hidden_states=True`):\n            tuple of :obj:`tf.Tensor` (one for the output of the embeddings + one for the output of each layer)\n            of shape :obj:`(batch_size, sequence_length, hidden_size)`.\n\n            Hidden-states of the model at the output of each layer plus the initial embedding outputs.\n        attentions (:obj:`tuple(tf.Tensor)`, `optional`, returned when ``config.output_attentions=True``):\n            tuple of :obj:`tf.Tensor` (one for each layer) of shape\n            :obj:`(batch_size, num_heads, sequence_length, sequence_length)`:\n\n            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads.\n\n    Examples::\n\n        import tensorflow as tf\n        from transformers1 import RobertaTokenizer, TFRobertaModel\n\n        tokenizer = RobertaTokenizer.from_pretrained('roberta-base')\n        model = TFRobertaModel.from_pretrained('roberta-base')\n        input_ids = tf.constant(tokenizer.encode(\"Hello, my dog is cute\", add_special_tokens=True))[None, :]  # Batch size 1\n        outputs = model(input_ids)\n        last_hidden_states = outputs[0]  # The last hidden-state is the first element of the output tuple\n\n        \"\"\"\n        outputs = self.roberta(inputs, **kwargs)\n        return outputs\n\n\nclass TFRobertaLMHead(tf.keras.layers.Layer):\n    \"\"\"Roberta Head for masked language modeling.\"\"\"\n\n    def __init__(self, config, input_embeddings, **kwargs):\n        super().__init__(**kwargs)\n        self.vocab_size = config.vocab_size\n        self.dense = tf.keras.layers.Dense(\n            config.hidden_size, kernel_initializer=get_initializer(config.initializer_range), name=\"dense\"\n        )\n        self.layer_norm = tf.keras.layers.LayerNormalization(epsilon=config.layer_norm_eps, name=\"layer_norm\")\n        self.act = tf.keras.layers.Activation(gelu)\n\n        # The output weights are the same as the input embeddings, but there is\n        # an output-only bias for each token.\n        self.decoder = input_embeddings\n\n    def build(self, input_shape):\n        self.bias = self.add_weight(shape=(self.vocab_size,), initializer=\"zeros\", trainable=True, name=\"bias\")\n        super().build(input_shape)\n\n    def call(self, features):\n        x = self.dense(features)\n        x = self.act(x)\n        x = self.layer_norm(x)\n\n        # project back to size of vocabulary with bias\n        x = self.decoder(x, mode=\"linear\") + self.bias\n\n        return x\n\n\n@add_start_docstrings(\"\"\"RoBERTa Model with a `language modeling` head on top. \"\"\", ROBERTA_START_DOCSTRING)\nclass TFRobertaForMaskedLM(TFRobertaPreTrainedModel):\n    def __init__(self, config, *inputs, **kwargs):\n        super().__init__(config, *inputs, **kwargs)\n\n        self.roberta = TFRobertaMainLayer(config, name=\"roberta\")\n        self.lm_head = TFRobertaLMHead(config, self.roberta.embeddings, name=\"lm_head\")\n\n    def get_output_embeddings(self):\n        return self.lm_head.decoder\n\n    @add_start_docstrings_to_callable(ROBERTA_INPUTS_DOCSTRING)\n    def call(self, inputs, **kwargs):\n        r\"\"\"\n    Return:\n        :obj:`tuple(tf.Tensor)` comprising various elements depending on the configuration (:class:`~transformers1.RobertaConfig`) and inputs:\n        prediction_scores (:obj:`Numpy array` or :obj:`tf.Tensor` of shape :obj:`(batch_size, sequence_length, config.vocab_size)`):\n            Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax).\n        hidden_states (:obj:`tuple(tf.Tensor)`, `optional`, returned when :obj:`config.output_hidden_states=True`):\n            tuple of :obj:`tf.Tensor` (one for the output of the embeddings + one for the output of each layer)\n            of shape :obj:`(batch_size, sequence_length, hidden_size)`.\n\n            Hidden-states of the model at the output of each layer plus the initial embedding outputs.\n        attentions (:obj:`tuple(tf.Tensor)`, `optional`, returned when ``config.output_attentions=True``):\n            tuple of :obj:`tf.Tensor` (one for each layer) of shape\n            :obj:`(batch_size, num_heads, sequence_length, sequence_length)`:\n\n            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads.\n\n    Examples::\n\n        import tensorflow as tf\n        from transformers1 import RobertaTokenizer, TFRobertaForMaskedLM\n\n        tokenizer = RobertaTokenizer.from_pretrained('roberta-base')\n        model = TFRobertaForMaskedLM.from_pretrained('roberta-base')\n        input_ids = tf.constant(tokenizer.encode(\"Hello, my dog is cute\", add_special_tokens=True))[None, :]  # Batch size 1\n        outputs = model(input_ids)\n        prediction_scores = outputs[0]\n\n        \"\"\"\n        outputs = self.roberta(inputs, **kwargs)\n\n        sequence_output = outputs[0]\n        prediction_scores = self.lm_head(sequence_output)\n\n        outputs = (prediction_scores,) + outputs[2:]  # Add hidden states and attention if they are here\n\n        return outputs  # prediction_scores, (hidden_states), (attentions)\n\n\nclass TFRobertaClassificationHead(tf.keras.layers.Layer):\n    \"\"\"Head for sentence-level classification tasks.\"\"\"\n\n    def __init__(self, config, **kwargs):\n        super().__init__(config, **kwargs)\n        self.dense = tf.keras.layers.Dense(\n            config.hidden_size,\n            kernel_initializer=get_initializer(config.initializer_range),\n            activation=\"tanh\",\n            name=\"dense\",\n        )\n        self.dropout = tf.keras.layers.Dropout(config.hidden_dropout_prob)\n        self.out_proj = tf.keras.layers.Dense(\n            config.num_labels, kernel_initializer=get_initializer(config.initializer_range), name=\"out_proj\"\n        )\n\n    def call(self, features, training=False):\n        x = features[:, 0, :]  # take <s> token (equiv. to [CLS])\n        x = self.dropout(x, training=training)\n        x = self.dense(x)\n        x = self.dropout(x, training=training)\n        x = self.out_proj(x)\n        return x\n\n\n@add_start_docstrings(\n    \"\"\"RoBERTa Model transformer with a sequence classification/regression head on top (a linear layer\n    on top of the pooled output) e.g. for GLUE tasks. \"\"\",\n    ROBERTA_START_DOCSTRING,\n)\nclass TFRobertaForSequenceClassification(TFRobertaPreTrainedModel):\n    def __init__(self, config, *inputs, **kwargs):\n        super().__init__(config, *inputs, **kwargs)\n        self.num_labels = config.num_labels\n\n        self.roberta = TFRobertaMainLayer(config, name=\"roberta\")\n        self.classifier = TFRobertaClassificationHead(config, name=\"classifier\")\n\n    @add_start_docstrings_to_callable(ROBERTA_INPUTS_DOCSTRING)\n    def call(self, inputs, **kwargs):\n        r\"\"\"\n    Return:\n        :obj:`tuple(tf.Tensor)` comprising various elements depending on the configuration (:class:`~transformers1.RobertaConfig`) and inputs:\n        logits (:obj:`Numpy array` or :obj:`tf.Tensor` of shape :obj:`(batch_size, config.num_labels)`):\n            Classification (or regression if config.num_labels==1) scores (before SoftMax).\n        hidden_states (:obj:`tuple(tf.Tensor)`, `optional`, returned when :obj:`config.output_hidden_states=True`):\n            tuple of :obj:`tf.Tensor` (one for the output of the embeddings + one for the output of each layer)\n            of shape :obj:`(batch_size, sequence_length, hidden_size)`.\n\n            Hidden-states of the model at the output of each layer plus the initial embedding outputs.\n        attentions (:obj:`tuple(tf.Tensor)`, `optional`, returned when ``config.output_attentions=True``):\n            tuple of :obj:`tf.Tensor` (one for each layer) of shape\n            :obj:`(batch_size, num_heads, sequence_length, sequence_length)`:\n\n            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads.\n\n    Examples::\n\n        import tensorflow as tf\n        from transformers1 import RobertaTokenizer, TFRobertaForSequenceClassification\n\n        tokenizer = RobertaTokenizer.from_pretrained('roberta-base')\n        model = TFRobertaForSequenceClassification.from_pretrained('roberta-base')\n        input_ids = tf.constant(tokenizer.encode(\"Hello, my dog is cute\", add_special_tokens=True))[None, :]  # Batch size 1\n        labels = tf.constant([1])[None, :]  # Batch size 1\n        outputs = model(input_ids)\n        logits = outputs[0]\n\n        \"\"\"\n        outputs = self.roberta(inputs, **kwargs)\n\n        sequence_output = outputs[0]\n        logits = self.classifier(sequence_output, training=kwargs.get(\"training\", False))\n\n        outputs = (logits,) + outputs[2:]\n\n        return outputs  # logits, (hidden_states), (attentions)\n\n\n@add_start_docstrings(\n    \"\"\"RoBERTa Model with a token classification head on top (a linear layer on top of\n    the hidden-states output) e.g. for Named-Entity-Recognition (NER) tasks. \"\"\",\n    ROBERTA_START_DOCSTRING,\n)\nclass TFRobertaForTokenClassification(TFRobertaPreTrainedModel):\n    def __init__(self, config, *inputs, **kwargs):\n        super().__init__(config, *inputs, **kwargs)\n        self.num_labels = config.num_labels\n\n        self.roberta = TFRobertaMainLayer(config, name=\"roberta\")\n        self.dropout = tf.keras.layers.Dropout(config.hidden_dropout_prob)\n        self.classifier = tf.keras.layers.Dense(\n            config.num_labels, kernel_initializer=get_initializer(config.initializer_range), name=\"classifier\"\n        )\n\n    @add_start_docstrings_to_callable(ROBERTA_INPUTS_DOCSTRING)\n    def call(self, inputs, **kwargs):\n        r\"\"\"\n    Return:\n        :obj:`tuple(tf.Tensor)` comprising various elements depending on the configuration (:class:`~transformers1.RobertaConfig`) and inputs:\n        scores (:obj:`Numpy array` or :obj:`tf.Tensor` of shape :obj:`(batch_size, sequence_length, config.num_labels)`):\n            Classification scores (before SoftMax).\n        hidden_states (:obj:`tuple(tf.Tensor)`, `optional`, returned when :obj:`config.output_hidden_states=True`):\n            tuple of :obj:`tf.Tensor` (one for the output of the embeddings + one for the output of each layer)\n            of shape :obj:`(batch_size, sequence_length, hidden_size)`.\n\n            Hidden-states of the model at the output of each layer plus the initial embedding outputs.\n        attentions (:obj:`tuple(tf.Tensor)`, `optional`, returned when ``config.output_attentions=True``):\n            tuple of :obj:`tf.Tensor` (one for each layer) of shape\n            :obj:`(batch_size, num_heads, sequence_length, sequence_length)`:\n\n            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads.\n\n    Examples::\n\n        import tensorflow as tf\n        from transformers1 import RobertaTokenizer, TFRobertaForTokenClassification\n\n        tokenizer = RobertaTokenizer.from_pretrained('roberta-base')\n        model = TFRobertaForTokenClassification.from_pretrained('roberta-base')\n        input_ids = tf.constant(tokenizer.encode(\"Hello, my dog is cute\", add_special_tokens=True))[None, :]  # Batch size 1\n        outputs = model(input_ids)\n        scores = outputs[0]\n\n        \"\"\"\n        outputs = self.roberta(inputs, **kwargs)\n\n        sequence_output = outputs[0]\n\n        sequence_output = self.dropout(sequence_output, training=kwargs.get(\"training\", False))\n        logits = self.classifier(sequence_output)\n\n        outputs = (logits,) + outputs[2:]  # add hidden states and attention if they are here\n\n        return outputs  # scores, (hidden_states), (attentions)\n\n\n@add_start_docstrings(\n    \"\"\"RoBERTa Model with a span classification head on top for extractive question-answering tasks like SQuAD (a linear layers on top of the hidden-states output to compute `span start logits` and `span end logits`). \"\"\",\n    ROBERTA_START_DOCSTRING,\n)\nclass TFRobertaForQuestionAnswering(TFRobertaPreTrainedModel):\n    def __init__(self, config, *inputs, **kwargs):\n        super().__init__(config, *inputs, **kwargs)\n        self.num_labels = config.num_labels\n\n        self.roberta = TFRobertaMainLayer(config, name=\"roberta\")\n        self.qa_outputs = tf.keras.layers.Dense(\n            config.num_labels, kernel_initializer=get_initializer(config.initializer_range), name=\"qa_outputs\"\n        )\n\n    @add_start_docstrings_to_callable(ROBERTA_INPUTS_DOCSTRING)\n    def call(self, inputs, **kwargs):\n        r\"\"\"\n    Return:\n        :obj:`tuple(tf.Tensor)` comprising various elements depending on the configuration (:class:`~transformers1.RobertaConfig`) and inputs:\n        start_scores (:obj:`Numpy array` or :obj:`tf.Tensor` of shape :obj:`(batch_size, sequence_length,)`):\n            Span-start scores (before SoftMax).\n        end_scores (:obj:`Numpy array` or :obj:`tf.Tensor` of shape :obj:`(batch_size, sequence_length,)`):\n            Span-end scores (before SoftMax).\n        hidden_states (:obj:`tuple(tf.Tensor)`, `optional`, returned when :obj:`config.output_hidden_states=True`):\n            tuple of :obj:`tf.Tensor` (one for the output of the embeddings + one for the output of each layer)\n            of shape :obj:`(batch_size, sequence_length, hidden_size)`.\n\n            Hidden-states of the model at the output of each layer plus the initial embedding outputs.\n        attentions (:obj:`tuple(tf.Tensor)`, `optional`, returned when ``config.output_attentions=True``):\n            tuple of :obj:`tf.Tensor` (one for each layer) of shape\n            :obj:`(batch_size, num_heads, sequence_length, sequence_length)`:\n\n            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads.\n\n    Examples::\n\n        # The checkpoint roberta-base is not fine-tuned for question answering. Please see the\n        # examples/question-answering/run_squad.py example to see how to fine-tune a model to a question answering task.\n\n        import tensorflow as tf\n        from transformers1 import RobertaTokenizer, TFRobertaForQuestionAnswering\n\n        tokenizer = RobertaTokenizer.from_pretrained('roberta-base')\n        model = TFRobertaForQuestionAnswering.from_pretrained('roberta-base')\n        input_ids = tokenizer.encode(\"Who was Jim Henson?\", \"Jim Henson was a nice puppet\")\n        start_scores, end_scores = model(tf.constant(input_ids)[None, :]) # Batch size 1\n\n        all_tokens = tokenizer.convert_ids_to_tokens(input_ids)\n        answer = ' '.join(all_tokens[tf.math.argmax(start_scores, 1)[0] : tf.math.argmax(end_scores, 1)[0]+1])\n\n        \"\"\"\n        outputs = self.roberta(inputs, **kwargs)\n\n        sequence_output = outputs[0]\n\n        logits = self.qa_outputs(sequence_output)\n        start_logits, end_logits = tf.split(logits, 2, axis=-1)\n        start_logits = tf.squeeze(start_logits, axis=-1)\n        end_logits = tf.squeeze(end_logits, axis=-1)\n\n        outputs = (start_logits, end_logits,) + outputs[2:]\n\n        return outputs  # start_logits, end_logits, (hidden_states), (attentions)\n"
  },
  {
    "path": "code/bert-base-count3/pretrain/transformers1/modeling_tf_t5.py",
    "content": "# coding=utf-8\n# Copyright 2018 T5 Authors and The HuggingFace Inc. team.\n# Copyright (c) 2018, NVIDIA CORPORATION.  All rights reserved.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n\"\"\" TF 2.0 T5 model. \"\"\"\n\n\nimport copy\nimport itertools\nimport logging\nimport math\n\nimport tensorflow as tf\n\nfrom .configuration_t5 import T5Config\nfrom .file_utils import DUMMY_INPUTS, DUMMY_MASK, add_start_docstrings, add_start_docstrings_to_callable\nfrom .modeling_tf_utils import TFPreTrainedModel, TFSharedEmbeddings, shape_list\n\n\nlogger = logging.getLogger(__name__)\n\nTF_T5_PRETRAINED_MODEL_ARCHIVE_LIST = [\n    \"t5-small\",\n    \"t5-base\",\n    \"t5-large\",\n    \"t5-3b\",\n    \"t5-11b\",\n    # See all T5 models at https://huggingface.co/models?filter=t5\n]\n\n####################################################\n# TF 2.0 Models are constructed using Keras imperative API by sub-classing\n# - tf.keras.layers.Layer for the layers and\n# - TFPreTrainedModel for the models (it-self a sub-class of tf.keras.Model)\n####################################################\n\n\nclass TFT5LayerNorm(tf.keras.layers.Layer):\n    def __init__(self, epsilon=1e-6, **kwargs):\n        \"\"\" Construct a layernorm module in the T5 style\n            No bias and no substraction of mean.\n        \"\"\"\n        super().__init__(**kwargs)\n        self.variance_epsilon = epsilon\n\n    def build(self, input_shape):\n        \"\"\"Build shared word embedding layer \"\"\"\n        self.weight = self.add_weight(\"weight\", shape=(input_shape[-1],), initializer=\"ones\")\n        super().build(input_shape)\n\n    def call(self, x):\n        variance = tf.math.reduce_mean(tf.math.square(x), axis=-1, keepdims=True)\n        x = x * tf.math.rsqrt(variance + self.variance_epsilon)\n        return self.weight * x\n\n\nclass TFT5DenseReluDense(tf.keras.layers.Layer):\n    def __init__(self, config, **kwargs):\n        super().__init__(**kwargs)\n        self.wi = tf.keras.layers.Dense(config.d_ff, use_bias=False, name=\"wi\")\n        self.wo = tf.keras.layers.Dense(config.d_model, use_bias=False, name=\"wo\")\n        self.dropout = tf.keras.layers.Dropout(config.dropout_rate)\n        self.act = tf.keras.activations.relu\n\n    def call(self, hidden_states, training=False):\n        h = self.wi(hidden_states)\n        h = self.act(h)\n        h = self.dropout(h, training=training)\n        h = self.wo(h)\n        return h\n\n\nclass TFT5LayerFF(tf.keras.layers.Layer):\n    def __init__(self, config, **kwargs):\n        super().__init__(**kwargs)\n        self.DenseReluDense = TFT5DenseReluDense(config, name=\"DenseReluDense\")\n        self.layer_norm = TFT5LayerNorm(epsilon=config.layer_norm_epsilon, name=\"layer_norm\")\n        self.dropout = tf.keras.layers.Dropout(config.dropout_rate)\n\n    def call(self, hidden_states, training=False):\n        norm_x = self.layer_norm(hidden_states)\n        y = self.DenseReluDense(norm_x, training=training)\n        layer_output = hidden_states + self.dropout(y, training=training)\n        return layer_output\n\n\nclass TFT5Attention(tf.keras.layers.Layer):\n    NEW_ID = itertools.count()\n\n    def __init__(self, config, has_relative_attention_bias=False, **kwargs):\n        super().__init__(**kwargs)\n        self.layer_id = next(TFT5Attention.NEW_ID)\n        self.is_decoder = config.is_decoder\n        self.has_relative_attention_bias = has_relative_attention_bias\n\n        self.output_attentions = config.output_attentions\n        self.relative_attention_num_buckets = config.relative_attention_num_buckets\n        self.d_model = config.d_model\n        self.d_kv = config.d_kv\n        self.n_heads = config.num_heads\n        self.inner_dim = self.n_heads * self.d_kv\n\n        # Mesh TensorFlow initialization to avoid scaling before softmax\n        self.q = tf.keras.layers.Dense(self.inner_dim, use_bias=False, name=\"q\")\n        self.k = tf.keras.layers.Dense(self.inner_dim, use_bias=False, name=\"k\")\n        self.v = tf.keras.layers.Dense(self.inner_dim, use_bias=False, name=\"v\")\n        self.o = tf.keras.layers.Dense(self.d_model, use_bias=False, name=\"o\")\n        self.dropout = tf.keras.layers.Dropout(config.dropout_rate)\n\n        if self.has_relative_attention_bias:\n            self.relative_attention_bias = tf.keras.layers.Embedding(\n                self.relative_attention_num_buckets, self.n_heads, name=\"relative_attention_bias\",\n            )\n        self.pruned_heads = set()\n\n    def prune_heads(self, heads):\n        raise NotImplementedError\n\n    @staticmethod\n    def _relative_position_bucket(relative_position, bidirectional=True, num_buckets=32, max_distance=128):\n        \"\"\"\n        Adapted from Mesh Tensorflow:\n        https://github.com/tensorflow/mesh/blob/0cb87fe07da627bf0b7e60475d59f95ed6b5be3d/mesh_tensorflow/transformer/transformer_layers.py#L593\n\n        Translate relative position to a bucket number for relative attention.\n        The relative position is defined as memory_position - query_position, i.e.\n        the distance in tokens from the attending position to the attended-to\n        position.  If bidirectional=False, then positive relative positions are\n        invalid.\n        We use smaller buckets for small absolute relative_position and larger buckets\n        for larger absolute relative_positions.  All relative positions >=max_distance\n        map to the same bucket.  All relative positions <=-max_distance map to the\n        same bucket.  This should allow for more graceful generalization to longer\n        sequences than the model has been trained on.\n        Args:\n            relative_position: an int32 Tensor\n            bidirectional: a boolean - whether the attention is bidirectional\n            num_buckets: an integer\n            max_distance: an integer\n        Returns:\n            a Tensor with the same shape as relative_position, containing int32\n            values in the range [0, num_buckets)\n        \"\"\"\n        ret = 0\n        n = -relative_position\n        if bidirectional:\n            num_buckets //= 2\n            ret += tf.dtypes.cast(tf.math.less(n, 0), tf.int32) * num_buckets\n            n = tf.math.abs(n)\n        else:\n            n = tf.math.maximum(n, 0)\n        # now n is in the range [0, inf)\n        max_exact = num_buckets // 2\n        is_small = tf.math.less(n, max_exact)\n        val_if_large = max_exact + tf.dtypes.cast(\n            tf.math.log(tf.dtypes.cast(n, tf.float32) / max_exact)\n            / math.log(max_distance / max_exact)\n            * (num_buckets - max_exact),\n            tf.int32,\n        )\n        val_if_large = tf.math.minimum(val_if_large, num_buckets - 1)\n        ret += tf.where(is_small, n, val_if_large)\n        return ret\n\n    def compute_bias(self, qlen, klen):\n        \"\"\" Compute binned relative position bias \"\"\"\n        context_position = tf.range(qlen)[:, None]\n        memory_position = tf.range(klen)[None, :]\n        relative_position = memory_position - context_position  # shape (qlen, klen)\n        rp_bucket = self._relative_position_bucket(\n            relative_position, bidirectional=not self.is_decoder, num_buckets=self.relative_attention_num_buckets,\n        )\n        values = self.relative_attention_bias(rp_bucket)  # shape (qlen, klen, num_heads)\n        values = tf.expand_dims(tf.transpose(values, [2, 0, 1]), axis=0)  # shape (1, num_heads, qlen, klen)\n        return values\n\n    def call(\n        self,\n        input,\n        mask=None,\n        kv=None,\n        position_bias=None,\n        cache=None,\n        past_key_value_state=None,\n        head_mask=None,\n        query_length=None,\n        use_cache=False,\n        training=False,\n    ):\n        \"\"\"\n        Self-attention (if kv is None) or attention over source sentence (provided by kv).\n        \"\"\"\n        # Input is (bs, qlen, dim)\n        # Mask is (bs, klen) (non-causal) or (bs, klen, klen)\n        # past_key_value_state[0] is (bs, n_heads, q_len - 1, dim_per_head)\n        bs, qlen, dim = shape_list(input)\n\n        if past_key_value_state is not None:\n            assert self.is_decoder is True, \"Encoder cannot cache past key value states\"\n            assert (\n                len(past_key_value_state) == 2\n            ), \"past_key_value_state should have 2 past states: keys and values. Got {} past states\".format(\n                len(past_key_value_state)\n            )\n            real_qlen = qlen + shape_list(past_key_value_state[0])[2] if query_length is None else query_length\n        else:\n            real_qlen = qlen\n\n        if kv is None:\n            klen = real_qlen\n        else:\n            klen = shape_list(kv)[1]\n\n        def shape(x):\n            \"\"\"  projection \"\"\"\n            return tf.transpose(tf.reshape(x, (bs, -1, self.n_heads, self.d_kv)), perm=(0, 2, 1, 3))\n\n        def unshape(x):\n            \"\"\"  compute context \"\"\"\n            return tf.reshape(tf.transpose(x, perm=(0, 2, 1, 3)), (bs, -1, self.inner_dim))\n\n        q = shape(self.q(input))  # (bs, n_heads, qlen, dim_per_head)\n\n        if kv is None:\n            k = shape(self.k(input))  # (bs, n_heads, qlen, dim_per_head)\n            v = shape(self.v(input))  # (bs, n_heads, qlen, dim_per_head)\n        elif past_key_value_state is None:\n            k = v = kv\n            k = shape(self.k(k))  # (bs, n_heads, qlen, dim_per_head)\n            v = shape(self.v(v))  # (bs, n_heads, qlen, dim_per_head)\n\n        if past_key_value_state is not None:\n            if kv is None:\n                k_, v_ = past_key_value_state\n                k = tf.concat([k_, k], axis=2)  # (bs, n_heads, klen, dim_per_head)\n                v = tf.concat([v_, v], axis=2)  # (bs, n_heads, klen, dim_per_head)\n            else:\n                k, v = past_key_value_state\n\n        # to cope with keras serialization\n        # we need to cast `use_cache` to correct bool\n        # if it is a tensor\n        if tf.is_tensor(use_cache):\n            if hasattr(use_cache, \"numpy\"):\n                use_cache = bool(use_cache.numpy())\n            else:\n                use_cache = True\n\n        if self.is_decoder and use_cache is True:\n            present_key_value_state = ((k, v),)\n        else:\n            present_key_value_state = (None,)\n\n        scores = tf.einsum(\"bnqd,bnkd->bnqk\", q, k)  # (bs, n_heads, qlen, klen)\n\n        if position_bias is None:\n            if not self.has_relative_attention_bias:\n                raise ValueError(\"No position_bias provided and no weights to compute position_bias\")\n            position_bias = self.compute_bias(real_qlen, klen)\n\n            # if key and values are already calculated\n            # we want only the last query position bias\n            if past_key_value_state is not None:\n                position_bias = position_bias[:, :, -1:, :]\n\n            if mask is not None:\n                position_bias = position_bias + mask  # (bs, n_heads, qlen, klen)\n\n        scores += position_bias\n        weights = tf.nn.softmax(scores, axis=-1)  # (bs, n_heads, qlen, klen)\n        weights = self.dropout(weights, training=training)  # (bs, n_heads, qlen, klen)\n\n        # Mask heads if we want to\n        if head_mask is not None:\n            weights = weights * head_mask\n\n        context = tf.matmul(weights, v)  # (bs, n_heads, qlen, dim_per_head)\n        context = unshape(context)  # (bs, qlen, dim)\n\n        context = self.o(context)\n\n        outputs = (context,) + present_key_value_state\n\n        if self.output_attentions:\n            outputs = outputs + (weights,)\n        if self.has_relative_attention_bias:\n            outputs = outputs + (position_bias,)\n        return outputs\n\n\nclass TFT5LayerSelfAttention(tf.keras.layers.Layer):\n    def __init__(self, config, has_relative_attention_bias=False, **kwargs):\n        super().__init__(**kwargs)\n        self.SelfAttention = TFT5Attention(\n            config, has_relative_attention_bias=has_relative_attention_bias, name=\"SelfAttention\",\n        )\n        self.layer_norm = TFT5LayerNorm(epsilon=config.layer_norm_epsilon, name=\"layer_norm\")\n        self.dropout = tf.keras.layers.Dropout(config.dropout_rate)\n\n    def call(\n        self,\n        hidden_states,\n        attention_mask=None,\n        position_bias=None,\n        head_mask=None,\n        past_key_value_state=None,\n        use_cache=False,\n        training=False,\n    ):\n        norm_x = self.layer_norm(hidden_states)\n        attention_output = self.SelfAttention(\n            norm_x,\n            mask=attention_mask,\n            position_bias=position_bias,\n            head_mask=head_mask,\n            past_key_value_state=past_key_value_state,\n            use_cache=use_cache,\n            training=training,\n        )\n        y = attention_output[0]\n        layer_output = hidden_states + self.dropout(y, training=training)\n        outputs = (layer_output,) + attention_output[1:]  # add attentions if we output them\n        return outputs\n\n\nclass TFT5LayerCrossAttention(tf.keras.layers.Layer):\n    def __init__(self, config, has_relative_attention_bias=False, **kwargs):\n        super().__init__(**kwargs)\n        self.EncDecAttention = TFT5Attention(\n            config, has_relative_attention_bias=has_relative_attention_bias, name=\"EncDecAttention\",\n        )\n        self.layer_norm = TFT5LayerNorm(epsilon=config.layer_norm_epsilon, name=\"layer_norm\")\n        self.dropout = tf.keras.layers.Dropout(config.dropout_rate)\n\n    def call(\n        self,\n        hidden_states,\n        kv,\n        attention_mask=None,\n        position_bias=None,\n        head_mask=None,\n        past_key_value_state=None,\n        query_length=None,\n        use_cache=False,\n        training=False,\n    ):\n        norm_x = self.layer_norm(hidden_states)\n        attention_output = self.EncDecAttention(\n            norm_x,\n            mask=attention_mask,\n            kv=kv,\n            position_bias=position_bias,\n            head_mask=head_mask,\n            past_key_value_state=past_key_value_state,\n            query_length=query_length,\n            use_cache=use_cache,\n            training=training,\n        )\n        y = attention_output[0]\n        layer_output = hidden_states + self.dropout(y, training=training)\n        outputs = (layer_output,) + attention_output[1:]  # add attentions if we output them\n        return outputs\n\n\nclass TFT5Block(tf.keras.layers.Layer):\n    def __init__(self, config, has_relative_attention_bias=False, **kwargs):\n        super().__init__(**kwargs)\n        self.is_decoder = config.is_decoder\n        self.layer = []\n        self.layer.append(\n            TFT5LayerSelfAttention(config, has_relative_attention_bias=has_relative_attention_bias, name=\"layer_._0\",)\n        )\n        if self.is_decoder:\n            self.layer.append(\n                TFT5LayerCrossAttention(\n                    config, has_relative_attention_bias=has_relative_attention_bias, name=\"layer_._1\",\n                )\n            )\n\n        self.layer.append(TFT5LayerFF(config, name=\"layer_._{}\".format(len(self.layer))))\n\n    def call(\n        self,\n        hidden_states,\n        attention_mask=None,\n        position_bias=None,\n        encoder_hidden_states=None,\n        encoder_attention_mask=None,\n        encoder_decoder_position_bias=None,\n        head_mask=None,\n        past_key_value_state=None,\n        use_cache=False,\n        training=False,\n    ):\n\n        if past_key_value_state is not None:\n            assert self.is_decoder, \"Only decoder can use `past_key_value_states`\"\n            expected_num_past_key_value_states = 2 if encoder_hidden_states is None else 4\n\n            error_message = \"There should be {} past states. 2 (past / key) for self attention.{} Got {} past key / value states\".format(\n                expected_num_past_key_value_states,\n                \"2 (past / key) for cross attention\" if expected_num_past_key_value_states == 4 else \"\",\n                len(past_key_value_state),\n            )\n            assert len(past_key_value_state) == expected_num_past_key_value_states, error_message\n\n            self_attn_past_key_value_state = past_key_value_state[:2]\n            cross_attn_past_key_value_state = past_key_value_state[2:]\n        else:\n            self_attn_past_key_value_state, cross_attn_past_key_value_state = None, None\n\n        self_attention_outputs = self.layer[0](\n            hidden_states,\n            attention_mask=attention_mask,\n            position_bias=position_bias,\n            head_mask=head_mask,\n            past_key_value_state=self_attn_past_key_value_state,\n            use_cache=use_cache,\n            training=training,\n        )\n        hidden_states, present_key_value_state = self_attention_outputs[:2]\n        attention_outputs = self_attention_outputs[2:]  # Keep self-attention outputs and relative position weights\n\n        if self.is_decoder and encoder_hidden_states is not None:\n            # the actual query length is unknown for cross attention\n            # if using past key value states. Need to inject it here\n            if present_key_value_state is not None:\n                query_length = shape_list(present_key_value_state[0])[2]\n            else:\n                query_length = None\n\n            cross_attention_outputs = self.layer[1](\n                hidden_states,\n                kv=encoder_hidden_states,\n                attention_mask=encoder_attention_mask,\n                position_bias=encoder_decoder_position_bias,\n                head_mask=head_mask,\n                past_key_value_state=cross_attn_past_key_value_state,\n                query_length=query_length,\n                use_cache=use_cache,\n                training=training,\n            )\n            hidden_states = cross_attention_outputs[0]\n            # Combine self attn and cross attn key value states\n            if present_key_value_state is not None:\n                present_key_value_state = present_key_value_state + cross_attention_outputs[1]\n\n            # Keep cross-attention outputs and relative position weights\n            attention_outputs = attention_outputs + cross_attention_outputs[2:]\n\n        # Apply Feed Forward layer\n        hidden_states = self.layer[-1](hidden_states, training=training)\n        outputs = (hidden_states,)\n\n        # Add attentions if we output them\n        outputs = outputs + (present_key_value_state,) + attention_outputs\n        return outputs  # hidden-states, present_key_value_states, (self-attention weights), (self-attention position bias), (cross-attention weights), (cross-attention position bias)\n\n\nclass _NoLayerEmbedTokens(object):\n    \"\"\"\n     this class wraps a the TFSharedEmbeddingTokens layer into a python 'no-keras-layer'\n     class to avoid problem with weight restoring. Also it makes sure that the layer is\n     called from the correct scope to avoid problem with saving/storing the correct weights\n    \"\"\"\n\n    def __init__(self, layer, abs_scope_name=None):\n        self._layer = layer\n        self._abs_scope_name = abs_scope_name\n\n    def call(self, inputs, mode=\"embedding\"):\n        if self._abs_scope_name is None:\n            return self._layer.call(inputs, mode)\n\n        # if an abs scope name is given to the embedding variable, call variable from absolute scope\n        with tf.compat.v1.variable_scope(self._abs_scope_name, auxiliary_name_scope=False) as abs_scope_name:\n            with tf.name_scope(abs_scope_name.original_name_scope):\n                return self._layer.call(inputs, mode)\n\n    def __call__(self, inputs, mode=\"embedding\"):\n        if self._abs_scope_name is None:\n            return self._layer(inputs, mode)\n\n        # if an abs scope name is given to the embedding variable, call variable from absolute scope\n        with tf.compat.v1.variable_scope(self._abs_scope_name, auxiliary_name_scope=False) as abs_scope_name:\n            with tf.name_scope(abs_scope_name.original_name_scope):\n                return self._layer(inputs, mode)\n\n\n####################################################\n# The full model without a specific pretrained or finetuning head is\n# provided as a tf.keras.layers.Layer usually called \"TFT5MainLayer\"\n####################################################\nclass TFT5MainLayer(tf.keras.layers.Layer):\n    def __init__(self, config, embed_tokens=None, **kwargs):\n        super().__init__(**kwargs)\n        self.output_attentions = config.output_attentions\n        self.output_hidden_states = config.output_hidden_states\n\n        self.embed_tokens = embed_tokens\n        self.is_decoder = config.is_decoder\n\n        self.config = config\n        self.num_hidden_layers = config.num_layers\n\n        self.block = [\n            TFT5Block(config, has_relative_attention_bias=bool(i == 0), name=\"block_._{}\".format(i),)\n            for i in range(config.num_layers)\n        ]\n        self.final_layer_norm = TFT5LayerNorm(epsilon=config.layer_norm_epsilon, name=\"final_layer_norm\")\n        self.dropout = tf.keras.layers.Dropout(config.dropout_rate)\n\n    def get_input_embeddings(self):\n        return self.embed_tokens\n\n    def get_output_embeddings(self):\n        return self.embed_tokens\n\n    def set_embed_tokens(self, embed_tokens):\n        self.embed_tokens = embed_tokens\n\n    def _resize_token_embeddings(self, new_num_tokens):\n        raise NotImplementedError  # Not implemented yet in the library fr TF 2.0 models\n\n    def _prune_heads(self, heads_to_prune):\n        raise NotImplementedError  # Not implemented yet in the library fr TF 2.0 models\n\n    def call(\n        self,\n        inputs,\n        attention_mask=None,\n        encoder_hidden_states=None,\n        encoder_attention_mask=None,\n        inputs_embeds=None,\n        head_mask=None,\n        past_key_value_states=None,\n        use_cache=False,\n        training=False,\n    ):\n\n        if inputs is not None and inputs_embeds is not None:\n            raise ValueError(\"You cannot specify both inputs and inputs_embeds at the same time\")\n        elif inputs is not None:\n            input_shape = shape_list(inputs)\n            inputs = tf.reshape(inputs, (-1, input_shape[-1]))\n        elif inputs_embeds is not None:\n            input_shape = shape_list(inputs_embeds)[:-1]\n        else:\n            raise ValueError(\"You have to specify either inputs or inputs_embeds\")\n\n        if inputs_embeds is None:\n            assert self.embed_tokens is not None, \"You have to intialize the model with valid token embeddings\"\n            inputs_embeds = self.embed_tokens(inputs)\n\n        batch_size, seq_length = input_shape\n\n        if past_key_value_states is not None:\n            assert seq_length == 1, \"Input shape is {}, but should be {} when using past_key_value_sates\".format(\n                input_shape, (batch_size, 1)\n            )\n            # required mask seq length can be calculated via length of past\n            # key value states and seq_length = 1 for the last token\n            mask_seq_length = shape_list(past_key_value_states[0][0])[2] + seq_length\n        else:\n            mask_seq_length = seq_length\n\n        if attention_mask is None:\n            attention_mask = tf.fill((batch_size, mask_seq_length), 1)\n        if self.is_decoder and encoder_attention_mask is None and encoder_hidden_states is not None:\n            encoder_seq_length = shape_list(encoder_hidden_states)[1]\n            encoder_attention_mask = tf.fill((batch_size, encoder_seq_length), 1)\n\n        # initialize past_key_value_states with `None` if past does not exist\n        if past_key_value_states is None:\n            past_key_value_states = [None] * len(self.block)\n\n        # We can provide a self-attention mask of dimensions [batch_size, from_seq_length, to_seq_length]\n        # ourselves in which case we just need to make it broadcastable to all heads.\n        attention_mask = tf.cast(attention_mask, dtype=tf.float32)\n        num_dims_attention_mask = len(shape_list(attention_mask))\n        if num_dims_attention_mask == 3:\n            extended_attention_mask = attention_mask[:, None, :, :]\n        elif num_dims_attention_mask == 2:\n            # Provided a padding mask of dimensions [batch_size, mask_seq_length]\n            # - if the model is a decoder, apply a causal mask in addition to the padding mask\n            # - if the model is an encoder, make the mask broadcastable to [batch_size, num_heads, mask_seq_length, mask_seq_length]\n            if self.is_decoder:\n                seq_ids = tf.range(mask_seq_length)\n                causal_mask = tf.less_equal(\n                    tf.tile(seq_ids[None, None, :], (batch_size, mask_seq_length, 1)), seq_ids[None, :, None],\n                )\n                causal_mask = tf.cast(causal_mask, dtype=tf.float32)\n                extended_attention_mask = causal_mask[:, None, :, :] * attention_mask[:, None, None, :]\n                if past_key_value_states[0] is not None:\n                    extended_attention_mask = extended_attention_mask[:, :, -1:, :]\n            else:\n                extended_attention_mask = attention_mask[:, None, None, :]\n\n        # Since attention_mask is 1.0 for positions we want to attend and 0.0 for\n        # masked positions, this operation will create a tensor which is 0.0 for\n        # positions we want to attend and -10000.0 for masked positions.\n        # Since we are adding it to the raw scores before the softmax, this is\n        # effectively the same as removing these entirely.\n\n        # T5 has a mask that can compare sequence ids, we can simulate this here with this transposistion\n        # Cf. https://github.com/tensorflow/mesh/blob/8d2465e9bc93129b913b5ccc6a59aa97abd96ec6/mesh_tensorflow/transformer/transformer_layers.py#L270\n        # extended_attention_mask = tf.math.equal(extended_attention_mask,\n        #                                         tf.transpose(extended_attention_mask, perm=(-1, -2)))\n\n        extended_attention_mask = (1.0 - extended_attention_mask) * -1e9\n\n        if self.is_decoder and encoder_attention_mask is not None:\n            # If a 2D ou 3D attention mask is provided for the cross-attention\n            # we need to make broadcastabe to [batch_size, num_heads, mask_seq_length, mask_seq_length]\n            # we need to make broadcastabe to [batch_size, num_heads, seq_length, seq_length]\n            encoder_attention_mask = tf.cast(encoder_attention_mask, dtype=tf.float32)\n            num_dims_encoder_attention_mask = len(shape_list(encoder_attention_mask))\n            if num_dims_encoder_attention_mask == 3:\n                encoder_extended_attention_mask = encoder_attention_mask[:, None, :, :]\n            if num_dims_encoder_attention_mask == 2:\n                encoder_extended_attention_mask = encoder_attention_mask[:, None, None, :]\n\n            # T5 has a mask that can compare sequence ids, we can simulate this here with this transposistion\n            # Cf. https://github.com/tensorflow/mesh/blob/8d2465e9bc93129b913b5ccc6a59aa97abd96ec6/mesh_tensorflow/transformer/transformer_layers.py#L270\n            # encoder_extended_attention_mask = tf.math.equal(encoder_extended_attention_mask,\n            #                                         tf.transpose(encoder_extended_attention_mask, perm=(-1, -2)))\n\n            encoder_extended_attention_mask = (1.0 - encoder_extended_attention_mask) * -1e9\n        else:\n            encoder_extended_attention_mask = None\n\n        # Prepare head mask if needed\n        # 1.0 in head_mask indicate we keep the head\n        # attention_probs has shape bsz x n_heads x N x N\n        # input head_mask has shape [num_heads] or [num_hidden_layers x num_heads]\n        # and head_mask is converted to shape [num_hidden_layers x batch x num_heads x seq_length x seq_length]\n        if head_mask is not None:\n            raise NotImplementedError\n        else:\n            head_mask = [None] * self.num_hidden_layers\n            # head_mask = tf.constant([0] * self.num_hidden_layers)\n\n        present_key_value_states = ()\n        all_hidden_states = ()\n        all_attentions = ()\n        position_bias = None\n        encoder_decoder_position_bias = None\n\n        hidden_states = self.dropout(inputs_embeds, training=training)\n\n        for i, (layer_module, past_key_value_state) in enumerate(zip(self.block, past_key_value_states)):\n            if self.output_hidden_states:\n                all_hidden_states = all_hidden_states + (hidden_states,)\n\n            layer_outputs = layer_module(\n                hidden_states,\n                attention_mask=extended_attention_mask,\n                position_bias=position_bias,\n                encoder_hidden_states=encoder_hidden_states,\n                encoder_attention_mask=encoder_extended_attention_mask,\n                encoder_decoder_position_bias=encoder_decoder_position_bias,\n                head_mask=head_mask[i],\n                past_key_value_state=past_key_value_state,\n                use_cache=use_cache,\n                training=training,\n            )\n            # layer_outputs is a tuple with:\n            # hidden-states, key-value-states, (self-attention weights), (self-attention position bias), (cross-attention weights), (cross-attention position bias)\n            hidden_states, present_key_value_state = layer_outputs[:2]\n            if i == 0:\n                # We share the position biases between the layers - the first layer store them\n                # layer_outputs = hidden-states, (self-attention weights), (self-attention position bias), (cross-attention weights), (cross-attention position bias)\n                position_bias = layer_outputs[3 if self.output_attentions else 2]\n                if self.is_decoder and encoder_hidden_states is not None:\n                    encoder_decoder_position_bias = layer_outputs[5 if self.output_attentions else 3]\n            # append next layer key value states\n            present_key_value_states = present_key_value_states + (present_key_value_state,)\n\n            if self.output_attentions:\n                all_attentions = all_attentions + (layer_outputs[2],)\n\n        hidden_states = self.final_layer_norm(hidden_states)\n        hidden_states = self.dropout(hidden_states, training=training)\n\n        # Add last layer\n        if self.output_hidden_states:\n            all_hidden_states = all_hidden_states + (hidden_states,)\n\n        outputs = (hidden_states,)\n        if use_cache is True:\n            assert self.is_decoder, \"`use_cache` can only be set to `True` if {} is used as a decoder\".format(self)\n            outputs = outputs + (present_key_value_states,)\n        if self.output_hidden_states:\n            outputs = outputs + (all_hidden_states,)\n        if self.output_attentions:\n            outputs = outputs + (all_attentions,)\n        return outputs  # last-layer hidden state, (all hidden states), (all attentions)\n\n\n####################################################\n# TFT5PreTrainedModel is a sub-class of tf.keras.Model\n# which take care of loading and saving pretrained weights\n# and various common utilities.\n# Here you just need to specify a few (self-explanatory)\n# pointers for your model.\n####################################################\nclass TFT5PreTrainedModel(TFPreTrainedModel):\n    \"\"\" An abstract class to handle weights initialization and\n        a simple interface for downloading and loading pretrained models.\n    \"\"\"\n\n    config_class = T5Config\n    base_model_prefix = \"transformer\"\n\n    @property\n    def dummy_inputs(self):\n        inputs = tf.constant(DUMMY_INPUTS)\n        input_mask = tf.constant(DUMMY_MASK)\n        dummy_inputs = {\n            \"inputs\": inputs,\n            \"decoder_input_ids\": inputs,\n            \"decoder_attention_mask\": input_mask,\n        }\n        return dummy_inputs\n\n\nT5_START_DOCSTRING = r\"\"\"    The T5 model was proposed in\n    `Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer`_\n    by Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, Peter J. Liu.\n    It's an encoder decoder transformer pre-trained in a text-to-text denoising generative setting.\n\n    This model is a tf.keras.Model `tf.keras.Model`_ sub-class. Use it as a regular TF 2.0 Keras Model and\n    refer to the TF 2.0 documentation for all matter related to general usage and behavior.\n\n    .. _`Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer`:\n        https://arxiv.org/abs/1910.10683\n\n    .. _`tf.keras.Model`:\n        https://www.tensorflow.org/versions/r2.0/api_docs/python/tf/keras/Model\n\n    Note on the model inputs:\n        TF 2.0 models accepts two formats as inputs:\n\n            - having all inputs as keyword arguments (like PyTorch models), or\n            - having all inputs as a list, tuple or dict in the first positional arguments.\n\n        This second option is usefull when using `tf.keras.Model.fit()` method which currently requires having all the tensors in the first argument of the model call function: `model(inputs)`.\n\n        If you choose this second option, there are three possibilities you can use to gather all the input Tensors in the first positional argument :\n\n        - a single Tensor with inputs only and nothing else: `model(inputs_ids)\n        - a list of varying length with one or several input Tensors IN THE ORDER given in the docstring:\n            `model([inputs, attention_mask])` or `model([inputs, attention_mask, token_type_ids])`\n        - a dictionary with one or several input Tensors associaed to the input names given in the docstring:\n            `model({'inputs': inputs, 'token_type_ids': token_type_ids})`\n\n    Parameters:\n        config (:class:`~transformers1.T5Config`): Model configuration class with all the parameters of the model.\n            Initializing with a config file does not load the weights associated with the model, only the configuration.\n            Check out the :meth:`~transformers1.PreTrainedModel.from_pretrained` method to load the model weights.\n\"\"\"\n\nT5_INPUTS_DOCSTRING = r\"\"\"\n    Args:\n        inputs are usually used as a `dict` (see T5 description above for more information) containing all the following.\n\n        inputs (:obj:`tf.Tensor` of shape :obj:`(batch_size, sequence_length)`):\n            Indices of input sequence tokens in the vocabulary.\n            T5 is a model with relative position embeddings so you should be able to pad the inputs on\n            the right or the left.\n            Indices can be obtained using :class:`transformers1.T5Tokenizer`.\n            To know more on how to prepare :obj:`inputs` for pre-training take a look at\n            `T5 Training <./t5.html#training>`_ .\n            See :func:`transformers1.PreTrainedTokenizer.encode` and\n            :func:`transformers1.PreTrainedTokenizer.convert_tokens_to_ids` for details.\n        decoder_input_ids (:obj:`tf.Tensor` of shape :obj:`(batch_size, target_sequence_length)`, `optional`, defaults to :obj:`None`):\n            Provide for sequence to sequence training. T5 uses the pad_token_id as the starting token for decoder_input_ids generation.\n            If `decoder_past_key_value_states` is used, optionally only the last `decoder_input_ids` have to be input (see `decoder_past_key_value_states`).\n        attention_mask (:obj:`tf.Tensor` of shape :obj:`(batch_size, sequence_length)`, `optional`, defaults to :obj:`None`):\n            Mask to avoid performing attention on padding token indices.\n            Mask values selected in ``[0, 1]``:\n            ``1`` for tokens that are NOT MASKED, ``0`` for MASKED tokens.\n        encoder_outputs (:obj:`tuple(tuple(tf.FloatTensor)`, `optional`, defaults to :obj:`None`):\n            Tuple consists of (`last_hidden_state`, `optional`: `hidden_states`, `optional`: `attentions`)\n            `last_hidden_state` of shape :obj:`(batch_size, sequence_length, hidden_size)`, `optional`, defaults to :obj:`None`) is a sequence of hidden-states at the output of the last layer of the encoder.\n            Used in the cross-attention of the decoder.\n        decoder_attention_mask (:obj:`tf.Tensor` of shape :obj:`(batch_size, tgt_seq_len)`, `optional`, defaults to :obj:`None`):\n            Default behavior: generate a tensor that ignores pad tokens in decoder_input_ids. Causal mask will also be used by default.\n        decoder_past_key_value_states (:obj:`tuple(tuple(tf.Tensor))` of length :obj:`config.n_layers` with each tuple having 4 tensors of shape :obj:`(batch_size, num_heads, sequence_length - 1, embed_size_per_head)`):\n            Contains pre-computed key and value hidden-states of the attention blocks.\n            Can be used to speed up decoding.\n            If `decoder_past_key_value_states` are used, the user can optionally input only the last `decoder_input_ids`\n            (those that don't have their past key value states given to this model) of shape :obj:`(batch_size, 1)`\n        use_cache (:obj:`bool`, `optional`, defaults to :obj:`True`):\n            If `use_cache` is True, `decoder_past_key_value_states` are returned and can be used to speed up decoding (see `decoder_past_key_value_states`).\n        inputs_embeds (:obj:`tf.Tensor` of shape :obj:`(batch_size, sequence_length, hidden_size)`, `optional`, defaults to :obj:`None`):\n            Optionally, instead of passing :obj:`inputs` you can choose to directly pass an embedded representation.\n            This is useful if you want more control over how to convert `inputs` indices into associated vectors\n            than the model's internal embedding lookup matrix.\n        decoder_inputs_embeds (:obj:`tf.Tensor` of shape :obj:`(batch_size, target_sequence_length, hidden_size)`, `optional`, defaults to :obj:`None`):\n            Optionally, instead of passing :obj:`decoder_input_ids` you can choose to directly pass an embedded representation.\n            This is useful if you want more control over how to convert `decoder_input_ids` indices into associated vectors\n            than the model's internal embedding lookup matrix.\n            To know more on how to prepare :obj:`decoder_input_ids` for pre-training take a look at\n            `T5 Training <./t5.html#training>`_ .\n        head_mask: (:obj:`tf.Tensor` of shape :obj:`(num_heads,)` or :obj:`(num_layers, num_heads)`, `optional`, defaults to :obj:`None`):\n            Mask to nullify selected heads of the self-attention modules.\n            Mask values selected in ``[0, 1]``:\n            ``1`` indicates the head is **not masked**, ``0`` indicates the head is **masked**.\n\"\"\"\n\n\n@add_start_docstrings(\n    \"The bare T5 Model transformer outputting raw hidden-states\" \"without any specific head on top.\",\n    T5_START_DOCSTRING,\n)\nclass TFT5Model(TFT5PreTrainedModel):\n    def __init__(self, config, *inputs, **kwargs):\n        super().__init__(config, *inputs, **kwargs)\n        self.shared = TFSharedEmbeddings(config.vocab_size, config.d_model, name=\"shared\")\n\n        # retrieve correct absolute scope for embed token wrapper\n        with tf.compat.v1.variable_scope(\"shared\") as shared_abs_scope_name:\n            pass\n\n        embed_tokens = _NoLayerEmbedTokens(self.shared, abs_scope_name=shared_abs_scope_name)\n\n        encoder_config = copy.deepcopy(config)\n        self.encoder = TFT5MainLayer(encoder_config, embed_tokens, name=\"encoder\")\n\n        decoder_config = copy.deepcopy(config)\n        decoder_config.is_decoder = True\n        self.decoder = TFT5MainLayer(decoder_config, embed_tokens, name=\"decoder\")\n\n    def get_input_embeddings(self):\n        return self.shared\n\n    def get_output_embeddings(self):\n        return self.shared\n\n    def get_encoder(self):\n        return self.encoder\n\n    def get_decoder(self):\n        return self.decoder\n\n    @add_start_docstrings_to_callable(T5_INPUTS_DOCSTRING)\n    def call(self, inputs, **kwargs):\n        r\"\"\"\n    Return:\n        :obj:`tuple(tf.Tensor)` comprising various elements depending on the configuration (:class:`~transformers1.T5Config`) and inputs.\n        last_hidden_state (:obj:`tf.Tensor` of shape :obj:`(batch_size, sequence_length, hidden_size)`):\n            Sequence of hidden-states at the output of the last layer of the model.\n            If `decoder_past_key_value_states` is used only the last hidden-state of the sequences of shape :obj:`(batch_size, 1, hidden_size)` is output.\n        decoder_past_key_value_states (:obj:`tuple(tuple(tf.Tensor))` of length :obj:`config.n_layers` with each tuple having 4 tensors of shape :obj:`(batch_size, num_heads, sequence_length, embed_size_per_head)`, `optional`, returned when ``use_cache=True``):\n            Contains pre-computed key and value hidden-states of the attention blocks.\n            Can be used to speed up sequential decoding (see `decoder_past_key_value_states` input).\n            Note that when using `decoder_past_key_value_states`, the model only outputs the last `hidden-state` of the sequence of shape :obj:`(batch_size, 1, config.vocab_size)`.\n        hidden_states (:obj:`tuple(tf.Tensor)`, `optional`, returned when ``config.output_hidden_states=True``):\n            Tuple of :obj:`tf.Tensor` (one for the output of the embeddings + one for the output of each layer)\n            of shape :obj:`(batch_size, sequence_length, hidden_size)`.\n\n            Hidden-states of the model at the output of each layer plus the initial embedding outputs.\n        attentions (:obj:`tuple(tf.Tensor)`, `optional`, returned when ``config.output_attentions=True``):\n            Tuple of :obj:`tf.Tensor` (one for each layer) of shape\n                :obj:`(batch_size, num_heads, sequence_length, sequence_length)`.\n\n            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention\n            heads.\n\n    Examples::\n\n        from transformers1 import T5Tokenizer, TFT5Model\n\n        tokenizer = T5Tokenizer.from_pretrained('t5-small')\n        model = TFT5Model.from_pretrained('t5-small')\n        inputs = tokenizer.encode(\"Hello, my dog is cute\", return_tensors=\"tf\")  # Batch size 1\n        outputs = model(inputs, decoder_input_ids=inputs)\n        last_hidden_states = outputs[0]  # The last hidden-state is the first element of the output tuple\n\n        \"\"\"\n\n        if isinstance(inputs, dict):\n            kwargs.update(inputs)\n        else:\n            kwargs[\"inputs\"] = inputs\n\n        # retrieve arguments\n        inputs = kwargs.get(\"inputs\", None)\n        inputs_embeds = kwargs.get(\"inputs_embeds\", None)\n        attention_mask = kwargs.get(\"attention_mask\", None)\n        encoder_outputs = kwargs.get(\"encoder_outputs\", None)\n        decoder_input_ids = kwargs.get(\"decoder_input_ids\", None)\n        decoder_attention_mask = kwargs.get(\"decoder_attention_mask\", None)\n        decoder_inputs_embeds = kwargs.get(\"decoder_inputs_embeds\", None)\n        decoder_past_key_value_states = kwargs.get(\"decoder_past_key_value_states\", None)\n        use_cache = kwargs.get(\"use_cache\", True)\n        head_mask = kwargs.get(\"head_mask\", None)\n\n        # Encode if needed (training, first prediction pass)\n        if encoder_outputs is None:\n            encoder_outputs = self.encoder(\n                inputs, attention_mask=attention_mask, inputs_embeds=inputs_embeds, head_mask=head_mask,\n            )\n\n        hidden_states = encoder_outputs[0]\n\n        # If decoding with past key value states, only the last tokens\n        # should be given as an input\n        if decoder_past_key_value_states is not None:\n            if decoder_input_ids is not None:\n                decoder_input_ids = decoder_input_ids[:, -1:]\n            if decoder_inputs_embeds is not None:\n                decoder_inputs_embeds = decoder_inputs_embeds[:, -1:]\n\n        # Decode\n        decoder_outputs = self.decoder(\n            decoder_input_ids,\n            attention_mask=decoder_attention_mask,\n            inputs_embeds=decoder_inputs_embeds,\n            past_key_value_states=decoder_past_key_value_states,\n            encoder_hidden_states=hidden_states,\n            encoder_attention_mask=attention_mask,\n            head_mask=head_mask,\n            use_cache=use_cache,\n        )\n\n        if use_cache is True:\n            past = ((encoder_outputs, decoder_outputs[1]),)\n            decoder_outputs = decoder_outputs[:1] + past + decoder_outputs[2:]\n\n        return decoder_outputs + encoder_outputs\n\n\n@add_start_docstrings(\"\"\"T5 Model with a `language modeling` head on top. \"\"\", T5_START_DOCSTRING)\nclass TFT5ForConditionalGeneration(TFT5PreTrainedModel):\n    def __init__(self, config, *inputs, **kwargs):\n        super().__init__(config, *inputs, **kwargs)\n        self.model_dim = config.d_model\n\n        self.shared = TFSharedEmbeddings(config.vocab_size, config.d_model, name=\"shared\")\n\n        # retrieve correct absolute scope for embed token wrapper\n        with tf.compat.v1.variable_scope(\"shared\") as shared_abs_scope_name:\n            pass\n\n        embed_tokens = _NoLayerEmbedTokens(self.shared, abs_scope_name=shared_abs_scope_name)\n\n        encoder_config = copy.deepcopy(config)\n        self.encoder = TFT5MainLayer(encoder_config, embed_tokens, name=\"encoder\")\n\n        decoder_config = copy.deepcopy(config)\n        decoder_config.is_decoder = True\n        self.decoder = TFT5MainLayer(decoder_config, embed_tokens, name=\"decoder\")\n\n    def get_input_embeddings(self):\n        return self.shared\n\n    def get_output_embeddings(self):\n        return self.shared\n\n    def get_encoder(self):\n        return self.encoder\n\n    def get_decoder(self):\n        return self.decoder\n\n    @add_start_docstrings_to_callable(T5_INPUTS_DOCSTRING)\n    def call(self, inputs, **kwargs):\n        r\"\"\"\n    Return:\n        :obj:`tuple(tf.Tensor)` comprising various elements depending on the configuration (:class:`~transformers1.T5Config`) and inputs.\n        loss (:obj:`tf.Tensor` of shape :obj:`(1,)`, `optional`, returned when :obj:`lm_label` is provided):\n            Classification loss (cross entropy).\n        prediction_scores (:obj:`tf.Tensor` of shape :obj:`(batch_size, sequence_length, config.vocab_size)`)\n            Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax).\n        decoder_past_key_value_states (:obj:`tuple(tuple(tf.Tensor))` of length :obj:`config.n_layers` with each tuple having 4 tensors of shape :obj:`(batch_size, num_heads, sequence_length, embed_size_per_head)`, `optional`, returned when ``use_cache=True``):\n            Contains pre-computed key and value hidden-states of the attention blocks.\n            Can be used to speed up sequential decoding (see `decoder_past_key_value_states` input).\n            Note that when using `decoder_past_key_value_states`, the model only outputs the last `prediction_score` of the sequence of shape :obj:`(batch_size, 1, config.vocab_size)`.\n        hidden_states (:obj:`tuple(tf.Tensor)`, `optional`, returned when ``config.output_hidden_states=True``):\n            Tuple of :obj:`tf.Tensor` (one for the output of the embeddings + one for the output of each layer)\n            of shape :obj:`(batch_size, sequence_length, hidden_size)`.\n\n            Hidden-states of the model at the output of each layer plus the initial embedding outputs.\n        attentions (:obj:`tuple(tf.Tensor)`, `optional`, returned when ``config.output_attentions=True``):\n            Tuple of :obj:`tf.Tensor` (one for each layer) of shape\n            :obj:`(batch_size, num_heads, sequence_length, sequence_length)`.\n\n            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention.\n\n    Examples::\n\n        from transformers1 import T5Tokenizer, TFT5ForConditionalGeneration\n\n        tokenizer = T5Tokenizer.from_pretrained('t5-small')\n        model = TFT5ForConditionalGeneration.from_pretrained('t5-small')\n        inputs = tokenizer.encode(\"Hello, my dog is cute\", return_tensors=\"tf\")  # Batch size 1\n        outputs = model(inputs, decoder_input_ids=inputs)\n        prediction_scores = outputs[0]\n\n        tokenizer = T5Tokenizer.from_pretrained('t5-small')\n        model = TFT5ForConditionalGeneration.from_pretrained('t5-small')\n        inputs = tokenizer.encode(\"summarize: Hello, my dog is cute\", return_tensors=\"tf\")  # Batch size 1\n        model.generate(inputs)\n\n        \"\"\"\n\n        if isinstance(inputs, dict):\n            kwargs.update(inputs)\n        else:\n            kwargs[\"inputs\"] = inputs\n\n        # retrieve arguments\n        inputs = kwargs.get(\"inputs\", None)\n        decoder_input_ids = kwargs.get(\"decoder_input_ids\", None)\n        attention_mask = kwargs.get(\"attention_mask\", None)\n        encoder_outputs = kwargs.get(\"encoder_outputs\", None)\n        decoder_attention_mask = kwargs.get(\"decoder_attention_mask\", None)\n        decoder_past_key_value_states = kwargs.get(\"decoder_past_key_value_states\", None)\n        use_cache = kwargs.get(\"use_cache\", True)\n        inputs_embeds = kwargs.get(\"inputs_embeds\", None)\n        decoder_inputs_embeds = kwargs.get(\"decoder_inputs_embeds\", None)\n        head_mask = kwargs.get(\"head_mask\", None)\n\n        # Encode if needed (training, first prediction pass)\n        if encoder_outputs is None:\n            # Convert encoder inputs in embeddings if needed\n            encoder_outputs = self.encoder(\n                inputs, attention_mask=attention_mask, inputs_embeds=inputs_embeds, head_mask=head_mask,\n            )\n\n        hidden_states = encoder_outputs[0]\n\n        # If decoding with past key value states, only the last tokens\n        # should be given as an input\n        if decoder_past_key_value_states is not None:\n            if decoder_input_ids is not None:\n                decoder_input_ids = decoder_input_ids[:, -1:]\n            if decoder_inputs_embeds is not None:\n                decoder_inputs_embeds = decoder_inputs_embeds[:, -1:]\n\n        # Decode\n        decoder_outputs = self.decoder(\n            decoder_input_ids,\n            attention_mask=decoder_attention_mask,\n            inputs_embeds=decoder_inputs_embeds,\n            past_key_value_states=decoder_past_key_value_states,\n            encoder_hidden_states=hidden_states,\n            encoder_attention_mask=attention_mask,\n            head_mask=head_mask,\n            use_cache=use_cache,\n        )\n\n        # insert decoder past at right place\n        # to speed up decoding\n        if use_cache is True:\n            past = ((encoder_outputs, decoder_outputs[1]),)\n            decoder_outputs = decoder_outputs[:1] + past + decoder_outputs[2:]\n\n        sequence_output = decoder_outputs[0] * (self.model_dim ** -0.5)\n        embed_tokens = self.get_output_embeddings()\n        lm_logits = embed_tokens(sequence_output, mode=\"linear\")\n        decoder_outputs = (lm_logits,) + decoder_outputs[1:]\n\n        return decoder_outputs + encoder_outputs\n\n    def prepare_inputs_for_generation(self, inputs, past, attention_mask, use_cache, **kwargs):\n        assert past is not None, \"past has to be defined for encoder_outputs\"\n\n        # first step\n        if len(past) < 2:\n            encoder_outputs, decoder_past_key_value_states = past, None\n        else:\n            encoder_outputs, decoder_past_key_value_states = past[0], past[1]\n\n        return {\n            \"inputs\": None,  # inputs don't have to be defined, but still need to be passed to make Keras.layer.__call__ happy\n            \"decoder_input_ids\": inputs,  # inputs are the decoder_input_ids\n            \"decoder_past_key_value_states\": decoder_past_key_value_states,\n            \"encoder_outputs\": encoder_outputs,\n            \"attention_mask\": attention_mask,\n            \"use_cache\": use_cache,\n        }\n\n    def _reorder_cache(self, past, beam_idx):\n        # if decoder past is not included in output\n        # speedy decoding is disabled and no need to reorder\n\n        if len(past) < 2:\n            logger.warning(\"You might want to consider setting `use_cache=True` to speed up decoding\")\n            return past\n\n        decoder_past = past[1]\n        past = (past[0],)\n        reordered_decoder_past = ()\n\n        for layer_past_states in decoder_past:\n            # get the correct batch idx from layer past batch dim\n            # batch dim of `past` is at 2nd position\n            reordered_layer_past_states = ()\n            for layer_past_state in layer_past_states:\n                # need to set correct `past` for each of the four key / value states\n                reordered_layer_past_states = reordered_layer_past_states + (tf.gather(layer_past_state, beam_idx),)\n\n            assert shape_list(reordered_layer_past_states[0]) == shape_list(layer_past_states[0])\n            assert len(reordered_layer_past_states) == len(layer_past_states)\n\n            reordered_decoder_past = reordered_decoder_past + (reordered_layer_past_states,)\n        return past + (reordered_decoder_past,)\n"
  },
  {
    "path": "code/bert-base-count3/pretrain/transformers1/modeling_tf_transfo_xl.py",
    "content": "# coding=utf-8\n# Copyright 2018 Google AI, Google Brain and Carnegie Mellon University Authors and the HuggingFace Inc. team.\n# Copyright (c) 2018, NVIDIA CORPORATION.  All rights reserved.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n\"\"\" TF 2.0 Transformer XL model.\n\"\"\"\n\n\nimport logging\n\nimport tensorflow as tf\n\nfrom .configuration_transfo_xl import TransfoXLConfig\nfrom .file_utils import add_start_docstrings, add_start_docstrings_to_callable\nfrom .modeling_tf_transfo_xl_utilities import TFAdaptiveSoftmaxMask\nfrom .modeling_tf_utils import TFPreTrainedModel, get_initializer, keras_serializable, shape_list\nfrom .tokenization_utils import BatchEncoding\n\n\nlogger = logging.getLogger(__name__)\n\nTF_TRANSFO_XL_PRETRAINED_MODEL_ARCHIVE_LIST = [\n    \"transfo-xl-wt103\",\n    # See all Transformer XL models at https://huggingface.co/models?filter=transfo-xl\n]\n\n\nclass TFPositionalEmbedding(tf.keras.layers.Layer):\n    def __init__(self, demb, **kwargs):\n        super().__init__(**kwargs)\n\n        self.inv_freq = 1 / (10000 ** (tf.range(0, demb, 2.0) / demb))\n\n    def call(self, pos_seq, bsz=None):\n        sinusoid_inp = tf.einsum(\"i,j->ij\", pos_seq, self.inv_freq)\n        pos_emb = tf.concat([tf.sin(sinusoid_inp), tf.cos(sinusoid_inp)], -1)\n\n        if bsz is not None:\n            return tf.tile(pos_emb[:, None, :], [1, bsz, 1])\n        else:\n            return pos_emb[:, None, :]\n\n\nclass TFPositionwiseFF(tf.keras.layers.Layer):\n    def __init__(self, d_model, d_inner, dropout, pre_lnorm=False, layer_norm_epsilon=1e-5, init_std=0.02, **kwargs):\n        super().__init__(**kwargs)\n\n        self.d_model = d_model\n        self.d_inner = d_inner\n        self.dropout = dropout\n\n        self.layer_1 = tf.keras.layers.Dense(\n            d_inner, kernel_initializer=get_initializer(init_std), activation=tf.nn.relu, name=\"CoreNet_._0\"\n        )\n        self.drop_1 = tf.keras.layers.Dropout(dropout)\n        self.layer_2 = tf.keras.layers.Dense(d_model, kernel_initializer=get_initializer(init_std), name=\"CoreNet_._3\")\n        self.drop_2 = tf.keras.layers.Dropout(dropout)\n\n        self.layer_norm = tf.keras.layers.LayerNormalization(epsilon=layer_norm_epsilon, name=\"layer_norm\")\n\n        self.pre_lnorm = pre_lnorm\n\n    def call(self, inp, training=False):\n        if self.pre_lnorm:\n            # layer normalization + positionwise feed-forward\n            core_out = self.layer_norm(inp)\n            core_out = self.layer_1(core_out)\n            core_out = self.drop_1(core_out, training=training)\n            core_out = self.layer_2(core_out)\n            core_out = self.drop_2(core_out, training=training)\n\n            # residual connection\n            output = core_out + inp\n        else:\n            # positionwise feed-forward\n            core_out = self.layer_1(inp)\n            core_out = self.drop_1(core_out, training=training)\n            core_out = self.layer_2(core_out)\n            core_out = self.drop_2(core_out, training=training)\n\n            # residual connection + layer normalization\n            output = self.layer_norm(inp + core_out)\n\n        return output\n\n\nclass TFRelPartialLearnableMultiHeadAttn(tf.keras.layers.Layer):\n    def __init__(\n        self,\n        n_head,\n        d_model,\n        d_head,\n        dropout,\n        dropatt=0,\n        tgt_len=None,\n        ext_len=None,\n        mem_len=None,\n        pre_lnorm=False,\n        r_r_bias=None,\n        r_w_bias=None,\n        output_attentions=False,\n        layer_norm_epsilon=1e-5,\n        init_std=0.02,\n        **kwargs\n    ):\n        super().__init__(**kwargs)\n\n        self.output_attentions = output_attentions\n        self.n_head = n_head\n        self.d_model = d_model\n        self.d_head = d_head\n        self.dropout = dropout\n\n        self.qkv_net = tf.keras.layers.Dense(\n            3 * n_head * d_head, kernel_initializer=get_initializer(init_std), use_bias=False, name=\"qkv_net\"\n        )\n\n        self.drop = tf.keras.layers.Dropout(dropout)\n        self.dropatt = tf.keras.layers.Dropout(dropatt)\n        self.o_net = tf.keras.layers.Dense(\n            d_model, kernel_initializer=get_initializer(init_std), use_bias=False, name=\"o_net\"\n        )\n\n        self.layer_norm = tf.keras.layers.LayerNormalization(epsilon=layer_norm_epsilon, name=\"layer_norm\")\n\n        self.scale = 1 / (d_head ** 0.5)\n\n        self.pre_lnorm = pre_lnorm\n\n        if r_r_bias is not None and r_w_bias is not None:  # Biases are shared\n            self.r_r_bias = r_r_bias\n            self.r_w_bias = r_w_bias\n        else:\n            self.r_r_bias = None\n            self.r_w_bias = None\n\n        self.r_net = tf.keras.layers.Dense(\n            self.n_head * self.d_head, kernel_initializer=get_initializer(init_std), use_bias=False, name=\"r_net\"\n        )\n\n    def build(self, input_shape):\n        if self.r_r_bias is None or self.r_w_bias is None:  # Biases are not shared\n            self.r_r_bias = self.add_weight(\n                shape=(self.n_head, self.d_head), initializer=\"zeros\", trainable=True, name=\"r_r_bias\"\n            )\n            self.r_w_bias = self.add_weight(\n                shape=(self.n_head, self.d_head), initializer=\"zeros\", trainable=True, name=\"r_w_bias\"\n            )\n        super().build(input_shape)\n\n    def _rel_shift(self, x):\n        x_size = shape_list(x)\n\n        x = tf.pad(x, [[0, 0], [1, 0], [0, 0], [0, 0]])\n        x = tf.reshape(x, [x_size[1] + 1, x_size[0], x_size[2], x_size[3]])\n        x = tf.slice(x, [1, 0, 0, 0], [-1, -1, -1, -1])\n        x = tf.reshape(x, x_size)\n\n        return x\n\n    def call(self, inputs, training=False):\n        w, r, attn_mask, mems, head_mask = inputs\n        qlen, rlen, bsz = shape_list(w)[0], shape_list(r)[0], shape_list(w)[1]\n\n        if mems is not None:\n            cat = tf.concat([mems, w], 0)\n            if self.pre_lnorm:\n                w_heads = self.qkv_net(self.layer_norm(cat))\n            else:\n                w_heads = self.qkv_net(cat)\n            r_head_k = self.r_net(r)\n\n            w_head_q, w_head_k, w_head_v = tf.split(w_heads, 3, axis=-1)\n            w_head_q = w_head_q[-qlen:]\n        else:\n            if self.pre_lnorm:\n                w_heads = self.qkv_net(self.layer_norm(w))\n            else:\n                w_heads = self.qkv_net(w)\n            r_head_k = self.r_net(r)\n\n            w_head_q, w_head_k, w_head_v = tf.split(w_heads, 3, axis=-1)\n\n        klen = shape_list(w_head_k)[0]\n\n        w_head_q = tf.reshape(w_head_q, (qlen, bsz, self.n_head, self.d_head))  # qlen x bsz x n_head x d_head\n        w_head_k = tf.reshape(w_head_k, (klen, bsz, self.n_head, self.d_head))  # qlen x bsz x n_head x d_head\n        w_head_v = tf.reshape(w_head_v, (klen, bsz, self.n_head, self.d_head))  # qlen x bsz x n_head x d_head\n\n        r_head_k = tf.reshape(r_head_k, (rlen, self.n_head, self.d_head))  # qlen x n_head x d_head\n\n        # compute attention score\n        rw_head_q = w_head_q + self.r_w_bias  # qlen x bsz x n_head x d_head\n        AC = tf.einsum(\"ibnd,jbnd->ijbn\", rw_head_q, w_head_k)  # qlen x klen x bsz x n_head\n\n        rr_head_q = w_head_q + self.r_r_bias\n        BD = tf.einsum(\"ibnd,jnd->ijbn\", rr_head_q, r_head_k)  # qlen x klen x bsz x n_head\n        BD = self._rel_shift(BD)\n\n        # [qlen x klen x bsz x n_head]\n        attn_score = AC + BD\n        attn_score = attn_score * self.scale\n\n        # compute attention probability\n        if attn_mask is not None:\n            attn_mask_t = attn_mask[:, :, None, None]\n            attn_score = attn_score * (1 - attn_mask_t) - 1e30 * attn_mask_t\n\n        # [qlen x klen x bsz x n_head]\n        attn_prob = tf.nn.softmax(attn_score, axis=1)\n        attn_prob = self.dropatt(attn_prob, training=training)\n\n        # Mask heads if we want to\n        if head_mask is not None:\n            attn_prob = attn_prob * head_mask\n\n        # compute attention vector\n        attn_vec = tf.einsum(\"ijbn,jbnd->ibnd\", attn_prob, w_head_v)\n\n        # [qlen x bsz x n_head x d_head]\n        attn_vec_sizes = shape_list(attn_vec)\n        attn_vec = tf.reshape(attn_vec, (attn_vec_sizes[0], attn_vec_sizes[1], self.n_head * self.d_head))\n\n        # linear projection\n        attn_out = self.o_net(attn_vec)\n        attn_out = self.drop(attn_out, training=training)\n\n        if self.pre_lnorm:\n            # residual connection\n            outputs = [w + attn_out]\n        else:\n            # residual connection + layer normalization\n            outputs = [self.layer_norm(w + attn_out)]\n\n        if self.output_attentions:\n            outputs.append(attn_prob)\n\n        return outputs\n\n\nclass TFRelPartialLearnableDecoderLayer(tf.keras.layers.Layer):\n    def __init__(\n        self,\n        n_head,\n        d_model,\n        d_head,\n        d_inner,\n        dropout,\n        tgt_len=None,\n        ext_len=None,\n        mem_len=None,\n        dropatt=0.0,\n        pre_lnorm=False,\n        r_w_bias=None,\n        r_r_bias=None,\n        output_attentions=False,\n        layer_norm_epsilon=1e-5,\n        init_std=0.02,\n        **kwargs\n    ):\n        super().__init__(**kwargs)\n\n        self.dec_attn = TFRelPartialLearnableMultiHeadAttn(\n            n_head,\n            d_model,\n            d_head,\n            dropout,\n            tgt_len=tgt_len,\n            ext_len=ext_len,\n            mem_len=mem_len,\n            dropatt=dropatt,\n            pre_lnorm=pre_lnorm,\n            r_w_bias=r_w_bias,\n            r_r_bias=r_r_bias,\n            init_std=init_std,\n            output_attentions=output_attentions,\n            layer_norm_epsilon=layer_norm_epsilon,\n            name=\"dec_attn\",\n        )\n        self.pos_ff = TFPositionwiseFF(\n            d_model,\n            d_inner,\n            dropout,\n            pre_lnorm=pre_lnorm,\n            init_std=init_std,\n            layer_norm_epsilon=layer_norm_epsilon,\n            name=\"pos_ff\",\n        )\n\n    def call(self, inputs, training=False):\n        dec_inp, r, dec_attn_mask, mems, head_mask = inputs\n        attn_outputs = self.dec_attn([dec_inp, r, dec_attn_mask, mems, head_mask], training=training)\n        ff_output = self.pos_ff(attn_outputs[0], training=training)\n\n        outputs = [ff_output] + attn_outputs[1:]\n\n        return outputs\n\n\nclass TFAdaptiveEmbedding(tf.keras.layers.Layer):\n    def __init__(self, n_token, d_embed, d_proj, cutoffs, div_val=1, init_std=0.02, sample_softmax=False, **kwargs):\n        super().__init__(**kwargs)\n\n        self.n_token = n_token\n        self.d_embed = d_embed\n        self.init_std = init_std\n\n        self.cutoffs = cutoffs + [n_token]\n        self.div_val = div_val\n        self.d_proj = d_proj\n\n        self.emb_scale = d_proj ** 0.5\n\n        self.cutoff_ends = [0] + self.cutoffs\n\n        self.emb_layers = []\n        self.emb_projs = []\n        if div_val == 1:\n            raise NotImplementedError  # Removed these to avoid maintaining dead code - They are not used in our pretrained checkpoint\n        else:\n            for i in range(len(self.cutoffs)):\n                l_idx, r_idx = self.cutoff_ends[i], self.cutoff_ends[i + 1]\n                d_emb_i = d_embed // (div_val ** i)\n                self.emb_layers.append(\n                    tf.keras.layers.Embedding(\n                        r_idx - l_idx,\n                        d_emb_i,\n                        embeddings_initializer=get_initializer(init_std),\n                        name=\"emb_layers_._{}\".format(i),\n                    )\n                )\n\n    def build(self, input_shape):\n        for i in range(len(self.cutoffs)):\n            d_emb_i = self.d_embed // (self.div_val ** i)\n            self.emb_projs.append(\n                self.add_weight(\n                    shape=(d_emb_i, self.d_proj),\n                    initializer=get_initializer(self.init_std),\n                    trainable=True,\n                    name=\"emb_projs_._{}\".format(i),\n                )\n            )\n        super().build(input_shape)\n\n    def call(self, inp):\n        if self.div_val == 1:\n            raise NotImplementedError  # Removed these to avoid maintaining dead code - They are not used in our pretrained checkpoint\n        else:\n            inp_flat = tf.reshape(inp, (-1,))\n            emb_flat = tf.zeros([shape_list(inp_flat)[0], self.d_proj])\n            for i in range(len(self.cutoffs)):\n                l_idx, r_idx = self.cutoff_ends[i], self.cutoff_ends[i + 1]\n\n                mask_i = (inp_flat >= l_idx) & (inp_flat < r_idx)\n\n                inp_i = tf.boolean_mask(inp_flat, mask_i) - l_idx\n                emb_i = self.emb_layers[i](inp_i)\n                emb_i = tf.einsum(\"id,de->ie\", emb_i, self.emb_projs[i])\n\n                mask_idx = tf.cast(tf.where(mask_i), dtype=tf.int64)\n                emb_flat += tf.scatter_nd(mask_idx, emb_i, tf.cast(shape_list(emb_flat), dtype=tf.int64))\n\n            embed_shape = shape_list(inp) + [self.d_proj]\n            embed = tf.reshape(emb_flat, embed_shape)\n\n        embed *= self.emb_scale\n\n        return embed\n\n\n@keras_serializable\nclass TFTransfoXLMainLayer(tf.keras.layers.Layer):\n    config_class = TransfoXLConfig\n\n    def __init__(self, config, **kwargs):\n        super().__init__(**kwargs)\n        self.output_attentions = config.output_attentions\n        self.output_hidden_states = config.output_hidden_states\n\n        self.n_token = config.vocab_size\n\n        self.d_embed = config.d_embed\n        self.d_model = config.d_model\n        self.n_head = config.n_head\n        self.d_head = config.d_head\n        self.untie_r = config.untie_r\n\n        self.word_emb = TFAdaptiveEmbedding(\n            config.vocab_size,\n            config.d_embed,\n            config.d_model,\n            config.cutoffs,\n            div_val=config.div_val,\n            init_std=config.init_std,\n            name=\"word_emb\",\n        )\n\n        self.drop = tf.keras.layers.Dropout(config.dropout)\n\n        self.n_layer = config.n_layer\n\n        self.tgt_len = config.tgt_len\n        self.mem_len = config.mem_len\n        self.ext_len = config.ext_len\n        self.max_klen = config.tgt_len + config.ext_len + config.mem_len\n\n        self.attn_type = config.attn_type\n\n        self.layers = []\n        if config.attn_type == 0:  # the default attention\n            for i in range(config.n_layer):\n                self.layers.append(\n                    TFRelPartialLearnableDecoderLayer(\n                        config.n_head,\n                        config.d_model,\n                        config.d_head,\n                        config.d_inner,\n                        config.dropout,\n                        tgt_len=config.tgt_len,\n                        ext_len=config.ext_len,\n                        mem_len=config.mem_len,\n                        dropatt=config.dropatt,\n                        pre_lnorm=config.pre_lnorm,\n                        r_w_bias=None if self.untie_r else self.r_w_bias,\n                        r_r_bias=None if self.untie_r else self.r_r_bias,\n                        output_attentions=self.output_attentions,\n                        layer_norm_epsilon=config.layer_norm_epsilon,\n                        init_std=config.init_std,\n                        name=\"layers_._{}\".format(i),\n                    )\n                )\n        else:  # learnable embeddings and absolute embeddings\n            raise NotImplementedError  # Removed these to avoid maintaining dead code - They are not used in our pretrained checkpoint\n\n        self.same_length = config.same_length\n        self.clamp_len = config.clamp_len\n\n        if self.attn_type == 0:  # default attention\n            self.pos_emb = TFPositionalEmbedding(self.d_model, name=\"pos_emb\")\n        else:  # learnable embeddings and absolute embeddings\n            raise NotImplementedError  # Removed these to avoid maintaining dead code - They are not used in our pretrained checkpoint\n\n    def build(self, input_shape):\n        if not self.untie_r:\n            self.r_w_bias = self.add_weight(\n                shape=(self.n_head, self.d_head), initializer=\"zeros\", trainable=True, name=\"r_w_bias\"\n            )\n            self.r_r_bias = self.add_weight(\n                shape=(self.n_head, self.d_head), initializer=\"zeros\", trainable=True, name=\"r_r_bias\"\n            )\n        super().build(input_shape)\n\n    def get_input_embeddings(self):\n        return self.word_emb\n\n    def _resize_token_embeddings(self, new_num_tokens):\n        return self.word_emb\n\n    def backward_compatible(self):\n        self.sample_softmax = -1\n\n    def reset_length(self, tgt_len, ext_len, mem_len):\n        self.tgt_len = tgt_len\n        self.mem_len = mem_len\n        self.ext_len = ext_len\n\n    def _prune_heads(self, heads):\n        raise NotImplementedError\n\n    def init_mems(self, bsz):\n        if self.mem_len > 0:\n            mems = []\n            for i in range(self.n_layer):\n                empty = tf.zeros([self.mem_len, bsz, self.d_model])\n                mems.append(empty)\n\n            return mems\n        else:\n            return None\n\n    def _update_mems(self, hids, mems, mlen, qlen):\n        # does not deal with None\n        if mems is None:\n            return None\n\n        # mems is not None\n        assert len(hids) == len(mems), \"len(hids) != len(mems)\"\n\n        # There are `mlen + qlen` steps that can be cached into mems\n        # For the next step, the last `ext_len` of the `qlen` tokens\n        # will be used as the extended context. Hence, we only cache\n        # the tokens from `mlen + qlen - self.ext_len - self.mem_len`\n        # to `mlen + qlen - self.ext_len`.\n        new_mems = []\n        end_idx = mlen + max(0, qlen - 0 - self.ext_len)\n        beg_idx = max(0, end_idx - self.mem_len)\n        for i in range(len(hids)):\n\n            cat = tf.concat([mems[i], hids[i]], axis=0)\n            tf.stop_gradient(cat)\n            new_mems.append(cat[beg_idx:end_idx])\n\n        return new_mems\n\n    def call(self, inputs, mems=None, head_mask=None, inputs_embeds=None, training=False):\n        if isinstance(inputs, (tuple, list)):\n            input_ids = inputs[0]\n            mems = inputs[1] if len(inputs) > 1 else mems\n            head_mask = inputs[2] if len(inputs) > 2 else head_mask\n            inputs_embeds = inputs[3] if len(inputs) > 3 else inputs_embeds\n            assert len(inputs) <= 4, \"Too many inputs.\"\n        elif isinstance(inputs, (dict, BatchEncoding)):\n            input_ids = inputs.get(\"input_ids\")\n            mems = inputs.get(\"mems\", mems)\n            head_mask = inputs.get(\"head_mask\", head_mask)\n            inputs_embeds = inputs.get(\"inputs_embeds\", inputs_embeds)\n            assert len(inputs) <= 4, \"Too many inputs.\"\n        else:\n            input_ids = inputs\n\n        # the original code for Transformer-XL used shapes [len, bsz] but we want a unified interface in the library\n        # so we transpose here from shape [bsz, len] to shape [len, bsz]\n        if input_ids is not None and inputs_embeds is not None:\n            raise ValueError(\"You cannot specify both input_ids and inputs_embeds at the same time\")\n        elif input_ids is not None:\n            input_ids = tf.transpose(input_ids, perm=(1, 0))\n            qlen, bsz = shape_list(input_ids)\n        elif inputs_embeds is not None:\n            inputs_embeds = tf.transpose(inputs_embeds, perm=(1, 0, 2))\n            qlen, bsz = shape_list(inputs_embeds)[:2]\n        else:\n            raise ValueError(\"You have to specify either input_ids or inputs_embeds\")\n\n        if mems is None:\n            mems = self.init_mems(bsz)\n\n        # Prepare head mask if needed\n        # 1.0 in head_mask indicate we keep the head\n        # attention_probs has shape bsz x n_heads x N x N\n        # input head_mask has shape [num_heads] or [num_hidden_layers x num_heads] (a head_mask for each layer)\n        # and head_mask is converted to shape [num_hidden_layers x qlen x klen x bsz x n_head]\n        if head_mask is not None:\n            raise NotImplementedError\n        else:\n            head_mask = [None] * self.n_layer\n\n        if inputs_embeds is not None:\n            word_emb = inputs_embeds\n        else:\n            word_emb = self.word_emb(input_ids)\n\n        mlen = shape_list(mems[0])[0] if mems is not None else 0\n        klen = mlen + qlen\n\n        attn_mask = tf.ones([qlen, qlen])\n        mask_u = tf.linalg.band_part(attn_mask, 0, -1)\n        mask_dia = tf.linalg.band_part(attn_mask, 0, 0)\n        attn_mask_pad = tf.zeros([qlen, mlen])\n        dec_attn_mask = tf.concat([attn_mask_pad, mask_u - mask_dia], 1)\n        if self.same_length:\n            mask_l = tf.linalg.band_part(attn_mask, -1, 0)\n            dec_attn_mask = tf.concat([dec_attn_mask[:, :qlen] + mask_l - mask_dia, dec_attn_mask[:, qlen:]], 1)\n        # ::: PyTorch masking code for reference :::\n        # if self.same_length:\n        #     all_ones = word_emb.new_ones((qlen, klen), dtype=torch.uint8)\n        #     mask_len = klen - self.mem_len\n        #     if mask_len > 0:\n        #         mask_shift_len = qlen - mask_len\n        #     else:\n        #         mask_shift_len = qlen\n        #     dec_attn_mask = (torch.triu(all_ones, 1+mlen)\n        #             + torch.tril(all_ones, -mask_shift_len))[:, :, None] # -1\n        # else:\n        #     dec_attn_mask = torch.triu(\n        #         word_emb.new_ones((qlen, klen), dtype=torch.uint8), diagonal=1+mlen)[:,:,None]\n\n        hids = []\n        attentions = []\n        if self.attn_type == 0:  # default\n            pos_seq = tf.range(klen - 1, -1, -1.0)\n            if self.clamp_len > 0:\n                pos_seq = tf.minimum(pos_seq, self.clamp_len)\n            pos_emb = self.pos_emb(pos_seq)\n\n            core_out = self.drop(word_emb, training=training)\n            pos_emb = self.drop(pos_emb, training=training)\n\n            for i, layer in enumerate(self.layers):\n                hids.append(core_out)\n                mems_i = None if mems is None else mems[i]\n                layer_outputs = layer([core_out, pos_emb, dec_attn_mask, mems_i, head_mask[i]], training=training)\n                core_out = layer_outputs[0]\n                if self.output_attentions:\n                    attentions.append(layer_outputs[1])\n        else:  # learnable embeddings and absolute embeddings\n            raise NotImplementedError  # Removed these to avoid maintaining dead code - They are not used in our pretrained checkpoint\n\n        core_out = self.drop(core_out, training=training)\n\n        new_mems = self._update_mems(hids, mems, mlen, qlen)\n\n        # We transpose back here to shape [bsz, len, hidden_dim]\n        outputs = [tf.transpose(core_out, perm=(1, 0, 2)), new_mems]\n        if self.output_hidden_states:\n            # Add last layer and transpose to library standard shape [bsz, len, hidden_dim]\n            hids.append(core_out)\n            hids = list(tf.transpose(t, perm=(1, 0, 2)) for t in hids)\n            outputs.append(hids)\n        if self.output_attentions:\n            # Transpose to library standard shape [bsz, n_heads, query_seq_len, key_seq_len]\n            attentions = list(tf.transpose(t, perm=(2, 3, 0, 1)) for t in attentions)\n            outputs.append(attentions)\n        return outputs  # last hidden state, new_mems, (all hidden states), (all attentions)\n\n\nclass TFTransfoXLPreTrainedModel(TFPreTrainedModel):\n    \"\"\" An abstract class to handle weights initialization and\n        a simple interface for downloading and loading pretrained models.\n    \"\"\"\n\n    config_class = TransfoXLConfig\n    base_model_prefix = \"transformer\"\n\n\nTRANSFO_XL_START_DOCSTRING = r\"\"\"\n\n    .. note::\n\n        TF 2.0 models accepts two formats as inputs:\n\n            - having all inputs as keyword arguments (like PyTorch models), or\n            - having all inputs as a list, tuple or dict in the first positional arguments.\n\n        This second option is useful when using :obj:`tf.keras.Model.fit()` method which currently requires having\n        all the tensors in the first argument of the model call function: :obj:`model(inputs)`.\n\n        If you choose this second option, there are three possibilities you can use to gather all the input Tensors\n        in the first positional argument :\n\n        - a single Tensor with input_ids only and nothing else: :obj:`model(inputs_ids)`\n        - a list of varying length with one or several input Tensors IN THE ORDER given in the docstring:\n          :obj:`model([input_ids, attention_mask])` or :obj:`model([input_ids, attention_mask, token_type_ids])`\n        - a dictionary with one or several input Tensors associated to the input names given in the docstring:\n          :obj:`model({'input_ids': input_ids, 'token_type_ids': token_type_ids})`\n\n    Parameters:\n        config (:class:`~transformers1.TransfoXLConfig`): Model configuration class with all the parameters of the model.\n            Initializing with a config file does not load the weights associated with the model, only the configuration.\n            Check out the :meth:`~transformers1.PreTrainedModel.from_pretrained` method to load the model weights.\n\"\"\"\n\nTRANSFO_XL_INPUTS_DOCSTRING = r\"\"\"\n    Args:\n        input_ids (:obj:`tf.Tensor` or :obj:`Numpy array` of shape :obj:`(batch_size, sequence_length)`):\n            Indices of input sequence tokens in the vocabulary.\n\n            Indices can be obtained using :class:`transformers1.TransfoXLTokenizer`.\n            See :func:`transformers1.PreTrainedTokenizer.encode` and\n            :func:`transformers1.PreTrainedTokenizer.encode_plus` for details.\n\n            `What are input IDs? <../glossary.html#input-ids>`__\n        mems (:obj:`List[tf.Tensor]` of length :obj:`config.n_layers`):\n            Contains pre-computed hidden-states (key and values in the attention blocks) as computed by the model\n            (see `mems` output below). Can be used to speed up sequential decoding. The token ids which have their mems\n            given to this model should not be passed as input ids as they have already been computed.\n        head_mask (:obj:`tf.Tensor` or :obj:`Numpy array` of shape :obj:`(num_heads,)` or :obj:`(num_layers, num_heads)`, `optional`, defaults to :obj:`None`):\n            Mask to nullify selected heads of the self-attention modules.\n            Mask values selected in ``[0, 1]``:\n            :obj:`1` indicates the head is **not masked**, :obj:`0` indicates the head is **masked**.\n        inputs_embeds (:obj:`tf.Tensor` or :obj:`Numpy array` of shape :obj:`(batch_size, sequence_length, hidden_size)`, `optional`, defaults to :obj:`None`):\n            Optionally, instead of passing :obj:`input_ids` you can choose to directly pass an embedded representation.\n            This is useful if you want more control over how to convert `input_ids` indices into associated vectors\n            than the model's internal embedding lookup matrix.\n\"\"\"\n\n\n@add_start_docstrings(\n    \"The bare Bert Model transformer outputing raw hidden-states without any specific head on top.\",\n    TRANSFO_XL_START_DOCSTRING,\n)\nclass TFTransfoXLModel(TFTransfoXLPreTrainedModel):\n    def __init__(self, config, *inputs, **kwargs):\n        super().__init__(config, *inputs, **kwargs)\n        self.transformer = TFTransfoXLMainLayer(config, name=\"transformer\")\n\n    @add_start_docstrings_to_callable(TRANSFO_XL_INPUTS_DOCSTRING)\n    def call(self, inputs, **kwargs):\n        r\"\"\"\n    Return:\n        :obj:`tuple(tf.Tensor)` comprising various elements depending on the configuration (:class:`~transformers1.TransfoXLConfig`) and inputs:\n        last_hidden_state (:obj:`tf.Tensor` of shape :obj:`(batch_size, sequence_length, hidden_size)`):\n            Sequence of hidden-states at the last layer of the model.\n        mems (:obj:`List[tf.Tensor]` of length :obj:`config.n_layers`):\n            Contains pre-computed hidden-states (key and values in the attention blocks).\n            Can be used (see `mems` input) to speed up sequential decoding. The token ids which have their past given to this model\n            should not be passed as input ids as they have already been computed.\n        hidden_states (:obj:`tuple(tf.Tensor)`, `optional`, returned when ``config.output_hidden_states=True``):\n            Tuple of :obj:`tf.Tensor` (one for the output of the embeddings + one for the output of each layer)\n            of shape :obj:`(batch_size, sequence_length, hidden_size)`.\n\n            Hidden-states of the model at the output of each layer plus the initial embedding outputs.\n        attentions (:obj:`tuple(tf.Tensor)`, `optional`, returned when ``config.output_attentions=True``):\n            Tuple of :obj:`tf.Tensor` (one for each layer) of shape\n            :obj:`(batch_size, num_heads, sequence_length, sequence_length)`.\n\n            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention\n            heads.\n\n    Examples::\n\n        import tensorflow as tf\n        from transformers1 import TransfoXLTokenizer, TFTransfoXLModel\n\n        tokenizer = TransfoXLTokenizer.from_pretrained('transfo-xl-wt103')\n        model = TFTransfoXLModel.from_pretrained('transfo-xl-wt103')\n        input_ids = tf.constant(tokenizer.encode(\"Hello, my dog is cute\", add_special_tokens=True))[None, :]  # Batch size 1\n        outputs = model(input_ids)\n        last_hidden_states, mems = outputs[:2]\n\n        \"\"\"\n        outputs = self.transformer(inputs, **kwargs)\n        return outputs\n\n\nclass TFTransfoXLLMHead(tf.keras.layers.Layer):\n    def __init__(self, config, input_embeddings, **kwargs):\n        super().__init__(**kwargs)\n        self.vocab_size = config.vocab_size\n\n        # The output weights are the same as the input embeddings, but there is\n        # an output-only bias for each token.\n        self.input_embeddings = input_embeddings\n\n    def build(self, input_shape):\n        self.bias = self.add_weight(shape=(self.vocab_size,), initializer=\"zeros\", trainable=True, name=\"bias\")\n        super().build(input_shape)\n\n    def call(self, hidden_states):\n        hidden_states = self.input_embeddings(hidden_states, mode=\"linear\")\n        hidden_states = hidden_states + self.bias\n        return hidden_states\n\n\n@add_start_docstrings(\n    \"\"\"The Transformer-XL Model with a language modeling head on top\n    (adaptive softmax with weights tied to the adaptive input embeddings)\"\"\",\n    TRANSFO_XL_START_DOCSTRING,\n)\nclass TFTransfoXLLMHeadModel(TFTransfoXLPreTrainedModel):\n    def __init__(self, config):\n        super().__init__(config)\n        self.transformer = TFTransfoXLMainLayer(config, name=\"transformer\")\n        self.sample_softmax = config.sample_softmax\n        assert (\n            self.sample_softmax <= 0\n        ), \"Sampling from the softmax is not implemented yet. Please look at issue: #3310: https://github.com/huggingface/transformers/issues/3310\"\n\n        self.crit = TFAdaptiveSoftmaxMask(\n            config.vocab_size, config.d_embed, config.d_model, config.cutoffs, div_val=config.div_val, name=\"crit\"\n        )\n\n    def get_output_embeddings(self):\n        \"\"\" Double-check if you are using adaptive softmax.\n        \"\"\"\n        if len(self.crit.out_layers) > 0:\n            return self.crit.out_layers[-1]\n        return None\n\n    def reset_length(self, tgt_len, ext_len, mem_len):\n        self.transformer.reset_length(tgt_len, ext_len, mem_len)\n\n    def init_mems(self, bsz):\n        return self.transformer.init_mems(bsz)\n\n    @add_start_docstrings_to_callable(TRANSFO_XL_INPUTS_DOCSTRING)\n    def call(self, inputs, mems=None, head_mask=None, inputs_embeds=None, labels=None, training=False):\n        r\"\"\"\n    Return:\n        :obj:`tuple(tf.Tensor)` comprising various elements depending on the configuration (:class:`~transformers1.TransfoXLConfig`) and inputs:\n        prediction_scores (:obj:`tf.Tensor` of shape :obj:`(batch_size, sequence_length, config.vocab_size)`):\n            Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax).\n        mems (:obj:`List[tf.Tensor]` of length :obj:`config.n_layers`):\n            Contains pre-computed hidden-states (key and values in the attention blocks).\n            Can be used (see `past` input) to speed up sequential decoding. The token ids which have their past given to this model\n            should not be passed as input ids as they have already been computed.\n        hidden_states (:obj:`tuple(tf.Tensor)`, `optional`, returned when ``config.output_hidden_states=True``):\n            Tuple of :obj:`tf.Tensor` (one for the output of the embeddings + one for the output of each layer)\n            of shape :obj:`(batch_size, sequence_length, hidden_size)`.\n\n            Hidden-states of the model at the output of each layer plus the initial embedding outputs.\n        attentions (:obj:`tuple(tf.Tensor)`, `optional`, returned when ``config.output_attentions=True``):\n            Tuple of :obj:`tf.Tensor` (one for each layer) of shape\n            :obj:`(batch_size, num_heads, sequence_length, sequence_length)`.\n\n            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention\n            heads.\n\n    Examples::\n\n        import tensorflow as tf\n        from transformers1 import TransfoXLTokenizer, TFTransfoXLLMHeadModel\n\n        tokenizer = TransfoXLTokenizer.from_pretrained('transfo-xl-wt103')\n        model = TFTransfoXLLMHeadModel.from_pretrained('transfo-xl-wt103')\n        input_ids = tf.constant(tokenizer.encode(\"Hello, my dog is cute\", add_special_tokens=True))[None, :]  # Batch size 1\n        outputs = model(input_ids)\n        prediction_scores, mems = outputs[:2]\n\n        \"\"\"\n        if isinstance(inputs, (tuple, list)):\n            input_ids = inputs[0]\n            mems = inputs[1] if len(inputs) > 1 else mems\n            head_mask = inputs[2] if len(inputs) > 2 else head_mask\n            inputs_embeds = inputs[3] if len(inputs) > 3 else inputs_embeds\n            labels = inputs[4] if len(inputs) > 4 else labels\n            assert len(inputs) <= 5, \"Too many inputs.\"\n        elif isinstance(inputs, dict):\n            input_ids = inputs.get(\"input_ids\")\n            mems = inputs.get(\"mems\", mems)\n            head_mask = inputs.get(\"head_mask\", head_mask)\n            inputs_embeds = inputs.get(\"inputs_embeds\", inputs_embeds)\n            labels = inputs.get(\"labels\", labels)\n            assert len(inputs) <= 5, \"Too many inputs.\"\n        else:\n            input_ids = inputs\n\n        if input_ids is not None:\n            bsz, tgt_len = shape_list(input_ids)[:2]\n        else:\n            bsz, tgt_len = shape_list(inputs_embeds)[:2]\n\n        transformer_outputs = self.transformer([input_ids, mems, head_mask, inputs_embeds], training=training)\n\n        last_hidden = transformer_outputs[0]\n        pred_hid = last_hidden[:, -tgt_len:]\n        outputs = transformer_outputs[1:]\n\n        softmax_output = self.crit([pred_hid, labels], training=training)\n        outputs = [softmax_output] + outputs\n\n        return outputs  # logits, new_mems, (all hidden states), (all attentions)\n\n    def prepare_inputs_for_generation(self, inputs, past, **model_kwargs):\n        inputs = {\"inputs\": inputs}\n\n        # if past is defined in model kwargs then use it for faster decoding\n        if past:\n            inputs[\"mems\"] = past\n\n        return inputs\n"
  },
  {
    "path": "code/bert-base-count3/pretrain/transformers1/modeling_tf_transfo_xl_utilities.py",
    "content": "# coding=utf-8\n# Copyright 2018 Google AI, Google Brain and Carnegie Mellon University Authors and the HuggingFace Inc. team.\n# Copyright (c) 2018, NVIDIA CORPORATION.  All rights reserved.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n\"\"\" A TF 2.0 Adaptive Softmax for Transformer XL model.\n\"\"\"\n\n\nimport tensorflow as tf\n\nfrom .modeling_tf_utils import shape_list\n\n\nclass TFAdaptiveSoftmaxMask(tf.keras.layers.Layer):\n    def __init__(self, vocab_size, d_embed, d_proj, cutoffs, div_val=1, keep_order=False, **kwargs):\n        super().__init__(**kwargs)\n\n        self.vocab_size = vocab_size\n        self.d_embed = d_embed\n        self.d_proj = d_proj\n\n        self.cutoffs = cutoffs + [vocab_size]\n        self.cutoff_ends = [0] + self.cutoffs\n        self.div_val = div_val\n\n        self.shortlist_size = self.cutoffs[0]\n        self.n_clusters = len(self.cutoffs) - 1\n        self.head_size = self.shortlist_size + self.n_clusters\n        self.keep_order = keep_order\n\n        self.out_layers = []\n        self.out_projs = []\n\n    def build(self, input_shape):\n        if self.n_clusters > 0:\n            self.cluster_weight = self.add_weight(\n                shape=(self.n_clusters, self.d_embed), initializer=\"zeros\", trainable=True, name=\"cluster_weight\"\n            )\n            self.cluster_bias = self.add_weight(\n                shape=(self.n_clusters,), initializer=\"zeros\", trainable=True, name=\"cluster_bias\"\n            )\n\n        if self.div_val == 1:\n            for i in range(len(self.cutoffs)):\n                if self.d_proj != self.d_embed:\n                    weight = self.add_weight(\n                        shape=(self.d_embed, self.d_proj),\n                        initializer=\"zeros\",\n                        trainable=True,\n                        name=\"out_projs_._{}\".format(i),\n                    )\n                    self.out_projs.append(weight)\n                else:\n                    self.out_projs.append(None)\n                weight = self.add_weight(\n                    shape=(self.vocab_size, self.d_embed,),\n                    initializer=\"zeros\",\n                    trainable=True,\n                    name=\"out_layers_._{}_._weight\".format(i),\n                )\n                bias = self.add_weight(\n                    shape=(self.vocab_size,),\n                    initializer=\"zeros\",\n                    trainable=True,\n                    name=\"out_layers_._{}_._bias\".format(i),\n                )\n                self.out_layers.append((weight, bias))\n        else:\n            for i in range(len(self.cutoffs)):\n                l_idx, r_idx = self.cutoff_ends[i], self.cutoff_ends[i + 1]\n                d_emb_i = self.d_embed // (self.div_val ** i)\n\n                weight = self.add_weight(\n                    shape=(d_emb_i, self.d_proj), initializer=\"zeros\", trainable=True, name=\"out_projs_._{}\".format(i)\n                )\n                self.out_projs.append(weight)\n                weight = self.add_weight(\n                    shape=(r_idx - l_idx, d_emb_i,),\n                    initializer=\"zeros\",\n                    trainable=True,\n                    name=\"out_layers_._{}_._weight\".format(i),\n                )\n                bias = self.add_weight(\n                    shape=(r_idx - l_idx,),\n                    initializer=\"zeros\",\n                    trainable=True,\n                    name=\"out_layers_._{}_._bias\".format(i),\n                )\n                self.out_layers.append((weight, bias))\n        super().build(input_shape)\n\n    @staticmethod\n    def _logit(x, W, b, proj=None):\n        y = x\n        if proj is not None:\n            y = tf.einsum(\"ibd,ed->ibe\", y, proj)\n        return tf.einsum(\"ibd,nd->ibn\", y, W) + b\n\n    @staticmethod\n    def _gather_logprob(logprob, target):\n        lp_size = shape_list(logprob)\n        r = tf.range(lp_size[0])\n        idx = tf.stack([r, target], 1)\n        return tf.gather_nd(logprob, idx)\n\n    def call(self, inputs, return_mean=True, training=False):\n        hidden, target = inputs\n        head_logprob = 0\n        if self.n_clusters == 0:\n            output = self._logit(hidden, self.out_layers[0][0], self.out_layers[0][1], self.out_projs[0])\n            if target is not None:\n                loss = tf.nn.sparse_softmax_cross_entropy_with_logits(labels=target, logits=output)\n            out = tf.nn.log_softmax(output, axis=-1)\n        else:\n            hidden_sizes = shape_list(hidden)\n            out = []\n            loss = tf.zeros(hidden_sizes[:2], dtype=tf.float32)\n            for i in range(len(self.cutoffs)):\n                l_idx, r_idx = self.cutoff_ends[i], self.cutoff_ends[i + 1]\n                if target is not None:\n                    mask = (target >= l_idx) & (target < r_idx)\n                    mask_idx = tf.where(mask)\n                    cur_target = tf.boolean_mask(target, mask) - l_idx\n\n                if self.div_val == 1:\n                    cur_W = self.out_layers[0][0][l_idx:r_idx]\n                    cur_b = self.out_layers[0][1][l_idx:r_idx]\n                else:\n                    cur_W = self.out_layers[i][0]\n                    cur_b = self.out_layers[i][1]\n\n                if i == 0:\n                    cur_W = tf.concat([cur_W, self.cluster_weight], 0)\n                    cur_b = tf.concat([cur_b, self.cluster_bias], 0)\n\n                    head_logit = self._logit(hidden, cur_W, cur_b, self.out_projs[0])\n                    head_logprob = tf.nn.log_softmax(head_logit)\n                    out.append(head_logprob[..., : self.cutoffs[0]])\n                    if target is not None:\n                        cur_head_logprob = tf.boolean_mask(head_logprob, mask)\n                        cur_logprob = self._gather_logprob(cur_head_logprob, cur_target)\n                else:\n                    tail_logit = self._logit(hidden, cur_W, cur_b, self.out_projs[i])\n                    tail_logprob = tf.nn.log_softmax(tail_logit)\n                    cluster_prob_idx = self.cutoffs[0] + i - 1  # No probability for the head cluster\n                    logprob_i = head_logprob[..., cluster_prob_idx, None] + tail_logprob\n                    out.append(logprob_i)\n                    if target is not None:\n                        cur_head_logprob = tf.boolean_mask(head_logprob, mask)\n                        cur_tail_logprob = tf.boolean_mask(tail_logprob, mask)\n                        cur_logprob = self._gather_logprob(cur_tail_logprob, cur_target)\n                        cur_logprob += cur_head_logprob[:, self.cutoff_ends[1] + i - 1]\n                if target is not None:\n                    loss += tf.scatter_nd(mask_idx, -cur_logprob, tf.cast(shape_list(loss), dtype=tf.int64))\n            out = tf.concat(out, axis=-1)\n\n        if target is not None:\n            if return_mean:\n                loss = tf.reduce_mean(loss)\n            # Add the training-time loss value to the layer using `self.add_loss()`.\n            self.add_loss(loss)\n\n            # Log the loss as a metric (we could log arbitrary metrics,\n            # including different metrics for training and inference.\n            self.add_metric(loss, name=self.name, aggregation=\"mean\" if return_mean else \"\")\n\n        return out\n"
  },
  {
    "path": "code/bert-base-count3/pretrain/transformers1/modeling_tf_utils.py",
    "content": "# coding=utf-8\n# Copyright 2018 The Google AI Language Team Authors and The HuggingFace Inc. team.\n# Copyright (c) 2018, NVIDIA CORPORATION.  All rights reserved.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n\"\"\"TF general model utils.\"\"\"\nimport functools\nimport logging\nimport os\n\nimport h5py\nimport numpy as np\nimport tensorflow as tf\nfrom tensorflow.python.keras.saving import hdf5_format\n\nfrom .configuration_utils import PretrainedConfig\nfrom .file_utils import DUMMY_INPUTS, TF2_WEIGHTS_NAME, WEIGHTS_NAME, cached_path, hf_bucket_url, is_remote_url\nfrom .modeling_tf_pytorch_utils import load_pytorch_checkpoint_in_tf2_model\n\n\nlogger = logging.getLogger(__name__)\n\n\nclass TFModelUtilsMixin:\n    \"\"\"\n    A few utilities for `tf.keras.Model`s, to be used as a mixin.\n    \"\"\"\n\n    def num_parameters(self, only_trainable: bool = False) -> int:\n        \"\"\"\n        Get number of (optionally, trainable) parameters in the model.\n        \"\"\"\n        if only_trainable:\n            return int(sum(np.prod(w.shape.as_list()) for w in self.trainable_variables))\n        else:\n            return self.count_params()\n\n\ndef keras_serializable(cls):\n    \"\"\"\n    Decorate a Keras Layer class to support Keras serialization.\n\n    This is done by:\n    1. adding a `transformers_config` dict to the Keras config dictionary in `get_config` (called by Keras at\n       serialization time\n    2. wrapping `__init__` to accept that `transformers_config` dict (passed by Keras at deserialization time) and\n       convert it to a config object for the actual layer initializer\n    3. registering the class as a custom object in Keras (if the Tensorflow version supports this), so that it does\n       not need to be supplied in `custom_objects` in the call to `tf.keras.models.load_model`\n\n    :param cls: a tf.keras.layers.Layers subclass that accepts a `config` argument to its initializer (typically a\n                `TF*MainLayer` class in this project)\n    :return: the same class object, with modifications for Keras deserialization.\n    \"\"\"\n    initializer = cls.__init__\n\n    config_class = getattr(cls, \"config_class\", None)\n    if config_class is None:\n        raise AttributeError(\"Must set `config_class` to use @keras_serializable\")\n\n    @functools.wraps(initializer)\n    def wrapped_init(self, *args, **kwargs):\n        transformers_config = kwargs.pop(\"transformers_config\", None)\n        config = args[0] if args and isinstance(args[0], PretrainedConfig) else kwargs.get(\"config\", None)\n        if config is not None and transformers_config is not None:\n            raise ValueError(\"Must pass either `config` or `transformers_config`, not both\")\n        elif config is not None:\n            # normal layer construction, call with unchanged args (config is already in there)\n            initializer(self, *args, **kwargs)\n        elif transformers_config is not None:\n            # Keras deserialization, convert dict to config\n            config = config_class.from_dict(transformers_config)\n            initializer(self, config, *args, **kwargs)\n        else:\n            raise ValueError(\"Must pass either `config` (PretrainedConfig) or `transformers_config` (dict)\")\n        self._transformers_config = config\n\n    cls.__init__ = wrapped_init\n\n    if not hasattr(cls, \"get_config\"):\n        raise TypeError(\"Only use @keras_serializable on tf.keras.layers.Layer subclasses\")\n    if hasattr(cls.get_config, \"_is_default\"):\n\n        def get_config(self):\n            cfg = super(cls, self).get_config()\n            cfg[\"transformers_config\"] = self._transformers_config.to_dict()\n            return cfg\n\n        cls.get_config = get_config\n\n    cls._keras_serializable = True\n    if hasattr(tf.keras.utils, \"register_keras_serializable\"):\n        cls = tf.keras.utils.register_keras_serializable()(cls)\n    return cls\n\n\nclass TFPreTrainedModel(tf.keras.Model, TFModelUtilsMixin):\n    r\"\"\" Base class for all TF models.\n\n        :class:`~transformers1.TFPreTrainedModel` takes care of storing the configuration of the models and handles methods for loading/downloading/saving models\n        as well as a few methods common to all models to (i) resize the input embeddings and (ii) prune heads in the self-attention heads.\n\n        Class attributes (overridden by derived classes):\n            - ``config_class``: a class derived from :class:`~transformers1.PretrainedConfig` to use as configuration class for this model architecture.\n            - ``load_tf_weights``: a python ``method`` for loading a TensorFlow checkpoint in a PyTorch model, taking as arguments:\n\n                - ``model``: an instance of the relevant subclass of :class:`~transformers1.PreTrainedModel`,\n                - ``config``: an instance of the relevant subclass of :class:`~transformers1.PretrainedConfig`,\n                - ``path``: a path (string) to the TensorFlow checkpoint.\n\n            - ``base_model_prefix``: a string indicating the attribute associated to the base model in derived classes of the same architecture adding modules on top of the base model.\n    \"\"\"\n    config_class = None\n    base_model_prefix = \"\"\n\n    @property\n    def dummy_inputs(self):\n        \"\"\" Dummy inputs to build the network.\n\n        Returns:\n            tf.Tensor with dummy inputs\n        \"\"\"\n        return {\"input_ids\": tf.constant(DUMMY_INPUTS)}\n\n    def __init__(self, config, *inputs, **kwargs):\n        super().__init__(*inputs, **kwargs)\n        if not isinstance(config, PretrainedConfig):\n            raise ValueError(\n                \"Parameter config in `{}(config)` should be an instance of class `PretrainedConfig`. \"\n                \"To create a model from a pretrained model use \"\n                \"`model = {}.from_pretrained(PRETRAINED_MODEL_NAME)`\".format(\n                    self.__class__.__name__, self.__class__.__name__\n                )\n            )\n        # Save config in model\n        self.config = config\n\n    def get_input_embeddings(self):\n        \"\"\"\n        Returns the model's input embeddings.\n\n        Returns:\n            :obj:`tf.keras.layers.Layer`:\n                A torch module mapping vocabulary to hidden states.\n        \"\"\"\n        base_model = getattr(self, self.base_model_prefix, self)\n        if base_model is not self:\n            return base_model.get_input_embeddings()\n        else:\n            raise NotImplementedError\n\n    def get_output_embeddings(self):\n        \"\"\"\n        Returns the model's output embeddings.\n\n        Returns:\n            :obj:`tf.keras.layers.Layer`:\n                A torch module mapping hidden states to vocabulary.\n        \"\"\"\n        return None  # Overwrite for models with output embeddings\n\n    def _get_resized_embeddings(self, old_embeddings, new_num_tokens=None):\n        \"\"\" Build a resized Embedding Variable from a provided token Embedding Module.\n            Increasing the size will add newly initialized vectors at the end\n            Reducing the size will remove vectors from the end\n\n        Args:\n            new_num_tokens: (`optional`) int\n                New number of tokens in the embedding matrix.\n                Increasing the size will add newly initialized vectors at the end\n                Reducing the size will remove vectors from the end\n                If not provided or None: return the provided token Embedding Module.\n        Return: ``tf.Variable``\n            Pointer to the resized Embedding Module or the old Embedding Module if new_num_tokens is None\n        \"\"\"\n        # if new_num_tokens is None:\n        #     return old_embeddings\n\n        # old_num_tokens, old_embedding_dim = old_embeddings.weight.size()\n        # if old_num_tokens == new_num_tokens:\n        #     return old_embeddings\n\n        # # Build new embeddings\n        # new_embeddings = nn.Embedding(new_num_tokens, old_embedding_dim)\n        # new_embeddings.to(old_embeddings.weight.device)\n\n        # # initialize all new embeddings (in particular added tokens)\n        # self._init_weights(new_embeddings)\n\n        # # Copy token embeddings from the previous weights\n        # num_tokens_to_copy = min(old_num_tokens, new_num_tokens)\n        # new_embeddings.weight.data[:num_tokens_to_copy, :] = old_embeddings.weight.data[:num_tokens_to_copy, :]\n\n        # return new_embeddings\n\n    def resize_token_embeddings(self, new_num_tokens=None):\n        \"\"\" Resize input token embeddings matrix of the model if new_num_tokens != config.vocab_size.\n        Take care of tying weights embeddings afterwards if the model class has a `tie_weights()` method.\n\n        Arguments:\n\n            new_num_tokens: (`optional`) int:\n                New number of tokens in the embedding matrix. Increasing the size will add newly initialized vectors at the end. Reducing the size will remove vectors from the end.\n                If not provided or None: does nothing and just returns a pointer to the input tokens ``tf.Variable`` Module of the model.\n\n        Return: ``tf.Variable``\n            Pointer to the input tokens Embeddings Module of the model\n        \"\"\"\n        raise NotImplementedError\n\n    def prune_heads(self, heads_to_prune):\n        \"\"\" Prunes heads of the base model.\n\n            Arguments:\n\n                heads_to_prune: dict with keys being selected layer indices (`int`) and associated values being the list of heads to prune in said layer (list of `int`).\n        \"\"\"\n        raise NotImplementedError\n\n    def save_pretrained(self, save_directory):\n        \"\"\" Save a model and its configuration file to a directory, so that it\n            can be re-loaded using the :func:`~transformers1.PreTrainedModel.from_pretrained` class method.\n        \"\"\"\n        assert os.path.isdir(\n            save_directory\n        ), \"Saving path should be a directory where the model and configuration can be saved\"\n\n        # Save configuration file\n        self.config.save_pretrained(save_directory)\n\n        # If we save using the predefined names, we can load using `from_pretrained`\n        output_model_file = os.path.join(save_directory, TF2_WEIGHTS_NAME)\n        self.save_weights(output_model_file)\n        logger.info(\"Model weights saved in {}\".format(output_model_file))\n\n    @classmethod\n    def from_pretrained(cls, pretrained_model_name_or_path, *model_args, **kwargs):\n        r\"\"\"Instantiate a pretrained TF 2.0 model from a pre-trained model configuration.\n\n        The warning ``Weights from XXX not initialized from pretrained model`` means that the weights of XXX do not come pre-trained with the rest of the model.\n        It is up to you to train those weights with a downstream fine-tuning task.\n\n        The warning ``Weights from XXX not used in YYY`` means that the layer XXX is not used by YYY, therefore those weights are discarded.\n\n        Parameters:\n            pretrained_model_name_or_path: either:\n\n                - a string with the `shortcut name` of a pre-trained model to load from cache or download, e.g.: ``bert-base-uncased``.\n                - a string with the `identifier name` of a pre-trained model that was user-uploaded to our S3, e.g.: ``dbmdz/bert-base-german-cased``.\n                - a path to a `directory` containing model weights saved using :func:`~transformers1.PreTrainedModel.save_pretrained`, e.g.: ``./my_model_directory/``.\n                - a path or url to a `PyTorch state_dict save file` (e.g. `./pt_model/pytorch_model.bin`). In this case, ``from_pt`` should be set to True and a configuration object should be provided as ``config`` argument. This loading path is slower than converting the PyTorch checkpoint in a TensorFlow model using the provided conversion scripts and loading the TensorFlow model afterwards.\n\n            model_args: (`optional`) Sequence of positional arguments:\n                All remaning positional arguments will be passed to the underlying model's ``__init__`` method\n\n            config: (`optional`) one of:\n                    - an instance of a class derived from :class:`~transformers1.PretrainedConfig`, or\n                    - a string valid as input to :func:`~transformers1.PretrainedConfig.from_pretrained()`\n                Configuration for the model to use instead of an automatically loaded configuation. Configuration can be automatically loaded when:\n\n                - the model is a model provided by the library (loaded with the ``shortcut-name`` string of a pretrained model), or\n                - the model was saved using :func:`~transformers1.PreTrainedModel.save_pretrained` and is reloaded by suppling the save directory.\n                - the model is loaded by suppling a local directory as ``pretrained_model_name_or_path`` and a configuration JSON file named `config.json` is found in the directory.\n\n            from_pt: (`optional`) boolean, default False:\n                Load the model weights from a PyTorch state_dict save file (see docstring of pretrained_model_name_or_path argument).\n\n            cache_dir: (`optional`) string:\n                Path to a directory in which a downloaded pre-trained model\n                configuration should be cached if the standard cache should not be used.\n\n            force_download: (`optional`) boolean, default False:\n                Force to (re-)download the model weights and configuration files and override the cached versions if they exists.\n\n            resume_download: (`optional`) boolean, default False:\n                Do not delete incompletely recieved file. Attempt to resume the download if such a file exists.\n\n            proxies: (`optional`) dict, default None:\n                A dictionary of proxy servers to use by protocol or endpoint, e.g.: {'http': 'foo.bar:3128', 'http://hostname': 'foo.bar:4012'}.\n                The proxies are used on each request.\n\n            output_loading_info: (`optional`) boolean:\n                Set to ``True`` to also return a dictionnary containing missing keys, unexpected keys and error messages.\n\n            kwargs: (`optional`) Remaining dictionary of keyword arguments:\n                Can be used to update the configuration object (after it being loaded) and initiate the model. (e.g. ``output_attention=True``). Behave differently depending on whether a `config` is provided or automatically loaded:\n\n                - If a configuration is provided with ``config``, ``**kwargs`` will be directly passed to the underlying model's ``__init__`` method (we assume all relevant updates to the configuration have already been done)\n                - If a configuration is not provided, ``kwargs`` will be first passed to the configuration class initialization function (:func:`~transformers1.PretrainedConfig.from_pretrained`). Each key of ``kwargs`` that corresponds to a configuration attribute will be used to override said attribute with the supplied ``kwargs`` value. Remaining keys that do not correspond to any configuration attribute will be passed to the underlying model's ``__init__`` function.\n\n        Examples::\n\n            # For example purposes. Not runnable.\n            model = BertModel.from_pretrained('bert-base-uncased')    # Download model and configuration from S3 and cache.\n            model = BertModel.from_pretrained('./test/saved_model/')  # E.g. model was saved using `save_pretrained('./test/saved_model/')`\n            model = BertModel.from_pretrained('bert-base-uncased', output_attention=True)  # Update configuration during loading\n            assert model.config.output_attention == True\n            # Loading from a TF checkpoint file instead of a PyTorch model (slower)\n            config = BertConfig.from_json_file('./tf_model/my_tf_model_config.json')\n            model = BertModel.from_pretrained('./tf_model/my_tf_checkpoint.ckpt.index', from_pt=True, config=config)\n\n        \"\"\"\n        config = kwargs.pop(\"config\", None)\n        cache_dir = kwargs.pop(\"cache_dir\", None)\n        from_pt = kwargs.pop(\"from_pt\", False)\n        force_download = kwargs.pop(\"force_download\", False)\n        resume_download = kwargs.pop(\"resume_download\", False)\n        proxies = kwargs.pop(\"proxies\", None)\n        output_loading_info = kwargs.pop(\"output_loading_info\", False)\n        use_cdn = kwargs.pop(\"use_cdn\", True)\n\n        # Load config if we don't provide a configuration\n        if not isinstance(config, PretrainedConfig):\n            config_path = config if config is not None else pretrained_model_name_or_path\n            config, model_kwargs = cls.config_class.from_pretrained(\n                config_path,\n                *model_args,\n                cache_dir=cache_dir,\n                return_unused_kwargs=True,\n                force_download=force_download,\n                resume_download=resume_download,\n                **kwargs,\n            )\n        else:\n            model_kwargs = kwargs\n\n        # Load model\n        if pretrained_model_name_or_path is not None:\n            if os.path.isdir(pretrained_model_name_or_path):\n                if os.path.isfile(os.path.join(pretrained_model_name_or_path, TF2_WEIGHTS_NAME)):\n                    # Load from a TF 2.0 checkpoint\n                    archive_file = os.path.join(pretrained_model_name_or_path, TF2_WEIGHTS_NAME)\n                elif from_pt and os.path.isfile(os.path.join(pretrained_model_name_or_path, WEIGHTS_NAME)):\n                    # Load from a PyTorch checkpoint\n                    archive_file = os.path.join(pretrained_model_name_or_path, WEIGHTS_NAME)\n                else:\n                    raise EnvironmentError(\n                        \"Error no file named {} found in directory {} or `from_pt` set to False\".format(\n                            [WEIGHTS_NAME, TF2_WEIGHTS_NAME], pretrained_model_name_or_path\n                        )\n                    )\n            elif os.path.isfile(pretrained_model_name_or_path) or is_remote_url(pretrained_model_name_or_path):\n                archive_file = pretrained_model_name_or_path\n            elif os.path.isfile(pretrained_model_name_or_path + \".index\"):\n                archive_file = pretrained_model_name_or_path + \".index\"\n            else:\n                archive_file = hf_bucket_url(\n                    pretrained_model_name_or_path,\n                    filename=(WEIGHTS_NAME if from_pt else TF2_WEIGHTS_NAME),\n                    use_cdn=use_cdn,\n                )\n\n            try:\n                # Load from URL or cache if already cached\n                resolved_archive_file = cached_path(\n                    archive_file,\n                    cache_dir=cache_dir,\n                    force_download=force_download,\n                    resume_download=resume_download,\n                    proxies=proxies,\n                )\n                if resolved_archive_file is None:\n                    raise EnvironmentError\n            except EnvironmentError:\n                msg = (\n                    f\"Can't load weights for '{pretrained_model_name_or_path}'. Make sure that:\\n\\n\"\n                    f\"- '{pretrained_model_name_or_path}' is a correct model identifier listed on 'https://huggingface.co/models'\\n\\n\"\n                    f\"- or '{pretrained_model_name_or_path}' is the correct path to a directory containing a file named one of {TF2_WEIGHTS_NAME}, {WEIGHTS_NAME}.\\n\\n\"\n                )\n                raise EnvironmentError(msg)\n            if resolved_archive_file == archive_file:\n                logger.info(\"loading weights file {}\".format(archive_file))\n            else:\n                logger.info(\"loading weights file {} from cache at {}\".format(archive_file, resolved_archive_file))\n        else:\n            resolved_archive_file = None\n\n        # Instantiate model.\n        model = cls(config, *model_args, **model_kwargs)\n\n        if from_pt:\n            # Load from a PyTorch checkpoint\n            return load_pytorch_checkpoint_in_tf2_model(model, resolved_archive_file, allow_missing_keys=True)\n\n        model(model.dummy_inputs, training=False)  # build the network with dummy inputs\n\n        assert os.path.isfile(resolved_archive_file), \"Error retrieving file {}\".format(resolved_archive_file)\n        # 'by_name' allow us to do transfer learning by skipping/adding layers\n        # see https://github.com/tensorflow/tensorflow/blob/00fad90125b18b80fe054de1055770cfb8fe4ba3/tensorflow/python/keras/engine/network.py#L1339-L1357\n        try:\n            model.load_weights(resolved_archive_file, by_name=True)\n        except OSError:\n            raise OSError(\n                \"Unable to load weights from h5 file. \"\n                \"If you tried to load a TF 2.0 model from a PyTorch checkpoint, please set from_pt=True. \"\n            )\n\n        model(model.dummy_inputs, training=False)  # Make sure restore ops are run\n\n        # Check if the models are the same to output loading informations\n        with h5py.File(resolved_archive_file, \"r\") as f:\n            if \"layer_names\" not in f.attrs and \"model_weights\" in f:\n                f = f[\"model_weights\"]\n            hdf5_layer_names = set(hdf5_format.load_attributes_from_hdf5_group(f, \"layer_names\"))\n        model_layer_names = set(layer.name for layer in model.layers)\n        missing_keys = list(model_layer_names - hdf5_layer_names)\n        unexpected_keys = list(hdf5_layer_names - model_layer_names)\n        error_msgs = []\n\n        if len(missing_keys) > 0:\n            logger.info(\n                \"Layers of {} not initialized from pretrained model: {}\".format(model.__class__.__name__, missing_keys)\n            )\n        if len(unexpected_keys) > 0:\n            logger.info(\n                \"Layers from pretrained model not used in {}: {}\".format(model.__class__.__name__, unexpected_keys)\n            )\n        if len(error_msgs) > 0:\n            raise RuntimeError(\n                \"Error(s) in loading weights for {}:\\n\\t{}\".format(model.__class__.__name__, \"\\n\\t\".join(error_msgs))\n            )\n        if output_loading_info:\n            loading_info = {\"missing_keys\": missing_keys, \"unexpected_keys\": unexpected_keys, \"error_msgs\": error_msgs}\n            return model, loading_info\n\n        return model\n\n    def prepare_inputs_for_generation(self, inputs, **kwargs):\n        return {\"inputs\": inputs}\n\n    def _use_cache(self, outputs, use_cache):\n        \"\"\"During generation, decide whether to pass the `past` variable to the next forward pass.\"\"\"\n        if len(outputs) <= 1 or use_cache is False:\n            return False\n        if hasattr(self.config, \"mem_len\") and self.config.mem_len == 0:\n            return False\n        return True\n\n    def generate(\n        self,\n        input_ids=None,\n        max_length=None,\n        min_length=None,\n        do_sample=None,\n        early_stopping=None,\n        num_beams=None,\n        temperature=None,\n        top_k=None,\n        top_p=None,\n        repetition_penalty=None,\n        bad_words_ids=None,\n        bos_token_id=None,\n        pad_token_id=None,\n        eos_token_id=None,\n        length_penalty=None,\n        no_repeat_ngram_size=None,\n        num_return_sequences=None,\n        attention_mask=None,\n        decoder_start_token_id=None,\n        use_cache=None,\n    ):\n        r\"\"\" Generates sequences for models with a LM head. The method currently supports greedy or penalized greedy decoding, sampling with top-k or nucleus sampling\n        and beam-search.\n\n        Adapted in part from `Facebook's XLM beam search code`_.\n\n        .. _`Facebook's XLM beam search code`:\n           https://github.com/facebookresearch/XLM/blob/9e6f6814d17be4fe5b15f2e6c43eb2b2d76daeb4/src/model/transformer.py#L529\n\n\n        Parameters:\n\n            input_ids: (`optional`) `tf.Tensor` of `dtype=tf.int32` of shape `(batch_size, sequence_length)`\n                The sequence used as a prompt for the generation. If `None` the method initializes\n                it as an empty `tf.Tensor` of shape `(1,)`.\n\n            max_length: (`optional`) int\n                The max length of the sequence to be generated.  Between 1 and infinity. Default to 20.\n\n            min_length: (`optional`) int\n                The min length of the sequence to be generated.  Between 0 and infinity. Default to 0.\n            do_sample: (`optional`) bool\n                If set to `False` greedy decoding is used. Otherwise sampling is used. Defaults to `False` as defined in `configuration_utils.PretrainedConfig`.\n\n            early_stopping: (`optional`) bool\n                if set to `True` beam search is stopped when at least `num_beams` sentences finished per batch. Defaults to `False` as defined in `configuration_utils.PretrainedConfig`.\n\n            num_beams: (`optional`) int\n                Number of beams for beam search. Must be between 1 and infinity. 1 means no beam search. Default to 1.\n\n            temperature: (`optional`) float\n                The value used to module the next token probabilities. Must be strictely positive. Default to 1.0.\n\n            top_k: (`optional`) int\n                The number of highest probability vocabulary tokens to keep for top-k-filtering. Between 1 and infinity. Default to 50.\n\n            top_p: (`optional`) float\n                The cumulative probability of parameter highest probability vocabulary tokens to keep for nucleus sampling. Must be between 0 and 1. Default to 1.\n\n            repetition_penalty: (`optional`) float\n                The parameter for repetition penalty. Between 1.0 and infinity. 1.0 means no penalty. Default to 1.0.\n\n            bos_token_id: (`optional`) int\n                Beginning of sentence token if no prompt is provided. Default to specicic model bos_token_id or None if it does not exist.\n\n            pad_token_id: (`optional`) int\n                Pad token. Defaults to pad_token_id as defined in the models config.\n\n            eos_token_id: (`optional`) int\n                EOS token. Defaults to eos_token_id as defined in the models config.\n\n            length_penalty: (`optional`) float\n                Exponential penalty to the length. Default to 1.\n\n            no_repeat_ngram_size: (`optional`) int\n                If set to int > 0, all ngrams of size `no_repeat_ngram_size` can only occur once.\n\n            bad_words_ids: (`optional`) list of lists of int\n                `bad_words_ids` contains tokens that are not allowed to be generated. In order to get the tokens of the words that should not appear in the generated text, use `tokenizer.encode(bad_word, add_prefix_space=True)`.\n\n            num_return_sequences: (`optional`) int\n                The number of independently computed returned sequences for each element in the batch. Default to 1.\n\n            attention_mask (`optional`) obj: `tf.Tensor` with `dtype=tf.int32` of same shape as `input_ids`\n                Mask to avoid performing attention on padding token indices.\n                Mask values selected in ``[0, 1]``:\n                ``1`` for tokens that are NOT MASKED, ``0`` for MASKED tokens.\n                Defaults to `None`.\n\n                `What are attention masks? <../glossary.html#attention-mask>`__\n\n            decoder_start_token_id=None: (`optional`) int\n                If an encoder-decoder model starts decoding with a different token than BOS.\n                Defaults to `None` and is changed to `BOS` later.\n\n            use_cache: (`optional`) bool\n                If `use_cache` is True, past key values are used to speed up decoding if applicable to model. Defaults to `True`.\n\n        Return:\n\n            output: `tf.Tensor` of `dtype=tf.int32` shape `(batch_size * num_return_sequences, sequence_length)`\n                sequence_length is either equal to max_length or shorter if all batches finished early due to the `eos_token_id`\n\n        Examples::\n\n            tokenizer = AutoTokenizer.from_pretrained('distilgpt2')   # Initialize tokenizer\n            model = TFAutoModelWithLMHead.from_pretrained('distilgpt2')    # Download model and configuration from S3 and cache.\n            outputs = model.generate(max_length=40)  # do greedy decoding\n            print('Generated: {}'.format(tokenizer.decode(outputs[0], skip_special_tokens=True)))\n\n            tokenizer = AutoTokenizer.from_pretrained('openai-gpt')   # Initialize tokenizer\n            model = TFAutoModelWithLMHead.from_pretrained('openai-gpt')    # Download model and configuration from S3 and cache.\n            input_context = 'The dog'\n            input_ids = tokenizer.encode(input_context, return_tensors='tf')  # encode input context\n            outputs = model.generate(input_ids=input_ids, num_beams=5, num_return_sequences=3, temperature=1.5)  # generate 3 independent sequences using beam search decoding (5 beams) with sampling from initial context 'The dog'\n            for i in range(3): #  3 output sequences were generated\n                print('Generated {}: {}'.format(i, tokenizer.decode(outputs[i], skip_special_tokens=True)))\n\n            tokenizer = AutoTokenizer.from_pretrained('distilgpt2')   # Initialize tokenizer\n            model = TFAutoModelWithLMHead.from_pretrained('distilgpt2')    # Download model and configuration from S3 and cache.\n            input_context = 'The dog'\n            input_ids = tokenizer.encode(input_context, return_tensors='tf')  # encode input context\n            outputs = model.generate(input_ids=input_ids, max_length=40, temperature=0.7, num_return_sequences=3)  # 3 generate sequences using by sampling\n            for i in range(3): #  3 output sequences were generated\n                print('Generated {}: {}'.format(i, tokenizer.decode(outputs[i], skip_special_tokens=True)))\n\n            tokenizer = AutoTokenizer.from_pretrained('ctrl')   # Initialize tokenizer\n            model = TFAutoModelWithLMHead.from_pretrained('ctrl')    # Download model and configuration from S3 and cache.\n            input_context = 'Legal My neighbor is'  # \"Legal\" is one of the control codes for ctrl\n            input_ids = tokenizer.encode(input_context, return_tensors='tf')  # encode input context\n            outputs = model.generate(input_ids=input_ids, max_length=50, temperature=0.7, repetition_penalty=1.2)  # generate sequences\n            print('Generated: {}'.format(tokenizer.decode(outputs[0], skip_special_tokens=True)))\n\n            tokenizer = AutoTokenizer.from_pretrained('gpt2')   # Initialize tokenizer\n            model = TFAutoModelWithLMHead.from_pretrained('gpt2')    # Download model and configuration from S3 and cache.\n            input_context = 'My cute dog'  # \"Legal\" is one of the control codes for ctrl\n            bad_words_ids = [tokenizer.encode(bad_word, add_prefix_space=True) for bad_word in ['idiot', 'stupid', 'shut up']]\n            input_ids = tokenizer.encode(input_context, return_tensors='tf')  # encode input context\n            outputs = model.generate(input_ids=input_ids, max_length=100, do_sample=True, bad_words_ids=bad_words_ids)  # generate sequences without allowing bad_words to be generated\n        \"\"\"\n\n        # We cannot generate if the model does not have a LM head\n        if self.get_output_embeddings() is None:\n            raise AttributeError(\n                \"You tried to generate sequences with a model that does not have a LM Head.\"\n                \"Please use another model class (e.g. `TFOpenAIGPTLMHeadModel`, `TFXLNetLMHeadModel`, `TFGPT2LMHeadModel`, `TFCTRLLMHeadModel`, `TFT5ForConditionalGeneration`, `TFTransfoXLLMHeadModel`)\"\n            )\n\n        max_length = max_length if max_length is not None else self.config.max_length\n        min_length = min_length if min_length is not None else self.config.min_length\n        do_sample = do_sample if do_sample is not None else self.config.do_sample\n        early_stopping = early_stopping if early_stopping is not None else self.config.early_stopping\n        use_cache = use_cache if use_cache is not None else self.config.use_cache\n        num_beams = num_beams if num_beams is not None else self.config.num_beams\n        temperature = temperature if temperature is not None else self.config.temperature\n        top_k = top_k if top_k is not None else self.config.top_k\n        top_p = top_p if top_p is not None else self.config.top_p\n        repetition_penalty = repetition_penalty if repetition_penalty is not None else self.config.repetition_penalty\n        bos_token_id = bos_token_id if bos_token_id is not None else self.config.bos_token_id\n        pad_token_id = pad_token_id if pad_token_id is not None else self.config.pad_token_id\n        eos_token_id = eos_token_id if eos_token_id is not None else self.config.eos_token_id\n        length_penalty = length_penalty if length_penalty is not None else self.config.length_penalty\n        no_repeat_ngram_size = (\n            no_repeat_ngram_size if no_repeat_ngram_size is not None else self.config.no_repeat_ngram_size\n        )\n        bad_words_ids = bad_words_ids if bad_words_ids is not None else self.config.bad_words_ids\n        num_return_sequences = (\n            num_return_sequences if num_return_sequences is not None else self.config.num_return_sequences\n        )\n        decoder_start_token_id = (\n            decoder_start_token_id if decoder_start_token_id is not None else self.config.decoder_start_token_id\n        )\n\n        if input_ids is not None:\n            batch_size = shape_list(input_ids)[0]  # overriden by the input batch_size\n        else:\n            batch_size = 1\n\n        assert isinstance(max_length, int) and max_length > 0, \"`max_length` should be a strictely positive integer.\"\n        assert isinstance(min_length, int) and min_length >= 0, \"`min_length` should be a positive integer.\"\n        assert isinstance(do_sample, bool), \"`do_sample` should be a boolean.\"\n        assert isinstance(early_stopping, bool), \"`early_stopping` should be a boolean.\"\n        assert isinstance(use_cache, bool), \"`use_cache` should be a boolean.\"\n        assert isinstance(num_beams, int) and num_beams > 0, \"`num_beams` should be a strictely positive integer.\"\n        assert temperature > 0, \"`temperature` should be strictely positive.\"\n        assert isinstance(top_k, int) and top_k >= 0, \"`top_k` should be a positive integer.\"\n        assert 0 <= top_p <= 1, \"`top_p` should be between 0 and 1.\"\n        assert repetition_penalty >= 1.0, \"`repetition_penalty` should be >= 1.\"\n        assert input_ids is not None or (\n            isinstance(bos_token_id, int) and bos_token_id >= 0\n        ), \"If input_ids is not defined, `bos_token_id` should be a positive integer.\"\n        assert pad_token_id is None or (\n            isinstance(pad_token_id, int) and (pad_token_id >= 0)\n        ), \"`pad_token_id` should be a positive integer.\"\n        assert (eos_token_id is None) or (\n            isinstance(eos_token_id, int) and (eos_token_id >= 0)\n        ), \"`eos_token_id` should be a positive integer.\"\n        assert length_penalty > 0, \"`length_penalty` should be strictely positive.\"\n        assert (\n            isinstance(num_return_sequences, int) and num_return_sequences > 0\n        ), \"`num_return_sequences` should be a strictely positive integer.\"\n        assert (\n            bad_words_ids is None or isinstance(bad_words_ids, list) and isinstance(bad_words_ids[0], list)\n        ), \"`bad_words_ids` is either `None` or a list of lists of tokens that should not be generated\"\n\n        if input_ids is None:\n            assert isinstance(bos_token_id, int) and bos_token_id >= 0, (\n                \"you should either supply a context to complete as `input_ids` input \"\n                \"or a `bos_token_id` (integer >= 0) as a first token to start the generation.\"\n            )\n            input_ids = tf.fill((batch_size, 1), bos_token_id)\n        else:\n            assert len(shape_list(input_ids)) == 2, \"Input prompt should be of shape (batch_size, sequence length).\"\n\n        # not allow to duplicate outputs when greedy decoding\n        if do_sample is False:\n            if num_beams == 1:\n                # no_beam_search greedy generation conditions\n                assert (\n                    num_return_sequences == 1\n                ), \"Greedy decoding will always produce the same output for num_beams == 1 and num_return_sequences > 1. Please set num_return_sequences = 1\"\n\n            else:\n                # beam_search greedy generation conditions\n                assert (\n                    num_beams >= num_return_sequences\n                ), \"Greedy beam search decoding cannot return more sequences than it has beams. Please set num_beams >= num_return_sequences\"\n\n        # create attention mask if necessary\n        # TODO (PVP): this should later be handled by the forward fn() in each model in the future see PR 3140\n        if (attention_mask is None) and (pad_token_id is not None) and (pad_token_id in input_ids.numpy()):\n            attention_mask = tf.cast(tf.math.not_equal(input_ids, pad_token_id), dtype=tf.int32)\n        elif attention_mask is None:\n            attention_mask = tf.ones_like(input_ids)\n\n        if pad_token_id is None and eos_token_id is not None:\n            logger.warning(\n                \"Setting `pad_token_id` to {} (first `eos_token_id`) to generate sequence\".format(eos_token_id)\n            )\n            pad_token_id = eos_token_id\n\n        # current position and vocab size\n        cur_len = shape_list(input_ids)[1]\n        vocab_size = self.config.vocab_size\n\n        # set effective batch size and effective batch multiplier according to do_sample\n        if do_sample:\n            effective_batch_size = batch_size * num_return_sequences\n            effective_batch_mult = num_return_sequences\n        else:\n            effective_batch_size = batch_size\n            effective_batch_mult = 1\n\n        if self.config.is_encoder_decoder:\n            if decoder_start_token_id is None:\n                decoder_start_token_id = bos_token_id\n\n            assert (\n                decoder_start_token_id is not None\n            ), \"decoder_start_token_id or bos_token_id has to be defined for encoder-decoder generation\"\n            assert hasattr(self, \"get_encoder\"), \"{} should have a 'get_encoder' function defined\".format(self)\n            assert callable(self.get_encoder), \"{} should be a method\".format(self.get_encoder)\n\n            # get encoder and store encoder outputs\n            encoder = self.get_encoder()\n\n            encoder_outputs = encoder(input_ids, attention_mask=attention_mask)\n\n        # Expand input ids if num_beams > 1 or num_return_sequences > 1\n        if num_return_sequences > 1 or num_beams > 1:\n            input_ids_len = shape_list(input_ids)[-1]\n            input_ids = tf.broadcast_to(\n                tf.expand_dims(input_ids, 1), (batch_size, effective_batch_mult * num_beams, input_ids_len)\n            )\n            attention_mask = tf.broadcast_to(\n                tf.expand_dims(attention_mask, 1), (batch_size, effective_batch_mult * num_beams, input_ids_len)\n            )\n            input_ids = tf.reshape(\n                input_ids, (effective_batch_size * num_beams, input_ids_len)\n            )  # shape: (batch_size * num_return_sequences * num_beams, cur_len)\n            attention_mask = tf.reshape(\n                attention_mask, (effective_batch_size * num_beams, input_ids_len)\n            )  # shape: (batch_size * num_return_sequences * num_beams, cur_len)\n\n        if self.config.is_encoder_decoder:\n\n            # create empty decoder_input_ids\n            input_ids = tf.ones((effective_batch_size * num_beams, 1), dtype=tf.int32,) * decoder_start_token_id\n            cur_len = 1\n\n            assert (\n                batch_size == encoder_outputs[0].shape[0]\n            ), f\"expected encoder_outputs[0] to have 1st dimension bs={batch_size}, got {encoder_outputs[0].shape[0]} \"\n\n            # expand batch_idx to assign correct encoder output for expanded input_ids (due to num_beams > 1 and num_return_sequences > 1)\n            expanded_batch_idxs = tf.reshape(\n                tf.repeat(tf.expand_dims(tf.range(batch_size), -1), repeats=num_beams * effective_batch_mult, axis=1),\n                shape=(-1,),\n            )\n            # expand encoder_outputs\n            encoder_outputs = (tf.gather(encoder_outputs[0], expanded_batch_idxs, axis=0), *encoder_outputs[1:])\n\n        else:\n            encoder_outputs = None\n            cur_len = shape_list(input_ids)[-1]\n\n        if num_beams > 1:\n            output = self._generate_beam_search(\n                input_ids,\n                cur_len=cur_len,\n                max_length=max_length,\n                min_length=min_length,\n                do_sample=do_sample,\n                early_stopping=early_stopping,\n                temperature=temperature,\n                top_k=top_k,\n                top_p=top_p,\n                repetition_penalty=repetition_penalty,\n                no_repeat_ngram_size=no_repeat_ngram_size,\n                bad_words_ids=bad_words_ids,\n                bos_token_id=bos_token_id,\n                pad_token_id=pad_token_id,\n                eos_token_id=eos_token_id,\n                decoder_start_token_id=decoder_start_token_id,\n                batch_size=effective_batch_size,\n                num_return_sequences=num_return_sequences,\n                length_penalty=length_penalty,\n                num_beams=num_beams,\n                vocab_size=vocab_size,\n                encoder_outputs=encoder_outputs,\n                attention_mask=attention_mask,\n                use_cache=use_cache,\n            )\n        else:\n            output = self._generate_no_beam_search(\n                input_ids,\n                cur_len=cur_len,\n                max_length=max_length,\n                min_length=min_length,\n                do_sample=do_sample,\n                temperature=temperature,\n                top_k=top_k,\n                top_p=top_p,\n                repetition_penalty=repetition_penalty,\n                no_repeat_ngram_size=no_repeat_ngram_size,\n                bad_words_ids=bad_words_ids,\n                bos_token_id=bos_token_id,\n                pad_token_id=pad_token_id,\n                eos_token_id=eos_token_id,\n                decoder_start_token_id=decoder_start_token_id,\n                batch_size=effective_batch_size,\n                vocab_size=vocab_size,\n                encoder_outputs=encoder_outputs,\n                attention_mask=attention_mask,\n                use_cache=use_cache,\n            )\n\n        return output\n\n    def _generate_no_beam_search(\n        self,\n        input_ids,\n        cur_len,\n        max_length,\n        min_length,\n        do_sample,\n        temperature,\n        top_k,\n        top_p,\n        repetition_penalty,\n        no_repeat_ngram_size,\n        bad_words_ids,\n        bos_token_id,\n        pad_token_id,\n        eos_token_id,\n        decoder_start_token_id,\n        batch_size,\n        vocab_size,\n        encoder_outputs,\n        attention_mask,\n        use_cache,\n    ):\n        \"\"\" Generate sequences for each example without beam search (num_beams == 1).\n            All returned sequence are generated independantly.\n        \"\"\"\n\n        # length of generated sentences / unfinished sentences\n        unfinished_sents = tf.ones_like(input_ids[:, 0])\n        sent_lengths = tf.ones_like(input_ids[:, 0]) * max_length\n\n        past = encoder_outputs  # defined for encoder-decoder models, None for decoder-only models\n\n        while cur_len < max_length:\n            model_inputs = self.prepare_inputs_for_generation(\n                input_ids, past=past, attention_mask=attention_mask, use_cache=use_cache\n            )\n            outputs = self(**model_inputs)\n            next_token_logits = outputs[0][:, -1, :]\n\n            # if model has past, then set the past variable to speed up decoding\n            if self._use_cache(outputs, use_cache):\n                past = outputs[1]\n\n            # repetition penalty from CTRL paper (https://arxiv.org/abs/1909.05858)\n            if repetition_penalty != 1.0:\n                next_token_logits_penalties = _create_next_token_logits_penalties(\n                    input_ids, next_token_logits, repetition_penalty\n                )\n                next_token_logits = tf.math.multiply(next_token_logits, next_token_logits_penalties)\n\n            if no_repeat_ngram_size > 0:\n                # calculate a list of banned tokens to prevent repetitively generating the same ngrams\n                # from fairseq: https://github.com/pytorch/fairseq/blob/a07cb6f40480928c9e0548b737aadd36ee66ac76/fairseq/sequence_generator.py#L345\n                banned_tokens = calc_banned_ngram_tokens(input_ids, batch_size, no_repeat_ngram_size, cur_len)\n                # create banned_tokens boolean mask\n                banned_tokens_indices_mask = []\n                for banned_tokens_slice in banned_tokens:\n                    banned_tokens_indices_mask.append(\n                        [True if token in banned_tokens_slice else False for token in range(vocab_size)]\n                    )\n\n                next_token_logits = set_tensor_by_indices_to_value(\n                    next_token_logits, tf.convert_to_tensor(banned_tokens_indices_mask, dtype=tf.bool), -float(\"inf\")\n                )\n\n            if bad_words_ids is not None:\n                # calculate a list of banned tokens according to bad words\n                banned_tokens = calc_banned_bad_words_ids(input_ids, bad_words_ids)\n\n                banned_tokens_indices_mask = []\n                for banned_tokens_slice in banned_tokens:\n                    banned_tokens_indices_mask.append(\n                        [True if token in banned_tokens_slice else False for token in range(vocab_size)]\n                    )\n\n                next_token_logits = set_tensor_by_indices_to_value(\n                    next_token_logits, tf.convert_to_tensor(banned_tokens_indices_mask, dtype=tf.bool), -float(\"inf\")\n                )\n\n            # set eos token prob to zero if min_length is not reached\n            if eos_token_id is not None and cur_len < min_length:\n                # create eos_token_id boolean mask\n                is_token_logit_eos_token = tf.convert_to_tensor(\n                    [True if token is eos_token_id else False for token in range(vocab_size)], dtype=tf.bool\n                )\n                eos_token_indices_mask = tf.broadcast_to(is_token_logit_eos_token, [batch_size, vocab_size])\n\n                next_token_logits = set_tensor_by_indices_to_value(\n                    next_token_logits, eos_token_indices_mask, -float(\"inf\")\n                )\n\n            if do_sample:\n                # Temperature (higher temperature => more likely to sample low probability tokens)\n                if temperature != 1.0:\n                    next_token_logits = next_token_logits / temperature\n                # Top-p/top-k filtering\n                next_token_logits = tf_top_k_top_p_filtering(next_token_logits, top_k=top_k, top_p=top_p)\n                # Sample\n                next_token = tf.squeeze(\n                    tf.random.categorical(next_token_logits, dtype=tf.int32, num_samples=1), axis=1\n                )\n            else:\n                # Greedy decoding\n                next_token = tf.math.argmax(next_token_logits, axis=-1, output_type=tf.int32)\n\n            # update generations and finished sentences\n            if eos_token_id is not None:\n                # pad finished sentences if eos_token_id exist\n                tokens_to_add = next_token * unfinished_sents + (pad_token_id) * (1 - unfinished_sents)\n            else:\n                tokens_to_add = next_token\n\n            # add token and increase length by one\n            input_ids = tf.concat([input_ids, tf.expand_dims(tokens_to_add, -1)], 1)\n            cur_len = cur_len + 1\n\n            if eos_token_id is not None:\n                eos_in_sents = tokens_to_add == eos_token_id\n                # if sentence is unfinished and the token to add is eos, sent_lengths is filled with current length\n                is_sents_unfinished_and_token_to_add_is_eos = tf.math.multiply(\n                    unfinished_sents, tf.cast(eos_in_sents, tf.int32)\n                )\n                sent_lengths = (\n                    sent_lengths * (1 - is_sents_unfinished_and_token_to_add_is_eos)\n                    + cur_len * is_sents_unfinished_and_token_to_add_is_eos\n                )\n\n                # unfinished_sents is set to zero if eos in sentence\n                unfinished_sents -= is_sents_unfinished_and_token_to_add_is_eos\n\n            # stop when there is a </s> in each sentence, or if we exceed the maximul length\n            if tf.math.reduce_max(unfinished_sents) == 0:\n                break\n\n            # extend attention_mask for new generated input if only decoder\n            if self.config.is_encoder_decoder is False:\n                attention_mask = tf.concat(\n                    [attention_mask, tf.ones((shape_list(attention_mask)[0], 1), dtype=tf.int32)], axis=-1\n                )\n\n        # if there are different sentences lengths in the batch, some batches have to be padded\n        min_sent_length = tf.math.reduce_min(sent_lengths)\n        max_sent_length = tf.math.reduce_max(sent_lengths)\n        if min_sent_length != max_sent_length:\n            assert pad_token_id is not None, \"`Pad_token_id` has to be defined if batches have different lengths\"\n            # finished sents are filled with pad_token\n            padding = tf.ones([batch_size, max_sent_length.numpy()], dtype=tf.int32) * pad_token_id\n\n            # create length masks for tf.where operation\n            broad_casted_sent_lengths = tf.broadcast_to(\n                tf.expand_dims(sent_lengths, -1), [batch_size, max_sent_length]\n            )\n            broad_casted_range = tf.transpose(\n                tf.broadcast_to(tf.expand_dims(tf.range(max_sent_length), -1), [max_sent_length, batch_size])\n            )\n\n            decoded = tf.where(broad_casted_range < broad_casted_sent_lengths, input_ids, padding)\n        else:\n            decoded = input_ids\n\n        return decoded\n\n    def _generate_beam_search(\n        self,\n        input_ids,\n        cur_len,\n        max_length,\n        min_length,\n        do_sample,\n        early_stopping,\n        temperature,\n        top_k,\n        top_p,\n        repetition_penalty,\n        no_repeat_ngram_size,\n        bad_words_ids,\n        bos_token_id,\n        pad_token_id,\n        decoder_start_token_id,\n        eos_token_id,\n        batch_size,\n        num_return_sequences,\n        length_penalty,\n        num_beams,\n        vocab_size,\n        encoder_outputs,\n        attention_mask,\n        use_cache,\n    ):\n        \"\"\" Generate sequences for each example with beam search.\n        \"\"\"\n\n        # generated hypotheses\n        generated_hyps = [\n            BeamHypotheses(num_beams, max_length, length_penalty, early_stopping=early_stopping)\n            for _ in range(batch_size)\n        ]\n\n        # for greedy decoding it is made sure that only tokens of the first beam are considered to avoid sampling the exact same tokens three times\n        if do_sample is False:\n            beam_scores_begin = tf.zeros((batch_size, 1), dtype=tf.float32)\n            beam_scores_end = tf.ones((batch_size, num_beams - 1), dtype=tf.float32) * (-1e9)\n            beam_scores = tf.concat([beam_scores_begin, beam_scores_end], -1)\n        else:\n            beam_scores = tf.zeros((batch_size, num_beams), dtype=tf.float32)\n\n        beam_scores = tf.reshape(beam_scores, (batch_size * num_beams,))\n\n        # cache compute states\n        past = encoder_outputs\n\n        # done sentences\n        done = [False for _ in range(batch_size)]\n\n        while cur_len < max_length:\n            model_inputs = self.prepare_inputs_for_generation(\n                input_ids, past=past, attention_mask=attention_mask, use_cache=use_cache\n            )\n            outputs = self(**model_inputs)  # (batch_size * num_beams, cur_len, vocab_size)\n            next_token_logits = outputs[0][:, -1, :]  # (batch_size * num_beams, vocab_size)\n\n            # if model has past, then set the past variable to speed up decoding\n            if self._use_cache(outputs, use_cache):\n                past = outputs[1]\n\n            # repetition penalty (from CTRL paper https://arxiv.org/abs/1909.05858)\n            if repetition_penalty != 1.0:\n                next_token_logits_penalties = _create_next_token_logits_penalties(\n                    input_ids, next_token_logits, repetition_penalty\n                )\n                next_token_logits = tf.math.multiply(next_token_logits, next_token_logits_penalties)\n\n            # Temperature (higher temperature => more likely to sample low probability tokens)\n            if temperature != 1.0:\n                next_token_logits = next_token_logits / temperature\n\n            #             calculate log softmax score\n            scores = tf.nn.log_softmax(next_token_logits, axis=-1)  # (batch_size * num_beams, vocab_size)\n\n            # set eos token prob to zero if min_length is not reached\n            if eos_token_id is not None and cur_len < min_length:\n                # create eos_token_id boolean mask\n                num_batch_hypotheses = batch_size * num_beams\n\n                is_token_logit_eos_token = tf.convert_to_tensor(\n                    [True if token is eos_token_id else False for token in range(vocab_size)], dtype=tf.bool\n                )\n                eos_token_indices_mask = tf.broadcast_to(is_token_logit_eos_token, [num_batch_hypotheses, vocab_size])\n\n                scores = set_tensor_by_indices_to_value(scores, eos_token_indices_mask, -float(\"inf\"))\n\n            if no_repeat_ngram_size > 0:\n                # calculate a list of banned tokens to prevent repetitively generating the same ngrams\n                # from fairseq: https://github.com/pytorch/fairseq/blob/a07cb6f40480928c9e0548b737aadd36ee66ac76/fairseq/sequence_generator.py#L345\n                num_batch_hypotheses = batch_size * num_beams\n                banned_tokens = calc_banned_ngram_tokens(\n                    input_ids, num_batch_hypotheses, no_repeat_ngram_size, cur_len\n                )\n                # create banned_tokens boolean mask\n                banned_tokens_indices_mask = []\n                for banned_tokens_slice in banned_tokens:\n                    banned_tokens_indices_mask.append(\n                        [True if token in banned_tokens_slice else False for token in range(vocab_size)]\n                    )\n\n                scores = set_tensor_by_indices_to_value(\n                    scores, tf.convert_to_tensor(banned_tokens_indices_mask, dtype=tf.bool), -float(\"inf\")\n                )\n\n            if bad_words_ids is not None:\n                # calculate a list of banned tokens according to bad words\n                banned_tokens = calc_banned_bad_words_ids(input_ids, bad_words_ids)\n\n                banned_tokens_indices_mask = []\n                for banned_tokens_slice in banned_tokens:\n                    banned_tokens_indices_mask.append(\n                        [True if token in banned_tokens_slice else False for token in range(vocab_size)]\n                    )\n\n                scores = set_tensor_by_indices_to_value(\n                    scores, tf.convert_to_tensor(banned_tokens_indices_mask, dtype=tf.bool), -float(\"inf\")\n                )\n\n            assert shape_list(scores) == [batch_size * num_beams, vocab_size]\n\n            if do_sample:\n                _scores = scores + tf.broadcast_to(\n                    beam_scores[:, None], (batch_size * num_beams, vocab_size)\n                )  # (batch_size * num_beams, vocab_size)\n\n                # Top-p/top-k filtering\n                _scores = tf_top_k_top_p_filtering(\n                    _scores, top_k=top_k, top_p=top_p, min_tokens_to_keep=2\n                )  # (batch_size * num_beams, vocab_size)\n                # Sample 2 next tokens for each beam (so we have some spare tokens and match output of greedy beam search)\n                _scores = tf.reshape(_scores, (batch_size, num_beams * vocab_size))\n\n                next_tokens = tf.random.categorical(\n                    _scores, dtype=tf.int32, num_samples=2 * num_beams\n                )  # (batch_size, 2 * num_beams)\n                # Compute next scores\n                next_scores = tf.gather(_scores, next_tokens, batch_dims=1)  # (batch_size, 2 * num_beams)\n\n                # sort the sampled vector to make sure that the first num_beams samples are the best\n                next_scores_indices = tf.argsort(next_scores, direction=\"DESCENDING\", axis=1)\n                next_scores = tf.gather(next_scores, next_scores_indices, batch_dims=1)  # (batch_size, num_beams * 2)\n                next_tokens = tf.gather(next_tokens, next_scores_indices, batch_dims=1)  # (batch_size, num_beams * 2)\n            else:\n                # Add the log prob of the new beams to the log prob of the beginning of the sequence (sum of logs == log of the product)\n                next_scores = scores + tf.broadcast_to(\n                    beam_scores[:, None], (batch_size * num_beams, vocab_size)\n                )  # (batch_size * num_beams, vocab_size)\n\n                # re-organize to group the beam together (we are keeping top hypothesis accross beams)\n                next_scores = tf.reshape(\n                    next_scores, (batch_size, num_beams * vocab_size)\n                )  # (batch_size, num_beams * vocab_size)\n\n                next_scores, next_tokens = tf.math.top_k(next_scores, k=2 * num_beams, sorted=True)\n\n            assert shape_list(next_scores) == shape_list(next_tokens) == [batch_size, 2 * num_beams]\n\n            # next batch beam content\n            next_batch_beam = []\n\n            # for each sentence\n            for batch_idx in range(batch_size):\n\n                # if we are done with this sentence\n                if done[batch_idx]:\n                    assert (\n                        len(generated_hyps[batch_idx]) >= num_beams\n                    ), \"Batch can only be done if at least {} beams have been generated\".format(num_beams)\n                    assert (\n                        eos_token_id is not None and pad_token_id is not None\n                    ), \"generated beams >= num_beams -> eos_token_id and pad_token have to be defined\"\n                    next_batch_beam.extend([(0, pad_token_id, 0)] * num_beams)  # pad the batch\n                    continue\n\n                # next sentence beam content\n                next_sent_beam = []\n\n                # next tokens for this sentence\n                for beam_token_rank, (beam_token_id, beam_token_score) in enumerate(\n                    zip(next_tokens[batch_idx], next_scores[batch_idx])\n                ):\n                    # get beam and token IDs\n                    beam_id = beam_token_id // vocab_size\n                    token_id = beam_token_id % vocab_size\n\n                    effective_beam_id = batch_idx * num_beams + beam_id\n                    # add to generated hypotheses if end of sentence or last iteration\n                    if (eos_token_id is not None) and (token_id.numpy() == eos_token_id):\n                        # if beam_token does not belong to top num_beams tokens, it should not be added\n                        is_beam_token_worse_than_top_num_beams = beam_token_rank >= num_beams\n                        if is_beam_token_worse_than_top_num_beams:\n                            continue\n                        generated_hyps[batch_idx].add(\n                            tf.identity(input_ids[effective_beam_id]), beam_token_score.numpy()\n                        )\n                    else:\n                        # add next predicted token if it is not eos_token\n                        next_sent_beam.append((beam_token_score, token_id, effective_beam_id))\n\n                    # the beam for next step is full\n                    if len(next_sent_beam) == num_beams:\n                        break\n\n                # Check if were done so that we can save a pad step if all(done)\n                done[batch_idx] = done[batch_idx] or generated_hyps[batch_idx].is_done(\n                    tf.reduce_max(next_scores[batch_idx]).numpy(), cur_len=cur_len\n                )\n\n                # update next beam content\n                assert len(next_sent_beam) == num_beams, \"Beam should always be full\"\n                next_batch_beam.extend(next_sent_beam)\n                assert len(next_batch_beam) == num_beams * (batch_idx + 1)\n\n            # stop when we are done with each sentence\n            if all(done):\n                break\n\n            # sanity check / prepare next batch\n            assert len(next_batch_beam) == batch_size * num_beams\n            beam_scores = tf.convert_to_tensor([x[0] for x in next_batch_beam], dtype=tf.float32)\n            beam_tokens = tf.convert_to_tensor([x[1] for x in next_batch_beam], dtype=tf.int32)\n            beam_idx = tf.convert_to_tensor([x[2] for x in next_batch_beam], dtype=tf.int32)\n\n            # re-order batch and update current length\n            input_ids = tf.stack([tf.identity(input_ids[x, :]) for x in beam_idx])\n            input_ids = tf.concat([input_ids, tf.expand_dims(beam_tokens, 1)], axis=-1)\n            cur_len = cur_len + 1\n\n            # re-order internal states\n            if past is not None:\n                past = self._reorder_cache(past, beam_idx)\n\n            # extend attention_mask for new generated input if only decoder\n            if self.config.is_encoder_decoder is False:\n                attention_mask = tf.concat(\n                    [attention_mask, tf.ones((shape_list(attention_mask)[0], 1), dtype=tf.int32)], axis=-1\n                )\n\n        # finalize all open beam hypotheses and end to generated hypotheses\n        for batch_idx in range(batch_size):\n            # Add all open beam hypothesis to generated_hyps\n            if done[batch_idx]:\n                continue\n            # test that beam scores match previously calculated scores if not eos and batch_idx not done\n            if eos_token_id is not None and all(\n                (token_id % vocab_size).numpy().item() is not eos_token_id for token_id in next_tokens[batch_idx]\n            ):\n                assert tf.reduce_all(\n                    next_scores[batch_idx, :num_beams] == tf.reshape(beam_scores, (batch_size, num_beams))[batch_idx]\n                ), \"If batch_idx is not done, final next scores: {} have to equal to accumulated beam_scores: {}\".format(\n                    next_scores[:, :num_beams][batch_idx], tf.reshape(beam_scores, (batch_size, num_beams))[batch_idx]\n                )\n\n            # need to add best num_beams hypotheses to generated hyps\n            for beam_id in range(num_beams):\n                effective_beam_id = batch_idx * num_beams + beam_id\n                final_score = beam_scores[effective_beam_id].numpy().item()\n                final_tokens = input_ids[effective_beam_id]\n                generated_hyps[batch_idx].add(final_tokens, final_score)\n\n        # depending on whether greedy generation is wanted or not define different output_batch_size and output_num_return_sequences_per_batch\n        output_batch_size = batch_size if do_sample else batch_size * num_return_sequences\n        output_num_return_sequences_per_batch = 1 if do_sample else num_return_sequences\n\n        # select the best hypotheses\n        sent_lengths_list = []\n        best = []\n\n        # retrieve best hypotheses\n        for i, hypotheses in enumerate(generated_hyps):\n            sorted_hyps = sorted(hypotheses.beams, key=lambda x: x[0])\n            for j in range(output_num_return_sequences_per_batch):\n                best_hyp = sorted_hyps.pop()[1]\n                sent_lengths_list.append(len(best_hyp))\n                best.append(best_hyp)\n        assert output_batch_size == len(best), \"Output batch size {} must match output beam hypotheses {}\".format(\n            output_batch_size, len(best)\n        )\n\n        sent_lengths = tf.convert_to_tensor(sent_lengths_list, dtype=tf.int32)\n\n        # shorter batches are filled with pad_token\n        if tf.reduce_min(sent_lengths).numpy() != tf.reduce_max(sent_lengths).numpy():\n            assert pad_token_id is not None, \"`Pad_token_id` has to be defined\"\n            sent_max_len = min(tf.reduce_max(sent_lengths).numpy() + 1, max_length)\n            decoded_list = []\n\n            # fill with hypothesis and eos_token_id if necessary\n            for i, hypo in enumerate(best):\n                assert sent_lengths[i] == shape_list(hypo)[0]\n                # if sent_length is max_len do not pad\n                if sent_lengths[i] == sent_max_len:\n                    decoded_slice = hypo\n                else:\n                    # else pad to sent_max_len\n                    num_pad_tokens = sent_max_len - sent_lengths[i]\n                    padding = pad_token_id * tf.ones((num_pad_tokens,), dtype=tf.int32)\n                    decoded_slice = tf.concat([hypo, padding], axis=-1)\n\n                    # finish sentence with EOS token\n                    if sent_lengths[i] < max_length:\n                        decoded_slice = tf.where(\n                            tf.range(sent_max_len, dtype=tf.int32) == sent_lengths[i],\n                            eos_token_id * tf.ones((sent_max_len,), dtype=tf.int32),\n                            decoded_slice,\n                        )\n                # add to list\n                decoded_list.append(decoded_slice)\n\n            decoded = tf.stack(decoded_list)\n        else:\n            # none of the hypotheses have an eos_token\n            assert (len(hypo) == max_length for hypo in best)\n            decoded = tf.stack(best)\n\n        return decoded\n\n    @staticmethod\n    def _reorder_cache(past, beam_idx):\n        return tuple(tf.gather(layer_past, beam_idx, axis=1) for layer_past in past)\n\n\ndef _create_next_token_logits_penalties(input_ids, logits, repetition_penalty):\n    # create logit penalties for already seen input_ids\n    token_penalties = np.ones(shape_list(logits))\n    prev_input_ids = [np.unique(input_id) for input_id in input_ids.numpy()]\n    for i, prev_input_id in enumerate(prev_input_ids):\n        logit_penalized = logits[i].numpy()[prev_input_id]\n        logit_penalties = np.zeros(logit_penalized.shape)\n        # if previous logit score is < 0 then multiply repetition penalty else divide\n        logit_penalties[logit_penalized < 0] = repetition_penalty\n        logit_penalties[logit_penalized > 0] = 1 / repetition_penalty\n        np.put(token_penalties[i], prev_input_id, logit_penalties)\n    return tf.convert_to_tensor(token_penalties, dtype=tf.float32)\n\n\ndef calc_banned_ngram_tokens(prev_input_ids, num_hypos, no_repeat_ngram_size, cur_len):\n    # Copied from fairseq for no_repeat_ngram in beam_search\"\"\"\n    if cur_len + 1 < no_repeat_ngram_size:\n        # return no banned tokens if we haven't generated no_repeat_ngram_size tokens yet\n        return [[] for _ in range(num_hypos)]\n    generated_ngrams = [{} for _ in range(num_hypos)]\n    for idx in range(num_hypos):\n        gen_tokens = prev_input_ids[idx].numpy().tolist()\n        generated_ngram = generated_ngrams[idx]\n        for ngram in zip(*[gen_tokens[i:] for i in range(no_repeat_ngram_size)]):\n            prev_ngram_tuple = tuple(ngram[:-1])\n            generated_ngram[prev_ngram_tuple] = generated_ngram.get(prev_ngram_tuple, []) + [ngram[-1]]\n\n    def _get_generated_ngrams(hypo_idx):\n        # Before decoding the next token, prevent decoding of ngrams that have already appeared\n        start_idx = cur_len + 1 - no_repeat_ngram_size\n        ngram_idx = tuple(prev_input_ids[hypo_idx, start_idx:cur_len].numpy().tolist())\n        return generated_ngrams[hypo_idx].get(ngram_idx, [])\n\n    banned_tokens = [_get_generated_ngrams(hypo_idx) for hypo_idx in range(num_hypos)]\n    return banned_tokens\n\n\ndef calc_banned_bad_words_ids(prev_input_ids, bad_words_ids):\n    banned_tokens = []\n\n    def _tokens_match(prev_tokens, tokens):\n        if len(tokens) == 0:\n            # if bad word tokens is just one token always ban it\n            return True\n        if len(tokens) > len(prev_input_ids):\n            # if bad word tokens are longer then prev input_ids they can't be equal\n            return False\n\n        if prev_tokens[-len(tokens) :] == tokens:\n            # if tokens match\n            return True\n        else:\n            return False\n\n    for prev_input_ids_slice in prev_input_ids:\n        banned_tokens_slice = []\n\n        for banned_token_seq in bad_words_ids:\n            assert len(banned_token_seq) > 0, \"Banned words token sequences {} cannot have an empty list\".format(\n                bad_words_ids\n            )\n\n            if _tokens_match(prev_input_ids_slice.numpy().tolist(), banned_token_seq[:-1]) is False:\n                # if tokens do not match continue\n                continue\n\n            banned_tokens_slice.append(banned_token_seq[-1])\n\n        banned_tokens.append(banned_tokens_slice)\n\n    return banned_tokens\n\n\ndef tf_top_k_top_p_filtering(logits, top_k=0, top_p=1.0, filter_value=-float(\"Inf\"), min_tokens_to_keep=1):\n    \"\"\" Filter a distribution of logits using top-k and/or nucleus (top-p) filtering\n        Args:\n            logits: logits distribution shape (batch size, vocabulary size)\n            if top_k > 0: keep only top k tokens with highest probability (top-k filtering).\n            if top_p < 1.0: keep the top tokens with cumulative probability >= top_p (nucleus filtering).\n                Nucleus filtering is described in Holtzman et al. (http://arxiv.org/abs/1904.09751)\n            Make sure we keep at least min_tokens_to_keep per batch example in the output\n        From: https://gist.github.com/thomwolf/1a5a29f6962089e871b94cbd09daf317\n    \"\"\"\n    logits_shape = shape_list(logits)\n\n    if top_k > 0:\n        top_k = min(max(top_k, min_tokens_to_keep), logits_shape[-1])  # Safety check\n        # Remove all tokens with a probability less than the last token of the top-k\n        indices_to_remove = logits < tf.math.top_k(logits, k=top_k)[0][..., -1, None]\n        logits = set_tensor_by_indices_to_value(logits, indices_to_remove, filter_value)\n\n    if top_p < 1.0:\n        sorted_indices = tf.argsort(logits, direction=\"DESCENDING\")\n        sorted_logits = tf.gather(\n            logits, sorted_indices, axis=-1, batch_dims=1\n        )  # expects logits to be of dim (batch_size, vocab_size)\n\n        cumulative_probs = tf.math.cumsum(tf.nn.softmax(sorted_logits, axis=-1), axis=-1)\n\n        # Remove tokens with cumulative probability above the threshold (token with 0 are kept)\n        sorted_indices_to_remove = cumulative_probs > top_p\n\n        if min_tokens_to_keep > 1:\n            # Keep at least min_tokens_to_keep (set to min_tokens_to_keep-1 because we add the first one below)\n            sorted_indices_to_remove = tf.concat(\n                [\n                    tf.zeros_like(sorted_indices_to_remove[:, :min_tokens_to_keep]),\n                    sorted_indices_to_remove[:, min_tokens_to_keep:],\n                ],\n                -1,\n            )\n\n        # Shift the indices to the right to keep also the first token above the threshold\n        sorted_indices_to_remove = tf.roll(sorted_indices_to_remove, 1, axis=-1)\n        sorted_indices_to_remove = tf.concat(\n            [tf.zeros_like(sorted_indices_to_remove[:, :1]), sorted_indices_to_remove[:, 1:]], -1,\n        )\n        # scatter sorted tensors to original indexing\n        indices_to_remove = scatter_values_on_batch_indices(sorted_indices_to_remove, sorted_indices)\n        logits = set_tensor_by_indices_to_value(logits, indices_to_remove, filter_value)\n    return logits\n\n\ndef scatter_values_on_batch_indices(values, batch_indices):\n    shape = shape_list(batch_indices)\n    # broadcast batch dim to shape\n    broad_casted_batch_dims = tf.reshape(tf.broadcast_to(tf.expand_dims(tf.range(shape[0]), axis=-1), shape), [1, -1])\n    # transform batch_indices to pair_indices\n    pair_indices = tf.transpose(tf.concat([broad_casted_batch_dims, tf.reshape(batch_indices, [1, -1])], 0))\n    # scatter values to pair indices\n    return tf.scatter_nd(pair_indices, tf.reshape(values, [-1]), shape)\n\n\ndef set_tensor_by_indices_to_value(tensor, indices, value):\n    # create value_tensor since tensor value assignment is not possible in TF\n    value_tensor = tf.zeros_like(tensor) + value\n    return tf.where(indices, value_tensor, tensor)\n\n\nclass BeamHypotheses(object):\n    def __init__(self, num_beams, max_length, length_penalty, early_stopping):\n        \"\"\"\n        Initialize n-best list of hypotheses.\n        \"\"\"\n        self.max_length = max_length - 1  # ignoring bos_token\n        self.length_penalty = length_penalty\n        self.early_stopping = early_stopping\n        self.num_beams = num_beams\n        self.beams = []\n        self.worst_score = 1e9\n\n    def __len__(self):\n        \"\"\"\n        Number of hypotheses in the list.\n        \"\"\"\n        return len(self.beams)\n\n    def add(self, hyp, sum_logprobs):\n        \"\"\"\n        Add a new hypothesis to the list.\n        \"\"\"\n        score = sum_logprobs / len(hyp) ** self.length_penalty\n        if len(self) < self.num_beams or score > self.worst_score:\n            self.beams.append((score, hyp))\n            if len(self) > self.num_beams:\n                sorted_scores = sorted([(s, idx) for idx, (s, _) in enumerate(self.beams)])\n                del self.beams[sorted_scores[0][1]]\n                self.worst_score = sorted_scores[1][0]\n            else:\n                self.worst_score = min(score, self.worst_score)\n\n    def is_done(self, best_sum_logprobs, cur_len=None):\n        \"\"\"\n        If there are enough hypotheses and that none of the hypotheses being generated\n        can become better than the worst one in the heap, then we are done with this sentence.\n        \"\"\"\n\n        if len(self) < self.num_beams:\n            return False\n        elif self.early_stopping:\n            return True\n        else:\n            if cur_len is None:\n                cur_len = self.max_length\n            cur_score = best_sum_logprobs / cur_len ** self.length_penalty\n            ret = self.worst_score >= cur_score\n            return ret\n\n\nclass TFConv1D(tf.keras.layers.Layer):\n    def __init__(self, nf, nx, initializer_range=0.02, **kwargs):\n        \"\"\" TFConv1D layer as defined by Radford et al. for OpenAI GPT (and also used in GPT-2)\n            Basically works like a Linear layer but the weights are transposed\n        \"\"\"\n        super().__init__(**kwargs)\n        self.nf = nf\n        self.nx = nx\n        self.initializer_range = initializer_range\n\n    def build(self, input_shape):\n        self.weight = self.add_weight(\n            \"weight\", shape=[self.nx, self.nf], initializer=get_initializer(self.initializer_range)\n        )\n        self.bias = self.add_weight(\"bias\", shape=[1, self.nf], initializer=tf.zeros_initializer())\n\n    def call(self, x):\n        bz, sl = shape_list(x)[:2]\n\n        x = tf.reshape(x, [-1, self.nx])\n        x = tf.matmul(x, self.weight) + self.bias\n\n        x = tf.reshape(x, [bz, sl, self.nf])\n\n        return x\n\n\nclass TFSharedEmbeddings(tf.keras.layers.Layer):\n    \"\"\"Construct shared token embeddings.\n    \"\"\"\n\n    def __init__(self, vocab_size, hidden_size, initializer_range=None, **kwargs):\n        super().__init__(**kwargs)\n        self.vocab_size = vocab_size\n        self.hidden_size = hidden_size\n        self.initializer_range = hidden_size ** -0.5 if initializer_range is None else initializer_range\n\n    def build(self, input_shape):\n        \"\"\"Build shared token embedding layer\n        Shared weights logic adapted from\n            https://github.com/tensorflow/models/blob/a009f4fb9d2fc4949e32192a944688925ef78659/official/transformer/v2/embedding_layer.py#L24\n        \"\"\"\n        self.weight = self.add_weight(\n            \"weight\", shape=[self.vocab_size, self.hidden_size], initializer=get_initializer(self.initializer_range)\n        )\n        super().build(input_shape)\n\n    def call(self, inputs, mode=\"embedding\"):\n        \"\"\"Get token embeddings of inputs.\n        Args:\n            inputs: list of three int64 tensors with shape [batch_size, length]: (input_ids, position_ids, token_type_ids)\n            mode: string, a valid value is one of \"embedding\" and \"linear\".\n        Returns:\n            outputs: (1) If mode == \"embedding\", output embedding tensor, float32 with\n                shape [batch_size, length, embedding_size]; (2) mode == \"linear\", output\n                linear tensor, float32 with shape [batch_size, length, vocab_size].\n        Raises:\n            ValueError: if mode is not valid.\n\n        Shared weights logic adapted from\n            https://github.com/tensorflow/models/blob/a009f4fb9d2fc4949e32192a944688925ef78659/official/transformer/v2/embedding_layer.py#L24\n        \"\"\"\n        if mode == \"embedding\":\n            return self._embedding(inputs)\n        elif mode == \"linear\":\n            return self._linear(inputs)\n        else:\n            raise ValueError(\"mode {} is not valid.\".format(mode))\n\n    def _embedding(self, input_ids):\n        \"\"\"Applies embedding based on inputs tensor.\"\"\"\n        return tf.gather(self.weight, input_ids)\n\n    def _linear(self, inputs):\n        \"\"\"Computes logits by running inputs through a linear layer.\n            Args:\n                inputs: A float32 tensor with shape [..., hidden_size]\n            Returns:\n                float32 tensor with shape [..., vocab_size].\n        \"\"\"\n        first_dims = shape_list(inputs)[:-1]\n\n        x = tf.reshape(inputs, [-1, self.hidden_size])\n        logits = tf.matmul(x, self.weight, transpose_b=True)\n\n        return tf.reshape(logits, first_dims + [self.vocab_size])\n\n\nclass TFSequenceSummary(tf.keras.layers.Layer):\n    r\"\"\" Compute a single vector summary of a sequence hidden states according to various possibilities:\n        Args of the config class:\n            summary_type:\n                - 'last' => [default] take the last token hidden state (like XLNet)\n                - 'first' => take the first token hidden state (like Bert)\n                - 'mean' => take the mean of all tokens hidden states\n                - 'cls_index' => supply a Tensor of classification token position (GPT/GPT-2)\n                - 'attn' => Not implemented now, use multi-head attention\n            summary_use_proj: Add a projection after the vector extraction\n            summary_proj_to_labels: If True, the projection outputs to config.num_labels classes (otherwise to hidden_size). Default: False.\n            summary_activation: 'tanh' => add a tanh activation to the output, Other => no activation. Default\n            summary_first_dropout: Add a dropout before the projection and activation\n            summary_last_dropout: Add a dropout after the projection and activation\n    \"\"\"\n\n    def __init__(self, config, initializer_range=0.02, **kwargs):\n        super().__init__(**kwargs)\n\n        self.summary_type = config.summary_type if hasattr(config, \"summary_use_proj\") else \"last\"\n        if self.summary_type == \"attn\":\n            # We should use a standard multi-head attention module with absolute positional embedding for that.\n            # Cf. https://github.com/zihangdai/xlnet/blob/master/modeling.py#L253-L276\n            # We can probably just use the multi-head attention module of PyTorch >=1.1.0\n            raise NotImplementedError\n\n        self.has_summary = hasattr(config, \"summary_use_proj\") and config.summary_use_proj\n        if self.has_summary:\n            if hasattr(config, \"summary_proj_to_labels\") and config.summary_proj_to_labels and config.num_labels > 0:\n                num_classes = config.num_labels\n            else:\n                num_classes = config.hidden_size\n            self.summary = tf.keras.layers.Dense(\n                num_classes, kernel_initializer=get_initializer(initializer_range), name=\"summary\"\n            )\n\n        self.has_activation = hasattr(config, \"summary_activation\") and config.summary_activation == \"tanh\"\n        if self.has_activation:\n            self.activation = tf.keras.activations.tanh\n\n        self.has_first_dropout = hasattr(config, \"summary_first_dropout\") and config.summary_first_dropout > 0\n        if self.has_first_dropout:\n            self.first_dropout = tf.keras.layers.Dropout(config.summary_first_dropout)\n\n        self.has_last_dropout = hasattr(config, \"summary_last_dropout\") and config.summary_last_dropout > 0\n        if self.has_last_dropout:\n            self.last_dropout = tf.keras.layers.Dropout(config.summary_last_dropout)\n\n    def call(self, inputs, training=False):\n        \"\"\" hidden_states: float Tensor in shape [bsz, seq_len, hidden_size], the hidden-states of the last layer.\n            cls_index: [optional] position of the classification token if summary_type == 'cls_index',\n                shape (bsz,) or more generally (bsz, ...) where ... are optional leading dimensions of hidden_states.\n                if summary_type == 'cls_index' and cls_index is None:\n                    we take the last token of the sequence as classification token\n        \"\"\"\n        if not isinstance(inputs, (dict, tuple, list)):\n            hidden_states = inputs\n            cls_index = None\n        elif isinstance(inputs, (tuple, list)):\n            hidden_states = inputs[0]\n            cls_index = inputs[1] if len(inputs) > 1 else None\n            assert len(inputs) <= 2, \"Too many inputs.\"\n        else:\n            hidden_states = inputs.get(\"hidden_states\")\n            cls_index = inputs.get(\"cls_index\", None)\n\n        if self.summary_type == \"last\":\n            output = hidden_states[:, -1]\n        elif self.summary_type == \"first\":\n            output = hidden_states[:, 0]\n        elif self.summary_type == \"mean\":\n            output = tf.reduce_mean(hidden_states, axis=1)\n        elif self.summary_type == \"cls_index\":\n            hidden_shape = shape_list(hidden_states)  # e.g. [batch, num choices, seq length, hidden dims]\n            if cls_index is None:\n                cls_index = tf.fill(\n                    hidden_shape[:-2], hidden_shape[-2] - 1\n                )  # A tensor full of shape [batch] or [batch, num choices] full of sequence length\n            cls_shape = shape_list(cls_index)\n            if len(cls_shape) <= len(hidden_shape) - 2:\n                cls_index = cls_index[..., tf.newaxis]\n            # else:\n            # cls_index = cls_index[..., tf.newaxis]\n            # cls_index = cls_index.expand((-1,) * (cls_index.dim()-1) + (hidden_states.size(-1),))\n            # shape of cls_index: (bsz, XX, 1, hidden_size) where XX are optional leading dim of hidden_states\n            output = tf.gather(hidden_states, cls_index, batch_dims=len(hidden_shape) - 2)\n            output = tf.squeeze(\n                output, axis=len(hidden_shape) - 2\n            )  # shape of output: (batch, num choices, hidden_size)\n        elif self.summary_type == \"attn\":\n            raise NotImplementedError\n\n        if self.has_first_dropout:\n            output = self.first_dropout(output, training=training)\n\n        if self.has_summary:\n            output = self.summary(output)\n\n        if self.has_activation:\n            output = self.activation(output)\n\n        if self.has_last_dropout:\n            output = self.last_dropout(output, training=training)\n\n        return output\n\n\ndef shape_list(x):\n    \"\"\"Deal with dynamic shape in tensorflow cleanly.\"\"\"\n    static = x.shape.as_list()\n    dynamic = tf.shape(x)\n    return [dynamic[i] if s is None else s for i, s in enumerate(static)]\n\n\ndef get_initializer(initializer_range=0.02):\n    \"\"\"Creates a `tf.initializers.truncated_normal` with the given range.\n    Args:\n        initializer_range: float, initializer range for stddev.\n    Returns:\n        TruncatedNormal initializer with stddev = `initializer_range`.\n    \"\"\"\n    return tf.keras.initializers.TruncatedNormal(stddev=initializer_range)\n"
  },
  {
    "path": "code/bert-base-count3/pretrain/transformers1/modeling_tf_xlm.py",
    "content": "# coding=utf-8\n# Copyright 2019-present, Facebook, Inc and the HuggingFace Inc. team.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n\"\"\" TF 2.0 XLM model.\n\"\"\"\n\n\nimport itertools\nimport logging\nimport math\n\nimport numpy as np\nimport tensorflow as tf\n\nfrom .configuration_xlm import XLMConfig\nfrom .file_utils import add_start_docstrings, add_start_docstrings_to_callable\nfrom .modeling_tf_utils import TFPreTrainedModel, TFSequenceSummary, TFSharedEmbeddings, get_initializer, shape_list\nfrom .tokenization_utils import BatchEncoding\n\n\nlogger = logging.getLogger(__name__)\n\nTF_XLM_PRETRAINED_MODEL_ARCHIVE_LIST = [\n    \"xlm-mlm-en-2048\",\n    \"xlm-mlm-ende-1024\",\n    \"xlm-mlm-enfr-1024\",\n    \"xlm-mlm-enro-1024\",\n    \"xlm-mlm-tlm-xnli15-1024\",\n    \"xlm-mlm-xnli15-1024\",\n    \"xlm-clm-enfr-1024\",\n    \"xlm-clm-ende-1024\",\n    \"xlm-mlm-17-1280\",\n    \"xlm-mlm-100-1280\",\n    # See all XLM models at https://huggingface.co/models?filter=xlm\n]\n\n\ndef create_sinusoidal_embeddings(n_pos, dim, out):\n    position_enc = np.array([[pos / np.power(10000, 2 * (j // 2) / dim) for j in range(dim)] for pos in range(n_pos)])\n    out[:, 0::2] = tf.constant(np.sin(position_enc[:, 0::2]))\n    out[:, 1::2] = tf.constant(np.cos(position_enc[:, 1::2]))\n\n\ndef gelu(x):\n    \"\"\" Gaussian Error Linear Unit.\n    Original Implementation of the gelu activation function in Google Bert repo when initially created.\n        For information: OpenAI GPT's gelu is slightly different (and gives slightly different results):\n        0.5 * x * (1 + torch.tanh(math.sqrt(2 / math.pi) * (x + 0.044715 * torch.pow(x, 3))))\n        Also see https://arxiv.org/abs/1606.08415\n    \"\"\"\n    cdf = 0.5 * (1.0 + tf.math.erf(x / tf.math.sqrt(2.0)))\n    return x * cdf\n\n\ndef get_masks(slen, lengths, causal, padding_mask=None, dtype=tf.float32):\n    \"\"\"\n    Generate hidden states mask, and optionally an attention mask.\n    \"\"\"\n    bs = shape_list(lengths)[0]\n    if padding_mask is not None:\n        mask = padding_mask\n    else:\n        # assert lengths.max().item() <= slen\n        alen = tf.range(slen)\n        mask = tf.math.less(alen, lengths[:, tf.newaxis])\n\n    # attention mask is the same as mask, or triangular inferior attention (causal)\n    if causal:\n        attn_mask = tf.less_equal(\n            tf.tile(alen[tf.newaxis, tf.newaxis, :], (bs, slen, 1)), alen[tf.newaxis, :, tf.newaxis]\n        )\n    else:\n        attn_mask = mask\n\n    # sanity check\n    # assert shape_list(mask) == [bs, slen]\n    tf.debugging.assert_equal(shape_list(mask), [bs, slen])\n    assert causal is False or shape_list(attn_mask) == [bs, slen, slen]\n\n    mask = tf.cast(mask, dtype=dtype)\n    attn_mask = tf.cast(attn_mask, dtype=dtype)\n\n    return mask, attn_mask\n\n\nclass TFMultiHeadAttention(tf.keras.layers.Layer):\n\n    NEW_ID = itertools.count()\n\n    def __init__(self, n_heads, dim, config, **kwargs):\n        super().__init__(**kwargs)\n        self.layer_id = next(TFMultiHeadAttention.NEW_ID)\n        self.output_attentions = config.output_attentions\n        self.dim = dim\n        self.n_heads = n_heads\n        assert self.dim % self.n_heads == 0\n\n        self.q_lin = tf.keras.layers.Dense(dim, kernel_initializer=get_initializer(config.init_std), name=\"q_lin\")\n        self.k_lin = tf.keras.layers.Dense(dim, kernel_initializer=get_initializer(config.init_std), name=\"k_lin\")\n        self.v_lin = tf.keras.layers.Dense(dim, kernel_initializer=get_initializer(config.init_std), name=\"v_lin\")\n        self.out_lin = tf.keras.layers.Dense(dim, kernel_initializer=get_initializer(config.init_std), name=\"out_lin\")\n        self.dropout = tf.keras.layers.Dropout(config.attention_dropout)\n        self.pruned_heads = set()\n\n    def prune_heads(self, heads):\n        raise NotImplementedError\n\n    def call(self, inputs, training=False):\n        \"\"\"\n        Self-attention (if kv is None) or attention over source sentence (provided by kv).\n        \"\"\"\n        input, mask, kv, cache, head_mask = inputs\n        # Input is (bs, qlen, dim)\n        # Mask is (bs, klen) (non-causal) or (bs, klen, klen)\n        bs, qlen, dim = shape_list(input)\n        if kv is None:\n            klen = qlen if cache is None else cache[\"slen\"] + qlen\n        else:\n            klen = shape_list(kv)[1]\n        # assert dim == self.dim, 'Dimensions do not match: %s input vs %s configured' % (dim, self.dim)\n        n_heads = self.n_heads\n        dim_per_head = self.dim // n_heads\n        mask_reshape = (bs, 1, qlen, klen) if len(shape_list(mask)) == 3 else (bs, 1, 1, klen)\n\n        def shape(x):\n            \"\"\"  projection \"\"\"\n            return tf.transpose(tf.reshape(x, (bs, -1, self.n_heads, dim_per_head)), perm=(0, 2, 1, 3))\n\n        def unshape(x):\n            \"\"\"  compute context \"\"\"\n            return tf.reshape(tf.transpose(x, perm=(0, 2, 1, 3)), (bs, -1, self.n_heads * dim_per_head))\n\n        q = shape(self.q_lin(input))  # (bs, n_heads, qlen, dim_per_head)\n        if kv is None:\n            k = shape(self.k_lin(input))  # (bs, n_heads, qlen, dim_per_head)\n            v = shape(self.v_lin(input))  # (bs, n_heads, qlen, dim_per_head)\n        elif cache is None or self.layer_id not in cache:\n            k = v = kv\n            k = shape(self.k_lin(k))  # (bs, n_heads, qlen, dim_per_head)\n            v = shape(self.v_lin(v))  # (bs, n_heads, qlen, dim_per_head)\n\n        if cache is not None:\n            if self.layer_id in cache:\n                if kv is None:\n                    k_, v_ = cache[self.layer_id]\n                    k = tf.concat([k_, k], axis=2)  # (bs, n_heads, klen, dim_per_head)\n                    v = tf.concat([v_, v], axis=2)  # (bs, n_heads, klen, dim_per_head)\n                else:\n                    k, v = cache[self.layer_id]\n            cache[self.layer_id] = (k, v)\n\n        q = q / math.sqrt(dim_per_head)  # (bs, n_heads, qlen, dim_per_head)\n        scores = tf.matmul(q, k, transpose_b=True)  # (bs, n_heads, qlen, klen)\n        mask = tf.reshape(mask, mask_reshape)  # (bs, n_heads, qlen, klen)\n        # scores.masked_fill_(mask, -float('inf'))                            # (bs, n_heads, qlen, klen)\n        scores = scores - 1e30 * (1.0 - mask)\n\n        weights = tf.nn.softmax(scores, axis=-1)  # (bs, n_heads, qlen, klen)\n        weights = self.dropout(weights, training=training)  # (bs, n_heads, qlen, klen)\n\n        # Mask heads if we want to\n        if head_mask is not None:\n            weights = weights * head_mask\n\n        context = tf.matmul(weights, v)  # (bs, n_heads, qlen, dim_per_head)\n        context = unshape(context)  # (bs, qlen, dim)\n\n        outputs = (self.out_lin(context),)\n        if self.output_attentions:\n            outputs = outputs + (weights,)\n        return outputs\n\n\nclass TFTransformerFFN(tf.keras.layers.Layer):\n    def __init__(self, in_dim, dim_hidden, out_dim, config, **kwargs):\n        super().__init__(**kwargs)\n        self.lin1 = tf.keras.layers.Dense(dim_hidden, kernel_initializer=get_initializer(config.init_std), name=\"lin1\")\n        self.lin2 = tf.keras.layers.Dense(out_dim, kernel_initializer=get_initializer(config.init_std), name=\"lin2\")\n        self.act = tf.keras.layers.Activation(gelu) if config.gelu_activation else tf.keras.activations.relu\n        self.dropout = tf.keras.layers.Dropout(config.dropout)\n\n    def call(self, input, training=False):\n        x = self.lin1(input)\n        x = self.act(x)\n        x = self.lin2(x)\n        x = self.dropout(x, training=training)\n        return x\n\n\nclass TFXLMMainLayer(tf.keras.layers.Layer):\n    def __init__(self, config, **kwargs):\n        super().__init__(**kwargs)\n        self.output_attentions = config.output_attentions\n        self.output_hidden_states = config.output_hidden_states\n\n        # encoder / decoder, output layer\n        self.is_encoder = config.is_encoder\n        self.is_decoder = not config.is_encoder\n        if self.is_decoder:\n            raise NotImplementedError(\"Currently XLM can only be used as an encoder\")\n        # self.with_output = with_output\n        self.causal = config.causal\n\n        # dictionary / languages\n        self.n_langs = config.n_langs\n        self.use_lang_emb = config.use_lang_emb\n        self.n_words = config.n_words\n        self.eos_index = config.eos_index\n        self.pad_index = config.pad_index\n        # self.dico = dico\n        # self.id2lang = config.id2lang\n        # self.lang2id = config.lang2id\n        # assert len(self.dico) == self.n_words\n        # assert len(self.id2lang) == len(self.lang2id) == self.n_langs\n\n        # model parameters\n        self.dim = config.emb_dim  # 512 by default\n        self.hidden_dim = self.dim * 4  # 2048 by default\n        self.n_heads = config.n_heads  # 8 by default\n        self.n_layers = config.n_layers\n        assert self.dim % self.n_heads == 0, \"transformer dim must be a multiple of n_heads\"\n\n        # embeddings\n        self.dropout = tf.keras.layers.Dropout(config.dropout)\n        self.attention_dropout = tf.keras.layers.Dropout(config.attention_dropout)\n\n        self.position_embeddings = tf.keras.layers.Embedding(\n            config.max_position_embeddings,\n            self.dim,\n            embeddings_initializer=get_initializer(config.embed_init_std),\n            name=\"position_embeddings\",\n        )\n        if config.sinusoidal_embeddings:\n            raise NotImplementedError\n            # create_sinusoidal_embeddings(config.max_position_embeddings, self.dim, out=self.position_embeddings.weight)\n        if config.n_langs > 1 and config.use_lang_emb:\n            self.lang_embeddings = tf.keras.layers.Embedding(\n                self.n_langs,\n                self.dim,\n                embeddings_initializer=get_initializer(config.embed_init_std),\n                name=\"lang_embeddings\",\n            )\n        self.embeddings = TFSharedEmbeddings(\n            self.n_words, self.dim, initializer_range=config.embed_init_std, name=\"embeddings\"\n        )  # padding_idx=self.pad_index)\n        self.layer_norm_emb = tf.keras.layers.LayerNormalization(epsilon=config.layer_norm_eps, name=\"layer_norm_emb\")\n\n        # transformer layers\n        self.attentions = []\n        self.layer_norm1 = []\n        self.ffns = []\n        self.layer_norm2 = []\n        # if self.is_decoder:\n        #     self.layer_norm15 = []\n        #     self.encoder_attn = []\n\n        for i in range(self.n_layers):\n            self.attentions.append(\n                TFMultiHeadAttention(self.n_heads, self.dim, config=config, name=\"attentions_._{}\".format(i))\n            )\n            self.layer_norm1.append(\n                tf.keras.layers.LayerNormalization(epsilon=config.layer_norm_eps, name=\"layer_norm1_._{}\".format(i))\n            )\n            # if self.is_decoder:\n            #     self.layer_norm15.append(nn.LayerNorm(self.dim, eps=config.layer_norm_eps))\n            #     self.encoder_attn.append(MultiHeadAttention(self.n_heads, self.dim, dropout=self.attention_dropout))\n            self.ffns.append(\n                TFTransformerFFN(self.dim, self.hidden_dim, self.dim, config=config, name=\"ffns_._{}\".format(i))\n            )\n            self.layer_norm2.append(\n                tf.keras.layers.LayerNormalization(epsilon=config.layer_norm_eps, name=\"layer_norm2_._{}\".format(i))\n            )\n\n        if hasattr(config, \"pruned_heads\"):\n            pruned_heads = config.pruned_heads.copy().items()\n            config.pruned_heads = {}\n            for layer, heads in pruned_heads:\n                if self.attentions[int(layer)].n_heads == config.n_heads:\n                    self.prune_heads({int(layer): list(map(int, heads))})\n\n    def get_input_embeddings(self):\n        return self.embeddings\n\n    def _resize_token_embeddings(self, new_num_tokens):\n        raise NotImplementedError\n\n    def _prune_heads(self, heads_to_prune):\n        \"\"\" Prunes heads of the model.\n            heads_to_prune: dict of {layer_num: list of heads to prune in this layer}\n            See base class PreTrainedModel\n        \"\"\"\n        raise NotImplementedError\n\n    def call(\n        self,\n        inputs,\n        attention_mask=None,\n        langs=None,\n        token_type_ids=None,\n        position_ids=None,\n        lengths=None,\n        cache=None,\n        head_mask=None,\n        inputs_embeds=None,\n        training=False,\n    ):  # removed: src_enc=None, src_len=None\n        if isinstance(inputs, (tuple, list)):\n            input_ids = inputs[0]\n            attention_mask = inputs[1] if len(inputs) > 1 else attention_mask\n            langs = inputs[2] if len(inputs) > 2 else langs\n            token_type_ids = inputs[3] if len(inputs) > 3 else token_type_ids\n            position_ids = inputs[4] if len(inputs) > 4 else position_ids\n            lengths = inputs[5] if len(inputs) > 5 else lengths\n            cache = inputs[6] if len(inputs) > 6 else cache\n            head_mask = inputs[7] if len(inputs) > 7 else head_mask\n            inputs_embeds = inputs[8] if len(inputs) > 8 else inputs_embeds\n            assert len(inputs) <= 9, \"Too many inputs.\"\n        elif isinstance(inputs, (dict, BatchEncoding)):\n            input_ids = inputs.get(\"input_ids\")\n            attention_mask = inputs.get(\"attention_mask\", attention_mask)\n            langs = inputs.get(\"langs\", langs)\n            token_type_ids = inputs.get(\"token_type_ids\", token_type_ids)\n            position_ids = inputs.get(\"position_ids\", position_ids)\n            lengths = inputs.get(\"lengths\", lengths)\n            cache = inputs.get(\"cache\", cache)\n            head_mask = inputs.get(\"head_mask\", head_mask)\n            inputs_embeds = inputs.get(\"inputs_embeds\", inputs_embeds)\n            assert len(inputs) <= 9, \"Too many inputs.\"\n        else:\n            input_ids = inputs\n\n        if input_ids is not None and inputs_embeds is not None:\n            raise ValueError(\"You cannot specify both input_ids and inputs_embeds at the same time\")\n        elif input_ids is not None:\n            bs, slen = shape_list(input_ids)\n        elif inputs_embeds is not None:\n            bs, slen = shape_list(inputs_embeds)[:2]\n        else:\n            raise ValueError(\"You have to specify either input_ids or inputs_embeds\")\n\n        if lengths is None:\n            if input_ids is not None:\n                lengths = tf.reduce_sum(tf.cast(tf.not_equal(input_ids, self.pad_index), dtype=tf.int32), axis=1)\n            else:\n                lengths = tf.convert_to_tensor([slen] * bs, tf.int32)\n        # mask = input_ids != self.pad_index\n\n        # check inputs\n        # assert shape_list(lengths)[0] == bs\n        tf.debugging.assert_equal(shape_list(lengths)[0], bs)\n        # assert lengths.max().item() <= slen\n        # input_ids = input_ids.transpose(0, 1)  # batch size as dimension 0\n        # assert (src_enc is None) == (src_len is None)\n        # if src_enc is not None:\n        #     assert self.is_decoder\n        #     assert src_enc.size(0) == bs\n\n        # generate masks\n        mask, attn_mask = get_masks(slen, lengths, self.causal, padding_mask=attention_mask)\n        # if self.is_decoder and src_enc is not None:\n        #     src_mask = torch.arange(src_len.max(), dtype=torch.long, device=lengths.device) < src_len[:, None]\n\n        # position_ids\n        if position_ids is None:\n            position_ids = tf.expand_dims(tf.range(slen), axis=0)\n        else:\n            # assert shape_list(position_ids) == [bs, slen]  # (slen, bs)\n            tf.debugging.assert_equal(shape_list(position_ids), [bs, slen])\n            # position_ids = position_ids.transpose(0, 1)\n\n        # langs\n        if langs is not None:\n            # assert shape_list(langs) == [bs, slen]  # (slen, bs)\n            tf.debugging.assert_equal(shape_list(langs), [bs, slen])\n            # langs = langs.transpose(0, 1)\n\n        # Prepare head mask if needed\n        # 1.0 in head_mask indicate we keep the head\n        # attention_probs has shape bsz x n_heads x N x N\n        # input head_mask has shape [num_heads] or [num_hidden_layers x num_heads]\n        # and head_mask is converted to shape [num_hidden_layers x batch x num_heads x qlen x klen]\n        if head_mask is not None:\n            raise NotImplementedError\n        else:\n            head_mask = [None] * self.n_layers\n\n        # do not recompute cached elements\n        if cache is not None and input_ids is not None:\n            _slen = slen - cache[\"slen\"]\n            input_ids = input_ids[:, -_slen:]\n            position_ids = position_ids[:, -_slen:]\n            if langs is not None:\n                langs = langs[:, -_slen:]\n            mask = mask[:, -_slen:]\n            attn_mask = attn_mask[:, -_slen:]\n\n        # embeddings\n        if inputs_embeds is None:\n            inputs_embeds = self.embeddings(input_ids)\n\n        tensor = inputs_embeds + self.position_embeddings(position_ids)\n        if langs is not None and self.use_lang_emb and self.n_langs > 1:\n            tensor = tensor + self.lang_embeddings(langs)\n        if token_type_ids is not None:\n            tensor = tensor + self.embeddings(token_type_ids)\n        tensor = self.layer_norm_emb(tensor)\n        tensor = self.dropout(tensor, training=training)\n        tensor = tensor * mask[..., tf.newaxis]\n\n        # transformer layers\n        hidden_states = ()\n        attentions = ()\n        for i in range(self.n_layers):\n            if self.output_hidden_states:\n                hidden_states = hidden_states + (tensor,)\n\n            # self attention\n            attn_outputs = self.attentions[i]([tensor, attn_mask, None, cache, head_mask[i]], training=training)\n            attn = attn_outputs[0]\n            if self.output_attentions:\n                attentions = attentions + (attn_outputs[1],)\n            attn = self.dropout(attn, training=training)\n            tensor = tensor + attn\n            tensor = self.layer_norm1[i](tensor)\n\n            # encoder attention (for decoder only)\n            # if self.is_decoder and src_enc is not None:\n            #     attn = self.encoder_attn[i](tensor, src_mask, kv=src_enc, cache=cache)\n            #     attn = F.dropout(attn, p=self.dropout, training=self.training)\n            #     tensor = tensor + attn\n            #     tensor = self.layer_norm15[i](tensor)\n\n            # FFN\n            tensor = tensor + self.ffns[i](tensor)\n            tensor = self.layer_norm2[i](tensor)\n            tensor = tensor * mask[..., tf.newaxis]\n\n        # Add last hidden state\n        if self.output_hidden_states:\n            hidden_states = hidden_states + (tensor,)\n\n        # update cache length\n        if cache is not None:\n            cache[\"slen\"] += tensor.size(1)\n\n        # move back sequence length to dimension 0\n        # tensor = tensor.transpose(0, 1)\n\n        outputs = (tensor,)\n        if self.output_hidden_states:\n            outputs = outputs + (hidden_states,)\n        if self.output_attentions:\n            outputs = outputs + (attentions,)\n        return outputs  # outputs, (hidden_states), (attentions)\n\n\nclass TFXLMPreTrainedModel(TFPreTrainedModel):\n    \"\"\" An abstract class to handle weights initialization and\n        a simple interface for downloading and loading pretrained models.\n    \"\"\"\n\n    config_class = XLMConfig\n    base_model_prefix = \"transformer\"\n\n    @property\n    def dummy_inputs(self):\n        # Sometimes XLM has language embeddings so don't forget to build them as well if needed\n        inputs_list = tf.constant([[7, 6, 0, 0, 1], [1, 2, 3, 0, 0], [0, 0, 0, 4, 5]])\n        attns_list = tf.constant([[1, 1, 0, 0, 1], [1, 1, 1, 0, 0], [1, 0, 0, 1, 1]])\n        if self.config.use_lang_emb and self.config.n_langs > 1:\n            langs_list = tf.constant([[1, 1, 0, 0, 1], [1, 1, 1, 0, 0], [1, 0, 0, 1, 1]])\n        else:\n            langs_list = None\n        return {\"input_ids\": inputs_list, \"attention_mask\": attns_list, \"langs\": langs_list}\n\n\nXLM_START_DOCSTRING = r\"\"\"\n\n    .. note::\n\n        TF 2.0 models accepts two formats as inputs:\n\n            - having all inputs as keyword arguments (like PyTorch models), or\n            - having all inputs as a list, tuple or dict in the first positional arguments.\n\n        This second option is useful when using :obj:`tf.keras.Model.fit()` method which currently requires having\n        all the tensors in the first argument of the model call function: :obj:`model(inputs)`.\n\n        If you choose this second option, there are three possibilities you can use to gather all the input Tensors\n        in the first positional argument :\n\n        - a single Tensor with input_ids only and nothing else: :obj:`model(inputs_ids)`\n        - a list of varying length with one or several input Tensors IN THE ORDER given in the docstring:\n          :obj:`model([input_ids, attention_mask])` or :obj:`model([input_ids, attention_mask, token_type_ids])`\n        - a dictionary with one or several input Tensors associated to the input names given in the docstring:\n          :obj:`model({'input_ids': input_ids, 'token_type_ids': token_type_ids})`\n\n    Parameters:\n        config (:class:`~transformers1.XLMConfig`): Model configuration class with all the parameters of the model.\n            Initializing with a config file does not load the weights associated with the model, only the configuration.\n            Check out the :meth:`~transformers1.PreTrainedModel.from_pretrained` method to load the model weights.\n\"\"\"\n\nXLM_INPUTS_DOCSTRING = r\"\"\"\n    Args:\n        input_ids (:obj:`tf.Tensor` or :obj:`Numpy array` of shape :obj:`(batch_size, sequence_length)`):\n            Indices of input sequence tokens in the vocabulary.\n\n            Indices can be obtained using :class:`transformers1.BertTokenizer`.\n            See :func:`transformers1.PreTrainedTokenizer.encode` and\n            :func:`transformers1.PreTrainedTokenizer.encode_plus` for details.\n\n            `What are input IDs? <../glossary.html#input-ids>`__\n        attention_mask (:obj:`tf.Tensor` or :obj:`Numpy array` of shape :obj:`(batch_size, sequence_length)`, `optional`, defaults to :obj:`None`):\n            Mask to avoid performing attention on padding token indices.\n            Mask values selected in ``[0, 1]``:\n            ``1`` for tokens that are NOT MASKED, ``0`` for MASKED tokens.\n\n            `What are attention masks? <../glossary.html#attention-mask>`__\n        langs (:obj:`tf.Tensor` or :obj:`Numpy array` of shape :obj:`(batch_size, sequence_length)`, `optional`, defaults to :obj:`None`):\n            A parallel sequence of tokens to be used to indicate the language of each token in the input.\n            Indices are languages ids which can be obtained from the language names by using two conversion mappings\n            provided in the configuration of the model (only provided for multilingual models).\n            More precisely, the `language name -> language id` mapping is in `model.config.lang2id` (dict str -> int) and\n            the `language id -> language name` mapping is `model.config.id2lang` (dict int -> str).\n\n            See usage examples detailed in the `multilingual documentation <https://huggingface.co/transformers/multilingual.html>`__.\n        token_type_ids (:obj:`tf.Tensor` or :obj:`Numpy array` of shape :obj:`(batch_size, sequence_length)`, `optional`, defaults to :obj:`None`):\n            Segment token indices to indicate first and second portions of the inputs.\n            Indices are selected in ``[0, 1]``: ``0`` corresponds to a `sentence A` token, ``1``\n            corresponds to a `sentence B` token\n\n            `What are token type IDs? <../glossary.html#token-type-ids>`_\n        position_ids (:obj:`tf.Tensor` or :obj:`Numpy array` of shape :obj:`(batch_size, sequence_length)`, `optional`, defaults to :obj:`None`):\n            Indices of positions of each input sequence tokens in the position embeddings.\n            Selected in the range ``[0, config.max_position_embeddings - 1]``.\n\n            `What are position IDs? <../glossary.html#position-ids>`_\n        lengths (:obj:`tf.Tensor` or :obj:`Numpy array` of shape :obj:`(batch_size,)`, `optional`, defaults to :obj:`None`):\n            Length of each sentence that can be used to avoid performing attention on padding token indices.\n            You can also use `attention_mask` for the same result (see above), kept here for compatbility.\n            Indices selected in ``[0, ..., input_ids.size(-1)]``:\n        cache (:obj:`Dict[str, tf.Tensor]`, `optional`, defaults to :obj:`None`):\n            dictionary with ``tf.Tensor`` that contains pre-computed\n            hidden-states (key and values in the attention blocks) as computed by the model\n            (see `cache` output below). Can be used to speed up sequential decoding.\n            The dictionary object will be modified in-place during the forward pass to add newly computed hidden-states.\n        head_mask (:obj:`tf.Tensor` or :obj:`Numpy array` of shape :obj:`(num_heads,)` or :obj:`(num_layers, num_heads)`, `optional`, defaults to :obj:`None`):\n            Mask to nullify selected heads of the self-attention modules.\n            Mask values selected in ``[0, 1]``:\n            :obj:`1` indicates the head is **not masked**, :obj:`0` indicates the head is **masked**.\n        inputs_embeds (:obj:`tf.Tensor` or :obj:`Numpy array` of shape :obj:`(batch_size, sequence_length, hidden_size)`, `optional`, defaults to :obj:`None`):\n            Optionally, instead of passing :obj:`input_ids` you can choose to directly pass an embedded representation.\n            This is useful if you want more control over how to convert `input_ids` indices into associated vectors\n            than the model's internal embedding lookup matrix.\n\"\"\"\n\n\n@add_start_docstrings(\n    \"The bare XLM Model transformer outputing raw hidden-states without any specific head on top.\",\n    XLM_START_DOCSTRING,\n)\nclass TFXLMModel(TFXLMPreTrainedModel):\n    def __init__(self, config, *inputs, **kwargs):\n        super().__init__(config, *inputs, **kwargs)\n        self.transformer = TFXLMMainLayer(config, name=\"transformer\")\n\n    @add_start_docstrings_to_callable(XLM_INPUTS_DOCSTRING)\n    def call(self, inputs, **kwargs):\n        r\"\"\"\n    Return:\n        :obj:`tuple(tf.Tensor)` comprising various elements depending on the configuration (:class:`~transformers1.XLMConfig`) and inputs:\n        last_hidden_state (:obj:`tf.Tensor` or :obj:`Numpy array` of shape :obj:`(batch_size, sequence_length, hidden_size)`):\n            Sequence of hidden-states at the output of the last layer of the model.\n        hidden_states (:obj:`tuple(tf.Tensor)`, `optional`, returned when ``config.output_hidden_states=True``):\n            Tuple of :obj:`tf.Tensor` or :obj:`Numpy array` (one for the output of the embeddings + one for the output of each layer)\n            of shape :obj:`(batch_size, sequence_length, hidden_size)`.\n\n            Hidden-states of the model at the output of each layer plus the initial embedding outputs.\n        attentions (:obj:`tuple(tf.Tensor)`, `optional`, returned when ``config.output_attentions=True``):\n            Tuple of :obj:`tf.Tensor` or :obj:`Numpy array` (one for each layer) of shape\n            :obj:`(batch_size, num_heads, sequence_length, sequence_length)`.\n\n            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention\n            heads.\n\n    Examples::\n\n        import tensorflow as tf\n        from transformers1 import XLMTokenizer, TFXLMModel\n\n        tokenizer = XLMTokenizer.from_pretrained('xlm-mlm-en-2048')\n        model = TFXLMModel.from_pretrained('xlm-mlm-en-2048')\n        input_ids = tf.constant(tokenizer.encode(\"Hello, my dog is cute\", add_special_tokens=True))[None, :]  # Batch size 1\n        outputs = model(input_ids)\n        last_hidden_states = outputs[0]  # The last hidden-state is the first element of the output tuple\n\n        \"\"\"\n        outputs = self.transformer(inputs, **kwargs)\n        return outputs\n\n\nclass TFXLMPredLayer(tf.keras.layers.Layer):\n    \"\"\"\n    Prediction layer (cross_entropy or adaptive_softmax).\n    \"\"\"\n\n    def __init__(self, config, input_embeddings, **kwargs):\n        super().__init__(**kwargs)\n        self.asm = config.asm\n        self.n_words = config.n_words\n        self.pad_index = config.pad_index\n        if config.asm is False:\n            self.input_embeddings = input_embeddings\n        else:\n            raise NotImplementedError\n            # self.proj = nn.AdaptiveLogSoftmaxWithLoss(\n            #     in_features=dim,\n            #     n_classes=config.n_words,\n            #     cutoffs=config.asm_cutoffs,\n            #     div_value=config.asm_div_value,\n            #     head_bias=True,  # default is False\n            # )\n\n    def build(self, input_shape):\n        # The output weights are the same as the input embeddings, but there is an output-only bias for each token.\n        self.bias = self.add_weight(shape=(self.n_words,), initializer=\"zeros\", trainable=True, name=\"bias\")\n        super().build(input_shape)\n\n    def call(self, hidden_states):\n        hidden_states = self.input_embeddings(hidden_states, mode=\"linear\")\n        hidden_states = hidden_states + self.bias\n        return hidden_states\n\n\n@add_start_docstrings(\n    \"\"\"The XLM Model transformer with a language modeling head on top\n    (linear layer with weights tied to the input embeddings). \"\"\",\n    XLM_START_DOCSTRING,\n)\nclass TFXLMWithLMHeadModel(TFXLMPreTrainedModel):\n    def __init__(self, config, *inputs, **kwargs):\n        super().__init__(config, *inputs, **kwargs)\n        self.transformer = TFXLMMainLayer(config, name=\"transformer\")\n        self.pred_layer = TFXLMPredLayer(config, self.transformer.embeddings, name=\"pred_layer_._proj\")\n\n    def get_output_embeddings(self):\n        return self.pred_layer.input_embeddings\n\n    def prepare_inputs_for_generation(self, inputs, **kwargs):\n        mask_token_id = self.config.mask_token_id\n        lang_id = self.config.lang_id\n\n        effective_batch_size = inputs.shape[0]\n        mask_token = tf.ones((effective_batch_size, 1), dtype=tf.int32) * mask_token_id\n        inputs = tf.concat([inputs, mask_token], axis=1)\n\n        if lang_id is not None:\n            langs = tf.ones_like(inputs) * lang_id\n        else:\n            langs = None\n        return {\"inputs\": inputs, \"langs\": langs}\n\n    @add_start_docstrings_to_callable(XLM_INPUTS_DOCSTRING)\n    def call(self, inputs, **kwargs):\n        r\"\"\"\n    Return:\n        :obj:`tuple(tf.Tensor)` comprising various elements depending on the configuration (:class:`~transformers1.XLMConfig`) and inputs:\n        prediction_scores (:obj:`tf.Tensor` or :obj:`Numpy array` of shape :obj:`(batch_size, sequence_length, config.vocab_size)`):\n            Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax).\n        hidden_states (:obj:`tuple(tf.Tensor)`, `optional`, returned when ``config.output_hidden_states=True``):\n            Tuple of :obj:`tf.Tensor` or :obj:`Numpy array` (one for the output of the embeddings + one for the output of each layer)\n            of shape :obj:`(batch_size, sequence_length, hidden_size)`.\n\n            Hidden-states of the model at the output of each layer plus the initial embedding outputs.\n        attentions (:obj:`tuple(tf.Tensor)`, `optional`, returned when ``config.output_attentions=True``):\n            Tuple of :obj:`tf.Tensor` or :obj:`Numpy array` (one for each layer) of shape\n            :obj:`(batch_size, num_heads, sequence_length, sequence_length)`.\n\n            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention\n            heads.\n\n    Examples::\n\n        import tensorflow as tf\n        from transformers1 import XLMTokenizer, TFXLMWithLMHeadModel\n\n        tokenizer = XLMTokenizer.from_pretrained('xlm-mlm-en-2048')\n        model = TFXLMWithLMHeadModel.from_pretrained('xlm-mlm-en-2048')\n        input_ids = tf.constant(tokenizer.encode(\"Hello, my dog is cute\", add_special_tokens=True))[None, :]  # Batch size 1\n        outputs = model(input_ids)\n        last_hidden_states = outputs[0]  # The last hidden-state is the first element of the output tuple\n\n        \"\"\"\n        transformer_outputs = self.transformer(inputs, **kwargs)\n\n        output = transformer_outputs[0]\n        outputs = self.pred_layer(output)\n        outputs = (outputs,) + transformer_outputs[1:]  # Keep new_mems and attention/hidden states if they are here\n\n        return outputs\n\n\n@add_start_docstrings(\n    \"\"\"XLM Model with a sequence classification/regression head on top (a linear layer on top of\n    the pooled output) e.g. for GLUE tasks. \"\"\",\n    XLM_START_DOCSTRING,\n)\nclass TFXLMForSequenceClassification(TFXLMPreTrainedModel):\n    def __init__(self, config, *inputs, **kwargs):\n        super().__init__(config, *inputs, **kwargs)\n        self.num_labels = config.num_labels\n\n        self.transformer = TFXLMMainLayer(config, name=\"transformer\")\n        self.sequence_summary = TFSequenceSummary(config, initializer_range=config.init_std, name=\"sequence_summary\")\n\n    @add_start_docstrings_to_callable(XLM_INPUTS_DOCSTRING)\n    def call(self, inputs, **kwargs):\n        r\"\"\"\n    Returns:\n        :obj:`tuple(tf.Tensor)` comprising various elements depending on the configuration (:class:`~transformers1.XLMConfig`) and inputs:\n        logits (:obj:`tf.Tensor` or :obj:`Numpy array` of shape :obj:`(batch_size, config.num_labels)`):\n            Classification (or regression if config.num_labels==1) scores (before SoftMax).\n        hidden_states (:obj:`tuple(tf.Tensor)`, `optional`, returned when ``config.output_hidden_states=True``):\n            Tuple of :obj:`tf.Tensor` or :obj:`Numpy array` (one for the output of the embeddings + one for the output of each layer)\n            of shape :obj:`(batch_size, sequence_length, hidden_size)`.\n\n            Hidden-states of the model at the output of each layer plus the initial embedding outputs.\n        attentions (:obj:`tuple(tf.Tensor)`, `optional`, returned when ``config.output_attentions=True``):\n            Tuple of :obj:`tf.Tensor` or :obj:`Numpy array` (one for each layer) of shape\n            :obj:`(batch_size, num_heads, sequence_length, sequence_length)`.\n\n            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention\n            heads.\n\n    Examples::\n\n        import tensorflow as tf\n        from transformers1 import XLMTokenizer, TFXLMForSequenceClassification\n\n        tokenizer = XLMTokenizer.from_pretrained('xlm-mlm-en-2048')\n        model = TFXLMForSequenceClassification.from_pretrained('xlm-mlm-en-2048')\n        input_ids = tf.constant(tokenizer.encode(\"Hello, my dog is cute\", add_special_tokens=True))[None, :]  # Batch size 1\n        labels = tf.constant([1])[None, :]  # Batch size 1\n        outputs = model(input_ids)\n        logits = outputs[0]\n\n        \"\"\"\n        transformer_outputs = self.transformer(inputs, **kwargs)\n        output = transformer_outputs[0]\n\n        logits = self.sequence_summary(output)\n\n        outputs = (logits,) + transformer_outputs[1:]  # Keep new_mems and attention/hidden states if they are here\n        return outputs\n\n\n@add_start_docstrings(\n    \"\"\"XLM Model with a span classification head on top for extractive question-answering tasks like SQuAD (a linear layers on top of\n    the hidden-states output to compute `span start logits` and `span end logits`). \"\"\",\n    XLM_START_DOCSTRING,\n)\nclass TFXLMForQuestionAnsweringSimple(TFXLMPreTrainedModel):\n    def __init__(self, config, *inputs, **kwargs):\n        super().__init__(config, *inputs, **kwargs)\n        self.transformer = TFXLMMainLayer(config, name=\"transformer\")\n        self.qa_outputs = tf.keras.layers.Dense(\n            config.num_labels, kernel_initializer=get_initializer(config.init_std), name=\"qa_outputs\"\n        )\n\n    @add_start_docstrings_to_callable(XLM_INPUTS_DOCSTRING)\n    def call(self, inputs, **kwargs):\n        r\"\"\"\n    Returns:\n        :obj:`tuple(tf.Tensor)` comprising various elements depending on the configuration (:class:`~transformers1.XLMConfig`) and inputs:\n        start_scores (:obj:`tf.Tensor` or :obj:`Numpy array` of shape :obj:`(batch_size, sequence_length,)`):\n            Span-start scores (before SoftMax).\n        end_scores (:obj:`tf.Tensor` or :obj:`Numpy array` of shape :obj:`(batch_size, sequence_length,)`):\n            Span-end scores (before SoftMax).\n        hidden_states (:obj:`tuple(tf.Tensor)`, `optional`, returned when ``config.output_hidden_states=True``):\n            Tuple of :obj:`tf.Tensor` or :obj:`Numpy array` (one for the output of the embeddings + one for the output of each layer)\n            of shape :obj:`(batch_size, sequence_length, hidden_size)`.\n\n            Hidden-states of the model at the output of each layer plus the initial embedding outputs.\n        attentions (:obj:`tuple(tf.Tensor)`, `optional`, returned when ``config.output_attentions=True``):\n            Tuple of :obj:`tf.Tensor` or :obj:`Numpy array` (one for each layer) of shape\n            :obj:`(batch_size, num_heads, sequence_length, sequence_length)`.\n\n            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention\n            heads.\n\n    Examples::\n\n        import tensorflow as tf\n        from transformers1 import XLMTokenizer, TFXLMForQuestionAnsweringSimple\n\n        tokenizer = XLMTokenizer.from_pretrained('xlm-mlm-en-2048')\n        model = TFXLMForQuestionAnsweringSimple.from_pretrained('xlm-mlm-en-2048')\n        input_ids = tf.constant(tokenizer.encode(\"Hello, my dog is cute\", add_special_tokens=True))[None, :]  # Batch size 1\n        outputs = model(input_ids)\n        start_scores, end_scores = outputs[:2]\n\n        \"\"\"\n        transformer_outputs = self.transformer(inputs, **kwargs)\n\n        sequence_output = transformer_outputs[0]\n\n        logits = self.qa_outputs(sequence_output)\n        start_logits, end_logits = tf.split(logits, 2, axis=-1)\n        start_logits = tf.squeeze(start_logits, axis=-1)\n        end_logits = tf.squeeze(end_logits, axis=-1)\n\n        outputs = (start_logits, end_logits,) + transformer_outputs[\n            1:\n        ]  # Keep mems, hidden states, attentions if there are in it\n\n        return outputs  # start_logits, end_logits, (hidden_states), (attentions)\n"
  },
  {
    "path": "code/bert-base-count3/pretrain/transformers1/modeling_tf_xlm_roberta.py",
    "content": "# coding=utf-8\n# Copyright 2019 Facebook AI Research and the HuggingFace Inc. team.\n# Copyright (c) 2018, NVIDIA CORPORATION.  All rights reserved.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n\"\"\" TF 2.0  XLM-RoBERTa model. \"\"\"\n\n\nimport logging\n\nfrom .configuration_xlm_roberta import XLMRobertaConfig\nfrom .file_utils import add_start_docstrings\nfrom .modeling_tf_roberta import (\n    TFRobertaForMaskedLM,\n    TFRobertaForSequenceClassification,\n    TFRobertaForTokenClassification,\n    TFRobertaModel,\n)\n\n\nlogger = logging.getLogger(__name__)\n\nTF_XLM_ROBERTA_PRETRAINED_MODEL_ARCHIVE_LIST = [\n    # See all XLM-RoBERTa models at https://huggingface.co/models?filter=xlm-roberta\n]\n\n\nXLM_ROBERTA_START_DOCSTRING = r\"\"\"\n\n    .. note::\n\n        TF 2.0 models accepts two formats as inputs:\n\n            - having all inputs as keyword arguments (like PyTorch models), or\n            - having all inputs as a list, tuple or dict in the first positional arguments.\n\n        This second option is useful when using :obj:`tf.keras.Model.fit()` method which currently requires having\n        all the tensors in the first argument of the model call function: :obj:`model(inputs)`.\n\n        If you choose this second option, there are three possibilities you can use to gather all the input Tensors\n        in the first positional argument :\n\n        - a single Tensor with input_ids only and nothing else: :obj:`model(inputs_ids)`\n        - a list of varying length with one or several input Tensors IN THE ORDER given in the docstring:\n          :obj:`model([input_ids, attention_mask])` or :obj:`model([input_ids, attention_mask, token_type_ids])`\n        - a dictionary with one or several input Tensors associated to the input names given in the docstring:\n          :obj:`model({'input_ids': input_ids, 'token_type_ids': token_type_ids})`\n\n    Parameters:\n        config (:class:`~transformers1.XLMRobertaConfig`): Model configuration class with all the parameters of the\n            model. Initializing with a config file does not load the weights associated with the model, only the configuration.\n            Check out the :meth:`~transformers1.PreTrainedModel.from_pretrained` method to load the model weights.\n\"\"\"\n\n\n@add_start_docstrings(\n    \"The bare XLM-RoBERTa Model transformer outputting raw hidden-states without any specific head on top.\",\n    XLM_ROBERTA_START_DOCSTRING,\n)\nclass TFXLMRobertaModel(TFRobertaModel):\n    \"\"\"\n    This class overrides :class:`~transformers1.TFRobertaModel`. Please check the\n    superclass for the appropriate documentation alongside usage examples.\n    \"\"\"\n\n    config_class = XLMRobertaConfig\n\n\n@add_start_docstrings(\n    \"\"\"XLM-RoBERTa Model with a `language modeling` head on top. \"\"\", XLM_ROBERTA_START_DOCSTRING,\n)\nclass TFXLMRobertaForMaskedLM(TFRobertaForMaskedLM):\n    \"\"\"\n    This class overrides :class:`~transformers1.TFRobertaForMaskedLM`. Please check the\n    superclass for the appropriate documentation alongside usage examples.\n    \"\"\"\n\n    config_class = XLMRobertaConfig\n\n\n@add_start_docstrings(\n    \"\"\"XLM-RoBERTa Model transformer with a sequence classification/regression head on top (a linear layer\n    on top of the pooled output) e.g. for GLUE tasks. \"\"\",\n    XLM_ROBERTA_START_DOCSTRING,\n)\nclass TFXLMRobertaForSequenceClassification(TFRobertaForSequenceClassification):\n    \"\"\"\n    This class overrides :class:`~transformers1.TFRobertaForSequenceClassification`. Please check the\n    superclass for the appropriate documentation alongside usage examples.\n    \"\"\"\n\n    config_class = XLMRobertaConfig\n\n\n@add_start_docstrings(\n    \"\"\"XLM-RoBERTa Model with a token classification head on top (a linear layer on top of\n    the hidden-states output) e.g. for Named-Entity-Recognition (NER) tasks. \"\"\",\n    XLM_ROBERTA_START_DOCSTRING,\n)\nclass TFXLMRobertaForTokenClassification(TFRobertaForTokenClassification):\n    \"\"\"\n    This class overrides :class:`~transformers1.TFRobertaForTokenClassification`. Please check the\n    superclass for the appropriate documentation alongside usage examples.\n    \"\"\"\n\n    config_class = XLMRobertaConfig\n"
  },
  {
    "path": "code/bert-base-count3/pretrain/transformers1/modeling_tf_xlnet.py",
    "content": "# coding=utf-8\n# Copyright 2018 Google AI, Google Brain and Carnegie Mellon University Authors and the HuggingFace Inc. team.\n# Copyright (c) 2018, NVIDIA CORPORATION.  All rights reserved.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n\"\"\" TF 2.0 XLNet model.\n\"\"\"\n\n\nimport logging\n\nimport numpy as np\nimport tensorflow as tf\n\nfrom .configuration_xlnet import XLNetConfig\nfrom .file_utils import add_start_docstrings, add_start_docstrings_to_callable\nfrom .modeling_tf_utils import (\n    TFPreTrainedModel,\n    TFSequenceSummary,\n    TFSharedEmbeddings,\n    get_initializer,\n    keras_serializable,\n    shape_list,\n)\nfrom .tokenization_utils import BatchEncoding\n\n\nlogger = logging.getLogger(__name__)\n\nTF_XLNET_PRETRAINED_MODEL_ARCHIVE_LIST = [\n    \"xlnet-base-cased\",\n    \"xlnet-large-cased\",\n    # See all XLNet models at https://huggingface.co/models?filter=xlnet\n]\n\n\ndef gelu(x):\n    \"\"\" Implementation of the gelu activation function.\n        XLNet is using OpenAI GPT's gelu\n        Also see https://arxiv.org/abs/1606.08415\n    \"\"\"\n    cdf = 0.5 * (1.0 + tf.tanh((np.sqrt(2 / np.pi) * (x + 0.044715 * tf.pow(x, 3)))))\n    return x * cdf\n\n\ndef swish(x):\n    return x * tf.sigmoid(x)\n\n\nACT2FN = {\n    \"gelu\": tf.keras.layers.Activation(gelu),\n    \"relu\": tf.keras.activations.relu,\n    \"swish\": tf.keras.layers.Activation(swish),\n}\n\n\nclass TFXLNetRelativeAttention(tf.keras.layers.Layer):\n    def __init__(self, config, **kwargs):\n        super().__init__(**kwargs)\n        self.output_attentions = config.output_attentions\n\n        if config.d_model % config.n_head != 0:\n            raise ValueError(\n                \"The hidden size (%d) is not a multiple of the number of attention \"\n                \"heads (%d)\" % (config.d_model, config.n_head)\n            )\n\n        self.n_head = config.n_head\n        self.d_head = config.d_head\n        self.d_model = config.d_model\n        self.scale = 1 / (config.d_head ** 0.5)\n        self.initializer_range = config.initializer_range\n\n        self.layer_norm = tf.keras.layers.LayerNormalization(epsilon=config.layer_norm_eps, name=\"layer_norm\")\n        self.dropout = tf.keras.layers.Dropout(config.dropout)\n\n    def build(self, input_shape):\n        initializer = get_initializer(self.initializer_range)\n        self.q = self.add_weight(\n            shape=(self.d_model, self.n_head, self.d_head), initializer=initializer, trainable=True, name=\"q\"\n        )\n        self.k = self.add_weight(\n            shape=(self.d_model, self.n_head, self.d_head), initializer=initializer, trainable=True, name=\"k\"\n        )\n        self.v = self.add_weight(\n            shape=(self.d_model, self.n_head, self.d_head), initializer=initializer, trainable=True, name=\"v\"\n        )\n        self.o = self.add_weight(\n            shape=(self.d_model, self.n_head, self.d_head), initializer=initializer, trainable=True, name=\"o\"\n        )\n        self.r = self.add_weight(\n            shape=(self.d_model, self.n_head, self.d_head), initializer=initializer, trainable=True, name=\"r\"\n        )\n        self.r_r_bias = self.add_weight(\n            shape=(self.n_head, self.d_head), initializer=\"zeros\", trainable=True, name=\"r_r_bias\"\n        )\n        self.r_s_bias = self.add_weight(\n            shape=(self.n_head, self.d_head), initializer=\"zeros\", trainable=True, name=\"r_s_bias\"\n        )\n        self.r_w_bias = self.add_weight(\n            shape=(self.n_head, self.d_head), initializer=\"zeros\", trainable=True, name=\"r_w_bias\"\n        )\n        self.seg_embed = self.add_weight(\n            shape=(2, self.n_head, self.d_head), initializer=initializer, trainable=True, name=\"seg_embed\"\n        )\n        super().build(input_shape)\n\n    def prune_heads(self, heads):\n        raise NotImplementedError\n\n    def rel_shift(self, x, klen=-1):\n        \"\"\"perform relative shift to form the relative attention score.\"\"\"\n        x_size = shape_list(x)\n\n        x = tf.reshape(x, (x_size[1], x_size[0], x_size[2], x_size[3]))\n        x = x[1:, ...]\n        x = tf.reshape(x, (x_size[0], x_size[1] - 1, x_size[2], x_size[3]))\n        x = x[:, 0:klen, :, :]\n        # x = torch.index_select(x, 1, torch.arange(klen, device=x.device, dtype=torch.long))\n\n        return x\n\n    def rel_attn_core(self, inputs, training=False):\n        \"\"\"Core relative positional attention operations.\"\"\"\n\n        q_head, k_head_h, v_head_h, k_head_r, seg_mat, attn_mask, head_mask = inputs\n\n        # content based attention score\n        ac = tf.einsum(\"ibnd,jbnd->ijbn\", q_head + self.r_w_bias, k_head_h)\n\n        # position based attention score\n        bd = tf.einsum(\"ibnd,jbnd->ijbn\", q_head + self.r_r_bias, k_head_r)\n        bd = self.rel_shift(bd, klen=shape_list(ac)[1])\n\n        # segment based attention score\n        if seg_mat is None:\n            ef = 0\n        else:\n            ef = tf.einsum(\"ibnd,snd->ibns\", q_head + self.r_s_bias, self.seg_embed)\n            ef = tf.einsum(\"ijbs,ibns->ijbn\", seg_mat, ef)\n\n        # merge attention scores and perform masking\n        attn_score = (ac + bd + ef) * self.scale\n        if attn_mask is not None:\n            # attn_score = attn_score * (1 - attn_mask) - 1e30 * attn_mask\n            if attn_mask.dtype == tf.float16:\n                attn_score = attn_score - 65500 * attn_mask\n            else:\n                attn_score = attn_score - 1e30 * attn_mask\n\n        # attention probability\n        attn_prob = tf.nn.softmax(attn_score, axis=1)\n\n        attn_prob = self.dropout(attn_prob, training=training)\n\n        # Mask heads if we want to\n        if head_mask is not None:\n            attn_prob = attn_prob * head_mask\n\n        # attention output\n        attn_vec = tf.einsum(\"ijbn,jbnd->ibnd\", attn_prob, v_head_h)\n\n        if self.output_attentions:\n            return attn_vec, attn_prob\n\n        return attn_vec\n\n    def post_attention(self, inputs, residual=True, training=False):\n        \"\"\"Post-attention processing.\"\"\"\n        # post-attention projection (back to `d_model`)\n        h, attn_vec = inputs\n\n        attn_out = tf.einsum(\"ibnd,hnd->ibh\", attn_vec, self.o)\n\n        attn_out = self.dropout(attn_out, training=training)\n\n        if residual:\n            attn_out = attn_out + h\n        output = self.layer_norm(attn_out)\n\n        return output\n\n    def call(self, inputs, training=False):\n        (h, g, attn_mask_h, attn_mask_g, r, seg_mat, mems, target_mapping, head_mask) = inputs\n\n        if g is not None:\n            # Two-stream attention with relative positional encoding.\n            # content based attention score\n            if mems is not None and len(shape_list(mems)) > 1:\n                cat = tf.concat([mems, h], axis=0)\n            else:\n                cat = h\n\n            # content-based key head\n            k_head_h = tf.einsum(\"ibh,hnd->ibnd\", cat, self.k)\n\n            # content-based value head\n            v_head_h = tf.einsum(\"ibh,hnd->ibnd\", cat, self.v)\n\n            # position-based key head\n            k_head_r = tf.einsum(\"ibh,hnd->ibnd\", r, self.r)\n\n            # h-stream\n            # content-stream query head\n            q_head_h = tf.einsum(\"ibh,hnd->ibnd\", h, self.q)\n\n            # core attention ops\n            attn_vec_h = self.rel_attn_core(\n                [q_head_h, k_head_h, v_head_h, k_head_r, seg_mat, attn_mask_h, head_mask], training=training\n            )\n\n            if self.output_attentions:\n                attn_vec_h, attn_prob_h = attn_vec_h\n\n            # post processing\n            output_h = self.post_attention([h, attn_vec_h], training=training)\n\n            # g-stream\n            # query-stream query head\n            q_head_g = tf.einsum(\"ibh,hnd->ibnd\", g, self.q)\n\n            # core attention ops\n            if target_mapping is not None:\n                q_head_g = tf.einsum(\"mbnd,mlb->lbnd\", q_head_g, target_mapping)\n                attn_vec_g = self.rel_attn_core(\n                    [q_head_g, k_head_h, v_head_h, k_head_r, seg_mat, attn_mask_g, head_mask], training=training\n                )\n\n                if self.output_attentions:\n                    attn_vec_g, attn_prob_g = attn_vec_g\n\n                attn_vec_g = tf.einsum(\"lbnd,mlb->mbnd\", attn_vec_g, target_mapping)\n            else:\n                attn_vec_g = self.rel_attn_core(\n                    [q_head_g, k_head_h, v_head_h, k_head_r, seg_mat, attn_mask_g, head_mask], training=training\n                )\n\n                if self.output_attentions:\n                    attn_vec_g, attn_prob_g = attn_vec_g\n\n            # post processing\n            output_g = self.post_attention([g, attn_vec_g], training=training)\n\n            if self.output_attentions:\n                attn_prob = attn_prob_h, attn_prob_g\n\n        else:\n            # Multi-head attention with relative positional encoding\n            if mems is not None and len(shape_list(mems)) > 1:\n                cat = tf.concat([mems, h], axis=0)\n            else:\n                cat = h\n\n            # content heads\n            q_head_h = tf.einsum(\"ibh,hnd->ibnd\", h, self.q)\n            k_head_h = tf.einsum(\"ibh,hnd->ibnd\", cat, self.k)\n            v_head_h = tf.einsum(\"ibh,hnd->ibnd\", cat, self.v)\n\n            # positional heads\n            k_head_r = tf.einsum(\"ibh,hnd->ibnd\", r, self.r)\n\n            # core attention ops\n            attn_vec = self.rel_attn_core(\n                [q_head_h, k_head_h, v_head_h, k_head_r, seg_mat, attn_mask_h, head_mask], training=training\n            )\n\n            if self.output_attentions:\n                attn_vec, attn_prob = attn_vec\n\n            # post processing\n            output_h = self.post_attention([h, attn_vec], training=training)\n            output_g = None\n\n        outputs = (output_h, output_g)\n        if self.output_attentions:\n            outputs = outputs + (attn_prob,)\n        return outputs\n\n\nclass TFXLNetFeedForward(tf.keras.layers.Layer):\n    def __init__(self, config, **kwargs):\n        super().__init__(**kwargs)\n        self.layer_norm = tf.keras.layers.LayerNormalization(epsilon=config.layer_norm_eps, name=\"layer_norm\")\n        self.layer_1 = tf.keras.layers.Dense(\n            config.d_inner, kernel_initializer=get_initializer(config.initializer_range), name=\"layer_1\"\n        )\n        self.layer_2 = tf.keras.layers.Dense(\n            config.d_model, kernel_initializer=get_initializer(config.initializer_range), name=\"layer_2\"\n        )\n        self.dropout = tf.keras.layers.Dropout(config.dropout)\n        if isinstance(config.ff_activation, str):\n            self.activation_function = ACT2FN[config.ff_activation]\n        else:\n            self.activation_function = config.ff_activation\n\n    def call(self, inp, training=False):\n        output = inp\n        output = self.layer_1(output)\n        output = self.activation_function(output)\n        output = self.dropout(output, training=training)\n        output = self.layer_2(output)\n        output = self.dropout(output, training=training)\n        output = self.layer_norm(output + inp)\n        return output\n\n\nclass TFXLNetLayer(tf.keras.layers.Layer):\n    def __init__(self, config, **kwargs):\n        super().__init__(**kwargs)\n        self.rel_attn = TFXLNetRelativeAttention(config, name=\"rel_attn\")\n        self.ff = TFXLNetFeedForward(config, name=\"ff\")\n        self.dropout = tf.keras.layers.Dropout(config.dropout)\n\n    def call(self, inputs, training=False):\n        outputs = self.rel_attn(inputs, training=training)\n        output_h, output_g = outputs[:2]\n\n        if output_g is not None:\n            output_g = self.ff(output_g, training=training)\n        output_h = self.ff(output_h, training=training)\n\n        outputs = (output_h, output_g) + outputs[2:]  # Add again attentions if there are there\n        return outputs\n\n\nclass TFXLNetLMHead(tf.keras.layers.Layer):\n    def __init__(self, config, input_embeddings, **kwargs):\n        super().__init__(**kwargs)\n        self.vocab_size = config.vocab_size\n        # The output weights are the same as the input embeddings, but there is\n        # an output-only bias for each token.\n        self.input_embeddings = input_embeddings\n\n    def build(self, input_shape):\n        self.bias = self.add_weight(shape=(self.vocab_size,), initializer=\"zeros\", trainable=True, name=\"bias\")\n        super().build(input_shape)\n\n    def call(self, hidden_states):\n        hidden_states = self.input_embeddings(hidden_states, mode=\"linear\")\n        hidden_states = hidden_states + self.bias\n        return hidden_states\n\n\n@keras_serializable\nclass TFXLNetMainLayer(tf.keras.layers.Layer):\n    config_class = XLNetConfig\n\n    def __init__(self, config, **kwargs):\n        super().__init__(**kwargs)\n        self.output_attentions = config.output_attentions\n        self.output_hidden_states = config.output_hidden_states\n\n        self.mem_len = config.mem_len\n        self.reuse_len = config.reuse_len\n        self.d_model = config.d_model\n        self.same_length = config.same_length\n        self.attn_type = config.attn_type\n        self.bi_data = config.bi_data\n        self.clamp_len = config.clamp_len\n        self.n_layer = config.n_layer\n        self.use_bfloat16 = config.use_bfloat16\n        self.initializer_range = config.initializer_range\n\n        self.word_embedding = TFSharedEmbeddings(\n            config.vocab_size, config.d_model, initializer_range=config.initializer_range, name=\"word_embedding\"\n        )\n        self.layer = [TFXLNetLayer(config, name=\"layer_._{}\".format(i)) for i in range(config.n_layer)]\n        self.dropout = tf.keras.layers.Dropout(config.dropout)\n\n    def get_input_embeddings(self):\n        return self.word_embedding\n\n    def build(self, input_shape):\n        initializer = get_initializer(self.initializer_range)\n        self.mask_emb = self.add_weight(\n            shape=(1, 1, self.d_model), initializer=initializer, trainable=True, name=\"mask_emb\"\n        )\n\n    def _resize_token_embeddings(self, new_num_tokens):\n        raise NotImplementedError\n\n    def _prune_heads(self, heads_to_prune):\n        raise NotImplementedError\n\n    def create_mask(self, qlen, mlen, dtype=tf.float32):\n        \"\"\"\n        Creates causal attention mask. Float mask where 1.0 indicates masked, 0.0 indicates not-masked.\n\n        Args:\n            qlen: TODO Lysandre didn't fill\n            mlen: TODO Lysandre didn't fill\n\n        ::\n\n                  same_length=False:      same_length=True:\n                  <mlen > <  qlen >       <mlen > <  qlen >\n               ^ [0 0 0 0 0 1 1 1 1]     [0 0 0 0 0 1 1 1 1]\n                 [0 0 0 0 0 0 1 1 1]     [1 0 0 0 0 0 1 1 1]\n            qlen [0 0 0 0 0 0 0 1 1]     [1 1 0 0 0 0 0 1 1]\n                 [0 0 0 0 0 0 0 0 1]     [1 1 1 0 0 0 0 0 1]\n               v [0 0 0 0 0 0 0 0 0]     [1 1 1 1 0 0 0 0 0]\n\n        \"\"\"\n        attn_mask = tf.ones([qlen, qlen], dtype=dtype)\n        mask_u = tf.matrix_band_part(attn_mask, 0, -1)\n        mask_dia = tf.matrix_band_part(attn_mask, 0, 0)\n        attn_mask_pad = tf.zeros([qlen, mlen], dtype=dtype)\n        ret = tf.concat([attn_mask_pad, mask_u - mask_dia], 1)\n        if self.same_length:\n            mask_l = tf.matrix_band_part(attn_mask, -1, 0)\n            ret = tf.concat([ret[:, :qlen] + mask_l - mask_dia, ret[:, qlen:]], 1)\n        return ret\n\n    def cache_mem(self, curr_out, prev_mem):\n        \"\"\"cache hidden states into memory.\"\"\"\n        if self.reuse_len is not None and self.reuse_len > 0:\n            curr_out = curr_out[: self.reuse_len]\n\n        if prev_mem is None:\n            new_mem = curr_out[-self.mem_len :]\n        else:\n            new_mem = tf.concat([prev_mem, curr_out], 0)[-self.mem_len :]\n\n        return tf.stop_gradient(new_mem)\n\n    @staticmethod\n    def positional_embedding(pos_seq, inv_freq, bsz=None):\n        sinusoid_inp = tf.einsum(\"i,d->id\", pos_seq, inv_freq)\n        pos_emb = tf.concat([tf.sin(sinusoid_inp), tf.cos(sinusoid_inp)], axis=-1)\n        pos_emb = pos_emb[:, None, :]\n\n        if bsz is not None:\n            pos_emb = tf.tile(pos_emb, [1, bsz, 1])\n\n        return pos_emb\n\n    def relative_positional_encoding(self, qlen, klen, bsz=None, dtype=None):\n        \"\"\"create relative positional encoding.\"\"\"\n        freq_seq = tf.range(0, self.d_model, 2.0)\n        if dtype is not None and dtype != tf.float32:\n            freq_seq = tf.cast(freq_seq, dtype=dtype)\n        inv_freq = 1 / (10000 ** (freq_seq / self.d_model))\n\n        if self.attn_type == \"bi\":\n            # beg, end = klen - 1, -qlen\n            beg, end = klen, -qlen\n        elif self.attn_type == \"uni\":\n            # beg, end = klen - 1, -1\n            beg, end = klen, -1\n        else:\n            raise ValueError(\"Unknown `attn_type` {}.\".format(self.attn_type))\n\n        if self.bi_data:\n            fwd_pos_seq = tf.range(beg, end, -1.0)\n            bwd_pos_seq = tf.range(-beg, -end, 1.0)\n\n            if dtype is not None and dtype != tf.float32:\n                fwd_pos_seq = tf.cast(fwd_pos_seq, dtype=dtype)\n                bwd_pos_seq = tf.cast(bwd_pos_seq, dtype=dtype)\n\n            if self.clamp_len > 0:\n                fwd_pos_seq = tf.clip_by_value(fwd_pos_seq, -self.clamp_len, self.clamp_len)\n                bwd_pos_seq = tf.clip_by_value(bwd_pos_seq, -self.clamp_len, self.clamp_len)\n\n            if bsz is not None:\n                # With bi_data, the batch size should be divisible by 2.\n                assert bsz % 2 == 0\n                fwd_pos_emb = self.positional_embedding(fwd_pos_seq, inv_freq, bsz // 2)\n                bwd_pos_emb = self.positional_embedding(bwd_pos_seq, inv_freq, bsz // 2)\n            else:\n                fwd_pos_emb = self.positional_embedding(fwd_pos_seq, inv_freq)\n                bwd_pos_emb = self.positional_embedding(bwd_pos_seq, inv_freq)\n\n            pos_emb = tf.concat([fwd_pos_emb, bwd_pos_emb], axis=1)\n        else:\n            fwd_pos_seq = tf.range(beg, end, -1.0)\n            if dtype is not None and dtype != tf.float32:\n                fwd_pos_seq = tf.cast(fwd_pos_seq, dtype=dtype)\n            if self.clamp_len > 0:\n                fwd_pos_seq = tf.clip_by_value(fwd_pos_seq, -self.clamp_len, self.clamp_len)\n            pos_emb = self.positional_embedding(fwd_pos_seq, inv_freq, bsz)\n\n        return pos_emb\n\n    def call(\n        self,\n        inputs,\n        attention_mask=None,\n        mems=None,\n        perm_mask=None,\n        target_mapping=None,\n        token_type_ids=None,\n        input_mask=None,\n        head_mask=None,\n        inputs_embeds=None,\n        use_cache=True,\n        training=False,\n    ):\n        if isinstance(inputs, (tuple, list)):\n            input_ids = inputs[0]\n            attention_mask = inputs[1] if len(inputs) > 1 else attention_mask\n            mems = inputs[2] if len(inputs) > 2 else mems\n            perm_mask = inputs[3] if len(inputs) > 3 else perm_mask\n            target_mapping = inputs[4] if len(inputs) > 4 else target_mapping\n            token_type_ids = inputs[5] if len(inputs) > 5 else token_type_ids\n            input_mask = inputs[6] if len(inputs) > 6 else input_mask\n            head_mask = inputs[7] if len(inputs) > 7 else head_mask\n            inputs_embeds = inputs[8] if len(inputs) > 8 else inputs_embeds\n            use_cache = inputs[9] if len(inputs) > 9 else use_cache\n            assert len(inputs) <= 10, \"Too many inputs.\"\n        elif isinstance(inputs, (dict, BatchEncoding)):\n            input_ids = inputs.get(\"input_ids\")\n            attention_mask = inputs.get(\"attention_mask\", attention_mask)\n            mems = inputs.get(\"mems\", mems)\n            perm_mask = inputs.get(\"perm_mask\", perm_mask)\n            target_mapping = inputs.get(\"target_mapping\", target_mapping)\n            token_type_ids = inputs.get(\"token_type_ids\", token_type_ids)\n            input_mask = inputs.get(\"input_mask\", input_mask)\n            head_mask = inputs.get(\"head_mask\", head_mask)\n            inputs_embeds = inputs.get(\"inputs_embeds\", inputs_embeds)\n            use_cache = inputs.get(\"use_cache\", use_cache)\n            assert len(inputs) <= 10, \"Too many inputs.\"\n        else:\n            input_ids = inputs\n\n        # the original code for XLNet uses shapes [len, bsz] with the batch dimension at the end\n        # but we want a unified interface in the library with the batch size on the first dimension\n        # so we move here the first dimension (batch) to the end\n\n        if input_ids is not None and inputs_embeds is not None:\n            raise ValueError(\"You cannot specify both input_ids and inputs_embeds at the same time\")\n        elif input_ids is not None:\n            input_ids = tf.transpose(input_ids, perm=(1, 0))\n            qlen, bsz = shape_list(input_ids)[:2]\n        elif inputs_embeds is not None:\n            inputs_embeds = tf.transpose(inputs_embeds, perm=(1, 0, 2))\n            qlen, bsz = shape_list(inputs_embeds)[:2]\n        else:\n            raise ValueError(\"You have to specify either input_ids or inputs_embeds\")\n\n        token_type_ids = tf.transpose(token_type_ids, perm=(1, 0)) if token_type_ids is not None else None\n        input_mask = tf.transpose(input_mask, perm=(1, 0)) if input_mask is not None else None\n        attention_mask = tf.transpose(attention_mask, perm=(1, 0)) if attention_mask is not None else None\n        perm_mask = tf.transpose(perm_mask, perm=(1, 2, 0)) if perm_mask is not None else None\n        target_mapping = tf.transpose(target_mapping, perm=(1, 2, 0)) if target_mapping is not None else None\n\n        mlen = shape_list(mems[0])[0] if mems is not None and mems[0] is not None else 0\n        klen = mlen + qlen\n\n        dtype_float = tf.bfloat16 if self.use_bfloat16 else tf.float32\n\n        # Attention mask\n        # causal attention mask\n        if self.attn_type == \"uni\":\n            attn_mask = self.create_mask(qlen, mlen)\n            attn_mask = attn_mask[:, :, None, None]\n        elif self.attn_type == \"bi\":\n            attn_mask = None\n        else:\n            raise ValueError(\"Unsupported attention type: {}\".format(self.attn_type))\n\n        # data mask: input mask & perm mask\n        assert input_mask is None or attention_mask is None, (\n            \"You can only use one of input_mask (uses 1 for padding) \"\n            \"or attention_mask (uses 0 for padding, added for compatbility with BERT). Please choose one.\"\n        )\n        if input_mask is None and attention_mask is not None:\n            input_mask = 1.0 - tf.cast(attention_mask, dtype=dtype_float)\n        if input_mask is not None and perm_mask is not None:\n            data_mask = input_mask[None] + perm_mask\n        elif input_mask is not None and perm_mask is None:\n            data_mask = input_mask[None]\n        elif input_mask is None and perm_mask is not None:\n            data_mask = perm_mask\n        else:\n            data_mask = None\n\n        if data_mask is not None:\n            # all mems can be attended to\n            if mlen > 0:\n                mems_mask = tf.zeros([shape_list(data_mask)[0], mlen, bsz], dtype=dtype_float)\n                data_mask = tf.concat([mems_mask, data_mask], axis=1)\n            if attn_mask is None:\n                attn_mask = data_mask[:, :, :, None]\n            else:\n                attn_mask += data_mask[:, :, :, None]\n\n        if attn_mask is not None:\n            attn_mask = tf.cast(attn_mask > 0, dtype=dtype_float)\n\n        if attn_mask is not None:\n            non_tgt_mask = -tf.eye(qlen, dtype=dtype_float)\n            if mlen > 0:\n                non_tgt_mask = tf.concat([tf.zeros([qlen, mlen], dtype=dtype_float), non_tgt_mask], axis=-1)\n            non_tgt_mask = tf.cast((attn_mask + non_tgt_mask[:, :, None, None]) > 0, dtype=dtype_float)\n        else:\n            non_tgt_mask = None\n\n        # Word embeddings and prepare h & g hidden states\n        if inputs_embeds is not None:\n            word_emb_k = inputs_embeds\n        else:\n            word_emb_k = self.word_embedding(input_ids)\n        output_h = self.dropout(word_emb_k, training=training)\n        if target_mapping is not None:\n            word_emb_q = tf.tile(self.mask_emb, [shape_list(target_mapping)[0], bsz, 1])\n            # else:  # We removed the inp_q input which was same as target mapping\n            #     inp_q_ext = inp_q[:, :, None]\n            #     word_emb_q = inp_q_ext * self.mask_emb + (1 - inp_q_ext) * word_emb_k\n            output_g = self.dropout(word_emb_q, training=training)\n        else:\n            output_g = None\n\n        # Segment embedding\n        if token_type_ids is not None:\n            # Convert `token_type_ids` to one-hot `seg_mat`\n            if mlen > 0:\n                mem_pad = tf.zeros([mlen, bsz], dtype=tf.int32)\n                cat_ids = tf.concat([mem_pad, token_type_ids], 0)\n            else:\n                cat_ids = token_type_ids\n\n            # `1` indicates not in the same segment [qlen x klen x bsz]\n            seg_mat = tf.cast(tf.logical_not(tf.equal(token_type_ids[:, None], cat_ids[None, :])), tf.int32)\n            seg_mat = tf.one_hot(seg_mat, 2, dtype=dtype_float)\n        else:\n            seg_mat = None\n\n        # Positional encoding\n        pos_emb = self.relative_positional_encoding(qlen, klen, bsz=bsz, dtype=dtype_float)\n        pos_emb = self.dropout(pos_emb, training=training)\n\n        # Prepare head mask if needed\n        # 1.0 in head_mask indicate we keep the head\n        # attention_probs has shape bsz x n_heads x N x N\n        # input head_mask has shape [num_heads] or [num_hidden_layers x num_heads] (a head_mask for each layer)\n        # and head_mask is converted to shape [num_hidden_layers x qlen x klen x bsz x n_head]\n        if head_mask is not None:\n            raise NotImplementedError\n        else:\n            head_mask = [None] * self.n_layer\n\n        new_mems = ()\n        if mems is None:\n            mems = [None] * len(self.layer)\n\n        attentions = []\n        hidden_states = []\n        for i, layer_module in enumerate(self.layer):\n            # cache new mems\n            if self.mem_len is not None and self.mem_len > 0 and use_cache is True:\n                new_mems = new_mems + (self.cache_mem(output_h, mems[i]),)\n            if self.output_hidden_states:\n                hidden_states.append((output_h, output_g) if output_g is not None else output_h)\n\n            outputs = layer_module(\n                [output_h, output_g, non_tgt_mask, attn_mask, pos_emb, seg_mat, mems[i], target_mapping, head_mask[i]],\n                training=training,\n            )\n            output_h, output_g = outputs[:2]\n            if self.output_attentions:\n                attentions.append(outputs[2])\n\n        # Add last hidden state\n        if self.output_hidden_states:\n            hidden_states.append((output_h, output_g) if output_g is not None else output_h)\n\n        output = self.dropout(output_g if output_g is not None else output_h, training=training)\n\n        # Prepare outputs, we transpose back here to shape [bsz, len, hidden_dim] (cf. beginning of forward() method)\n        outputs = (tf.transpose(output, perm=(1, 0, 2)),)\n\n        if self.mem_len is not None and self.mem_len > 0 and use_cache is True:\n            outputs = outputs + (new_mems,)\n\n        if self.output_hidden_states:\n            if output_g is not None:\n                hidden_states = tuple(tf.transpose(h, perm=(1, 0, 2)) for hs in hidden_states for h in hs)\n            else:\n                hidden_states = tuple(tf.transpose(hs, perm=(1, 0, 2)) for hs in hidden_states)\n            outputs = outputs + (hidden_states,)\n        if self.output_attentions:\n            attentions = tuple(tf.transpose(t, perm=(2, 3, 0, 1)) for t in attentions)\n            outputs = outputs + (attentions,)\n\n        return outputs  # outputs, (new_mems), (hidden_states), (attentions)\n\n\nclass TFXLNetPreTrainedModel(TFPreTrainedModel):\n    \"\"\" An abstract class to handle weights initialization and\n        a simple interface for downloading and loading pretrained models.\n    \"\"\"\n\n    config_class = XLNetConfig\n    base_model_prefix = \"transformer\"\n\n\nXLNET_START_DOCSTRING = r\"\"\"\n\n    .. note::\n\n        TF 2.0 models accepts two formats as inputs:\n\n            - having all inputs as keyword arguments (like PyTorch models), or\n            - having all inputs as a list, tuple or dict in the first positional arguments.\n\n        This second option is useful when using :obj:`tf.keras.Model.fit()` method which currently requires having\n        all the tensors in the first argument of the model call function: :obj:`model(inputs)`.\n\n        If you choose this second option, there are three possibilities you can use to gather all the input Tensors\n        in the first positional argument :\n\n        - a single Tensor with input_ids only and nothing else: :obj:`model(inputs_ids)`\n        - a list of varying length with one or several input Tensors IN THE ORDER given in the docstring:\n          :obj:`model([input_ids, attention_mask])` or :obj:`model([input_ids, attention_mask, token_type_ids])`\n        - a dictionary with one or several input Tensors associated to the input names given in the docstring:\n          :obj:`model({'input_ids': input_ids, 'token_type_ids': token_type_ids})`\n\n    Parameters:\n        config (:class:`~transformers1.XLNetConfig`): Model configuration class with all the parameters of the model.\n            Initializing with a config file does not load the weights associated with the model, only the configuration.\n            Check out the :meth:`~transformers1.PreTrainedModel.from_pretrained` method to load the model weights.\n\"\"\"\n\nXLNET_INPUTS_DOCSTRING = r\"\"\"\n    Args:\n        input_ids (:obj:`tf.Tensor` or :obj:`Numpy array` of shape :obj:`(batch_size, sequence_length)`):\n            Indices of input sequence tokens in the vocabulary.\n\n            Indices can be obtained using :class:`transformers1.XLNetTokenizer`.\n            See :func:`transformers1.PreTrainedTokenizer.encode` and\n            :func:`transformers1.PreTrainedTokenizer.encode_plus` for details.\n\n            `What are input IDs? <../glossary.html#input-ids>`__\n        attention_mask (:obj:`tf.Tensor` or :obj:`Numpy array` of shape :obj:`(batch_size, sequence_length)`, `optional`, defaults to :obj:`None`):\n            Mask to avoid performing attention on padding token indices.\n            Mask values selected in ``[0, 1]``:\n            ``1`` for tokens that are NOT MASKED, ``0`` for MASKED tokens.\n\n            `What are attention masks? <../glossary.html#attention-mask>`__\n        mems (:obj:`List[tf.Tensor]` of length :obj:`config.n_layers`):\n            Contains pre-computed hidden-states (key and values in the attention blocks) as computed by the model\n            (see `mems` output below). Can be used to speed up sequential decoding. The token ids which have their mems\n            given to this model should not be passed as input ids as they have already been computed.\n        perm_mask (:obj:`tf.Tensor` or :obj:`Numpy array` of shape :obj:`(batch_size, sequence_length, sequence_length)`, `optional`, defaults to :obj:`None`):\n            Mask to indicate the attention pattern for each input token with values selected in ``[0, 1]``:\n            If ``perm_mask[k, i, j] = 0``, i attend to j in batch k;\n            if ``perm_mask[k, i, j] = 1``, i does not attend to j in batch k.\n            If None, each token attends to all the others (full bidirectional attention).\n            Only used during pretraining (to define factorization order) or for sequential decoding (generation).\n        target_mapping (:obj:`tf.Tensor` or :obj:`Numpy array` of shape :obj:`(batch_size, num_predict, sequence_length)`, `optional`, defaults to :obj:`None`):\n            Mask to indicate the output tokens to use.\n            If ``target_mapping[k, i, j] = 1``, the i-th predict in batch k is on the j-th token.\n            Only used during pretraining for partial prediction or for sequential decoding (generation).\n        token_type_ids (:obj:`tf.Tensor` or :obj:`Numpy array` of shape :obj:`(batch_size, sequence_length)`, `optional`, defaults to :obj:`None`):\n            Segment token indices to indicate first and second portions of the inputs.\n            Indices are selected in ``[0, 1]``: ``0`` corresponds to a `sentence A` token, ``1``\n            corresponds to a `sentence B` token\n\n            `What are token type IDs? <../glossary.html#token-type-ids>`_\n        input_mask (:obj:`tf.Tensor` or :obj:`Numpy array` of shape :obj:`(batch_size, sequence_length)`, `optional`, defaults to :obj:`None`):\n            Mask to avoid performing attention on padding token indices.\n            Negative of `attention_mask`, i.e. with 0 for real tokens and 1 for padding.\n            Kept for compatibility with the original code base.\n            You can only uses one of `input_mask` and `attention_mask`\n            Mask values selected in ``[0, 1]``:\n            ``1`` for tokens that are MASKED, ``0`` for tokens that are NOT MASKED.\n        head_mask (:obj:`tf.Tensor` or :obj:`Numpy array` of shape :obj:`(num_heads,)` or :obj:`(num_layers, num_heads)`, `optional`, defaults to :obj:`None`):\n            Mask to nullify selected heads of the self-attention modules.\n            Mask values selected in ``[0, 1]``:\n            :obj:`1` indicates the head is **not masked**, :obj:`0` indicates the head is **masked**.\n        inputs_embeds (:obj:`tf.Tensor` or :obj:`Numpy array` of shape :obj:`(batch_size, sequence_length, hidden_size)`, `optional`, defaults to :obj:`None`):\n            Optionally, instead of passing :obj:`input_ids` you can choose to directly pass an embedded representation.\n            This is useful if you want more control over how to convert `input_ids` indices into associated vectors\n            than the model's internal embedding lookup matrix.\n        use_cache (:obj:`bool`):\n            If `use_cache` is True, `mems` are returned and can be used to speed up decoding (see `mems`). Defaults to `True`.\n\"\"\"\n\n\n@add_start_docstrings(\n    \"The bare XLNet Model transformer outputing raw hidden-states without any specific head on top.\",\n    XLNET_START_DOCSTRING,\n)\nclass TFXLNetModel(TFXLNetPreTrainedModel):\n    def __init__(self, config, *inputs, **kwargs):\n        super().__init__(config, *inputs, **kwargs)\n        self.transformer = TFXLNetMainLayer(config, name=\"transformer\")\n\n    @add_start_docstrings_to_callable(XLNET_INPUTS_DOCSTRING)\n    def call(self, inputs, **kwargs):\n        r\"\"\"\n    Return:\n        :obj:`tuple(tf.Tensor)` comprising various elements depending on the configuration (:class:`~transformers1.XLNetConfig`) and inputs:\n        last_hidden_state (:obj:`tf.Tensor` or :obj:`Numpy array` of shape :obj:`(batch_size, sequence_length, hidden_size)`):\n            Sequence of hidden-states at the last layer of the model.\n        mems (:obj:`List[tf.Tensor]` of length :obj:`config.n_layers`):\n            Contains pre-computed hidden-states (key and values in the attention blocks).\n            Can be used (see `mems` input) to speed up sequential decoding. The token ids which have their past given to this model\n            should not be passed as input ids as they have already been computed.\n        hidden_states (:obj:`tuple(tf.Tensor)`, `optional`, returned when ``config.output_hidden_states=True``):\n            Tuple of :obj:`tf.Tensor` or :obj:`Numpy array` (one for the output of the embeddings + one for the output of each layer)\n            of shape :obj:`(batch_size, sequence_length, hidden_size)`.\n\n            Hidden-states of the model at the output of each layer plus the initial embedding outputs.\n        attentions (:obj:`tuple(tf.Tensor)`, `optional`, returned when ``config.output_attentions=True``):\n            Tuple of :obj:`tf.Tensor` or :obj:`Numpy array` (one for each layer) of shape\n            :obj:`(batch_size, num_heads, sequence_length, sequence_length)`.\n\n            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention\n            heads.\n\n    Examples::\n\n        import tensorflow as tf\n        from transformers1 import XLNetTokenizer, TFXLNetModel\n\n        tokenizer = XLNetTokenizer.from_pretrained('xlnet-large-cased')\n        model = TFXLNetModel.from_pretrained('xlnet-large-cased')\n        input_ids = tf.constant(tokenizer.encode(\"Hello, my dog is cute\", add_special_tokens=True))[None, :]  # Batch size 1\n        outputs = model(input_ids)\n        last_hidden_states = outputs[0]  # The last hidden-state is the first element of the output tuple\n\n        \"\"\"\n        outputs = self.transformer(inputs, **kwargs)\n        return outputs\n\n\n@add_start_docstrings(\n    \"\"\"XLNet Model with a language modeling head on top\n    (linear layer with weights tied to the input embeddings). \"\"\",\n    XLNET_START_DOCSTRING,\n)\nclass TFXLNetLMHeadModel(TFXLNetPreTrainedModel):\n    def __init__(self, config, *inputs, **kwargs):\n        super().__init__(config, *inputs, **kwargs)\n        self.transformer = TFXLNetMainLayer(config, name=\"transformer\")\n        self.lm_loss = TFXLNetLMHead(config, self.transformer.word_embedding, name=\"lm_loss\")\n\n    def get_output_embeddings(self):\n        return self.lm_loss.input_embeddings\n\n    def prepare_inputs_for_generation(self, inputs, past, **kwargs):\n        # Add dummy token at the end (no attention on this one)\n\n        effective_batch_size = inputs.shape[0]\n        dummy_token = tf.zeros((effective_batch_size, 1), dtype=tf.int32)\n        inputs = tf.concat([inputs, dummy_token], axis=1)\n\n        # Build permutation mask so that previous tokens don't see last token\n        sequence_length = inputs.shape[1]\n        perm_mask = tf.zeros((effective_batch_size, sequence_length, sequence_length - 1), dtype=tf.float32)\n        perm_mask_seq_end = tf.ones((effective_batch_size, sequence_length, 1), dtype=tf.float32)\n        perm_mask = tf.concat([perm_mask, perm_mask_seq_end], axis=-1)\n\n        # We'll only predict the last token\n        target_mapping = tf.zeros((effective_batch_size, 1, sequence_length - 1), dtype=tf.float32)\n        target_mapping_seq_end = tf.ones((effective_batch_size, 1, 1), dtype=tf.float32)\n        target_mapping = tf.concat([target_mapping, target_mapping_seq_end], axis=-1)\n\n        inputs = {\n            \"inputs\": inputs,\n            \"perm_mask\": perm_mask,\n            \"target_mapping\": target_mapping,\n            \"use_cache\": kwargs[\"use_cache\"],\n        }\n\n        # if past is defined in model kwargs then use it for faster decoding\n        if past:\n            inputs[\"mems\"] = past\n\n        return inputs\n\n    @add_start_docstrings_to_callable(XLNET_INPUTS_DOCSTRING)\n    def call(self, inputs, **kwargs):\n        r\"\"\"\n    Return:\n        :obj:`tuple(tf.Tensor)` comprising various elements depending on the configuration (:class:`~transformers1.XLNetConfig`) and inputs:\n        prediction_scores (:obj:`tf.Tensor` or :obj:`Numpy array` of shape :obj:`(batch_size, sequence_length, config.vocab_size)`):\n            Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax).\n        mems (:obj:`List[tf.Tensor]` of length :obj:`config.n_layers`):\n            Contains pre-computed hidden-states (key and values in the attention blocks).\n            Can be used (see `past` input) to speed up sequential decoding. The token ids which have their past given to this model\n            should not be passed as input ids as they have already been computed.\n        hidden_states (:obj:`tuple(tf.Tensor)`, `optional`, returned when ``config.output_hidden_states=True``):\n            Tuple of :obj:`tf.Tensor` or :obj:`Numpy array` (one for the output of the embeddings + one for the output of each layer)\n            of shape :obj:`(batch_size, sequence_length, hidden_size)`.\n\n            Hidden-states of the model at the output of each layer plus the initial embedding outputs.\n        attentions (:obj:`tuple(tf.Tensor)`, `optional`, returned when ``config.output_attentions=True``):\n            Tuple of :obj:`tf.Tensor` or :obj:`Numpy array` (one for each layer) of shape\n            :obj:`(batch_size, num_heads, sequence_length, sequence_length)`.\n\n            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention\n            heads.\n\n    Examples::\n\n        import tensorflow as tf\n        import numpy as np\n        from transformers1 import XLNetTokenizer, TFXLNetLMHeadModel\n\n        tokenizer = XLNetTokenizer.from_pretrained('xlnet-large-cased')\n        model = TFXLNetLMHeadModel.from_pretrained('xlnet-large-cased')\n\n        # We show how to setup inputs to predict a next token using a bi-directional context.\n        input_ids = tf.constant(tokenizer.encode(\"Hello, my dog is very <mask>\", add_special_tokens=True))[None, :]  # We will predict the masked token\n        perm_mask = np.zeros((1, input_ids.shape[1], input_ids.shape[1]))\n        perm_mask[:, :, -1] = 1.0  # Previous tokens don't see last token\n        target_mapping = np.zeros((1, 1, input_ids.shape[1]))  # Shape [1, 1, seq_length] => let's predict one token\n        target_mapping[0, 0, -1] = 1.0  # Our first (and only) prediction will be the last token of the sequence (the masked token)\n        outputs = model(input_ids, perm_mask=tf.constant(perm_mask, dtype=tf.float32), target_mapping=tf.constant(target_mapping, dtype=tf.float32))\n\n        next_token_logits = outputs[0]  # Output has shape [target_mapping.size(0), target_mapping.size(1), config.vocab_size]\n\n        \"\"\"\n        transformer_outputs = self.transformer(inputs, **kwargs)\n        hidden_state = transformer_outputs[0]\n        logits = self.lm_loss(hidden_state)\n\n        outputs = (logits,) + transformer_outputs[1:]  # Keep mems, hidden states, attentions if there are in it\n\n        return outputs  # return logits, (mems), (hidden states), (attentions)\n\n\n@add_start_docstrings(\n    \"\"\"XLNet Model with a sequence classification/regression head on top (a linear layer on top of\n    the pooled output) e.g. for GLUE tasks. \"\"\",\n    XLNET_START_DOCSTRING,\n)\nclass TFXLNetForSequenceClassification(TFXLNetPreTrainedModel):\n    def __init__(self, config, *inputs, **kwargs):\n        super().__init__(config, *inputs, **kwargs)\n        self.num_labels = config.num_labels\n\n        self.transformer = TFXLNetMainLayer(config, name=\"transformer\")\n        self.sequence_summary = TFSequenceSummary(\n            config, initializer_range=config.initializer_range, name=\"sequence_summary\"\n        )\n        self.logits_proj = tf.keras.layers.Dense(\n            config.num_labels, kernel_initializer=get_initializer(config.initializer_range), name=\"logits_proj\"\n        )\n\n    @add_start_docstrings_to_callable(XLNET_INPUTS_DOCSTRING)\n    def call(self, inputs, **kwargs):\n        r\"\"\"\n    Return:\n        :obj:`tuple(tf.Tensor)` comprising various elements depending on the configuration (:class:`~transformers1.XLNetConfig`) and inputs:\n        logits (:obj:`tf.Tensor` or :obj:`Numpy array` of shape :obj:(batch_size, config.num_labels)`):\n            Classification (or regression if config.num_labels==1) scores (before SoftMax).\n        mems (:obj:`List[tf.Tensor]` of length :obj:`config.n_layers`):\n            Contains pre-computed hidden-states (key and values in the attention blocks).\n            Can be used (see `past` input) to speed up sequential decoding. The token ids which have their past given to this model\n            should not be passed as input ids as they have already been computed.\n        hidden_states (:obj:`tuple(tf.Tensor)`, `optional`, returned when ``config.output_hidden_states=True``):\n            Tuple of :obj:`tf.Tensor` or :obj:`Numpy array` (one for the output of the embeddings + one for the output of each layer)\n            of shape :obj:`(batch_size, sequence_length, hidden_size)`.\n\n            Hidden-states of the model at the output of each layer plus the initial embedding outputs.\n        attentions (:obj:`tuple(tf.Tensor)`, `optional`, returned when ``config.output_attentions=True``):\n            Tuple of :obj:`tf.Tensor` or :obj:`Numpy array` (one for each layer) of shape\n            :obj:`(batch_size, num_heads, sequence_length, sequence_length)`.\n\n            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention\n            heads.\n\n    Examples::\n\n        import tensorflow as tf\n        from transformers1 import XLNetTokenizer, TFXLNetForSequenceClassification\n\n        tokenizer = XLNetTokenizer.from_pretrained('xlnet-large-cased')\n        model = TFXLNetForSequenceClassification.from_pretrained('xlnet-large-cased')\n        input_ids = tf.constant(tokenizer.encode(\"Hello, my dog is cute\", add_special_tokens=True))[None, :]  # Batch size 1\n        outputs = model(input_ids)\n        logits = outputs[0]\n\n        \"\"\"\n        transformer_outputs = self.transformer(inputs, **kwargs)\n        output = transformer_outputs[0]\n\n        output = self.sequence_summary(output)\n        logits = self.logits_proj(output)\n\n        outputs = (logits,) + transformer_outputs[1:]  # Keep mems, hidden states, attentions if there are in it\n\n        return outputs  # return logits, (mems), (hidden states), (attentions)\n\n\n@add_start_docstrings(\n    \"\"\"XLNet Model with a token classification head on top (a linear layer on top of\n    the hidden-states output) e.g. for Named-Entity-Recognition (NER) tasks. \"\"\",\n    XLNET_START_DOCSTRING,\n)\nclass TFXLNetForTokenClassification(TFXLNetPreTrainedModel):\n    def __init__(self, config, *inputs, **kwargs):\n        super().__init__(config, *inputs, **kwargs)\n        self.num_labels = config.num_labels\n\n        self.transformer = TFXLNetMainLayer(config, name=\"transformer\")\n        self.classifier = tf.keras.layers.Dense(\n            config.num_labels, kernel_initializer=get_initializer(config.initializer_range), name=\"classifier\"\n        )\n\n    def call(self, inputs, **kwargs):\n        r\"\"\"\n    Return:\n        :obj:`tuple(tf.Tensor)` comprising various elements depending on the configuration (:class:`~transformers1.XLNetConfig`) and inputs:\n        logits (:obj:`tf.Tensor` or :obj:`Numpy array` of shape :obj:(batch_size, config.num_labels)`):\n            Classification scores (before SoftMax).\n        mems (:obj:`List[tf.Tensor]` of length :obj:`config.n_layers`):\n            Contains pre-computed hidden-states (key and values in the attention blocks).\n            Can be used (see `past` input) to speed up sequential decoding. The token ids which have their past given to this model\n            should not be passed as input ids as they have already been computed.\n        hidden_states (:obj:`tuple(tf.Tensor)`, `optional`, returned when ``config.output_hidden_states=True``):\n            Tuple of :obj:`tf.Tensor` or :obj:`Numpy array` (one for the output of the embeddings + one for the output of each layer)\n            of shape :obj:`(batch_size, sequence_length, hidden_size)`.\n\n            Hidden-states of the model at the output of each layer plus the initial embedding outputs.\n        attentions (:obj:`tuple(tf.Tensor)`, `optional`, returned when ``config.output_attentions=True``):\n            Tuple of :obj:`tf.Tensor` or :obj:`Numpy array` (one for each layer) of shape\n            :obj:`(batch_size, num_heads, sequence_length, sequence_length)`.\n\n            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention\n            heads.\n\n    Examples::\n\n        import tensorflow as tf\n        from transformers1 import XLNetTokenizer, TFXLNetForTokenClassification\n\n        tokenizer = XLNetTokenizer.from_pretrained('xlnet-large-cased')\n        model = TFXLNetForTokenClassification.from_pretrained('xlnet-large-cased')\n        input_ids = tf.constant(tokenizer.encode(\"Hello, my dog is cute\"))[None, :]  # Batch size 1\n        outputs = model(input_ids)\n        scores = outputs[0]\n\n        \"\"\"\n        transformer_outputs = self.transformer(inputs, **kwargs)\n        output = transformer_outputs[0]\n\n        logits = self.classifier(output)\n\n        outputs = (logits,) + transformer_outputs[1:]  # Keep mems, hidden states, attentions if there are in it\n\n        return outputs  # return logits, (mems), (hidden states), (attentions)\n\n\n@add_start_docstrings(\n    \"\"\"XLNet Model with a span classification head on top for extractive question-answering tasks like SQuAD (a linear layers on top of\n    the hidden-states output to compute `span start logits` and `span end logits`). \"\"\",\n    XLNET_START_DOCSTRING,\n)\nclass TFXLNetForQuestionAnsweringSimple(TFXLNetPreTrainedModel):\n    def __init__(self, config, *inputs, **kwargs):\n        super().__init__(config, *inputs, **kwargs)\n        self.transformer = TFXLNetMainLayer(config, name=\"transformer\")\n        self.qa_outputs = tf.keras.layers.Dense(\n            config.num_labels, kernel_initializer=get_initializer(config.initializer_range), name=\"qa_outputs\"\n        )\n\n    @add_start_docstrings_to_callable(XLNET_INPUTS_DOCSTRING)\n    def call(self, inputs, **kwargs):\n        r\"\"\"\n    Returns:\n        :obj:`tuple(tf.Tensor)` comprising various elements depending on the configuration (:class:`~transformers1.XLNetConfig`) and inputs:\n        loss (:obj:`tf.Tensor` or :obj:`Numpy array` of shape :obj:`(1,)`, `optional`, returned when :obj:`labels` is provided):\n            Total span extraction loss is the sum of a Cross-Entropy for the start and end positions.\n        start_scores (:obj:`tf.Tensor` or :obj:`Numpy array` of shape :obj:`(batch_size, sequence_length,)`):\n            Span-start scores (before SoftMax).\n        end_scores (:obj:`tf.Tensor` or :obj:`Numpy array` of shape :obj:`(batch_size, sequence_length,)`):\n            Span-end scores (before SoftMax).\n        mems (:obj:`List[tf.Tensor]` of length :obj:`config.n_layers`):\n            Contains pre-computed hidden-states (key and values in the attention blocks).\n            Can be used (see `past` input) to speed up sequential decoding. The token ids which have their past given to this model\n            should not be passed as input ids as they have already been computed.\n        hidden_states (:obj:`tuple(tf.Tensor)`, `optional`, returned when ``config.output_hidden_states=True``):\n            Tuple of :obj:`tf.Tensor` or :obj:`Numpy array` (one for the output of the embeddings + one for the output of each layer)\n            of shape :obj:`(batch_size, sequence_length, hidden_size)`.\n\n            Hidden-states of the model at the output of each layer plus the initial embedding outputs.\n        attentions (:obj:`tuple(tf.Tensor)`, `optional`, returned when ``config.output_attentions=True``):\n            Tuple of :obj:`tf.Tensor` or :obj:`Numpy array` (one for each layer) of shape\n            :obj:`(batch_size, num_heads, sequence_length, sequence_length)`.\n\n            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention\n            heads.\n\n    Examples::\n\n        import tensorflow as tf\n        from transformers1 import XLNetTokenizer, TFXLNetForQuestionAnsweringSimple\n\n        tokenizer = XLNetTokenizer.from_pretrained('xlnet-base-cased')\n        model = TFXLNetForQuestionAnsweringSimple.from_pretrained('xlnet-base-cased')\n        input_ids = tf.constant(tokenizer.encode(\"Hello, my dog is cute\", add_special_tokens=True))[None, :]  # Batch size 1\n        outputs = model(input_ids)\n        start_scores, end_scores = outputs[:2]\n\n        \"\"\"\n        transformer_outputs = self.transformer(inputs, **kwargs)\n\n        sequence_output = transformer_outputs[0]\n\n        logits = self.qa_outputs(sequence_output)\n        start_logits, end_logits = tf.split(logits, 2, axis=-1)\n        start_logits = tf.squeeze(start_logits, axis=-1)\n        end_logits = tf.squeeze(end_logits, axis=-1)\n\n        outputs = (start_logits, end_logits,) + transformer_outputs[\n            1:\n        ]  # Keep mems, hidden states, attentions if there are in it\n\n        return outputs  # start_logits, end_logits, (mems), (hidden_states), (attentions)\n\n\n# @add_start_docstrings(\"\"\"XLNet Model with a span classification head on top for extractive question-answering tasks like SQuAD (a linear layers on top of\n#     the hidden-states output to compute `span start logits` and `span end logits`). \"\"\",\n#     XLNET_START_DOCSTRING, XLNET_INPUTS_DOCSTRING)\n# class TFXLNetForQuestionAnswering(TFXLNetPreTrainedModel):\n#     r\"\"\"\n#     Outputs: `Tuple` comprising various elements depending on the configuration (config) and inputs:\n#         **start_top_log_probs**: (`optional`, returned if ``start_positions`` or ``end_positions`` is not provided)\n#             ``tf.Tensor`` of shape ``(batch_size, config.start_n_top)``\n#             Log probabilities for the top config.start_n_top start token possibilities (beam-search).\n#         **start_top_index**: (`optional`, returned if ``start_positions`` or ``end_positions`` is not provided)\n#             ``tf.Tensor`` of shape ``(batch_size, config.start_n_top)``\n#             Indices for the top config.start_n_top start token possibilities (beam-search).\n#         **end_top_log_probs**: (`optional`, returned if ``start_positions`` or ``end_positions`` is not provided)\n#             ``tf.Tensor`` of shape ``(batch_size, config.start_n_top * config.end_n_top)``\n#             Log probabilities for the top ``config.start_n_top * config.end_n_top`` end token possibilities (beam-search).\n#         **end_top_index**: (`optional`, returned if ``start_positions`` or ``end_positions`` is not provided)\n#             ``tf.Tensor`` of shape ``(batch_size, config.start_n_top * config.end_n_top)``\n#             Indices for the top ``config.start_n_top * config.end_n_top`` end token possibilities (beam-search).\n#         **cls_logits**: (`optional`, returned if ``start_positions`` or ``end_positions`` is not provided)\n#             ``tf.Tensor`` of shape ``(batch_size,)``\n#             Log probabilities for the ``is_impossible`` label of the answers.\n#         **mems**:\n#             list of ``tf.Tensor`` (one for each layer):\n#             that contains pre-computed hidden-states (key and values in the attention blocks) as computed by the model\n#             if config.mem_len > 0 else tuple of None. Can be used to speed up sequential decoding and attend to longer context.\n#             See details in the docstring of the `mems` input above.\n#         **hidden_states**: (`optional`, returned when ``config.output_hidden_states=True``)\n#             list of ``tf.Tensor`` (one for the output of each layer + the output of the embeddings)\n#             of shape ``(batch_size, sequence_length, hidden_size)``:\n#             Hidden-states of the model at the output of each layer plus the initial embedding outputs.\n#         **attentions**: (`optional`, returned when ``config.output_attentions=True``)\n#             list of ``tf.Tensor`` (one for each layer) of shape ``(batch_size, num_heads, sequence_length, sequence_length)``:\n#             Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads.\n\n#     Examples::\n\n#         # For example purposes. Not runnable.\n#         tokenizer = XLMTokenizer.from_pretrained('xlm-mlm-en-2048')\n#         model = XLMForQuestionAnswering.from_pretrained('xlnet-large-cased')\n#         input_ids = tf.constant(tokenizer.encode(\"Hello, my dog is cute\", add_special_tokens=True))[None, :]  # Batch size 1\n#         start_positions = tf.constant([1])\n#         end_positions = tf.constant([3])\n#         outputs = model(input_ids, start_positions=start_positions, end_positions=end_positions)\n#         loss, start_scores, end_scores = outputs[:2]\n\n#     \"\"\"\n#     def __init__(self, config, *inputs, **kwargs):\n#         super().__init__(config, *inputs, **kwargs)\n#         self.start_n_top = config.start_n_top\n#         self.end_n_top = config.end_n_top\n\n#         self.transformer = TFXLNetMainLayer(config, name='transformer')\n#         self.start_logits = TFPoolerStartLogits(config, name='start_logits')\n#         self.end_logits = TFPoolerEndLogits(config, name='end_logits')\n#         self.answer_class = TFPoolerAnswerClass(config, name='answer_class')\n\n#     def call(self, inputs, training=False):\n#         transformer_outputs = self.transformer(inputs, training=training)\n#         hidden_states = transformer_outputs[0]\n#         start_logits = self.start_logits(hidden_states, p_mask=p_mask)\n\n#         outputs = transformer_outputs[1:]  # Keep mems, hidden states, attentions if there are in it\n\n#         if start_positions is not None and end_positions is not None:\n#             # If we are on multi-GPU, let's remove the dimension added by batch splitting\n#             for x in (start_positions, end_positions, cls_index, is_impossible):\n#                 if x is not None and x.dim() > 1:\n#                     x.squeeze_(-1)\n\n#             # during training, compute the end logits based on the ground truth of the start position\n#             end_logits = self.end_logits(hidden_states, start_positions=start_positions, p_mask=p_mask)\n\n#             loss_fct = CrossEntropyLoss()\n#             start_loss = loss_fct(start_logits, start_positions)\n#             end_loss = loss_fct(end_logits, end_positions)\n#             total_loss = (start_loss + end_loss) / 2\n\n#             if cls_index is not None and is_impossible is not None:\n#                 # Predict answerability from the representation of CLS and START\n#                 cls_logits = self.answer_class(hidden_states, start_positions=start_positions, cls_index=cls_index)\n#                 loss_fct_cls = nn.BCEWithLogitsLoss()\n#                 cls_loss = loss_fct_cls(cls_logits, is_impossible)\n\n#                 # note(zhiliny): by default multiply the loss by 0.5 so that the scale is comparable to start_loss and end_loss\n#                 total_loss += cls_loss * 0.5\n\n#             outputs = (total_loss,) + outputs\n\n#         else:\n#             # during inference, compute the end logits based on beam search\n#             bsz, slen, hsz = hidden_states.size()\n#             start_log_probs = F.softmax(start_logits, dim=-1) # shape (bsz, slen)\n\n#             start_top_log_probs, start_top_index = torch.topk(start_log_probs, self.start_n_top, dim=-1) # shape (bsz, start_n_top)\n#             start_top_index_exp = start_top_index.unsqueeze(-1).expand(-1, -1, hsz) # shape (bsz, start_n_top, hsz)\n#             start_states = torch.gather(hidden_states, -2, start_top_index_exp) # shape (bsz, start_n_top, hsz)\n#             start_states = start_states.unsqueeze(1).expand(-1, slen, -1, -1) # shape (bsz, slen, start_n_top, hsz)\n\n#             hidden_states_expanded = hidden_states.unsqueeze(2).expand_as(start_states) # shape (bsz, slen, start_n_top, hsz)\n#             p_mask = p_mask.unsqueeze(-1) if p_mask is not None else None\n#             end_logits = self.end_logits(hidden_states_expanded, start_states=start_states, p_mask=p_mask)\n#             end_log_probs = F.softmax(end_logits, dim=1) # shape (bsz, slen, start_n_top)\n\n#             end_top_log_probs, end_top_index = torch.topk(end_log_probs, self.end_n_top, dim=1) # shape (bsz, end_n_top, start_n_top)\n#             end_top_log_probs = end_top_log_probs.view(-1, self.start_n_top * self.end_n_top)\n#             end_top_index = end_top_index.view(-1, self.start_n_top * self.end_n_top)\n\n#             start_states = torch.einsum(\"blh,bl->bh\", hidden_states, start_log_probs)  # get the representation of START as weighted sum of hidden states\n#             cls_logits = self.answer_class(hidden_states, start_states=start_states, cls_index=cls_index)  # Shape (batch size,): one single `cls_logits` for each sample\n\n#             outputs = (start_top_log_probs, start_top_index, end_top_log_probs, end_top_index, cls_logits) + outputs\n\n#         # return start_top_log_probs, start_top_index, end_top_log_probs, end_top_index, cls_logits\n#         # or (if labels are provided) (total_loss,)\n#         return outputs\n"
  },
  {
    "path": "code/bert-base-count3/pretrain/transformers1/modeling_transfo_xl.py",
    "content": "# coding=utf-8\n# Copyright 2018 Google AI, Google Brain and Carnegie Mellon University Authors and the HuggingFace Inc. team.\n# Copyright (c) 2018, NVIDIA CORPORATION.  All rights reserved.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n\"\"\" PyTorch Transformer XL model.\n    Adapted from https://github.com/kimiyoung/transformer-xl.\n    In particular https://github.com/kimiyoung/transformer-xl/blob/master/pytorch/mem_transformer.py\n\"\"\"\n\n\nimport logging\n\nimport torch\nimport torch.nn as nn\nimport torch.nn.functional as F\n\nfrom .configuration_transfo_xl import TransfoXLConfig\nfrom .file_utils import add_start_docstrings, add_start_docstrings_to_callable\nfrom .modeling_transfo_xl_utilities import ProjectedAdaptiveLogSoftmax\nfrom .modeling_utils import PreTrainedModel\n\n\nlogger = logging.getLogger(__name__)\n\nTRANSFO_XL_PRETRAINED_MODEL_ARCHIVE_LIST = [\n    \"transfo-xl-wt103\",\n    # See all Transformer XL models at https://huggingface.co/models?filter=transfo-xl\n]\n\n\ndef build_tf_to_pytorch_map(model, config):\n    \"\"\" A map of modules from TF to PyTorch.\n        This time I use a map to keep the PyTorch model as identical to the original PyTorch model as possible.\n    \"\"\"\n    tf_to_pt_map = {}\n\n    if hasattr(model, \"transformer\"):\n        # We are loading in a TransfoXLLMHeadModel => we will load also the Adaptive Softmax\n        tf_to_pt_map.update(\n            {\n                \"transformer/adaptive_softmax/cutoff_0/cluster_W\": model.crit.cluster_weight,\n                \"transformer/adaptive_softmax/cutoff_0/cluster_b\": model.crit.cluster_bias,\n            }\n        )\n        for i, (out_l, proj_l, tie_proj) in enumerate(\n            zip(model.crit.out_layers, model.crit.out_projs, config.tie_projs)\n        ):\n            layer_str = \"transformer/adaptive_softmax/cutoff_%d/\" % i\n            if config.tie_weight:\n                tf_to_pt_map.update({layer_str + \"b\": out_l.bias})\n            else:\n                raise NotImplementedError\n                # I don't think this is implemented in the TF code\n                tf_to_pt_map.update({layer_str + \"lookup_table\": out_l.weight, layer_str + \"b\": out_l.bias})\n            if not tie_proj:\n                tf_to_pt_map.update({layer_str + \"proj\": proj_l})\n        # Now load the rest of the transformer\n        model = model.transformer\n\n    # Embeddings\n    for i, (embed_l, proj_l) in enumerate(zip(model.word_emb.emb_layers, model.word_emb.emb_projs)):\n        layer_str = \"transformer/adaptive_embed/cutoff_%d/\" % i\n        tf_to_pt_map.update({layer_str + \"lookup_table\": embed_l.weight, layer_str + \"proj_W\": proj_l})\n\n    # Transformer blocks\n    for i, b in enumerate(model.layers):\n        layer_str = \"transformer/layer_%d/\" % i\n        tf_to_pt_map.update(\n            {\n                layer_str + \"rel_attn/LayerNorm/gamma\": b.dec_attn.layer_norm.weight,\n                layer_str + \"rel_attn/LayerNorm/beta\": b.dec_attn.layer_norm.bias,\n                layer_str + \"rel_attn/o/kernel\": b.dec_attn.o_net.weight,\n                layer_str + \"rel_attn/qkv/kernel\": b.dec_attn.qkv_net.weight,\n                layer_str + \"rel_attn/r/kernel\": b.dec_attn.r_net.weight,\n                layer_str + \"ff/LayerNorm/gamma\": b.pos_ff.layer_norm.weight,\n                layer_str + \"ff/LayerNorm/beta\": b.pos_ff.layer_norm.bias,\n                layer_str + \"ff/layer_1/kernel\": b.pos_ff.CoreNet[0].weight,\n                layer_str + \"ff/layer_1/bias\": b.pos_ff.CoreNet[0].bias,\n                layer_str + \"ff/layer_2/kernel\": b.pos_ff.CoreNet[3].weight,\n                layer_str + \"ff/layer_2/bias\": b.pos_ff.CoreNet[3].bias,\n            }\n        )\n\n    # Relative positioning biases\n    if config.untie_r:\n        r_r_list = []\n        r_w_list = []\n        for b in model.layers:\n            r_r_list.append(b.dec_attn.r_r_bias)\n            r_w_list.append(b.dec_attn.r_w_bias)\n    else:\n        r_r_list = [model.r_r_bias]\n        r_w_list = [model.r_w_bias]\n    tf_to_pt_map.update({\"transformer/r_r_bias\": r_r_list, \"transformer/r_w_bias\": r_w_list})\n    return tf_to_pt_map\n\n\ndef load_tf_weights_in_transfo_xl(model, config, tf_path):\n    \"\"\" Load tf checkpoints in a pytorch model\n    \"\"\"\n    try:\n        import numpy as np\n        import tensorflow as tf\n    except ImportError:\n        logger.error(\n            \"Loading a TensorFlow models in PyTorch, requires TensorFlow to be installed. Please see \"\n            \"https://www.tensorflow.org/install/ for installation instructions.\"\n        )\n        raise\n    # Build TF to PyTorch weights loading map\n    tf_to_pt_map = build_tf_to_pytorch_map(model, config)\n\n    # Load weights from TF model\n    init_vars = tf.train.list_variables(tf_path)\n    tf_weights = {}\n    for name, shape in init_vars:\n        logger.info(\"Loading TF weight {} with shape {}\".format(name, shape))\n        array = tf.train.load_variable(tf_path, name)\n        tf_weights[name] = array\n\n    for name, pointer in tf_to_pt_map.items():\n        assert name in tf_weights\n        array = tf_weights[name]\n        # adam_v and adam_m are variables used in AdamWeightDecayOptimizer to calculated m and v\n        # which are not required for using pretrained model\n        if \"kernel\" in name or \"proj\" in name:\n            array = np.transpose(array)\n        if (\"r_r_bias\" in name or \"r_w_bias\" in name) and len(pointer) > 1:\n            # Here we will split the TF weights\n            assert len(pointer) == array.shape[0]\n            for i, p_i in enumerate(pointer):\n                arr_i = array[i, ...]\n                try:\n                    assert p_i.shape == arr_i.shape\n                except AssertionError as e:\n                    e.args += (p_i.shape, arr_i.shape)\n                    raise\n                logger.info(\"Initialize PyTorch weight {} for layer {}\".format(name, i))\n                p_i.data = torch.from_numpy(arr_i)\n        else:\n            try:\n                assert pointer.shape == array.shape\n            except AssertionError as e:\n                e.args += (pointer.shape, array.shape)\n                raise\n            logger.info(\"Initialize PyTorch weight {}\".format(name))\n            pointer.data = torch.from_numpy(array)\n        tf_weights.pop(name, None)\n        tf_weights.pop(name + \"/Adam\", None)\n        tf_weights.pop(name + \"/Adam_1\", None)\n\n    logger.info(\"Weights not copied to PyTorch model: {}\".format(\", \".join(tf_weights.keys())))\n    return model\n\n\nclass PositionalEmbedding(nn.Module):\n    def __init__(self, demb):\n        super().__init__()\n\n        self.demb = demb\n\n        inv_freq = 1 / (10000 ** (torch.arange(0.0, demb, 2.0) / demb))\n        self.register_buffer(\"inv_freq\", inv_freq)\n\n    def forward(self, pos_seq, bsz=None):\n        sinusoid_inp = torch.ger(pos_seq, self.inv_freq)\n        pos_emb = torch.cat([sinusoid_inp.sin(), sinusoid_inp.cos()], dim=-1)\n\n        if bsz is not None:\n            return pos_emb[:, None, :].expand(-1, bsz, -1)\n        else:\n            return pos_emb[:, None, :]\n\n\nclass PositionwiseFF(nn.Module):\n    def __init__(self, d_model, d_inner, dropout, pre_lnorm=False, layer_norm_epsilon=1e-5):\n        super().__init__()\n\n        self.d_model = d_model\n        self.d_inner = d_inner\n        self.dropout = dropout\n\n        self.CoreNet = nn.Sequential(\n            nn.Linear(d_model, d_inner),\n            nn.ReLU(inplace=True),\n            nn.Dropout(dropout),\n            nn.Linear(d_inner, d_model),\n            nn.Dropout(dropout),\n        )\n\n        self.layer_norm = nn.LayerNorm(d_model, eps=layer_norm_epsilon)\n\n        self.pre_lnorm = pre_lnorm\n\n    def forward(self, inp):\n        if self.pre_lnorm:\n            # layer normalization + positionwise feed-forward\n            core_out = self.CoreNet(self.layer_norm(inp))\n\n            # residual connection\n            output = core_out + inp\n        else:\n            # positionwise feed-forward\n            core_out = self.CoreNet(inp)\n\n            # residual connection + layer normalization\n            output = self.layer_norm(inp + core_out)\n\n        return output\n\n\nclass RelPartialLearnableMultiHeadAttn(nn.Module):\n    def __init__(\n        self,\n        n_head,\n        d_model,\n        d_head,\n        dropout,\n        dropatt=0,\n        tgt_len=None,\n        ext_len=None,\n        mem_len=None,\n        pre_lnorm=False,\n        r_r_bias=None,\n        r_w_bias=None,\n        output_attentions=False,\n        layer_norm_epsilon=1e-5,\n    ):\n        super().__init__()\n\n        self.output_attentions = output_attentions\n        self.n_head = n_head\n        self.d_model = d_model\n        self.d_head = d_head\n        self.dropout = dropout\n\n        self.qkv_net = nn.Linear(d_model, 3 * n_head * d_head, bias=False)\n\n        self.drop = nn.Dropout(dropout)\n        self.dropatt = nn.Dropout(dropatt)\n        self.o_net = nn.Linear(n_head * d_head, d_model, bias=False)\n\n        self.layer_norm = nn.LayerNorm(d_model, eps=layer_norm_epsilon)\n\n        self.scale = 1 / (d_head ** 0.5)\n\n        self.pre_lnorm = pre_lnorm\n\n        if r_r_bias is None or r_w_bias is None:  # Biases are not shared\n            self.r_r_bias = nn.Parameter(torch.FloatTensor(self.n_head, self.d_head))\n            self.r_w_bias = nn.Parameter(torch.FloatTensor(self.n_head, self.d_head))\n        else:\n            self.r_r_bias = r_r_bias\n            self.r_w_bias = r_w_bias\n\n        self.r_net = nn.Linear(self.d_model, self.n_head * self.d_head, bias=False)\n\n    def _rel_shift(self, x):\n        zero_pad_shape = (x.size(0), 1) + x.size()[2:]\n        zero_pad = torch.zeros(zero_pad_shape, device=x.device, dtype=x.dtype)\n        x_padded = torch.cat([zero_pad, x], dim=1)\n\n        x_padded_shape = (x.size(1) + 1, x.size(0)) + x.size()[2:]\n        x_padded = x_padded.view(*x_padded_shape)\n\n        x = x_padded[1:].view_as(x)\n\n        return x\n\n    def forward(self, w, r, attn_mask=None, mems=None, head_mask=None):\n        qlen, rlen, bsz = w.size(0), r.size(0), w.size(1)\n\n        if mems is not None:\n            cat = torch.cat([mems, w], 0)\n            if self.pre_lnorm:\n                w_heads = self.qkv_net(self.layer_norm(cat))\n            else:\n                w_heads = self.qkv_net(cat)\n            r_head_k = self.r_net(r)\n\n            w_head_q, w_head_k, w_head_v = torch.chunk(w_heads, 3, dim=-1)\n            w_head_q = w_head_q[-qlen:]\n        else:\n            if self.pre_lnorm:\n                w_heads = self.qkv_net(self.layer_norm(w))\n            else:\n                w_heads = self.qkv_net(w)\n            r_head_k = self.r_net(r)\n\n            w_head_q, w_head_k, w_head_v = torch.chunk(w_heads, 3, dim=-1)\n\n        klen = w_head_k.size(0)\n\n        w_head_q = w_head_q.view(qlen, bsz, self.n_head, self.d_head)  # qlen x bsz x n_head x d_head\n        w_head_k = w_head_k.view(klen, bsz, self.n_head, self.d_head)  # qlen x bsz x n_head x d_head\n        w_head_v = w_head_v.view(klen, bsz, self.n_head, self.d_head)  # qlen x bsz x n_head x d_head\n\n        r_head_k = r_head_k.view(rlen, self.n_head, self.d_head)  # qlen x n_head x d_head\n\n        # compute attention score\n        rw_head_q = w_head_q + self.r_w_bias  # qlen x bsz x n_head x d_head\n        AC = torch.einsum(\"ibnd,jbnd->ijbn\", (rw_head_q, w_head_k))  # qlen x klen x bsz x n_head\n\n        rr_head_q = w_head_q + self.r_r_bias\n        BD = torch.einsum(\"ibnd,jnd->ijbn\", (rr_head_q, r_head_k))  # qlen x klen x bsz x n_head\n        BD = self._rel_shift(BD)\n\n        # [qlen x klen x bsz x n_head]\n        attn_score = AC + BD\n        attn_score.mul_(self.scale)\n\n        # compute attention probability\n        if attn_mask is not None and torch.sum(attn_mask).item():\n            attn_mask = attn_mask == 1  # Switch to bool\n            if attn_mask.dim() == 2:\n                if next(self.parameters()).dtype == torch.float16:\n                    attn_score = (\n                        attn_score.float().masked_fill(attn_mask[None, :, :, None], -65000).type_as(attn_score)\n                    )\n                else:\n                    attn_score = attn_score.float().masked_fill(attn_mask[None, :, :, None], -1e30).type_as(attn_score)\n            elif attn_mask.dim() == 3:\n                if next(self.parameters()).dtype == torch.float16:\n                    attn_score = attn_score.float().masked_fill(attn_mask[:, :, :, None], -65000).type_as(attn_score)\n                else:\n                    attn_score = attn_score.float().masked_fill(attn_mask[:, :, :, None], -1e30).type_as(attn_score)\n\n        # [qlen x klen x bsz x n_head]\n        attn_prob = F.softmax(attn_score, dim=1)\n        attn_prob = self.dropatt(attn_prob)\n\n        # Mask heads if we want to\n        if head_mask is not None:\n            attn_prob = attn_prob * head_mask\n\n        # compute attention vector\n        attn_vec = torch.einsum(\"ijbn,jbnd->ibnd\", (attn_prob, w_head_v))\n\n        # [qlen x bsz x n_head x d_head]\n        attn_vec = attn_vec.contiguous().view(attn_vec.size(0), attn_vec.size(1), self.n_head * self.d_head)\n\n        # linear projection\n        attn_out = self.o_net(attn_vec)\n        attn_out = self.drop(attn_out)\n\n        if self.pre_lnorm:\n            # residual connection\n            outputs = [w + attn_out]\n        else:\n            # residual connection + layer normalization\n            outputs = [self.layer_norm(w + attn_out)]\n\n        if self.output_attentions:\n            outputs.append(attn_prob)\n\n        return outputs\n\n\nclass RelPartialLearnableDecoderLayer(nn.Module):\n    def __init__(self, n_head, d_model, d_head, d_inner, dropout, layer_norm_epsilon=1e-5, **kwargs):\n        super().__init__()\n\n        self.dec_attn = RelPartialLearnableMultiHeadAttn(\n            n_head, d_model, d_head, dropout, layer_norm_epsilon=layer_norm_epsilon, **kwargs\n        )\n        self.pos_ff = PositionwiseFF(\n            d_model, d_inner, dropout, pre_lnorm=kwargs.get(\"pre_lnorm\"), layer_norm_epsilon=layer_norm_epsilon\n        )\n\n    def forward(self, dec_inp, r, dec_attn_mask=None, mems=None, head_mask=None):\n\n        attn_outputs = self.dec_attn(dec_inp, r, attn_mask=dec_attn_mask, mems=mems, head_mask=head_mask)\n        ff_output = self.pos_ff(attn_outputs[0])\n\n        outputs = [ff_output] + attn_outputs[1:]\n\n        return outputs\n\n\nclass AdaptiveEmbedding(nn.Module):\n    def __init__(self, n_token, d_embed, d_proj, cutoffs, div_val=1, sample_softmax=False):\n        super().__init__()\n\n        self.n_token = n_token\n        self.d_embed = d_embed\n\n        self.cutoffs = cutoffs + [n_token]\n        self.div_val = div_val\n        self.d_proj = d_proj\n\n        self.emb_scale = d_proj ** 0.5\n\n        self.cutoff_ends = [0] + self.cutoffs\n\n        self.emb_layers = nn.ModuleList()\n        self.emb_projs = nn.ParameterList()\n        if div_val == 1:\n            self.emb_layers.append(nn.Embedding(n_token, d_embed, sparse=sample_softmax > 0))\n            if d_proj != d_embed:\n                self.emb_projs.append(nn.Parameter(torch.FloatTensor(d_proj, d_embed)))\n        else:\n            for i in range(len(self.cutoffs)):\n                l_idx, r_idx = self.cutoff_ends[i], self.cutoff_ends[i + 1]\n                d_emb_i = d_embed // (div_val ** i)\n                self.emb_layers.append(nn.Embedding(r_idx - l_idx, d_emb_i))\n                self.emb_projs.append(nn.Parameter(torch.FloatTensor(d_proj, d_emb_i)))\n\n    def forward(self, inp):\n        if self.div_val == 1:\n            embed = self.emb_layers[0](inp)\n            if self.d_proj != self.d_embed:\n                embed = F.linear(embed, self.emb_projs[0])\n        else:\n            param = next(self.parameters())\n            inp_flat = inp.view(-1)\n            emb_flat = torch.zeros([inp_flat.size(0), self.d_proj], dtype=param.dtype, device=param.device)\n            for i in range(len(self.cutoffs)):\n                l_idx, r_idx = self.cutoff_ends[i], self.cutoff_ends[i + 1]\n\n                mask_i = (inp_flat >= l_idx) & (inp_flat < r_idx)\n                indices_i = mask_i.nonzero().squeeze()\n\n                if indices_i.numel() == 0:\n                    continue\n\n                inp_i = inp_flat.index_select(0, indices_i) - l_idx\n                emb_i = self.emb_layers[i](inp_i)\n                emb_i = F.linear(emb_i, self.emb_projs[i])\n\n                emb_flat.index_copy_(0, indices_i, emb_i)\n\n            embed_shape = inp.size() + (self.d_proj,)\n            embed = emb_flat.view(embed_shape)\n\n        embed.mul_(self.emb_scale)\n\n        return embed\n\n\nclass TransfoXLPreTrainedModel(PreTrainedModel):\n    \"\"\" An abstract class to handle weights initialization and\n        a simple interface for downloading and loading pretrained models.\n    \"\"\"\n\n    config_class = TransfoXLConfig\n    load_tf_weights = load_tf_weights_in_transfo_xl\n    base_model_prefix = \"transformer\"\n\n    def _init_weight(self, weight):\n        if self.config.init == \"uniform\":\n            nn.init.uniform_(weight, -self.config.init_range, self.config.init_range)\n        elif self.config.init == \"normal\":\n            nn.init.normal_(weight, 0.0, self.config.init_std)\n\n    def _init_bias(self, bias):\n        nn.init.constant_(bias, 0.0)\n\n    def _init_weights(self, m):\n        \"\"\" Initialize the weights.\n        \"\"\"\n        classname = m.__class__.__name__\n        if classname.find(\"Linear\") != -1:\n            if hasattr(m, \"weight\") and m.weight is not None:\n                self._init_weight(m.weight)\n            if hasattr(m, \"bias\") and m.bias is not None:\n                self._init_bias(m.bias)\n        elif classname.find(\"AdaptiveEmbedding\") != -1:\n            if hasattr(m, \"emb_projs\"):\n                for i in range(len(m.emb_projs)):\n                    if m.emb_projs[i] is not None:\n                        nn.init.normal_(m.emb_projs[i], 0.0, self.config.proj_init_std)\n        elif classname.find(\"Embedding\") != -1:\n            if hasattr(m, \"weight\"):\n                self._init_weight(m.weight)\n        elif classname.find(\"ProjectedAdaptiveLogSoftmax\") != -1:\n            if hasattr(m, \"cluster_weight\") and m.cluster_weight is not None:\n                self._init_weight(m.cluster_weight)\n            if hasattr(m, \"cluster_bias\") and m.cluster_bias is not None:\n                self._init_bias(m.cluster_bias)\n            if hasattr(m, \"out_projs\"):\n                for i in range(len(m.out_projs)):\n                    if m.out_projs[i] is not None:\n                        nn.init.normal_(m.out_projs[i], 0.0, self.config.proj_init_std)\n        elif classname.find(\"LayerNorm\") != -1:\n            if hasattr(m, \"weight\"):\n                nn.init.normal_(m.weight, 1.0, self.config.init_std)\n            if hasattr(m, \"bias\") and m.bias is not None:\n                self._init_bias(m.bias)\n        else:\n            if hasattr(m, \"r_emb\"):\n                self._init_weight(m.r_emb)\n            if hasattr(m, \"r_w_bias\"):\n                self._init_weight(m.r_w_bias)\n            if hasattr(m, \"r_r_bias\"):\n                self._init_weight(m.r_r_bias)\n            if hasattr(m, \"r_bias\"):\n                self._init_bias(m.r_bias)\n\n\nTRANSFO_XL_START_DOCSTRING = r\"\"\"\n\n    This model is a PyTorch `torch.nn.Module <https://pytorch.org/docs/stable/nn.html#torch.nn.Module>`_ sub-class.\n    Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general\n    usage and behavior.\n\n    Parameters:\n        config (:class:`~transformers1.TransfoXLConfig`): Model configuration class with all the parameters of the model.\n            Initializing with a config file does not load the weights associated with the model, only the configuration.\n            Check out the :meth:`~transformers1.PreTrainedModel.from_pretrained` method to load the model weights.\n\"\"\"\n\nTRANSFO_XL_INPUTS_DOCSTRING = r\"\"\"\n    Args:\n        input_ids (:obj:`torch.LongTensor` of shape :obj:`(batch_size, sequence_length)`):\n            Indices of input sequence tokens in the vocabulary.\n\n            Indices can be obtained using :class:`transformers1.TransfoXLTokenizer`.\n            See :func:`transformers1.PreTrainedTokenizer.encode` and\n            :func:`transformers1.PreTrainedTokenizer.encode_plus` for details.\n\n            `What are input IDs? <../glossary.html#input-ids>`__\n        mems (:obj:`List[torch.FloatTensor]` of length :obj:`config.n_layers`):\n            Contains pre-computed hidden-states (key and values in the attention blocks) as computed by the model\n            (see `mems` output below). Can be used to speed up sequential decoding. The token ids which have their mems\n            given to this model should not be passed as input ids as they have already been computed.\n        head_mask (:obj:`torch.FloatTensor` of shape :obj:`(num_heads,)` or :obj:`(num_layers, num_heads)`, `optional`, defaults to :obj:`None`):\n            Mask to nullify selected heads of the self-attention modules.\n            Mask values selected in ``[0, 1]``:\n            :obj:`1` indicates the head is **not masked**, :obj:`0` indicates the head is **masked**.\n        inputs_embeds (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length, hidden_size)`, `optional`, defaults to :obj:`None`):\n            Optionally, instead of passing :obj:`input_ids` you can choose to directly pass an embedded representation.\n            This is useful if you want more control over how to convert `input_ids` indices into associated vectors\n            than the model's internal embedding lookup matrix.\n\"\"\"\n\n\n@add_start_docstrings(\n    \"The bare Bert Model transformer outputting raw hidden-states without any specific head on top.\",\n    TRANSFO_XL_START_DOCSTRING,\n)\nclass TransfoXLModel(TransfoXLPreTrainedModel):\n    def __init__(self, config):\n        super().__init__(config)\n        self.output_attentions = config.output_attentions\n        self.output_hidden_states = config.output_hidden_states\n\n        self.n_token = config.vocab_size\n\n        self.d_embed = config.d_embed\n        self.d_model = config.d_model\n        self.n_head = config.n_head\n        self.d_head = config.d_head\n\n        self.word_emb = AdaptiveEmbedding(\n            config.vocab_size, config.d_embed, config.d_model, config.cutoffs, div_val=config.div_val\n        )\n\n        self.drop = nn.Dropout(config.dropout)\n\n        self.n_layer = config.n_layer\n\n        self.tgt_len = config.tgt_len\n        self.mem_len = config.mem_len\n        self.ext_len = config.ext_len\n        self.max_klen = config.tgt_len + config.ext_len + config.mem_len\n\n        self.attn_type = config.attn_type\n\n        if not config.untie_r:\n            self.r_w_bias = nn.Parameter(torch.FloatTensor(self.n_head, self.d_head))\n            self.r_r_bias = nn.Parameter(torch.FloatTensor(self.n_head, self.d_head))\n\n        self.layers = nn.ModuleList()\n        if config.attn_type == 0:  # the default attention\n            for i in range(config.n_layer):\n                self.layers.append(\n                    RelPartialLearnableDecoderLayer(\n                        config.n_head,\n                        config.d_model,\n                        config.d_head,\n                        config.d_inner,\n                        config.dropout,\n                        tgt_len=config.tgt_len,\n                        ext_len=config.ext_len,\n                        mem_len=config.mem_len,\n                        dropatt=config.dropatt,\n                        pre_lnorm=config.pre_lnorm,\n                        r_w_bias=None if config.untie_r else self.r_w_bias,\n                        r_r_bias=None if config.untie_r else self.r_r_bias,\n                        output_attentions=self.output_attentions,\n                        layer_norm_epsilon=config.layer_norm_epsilon,\n                    )\n                )\n        else:  # learnable embeddings and absolute embeddings are not used in our pretrained checkpoints\n            raise NotImplementedError  # Removed them to avoid maintaining dead code\n\n        self.same_length = config.same_length\n        self.clamp_len = config.clamp_len\n\n        if self.attn_type == 0:  # default attention\n            self.pos_emb = PositionalEmbedding(self.d_model)\n        else:  # learnable embeddings and absolute embeddings\n            raise NotImplementedError  # Removed these to avoid maintaining dead code - They are not used in our pretrained checkpoint\n\n        self.init_weights()\n\n    def get_input_embeddings(self):\n        return self.word_emb\n\n    def set_input_embeddings(self, new_embeddings):\n        self.word_emb = new_embeddings\n\n    def backward_compatible(self):\n        self.sample_softmax = -1\n\n    def reset_length(self, tgt_len, ext_len, mem_len):\n        self.tgt_len = tgt_len\n        self.mem_len = mem_len\n        self.ext_len = ext_len\n\n    def _prune_heads(self, heads):\n        logger.info(\"Head pruning is not implemented for Transformer-XL model\")\n        pass\n\n    def init_mems(self, bsz):\n        if self.mem_len > 0:\n            mems = []\n            param = next(self.parameters())\n            for i in range(self.n_layer):\n                empty = torch.zeros(self.mem_len, bsz, self.config.d_model, dtype=param.dtype, device=param.device)\n                mems.append(empty)\n\n            return mems\n        else:\n            return None\n\n    def _update_mems(self, hids, mems, mlen, qlen):\n        # does not deal with None\n        if mems is None:\n            return None\n\n        # mems is not None\n        assert len(hids) == len(mems), \"len(hids) != len(mems)\"\n\n        # There are `mlen + qlen` steps that can be cached into mems\n        # For the next step, the last `ext_len` of the `qlen` tokens\n        # will be used as the extended context. Hence, we only cache\n        # the tokens from `mlen + qlen - self.ext_len - self.mem_len`\n        # to `mlen + qlen - self.ext_len`.\n        with torch.no_grad():\n            new_mems = []\n            end_idx = mlen + max(0, qlen - 0 - self.ext_len)\n            beg_idx = max(0, end_idx - self.mem_len)\n            for i in range(len(hids)):\n\n                cat = torch.cat([mems[i], hids[i]], dim=0)\n                new_mems.append(cat[beg_idx:end_idx].detach())\n\n        return new_mems\n\n    @add_start_docstrings_to_callable(TRANSFO_XL_INPUTS_DOCSTRING)\n    def forward(self, input_ids=None, mems=None, head_mask=None, inputs_embeds=None):\n        r\"\"\"\n    Return:\n        :obj:`tuple(torch.FloatTensor)` comprising various elements depending on the configuration (:class:`~transformers1.TransfoXLConfig`) and inputs:\n        last_hidden_state (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length, hidden_size)`):\n            Sequence of hidden-states at the last layer of the model.\n        mems (:obj:`List[torch.FloatTensor]` of length :obj:`config.n_layers`):\n            Contains pre-computed hidden-states (key and values in the attention blocks).\n            Can be used (see `mems` input) to speed up sequential decoding. The token ids which have their past given to this model\n            should not be passed as input ids as they have already been computed.\n        hidden_states (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_hidden_states=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer)\n            of shape :obj:`(batch_size, sequence_length, hidden_size)`.\n\n            Hidden-states of the model at the output of each layer plus the initial embedding outputs.\n        attentions (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_attentions=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for each layer) of shape\n            :obj:`(batch_size, num_heads, sequence_length, sequence_length)`.\n\n            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention\n            heads.\n\n    Examples::\n\n        from transformers1 import TransfoXLTokenizer, TransfoXLModel\n        import torch\n\n        tokenizer = TransfoXLTokenizer.from_pretrained('transfo-xl-wt103')\n        model = TransfoXLModel.from_pretrained('transfo-xl-wt103')\n        input_ids = torch.tensor(tokenizer.encode(\"Hello, my dog is cute\", add_special_tokens=True)).unsqueeze(0)  # Batch size 1\n        outputs = model(input_ids)\n        last_hidden_states, mems = outputs[:2]\n\n        \"\"\"\n        # the original code for Transformer-XL used shapes [len, bsz] but we want a unified interface in the library\n        # so we transpose here from shape [bsz, len] to shape [len, bsz]\n        if input_ids is not None and inputs_embeds is not None:\n            raise ValueError(\"You cannot specify both input_ids and inputs_embeds at the same time\")\n        elif input_ids is not None:\n            input_ids = input_ids.transpose(0, 1).contiguous()\n            qlen, bsz = input_ids.size()\n        elif inputs_embeds is not None:\n            inputs_embeds = inputs_embeds.transpose(0, 1).contiguous()\n            qlen, bsz = inputs_embeds.shape[0], inputs_embeds.shape[1]\n        else:\n            raise ValueError(\"You have to specify either input_ids or inputs_embeds\")\n\n        if mems is None:\n            mems = self.init_mems(bsz)\n\n        # Prepare head mask if needed\n        # 1.0 in head_mask indicate we keep the head\n        # attention_probs has shape bsz x n_heads x N x N\n        # input head_mask has shape [num_heads] or [num_hidden_layers x num_heads] (a head_mask for each layer)\n        # and head_mask is converted to shape [num_hidden_layers x qlen x klen x bsz x n_head]\n        if head_mask is not None:\n            if head_mask.dim() == 1:\n                head_mask = head_mask.unsqueeze(0).unsqueeze(0).unsqueeze(0).unsqueeze(0)\n                head_mask = head_mask.expand(self.n_layer, -1, -1, -1, -1)\n            elif head_mask.dim() == 2:\n                head_mask = head_mask.unsqueeze(1).unsqueeze(1).unsqueeze(1)\n            head_mask = head_mask.to(\n                dtype=next(self.parameters()).dtype\n            )  # switch to fload if need + fp16 compatibility\n        else:\n            head_mask = [None] * self.n_layer\n\n        if inputs_embeds is not None:\n            word_emb = inputs_embeds\n        else:\n            word_emb = self.word_emb(input_ids)\n\n        mlen = mems[0].size(0) if mems is not None else 0\n        klen = mlen + qlen\n        if self.same_length:\n            all_ones = word_emb.new_ones((qlen, klen), dtype=torch.uint8)\n            mask_len = klen - self.mem_len\n            if mask_len > 0:\n                mask_shift_len = qlen - mask_len\n            else:\n                mask_shift_len = qlen\n            dec_attn_mask = (torch.triu(all_ones, 1 + mlen) + torch.tril(all_ones, -mask_shift_len))[:, :, None]  # -1\n        else:\n            dec_attn_mask = torch.triu(word_emb.new_ones((qlen, klen), dtype=torch.uint8), diagonal=1 + mlen)[\n                :, :, None\n            ]\n\n        hids = []\n        attentions = []\n        if self.attn_type == 0:  # default\n            pos_seq = torch.arange(klen - 1, -1, -1.0, device=word_emb.device, dtype=word_emb.dtype)\n            if self.clamp_len > 0:\n                pos_seq.clamp_(max=self.clamp_len)\n            pos_emb = self.pos_emb(pos_seq)\n\n            core_out = self.drop(word_emb)\n            pos_emb = self.drop(pos_emb)\n\n            for i, layer in enumerate(self.layers):\n                hids.append(core_out)\n                mems_i = None if mems is None else mems[i]\n                layer_outputs = layer(\n                    core_out, pos_emb, dec_attn_mask=dec_attn_mask, mems=mems_i, head_mask=head_mask[i]\n                )\n                core_out = layer_outputs[0]\n                if self.output_attentions:\n                    attentions.append(layer_outputs[1])\n        else:  # learnable embeddings and absolute embeddings\n            raise NotImplementedError  # Removed these to avoid maintaining dead code - They are not used in our pretrained checkpoint\n\n        core_out = self.drop(core_out)\n\n        new_mems = self._update_mems(hids, mems, mlen, qlen)\n\n        # We transpose back here to shape [bsz, len, hidden_dim]\n        outputs = [core_out.transpose(0, 1).contiguous(), new_mems]\n        if self.output_hidden_states:\n            # Add last layer and transpose to library standard shape [bsz, len, hidden_dim]\n            hids.append(core_out)\n            hids = list(t.transpose(0, 1).contiguous() for t in hids)\n            outputs.append(hids)\n        if self.output_attentions:\n            # Transpose to library standard shape [bsz, n_heads, query_seq_len, key_seq_len]\n            attentions = list(t.permute(2, 3, 0, 1).contiguous() for t in attentions)\n            outputs.append(attentions)\n\n        return outputs  # last hidden state, new_mems, (all hidden states), (all attentions)\n\n\n@add_start_docstrings(\n    \"\"\"The Transformer-XL Model with a language modeling head on top\n    (adaptive softmax with weights tied to the adaptive input embeddings)\"\"\",\n    TRANSFO_XL_START_DOCSTRING,\n)\nclass TransfoXLLMHeadModel(TransfoXLPreTrainedModel):\n    def __init__(self, config):\n        super().__init__(config)\n        self.transformer = TransfoXLModel(config)\n        self.sample_softmax = config.sample_softmax\n\n        assert (\n            self.sample_softmax <= 0\n        ), \"Sampling from the softmax is not implemented yet. Please look at issue: #3310: https://github.com/huggingface/transformers/issues/3310\"\n\n        self.crit = ProjectedAdaptiveLogSoftmax(\n            config.vocab_size, config.d_embed, config.d_model, config.cutoffs, div_val=config.div_val\n        )\n\n        self.init_weights()\n\n    def tie_weights(self):\n        \"\"\"\n        Run this to be sure output and input (adaptive) softmax weights are tied\n        \"\"\"\n\n        if self.config.tie_weight:\n            for i in range(len(self.crit.out_layers)):\n                self._tie_or_clone_weights(self.crit.out_layers[i], self.transformer.word_emb.emb_layers[i])\n        if self.config.tie_projs:\n            for i, tie_proj in enumerate(self.config.tie_projs):\n                if tie_proj and self.config.div_val == 1 and self.config.d_model != self.config.d_embed:\n                    if self.config.torchscript:\n                        self.crit.out_projs[i] = nn.Parameter(self.transformer.word_emb.emb_projs[0].clone())\n                    else:\n                        self.crit.out_projs[i] = self.transformer.word_emb.emb_projs[0]\n                elif tie_proj and self.config.div_val != 1:\n                    if self.config.torchscript:\n                        self.crit.out_projs[i] = nn.Parameter(self.transformer.word_emb.emb_projs[i].clone())\n                    else:\n                        self.crit.out_projs[i] = self.transformer.word_emb.emb_projs[i]\n\n    def reset_length(self, tgt_len, ext_len, mem_len):\n        self.transformer.reset_length(tgt_len, ext_len, mem_len)\n\n    def init_mems(self, bsz):\n        return self.transformer.init_mems(bsz)\n\n    @add_start_docstrings_to_callable(TRANSFO_XL_INPUTS_DOCSTRING)\n    def forward(self, input_ids=None, mems=None, head_mask=None, inputs_embeds=None, labels=None):\n        r\"\"\"\n        labels (:obj:`torch.LongTensor` of shape :obj:`(batch_size, sequence_length)`, `optional`, defaults to :obj:`None`):\n            Labels for language modeling.\n            Note that the labels **are shifted** inside the model, i.e. you can set ``labels = input_ids``\n            Indices are selected in ``[-100, 0, ..., config.vocab_size]``\n            All labels set to ``-100`` are ignored (masked), the loss is only\n            computed for labels in ``[0, ..., config.vocab_size]``\n\n    Return:\n        :obj:`tuple(torch.FloatTensor)` comprising various elements depending on the configuration (:class:`~transformers1.TransfoXLConfig`) and inputs:\n        loss (:obj:`torch.FloatTensor` of shape `(batch_size, sequence_length-1)`, `optional`, returned when ``labels`` is provided)\n            Language modeling loss.\n        prediction_scores (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length, config.vocab_size)`):\n            Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax).\n        mems (:obj:`List[torch.FloatTensor]` of length :obj:`config.n_layers`):\n            Contains pre-computed hidden-states (key and values in the attention blocks).\n            Can be used (see `past` input) to speed up sequential decoding. The token ids which have their past given to this model\n            should not be passed as input ids as they have already been computed.\n        hidden_states (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_hidden_states=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer)\n            of shape :obj:`(batch_size, sequence_length, hidden_size)`.\n\n            Hidden-states of the model at the output of each layer plus the initial embedding outputs.\n        attentions (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_attentions=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for each layer) of shape\n            :obj:`(batch_size, num_heads, sequence_length, sequence_length)`.\n\n            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention\n            heads.\n\n    Examples::\n\n        from transformers1 import TransfoXLTokenizer, TransfoXLLMHeadModel\n        import torch\n\n        tokenizer = TransfoXLTokenizer.from_pretrained('transfo-xl-wt103')\n        model = TransfoXLLMHeadModel.from_pretrained('transfo-xl-wt103')\n        input_ids = torch.tensor(tokenizer.encode(\"Hello, my dog is cute\", add_special_tokens=True)).unsqueeze(0)  # Batch size 1\n        outputs = model(input_ids)\n        prediction_scores, mems = outputs[:2]\n\n        \"\"\"\n        if input_ids is not None:\n            bsz, tgt_len = input_ids.size(0), input_ids.size(1)\n        elif inputs_embeds is not None:\n            bsz, tgt_len = inputs_embeds.size(0), inputs_embeds.size(1)\n        else:\n            raise ValueError(\"You have to specify either input_ids or inputs_embeds\")\n\n        transformer_outputs = self.transformer(input_ids, mems=mems, head_mask=head_mask, inputs_embeds=inputs_embeds)\n\n        last_hidden = transformer_outputs[0]\n        pred_hid = last_hidden[:, -tgt_len:]\n        outputs = transformer_outputs[1:]\n\n        softmax_output = self.crit(pred_hid, labels)\n        if labels is None:\n            softmax_output = softmax_output.view(bsz, tgt_len, -1)\n            outputs = [softmax_output] + outputs\n        else:\n            softmax_output = softmax_output.view(bsz, tgt_len - 1)\n            outputs = [softmax_output, None] + outputs\n\n        return outputs  # (loss), logits or None if labels is not None (speed up adaptive softmax), new_mems, (all hidden states), (all attentions)\n\n    def get_output_embeddings(self):\n        \"\"\" Double-check if you are using adaptive softmax.\n        \"\"\"\n        if self.sample_softmax > 0:\n            return self.out_layer\n        else:\n            return self.crit.out_layers[-1]\n\n    def prepare_inputs_for_generation(self, input_ids, past, **model_kwargs):\n        inputs = {\"input_ids\": input_ids}\n\n        # if past is defined in model kwargs then use it for faster decoding\n        if past:\n            inputs[\"mems\"] = past\n\n        return inputs\n"
  },
  {
    "path": "code/bert-base-count3/pretrain/transformers1/modeling_transfo_xl_utilities.py",
    "content": "# coding=utf-8\n# Copyright 2018 Google AI, Google Brain and Carnegie Mellon University Authors and the HuggingFace Inc. team.\n# Copyright (c) 2018, NVIDIA CORPORATION.  All rights reserved.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n\"\"\" Utilities for PyTorch Transformer XL model.\n    Directly adapted from https://github.com/kimiyoung/transformer-xl.\n\"\"\"\n\n\nimport torch\nimport torch.nn as nn\nimport torch.nn.functional as F\n\n\n# CUDA_MAJOR = int(torch.version.cuda.split('.')[0])\n# CUDA_MINOR = int(torch.version.cuda.split('.')[1])\n\n\nclass ProjectedAdaptiveLogSoftmax(nn.Module):\n    def __init__(self, n_token, d_embed, d_proj, cutoffs, div_val=1, keep_order=False):\n        super().__init__()\n\n        self.n_token = n_token\n        self.d_embed = d_embed\n        self.d_proj = d_proj\n\n        self.cutoffs = cutoffs + [n_token]\n        self.cutoff_ends = [0] + self.cutoffs\n        self.div_val = div_val\n\n        self.shortlist_size = self.cutoffs[0]\n        self.n_clusters = len(self.cutoffs) - 1\n        self.head_size = self.shortlist_size + self.n_clusters\n\n        if self.n_clusters > 0:\n            self.cluster_weight = nn.Parameter(torch.zeros(self.n_clusters, self.d_embed))\n            self.cluster_bias = nn.Parameter(torch.zeros(self.n_clusters))\n\n        self.out_layers = nn.ModuleList()\n        self.out_projs = nn.ParameterList()\n\n        if div_val == 1:\n            for i in range(len(self.cutoffs)):\n                if d_proj != d_embed:\n                    self.out_projs.append(nn.Parameter(torch.FloatTensor(d_proj, d_embed)))\n                else:\n                    self.out_projs.append(None)\n\n            self.out_layers.append(nn.Linear(d_embed, n_token))\n        else:\n            for i in range(len(self.cutoffs)):\n                l_idx, r_idx = self.cutoff_ends[i], self.cutoff_ends[i + 1]\n                d_emb_i = d_embed // (div_val ** i)\n\n                self.out_projs.append(nn.Parameter(torch.FloatTensor(d_proj, d_emb_i)))\n\n                self.out_layers.append(nn.Linear(d_emb_i, r_idx - l_idx))\n\n        self.keep_order = keep_order\n\n    def _compute_logit(self, hidden, weight, bias, proj):\n        if proj is None:\n            logit = F.linear(hidden, weight, bias=bias)\n        else:\n            # if CUDA_MAJOR <= 9 and CUDA_MINOR <= 1:\n            proj_hid = F.linear(hidden, proj.t().contiguous())\n            logit = F.linear(proj_hid, weight, bias=bias)\n            # else:\n            #     logit = torch.einsum('bd,de,ev->bv', (hidden, proj, weight.t()))\n            #     if bias is not None:\n            #         logit = logit + bias\n\n        return logit\n\n    def forward(self, hidden, labels=None, keep_order=False):\n        \"\"\"\n            Params:\n                hidden :: [len*bsz x d_proj]\n                labels :: [len*bsz]\n            Return:\n                if labels is None:\n                    out :: [len*bsz x n_tokens] log probabilities of tokens over the vocabulary\n                else:\n                    out :: [(len-1)*bsz] Negative log likelihood\n            We could replace this implementation by the native PyTorch one\n            if their's had an option to set bias on all clusters in the native one.\n            here: https://github.com/pytorch/pytorch/blob/dbe6a7a9ff1a364a8706bf5df58a1ca96d2fd9da/torch/nn/modules/adaptive.py#L138\n        \"\"\"\n\n        if labels is not None:\n            # Shift so that tokens < n predict n\n            hidden = hidden[..., :-1, :].contiguous()\n            labels = labels[..., 1:].contiguous()\n            hidden = hidden.view(-1, hidden.size(-1))\n            labels = labels.view(-1)\n            if hidden.size(0) != labels.size(0):\n                raise RuntimeError(\"Input and labels should have the same size \" \"in the batch dimension.\")\n        else:\n            hidden = hidden.view(-1, hidden.size(-1))\n\n        if self.n_clusters == 0:\n            logit = self._compute_logit(hidden, self.out_layers[0].weight, self.out_layers[0].bias, self.out_projs[0])\n            if labels is not None:\n                out = -F.log_softmax(logit, dim=-1).gather(1, labels.unsqueeze(1)).squeeze(1)\n            else:\n                out = F.log_softmax(logit, dim=-1)\n        else:\n            # construct weights and biases\n            weights, biases = [], []\n            for i in range(len(self.cutoffs)):\n                if self.div_val == 1:\n                    l_idx, r_idx = self.cutoff_ends[i], self.cutoff_ends[i + 1]\n                    weight_i = self.out_layers[0].weight[l_idx:r_idx]\n                    bias_i = self.out_layers[0].bias[l_idx:r_idx]\n                else:\n                    weight_i = self.out_layers[i].weight\n                    bias_i = self.out_layers[i].bias\n\n                if i == 0:\n                    weight_i = torch.cat([weight_i, self.cluster_weight], dim=0)\n                    bias_i = torch.cat([bias_i, self.cluster_bias], dim=0)\n\n                weights.append(weight_i)\n                biases.append(bias_i)\n\n            head_weight, head_bias, head_proj = weights[0], biases[0], self.out_projs[0]\n\n            head_logit = self._compute_logit(hidden, head_weight, head_bias, head_proj)\n            head_logprob = F.log_softmax(head_logit, dim=1)\n\n            if labels is None:\n                out = hidden.new_empty((head_logit.size(0), self.n_token))\n            else:\n                out = torch.zeros_like(labels, dtype=hidden.dtype, device=hidden.device)\n\n            offset = 0\n            cutoff_values = [0] + self.cutoffs\n            for i in range(len(cutoff_values) - 1):\n                l_idx, r_idx = cutoff_values[i], cutoff_values[i + 1]\n\n                if labels is not None:\n                    mask_i = (labels >= l_idx) & (labels < r_idx)\n                    indices_i = mask_i.nonzero().squeeze()\n\n                    if indices_i.numel() == 0:\n                        continue\n\n                    target_i = labels.index_select(0, indices_i) - l_idx\n                    head_logprob_i = head_logprob.index_select(0, indices_i)\n                    hidden_i = hidden.index_select(0, indices_i)\n                else:\n                    hidden_i = hidden\n\n                if i == 0:\n                    if labels is not None:\n                        logprob_i = head_logprob_i.gather(1, target_i[:, None]).squeeze(1)\n                    else:\n                        out[:, : self.cutoffs[0]] = head_logprob[:, : self.cutoffs[0]]\n                else:\n                    weight_i, bias_i, proj_i = weights[i], biases[i], self.out_projs[i]\n\n                    tail_logit_i = self._compute_logit(hidden_i, weight_i, bias_i, proj_i)\n                    tail_logprob_i = F.log_softmax(tail_logit_i, dim=1)\n                    cluster_prob_idx = self.cutoffs[0] + i - 1  # No probability for the head cluster\n                    if labels is not None:\n                        logprob_i = head_logprob_i[:, cluster_prob_idx] + tail_logprob_i.gather(\n                            1, target_i[:, None]\n                        ).squeeze(1)\n                    else:\n                        logprob_i = head_logprob[:, cluster_prob_idx, None] + tail_logprob_i\n                        out[:, l_idx:r_idx] = logprob_i\n\n                if labels is not None:\n                    if (hasattr(self, \"keep_order\") and self.keep_order) or keep_order:\n                        out.index_copy_(0, indices_i, -logprob_i)\n                    else:\n                        out[offset : offset + logprob_i.size(0)].copy_(-logprob_i)\n                    offset += logprob_i.size(0)\n\n        return out\n\n    def log_prob(self, hidden):\n        r\"\"\" Computes log probabilities for all :math:`n\\_classes`\n        From: https://github.com/pytorch/pytorch/blob/master/torch/nn/modules/adaptive.py\n        Args:\n            hidden (Tensor): a minibatch of examples\n        Returns:\n            log-probabilities of for each class :math:`c`\n            in range :math:`0 <= c <= n\\_classes`, where :math:`n\\_classes` is a\n            parameter passed to ``AdaptiveLogSoftmaxWithLoss`` constructor.\n        Shape:\n            - Input: :math:`(N, in\\_features)`\n            - Output: :math:`(N, n\\_classes)`\n        \"\"\"\n        if self.n_clusters == 0:\n            logit = self._compute_logit(hidden, self.out_layers[0].weight, self.out_layers[0].bias, self.out_projs[0])\n            return F.log_softmax(logit, dim=-1)\n        else:\n            # construct weights and biases\n            weights, biases = [], []\n            for i in range(len(self.cutoffs)):\n                if self.div_val == 1:\n                    l_idx, r_idx = self.cutoff_ends[i], self.cutoff_ends[i + 1]\n                    weight_i = self.out_layers[0].weight[l_idx:r_idx]\n                    bias_i = self.out_layers[0].bias[l_idx:r_idx]\n                else:\n                    weight_i = self.out_layers[i].weight\n                    bias_i = self.out_layers[i].bias\n\n                if i == 0:\n                    weight_i = torch.cat([weight_i, self.cluster_weight], dim=0)\n                    bias_i = torch.cat([bias_i, self.cluster_bias], dim=0)\n\n                weights.append(weight_i)\n                biases.append(bias_i)\n\n            head_weight, head_bias, head_proj = weights[0], biases[0], self.out_projs[0]\n            head_logit = self._compute_logit(hidden, head_weight, head_bias, head_proj)\n\n            out = hidden.new_empty((head_logit.size(0), self.n_token))\n            head_logprob = F.log_softmax(head_logit, dim=1)\n\n            cutoff_values = [0] + self.cutoffs\n            for i in range(len(cutoff_values) - 1):\n                start_idx, stop_idx = cutoff_values[i], cutoff_values[i + 1]\n\n                if i == 0:\n                    out[:, : self.cutoffs[0]] = head_logprob[:, : self.cutoffs[0]]\n                else:\n                    weight_i, bias_i, proj_i = weights[i], biases[i], self.out_projs[i]\n\n                    tail_logit_i = self._compute_logit(hidden, weight_i, bias_i, proj_i)\n                    tail_logprob_i = F.log_softmax(tail_logit_i, dim=1)\n\n                    logprob_i = head_logprob[:, -i] + tail_logprob_i\n                    out[:, start_idx, stop_idx] = logprob_i\n\n            return out\n"
  },
  {
    "path": "code/bert-base-count3/pretrain/transformers1/modeling_utils.py",
    "content": "# coding=utf-8\n# Copyright 2018 The Google AI Language Team Authors, Facebook AI Research authors and The HuggingFace Inc. team.\n# Copyright (c) 2018, NVIDIA CORPORATION.  All rights reserved.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n\nimport inspect\nimport logging\nimport os\nfrom typing import Callable, Dict, Iterable, List, Optional, Tuple\n\nimport torch\nfrom torch import Tensor, device, dtype, nn\nfrom torch.nn import CrossEntropyLoss\nfrom torch.nn import functional as F\n\nfrom .activations import get_activation\nfrom .configuration_utils import PretrainedConfig\nfrom .file_utils import (\n    DUMMY_INPUTS,\n    TF2_WEIGHTS_NAME,\n    TF_WEIGHTS_NAME,\n    WEIGHTS_NAME,\n    cached_path,\n    hf_bucket_url,\n    is_remote_url,\n)\n\n\nlogger = logging.getLogger(__name__)\n\n\ntry:\n    from torch.nn import Identity\nexcept ImportError:\n    # Older PyTorch compatibility\n    class Identity(nn.Module):\n        r\"\"\"A placeholder identity operator that is argument-insensitive.\n        \"\"\"\n\n        def __init__(self, *args, **kwargs):\n            super().__init__()\n\n        def forward(self, input):\n            return input\n\n\nclass ModuleUtilsMixin:\n    \"\"\"\n    A few utilities for torch.nn.Modules, to be used as a mixin.\n    \"\"\"\n\n    def num_parameters(self, only_trainable: bool = False) -> int:\n        \"\"\"\n        Get number of (optionally, trainable) parameters in the module.\n        \"\"\"\n        params = filter(lambda x: x.requires_grad, self.parameters()) if only_trainable else self.parameters()\n        return sum(p.numel() for p in params)\n\n    @staticmethod\n    def _hook_rss_memory_pre_forward(module, *args, **kwargs):\n        try:\n            import psutil\n        except (ImportError):\n            raise ImportError(\"You need to install psutil (pip install psutil) to use memory tracing.\")\n\n        process = psutil.Process(os.getpid())\n        mem = process.memory_info()\n        module.mem_rss_pre_forward = mem.rss\n        return None\n\n    @staticmethod\n    def _hook_rss_memory_post_forward(module, *args, **kwargs):\n        try:\n            import psutil\n        except (ImportError):\n            raise ImportError(\"You need to install psutil (pip install psutil) to use memory tracing.\")\n\n        process = psutil.Process(os.getpid())\n        mem = process.memory_info()\n        module.mem_rss_post_forward = mem.rss\n        mem_rss_diff = module.mem_rss_post_forward - module.mem_rss_pre_forward\n        module.mem_rss_diff = mem_rss_diff + (module.mem_rss_diff if hasattr(module, \"mem_rss_diff\") else 0)\n        return None\n\n    def add_memory_hooks(self):\n        \"\"\" Add a memory hook before and after each sub-module forward pass to record increase in memory consumption.\n            Increase in memory consumption is stored in a `mem_rss_diff` attribute for each module and can be reset to zero with `model.reset_memory_hooks_state()`\n        \"\"\"\n        for module in self.modules():\n            module.register_forward_pre_hook(self._hook_rss_memory_pre_forward)\n            module.register_forward_hook(self._hook_rss_memory_post_forward)\n        self.reset_memory_hooks_state()\n\n    def reset_memory_hooks_state(self):\n        for module in self.modules():\n            module.mem_rss_diff = 0\n            module.mem_rss_post_forward = 0\n            module.mem_rss_pre_forward = 0\n\n    @property\n    def device(self) -> device:\n        \"\"\"\n        Get torch.device from module, assuming that the whole module has one device.\n        \"\"\"\n        try:\n            return next(self.parameters()).device\n        except StopIteration:\n            # For nn.DataParallel compatibility in PyTorch 1.5\n\n            def find_tensor_attributes(module: nn.Module) -> List[Tuple[str, Tensor]]:\n                tuples = [(k, v) for k, v in module.__dict__.items() if torch.is_tensor(v)]\n                return tuples\n\n            gen = self._named_members(get_members_fn=find_tensor_attributes)\n            first_tuple = next(gen)\n            return first_tuple[1].device\n\n    @property\n    def dtype(self) -> dtype:\n        \"\"\"\n        Get torch.dtype from module, assuming that the whole module has one dtype.\n        \"\"\"\n        try:\n            return next(self.parameters()).dtype\n        except StopIteration:\n            # For nn.DataParallel compatibility in PyTorch 1.5\n\n            def find_tensor_attributes(module: nn.Module) -> List[Tuple[str, Tensor]]:\n                tuples = [(k, v) for k, v in module.__dict__.items() if torch.is_tensor(v)]\n                return tuples\n\n            gen = self._named_members(get_members_fn=find_tensor_attributes)\n            first_tuple = next(gen)\n            return first_tuple[1].dtype\n\n    def invert_attention_mask(self, encoder_attention_mask: Tensor) -> Tensor:\n        \"\"\"type: torch.Tensor -> torch.Tensor\"\"\"\n        if encoder_attention_mask.dim() == 3:\n            encoder_extended_attention_mask = encoder_attention_mask[:, None, :, :]\n        if encoder_attention_mask.dim() == 2:\n            encoder_extended_attention_mask = encoder_attention_mask[:, None, None, :]\n        # T5 has a mask that can compare sequence ids, we can simulate this here with this transposition\n        # Cf. https://github.com/tensorflow/mesh/blob/8d2465e9bc93129b913b5ccc6a59aa97abd96ec6/mesh_tensorflow\n        # /transformer/transformer_layers.py#L270\n        # encoder_extended_attention_mask = (encoder_extended_attention_mask ==\n        # encoder_extended_attention_mask.transpose(-1, -2))\n        encoder_extended_attention_mask = encoder_extended_attention_mask.to(dtype=self.dtype)  # fp16 compatibility\n\n        if self.dtype == torch.float16:\n            encoder_extended_attention_mask = (1.0 - encoder_extended_attention_mask) * -1e4\n        elif self.dtype == torch.float32:\n            encoder_extended_attention_mask = (1.0 - encoder_extended_attention_mask) * -1e9\n        else:\n            raise ValueError(\n                \"{} not recognized. `dtype` should be set to either `torch.float32` or `torch.float16`\".format(\n                    self.dtype\n                )\n            )\n\n        return encoder_extended_attention_mask\n\n    def get_extended_attention_mask(self, attention_mask: Tensor, input_shape: Tuple, device: device) -> Tensor:\n        \"\"\"Makes broadcastable attention mask and causal mask so that future and maked tokens are ignored.\n\n        Arguments:\n            attention_mask: torch.Tensor with 1 indicating tokens to ATTEND to\n            input_shape: tuple, shape of input_ids\n            device: torch.Device, usually self.device\n\n        Returns:\n            torch.Tensor with dtype of attention_mask.dtype\n        \"\"\"\n        # We can provide a self-attention mask of dimensions [batch_size, from_seq_length, to_seq_length]\n        # ourselves in which case we just need to make it broadcastable to all heads.\n        if attention_mask.dim() == 3:\n            extended_attention_mask = attention_mask[:, None, :, :]\n        elif attention_mask.dim() == 2:\n            # Provided a padding mask of dimensions [batch_size, seq_length]\n            # - if the model is a decoder, apply a causal mask in addition to the padding mask\n            # - if the model is an encoder, make the mask broadcastable to [batch_size, num_heads, seq_length, seq_length]\n            if self.config.is_decoder:\n                batch_size, seq_length = input_shape\n                seq_ids = torch.arange(seq_length, device=device)\n                causal_mask = seq_ids[None, None, :].repeat(batch_size, seq_length, 1) <= seq_ids[None, :, None]\n                # causal and attention masks must have same type with pytorch version < 1.3\n                causal_mask = causal_mask.to(attention_mask.dtype)\n                extended_attention_mask = causal_mask[:, None, :, :] * attention_mask[:, None, None, :]\n            else:\n                extended_attention_mask = attention_mask[:, None, None, :]\n        else:\n            raise ValueError(\n                \"Wrong shape for input_ids (shape {}) or attention_mask (shape {})\".format(\n                    input_shape, attention_mask.shape\n                )\n            )\n\n        # Since attention_mask is 1.0 for positions we want to attend and 0.0 for\n        # masked positions, this operation will create a tensor which is 0.0 for\n        # positions we want to attend and -10000.0 for masked positions.\n        # Since we are adding it to the raw scores before the softmax, this is\n        # effectively the same as removing these entirely.\n        extended_attention_mask = extended_attention_mask.to(dtype=self.dtype)  # fp16 compatibility\n        extended_attention_mask = (1.0 - extended_attention_mask) * -10000.0\n        return extended_attention_mask\n\n    def get_head_mask(self, head_mask: Tensor, num_hidden_layers: int, is_attention_chunked: bool = False) -> Tensor:\n        \"\"\"\n        # Prepare head mask if needed\n        # 1.0 in head_mask indicate we keep the head\n        attention_probs has shape bsz x n_heads x N x N\n        Arguments:\n            head_mask: torch.Tensor or None: has shape [num_heads] or [num_hidden_layers x num_heads]\n            num_hidden_layers: int\n        Returns:\n             Tensor of shape shape [num_hidden_layers x batch x num_heads x seq_length x seq_length]\n             or list with [None] for each layer\n        \"\"\"\n        if head_mask is not None:\n            head_mask = self._convert_head_mask_to_5d(head_mask, num_hidden_layers)\n            if is_attention_chunked is True:\n                head_mask = head_mask.unsqueeze(-1)\n        else:\n            head_mask = [None] * num_hidden_layers\n\n        return head_mask\n\n    def _convert_head_mask_to_5d(self, head_mask, num_hidden_layers):\n        \"\"\"-> [num_hidden_layers x batch x num_heads x seq_length x seq_length]\"\"\"\n        if head_mask.dim() == 1:\n            head_mask = head_mask.unsqueeze(0).unsqueeze(0).unsqueeze(-1).unsqueeze(-1)\n            head_mask = head_mask.expand(num_hidden_layers, -1, -1, -1, -1)\n        elif head_mask.dim() == 2:\n            head_mask = head_mask.unsqueeze(1).unsqueeze(-1).unsqueeze(-1)  # We can specify head_mask for each layer\n        assert head_mask.dim() == 5, f\"head_mask.dim != 5, instead {head_mask.dim()}\"\n        head_mask = head_mask.to(dtype=self.dtype)  # switch to fload if need + fp16 compatibility\n        return head_mask\n\n\nclass PreTrainedModel(nn.Module, ModuleUtilsMixin):\n    r\"\"\" Base class for all models.\n\n        :class:`~transformers1.PreTrainedModel` takes care of storing the configuration of the models and handles methods for loading/downloading/saving models\n        as well as a few methods common to all models to (i) resize the input embeddings and (ii) prune heads in the self-attention heads.\n\n        Class attributes (overridden by derived classes):\n            - ``config_class``: a class derived from :class:`~transformers1.PretrainedConfig` to use as configuration class for this model architecture.\n            - ``load_tf_weights``: a python ``method`` for loading a TensorFlow checkpoint in a PyTorch model, taking as arguments:\n\n                - ``model``: an instance of the relevant subclass of :class:`~transformers1.PreTrainedModel`,\n                - ``config``: an instance of the relevant subclass of :class:`~transformers1.PretrainedConfig`,\n                - ``path``: a path (string) to the TensorFlow checkpoint.\n\n            - ``base_model_prefix``: a string indicating the attribute associated to the base model in derived classes of the same architecture adding modules on top of the base model.\n    \"\"\"\n    config_class = None\n    base_model_prefix = \"\"\n\n    @property\n    def dummy_inputs(self):\n        \"\"\" Dummy inputs to do a forward pass in the network.\n\n        Returns:\n            torch.Tensor with dummy inputs\n        \"\"\"\n        return {\"input_ids\": torch.tensor(DUMMY_INPUTS)}\n\n    def __init__(self, config, *inputs, **kwargs):\n        super().__init__()\n        if not isinstance(config, PretrainedConfig):\n            raise ValueError(\n                \"Parameter config in `{}(config)` should be an instance of class `PretrainedConfig`. \"\n                \"To create a model from a pretrained model use \"\n                \"`model = {}.from_pretrained(PRETRAINED_MODEL_NAME)`\".format(\n                    self.__class__.__name__, self.__class__.__name__\n                )\n            )\n        # Save config in model\n        self.config = config\n\n    @property\n    def base_model(self):\n        return getattr(self, self.base_model_prefix, self)\n\n    def get_input_embeddings(self):\n        \"\"\"\n        Returns the model's input embeddings.\n\n        Returns:\n            :obj:`nn.Module`:\n                A torch module mapping vocabulary to hidden states.\n        \"\"\"\n        base_model = getattr(self, self.base_model_prefix, self)\n        if base_model is not self:\n            return base_model.get_input_embeddings()\n        else:\n            raise NotImplementedError\n\n    def set_input_embeddings(self, value: nn.Module):\n        \"\"\"\n        Set model's input embeddings\n\n        Args:\n            value (:obj:`nn.Module`):\n                A module mapping vocabulary to hidden states.\n        \"\"\"\n        base_model = getattr(self, self.base_model_prefix, self)\n        if base_model is not self:\n            base_model.set_input_embeddings(value)\n        else:\n            raise NotImplementedError\n\n    def get_output_embeddings(self):\n        \"\"\"\n        Returns the model's output embeddings.\n\n        Returns:\n            :obj:`nn.Module`:\n                A torch module mapping hidden states to vocabulary.\n        \"\"\"\n        return None  # Overwrite for models with output embeddings\n\n    def tie_weights(self):\n        \"\"\"\n        Tie the weights between the input embeddings and the output embeddings.\n        If the `torchscript` flag is set in the configuration, can't handle parameter sharing so we are cloning\n        the weights instead.\n        \"\"\"\n        output_embeddings = self.get_output_embeddings()\n        if output_embeddings is not None:\n            self._tie_or_clone_weights(output_embeddings, self.get_input_embeddings())\n\n    def _tie_or_clone_weights(self, output_embeddings, input_embeddings):\n        \"\"\" Tie or clone module weights depending of whether we are using TorchScript or not\n        \"\"\"\n        if self.config.torchscript:\n            output_embeddings.weight = nn.Parameter(input_embeddings.weight.clone())\n        else:\n            output_embeddings.weight = input_embeddings.weight\n\n        if getattr(output_embeddings, \"bias\", None) is not None:\n            output_embeddings.bias.data = torch.nn.functional.pad(\n                output_embeddings.bias.data,\n                (0, output_embeddings.weight.shape[0] - output_embeddings.bias.shape[0],),\n                \"constant\",\n                0,\n            )\n        if hasattr(output_embeddings, \"out_features\") and hasattr(input_embeddings, \"num_embeddings\"):\n            output_embeddings.out_features = input_embeddings.num_embeddings\n\n    def resize_token_embeddings(self, new_num_tokens: Optional[int] = None):\n        \"\"\" Resize input token embeddings matrix of the model if new_num_tokens != config.vocab_size.\n        Take care of tying weights embeddings afterwards if the model class has a `tie_weights()` method.\n\n        Arguments:\n\n            new_num_tokens: (`optional`) int:\n                New number of tokens in the embedding matrix. Increasing the size will add newly initialized vectors at the end. Reducing the size will remove vectors from the end.\n                If not provided or None: does nothing and just returns a pointer to the input tokens ``torch.nn.Embeddings`` Module of the model.\n\n        Return: ``torch.nn.Embeddings``\n            Pointer to the input tokens Embeddings Module of the model\n        \"\"\"\n        base_model = getattr(self, self.base_model_prefix, self)  # get the base model if needed\n        model_embeds = base_model._resize_token_embeddings(new_num_tokens)\n        if new_num_tokens is None:\n            return model_embeds\n\n        # Update base model and current model config\n        self.config.vocab_size = new_num_tokens\n        base_model.vocab_size = new_num_tokens\n\n        # Tie weights again if needed\n        self.tie_weights()\n\n        return model_embeds\n\n    def _resize_token_embeddings(self, new_num_tokens):\n        old_embeddings = self.get_input_embeddings()\n        new_embeddings = self._get_resized_embeddings(old_embeddings, new_num_tokens)\n        self.set_input_embeddings(new_embeddings)\n        return self.get_input_embeddings()\n\n    def _get_resized_embeddings(\n        self, old_embeddings: torch.nn.Embedding, new_num_tokens: Optional[int] = None\n    ) -> torch.nn.Embedding:\n        \"\"\" Build a resized Embedding Module from a provided token Embedding Module.\n            Increasing the size will add newly initialized vectors at the end\n            Reducing the size will remove vectors from the end\n\n        Args:\n            old_embeddings: ``torch.nn.Embedding``\n                Old embeddings to be resized.\n            new_num_tokens: (`optional`) int\n                New number of tokens in the embedding matrix.\n                Increasing the size will add newly initialized vectors at the end\n                Reducing the size will remove vectors from the end\n                If not provided or None: return the provided token Embedding Module.\n        Return: ``torch.nn.Embedding``\n            Pointer to the resized Embedding Module or the old Embedding Module if new_num_tokens is None\n        \"\"\"\n        if new_num_tokens is None:\n            return old_embeddings\n\n        old_num_tokens, old_embedding_dim = old_embeddings.weight.size()\n        if old_num_tokens == new_num_tokens:\n            return old_embeddings\n\n        # Build new embeddings\n        new_embeddings = nn.Embedding(new_num_tokens, old_embedding_dim)\n        new_embeddings.to(old_embeddings.weight.device)\n\n        # initialize all new embeddings (in particular added tokens)\n        self._init_weights(new_embeddings)\n\n        # Copy token embeddings from the previous weights\n        num_tokens_to_copy = min(old_num_tokens, new_num_tokens)\n        new_embeddings.weight.data[:num_tokens_to_copy, :] = old_embeddings.weight.data[:num_tokens_to_copy, :]\n\n        return new_embeddings\n\n    def init_weights(self):\n        \"\"\" Initialize and prunes weights if needed. \"\"\"\n        # Initialize weights\n        self.apply(self._init_weights)\n\n        # Prune heads if needed\n        if self.config.pruned_heads:\n            self.prune_heads(self.config.pruned_heads)\n\n        # Tie weights if needed\n        self.tie_weights()\n\n    def prune_heads(self, heads_to_prune: Dict):\n        \"\"\" Prunes heads of the base model.\n\n            Arguments:\n\n                heads_to_prune: dict with keys being selected layer indices (`int`) and associated values being the list of heads to prune in said layer (list of `int`).\n                E.g. {1: [0, 2], 2: [2, 3]} will prune heads 0 and 2 on layer 1 and heads 2 and 3 on layer 2.\n        \"\"\"\n        # save new sets of pruned heads as union of previously stored pruned heads and newly pruned heads\n        for layer, heads in heads_to_prune.items():\n            union_heads = set(self.config.pruned_heads.get(layer, [])) | set(heads)\n            self.config.pruned_heads[layer] = list(union_heads)  # Unfortunately we have to store it as list for JSON\n\n        self.base_model._prune_heads(heads_to_prune)\n\n    def save_pretrained(self, save_directory):\n        \"\"\" Save a model and its configuration file to a directory, so that it\n            can be re-loaded using the `:func:`~transformers1.PreTrainedModel.from_pretrained`` class method.\n\n            Arguments:\n                save_directory: directory to which to save.\n        \"\"\"\n        assert os.path.isdir(\n            save_directory\n        ), \"Saving path should be a directory where the model and configuration can be saved\"\n\n        # Only save the model itself if we are using distributed training\n        model_to_save = self.module if hasattr(self, \"module\") else self\n\n        # Attach architecture to the config\n        model_to_save.config.architectures = [model_to_save.__class__.__name__]\n\n        # If we save using the predefined names, we can load using `from_pretrained`\n        output_model_file = os.path.join(save_directory, WEIGHTS_NAME)\n\n        if getattr(self.config, \"xla_device\", False):\n            import torch_xla.core.xla_model as xm\n\n            if xm.is_master_ordinal():\n                # Save configuration file\n                model_to_save.config.save_pretrained(save_directory)\n            # xm.save takes care of saving only from master\n            xm.save(model_to_save.state_dict(), output_model_file)\n        else:\n            model_to_save.config.save_pretrained(save_directory)\n            torch.save(model_to_save.state_dict(), output_model_file)\n\n        logger.info(\"Model weights saved in {}\".format(output_model_file))\n\n    @classmethod\n    def from_pretrained(cls, pretrained_model_name_or_path, *model_args, **kwargs):\n        r\"\"\"Instantiate a pretrained pytorch model from a pre-trained model configuration.\n\n        The model is set in evaluation mode by default using ``model.eval()`` (Dropout modules are deactivated)\n        To train the model, you should first set it back in training mode with ``model.train()``\n\n        The warning ``Weights from XXX not initialized from pretrained model`` means that the weights of XXX do not come pre-trained with the rest of the model.\n        It is up to you to train those weights with a downstream fine-tuning task.\n\n        The warning ``Weights from XXX not used in YYY`` means that the layer XXX is not used by YYY, therefore those weights are discarded.\n\n        Parameters:\n            pretrained_model_name_or_path: either:\n              - a string with the `shortcut name` of a pre-trained model to load from cache or download, e.g.: ``bert-base-uncased``.\n              - a string with the `identifier name` of a pre-trained model that was user-uploaded to our S3, e.g.: ``dbmdz/bert-base-german-cased``.\n              - a path to a `directory` containing model weights saved using :func:`~transformers1.PreTrainedModel.save_pretrained`, e.g.: ``./my_model_directory/``.\n              - a path or url to a `tensorflow index checkpoint file` (e.g. `./tf_model/model.ckpt.index`). In this case, ``from_tf`` should be set to True and a configuration object should be provided as ``config`` argument. This loading path is slower than converting the TensorFlow checkpoint in a PyTorch model using the provided conversion scripts and loading the PyTorch model afterwards.\n              - None if you are both providing the configuration and state dictionary (resp. with keyword arguments ``config`` and ``state_dict``)\n\n            model_args: (`optional`) Sequence of positional arguments:\n                All remaning positional arguments will be passed to the underlying model's ``__init__`` method\n\n            config: (`optional`) one of:\n                - an instance of a class derived from :class:`~transformers1.PretrainedConfig`, or\n                - a string valid as input to :func:`~transformers1.PretrainedConfig.from_pretrained()`\n                Configuration for the model to use instead of an automatically loaded configuation. Configuration can be automatically loaded when:\n                    - the model is a model provided by the library (loaded with the ``shortcut-name`` string of a pretrained model), or\n                    - the model was saved using :func:`~transformers1.PreTrainedModel.save_pretrained` and is reloaded by suppling the save directory.\n                    - the model is loaded by suppling a local directory as ``pretrained_model_name_or_path`` and a configuration JSON file named `config.json` is found in the directory.\n\n            state_dict: (`optional`) dict:\n                an optional state dictionnary for the model to use instead of a state dictionary loaded from saved weights file.\n                This option can be used if you want to create a model from a pretrained configuration but load your own weights.\n                In this case though, you should check if using :func:`~transformers1.PreTrainedModel.save_pretrained` and :func:`~transformers1.PreTrainedModel.from_pretrained` is not a simpler option.\n\n            cache_dir: (`optional`) string:\n                Path to a directory in which a downloaded pre-trained model\n                configuration should be cached if the standard cache should not be used.\n\n            force_download: (`optional`) boolean, default False:\n                Force to (re-)download the model weights and configuration files and override the cached versions if they exists.\n\n            resume_download: (`optional`) boolean, default False:\n                Do not delete incompletely recieved file. Attempt to resume the download if such a file exists.\n\n            proxies: (`optional`) dict, default None:\n                A dictionary of proxy servers to use by protocol or endpoint, e.g.: {'http': 'foo.bar:3128', 'http://hostname': 'foo.bar:4012'}.\n                The proxies are used on each request.\n\n            output_loading_info: (`optional`) boolean:\n                Set to ``True`` to also return a dictionnary containing missing keys, unexpected keys and error messages.\n\n            kwargs: (`optional`) Remaining dictionary of keyword arguments:\n                Can be used to update the configuration object (after it being loaded) and initiate the model. (e.g. ``output_attention=True``). Behave differently depending on whether a `config` is provided or automatically loaded:\n\n                - If a configuration is provided with ``config``, ``**kwargs`` will be directly passed to the underlying model's ``__init__`` method (we assume all relevant updates to the configuration have already been done)\n                - If a configuration is not provided, ``kwargs`` will be first passed to the configuration class initialization function (:func:`~transformers1.PretrainedConfig.from_pretrained`). Each key of ``kwargs`` that corresponds to a configuration attribute will be used to override said attribute with the supplied ``kwargs`` value. Remaining keys that do not correspond to any configuration attribute will be passed to the underlying model's ``__init__`` function.\n\n        Examples::\n\n            # For example purposes. Not runnable.\n            model = BertModel.from_pretrained('bert-base-uncased')    # Download model and configuration from S3 and cache.\n            model = BertModel.from_pretrained('./test/saved_model/')  # E.g. model was saved using `save_pretrained('./test/saved_model/')`\n            model = BertModel.from_pretrained('bert-base-uncased', output_attention=True)  # Update configuration during loading\n            assert model.config.output_attention == True\n            # Loading from a TF checkpoint file instead of a PyTorch model (slower)\n            config = BertConfig.from_json_file('./tf_model/my_tf_model_config.json')\n            model = BertModel.from_pretrained('./tf_model/my_tf_checkpoint.ckpt.index', from_tf=True, config=config)\n\n        \"\"\"\n        config = kwargs.pop(\"config\", None)\n        state_dict = kwargs.pop(\"state_dict\", None)\n        cache_dir = kwargs.pop(\"cache_dir\", None)\n        from_tf = kwargs.pop(\"from_tf\", False)\n        force_download = kwargs.pop(\"force_download\", False)\n        resume_download = kwargs.pop(\"resume_download\", False)\n        proxies = kwargs.pop(\"proxies\", None)\n        output_loading_info = kwargs.pop(\"output_loading_info\", False)\n        local_files_only = kwargs.pop(\"local_files_only\", False)\n        use_cdn = kwargs.pop(\"use_cdn\", True)\n\n        # Load config if we don't provide a configuration\n        if not isinstance(config, PretrainedConfig):\n            config_path = config if config is not None else pretrained_model_name_or_path\n            config, model_kwargs = cls.config_class.from_pretrained(\n                config_path,\n                *model_args,\n                cache_dir=cache_dir,\n                return_unused_kwargs=True,\n                force_download=force_download,\n                resume_download=resume_download,\n                proxies=proxies,\n                local_files_only=local_files_only,\n                **kwargs,\n            )\n        else:\n            model_kwargs = kwargs\n\n        # Load model\n        if pretrained_model_name_or_path is not None:\n            if os.path.isdir(pretrained_model_name_or_path):\n                if from_tf and os.path.isfile(os.path.join(pretrained_model_name_or_path, TF_WEIGHTS_NAME + \".index\")):\n                    # Load from a TF 1.0 checkpoint\n                    archive_file = os.path.join(pretrained_model_name_or_path, TF_WEIGHTS_NAME + \".index\")\n                elif from_tf and os.path.isfile(os.path.join(pretrained_model_name_or_path, TF2_WEIGHTS_NAME)):\n                    # Load from a TF 2.0 checkpoint\n                    archive_file = os.path.join(pretrained_model_name_or_path, TF2_WEIGHTS_NAME)\n                elif os.path.isfile(os.path.join(pretrained_model_name_or_path, WEIGHTS_NAME)):\n                    # Load from a PyTorch checkpoint\n                    archive_file = os.path.join(pretrained_model_name_or_path, WEIGHTS_NAME)\n                else:\n                    raise EnvironmentError(\n                        \"Error no file named {} found in directory {} or `from_tf` set to False\".format(\n                            [WEIGHTS_NAME, TF2_WEIGHTS_NAME, TF_WEIGHTS_NAME + \".index\"],\n                            pretrained_model_name_or_path,\n                        )\n                    )\n            elif os.path.isfile(pretrained_model_name_or_path) or is_remote_url(pretrained_model_name_or_path):\n                archive_file = pretrained_model_name_or_path\n            elif os.path.isfile(pretrained_model_name_or_path + \".index\"):\n                assert (\n                    from_tf\n                ), \"We found a TensorFlow checkpoint at {}, please set from_tf to True to load from this checkpoint\".format(\n                    pretrained_model_name_or_path + \".index\"\n                )\n                archive_file = pretrained_model_name_or_path + \".index\"\n            else:\n                archive_file = hf_bucket_url(\n                    pretrained_model_name_or_path,\n                    filename=(TF2_WEIGHTS_NAME if from_tf else WEIGHTS_NAME),\n                    use_cdn=use_cdn,\n                )\n\n            try:\n                # Load from URL or cache if already cached\n                resolved_archive_file = cached_path(\n                    archive_file,\n                    cache_dir=cache_dir,\n                    force_download=force_download,\n                    proxies=proxies,\n                    resume_download=resume_download,\n                    local_files_only=local_files_only,\n                )\n                if resolved_archive_file is None:\n                    raise EnvironmentError\n            except EnvironmentError:\n                msg = (\n                    f\"Can't load weights for '{pretrained_model_name_or_path}'. Make sure that:\\n\\n\"\n                    f\"- '{pretrained_model_name_or_path}' is a correct model identifier listed on 'https://huggingface.co/models'\\n\\n\"\n                    f\"- or '{pretrained_model_name_or_path}' is the correct path to a directory containing a file named one of {WEIGHTS_NAME}, {TF2_WEIGHTS_NAME}, {TF_WEIGHTS_NAME}.\\n\\n\"\n                )\n                raise EnvironmentError(msg)\n\n            if resolved_archive_file == archive_file:\n                logger.info(\"loading weights file {}\".format(archive_file))\n            else:\n                logger.info(\"loading weights file {} from cache at {}\".format(archive_file, resolved_archive_file))\n        else:\n            resolved_archive_file = None\n\n        # Instantiate model.\n        model = cls(config, *model_args, **model_kwargs)\n\n        if state_dict is None and not from_tf:\n            try:\n                state_dict = torch.load(resolved_archive_file, map_location=\"cpu\")\n            except Exception:\n                raise OSError(\n                    \"Unable to load weights from pytorch checkpoint file. \"\n                    \"If you tried to load a PyTorch model from a TF 2.0 checkpoint, please set from_tf=True. \"\n                )\n\n        missing_keys = []\n        unexpected_keys = []\n        error_msgs = []\n\n        if from_tf:\n            if resolved_archive_file.endswith(\".index\"):\n                # Load from a TensorFlow 1.X checkpoint - provided by original authors\n                model = cls.load_tf_weights(model, config, resolved_archive_file[:-6])  # Remove the '.index'\n            else:\n                # Load from our TensorFlow 2.0 checkpoints\n                try:\n                    from transformers import load_tf2_checkpoint_in_pytorch_model\n\n                    model = load_tf2_checkpoint_in_pytorch_model(model, resolved_archive_file, allow_missing_keys=True)\n                except ImportError:\n                    logger.error(\n                        \"Loading a TensorFlow model in PyTorch, requires both PyTorch and TensorFlow to be installed. Please see \"\n                        \"https://pytorch.org/ and https://www.tensorflow.org/install/ for installation instructions.\"\n                    )\n                    raise\n        else:\n            # Convert old format to new format if needed from a PyTorch state_dict\n            old_keys = []\n            new_keys = []\n            for key in state_dict.keys():\n                new_key = None\n                if \"gamma\" in key:\n                    new_key = key.replace(\"gamma\", \"weight\")\n                if \"beta\" in key:\n                    new_key = key.replace(\"beta\", \"bias\")\n                if new_key:\n                    old_keys.append(key)\n                    new_keys.append(new_key)\n            for old_key, new_key in zip(old_keys, new_keys):\n                state_dict[new_key] = state_dict.pop(old_key)\n\n            # copy state_dict so _load_from_state_dict can modify it\n            metadata = getattr(state_dict, \"_metadata\", None)\n            state_dict = state_dict.copy()\n            if metadata is not None:\n                state_dict._metadata = metadata\n\n            # PyTorch's `_load_from_state_dict` does not copy parameters in a module's descendants\n            # so we need to apply the function recursively.\n            def load(module: nn.Module, prefix=\"\"):\n                local_metadata = {} if metadata is None else metadata.get(prefix[:-1], {})\n                module._load_from_state_dict(\n                    state_dict, prefix, local_metadata, True, missing_keys, unexpected_keys, error_msgs,\n                )\n                for name, child in module._modules.items():\n                    if child is not None:\n                        load(child, prefix + name + \".\")\n\n            # Make sure we are able to load base models as well as derived models (with heads)\n            start_prefix = \"\"\n            model_to_load = model\n            has_prefix_module = any(s.startswith(cls.base_model_prefix) for s in state_dict.keys())\n            if not hasattr(model, cls.base_model_prefix) and has_prefix_module:\n                start_prefix = cls.base_model_prefix + \".\"\n            if hasattr(model, cls.base_model_prefix) and not has_prefix_module:\n                model_to_load = getattr(model, cls.base_model_prefix)\n\n            load(model_to_load, prefix=start_prefix)\n\n            if model.__class__.__name__ != model_to_load.__class__.__name__:\n                base_model_state_dict = model_to_load.state_dict().keys()\n                head_model_state_dict_without_base_prefix = [\n                    key.split(cls.base_model_prefix + \".\")[-1] for key in model.state_dict().keys()\n                ]\n\n                missing_keys.extend(head_model_state_dict_without_base_prefix - base_model_state_dict)\n\n            if len(missing_keys) > 0:\n                logger.info(\n                    \"Weights of {} not initialized from pretrained model: {}\".format(\n                        model.__class__.__name__, missing_keys\n                    )\n                )\n            if len(unexpected_keys) > 0:\n                logger.info(\n                    \"Weights from pretrained model not used in {}: {}\".format(\n                        model.__class__.__name__, unexpected_keys\n                    )\n                )\n            if len(error_msgs) > 0:\n                raise RuntimeError(\n                    \"Error(s) in loading state_dict for {}:\\n\\t{}\".format(\n                        model.__class__.__name__, \"\\n\\t\".join(error_msgs)\n                    )\n                )\n        model.tie_weights()  # make sure token embedding weights are still tied if needed\n\n        # Set model in evaluation mode to deactivate DropOut modules by default\n        model.eval()\n\n        if output_loading_info:\n            loading_info = {\n                \"missing_keys\": missing_keys,\n                \"unexpected_keys\": unexpected_keys,\n                \"error_msgs\": error_msgs,\n            }\n            return model, loading_info\n\n        if hasattr(config, \"xla_device\") and config.xla_device:\n            import torch_xla.core.xla_model as xm\n\n            model = xm.send_cpu_data_to_device(model, xm.xla_device())\n            model.to(xm.xla_device())\n\n        return model\n\n    def prepare_inputs_for_generation(self, input_ids, **kwargs):\n        return {\"input_ids\": input_ids}\n\n    def prepare_logits_for_generation(self, logits, **kwargs):\n        return logits\n\n    def _use_cache(self, outputs, use_cache):\n        \"\"\"During generation, decide whether to pass the `past` variable to the next forward pass.\"\"\"\n        if len(outputs) <= 1 or use_cache is False:\n            return False\n        if hasattr(self.config, \"mem_len\") and self.config.mem_len == 0:\n            return False\n        return True\n\n    def enforce_repetition_penalty_(self, lprobs, batch_size, num_beams, prev_output_tokens, repetition_penalty):\n        \"\"\"repetition penalty (from CTRL paper https://arxiv.org/abs/1909.05858). \"\"\"\n        for i in range(batch_size * num_beams):\n            for previous_token in set(prev_output_tokens[i].tolist()):\n                # if score < 0 then repetition penalty has to multiplied to reduce the previous token probability\n                if lprobs[i, previous_token] < 0:\n                    lprobs[i, previous_token] *= repetition_penalty\n                else:\n                    lprobs[i, previous_token] /= repetition_penalty\n\n    @torch.no_grad()\n    def generate(\n        self,\n        input_ids: Optional[torch.LongTensor] = None,\n        max_length: Optional[int] = None,\n        min_length: Optional[int] = None,\n        do_sample: Optional[bool] = None,\n        early_stopping: Optional[bool] = None,\n        num_beams: Optional[int] = None,\n        temperature: Optional[float] = None,\n        top_k: Optional[int] = None,\n        top_p: Optional[float] = None,\n        repetition_penalty: Optional[float] = None,\n        bad_words_ids: Optional[Iterable[int]] = None,\n        bos_token_id: Optional[int] = None,\n        pad_token_id: Optional[int] = None,\n        eos_token_id: Optional[int] = None,\n        length_penalty: Optional[float] = None,\n        no_repeat_ngram_size: Optional[int] = None,\n        num_return_sequences: Optional[int] = None,\n        attention_mask: Optional[torch.LongTensor] = None,\n        decoder_start_token_id: Optional[int] = None,\n        use_cache: Optional[bool] = None,\n        **model_specific_kwargs\n    ) -> torch.LongTensor:\n        r\"\"\" Generates sequences for models with a LM head. The method currently supports greedy decoding, beam-search decoding, sampling with temperature, sampling with top-k or nucleus sampling.\n\n        Adapted in part from `Facebook's XLM beam search code`_.\n\n        .. _`Facebook's XLM beam search code`:\n           https://github.com/facebookresearch/XLM/blob/9e6f6814d17be4fe5b15f2e6c43eb2b2d76daeb4/src/model/transformer.py#L529\n\n\n        Parameters:\n\n            input_ids: (`optional`) `torch.LongTensor` of shape `(batch_size, sequence_length)`\n                The sequence used as a prompt for the generation. If `None` the method initializes\n                it as an empty `torch.LongTensor` of shape `(1,)`.\n\n            max_length: (`optional`) int\n                The max length of the sequence to be generated.  Between `min_length` and infinity. Default to 20.\n\n            min_length: (`optional`) int\n                The min length of the sequence to be generated.  Between 0 and infinity. Default to 0.\n\n            do_sample: (`optional`) bool\n                If set to `False` greedy decoding is used. Otherwise sampling is used. Defaults to `False` as defined in `configuration_utils.PretrainedConfig`.\n\n            early_stopping: (`optional`) bool\n                if set to `True` beam search is stopped when at least `num_beams` sentences finished per batch. Defaults to `False` as defined in `configuration_utils.PretrainedConfig`.\n\n            num_beams: (`optional`) int\n                Number of beams for beam search. Must be between 1 and infinity. 1 means no beam search. Default to 1.\n\n            temperature: (`optional`) float\n                The value used to module the next token probabilities. Must be strictly positive. Default to 1.0.\n\n            top_k: (`optional`) int\n                The number of highest probability vocabulary tokens to keep for top-k-filtering. Between 1 and infinity. Default to 50.\n\n            top_p: (`optional`) float\n                The cumulative probability of parameter highest probability vocabulary tokens to keep for nucleus sampling. Must be between 0 and 1. Default to 1.\n\n            repetition_penalty: (`optional`) float\n                The parameter for repetition penalty. Between 1.0 and infinity. 1.0 means no penalty. Default to 1.0.\n\n            pad_token_id: (`optional`) int\n                Padding token. Default to specicic model pad_token_id or None if it does not exist.\n\n            bos_token_id: (`optional`) int\n                BOS token. Defaults to `bos_token_id` as defined in the models config.\n\n            eos_token_id: (`optional`) int\n                EOS token. Defaults to `eos_token_id` as defined in the models config.\n\n            length_penalty: (`optional`) float\n                Exponential penalty to the length. Default to 1.\n\n            no_repeat_ngram_size: (`optional`) int\n                If set to int > 0, all ngrams of size `no_repeat_ngram_size` can only occur once.\n            bad_words_ids: (`optional`) list of lists of int\n                `bad_words_ids` contains tokens that are not allowed to be generated. In order to get the tokens of the words that should not appear in the generated text, use `tokenizer.encode(bad_word, add_prefix_space=True)`.\n\n            num_return_sequences: (`optional`) int\n                The number of independently computed returned sequences for each element in the batch. Default to 1.\n\n            attention_mask (`optional`) obj: `torch.LongTensor` of same shape as `input_ids`\n                Mask to avoid performing attention on padding token indices.\n                Mask values selected in ``[0, 1]``:\n                ``1`` for tokens that are NOT MASKED, ``0`` for MASKED tokens.\n                Defaults to `None`.\n\n                `What are attention masks? <../glossary.html#attention-mask>`__\n\n            decoder_start_token_id=None: (`optional`) int\n                If an encoder-decoder model starts decoding with a different token than BOS.\n                Defaults to `None` and is changed to `BOS` later.\n\n            use_cache: (`optional`) bool\n                If `use_cache` is True, past key values are used to speed up decoding if applicable to model. Defaults to `True`.\n\n            model_specific_kwargs: (`optional`) dict\n                Additional model specific kwargs will be forwarded to the `forward` function of the model.\n\n        Return:\n\n            output: `torch.LongTensor` of shape `(batch_size * num_return_sequences, sequence_length)`\n                sequence_length is either equal to max_length or shorter if all batches finished early due to the `eos_token_id`\n\n        Examples::\n\n            tokenizer = AutoTokenizer.from_pretrained('distilgpt2')   # Initialize tokenizer\n            model = AutoModelWithLMHead.from_pretrained('distilgpt2')    # Download model and configuration from S3 and cache.\n            outputs = model.generate(max_length=40)  # do greedy decoding\n            print('Generated: {}'.format(tokenizer.decode(outputs[0], skip_special_tokens=True)))\n\n            tokenizer = AutoTokenizer.from_pretrained('openai-gpt')   # Initialize tokenizer\n            model = AutoModelWithLMHead.from_pretrained('openai-gpt')    # Download model and configuration from S3 and cache.\n            input_context = 'The dog'\n            input_ids = tokenizer.encode(input_context, return_tensors='pt')  # encode input context\n            outputs = model.generate(input_ids=input_ids, num_beams=5, num_return_sequences=3, temperature=1.5)  # generate 3 independent sequences using beam search decoding (5 beams) with sampling from initial context 'The dog'\n            for i in range(3): #  3 output sequences were generated\n                print('Generated {}: {}'.format(i, tokenizer.decode(outputs[i], skip_special_tokens=True)))\n\n            tokenizer = AutoTokenizer.from_pretrained('distilgpt2')   # Initialize tokenizer\n            model = AutoModelWithLMHead.from_pretrained('distilgpt2')    # Download model and configuration from S3 and cache.\n            input_context = 'The dog'\n            input_ids = tokenizer.encode(input_context, return_tensors='pt')  # encode input context\n            outputs = model.generate(input_ids=input_ids, max_length=40, temperature=0.7, num_return_sequences=3)  # 3 generate sequences using by sampling\n            for i in range(3): #  3 output sequences were generated\n                print('Generated {}: {}'.format(i, tokenizer.decode(outputs[i], skip_special_tokens=True)))\n\n            tokenizer = AutoTokenizer.from_pretrained('ctrl')   # Initialize tokenizer\n            model = AutoModelWithLMHead.from_pretrained('ctrl')    # Download model and configuration from S3 and cache.\n            input_context = 'Legal My neighbor is'  # \"Legal\" is one of the control codes for ctrl\n            input_ids = tokenizer.encode(input_context, return_tensors='pt')  # encode input context\n            outputs = model.generate(input_ids=input_ids, max_length=50, temperature=0.7, repetition_penalty=1.2)  # generate sequences\n            print('Generated: {}'.format(tokenizer.decode(outputs[0], skip_special_tokens=True)))\n\n            tokenizer = AutoTokenizer.from_pretrained('gpt2')   # Initialize tokenizer\n            model = AutoModelWithLMHead.from_pretrained('gpt2')    # Download model and configuration from S3 and cache.\n            input_context = 'My cute dog'  # \"Legal\" is one of the control codes for ctrl\n            bad_words_ids = [tokenizer.encode(bad_word, add_prefix_space=True) for bad_word in ['idiot', 'stupid', 'shut up']]\n            input_ids = tokenizer.encode(input_context, return_tensors='pt')  # encode input context\n            outputs = model.generate(input_ids=input_ids, max_length=100, do_sample=True, bad_words_ids=bad_words_ids)  # generate sequences without allowing bad_words to be generated\n        \"\"\"\n\n        # We cannot generate if the model does not have a LM head\n        if self.get_output_embeddings() is None:\n            raise AttributeError(\n                \"You tried to generate sequences with a model that does not have a LM Head.\"\n                \"Please use another model class (e.g. `OpenAIGPTLMHeadModel`, `XLNetLMHeadModel`, `GPT2LMHeadModel`, `CTRLLMHeadModel`, `T5WithLMHeadModel`, `TransfoXLLMHeadModel`, `XLMWithLMHeadModel`, `BartForConditionalGeneration` )\"\n            )\n\n        max_length = max_length if max_length is not None else self.config.max_length\n        min_length = min_length if min_length is not None else self.config.min_length\n        do_sample = do_sample if do_sample is not None else self.config.do_sample\n        early_stopping = early_stopping if early_stopping is not None else self.config.early_stopping\n        use_cache = use_cache if use_cache is not None else self.config.use_cache\n        num_beams = num_beams if num_beams is not None else self.config.num_beams\n        temperature = temperature if temperature is not None else self.config.temperature\n        top_k = top_k if top_k is not None else self.config.top_k\n        top_p = top_p if top_p is not None else self.config.top_p\n        repetition_penalty = repetition_penalty if repetition_penalty is not None else self.config.repetition_penalty\n        bos_token_id = bos_token_id if bos_token_id is not None else self.config.bos_token_id\n        pad_token_id = pad_token_id if pad_token_id is not None else self.config.pad_token_id\n        eos_token_id = eos_token_id if eos_token_id is not None else self.config.eos_token_id\n        length_penalty = length_penalty if length_penalty is not None else self.config.length_penalty\n        no_repeat_ngram_size = (\n            no_repeat_ngram_size if no_repeat_ngram_size is not None else self.config.no_repeat_ngram_size\n        )\n        bad_words_ids = bad_words_ids if bad_words_ids is not None else self.config.bad_words_ids\n        num_return_sequences = (\n            num_return_sequences if num_return_sequences is not None else self.config.num_return_sequences\n        )\n        decoder_start_token_id = (\n            decoder_start_token_id if decoder_start_token_id is not None else self.config.decoder_start_token_id\n        )\n\n        if input_ids is not None:\n            batch_size = input_ids.shape[0]  # overriden by the input batch_size\n        else:\n            batch_size = 1\n\n        assert isinstance(max_length, int) and max_length > 0, \"`max_length` should be a strictly positive integer.\"\n        assert isinstance(min_length, int) and min_length >= 0, \"`min_length` should be a positive integer.\"\n        assert isinstance(do_sample, bool), \"`do_sample` should be a boolean.\"\n        assert isinstance(early_stopping, bool), \"`early_stopping` should be a boolean.\"\n        assert isinstance(use_cache, bool), \"`use_cache` should be a boolean.\"\n        assert isinstance(num_beams, int) and num_beams > 0, \"`num_beams` should be a strictly positive integer.\"\n        assert temperature > 0, \"`temperature` should be strictly positive.\"\n        assert isinstance(top_k, int) and top_k >= 0, \"`top_k` should be a positive integer.\"\n        assert 0 <= top_p <= 1, \"`top_p` should be between 0 and 1.\"\n        assert repetition_penalty >= 1.0, \"`repetition_penalty` should be >= 1.\"\n        assert input_ids is not None or (\n            isinstance(bos_token_id, int) and bos_token_id >= 0\n        ), \"If input_ids is not defined, `bos_token_id` should be a positive integer.\"\n        assert pad_token_id is None or (\n            isinstance(pad_token_id, int) and (pad_token_id >= 0)\n        ), \"`pad_token_id` should be a positive integer.\"\n        assert (eos_token_id is None) or (\n            isinstance(eos_token_id, int) and (eos_token_id >= 0)\n        ), \"`eos_token_id` should be a positive integer.\"\n        assert length_penalty > 0, \"`length_penalty` should be strictly positive.\"\n        assert (\n            isinstance(no_repeat_ngram_size, int) and no_repeat_ngram_size >= 0\n        ), \"`no_repeat_ngram_size` should be a positive integer.\"\n        assert (\n            isinstance(num_return_sequences, int) and num_return_sequences > 0\n        ), \"`num_return_sequences` should be a strictly positive integer.\"\n        assert (\n            bad_words_ids is None or isinstance(bad_words_ids, list) and isinstance(bad_words_ids[0], list)\n        ), \"`bad_words_ids` is either `None` or a list of lists of tokens that should not be generated\"\n\n        if input_ids is None:\n            assert isinstance(bos_token_id, int) and bos_token_id >= 0, (\n                \"you should either supply a context to complete as `input_ids` input \"\n                \"or a `bos_token_id` (integer >= 0) as a first token to start the generation.\"\n            )\n            input_ids = torch.full(\n                (batch_size, 1), bos_token_id, dtype=torch.long, device=next(self.parameters()).device,\n            )\n        else:\n            assert input_ids.dim() == 2, \"Input prompt should be of shape (batch_size, sequence length).\"\n\n        # not allow to duplicate outputs when greedy decoding\n        if do_sample is False:\n            if num_beams == 1:\n                # no_beam_search greedy generation conditions\n                assert (\n                    num_return_sequences == 1\n                ), \"Greedy decoding will always produce the same output for num_beams == 1 and num_return_sequences > 1. Please set num_return_sequences = 1\"\n\n            else:\n                # beam_search greedy generation conditions\n                assert (\n                    num_beams >= num_return_sequences\n                ), \"Greedy beam search decoding cannot return more sequences than it has beams. Please set num_beams >= num_return_sequences\"\n\n        # create attention mask if necessary\n        # TODO (PVP): this should later be handled by the forward fn() in each model in the future see PR 3140\n        if (attention_mask is None) and (pad_token_id is not None) and (pad_token_id in input_ids):\n            attention_mask = input_ids.ne(pad_token_id).long()\n        elif attention_mask is None:\n            attention_mask = input_ids.new_ones(input_ids.shape)\n\n        # set pad_token_id to eos_token_id if not set. Important that this is done after\n        # attention_mask is created\n        if pad_token_id is None and eos_token_id is not None:\n            logger.warning(\n                \"Setting `pad_token_id` to {} (first `eos_token_id`) to generate sequence\".format(eos_token_id)\n            )\n            pad_token_id = eos_token_id\n\n        # current position and vocab size\n        if hasattr(self.config, \"vocab_size\"):\n            vocab_size = self.config.vocab_size\n        elif (\n            self.config.is_encoder_decoder\n            and hasattr(self.config, \"decoder\")\n            and hasattr(self.config.decoder, \"vocab_size\")\n        ):\n            vocab_size = self.config.decoder.vocab_size\n\n        # set effective batch size and effective batch multiplier according to do_sample\n        if do_sample:\n            effective_batch_size = batch_size * num_return_sequences\n            effective_batch_mult = num_return_sequences\n        else:\n            effective_batch_size = batch_size\n            effective_batch_mult = 1\n\n        if self.config.is_encoder_decoder:\n            if decoder_start_token_id is None:\n                decoder_start_token_id = bos_token_id\n\n            assert (\n                decoder_start_token_id is not None\n            ), \"decoder_start_token_id or bos_token_id has to be defined for encoder-decoder generation\"\n            assert hasattr(self, \"get_encoder\"), \"{} should have a 'get_encoder' function defined\".format(self)\n            assert callable(self.get_encoder), \"{} should be a method\".format(self.get_encoder)\n\n            # get encoder and store encoder outputs\n            encoder = self.get_encoder()\n\n            encoder_outputs: tuple = encoder(input_ids, attention_mask=attention_mask)\n\n        # Expand input ids if num_beams > 1 or num_return_sequences > 1\n        if num_return_sequences > 1 or num_beams > 1:\n            input_ids_len = input_ids.shape[-1]\n            input_ids = input_ids.unsqueeze(1).expand(batch_size, effective_batch_mult * num_beams, input_ids_len)\n            attention_mask = attention_mask.unsqueeze(1).expand(\n                batch_size, effective_batch_mult * num_beams, input_ids_len\n            )\n\n            input_ids = input_ids.contiguous().view(\n                effective_batch_size * num_beams, input_ids_len\n            )  # shape: (batch_size * num_return_sequences * num_beams, cur_len)\n            attention_mask = attention_mask.contiguous().view(\n                effective_batch_size * num_beams, input_ids_len\n            )  # shape: (batch_size * num_return_sequences * num_beams, cur_len)\n\n        if self.config.is_encoder_decoder:\n            # create empty decoder_input_ids\n            input_ids = torch.full(\n                (effective_batch_size * num_beams, 1),\n                decoder_start_token_id,\n                dtype=torch.long,\n                device=next(self.parameters()).device,\n            )\n            cur_len = 1\n\n            assert (\n                batch_size == encoder_outputs[0].shape[0]\n            ), f\"expected encoder_outputs[0] to have 1st dimension bs={batch_size}, got {encoder_outputs[0].shape[0]} \"\n\n            # expand batch_idx to assign correct encoder output for expanded input_ids (due to num_beams > 1 and num_return_sequences > 1)\n            expanded_batch_idxs = (\n                torch.arange(batch_size)\n                .view(-1, 1)\n                .repeat(1, num_beams * effective_batch_mult)\n                .view(-1)\n                .to(input_ids.device)\n            )\n            # expand encoder_outputs\n            encoder_outputs = (encoder_outputs[0].index_select(0, expanded_batch_idxs), *encoder_outputs[1:])\n\n        else:\n            encoder_outputs = None\n            cur_len = input_ids.shape[-1]\n\n        if num_beams > 1:\n            output = self._generate_beam_search(\n                input_ids,\n                cur_len=cur_len,\n                max_length=max_length,\n                min_length=min_length,\n                do_sample=do_sample,\n                early_stopping=early_stopping,\n                temperature=temperature,\n                top_k=top_k,\n                top_p=top_p,\n                repetition_penalty=repetition_penalty,\n                no_repeat_ngram_size=no_repeat_ngram_size,\n                bad_words_ids=bad_words_ids,\n                bos_token_id=bos_token_id,\n                pad_token_id=pad_token_id,\n                decoder_start_token_id=decoder_start_token_id,\n                eos_token_id=eos_token_id,\n                batch_size=effective_batch_size,\n                num_return_sequences=num_return_sequences,\n                length_penalty=length_penalty,\n                num_beams=num_beams,\n                vocab_size=vocab_size,\n                encoder_outputs=encoder_outputs,\n                attention_mask=attention_mask,\n                use_cache=use_cache,\n                model_specific_kwargs=model_specific_kwargs,\n            )\n        else:\n            output = self._generate_no_beam_search(\n                input_ids,\n                cur_len=cur_len,\n                max_length=max_length,\n                min_length=min_length,\n                do_sample=do_sample,\n                temperature=temperature,\n                top_k=top_k,\n                top_p=top_p,\n                repetition_penalty=repetition_penalty,\n                no_repeat_ngram_size=no_repeat_ngram_size,\n                bad_words_ids=bad_words_ids,\n                bos_token_id=bos_token_id,\n                pad_token_id=pad_token_id,\n                decoder_start_token_id=decoder_start_token_id,\n                eos_token_id=eos_token_id,\n                batch_size=effective_batch_size,\n                encoder_outputs=encoder_outputs,\n                attention_mask=attention_mask,\n                use_cache=use_cache,\n                model_specific_kwargs=model_specific_kwargs,\n            )\n\n        return output\n\n    def _generate_no_beam_search(\n        self,\n        input_ids,\n        cur_len,\n        max_length,\n        min_length,\n        do_sample,\n        temperature,\n        top_k,\n        top_p,\n        repetition_penalty,\n        no_repeat_ngram_size,\n        bad_words_ids,\n        bos_token_id,\n        pad_token_id,\n        eos_token_id,\n        decoder_start_token_id,\n        batch_size,\n        encoder_outputs,\n        attention_mask,\n        use_cache,\n        model_specific_kwargs,\n    ):\n        \"\"\" Generate sequences for each example without beam search (num_beams == 1).\n            All returned sequence are generated independantly.\n        \"\"\"\n        # length of generated sentences / unfinished sentences\n        unfinished_sents = input_ids.new(batch_size).fill_(1)\n        sent_lengths = input_ids.new(batch_size).fill_(max_length)\n\n        past = encoder_outputs  # defined for encoder-decoder models, None for decoder-only models\n\n        while cur_len < max_length:\n            model_inputs = self.prepare_inputs_for_generation(\n                input_ids, past=past, attention_mask=attention_mask, use_cache=use_cache, **model_specific_kwargs\n            )\n\n            outputs = self(**model_inputs)\n            next_token_logits = outputs[0][:, -1, :]\n\n            # if model has past, then set the past variable to speed up decoding\n            if self._use_cache(outputs, use_cache):\n                past = outputs[1]\n\n            # repetition penalty from CTRL paper (https://arxiv.org/abs/1909.05858)\n            if repetition_penalty != 1.0:\n                self.enforce_repetition_penalty_(next_token_logits, batch_size, 1, input_ids, repetition_penalty)\n\n            if no_repeat_ngram_size > 0:\n                # calculate a list of banned tokens to prevent repetitively generating the same ngrams\n                # from fairseq: https://github.com/pytorch/fairseq/blob/a07cb6f40480928c9e0548b737aadd36ee66ac76/fairseq/sequence_generator.py#L345\n                banned_tokens = calc_banned_ngram_tokens(input_ids, batch_size, no_repeat_ngram_size, cur_len)\n                for batch_idx in range(batch_size):\n                    next_token_logits[batch_idx, banned_tokens[batch_idx]] = -float(\"inf\")\n\n            if bad_words_ids is not None:\n                # calculate a list of banned tokens according to bad words\n                banned_tokens = calc_banned_bad_words_ids(input_ids, bad_words_ids)\n\n                for batch_idx in range(batch_size):\n                    next_token_logits[batch_idx, banned_tokens[batch_idx]] = -float(\"inf\")\n\n            # set eos token prob to zero if min_length is not reached\n            if eos_token_id is not None and cur_len < min_length:\n                next_token_logits[:, eos_token_id] = -float(\"inf\")\n\n            if do_sample:\n                # Temperature (higher temperature => more likely to sample low probability tokens)\n                if temperature != 1.0:\n                    next_token_logits = next_token_logits / temperature\n                # Top-p/top-k filtering\n                next_token_logits = top_k_top_p_filtering(next_token_logits, top_k=top_k, top_p=top_p)\n                # Sample\n                probs = F.softmax(next_token_logits, dim=-1)\n                next_token = torch.multinomial(probs, num_samples=1).squeeze(1)\n            else:\n                # Greedy decoding\n                next_token = torch.argmax(next_token_logits, dim=-1)\n\n            # update generations and finished sentences\n            if eos_token_id is not None:\n                # pad finished sentences if eos_token_id exist\n                tokens_to_add = next_token * unfinished_sents + (pad_token_id) * (1 - unfinished_sents)\n            else:\n                tokens_to_add = next_token\n\n            # add token and increase length by one\n            input_ids = torch.cat([input_ids, tokens_to_add.unsqueeze(-1)], dim=-1)\n            cur_len = cur_len + 1\n\n            if eos_token_id is not None:\n                eos_in_sents = tokens_to_add == eos_token_id\n                # if sentence is unfinished and the token to add is eos, sent_lengths is filled with current length\n                is_sents_unfinished_and_token_to_add_is_eos = unfinished_sents.mul(eos_in_sents.long()).bool()\n                sent_lengths.masked_fill_(is_sents_unfinished_and_token_to_add_is_eos, cur_len)\n                # unfinished_sents is set to zero if eos in sentence\n                unfinished_sents.mul_((~eos_in_sents).long())\n\n            # stop when there is a </s> in each sentence, or if we exceed the maximul length\n            if unfinished_sents.max() == 0:\n                break\n\n            # extend attention_mask for new generated input if only decoder\n            if self.config.is_encoder_decoder is False:\n                attention_mask = torch.cat(\n                    [attention_mask, attention_mask.new_ones((attention_mask.shape[0], 1))], dim=-1\n                )\n\n        # if there are different sentences lengths in the batch, some batches have to be padded\n        if sent_lengths.min().item() != sent_lengths.max().item():\n            assert pad_token_id is not None, \"`Pad_token_id` has to be defined if batches have different lengths\"\n            # finished sents are filled with pad_token\n            decoded = input_ids.new(batch_size, sent_lengths.max().item()).fill_(pad_token_id)\n        else:\n            decoded = input_ids\n\n        for hypo_idx, hypo in enumerate(input_ids):\n            decoded[hypo_idx, : sent_lengths[hypo_idx]] = hypo[: sent_lengths[hypo_idx]]\n\n        return decoded\n\n    def _generate_beam_search(\n        self,\n        input_ids,\n        cur_len,\n        max_length,\n        min_length,\n        do_sample,\n        early_stopping,\n        temperature,\n        top_k,\n        top_p,\n        repetition_penalty,\n        no_repeat_ngram_size,\n        bad_words_ids,\n        bos_token_id,\n        pad_token_id,\n        eos_token_id,\n        decoder_start_token_id,\n        batch_size,\n        num_return_sequences,\n        length_penalty,\n        num_beams,\n        vocab_size,\n        encoder_outputs,\n        attention_mask,\n        use_cache,\n        model_specific_kwargs,\n    ):\n        \"\"\" Generate sequences for each example with beam search.\n        \"\"\"\n\n        # generated hypotheses\n        generated_hyps = [\n            BeamHypotheses(num_beams, max_length, length_penalty, early_stopping=early_stopping)\n            for _ in range(batch_size)\n        ]\n\n        # scores for each sentence in the beam\n        beam_scores = torch.zeros((batch_size, num_beams), dtype=torch.float, device=input_ids.device)\n\n        # for greedy decoding it is made sure that only tokens of the first beam are considered to avoid sampling the exact same tokens three times\n        if do_sample is False:\n            beam_scores[:, 1:] = -1e9\n        beam_scores = beam_scores.view(-1)  # shape (batch_size * num_beams,)\n\n        # cache compute states\n        past = encoder_outputs  # defined for encoder-decoder models, None for decoder-only models\n\n        # done sentences\n        done = [False for _ in range(batch_size)]\n\n        while cur_len < max_length:\n            model_inputs = self.prepare_inputs_for_generation(\n                input_ids, past=past, attention_mask=attention_mask, use_cache=use_cache, **model_specific_kwargs\n            )\n            outputs = self(**model_inputs)  # (batch_size * num_beams, cur_len, vocab_size)\n            next_token_logits = outputs[0][:, -1, :]  # (batch_size * num_beams, vocab_size)\n\n            # if model has past, then set the past variable to speed up decoding\n            if self._use_cache(outputs, use_cache):\n                past = outputs[1]\n\n            # repetition penalty (from CTRL paper https://arxiv.org/abs/1909.05858)\n            if repetition_penalty != 1.0:\n                self.enforce_repetition_penalty_(\n                    next_token_logits, batch_size, num_beams, input_ids, repetition_penalty,\n                )\n\n            if temperature != 1.0:\n                next_token_logits = next_token_logits / temperature\n\n            if self.config.is_encoder_decoder and do_sample is False:\n                # TODO (PVP) still a bit hacky here - there might be a better solution\n                next_token_logits = self.prepare_logits_for_generation(\n                    next_token_logits, cur_len=cur_len, max_length=max_length\n                )\n\n            scores = F.log_softmax(next_token_logits, dim=-1)  # (batch_size * num_beams, vocab_size)\n\n            # set eos token prob to zero if min_length is not reached\n            if eos_token_id is not None and cur_len < min_length:\n                scores[:, eos_token_id] = -float(\"inf\")\n\n            if no_repeat_ngram_size > 0:\n                # calculate a list of banned tokens to prevent repetitively generating the same ngrams\n                num_batch_hypotheses = batch_size * num_beams\n                # from fairseq: https://github.com/pytorch/fairseq/blob/a07cb6f40480928c9e0548b737aadd36ee66ac76/fairseq/sequence_generator.py#L345\n                banned_batch_tokens = calc_banned_ngram_tokens(\n                    input_ids, num_batch_hypotheses, no_repeat_ngram_size, cur_len\n                )\n                for i, banned_tokens in enumerate(banned_batch_tokens):\n                    scores[i, banned_tokens] = -float(\"inf\")\n\n            if bad_words_ids is not None:\n                # calculate a list of banned tokens according to bad words\n                banned_tokens = calc_banned_bad_words_ids(input_ids, bad_words_ids)\n\n                for i, banned_tokens in enumerate(banned_tokens):\n                    scores[i, banned_tokens] = -float(\"inf\")\n\n            assert scores.shape == (batch_size * num_beams, vocab_size), \"Shapes of scores: {} != {}\".format(\n                scores.shape, (batch_size * num_beams, vocab_size)\n            )\n\n            if do_sample:\n                _scores = scores + beam_scores[:, None].expand_as(scores)  # (batch_size * num_beams, vocab_size)\n                # Top-p/top-k filtering\n                _scores = top_k_top_p_filtering(\n                    _scores, top_k=top_k, top_p=top_p, min_tokens_to_keep=2\n                )  # (batch_size * num_beams, vocab_size)\n                # re-organize to group the beam together to sample from all beam_idxs\n                _scores = _scores.contiguous().view(\n                    batch_size, num_beams * vocab_size\n                )  # (batch_size, num_beams * vocab_size)\n\n                # Sample 2 next tokens for each beam (so we have some spare tokens and match output of greedy beam search)\n                probs = F.softmax(_scores, dim=-1)\n                next_tokens = torch.multinomial(probs, num_samples=2 * num_beams)  # (batch_size, num_beams * 2)\n                # Compute next scores\n                next_scores = torch.gather(_scores, -1, next_tokens)  # (batch_size, num_beams * 2)\n                # sort the sampled vector to make sure that the first num_beams samples are the best\n                next_scores, next_scores_indices = torch.sort(next_scores, descending=True, dim=1)\n                next_tokens = torch.gather(next_tokens, -1, next_scores_indices)  # (batch_size, num_beams * 2)\n\n            else:\n                next_scores = scores + beam_scores[:, None].expand_as(scores)  # (batch_size * num_beams, vocab_size)\n\n                # re-organize to group the beam together (we are keeping top hypothesis accross beams)\n                next_scores = next_scores.view(\n                    batch_size, num_beams * vocab_size\n                )  # (batch_size, num_beams * vocab_size)\n\n                next_scores, next_tokens = torch.topk(next_scores, 2 * num_beams, dim=1, largest=True, sorted=True)\n\n            assert next_scores.size() == next_tokens.size() == (batch_size, 2 * num_beams)\n\n            # next batch beam content\n            next_batch_beam = []\n\n            # for each sentence\n            for batch_idx in range(batch_size):\n\n                # if we are done with this sentence\n                if done[batch_idx]:\n                    assert (\n                        len(generated_hyps[batch_idx]) >= num_beams\n                    ), \"Batch can only be done if at least {} beams have been generated\".format(num_beams)\n                    assert (\n                        eos_token_id is not None and pad_token_id is not None\n                    ), \"generated beams >= num_beams -> eos_token_id and pad_token have to be defined\"\n                    next_batch_beam.extend([(0, pad_token_id, 0)] * num_beams)  # pad the batch\n                    continue\n\n                # next sentence beam content\n                next_sent_beam = []\n\n                # next tokens for this sentence\n                for beam_token_rank, (beam_token_id, beam_token_score) in enumerate(\n                    zip(next_tokens[batch_idx], next_scores[batch_idx])\n                ):\n                    # get beam and token IDs\n                    beam_id = beam_token_id // vocab_size\n                    token_id = beam_token_id % vocab_size\n\n                    effective_beam_id = batch_idx * num_beams + beam_id\n                    # add to generated hypotheses if end of sentence or last iteration\n                    if (eos_token_id is not None) and (token_id.item() == eos_token_id):\n                        # if beam_token does not belong to top num_beams tokens, it should not be added\n                        is_beam_token_worse_than_top_num_beams = beam_token_rank >= num_beams\n                        if is_beam_token_worse_than_top_num_beams:\n                            continue\n                        generated_hyps[batch_idx].add(\n                            input_ids[effective_beam_id].clone(), beam_token_score.item(),\n                        )\n                    else:\n                        # add next predicted token if it is not eos_token\n                        next_sent_beam.append((beam_token_score, token_id, effective_beam_id))\n\n                    # the beam for next step is full\n                    if len(next_sent_beam) == num_beams:\n                        break\n\n                # Check if were done so that we can save a pad step if all(done)\n                done[batch_idx] = done[batch_idx] or generated_hyps[batch_idx].is_done(\n                    next_scores[batch_idx].max().item(), cur_len=cur_len\n                )\n\n                # update next beam content\n                assert len(next_sent_beam) == num_beams, \"Beam should always be full\"\n                next_batch_beam.extend(next_sent_beam)\n                assert len(next_batch_beam) == num_beams * (batch_idx + 1)\n\n            # stop when we are done with each sentence\n            if all(done):\n                break\n\n            # sanity check / prepare next batch\n            assert len(next_batch_beam) == batch_size * num_beams\n            beam_scores = beam_scores.new([x[0] for x in next_batch_beam])\n            beam_tokens = input_ids.new([x[1] for x in next_batch_beam])\n            beam_idx = input_ids.new([x[2] for x in next_batch_beam])\n\n            # re-order batch and update current length\n            input_ids = input_ids[beam_idx, :]\n            input_ids = torch.cat([input_ids, beam_tokens.unsqueeze(1)], dim=-1)\n            cur_len = cur_len + 1\n\n            # re-order internal states\n            if past is not None:\n                past = self._reorder_cache(past, beam_idx)\n\n            # extend attention_mask for new generated input if only decoder\n            if self.config.is_encoder_decoder is False:\n                attention_mask = torch.cat(\n                    [attention_mask, attention_mask.new_ones((attention_mask.shape[0], 1))], dim=-1\n                )\n\n        # finalize all open beam hypotheses and end to generated hypotheses\n        for batch_idx in range(batch_size):\n            if done[batch_idx]:\n                continue\n\n            # test that beam scores match previously calculated scores if not eos and batch_idx not done\n            if eos_token_id is not None and all(\n                (token_id % vocab_size).item() is not eos_token_id for token_id in next_tokens[batch_idx]\n            ):\n                assert torch.all(\n                    next_scores[batch_idx, :num_beams] == beam_scores.view(batch_size, num_beams)[batch_idx]\n                ), \"If batch_idx is not done, final next scores: {} have to equal to accumulated beam_scores: {}\".format(\n                    next_scores[:, :num_beams][batch_idx], beam_scores.view(batch_size, num_beams)[batch_idx],\n                )\n\n            # need to add best num_beams hypotheses to generated hyps\n            for beam_id in range(num_beams):\n                effective_beam_id = batch_idx * num_beams + beam_id\n                final_score = beam_scores[effective_beam_id].item()\n                final_tokens = input_ids[effective_beam_id]\n                generated_hyps[batch_idx].add(final_tokens, final_score)\n\n        # depending on whether greedy generation is wanted or not define different output_batch_size and output_num_return_sequences_per_batch\n        output_batch_size = batch_size if do_sample else batch_size * num_return_sequences\n        output_num_return_sequences_per_batch = 1 if do_sample else num_return_sequences\n\n        # select the best hypotheses\n        sent_lengths = input_ids.new(output_batch_size)\n        best = []\n\n        # retrieve best hypotheses\n        for i, hypotheses in enumerate(generated_hyps):\n            sorted_hyps = sorted(hypotheses.beams, key=lambda x: x[0])\n            for j in range(output_num_return_sequences_per_batch):\n                effective_batch_idx = output_num_return_sequences_per_batch * i + j\n                best_hyp = sorted_hyps.pop()[1]\n                sent_lengths[effective_batch_idx] = len(best_hyp)\n                best.append(best_hyp)\n\n        # shorter batches are filled with pad_token\n        if sent_lengths.min().item() != sent_lengths.max().item():\n            assert pad_token_id is not None, \"`Pad_token_id` has to be defined\"\n            sent_max_len = min(sent_lengths.max().item() + 1, max_length)\n            decoded = input_ids.new(output_batch_size, sent_max_len).fill_(pad_token_id)\n\n            # fill with hypothesis and eos_token_id if necessary\n            for i, hypo in enumerate(best):\n                decoded[i, : sent_lengths[i]] = hypo\n                if sent_lengths[i] < max_length:\n                    decoded[i, sent_lengths[i]] = eos_token_id\n        else:\n            # none of the hypotheses have an eos_token\n            assert (len(hypo) == max_length for hypo in best)\n            decoded = torch.stack(best).type(torch.long).to(next(self.parameters()).device)\n\n        return decoded\n\n    @staticmethod\n    def _reorder_cache(past: Tuple, beam_idx: Tensor) -> Tuple[Tensor]:\n        return tuple(layer_past.index_select(1, beam_idx) for layer_past in past)\n\n\ndef calc_banned_ngram_tokens(prev_input_ids: Tensor, num_hypos: int, no_repeat_ngram_size: int, cur_len: int) -> None:\n    \"\"\"Copied from fairseq for no_repeat_ngram in beam_search\"\"\"\n    if cur_len + 1 < no_repeat_ngram_size:\n        # return no banned tokens if we haven't generated no_repeat_ngram_size tokens yet\n        return [[] for _ in range(num_hypos)]\n    generated_ngrams = [{} for _ in range(num_hypos)]\n    for idx in range(num_hypos):\n        gen_tokens = prev_input_ids[idx].tolist()\n        generated_ngram = generated_ngrams[idx]\n        for ngram in zip(*[gen_tokens[i:] for i in range(no_repeat_ngram_size)]):\n            prev_ngram_tuple = tuple(ngram[:-1])\n            generated_ngram[prev_ngram_tuple] = generated_ngram.get(prev_ngram_tuple, []) + [ngram[-1]]\n\n    def _get_generated_ngrams(hypo_idx):\n        # Before decoding the next token, prevent decoding of ngrams that have already appeared\n        start_idx = cur_len + 1 - no_repeat_ngram_size\n        ngram_idx = tuple(prev_input_ids[hypo_idx, start_idx:cur_len].tolist())\n        return generated_ngrams[hypo_idx].get(ngram_idx, [])\n\n    banned_tokens = [_get_generated_ngrams(hypo_idx) for hypo_idx in range(num_hypos)]\n    return banned_tokens\n\n\ndef calc_banned_bad_words_ids(prev_input_ids: Iterable[int], bad_words_ids: Iterable[int]) -> Iterable[int]:\n    banned_tokens = []\n\n    def _tokens_match(prev_tokens, tokens):\n        if len(tokens) == 0:\n            # if bad word tokens is just one token always ban it\n            return True\n        if len(tokens) > len(prev_input_ids):\n            # if bad word tokens are longer then prev input_ids they can't be equal\n            return False\n\n        if prev_tokens[-len(tokens) :] == tokens:\n            # if tokens match\n            return True\n        else:\n            return False\n\n    for prev_input_ids_slice in prev_input_ids:\n        banned_tokens_slice = []\n\n        for banned_token_seq in bad_words_ids:\n            assert len(banned_token_seq) > 0, \"Banned words token sequences {} cannot have an empty list\".format(\n                bad_words_ids\n            )\n\n            if _tokens_match(prev_input_ids_slice.tolist(), banned_token_seq[:-1]) is False:\n                # if tokens do not match continue\n                continue\n\n            banned_tokens_slice.append(banned_token_seq[-1])\n\n        banned_tokens.append(banned_tokens_slice)\n\n    return banned_tokens\n\n\ndef top_k_top_p_filtering(\n    logits: Tensor,\n    top_k: int = 0,\n    top_p: float = 1.0,\n    filter_value: float = -float(\"Inf\"),\n    min_tokens_to_keep: int = 1,\n) -> Tensor:\n    \"\"\" Filter a distribution of logits using top-k and/or nucleus (top-p) filtering\n        Args:\n            logits: logits distribution shape (batch size, vocabulary size)\n            if top_k > 0: keep only top k tokens with highest probability (top-k filtering).\n            if top_p < 1.0: keep the top tokens with cumulative probability >= top_p (nucleus filtering).\n                Nucleus filtering is described in Holtzman et al. (http://arxiv.org/abs/1904.09751)\n            Make sure we keep at least min_tokens_to_keep per batch example in the output\n        From: https://gist.github.com/thomwolf/1a5a29f6962089e871b94cbd09daf317\n    \"\"\"\n    if top_k > 0:\n        top_k = min(max(top_k, min_tokens_to_keep), logits.size(-1))  # Safety check\n        # Remove all tokens with a probability less than the last token of the top-k\n        indices_to_remove = logits < torch.topk(logits, top_k)[0][..., -1, None]\n        logits[indices_to_remove] = filter_value\n\n    if top_p < 1.0:\n        sorted_logits, sorted_indices = torch.sort(logits, descending=True)\n        cumulative_probs = torch.cumsum(F.softmax(sorted_logits, dim=-1), dim=-1)\n\n        # Remove tokens with cumulative probability above the threshold (token with 0 are kept)\n        sorted_indices_to_remove = cumulative_probs > top_p\n        if min_tokens_to_keep > 1:\n            # Keep at least min_tokens_to_keep (set to min_tokens_to_keep-1 because we add the first one below)\n            sorted_indices_to_remove[..., :min_tokens_to_keep] = 0\n        # Shift the indices to the right to keep also the first token above the threshold\n        sorted_indices_to_remove[..., 1:] = sorted_indices_to_remove[..., :-1].clone()\n        sorted_indices_to_remove[..., 0] = 0\n\n        # scatter sorted tensors to original indexing\n        indices_to_remove = sorted_indices_to_remove.scatter(1, sorted_indices, sorted_indices_to_remove)\n        logits[indices_to_remove] = filter_value\n    return logits\n\n\nclass BeamHypotheses(object):\n    def __init__(self, num_beams, max_length, length_penalty, early_stopping):\n        \"\"\"\n        Initialize n-best list of hypotheses.\n        \"\"\"\n        self.max_length = max_length - 1  # ignoring bos_token\n        self.length_penalty = length_penalty\n        self.early_stopping = early_stopping\n        self.num_beams = num_beams\n        self.beams = []\n        self.worst_score = 1e9\n\n    def __len__(self):\n        \"\"\"\n        Number of hypotheses in the list.\n        \"\"\"\n        return len(self.beams)\n\n    def add(self, hyp, sum_logprobs):\n        \"\"\"\n        Add a new hypothesis to the list.\n        \"\"\"\n        score = sum_logprobs / len(hyp) ** self.length_penalty\n        if len(self) < self.num_beams or score > self.worst_score:\n            self.beams.append((score, hyp))\n            if len(self) > self.num_beams:\n                sorted_scores = sorted([(s, idx) for idx, (s, _) in enumerate(self.beams)])\n                del self.beams[sorted_scores[0][1]]\n                self.worst_score = sorted_scores[1][0]\n            else:\n                self.worst_score = min(score, self.worst_score)\n\n    def is_done(self, best_sum_logprobs, cur_len=None):\n        \"\"\"\n        If there are enough hypotheses and that none of the hypotheses being generated\n        can become better than the worst one in the heap, then we are done with this sentence.\n        \"\"\"\n\n        if len(self) < self.num_beams:\n            return False\n        elif self.early_stopping:\n            return True\n        else:\n            if cur_len is None:\n                cur_len = self.max_length\n            cur_score = best_sum_logprobs / cur_len ** self.length_penalty\n            ret = self.worst_score >= cur_score\n            return ret\n\n\nclass Conv1D(nn.Module):\n    def __init__(self, nf, nx):\n        \"\"\" Conv1D layer as defined by Radford et al. for OpenAI GPT (and also used in GPT-2)\n            Basically works like a Linear layer but the weights are transposed\n        \"\"\"\n        super().__init__()\n        self.nf = nf\n        w = torch.empty(nx, nf)\n        nn.init.normal_(w, std=0.02)\n        self.weight = nn.Parameter(w)\n        self.bias = nn.Parameter(torch.zeros(nf))\n\n    def forward(self, x):\n        size_out = x.size()[:-1] + (self.nf,)\n        x = torch.addmm(self.bias, x.view(-1, x.size(-1)), self.weight)\n        x = x.view(*size_out)\n        return x\n\n\nclass PoolerStartLogits(nn.Module):\n    \"\"\" Compute SQuAD start_logits from sequence hidden states. \"\"\"\n\n    def __init__(self, config):\n        super().__init__()\n        self.dense = nn.Linear(config.hidden_size, 1)\n\n    def forward(self, hidden_states, p_mask=None):\n        \"\"\" Args:\n            **p_mask**: (`optional`) ``torch.FloatTensor`` of shape `(batch_size, seq_len)`\n                invalid position mask such as query and special symbols (PAD, SEP, CLS)\n                1.0 means token should be masked.\n        \"\"\"\n        x = self.dense(hidden_states).squeeze(-1)\n\n        if p_mask is not None:\n            if next(self.parameters()).dtype == torch.float16:\n                x = x * (1 - p_mask) - 65500 * p_mask\n            else:\n                x = x * (1 - p_mask) - 1e30 * p_mask\n\n        return x\n\n\nclass PoolerEndLogits(nn.Module):\n    \"\"\" Compute SQuAD end_logits from sequence hidden states and start token hidden state.\n    \"\"\"\n\n    def __init__(self, config):\n        super().__init__()\n        self.dense_0 = nn.Linear(config.hidden_size * 2, config.hidden_size)\n        self.activation = nn.Tanh()\n        self.LayerNorm = nn.LayerNorm(config.hidden_size, eps=config.layer_norm_eps)\n        self.dense_1 = nn.Linear(config.hidden_size, 1)\n\n    def forward(self, hidden_states, start_states=None, start_positions=None, p_mask=None):\n        \"\"\" Args:\n            One of ``start_states``, ``start_positions`` should be not None.\n            If both are set, ``start_positions`` overrides ``start_states``.\n\n            **start_states**: ``torch.LongTensor`` of shape identical to hidden_states\n                hidden states of the first tokens for the labeled span.\n            **start_positions**: ``torch.LongTensor`` of shape ``(batch_size,)``\n                position of the first token for the labeled span:\n            **p_mask**: (`optional`) ``torch.FloatTensor`` of shape ``(batch_size, seq_len)``\n                Mask of invalid position such as query and special symbols (PAD, SEP, CLS)\n                1.0 means token should be masked.\n        \"\"\"\n        assert (\n            start_states is not None or start_positions is not None\n        ), \"One of start_states, start_positions should be not None\"\n        if start_positions is not None:\n            slen, hsz = hidden_states.shape[-2:]\n            start_positions = start_positions[:, None, None].expand(-1, -1, hsz)  # shape (bsz, 1, hsz)\n            start_states = hidden_states.gather(-2, start_positions)  # shape (bsz, 1, hsz)\n            start_states = start_states.expand(-1, slen, -1)  # shape (bsz, slen, hsz)\n\n        x = self.dense_0(torch.cat([hidden_states, start_states], dim=-1))\n        x = self.activation(x)\n        x = self.LayerNorm(x)\n        x = self.dense_1(x).squeeze(-1)\n\n        if p_mask is not None:\n            if next(self.parameters()).dtype == torch.float16:\n                x = x * (1 - p_mask) - 65500 * p_mask\n            else:\n                x = x * (1 - p_mask) - 1e30 * p_mask\n\n        return x\n\n\nclass PoolerAnswerClass(nn.Module):\n    \"\"\" Compute SQuAD 2.0 answer class from classification and start tokens hidden states. \"\"\"\n\n    def __init__(self, config):\n        super().__init__()\n        self.dense_0 = nn.Linear(config.hidden_size * 2, config.hidden_size)\n        self.activation = nn.Tanh()\n        self.dense_1 = nn.Linear(config.hidden_size, 1, bias=False)\n\n    def forward(self, hidden_states, start_states=None, start_positions=None, cls_index=None):\n        \"\"\"\n        Args:\n            One of ``start_states``, ``start_positions`` should be not None.\n            If both are set, ``start_positions`` overrides ``start_states``.\n\n            **start_states**: ``torch.LongTensor`` of shape identical to ``hidden_states``.\n                hidden states of the first tokens for the labeled span.\n            **start_positions**: ``torch.LongTensor`` of shape ``(batch_size,)``\n                position of the first token for the labeled span.\n            **cls_index**: torch.LongTensor of shape ``(batch_size,)``\n                position of the CLS token. If None, take the last token.\n\n            note(Original repo):\n                no dependency on end_feature so that we can obtain one single `cls_logits`\n                for each sample\n        \"\"\"\n        hsz = hidden_states.shape[-1]\n        assert (\n            start_states is not None or start_positions is not None\n        ), \"One of start_states, start_positions should be not None\"\n        if start_positions is not None:\n            start_positions = start_positions[:, None, None].expand(-1, -1, hsz)  # shape (bsz, 1, hsz)\n            start_states = hidden_states.gather(-2, start_positions).squeeze(-2)  # shape (bsz, hsz)\n\n        if cls_index is not None:\n            cls_index = cls_index[:, None, None].expand(-1, -1, hsz)  # shape (bsz, 1, hsz)\n            cls_token_state = hidden_states.gather(-2, cls_index).squeeze(-2)  # shape (bsz, hsz)\n        else:\n            cls_token_state = hidden_states[:, -1, :]  # shape (bsz, hsz)\n\n        x = self.dense_0(torch.cat([start_states, cls_token_state], dim=-1))\n        x = self.activation(x)\n        x = self.dense_1(x).squeeze(-1)\n\n        return x\n\n\nclass SQuADHead(nn.Module):\n    r\"\"\" A SQuAD head inspired by XLNet.\n\n    Parameters:\n        config (:class:`~transformers.XLNetConfig`): Model configuration class with all the parameters of the model.\n\n    Inputs:\n        **hidden_states**: ``torch.FloatTensor`` of shape ``(batch_size, seq_len, hidden_size)``\n            hidden states of sequence tokens\n        **start_positions**: ``torch.LongTensor`` of shape ``(batch_size,)``\n            position of the first token for the labeled span.\n        **end_positions**: ``torch.LongTensor`` of shape ``(batch_size,)``\n            position of the last token for the labeled span.\n        **cls_index**: torch.LongTensor of shape ``(batch_size,)``\n            position of the CLS token. If None, take the last token.\n        **is_impossible**: ``torch.LongTensor`` of shape ``(batch_size,)``\n            Whether the question has a possible answer in the paragraph or not.\n        **p_mask**: (`optional`) ``torch.FloatTensor`` of shape ``(batch_size, seq_len)``\n            Mask of invalid position such as query and special symbols (PAD, SEP, CLS)\n            1.0 means token should be masked.\n\n    Outputs: `Tuple` comprising various elements depending on the configuration (config) and inputs:\n        **loss**: (`optional`, returned if both ``start_positions`` and ``end_positions`` are provided) ``torch.FloatTensor`` of shape ``(1,)``:\n            Classification loss as the sum of start token, end token (and is_impossible if provided) classification losses.\n        **start_top_log_probs**: (`optional`, returned if ``start_positions`` or ``end_positions`` is not provided)\n            ``torch.FloatTensor`` of shape ``(batch_size, config.start_n_top)``\n            Log probabilities for the top config.start_n_top start token possibilities (beam-search).\n        **start_top_index**: (`optional`, returned if ``start_positions`` or ``end_positions`` is not provided)\n            ``torch.LongTensor`` of shape ``(batch_size, config.start_n_top)``\n            Indices for the top config.start_n_top start token possibilities (beam-search).\n        **end_top_log_probs**: (`optional`, returned if ``start_positions`` or ``end_positions`` is not provided)\n            ``torch.FloatTensor`` of shape ``(batch_size, config.start_n_top * config.end_n_top)``\n            Log probabilities for the top ``config.start_n_top * config.end_n_top`` end token possibilities (beam-search).\n        **end_top_index**: (`optional`, returned if ``start_positions`` or ``end_positions`` is not provided)\n            ``torch.LongTensor`` of shape ``(batch_size, config.start_n_top * config.end_n_top)``\n            Indices for the top ``config.start_n_top * config.end_n_top`` end token possibilities (beam-search).\n        **cls_logits**: (`optional`, returned if ``start_positions`` or ``end_positions`` is not provided)\n            ``torch.FloatTensor`` of shape ``(batch_size,)``\n            Log probabilities for the ``is_impossible`` label of the answers.\n    \"\"\"\n\n    def __init__(self, config):\n        super().__init__()\n        self.start_n_top = config.start_n_top\n        self.end_n_top = config.end_n_top\n\n        self.start_logits = PoolerStartLogits(config)\n        self.end_logits = PoolerEndLogits(config)\n        self.answer_class = PoolerAnswerClass(config)\n\n    def forward(\n        self, hidden_states, start_positions=None, end_positions=None, cls_index=None, is_impossible=None, p_mask=None,\n    ):\n        outputs = ()\n\n        start_logits = self.start_logits(hidden_states, p_mask=p_mask)\n\n        if start_positions is not None and end_positions is not None:\n            # If we are on multi-GPU, let's remove the dimension added by batch splitting\n            for x in (start_positions, end_positions, cls_index, is_impossible):\n                if x is not None and x.dim() > 1:\n                    x.squeeze_(-1)\n\n            # during training, compute the end logits based on the ground truth of the start position\n            end_logits = self.end_logits(hidden_states, start_positions=start_positions, p_mask=p_mask)\n\n            loss_fct = CrossEntropyLoss()\n            start_loss = loss_fct(start_logits, start_positions)\n            end_loss = loss_fct(end_logits, end_positions)\n            total_loss = (start_loss + end_loss) / 2\n\n            if cls_index is not None and is_impossible is not None:\n                # Predict answerability from the representation of CLS and START\n                cls_logits = self.answer_class(hidden_states, start_positions=start_positions, cls_index=cls_index)\n                loss_fct_cls = nn.BCEWithLogitsLoss()\n                cls_loss = loss_fct_cls(cls_logits, is_impossible)\n\n                # note(zhiliny): by default multiply the loss by 0.5 so that the scale is comparable to start_loss and end_loss\n                total_loss += cls_loss * 0.5\n\n            outputs = (total_loss,) + outputs\n\n        else:\n            # during inference, compute the end logits based on beam search\n            bsz, slen, hsz = hidden_states.size()\n            start_log_probs = F.softmax(start_logits, dim=-1)  # shape (bsz, slen)\n\n            start_top_log_probs, start_top_index = torch.topk(\n                start_log_probs, self.start_n_top, dim=-1\n            )  # shape (bsz, start_n_top)\n            start_top_index_exp = start_top_index.unsqueeze(-1).expand(-1, -1, hsz)  # shape (bsz, start_n_top, hsz)\n            start_states = torch.gather(hidden_states, -2, start_top_index_exp)  # shape (bsz, start_n_top, hsz)\n            start_states = start_states.unsqueeze(1).expand(-1, slen, -1, -1)  # shape (bsz, slen, start_n_top, hsz)\n\n            hidden_states_expanded = hidden_states.unsqueeze(2).expand_as(\n                start_states\n            )  # shape (bsz, slen, start_n_top, hsz)\n            p_mask = p_mask.unsqueeze(-1) if p_mask is not None else None\n            end_logits = self.end_logits(hidden_states_expanded, start_states=start_states, p_mask=p_mask)\n            end_log_probs = F.softmax(end_logits, dim=1)  # shape (bsz, slen, start_n_top)\n\n            end_top_log_probs, end_top_index = torch.topk(\n                end_log_probs, self.end_n_top, dim=1\n            )  # shape (bsz, end_n_top, start_n_top)\n            end_top_log_probs = end_top_log_probs.view(-1, self.start_n_top * self.end_n_top)\n            end_top_index = end_top_index.view(-1, self.start_n_top * self.end_n_top)\n\n            start_states = torch.einsum(\"blh,bl->bh\", hidden_states, start_log_probs)\n            cls_logits = self.answer_class(hidden_states, start_states=start_states, cls_index=cls_index)\n\n            outputs = (start_top_log_probs, start_top_index, end_top_log_probs, end_top_index, cls_logits,) + outputs\n\n        # return start_top_log_probs, start_top_index, end_top_log_probs, end_top_index, cls_logits\n        # or (if labels are provided) (total_loss,)\n        return outputs\n\n\nclass SequenceSummary(nn.Module):\n    r\"\"\" Compute a single vector summary of a sequence hidden states according to various possibilities:\n        Args of the config class:\n            summary_type:\n                - 'last' => [default] take the last token hidden state (like XLNet)\n                - 'first' => take the first token hidden state (like Bert)\n                - 'mean' => take the mean of all tokens hidden states\n                - 'cls_index' => supply a Tensor of classification token position (GPT/GPT-2)\n                - 'attn' => Not implemented now, use multi-head attention\n            summary_use_proj: Add a projection after the vector extraction\n            summary_proj_to_labels: If True, the projection outputs to config.num_labels classes (otherwise to hidden_size). Default: False.\n            summary_activation: 'tanh' or another string => add an activation to the output, Other => no activation. Default\n            summary_first_dropout: Add a dropout before the projection and activation\n            summary_last_dropout: Add a dropout after the projection and activation\n    \"\"\"\n\n    def __init__(self, config: PretrainedConfig):\n        super().__init__()\n\n        self.summary_type = getattr(config, \"summary_type\", \"last\")\n        if self.summary_type == \"attn\":\n            # We should use a standard multi-head attention module with absolute positional embedding for that.\n            # Cf. https://github.com/zihangdai/xlnet/blob/master/modeling.py#L253-L276\n            # We can probably just use the multi-head attention module of PyTorch >=1.1.0\n            raise NotImplementedError\n\n        self.summary = Identity()\n        if hasattr(config, \"summary_use_proj\") and config.summary_use_proj:\n            if hasattr(config, \"summary_proj_to_labels\") and config.summary_proj_to_labels and config.num_labels > 0:\n                num_classes = config.num_labels\n            else:\n                num_classes = config.hidden_size\n            self.summary = nn.Linear(config.hidden_size, num_classes)\n\n        activation_string = getattr(config, \"summary_activation\", None)\n        self.activation: Callable = (get_activation(activation_string) if activation_string else Identity())\n\n        self.first_dropout = Identity()\n        if hasattr(config, \"summary_first_dropout\") and config.summary_first_dropout > 0:\n            self.first_dropout = nn.Dropout(config.summary_first_dropout)\n\n        self.last_dropout = Identity()\n        if hasattr(config, \"summary_last_dropout\") and config.summary_last_dropout > 0:\n            self.last_dropout = nn.Dropout(config.summary_last_dropout)\n\n    def forward(self, hidden_states, cls_index=None):\n        \"\"\" hidden_states: float Tensor in shape [bsz, ..., seq_len, hidden_size], the hidden-states of the last layer.\n            cls_index: [optional] position of the classification token if summary_type == 'cls_index',\n                shape (bsz,) or more generally (bsz, ...) where ... are optional leading dimensions of hidden_states.\n                if summary_type == 'cls_index' and cls_index is None:\n                    we take the last token of the sequence as classification token\n        \"\"\"\n        if self.summary_type == \"last\":\n            output = hidden_states[:, -1]\n        elif self.summary_type == \"first\":\n            output = hidden_states[:, 0]\n        elif self.summary_type == \"mean\":\n            output = hidden_states.mean(dim=1)\n        elif self.summary_type == \"cls_index\":\n            if cls_index is None:\n                cls_index = torch.full_like(hidden_states[..., :1, :], hidden_states.shape[-2] - 1, dtype=torch.long,)\n            else:\n                cls_index = cls_index.unsqueeze(-1).unsqueeze(-1)\n                cls_index = cls_index.expand((-1,) * (cls_index.dim() - 1) + (hidden_states.size(-1),))\n            # shape of cls_index: (bsz, XX, 1, hidden_size) where XX are optional leading dim of hidden_states\n            output = hidden_states.gather(-2, cls_index).squeeze(-2)  # shape (bsz, XX, hidden_size)\n        elif self.summary_type == \"attn\":\n            raise NotImplementedError\n\n        output = self.first_dropout(output)\n        output = self.summary(output)\n        output = self.activation(output)\n        output = self.last_dropout(output)\n\n        return output\n\n\ndef create_position_ids_from_input_ids(input_ids, padding_idx):\n    \"\"\" Replace non-padding symbols with their position numbers. Position numbers begin at\n    padding_idx+1. Padding symbols are ignored. This is modified from fairseq's\n    `utils.make_positions`.\n\n    :param torch.Tensor x:\n    :return torch.Tensor:\n    \"\"\"\n    # The series of casts and type-conversions here are carefully balanced to both work with ONNX export and XLA.\n    mask = input_ids.ne(padding_idx).int()\n    incremental_indices = torch.cumsum(mask, dim=1).type_as(mask) * mask\n    return incremental_indices.long() + padding_idx\n\n\ndef prune_linear_layer(layer, index, dim=0):\n    \"\"\" Prune a linear layer (a model parameters) to keep only entries in index.\n        Return the pruned layer as a new layer with requires_grad=True.\n        Used to remove heads.\n    \"\"\"\n    index = index.to(layer.weight.device)\n    W = layer.weight.index_select(dim, index).clone().detach()\n    if layer.bias is not None:\n        if dim == 1:\n            b = layer.bias.clone().detach()\n        else:\n            b = layer.bias[index].clone().detach()\n    new_size = list(layer.weight.size())\n    new_size[dim] = len(index)\n    new_layer = nn.Linear(new_size[1], new_size[0], bias=layer.bias is not None).to(layer.weight.device)\n    new_layer.weight.requires_grad = False\n    new_layer.weight.copy_(W.contiguous())\n    new_layer.weight.requires_grad = True\n    if layer.bias is not None:\n        new_layer.bias.requires_grad = False\n        new_layer.bias.copy_(b.contiguous())\n        new_layer.bias.requires_grad = True\n    return new_layer\n\n\ndef prune_conv1d_layer(layer, index, dim=1):\n    \"\"\" Prune a Conv1D layer (a model parameters) to keep only entries in index.\n        A Conv1D work as a Linear layer (see e.g. BERT) but the weights are transposed.\n        Return the pruned layer as a new layer with requires_grad=True.\n        Used to remove heads.\n    \"\"\"\n    index = index.to(layer.weight.device)\n    W = layer.weight.index_select(dim, index).clone().detach()\n    if dim == 0:\n        b = layer.bias.clone().detach()\n    else:\n        b = layer.bias[index].clone().detach()\n    new_size = list(layer.weight.size())\n    new_size[dim] = len(index)\n    new_layer = Conv1D(new_size[1], new_size[0]).to(layer.weight.device)\n    new_layer.weight.requires_grad = False\n    new_layer.weight.copy_(W.contiguous())\n    new_layer.weight.requires_grad = True\n    new_layer.bias.requires_grad = False\n    new_layer.bias.copy_(b.contiguous())\n    new_layer.bias.requires_grad = True\n    return new_layer\n\n\ndef prune_layer(layer, index, dim=None):\n    \"\"\" Prune a Conv1D or nn.Linear layer (a model parameters) to keep only entries in index.\n        Return the pruned layer as a new layer with requires_grad=True.\n        Used to remove heads.\n    \"\"\"\n    if isinstance(layer, nn.Linear):\n        return prune_linear_layer(layer, index, dim=0 if dim is None else dim)\n    elif isinstance(layer, Conv1D):\n        return prune_conv1d_layer(layer, index, dim=1 if dim is None else dim)\n    else:\n        raise ValueError(\"Can't prune layer of class {}\".format(layer.__class__))\n\n\ndef apply_chunking_to_forward(\n    chunk_size: int, chunk_dim: int, forward_fn: Callable[..., torch.Tensor], *input_tensors\n) -> torch.Tensor:\n    \"\"\"\n    This function chunks the `input_tensors` into smaller input tensor parts of size `chunk_size` over the dimension `chunk_dim`.\n    It then applies a layer `forward_fn` to each chunk independently to save memory.\n    If the `forward_fn` is independent across the `chunk_dim` this function will yield the\n    same result as not applying it.\n\n    Args:\n        chunk_size: int - the chunk size of a chunked tensor. `num_chunks` = `len(input_tensors[0]) / chunk_size`\n        chunk_dim: int - the dimension over which the input_tensors should be chunked\n        forward_fn: fn - the forward fn of the model\n        input_tensors: tuple(torch.Tensor) - the input tensors of `forward_fn` which are chunked\n    Returns:\n        a Tensor with the same shape the foward_fn would have given if applied\n\n\n    Examples::\n\n        # rename the usual forward() fn to forward_chunk()\n        def forward_chunk(self, hidden_states):\n            hidden_states = self.decoder(hidden_states)\n            return hidden_states\n\n        # implement a chunked forward function\n        def forward(self, hidden_states):\n            return apply_chunking_to_forward(self.chunk_size_lm_head, self.seq_len_dim, self.forward_chunk, hidden_states)\n    \"\"\"\n\n    assert len(input_tensors) > 0, \"{} has to be a tuple/list of tensors\".format(input_tensors)\n    tensor_shape = input_tensors[0].shape\n    assert all(\n        input_tensor.shape == tensor_shape for input_tensor in input_tensors\n    ), \"All input tenors have to be of the same shape\"\n\n    # inspect.signature exist since python 3.5 and is a python method -> no problem with backward compability\n    num_args_in_forward_chunk_fn = len(inspect.signature(forward_fn).parameters)\n    assert num_args_in_forward_chunk_fn == len(\n        input_tensors\n    ), \"forward_chunk_fn expects {} arguments, but only {} input tensors are given\".format(\n        num_args_in_forward_chunk_fn, len(input_tensors)\n    )\n\n    if chunk_size > 0:\n        assert (\n            input_tensors[0].shape[chunk_dim] % chunk_size == 0\n        ), \"The dimension to be chunked {} has to be a multiple of the chunk size {}\".format(\n            input_tensors[0][chunk_dim], chunk_size\n        )\n\n        num_chunks = input_tensors[0].shape[chunk_dim] // chunk_size\n\n        # chunk input tensor into tuples\n        input_tensors_chunks = tuple(input_tensor.chunk(num_chunks, dim=chunk_dim) for input_tensor in input_tensors)\n        # apply forward fn to every tuple\n        output_chunks = tuple(forward_fn(*input_tensors_chunk) for input_tensors_chunk in zip(*input_tensors_chunks))\n        # concatenate output at same dimension\n        return torch.cat(output_chunks, dim=chunk_dim)\n\n    return forward_fn(*input_tensors)\n"
  },
  {
    "path": "code/bert-base-count3/pretrain/transformers1/modeling_xlm.py",
    "content": "# coding=utf-8\n# Copyright 2019-present, Facebook, Inc and the HuggingFace Inc. team.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n\"\"\" PyTorch XLM model.\n\"\"\"\n\n\nimport itertools\nimport logging\nimport math\n\nimport numpy as np\nimport torch\nfrom torch import nn\nfrom torch.nn import CrossEntropyLoss, MSELoss\nfrom torch.nn import functional as F\n\nfrom .activations import gelu\nfrom .configuration_xlm import XLMConfig\nfrom .file_utils import add_start_docstrings, add_start_docstrings_to_callable\nfrom .modeling_utils import PreTrainedModel, SequenceSummary, SQuADHead, prune_linear_layer\n\n\nlogger = logging.getLogger(__name__)\n\nXLM_PRETRAINED_MODEL_ARCHIVE_LIST = [\n    \"xlm-mlm-en-2048\",\n    \"xlm-mlm-ende-1024\",\n    \"xlm-mlm-enfr-1024\",\n    \"xlm-mlm-enro-1024\",\n    \"xlm-mlm-tlm-xnli15-1024\",\n    \"xlm-mlm-xnli15-1024\",\n    \"xlm-clm-enfr-1024\",\n    \"xlm-clm-ende-1024\",\n    \"xlm-mlm-17-1280\",\n    \"xlm-mlm-100-1280\",\n    # See all XLM models at https://huggingface.co/models?filter=xlm\n]\n\n\ndef create_sinusoidal_embeddings(n_pos, dim, out):\n    position_enc = np.array([[pos / np.power(10000, 2 * (j // 2) / dim) for j in range(dim)] for pos in range(n_pos)])\n    out[:, 0::2] = torch.FloatTensor(np.sin(position_enc[:, 0::2]))\n    out[:, 1::2] = torch.FloatTensor(np.cos(position_enc[:, 1::2]))\n    out.detach_()\n    out.requires_grad = False\n\n\ndef get_masks(slen, lengths, causal, padding_mask=None):\n    \"\"\"\n    Generate hidden states mask, and optionally an attention mask.\n    \"\"\"\n    alen = torch.arange(slen, dtype=torch.long, device=lengths.device)\n    if padding_mask is not None:\n        mask = padding_mask\n    else:\n        assert lengths.max().item() <= slen\n        mask = alen < lengths[:, None]\n\n    # attention mask is the same as mask, or triangular inferior attention (causal)\n    bs = lengths.size(0)\n    if causal:\n        attn_mask = alen[None, None, :].repeat(bs, slen, 1) <= alen[None, :, None]\n    else:\n        attn_mask = mask\n\n    # sanity check\n    assert mask.size() == (bs, slen)\n    assert causal is False or attn_mask.size() == (bs, slen, slen)\n\n    return mask, attn_mask\n\n\nclass MultiHeadAttention(nn.Module):\n\n    NEW_ID = itertools.count()\n\n    def __init__(self, n_heads, dim, config):\n        super().__init__()\n        self.layer_id = next(MultiHeadAttention.NEW_ID)\n        self.output_attentions = config.output_attentions\n        self.dim = dim\n        self.n_heads = n_heads\n        self.dropout = config.attention_dropout\n        assert self.dim % self.n_heads == 0\n\n        self.q_lin = nn.Linear(dim, dim)\n        self.k_lin = nn.Linear(dim, dim)\n        self.v_lin = nn.Linear(dim, dim)\n        self.out_lin = nn.Linear(dim, dim)\n        self.pruned_heads = set()\n\n    def prune_heads(self, heads):\n        attention_head_size = self.dim // self.n_heads\n        if len(heads) == 0:\n            return\n        mask = torch.ones(self.n_heads, attention_head_size)\n        heads = set(heads) - self.pruned_heads\n        for head in heads:\n            head -= sum(1 if h < head else 0 for h in self.pruned_heads)\n            mask[head] = 0\n        mask = mask.view(-1).contiguous().eq(1)\n        index = torch.arange(len(mask))[mask].long()\n        # Prune linear layers\n        self.q_lin = prune_linear_layer(self.q_lin, index)\n        self.k_lin = prune_linear_layer(self.k_lin, index)\n        self.v_lin = prune_linear_layer(self.v_lin, index)\n        self.out_lin = prune_linear_layer(self.out_lin, index, dim=1)\n        # Update hyper params\n        self.n_heads = self.n_heads - len(heads)\n        self.dim = attention_head_size * self.n_heads\n        self.pruned_heads = self.pruned_heads.union(heads)\n\n    def forward(self, input, mask, kv=None, cache=None, head_mask=None):\n        \"\"\"\n        Self-attention (if kv is None) or attention over source sentence (provided by kv).\n        \"\"\"\n        # Input is (bs, qlen, dim)\n        # Mask is (bs, klen) (non-causal) or (bs, klen, klen)\n        bs, qlen, dim = input.size()\n        if kv is None:\n            klen = qlen if cache is None else cache[\"slen\"] + qlen\n        else:\n            klen = kv.size(1)\n        # assert dim == self.dim, 'Dimensions do not match: %s input vs %s configured' % (dim, self.dim)\n        n_heads = self.n_heads\n        dim_per_head = self.dim // n_heads\n        mask_reshape = (bs, 1, qlen, klen) if mask.dim() == 3 else (bs, 1, 1, klen)\n\n        def shape(x):\n            \"\"\"  projection \"\"\"\n            return x.view(bs, -1, self.n_heads, dim_per_head).transpose(1, 2)\n\n        def unshape(x):\n            \"\"\"  compute context \"\"\"\n            return x.transpose(1, 2).contiguous().view(bs, -1, self.n_heads * dim_per_head)\n\n        q = shape(self.q_lin(input))  # (bs, n_heads, qlen, dim_per_head)\n        if kv is None:\n            k = shape(self.k_lin(input))  # (bs, n_heads, qlen, dim_per_head)\n            v = shape(self.v_lin(input))  # (bs, n_heads, qlen, dim_per_head)\n        elif cache is None or self.layer_id not in cache:\n            k = v = kv\n            k = shape(self.k_lin(k))  # (bs, n_heads, qlen, dim_per_head)\n            v = shape(self.v_lin(v))  # (bs, n_heads, qlen, dim_per_head)\n\n        if cache is not None:\n            if self.layer_id in cache:\n                if kv is None:\n                    k_, v_ = cache[self.layer_id]\n                    k = torch.cat([k_, k], dim=2)  # (bs, n_heads, klen, dim_per_head)\n                    v = torch.cat([v_, v], dim=2)  # (bs, n_heads, klen, dim_per_head)\n                else:\n                    k, v = cache[self.layer_id]\n            cache[self.layer_id] = (k, v)\n\n        q = q / math.sqrt(dim_per_head)  # (bs, n_heads, qlen, dim_per_head)\n        scores = torch.matmul(q, k.transpose(2, 3))  # (bs, n_heads, qlen, klen)\n        mask = (mask == 0).view(mask_reshape).expand_as(scores)  # (bs, n_heads, qlen, klen)\n        scores.masked_fill_(mask, -float(\"inf\"))  # (bs, n_heads, qlen, klen)\n\n        weights = F.softmax(scores.float(), dim=-1).type_as(scores)  # (bs, n_heads, qlen, klen)\n        weights = F.dropout(weights, p=self.dropout, training=self.training)  # (bs, n_heads, qlen, klen)\n\n        # Mask heads if we want to\n        if head_mask is not None:\n            weights = weights * head_mask\n\n        context = torch.matmul(weights, v)  # (bs, n_heads, qlen, dim_per_head)\n        context = unshape(context)  # (bs, qlen, dim)\n\n        outputs = (self.out_lin(context),)\n        if self.output_attentions:\n            outputs = outputs + (weights,)\n        return outputs\n\n\nclass TransformerFFN(nn.Module):\n    def __init__(self, in_dim, dim_hidden, out_dim, config):\n        super().__init__()\n        self.dropout = config.dropout\n        self.lin1 = nn.Linear(in_dim, dim_hidden)\n        self.lin2 = nn.Linear(dim_hidden, out_dim)\n        self.act = gelu if config.gelu_activation else F.relu\n\n    def forward(self, input):\n        x = self.lin1(input)\n        x = self.act(x)\n        x = self.lin2(x)\n        x = F.dropout(x, p=self.dropout, training=self.training)\n        return x\n\n\nclass XLMPreTrainedModel(PreTrainedModel):\n    \"\"\" An abstract class to handle weights initialization and\n        a simple interface for downloading and loading pretrained models.\n    \"\"\"\n\n    config_class = XLMConfig\n    load_tf_weights = None\n    base_model_prefix = \"transformer\"\n\n    def __init__(self, *inputs, **kwargs):\n        super().__init__(*inputs, **kwargs)\n\n    @property\n    def dummy_inputs(self):\n        inputs_list = torch.tensor([[7, 6, 0, 0, 1], [1, 2, 3, 0, 0], [0, 0, 0, 4, 5]])\n        attns_list = torch.tensor([[1, 1, 0, 0, 1], [1, 1, 1, 0, 0], [1, 0, 0, 1, 1]])\n        if self.config.use_lang_emb and self.config.n_langs > 1:\n            langs_list = torch.tensor([[1, 1, 0, 0, 1], [1, 1, 1, 0, 0], [1, 0, 0, 1, 1]])\n        else:\n            langs_list = None\n        return {\"input_ids\": inputs_list, \"attention_mask\": attns_list, \"langs\": langs_list}\n\n    def _init_weights(self, module):\n        \"\"\" Initialize the weights. \"\"\"\n        if isinstance(module, nn.Embedding):\n            if self.config is not None and self.config.embed_init_std is not None:\n                nn.init.normal_(module.weight, mean=0, std=self.config.embed_init_std)\n        if isinstance(module, nn.Linear):\n            if self.config is not None and self.config.init_std is not None:\n                nn.init.normal_(module.weight, mean=0, std=self.config.init_std)\n                if hasattr(module, \"bias\") and module.bias is not None:\n                    nn.init.constant_(module.bias, 0.0)\n        if isinstance(module, nn.LayerNorm):\n            module.bias.data.zero_()\n            module.weight.data.fill_(1.0)\n\n\nXLM_START_DOCSTRING = r\"\"\"\n\n    This model is a PyTorch `torch.nn.Module <https://pytorch.org/docs/stable/nn.html#torch.nn.Module>`_ sub-class.\n    Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general\n    usage and behavior.\n\n    Parameters:\n        config (:class:`~transformers1.XLMConfig`): Model configuration class with all the parameters of the model.\n            Initializing with a config file does not load the weights associated with the model, only the configuration.\n            Check out the :meth:`~transformers1.PreTrainedModel.from_pretrained` method to load the model weights.\n\"\"\"\n\nXLM_INPUTS_DOCSTRING = r\"\"\"\n    Args:\n        input_ids (:obj:`torch.LongTensor` of shape :obj:`(batch_size, sequence_length)`):\n            Indices of input sequence tokens in the vocabulary.\n\n            Indices can be obtained using :class:`transformers1.BertTokenizer`.\n            See :func:`transformers1.PreTrainedTokenizer.encode` and\n            :func:`transformers1.PreTrainedTokenizer.encode_plus` for details.\n\n            `What are input IDs? <../glossary.html#input-ids>`__\n        attention_mask (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length)`, `optional`, defaults to :obj:`None`):\n            Mask to avoid performing attention on padding token indices.\n            Mask values selected in ``[0, 1]``:\n            ``1`` for tokens that are NOT MASKED, ``0`` for MASKED tokens.\n\n            `What are attention masks? <../glossary.html#attention-mask>`__\n        langs (:obj:`torch.LongTensor` of shape :obj:`(batch_size, sequence_length)`, `optional`, defaults to :obj:`None`):\n            A parallel sequence of tokens to be used to indicate the language of each token in the input.\n            Indices are languages ids which can be obtained from the language names by using two conversion mappings\n            provided in the configuration of the model (only provided for multilingual models).\n            More precisely, the `language name -> language id` mapping is in `model.config.lang2id` (dict str -> int) and\n            the `language id -> language name` mapping is `model.config.id2lang` (dict int -> str).\n\n            See usage examples detailed in the `multilingual documentation <https://huggingface.co/transformers/multilingual.html>`__.\n        token_type_ids (:obj:`torch.LongTensor` of shape :obj:`(batch_size, sequence_length)`, `optional`, defaults to :obj:`None`):\n            Segment token indices to indicate first and second portions of the inputs.\n            Indices are selected in ``[0, 1]``: ``0`` corresponds to a `sentence A` token, ``1``\n            corresponds to a `sentence B` token\n\n            `What are token type IDs? <../glossary.html#token-type-ids>`_\n        position_ids (:obj:`torch.LongTensor` of shape :obj:`(batch_size, sequence_length)`, `optional`, defaults to :obj:`None`):\n            Indices of positions of each input sequence tokens in the position embeddings.\n            Selected in the range ``[0, config.max_position_embeddings - 1]``.\n\n            `What are position IDs? <../glossary.html#position-ids>`_\n        lengths (:obj:`torch.LongTensor` of shape :obj:`(batch_size,)`, `optional`, defaults to :obj:`None`):\n            Length of each sentence that can be used to avoid performing attention on padding token indices.\n            You can also use `attention_mask` for the same result (see above), kept here for compatbility.\n            Indices selected in ``[0, ..., input_ids.size(-1)]``:\n        cache (:obj:`Dict[str, torch.FloatTensor]`, `optional`, defaults to :obj:`None`):\n            dictionary with ``torch.FloatTensor`` that contains pre-computed\n            hidden-states (key and values in the attention blocks) as computed by the model\n            (see `cache` output below). Can be used to speed up sequential decoding.\n            The dictionary object will be modified in-place during the forward pass to add newly computed hidden-states.\n        head_mask (:obj:`torch.FloatTensor` of shape :obj:`(num_heads,)` or :obj:`(num_layers, num_heads)`, `optional`, defaults to :obj:`None`):\n            Mask to nullify selected heads of the self-attention modules.\n            Mask values selected in ``[0, 1]``:\n            :obj:`1` indicates the head is **not masked**, :obj:`0` indicates the head is **masked**.\n        inputs_embeds (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length, hidden_size)`, `optional`, defaults to :obj:`None`):\n            Optionally, instead of passing :obj:`input_ids` you can choose to directly pass an embedded representation.\n            This is useful if you want more control over how to convert `input_ids` indices into associated vectors\n            than the model's internal embedding lookup matrix.\n\"\"\"\n\n\n@add_start_docstrings(\n    \"The bare XLM Model transformer outputting raw hidden-states without any specific head on top.\",\n    XLM_START_DOCSTRING,\n)\nclass XLMModel(XLMPreTrainedModel):\n    def __init__(self, config):  # , dico, is_encoder, with_output):\n        super().__init__(config)\n        self.output_attentions = config.output_attentions\n        self.output_hidden_states = config.output_hidden_states\n\n        # encoder / decoder, output layer\n        self.is_encoder = config.is_encoder\n        self.is_decoder = not config.is_encoder\n        if self.is_decoder:\n            raise NotImplementedError(\"Currently XLM can only be used as an encoder\")\n        # self.with_output = with_output\n        self.causal = config.causal\n\n        # dictionary / languages\n        self.n_langs = config.n_langs\n        self.use_lang_emb = config.use_lang_emb\n        self.n_words = config.n_words\n        self.eos_index = config.eos_index\n        self.pad_index = config.pad_index\n        # self.dico = dico\n        # self.id2lang = config.id2lang\n        # self.lang2id = config.lang2id\n        # assert len(self.dico) == self.n_words\n        # assert len(self.id2lang) == len(self.lang2id) == self.n_langs\n\n        # model parameters\n        self.dim = config.emb_dim  # 512 by default\n        self.hidden_dim = self.dim * 4  # 2048 by default\n        self.n_heads = config.n_heads  # 8 by default\n        self.n_layers = config.n_layers\n        self.dropout = config.dropout\n        self.attention_dropout = config.attention_dropout\n        assert self.dim % self.n_heads == 0, \"transformer dim must be a multiple of n_heads\"\n\n        # embeddings\n        self.position_embeddings = nn.Embedding(config.max_position_embeddings, self.dim)\n        if config.sinusoidal_embeddings:\n            create_sinusoidal_embeddings(config.max_position_embeddings, self.dim, out=self.position_embeddings.weight)\n        if config.n_langs > 1 and config.use_lang_emb:\n            self.lang_embeddings = nn.Embedding(self.n_langs, self.dim)\n        self.embeddings = nn.Embedding(self.n_words, self.dim, padding_idx=self.pad_index)\n        self.layer_norm_emb = nn.LayerNorm(self.dim, eps=config.layer_norm_eps)\n\n        # transformer layers\n        self.attentions = nn.ModuleList()\n        self.layer_norm1 = nn.ModuleList()\n        self.ffns = nn.ModuleList()\n        self.layer_norm2 = nn.ModuleList()\n        # if self.is_decoder:\n        #     self.layer_norm15 = nn.ModuleList()\n        #     self.encoder_attn = nn.ModuleList()\n\n        for _ in range(self.n_layers):\n            self.attentions.append(MultiHeadAttention(self.n_heads, self.dim, config=config))\n            self.layer_norm1.append(nn.LayerNorm(self.dim, eps=config.layer_norm_eps))\n            # if self.is_decoder:\n            #     self.layer_norm15.append(nn.LayerNorm(self.dim, eps=config.layer_norm_eps))\n            #     self.encoder_attn.append(MultiHeadAttention(self.n_heads, self.dim, dropout=self.attention_dropout))\n            self.ffns.append(TransformerFFN(self.dim, self.hidden_dim, self.dim, config=config))\n            self.layer_norm2.append(nn.LayerNorm(self.dim, eps=config.layer_norm_eps))\n\n        if hasattr(config, \"pruned_heads\"):\n            pruned_heads = config.pruned_heads.copy().items()\n            config.pruned_heads = {}\n            for layer, heads in pruned_heads:\n                if self.attentions[int(layer)].n_heads == config.n_heads:\n                    self.prune_heads({int(layer): list(map(int, heads))})\n\n        self.init_weights()\n\n    def get_input_embeddings(self):\n        return self.embeddings\n\n    def set_input_embeddings(self, new_embeddings):\n        self.embeddings = new_embeddings\n\n    def _prune_heads(self, heads_to_prune):\n        \"\"\" Prunes heads of the model.\n            heads_to_prune: dict of {layer_num: list of heads to prune in this layer}\n            See base class PreTrainedModel\n        \"\"\"\n        for layer, heads in heads_to_prune.items():\n            self.attentions[layer].prune_heads(heads)\n\n    @add_start_docstrings_to_callable(XLM_INPUTS_DOCSTRING)\n    def forward(\n        self,\n        input_ids=None,\n        attention_mask=None,\n        langs=None,\n        token_type_ids=None,\n        position_ids=None,\n        lengths=None,\n        cache=None,\n        head_mask=None,\n        inputs_embeds=None,\n    ):\n        r\"\"\"\n    Return:\n        :obj:`tuple(torch.FloatTensor)` comprising various elements depending on the configuration (:class:`~transformers1.XLMConfig`) and inputs:\n        last_hidden_state (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length, hidden_size)`):\n            Sequence of hidden-states at the output of the last layer of the model.\n        hidden_states (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_hidden_states=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer)\n            of shape :obj:`(batch_size, sequence_length, hidden_size)`.\n\n            Hidden-states of the model at the output of each layer plus the initial embedding outputs.\n        attentions (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_attentions=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for each layer) of shape\n            :obj:`(batch_size, num_heads, sequence_length, sequence_length)`.\n\n            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention\n            heads.\n\n    Examples::\n\n        from transformers1 import XLMTokenizer, XLMModel\n        import torch\n\n        tokenizer = XLMTokenizer.from_pretrained('xlm-mlm-en-2048')\n        model = XLMModel.from_pretrained('xlm-mlm-en-2048')\n        input_ids = torch.tensor(tokenizer.encode(\"Hello, my dog is cute\", add_special_tokens=True)).unsqueeze(0)  # Batch size 1\n        outputs = model(input_ids)\n        last_hidden_states = outputs[0]  # The last hidden-state is the first element of the output tuple\n\n        \"\"\"\n        if input_ids is not None:\n            bs, slen = input_ids.size()\n        else:\n            bs, slen = inputs_embeds.size()[:-1]\n\n        if lengths is None:\n            if input_ids is not None:\n                lengths = (input_ids != self.pad_index).sum(dim=1).long()\n            else:\n                lengths = torch.LongTensor([slen] * bs)\n        # mask = input_ids != self.pad_index\n\n        # check inputs\n        assert lengths.size(0) == bs\n        assert lengths.max().item() <= slen\n        # input_ids = input_ids.transpose(0, 1)  # batch size as dimension 0\n        # assert (src_enc is None) == (src_len is None)\n        # if src_enc is not None:\n        #     assert self.is_decoder\n        #     assert src_enc.size(0) == bs\n\n        # generate masks\n        mask, attn_mask = get_masks(slen, lengths, self.causal, padding_mask=attention_mask)\n        # if self.is_decoder and src_enc is not None:\n        #     src_mask = torch.arange(src_len.max(), dtype=torch.long, device=lengths.device) < src_len[:, None]\n\n        device = input_ids.device if input_ids is not None else inputs_embeds.device\n\n        # position_ids\n        if position_ids is None:\n            position_ids = torch.arange(slen, dtype=torch.long, device=device)\n            position_ids = position_ids.unsqueeze(0).expand((bs, slen))\n        else:\n            assert position_ids.size() == (bs, slen)  # (slen, bs)\n            # position_ids = position_ids.transpose(0, 1)\n\n        # langs\n        if langs is not None:\n            assert langs.size() == (bs, slen)  # (slen, bs)\n            # langs = langs.transpose(0, 1)\n\n        # Prepare head mask if needed\n        head_mask = self.get_head_mask(head_mask, self.config.n_layers)\n\n        # do not recompute cached elements\n        if cache is not None and input_ids is not None:\n            _slen = slen - cache[\"slen\"]\n            input_ids = input_ids[:, -_slen:]\n            position_ids = position_ids[:, -_slen:]\n            if langs is not None:\n                langs = langs[:, -_slen:]\n            mask = mask[:, -_slen:]\n            attn_mask = attn_mask[:, -_slen:]\n\n        # embeddings\n        if inputs_embeds is None:\n            inputs_embeds = self.embeddings(input_ids)\n\n        tensor = inputs_embeds + self.position_embeddings(position_ids).expand_as(inputs_embeds)\n        if langs is not None and self.use_lang_emb and self.n_langs > 1:\n            tensor = tensor + self.lang_embeddings(langs)\n        if token_type_ids is not None:\n            tensor = tensor + self.embeddings(token_type_ids)\n        tensor = self.layer_norm_emb(tensor)\n        tensor = F.dropout(tensor, p=self.dropout, training=self.training)\n        tensor *= mask.unsqueeze(-1).to(tensor.dtype)\n\n        # transformer layers\n        hidden_states = ()\n        attentions = ()\n        for i in range(self.n_layers):\n            if self.output_hidden_states:\n                hidden_states = hidden_states + (tensor,)\n\n            # self attention\n            attn_outputs = self.attentions[i](tensor, attn_mask, cache=cache, head_mask=head_mask[i])\n            attn = attn_outputs[0]\n            if self.output_attentions:\n                attentions = attentions + (attn_outputs[1],)\n            attn = F.dropout(attn, p=self.dropout, training=self.training)\n            tensor = tensor + attn\n            tensor = self.layer_norm1[i](tensor)\n\n            # encoder attention (for decoder only)\n            # if self.is_decoder and src_enc is not None:\n            #     attn = self.encoder_attn[i](tensor, src_mask, kv=src_enc, cache=cache)\n            #     attn = F.dropout(attn, p=self.dropout, training=self.training)\n            #     tensor = tensor + attn\n            #     tensor = self.layer_norm15[i](tensor)\n\n            # FFN\n            tensor = tensor + self.ffns[i](tensor)\n            tensor = self.layer_norm2[i](tensor)\n            tensor *= mask.unsqueeze(-1).to(tensor.dtype)\n\n        # Add last hidden state\n        if self.output_hidden_states:\n            hidden_states = hidden_states + (tensor,)\n\n        # update cache length\n        if cache is not None:\n            cache[\"slen\"] += tensor.size(1)\n\n        # move back sequence length to dimension 0\n        # tensor = tensor.transpose(0, 1)\n\n        outputs = (tensor,)\n        if self.output_hidden_states:\n            outputs = outputs + (hidden_states,)\n        if self.output_attentions:\n            outputs = outputs + (attentions,)\n        return outputs  # outputs, (hidden_states), (attentions)\n\n\nclass XLMPredLayer(nn.Module):\n    \"\"\"\n    Prediction layer (cross_entropy or adaptive_softmax).\n    \"\"\"\n\n    def __init__(self, config):\n        super().__init__()\n        self.asm = config.asm\n        self.n_words = config.n_words\n        self.pad_index = config.pad_index\n        dim = config.emb_dim\n\n        if config.asm is False:\n            self.proj = nn.Linear(dim, config.n_words, bias=True)\n        else:\n            self.proj = nn.AdaptiveLogSoftmaxWithLoss(\n                in_features=dim,\n                n_classes=config.n_words,\n                cutoffs=config.asm_cutoffs,\n                div_value=config.asm_div_value,\n                head_bias=True,  # default is False\n            )\n\n    def forward(self, x, y=None):\n        \"\"\" Compute the loss, and optionally the scores.\n        \"\"\"\n        outputs = ()\n        if self.asm is False:\n            scores = self.proj(x)\n            outputs = (scores,) + outputs\n            if y is not None:\n                loss = F.cross_entropy(scores.view(-1, self.n_words), y.view(-1), reduction=\"elementwise_mean\")\n                outputs = (loss,) + outputs\n        else:\n            scores = self.proj.log_prob(x)\n            outputs = (scores,) + outputs\n            if y is not None:\n                _, loss = self.proj(x, y)\n                outputs = (loss,) + outputs\n\n        return outputs\n\n\n@add_start_docstrings(\n    \"\"\"The XLM Model transformer with a language modeling head on top\n    (linear layer with weights tied to the input embeddings). \"\"\",\n    XLM_START_DOCSTRING,\n)\nclass XLMWithLMHeadModel(XLMPreTrainedModel):\n    def __init__(self, config):\n        super().__init__(config)\n        self.transformer = XLMModel(config)\n        self.pred_layer = XLMPredLayer(config)\n\n        self.init_weights()\n\n    def get_output_embeddings(self):\n        return self.pred_layer.proj\n\n    def prepare_inputs_for_generation(self, input_ids, **kwargs):\n        mask_token_id = self.config.mask_token_id\n        lang_id = self.config.lang_id\n\n        effective_batch_size = input_ids.shape[0]\n        mask_token = torch.full((effective_batch_size, 1), mask_token_id, dtype=torch.long, device=input_ids.device)\n        input_ids = torch.cat([input_ids, mask_token], dim=1)\n        if lang_id is not None:\n            langs = torch.full_like(input_ids, lang_id)\n        else:\n            langs = None\n        return {\"input_ids\": input_ids, \"langs\": langs}\n\n    @add_start_docstrings_to_callable(XLM_INPUTS_DOCSTRING)\n    def forward(\n        self,\n        input_ids=None,\n        attention_mask=None,\n        langs=None,\n        token_type_ids=None,\n        position_ids=None,\n        lengths=None,\n        cache=None,\n        head_mask=None,\n        inputs_embeds=None,\n        labels=None,\n    ):\n        r\"\"\"\n        labels (:obj:`torch.LongTensor` of shape :obj:`(batch_size, sequence_length)`, `optional`, defaults to :obj:`None`):\n            Labels for language modeling.\n            Note that the labels **are shifted** inside the model, i.e. you can set ``labels = input_ids``\n            Indices are selected in ``[-100, 0, ..., config.vocab_size]``\n            All labels set to ``-100`` are ignored (masked), the loss is only\n            computed for labels in ``[0, ..., config.vocab_size]``\n\n    Return:\n        :obj:`tuple(torch.FloatTensor)` comprising various elements depending on the configuration (:class:`~transformers1.XLMConfig`) and inputs:\n        loss (:obj:`torch.FloatTensor` of shape `(1,)`, `optional`, returned when ``labels`` is provided)\n            Language modeling loss.\n        prediction_scores (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length, config.vocab_size)`):\n            Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax).\n        hidden_states (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_hidden_states=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer)\n            of shape :obj:`(batch_size, sequence_length, hidden_size)`.\n\n            Hidden-states of the model at the output of each layer plus the initial embedding outputs.\n        attentions (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_attentions=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for each layer) of shape\n            :obj:`(batch_size, num_heads, sequence_length, sequence_length)`.\n\n            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention\n            heads.\n\n    Examples::\n\n        from transformers1 import XLMTokenizer, XLMWithLMHeadModel\n        import torch\n\n        tokenizer = XLMTokenizer.from_pretrained('xlm-mlm-en-2048')\n        model = XLMWithLMHeadModel.from_pretrained('xlm-mlm-en-2048')\n        input_ids = torch.tensor(tokenizer.encode(\"Hello, my dog is cute\", add_special_tokens=True)).unsqueeze(0)  # Batch size 1\n        outputs = model(input_ids)\n        last_hidden_states = outputs[0]  # The last hidden-state is the first element of the output tuple\n\n        \"\"\"\n        transformer_outputs = self.transformer(\n            input_ids,\n            attention_mask=attention_mask,\n            langs=langs,\n            token_type_ids=token_type_ids,\n            position_ids=position_ids,\n            lengths=lengths,\n            cache=cache,\n            head_mask=head_mask,\n            inputs_embeds=inputs_embeds,\n        )\n\n        output = transformer_outputs[0]\n        outputs = self.pred_layer(output, labels)\n        outputs = outputs + transformer_outputs[1:]  # Keep new_mems and attention/hidden states if they are here\n\n        return outputs\n\n\n@add_start_docstrings(\n    \"\"\"XLM Model with a sequence classification/regression head on top (a linear layer on top of\n    the pooled output) e.g. for GLUE tasks. \"\"\",\n    XLM_START_DOCSTRING,\n)\nclass XLMForSequenceClassification(XLMPreTrainedModel):\n    def __init__(self, config):\n        super().__init__(config)\n        self.num_labels = config.num_labels\n\n        self.transformer = XLMModel(config)\n        self.sequence_summary = SequenceSummary(config)\n\n        self.init_weights()\n\n    @add_start_docstrings_to_callable(XLM_INPUTS_DOCSTRING)\n    def forward(\n        self,\n        input_ids=None,\n        attention_mask=None,\n        langs=None,\n        token_type_ids=None,\n        position_ids=None,\n        lengths=None,\n        cache=None,\n        head_mask=None,\n        inputs_embeds=None,\n        labels=None,\n    ):\n        r\"\"\"\n        labels (:obj:`torch.LongTensor` of shape :obj:`(batch_size,)`, `optional`, defaults to :obj:`None`):\n            Labels for computing the sequence classification/regression loss.\n            Indices should be in :obj:`[0, ..., config.num_labels - 1]`.\n            If :obj:`config.num_labels == 1` a regression loss is computed (Mean-Square loss),\n            If :obj:`config.num_labels > 1` a classification loss is computed (Cross-Entropy).\n\n    Returns:\n        :obj:`tuple(torch.FloatTensor)` comprising various elements depending on the configuration (:class:`~transformers1.XLMConfig`) and inputs:\n        loss (:obj:`torch.FloatTensor` of shape :obj:`(1,)`, `optional`, returned when :obj:`label` is provided):\n            Classification (or regression if config.num_labels==1) loss.\n        logits (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, config.num_labels)`):\n            Classification (or regression if config.num_labels==1) scores (before SoftMax).\n        hidden_states (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_hidden_states=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer)\n            of shape :obj:`(batch_size, sequence_length, hidden_size)`.\n\n            Hidden-states of the model at the output of each layer plus the initial embedding outputs.\n        attentions (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_attentions=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for each layer) of shape\n            :obj:`(batch_size, num_heads, sequence_length, sequence_length)`.\n\n            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention\n            heads.\n\n    Examples::\n\n        from transformers1 import XLMTokenizer, XLMForSequenceClassification\n        import torch\n\n        tokenizer = XLMTokenizer.from_pretrained('xlm-mlm-en-2048')\n        model = XLMForSequenceClassification.from_pretrained('xlm-mlm-en-2048')\n        input_ids = torch.tensor(tokenizer.encode(\"Hello, my dog is cute\", add_special_tokens=True)).unsqueeze(0)  # Batch size 1\n        labels = torch.tensor([1]).unsqueeze(0)  # Batch size 1\n        outputs = model(input_ids, labels=labels)\n        loss, logits = outputs[:2]\n\n        \"\"\"\n        transformer_outputs = self.transformer(\n            input_ids,\n            attention_mask=attention_mask,\n            langs=langs,\n            token_type_ids=token_type_ids,\n            position_ids=position_ids,\n            lengths=lengths,\n            cache=cache,\n            head_mask=head_mask,\n            inputs_embeds=inputs_embeds,\n        )\n\n        output = transformer_outputs[0]\n        logits = self.sequence_summary(output)\n\n        outputs = (logits,) + transformer_outputs[1:]  # Keep new_mems and attention/hidden states if they are here\n\n        if labels is not None:\n            if self.num_labels == 1:\n                #  We are doing regression\n                loss_fct = MSELoss()\n                loss = loss_fct(logits.view(-1), labels.view(-1))\n            else:\n                loss_fct = CrossEntropyLoss()\n                loss = loss_fct(logits.view(-1, self.num_labels), labels.view(-1))\n            outputs = (loss,) + outputs\n\n        return outputs\n\n\n@add_start_docstrings(\n    \"\"\"XLM Model with a span classification head on top for extractive question-answering tasks like SQuAD (a linear layers on top of\n    the hidden-states output to compute `span start logits` and `span end logits`). \"\"\",\n    XLM_START_DOCSTRING,\n)\nclass XLMForQuestionAnsweringSimple(XLMPreTrainedModel):\n    def __init__(self, config):\n        super().__init__(config)\n\n        self.transformer = XLMModel(config)\n        self.qa_outputs = nn.Linear(config.hidden_size, config.num_labels)\n\n        self.init_weights()\n\n    @add_start_docstrings_to_callable(XLM_INPUTS_DOCSTRING)\n    def forward(\n        self,\n        input_ids=None,\n        attention_mask=None,\n        langs=None,\n        token_type_ids=None,\n        position_ids=None,\n        lengths=None,\n        cache=None,\n        head_mask=None,\n        inputs_embeds=None,\n        start_positions=None,\n        end_positions=None,\n    ):\n        r\"\"\"\n        start_positions (:obj:`torch.LongTensor` of shape :obj:`(batch_size,)`, `optional`, defaults to :obj:`None`):\n            Labels for position (index) of the start of the labelled span for computing the token classification loss.\n            Positions are clamped to the length of the sequence (`sequence_length`).\n            Position outside of the sequence are not taken into account for computing the loss.\n        end_positions (:obj:`torch.LongTensor` of shape :obj:`(batch_size,)`, `optional`, defaults to :obj:`None`):\n            Labels for position (index) of the end of the labelled span for computing the token classification loss.\n            Positions are clamped to the length of the sequence (`sequence_length`).\n            Position outside of the sequence are not taken into account for computing the loss.\n\n    Returns:\n        :obj:`tuple(torch.FloatTensor)` comprising various elements depending on the configuration (:class:`~transformers1.XLMConfig`) and inputs:\n        loss (:obj:`torch.FloatTensor` of shape :obj:`(1,)`, `optional`, returned when :obj:`labels` is provided):\n            Total span extraction loss is the sum of a Cross-Entropy for the start and end positions.\n        start_scores (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length,)`):\n            Span-start scores (before SoftMax).\n        end_scores (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length,)`):\n            Span-end scores (before SoftMax).\n        hidden_states (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_hidden_states=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer)\n            of shape :obj:`(batch_size, sequence_length, hidden_size)`.\n\n            Hidden-states of the model at the output of each layer plus the initial embedding outputs.\n        attentions (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_attentions=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for each layer) of shape\n            :obj:`(batch_size, num_heads, sequence_length, sequence_length)`.\n\n            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention\n            heads.\n\n    Examples::\n\n        from transformers1 import XLMTokenizer, XLMForQuestionAnsweringSimple\n        import torch\n\n        tokenizer = XLMTokenizer.from_pretrained('xlm-mlm-en-2048')\n        model = XLMForQuestionAnsweringSimple.from_pretrained('xlm-mlm-en-2048')\n        input_ids = torch.tensor(tokenizer.encode(\"Hello, my dog is cute\", add_special_tokens=True)).unsqueeze(0)  # Batch size 1\n        start_positions = torch.tensor([1])\n        end_positions = torch.tensor([3])\n        outputs = model(input_ids, start_positions=start_positions, end_positions=end_positions)\n        loss = outputs[0]\n\n        \"\"\"\n        transformer_outputs = self.transformer(\n            input_ids,\n            attention_mask=attention_mask,\n            langs=langs,\n            token_type_ids=token_type_ids,\n            position_ids=position_ids,\n            lengths=lengths,\n            cache=cache,\n            head_mask=head_mask,\n            inputs_embeds=inputs_embeds,\n        )\n\n        sequence_output = transformer_outputs[0]\n\n        logits = self.qa_outputs(sequence_output)\n        start_logits, end_logits = logits.split(1, dim=-1)\n        start_logits = start_logits.squeeze(-1)\n        end_logits = end_logits.squeeze(-1)\n\n        outputs = (\n            start_logits,\n            end_logits,\n        )\n        if start_positions is not None and end_positions is not None:\n            # If we are on multi-GPU, split add a dimension\n            if len(start_positions.size()) > 1:\n                start_positions = start_positions.squeeze(-1)\n            if len(end_positions.size()) > 1:\n                end_positions = end_positions.squeeze(-1)\n            # sometimes the start/end positions are outside our model inputs, we ignore these terms\n            ignored_index = start_logits.size(1)\n            start_positions.clamp_(0, ignored_index)\n            end_positions.clamp_(0, ignored_index)\n\n            loss_fct = CrossEntropyLoss(ignore_index=ignored_index)\n            start_loss = loss_fct(start_logits, start_positions)\n            end_loss = loss_fct(end_logits, end_positions)\n            total_loss = (start_loss + end_loss) / 2\n            outputs = (total_loss,) + outputs\n\n        outputs = outputs + transformer_outputs[1:]  # Keep new_mems and attention/hidden states if they are here\n\n        return outputs\n\n\n@add_start_docstrings(\n    \"\"\"XLM Model with a beam-search span classification head on top for extractive question-answering tasks like SQuAD (a linear layers on top of\n    the hidden-states output to compute `span start logits` and `span end logits`). \"\"\",\n    XLM_START_DOCSTRING,\n)\nclass XLMForQuestionAnswering(XLMPreTrainedModel):\n    def __init__(self, config):\n        super().__init__(config)\n\n        self.transformer = XLMModel(config)\n        self.qa_outputs = SQuADHead(config)\n\n        self.init_weights()\n\n    @add_start_docstrings_to_callable(XLM_INPUTS_DOCSTRING)\n    def forward(\n        self,\n        input_ids=None,\n        attention_mask=None,\n        langs=None,\n        token_type_ids=None,\n        position_ids=None,\n        lengths=None,\n        cache=None,\n        head_mask=None,\n        inputs_embeds=None,\n        start_positions=None,\n        end_positions=None,\n        is_impossible=None,\n        cls_index=None,\n        p_mask=None,\n    ):\n        r\"\"\"\n        start_positions (:obj:`torch.LongTensor` of shape :obj:`(batch_size,)`, `optional`, defaults to :obj:`None`):\n            Labels for position (index) of the start of the labelled span for computing the token classification loss.\n            Positions are clamped to the length of the sequence (`sequence_length`).\n            Position outside of the sequence are not taken into account for computing the loss.\n        end_positions (:obj:`torch.LongTensor` of shape :obj:`(batch_size,)`, `optional`, defaults to :obj:`None`):\n            Labels for position (index) of the end of the labelled span for computing the token classification loss.\n            Positions are clamped to the length of the sequence (`sequence_length`).\n            Position outside of the sequence are not taken into account for computing the loss.\n        is_impossible (``torch.LongTensor`` of shape ``(batch_size,)``, `optional`, defaults to :obj:`None`):\n            Labels whether a question has an answer or no answer (SQuAD 2.0)\n        cls_index (``torch.LongTensor`` of shape ``(batch_size,)``, `optional`, defaults to :obj:`None`):\n            Labels for position (index) of the classification token to use as input for computing plausibility of the answer.\n        p_mask (``torch.FloatTensor`` of shape ``(batch_size, sequence_length)``, `optional`, defaults to :obj:`None`):\n            Optional mask of tokens which can't be in answers (e.g. [CLS], [PAD], ...).\n            1.0 means token should be masked. 0.0 mean token is not masked.\n\n    Returns:\n        :obj:`tuple(torch.FloatTensor)` comprising various elements depending on the configuration (:class:`~transformers1.XLMConfig`) and inputs:\n        loss (:obj:`torch.FloatTensor` of shape :obj:`(1,)`, `optional`, returned if both :obj:`start_positions` and :obj:`end_positions` are provided):\n            Classification loss as the sum of start token, end token (and is_impossible if provided) classification losses.\n        start_top_log_probs (``torch.FloatTensor`` of shape ``(batch_size, config.start_n_top)``, `optional`, returned if ``start_positions`` or ``end_positions`` is not provided):\n            Log probabilities for the top config.start_n_top start token possibilities (beam-search).\n        start_top_index (``torch.LongTensor`` of shape ``(batch_size, config.start_n_top)``, `optional`, returned if ``start_positions`` or ``end_positions`` is not provided):\n            Indices for the top config.start_n_top start token possibilities (beam-search).\n        end_top_log_probs (``torch.FloatTensor`` of shape ``(batch_size, config.start_n_top * config.end_n_top)``, `optional`, returned if ``start_positions`` or ``end_positions`` is not provided):\n            Log probabilities for the top ``config.start_n_top * config.end_n_top`` end token possibilities (beam-search).\n        end_top_index (``torch.LongTensor`` of shape ``(batch_size, config.start_n_top * config.end_n_top)``, `optional`, returned if ``start_positions`` or ``end_positions`` is not provided):\n            Indices for the top ``config.start_n_top * config.end_n_top`` end token possibilities (beam-search).\n        cls_logits (``torch.FloatTensor`` of shape ``(batch_size,)``, `optional`, returned if ``start_positions`` or ``end_positions`` is not provided):\n            Log probabilities for the ``is_impossible`` label of the answers.\n        hidden_states (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_hidden_states=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer)\n            of shape :obj:`(batch_size, sequence_length, hidden_size)`.\n\n            Hidden-states of the model at the output of each layer plus the initial embedding outputs.\n        attentions (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_attentions=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for each layer) of shape\n            :obj:`(batch_size, num_heads, sequence_length, sequence_length)`.\n\n            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention\n            heads.\n\n    Examples::\n\n        from transformers1 import XLMTokenizer, XLMForQuestionAnswering\n        import torch\n\n        tokenizer = XLMTokenizer.from_pretrained('xlm-mlm-en-2048')\n        model = XLMForQuestionAnswering.from_pretrained('xlm-mlm-en-2048')\n        input_ids = torch.tensor(tokenizer.encode(\"Hello, my dog is cute\", add_special_tokens=True)).unsqueeze(0)  # Batch size 1\n        start_positions = torch.tensor([1])\n        end_positions = torch.tensor([3])\n        outputs = model(input_ids, start_positions=start_positions, end_positions=end_positions)\n        loss = outputs[0]\n\n        \"\"\"\n        transformer_outputs = self.transformer(\n            input_ids,\n            attention_mask=attention_mask,\n            langs=langs,\n            token_type_ids=token_type_ids,\n            position_ids=position_ids,\n            lengths=lengths,\n            cache=cache,\n            head_mask=head_mask,\n            inputs_embeds=inputs_embeds,\n        )\n\n        output = transformer_outputs[0]\n\n        outputs = self.qa_outputs(\n            output,\n            start_positions=start_positions,\n            end_positions=end_positions,\n            cls_index=cls_index,\n            is_impossible=is_impossible,\n            p_mask=p_mask,\n        )\n\n        outputs = outputs + transformer_outputs[1:]  # Keep new_mems and attention/hidden states if they are here\n\n        return outputs\n\n\n@add_start_docstrings(\n    \"\"\"XLM Model with a token classification head on top (a linear layer on top of\n    the hidden-states output) e.g. for Named-Entity-Recognition (NER) tasks. \"\"\",\n    XLM_START_DOCSTRING,\n)\nclass XLMForTokenClassification(XLMPreTrainedModel):\n    def __init__(self, config):\n        super().__init__(config)\n        self.num_labels = config.num_labels\n\n        self.transformer = XLMModel(config)\n        self.dropout = nn.Dropout(config.dropout)\n        self.classifier = nn.Linear(config.hidden_size, config.num_labels)\n\n        self.init_weights()\n\n    @add_start_docstrings_to_callable(XLM_INPUTS_DOCSTRING)\n    def forward(\n        self,\n        input_ids=None,\n        attention_mask=None,\n        langs=None,\n        token_type_ids=None,\n        position_ids=None,\n        head_mask=None,\n        labels=None,\n    ):\n        r\"\"\"\n        labels (:obj:`torch.LongTensor` of shape :obj:`(batch_size, sequence_length)`, `optional`, defaults to :obj:`None`):\n            Labels for computing the token classification loss.\n            Indices should be in ``[0, ..., config.num_labels - 1]``.\n\n    Returns:\n        :obj:`tuple(torch.FloatTensor)` comprising various elements depending on the configuration (:class:`~transformers1.XLMConfig`) and inputs:\n        loss (:obj:`torch.FloatTensor` of shape :obj:`(1,)`, `optional`, returned when ``labels`` is provided) :\n            Classification loss.\n        scores (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length, config.num_labels)`)\n            Classification scores (before SoftMax).\n        hidden_states (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_hidden_states=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer)\n            of shape :obj:`(batch_size, sequence_length, hidden_size)`.\n\n            Hidden-states of the model at the output of each layer plus the initial embedding outputs.\n        attentions (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_attentions=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for each layer) of shape\n            :obj:`(batch_size, num_heads, sequence_length, sequence_length)`.\n\n            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention\n            heads.\n\n    Examples::\n\n        from transformers1 import XLMTokenizer, XLMForTokenClassification\n        import torch\n\n        tokenizer = XLMTokenizer.from_pretrained('xlm-mlm-100-1280')\n        model = XLMForTokenClassification.from_pretrained('xlm-mlm-100-1280')\n        input_ids = torch.tensor(tokenizer.encode(\"Hello, my dog is cute\")).unsqueeze(0)  # Batch size 1\n        labels = torch.tensor([1] * input_ids.size(1)).unsqueeze(0)  # Batch size 1\n        outputs = model(input_ids, labels=labels)\n        loss, scores = outputs[:2]\n\n        \"\"\"\n        outputs = self.transformer(\n            input_ids,\n            attention_mask=attention_mask,\n            langs=langs,\n            token_type_ids=token_type_ids,\n            position_ids=position_ids,\n            head_mask=head_mask,\n        )\n\n        sequence_output = outputs[0]\n\n        sequence_output = self.dropout(sequence_output)\n        logits = self.classifier(sequence_output)\n\n        outputs = (logits,) + outputs[2:]  # add hidden states and attention if they are here\n        if labels is not None:\n            loss_fct = CrossEntropyLoss()\n            # Only keep active parts of the loss\n            if attention_mask is not None:\n                active_loss = attention_mask.view(-1) == 1\n                active_logits = logits.view(-1, self.num_labels)\n                active_labels = torch.where(\n                    active_loss, labels.view(-1), torch.tensor(loss_fct.ignore_index).type_as(labels)\n                )\n                loss = loss_fct(active_logits, active_labels)\n            else:\n                loss = loss_fct(logits.view(-1, self.num_labels), labels.view(-1))\n            outputs = (loss,) + outputs\n\n        return outputs  # (loss), scores, (hidden_states), (attentions)\n"
  },
  {
    "path": "code/bert-base-count3/pretrain/transformers1/modeling_xlm_roberta.py",
    "content": "# coding=utf-8\n# Copyright 2019 Facebook AI Research and the HuggingFace Inc. team.\n# Copyright (c) 2018, NVIDIA CORPORATION.  All rights reserved.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n\"\"\"PyTorch XLM-RoBERTa model. \"\"\"\n\n\nimport logging\n\nfrom .configuration_xlm_roberta import XLMRobertaConfig\nfrom .file_utils import add_start_docstrings\nfrom .modeling_roberta import (\n    RobertaForMaskedLM,\n    RobertaForMultipleChoice,\n    RobertaForSequenceClassification,\n    RobertaForTokenClassification,\n    RobertaModel,\n)\n\n\nlogger = logging.getLogger(__name__)\n\nXLM_ROBERTA_PRETRAINED_MODEL_ARCHIVE_LIST = [\n    \"xlm-roberta-base\",\n    \"xlm-roberta-large\",\n    \"xlm-roberta-large-finetuned-conll02-dutch\",\n    \"xlm-roberta-large-finetuned-conll02-spanish\",\n    \"xlm-roberta-large-finetuned-conll03-english\",\n    \"xlm-roberta-large-finetuned-conll03-german\",\n    # See all XLM-RoBERTa models at https://huggingface.co/models?filter=xlm-roberta\n]\n\n\nXLM_ROBERTA_START_DOCSTRING = r\"\"\"\n\n    This model is a PyTorch `torch.nn.Module <https://pytorch.org/docs/stable/nn.html#torch.nn.Module>`_ sub-class.\n    Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general\n    usage and behavior.\n\n    Parameters:\n        config (:class:`~transformers1.XLMRobertaConfig`): Model configuration class with all the parameters of the\n            model. Initializing with a config file does not load the weights associated with the model, only the configuration.\n            Check out the :meth:`~transformers1.PreTrainedModel.from_pretrained` method to load the model weights.\n\"\"\"\n\n\n@add_start_docstrings(\n    \"The bare XLM-RoBERTa Model transformer outputting raw hidden-states without any specific head on top.\",\n    XLM_ROBERTA_START_DOCSTRING,\n)\nclass XLMRobertaModel(RobertaModel):\n    \"\"\"\n    This class overrides :class:`~transformers1.RobertaModel`. Please check the\n    superclass for the appropriate documentation alongside usage examples.\n    \"\"\"\n\n    config_class = XLMRobertaConfig\n\n\n@add_start_docstrings(\n    \"\"\"XLM-RoBERTa Model with a `language modeling` head on top. \"\"\", XLM_ROBERTA_START_DOCSTRING,\n)\nclass XLMRobertaForMaskedLM(RobertaForMaskedLM):\n    \"\"\"\n    This class overrides :class:`~transformers1.RobertaForMaskedLM`. Please check the\n    superclass for the appropriate documentation alongside usage examples.\n    \"\"\"\n\n    config_class = XLMRobertaConfig\n\n\n@add_start_docstrings(\n    \"\"\"XLM-RoBERTa Model transformer with a sequence classification/regression head on top (a linear layer\n    on top of the pooled output) e.g. for GLUE tasks. \"\"\",\n    XLM_ROBERTA_START_DOCSTRING,\n)\nclass XLMRobertaForSequenceClassification(RobertaForSequenceClassification):\n    \"\"\"\n    This class overrides :class:`~transformers1.RobertaForSequenceClassification`. Please check the\n    superclass for the appropriate documentation alongside usage examples.\n    \"\"\"\n\n    config_class = XLMRobertaConfig\n\n\n@add_start_docstrings(\n    \"\"\"XLM-RoBERTa Model with a multiple choice classification head on top (a linear layer on top of\n    the pooled output and a softmax) e.g. for RocStories/SWAG tasks. \"\"\",\n    XLM_ROBERTA_START_DOCSTRING,\n)\nclass XLMRobertaForMultipleChoice(RobertaForMultipleChoice):\n    \"\"\"\n    This class overrides :class:`~transformers1.RobertaForMultipleChoice`. Please check the\n    superclass for the appropriate documentation alongside usage examples.\n    \"\"\"\n\n    config_class = XLMRobertaConfig\n\n\n@add_start_docstrings(\n    \"\"\"XLM-RoBERTa Model with a token classification head on top (a linear layer on top of\n    the hidden-states output) e.g. for Named-Entity-Recognition (NER) tasks. \"\"\",\n    XLM_ROBERTA_START_DOCSTRING,\n)\nclass XLMRobertaForTokenClassification(RobertaForTokenClassification):\n    \"\"\"\n    This class overrides :class:`~transformers1.RobertaForTokenClassification`. Please check the\n    superclass for the appropriate documentation alongside usage examples.\n    \"\"\"\n\n    config_class = XLMRobertaConfig\n"
  },
  {
    "path": "code/bert-base-count3/pretrain/transformers1/modeling_xlnet.py",
    "content": "# coding=utf-8\n# Copyright 2018 Google AI, Google Brain and Carnegie Mellon University Authors and the HuggingFace Inc. team.\n# Copyright (c) 2018, NVIDIA CORPORATION.  All rights reserved.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n\"\"\" PyTorch XLNet model.\n\"\"\"\n\n\nimport logging\n\nimport torch\nfrom torch import nn\nfrom torch.nn import CrossEntropyLoss, MSELoss\nfrom torch.nn import functional as F\n\nfrom .activations import gelu_new, swish\nfrom .configuration_xlnet import XLNetConfig\nfrom .file_utils import add_start_docstrings, add_start_docstrings_to_callable\nfrom .modeling_utils import PoolerAnswerClass, PoolerEndLogits, PoolerStartLogits, PreTrainedModel, SequenceSummary\n\n\nlogger = logging.getLogger(__name__)\n\nXLNET_PRETRAINED_MODEL_ARCHIVE_LIST = [\n    \"xlnet-base-cased\",\n    \"xlnet-large-cased\",\n    # See all XLNet models at https://huggingface.co/models?filter=xlnet\n]\n\n\ndef build_tf_xlnet_to_pytorch_map(model, config, tf_weights=None):\n    \"\"\" A map of modules from TF to PyTorch.\n        I use a map to keep the PyTorch model as\n        identical to the original PyTorch model as possible.\n    \"\"\"\n\n    tf_to_pt_map = {}\n\n    if hasattr(model, \"transformer\"):\n        if hasattr(model, \"lm_loss\"):\n            # We will load also the output bias\n            tf_to_pt_map[\"model/lm_loss/bias\"] = model.lm_loss.bias\n        if hasattr(model, \"sequence_summary\") and \"model/sequnece_summary/summary/kernel\" in tf_weights:\n            # We will load also the sequence summary\n            tf_to_pt_map[\"model/sequnece_summary/summary/kernel\"] = model.sequence_summary.summary.weight\n            tf_to_pt_map[\"model/sequnece_summary/summary/bias\"] = model.sequence_summary.summary.bias\n        if (\n            hasattr(model, \"logits_proj\")\n            and config.finetuning_task is not None\n            and \"model/regression_{}/logit/kernel\".format(config.finetuning_task) in tf_weights\n        ):\n            tf_to_pt_map[\"model/regression_{}/logit/kernel\".format(config.finetuning_task)] = model.logits_proj.weight\n            tf_to_pt_map[\"model/regression_{}/logit/bias\".format(config.finetuning_task)] = model.logits_proj.bias\n\n        # Now load the rest of the transformer\n        model = model.transformer\n\n    # Embeddings and output\n    tf_to_pt_map.update(\n        {\n            \"model/transformer/word_embedding/lookup_table\": model.word_embedding.weight,\n            \"model/transformer/mask_emb/mask_emb\": model.mask_emb,\n        }\n    )\n\n    # Transformer blocks\n    for i, b in enumerate(model.layer):\n        layer_str = \"model/transformer/layer_%d/\" % i\n        tf_to_pt_map.update(\n            {\n                layer_str + \"rel_attn/LayerNorm/gamma\": b.rel_attn.layer_norm.weight,\n                layer_str + \"rel_attn/LayerNorm/beta\": b.rel_attn.layer_norm.bias,\n                layer_str + \"rel_attn/o/kernel\": b.rel_attn.o,\n                layer_str + \"rel_attn/q/kernel\": b.rel_attn.q,\n                layer_str + \"rel_attn/k/kernel\": b.rel_attn.k,\n                layer_str + \"rel_attn/r/kernel\": b.rel_attn.r,\n                layer_str + \"rel_attn/v/kernel\": b.rel_attn.v,\n                layer_str + \"ff/LayerNorm/gamma\": b.ff.layer_norm.weight,\n                layer_str + \"ff/LayerNorm/beta\": b.ff.layer_norm.bias,\n                layer_str + \"ff/layer_1/kernel\": b.ff.layer_1.weight,\n                layer_str + \"ff/layer_1/bias\": b.ff.layer_1.bias,\n                layer_str + \"ff/layer_2/kernel\": b.ff.layer_2.weight,\n                layer_str + \"ff/layer_2/bias\": b.ff.layer_2.bias,\n            }\n        )\n\n    # Relative positioning biases\n    if config.untie_r:\n        r_r_list = []\n        r_w_list = []\n        r_s_list = []\n        seg_embed_list = []\n        for b in model.layer:\n            r_r_list.append(b.rel_attn.r_r_bias)\n            r_w_list.append(b.rel_attn.r_w_bias)\n            r_s_list.append(b.rel_attn.r_s_bias)\n            seg_embed_list.append(b.rel_attn.seg_embed)\n    else:\n        r_r_list = [model.r_r_bias]\n        r_w_list = [model.r_w_bias]\n        r_s_list = [model.r_s_bias]\n        seg_embed_list = [model.seg_embed]\n    tf_to_pt_map.update(\n        {\n            \"model/transformer/r_r_bias\": r_r_list,\n            \"model/transformer/r_w_bias\": r_w_list,\n            \"model/transformer/r_s_bias\": r_s_list,\n            \"model/transformer/seg_embed\": seg_embed_list,\n        }\n    )\n    return tf_to_pt_map\n\n\ndef load_tf_weights_in_xlnet(model, config, tf_path):\n    \"\"\" Load tf checkpoints in a pytorch model\n    \"\"\"\n    try:\n        import numpy as np\n        import tensorflow as tf\n    except ImportError:\n        logger.error(\n            \"Loading a TensorFlow models in PyTorch, requires TensorFlow to be installed. Please see \"\n            \"https://www.tensorflow.org/install/ for installation instructions.\"\n        )\n        raise\n    # Load weights from TF model\n    init_vars = tf.train.list_variables(tf_path)\n    tf_weights = {}\n    for name, shape in init_vars:\n        logger.info(\"Loading TF weight {} with shape {}\".format(name, shape))\n        array = tf.train.load_variable(tf_path, name)\n        tf_weights[name] = array\n\n    # Build TF to PyTorch weights loading map\n    tf_to_pt_map = build_tf_xlnet_to_pytorch_map(model, config, tf_weights)\n\n    for name, pointer in tf_to_pt_map.items():\n        logger.info(\"Importing {}\".format(name))\n        if name not in tf_weights:\n            logger.info(\"{} not in tf pre-trained weights, skipping\".format(name))\n            continue\n        array = tf_weights[name]\n        # adam_v and adam_m are variables used in AdamWeightDecayOptimizer to calculated m and v\n        # which are not required for using pretrained model\n        if \"kernel\" in name and (\"ff\" in name or \"summary\" in name or \"logit\" in name):\n            logger.info(\"Transposing\")\n            array = np.transpose(array)\n        if isinstance(pointer, list):\n            # Here we will split the TF weights\n            assert len(pointer) == array.shape[0]\n            for i, p_i in enumerate(pointer):\n                arr_i = array[i, ...]\n                try:\n                    assert p_i.shape == arr_i.shape\n                except AssertionError as e:\n                    e.args += (p_i.shape, arr_i.shape)\n                    raise\n                logger.info(\"Initialize PyTorch weight {} for layer {}\".format(name, i))\n                p_i.data = torch.from_numpy(arr_i)\n        else:\n            try:\n                assert pointer.shape == array.shape\n            except AssertionError as e:\n                e.args += (pointer.shape, array.shape)\n                raise\n            logger.info(\"Initialize PyTorch weight {}\".format(name))\n            pointer.data = torch.from_numpy(array)\n        tf_weights.pop(name, None)\n        tf_weights.pop(name + \"/Adam\", None)\n        tf_weights.pop(name + \"/Adam_1\", None)\n\n    logger.info(\"Weights not copied to PyTorch model: {}\".format(\", \".join(tf_weights.keys())))\n    return model\n\n\nACT2FN = {\"gelu\": gelu_new, \"relu\": torch.nn.functional.relu, \"swish\": swish}\n\n\nXLNetLayerNorm = nn.LayerNorm\n\n\nclass XLNetRelativeAttention(nn.Module):\n    def __init__(self, config):\n        super().__init__()\n        self.output_attentions = config.output_attentions\n\n        if config.d_model % config.n_head != 0:\n            raise ValueError(\n                \"The hidden size (%d) is not a multiple of the number of attention \"\n                \"heads (%d)\" % (config.d_model, config.n_head)\n            )\n\n        self.n_head = config.n_head\n        self.d_head = config.d_head\n        self.d_model = config.d_model\n        self.scale = 1 / (config.d_head ** 0.5)\n\n        self.q = nn.Parameter(torch.FloatTensor(config.d_model, self.n_head, self.d_head))\n        self.k = nn.Parameter(torch.FloatTensor(config.d_model, self.n_head, self.d_head))\n        self.v = nn.Parameter(torch.FloatTensor(config.d_model, self.n_head, self.d_head))\n        self.o = nn.Parameter(torch.FloatTensor(config.d_model, self.n_head, self.d_head))\n        self.r = nn.Parameter(torch.FloatTensor(config.d_model, self.n_head, self.d_head))\n\n        self.r_r_bias = nn.Parameter(torch.FloatTensor(self.n_head, self.d_head))\n        self.r_s_bias = nn.Parameter(torch.FloatTensor(self.n_head, self.d_head))\n        self.r_w_bias = nn.Parameter(torch.FloatTensor(self.n_head, self.d_head))\n        self.seg_embed = nn.Parameter(torch.FloatTensor(2, self.n_head, self.d_head))\n\n        self.layer_norm = XLNetLayerNorm(config.d_model, eps=config.layer_norm_eps)\n        self.dropout = nn.Dropout(config.dropout)\n\n    def prune_heads(self, heads):\n        raise NotImplementedError\n\n    @staticmethod\n    def rel_shift(x, klen=-1):\n        \"\"\"perform relative shift to form the relative attention score.\"\"\"\n        x_size = x.shape\n\n        x = x.reshape(x_size[1], x_size[0], x_size[2], x_size[3])\n        x = x[1:, ...]\n        x = x.reshape(x_size[0], x_size[1] - 1, x_size[2], x_size[3])\n        # x = x[:, 0:klen, :, :]\n        x = torch.index_select(x, 1, torch.arange(klen, device=x.device, dtype=torch.long))\n\n        return x\n\n    @staticmethod\n    def rel_shift_bnij(x, klen=-1):\n        x_size = x.shape\n\n        x = x.reshape(x_size[0], x_size[1], x_size[3], x_size[2])\n        x = x[:, :, 1:, :]\n        x = x.reshape(x_size[0], x_size[1], x_size[2], x_size[3] - 1)\n        # Note: the tensor-slice form was faster in my testing than torch.index_select\n        #       However, tracing doesn't like the nature of the slice, and if klen changes\n        #       during the run then it'll fail, whereas index_select will be fine.\n        x = torch.index_select(x, 3, torch.arange(klen, device=x.device, dtype=torch.long))\n        # x = x[:, :, :, :klen]\n\n        return x\n\n    def rel_attn_core(self, q_head, k_head_h, v_head_h, k_head_r, seg_mat=None, attn_mask=None, head_mask=None):\n        \"\"\"Core relative positional attention operations.\"\"\"\n\n        # content based attention score\n        ac = torch.einsum(\"ibnd,jbnd->bnij\", q_head + self.r_w_bias, k_head_h)\n\n        # position based attention score\n        bd = torch.einsum(\"ibnd,jbnd->bnij\", q_head + self.r_r_bias, k_head_r)\n        bd = self.rel_shift_bnij(bd, klen=ac.shape[3])\n\n        # segment based attention score\n        if seg_mat is None:\n            ef = 0\n        else:\n            ef = torch.einsum(\"ibnd,snd->ibns\", q_head + self.r_s_bias, self.seg_embed)\n            ef = torch.einsum(\"ijbs,ibns->bnij\", seg_mat, ef)\n\n        # merge attention scores and perform masking\n        attn_score = (ac + bd + ef) * self.scale\n        if attn_mask is not None:\n            # attn_score = attn_score * (1 - attn_mask) - 1e30 * attn_mask\n            if attn_mask.dtype == torch.float16:\n                attn_score = attn_score - 65500 * torch.einsum(\"ijbn->bnij\", attn_mask)\n            else:\n                attn_score = attn_score - 1e30 * torch.einsum(\"ijbn->bnij\", attn_mask)\n\n        # attention probability\n        attn_prob = F.softmax(attn_score, dim=3)\n        attn_prob = self.dropout(attn_prob)\n\n        # Mask heads if we want to\n        if head_mask is not None:\n            attn_prob = attn_prob * torch.einsum(\"ijbn->bnij\", head_mask)\n\n        # attention output\n        attn_vec = torch.einsum(\"bnij,jbnd->ibnd\", attn_prob, v_head_h)\n\n        if self.output_attentions:\n            return attn_vec, torch.einsum(\"bnij->ijbn\", attn_prob)\n\n        return attn_vec\n\n    def post_attention(self, h, attn_vec, residual=True):\n        \"\"\"Post-attention processing.\"\"\"\n        # post-attention projection (back to `d_model`)\n        attn_out = torch.einsum(\"ibnd,hnd->ibh\", attn_vec, self.o)\n\n        attn_out = self.dropout(attn_out)\n        if residual:\n            attn_out = attn_out + h\n        output = self.layer_norm(attn_out)\n\n        return output\n\n    def forward(self, h, g, attn_mask_h, attn_mask_g, r, seg_mat, mems=None, target_mapping=None, head_mask=None):\n        if g is not None:\n            # Two-stream attention with relative positional encoding.\n            # content based attention score\n            if mems is not None and mems.dim() > 1:\n                cat = torch.cat([mems, h], dim=0)\n            else:\n                cat = h\n\n            # content-based key head\n            k_head_h = torch.einsum(\"ibh,hnd->ibnd\", cat, self.k)\n\n            # content-based value head\n            v_head_h = torch.einsum(\"ibh,hnd->ibnd\", cat, self.v)\n\n            # position-based key head\n            k_head_r = torch.einsum(\"ibh,hnd->ibnd\", r, self.r)\n\n            # h-stream\n            # content-stream query head\n            q_head_h = torch.einsum(\"ibh,hnd->ibnd\", h, self.q)\n\n            # core attention ops\n            attn_vec_h = self.rel_attn_core(\n                q_head_h, k_head_h, v_head_h, k_head_r, seg_mat=seg_mat, attn_mask=attn_mask_h, head_mask=head_mask\n            )\n\n            if self.output_attentions:\n                attn_vec_h, attn_prob_h = attn_vec_h\n\n            # post processing\n            output_h = self.post_attention(h, attn_vec_h)\n\n            # g-stream\n            # query-stream query head\n            q_head_g = torch.einsum(\"ibh,hnd->ibnd\", g, self.q)\n\n            # core attention ops\n            if target_mapping is not None:\n                q_head_g = torch.einsum(\"mbnd,mlb->lbnd\", q_head_g, target_mapping)\n                attn_vec_g = self.rel_attn_core(\n                    q_head_g, k_head_h, v_head_h, k_head_r, seg_mat=seg_mat, attn_mask=attn_mask_g, head_mask=head_mask\n                )\n\n                if self.output_attentions:\n                    attn_vec_g, attn_prob_g = attn_vec_g\n\n                attn_vec_g = torch.einsum(\"lbnd,mlb->mbnd\", attn_vec_g, target_mapping)\n            else:\n                attn_vec_g = self.rel_attn_core(\n                    q_head_g, k_head_h, v_head_h, k_head_r, seg_mat=seg_mat, attn_mask=attn_mask_g, head_mask=head_mask\n                )\n\n                if self.output_attentions:\n                    attn_vec_g, attn_prob_g = attn_vec_g\n\n            # post processing\n            output_g = self.post_attention(g, attn_vec_g)\n\n            if self.output_attentions:\n                attn_prob = attn_prob_h, attn_prob_g\n\n        else:\n            # Multi-head attention with relative positional encoding\n            if mems is not None and mems.dim() > 1:\n                cat = torch.cat([mems, h], dim=0)\n            else:\n                cat = h\n\n            # content heads\n            q_head_h = torch.einsum(\"ibh,hnd->ibnd\", h, self.q)\n            k_head_h = torch.einsum(\"ibh,hnd->ibnd\", cat, self.k)\n            v_head_h = torch.einsum(\"ibh,hnd->ibnd\", cat, self.v)\n\n            # positional heads\n            k_head_r = torch.einsum(\"ibh,hnd->ibnd\", r, self.r)\n\n            # core attention ops\n            attn_vec = self.rel_attn_core(\n                q_head_h, k_head_h, v_head_h, k_head_r, seg_mat=seg_mat, attn_mask=attn_mask_h, head_mask=head_mask\n            )\n\n            if self.output_attentions:\n                attn_vec, attn_prob = attn_vec\n\n            # post processing\n            output_h = self.post_attention(h, attn_vec)\n            output_g = None\n\n        outputs = (output_h, output_g)\n        if self.output_attentions:\n            outputs = outputs + (attn_prob,)\n        return outputs\n\n\nclass XLNetFeedForward(nn.Module):\n    def __init__(self, config):\n        super().__init__()\n        self.layer_norm = XLNetLayerNorm(config.d_model, eps=config.layer_norm_eps)\n        self.layer_1 = nn.Linear(config.d_model, config.d_inner)\n        self.layer_2 = nn.Linear(config.d_inner, config.d_model)\n        self.dropout = nn.Dropout(config.dropout)\n        if isinstance(config.ff_activation, str):\n            self.activation_function = ACT2FN[config.ff_activation]\n        else:\n            self.activation_function = config.ff_activation\n\n    def forward(self, inp):\n        output = inp\n        output = self.layer_1(output)\n        output = self.activation_function(output)\n        output = self.dropout(output)\n        output = self.layer_2(output)\n        output = self.dropout(output)\n        output = self.layer_norm(output + inp)\n        return output\n\n\nclass XLNetLayer(nn.Module):\n    def __init__(self, config):\n        super().__init__()\n        self.rel_attn = XLNetRelativeAttention(config)\n        self.ff = XLNetFeedForward(config)\n        self.dropout = nn.Dropout(config.dropout)\n\n    def forward(\n        self, output_h, output_g, attn_mask_h, attn_mask_g, r, seg_mat, mems=None, target_mapping=None, head_mask=None\n    ):\n        outputs = self.rel_attn(\n            output_h,\n            output_g,\n            attn_mask_h,\n            attn_mask_g,\n            r,\n            seg_mat,\n            mems=mems,\n            target_mapping=target_mapping,\n            head_mask=head_mask,\n        )\n        output_h, output_g = outputs[:2]\n\n        if output_g is not None:\n            output_g = self.ff(output_g)\n        output_h = self.ff(output_h)\n\n        outputs = (output_h, output_g) + outputs[2:]  # Add again attentions if there are there\n        return outputs\n\n\nclass XLNetPreTrainedModel(PreTrainedModel):\n    \"\"\" An abstract class to handle weights initialization and\n        a simple interface for downloading and loading pretrained models.\n    \"\"\"\n\n    config_class = XLNetConfig\n    load_tf_weights = load_tf_weights_in_xlnet\n    base_model_prefix = \"transformer\"\n\n    def _init_weights(self, module):\n        \"\"\" Initialize the weights.\n        \"\"\"\n        if isinstance(module, (nn.Linear, nn.Embedding)):\n            # Slightly different from the TF version which uses truncated_normal for initialization\n            # cf https://github.com/pytorch/pytorch/pull/5617\n            module.weight.data.normal_(mean=0.0, std=self.config.initializer_range)\n            if isinstance(module, nn.Linear) and module.bias is not None:\n                module.bias.data.zero_()\n        elif isinstance(module, XLNetLayerNorm):\n            module.bias.data.zero_()\n            module.weight.data.fill_(1.0)\n        elif isinstance(module, XLNetRelativeAttention):\n            for param in [\n                module.q,\n                module.k,\n                module.v,\n                module.o,\n                module.r,\n                module.r_r_bias,\n                module.r_s_bias,\n                module.r_w_bias,\n                module.seg_embed,\n            ]:\n                param.data.normal_(mean=0.0, std=self.config.initializer_range)\n        elif isinstance(module, XLNetModel):\n            module.mask_emb.data.normal_(mean=0.0, std=self.config.initializer_range)\n\n\nXLNET_START_DOCSTRING = r\"\"\"\n\n    This model is a PyTorch `torch.nn.Module <https://pytorch.org/docs/stable/nn.html#torch.nn.Module>`_ sub-class.\n    Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general\n    usage and behavior.\n\n    Parameters:\n        config (:class:`~transformers1.XLNetConfig`): Model configuration class with all the parameters of the model.\n            Initializing with a config file does not load the weights associated with the model, only the configuration.\n            Check out the :meth:`~transformers1.PreTrainedModel.from_pretrained` method to load the model weights.\n\"\"\"\n\nXLNET_INPUTS_DOCSTRING = r\"\"\"\n    Args:\n        input_ids (:obj:`torch.LongTensor` of shape :obj:`{0}`):\n            Indices of input sequence tokens in the vocabulary.\n\n            Indices can be obtained using :class:`transformers1.BertTokenizer`.\n            See :func:`transformers1.PreTrainedTokenizer.encode` and\n            :func:`transformers1.PreTrainedTokenizer.encode_plus` for details.\n\n            `What are input IDs? <../glossary.html#input-ids>`__\n        attention_mask (:obj:`torch.FloatTensor` of shape :obj:`{0}`, `optional`, defaults to :obj:`None`):\n            Mask to avoid performing attention on padding token indices.\n            Mask values selected in ``[0, 1]``:\n            ``1`` for tokens that are NOT MASKED, ``0`` for MASKED tokens.\n\n            `What are attention masks? <../glossary.html#attention-mask>`__\n        mems (:obj:`List[torch.FloatTensor]` of length :obj:`config.n_layers`):\n            Contains pre-computed hidden-states (key and values in the attention blocks) as computed by the model\n            (see `mems` output below). Can be used to speed up sequential decoding. The token ids which have their mems\n            given to this model should not be passed as input ids as they have already been computed.\n            `use_cache` has to be set to `True` to make use of `mems`.\n        perm_mask (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length, sequence_length)`, `optional`, defaults to :obj:`None`):\n            Mask to indicate the attention pattern for each input token with values selected in ``[0, 1]``:\n            If ``perm_mask[k, i, j] = 0``, i attend to j in batch k;\n            if ``perm_mask[k, i, j] = 1``, i does not attend to j in batch k.\n            If None, each token attends to all the others (full bidirectional attention).\n            Only used during pretraining (to define factorization order) or for sequential decoding (generation).\n        target_mapping (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, num_predict, sequence_length)`, `optional`, defaults to :obj:`None`):\n            Mask to indicate the output tokens to use.\n            If ``target_mapping[k, i, j] = 1``, the i-th predict in batch k is on the j-th token.\n            Only used during pretraining for partial prediction or for sequential decoding (generation).\n        token_type_ids (:obj:`torch.LongTensor` of shape :obj:`{0}`, `optional`, defaults to :obj:`None`):\n            Segment token indices to indicate first and second portions of the inputs.\n            Indices are selected in ``[0, 1]``: ``0`` corresponds to a `sentence A` token, ``1``\n            corresponds to a `sentence B` token. The classifier token should be represented by a ``2``.\n\n            `What are token type IDs? <../glossary.html#token-type-ids>`_\n        input_mask (:obj:`torch.FloatTensor` of shape :obj:`{0}`, `optional`, defaults to :obj:`None`):\n            Mask to avoid performing attention on padding token indices.\n            Negative of `attention_mask`, i.e. with 0 for real tokens and 1 for padding.\n            Kept for compatibility with the original code base.\n            You can only uses one of `input_mask` and `attention_mask`\n            Mask values selected in ``[0, 1]``:\n            ``1`` for tokens that are MASKED, ``0`` for tokens that are NOT MASKED.\n        head_mask (:obj:`torch.FloatTensor` of shape :obj:`(num_heads,)` or :obj:`(num_layers, num_heads)`, `optional`, defaults to :obj:`None`):\n            Mask to nullify selected heads of the self-attention modules.\n            Mask values selected in ``[0, 1]``:\n            :obj:`1` indicates the head is **not masked**, :obj:`0` indicates the head is **masked**.\n        inputs_embeds (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length, hidden_size)`, `optional`, defaults to :obj:`None`):\n            Optionally, instead of passing :obj:`input_ids` you can choose to directly pass an embedded representation.\n            This is useful if you want more control over how to convert `input_ids` indices into associated vectors\n            than the model's internal embedding lookup matrix.\n        use_cache (:obj:`bool`):\n            If `use_cache` is True, `mems` are returned and can be used to speed up decoding (see `mems`). Defaults to `True`.\n\"\"\"\n\n\n@add_start_docstrings(\n    \"The bare XLNet Model transformer outputting raw hidden-states without any specific head on top.\",\n    XLNET_START_DOCSTRING,\n)\nclass XLNetModel(XLNetPreTrainedModel):\n    def __init__(self, config):\n        super().__init__(config)\n        self.output_attentions = config.output_attentions\n        self.output_hidden_states = config.output_hidden_states\n\n        self.mem_len = config.mem_len\n        self.reuse_len = config.reuse_len\n        self.d_model = config.d_model\n        self.same_length = config.same_length\n        self.attn_type = config.attn_type\n        self.bi_data = config.bi_data\n        self.clamp_len = config.clamp_len\n        self.n_layer = config.n_layer\n\n        self.word_embedding = nn.Embedding(config.vocab_size, config.d_model)\n        self.mask_emb = nn.Parameter(torch.FloatTensor(1, 1, config.d_model))\n        self.layer = nn.ModuleList([XLNetLayer(config) for _ in range(config.n_layer)])\n        self.dropout = nn.Dropout(config.dropout)\n\n        self.init_weights()\n\n    def get_input_embeddings(self):\n        return self.word_embedding\n\n    def set_input_embeddings(self, new_embeddings):\n        self.word_embedding = new_embeddings\n\n    def _prune_heads(self, heads_to_prune):\n        raise NotImplementedError\n\n    def create_mask(self, qlen, mlen):\n        \"\"\"\n        Creates causal attention mask. Float mask where 1.0 indicates masked, 0.0 indicates not-masked.\n\n        Args:\n            qlen: Sequence length\n            mlen: Mask length\n\n        ::\n\n                  same_length=False:      same_length=True:\n                  <mlen > <  qlen >       <mlen > <  qlen >\n               ^ [0 0 0 0 0 1 1 1 1]     [0 0 0 0 0 1 1 1 1]\n                 [0 0 0 0 0 0 1 1 1]     [1 0 0 0 0 0 1 1 1]\n            qlen [0 0 0 0 0 0 0 1 1]     [1 1 0 0 0 0 0 1 1]\n                 [0 0 0 0 0 0 0 0 1]     [1 1 1 0 0 0 0 0 1]\n               v [0 0 0 0 0 0 0 0 0]     [1 1 1 1 0 0 0 0 0]\n\n        \"\"\"\n        attn_mask = torch.ones([qlen, qlen])\n        mask_up = torch.triu(attn_mask, diagonal=1)\n        attn_mask_pad = torch.zeros([qlen, mlen])\n        ret = torch.cat([attn_mask_pad, mask_up], dim=1)\n        if self.same_length:\n            mask_lo = torch.tril(attn_mask, diagonal=-1)\n            ret = torch.cat([ret[:, :qlen] + mask_lo, ret[:, qlen:]], dim=1)\n\n        ret = ret.to(self.device)\n        return ret\n\n    def cache_mem(self, curr_out, prev_mem):\n        # cache hidden states into memory.\n        if self.reuse_len is not None and self.reuse_len > 0:\n            curr_out = curr_out[: self.reuse_len]\n\n        if prev_mem is None:\n            new_mem = curr_out[-self.mem_len :]\n        else:\n            new_mem = torch.cat([prev_mem, curr_out], dim=0)[-self.mem_len :]\n\n        return new_mem.detach()\n\n    @staticmethod\n    def positional_embedding(pos_seq, inv_freq, bsz=None):\n        sinusoid_inp = torch.einsum(\"i,d->id\", pos_seq, inv_freq)\n        pos_emb = torch.cat([torch.sin(sinusoid_inp), torch.cos(sinusoid_inp)], dim=-1)\n        pos_emb = pos_emb[:, None, :]\n\n        if bsz is not None:\n            pos_emb = pos_emb.expand(-1, bsz, -1)\n\n        return pos_emb\n\n    def relative_positional_encoding(self, qlen, klen, bsz=None):\n        # create relative positional encoding.\n        freq_seq = torch.arange(0, self.d_model, 2.0, dtype=torch.float)\n        inv_freq = 1 / torch.pow(10000, (freq_seq / self.d_model))\n\n        if self.attn_type == \"bi\":\n            # beg, end = klen - 1, -qlen\n            beg, end = klen, -qlen\n        elif self.attn_type == \"uni\":\n            # beg, end = klen - 1, -1\n            beg, end = klen, -1\n        else:\n            raise ValueError(\"Unknown `attn_type` {}.\".format(self.attn_type))\n\n        if self.bi_data:\n            fwd_pos_seq = torch.arange(beg, end, -1.0, dtype=torch.float)\n            bwd_pos_seq = torch.arange(-beg, -end, 1.0, dtype=torch.float)\n\n            if self.clamp_len > 0:\n                fwd_pos_seq = fwd_pos_seq.clamp(-self.clamp_len, self.clamp_len)\n                bwd_pos_seq = bwd_pos_seq.clamp(-self.clamp_len, self.clamp_len)\n\n            if bsz is not None:\n                fwd_pos_emb = self.positional_embedding(fwd_pos_seq, inv_freq, bsz // 2)\n                bwd_pos_emb = self.positional_embedding(bwd_pos_seq, inv_freq, bsz // 2)\n            else:\n                fwd_pos_emb = self.positional_embedding(fwd_pos_seq, inv_freq)\n                bwd_pos_emb = self.positional_embedding(bwd_pos_seq, inv_freq)\n\n            pos_emb = torch.cat([fwd_pos_emb, bwd_pos_emb], dim=1)\n        else:\n            fwd_pos_seq = torch.arange(beg, end, -1.0)\n            if self.clamp_len > 0:\n                fwd_pos_seq = fwd_pos_seq.clamp(-self.clamp_len, self.clamp_len)\n            pos_emb = self.positional_embedding(fwd_pos_seq, inv_freq, bsz)\n\n        pos_emb = pos_emb.to(self.device)\n        return pos_emb\n\n    @add_start_docstrings_to_callable(XLNET_INPUTS_DOCSTRING.format(\"(batch_size, sequence_length)\"))\n    def forward(\n        self,\n        input_ids=None,\n        attention_mask=None,\n        mems=None,\n        perm_mask=None,\n        target_mapping=None,\n        token_type_ids=None,\n        input_mask=None,\n        head_mask=None,\n        inputs_embeds=None,\n        use_cache=True,\n    ):\n        r\"\"\"\n    Return:\n        :obj:`tuple(torch.FloatTensor)` comprising various elements depending on the configuration (:class:`~transformers1.XLNetConfig`) and inputs:\n        last_hidden_state (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, num_predict, hidden_size)`):\n            Sequence of hidden-states at the last layer of the model.\n            `num_predict` corresponds to `target_mapping.shape[1]`. If `target_mapping` is `None`, then `num_predict` corresponds to `sequence_length`.\n        mems (:obj:`List[torch.FloatTensor]` of length :obj:`config.n_layers`):\n            Contains pre-computed hidden-states (key and values in the attention blocks).\n            Can be used (see `mems` input) to speed up sequential decoding. The token ids which have their past given to this model\n            should not be passed as input ids as they have already been computed.\n        hidden_states (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_hidden_states=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer)\n            of shape :obj:`(batch_size, sequence_length, hidden_size)`.\n\n            Hidden-states of the model at the output of each layer plus the initial embedding outputs.\n        attentions (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_attentions=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for each layer) of shape\n            :obj:`(batch_size, num_heads, sequence_length, sequence_length)`.\n\n            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention\n            heads.\n\n    Examples::\n\n        from transformers1 import XLNetTokenizer, XLNetModel\n        import torch\n\n        tokenizer = XLNetTokenizer.from_pretrained('xlnet-large-cased')\n        model = XLNetModel.from_pretrained('xlnet-large-cased')\n\n        input_ids = torch.tensor(tokenizer.encode(\"Hello, my dog is cute\", add_special_tokens=False)).unsqueeze(0)  # Batch size 1\n\n        outputs = model(input_ids)\n        last_hidden_states = outputs[0]  # The last hidden-state is the first element of the output tuple\n\n        \"\"\"\n        # the original code for XLNet uses shapes [len, bsz] with the batch dimension at the end\n        # but we want a unified interface in the library with the batch size on the first dimension\n        # so we move here the first dimension (batch) to the end\n        if input_ids is not None and inputs_embeds is not None:\n            raise ValueError(\"You cannot specify both input_ids and inputs_embeds at the same time\")\n        elif input_ids is not None:\n            input_ids = input_ids.transpose(0, 1).contiguous()\n            qlen, bsz = input_ids.shape[0], input_ids.shape[1]\n        elif inputs_embeds is not None:\n            inputs_embeds = inputs_embeds.transpose(0, 1).contiguous()\n            qlen, bsz = inputs_embeds.shape[0], inputs_embeds.shape[1]\n        else:\n            raise ValueError(\"You have to specify either input_ids or inputs_embeds\")\n\n        token_type_ids = token_type_ids.transpose(0, 1).contiguous() if token_type_ids is not None else None\n        input_mask = input_mask.transpose(0, 1).contiguous() if input_mask is not None else None\n        attention_mask = attention_mask.transpose(0, 1).contiguous() if attention_mask is not None else None\n        perm_mask = perm_mask.permute(1, 2, 0).contiguous() if perm_mask is not None else None\n        target_mapping = target_mapping.permute(1, 2, 0).contiguous() if target_mapping is not None else None\n\n        mlen = mems[0].shape[0] if mems is not None and mems[0] is not None else 0\n        klen = mlen + qlen\n\n        dtype_float = self.dtype\n        device = self.device\n\n        # Attention mask\n        # causal attention mask\n        if self.attn_type == \"uni\":\n            attn_mask = self.create_mask(qlen, mlen)\n            attn_mask = attn_mask[:, :, None, None]\n        elif self.attn_type == \"bi\":\n            attn_mask = None\n        else:\n            raise ValueError(\"Unsupported attention type: {}\".format(self.attn_type))\n\n        # data mask: input mask & perm mask\n        assert input_mask is None or attention_mask is None, \"You can only use one of input_mask (uses 1 for padding) \"\n        \"or attention_mask (uses 0 for padding, added for compatbility with BERT). Please choose one.\"\n        if input_mask is None and attention_mask is not None:\n            input_mask = 1.0 - attention_mask\n        if input_mask is not None and perm_mask is not None:\n            data_mask = input_mask[None] + perm_mask\n        elif input_mask is not None and perm_mask is None:\n            data_mask = input_mask[None]\n        elif input_mask is None and perm_mask is not None:\n            data_mask = perm_mask\n        else:\n            data_mask = None\n\n        if data_mask is not None:\n            # all mems can be attended to\n            if mlen > 0:\n                mems_mask = torch.zeros([data_mask.shape[0], mlen, bsz]).to(data_mask)\n                data_mask = torch.cat([mems_mask, data_mask], dim=1)\n            if attn_mask is None:\n                attn_mask = data_mask[:, :, :, None]\n            else:\n                attn_mask += data_mask[:, :, :, None]\n\n        if attn_mask is not None:\n            attn_mask = (attn_mask > 0).to(dtype_float)\n\n        if attn_mask is not None:\n            non_tgt_mask = -torch.eye(qlen).to(attn_mask)\n            if mlen > 0:\n                non_tgt_mask = torch.cat([torch.zeros([qlen, mlen]).to(attn_mask), non_tgt_mask], dim=-1)\n            non_tgt_mask = ((attn_mask + non_tgt_mask[:, :, None, None]) > 0).to(attn_mask)\n        else:\n            non_tgt_mask = None\n\n        # Word embeddings and prepare h & g hidden states\n        if inputs_embeds is not None:\n            word_emb_k = inputs_embeds\n        else:\n            word_emb_k = self.word_embedding(input_ids)\n        output_h = self.dropout(word_emb_k)\n        if target_mapping is not None:\n            word_emb_q = self.mask_emb.expand(target_mapping.shape[0], bsz, -1)\n            # else:  # We removed the inp_q input which was same as target mapping\n            #     inp_q_ext = inp_q[:, :, None]\n            #     word_emb_q = inp_q_ext * self.mask_emb + (1 - inp_q_ext) * word_emb_k\n            output_g = self.dropout(word_emb_q)\n        else:\n            output_g = None\n\n        # Segment embedding\n        if token_type_ids is not None:\n            # Convert `token_type_ids` to one-hot `seg_mat`\n            if mlen > 0:\n                mem_pad = torch.zeros([mlen, bsz], dtype=torch.long, device=device)\n                cat_ids = torch.cat([mem_pad, token_type_ids], dim=0)\n            else:\n                cat_ids = token_type_ids\n\n            # `1` indicates not in the same segment [qlen x klen x bsz]\n            seg_mat = (token_type_ids[:, None] != cat_ids[None, :]).long()\n            seg_mat = F.one_hot(seg_mat, num_classes=2).to(dtype_float)\n        else:\n            seg_mat = None\n\n        # Positional encoding\n        pos_emb = self.relative_positional_encoding(qlen, klen, bsz=bsz)\n        pos_emb = self.dropout(pos_emb)\n\n        # Prepare head mask if needed\n        # 1.0 in head_mask indicate we keep the head\n        # attention_probs has shape bsz x n_heads x N x N\n        # input head_mask has shape [num_heads] or [num_hidden_layers x num_heads] (a head_mask for each layer)\n        # and head_mask is converted to shape [num_hidden_layers x qlen x klen x bsz x n_head]\n        if head_mask is not None:\n            if head_mask.dim() == 1:\n                head_mask = head_mask.unsqueeze(0).unsqueeze(0).unsqueeze(0).unsqueeze(0)\n                head_mask = head_mask.expand(self.n_layer, -1, -1, -1, -1)\n            elif head_mask.dim() == 2:\n                head_mask = head_mask.unsqueeze(1).unsqueeze(1).unsqueeze(1)\n            head_mask = head_mask.to(\n                dtype=next(self.parameters()).dtype\n            )  # switch to fload if need + fp16 compatibility\n        else:\n            head_mask = [None] * self.n_layer\n\n        new_mems = ()\n        if mems is None:\n            mems = [None] * len(self.layer)\n\n        attentions = []\n        hidden_states = []\n        for i, layer_module in enumerate(self.layer):\n            if self.mem_len is not None and self.mem_len > 0 and use_cache is True:\n                # cache new mems\n                new_mems = new_mems + (self.cache_mem(output_h, mems[i]),)\n            if self.output_hidden_states:\n                hidden_states.append((output_h, output_g) if output_g is not None else output_h)\n\n            outputs = layer_module(\n                output_h,\n                output_g,\n                attn_mask_h=non_tgt_mask,\n                attn_mask_g=attn_mask,\n                r=pos_emb,\n                seg_mat=seg_mat,\n                mems=mems[i],\n                target_mapping=target_mapping,\n                head_mask=head_mask[i],\n            )\n            output_h, output_g = outputs[:2]\n            if self.output_attentions:\n                attentions.append(outputs[2])\n\n        # Add last hidden state\n        if self.output_hidden_states:\n            hidden_states.append((output_h, output_g) if output_g is not None else output_h)\n\n        output = self.dropout(output_g if output_g is not None else output_h)\n\n        # Prepare outputs, we transpose back here to shape [bsz, len, hidden_dim] (cf. beginning of forward() method)\n        outputs = (output.permute(1, 0, 2).contiguous(),)\n\n        if self.mem_len is not None and self.mem_len > 0 and use_cache is True:\n            outputs = outputs + (new_mems,)\n\n        if self.output_hidden_states:\n            if output_g is not None:\n                hidden_states = tuple(h.permute(1, 0, 2).contiguous() for hs in hidden_states for h in hs)\n            else:\n                hidden_states = tuple(hs.permute(1, 0, 2).contiguous() for hs in hidden_states)\n            outputs = outputs + (hidden_states,)\n        if self.output_attentions:\n            if target_mapping is not None:\n                # when target_mapping is provided, there are 2-tuple of attentions\n                attentions = tuple(\n                    tuple(att_stream.permute(2, 3, 0, 1).contiguous() for att_stream in t) for t in attentions\n                )\n            else:\n                attentions = tuple(t.permute(2, 3, 0, 1).contiguous() for t in attentions)\n            outputs = outputs + (attentions,)\n\n        return outputs  # outputs, (new_mems), (hidden_states), (attentions)\n\n\n@add_start_docstrings(\n    \"\"\"XLNet Model with a language modeling head on top\n    (linear layer with weights tied to the input embeddings). \"\"\",\n    XLNET_START_DOCSTRING,\n)\nclass XLNetLMHeadModel(XLNetPreTrainedModel):\n    def __init__(self, config):\n        super().__init__(config)\n        self.attn_type = config.attn_type\n        self.same_length = config.same_length\n\n        self.transformer = XLNetModel(config)\n        self.lm_loss = nn.Linear(config.d_model, config.vocab_size, bias=True)\n\n        self.init_weights()\n\n    def get_output_embeddings(self):\n        return self.lm_loss\n\n    def prepare_inputs_for_generation(self, input_ids, past, **kwargs):\n        # Add dummy token at the end (no attention on this one)\n\n        effective_batch_size = input_ids.shape[0]\n        dummy_token = torch.zeros((effective_batch_size, 1), dtype=torch.long, device=input_ids.device)\n        input_ids = torch.cat([input_ids, dummy_token], dim=1)\n\n        # Build permutation mask so that previous tokens don't see last token\n        sequence_length = input_ids.shape[1]\n        perm_mask = torch.zeros(\n            (effective_batch_size, sequence_length, sequence_length), dtype=torch.float, device=input_ids.device\n        )\n        perm_mask[:, :, -1] = 1.0\n\n        # We'll only predict the last token\n        target_mapping = torch.zeros(\n            (effective_batch_size, 1, sequence_length), dtype=torch.float, device=input_ids.device\n        )\n        target_mapping[0, 0, -1] = 1.0\n\n        inputs = {\n            \"input_ids\": input_ids,\n            \"perm_mask\": perm_mask,\n            \"target_mapping\": target_mapping,\n            \"use_cache\": kwargs[\"use_cache\"],\n        }\n\n        # if past is defined in model kwargs then use it for faster decoding\n        if past:\n            inputs[\"mems\"] = past\n\n        return inputs\n\n    @add_start_docstrings_to_callable(XLNET_INPUTS_DOCSTRING.format(\"(batch_size, sequence_length)\"))\n    def forward(\n        self,\n        input_ids=None,\n        attention_mask=None,\n        mems=None,\n        perm_mask=None,\n        target_mapping=None,\n        token_type_ids=None,\n        input_mask=None,\n        head_mask=None,\n        inputs_embeds=None,\n        use_cache=True,\n        labels=None,\n    ):\n        r\"\"\"\n        labels (:obj:`torch.LongTensor` of shape :obj:`(batch_size, num_predict)`, `optional`, defaults to :obj:`None`):\n            Labels for masked language modeling.\n            `num_predict` corresponds to `target_mapping.shape[1]`. If `target_mapping` is `None`, then `num_predict` corresponds to `sequence_length`.\n            The labels should correspond to the masked input words that should be predicted and depends on `target_mapping`. Note in order to perform standard auto-regressive language modeling a `<mask>` token has to be added to the `input_ids` (see `prepare_inputs_for_generation` fn and examples below)\n            Indices are selected in ``[-100, 0, ..., config.vocab_size]``\n            All labels set to ``-100`` are ignored, the loss is only\n            computed for labels in ``[0, ..., config.vocab_size]``\n\n    Return:\n        :obj:`tuple(torch.FloatTensor)` comprising various elements depending on the configuration (:class:`~transformers1.XLNetConfig`) and inputs:\n        loss (:obj:`torch.FloatTensor` of shape `(1,)`, `optional`, returned when ``labels`` is provided)\n            Language modeling loss.\n        prediction_scores (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, num_predict, config.vocab_size)`):\n            Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax).\n            `num_predict` corresponds to `target_mapping.shape[1]`. If `target_mapping` is `None`, then `num_predict` corresponds to `sequence_length`.\n        mems (:obj:`List[torch.FloatTensor]` of length :obj:`config.n_layers`):\n            Contains pre-computed hidden-states (key and values in the attention blocks).\n            Can be used (see `past` input) to speed up sequential decoding. The token ids which have their past given to this model\n            should not be passed as input ids as they have already been computed.\n        hidden_states (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_hidden_states=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer)\n            of shape :obj:`(batch_size, sequence_length, hidden_size)`.\n\n            Hidden-states of the model at the output of each layer plus the initial embedding outputs.\n        attentions (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_attentions=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for each layer) of shape\n            :obj:`(batch_size, num_heads, sequence_length, sequence_length)`.\n\n            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention\n            heads.\n\n    Examples::\n\n        from transformers1 import XLNetTokenizer, XLNetLMHeadModel\n        import torch\n\n        tokenizer = XLNetTokenizer.from_pretrained('xlnet-large-cased')\n        model = XLNetLMHeadModel.from_pretrained('xlnet-large-cased')\n\n        # We show how to setup inputs to predict a next token using a bi-directional context.\n        input_ids = torch.tensor(tokenizer.encode(\"Hello, my dog is very <mask>\", add_special_tokens=False)).unsqueeze(0)  # We will predict the masked token\n        perm_mask = torch.zeros((1, input_ids.shape[1], input_ids.shape[1]), dtype=torch.float)\n        perm_mask[:, :, -1] = 1.0  # Previous tokens don't see last token\n        target_mapping = torch.zeros((1, 1, input_ids.shape[1]), dtype=torch.float)  # Shape [1, 1, seq_length] => let's predict one token\n        target_mapping[0, 0, -1] = 1.0  # Our first (and only) prediction will be the last token of the sequence (the masked token)\n\n        outputs = model(input_ids, perm_mask=perm_mask, target_mapping=target_mapping)\n        next_token_logits = outputs[0]  # Output has shape [target_mapping.size(0), target_mapping.size(1), config.vocab_size]\n\n        # The same way can the XLNetLMHeadModel be used to be trained by standard auto-regressive language modeling.\n        input_ids = torch.tensor(tokenizer.encode(\"Hello, my dog is very <mask>\", add_special_tokens=False)).unsqueeze(0)  # We will predict the masked token\n        labels = torch.tensor(tokenizer.encode(\"cute\", add_special_tokens=False)).unsqueeze(0)\n        assert labels.shape[0] == 1, 'only one word will be predicted'\n        perm_mask = torch.zeros((1, input_ids.shape[1], input_ids.shape[1]), dtype=torch.float)\n        perm_mask[:, :, -1] = 1.0  # Previous tokens don't see last token as is done in standard auto-regressive lm training\n        target_mapping = torch.zeros((1, 1, input_ids.shape[1]), dtype=torch.float)  # Shape [1, 1, seq_length] => let's predict one token\n        target_mapping[0, 0, -1] = 1.0  # Our first (and only) prediction will be the last token of the sequence (the masked token)\n\n        outputs = model(input_ids, perm_mask=perm_mask, target_mapping=target_mapping, labels=labels)\n        loss, next_token_logits = outputs[:2]  # Output has shape [target_mapping.size(0), target_mapping.size(1), config.vocab_size]\n\n        \"\"\"\n        transformer_outputs = self.transformer(\n            input_ids,\n            attention_mask=attention_mask,\n            mems=mems,\n            perm_mask=perm_mask,\n            target_mapping=target_mapping,\n            token_type_ids=token_type_ids,\n            input_mask=input_mask,\n            head_mask=head_mask,\n            inputs_embeds=inputs_embeds,\n            use_cache=use_cache,\n        )\n\n        logits = self.lm_loss(transformer_outputs[0])\n\n        outputs = (logits,) + transformer_outputs[1:]  # Keep mems, hidden states, attentions if there are in it\n\n        if labels is not None:\n            # Flatten the tokens\n            loss_fct = CrossEntropyLoss()\n            loss = loss_fct(logits.view(-1, logits.size(-1)), labels.view(-1))\n            outputs = (loss,) + outputs\n\n        return outputs  # return (loss), logits, (mems), (hidden states), (attentions)\n\n\n@add_start_docstrings(\n    \"\"\"XLNet Model with a sequence classification/regression head on top (a linear layer on top of\n    the pooled output) e.g. for GLUE tasks. \"\"\",\n    XLNET_START_DOCSTRING,\n)\nclass XLNetForSequenceClassification(XLNetPreTrainedModel):\n    def __init__(self, config):\n        super().__init__(config)\n        self.num_labels = config.num_labels\n\n        self.transformer = XLNetModel(config)\n        self.sequence_summary = SequenceSummary(config)\n        self.logits_proj = nn.Linear(config.d_model, config.num_labels)\n\n        self.init_weights()\n\n    @add_start_docstrings_to_callable(XLNET_INPUTS_DOCSTRING.format(\"(batch_size, sequence_length)\"))\n    def forward(\n        self,\n        input_ids=None,\n        attention_mask=None,\n        mems=None,\n        perm_mask=None,\n        target_mapping=None,\n        token_type_ids=None,\n        input_mask=None,\n        head_mask=None,\n        inputs_embeds=None,\n        use_cache=True,\n        labels=None,\n    ):\n        r\"\"\"\n        labels (:obj:`torch.LongTensor` of shape :obj:`(batch_size,)`, `optional`, defaults to :obj:`None`)\n            Labels for computing the sequence classification/regression loss.\n            Indices should be in ``[0, ..., config.num_labels - 1]``.\n            If ``config.num_labels == 1`` a regression loss is computed (Mean-Square loss),\n            If ``config.num_labels > 1`` a classification loss is computed (Cross-Entropy).\n\n    Return:\n        :obj:`tuple(torch.FloatTensor)` comprising various elements depending on the configuration (:class:`~transformers1.XLNetConfig`) and inputs:\n        loss (:obj:`torch.FloatTensor` of shape :obj:`(1,)`, `optional`, returned when :obj:`labels` is provided):\n            Classification (or regression if config.num_labels==1) loss.\n        logits (:obj:`torch.FloatTensor` of shape :obj:(batch_size, config.num_labels)`):\n            Classification (or regression if config.num_labels==1) scores (before SoftMax).\n        mems (:obj:`List[torch.FloatTensor]` of length :obj:`config.n_layers`):\n            Contains pre-computed hidden-states (key and values in the attention blocks).\n            Can be used (see `past` input) to speed up sequential decoding. The token ids which have their past given to this model\n            should not be passed as input ids as they have already been computed.\n        hidden_states (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_hidden_states=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer)\n            of shape :obj:`(batch_size, sequence_length, hidden_size)`.\n\n            Hidden-states of the model at the output of each layer plus the initial embedding outputs.\n        attentions (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_attentions=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for each layer) of shape\n            :obj:`(batch_size, num_heads, sequence_length, sequence_length)`.\n\n            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention\n            heads.\n\n    Examples::\n\n        from transformers1 import XLNetTokenizer, XLNetForSequenceClassification\n        import torch\n\n        tokenizer = XLNetTokenizer.from_pretrained('xlnet-large-cased')\n        model = XLNetForSequenceClassification.from_pretrained('xlnet-large-cased')\n\n        input_ids = torch.tensor(tokenizer.encode(\"Hello, my dog is cute\", add_special_tokens=True)).unsqueeze(0)  # Batch size 1\n        labels = torch.tensor([1]).unsqueeze(0)  # Batch size 1\n        outputs = model(input_ids, labels=labels)\n        loss, logits = outputs[:2]\n\n        \"\"\"\n        transformer_outputs = self.transformer(\n            input_ids,\n            attention_mask=attention_mask,\n            mems=mems,\n            perm_mask=perm_mask,\n            target_mapping=target_mapping,\n            token_type_ids=token_type_ids,\n            input_mask=input_mask,\n            head_mask=head_mask,\n            inputs_embeds=inputs_embeds,\n            use_cache=use_cache,\n        )\n        output = transformer_outputs[0]\n\n        output = self.sequence_summary(output)\n        logits = self.logits_proj(output)\n\n        outputs = (logits,) + transformer_outputs[1:]  # Keep mems, hidden states, attentions if there are in it\n\n        if labels is not None:\n            if self.num_labels == 1:\n                #  We are doing regression\n                loss_fct = MSELoss()\n                loss = loss_fct(logits.view(-1), labels.view(-1))\n            else:\n                loss_fct = CrossEntropyLoss()\n                loss = loss_fct(logits.view(-1, self.num_labels), labels.view(-1))\n            outputs = (loss,) + outputs\n\n        return outputs  # return (loss), logits, (mems), (hidden states), (attentions)\n\n\n@add_start_docstrings(\n    \"\"\"XLNet Model with a token classification head on top (a linear layer on top of\n    the hidden-states output) e.g. for Named-Entity-Recognition (NER) tasks. \"\"\",\n    XLNET_START_DOCSTRING,\n)\nclass XLNetForTokenClassification(XLNetPreTrainedModel):\n    def __init__(self, config):\n        super().__init__(config)\n        self.num_labels = config.num_labels\n\n        self.transformer = XLNetModel(config)\n        self.classifier = nn.Linear(config.hidden_size, config.num_labels)\n\n        self.init_weights()\n\n    @add_start_docstrings_to_callable(XLNET_INPUTS_DOCSTRING.format(\"(batch_size, sequence_length)\"))\n    def forward(\n        self,\n        input_ids=None,\n        attention_mask=None,\n        mems=None,\n        perm_mask=None,\n        target_mapping=None,\n        token_type_ids=None,\n        input_mask=None,\n        head_mask=None,\n        inputs_embeds=None,\n        use_cache=True,\n        labels=None,\n    ):\n        r\"\"\"\n        labels (:obj:`torch.LongTensor` of shape :obj:`(batch_size,)`, `optional`, defaults to :obj:`None`):\n            Labels for computing the multiple choice classification loss.\n            Indices should be in ``[0, ..., num_choices]`` where `num_choices` is the size of the second dimension\n            of the input tensors. (see `input_ids` above)\n\n    Return:\n        :obj:`tuple(torch.FloatTensor)` comprising various elements depending on the configuration (:class:`~transformers1.XLNetConfig`) and inputs:\n        loss (:obj:`torch.FloatTensor` of shape :obj:`(1,)`, `optional`, returned when :obj:`labels` is provided):\n            Classification loss.\n        logits (:obj:`torch.FloatTensor` of shape :obj:(batch_size, config.num_labels)`):\n            Classification scores (before SoftMax).\n        mems (:obj:`List[torch.FloatTensor]` of length :obj:`config.n_layers`):\n            Contains pre-computed hidden-states (key and values in the attention blocks).\n            Can be used (see `past` input) to speed up sequential decoding. The token ids which have their past given to this model\n            should not be passed as input ids as they have already been computed.\n        hidden_states (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_hidden_states=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer)\n            of shape :obj:`(batch_size, sequence_length, hidden_size)`.\n\n            Hidden-states of the model at the output of each layer plus the initial embedding outputs.\n        attentions (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_attentions=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for each layer) of shape\n            :obj:`(batch_size, num_heads, sequence_length, sequence_length)`.\n\n            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention\n            heads.\n\n    Examples::\n\n        from transformers1 import XLNetTokenizer, XLNetForTokenClassification\n        import torch\n\n        tokenizer = XLNetTokenizer.from_pretrained('xlnet-large-cased')\n        model = XLNetForTokenClassification.from_pretrained('xlnet-large-cased')\n\n        input_ids = torch.tensor(tokenizer.encode(\"Hello, my dog is cute\")).unsqueeze(0)  # Batch size 1\n        labels = torch.tensor([1] * input_ids.size(1)).unsqueeze(0)  # Batch size 1\n        outputs = model(input_ids, labels=labels)\n\n        scores = outputs[0]\n\n        \"\"\"\n\n        outputs = self.transformer(\n            input_ids,\n            attention_mask=attention_mask,\n            mems=mems,\n            perm_mask=perm_mask,\n            target_mapping=target_mapping,\n            token_type_ids=token_type_ids,\n            input_mask=input_mask,\n            head_mask=head_mask,\n            inputs_embeds=inputs_embeds,\n            use_cache=use_cache,\n        )\n\n        sequence_output = outputs[0]\n\n        logits = self.classifier(sequence_output)\n\n        outputs = (logits,) + outputs[1:]  # Keep mems, hidden states, attentions if there are in it\n        if labels is not None:\n            loss_fct = CrossEntropyLoss()\n            # Only keep active parts of the loss\n            if attention_mask is not None:\n                active_loss = attention_mask.view(-1) == 1\n                active_logits = logits.view(-1, self.num_labels)\n                active_labels = torch.where(\n                    active_loss, labels.view(-1), torch.tensor(loss_fct.ignore_index).type_as(labels)\n                )\n                loss = loss_fct(active_logits, active_labels)\n            else:\n                loss = loss_fct(logits.view(-1, self.num_labels), labels.view(-1))\n            outputs = (loss,) + outputs\n\n        return outputs  # return (loss), logits, (mems), (hidden states), (attentions)\n\n\n@add_start_docstrings(\n    \"\"\"XLNet Model with a multiple choice classification head on top (a linear layer on top of\n    the pooled output and a softmax) e.g. for RACE/SWAG tasks. \"\"\",\n    XLNET_START_DOCSTRING,\n)\nclass XLNetForMultipleChoice(XLNetPreTrainedModel):\n    def __init__(self, config):\n        super().__init__(config)\n\n        self.transformer = XLNetModel(config)\n        self.sequence_summary = SequenceSummary(config)\n        self.logits_proj = nn.Linear(config.d_model, 1)\n\n        self.init_weights()\n\n    @add_start_docstrings_to_callable(XLNET_INPUTS_DOCSTRING.format(\"(batch_size, num_choices, sequence_length)\"))\n    def forward(\n        self,\n        input_ids=None,\n        token_type_ids=None,\n        input_mask=None,\n        attention_mask=None,\n        mems=None,\n        perm_mask=None,\n        target_mapping=None,\n        head_mask=None,\n        inputs_embeds=None,\n        use_cache=True,\n        labels=None,\n    ):\n        r\"\"\"\n        labels (:obj:`torch.LongTensor` of shape :obj:`(batch_size,)`, `optional`, defaults to :obj:`None`):\n            Labels for computing the multiple choice classification loss.\n            Indices should be in ``[0, ..., num_choices]`` where `num_choices` is the size of the second dimension\n            of the input tensors. (see `input_ids` above)\n\n    Returns:\n        :obj:`tuple(torch.FloatTensor)` comprising various elements depending on the configuration (:class:`~transformers1.XLNetConfig`) and inputs:\n        loss (:obj:`torch.FloatTensor`` of shape ``(1,)`, `optional`, returned when :obj:`labels` is provided):\n            Classification loss.\n        classification_scores (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, num_choices)`):\n            `num_choices` is the second dimension of the input tensors. (see `input_ids` above).\n\n            Classification scores (before SoftMax).\n        mems (:obj:`List[torch.FloatTensor]` of length :obj:`config.n_layers`):\n            Contains pre-computed hidden-states (key and values in the attention blocks).\n            Can be used (see `past` input) to speed up sequential decoding. The token ids which have their past given to this model\n            should not be passed as input ids as they have already been computed.\n        hidden_states (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_hidden_states=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer)\n            of shape :obj:`(batch_size, sequence_length, hidden_size)`.\n\n            Hidden-states of the model at the output of each layer plus the initial embedding outputs.\n        attentions (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_attentions=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for each layer) of shape\n            :obj:`(batch_size, num_heads, sequence_length, sequence_length)`.\n\n            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention\n            heads.\n\n    Examples::\n\n        from transformers1 import XLNetTokenizer, XLNetForMultipleChoice\n        import torch\n\n        tokenizer = XLNetTokenizer.from_pretrained('xlnet-base-cased')\n        model = XLNetForMultipleChoice.from_pretrained('xlnet-base-cased')\n\n        choices = [\"Hello, my dog is cute\", \"Hello, my cat is amazing\"]\n        input_ids = torch.tensor([tokenizer.encode(s) for s in choices]).unsqueeze(0)  # Batch size 1, 2 choices\n        labels = torch.tensor(1).unsqueeze(0)  # Batch size 1\n\n        outputs = model(input_ids, labels=labels)\n        loss, classification_scores = outputs[:2]\n\n        \"\"\"\n        num_choices = input_ids.shape[1]\n\n        flat_input_ids = input_ids.view(-1, input_ids.size(-1))\n        flat_token_type_ids = token_type_ids.view(-1, token_type_ids.size(-1)) if token_type_ids is not None else None\n        flat_attention_mask = attention_mask.view(-1, attention_mask.size(-1)) if attention_mask is not None else None\n        flat_input_mask = input_mask.view(-1, input_mask.size(-1)) if input_mask is not None else None\n\n        transformer_outputs = self.transformer(\n            flat_input_ids,\n            token_type_ids=flat_token_type_ids,\n            input_mask=flat_input_mask,\n            attention_mask=flat_attention_mask,\n            mems=mems,\n            perm_mask=perm_mask,\n            target_mapping=target_mapping,\n            head_mask=head_mask,\n            inputs_embeds=inputs_embeds,\n            use_cache=use_cache,\n        )\n\n        output = transformer_outputs[0]\n\n        output = self.sequence_summary(output)\n        logits = self.logits_proj(output)\n        reshaped_logits = logits.view(-1, num_choices)\n        outputs = (reshaped_logits,) + transformer_outputs[\n            1:\n        ]  # Keep mems, hidden states, attentions if there are in it\n\n        if labels is not None:\n            loss_fct = CrossEntropyLoss()\n            loss = loss_fct(reshaped_logits, labels.view(-1))\n            outputs = (loss,) + outputs\n\n        return outputs  # return (loss), logits, (mems), (hidden states), (attentions)\n\n\n@add_start_docstrings(\n    \"\"\"XLNet Model with a span classification head on top for extractive question-answering tasks like SQuAD (a linear layers on top of\n    the hidden-states output to compute `span start logits` and `span end logits`). \"\"\",\n    XLNET_START_DOCSTRING,\n)\nclass XLNetForQuestionAnsweringSimple(XLNetPreTrainedModel):\n    def __init__(self, config):\n        super().__init__(config)\n        self.num_labels = config.num_labels\n\n        self.transformer = XLNetModel(config)\n        self.qa_outputs = nn.Linear(config.hidden_size, config.num_labels)\n\n        self.init_weights()\n\n    @add_start_docstrings_to_callable(XLNET_INPUTS_DOCSTRING.format(\"(batch_size, sequence_length)\"))\n    def forward(\n        self,\n        input_ids=None,\n        attention_mask=None,\n        mems=None,\n        perm_mask=None,\n        target_mapping=None,\n        token_type_ids=None,\n        input_mask=None,\n        head_mask=None,\n        inputs_embeds=None,\n        use_cache=True,\n        start_positions=None,\n        end_positions=None,\n    ):\n        r\"\"\"\n        start_positions (:obj:`torch.LongTensor` of shape :obj:`(batch_size,)`, `optional`, defaults to :obj:`None`):\n            Labels for position (index) of the start of the labelled span for computing the token classification loss.\n            Positions are clamped to the length of the sequence (`sequence_length`).\n            Position outside of the sequence are not taken into account for computing the loss.\n        end_positions (:obj:`torch.LongTensor` of shape :obj:`(batch_size,)`, `optional`, defaults to :obj:`None`):\n            Labels for position (index) of the end of the labelled span for computing the token classification loss.\n            Positions are clamped to the length of the sequence (`sequence_length`).\n            Position outside of the sequence are not taken into account for computing the loss.\n\n    Returns:\n        :obj:`tuple(torch.FloatTensor)` comprising various elements depending on the configuration (:class:`~transformers1.XLNetConfig`) and inputs:\n        loss (:obj:`torch.FloatTensor` of shape :obj:`(1,)`, `optional`, returned when :obj:`labels` is provided):\n            Total span extraction loss is the sum of a Cross-Entropy for the start and end positions.\n        start_scores (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length,)`):\n            Span-start scores (before SoftMax).\n        end_scores (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length,)`):\n            Span-end scores (before SoftMax).\n        mems (:obj:`List[torch.FloatTensor]` of length :obj:`config.n_layers`):\n            Contains pre-computed hidden-states (key and values in the attention blocks).\n            Can be used (see `past` input) to speed up sequential decoding. The token ids which have their past given to this model\n            should not be passed as input ids as they have already been computed.\n        hidden_states (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_hidden_states=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer)\n            of shape :obj:`(batch_size, sequence_length, hidden_size)`.\n\n            Hidden-states of the model at the output of each layer plus the initial embedding outputs.\n        attentions (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_attentions=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for each layer) of shape\n            :obj:`(batch_size, num_heads, sequence_length, sequence_length)`.\n\n            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention\n            heads.\n\n    Examples::\n\n        from transformers1 import XLNetTokenizer, XLNetForQuestionAnsweringSimple\n        import torch\n\n        tokenizer = XLNetTokenizer.from_pretrained('xlnet-base-cased')\n        model = XLNetForQuestionAnsweringSimple.from_pretrained('xlnet-base-cased')\n\n        input_ids = torch.tensor(tokenizer.encode(\"Hello, my dog is cute\", add_special_tokens=True)).unsqueeze(0)  # Batch size 1\n        start_positions = torch.tensor([1])\n        end_positions = torch.tensor([3])\n\n        outputs = model(input_ids, start_positions=start_positions, end_positions=end_positions)\n        loss = outputs[0]\n\n        \"\"\"\n\n        outputs = self.transformer(\n            input_ids,\n            attention_mask=attention_mask,\n            mems=mems,\n            perm_mask=perm_mask,\n            target_mapping=target_mapping,\n            token_type_ids=token_type_ids,\n            input_mask=input_mask,\n            head_mask=head_mask,\n            inputs_embeds=inputs_embeds,\n            use_cache=use_cache,\n        )\n\n        sequence_output = outputs[0]\n\n        logits = self.qa_outputs(sequence_output)\n        start_logits, end_logits = logits.split(1, dim=-1)\n        start_logits = start_logits.squeeze(-1)\n        end_logits = end_logits.squeeze(-1)\n\n        outputs = (start_logits, end_logits,) + outputs[2:]\n        if start_positions is not None and end_positions is not None:\n            # If we are on multi-GPU, split add a dimension\n            if len(start_positions.size()) > 1:\n                start_positions = start_positions.squeeze(-1)\n            if len(end_positions.size()) > 1:\n                end_positions = end_positions.squeeze(-1)\n            # sometimes the start/end positions are outside our model inputs, we ignore these terms\n            ignored_index = start_logits.size(1)\n            start_positions.clamp_(0, ignored_index)\n            end_positions.clamp_(0, ignored_index)\n\n            loss_fct = CrossEntropyLoss(ignore_index=ignored_index)\n            start_loss = loss_fct(start_logits, start_positions)\n            end_loss = loss_fct(end_logits, end_positions)\n            total_loss = (start_loss + end_loss) / 2\n            outputs = (total_loss,) + outputs\n\n        return outputs  # (loss), start_logits, end_logits, (mems), (hidden_states), (attentions)\n\n\n@add_start_docstrings(\n    \"\"\"XLNet Model with a span classification head on top for extractive question-answering tasks like SQuAD (a linear layers on top of\n    the hidden-states output to compute `span start logits` and `span end logits`). \"\"\",\n    XLNET_START_DOCSTRING,\n)\nclass XLNetForQuestionAnswering(XLNetPreTrainedModel):\n    def __init__(self, config):\n        super().__init__(config)\n        self.start_n_top = config.start_n_top\n        self.end_n_top = config.end_n_top\n\n        self.transformer = XLNetModel(config)\n        self.start_logits = PoolerStartLogits(config)\n        self.end_logits = PoolerEndLogits(config)\n        self.answer_class = PoolerAnswerClass(config)\n\n        self.init_weights()\n\n    @add_start_docstrings_to_callable(XLNET_INPUTS_DOCSTRING.format(\"(batch_size, sequence_length)\"))\n    def forward(\n        self,\n        input_ids=None,\n        attention_mask=None,\n        mems=None,\n        perm_mask=None,\n        target_mapping=None,\n        token_type_ids=None,\n        input_mask=None,\n        head_mask=None,\n        inputs_embeds=None,\n        use_cache=True,\n        start_positions=None,\n        end_positions=None,\n        is_impossible=None,\n        cls_index=None,\n        p_mask=None,\n    ):\n        r\"\"\"\n        start_positions (:obj:`torch.LongTensor` of shape :obj:`(batch_size,)`, `optional`, defaults to :obj:`None`):\n            Labels for position (index) of the start of the labelled span for computing the token classification loss.\n            Positions are clamped to the length of the sequence (`sequence_length`).\n            Position outside of the sequence are not taken into account for computing the loss.\n        end_positions (:obj:`torch.LongTensor` of shape :obj:`(batch_size,)`, `optional`, defaults to :obj:`None`):\n            Labels for position (index) of the end of the labelled span for computing the token classification loss.\n            Positions are clamped to the length of the sequence (`sequence_length`).\n            Position outside of the sequence are not taken into account for computing the loss.\n        is_impossible (``torch.LongTensor`` of shape ``(batch_size,)``, `optional`, defaults to :obj:`None`):\n            Labels whether a question has an answer or no answer (SQuAD 2.0)\n        cls_index (``torch.LongTensor`` of shape ``(batch_size,)``, `optional`, defaults to :obj:`None`):\n            Labels for position (index) of the classification token to use as input for computing plausibility of the answer.\n        p_mask (``torch.FloatTensor`` of shape ``(batch_size, sequence_length)``, `optional`, defaults to :obj:`None`):\n            Optional mask of tokens which can't be in answers (e.g. [CLS], [PAD], ...).\n            1.0 means token should be masked. 0.0 mean token is not masked.\n\n    Returns:\n        :obj:`tuple(torch.FloatTensor)` comprising various elements depending on the configuration (:class:`~transformers1.XLNetConfig`) and inputs:\n        loss (:obj:`torch.FloatTensor` of shape :obj:`(1,)`, `optional`, returned if both :obj:`start_positions` and :obj:`end_positions` are provided):\n            Classification loss as the sum of start token, end token (and is_impossible if provided) classification losses.\n        start_top_log_probs (``torch.FloatTensor`` of shape ``(batch_size, config.start_n_top)``, `optional`, returned if ``start_positions`` or ``end_positions`` is not provided):\n            Log probabilities for the top config.start_n_top start token possibilities (beam-search).\n        start_top_index (``torch.LongTensor`` of shape ``(batch_size, config.start_n_top)``, `optional`, returned if ``start_positions`` or ``end_positions`` is not provided):\n            Indices for the top config.start_n_top start token possibilities (beam-search).\n        end_top_log_probs (``torch.FloatTensor`` of shape ``(batch_size, config.start_n_top * config.end_n_top)``, `optional`, returned if ``start_positions`` or ``end_positions`` is not provided):\n            Log probabilities for the top ``config.start_n_top * config.end_n_top`` end token possibilities (beam-search).\n        end_top_index (``torch.LongTensor`` of shape ``(batch_size, config.start_n_top * config.end_n_top)``, `optional`, returned if ``start_positions`` or ``end_positions`` is not provided):\n            Indices for the top ``config.start_n_top * config.end_n_top`` end token possibilities (beam-search).\n        cls_logits (``torch.FloatTensor`` of shape ``(batch_size,)``, `optional`, returned if ``start_positions`` or ``end_positions`` is not provided):\n            Log probabilities for the ``is_impossible`` label of the answers.\n        mems (:obj:`List[torch.FloatTensor]` of length :obj:`config.n_layers`):\n            Contains pre-computed hidden-states (key and values in the attention blocks).\n            Can be used (see `past` input) to speed up sequential decoding. The token ids which have their past given to this model\n            should not be passed as input ids as they have already been computed.\n        hidden_states (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_hidden_states=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer)\n            of shape :obj:`(batch_size, sequence_length, hidden_size)`.\n\n            Hidden-states of the model at the output of each layer plus the initial embedding outputs.\n        attentions (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_attentions=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for each layer) of shape\n            :obj:`(batch_size, num_heads, sequence_length, sequence_length)`.\n\n            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention\n            heads.\n\n    Examples::\n\n        from transformers1 import XLNetTokenizer, XLNetForQuestionAnswering\n        import torch\n\n        tokenizer =  XLNetTokenizer.from_pretrained('xlnet-base-cased')\n        model = XLNetForQuestionAnswering.from_pretrained('xlnet-base-cased')\n\n        input_ids = torch.tensor(tokenizer.encode(\"Hello, my dog is cute\", add_special_tokens=True)).unsqueeze(0)  # Batch size 1\n        start_positions = torch.tensor([1])\n        end_positions = torch.tensor([3])\n        outputs = model(input_ids, start_positions=start_positions, end_positions=end_positions)\n        loss = outputs[0]\n\n        \"\"\"\n        transformer_outputs = self.transformer(\n            input_ids,\n            attention_mask=attention_mask,\n            mems=mems,\n            perm_mask=perm_mask,\n            target_mapping=target_mapping,\n            token_type_ids=token_type_ids,\n            input_mask=input_mask,\n            head_mask=head_mask,\n            inputs_embeds=inputs_embeds,\n            use_cache=use_cache,\n        )\n        hidden_states = transformer_outputs[0]\n        start_logits = self.start_logits(hidden_states, p_mask=p_mask)\n\n        outputs = transformer_outputs[1:]  # Keep mems, hidden states, attentions if there are in it\n\n        if start_positions is not None and end_positions is not None:\n            # If we are on multi-GPU, let's remove the dimension added by batch splitting\n            for x in (start_positions, end_positions, cls_index, is_impossible):\n                if x is not None and x.dim() > 1:\n                    x.squeeze_(-1)\n\n            # during training, compute the end logits based on the ground truth of the start position\n            end_logits = self.end_logits(hidden_states, start_positions=start_positions, p_mask=p_mask)\n\n            loss_fct = CrossEntropyLoss()\n            start_loss = loss_fct(start_logits, start_positions)\n            end_loss = loss_fct(end_logits, end_positions)\n            total_loss = (start_loss + end_loss) / 2\n\n            if cls_index is not None and is_impossible is not None:\n                # Predict answerability from the representation of CLS and START\n                cls_logits = self.answer_class(hidden_states, start_positions=start_positions, cls_index=cls_index)\n                loss_fct_cls = nn.BCEWithLogitsLoss()\n                cls_loss = loss_fct_cls(cls_logits, is_impossible)\n\n                # note(zhiliny): by default multiply the loss by 0.5 so that the scale is comparable to start_loss and end_loss\n                total_loss += cls_loss * 0.5\n\n            outputs = (total_loss,) + outputs\n\n        else:\n            # during inference, compute the end logits based on beam search\n            bsz, slen, hsz = hidden_states.size()\n            start_log_probs = F.softmax(start_logits, dim=-1)  # shape (bsz, slen)\n\n            start_top_log_probs, start_top_index = torch.topk(\n                start_log_probs, self.start_n_top, dim=-1\n            )  # shape (bsz, start_n_top)\n            start_top_index_exp = start_top_index.unsqueeze(-1).expand(-1, -1, hsz)  # shape (bsz, start_n_top, hsz)\n            start_states = torch.gather(hidden_states, -2, start_top_index_exp)  # shape (bsz, start_n_top, hsz)\n            start_states = start_states.unsqueeze(1).expand(-1, slen, -1, -1)  # shape (bsz, slen, start_n_top, hsz)\n\n            hidden_states_expanded = hidden_states.unsqueeze(2).expand_as(\n                start_states\n            )  # shape (bsz, slen, start_n_top, hsz)\n            p_mask = p_mask.unsqueeze(-1) if p_mask is not None else None\n            end_logits = self.end_logits(hidden_states_expanded, start_states=start_states, p_mask=p_mask)\n            end_log_probs = F.softmax(end_logits, dim=1)  # shape (bsz, slen, start_n_top)\n\n            end_top_log_probs, end_top_index = torch.topk(\n                end_log_probs, self.end_n_top, dim=1\n            )  # shape (bsz, end_n_top, start_n_top)\n            end_top_log_probs = end_top_log_probs.view(-1, self.start_n_top * self.end_n_top)\n            end_top_index = end_top_index.view(-1, self.start_n_top * self.end_n_top)\n\n            start_states = torch.einsum(\n                \"blh,bl->bh\", hidden_states, start_log_probs\n            )  # get the representation of START as weighted sum of hidden states\n            cls_logits = self.answer_class(\n                hidden_states, start_states=start_states, cls_index=cls_index\n            )  # Shape (batch size,): one single `cls_logits` for each sample\n\n            outputs = (start_top_log_probs, start_top_index, end_top_log_probs, end_top_index, cls_logits) + outputs\n\n        # return start_top_log_probs, start_top_index, end_top_log_probs, end_top_index, cls_logits\n        # or (if labels are provided) (total_loss,)\n        return outputs\n"
  },
  {
    "path": "code/bert-base-count3/pretrain/transformers1/optimization.py",
    "content": "# coding=utf-8\n# Copyright 2018 The Google AI Language Team Authors and The HuggingFace Inc. team.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n\"\"\"PyTorch optimization for BERT model.\"\"\"\n\nimport logging\nimport math\n\nimport torch\nfrom torch.optim import Optimizer\nfrom torch.optim.lr_scheduler import LambdaLR\n\n\nlogger = logging.getLogger(__name__)\n\n\ndef get_constant_schedule(optimizer, last_epoch=-1):\n    \"\"\" Create a schedule with a constant learning rate.\n    \"\"\"\n    return LambdaLR(optimizer, lambda _: 1, last_epoch=last_epoch)\n\n\ndef get_constant_schedule_with_warmup(optimizer, num_warmup_steps, last_epoch=-1):\n    \"\"\" Create a schedule with a constant learning rate preceded by a warmup\n    period during which the learning rate increases linearly between 0 and 1.\n    \"\"\"\n\n    def lr_lambda(current_step):\n        if current_step < num_warmup_steps:\n            return float(current_step) / float(max(1.0, num_warmup_steps))\n        return 1.0\n\n    return LambdaLR(optimizer, lr_lambda, last_epoch=last_epoch)\n\n\ndef get_linear_schedule_with_warmup(optimizer, num_warmup_steps, num_training_steps, last_epoch=-1):\n    \"\"\" Create a schedule with a learning rate that decreases linearly after\n    linearly increasing during a warmup period.\n    \"\"\"\n\n    def lr_lambda(current_step):\n        if current_step < num_warmup_steps:\n            return float(current_step) / float(max(1, num_warmup_steps))\n        return max(\n            0.0, float(num_training_steps - current_step) / float(max(1, num_training_steps - num_warmup_steps))\n        )\n\n    return LambdaLR(optimizer, lr_lambda, last_epoch)\n\n\ndef get_cosine_schedule_with_warmup(optimizer, num_warmup_steps, num_training_steps, num_cycles=0.5, last_epoch=-1):\n    \"\"\" Create a schedule with a learning rate that decreases following the\n    values of the cosine function between 0 and `pi * cycles` after a warmup\n    period during which it increases linearly between 0 and 1.\n    \"\"\"\n\n    def lr_lambda(current_step):\n        if current_step < num_warmup_steps:\n            return float(current_step) / float(max(1, num_warmup_steps))\n        progress = float(current_step - num_warmup_steps) / float(max(1, num_training_steps - num_warmup_steps))\n        return max(0.0, 0.5 * (1.0 + math.cos(math.pi * float(num_cycles) * 2.0 * progress)))\n\n    return LambdaLR(optimizer, lr_lambda, last_epoch)\n\n\ndef get_cosine_with_hard_restarts_schedule_with_warmup(\n    optimizer, num_warmup_steps, num_training_steps, num_cycles=1.0, last_epoch=-1\n):\n    \"\"\" Create a schedule with a learning rate that decreases following the\n    values of the cosine function with several hard restarts, after a warmup\n    period during which it increases linearly between 0 and 1.\n    \"\"\"\n\n    def lr_lambda(current_step):\n        if current_step < num_warmup_steps:\n            return float(current_step) / float(max(1, num_warmup_steps))\n        progress = float(current_step - num_warmup_steps) / float(max(1, num_training_steps - num_warmup_steps))\n        if progress >= 1.0:\n            return 0.0\n        return max(0.0, 0.5 * (1.0 + math.cos(math.pi * ((float(num_cycles) * progress) % 1.0))))\n\n    return LambdaLR(optimizer, lr_lambda, last_epoch)\n\n\nclass AdamW(Optimizer):\n    \"\"\" Implements Adam algorithm with weight decay fix.\n\n    Parameters:\n        lr (float): learning rate. Default 1e-3.\n        betas (tuple of 2 floats): Adams beta parameters (b1, b2). Default: (0.9, 0.999)\n        eps (float): Adams epsilon. Default: 1e-6\n        weight_decay (float): Weight decay. Default: 0.0\n        correct_bias (bool): can be set to False to avoid correcting bias in Adam (e.g. like in Bert TF repository). Default True.\n    \"\"\"\n\n    def __init__(self, params, lr=1e-3, betas=(0.9, 0.999), eps=1e-6, weight_decay=0.0, correct_bias=True):\n        if lr < 0.0:\n            raise ValueError(\"Invalid learning rate: {} - should be >= 0.0\".format(lr))\n        if not 0.0 <= betas[0] < 1.0:\n            raise ValueError(\"Invalid beta parameter: {} - should be in [0.0, 1.0[\".format(betas[0]))\n        if not 0.0 <= betas[1] < 1.0:\n            raise ValueError(\"Invalid beta parameter: {} - should be in [0.0, 1.0[\".format(betas[1]))\n        if not 0.0 <= eps:\n            raise ValueError(\"Invalid epsilon value: {} - should be >= 0.0\".format(eps))\n        defaults = dict(lr=lr, betas=betas, eps=eps, weight_decay=weight_decay, correct_bias=correct_bias)\n        super().__init__(params, defaults)\n\n    def step(self, closure=None):\n        \"\"\"Performs a single optimization step.\n\n        Arguments:\n            closure (callable, optional): A closure that reevaluates the model\n                and returns the loss.\n        \"\"\"\n        loss = None\n        if closure is not None:\n            loss = closure()\n\n        for group in self.param_groups:\n            for p in group[\"params\"]:\n                if p.grad is None:\n                    continue\n                grad = p.grad.data\n                if grad.is_sparse:\n                    raise RuntimeError(\"Adam does not support sparse gradients, please consider SparseAdam instead\")\n\n                state = self.state[p]\n\n                # State initialization\n                if len(state) == 0:\n                    state[\"step\"] = 0\n                    # Exponential moving average of gradient values\n                    state[\"exp_avg\"] = torch.zeros_like(p.data)\n                    # Exponential moving average of squared gradient values\n                    state[\"exp_avg_sq\"] = torch.zeros_like(p.data)\n\n                exp_avg, exp_avg_sq = state[\"exp_avg\"], state[\"exp_avg_sq\"]\n                beta1, beta2 = group[\"betas\"]\n\n                state[\"step\"] += 1\n\n                # Decay the first and second moment running average coefficient\n                # In-place operations to update the averages at the same time\n                exp_avg.mul_(beta1).add_(grad, alpha=1.0 - beta1)\n                exp_avg_sq.mul_(beta2).addcmul_(grad, grad, value=1.0 - beta2)\n                denom = exp_avg_sq.sqrt().add_(group[\"eps\"])\n\n                step_size = group[\"lr\"]\n                if group[\"correct_bias\"]:  # No bias correction for Bert\n                    bias_correction1 = 1.0 - beta1 ** state[\"step\"]\n                    bias_correction2 = 1.0 - beta2 ** state[\"step\"]\n                    step_size = step_size * math.sqrt(bias_correction2) / bias_correction1\n\n                p.data.addcdiv_(exp_avg, denom, value=-step_size)\n\n                # Just adding the square of the weights to the loss function is *not*\n                # the correct way of using L2 regularization/weight decay with Adam,\n                # since that will interact with the m and v parameters in strange ways.\n                #\n                # Instead we want to decay the weights in a manner that doesn't interact\n                # with the m/v parameters. This is equivalent to adding the square\n                # of the weights to the loss with plain (non-momentum) SGD.\n                # Add weight decay at the end (fixed version)\n                if group[\"weight_decay\"] > 0.0:\n                    p.data.add_(p.data, alpha=-group[\"lr\"] * group[\"weight_decay\"])\n\n        return loss\n"
  },
  {
    "path": "code/bert-base-count3/pretrain/transformers1/optimization_tf.py",
    "content": "# Copyright 2019 The TensorFlow Authors. All Rights Reserved.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n# ==============================================================================\n\"\"\"Functions and classes related to optimization (weight updates).\"\"\"\n\n\nimport re\n\nimport tensorflow as tf\n\n\nclass WarmUp(tf.keras.optimizers.schedules.LearningRateSchedule):\n    \"\"\"Applies a warmup schedule on a given learning rate decay schedule.\"\"\"\n\n    def __init__(\n        self, initial_learning_rate, decay_schedule_fn, warmup_steps, power=1.0, name=None,\n    ):\n        super().__init__()\n        self.initial_learning_rate = initial_learning_rate\n        self.warmup_steps = warmup_steps\n        self.power = power\n        self.decay_schedule_fn = decay_schedule_fn\n        self.name = name\n\n    def __call__(self, step):\n        with tf.name_scope(self.name or \"WarmUp\") as name:\n            # Implements polynomial warmup. i.e., if global_step < warmup_steps, the\n            # learning rate will be `global_step/num_warmup_steps * init_lr`.\n            global_step_float = tf.cast(step, tf.float32)\n            warmup_steps_float = tf.cast(self.warmup_steps, tf.float32)\n            warmup_percent_done = global_step_float / warmup_steps_float\n            warmup_learning_rate = self.initial_learning_rate * tf.math.pow(warmup_percent_done, self.power)\n            return tf.cond(\n                global_step_float < warmup_steps_float,\n                lambda: warmup_learning_rate,\n                lambda: self.decay_schedule_fn(step),\n                name=name,\n            )\n\n    def get_config(self):\n        return {\n            \"initial_learning_rate\": self.initial_learning_rate,\n            \"decay_schedule_fn\": self.decay_schedule_fn,\n            \"warmup_steps\": self.warmup_steps,\n            \"power\": self.power,\n            \"name\": self.name,\n        }\n\n\ndef create_optimizer(init_lr, num_train_steps, num_warmup_steps, end_lr=0.0, optimizer_type=\"adamw\"):\n    \"\"\"Creates an optimizer with learning rate schedule.\"\"\"\n    # Implements linear decay of the learning rate.\n    lr_schedule = tf.keras.optimizers.schedules.PolynomialDecay(\n        initial_learning_rate=init_lr, decay_steps=num_train_steps, end_learning_rate=end_lr,\n    )\n    if num_warmup_steps:\n        lr_schedule = WarmUp(\n            initial_learning_rate=init_lr, decay_schedule_fn=lr_schedule, warmup_steps=num_warmup_steps,\n        )\n\n    optimizer = AdamWeightDecay(\n        learning_rate=lr_schedule,\n        weight_decay_rate=0.01,\n        beta_1=0.9,\n        beta_2=0.999,\n        epsilon=1e-6,\n        exclude_from_weight_decay=[\"LayerNorm\", \"layer_norm\", \"bias\"],\n    )\n\n    return optimizer\n\n\nclass AdamWeightDecay(tf.keras.optimizers.Adam):\n    \"\"\"Adam enables L2 weight decay and clip_by_global_norm on gradients.\n  Just adding the square of the weights to the loss function is *not* the\n  correct way of using L2 regularization/weight decay with Adam, since that will\n  interact with the m and v parameters in strange ways.\n  Instead we want ot decay the weights in a manner that doesn't interact with\n  the m/v parameters. This is equivalent to adding the square of the weights to\n  the loss with plain (non-momentum) SGD.\n  \"\"\"\n\n    def __init__(\n        self,\n        learning_rate=0.001,\n        beta_1=0.9,\n        beta_2=0.999,\n        epsilon=1e-7,\n        amsgrad=False,\n        weight_decay_rate=0.0,\n        include_in_weight_decay=None,\n        exclude_from_weight_decay=None,\n        name=\"AdamWeightDecay\",\n        **kwargs\n    ):\n        super().__init__(learning_rate, beta_1, beta_2, epsilon, amsgrad, name, **kwargs)\n        self.weight_decay_rate = weight_decay_rate\n        self._include_in_weight_decay = include_in_weight_decay\n        self._exclude_from_weight_decay = exclude_from_weight_decay\n\n    @classmethod\n    def from_config(cls, config):\n        \"\"\"Creates an optimizer from its config with WarmUp custom object.\"\"\"\n        custom_objects = {\"WarmUp\": WarmUp}\n        return super(AdamWeightDecay, cls).from_config(config, custom_objects=custom_objects)\n\n    def _prepare_local(self, var_device, var_dtype, apply_state):\n        super(AdamWeightDecay, self)._prepare_local(var_device, var_dtype, apply_state)\n        apply_state[(var_device, var_dtype)][\"weight_decay_rate\"] = tf.constant(\n            self.weight_decay_rate, name=\"adam_weight_decay_rate\"\n        )\n\n    def _decay_weights_op(self, var, learning_rate, apply_state):\n        do_decay = self._do_use_weight_decay(var.name)\n        if do_decay:\n            return var.assign_sub(\n                learning_rate * var * apply_state[(var.device, var.dtype.base_dtype)][\"weight_decay_rate\"],\n                use_locking=self._use_locking,\n            )\n        return tf.no_op()\n\n    def apply_gradients(self, grads_and_vars, name=None):\n        grads, tvars = list(zip(*grads_and_vars))\n        return super(AdamWeightDecay, self).apply_gradients(zip(grads, tvars), name=name,)\n\n    def _get_lr(self, var_device, var_dtype, apply_state):\n        \"\"\"Retrieves the learning rate with the given state.\"\"\"\n        if apply_state is None:\n            return self._decayed_lr_t[var_dtype], {}\n\n        apply_state = apply_state or {}\n        coefficients = apply_state.get((var_device, var_dtype))\n        if coefficients is None:\n            coefficients = self._fallback_apply_state(var_device, var_dtype)\n            apply_state[(var_device, var_dtype)] = coefficients\n\n        return coefficients[\"lr_t\"], dict(apply_state=apply_state)\n\n    def _resource_apply_dense(self, grad, var, apply_state=None):\n        lr_t, kwargs = self._get_lr(var.device, var.dtype.base_dtype, apply_state)\n        decay = self._decay_weights_op(var, lr_t, apply_state)\n        with tf.control_dependencies([decay]):\n            return super(AdamWeightDecay, self)._resource_apply_dense(grad, var, **kwargs)\n\n    def _resource_apply_sparse(self, grad, var, indices, apply_state=None):\n        lr_t, kwargs = self._get_lr(var.device, var.dtype.base_dtype, apply_state)\n        decay = self._decay_weights_op(var, lr_t, apply_state)\n        with tf.control_dependencies([decay]):\n            return super(AdamWeightDecay, self)._resource_apply_sparse(grad, var, indices, **kwargs)\n\n    def get_config(self):\n        config = super().get_config()\n        config.update({\"weight_decay_rate\": self.weight_decay_rate})\n        return config\n\n    def _do_use_weight_decay(self, param_name):\n        \"\"\"Whether to use L2 weight decay for `param_name`.\"\"\"\n        if self.weight_decay_rate == 0:\n            return False\n\n        if self._include_in_weight_decay:\n            for r in self._include_in_weight_decay:\n                if re.search(r, param_name) is not None:\n                    return True\n\n        if self._exclude_from_weight_decay:\n            for r in self._exclude_from_weight_decay:\n                if re.search(r, param_name) is not None:\n                    return False\n        return True\n\n\n# Extracted from https://github.com/OpenNMT/OpenNMT-tf/blob/master/opennmt/optimizers/utils.py\nclass GradientAccumulator(object):\n    \"\"\"Gradient accumulation utility.\n  When used with a distribution strategy, the accumulator should be called in a\n  replica context. Gradients will be accumulated locally on each replica and\n  without synchronization. Users should then call ``.gradients``, scale the\n  gradients if required, and pass the result to ``apply_gradients``.\n  \"\"\"\n\n    # We use the ON_READ synchronization policy so that no synchronization is\n    # performed on assignment. To get the value, we call .value() which returns the\n    # value on the current replica without synchronization.\n\n    def __init__(self):\n        \"\"\"Initializes the accumulator.\"\"\"\n        self._gradients = []\n        self._accum_steps = None\n\n    @property\n    def step(self):\n        \"\"\"Number of accumulated steps.\"\"\"\n        if self._accum_steps is None:\n            self._accum_steps = tf.Variable(\n                tf.constant(0, dtype=tf.int64),\n                trainable=False,\n                synchronization=tf.VariableSynchronization.ON_READ,\n                aggregation=tf.VariableAggregation.ONLY_FIRST_REPLICA,\n            )\n\n        return self._accum_steps.value()\n\n    @property\n    def gradients(self):\n        \"\"\"The accumulated gradients on the current replica.\"\"\"\n        if not self._gradients:\n            raise ValueError(\"The accumulator should be called first to initialize the gradients\")\n        return list(gradient.value() if gradient is not None else gradient for gradient in self._gradients)\n\n    def __call__(self, gradients):\n        \"\"\"Accumulates :obj:`gradients` on the current replica.\"\"\"\n        if not self._gradients:\n            _ = self.step  # Create the step variable.\n            self._gradients.extend(\n                [\n                    tf.Variable(\n                        tf.zeros_like(gradient),\n                        trainable=False,\n                        synchronization=tf.VariableSynchronization.ON_READ,\n                        aggregation=tf.VariableAggregation.ONLY_FIRST_REPLICA,\n                    )\n                    if gradient is not None\n                    else gradient\n                    for gradient in gradients\n                ]\n            )\n        if len(gradients) != len(self._gradients):\n            raise ValueError(\"Expected %s gradients, but got %d\" % (len(self._gradients), len(gradients)))\n\n        for accum_gradient, gradient in zip(self._gradients, gradients):\n            if accum_gradient is not None and gradient is not None:\n                accum_gradient.assign_add(gradient)\n\n        self._accum_steps.assign_add(1)\n\n    def reset(self):\n        \"\"\"Resets the accumulated gradients on the current replica.\"\"\"\n        if not self._gradients:\n            return\n        self._accum_steps.assign(0)\n        for gradient in self._gradients:\n            if gradient is not None:\n                gradient.assign(tf.zeros_like(gradient))\n"
  },
  {
    "path": "code/bert-base-count3/pretrain/transformers1/pipelines.py",
    "content": "# coding=utf-8\n# Copyright 2018 The HuggingFace Inc. team.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n\n\nimport csv\nimport json\nimport logging\nimport os\nimport pickle\nimport sys\nfrom abc import ABC, abstractmethod\nfrom contextlib import contextmanager\nfrom itertools import chain\nfrom os.path import abspath, exists\nfrom typing import TYPE_CHECKING, Any, Dict, Iterable, List, Optional, Sequence, Tuple, Union\n\nimport numpy as np\n\nfrom .configuration_auto import ALL_PRETRAINED_CONFIG_ARCHIVE_MAP, AutoConfig\nfrom .configuration_utils import PretrainedConfig\nfrom .data import SquadExample, squad_convert_examples_to_features\nfrom .file_utils import is_tf_available, is_torch_available\nfrom .modelcard import ModelCard\nfrom .tokenization_auto import AutoTokenizer\nfrom .tokenization_bert import BasicTokenizer\nfrom .tokenization_utils import PreTrainedTokenizer\n\n\nif is_tf_available():\n    import tensorflow as tf\n    from .modeling_tf_auto import (\n        TFAutoModel,\n        TFAutoModelForSequenceClassification,\n        TFAutoModelForQuestionAnswering,\n        TFAutoModelForTokenClassification,\n        TFAutoModelWithLMHead,\n    )\n\nif is_torch_available():\n    import torch\n    from .modeling_auto import (\n        AutoModel,\n        AutoModelForSequenceClassification,\n        AutoModelForQuestionAnswering,\n        AutoModelForTokenClassification,\n        AutoModelWithLMHead,\n    )\n\nif TYPE_CHECKING:\n    from .modeling_utils import PreTrainedModel\n    from .modeling_tf_utils import TFPreTrainedModel\n\n\nlogger = logging.getLogger(__name__)\n\n\ndef get_framework(model=None):\n    \"\"\" Select framework (TensorFlow/PyTorch) to use.\n        If both frameworks are installed and no specific model is provided, defaults to using PyTorch.\n    \"\"\"\n    if is_tf_available() and is_torch_available() and model is not None and not isinstance(model, str):\n        # Both framework are available but the user supplied a model class instance.\n        # Try to guess which framework to use from the model classname\n        framework = \"tf\" if model.__class__.__name__.startswith(\"TF\") else \"pt\"\n    elif not is_tf_available() and not is_torch_available():\n        raise RuntimeError(\n            \"At least one of TensorFlow 2.0 or PyTorch should be installed. \"\n            \"To install TensorFlow 2.0, read the instructions at https://www.tensorflow.org/install/ \"\n            \"To install PyTorch, read the instructions at https://pytorch.org/.\"\n        )\n    else:\n        # framework = 'tf' if is_tf_available() else 'pt'\n        framework = \"pt\" if is_torch_available() else \"tf\"\n    return framework\n\n\nclass ArgumentHandler(ABC):\n    \"\"\"\n    Base interface for handling varargs for each Pipeline\n    \"\"\"\n\n    @abstractmethod\n    def __call__(self, *args, **kwargs):\n        raise NotImplementedError()\n\n\nclass DefaultArgumentHandler(ArgumentHandler):\n    \"\"\"\n    Default varargs argument parser handling parameters for each Pipeline\n    \"\"\"\n\n    @staticmethod\n    def handle_kwargs(kwargs: Dict) -> List:\n        if len(kwargs) == 1:\n            output = list(kwargs.values())\n        else:\n            output = list(chain(kwargs.values()))\n\n        return DefaultArgumentHandler.handle_args(output)\n\n    @staticmethod\n    def handle_args(args: Sequence[Any]) -> List[str]:\n\n        # Only one argument, let's do case by case\n        if len(args) == 1:\n            if isinstance(args[0], str):\n                return [args[0]]\n            elif not isinstance(args[0], list):\n                return list(args)\n            else:\n                return args[0]\n\n        # Multiple arguments (x1, x2, ...)\n        elif len(args) > 1:\n            if all([isinstance(arg, str) for arg in args]):\n                return list(args)\n\n            # If not instance of list, then it should instance of iterable\n            elif isinstance(args, Iterable):\n                return list(chain.from_iterable(chain(args)))\n            else:\n                raise ValueError(\n                    \"Invalid input type {}. Pipeline supports Union[str, Iterable[str]]\".format(type(args))\n                )\n        else:\n            return []\n\n    def __call__(self, *args, **kwargs):\n        if len(kwargs) > 0 and len(args) > 0:\n            raise ValueError(\"Pipeline cannot handle mixed args and kwargs\")\n\n        if len(kwargs) > 0:\n            return DefaultArgumentHandler.handle_kwargs(kwargs)\n        else:\n            return DefaultArgumentHandler.handle_args(args)\n\n\nclass PipelineDataFormat:\n    \"\"\"\n    Base class for all the pipeline supported data format both for reading and writing.\n    Supported data formats currently includes:\n     - JSON\n     - CSV\n     - stdin/stdout (pipe)\n\n    PipelineDataFormat also includes some utilities to work with multi-columns like mapping from datasets columns\n    to pipelines keyword arguments through the `dataset_kwarg_1=dataset_column_1` format.\n    \"\"\"\n\n    SUPPORTED_FORMATS = [\"json\", \"csv\", \"pipe\"]\n\n    def __init__(\n        self, output_path: Optional[str], input_path: Optional[str], column: Optional[str], overwrite=False,\n    ):\n        self.output_path = output_path\n        self.input_path = input_path\n        self.column = column.split(\",\") if column is not None else [\"\"]\n        self.is_multi_columns = len(self.column) > 1\n\n        if self.is_multi_columns:\n            self.column = [tuple(c.split(\"=\")) if \"=\" in c else (c, c) for c in self.column]\n\n        if output_path is not None and not overwrite:\n            if exists(abspath(self.output_path)):\n                raise OSError(\"{} already exists on disk\".format(self.output_path))\n\n        if input_path is not None:\n            if not exists(abspath(self.input_path)):\n                raise OSError(\"{} doesnt exist on disk\".format(self.input_path))\n\n    @abstractmethod\n    def __iter__(self):\n        raise NotImplementedError()\n\n    @abstractmethod\n    def save(self, data: dict):\n        \"\"\"\n        Save the provided data object with the representation for the current `DataFormat`.\n        :param data: data to store\n        :return:\n        \"\"\"\n        raise NotImplementedError()\n\n    def save_binary(self, data: Union[dict, List[dict]]) -> str:\n        \"\"\"\n        Save the provided data object as a pickle-formatted binary data on the disk.\n        :param data: data to store\n        :return: (str) Path where the data has been saved\n        \"\"\"\n        path, _ = os.path.splitext(self.output_path)\n        binary_path = os.path.extsep.join((path, \"pickle\"))\n\n        with open(binary_path, \"wb+\") as f_output:\n            pickle.dump(data, f_output)\n\n        return binary_path\n\n    @staticmethod\n    def from_str(\n        format: str, output_path: Optional[str], input_path: Optional[str], column: Optional[str], overwrite=False,\n    ):\n        if format == \"json\":\n            return JsonPipelineDataFormat(output_path, input_path, column, overwrite=overwrite)\n        elif format == \"csv\":\n            return CsvPipelineDataFormat(output_path, input_path, column, overwrite=overwrite)\n        elif format == \"pipe\":\n            return PipedPipelineDataFormat(output_path, input_path, column, overwrite=overwrite)\n        else:\n            raise KeyError(\"Unknown reader {} (Available reader are json/csv/pipe)\".format(format))\n\n\nclass CsvPipelineDataFormat(PipelineDataFormat):\n    def __init__(\n        self, output_path: Optional[str], input_path: Optional[str], column: Optional[str], overwrite=False,\n    ):\n        super().__init__(output_path, input_path, column, overwrite=overwrite)\n\n    def __iter__(self):\n        with open(self.input_path, \"r\") as f:\n            reader = csv.DictReader(f)\n            for row in reader:\n                if self.is_multi_columns:\n                    yield {k: row[c] for k, c in self.column}\n                else:\n                    yield row[self.column[0]]\n\n    def save(self, data: List[dict]):\n        with open(self.output_path, \"w\") as f:\n            if len(data) > 0:\n                writer = csv.DictWriter(f, list(data[0].keys()))\n                writer.writeheader()\n                writer.writerows(data)\n\n\nclass JsonPipelineDataFormat(PipelineDataFormat):\n    def __init__(\n        self, output_path: Optional[str], input_path: Optional[str], column: Optional[str], overwrite=False,\n    ):\n        super().__init__(output_path, input_path, column, overwrite=overwrite)\n\n        with open(input_path, \"r\") as f:\n            self._entries = json.load(f)\n\n    def __iter__(self):\n        for entry in self._entries:\n            if self.is_multi_columns:\n                yield {k: entry[c] for k, c in self.column}\n            else:\n                yield entry[self.column[0]]\n\n    def save(self, data: dict):\n        with open(self.output_path, \"w\") as f:\n            json.dump(data, f)\n\n\nclass PipedPipelineDataFormat(PipelineDataFormat):\n    \"\"\"\n    Read data from piped input to the python process.\n    For multi columns data, columns should separated by \\t\n\n    If columns are provided, then the output will be a dictionary with {column_x: value_x}\n    \"\"\"\n\n    def __iter__(self):\n        for line in sys.stdin:\n            # Split for multi-columns\n            if \"\\t\" in line:\n\n                line = line.split(\"\\t\")\n                if self.column:\n                    # Dictionary to map arguments\n                    yield {kwargs: l for (kwargs, _), l in zip(self.column, line)}\n                else:\n                    yield tuple(line)\n\n            # No dictionary to map arguments\n            else:\n                yield line\n\n    def save(self, data: dict):\n        print(data)\n\n    def save_binary(self, data: Union[dict, List[dict]]) -> str:\n        if self.output_path is None:\n            raise KeyError(\n                \"When using piped input on pipeline outputting large object requires an output file path. \"\n                \"Please provide such output path through --output argument.\"\n            )\n\n        return super().save_binary(data)\n\n\nclass _ScikitCompat(ABC):\n    \"\"\"\n    Interface layer for the Scikit and Keras compatibility.\n    \"\"\"\n\n    @abstractmethod\n    def transform(self, X):\n        raise NotImplementedError()\n\n    @abstractmethod\n    def predict(self, X):\n        raise NotImplementedError()\n\n\nclass Pipeline(_ScikitCompat):\n    \"\"\"\n    The Pipeline class is the class from which all pipelines inherit. Refer to this class for methods shared across\n    different pipelines.\n\n    Base class implementing pipelined operations.\n    Pipeline workflow is defined as a sequence of the following operations:\n        Input -> Tokenization -> Model Inference -> Post-Processing (Task dependent) -> Output\n\n    Pipeline supports running on CPU or GPU through the device argument. Users can specify\n    device argument as an integer, -1 meaning \"CPU\", >= 0 referring the CUDA device ordinal.\n\n    Some pipeline, like for instance FeatureExtractionPipeline ('feature-extraction') outputs large\n    tensor object as nested-lists. In order to avoid dumping such large structure as textual data we\n    provide the binary_output constructor argument. If set to True, the output will be stored in the\n    pickle format.\n\n    Arguments:\n        model (:obj:`~transformers1.PreTrainedModel` or :obj:`~transformers1.TFPreTrainedModel`):\n            The model that will be used by the pipeline to make predictions. This needs to be a model inheriting from\n            :class:`~transformers1.PreTrainedModel` for PyTorch and :class:`~transformers1.TFPreTrainedModel` for\n            TensorFlow.\n        tokenizer (:obj:`~transformers1.PreTrainedTokenizer`):\n            The tokenizer that will be used by the pipeline to encode data for the model. This object inherits from\n            :class:`~transformers1.PreTrainedTokenizer`.\n        modelcard (:obj:`str` or :class:`~transformers1.ModelCard`, `optional`, defaults to :obj:`None`):\n            Model card attributed to the model for this pipeline.\n        framework (:obj:`str`, `optional`, defaults to :obj:`None`):\n            The framework to use, either \"pt\" for PyTorch or \"tf\" for TensorFlow. The specified framework must be\n            installed.\n\n            If no framework is specified, will default to the one currently installed. If no framework is specified\n            and both frameworks are installed, will default to PyTorch.\n        args_parser (:class:`~transformers1.pipelines.ArgumentHandler`, `optional`, defaults to :obj:`None`):\n            Reference to the object in charge of parsing supplied pipeline parameters.\n        device (:obj:`int`, `optional`, defaults to :obj:`-1`):\n            Device ordinal for CPU/GPU supports. Setting this to -1 will leverage CPU, >=0 will run the model\n            on the associated CUDA device id.\n        binary_output (:obj:`bool`, `optional`, defaults to :obj:`False`):\n            Flag indicating if the output the pipeline should happen in a binary format (i.e. pickle) or as raw text.\n\n    Return:\n        :obj:`List` or :obj:`Dict`:\n        Pipeline returns list or dictionary depending on:\n\n         - Whether the user supplied multiple samples\n         - Whether the pipeline exposes multiple fields in the output object\n    \"\"\"\n\n    default_input_names = None\n\n    def __init__(\n        self,\n        model: Union[\"PreTrainedModel\", \"TFPreTrainedModel\"],\n        tokenizer: PreTrainedTokenizer,\n        modelcard: Optional[ModelCard] = None,\n        framework: Optional[str] = None,\n        task: str = \"\",\n        args_parser: ArgumentHandler = None,\n        device: int = -1,\n        binary_output: bool = False,\n    ):\n\n        if framework is None:\n            framework = get_framework()\n\n        self.model = model\n        self.tokenizer = tokenizer\n        self.modelcard = modelcard\n        self.framework = framework\n        self.device = device if framework == \"tf\" else torch.device(\"cpu\" if device < 0 else \"cuda:{}\".format(device))\n        self.binary_output = binary_output\n        self._args_parser = args_parser or DefaultArgumentHandler()\n\n        # Special handling\n        if self.framework == \"pt\" and self.device.type == \"cuda\":\n            self.model = self.model.to(self.device)\n\n        # Update config with task specific parameters\n        task_specific_params = self.model.config.task_specific_params\n        if task_specific_params is not None and task in task_specific_params:\n            self.model.config.update(task_specific_params.get(task))\n\n    def save_pretrained(self, save_directory):\n        \"\"\"\n        Save the pipeline's model and tokenizer to the specified save_directory\n        \"\"\"\n        if not os.path.isdir(save_directory):\n            logger.error(\"Provided path ({}) should be a directory\".format(save_directory))\n            return\n\n        self.model.save_pretrained(save_directory)\n        self.tokenizer.save_pretrained(save_directory)\n        if self.modelcard is not None:\n            self.modelcard.save_pretrained(save_directory)\n\n    def transform(self, X):\n        \"\"\"\n        Scikit / Keras interface to transformers1' pipelines. This method will forward to __call__().\n        \"\"\"\n        return self(X=X)\n\n    def predict(self, X):\n        \"\"\"\n        Scikit / Keras interface to transformers1' pipelines. This method will forward to __call__().\n        \"\"\"\n        return self(X=X)\n\n    @contextmanager\n    def device_placement(self):\n        \"\"\"\n        Context Manager allowing tensor allocation on the user-specified device in framework agnostic way.\n        example:\n            # Explicitly ask for tensor allocation on CUDA device :0\n            nlp = pipeline(..., device=0)\n            with nlp.device_placement():\n                # Every framework specific tensor allocation will be done on the request device\n                output = nlp(...)\n        Returns:\n            Context manager\n        \"\"\"\n        if self.framework == \"tf\":\n            with tf.device(\"/CPU:0\" if self.device == -1 else \"/device:GPU:{}\".format(self.device)):\n                yield\n        else:\n            if self.device.type == \"cuda\":\n                torch.cuda.set_device(self.device)\n\n            yield\n\n    def ensure_tensor_on_device(self, **inputs):\n        \"\"\"\n        Ensure PyTorch tensors are on the specified device.\n        :param inputs:\n        :return:\n        \"\"\"\n        return {name: tensor.to(self.device) for name, tensor in inputs.items()}\n\n    def _parse_and_tokenize(self, *args, pad_to_max_length=True, add_special_tokens=True, **kwargs):\n        \"\"\"\n        Parse arguments and tokenize\n        \"\"\"\n        # Parse arguments\n        inputs = self._args_parser(*args, **kwargs)\n        inputs = self.tokenizer.batch_encode_plus(\n            inputs,\n            add_special_tokens=add_special_tokens,\n            return_tensors=self.framework,\n            pad_to_max_length=pad_to_max_length,\n        )\n\n        return inputs\n\n    def __call__(self, *args, **kwargs):\n        inputs = self._parse_and_tokenize(*args, **kwargs)\n        return self._forward(inputs)\n\n    def _forward(self, inputs, return_tensors=False):\n        \"\"\"\n        Internal framework specific forward dispatching.\n        Args:\n            inputs: dict holding all the keyworded arguments for required by the model forward method.\n            return_tensors: Whether to return native framework (pt/tf) tensors rather than numpy array.\n        Returns:\n            Numpy array\n        \"\"\"\n        # Encode for forward\n        with self.device_placement():\n            if self.framework == \"tf\":\n                # TODO trace model\n                predictions = self.model(inputs.data, training=False)[0]\n            else:\n                with torch.no_grad():\n                    inputs = self.ensure_tensor_on_device(**inputs)\n                    predictions = self.model(**inputs)[0].cpu()\n\n        if return_tensors:\n            return predictions\n        else:\n            return predictions.numpy()\n\n\nclass FeatureExtractionPipeline(Pipeline):\n    \"\"\"\n    Feature extraction pipeline using Model head. This pipeline extracts the hidden states from the base transformer,\n    which can be used as features in downstream tasks.\n\n    This feature extraction pipeline can currently be loaded from the :func:`~transformers1.pipeline` method using\n    the following task identifier(s):\n\n    - \"feature-extraction\", for extracting features of a sequence.\n\n    All models may be used for this pipeline. See a list of all models, including community-contributed models on\n    `huggingface.co/models <https://huggingface.co/models>`__.\n\n    Arguments:\n        model (:obj:`~transformers1.PreTrainedModel` or :obj:`~transformers1.TFPreTrainedModel`):\n            The model that will be used by the pipeline to make predictions. This needs to be a model inheriting from\n            :class:`~transformers1.PreTrainedModel` for PyTorch and :class:`~transformers1.TFPreTrainedModel` for\n            TensorFlow.\n        tokenizer (:obj:`~transformers1.PreTrainedTokenizer`):\n            The tokenizer that will be used by the pipeline to encode data for the model. This object inherits from\n            :class:`~transformers1.PreTrainedTokenizer`.\n        modelcard (:obj:`str` or :class:`~transformers1.ModelCard`, `optional`, defaults to :obj:`None`):\n            Model card attributed to the model for this pipeline.\n        framework (:obj:`str`, `optional`, defaults to :obj:`None`):\n            The framework to use, either \"pt\" for PyTorch or \"tf\" for TensorFlow. The specified framework must be\n            installed.\n\n            If no framework is specified, will default to the one currently installed. If no framework is specified\n            and both frameworks are installed, will default to PyTorch.\n        args_parser (:class:`~transformers1.pipelines.ArgumentHandler`, `optional`, defaults to :obj:`None`):\n            Reference to the object in charge of parsing supplied pipeline parameters.\n        device (:obj:`int`, `optional`, defaults to :obj:`-1`):\n            Device ordinal for CPU/GPU supports. Setting this to -1 will leverage CPU, >=0 will run the model\n            on the associated CUDA device id.\n    \"\"\"\n\n    def __init__(\n        self,\n        model: Union[\"PreTrainedModel\", \"TFPreTrainedModel\"],\n        tokenizer: PreTrainedTokenizer,\n        modelcard: Optional[ModelCard] = None,\n        framework: Optional[str] = None,\n        args_parser: ArgumentHandler = None,\n        device: int = -1,\n        task: str = \"\",\n    ):\n        super().__init__(\n            model=model,\n            tokenizer=tokenizer,\n            modelcard=modelcard,\n            framework=framework,\n            args_parser=args_parser,\n            device=device,\n            binary_output=True,\n            task=task,\n        )\n\n    def __call__(self, *args, **kwargs):\n        return super().__call__(*args, **kwargs).tolist()\n\n\nclass TextGenerationPipeline(Pipeline):\n    \"\"\"\n    Language generation pipeline using any ModelWithLMHead head. This pipeline predicts the words that will follow a specified text prompt.\n\n    This language generation pipeline can currently be loaded from the :func:`~transformers1.pipeline` method using\n    the following task identifier(s):\n\n    - \"text-generation\", for generating text from a specified prompt.\n\n    The models that this pipeline can use are models that have been trained with an autoregressive language modeling objective,\n    which includes the uni-directional models in the library (e.g. gpt2).\n    See the list of available community models on\n    `huggingface.co/models <https://huggingface.co/models?search=&filter=lm-head>`__.\n    \"\"\"\n\n    # Padding text to help Transformer-XL and XLNet with short prompts as proposed by Aman Rusia\n    # in https://github.com/rusiaaman/XLNet-gen#methodology\n    # and https://medium.com/@amanrusia/xlnet-speaks-comparison-to-gpt-2-ea1a4e9ba39e\n\n    PADDING_TEXT = \"\"\"In 1991, the remains of Russian Tsar Nicholas II and his family\n    (except for Alexei and Maria) are discovered.\n    The voice of Nicholas's young son, Tsarevich Alexei Nikolaevich, narrates the\n    remainder of the story. 1883 Western Siberia,\n    a young Grigori Rasputin is asked by his father and a group of men to perform magic.\n    Rasputin has a vision and denounces one of the men as a horse thief. Although his\n    father initially slaps him for making such an accusation, Rasputin watches as the\n    man is chased outside and beaten. Twenty years later, Rasputin sees a vision of\n    the Virgin Mary, prompting him to become a priest. Rasputin quickly becomes famous,\n    with people, even a bishop, begging for his blessing. <eod> </s> <eos>\"\"\"\n\n    ALLOWED_MODELS = [\n        \"XLNetLMHeadModel\",\n        \"TransfoXLLMHeadModel\",\n        \"ReformerModelWithLMHead\",\n        \"GPT2LMHeadModel\",\n        \"OpenAIGPTLMHeadModel\",\n        \"CTRLLMHeadModel\",\n        \"TFXLNetLMHeadModel\",\n        \"TFTransfoXLLMHeadModel\",\n        \"TFGPT2LMHeadModel\",\n        \"TFOpenAIGPTLMHeadModel\",\n        \"TFCTRLLMHeadModel\",\n    ]\n\n    def __call__(\n        self, *args, return_tensors=False, return_text=True, clean_up_tokenization_spaces=False, **generate_kwargs\n    ):\n        if self.model.__class__.__name__ not in self.ALLOWED_MODELS:\n            raise NotImplementedError(\n                \"Generation is currently not supported for {}. Please select a model from {} for generation.\".format(\n                    self.model.__class__.__name__, self.ALLOWED_MODELS\n                )\n            )\n\n        text_inputs = self._args_parser(*args)\n\n        results = []\n        for prompt_text in text_inputs:\n            # Manage correct placement of the tensors\n            with self.device_placement():\n                if self.model.__class__.__name__ in [\"XLNetLMHeadModel\", \"TransfoXLLMHeadModel\"]:\n                    inputs = self._parse_and_tokenize(\n                        self.PADDING_TEXT + prompt_text, pad_to_max_length=False, add_special_tokens=False\n                    )\n                else:\n                    inputs = self._parse_and_tokenize(prompt_text, pad_to_max_length=False, add_special_tokens=False)\n\n                # set input_ids to None to allow empty prompt\n                if inputs[\"input_ids\"].shape[-1] == 0:\n                    inputs[\"input_ids\"] = None\n                    inputs[\"attention_mask\"] = None\n\n                if self.framework == \"pt\" and inputs[\"input_ids\"] is not None:\n                    inputs = self.ensure_tensor_on_device(**inputs)\n\n                input_ids = inputs[\"input_ids\"]\n\n                # Ensure that batch size = 1 (batch generation not allowed for now)\n                assert (\n                    input_ids is None or input_ids.shape[0] == 1\n                ), \"Batch generation is currently not supported. See https://github.com/huggingface/transformers/issues/3021 for more information.\"\n\n                output_sequences = self.model.generate(input_ids=input_ids, **generate_kwargs)  # BS x SL\n\n            result = []\n            for generated_sequence in output_sequences:\n                generated_sequence = generated_sequence.numpy().tolist()\n                record = {}\n                if return_tensors:\n                    record[\"generated_token_ids\"] = generated_sequence\n                if return_text:\n                    # Decode text\n                    text = self.tokenizer.decode(\n                        generated_sequence,\n                        skip_special_tokens=True,\n                        clean_up_tokenization_spaces=clean_up_tokenization_spaces,\n                    )\n\n                    # Remove PADDING prompt of the sequence if XLNet or Transfo-XL model is used\n                    if input_ids is None:\n                        prompt_length = 0\n                    else:\n                        prompt_length = len(\n                            self.tokenizer.decode(\n                                input_ids[0],\n                                skip_special_tokens=True,\n                                clean_up_tokenization_spaces=clean_up_tokenization_spaces,\n                            )\n                        )\n\n                    record[\"generated_text\"] = prompt_text + text[prompt_length:]\n\n                result.append(record)\n            results += [result]\n\n        if len(results) == 1:\n            return results[0]\n\n        return results\n\n\nclass TextClassificationPipeline(Pipeline):\n    \"\"\"\n    Text classification pipeline using ModelForSequenceClassification head. See the\n    `sequence classification usage <../usage.html#sequence-classification>`__ examples for more information.\n\n    This text classification pipeline can currently be loaded from the :func:`~transformers1.pipeline` method using\n    the following task identifier(s):\n\n    - \"sentiment-analysis\", for classifying sequences according to positive or negative sentiments.\n\n    The models that this pipeline can use are models that have been fine-tuned on a sequence classification task.\n    See the up-to-date list of available models on\n    `huggingface.co/models <https://huggingface.co/models?filter=text-classification>`__.\n\n    Arguments:\n        model (:obj:`~transformers1.PreTrainedModel` or :obj:`~transformers1.TFPreTrainedModel`):\n            The model that will be used by the pipeline to make predictions. This needs to be a model inheriting from\n            :class:`~transformers1.PreTrainedModel` for PyTorch and :class:`~transformers1.TFPreTrainedModel` for\n            TensorFlow.\n        tokenizer (:obj:`~transformers1.PreTrainedTokenizer`):\n            The tokenizer that will be used by the pipeline to encode data for the model. This object inherits from\n            :class:`~transformers1.PreTrainedTokenizer`.\n        modelcard (:obj:`str` or :class:`~transformers1.ModelCard`, `optional`, defaults to :obj:`None`):\n            Model card attributed to the model for this pipeline.\n        framework (:obj:`str`, `optional`, defaults to :obj:`None`):\n            The framework to use, either \"pt\" for PyTorch or \"tf\" for TensorFlow. The specified framework must be\n            installed.\n\n            If no framework is specified, will default to the one currently installed. If no framework is specified\n            and both frameworks are installed, will default to PyTorch.\n        args_parser (:class:`~transformers1.pipelines.ArgumentHandler`, `optional`, defaults to :obj:`None`):\n            Reference to the object in charge of parsing supplied pipeline parameters.\n        device (:obj:`int`, `optional`, defaults to :obj:`-1`):\n            Device ordinal for CPU/GPU supports. Setting this to -1 will leverage CPU, >=0 will run the model\n            on the associated CUDA device id.\n    \"\"\"\n\n    def __call__(self, *args, **kwargs):\n        outputs = super().__call__(*args, **kwargs)\n        scores = np.exp(outputs) / np.exp(outputs).sum(-1, keepdims=True)\n        return [{\"label\": self.model.config.id2label[item.argmax()], \"score\": item.max().item()} for item in scores]\n\n\nclass FillMaskPipeline(Pipeline):\n    \"\"\"\n    Masked language modeling prediction pipeline using ModelWithLMHead head. See the\n    `masked language modeling usage <../usage.html#masked-language-modeling>`__ examples for more information.\n\n    This mask filling pipeline can currently be loaded from the :func:`~transformers1.pipeline` method using\n    the following task identifier(s):\n\n    - \"fill-mask\", for predicting masked tokens in a sequence.\n\n    The models that this pipeline can use are models that have been trained with a masked language modeling objective,\n    which includes the bi-directional models in the library.\n    See the up-to-date list of available models on\n    `huggingface.co/models <https://huggingface.co/models?filter=lm-head>`__.\n\n    Arguments:\n        model (:obj:`~transformers1.PreTrainedModel` or :obj:`~transformers1.TFPreTrainedModel`):\n            The model that will be used by the pipeline to make predictions. This needs to be a model inheriting from\n            :class:`~transformers1.PreTrainedModel` for PyTorch and :class:`~transformers1.TFPreTrainedModel` for\n            TensorFlow.\n        tokenizer (:obj:`~transformers1.PreTrainedTokenizer`):\n            The tokenizer that will be used by the pipeline to encode data for the model. This object inherits from\n            :class:`~transformers1.PreTrainedTokenizer`.\n        modelcard (:obj:`str` or :class:`~transformers1.ModelCard`, `optional`, defaults to :obj:`None`):\n            Model card attributed to the model for this pipeline.\n        framework (:obj:`str`, `optional`, defaults to :obj:`None`):\n            The framework to use, either \"pt\" for PyTorch or \"tf\" for TensorFlow. The specified framework must be\n            installed.\n\n            If no framework is specified, will default to the one currently installed. If no framework is specified\n            and both frameworks are installed, will default to PyTorch.\n        args_parser (:class:`~transformers1.pipelines.ArgumentHandler`, `optional`, defaults to :obj:`None`):\n            Reference to the object in charge of parsing supplied pipeline parameters.\n        device (:obj:`int`, `optional`, defaults to :obj:`-1`):\n            Device ordinal for CPU/GPU supports. Setting this to -1 will leverage CPU, >=0 will run the model\n            on the associated CUDA device id.\n    \"\"\"\n\n    def __init__(\n        self,\n        model: Union[\"PreTrainedModel\", \"TFPreTrainedModel\"],\n        tokenizer: PreTrainedTokenizer,\n        modelcard: Optional[ModelCard] = None,\n        framework: Optional[str] = None,\n        args_parser: ArgumentHandler = None,\n        device: int = -1,\n        topk=5,\n        task: str = \"\",\n    ):\n        super().__init__(\n            model=model,\n            tokenizer=tokenizer,\n            modelcard=modelcard,\n            framework=framework,\n            args_parser=args_parser,\n            device=device,\n            binary_output=True,\n            task=task,\n        )\n\n        self.topk = topk\n\n    def __call__(self, *args, **kwargs):\n        inputs = self._parse_and_tokenize(*args, **kwargs)\n        outputs = self._forward(inputs, return_tensors=True)\n\n        results = []\n        batch_size = outputs.shape[0] if self.framework == \"tf\" else outputs.size(0)\n\n        for i in range(batch_size):\n            input_ids = inputs[\"input_ids\"][i]\n            result = []\n\n            if self.framework == \"tf\":\n                masked_index = tf.where(input_ids == self.tokenizer.mask_token_id).numpy().item()\n                logits = outputs[i, masked_index, :]\n                probs = tf.nn.softmax(logits)\n                topk = tf.math.top_k(probs, k=self.topk)\n                values, predictions = topk.values.numpy(), topk.indices.numpy()\n            else:\n                masked_index = (input_ids == self.tokenizer.mask_token_id).nonzero().item()\n                logits = outputs[i, masked_index, :]\n                probs = logits.softmax(dim=0)\n                values, predictions = probs.topk(self.topk)\n\n            for v, p in zip(values.tolist(), predictions.tolist()):\n                tokens = input_ids.numpy()\n                tokens[masked_index] = p\n                # Filter padding out:\n                tokens = tokens[np.where(tokens != self.tokenizer.pad_token_id)]\n                result.append({\"sequence\": self.tokenizer.decode(tokens), \"score\": v, \"token\": p})\n\n            # Append\n            results += [result]\n\n        if len(results) == 1:\n            return results[0]\n        return results\n\n\nclass NerPipeline(Pipeline):\n    \"\"\"\n    Named Entity Recognition pipeline using ModelForTokenClassification head. See the\n    `named entity recognition usage <../usage.html#named-entity-recognition>`__ examples for more information.\n\n    This token recognition pipeline can currently be loaded from the :func:`~transformers1.pipeline` method using\n    the following task identifier(s):\n\n    - \"ner\", for predicting the classes of tokens in a sequence: person, organisation, location or miscellaneous.\n\n    The models that this pipeline can use are models that have been fine-tuned on a token classification task.\n    See the up-to-date list of available models on\n    `huggingface.co/models <https://huggingface.co/models?filter=token-classification>`__.\n\n    Arguments:\n        model (:obj:`~transformers1.PreTrainedModel` or :obj:`~transformers1.TFPreTrainedModel`):\n            The model that will be used by the pipeline to make predictions. This needs to be a model inheriting from\n            :class:`~transformers1.PreTrainedModel` for PyTorch and :class:`~transformers1.TFPreTrainedModel` for\n            TensorFlow.\n        tokenizer (:obj:`~transformers1.PreTrainedTokenizer`):\n            The tokenizer that will be used by the pipeline to encode data for the model. This object inherits from\n            :class:`~transformers1.PreTrainedTokenizer`.\n        modelcard (:obj:`str` or :class:`~transformers1.ModelCard`, `optional`, defaults to :obj:`None`):\n            Model card attributed to the model for this pipeline.\n        framework (:obj:`str`, `optional`, defaults to :obj:`None`):\n            The framework to use, either \"pt\" for PyTorch or \"tf\" for TensorFlow. The specified framework must be\n            installed.\n\n            If no framework is specified, will default to the one currently installed. If no framework is specified\n            and both frameworks are installed, will default to PyTorch.\n        args_parser (:class:`~transformers1.pipelines.ArgumentHandler`, `optional`, defaults to :obj:`None`):\n            Reference to the object in charge of parsing supplied pipeline parameters.\n        device (:obj:`int`, `optional`, defaults to :obj:`-1`):\n            Device ordinal for CPU/GPU supports. Setting this to -1 will leverage CPU, >=0 will run the model\n            on the associated CUDA device id.\n    \"\"\"\n\n    default_input_names = \"sequences\"\n\n    def __init__(\n        self,\n        model: Union[\"PreTrainedModel\", \"TFPreTrainedModel\"],\n        tokenizer: PreTrainedTokenizer,\n        modelcard: Optional[ModelCard] = None,\n        framework: Optional[str] = None,\n        args_parser: ArgumentHandler = None,\n        device: int = -1,\n        binary_output: bool = False,\n        ignore_labels=[\"O\"],\n        task: str = \"\",\n        grouped_entities: bool = False,\n    ):\n        super().__init__(\n            model=model,\n            tokenizer=tokenizer,\n            modelcard=modelcard,\n            framework=framework,\n            args_parser=args_parser,\n            device=device,\n            binary_output=binary_output,\n            task=task,\n        )\n\n        self._basic_tokenizer = BasicTokenizer(do_lower_case=False)\n        self.ignore_labels = ignore_labels\n        self.grouped_entities = grouped_entities\n\n    def __call__(self, *args, **kwargs):\n        inputs = self._args_parser(*args, **kwargs)\n        answers = []\n        for sentence in inputs:\n\n            # Manage correct placement of the tensors\n            with self.device_placement():\n\n                tokens = self.tokenizer.encode_plus(\n                    sentence,\n                    return_attention_mask=False,\n                    return_tensors=self.framework,\n                    max_length=self.tokenizer.max_len,\n                )\n\n                # Forward\n                if self.framework == \"tf\":\n                    entities = self.model(tokens.data)[0][0].numpy()\n                    input_ids = tokens[\"input_ids\"].numpy()[0]\n                else:\n                    with torch.no_grad():\n                        tokens = self.ensure_tensor_on_device(**tokens)\n                        entities = self.model(**tokens)[0][0].cpu().numpy()\n                        input_ids = tokens[\"input_ids\"].cpu().numpy()[0]\n\n            score = np.exp(entities) / np.exp(entities).sum(-1, keepdims=True)\n            labels_idx = score.argmax(axis=-1)\n\n            entities = []\n            entity_groups = []\n            entity_group_disagg = []\n            # Filter to labels not in `self.ignore_labels`\n            filtered_labels_idx = [\n                (idx, label_idx)\n                for idx, label_idx in enumerate(labels_idx)\n                if self.model.config.id2label[label_idx] not in self.ignore_labels\n            ]\n\n            for idx, label_idx in filtered_labels_idx:\n\n                entity = {\n                    \"word\": self.tokenizer.convert_ids_to_tokens(int(input_ids[idx])),\n                    \"score\": score[idx][label_idx].item(),\n                    \"entity\": self.model.config.id2label[label_idx],\n                    \"index\": idx,\n                }\n                last_idx, _ = filtered_labels_idx[-1]\n                if self.grouped_entities:\n                    if not entity_group_disagg:\n                        entity_group_disagg += [entity]\n                        if idx == last_idx:\n                            entity_groups += [self.group_entities(entity_group_disagg)]\n                        continue\n\n                    # If the current entity is similar and adjacent to the previous entity, append it to the disaggregated entity group\n                    if (\n                        entity[\"entity\"] == entity_group_disagg[-1][\"entity\"]\n                        and entity[\"index\"] == entity_group_disagg[-1][\"index\"] + 1\n                    ):\n                        entity_group_disagg += [entity]\n                        # Group the entities at the last entity\n                        if idx == last_idx:\n                            entity_groups += [self.group_entities(entity_group_disagg)]\n                    # If the current entity is different from the previous entity, aggregate the disaggregated entity group\n                    else:\n                        entity_groups += [self.group_entities(entity_group_disagg)]\n                        entity_group_disagg = [entity]\n\n                entities += [entity]\n\n            # Append\n            if self.grouped_entities:\n                answers += [entity_groups]\n            else:\n                answers += [entities]\n\n        if len(answers) == 1:\n            return answers[0]\n        return answers\n\n    def group_entities(self, entities):\n        \"\"\"\n        Returns grouped entities\n        \"\"\"\n        # Get the last entity in the entity group\n        entity = entities[-1][\"entity\"]\n        scores = np.mean([entity[\"score\"] for entity in entities])\n        tokens = [entity[\"word\"] for entity in entities]\n\n        entity_group = {\n            \"entity_group\": entity,\n            \"score\": np.mean(scores),\n            \"word\": self.tokenizer.convert_tokens_to_string(tokens),\n        }\n        return entity_group\n\n\nTokenClassificationPipeline = NerPipeline\n\n\nclass QuestionAnsweringArgumentHandler(ArgumentHandler):\n    \"\"\"\n    QuestionAnsweringPipeline requires the user to provide multiple arguments (i.e. question & context) to be mapped\n    to internal SquadExample / SquadFeature structures.\n\n    QuestionAnsweringArgumentHandler manages all the possible to create SquadExample from the command-line supplied\n    arguments.\n    \"\"\"\n\n    def __call__(self, *args, **kwargs):\n        # Position args, handling is sensibly the same as X and data, so forwarding to avoid duplicating\n        if args is not None and len(args) > 0:\n            if len(args) == 1:\n                kwargs[\"X\"] = args[0]\n            else:\n                kwargs[\"X\"] = list(args)\n\n        # Generic compatibility with sklearn and Keras\n        # Batched data\n        if \"X\" in kwargs or \"data\" in kwargs:\n            inputs = kwargs[\"X\"] if \"X\" in kwargs else kwargs[\"data\"]\n\n            if isinstance(inputs, dict):\n                inputs = [inputs]\n            else:\n                # Copy to avoid overriding arguments\n                inputs = [i for i in inputs]\n\n            for i, item in enumerate(inputs):\n                if isinstance(item, dict):\n                    if any(k not in item for k in [\"question\", \"context\"]):\n                        raise KeyError(\"You need to provide a dictionary with keys {question:..., context:...}\")\n\n                    inputs[i] = QuestionAnsweringPipeline.create_sample(**item)\n\n                elif not isinstance(item, SquadExample):\n                    raise ValueError(\n                        \"{} argument needs to be of type (list[SquadExample | dict], SquadExample, dict)\".format(\n                            \"X\" if \"X\" in kwargs else \"data\"\n                        )\n                    )\n\n            # Tabular input\n        elif \"question\" in kwargs and \"context\" in kwargs:\n            if isinstance(kwargs[\"question\"], str):\n                kwargs[\"question\"] = [kwargs[\"question\"]]\n\n            if isinstance(kwargs[\"context\"], str):\n                kwargs[\"context\"] = [kwargs[\"context\"]]\n\n            inputs = [\n                QuestionAnsweringPipeline.create_sample(q, c) for q, c in zip(kwargs[\"question\"], kwargs[\"context\"])\n            ]\n        else:\n            raise ValueError(\"Unknown arguments {}\".format(kwargs))\n\n        if not isinstance(inputs, list):\n            inputs = [inputs]\n\n        return inputs\n\n\nclass QuestionAnsweringPipeline(Pipeline):\n    \"\"\"\n    Question Answering pipeline using ModelForQuestionAnswering head. See the\n    `question answering usage <../usage.html#question-answering>`__ examples for more information.\n\n    This question answering can currently be loaded from the :func:`~transformers1.pipeline` method using\n    the following task identifier(s):\n\n    - \"question-answering\", for answering questions given a context.\n\n    The models that this pipeline can use are models that have been fine-tuned on a question answering task.\n    See the up-to-date list of available models on\n    `huggingface.co/models <https://huggingface.co/models?filter=question-answering>`__.\n\n    Arguments:\n        model (:obj:`~transformers1.PreTrainedModel` or :obj:`~transformers1.TFPreTrainedModel`):\n            The model that will be used by the pipeline to make predictions. This needs to be a model inheriting from\n            :class:`~transformers1.PreTrainedModel` for PyTorch and :class:`~transformers1.TFPreTrainedModel` for\n            TensorFlow.\n        tokenizer (:obj:`~transformers1.PreTrainedTokenizer`):\n            The tokenizer that will be used by the pipeline to encode data for the model. This object inherits from\n            :class:`~transformers1.PreTrainedTokenizer`.\n        modelcard (:obj:`str` or :class:`~transformers1.ModelCard`, `optional`, defaults to :obj:`None`):\n            Model card attributed to the model for this pipeline.\n        framework (:obj:`str`, `optional`, defaults to :obj:`None`):\n            The framework to use, either \"pt\" for PyTorch or \"tf\" for TensorFlow. The specified framework must be\n            installed.\n\n            If no framework is specified, will default to the one currently installed. If no framework is specified\n            and both frameworks are installed, will default to PyTorch.\n        args_parser (:class:`~transformers1.pipelines.ArgumentHandler`, `optional`, defaults to :obj:`None`):\n            Reference to the object in charge of parsing supplied pipeline parameters.\n        device (:obj:`int`, `optional`, defaults to :obj:`-1`):\n            Device ordinal for CPU/GPU supports. Setting this to -1 will leverage CPU, >=0 will run the model\n            on the associated CUDA device id.\n    \"\"\"\n\n    default_input_names = \"question,context\"\n\n    def __init__(\n        self,\n        model: Union[\"PreTrainedModel\", \"TFPreTrainedModel\"],\n        tokenizer: PreTrainedTokenizer,\n        modelcard: Optional[ModelCard] = None,\n        framework: Optional[str] = None,\n        device: int = -1,\n        task: str = \"\",\n        **kwargs\n    ):\n        super().__init__(\n            model=model,\n            tokenizer=tokenizer,\n            modelcard=modelcard,\n            framework=framework,\n            args_parser=QuestionAnsweringArgumentHandler(),\n            device=device,\n            task=task,\n            **kwargs,\n        )\n\n    @staticmethod\n    def create_sample(\n        question: Union[str, List[str]], context: Union[str, List[str]]\n    ) -> Union[SquadExample, List[SquadExample]]:\n        \"\"\"\n        QuestionAnsweringPipeline leverages the SquadExample/SquadFeatures internally.\n        This helper method encapsulate all the logic for converting question(s) and context(s) to SquadExample(s).\n        We currently support extractive question answering.\n        Arguments:\n             question: (str, List[str]) The question to be ask for the associated context\n             context: (str, List[str]) The context in which we will look for the answer.\n\n        Returns:\n            SquadExample initialized with the corresponding question and context.\n        \"\"\"\n        if isinstance(question, list):\n            return [SquadExample(None, q, c, None, None, None) for q, c in zip(question, context)]\n        else:\n            return SquadExample(None, question, context, None, None, None)\n\n    def __call__(self, *args, **kwargs):\n        \"\"\"\n        Args:\n            We support multiple use-cases, the following are exclusive:\n            X: sequence of SquadExample\n            data: sequence of SquadExample\n            question: (str, List[str]), batch of question(s) to map along with context\n            context: (str, List[str]), batch of context(s) associated with the provided question keyword argument\n        Returns:\n            dict: {'answer': str, 'score\": float, 'start\": int, \"end\": int}\n            answer: the textual answer in the intial context\n            score: the score the current answer scored for the model\n            start: the character index in the original string corresponding to the beginning of the answer' span\n            end: the character index in the original string corresponding to the ending of the answer' span\n        \"\"\"\n        # Set defaults values\n        kwargs.setdefault(\"topk\", 1)\n        kwargs.setdefault(\"doc_stride\", 128)\n        kwargs.setdefault(\"max_answer_len\", 15)\n        kwargs.setdefault(\"max_seq_len\", 384)\n        kwargs.setdefault(\"max_question_len\", 64)\n        kwargs.setdefault(\"handle_impossible_answer\", False)\n\n        if kwargs[\"topk\"] < 1:\n            raise ValueError(\"topk parameter should be >= 1 (got {})\".format(kwargs[\"topk\"]))\n\n        if kwargs[\"max_answer_len\"] < 1:\n            raise ValueError(\"max_answer_len parameter should be >= 1 (got {})\".format(kwargs[\"max_answer_len\"]))\n\n        # Convert inputs to features\n        examples = self._args_parser(*args, **kwargs)\n        features_list = [\n            squad_convert_examples_to_features(\n                [example],\n                self.tokenizer,\n                kwargs[\"max_seq_len\"],\n                kwargs[\"doc_stride\"],\n                kwargs[\"max_question_len\"],\n                False,\n                tqdm_enabled=False,\n            )\n            for example in examples\n        ]\n        all_answers = []\n        for features, example in zip(features_list, examples):\n            model_input_names = self.tokenizer.model_input_names + [\"input_ids\"]\n            fw_args = {k: [feature.__dict__[k] for feature in features] for k in model_input_names}\n\n            # Manage tensor allocation on correct device\n            with self.device_placement():\n                if self.framework == \"tf\":\n                    fw_args = {k: tf.constant(v) for (k, v) in fw_args.items()}\n                    start, end = self.model(fw_args)\n                    start, end = start.numpy(), end.numpy()\n                else:\n                    with torch.no_grad():\n                        # Retrieve the score for the context tokens only (removing question tokens)\n                        fw_args = {k: torch.tensor(v, device=self.device) for (k, v) in fw_args.items()}\n                        start, end = self.model(**fw_args)\n                        start, end = start.cpu().numpy(), end.cpu().numpy()\n\n            min_null_score = 1000000  # large and positive\n            answers = []\n            for (feature, start_, end_) in zip(features, start, end):\n                # Normalize logits and spans to retrieve the answer\n                start_ = np.exp(start_) / np.sum(np.exp(start_))\n                end_ = np.exp(end_) / np.sum(np.exp(end_))\n\n                # Mask padding and question\n                start_, end_ = (\n                    start_ * np.abs(np.array(feature.p_mask) - 1),\n                    end_ * np.abs(np.array(feature.p_mask) - 1),\n                )\n\n                if kwargs[\"handle_impossible_answer\"]:\n                    min_null_score = min(min_null_score, (start_[0] * end_[0]).item())\n\n                start_[0] = end_[0] = 0\n\n                starts, ends, scores = self.decode(start_, end_, kwargs[\"topk\"], kwargs[\"max_answer_len\"])\n                char_to_word = np.array(example.char_to_word_offset)\n\n                # Convert the answer (tokens) back to the original text\n                answers += [\n                    {\n                        \"score\": score.item(),\n                        \"start\": np.where(char_to_word == feature.token_to_orig_map[s])[0][0].item(),\n                        \"end\": np.where(char_to_word == feature.token_to_orig_map[e])[0][-1].item(),\n                        \"answer\": \" \".join(\n                            example.doc_tokens[feature.token_to_orig_map[s] : feature.token_to_orig_map[e] + 1]\n                        ),\n                    }\n                    for s, e, score in zip(starts, ends, scores)\n                ]\n\n            if kwargs[\"handle_impossible_answer\"]:\n                answers.append({\"score\": min_null_score, \"start\": 0, \"end\": 0, \"answer\": \"\"})\n\n            answers = sorted(answers, key=lambda x: x[\"score\"], reverse=True)[: kwargs[\"topk\"]]\n            all_answers += answers\n\n        if len(all_answers) == 1:\n            return all_answers[0]\n        return all_answers\n\n    def decode(self, start: np.ndarray, end: np.ndarray, topk: int, max_answer_len: int) -> Tuple:\n        \"\"\"\n        Take the output of any QuestionAnswering head and will generate probalities for each span to be\n        the actual answer.\n        In addition, it filters out some unwanted/impossible cases like answer len being greater than\n        max_answer_len or answer end position being before the starting position.\n        The method supports output the k-best answer through the topk argument.\n\n        Args:\n            start: numpy array, holding individual start probabilities for each token\n            end: numpy array, holding individual end probabilities for each token\n            topk: int, indicates how many possible answer span(s) to extract from the model's output\n            max_answer_len: int, maximum size of the answer to extract from the model's output\n        \"\"\"\n        # Ensure we have batch axis\n        if start.ndim == 1:\n            start = start[None]\n\n        if end.ndim == 1:\n            end = end[None]\n\n        # Compute the score of each tuple(start, end) to be the real answer\n        outer = np.matmul(np.expand_dims(start, -1), np.expand_dims(end, 1))\n\n        # Remove candidate with end < start and end - start > max_answer_len\n        candidates = np.tril(np.triu(outer), max_answer_len - 1)\n\n        #  Inspired by Chen & al. (https://github.com/facebookresearch/DrQA)\n        scores_flat = candidates.flatten()\n        if topk == 1:\n            idx_sort = [np.argmax(scores_flat)]\n        elif len(scores_flat) < topk:\n            idx_sort = np.argsort(-scores_flat)\n        else:\n            idx = np.argpartition(-scores_flat, topk)[0:topk]\n            idx_sort = idx[np.argsort(-scores_flat[idx])]\n\n        start, end = np.unravel_index(idx_sort, candidates.shape)[1:]\n        return start, end, candidates[0, start, end]\n\n    def span_to_answer(self, text: str, start: int, end: int):\n        \"\"\"\n        When decoding from token probalities, this method maps token indexes to actual word in\n        the initial context.\n\n        Args:\n            text: str, the actual context to extract the answer from\n            start: int, starting answer token index\n            end: int, ending answer token index\n\n        Returns:\n            dict: {'answer': str, 'start': int, 'end': int}\n        \"\"\"\n        words = []\n        token_idx = char_start_idx = char_end_idx = chars_idx = 0\n\n        for i, word in enumerate(text.split(\" \")):\n            token = self.tokenizer.tokenize(word)\n\n            # Append words if they are in the span\n            if start <= token_idx <= end:\n                if token_idx == start:\n                    char_start_idx = chars_idx\n\n                if token_idx == end:\n                    char_end_idx = chars_idx + len(word)\n\n                words += [word]\n\n            # Stop if we went over the end of the answer\n            if token_idx > end:\n                break\n\n            # Append the subtokenization length to the running index\n            token_idx += len(token)\n            chars_idx += len(word) + 1\n\n        # Join text with spaces\n        return {\n            \"answer\": \" \".join(words),\n            \"start\": max(0, char_start_idx),\n            \"end\": min(len(text), char_end_idx),\n        }\n\n\nclass SummarizationPipeline(Pipeline):\n    \"\"\"\n    Summarize news articles and other documents\n\n    Usage::\n\n        # use bart in pytorch\n        summarizer = pipeline(\"summarization\")\n        summarizer(\"Sam Shleifer writes the best docstring examples in the whole world.\", min_length=5, max_length=20)\n\n        # use t5 in tf\n        summarizer = pipeline(\"summarization\", model=\"t5-base\", tokenizer=\"t5-base\", framework=\"tf\")\n        summarizer(\"Sam Shleifer writes the best docstring examples in the whole world.\", min_length=5, max_length=20)\n\n    The models that this pipeline can use are models that have been fine-tuned on a summarization task,\n    which is currently, '`bart-large-cnn`', '`t5-small`', '`t5-base`', '`t5-large`', '`t5-3b`', '`t5-11b`'.\n    See the up-to-date list of available models on\n    `huggingface.co/models <https://huggingface.co/models?filter=summarization>`__.\n\n    Arguments:\n        model (:obj:`str` or :obj:`~transformers1.PreTrainedModel` or :obj:`~transformers1.TFPreTrainedModel`, `optional`, defaults to :obj:`None`):\n            The model that will be used by the pipeline to make predictions. This can be :obj:`None`, a string\n            checkpoint identifier or an actual pre-trained model inheriting from\n            :class:`~transformers1.PreTrainedModel` for PyTorch and :class:`~transformers1.TFPreTrainedModel` for\n            TensorFlow.\n\n            If :obj:`None`, the default of the pipeline will be loaded.\n        tokenizer (:obj:`str` or :obj:`~transformers1.PreTrainedTokenizer`, `optional`, defaults to :obj:`None`):\n            The tokenizer that will be used by the pipeline to encode data for the model. This can be :obj:`None`,\n            a string checkpoint identifier or an actual pre-trained tokenizer inheriting from\n            :class:`~transformers1.PreTrainedTokenizer`.\n\n            If :obj:`None`, the default of the pipeline will be loaded.\n        modelcard (:obj:`str` or :class:`~transformers1.ModelCard`, `optional`, defaults to :obj:`None`):\n            Model card attributed to the model for this pipeline.\n        framework (:obj:`str`, `optional`, defaults to :obj:`None`):\n            The framework to use, either \"pt\" for PyTorch or \"tf\" for TensorFlow. The specified framework must be\n            installed.\n\n            If no framework is specified, will default to the one currently installed. If no framework is specified\n            and both frameworks are installed, will default to PyTorch.\n        args_parser (:class:`~transformers1.pipelines.ArgumentHandler`, `optional`, defaults to :obj:`None`):\n            Reference to the object in charge of parsing supplied pipeline parameters.\n        device (:obj:`int`, `optional`, defaults to :obj:`-1`):\n            Device ordinal for CPU/GPU supports. Setting this to -1 will leverage CPU, >=0 will run the model\n            on the associated CUDA device id.\n    \"\"\"\n\n    def __call__(\n        self, *documents, return_tensors=False, return_text=True, clean_up_tokenization_spaces=False, **generate_kwargs\n    ):\n        r\"\"\"\n        Args:\n            *documents: (list of strings) articles to be summarized\n            return_text: (bool, default=True) whether to add a decoded \"summary_text\" to each result\n            return_tensors: (bool, default=False) whether to return the raw \"summary_token_ids\" to each result\n\n            clean_up_tokenization_spaces: (`optional`) bool whether to include extra spaces in the output\n            **generate_kwargs: extra kwargs passed to `self.model.generate`_\n\n        Returns:\n            list of dicts with 'summary_text' and/or 'summary_token_ids' for each document_to_summarize\n\n        .. _`self.model.generate`:\n            https://huggingface.co/transformers/model_doc/bart.html#transformers.BartForConditionalGeneration.generate\n\n        \"\"\"\n        assert return_tensors or return_text, \"You must specify return_tensors=True or return_text=True\"\n        assert len(documents) > 0, \"Please provide a document to summarize\"\n\n        if self.framework == \"tf\" and \"BartForConditionalGeneration\" in self.model.__class__.__name__:\n            raise NotImplementedError(\n                \"Tensorflow is not yet supported for Bart. Please consider using T5, e.g. `t5-base`\"\n            )\n\n        prefix = self.model.config.prefix if self.model.config.prefix is not None else \"\"\n\n        if isinstance(documents[0], list):\n            assert (\n                self.tokenizer.pad_token_id is not None\n            ), \"Please make sure that the tokenizer has a pad_token_id when using a batch input\"\n\n            documents = ([prefix + document for document in documents[0]],)\n            pad_to_max_length = True\n\n        elif isinstance(documents[0], str):\n            documents = (prefix + documents[0],)\n            pad_to_max_length = False\n        else:\n            raise ValueError(\n                \" `documents[0]`: {} have the wrong format. The should be either of type `str` or type `list`\".format(\n                    documents[0]\n                )\n            )\n\n        with self.device_placement():\n            inputs = self._parse_and_tokenize(*documents, pad_to_max_length=pad_to_max_length)\n\n            if self.framework == \"pt\":\n                inputs = self.ensure_tensor_on_device(**inputs)\n                input_length = inputs[\"input_ids\"].shape[-1]\n            elif self.framework == \"tf\":\n                input_length = tf.shape(inputs[\"input_ids\"])[-1].numpy()\n\n            min_length = generate_kwargs.get(\"min_length\", self.model.config.min_length)\n            if input_length < min_length // 2:\n                logger.warning(\n                    \"Your min_length is set to {}, but you input_length is only {}. You might consider decreasing min_length manually, e.g. summarizer('...', min_length=10)\".format(\n                        min_length, input_length\n                    )\n                )\n\n            max_length = generate_kwargs.get(\"max_length\", self.model.config.max_length)\n            if input_length < max_length:\n                logger.warning(\n                    \"Your max_length is set to {}, but you input_length is only {}. You might consider decreasing max_length manually, e.g. summarizer('...', max_length=50)\".format(\n                        max_length, input_length\n                    )\n                )\n\n            summaries = self.model.generate(\n                inputs[\"input_ids\"], attention_mask=inputs[\"attention_mask\"], **generate_kwargs,\n            )\n\n            results = []\n            for summary in summaries:\n                record = {}\n                if return_tensors:\n                    record[\"summary_token_ids\"] = summary\n                if return_text:\n                    record[\"summary_text\"] = self.tokenizer.decode(\n                        summary, skip_special_tokens=True, clean_up_tokenization_spaces=clean_up_tokenization_spaces,\n                    )\n                results.append(record)\n            return results\n\n\nclass TranslationPipeline(Pipeline):\n    \"\"\"\n    Translates from one language to another.\n\n    Usage::\n        en_fr_translator = pipeline(\"translation_en_to_fr\")\n        en_fr_translator(\"How old are you?\")\n\n    The models that this pipeline can use are models that have been fine-tuned on a translation task,\n    currently: \"t5-small\", \"t5-base\", \"t5-large\", \"t5-3b\", \"t5-11b\"\n    See the up-to-date list of available models on\n    `huggingface.co/models <https://huggingface.co/models?filter=translation>`__.\n\n    Arguments:\n        model (:obj:`str` or :obj:`~transformers1.PreTrainedModel` or :obj:`~transformers1.TFPreTrainedModel`, `optional`, defaults to :obj:`None`):\n            The model that will be used by the pipeline to make predictions. This can be :obj:`None`, a string\n            checkpoint identifier or an actual pre-trained model inheriting from\n            :class:`~transformers1.PreTrainedModel` for PyTorch and :class:`~transformers1.TFPreTrainedModel` for\n            TensorFlow.\n            If :obj:`None`, the default of the pipeline will be loaded.\n        tokenizer (:obj:`str` or :obj:`~transformers1.PreTrainedTokenizer`, `optional`, defaults to :obj:`None`):\n            The tokenizer that will be used by the pipeline to encode data for the model. This can be :obj:`None`,\n            a string checkpoint identifier or an actual pre-trained tokenizer inheriting from\n            :class:`~transformers1.PreTrainedTokenizer`.\n            If :obj:`None`, the default of the pipeline will be loaded.\n        modelcard (:obj:`str` or :class:`~transformers1.ModelCard`, `optional`, defaults to :obj:`None`):\n            Model card attributed to the model for this pipeline.\n        framework (:obj:`str`, `optional`, defaults to :obj:`None`):\n            The framework to use, either \"pt\" for PyTorch or \"tf\" for TensorFlow. The specified framework must be\n            installed.\n            If no framework is specified, will default to the one currently installed. If no framework is specified\n            and both frameworks are installed, will default to PyTorch.\n        args_parser (:class:`~transformers1.pipelines.ArgumentHandler`, `optional`, defaults to :obj:`None`):\n            Reference to the object in charge of parsing supplied pipeline parameters.\n        device (:obj:`int`, `optional`, defaults to :obj:`-1`):\n            Device ordinal for CPU/GPU supports. Setting this to -1 will leverage CPU, >=0 will run the model\n            on the associated CUDA device id.\n    \"\"\"\n\n    def __call__(\n        self, *args, return_tensors=False, return_text=True, clean_up_tokenization_spaces=False, **generate_kwargs\n    ):\n        r\"\"\"\n        Args:\n            *args: (list of strings) texts to be translated\n            return_text: (bool, default=True) whether to add a decoded \"translation_text\" to each result\n            return_tensors: (bool, default=False) whether to return the raw \"translation_token_ids\" to each result\n\n            **generate_kwargs: extra kwargs passed to `self.model.generate`_\n\n        Returns:\n            list of dicts with 'translation_text' and/or 'translation_token_ids' for each text_to_translate\n        .. _`self.model.generate`:\n            https://huggingface.co/transformers/model_doc/bart.html#transformers.BartForConditionalGeneration.generate\n        \"\"\"\n        assert return_tensors or return_text, \"You must specify return_tensors=True or return_text=True\"\n\n        prefix = self.model.config.prefix if self.model.config.prefix is not None else \"\"\n\n        if isinstance(args[0], list):\n            assert (\n                self.tokenizer.pad_token_id is not None\n            ), \"Please make sure that the tokenizer has a pad_token_id when using a batch input\"\n            args = ([prefix + text for text in args[0]],)\n            pad_to_max_length = True\n\n        elif isinstance(args[0], str):\n            args = (prefix + args[0],)\n            pad_to_max_length = False\n        else:\n            raise ValueError(\n                \" `documents[0]`: {} have the wrong format. The should be either of type `str` or type `list`\".format(\n                    args[0]\n                )\n            )\n\n        with self.device_placement():\n            inputs = self._parse_and_tokenize(*args, pad_to_max_length=pad_to_max_length)\n\n            if self.framework == \"pt\":\n                inputs = self.ensure_tensor_on_device(**inputs)\n                input_length = inputs[\"input_ids\"].shape[-1]\n\n            elif self.framework == \"tf\":\n                input_length = tf.shape(inputs[\"input_ids\"])[-1].numpy()\n\n            max_length = generate_kwargs.get(\"max_length\", self.model.config.max_length)\n            if input_length > 0.9 * max_length:\n                logger.warning(\n                    \"Your input_length: {} is bigger than 0.9 * max_length: {}. You might consider increasing your max_length manually, e.g. translator('...', max_length=400)\".format(\n                        input_length, max_length\n                    )\n                )\n\n            translations = self.model.generate(\n                inputs[\"input_ids\"], attention_mask=inputs[\"attention_mask\"], **generate_kwargs,\n            )\n            results = []\n            for translation in translations:\n                record = {}\n                if return_tensors:\n                    record[\"translation_token_ids\"] = translation\n                if return_text:\n                    record[\"translation_text\"] = self.tokenizer.decode(\n                        translation,\n                        skip_special_tokens=True,\n                        clean_up_tokenization_spaces=clean_up_tokenization_spaces,\n                    )\n                results.append(record)\n            return results\n\n\n# Register all the supported tasks here\nSUPPORTED_TASKS = {\n    \"feature-extraction\": {\n        \"impl\": FeatureExtractionPipeline,\n        \"tf\": TFAutoModel if is_tf_available() else None,\n        \"pt\": AutoModel if is_torch_available() else None,\n        \"default\": {\n            \"model\": {\"pt\": \"distilbert-base-cased\", \"tf\": \"distilbert-base-cased\"},\n            \"config\": None,\n            \"tokenizer\": \"distilbert-base-cased\",\n        },\n    },\n    \"sentiment-analysis\": {\n        \"impl\": TextClassificationPipeline,\n        \"tf\": TFAutoModelForSequenceClassification if is_tf_available() else None,\n        \"pt\": AutoModelForSequenceClassification if is_torch_available() else None,\n        \"default\": {\n            \"model\": {\n                \"pt\": \"distilbert-base-uncased-finetuned-sst-2-english\",\n                \"tf\": \"distilbert-base-uncased-finetuned-sst-2-english\",\n            },\n            \"config\": \"distilbert-base-uncased-finetuned-sst-2-english\",\n            \"tokenizer\": \"distilbert-base-uncased\",\n        },\n    },\n    \"ner\": {\n        \"impl\": NerPipeline,\n        \"tf\": TFAutoModelForTokenClassification if is_tf_available() else None,\n        \"pt\": AutoModelForTokenClassification if is_torch_available() else None,\n        \"default\": {\n            \"model\": {\n                \"pt\": \"dbmdz/bert-large-cased-finetuned-conll03-english\",\n                \"tf\": \"dbmdz/bert-large-cased-finetuned-conll03-english\",\n            },\n            \"config\": \"dbmdz/bert-large-cased-finetuned-conll03-english\",\n            \"tokenizer\": \"bert-large-cased\",\n        },\n    },\n    \"question-answering\": {\n        \"impl\": QuestionAnsweringPipeline,\n        \"tf\": TFAutoModelForQuestionAnswering if is_tf_available() else None,\n        \"pt\": AutoModelForQuestionAnswering if is_torch_available() else None,\n        \"default\": {\n            \"model\": {\"pt\": \"distilbert-base-cased-distilled-squad\", \"tf\": \"distilbert-base-cased-distilled-squad\"},\n            \"config\": None,\n            \"tokenizer\": (\"distilbert-base-cased\", {\"use_fast\": False}),\n        },\n    },\n    \"fill-mask\": {\n        \"impl\": FillMaskPipeline,\n        \"tf\": TFAutoModelWithLMHead if is_tf_available() else None,\n        \"pt\": AutoModelWithLMHead if is_torch_available() else None,\n        \"default\": {\n            \"model\": {\"pt\": \"distilroberta-base\", \"tf\": \"distilroberta-base\"},\n            \"config\": None,\n            \"tokenizer\": (\"distilroberta-base\", {\"use_fast\": False}),\n        },\n    },\n    \"summarization\": {\n        \"impl\": SummarizationPipeline,\n        \"tf\": TFAutoModelWithLMHead if is_tf_available() else None,\n        \"pt\": AutoModelWithLMHead if is_torch_available() else None,\n        \"default\": {\"model\": {\"pt\": \"facebook/bart-large-cnn\", \"tf\": \"t5-small\"}, \"config\": None, \"tokenizer\": None},\n    },\n    \"translation_en_to_fr\": {\n        \"impl\": TranslationPipeline,\n        \"tf\": TFAutoModelWithLMHead if is_tf_available() else None,\n        \"pt\": AutoModelWithLMHead if is_torch_available() else None,\n        \"default\": {\n            \"model\": {\"pt\": \"t5-base\", \"tf\": \"t5-base\"},\n            \"config\": None,\n            \"tokenizer\": (\"t5-base\", {\"use_fast\": False}),\n        },\n    },\n    \"translation_en_to_de\": {\n        \"impl\": TranslationPipeline,\n        \"tf\": TFAutoModelWithLMHead if is_tf_available() else None,\n        \"pt\": AutoModelWithLMHead if is_torch_available() else None,\n        \"default\": {\n            \"model\": {\"pt\": \"t5-base\", \"tf\": \"t5-base\"},\n            \"config\": None,\n            \"tokenizer\": (\"t5-base\", {\"use_fast\": False}),\n        },\n    },\n    \"translation_en_to_ro\": {\n        \"impl\": TranslationPipeline,\n        \"tf\": TFAutoModelWithLMHead if is_tf_available() else None,\n        \"pt\": AutoModelWithLMHead if is_torch_available() else None,\n        \"default\": {\n            \"model\": {\"pt\": \"t5-base\", \"tf\": \"t5-base\"},\n            \"config\": None,\n            \"tokenizer\": (\"t5-base\", {\"use_fast\": False}),\n        },\n    },\n    \"text-generation\": {\n        \"impl\": TextGenerationPipeline,\n        \"tf\": TFAutoModelWithLMHead if is_tf_available() else None,\n        \"pt\": AutoModelWithLMHead if is_torch_available() else None,\n        \"default\": {\"model\": {\"pt\": \"gpt2\", \"tf\": \"gpt2\"}, \"config\": None, \"tokenizer\": \"gpt2\"},\n    },\n}\n\n\ndef pipeline(\n    task: str,\n    model: Optional = None,\n    config: Optional[Union[str, PretrainedConfig]] = None,\n    tokenizer: Optional[Union[str, PreTrainedTokenizer]] = None,\n    framework: Optional[str] = None,\n    **kwargs\n) -> Pipeline:\n    \"\"\"\n    Utility factory method to build a pipeline.\n\n    Pipeline are made of:\n\n        - A Tokenizer instance in charge of mapping raw textual input to token\n        - A Model instance\n        - Some (optional) post processing for enhancing model's output\n\n\n    Args:\n        task (:obj:`str`):\n            The task defining which pipeline will be returned. Currently accepted tasks are:\n\n            - \"feature-extraction\": will return a :class:`~transformers1.FeatureExtractionPipeline`\n            - \"sentiment-analysis\": will return a :class:`~transformers1.TextClassificationPipeline`\n            - \"ner\": will return a :class:`~transformers1.NerPipeline`\n            - \"question-answering\": will return a :class:`~transformers1.QuestionAnsweringPipeline`\n            - \"fill-mask\": will return a :class:`~transformers1.FillMaskPipeline`\n            - \"summarization\": will return a :class:`~transformers1.SummarizationPipeline`\n            - \"translation_xx_to_yy\": will return a :class:`~transformers1.TranslationPipeline`\n        model (:obj:`str` or :obj:`~transformers1.PreTrainedModel` or :obj:`~transformers1.TFPreTrainedModel`, `optional`, defaults to :obj:`None`):\n            The model that will be used by the pipeline to make predictions. This can be :obj:`None`,\n            a model identifier or an actual pre-trained model inheriting from\n            :class:`~transformers1.PreTrainedModel` for PyTorch and :class:`~transformers1.TFPreTrainedModel` for\n            TensorFlow.\n\n            If :obj:`None`, the default for this pipeline will be loaded.\n        config (:obj:`str` or :obj:`~transformers1.PretrainedConfig`, `optional`, defaults to :obj:`None`):\n            The configuration that will be used by the pipeline to instantiate the model. This can be :obj:`None`,\n            a model identifier or an actual pre-trained model configuration inheriting from\n            :class:`~transformers1.PretrainedConfig`.\n\n            If :obj:`None`, the default for this pipeline will be loaded.\n        tokenizer (:obj:`str` or :obj:`~transformers1.PreTrainedTokenizer`, `optional`, defaults to :obj:`None`):\n            The tokenizer that will be used by the pipeline to encode data for the model. This can be :obj:`None`,\n            a model identifier or an actual pre-trained tokenizer inheriting from\n            :class:`~transformers1.PreTrainedTokenizer`.\n\n            If :obj:`None`, the default for this pipeline will be loaded.\n        framework (:obj:`str`, `optional`, defaults to :obj:`None`):\n            The framework to use, either \"pt\" for PyTorch or \"tf\" for TensorFlow. The specified framework must be\n            installed.\n\n            If no framework is specified, will default to the one currently installed. If no framework is specified\n            and both frameworks are installed, will default to PyTorch.\n\n    Returns:\n        :class:`~transformers.Pipeline`: Class inheriting from :class:`~transformers1.Pipeline`, according to\n        the task.\n\n    Examples::\n\n        from transformers1 import pipeline, AutoModelForTokenClassification, AutoTokenizer\n\n        # Sentiment analysis pipeline\n        pipeline('sentiment-analysis')\n\n        # Question answering pipeline, specifying the checkpoint identifier\n        pipeline('question-answering', model='distilbert-base-cased-distilled-squad', tokenizer='bert-base-cased')\n\n        # Named entity recognition pipeline, passing in a specific model and tokenizer\n        model = AutoModelForTokenClassification.from_pretrained(\"dbmdz/bert-large-cased-finetuned-conll03-english\")\n        tokenizer = AutoTokenizer.from_pretrained(\"bert-base-cased\")\n        pipeline('ner', model=model, tokenizer=tokenizer)\n    \"\"\"\n    # Retrieve the task\n    if task not in SUPPORTED_TASKS:\n        raise KeyError(\"Unknown task {}, available tasks are {}\".format(task, list(SUPPORTED_TASKS.keys())))\n\n    framework = framework or get_framework(model)\n\n    targeted_task = SUPPORTED_TASKS[task]\n    task_class, model_class = targeted_task[\"impl\"], targeted_task[framework]\n\n    # Use default model/config/tokenizer for the task if no model is provided\n    if model is None:\n        models, config, tokenizer = [targeted_task[\"default\"][k] for k in [\"model\", \"config\", \"tokenizer\"]]\n        model = models[framework]\n\n    # Try to infer tokenizer from model or config name (if provided as str)\n    if tokenizer is None:\n        if isinstance(model, str) and model in ALL_PRETRAINED_CONFIG_ARCHIVE_MAP:\n            tokenizer = model\n        elif isinstance(config, str) and config in ALL_PRETRAINED_CONFIG_ARCHIVE_MAP:\n            tokenizer = config\n        else:\n            # Impossible to guest what is the right tokenizer here\n            raise Exception(\n                \"Impossible to guess which tokenizer to use. \"\n                \"Please provided a PretrainedTokenizer class or a path/identifier to a pretrained tokenizer.\"\n            )\n\n    modelcard = None\n    # Try to infer modelcard from model or config name (if provided as str)\n    if isinstance(model, str):\n        modelcard = model\n    elif isinstance(config, str):\n        modelcard = config\n\n    # Instantiate tokenizer if needed\n    if isinstance(tokenizer, (str, tuple)):\n        if isinstance(tokenizer, tuple):\n            # For tuple we have (tokenizer name, {kwargs})\n            tokenizer = AutoTokenizer.from_pretrained(tokenizer[0], **tokenizer[1])\n        else:\n            tokenizer = AutoTokenizer.from_pretrained(tokenizer)\n\n    # Instantiate config if needed\n    if isinstance(config, str):\n        config = AutoConfig.from_pretrained(config)\n\n    # Instantiate modelcard if needed\n    if isinstance(modelcard, str):\n        modelcard = ModelCard.from_pretrained(modelcard)\n\n    # Instantiate model if needed\n    if isinstance(model, str):\n        # Handle transparent TF/PT model conversion\n        model_kwargs = {}\n        if framework == \"pt\" and model.endswith(\".h5\"):\n            model_kwargs[\"from_tf\"] = True\n            logger.warning(\n                \"Model might be a TensorFlow model (ending with `.h5`) but TensorFlow is not available. \"\n                \"Trying to load the model with PyTorch.\"\n            )\n        elif framework == \"tf\" and model.endswith(\".bin\"):\n            model_kwargs[\"from_pt\"] = True\n            logger.warning(\n                \"Model might be a PyTorch model (ending with `.bin`) but PyTorch is not available. \"\n                \"Trying to load the model with Tensorflow.\"\n            )\n        model = model_class.from_pretrained(model, config=config, **model_kwargs)\n\n    return task_class(model=model, tokenizer=tokenizer, modelcard=modelcard, framework=framework, task=task, **kwargs)\n"
  },
  {
    "path": "code/bert-base-count3/pretrain/transformers1/tokenization_albert.py",
    "content": "# coding=utf-8\n# Copyright 2018 Google AI, Google Brain and the HuggingFace Inc. team.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n\"\"\" Tokenization classes for ALBERT model.\"\"\"\n\n\nimport logging\nimport os\nimport unicodedata\nfrom shutil import copyfile\nfrom typing import List, Optional\n\nfrom .tokenization_utils import PreTrainedTokenizer\n\n\nlogger = logging.getLogger(__name__)\nVOCAB_FILES_NAMES = {\"vocab_file\": \"spiece.model\"}\n\nPRETRAINED_VOCAB_FILES_MAP = {\n    \"vocab_file\": {\n        \"albert-base-v1\": \"https://s3.amazonaws.com/models.huggingface.co/bert/albert-base-v1-spiece.model\",\n        \"albert-large-v1\": \"https://s3.amazonaws.com/models.huggingface.co/bert/albert-large-v1-spiece.model\",\n        \"albert-xlarge-v1\": \"https://s3.amazonaws.com/models.huggingface.co/bert/albert-xlarge-v1-spiece.model\",\n        \"albert-xxlarge-v1\": \"https://s3.amazonaws.com/models.huggingface.co/bert/albert-xxlarge-v1-spiece.model\",\n        \"albert-base-v2\": \"https://s3.amazonaws.com/models.huggingface.co/bert/albert-base-v2-spiece.model\",\n        \"albert-large-v2\": \"https://s3.amazonaws.com/models.huggingface.co/bert/albert-large-v2-spiece.model\",\n        \"albert-xlarge-v2\": \"https://s3.amazonaws.com/models.huggingface.co/bert/albert-xlarge-v2-spiece.model\",\n        \"albert-xxlarge-v2\": \"https://s3.amazonaws.com/models.huggingface.co/bert/albert-xxlarge-v2-spiece.model\",\n    }\n}\n\nPRETRAINED_POSITIONAL_EMBEDDINGS_SIZES = {\n    \"albert-base-v1\": 512,\n    \"albert-large-v1\": 512,\n    \"albert-xlarge-v1\": 512,\n    \"albert-xxlarge-v1\": 512,\n    \"albert-base-v2\": 512,\n    \"albert-large-v2\": 512,\n    \"albert-xlarge-v2\": 512,\n    \"albert-xxlarge-v2\": 512,\n}\n\nSPIECE_UNDERLINE = \"▁\"\n\n\nclass AlbertTokenizer(PreTrainedTokenizer):\n    \"\"\"\n    Constructs an ALBERT tokenizer. Based on `SentencePiece <https://github.com/google/sentencepiece>`__\n\n    This tokenizer inherits from :class:`~transformers1.PreTrainedTokenizer` which contains most of the methods. Users\n    should refer to the superclass for more information regarding methods.\n\n    Args:\n        vocab_file (:obj:`string`):\n            `SentencePiece <https://github.com/google/sentencepiece>`__ file (generally has a .spm extension) that\n            contains the vocabulary necessary to instantiate a tokenizer.\n        do_lower_case (:obj:`bool`, `optional`, defaults to :obj:`True`):\n            Whether to lowercase the input when tokenizing.\n        remove_space (:obj:`bool`, `optional`, defaults to :obj:`True`):\n            Whether to strip the text when tokenizing (removing excess spaces before and after the string).\n        keep_accents (:obj:`bool`, `optional`, defaults to :obj:`False`):\n            Whether to keep accents when tokenizing.\n        bos_token (:obj:`string`, `optional`, defaults to \"[CLS]\"):\n            The beginning of sequence token that was used during pre-training. Can be used a sequence classifier token.\n\n            .. note::\n\n                When building a sequence using special tokens, this is not the token that is used for the beginning\n                of sequence. The token used is the :obj:`cls_token`.\n        eos_token (:obj:`string`, `optional`, defaults to \"[SEP]\"):\n            The end of sequence token.\n\n            .. note::\n\n                When building a sequence using special tokens, this is not the token that is used for the end\n                of sequence. The token used is the :obj:`sep_token`.\n        unk_token (:obj:`string`, `optional`, defaults to \"<unk>\"):\n            The unknown token. A token that is not in the vocabulary cannot be converted to an ID and is set to be this\n            token instead.\n        sep_token (:obj:`string`, `optional`, defaults to \"[SEP]\"):\n            The separator token, which is used when building a sequence from multiple sequences, e.g. two sequences\n            for sequence classification or for a text and a question for question answering.\n            It is also used as the last token of a sequence built with special tokens.\n        pad_token (:obj:`string`, `optional`, defaults to \"<pad>\"):\n            The token used for padding, for example when batching sequences of different lengths.\n        cls_token (:obj:`string`, `optional`, defaults to \"[CLS]\"):\n            The classifier token which is used when doing sequence classification (classification of the whole\n            sequence instead of per-token classification). It is the first token of the sequence when built with\n            special tokens.\n        mask_token (:obj:`string`, `optional`, defaults to \"[MASK]\"):\n            The token used for masking values. This is the token used when training this model with masked language\n            modeling. This is the token which the model will try to predict.\n\n    Attributes:\n        sp_model (:obj:`SentencePieceProcessor`):\n            The `SentencePiece` processor that is used for every conversion (string, tokens and IDs).\n    \"\"\"\n\n    vocab_files_names = VOCAB_FILES_NAMES\n    pretrained_vocab_files_map = PRETRAINED_VOCAB_FILES_MAP\n    max_model_input_sizes = PRETRAINED_POSITIONAL_EMBEDDINGS_SIZES\n\n    def __init__(\n        self,\n        vocab_file,\n        do_lower_case=True,\n        remove_space=True,\n        keep_accents=False,\n        bos_token=\"[CLS]\",\n        eos_token=\"[SEP]\",\n        unk_token=\"<unk>\",\n        sep_token=\"[SEP]\",\n        pad_token=\"<pad>\",\n        cls_token=\"[CLS]\",\n        mask_token=\"[MASK]\",\n        **kwargs\n    ):\n        super().__init__(\n            bos_token=bos_token,\n            eos_token=eos_token,\n            unk_token=unk_token,\n            sep_token=sep_token,\n            pad_token=pad_token,\n            cls_token=cls_token,\n            mask_token=mask_token,\n            **kwargs,\n        )\n\n        try:\n            import sentencepiece as spm\n        except ImportError:\n            logger.warning(\n                \"You need to install SentencePiece to use AlbertTokenizer: https://github.com/google/sentencepiece\"\n                \"pip install sentencepiece\"\n            )\n            raise\n\n        self.do_lower_case = do_lower_case\n        self.remove_space = remove_space\n        self.keep_accents = keep_accents\n        self.vocab_file = vocab_file\n\n        self.sp_model = spm.SentencePieceProcessor()\n        self.sp_model.Load(vocab_file)\n\n    @property\n    def vocab_size(self):\n        return len(self.sp_model)\n\n    def get_vocab(self):\n        vocab = {self.convert_ids_to_tokens(i): i for i in range(self.vocab_size)}\n        vocab.update(self.added_tokens_encoder)\n        return vocab\n\n    def __getstate__(self):\n        state = self.__dict__.copy()\n        state[\"sp_model\"] = None\n        return state\n\n    def __setstate__(self, d):\n        self.__dict__ = d\n        try:\n            import sentencepiece as spm\n        except ImportError:\n            logger.warning(\n                \"You need to install SentencePiece to use AlbertTokenizer: https://github.com/google/sentencepiece\"\n                \"pip install sentencepiece\"\n            )\n            raise\n        self.sp_model = spm.SentencePieceProcessor()\n        self.sp_model.Load(self.vocab_file)\n\n    def preprocess_text(self, inputs):\n        if self.remove_space:\n            outputs = \" \".join(inputs.strip().split())\n        else:\n            outputs = inputs\n        outputs = outputs.replace(\"``\", '\"').replace(\"''\", '\"')\n\n        if not self.keep_accents:\n            outputs = unicodedata.normalize(\"NFKD\", outputs)\n            outputs = \"\".join([c for c in outputs if not unicodedata.combining(c)])\n        if self.do_lower_case:\n            outputs = outputs.lower()\n\n        return outputs\n\n    def _tokenize(self, text, sample=False):\n        \"\"\" Tokenize a string. \"\"\"\n        text = self.preprocess_text(text)\n\n        if not sample:\n            pieces = self.sp_model.EncodeAsPieces(text)\n        else:\n            pieces = self.sp_model.SampleEncodeAsPieces(text, 64, 0.1)\n        new_pieces = []\n        for piece in pieces:\n            if len(piece) > 1 and piece[-1] == str(\",\") and piece[-2].isdigit():\n                cur_pieces = self.sp_model.EncodeAsPieces(piece[:-1].replace(SPIECE_UNDERLINE, \"\"))\n                if piece[0] != SPIECE_UNDERLINE and cur_pieces[0][0] == SPIECE_UNDERLINE:\n                    if len(cur_pieces[0]) == 1:\n                        cur_pieces = cur_pieces[1:]\n                    else:\n                        cur_pieces[0] = cur_pieces[0][1:]\n                cur_pieces.append(piece[-1])\n                new_pieces.extend(cur_pieces)\n            else:\n                new_pieces.append(piece)\n\n        return new_pieces\n\n    def _convert_token_to_id(self, token):\n        \"\"\" Converts a token (str) in an id using the vocab. \"\"\"\n        return self.sp_model.PieceToId(token)\n\n    def _convert_id_to_token(self, index):\n        \"\"\"Converts an index (integer) in a token (str) using the vocab.\"\"\"\n        return self.sp_model.IdToPiece(index)\n\n    def convert_tokens_to_string(self, tokens):\n        out_string = \"\".join(tokens).replace(SPIECE_UNDERLINE, \" \").strip()\n        return out_string\n\n    def build_inputs_with_special_tokens(\n        self, token_ids_0: List[int], token_ids_1: Optional[List[int]] = None\n    ) -> List[int]:\n        \"\"\"\n        Build model inputs from a sequence or a pair of sequence for sequence classification tasks\n        by concatenating and adding special tokens.\n        An ALBERT sequence has the following format:\n\n        - single sequence: ``[CLS] X [SEP]``\n        - pair of sequences: ``[CLS] A [SEP] B [SEP]``\n\n        Args:\n            token_ids_0 (:obj:`List[int]`):\n                List of IDs to which the special tokens will be added\n            token_ids_1 (:obj:`List[int]`, `optional`, defaults to :obj:`None`):\n                Optional second list of IDs for sequence pairs.\n\n        Returns:\n            :obj:`List[int]`: list of `input IDs <../glossary.html#input-ids>`__ with the appropriate special tokens.\n        \"\"\"\n        sep = [self.sep_token_id]\n        cls = [self.cls_token_id]\n        if token_ids_1 is None:\n            return cls + token_ids_0 + sep\n        return cls + token_ids_0 + sep + token_ids_1 + sep\n\n    def get_special_tokens_mask(\n        self, token_ids_0: List[int], token_ids_1: Optional[List[int]] = None, already_has_special_tokens: bool = False\n    ) -> List[int]:\n        \"\"\"\n        Retrieves sequence ids from a token list that has no special tokens added. This method is called when adding\n        special tokens using the tokenizer ``prepare_for_model`` or ``encode_plus`` methods.\n\n        Args:\n            token_ids_0 (:obj:`List[int]`):\n                List of ids.\n            token_ids_1 (:obj:`List[int]`, `optional`, defaults to :obj:`None`):\n                Optional second list of IDs for sequence pairs.\n            already_has_special_tokens (:obj:`bool`, `optional`, defaults to :obj:`False`):\n                Set to True if the token list is already formatted with special tokens for the model\n\n        Returns:\n            :obj:`List[int]`: A list of integers in the range [0, 1]: 1 for a special token, 0 for a sequence token.\n        \"\"\"\n\n        if already_has_special_tokens:\n            if token_ids_1 is not None:\n                raise ValueError(\n                    \"You should not supply a second sequence if the provided sequence of \"\n                    \"ids is already formatted with special tokens for the model.\"\n                )\n            return list(map(lambda x: 1 if x in [self.sep_token_id, self.cls_token_id] else 0, token_ids_0))\n\n        if token_ids_1 is not None:\n            return [1] + ([0] * len(token_ids_0)) + [1] + ([0] * len(token_ids_1)) + [1]\n        return [1] + ([0] * len(token_ids_0)) + [1]\n\n    def create_token_type_ids_from_sequences(\n        self, token_ids_0: List[int], token_ids_1: Optional[List[int]] = None\n    ) -> List[int]:\n        \"\"\"\n        Creates a mask from the two sequences passed to be used in a sequence-pair classification task.\n        An ALBERT sequence pair mask has the following format:\n\n        ::\n\n            0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1\n            | first sequence    | second sequence |\n\n        if token_ids_1 is None, only returns the first portion of the mask (0s).\n\n        Args:\n            token_ids_0 (:obj:`List[int]`):\n                List of ids.\n            token_ids_1 (:obj:`List[int]`, `optional`, defaults to :obj:`None`):\n                Optional second list of IDs for sequence pairs.\n\n        Returns:\n            :obj:`List[int]`: List of `token type IDs <../glossary.html#token-type-ids>`_ according to the given\n            sequence(s).\n        \"\"\"\n        sep = [self.sep_token_id]\n        cls = [self.cls_token_id]\n\n        if token_ids_1 is None:\n            return len(cls + token_ids_0 + sep) * [0]\n        return len(cls + token_ids_0 + sep) * [0] + len(token_ids_1 + sep) * [1]\n\n    def save_vocabulary(self, save_directory):\n        \"\"\"\n        Save the sentencepiece vocabulary (copy original file) and special tokens file to a directory.\n\n        Args:\n            save_directory (:obj:`str`):\n                The directory in which to save the vocabulary.\n\n        Returns:\n            :obj:`Tuple(str)`: Paths to the files saved.\n        \"\"\"\n        if not os.path.isdir(save_directory):\n            logger.error(\"Vocabulary path ({}) should be a directory\".format(save_directory))\n            return\n        out_vocab_file = os.path.join(save_directory, VOCAB_FILES_NAMES[\"vocab_file\"])\n\n        if os.path.abspath(self.vocab_file) != os.path.abspath(out_vocab_file):\n            copyfile(self.vocab_file, out_vocab_file)\n\n        return (out_vocab_file,)\n"
  },
  {
    "path": "code/bert-base-count3/pretrain/transformers1/tokenization_auto.py",
    "content": "# coding=utf-8\n# Copyright 2018 The HuggingFace Inc. team.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n\"\"\" Auto Tokenizer class. \"\"\"\n\n\nimport logging\nfrom collections import OrderedDict\n\nfrom .configuration_auto import (\n    AlbertConfig,\n    AutoConfig,\n    BartConfig,\n    BertConfig,\n    CamembertConfig,\n    CTRLConfig,\n    DistilBertConfig,\n    ElectraConfig,\n    FlaubertConfig,\n    GPT2Config,\n    LongformerConfig,\n    OpenAIGPTConfig,\n    ReformerConfig,\n    RobertaConfig,\n    T5Config,\n    TransfoXLConfig,\n    XLMConfig,\n    XLMRobertaConfig,\n    XLNetConfig,\n)\nfrom .configuration_marian import MarianConfig\nfrom .configuration_utils import PretrainedConfig\nfrom .tokenization_albert import AlbertTokenizer\nfrom .tokenization_bart import BartTokenizer\nfrom .tokenization_bert import BertTokenizer, BertTokenizerFast\nfrom .tokenization_bert_japanese import BertJapaneseTokenizer\nfrom .tokenization_camembert import CamembertTokenizer\nfrom .tokenization_ctrl import CTRLTokenizer\nfrom .tokenization_distilbert import DistilBertTokenizer, DistilBertTokenizerFast\nfrom .tokenization_electra import ElectraTokenizer, ElectraTokenizerFast\nfrom .tokenization_flaubert import FlaubertTokenizer\nfrom .tokenization_gpt2 import GPT2Tokenizer, GPT2TokenizerFast\nfrom .tokenization_longformer import LongformerTokenizer\nfrom .tokenization_marian import MarianTokenizer\nfrom .tokenization_openai import OpenAIGPTTokenizer, OpenAIGPTTokenizerFast\nfrom .tokenization_reformer import ReformerTokenizer\nfrom .tokenization_roberta import RobertaTokenizer, RobertaTokenizerFast\nfrom .tokenization_t5 import T5Tokenizer\nfrom .tokenization_transfo_xl import TransfoXLTokenizer, TransfoXLTokenizerFast\nfrom .tokenization_xlm import XLMTokenizer\nfrom .tokenization_xlm_roberta import XLMRobertaTokenizer\nfrom .tokenization_xlnet import XLNetTokenizer\n\n\nlogger = logging.getLogger(__name__)\n\n\nTOKENIZER_MAPPING = OrderedDict(\n    [\n        (T5Config, (T5Tokenizer, None)),\n        (DistilBertConfig, (DistilBertTokenizer, DistilBertTokenizerFast)),\n        (AlbertConfig, (AlbertTokenizer, None)),\n        (CamembertConfig, (CamembertTokenizer, None)),\n        (XLMRobertaConfig, (XLMRobertaTokenizer, None)),\n        (MarianConfig, (MarianTokenizer, None)),\n        (BartConfig, (BartTokenizer, None)),\n        (LongformerConfig, (LongformerTokenizer, None)),\n        (RobertaConfig, (RobertaTokenizer, RobertaTokenizerFast)),\n        (ReformerConfig, (ReformerTokenizer, None)),\n        (ElectraConfig, (ElectraTokenizer, ElectraTokenizerFast)),\n        (BertConfig, (BertTokenizer, BertTokenizerFast)),\n        (OpenAIGPTConfig, (OpenAIGPTTokenizer, OpenAIGPTTokenizerFast)),\n        (GPT2Config, (GPT2Tokenizer, GPT2TokenizerFast)),\n        (TransfoXLConfig, (TransfoXLTokenizer, TransfoXLTokenizerFast)),\n        (XLNetConfig, (XLNetTokenizer, None)),\n        (FlaubertConfig, (FlaubertTokenizer, None)),\n        (XLMConfig, (XLMTokenizer, None)),\n        (CTRLConfig, (CTRLTokenizer, None)),\n    ]\n)\n\n\nclass AutoTokenizer:\n    r\"\"\":class:`~transformers1.AutoTokenizer` is a generic tokenizer class\n        that will be instantiated as one of the tokenizer classes of the library\n        when created with the `AutoTokenizer.from_pretrained(pretrained_model_name_or_path)`\n        class method.\n\n        The `from_pretrained()` method takes care of returning the correct tokenizer class instance\n        based on the `model_type` property of the config object, or when it's missing,\n        falling back to using pattern matching on the `pretrained_model_name_or_path` string:\n            - `t5`: T5Tokenizer (T5 model)\n            - `distilbert`: DistilBertTokenizer (DistilBert model)\n            - `albert`: AlbertTokenizer (ALBERT model)\n            - `camembert`: CamembertTokenizer (CamemBERT model)\n            - `xlm-roberta`: XLMRobertaTokenizer (XLM-RoBERTa model)\n            - `longformer`: LongformerTokenizer (AllenAI Longformer model)\n            - `roberta`: RobertaTokenizer (RoBERTa model)\n            - `bert`: BertTokenizer (Bert model)\n            - `openai-gpt`: OpenAIGPTTokenizer (OpenAI GPT model)\n            - `gpt2`: GPT2Tokenizer (OpenAI GPT-2 model)\n            - `transfo-xl`: TransfoXLTokenizer (Transformer-XL model)\n            - `xlnet`: XLNetTokenizer (XLNet model)\n            - `xlm`: XLMTokenizer (XLM model)\n            - `ctrl`: CTRLTokenizer (Salesforce CTRL model)\n            - `electra`: ElectraTokenizer (Google ELECTRA model)\n\n        This class cannot be instantiated using `__init__()` (throw an error).\n    \"\"\"\n\n    def __init__(self):\n        raise EnvironmentError(\n            \"AutoTokenizer is designed to be instantiated \"\n            \"using the `AutoTokenizer.from_pretrained(pretrained_model_name_or_path)` method.\"\n        )\n\n    @classmethod\n    def from_pretrained(cls, pretrained_model_name_or_path, *inputs, **kwargs):\n        r\"\"\" Instantiate one of the tokenizer classes of the library\n        from a pre-trained model vocabulary.\n\n        The tokenizer class to instantiate is selected\n        based on the `model_type` property of the config object, or when it's missing,\n        falling back to using pattern matching on the `pretrained_model_name_or_path` string:\n            - `t5`: T5Tokenizer (T5 model)\n            - `distilbert`: DistilBertTokenizer (DistilBert model)\n            - `albert`: AlbertTokenizer (ALBERT model)\n            - `camembert`: CamembertTokenizer (CamemBERT model)\n            - `xlm-roberta`: XLMRobertaTokenizer (XLM-RoBERTa model)\n            - `longformer`: LongformerTokenizer (AllenAI Longformer model)\n            - `roberta`: RobertaTokenizer (RoBERTa model)\n            - `bert-base-japanese`: BertJapaneseTokenizer (Bert model)\n            - `bert`: BertTokenizer (Bert model)\n            - `openai-gpt`: OpenAIGPTTokenizer (OpenAI GPT model)\n            - `gpt2`: GPT2Tokenizer (OpenAI GPT-2 model)\n            - `transfo-xl`: TransfoXLTokenizer (Transformer-XL model)\n            - `xlnet`: XLNetTokenizer (XLNet model)\n            - `xlm`: XLMTokenizer (XLM model)\n            - `ctrl`: CTRLTokenizer (Salesforce CTRL model)\n            - `electra`: ElectraTokenizer (Google ELECTRA model)\n\n        Params:\n            pretrained_model_name_or_path: either:\n\n                - a string with the `shortcut name` of a predefined tokenizer to load from cache or download, e.g.: ``bert-base-uncased``.\n                - a string with the `identifier name` of a predefined tokenizer that was user-uploaded to our S3, e.g.: ``dbmdz/bert-base-german-cased``.\n                - a path to a `directory` containing vocabulary files required by the tokenizer, for instance saved using the :func:`~transformers1.PreTrainedTokenizer.save_pretrained` method, e.g.: ``./my_model_directory/``.\n                - (not applicable to all derived classes) a path or url to a single saved vocabulary file if and only if the tokenizer only requires a single vocabulary file (e.g. Bert, XLNet), e.g.: ``./my_model_directory/vocab.txt``.\n\n            cache_dir: (`optional`) string:\n                Path to a directory in which a downloaded predefined tokenizer vocabulary files should be cached if the standard cache should not be used.\n\n            force_download: (`optional`) boolean, default False:\n                Force to (re-)download the vocabulary files and override the cached versions if they exists.\n\n            resume_download: (`optional`) boolean, default False:\n                Do not delete incompletely recieved file. Attempt to resume the download if such a file exists.\n\n            proxies: (`optional`) dict, default None:\n                A dictionary of proxy servers to use by protocol or endpoint, e.g.: {'http': 'foo.bar:3128', 'http://hostname': 'foo.bar:4012'}.\n                The proxies are used on each request.\n\n            use_fast: (`optional`) boolean, default False:\n                Indicate if transformers1 should try to load the fast version of the tokenizer (True) or use the Python one (False).\n\n            inputs: (`optional`) positional arguments: will be passed to the Tokenizer ``__init__`` method.\n\n            kwargs: (`optional`) keyword arguments: will be passed to the Tokenizer ``__init__`` method. Can be used to set special tokens like ``bos_token``, ``eos_token``, ``unk_token``, ``sep_token``, ``pad_token``, ``cls_token``, ``mask_token``, ``additional_special_tokens``. See parameters in the doc string of :class:`~transformers1.PreTrainedTokenizer` for details.\n\n        Examples::\n\n            # Download vocabulary from S3 and cache.\n            tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')\n\n            # Download vocabulary from S3 (user-uploaded) and cache.\n            tokenizer = AutoTokenizer.from_pretrained('dbmdz/bert-base-german-cased')\n\n            # If vocabulary files are in a directory (e.g. tokenizer was saved using `save_pretrained('./test/saved_model/')`)\n            tokenizer = AutoTokenizer.from_pretrained('./test/bert_saved_model/')\n\n        \"\"\"\n        config = kwargs.pop(\"config\", None)\n        if not isinstance(config, PretrainedConfig):\n            config = AutoConfig.from_pretrained(pretrained_model_name_or_path, **kwargs)\n\n        if \"bert-base-japanese\" in pretrained_model_name_or_path:\n            return BertJapaneseTokenizer.from_pretrained(pretrained_model_name_or_path, *inputs, **kwargs)\n\n        use_fast = kwargs.pop(\"use_fast\", False)\n        for config_class, (tokenizer_class_py, tokenizer_class_fast) in TOKENIZER_MAPPING.items():\n            if isinstance(config, config_class):\n                if tokenizer_class_fast and use_fast:\n                    return tokenizer_class_fast.from_pretrained(pretrained_model_name_or_path, *inputs, **kwargs)\n                else:\n                    return tokenizer_class_py.from_pretrained(pretrained_model_name_or_path, *inputs, **kwargs)\n\n        raise ValueError(\n            \"Unrecognized configuration class {} to build an AutoTokenizer.\\n\"\n            \"Model type should be one of {}.\".format(\n                config.__class__, \", \".join(c.__name__ for c in TOKENIZER_MAPPING.keys())\n            )\n        )\n"
  },
  {
    "path": "code/bert-base-count3/pretrain/transformers1/tokenization_bart.py",
    "content": "# coding=utf-8\n# Copyright 2020 The Facebook AI Research Team Authors and The HuggingFace Inc. team.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n\nimport logging\n\nfrom .tokenization_roberta import RobertaTokenizer\nfrom .tokenization_xlm_roberta import XLMRobertaTokenizer\n\n\nlogger = logging.getLogger(__name__)\n\n\n# vocab and merges same as roberta\nvocab_url = \"https://s3.amazonaws.com/models.huggingface.co/bert/roberta-large-vocab.json\"\nmerges_url = \"https://s3.amazonaws.com/models.huggingface.co/bert/roberta-large-merges.txt\"\n_all_bart_models = [\n    \"facebook/bart-large\",\n    \"facebook/bart-large-mnli\",\n    \"facebook/bart-large-cnn\",\n    \"facebook/bart-large-xsum\",\n]\n\n\nclass BartTokenizer(RobertaTokenizer):\n    # merges and vocab same as Roberta\n    max_model_input_sizes = {m: 1024 for m in _all_bart_models}\n    pretrained_vocab_files_map = {\n        \"vocab_file\": {m: vocab_url for m in _all_bart_models},\n        \"merges_file\": {m: merges_url for m in _all_bart_models},\n    }\n\n\n_all_mbart_models = [\"facebook/mbart-large-en-ro\"]\nSPM_URL = \"https://s3.amazonaws.com/models.huggingface.co/bert/facebook/mbart-large-en-ro/sentence.bpe.model\"\n\n\nclass MBartTokenizer(XLMRobertaTokenizer):\n    vocab_files_names = {\"vocab_file\": \"sentencepiece.bpe.model\"}\n    max_model_input_sizes = {m: 1024 for m in _all_mbart_models}\n    pretrained_vocab_files_map = {\"vocab_file\": {m: SPM_URL for m in _all_mbart_models}}\n"
  },
  {
    "path": "code/bert-base-count3/pretrain/transformers1/tokenization_bert.py",
    "content": "# coding=utf-8\n# Copyright 2018 The Google AI Language Team Authors and The HuggingFace Inc. team.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n\"\"\"Tokenization classes.\"\"\"\n\n\nimport collections\nimport logging\nimport os\nimport unicodedata\nfrom typing import List, Optional\n\nfrom tokenizers import BertWordPieceTokenizer\n\nfrom .tokenization_utils import PreTrainedTokenizer, PreTrainedTokenizerFast\n\n\nlogger = logging.getLogger(__name__)\n\nVOCAB_FILES_NAMES = {\"vocab_file\": \"vocab.txt\"}\n\nPRETRAINED_VOCAB_FILES_MAP = {\n    \"vocab_file\": {\n        \"bert-base-uncased\": \"https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-uncased-vocab.txt\",\n        \"bert-large-uncased\": \"https://s3.amazonaws.com/models.huggingface.co/bert/bert-large-uncased-vocab.txt\",\n        \"bert-base-cased\": \"https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-cased-vocab.txt\",\n        \"bert-large-cased\": \"https://s3.amazonaws.com/models.huggingface.co/bert/bert-large-cased-vocab.txt\",\n        \"bert-base-multilingual-uncased\": \"https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-multilingual-uncased-vocab.txt\",\n        \"bert-base-multilingual-cased\": \"https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-multilingual-cased-vocab.txt\",\n        \"bert-base-chinese\": \"https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-chinese-vocab.txt\",\n        \"bert-base-german-cased\": \"https://int-deepset-models-bert.s3.eu-central-1.amazonaws.com/pytorch/bert-base-german-cased-vocab.txt\",\n        \"bert-large-uncased-whole-word-masking\": \"https://s3.amazonaws.com/models.huggingface.co/bert/bert-large-uncased-whole-word-masking-vocab.txt\",\n        \"bert-large-cased-whole-word-masking\": \"https://s3.amazonaws.com/models.huggingface.co/bert/bert-large-cased-whole-word-masking-vocab.txt\",\n        \"bert-large-uncased-whole-word-masking-finetuned-squad\": \"https://s3.amazonaws.com/models.huggingface.co/bert/bert-large-uncased-whole-word-masking-finetuned-squad-vocab.txt\",\n        \"bert-large-cased-whole-word-masking-finetuned-squad\": \"https://s3.amazonaws.com/models.huggingface.co/bert/bert-large-cased-whole-word-masking-finetuned-squad-vocab.txt\",\n        \"bert-base-cased-finetuned-mrpc\": \"https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-cased-finetuned-mrpc-vocab.txt\",\n        \"bert-base-german-dbmdz-cased\": \"https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-german-dbmdz-cased-vocab.txt\",\n        \"bert-base-german-dbmdz-uncased\": \"https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-german-dbmdz-uncased-vocab.txt\",\n        \"TurkuNLP/bert-base-finnish-cased-v1\": \"https://s3.amazonaws.com/models.huggingface.co/bert/TurkuNLP/bert-base-finnish-cased-v1/vocab.txt\",\n        \"TurkuNLP/bert-base-finnish-uncased-v1\": \"https://s3.amazonaws.com/models.huggingface.co/bert/TurkuNLP/bert-base-finnish-uncased-v1/vocab.txt\",\n        \"wietsedv/bert-base-dutch-cased\": \"https://s3.amazonaws.com/models.huggingface.co/bert/wietsedv/bert-base-dutch-cased/vocab.txt\",\n    }\n}\n\nPRETRAINED_POSITIONAL_EMBEDDINGS_SIZES = {\n    \"bert-base-uncased\": 512,\n    \"bert-large-uncased\": 512,\n    \"bert-base-cased\": 512,\n    \"bert-large-cased\": 512,\n    \"bert-base-multilingual-uncased\": 512,\n    \"bert-base-multilingual-cased\": 512,\n    \"bert-base-chinese\": 512,\n    \"bert-base-german-cased\": 512,\n    \"bert-large-uncased-whole-word-masking\": 512,\n    \"bert-large-cased-whole-word-masking\": 512,\n    \"bert-large-uncased-whole-word-masking-finetuned-squad\": 512,\n    \"bert-large-cased-whole-word-masking-finetuned-squad\": 512,\n    \"bert-base-cased-finetuned-mrpc\": 512,\n    \"bert-base-german-dbmdz-cased\": 512,\n    \"bert-base-german-dbmdz-uncased\": 512,\n    \"TurkuNLP/bert-base-finnish-cased-v1\": 512,\n    \"TurkuNLP/bert-base-finnish-uncased-v1\": 512,\n    \"wietsedv/bert-base-dutch-cased\": 512,\n}\n\nPRETRAINED_INIT_CONFIGURATION = {\n    \"bert-base-uncased\": {\"do_lower_case\": True},\n    \"bert-large-uncased\": {\"do_lower_case\": True},\n    \"bert-base-cased\": {\"do_lower_case\": False},\n    \"bert-large-cased\": {\"do_lower_case\": False},\n    \"bert-base-multilingual-uncased\": {\"do_lower_case\": True},\n    \"bert-base-multilingual-cased\": {\"do_lower_case\": False},\n    \"bert-base-chinese\": {\"do_lower_case\": False},\n    \"bert-base-german-cased\": {\"do_lower_case\": False},\n    \"bert-large-uncased-whole-word-masking\": {\"do_lower_case\": True},\n    \"bert-large-cased-whole-word-masking\": {\"do_lower_case\": False},\n    \"bert-large-uncased-whole-word-masking-finetuned-squad\": {\"do_lower_case\": True},\n    \"bert-large-cased-whole-word-masking-finetuned-squad\": {\"do_lower_case\": False},\n    \"bert-base-cased-finetuned-mrpc\": {\"do_lower_case\": False},\n    \"bert-base-german-dbmdz-cased\": {\"do_lower_case\": False},\n    \"bert-base-german-dbmdz-uncased\": {\"do_lower_case\": True},\n    \"TurkuNLP/bert-base-finnish-cased-v1\": {\"do_lower_case\": False},\n    \"TurkuNLP/bert-base-finnish-uncased-v1\": {\"do_lower_case\": True},\n    \"wietsedv/bert-base-dutch-cased\": {\"do_lower_case\": False},\n}\n\n\ndef load_vocab(vocab_file):\n    \"\"\"Loads a vocabulary file into a dictionary.\"\"\"\n    vocab = collections.OrderedDict()\n    with open(vocab_file, \"r\", encoding=\"utf-8\") as reader:\n        tokens = reader.readlines()\n    for index, token in enumerate(tokens):\n        token = token.rstrip(\"\\n\")\n        vocab[token] = index\n    return vocab\n\n\ndef whitespace_tokenize(text):\n    \"\"\"Runs basic whitespace cleaning and splitting on a piece of text.\"\"\"\n    text = text.strip()\n    if not text:\n        return []\n    tokens = text.split()\n    return tokens\n\n\nclass BertTokenizer(PreTrainedTokenizer):\n    r\"\"\"\n    Constructs a BERT tokenizer. Based on WordPiece.\n\n    This tokenizer inherits from :class:`~transformers1.PreTrainedTokenizer` which contains most of the methods. Users\n    should refer to the superclass for more information regarding methods.\n\n    Args:\n        vocab_file (:obj:`string`):\n            File containing the vocabulary.\n        do_lower_case (:obj:`bool`, `optional`, defaults to :obj:`True`):\n            Whether to lowercase the input when tokenizing.\n        do_basic_tokenize (:obj:`bool`, `optional`, defaults to :obj:`True`):\n            Whether to do basic tokenization before WordPiece.\n        never_split (:obj:`bool`, `optional`, defaults to :obj:`True`):\n            List of tokens which will never be split during tokenization. Only has an effect when\n            :obj:`do_basic_tokenize=True`\n        unk_token (:obj:`string`, `optional`, defaults to \"[UNK]\"):\n            The unknown token. A token that is not in the vocabulary cannot be converted to an ID and is set to be this\n            token instead.\n        sep_token (:obj:`string`, `optional`, defaults to \"[SEP]\"):\n            The separator token, which is used when building a sequence from multiple sequences, e.g. two sequences\n            for sequence classification or for a text and a question for question answering.\n            It is also used as the last token of a sequence built with special tokens.\n        pad_token (:obj:`string`, `optional`, defaults to \"[PAD]\"):\n            The token used for padding, for example when batching sequences of different lengths.\n        cls_token (:obj:`string`, `optional`, defaults to \"[CLS]\"):\n            The classifier token which is used when doing sequence classification (classification of the whole\n            sequence instead of per-token classification). It is the first token of the sequence when built with\n            special tokens.\n        mask_token (:obj:`string`, `optional`, defaults to \"[MASK]\"):\n            The token used for masking values. This is the token used when training this model with masked language\n            modeling. This is the token which the model will try to predict.\n        tokenize_chinese_chars (:obj:`bool`, `optional`, defaults to :obj:`True`):\n            Whether to tokenize Chinese characters.\n            This should likely be deactivated for Japanese:\n            see: https://github.com/huggingface/transformers/issues/328\n    \"\"\"\n\n    vocab_files_names = VOCAB_FILES_NAMES\n    pretrained_vocab_files_map = PRETRAINED_VOCAB_FILES_MAP\n    pretrained_init_configuration = PRETRAINED_INIT_CONFIGURATION\n    max_model_input_sizes = PRETRAINED_POSITIONAL_EMBEDDINGS_SIZES\n\n    def __init__(\n        self,\n        vocab_file,\n        do_lower_case=True,\n        do_basic_tokenize=True,\n        never_split=None,\n        unk_token=\"[UNK]\",\n        sep_token=\"[SEP]\",\n        pad_token=\"[PAD]\",\n        cls_token=\"[CLS]\",\n        mask_token=\"[MASK]\",\n        tokenize_chinese_chars=True,\n        **kwargs\n    ):\n        super().__init__(\n            unk_token=unk_token,\n            sep_token=sep_token,\n            pad_token=pad_token,\n            cls_token=cls_token,\n            mask_token=mask_token,\n            **kwargs,\n        )\n\n        if not os.path.isfile(vocab_file):\n            raise ValueError(\n                \"Can't find a vocabulary file at path '{}'. To load the vocabulary from a Google pretrained \"\n                \"model use `tokenizer = BertTokenizer.from_pretrained(PRETRAINED_MODEL_NAME)`\".format(vocab_file)\n            )\n        self.vocab = load_vocab(vocab_file)\n        self.ids_to_tokens = collections.OrderedDict([(ids, tok) for tok, ids in self.vocab.items()])\n        self.do_basic_tokenize = do_basic_tokenize\n        if do_basic_tokenize:\n            self.basic_tokenizer = BasicTokenizer(\n                do_lower_case=do_lower_case, never_split=never_split, tokenize_chinese_chars=tokenize_chinese_chars\n            )\n        self.wordpiece_tokenizer = WordpieceTokenizer(vocab=self.vocab, unk_token=self.unk_token)\n\n    @property\n    def vocab_size(self):\n        return len(self.vocab)\n\n    def get_vocab(self):\n        return dict(self.vocab, **self.added_tokens_encoder)\n\n    def _tokenize(self, text):\n        split_tokens = []\n        if self.do_basic_tokenize:\n            for token in self.basic_tokenizer.tokenize(text, never_split=self.all_special_tokens):\n                for sub_token in self.wordpiece_tokenizer.tokenize(token):\n                    split_tokens.append(sub_token)\n        else:\n            split_tokens = self.wordpiece_tokenizer.tokenize(text)\n        return split_tokens\n\n    def _convert_token_to_id(self, token):\n        \"\"\" Converts a token (str) in an id using the vocab. \"\"\"\n        return self.vocab.get(token, self.vocab.get(self.unk_token))\n\n    def _convert_id_to_token(self, index):\n        \"\"\"Converts an index (integer) in a token (str) using the vocab.\"\"\"\n        return self.ids_to_tokens.get(index, self.unk_token)\n\n    def convert_tokens_to_string(self, tokens):\n        \"\"\" Converts a sequence of tokens (string) in a single string. \"\"\"\n        out_string = \" \".join(tokens).replace(\" ##\", \"\").strip()\n        return out_string\n\n    def build_inputs_with_special_tokens(\n        self, token_ids_0: List[int], token_ids_1: Optional[List[int]] = None\n    ) -> List[int]:\n        \"\"\"\n        Build model inputs from a sequence or a pair of sequence for sequence classification tasks\n        by concatenating and adding special tokens.\n        A BERT sequence has the following format:\n\n        - single sequence: ``[CLS] X [SEP]``\n        - pair of sequences: ``[CLS] A [SEP] B [SEP]``\n\n        Args:\n            token_ids_0 (:obj:`List[int]`):\n                List of IDs to which the special tokens will be added\n            token_ids_1 (:obj:`List[int]`, `optional`, defaults to :obj:`None`):\n                Optional second list of IDs for sequence pairs.\n\n        Returns:\n            :obj:`List[int]`: list of `input IDs <../glossary.html#input-ids>`__ with the appropriate special tokens.\n        \"\"\"\n        if token_ids_1 is None:\n            return [self.cls_token_id] + token_ids_0 + [self.sep_token_id]\n        cls = [self.cls_token_id]\n        sep = [self.sep_token_id]\n        return cls + token_ids_0 + sep + token_ids_1 + sep\n\n    def get_special_tokens_mask(\n        self, token_ids_0: List[int], token_ids_1: Optional[List[int]] = None, already_has_special_tokens: bool = False\n    ) -> List[int]:\n        \"\"\"\n        Retrieves sequence ids from a token list that has no special tokens added. This method is called when adding\n        special tokens using the tokenizer ``prepare_for_model`` or ``encode_plus`` methods.\n\n        Args:\n            token_ids_0 (:obj:`List[int]`):\n                List of ids.\n            token_ids_1 (:obj:`List[int]`, `optional`, defaults to :obj:`None`):\n                Optional second list of IDs for sequence pairs.\n            already_has_special_tokens (:obj:`bool`, `optional`, defaults to :obj:`False`):\n                Set to True if the token list is already formatted with special tokens for the model\n\n        Returns:\n            :obj:`List[int]`: A list of integers in the range [0, 1]: 1 for a special token, 0 for a sequence token.\n        \"\"\"\n\n        if already_has_special_tokens:\n            if token_ids_1 is not None:\n                raise ValueError(\n                    \"You should not supply a second sequence if the provided sequence of \"\n                    \"ids is already formated with special tokens for the model.\"\n                )\n            return list(map(lambda x: 1 if x in [self.sep_token_id, self.cls_token_id] else 0, token_ids_0))\n\n        if token_ids_1 is not None:\n            return [1] + ([0] * len(token_ids_0)) + [1] + ([0] * len(token_ids_1)) + [1]\n        return [1] + ([0] * len(token_ids_0)) + [1]\n\n    def create_token_type_ids_from_sequences(\n        self, token_ids_0: List[int], token_ids_1: Optional[List[int]] = None\n    ) -> List[int]:\n        \"\"\"\n        Creates a mask from the two sequences passed to be used in a sequence-pair classification task.\n        A BERT sequence pair mask has the following format:\n\n        ::\n\n            0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1\n            | first sequence    | second sequence |\n\n        if token_ids_1 is None, only returns the first portion of the mask (0's).\n\n        Args:\n            token_ids_0 (:obj:`List[int]`):\n                List of ids.\n            token_ids_1 (:obj:`List[int]`, `optional`, defaults to :obj:`None`):\n                Optional second list of IDs for sequence pairs.\n\n        Returns:\n            :obj:`List[int]`: List of `token type IDs <../glossary.html#token-type-ids>`_ according to the given\n            sequence(s).\n        \"\"\"\n        sep = [self.sep_token_id]\n        cls = [self.cls_token_id]\n        if token_ids_1 is None:\n            return len(cls + token_ids_0 + sep) * [0]\n        return len(cls + token_ids_0 + sep) * [0] + len(token_ids_1 + sep) * [1]\n\n    def save_vocabulary(self, vocab_path):\n        \"\"\"\n        Save the sentencepiece vocabulary (copy original file) and special tokens file to a directory.\n\n        Args:\n            vocab_path (:obj:`str`):\n                The directory in which to save the vocabulary.\n\n        Returns:\n            :obj:`Tuple(str)`: Paths to the files saved.\n        \"\"\"\n        index = 0\n        if os.path.isdir(vocab_path):\n            vocab_file = os.path.join(vocab_path, VOCAB_FILES_NAMES[\"vocab_file\"])\n        else:\n            vocab_file = vocab_path\n        with open(vocab_file, \"w\", encoding=\"utf-8\") as writer:\n            for token, token_index in sorted(self.vocab.items(), key=lambda kv: kv[1]):\n                if index != token_index:\n                    logger.warning(\n                        \"Saving vocabulary to {}: vocabulary indices are not consecutive.\"\n                        \" Please check that the vocabulary is not corrupted!\".format(vocab_file)\n                    )\n                    index = token_index\n                writer.write(token + \"\\n\")\n                index += 1\n        return (vocab_file,)\n\n\nclass BasicTokenizer(object):\n    \"\"\"Runs basic tokenization (punctuation splitting, lower casing, etc.).\"\"\"\n\n    def __init__(self, do_lower_case=True, never_split=None, tokenize_chinese_chars=True):\n        \"\"\" Constructs a BasicTokenizer.\n\n        Args:\n            **do_lower_case**: Whether to lower case the input.\n            **never_split**: (`optional`) list of str\n                Kept for backward compatibility purposes.\n                Now implemented directly at the base class level (see :func:`PreTrainedTokenizer.tokenize`)\n                List of token not to split.\n            **tokenize_chinese_chars**: (`optional`) boolean (default True)\n                Whether to tokenize Chinese characters.\n                This should likely be deactivated for Japanese:\n                see: https://github.com/huggingface/pytorch-pretrained-BERT/issues/328\n        \"\"\"\n        if never_split is None:\n            never_split = []\n        self.do_lower_case = do_lower_case\n        self.never_split = never_split\n        self.tokenize_chinese_chars = tokenize_chinese_chars\n\n    def tokenize(self, text, never_split=None):\n        \"\"\" Basic Tokenization of a piece of text.\n            Split on \"white spaces\" only, for sub-word tokenization, see WordPieceTokenizer.\n\n        Args:\n            **never_split**: (`optional`) list of str\n                Kept for backward compatibility purposes.\n                Now implemented directly at the base class level (see :func:`PreTrainedTokenizer.tokenize`)\n                List of token not to split.\n        \"\"\"\n        never_split = self.never_split + (never_split if never_split is not None else [])\n        text = self._clean_text(text)\n        # This was added on November 1st, 2018 for the multilingual and Chinese\n        # models. This is also applied to the English models now, but it doesn't\n        # matter since the English models were not trained on any Chinese data\n        # and generally don't have any Chinese data in them (there are Chinese\n        # characters in the vocabulary because Wikipedia does have some Chinese\n        # words in the English Wikipedia.).\n        if self.tokenize_chinese_chars:\n            text = self._tokenize_chinese_chars(text)\n        orig_tokens = whitespace_tokenize(text)\n        split_tokens = []\n        for token in orig_tokens:\n            if self.do_lower_case and token not in never_split:\n                token = token.lower()\n                token = self._run_strip_accents(token)\n            split_tokens.extend(self._run_split_on_punc(token, never_split))\n\n        output_tokens = whitespace_tokenize(\" \".join(split_tokens))\n        return output_tokens\n\n    def _run_strip_accents(self, text):\n        \"\"\"Strips accents from a piece of text.\"\"\"\n        text = unicodedata.normalize(\"NFD\", text)\n        output = []\n        for char in text:\n            cat = unicodedata.category(char)\n            if cat == \"Mn\":\n                continue\n            output.append(char)\n        return \"\".join(output)\n\n    def _run_split_on_punc(self, text, never_split=None):\n        \"\"\"Splits punctuation on a piece of text.\"\"\"\n        if never_split is not None and text in never_split:\n            return [text]\n        chars = list(text)\n        i = 0\n        start_new_word = True\n        output = []\n        while i < len(chars):\n            char = chars[i]\n            if _is_punctuation(char):\n                output.append([char])\n                start_new_word = True\n            else:\n                if start_new_word:\n                    output.append([])\n                start_new_word = False\n                output[-1].append(char)\n            i += 1\n\n        return [\"\".join(x) for x in output]\n\n    def _tokenize_chinese_chars(self, text):\n        \"\"\"Adds whitespace around any CJK character.\"\"\"\n        output = []\n        for char in text:\n            cp = ord(char)\n            if self._is_chinese_char(cp):\n                output.append(\" \")\n                output.append(char)\n                output.append(\" \")\n            else:\n                output.append(char)\n        return \"\".join(output)\n\n    def _is_chinese_char(self, cp):\n        \"\"\"Checks whether CP is the codepoint of a CJK character.\"\"\"\n        # This defines a \"chinese character\" as anything in the CJK Unicode block:\n        #   https://en.wikipedia.org/wiki/CJK_Unified_Ideographs_(Unicode_block)\n        #\n        # Note that the CJK Unicode block is NOT all Japanese and Korean characters,\n        # despite its name. The modern Korean Hangul alphabet is a different block,\n        # as is Japanese Hiragana and Katakana. Those alphabets are used to write\n        # space-separated words, so they are not treated specially and handled\n        # like the all of the other languages.\n        if (\n            (cp >= 0x4E00 and cp <= 0x9FFF)\n            or (cp >= 0x3400 and cp <= 0x4DBF)  #\n            or (cp >= 0x20000 and cp <= 0x2A6DF)  #\n            or (cp >= 0x2A700 and cp <= 0x2B73F)  #\n            or (cp >= 0x2B740 and cp <= 0x2B81F)  #\n            or (cp >= 0x2B820 and cp <= 0x2CEAF)  #\n            or (cp >= 0xF900 and cp <= 0xFAFF)\n            or (cp >= 0x2F800 and cp <= 0x2FA1F)  #\n        ):  #\n            return True\n\n        return False\n\n    def _clean_text(self, text):\n        \"\"\"Performs invalid character removal and whitespace cleanup on text.\"\"\"\n        output = []\n        for char in text:\n            cp = ord(char)\n            if cp == 0 or cp == 0xFFFD or _is_control(char):\n                continue\n            if _is_whitespace(char):\n                output.append(\" \")\n            else:\n                output.append(char)\n        return \"\".join(output)\n\n\nclass WordpieceTokenizer(object):\n    \"\"\"Runs WordPiece tokenization.\"\"\"\n\n    def __init__(self, vocab, unk_token, max_input_chars_per_word=100):\n        self.vocab = vocab\n        self.unk_token = unk_token\n        self.max_input_chars_per_word = max_input_chars_per_word\n\n    def tokenize(self, text):\n        \"\"\"Tokenizes a piece of text into its word pieces.\n\n        This uses a greedy longest-match-first algorithm to perform tokenization\n        using the given vocabulary.\n\n        For example:\n          input = \"unaffable\"\n          output = [\"un\", \"##aff\", \"##able\"]\n\n        Args:\n          text: A single token or whitespace separated tokens. This should have\n            already been passed through `BasicTokenizer`.\n\n        Returns:\n          A list of wordpiece tokens.\n        \"\"\"\n\n        output_tokens = []\n        for token in whitespace_tokenize(text):\n            chars = list(token)\n            if len(chars) > self.max_input_chars_per_word:\n                output_tokens.append(self.unk_token)\n                continue\n\n            is_bad = False\n            start = 0\n            sub_tokens = []\n            while start < len(chars):\n                end = len(chars)\n                cur_substr = None\n                while start < end:\n                    substr = \"\".join(chars[start:end])\n                    if start > 0:\n                        substr = \"##\" + substr\n                    if substr in self.vocab:\n                        cur_substr = substr\n                        break\n                    end -= 1\n                if cur_substr is None:\n                    is_bad = True\n                    break\n                sub_tokens.append(cur_substr)\n                start = end\n\n            if is_bad:\n                output_tokens.append(self.unk_token)\n            else:\n                output_tokens.extend(sub_tokens)\n        return output_tokens\n\n\ndef _is_whitespace(char):\n    \"\"\"Checks whether `chars` is a whitespace character.\"\"\"\n    # \\t, \\n, and \\r are technically contorl characters but we treat them\n    # as whitespace since they are generally considered as such.\n    if char == \" \" or char == \"\\t\" or char == \"\\n\" or char == \"\\r\":\n        return True\n    cat = unicodedata.category(char)\n    if cat == \"Zs\":\n        return True\n    return False\n\n\ndef _is_control(char):\n    \"\"\"Checks whether `chars` is a control character.\"\"\"\n    # These are technically control characters but we count them as whitespace\n    # characters.\n    if char == \"\\t\" or char == \"\\n\" or char == \"\\r\":\n        return False\n    cat = unicodedata.category(char)\n    if cat.startswith(\"C\"):\n        return True\n    return False\n\n\ndef _is_punctuation(char):\n    \"\"\"Checks whether `chars` is a punctuation character.\"\"\"\n    cp = ord(char)\n    # We treat all non-letter/number ASCII as punctuation.\n    # Characters such as \"^\", \"$\", and \"`\" are not in the Unicode\n    # Punctuation class but we treat them as punctuation anyways, for\n    # consistency.\n    if (cp >= 33 and cp <= 47) or (cp >= 58 and cp <= 64) or (cp >= 91 and cp <= 96) or (cp >= 123 and cp <= 126):\n        return True\n    cat = unicodedata.category(char)\n    if cat.startswith(\"P\"):\n        return True\n    return False\n\n\nclass BertTokenizerFast(PreTrainedTokenizerFast):\n    r\"\"\"\n    Constructs a \"Fast\" BERT tokenizer (backed by HuggingFace's `tokenizers` library).\n\n    Bert tokenization is Based on WordPiece.\n\n    This tokenizer inherits from :class:`~transformers1.PreTrainedTokenizerFast` which contains most of the methods. Users\n    should refer to the superclass for more information regarding methods.\n\n    Args:\n        vocab_file (:obj:`string`):\n            File containing the vocabulary.\n        do_lower_case (:obj:`bool`, `optional`, defaults to :obj:`True`):\n            Whether to lowercase the input when tokenizing.\n        unk_token (:obj:`string`, `optional`, defaults to \"[UNK]\"):\n            The unknown token. A token that is not in the vocabulary cannot be converted to an ID and is set to be this\n            token instead.\n        sep_token (:obj:`string`, `optional`, defaults to \"[SEP]\"):\n            The separator token, which is used when building a sequence from multiple sequences, e.g. two sequences\n            for sequence classification or for a text and a question for question answering.\n            It is also used as the last token of a sequence built with special tokens.\n        pad_token (:obj:`string`, `optional`, defaults to \"[PAD]\"):\n            The token used for padding, for example when batching sequences of different lengths.\n        cls_token (:obj:`string`, `optional`, defaults to \"[CLS]\"):\n            The classifier token which is used when doing sequence classification (classification of the whole\n            sequence instead of per-token classification). It is the first token of the sequence when built with\n            special tokens.\n        mask_token (:obj:`string`, `optional`, defaults to \"[MASK]\"):\n            The token used for masking values. This is the token used when training this model with masked language\n            modeling. This is the token which the model will try to predict.\n        tokenize_chinese_chars (:obj:`bool`, `optional`, defaults to :obj:`True`):\n            Whether to tokenize Chinese characters.\n            This should likely be deactivated for Japanese:\n            see: https://github.com/huggingface/transformers/issues/328\n        clean_text (:obj:`bool`, `optional`, defaults to :obj:`True`):\n            Whether to clean the text before tokenization by removing any control characters and\n            replacing all whitespaces by the classic one.\n        tokenize_chinese_chars (:obj:`bool`, `optional`, defaults to :obj:`True`):\n            Whether to tokenize Chinese characters.\n            This should likely be deactivated for Japanese:\n            see: https://github.com/huggingface/transformers/issues/328\n    \"\"\"\n\n    vocab_files_names = VOCAB_FILES_NAMES\n    pretrained_vocab_files_map = PRETRAINED_VOCAB_FILES_MAP\n    pretrained_init_configuration = PRETRAINED_INIT_CONFIGURATION\n    max_model_input_sizes = PRETRAINED_POSITIONAL_EMBEDDINGS_SIZES\n\n    def __init__(\n        self,\n        vocab_file,\n        do_lower_case=True,\n        unk_token=\"[UNK]\",\n        sep_token=\"[SEP]\",\n        pad_token=\"[PAD]\",\n        cls_token=\"[CLS]\",\n        mask_token=\"[MASK]\",\n        clean_text=True,\n        tokenize_chinese_chars=True,\n        strip_accents=True,\n        wordpieces_prefix=\"##\",\n        **kwargs\n    ):\n        super().__init__(\n            BertWordPieceTokenizer(\n                vocab_file=vocab_file,\n                unk_token=unk_token,\n                sep_token=sep_token,\n                cls_token=cls_token,\n                clean_text=clean_text,\n                handle_chinese_chars=tokenize_chinese_chars,\n                strip_accents=strip_accents,\n                lowercase=do_lower_case,\n                wordpieces_prefix=wordpieces_prefix,\n            ),\n            unk_token=unk_token,\n            sep_token=sep_token,\n            pad_token=pad_token,\n            cls_token=cls_token,\n            mask_token=mask_token,\n            **kwargs,\n        )\n\n        self.do_lower_case = do_lower_case\n\n    def build_inputs_with_special_tokens(self, token_ids_0, token_ids_1=None):\n        output = [self.cls_token_id] + token_ids_0 + [self.sep_token_id]\n\n        if token_ids_1:\n            output += token_ids_1 + [self.sep_token_id]\n\n        return output\n\n    def create_token_type_ids_from_sequences(\n        self, token_ids_0: List[int], token_ids_1: Optional[List[int]] = None\n    ) -> List[int]:\n        \"\"\"\n        Creates a mask from the two sequences passed to be used in a sequence-pair classification task.\n        A BERT sequence pair mask has the following format:\n\n        ::\n\n            0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1\n            | first sequence    | second sequence |\n\n        if token_ids_1 is None, only returns the first portion of the mask (0's).\n\n        Args:\n            token_ids_0 (:obj:`List[int]`):\n                List of ids.\n            token_ids_1 (:obj:`List[int]`, `optional`, defaults to :obj:`None`):\n                Optional second list of IDs for sequence pairs.\n\n        Returns:\n            :obj:`List[int]`: List of `token type IDs <../glossary.html#token-type-ids>`_ according to the given\n            sequence(s).\n        \"\"\"\n        sep = [self.sep_token_id]\n        cls = [self.cls_token_id]\n        if token_ids_1 is None:\n            return len(cls + token_ids_0 + sep) * [0]\n        return len(cls + token_ids_0 + sep) * [0] + len(token_ids_1 + sep) * [1]\n"
  },
  {
    "path": "code/bert-base-count3/pretrain/transformers1/tokenization_bert_japanese.py",
    "content": "# coding=utf-8\n# Copyright 2018 The Google AI Language Team Authors and The HuggingFace Inc. team.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n\"\"\"Tokenization classes.\"\"\"\n\n\nimport collections\nimport logging\nimport os\nimport unicodedata\nfrom typing import Optional\n\nfrom .tokenization_bert import BasicTokenizer, BertTokenizer, WordpieceTokenizer, load_vocab\n\n\nlogger = logging.getLogger(__name__)\n\nVOCAB_FILES_NAMES = {\"vocab_file\": \"vocab.txt\"}\n\nPRETRAINED_VOCAB_FILES_MAP = {\n    \"vocab_file\": {\n        \"cl-tohoku/bert-base-japanese\": \"https://s3.amazonaws.com/models.huggingface.co/bert/cl-tohoku/bert-base-japanese/vocab.txt\",\n        \"cl-tohoku/bert-base-japanese-whole-word-masking\": \"https://s3.amazonaws.com/models.huggingface.co/bert/cl-tohoku/bert-base-japanese-whole-word-masking/vocab.txt\",\n        \"cl-tohoku/bert-base-japanese-char\": \"https://s3.amazonaws.com/models.huggingface.co/bert/cl-tohoku/bert-base-japanese-char/vocab.txt\",\n        \"cl-tohoku/bert-base-japanese-char-whole-word-masking\": \"https://s3.amazonaws.com/models.huggingface.co/bert/cl-tohoku/bert-base-japanese-char-whole-word-masking/vocab.txt\",\n    }\n}\n\nPRETRAINED_POSITIONAL_EMBEDDINGS_SIZES = {\n    \"cl-tohoku/bert-base-japanese\": 512,\n    \"cl-tohoku/bert-base-japanese-whole-word-masking\": 512,\n    \"cl-tohoku/bert-base-japanese-char\": 512,\n    \"cl-tohoku/bert-base-japanese-char-whole-word-masking\": 512,\n}\n\nPRETRAINED_INIT_CONFIGURATION = {\n    \"cl-tohoku/bert-base-japanese\": {\n        \"do_lower_case\": False,\n        \"word_tokenizer_type\": \"mecab\",\n        \"subword_tokenizer_type\": \"wordpiece\",\n    },\n    \"cl-tohoku/bert-base-japanese-whole-word-masking\": {\n        \"do_lower_case\": False,\n        \"word_tokenizer_type\": \"mecab\",\n        \"subword_tokenizer_type\": \"wordpiece\",\n    },\n    \"cl-tohoku/bert-base-japanese-char\": {\n        \"do_lower_case\": False,\n        \"word_tokenizer_type\": \"mecab\",\n        \"subword_tokenizer_type\": \"character\",\n    },\n    \"cl-tohoku/bert-base-japanese-char-whole-word-masking\": {\n        \"do_lower_case\": False,\n        \"word_tokenizer_type\": \"mecab\",\n        \"subword_tokenizer_type\": \"character\",\n    },\n}\n\n\nclass BertJapaneseTokenizer(BertTokenizer):\n    \"\"\"BERT tokenizer for Japanese text\"\"\"\n\n    vocab_files_names = VOCAB_FILES_NAMES\n    pretrained_vocab_files_map = PRETRAINED_VOCAB_FILES_MAP\n    pretrained_init_configuration = PRETRAINED_INIT_CONFIGURATION\n    max_model_input_sizes = PRETRAINED_POSITIONAL_EMBEDDINGS_SIZES\n\n    def __init__(\n        self,\n        vocab_file,\n        do_lower_case=False,\n        do_word_tokenize=True,\n        do_subword_tokenize=True,\n        word_tokenizer_type=\"basic\",\n        subword_tokenizer_type=\"wordpiece\",\n        never_split=None,\n        unk_token=\"[UNK]\",\n        sep_token=\"[SEP]\",\n        pad_token=\"[PAD]\",\n        cls_token=\"[CLS]\",\n        mask_token=\"[MASK]\",\n        mecab_kwargs=None,\n        **kwargs\n    ):\n        \"\"\"Constructs a MecabBertTokenizer.\n\n        Args:\n            **vocab_file**: Path to a one-wordpiece-per-line vocabulary file.\n            **do_lower_case**: (`optional`) boolean (default True)\n                Whether to lower case the input.\n                Only has an effect when do_basic_tokenize=True.\n            **do_word_tokenize**: (`optional`) boolean (default True)\n                Whether to do word tokenization.\n            **do_subword_tokenize**: (`optional`) boolean (default True)\n                Whether to do subword tokenization.\n            **word_tokenizer_type**: (`optional`) string (default \"basic\")\n                Type of word tokenizer.\n            **subword_tokenizer_type**: (`optional`) string (default \"wordpiece\")\n                Type of subword tokenizer.\n            **mecab_kwargs**: (`optional`) dict passed to `MecabTokenizer` constructor (default None)\n        \"\"\"\n        super(BertTokenizer, self).__init__(\n            unk_token=unk_token,\n            sep_token=sep_token,\n            pad_token=pad_token,\n            cls_token=cls_token,\n            mask_token=mask_token,\n            **kwargs,\n        )\n        # ^^ We call the grandparent's init, not the parent's.\n\n        if not os.path.isfile(vocab_file):\n            raise ValueError(\n                \"Can't find a vocabulary file at path '{}'. To load the vocabulary from a Google pretrained \"\n                \"model use `tokenizer = BertTokenizer.from_pretrained(PRETRAINED_MODEL_NAME)`\".format(vocab_file)\n            )\n        self.vocab = load_vocab(vocab_file)\n        self.ids_to_tokens = collections.OrderedDict([(ids, tok) for tok, ids in self.vocab.items()])\n\n        self.do_word_tokenize = do_word_tokenize\n        if do_word_tokenize:\n            if word_tokenizer_type == \"basic\":\n                self.word_tokenizer = BasicTokenizer(\n                    do_lower_case=do_lower_case, never_split=never_split, tokenize_chinese_chars=False\n                )\n            elif word_tokenizer_type == \"mecab\":\n                self.word_tokenizer = MecabTokenizer(\n                    do_lower_case=do_lower_case, never_split=never_split, **(mecab_kwargs or {})\n                )\n            else:\n                raise ValueError(\"Invalid word_tokenizer_type '{}' is specified.\".format(word_tokenizer_type))\n\n        self.do_subword_tokenize = do_subword_tokenize\n        if do_subword_tokenize:\n            if subword_tokenizer_type == \"wordpiece\":\n                self.subword_tokenizer = WordpieceTokenizer(vocab=self.vocab, unk_token=self.unk_token)\n            elif subword_tokenizer_type == \"character\":\n                self.subword_tokenizer = CharacterTokenizer(vocab=self.vocab, unk_token=self.unk_token)\n            else:\n                raise ValueError(\"Invalid subword_tokenizer_type '{}' is specified.\".format(subword_tokenizer_type))\n\n    def _tokenize(self, text):\n        if self.do_word_tokenize:\n            tokens = self.word_tokenizer.tokenize(text, never_split=self.all_special_tokens)\n        else:\n            tokens = [text]\n\n        if self.do_subword_tokenize:\n            split_tokens = [sub_token for token in tokens for sub_token in self.subword_tokenizer.tokenize(token)]\n        else:\n            split_tokens = tokens\n\n        return split_tokens\n\n\nclass MecabTokenizer:\n    \"\"\"Runs basic tokenization with MeCab morphological parser.\"\"\"\n\n    def __init__(self, do_lower_case=False, never_split=None, normalize_text=True, mecab_option: Optional[str] = None):\n        \"\"\"Constructs a MecabTokenizer.\n\n        Args:\n            **do_lower_case**: (`optional`) boolean (default True)\n                Whether to lower case the input.\n            **never_split**: (`optional`) list of str\n                Kept for backward compatibility purposes.\n                Now implemented directly at the base class level (see :func:`PreTrainedTokenizer.tokenize`)\n                List of token not to split.\n            **normalize_text**: (`optional`) boolean (default True)\n                Whether to apply unicode normalization to text before tokenization.\n            **mecab_option**: (`optional`) string passed to `MeCab.Tagger` constructor (default \"\")\n        \"\"\"\n        self.do_lower_case = do_lower_case\n        self.never_split = never_split if never_split is not None else []\n        self.normalize_text = normalize_text\n\n        import MeCab\n\n        self.mecab = MeCab.Tagger(mecab_option) if mecab_option is not None else MeCab.Tagger()\n\n    def tokenize(self, text, never_split=None, **kwargs):\n        \"\"\"Tokenizes a piece of text.\"\"\"\n        if self.normalize_text:\n            text = unicodedata.normalize(\"NFKC\", text)\n\n        never_split = self.never_split + (never_split if never_split is not None else [])\n        tokens = []\n\n        mecab_output = self.mecab.parse(text)\n\n        cursor = 0\n        for line in mecab_output.split(\"\\n\"):\n            if line == \"EOS\":\n                break\n\n            token, _ = line.split(\"\\t\")\n            token_start = text.index(token, cursor)\n            token_end = token_start + len(token)\n            if self.do_lower_case and token not in never_split:\n                token = token.lower()\n\n            tokens.append(token)\n            cursor = token_end\n\n        return tokens\n\n\nclass CharacterTokenizer(object):\n    \"\"\"Runs Character tokenziation.\"\"\"\n\n    def __init__(self, vocab, unk_token, normalize_text=True):\n        \"\"\"Constructs a CharacterTokenizer.\n\n        Args:\n            **vocab**:\n                Vocabulary object.\n            **unk_token**: str\n                A special symbol for out-of-vocabulary token.\n            **normalize_text**: (`optional`) boolean (default True)\n                Whether to apply unicode normalization to text before tokenization.\n        \"\"\"\n        self.vocab = vocab\n        self.unk_token = unk_token\n        self.normalize_text = normalize_text\n\n    def tokenize(self, text):\n        \"\"\"Tokenizes a piece of text into characters.\n\n        For example:\n            input = \"apple\"\n            output = [\"a\", \"p\", \"p\", \"l\", \"e\"]\n        Args:\n            text: A single token or whitespace separated tokens.\n                This should have already been passed through `BasicTokenizer`.\n        Returns:\n            A list of characters.\n        \"\"\"\n        if self.normalize_text:\n            text = unicodedata.normalize(\"NFKC\", text)\n\n        output_tokens = []\n        for i, char in enumerate(text):\n            if char not in self.vocab:\n                output_tokens.append(self.unk_token)\n                continue\n\n            output_tokens.append(char)\n\n        return output_tokens\n"
  },
  {
    "path": "code/bert-base-count3/pretrain/transformers1/tokenization_camembert.py",
    "content": "# coding=utf-8\n# Copyright 2018 Google AI, Google Brain and Carnegie Mellon University Authors and the HuggingFace Inc. team.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License\n\"\"\" Tokenization classes for Camembert model.\"\"\"\n\n\nimport logging\nimport os\nfrom shutil import copyfile\nfrom typing import List, Optional\n\nimport sentencepiece as spm\n\nfrom .tokenization_utils import PreTrainedTokenizer\nfrom .tokenization_xlnet import SPIECE_UNDERLINE\n\n\nlogger = logging.getLogger(__name__)\n\nVOCAB_FILES_NAMES = {\"vocab_file\": \"sentencepiece.bpe.model\"}\n\nPRETRAINED_VOCAB_FILES_MAP = {\n    \"vocab_file\": {\n        \"camembert-base\": \"https://s3.amazonaws.com/models.huggingface.co/bert/camembert-base-sentencepiece.bpe.model\",\n    }\n}\n\nPRETRAINED_POSITIONAL_EMBEDDINGS_SIZES = {\n    \"camembert-base\": None,\n}\n\nSHARED_MODEL_IDENTIFIERS = [\n    # Load with\n    # `tokenizer = AutoTokenizer.from_pretrained(\"username/pretrained_model\")`\n    \"Musixmatch/umberto-commoncrawl-cased-v1\",\n    \"Musixmatch/umberto-wikipedia-uncased-v1\",\n]\n\n\nclass CamembertTokenizer(PreTrainedTokenizer):\n    \"\"\"\n        Adapted from RobertaTokenizer and XLNetTokenizer\n        SentencePiece based tokenizer. Peculiarities:\n\n        - requires `SentencePiece <https://github.com/google/sentencepiece>`_\n\n    This tokenizer inherits from :class:`~transformers1.PreTrainedTokenizer` which contains most of the methods. Users\n    should refer to the superclass for more information regarding methods.\n\n    Args:\n        vocab_file (:obj:`str`):\n            Path to the vocabulary file.\n        bos_token (:obj:`string`, `optional`, defaults to \"<s>\"):\n            The beginning of sequence token that was used during pre-training. Can be used a sequence classifier token.\n\n            .. note::\n\n                When building a sequence using special tokens, this is not the token that is used for the beginning\n                of sequence. The token used is the :obj:`cls_token`.\n        eos_token (:obj:`string`, `optional`, defaults to \"</s>\"):\n            The end of sequence token.\n\n            .. note::\n\n                When building a sequence using special tokens, this is not the token that is used for the end\n                of sequence. The token used is the :obj:`sep_token`.\n        sep_token (:obj:`string`, `optional`, defaults to \"</s>\"):\n            The separator token, which is used when building a sequence from multiple sequences, e.g. two sequences\n            for sequence classification or for a text and a question for question answering.\n            It is also used as the last token of a sequence built with special tokens.\n        cls_token (:obj:`string`, `optional`, defaults to \"<s>\"):\n            The classifier token which is used when doing sequence classification (classification of the whole\n            sequence instead of per-token classification). It is the first token of the sequence when built with\n            special tokens.\n        unk_token (:obj:`string`, `optional`, defaults to \"<unk>\"):\n            The unknown token. A token that is not in the vocabulary cannot be converted to an ID and is set to be this\n            token instead.\n        pad_token (:obj:`string`, `optional`, defaults to \"<pad>\"):\n            The token used for padding, for example when batching sequences of different lengths.\n        mask_token (:obj:`string`, `optional`, defaults to \"<mask>\"):\n            The token used for masking values. This is the token used when training this model with masked language\n            modeling. This is the token which the model will try to predict.\n        additional_special_tokens (:obj:`List[str]`, `optional`, defaults to :obj:`[\"<s>NOTUSED\", \"</s>NOTUSED\"]`):\n            Additional special tokens used by the tokenizer.\n\n    Attributes:\n        sp_model (:obj:`SentencePieceProcessor`):\n            The `SentencePiece` processor that is used for every conversion (string, tokens and IDs).\n    \"\"\"\n\n    vocab_files_names = VOCAB_FILES_NAMES\n    pretrained_vocab_files_map = PRETRAINED_VOCAB_FILES_MAP\n    max_model_input_sizes = PRETRAINED_POSITIONAL_EMBEDDINGS_SIZES\n    model_input_names = [\"attention_mask\"]\n\n    def __init__(\n        self,\n        vocab_file,\n        bos_token=\"<s>\",\n        eos_token=\"</s>\",\n        sep_token=\"</s>\",\n        cls_token=\"<s>\",\n        unk_token=\"<unk>\",\n        pad_token=\"<pad>\",\n        mask_token=\"<mask>\",\n        additional_special_tokens=[\"<s>NOTUSED\", \"</s>NOTUSED\"],\n        **kwargs\n    ):\n        super().__init__(\n            max_len=512,\n            bos_token=bos_token,\n            eos_token=eos_token,\n            unk_token=unk_token,\n            sep_token=sep_token,\n            cls_token=cls_token,\n            pad_token=pad_token,\n            mask_token=mask_token,\n            additional_special_tokens=additional_special_tokens,\n            **kwargs,\n        )\n        self.sp_model = spm.SentencePieceProcessor()\n        self.sp_model.Load(str(vocab_file))\n        self.vocab_file = vocab_file\n        # HACK: These tokens were added by fairseq but don't seem to be actually used when duplicated in the actual\n        # sentencepiece vocabulary (this is the case for <s> and </s>\n        self.fairseq_tokens_to_ids = {\"<s>NOTUSED\": 0, \"<pad>\": 1, \"</s>NOTUSED\": 2, \"<unk>\": 3}\n        self.fairseq_offset = len(self.fairseq_tokens_to_ids)\n        self.fairseq_tokens_to_ids[\"<mask>\"] = len(self.sp_model) + len(self.fairseq_tokens_to_ids)\n        self.fairseq_ids_to_tokens = {v: k for k, v in self.fairseq_tokens_to_ids.items()}\n\n    def build_inputs_with_special_tokens(\n        self, token_ids_0: List[int], token_ids_1: Optional[List[int]] = None\n    ) -> List[int]:\n        \"\"\"\n        Build model inputs from a sequence or a pair of sequence for sequence classification tasks\n        by concatenating and adding special tokens.\n        A CamemBERT sequence has the following format:\n\n        - single sequence: ``<s> X </s>``\n        - pair of sequences: ``<s> A </s></s> B </s>``\n\n        Args:\n            token_ids_0 (:obj:`List[int]`):\n                List of IDs to which the special tokens will be added\n            token_ids_1 (:obj:`List[int]`, `optional`, defaults to :obj:`None`):\n                Optional second list of IDs for sequence pairs.\n\n        Returns:\n            :obj:`List[int]`: list of `input IDs <../glossary.html#input-ids>`__ with the appropriate special tokens.\n        \"\"\"\n\n        if token_ids_1 is None:\n            return [self.cls_token_id] + token_ids_0 + [self.sep_token_id]\n        cls = [self.cls_token_id]\n        sep = [self.sep_token_id]\n        return cls + token_ids_0 + sep + sep + token_ids_1 + sep\n\n    def get_special_tokens_mask(\n        self, token_ids_0: List[int], token_ids_1: Optional[List[int]] = None, already_has_special_tokens: bool = False\n    ) -> List[int]:\n        \"\"\"\n        Retrieves sequence ids from a token list that has no special tokens added. This method is called when adding\n        special tokens using the tokenizer ``prepare_for_model`` or ``encode_plus`` methods.\n\n        Args:\n            token_ids_0 (:obj:`List[int]`):\n                List of ids.\n            token_ids_1 (:obj:`List[int]`, `optional`, defaults to :obj:`None`):\n                Optional second list of IDs for sequence pairs.\n            already_has_special_tokens (:obj:`bool`, `optional`, defaults to :obj:`False`):\n                Set to True if the token list is already formatted with special tokens for the model\n\n        Returns:\n            :obj:`List[int]`: A list of integers in the range [0, 1]: 1 for a special token, 0 for a sequence token.\n        \"\"\"\n        if already_has_special_tokens:\n            if token_ids_1 is not None:\n                raise ValueError(\n                    \"You should not supply a second sequence if the provided sequence of \"\n                    \"ids is already formated with special tokens for the model.\"\n                )\n            return list(map(lambda x: 1 if x in [self.sep_token_id, self.cls_token_id] else 0, token_ids_0))\n\n        if token_ids_1 is None:\n            return [1] + ([0] * len(token_ids_0)) + [1]\n        return [1] + ([0] * len(token_ids_0)) + [1, 1] + ([0] * len(token_ids_1)) + [1]\n\n    def create_token_type_ids_from_sequences(\n        self, token_ids_0: List[int], token_ids_1: Optional[List[int]] = None\n    ) -> List[int]:\n        \"\"\"\n        Creates a mask from the two sequences passed to be used in a sequence-pair classification task.\n        CamemBERT, like RoBERTa, does not make use of token type ids, therefore a list of zeros is returned.\n\n        Args:\n            token_ids_0 (:obj:`List[int]`):\n                List of ids.\n            token_ids_1 (:obj:`List[int]`, `optional`, defaults to :obj:`None`):\n                Optional second list of IDs for sequence pairs.\n\n        Returns:\n            :obj:`List[int]`: List of zeros.\n\n        \"\"\"\n        sep = [self.sep_token_id]\n        cls = [self.cls_token_id]\n\n        if token_ids_1 is None:\n            return len(cls + token_ids_0 + sep) * [0]\n        return len(cls + token_ids_0 + sep + sep + token_ids_1 + sep) * [0]\n\n    @property\n    def vocab_size(self):\n        return len(self.fairseq_tokens_to_ids) + len(self.sp_model)\n\n    def _tokenize(self, text):\n        return self.sp_model.EncodeAsPieces(text)\n\n    def _convert_token_to_id(self, token):\n        \"\"\" Converts a token (str) in an id using the vocab. \"\"\"\n        if token in self.fairseq_tokens_to_ids:\n            return self.fairseq_tokens_to_ids[token]\n        elif self.sp_model.PieceToId(token) == 0:\n            # Convert sentence piece unk token to fairseq unk token index\n            return self.unk_token_id\n        return self.fairseq_offset + self.sp_model.PieceToId(token)\n\n    def _convert_id_to_token(self, index):\n        \"\"\"Converts an index (integer) in a token (str) using the vocab.\"\"\"\n        if index in self.fairseq_ids_to_tokens:\n            return self.fairseq_ids_to_tokens[index]\n        return self.sp_model.IdToPiece(index - self.fairseq_offset)\n\n    def __getstate__(self):\n        state = self.__dict__.copy()\n        state[\"sp_model\"] = None\n        return state\n\n    def __setstate__(self, d):\n        self.__dict__ = d\n        try:\n            import sentencepiece as spm\n        except ImportError:\n            logger.warning(\n                \"You need to install SentencePiece to use AlbertTokenizer: https://github.com/google/sentencepiece\"\n                \"pip install sentencepiece\"\n            )\n            raise\n        self.sp_model = spm.SentencePieceProcessor()\n        self.sp_model.Load(self.vocab_file)\n\n    def convert_tokens_to_string(self, tokens):\n        \"\"\"Converts a sequence of tokens (strings for sub-words) in a single string.\"\"\"\n        out_string = \"\".join(tokens).replace(SPIECE_UNDERLINE, \" \").strip()\n        return out_string\n\n    def save_vocabulary(self, save_directory):\n        \"\"\"\n        Save the sentencepiece vocabulary (copy original file) and special tokens file to a directory.\n\n        Args:\n            save_directory (:obj:`str`):\n                The directory in which to save the vocabulary.\n\n        Returns:\n            :obj:`Tuple(str)`: Paths to the files saved.\n        \"\"\"\n        if not os.path.isdir(save_directory):\n            logger.error(\"Vocabulary path ({}) should be a directory\".format(save_directory))\n            return\n        out_vocab_file = os.path.join(save_directory, VOCAB_FILES_NAMES[\"vocab_file\"])\n\n        if os.path.abspath(self.vocab_file) != os.path.abspath(out_vocab_file):\n            copyfile(self.vocab_file, out_vocab_file)\n\n        return (out_vocab_file,)\n"
  },
  {
    "path": "code/bert-base-count3/pretrain/transformers1/tokenization_ctrl.py",
    "content": "# coding=utf-8\n# Copyright 2018 Salesforce and The HuggingFace Inc. team.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n\"\"\"Tokenization classes for Salesforce CTRL.\"\"\"\n\n\nimport json\nimport logging\nimport os\n\nimport regex as re\n\nfrom .tokenization_utils import PreTrainedTokenizer\n\n\nlogger = logging.getLogger(__name__)\n\nVOCAB_FILES_NAMES = {\n    \"vocab_file\": \"vocab.json\",\n    \"merges_file\": \"merges.txt\",\n}\n\nPRETRAINED_VOCAB_FILES_MAP = {\n    \"vocab_file\": {\"ctrl\": \"https://raw.githubusercontent.com/salesforce/ctrl/master/ctrl-vocab.json\"},\n    \"merges_file\": {\"ctrl\": \"https://raw.githubusercontent.com/salesforce/ctrl/master/ctrl-merges.txt\"},\n}\n\nPRETRAINED_POSITIONAL_EMBEDDINGS_SIZES = {\n    \"ctrl\": 256,\n}\n\nCONTROL_CODES = {\n    \"Pregnancy\": 168629,\n    \"Christianity\": 7675,\n    \"Explain\": 106423,\n    \"Fitness\": 63440,\n    \"Saving\": 63163,\n    \"Ask\": 27171,\n    \"Ass\": 95985,\n    \"Joke\": 163509,\n    \"Questions\": 45622,\n    \"Thoughts\": 49605,\n    \"Retail\": 52342,\n    \"Feminism\": 164338,\n    \"Writing\": 11992,\n    \"Atheism\": 192263,\n    \"Netflix\": 48616,\n    \"Computing\": 39639,\n    \"Opinion\": 43213,\n    \"Alone\": 44967,\n    \"Funny\": 58917,\n    \"Gaming\": 40358,\n    \"Human\": 4088,\n    \"India\": 1331,\n    \"Joker\": 77138,\n    \"Diet\": 36206,\n    \"Legal\": 11859,\n    \"Norman\": 4939,\n    \"Tip\": 72689,\n    \"Weight\": 52343,\n    \"Movies\": 46273,\n    \"Running\": 23425,\n    \"Science\": 2090,\n    \"Horror\": 37793,\n    \"Confession\": 60572,\n    \"Finance\": 12250,\n    \"Politics\": 16360,\n    \"Scary\": 191985,\n    \"Support\": 12654,\n    \"Technologies\": 32516,\n    \"Teenage\": 66160,\n    \"Event\": 32769,\n    \"Learned\": 67460,\n    \"Notion\": 182770,\n    \"Wikipedia\": 37583,\n    \"Books\": 6665,\n    \"Extract\": 76050,\n    \"Confessions\": 102701,\n    \"Conspiracy\": 75932,\n    \"Links\": 63674,\n    \"Narcissus\": 150425,\n    \"Relationship\": 54766,\n    \"Relationships\": 134796,\n    \"Reviews\": 41671,\n    \"News\": 4256,\n    \"Translation\": 26820,\n    \"multilingual\": 128406,\n}\n\n\ndef get_pairs(word):\n    \"\"\"Return set of symbol pairs in a word.\n\n    Word is represented as tuple of symbols (symbols being variable-length strings).\n    \"\"\"\n    pairs = set()\n    prev_char = word[0]\n    for char in word[1:]:\n        pairs.add((prev_char, char))\n        prev_char = char\n\n    pairs = set(pairs)\n    return pairs\n\n\nclass CTRLTokenizer(PreTrainedTokenizer):\n    \"\"\"\n    Constructs a CTRL tokenizer. Peculiarities:\n\n    - Byte-Pair-Encoding\n\n    This tokenizer inherits from :class:`~transformers1.PreTrainedTokenizer` which contains most of the methods. Users\n    should refer to the superclass for more information regarding methods.\n\n    Args:\n        vocab_file (:obj:`str`):\n            Path to the vocabulary file.\n        merges_file (:obj:`str`):\n            Path to the merges file.\n        unk_token (:obj:`string`, `optional`, defaults to \"<unk>\"):\n            The unknown token. A token that is not in the vocabulary cannot be converted to an ID and is set to be this\n            token instead.\n    \"\"\"\n\n    vocab_files_names = VOCAB_FILES_NAMES\n    pretrained_vocab_files_map = PRETRAINED_VOCAB_FILES_MAP\n    max_model_input_sizes = PRETRAINED_POSITIONAL_EMBEDDINGS_SIZES\n    control_codes = CONTROL_CODES\n\n    def __init__(self, vocab_file, merges_file, unk_token=\"<unk>\", **kwargs):\n        super().__init__(unk_token=unk_token, **kwargs)\n\n        with open(vocab_file, encoding=\"utf-8\") as vocab_handle:\n            self.encoder = json.load(vocab_handle)\n        self.decoder = {v: k for k, v in self.encoder.items()}\n        with open(merges_file, encoding=\"utf-8\") as merges_handle:\n            merges = merges_handle.read().split(\"\\n\")[1:-1]\n        merges = [tuple(merge.split()) for merge in merges]\n        self.bpe_ranks = dict(zip(merges, range(len(merges))))\n        self.cache = {}\n\n    @property\n    def vocab_size(self):\n        return len(self.encoder)\n\n    def get_vocab(self):\n        return dict(self.encoder, **self.added_tokens_encoder)\n\n    def bpe(self, token):\n        if token in self.cache:\n            return self.cache[token]\n        word = tuple(token)\n        word = tuple(list(word[:-1]) + [word[-1] + \"</w>\"])\n        pairs = get_pairs(word)\n\n        if not pairs:\n            return token\n\n        while True:\n            bigram = min(pairs, key=lambda pair: self.bpe_ranks.get(pair, float(\"inf\")))\n            if bigram not in self.bpe_ranks:\n                break\n            first, second = bigram\n            new_word = []\n            i = 0\n            while i < len(word):\n                try:\n                    j = word.index(first, i)\n                except ValueError:\n                    new_word.extend(word[i:])\n                    break\n                else:\n                    new_word.extend(word[i:j])\n                    i = j\n\n                if word[i] == first and i < len(word) - 1 and word[i + 1] == second:\n                    new_word.append(first + second)\n                    i += 2\n                else:\n                    new_word.append(word[i])\n                    i += 1\n            new_word = tuple(new_word)\n            word = new_word\n            if len(word) == 1:\n                break\n            else:\n                pairs = get_pairs(word)\n        word = \"@@ \".join(word)\n        word = word[:-4]\n        self.cache[token] = word\n        return word\n\n    def _tokenize(self, text):\n        \"\"\" Tokenize a string.\n        \"\"\"\n        split_tokens = []\n\n        words = re.findall(r\"\\S+\\n?\", text)\n\n        for token in words:\n            split_tokens.extend([t for t in self.bpe(token).split(\" \")])\n        return split_tokens\n\n    def _convert_token_to_id(self, token):\n        \"\"\" Converts a token (str) in an id using the vocab. \"\"\"\n        return self.encoder.get(token, self.encoder.get(self.unk_token))\n\n    def _convert_id_to_token(self, index):\n        \"\"\"Converts an index (integer) in a token (str) using the vocab.\"\"\"\n        return self.decoder.get(index, self.unk_token)\n\n    def convert_tokens_to_string(self, tokens):\n        \"\"\" Converts a sequence of tokens (string) in a single string. \"\"\"\n        out_string = \" \".join(tokens).replace(\"@@ \", \"\").strip()\n        return out_string\n\n    def save_vocabulary(self, save_directory):\n        \"\"\"\n        Save the vocabulary and special tokens file to a directory.\n\n        Args:\n            save_directory (:obj:`str`):\n                The directory in which to save the vocabulary.\n\n        Returns:\n            :obj:`Tuple(str)`: Paths to the files saved.\n        \"\"\"\n        if not os.path.isdir(save_directory):\n            logger.error(\"Vocabulary path ({}) should be a directory\".format(save_directory))\n            return\n        vocab_file = os.path.join(save_directory, VOCAB_FILES_NAMES[\"vocab_file\"])\n        merge_file = os.path.join(save_directory, VOCAB_FILES_NAMES[\"merges_file\"])\n\n        with open(vocab_file, \"w\", encoding=\"utf-8\") as f:\n            f.write(json.dumps(self.encoder, ensure_ascii=False))\n\n        index = 0\n        with open(merge_file, \"w\", encoding=\"utf-8\") as writer:\n            writer.write(\"#version: 0.2\\n\")\n            for bpe_tokens, token_index in sorted(self.bpe_ranks.items(), key=lambda kv: kv[1]):\n                if index != token_index:\n                    logger.warning(\n                        \"Saving vocabulary to {}: BPE merge indices are not consecutive.\"\n                        \" Please check that the tokenizer is not corrupted!\".format(merge_file)\n                    )\n                    index = token_index\n                writer.write(\" \".join(bpe_tokens) + \"\\n\")\n                index += 1\n\n        return vocab_file, merge_file\n\n    # def decode(self, token_ids, skip_special_tokens=False, clean_up_tokenization_spaces=True):\n    #     filtered_tokens = ' '.join(self.convert_ids_to_tokens(token_ids, skip_special_tokens=skip_special_tokens))\n    #     tokens_generated_so_far = re.sub('(@@ )', '', string=filtered_tokens)\n    #     tokens_generated_so_far = re.sub('(@@ ?$)', '', string=tokens_generated_so_far)\n    #     return ''.join(tokens_generated_so_far)\n"
  },
  {
    "path": "code/bert-base-count3/pretrain/transformers1/tokenization_distilbert.py",
    "content": "# coding=utf-8\n# Copyright 2018 The HuggingFace Inc. team.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n\"\"\"Tokenization classes for DistilBERT.\"\"\"\n\n\nimport logging\n\nfrom .tokenization_bert import BertTokenizer, BertTokenizerFast\n\n\nlogger = logging.getLogger(__name__)\n\nVOCAB_FILES_NAMES = {\"vocab_file\": \"vocab.txt\"}\n\nPRETRAINED_VOCAB_FILES_MAP = {\n    \"vocab_file\": {\n        \"distilbert-base-uncased\": \"https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-uncased-vocab.txt\",\n        \"distilbert-base-uncased-distilled-squad\": \"https://s3.amazonaws.com/models.huggingface.co/bert/bert-large-uncased-vocab.txt\",\n        \"distilbert-base-cased\": \"https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-cased-vocab.txt\",\n        \"distilbert-base-cased-distilled-squad\": \"https://s3.amazonaws.com/models.huggingface.co/bert/bert-large-cased-vocab.txt\",\n        \"distilbert-base-german-cased\": \"https://s3.amazonaws.com/models.huggingface.co/bert/distilbert-base-german-cased-vocab.txt\",\n        \"distilbert-base-multilingual-cased\": \"https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-multilingual-cased-vocab.txt\",\n    }\n}\n\nPRETRAINED_POSITIONAL_EMBEDDINGS_SIZES = {\n    \"distilbert-base-uncased\": 512,\n    \"distilbert-base-uncased-distilled-squad\": 512,\n    \"distilbert-base-cased\": 512,\n    \"distilbert-base-cased-distilled-squad\": 512,\n    \"distilbert-base-german-cased\": 512,\n    \"distilbert-base-multilingual-cased\": 512,\n}\n\n\nPRETRAINED_INIT_CONFIGURATION = {\n    \"distilbert-base-uncased\": {\"do_lower_case\": True},\n    \"distilbert-base-uncased-distilled-squad\": {\"do_lower_case\": True},\n    \"distilbert-base-cased\": {\"do_lower_case\": False},\n    \"distilbert-base-cased-distilled-squad\": {\"do_lower_case\": False},\n    \"distilbert-base-german-cased\": {\"do_lower_case\": False},\n    \"distilbert-base-multilingual-cased\": {\"do_lower_case\": False},\n}\n\n\nclass DistilBertTokenizer(BertTokenizer):\n    r\"\"\"\n    Constructs a  DistilBertTokenizer.\n\n    :class:`~transformers1.DistilBertTokenizer is identical to :class:`~transformers1.BertTokenizer` and runs end-to-end\n    tokenization: punctuation splitting + wordpiece.\n\n    Refer to superclass :class:`~transformers1.BertTokenizer` for usage examples and documentation concerning\n    parameters.\n    \"\"\"\n\n    vocab_files_names = VOCAB_FILES_NAMES\n    pretrained_vocab_files_map = PRETRAINED_VOCAB_FILES_MAP\n    max_model_input_sizes = PRETRAINED_POSITIONAL_EMBEDDINGS_SIZES\n    pretrained_init_configuration = PRETRAINED_INIT_CONFIGURATION\n    model_input_names = [\"attention_mask\"]\n\n\nclass DistilBertTokenizerFast(BertTokenizerFast):\n    r\"\"\"\n    Constructs a  \"Fast\" DistilBertTokenizer (backed by HuggingFace's `tokenizers` library).\n\n    :class:`~transformers1.DistilBertTokenizerFast` is identical to :class:`~transformers1.BertTokenizerFast` and runs end-to-end\n    tokenization: punctuation splitting + wordpiece.\n\n    Refer to superclass :class:`~transformers1.BertTokenizerFast` for usage examples and documentation concerning\n    parameters.\n    \"\"\"\n\n    vocab_files_names = VOCAB_FILES_NAMES\n    pretrained_vocab_files_map = PRETRAINED_VOCAB_FILES_MAP\n    max_model_input_sizes = PRETRAINED_POSITIONAL_EMBEDDINGS_SIZES\n    pretrained_init_configuration = PRETRAINED_INIT_CONFIGURATION\n    model_input_names = [\"attention_mask\"]\n"
  },
  {
    "path": "code/bert-base-count3/pretrain/transformers1/tokenization_electra.py",
    "content": "# coding=utf-8\n# Copyright 2020 The Google AI Team, Stanford University and The HuggingFace Inc. team.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n\nfrom .tokenization_bert import BertTokenizer, BertTokenizerFast\n\n\nVOCAB_FILES_NAMES = {\"vocab_file\": \"vocab.txt\"}\n\nPRETRAINED_VOCAB_FILES_MAP = {\n    \"vocab_file\": {\n        \"google/electra-small-generator\": \"https://s3.amazonaws.com/models.huggingface.co/bert/google/electra-small-generator/vocab.txt\",\n        \"google/electra-base-generator\": \"https://s3.amazonaws.com/models.huggingface.co/bert/google/electra-base-generator/vocab.txt\",\n        \"google/electra-large-generator\": \"https://s3.amazonaws.com/models.huggingface.co/bert/google/electra-large-generator/vocab.txt\",\n        \"google/electra-small-discriminator\": \"https://s3.amazonaws.com/models.huggingface.co/bert/google/electra-small-discriminator/vocab.txt\",\n        \"google/electra-base-discriminator\": \"https://s3.amazonaws.com/models.huggingface.co/bert/google/electra-base-discriminator/vocab.txt\",\n        \"google/electra-large-discriminator\": \"https://s3.amazonaws.com/models.huggingface.co/bert/google/electra-large-discriminator/vocab.txt\",\n    }\n}\n\nPRETRAINED_POSITIONAL_EMBEDDINGS_SIZES = {\n    \"google/electra-small-generator\": 512,\n    \"google/electra-base-generator\": 512,\n    \"google/electra-large-generator\": 512,\n    \"google/electra-small-discriminator\": 512,\n    \"google/electra-base-discriminator\": 512,\n    \"google/electra-large-discriminator\": 512,\n}\n\n\nPRETRAINED_INIT_CONFIGURATION = {\n    \"google/electra-small-generator\": {\"do_lower_case\": True},\n    \"google/electra-base-generator\": {\"do_lower_case\": True},\n    \"google/electra-large-generator\": {\"do_lower_case\": True},\n    \"google/electra-small-discriminator\": {\"do_lower_case\": True},\n    \"google/electra-base-discriminator\": {\"do_lower_case\": True},\n    \"google/electra-large-discriminator\": {\"do_lower_case\": True},\n}\n\n\nclass ElectraTokenizer(BertTokenizer):\n    r\"\"\"\n    Constructs an Electra tokenizer.\n    :class:`~transformers1.ElectraTokenizer` is identical to :class:`~transformers1.BertTokenizer` and runs end-to-end\n    tokenization: punctuation splitting + wordpiece.\n\n    Refer to superclass :class:`~transformers1.BertTokenizer` for usage examples and documentation concerning\n    parameters.\n    \"\"\"\n\n    vocab_files_names = VOCAB_FILES_NAMES\n    pretrained_vocab_files_map = PRETRAINED_VOCAB_FILES_MAP\n    max_model_input_sizes = PRETRAINED_POSITIONAL_EMBEDDINGS_SIZES\n    pretrained_init_configuration = PRETRAINED_INIT_CONFIGURATION\n\n\nclass ElectraTokenizerFast(BertTokenizerFast):\n    r\"\"\"\n    Constructs a \"Fast\" Electra Fast tokenizer (backed by HuggingFace's `tokenizers` library).\n\n    :class:`~transformers1.ElectraTokenizerFast` is identical to :class:`~transformers1.BertTokenizerFast` and runs end-to-end\n    tokenization: punctuation splitting + wordpiece.\n\n    Refer to superclass :class:`~transformers1.BertTokenizerFast` for usage examples and documentation concerning\n    parameters.\n    \"\"\"\n    vocab_files_names = VOCAB_FILES_NAMES\n    pretrained_vocab_files_map = PRETRAINED_VOCAB_FILES_MAP\n    max_model_input_sizes = PRETRAINED_POSITIONAL_EMBEDDINGS_SIZES\n    pretrained_init_configuration = PRETRAINED_INIT_CONFIGURATION\n"
  },
  {
    "path": "code/bert-base-count3/pretrain/transformers1/tokenization_flaubert.py",
    "content": "# coding=utf-8\n# Copyright 2019-present CNRS, Facebook Inc. and the HuggingFace Inc. team.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n\"\"\"Tokenization classes for Flaubert, based on XLM.\"\"\"\n\n\nimport logging\nimport unicodedata\n\nimport six\n\nfrom .tokenization_xlm import XLMTokenizer\n\n\nlogger = logging.getLogger(__name__)\n\nVOCAB_FILES_NAMES = {\n    \"vocab_file\": \"vocab.json\",\n    \"merges_file\": \"merges.txt\",\n}\n\nPRETRAINED_VOCAB_FILES_MAP = {\n    \"vocab_file\": {\n        \"flaubert/flaubert_small_cased\": \"https://s3.amazonaws.com/models.huggingface.co/bert/flaubert/flaubert_small_cased/vocab.json\",\n        \"flaubert/flaubert_base_uncased\": \"https://s3.amazonaws.com/models.huggingface.co/bert/flaubert/flaubert_base_uncased/vocab.json\",\n        \"flaubert/flaubert_base_cased\": \"https://s3.amazonaws.com/models.huggingface.co/bert/flaubert/flaubert_base_cased/vocab.json\",\n        \"flaubert/flaubert_large_cased\": \"https://s3.amazonaws.com/models.huggingface.co/bert/flaubert/flaubert_large_cased/vocab.json\",\n    },\n    \"merges_file\": {\n        \"flaubert/flaubert_small_cased\": \"https://s3.amazonaws.com/models.huggingface.co/bert/flaubert/flaubert_small_cased/merges.txt\",\n        \"flaubert/flaubert_base_uncased\": \"https://s3.amazonaws.com/models.huggingface.co/bert/flaubert/flaubert_base_uncased/merges.txt\",\n        \"flaubert/flaubert_base_cased\": \"https://s3.amazonaws.com/models.huggingface.co/bert/flaubert/flaubert_base_cased/merges.txt\",\n        \"flaubert/flaubert_large_cased\": \"https://s3.amazonaws.com/models.huggingface.co/bert/flaubert/flaubert_large_cased/merges.txt\",\n    },\n}\n\nPRETRAINED_POSITIONAL_EMBEDDINGS_SIZES = {\n    \"flaubert/flaubert_small_cased\": 512,\n    \"flaubert/flaubert_base_uncased\": 512,\n    \"flaubert/flaubert_base_cased\": 512,\n    \"flaubert/flaubert_large_cased\": 512,\n}\n\nPRETRAINED_INIT_CONFIGURATION = {\n    \"flaubert/flaubert_small_cased\": {\"do_lowercase\": False},\n    \"flaubert/flaubert_base_uncased\": {\"do_lowercase\": True},\n    \"flaubert/flaubert_base_cased\": {\"do_lowercase\": False},\n    \"flaubert/flaubert_large_cased\": {\"do_lowercase\": False},\n}\n\n\ndef convert_to_unicode(text):\n    \"\"\"\n    Converts `text` to Unicode (if it's not already), assuming UTF-8 input.\n    \"\"\"\n    # six_ensure_text is copied from https://github.com/benjaminp/six\n    def six_ensure_text(s, encoding=\"utf-8\", errors=\"strict\"):\n        if isinstance(s, six.binary_type):\n            return s.decode(encoding, errors)\n        elif isinstance(s, six.text_type):\n            return s\n        else:\n            raise TypeError(\"not expecting type '%s'\" % type(s))\n\n    return six_ensure_text(text, encoding=\"utf-8\", errors=\"ignore\")\n\n\nclass FlaubertTokenizer(XLMTokenizer):\n    \"\"\"\n    BPE tokenizer for Flaubert\n\n    - Moses preprocessing & tokenization\n    - Normalize all inputs text\n    - argument ``special_tokens`` and function ``set_special_tokens``, can be used to add additional symbols \\\n      (ex: \"__classify__\") to a vocabulary\n    - `do_lowercase` controle lower casing (automatically set for pretrained vocabularies)\n\n    This tokenizer inherits from :class:`~transformers1.XLMTokenizer`. Please check the superclass for usage examples\n    and documentation regarding arguments.\n    \"\"\"\n\n    vocab_files_names = VOCAB_FILES_NAMES\n    pretrained_vocab_files_map = PRETRAINED_VOCAB_FILES_MAP\n    pretrained_init_configuration = PRETRAINED_INIT_CONFIGURATION\n    max_model_input_sizes = PRETRAINED_POSITIONAL_EMBEDDINGS_SIZES\n\n    def __init__(self, do_lowercase=False, **kwargs):\n        super().__init__(**kwargs)\n        self.do_lowercase = do_lowercase\n        self.do_lowercase_and_remove_accent = False\n\n    def preprocess_text(self, text):\n        text = text.replace(\"``\", '\"').replace(\"''\", '\"')\n        text = convert_to_unicode(text)\n        text = unicodedata.normalize(\"NFC\", text)\n\n        if self.do_lowercase:\n            text = text.lower()\n\n        return text\n\n    def _tokenize(self, text, bypass_tokenizer=False):\n        \"\"\"\n        Tokenize a string given language code using Moses.\n\n        Details of tokenization:\n        - [sacremoses](https://github.com/alvations/sacremoses): port of Moses\n            - Install with `pip install sacremoses`\n\n        Args:\n            - bypass_tokenizer: Allow users to preprocess and tokenize the sentences externally (default = False)  (bool). If True, we only apply BPE.\n\n        Returns:\n            List of tokens.\n        \"\"\"\n        lang = \"fr\"\n        if lang and self.lang2id and lang not in self.lang2id:\n            logger.error(\n                \"Supplied language code not found in lang2id mapping. Please check that your language is supported by the loaded pretrained model.\"\n            )\n\n        if bypass_tokenizer:\n            text = text.split()\n        else:\n            text = self.preprocess_text(text)\n            text = self.moses_pipeline(text, lang=lang)\n            text = self.moses_tokenize(text, lang=lang)\n\n        split_tokens = []\n        for token in text:\n            if token:\n                split_tokens.extend([t for t in self.bpe(token).split(\" \")])\n\n        return split_tokens\n"
  },
  {
    "path": "code/bert-base-count3/pretrain/transformers1/tokenization_gpt2.py",
    "content": "# coding=utf-8\n# Copyright 2018 The Open AI Team Authors and The HuggingFace Inc. team.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n\"\"\"Tokenization classes for OpenAI GPT.\"\"\"\n\n\nimport json\nimport logging\nimport os\nfrom functools import lru_cache\n\nimport regex as re\nfrom tokenizers import ByteLevelBPETokenizer\n\nfrom .tokenization_utils import PreTrainedTokenizer, PreTrainedTokenizerFast\n\n\nlogger = logging.getLogger(__name__)\n\nVOCAB_FILES_NAMES = {\n    \"vocab_file\": \"vocab.json\",\n    \"merges_file\": \"merges.txt\",\n}\n\nPRETRAINED_VOCAB_FILES_MAP = {\n    \"vocab_file\": {\n        \"gpt2\": \"https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-vocab.json\",\n        \"gpt2-medium\": \"https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-medium-vocab.json\",\n        \"gpt2-large\": \"https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-large-vocab.json\",\n        \"gpt2-xl\": \"https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-xl-vocab.json\",\n        \"distilgpt2\": \"https://s3.amazonaws.com/models.huggingface.co/bert/distilgpt2-vocab.json\",\n    },\n    \"merges_file\": {\n        \"gpt2\": \"https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-merges.txt\",\n        \"gpt2-medium\": \"https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-medium-merges.txt\",\n        \"gpt2-large\": \"https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-large-merges.txt\",\n        \"gpt2-xl\": \"https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-xl-merges.txt\",\n        \"distilgpt2\": \"https://s3.amazonaws.com/models.huggingface.co/bert/distilgpt2-merges.txt\",\n    },\n}\n\nPRETRAINED_POSITIONAL_EMBEDDINGS_SIZES = {\n    \"gpt2\": 1024,\n    \"gpt2-medium\": 1024,\n    \"gpt2-large\": 1024,\n    \"gpt2-xl\": 1024,\n    \"distilgpt2\": 1024,\n}\n\n\n@lru_cache()\ndef bytes_to_unicode():\n    \"\"\"\n    Returns list of utf-8 byte and a mapping to unicode strings.\n    We specifically avoids mapping to whitespace/control characters the bpe code barfs on.\n\n    The reversible bpe codes work on unicode strings.\n    This means you need a large # of unicode characters in your vocab if you want to avoid UNKs.\n    When you're at something like a 10B token dataset you end up needing around 5K for decent coverage.\n    This is a signficant percentage of your normal, say, 32K bpe vocab.\n    To avoid that, we want lookup tables between utf-8 bytes and unicode strings.\n    \"\"\"\n    bs = (\n        list(range(ord(\"!\"), ord(\"~\") + 1)) + list(range(ord(\"¡\"), ord(\"¬\") + 1)) + list(range(ord(\"®\"), ord(\"ÿ\") + 1))\n    )\n    cs = bs[:]\n    n = 0\n    for b in range(2 ** 8):\n        if b not in bs:\n            bs.append(b)\n            cs.append(2 ** 8 + n)\n            n += 1\n    cs = [chr(n) for n in cs]\n    return dict(zip(bs, cs))\n\n\ndef get_pairs(word):\n    \"\"\"Return set of symbol pairs in a word.\n\n    Word is represented as tuple of symbols (symbols being variable-length strings).\n    \"\"\"\n    pairs = set()\n    prev_char = word[0]\n    for char in word[1:]:\n        pairs.add((prev_char, char))\n        prev_char = char\n    return pairs\n\n\nclass GPT2Tokenizer(PreTrainedTokenizer):\n    \"\"\"\n    GPT-2 BPE tokenizer. Peculiarities:\n\n    - Byte-level Byte-Pair-Encoding\n    - Requires a space to start the input string => the encoding methods should be called with the\n      ``add_prefix_space`` flag set to ``True``.\n      Otherwise, this tokenizer ``encode`` and ``decode`` method will not conserve\n      the absence of a space at the beginning of a string:\n\n    ::\n\n        tokenizer.decode(tokenizer.encode(\"Hello\")) = \" Hello\"\n\n    This tokenizer inherits from :class:`~transformers1.PreTrainedTokenizer` which contains most of the methods. Users\n    should refer to the superclass for more information regarding methods.\n\n    Args:\n        vocab_file (:obj:`str`):\n            Path to the vocabulary file.\n        merges_file (:obj:`str`):\n            Path to the merges file.\n        errors (:obj:`str`, `optional`, defaults to \"replace\"):\n            Paradigm to follow when decoding bytes to UTF-8. See `bytes.decode\n            <https://docs.python.org/3/library/stdtypes.html#bytes.decode>`__ for more information.\n        unk_token (:obj:`string`, `optional`, defaults to `<|endoftext|>`):\n            The unknown token. A token that is not in the vocabulary cannot be converted to an ID and is set to be this\n            token instead.\n        bos_token (:obj:`string`, `optional`, defaults to `<|endoftext|>`):\n            The beginning of sequence token.\n        eos_token (:obj:`string`, `optional`, defaults to `<|endoftext|>`):\n            The end of sequence token.\n    \"\"\"\n\n    vocab_files_names = VOCAB_FILES_NAMES\n    pretrained_vocab_files_map = PRETRAINED_VOCAB_FILES_MAP\n    max_model_input_sizes = PRETRAINED_POSITIONAL_EMBEDDINGS_SIZES\n\n    def __init__(\n        self,\n        vocab_file,\n        merges_file,\n        errors=\"replace\",\n        unk_token=\"<|endoftext|>\",\n        bos_token=\"<|endoftext|>\",\n        eos_token=\"<|endoftext|>\",\n        **kwargs\n    ):\n        super().__init__(bos_token=bos_token, eos_token=eos_token, unk_token=unk_token, **kwargs)\n\n        with open(vocab_file, encoding=\"utf-8\") as vocab_handle:\n            self.encoder = json.load(vocab_handle)\n        self.decoder = {v: k for k, v in self.encoder.items()}\n        self.errors = errors  # how to handle errors in decoding\n        self.byte_encoder = bytes_to_unicode()\n        self.byte_decoder = {v: k for k, v in self.byte_encoder.items()}\n        with open(merges_file, encoding=\"utf-8\") as merges_handle:\n            bpe_merges = merges_handle.read().split(\"\\n\")[1:-1]\n        bpe_merges = [tuple(merge.split()) for merge in bpe_merges]\n        self.bpe_ranks = dict(zip(bpe_merges, range(len(bpe_merges))))\n        self.cache = {}\n\n        # Should haved added re.IGNORECASE so BPE merges can happen for capitalized versions of contractions\n        self.pat = re.compile(r\"\"\"'s|'t|'re|'ve|'m|'ll|'d| ?\\p{L}+| ?\\p{N}+| ?[^\\s\\p{L}\\p{N}]+|\\s+(?!\\S)|\\s+\"\"\")\n\n    @property\n    def vocab_size(self):\n        return len(self.encoder)\n\n    def get_vocab(self):\n        return dict(self.encoder, **self.added_tokens_encoder)\n\n    def bpe(self, token):\n        if token in self.cache:\n            return self.cache[token]\n        word = tuple(token)\n        pairs = get_pairs(word)\n\n        if not pairs:\n            return token\n\n        while True:\n            bigram = min(pairs, key=lambda pair: self.bpe_ranks.get(pair, float(\"inf\")))\n            if bigram not in self.bpe_ranks:\n                break\n            first, second = bigram\n            new_word = []\n            i = 0\n            while i < len(word):\n                try:\n                    j = word.index(first, i)\n                except ValueError:\n                    new_word.extend(word[i:])\n                    break\n                else:\n                    new_word.extend(word[i:j])\n                    i = j\n\n                if word[i] == first and i < len(word) - 1 and word[i + 1] == second:\n                    new_word.append(first + second)\n                    i += 2\n                else:\n                    new_word.append(word[i])\n                    i += 1\n            new_word = tuple(new_word)\n            word = new_word\n            if len(word) == 1:\n                break\n            else:\n                pairs = get_pairs(word)\n        word = \" \".join(word)\n        self.cache[token] = word\n        return word\n\n    def _tokenize(self, text):\n        \"\"\" Tokenize a string. \"\"\"\n        bpe_tokens = []\n        for token in re.findall(self.pat, text):\n            token = \"\".join(\n                self.byte_encoder[b] for b in token.encode(\"utf-8\")\n            )  # Maps all our bytes to unicode strings, avoiding controle tokens of the BPE (spaces in our case)\n            bpe_tokens.extend(bpe_token for bpe_token in self.bpe(token).split(\" \"))\n        return bpe_tokens\n\n    def _convert_token_to_id(self, token):\n        \"\"\" Converts a token (str) in an id using the vocab. \"\"\"\n        return self.encoder.get(token, self.encoder.get(self.unk_token))\n\n    def _convert_id_to_token(self, index):\n        \"\"\"Converts an index (integer) in a token (str) using the vocab.\"\"\"\n        return self.decoder.get(index)\n\n    def convert_tokens_to_string(self, tokens):\n        \"\"\" Converts a sequence of tokens (string) in a single string. \"\"\"\n        text = \"\".join(tokens)\n        text = bytearray([self.byte_decoder[c] for c in text]).decode(\"utf-8\", errors=self.errors)\n        return text\n\n    def save_vocabulary(self, save_directory):\n        \"\"\"\n        Save the vocabulary and special tokens file to a directory.\n\n        Args:\n            save_directory (:obj:`str`):\n                The directory in which to save the vocabulary.\n\n        Returns:\n            :obj:`Tuple(str)`: Paths to the files saved.\n        \"\"\"\n        if not os.path.isdir(save_directory):\n            logger.error(\"Vocabulary path ({}) should be a directory\".format(save_directory))\n            return\n        vocab_file = os.path.join(save_directory, VOCAB_FILES_NAMES[\"vocab_file\"])\n        merge_file = os.path.join(save_directory, VOCAB_FILES_NAMES[\"merges_file\"])\n\n        with open(vocab_file, \"w\", encoding=\"utf-8\") as f:\n            f.write(json.dumps(self.encoder, ensure_ascii=False))\n\n        index = 0\n        with open(merge_file, \"w\", encoding=\"utf-8\") as writer:\n            writer.write(\"#version: 0.2\\n\")\n            for bpe_tokens, token_index in sorted(self.bpe_ranks.items(), key=lambda kv: kv[1]):\n                if index != token_index:\n                    logger.warning(\n                        \"Saving vocabulary to {}: BPE merge indices are not consecutive.\"\n                        \" Please check that the tokenizer is not corrupted!\".format(merge_file)\n                    )\n                    index = token_index\n                writer.write(\" \".join(bpe_tokens) + \"\\n\")\n                index += 1\n\n        return vocab_file, merge_file\n\n    def prepare_for_tokenization(self, text, **kwargs):\n        if \"add_prefix_space\" in kwargs and kwargs[\"add_prefix_space\"]:\n            return \" \" + text\n        return text\n\n\nclass GPT2TokenizerFast(PreTrainedTokenizerFast):\n    \"\"\"\n    Constructs a \"Fast\" GPT-2 BPE tokenizer (backed by HuggingFace's `tokenizers` library).\n\n    Peculiarities:\n\n    - Byte-level Byte-Pair-Encoding\n    - Requires a space to start the input string => the encoding methods should be called with the\n      ``add_prefix_space`` flag set to ``True``.\n      Otherwise, this tokenizer ``encode`` and ``decode`` method will not conserve\n      the absence of a space at the beginning of a string:\n\n    ::\n\n        tokenizer.decode(tokenizer.encode(\"Hello\")) = \" Hello\"\n\n    This tokenizer inherits from :class:`~transformers1.PreTrainedTokenizerFast` which contains most of the methods. Users\n    should refer to the superclass for more information regarding methods.\n\n    Args:\n        vocab_file (:obj:`str`):\n            Path to the vocabulary file.\n        merges_file (:obj:`str`):\n            Path to the merges file.\n        errors (:obj:`str`, `optional`, defaults to \"replace\"):\n            Paradigm to follow when decoding bytes to UTF-8. See `bytes.decode\n            <https://docs.python.org/3/library/stdtypes.html#bytes.decode>`__ for more information.\n        unk_token (:obj:`string`, `optional`, defaults to `<|endoftext|>`):\n            The unknown token. A token that is not in the vocabulary cannot be converted to an ID and is set to be this\n            token instead.\n        bos_token (:obj:`string`, `optional`, defaults to `<|endoftext|>`):\n            The beginning of sequence token.\n        eos_token (:obj:`string`, `optional`, defaults to `<|endoftext|>`):\n            The end of sequence token.\n        add_prefix_space (:obj:`bool`, `optional`, defaults to `False`):\n            Whether to add a leading space to the first word.\n            This allows to treat the leading word just as any other word.\n            (GPT2 tokenizer detect beginning of words by the preceeding space)\n        trim_offsets (:obj:`bool`, `optional`, defaults to `True`):\n            Whether the post processing step should trim offsets to avoid including whitespaces.\n    \"\"\"\n\n    vocab_files_names = VOCAB_FILES_NAMES\n    pretrained_vocab_files_map = PRETRAINED_VOCAB_FILES_MAP\n    max_model_input_sizes = PRETRAINED_POSITIONAL_EMBEDDINGS_SIZES\n\n    def __init__(\n        self,\n        vocab_file,\n        merges_file,\n        unk_token=\"<|endoftext|>\",\n        bos_token=\"<|endoftext|>\",\n        eos_token=\"<|endoftext|>\",\n        add_prefix_space=False,\n        trim_offsets=True,\n        **kwargs\n    ):\n        super().__init__(\n            ByteLevelBPETokenizer(\n                vocab_file=vocab_file,\n                merges_file=merges_file,\n                add_prefix_space=add_prefix_space,\n                trim_offsets=trim_offsets,\n            ),\n            bos_token=bos_token,\n            eos_token=eos_token,\n            unk_token=unk_token,\n            **kwargs,\n        )\n"
  },
  {
    "path": "code/bert-base-count3/pretrain/transformers1/tokenization_longformer.py",
    "content": "# coding=utf-8\n# Copyright 2020 The Allen Institute for AI team and The HuggingFace Inc. team.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n\nimport logging\n\nfrom .tokenization_roberta import RobertaTokenizer, RobertaTokenizerFast\n\n\nlogger = logging.getLogger(__name__)\n\n\n# vocab and merges same as roberta\nvocab_url = \"https://s3.amazonaws.com/models.huggingface.co/bert/roberta-large-vocab.json\"\nmerges_url = \"https://s3.amazonaws.com/models.huggingface.co/bert/roberta-large-merges.txt\"\n_all_longformer_models = [\n    \"allenai/longformer-base-4096\",\n    \"allenai/longformer-large-4096\",\n    \"allenai/longformer-large-4096-finetuned-triviaqa\",\n    \"allenai/longformer-base-4096-extra.pos.embd.only\",\n    \"allenai/longformer-large-4096-extra.pos.embd.only\",\n]\n\n\nPRETRAINED_POSITIONAL_EMBEDDINGS_SIZES = {\n    \"allenai/longformer-base-4096\": 4096,\n    \"allenai/longformer-large-4096\": 4096,\n    \"allenai/longformer-large-4096-finetuned-triviaqa\": 4096,\n    \"allenai/longformer-base-4096-extra.pos.embd.only\": 4096,\n    \"allenai/longformer-large-4096-extra.pos.embd.only\": 4096,\n}\n\n\nclass LongformerTokenizer(RobertaTokenizer):\n    # merges and vocab same as Roberta\n    max_model_input_sizes = PRETRAINED_POSITIONAL_EMBEDDINGS_SIZES\n    pretrained_vocab_files_map = {\n        \"vocab_file\": {m: vocab_url for m in _all_longformer_models},\n        \"merges_file\": {m: merges_url for m in _all_longformer_models},\n    }\n\n\nclass LongformerTokenizerFast(RobertaTokenizerFast):\n    # merges and vocab same as Roberta\n    max_model_input_sizes = PRETRAINED_POSITIONAL_EMBEDDINGS_SIZES\n    pretrained_vocab_files_map = {\n        \"vocab_file\": {m: vocab_url for m in _all_longformer_models},\n        \"merges_file\": {m: merges_url for m in _all_longformer_models},\n    }\n"
  },
  {
    "path": "code/bert-base-count3/pretrain/transformers1/tokenization_marian.py",
    "content": "import json\nimport re\nimport warnings\nfrom pathlib import Path\nfrom shutil import copyfile\nfrom typing import Dict, List, Optional, Tuple, Union\n\nimport sentencepiece\n\nfrom .file_utils import S3_BUCKET_PREFIX\nfrom .tokenization_utils import BatchEncoding, PreTrainedTokenizer\n\n\nvocab_files_names = {\n    \"source_spm\": \"source.spm\",\n    \"target_spm\": \"target.spm\",\n    \"vocab\": \"vocab.json\",\n    \"tokenizer_config_file\": \"tokenizer_config.json\",\n}\nMODEL_NAMES = (\"opus-mt-en-de\",)  # TODO(SS): delete this, the only required constant is vocab_files_names\nPRETRAINED_VOCAB_FILES_MAP = {\n    k: {m: f\"{S3_BUCKET_PREFIX}/Helsinki-NLP/{m}/{fname}\" for m in MODEL_NAMES}\n    for k, fname in vocab_files_names.items()\n}\n# Example URL https://s3.amazonaws.com/models.huggingface.co/bert/Helsinki-NLP/opus-mt-en-de/vocab.json\n\n\nclass MarianTokenizer(PreTrainedTokenizer):\n    \"\"\"Sentencepiece tokenizer for marian. Source and target languages have different SPM models.\n    The logic is use the relevant source_spm or target_spm to encode txt as pieces, then look up each piece in a vocab dictionary.\n\n    Examples::\n\n        from transformers1 import MarianTokenizer\n        tok = MarianTokenizer.from_pretrained('Helsinki-NLP/opus-mt-en-de')\n        src_texts = [ \"I am a small frog.\", \"Tom asked his teacher for advice.\"]\n        tgt_texts = [\"Ich bin ein kleiner Frosch.\", \"Tom bat seinen Lehrer um Rat.\"]  # optional\n        batch_enc: BatchEncoding = tok.prepare_translation_batch(src_texts, tgt_texts=tgt_texts)\n        # keys  [input_ids, attention_mask, decoder_input_ids,  decoder_attention_mask].\n        # model(**batch) should work\n    \"\"\"\n\n    vocab_files_names = vocab_files_names\n    pretrained_vocab_files_map = PRETRAINED_VOCAB_FILES_MAP\n    max_model_input_sizes = {m: 512 for m in MODEL_NAMES}\n    model_input_names = [\"attention_mask\"]  # actually attention_mask, decoder_attention_mask\n    language_code_re = re.compile(\">>.+<<\")  # type: re.Pattern\n\n    def __init__(\n        self,\n        vocab=None,\n        source_spm=None,\n        target_spm=None,\n        source_lang=None,\n        target_lang=None,\n        unk_token=\"<unk>\",\n        eos_token=\"</s>\",\n        pad_token=\"<pad>\",\n        max_len=512,\n        **kwargs,\n    ):\n\n        super().__init__(\n            # bos_token=bos_token,  unused. Start decoding with config.decoder_start_token_id\n            max_len=max_len,\n            eos_token=eos_token,\n            unk_token=unk_token,\n            pad_token=pad_token,\n            **kwargs,\n        )\n        self.encoder = load_json(vocab)\n        if self.unk_token not in self.encoder:\n            raise KeyError(\"<unk> token must be in vocab\")\n        assert self.pad_token in self.encoder\n        self.decoder = {v: k for k, v in self.encoder.items()}\n\n        self.source_lang = source_lang\n        self.target_lang = target_lang\n        self.supported_language_codes: list = [k for k in self.encoder if k.startswith(\">>\") and k.endswith(\"<<\")]\n        self.spm_files = [source_spm, target_spm]\n\n        # load SentencePiece model for pre-processing\n        self.spm_source = load_spm(source_spm)\n        self.spm_target = load_spm(target_spm)\n        self.current_spm = self.spm_source\n\n        # Multilingual target side: default to using first supported language code.\n\n        self._setup_normalizer()\n\n    def _setup_normalizer(self):\n        try:\n            from mosestokenizer import MosesPunctuationNormalizer\n\n            self.punc_normalizer = MosesPunctuationNormalizer(self.source_lang)\n        except ImportError:\n            warnings.warn(\"Recommended: pip install mosestokenizer\")\n            self.punc_normalizer = lambda x: x\n\n    def normalize(self, x: str) -> str:\n        \"\"\"Cover moses empty string edge case. They return empty list for '' input!\"\"\"\n        return self.punc_normalizer(x) if x else \"\"\n\n    def _convert_token_to_id(self, token):\n        return self.encoder.get(token, self.encoder[self.unk_token])\n\n    def remove_language_code(self, text: str):\n        \"\"\"Remove language codes like <<fr>> before sentencepiece\"\"\"\n        match = self.language_code_re.match(text)\n        code: list = [match.group(0)] if match else []\n        return code, self.language_code_re.sub(\"\", text)\n\n    def _tokenize(self, text: str) -> List[str]:\n        code, text = self.remove_language_code(text)\n        pieces = self.current_spm.EncodeAsPieces(text)\n        return code + pieces\n\n    def _convert_id_to_token(self, index: int) -> str:\n        \"\"\"Converts an index (integer) in a token (str) using the encoder.\"\"\"\n        return self.decoder.get(index, self.unk_token)\n\n    def convert_tokens_to_string(self, tokens: List[str]) -> str:\n        \"\"\"Uses target language sentencepiece model\"\"\"\n        return self.spm_target.DecodePieces(tokens)\n\n    def build_inputs_with_special_tokens(self, token_ids_0, token_ids_1=None) -> List[int]:\n        \"\"\"Build model inputs from a sequence by appending eos_token_id.\"\"\"\n        if token_ids_1 is None:\n            return token_ids_0 + [self.eos_token_id]\n        # We don't expect to process pairs, but leave the pair logic for API consistency\n        return token_ids_0 + token_ids_1 + [self.eos_token_id]\n\n    def prepare_translation_batch(\n        self,\n        src_texts: List[str],\n        tgt_texts: Optional[List[str]] = None,\n        max_length: Optional[int] = None,\n        pad_to_max_length: bool = True,\n        return_tensors: str = \"pt\",\n    ) -> BatchEncoding:\n        \"\"\"Prepare model inputs for translation. For best performance, translate one sentence at a time.\n        Arguments:\n            src_texts: list of src language texts\n            tgt_texts: list of tgt language texts\n            max_length: (None) defer to config (1024 for mbart-large-en-ro)\n            pad_to_max_length: (bool)\n            return_tensors: (str) default \"pt\" returns pytorch tensors, pass None to return lists.\n\n        Returns:\n            BatchEncoding: with keys [input_ids, attention_mask, decoder_input_ids,  decoder_attention_mask]\n            all shaped bs, seq_len. (BatchEncoding is a dict of string -> tensor or lists).\n            If no tgt_text is specified, the only keys will be input_ids and attention_mask.\n        \"\"\"\n        if \"\" in src_texts:\n            raise ValueError(f\"found empty string in src_texts: {src_texts}\")\n        self.current_spm = self.spm_source\n        src_texts = [self.normalize(t) for t in src_texts]  # this does not appear to do much\n        model_inputs: BatchEncoding = self.batch_encode_plus(\n            src_texts,\n            add_special_tokens=True,\n            return_tensors=return_tensors,\n            max_length=max_length,\n            pad_to_max_length=pad_to_max_length,\n        )\n        if tgt_texts is None:\n            return model_inputs\n\n        self.current_spm = self.spm_target\n        decoder_inputs: BatchEncoding = self.batch_encode_plus(\n            tgt_texts,\n            add_special_tokens=True,\n            return_tensors=return_tensors,\n            max_length=max_length,\n            pad_to_max_length=pad_to_max_length,\n        )\n        for k, v in decoder_inputs.items():\n            model_inputs[f\"decoder_{k}\"] = v\n        self.current_spm = self.spm_source\n        return model_inputs\n\n    @property\n    def vocab_size(self) -> int:\n        return len(self.encoder)\n\n    def save_vocabulary(self, save_directory: str) -> Tuple[str]:\n        \"\"\"save vocab file to json and copy spm files from their original path.\"\"\"\n        save_dir = Path(save_directory)\n        assert save_dir.is_dir(), f\"{save_directory} should be a directory\"\n        save_json(self.encoder, save_dir / self.vocab_files_names[\"vocab\"])\n\n        for f in self.spm_files:\n            dest_path = save_dir / Path(f).name\n            if not dest_path.exists():\n                copyfile(f, save_dir / Path(f).name)\n        return tuple(save_dir / f for f in self.vocab_files_names)\n\n    def get_vocab(self) -> Dict:\n        vocab = self.encoder.copy()\n        vocab.update(self.added_tokens_encoder)\n        return vocab\n\n    def __getstate__(self) -> Dict:\n        state = self.__dict__.copy()\n        state.update({k: None for k in [\"spm_source\", \"spm_target\", \"current_spm\", \"punc_normalizer\"]})\n        return state\n\n    def __setstate__(self, d: Dict) -> None:\n        self.__dict__ = d\n        self.spm_source, self.spm_target = (load_spm(f) for f in self.spm_files)\n        self.current_spm = self.spm_source\n        self._setup_normalizer()\n\n    def num_special_tokens_to_add(self, **unused):\n        \"\"\"Just EOS\"\"\"\n        return 1\n\n    def _special_token_mask(self, seq):\n        all_special_ids = set(self.all_special_ids)  # call it once instead of inside list comp\n        all_special_ids.remove(self.unk_token_id)  # <unk> is only sometimes special\n        return [1 if x in all_special_ids else 0 for x in seq]\n\n    def get_special_tokens_mask(\n        self, token_ids_0: List, token_ids_1: Optional[List] = None, already_has_special_tokens: bool = False\n    ) -> List[int]:\n        \"\"\"Get list where entries are [1] if a token is [eos] or [pad] else 0.\"\"\"\n        if already_has_special_tokens:\n            return self._special_token_mask(token_ids_0)\n        elif token_ids_1 is None:\n            return self._special_token_mask(token_ids_0) + [1]\n        else:\n            return self._special_token_mask(token_ids_0 + token_ids_1) + [1]\n\n\ndef load_spm(path: str) -> sentencepiece.SentencePieceProcessor:\n    spm = sentencepiece.SentencePieceProcessor()\n    spm.Load(path)\n    return spm\n\n\ndef save_json(data, path: str) -> None:\n    with open(path, \"w\") as f:\n        json.dump(data, f, indent=2)\n\n\ndef load_json(path: str) -> Union[Dict, List]:\n    with open(path, \"r\") as f:\n        return json.load(f)\n"
  },
  {
    "path": "code/bert-base-count3/pretrain/transformers1/tokenization_openai.py",
    "content": "# coding=utf-8\n# Copyright 2018 The Open AI Team Authors and The HuggingFace Inc. team.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n\"\"\"Tokenization classes for OpenAI GPT.\"\"\"\n\n\nimport json\nimport logging\nimport os\nimport re\n\nfrom tokenizers import CharBPETokenizer\n\nfrom .tokenization_bert import BasicTokenizer\nfrom .tokenization_utils import PreTrainedTokenizer, PreTrainedTokenizerFast\n\n\nlogger = logging.getLogger(__name__)\n\nVOCAB_FILES_NAMES = {\n    \"vocab_file\": \"vocab.json\",\n    \"merges_file\": \"merges.txt\",\n}\n\nPRETRAINED_VOCAB_FILES_MAP = {\n    \"vocab_file\": {\"openai-gpt\": \"https://s3.amazonaws.com/models.huggingface.co/bert/openai-gpt-vocab.json\"},\n    \"merges_file\": {\"openai-gpt\": \"https://s3.amazonaws.com/models.huggingface.co/bert/openai-gpt-merges.txt\"},\n}\n\nPRETRAINED_POSITIONAL_EMBEDDINGS_SIZES = {\n    \"openai-gpt\": 512,\n}\n\n\ndef get_pairs(word):\n    \"\"\"\n    Return set of symbol pairs in a word.\n    word is represented as tuple of symbols (symbols being variable-length strings)\n    \"\"\"\n    pairs = set()\n    prev_char = word[0]\n    for char in word[1:]:\n        pairs.add((prev_char, char))\n        prev_char = char\n    return pairs\n\n\ndef text_standardize(text):\n    \"\"\"\n    fixes some issues the spacy tokenizer had on books corpus\n    also does some whitespace standardization\n    \"\"\"\n    text = text.replace(\"—\", \"-\")\n    text = text.replace(\"–\", \"-\")\n    text = text.replace(\"―\", \"-\")\n    text = text.replace(\"…\", \"...\")\n    text = text.replace(\"´\", \"'\")\n    text = re.sub(r\"\"\"(-+|~+|!+|\"+|;+|\\?+|\\++|,+|\\)+|\\(+|\\\\+|\\/+|\\*+|\\[+|\\]+|}+|{+|\\|+|_+)\"\"\", r\" \\1 \", text)\n    text = re.sub(r\"\\s*\\n\\s*\", \" \\n \", text)\n    text = re.sub(r\"[^\\S\\n]+\", \" \", text)\n    return text.strip()\n\n\nclass OpenAIGPTTokenizer(PreTrainedTokenizer):\n    \"\"\"\n    BPE tokenizer. Peculiarities:\n\n    - lower case all inputs\n    - uses SpaCy tokenizer and ftfy for pre-BPE tokenization if they are installed, fallback to BERT's BasicTokenizer if not.\n\n    This tokenizer inherits from :class:`~transformers1.PreTrainedTokenizer` which contains most of the methods. Users\n    should refer to the superclass for more information regarding methods.\n\n    Args:\n        vocab_file (:obj:`str`):\n            Path to the vocabulary file.\n        merges_file (:obj:`str`):\n            Path to the merges file.\n        unk_token (:obj:`string`, `optional`, defaults to \"<unk>\"):\n            The unknown token. A token that is not in the vocabulary cannot be converted to an ID and is set to be this\n            token instead.\n    \"\"\"\n\n    vocab_files_names = VOCAB_FILES_NAMES\n    pretrained_vocab_files_map = PRETRAINED_VOCAB_FILES_MAP\n    max_model_input_sizes = PRETRAINED_POSITIONAL_EMBEDDINGS_SIZES\n\n    def __init__(self, vocab_file, merges_file, unk_token=\"<unk>\", **kwargs):\n        super().__init__(unk_token=unk_token, **kwargs)\n\n        try:\n            import ftfy\n            from spacy.lang.en import English\n\n            _nlp = English()\n            self.nlp = _nlp.Defaults.create_tokenizer(_nlp)\n            self.fix_text = ftfy.fix_text\n        except ImportError:\n            logger.warning(\"ftfy or spacy is not installed using BERT BasicTokenizer instead of SpaCy & ftfy.\")\n            self.nlp = BasicTokenizer(do_lower_case=True)\n            self.fix_text = None\n\n        with open(vocab_file, encoding=\"utf-8\") as vocab_handle:\n            self.encoder = json.load(vocab_handle)\n        self.decoder = {v: k for k, v in self.encoder.items()}\n        with open(merges_file, encoding=\"utf-8\") as merges_handle:\n            merges = merges_handle.read().split(\"\\n\")[1:-1]\n        merges = [tuple(merge.split()) for merge in merges]\n        self.bpe_ranks = dict(zip(merges, range(len(merges))))\n        self.cache = {}\n\n    @property\n    def vocab_size(self):\n        return len(self.encoder)\n\n    def get_vocab(self):\n        return dict(self.encoder, **self.added_tokens_encoder)\n\n    def bpe(self, token):\n        word = tuple(token[:-1]) + (token[-1] + \"</w>\",)\n        if token in self.cache:\n            return self.cache[token]\n        pairs = get_pairs(word)\n\n        if not pairs:\n            return token + \"</w>\"\n\n        while True:\n            bigram = min(pairs, key=lambda pair: self.bpe_ranks.get(pair, float(\"inf\")))\n            if bigram not in self.bpe_ranks:\n                break\n            first, second = bigram\n            new_word = []\n            i = 0\n            while i < len(word):\n                try:\n                    j = word.index(first, i)\n                except ValueError:\n                    new_word.extend(word[i:])\n                    break\n                else:\n                    new_word.extend(word[i:j])\n                    i = j\n\n                if word[i] == first and i < len(word) - 1 and word[i + 1] == second:\n                    new_word.append(first + second)\n                    i += 2\n                else:\n                    new_word.append(word[i])\n                    i += 1\n            new_word = tuple(new_word)\n            word = new_word\n            if len(word) == 1:\n                break\n            else:\n                pairs = get_pairs(word)\n        word = \" \".join(word)\n        if word == \"\\n  </w>\":\n            word = \"\\n</w>\"\n        self.cache[token] = word\n        return word\n\n    def _tokenize(self, text):\n        \"\"\" Tokenize a string. \"\"\"\n        split_tokens = []\n        if self.fix_text is None:\n            # Using BERT's BasicTokenizer\n            text = self.nlp.tokenize(text)\n            for token in text:\n                split_tokens.extend([t for t in self.bpe(token).split(\" \")])\n        else:\n            # Using SpaCy & ftfy (original tokenization process of OpenAI GPT)\n            text = self.nlp(text_standardize(self.fix_text(text)))\n            for token in text:\n                split_tokens.extend([t for t in self.bpe(token.text.lower()).split(\" \")])\n        return split_tokens\n\n    def _convert_token_to_id(self, token):\n        \"\"\" Converts a token (str) in an id using the vocab. \"\"\"\n        return self.encoder.get(token, self.encoder.get(self.unk_token))\n\n    def _convert_id_to_token(self, index):\n        \"\"\"Converts an id in a token (BPE) using the vocab.\"\"\"\n        return self.decoder.get(index, self.unk_token)\n\n    def convert_tokens_to_string(self, tokens):\n        \"\"\" Converts a sequence of tokens (string) in a single string. \"\"\"\n        out_string = \"\".join(tokens).replace(\"</w>\", \" \").strip()\n        return out_string\n\n    def save_vocabulary(self, save_directory):\n        \"\"\"\n        Save the vocabulary and special tokens file to a directory.\n\n        Args:\n            save_directory (:obj:`str`):\n                The directory in which to save the vocabulary.\n\n        Returns:\n            :obj:`Tuple(str)`: Paths to the files saved.\n        \"\"\"\n        if not os.path.isdir(save_directory):\n            logger.error(\"Vocabulary path ({}) should be a directory\".format(save_directory))\n            return\n        vocab_file = os.path.join(save_directory, VOCAB_FILES_NAMES[\"vocab_file\"])\n        merge_file = os.path.join(save_directory, VOCAB_FILES_NAMES[\"merges_file\"])\n\n        with open(vocab_file, \"w\", encoding=\"utf-8\") as f:\n            f.write(json.dumps(self.encoder, ensure_ascii=False))\n\n        index = 0\n        with open(merge_file, \"w\", encoding=\"utf-8\") as writer:\n            writer.write(\"#version: 0.2\\n\")\n            for bpe_tokens, token_index in sorted(self.bpe_ranks.items(), key=lambda kv: kv[1]):\n                if index != token_index:\n                    logger.warning(\n                        \"Saving vocabulary to {}: BPE merge indices are not consecutive.\"\n                        \" Please check that the tokenizer is not corrupted!\".format(merge_file)\n                    )\n                    index = token_index\n                writer.write(\" \".join(bpe_tokens) + \"\\n\")\n                index += 1\n\n        return vocab_file, merge_file\n\n\nclass OpenAIGPTTokenizerFast(PreTrainedTokenizerFast):\n    \"\"\"\n    Construct a \"Fast\" BPE tokenizer for OpenAI GPT (backed by HuggingFace's `tokenizers` library).\n\n    Peculiarities:\n\n    - lower case all inputs\n    - uses SpaCy tokenizer and ftfy for pre-BPE tokenization if they are installed, fallback to BERT's BasicTokenizer if not.\n\n    This tokenizer inherits from :class:`~transformers1.PreTrainedTokenizer` which contains most of the methods. Users\n    should refer to the superclass for more information regarding methods.\n\n    Args:\n        vocab_file (:obj:`str`):\n            Path to the vocabulary file.\n        merges_file (:obj:`str`):\n            Path to the merges file.\n        unk_token (:obj:`string`, `optional`, defaults to \"<unk>\"):\n            The unknown token. A token that is not in the vocabulary cannot be converted to an ID and is set to be this\n            token instead.\n    \"\"\"\n\n    vocab_files_names = VOCAB_FILES_NAMES\n    pretrained_vocab_files_map = PRETRAINED_VOCAB_FILES_MAP\n    max_model_input_sizes = PRETRAINED_POSITIONAL_EMBEDDINGS_SIZES\n\n    def __init__(self, vocab_file, merges_file, unk_token=\"<unk>\", **kwargs):\n        kwargs.setdefault(\"unk_token\", unk_token)\n        super().__init__(\n            CharBPETokenizer(vocab_file=vocab_file, merges_file=merges_file, unk_token=unk_token, lowercase=True),\n            **kwargs,\n        )\n"
  },
  {
    "path": "code/bert-base-count3/pretrain/transformers1/tokenization_reformer.py",
    "content": "# coding=utf-8\n# Copyright 2020 The Trax Authors and The HuggingFace Inc. team.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n\"\"\" Tokenization class for model Reformer.\"\"\"\n\n\nimport logging\nimport os\nfrom shutil import copyfile\n\nfrom .tokenization_utils import PreTrainedTokenizer\n\n\nlogger = logging.getLogger(__name__)\n\nSPIECE_UNDERLINE = \"▁\"\n\n\n####################################################\n# Mapping from the keyword arguments names of Tokenizer `__init__`\n# to file names for serializing Tokenizer instances\n####################################################\nVOCAB_FILES_NAMES = {\"vocab_file\": \"spiece.model\"}\n\n####################################################\n# Mapping from the keyword arguments names of Tokenizer `__init__`\n# to pretrained vocabulary URL for all the model shortcut names.\n####################################################\nPRETRAINED_VOCAB_FILES_MAP = {\n    \"vocab_file\": {\n        \"google/reformer-crime-and-punishment\": \"https://cdn.huggingface.co/google/reformer-crime-and-punishment/spiece.model\"\n    }\n}\n\n####################################################\n# Mapping from model shortcut names to max length of inputs\n####################################################\nPRETRAINED_POSITIONAL_EMBEDDINGS_SIZES = {\n    \"google/reformer-crime-and-punishment\": 524288,\n}\n\n\nclass ReformerTokenizer(PreTrainedTokenizer):\n    \"\"\"\n        Constructs an Reformer tokenizer. Based on `SentencePiece <https://github.com/google/sentencepiece>`__ .\n\n        This tokenizer inherits from :class:`~transformers1.PreTrainedTokenizer` which contains most of the methods. Users\n        should refer to the superclass for more information regarding methods.\n\n        Args:\n            vocab_file (:obj:`string`):\n                `SentencePiece <https://github.com/google/sentencepiece>`__ file (generally has a `.spm` extension) that\n                contains the vocabulary necessary to instantiate a tokenizer.\n            eos_token (:obj:`string`, `optional`, defaults to \"</s>\"):\n                The end of sequence token.\n\n                .. note::\n\n                    When building a sequence using special tokens, this is not the token that is used for the end\n                    of sequence. The token used is the :obj:`sep_token`.\n            unk_token (:obj:`string`, `optional`, defaults to \"<unk>\"):\n                The unknown token. A token that is not in the vocabulary cannot be converted to an ID and is set to be this\n                token instead.\n            pad_token (:obj:`string`, `optional`, defaults to \"<pad>\"):\n                The token used for padding, for example when batching sequences of different lengths.\n            additional_special_tokens (:obj:`List[str]`, `optional`, defaults to :obj:`None`):\n                Additional special tokens used by the tokenizer.\n    \"\"\"\n\n    vocab_files_names = VOCAB_FILES_NAMES\n    pretrained_vocab_files_map = PRETRAINED_VOCAB_FILES_MAP\n    max_model_input_sizes = PRETRAINED_POSITIONAL_EMBEDDINGS_SIZES\n\n    def __init__(\n        self,\n        vocab_file,\n        eos_token=\"</s>\",\n        unk_token=\"<unk>\",\n        pad_token=\"<pad>\",\n        additional_special_tokens=[],\n        **kwargs\n    ):\n        super().__init__(\n            eos_token=eos_token,\n            unk_token=unk_token,\n            pad_token=pad_token,\n            additional_special_tokens=additional_special_tokens,\n            **kwargs,\n        )\n\n        try:\n            import sentencepiece as spm\n        except ImportError:\n            logger.warning(\n                \"You need to install SentencePiece to use ReformerTokenizer:\"\n                \"https://github.com/google/sentencepiece\"\n                \"pip install sentencepiece\"\n            )\n            raise\n\n        self.vocab_file = vocab_file\n        self.sp_model = spm.SentencePieceProcessor()\n        self.sp_model.Load(vocab_file)\n\n    @property\n    def vocab_size(self):\n        return self.sp_model.get_piece_size()\n\n    def get_vocab(self):\n        vocab = {self.convert_ids_to_tokens(i): i for i in range(self.vocab_size)}\n        vocab.update(self.added_tokens_encoder)\n        return vocab\n\n    def __getstate__(self):\n        state = self.__dict__.copy()\n        state[\"sp_model\"] = None\n        return state\n\n    def __setstate__(self, d):\n        self.__dict__ = d\n        try:\n            import sentencepiece as spm\n        except ImportError:\n            logger.warning(\n                \"You need to install SentencePiece to use ReformerTokenizer: https://github.com/google/sentencepiece\"\n                \"pip install sentencepiece\"\n            )\n            raise\n        self.sp_model = spm.SentencePieceProcessor()\n        self.sp_model.Load(self.vocab_file)\n\n    def _tokenize(self, text, sample=False):\n        \"\"\" Take as input a string and return a list of strings (tokens) for words/sub-words\n        \"\"\"\n        if not sample:\n            pieces = self.sp_model.EncodeAsPieces(text)\n        else:\n            pieces = self.sp_model.SampleEncodeAsPieces(text, 64, 0.1)\n        return pieces\n\n    def _convert_token_to_id(self, token):\n        \"\"\" Converts a token (str) in an id using the vocab. \"\"\"\n        return self.sp_model.piece_to_id(token)\n\n    def _convert_id_to_token(self, index):\n        \"\"\"Converts an index (integer) in a token (str) using the vocab.\"\"\"\n        if index < self.sp_model.get_piece_size():\n            token = self.sp_model.IdToPiece(index)\n        return token\n\n    def convert_tokens_to_string(self, tokens):\n        \"\"\" Converts a sequence of tokens (string) in a single string. \"\"\"\n        out_string = self.sp_model.decode_pieces(tokens)\n        return out_string\n\n    def save_vocabulary(self, save_directory):\n        \"\"\" Save the sentencepiece vocabulary (copy original file) and special tokens file\n            to a directory.\n        \"\"\"\n        if not os.path.isdir(save_directory):\n            logger.error(\"Vocabulary path ({}) should be a directory\".format(save_directory))\n            return\n        out_vocab_file = os.path.join(save_directory, VOCAB_FILES_NAMES[\"vocab_file\"])\n\n        if os.path.abspath(self.vocab_file) != os.path.abspath(out_vocab_file):\n            copyfile(self.vocab_file, out_vocab_file)\n\n        return (out_vocab_file,)\n"
  },
  {
    "path": "code/bert-base-count3/pretrain/transformers1/tokenization_roberta.py",
    "content": "# coding=utf-8\n# Copyright 2018 The Open AI Team Authors and The HuggingFace Inc. team.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n\"\"\"Tokenization classes for RoBERTa.\"\"\"\n\n\nimport logging\nfrom typing import List, Optional\n\nfrom tokenizers import AddedToken\nfrom tokenizers.processors import RobertaProcessing\n\nfrom .tokenization_gpt2 import GPT2Tokenizer, GPT2TokenizerFast\nfrom .tokenization_utils import PreTrainedTokenizer\n\n\nlogger = logging.getLogger(__name__)\n\nVOCAB_FILES_NAMES = {\n    \"vocab_file\": \"vocab.json\",\n    \"merges_file\": \"merges.txt\",\n}\n\nPRETRAINED_VOCAB_FILES_MAP = {\n    \"vocab_file\": {\n        \"roberta-base\": \"https://s3.amazonaws.com/models.huggingface.co/bert/roberta-base-vocab.json\",\n        \"roberta-large\": \"https://s3.amazonaws.com/models.huggingface.co/bert/roberta-large-vocab.json\",\n        \"roberta-large-mnli\": \"https://s3.amazonaws.com/models.huggingface.co/bert/roberta-large-mnli-vocab.json\",\n        \"distilroberta-base\": \"https://s3.amazonaws.com/models.huggingface.co/bert/distilroberta-base-vocab.json\",\n        \"roberta-base-openai-detector\": \"https://s3.amazonaws.com/models.huggingface.co/bert/roberta-base-vocab.json\",\n        \"roberta-large-openai-detector\": \"https://s3.amazonaws.com/models.huggingface.co/bert/roberta-large-vocab.json\",\n    },\n    \"merges_file\": {\n        \"roberta-base\": \"https://s3.amazonaws.com/models.huggingface.co/bert/roberta-base-merges.txt\",\n        \"roberta-large\": \"https://s3.amazonaws.com/models.huggingface.co/bert/roberta-large-merges.txt\",\n        \"roberta-large-mnli\": \"https://s3.amazonaws.com/models.huggingface.co/bert/roberta-large-mnli-merges.txt\",\n        \"distilroberta-base\": \"https://s3.amazonaws.com/models.huggingface.co/bert/distilroberta-base-merges.txt\",\n        \"roberta-base-openai-detector\": \"https://s3.amazonaws.com/models.huggingface.co/bert/roberta-base-merges.txt\",\n        \"roberta-large-openai-detector\": \"https://s3.amazonaws.com/models.huggingface.co/bert/roberta-large-merges.txt\",\n    },\n}\n\nPRETRAINED_POSITIONAL_EMBEDDINGS_SIZES = {\n    \"roberta-base\": 512,\n    \"roberta-large\": 512,\n    \"roberta-large-mnli\": 512,\n    \"distilroberta-base\": 512,\n    \"roberta-base-openai-detector\": 512,\n    \"roberta-large-openai-detector\": 512,\n}\n\n\nclass RobertaTokenizer(GPT2Tokenizer):\n    \"\"\"\n    Constructs a RoBERTa BPE tokenizer, derived from the GPT-2 tokenizer. Peculiarities:\n\n    - Byte-level Byte-Pair-Encoding\n    - Requires a space to start the input string => the encoding methods should be called with the\n      ``add_prefix_space`` flag set to ``True``.\n      Otherwise, this tokenizer ``encode`` and ``decode`` method will not conserve\n      the absence of a space at the beginning of a string:\n\n    ::\n\n        tokenizer.decode(tokenizer.encode(\"Hello\")) = \" Hello\"\n\n    This tokenizer inherits from :class:`~transformers1.PreTrainedTokenizer` which contains most of the methods. Users\n    should refer to the superclass for more information regarding methods.\n\n    Args:\n        vocab_file (:obj:`str`):\n            Path to the vocabulary file.\n        merges_file (:obj:`str`):\n            Path to the merges file.\n        errors (:obj:`str`, `optional`, defaults to \"replace\"):\n            Paradigm to follow when decoding bytes to UTF-8. See `bytes.decode\n            <https://docs.python.org/3/library/stdtypes.html#bytes.decode>`__ for more information.\n        bos_token (:obj:`string`, `optional`, defaults to \"<s>\"):\n            The beginning of sequence token that was used during pre-training. Can be used a sequence classifier token.\n\n            .. note::\n\n                When building a sequence using special tokens, this is not the token that is used for the beginning\n                of sequence. The token used is the :obj:`cls_token`.\n        eos_token (:obj:`string`, `optional`, defaults to \"</s>\"):\n            The end of sequence token.\n\n            .. note::\n\n                When building a sequence using special tokens, this is not the token that is used for the end\n                of sequence. The token used is the :obj:`sep_token`.\n        sep_token (:obj:`string`, `optional`, defaults to \"</s>\"):\n            The separator token, which is used when building a sequence from multiple sequences, e.g. two sequences\n            for sequence classification or for a text and a question for question answering.\n            It is also used as the last token of a sequence built with special tokens.\n        cls_token (:obj:`string`, `optional`, defaults to \"<s>\"):\n            The classifier token which is used when doing sequence classification (classification of the whole\n            sequence instead of per-token classification). It is the first token of the sequence when built with\n            special tokens.\n        unk_token (:obj:`string`, `optional`, defaults to \"<unk>\"):\n            The unknown token. A token that is not in the vocabulary cannot be converted to an ID and is set to be this\n            token instead.\n        pad_token (:obj:`string`, `optional`, defaults to \"<pad>\"):\n            The token used for padding, for example when batching sequences of different lengths.\n        mask_token (:obj:`string`, `optional`, defaults to \"<mask>\"):\n            The token used for masking values. This is the token used when training this model with masked language\n            modeling. This is the token which the model will try to predict.\n    \"\"\"\n\n    vocab_files_names = VOCAB_FILES_NAMES\n    pretrained_vocab_files_map = PRETRAINED_VOCAB_FILES_MAP\n    max_model_input_sizes = PRETRAINED_POSITIONAL_EMBEDDINGS_SIZES\n    model_input_names = [\"attention_mask\"]\n\n    def __init__(\n        self,\n        vocab_file,\n        merges_file,\n        errors=\"replace\",\n        bos_token=\"<s>\",\n        eos_token=\"</s>\",\n        sep_token=\"</s>\",\n        cls_token=\"<s>\",\n        unk_token=\"<unk>\",\n        pad_token=\"<pad>\",\n        mask_token=\"<mask>\",\n        **kwargs\n    ):\n        super().__init__(\n            vocab_file=vocab_file,\n            merges_file=merges_file,\n            errors=errors,\n            bos_token=bos_token,\n            eos_token=eos_token,\n            unk_token=unk_token,\n            sep_token=sep_token,\n            cls_token=cls_token,\n            pad_token=pad_token,\n            mask_token=mask_token,\n            **kwargs,\n        )\n\n    def build_inputs_with_special_tokens(\n        self, token_ids_0: List[int], token_ids_1: Optional[List[int]] = None\n    ) -> List[int]:\n        \"\"\"\n        Build model inputs from a sequence or a pair of sequence for sequence classification tasks\n        by concatenating and adding special tokens.\n        A RoBERTa sequence has the following format:\n\n        - single sequence: ``<s> X </s>``\n        - pair of sequences: ``<s> A </s></s> B </s>``\n\n        Args:\n            token_ids_0 (:obj:`List[int]`):\n                List of IDs to which the special tokens will be added\n            token_ids_1 (:obj:`List[int]`, `optional`, defaults to :obj:`None`):\n                Optional second list of IDs for sequence pairs.\n\n        Returns:\n            :obj:`List[int]`: list of `input IDs <../glossary.html#input-ids>`__ with the appropriate special tokens.\n        \"\"\"\n        if token_ids_1 is None:\n            return [self.cls_token_id] + token_ids_0 + [self.sep_token_id]\n        cls = [self.cls_token_id]\n        sep = [self.sep_token_id]\n        return cls + token_ids_0 + sep + sep + token_ids_1 + sep\n\n    def get_special_tokens_mask(\n        self, token_ids_0: List[int], token_ids_1: Optional[List[int]] = None, already_has_special_tokens: bool = False\n    ) -> List[int]:\n        \"\"\"\n        Retrieves sequence ids from a token list that has no special tokens added. This method is called when adding\n        special tokens using the tokenizer ``prepare_for_model`` or ``encode_plus`` methods.\n\n        Args:\n            token_ids_0 (:obj:`List[int]`):\n                List of ids.\n            token_ids_1 (:obj:`List[int]`, `optional`, defaults to :obj:`None`):\n                Optional second list of IDs for sequence pairs.\n            already_has_special_tokens (:obj:`bool`, `optional`, defaults to :obj:`False`):\n                Set to True if the token list is already formatted with special tokens for the model\n\n        Returns:\n            :obj:`List[int]`: A list of integers in the range [0, 1]: 1 for a special token, 0 for a sequence token.\n        \"\"\"\n        if already_has_special_tokens:\n            if token_ids_1 is not None:\n                raise ValueError(\n                    \"You should not supply a second sequence if the provided sequence of \"\n                    \"ids is already formatted with special tokens for the model.\"\n                )\n            return list(map(lambda x: 1 if x in [self.sep_token_id, self.cls_token_id] else 0, token_ids_0))\n\n        if token_ids_1 is None:\n            return [1] + ([0] * len(token_ids_0)) + [1]\n        return [1] + ([0] * len(token_ids_0)) + [1, 1] + ([0] * len(token_ids_1)) + [1]\n\n    def create_token_type_ids_from_sequences(\n        self, token_ids_0: List[int], token_ids_1: Optional[List[int]] = None\n    ) -> List[int]:\n        \"\"\"\n        Creates a mask from the two sequences passed to be used in a sequence-pair classification task.\n        RoBERTa does not make use of token type ids, therefore a list of zeros is returned.\n\n        Args:\n            token_ids_0 (:obj:`List[int]`):\n                List of ids.\n            token_ids_1 (:obj:`List[int]`, `optional`, defaults to :obj:`None`):\n                Optional second list of IDs for sequence pairs.\n\n        Returns:\n            :obj:`List[int]`: List of zeros.\n\n        \"\"\"\n        sep = [self.sep_token_id]\n        cls = [self.cls_token_id]\n\n        if token_ids_1 is None:\n            return len(cls + token_ids_0 + sep) * [0]\n        return len(cls + token_ids_0 + sep + sep + token_ids_1 + sep) * [0]\n\n    def prepare_for_tokenization(self, text, add_special_tokens=False, **kwargs):\n        if \"add_prefix_space\" in kwargs:\n            add_prefix_space = kwargs[\"add_prefix_space\"]\n        else:\n            add_prefix_space = add_special_tokens\n        if add_prefix_space and not text[0].isspace():\n            text = \" \" + text\n        return text\n\n\nclass RobertaTokenizerFast(GPT2TokenizerFast):\n    \"\"\"\n    Constructs a \"Fast\" RoBERTa BPE tokenizer (backed by HuggingFace's `tokenizers` library).\n\n    Peculiarities:\n\n    - Byte-level Byte-Pair-Encoding\n    - Requires a space to start the input string => the encoding methods should be called with the\n      ``add_prefix_space`` flag set to ``True``.\n      Otherwise, this tokenizer ``encode`` and ``decode`` method will not conserve\n      the absence of a space at the beginning of a string:\n\n    ::\n\n        tokenizer.decode(tokenizer.encode(\"Hello\")) = \" Hello\"\n\n    This tokenizer inherits from :class:`~transformers1.PreTrainedTokenizerFast` which contains most of the methods. Users\n    should refer to the superclass for more information regarding methods.\n\n    Args:\n        vocab_file (:obj:`str`):\n            Path to the vocabulary file.\n        merges_file (:obj:`str`):\n            Path to the merges file.\n        errors (:obj:`str`, `optional`, defaults to \"replace\"):\n            Paradigm to follow when decoding bytes to UTF-8. See `bytes.decode\n            <https://docs.python.org/3/library/stdtypes.html#bytes.decode>`__ for more information.\n        unk_token (:obj:`string`, `optional`, defaults to `<|endoftext|>`):\n            The unknown token. A token that is not in the vocabulary cannot be converted to an ID and is set to be this\n            token instead.\n        bos_token (:obj:`string`, `optional`, defaults to `<|endoftext|>`):\n            The beginning of sequence token.\n        eos_token (:obj:`string`, `optional`, defaults to `<|endoftext|>`):\n            The end of sequence token.\n        add_prefix_space (:obj:`bool`, `optional`, defaults to `False`):\n            Whether to add a leading space to the first word.\n            This allows to treat the leading word just as any other word.\n            (GPT2 tokenizer detect beginning of words by the preceeding space)\n        trim_offsets (:obj:`bool`, `optional`, defaults to `True`):\n            Whether the post processing step should trim offsets to avoid including whitespaces.\n    \"\"\"\n\n    vocab_files_names = VOCAB_FILES_NAMES\n    pretrained_vocab_files_map = PRETRAINED_VOCAB_FILES_MAP\n    max_model_input_sizes = PRETRAINED_POSITIONAL_EMBEDDINGS_SIZES\n    model_input_names = [\"attention_mask\"]\n\n    def __init__(\n        self,\n        vocab_file,\n        merges_file,\n        errors=\"replace\",\n        bos_token=\"<s>\",\n        eos_token=\"</s>\",\n        sep_token=\"</s>\",\n        cls_token=\"<s>\",\n        unk_token=\"<unk>\",\n        pad_token=\"<pad>\",\n        mask_token=\"<mask>\",\n        add_prefix_space=True,\n        trim_offsets=True,\n        **kwargs\n    ):\n        kwargs.setdefault(\"pad_token\", pad_token)\n        kwargs.setdefault(\"sep_token\", sep_token)\n        kwargs.setdefault(\"cls_token\", cls_token)\n        kwargs.setdefault(\"mask_token\", mask_token)\n\n        super().__init__(\n            vocab_file=vocab_file,\n            merges_file=merges_file,\n            unk_token=unk_token,\n            bos_token=bos_token,\n            eos_token=eos_token,\n            add_prefix_space=add_prefix_space,\n            trim_offsets=trim_offsets,\n            **kwargs,\n        )\n\n        self.backend_tokenizer._tokenizer.post_processor = RobertaProcessing(\n            sep=(sep_token, self.sep_token_id),\n            cls=(cls_token, self.cls_token_id),\n            add_prefix_space=add_prefix_space,\n            trim_offsets=trim_offsets,\n        )\n\n        self.backend_tokenizer.add_special_tokens([kwargs[\"mask_token\"]])\n\n    @PreTrainedTokenizer.mask_token.setter\n    def mask_token(self, value):\n        if not isinstance(value, AddedToken):\n            value = AddedToken(value, lstrip=True)\n\n        self._mask_token = str(value)\n        self._maybe_update_backend([value])\n\n    def build_inputs_with_special_tokens(self, token_ids_0, token_ids_1=None):\n        output = [self.bos_token_id] + token_ids_0 + [self.eos_token_id]\n        if token_ids_1 is None:\n            return output\n\n        return output + [self.eos_token_id] + token_ids_1 + [self.eos_token_id]\n\n    def create_token_type_ids_from_sequences(\n        self, token_ids_0: List[int], token_ids_1: Optional[List[int]] = None\n    ) -> List[int]:\n        \"\"\"\n        Creates a mask from the two sequences passed to be used in a sequence-pair classification task.\n        RoBERTa does not make use of token type ids, therefore a list of zeros is returned.\n\n        Args:\n            token_ids_0 (:obj:`List[int]`):\n                List of ids.\n            token_ids_1 (:obj:`List[int]`, `optional`, defaults to :obj:`None`):\n                Optional second list of IDs for sequence pairs.\n\n        Returns:\n            :obj:`List[int]`: List of zeros.\n\n        \"\"\"\n        sep = [self.sep_token_id]\n        cls = [self.cls_token_id]\n\n        if token_ids_1 is None:\n            return len(cls + token_ids_0 + sep) * [0]\n        return len(cls + token_ids_0 + sep + sep + token_ids_1 + sep) * [0]\n"
  },
  {
    "path": "code/bert-base-count3/pretrain/transformers1/tokenization_t5.py",
    "content": "# coding=utf-8\n# Copyright 2018 T5 Authors and HuggingFace Inc. team.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n\"\"\" Tokenization class for model T5.\"\"\"\n\n\nimport logging\nimport os\nimport re\nfrom shutil import copyfile\n\nfrom .tokenization_utils import PreTrainedTokenizer\n\n\nlogger = logging.getLogger(__name__)\n\nSPIECE_UNDERLINE = \"▁\"\n\n####################################################\n# Mapping from the keyword arguments names of Tokenizer `__init__`\n# to file names for serializing Tokenizer instances\n####################################################\nVOCAB_FILES_NAMES = {\"vocab_file\": \"spiece.model\"}\n\n####################################################\n# Mapping from the keyword arguments names of Tokenizer `__init__`\n# to pretrained vocabulary URL for all the model shortcut names.\n####################################################\nPRETRAINED_VOCAB_FILES_MAP = {\n    \"vocab_file\": {\n        \"t5-small\": \"https://s3.amazonaws.com/models.huggingface.co/bert/t5-spiece.model\",\n        \"t5-base\": \"https://s3.amazonaws.com/models.huggingface.co/bert/t5-spiece.model\",\n        \"t5-large\": \"https://s3.amazonaws.com/models.huggingface.co/bert/t5-spiece.model\",\n        \"t5-3b\": \"https://s3.amazonaws.com/models.huggingface.co/bert/t5-spiece.model\",\n        \"t5-11b\": \"https://s3.amazonaws.com/models.huggingface.co/bert/t5-spiece.model\",\n    }\n}\n\n####################################################\n# Mapping from model shortcut names to max length of inputs\n####################################################\nPRETRAINED_POSITIONAL_EMBEDDINGS_SIZES = {\n    \"t5-small\": 512,\n    \"t5-base\": 512,\n    \"t5-large\": 512,\n    \"t5-3b\": 512,\n    \"t5-11b\": 512,\n}\n\n\nclass T5Tokenizer(PreTrainedTokenizer):\n    \"\"\"\n        Constructs an XLNet tokenizer. Based on `SentencePiece <https://github.com/google/sentencepiece>`__ .\n\n        This tokenizer inherits from :class:`~transformers1.PreTrainedTokenizer` which contains most of the methods. Users\n        should refer to the superclass for more information regarding methods.\n\n        Args:\n            vocab_file (:obj:`string`):\n                `SentencePiece <https://github.com/google/sentencepiece>`__ file (generally has a `.spm` extension) that\n                contains the vocabulary necessary to instantiate a tokenizer.\n            eos_token (:obj:`string`, `optional`, defaults to \"</s>\"):\n                The end of sequence token.\n\n                .. note::\n\n                    When building a sequence using special tokens, this is not the token that is used for the end\n                    of sequence. The token used is the :obj:`sep_token`.\n            unk_token (:obj:`string`, `optional`, defaults to \"<unk>\"):\n                The unknown token. A token that is not in the vocabulary cannot be converted to an ID and is set to be this\n                token instead.\n            pad_token (:obj:`string`, `optional`, defaults to \"<pad>\"):\n                The token used for padding, for example when batching sequences of different lengths.\n            extra_ids (:obj:`List[str]`, `optional`, defaults to :obj:`100`):\n                Add a number of extra ids added to the end of the vocabulary for use as sentinels.\n                These tokens are accessible as \"<extra_id_{%d}>\" where \"{%d}\" is a number between 0 and extra_ids-1.\n                Extra tokens are indexed from the end of the vocabulary up to beginnning (\"<extra_id_0>\" is the last token in the vocabulary like in T5 preprocessing\n                see: https://github.com/google-research/text-to-text-transfer-transformer/blob/9fd7b14a769417be33bc6c850f9598764913c833/t5/data/preprocessors.py#L2117)\n            additional_special_tokens (:obj:`List[str]`, `optional`, defaults to :obj:`None`):\n                Additional special tokens used by the tokenizer.\n    \"\"\"\n\n    vocab_files_names = VOCAB_FILES_NAMES\n    pretrained_vocab_files_map = PRETRAINED_VOCAB_FILES_MAP\n    max_model_input_sizes = PRETRAINED_POSITIONAL_EMBEDDINGS_SIZES\n\n    def __init__(\n        self,\n        vocab_file,\n        eos_token=\"</s>\",\n        unk_token=\"<unk>\",\n        pad_token=\"<pad>\",\n        extra_ids=100,\n        additional_special_tokens=None,\n        **kwargs\n    ):\n        # Add extra_ids to the special token list\n        if extra_ids > 0:\n            if additional_special_tokens is None:\n                additional_special_tokens = []\n            additional_special_tokens.extend([\"<extra_id_{}>\".format(i) for i in range(extra_ids)])\n\n        super().__init__(\n            eos_token=eos_token,\n            unk_token=unk_token,\n            pad_token=pad_token,\n            additional_special_tokens=additional_special_tokens,\n            **kwargs,\n        )\n\n        try:\n            import sentencepiece as spm\n        except ImportError:\n            logger.warning(\n                \"You need to install SentencePiece to use T5Tokenizer:\"\n                \"https://github.com/google/sentencepiece\"\n                \"pip install sentencepiece\"\n            )\n            raise\n\n        self.vocab_file = vocab_file\n        self._extra_ids = extra_ids\n\n        self.sp_model = spm.SentencePieceProcessor()\n        self.sp_model.Load(vocab_file)\n\n    @property\n    def vocab_size(self):\n        return self.sp_model.get_piece_size() + self._extra_ids\n\n    def get_vocab(self):\n        vocab = {self.convert_ids_to_tokens(i): i for i in range(self.vocab_size)}\n        vocab.update(self.added_tokens_encoder)\n        return vocab\n\n    def __getstate__(self):\n        state = self.__dict__.copy()\n        state[\"sp_model\"] = None\n        return state\n\n    def __setstate__(self, d):\n        self.__dict__ = d\n        try:\n            import sentencepiece as spm\n        except ImportError:\n            logger.warning(\n                \"You need to install SentencePiece to use T5Tokenizer: https://github.com/google/sentencepiece\"\n                \"pip install sentencepiece\"\n            )\n            raise\n        self.sp_model = spm.SentencePieceProcessor()\n        self.sp_model.Load(self.vocab_file)\n\n    def _tokenize(self, text, sample=False):\n        \"\"\" Take as input a string and return a list of strings (tokens) for words/sub-words\n        \"\"\"\n        if not sample:\n            pieces = self.sp_model.EncodeAsPieces(text)\n        else:\n            pieces = self.sp_model.SampleEncodeAsPieces(text, 64, 0.1)\n        return pieces\n\n    def _convert_token_to_id(self, token):\n        \"\"\" Converts a token (str) in an id using the vocab. \"\"\"\n        if token.startswith(\"<extra_id_\"):\n            match = re.match(r\"<extra_id_(\\d+)>\", token)\n            num = int(match.group(1))\n            return self.vocab_size - num - 1\n        return self.sp_model.piece_to_id(token)\n\n    def _convert_id_to_token(self, index):\n        \"\"\"Converts an index (integer) in a token (str) using the vocab.\"\"\"\n        if index < self.sp_model.get_piece_size():\n            token = self.sp_model.IdToPiece(index)\n        else:\n            token = \"<extra_id_{}>\".format(self.vocab_size - 1 - index)\n        return token\n\n    def convert_tokens_to_string(self, tokens):\n        \"\"\" Converts a sequence of tokens (string) in a single string. \"\"\"\n        out_string = self.sp_model.decode_pieces(tokens)\n        return out_string\n\n    def save_vocabulary(self, save_directory):\n        \"\"\" Save the sentencepiece vocabulary (copy original file) and special tokens file\n            to a directory.\n        \"\"\"\n        if not os.path.isdir(save_directory):\n            logger.error(\"Vocabulary path ({}) should be a directory\".format(save_directory))\n            return\n        out_vocab_file = os.path.join(save_directory, VOCAB_FILES_NAMES[\"vocab_file\"])\n\n        if os.path.abspath(self.vocab_file) != os.path.abspath(out_vocab_file):\n            copyfile(self.vocab_file, out_vocab_file)\n\n        return (out_vocab_file,)\n"
  },
  {
    "path": "code/bert-base-count3/pretrain/transformers1/tokenization_transfo_xl.py",
    "content": "# coding=utf-8\n# Copyright 2018 Google AI, Google Brain and Carnegie Mellon University Authors and the HuggingFace Inc. team.\n# Copyright (c) 2018, NVIDIA CORPORATION.  All rights reserved.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n\"\"\" Tokenization classes for Transformer XL model.\n    Adapted from https://github.com/kimiyoung/transformer-xl.\n\"\"\"\n\n\nimport glob\nimport logging\nimport os\nimport pickle\nimport re\nfrom collections import Counter, OrderedDict\nfrom typing import Optional\n\nimport numpy as np\nfrom tokenizers import Tokenizer\nfrom tokenizers.implementations import BaseTokenizer\nfrom tokenizers.models import WordLevel\nfrom tokenizers.normalizers import Lowercase, Sequence, Strip, unicode_normalizer_from_str\nfrom tokenizers.pre_tokenizers import CharDelimiterSplit, WhitespaceSplit\nfrom tokenizers.processors import BertProcessing\n\nfrom .file_utils import cached_path, is_torch_available\nfrom .tokenization_utils import PreTrainedTokenizer, PreTrainedTokenizerFast\n\n\nif is_torch_available():\n    import torch\n\n\nlogger = logging.getLogger(__name__)\n\nVOCAB_FILES_NAMES = {\"pretrained_vocab_file\": \"vocab.bin\", \"vocab_file\": \"vocab.txt\"}\nVOCAB_FILES_NAMES_FAST = {\"pretrained_vocab_file\": \"vocab.json\", \"vocab_file\": \"vocab.json\"}\n\nPRETRAINED_VOCAB_FILES_MAP = {\n    \"pretrained_vocab_file\": {\n        \"transfo-xl-wt103\": \"https://s3.amazonaws.com/models.huggingface.co/bert/transfo-xl-wt103-vocab.bin\",\n    }\n}\n\nPRETRAINED_VOCAB_FILES_MAP_FAST = {\n    \"pretrained_vocab_file\": {\n        \"transfo-xl-wt103\": \"https://s3.amazonaws.com/models.huggingface.co/bert/transfo-xl-wt103-vocab.json\",\n    }\n}\n\nPRETRAINED_POSITIONAL_EMBEDDINGS_SIZES = {\n    \"transfo-xl-wt103\": None,\n}\n\nPRETRAINED_CORPUS_ARCHIVE_MAP = {\n    \"transfo-xl-wt103\": \"https://s3.amazonaws.com/models.huggingface.co/bert/transfo-xl-wt103-corpus.bin\",\n}\nCORPUS_NAME = \"corpus.bin\"\n\n\nclass TransfoXLTokenizer(PreTrainedTokenizer):\n    \"\"\"\n    Transformer-XL tokenizer adapted from Vocab class in https://github.com/kimiyoung/transformer-xl\n\n    This tokenizer inherits from :class:`~transformers1.PreTrainedTokenizer` which contains most of the methods. Users\n    should refer to the superclass for more information regarding methods.\n    \"\"\"\n\n    vocab_files_names = VOCAB_FILES_NAMES\n    pretrained_vocab_files_map = PRETRAINED_VOCAB_FILES_MAP\n    max_model_input_sizes = PRETRAINED_POSITIONAL_EMBEDDINGS_SIZES\n    model_input_names = []\n\n    def __init__(\n        self,\n        special=None,\n        min_freq=0,\n        max_size=None,\n        lower_case=False,\n        delimiter=None,\n        vocab_file=None,\n        pretrained_vocab_file=None,\n        never_split=None,\n        unk_token=\"<unk>\",\n        eos_token=\"<eos>\",\n        additional_special_tokens=[\"<formula>\"],\n        **kwargs\n    ):\n        super().__init__(\n            unk_token=unk_token, eos_token=eos_token, additional_special_tokens=additional_special_tokens, **kwargs\n        )\n\n        if never_split is None:\n            never_split = self.all_special_tokens\n        if special is None:\n            special = []\n        self.counter = Counter()\n        self.special = special\n        self.min_freq = min_freq\n        self.max_size = max_size\n        self.lower_case = lower_case\n        self.delimiter = delimiter\n        self.vocab_file = vocab_file\n        self.never_split = never_split\n        self.punctuation_symbols = '!\"#$%&()*+,-./\\:;<=>?@[\\\\]^_`{|}~'  # noqa: W605\n        self.punction_without_space_before_pattern = re.compile(r\"[^\\s][{}]\".format(self.punctuation_symbols))\n        self.punctuation_with_space_around_pattern = self._compile_space_around_punctuation_pattern()\n\n        try:\n            if pretrained_vocab_file is not None:\n                # Hack because, honestly this tokenizer was not made to be used\n                # in a library like ours, at all.\n                vocab_dict = torch.load(pretrained_vocab_file)\n                for key, value in vocab_dict.items():\n                    if key not in self.__dict__:\n                        self.__dict__[key] = value\n\n            if vocab_file is not None:\n                self.build_vocab()\n        except Exception:\n            raise ValueError(\n                \"Unable to parse file {}. Unknown format. \"\n                \"If you tried to load a model saved through TransfoXLTokenizerFast,\"\n                \"please note they are not compatible.\".format(pretrained_vocab_file)\n            )\n\n        if vocab_file is not None:\n            self.build_vocab()\n\n    def _compile_space_around_punctuation_pattern(self):\n        look_ahead_for_special_token = \"(?=[{}])\".format(self.punctuation_symbols)\n        look_ahead_to_match_all_except_space = \"(?=[^\\s])\"  # noqa: W605\n        return re.compile(r\"\" + look_ahead_for_special_token + look_ahead_to_match_all_except_space)\n\n    def count_file(self, path, verbose=False, add_eos=False):\n        if verbose:\n            logger.info(\"counting file {} ...\".format(path))\n        assert os.path.exists(path)\n\n        sents = []\n        with open(path, \"r\", encoding=\"utf-8\") as f:\n            for idx, line in enumerate(f):\n                if verbose and idx > 0 and idx % 500000 == 0:\n                    logger.info(\"    line {}\".format(idx))\n                symbols = self.tokenize(line, add_eos=add_eos)\n                self.counter.update(symbols)\n                sents.append(symbols)\n\n        return sents\n\n    def count_sents(self, sents, verbose=False):\n        \"\"\"\n            sents : a list of sentences, each a list of tokenized symbols\n        \"\"\"\n        if verbose:\n            logger.info(\"counting {} sents ...\".format(len(sents)))\n        for idx, symbols in enumerate(sents):\n            if verbose and idx > 0 and idx % 500000 == 0:\n                logger.info(\"    line {}\".format(idx))\n            self.counter.update(symbols)\n\n    def _build_from_file(self, vocab_file):\n        self.idx2sym = []\n        self.sym2idx = OrderedDict()\n\n        with open(vocab_file, \"r\", encoding=\"utf-8\") as f:\n            for line in f:\n                symb = line.strip().split()[0]\n                self.add_symbol(symb)\n        if \"<UNK>\" in self.sym2idx:\n            self.unk_idx = self.sym2idx[\"<UNK>\"]\n        elif \"<unk>\" in self.sym2idx:\n            self.unk_idx = self.sym2idx[\"<unk>\"]\n        else:\n            raise ValueError(\"No <unkown> token in vocabulary\")\n\n    def save_vocabulary(self, vocab_path):\n        \"\"\"\n        Save the vocabulary and special tokens file to a directory.\n\n        Args:\n            vocab_path (:obj:`str`):\n                The directory in which to save the vocabulary.\n\n        Returns:\n            :obj:`Tuple(str)`: Paths to the files saved.\n        \"\"\"\n\n        logger.warning(\n            \"Please note you will not be able to load the save vocabulary in\"\n            \" Rust-based TransfoXLTokenizerFast as they don't share the same structure.\"\n        )\n\n        if os.path.isdir(vocab_path):\n            vocab_file = os.path.join(vocab_path, VOCAB_FILES_NAMES[\"pretrained_vocab_file\"])\n        else:\n            vocab_file = vocab_path\n        torch.save(self.__dict__, vocab_file)\n        return (vocab_file,)\n\n    def build_vocab(self):\n        if self.vocab_file:\n            logger.info(\"building vocab from {}\".format(self.vocab_file))\n            self._build_from_file(self.vocab_file)\n            logger.info(\"final vocab size {}\".format(len(self)))\n        else:\n            logger.info(\"building vocab with min_freq={}, max_size={}\".format(self.min_freq, self.max_size))\n            self.idx2sym = []\n            self.sym2idx = OrderedDict()\n\n            for sym in self.special:\n                self.add_special(sym)\n\n            for sym, cnt in self.counter.most_common(self.max_size):\n                if cnt < self.min_freq:\n                    break\n                self.add_symbol(sym)\n\n            logger.info(\"final vocab size {} from {} unique tokens\".format(len(self), len(self.counter)))\n\n    def encode_file(self, path, ordered=False, verbose=False, add_eos=True, add_double_eos=False):\n        if verbose:\n            logger.info(\"encoding file {} ...\".format(path))\n        assert os.path.exists(path)\n        encoded = []\n        with open(path, \"r\", encoding=\"utf-8\") as f:\n            for idx, line in enumerate(f):\n                if verbose and idx > 0 and idx % 500000 == 0:\n                    logger.info(\"    line {}\".format(idx))\n                symbols = self.tokenize(line, add_eos=add_eos, add_double_eos=add_double_eos)\n                encoded.append(self.convert_to_tensor(symbols))\n\n        if ordered:\n            encoded = torch.cat(encoded)\n\n        return encoded\n\n    def encode_sents(self, sents, ordered=False, verbose=False):\n        if verbose:\n            logger.info(\"encoding {} sents ...\".format(len(sents)))\n        encoded = []\n        for idx, symbols in enumerate(sents):\n            if verbose and idx > 0 and idx % 500000 == 0:\n                logger.info(\"    line {}\".format(idx))\n            encoded.append(self.convert_to_tensor(symbols))\n\n        if ordered:\n            encoded = torch.cat(encoded)\n\n        return encoded\n\n    def add_special(self, sym):\n        if sym not in self.sym2idx:\n            self.idx2sym.append(sym)\n            self.sym2idx[sym] = len(self.idx2sym) - 1\n            setattr(self, \"{}_idx\".format(sym.strip(\"<>\")), self.sym2idx[sym])\n\n    def add_symbol(self, sym):\n        if sym not in self.sym2idx:\n            self.idx2sym.append(sym)\n            self.sym2idx[sym] = len(self.idx2sym) - 1\n\n    def _convert_id_to_token(self, idx):\n        \"\"\"Converts an id in a token (BPE) using the vocab.\"\"\"\n        assert 0 <= idx < len(self), \"Index {} out of vocabulary range\".format(idx)\n        return self.idx2sym[idx]\n\n    def _convert_token_to_id(self, sym):\n        \"\"\" Converts a token (str) in an id using the vocab. \"\"\"\n        if sym in self.sym2idx:\n            return self.sym2idx[sym]\n        else:\n            # logger.info('encounter unk {}'.format(sym))\n            # assert '<eos>' not in sym\n            if hasattr(self, \"unk_idx\"):\n                return self.sym2idx.get(sym, self.unk_idx)\n            # Backward compatibility with pre-trained models\n            elif \"<unk>\" in self.sym2idx:\n                return self.sym2idx[\"<unk>\"]\n            elif \"<UNK>\" in self.sym2idx:\n                return self.sym2idx[\"<UNK>\"]\n            else:\n                raise ValueError(\"Token not in vocabulary and no <unk> token in vocabulary for replacement\")\n\n    def convert_tokens_to_string(self, tokens):\n        \"\"\" Converts a sequence of tokens (string) in a single string. \"\"\"\n        out_string = \" \".join(tokens).strip()\n        return out_string\n\n    def convert_to_tensor(self, symbols):\n        return torch.LongTensor(self.convert_tokens_to_ids(symbols))\n\n    @property\n    def vocab_size(self):\n        return len(self.idx2sym)\n\n    def get_vocab(self):\n        return dict(self.sym2idx, **self.added_tokens_encoder)\n\n    def _tokenize(self, line, add_eos=False, add_double_eos=False):\n        line = line.strip()\n        # convert to lower case\n        if self.lower_case:\n            line = line.lower()\n\n        # empty delimiter '' will evaluate False\n        if self.delimiter == \"\":\n            symbols = line\n        else:\n            symbols = line.split(self.delimiter)\n\n        if add_double_eos:  # lm1b\n            return [\"<S>\"] + symbols + [\"<S>\"]\n        elif add_eos:\n            return symbols + [\"<eos>\"]\n        else:\n            return symbols\n\n    def prepare_for_tokenization(self, text, **kwargs):\n        # add spaces before punctuation symbols as should be done in transfo-xl\n\n        if \"add_space_before_punct_symbol\" in kwargs and kwargs[\"add_space_before_punct_symbol\"]:\n            text = self.punctuation_with_space_around_pattern.sub(r\" \", text)\n        elif self.punction_without_space_before_pattern.search(text):\n            # searches until the first occurence of a punctuation symbol without surrounding spaces\n            logger.warning(\n                \"You might want to consider setting `add_space_before_punct_symbol=True` as an argument to the `tokenizer.encode()` to avoid tokenizing words with punctuation symbols to the `<unk>` token\"\n            )\n\n        return text\n\n\nclass _TransfoXLDelimiterLookupTokenizer(BaseTokenizer):\n    def __init__(\n        self,\n        vocab_file,\n        delimiter,\n        lowercase,\n        unk_token,\n        eos_token,\n        add_eos=False,\n        add_double_eos=False,\n        normalization: Optional[str] = None,\n    ):\n\n        try:\n            tokenizer = WordLevel(vocab_file, unk_token=unk_token)\n            tokenizer = Tokenizer(tokenizer)\n        except Exception:\n            raise ValueError(\n                \"Unable to parse file {}. Unknown format. \"\n                \"If you tried to load a model saved through TransfoXLTokenizer,\"\n                \"please note they are not compatible.\".format(vocab_file)\n            )\n\n        # Create the correct normalization path\n        normalizer = []\n\n        # Include unicode normalization\n        if normalization:\n            normalizer += [unicode_normalizer_from_str(normalization)]\n\n        # Include case normalization\n        if lowercase:\n            normalizer += [Lowercase()]\n\n        # Strip normalizer at the end\n        normalizer += [Strip(left=True, right=True)]\n\n        if len(normalizer) > 0:\n            tokenizer.normalizer = Sequence(normalizer) if len(normalizer) > 1 else normalizer[0]\n\n        # Setup the splitter\n        tokenizer.pre_tokenizer = CharDelimiterSplit(delimiter) if delimiter else WhitespaceSplit()\n\n        if add_double_eos:\n            tokenizer.post_processor = BertProcessing(\n                (eos_token, tokenizer.token_to_id(eos_token)), (eos_token, tokenizer.token_to_id(eos_token))\n            )\n\n        parameters = {\n            \"model\": \"TransfoXLModel\",\n            \"add_eos\": add_eos,\n            \"add_double_eos\": add_double_eos,\n            \"unk_token\": unk_token,\n            \"eos_token\": eos_token,\n            \"delimiter\": delimiter,\n            \"lowercase\": lowercase,\n        }\n\n        super().__init__(tokenizer, parameters)\n\n\nclass TransfoXLTokenizerFast(PreTrainedTokenizerFast):\n    \"\"\"\n    Construct a \"Fast\" Transformer-XL tokenizer (backed by HuggingFace's `tokenizers` library).\n\n    The Transformer-XL tokenizer is a word-level tokenizer (no sub-word tokenization).\n\n    Adapted from Vocab class in https://github.com/kimiyoung/transformer-xl\n\n    This tokenizer inherits from :class:`~transformers1.PreTrainedTokenizerFast` which contains most of the methods. Users\n    should refer to the superclass for more information regarding methods.\n    \"\"\"\n\n    vocab_files_names = VOCAB_FILES_NAMES_FAST\n    pretrained_vocab_files_map = PRETRAINED_VOCAB_FILES_MAP_FAST\n    max_model_input_sizes = PRETRAINED_POSITIONAL_EMBEDDINGS_SIZES\n    model_input_names = []\n\n    def __init__(\n        self,\n        special=None,\n        min_freq=0,\n        max_size=None,\n        lower_case=False,\n        delimiter=None,\n        vocab_file=None,\n        pretrained_vocab_file=None,\n        never_split=None,\n        unk_token=\"<unk>\",\n        eos_token=\"<eos>\",\n        additional_special_tokens=[\"<formula>\"],\n        add_eos=False,\n        add_double_eos=False,\n        normalization=None,\n        **kwargs\n    ):\n\n        super().__init__(\n            _TransfoXLDelimiterLookupTokenizer(\n                vocab_file=vocab_file or pretrained_vocab_file,\n                delimiter=delimiter,\n                lowercase=lower_case,\n                unk_token=unk_token,\n                eos_token=eos_token,\n                add_eos=add_eos,\n                add_double_eos=add_double_eos,\n                normalization=normalization,\n            ),\n            unk_token=unk_token,\n            eos_token=eos_token,\n            additional_special_tokens=additional_special_tokens,\n            **kwargs,\n        )\n\n    def save_pretrained(self, save_directory):\n        logger.warning(\n            \"Please note you will not be able to load the vocabulary in\"\n            \" Python-based TransfoXLTokenizer as they don't share the same structure.\"\n        )\n\n        return super().save_pretrained(save_directory)\n\n\nclass LMOrderedIterator(object):\n    def __init__(self, data, bsz, bptt, device=\"cpu\", ext_len=None):\n        \"\"\"\n            data -- LongTensor -- the LongTensor is strictly ordered\n        \"\"\"\n        self.bsz = bsz\n        self.bptt = bptt\n        self.ext_len = ext_len if ext_len is not None else 0\n\n        self.device = device\n\n        # Work out how cleanly we can divide the dataset into bsz parts.\n        self.n_step = data.size(0) // bsz\n\n        # Trim off any extra elements that wouldn't cleanly fit (remainders).\n        data = data.narrow(0, 0, self.n_step * bsz)\n\n        # Evenly divide the data across the bsz batches.\n        self.data = data.view(bsz, -1).t().contiguous().to(device)\n\n        # Number of mini-batches\n        self.n_batch = (self.n_step + self.bptt - 1) // self.bptt\n\n    def get_batch(self, i, bptt=None):\n        if bptt is None:\n            bptt = self.bptt\n        seq_len = min(bptt, self.data.size(0) - 1 - i)\n\n        end_idx = i + seq_len\n        beg_idx = max(0, i - self.ext_len)\n\n        data = self.data[beg_idx:end_idx]\n        target = self.data[i + 1 : i + 1 + seq_len]\n\n        data_out = data.transpose(0, 1).contiguous().to(self.device)\n        target_out = target.transpose(0, 1).contiguous().to(self.device)\n\n        return data_out, target_out, seq_len\n\n    def get_fixlen_iter(self, start=0):\n        for i in range(start, self.data.size(0) - 1, self.bptt):\n            yield self.get_batch(i)\n\n    def get_varlen_iter(self, start=0, std=5, min_len=5, max_deviation=3):\n        max_len = self.bptt + max_deviation * std\n        i = start\n        while True:\n            bptt = self.bptt if np.random.random() < 0.95 else self.bptt / 2.0\n            bptt = min(max_len, max(min_len, int(np.random.normal(bptt, std))))\n            data, target, seq_len = self.get_batch(i, bptt)\n            i += seq_len\n            yield data, target, seq_len\n            if i >= self.data.size(0) - 2:\n                break\n\n    def __iter__(self):\n        return self.get_fixlen_iter()\n\n\nclass LMShuffledIterator(object):\n    def __init__(self, data, bsz, bptt, device=\"cpu\", ext_len=None, shuffle=False):\n        \"\"\"\n            data -- list[LongTensor] -- there is no order among the LongTensors\n        \"\"\"\n        self.data = data\n\n        self.bsz = bsz\n        self.bptt = bptt\n        self.ext_len = ext_len if ext_len is not None else 0\n\n        self.device = device\n        self.shuffle = shuffle\n\n    def get_sent_stream(self):\n        # index iterator\n        epoch_indices = np.random.permutation(len(self.data)) if self.shuffle else np.array(range(len(self.data)))\n\n        # sentence iterator\n        for idx in epoch_indices:\n            yield self.data[idx]\n\n    def stream_iterator(self, sent_stream):\n        # streams for each data in the batch\n        streams = [None] * self.bsz\n\n        data = torch.LongTensor(self.bptt, self.bsz)\n        target = torch.LongTensor(self.bptt, self.bsz)\n\n        n_retain = 0\n\n        while True:\n            # data   : [n_retain+bptt x bsz]\n            # target : [bptt x bsz]\n            data[n_retain:].fill_(-1)\n            target.fill_(-1)\n\n            valid_batch = True\n\n            for i in range(self.bsz):\n                n_filled = 0\n                try:\n                    while n_filled < self.bptt:\n                        if streams[i] is None or len(streams[i]) <= 1:\n                            streams[i] = next(sent_stream)\n                        # number of new tokens to fill in\n                        n_new = min(len(streams[i]) - 1, self.bptt - n_filled)\n                        # first n_retain tokens are retained from last batch\n                        data[n_retain + n_filled : n_retain + n_filled + n_new, i] = streams[i][:n_new]\n                        target[n_filled : n_filled + n_new, i] = streams[i][1 : n_new + 1]\n                        streams[i] = streams[i][n_new:]\n                        n_filled += n_new\n                except StopIteration:\n                    valid_batch = False\n                    break\n\n            if not valid_batch:\n                return\n\n            data_out = data.transpose(0, 1).contiguous().to(self.device)\n            target_out = target.transpose(0, 1).contiguous().to(self.device)\n\n            yield data_out, target_out, self.bptt\n\n            n_retain = min(data.size(0), self.ext_len)\n            if n_retain > 0:\n                data[:n_retain] = data[-n_retain:]\n            data.resize_(n_retain + self.bptt, data.size(1))\n\n    def __iter__(self):\n        # sent_stream is an iterator\n        sent_stream = self.get_sent_stream()\n\n        for batch in self.stream_iterator(sent_stream):\n            yield batch\n\n\nclass LMMultiFileIterator(LMShuffledIterator):\n    def __init__(self, paths, vocab, bsz, bptt, device=\"cpu\", ext_len=None, shuffle=False):\n\n        self.paths = paths\n        self.vocab = vocab\n\n        self.bsz = bsz\n        self.bptt = bptt\n        self.ext_len = ext_len if ext_len is not None else 0\n\n        self.device = device\n        self.shuffle = shuffle\n\n    def get_sent_stream(self, path):\n        sents = self.vocab.encode_file(path, add_double_eos=True)\n        if self.shuffle:\n            np.random.shuffle(sents)\n        sent_stream = iter(sents)\n\n        return sent_stream\n\n    def __iter__(self):\n        if self.shuffle:\n            np.random.shuffle(self.paths)\n\n        for path in self.paths:\n            # sent_stream is an iterator\n            sent_stream = self.get_sent_stream(path)\n            for batch in self.stream_iterator(sent_stream):\n                yield batch\n\n\nclass TransfoXLCorpus(object):\n    @classmethod\n    def from_pretrained(cls, pretrained_model_name_or_path, cache_dir=None, *inputs, **kwargs):\n        \"\"\"\n        Instantiate a pre-processed corpus.\n        \"\"\"\n        vocab = TransfoXLTokenizer.from_pretrained(pretrained_model_name_or_path, *inputs, **kwargs)\n        if pretrained_model_name_or_path in PRETRAINED_CORPUS_ARCHIVE_MAP:\n            corpus_file = PRETRAINED_CORPUS_ARCHIVE_MAP[pretrained_model_name_or_path]\n        else:\n            corpus_file = os.path.join(pretrained_model_name_or_path, CORPUS_NAME)\n        # redirect to the cache, if necessary\n        try:\n            resolved_corpus_file = cached_path(corpus_file, cache_dir=cache_dir)\n        except EnvironmentError:\n            logger.error(\n                \"Corpus '{}' was not found in corpus list ({}). \"\n                \"We assumed '{}' was a path or url but couldn't find files {} \"\n                \"at this path or url.\".format(\n                    pretrained_model_name_or_path,\n                    \", \".join(PRETRAINED_CORPUS_ARCHIVE_MAP.keys()),\n                    pretrained_model_name_or_path,\n                    corpus_file,\n                )\n            )\n            return None\n        if resolved_corpus_file == corpus_file:\n            logger.info(\"loading corpus file {}\".format(corpus_file))\n        else:\n            logger.info(\"loading corpus file {} from cache at {}\".format(corpus_file, resolved_corpus_file))\n\n        # Instantiate tokenizer.\n        corpus = cls(*inputs, **kwargs)\n        corpus_dict = torch.load(resolved_corpus_file)\n        for key, value in corpus_dict.items():\n            corpus.__dict__[key] = value\n        corpus.vocab = vocab\n        if corpus.train is not None:\n            corpus.train = torch.tensor(corpus.train, dtype=torch.long)\n        if corpus.valid is not None:\n            corpus.valid = torch.tensor(corpus.valid, dtype=torch.long)\n        if corpus.test is not None:\n            corpus.test = torch.tensor(corpus.test, dtype=torch.long)\n        return corpus\n\n    def __init__(self, *args, **kwargs):\n        self.vocab = TransfoXLTokenizer(*args, **kwargs)\n        self.dataset = None\n        self.train = None\n        self.valid = None\n        self.test = None\n\n    def build_corpus(self, path, dataset):\n        self.dataset = dataset\n\n        if self.dataset in [\"ptb\", \"wt2\", \"enwik8\", \"text8\"]:\n            self.vocab.count_file(os.path.join(path, \"train.txt\"))\n            self.vocab.count_file(os.path.join(path, \"valid.txt\"))\n            self.vocab.count_file(os.path.join(path, \"test.txt\"))\n        elif self.dataset == \"wt103\":\n            self.vocab.count_file(os.path.join(path, \"train.txt\"))\n        elif self.dataset == \"lm1b\":\n            train_path_pattern = os.path.join(\n                path,\n                \"1-billion-word-language-modeling-benchmark-r13output\",\n                \"training-monolingual.tokenized.shuffled\",\n                \"news.en-*\",\n            )\n            train_paths = glob.glob(train_path_pattern)\n            # the vocab will load from file when build_vocab() is called\n\n        self.vocab.build_vocab()\n\n        if self.dataset in [\"ptb\", \"wt2\", \"wt103\"]:\n            self.train = self.vocab.encode_file(os.path.join(path, \"train.txt\"), ordered=True)\n            self.valid = self.vocab.encode_file(os.path.join(path, \"valid.txt\"), ordered=True)\n            self.test = self.vocab.encode_file(os.path.join(path, \"test.txt\"), ordered=True)\n        elif self.dataset in [\"enwik8\", \"text8\"]:\n            self.train = self.vocab.encode_file(os.path.join(path, \"train.txt\"), ordered=True, add_eos=False)\n            self.valid = self.vocab.encode_file(os.path.join(path, \"valid.txt\"), ordered=True, add_eos=False)\n            self.test = self.vocab.encode_file(os.path.join(path, \"test.txt\"), ordered=True, add_eos=False)\n        elif self.dataset == \"lm1b\":\n            self.train = train_paths\n            self.valid = self.vocab.encode_file(os.path.join(path, \"valid.txt\"), ordered=False, add_double_eos=True)\n            self.test = self.vocab.encode_file(os.path.join(path, \"test.txt\"), ordered=False, add_double_eos=True)\n\n    def get_iterator(self, split, *args, **kwargs):\n        if split == \"train\":\n            if self.dataset in [\"ptb\", \"wt2\", \"wt103\", \"enwik8\", \"text8\"]:\n                data_iter = LMOrderedIterator(self.train, *args, **kwargs)\n            elif self.dataset == \"lm1b\":\n                kwargs[\"shuffle\"] = True\n                data_iter = LMMultiFileIterator(self.train, self.vocab, *args, **kwargs)\n        elif split in [\"valid\", \"test\"]:\n            data = self.valid if split == \"valid\" else self.test\n            if self.dataset in [\"ptb\", \"wt2\", \"wt103\", \"enwik8\", \"text8\"]:\n                data_iter = LMOrderedIterator(data, *args, **kwargs)\n            elif self.dataset == \"lm1b\":\n                data_iter = LMShuffledIterator(data, *args, **kwargs)\n\n        return data_iter\n\n\ndef get_lm_corpus(datadir, dataset):\n    fn = os.path.join(datadir, \"cache.pt\")\n    fn_pickle = os.path.join(datadir, \"cache.pkl\")\n    if os.path.exists(fn):\n        logger.info(\"Loading cached dataset...\")\n        corpus = torch.load(fn_pickle)\n    elif os.path.exists(fn):\n        logger.info(\"Loading cached dataset from pickle...\")\n        with open(fn, \"rb\") as fp:\n            corpus = pickle.load(fp)\n    else:\n        logger.info(\"Producing dataset {}...\".format(dataset))\n        kwargs = {}\n        if dataset in [\"wt103\", \"wt2\"]:\n            kwargs[\"special\"] = [\"<eos>\"]\n            kwargs[\"lower_case\"] = False\n        elif dataset == \"ptb\":\n            kwargs[\"special\"] = [\"<eos>\"]\n            kwargs[\"lower_case\"] = True\n        elif dataset == \"lm1b\":\n            kwargs[\"special\"] = []\n            kwargs[\"lower_case\"] = False\n            kwargs[\"vocab_file\"] = os.path.join(datadir, \"1b_word_vocab.txt\")\n        elif dataset in [\"enwik8\", \"text8\"]:\n            pass\n\n        corpus = TransfoXLCorpus(datadir, dataset, **kwargs)\n        torch.save(corpus, fn)\n\n    return corpus\n"
  },
  {
    "path": "code/bert-base-count3/pretrain/transformers1/tokenization_utils.py",
    "content": "# coding=utf-8\n# Copyright 2020 The HuggingFace Inc. team.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n\"\"\"Tokenization classes for python and fast tokenizers. Fast tokenizers are provided by HuggingFace's tokenizers library.\"\"\"\n\nimport copy\nimport functools\nimport itertools\nimport json\nimport logging\nimport operator\nimport os\nimport re\nimport warnings\nfrom collections import UserDict, defaultdict\nfrom contextlib import contextmanager\nfrom typing import Any, Dict, List, NamedTuple, Optional, Sequence, Tuple, Union\n\nfrom tokenizers import AddedToken as AddedTokenFast\nfrom tokenizers import Encoding as EncodingFast\nfrom tokenizers.decoders import Decoder as DecoderFast\nfrom tokenizers.implementations import BaseTokenizer as BaseTokenizerFast\n\nfrom .file_utils import cached_path, hf_bucket_url, is_remote_url, is_tf_available, is_torch_available, torch_required\n\n\nif is_tf_available():\n    import tensorflow as tf\nif is_torch_available():\n    import torch\n\nlogger = logging.getLogger(__name__)\n\nSPECIAL_TOKENS_MAP_FILE = \"special_tokens_map.json\"\nADDED_TOKENS_FILE = \"added_tokens.json\"\nTOKENIZER_CONFIG_FILE = \"tokenizer_config.json\"\n\nVERY_LARGE_INTEGER = int(1e30)  # This is used to set the max input length for a model with infinite size input\nLARGE_INTEGER = int(1e20)  # This is used when we need something big but slightly smaller than VERY_LARGE_INTEGER\n\n# Define type aliases and NamedTuples\nTextInput = str\nPreTokenizedInput = List[str]\nEncodedInput = List[int]\nTextInputPair = Tuple[str, str]\nPreTokenizedInputPair = Tuple[List[str], List[str]]\nEncodedInputPair = Tuple[List[int], List[int]]\n\n\nclass CharSpan(NamedTuple):\n    \"\"\" Character span in the original string\n\n        Args:\n            start: index of the first character in the original string\n            end: index of the character following the last character in the original string\n    \"\"\"\n\n    start: int\n    end: int\n\n\nclass TokenSpan(NamedTuple):\n    \"\"\" Token span in an encoded string (list of tokens)\n\n        Args:\n            start: index of the first token in the span\n            end: index of the token following the last token in the span\n    \"\"\"\n\n    start: int\n    end: int\n\n\ndef flatten(x: Sequence):\n    \"\"\"\n    Flatten the provided (potentially nested) sequence\n\n    Args:\n        x (Sequence): Potentially nested sequence to flatten\n\n    Returns:\n        list: Flattened sequence\n    \"\"\"\n\n    return functools.reduce(operator.iconcat, x, [])\n\n\n@contextmanager\ndef truncate_and_pad(\n    tokenizer: BaseTokenizerFast,\n    max_length: int,\n    stride: int,\n    strategy: str,\n    pad_to_max_length: bool,\n    padding_side: str,\n    pad_token_id: int,\n    pad_token_type_id: int,\n    pad_token: str,\n):\n    \"\"\" This contextmanager is in charge of defining the truncation and the padding strategies for fast tokenizers\n        (provided by HuggingFace tokenizers library) and restore the tokenizer settings afterwards.\n\n        This contextmanager assumes the provider tokenizer has no padding / truncation strategy\n        before the managed section. If your tokenizer set a padding / truncation strategy before,\n        then it will be reset to no padding/truncation when exiting the managed section.\n\n        Args:\n            tokenizer (BaseTokenizerFast): The tokenizer which will be used\n            max_length (int): The maximum size of the sequence\n            stride (int): The stride to use when handling overflow\n            strategy (str): Overflowing logic to use\n            pad_to_max_length (bool): Boolean indicating if the output needs to be padded up to max_length\n            padding_side (str): \"left\" or \"right\" indicating the direction the output sequence will be padded\n            pad_token_id (int): The integer representation of the padding token to use\n            pad_token_type_id (int): The integer representation of the padding token type to use\n            pad_token (str): The string representation of the padding token to use\n\n    \"\"\"\n\n    # Handle all the truncation and padding stuff\n    if max_length is not None:\n        tokenizer.enable_truncation(max_length, stride=stride, strategy=strategy)\n\n    if pad_to_max_length and (pad_token and pad_token_id >= 0):\n        tokenizer.enable_padding(\n            max_length=max_length,\n            direction=padding_side,\n            pad_id=pad_token_id,\n            pad_type_id=pad_token_type_id,\n            pad_token=pad_token,\n        )\n    elif pad_to_max_length:\n        logger.warning(\n            \"Disabled padding because no padding token set (pad_token: {}, pad_token_id: {}).\\n\"\n            \"To remove this error, you can add a new pad token and then resize model embedding:\\n\"\n            \"\\ttokenizer.pad_token = '<PAD>'\\n\\tmodel.resize_token_embeddings(len(tokenizer))\".format(\n                pad_token, pad_token_id\n            )\n        )\n\n    yield\n\n    # TODO(morgan, anthony): once we have a simple way to serialize tokenizers maybe store and restore the state afterward\n    # to avoid destructing the padding / truncation strategy as we do now.\n\n    if max_length is not None:\n        tokenizer.no_truncation()\n\n    if pad_to_max_length and (pad_token and pad_token_id >= 0):\n        tokenizer.no_padding()\n\n\nclass BatchEncoding(UserDict):\n    \"\"\" BatchEncoding hold the output of the encode and batch_encode methods (tokens, attention_masks, etc).\n        This class is derived from a python Dictionary and can be used as a dictionnary.\n        In addition, this class expose utility methods to map from word/char space to token space.\n\n        Args:\n            data (:obj:`dict`): Dictionary of lists/arrays returned by the encode/batch_encode methods ('input_ids', 'attention_mask'...)\n            encoding (:obj:`EncodingFast`, :obj:`list(EncodingFast)`, `optional`, defaults to :obj:`None`):\n                If the tokenizer is a fast tokenizer which outputs additional informations like mapping from word/char space to token space\n                the `EncodingFast` instance or list of instance (for batches) hold these informations.\n\n    \"\"\"\n\n    def __init__(\n        self,\n        data: Optional[Dict[str, Any]] = None,\n        encoding: Optional[Union[EncodingFast, Sequence[EncodingFast]]] = None,\n    ):\n        super().__init__(data)\n\n        if isinstance(encoding, EncodingFast):\n            encoding = [encoding]\n\n        self._encodings = encoding\n\n    def __getitem__(self, item: Union[int, str]) -> EncodingFast:\n        \"\"\" If the key is a string, get the value of the dict associated to `key` ('input_ids', 'attention_mask'...)\n            If the key is an integer, get the EncodingFast for batch item with index `key`\n        \"\"\"\n        if isinstance(item, str):\n            return self.data[item]\n        elif self._encodings is not None:\n            return self._encodings[item]\n        else:\n            raise KeyError(\n                \"Indexing with integers (to access backend Encoding for a given batch index) \"\n                \"is not available when using Python based tokenizers\"\n            )\n\n    def __getattr__(self, item: str):\n        return self.data[item]\n\n    def keys(self):\n        return self.data.keys()\n\n    def values(self):\n        return self.data.values()\n\n    def items(self):\n        return self.data.items()\n\n    # After this point:\n    # Extended properties and methods only available for fast (Rust-based) tokenizers\n    # provided by HuggingFace tokenizers library.\n\n    @property\n    def encodings(self) -> Optional[List[EncodingFast]]:\n        \"\"\"\n        Return the list all encoding from the tokenization process\n\n        Returns: List[EncodingFast] or None if input was tokenized through Python (i.e. not fast) tokenizer\n        \"\"\"\n        return self._encodings\n\n    def tokens(self, batch_index: int = 0) -> List[int]:\n        if not self._encodings:\n            raise ValueError(\"tokens() is not available when using Python based tokenizers\")\n        return self._encodings[batch_index].tokens\n\n    def words(self, batch_index: int = 0) -> List[Optional[int]]:\n        if not self._encodings:\n            raise ValueError(\"words() is not available when using Python based tokenizers\")\n        return self._encodings[batch_index].words\n\n    def token_to_word(self, batch_or_token_index: int, token_index: Optional[int] = None) -> int:\n        \"\"\" Get the index of the word corresponding (i.e. comprising) to an encoded token\n            in a sequence of the batch.\n\n            Can be called as:\n                - self.token_to_word(token_index) if batch size is 1\n                - self.token_to_word(batch_index, token_index) if batch size is greater than 1\n\n            This method is particularly suited when the input sequences are provided as\n            pre-tokenized sequences (i.e. words are defined by the user). In this case it allows\n            to easily associate encoded tokens with provided tokenized words.\n\n        Args:\n            batch_or_token_index (:obj:`int`):\n                Index of the sequence in the batch. If the batch only comprise one sequence,\n                this can be the index of the token in the sequence\n            token_index (:obj:`int`, `optional`):\n                If a batch index is provided in `batch_or_token_index`, this can be the index\n                of the token in the sequence.\n\n        Returns:\n            word_index (:obj:`int`):\n                index of the word in the input sequence.\n\n        \"\"\"\n\n        if not self._encodings:\n            raise ValueError(\"token_to_word() is not available when using Python based tokenizers\")\n        if token_index is not None:\n            batch_index = batch_or_token_index\n        else:\n            batch_index = 0\n            token_index = batch_or_token_index\n        if batch_index < 0:\n            batch_index = self._batch_size + batch_index\n        if token_index < 0:\n            token_index = self._seq_len + token_index\n        return self._encodings[batch_index].token_to_word(token_index)\n\n    def word_to_tokens(self, batch_or_word_index: int, word_index: Optional[int] = None) -> TokenSpan:\n        \"\"\" Get the encoded token span corresponding to a word in the sequence of the batch.\n\n            Token spans are returned as a TokenSpan NamedTuple with:\n                start: index of the first token\n                end: index of the token following the last token\n\n            Can be called as:\n                - self.word_to_tokens(word_index) if batch size is 1\n                - self.word_to_tokens(batch_index, word_index) if batch size is greater or equal to 1\n\n            This method is particularly suited when the input sequences are provided as\n            pre-tokenized sequences (i.e. words are defined by the user). In this case it allows\n            to easily associate encoded tokens with provided tokenized words.\n\n        Args:\n            batch_or_word_index (:obj:`int`):\n                Index of the sequence in the batch. If the batch only comprises one sequence,\n                this can be the index of the word in the sequence\n            word_index (:obj:`int`, `optional`):\n                If a batch index is provided in `batch_or_token_index`, this can be the index\n                of the word in the sequence.\n\n        Returns:\n            token_span (:obj:`TokenSpan`):\n                Span of tokens in the encoded sequence.\n\n                TokenSpan are NamedTuple with:\n                    start: index of the first token\n                    end: index of the token following the last token\n        \"\"\"\n\n        if not self._encodings:\n            raise ValueError(\"word_to_tokens() is not available when using Python based tokenizers\")\n        if word_index is not None:\n            batch_index = batch_or_word_index\n        else:\n            batch_index = 0\n            word_index = batch_or_word_index\n        if batch_index < 0:\n            batch_index = self._batch_size + batch_index\n        if word_index < 0:\n            word_index = self._seq_len + word_index\n        return TokenSpan(*(self._encodings[batch_index].word_to_tokens(word_index)))\n\n    def token_to_chars(self, batch_or_token_index: int, token_index: Optional[int] = None) -> CharSpan:\n        \"\"\" Get the character span corresponding to an encoded token in a sequence of the batch.\n\n            Character spans are returned as a CharSpan NamedTuple with:\n                start: index of the first character in the original string associated to the token\n                end: index of the character following the last character in the original string associated to the token\n\n            Can be called as:\n                - self.token_to_chars(token_index) if batch size is 1\n                - self.token_to_chars(batch_index, token_index) if batch size is greater or equal to 1\n\n        Args:\n            batch_or_token_index (:obj:`int`):\n                Index of the sequence in the batch. If the batch only comprise one sequence,\n                this can be the index of the token in the sequence\n            token_index (:obj:`int`, `optional`):\n                If a batch index is provided in `batch_or_token_index`, this can be the index\n                of the token or tokens in the sequence.\n\n        Returns:\n            char_span (:obj:`CharSpan`):\n                Span of characters in the original string.\n\n                CharSpan are NamedTuple with:\n                    start: index of the first character in the original string\n                    end: index of the character following the last character in the original string\n        \"\"\"\n\n        if not self._encodings:\n            raise ValueError(\"token_to_chars() is not available when using Python based tokenizers\")\n        if token_index is not None:\n            batch_index = batch_or_token_index\n        else:\n            batch_index = 0\n            token_index = batch_or_token_index\n        return CharSpan(*(self._encodings[batch_index].token_to_chars(token_index)))\n\n    def char_to_token(self, batch_or_char_index: int, char_index: Optional[int] = None) -> int:\n        \"\"\" Get the index of the token in the encoded output comprising a character\n            in the original string for a sequence of the batch.\n\n            Can be called as:\n                - self.char_to_token(char_index) if batch size is 1\n                - self.char_to_token(batch_index, char_index) if batch size is greater or equal to 1\n\n            This method is particularly suited when the input sequences are provided as\n            pre-tokenized sequences (i.e. words are defined by the user). In this case it allows\n            to easily associate encoded tokens with provided tokenized words.\n\n        Args:\n            batch_or_char_index (:obj:`int`):\n                Index of the sequence in the batch. If the batch only comprise one sequence,\n                this can be the index of the word in the sequence\n            char_index (:obj:`int`, `optional`):\n                If a batch index is provided in `batch_or_token_index`, this can be the index\n                of the word in the sequence.\n\n\n        Returns:\n            token_index (:obj:`int`):\n                Index of the token.\n        \"\"\"\n\n        if not self._encodings:\n            raise ValueError(\"char_to_token() is not available when using Python based tokenizers\")\n        if char_index is not None:\n            batch_index = batch_or_char_index\n        else:\n            batch_index = 0\n            char_index = batch_or_char_index\n        return self._encodings[batch_index].char_to_token(char_index)\n\n    def word_to_chars(self, batch_or_word_index: int, word_index: Optional[int] = None) -> CharSpan:\n        \"\"\" Get the character span in the original string corresponding to given word in a sequence\n            of the batch.\n\n            Character spans are returned as a CharSpan NamedTuple with:\n                start: index of the first character in the original string\n                end: index of the character following the last character in the original string\n\n            Can be called as:\n                - self.word_to_chars(word_index) if batch size is 1\n                - self.word_to_chars(batch_index, word_index) if batch size is greater or equal to 1\n\n        Args:\n            batch_or_word_index (:obj:`int`):\n                Index of the sequence in the batch. If the batch only comprise one sequence,\n                this can be the index of the word in the sequence\n            word_index (:obj:`int`, `optional`):\n                If a batch index is provided in `batch_or_token_index`, this can be the index\n                of the word in the sequence.\n\n        Returns:\n            char_span (:obj:`CharSpan` or :obj:`List[CharSpan]`):\n                Span(s) of the associated character or characters in the string.\n                CharSpan are NamedTuple with:\n                    start: index of the first character associated to the token in the original string\n                    end: index of the character following the last character associated to the token in the original string\n        \"\"\"\n\n        if not self._encodings:\n            raise ValueError(\"word_to_chars() is not available when using Python based tokenizers\")\n        if word_index is not None:\n            batch_index = batch_or_word_index\n        else:\n            batch_index = 0\n            word_index = batch_or_word_index\n        return CharSpan(*(self._encodings[batch_index].word_to_chars(word_index)))\n\n    def char_to_word(self, batch_or_char_index: int, char_index: Optional[int] = None) -> int:\n        \"\"\" Get the word in the original string corresponding to a character in the original string of\n            a sequence of the batch.\n\n            Can be called as:\n                - self.char_to_word(char_index) if batch size is 1\n                - self.char_to_word(batch_index, char_index) if batch size is greater than 1\n\n            This method is particularly suited when the input sequences are provided as\n            pre-tokenized sequences (i.e. words are defined by the user). In this case it allows\n            to easily associate encoded tokens with provided tokenized words.\n\n        Args:\n            batch_or_char_index (:obj:`int`):\n                Index of the sequence in the batch. If the batch only comprise one sequence,\n                this can be the index of the character in the orginal string.\n            char_index (:obj:`int`, `optional`):\n                If a batch index is provided in `batch_or_token_index`, this can be the index\n                of the character in the orginal string.\n\n\n        Returns:\n            token_index (:obj:`int` or :obj:`List[int]`):\n                Index or indices of the associated encoded token(s).\n        \"\"\"\n\n        if not self._encodings:\n            raise ValueError(\"char_to_word() is not available when using Python based tokenizers\")\n        if char_index is not None:\n            batch_index = batch_or_char_index\n        else:\n            batch_index = 0\n            char_index = batch_or_char_index\n        return self._encodings[batch_index].char_to_word(char_index)\n\n    @torch_required\n    def to(self, device: str):\n        \"\"\"Send all values to device by calling v.to(device)\"\"\"\n        self.data = {k: v.to(device) for k, v in self.data.items()}\n        return self\n\n\nclass SpecialTokensMixin:\n    \"\"\" SpecialTokensMixin is derived by ``PreTrainedTokenizer`` and ``PreTrainedTokenizerFast`` and\n        handles specific behaviors related to special tokens. In particular, this class hold the\n        attributes which can be used to directly access to these special tokens in a\n        model-independant manner and allow to set and update the special tokens.\n    \"\"\"\n\n    SPECIAL_TOKENS_ATTRIBUTES = [\n        \"bos_token\",\n        \"eos_token\",\n        \"unk_token\",\n        \"sep_token\",\n        \"pad_token\",\n        \"cls_token\",\n        \"mask_token\",\n        \"additional_special_tokens\",\n    ]\n\n    def __init__(self, **kwargs):\n        self._bos_token = None\n        self._eos_token = None\n        self._unk_token = None\n        self._sep_token = None\n        self._pad_token = None\n        self._cls_token = None\n        self._mask_token = None\n        self._pad_token_type_id = 0\n        self._additional_special_tokens = []\n\n        for key, value in kwargs.items():\n            if key in self.SPECIAL_TOKENS_ATTRIBUTES:\n                if key == \"additional_special_tokens\":\n                    assert isinstance(value, (list, tuple)) and all(isinstance(t, str) for t in value)\n                    setattr(self, key, value)\n                elif isinstance(value, AddedTokenFast):\n                    setattr(self, key, str(value))\n                elif isinstance(value, str):\n                    setattr(self, key, value)\n                else:\n                    raise TypeError(\n                        \"special token {} has to be either str or AddedTokenFast but got: {}\".format(key, type(value))\n                    )\n\n    @property\n    def bos_token(self):\n        \"\"\" Beginning of sentence token (string). Log an error if used while not having been set. \"\"\"\n        if self._bos_token is None:\n            logger.error(\"Using bos_token, but it is not set yet.\")\n        return self._bos_token\n\n    @property\n    def eos_token(self):\n        \"\"\" End of sentence token (string). Log an error if used while not having been set. \"\"\"\n        if self._eos_token is None:\n            logger.error(\"Using eos_token, but it is not set yet.\")\n        return self._eos_token\n\n    @property\n    def unk_token(self):\n        \"\"\" Unknown token (string). Log an error if used while not having been set. \"\"\"\n        if self._unk_token is None:\n            logger.error(\"Using unk_token, but it is not set yet.\")\n        return self._unk_token\n\n    @property\n    def sep_token(self):\n        \"\"\" Separation token (string). E.g. separate context and query in an input sequence. Log an error if used while not having been set. \"\"\"\n        if self._sep_token is None:\n            logger.error(\"Using sep_token, but it is not set yet.\")\n        return self._sep_token\n\n    @property\n    def pad_token(self):\n        \"\"\" Padding token (string). Log an error if used while not having been set. \"\"\"\n        if self._pad_token is None:\n            logger.error(\"Using pad_token, but it is not set yet.\")\n        return self._pad_token\n\n    @property\n    def cls_token(self):\n        \"\"\" Classification token (string). E.g. to extract a summary of an input sequence leveraging self-attention along the full depth of the model. Log an error if used while not having been set. \"\"\"\n        if self._cls_token is None:\n            logger.error(\"Using cls_token, but it is not set yet.\")\n        return self._cls_token\n\n    @property\n    def mask_token(self):\n        \"\"\" Mask token (string). E.g. when training a model with masked-language modeling. Log an error if used while not having been set. \"\"\"\n        if self._mask_token is None:\n            logger.error(\"Using mask_token, but it is not set yet.\")\n        return self._mask_token\n\n    @property\n    def additional_special_tokens(self):\n        \"\"\" All the additional special tokens you may want to use (list of strings). Log an error if used while not having been set. \"\"\"\n        if self._additional_special_tokens is None:\n            logger.error(\"Using additional_special_tokens, but it is not set yet.\")\n        return self._additional_special_tokens\n\n    def _maybe_update_backend(self, value):\n        \"\"\" To be overriden by derived class if a backend tokenizer has to be updated. \"\"\"\n        pass\n\n    @bos_token.setter\n    def bos_token(self, value):\n        self._bos_token = value\n        self._maybe_update_backend([value])\n\n    @eos_token.setter\n    def eos_token(self, value):\n        self._eos_token = value\n        self._maybe_update_backend([value])\n\n    @unk_token.setter\n    def unk_token(self, value):\n        self._unk_token = value\n        self._maybe_update_backend([value])\n\n    @sep_token.setter\n    def sep_token(self, value):\n        self._sep_token = value\n        self._maybe_update_backend([value])\n\n    @pad_token.setter\n    def pad_token(self, value):\n        self._pad_token = value\n        self._maybe_update_backend([value])\n\n    @cls_token.setter\n    def cls_token(self, value):\n        self._cls_token = value\n        self._maybe_update_backend([value])\n\n    @mask_token.setter\n    def mask_token(self, value):\n        self._mask_token = value\n        self._maybe_update_backend([value])\n\n    @additional_special_tokens.setter\n    def additional_special_tokens(self, value):\n        self._additional_special_tokens = value\n        self._maybe_update_backend(value)\n\n    @property\n    def bos_token_id(self):\n        \"\"\" Id of the beginning of sentence token in the vocabulary. Log an error if used while not having been set. \"\"\"\n        return self.convert_tokens_to_ids(self.bos_token)\n\n    @property\n    def eos_token_id(self):\n        \"\"\" Id of the end of sentence token in the vocabulary. Log an error if used while not having been set. \"\"\"\n        return self.convert_tokens_to_ids(self.eos_token)\n\n    @property\n    def unk_token_id(self):\n        \"\"\" Id of the unknown token in the vocabulary. Log an error if used while not having been set. \"\"\"\n        return self.convert_tokens_to_ids(self.unk_token)\n\n    @property\n    def sep_token_id(self):\n        \"\"\" Id of the separation token in the vocabulary. E.g. separate context and query in an input sequence. Log an error if used while not having been set. \"\"\"\n        return self.convert_tokens_to_ids(self.sep_token)\n\n    @property\n    def pad_token_id(self):\n        \"\"\" Id of the padding token in the vocabulary. Log an error if used while not having been set. \"\"\"\n        return self.convert_tokens_to_ids(self.pad_token)\n\n    @property\n    def pad_token_type_id(self):\n        \"\"\" Id of the padding token type in the vocabulary.\"\"\"\n        return self._pad_token_type_id\n\n    @property\n    def cls_token_id(self):\n        \"\"\" Id of the classification token in the vocabulary. E.g. to extract a summary of an input sequence leveraging self-attention along the full depth of the model. Log an error if used while not having been set. \"\"\"\n        return self.convert_tokens_to_ids(self.cls_token)\n\n    @property\n    def mask_token_id(self):\n        \"\"\" Id of the mask token in the vocabulary. E.g. when training a model with masked-language modeling. Log an error if used while not having been set. \"\"\"\n        return self.convert_tokens_to_ids(self.mask_token)\n\n    @property\n    def additional_special_tokens_ids(self):\n        \"\"\" Ids of all the additional special tokens in the vocabulary (list of integers). Log an error if used while not having been set. \"\"\"\n        return self.convert_tokens_to_ids(self.additional_special_tokens)\n\n    @property\n    def special_tokens_map(self):\n        \"\"\" A dictionary mapping special token class attribute (cls_token, unk_token...) to their\n            values ('<unk>', '<cls>'...)\n        \"\"\"\n        set_attr = {}\n        for attr in self.SPECIAL_TOKENS_ATTRIBUTES:\n            attr_value = getattr(self, \"_\" + attr)\n            if attr_value:\n                set_attr[attr] = attr_value\n        return set_attr\n\n    @property\n    def all_special_tokens(self):\n        \"\"\" List all the special tokens ('<unk>', '<cls>'...) mapped to class attributes\n            (cls_token, unk_token...).\n        \"\"\"\n        all_toks = []\n        set_attr = self.special_tokens_map\n        for attr_value in set_attr.values():\n            all_toks = all_toks + (list(attr_value) if isinstance(attr_value, (list, tuple)) else [attr_value])\n        all_toks = list(set(all_toks))\n        return all_toks\n\n    @property\n    def all_special_ids(self):\n        \"\"\" List the vocabulary indices of the special tokens ('<unk>', '<cls>'...) mapped to\n            class attributes (cls_token, unk_token...).\n        \"\"\"\n        all_toks = self.all_special_tokens\n        all_ids = self.convert_tokens_to_ids(all_toks)\n        return all_ids\n\n\nclass PreTrainedTokenizer(SpecialTokensMixin):\n    \"\"\" Base class for all tokenizers.\n\n    Handle all the shared methods for tokenization and special tokens as well as methods\n    downloading/caching/loading pretrained tokenizers as well as adding tokens to the vocabulary.\n\n    This class also contain the added tokens in a unified way on top of all tokenizers so we don't\n    have to handle the specific vocabulary augmentation methods of the various underlying\n    dictionary structures (BPE, sentencepiece...).\n\n    Class attributes (overridden by derived classes):\n\n        - ``vocab_files_names``: a python ``dict`` with, as keys, the ``__init__`` keyword name of each vocabulary file\n            required by the model, and as associated values, the filename for saving the associated file (string).\n        - ``pretrained_vocab_files_map``: a python ``dict of dict`` the high-level keys\n            being the ``__init__`` keyword name of each vocabulary file required by the model, the low-level being the\n            `short-cut-names` (string) of the pretrained models with, as associated values, the `url` (string) to the\n            associated pretrained vocabulary file.\n        - ``max_model_input_sizes``: a python ``dict`` with, as keys, the `short-cut-names` (string) of the pretrained\n            models, and as associated values, the maximum length of the sequence inputs of this model, or None if the\n            model has no maximum input size.\n        - ``pretrained_init_configuration``: a python ``dict`` with, as keys, the `short-cut-names` (string) of the\n            pretrained models, and as associated values, a dictionnary of specific arguments to pass to the\n            ``__init__``method of the tokenizer class for this pretrained model when loading the tokenizer with the\n            ``from_pretrained()`` method.\n\n    Args:\n        - ``model_max_length``: (`Optional`) int: the maximum length in number of tokens for the inputs to the transformer model.\n            When the tokenizer is loaded with `from_pretrained`, this will be set to the value stored for the associated\n            model in ``max_model_input_sizes`` (see above). If no value is provided, will default to VERY_LARGE_INTEGER (`int(1e30)`).\n            no associated max_length can be found in ``max_model_input_sizes``.\n        - ``padding_side``: (`Optional`) string: the side on which the model should have padding applied.\n            Should be selected between ['right', 'left']\n        - ``model_input_names``: (`Optional`) List[string]: the list of the forward pass inputs accepted by the\n            model (\"token_type_ids\", \"attention_mask\"...).\n        - ``bos_token``: (`Optional`) string: a beginning of sentence token.\n            Will be associated to ``self.bos_token`` and ``self.bos_token_id``\n        - ``eos_token``: (`Optional`) string: an end of sentence token.\n            Will be associated to ``self.eos_token`` and ``self.eos_token_id``\n        - ``unk_token``: (`Optional`) string: an unknown token.\n            Will be associated to ``self.unk_token`` and ``self.unk_token_id``\n        - ``sep_token``: (`Optional`) string: a separation token (e.g. to separate context and query in an input sequence).\n            Will be associated to ``self.sep_token`` and ``self.sep_token_id``\n        - ``pad_token``: (`Optional`) string: a padding token.\n            Will be associated to ``self.pad_token`` and ``self.pad_token_id``\n        - ``cls_token``: (`Optional`) string: a classification token (e.g. to extract a summary of an input sequence\n            leveraging self-attention along the full depth of the model).\n            Will be associated to ``self.cls_token`` and ``self.cls_token_id``\n        - ``mask_token``: (`Optional`) string: a masking token (e.g. when training a model with masked-language\n            modeling). Will be associated to ``self.mask_token`` and ``self.mask_token_id``\n        - ``additional_special_tokens``: (`Optional`) list: a list of additional special tokens.\n            Adding all special tokens here ensure they won't be split by the tokenization process.\n            Will be associated to ``self.additional_special_tokens`` and ``self.additional_special_tokens_ids``\n    \"\"\"\n\n    vocab_files_names: Dict[str, str] = {}\n    pretrained_vocab_files_map: Dict[str, Dict[str, str]] = {}\n    pretrained_init_configuration: Dict[str, Dict[str, Any]] = {}\n    max_model_input_sizes: Dict[str, int] = {}\n    model_input_names: List[str] = [\"token_type_ids\", \"attention_mask\"]\n\n    padding_side: str = \"right\"\n\n    NO_PAD_TOKEN_FOR_BATCH_MSG = (\n        \"No padding token is set for this model, therefore no batch can be made with uneven \"\n        \"sequences. Set a padding token or adjust the lengths of the sequences building the \"\n        \"batch so that every sequence is of the same length.\"\n    )\n\n    UNEVEN_SEQUENCES_FOR_BATCH_MSG = (\n        \"The sequences building the batch are not of the same size, no tensor \"\n        \"can be built. Set `pad_to_max_length=True` to pad the smaller sequences\"\n        \"up to the larger sequence's length.\"\n    )\n\n    @property\n    def vocab_size(self) -> int:\n        \"\"\" Size of the base vocabulary (without the added tokens) \"\"\"\n        raise NotImplementedError\n\n    @property\n    def is_fast(self) -> bool:\n        return False\n\n    @property\n    def max_len(self) -> int:\n        \"\"\" Kept here for backward compatibility.\n            Now renamed to `model_max_length` to avoid ambiguity.\n        \"\"\"\n        return self.model_max_length\n\n    @property\n    def max_len_single_sentence(self) -> int:\n        return self.model_max_length - self.num_special_tokens_to_add(pair=False)\n\n    @property\n    def max_len_sentences_pair(self) -> int:\n        return self.model_max_length - self.num_special_tokens_to_add(pair=True)\n\n    @max_len_single_sentence.setter\n    def max_len_single_sentence(self, value) -> int:\n        \"\"\" For backward compatibility, allow to try to setup 'max_len_single_sentence' \"\"\"\n        if value == self.model_max_length - self.num_special_tokens_to_add(pair=False):\n            logger.warning(\n                \"Setting 'max_len_single_sentence' is now deprecated. \" \"This value is automatically set up.\"\n            )\n        else:\n            raise ValueError(\n                \"Setting 'max_len_single_sentence' is now deprecated. \" \"This value is automatically set up.\"\n            )\n\n    @max_len_sentences_pair.setter\n    def max_len_sentences_pair(self, value) -> int:\n        \"\"\" For backward compatibility, allow to try to setup 'max_len_sentences_pair' \"\"\"\n        if value == self.model_max_length - self.num_special_tokens_to_add(pair=True):\n            logger.warning(\n                \"Setting 'max_len_sentences_pair' is now deprecated. \" \"This value is automatically set up.\"\n            )\n        else:\n            raise ValueError(\n                \"Setting 'max_len_sentences_pair' is now deprecated. \" \"This value is automatically set up.\"\n            )\n\n    def get_vocab(self):\n        \"\"\" Returns the vocabulary as a dict of {token: index} pairs. `tokenizer.get_vocab()[token]` is equivalent to `tokenizer.convert_tokens_to_ids(token)` when `token` is in the vocab. \"\"\"\n        raise NotImplementedError()\n\n    def __init__(self, model_max_length=None, **kwargs):\n\n        super().__init__(**kwargs)\n\n        # For backward compatibility we fallback to set model_max_length from max_len if provided\n        if \"max_len\" in kwargs:\n            warnings.warn(\n                \"Parameter max_len is deprecated and will be removed in a future release. \"\n                \"Use model_max_length instead.\",\n                category=FutureWarning,\n            )\n\n            model_max_length = kwargs.pop(\"max_len\")\n        self.model_max_length = model_max_length if model_max_length is not None else VERY_LARGE_INTEGER\n\n        # Padding side is right by default and overridden in subclasses. If specified in the kwargs, it is changed.\n        self.padding_side = kwargs.pop(\"padding_side\", self.padding_side)\n        assert self.padding_side in [\n            \"right\",\n            \"left\",\n        ], f\"Padding side should be selected between 'right' and 'left', current value: {self.padding_side}\"\n        self.model_input_names = kwargs.pop(\"model_input_names\", self.model_input_names)\n\n        # Added tokens\n        self.added_tokens_encoder = {}\n        self.unique_added_tokens_encoder = set()\n        self.added_tokens_decoder = {}\n\n        # inputs and kwargs for saving and re-loading (see ``from_pretrained`` and ``save_pretrained``)\n        self.init_inputs = ()\n        self.init_kwargs = {}\n\n    def __len__(self):\n        \"\"\" Size of the full vocabulary with the added tokens \"\"\"\n        return self.vocab_size + len(self.added_tokens_encoder)\n\n    @classmethod\n    def from_pretrained(cls, *inputs, **kwargs):\n        r\"\"\"\n        Instantiate a :class:`~transformers1.PreTrainedTokenizer` (or a derived class) from a predefined tokenizer.\n\n        Args:\n            pretrained_model_name_or_path: either:\n\n                - a string with the `shortcut name` of a predefined tokenizer to load from cache or download, e.g.: ``bert-base-uncased``.\n                - a string with the `identifier name` of a predefined tokenizer that was user-uploaded to our S3, e.g.: ``dbmdz/bert-base-german-cased``.\n                - a path to a `directory` containing vocabulary files required by the tokenizer, for instance saved using the :func:`~transformers1.PreTrainedTokenizer.save_pretrained` method, e.g.: ``./my_model_directory/``.\n                - (not applicable to all derived classes, deprecated) a path or url to a single saved vocabulary file if and only if the tokenizer only requires a single vocabulary file (e.g. Bert, XLNet), e.g.: ``./my_model_directory/vocab.txt``.\n\n            cache_dir: (`optional`) string:\n                Path to a directory in which a downloaded predefined tokenizer vocabulary files should be cached if the standard cache should not be used.\n\n            force_download: (`optional`) boolean, default False:\n                Force to (re-)download the vocabulary files and override the cached versions if they exists.\n\n            resume_download: (`optional`) boolean, default False:\n                Do not delete incompletely recieved file. Attempt to resume the download if such a file exists.\n\n            proxies: (`optional`) dict, default None:\n                A dictionary of proxy servers to use by protocol or endpoint, e.g.: {'http': 'foo.bar:3128', 'http://hostname': 'foo.bar:4012'}.\n                The proxies are used on each request.\n\n            inputs: (`optional`) positional arguments: will be passed to the Tokenizer ``__init__`` method.\n\n            kwargs: (`optional`) keyword arguments: will be passed to the Tokenizer ``__init__`` method. Can be used to set special tokens like ``bos_token``, ``eos_token``, ``unk_token``, ``sep_token``, ``pad_token``, ``cls_token``, ``mask_token``, ``additional_special_tokens``. See parameters in the doc string of :class:`~transformers1.PreTrainedTokenizer` for details.\n\n        Examples::\n\n            # We can't instantiate directly the base class `PreTrainedTokenizer` so let's show our examples on a derived class: BertTokenizer\n\n            # Download vocabulary from S3 and cache.\n            tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')\n\n            # Download vocabulary from S3 (user-uploaded) and cache.\n            tokenizer = BertTokenizer.from_pretrained('dbmdz/bert-base-german-cased')\n\n            # If vocabulary files are in a directory (e.g. tokenizer was saved using `save_pretrained('./test/saved_model/')`)\n            tokenizer = BertTokenizer.from_pretrained('./test/saved_model/')\n\n            # If the tokenizer uses a single vocabulary file, you can point directly to this file\n            tokenizer = BertTokenizer.from_pretrained('./test/saved_model/my_vocab.txt')\n\n            # You can link tokens to special vocabulary when instantiating\n            tokenizer = BertTokenizer.from_pretrained('bert-base-uncased', unk_token='<unk>')\n            # You should be sure '<unk>' is in the vocabulary when doing that.\n            # Otherwise use tokenizer.add_special_tokens({'unk_token': '<unk>'}) instead)\n            assert tokenizer.unk_token == '<unk>'\n\n        \"\"\"\n        return cls._from_pretrained(*inputs, **kwargs)\n\n    @classmethod\n    def _from_pretrained(cls, pretrained_model_name_or_path, *init_inputs, **kwargs):\n        cache_dir = kwargs.pop(\"cache_dir\", None)\n        force_download = kwargs.pop(\"force_download\", False)\n        resume_download = kwargs.pop(\"resume_download\", False)\n        proxies = kwargs.pop(\"proxies\", None)\n        local_files_only = kwargs.pop(\"local_files_only\", False)\n\n        s3_models = list(cls.max_model_input_sizes.keys())\n        vocab_files = {}\n        init_configuration = {}\n        if pretrained_model_name_or_path in s3_models:\n            # Get the vocabulary from AWS S3 bucket\n            for file_id, map_list in cls.pretrained_vocab_files_map.items():\n                vocab_files[file_id] = map_list[pretrained_model_name_or_path]\n            if (\n                cls.pretrained_init_configuration\n                and pretrained_model_name_or_path in cls.pretrained_init_configuration\n            ):\n                init_configuration = cls.pretrained_init_configuration[pretrained_model_name_or_path].copy()\n        else:\n            # Get the vocabulary from local files\n            logger.info(\n                \"Model name '{}' not found in model shortcut name list ({}). \"\n                \"Assuming '{}' is a path, a model identifier, or url to a directory containing tokenizer files.\".format(\n                    pretrained_model_name_or_path, \", \".join(s3_models), pretrained_model_name_or_path\n                )\n            )\n\n            if os.path.isfile(pretrained_model_name_or_path) or is_remote_url(pretrained_model_name_or_path):\n                if len(cls.vocab_files_names) > 1:\n                    raise ValueError(\n                        f\"Calling {cls.__name__}.from_pretrained() with the path to a single file or url is not supported.\"\n                        \"Use a model identifier or the path to a directory instead.\"\n                    )\n                logger.warning(\n                    f\"Calling {cls.__name__}.from_pretrained() with the path to a single file or url is deprecated\"\n                )\n                file_id = list(cls.vocab_files_names.keys())[0]\n                vocab_files[file_id] = pretrained_model_name_or_path\n            else:\n                # At this point pretrained_model_name_or_path is either a directory or a model identifier name\n                additional_files_names = {\n                    \"added_tokens_file\": ADDED_TOKENS_FILE,\n                    \"special_tokens_map_file\": SPECIAL_TOKENS_MAP_FILE,\n                    \"tokenizer_config_file\": TOKENIZER_CONFIG_FILE,\n                }\n                # Look for the tokenizer main vocabulary files + the additional tokens files\n                for file_id, file_name in {**cls.vocab_files_names, **additional_files_names}.items():\n                    if os.path.isdir(pretrained_model_name_or_path):\n                        full_file_name = os.path.join(pretrained_model_name_or_path, file_name)\n                        if not os.path.exists(full_file_name):\n                            logger.info(\"Didn't find file {}. We won't load it.\".format(full_file_name))\n                            full_file_name = None\n                    else:\n                        full_file_name = hf_bucket_url(\n                            pretrained_model_name_or_path, filename=file_name, use_cdn=False\n                        )\n\n                    vocab_files[file_id] = full_file_name\n\n        # Get files from url, cache, or disk depending on the case\n        try:\n            resolved_vocab_files = {}\n            for file_id, file_path in vocab_files.items():\n                if file_path is None:\n                    resolved_vocab_files[file_id] = None\n                else:\n                    resolved_vocab_files[file_id] = cached_path(\n                        file_path,\n                        cache_dir=cache_dir,\n                        force_download=force_download,\n                        proxies=proxies,\n                        resume_download=resume_download,\n                        local_files_only=local_files_only,\n                    )\n        except EnvironmentError:\n            if pretrained_model_name_or_path in s3_models:\n                msg = \"Couldn't reach server at '{}' to download vocabulary files.\"\n            else:\n                msg = (\n                    \"Model name '{}' was not found in tokenizers model name list ({}). \"\n                    \"We assumed '{}' was a path or url to a directory containing vocabulary files \"\n                    \"named {}, but couldn't find such vocabulary files at this path or url.\".format(\n                        pretrained_model_name_or_path,\n                        \", \".join(s3_models),\n                        pretrained_model_name_or_path,\n                        list(cls.vocab_files_names.values()),\n                    )\n                )\n\n            raise EnvironmentError(msg)\n\n        if all(full_file_name is None for full_file_name in resolved_vocab_files.values()):\n            raise EnvironmentError(\n                \"Model name '{}' was not found in tokenizers model name list ({}). \"\n                \"We assumed '{}' was a path, a model identifier, or url to a directory containing vocabulary files \"\n                \"named {} but couldn't find such vocabulary files at this path or url.\".format(\n                    pretrained_model_name_or_path,\n                    \", \".join(s3_models),\n                    pretrained_model_name_or_path,\n                    list(cls.vocab_files_names.values()),\n                )\n            )\n\n        for file_id, file_path in vocab_files.items():\n            if file_path == resolved_vocab_files[file_id]:\n                logger.info(\"loading file {}\".format(file_path))\n            else:\n                logger.info(\"loading file {} from cache at {}\".format(file_path, resolved_vocab_files[file_id]))\n\n        # Prepare tokenizer initialization kwargs\n        # Did we saved some inputs and kwargs to reload ?\n        tokenizer_config_file = resolved_vocab_files.pop(\"tokenizer_config_file\", None)\n        if tokenizer_config_file is not None:\n            with open(tokenizer_config_file, encoding=\"utf-8\") as tokenizer_config_handle:\n                init_kwargs = json.load(tokenizer_config_handle)\n            saved_init_inputs = init_kwargs.pop(\"init_inputs\", ())\n            if not init_inputs:\n                init_inputs = saved_init_inputs\n        else:\n            init_kwargs = init_configuration\n\n        # Update with newly provided kwargs\n        init_kwargs.update(kwargs)\n\n        # Set max length if needed\n        if pretrained_model_name_or_path in cls.max_model_input_sizes:\n            # if we're using a pretrained model, ensure the tokenizer\n            # wont index sequences longer than the number of positional embeddings\n            model_max_length = cls.max_model_input_sizes[pretrained_model_name_or_path]\n            if model_max_length is not None and isinstance(model_max_length, (int, float)):\n                init_kwargs[\"model_max_length\"] = min(init_kwargs.get(\"model_max_length\", int(1e30)), model_max_length)\n\n        # Merge resolved_vocab_files arguments in init_kwargs.\n        added_tokens_file = resolved_vocab_files.pop(\"added_tokens_file\", None)\n        special_tokens_map_file = resolved_vocab_files.pop(\"special_tokens_map_file\", None)\n        for args_name, file_path in resolved_vocab_files.items():\n            if args_name not in init_kwargs:\n                init_kwargs[args_name] = file_path\n        if special_tokens_map_file is not None:\n            with open(special_tokens_map_file, encoding=\"utf-8\") as special_tokens_map_handle:\n                special_tokens_map = json.load(special_tokens_map_handle)\n            for key, value in special_tokens_map.items():\n                if key not in init_kwargs:\n                    init_kwargs[key] = value\n\n        # Instantiate tokenizer.\n        try:\n            tokenizer = cls(*init_inputs, **init_kwargs)\n        except OSError:\n            raise OSError(\n                \"Unable to load vocabulary from file. \"\n                \"Please check that the provided vocabulary is accessible and not corrupted.\"\n            )\n\n        # Save inputs and kwargs for saving and re-loading with ``save_pretrained``\n        tokenizer.init_inputs = init_inputs\n        tokenizer.init_kwargs = init_kwargs\n\n        # update unique_added_tokens_encoder with special tokens for correct tokenization\n        tokenizer.unique_added_tokens_encoder.update(set(tokenizer.all_special_tokens))\n\n        # Add supplementary tokens.\n        if added_tokens_file is not None:\n            with open(added_tokens_file, encoding=\"utf-8\") as added_tokens_handle:\n                added_tok_encoder = json.load(added_tokens_handle)\n            added_tok_decoder = {v: k for k, v in added_tok_encoder.items()}\n            tokenizer.added_tokens_encoder.update(added_tok_encoder)\n            tokenizer.added_tokens_decoder.update(added_tok_decoder)\n            tokenizer.unique_added_tokens_encoder.update(set(tokenizer.added_tokens_encoder.keys()))\n\n        return tokenizer\n\n    def save_pretrained(self, save_directory):\n        \"\"\" Save the tokenizer vocabulary files together with:\n                - added tokens,\n                - special-tokens-to-class-attributes-mapping,\n                - tokenizer instantiation positional and keywords inputs (e.g. do_lower_case for Bert).\n\n            Warning: This won't save modifications you may have applied to the tokenizer after the instantiation\n            (e.g. modifying tokenizer.do_lower_case after creation).\n\n            This method make sure the full tokenizer can then be re-loaded using the\n            :func:`~transformers1.PreTrainedTokenizer.from_pretrained` class method.\n        \"\"\"\n        if not os.path.isdir(save_directory):\n            logger.error(\"Saving directory ({}) should be a directory\".format(save_directory))\n            return\n\n        special_tokens_map_file = os.path.join(save_directory, SPECIAL_TOKENS_MAP_FILE)\n        added_tokens_file = os.path.join(save_directory, ADDED_TOKENS_FILE)\n        tokenizer_config_file = os.path.join(save_directory, TOKENIZER_CONFIG_FILE)\n\n        tokenizer_config = copy.deepcopy(self.init_kwargs)\n        if len(self.init_inputs) > 0:\n            tokenizer_config[\"init_inputs\"] = copy.deepcopy(self.init_inputs)\n        for file_id in self.vocab_files_names.keys():\n            tokenizer_config.pop(file_id, None)\n\n        with open(tokenizer_config_file, \"w\", encoding=\"utf-8\") as f:\n            f.write(json.dumps(tokenizer_config, ensure_ascii=False))\n\n        with open(special_tokens_map_file, \"w\", encoding=\"utf-8\") as f:\n            f.write(json.dumps(self.special_tokens_map, ensure_ascii=False))\n\n        if len(self.added_tokens_encoder) > 0:\n            with open(added_tokens_file, \"w\", encoding=\"utf-8\") as f:\n                out_str = json.dumps(self.added_tokens_encoder, ensure_ascii=False)\n                f.write(out_str)\n\n        vocab_files = self.save_vocabulary(save_directory)\n\n        return vocab_files + (special_tokens_map_file, added_tokens_file)\n\n    def save_vocabulary(self, save_directory) -> Tuple[str]:\n        \"\"\" Save the tokenizer vocabulary to a directory. This method does *NOT* save added tokens\n            and special token mappings.\n\n            Please use :func:`~transformers1.PreTrainedTokenizer.save_pretrained` `()` to save the full\n            Tokenizer state if you want to reload it using the :func:`~transformers1.PreTrainedTokenizer.from_pretrained`\n            class method.\n        \"\"\"\n        raise NotImplementedError\n\n    def add_tokens(self, new_tokens: Union[str, List[str]]) -> int:\n        \"\"\"\n        Add a list of new tokens to the tokenizer class. If the new tokens are not in the\n        vocabulary, they are added to it with indices starting from length of the current vocabulary.\n\n        Args:\n            new_tokens: string or list of string. Each string is a token to add. Tokens are only added if they are not\n            already in the vocabulary (tested by checking if the tokenizer assign the index of the ``unk_token`` to them).\n\n        Returns:\n            Number of tokens added to the vocabulary.\n\n        Examples::\n\n            # Let's see how to increase the vocabulary of Bert model and tokenizer\n            tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')\n            model = BertModel.from_pretrained('bert-base-uncased')\n\n            num_added_toks = tokenizer.add_tokens(['new_tok1', 'my_new-tok2'])\n            print('We have added', num_added_toks, 'tokens')\n            model.resize_token_embeddings(len(tokenizer))  # Notice: resize_token_embeddings expect to receive the full size of the new vocabulary, i.e. the length of the tokenizer.\n        \"\"\"\n        if not new_tokens:\n            return 0\n\n        if not isinstance(new_tokens, list):\n            new_tokens = [new_tokens]\n\n        tokens_to_add = []\n        for token in new_tokens:\n            assert isinstance(token, str)\n            if self.init_kwargs.get(\"do_lower_case\", False) and token not in self.all_special_tokens:\n                token = token.lower()\n            if (\n                token != self.unk_token\n                and self.convert_tokens_to_ids(token) == self.convert_tokens_to_ids(self.unk_token)\n                and token not in tokens_to_add\n            ):\n                tokens_to_add.append(token)\n                logger.info(\"Adding %s to the vocabulary\", token)\n\n        added_tok_encoder = dict((tok, len(self) + i) for i, tok in enumerate(tokens_to_add))\n        added_tok_decoder = {v: k for k, v in added_tok_encoder.items()}\n        self.added_tokens_encoder.update(added_tok_encoder)\n        self.unique_added_tokens_encoder = set(self.added_tokens_encoder.keys()).union(set(self.all_special_tokens))\n        self.added_tokens_decoder.update(added_tok_decoder)\n\n        return len(tokens_to_add)\n\n    def num_special_tokens_to_add(self, pair=False):\n        \"\"\"\n        Returns the number of added tokens when encoding a sequence with special tokens.\n\n        Note:\n            This encodes inputs and checks the number of added tokens, and is therefore not efficient. Do not put this\n            inside your training loop.\n\n        Args:\n            pair: Returns the number of added tokens in the case of a sequence pair if set to True, returns the\n                number of added tokens in the case of a single sequence if set to False.\n\n        Returns:\n            Number of tokens added to sequences\n        \"\"\"\n        token_ids_0 = []\n        token_ids_1 = []\n        return len(self.build_inputs_with_special_tokens(token_ids_0, token_ids_1 if pair else None))\n\n    def add_special_tokens(self, special_tokens_dict):\n        \"\"\"\n        Add a dictionary of special tokens (eos, pad, cls...) to the encoder and link them\n        to class attributes. If special tokens are NOT in the vocabulary, they are added\n        to it (indexed starting from the last index of the current vocabulary).\n\n        Using `add_special_tokens` will ensure your special tokens can be used in several ways:\n\n        - special tokens are carefully handled by the tokenizer (they are never split)\n        - you can easily refer to special tokens using tokenizer class attributes like `tokenizer.cls_token`. This makes it easy to develop model-agnostic training and fine-tuning scripts.\n\n        When possible, special tokens are already registered for provided pretrained models (ex: BertTokenizer cls_token is already registered to be '[CLS]' and XLM's one is also registered to be '</s>')\n\n        Args:\n            special_tokens_dict: dict of string. Keys should be in the list of predefined special attributes:\n                [``bos_token``, ``eos_token``, ``unk_token``, ``sep_token``, ``pad_token``, ``cls_token``, ``mask_token``,\n                ``additional_special_tokens``].\n\n                Tokens are only added if they are not already in the vocabulary (tested by checking if the tokenizer assign the index of the ``unk_token`` to them).\n\n        Returns:\n            Number of tokens added to the vocabulary.\n\n        Examples::\n\n            # Let's see how to add a new classification token to GPT-2\n            tokenizer = GPT2Tokenizer.from_pretrained('gpt2')\n            model = GPT2Model.from_pretrained('gpt2')\n\n            special_tokens_dict = {'cls_token': '<CLS>'}\n\n            num_added_toks = tokenizer.add_special_tokens(special_tokens_dict)\n            print('We have added', num_added_toks, 'tokens')\n            model.resize_token_embeddings(len(tokenizer))  # Notice: resize_token_embeddings expect to receive the full size of the new vocabulary, i.e. the length of the tokenizer.\n\n            assert tokenizer.cls_token == '<CLS>'\n        \"\"\"\n        if not special_tokens_dict:\n            return 0\n\n        added_tokens = 0\n        for key, value in special_tokens_dict.items():\n            assert key in self.SPECIAL_TOKENS_ATTRIBUTES\n            if key == \"additional_special_tokens\":\n                assert isinstance(value, (list, tuple)) and all(isinstance(t, str) for t in value)\n                added_tokens += self.add_tokens(value)\n            else:\n                assert isinstance(value, str)\n                added_tokens += self.add_tokens([value])\n            logger.info(\"Assigning %s to the %s key of the tokenizer\", value, key)\n            setattr(self, key, value)\n\n        return added_tokens\n\n    def tokenize(self, text: TextInput, **kwargs):\n        \"\"\" Converts a string in a sequence of tokens (string), using the tokenizer.\n            Split in words for word-based vocabulary or sub-words for sub-word-based\n            vocabularies (BPE/SentencePieces/WordPieces).\n\n            Take care of added tokens.\n\n            Args:\n                text (:obj:`string`): The sequence to be encoded.\n                **kwargs (:obj: `dict`): Arguments passed to the model-specific `prepare_for_tokenization` preprocessing method.\n        \"\"\"\n        all_special_tokens = self.all_special_tokens\n        text = self.prepare_for_tokenization(text, **kwargs)\n\n        # TODO: should this be in the base class?\n        def lowercase_text(t):\n            # convert non-special tokens to lowercase\n            escaped_special_toks = [re.escape(s_tok) for s_tok in all_special_tokens]\n            pattern = r\"(\" + r\"|\".join(escaped_special_toks) + r\")|\" + r\"(.+?)\"\n            return re.sub(pattern, lambda m: m.groups()[0] or m.groups()[1].lower(), t)\n\n        if self.init_kwargs.get(\"do_lower_case\", False):\n            text = lowercase_text(text)\n\n        def split_on_token(tok, text):\n            result = []\n            split_text = text.split(tok)\n            for i, sub_text in enumerate(split_text):\n                sub_text = sub_text.rstrip()\n                if i == 0 and not sub_text:\n                    result += [tok]\n                elif i == len(split_text) - 1:\n                    if sub_text:\n                        result += [sub_text]\n                    else:\n                        pass\n                else:\n                    if sub_text:\n                        result += [sub_text]\n                    result += [tok]\n            return result\n\n        def split_on_tokens(tok_list, text):\n            if not text.strip():\n                return []\n            if not tok_list:\n                return self._tokenize(text)\n\n            tokenized_text = []\n            text_list = [text]\n            for tok in tok_list:\n                tokenized_text = []\n                for sub_text in text_list:\n                    if sub_text not in self.unique_added_tokens_encoder:\n                        tokenized_text += split_on_token(tok, sub_text)\n                    else:\n                        tokenized_text += [sub_text]\n                text_list = tokenized_text\n\n            return list(\n                itertools.chain.from_iterable(\n                    (\n                        self._tokenize(token) if token not in self.unique_added_tokens_encoder else [token]\n                        for token in tokenized_text\n                    )\n                )\n            )\n\n        added_tokens = self.unique_added_tokens_encoder\n        tokenized_text = split_on_tokens(added_tokens, text)\n        return tokenized_text\n\n    def _tokenize(self, text, **kwargs):\n        \"\"\" Converts a string in a sequence of tokens (string), using the tokenizer.\n            Split in words for word-based vocabulary or sub-words for sub-word-based\n            vocabularies (BPE/SentencePieces/WordPieces).\n\n            Do NOT take care of added tokens.\n        \"\"\"\n        raise NotImplementedError\n\n    def convert_tokens_to_ids(self, tokens):\n        \"\"\" Converts a token string (or a sequence of tokens) in a single integer id\n            (or a sequence of ids), using the vocabulary.\n        \"\"\"\n        if tokens is None:\n            return None\n\n        if isinstance(tokens, str):\n            return self._convert_token_to_id_with_added_voc(tokens)\n\n        ids = []\n        for token in tokens:\n            ids.append(self._convert_token_to_id_with_added_voc(token))\n        return ids\n\n    def _convert_token_to_id_with_added_voc(self, token):\n        if token is None:\n            return None\n\n        if token in self.added_tokens_encoder:\n            return self.added_tokens_encoder[token]\n        return self._convert_token_to_id(token)\n\n    def _convert_token_to_id(self, token):\n        raise NotImplementedError\n\n    def encode(\n        self,\n        text: Union[TextInput, PreTokenizedInput, EncodedInput],\n        text_pair: Optional[Union[TextInput, PreTokenizedInput, EncodedInput]] = None,\n        add_special_tokens: bool = True,\n        max_length: Optional[int] = None,\n        stride: int = 0,\n        truncation_strategy: str = \"longest_first\",\n        pad_to_max_length: bool = False,\n        return_tensors: Optional[str] = None,\n        **kwargs\n    ):\n        \"\"\"\n        Converts a string in a sequence of ids (integer), using the tokenizer and vocabulary.\n\n        Same as doing ``self.convert_tokens_to_ids(self.tokenize(text))``.\n\n        Args:\n            text (:obj:`str`, :obj:`List[str]` or :obj:`List[int]`):\n                The first sequence to be encoded. This can be a string, a list of strings (tokenized string using\n                the `tokenize` method) or a list of integers (tokenized string ids using the `convert_tokens_to_ids`\n                method)\n            text_pair (:obj:`str`, :obj:`List[str]` or :obj:`List[int]`, `optional`, defaults to :obj:`None`):\n                Optional second sequence to be encoded. This can be a string, a list of strings (tokenized\n                string using the `tokenize` method) or a list of integers (tokenized string ids using the\n                `convert_tokens_to_ids` method)\n            add_special_tokens (:obj:`bool`, `optional`, defaults to :obj:`True`):\n                If set to ``True``, the sequences will be encoded with the special tokens relative\n                to their model.\n            max_length (:obj:`int`, `optional`, defaults to :obj:`None`):\n                If set to a number, will limit the total sequence returned so that it has a maximum length.\n                If there are overflowing tokens, those will be added to the returned dictionary.\n                You can set it to the maximal input size of the model with `max_length = tokenizer.model_max_length`.\n            stride (:obj:`int`, `optional`, defaults to ``0``):\n                If set to a number along with max_length, the overflowing tokens returned will contain some tokens\n                from the main sequence returned. The value of this argument defines the number of additional tokens.\n            truncation_strategy (:obj:`str`, `optional`, defaults to `longest_first`):\n                String selected in the following options:\n\n                - 'longest_first' (default) Iteratively reduce the inputs sequence until the input is under max_length\n                  starting from the longest one at each token (when there is a pair of input sequences)\n                - 'only_first': Only truncate the first sequence\n                - 'only_second': Only truncate the second sequence\n                - 'do_not_truncate': Does not truncate (raise an error if the input sequence is longer than max_length)\n            pad_to_max_length (:obj:`bool`, `optional`, defaults to :obj:`False`):\n                If set to True, the returned sequences will be padded according to the model's padding side and\n                padding index, up to their max length. If no max length is specified, the padding is done up to the\n                model's max length. The tokenizer padding sides are handled by the class attribute `padding_side`\n                which can be set to the following strings:\n\n                - 'left': pads on the left of the sequences\n                - 'right': pads on the right of the sequences\n                Defaults to False: no padding.\n            return_tensors (:obj:`str`, `optional`, defaults to :obj:`None`):\n                Can be set to 'tf' or 'pt' to return respectively TensorFlow :obj:`tf.constant`\n                or PyTorch :obj:`torch.Tensor` instead of a list of python integers.\n            **kwargs: passed to the `self.tokenize()` method\n        \"\"\"\n        encoded_inputs = self.encode_plus(\n            text,\n            text_pair=text_pair,\n            max_length=max_length,\n            add_special_tokens=add_special_tokens,\n            stride=stride,\n            truncation_strategy=truncation_strategy,\n            pad_to_max_length=pad_to_max_length,\n            return_tensors=return_tensors,\n            **kwargs,\n        )\n\n        return encoded_inputs[\"input_ids\"]\n\n    def encode_plus(\n        self,\n        text: Union[TextInput, PreTokenizedInput, EncodedInput],\n        text_pair: Optional[Union[TextInput, PreTokenizedInput, EncodedInput]] = None,\n        add_special_tokens: bool = True,\n        max_length: Optional[int] = None,\n        stride: int = 0,\n        truncation_strategy: str = \"longest_first\",\n        pad_to_max_length: bool = False,\n        is_pretokenized: bool = False,\n        return_tensors: Optional[str] = None,\n        return_token_type_ids: Optional[bool] = None,\n        return_attention_mask: Optional[bool] = None,\n        return_overflowing_tokens: bool = False,\n        return_special_tokens_mask: bool = False,\n        return_offsets_mapping: bool = False,\n        **kwargs\n    ) -> BatchEncoding:\n        \"\"\"\n        Returns a dictionary containing the encoded sequence or sequence pair and additional information:\n        the mask for sequence classification and the overflowing elements if a ``max_length`` is specified.\n\n        Args:\n            text (:obj:`str`, :obj:`List[str]` or :obj:`List[int]` (the later only for not-fast tokenizers)):\n                The first sequence to be encoded. This can be a string, a list of strings (tokenized string using\n                the `tokenize` method) or a list of integers (tokenized string ids using the `convert_tokens_to_ids`\n                method)\n            text_pair (:obj:`str`, :obj:`List[str]` or :obj:`List[int]`, `optional`, defaults to :obj:`None`):\n                Optional second sequence to be encoded. This can be a string, a list of strings (tokenized\n                string using the `tokenize` method) or a list of integers (tokenized string ids using the\n                `convert_tokens_to_ids` method)\n            add_special_tokens (:obj:`bool`, `optional`, defaults to :obj:`True`):\n                If set to ``True``, the sequences will be encoded with the special tokens relative\n                to their model.\n            max_length (:obj:`int`, `optional`, defaults to :obj:`None`):\n                If set to a number, will limit the total sequence returned so that it has a maximum length.\n                If there are overflowing tokens, those will be added to the returned dictionary\n                You can set it to the maximal input size of the model with `max_length = tokenizer.model_max_length`.\n            stride (:obj:`int`, `optional`, defaults to ``0``):\n                If set to a number along with max_length, the overflowing tokens returned will contain some tokens\n                from the main sequence returned. The value of this argument defines the number of additional tokens.\n            truncation_strategy (:obj:`str`, `optional`, defaults to `longest_first`):\n                String selected in the following options:\n\n                - 'longest_first' (default) Iteratively reduce the inputs sequence until the input is under max_length\n                  starting from the longest one at each token (when there is a pair of input sequences)\n                - 'only_first': Only truncate the first sequence\n                - 'only_second': Only truncate the second sequence\n                - 'do_not_truncate': Does not truncate (raise an error if the input sequence is longer than max_length)\n            pad_to_max_length (:obj:`bool`, `optional`, defaults to :obj:`False`):\n                If set to True, the returned sequences will be padded according to the model's padding side and\n                padding index, up to their max length. If no max length is specified, the padding is done up to the\n                model's max length. The tokenizer padding sides are handled by the class attribute `padding_side`\n                which can be set to the following strings:\n\n                - 'left': pads on the left of the sequences\n                - 'right': pads on the right of the sequences\n                Defaults to False: no padding.\n            is_pretokenized (:obj:`bool`, defaults to :obj:`False`):\n                Set to True to indicate the input is already tokenized\n            return_tensors (:obj:`str`, `optional`, defaults to :obj:`None`):\n                Can be set to 'tf' or 'pt' to return respectively TensorFlow :obj:`tf.constant`\n                or PyTorch :obj:`torch.Tensor` instead of a list of python integers.\n            return_token_type_ids (:obj:`bool`, `optional`, defaults to :obj:`None`):\n                Whether to return token type IDs. If left to the default, will return the token type IDs according\n                to the specific tokenizer's default, defined by the :obj:`return_outputs` attribute.\n\n                `What are token type IDs? <../glossary.html#token-type-ids>`_\n            return_attention_mask (:obj:`bool`, `optional`, defaults to :obj:`none`):\n                Whether to return the attention mask. If left to the default, will return the attention mask according\n                to the specific tokenizer's default, defined by the :obj:`return_outputs` attribute.\n\n                `What are attention masks? <../glossary.html#attention-mask>`__\n            return_overflowing_tokens (:obj:`bool`, `optional`, defaults to :obj:`False`):\n                Set to True to return overflowing token information (default False).\n            return_special_tokens_mask (:obj:`bool`, `optional`, defaults to :obj:`False`):\n                Set to True to return special tokens mask information (default False).\n            return_offsets_mapping (:obj:`bool`, `optional`, defaults to :obj:`False`):\n                Set to True to return (char_start, char_end) for each token (default False).\n                If using Python's tokenizer, this method will raise NotImplementedError.\n                This one is only available on fast tokenizers inheriting from PreTrainedTokenizerFast.\n            **kwargs: passed to the `self.tokenize()` method\n\n        Return:\n            A Dictionary of shape::\n\n                {\n                    input_ids: list[int],\n                    token_type_ids: list[int] if return_token_type_ids is True (default)\n                    attention_mask: list[int] if return_attention_mask is True (default)\n                    overflowing_tokens: list[int] if a ``max_length`` is specified and return_overflowing_tokens is True\n                    num_truncated_tokens: int if a ``max_length`` is specified and return_overflowing_tokens is True\n                    special_tokens_mask: list[int] if ``add_special_tokens`` if set to ``True``\n                    and return_special_tokens_mask is True\n                }\n\n            With the fields:\n\n            - ``input_ids``: list of token ids to be fed to a model\n            - ``token_type_ids``: list of token type ids to be fed to a model\n            - ``attention_mask``: list of indices specifying which tokens should be attended to by the model\n            - ``overflowing_tokens``: list of overflowing tokens if a max length is specified.\n            - ``num_truncated_tokens``: number of overflowing tokens a ``max_length`` is specified\n            - ``special_tokens_mask``: if adding special tokens, this is a list of [0, 1], with 0 specifying special added\n              tokens and 1 specifying sequence tokens.\n        \"\"\"\n\n        def get_input_ids(text):\n            if isinstance(text, str):\n                tokens = self.tokenize(text, add_special_tokens=add_special_tokens, **kwargs)\n                return self.convert_tokens_to_ids(tokens)\n            elif isinstance(text, (list, tuple)) and len(text) > 0 and isinstance(text[0], str):\n                return self.convert_tokens_to_ids(text)\n            elif isinstance(text, (list, tuple)) and len(text) > 0 and isinstance(text[0], int):\n                return text\n            else:\n                raise ValueError(\n                    \"Input is not valid. Should be a string, a list/tuple of strings or a list/tuple of integers.\"\n                )\n\n        if return_offsets_mapping:\n            raise NotImplementedError(\n                \"return_offset_mapping is not available when using Python tokenizers.\"\n                \"To use this feature, change your tokenizer to one deriving from \"\n                \"transformers1.PreTrainedTokenizerFast.\"\n                \"More information on available tokenizers at \"\n                \"https://github.com/huggingface/transformers/pull/2674\"\n            )\n\n        # Throw an error if we can pad because there is no padding token\n        if pad_to_max_length and self.pad_token_id is None:\n            raise ValueError(\n                \"Unable to set proper padding strategy as the tokenizer does not have a padding token. \"\n                \"In this case please set the `pad_token` `(tokenizer.pad_token = tokenizer.eos_token e.g.)` \"\n                \"or add a new pad token via the function add_special_tokens if you want to use a padding strategy\"\n            )\n\n        first_ids = get_input_ids(text)\n        second_ids = get_input_ids(text_pair) if text_pair is not None else None\n\n        return self.prepare_for_model(\n            first_ids,\n            pair_ids=second_ids,\n            max_length=max_length,\n            pad_to_max_length=pad_to_max_length,\n            add_special_tokens=add_special_tokens,\n            stride=stride,\n            truncation_strategy=truncation_strategy,\n            return_tensors=return_tensors,\n            return_attention_mask=return_attention_mask,\n            return_token_type_ids=return_token_type_ids,\n            return_overflowing_tokens=return_overflowing_tokens,\n            return_special_tokens_mask=return_special_tokens_mask,\n        )\n\n    def batch_encode_plus(\n        self,\n        batch_text_or_text_pairs: Union[\n            List[TextInput],\n            List[TextInputPair],\n            List[PreTokenizedInput],\n            List[PreTokenizedInputPair],\n            List[EncodedInput],\n            List[EncodedInputPair],\n        ],\n        add_special_tokens: bool = True,\n        max_length: Optional[int] = None,\n        stride: int = 0,\n        truncation_strategy: str = \"longest_first\",\n        pad_to_max_length: bool = False,\n        is_pretokenized: bool = False,\n        return_tensors: Optional[str] = None,\n        return_token_type_ids: Optional[bool] = None,\n        return_attention_masks: Optional[bool] = None,\n        return_overflowing_tokens: bool = False,\n        return_special_tokens_masks: bool = False,\n        return_offsets_mapping: bool = False,\n        return_lengths: bool = False,\n        **kwargs\n    ) -> BatchEncoding:\n        \"\"\"\n        Returns a dictionary containing the encoded sequence or sequence pair and additional information:\n        the mask for sequence classification and the overflowing elements if a ``max_length`` is specified.\n\n        Args:\n            batch_text_or_text_pairs (:obj:`List[str]`,  :obj:`List[Tuple[str, str]]`,\n                                      :obj:`List[List[str]]`,  :obj:`List[Tuple[List[str], List[str]]]`,\n                                      and for not-fast tokenizers, also:\n                                      :obj:`List[List[int]]`,  :obj:`List[Tuple[List[int], List[int]]]`):\n                Batch of sequences or pair of sequences to be encoded.\n                This can be a list of string/string-sequences/int-sequences or a list of pair of\n                string/string-sequences/int-sequence (see details in encode_plus)\n            add_special_tokens (:obj:`bool`, `optional`, defaults to :obj:`True`):\n                If set to ``True``, the sequences will be encoded with the special tokens relative\n                to their model.\n            max_length (:obj:`int`, `optional`, defaults to :obj:`None`):\n                If set to a number, will limit the total sequence returned so that it has a maximum length.\n                If there are overflowing tokens, those will be added to the returned dictionary\n            stride (:obj:`int`, `optional`, defaults to ``0``):\n                If set to a number along with max_length, the overflowing tokens returned will contain some tokens\n                from the main sequence returned. The value of this argument defines the number of additional tokens.\n            truncation_strategy (:obj:`str`, `optional`, defaults to `longest_first`):\n                String selected in the following options:\n\n                - 'longest_first' (default) Iteratively reduce the inputs sequence until the input is under max_length\n                  starting from the longest one at each token (when there is a pair of input sequences)\n                - 'only_first': Only truncate the first sequence\n                - 'only_second': Only truncate the second sequence\n                - 'do_not_truncate': Does not truncate (raise an error if the input sequence is longer than max_length)\n            pad_to_max_length (:obj:`bool`, `optional`, defaults to :obj:`False`):\n                If set to True, the returned sequences will be padded according to the model's padding side and\n                padding index, up to their max length. If no max length is specified, the padding is done up to the\n                model's max length. The tokenizer padding sides are handled by the class attribute `padding_side`\n                which can be set to the following strings:\n\n                - 'left': pads on the left of the sequences\n                - 'right': pads on the right of the sequences\n                Defaults to False: no padding.\n            is_pretokenized (:obj:`bool`, defaults to :obj:`False`):\n                Set to True to indicate the input is already tokenized\n            return_tensors (:obj:`str`, `optional`, defaults to :obj:`None`):\n                Can be set to 'tf' or 'pt' to return respectively TensorFlow :obj:`tf.constant`\n                or PyTorch :obj:`torch.Tensor` instead of a list of python integers.\n            return_token_type_ids (:obj:`bool`, `optional`, defaults to :obj:`None`):\n                Whether to return token type IDs. If left to the default, will return the token type IDs according\n                to the specific tokenizer's default, defined by the :obj:`return_outputs` attribute.\n\n                `What are token type IDs? <../glossary.html#token-type-ids>`_\n            return_attention_masks (:obj:`bool`, `optional`, defaults to :obj:`none`):\n                Whether to return the attention mask. If left to the default, will return the attention mask according\n                to the specific tokenizer's default, defined by the :obj:`return_outputs` attribute.\n\n                `What are attention masks? <../glossary.html#attention-mask>`__\n            return_overflowing_tokens (:obj:`bool`, `optional`, defaults to :obj:`False`):\n                Set to True to return overflowing token information (default False).\n            return_special_tokens_masks (:obj:`bool`, `optional`, defaults to :obj:`False`):\n                Set to True to return special tokens mask information (default False).\n            return_offsets_mapping (:obj:`bool`, `optional`, defaults to :obj:`False`):\n                Set to True to return (char_start, char_end) for each token (default False).\n                If using Python's tokenizer, this method will raise NotImplementedError. This one is only available on\n                Rust-based tokenizers inheriting from PreTrainedTokenizerFast.\n            return_lengths (:obj:`bool`, `optional`, defaults to :obj:`False`):\n                If set the resulting dictionary will include the length of each encoded inputs\n            **kwargs: passed to the `self.tokenize()` method\n\n        Return:\n            A Dictionary of shape::\n\n                {\n                    input_ids: list[List[int]],\n                    token_type_ids: list[List[int]] if return_token_type_ids is True (default)\n                    attention_mask: list[List[int]] if return_attention_mask is True (default)\n                    overflowing_tokens: list[List[int]] if a ``max_length`` is specified and return_overflowing_tokens is True\n                    num_truncated_tokens: List[int] if a ``max_length`` is specified and return_overflowing_tokens is True\n                    special_tokens_mask: list[List[int]] if ``add_special_tokens`` if set to ``True`` and return_special_tokens_mask is True\n                }\n\n            With the fields:\n\n            - ``input_ids``: list of token ids to be fed to a model\n            - ``token_type_ids``: list of token type ids to be fed to a model\n            - ``attention_mask``: list of indices specifying which tokens should be attended to by the model\n            - ``overflowing_tokens``: list of overflowing tokens if a max length is specified.\n            - ``num_truncated_tokens``: number of overflowing tokens a ``max_length`` is specified\n            - ``special_tokens_mask``: if adding special tokens, this is a list of [0, 1], with 0 specifying special added\n              tokens and 1 specifying sequence tokens.\n        \"\"\"\n\n        def get_input_ids(text):\n            if isinstance(text, str):\n                tokens = self.tokenize(text, add_special_tokens=add_special_tokens, **kwargs)\n                return self.convert_tokens_to_ids(tokens)\n            elif isinstance(text, (list, tuple)) and len(text) > 0 and isinstance(text[0], str):\n                return self.convert_tokens_to_ids(text)\n            elif isinstance(text, (list, tuple)) and len(text) > 0 and isinstance(text[0], int):\n                return text\n            else:\n                raise ValueError(\n                    \"Input is not valid. Should be a string, a list/tuple of strings or a list/tuple of integers.\"\n                )\n\n        # Throw an error if we can pad because there is no padding token\n        if pad_to_max_length and self.pad_token_id is None:\n            raise ValueError(\n                \"Unable to set proper padding strategy as the tokenizer does not have a padding token. In this case please set the `pad_token` `(tokenizer.pad_token = tokenizer.eos_token e.g.)` or add a new pad token via the function add_special_tokens if you want to use a padding strategy\"\n            )\n\n        if return_offsets_mapping:\n            raise NotImplementedError(\n                \"return_offset_mapping is not available when using Python tokenizers.\"\n                \"To use this feature, change your tokenizer to one deriving from \"\n                \"transformers1.PreTrainedTokenizerFast.\"\n                \"More information on available tokenizers at \"\n                \"https://github.com/huggingface/transformers/pull/2674\"\n            )\n\n        input_ids = []\n        for ids_or_pair_ids in batch_text_or_text_pairs:\n            if isinstance(ids_or_pair_ids, (list, tuple)) and len(ids_or_pair_ids) == 2 and not is_pretokenized:\n                ids, pair_ids = ids_or_pair_ids\n            else:\n                ids, pair_ids = ids_or_pair_ids, None\n\n            first_ids = get_input_ids(ids)\n            second_ids = get_input_ids(pair_ids) if pair_ids is not None else None\n            input_ids.append((first_ids, second_ids))\n\n        if max_length is None and pad_to_max_length:\n\n            def total_sequence_length(input_pairs):\n                first_ids, second_ids = input_pairs\n                return len(first_ids) + (\n                    self.num_special_tokens_to_add()\n                    if second_ids is None\n                    else (len(second_ids) + self.num_special_tokens_to_add(pair=True))\n                )\n\n            max_length = max([total_sequence_length(ids) for ids in input_ids])\n\n        batch_outputs = {}\n        for first_ids, second_ids in input_ids:\n            # Prepares a sequence of input id, or a pair of sequences of inputs ids so that it can be used by\n            # the model. It adds special tokens, truncates sequences if overflowing while taking into account\n            # the special tokens and manages a window stride for overflowing tokens\n            outputs = self.prepare_for_model(\n                first_ids,\n                pair_ids=second_ids,\n                max_length=max_length,\n                pad_to_max_length=pad_to_max_length,\n                add_special_tokens=add_special_tokens,\n                stride=stride,\n                truncation_strategy=truncation_strategy,\n                return_attention_mask=return_attention_masks,\n                return_token_type_ids=return_token_type_ids,\n                return_overflowing_tokens=return_overflowing_tokens,\n                return_special_tokens_mask=return_special_tokens_masks,\n                return_lengths=return_lengths,\n                return_tensors=None,  # We will convert the whole batch to tensors at the end\n            )\n\n            for key, value in outputs.items():\n                if key not in batch_outputs:\n                    batch_outputs[key] = []\n                batch_outputs[key].append(value)\n\n        if return_tensors is not None:\n\n            self.convert_to_tensors_(batch_outputs, return_tensors)\n        return BatchEncoding(batch_outputs)\n\n    def convert_to_tensors_(self, batch_outputs: dict, return_tensors: str) -> None:\n        # Do the tensor conversion in batch\n        for key, value in batch_outputs.items():\n            if return_tensors == \"tf\" and is_tf_available():\n                try:\n                    batch_outputs[key] = tf.constant(value)\n                except ValueError:\n                    if None in [item for sequence in value for item in sequence]:\n                        raise ValueError(self.NO_PAD_TOKEN_FOR_BATCH_MSG)\n                    else:\n                        raise ValueError(self.UNEVEN_SEQUENCES_FOR_BATCH_MSG)\n            elif return_tensors == \"pt\" and is_torch_available():\n                try:\n                    batch_outputs[key] = torch.tensor(value)\n                except ValueError:\n                    raise ValueError(self.UNEVEN_SEQUENCES_FOR_BATCH_MSG)\n                except RuntimeError:\n                    if None in [item for sequence in value for item in sequence]:\n                        raise ValueError(self.NO_PAD_TOKEN_FOR_BATCH_MSG)\n                    else:\n                        raise\n\n            elif return_tensors is not None:\n                logger.warning(\n                    \"Unable to convert output to tensors format {}, PyTorch or TensorFlow is not available.\".format(\n                        return_tensors\n                    )\n                )\n\n    def prepare_for_model(\n        self,\n        ids: List[int],\n        pair_ids: Optional[List[int]] = None,\n        max_length: Optional[int] = None,\n        add_special_tokens: bool = True,\n        stride: int = 0,\n        truncation_strategy: str = \"longest_first\",\n        pad_to_max_length: bool = False,\n        return_tensors: Optional[str] = None,\n        return_token_type_ids: Optional[bool] = None,\n        return_attention_mask: Optional[bool] = None,\n        return_overflowing_tokens: bool = False,\n        return_special_tokens_mask: bool = False,\n        return_lengths: bool = False,\n    ) -> BatchEncoding:\n        \"\"\" Prepares a sequence of input id, or a pair of sequences of inputs ids so that it can be used by the model.\n        It adds special tokens, truncates sequences if overflowing while taking into account the special tokens and\n        manages a moving window (with user defined stride) for overflowing tokens\n\n        Args:\n            ids: list of tokenized input ids. Can be obtained from a string by chaining the\n                `tokenize` and `convert_tokens_to_ids` methods.\n            pair_ids: Optional second list of input ids. Can be obtained from a string by chaining the\n                `tokenize` and `convert_tokens_to_ids` methods.\n            max_length: maximum length of the returned list. Will truncate by taking into account the special tokens.\n            add_special_tokens: if set to ``True``, the sequences will be encoded with the special tokens relative\n                to their model.\n            stride: window stride for overflowing tokens. Can be useful to remove edge effect when using sequential\n                list of inputs. The overflowing token will contains a part of the previous window of tokens.\n            truncation_strategy: string selected in the following options:\n                - 'longest_first' (default) Iteratively reduce the inputs sequence until the input is under max_length\n                    starting from the longest one at each token (when there is a pair of input sequences)\n                - 'only_first': Only truncate the first sequence\n                - 'only_second': Only truncate the second sequence\n                - 'do_not_truncate': Does not truncate (raise an error if the input sequence is longer than max_length)\n            pad_to_max_length: if set to True, the returned sequences will be padded according to the model's padding side and\n                padding index, up to their max length. If no max length is specified, the padding is done up to the model's max length.\n                The tokenizer padding sides are handled by the following strings:\n                - 'left': pads on the left of the sequences\n                - 'right': pads on the right of the sequences\n                Defaults to False: no padding.\n            return_tensors: (optional) can be set to 'tf' or 'pt' to return respectively TensorFlow tf.constant\n                or PyTorch torch.Tensor instead of a list of python integers.\n            return_token_type_ids: (optional) Set to False to avoid returning token_type_ids (default: set to model specifics).\n            return_attention_mask: (optional) Set to False to avoid returning attention mask (default: set to model specifics)\n            return_overflowing_tokens: (optional) Set to True to return overflowing token information (default False).\n            return_special_tokens_mask: (optional) Set to True to return special tokens mask information (default False).\n            return_lengths (:obj:`bool`, `optional`, defaults to :obj:`False`):\n                If set the resulting dictionary will include the length of each encoded inputs\n\n        Return:\n            A Dictionary of shape::\n\n                {\n                    input_ids: list[int],\n                    token_type_ids: list[int] if return_token_type_ids is True (default)\n                    overflowing_tokens: list[int] if a ``max_length`` is specified and return_overflowing_tokens is True\n                    num_truncated_tokens: int if a ``max_length`` is specified and return_overflowing_tokens is True\n                    special_tokens_mask: list[int] if ``add_special_tokens`` if set to ``True`` and return_special_tokens_mask is True\n                    length: int if return_lengths is True\n                }\n\n            With the fields:\n                - ``input_ids``: list of token ids to be fed to a model\n                - ``token_type_ids``: list of token type ids to be fed to a model\n\n                - ``overflowing_tokens``: list of overflowing tokens if a max length is specified.\n                - ``num_truncated_tokens``: number of overflowing tokens a ``max_length`` is specified\n                - ``special_tokens_mask``: if adding special tokens, this is a list of [0, 1], with 0 specifying special added\n                    tokens and 1 specifying sequence tokens.\n                - ``length``: this is the length of ``input_ids``\n        \"\"\"\n        pair = bool(pair_ids is not None)\n        len_ids = len(ids)\n        len_pair_ids = len(pair_ids) if pair else 0\n\n        # Load from model defaults\n        if return_token_type_ids is None:\n            return_token_type_ids = \"token_type_ids\" in self.model_input_names\n        if return_attention_mask is None:\n            return_attention_mask = \"attention_mask\" in self.model_input_names\n\n        encoded_inputs = {}\n\n        # Truncation: Handle max sequence length\n        total_len = len_ids + len_pair_ids + (self.num_special_tokens_to_add(pair=pair) if add_special_tokens else 0)\n        if max_length and total_len > max_length:\n            ids, pair_ids, overflowing_tokens = self.truncate_sequences(\n                ids,\n                pair_ids=pair_ids,\n                num_tokens_to_remove=total_len - max_length,\n                truncation_strategy=truncation_strategy,\n                stride=stride,\n            )\n            if return_overflowing_tokens:\n                encoded_inputs[\"overflowing_tokens\"] = overflowing_tokens\n                encoded_inputs[\"num_truncated_tokens\"] = total_len - max_length\n\n        # Add special tokens\n        if add_special_tokens:\n            sequence = self.build_inputs_with_special_tokens(ids, pair_ids)\n            token_type_ids = self.create_token_type_ids_from_sequences(ids, pair_ids)\n        else:\n            sequence = ids + pair_ids if pair else ids\n            token_type_ids = [0] * len(ids) + ([1] * len(pair_ids) if pair else [])\n\n        # Build output dictionnary\n        encoded_inputs[\"input_ids\"] = sequence\n        if return_token_type_ids:\n            encoded_inputs[\"token_type_ids\"] = token_type_ids\n        if return_special_tokens_mask:\n            if add_special_tokens:\n                encoded_inputs[\"special_tokens_mask\"] = self.get_special_tokens_mask(ids, pair_ids)\n            else:\n                encoded_inputs[\"special_tokens_mask\"] = [0] * len(sequence)\n\n        # Check lengths\n        assert max_length is None or len(encoded_inputs[\"input_ids\"]) <= max_length\n        if max_length is None and len(encoded_inputs[\"input_ids\"]) > self.model_max_length:\n            logger.warning(\n                \"Token indices sequence length is longer than the specified maximum sequence length \"\n                \"for this model ({} > {}). Running this sequence through the model will result in \"\n                \"indexing errors\".format(len(ids), self.model_max_length)\n            )\n\n        # Padding\n        needs_to_be_padded = pad_to_max_length and (\n            max_length\n            and len(encoded_inputs[\"input_ids\"]) < max_length\n            or max_length is None\n            and len(encoded_inputs[\"input_ids\"]) < self.model_max_length\n            and self.model_max_length <= LARGE_INTEGER\n        )\n\n        if pad_to_max_length and max_length is None and self.model_max_length > LARGE_INTEGER:\n            logger.warning(\n                \"Sequence can't be padded as no maximum length is specified and the model maximum length is too high.\"\n            )\n\n        if needs_to_be_padded:\n            difference = (max_length if max_length is not None else self.model_max_length) - len(\n                encoded_inputs[\"input_ids\"]\n            )\n            if self.padding_side == \"right\":\n                if return_attention_mask:\n                    encoded_inputs[\"attention_mask\"] = [1] * len(encoded_inputs[\"input_ids\"]) + [0] * difference\n                if return_token_type_ids:\n                    encoded_inputs[\"token_type_ids\"] = (\n                        encoded_inputs[\"token_type_ids\"] + [self.pad_token_type_id] * difference\n                    )\n                if return_special_tokens_mask:\n                    encoded_inputs[\"special_tokens_mask\"] = encoded_inputs[\"special_tokens_mask\"] + [1] * difference\n                encoded_inputs[\"input_ids\"] = encoded_inputs[\"input_ids\"] + [self.pad_token_id] * difference\n            elif self.padding_side == \"left\":\n                if return_attention_mask:\n                    encoded_inputs[\"attention_mask\"] = [0] * difference + [1] * len(encoded_inputs[\"input_ids\"])\n                if return_token_type_ids:\n                    encoded_inputs[\"token_type_ids\"] = [self.pad_token_type_id] * difference + encoded_inputs[\n                        \"token_type_ids\"\n                    ]\n                if return_special_tokens_mask:\n                    encoded_inputs[\"special_tokens_mask\"] = [1] * difference + encoded_inputs[\"special_tokens_mask\"]\n                encoded_inputs[\"input_ids\"] = [self.pad_token_id] * difference + encoded_inputs[\"input_ids\"]\n            else:\n                raise ValueError(\"Invalid padding strategy:\" + str(self.padding_side))\n        else:\n            if return_attention_mask:\n                encoded_inputs[\"attention_mask\"] = [1] * len(encoded_inputs[\"input_ids\"])\n\n        if return_lengths:\n            encoded_inputs[\"length\"] = len(encoded_inputs[\"input_ids\"])\n\n        # Prepare model inputs as tensors if asked\n        if return_tensors == \"tf\" and is_tf_available():\n            encoded_inputs[\"input_ids\"] = tf.constant([encoded_inputs[\"input_ids\"]])\n\n            if \"token_type_ids\" in encoded_inputs:\n                encoded_inputs[\"token_type_ids\"] = tf.constant([encoded_inputs[\"token_type_ids\"]])\n\n            if \"attention_mask\" in encoded_inputs:\n                encoded_inputs[\"attention_mask\"] = tf.constant([encoded_inputs[\"attention_mask\"]])\n\n        elif return_tensors == \"pt\" and is_torch_available():\n            encoded_inputs[\"input_ids\"] = torch.tensor([encoded_inputs[\"input_ids\"]])\n\n            if \"token_type_ids\" in encoded_inputs:\n                encoded_inputs[\"token_type_ids\"] = torch.tensor([encoded_inputs[\"token_type_ids\"]])\n\n            if \"attention_mask\" in encoded_inputs:\n                encoded_inputs[\"attention_mask\"] = torch.tensor([encoded_inputs[\"attention_mask\"]])\n        elif return_tensors is not None:\n            logger.warning(\n                \"Unable to convert output to tensors format {}, PyTorch or TensorFlow is not available.\".format(\n                    return_tensors\n                )\n            )\n\n        return BatchEncoding(encoded_inputs)\n\n    def prepare_for_tokenization(self, text: str, **kwargs) -> str:\n        \"\"\" Performs any necessary transformations before tokenization \"\"\"\n        return text\n\n    def truncate_sequences(\n        self,\n        ids: List[int],\n        pair_ids: Optional[List[int]] = None,\n        num_tokens_to_remove: int = 0,\n        truncation_strategy: str = \"longest_first\",\n        stride: int = 0,\n    ) -> Tuple[List[int], List[int], List[int]]:\n        \"\"\" Truncates a sequence pair in place to the maximum length.\n\n        Args:\n            ids: list of tokenized input ids. Can be obtained from a string by chaining the\n                `tokenize` and `convert_tokens_to_ids` methods.\n            pair_ids: Optional second list of input ids. Can be obtained from a string by chaining the\n                `tokenize` and `convert_tokens_to_ids` methods.\n            num_tokens_to_remove (:obj:`int`, `optional`, defaults to ``0``):\n                number of tokens to remove using the truncation strategy\n            truncation_strategy: string selected in the following options:\n                - 'longest_first' (default) Iteratively reduce the inputs sequence until the input is under max_length\n                    starting from the longest one at each token (when there is a pair of input sequences).\n                    Overflowing tokens only contains overflow from the first sequence.\n                - 'only_first': Only truncate the first sequence. raise an error if the first sequence is shorter or equal to than num_tokens_to_remove.\n                - 'only_second': Only truncate the second sequence\n                - 'do_not_truncate': Does not truncate (raise an error if the input sequence is longer than max_length)\n            stride (:obj:`int`, `optional`, defaults to ``0``):\n                If set to a number along with max_length, the overflowing tokens returned will contain some tokens\n                from the main sequence returned. The value of this argument defines the number of additional tokens.\n        \"\"\"\n        if num_tokens_to_remove <= 0:\n            return ids, pair_ids, []\n\n        if truncation_strategy == \"longest_first\":\n            overflowing_tokens = []\n            for _ in range(num_tokens_to_remove):\n                if pair_ids is None or len(ids) > len(pair_ids):\n                    overflowing_tokens = [ids[-1]] + overflowing_tokens\n                    ids = ids[:-1]\n                else:\n                    pair_ids = pair_ids[:-1]\n            window_len = min(len(ids), stride)\n            if window_len > 0:\n                overflowing_tokens = ids[-window_len:] + overflowing_tokens\n        elif truncation_strategy == \"only_first\":\n            assert len(ids) > num_tokens_to_remove\n            window_len = min(len(ids), stride + num_tokens_to_remove)\n            overflowing_tokens = ids[-window_len:]\n            ids = ids[:-num_tokens_to_remove]\n        elif truncation_strategy == \"only_second\":\n            assert pair_ids is not None and len(pair_ids) > num_tokens_to_remove\n            window_len = min(len(pair_ids), stride + num_tokens_to_remove)\n            overflowing_tokens = pair_ids[-window_len:]\n            pair_ids = pair_ids[:-num_tokens_to_remove]\n        elif truncation_strategy == \"do_not_truncate\":\n            raise ValueError(\"Input sequence are too long for max_length. Please select a truncation strategy.\")\n        else:\n            raise ValueError(\n                \"Truncation_strategy should be selected in ['longest_first', 'only_first', 'only_second', 'do_not_truncate']\"\n            )\n        return (ids, pair_ids, overflowing_tokens)\n\n    def create_token_type_ids_from_sequences(self, token_ids_0: List, token_ids_1: Optional[List] = None) -> List[int]:\n        if token_ids_1 is None:\n            return len(token_ids_0) * [0]\n        return [0] * len(token_ids_0) + [1] * len(token_ids_1)\n\n    def build_inputs_with_special_tokens(self, token_ids_0: List, token_ids_1: Optional[List] = None) -> List:\n        \"\"\"\n        Build model inputs from a sequence or a pair of sequence for sequence classification tasks\n        by concatenating and adding special tokens. This implementation does not add special tokens.\n        \"\"\"\n        if token_ids_1 is None:\n            return token_ids_0\n        return token_ids_0 + token_ids_1\n\n    def get_special_tokens_mask(\n        self, token_ids_0: List, token_ids_1: Optional[List] = None, already_has_special_tokens: bool = False\n    ) -> List[int]:\n        \"\"\"\n        Retrieves sequence ids from a token list that has no special tokens added. This method is called when adding\n        special tokens using the tokenizer ``prepare_for_model`` or ``encode_plus`` methods.\n\n        Args:\n            token_ids_0: list of ids (must not contain special tokens)\n            token_ids_1: Optional list of ids (must not contain special tokens), necessary when fetching sequence ids\n                for sequence pairs\n            already_has_special_tokens: (default False) Set to True if the token list is already formated with\n                special tokens for the model\n\n        Returns:\n            A list of integers in the range [0, 1]: 1 for a special token, 0 for a sequence token.\n        \"\"\"\n        return [0] * ((len(token_ids_1) if token_ids_1 else 0) + len(token_ids_0))\n\n    def convert_ids_to_tokens(\n        self, ids: Union[int, List[int]], skip_special_tokens: bool = False\n    ) -> Union[int, List[int]]:\n        \"\"\" Converts a single index or a sequence of indices (integers) in a token \"\n            (resp.) a sequence of tokens (str), using the vocabulary and added tokens.\n\n            Args:\n                skip_special_tokens: Don't decode special tokens (self.all_special_tokens). Default: False\n        \"\"\"\n        if isinstance(ids, int):\n            if ids in self.added_tokens_decoder:\n                return self.added_tokens_decoder[ids]\n            else:\n                return self._convert_id_to_token(ids)\n        tokens = []\n        for index in ids:\n            index = int(index)\n            if skip_special_tokens and index in self.all_special_ids:\n                continue\n            if index in self.added_tokens_decoder:\n                tokens.append(self.added_tokens_decoder[index])\n            else:\n                tokens.append(self._convert_id_to_token(index))\n        return tokens\n\n    def _convert_id_to_token(self, index: int) -> str:\n        raise NotImplementedError\n\n    def convert_tokens_to_string(self, tokens: List[str]) -> str:\n        \"\"\" Converts a sequence of tokens (string) in a single string.\n            The most simple way to do it is ' '.join(self.convert_ids_to_tokens(token_ids))\n            but we often want to remove sub-word tokenization artifacts at the same time.\n        \"\"\"\n        return \" \".join(self.convert_ids_to_tokens(tokens))\n\n    def decode(\n        self, token_ids: List[int], skip_special_tokens: bool = False, clean_up_tokenization_spaces: bool = True\n    ) -> str:\n        \"\"\"\n        Converts a sequence of ids (integer) in a string, using the tokenizer and vocabulary\n        with options to remove special tokens and clean up tokenization spaces.\n        Similar to doing ``self.convert_tokens_to_string(self.convert_ids_to_tokens(token_ids))``.\n\n        Args:\n            token_ids: list of tokenized input ids. Can be obtained using the `encode` or `encode_plus` methods.\n            skip_special_tokens: if set to True, will replace special tokens.\n            clean_up_tokenization_spaces: if set to True, will clean up the tokenization spaces.\n        \"\"\"\n        filtered_tokens = self.convert_ids_to_tokens(token_ids, skip_special_tokens=skip_special_tokens)\n\n        # To avoid mixing byte-level and unicode for byte-level BPT\n        # we need to build string separatly for added tokens and byte-level tokens\n        # cf. https://github.com/huggingface/transformers/issues/1133\n        sub_texts = []\n        current_sub_text = []\n        for token in filtered_tokens:\n            if skip_special_tokens and token in self.all_special_ids:\n                continue\n            if token in self.added_tokens_encoder:\n                if current_sub_text:\n                    sub_texts.append(self.convert_tokens_to_string(current_sub_text))\n                    current_sub_text = []\n                sub_texts.append(token)\n            else:\n                current_sub_text.append(token)\n        if current_sub_text:\n            sub_texts.append(self.convert_tokens_to_string(current_sub_text))\n        text = \" \".join(sub_texts)\n\n        if clean_up_tokenization_spaces:\n            clean_text = self.clean_up_tokenization(text)\n            return clean_text\n        else:\n            return text\n\n    def batch_decode(self, sequences: List[List[int]], **kwargs) -> List[str]:\n        return [self.decode(seq, **kwargs) for seq in sequences]\n\n    @staticmethod\n    def clean_up_tokenization(out_string: str) -> str:\n        \"\"\" Clean up a list of simple English tokenization artifacts like spaces before punctuations and abreviated forms.\n        \"\"\"\n        out_string = (\n            out_string.replace(\" .\", \".\")\n            .replace(\" ?\", \"?\")\n            .replace(\" !\", \"!\")\n            .replace(\" ,\", \",\")\n            .replace(\" ' \", \"'\")\n            .replace(\" n't\", \"n't\")\n            .replace(\" 'm\", \"'m\")\n            .replace(\" 's\", \"'s\")\n            .replace(\" 've\", \"'ve\")\n            .replace(\" 're\", \"'re\")\n        )\n        return out_string\n\n\nclass PreTrainedTokenizerFast(PreTrainedTokenizer):\n    \"\"\" Base class for all fast tokenizers (wrapping HuggingFace tokenizers library).\n\n    Inherit from PreTrainedTokenizer.\n\n    Handle all the shared methods for tokenization and special tokens as well as methods\n    downloading/caching/loading pretrained tokenizers as well as adding tokens to the vocabulary.\n\n    This class also contain the added tokens in a unified way on top of all tokenizers so we don't\n    have to handle the specific vocabulary augmentation methods of the various underlying\n    dictionary structures (BPE, sentencepiece...).\n\n    Class attributes (overridden by derived classes):\n\n        - ``vocab_files_names``: a python ``dict`` with, as keys, the ``__init__`` keyword name of each vocabulary file\n            required by the model, and as associated values, the filename for saving the associated file (string).\n        - ``pretrained_vocab_files_map``: a python ``dict of dict`` the high-level keys\n            being the ``__init__`` keyword name of each vocabulary file required by the model, the low-level being the\n            `short-cut-names` (string) of the pretrained models with, as associated values, the `url` (string) to the\n            associated pretrained vocabulary file.\n        - ``max_model_input_sizes``: a python ``dict`` with, as keys, the `short-cut-names` (string) of the pretrained\n            models, and as associated values, the maximum length of the sequence inputs of this model, or None if the\n            model has no maximum input size.\n        - ``pretrained_init_configuration``: a python ``dict`` with, as keys, the `short-cut-names` (string) of the\n            pretrained models, and as associated values, a dictionnary of specific arguments to pass to the\n            ``__init__``method of the tokenizer class for this pretrained model when loading the tokenizer with the\n            ``from_pretrained()`` method.\n\n    Args:\n        - ``tokenizer`` (`BaseTokenizerFast`): A Fast tokenizer from the HuggingFace tokenizer library (in low level Rust language)\n        - ``model_max_length``: (`Optional`) int: the maximum length in number of tokens for the inputs to the transformer model.\n            When the tokenizer is loaded with `from_pretrained`, this will be set to the value stored for the associated\n            model in ``max_model_input_sizes`` (see above). If no value is provided, will default to VERY_LARGE_INTEGER (`int(1e30)`).\n            no associated max_length can be found in ``max_model_input_sizes``.\n        - ``padding_side``: (`Optional`) string: the side on which the model should have padding applied.\n            Should be selected between ['right', 'left']\n        - ``model_input_names``: (`Optional`) List[string]: the list of the forward pass inputs accepted by the\n            model (\"token_type_ids\", \"attention_mask\"...).\n        - ``bos_token``: (`Optional`) string: a beginning of sentence token.\n            Will be associated to ``self.bos_token`` and ``self.bos_token_id``\n        - ``eos_token``: (`Optional`) string: an end of sentence token.\n            Will be associated to ``self.eos_token`` and ``self.eos_token_id``\n        - ``unk_token``: (`Optional`) string: an unknown token.\n            Will be associated to ``self.unk_token`` and ``self.unk_token_id``\n        - ``sep_token``: (`Optional`) string: a separation token (e.g. to separate context and query in an input sequence).\n            Will be associated to ``self.sep_token`` and ``self.sep_token_id``\n        - ``pad_token``: (`Optional`) string: a padding token.\n            Will be associated to ``self.pad_token`` and ``self.pad_token_id``\n        - ``cls_token``: (`Optional`) string: a classification token (e.g. to extract a summary of an input sequence\n            leveraging self-attention along the full depth of the model).\n            Will be associated to ``self.cls_token`` and ``self.cls_token_id``\n        - ``mask_token``: (`Optional`) string: a masking token (e.g. when training a model with masked-language\n            modeling). Will be associated to ``self.mask_token`` and ``self.mask_token_id``\n        - ``additional_special_tokens``: (`Optional`) list: a list of additional special tokens.\n            Adding all special tokens here ensure they won't be split by the tokenization process.\n            Will be associated to ``self.additional_special_tokens`` and ``self.additional_special_tokens_ids``\n    \"\"\"\n\n    def __init__(self, tokenizer: BaseTokenizerFast, **kwargs):\n        if not isinstance(tokenizer, BaseTokenizerFast):\n            raise ValueError(\n                \"Tokenizer should be an instance of a Tokenizer \" \"provided by HuggingFace tokenizers library.\"\n            )\n        self._tokenizer: BaseTokenizerFast = tokenizer\n\n        # Initialize all the rest of the kwargs\n        super().__init__(**kwargs)\n\n    @property\n    def backend_tokenizer(self) -> BaseTokenizerFast:\n        return self._tokenizer\n\n    @property\n    def decoder(self) -> DecoderFast:\n        return self._tokenizer._tokenizer.decoder\n\n    @property\n    def is_fast(self) -> bool:\n        return True\n\n    @property\n    def vocab_size(self) -> int:\n        return self._tokenizer.get_vocab_size(with_added_tokens=False)\n\n    def __len__(self) -> int:\n        return self._tokenizer.get_vocab_size(with_added_tokens=True)\n\n    def _maybe_update_backend(self, value):\n        \"\"\" Update the backend fast tokenizer.\n            Override method from base class SpecialTokensMixin \"\"\"\n        self._tokenizer.add_special_tokens(value)\n\n    def _convert_encoding(\n        self,\n        encoding: EncodingFast,\n        return_tensors: Optional[bool] = None,\n        return_token_type_ids: Optional[bool] = None,\n        return_attention_mask: Optional[bool] = None,\n        return_overflowing_tokens: bool = False,\n        return_special_tokens_mask: bool = False,\n        return_offsets_mapping: bool = False,\n    ) -> Dict[str, Any]:\n        \"\"\" Convert the encoding representation (from low-level HuggingFace tokenizer output) to a python Dict.\n\n            Overflowing tokens are converted to additional examples (like batches) so the output values of\n            the dict are lists (overflows) of lists (tokens).\n\n            If return_tensors is not None, these lists of lists are converted to 2-D tensors\n            for input_ids, token_type_ids and attention_mask.\n            Output shape: (overflows, sequence length)\n        \"\"\"\n        if return_token_type_ids is None:\n            return_token_type_ids = \"token_type_ids\" in self.model_input_names\n        if return_attention_mask is None:\n            return_attention_mask = \"attention_mask\" in self.model_input_names\n\n        if return_overflowing_tokens and encoding.overflowing is not None:\n            encodings = [encoding] + encoding.overflowing\n        else:\n            encodings = [encoding]\n\n        encoding_dict = defaultdict(list)\n        for e in encodings:\n            encoding_dict[\"input_ids\"].append(e.ids)\n\n            if return_token_type_ids:\n                encoding_dict[\"token_type_ids\"].append(e.type_ids)\n            if return_attention_mask:\n                encoding_dict[\"attention_mask\"].append(e.attention_mask)\n            if return_special_tokens_mask:\n                encoding_dict[\"special_tokens_mask\"].append(e.special_tokens_mask)\n            if return_offsets_mapping:\n                encoding_dict[\"offset_mapping\"].append(e.offsets)\n\n        if return_tensors is not None:\n            for key, value in encoding_dict.items():\n                if return_tensors == \"tf\" and is_tf_available():\n                    encoding_dict[key] = tf.constant(value)\n                elif return_tensors == \"pt\" and is_torch_available():\n                    encoding_dict[key] = torch.tensor(value)\n                elif return_tensors is not None:\n                    logger.warning(\n                        \"Unable to convert output to tensors format {}, \"\n                        \"PyTorch or TensorFlow is not available.\".format(return_tensors)\n                    )\n\n        return encoding_dict\n\n    def _convert_token_to_id_with_added_voc(self, token: int) -> str:\n        index = self._tokenizer.token_to_id(token)\n        if index is None:\n            return self.unk_token_id\n        return index\n\n    def _convert_id_to_token(self, index: int) -> Optional[str]:\n        return self._tokenizer.id_to_token(int(index))\n\n    def get_vocab(self):\n        return self._tokenizer.get_vocab(True)\n\n    def convert_tokens_to_string(self, tokens: List[int], skip_special_tokens: bool = False) -> str:\n        return self._tokenizer.decode(tokens, skip_special_tokens)\n\n    def add_tokens(self, new_tokens: List[Union[str, AddedTokenFast]]) -> int:\n        \"\"\"\n        Add a list of new tokens to the tokenizer class. If the new tokens are not in the\n        vocabulary, they are added to it with indices starting from length of the current vocabulary.\n\n        Args:\n            new_tokens: string or list of string or AddedTokenFast. Each string is a token to add.\n            Tokens are only added if they are not already in the vocabulary. AddedTokenFast wrap a string token to let you personnalize it's behavior (Whether this token should only match against single word, whether this token should strip all potential whitespaces on the left side, Whether this token should strip all potential whitespaces on the right side...).\n            See details for AddedToken in HuggingFace tokenizers library.\n\n        Returns:\n            Number of tokens added to the vocabulary.\n\n        Examples::\n\n            # Let's see how to increase the vocabulary of Bert model and tokenizer\n            tokenizer = BertTokenizerFast.from_pretrained('bert-base-uncased')\n            model = BertModel.from_pretrained('bert-base-uncased')\n\n            num_added_toks = tokenizer.add_tokens(['new_tok1', 'my_new-tok2'])\n            print('We have added', num_added_toks, 'tokens')\n            model.resize_token_embeddings(len(tokenizer))  # Notice: resize_token_embeddings expect to receive the full size of the new vocabulary, i.e. the length of the tokenizer.\n        \"\"\"\n        if isinstance(new_tokens, str):\n            new_tokens = [new_tokens]\n        return self._tokenizer.add_tokens(new_tokens)\n\n    def add_special_tokens(self, special_tokens_dict: dict) -> int:\n        # Map special tokens to class attributes (self.pad_token...)\n        super().add_special_tokens(special_tokens_dict)\n\n        # If the backend tokenizer the only specificities of special tokens are that\n        #    - they will never be processed by the model, and\n        #    - they will be removed while decoding.\n        # But they are not mapped to special attributes in the backend so we can just\n        # send a list.\n        tokens = []\n        for token in special_tokens_dict.values():\n            if isinstance(token, list):\n                tokens += token\n            else:\n                tokens += [token]\n        num_added_tokens = self._tokenizer.add_special_tokens(tokens)\n\n        return num_added_tokens\n\n    def num_special_tokens_to_add(self, pair: bool = False) -> int:\n        return self._tokenizer.num_special_tokens_to_add(pair)\n\n    def tokenize(\n        self, text: TextInput, pair: Optional[TextInput] = None, add_special_tokens: bool = False\n    ) -> List[str]:\n        return self._tokenizer.encode(text, pair, add_special_tokens).tokens\n\n    def batch_encode_plus(\n        self,\n        batch_text_or_text_pairs: Union[\n            List[TextInput], List[TextInputPair], List[PreTokenizedInput], List[PreTokenizedInputPair]\n        ],\n        add_special_tokens: bool = True,\n        max_length: Optional[int] = None,\n        stride: int = 0,\n        truncation_strategy: str = \"longest_first\",\n        pad_to_max_length: bool = False,\n        is_pretokenized: bool = False,\n        return_tensors: Optional[str] = None,\n        return_token_type_ids: Optional[bool] = None,\n        return_attention_mask: Optional[bool] = None,\n        return_overflowing_tokens: bool = False,\n        return_special_tokens_mask: bool = False,\n        return_offsets_mapping: bool = False,\n        return_lengths: bool = False,\n        **kwargs\n    ) -> BatchEncoding:\n\n        if not isinstance(batch_text_or_text_pairs, list):\n            raise ValueError(\n                \"batch_text_or_text_pairs has to be a list (got {})\".format(type(batch_text_or_text_pairs))\n            )\n\n        # Needed if we have to return a tensor\n        pad_to_max_length = pad_to_max_length or (return_tensors is not None and len(batch_text_or_text_pairs) > 1)\n\n        # Throw an error if we can pad because there is no padding token\n        if pad_to_max_length and self.pad_token_id is None:\n            raise ValueError(\"Unable to set proper padding strategy as the tokenizer does not have a padding token\")\n\n        # Set the truncation and padding strategy and restore the initial configuration\n        with truncate_and_pad(\n            tokenizer=self._tokenizer,\n            max_length=max_length,\n            stride=stride,\n            strategy=truncation_strategy,\n            pad_to_max_length=pad_to_max_length,\n            padding_side=self.padding_side,\n            pad_token_id=self.pad_token_id,\n            pad_token_type_id=self.pad_token_type_id,\n            pad_token=self._pad_token,\n        ):\n\n            # Check for the pretokenized path\n            if is_pretokenized:\n                encodings = []\n\n                # Iterate over each sample (we don't know yet if they are pairs or simple input\n                for i, sample in enumerate(batch_text_or_text_pairs):\n\n                    if not isinstance(sample, (list, tuple)):\n                        raise TypeError(\n                            \"batch_encode_plus(..., is_pretokenized=True) requires batch_text_or_text_pairs \"\n                            \"to be either List[List[str]] or List[Tuple[List[str], List[str]]] but sample at \"\n                            \"index {} is of type {}\".format(i, type(sample))\n                        )\n\n                    # Test if we have a pair of sentences by checking the depth of nesting\n                    is_pair = bool(len(sample) > 0 and isinstance(sample[0], (list, tuple)))\n\n                    # Take care of the first sequence - we multi-thread over the words\n                    encodings_text = EncodingFast.merge(\n                        self._tokenizer.encode_batch(sample[0] if is_pair else sample, add_special_tokens=False),\n                        growing_offsets=True,\n                    )\n\n                    # Take care of the second sequence if we have a pair\n                    if is_pair:\n                        encodings_pair = EncodingFast.merge(\n                            self._tokenizer.encode_batch([(\"\", s) for s in sample[1]], add_special_tokens=False),\n                            growing_offsets=True,\n                        )\n                    else:\n                        encodings_pair = None\n\n                    # Post-process - truncate/pad and add special tokens\n                    encoding = self._tokenizer.post_process(encodings_text, encodings_pair, add_special_tokens)\n                    encodings.append(encoding)\n\n            # Classical path with strings input\n            else:\n                # Avoid thread overhead if only one example.\n                if len(batch_text_or_text_pairs) == 1:\n                    if isinstance(batch_text_or_text_pairs[0], (tuple, list)):\n                        encodings = self._tokenizer.encode(\n                            *batch_text_or_text_pairs[0], add_special_tokens=add_special_tokens\n                        )\n                    else:\n                        encodings = self._tokenizer.encode(\n                            batch_text_or_text_pairs[0], add_special_tokens=add_special_tokens\n                        )\n                    encodings = [encodings]\n                else:\n                    encodings = self._tokenizer.encode_batch(\n                        batch_text_or_text_pairs, add_special_tokens=add_special_tokens\n                    )\n\n        # Convert encoding to dict\n        # `Tokens` has type: List[Dict[str, List[List[int]]]] or List[Dict[str, 2D-Tensor]]\n        # with nested dimensions corresponding to batch, overflows, sequence length\n        tokens = [\n            self._convert_encoding(\n                encoding=encoding,\n                return_tensors=return_tensors,\n                return_token_type_ids=return_token_type_ids,\n                return_attention_mask=return_attention_mask,\n                return_overflowing_tokens=return_overflowing_tokens,\n                return_special_tokens_mask=return_special_tokens_mask,\n                return_offsets_mapping=return_offsets_mapping,\n            )\n            for encoding in encodings\n        ]\n\n        # Sanitize the output to have dict[list] from list[dict]\n        sanitized = {}\n        for key in tokens[0].keys():\n            # To List[List[List[int]]] of shape (batch, overflows, sequence length)\n            stack = [e for item in tokens for e in item[key]]\n            if return_tensors == \"tf\":\n                stack = tf.stack(stack, axis=0)\n            elif return_tensors == \"pt\":\n                stack = torch.stack(stack, dim=0)\n            # elif not return_tensors and len(stack) == 1:\n            #     stack = stack[0]\n\n            sanitized[key] = stack\n\n        # If returning overflowing tokens, we need to return a mapping\n        # from the batch idx to the original sample\n        if return_overflowing_tokens:\n            overflow_to_sample_mapping = flatten([[i] * len(enc[\"input_ids\"]) for i, enc in enumerate(tokens)])\n            sanitized[\"overflow_to_sample_mapping\"] = overflow_to_sample_mapping\n\n        return BatchEncoding(sanitized, encodings)\n\n    def encode_plus(\n        self,\n        text: Union[TextInput, PreTokenizedInput],\n        text_pair: Optional[Union[TextInput, PreTokenizedInput]] = None,\n        add_special_tokens: bool = True,\n        max_length: Optional[int] = None,\n        pad_to_max_length: bool = False,\n        stride: int = 0,\n        truncation_strategy: str = \"longest_first\",\n        is_pretokenized: bool = False,\n        return_tensors: Optional[bool] = None,\n        return_token_type_ids: Optional[bool] = None,\n        return_attention_mask: Optional[bool] = None,\n        return_overflowing_tokens: bool = False,\n        return_special_tokens_mask: bool = False,\n        return_offsets_mapping: bool = False,\n        **kwargs\n    ) -> BatchEncoding:\n\n        # Check for pretokenized path (ie [token1, token2, ..., tokenN] -> [id1, id2, ..., idN]\n        if is_pretokenized:\n            if isinstance(text, list) and len(text) > 0:\n\n                # Encode through encode_batch with sequence of only one word which will be merged after hand\n                encoding = self._tokenizer.encode_batch(text, add_special_tokens=False)\n                encoding = EncodingFast.merge(encoding, growing_offsets=True)\n\n                # Let's do the same for pairs if provided\n                if isinstance(text_pair, list):\n                    # We prepend empty string before each word so that encoding is aware content is a pair\n                    encoding_pair = self._tokenizer.encode_batch(\n                        [(\"\", p) for p in text_pair], add_special_tokens=False\n                    )\n                    encoding_pair = EncodingFast.merge(encoding_pair, growing_offsets=True)\n                elif text_pair is None:\n                    encoding_pair = None\n                else:\n                    raise TypeError(\n                        \"encode_plus(..., is_pretokenized=True) requires text and text_pair to be List[str] \"\n                        \"but got (text={}, text_pair={})\".format(type(text), type(text_pair))\n                    )\n\n                # Post process and if asked to do so, insert special tokens where needed\n                encoding = self._tokenizer.post_process(encoding, encoding_pair, add_special_tokens)\n\n                batched_output = BatchEncoding(\n                    self._convert_encoding(\n                        encoding,\n                        return_tensors=return_tensors,\n                        return_token_type_ids=return_token_type_ids,\n                        return_attention_mask=return_attention_mask,\n                        return_overflowing_tokens=return_overflowing_tokens,\n                        return_special_tokens_mask=return_special_tokens_mask,\n                        return_offsets_mapping=return_offsets_mapping,\n                    ),\n                    encoding,\n                )\n            else:\n                raise TypeError(\n                    \"encode_plus(..., is_pretokenized=True) requires text to be List[str] \"\n                    \"but got (text={}, text_pair={})\".format(type(text), type(text_pair))\n                )\n        else:\n            batched_input = [(text, text_pair)] if text_pair else [text]\n            batched_output = self.batch_encode_plus(\n                batched_input,\n                add_special_tokens=add_special_tokens,\n                max_length=max_length,\n                stride=stride,\n                truncation_strategy=truncation_strategy,\n                return_tensors=return_tensors,\n                return_token_type_ids=return_token_type_ids,\n                return_attention_mask=return_attention_mask,\n                return_overflowing_tokens=return_overflowing_tokens,\n                return_special_tokens_mask=return_special_tokens_mask,\n                return_offsets_mapping=return_offsets_mapping,\n                pad_to_max_length=pad_to_max_length,\n                **kwargs,\n            )\n\n        # Return tensor is None, then we can remove the leading batch axis\n        if not return_tensors:\n            batched_output = BatchEncoding(\n                {\n                    key: value[0] if len(value) > 0 and isinstance(value[0], list) else value\n                    for key, value in batched_output.items()\n                },\n                batched_output.encodings,\n            )\n\n        return batched_output\n\n    def decode(\n        self, token_ids: List[int], skip_special_tokens: bool = False, clean_up_tokenization_spaces: bool = True\n    ) -> str:\n        text = self._tokenizer.decode(token_ids, skip_special_tokens)\n\n        if clean_up_tokenization_spaces:\n            clean_text = self.clean_up_tokenization(text)\n            return clean_text\n        else:\n            return text\n\n    def save_vocabulary(self, save_directory: str) -> Tuple[str]:\n        if os.path.isdir(save_directory):\n            files = self._tokenizer.save(save_directory)\n        else:\n            folder, file = os.path.split(os.path.abspath(save_directory))\n            files = self._tokenizer.save(folder, name=file)\n\n        return tuple(files)\n\n\ndef trim_batch(\n    input_ids, pad_token_id, attention_mask=None,\n):\n    \"\"\"Remove columns that are populated exclusively by pad_token_id\"\"\"\n    keep_column_mask = input_ids.ne(pad_token_id).any(dim=0)\n    if attention_mask is None:\n        return input_ids[:, keep_column_mask]\n    else:\n        return (input_ids[:, keep_column_mask], attention_mask[:, keep_column_mask])\n"
  },
  {
    "path": "code/bert-base-count3/pretrain/transformers1/tokenization_xlm.py",
    "content": "# coding=utf-8\n# Copyright 2019 The Open AI Team Authors and The HuggingFace Inc. team.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n\"\"\"Tokenization classes for XLM.\"\"\"\n\n\nimport json\nimport logging\nimport os\nimport re\nimport sys\nimport unicodedata\nfrom typing import List, Optional\n\nimport sacremoses as sm\n\nfrom .tokenization_utils import PreTrainedTokenizer\n\n\nlogger = logging.getLogger(__name__)\n\nVOCAB_FILES_NAMES = {\n    \"vocab_file\": \"vocab.json\",\n    \"merges_file\": \"merges.txt\",\n}\n\nPRETRAINED_VOCAB_FILES_MAP = {\n    \"vocab_file\": {\n        \"xlm-mlm-en-2048\": \"https://s3.amazonaws.com/models.huggingface.co/bert/xlm-mlm-en-2048-vocab.json\",\n        \"xlm-mlm-ende-1024\": \"https://s3.amazonaws.com/models.huggingface.co/bert/xlm-mlm-ende-1024-vocab.json\",\n        \"xlm-mlm-enfr-1024\": \"https://s3.amazonaws.com/models.huggingface.co/bert/xlm-mlm-enfr-1024-vocab.json\",\n        \"xlm-mlm-enro-1024\": \"https://s3.amazonaws.com/models.huggingface.co/bert/xlm-mlm-enro-1024-vocab.json\",\n        \"xlm-mlm-tlm-xnli15-1024\": \"https://s3.amazonaws.com/models.huggingface.co/bert/xlm-mlm-tlm-xnli15-1024-vocab.json\",\n        \"xlm-mlm-xnli15-1024\": \"https://s3.amazonaws.com/models.huggingface.co/bert/xlm-mlm-xnli15-1024-vocab.json\",\n        \"xlm-clm-enfr-1024\": \"https://s3.amazonaws.com/models.huggingface.co/bert/xlm-clm-enfr-1024-vocab.json\",\n        \"xlm-clm-ende-1024\": \"https://s3.amazonaws.com/models.huggingface.co/bert/xlm-clm-ende-1024-vocab.json\",\n        \"xlm-mlm-17-1280\": \"https://s3.amazonaws.com/models.huggingface.co/bert/xlm-mlm-17-1280-vocab.json\",\n        \"xlm-mlm-100-1280\": \"https://s3.amazonaws.com/models.huggingface.co/bert/xlm-mlm-100-1280-vocab.json\",\n    },\n    \"merges_file\": {\n        \"xlm-mlm-en-2048\": \"https://s3.amazonaws.com/models.huggingface.co/bert/xlm-mlm-en-2048-merges.txt\",\n        \"xlm-mlm-ende-1024\": \"https://s3.amazonaws.com/models.huggingface.co/bert/xlm-mlm-ende-1024-merges.txt\",\n        \"xlm-mlm-enfr-1024\": \"https://s3.amazonaws.com/models.huggingface.co/bert/xlm-mlm-enfr-1024-merges.txt\",\n        \"xlm-mlm-enro-1024\": \"https://s3.amazonaws.com/models.huggingface.co/bert/xlm-mlm-enro-1024-merges.txt\",\n        \"xlm-mlm-tlm-xnli15-1024\": \"https://s3.amazonaws.com/models.huggingface.co/bert/xlm-mlm-tlm-xnli15-1024-merges.txt\",\n        \"xlm-mlm-xnli15-1024\": \"https://s3.amazonaws.com/models.huggingface.co/bert/xlm-mlm-xnli15-1024-merges.txt\",\n        \"xlm-clm-enfr-1024\": \"https://s3.amazonaws.com/models.huggingface.co/bert/xlm-mlm-enfr-1024-merges.txt\",\n        \"xlm-clm-ende-1024\": \"https://s3.amazonaws.com/models.huggingface.co/bert/xlm-mlm-ende-1024-merges.txt\",\n        \"xlm-mlm-17-1280\": \"https://s3.amazonaws.com/models.huggingface.co/bert/xlm-mlm-17-1280-merges.txt\",\n        \"xlm-mlm-100-1280\": \"https://s3.amazonaws.com/models.huggingface.co/bert/xlm-mlm-100-1280-merges.txt\",\n    },\n}\n\nPRETRAINED_POSITIONAL_EMBEDDINGS_SIZES = {\n    \"xlm-mlm-en-2048\": 512,\n    \"xlm-mlm-ende-1024\": 512,\n    \"xlm-mlm-enfr-1024\": 512,\n    \"xlm-mlm-enro-1024\": 512,\n    \"xlm-mlm-tlm-xnli15-1024\": 512,\n    \"xlm-mlm-xnli15-1024\": 512,\n    \"xlm-clm-enfr-1024\": 512,\n    \"xlm-clm-ende-1024\": 512,\n    \"xlm-mlm-17-1280\": 512,\n    \"xlm-mlm-100-1280\": 512,\n}\n\nPRETRAINED_INIT_CONFIGURATION = {\n    \"xlm-mlm-en-2048\": {\"do_lowercase_and_remove_accent\": True},\n    \"xlm-mlm-ende-1024\": {\n        \"do_lowercase_and_remove_accent\": True,\n        \"id2lang\": {\"0\": \"de\", \"1\": \"en\"},\n        \"lang2id\": {\"de\": 0, \"en\": 1},\n    },\n    \"xlm-mlm-enfr-1024\": {\n        \"do_lowercase_and_remove_accent\": True,\n        \"id2lang\": {\"0\": \"en\", \"1\": \"fr\"},\n        \"lang2id\": {\"en\": 0, \"fr\": 1},\n    },\n    \"xlm-mlm-enro-1024\": {\n        \"do_lowercase_and_remove_accent\": True,\n        \"id2lang\": {\"0\": \"en\", \"1\": \"ro\"},\n        \"lang2id\": {\"en\": 0, \"ro\": 1},\n    },\n    \"xlm-mlm-tlm-xnli15-1024\": {\n        \"do_lowercase_and_remove_accent\": True,\n        \"id2lang\": {\n            \"0\": \"ar\",\n            \"1\": \"bg\",\n            \"2\": \"de\",\n            \"3\": \"el\",\n            \"4\": \"en\",\n            \"5\": \"es\",\n            \"6\": \"fr\",\n            \"7\": \"hi\",\n            \"8\": \"ru\",\n            \"9\": \"sw\",\n            \"10\": \"th\",\n            \"11\": \"tr\",\n            \"12\": \"ur\",\n            \"13\": \"vi\",\n            \"14\": \"zh\",\n        },\n        \"lang2id\": {\n            \"ar\": 0,\n            \"bg\": 1,\n            \"de\": 2,\n            \"el\": 3,\n            \"en\": 4,\n            \"es\": 5,\n            \"fr\": 6,\n            \"hi\": 7,\n            \"ru\": 8,\n            \"sw\": 9,\n            \"th\": 10,\n            \"tr\": 11,\n            \"ur\": 12,\n            \"vi\": 13,\n            \"zh\": 14,\n        },\n    },\n    \"xlm-mlm-xnli15-1024\": {\n        \"do_lowercase_and_remove_accent\": True,\n        \"id2lang\": {\n            \"0\": \"ar\",\n            \"1\": \"bg\",\n            \"2\": \"de\",\n            \"3\": \"el\",\n            \"4\": \"en\",\n            \"5\": \"es\",\n            \"6\": \"fr\",\n            \"7\": \"hi\",\n            \"8\": \"ru\",\n            \"9\": \"sw\",\n            \"10\": \"th\",\n            \"11\": \"tr\",\n            \"12\": \"ur\",\n            \"13\": \"vi\",\n            \"14\": \"zh\",\n        },\n        \"lang2id\": {\n            \"ar\": 0,\n            \"bg\": 1,\n            \"de\": 2,\n            \"el\": 3,\n            \"en\": 4,\n            \"es\": 5,\n            \"fr\": 6,\n            \"hi\": 7,\n            \"ru\": 8,\n            \"sw\": 9,\n            \"th\": 10,\n            \"tr\": 11,\n            \"ur\": 12,\n            \"vi\": 13,\n            \"zh\": 14,\n        },\n    },\n    \"xlm-clm-enfr-1024\": {\n        \"do_lowercase_and_remove_accent\": True,\n        \"id2lang\": {\"0\": \"en\", \"1\": \"fr\"},\n        \"lang2id\": {\"en\": 0, \"fr\": 1},\n    },\n    \"xlm-clm-ende-1024\": {\n        \"do_lowercase_and_remove_accent\": True,\n        \"id2lang\": {\"0\": \"de\", \"1\": \"en\"},\n        \"lang2id\": {\"de\": 0, \"en\": 1},\n    },\n    \"xlm-mlm-17-1280\": {\n        \"do_lowercase_and_remove_accent\": False,\n        \"id2lang\": {\n            \"0\": \"ar\",\n            \"1\": \"de\",\n            \"2\": \"en\",\n            \"3\": \"es\",\n            \"4\": \"fr\",\n            \"5\": \"hi\",\n            \"6\": \"it\",\n            \"7\": \"ja\",\n            \"8\": \"ko\",\n            \"9\": \"nl\",\n            \"10\": \"pl\",\n            \"11\": \"pt\",\n            \"12\": \"ru\",\n            \"13\": \"sv\",\n            \"14\": \"tr\",\n            \"15\": \"vi\",\n            \"16\": \"zh\",\n        },\n        \"lang2id\": {\n            \"ar\": 0,\n            \"de\": 1,\n            \"en\": 2,\n            \"es\": 3,\n            \"fr\": 4,\n            \"hi\": 5,\n            \"it\": 6,\n            \"ja\": 7,\n            \"ko\": 8,\n            \"nl\": 9,\n            \"pl\": 10,\n            \"pt\": 11,\n            \"ru\": 12,\n            \"sv\": 13,\n            \"tr\": 14,\n            \"vi\": 15,\n            \"zh\": 16,\n        },\n    },\n    \"xlm-mlm-100-1280\": {\n        \"do_lowercase_and_remove_accent\": False,\n        \"id2lang\": {\n            \"0\": \"af\",\n            \"1\": \"als\",\n            \"2\": \"am\",\n            \"3\": \"an\",\n            \"4\": \"ang\",\n            \"5\": \"ar\",\n            \"6\": \"arz\",\n            \"7\": \"ast\",\n            \"8\": \"az\",\n            \"9\": \"bar\",\n            \"10\": \"be\",\n            \"11\": \"bg\",\n            \"12\": \"bn\",\n            \"13\": \"br\",\n            \"14\": \"bs\",\n            \"15\": \"ca\",\n            \"16\": \"ceb\",\n            \"17\": \"ckb\",\n            \"18\": \"cs\",\n            \"19\": \"cy\",\n            \"20\": \"da\",\n            \"21\": \"de\",\n            \"22\": \"el\",\n            \"23\": \"en\",\n            \"24\": \"eo\",\n            \"25\": \"es\",\n            \"26\": \"et\",\n            \"27\": \"eu\",\n            \"28\": \"fa\",\n            \"29\": \"fi\",\n            \"30\": \"fr\",\n            \"31\": \"fy\",\n            \"32\": \"ga\",\n            \"33\": \"gan\",\n            \"34\": \"gl\",\n            \"35\": \"gu\",\n            \"36\": \"he\",\n            \"37\": \"hi\",\n            \"38\": \"hr\",\n            \"39\": \"hu\",\n            \"40\": \"hy\",\n            \"41\": \"ia\",\n            \"42\": \"id\",\n            \"43\": \"is\",\n            \"44\": \"it\",\n            \"45\": \"ja\",\n            \"46\": \"jv\",\n            \"47\": \"ka\",\n            \"48\": \"kk\",\n            \"49\": \"kn\",\n            \"50\": \"ko\",\n            \"51\": \"ku\",\n            \"52\": \"la\",\n            \"53\": \"lb\",\n            \"54\": \"lt\",\n            \"55\": \"lv\",\n            \"56\": \"mk\",\n            \"57\": \"ml\",\n            \"58\": \"mn\",\n            \"59\": \"mr\",\n            \"60\": \"ms\",\n            \"61\": \"my\",\n            \"62\": \"nds\",\n            \"63\": \"ne\",\n            \"64\": \"nl\",\n            \"65\": \"nn\",\n            \"66\": \"no\",\n            \"67\": \"oc\",\n            \"68\": \"pl\",\n            \"69\": \"pt\",\n            \"70\": \"ro\",\n            \"71\": \"ru\",\n            \"72\": \"scn\",\n            \"73\": \"sco\",\n            \"74\": \"sh\",\n            \"75\": \"si\",\n            \"76\": \"simple\",\n            \"77\": \"sk\",\n            \"78\": \"sl\",\n            \"79\": \"sq\",\n            \"80\": \"sr\",\n            \"81\": \"sv\",\n            \"82\": \"sw\",\n            \"83\": \"ta\",\n            \"84\": \"te\",\n            \"85\": \"th\",\n            \"86\": \"tl\",\n            \"87\": \"tr\",\n            \"88\": \"tt\",\n            \"89\": \"uk\",\n            \"90\": \"ur\",\n            \"91\": \"uz\",\n            \"92\": \"vi\",\n            \"93\": \"war\",\n            \"94\": \"wuu\",\n            \"95\": \"yi\",\n            \"96\": \"zh\",\n            \"97\": \"zh_classical\",\n            \"98\": \"zh_min_nan\",\n            \"99\": \"zh_yue\",\n        },\n        \"lang2id\": {\n            \"af\": 0,\n            \"als\": 1,\n            \"am\": 2,\n            \"an\": 3,\n            \"ang\": 4,\n            \"ar\": 5,\n            \"arz\": 6,\n            \"ast\": 7,\n            \"az\": 8,\n            \"bar\": 9,\n            \"be\": 10,\n            \"bg\": 11,\n            \"bn\": 12,\n            \"br\": 13,\n            \"bs\": 14,\n            \"ca\": 15,\n            \"ceb\": 16,\n            \"ckb\": 17,\n            \"cs\": 18,\n            \"cy\": 19,\n            \"da\": 20,\n            \"de\": 21,\n            \"el\": 22,\n            \"en\": 23,\n            \"eo\": 24,\n            \"es\": 25,\n            \"et\": 26,\n            \"eu\": 27,\n            \"fa\": 28,\n            \"fi\": 29,\n            \"fr\": 30,\n            \"fy\": 31,\n            \"ga\": 32,\n            \"gan\": 33,\n            \"gl\": 34,\n            \"gu\": 35,\n            \"he\": 36,\n            \"hi\": 37,\n            \"hr\": 38,\n            \"hu\": 39,\n            \"hy\": 40,\n            \"ia\": 41,\n            \"id\": 42,\n            \"is\": 43,\n            \"it\": 44,\n            \"ja\": 45,\n            \"jv\": 46,\n            \"ka\": 47,\n            \"kk\": 48,\n            \"kn\": 49,\n            \"ko\": 50,\n            \"ku\": 51,\n            \"la\": 52,\n            \"lb\": 53,\n            \"lt\": 54,\n            \"lv\": 55,\n            \"mk\": 56,\n            \"ml\": 57,\n            \"mn\": 58,\n            \"mr\": 59,\n            \"ms\": 60,\n            \"my\": 61,\n            \"nds\": 62,\n            \"ne\": 63,\n            \"nl\": 64,\n            \"nn\": 65,\n            \"no\": 66,\n            \"oc\": 67,\n            \"pl\": 68,\n            \"pt\": 69,\n            \"ro\": 70,\n            \"ru\": 71,\n            \"scn\": 72,\n            \"sco\": 73,\n            \"sh\": 74,\n            \"si\": 75,\n            \"simple\": 76,\n            \"sk\": 77,\n            \"sl\": 78,\n            \"sq\": 79,\n            \"sr\": 80,\n            \"sv\": 81,\n            \"sw\": 82,\n            \"ta\": 83,\n            \"te\": 84,\n            \"th\": 85,\n            \"tl\": 86,\n            \"tr\": 87,\n            \"tt\": 88,\n            \"uk\": 89,\n            \"ur\": 90,\n            \"uz\": 91,\n            \"vi\": 92,\n            \"war\": 93,\n            \"wuu\": 94,\n            \"yi\": 95,\n            \"zh\": 96,\n            \"zh_classical\": 97,\n            \"zh_min_nan\": 98,\n            \"zh_yue\": 99,\n        },\n    },\n}\n\n\ndef get_pairs(word):\n    \"\"\"\n    Return set of symbol pairs in a word.\n    word is represented as tuple of symbols (symbols being variable-length strings)\n    \"\"\"\n    pairs = set()\n    prev_char = word[0]\n    for char in word[1:]:\n        pairs.add((prev_char, char))\n        prev_char = char\n    return pairs\n\n\ndef lowercase_and_remove_accent(text):\n    \"\"\"\n    Lowercase and strips accents from a piece of text based on\n    https://github.com/facebookresearch/XLM/blob/master/tools/lowercase_and_remove_accent.py\n    \"\"\"\n    text = \" \".join(text)\n    text = text.lower()\n    text = unicodedata.normalize(\"NFD\", text)\n    output = []\n    for char in text:\n        cat = unicodedata.category(char)\n        if cat == \"Mn\":\n            continue\n        output.append(char)\n    return \"\".join(output).lower().split(\" \")\n\n\ndef replace_unicode_punct(text):\n    \"\"\"\n    Port of https://github.com/moses-smt/mosesdecoder/blob/master/scripts/tokenizer/replace-unicode-punctuation.perl\n    \"\"\"\n    text = text.replace(\"，\", \",\")\n    text = re.sub(r\"。\\s*\", \". \", text)\n    text = text.replace(\"、\", \",\")\n    text = text.replace(\"”\", '\"')\n    text = text.replace(\"“\", '\"')\n    text = text.replace(\"∶\", \":\")\n    text = text.replace(\"：\", \":\")\n    text = text.replace(\"？\", \"?\")\n    text = text.replace(\"《\", '\"')\n    text = text.replace(\"》\", '\"')\n    text = text.replace(\"）\", \")\")\n    text = text.replace(\"！\", \"!\")\n    text = text.replace(\"（\", \"(\")\n    text = text.replace(\"；\", \";\")\n    text = text.replace(\"１\", \"1\")\n    text = text.replace(\"」\", '\"')\n    text = text.replace(\"「\", '\"')\n    text = text.replace(\"０\", \"0\")\n    text = text.replace(\"３\", \"3\")\n    text = text.replace(\"２\", \"2\")\n    text = text.replace(\"５\", \"5\")\n    text = text.replace(\"６\", \"6\")\n    text = text.replace(\"９\", \"9\")\n    text = text.replace(\"７\", \"7\")\n    text = text.replace(\"８\", \"8\")\n    text = text.replace(\"４\", \"4\")\n    text = re.sub(r\"．\\s*\", \". \", text)\n    text = text.replace(\"～\", \"~\")\n    text = text.replace(\"’\", \"'\")\n    text = text.replace(\"…\", \"...\")\n    text = text.replace(\"━\", \"-\")\n    text = text.replace(\"〈\", \"<\")\n    text = text.replace(\"〉\", \">\")\n    text = text.replace(\"【\", \"[\")\n    text = text.replace(\"】\", \"]\")\n    text = text.replace(\"％\", \"%\")\n    return text\n\n\ndef remove_non_printing_char(text):\n    \"\"\"\n    Port of https://github.com/moses-smt/mosesdecoder/blob/master/scripts/tokenizer/remove-non-printing-char.perl\n    \"\"\"\n    output = []\n    for char in text:\n        cat = unicodedata.category(char)\n        if cat.startswith(\"C\"):\n            continue\n        output.append(char)\n    return \"\".join(output)\n\n\ndef romanian_preprocessing(text):\n    \"\"\"Sennrich's WMT16 scripts for Romanian preprocessing, used by model `xlm-mlm-enro-1024`\"\"\"\n    # https://github.com/rsennrich/wmt16-scripts/blob/master/preprocess/normalise-romanian.py\n    text = text.replace(\"\\u015e\", \"\\u0218\").replace(\"\\u015f\", \"\\u0219\")\n    text = text.replace(\"\\u0162\", \"\\u021a\").replace(\"\\u0163\", \"\\u021b\")\n    # https://github.com/rsennrich/wmt16-scripts/blob/master/preprocess/remove-diacritics.py\n    text = text.replace(\"\\u0218\", \"S\").replace(\"\\u0219\", \"s\")  # s-comma\n    text = text.replace(\"\\u021a\", \"T\").replace(\"\\u021b\", \"t\")  # t-comma\n    text = text.replace(\"\\u0102\", \"A\").replace(\"\\u0103\", \"a\")\n    text = text.replace(\"\\u00C2\", \"A\").replace(\"\\u00E2\", \"a\")\n    text = text.replace(\"\\u00CE\", \"I\").replace(\"\\u00EE\", \"i\")\n    return text\n\n\nclass XLMTokenizer(PreTrainedTokenizer):\n    \"\"\"\n    BPE tokenizer for XLM\n\n    - Moses preprocessing & tokenization for most supported languages\n    - Language specific tokenization for Chinese (Jieba), Japanese (KyTea) and Thai (PyThaiNLP)\n    - (optionally) lower case & normalize all inputs text\n    - argument ``special_tokens`` and function ``set_special_tokens``, can be used to add additional symbols \\\n      (ex: \"__classify__\") to a vocabulary\n    - `lang2id` attribute maps the languages supported by the model with their ids if provided (automatically set for pretrained vocabularies)\n    - `id2lang` attributes does reverse mapping if provided (automatically set for pretrained vocabularies)\n\n    This tokenizer inherits from :class:`~transformers1.PreTrainedTokenizer` which contains most of the methods. Users\n    should refer to the superclass for more information regarding methods.\n\n    Args:\n        vocab_file (:obj:`string`):\n            Vocabulary file.\n        merges_file (:obj:`string`):\n            Merges file.\n        do_lower_case (:obj:`bool`, `optional`, defaults to :obj:`True`):\n            Whether to lowercase the input when tokenizing.\n        remove_space (:obj:`bool`, `optional`, defaults to :obj:`True`):\n            Whether to strip the text when tokenizing (removing excess spaces before and after the string).\n        keep_accents (:obj:`bool`, `optional`, defaults to :obj:`False`):\n            Whether to keep accents when tokenizing.\n        unk_token (:obj:`string`, `optional`, defaults to \"<unk>\"):\n            The unknown token. A token that is not in the vocabulary cannot be converted to an ID and is set to be this\n            token instead.\n        bos_token (:obj:`string`, `optional`, defaults to \"<s>\"):\n            The beginning of sequence token that was used during pre-training. Can be used a sequence classifier token.\n\n            .. note::\n\n                When building a sequence using special tokens, this is not the token that is used for the beginning\n                of sequence. The token used is the :obj:`cls_token`.\n        sep_token (:obj:`string`, `optional`, defaults to \"</s>\"):\n            The separator token, which is used when building a sequence from multiple sequences, e.g. two sequences\n            for sequence classification or for a text and a question for question answering.\n            It is also used as the last token of a sequence built with special tokens.\n        pad_token (:obj:`string`, `optional`, defaults to \"<pad>\"):\n            The token used for padding, for example when batching sequences of different lengths.\n        cls_token (:obj:`string`, `optional`, defaults to \"</s>\"):\n            The classifier token which is used when doing sequence classification (classification of the whole\n            sequence instead of per-token classification). It is the first token of the sequence when built with\n            special tokens.\n        mask_token (:obj:`string`, `optional`, defaults to \"<special1>\"):\n            The token used for masking values. This is the token used when training this model with masked language\n            modeling. This is the token which the model will try to predict.\n        additional_special_tokens (:obj:`List[str]`, `optional`, defaults to :obj:`[\"<special0>\",\"<special1>\",\"<special2>\",\"<special3>\",\"<special4>\",\"<special5>\",\"<special6>\",\"<special7>\",\"<special8>\",\"<special9>\"]`):\n            List of additional special tokens.\n        lang2id (:obj:`Dict[str, int]`, `optional`, defaults to :obj:`None`):\n            Dictionary mapping languages string identifiers to their IDs.\n        id2lang (:obj:`Dict[int, str`, `optional`, defaults to :obj:`None`):\n            Dictionary mapping language IDs to their string identifiers.\n        do_lowercase_and_remove_accent (:obj:`bool`, `optional`, defaults to :obj:`True`):\n            Whether to lowercase and remove accents when tokenizing.\n    \"\"\"\n\n    vocab_files_names = VOCAB_FILES_NAMES\n    pretrained_vocab_files_map = PRETRAINED_VOCAB_FILES_MAP\n    pretrained_init_configuration = PRETRAINED_INIT_CONFIGURATION\n    max_model_input_sizes = PRETRAINED_POSITIONAL_EMBEDDINGS_SIZES\n\n    def __init__(\n        self,\n        vocab_file,\n        merges_file,\n        unk_token=\"<unk>\",\n        bos_token=\"<s>\",\n        sep_token=\"</s>\",\n        pad_token=\"<pad>\",\n        cls_token=\"</s>\",\n        mask_token=\"<special1>\",\n        additional_special_tokens=[\n            \"<special0>\",\n            \"<special1>\",\n            \"<special2>\",\n            \"<special3>\",\n            \"<special4>\",\n            \"<special5>\",\n            \"<special6>\",\n            \"<special7>\",\n            \"<special8>\",\n            \"<special9>\",\n        ],\n        lang2id=None,\n        id2lang=None,\n        do_lowercase_and_remove_accent=True,\n        **kwargs\n    ):\n        super().__init__(\n            unk_token=unk_token,\n            bos_token=bos_token,\n            sep_token=sep_token,\n            pad_token=pad_token,\n            cls_token=cls_token,\n            mask_token=mask_token,\n            additional_special_tokens=additional_special_tokens,\n            **kwargs,\n        )\n\n        # cache of sm.MosesPunctNormalizer instance\n        self.cache_moses_punct_normalizer = dict()\n        # cache of sm.MosesTokenizer instance\n        self.cache_moses_tokenizer = dict()\n        self.lang_with_custom_tokenizer = set([\"zh\", \"th\", \"ja\"])\n        # True for current supported model (v1.2.0), False for XLM-17 & 100\n        self.do_lowercase_and_remove_accent = do_lowercase_and_remove_accent\n        self.lang2id = lang2id\n        self.id2lang = id2lang\n        if lang2id is not None and id2lang is not None:\n            assert len(lang2id) == len(id2lang)\n\n        self.ja_word_tokenizer = None\n        self.zh_word_tokenizer = None\n\n        with open(vocab_file, encoding=\"utf-8\") as vocab_handle:\n            self.encoder = json.load(vocab_handle)\n        self.decoder = {v: k for k, v in self.encoder.items()}\n        with open(merges_file, encoding=\"utf-8\") as merges_handle:\n            merges = merges_handle.read().split(\"\\n\")[:-1]\n        merges = [tuple(merge.split()[:2]) for merge in merges]\n        self.bpe_ranks = dict(zip(merges, range(len(merges))))\n        self.cache = {}\n\n    def moses_punct_norm(self, text, lang):\n        if lang not in self.cache_moses_punct_normalizer:\n            punct_normalizer = sm.MosesPunctNormalizer(lang=lang)\n            self.cache_moses_punct_normalizer[lang] = punct_normalizer\n        else:\n            punct_normalizer = self.cache_moses_punct_normalizer[lang]\n        return punct_normalizer.normalize(text)\n\n    def moses_tokenize(self, text, lang):\n        if lang not in self.cache_moses_tokenizer:\n            moses_tokenizer = sm.MosesTokenizer(lang=lang)\n            self.cache_moses_tokenizer[lang] = moses_tokenizer\n        else:\n            moses_tokenizer = self.cache_moses_tokenizer[lang]\n        return moses_tokenizer.tokenize(text, return_str=False, escape=False)\n\n    def moses_pipeline(self, text, lang):\n        text = replace_unicode_punct(text)\n        text = self.moses_punct_norm(text, lang)\n        text = remove_non_printing_char(text)\n        return text\n\n    def ja_tokenize(self, text):\n        if self.ja_word_tokenizer is None:\n            try:\n                import Mykytea\n\n                self.ja_word_tokenizer = Mykytea.Mykytea(\n                    \"-model %s/local/share/kytea/model.bin\" % os.path.expanduser(\"~\")\n                )\n            except (AttributeError, ImportError):\n                logger.error(\n                    \"Make sure you install KyTea (https://github.com/neubig/kytea) and it's python wrapper (https://github.com/chezou/Mykytea-python) with the following steps\"\n                )\n                logger.error(\"1. git clone git@github.com:neubig/kytea.git && cd kytea\")\n                logger.error(\"2. autoreconf -i\")\n                logger.error(\"3. ./configure --prefix=$HOME/local\")\n                logger.error(\"4. make && make install\")\n                logger.error(\"5. pip install kytea\")\n                raise\n        return list(self.ja_word_tokenizer.getWS(text))\n\n    @property\n    def vocab_size(self):\n        return len(self.encoder)\n\n    def get_vocab(self):\n        return dict(self.encoder, **self.added_tokens_encoder)\n\n    def bpe(self, token):\n        word = tuple(token[:-1]) + (token[-1] + \"</w>\",)\n        if token in self.cache:\n            return self.cache[token]\n        pairs = get_pairs(word)\n\n        if not pairs:\n            return token + \"</w>\"\n\n        while True:\n            bigram = min(pairs, key=lambda pair: self.bpe_ranks.get(pair, float(\"inf\")))\n            if bigram not in self.bpe_ranks:\n                break\n            first, second = bigram\n            new_word = []\n            i = 0\n            while i < len(word):\n                try:\n                    j = word.index(first, i)\n                except ValueError:\n                    new_word.extend(word[i:])\n                    break\n                else:\n                    new_word.extend(word[i:j])\n                    i = j\n\n                if word[i] == first and i < len(word) - 1 and word[i + 1] == second:\n                    new_word.append(first + second)\n                    i += 2\n                else:\n                    new_word.append(word[i])\n                    i += 1\n            new_word = tuple(new_word)\n            word = new_word\n            if len(word) == 1:\n                break\n            else:\n                pairs = get_pairs(word)\n        word = \" \".join(word)\n        if word == \"\\n  </w>\":\n            word = \"\\n</w>\"\n        self.cache[token] = word\n        return word\n\n    def _tokenize(self, text, lang=\"en\", bypass_tokenizer=False):\n        \"\"\"\n        Tokenize a string given language code. For Chinese, Japanese and Thai, we use a language specific tokenizerself. Otherwise, we use Moses.\n\n        Details of tokenization:\n        - [sacremoses](https://github.com/alvations/sacremoses): port of Moses\n            - Install with `pip install sacremoses`\n        - [pythainlp](https://github.com/PyThaiNLP/pythainlp): Thai tokenizer\n            - Install with `pip install pythainlp`\n        - [kytea](https://github.com/chezou/Mykytea-python): Japanese tokenizer, wrapper of [KyTea](https://github.com/neubig/kytea)\n            - Install with the following steps:\n            ```\n            git clone git@github.com:neubig/kytea.git && cd kytea\n            autoreconf -i\n            ./configure --prefix=$HOME/local\n            make && make install\n            pip install kytea\n            ```\n        - [jieba](https://github.com/fxsjy/jieba): Chinese tokenizer (*)\n            - Install with `pip install jieba`\n\n        (*) The original XLM used [Stanford Segmenter](https://nlp.stanford.edu/software/stanford-segmenter-2018-10-16.zip).\n        However, the wrapper (`nltk.tokenize.stanford_segmenter`) is slow due to JVM overhead, and it will be deprecated.\n        Jieba is a lot faster and pip-installable. Note there is some mismatch with the Stanford Segmenter. It should be fine\n        if you fine-tune the model with Chinese supervisionself. If you want the same exact behaviour, use the original XLM\n        [preprocessing script](https://github.com/facebookresearch/XLM/tree/master/tools) to tokenize the sentence externally,\n        and set `bypass_tokenizer=True` to bypass the tokenizer.\n\n        Args:\n            - lang: ISO language code (default = 'en') (string). Languages should belong of the model supported languages. However, we don't enforce it.\n            - bypass_tokenizer: Allow users to preprocess and tokenize the sentences externally (default = False)  (bool). If True, we only apply BPE.\n\n        Returns:\n            List of tokens.\n        \"\"\"\n        if lang and self.lang2id and lang not in self.lang2id:\n            logger.error(\n                \"Supplied language code not found in lang2id mapping. Please check that your language is supported by the loaded pretrained model.\"\n            )\n        if bypass_tokenizer:\n            text = text.split()\n        elif lang not in self.lang_with_custom_tokenizer:\n            text = self.moses_pipeline(text, lang=lang)\n            # TODO: make sure we are using `xlm-mlm-enro-1024`, since XLM-100 doesn't have this step\n            if lang == \"ro\":\n                text = romanian_preprocessing(text)\n            text = self.moses_tokenize(text, lang=lang)\n        elif lang == \"th\":\n            text = self.moses_pipeline(text, lang=lang)\n            try:\n                if \"pythainlp\" not in sys.modules:\n                    from pythainlp.tokenize import word_tokenize as th_word_tokenize\n                else:\n                    th_word_tokenize = sys.modules[\"pythainlp\"].word_tokenize\n            except (AttributeError, ImportError):\n                logger.error(\n                    \"Make sure you install PyThaiNLP (https://github.com/PyThaiNLP/pythainlp) with the following steps\"\n                )\n                logger.error(\"1. pip install pythainlp\")\n                raise\n            text = th_word_tokenize(text)\n        elif lang == \"zh\":\n            try:\n                if \"jieba\" not in sys.modules:\n                    import jieba\n                else:\n                    jieba = sys.modules[\"jieba\"]\n            except (AttributeError, ImportError):\n                logger.error(\"Make sure you install Jieba (https://github.com/fxsjy/jieba) with the following steps\")\n                logger.error(\"1. pip install jieba\")\n                raise\n            text = \" \".join(jieba.cut(text))\n            text = self.moses_pipeline(text, lang=lang)\n            text = text.split()\n        elif lang == \"ja\":\n            text = self.moses_pipeline(text, lang=lang)\n            text = self.ja_tokenize(text)\n        else:\n            raise ValueError(\"It should not reach here\")\n\n        if self.do_lowercase_and_remove_accent and not bypass_tokenizer:\n            text = lowercase_and_remove_accent(text)\n\n        split_tokens = []\n        for token in text:\n            if token:\n                split_tokens.extend([t for t in self.bpe(token).split(\" \")])\n\n        return split_tokens\n\n    def _convert_token_to_id(self, token):\n        \"\"\" Converts a token (str) in an id using the vocab. \"\"\"\n        return self.encoder.get(token, self.encoder.get(self.unk_token))\n\n    def _convert_id_to_token(self, index):\n        \"\"\"Converts an index (integer) in a token (str) using the vocab.\"\"\"\n        return self.decoder.get(index, self.unk_token)\n\n    def convert_tokens_to_string(self, tokens):\n        \"\"\" Converts a sequence of tokens (string) in a single string. \"\"\"\n        out_string = \"\".join(tokens).replace(\"</w>\", \" \").strip()\n        return out_string\n\n    def build_inputs_with_special_tokens(\n        self, token_ids_0: List[int], token_ids_1: Optional[List[int]] = None\n    ) -> List[int]:\n        \"\"\"\n        Build model inputs from a sequence or a pair of sequence for sequence classification tasks\n        by concatenating and adding special tokens.\n        A XLM sequence has the following format:\n\n        - single sequence: ``<s> X </s>``\n        - pair of sequences: ``<s> A </s> B </s>``\n\n        Args:\n            token_ids_0 (:obj:`List[int]`):\n                List of IDs to which the special tokens will be added\n            token_ids_1 (:obj:`List[int]`, `optional`, defaults to :obj:`None`):\n                Optional second list of IDs for sequence pairs.\n\n        Returns:\n            :obj:`List[int]`: list of `input IDs <../glossary.html#input-ids>`__ with the appropriate special tokens.\n\n        \"\"\"\n        bos = [self.bos_token_id]\n        sep = [self.sep_token_id]\n\n        if token_ids_1 is None:\n            return bos + token_ids_0 + sep\n        return bos + token_ids_0 + sep + token_ids_1 + sep\n\n    def get_special_tokens_mask(\n        self, token_ids_0: List[int], token_ids_1: Optional[List[int]] = None, already_has_special_tokens: bool = False\n    ) -> List[int]:\n        \"\"\"\n        Retrieves sequence ids from a token list that has no special tokens added. This method is called when adding\n        special tokens using the tokenizer ``prepare_for_model`` or ``encode_plus`` methods.\n\n        Args:\n            token_ids_0 (:obj:`List[int]`):\n                List of ids.\n            token_ids_1 (:obj:`List[int]`, `optional`, defaults to :obj:`None`):\n                Optional second list of IDs for sequence pairs.\n            already_has_special_tokens (:obj:`bool`, `optional`, defaults to :obj:`False`):\n                Set to True if the token list is already formatted with special tokens for the model\n\n        Returns:\n            :obj:`List[int]`: A list of integers in the range [0, 1]: 1 for a special token, 0 for a sequence token.\n        \"\"\"\n\n        if already_has_special_tokens:\n            if token_ids_1 is not None:\n                raise ValueError(\n                    \"You should not supply a second sequence if the provided sequence of \"\n                    \"ids is already formated with special tokens for the model.\"\n                )\n            return list(map(lambda x: 1 if x in [self.sep_token_id, self.cls_token_id] else 0, token_ids_0,))\n\n        if token_ids_1 is not None:\n            return [1] + ([0] * len(token_ids_0)) + [1] + ([0] * len(token_ids_1)) + [1]\n        return [1] + ([0] * len(token_ids_0)) + [1]\n\n    def create_token_type_ids_from_sequences(\n        self, token_ids_0: List[int], token_ids_1: Optional[List[int]] = None\n    ) -> List[int]:\n        \"\"\"\n        Creates a mask from the two sequences passed to be used in a sequence-pair classification task.\n        An XLM sequence pair mask has the following format:\n\n        ::\n\n            0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1\n            | first sequence    | second sequence |\n\n        if token_ids_1 is None, only returns the first portion of the mask (0s).\n\n        Args:\n            token_ids_0 (:obj:`List[int]`):\n                List of ids.\n            token_ids_1 (:obj:`List[int]`, `optional`, defaults to :obj:`None`):\n                Optional second list of IDs for sequence pairs.\n\n        Returns:\n            :obj:`List[int]`: List of `token type IDs <../glossary.html#token-type-ids>`_ according to the given\n            sequence(s).\n        \"\"\"\n        sep = [self.sep_token_id]\n        cls = [self.cls_token_id]\n        if token_ids_1 is None:\n            return len(cls + token_ids_0 + sep) * [0]\n        return len(cls + token_ids_0 + sep) * [0] + len(token_ids_1 + sep) * [1]\n\n    def save_vocabulary(self, save_directory):\n        \"\"\"\n        Save the vocabulary and special tokens file to a directory.\n\n        Args:\n            save_directory (:obj:`str`):\n                The directory in which to save the vocabulary.\n\n        Returns:\n            :obj:`Tuple(str)`: Paths to the files saved.\n        \"\"\"\n        if not os.path.isdir(save_directory):\n            logger.error(\"Vocabulary path ({}) should be a directory\".format(save_directory))\n            return\n        vocab_file = os.path.join(save_directory, VOCAB_FILES_NAMES[\"vocab_file\"])\n        merge_file = os.path.join(save_directory, VOCAB_FILES_NAMES[\"merges_file\"])\n\n        with open(vocab_file, \"w\", encoding=\"utf-8\") as f:\n            f.write(json.dumps(self.encoder, ensure_ascii=False))\n\n        index = 0\n        with open(merge_file, \"w\", encoding=\"utf-8\") as writer:\n            for bpe_tokens, token_index in sorted(self.bpe_ranks.items(), key=lambda kv: kv[1]):\n                if index != token_index:\n                    logger.warning(\n                        \"Saving vocabulary to {}: BPE merge indices are not consecutive.\"\n                        \" Please check that the tokenizer is not corrupted!\".format(merge_file)\n                    )\n                    index = token_index\n                writer.write(\" \".join(bpe_tokens) + \"\\n\")\n                index += 1\n\n        return vocab_file, merge_file\n"
  },
  {
    "path": "code/bert-base-count3/pretrain/transformers1/tokenization_xlm_roberta.py",
    "content": "# coding=utf-8\n# Copyright 2018 Google AI, Google Brain and Carnegie Mellon University Authors and the HuggingFace Inc. team.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License\n\"\"\" Tokenization classes for XLM-RoBERTa model.\"\"\"\n\n\nimport logging\nimport os\nfrom shutil import copyfile\nfrom typing import List, Optional\n\nfrom .tokenization_utils import PreTrainedTokenizer\nfrom .tokenization_xlnet import SPIECE_UNDERLINE\n\n\nlogger = logging.getLogger(__name__)\n\nVOCAB_FILES_NAMES = {\"vocab_file\": \"sentencepiece.bpe.model\"}\n\nPRETRAINED_VOCAB_FILES_MAP = {\n    \"vocab_file\": {\n        \"xlm-roberta-base\": \"https://s3.amazonaws.com/models.huggingface.co/bert/xlm-roberta-base-sentencepiece.bpe.model\",\n        \"xlm-roberta-large\": \"https://s3.amazonaws.com/models.huggingface.co/bert/xlm-roberta-large-sentencepiece.bpe.model\",\n        \"xlm-roberta-large-finetuned-conll02-dutch\": \"https://s3.amazonaws.com/models.huggingface.co/bert/xlm-roberta-large-finetuned-conll02-dutch-sentencepiece.bpe.model\",\n        \"xlm-roberta-large-finetuned-conll02-spanish\": \"https://s3.amazonaws.com/models.huggingface.co/bert/xlm-roberta-large-finetuned-conll02-spanish-sentencepiece.bpe.model\",\n        \"xlm-roberta-large-finetuned-conll03-english\": \"https://s3.amazonaws.com/models.huggingface.co/bert/xlm-roberta-large-finetuned-conll03-english-sentencepiece.bpe.model\",\n        \"xlm-roberta-large-finetuned-conll03-german\": \"https://s3.amazonaws.com/models.huggingface.co/bert/xlm-roberta-large-finetuned-conll03-german-sentencepiece.bpe.model\",\n    }\n}\n\nPRETRAINED_POSITIONAL_EMBEDDINGS_SIZES = {\n    \"xlm-roberta-base\": 512,\n    \"xlm-roberta-large\": 512,\n    \"xlm-roberta-large-finetuned-conll02-dutch\": 512,\n    \"xlm-roberta-large-finetuned-conll02-spanish\": 512,\n    \"xlm-roberta-large-finetuned-conll03-english\": 512,\n    \"xlm-roberta-large-finetuned-conll03-german\": 512,\n}\n\n\nclass XLMRobertaTokenizer(PreTrainedTokenizer):\n    \"\"\"\n        Adapted from RobertaTokenizer and XLNetTokenizer\n        SentencePiece based tokenizer. Peculiarities:\n\n        - requires `SentencePiece <https://github.com/google/sentencepiece>`_\n\n    This tokenizer inherits from :class:`~transformers1.PreTrainedTokenizer` which contains most of the methods. Users\n    should refer to the superclass for more information regarding methods.\n\n    Args:\n        vocab_file (:obj:`str`):\n            Path to the vocabulary file.\n        bos_token (:obj:`string`, `optional`, defaults to \"<s>\"):\n            The beginning of sequence token that was used during pre-training. Can be used a sequence classifier token.\n\n            .. note::\n\n                When building a sequence using special tokens, this is not the token that is used for the beginning\n                of sequence. The token used is the :obj:`cls_token`.\n        eos_token (:obj:`string`, `optional`, defaults to \"</s>\"):\n            The end of sequence token.\n\n            .. note::\n\n                When building a sequence using special tokens, this is not the token that is used for the end\n                of sequence. The token used is the :obj:`sep_token`.\n        sep_token (:obj:`string`, `optional`, defaults to \"</s>\"):\n            The separator token, which is used when building a sequence from multiple sequences, e.g. two sequences\n            for sequence classification or for a text and a question for question answering.\n            It is also used as the last token of a sequence built with special tokens.\n        cls_token (:obj:`string`, `optional`, defaults to \"<s>\"):\n            The classifier token which is used when doing sequence classification (classification of the whole\n            sequence instead of per-token classification). It is the first token of the sequence when built with\n            special tokens.\n        unk_token (:obj:`string`, `optional`, defaults to \"<unk>\"):\n            The unknown token. A token that is not in the vocabulary cannot be converted to an ID and is set to be this\n            token instead.\n        pad_token (:obj:`string`, `optional`, defaults to \"<pad>\"):\n            The token used for padding, for example when batching sequences of different lengths.\n        mask_token (:obj:`string`, `optional`, defaults to \"<mask>\"):\n            The token used for masking values. This is the token used when training this model with masked language\n            modeling. This is the token which the model will try to predict.\n        additional_special_tokens (:obj:`List[str]`, `optional`, defaults to :obj:`[\"<s>NOTUSED\", \"</s>NOTUSED\"]`):\n            Additional special tokens used by the tokenizer.\n\n    Attributes:\n        sp_model (:obj:`SentencePieceProcessor`):\n            The `SentencePiece` processor that is used for every conversion (string, tokens and IDs).\n    \"\"\"\n\n    vocab_files_names = VOCAB_FILES_NAMES\n    pretrained_vocab_files_map = PRETRAINED_VOCAB_FILES_MAP\n    max_model_input_sizes = PRETRAINED_POSITIONAL_EMBEDDINGS_SIZES\n    model_input_names = [\"attention_mask\"]\n\n    def __init__(\n        self,\n        vocab_file,\n        bos_token=\"<s>\",\n        eos_token=\"</s>\",\n        sep_token=\"</s>\",\n        cls_token=\"<s>\",\n        unk_token=\"<unk>\",\n        pad_token=\"<pad>\",\n        mask_token=\"<mask>\",\n        **kwargs\n    ):\n        super().__init__(\n            bos_token=bos_token,\n            eos_token=eos_token,\n            unk_token=unk_token,\n            sep_token=sep_token,\n            cls_token=cls_token,\n            pad_token=pad_token,\n            mask_token=mask_token,\n            **kwargs,\n        )\n\n        try:\n            import sentencepiece as spm\n        except ImportError:\n            logger.warning(\n                \"You need to install SentencePiece to use XLMRobertaTokenizer: https://github.com/google/sentencepiece\"\n                \"pip install sentencepiece\"\n            )\n            raise\n\n        self.sp_model = spm.SentencePieceProcessor()\n        self.sp_model.Load(str(vocab_file))\n        self.vocab_file = vocab_file\n\n        # Original fairseq vocab and spm vocab must be \"aligned\":\n        # Vocab    |    0    |    1    |   2    |    3    |  4  |  5  |  6  |   7   |   8   |  9\n        # -------- | ------- | ------- | ------ | ------- | --- | --- | --- | ----- | ----- | ----\n        # fairseq  | '<s>'   | '<pad>' | '</s>' | '<unk>' | ',' | '.' | '▁' | 's'   | '▁de' | '-'\n        # spm      | '<unk>' | '<s>'   | '</s>' | ','     | '.' | '▁' | 's' | '▁de' | '-'   | '▁a'\n\n        # Mimic fairseq token-to-id alignment for the first 4 token\n        self.fairseq_tokens_to_ids = {\"<s>\": 0, \"<pad>\": 1, \"</s>\": 2, \"<unk>\": 3}\n\n        # The first \"real\" token \",\" has position 4 in the original fairseq vocab and position 3 in the spm vocab\n        self.fairseq_offset = 1\n\n        self.fairseq_tokens_to_ids[\"<mask>\"] = len(self.sp_model) + self.fairseq_offset\n        self.fairseq_ids_to_tokens = {v: k for k, v in self.fairseq_tokens_to_ids.items()}\n\n    def __getstate__(self):\n        state = self.__dict__.copy()\n        state[\"sp_model\"] = None\n        return state\n\n    def __setstate__(self, d):\n        self.__dict__ = d\n        try:\n            import sentencepiece as spm\n        except ImportError:\n            logger.warning(\n                \"You need to install SentencePiece to use XLMRobertaTokenizer: https://github.com/google/sentencepiece\"\n                \"pip install sentencepiece\"\n            )\n            raise\n        self.sp_model = spm.SentencePieceProcessor()\n        self.sp_model.Load(self.vocab_file)\n\n    def build_inputs_with_special_tokens(\n        self, token_ids_0: List[int], token_ids_1: Optional[List[int]] = None\n    ) -> List[int]:\n        \"\"\"\n        Build model inputs from a sequence or a pair of sequence for sequence classification tasks\n        by concatenating and adding special tokens.\n        A XLM-R sequence has the following format:\n\n        - single sequence: ``<s> X </s>``\n        - pair of sequences: ``<s> A </s></s> B </s>``\n\n        Args:\n            token_ids_0 (:obj:`List[int]`):\n                List of IDs to which the special tokens will be added\n            token_ids_1 (:obj:`List[int]`, `optional`, defaults to :obj:`None`):\n                Optional second list of IDs for sequence pairs.\n\n        Returns:\n            :obj:`List[int]`: list of `input IDs <../glossary.html#input-ids>`__ with the appropriate special tokens.\n        \"\"\"\n\n        if token_ids_1 is None:\n            return [self.cls_token_id] + token_ids_0 + [self.sep_token_id]\n        cls = [self.cls_token_id]\n        sep = [self.sep_token_id]\n        return cls + token_ids_0 + sep + sep + token_ids_1 + sep\n\n    def get_special_tokens_mask(\n        self, token_ids_0: List[int], token_ids_1: Optional[List[int]] = None, already_has_special_tokens: bool = False\n    ) -> List[int]:\n        \"\"\"\n        Retrieves sequence ids from a token list that has no special tokens added. This method is called when adding\n        special tokens using the tokenizer ``prepare_for_model`` or ``encode_plus`` methods.\n\n        Args:\n            token_ids_0 (:obj:`List[int]`):\n                List of ids.\n            token_ids_1 (:obj:`List[int]`, `optional`, defaults to :obj:`None`):\n                Optional second list of IDs for sequence pairs.\n            already_has_special_tokens (:obj:`bool`, `optional`, defaults to :obj:`False`):\n                Set to True if the token list is already formatted with special tokens for the model\n\n        Returns:\n            :obj:`List[int]`: A list of integers in the range [0, 1]: 1 for a special token, 0 for a sequence token.\n        \"\"\"\n\n        if already_has_special_tokens:\n            if token_ids_1 is not None:\n                raise ValueError(\n                    \"You should not supply a second sequence if the provided sequence of \"\n                    \"ids is already formated with special tokens for the model.\"\n                )\n            return list(map(lambda x: 1 if x in [self.sep_token_id, self.cls_token_id] else 0, token_ids_0))\n\n        if token_ids_1 is None:\n            return [1] + ([0] * len(token_ids_0)) + [1]\n        return [1] + ([0] * len(token_ids_0)) + [1, 1] + ([0] * len(token_ids_1)) + [1]\n\n    def create_token_type_ids_from_sequences(\n        self, token_ids_0: List[int], token_ids_1: Optional[List[int]] = None\n    ) -> List[int]:\n        \"\"\"\n        Creates a mask from the two sequences passed to be used in a sequence-pair classification task.\n        XLM-R does not make use of token type ids, therefore a list of zeros is returned.\n\n        Args:\n            token_ids_0 (:obj:`List[int]`):\n                List of ids.\n            token_ids_1 (:obj:`List[int]`, `optional`, defaults to :obj:`None`):\n                Optional second list of IDs for sequence pairs.\n\n        Returns:\n            :obj:`List[int]`: List of zeros.\n\n        \"\"\"\n\n        sep = [self.sep_token_id]\n        cls = [self.cls_token_id]\n\n        if token_ids_1 is None:\n            return len(cls + token_ids_0 + sep) * [0]\n        return len(cls + token_ids_0 + sep + sep + token_ids_1 + sep) * [0]\n\n    @property\n    def vocab_size(self):\n        return len(self.sp_model) + self.fairseq_offset + 1  # Add the <mask> token\n\n    def get_vocab(self):\n        vocab = {self.convert_ids_to_tokens(i): i for i in range(self.vocab_size)}\n        vocab.update(self.added_tokens_encoder)\n        return vocab\n\n    def _tokenize(self, text):\n        return self.sp_model.EncodeAsPieces(text)\n\n    def _convert_token_to_id(self, token):\n        \"\"\" Converts a token (str) in an id using the vocab. \"\"\"\n        if token in self.fairseq_tokens_to_ids:\n            return self.fairseq_tokens_to_ids[token]\n        spm_id = self.sp_model.PieceToId(token)\n\n        # Need to return unknown token if the SP model returned 0\n        return spm_id + self.fairseq_offset if spm_id else self.unk_token_id\n\n    def _convert_id_to_token(self, index):\n        \"\"\"Converts an index (integer) in a token (str) using the vocab.\"\"\"\n        if index in self.fairseq_ids_to_tokens:\n            return self.fairseq_ids_to_tokens[index]\n        return self.sp_model.IdToPiece(index - self.fairseq_offset)\n\n    def convert_tokens_to_string(self, tokens):\n        \"\"\"Converts a sequence of tokens (strings for sub-words) in a single string.\"\"\"\n        out_string = \"\".join(tokens).replace(SPIECE_UNDERLINE, \" \").strip()\n        return out_string\n\n    def save_vocabulary(self, save_directory):\n        \"\"\"\n        Save the sentencepiece vocabulary (copy original file) and special tokens file to a directory.\n\n        Args:\n            save_directory (:obj:`str`):\n                The directory in which to save the vocabulary.\n\n        Returns:\n            :obj:`Tuple(str)`: Paths to the files saved.\n        \"\"\"\n        if not os.path.isdir(save_directory):\n            logger.error(\"Vocabulary path ({}) should be a directory\".format(save_directory))\n            return\n        out_vocab_file = os.path.join(save_directory, VOCAB_FILES_NAMES[\"vocab_file\"])\n\n        if os.path.abspath(self.vocab_file) != os.path.abspath(out_vocab_file):\n            copyfile(self.vocab_file, out_vocab_file)\n\n        return (out_vocab_file,)\n"
  },
  {
    "path": "code/bert-base-count3/pretrain/transformers1/tokenization_xlnet.py",
    "content": "# coding=utf-8\n# Copyright 2018 Google AI, Google Brain and Carnegie Mellon University Authors and the HuggingFace Inc. team.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n\"\"\" Tokenization classes for XLNet model.\"\"\"\n\n\nimport logging\nimport os\nimport unicodedata\nfrom shutil import copyfile\nfrom typing import List, Optional\n\nfrom .tokenization_utils import PreTrainedTokenizer\n\n\nlogger = logging.getLogger(__name__)\n\nVOCAB_FILES_NAMES = {\"vocab_file\": \"spiece.model\"}\n\nPRETRAINED_VOCAB_FILES_MAP = {\n    \"vocab_file\": {\n        \"xlnet-base-cased\": \"https://s3.amazonaws.com/models.huggingface.co/bert/xlnet-base-cased-spiece.model\",\n        \"xlnet-large-cased\": \"https://s3.amazonaws.com/models.huggingface.co/bert/xlnet-large-cased-spiece.model\",\n    }\n}\n\nPRETRAINED_POSITIONAL_EMBEDDINGS_SIZES = {\n    \"xlnet-base-cased\": None,\n    \"xlnet-large-cased\": None,\n}\n\nSPIECE_UNDERLINE = \"▁\"\n\n# Segments (not really needed)\nSEG_ID_A = 0\nSEG_ID_B = 1\nSEG_ID_CLS = 2\nSEG_ID_SEP = 3\nSEG_ID_PAD = 4\n\n\nclass XLNetTokenizer(PreTrainedTokenizer):\n    \"\"\"\n    Constructs an XLNet tokenizer. Based on `SentencePiece <https://github.com/google/sentencepiece>`__\n\n    This tokenizer inherits from :class:`~transformers1.PreTrainedTokenizer` which contains most of the methods. Users\n    should refer to the superclass for more information regarding methods.\n\n    Args:\n        vocab_file (:obj:`string`):\n            `SentencePiece <https://github.com/google/sentencepiece>`__ file (generally has a .spm extension) that\n            contains the vocabulary necessary to instantiate a tokenizer.\n        do_lower_case (:obj:`bool`, `optional`, defaults to :obj:`True`):\n            Whether to lowercase the input when tokenizing.\n        remove_space (:obj:`bool`, `optional`, defaults to :obj:`True`):\n            Whether to strip the text when tokenizing (removing excess spaces before and after the string).\n        keep_accents (:obj:`bool`, `optional`, defaults to :obj:`False`):\n            Whether to keep accents when tokenizing.\n        bos_token (:obj:`string`, `optional`, defaults to \"<s>\"):\n            The beginning of sequence token that was used during pre-training. Can be used a sequence classifier token.\n\n            .. note::\n\n                When building a sequence using special tokens, this is not the token that is used for the beginning\n                of sequence. The token used is the :obj:`cls_token`.\n        eos_token (:obj:`string`, `optional`, defaults to \"</s>\"):\n            The end of sequence token.\n\n            .. note::\n\n                When building a sequence using special tokens, this is not the token that is used for the end\n                of sequence. The token used is the :obj:`sep_token`.\n        unk_token (:obj:`string`, `optional`, defaults to \"<unk>\"):\n            The unknown token. A token that is not in the vocabulary cannot be converted to an ID and is set to be this\n            token instead.\n        sep_token (:obj:`string`, `optional`, defaults to \"<sep>\"):\n            The separator token, which is used when building a sequence from multiple sequences, e.g. two sequences\n            for sequence classification or for a text and a question for question answering.\n            It is also used as the last token of a sequence built with special tokens.\n        pad_token (:obj:`string`, `optional`, defaults to \"<pad>\"):\n            The token used for padding, for example when batching sequences of different lengths.\n        cls_token (:obj:`string`, `optional`, defaults to \"<cls>\"):\n            The classifier token which is used when doing sequence classification (classification of the whole\n            sequence instead of per-token classification). It is the first token of the sequence when built with\n            special tokens.\n        mask_token (:obj:`string`, `optional`, defaults to \"<mask>\"):\n            The token used for masking values. This is the token used when training this model with masked language\n            modeling. This is the token which the model will try to predict.\n        additional_special_tokens (:obj:`List[str]`, `optional`, defaults to :obj:`[\"<eop>\", \"<eod>\"]`):\n            Additional special tokens used by the tokenizer.\n\n    Attributes:\n        sp_model (:obj:`SentencePieceProcessor`):\n            The `SentencePiece` processor that is used for every conversion (string, tokens and IDs).\n    \"\"\"\n\n    vocab_files_names = VOCAB_FILES_NAMES\n    pretrained_vocab_files_map = PRETRAINED_VOCAB_FILES_MAP\n    max_model_input_sizes = PRETRAINED_POSITIONAL_EMBEDDINGS_SIZES\n    padding_side = \"left\"\n\n    def __init__(\n        self,\n        vocab_file,\n        do_lower_case=False,\n        remove_space=True,\n        keep_accents=False,\n        bos_token=\"<s>\",\n        eos_token=\"</s>\",\n        unk_token=\"<unk>\",\n        sep_token=\"<sep>\",\n        pad_token=\"<pad>\",\n        cls_token=\"<cls>\",\n        mask_token=\"<mask>\",\n        additional_special_tokens=[\"<eop>\", \"<eod>\"],\n        **kwargs\n    ):\n        super().__init__(\n            bos_token=bos_token,\n            eos_token=eos_token,\n            unk_token=unk_token,\n            sep_token=sep_token,\n            pad_token=pad_token,\n            cls_token=cls_token,\n            mask_token=mask_token,\n            additional_special_tokens=additional_special_tokens,\n            **kwargs,\n        )\n\n        self._pad_token_type_id = 3\n\n        try:\n            import sentencepiece as spm\n        except ImportError:\n            logger.warning(\n                \"You need to install SentencePiece to use XLNetTokenizer: https://github.com/google/sentencepiece\"\n                \"pip install sentencepiece\"\n            )\n            raise\n\n        self.do_lower_case = do_lower_case\n        self.remove_space = remove_space\n        self.keep_accents = keep_accents\n        self.vocab_file = vocab_file\n\n        self.sp_model = spm.SentencePieceProcessor()\n        self.sp_model.Load(vocab_file)\n\n    @property\n    def vocab_size(self):\n        return len(self.sp_model)\n\n    def get_vocab(self):\n        vocab = {self.convert_ids_to_tokens(i): i for i in range(self.vocab_size)}\n        vocab.update(self.added_tokens_encoder)\n        return vocab\n\n    def __getstate__(self):\n        state = self.__dict__.copy()\n        state[\"sp_model\"] = None\n        return state\n\n    def __setstate__(self, d):\n        self.__dict__ = d\n        try:\n            import sentencepiece as spm\n        except ImportError:\n            logger.warning(\n                \"You need to install SentencePiece to use XLNetTokenizer: https://github.com/google/sentencepiece\"\n                \"pip install sentencepiece\"\n            )\n            raise\n        self.sp_model = spm.SentencePieceProcessor()\n        self.sp_model.Load(self.vocab_file)\n\n    def preprocess_text(self, inputs):\n        if self.remove_space:\n            outputs = \" \".join(inputs.strip().split())\n        else:\n            outputs = inputs\n        outputs = outputs.replace(\"``\", '\"').replace(\"''\", '\"')\n\n        if not self.keep_accents:\n            outputs = unicodedata.normalize(\"NFKD\", outputs)\n            outputs = \"\".join([c for c in outputs if not unicodedata.combining(c)])\n        if self.do_lower_case:\n            outputs = outputs.lower()\n\n        return outputs\n\n    def _tokenize(self, text, sample=False):\n        \"\"\" Tokenize a string. \"\"\"\n        text = self.preprocess_text(text)\n\n        if not sample:\n            pieces = self.sp_model.EncodeAsPieces(text)\n        else:\n            pieces = self.sp_model.SampleEncodeAsPieces(text, 64, 0.1)\n        new_pieces = []\n        for piece in pieces:\n            if len(piece) > 1 and piece[-1] == str(\",\") and piece[-2].isdigit():\n                cur_pieces = self.sp_model.EncodeAsPieces(piece[:-1].replace(SPIECE_UNDERLINE, \"\"))\n                if piece[0] != SPIECE_UNDERLINE and cur_pieces[0][0] == SPIECE_UNDERLINE:\n                    if len(cur_pieces[0]) == 1:\n                        cur_pieces = cur_pieces[1:]\n                    else:\n                        cur_pieces[0] = cur_pieces[0][1:]\n                cur_pieces.append(piece[-1])\n                new_pieces.extend(cur_pieces)\n            else:\n                new_pieces.append(piece)\n\n        return new_pieces\n\n    def _convert_token_to_id(self, token):\n        \"\"\" Converts a token (str) in an id using the vocab. \"\"\"\n        return self.sp_model.PieceToId(token)\n\n    def _convert_id_to_token(self, index):\n        \"\"\"Converts an index (integer) in a token (str) using the vocab.\"\"\"\n        return self.sp_model.IdToPiece(index)\n\n    def convert_tokens_to_string(self, tokens):\n        \"\"\"Converts a sequence of tokens (strings for sub-words) in a single string.\"\"\"\n        out_string = \"\".join(tokens).replace(SPIECE_UNDERLINE, \" \").strip()\n        return out_string\n\n    def build_inputs_with_special_tokens(\n        self, token_ids_0: List[int], token_ids_1: Optional[List[int]] = None\n    ) -> List[int]:\n        \"\"\"\n        Build model inputs from a sequence or a pair of sequence for sequence classification tasks\n        by concatenating and adding special tokens.\n        An XLNet sequence has the following format:\n\n        - single sequence: ``X <sep> <cls>``\n        - pair of sequences: ``A <sep> B <sep> <cls>``\n\n        Args:\n            token_ids_0 (:obj:`List[int]`):\n                List of IDs to which the special tokens will be added\n            token_ids_1 (:obj:`List[int]`, `optional`, defaults to :obj:`None`):\n                Optional second list of IDs for sequence pairs.\n\n        Returns:\n            :obj:`List[int]`: list of `input IDs <../glossary.html#input-ids>`__ with the appropriate special tokens.\n        \"\"\"\n        sep = [self.sep_token_id]\n        cls = [self.cls_token_id]\n        if token_ids_1 is None:\n            return token_ids_0 + sep + cls\n        return token_ids_0 + sep + token_ids_1 + sep + cls\n\n    def get_special_tokens_mask(\n        self, token_ids_0: List[int], token_ids_1: Optional[List[int]] = None, already_has_special_tokens: bool = False\n    ) -> List[int]:\n        \"\"\"\n        Retrieves sequence ids from a token list that has no special tokens added. This method is called when adding\n        special tokens using the tokenizer ``prepare_for_model`` or ``encode_plus`` methods.\n\n        Args:\n            token_ids_0 (:obj:`List[int]`):\n                List of ids.\n            token_ids_1 (:obj:`List[int]`, `optional`, defaults to :obj:`None`):\n                Optional second list of IDs for sequence pairs.\n            already_has_special_tokens (:obj:`bool`, `optional`, defaults to :obj:`False`):\n                Set to True if the token list is already formatted with special tokens for the model\n\n        Returns:\n            :obj:`List[int]`: A list of integers in the range [0, 1]: 1 for a special token, 0 for a sequence token.\n        \"\"\"\n\n        if already_has_special_tokens:\n            if token_ids_1 is not None:\n                raise ValueError(\n                    \"You should not supply a second sequence if the provided sequence of \"\n                    \"ids is already formated with special tokens for the model.\"\n                )\n            return list(map(lambda x: 1 if x in [self.sep_token_id, self.cls_token_id] else 0, token_ids_0))\n\n        if token_ids_1 is not None:\n            return ([0] * len(token_ids_0)) + [1] + ([0] * len(token_ids_1)) + [1, 1]\n        return ([0] * len(token_ids_0)) + [1, 1]\n\n    def create_token_type_ids_from_sequences(\n        self, token_ids_0: List[int], token_ids_1: Optional[List[int]] = None\n    ) -> List[int]:\n        \"\"\"\n        Creates a mask from the two sequences passed to be used in a sequence-pair classification task.\n        An XLNet sequence pair mask has the following format:\n        0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 2\n        | first sequence    | second sequence     | CLS segment ID\n\n        if token_ids_1 is None, only returns the first portion of the mask (0's).\n\n        Args:\n            token_ids_0 (:obj:`List[int]`):\n                List of ids.\n            token_ids_1 (:obj:`List[int]`, `optional`, defaults to :obj:`None`):\n                Optional second list of IDs for sequence pairs.\n\n        Returns:\n            :obj:`List[int]`: List of `token type IDs <../glossary.html#token-type-ids>`_ according to the given\n            sequence(s).\n        \"\"\"\n        sep = [self.sep_token_id]\n        cls_segment_id = [2]\n\n        if token_ids_1 is None:\n            return len(token_ids_0 + sep) * [0] + cls_segment_id\n        return len(token_ids_0 + sep) * [0] + len(token_ids_1 + sep) * [1] + cls_segment_id\n\n    def save_vocabulary(self, save_directory):\n        \"\"\"\n        Save the sentencepiece vocabulary (copy original file) and special tokens file to a directory.\n\n        Args:\n            save_directory (:obj:`str`):\n                The directory in which to save the vocabulary.\n\n        Returns:\n            :obj:`Tuple(str)`: Paths to the files saved.\n        \"\"\"\n        if not os.path.isdir(save_directory):\n            logger.error(\"Vocabulary path ({}) should be a directory\".format(save_directory))\n            return\n        out_vocab_file = os.path.join(save_directory, VOCAB_FILES_NAMES[\"vocab_file\"])\n\n        if os.path.abspath(self.vocab_file) != os.path.abspath(out_vocab_file):\n            copyfile(self.vocab_file, out_vocab_file)\n\n        return (out_vocab_file,)\n"
  },
  {
    "path": "code/bert-base-count3/pretrain/transformers1/trainer.py",
    "content": "import json\nimport logging\nimport math\nimport os\nimport random\nimport re\nimport shutil\nfrom contextlib import contextmanager\nfrom pathlib import Path\nfrom typing import Callable, Dict, List, Optional, Tuple\nimport time\nimport numpy as np\nimport torch\nfrom packaging import version\nfrom torch import nn\nfrom torch.utils.data.dataloader import DataLoader\nfrom torch.utils.data.dataset import Dataset\nfrom torch.utils.data.distributed import DistributedSampler\nfrom torch.utils.data.sampler import RandomSampler, Sampler, SequentialSampler\nfrom tqdm.auto import tqdm, trange\n\nfrom .data.data_collator import DataCollator, DefaultDataCollator\nfrom transformers.modeling_utils import PreTrainedModel\nfrom .optimization import AdamW\nfrom transformers import get_polynomial_decay_schedule_with_warmup#需要新版才有\nfrom .trainer_utils import PREFIX_CHECKPOINT_DIR, EvalPrediction, PredictionOutput, TrainOutput\nfrom .training_args import TrainingArguments, is_tpu_available\n\n\ntry:\n    from apex import amp\n\n    _has_apex = True\nexcept ImportError:\n    _has_apex = False\n\n\ndef is_apex_available():\n    return _has_apex\n\n\nif is_tpu_available():\n    import torch_xla.core.xla_model as xm\n    import torch_xla.debug.metrics as met\n    import torch_xla.distributed.parallel_loader as pl\n\ntry:\n    from torch.utils.tensorboard import SummaryWriter\n\n    _has_tensorboard = True\nexcept ImportError:\n    try:\n        from tensorboardX import SummaryWriter\n\n        _has_tensorboard = True\n    except ImportError:\n        _has_tensorboard = False\n\n\ndef is_tensorboard_available():\n    return _has_tensorboard\n\n\ntry:\n    import wandb\n\n    wandb.ensure_configured()\n    if wandb.api.api_key is None:\n        _has_wandb = False\n        wandb.termwarn(\"W&B installed but not logged in.  Run `wandb login` or set the WANDB_API_KEY env variable.\")\n    else:\n        _has_wandb = False if os.getenv(\"WANDB_DISABLED\") else True\nexcept ImportError:\n    _has_wandb = False\n\n\ndef is_wandb_available():\n    return _has_wandb\n\n\nlogger = logging.getLogger(__name__)\n\n\ndef set_seed(seed: int):\n    random.seed(seed)\n    np.random.seed(seed)\n    torch.manual_seed(seed)\n    torch.cuda.manual_seed_all(seed)\n    # ^^ safe to call this function even if cuda is not available\n\n\n@contextmanager\ndef torch_distributed_zero_first(local_rank: int):\n    \"\"\"\n    Decorator to make all processes in distributed training wait for each local_master to do something.\n    \"\"\"\n    if local_rank not in [-1, 0]:\n        torch.distributed.barrier()\n    yield\n    if local_rank == 0:\n        torch.distributed.barrier()\n\n\nclass SequentialDistributedSampler(Sampler):\n    \"\"\"\n    Distributed Sampler that subsamples indicies sequentially,\n    making it easier to collate all results at the end.\n\n    Even though we only use this sampler for eval and predict (no training),\n    which means that the model params won't have to be synced (i.e. will not hang\n    for synchronization even if varied number of forward passes), we still add extra\n    samples to the sampler to make it evenly divisible (like in `DistributedSampler`)\n    to make it easy to `gather` or `reduce` resulting tensors at the end of the loop.\n    \"\"\"\n\n    def __init__(self, dataset, num_replicas=None, rank=None):\n        if num_replicas is None:\n            if not torch.distributed.is_available():\n                raise RuntimeError(\"Requires distributed package to be available\")\n            num_replicas = torch.distributed.get_world_size()\n        if rank is None:\n            if not torch.distributed.is_available():\n                raise RuntimeError(\"Requires distributed package to be available\")\n            rank = torch.distributed.get_rank()\n        self.dataset = dataset\n        self.num_replicas = num_replicas\n        self.rank = rank\n        self.num_samples = int(math.ceil(len(self.dataset) * 1.0 / self.num_replicas))\n        self.total_size = self.num_samples * self.num_replicas\n\n    def __iter__(self):\n        indices = list(range(len(self.dataset)))\n\n        # add extra samples to make it evenly divisible\n        indices += indices[: (self.total_size - len(indices))]\n        assert len(indices) == self.total_size\n\n        # subsample\n        indices = indices[self.rank * self.num_samples : (self.rank + 1) * self.num_samples]\n        assert len(indices) == self.num_samples\n\n        return iter(indices)\n\n    def __len__(self):\n        return self.num_samples\n\n\ndef get_tpu_sampler(dataset: Dataset):\n    if xm.xrt_world_size() <= 1:\n        return RandomSampler(dataset)\n    return DistributedSampler(dataset, num_replicas=xm.xrt_world_size(), rank=xm.get_ordinal())\n\n\nclass Trainer:\n    \"\"\"\n    Trainer is a simple but feature-complete training and eval loop for PyTorch,\n    optimized for Transformers.\n    \"\"\"\n\n    model: PreTrainedModel\n    args: TrainingArguments\n    train_dataset: Optional[Dataset]\n    eval_dataset: Optional[Dataset]\n    compute_metrics: Optional[Callable[[EvalPrediction], Dict]] = None\n    prediction_loss_only: bool\n    tb_writer: Optional[\"SummaryWriter\"] = None\n    optimizers: Tuple[torch.optim.Optimizer, torch.optim.lr_scheduler.LambdaLR] = None\n    global_step: Optional[int] = None\n    epoch: Optional[float] = None\n\n    def __init__(\n        self,\n        model: PreTrainedModel,\n        args: TrainingArguments,\n        train_dataLoader: Optional[DataLoader] = None,\n        eval_dataLoader: Optional[DataLoader] = None,\n        compute_metrics: Optional[Callable[[EvalPrediction], Dict]] = None,\n        prediction_loss_only=False,\n        tb_writer: Optional[\"SummaryWriter\"] = None,\n        optimizers: Tuple[torch.optim.Optimizer, torch.optim.lr_scheduler.LambdaLR] = None,\n    ):\n        \"\"\"\n        Trainer is a simple but feature-complete training and eval loop for PyTorch,\n        optimized for Transformers.\n\n        Args:\n            prediction_loss_only:\n                (Optional) in evaluation and prediction, only return the loss\n        \"\"\"\n        self.model = model.to(args.device)\n        self.args = args\n\n        self.train_dataLoader = train_dataLoader\n        self.eval_dataLoader = eval_dataLoader\n        self.compute_metrics = compute_metrics\n        self.prediction_loss_only = prediction_loss_only\n        self.optimizers = optimizers\n        if tb_writer is not None:\n            self.tb_writer = tb_writer\n        elif is_tensorboard_available() and self.is_world_master():\n            self.tb_writer = SummaryWriter(log_dir=self.args.logging_dir)\n        if not is_tensorboard_available():\n            logger.warning(\n                \"You are instantiating a Trainer but Tensorboard is not installed. You should consider installing it.\"\n            )\n        if is_wandb_available():\n            self._setup_wandb()\n        else:\n            logger.info(\n                \"You are instantiating a Trainer but W&B is not installed. To use wandb logging, \"\n                \"run `pip install wandb; wandb login` see https://docs.wandb.com/huggingface.\"\n            )\n        set_seed(self.args.seed)\n        # Create output directory if needed\n        if self.is_world_master():\n            os.makedirs(self.args.output_dir, exist_ok=True)\n        if is_tpu_available():\n            # Set an xla_device flag on the model's config.\n            # We'll find a more elegant and not need to do this in the future.\n            self.model.config.xla_device = True\n\n    def get_test_dataloader(self, test_dataset: Dataset) -> DataLoader:\n        # We use the same batch_size as for eval.\n        if is_tpu_available():\n            sampler = SequentialDistributedSampler(\n                test_dataset, num_replicas=xm.xrt_world_size(), rank=xm.get_ordinal()\n            )\n        elif self.args.local_rank != -1:\n            sampler = SequentialDistributedSampler(test_dataset)\n        else:\n            sampler = SequentialSampler(test_dataset)\n\n        data_loader = DataLoader(\n            test_dataset,\n            sampler=sampler,\n            batch_size=self.args.eval_batch_size,\n\n        )\n\n        return data_loader\n\n    def get_optimizers(\n        self, num_training_steps: int\n    ) -> Tuple[torch.optim.Optimizer, torch.optim.lr_scheduler.LambdaLR]:\n        \"\"\"\n        Setup the optimizer and the learning rate scheduler.\n\n        We provide a reasonable default that works well.\n        If you want to use something else, you can pass a tuple in the Trainer's init,\n        or override this method in a subclass.\n        \"\"\"\n        if self.optimizers is not None:\n            return self.optimizers\n        # Prepare optimizer and schedule (linear warmup and decay)\n        no_decay = [\"bias\", \"LayerNorm.weight\"]\n        optimizer_grouped_parameters = [\n            {\n                \"params\": [p for n, p in self.model.named_parameters() if not any(nd in n for nd in no_decay)],\n                \"weight_decay\": self.args.weight_decay,\n            },\n            {\n                \"params\": [p for n, p in self.model.named_parameters() if any(nd in n for nd in no_decay)],\n                \"weight_decay\": 0.0,\n            },\n        ]\n\n        optimizer = AdamW(optimizer_grouped_parameters, lr=self.args.learning_rate, eps=self.args.adam_epsilon)\n        scheduler = get_polynomial_decay_schedule_with_warmup(\n            optimizer, num_warmup_steps=self.args.warmup_steps, num_training_steps=num_training_steps,lr_end=self.args.lr_end\n        )\n        return optimizer, scheduler\n\n    def _setup_wandb(self):\n        \"\"\"\n        Setup the optional Weights & Biases (`wandb`) integration.\n\n        One can override this method to customize the setup if needed.  Find more information at https://docs.wandb.com/huggingface\n        You can also override the following environment variables:\n\n        Environment:\n            WANDB_WATCH:\n                (Optional, [\"gradients\", \"all\", \"false\"]) \"gradients\" by default, set to \"false\" to disable gradient logging\n                or \"all\" to log gradients and parameters\n            WANDB_PROJECT:\n                (Optional): str - \"huggingface\" by default, set this to a custom string to store results in a different project\n            WANDB_DISABLED:\n                (Optional): boolean - defaults to false, set to \"true\" to disable wandb entirely\n        \"\"\"\n        logger.info('Automatic Weights & Biases logging enabled, to disable set os.environ[\"WANDB_DISABLED\"] = \"true\"')\n        wandb.init(project=os.getenv(\"WANDB_PROJECT\", \"huggingface\"), config=vars(self.args))\n        # keep track of model topology and gradients\n        if os.getenv(\"WANDB_WATCH\") != \"false\":\n            wandb.watch(\n                self.model, log=os.getenv(\"WANDB_WATCH\", \"gradients\"), log_freq=max(100, self.args.logging_steps)\n            )\n\n    def num_examples(self, dataloader: DataLoader) -> int:\n        \"\"\"\n        Helper to get num of examples from a DataLoader, by accessing its Dataset.\n        \"\"\"\n        return len(dataloader.dataset)\n\n    def train(self, model_path: Optional[str] = None):\n        \"\"\"\n        Main training entry point.\n\n        Args:\n            model_path:\n                (Optional) Local path to model if model to train has been instantiated from a local path\n                If present, we will try reloading the optimizer/scheduler states from there.\n        \"\"\"\n        train_dataloader = self.train_dataLoader\n        if self.args.max_steps > 0:\n            t_total = self.args.max_steps\n            num_train_epochs = (\n                self.args.max_steps // (len(train_dataloader) // self.args.gradient_accumulation_steps) + 1\n            )\n        else:\n            t_total = int(len(train_dataloader) // self.args.gradient_accumulation_steps * self.args.num_train_epochs)\n            num_train_epochs = self.args.num_train_epochs\n\n        optimizer, scheduler = self.get_optimizers(num_training_steps=t_total)\n\n        # Check if saved optimizer or scheduler states exist\n        if (\n            model_path is not None\n            and os.path.isfile(os.path.join(model_path, \"optimizer.pt\"))\n            and os.path.isfile(os.path.join(model_path, \"scheduler.pt\"))\n        ):\n            # Load in optimizer and scheduler states\n            optimizer.load_state_dict(\n                torch.load(os.path.join(model_path, \"optimizer.pt\"), map_location=self.args.device)\n            )\n            scheduler.load_state_dict(torch.load(os.path.join(model_path, \"scheduler.pt\")))\n\n        model = self.model\n        if self.args.fp16:\n            if not is_apex_available():\n                raise ImportError(\"Please install apex from https://www.github.com/nvidia/apex to use fp16 training.\")\n            model, optimizer = amp.initialize(model, optimizer, opt_level=self.args.fp16_opt_level)\n\n        # multi-gpu training (should be after apex fp16 initialization)\n        if self.args.n_gpu > 1:\n            model = torch.nn.DataParallel(model)\n\n        # Distributed training (should be after apex fp16 initialization)\n        if self.args.local_rank != -1:\n            model = torch.nn.parallel.DistributedDataParallel(\n                model,\n                device_ids=[self.args.local_rank],\n                output_device=self.args.local_rank,\n                find_unused_parameters=True,\n            )\n\n        if self.tb_writer is not None:\n            self.tb_writer.add_text(\"args\", self.args.to_json_string())\n            self.tb_writer.add_hparams(self.args.to_sanitized_dict(), metric_dict={})\n\n        # Train!\n        if is_tpu_available():\n            total_train_batch_size = self.args.train_batch_size * xm.xrt_world_size()\n        else:\n            total_train_batch_size = (\n                self.args.train_batch_size\n                * self.args.gradient_accumulation_steps\n                * (torch.distributed.get_world_size() if self.args.local_rank != -1 else 1)\n            )\n        logger.info(\"***** Running training *****\")\n        logger.info(\"  Num examples = %d\", self.num_examples(train_dataloader))\n        logger.info(\"  Num Epochs = %d\", num_train_epochs)\n        logger.info(\"  Instantaneous batch size per device = %d\", self.args.per_device_train_batch_size)\n        logger.info(\"  Total train batch size (w. parallel, distributed & accumulation) = %d\", total_train_batch_size)\n        logger.info(\"  Gradient Accumulation steps = %d\", self.args.gradient_accumulation_steps)\n        logger.info(\"  Total optimization steps = %d\", t_total)\n\n        self.global_step = 0\n        self.epoch = 0\n        epochs_trained = 0\n        steps_trained_in_current_epoch = 0\n        # Check if continuing training from a checkpoint\n        if model_path is not None:\n            # set global_step to global_step of last saved checkpoint from model path\n            try:\n                self.global_step = int(model_path.split(\"-\")[-1].split(\"/\")[0])\n                epochs_trained = self.global_step // (len(train_dataloader) // self.args.gradient_accumulation_steps)\n                steps_trained_in_current_epoch = self.global_step % (\n                    len(train_dataloader) // self.args.gradient_accumulation_steps\n                )\n\n                logger.info(\"  Continuing training from checkpoint, will skip to saved global_step\")\n                logger.info(\"  Continuing training from epoch %d\", epochs_trained)\n                logger.info(\"  Continuing training from global step %d\", self.global_step)\n                logger.info(\"  Will skip the first %d steps in the first epoch\", steps_trained_in_current_epoch)\n            except ValueError:\n                self.global_step = 0\n                logger.info(\"  Starting fine-tuning.\")\n\n        tr_loss = 0.0\n        logging_loss = 0.0\n        tqdmLoss=0#进度条的loss用滑动平均显示\n        beta_exp=1\n        model.zero_grad()\n        train_iterator = trange(\n            epochs_trained, int(num_train_epochs), desc=\"Epoch\", disable=True\n        )\n        for epoch in train_iterator:\n            last=time.time()\n            if isinstance(train_dataloader, DataLoader) and isinstance(train_dataloader.sampler, DistributedSampler):\n                train_dataloader.sampler.set_epoch(epoch)\n\n            if is_tpu_available():\n                parallel_loader = pl.ParallelLoader(train_dataloader, [self.args.device]).per_device_loader(\n                    self.args.device\n                )\n                epoch_iterator = tqdm(parallel_loader, desc=\"Iteration\", disable=not self.is_local_master())\n            else:\n                epoch_iterator = tqdm(train_dataloader, desc=\"Iteration\", disable=True,ncols=70)#固定下长度，不然要换行\n\n            for step, inputs in enumerate(epoch_iterator):\n\n                # Skip past any already trained steps if resuming training\n                if steps_trained_in_current_epoch > 0:\n                    steps_trained_in_current_epoch -= 1\n                    continue\n                now_loss=self._training_step(model, inputs, optimizer)\n                tr_loss += now_loss\n                #丰富进度条\n                tqdmLoss=tqdmLoss*0.99+(1-0.99)*now_loss#滑动平均下\n                beta_exp*=0.99#校正\n\n                epoch_iterator.set_description_str(f\"epoch：{epoch+1}\")\n                epoch_iterator.set_postfix_str(f\"loss：{round(tqdmLoss/(1-beta_exp),4)}\")\n                if (step + 1) % self.args.gradient_accumulation_steps == 0 or (\n                    # last step in epoch but step is always smaller than gradient_accumulation_steps\n                    len(epoch_iterator) <= self.args.gradient_accumulation_steps\n                    and (step + 1) == len(epoch_iterator)\n                ):\n                    if self.args.fp16:\n                        torch.nn.utils.clip_grad_norm_(amp.master_params(optimizer), self.args.max_grad_norm)\n                    else:\n                        torch.nn.utils.clip_grad_norm_(model.parameters(), self.args.max_grad_norm)\n\n                    if is_tpu_available():\n                        xm.optimizer_step(optimizer)\n                    else:\n                        optimizer.step()\n\n                    scheduler.step()\n                    model.zero_grad()\n                    self.global_step += 1\n                    self.epoch = epoch + (step + 1) / len(epoch_iterator)\n\n                    if (self.args.logging_steps > 0 and self.global_step % self.args.logging_steps == 0) or (\n                        self.global_step == 1 and self.args.logging_first_step\n                    ):\n                        logs: Dict[str, float] = {}\n                        logs[\"loss\"] = (tr_loss - logging_loss) / self.args.logging_steps\n                        # backward compatibility for pytorch schedulers\n                        logs[\"learning_rate\"] = (\n                            scheduler.get_last_lr()[0]\n                            if version.parse(torch.__version__) >= version.parse(\"1.4\")\n                            else scheduler.get_lr()[0]\n                        )\n                        logging_loss = tr_loss\n                        print()#log前要换行，不然和进度条挤在一起\n                        self._log(logs)\n                        print()\n                        if self.args.evaluate_during_training:\n                            self.evaluate()\n\n                    if self.args.save_steps > 0 and self.global_step % self.args.save_steps==0:\n                        # In all cases (even distributed/parallel), self.model is always a reference\n                        # to the model we want to save.\n                        if hasattr(model, \"module\"):\n                            assert model.module is self.model\n                        else:\n                            assert model is self.model\n                        # Save model checkpoint\n                        output_dir = os.path.join(self.args.output_dir, f\"{PREFIX_CHECKPOINT_DIR}-{self.global_step}-epoch-{int(self.epoch)}\")\n\n                        self.save_model(output_dir)\n\n                        if self.is_world_master():\n                            self._rotate_checkpoints()\n\n                        if is_tpu_available():\n                            xm.rendezvous(\"saving_optimizer_states\")\n                            xm.save(optimizer.state_dict(), os.path.join(output_dir, \"optimizer.pt\"))\n                            xm.save(scheduler.state_dict(), os.path.join(output_dir, \"scheduler.pt\"))\n                        elif self.is_world_master():\n                            torch.save(optimizer.state_dict(), os.path.join(output_dir, \"optimizer.pt\"))\n                            torch.save(scheduler.state_dict(), os.path.join(output_dir, \"scheduler.pt\"))\n\n                if self.args.max_steps > 0 and self.global_step > self.args.max_steps:\n                    epoch_iterator.close()\n                    break\n            print(f\"预训练第{epoch}轮耗时：\",time.time()-last)\n            if self.args.max_steps > 0 and self.global_step > self.args.max_steps:\n                train_iterator.close()\n                break\n            if self.args.tpu_metrics_debug:\n                # tpu-comment: Logging debug metrics for PyTorch/XLA (compile, execute times, ops, etc.)\n                xm.master_print(met.metrics_report())\n        if self.tb_writer:\n            self.tb_writer.close()\n\n        logger.info(\"\\n\\nTraining completed. Do not forget to share your model on huggingface.co/models =)\\n\\n\")\n        return TrainOutput(self.global_step, tr_loss / self.global_step)\n\n    def _log(self, logs: Dict[str, float], iterator: Optional[tqdm] = None) -> None:\n        if self.epoch is not None:\n            logs[\"epoch\"] = self.epoch\n        if self.tb_writer:\n            for k, v in logs.items():\n                self.tb_writer.add_scalar(k, v, self.global_step)\n        if is_wandb_available():\n            wandb.log(logs, step=self.global_step)\n        output = json.dumps({**logs, **{\"step\": self.global_step}})\n        if iterator is not None:\n            iterator.write(output)\n        else:\n            print(output)\n\n    def _training_step(\n        self, model: nn.Module, inputs: Dict[str, torch.Tensor], optimizer: torch.optim.Optimizer\n    ) -> float:\n        model.train()\n        for k, v in inputs.items():\n            inputs[k] = v.to(self.args.device)\n\n        outputs = model(**inputs)\n        loss = outputs[0]  # model outputs are always tuple in transformers1 (see doc)\n\n        if self.args.n_gpu > 1:\n            loss = loss.mean()  # mean() to average on multi-gpu parallel training\n        if self.args.gradient_accumulation_steps > 1:\n            loss = loss / self.args.gradient_accumulation_steps\n\n        if self.args.fp16:\n            with amp.scale_loss(loss, optimizer) as scaled_loss:\n                scaled_loss.backward()\n        else:\n            loss.backward()\n\n        return loss.item()\n\n    def is_local_master(self) -> bool:\n        if is_tpu_available():\n            return xm.is_master_ordinal(local=True)\n        else:\n            return self.args.local_rank in [-1, 0]\n\n    def is_world_master(self) -> bool:\n        \"\"\"\n        This will be True only in one process, even in distributed mode,\n        even when training on multiple machines.\n        \"\"\"\n        if is_tpu_available():\n            return xm.is_master_ordinal(local=False)\n        else:\n            return self.args.local_rank == -1 or torch.distributed.get_rank() == 0\n\n    def save_model(self, output_dir: Optional[str] = None):\n        \"\"\"\n        Saving best-practices: if you use default names for the model,\n        you can reload it using from_pretrained().\n\n        Will only save from the world_master process (unless in TPUs).\n        \"\"\"\n\n        if is_tpu_available():\n            self._save_tpu(output_dir)\n        elif self.is_world_master():\n            self._save(output_dir)\n\n    def _save_tpu(self, output_dir: Optional[str] = None):\n        output_dir = output_dir if output_dir is not None else self.args.output_dir\n        logger.info(\"Saving model checkpoint to %s\", output_dir)\n\n        if xm.is_master_ordinal():\n            os.makedirs(output_dir, exist_ok=True)\n            torch.save(self.args, os.path.join(output_dir, \"training_args.bin\"))\n\n        # Save a trained model and configuration using `save_pretrained()`.\n        # They can then be reloaded using `from_pretrained()`\n        if not isinstance(self.model, PreTrainedModel):\n            raise ValueError(\"Trainer.model appears to not be a PreTrainedModel\")\n\n        xm.rendezvous(\"saving_checkpoint\")\n        self.model.save_pretrained(output_dir)\n\n    def _save(self, output_dir: Optional[str] = None):\n        output_dir = output_dir if output_dir is not None else self.args.output_dir\n        os.makedirs(output_dir, exist_ok=True)\n        logger.info(\"Saving model checkpoint to %s\", output_dir)\n        # Save a trained model and configuration using `save_pretrained()`.\n        # They can then be reloaded using `from_pretrained()`\n        if not isinstance(self.model, PreTrainedModel):\n            raise ValueError(\"Trainer.model appears to not be a PreTrainedModel\")\n        self.model.save_pretrained(output_dir)\n\n        # Good practice: save your training arguments together with the trained model\n        torch.save(self.args, os.path.join(output_dir, \"training_args.bin\"))\n\n    def _sorted_checkpoints(self, checkpoint_prefix=PREFIX_CHECKPOINT_DIR, use_mtime=False) -> List[str]:\n        ordering_and_checkpoint_path = []\n\n        glob_checkpoints = [str(x) for x in Path(self.args.output_dir).glob(f\"{checkpoint_prefix}-*\")]\n\n        for path in glob_checkpoints:\n            if use_mtime:\n                ordering_and_checkpoint_path.append((os.path.getmtime(path), path))\n            else:\n                regex_match = re.match(f\".*{checkpoint_prefix}-([0-9]+)\", path)\n                if regex_match and regex_match.groups():\n                    ordering_and_checkpoint_path.append((int(regex_match.groups()[0]), path))\n\n        checkpoints_sorted = sorted(ordering_and_checkpoint_path)\n        checkpoints_sorted = [checkpoint[1] for checkpoint in checkpoints_sorted]\n        return checkpoints_sorted\n\n    def _rotate_checkpoints(self, use_mtime=False) -> None:\n        if self.args.save_total_limit is None or self.args.save_total_limit <= 0:\n            return\n\n        # Check if we should delete older checkpoint(s)\n        checkpoints_sorted = self._sorted_checkpoints(use_mtime=use_mtime)\n        if len(checkpoints_sorted) <= self.args.save_total_limit:\n            return\n\n        number_of_checkpoints_to_delete = max(0, len(checkpoints_sorted) - self.args.save_total_limit)\n        checkpoints_to_be_deleted = checkpoints_sorted[:number_of_checkpoints_to_delete]\n        for checkpoint in checkpoints_to_be_deleted:\n            curEpoch = checkpoint.split('-')[-1]\n            print(checkpoint,curEpoch)\n            if int(curEpoch) % 50 == 0:\n                continue\n            logger.info(\"Deleting older checkpoint [{}] due to args.save_total_limit\".format(checkpoint))\n            shutil.rmtree(checkpoint)\n\n    def evaluate(\n        self, eval_dataset: Optional[Dataset] = None, prediction_loss_only: Optional[bool] = None,\n    ) -> Dict[str, float]:\n        \"\"\"\n        Run evaluation and return metrics.\n\n        The calling script will be responsible for providing a method to compute metrics, as they are\n        task-dependent.\n\n        Args:\n            eval_dataset: (Optional) Pass a dataset if you wish to override\n            the one on the instance.\n        Returns:\n            A dict containing:\n                - the eval loss\n                - the potential metrics computed from the predictions\n        \"\"\"\n        eval_dataloader = self.eval_dataLoader\n\n        output = self._prediction_loop(eval_dataloader, description=\"Evaluation\")\n\n        self._log(output.metrics)\n\n        if self.args.tpu_metrics_debug:\n            # tpu-comment: Logging debug metrics for PyTorch/XLA (compile, execute times, ops, etc.)\n            xm.master_print(met.metrics_report())\n\n        return output.metrics\n\n    def predict(self, test_dataset: Dataset) -> PredictionOutput:\n        \"\"\"\n        Run prediction and return predictions and potential metrics.\n\n        Depending on the dataset and your use case, your test dataset may contain labels.\n        In that case, this method will also return metrics, like in evaluate().\n        \"\"\"\n        test_dataloader = self.get_test_dataloader(test_dataset)\n\n        return self._prediction_loop(test_dataloader, description=\"Prediction\")\n\n    def _prediction_loop(\n        self, dataloader: DataLoader, description: str, prediction_loss_only: Optional[bool] = None\n    ) -> PredictionOutput:\n        \"\"\"\n        Prediction/evaluation loop, shared by `evaluate()` and `predict()`.\n\n        Works both with or without labels.\n        \"\"\"\n\n        prediction_loss_only = prediction_loss_only if prediction_loss_only is not None else self.prediction_loss_only\n\n        model = self.model\n        # multi-gpu eval\n        if self.args.n_gpu > 1:\n            model = torch.nn.DataParallel(model)\n        else:\n            model = self.model\n        # Note: in torch.distributed mode, there's no point in wrapping the model\n        # inside a DistributedDataParallel as we'll be under `no_grad` anyways.\n\n        batch_size = dataloader.batch_size\n        logger.info(\"***** Running %s *****\", description)\n        logger.info(\"  Num examples = %d\", self.num_examples(dataloader))\n        logger.info(\"  Batch size = %d\", batch_size)\n        eval_losses: List[float] = []\n        preds: torch.Tensor = None\n        label_ids: torch.Tensor = None\n        model.eval()\n\n        if is_tpu_available():\n            dataloader = pl.ParallelLoader(dataloader, [self.args.device]).per_device_loader(self.args.device)\n\n        for inputs in tqdm(dataloader, desc=description):\n            has_labels = any(inputs.get(k) is not None for k in [\"labels\", \"lm_labels\", \"masked_lm_labels\"])\n\n            for k, v in inputs.items():\n                inputs[k] = v.to(self.args.device)\n\n            with torch.no_grad():\n                outputs = model(**inputs)\n                if has_labels:\n                    step_eval_loss, logits = outputs[:2]\n                    eval_losses += [step_eval_loss.mean().item()]\n                else:\n                    logits = outputs[0]\n\n            if not prediction_loss_only:\n                if preds is None:\n                    preds = logits.detach()\n                else:\n                    preds = torch.cat((preds, logits.detach()), dim=0)\n                if inputs.get(\"labels\") is not None:\n                    if label_ids is None:\n                        label_ids = inputs[\"labels\"].detach()\n                    else:\n                        label_ids = torch.cat((label_ids, inputs[\"labels\"].detach()), dim=0)\n\n        if self.args.local_rank != -1:\n            # In distributed mode, concatenate all results from all nodes:\n            if preds is not None:\n                preds = self.distributed_concat(preds, num_total_examples=self.num_examples(dataloader))\n            if label_ids is not None:\n                label_ids = self.distributed_concat(label_ids, num_total_examples=self.num_examples(dataloader))\n        elif is_tpu_available():\n            # tpu-comment: Get all predictions and labels from all worker shards of eval dataset\n            if preds is not None:\n                preds = xm.mesh_reduce(\"eval_preds\", preds, torch.cat)\n            if label_ids is not None:\n                label_ids = xm.mesh_reduce(\"eval_label_ids\", label_ids, torch.cat)\n\n        # Finally, turn the aggregated tensors into numpy arrays.\n        if preds is not None:\n            preds = preds.cpu().numpy()\n        if label_ids is not None:\n            label_ids = label_ids.cpu().numpy()\n\n        if self.compute_metrics is not None and preds is not None and label_ids is not None:\n            metrics = self.compute_metrics(EvalPrediction(predictions=preds, label_ids=label_ids))\n        else:\n            metrics = {}\n        if len(eval_losses) > 0:\n            metrics[\"eval_loss\"] = np.mean(eval_losses)\n\n        # Prefix all keys with eval_\n        for key in list(metrics.keys()):\n            if not key.startswith(\"eval_\"):\n                metrics[f\"eval_{key}\"] = metrics.pop(key)\n\n        return PredictionOutput(predictions=preds, label_ids=label_ids, metrics=metrics)\n\n    def distributed_concat(self, tensor: torch.Tensor, num_total_examples: int) -> torch.Tensor:\n        assert self.args.local_rank != -1\n\n        output_tensors = [tensor.clone() for _ in range(torch.distributed.get_world_size())]\n        torch.distributed.all_gather(output_tensors, tensor)\n\n        concat = torch.cat(output_tensors, dim=0)\n\n        # truncate the dummy elements added by SequentialDistributedSampler\n        output = concat[:num_total_examples]\n        return output\n"
  },
  {
    "path": "code/bert-base-count3/pretrain/transformers1/trainer_tf.py",
    "content": "\"\"\"Tensorflow trainer class.\"\"\"\n\nimport logging\nimport math\nimport os\nfrom typing import Callable, Dict, Optional\n\nimport numpy as np\nimport tensorflow as tf\n\nfrom .modeling_tf_utils import TFPreTrainedModel, shape_list\nfrom .optimization_tf import GradientAccumulator, create_optimizer\nfrom .trainer_utils import PREFIX_CHECKPOINT_DIR, EvalPrediction, PredictionOutput\nfrom .training_args_tf import TFTrainingArguments\n\n\nlogger = logging.getLogger(__name__)\n\n\nclass TFTrainer:\n    model: TFPreTrainedModel\n    args: TFTrainingArguments\n    # something similar to a PT Dataset.\n    # This is just temporary before to have\n    # a framework-agnostic approach for datasets.\n    train_dataset: Optional[tf.data.Dataset]\n    eval_dataset: Optional[tf.data.Dataset]\n    compute_metrics: Optional[Callable[[EvalPrediction], Dict]] = None\n    prediction_loss_only: bool\n\n    def __init__(\n        self,\n        model: TFPreTrainedModel,\n        args: TFTrainingArguments,\n        train_dataset: Optional[tf.data.Dataset] = None,\n        eval_dataset: Optional[tf.data.Dataset] = None,\n        compute_metrics: Optional[Callable[[EvalPrediction], Dict]] = None,\n        prediction_loss_only=False,\n    ):\n        self.model = model\n        self.args = args\n        self.train_dataset = train_dataset\n        self.eval_dataset = eval_dataset\n        self.compute_metrics = compute_metrics\n        self.prediction_loss_only = prediction_loss_only\n        self.gradient_accumulator = GradientAccumulator()\n\n        self._setup_training()\n\n    def _setup_training(self) -> None:\n        \"\"\"\n        Setup the different steps to train a model:\n          - check if all the data are given\n          - create the proper strategy\n          - create the features\n          - prepare the model settings\n        \"\"\"\n        self._prepare_dataset()\n\n        with self.args.strategy.scope():\n            self._create_optimizer()\n            _ = self.optimizer.iterations\n            self._set_loss_and_metric()\n            self._create_checkpoint_manager()\n            self._create_summary_writer()\n\n    def _set_loss_and_metric(self) -> None:\n        \"\"\"\n        Create the training loss and metric with their name. Allowed names are those listed\n        in the Tensorflow documentation and those contained in the transformers1 library.\n        \"\"\"\n        try:\n            self.loss = tf.keras.losses.get(\n                {\n                    \"class_name\": self.args.loss_name,\n                    \"config\": {\"from_logits\": True, \"reduction\": tf.keras.losses.Reduction.NONE},\n                }\n            )\n        except TypeError:\n            self.loss = tf.keras.losses.get(\n                {\"class_name\": self.args.loss_name, \"config\": {\"reduction\": tf.keras.losses.Reduction.NONE}}\n            )\n\n    def _create_summary_writer(self) -> None:\n        \"\"\"\n        Create a summary writer to be able to read the logs in Tensorboard.\n        \"\"\"\n        self.writer = tf.summary.create_file_writer(self.args.logging_dir)\n\n    def _prepare_dataset(self) -> None:\n        \"\"\"\n        Prepare the training, validation and test data.\n        \"\"\"\n        if self.train_dataset is not None:\n            self.num_train_examples = self.train_dataset.reduce(tf.constant(0), lambda x, _: x + 1).numpy()\n\n            if self.args.max_steps > 0:\n                self.train_steps = self.args.max_steps\n            else:\n                self.train_steps: int = math.ceil(self.num_train_examples / self.args.train_batch_size)\n\n            self.train_dataset = (\n                self.train_dataset.cache()\n                .shuffle(self.num_train_examples)\n                .batch(self.args.train_batch_size)\n                .prefetch(tf.data.experimental.AUTOTUNE)\n            )\n\n            if self.args.max_steps > 0:\n                self.train_dataset = self.train_dataset.repeat(-1)\n\n            self.train_dataset = self.args.strategy.experimental_distribute_dataset(self.train_dataset)\n        else:\n            self.train_steps = 0\n\n        if self.eval_dataset is not None:\n            self.eval_dataset = (\n                self.eval_dataset.batch(self.args.eval_batch_size).cache().prefetch(tf.data.experimental.AUTOTUNE)\n            )\n            self.eval_dataset = self.args.strategy.experimental_distribute_dataset(self.eval_dataset)\n\n    def _create_optimizer(self) -> None:\n        \"\"\"\n        Create the training optimizer with its name. Allowed names are those listed\n        in the Tensorflow documentation and those contained in the transformers1 library.\n        \"\"\"\n        if self.args.optimizer_name == \"adamw\":\n            self.optimizer = create_optimizer(\n                self.args.learning_rate, self.train_steps, self.args.warmup_steps, self.args.end_lr\n            )\n        else:\n            try:\n                self.optimizer = tf.keras.optimizers.get(\n                    {\n                        \"class_name\": self.args.optimizer_name,\n                        \"config\": {\"learning_rate\": self.args.learning_rate, \"epsilon\": self.args.adam_epsilon},\n                    }\n                )\n            except TypeError:\n                # This is for the case where the optimizer is not Adam-like such as SGD\n                self.optimizer = tf.keras.optimizers.get(\n                    {\"class_name\": self.args.optimizer_name, \"config\": {\"learning_rate\": self.args.learning_rate}}\n                )\n        logger.info(\"Created an/a {} optimizer\".format(self.args.optimizer_name))\n\n    def _create_checkpoint_manager(self, max_to_keep: int = 5, load_model: bool = True) -> None:\n        \"\"\"\n        Create a checkpoint manager in order to be able to make the training\n        fault-tolerant.\n        Args:\n          max_to_keep: the maximum number of checkpoints to keep in the checkpoint path.\n          load_model: if we want to start the training from the latest checkpoint.\n        \"\"\"\n        ckpt = tf.train.Checkpoint(optimizer=self.optimizer, model=self.model)\n\n        self.model.ckpt_manager = tf.train.CheckpointManager(ckpt, PREFIX_CHECKPOINT_DIR, max_to_keep=max_to_keep)\n\n        if load_model:\n            ckpt.restore(self.model.ckpt_manager.latest_checkpoint).expect_partial()\n\n    @tf.function\n    def _evaluate_steps(self, per_replica_features, per_replica_labels):\n        \"\"\"\n        One step evaluation across replica.\n        Args:\n          per_replica_features: the batched features.\n          per_replica_labels: the batched labels.\n        Returns:\n          The loss corresponding to the given batch.\n        \"\"\"\n        per_replica_loss, per_replica_logits = self.args.strategy.experimental_run_v2(\n            self._run_model, args=(per_replica_features, per_replica_labels, False)\n        )\n\n        try:\n            reduced_loss = self.args.strategy.reduce(tf.distribute.ReduceOp.MEAN, per_replica_loss, axis=0)\n        except ValueError:\n            reduced_loss = self.args.strategy.reduce(tf.distribute.ReduceOp.MEAN, per_replica_loss, None)\n\n        return reduced_loss, per_replica_logits\n\n    def _prediction_loop(\n        self, dataset: tf.data.Dataset, description: str, prediction_loss_only: Optional[bool] = None\n    ) -> PredictionOutput:\n        logger.info(\"***** Running %s *****\", description)\n        logger.info(\"  Batch size = %d\", self.args.eval_batch_size)\n\n        label_ids: np.ndarray = None\n        preds: np.ndarray = None\n\n        step: int = 1\n\n        for features, labels in dataset:\n            step = tf.convert_to_tensor(step, dtype=tf.int64)\n            loss, logits = self._evaluate_steps(features, labels)\n            loss = tf.reduce_mean(loss)\n\n            if not prediction_loss_only:\n                if self.args.n_gpu > 1:\n                    for val in logits.values:\n                        if preds is None:\n                            preds = val.numpy()\n                        else:\n                            preds = np.append(preds, val.numpy(), axis=0)\n\n                    for val in labels.values:\n                        if label_ids is None:\n                            label_ids = val.numpy()\n                        else:\n                            label_ids = np.append(label_ids, val.numpy(), axis=0)\n                else:\n                    if preds is None:\n                        preds = logits.numpy()\n                    else:\n                        preds = np.append(preds, logits.numpy(), axis=0)\n\n                    if label_ids is None:\n                        label_ids = labels.numpy()\n                    else:\n                        label_ids = np.append(label_ids, labels.numpy(), axis=0)\n\n            step += 1\n\n        if self.compute_metrics is not None and preds is not None and label_ids is not None:\n            metrics = self.compute_metrics(EvalPrediction(predictions=preds, label_ids=label_ids))\n        else:\n            metrics = {}\n\n        metrics[\"eval_loss\"] = loss.numpy()\n\n        for key in list(metrics.keys()):\n            if not key.startswith(\"eval_\"):\n                metrics[f\"eval_{key}\"] = metrics.pop(key)\n\n        return PredictionOutput(predictions=preds, label_ids=label_ids, metrics=metrics)\n\n    def evaluate(\n        self, eval_dataset: Optional[tf.data.Dataset] = None, prediction_loss_only: Optional[bool] = None\n    ) -> Dict[str, float]:\n        \"\"\"\n        Prediction/evaluation loop, shared by `evaluate()` and `predict()`.\n        \"\"\"\n        if eval_dataset is None:\n            eval_dataset = self.eval_dataset\n\n        output = self._prediction_loop(eval_dataset, description=\"Evaluation\")\n\n        return output.metrics\n\n    def train(self) -> None:\n        \"\"\"\n        Train method to train the model.\n        \"\"\"\n        if self.args.debug:\n            tf.summary.trace_on(graph=True, profiler=True)\n\n        self.gradient_accumulator.reset()\n\n        iterations = self.optimizer.iterations\n\n        if iterations.numpy() > 0:\n            logger.info(\"Start the training from the last checkpoint\")\n            start_epoch = (iterations.numpy() // self.train_steps) + 1\n        else:\n            start_epoch = 1\n\n        tf.summary.experimental.set_step(iterations)\n\n        epochs = 1 if self.args.max_steps > 0 else self.args.num_train_epochs\n\n        logger.info(\"***** Running training *****\")\n        logger.info(\"  Num examples = %d\", self.num_train_examples)\n        logger.info(\"  Num Epochs = %d\", epochs)\n        logger.info(\"  Total optimization steps = %d\", self.train_steps)\n\n        for epoch in range(start_epoch, int(epochs + 1)):\n            for training_loss in self._training_steps():\n                step = iterations.numpy()\n\n                if self.args.debug:\n                    with self.writer.as_default():\n                        tf.summary.scalar(\"loss\", training_loss, step=step)\n\n                if step == 1 and self.args.debug:\n                    with self.writer.as_default():\n                        tf.summary.trace_export(name=\"training\", step=step, profiler_outdir=self.args.logging_dir)\n\n                if self.args.evaluate_during_training and step % self.args.eval_steps == 0:\n                    logs = {}\n                    results = self.evaluate()\n\n                    for key, value in results.items():\n                        eval_key = \"eval_{}\".format(key)\n                        logs[eval_key] = value\n\n                    if callable(self.optimizer.learning_rate):\n                        logs[\"learning_rate\"] = self.optimizer.learning_rate(step).numpy()\n                    else:\n                        logs[\"learning_rate\"] = self.optimizer.learning_rate.numpy()\n\n                    logger.info(\"Epoch {} Step {} Validation Metrics {}\".format(epoch, step, logs))\n\n                    with self.writer.as_default():\n                        for k, v in logs.items():\n                            tf.summary.scalar(k, v, step=step)\n\n                if step % self.args.logging_steps == 0:\n                    logger.info(\"Epoch {} Step {} Train Loss {:.4f}\".format(epoch, step, training_loss.numpy()))\n\n                if step % self.args.save_steps == 0:\n                    ckpt_save_path = self.model.ckpt_manager.save()\n                    logger.info(\"Saving checkpoint for step {} at {}\".format(step, ckpt_save_path))\n\n                if step % self.train_steps == 0:\n                    break\n\n    def _training_steps(self):\n        \"\"\"\n        Returns a generator over training steps (i.e. parameters update).\n        \"\"\"\n        for i, loss in enumerate(self._accumulate_next_gradients()):\n            if i % self.args.gradient_accumulation_steps == 0:\n                self._apply_gradients()\n                yield loss\n\n    @tf.function\n    def _apply_gradients(self):\n        \"\"\"Applies the gradients (cross-replica).\"\"\"\n        self.args.strategy.experimental_run_v2(self._step)\n\n    def _step(self):\n        \"\"\"Applies gradients and resets accumulation.\"\"\"\n        gradient_scale = self.gradient_accumulator.step * self.args.strategy.num_replicas_in_sync\n        gradients = [\n            gradient / tf.cast(gradient_scale, gradient.dtype) for gradient in self.gradient_accumulator.gradients\n        ]\n        gradients = [(tf.clip_by_value(grad, -self.args.max_grad_norm, self.args.max_grad_norm)) for grad in gradients]\n\n        self.optimizer.apply_gradients(list(zip(gradients, self.model.trainable_variables)))\n        self.gradient_accumulator.reset()\n\n    def _accumulate_next_gradients(self):\n        \"\"\"Accumulates the gradients from the next element in dataset.\"\"\"\n        iterator = iter(self.train_dataset)\n\n        @tf.function\n        def _accumulate_next():\n            per_replica_features, per_replica_labels = next(iterator)\n\n            return self._accumulate_gradients(per_replica_features, per_replica_labels)\n\n        while True:\n            try:\n                yield _accumulate_next()\n            except tf.errors.OutOfRangeError:\n                break\n\n    def _accumulate_gradients(self, per_replica_features, per_replica_labels):\n        \"\"\"Accumulates the gradients across all the replica.\"\"\"\n        per_replica_loss = self.args.strategy.experimental_run_v2(\n            self._forward, args=(per_replica_features, per_replica_labels)\n        )\n\n        try:\n            reduced_loss = self.args.strategy.reduce(tf.distribute.ReduceOp.MEAN, per_replica_loss, axis=0)\n        except ValueError:\n            reduced_loss = self.args.strategy.reduce(tf.distribute.ReduceOp.MEAN, per_replica_loss, None)\n\n        return reduced_loss\n\n    def _forward(self, features, labels):\n        \"\"\"Forwards a training example and accumulates the gradients.\"\"\"\n        per_example_loss, _ = self._run_model(features, labels, True)\n        gradients = tf.gradients(per_example_loss, self.model.trainable_variables)\n        gradients = [\n            g if g is not None else tf.zeros_like(v) for g, v in zip(gradients, self.model.trainable_variables)\n        ]\n\n        self.gradient_accumulator(gradients)\n\n        return per_example_loss\n\n    def _run_model(self, features, labels, training):\n        \"\"\"\n        Computes the loss of the given features and labels pair.\n        Args:\n          features: the batched features.\n          labels: the batched labels.\n          training: run the model in training mode or not\n        \"\"\"\n        if self.args.mode == \"text-classification\" or self.args.mode == \"token-classification\":\n            logits = self.model(features, training=training)[0]\n        else:\n            logits = self.model(features, training=training)\n\n        if self.args.mode == \"token-classification\":\n            active_loss = tf.reshape(labels, (-1,)) != -1\n            reduced_logits = tf.boolean_mask(tf.reshape(logits, (-1, shape_list(logits)[2])), active_loss)\n            labels = tf.boolean_mask(tf.reshape(labels, (-1,)), active_loss)\n            loss = self.loss(labels, reduced_logits)\n        elif self.args.mode == \"question-answering\":\n            start_loss = self.loss(labels[\"start_position\"], logits[0])\n            end_loss = self.loss(labels[\"end_position\"], logits[1])\n            loss = (start_loss + end_loss) / 2.0\n        else:\n            loss = self.loss(labels, logits)\n\n        loss += sum(self.model.losses) * (1.0 / self.args.n_gpu)\n\n        return loss, logits\n\n    def predict(self, test_dataset: tf.data.Dataset) -> PredictionOutput:\n        \"\"\"\n        Run prediction and return predictions and potential metrics.\n        Depending on the dataset and your use case, your test dataset may contain labels.\n        In that case, this method will also return metrics, like in evaluate().\n        Args:\n          test_dataset: something similar to a PT Dataset. This is just\n            temporary before to have a framework-agnostic approach for datasets.\n        \"\"\"\n        test_dataset = test_dataset.batch(self.args.eval_batch_size)\n        test_dataset = self.args.strategy.experimental_distribute_dataset(test_dataset)\n\n        return self._prediction_loop(test_dataset, description=\"Prediction\")\n\n    def save_model(self) -> None:\n        \"\"\"\n        Save the pretrained model and create a Tensorflow saved model.\n        \"\"\"\n        logger.info(\"Saving model in {}\".format(self.args.output_dir))\n\n        path = os.path.join(self.args.output_dir, \"saved_model\")\n\n        logger.info(\"Saving model in {}\".format(path))\n        os.makedirs(path, exist_ok=True)\n        self.model.save_pretrained(self.args.output_dir)\n"
  },
  {
    "path": "code/bert-base-count3/pretrain/transformers1/trainer_utils.py",
    "content": "from typing import Dict, NamedTuple, Optional\n\nimport numpy as np\n\n\nclass EvalPrediction(NamedTuple):\n    \"\"\"\n    Evaluation output (always contains labels), to be used\n    to compute metrics.\n    \"\"\"\n\n    predictions: np.ndarray\n    label_ids: np.ndarray\n\n\nclass PredictionOutput(NamedTuple):\n    predictions: np.ndarray\n    label_ids: Optional[np.ndarray]\n    metrics: Optional[Dict[str, float]]\n\n\nclass TrainOutput(NamedTuple):\n    global_step: int\n    training_loss: float\n\n\nPREFIX_CHECKPOINT_DIR = \"checkpoint\"\n"
  },
  {
    "path": "code/bert-base-count3/pretrain/transformers1/training_args.py",
    "content": "import dataclasses\nimport json\nimport logging\nfrom dataclasses import dataclass, field\nfrom typing import Any, Dict, Optional, Tuple\n\nfrom .file_utils import cached_property, is_torch_available, torch_required\n\n\nif is_torch_available():\n    import torch\n\n\ntry:\n    import torch_xla.core.xla_model as xm\n\n    _has_tpu = True\nexcept ImportError:\n    _has_tpu = False\n\n\n@torch_required\ndef is_tpu_available():\n    return _has_tpu\n\n\nlogger = logging.getLogger(__name__)\n\n\n@dataclass\nclass TrainingArguments:\n    \"\"\"\n    TrainingArguments is the subset of the arguments we use in our example scripts\n    **which relate to the training loop itself**.\n\n    Using `HfArgumentParser` we can turn this class\n    into argparse arguments to be able to specify them on\n    the command line.\n    \"\"\"\n\n    output_dir: str = field(\n        metadata={\"help\": \"The output directory where the model predictions and checkpoints will be written.\"}\n    )\n    overwrite_output_dir: bool = field(\n        default=False,\n        metadata={\n            \"help\": (\n                \"Overwrite the content of the output directory.\"\n                \"Use this to continue training if output_dir points to a checkpoint directory.\"\n            )\n        },\n    )\n\n    do_train: bool = field(default=False, metadata={\"help\": \"Whether to run training.\"})\n    do_eval: bool = field(default=False, metadata={\"help\": \"Whether to run eval on the dev set.\"})\n    do_predict: bool = field(default=False, metadata={\"help\": \"Whether to run predictions on the test set.\"})\n    evaluate_during_training: bool = field(\n        default=False, metadata={\"help\": \"Run evaluation during training at each logging step.\"},\n    )\n\n    per_device_train_batch_size: int = field(\n        default=8, metadata={\"help\": \"Batch size per GPU/TPU core/CPU for training.\"}\n    )\n    per_device_eval_batch_size: int = field(\n        default=8, metadata={\"help\": \"Batch size per GPU/TPU core/CPU for evaluation.\"}\n    )\n\n    per_gpu_train_batch_size: Optional[int] = field(\n        default=None,\n        metadata={\n            \"help\": \"Deprecated, the use of `--per_device_train_batch_size` is preferred. \"\n            \"Batch size per GPU/TPU core/CPU for training.\"\n        },\n    )\n    per_gpu_eval_batch_size: Optional[int] = field(\n        default=None,\n        metadata={\n            \"help\": \"Deprecated, the use of `--per_device_eval_batch_size` is preferred.\"\n            \"Batch size per GPU/TPU core/CPU for evaluation.\"\n        },\n    )\n\n    gradient_accumulation_steps: int = field(\n        default=1,\n        metadata={\"help\": \"Number of updates steps to accumulate before performing a backward/update pass.\"},\n    )\n\n    learning_rate: float = field(default=5e-5, metadata={\"help\": \"The initial learning rate for Adam.\"})\n    lr_end: float = field(default=1e-5, metadata={\"help\": \"学习率最后衰减到多少.\"})\n    weight_decay: float = field(default=0.0, metadata={\"help\": \"Weight decay if we apply some.\"})\n    adam_epsilon: float = field(default=1e-8, metadata={\"help\": \"Epsilon for Adam optimizer.\"})\n    max_grad_norm: float = field(default=1.0, metadata={\"help\": \"Max gradient norm.\"})\n\n    num_train_epochs: float = field(default=3.0, metadata={\"help\": \"Total number of training epochs to perform.\"})\n    max_steps: int = field(\n        default=-1,\n        metadata={\"help\": \"If > 0: set total number of training steps to perform. Override num_train_epochs.\"},\n    )\n    warmup_steps: int = field(default=0, metadata={\"help\": \"Linear warmup over warmup_steps.\"})\n\n    logging_dir: Optional[str] = field(default=None, metadata={\"help\": \"Tensorboard log dir.\"})\n    logging_first_step: bool = field(default=False, metadata={\"help\": \"Log and eval the first global_step\"})\n    logging_steps: int = field(default=500, metadata={\"help\": \"Log every X updates steps.\"})\n    save_steps: int = field(default=500, metadata={\"help\": \"Save checkpoint every X updates steps.\"})\n    save_total_limit: Optional[int] = field(\n        default=None,\n        metadata={\n            \"help\": (\n                \"Limit the total amount of checkpoints.\"\n                \"Deletes the older checkpoints in the output_dir. Default is unlimited checkpoints\"\n            )\n        },\n    )\n    no_cuda: bool = field(default=False, metadata={\"help\": \"Do not use CUDA even when it is available\"})\n    seed: int = field(default=42, metadata={\"help\": \"random seed for initialization\"})\n\n    fp16: bool = field(\n        default=False,\n        metadata={\"help\": \"Whether to use 16-bit (mixed) precision (through NVIDIA apex) instead of 32-bit\"},\n    )\n    fp16_opt_level: str = field(\n        default=\"O1\",\n        metadata={\n            \"help\": (\n                \"For fp16: Apex AMP optimization level selected in ['O0', 'O1', 'O2', and 'O3'].\"\n                \"See details at https://nvidia.github.io/apex/amp.html\"\n            )\n        },\n    )\n    local_rank: int = field(default=-1, metadata={\"help\": \"For distributed training: local_rank\"})\n\n    tpu_num_cores: Optional[int] = field(\n        default=None, metadata={\"help\": \"TPU: Number of TPU cores (automatically passed by launcher script)\"}\n    )\n    tpu_metrics_debug: bool = field(default=False, metadata={\"help\": \"TPU: Whether to print debug metrics\"})\n\n    @property\n    def train_batch_size(self) -> int:\n        if self.per_gpu_train_batch_size:\n            logger.warning(\n                \"Using deprecated `--per_gpu_train_batch_size` argument which will be removed in a future \"\n                \"version. Using `--per_device_train_batch_size` is preferred.\"\n            )\n        per_device_batch_size = self.per_gpu_train_batch_size or self.per_device_train_batch_size\n        return per_device_batch_size * max(1, self.n_gpu)\n\n    @property\n    def eval_batch_size(self) -> int:\n        if self.per_gpu_eval_batch_size:\n            logger.warning(\n                \"Using deprecated `--per_gpu_eval_batch_size` argument which will be removed in a future \"\n                \"version. Using `--per_device_eval_batch_size` is preferred.\"\n            )\n        per_device_batch_size = self.per_gpu_eval_batch_size or self.per_device_eval_batch_size\n        return per_device_batch_size * max(1, self.n_gpu)\n\n    @cached_property\n    @torch_required\n    def _setup_devices(self) -> Tuple[\"torch.device\", int]:\n        logger.info(\"PyTorch: setting up devices\")\n        if self.no_cuda:\n            device = torch.device(\"cpu\")\n            n_gpu = 0\n        elif is_tpu_available():\n            device = xm.xla_device()\n            n_gpu = 0\n        elif self.local_rank == -1:\n            # if n_gpu is > 1 we'll use nn.DataParallel.\n            # If you only want to use a specific subset of GPUs use `CUDA_VISIBLE_DEVICES=0`\n            device = torch.device(\"cuda\" if torch.cuda.is_available() else \"cpu\")\n            n_gpu = torch.cuda.device_count()\n        else:\n            # Here, we'll use torch.distributed.\n            # Initializes the distributed backend which will take care of sychronizing nodes/GPUs\n            torch.distributed.init_process_group(backend=\"nccl\")\n            device = torch.device(\"cuda\", self.local_rank)\n            n_gpu = 1\n        return device, n_gpu\n\n    @property\n    @torch_required\n    def device(self) -> \"torch.device\":\n        return self._setup_devices[0]\n\n    @property\n    @torch_required\n    def n_gpu(self):\n        return self._setup_devices[1]\n\n    def to_json_string(self):\n        \"\"\"\n        Serializes this instance to a JSON string.\n        \"\"\"\n        return json.dumps(dataclasses.asdict(self), indent=2)\n\n    def to_sanitized_dict(self) -> Dict[str, Any]:\n        \"\"\"\n        Sanitized serialization to use with TensorBoard’s hparams\n        \"\"\"\n        d = dataclasses.asdict(self)\n        valid_types = [bool, int, float, str]\n        if is_torch_available():\n            valid_types.append(torch.Tensor)\n        return {k: v if type(v) in valid_types else str(v) for k, v in d.items()}\n"
  },
  {
    "path": "code/bert-base-count3/pretrain/transformers1/training_args_tf.py",
    "content": "import logging\nfrom dataclasses import dataclass, field\nfrom typing import Tuple\n\nfrom .file_utils import cached_property, is_tf_available, tf_required\nfrom .training_args import TrainingArguments\n\n\nlogger = logging.getLogger(__name__)\n\nif is_tf_available():\n    import tensorflow as tf\n\n\n@dataclass\nclass TFTrainingArguments(TrainingArguments):\n    optimizer_name: str = field(\n        default=\"adam\",\n        metadata={\n            \"help\": 'Name of a Tensorflow optimizer among \"adadelta, adagrad, adam, adamax, ftrl, nadam, rmsprop, sgd, adamw\"'\n        },\n    )\n    mode: str = field(\n        default=\"text-classification\",\n        metadata={\"help\": 'Type of task, one of \"text-classification\", \"token-classification\", \"question-answering\"'},\n    )\n    loss_name: str = field(\n        default=\"SparseCategoricalCrossentropy\",\n        metadata={\n            \"help\": \"Name of a Tensorflow loss. For the list see: https://www.tensorflow.org/api_docs/python/tf/keras/losses\"\n        },\n    )\n    tpu_name: str = field(\n        default=None, metadata={\"help\": \"Name of TPU\"},\n    )\n    end_lr: float = field(\n        default=0, metadata={\"help\": \"End learning rate for optimizer\"},\n    )\n    eval_steps: int = field(default=1000, metadata={\"help\": \"Run an evaluation every X steps.\"})\n    debug: bool = field(\n        default=False, metadata={\"help\": \"Activate the trace to record computation graphs and profiling information\"}\n    )\n\n    @cached_property\n    @tf_required\n    def _setup_strategy(self) -> Tuple[\"tf.distribute.Strategy\", int]:\n        logger.info(\"Tensorflow: setting up strategy\")\n        gpus = tf.config.list_physical_devices(\"GPU\")\n\n        if self.no_cuda:\n            strategy = tf.distribute.OneDeviceStrategy(device=\"/cpu:0\")\n        else:\n            try:\n                if self.tpu_name:\n                    tpu = tf.distribute.cluster_resolver.TPUClusterResolver(self.tpu_name)\n                else:\n                    tpu = tf.distribute.cluster_resolver.TPUClusterResolver()\n            except ValueError:\n                tpu = None\n\n            if tpu:\n                tf.config.experimental_connect_to_cluster(tpu)\n                tf.tpu.experimental.initialize_tpu_system(tpu)\n\n                strategy = tf.distribute.experimental.TPUStrategy(tpu)\n            elif len(gpus) == 0:\n                strategy = tf.distribute.OneDeviceStrategy(device=\"/cpu:0\")\n            elif len(gpus) == 1:\n                strategy = tf.distribute.OneDeviceStrategy(device=\"/gpu:0\")\n            elif len(gpus) > 1:\n                # If you only want to use a specific subset of GPUs use `CUDA_VISIBLE_DEVICES=0`\n                strategy = tf.distribute.MirroredStrategy()\n            else:\n                raise ValueError(\"Cannot find the proper strategy please check your environment properties.\")\n\n        return strategy\n\n    @property\n    @tf_required\n    def strategy(self) -> \"tf.distribute.Strategy\":\n        return self._setup_strategy\n\n    @property\n    @tf_required\n    def n_gpu(self) -> int:\n        return self._setup_strategy.num_replicas_in_sync\n"
  },
  {
    "path": "code/bert-base-count3/pretrain/transformers1/try.py",
    "content": "from transformers import TFAlbertForMaskedLM, TFAlbertModel, TFAlbertForSequenceClassification, AlbertForMaskedLM\nimport os\n\ncheckpoint = \"albert-base-v1\"\n\nmodel = AlbertForMaskedLM.from_pretrained(checkpoint)\n\nif not os.path.exists(\"~/saved/\" + checkpoint):\n    os.makedirs(\"~/saved/\" + checkpoint)\n    \n\nmodel.save_pretrained(\"~/saved/\" + checkpoint)\nmodel = TFAlbertForMaskedLM.from_pretrained('~/saved/' + checkpoint, from_pt=True)\nmodel.save_pretrained(\"~/saved/\" + checkpoint)\nmodel = TFAlbertModel.from_pretrained('~/saved/' + checkpoint)\nmodel = TFAlbertForMaskedLM.from_pretrained('~/saved/' + checkpoint)\nmodel = TFAlbertForSequenceClassification.from_pretrained('~/saved/' + checkpoint)\n\n\nprint(\"nice model\") "
  },
  {
    "path": "code/bert-base-count3/pretrain/transformers1/utils_encoder_decoder.py",
    "content": "# coding=utf-8\n# Copyright 2020 The HuggingFace Inc. team.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n\"\"\" Classes to support Encoder-Decoder architectures \"\"\"\n\n\ndef prepare_encoder_decoder_model_kwargs(**kwargs):\n    \"\"\" Prepare the encoder and decoder's keyword arguments.\n\n    Keyword arguments come in 3 flavors:\n    - encoder-specific (prefixed by `encoder_`)\n    - decoder-specific (prefixed by `decoder_`)\n    - those that apply to the model as whole.\n\n    We let the specific kwargs override the common ones in case of\n    conflict.\n    \"\"\"\n\n    kwargs_common = {\n        argument: value\n        for argument, value in kwargs.items()\n        if not argument.startswith(\"encoder_\") and not argument.startswith(\"decoder_\")\n    }\n    if \"input_ids\" in kwargs_common:\n        kwargs[\"encoder_input_ids\"] = kwargs_common.pop(\"input_ids\")\n\n    decoder_kwargs = kwargs_common.copy()\n    encoder_kwargs = kwargs_common.copy()\n    encoder_kwargs.update(\n        {argument[len(\"encoder_\") :]: value for argument, value in kwargs.items() if argument.startswith(\"encoder_\")}\n    )\n    decoder_kwargs.update(\n        {argument[len(\"decoder_\") :]: value for argument, value in kwargs.items() if argument.startswith(\"decoder_\")}\n    )\n    decoder_kwargs[\"encoder_attention_mask\"] = encoder_kwargs.get(\"attention_mask\", None)\n    return encoder_kwargs, decoder_kwargs\n"
  },
  {
    "path": "code/bert-base-count3-len100/finetuning/.ipynb_checkpoints/PyTorch_Bert-Squad_OnnxRuntime_GPU-checkpoint.ipynb",
    "content": "{\n \"cells\": [\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Copyright (c) Microsoft Corporation. All rights reserved.  \\n\",\n    \"Licensed under the MIT License.\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"# Inference PyTorch Bert Model with ONNX Runtime on GPU\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"In this tutorial, you'll learn how to load a Bert model from PyTorch, convert it to ONNX, and inference it for high performance using ONNX Runtime and NVIDIA GPU. In the following sections, we are going to use the Bert model trained with Stanford Question Answering Dataset (SQuAD) dataset as an example. Bert SQuAD model is used in question answering scenarios, where the answer to every question is a segment of text from the corresponding reading passage, or the question might be unanswerable.\\n\",\n    \"\\n\",\n    \"This notebook is for GPU inference. For CPU inference, please look at another notebook [Inference PyTorch Bert Model with ONNX Runtime on CPU](PyTorch_Bert-Squad_OnnxRuntime_CPU.ipynb).\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"## 0. Prerequisites ##\\n\",\n    \"It requires your machine to have a GPU, and a python environment with [PyTorch](https://pytorch.org/) installed before running this notebook.\\n\",\n    \"\\n\",\n    \"#### GPU Environment Setup using AnaConda\\n\",\n    \"\\n\",\n    \"First, we install [AnaConda](https://www.anaconda.com/distribution/) in a target machine and open an AnaConda prompt window when it is done. Then run the following commands to create a conda environment. This notebook is tested with PyTorch 1.5.0 and OnnxRuntime 1.3.0.\\n\",\n    \"\\n\",\n    \"```console\\n\",\n    \"conda create -n gpu_env python=3.7\\n\",\n    \"conda activate gpu_env\\n\",\n    \"conda install pytorch torchvision cudatoolkit=10.1 -c pytorch\\n\",\n    \"conda install -c anaconda ipykernel\\n\",\n    \"conda install -c conda-forge ipywidgets\\n\",\n    \"python -m ipykernel install --user --name=gpu_env_py37\\n\",\n    \"jupyter notebook\\n\",\n    \"```\\n\",\n    \"Finally, launch Jupyter Notebook and you can choose gpu_env_py37 as kernel to run this notebook.\\n\",\n    \"\\n\",\n    \"Onnxruntime-gpu need specified version of CUDA and cuDNN. You can find the corresponding version in [requirements](https://github.com/microsoft/onnxruntime/tree/rel-1.3.0#system-requirements). If the version is different from above cudatoolkit version, you have to install them separately, and add their bin directories to PATH environment variable (See [CUDA and cuDNN Path](#CUDA-and-cuDNN-Path) below).\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"\\u001b[33mWARNING: Skipping onnxruntime-gpu as it is not installed.\\u001b[0m\\r\\n\"\n     ]\n    }\n   ],\n   \"source\": [\n    \"import sys\\n\",\n    \"!{sys.executable} -m pip uninstall --quiet --yes onnxruntime-gpu\\n\",\n    \"!{sys.executable} -m pip install --quiet onnxruntime-gpu\\n\",\n    \"!{sys.executable} -m pip install --quiet --upgrade transformers\\n\",\n    \"!{sys.executable} -m pip install --quiet --upgrade onnxconverter_common\\n\",\n    \"!{sys.executable} -m pip install --quiet --upgrade onnxruntime-tools\\n\",\n    \"!{sys.executable} -m pip install --quiet wget netron pandas\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"## 1. Load Pretrained Bert model ##\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"We begin by downloading the SQuAD data file and store them in the specified location. \"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"import os\\n\",\n    \"\\n\",\n    \"cache_dir = \\\"./squad\\\"\\n\",\n    \"if not os.path.exists(cache_dir):\\n\",\n    \"    os.makedirs(cache_dir)\\n\",\n    \"\\n\",\n    \"predict_file_url = \\\"https://rajpurkar.github.io/SQuAD-explorer/dataset/dev-v1.1.json\\\"\\n\",\n    \"predict_file = os.path.join(cache_dir, \\\"dev-v1.1.json\\\")\\n\",\n    \"if not os.path.exists(predict_file):\\n\",\n    \"    import wget\\n\",\n    \"    print(\\\"Start downloading predict file.\\\")\\n\",\n    \"    wget.download(predict_file_url, predict_file)\\n\",\n    \"    print(\\\"Predict file downloaded.\\\")\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Let's first define some constant variables.\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 3,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"# Whether allow overwriting existing ONNX model and download the latest script from GitHub\\n\",\n    \"enable_overwrite = True\\n\",\n    \"\\n\",\n    \"# Total samples to inference, so that we can get average latency\\n\",\n    \"total_samples = 1000\\n\",\n    \"\\n\",\n    \"# ONNX opset version\\n\",\n    \"opset_version=11\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Specify some model configuration variables.\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 4,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"# For fine-tuned large model, the model name is \\\"bert-large-uncased-whole-word-masking-finetuned-squad\\\". Here we use bert-base for demo.\\n\",\n    \"model_name_or_path = \\\"bert-base-cased\\\"\\n\",\n    \"max_seq_length = 128\\n\",\n    \"doc_stride = 128\\n\",\n    \"max_query_length = 64\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Start to load model from pretrained. This step could take a few minutes. \"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 5,\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"name\": \"stderr\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"100%|██████████| 48/48 [00:04<00:00, 11.28it/s]\\n\",\n      \"convert squad examples to features: 100%|██████████| 1000/1000 [00:09<00:00, 102.15it/s]\\n\",\n      \"add example index and unique id: 100%|██████████| 1000/1000 [00:00<00:00, 161306.98it/s]\\n\"\n     ]\n    }\n   ],\n   \"source\": [\n    \"# The following code is adapted from HuggingFace transformers\\n\",\n    \"# https://github.com/huggingface/transformers/blob/master/examples/run_squad.py\\n\",\n    \"\\n\",\n    \"from transformers import (BertConfig, BertForQuestionAnswering, BertTokenizer)\\n\",\n    \"\\n\",\n    \"# Load pretrained model and tokenizer\\n\",\n    \"config_class, model_class, tokenizer_class = (BertConfig, BertForQuestionAnswering, BertTokenizer)\\n\",\n    \"config = config_class.from_pretrained(model_name_or_path, cache_dir=cache_dir)\\n\",\n    \"tokenizer = tokenizer_class.from_pretrained(model_name_or_path, do_lower_case=True, cache_dir=cache_dir)\\n\",\n    \"model = model_class.from_pretrained(model_name_or_path,\\n\",\n    \"                                    from_tf=False,\\n\",\n    \"                                    config=config,\\n\",\n    \"                                    cache_dir=cache_dir)\\n\",\n    \"# load some examples\\n\",\n    \"from transformers.data.processors.squad import SquadV1Processor\\n\",\n    \"\\n\",\n    \"processor = SquadV1Processor()\\n\",\n    \"examples = processor.get_dev_examples(None, filename=predict_file)\\n\",\n    \"\\n\",\n    \"from transformers import squad_convert_examples_to_features\\n\",\n    \"features, dataset = squad_convert_examples_to_features( \\n\",\n    \"            examples=examples[:total_samples], # convert enough examples for this notebook\\n\",\n    \"            tokenizer=tokenizer,\\n\",\n    \"            max_seq_length=max_seq_length,\\n\",\n    \"            doc_stride=doc_stride,\\n\",\n    \"            max_query_length=max_query_length,\\n\",\n    \"            is_training=False,\\n\",\n    \"            return_dataset='pt'\\n\",\n    \"        )\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"## 2. Export the loaded model ##\\n\",\n    \"Once the model is loaded, we can export the loaded PyTorch model to ONNX.\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 6,\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"Model exported at  ./onnx/bert-base-cased-squad_opset11.onnx\\n\"\n     ]\n    }\n   ],\n   \"source\": [\n    \"output_dir = \\\"./onnx\\\"\\n\",\n    \"if not os.path.exists(output_dir):\\n\",\n    \"    os.makedirs(output_dir)   \\n\",\n    \"export_model_path = os.path.join(output_dir, 'bert-base-cased-squad_opset{}.onnx'.format(opset_version))\\n\",\n    \"\\n\",\n    \"import torch\\n\",\n    \"use_gpu = torch.cuda.is_available()\\n\",\n    \"device = torch.device(\\\"cuda\\\" if use_gpu else \\\"cpu\\\")\\n\",\n    \"\\n\",\n    \"# Get the first example data to run the model and export it to ONNX\\n\",\n    \"data = dataset[0]\\n\",\n    \"inputs = {\\n\",\n    \"    'input_ids':      data[0].to(device).reshape(1, max_seq_length),\\n\",\n    \"    'attention_mask': data[1].to(device).reshape(1, max_seq_length),\\n\",\n    \"    'token_type_ids': data[2].to(device).reshape(1, max_seq_length)\\n\",\n    \"}\\n\",\n    \"\\n\",\n    \"# Set model to inference mode, which is required before exporting the model because some operators behave differently in \\n\",\n    \"# inference and training mode.\\n\",\n    \"model.eval()\\n\",\n    \"model.to(device)\\n\",\n    \"\\n\",\n    \"if enable_overwrite or not os.path.exists(export_model_path):\\n\",\n    \"    with torch.no_grad():\\n\",\n    \"        symbolic_names = {0: 'batch_size', 1: 'max_seq_len'}\\n\",\n    \"        torch.onnx.export(model,                                            # model being run\\n\",\n    \"                          args=tuple(inputs.values()),                      # model input (or a tuple for multiple inputs)\\n\",\n    \"                          f=export_model_path,                              # where to save the model (can be a file or file-like object)\\n\",\n    \"                          opset_version=opset_version,                      # the ONNX version to export the model to\\n\",\n    \"                          do_constant_folding=True,                         # whether to execute constant folding for optimization\\n\",\n    \"                          input_names=['input_ids',                         # the model's input names\\n\",\n    \"                                       'input_mask', \\n\",\n    \"                                       'segment_ids'],\\n\",\n    \"                          output_names=['start', 'end'],                    # the model's output names\\n\",\n    \"                          dynamic_axes={'input_ids': symbolic_names,        # variable length axes\\n\",\n    \"                                        'input_mask' : symbolic_names,\\n\",\n    \"                                        'segment_ids' : symbolic_names,\\n\",\n    \"                                        'start' : symbolic_names,\\n\",\n    \"                                        'end' : symbolic_names})\\n\",\n    \"        print(\\\"Model exported at \\\", export_model_path)\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"## 3. PyTorch Inference ##\\n\",\n    \"Use PyTorch to evaluate an example input for comparison purpose.\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 7,\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"PyTorch cuda Inference time = 16.57 ms\\n\"\n     ]\n    }\n   ],\n   \"source\": [\n    \"import time\\n\",\n    \"\\n\",\n    \"# Measure the latency. It is not accurate using Jupyter Notebook, it is recommended to use standalone python script.\\n\",\n    \"latency = []\\n\",\n    \"with torch.no_grad():\\n\",\n    \"    for i in range(total_samples):\\n\",\n    \"        data = dataset[i]\\n\",\n    \"        inputs = {\\n\",\n    \"            'input_ids':      data[0].to(device).reshape(1, max_seq_length),\\n\",\n    \"            'attention_mask': data[1].to(device).reshape(1, max_seq_length),\\n\",\n    \"            'token_type_ids': data[2].to(device).reshape(1, max_seq_length)\\n\",\n    \"        }\\n\",\n    \"        start = time.time()\\n\",\n    \"        outputs = model(**inputs)\\n\",\n    \"        latency.append(time.time() - start)\\n\",\n    \"print(\\\"PyTorch {} Inference time = {} ms\\\".format(device.type, format(sum(latency) * 1000 / len(latency), '.2f')))\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"## 4. Inference ONNX Model with ONNX Runtime ##\\n\",\n    \"\\n\",\n    \"### CUDA and cuDNN Path\\n\",\n    \"onnxruntime-gpu has dependency on [CUDA](https://developer.nvidia.com/cuda-downloads) and [cuDNN](https://developer.nvidia.com/cudnn):\\n\",\n    \"\\n\",\n    \"* [onnxruntime-gpu v1.3.0](https://github.com/microsoft/onnxruntime/tree/rel-1.3.0#system-requirements) requires CUDA Runtime 10.1 and CUDNN 7.6.5.\\n\",\n    \"* [onnxruntime-gpu v1.2.0](https://github.com/microsoft/onnxruntime/releases/tag/v1.2.0) requires CUDA Runtime 10.1 and CUDNN 7.6.5.\\n\",\n    \"\\n\",\n    \"During installing PyTorch 1.5, we installed cudatoolkit 10.1.243 in this conda environment. That shall be good for onnxruntime-gpu 1.3.0 in Jupyter Notebook.\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 8,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"# Change to True when onnxruntime (like onnxruntime-gpu 1.0.0 ~ 1.1.2) cannot be imported.\\n\",\n    \"add_cuda_path = False\\n\",\n    \"\\n\",\n    \"if add_cuda_path:\\n\",\n    \"    # Add path of CUDA 10.0 and CUDNN 7.6 for onnxruntime-gpu 1.0.0 ~ 1.1.2\\n\",\n    \"    cuda_dir = 'D:/NVidia/CUDA/v10.1/bin'\\n\",\n    \"    cudnn_dir = 'D:/NVidia/CUDA/v10.1/bin'\\n\",\n    \"    if not (os.path.exists(cuda_dir) and os.path.exists(cudnn_dir)):\\n\",\n    \"        raise ValueError(\\\"Please specify correct path for CUDA and cuDNN. Otherwise onnxruntime cannot be imported.\\\")\\n\",\n    \"    else:\\n\",\n    \"        if cuda_dir == cudnn_dir:\\n\",\n    \"            os.environ[\\\"PATH\\\"] = cuda_dir + ';' + os.environ[\\\"PATH\\\"]\\n\",\n    \"        else:\\n\",\n    \"            os.environ[\\\"PATH\\\"] = cuda_dir + ';' + cudnn_dir + ';' + os.environ[\\\"PATH\\\"]\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"### OpenMP Environment Variable\\n\",\n    \"\\n\",\n    \"OpenMP environment variables are optional for GPU inference of standard Bert model. It has little performance impact on Bert model since most nodes are executed in GPU. \\n\",\n    \"\\n\",\n    \"You can find the best setting based on [Performance Test Tool](#Performance-Test-Tool) result in later part of this notebook.\\n\",\n    \"\\n\",\n    \"**Attention: Setting environment variables shall be done before importing onnxruntime**. Otherwise, they might not take effect.\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 9,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"# Optional. You can change them according to Performance Test Tool result.\\n\",\n    \"#os.environ[\\\"OMP_NUM_THREADS\\\"] = '1'\\n\",\n    \"#os.environ[\\\"OMP_WAIT_POLICY\\\"] = 'PASSIVE'\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Now we are ready to inference the model with ONNX Runtime.\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 10,\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"OnnxRuntime gpu Inference time = 4.43 ms\\n\"\n     ]\n    }\n   ],\n   \"source\": [\n    \"import psutil\\n\",\n    \"import onnxruntime\\n\",\n    \"import numpy\\n\",\n    \"\\n\",\n    \"assert 'CUDAExecutionProvider' in onnxruntime.get_available_providers()\\n\",\n    \"device_name = 'gpu'\\n\",\n    \"\\n\",\n    \"sess_options = onnxruntime.SessionOptions()\\n\",\n    \"\\n\",\n    \"# Optional: store the optimized graph and view it using Netron to verify that model is fully optimized.\\n\",\n    \"# Note that this will increase session creation time so enable it for debugging only.\\n\",\n    \"sess_options.optimized_model_filepath = os.path.join(output_dir, \\\"optimized_model_{}.onnx\\\".format(device_name))\\n\",\n    \"\\n\",\n    \"# Please change the value according to best setting in Performance Test Tool result.\\n\",\n    \"sess_options.intra_op_num_threads=psutil.cpu_count(logical=True)\\n\",\n    \"\\n\",\n    \"session = onnxruntime.InferenceSession(export_model_path, sess_options)\\n\",\n    \"\\n\",\n    \"latency = []\\n\",\n    \"for i in range(total_samples):\\n\",\n    \"    data = dataset[i]\\n\",\n    \"    # TODO: use IO Binding (see https://github.com/microsoft/onnxruntime/pull/4206) to improve performance.\\n\",\n    \"    ort_inputs = {\\n\",\n    \"        'input_ids':  data[0].cpu().reshape(1, max_seq_length).numpy(),\\n\",\n    \"        'input_mask': data[1].cpu().reshape(1, max_seq_length).numpy(),\\n\",\n    \"        'segment_ids': data[2].cpu().reshape(1, max_seq_length).numpy()\\n\",\n    \"    }\\n\",\n    \"    start = time.time()\\n\",\n    \"    ort_outputs = session.run(None, ort_inputs)\\n\",\n    \"    latency.append(time.time() - start)\\n\",\n    \"    \\n\",\n    \"print(\\\"OnnxRuntime {} Inference time = {} ms\\\".format(device_name, format(sum(latency) * 1000 / len(latency), '.2f')))\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"We can compare the output of PyTorch and ONNX Runtime. We can see some results are not close. It is because ONNX Runtime uses some approximation in CUDA optimization. Based on our evaluation on SQuAD data set, F1 score is on par for models before and after optimization.\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 11,\n   \"metadata\": {\n    \"scrolled\": true\n   },\n   \"outputs\": [\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"***** Verifying correctness *****\\n\",\n      \"PyTorch and ONNX Runtime output 0 are close: True\\n\",\n      \"maximum_diff=9.499490261077881e-07 average_diff=1.4225952327251434e-07\\n\",\n      \"PyTorch and ONNX Runtime output 1 are close: True\\n\",\n      \"maximum_diff=6.92903995513916e-07 average_diff=1.2441887520253658e-07\\n\"\n     ]\n    }\n   ],\n   \"source\": [\n    \"print(\\\"***** Verifying correctness *****\\\")\\n\",\n    \"for i in range(2):    \\n\",\n    \"    print('PyTorch and ONNX Runtime output {} are close:'.format(i), numpy.allclose(ort_outputs[i], outputs[i].cpu(), rtol=1e-02, atol=1e-02))\\n\",\n    \"    diff = ort_outputs[i] - outputs[i].cpu().numpy()\\n\",\n    \"    max_diff = numpy.max(numpy.abs(diff))\\n\",\n    \"    avg_diff = numpy.average(numpy.abs(diff))\\n\",\n    \"    print(f'maximum_diff={max_diff} average_diff={avg_diff}')\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"### Inference with Actual Sequence Length\\n\",\n    \"Note that ONNX model is exported using dynamic length axis. It is recommended to use actual sequence input without padding instead of fixed length input for best performance. Let's see how it can be applied to this model.\\n\",\n    \"\\n\",\n    \"From an example input below, we can see zero padding at the end of each sequence.\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 12,\n   \"metadata\": {\n    \"scrolled\": true\n   },\n   \"outputs\": [\n    {\n     \"data\": {\n      \"text/plain\": [\n       \"{'input_ids': tensor([[  101,  1293,  1242,  2557,  1127,  1226,  1104,  1103,  3613, 16429,\\n\",\n       \"           5235,   136,   102,  3613, 16429,  5988,   170,   107,  1353,  1671,\\n\",\n       \"           1992,  1342,   107,  5235,   117,  1107,  1134,  1473,  3683,  3538,\\n\",\n       \"           1125,   170,  1476,   118,  1248,  2595,  4086,  1714,  1104,  2965,\\n\",\n       \"          15897,  1104,  3613, 16429,   119,  1473,  3683,  3538,  3222,  1149,\\n\",\n       \"           2551,  1168, 23759,  1116,  1121,  1506,  1103, 10280,  2231,  1111,\\n\",\n       \"           1103,  1714, 16355,   119,   102,     0,     0,     0,     0,     0,\\n\",\n       \"              0,     0,     0,     0,     0,     0,     0,     0,     0,     0,\\n\",\n       \"              0,     0,     0,     0,     0,     0,     0,     0,     0,     0,\\n\",\n       \"              0,     0,     0,     0,     0,     0,     0,     0,     0,     0,\\n\",\n       \"              0,     0,     0,     0,     0,     0,     0,     0,     0,     0,\\n\",\n       \"              0,     0,     0,     0,     0,     0,     0,     0,     0,     0,\\n\",\n       \"              0,     0,     0,     0,     0,     0,     0,     0]],\\n\",\n       \"        device='cuda:0'),\\n\",\n       \" 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,\\n\",\n       \"          1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,\\n\",\n       \"          1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0,\\n\",\n       \"          0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,\\n\",\n       \"          0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,\\n\",\n       \"          0, 0, 0, 0, 0, 0, 0, 0]], device='cuda:0'),\\n\",\n       \" 'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,\\n\",\n       \"          1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,\\n\",\n       \"          1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0,\\n\",\n       \"          0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,\\n\",\n       \"          0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,\\n\",\n       \"          0, 0, 0, 0, 0, 0, 0, 0]], device='cuda:0')}\"\n      ]\n     },\n     \"execution_count\": 12,\n     \"metadata\": {},\n     \"output_type\": \"execute_result\"\n    }\n   ],\n   \"source\": [\n    \"# An example input (we can see padding). From attention_mask, we can deduce the actual length.\\n\",\n    \"inputs\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"The original sequence length is 128. After removing paddings, the sequence length is reduced. Input with smaller sequence length need less computation, thus we can see there is improvement on inference latency. \"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 13,\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"Average length 101\\n\",\n      \"OnnxRuntime gpu Inference time with actual sequence length = 4.23 ms\\n\"\n     ]\n    }\n   ],\n   \"source\": [\n    \"import statistics\\n\",\n    \"\\n\",\n    \"latency = []\\n\",\n    \"lengths = []\\n\",\n    \"for i in range(total_samples):\\n\",\n    \"    data = dataset[i]\\n\",\n    \"    # Instead of using fixed length (128), we can use actual sequence length (less than 128), which helps to get better performance.\\n\",\n    \"    actual_sequence_length = sum(data[1].numpy())\\n\",\n    \"    lengths.append(actual_sequence_length)\\n\",\n    \"    opt_inputs = {\\n\",\n    \"        'input_ids':  data[0].numpy()[:actual_sequence_length].reshape(1, actual_sequence_length),\\n\",\n    \"        'input_mask': data[1].numpy()[:actual_sequence_length].reshape(1, actual_sequence_length),\\n\",\n    \"        'segment_ids': data[2].numpy()[:actual_sequence_length].reshape(1, actual_sequence_length)\\n\",\n    \"    }\\n\",\n    \"    start = time.time()\\n\",\n    \"    opt_outputs = session.run(None, opt_inputs)\\n\",\n    \"    latency.append(time.time() - start)\\n\",\n    \"print(\\\"Average length\\\", statistics.mean(lengths))\\n\",\n    \"print(\\\"OnnxRuntime {} Inference time with actual sequence length = {} ms\\\".format(device_name, format(sum(latency) * 1000 / len(latency), '.2f')))\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Let's compare the output and see whether the results are close.\\n\",\n    \"\\n\",\n    \"**Note**: Need end-to-end evaluation on performance and accuracy if you use this strategy.\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 14,\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"***** Comparing results with/without paddings *****\\n\",\n      \"Output 0 are close: True\\n\",\n      \"Output 1 are close: True\\n\"\n     ]\n    }\n   ],\n   \"source\": [\n    \"print(\\\"***** Comparing results with/without paddings *****\\\")\\n\",\n    \"for i in range(2):\\n\",\n    \"    print('Output {} are close:'.format(i), numpy.allclose(opt_outputs[i], ort_outputs[i][:,:len(opt_outputs[i][0])], rtol=1e-03, atol=1e-03))\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"## 5. Offline Optimization and Test Tools\\n\",\n    \"\\n\",\n    \"It is recommended to try [OnnxRuntime Transformer Model Optimization Tool](https://github.com/microsoft/onnxruntime/tree/master/onnxruntime/python/tools/transformers) on the exported ONNX models. It could help verify whether the model can be fully optimized, and get performance test results.\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"#### Transformer Optimizer\\n\",\n    \"\\n\",\n    \"Although OnnxRuntime could optimize Bert model exported by PyTorch. Sometime, model cannot be fully optimized due to different reasons:\\n\",\n    \"* A new subgraph pattern is generated by new version of export tool, and the pattern is not covered by older version of OnnxRuntime. \\n\",\n    \"* The exported model uses dynamic axis and this makes it harder for shape inference of the graph. That blocks some optimization to be applied.\\n\",\n    \"* Some optimization is better to be done offline. Like change input tensor type from int64 to int32 to avoid extra Cast nodes, or convert model to float16 to achieve better performance in V100 or T4 GPU.\\n\",\n    \"\\n\",\n    \"We have python script **optimizer.py**, which is more flexible in graph pattern matching and model conversion (like float32 to float16). You can also use it to verify whether a Bert model is fully optimized.\\n\",\n    \"\\n\",\n    \"In this example, we can see that it introduces optimization that is not provided by onnxruntime: SkipLayerNormalization and bias fusion, which is not fused in OnnxRuntime due to shape inference as mentioned.\\n\",\n    \"\\n\",\n    \"It will also tell whether the model is fully optimized or not. If not, that means you might need change the script to fuse some new pattern of subgraph.\\n\",\n    \"\\n\",\n    \"Example Usage:\\n\",\n    \"```\\n\",\n    \"from onnxruntime_tools import optimizer\\n\",\n    \"optimized_model = optimizer.optimize_model(export_model_path, model_type='bert', num_heads=12, hidden_size=768)\\n\",\n    \"optimized_model.save_model_to_file(optimized_model_path)\\n\",\n    \"```\\n\",\n    \"\\n\",\n    \"You can also use optimizer_cli like the following:\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"#### Float32 Model\\n\",\n    \"Let us optimize the ONNX model using the script. The first example will output model with float32 to store weights. This is the choice for most GPUs without Tensor Core.\\n\",\n    \"\\n\",\n    \"If your GPU (like V100 or T4) has Tensor Core, jump to [Float16 Model](#6.-Model-Optimization-with-Float16) section since that will give you better performance than Float32 model.\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 15,\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"optimize_by_onnxruntime: Save optimized model by onnxruntime to ./onnx/bert-base-cased-squad_opset11_o1_cpu.onnx\\n\",\n      \"               apply: Fused LayerNormalization count: 25\\n\",\n      \"               apply: Fused Gelu count: 12\\n\",\n      \"               apply: Fused SkipLayerNormalization count: 25\\n\",\n      \"               apply: Fused Attention count: 12\\n\",\n      \"         prune_graph: Graph pruned: 0 inputs, 0 outputs and 5 nodes are removed\\n\",\n      \"               apply: Fused EmbedLayerNormalization(with mask) count: 1\\n\",\n      \"         prune_graph: Graph pruned: 0 inputs, 0 outputs and 12 nodes are removed\\n\",\n      \"         prune_graph: Graph pruned: 0 inputs, 0 outputs and 0 nodes are removed\\n\",\n      \"               apply: Fused BiasGelu count: 12\\n\",\n      \"               apply: Fused SkipLayerNormalization(add bias) count: 24\\n\",\n      \"            optimize: opset verion: 11\\n\",\n      \"  save_model_to_file: Output model to ./onnx/bert-base-cased-squad_opt_gpu_fp32.onnx\\n\",\n      \"get_fused_operator_statistics: Optimized operators:{'EmbedLayerNormalization': 1, 'Attention': 12, 'Gelu': 0, 'FastGelu': 0, 'BiasGelu': 12, 'LayerNormalization': 0, 'SkipLayerNormalization': 24}\\n\",\n      \"                main: The model has been fully optimized.\\n\"\n     ]\n    }\n   ],\n   \"source\": [\n    \"optimized_fp32_model_path = './onnx/bert-base-cased-squad_opt_{}_fp32.onnx'.format('gpu' if use_gpu else 'cpu')\\n\",\n    \"\\n\",\n    \"!python -m onnxruntime_tools.optimizer_cli --input $export_model_path --output $optimized_fp32_model_path\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"#### Optimized Graph\\n\",\n    \"We can open the optimized model using [Netron](https://github.com/lutzroeder/netron) to visualize.\\n\",\n    \"\\n\",\n    \"The graph is like the following:\\n\",\n    \"<img src='images/optimized_bert_gpu.png'>\\n\",\n    \"\\n\",\n    \"Sometime, optimized graph is slightly different. For example, FastGelu is replaced by BiasGelu for CPU inference; When the option --input_int32 is used, Cast nodes for inputs are removed.\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 16,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"import netron\\n\",\n    \"\\n\",\n    \"# change it to True if want to view the optimized model in browser\\n\",\n    \"enable_netron = False\\n\",\n    \"if enable_netron:\\n\",\n    \"    # If you encounter error \\\"access a socket in a way forbidden by its access permissions\\\", install Netron as standalone application instead.\\n\",\n    \"    netron.start(optimized_fp32_model_path)\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"### Performance Test Tool\\n\",\n    \"\\n\",\n    \"The following will create 1000 random inputs of batch_size 1 and sequence length 128, then measure the average latency and throughput numbers.\\n\",\n    \"\\n\",\n    \"Note that the test uses fixed sequence length. If you use [dynamic sequence length](#Inference-with-Actual-Sequence-Length), actual performance depends on the distribution of sequence length.\\n\",\n    \"\\n\",\n    \"**Attention**: Latency numbers from Jupyter Notebook are not accurate. See [Attional Info](#7.-Additional-Info) for more info.\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 17,\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"test setting TestSetting(batch_size=1, sequence_length=128, test_cases=1000, test_times=1, contiguous=None, use_gpu=True, warmup=True, omp_num_threads=None, omp_wait_policy=None, intra_op_num_threads=None, seed=3, verbose=False, inclusive=False, extra_latency=True)\\n\",\n      \"Generating 1000 samples for batch_size=1 sequence_length=128\\n\",\n      \"Running test: model=bert-base-cased-squad_opt_gpu_fp32.onnx,graph_optimization_level=ENABLE_ALL,intra_op_num_threads=0,OMP_NUM_THREADS=,OMP_WAIT_POLICY=,batch_size=1,sequence_length=128,test_cases=1000,test_times=1,contiguous=None,use_gpu=True,warmup=True\\n\",\n      \"Average latency = 4.92 ms, Throughput = 203.24 QPS\\n\",\n      \"Running test: model=bert-base-cased-squad_opt_gpu_fp32.onnx,graph_optimization_level=ENABLE_ALL,intra_op_num_threads=1,OMP_NUM_THREADS=12,OMP_WAIT_POLICY=PASSIVE,batch_size=1,sequence_length=128,test_cases=1000,test_times=1,contiguous=None,use_gpu=True,warmup=True\\n\",\n      \"Average latency = 4.90 ms, Throughput = 203.88 QPS\\n\",\n      \"Running test: model=bert-base-cased-squad_opt_gpu_fp32.onnx,graph_optimization_level=ENABLE_ALL,intra_op_num_threads=12,OMP_NUM_THREADS=1,OMP_WAIT_POLICY=ACTIVE,batch_size=1,sequence_length=128,test_cases=1000,test_times=1,contiguous=None,use_gpu=True,warmup=True\\n\",\n      \"Average latency = 5.07 ms, Throughput = 197.16 QPS\\n\",\n      \"Running test: model=bert-base-cased-squad_opt_gpu_fp32.onnx,graph_optimization_level=ENABLE_ALL,intra_op_num_threads=1,OMP_NUM_THREADS=12,OMP_WAIT_POLICY=ACTIVE,batch_size=1,sequence_length=128,test_cases=1000,test_times=1,contiguous=None,use_gpu=True,warmup=True\\n\",\n      \"Average latency = 4.82 ms, Throughput = 207.33 QPS\\n\",\n      \"skip duplicated test: model=bert-base-cased-squad_opt_gpu_fp32.onnx,graph_optimization_level=ENABLE_ALL,intra_op_num_threads=1,OMP_NUM_THREADS=12,OMP_WAIT_POLICY=PASSIVE,batch_size=1,sequence_length=128,test_cases=1000,test_times=1,contiguous=None,use_gpu=True,warmup=True\\n\",\n      \"skip duplicated test: model=bert-base-cased-squad_opt_gpu_fp32.onnx,graph_optimization_level=ENABLE_ALL,intra_op_num_threads=12,OMP_NUM_THREADS=1,OMP_WAIT_POLICY=ACTIVE,batch_size=1,sequence_length=128,test_cases=1000,test_times=1,contiguous=None,use_gpu=True,warmup=True\\n\",\n      \"Running test: model=bert-base-cased-squad_opt_gpu_fp32.onnx,graph_optimization_level=ENABLE_ALL,intra_op_num_threads=12,OMP_NUM_THREADS=1,OMP_WAIT_POLICY=PASSIVE,batch_size=1,sequence_length=128,test_cases=1000,test_times=1,contiguous=None,use_gpu=True,warmup=True\\n\",\n      \"Average latency = 4.93 ms, Throughput = 202.92 QPS\\n\",\n      \"Running test: model=bert-base-cased-squad_opt_gpu_fp32.onnx,graph_optimization_level=ENABLE_ALL,intra_op_num_threads=12,OMP_NUM_THREADS=12,OMP_WAIT_POLICY=ACTIVE,batch_size=1,sequence_length=128,test_cases=1000,test_times=1,contiguous=None,use_gpu=True,warmup=True\\n\",\n      \"Average latency = 4.91 ms, Throughput = 203.55 QPS\\n\",\n      \"Running test: model=bert-base-cased-squad_opt_gpu_fp32.onnx,graph_optimization_level=ENABLE_ALL,intra_op_num_threads=12,OMP_NUM_THREADS=12,OMP_WAIT_POLICY=PASSIVE,batch_size=1,sequence_length=128,test_cases=1000,test_times=1,contiguous=None,use_gpu=True,warmup=True\\n\",\n      \"Average latency = 4.88 ms, Throughput = 204.90 QPS\\n\",\n      \"Test summary is saved to onnx/perf_results_GPU_B1_S128_20200617-232134.txt\\n\"\n     ]\n    }\n   ],\n   \"source\": [\n    \"GPU_OPTION = '--use_gpu' if use_gpu else ''\\n\",\n    \"\\n\",\n    \"!python -m onnxruntime_tools.transformers.bert_perf_test --model $optimized_fp32_model_path --batch_size 1 --sequence_length 128 --samples 1000 --test_times 1 --inclusive --all $GPU_OPTION\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Let's load the summary file and take a look. Note that blank value in OMP_NUM_THREADS or OMP_WAIT_POLICY means the environment variable does not exist.\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 18,\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"Float32 model perf results from ./onnx/perf_results_GPU_B1_S128_20200617-232134.txt\\n\"\n     ]\n    },\n    {\n     \"data\": {\n      \"text/html\": [\n       \"<div>\\n\",\n       \"<style scoped>\\n\",\n       \"    .dataframe tbody tr th:only-of-type {\\n\",\n       \"        vertical-align: middle;\\n\",\n       \"    }\\n\",\n       \"\\n\",\n       \"    .dataframe tbody tr th {\\n\",\n       \"        vertical-align: top;\\n\",\n       \"    }\\n\",\n       \"\\n\",\n       \"    .dataframe thead th {\\n\",\n       \"        text-align: right;\\n\",\n       \"    }\\n\",\n       \"</style>\\n\",\n       \"<table border=\\\"1\\\" class=\\\"dataframe\\\">\\n\",\n       \"  <thead>\\n\",\n       \"    <tr style=\\\"text-align: right;\\\">\\n\",\n       \"      <th></th>\\n\",\n       \"      <th>Latency(ms)</th>\\n\",\n       \"      <th>Latency_P50</th>\\n\",\n       \"      <th>Latency_P75</th>\\n\",\n       \"      <th>Latency_P90</th>\\n\",\n       \"      <th>Latency_P95</th>\\n\",\n       \"      <th>Latency_P99</th>\\n\",\n       \"      <th>Throughput(QPS)</th>\\n\",\n       \"      <th>intra_op_num_threads</th>\\n\",\n       \"      <th>OMP_NUM_THREADS</th>\\n\",\n       \"      <th>OMP_WAIT_POLICY</th>\\n\",\n       \"      <th>contiguous</th>\\n\",\n       \"      <th>warmup</th>\\n\",\n       \"    </tr>\\n\",\n       \"  </thead>\\n\",\n       \"  <tbody>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>0</th>\\n\",\n       \"      <td>4.82</td>\\n\",\n       \"      <td>4.53</td>\\n\",\n       \"      <td>4.57</td>\\n\",\n       \"      <td>5.15</td>\\n\",\n       \"      <td>7.25</td>\\n\",\n       \"      <td>8.75</td>\\n\",\n       \"      <td>207.33</td>\\n\",\n       \"      <td>1</td>\\n\",\n       \"      <td>12</td>\\n\",\n       \"      <td>ACTIVE</td>\\n\",\n       \"      <td>None</td>\\n\",\n       \"      <td>True</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>1</th>\\n\",\n       \"      <td>4.88</td>\\n\",\n       \"      <td>4.54</td>\\n\",\n       \"      <td>4.58</td>\\n\",\n       \"      <td>6.47</td>\\n\",\n       \"      <td>7.13</td>\\n\",\n       \"      <td>8.68</td>\\n\",\n       \"      <td>204.90</td>\\n\",\n       \"      <td>12</td>\\n\",\n       \"      <td>12</td>\\n\",\n       \"      <td>PASSIVE</td>\\n\",\n       \"      <td>None</td>\\n\",\n       \"      <td>True</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>2</th>\\n\",\n       \"      <td>4.90</td>\\n\",\n       \"      <td>4.54</td>\\n\",\n       \"      <td>4.57</td>\\n\",\n       \"      <td>6.16</td>\\n\",\n       \"      <td>7.64</td>\\n\",\n       \"      <td>8.82</td>\\n\",\n       \"      <td>203.88</td>\\n\",\n       \"      <td>1</td>\\n\",\n       \"      <td>12</td>\\n\",\n       \"      <td>PASSIVE</td>\\n\",\n       \"      <td>None</td>\\n\",\n       \"      <td>True</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>3</th>\\n\",\n       \"      <td>4.91</td>\\n\",\n       \"      <td>4.55</td>\\n\",\n       \"      <td>4.59</td>\\n\",\n       \"      <td>6.70</td>\\n\",\n       \"      <td>7.43</td>\\n\",\n       \"      <td>8.78</td>\\n\",\n       \"      <td>203.55</td>\\n\",\n       \"      <td>12</td>\\n\",\n       \"      <td>12</td>\\n\",\n       \"      <td>ACTIVE</td>\\n\",\n       \"      <td>None</td>\\n\",\n       \"      <td>True</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>4</th>\\n\",\n       \"      <td>4.92</td>\\n\",\n       \"      <td>4.57</td>\\n\",\n       \"      <td>4.60</td>\\n\",\n       \"      <td>6.50</td>\\n\",\n       \"      <td>7.82</td>\\n\",\n       \"      <td>8.90</td>\\n\",\n       \"      <td>203.24</td>\\n\",\n       \"      <td>0</td>\\n\",\n       \"      <td></td>\\n\",\n       \"      <td></td>\\n\",\n       \"      <td>None</td>\\n\",\n       \"      <td>True</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>5</th>\\n\",\n       \"      <td>4.93</td>\\n\",\n       \"      <td>4.55</td>\\n\",\n       \"      <td>4.59</td>\\n\",\n       \"      <td>6.66</td>\\n\",\n       \"      <td>7.57</td>\\n\",\n       \"      <td>8.80</td>\\n\",\n       \"      <td>202.92</td>\\n\",\n       \"      <td>12</td>\\n\",\n       \"      <td>1</td>\\n\",\n       \"      <td>PASSIVE</td>\\n\",\n       \"      <td>None</td>\\n\",\n       \"      <td>True</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>6</th>\\n\",\n       \"      <td>5.07</td>\\n\",\n       \"      <td>4.56</td>\\n\",\n       \"      <td>4.61</td>\\n\",\n       \"      <td>7.19</td>\\n\",\n       \"      <td>8.11</td>\\n\",\n       \"      <td>9.01</td>\\n\",\n       \"      <td>197.16</td>\\n\",\n       \"      <td>12</td>\\n\",\n       \"      <td>1</td>\\n\",\n       \"      <td>ACTIVE</td>\\n\",\n       \"      <td>None</td>\\n\",\n       \"      <td>True</td>\\n\",\n       \"    </tr>\\n\",\n       \"  </tbody>\\n\",\n       \"</table>\\n\",\n       \"</div>\"\n      ],\n      \"text/plain\": [\n       \"   Latency(ms)  Latency_P50  Latency_P75  Latency_P90  Latency_P95  \\\\\\n\",\n       \"0         4.82         4.53         4.57         5.15         7.25   \\n\",\n       \"1         4.88         4.54         4.58         6.47         7.13   \\n\",\n       \"2         4.90         4.54         4.57         6.16         7.64   \\n\",\n       \"3         4.91         4.55         4.59         6.70         7.43   \\n\",\n       \"4         4.92         4.57         4.60         6.50         7.82   \\n\",\n       \"5         4.93         4.55         4.59         6.66         7.57   \\n\",\n       \"6         5.07         4.56         4.61         7.19         8.11   \\n\",\n       \"\\n\",\n       \"   Latency_P99  Throughput(QPS)  intra_op_num_threads OMP_NUM_THREADS  \\\\\\n\",\n       \"0         8.75           207.33                     1              12   \\n\",\n       \"1         8.68           204.90                    12              12   \\n\",\n       \"2         8.82           203.88                     1              12   \\n\",\n       \"3         8.78           203.55                    12              12   \\n\",\n       \"4         8.90           203.24                     0                   \\n\",\n       \"5         8.80           202.92                    12               1   \\n\",\n       \"6         9.01           197.16                    12               1   \\n\",\n       \"\\n\",\n       \"  OMP_WAIT_POLICY contiguous  warmup  \\n\",\n       \"0          ACTIVE       None    True  \\n\",\n       \"1         PASSIVE       None    True  \\n\",\n       \"2         PASSIVE       None    True  \\n\",\n       \"3          ACTIVE       None    True  \\n\",\n       \"4                       None    True  \\n\",\n       \"5         PASSIVE       None    True  \\n\",\n       \"6          ACTIVE       None    True  \"\n      ]\n     },\n     \"execution_count\": 18,\n     \"metadata\": {},\n     \"output_type\": \"execute_result\"\n    }\n   ],\n   \"source\": [\n    \"import os\\n\",\n    \"import glob     \\n\",\n    \"import pandas\\n\",\n    \"latest_result_file = max(glob.glob(\\\"./onnx/perf_results_GPU_B1_S128_*.txt\\\"), key=os.path.getmtime)\\n\",\n    \"result_data = pandas.read_table(latest_result_file, converters={'OMP_NUM_THREADS': str, 'OMP_WAIT_POLICY':str})\\n\",\n    \"print(\\\"Float32 model perf results from\\\", latest_result_file)\\n\",\n    \"# Remove some columns that have same values for all rows.\\n\",\n    \"columns_to_remove = ['model', 'graph_optimization_level', 'batch_size', 'sequence_length', 'test_cases', 'test_times', 'use_gpu']\\n\",\n    \"result_data.drop(columns_to_remove, axis=1, inplace=True)\\n\",\n    \"result_data\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"From above result, we can see that latency is very close for different settings. The default setting (intra_op_num_threads=0, OMP_NUM_THREADS and OMP_WAIT_POLICY does not exist) performs the best. \\n\",\n    \"\\n\",\n    \"### Model Results Comparison Tool\\n\",\n    \"\\n\",\n    \"When a BERT model is optimized, some approximation is used in calculation. If your BERT model has three inputs, a script compare_bert_results.py can be used to do a quick verification. The tool will generate some fake input data, and compare the inference outputs of the original and optimized models. If outputs are all close, it is safe to use the optimized model.\\n\",\n    \"\\n\",\n    \"For GPU inference, the absolute or relative difference is larger than those numbers of CPU inference. Note that slight difference in output will not impact final result. We did end-to-end evaluation using SQuAD data set using a fine-tuned squad model, and F1 score is almost the same before/after optimization.\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 19,\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"100% passed for 100 random inputs given thresholds (rtol=0.01, atol=0.01).\\r\\n\",\n      \"maximum absolute difference=1.9222497940063477e-06\\r\\n\",\n      \"maximum relative difference=0.05027933046221733\\r\\n\"\n     ]\n    }\n   ],\n   \"source\": [\n    \"!python -m onnxruntime_tools.transformers.compare_bert_results --baseline_model $export_model_path --optimized_model $optimized_fp32_model_path --batch_size 1 --sequence_length 128 --samples 100 --rtol 0.01 --atol 0.01 $GPU_OPTION\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"## 6. Model Optimization with Float16\\n\",\n    \"\\n\",\n    \"The optimizer.py script have an option **--float16** to convert model to use float16 to store weights. After the conversion, it could be faster to run in GPU with tensor cores like V100 or T4.\\n\",\n    \"\\n\",\n    \"Let's run tools to measure the performance on V100. The results show significant performance improvement: latency is about 3.4 ms for float32 model, and 1.8 ms for float16 model.\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 20,\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"optimize_by_onnxruntime: Save optimized model by onnxruntime to ./onnx/bert-base-cased-squad_opset11_o1_cpu.onnx\\n\",\n      \"               apply: Fused LayerNormalization count: 25\\n\",\n      \"               apply: Fused Gelu count: 12\\n\",\n      \"               apply: Fused SkipLayerNormalization count: 25\\n\",\n      \"               apply: Fused Attention count: 12\\n\",\n      \"         prune_graph: Graph pruned: 0 inputs, 0 outputs and 5 nodes are removed\\n\",\n      \"               apply: Fused EmbedLayerNormalization(with mask) count: 1\\n\",\n      \"         prune_graph: Graph pruned: 0 inputs, 0 outputs and 12 nodes are removed\\n\",\n      \"         prune_graph: Graph pruned: 0 inputs, 0 outputs and 0 nodes are removed\\n\",\n      \"               apply: Fused BiasGelu count: 12\\n\",\n      \"               apply: Fused SkipLayerNormalization(add bias) count: 24\\n\",\n      \"            optimize: opset verion: 11\\n\",\n      \"  save_model_to_file: Output model to ./onnx/bert-base-cased-squad_opt_gpu_fp16.onnx\\n\",\n      \"get_fused_operator_statistics: Optimized operators:{'EmbedLayerNormalization': 1, 'Attention': 12, 'Gelu': 0, 'FastGelu': 0, 'BiasGelu': 12, 'LayerNormalization': 0, 'SkipLayerNormalization': 24}\\n\",\n      \"                main: The model has been fully optimized.\\n\"\n     ]\n    }\n   ],\n   \"source\": [\n    \"optimized_fp16_model_path = './onnx/bert-base-cased-squad_opt_{}_fp16.onnx'.format('gpu' if use_gpu else 'cpu')\\n\",\n    \"!python -m onnxruntime_tools.optimizer_cli --input $export_model_path --output $optimized_fp16_model_path --float16\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 21,\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"test setting TestSetting(batch_size=1, sequence_length=128, test_cases=1000, test_times=1, contiguous=None, use_gpu=True, warmup=True, omp_num_threads=None, omp_wait_policy=None, intra_op_num_threads=None, seed=3, verbose=False, inclusive=False, extra_latency=True)\\n\",\n      \"Generating 1000 samples for batch_size=1 sequence_length=128\\n\",\n      \"Running test: model=bert-base-cased-squad_opt_gpu_fp16.onnx,graph_optimization_level=ENABLE_ALL,intra_op_num_threads=0,OMP_NUM_THREADS=,OMP_WAIT_POLICY=,batch_size=1,sequence_length=128,test_cases=1000,test_times=1,contiguous=None,use_gpu=True,warmup=True\\n\",\n      \"Average latency = 3.01 ms, Throughput = 331.90 QPS\\n\",\n      \"Running test: model=bert-base-cased-squad_opt_gpu_fp16.onnx,graph_optimization_level=ENABLE_ALL,intra_op_num_threads=1,OMP_NUM_THREADS=12,OMP_WAIT_POLICY=PASSIVE,batch_size=1,sequence_length=128,test_cases=1000,test_times=1,contiguous=None,use_gpu=True,warmup=True\\n\",\n      \"Average latency = 3.12 ms, Throughput = 320.00 QPS\\n\",\n      \"Running test: model=bert-base-cased-squad_opt_gpu_fp16.onnx,graph_optimization_level=ENABLE_ALL,intra_op_num_threads=12,OMP_NUM_THREADS=1,OMP_WAIT_POLICY=ACTIVE,batch_size=1,sequence_length=128,test_cases=1000,test_times=1,contiguous=None,use_gpu=True,warmup=True\\n\",\n      \"Average latency = 3.02 ms, Throughput = 331.39 QPS\\n\",\n      \"Running test: model=bert-base-cased-squad_opt_gpu_fp16.onnx,graph_optimization_level=ENABLE_ALL,intra_op_num_threads=1,OMP_NUM_THREADS=12,OMP_WAIT_POLICY=ACTIVE,batch_size=1,sequence_length=128,test_cases=1000,test_times=1,contiguous=None,use_gpu=True,warmup=True\\n\",\n      \"Average latency = 3.01 ms, Throughput = 332.53 QPS\\n\",\n      \"skip duplicated test: model=bert-base-cased-squad_opt_gpu_fp16.onnx,graph_optimization_level=ENABLE_ALL,intra_op_num_threads=1,OMP_NUM_THREADS=12,OMP_WAIT_POLICY=PASSIVE,batch_size=1,sequence_length=128,test_cases=1000,test_times=1,contiguous=None,use_gpu=True,warmup=True\\n\",\n      \"skip duplicated test: model=bert-base-cased-squad_opt_gpu_fp16.onnx,graph_optimization_level=ENABLE_ALL,intra_op_num_threads=12,OMP_NUM_THREADS=1,OMP_WAIT_POLICY=ACTIVE,batch_size=1,sequence_length=128,test_cases=1000,test_times=1,contiguous=None,use_gpu=True,warmup=True\\n\",\n      \"Running test: model=bert-base-cased-squad_opt_gpu_fp16.onnx,graph_optimization_level=ENABLE_ALL,intra_op_num_threads=12,OMP_NUM_THREADS=1,OMP_WAIT_POLICY=PASSIVE,batch_size=1,sequence_length=128,test_cases=1000,test_times=1,contiguous=None,use_gpu=True,warmup=True\\n\",\n      \"Average latency = 3.04 ms, Throughput = 328.67 QPS\\n\",\n      \"Running test: model=bert-base-cased-squad_opt_gpu_fp16.onnx,graph_optimization_level=ENABLE_ALL,intra_op_num_threads=12,OMP_NUM_THREADS=12,OMP_WAIT_POLICY=ACTIVE,batch_size=1,sequence_length=128,test_cases=1000,test_times=1,contiguous=None,use_gpu=True,warmup=True\\n\",\n      \"Average latency = 3.01 ms, Throughput = 331.72 QPS\\n\",\n      \"Running test: model=bert-base-cased-squad_opt_gpu_fp16.onnx,graph_optimization_level=ENABLE_ALL,intra_op_num_threads=12,OMP_NUM_THREADS=12,OMP_WAIT_POLICY=PASSIVE,batch_size=1,sequence_length=128,test_cases=1000,test_times=1,contiguous=None,use_gpu=True,warmup=True\\n\",\n      \"Average latency = 3.04 ms, Throughput = 329.32 QPS\\n\",\n      \"Test summary is saved to onnx/perf_results_GPU_B1_S128_20200617-232234.txt\\n\"\n     ]\n    }\n   ],\n   \"source\": [\n    \"GPU_OPTION = '--use_gpu' if use_gpu else ''\\n\",\n    \"!python -m onnxruntime_tools.transformers.bert_perf_test --model $optimized_fp16_model_path --batch_size 1 --sequence_length 128 --samples 1000 --test_times 1 --inclusive --all $GPU_OPTION\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 22,\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"Float32 model perf results from ./onnx/perf_results_GPU_B1_S128_20200617-232234.txt\\n\"\n     ]\n    },\n    {\n     \"data\": {\n      \"text/html\": [\n       \"<div>\\n\",\n       \"<style scoped>\\n\",\n       \"    .dataframe tbody tr th:only-of-type {\\n\",\n       \"        vertical-align: middle;\\n\",\n       \"    }\\n\",\n       \"\\n\",\n       \"    .dataframe tbody tr th {\\n\",\n       \"        vertical-align: top;\\n\",\n       \"    }\\n\",\n       \"\\n\",\n       \"    .dataframe thead th {\\n\",\n       \"        text-align: right;\\n\",\n       \"    }\\n\",\n       \"</style>\\n\",\n       \"<table border=\\\"1\\\" class=\\\"dataframe\\\">\\n\",\n       \"  <thead>\\n\",\n       \"    <tr style=\\\"text-align: right;\\\">\\n\",\n       \"      <th></th>\\n\",\n       \"      <th>Latency(ms)</th>\\n\",\n       \"      <th>Latency_P50</th>\\n\",\n       \"      <th>Latency_P75</th>\\n\",\n       \"      <th>Latency_P90</th>\\n\",\n       \"      <th>Latency_P95</th>\\n\",\n       \"      <th>Latency_P99</th>\\n\",\n       \"      <th>Throughput(QPS)</th>\\n\",\n       \"      <th>intra_op_num_threads</th>\\n\",\n       \"      <th>OMP_NUM_THREADS</th>\\n\",\n       \"      <th>OMP_WAIT_POLICY</th>\\n\",\n       \"      <th>contiguous</th>\\n\",\n       \"      <th>warmup</th>\\n\",\n       \"    </tr>\\n\",\n       \"  </thead>\\n\",\n       \"  <tbody>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>0</th>\\n\",\n       \"      <td>3.01</td>\\n\",\n       \"      <td>2.79</td>\\n\",\n       \"      <td>2.81</td>\\n\",\n       \"      <td>2.86</td>\\n\",\n       \"      <td>5.08</td>\\n\",\n       \"      <td>7.16</td>\\n\",\n       \"      <td>332.53</td>\\n\",\n       \"      <td>1</td>\\n\",\n       \"      <td>12</td>\\n\",\n       \"      <td>ACTIVE</td>\\n\",\n       \"      <td>None</td>\\n\",\n       \"      <td>True</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>1</th>\\n\",\n       \"      <td>3.01</td>\\n\",\n       \"      <td>2.80</td>\\n\",\n       \"      <td>2.81</td>\\n\",\n       \"      <td>2.88</td>\\n\",\n       \"      <td>4.52</td>\\n\",\n       \"      <td>7.05</td>\\n\",\n       \"      <td>331.90</td>\\n\",\n       \"      <td>0</td>\\n\",\n       \"      <td></td>\\n\",\n       \"      <td></td>\\n\",\n       \"      <td>None</td>\\n\",\n       \"      <td>True</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>2</th>\\n\",\n       \"      <td>3.01</td>\\n\",\n       \"      <td>2.78</td>\\n\",\n       \"      <td>2.80</td>\\n\",\n       \"      <td>2.92</td>\\n\",\n       \"      <td>5.01</td>\\n\",\n       \"      <td>7.02</td>\\n\",\n       \"      <td>331.72</td>\\n\",\n       \"      <td>12</td>\\n\",\n       \"      <td>12</td>\\n\",\n       \"      <td>ACTIVE</td>\\n\",\n       \"      <td>None</td>\\n\",\n       \"      <td>True</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>3</th>\\n\",\n       \"      <td>3.02</td>\\n\",\n       \"      <td>2.79</td>\\n\",\n       \"      <td>2.80</td>\\n\",\n       \"      <td>2.85</td>\\n\",\n       \"      <td>6.34</td>\\n\",\n       \"      <td>7.04</td>\\n\",\n       \"      <td>331.39</td>\\n\",\n       \"      <td>12</td>\\n\",\n       \"      <td>1</td>\\n\",\n       \"      <td>ACTIVE</td>\\n\",\n       \"      <td>None</td>\\n\",\n       \"      <td>True</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>4</th>\\n\",\n       \"      <td>3.04</td>\\n\",\n       \"      <td>2.80</td>\\n\",\n       \"      <td>2.82</td>\\n\",\n       \"      <td>2.93</td>\\n\",\n       \"      <td>5.56</td>\\n\",\n       \"      <td>7.08</td>\\n\",\n       \"      <td>329.32</td>\\n\",\n       \"      <td>12</td>\\n\",\n       \"      <td>12</td>\\n\",\n       \"      <td>PASSIVE</td>\\n\",\n       \"      <td>None</td>\\n\",\n       \"      <td>True</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>5</th>\\n\",\n       \"      <td>3.04</td>\\n\",\n       \"      <td>2.79</td>\\n\",\n       \"      <td>2.81</td>\\n\",\n       \"      <td>2.92</td>\\n\",\n       \"      <td>6.37</td>\\n\",\n       \"      <td>7.08</td>\\n\",\n       \"      <td>328.67</td>\\n\",\n       \"      <td>12</td>\\n\",\n       \"      <td>1</td>\\n\",\n       \"      <td>PASSIVE</td>\\n\",\n       \"      <td>None</td>\\n\",\n       \"      <td>True</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>6</th>\\n\",\n       \"      <td>3.12</td>\\n\",\n       \"      <td>2.79</td>\\n\",\n       \"      <td>2.82</td>\\n\",\n       \"      <td>2.96</td>\\n\",\n       \"      <td>6.66</td>\\n\",\n       \"      <td>7.20</td>\\n\",\n       \"      <td>320.00</td>\\n\",\n       \"      <td>1</td>\\n\",\n       \"      <td>12</td>\\n\",\n       \"      <td>PASSIVE</td>\\n\",\n       \"      <td>None</td>\\n\",\n       \"      <td>True</td>\\n\",\n       \"    </tr>\\n\",\n       \"  </tbody>\\n\",\n       \"</table>\\n\",\n       \"</div>\"\n      ],\n      \"text/plain\": [\n       \"   Latency(ms)  Latency_P50  Latency_P75  Latency_P90  Latency_P95  \\\\\\n\",\n       \"0         3.01         2.79         2.81         2.86         5.08   \\n\",\n       \"1         3.01         2.80         2.81         2.88         4.52   \\n\",\n       \"2         3.01         2.78         2.80         2.92         5.01   \\n\",\n       \"3         3.02         2.79         2.80         2.85         6.34   \\n\",\n       \"4         3.04         2.80         2.82         2.93         5.56   \\n\",\n       \"5         3.04         2.79         2.81         2.92         6.37   \\n\",\n       \"6         3.12         2.79         2.82         2.96         6.66   \\n\",\n       \"\\n\",\n       \"   Latency_P99  Throughput(QPS)  intra_op_num_threads OMP_NUM_THREADS  \\\\\\n\",\n       \"0         7.16           332.53                     1              12   \\n\",\n       \"1         7.05           331.90                     0                   \\n\",\n       \"2         7.02           331.72                    12              12   \\n\",\n       \"3         7.04           331.39                    12               1   \\n\",\n       \"4         7.08           329.32                    12              12   \\n\",\n       \"5         7.08           328.67                    12               1   \\n\",\n       \"6         7.20           320.00                     1              12   \\n\",\n       \"\\n\",\n       \"  OMP_WAIT_POLICY contiguous  warmup  \\n\",\n       \"0          ACTIVE       None    True  \\n\",\n       \"1                       None    True  \\n\",\n       \"2          ACTIVE       None    True  \\n\",\n       \"3          ACTIVE       None    True  \\n\",\n       \"4         PASSIVE       None    True  \\n\",\n       \"5         PASSIVE       None    True  \\n\",\n       \"6         PASSIVE       None    True  \"\n      ]\n     },\n     \"execution_count\": 22,\n     \"metadata\": {},\n     \"output_type\": \"execute_result\"\n    }\n   ],\n   \"source\": [\n    \"import os\\n\",\n    \"import glob     \\n\",\n    \"import pandas\\n\",\n    \"latest_result_file = max(glob.glob(\\\"./onnx/perf_results_GPU_B1_S128_*.txt\\\"), key=os.path.getmtime)\\n\",\n    \"result_data = pandas.read_table(latest_result_file, converters={'OMP_NUM_THREADS': str, 'OMP_WAIT_POLICY':str})\\n\",\n    \"print(\\\"Float32 model perf results from\\\", latest_result_file)\\n\",\n    \"# Remove some columns that have same values for all rows.\\n\",\n    \"columns_to_remove = ['model', 'graph_optimization_level', 'batch_size', 'sequence_length', 'test_cases', 'test_times', 'use_gpu']\\n\",\n    \"result_data.drop(columns_to_remove, axis=1, inplace=True)\\n\",\n    \"result_data\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"### Throughput Tuning\\n\",\n    \"\\n\",\n    \"Some application need best throughput under some constraint on latency. This can be done by testing performance of different batch sizes. The tool could help on this.\\n\",\n    \"\\n\",\n    \"Here is an example that check the performance of multiple batch sizes (1, 2, 4, 8, 16, 32 and 64) using default settings.\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 23,\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"test setting TestSetting(batch_size=32, sequence_length=128, test_cases=1000, test_times=1, contiguous=None, use_gpu=True, warmup=True, omp_num_threads=12, omp_wait_policy='ACTIVE', intra_op_num_threads=1, seed=3, verbose=False, inclusive=False, extra_latency=True)\\n\",\n      \"Generating 1000 samples for batch_size=32 sequence_length=128\\n\",\n      \"Running test: model=bert-base-cased-squad_opt_gpu_fp16.onnx,graph_optimization_level=ENABLE_ALL,intra_op_num_threads=1,OMP_NUM_THREADS=12,OMP_WAIT_POLICY=ACTIVE,batch_size=32,sequence_length=128,test_cases=1000,test_times=1,contiguous=None,use_gpu=True,warmup=True\\n\",\n      \"Average latency = 16.17 ms, Throughput = 1979.41 QPS\\n\",\n      \"test setting TestSetting(batch_size=1, sequence_length=128, test_cases=1000, test_times=1, contiguous=None, use_gpu=True, warmup=True, omp_num_threads=12, omp_wait_policy='ACTIVE', intra_op_num_threads=1, seed=3, verbose=False, inclusive=False, extra_latency=True)\\n\",\n      \"Generating 1000 samples for batch_size=1 sequence_length=128\\n\",\n      \"Running test: model=bert-base-cased-squad_opt_gpu_fp16.onnx,graph_optimization_level=ENABLE_ALL,intra_op_num_threads=1,OMP_NUM_THREADS=12,OMP_WAIT_POLICY=ACTIVE,batch_size=1,sequence_length=128,test_cases=1000,test_times=1,contiguous=None,use_gpu=True,warmup=True\\n\",\n      \"Average latency = 3.00 ms, Throughput = 333.83 QPS\\n\",\n      \"test setting TestSetting(batch_size=2, sequence_length=128, test_cases=1000, test_times=1, contiguous=None, use_gpu=True, warmup=True, omp_num_threads=12, omp_wait_policy='ACTIVE', intra_op_num_threads=1, seed=3, verbose=False, inclusive=False, extra_latency=True)\\n\",\n      \"Generating 1000 samples for batch_size=2 sequence_length=128\\n\",\n      \"Running test: model=bert-base-cased-squad_opt_gpu_fp16.onnx,graph_optimization_level=ENABLE_ALL,intra_op_num_threads=1,OMP_NUM_THREADS=12,OMP_WAIT_POLICY=ACTIVE,batch_size=2,sequence_length=128,test_cases=1000,test_times=1,contiguous=None,use_gpu=True,warmup=True\\n\",\n      \"Average latency = 3.59 ms, Throughput = 557.32 QPS\\n\",\n      \"test setting TestSetting(batch_size=64, sequence_length=128, test_cases=1000, test_times=1, contiguous=None, use_gpu=True, warmup=True, omp_num_threads=12, omp_wait_policy='ACTIVE', intra_op_num_threads=1, seed=3, verbose=False, inclusive=False, extra_latency=True)\\n\",\n      \"Generating 1000 samples for batch_size=64 sequence_length=128\\n\",\n      \"Running test: model=bert-base-cased-squad_opt_gpu_fp16.onnx,graph_optimization_level=ENABLE_ALL,intra_op_num_threads=1,OMP_NUM_THREADS=12,OMP_WAIT_POLICY=ACTIVE,batch_size=64,sequence_length=128,test_cases=1000,test_times=1,contiguous=None,use_gpu=True,warmup=True\\n\",\n      \"Average latency = 29.26 ms, Throughput = 2187.15 QPS\\n\",\n      \"test setting TestSetting(batch_size=4, sequence_length=128, test_cases=1000, test_times=1, contiguous=None, use_gpu=True, warmup=True, omp_num_threads=12, omp_wait_policy='ACTIVE', intra_op_num_threads=1, seed=3, verbose=False, inclusive=False, extra_latency=True)\\n\",\n      \"Generating 1000 samples for batch_size=4 sequence_length=128\\n\",\n      \"Running test: model=bert-base-cased-squad_opt_gpu_fp16.onnx,graph_optimization_level=ENABLE_ALL,intra_op_num_threads=1,OMP_NUM_THREADS=12,OMP_WAIT_POLICY=ACTIVE,batch_size=4,sequence_length=128,test_cases=1000,test_times=1,contiguous=None,use_gpu=True,warmup=True\\n\",\n      \"Average latency = 4.32 ms, Throughput = 926.92 QPS\\n\",\n      \"test setting TestSetting(batch_size=8, sequence_length=128, test_cases=1000, test_times=1, contiguous=None, use_gpu=True, warmup=True, omp_num_threads=12, omp_wait_policy='ACTIVE', intra_op_num_threads=1, seed=3, verbose=False, inclusive=False, extra_latency=True)\\n\",\n      \"Generating 1000 samples for batch_size=8 sequence_length=128\\n\",\n      \"Running test: model=bert-base-cased-squad_opt_gpu_fp16.onnx,graph_optimization_level=ENABLE_ALL,intra_op_num_threads=1,OMP_NUM_THREADS=12,OMP_WAIT_POLICY=ACTIVE,batch_size=8,sequence_length=128,test_cases=1000,test_times=1,contiguous=None,use_gpu=True,warmup=True\\n\",\n      \"Average latency = 6.32 ms, Throughput = 1266.63 QPS\\n\",\n      \"test setting TestSetting(batch_size=16, sequence_length=128, test_cases=1000, test_times=1, contiguous=None, use_gpu=True, warmup=True, omp_num_threads=12, omp_wait_policy='ACTIVE', intra_op_num_threads=1, seed=3, verbose=False, inclusive=False, extra_latency=True)\\n\",\n      \"Generating 1000 samples for batch_size=16 sequence_length=128\\n\",\n      \"Running test: model=bert-base-cased-squad_opt_gpu_fp16.onnx,graph_optimization_level=ENABLE_ALL,intra_op_num_threads=1,OMP_NUM_THREADS=12,OMP_WAIT_POLICY=ACTIVE,batch_size=16,sequence_length=128,test_cases=1000,test_times=1,contiguous=None,use_gpu=True,warmup=True\\n\",\n      \"Average latency = 9.60 ms, Throughput = 1666.05 QPS\\n\",\n      \"Test summary is saved to onnx/perf_results_GPU_B1-2-4-8-16-32-64_S128_20200617-232401.txt\\n\"\n     ]\n    }\n   ],\n   \"source\": [\n    \"GPU_OPTION = '--use_gpu' if use_gpu else ''\\n\",\n    \"THREAD_SETTING = '--intra_op_num_threads 1 --omp_num_threads {} --omp_wait_policy ACTIVE'.format(psutil.cpu_count(logical=True))\\n\",\n    \"!python -m onnxruntime_tools.transformers.bert_perf_test --model $optimized_fp16_model_path --batch_size 1 2 4 8 16 32 64 --sequence_length 128 --samples 1000 --test_times 1 --inclusive $THREAD_SETTING $GPU_OPTION\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 26,\n   \"metadata\": {\n    \"scrolled\": false\n   },\n   \"outputs\": [\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"Float16 model summary from ./onnx/perf_results_GPU_B1-2-4-8-16-32-64_S128_20200617-232401.txt\\n\"\n     ]\n    },\n    {\n     \"data\": {\n      \"text/html\": [\n       \"<div>\\n\",\n       \"<style scoped>\\n\",\n       \"    .dataframe tbody tr th:only-of-type {\\n\",\n       \"        vertical-align: middle;\\n\",\n       \"    }\\n\",\n       \"\\n\",\n       \"    .dataframe tbody tr th {\\n\",\n       \"        vertical-align: top;\\n\",\n       \"    }\\n\",\n       \"\\n\",\n       \"    .dataframe thead th {\\n\",\n       \"        text-align: right;\\n\",\n       \"    }\\n\",\n       \"</style>\\n\",\n       \"<table border=\\\"1\\\" class=\\\"dataframe\\\">\\n\",\n       \"  <thead>\\n\",\n       \"    <tr style=\\\"text-align: right;\\\">\\n\",\n       \"      <th></th>\\n\",\n       \"      <th>Latency(ms)</th>\\n\",\n       \"      <th>Latency_P50</th>\\n\",\n       \"      <th>Latency_P75</th>\\n\",\n       \"      <th>Latency_P90</th>\\n\",\n       \"      <th>Latency_P95</th>\\n\",\n       \"      <th>Latency_P99</th>\\n\",\n       \"      <th>Throughput(QPS)</th>\\n\",\n       \"      <th>batch_size</th>\\n\",\n       \"    </tr>\\n\",\n       \"  </thead>\\n\",\n       \"  <tbody>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>0</th>\\n\",\n       \"      <td>3.00</td>\\n\",\n       \"      <td>2.79</td>\\n\",\n       \"      <td>2.81</td>\\n\",\n       \"      <td>2.86</td>\\n\",\n       \"      <td>4.37</td>\\n\",\n       \"      <td>7.08</td>\\n\",\n       \"      <td>333.83</td>\\n\",\n       \"      <td>1</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>1</th>\\n\",\n       \"      <td>3.59</td>\\n\",\n       \"      <td>3.33</td>\\n\",\n       \"      <td>3.35</td>\\n\",\n       \"      <td>3.42</td>\\n\",\n       \"      <td>6.60</td>\\n\",\n       \"      <td>7.54</td>\\n\",\n       \"      <td>557.32</td>\\n\",\n       \"      <td>2</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>2</th>\\n\",\n       \"      <td>4.32</td>\\n\",\n       \"      <td>3.98</td>\\n\",\n       \"      <td>4.01</td>\\n\",\n       \"      <td>4.64</td>\\n\",\n       \"      <td>7.23</td>\\n\",\n       \"      <td>8.11</td>\\n\",\n       \"      <td>926.92</td>\\n\",\n       \"      <td>4</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>3</th>\\n\",\n       \"      <td>6.32</td>\\n\",\n       \"      <td>5.94</td>\\n\",\n       \"      <td>5.97</td>\\n\",\n       \"      <td>7.61</td>\\n\",\n       \"      <td>8.96</td>\\n\",\n       \"      <td>10.12</td>\\n\",\n       \"      <td>1266.63</td>\\n\",\n       \"      <td>8</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>4</th>\\n\",\n       \"      <td>9.60</td>\\n\",\n       \"      <td>9.22</td>\\n\",\n       \"      <td>9.25</td>\\n\",\n       \"      <td>11.32</td>\\n\",\n       \"      <td>12.33</td>\\n\",\n       \"      <td>13.34</td>\\n\",\n       \"      <td>1666.05</td>\\n\",\n       \"      <td>16</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>5</th>\\n\",\n       \"      <td>16.17</td>\\n\",\n       \"      <td>15.80</td>\\n\",\n       \"      <td>15.90</td>\\n\",\n       \"      <td>17.38</td>\\n\",\n       \"      <td>18.80</td>\\n\",\n       \"      <td>19.93</td>\\n\",\n       \"      <td>1979.41</td>\\n\",\n       \"      <td>32</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>6</th>\\n\",\n       \"      <td>29.26</td>\\n\",\n       \"      <td>28.89</td>\\n\",\n       \"      <td>29.01</td>\\n\",\n       \"      <td>30.63</td>\\n\",\n       \"      <td>32.53</td>\\n\",\n       \"      <td>33.28</td>\\n\",\n       \"      <td>2187.15</td>\\n\",\n       \"      <td>64</td>\\n\",\n       \"    </tr>\\n\",\n       \"  </tbody>\\n\",\n       \"</table>\\n\",\n       \"</div>\"\n      ],\n      \"text/plain\": [\n       \"   Latency(ms)  Latency_P50  Latency_P75  Latency_P90  Latency_P95  \\\\\\n\",\n       \"0         3.00         2.79         2.81         2.86         4.37   \\n\",\n       \"1         3.59         3.33         3.35         3.42         6.60   \\n\",\n       \"2         4.32         3.98         4.01         4.64         7.23   \\n\",\n       \"3         6.32         5.94         5.97         7.61         8.96   \\n\",\n       \"4         9.60         9.22         9.25        11.32        12.33   \\n\",\n       \"5        16.17        15.80        15.90        17.38        18.80   \\n\",\n       \"6        29.26        28.89        29.01        30.63        32.53   \\n\",\n       \"\\n\",\n       \"   Latency_P99  Throughput(QPS)  batch_size  \\n\",\n       \"0         7.08           333.83           1  \\n\",\n       \"1         7.54           557.32           2  \\n\",\n       \"2         8.11           926.92           4  \\n\",\n       \"3        10.12          1266.63           8  \\n\",\n       \"4        13.34          1666.05          16  \\n\",\n       \"5        19.93          1979.41          32  \\n\",\n       \"6        33.28          2187.15          64  \"\n      ]\n     },\n     \"execution_count\": 26,\n     \"metadata\": {},\n     \"output_type\": \"execute_result\"\n    }\n   ],\n   \"source\": [\n    \"import os\\n\",\n    \"import glob     \\n\",\n    \"import pandas\\n\",\n    \"latest_result_file = max(glob.glob(\\\"./onnx/perf_results_*.txt\\\"), key=os.path.getmtime)\\n\",\n    \"result_data = pandas.read_table(latest_result_file, converters={'OMP_NUM_THREADS': str, 'OMP_WAIT_POLICY':str})\\n\",\n    \"print(\\\"Float16 model summary from\\\", latest_result_file)\\n\",\n    \"columns_to_remove = ['model', 'graph_optimization_level', 'test_cases', 'test_times', 'use_gpu', 'warmup', 'sequence_length']\\n\",\n    \"columns_to_remove.extend(['intra_op_num_threads', 'OMP_NUM_THREADS', 'OMP_WAIT_POLICY', 'contiguous'])\\n\",\n    \"result_data.drop(columns_to_remove, axis=1, inplace=True)\\n\",\n    \"result_data\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"## 7. Additional Info\\n\",\n    \"\\n\",\n    \"Note that running Jupyter Notebook has significant impact on performance result. You can close Jupyter Notebook and other applications, then run the performance test in a console to get more accurate performance numbers.\\n\",\n    \"\\n\",\n    \"We have a [benchmark script](https://github.com/microsoft/onnxruntime/blob/master/onnxruntime/python/tools/transformers/run_benchmark.sh). It is recommended to use it measure inference speed of OnnxRuntime.\\n\",\n    \"\\n\",\n    \"[OnnxRuntime C API](https://github.com/microsoft/onnxruntime/blob/master/docs/C_API.md) could get slightly better performance than python API. If you use C API in inference, you can use OnnxRuntime_Perf_Test.exe built from source to measure performance instead.\\n\",\n    \"\\n\",\n    \"Here is the machine configuration that generated the above results. You might get slower or faster result according to your hardware.\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 27,\n   \"metadata\": {\n    \"scrolled\": true\n   },\n   \"outputs\": [\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"{\\r\\n\",\n      \"  \\\"gpu\\\": {\\r\\n\",\n      \"    \\\"driver_version\\\": \\\"440.64.00\\\",\\r\\n\",\n      \"    \\\"devices\\\": [\\r\\n\",\n      \"      {\\r\\n\",\n      \"        \\\"memory_total\\\": 16945512448,\\r\\n\",\n      \"        \\\"memory_available\\\": 14110883840,\\r\\n\",\n      \"        \\\"name\\\": \\\"Tesla V100-PCIE-16GB\\\"\\r\\n\",\n      \"      },\\r\\n\",\n      \"      {\\r\\n\",\n      \"        \\\"memory_total\\\": 16945512448,\\r\\n\",\n      \"        \\\"memory_available\\\": 16932601856,\\r\\n\",\n      \"        \\\"name\\\": \\\"Tesla V100-PCIE-16GB\\\"\\r\\n\",\n      \"      }\\r\\n\",\n      \"    ]\\r\\n\",\n      \"  },\\r\\n\",\n      \"  \\\"cpu\\\": {\\r\\n\",\n      \"    \\\"brand\\\": \\\"Intel(R) Xeon(R) CPU E5-2690 v4 @ 2.60GHz\\\",\\r\\n\",\n      \"    \\\"cores\\\": 12,\\r\\n\",\n      \"    \\\"logical_cores\\\": 12,\\r\\n\",\n      \"    \\\"hz\\\": \\\"2.5940 GHz\\\",\\r\\n\",\n      \"    \\\"l2_cache\\\": \\\"256 KB\\\",\\r\\n\",\n      \"    \\\"l3_cache\\\": \\\"35840 KB\\\",\\r\\n\",\n      \"    \\\"processor\\\": \\\"x86_64\\\"\\r\\n\",\n      \"  },\\r\\n\",\n      \"  \\\"memory\\\": {\\r\\n\",\n      \"    \\\"total\\\": 236645588992,\\r\\n\",\n      \"    \\\"available\\\": 222567559168\\r\\n\",\n      \"  },\\r\\n\",\n      \"  \\\"python\\\": \\\"3.7.7.final.0 (64 bit)\\\",\\r\\n\",\n      \"  \\\"os\\\": \\\"Linux-4.15.0-1089-azure-x86_64-with-debian-stretch-sid\\\",\\r\\n\",\n      \"  \\\"onnxruntime\\\": {\\r\\n\",\n      \"    \\\"version\\\": \\\"1.3.0\\\",\\r\\n\",\n      \"    \\\"support_gpu\\\": true\\r\\n\",\n      \"  },\\r\\n\",\n      \"  \\\"pytorch\\\": {\\r\\n\",\n      \"    \\\"version\\\": \\\"1.5.0\\\",\\r\\n\",\n      \"    \\\"support_gpu\\\": true\\r\\n\",\n      \"  },\\r\\n\",\n      \"  \\\"tensorflow\\\": null\\r\\n\",\n      \"}\\r\\n\"\n     ]\n    }\n   ],\n   \"source\": [\n    \"!{sys.executable} -m onnxruntime_tools.transformers.machine_info --silent\"\n   ]\n  }\n ],\n \"metadata\": {\n  \"kernelspec\": {\n   \"display_name\": \"PyCharm (ccks_ner-master)\",\n   \"language\": \"python\",\n   \"name\": \"pycharm-de4c0941\"\n  },\n  \"language_info\": {\n   \"codemirror_mode\": {\n    \"name\": \"ipython\",\n    \"version\": 3\n   },\n   \"file_extension\": \".py\",\n   \"mimetype\": \"text/x-python\",\n   \"name\": \"python\",\n   \"nbconvert_exporter\": \"python\",\n   \"pygments_lexer\": \"ipython3\",\n   \"version\": \"3.8.5\"\n  }\n },\n \"nbformat\": 4,\n \"nbformat_minor\": 2\n}\n"
  },
  {
    "path": "code/bert-base-count3-len100/finetuning/Config.py",
    "content": "from transformers import BertTokenizer, AdamW, BertModel, BertPreTrainedModel, BertConfig, \\\n    get_linear_schedule_with_warmup, XLNetModel, XLNetTokenizer, XLNetConfig, ElectraModel, ElectraConfig, ElectraTokenizer, \\\n    RobertaTokenizer, RobertaModel, RobertaConfig\nfrom NEZHA.modeling_nezha import NeZhaModel\nfrom NEZHA.configuration_nezha import NeZhaConfig\n\n\nMODELS = {\n    'BertForClass':  BertModel,\n    'BertForClass_MultiDropout':  BertModel,\n   'BertLastTwoCls':  BertModel,\n    'BertLastCls':BertModel,\n   'BertLastTwoClsPooler':  BertModel,\n    'BertLastTwoEmbeddings': BertModel,\n    'BertLastTwoEmbeddingsPooler': BertModel,\n    'BertLastFourCls': BertModel,\n    'BertLastFourClsPooler':  BertModel,\n    'BertLastFourEmbeddings':  BertModel,\n   'BertLastFourEmbeddingsPooler':  BertModel,\n   'BertDynCls':  BertModel,\n    'BertDynEmbeddings': BertModel,\n    'BertRNN': BertModel,\n    'BertCNN': XLNetModel,\n    'BertRCNN':  BertModel,\n    'XLNet': XLNetModel,\n    'Electra': ElectraModel,\n    'NEZHA': NeZhaModel\n    }\n\nTOKENIZERS = {\n    'BertForClass': BertTokenizer,\n    'BertForClass_MultiDropout': BertTokenizer,\n    'BertLastTwoCls': BertTokenizer,\n    'BertLastCls': BertTokenizer,\n    'BertLastTwoClsPooler': BertTokenizer,\n    'BertLastTwoEmbeddings': BertTokenizer,\n    'BertLastTwoEmbeddingsPooler': BertTokenizer,\n    'BertLastFourCls': BertTokenizer,\n    'BertLastFourClsPooler': BertTokenizer,\n    'BertLastFourEmbeddings': BertTokenizer,\n    'BertLastFourEmbeddingsPooler': BertTokenizer,\n    'BertDynCls': BertTokenizer,\n    'BertDynEmbeddings': BertTokenizer,\n    'BertRNN': BertTokenizer,\n    'BertCNN': BertTokenizer,\n    'BertRCNN': BertTokenizer,\n    'XLNet': XLNetTokenizer,\n    'Electra': ElectraTokenizer,\n    'NEZHA': BertTokenizer\n    }\n\nCONFIGS = {\n    'BertForClass': BertConfig,\n    'BertForClass_MultiDropout': BertConfig,\n    'BertLastTwoCls': BertConfig,\n    'BertLastCls': BertConfig,\n    'BertLastTwoClsPooler': BertConfig,\n    'BertLastTwoEmbeddings': BertConfig,\n    'BertLastTwoEmbeddingsPooler': BertConfig,\n    'BertLastFourCls': BertConfig,\n    'BertLastFourClsPooler': BertConfig,\n    'BertLastFourEmbeddings': BertConfig,\n    'BertLastFourEmbeddingsPooler': BertConfig,\n    'BertDynCls': BertConfig,\n    'BertDynEmbeddings': BertConfig,\n    'BertRNN': BertConfig,\n    'BertCNN': BertConfig,\n    'BertRCNN': BertConfig,\n    'XLNet': XLNetConfig,\n    'Electra': ElectraConfig,\n    'NEZHA': NeZhaConfig\n\n    }"
  },
  {
    "path": "code/bert-base-count3-len100/finetuning/NEZHA/configuration_nezha.py",
    "content": "\nfrom transformers import PretrainedConfig\n\nNEZHA_PRETRAINED_CONFIG_ARCHIVE_MAP = {}\n\nclass NeZhaConfig(PretrainedConfig):\n    r\"\"\"\n        This is the configuration class to store the configuration of an :class:`~transformers.AlbertModel`.\n        It is used to instantiate an ALBERT model according to the specified arguments, defining the model\n        architecture. Instantiating a configuration with the defaults will yield a similar configuration to that of\n        the ALBERT `xxlarge <https://huggingface.co/albert-xxlarge-v2>`__ architecture.\n\n        Configuration objects inherit from  :class:`~transformers.PretrainedConfig` and can be used\n        to control the model outputs. Read the documentation from  :class:`~transformers.PretrainedConfig`\n        for more information.\n\n\n        Args:\n            vocab_size (:obj:`int`, optional, defaults to 30000):\n                Vocabulary size of the ALBERT model. Defines the different tokens that\n                can be represented by the `inputs_ids` passed to the forward method of :class:`~transformers.AlbertModel`.\n            embedding_size (:obj:`int`, optional, defaults to 128):\n                Dimensionality of vocabulary embeddings.\n            hidden_size (:obj:`int`, optional, defaults to 4096):\n                Dimensionality of the encoder layers and the pooler layer.\n            num_hidden_layers (:obj:`int`, optional, defaults to 12):\n                Number of hidden layers in the Transformer encoder.\n            num_hidden_groups (:obj:`int`, optional, defaults to 1):\n                Number of groups for the hidden layers, parameters in the same group are shared.\n            num_attention_heads (:obj:`int`, optional, defaults to 64):\n                Number of attention heads for each attention layer in the Transformer encoder.\n            intermediate_size (:obj:`int`, optional, defaults to 16384):\n                The dimensionality of the \"intermediate\" (i.e., feed-forward) layer in the Transformer encoder.\n            inner_group_num (:obj:`int`, optional, defaults to 1):\n                The number of inner repetition of attention and ffn.\n            hidden_act (:obj:`str` or :obj:`function`, optional, defaults to \"gelu_new\"):\n                The non-linear activation function (function or string) in the encoder and pooler.\n                If string, \"gelu\", \"relu\", \"swish\" and \"gelu_new\" are supported.\n            hidden_dropout_prob (:obj:`float`, optional, defaults to 0):\n                The dropout probability for all fully connected layers in the embeddings, encoder, and pooler.\n            attention_probs_dropout_prob (:obj:`float`, optional, defaults to 0):\n                The dropout ratio for the attention probabilities.\n            max_position_embeddings (:obj:`int`, optional, defaults to 512):\n                The maximum sequence length that this model might ever be used with. Typically set this to something\n                large (e.g., 512 or 1024 or 2048).\n            type_vocab_size (:obj:`int`, optional, defaults to 2):\n                The vocabulary size of the `token_type_ids` passed into :class:`~transformers.AlbertModel`.\n            initializer_range (:obj:`float`, optional, defaults to 0.02):\n                The standard deviation of the truncated_normal_initializer for initializing all weight matrices.\n            layer_norm_eps (:obj:`float`, optional, defaults to 1e-12):\n                The epsilon used by the layer normalization layers.\n            classifier_dropout_prob (:obj:`float`, optional, defaults to 0.1):\n                The dropout ratio for attached classifiers.\n\n        Example::\n\n            from transformers import AlbertConfig, AlbertModel\n            # Initializing an ALBERT-xxlarge style configuration\n            albert_xxlarge_configuration = AlbertConfig()\n\n            # Initializing an ALBERT-base style configuration\n            albert_base_configuration = AlbertConfig(\n                hidden_size=768,\n                num_attention_heads=12,\n                intermediate_size=3072,\n            )\n\n            # Initializing a model from the ALBERT-base style configuration\n            model = AlbertModel(albert_xxlarge_configuration)\n\n            # Accessing the model configuration\n            configuration = model.config\n\n        Attributes:\n            pretrained_config_archive_map (Dict[str, str]):\n                A dictionary containing all the available pre-trained checkpoints.\n    \"\"\"\n\n    pretrained_config_archive_map = NEZHA_PRETRAINED_CONFIG_ARCHIVE_MAP\n    model_type = \"nezha\"\n\n    def __init__(\n        self,\n        vocab_size=30000,\n        embedding_size=128,\n        hidden_size=4096,\n        num_hidden_layers=12,\n        num_hidden_groups=1,\n        num_attention_heads=64,\n        intermediate_size=16384,\n        inner_group_num=1,\n        hidden_act=\"gelu_new\",\n        hidden_dropout_prob=0,\n        attention_probs_dropout_prob=0,\n        max_position_embeddings=512,\n        max_relative_position=64,\n        type_vocab_size=2,\n        initializer_range=0.02,\n        layer_norm_eps=1e-12,\n        classifier_dropout_prob=0.1,\n        use_relative_position=True,\n        pad_token_id=0,\n        bos_token_id=2,\n        eos_token_id=3,\n        **kwargs\n    ):\n        super().__init__(pad_token_id=pad_token_id, bos_token_id=bos_token_id, eos_token_id=eos_token_id, **kwargs)\n\n        self.vocab_size = vocab_size\n        self.embedding_size = embedding_size\n        self.hidden_size = hidden_size\n        self.num_hidden_layers = num_hidden_layers\n        self.num_hidden_groups = num_hidden_groups\n        self.num_attention_heads = num_attention_heads\n        self.inner_group_num = inner_group_num\n        self.hidden_act = hidden_act\n        self.intermediate_size = intermediate_size\n        self.hidden_dropout_prob = hidden_dropout_prob\n        self.attention_probs_dropout_prob = attention_probs_dropout_prob\n        self.max_position_embeddings = max_position_embeddings\n        self.max_relative_position = max_relative_position\n        self.type_vocab_size = type_vocab_size\n        self.initializer_range = initializer_range\n        self.layer_norm_eps = layer_norm_eps\n        self.use_relative_position=use_relative_position\n        self.classifier_dropout_prob = classifier_dropout_prob\n"
  },
  {
    "path": "code/bert-base-count3-len100/finetuning/NEZHA/modeling_nezha.py",
    "content": "import math\nimport os\nimport logging\nimport torch\n\nfrom torch import nn\nfrom torch.nn import CrossEntropyLoss, MSELoss\n\nfrom .configuration_nezha import NeZhaConfig\nfrom transformers.file_utils import add_start_docstrings, add_start_docstrings_to_model_forward\nfrom transformers.modeling_utils import PreTrainedModel, prune_linear_layer\nfrom transformers.models.bert.modeling_bert import (\n    BertOutput,\n    BertPooler,\n    BertSelfOutput,\n    BertIntermediate,\n    BertOnlyMLMHead,\n    BertOnlyNSPHead,\n    BertPreTrainingHeads,\n    BERT_START_DOCSTRING,\n    BERT_INPUTS_DOCSTRING,\n)\n\nlogger = logging.getLogger(__name__)\n\n_CONFIG_FOR_DOC = \"NeZhaConfig\"\n_TOKENIZER_FOR_DOC = \"NeZhaTokenizer\"\n\nNEZHA_PRETRAINED_MODEL_ARCHIVE_LIST = []\nNEZHA_PRETRAINED_MODEL_ARCHIVE_MAP = {}\n\n\ndef load_tf_weights_in_nezha(model, config, tf_checkpoint_path):\n    \"\"\"Load tf checkpoints in a pytorch model.\"\"\"\n    try:\n        import re\n        import numpy as np\n        import tensorflow as tf\n    except ImportError:\n        logger.error(\n            \"Loading a TensorFlow model in PyTorch, requires TensorFlow to be installed. Please see \"\n            \"https://www.tensorflow.org/install/ for installation instructions.\"\n        )\n        raise\n\n    tf_path = os.path.abspath(tf_checkpoint_path)\n    logger.info(\"Converting TensorFlow checkpoint from {}\".format(tf_path))\n    # Load weights from TF model\n    init_vars = tf.train.list_variables(tf_path)\n    names = []\n    arrays = []\n    for name, shape in init_vars:\n        # logger.info(\"Loading TF weight {} with shape {}\".format(name, shape))\n        array = tf.train.load_variable(tf_path, name)\n        names.append(name)\n        arrays.append(array)\n\n    for name, array in zip(names, arrays):\n        name = name.split(\"/\")\n        # adam_v and adam_m are variables used in AdamWeightDecayOptimizer to calculated m and v\n        # which are not required for using pretrained model\n        if any(\n                n in [\"adam_v\", \"adam_m\", \"lamb_m\", \"lamb_v\", \"AdamWeightDecayOptimizer\", \"AdamWeightDecayOptimizer_1\",\n                      \"global_step\", \"good_steps\", \"loss_scale\", 'bad_steps']\n                for n in name\n        ):\n            logger.info(\"Skipping {}\".format(\"/\".join(name)))\n            continue\n        pointer = model\n        for m_name in name:\n            if re.fullmatch(r\"[A-Za-z]+_\\d+\", m_name):\n                scope_names = re.split(r\"_(\\d+)\", m_name)\n            else:\n                scope_names = [m_name]\n            if scope_names[0] == \"kernel\" or scope_names[0] == \"gamma\":\n                pointer = getattr(pointer, \"weight\")\n            elif scope_names[0] == \"output_bias\" or scope_names[0] == \"beta\":\n                pointer = getattr(pointer, \"bias\")\n            elif scope_names[0] == \"output_weights\":\n                pointer = getattr(pointer, \"weight\")\n            elif scope_names[0] == \"squad\":\n                pointer = getattr(pointer, \"classifier\")\n            else:\n                try:\n                    pointer = getattr(pointer, scope_names[0])\n                except AttributeError:\n                    logger.info(\"Skipping {}\".format(\"/\".join(name)))\n                    continue\n            if len(scope_names) >= 2:\n                num = int(scope_names[1])\n                pointer = pointer[num]\n        if m_name[-11:] == \"_embeddings\":\n            pointer = getattr(pointer, \"weight\")\n        elif m_name == \"kernel\":\n            array = np.transpose(array)\n        try:\n            assert (\n                    pointer.shape == array.shape\n            ), f\"Pointer shape {pointer.shape} and array shape {array.shape} mismatched\"\n        except AssertionError as e:\n            e.args += (pointer.shape, array.shape)\n            raise\n        logger.info(\"Initialize PyTorch weight {}\".format(name))\n        pointer.data = torch.from_numpy(array)\n    return model\n\n\nclass NeZhaEmbeddings(nn.Module):\n    \"\"\"\n    Construct the embeddings from word, position and token_type embeddings.\n    \"\"\"\n\n    def __init__(self, config):\n        super().__init__()\n        self.use_relative_position = config.use_relative_position\n        self.word_embeddings = nn.Embedding(config.vocab_size, config.hidden_size, padding_idx=config.pad_token_id)\n        self.token_type_embeddings = nn.Embedding(config.type_vocab_size, config.hidden_size)\n        # self.LayerNorm is not snake-cased to stick with TensorFlow model variable name and be able to load\n        # any TensorFlow checkpoint file\n        self.LayerNorm = nn.LayerNorm(config.hidden_size, eps=config.layer_norm_eps)\n        self.dropout = nn.Dropout(config.hidden_dropout_prob)\n\n    def forward(self, input_ids=None, token_type_ids=None, inputs_embeds=None):\n        if input_ids is not None:\n            input_shape = input_ids.size()\n        else:\n            input_shape = inputs_embeds.size()[:-1]\n        device = input_ids.device if input_ids is not None else inputs_embeds.device\n        if token_type_ids is None:\n            token_type_ids = torch.zeros(input_shape, dtype=torch.long, device=device)\n        if inputs_embeds is None:\n            inputs_embeds = self.word_embeddings(input_ids)\n        token_type_embeddings = self.token_type_embeddings(token_type_ids)\n        embeddings = inputs_embeds + token_type_embeddings\n        embeddings = self.LayerNorm(embeddings)\n        embeddings = self.dropout(embeddings)\n        return embeddings\n\n\ndef relative_position_encoding(depth, max_length=512, max_relative_position=127):\n    vocab_size = max_relative_position * 2 + 1\n    range_vec = torch.arange(max_length)\n    range_mat = range_vec.repeat(max_length).view(max_length, max_length)\n    distance_mat = range_mat - torch.t(range_mat)\n    distance_mat_clipped = torch.clamp(distance_mat, -max_relative_position, max_relative_position)\n    final_mat = distance_mat_clipped + max_relative_position\n\n    embeddings_table = torch.zeros(vocab_size, depth)\n    position = torch.arange(0, vocab_size, dtype=torch.float).unsqueeze(1)\n    div_term = torch.exp(torch.arange(0, depth, 2).float() * (-math.log(10000.0) / depth))\n    embeddings_table[:, 0::2] = torch.sin(position * div_term)\n    embeddings_table[:, 1::2] = torch.cos(position * div_term)\n    embeddings_table = embeddings_table.unsqueeze(0).transpose(0, 1).squeeze(1)\n\n    flat_relative_positions_matrix = final_mat.view(-1)\n    one_hot_relative_positions_matrix = torch.nn.functional.one_hot(flat_relative_positions_matrix,\n                                                                    num_classes=vocab_size).float()\n    positions_encoding = torch.matmul(one_hot_relative_positions_matrix, embeddings_table)\n    my_shape = list(final_mat.size())\n    my_shape.append(depth)\n    positions_encoding = positions_encoding.view(my_shape)\n    return positions_encoding\n\n\nclass NeZhaSelfAttention(nn.Module):\n    def __init__(self, config):\n        super().__init__()\n        if config.hidden_size % config.num_attention_heads != 0 and not hasattr(config, \"embedding_size\"):\n            raise ValueError(\n                \"The hidden size (%d) is not a multiple of the number of attention \"\n                \"heads (%d)\" % (config.hidden_size, config.num_attention_heads)\n            )\n        self.output_attentions = config.output_attentions\n\n        self.num_attention_heads = config.num_attention_heads\n        self.attention_head_size = int(config.hidden_size / config.num_attention_heads)\n        self.all_head_size = self.num_attention_heads * self.attention_head_size\n\n        self.query = nn.Linear(config.hidden_size, self.all_head_size)\n        self.key = nn.Linear(config.hidden_size, self.all_head_size)\n        self.value = nn.Linear(config.hidden_size, self.all_head_size)\n        self.dropout = nn.Dropout(config.attention_probs_dropout_prob)\n\n        self.relative_positions_encoding = relative_position_encoding(max_length=config.max_position_embeddings,\n                                                                     depth=self.attention_head_size,\n                                                                     max_relative_position=config.max_relative_position).to('cuda')\n\n    def transpose_for_scores(self, x):\n        new_x_shape = x.size()[:-1] + (self.num_attention_heads, self.attention_head_size)\n        x = x.view(*new_x_shape)\n        return x.permute(0, 2, 1, 3)\n\n    def forward(\n            self,\n            hidden_states,\n            attention_mask=None,\n            head_mask=None,\n            encoder_hidden_states=None,\n            encoder_attention_mask=None,\n    ):\n        mixed_query_layer = self.query(hidden_states)\n\n        # If this is instantiated as a cross-attention module, the keys\n        # and values come from an encoder; the attention mask needs to be\n        # such that the encoder's padding tokens are not attended to.\n        if encoder_hidden_states is not None:\n            mixed_key_layer = self.key(encoder_hidden_states)\n            mixed_value_layer = self.value(encoder_hidden_states)\n            attention_mask = encoder_attention_mask\n        else:\n            mixed_key_layer = self.key(hidden_states)\n            mixed_value_layer = self.value(hidden_states)\n\n        query_layer = self.transpose_for_scores(mixed_query_layer)\n        key_layer = self.transpose_for_scores(mixed_key_layer)\n        value_layer = self.transpose_for_scores(mixed_value_layer)\n\n        # Take the dot product between \"query\" and \"key\" to get the raw attention scores.\n        attention_scores = torch.matmul(query_layer, key_layer.transpose(-1, -2))\n\n        batch_size, num_attention_heads, from_seq_length, to_seq_length = attention_scores.size()\n\n        relations_keys = self.relative_positions_encoding[:to_seq_length, :to_seq_length, :]\n        query_layer_t = query_layer.permute(2, 0, 1, 3)\n\n        query_layer_r = query_layer_t.contiguous().view(from_seq_length, batch_size * num_attention_heads,\n                                                        self.attention_head_size)\n        key_position_scores = torch.matmul(query_layer_r, relations_keys.permute(0, 2, 1))\n        key_position_scores_r = key_position_scores.view(from_seq_length, batch_size,\n                                                         num_attention_heads, from_seq_length)\n        key_position_scores_r_t = key_position_scores_r.permute(1, 2, 0, 3)\n        attention_scores = attention_scores + key_position_scores_r_t\n\n        attention_scores = attention_scores / math.sqrt(self.attention_head_size)\n        if attention_mask is not None:\n            # Apply the attention mask is (precomputed for all layers in BertModel forward() function)\n            attention_scores = attention_scores + attention_mask\n\n        # Normalize the attention scores to probabilities.\n        attention_probs = nn.Softmax(dim=-1)(attention_scores)\n\n        # This is actually dropping out entire tokens to attend to, which might\n        # seem a bit unusual, but is taken from the original Transformer paper.\n        attention_probs = self.dropout(attention_probs)\n\n        # Mask heads if we want to\n        if head_mask is not None:\n            attention_probs = attention_probs * head_mask\n\n        context_layer = torch.matmul(attention_probs, value_layer)\n\n        relations_values = self.relative_positions_encoding[:to_seq_length, :to_seq_length, :]\n        attention_probs_t = attention_probs.permute(2, 0, 1, 3)\n        attentions_probs_r = attention_probs_t.contiguous().view(from_seq_length, batch_size * num_attention_heads,\n                                                                 to_seq_length)\n        value_position_scores = torch.matmul(attentions_probs_r, relations_values)\n        value_position_scores_r = value_position_scores.view(from_seq_length, batch_size,\n                                                             num_attention_heads, self.attention_head_size)\n        value_position_scores_r_t = value_position_scores_r.permute(1, 2, 0, 3)\n        context_layer = context_layer + value_position_scores_r_t\n\n        context_layer = context_layer.permute(0, 2, 1, 3).contiguous()\n        new_context_layer_shape = context_layer.size()[:-2] + (self.all_head_size,)\n        context_layer = context_layer.view(*new_context_layer_shape)\n\n        outputs = (context_layer, attention_probs) if self.output_attentions else (context_layer,)\n        return outputs\n\n\nclass NeZhaAttention(nn.Module):\n    def __init__(self, config):\n        super().__init__()\n        self.self = NeZhaSelfAttention(config)\n        self.output = BertSelfOutput(config)\n        self.pruned_heads = set()\n\n    def prune_heads(self, heads):\n        if len(heads) == 0:\n            return\n        mask = torch.ones(self.self.num_attention_heads, self.self.attention_head_size)\n        heads = set(heads) - self.pruned_heads  # Convert to set and remove already pruned heads\n        for head in heads:\n            # Compute how many pruned heads are before the head and move the index accordingly\n            head = head - sum(1 if h < head else 0 for h in self.pruned_heads)\n            mask[head] = 0\n        mask = mask.view(-1).contiguous().eq(1)\n        index = torch.arange(len(mask))[mask].long()\n        # Prune linear layers\n        self.self.query = prune_linear_layer(self.self.query, index)\n        self.self.key = prune_linear_layer(self.self.key, index)\n        self.self.value = prune_linear_layer(self.self.value, index)\n        self.output.dense = prune_linear_layer(self.output.dense, index, dim=1)\n        # Update hyper params and store pruned heads\n        self.self.num_attention_heads = self.self.num_attention_heads - len(heads)\n        self.self.all_head_size = self.self.attention_head_size * self.self.num_attention_heads\n        self.pruned_heads = self.pruned_heads.union(heads)\n\n    def forward(\n            self,\n            hidden_states,\n            attention_mask=None,\n            head_mask=None,\n            encoder_hidden_states=None,\n            encoder_attention_mask=None,\n    ):\n        self_outputs = self.self(\n            hidden_states, attention_mask, head_mask, encoder_hidden_states, encoder_attention_mask\n        )\n        attention_output = self.output(self_outputs[0], hidden_states)\n        outputs = (attention_output,) + self_outputs[1:]  # add attentions if we output them\n        return outputs\n\n\nclass NeZhaLayer(nn.Module):\n    def __init__(self, config):\n        super().__init__()\n        self.attention = NeZhaAttention(config)\n        self.is_decoder = config.is_decoder\n        if self.is_decoder:\n            self.crossattention = NeZhaAttention(config)\n        self.intermediate = BertIntermediate(config)\n        self.output = BertOutput(config)\n\n    def forward(\n            self,\n            hidden_states,\n            attention_mask=None,\n            head_mask=None,\n            encoder_hidden_states=None,\n            encoder_attention_mask=None,\n    ):\n        self_attention_outputs = self.attention(hidden_states, attention_mask, head_mask)\n        attention_output = self_attention_outputs[0]\n        outputs = self_attention_outputs[1:]  # add self attentions if we output attention weights\n\n        if self.is_decoder and encoder_hidden_states is not None:\n            cross_attention_outputs = self.crossattention(\n                attention_output, attention_mask, head_mask, encoder_hidden_states, encoder_attention_mask\n            )\n            attention_output = cross_attention_outputs[0]\n            outputs = outputs + cross_attention_outputs[1:]  # add cross attentions if we output attention weights\n\n        intermediate_output = self.intermediate(attention_output)\n        layer_output = self.output(intermediate_output, attention_output)\n        outputs = (layer_output,) + outputs\n        return outputs\n\n\nclass NeZhaEncoder(nn.Module):\n    def __init__(self, config):\n        super().__init__()\n        self.output_attentions = config.output_attentions\n        self.output_hidden_states = config.output_hidden_states\n        self.layer = nn.ModuleList([NeZhaLayer(config) for _ in range(config.num_hidden_layers)])\n\n\n    def forward(\n            self,\n            hidden_states,\n            attention_mask=None,\n            head_mask=None,\n            encoder_hidden_states=None,\n            encoder_attention_mask=None,\n    ):\n        all_hidden_states = ()\n        all_attentions = ()\n        for i, layer_module in enumerate(self.layer):\n            if self.output_hidden_states:\n                all_hidden_states = all_hidden_states + (hidden_states,)\n            layer_outputs = layer_module(\n                hidden_states, attention_mask, head_mask[i], encoder_hidden_states, encoder_attention_mask\n            )\n            hidden_states = layer_outputs[0]\n            if self.output_attentions:\n                all_attentions = all_attentions + (layer_outputs[1],)\n        # Add last layer\n        if self.output_hidden_states:\n            all_hidden_states = all_hidden_states + (hidden_states,)\n\n        outputs = (hidden_states,)\n        if self.output_hidden_states:\n            outputs = outputs + (all_hidden_states,)\n        if self.output_attentions:\n            outputs = outputs + (all_attentions,)\n        return outputs  # last-layer hidden state, (all hidden states), (all attentions)\n\n\nclass NeZhaPreTrainedModel(PreTrainedModel):\n    \"\"\" An abstract class to handle weights initialization and\n        a simple interface for downloading and loading pretrained models.\n    \"\"\"\n    config_class = NeZhaConfig\n    pretrained_model_archive_map = NEZHA_PRETRAINED_MODEL_ARCHIVE_MAP\n    load_tf_weights = load_tf_weights_in_nezha\n    base_model_prefix = \"bert\"\n\n    def _init_weights(self, module):\n        \"\"\" Initialize the weights \"\"\"\n        if isinstance(module, (nn.Linear, nn.Embedding)):\n            # Slightly different from the TF version which uses truncated_normal for initialization\n            # cf https://github.com/pytorch/pytorch/pull/5617\n            module.weight.data.normal_(mean=0.0, std=self.config.initializer_range)\n        elif isinstance(module, nn.LayerNorm):\n            module.bias.data.zero_()\n            module.weight.data.fill_(1.0)\n        if isinstance(module, nn.Linear) and module.bias is not None:\n            module.bias.data.zero_()\n\n\n@add_start_docstrings(\n    \"The bare Bert Model transformer outputting raw hidden-states without any specific head on top.\",\n    BERT_START_DOCSTRING,\n)\nclass NeZhaModel(NeZhaPreTrainedModel):\n    \"\"\"\n    The model can behave as an encoder (with only self-attention) as well\n    as a decoder, in which case a layer of cross-attention is added between\n    the self-attention layers, following the architecture described in `Attention is all you need`_ by Ashish Vaswani,\n    Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser and Illia Polosukhin.\n\n    To behave as an decoder the model needs to be initialized with the\n    :obj:`is_decoder` argument of the configuration set to :obj:`True`; an\n    :obj:`encoder_hidden_states` is expected as an input to the forward pass.\n\n    .. _`Attention is all you need`:\n        https://arxiv.org/abs/1706.03762\n\n    \"\"\"\n\n    def __init__(self, config):\n        super().__init__(config)\n        self.config = config\n        self.embeddings = NeZhaEmbeddings(config)\n        self.encoder = NeZhaEncoder(config)\n        self.pooler = BertPooler(config)\n        self.init_weights()\n\n    def get_input_embeddings(self):\n        return self.embeddings.word_embeddings\n\n    def set_input_embeddings(self, value):\n        self.embeddings.word_embeddings = value\n\n    def _prune_heads(self, heads_to_prune):\n        \"\"\" Prunes heads of the model.\n            heads_to_prune: dict of {layer_num: list of heads to prune in this layer}\n            See base class PreTrainedModel\n        \"\"\"\n        for layer, heads in heads_to_prune.items():\n            self.encoder.layer[layer].attention.prune_heads(heads)\n\n    @add_start_docstrings_to_model_forward(BERT_INPUTS_DOCSTRING.format(\"batch_size, sequence_length\"))\n    def forward(\n            self,\n            input_ids=None,\n            attention_mask=None,\n            token_type_ids=None,\n            head_mask=None,\n            position_ids=None,\n            inputs_embeds=None,\n            encoder_hidden_states=None,\n            encoder_attention_mask=None,\n    ):\n        r\"\"\"\n    Return:\n        :obj:`tuple(torch.FloatTensor)` comprising various elements depending on the configuration (:class:`~transformers.BertConfig`) and inputs:\n        last_hidden_state (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length, hidden_size)`):\n            Sequence of hidden-states at the output of the last layer of the model.\n        pooler_output (:obj:`torch.FloatTensor`: of shape :obj:`(batch_size, hidden_size)`):\n            Last layer hidden-state of the first token of the sequence (classification token)\n            further processed by a Linear layer and a Tanh activation function. The Linear\n            layer weights are trained from the next sentence prediction (classification)\n            objective during pre-training.\n\n            This output is usually *not* a good summary\n            of the semantic content of the input, you're often better with averaging or pooling\n            the sequence of hidden-states for the whole input sequence.\n        hidden_states (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_hidden_states=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer)\n            of shape :obj:`(batch_size, sequence_length, hidden_size)`.\n\n            Hidden-states of the model at the output of each layer plus the initial embedding outputs.\n        attentions (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_attentions=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for each layer) of shape\n            :obj:`(batch_size, num_heads, sequence_length, sequence_length)`.\n\n            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention\n            heads.\n\n    Examples::\n\n        from transformers import BertModel, BertTokenizer\n        import torch\n\n        tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')\n        model = BertModel.from_pretrained('bert-base-uncased')\n\n        input_ids = torch.tensor(tokenizer.encode(\"Hello, my dog is cute\", add_special_tokens=True)).unsqueeze(0)  # Batch size 1\n        outputs = model(input_ids)\n\n        last_hidden_states = outputs[0]  # The last hidden-state is the first element of the output tuple\n\n        \"\"\"\n\n        if input_ids is not None and inputs_embeds is not None:\n            raise ValueError(\"You cannot specify both input_ids and inputs_embeds at the same time\")\n        elif input_ids is not None:\n            input_shape = input_ids.size()\n        elif inputs_embeds is not None:\n            input_shape = inputs_embeds.size()[:-1]\n        else:\n            raise ValueError(\"You have to specify either input_ids or inputs_embeds\")\n\n        device = input_ids.device if input_ids is not None else inputs_embeds.device\n\n        if attention_mask is None:\n            attention_mask = torch.ones(input_shape, device=device)\n        if token_type_ids is None:\n            token_type_ids = torch.zeros(input_shape, dtype=torch.long, device=device)\n\n        # We can provide a self-attention mask of dimensions [batch_size, from_seq_length, to_seq_length]\n        # ourselves in which case we just need to make it broadcastable to all heads.\n        extended_attention_mask: torch.Tensor = self.get_extended_attention_mask(\n            attention_mask, input_shape, self.device\n        )\n\n        # If a 2D ou 3D attention mask is provided for the cross-attention\n        # we need to make broadcastabe to [batch_size, num_heads, seq_length, seq_length]\n        if self.config.is_decoder and encoder_hidden_states is not None:\n            encoder_batch_size, encoder_sequence_length, _ = encoder_hidden_states.size()\n            encoder_hidden_shape = (encoder_batch_size, encoder_sequence_length)\n            if encoder_attention_mask is None:\n                encoder_attention_mask = torch.ones(encoder_hidden_shape, device=device)\n            encoder_extended_attention_mask = self.invert_attention_mask(encoder_attention_mask)\n        else:\n            encoder_extended_attention_mask = None\n\n        # Prepare head mask if needed\n        # 1.0 in head_mask indicate we keep the head\n        # attention_probs has shape bsz x n_heads x N x N\n        # input head_mask has shape [num_heads] or [num_hidden_layers x num_heads]\n        # and head_mask is converted to shape [num_hidden_layers x batch x num_heads x seq_length x seq_length]\n        head_mask = self.get_head_mask(head_mask, self.config.num_hidden_layers)\n\n        embedding_output = self.embeddings(\n            input_ids=input_ids, token_type_ids=token_type_ids, inputs_embeds=inputs_embeds\n        )\n        encoder_outputs = self.encoder(\n            embedding_output,\n            attention_mask=extended_attention_mask,\n            head_mask=head_mask,\n            encoder_hidden_states=encoder_hidden_states,\n            encoder_attention_mask=encoder_extended_attention_mask,\n        )\n        sequence_output = encoder_outputs[0]\n        pooled_output = self.pooler(sequence_output)\n\n        outputs = (sequence_output, pooled_output,) + encoder_outputs[\n                                                      1:\n                                                      ]  # add hidden_states and attentions if they are here\n        return outputs  # sequence_output, pooled_output, (hidden_states), (attentions)\n\n\n@add_start_docstrings(\n    \"\"\"Bert Model with two heads on top as done during the pre-training: a `masked language modeling` head and\n    a `next sentence prediction (classification)` head. \"\"\",\n    BERT_START_DOCSTRING,\n)\nclass NeZhaForPreTraining(NeZhaPreTrainedModel):\n    def __init__(self, config):\n        super().__init__(config)\n        self.bert = NeZhaModel(config)\n        self.cls = BertPreTrainingHeads(config)\n        self.init_weights()\n\n    def get_output_embeddings(self):\n        return self.cls.predictions.decoder\n\n    @add_start_docstrings_to_model_forward(BERT_INPUTS_DOCSTRING.format(\"batch_size, sequence_length\"))\n    def forward(\n            self,\n            input_ids=None,\n            attention_mask=None,\n            token_type_ids=None,\n            head_mask=None,\n            position_ids=None,\n            inputs_embeds=None,\n            labels=None,\n            next_sentence_label=None,\n    ):\n        r\"\"\"\n        masked_lm_labels (``torch.LongTensor`` of shape ``(batch_size, sequence_length)``, `optional`, defaults to :obj:`None`):\n            Labels for computing the masked language modeling loss.\n            Indices should be in ``[-100, 0, ..., config.vocab_size]`` (see ``input_ids`` docstring)\n            Tokens with indices set to ``-100`` are ignored (masked), the loss is only computed for the tokens with labels\n            in ``[0, ..., config.vocab_size]``\n        next_sentence_label (``torch.LongTensor`` of shape ``(batch_size,)``, `optional`, defaults to :obj:`None`):\n            Labels for computing the next sequence prediction (classification) loss. Input should be a sequence pair (see :obj:`input_ids` docstring)\n            Indices should be in ``[0, 1]``.\n            ``0`` indicates sequence B is a continuation of sequence A,\n            ``1`` indicates sequence B is a random sequence.\n\n    Returns:\n        :obj:`tuple(torch.FloatTensor)` comprising various elements depending on the configuration (:class:`~transformers.BertConfig`) and inputs:\n        loss (`optional`, returned when ``masked_lm_labels`` is provided) ``torch.FloatTensor`` of shape ``(1,)``:\n            Total loss as the sum of the masked language modeling loss and the next sequence prediction (classification) loss.\n        prediction_scores (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length, config.vocab_size)`)\n            Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax).\n        seq_relationship_scores (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, 2)`):\n            Prediction scores of the next sequence prediction (classification) head (scores of True/False\n            continuation before SoftMax).\n        hidden_states (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when :obj:`config.output_hidden_states=True`):\n            Tuple of :obj:`torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer)\n            of shape :obj:`(batch_size, sequence_length, hidden_size)`.\n\n            Hidden-states of the model at the output of each layer plus the initial embedding outputs.\n        attentions (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_attentions=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for each layer) of shape\n            :obj:`(batch_size, num_heads, sequence_length, sequence_length)`.\n\n            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention\n            heads.\n\n\n    Examples::\n\n        from transformers import BertTokenizer, BertForPreTraining\n        import torch\n\n        tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')\n        model = BertForPreTraining.from_pretrained('bert-base-uncased')\n\n        input_ids = torch.tensor(tokenizer.encode(\"Hello, my dog is cute\", add_special_tokens=True)).unsqueeze(0)  # Batch size 1\n        outputs = model(input_ids)\n\n        prediction_scores, seq_relationship_scores = outputs[:2]\n\n        \"\"\"\n\n        outputs = self.bert(\n            input_ids,\n            attention_mask=attention_mask,\n            token_type_ids=token_type_ids,\n            head_mask=head_mask,\n            inputs_embeds=inputs_embeds,\n        )\n\n        sequence_output, pooled_output = outputs[:2]\n        prediction_scores, seq_relationship_score = self.cls(sequence_output, pooled_output)\n        # add hidden states and attention if they are here\n        outputs = (prediction_scores, seq_relationship_score,) + outputs[2:]\n\n        if labels is not None and next_sentence_label is not None:\n            loss_fct = CrossEntropyLoss()\n            masked_lm_loss = loss_fct(prediction_scores.view(-1, self.config.vocab_size), labels.view(-1))\n            next_sentence_loss = loss_fct(seq_relationship_score.view(-1, 2), next_sentence_label.view(-1))\n            total_loss = masked_lm_loss + next_sentence_loss\n            outputs = (total_loss,) + outputs\n\n        return outputs  # (loss), prediction_scores, seq_relationship_score, (hidden_states), (attentions)\n\n\n@add_start_docstrings(\"\"\"Bert Model with a `language modeling` head on top. \"\"\", BERT_START_DOCSTRING)\nclass NeZhaForMaskedLM(NeZhaPreTrainedModel):\n    def __init__(self, config):\n        super().__init__(config)\n        self.bert = NeZhaModel(config)\n        self.cls = BertOnlyMLMHead(config)\n        self.init_weights()\n\n    def get_output_embeddings(self):\n        return self.cls.predictions.decoder\n\n    @add_start_docstrings_to_model_forward(BERT_INPUTS_DOCSTRING.format(\"batch_size, sequence_length\"))\n    def forward(\n            self,\n            input_ids=None,\n            attention_mask=None,\n            token_type_ids=None,\n            head_mask=None,\n            position_ids=None,\n            inputs_embeds=None,\n            encoder_hidden_states=None,\n            encoder_attention_mask=None,\n            labels=None,\n    ):\n        r\"\"\"\n        masked_lm_labels (:obj:`torch.LongTensor` of shape :obj:`(batch_size, sequence_length)`, `optional`, defaults to :obj:`None`):\n            Labels for computing the masked language modeling loss.\n            Indices should be in ``[-100, 0, ..., config.vocab_size]`` (see ``input_ids`` docstring)\n            Tokens with indices set to ``-100`` are ignored (masked), the loss is only computed for the tokens with labels\n            in ``[0, ..., config.vocab_size]``\n        lm_labels (:obj:`torch.LongTensor` of shape :obj:`(batch_size, sequence_length)`, `optional`, defaults to :obj:`None`):\n            Labels for computing the left-to-right language modeling loss (next word prediction).\n            Indices should be in ``[-100, 0, ..., config.vocab_size]`` (see ``input_ids`` docstring)\n            Tokens with indices set to ``-100`` are ignored (masked), the loss is only computed for the tokens with labels\n            in ``[0, ..., config.vocab_size]``\n\n    Returns:\n        :obj:`tuple(torch.FloatTensor)` comprising various elements depending on the configuration (:class:`~transformers.BertConfig`) and inputs:\n        masked_lm_loss (`optional`, returned when ``masked_lm_labels`` is provided) ``torch.FloatTensor`` of shape ``(1,)``:\n            Masked language modeling loss.\n        ltr_lm_loss (:obj:`torch.FloatTensor` of shape :obj:`(1,)`, `optional`, returned when :obj:`lm_labels` is provided):\n                Next token prediction loss.\n        prediction_scores (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length, config.vocab_size)`)\n            Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax).\n        hidden_states (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_hidden_states=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer)\n            of shape :obj:`(batch_size, sequence_length, hidden_size)`.\n\n            Hidden-states of the model at the output of each layer plus the initial embedding outputs.\n        attentions (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_attentions=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for each layer) of shape\n            :obj:`(batch_size, num_heads, sequence_length, sequence_length)`.\n\n            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention\n            heads.\n\n        Examples::\n\n            from transformers import BertTokenizer, BertForMaskedLM\n            import torch\n\n            tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')\n            model = BertForMaskedLM.from_pretrained('bert-base-uncased')\n\n            input_ids = torch.tensor(tokenizer.encode(\"Hello, my dog is cute\", add_special_tokens=True)).unsqueeze(0)  # Batch size 1\n            outputs = model(input_ids, masked_lm_labels=input_ids)\n\n            loss, prediction_scores = outputs[:2]\n\n        \"\"\"\n        outputs = self.bert(\n            input_ids,\n            attention_mask=attention_mask,\n            token_type_ids=token_type_ids,\n            head_mask=head_mask,\n            inputs_embeds=inputs_embeds,\n            encoder_hidden_states=encoder_hidden_states,\n            encoder_attention_mask=encoder_attention_mask,\n        )\n\n        sequence_output = outputs[0]\n        prediction_scores = self.cls(sequence_output)\n        outputs = (prediction_scores,) + outputs[2:]  # Add hidden states and attention if they are here\n\n        # Although this may seem awkward, BertForMaskedLM supports two scenarios:\n        # 1. If a tensor that contains the indices of masked labels is provided,\n        #    the cross-entropy is the MLM cross-entropy that measures the likelihood\n        #    of predictions for masked words.\n        # 2. If `lm_labels` is provided we are in a causal scenario where we\n        #    try to predict the next token for each input in the decoder.\n        masked_lm_labels = None\n        if labels is not None:\n            loss_fct = CrossEntropyLoss()  # -100 index = padding token\n            masked_lm_loss = loss_fct(prediction_scores.view(-1, self.config.vocab_size), labels.view(-1))\n            outputs = (masked_lm_loss,) + outputs\n        return outputs  # (ltr_lm_loss), (masked_lm_loss), prediction_scores, (hidden_states), (attentions)\n\n    def prepare_inputs_for_generation(self, input_ids, attention_mask=None, **model_kwargs):\n        input_shape = input_ids.shape\n        effective_batch_size = input_shape[0]\n\n        # if model is used as a decoder in encoder-decoder model, the decoder attention mask is created on the fly\n        if attention_mask is None:\n            attention_mask = input_ids.new_ones(input_shape)\n\n        # if model is does not use a causal mask then add a dummy token\n        if self.config.is_decoder is False:\n            assert self.config.pad_token_id is not None, \"The PAD token should be defined for generation\"\n            attention_mask = torch.cat(\n                [attention_mask, attention_mask.new_zeros((attention_mask.shape[0], 1))], dim=-1\n            )\n\n            dummy_token = torch.full(\n                (effective_batch_size, 1), self.config.pad_token_id, dtype=torch.long, device=input_ids.device\n            )\n            input_ids = torch.cat([input_ids, dummy_token], dim=1)\n\n        return {\"input_ids\": input_ids, \"attention_mask\": attention_mask}\n\n\n@add_start_docstrings(\n    \"\"\"Bert Model with a `next sentence prediction (classification)` head on top. \"\"\", BERT_START_DOCSTRING,\n)\nclass NeZhaForNextSentencePrediction(NeZhaPreTrainedModel):\n    def __init__(self, config):\n        super().__init__(config)\n        self.bert = NeZhaModel(config)\n        self.cls = BertOnlyNSPHead(config)\n        self.init_weights()\n\n    @add_start_docstrings_to_model_forward(BERT_INPUTS_DOCSTRING.format(\"batch_size, sequence_length\"))\n    def forward(\n            self,\n            input_ids=None,\n            attention_mask=None,\n            token_type_ids=None,\n            head_mask=None,\n            position_ids=None,\n            inputs_embeds=None,\n            next_sentence_label=None,\n    ):\n        r\"\"\"\n        next_sentence_label (:obj:`torch.LongTensor` of shape :obj:`(batch_size,)`, `optional`, defaults to :obj:`None`):\n            Labels for computing the next sequence prediction (classification) loss. Input should be a sequence pair (see ``input_ids`` docstring)\n            Indices should be in ``[0, 1]``.\n            ``0`` indicates sequence B is a continuation of sequence A,\n            ``1`` indicates sequence B is a random sequence.\n\n    Returns:\n        :obj:`tuple(torch.FloatTensor)` comprising various elements depending on the configuration (:class:`~transformers.BertConfig`) and inputs:\n        loss (:obj:`torch.FloatTensor` of shape :obj:`(1,)`, `optional`, returned when :obj:`next_sentence_label` is provided):\n            Next sequence prediction (classification) loss.\n        seq_relationship_scores (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, 2)`):\n            Prediction scores of the next sequence prediction (classification) head (scores of True/False continuation before SoftMax).\n        hidden_states (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_hidden_states=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer)\n            of shape :obj:`(batch_size, sequence_length, hidden_size)`.\n\n            Hidden-states of the model at the output of each layer plus the initial embedding outputs.\n        attentions (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_attentions=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for each layer) of shape\n            :obj:`(batch_size, num_heads, sequence_length, sequence_length)`.\n\n            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention\n            heads.\n\n    Examples::\n\n        from transformers import BertTokenizer, BertForNextSentencePrediction\n        import torch\n\n        tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')\n        model = BertForNextSentencePrediction.from_pretrained('bert-base-uncased')\n\n        input_ids = torch.tensor(tokenizer.encode(\"Hello, my dog is cute\", add_special_tokens=True)).unsqueeze(0)  # Batch size 1\n        outputs = model(input_ids)\n\n        seq_relationship_scores = outputs[0]\n\n        \"\"\"\n\n        outputs = self.bert(\n            input_ids,\n            attention_mask=attention_mask,\n            token_type_ids=token_type_ids,\n            head_mask=head_mask,\n            inputs_embeds=inputs_embeds,\n        )\n\n        pooled_output = outputs[1]\n        seq_relationship_score = self.cls(pooled_output)\n        outputs = (seq_relationship_score,) + outputs[2:]  # add hidden states and attention if they are here\n        if next_sentence_label is not None:\n            loss_fct = CrossEntropyLoss()\n            next_sentence_loss = loss_fct(seq_relationship_score.view(-1, 2), next_sentence_label.view(-1))\n            outputs = (next_sentence_loss,) + outputs\n\n        return outputs  # (next_sentence_loss), seq_relationship_score, (hidden_states), (attentions)\n\n\n@add_start_docstrings(\n    \"\"\"Bert Model transformer with a sequence classification/regression head on top (a linear layer on top of\n    the pooled output) e.g. for GLUE tasks. \"\"\",\n    BERT_START_DOCSTRING,\n)\nclass NeZhaForSequenceClassification(NeZhaPreTrainedModel):\n    def __init__(self, config):\n        super().__init__(config)\n        self.num_labels = config.num_labels\n        self.bert = NeZhaModel(config)\n        self.dropout = nn.Dropout(config.hidden_dropout_prob)\n        self.classifier = nn.Linear(config.hidden_size, config.num_labels)\n        self.init_weights()\n\n    @add_start_docstrings_to_model_forward(BERT_INPUTS_DOCSTRING.format(\"batch_size, sequence_length\"))\n    def forward(\n            self,\n            input_ids=None,\n            attention_mask=None,\n            token_type_ids=None,\n            position_ids=None,\n            head_mask=None,\n            inputs_embeds=None,\n            labels=None,\n    ):\n        r\"\"\"\n        labels (:obj:`torch.LongTensor` of shape :obj:`(batch_size,)`, `optional`, defaults to :obj:`None`):\n            Labels for computing the sequence classification/regression loss.\n            Indices should be in :obj:`[0, ..., config.num_labels - 1]`.\n            If :obj:`config.num_labels == 1` a regression loss is computed (Mean-Square loss),\n            If :obj:`config.num_labels > 1` a classification loss is computed (Cross-Entropy).\n\n    Returns:\n        :obj:`tuple(torch.FloatTensor)` comprising various elements depending on the configuration (:class:`~transformers.BertConfig`) and inputs:\n        loss (:obj:`torch.FloatTensor` of shape :obj:`(1,)`, `optional`, returned when :obj:`label` is provided):\n            Classification (or regression if config.num_labels==1) loss.\n        logits (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, config.num_labels)`):\n            Classification (or regression if config.num_labels==1) scores (before SoftMax).\n        hidden_states (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_hidden_states=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer)\n            of shape :obj:`(batch_size, sequence_length, hidden_size)`.\n\n            Hidden-states of the model at the output of each layer plus the initial embedding outputs.\n        attentions (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_attentions=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for each layer) of shape\n            :obj:`(batch_size, num_heads, sequence_length, sequence_length)`.\n\n            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention\n            heads.\n\n    Examples::\n\n        from transformers import BertTokenizer, BertForSequenceClassification\n        import torch\n\n        tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')\n        model = BertForSequenceClassification.from_pretrained('bert-base-uncased')\n\n        input_ids = torch.tensor(tokenizer.encode(\"Hello, my dog is cute\", add_special_tokens=True)).unsqueeze(0)  # Batch size 1\n        labels = torch.tensor([1]).unsqueeze(0)  # Batch size 1\n        outputs = model(input_ids, labels=labels)\n\n        loss, logits = outputs[:2]\n\n        \"\"\"\n\n        outputs = self.bert(\n            input_ids,\n            attention_mask=attention_mask,\n            token_type_ids=token_type_ids,\n            head_mask=head_mask,\n            inputs_embeds=inputs_embeds,\n        )\n\n        pooled_output = outputs[1]\n\n        pooled_output = self.dropout(pooled_output)\n        logits = self.classifier(pooled_output)\n\n        outputs = (logits,) + outputs[2:]  # add hidden states and attention if they are here\n\n        if labels is not None:\n            if self.num_labels == 1:\n                #  We are doing regression\n                loss_fct = MSELoss()\n                loss = loss_fct(logits.view(-1), labels.view(-1))\n            else:\n                loss_fct = CrossEntropyLoss()\n                loss = loss_fct(logits.view(-1, self.num_labels), labels.view(-1))\n            outputs = (loss,) + outputs\n\n        return outputs  # (loss), logits, (hidden_states), (attentions)\n\n\n@add_start_docstrings(\n    \"\"\"Bert Model with a multiple choice classification head on top (a linear layer on top of\n    the pooled output and a softmax) e.g. for RocStories/SWAG tasks. \"\"\",\n    BERT_START_DOCSTRING,\n)\nclass NeZhaForMultipleChoice(NeZhaPreTrainedModel):\n    def __init__(self, config):\n        super().__init__(config)\n        self.bert = NeZhaModel(config)\n        self.dropout = nn.Dropout(config.hidden_dropout_prob)\n        self.classifier = nn.Linear(config.hidden_size, 1)\n        self.init_weights()\n\n    @add_start_docstrings_to_model_forward(BERT_INPUTS_DOCSTRING.format(\"batch_size, sequence_length\"))\n    def forward(\n            self,\n            input_ids=None,\n            attention_mask=None,\n            token_type_ids=None,\n            head_mask=None,\n            position_ids=None,\n            inputs_embeds=None,\n            labels=None,\n    ):\n        r\"\"\"\n        labels (:obj:`torch.LongTensor` of shape :obj:`(batch_size,)`, `optional`, defaults to :obj:`None`):\n            Labels for computing the multiple choice classification loss.\n            Indices should be in ``[0, ..., num_choices]`` where `num_choices` is the size of the second dimension\n            of the input tensors. (see `input_ids` above)\n\n    Returns:\n        :obj:`tuple(torch.FloatTensor)` comprising various elements depending on the configuration (:class:`~transformers.BertConfig`) and inputs:\n        loss (:obj:`torch.FloatTensor` of shape `(1,)`, `optional`, returned when :obj:`labels` is provided):\n            Classification loss.\n        classification_scores (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, num_choices)`):\n            `num_choices` is the second dimension of the input tensors. (see `input_ids` above).\n\n            Classification scores (before SoftMax).\n        hidden_states (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_hidden_states=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer)\n            of shape :obj:`(batch_size, sequence_length, hidden_size)`.\n\n            Hidden-states of the model at the output of each layer plus the initial embedding outputs.\n        attentions (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_attentions=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for each layer) of shape\n            :obj:`(batch_size, num_heads, sequence_length, sequence_length)`.\n\n            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention\n            heads.\n\n    Examples::\n\n        from transformers import BertTokenizer, BertForMultipleChoice\n        import torch\n\n        tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')\n        model = BertForMultipleChoice.from_pretrained('bert-base-uncased')\n        choices = [\"Hello, my dog is cute\", \"Hello, my cat is amazing\"]\n\n        input_ids = torch.tensor([tokenizer.encode(s, add_special_tokens=True) for s in choices]).unsqueeze(0)  # Batch size 1, 2 choices\n        labels = torch.tensor(1).unsqueeze(0)  # Batch size 1\n        outputs = model(input_ids, labels=labels)\n\n        loss, classification_scores = outputs[:2]\n\n        \"\"\"\n        num_choices = input_ids.shape[1]\n\n        input_ids = input_ids.view(-1, input_ids.size(-1))\n        attention_mask = attention_mask.view(-1, attention_mask.size(-1)) if attention_mask is not None else None\n        token_type_ids = token_type_ids.view(-1, token_type_ids.size(-1)) if token_type_ids is not None else None\n\n        outputs = self.bert(\n            input_ids,\n            attention_mask=attention_mask,\n            token_type_ids=token_type_ids,\n            head_mask=head_mask,\n            inputs_embeds=inputs_embeds,\n        )\n\n        pooled_output = outputs[1]\n\n        pooled_output = self.dropout(pooled_output)\n        logits = self.classifier(pooled_output)\n        reshaped_logits = logits.view(-1, num_choices)\n\n        outputs = (reshaped_logits,) + outputs[2:]  # add hidden states and attention if they are here\n\n        if labels is not None:\n            loss_fct = CrossEntropyLoss()\n            loss = loss_fct(reshaped_logits, labels)\n            outputs = (loss,) + outputs\n\n        return outputs  # (loss), reshaped_logits, (hidden_states), (attentions)\n\n\n@add_start_docstrings(\n    \"\"\"Bert Model with a token classification head on top (a linear layer on top of\n    the hidden-states output) e.g. for Named-Entity-Recognition (NER) tasks. \"\"\",\n    BERT_START_DOCSTRING,\n)\nclass NeZhaForTokenClassification(NeZhaPreTrainedModel):\n    def __init__(self, config):\n        super().__init__(config)\n        self.num_labels = config.num_labels\n        self.bert = NeZhaModel(config)\n        self.dropout = nn.Dropout(config.hidden_dropout_prob)\n        self.classifier = nn.Linear(config.hidden_size, config.num_labels)\n        self.init_weights()\n\n    @add_start_docstrings_to_model_forward(BERT_INPUTS_DOCSTRING.format(\"batch_size, sequence_length\"))\n    def forward(\n            self,\n            input_ids=None,\n            attention_mask=None,\n            token_type_ids=None,\n            head_mask=None,\n            position_ids=None,\n            inputs_embeds=None,\n            labels=None,\n    ):\n        r\"\"\"\n        labels (:obj:`torch.LongTensor` of shape :obj:`(batch_size, sequence_length)`, `optional`, defaults to :obj:`None`):\n            Labels for computing the token classification loss.\n            Indices should be in ``[0, ..., config.num_labels - 1]``.\n\n    Returns:\n        :obj:`tuple(torch.FloatTensor)` comprising various elements depending on the configuration (:class:`~transformers.BertConfig`) and inputs:\n        loss (:obj:`torch.FloatTensor` of shape :obj:`(1,)`, `optional`, returned when ``labels`` is provided) :\n            Classification loss.\n        scores (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length, config.num_labels)`)\n            Classification scores (before SoftMax).\n        hidden_states (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_hidden_states=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer)\n            of shape :obj:`(batch_size, sequence_length, hidden_size)`.\n\n            Hidden-states of the model at the output of each layer plus the initial embedding outputs.\n        attentions (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_attentions=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for each layer) of shape\n            :obj:`(batch_size, num_heads, sequence_length, sequence_length)`.\n\n            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention\n            heads.\n\n    Examples::\n\n        from transformers import BertTokenizer, BertForTokenClassification\n        import torch\n\n        tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')\n        model = BertForTokenClassification.from_pretrained('bert-base-uncased')\n\n        input_ids = torch.tensor(tokenizer.encode(\"Hello, my dog is cute\", add_special_tokens=True)).unsqueeze(0)  # Batch size 1\n        labels = torch.tensor([1] * input_ids.size(1)).unsqueeze(0)  # Batch size 1\n        outputs = model(input_ids, labels=labels)\n\n        loss, scores = outputs[:2]\n\n        \"\"\"\n\n        outputs = self.bert(\n            input_ids,\n            attention_mask=attention_mask,\n            token_type_ids=token_type_ids,\n            head_mask=head_mask,\n            inputs_embeds=inputs_embeds,\n        )\n\n        sequence_output = outputs[0]\n\n        sequence_output = self.dropout(sequence_output)\n        logits = self.classifier(sequence_output)\n\n        outputs = (logits,) + outputs[2:]  # add hidden states and attention if they are here\n        if labels is not None:\n            loss_fct = CrossEntropyLoss()\n            # Only keep active parts of the loss\n            if attention_mask is not None:\n                active_loss = attention_mask.view(-1) == 1\n                active_logits = logits.view(-1, self.num_labels)\n                active_labels = torch.where(\n                    active_loss, labels.view(-1), torch.tensor(loss_fct.ignore_index).type_as(labels)\n                )\n                loss = loss_fct(active_logits, active_labels)\n            else:\n                loss = loss_fct(logits.view(-1, self.num_labels), labels.view(-1))\n            outputs = (loss,) + outputs\n\n        return outputs  # (loss), scores, (hidden_states), (attentions)\n\n\n@add_start_docstrings(\n    \"\"\"Bert Model with a span classification head on top for extractive question-answering tasks like SQuAD (a linear\n    layers on top of the hidden-states output to compute `span start logits` and `span end logits`). \"\"\",\n    BERT_START_DOCSTRING,\n)\nclass NeZhaForQuestionAnswering(NeZhaPreTrainedModel):\n    def __init__(self, config):\n        super().__init__(config)\n        self.num_labels = config.num_labels\n        self.bert = NeZhaModel(config)\n        self.qa_outputs = nn.Linear(config.hidden_size, config.num_labels)\n        self.init_weights()\n\n    @add_start_docstrings_to_model_forward(BERT_INPUTS_DOCSTRING.format(\"batch_size, sequence_length\"))\n    def forward(\n            self,\n            input_ids=None,\n            attention_mask=None,\n            token_type_ids=None,\n            head_mask=None,\n            inputs_embeds=None,\n            position_ids=None,\n            start_positions=None,\n            end_positions=None,\n    ):\n        r\"\"\"\n        start_positions (:obj:`torch.LongTensor` of shape :obj:`(batch_size,)`, `optional`, defaults to :obj:`None`):\n            Labels for position (index) of the start of the labelled span for computing the token classification loss.\n            Positions are clamped to the length of the sequence (`sequence_length`).\n            Position outside of the sequence are not taken into account for computing the loss.\n        end_positions (:obj:`torch.LongTensor` of shape :obj:`(batch_size,)`, `optional`, defaults to :obj:`None`):\n            Labels for position (index) of the end of the labelled span for computing the token classification loss.\n            Positions are clamped to the length of the sequence (`sequence_length`).\n            Position outside of the sequence are not taken into account for computing the loss.\n\n    Returns:\n        :obj:`tuple(torch.FloatTensor)` comprising various elements depending on the configuration (:class:`~transformers.BertConfig`) and inputs:\n        loss (:obj:`torch.FloatTensor` of shape :obj:`(1,)`, `optional`, returned when :obj:`labels` is provided):\n            Total span extraction loss is the sum of a Cross-Entropy for the start and end positions.\n        start_scores (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length,)`):\n            Span-start scores (before SoftMax).\n        end_scores (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length,)`):\n            Span-end scores (before SoftMax).\n        hidden_states (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_hidden_states=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer)\n            of shape :obj:`(batch_size, sequence_length, hidden_size)`.\n\n            Hidden-states of the model at the output of each layer plus the initial embedding outputs.\n        attentions (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_attentions=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for each layer) of shape\n            :obj:`(batch_size, num_heads, sequence_length, sequence_length)`.\n\n            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention\n            heads.\n\n    Examples::\n\n        from transformers import BertTokenizer, BertForQuestionAnswering\n        import torch\n\n        tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')\n        model = BertForQuestionAnswering.from_pretrained('bert-large-uncased-whole-word-masking-finetuned-squad')\n\n        question, text = \"Who was Jim Henson?\", \"Jim Henson was a nice puppet\"\n        encoding = tokenizer.encode_plus(question, text)\n        input_ids, token_type_ids = encoding[\"input_ids\"], encoding[\"token_type_ids\"]\n        start_scores, end_scores = model(torch.tensor([input_ids]), token_type_ids=torch.tensor([token_type_ids]))\n\n        all_tokens = tokenizer.convert_ids_to_tokens(input_ids)\n        answer = ' '.join(all_tokens[torch.argmax(start_scores) : torch.argmax(end_scores)+1])\n\n        assert answer == \"a nice puppet\"\n\n        \"\"\"\n\n        outputs = self.bert(\n            input_ids,\n            attention_mask=attention_mask,\n            token_type_ids=token_type_ids,\n            head_mask=head_mask,\n            inputs_embeds=inputs_embeds,\n        )\n\n        sequence_output = outputs[0]\n\n        logits = self.qa_outputs(sequence_output)\n        start_logits, end_logits = logits.split(1, dim=-1)\n        start_logits = start_logits.squeeze(-1)\n        end_logits = end_logits.squeeze(-1)\n\n        outputs = (start_logits, end_logits,) + outputs[2:]\n        if start_positions is not None and end_positions is not None:\n            # If we are on multi-GPU, split add a dimension\n            if len(start_positions.size()) > 1:\n                start_positions = start_positions.squeeze(-1)\n            if len(end_positions.size()) > 1:\n                end_positions = end_positions.squeeze(-1)\n            # sometimes the start/end positions are outside our model inputs, we ignore these terms\n            ignored_index = start_logits.size(1)\n            start_positions.clamp_(0, ignored_index)\n            end_positions.clamp_(0, ignored_index)\n\n            loss_fct = CrossEntropyLoss(ignore_index=ignored_index)\n            start_loss = loss_fct(start_logits, start_positions)\n            end_loss = loss_fct(end_logits, end_positions)\n            total_loss = (start_loss + end_loss) / 2\n            outputs = (total_loss,) + outputs\n\n        return outputs  # (loss), start_logits, end_logits, (hidden_states), (attentions)\n\n"
  },
  {
    "path": "code/bert-base-count3-len100/finetuning/model.py",
    "content": "import torch\nimport random\nimport os\nfrom torch import nn, optim\nimport torch.nn.functional as F\nfrom transformers.activations import get_activation\n\nfrom Config import *\n\n\nclass BertForClass(nn.Module):\n    def __init__(self, config):\n        super(BertForClass, self).__init__()\n        self.n_classes = config.num_class\n\n        config_json = 'bert_config.json' if os.path.exists(config.model_path + 'bert_config.json') else 'config.json'\n        self.bert_config = CONFIGS[config.model].from_pretrained(config.model_path + config_json,\n                                                                 output_hidden_states=True)\n        self.bert_model = MODELS[config.model].from_pretrained(config.model_path, config=self.bert_config)\n        self.isDropout = True if 0 < config.dropout < 1 else False\n        self.dropout = nn.Dropout(p=config.dropout)\n        self.classifier = nn.Linear(self.bert_config.hidden_size * 2, self.n_classes)\n\n    def forward(self, input_ids, input_masks, segment_ids):\n        output = self.bert_model(input_ids=input_ids, token_type_ids=segment_ids,\n                                                                        attention_mask=input_masks)\n        sequence_output = output[0]\n        pooler_output = output[1]\n        hidden_states = output[2]\n        seq_avg = torch.mean(sequence_output, dim=1)\n        concat_out = torch.cat((seq_avg, pooler_output), dim=1)\n        if self.isDropout:\n            concat_out = self.dropout(concat_out)\n        logit = self.classifier(concat_out)\n        return logit\n\nclass BertForClass_MultiDropout(nn.Module):\n    def __init__(self, config):\n        super(BertForClass_MultiDropout, self).__init__()\n        self.n_classes = config.num_class\n        config_json = 'bert_config.json' if os.path.exists(config.model_path + 'bert_config.json') else 'config.json'\n        self.bert_config = CONFIGS[config.model].from_pretrained(config.model_path + config_json,\n                                                                 output_hidden_states=True)\n\n        self.bert_model = MODELS[config.model].from_pretrained(config.model_path, config=self.bert_config)\n        self.multi_drop = 5\n        self.multi_dropouts = nn.ModuleList([nn.Dropout(config.dropout) for _ in range(self.multi_drop)])\n        self.classifier = nn.Linear(self.bert_config.hidden_size * 2, self.n_classes)\n\n    def forward(self, input_ids, input_masks, segment_ids):\n        sequence_output, pooler_output, hidden_states = self.bert_model(input_ids=input_ids, token_type_ids=segment_ids,\n                                                                        attention_mask=input_masks)\n        seq_avg = torch.mean(sequence_output, dim=1)\n        concat_out = torch.cat((seq_avg, pooler_output), dim=1)\n        for j, dropout in enumerate(self.multi_dropouts):\n            if j == 0:\n                logit = self.classifier(dropout(concat_out)) / self.multi_drop\n            else:\n                logit += self.classifier(dropout(concat_out)) / self.multi_drop\n\n        return logit\n\nclass BertLastTwoCls(nn.Module):\n    def __init__(self, config):\n        super(BertLastTwoCls, self).__init__()\n        self.n_classes = config.num_class\n        config_json = 'bert_config.json' if os.path.exists(config.model_path + 'bert_config.json') else 'config.json'\n        self.bert_config = CONFIGS[config.model].from_pretrained(config.model_path + config_json,\n                                                                 output_hidden_states=True)\n        self.bert_model = MODELS[config.model].from_pretrained(config.model_path, config=self.bert_config)\n        self.isDropout = True if 0 < config.dropout < 1 else False\n        self.dropout = nn.Dropout(p=config.dropout)\n        self.classifier = nn.Linear(self.bert_config.hidden_size, config.num_class)\n\n    def forward(self, input_ids, input_masks, segment_ids):\n        sequence_output, pooler_output, hidden_states = self.bert_model(input_ids=input_ids, token_type_ids=segment_ids,\n                                                                        attention_mask=input_masks)\n        logit = self.classifier(pooler_output)\n\n        return logit\n\n\nclass BertLastCls(nn.Module):\n    def __init__(self, config):\n        super(BertLastCls, self).__init__()\n        self.n_classes = config.num_class\n        config_json = 'bert_config.json' if os.path.exists(config.model_path + 'bert_config.json') else 'config.json'\n        self.bert_config = CONFIGS[config.model].from_pretrained(config.model_path + config_json,\n                                                                 output_hidden_states=True)\n        self.bert_model = MODELS[config.model].from_pretrained(config.model_path, config=self.bert_config)\n        self.isDropout = True if 0 < config.dropout < 1 else False\n        self.dropout = nn.Dropout(p=config.dropout)\n        self.classifier = nn.Linear(self.bert_config.hidden_size, config.num_class)\n\n    def forward(self, input_ids, input_masks, segment_ids):\n        output = self.bert_model(input_ids=input_ids, token_type_ids=segment_ids,\n                                 attention_mask=input_masks)\n        sequence_output = output[0]\n        pooler_output = output[1]\n        hidden_states = output[2]\n\n        if self.isDropout:\n            output = self.dropout(pooler_output)\n        logit = self.classifier(output)\n\n        return logit\n\nclass BertLastTwoClsPooler(nn.Module):\n    def __init__(self, config):\n        super(BertLastTwoClsPooler, self).__init__()\n        self.n_classes = config.num_class\n        config_json = 'bert_config.json' if os.path.exists(config.model_path + 'bert_config.json') else 'config.json'\n        self.bert_config = CONFIGS[config.model].from_pretrained(config.model_path + config_json,\n                                                                 output_hidden_states=True)\n        self.bert_model = MODELS[config.model].from_pretrained(config.model_path, config=self.bert_config)\n        self.isDropout = True if 0 < config.dropout < 1 else False\n        self.dropout = nn.Dropout(p=config.dropout)\n        self.classifier = nn.Linear(self.bert_config.hidden_size * 3, config.num_class)\n\n    def forward(self, input_ids, input_masks, segment_ids):\n        sequence_output, pooler_output, hidden_states = self.bert_model(input_ids=input_ids, token_type_ids=segment_ids,\n                                                                        attention_mask=input_masks)\n\n        output = torch.cat(\n            (pooler_output, hidden_states[-1][:, 0], hidden_states[-2][:, 0]), dim=1)\n        if self.isDropout:\n            output = self.dropout(output)\n        logit = self.classifier(output)\n\n        return logit\n\nclass BertLastTwoEmbeddings(nn.Module):\n    def __init__(self, config):\n        super(BertLastTwoEmbeddings, self).__init__()\n        self.n_classes = config.num_class\n        config_json = 'bert_config.json' if os.path.exists(config.model_path + 'bert_config.json') else 'config.json'\n        self.bert_config = CONFIGS[config.model].from_pretrained(config.model_path + config_json,\n                                                                 output_hidden_states=True)\n        self.bert_model = MODELS[config.model].from_pretrained(config.model_path, config=self.bert_config)\n        self.isDropout = True if 0 < config.dropout < 1 else False\n        self.dropout = nn.Dropout(p=config.dropout)\n        self.classifier = nn.Linear(self.bert_config.hidden_size * 2, config.num_class)\n\n    def forward(self, input_ids, input_masks, segment_ids):\n        sequence_output, pooler_output, hidden_states = self.bert_model(input_ids=input_ids, token_type_ids=segment_ids,\n                                                                        attention_mask=input_masks)\n\n        hidden_states1 = torch.mean(hidden_states[-1], dim=1)\n        hidden_states2 = torch.mean(hidden_states[-2], dim=1)\n\n        output = torch.cat(\n            (hidden_states1, hidden_states2), dim=1)\n        if self.isDropout:\n            output = self.dropout(output)\n        logit = self.classifier(output)\n\n        return logit\n\n\nclass BertLastTwoEmbeddingsPooler(nn.Module):\n    def __init__(self, config):\n        super(BertLastTwoEmbeddingsPooler, self).__init__()\n        self.n_classes = config.num_class\n        config_json = 'bert_config.json' if os.path.exists(config.model_path + 'bert_config.json') else 'config.json'\n        self.bert_config = CONFIGS[config.model].from_pretrained(config.model_path + config_json,\n                                                                 output_hidden_states=True)\n        self.bert_model = MODELS[config.model].from_pretrained(config.model_path, config=self.bert_config)\n        self.isDropout = True if 0 < config.dropout < 1 else False\n        self.dropout = nn.Dropout(p=config.dropout)\n        self.classifier = nn.Linear(self.bert_config.hidden_size * 3, config.num_class)\n\n    def forward(self, input_ids, input_masks, segment_ids):\n        sequence_output, pooler_output, hidden_states = self.bert_model(input_ids=input_ids, token_type_ids=segment_ids,\n                                                                        attention_mask=input_masks)\n\n        hidden_states1 = torch.mean(hidden_states[-1], dim=1)\n        hidden_states2 = torch.mean(hidden_states[-2], dim=1)\n\n        output = torch.cat(\n            (pooler_output, hidden_states1, hidden_states2), dim=1)\n        if self.isDropout:\n            output = self.dropout(output)\n        logit = self.classifier(output)\n\n        return logit\n\nclass BertLastFourCls(nn.Module):\n    def __init__(self, config):\n        super(BertLastFourCls, self).__init__()\n        self.n_classes = config.num_class\n        config_json = 'bert_config.json' if os.path.exists(config.model_path + 'bert_config.json') else 'config.json'\n        self.bert_config = CONFIGS[config.model].from_pretrained(config.model_path + config_json,\n                                                                 output_hidden_states=True)\n        self.bert_model = MODELS[config.model].from_pretrained(config.model_path, config=self.bert_config)\n        self.isDropout = True if 0 < config.dropout < 1 else False\n        self.dropout = nn.Dropout(p=config.dropout)\n        self.classifier = nn.Linear(self.bert_config.hidden_size * 4, config.num_class)\n\n    def forward(self, input_ids, input_masks, segment_ids):\n        output = self.bert_model(input_ids=input_ids, token_type_ids=segment_ids,\n                                 attention_mask=input_masks)\n        sequence_output = output[0]\n        pooler_output = output[1]\n        hidden_states = output[2]\n\n        output = torch.cat(\n            (hidden_states[-1][:, 0], hidden_states[-2][:, 0], hidden_states[-3][:, 0], hidden_states[-4][:, 0]), dim=1)\n        if self.isDropout:\n            output = self.dropout(output)\n        logit = self.classifier(output)\n\n        return logit\n\n\nclass BertLastFourClsPooler(nn.Module):\n    def __init__(self, config):\n        super(BertLastFourClsPooler, self).__init__()\n        self.n_classes = config.num_class\n        config_json = 'bert_config.json' if os.path.exists(config.model_path + 'bert_config.json') else 'config.json'\n        self.bert_config = CONFIGS[config.model].from_pretrained(config.model_path + config_json,\n                                                                 output_hidden_states=True)\n        self.bert_model = MODELS[config.model].from_pretrained(config.model_path, config=self.bert_config)\n        self.isDropout = True if 0 < config.dropout < 1 else False\n        self.dropout = nn.Dropout(p=config.dropout)\n        self.classifier = nn.Linear(self.bert_config.hidden_size * 5, config.num_class)\n\n    def forward(self, input_ids, input_masks, segment_ids):\n        sequence_output, pooler_output, hidden_states = self.bert_model(input_ids=input_ids, token_type_ids=segment_ids,\n                                                                        attention_mask=input_masks)\n\n        output = torch.cat(\n            (pooler_output, hidden_states[-1][:, 0], hidden_states[-2][:, 0], hidden_states[-3][:, 0], hidden_states[-4][:, 0]), dim=1)\n        if self.isDropout:\n            output = self.dropout(output)\n        logit = self.classifier(output)\n\n        return logit\n\nclass BertLastFourEmbeddings(nn.Module):\n    def __init__(self, config):\n        super(BertLastFourEmbeddings, self).__init__()\n        self.n_classes = config.num_class\n        config_json = 'bert_config.json' if os.path.exists(config.model_path + 'bert_config.json') else 'config.json'\n        self.bert_config = CONFIGS[config.model].from_pretrained(config.model_path + config_json,\n                                                                 output_hidden_states=True)\n        self.bert_model = MODELS[config.model].from_pretrained(config.model_path, config=self.bert_config)\n        self.isDropout = True if 0 < config.dropout < 1 else False\n        self.dropout = nn.Dropout(p=config.dropout)\n        self.classifier = nn.Linear(self.bert_config.hidden_size * 4, config.num_class)\n\n    def forward(self, input_ids, input_masks, segment_ids):\n        sequence_output, pooler_output, hidden_states = self.bert_model(input_ids=input_ids, token_type_ids=segment_ids,\n                                                                        attention_mask=input_masks)\n\n        hidden_states1 = torch.mean(hidden_states[-1], dim=1)\n        hidden_states2 = torch.mean(hidden_states[-2], dim=1)\n        hidden_states3 = torch.mean(hidden_states[-3], dim=1)\n        hidden_states4 = torch.mean(hidden_states[-4], dim=1)\n        output = torch.cat(\n            (hidden_states1, hidden_states2, hidden_states3, hidden_states4), dim=1)\n        if self.isDropout:\n            output = self.dropout(output)\n        logit = self.classifier(output)\n\n        return logit\n\n\nclass BertLastFourEmbeddingsPooler(nn.Module):\n    def __init__(self, config):\n        super(BertLastFourEmbeddingsPooler, self).__init__()\n        self.n_classes = config.num_class\n        config_json = 'bert_config.json' if os.path.exists(config.model_path + 'bert_config.json') else 'config.json'\n        self.bert_config = CONFIGS[config.model].from_pretrained(config.model_path + config_json,\n                                                                 output_hidden_states=True)\n        self.bert_model = MODELS[config.model].from_pretrained(config.model_path, config=self.bert_config)\n        self.isDropout = True if 0 < config.dropout < 1 else False\n        self.dropout = nn.Dropout(p=config.dropout)\n        self.classifier = nn.Linear(self.bert_config.hidden_size * 5, config.num_class)\n\n    def forward(self, input_ids, input_masks, segment_ids):\n        sequence_output, pooler_output, hidden_states = self.bert_model(input_ids=input_ids, token_type_ids=segment_ids,\n                                                                        attention_mask=input_masks)\n\n        hidden_states1 = torch.mean(hidden_states[-1], dim=1)\n        hidden_states2 = torch.mean(hidden_states[-2], dim=1)\n        hidden_states3 = torch.mean(hidden_states[-3], dim=1)\n        hidden_states4 = torch.mean(hidden_states[-4], dim=1)\n        output = torch.cat(\n            (pooler_output, hidden_states1, hidden_states2, hidden_states3, hidden_states4), dim=1)\n        if self.isDropout:\n            output = self.dropout(output)\n        logit = self.classifier(output)\n\n        return logit\n\nclass BertDynCls(nn.Module):\n    def __init__(self, config):\n        super(BertDynCls, self).__init__()\n        self.n_classes = config.num_class\n        config_json = 'bert_config.json' if os.path.exists(config.model_path + 'bert_config.json') else 'config.json'\n        self.bert_config = CONFIGS[config.model].from_pretrained(config.model_path + config_json,\n                                                                 output_hidden_states=True)\n        self.bert_model = MODELS[config.model].from_pretrained(config.model_path, config=self.bert_config)\n        self.dynWeight = nn.Linear(self.bert_config.hidden_size, 1)\n        self.dence = nn.Linear(self.bert_config.hidden_size, 512)\n        self.isDropout = True if 0 < config.dropout < 1 else False\n        self.dropout = nn.Dropout(p=config.dropout)\n        self.classifier = nn.Linear(512, config.num_class)\n\n\n    def forward(self, input_ids, input_masks, segment_ids):\n        sequence_output, pooler_output, hidden_states = self.bert_model(input_ids=input_ids, token_type_ids=segment_ids,\n                                                                        attention_mask=input_masks)\n\n        batch_size = pooler_output.shape[0]\n\n        hid_avg_list = None\n        weight_list = None\n        for i, hidden in enumerate(hidden_states):\n            hid_avg = hidden_states[-(i + 1)][0]\n            weight = self.dynWeight(hid_avg).repeat(1, self.bert_config.hidden_size)\n            if hid_avg_list is None:\n                hid_avg_list = hid_avg\n            else:\n                hid_avg_list = torch.cat((hid_avg_list, hid_avg), dim=1)\n\n            if weight_list is None:\n                weight_list = hid_avg\n            else:\n                weight_list = torch.cat((weight_list, weight), dim=1)\n\n        concat_out = weight_list.mul_(hid_avg_list)\n        concat_out = concat_out.reshape(batch_size, -1, self.bert_config.hidden_size)\n        concat_out = torch.sum(concat_out, dim=1)\n\n        if self.isDropout:\n            concat_out = self.dropout(concat_out)\n        concat_out = self.dence(concat_out)\n        logit = self.classifier(concat_out)\n\n        return logit\n\nclass BertDynEmbeddings(nn.Module):\n    def __init__(self, config):\n        super(BertDynEmbeddings, self).__init__()\n        self.n_classes = config.num_class\n        config_json = 'bert_config.json' if os.path.exists(config.model_path + 'bert_config.json') else 'config.json'\n        self.bert_config = CONFIGS[config.model].from_pretrained(config.model_path + config_json,\n                                                                 output_hidden_states=True)\n        self.bert_model = MODELS[config.model].from_pretrained(config.model_path, config=self.bert_config)\n        self.dynWeight = nn.Linear(self.bert_config.hidden_size, 1)\n        self.dence = nn.Linear(self.bert_config.hidden_size, 512)\n        self.isDropout = True if 0 < config.dropout < 1 else False\n        self.dropout = nn.Dropout(p=config.dropout)\n        self.classifier = nn.Linear(512, config.num_class)\n\n\n    def forward(self, input_ids, input_masks, segment_ids):\n        sequence_output, pooler_output, hidden_states = self.bert_model(input_ids=input_ids, token_type_ids=segment_ids,\n                                                                        attention_mask=input_masks)\n\n        batch_size = pooler_output.shape[0]\n\n        hid_avg_list = None\n        weight_list = None\n        for i, hidden in enumerate(hidden_states):\n            hid_avg = torch.mean(hidden_states[-(i + 1)], dim=1)\n            weight = self.dynWeight(hid_avg).repeat(1, self.bert_config.hidden_size)\n            if hid_avg_list is None:\n                hid_avg_list = hid_avg\n            else:\n                hid_avg_list = torch.cat((hid_avg_list, hid_avg), dim=1)\n\n            if weight_list is None:\n                weight_list = hid_avg\n            else:\n                weight_list = torch.cat((weight_list, weight), dim=1)\n\n        concat_out = weight_list.mul_(hid_avg_list)\n        concat_out = concat_out.reshape(batch_size, -1, self.bert_config.hidden_size)\n        concat_out = torch.sum(concat_out, dim=1)\n\n        if self.isDropout:\n            concat_out = self.dropout(concat_out)\n\n        concat_out = self.dence(concat_out)\n        logit = self.classifier(concat_out)\n\n        return logit\n\n\nclass BertRNN(nn.Module):\n\n    def __init__(self, config):\n        super(BertRNN, self).__init__()\n        self.rnn_type = \"gru\"\n        self.bidirectional = True\n        self.hidden_dim = 256\n        self.n_layers = 2\n        self.batch_first = True\n        self.drop_out = 0.1\n        self.n_classes = config.num_class\n        config_json = 'bert_config.json' if os.path.exists(config.model_path + 'bert_config.json') else 'config.json'\n        self.bert_config = CONFIGS[config.model].from_pretrained(config.model_path + config_json,\n                                                                 output_hidden_states=True)\n        self.bert_model = MODELS[config.model].from_pretrained(config.model_path, config=self.bert_config)\n        self.num_directions = 1 if not self.bidirectional else 2\n\n        if self.rnn_type == 'lstm':\n            self.rnn = nn.LSTM(self.bert_config.to_dict()['hidden_size'],\n                               hidden_size=self.hidden_dim,\n                               num_layers=self.n_layers,\n                               bidirectional=self.bidirectional,\n                               batch_first=self.batch_first,\n                               dropout=self.drop_out)\n        elif self.rnn_type == 'gru':\n            self.rnn = nn.GRU(self.bert_config.to_dict()['hidden_size'],\n                              hidden_size=self.hidden_dim,\n                              num_layers=self.n_layers,\n                              bidirectional=self.bidirectional,\n                              batch_first=self.batch_first,\n                              dropout=self.drop_out)\n        else:\n            self.rnn = nn.RNN(self.bert_config.to_dict()['hidden_size'],\n                              hidden_size=self.hidden_dim,\n                              num_layers=self.n_layers,\n                              bidirectional=self.bidirectional,\n                              batch_first=self.batch_first,\n                              dropout=self.drop_out)\n\n        self.dropout = nn.Dropout(self.drop_out)\n        self.fc_rnn = nn.Linear(self.hidden_dim * self.num_directions, config.num_class)\n\n    def forward(self, input_ids, input_masks, segment_ids):\n\n        sequence_output, pooler_output, hidden_states = self.bert_model(input_ids=input_ids, token_type_ids=segment_ids,\n                                                                        attention_mask=input_masks)\n\n        self.rnn.flatten_parameters()\n        if self.rnn_type in ['rnn', 'gru']:\n            output, hidden = self.rnn(sequence_output)\n        else:\n            output, (hidden, cell) = self.rnn(sequence_output)\n\n        # output = [ batch size, sent len, hidden_dim * bidirectional]\n        batch_size, max_seq_len, hidden_dim = output.shape\n        hidden = torch.transpose(hidden, 1, 0)\n        hidden = torch.mean(torch.reshape(hidden, [batch_size, -1, hidden_dim]), dim=1)\n        output = torch.sum(output, dim=1)\n        fc_input = self.dropout(output + hidden)\n\n        # output = torch.mean(output, dim=1)\n        # fc_input = self.dropout(output)\n        out = self.fc_rnn(fc_input)\n\n        return out\n\n\nclass BertCNN(nn.Module):\n\n    def __init__(self, config):\n        super(BertCNN, self).__init__()\n        self.num_filters = 100\n        config_json = 'bert_config.json' if os.path.exists(config.model_path + 'bert_config.json') else 'config.json'\n        self.bert_config = CONFIGS[config.model].from_pretrained(config.model_path + config_json,\n                                                                 output_hidden_states=True)\n        self.hidden_size = self.bert_config.to_dict()['hidden_size']\n        self.filter_sizes = {3, 4, 5}\n        self.drop_out = 0.5\n\n        self.convs = nn.ModuleList(\n            [nn.Conv2d(1, self.num_filters, (k, self.hidden_size)) for k in self.filter_sizes])\n\n\n        self.bert_model = MODELS[config.model].from_pretrained(config.model_path, config=self.bert_config)\n        self.dropout = nn.Dropout(self.drop_out)\n\n        self.fc_cnn = nn.Linear(self.num_filters * len(self.filter_sizes), config.num_class)\n\n    def conv_and_pool(self, x, conv):\n        x = F.relu(conv(x)).squeeze(3)\n        x = F.max_pool1d(x, x.size(2)).squeeze(2)\n        return x\n\n    def forward(self, input_ids, input_masks, segment_ids):\n        sequence_output, pooler_output, hidden_states = self.bert_model(input_ids=input_ids, token_type_ids=segment_ids,\n                                                                        attention_mask=input_masks)\n\n        sequence_output = self.dropout(sequence_output)\n        out = sequence_output.unsqueeze(1)\n        out = torch.cat([self.conv_and_pool(out, conv) for conv in self.convs], 1)\n        out = self.dropout(out)\n        out = self.fc_cnn(out)\n        return out\n\n\nclass BertRCNN(nn.Module):\n    def __init__(self, config):\n        super(BertRCNN, self).__init__()\n        self.rnn_type = \"lstm\"\n        self.bidirectional = True\n        self.hidden_dim = 256\n        self.n_layers = 2\n        self.batch_first = True\n        self.drop_out = 0.5\n\n        config_json = 'bert_config.json' if os.path.exists(config.model_path + 'bert_config.json') else 'config.json'\n        self.bert_config = CONFIGS[config.model].from_pretrained(config.model_path + config_json,\n                                                                 output_hidden_states=True)\n\n        if self.rnn_type == 'lstm':\n            self.rnn = nn.LSTM(self.bert_config.to_dict()['hidden_size'],\n                               self.hidden_dim,\n                               num_layers=self.n_layers,\n                               bidirectional=self.bidirectional,\n                               batch_first=self.batch_first,\n                               dropout=self.drop_out)\n        elif self.rnn_type == 'gru':\n            self.rnn = nn.GRU(self.bert_config.to_dict()['hidden_size'],\n                              hidden_size=self.hidden_dim,\n                              num_layers=self.n_layers,\n                              bidirectional=self.bidirectional,\n                              batch_first=self.batch_first,\n                              dropout=self.drop_out)\n        else:\n            self.rnn = nn.RNN(self.bert_config.to_dict()['hidden_size'],\n                              hidden_size=self.hidden_dim,\n                              num_layers=self.n_layers,\n                              bidirectional=self.bidirectional,\n                              batch_first=self.batch_first,\n                              dropout=self.drop_out)\n\n        # self.maxpool = nn.MaxPool1d()\n\n\n        self.bert_model = MODELS[config.model].from_pretrained(config.model_path, config=self.bert_config)\n        self.fc = nn.Linear(self.hidden_dim * self.n_layers, config.num_class)\n        self.dropout = nn.Dropout(self.drop_out)\n\n    def forward(self, input_ids, input_masks, segment_ids):\n\n        sequence_output, pooler_output, hidden_states = self.bert_model(input_ids=input_ids, token_type_ids=segment_ids,\n                                                                        attention_mask=input_masks)\n\n        sentence_len = sequence_output.shape[1]\n        pooler_output = pooler_output.unsqueeze(dim=1).repeat(1, sentence_len, 1)\n        bert_sentence = sequence_output + pooler_output\n\n        self.rnn.flatten_parameters()\n        if self.rnn_type in ['rnn', 'gru']:\n            output, hidden = self.rnn(bert_sentence)\n        else:\n            output, (hidden, cell) = self.rnn(bert_sentence)\n\n        batch_size, max_seq_len, hidden_dim = output.shape\n        out = torch.transpose(output.relu(), 1, 2)\n\n        out = F.max_pool1d(out, max_seq_len).squeeze()\n        out = self.fc(out)\n\n        return out\n\n\nclass XLNet(nn.Module):\n\n    def __init__(self, config):\n        super(XLNet, self).__init__()\n        self.xlnet = XLNetModel.from_pretrained(config.model_path)\n\n        self.isDropout = True if 0 < config.dropout < 1 else False\n        self.dropout = nn.Dropout(p=config.dropout)\n        self.fc = nn.Linear(self.xlnet.d_model, config.num_class)\n\n    def forward(self, input_ids, input_masks, segment_ids):\n        sequence_output = self.xlnet(input_ids=input_ids, token_type_ids=segment_ids,\n                                     attention_mask=input_masks)\n        sequence_output = torch.sum(sequence_output[0], dim=1)\n        if self.isDropout:\n            sequence_output = self.dropout(sequence_output)\n        out = self.fc(sequence_output)\n        return out\n\n\nclass ElectraClassificationHead(nn.Module):\n    \"\"\"Head for sentence-level classification tasks.\"\"\"\n\n    def __init__(self, config):\n        super().__init__()\n        self.dense = nn.Linear(config.hidden_size, config.hidden_size)\n        self.dropout = nn.Dropout(config.hidden_dropout_prob)\n        self.out_proj = nn.Linear(config.hidden_size, config.num_labels)\n\n    def forward(self, features, **kwargs):\n        x = features[:, 0, :]  # take <s> token (equiv. to [CLS])\n        x = self.dropout(x)\n        x = self.dense(x)\n        x = get_activation(\"gelu\")(x)  # although BERT uses tanh here, it seems Electra authors used gelu here\n        x = self.dropout(x)\n        x = self.out_proj(x)\n        return x\n\nclass Electra(nn.Module):\n\n    def __init__(self, config):\n        super(Electra, self).__init__()\n        self.electra = ElectraModel.from_pretrained(config.model_path)\n\n        config_json = 'bert_config.json' if os.path.exists(config.model_path + 'bert_config.json') else 'config.json'\n        self.electra_config = CONFIGS[config.model].from_pretrained(config.model_path + config_json)\n        self.electra_config.num_labels = config.num_class\n        self.fc = ElectraClassificationHead(self.electra_config)\n\n    def forward(self, input_ids, input_masks, segment_ids):\n        discriminator_hidden_states = self.electra(input_ids=input_ids, token_type_ids=segment_ids,\n                                     attention_mask=input_masks)\n\n        sequence_output = discriminator_hidden_states[0]\n        out = self.fc(sequence_output)\n        return out\n\nclass NEZHA(nn.Module):\n    def __init__(self, config):\n        super(NEZHA, self).__init__()\n        self.n_classes = config.num_class\n\n        config_json = 'bert_config.json' if os.path.exists(config.model_path + 'bert_config.json') else 'config.json'\n        self.bert_config = CONFIGS[config.model].from_pretrained(config.model_path + config_json)\n        #self.bert_model = MODELS[config.model](config=self.bert_config)\n        self.bert_model = MODELS[config.model].from_pretrained(config.model_path, config=self.bert_config)\n\n        # NEZHA init\n        #torch_init_model(self.bert_model, os.path.join(config.model_path, 'pytorch_model.bin'))\n        self.isDropout = True if 0 < config.dropout < 1 else False\n        self.dropout = nn.Dropout(p=config.dropout)\n        self.classifier = nn.Linear(self.bert_config.hidden_size * 2, self.n_classes)\n\n    def forward(self, input_ids, input_masks, segment_ids):\n        sequence_output, pooler_output = self.bert_model(input_ids=input_ids, token_type_ids=segment_ids,\n                                                                        attention_mask=input_masks)\n        seq_avg = torch.mean(sequence_output, dim=1)\n        concat_out = torch.cat((seq_avg, pooler_output), dim=1)\n\n        if self.isDropout:\n            concat_out = self.dropout(concat_out)\n        logit = self.classifier(concat_out)\n        return logit\n\n\n"
  },
  {
    "path": "code/bert-base-count3-len100/finetuning/models/gitkeep",
    "content": ""
  },
  {
    "path": "code/bert-base-count3-len100/finetuning/multi_gpu_QA.py",
    "content": "from tqdm import tqdm, trange\nimport numpy as np\nimport pandas as pd\nimport logging\nimport torch\nimport random\nimport os\nfrom torch import nn, optim\nfrom transformers import BertTokenizer, AdamW, BertModel, BertPreTrainedModel, BertConfig, \\\n    get_linear_schedule_with_warmup, XLNetModel, XLNetTokenizer, XLNetConfig\nfrom transformers.optimization import get_linear_schedule_with_warmup\nfrom sklearn.model_selection import StratifiedKFold, KFold\nfrom sklearn.metrics import mean_absolute_error, accuracy_score, f1_score, roc_auc_score\nfrom model import *\nfrom utils import *\nimport time\nimport logging\nlogging.basicConfig(level=logging.DEBUG, filename=\"train.log\",filemode='a')\n\n\nfrom NEZHA.modeling_nezha import *\n\nMODEL_CLASSES = {\n    'BertForClass': BertForClass,\n    'BertLastCls': BertLastCls,\n    'BertLastTwoCls': BertLastTwoCls,\n    'BertLastTwoClsPooler': BertLastTwoClsPooler,\n    'BertLastTwoEmbeddings': BertLastTwoEmbeddings,\n    'BertLastTwoEmbeddingsPooler': BertLastTwoEmbeddingsPooler,\n    'BertLastFourCls': BertLastFourCls,\n    'BertLastFourClsPooler': BertLastFourClsPooler,\n    'BertLastFourEmbeddings': BertLastFourEmbeddings,\n    'BertLastFourEmbeddingsPooler': BertLastFourEmbeddingsPooler,\n    'BertDynCls': BertDynCls,\n    'BertDynEmbeddings': BertDynEmbeddings,\n    'BertRNN': BertRNN,\n    'BertCNN': BertCNN,\n    'BertRCNN': BertRCNN,\n    'XLNet': XLNet,\n    'Electra': Electra,\n    'NEZHA': NEZHA,\n\n}\n\n\nclass Config:\n    def __init__(self):\n        # 预训练模型路径\n        self.modelId = 2\n        self.model = \"BertLastCls\"\n        self.Stratification = False\n        self.model_path = '../../bert-base-count3/pretrain/bert_model/'\n\n        self.num_class = 2\n        self.dropout = 0.2\n        self.MAX_LEN = 100\n        self.epoch = 3\n        self.learn_rate = 4e-5\n        self.normal_lr = 1e-4\n        self.batch_size = 32\n        self.k_fold = 10\n        self.seed = 42\n\n        self.device = torch.device('cuda')\n        # self.device = torch.device('cpu')\n\n        self.focalloss = False\n        self.pgd = False\n        self.fgm = True\n\n\nconfig = Config()\nos.environ['PYTHONHASHSEED']='0'#消除hash算法的随机性\nrandom.seed(config.seed)\nnp.random.seed(config.seed)\ntorch.manual_seed(config.seed)\ntorch.cuda.manual_seed_all(config.seed)\n\n\nfile_path = './log/'\n# 创建一个logger\nlogger = logging.getLogger('mylogger')\nlogger.setLevel(logging.DEBUG)\n\n\ntrain = pd.read_csv('/tcdata/gaiic_track3_round1_train_20210228.tsv',sep='\\t',header=None)\nsemi = pd.read_csv('/tcdata/gaiic_track3_round2_train_20210407.tsv',sep='\\t',header=None)\ntrain = pd.concat([train, semi], sort=False)\ntrain.columns=['q1','q2','label']\n\n\ntrain_query1 = train['q1'].values.astype(str)\ntrain_query2 = train['q2'].values.astype(str)\ntrain_label = train['label'].values.astype(int)\n\n\noof_train = np.zeros((len(train), config.num_class), dtype=np.float32)\n\n\n#kf = StratifiedKFold(n_splits=config.k_fold, shuffle=True, random_state=config.seed)\nkf = KFold(n_splits=config.k_fold, shuffle=True, random_state=config.seed)\n\nfor fold, (train_index, valid_index) in enumerate(kf.split(train_query1, train_label)):\n\n    print('\\n\\n------------fold:{}------------\\n'.format(fold))\n\n    '''\n    q1 = train_query1[train_index]\n    q2 = train_query2[train_index]\n    y = train_label[train_index]\n    '''\n    q1 = train_query1\n    q2 = train_query2\n    y = train_label\n\n\n    val_q1 = train_query1[valid_index]\n    val_q2 = train_query2[valid_index]\n    val_y = train_label[valid_index]\n\n    train_D = data_generator([q1, q2, y], config, shuffle=True)\n    val_D = data_generator([val_q1, val_q2, val_y], config)\n\n    model = MODEL_CLASSES[config.model](config).to(config.device)\n\n    if torch.cuda.device_count() > 1:\n        print(\"Let's use\", torch.cuda.device_count(), \"GPUs!\")\n        model = torch.nn.DataParallel(model)\n\n\n    if config.pgd:\n        pgd = PGD(model)\n        K = 3\n\n    elif config.fgm:\n        fgm = FGM(model)\n\n    if config.focalloss:\n        loss_fn = FocalLoss(config.num_class)\n    else:\n        loss_fn = nn.CrossEntropyLoss()  # BCEWithLogitsLoss就是把Sigmoid-BCELoss合成一步\n\n\n    num_train_steps = int(len(train) / config.batch_size * config.epoch)\n    param_optimizer = list(model.named_parameters())\n\n    no_decay = [\"bias\", \"LayerNorm.bias\", \"LayerNorm.weight\"]\n\n    if config.Stratification:\n        bert_params = [x for x in param_optimizer if 'bert' in x[0]]\n        normal_params = [p for n, p in param_optimizer if 'bert' not in n]\n        optimizer_parameters = [\n            {'params': [p for n, p in bert_params if not any(nd in n for nd in no_decay)], 'weight_decay': 0.01},\n            {'params': [p for n, p in bert_params if any(nd in n for nd in no_decay)], 'weight_decay': 0.0},\n            {'params': normal_params, 'lr': config.normal_lr},\n        ]\n    else:\n        optimizer_parameters = [\n            {'params': [p for n, p in param_optimizer if not any(nd in n for nd in no_decay)], 'weight_decay': 0.01},\n            {'params': [p for n, p in param_optimizer if any(nd in n for nd in no_decay)], 'weight_decay': 0.0},\n        ]\n\n    optimizer = AdamW(optimizer_parameters, lr=config.learn_rate) # lr为全局学习率\n    scheduler = get_linear_schedule_with_warmup(\n        optimizer,\n        num_warmup_steps=int(len(train) / config.batch_size / 2),\n        num_training_steps=num_train_steps\n    )\n\n    best_auc = 0\n    PATH = './models/bert_{}.pth'.format(fold)\n    save_model_path = './models/'\n    if not os.path.exists(save_model_path):\n        os.makedirs(save_model_path)\n\n    for e in range(config.epoch):\n        print('\\n------------epoch:{}------------'.format(e))\n        model.train()\n        acc = 0\n        train_len = 0\n        loss_num = 0\n        tq = tqdm(train_D,ncols=70,disable=True)\n        last=time.time()\n        for input_ids, input_masks, segment_ids, labels in tq:\n            label_t = torch.tensor(labels, dtype=torch.long).to(config.device)\n\n            y_pred = model(input_ids, input_masks, segment_ids)\n\n            loss = loss_fn(y_pred, label_t)\n            loss = loss.mean()\n            loss.backward()\n\n            if config.pgd:\n                pgd.backup_grad()\n                # 对抗训练\n                for t in range(K):\n                    pgd.attack(is_first_attack=(t == 0))  # 在embedding上添加对抗扰动, first attack时备份param.data\n                    if t != K - 1:\n                        model.zero_grad()\n                    else:\n                        pgd.restore_grad()\n                    y_pred = model(input_ids, input_masks, segment_ids)\n\n                    loss_adv = loss_fn(y_pred, label_t)\n                    loss_adv = loss_adv.mean()\n                    loss_adv.backward()  # 反向传播，并在正常的grad基础上，累加对抗训练的梯度\n                pgd.restore()  # 恢复embedding参数\n\n            elif config.fgm:\n                # 对抗训练\n                fgm.attack()  # 在embedding上添加对抗扰动\n                y_pred = model(input_ids, input_masks, segment_ids)\n                loss_adv = loss_fn(y_pred, label_t)\n                loss_adv = loss_adv.mean()\n                loss_adv.backward()  # 反向传播，并在正常的grad基础上，累加对抗训练的梯度\n                fgm.restore()  # 恢复embedding参数\n\n\n            # 梯度下降，更新参数\n            optimizer.step()\n            scheduler.step()  # Update learning rate schedule\n            model.zero_grad()\n\n            y_pred = np.argmax(y_pred.detach().to(\"cpu\").numpy(), axis=1)\n            acc += sum(y_pred == labels)\n            loss_num += loss.item()\n            train_len += len(labels)\n            tq.set_postfix(fold=fold, epoch=e, loss=loss_num / train_len, acc=acc / train_len)\n        print(f\"微调第{e}轮耗时：{time.time()-last}\")\n        model.eval()\n        with torch.no_grad():\n            y_p = []\n            y_l = []\n            train_logit = None\n            for input_ids, input_masks, segment_ids, labels in tqdm(val_D,disable=True):\n                label_t = torch.tensor(labels, dtype=torch.long).to(config.device)\n\n                y_pred = model(input_ids, input_masks, segment_ids)\n                y_pred = F.softmax(y_pred)\n                y_pred = y_pred.detach().to(\"cpu\").numpy()\n                if train_logit is None:\n                    train_logit = y_pred\n                else:\n                    train_logit = np.vstack((train_logit, y_pred))\n\n                y_p += list(y_pred[:,1])\n\n                y_pred = np.argmax(y_pred, axis=1)\n                y_l += list(y_pred)\n\n\n            f1 = f1_score(val_y, y_l, average=\"macro\")\n            auc_score = roc_auc_score(val_y, y_p)\n            print(\"best_auc:{}  auc_score:{}  f1:{}\\n\".format(best_auc, auc_score, f1))\n            if auc_score >= best_auc:\n                best_auc = auc_score\n                oof_train[valid_index] = np.array(train_logit)\n                #torch.save(model.module.state_dict() if hasattr(model, \"module\") else model.state_dict(), PATH)\n                torch.save(model.module if hasattr(model, \"module\") else model, PATH)\n\n    optimizer.zero_grad()\n\n    del model\n    torch.cuda.empty_cache()\n\n    break\n\n"
  },
  {
    "path": "code/bert-base-count3-len100/finetuning/utils.py",
    "content": "import torch\nfrom transformers import BertTokenizer, AdamW, BertModel, BertPreTrainedModel, BertConfig, \\\n    get_linear_schedule_with_warmup, XLNetModel, XLNetTokenizer, XLNetConfig\nimport numpy as np\nimport os\nimport random\nfrom Config import *\nimport torch\nimport torch.nn as nn\nimport torch.nn.functional as F\n\ndef paddingList(ls:list,val,returnTensor=False):\n    ls=ls[:]#不要改变了原list尺寸\n    maxLen=max([len(i) for i in ls])\n    for i in range(len(ls)):\n        ls[i]=ls[i]+[val]*(maxLen-len(ls[i]))\n    return torch.tensor(ls,device='cuda') if returnTensor else ls\n\ndef fastTokenizer(a:str,b:str,maxLen,tk):\n    a,b=a.split(),b.split()\n    a,b=tk.convert_tokens_to_ids(a),tk.convert_tokens_to_ids(b)\n    maxLen-=3#空留给cls sep sep\n    assert maxLen>=0\n    len2=maxLen//2#若为奇数，更长部分给左边\n    len1=maxLen-len2\n    #一共就a超长与否，b超长与否，组合的四种情况\n    if len(a)+len(b)>maxLen:#需要截断\n        if len(a)<=len1 and len(b)>len2:\n            b=b[:maxLen-len(a)]\n        elif len(a)>len1 and len(b)<=len2:\n            a=a[:maxLen-len(b)]\n        elif len(a)>len1 and len(b)>len2:\n            a=a[:len1]\n            b=b[:len2]\n    input_ids=[tk.cls_token_id]+a+[tk.sep_token_id]+b+[tk.sep_token_id]\n    token_type_ids=[0]*(len(a)+2)+[1]*(len(b)+1)\n    return {'input_ids': input_ids, 'token_type_ids': token_type_ids}\n\nclass data_generator:\n    def __init__(self, data, config, shuffle=False):\n        self.data = data\n        self.batch_size = config.batch_size\n        self.max_length = config.MAX_LEN\n        self.shuffle = shuffle\n\n        vocab = 'vocab.txt' if os.path.exists(config.model_path + 'vocab.txt') else 'spiece.model'\n        self.tokenizer = TOKENIZERS[config.model].from_pretrained(config.model_path + vocab)\n\n        self.steps = len(self.data[0]) // self.batch_size\n        if len(self.data[0]) % self.batch_size != 0:\n            self.steps += 1\n\n    def __len__(self):\n        return self.steps\n\n    def __iter__(self):\n        q1, q2, y = self.data\n        idxs = list(range(len(self.data[0])))\n        if self.shuffle:\n            np.random.shuffle(idxs)\n        input_ids, input_masks, segment_ids, labels = [], [], [], []\n\n        for index, i in enumerate(idxs):\n\n            text = q1[i]\n            text_pair = q2[i]\n            '''\n            # text = self.tokenizer(text, text_pair, padding='max_length', truncation=True, max_length=self.max_length)\n            text = fastTokenizer(text, text_pair, self.max_length, self.tokenizer)\n            input_ids.append(text['input_ids'])\n            segment_ids.append(text['token_type_ids'])\n            input_masks.append([1] * len(text['input_ids']))  # bs为1时无padding，全1\n            yield input_ids, input_masks, segment_ids, labels\n            input_ids, input_masks, segment_ids, labels = [], [], [], []\n\n            '''\n            tkRes = self.tokenizer(text, text_pair, max_length=self.max_length, truncation='longest_first',\n                                   return_attention_mask=False)\n            input_id = tkRes['input_ids']\n            segment_id = tkRes['token_type_ids']\n            assert len(segment_id) == len(input_id)\n            input_ids.append(input_id)\n            segment_ids.append(segment_id)\n            labels.append(y[i])\n\n            if len(input_ids) == self.batch_size or i == idxs[-1]:\n                input_ids = paddingList(input_ids, 0, returnTensor=True)  # 动态padding\n                segment_ids = paddingList(segment_ids, 0, returnTensor=True)\n                input_masks = (input_ids != 0)\n                yield input_ids, input_masks, segment_ids, labels\n                input_ids, input_masks, segment_ids, labels = [], [], [], []\n\n\n\nclass PGD():\n    def __init__(self, model):\n        self.model = model\n        self.emb_backup = {}\n        self.grad_backup = {}\n\n    def attack(self, epsilon=0.3, alpha=0.1, emb_name='word_embeddings', is_first_attack=False):\n        # emb_name这个参数要换成你模型中embedding的参数名\n        for name, param in self.model.named_parameters():\n            if param.requires_grad and emb_name in name:\n                if is_first_attack:\n                    self.emb_backup[name] = param.data.clone()\n                norm = torch.norm(param.grad)\n                if norm != 0 and not torch.isnan(norm):\n                    r_at = alpha * param.grad / norm\n                    param.data.add_(r_at)\n                    param.data = self.project(name, param.data, epsilon)\n\n    def restore(self, emb_name='word_embeddings'):\n        # emb_name这个参数要换成你模型中embedding的参数名\n        for name, param in self.model.named_parameters():\n            if param.requires_grad and emb_name in name:\n                assert name in self.emb_backup\n                param.data = self.emb_backup[name]\n        self.emb_backup = {}\n\n    def project(self, param_name, param_data, epsilon):\n        r = param_data - self.emb_backup[param_name]\n        if torch.norm(r) > epsilon:\n            r = epsilon * r / torch.norm(r)\n        return self.emb_backup[param_name] + r\n\n    def backup_grad(self):\n        for name, param in self.model.named_parameters():\n            if param.requires_grad:\n                self.grad_backup[name] = param.grad.clone()\n\n    def restore_grad(self):\n        for name, param in self.model.named_parameters():\n            if param.requires_grad:\n                param.grad = self.grad_backup[name]\n\n\n\nclass FGM():\n    def __init__(self, model):\n        self.model = model\n        self.backup = {}\n\n    def attack(self, epsilon=0.25, emb_name='word_embeddings'):\n        # emb_name这个参数要换成你模型中embedding的参数名\n        for name, param in self.model.named_parameters():\n            if param.requires_grad and emb_name in name:\n                self.backup[name] = param.data.clone()\n                norm = torch.norm(param.grad)\n                if norm != 0:\n                    r_at = epsilon * param.grad / norm\n                    param.data.add_(r_at)\n\n    def restore(self, emb_name='word_embeddings'):\n        # emb_name这个参数要换成你模型中embedding的参数名\n        for name, param in self.model.named_parameters():\n            if param.requires_grad and emb_name in name:\n                assert name in self.backup\n                param.data = self.backup[name]\n        self.backup = {}\n\n\n# 支持多分类和二分类\nclass FocalLoss(nn.Module):\n    \"\"\"\n    This is a implementation of Focal Loss with smooth label cross entropy supported which is proposed in\n    'Focal Loss for Dense Object Detection. (https://arxiv.org/abs/1708.02002)'\n    Focal_Loss= -1*alpha*(1-pt)^gamma*log(pt)\n    :param num_class:\n    :param alpha: (tensor) 3D or 4D the scalar factor for this criterion\n    :param gamma: (float,double) gamma > 0 reduces the relative loss\n    for well-classified examples (p>0.5) putting more\n    focus on hard misclassified example\n    :param smooth: (float,double) smooth value when cross entropy\n    :param balance_index: (int) balance class index,\n    should be specific when alpha is float\n    :param size_average: (bool, optional) By default,\n    the losses are averaged over each loss element in the batch.\n    \"\"\"\n    def __init__(self, num_class, alpha=None, gamma=2,\n                smooth=None, size_average=True):\n        super(FocalLoss, self).__init__()\n        self.num_class = num_class\n        self.alpha = alpha\n        self.gamma = gamma\n        self.smooth = smooth\n        self.size_average = size_average\n\n        if self.alpha is None:\n            self.alpha = torch.ones(self.num_class, 1)\n        elif isinstance(self.alpha, (list, np.ndarray)):\n            assert len(self.alpha) == self.num_class\n            self.alpha = torch.FloatTensor(alpha).view(self.num_class, 1)\n            self.alpha = self.alpha / self.alpha.sum()\n        else:\n            raise TypeError('Not support alpha type')\n        if self.smooth is not None:\n            if self.smooth < 0 or self.smooth > 1.0:\n                raise ValueError('smooth value should be in [0,1]')\n\n    def forward(self, input, target):\n        logit = F.softmax(input, dim=1)\n\n        if logit.dim() > 2:\n            # N,C,d1,d2 -> N,C,m (m=d1*d2*...)\n            logit = logit.view(logit.size(0), logit.size(1), -1)\n            logit = logit.permute(0, 2, 1).contiguous()\n            logit = logit.view(-1, logit.size(-1))\n        target = target.view(-1, 1)\n\n        # N = input.size(0)\n        # alpha = torch.ones(N, self.num_class)\n        # alpha = alpha * (1 - self.alpha)\n        # alpha = alpha.scatter_(1, target.long(), self.alpha)\n        epsilon = 1e-10\n        alpha = self.alpha\n        if alpha.device != input.device:\n            alpha = alpha.to(input.device)\n\n        idx = target.cpu().long()\n        one_hot_key = torch.FloatTensor(target.size(0), self.num_class).zero_()\n        one_hot_key = one_hot_key.scatter_(1, idx, 1)\n        if one_hot_key.device != logit.device:\n            one_hot_key = one_hot_key.to(logit.device)\n\n        if self.smooth:\n            one_hot_key = torch.clamp(\n                one_hot_key, self.smooth, 1.0 - self.smooth)\n        pt = (one_hot_key * logit).sum(1) + epsilon\n        logpt = pt.log()\n\n        gamma = self.gamma\n\n        alpha = alpha[idx]\n        loss = -1 * alpha * torch.pow((1 - pt), gamma) * logpt\n\n        if self.size_average:\n            loss = loss.mean()\n        else:\n            loss = loss.sum()\n        return loss\n\n\ndef f1_match(y_true,y_pred):\n    acc = sum(y_pred & y_true) / (sum(y_pred))\n    rec = sum(y_pred & y_true) / (sum(y_true))\n\n    return 2 * acc * rec /(acc + rec)"
  },
  {
    "path": "code/bert-base-count5/finetuning/.ipynb_checkpoints/PyTorch_Bert-Squad_OnnxRuntime_GPU-checkpoint.ipynb",
    "content": "{\n \"cells\": [\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Copyright (c) Microsoft Corporation. All rights reserved.  \\n\",\n    \"Licensed under the MIT License.\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"# Inference PyTorch Bert Model with ONNX Runtime on GPU\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"In this tutorial, you'll learn how to load a Bert model from PyTorch, convert it to ONNX, and inference it for high performance using ONNX Runtime and NVIDIA GPU. In the following sections, we are going to use the Bert model trained with Stanford Question Answering Dataset (SQuAD) dataset as an example. Bert SQuAD model is used in question answering scenarios, where the answer to every question is a segment of text from the corresponding reading passage, or the question might be unanswerable.\\n\",\n    \"\\n\",\n    \"This notebook is for GPU inference. For CPU inference, please look at another notebook [Inference PyTorch Bert Model with ONNX Runtime on CPU](PyTorch_Bert-Squad_OnnxRuntime_CPU.ipynb).\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"## 0. Prerequisites ##\\n\",\n    \"It requires your machine to have a GPU, and a python environment with [PyTorch](https://pytorch.org/) installed before running this notebook.\\n\",\n    \"\\n\",\n    \"#### GPU Environment Setup using AnaConda\\n\",\n    \"\\n\",\n    \"First, we install [AnaConda](https://www.anaconda.com/distribution/) in a target machine and open an AnaConda prompt window when it is done. Then run the following commands to create a conda environment. This notebook is tested with PyTorch 1.5.0 and OnnxRuntime 1.3.0.\\n\",\n    \"\\n\",\n    \"```console\\n\",\n    \"conda create -n gpu_env python=3.7\\n\",\n    \"conda activate gpu_env\\n\",\n    \"conda install pytorch torchvision cudatoolkit=10.1 -c pytorch\\n\",\n    \"conda install -c anaconda ipykernel\\n\",\n    \"conda install -c conda-forge ipywidgets\\n\",\n    \"python -m ipykernel install --user --name=gpu_env_py37\\n\",\n    \"jupyter notebook\\n\",\n    \"```\\n\",\n    \"Finally, launch Jupyter Notebook and you can choose gpu_env_py37 as kernel to run this notebook.\\n\",\n    \"\\n\",\n    \"Onnxruntime-gpu need specified version of CUDA and cuDNN. You can find the corresponding version in [requirements](https://github.com/microsoft/onnxruntime/tree/rel-1.3.0#system-requirements). If the version is different from above cudatoolkit version, you have to install them separately, and add their bin directories to PATH environment variable (See [CUDA and cuDNN Path](#CUDA-and-cuDNN-Path) below).\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"\\u001b[33mWARNING: Skipping onnxruntime-gpu as it is not installed.\\u001b[0m\\r\\n\"\n     ]\n    }\n   ],\n   \"source\": [\n    \"import sys\\n\",\n    \"!{sys.executable} -m pip uninstall --quiet --yes onnxruntime-gpu\\n\",\n    \"!{sys.executable} -m pip install --quiet onnxruntime-gpu\\n\",\n    \"!{sys.executable} -m pip install --quiet --upgrade transformers\\n\",\n    \"!{sys.executable} -m pip install --quiet --upgrade onnxconverter_common\\n\",\n    \"!{sys.executable} -m pip install --quiet --upgrade onnxruntime-tools\\n\",\n    \"!{sys.executable} -m pip install --quiet wget netron pandas\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"## 1. Load Pretrained Bert model ##\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"We begin by downloading the SQuAD data file and store them in the specified location. \"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"import os\\n\",\n    \"\\n\",\n    \"cache_dir = \\\"./squad\\\"\\n\",\n    \"if not os.path.exists(cache_dir):\\n\",\n    \"    os.makedirs(cache_dir)\\n\",\n    \"\\n\",\n    \"predict_file_url = \\\"https://rajpurkar.github.io/SQuAD-explorer/dataset/dev-v1.1.json\\\"\\n\",\n    \"predict_file = os.path.join(cache_dir, \\\"dev-v1.1.json\\\")\\n\",\n    \"if not os.path.exists(predict_file):\\n\",\n    \"    import wget\\n\",\n    \"    print(\\\"Start downloading predict file.\\\")\\n\",\n    \"    wget.download(predict_file_url, predict_file)\\n\",\n    \"    print(\\\"Predict file downloaded.\\\")\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Let's first define some constant variables.\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 3,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"# Whether allow overwriting existing ONNX model and download the latest script from GitHub\\n\",\n    \"enable_overwrite = True\\n\",\n    \"\\n\",\n    \"# Total samples to inference, so that we can get average latency\\n\",\n    \"total_samples = 1000\\n\",\n    \"\\n\",\n    \"# ONNX opset version\\n\",\n    \"opset_version=11\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Specify some model configuration variables.\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 4,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"# For fine-tuned large model, the model name is \\\"bert-large-uncased-whole-word-masking-finetuned-squad\\\". Here we use bert-base for demo.\\n\",\n    \"model_name_or_path = \\\"bert-base-cased\\\"\\n\",\n    \"max_seq_length = 128\\n\",\n    \"doc_stride = 128\\n\",\n    \"max_query_length = 64\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Start to load model from pretrained. This step could take a few minutes. \"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 5,\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"name\": \"stderr\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"100%|██████████| 48/48 [00:04<00:00, 11.28it/s]\\n\",\n      \"convert squad examples to features: 100%|██████████| 1000/1000 [00:09<00:00, 102.15it/s]\\n\",\n      \"add example index and unique id: 100%|██████████| 1000/1000 [00:00<00:00, 161306.98it/s]\\n\"\n     ]\n    }\n   ],\n   \"source\": [\n    \"# The following code is adapted from HuggingFace transformers\\n\",\n    \"# https://github.com/huggingface/transformers/blob/master/examples/run_squad.py\\n\",\n    \"\\n\",\n    \"from transformers import (BertConfig, BertForQuestionAnswering, BertTokenizer)\\n\",\n    \"\\n\",\n    \"# Load pretrained model and tokenizer\\n\",\n    \"config_class, model_class, tokenizer_class = (BertConfig, BertForQuestionAnswering, BertTokenizer)\\n\",\n    \"config = config_class.from_pretrained(model_name_or_path, cache_dir=cache_dir)\\n\",\n    \"tokenizer = tokenizer_class.from_pretrained(model_name_or_path, do_lower_case=True, cache_dir=cache_dir)\\n\",\n    \"model = model_class.from_pretrained(model_name_or_path,\\n\",\n    \"                                    from_tf=False,\\n\",\n    \"                                    config=config,\\n\",\n    \"                                    cache_dir=cache_dir)\\n\",\n    \"# load some examples\\n\",\n    \"from transformers.data.processors.squad import SquadV1Processor\\n\",\n    \"\\n\",\n    \"processor = SquadV1Processor()\\n\",\n    \"examples = processor.get_dev_examples(None, filename=predict_file)\\n\",\n    \"\\n\",\n    \"from transformers import squad_convert_examples_to_features\\n\",\n    \"features, dataset = squad_convert_examples_to_features( \\n\",\n    \"            examples=examples[:total_samples], # convert enough examples for this notebook\\n\",\n    \"            tokenizer=tokenizer,\\n\",\n    \"            max_seq_length=max_seq_length,\\n\",\n    \"            doc_stride=doc_stride,\\n\",\n    \"            max_query_length=max_query_length,\\n\",\n    \"            is_training=False,\\n\",\n    \"            return_dataset='pt'\\n\",\n    \"        )\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"## 2. Export the loaded model ##\\n\",\n    \"Once the model is loaded, we can export the loaded PyTorch model to ONNX.\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 6,\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"Model exported at  ./onnx/bert-base-cased-squad_opset11.onnx\\n\"\n     ]\n    }\n   ],\n   \"source\": [\n    \"output_dir = \\\"./onnx\\\"\\n\",\n    \"if not os.path.exists(output_dir):\\n\",\n    \"    os.makedirs(output_dir)   \\n\",\n    \"export_model_path = os.path.join(output_dir, 'bert-base-cased-squad_opset{}.onnx'.format(opset_version))\\n\",\n    \"\\n\",\n    \"import torch\\n\",\n    \"use_gpu = torch.cuda.is_available()\\n\",\n    \"device = torch.device(\\\"cuda\\\" if use_gpu else \\\"cpu\\\")\\n\",\n    \"\\n\",\n    \"# Get the first example data to run the model and export it to ONNX\\n\",\n    \"data = dataset[0]\\n\",\n    \"inputs = {\\n\",\n    \"    'input_ids':      data[0].to(device).reshape(1, max_seq_length),\\n\",\n    \"    'attention_mask': data[1].to(device).reshape(1, max_seq_length),\\n\",\n    \"    'token_type_ids': data[2].to(device).reshape(1, max_seq_length)\\n\",\n    \"}\\n\",\n    \"\\n\",\n    \"# Set model to inference mode, which is required before exporting the model because some operators behave differently in \\n\",\n    \"# inference and training mode.\\n\",\n    \"model.eval()\\n\",\n    \"model.to(device)\\n\",\n    \"\\n\",\n    \"if enable_overwrite or not os.path.exists(export_model_path):\\n\",\n    \"    with torch.no_grad():\\n\",\n    \"        symbolic_names = {0: 'batch_size', 1: 'max_seq_len'}\\n\",\n    \"        torch.onnx.export(model,                                            # model being run\\n\",\n    \"                          args=tuple(inputs.values()),                      # model input (or a tuple for multiple inputs)\\n\",\n    \"                          f=export_model_path,                              # where to save the model (can be a file or file-like object)\\n\",\n    \"                          opset_version=opset_version,                      # the ONNX version to export the model to\\n\",\n    \"                          do_constant_folding=True,                         # whether to execute constant folding for optimization\\n\",\n    \"                          input_names=['input_ids',                         # the model's input names\\n\",\n    \"                                       'input_mask', \\n\",\n    \"                                       'segment_ids'],\\n\",\n    \"                          output_names=['start', 'end'],                    # the model's output names\\n\",\n    \"                          dynamic_axes={'input_ids': symbolic_names,        # variable length axes\\n\",\n    \"                                        'input_mask' : symbolic_names,\\n\",\n    \"                                        'segment_ids' : symbolic_names,\\n\",\n    \"                                        'start' : symbolic_names,\\n\",\n    \"                                        'end' : symbolic_names})\\n\",\n    \"        print(\\\"Model exported at \\\", export_model_path)\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"## 3. PyTorch Inference ##\\n\",\n    \"Use PyTorch to evaluate an example input for comparison purpose.\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 7,\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"PyTorch cuda Inference time = 16.57 ms\\n\"\n     ]\n    }\n   ],\n   \"source\": [\n    \"import time\\n\",\n    \"\\n\",\n    \"# Measure the latency. It is not accurate using Jupyter Notebook, it is recommended to use standalone python script.\\n\",\n    \"latency = []\\n\",\n    \"with torch.no_grad():\\n\",\n    \"    for i in range(total_samples):\\n\",\n    \"        data = dataset[i]\\n\",\n    \"        inputs = {\\n\",\n    \"            'input_ids':      data[0].to(device).reshape(1, max_seq_length),\\n\",\n    \"            'attention_mask': data[1].to(device).reshape(1, max_seq_length),\\n\",\n    \"            'token_type_ids': data[2].to(device).reshape(1, max_seq_length)\\n\",\n    \"        }\\n\",\n    \"        start = time.time()\\n\",\n    \"        outputs = model(**inputs)\\n\",\n    \"        latency.append(time.time() - start)\\n\",\n    \"print(\\\"PyTorch {} Inference time = {} ms\\\".format(device.type, format(sum(latency) * 1000 / len(latency), '.2f')))\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"## 4. Inference ONNX Model with ONNX Runtime ##\\n\",\n    \"\\n\",\n    \"### CUDA and cuDNN Path\\n\",\n    \"onnxruntime-gpu has dependency on [CUDA](https://developer.nvidia.com/cuda-downloads) and [cuDNN](https://developer.nvidia.com/cudnn):\\n\",\n    \"\\n\",\n    \"* [onnxruntime-gpu v1.3.0](https://github.com/microsoft/onnxruntime/tree/rel-1.3.0#system-requirements) requires CUDA Runtime 10.1 and CUDNN 7.6.5.\\n\",\n    \"* [onnxruntime-gpu v1.2.0](https://github.com/microsoft/onnxruntime/releases/tag/v1.2.0) requires CUDA Runtime 10.1 and CUDNN 7.6.5.\\n\",\n    \"\\n\",\n    \"During installing PyTorch 1.5, we installed cudatoolkit 10.1.243 in this conda environment. That shall be good for onnxruntime-gpu 1.3.0 in Jupyter Notebook.\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 8,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"# Change to True when onnxruntime (like onnxruntime-gpu 1.0.0 ~ 1.1.2) cannot be imported.\\n\",\n    \"add_cuda_path = False\\n\",\n    \"\\n\",\n    \"if add_cuda_path:\\n\",\n    \"    # Add path of CUDA 10.0 and CUDNN 7.6 for onnxruntime-gpu 1.0.0 ~ 1.1.2\\n\",\n    \"    cuda_dir = 'D:/NVidia/CUDA/v10.1/bin'\\n\",\n    \"    cudnn_dir = 'D:/NVidia/CUDA/v10.1/bin'\\n\",\n    \"    if not (os.path.exists(cuda_dir) and os.path.exists(cudnn_dir)):\\n\",\n    \"        raise ValueError(\\\"Please specify correct path for CUDA and cuDNN. Otherwise onnxruntime cannot be imported.\\\")\\n\",\n    \"    else:\\n\",\n    \"        if cuda_dir == cudnn_dir:\\n\",\n    \"            os.environ[\\\"PATH\\\"] = cuda_dir + ';' + os.environ[\\\"PATH\\\"]\\n\",\n    \"        else:\\n\",\n    \"            os.environ[\\\"PATH\\\"] = cuda_dir + ';' + cudnn_dir + ';' + os.environ[\\\"PATH\\\"]\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"### OpenMP Environment Variable\\n\",\n    \"\\n\",\n    \"OpenMP environment variables are optional for GPU inference of standard Bert model. It has little performance impact on Bert model since most nodes are executed in GPU. \\n\",\n    \"\\n\",\n    \"You can find the best setting based on [Performance Test Tool](#Performance-Test-Tool) result in later part of this notebook.\\n\",\n    \"\\n\",\n    \"**Attention: Setting environment variables shall be done before importing onnxruntime**. Otherwise, they might not take effect.\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 9,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"# Optional. You can change them according to Performance Test Tool result.\\n\",\n    \"#os.environ[\\\"OMP_NUM_THREADS\\\"] = '1'\\n\",\n    \"#os.environ[\\\"OMP_WAIT_POLICY\\\"] = 'PASSIVE'\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Now we are ready to inference the model with ONNX Runtime.\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 10,\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"OnnxRuntime gpu Inference time = 4.43 ms\\n\"\n     ]\n    }\n   ],\n   \"source\": [\n    \"import psutil\\n\",\n    \"import onnxruntime\\n\",\n    \"import numpy\\n\",\n    \"\\n\",\n    \"assert 'CUDAExecutionProvider' in onnxruntime.get_available_providers()\\n\",\n    \"device_name = 'gpu'\\n\",\n    \"\\n\",\n    \"sess_options = onnxruntime.SessionOptions()\\n\",\n    \"\\n\",\n    \"# Optional: store the optimized graph and view it using Netron to verify that model is fully optimized.\\n\",\n    \"# Note that this will increase session creation time so enable it for debugging only.\\n\",\n    \"sess_options.optimized_model_filepath = os.path.join(output_dir, \\\"optimized_model_{}.onnx\\\".format(device_name))\\n\",\n    \"\\n\",\n    \"# Please change the value according to best setting in Performance Test Tool result.\\n\",\n    \"sess_options.intra_op_num_threads=psutil.cpu_count(logical=True)\\n\",\n    \"\\n\",\n    \"session = onnxruntime.InferenceSession(export_model_path, sess_options)\\n\",\n    \"\\n\",\n    \"latency = []\\n\",\n    \"for i in range(total_samples):\\n\",\n    \"    data = dataset[i]\\n\",\n    \"    # TODO: use IO Binding (see https://github.com/microsoft/onnxruntime/pull/4206) to improve performance.\\n\",\n    \"    ort_inputs = {\\n\",\n    \"        'input_ids':  data[0].cpu().reshape(1, max_seq_length).numpy(),\\n\",\n    \"        'input_mask': data[1].cpu().reshape(1, max_seq_length).numpy(),\\n\",\n    \"        'segment_ids': data[2].cpu().reshape(1, max_seq_length).numpy()\\n\",\n    \"    }\\n\",\n    \"    start = time.time()\\n\",\n    \"    ort_outputs = session.run(None, ort_inputs)\\n\",\n    \"    latency.append(time.time() - start)\\n\",\n    \"    \\n\",\n    \"print(\\\"OnnxRuntime {} Inference time = {} ms\\\".format(device_name, format(sum(latency) * 1000 / len(latency), '.2f')))\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"We can compare the output of PyTorch and ONNX Runtime. We can see some results are not close. It is because ONNX Runtime uses some approximation in CUDA optimization. Based on our evaluation on SQuAD data set, F1 score is on par for models before and after optimization.\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 11,\n   \"metadata\": {\n    \"scrolled\": true\n   },\n   \"outputs\": [\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"***** Verifying correctness *****\\n\",\n      \"PyTorch and ONNX Runtime output 0 are close: True\\n\",\n      \"maximum_diff=9.499490261077881e-07 average_diff=1.4225952327251434e-07\\n\",\n      \"PyTorch and ONNX Runtime output 1 are close: True\\n\",\n      \"maximum_diff=6.92903995513916e-07 average_diff=1.2441887520253658e-07\\n\"\n     ]\n    }\n   ],\n   \"source\": [\n    \"print(\\\"***** Verifying correctness *****\\\")\\n\",\n    \"for i in range(2):    \\n\",\n    \"    print('PyTorch and ONNX Runtime output {} are close:'.format(i), numpy.allclose(ort_outputs[i], outputs[i].cpu(), rtol=1e-02, atol=1e-02))\\n\",\n    \"    diff = ort_outputs[i] - outputs[i].cpu().numpy()\\n\",\n    \"    max_diff = numpy.max(numpy.abs(diff))\\n\",\n    \"    avg_diff = numpy.average(numpy.abs(diff))\\n\",\n    \"    print(f'maximum_diff={max_diff} average_diff={avg_diff}')\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"### Inference with Actual Sequence Length\\n\",\n    \"Note that ONNX model is exported using dynamic length axis. It is recommended to use actual sequence input without padding instead of fixed length input for best performance. Let's see how it can be applied to this model.\\n\",\n    \"\\n\",\n    \"From an example input below, we can see zero padding at the end of each sequence.\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 12,\n   \"metadata\": {\n    \"scrolled\": true\n   },\n   \"outputs\": [\n    {\n     \"data\": {\n      \"text/plain\": [\n       \"{'input_ids': tensor([[  101,  1293,  1242,  2557,  1127,  1226,  1104,  1103,  3613, 16429,\\n\",\n       \"           5235,   136,   102,  3613, 16429,  5988,   170,   107,  1353,  1671,\\n\",\n       \"           1992,  1342,   107,  5235,   117,  1107,  1134,  1473,  3683,  3538,\\n\",\n       \"           1125,   170,  1476,   118,  1248,  2595,  4086,  1714,  1104,  2965,\\n\",\n       \"          15897,  1104,  3613, 16429,   119,  1473,  3683,  3538,  3222,  1149,\\n\",\n       \"           2551,  1168, 23759,  1116,  1121,  1506,  1103, 10280,  2231,  1111,\\n\",\n       \"           1103,  1714, 16355,   119,   102,     0,     0,     0,     0,     0,\\n\",\n       \"              0,     0,     0,     0,     0,     0,     0,     0,     0,     0,\\n\",\n       \"              0,     0,     0,     0,     0,     0,     0,     0,     0,     0,\\n\",\n       \"              0,     0,     0,     0,     0,     0,     0,     0,     0,     0,\\n\",\n       \"              0,     0,     0,     0,     0,     0,     0,     0,     0,     0,\\n\",\n       \"              0,     0,     0,     0,     0,     0,     0,     0,     0,     0,\\n\",\n       \"              0,     0,     0,     0,     0,     0,     0,     0]],\\n\",\n       \"        device='cuda:0'),\\n\",\n       \" 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,\\n\",\n       \"          1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,\\n\",\n       \"          1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0,\\n\",\n       \"          0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,\\n\",\n       \"          0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,\\n\",\n       \"          0, 0, 0, 0, 0, 0, 0, 0]], device='cuda:0'),\\n\",\n       \" 'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,\\n\",\n       \"          1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,\\n\",\n       \"          1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0,\\n\",\n       \"          0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,\\n\",\n       \"          0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,\\n\",\n       \"          0, 0, 0, 0, 0, 0, 0, 0]], device='cuda:0')}\"\n      ]\n     },\n     \"execution_count\": 12,\n     \"metadata\": {},\n     \"output_type\": \"execute_result\"\n    }\n   ],\n   \"source\": [\n    \"# An example input (we can see padding). From attention_mask, we can deduce the actual length.\\n\",\n    \"inputs\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"The original sequence length is 128. After removing paddings, the sequence length is reduced. Input with smaller sequence length need less computation, thus we can see there is improvement on inference latency. \"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 13,\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"Average length 101\\n\",\n      \"OnnxRuntime gpu Inference time with actual sequence length = 4.23 ms\\n\"\n     ]\n    }\n   ],\n   \"source\": [\n    \"import statistics\\n\",\n    \"\\n\",\n    \"latency = []\\n\",\n    \"lengths = []\\n\",\n    \"for i in range(total_samples):\\n\",\n    \"    data = dataset[i]\\n\",\n    \"    # Instead of using fixed length (128), we can use actual sequence length (less than 128), which helps to get better performance.\\n\",\n    \"    actual_sequence_length = sum(data[1].numpy())\\n\",\n    \"    lengths.append(actual_sequence_length)\\n\",\n    \"    opt_inputs = {\\n\",\n    \"        'input_ids':  data[0].numpy()[:actual_sequence_length].reshape(1, actual_sequence_length),\\n\",\n    \"        'input_mask': data[1].numpy()[:actual_sequence_length].reshape(1, actual_sequence_length),\\n\",\n    \"        'segment_ids': data[2].numpy()[:actual_sequence_length].reshape(1, actual_sequence_length)\\n\",\n    \"    }\\n\",\n    \"    start = time.time()\\n\",\n    \"    opt_outputs = session.run(None, opt_inputs)\\n\",\n    \"    latency.append(time.time() - start)\\n\",\n    \"print(\\\"Average length\\\", statistics.mean(lengths))\\n\",\n    \"print(\\\"OnnxRuntime {} Inference time with actual sequence length = {} ms\\\".format(device_name, format(sum(latency) * 1000 / len(latency), '.2f')))\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Let's compare the output and see whether the results are close.\\n\",\n    \"\\n\",\n    \"**Note**: Need end-to-end evaluation on performance and accuracy if you use this strategy.\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 14,\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"***** Comparing results with/without paddings *****\\n\",\n      \"Output 0 are close: True\\n\",\n      \"Output 1 are close: True\\n\"\n     ]\n    }\n   ],\n   \"source\": [\n    \"print(\\\"***** Comparing results with/without paddings *****\\\")\\n\",\n    \"for i in range(2):\\n\",\n    \"    print('Output {} are close:'.format(i), numpy.allclose(opt_outputs[i], ort_outputs[i][:,:len(opt_outputs[i][0])], rtol=1e-03, atol=1e-03))\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"## 5. Offline Optimization and Test Tools\\n\",\n    \"\\n\",\n    \"It is recommended to try [OnnxRuntime Transformer Model Optimization Tool](https://github.com/microsoft/onnxruntime/tree/master/onnxruntime/python/tools/transformers) on the exported ONNX models. It could help verify whether the model can be fully optimized, and get performance test results.\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"#### Transformer Optimizer\\n\",\n    \"\\n\",\n    \"Although OnnxRuntime could optimize Bert model exported by PyTorch. Sometime, model cannot be fully optimized due to different reasons:\\n\",\n    \"* A new subgraph pattern is generated by new version of export tool, and the pattern is not covered by older version of OnnxRuntime. \\n\",\n    \"* The exported model uses dynamic axis and this makes it harder for shape inference of the graph. That blocks some optimization to be applied.\\n\",\n    \"* Some optimization is better to be done offline. Like change input tensor type from int64 to int32 to avoid extra Cast nodes, or convert model to float16 to achieve better performance in V100 or T4 GPU.\\n\",\n    \"\\n\",\n    \"We have python script **optimizer.py**, which is more flexible in graph pattern matching and model conversion (like float32 to float16). You can also use it to verify whether a Bert model is fully optimized.\\n\",\n    \"\\n\",\n    \"In this example, we can see that it introduces optimization that is not provided by onnxruntime: SkipLayerNormalization and bias fusion, which is not fused in OnnxRuntime due to shape inference as mentioned.\\n\",\n    \"\\n\",\n    \"It will also tell whether the model is fully optimized or not. If not, that means you might need change the script to fuse some new pattern of subgraph.\\n\",\n    \"\\n\",\n    \"Example Usage:\\n\",\n    \"```\\n\",\n    \"from onnxruntime_tools import optimizer\\n\",\n    \"optimized_model = optimizer.optimize_model(export_model_path, model_type='bert', num_heads=12, hidden_size=768)\\n\",\n    \"optimized_model.save_model_to_file(optimized_model_path)\\n\",\n    \"```\\n\",\n    \"\\n\",\n    \"You can also use optimizer_cli like the following:\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"#### Float32 Model\\n\",\n    \"Let us optimize the ONNX model using the script. The first example will output model with float32 to store weights. This is the choice for most GPUs without Tensor Core.\\n\",\n    \"\\n\",\n    \"If your GPU (like V100 or T4) has Tensor Core, jump to [Float16 Model](#6.-Model-Optimization-with-Float16) section since that will give you better performance than Float32 model.\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 15,\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"optimize_by_onnxruntime: Save optimized model by onnxruntime to ./onnx/bert-base-cased-squad_opset11_o1_cpu.onnx\\n\",\n      \"               apply: Fused LayerNormalization count: 25\\n\",\n      \"               apply: Fused Gelu count: 12\\n\",\n      \"               apply: Fused SkipLayerNormalization count: 25\\n\",\n      \"               apply: Fused Attention count: 12\\n\",\n      \"         prune_graph: Graph pruned: 0 inputs, 0 outputs and 5 nodes are removed\\n\",\n      \"               apply: Fused EmbedLayerNormalization(with mask) count: 1\\n\",\n      \"         prune_graph: Graph pruned: 0 inputs, 0 outputs and 12 nodes are removed\\n\",\n      \"         prune_graph: Graph pruned: 0 inputs, 0 outputs and 0 nodes are removed\\n\",\n      \"               apply: Fused BiasGelu count: 12\\n\",\n      \"               apply: Fused SkipLayerNormalization(add bias) count: 24\\n\",\n      \"            optimize: opset verion: 11\\n\",\n      \"  save_model_to_file: Output model to ./onnx/bert-base-cased-squad_opt_gpu_fp32.onnx\\n\",\n      \"get_fused_operator_statistics: Optimized operators:{'EmbedLayerNormalization': 1, 'Attention': 12, 'Gelu': 0, 'FastGelu': 0, 'BiasGelu': 12, 'LayerNormalization': 0, 'SkipLayerNormalization': 24}\\n\",\n      \"                main: The model has been fully optimized.\\n\"\n     ]\n    }\n   ],\n   \"source\": [\n    \"optimized_fp32_model_path = './onnx/bert-base-cased-squad_opt_{}_fp32.onnx'.format('gpu' if use_gpu else 'cpu')\\n\",\n    \"\\n\",\n    \"!python -m onnxruntime_tools.optimizer_cli --input $export_model_path --output $optimized_fp32_model_path\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"#### Optimized Graph\\n\",\n    \"We can open the optimized model using [Netron](https://github.com/lutzroeder/netron) to visualize.\\n\",\n    \"\\n\",\n    \"The graph is like the following:\\n\",\n    \"<img src='images/optimized_bert_gpu.png'>\\n\",\n    \"\\n\",\n    \"Sometime, optimized graph is slightly different. For example, FastGelu is replaced by BiasGelu for CPU inference; When the option --input_int32 is used, Cast nodes for inputs are removed.\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 16,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"import netron\\n\",\n    \"\\n\",\n    \"# change it to True if want to view the optimized model in browser\\n\",\n    \"enable_netron = False\\n\",\n    \"if enable_netron:\\n\",\n    \"    # If you encounter error \\\"access a socket in a way forbidden by its access permissions\\\", install Netron as standalone application instead.\\n\",\n    \"    netron.start(optimized_fp32_model_path)\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"### Performance Test Tool\\n\",\n    \"\\n\",\n    \"The following will create 1000 random inputs of batch_size 1 and sequence length 128, then measure the average latency and throughput numbers.\\n\",\n    \"\\n\",\n    \"Note that the test uses fixed sequence length. If you use [dynamic sequence length](#Inference-with-Actual-Sequence-Length), actual performance depends on the distribution of sequence length.\\n\",\n    \"\\n\",\n    \"**Attention**: Latency numbers from Jupyter Notebook are not accurate. See [Attional Info](#7.-Additional-Info) for more info.\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 17,\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"test setting TestSetting(batch_size=1, sequence_length=128, test_cases=1000, test_times=1, contiguous=None, use_gpu=True, warmup=True, omp_num_threads=None, omp_wait_policy=None, intra_op_num_threads=None, seed=3, verbose=False, inclusive=False, extra_latency=True)\\n\",\n      \"Generating 1000 samples for batch_size=1 sequence_length=128\\n\",\n      \"Running test: model=bert-base-cased-squad_opt_gpu_fp32.onnx,graph_optimization_level=ENABLE_ALL,intra_op_num_threads=0,OMP_NUM_THREADS=,OMP_WAIT_POLICY=,batch_size=1,sequence_length=128,test_cases=1000,test_times=1,contiguous=None,use_gpu=True,warmup=True\\n\",\n      \"Average latency = 4.92 ms, Throughput = 203.24 QPS\\n\",\n      \"Running test: model=bert-base-cased-squad_opt_gpu_fp32.onnx,graph_optimization_level=ENABLE_ALL,intra_op_num_threads=1,OMP_NUM_THREADS=12,OMP_WAIT_POLICY=PASSIVE,batch_size=1,sequence_length=128,test_cases=1000,test_times=1,contiguous=None,use_gpu=True,warmup=True\\n\",\n      \"Average latency = 4.90 ms, Throughput = 203.88 QPS\\n\",\n      \"Running test: model=bert-base-cased-squad_opt_gpu_fp32.onnx,graph_optimization_level=ENABLE_ALL,intra_op_num_threads=12,OMP_NUM_THREADS=1,OMP_WAIT_POLICY=ACTIVE,batch_size=1,sequence_length=128,test_cases=1000,test_times=1,contiguous=None,use_gpu=True,warmup=True\\n\",\n      \"Average latency = 5.07 ms, Throughput = 197.16 QPS\\n\",\n      \"Running test: model=bert-base-cased-squad_opt_gpu_fp32.onnx,graph_optimization_level=ENABLE_ALL,intra_op_num_threads=1,OMP_NUM_THREADS=12,OMP_WAIT_POLICY=ACTIVE,batch_size=1,sequence_length=128,test_cases=1000,test_times=1,contiguous=None,use_gpu=True,warmup=True\\n\",\n      \"Average latency = 4.82 ms, Throughput = 207.33 QPS\\n\",\n      \"skip duplicated test: model=bert-base-cased-squad_opt_gpu_fp32.onnx,graph_optimization_level=ENABLE_ALL,intra_op_num_threads=1,OMP_NUM_THREADS=12,OMP_WAIT_POLICY=PASSIVE,batch_size=1,sequence_length=128,test_cases=1000,test_times=1,contiguous=None,use_gpu=True,warmup=True\\n\",\n      \"skip duplicated test: model=bert-base-cased-squad_opt_gpu_fp32.onnx,graph_optimization_level=ENABLE_ALL,intra_op_num_threads=12,OMP_NUM_THREADS=1,OMP_WAIT_POLICY=ACTIVE,batch_size=1,sequence_length=128,test_cases=1000,test_times=1,contiguous=None,use_gpu=True,warmup=True\\n\",\n      \"Running test: model=bert-base-cased-squad_opt_gpu_fp32.onnx,graph_optimization_level=ENABLE_ALL,intra_op_num_threads=12,OMP_NUM_THREADS=1,OMP_WAIT_POLICY=PASSIVE,batch_size=1,sequence_length=128,test_cases=1000,test_times=1,contiguous=None,use_gpu=True,warmup=True\\n\",\n      \"Average latency = 4.93 ms, Throughput = 202.92 QPS\\n\",\n      \"Running test: model=bert-base-cased-squad_opt_gpu_fp32.onnx,graph_optimization_level=ENABLE_ALL,intra_op_num_threads=12,OMP_NUM_THREADS=12,OMP_WAIT_POLICY=ACTIVE,batch_size=1,sequence_length=128,test_cases=1000,test_times=1,contiguous=None,use_gpu=True,warmup=True\\n\",\n      \"Average latency = 4.91 ms, Throughput = 203.55 QPS\\n\",\n      \"Running test: model=bert-base-cased-squad_opt_gpu_fp32.onnx,graph_optimization_level=ENABLE_ALL,intra_op_num_threads=12,OMP_NUM_THREADS=12,OMP_WAIT_POLICY=PASSIVE,batch_size=1,sequence_length=128,test_cases=1000,test_times=1,contiguous=None,use_gpu=True,warmup=True\\n\",\n      \"Average latency = 4.88 ms, Throughput = 204.90 QPS\\n\",\n      \"Test summary is saved to onnx/perf_results_GPU_B1_S128_20200617-232134.txt\\n\"\n     ]\n    }\n   ],\n   \"source\": [\n    \"GPU_OPTION = '--use_gpu' if use_gpu else ''\\n\",\n    \"\\n\",\n    \"!python -m onnxruntime_tools.transformers.bert_perf_test --model $optimized_fp32_model_path --batch_size 1 --sequence_length 128 --samples 1000 --test_times 1 --inclusive --all $GPU_OPTION\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Let's load the summary file and take a look. Note that blank value in OMP_NUM_THREADS or OMP_WAIT_POLICY means the environment variable does not exist.\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 18,\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"Float32 model perf results from ./onnx/perf_results_GPU_B1_S128_20200617-232134.txt\\n\"\n     ]\n    },\n    {\n     \"data\": {\n      \"text/html\": [\n       \"<div>\\n\",\n       \"<style scoped>\\n\",\n       \"    .dataframe tbody tr th:only-of-type {\\n\",\n       \"        vertical-align: middle;\\n\",\n       \"    }\\n\",\n       \"\\n\",\n       \"    .dataframe tbody tr th {\\n\",\n       \"        vertical-align: top;\\n\",\n       \"    }\\n\",\n       \"\\n\",\n       \"    .dataframe thead th {\\n\",\n       \"        text-align: right;\\n\",\n       \"    }\\n\",\n       \"</style>\\n\",\n       \"<table border=\\\"1\\\" class=\\\"dataframe\\\">\\n\",\n       \"  <thead>\\n\",\n       \"    <tr style=\\\"text-align: right;\\\">\\n\",\n       \"      <th></th>\\n\",\n       \"      <th>Latency(ms)</th>\\n\",\n       \"      <th>Latency_P50</th>\\n\",\n       \"      <th>Latency_P75</th>\\n\",\n       \"      <th>Latency_P90</th>\\n\",\n       \"      <th>Latency_P95</th>\\n\",\n       \"      <th>Latency_P99</th>\\n\",\n       \"      <th>Throughput(QPS)</th>\\n\",\n       \"      <th>intra_op_num_threads</th>\\n\",\n       \"      <th>OMP_NUM_THREADS</th>\\n\",\n       \"      <th>OMP_WAIT_POLICY</th>\\n\",\n       \"      <th>contiguous</th>\\n\",\n       \"      <th>warmup</th>\\n\",\n       \"    </tr>\\n\",\n       \"  </thead>\\n\",\n       \"  <tbody>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>0</th>\\n\",\n       \"      <td>4.82</td>\\n\",\n       \"      <td>4.53</td>\\n\",\n       \"      <td>4.57</td>\\n\",\n       \"      <td>5.15</td>\\n\",\n       \"      <td>7.25</td>\\n\",\n       \"      <td>8.75</td>\\n\",\n       \"      <td>207.33</td>\\n\",\n       \"      <td>1</td>\\n\",\n       \"      <td>12</td>\\n\",\n       \"      <td>ACTIVE</td>\\n\",\n       \"      <td>None</td>\\n\",\n       \"      <td>True</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>1</th>\\n\",\n       \"      <td>4.88</td>\\n\",\n       \"      <td>4.54</td>\\n\",\n       \"      <td>4.58</td>\\n\",\n       \"      <td>6.47</td>\\n\",\n       \"      <td>7.13</td>\\n\",\n       \"      <td>8.68</td>\\n\",\n       \"      <td>204.90</td>\\n\",\n       \"      <td>12</td>\\n\",\n       \"      <td>12</td>\\n\",\n       \"      <td>PASSIVE</td>\\n\",\n       \"      <td>None</td>\\n\",\n       \"      <td>True</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>2</th>\\n\",\n       \"      <td>4.90</td>\\n\",\n       \"      <td>4.54</td>\\n\",\n       \"      <td>4.57</td>\\n\",\n       \"      <td>6.16</td>\\n\",\n       \"      <td>7.64</td>\\n\",\n       \"      <td>8.82</td>\\n\",\n       \"      <td>203.88</td>\\n\",\n       \"      <td>1</td>\\n\",\n       \"      <td>12</td>\\n\",\n       \"      <td>PASSIVE</td>\\n\",\n       \"      <td>None</td>\\n\",\n       \"      <td>True</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>3</th>\\n\",\n       \"      <td>4.91</td>\\n\",\n       \"      <td>4.55</td>\\n\",\n       \"      <td>4.59</td>\\n\",\n       \"      <td>6.70</td>\\n\",\n       \"      <td>7.43</td>\\n\",\n       \"      <td>8.78</td>\\n\",\n       \"      <td>203.55</td>\\n\",\n       \"      <td>12</td>\\n\",\n       \"      <td>12</td>\\n\",\n       \"      <td>ACTIVE</td>\\n\",\n       \"      <td>None</td>\\n\",\n       \"      <td>True</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>4</th>\\n\",\n       \"      <td>4.92</td>\\n\",\n       \"      <td>4.57</td>\\n\",\n       \"      <td>4.60</td>\\n\",\n       \"      <td>6.50</td>\\n\",\n       \"      <td>7.82</td>\\n\",\n       \"      <td>8.90</td>\\n\",\n       \"      <td>203.24</td>\\n\",\n       \"      <td>0</td>\\n\",\n       \"      <td></td>\\n\",\n       \"      <td></td>\\n\",\n       \"      <td>None</td>\\n\",\n       \"      <td>True</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>5</th>\\n\",\n       \"      <td>4.93</td>\\n\",\n       \"      <td>4.55</td>\\n\",\n       \"      <td>4.59</td>\\n\",\n       \"      <td>6.66</td>\\n\",\n       \"      <td>7.57</td>\\n\",\n       \"      <td>8.80</td>\\n\",\n       \"      <td>202.92</td>\\n\",\n       \"      <td>12</td>\\n\",\n       \"      <td>1</td>\\n\",\n       \"      <td>PASSIVE</td>\\n\",\n       \"      <td>None</td>\\n\",\n       \"      <td>True</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>6</th>\\n\",\n       \"      <td>5.07</td>\\n\",\n       \"      <td>4.56</td>\\n\",\n       \"      <td>4.61</td>\\n\",\n       \"      <td>7.19</td>\\n\",\n       \"      <td>8.11</td>\\n\",\n       \"      <td>9.01</td>\\n\",\n       \"      <td>197.16</td>\\n\",\n       \"      <td>12</td>\\n\",\n       \"      <td>1</td>\\n\",\n       \"      <td>ACTIVE</td>\\n\",\n       \"      <td>None</td>\\n\",\n       \"      <td>True</td>\\n\",\n       \"    </tr>\\n\",\n       \"  </tbody>\\n\",\n       \"</table>\\n\",\n       \"</div>\"\n      ],\n      \"text/plain\": [\n       \"   Latency(ms)  Latency_P50  Latency_P75  Latency_P90  Latency_P95  \\\\\\n\",\n       \"0         4.82         4.53         4.57         5.15         7.25   \\n\",\n       \"1         4.88         4.54         4.58         6.47         7.13   \\n\",\n       \"2         4.90         4.54         4.57         6.16         7.64   \\n\",\n       \"3         4.91         4.55         4.59         6.70         7.43   \\n\",\n       \"4         4.92         4.57         4.60         6.50         7.82   \\n\",\n       \"5         4.93         4.55         4.59         6.66         7.57   \\n\",\n       \"6         5.07         4.56         4.61         7.19         8.11   \\n\",\n       \"\\n\",\n       \"   Latency_P99  Throughput(QPS)  intra_op_num_threads OMP_NUM_THREADS  \\\\\\n\",\n       \"0         8.75           207.33                     1              12   \\n\",\n       \"1         8.68           204.90                    12              12   \\n\",\n       \"2         8.82           203.88                     1              12   \\n\",\n       \"3         8.78           203.55                    12              12   \\n\",\n       \"4         8.90           203.24                     0                   \\n\",\n       \"5         8.80           202.92                    12               1   \\n\",\n       \"6         9.01           197.16                    12               1   \\n\",\n       \"\\n\",\n       \"  OMP_WAIT_POLICY contiguous  warmup  \\n\",\n       \"0          ACTIVE       None    True  \\n\",\n       \"1         PASSIVE       None    True  \\n\",\n       \"2         PASSIVE       None    True  \\n\",\n       \"3          ACTIVE       None    True  \\n\",\n       \"4                       None    True  \\n\",\n       \"5         PASSIVE       None    True  \\n\",\n       \"6          ACTIVE       None    True  \"\n      ]\n     },\n     \"execution_count\": 18,\n     \"metadata\": {},\n     \"output_type\": \"execute_result\"\n    }\n   ],\n   \"source\": [\n    \"import os\\n\",\n    \"import glob     \\n\",\n    \"import pandas\\n\",\n    \"latest_result_file = max(glob.glob(\\\"./onnx/perf_results_GPU_B1_S128_*.txt\\\"), key=os.path.getmtime)\\n\",\n    \"result_data = pandas.read_table(latest_result_file, converters={'OMP_NUM_THREADS': str, 'OMP_WAIT_POLICY':str})\\n\",\n    \"print(\\\"Float32 model perf results from\\\", latest_result_file)\\n\",\n    \"# Remove some columns that have same values for all rows.\\n\",\n    \"columns_to_remove = ['model', 'graph_optimization_level', 'batch_size', 'sequence_length', 'test_cases', 'test_times', 'use_gpu']\\n\",\n    \"result_data.drop(columns_to_remove, axis=1, inplace=True)\\n\",\n    \"result_data\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"From above result, we can see that latency is very close for different settings. The default setting (intra_op_num_threads=0, OMP_NUM_THREADS and OMP_WAIT_POLICY does not exist) performs the best. \\n\",\n    \"\\n\",\n    \"### Model Results Comparison Tool\\n\",\n    \"\\n\",\n    \"When a BERT model is optimized, some approximation is used in calculation. If your BERT model has three inputs, a script compare_bert_results.py can be used to do a quick verification. The tool will generate some fake input data, and compare the inference outputs of the original and optimized models. If outputs are all close, it is safe to use the optimized model.\\n\",\n    \"\\n\",\n    \"For GPU inference, the absolute or relative difference is larger than those numbers of CPU inference. Note that slight difference in output will not impact final result. We did end-to-end evaluation using SQuAD data set using a fine-tuned squad model, and F1 score is almost the same before/after optimization.\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 19,\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"100% passed for 100 random inputs given thresholds (rtol=0.01, atol=0.01).\\r\\n\",\n      \"maximum absolute difference=1.9222497940063477e-06\\r\\n\",\n      \"maximum relative difference=0.05027933046221733\\r\\n\"\n     ]\n    }\n   ],\n   \"source\": [\n    \"!python -m onnxruntime_tools.transformers.compare_bert_results --baseline_model $export_model_path --optimized_model $optimized_fp32_model_path --batch_size 1 --sequence_length 128 --samples 100 --rtol 0.01 --atol 0.01 $GPU_OPTION\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"## 6. Model Optimization with Float16\\n\",\n    \"\\n\",\n    \"The optimizer.py script have an option **--float16** to convert model to use float16 to store weights. After the conversion, it could be faster to run in GPU with tensor cores like V100 or T4.\\n\",\n    \"\\n\",\n    \"Let's run tools to measure the performance on V100. The results show significant performance improvement: latency is about 3.4 ms for float32 model, and 1.8 ms for float16 model.\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 20,\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"optimize_by_onnxruntime: Save optimized model by onnxruntime to ./onnx/bert-base-cased-squad_opset11_o1_cpu.onnx\\n\",\n      \"               apply: Fused LayerNormalization count: 25\\n\",\n      \"               apply: Fused Gelu count: 12\\n\",\n      \"               apply: Fused SkipLayerNormalization count: 25\\n\",\n      \"               apply: Fused Attention count: 12\\n\",\n      \"         prune_graph: Graph pruned: 0 inputs, 0 outputs and 5 nodes are removed\\n\",\n      \"               apply: Fused EmbedLayerNormalization(with mask) count: 1\\n\",\n      \"         prune_graph: Graph pruned: 0 inputs, 0 outputs and 12 nodes are removed\\n\",\n      \"         prune_graph: Graph pruned: 0 inputs, 0 outputs and 0 nodes are removed\\n\",\n      \"               apply: Fused BiasGelu count: 12\\n\",\n      \"               apply: Fused SkipLayerNormalization(add bias) count: 24\\n\",\n      \"            optimize: opset verion: 11\\n\",\n      \"  save_model_to_file: Output model to ./onnx/bert-base-cased-squad_opt_gpu_fp16.onnx\\n\",\n      \"get_fused_operator_statistics: Optimized operators:{'EmbedLayerNormalization': 1, 'Attention': 12, 'Gelu': 0, 'FastGelu': 0, 'BiasGelu': 12, 'LayerNormalization': 0, 'SkipLayerNormalization': 24}\\n\",\n      \"                main: The model has been fully optimized.\\n\"\n     ]\n    }\n   ],\n   \"source\": [\n    \"optimized_fp16_model_path = './onnx/bert-base-cased-squad_opt_{}_fp16.onnx'.format('gpu' if use_gpu else 'cpu')\\n\",\n    \"!python -m onnxruntime_tools.optimizer_cli --input $export_model_path --output $optimized_fp16_model_path --float16\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 21,\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"test setting TestSetting(batch_size=1, sequence_length=128, test_cases=1000, test_times=1, contiguous=None, use_gpu=True, warmup=True, omp_num_threads=None, omp_wait_policy=None, intra_op_num_threads=None, seed=3, verbose=False, inclusive=False, extra_latency=True)\\n\",\n      \"Generating 1000 samples for batch_size=1 sequence_length=128\\n\",\n      \"Running test: model=bert-base-cased-squad_opt_gpu_fp16.onnx,graph_optimization_level=ENABLE_ALL,intra_op_num_threads=0,OMP_NUM_THREADS=,OMP_WAIT_POLICY=,batch_size=1,sequence_length=128,test_cases=1000,test_times=1,contiguous=None,use_gpu=True,warmup=True\\n\",\n      \"Average latency = 3.01 ms, Throughput = 331.90 QPS\\n\",\n      \"Running test: model=bert-base-cased-squad_opt_gpu_fp16.onnx,graph_optimization_level=ENABLE_ALL,intra_op_num_threads=1,OMP_NUM_THREADS=12,OMP_WAIT_POLICY=PASSIVE,batch_size=1,sequence_length=128,test_cases=1000,test_times=1,contiguous=None,use_gpu=True,warmup=True\\n\",\n      \"Average latency = 3.12 ms, Throughput = 320.00 QPS\\n\",\n      \"Running test: model=bert-base-cased-squad_opt_gpu_fp16.onnx,graph_optimization_level=ENABLE_ALL,intra_op_num_threads=12,OMP_NUM_THREADS=1,OMP_WAIT_POLICY=ACTIVE,batch_size=1,sequence_length=128,test_cases=1000,test_times=1,contiguous=None,use_gpu=True,warmup=True\\n\",\n      \"Average latency = 3.02 ms, Throughput = 331.39 QPS\\n\",\n      \"Running test: model=bert-base-cased-squad_opt_gpu_fp16.onnx,graph_optimization_level=ENABLE_ALL,intra_op_num_threads=1,OMP_NUM_THREADS=12,OMP_WAIT_POLICY=ACTIVE,batch_size=1,sequence_length=128,test_cases=1000,test_times=1,contiguous=None,use_gpu=True,warmup=True\\n\",\n      \"Average latency = 3.01 ms, Throughput = 332.53 QPS\\n\",\n      \"skip duplicated test: model=bert-base-cased-squad_opt_gpu_fp16.onnx,graph_optimization_level=ENABLE_ALL,intra_op_num_threads=1,OMP_NUM_THREADS=12,OMP_WAIT_POLICY=PASSIVE,batch_size=1,sequence_length=128,test_cases=1000,test_times=1,contiguous=None,use_gpu=True,warmup=True\\n\",\n      \"skip duplicated test: model=bert-base-cased-squad_opt_gpu_fp16.onnx,graph_optimization_level=ENABLE_ALL,intra_op_num_threads=12,OMP_NUM_THREADS=1,OMP_WAIT_POLICY=ACTIVE,batch_size=1,sequence_length=128,test_cases=1000,test_times=1,contiguous=None,use_gpu=True,warmup=True\\n\",\n      \"Running test: model=bert-base-cased-squad_opt_gpu_fp16.onnx,graph_optimization_level=ENABLE_ALL,intra_op_num_threads=12,OMP_NUM_THREADS=1,OMP_WAIT_POLICY=PASSIVE,batch_size=1,sequence_length=128,test_cases=1000,test_times=1,contiguous=None,use_gpu=True,warmup=True\\n\",\n      \"Average latency = 3.04 ms, Throughput = 328.67 QPS\\n\",\n      \"Running test: model=bert-base-cased-squad_opt_gpu_fp16.onnx,graph_optimization_level=ENABLE_ALL,intra_op_num_threads=12,OMP_NUM_THREADS=12,OMP_WAIT_POLICY=ACTIVE,batch_size=1,sequence_length=128,test_cases=1000,test_times=1,contiguous=None,use_gpu=True,warmup=True\\n\",\n      \"Average latency = 3.01 ms, Throughput = 331.72 QPS\\n\",\n      \"Running test: model=bert-base-cased-squad_opt_gpu_fp16.onnx,graph_optimization_level=ENABLE_ALL,intra_op_num_threads=12,OMP_NUM_THREADS=12,OMP_WAIT_POLICY=PASSIVE,batch_size=1,sequence_length=128,test_cases=1000,test_times=1,contiguous=None,use_gpu=True,warmup=True\\n\",\n      \"Average latency = 3.04 ms, Throughput = 329.32 QPS\\n\",\n      \"Test summary is saved to onnx/perf_results_GPU_B1_S128_20200617-232234.txt\\n\"\n     ]\n    }\n   ],\n   \"source\": [\n    \"GPU_OPTION = '--use_gpu' if use_gpu else ''\\n\",\n    \"!python -m onnxruntime_tools.transformers.bert_perf_test --model $optimized_fp16_model_path --batch_size 1 --sequence_length 128 --samples 1000 --test_times 1 --inclusive --all $GPU_OPTION\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 22,\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"Float32 model perf results from ./onnx/perf_results_GPU_B1_S128_20200617-232234.txt\\n\"\n     ]\n    },\n    {\n     \"data\": {\n      \"text/html\": [\n       \"<div>\\n\",\n       \"<style scoped>\\n\",\n       \"    .dataframe tbody tr th:only-of-type {\\n\",\n       \"        vertical-align: middle;\\n\",\n       \"    }\\n\",\n       \"\\n\",\n       \"    .dataframe tbody tr th {\\n\",\n       \"        vertical-align: top;\\n\",\n       \"    }\\n\",\n       \"\\n\",\n       \"    .dataframe thead th {\\n\",\n       \"        text-align: right;\\n\",\n       \"    }\\n\",\n       \"</style>\\n\",\n       \"<table border=\\\"1\\\" class=\\\"dataframe\\\">\\n\",\n       \"  <thead>\\n\",\n       \"    <tr style=\\\"text-align: right;\\\">\\n\",\n       \"      <th></th>\\n\",\n       \"      <th>Latency(ms)</th>\\n\",\n       \"      <th>Latency_P50</th>\\n\",\n       \"      <th>Latency_P75</th>\\n\",\n       \"      <th>Latency_P90</th>\\n\",\n       \"      <th>Latency_P95</th>\\n\",\n       \"      <th>Latency_P99</th>\\n\",\n       \"      <th>Throughput(QPS)</th>\\n\",\n       \"      <th>intra_op_num_threads</th>\\n\",\n       \"      <th>OMP_NUM_THREADS</th>\\n\",\n       \"      <th>OMP_WAIT_POLICY</th>\\n\",\n       \"      <th>contiguous</th>\\n\",\n       \"      <th>warmup</th>\\n\",\n       \"    </tr>\\n\",\n       \"  </thead>\\n\",\n       \"  <tbody>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>0</th>\\n\",\n       \"      <td>3.01</td>\\n\",\n       \"      <td>2.79</td>\\n\",\n       \"      <td>2.81</td>\\n\",\n       \"      <td>2.86</td>\\n\",\n       \"      <td>5.08</td>\\n\",\n       \"      <td>7.16</td>\\n\",\n       \"      <td>332.53</td>\\n\",\n       \"      <td>1</td>\\n\",\n       \"      <td>12</td>\\n\",\n       \"      <td>ACTIVE</td>\\n\",\n       \"      <td>None</td>\\n\",\n       \"      <td>True</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>1</th>\\n\",\n       \"      <td>3.01</td>\\n\",\n       \"      <td>2.80</td>\\n\",\n       \"      <td>2.81</td>\\n\",\n       \"      <td>2.88</td>\\n\",\n       \"      <td>4.52</td>\\n\",\n       \"      <td>7.05</td>\\n\",\n       \"      <td>331.90</td>\\n\",\n       \"      <td>0</td>\\n\",\n       \"      <td></td>\\n\",\n       \"      <td></td>\\n\",\n       \"      <td>None</td>\\n\",\n       \"      <td>True</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>2</th>\\n\",\n       \"      <td>3.01</td>\\n\",\n       \"      <td>2.78</td>\\n\",\n       \"      <td>2.80</td>\\n\",\n       \"      <td>2.92</td>\\n\",\n       \"      <td>5.01</td>\\n\",\n       \"      <td>7.02</td>\\n\",\n       \"      <td>331.72</td>\\n\",\n       \"      <td>12</td>\\n\",\n       \"      <td>12</td>\\n\",\n       \"      <td>ACTIVE</td>\\n\",\n       \"      <td>None</td>\\n\",\n       \"      <td>True</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>3</th>\\n\",\n       \"      <td>3.02</td>\\n\",\n       \"      <td>2.79</td>\\n\",\n       \"      <td>2.80</td>\\n\",\n       \"      <td>2.85</td>\\n\",\n       \"      <td>6.34</td>\\n\",\n       \"      <td>7.04</td>\\n\",\n       \"      <td>331.39</td>\\n\",\n       \"      <td>12</td>\\n\",\n       \"      <td>1</td>\\n\",\n       \"      <td>ACTIVE</td>\\n\",\n       \"      <td>None</td>\\n\",\n       \"      <td>True</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>4</th>\\n\",\n       \"      <td>3.04</td>\\n\",\n       \"      <td>2.80</td>\\n\",\n       \"      <td>2.82</td>\\n\",\n       \"      <td>2.93</td>\\n\",\n       \"      <td>5.56</td>\\n\",\n       \"      <td>7.08</td>\\n\",\n       \"      <td>329.32</td>\\n\",\n       \"      <td>12</td>\\n\",\n       \"      <td>12</td>\\n\",\n       \"      <td>PASSIVE</td>\\n\",\n       \"      <td>None</td>\\n\",\n       \"      <td>True</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>5</th>\\n\",\n       \"      <td>3.04</td>\\n\",\n       \"      <td>2.79</td>\\n\",\n       \"      <td>2.81</td>\\n\",\n       \"      <td>2.92</td>\\n\",\n       \"      <td>6.37</td>\\n\",\n       \"      <td>7.08</td>\\n\",\n       \"      <td>328.67</td>\\n\",\n       \"      <td>12</td>\\n\",\n       \"      <td>1</td>\\n\",\n       \"      <td>PASSIVE</td>\\n\",\n       \"      <td>None</td>\\n\",\n       \"      <td>True</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>6</th>\\n\",\n       \"      <td>3.12</td>\\n\",\n       \"      <td>2.79</td>\\n\",\n       \"      <td>2.82</td>\\n\",\n       \"      <td>2.96</td>\\n\",\n       \"      <td>6.66</td>\\n\",\n       \"      <td>7.20</td>\\n\",\n       \"      <td>320.00</td>\\n\",\n       \"      <td>1</td>\\n\",\n       \"      <td>12</td>\\n\",\n       \"      <td>PASSIVE</td>\\n\",\n       \"      <td>None</td>\\n\",\n       \"      <td>True</td>\\n\",\n       \"    </tr>\\n\",\n       \"  </tbody>\\n\",\n       \"</table>\\n\",\n       \"</div>\"\n      ],\n      \"text/plain\": [\n       \"   Latency(ms)  Latency_P50  Latency_P75  Latency_P90  Latency_P95  \\\\\\n\",\n       \"0         3.01         2.79         2.81         2.86         5.08   \\n\",\n       \"1         3.01         2.80         2.81         2.88         4.52   \\n\",\n       \"2         3.01         2.78         2.80         2.92         5.01   \\n\",\n       \"3         3.02         2.79         2.80         2.85         6.34   \\n\",\n       \"4         3.04         2.80         2.82         2.93         5.56   \\n\",\n       \"5         3.04         2.79         2.81         2.92         6.37   \\n\",\n       \"6         3.12         2.79         2.82         2.96         6.66   \\n\",\n       \"\\n\",\n       \"   Latency_P99  Throughput(QPS)  intra_op_num_threads OMP_NUM_THREADS  \\\\\\n\",\n       \"0         7.16           332.53                     1              12   \\n\",\n       \"1         7.05           331.90                     0                   \\n\",\n       \"2         7.02           331.72                    12              12   \\n\",\n       \"3         7.04           331.39                    12               1   \\n\",\n       \"4         7.08           329.32                    12              12   \\n\",\n       \"5         7.08           328.67                    12               1   \\n\",\n       \"6         7.20           320.00                     1              12   \\n\",\n       \"\\n\",\n       \"  OMP_WAIT_POLICY contiguous  warmup  \\n\",\n       \"0          ACTIVE       None    True  \\n\",\n       \"1                       None    True  \\n\",\n       \"2          ACTIVE       None    True  \\n\",\n       \"3          ACTIVE       None    True  \\n\",\n       \"4         PASSIVE       None    True  \\n\",\n       \"5         PASSIVE       None    True  \\n\",\n       \"6         PASSIVE       None    True  \"\n      ]\n     },\n     \"execution_count\": 22,\n     \"metadata\": {},\n     \"output_type\": \"execute_result\"\n    }\n   ],\n   \"source\": [\n    \"import os\\n\",\n    \"import glob     \\n\",\n    \"import pandas\\n\",\n    \"latest_result_file = max(glob.glob(\\\"./onnx/perf_results_GPU_B1_S128_*.txt\\\"), key=os.path.getmtime)\\n\",\n    \"result_data = pandas.read_table(latest_result_file, converters={'OMP_NUM_THREADS': str, 'OMP_WAIT_POLICY':str})\\n\",\n    \"print(\\\"Float32 model perf results from\\\", latest_result_file)\\n\",\n    \"# Remove some columns that have same values for all rows.\\n\",\n    \"columns_to_remove = ['model', 'graph_optimization_level', 'batch_size', 'sequence_length', 'test_cases', 'test_times', 'use_gpu']\\n\",\n    \"result_data.drop(columns_to_remove, axis=1, inplace=True)\\n\",\n    \"result_data\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"### Throughput Tuning\\n\",\n    \"\\n\",\n    \"Some application need best throughput under some constraint on latency. This can be done by testing performance of different batch sizes. The tool could help on this.\\n\",\n    \"\\n\",\n    \"Here is an example that check the performance of multiple batch sizes (1, 2, 4, 8, 16, 32 and 64) using default settings.\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 23,\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"test setting TestSetting(batch_size=32, sequence_length=128, test_cases=1000, test_times=1, contiguous=None, use_gpu=True, warmup=True, omp_num_threads=12, omp_wait_policy='ACTIVE', intra_op_num_threads=1, seed=3, verbose=False, inclusive=False, extra_latency=True)\\n\",\n      \"Generating 1000 samples for batch_size=32 sequence_length=128\\n\",\n      \"Running test: model=bert-base-cased-squad_opt_gpu_fp16.onnx,graph_optimization_level=ENABLE_ALL,intra_op_num_threads=1,OMP_NUM_THREADS=12,OMP_WAIT_POLICY=ACTIVE,batch_size=32,sequence_length=128,test_cases=1000,test_times=1,contiguous=None,use_gpu=True,warmup=True\\n\",\n      \"Average latency = 16.17 ms, Throughput = 1979.41 QPS\\n\",\n      \"test setting TestSetting(batch_size=1, sequence_length=128, test_cases=1000, test_times=1, contiguous=None, use_gpu=True, warmup=True, omp_num_threads=12, omp_wait_policy='ACTIVE', intra_op_num_threads=1, seed=3, verbose=False, inclusive=False, extra_latency=True)\\n\",\n      \"Generating 1000 samples for batch_size=1 sequence_length=128\\n\",\n      \"Running test: model=bert-base-cased-squad_opt_gpu_fp16.onnx,graph_optimization_level=ENABLE_ALL,intra_op_num_threads=1,OMP_NUM_THREADS=12,OMP_WAIT_POLICY=ACTIVE,batch_size=1,sequence_length=128,test_cases=1000,test_times=1,contiguous=None,use_gpu=True,warmup=True\\n\",\n      \"Average latency = 3.00 ms, Throughput = 333.83 QPS\\n\",\n      \"test setting TestSetting(batch_size=2, sequence_length=128, test_cases=1000, test_times=1, contiguous=None, use_gpu=True, warmup=True, omp_num_threads=12, omp_wait_policy='ACTIVE', intra_op_num_threads=1, seed=3, verbose=False, inclusive=False, extra_latency=True)\\n\",\n      \"Generating 1000 samples for batch_size=2 sequence_length=128\\n\",\n      \"Running test: model=bert-base-cased-squad_opt_gpu_fp16.onnx,graph_optimization_level=ENABLE_ALL,intra_op_num_threads=1,OMP_NUM_THREADS=12,OMP_WAIT_POLICY=ACTIVE,batch_size=2,sequence_length=128,test_cases=1000,test_times=1,contiguous=None,use_gpu=True,warmup=True\\n\",\n      \"Average latency = 3.59 ms, Throughput = 557.32 QPS\\n\",\n      \"test setting TestSetting(batch_size=64, sequence_length=128, test_cases=1000, test_times=1, contiguous=None, use_gpu=True, warmup=True, omp_num_threads=12, omp_wait_policy='ACTIVE', intra_op_num_threads=1, seed=3, verbose=False, inclusive=False, extra_latency=True)\\n\",\n      \"Generating 1000 samples for batch_size=64 sequence_length=128\\n\",\n      \"Running test: model=bert-base-cased-squad_opt_gpu_fp16.onnx,graph_optimization_level=ENABLE_ALL,intra_op_num_threads=1,OMP_NUM_THREADS=12,OMP_WAIT_POLICY=ACTIVE,batch_size=64,sequence_length=128,test_cases=1000,test_times=1,contiguous=None,use_gpu=True,warmup=True\\n\",\n      \"Average latency = 29.26 ms, Throughput = 2187.15 QPS\\n\",\n      \"test setting TestSetting(batch_size=4, sequence_length=128, test_cases=1000, test_times=1, contiguous=None, use_gpu=True, warmup=True, omp_num_threads=12, omp_wait_policy='ACTIVE', intra_op_num_threads=1, seed=3, verbose=False, inclusive=False, extra_latency=True)\\n\",\n      \"Generating 1000 samples for batch_size=4 sequence_length=128\\n\",\n      \"Running test: model=bert-base-cased-squad_opt_gpu_fp16.onnx,graph_optimization_level=ENABLE_ALL,intra_op_num_threads=1,OMP_NUM_THREADS=12,OMP_WAIT_POLICY=ACTIVE,batch_size=4,sequence_length=128,test_cases=1000,test_times=1,contiguous=None,use_gpu=True,warmup=True\\n\",\n      \"Average latency = 4.32 ms, Throughput = 926.92 QPS\\n\",\n      \"test setting TestSetting(batch_size=8, sequence_length=128, test_cases=1000, test_times=1, contiguous=None, use_gpu=True, warmup=True, omp_num_threads=12, omp_wait_policy='ACTIVE', intra_op_num_threads=1, seed=3, verbose=False, inclusive=False, extra_latency=True)\\n\",\n      \"Generating 1000 samples for batch_size=8 sequence_length=128\\n\",\n      \"Running test: model=bert-base-cased-squad_opt_gpu_fp16.onnx,graph_optimization_level=ENABLE_ALL,intra_op_num_threads=1,OMP_NUM_THREADS=12,OMP_WAIT_POLICY=ACTIVE,batch_size=8,sequence_length=128,test_cases=1000,test_times=1,contiguous=None,use_gpu=True,warmup=True\\n\",\n      \"Average latency = 6.32 ms, Throughput = 1266.63 QPS\\n\",\n      \"test setting TestSetting(batch_size=16, sequence_length=128, test_cases=1000, test_times=1, contiguous=None, use_gpu=True, warmup=True, omp_num_threads=12, omp_wait_policy='ACTIVE', intra_op_num_threads=1, seed=3, verbose=False, inclusive=False, extra_latency=True)\\n\",\n      \"Generating 1000 samples for batch_size=16 sequence_length=128\\n\",\n      \"Running test: model=bert-base-cased-squad_opt_gpu_fp16.onnx,graph_optimization_level=ENABLE_ALL,intra_op_num_threads=1,OMP_NUM_THREADS=12,OMP_WAIT_POLICY=ACTIVE,batch_size=16,sequence_length=128,test_cases=1000,test_times=1,contiguous=None,use_gpu=True,warmup=True\\n\",\n      \"Average latency = 9.60 ms, Throughput = 1666.05 QPS\\n\",\n      \"Test summary is saved to onnx/perf_results_GPU_B1-2-4-8-16-32-64_S128_20200617-232401.txt\\n\"\n     ]\n    }\n   ],\n   \"source\": [\n    \"GPU_OPTION = '--use_gpu' if use_gpu else ''\\n\",\n    \"THREAD_SETTING = '--intra_op_num_threads 1 --omp_num_threads {} --omp_wait_policy ACTIVE'.format(psutil.cpu_count(logical=True))\\n\",\n    \"!python -m onnxruntime_tools.transformers.bert_perf_test --model $optimized_fp16_model_path --batch_size 1 2 4 8 16 32 64 --sequence_length 128 --samples 1000 --test_times 1 --inclusive $THREAD_SETTING $GPU_OPTION\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 26,\n   \"metadata\": {\n    \"scrolled\": false\n   },\n   \"outputs\": [\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"Float16 model summary from ./onnx/perf_results_GPU_B1-2-4-8-16-32-64_S128_20200617-232401.txt\\n\"\n     ]\n    },\n    {\n     \"data\": {\n      \"text/html\": [\n       \"<div>\\n\",\n       \"<style scoped>\\n\",\n       \"    .dataframe tbody tr th:only-of-type {\\n\",\n       \"        vertical-align: middle;\\n\",\n       \"    }\\n\",\n       \"\\n\",\n       \"    .dataframe tbody tr th {\\n\",\n       \"        vertical-align: top;\\n\",\n       \"    }\\n\",\n       \"\\n\",\n       \"    .dataframe thead th {\\n\",\n       \"        text-align: right;\\n\",\n       \"    }\\n\",\n       \"</style>\\n\",\n       \"<table border=\\\"1\\\" class=\\\"dataframe\\\">\\n\",\n       \"  <thead>\\n\",\n       \"    <tr style=\\\"text-align: right;\\\">\\n\",\n       \"      <th></th>\\n\",\n       \"      <th>Latency(ms)</th>\\n\",\n       \"      <th>Latency_P50</th>\\n\",\n       \"      <th>Latency_P75</th>\\n\",\n       \"      <th>Latency_P90</th>\\n\",\n       \"      <th>Latency_P95</th>\\n\",\n       \"      <th>Latency_P99</th>\\n\",\n       \"      <th>Throughput(QPS)</th>\\n\",\n       \"      <th>batch_size</th>\\n\",\n       \"    </tr>\\n\",\n       \"  </thead>\\n\",\n       \"  <tbody>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>0</th>\\n\",\n       \"      <td>3.00</td>\\n\",\n       \"      <td>2.79</td>\\n\",\n       \"      <td>2.81</td>\\n\",\n       \"      <td>2.86</td>\\n\",\n       \"      <td>4.37</td>\\n\",\n       \"      <td>7.08</td>\\n\",\n       \"      <td>333.83</td>\\n\",\n       \"      <td>1</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>1</th>\\n\",\n       \"      <td>3.59</td>\\n\",\n       \"      <td>3.33</td>\\n\",\n       \"      <td>3.35</td>\\n\",\n       \"      <td>3.42</td>\\n\",\n       \"      <td>6.60</td>\\n\",\n       \"      <td>7.54</td>\\n\",\n       \"      <td>557.32</td>\\n\",\n       \"      <td>2</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>2</th>\\n\",\n       \"      <td>4.32</td>\\n\",\n       \"      <td>3.98</td>\\n\",\n       \"      <td>4.01</td>\\n\",\n       \"      <td>4.64</td>\\n\",\n       \"      <td>7.23</td>\\n\",\n       \"      <td>8.11</td>\\n\",\n       \"      <td>926.92</td>\\n\",\n       \"      <td>4</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>3</th>\\n\",\n       \"      <td>6.32</td>\\n\",\n       \"      <td>5.94</td>\\n\",\n       \"      <td>5.97</td>\\n\",\n       \"      <td>7.61</td>\\n\",\n       \"      <td>8.96</td>\\n\",\n       \"      <td>10.12</td>\\n\",\n       \"      <td>1266.63</td>\\n\",\n       \"      <td>8</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>4</th>\\n\",\n       \"      <td>9.60</td>\\n\",\n       \"      <td>9.22</td>\\n\",\n       \"      <td>9.25</td>\\n\",\n       \"      <td>11.32</td>\\n\",\n       \"      <td>12.33</td>\\n\",\n       \"      <td>13.34</td>\\n\",\n       \"      <td>1666.05</td>\\n\",\n       \"      <td>16</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>5</th>\\n\",\n       \"      <td>16.17</td>\\n\",\n       \"      <td>15.80</td>\\n\",\n       \"      <td>15.90</td>\\n\",\n       \"      <td>17.38</td>\\n\",\n       \"      <td>18.80</td>\\n\",\n       \"      <td>19.93</td>\\n\",\n       \"      <td>1979.41</td>\\n\",\n       \"      <td>32</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>6</th>\\n\",\n       \"      <td>29.26</td>\\n\",\n       \"      <td>28.89</td>\\n\",\n       \"      <td>29.01</td>\\n\",\n       \"      <td>30.63</td>\\n\",\n       \"      <td>32.53</td>\\n\",\n       \"      <td>33.28</td>\\n\",\n       \"      <td>2187.15</td>\\n\",\n       \"      <td>64</td>\\n\",\n       \"    </tr>\\n\",\n       \"  </tbody>\\n\",\n       \"</table>\\n\",\n       \"</div>\"\n      ],\n      \"text/plain\": [\n       \"   Latency(ms)  Latency_P50  Latency_P75  Latency_P90  Latency_P95  \\\\\\n\",\n       \"0         3.00         2.79         2.81         2.86         4.37   \\n\",\n       \"1         3.59         3.33         3.35         3.42         6.60   \\n\",\n       \"2         4.32         3.98         4.01         4.64         7.23   \\n\",\n       \"3         6.32         5.94         5.97         7.61         8.96   \\n\",\n       \"4         9.60         9.22         9.25        11.32        12.33   \\n\",\n       \"5        16.17        15.80        15.90        17.38        18.80   \\n\",\n       \"6        29.26        28.89        29.01        30.63        32.53   \\n\",\n       \"\\n\",\n       \"   Latency_P99  Throughput(QPS)  batch_size  \\n\",\n       \"0         7.08           333.83           1  \\n\",\n       \"1         7.54           557.32           2  \\n\",\n       \"2         8.11           926.92           4  \\n\",\n       \"3        10.12          1266.63           8  \\n\",\n       \"4        13.34          1666.05          16  \\n\",\n       \"5        19.93          1979.41          32  \\n\",\n       \"6        33.28          2187.15          64  \"\n      ]\n     },\n     \"execution_count\": 26,\n     \"metadata\": {},\n     \"output_type\": \"execute_result\"\n    }\n   ],\n   \"source\": [\n    \"import os\\n\",\n    \"import glob     \\n\",\n    \"import pandas\\n\",\n    \"latest_result_file = max(glob.glob(\\\"./onnx/perf_results_*.txt\\\"), key=os.path.getmtime)\\n\",\n    \"result_data = pandas.read_table(latest_result_file, converters={'OMP_NUM_THREADS': str, 'OMP_WAIT_POLICY':str})\\n\",\n    \"print(\\\"Float16 model summary from\\\", latest_result_file)\\n\",\n    \"columns_to_remove = ['model', 'graph_optimization_level', 'test_cases', 'test_times', 'use_gpu', 'warmup', 'sequence_length']\\n\",\n    \"columns_to_remove.extend(['intra_op_num_threads', 'OMP_NUM_THREADS', 'OMP_WAIT_POLICY', 'contiguous'])\\n\",\n    \"result_data.drop(columns_to_remove, axis=1, inplace=True)\\n\",\n    \"result_data\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"## 7. Additional Info\\n\",\n    \"\\n\",\n    \"Note that running Jupyter Notebook has significant impact on performance result. You can close Jupyter Notebook and other applications, then run the performance test in a console to get more accurate performance numbers.\\n\",\n    \"\\n\",\n    \"We have a [benchmark script](https://github.com/microsoft/onnxruntime/blob/master/onnxruntime/python/tools/transformers/run_benchmark.sh). It is recommended to use it measure inference speed of OnnxRuntime.\\n\",\n    \"\\n\",\n    \"[OnnxRuntime C API](https://github.com/microsoft/onnxruntime/blob/master/docs/C_API.md) could get slightly better performance than python API. If you use C API in inference, you can use OnnxRuntime_Perf_Test.exe built from source to measure performance instead.\\n\",\n    \"\\n\",\n    \"Here is the machine configuration that generated the above results. You might get slower or faster result according to your hardware.\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 27,\n   \"metadata\": {\n    \"scrolled\": true\n   },\n   \"outputs\": [\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"{\\r\\n\",\n      \"  \\\"gpu\\\": {\\r\\n\",\n      \"    \\\"driver_version\\\": \\\"440.64.00\\\",\\r\\n\",\n      \"    \\\"devices\\\": [\\r\\n\",\n      \"      {\\r\\n\",\n      \"        \\\"memory_total\\\": 16945512448,\\r\\n\",\n      \"        \\\"memory_available\\\": 14110883840,\\r\\n\",\n      \"        \\\"name\\\": \\\"Tesla V100-PCIE-16GB\\\"\\r\\n\",\n      \"      },\\r\\n\",\n      \"      {\\r\\n\",\n      \"        \\\"memory_total\\\": 16945512448,\\r\\n\",\n      \"        \\\"memory_available\\\": 16932601856,\\r\\n\",\n      \"        \\\"name\\\": \\\"Tesla V100-PCIE-16GB\\\"\\r\\n\",\n      \"      }\\r\\n\",\n      \"    ]\\r\\n\",\n      \"  },\\r\\n\",\n      \"  \\\"cpu\\\": {\\r\\n\",\n      \"    \\\"brand\\\": \\\"Intel(R) Xeon(R) CPU E5-2690 v4 @ 2.60GHz\\\",\\r\\n\",\n      \"    \\\"cores\\\": 12,\\r\\n\",\n      \"    \\\"logical_cores\\\": 12,\\r\\n\",\n      \"    \\\"hz\\\": \\\"2.5940 GHz\\\",\\r\\n\",\n      \"    \\\"l2_cache\\\": \\\"256 KB\\\",\\r\\n\",\n      \"    \\\"l3_cache\\\": \\\"35840 KB\\\",\\r\\n\",\n      \"    \\\"processor\\\": \\\"x86_64\\\"\\r\\n\",\n      \"  },\\r\\n\",\n      \"  \\\"memory\\\": {\\r\\n\",\n      \"    \\\"total\\\": 236645588992,\\r\\n\",\n      \"    \\\"available\\\": 222567559168\\r\\n\",\n      \"  },\\r\\n\",\n      \"  \\\"python\\\": \\\"3.7.7.final.0 (64 bit)\\\",\\r\\n\",\n      \"  \\\"os\\\": \\\"Linux-4.15.0-1089-azure-x86_64-with-debian-stretch-sid\\\",\\r\\n\",\n      \"  \\\"onnxruntime\\\": {\\r\\n\",\n      \"    \\\"version\\\": \\\"1.3.0\\\",\\r\\n\",\n      \"    \\\"support_gpu\\\": true\\r\\n\",\n      \"  },\\r\\n\",\n      \"  \\\"pytorch\\\": {\\r\\n\",\n      \"    \\\"version\\\": \\\"1.5.0\\\",\\r\\n\",\n      \"    \\\"support_gpu\\\": true\\r\\n\",\n      \"  },\\r\\n\",\n      \"  \\\"tensorflow\\\": null\\r\\n\",\n      \"}\\r\\n\"\n     ]\n    }\n   ],\n   \"source\": [\n    \"!{sys.executable} -m onnxruntime_tools.transformers.machine_info --silent\"\n   ]\n  }\n ],\n \"metadata\": {\n  \"kernelspec\": {\n   \"display_name\": \"PyCharm (ccks_ner-master)\",\n   \"language\": \"python\",\n   \"name\": \"pycharm-de4c0941\"\n  },\n  \"language_info\": {\n   \"codemirror_mode\": {\n    \"name\": \"ipython\",\n    \"version\": 3\n   },\n   \"file_extension\": \".py\",\n   \"mimetype\": \"text/x-python\",\n   \"name\": \"python\",\n   \"nbconvert_exporter\": \"python\",\n   \"pygments_lexer\": \"ipython3\",\n   \"version\": \"3.8.5\"\n  }\n },\n \"nbformat\": 4,\n \"nbformat_minor\": 2\n}\n"
  },
  {
    "path": "code/bert-base-count5/finetuning/Config.py",
    "content": "from transformers import BertTokenizer, AdamW, BertModel, BertPreTrainedModel, BertConfig, \\\n    get_linear_schedule_with_warmup, XLNetModel, XLNetTokenizer, XLNetConfig, ElectraModel, ElectraConfig, ElectraTokenizer, \\\n    RobertaTokenizer, RobertaModel, RobertaConfig\nfrom NEZHA.modeling_nezha import NeZhaModel\nfrom NEZHA.configuration_nezha import NeZhaConfig\n\n\nMODELS = {\n    'BertForClass':  BertModel,\n    'BertForClass_MultiDropout':  BertModel,\n   'BertLastTwoCls':  BertModel,\n    'BertLastCls':BertModel,\n   'BertLastTwoClsPooler':  BertModel,\n    'BertLastTwoEmbeddings': BertModel,\n    'BertLastTwoEmbeddingsPooler': BertModel,\n    'BertLastFourCls': BertModel,\n    'BertLastFourClsPooler':  BertModel,\n    'BertLastFourEmbeddings':  BertModel,\n   'BertLastFourEmbeddingsPooler':  BertModel,\n   'BertDynCls':  BertModel,\n    'BertDynEmbeddings': BertModel,\n    'BertRNN': BertModel,\n    'BertCNN': XLNetModel,\n    'BertRCNN':  BertModel,\n    'XLNet': XLNetModel,\n    'Electra': ElectraModel,\n    'NEZHA': NeZhaModel\n    }\n\nTOKENIZERS = {\n    'BertForClass': BertTokenizer,\n    'BertForClass_MultiDropout': BertTokenizer,\n    'BertLastTwoCls': BertTokenizer,\n    'BertLastCls': BertTokenizer,\n    'BertLastTwoClsPooler': BertTokenizer,\n    'BertLastTwoEmbeddings': BertTokenizer,\n    'BertLastTwoEmbeddingsPooler': BertTokenizer,\n    'BertLastFourCls': BertTokenizer,\n    'BertLastFourClsPooler': BertTokenizer,\n    'BertLastFourEmbeddings': BertTokenizer,\n    'BertLastFourEmbeddingsPooler': BertTokenizer,\n    'BertDynCls': BertTokenizer,\n    'BertDynEmbeddings': BertTokenizer,\n    'BertRNN': BertTokenizer,\n    'BertCNN': BertTokenizer,\n    'BertRCNN': BertTokenizer,\n    'XLNet': XLNetTokenizer,\n    'Electra': ElectraTokenizer,\n    'NEZHA': BertTokenizer\n    }\n\nCONFIGS = {\n    'BertForClass': BertConfig,\n    'BertForClass_MultiDropout': BertConfig,\n    'BertLastTwoCls': BertConfig,\n    'BertLastCls': BertConfig,\n    'BertLastTwoClsPooler': BertConfig,\n    'BertLastTwoEmbeddings': BertConfig,\n    'BertLastTwoEmbeddingsPooler': BertConfig,\n    'BertLastFourCls': BertConfig,\n    'BertLastFourClsPooler': BertConfig,\n    'BertLastFourEmbeddings': BertConfig,\n    'BertLastFourEmbeddingsPooler': BertConfig,\n    'BertDynCls': BertConfig,\n    'BertDynEmbeddings': BertConfig,\n    'BertRNN': BertConfig,\n    'BertCNN': BertConfig,\n    'BertRCNN': BertConfig,\n    'XLNet': XLNetConfig,\n    'Electra': ElectraConfig,\n    'NEZHA': NeZhaConfig\n\n    }"
  },
  {
    "path": "code/bert-base-count5/finetuning/NEZHA/configuration_nezha.py",
    "content": "\nfrom transformers import PretrainedConfig\n\nNEZHA_PRETRAINED_CONFIG_ARCHIVE_MAP = {}\n\nclass NeZhaConfig(PretrainedConfig):\n    r\"\"\"\n        This is the configuration class to store the configuration of an :class:`~transformers.AlbertModel`.\n        It is used to instantiate an ALBERT model according to the specified arguments, defining the model\n        architecture. Instantiating a configuration with the defaults will yield a similar configuration to that of\n        the ALBERT `xxlarge <https://huggingface.co/albert-xxlarge-v2>`__ architecture.\n\n        Configuration objects inherit from  :class:`~transformers.PretrainedConfig` and can be used\n        to control the model outputs. Read the documentation from  :class:`~transformers.PretrainedConfig`\n        for more information.\n\n\n        Args:\n            vocab_size (:obj:`int`, optional, defaults to 30000):\n                Vocabulary size of the ALBERT model. Defines the different tokens that\n                can be represented by the `inputs_ids` passed to the forward method of :class:`~transformers.AlbertModel`.\n            embedding_size (:obj:`int`, optional, defaults to 128):\n                Dimensionality of vocabulary embeddings.\n            hidden_size (:obj:`int`, optional, defaults to 4096):\n                Dimensionality of the encoder layers and the pooler layer.\n            num_hidden_layers (:obj:`int`, optional, defaults to 12):\n                Number of hidden layers in the Transformer encoder.\n            num_hidden_groups (:obj:`int`, optional, defaults to 1):\n                Number of groups for the hidden layers, parameters in the same group are shared.\n            num_attention_heads (:obj:`int`, optional, defaults to 64):\n                Number of attention heads for each attention layer in the Transformer encoder.\n            intermediate_size (:obj:`int`, optional, defaults to 16384):\n                The dimensionality of the \"intermediate\" (i.e., feed-forward) layer in the Transformer encoder.\n            inner_group_num (:obj:`int`, optional, defaults to 1):\n                The number of inner repetition of attention and ffn.\n            hidden_act (:obj:`str` or :obj:`function`, optional, defaults to \"gelu_new\"):\n                The non-linear activation function (function or string) in the encoder and pooler.\n                If string, \"gelu\", \"relu\", \"swish\" and \"gelu_new\" are supported.\n            hidden_dropout_prob (:obj:`float`, optional, defaults to 0):\n                The dropout probability for all fully connected layers in the embeddings, encoder, and pooler.\n            attention_probs_dropout_prob (:obj:`float`, optional, defaults to 0):\n                The dropout ratio for the attention probabilities.\n            max_position_embeddings (:obj:`int`, optional, defaults to 512):\n                The maximum sequence length that this model might ever be used with. Typically set this to something\n                large (e.g., 512 or 1024 or 2048).\n            type_vocab_size (:obj:`int`, optional, defaults to 2):\n                The vocabulary size of the `token_type_ids` passed into :class:`~transformers.AlbertModel`.\n            initializer_range (:obj:`float`, optional, defaults to 0.02):\n                The standard deviation of the truncated_normal_initializer for initializing all weight matrices.\n            layer_norm_eps (:obj:`float`, optional, defaults to 1e-12):\n                The epsilon used by the layer normalization layers.\n            classifier_dropout_prob (:obj:`float`, optional, defaults to 0.1):\n                The dropout ratio for attached classifiers.\n\n        Example::\n\n            from transformers import AlbertConfig, AlbertModel\n            # Initializing an ALBERT-xxlarge style configuration\n            albert_xxlarge_configuration = AlbertConfig()\n\n            # Initializing an ALBERT-base style configuration\n            albert_base_configuration = AlbertConfig(\n                hidden_size=768,\n                num_attention_heads=12,\n                intermediate_size=3072,\n            )\n\n            # Initializing a model from the ALBERT-base style configuration\n            model = AlbertModel(albert_xxlarge_configuration)\n\n            # Accessing the model configuration\n            configuration = model.config\n\n        Attributes:\n            pretrained_config_archive_map (Dict[str, str]):\n                A dictionary containing all the available pre-trained checkpoints.\n    \"\"\"\n\n    pretrained_config_archive_map = NEZHA_PRETRAINED_CONFIG_ARCHIVE_MAP\n    model_type = \"nezha\"\n\n    def __init__(\n        self,\n        vocab_size=30000,\n        embedding_size=128,\n        hidden_size=4096,\n        num_hidden_layers=12,\n        num_hidden_groups=1,\n        num_attention_heads=64,\n        intermediate_size=16384,\n        inner_group_num=1,\n        hidden_act=\"gelu_new\",\n        hidden_dropout_prob=0,\n        attention_probs_dropout_prob=0,\n        max_position_embeddings=512,\n        max_relative_position=64,\n        type_vocab_size=2,\n        initializer_range=0.02,\n        layer_norm_eps=1e-12,\n        classifier_dropout_prob=0.1,\n        use_relative_position=True,\n        pad_token_id=0,\n        bos_token_id=2,\n        eos_token_id=3,\n        **kwargs\n    ):\n        super().__init__(pad_token_id=pad_token_id, bos_token_id=bos_token_id, eos_token_id=eos_token_id, **kwargs)\n\n        self.vocab_size = vocab_size\n        self.embedding_size = embedding_size\n        self.hidden_size = hidden_size\n        self.num_hidden_layers = num_hidden_layers\n        self.num_hidden_groups = num_hidden_groups\n        self.num_attention_heads = num_attention_heads\n        self.inner_group_num = inner_group_num\n        self.hidden_act = hidden_act\n        self.intermediate_size = intermediate_size\n        self.hidden_dropout_prob = hidden_dropout_prob\n        self.attention_probs_dropout_prob = attention_probs_dropout_prob\n        self.max_position_embeddings = max_position_embeddings\n        self.max_relative_position = max_relative_position\n        self.type_vocab_size = type_vocab_size\n        self.initializer_range = initializer_range\n        self.layer_norm_eps = layer_norm_eps\n        self.use_relative_position=use_relative_position\n        self.classifier_dropout_prob = classifier_dropout_prob\n"
  },
  {
    "path": "code/bert-base-count5/finetuning/NEZHA/modeling_nezha.py",
    "content": "import math\nimport os\nimport logging\nimport torch\n\nfrom torch import nn\nfrom torch.nn import CrossEntropyLoss, MSELoss\n\nfrom .configuration_nezha import NeZhaConfig\nfrom transformers.file_utils import add_start_docstrings, add_start_docstrings_to_model_forward\nfrom transformers.modeling_utils import PreTrainedModel, prune_linear_layer\nfrom transformers.models.bert.modeling_bert import (\n    BertOutput,\n    BertPooler,\n    BertSelfOutput,\n    BertIntermediate,\n    BertOnlyMLMHead,\n    BertOnlyNSPHead,\n    BertPreTrainingHeads,\n    BERT_START_DOCSTRING,\n    BERT_INPUTS_DOCSTRING,\n)\n\nlogger = logging.getLogger(__name__)\n\n_CONFIG_FOR_DOC = \"NeZhaConfig\"\n_TOKENIZER_FOR_DOC = \"NeZhaTokenizer\"\n\nNEZHA_PRETRAINED_MODEL_ARCHIVE_LIST = []\nNEZHA_PRETRAINED_MODEL_ARCHIVE_MAP = {}\n\n\ndef load_tf_weights_in_nezha(model, config, tf_checkpoint_path):\n    \"\"\"Load tf checkpoints in a pytorch model.\"\"\"\n    try:\n        import re\n        import numpy as np\n        import tensorflow as tf\n    except ImportError:\n        logger.error(\n            \"Loading a TensorFlow model in PyTorch, requires TensorFlow to be installed. Please see \"\n            \"https://www.tensorflow.org/install/ for installation instructions.\"\n        )\n        raise\n\n    tf_path = os.path.abspath(tf_checkpoint_path)\n    logger.info(\"Converting TensorFlow checkpoint from {}\".format(tf_path))\n    # Load weights from TF model\n    init_vars = tf.train.list_variables(tf_path)\n    names = []\n    arrays = []\n    for name, shape in init_vars:\n        # logger.info(\"Loading TF weight {} with shape {}\".format(name, shape))\n        array = tf.train.load_variable(tf_path, name)\n        names.append(name)\n        arrays.append(array)\n\n    for name, array in zip(names, arrays):\n        name = name.split(\"/\")\n        # adam_v and adam_m are variables used in AdamWeightDecayOptimizer to calculated m and v\n        # which are not required for using pretrained model\n        if any(\n                n in [\"adam_v\", \"adam_m\", \"lamb_m\", \"lamb_v\", \"AdamWeightDecayOptimizer\", \"AdamWeightDecayOptimizer_1\",\n                      \"global_step\", \"good_steps\", \"loss_scale\", 'bad_steps']\n                for n in name\n        ):\n            logger.info(\"Skipping {}\".format(\"/\".join(name)))\n            continue\n        pointer = model\n        for m_name in name:\n            if re.fullmatch(r\"[A-Za-z]+_\\d+\", m_name):\n                scope_names = re.split(r\"_(\\d+)\", m_name)\n            else:\n                scope_names = [m_name]\n            if scope_names[0] == \"kernel\" or scope_names[0] == \"gamma\":\n                pointer = getattr(pointer, \"weight\")\n            elif scope_names[0] == \"output_bias\" or scope_names[0] == \"beta\":\n                pointer = getattr(pointer, \"bias\")\n            elif scope_names[0] == \"output_weights\":\n                pointer = getattr(pointer, \"weight\")\n            elif scope_names[0] == \"squad\":\n                pointer = getattr(pointer, \"classifier\")\n            else:\n                try:\n                    pointer = getattr(pointer, scope_names[0])\n                except AttributeError:\n                    logger.info(\"Skipping {}\".format(\"/\".join(name)))\n                    continue\n            if len(scope_names) >= 2:\n                num = int(scope_names[1])\n                pointer = pointer[num]\n        if m_name[-11:] == \"_embeddings\":\n            pointer = getattr(pointer, \"weight\")\n        elif m_name == \"kernel\":\n            array = np.transpose(array)\n        try:\n            assert (\n                    pointer.shape == array.shape\n            ), f\"Pointer shape {pointer.shape} and array shape {array.shape} mismatched\"\n        except AssertionError as e:\n            e.args += (pointer.shape, array.shape)\n            raise\n        logger.info(\"Initialize PyTorch weight {}\".format(name))\n        pointer.data = torch.from_numpy(array)\n    return model\n\n\nclass NeZhaEmbeddings(nn.Module):\n    \"\"\"\n    Construct the embeddings from word, position and token_type embeddings.\n    \"\"\"\n\n    def __init__(self, config):\n        super().__init__()\n        self.use_relative_position = config.use_relative_position\n        self.word_embeddings = nn.Embedding(config.vocab_size, config.hidden_size, padding_idx=config.pad_token_id)\n        self.token_type_embeddings = nn.Embedding(config.type_vocab_size, config.hidden_size)\n        # self.LayerNorm is not snake-cased to stick with TensorFlow model variable name and be able to load\n        # any TensorFlow checkpoint file\n        self.LayerNorm = nn.LayerNorm(config.hidden_size, eps=config.layer_norm_eps)\n        self.dropout = nn.Dropout(config.hidden_dropout_prob)\n\n    def forward(self, input_ids=None, token_type_ids=None, inputs_embeds=None):\n        if input_ids is not None:\n            input_shape = input_ids.size()\n        else:\n            input_shape = inputs_embeds.size()[:-1]\n        device = input_ids.device if input_ids is not None else inputs_embeds.device\n        if token_type_ids is None:\n            token_type_ids = torch.zeros(input_shape, dtype=torch.long, device=device)\n        if inputs_embeds is None:\n            inputs_embeds = self.word_embeddings(input_ids)\n        token_type_embeddings = self.token_type_embeddings(token_type_ids)\n        embeddings = inputs_embeds + token_type_embeddings\n        embeddings = self.LayerNorm(embeddings)\n        embeddings = self.dropout(embeddings)\n        return embeddings\n\n\ndef relative_position_encoding(depth, max_length=512, max_relative_position=127):\n    vocab_size = max_relative_position * 2 + 1\n    range_vec = torch.arange(max_length)\n    range_mat = range_vec.repeat(max_length).view(max_length, max_length)\n    distance_mat = range_mat - torch.t(range_mat)\n    distance_mat_clipped = torch.clamp(distance_mat, -max_relative_position, max_relative_position)\n    final_mat = distance_mat_clipped + max_relative_position\n\n    embeddings_table = torch.zeros(vocab_size, depth)\n    position = torch.arange(0, vocab_size, dtype=torch.float).unsqueeze(1)\n    div_term = torch.exp(torch.arange(0, depth, 2).float() * (-math.log(10000.0) / depth))\n    embeddings_table[:, 0::2] = torch.sin(position * div_term)\n    embeddings_table[:, 1::2] = torch.cos(position * div_term)\n    embeddings_table = embeddings_table.unsqueeze(0).transpose(0, 1).squeeze(1)\n\n    flat_relative_positions_matrix = final_mat.view(-1)\n    one_hot_relative_positions_matrix = torch.nn.functional.one_hot(flat_relative_positions_matrix,\n                                                                    num_classes=vocab_size).float()\n    positions_encoding = torch.matmul(one_hot_relative_positions_matrix, embeddings_table)\n    my_shape = list(final_mat.size())\n    my_shape.append(depth)\n    positions_encoding = positions_encoding.view(my_shape)\n    return positions_encoding\n\n\nclass NeZhaSelfAttention(nn.Module):\n    def __init__(self, config):\n        super().__init__()\n        if config.hidden_size % config.num_attention_heads != 0 and not hasattr(config, \"embedding_size\"):\n            raise ValueError(\n                \"The hidden size (%d) is not a multiple of the number of attention \"\n                \"heads (%d)\" % (config.hidden_size, config.num_attention_heads)\n            )\n        self.output_attentions = config.output_attentions\n\n        self.num_attention_heads = config.num_attention_heads\n        self.attention_head_size = int(config.hidden_size / config.num_attention_heads)\n        self.all_head_size = self.num_attention_heads * self.attention_head_size\n\n        self.query = nn.Linear(config.hidden_size, self.all_head_size)\n        self.key = nn.Linear(config.hidden_size, self.all_head_size)\n        self.value = nn.Linear(config.hidden_size, self.all_head_size)\n        self.dropout = nn.Dropout(config.attention_probs_dropout_prob)\n\n        self.relative_positions_encoding = relative_position_encoding(max_length=config.max_position_embeddings,\n                                                                     depth=self.attention_head_size,\n                                                                     max_relative_position=config.max_relative_position).to('cuda')\n\n    def transpose_for_scores(self, x):\n        new_x_shape = x.size()[:-1] + (self.num_attention_heads, self.attention_head_size)\n        x = x.view(*new_x_shape)\n        return x.permute(0, 2, 1, 3)\n\n    def forward(\n            self,\n            hidden_states,\n            attention_mask=None,\n            head_mask=None,\n            encoder_hidden_states=None,\n            encoder_attention_mask=None,\n    ):\n        mixed_query_layer = self.query(hidden_states)\n\n        # If this is instantiated as a cross-attention module, the keys\n        # and values come from an encoder; the attention mask needs to be\n        # such that the encoder's padding tokens are not attended to.\n        if encoder_hidden_states is not None:\n            mixed_key_layer = self.key(encoder_hidden_states)\n            mixed_value_layer = self.value(encoder_hidden_states)\n            attention_mask = encoder_attention_mask\n        else:\n            mixed_key_layer = self.key(hidden_states)\n            mixed_value_layer = self.value(hidden_states)\n\n        query_layer = self.transpose_for_scores(mixed_query_layer)\n        key_layer = self.transpose_for_scores(mixed_key_layer)\n        value_layer = self.transpose_for_scores(mixed_value_layer)\n\n        # Take the dot product between \"query\" and \"key\" to get the raw attention scores.\n        attention_scores = torch.matmul(query_layer, key_layer.transpose(-1, -2))\n\n        batch_size, num_attention_heads, from_seq_length, to_seq_length = attention_scores.size()\n\n        relations_keys = self.relative_positions_encoding[:to_seq_length, :to_seq_length, :]\n        query_layer_t = query_layer.permute(2, 0, 1, 3)\n\n        query_layer_r = query_layer_t.contiguous().view(from_seq_length, batch_size * num_attention_heads,\n                                                        self.attention_head_size)\n        key_position_scores = torch.matmul(query_layer_r, relations_keys.permute(0, 2, 1))\n        key_position_scores_r = key_position_scores.view(from_seq_length, batch_size,\n                                                         num_attention_heads, from_seq_length)\n        key_position_scores_r_t = key_position_scores_r.permute(1, 2, 0, 3)\n        attention_scores = attention_scores + key_position_scores_r_t\n\n        attention_scores = attention_scores / math.sqrt(self.attention_head_size)\n        if attention_mask is not None:\n            # Apply the attention mask is (precomputed for all layers in BertModel forward() function)\n            attention_scores = attention_scores + attention_mask\n\n        # Normalize the attention scores to probabilities.\n        attention_probs = nn.Softmax(dim=-1)(attention_scores)\n\n        # This is actually dropping out entire tokens to attend to, which might\n        # seem a bit unusual, but is taken from the original Transformer paper.\n        attention_probs = self.dropout(attention_probs)\n\n        # Mask heads if we want to\n        if head_mask is not None:\n            attention_probs = attention_probs * head_mask\n\n        context_layer = torch.matmul(attention_probs, value_layer)\n\n        relations_values = self.relative_positions_encoding[:to_seq_length, :to_seq_length, :]\n        attention_probs_t = attention_probs.permute(2, 0, 1, 3)\n        attentions_probs_r = attention_probs_t.contiguous().view(from_seq_length, batch_size * num_attention_heads,\n                                                                 to_seq_length)\n        value_position_scores = torch.matmul(attentions_probs_r, relations_values)\n        value_position_scores_r = value_position_scores.view(from_seq_length, batch_size,\n                                                             num_attention_heads, self.attention_head_size)\n        value_position_scores_r_t = value_position_scores_r.permute(1, 2, 0, 3)\n        context_layer = context_layer + value_position_scores_r_t\n\n        context_layer = context_layer.permute(0, 2, 1, 3).contiguous()\n        new_context_layer_shape = context_layer.size()[:-2] + (self.all_head_size,)\n        context_layer = context_layer.view(*new_context_layer_shape)\n\n        outputs = (context_layer, attention_probs) if self.output_attentions else (context_layer,)\n        return outputs\n\n\nclass NeZhaAttention(nn.Module):\n    def __init__(self, config):\n        super().__init__()\n        self.self = NeZhaSelfAttention(config)\n        self.output = BertSelfOutput(config)\n        self.pruned_heads = set()\n\n    def prune_heads(self, heads):\n        if len(heads) == 0:\n            return\n        mask = torch.ones(self.self.num_attention_heads, self.self.attention_head_size)\n        heads = set(heads) - self.pruned_heads  # Convert to set and remove already pruned heads\n        for head in heads:\n            # Compute how many pruned heads are before the head and move the index accordingly\n            head = head - sum(1 if h < head else 0 for h in self.pruned_heads)\n            mask[head] = 0\n        mask = mask.view(-1).contiguous().eq(1)\n        index = torch.arange(len(mask))[mask].long()\n        # Prune linear layers\n        self.self.query = prune_linear_layer(self.self.query, index)\n        self.self.key = prune_linear_layer(self.self.key, index)\n        self.self.value = prune_linear_layer(self.self.value, index)\n        self.output.dense = prune_linear_layer(self.output.dense, index, dim=1)\n        # Update hyper params and store pruned heads\n        self.self.num_attention_heads = self.self.num_attention_heads - len(heads)\n        self.self.all_head_size = self.self.attention_head_size * self.self.num_attention_heads\n        self.pruned_heads = self.pruned_heads.union(heads)\n\n    def forward(\n            self,\n            hidden_states,\n            attention_mask=None,\n            head_mask=None,\n            encoder_hidden_states=None,\n            encoder_attention_mask=None,\n    ):\n        self_outputs = self.self(\n            hidden_states, attention_mask, head_mask, encoder_hidden_states, encoder_attention_mask\n        )\n        attention_output = self.output(self_outputs[0], hidden_states)\n        outputs = (attention_output,) + self_outputs[1:]  # add attentions if we output them\n        return outputs\n\n\nclass NeZhaLayer(nn.Module):\n    def __init__(self, config):\n        super().__init__()\n        self.attention = NeZhaAttention(config)\n        self.is_decoder = config.is_decoder\n        if self.is_decoder:\n            self.crossattention = NeZhaAttention(config)\n        self.intermediate = BertIntermediate(config)\n        self.output = BertOutput(config)\n\n    def forward(\n            self,\n            hidden_states,\n            attention_mask=None,\n            head_mask=None,\n            encoder_hidden_states=None,\n            encoder_attention_mask=None,\n    ):\n        self_attention_outputs = self.attention(hidden_states, attention_mask, head_mask)\n        attention_output = self_attention_outputs[0]\n        outputs = self_attention_outputs[1:]  # add self attentions if we output attention weights\n\n        if self.is_decoder and encoder_hidden_states is not None:\n            cross_attention_outputs = self.crossattention(\n                attention_output, attention_mask, head_mask, encoder_hidden_states, encoder_attention_mask\n            )\n            attention_output = cross_attention_outputs[0]\n            outputs = outputs + cross_attention_outputs[1:]  # add cross attentions if we output attention weights\n\n        intermediate_output = self.intermediate(attention_output)\n        layer_output = self.output(intermediate_output, attention_output)\n        outputs = (layer_output,) + outputs\n        return outputs\n\n\nclass NeZhaEncoder(nn.Module):\n    def __init__(self, config):\n        super().__init__()\n        self.output_attentions = config.output_attentions\n        self.output_hidden_states = config.output_hidden_states\n        self.layer = nn.ModuleList([NeZhaLayer(config) for _ in range(config.num_hidden_layers)])\n\n\n    def forward(\n            self,\n            hidden_states,\n            attention_mask=None,\n            head_mask=None,\n            encoder_hidden_states=None,\n            encoder_attention_mask=None,\n    ):\n        all_hidden_states = ()\n        all_attentions = ()\n        for i, layer_module in enumerate(self.layer):\n            if self.output_hidden_states:\n                all_hidden_states = all_hidden_states + (hidden_states,)\n            layer_outputs = layer_module(\n                hidden_states, attention_mask, head_mask[i], encoder_hidden_states, encoder_attention_mask\n            )\n            hidden_states = layer_outputs[0]\n            if self.output_attentions:\n                all_attentions = all_attentions + (layer_outputs[1],)\n        # Add last layer\n        if self.output_hidden_states:\n            all_hidden_states = all_hidden_states + (hidden_states,)\n\n        outputs = (hidden_states,)\n        if self.output_hidden_states:\n            outputs = outputs + (all_hidden_states,)\n        if self.output_attentions:\n            outputs = outputs + (all_attentions,)\n        return outputs  # last-layer hidden state, (all hidden states), (all attentions)\n\n\nclass NeZhaPreTrainedModel(PreTrainedModel):\n    \"\"\" An abstract class to handle weights initialization and\n        a simple interface for downloading and loading pretrained models.\n    \"\"\"\n    config_class = NeZhaConfig\n    pretrained_model_archive_map = NEZHA_PRETRAINED_MODEL_ARCHIVE_MAP\n    load_tf_weights = load_tf_weights_in_nezha\n    base_model_prefix = \"bert\"\n\n    def _init_weights(self, module):\n        \"\"\" Initialize the weights \"\"\"\n        if isinstance(module, (nn.Linear, nn.Embedding)):\n            # Slightly different from the TF version which uses truncated_normal for initialization\n            # cf https://github.com/pytorch/pytorch/pull/5617\n            module.weight.data.normal_(mean=0.0, std=self.config.initializer_range)\n        elif isinstance(module, nn.LayerNorm):\n            module.bias.data.zero_()\n            module.weight.data.fill_(1.0)\n        if isinstance(module, nn.Linear) and module.bias is not None:\n            module.bias.data.zero_()\n\n\n@add_start_docstrings(\n    \"The bare Bert Model transformer outputting raw hidden-states without any specific head on top.\",\n    BERT_START_DOCSTRING,\n)\nclass NeZhaModel(NeZhaPreTrainedModel):\n    \"\"\"\n    The model can behave as an encoder (with only self-attention) as well\n    as a decoder, in which case a layer of cross-attention is added between\n    the self-attention layers, following the architecture described in `Attention is all you need`_ by Ashish Vaswani,\n    Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser and Illia Polosukhin.\n\n    To behave as an decoder the model needs to be initialized with the\n    :obj:`is_decoder` argument of the configuration set to :obj:`True`; an\n    :obj:`encoder_hidden_states` is expected as an input to the forward pass.\n\n    .. _`Attention is all you need`:\n        https://arxiv.org/abs/1706.03762\n\n    \"\"\"\n\n    def __init__(self, config):\n        super().__init__(config)\n        self.config = config\n        self.embeddings = NeZhaEmbeddings(config)\n        self.encoder = NeZhaEncoder(config)\n        self.pooler = BertPooler(config)\n        self.init_weights()\n\n    def get_input_embeddings(self):\n        return self.embeddings.word_embeddings\n\n    def set_input_embeddings(self, value):\n        self.embeddings.word_embeddings = value\n\n    def _prune_heads(self, heads_to_prune):\n        \"\"\" Prunes heads of the model.\n            heads_to_prune: dict of {layer_num: list of heads to prune in this layer}\n            See base class PreTrainedModel\n        \"\"\"\n        for layer, heads in heads_to_prune.items():\n            self.encoder.layer[layer].attention.prune_heads(heads)\n\n    @add_start_docstrings_to_model_forward(BERT_INPUTS_DOCSTRING.format(\"batch_size, sequence_length\"))\n    def forward(\n            self,\n            input_ids=None,\n            attention_mask=None,\n            token_type_ids=None,\n            head_mask=None,\n            position_ids=None,\n            inputs_embeds=None,\n            encoder_hidden_states=None,\n            encoder_attention_mask=None,\n    ):\n        r\"\"\"\n    Return:\n        :obj:`tuple(torch.FloatTensor)` comprising various elements depending on the configuration (:class:`~transformers.BertConfig`) and inputs:\n        last_hidden_state (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length, hidden_size)`):\n            Sequence of hidden-states at the output of the last layer of the model.\n        pooler_output (:obj:`torch.FloatTensor`: of shape :obj:`(batch_size, hidden_size)`):\n            Last layer hidden-state of the first token of the sequence (classification token)\n            further processed by a Linear layer and a Tanh activation function. The Linear\n            layer weights are trained from the next sentence prediction (classification)\n            objective during pre-training.\n\n            This output is usually *not* a good summary\n            of the semantic content of the input, you're often better with averaging or pooling\n            the sequence of hidden-states for the whole input sequence.\n        hidden_states (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_hidden_states=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer)\n            of shape :obj:`(batch_size, sequence_length, hidden_size)`.\n\n            Hidden-states of the model at the output of each layer plus the initial embedding outputs.\n        attentions (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_attentions=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for each layer) of shape\n            :obj:`(batch_size, num_heads, sequence_length, sequence_length)`.\n\n            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention\n            heads.\n\n    Examples::\n\n        from transformers import BertModel, BertTokenizer\n        import torch\n\n        tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')\n        model = BertModel.from_pretrained('bert-base-uncased')\n\n        input_ids = torch.tensor(tokenizer.encode(\"Hello, my dog is cute\", add_special_tokens=True)).unsqueeze(0)  # Batch size 1\n        outputs = model(input_ids)\n\n        last_hidden_states = outputs[0]  # The last hidden-state is the first element of the output tuple\n\n        \"\"\"\n\n        if input_ids is not None and inputs_embeds is not None:\n            raise ValueError(\"You cannot specify both input_ids and inputs_embeds at the same time\")\n        elif input_ids is not None:\n            input_shape = input_ids.size()\n        elif inputs_embeds is not None:\n            input_shape = inputs_embeds.size()[:-1]\n        else:\n            raise ValueError(\"You have to specify either input_ids or inputs_embeds\")\n\n        device = input_ids.device if input_ids is not None else inputs_embeds.device\n\n        if attention_mask is None:\n            attention_mask = torch.ones(input_shape, device=device)\n        if token_type_ids is None:\n            token_type_ids = torch.zeros(input_shape, dtype=torch.long, device=device)\n\n        # We can provide a self-attention mask of dimensions [batch_size, from_seq_length, to_seq_length]\n        # ourselves in which case we just need to make it broadcastable to all heads.\n        extended_attention_mask: torch.Tensor = self.get_extended_attention_mask(\n            attention_mask, input_shape, self.device\n        )\n\n        # If a 2D ou 3D attention mask is provided for the cross-attention\n        # we need to make broadcastabe to [batch_size, num_heads, seq_length, seq_length]\n        if self.config.is_decoder and encoder_hidden_states is not None:\n            encoder_batch_size, encoder_sequence_length, _ = encoder_hidden_states.size()\n            encoder_hidden_shape = (encoder_batch_size, encoder_sequence_length)\n            if encoder_attention_mask is None:\n                encoder_attention_mask = torch.ones(encoder_hidden_shape, device=device)\n            encoder_extended_attention_mask = self.invert_attention_mask(encoder_attention_mask)\n        else:\n            encoder_extended_attention_mask = None\n\n        # Prepare head mask if needed\n        # 1.0 in head_mask indicate we keep the head\n        # attention_probs has shape bsz x n_heads x N x N\n        # input head_mask has shape [num_heads] or [num_hidden_layers x num_heads]\n        # and head_mask is converted to shape [num_hidden_layers x batch x num_heads x seq_length x seq_length]\n        head_mask = self.get_head_mask(head_mask, self.config.num_hidden_layers)\n\n        embedding_output = self.embeddings(\n            input_ids=input_ids, token_type_ids=token_type_ids, inputs_embeds=inputs_embeds\n        )\n        encoder_outputs = self.encoder(\n            embedding_output,\n            attention_mask=extended_attention_mask,\n            head_mask=head_mask,\n            encoder_hidden_states=encoder_hidden_states,\n            encoder_attention_mask=encoder_extended_attention_mask,\n        )\n        sequence_output = encoder_outputs[0]\n        pooled_output = self.pooler(sequence_output)\n\n        outputs = (sequence_output, pooled_output,) + encoder_outputs[\n                                                      1:\n                                                      ]  # add hidden_states and attentions if they are here\n        return outputs  # sequence_output, pooled_output, (hidden_states), (attentions)\n\n\n@add_start_docstrings(\n    \"\"\"Bert Model with two heads on top as done during the pre-training: a `masked language modeling` head and\n    a `next sentence prediction (classification)` head. \"\"\",\n    BERT_START_DOCSTRING,\n)\nclass NeZhaForPreTraining(NeZhaPreTrainedModel):\n    def __init__(self, config):\n        super().__init__(config)\n        self.bert = NeZhaModel(config)\n        self.cls = BertPreTrainingHeads(config)\n        self.init_weights()\n\n    def get_output_embeddings(self):\n        return self.cls.predictions.decoder\n\n    @add_start_docstrings_to_model_forward(BERT_INPUTS_DOCSTRING.format(\"batch_size, sequence_length\"))\n    def forward(\n            self,\n            input_ids=None,\n            attention_mask=None,\n            token_type_ids=None,\n            head_mask=None,\n            position_ids=None,\n            inputs_embeds=None,\n            labels=None,\n            next_sentence_label=None,\n    ):\n        r\"\"\"\n        masked_lm_labels (``torch.LongTensor`` of shape ``(batch_size, sequence_length)``, `optional`, defaults to :obj:`None`):\n            Labels for computing the masked language modeling loss.\n            Indices should be in ``[-100, 0, ..., config.vocab_size]`` (see ``input_ids`` docstring)\n            Tokens with indices set to ``-100`` are ignored (masked), the loss is only computed for the tokens with labels\n            in ``[0, ..., config.vocab_size]``\n        next_sentence_label (``torch.LongTensor`` of shape ``(batch_size,)``, `optional`, defaults to :obj:`None`):\n            Labels for computing the next sequence prediction (classification) loss. Input should be a sequence pair (see :obj:`input_ids` docstring)\n            Indices should be in ``[0, 1]``.\n            ``0`` indicates sequence B is a continuation of sequence A,\n            ``1`` indicates sequence B is a random sequence.\n\n    Returns:\n        :obj:`tuple(torch.FloatTensor)` comprising various elements depending on the configuration (:class:`~transformers.BertConfig`) and inputs:\n        loss (`optional`, returned when ``masked_lm_labels`` is provided) ``torch.FloatTensor`` of shape ``(1,)``:\n            Total loss as the sum of the masked language modeling loss and the next sequence prediction (classification) loss.\n        prediction_scores (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length, config.vocab_size)`)\n            Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax).\n        seq_relationship_scores (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, 2)`):\n            Prediction scores of the next sequence prediction (classification) head (scores of True/False\n            continuation before SoftMax).\n        hidden_states (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when :obj:`config.output_hidden_states=True`):\n            Tuple of :obj:`torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer)\n            of shape :obj:`(batch_size, sequence_length, hidden_size)`.\n\n            Hidden-states of the model at the output of each layer plus the initial embedding outputs.\n        attentions (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_attentions=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for each layer) of shape\n            :obj:`(batch_size, num_heads, sequence_length, sequence_length)`.\n\n            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention\n            heads.\n\n\n    Examples::\n\n        from transformers import BertTokenizer, BertForPreTraining\n        import torch\n\n        tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')\n        model = BertForPreTraining.from_pretrained('bert-base-uncased')\n\n        input_ids = torch.tensor(tokenizer.encode(\"Hello, my dog is cute\", add_special_tokens=True)).unsqueeze(0)  # Batch size 1\n        outputs = model(input_ids)\n\n        prediction_scores, seq_relationship_scores = outputs[:2]\n\n        \"\"\"\n\n        outputs = self.bert(\n            input_ids,\n            attention_mask=attention_mask,\n            token_type_ids=token_type_ids,\n            head_mask=head_mask,\n            inputs_embeds=inputs_embeds,\n        )\n\n        sequence_output, pooled_output = outputs[:2]\n        prediction_scores, seq_relationship_score = self.cls(sequence_output, pooled_output)\n        # add hidden states and attention if they are here\n        outputs = (prediction_scores, seq_relationship_score,) + outputs[2:]\n\n        if labels is not None and next_sentence_label is not None:\n            loss_fct = CrossEntropyLoss()\n            masked_lm_loss = loss_fct(prediction_scores.view(-1, self.config.vocab_size), labels.view(-1))\n            next_sentence_loss = loss_fct(seq_relationship_score.view(-1, 2), next_sentence_label.view(-1))\n            total_loss = masked_lm_loss + next_sentence_loss\n            outputs = (total_loss,) + outputs\n\n        return outputs  # (loss), prediction_scores, seq_relationship_score, (hidden_states), (attentions)\n\n\n@add_start_docstrings(\"\"\"Bert Model with a `language modeling` head on top. \"\"\", BERT_START_DOCSTRING)\nclass NeZhaForMaskedLM(NeZhaPreTrainedModel):\n    def __init__(self, config):\n        super().__init__(config)\n        self.bert = NeZhaModel(config)\n        self.cls = BertOnlyMLMHead(config)\n        self.init_weights()\n\n    def get_output_embeddings(self):\n        return self.cls.predictions.decoder\n\n    @add_start_docstrings_to_model_forward(BERT_INPUTS_DOCSTRING.format(\"batch_size, sequence_length\"))\n    def forward(\n            self,\n            input_ids=None,\n            attention_mask=None,\n            token_type_ids=None,\n            head_mask=None,\n            position_ids=None,\n            inputs_embeds=None,\n            encoder_hidden_states=None,\n            encoder_attention_mask=None,\n            labels=None,\n    ):\n        r\"\"\"\n        masked_lm_labels (:obj:`torch.LongTensor` of shape :obj:`(batch_size, sequence_length)`, `optional`, defaults to :obj:`None`):\n            Labels for computing the masked language modeling loss.\n            Indices should be in ``[-100, 0, ..., config.vocab_size]`` (see ``input_ids`` docstring)\n            Tokens with indices set to ``-100`` are ignored (masked), the loss is only computed for the tokens with labels\n            in ``[0, ..., config.vocab_size]``\n        lm_labels (:obj:`torch.LongTensor` of shape :obj:`(batch_size, sequence_length)`, `optional`, defaults to :obj:`None`):\n            Labels for computing the left-to-right language modeling loss (next word prediction).\n            Indices should be in ``[-100, 0, ..., config.vocab_size]`` (see ``input_ids`` docstring)\n            Tokens with indices set to ``-100`` are ignored (masked), the loss is only computed for the tokens with labels\n            in ``[0, ..., config.vocab_size]``\n\n    Returns:\n        :obj:`tuple(torch.FloatTensor)` comprising various elements depending on the configuration (:class:`~transformers.BertConfig`) and inputs:\n        masked_lm_loss (`optional`, returned when ``masked_lm_labels`` is provided) ``torch.FloatTensor`` of shape ``(1,)``:\n            Masked language modeling loss.\n        ltr_lm_loss (:obj:`torch.FloatTensor` of shape :obj:`(1,)`, `optional`, returned when :obj:`lm_labels` is provided):\n                Next token prediction loss.\n        prediction_scores (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length, config.vocab_size)`)\n            Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax).\n        hidden_states (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_hidden_states=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer)\n            of shape :obj:`(batch_size, sequence_length, hidden_size)`.\n\n            Hidden-states of the model at the output of each layer plus the initial embedding outputs.\n        attentions (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_attentions=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for each layer) of shape\n            :obj:`(batch_size, num_heads, sequence_length, sequence_length)`.\n\n            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention\n            heads.\n\n        Examples::\n\n            from transformers import BertTokenizer, BertForMaskedLM\n            import torch\n\n            tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')\n            model = BertForMaskedLM.from_pretrained('bert-base-uncased')\n\n            input_ids = torch.tensor(tokenizer.encode(\"Hello, my dog is cute\", add_special_tokens=True)).unsqueeze(0)  # Batch size 1\n            outputs = model(input_ids, masked_lm_labels=input_ids)\n\n            loss, prediction_scores = outputs[:2]\n\n        \"\"\"\n        outputs = self.bert(\n            input_ids,\n            attention_mask=attention_mask,\n            token_type_ids=token_type_ids,\n            head_mask=head_mask,\n            inputs_embeds=inputs_embeds,\n            encoder_hidden_states=encoder_hidden_states,\n            encoder_attention_mask=encoder_attention_mask,\n        )\n\n        sequence_output = outputs[0]\n        prediction_scores = self.cls(sequence_output)\n        outputs = (prediction_scores,) + outputs[2:]  # Add hidden states and attention if they are here\n\n        # Although this may seem awkward, BertForMaskedLM supports two scenarios:\n        # 1. If a tensor that contains the indices of masked labels is provided,\n        #    the cross-entropy is the MLM cross-entropy that measures the likelihood\n        #    of predictions for masked words.\n        # 2. If `lm_labels` is provided we are in a causal scenario where we\n        #    try to predict the next token for each input in the decoder.\n        masked_lm_labels = None\n        if labels is not None:\n            loss_fct = CrossEntropyLoss()  # -100 index = padding token\n            masked_lm_loss = loss_fct(prediction_scores.view(-1, self.config.vocab_size), labels.view(-1))\n            outputs = (masked_lm_loss,) + outputs\n        return outputs  # (ltr_lm_loss), (masked_lm_loss), prediction_scores, (hidden_states), (attentions)\n\n    def prepare_inputs_for_generation(self, input_ids, attention_mask=None, **model_kwargs):\n        input_shape = input_ids.shape\n        effective_batch_size = input_shape[0]\n\n        # if model is used as a decoder in encoder-decoder model, the decoder attention mask is created on the fly\n        if attention_mask is None:\n            attention_mask = input_ids.new_ones(input_shape)\n\n        # if model is does not use a causal mask then add a dummy token\n        if self.config.is_decoder is False:\n            assert self.config.pad_token_id is not None, \"The PAD token should be defined for generation\"\n            attention_mask = torch.cat(\n                [attention_mask, attention_mask.new_zeros((attention_mask.shape[0], 1))], dim=-1\n            )\n\n            dummy_token = torch.full(\n                (effective_batch_size, 1), self.config.pad_token_id, dtype=torch.long, device=input_ids.device\n            )\n            input_ids = torch.cat([input_ids, dummy_token], dim=1)\n\n        return {\"input_ids\": input_ids, \"attention_mask\": attention_mask}\n\n\n@add_start_docstrings(\n    \"\"\"Bert Model with a `next sentence prediction (classification)` head on top. \"\"\", BERT_START_DOCSTRING,\n)\nclass NeZhaForNextSentencePrediction(NeZhaPreTrainedModel):\n    def __init__(self, config):\n        super().__init__(config)\n        self.bert = NeZhaModel(config)\n        self.cls = BertOnlyNSPHead(config)\n        self.init_weights()\n\n    @add_start_docstrings_to_model_forward(BERT_INPUTS_DOCSTRING.format(\"batch_size, sequence_length\"))\n    def forward(\n            self,\n            input_ids=None,\n            attention_mask=None,\n            token_type_ids=None,\n            head_mask=None,\n            position_ids=None,\n            inputs_embeds=None,\n            next_sentence_label=None,\n    ):\n        r\"\"\"\n        next_sentence_label (:obj:`torch.LongTensor` of shape :obj:`(batch_size,)`, `optional`, defaults to :obj:`None`):\n            Labels for computing the next sequence prediction (classification) loss. Input should be a sequence pair (see ``input_ids`` docstring)\n            Indices should be in ``[0, 1]``.\n            ``0`` indicates sequence B is a continuation of sequence A,\n            ``1`` indicates sequence B is a random sequence.\n\n    Returns:\n        :obj:`tuple(torch.FloatTensor)` comprising various elements depending on the configuration (:class:`~transformers.BertConfig`) and inputs:\n        loss (:obj:`torch.FloatTensor` of shape :obj:`(1,)`, `optional`, returned when :obj:`next_sentence_label` is provided):\n            Next sequence prediction (classification) loss.\n        seq_relationship_scores (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, 2)`):\n            Prediction scores of the next sequence prediction (classification) head (scores of True/False continuation before SoftMax).\n        hidden_states (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_hidden_states=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer)\n            of shape :obj:`(batch_size, sequence_length, hidden_size)`.\n\n            Hidden-states of the model at the output of each layer plus the initial embedding outputs.\n        attentions (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_attentions=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for each layer) of shape\n            :obj:`(batch_size, num_heads, sequence_length, sequence_length)`.\n\n            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention\n            heads.\n\n    Examples::\n\n        from transformers import BertTokenizer, BertForNextSentencePrediction\n        import torch\n\n        tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')\n        model = BertForNextSentencePrediction.from_pretrained('bert-base-uncased')\n\n        input_ids = torch.tensor(tokenizer.encode(\"Hello, my dog is cute\", add_special_tokens=True)).unsqueeze(0)  # Batch size 1\n        outputs = model(input_ids)\n\n        seq_relationship_scores = outputs[0]\n\n        \"\"\"\n\n        outputs = self.bert(\n            input_ids,\n            attention_mask=attention_mask,\n            token_type_ids=token_type_ids,\n            head_mask=head_mask,\n            inputs_embeds=inputs_embeds,\n        )\n\n        pooled_output = outputs[1]\n        seq_relationship_score = self.cls(pooled_output)\n        outputs = (seq_relationship_score,) + outputs[2:]  # add hidden states and attention if they are here\n        if next_sentence_label is not None:\n            loss_fct = CrossEntropyLoss()\n            next_sentence_loss = loss_fct(seq_relationship_score.view(-1, 2), next_sentence_label.view(-1))\n            outputs = (next_sentence_loss,) + outputs\n\n        return outputs  # (next_sentence_loss), seq_relationship_score, (hidden_states), (attentions)\n\n\n@add_start_docstrings(\n    \"\"\"Bert Model transformer with a sequence classification/regression head on top (a linear layer on top of\n    the pooled output) e.g. for GLUE tasks. \"\"\",\n    BERT_START_DOCSTRING,\n)\nclass NeZhaForSequenceClassification(NeZhaPreTrainedModel):\n    def __init__(self, config):\n        super().__init__(config)\n        self.num_labels = config.num_labels\n        self.bert = NeZhaModel(config)\n        self.dropout = nn.Dropout(config.hidden_dropout_prob)\n        self.classifier = nn.Linear(config.hidden_size, config.num_labels)\n        self.init_weights()\n\n    @add_start_docstrings_to_model_forward(BERT_INPUTS_DOCSTRING.format(\"batch_size, sequence_length\"))\n    def forward(\n            self,\n            input_ids=None,\n            attention_mask=None,\n            token_type_ids=None,\n            position_ids=None,\n            head_mask=None,\n            inputs_embeds=None,\n            labels=None,\n    ):\n        r\"\"\"\n        labels (:obj:`torch.LongTensor` of shape :obj:`(batch_size,)`, `optional`, defaults to :obj:`None`):\n            Labels for computing the sequence classification/regression loss.\n            Indices should be in :obj:`[0, ..., config.num_labels - 1]`.\n            If :obj:`config.num_labels == 1` a regression loss is computed (Mean-Square loss),\n            If :obj:`config.num_labels > 1` a classification loss is computed (Cross-Entropy).\n\n    Returns:\n        :obj:`tuple(torch.FloatTensor)` comprising various elements depending on the configuration (:class:`~transformers.BertConfig`) and inputs:\n        loss (:obj:`torch.FloatTensor` of shape :obj:`(1,)`, `optional`, returned when :obj:`label` is provided):\n            Classification (or regression if config.num_labels==1) loss.\n        logits (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, config.num_labels)`):\n            Classification (or regression if config.num_labels==1) scores (before SoftMax).\n        hidden_states (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_hidden_states=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer)\n            of shape :obj:`(batch_size, sequence_length, hidden_size)`.\n\n            Hidden-states of the model at the output of each layer plus the initial embedding outputs.\n        attentions (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_attentions=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for each layer) of shape\n            :obj:`(batch_size, num_heads, sequence_length, sequence_length)`.\n\n            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention\n            heads.\n\n    Examples::\n\n        from transformers import BertTokenizer, BertForSequenceClassification\n        import torch\n\n        tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')\n        model = BertForSequenceClassification.from_pretrained('bert-base-uncased')\n\n        input_ids = torch.tensor(tokenizer.encode(\"Hello, my dog is cute\", add_special_tokens=True)).unsqueeze(0)  # Batch size 1\n        labels = torch.tensor([1]).unsqueeze(0)  # Batch size 1\n        outputs = model(input_ids, labels=labels)\n\n        loss, logits = outputs[:2]\n\n        \"\"\"\n\n        outputs = self.bert(\n            input_ids,\n            attention_mask=attention_mask,\n            token_type_ids=token_type_ids,\n            head_mask=head_mask,\n            inputs_embeds=inputs_embeds,\n        )\n\n        pooled_output = outputs[1]\n\n        pooled_output = self.dropout(pooled_output)\n        logits = self.classifier(pooled_output)\n\n        outputs = (logits,) + outputs[2:]  # add hidden states and attention if they are here\n\n        if labels is not None:\n            if self.num_labels == 1:\n                #  We are doing regression\n                loss_fct = MSELoss()\n                loss = loss_fct(logits.view(-1), labels.view(-1))\n            else:\n                loss_fct = CrossEntropyLoss()\n                loss = loss_fct(logits.view(-1, self.num_labels), labels.view(-1))\n            outputs = (loss,) + outputs\n\n        return outputs  # (loss), logits, (hidden_states), (attentions)\n\n\n@add_start_docstrings(\n    \"\"\"Bert Model with a multiple choice classification head on top (a linear layer on top of\n    the pooled output and a softmax) e.g. for RocStories/SWAG tasks. \"\"\",\n    BERT_START_DOCSTRING,\n)\nclass NeZhaForMultipleChoice(NeZhaPreTrainedModel):\n    def __init__(self, config):\n        super().__init__(config)\n        self.bert = NeZhaModel(config)\n        self.dropout = nn.Dropout(config.hidden_dropout_prob)\n        self.classifier = nn.Linear(config.hidden_size, 1)\n        self.init_weights()\n\n    @add_start_docstrings_to_model_forward(BERT_INPUTS_DOCSTRING.format(\"batch_size, sequence_length\"))\n    def forward(\n            self,\n            input_ids=None,\n            attention_mask=None,\n            token_type_ids=None,\n            head_mask=None,\n            position_ids=None,\n            inputs_embeds=None,\n            labels=None,\n    ):\n        r\"\"\"\n        labels (:obj:`torch.LongTensor` of shape :obj:`(batch_size,)`, `optional`, defaults to :obj:`None`):\n            Labels for computing the multiple choice classification loss.\n            Indices should be in ``[0, ..., num_choices]`` where `num_choices` is the size of the second dimension\n            of the input tensors. (see `input_ids` above)\n\n    Returns:\n        :obj:`tuple(torch.FloatTensor)` comprising various elements depending on the configuration (:class:`~transformers.BertConfig`) and inputs:\n        loss (:obj:`torch.FloatTensor` of shape `(1,)`, `optional`, returned when :obj:`labels` is provided):\n            Classification loss.\n        classification_scores (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, num_choices)`):\n            `num_choices` is the second dimension of the input tensors. (see `input_ids` above).\n\n            Classification scores (before SoftMax).\n        hidden_states (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_hidden_states=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer)\n            of shape :obj:`(batch_size, sequence_length, hidden_size)`.\n\n            Hidden-states of the model at the output of each layer plus the initial embedding outputs.\n        attentions (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_attentions=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for each layer) of shape\n            :obj:`(batch_size, num_heads, sequence_length, sequence_length)`.\n\n            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention\n            heads.\n\n    Examples::\n\n        from transformers import BertTokenizer, BertForMultipleChoice\n        import torch\n\n        tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')\n        model = BertForMultipleChoice.from_pretrained('bert-base-uncased')\n        choices = [\"Hello, my dog is cute\", \"Hello, my cat is amazing\"]\n\n        input_ids = torch.tensor([tokenizer.encode(s, add_special_tokens=True) for s in choices]).unsqueeze(0)  # Batch size 1, 2 choices\n        labels = torch.tensor(1).unsqueeze(0)  # Batch size 1\n        outputs = model(input_ids, labels=labels)\n\n        loss, classification_scores = outputs[:2]\n\n        \"\"\"\n        num_choices = input_ids.shape[1]\n\n        input_ids = input_ids.view(-1, input_ids.size(-1))\n        attention_mask = attention_mask.view(-1, attention_mask.size(-1)) if attention_mask is not None else None\n        token_type_ids = token_type_ids.view(-1, token_type_ids.size(-1)) if token_type_ids is not None else None\n\n        outputs = self.bert(\n            input_ids,\n            attention_mask=attention_mask,\n            token_type_ids=token_type_ids,\n            head_mask=head_mask,\n            inputs_embeds=inputs_embeds,\n        )\n\n        pooled_output = outputs[1]\n\n        pooled_output = self.dropout(pooled_output)\n        logits = self.classifier(pooled_output)\n        reshaped_logits = logits.view(-1, num_choices)\n\n        outputs = (reshaped_logits,) + outputs[2:]  # add hidden states and attention if they are here\n\n        if labels is not None:\n            loss_fct = CrossEntropyLoss()\n            loss = loss_fct(reshaped_logits, labels)\n            outputs = (loss,) + outputs\n\n        return outputs  # (loss), reshaped_logits, (hidden_states), (attentions)\n\n\n@add_start_docstrings(\n    \"\"\"Bert Model with a token classification head on top (a linear layer on top of\n    the hidden-states output) e.g. for Named-Entity-Recognition (NER) tasks. \"\"\",\n    BERT_START_DOCSTRING,\n)\nclass NeZhaForTokenClassification(NeZhaPreTrainedModel):\n    def __init__(self, config):\n        super().__init__(config)\n        self.num_labels = config.num_labels\n        self.bert = NeZhaModel(config)\n        self.dropout = nn.Dropout(config.hidden_dropout_prob)\n        self.classifier = nn.Linear(config.hidden_size, config.num_labels)\n        self.init_weights()\n\n    @add_start_docstrings_to_model_forward(BERT_INPUTS_DOCSTRING.format(\"batch_size, sequence_length\"))\n    def forward(\n            self,\n            input_ids=None,\n            attention_mask=None,\n            token_type_ids=None,\n            head_mask=None,\n            position_ids=None,\n            inputs_embeds=None,\n            labels=None,\n    ):\n        r\"\"\"\n        labels (:obj:`torch.LongTensor` of shape :obj:`(batch_size, sequence_length)`, `optional`, defaults to :obj:`None`):\n            Labels for computing the token classification loss.\n            Indices should be in ``[0, ..., config.num_labels - 1]``.\n\n    Returns:\n        :obj:`tuple(torch.FloatTensor)` comprising various elements depending on the configuration (:class:`~transformers.BertConfig`) and inputs:\n        loss (:obj:`torch.FloatTensor` of shape :obj:`(1,)`, `optional`, returned when ``labels`` is provided) :\n            Classification loss.\n        scores (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length, config.num_labels)`)\n            Classification scores (before SoftMax).\n        hidden_states (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_hidden_states=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer)\n            of shape :obj:`(batch_size, sequence_length, hidden_size)`.\n\n            Hidden-states of the model at the output of each layer plus the initial embedding outputs.\n        attentions (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_attentions=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for each layer) of shape\n            :obj:`(batch_size, num_heads, sequence_length, sequence_length)`.\n\n            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention\n            heads.\n\n    Examples::\n\n        from transformers import BertTokenizer, BertForTokenClassification\n        import torch\n\n        tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')\n        model = BertForTokenClassification.from_pretrained('bert-base-uncased')\n\n        input_ids = torch.tensor(tokenizer.encode(\"Hello, my dog is cute\", add_special_tokens=True)).unsqueeze(0)  # Batch size 1\n        labels = torch.tensor([1] * input_ids.size(1)).unsqueeze(0)  # Batch size 1\n        outputs = model(input_ids, labels=labels)\n\n        loss, scores = outputs[:2]\n\n        \"\"\"\n\n        outputs = self.bert(\n            input_ids,\n            attention_mask=attention_mask,\n            token_type_ids=token_type_ids,\n            head_mask=head_mask,\n            inputs_embeds=inputs_embeds,\n        )\n\n        sequence_output = outputs[0]\n\n        sequence_output = self.dropout(sequence_output)\n        logits = self.classifier(sequence_output)\n\n        outputs = (logits,) + outputs[2:]  # add hidden states and attention if they are here\n        if labels is not None:\n            loss_fct = CrossEntropyLoss()\n            # Only keep active parts of the loss\n            if attention_mask is not None:\n                active_loss = attention_mask.view(-1) == 1\n                active_logits = logits.view(-1, self.num_labels)\n                active_labels = torch.where(\n                    active_loss, labels.view(-1), torch.tensor(loss_fct.ignore_index).type_as(labels)\n                )\n                loss = loss_fct(active_logits, active_labels)\n            else:\n                loss = loss_fct(logits.view(-1, self.num_labels), labels.view(-1))\n            outputs = (loss,) + outputs\n\n        return outputs  # (loss), scores, (hidden_states), (attentions)\n\n\n@add_start_docstrings(\n    \"\"\"Bert Model with a span classification head on top for extractive question-answering tasks like SQuAD (a linear\n    layers on top of the hidden-states output to compute `span start logits` and `span end logits`). \"\"\",\n    BERT_START_DOCSTRING,\n)\nclass NeZhaForQuestionAnswering(NeZhaPreTrainedModel):\n    def __init__(self, config):\n        super().__init__(config)\n        self.num_labels = config.num_labels\n        self.bert = NeZhaModel(config)\n        self.qa_outputs = nn.Linear(config.hidden_size, config.num_labels)\n        self.init_weights()\n\n    @add_start_docstrings_to_model_forward(BERT_INPUTS_DOCSTRING.format(\"batch_size, sequence_length\"))\n    def forward(\n            self,\n            input_ids=None,\n            attention_mask=None,\n            token_type_ids=None,\n            head_mask=None,\n            inputs_embeds=None,\n            position_ids=None,\n            start_positions=None,\n            end_positions=None,\n    ):\n        r\"\"\"\n        start_positions (:obj:`torch.LongTensor` of shape :obj:`(batch_size,)`, `optional`, defaults to :obj:`None`):\n            Labels for position (index) of the start of the labelled span for computing the token classification loss.\n            Positions are clamped to the length of the sequence (`sequence_length`).\n            Position outside of the sequence are not taken into account for computing the loss.\n        end_positions (:obj:`torch.LongTensor` of shape :obj:`(batch_size,)`, `optional`, defaults to :obj:`None`):\n            Labels for position (index) of the end of the labelled span for computing the token classification loss.\n            Positions are clamped to the length of the sequence (`sequence_length`).\n            Position outside of the sequence are not taken into account for computing the loss.\n\n    Returns:\n        :obj:`tuple(torch.FloatTensor)` comprising various elements depending on the configuration (:class:`~transformers.BertConfig`) and inputs:\n        loss (:obj:`torch.FloatTensor` of shape :obj:`(1,)`, `optional`, returned when :obj:`labels` is provided):\n            Total span extraction loss is the sum of a Cross-Entropy for the start and end positions.\n        start_scores (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length,)`):\n            Span-start scores (before SoftMax).\n        end_scores (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length,)`):\n            Span-end scores (before SoftMax).\n        hidden_states (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_hidden_states=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer)\n            of shape :obj:`(batch_size, sequence_length, hidden_size)`.\n\n            Hidden-states of the model at the output of each layer plus the initial embedding outputs.\n        attentions (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_attentions=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for each layer) of shape\n            :obj:`(batch_size, num_heads, sequence_length, sequence_length)`.\n\n            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention\n            heads.\n\n    Examples::\n\n        from transformers import BertTokenizer, BertForQuestionAnswering\n        import torch\n\n        tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')\n        model = BertForQuestionAnswering.from_pretrained('bert-large-uncased-whole-word-masking-finetuned-squad')\n\n        question, text = \"Who was Jim Henson?\", \"Jim Henson was a nice puppet\"\n        encoding = tokenizer.encode_plus(question, text)\n        input_ids, token_type_ids = encoding[\"input_ids\"], encoding[\"token_type_ids\"]\n        start_scores, end_scores = model(torch.tensor([input_ids]), token_type_ids=torch.tensor([token_type_ids]))\n\n        all_tokens = tokenizer.convert_ids_to_tokens(input_ids)\n        answer = ' '.join(all_tokens[torch.argmax(start_scores) : torch.argmax(end_scores)+1])\n\n        assert answer == \"a nice puppet\"\n\n        \"\"\"\n\n        outputs = self.bert(\n            input_ids,\n            attention_mask=attention_mask,\n            token_type_ids=token_type_ids,\n            head_mask=head_mask,\n            inputs_embeds=inputs_embeds,\n        )\n\n        sequence_output = outputs[0]\n\n        logits = self.qa_outputs(sequence_output)\n        start_logits, end_logits = logits.split(1, dim=-1)\n        start_logits = start_logits.squeeze(-1)\n        end_logits = end_logits.squeeze(-1)\n\n        outputs = (start_logits, end_logits,) + outputs[2:]\n        if start_positions is not None and end_positions is not None:\n            # If we are on multi-GPU, split add a dimension\n            if len(start_positions.size()) > 1:\n                start_positions = start_positions.squeeze(-1)\n            if len(end_positions.size()) > 1:\n                end_positions = end_positions.squeeze(-1)\n            # sometimes the start/end positions are outside our model inputs, we ignore these terms\n            ignored_index = start_logits.size(1)\n            start_positions.clamp_(0, ignored_index)\n            end_positions.clamp_(0, ignored_index)\n\n            loss_fct = CrossEntropyLoss(ignore_index=ignored_index)\n            start_loss = loss_fct(start_logits, start_positions)\n            end_loss = loss_fct(end_logits, end_positions)\n            total_loss = (start_loss + end_loss) / 2\n            outputs = (total_loss,) + outputs\n\n        return outputs  # (loss), start_logits, end_logits, (hidden_states), (attentions)\n\n"
  },
  {
    "path": "code/bert-base-count5/finetuning/model.py",
    "content": "import torch\nimport random\nimport os\nfrom torch import nn, optim\nimport torch.nn.functional as F\nfrom transformers.activations import get_activation\n\nfrom Config import *\n\n\nclass BertForClass(nn.Module):\n    def __init__(self, config):\n        super(BertForClass, self).__init__()\n        self.n_classes = config.num_class\n\n        config_json = 'bert_config.json' if os.path.exists(config.model_path + 'bert_config.json') else 'config.json'\n        self.bert_config = CONFIGS[config.model].from_pretrained(config.model_path + config_json,\n                                                                 output_hidden_states=True)\n        self.bert_model = MODELS[config.model].from_pretrained(config.model_path, config=self.bert_config)\n        self.isDropout = True if 0 < config.dropout < 1 else False\n        self.dropout = nn.Dropout(p=config.dropout)\n        self.classifier = nn.Linear(self.bert_config.hidden_size * 2, self.n_classes)\n\n    def forward(self, input_ids, input_masks, segment_ids):\n        output = self.bert_model(input_ids=input_ids, token_type_ids=segment_ids,\n                                                                        attention_mask=input_masks)\n        sequence_output = output[0]\n        pooler_output = output[1]\n        hidden_states = output[2]\n        seq_avg = torch.mean(sequence_output, dim=1)\n        concat_out = torch.cat((seq_avg, pooler_output), dim=1)\n        if self.isDropout:\n            concat_out = self.dropout(concat_out)\n        logit = self.classifier(concat_out)\n        return logit\n\nclass BertForClass_MultiDropout(nn.Module):\n    def __init__(self, config):\n        super(BertForClass_MultiDropout, self).__init__()\n        self.n_classes = config.num_class\n        config_json = 'bert_config.json' if os.path.exists(config.model_path + 'bert_config.json') else 'config.json'\n        self.bert_config = CONFIGS[config.model].from_pretrained(config.model_path + config_json,\n                                                                 output_hidden_states=True)\n\n        self.bert_model = MODELS[config.model].from_pretrained(config.model_path, config=self.bert_config)\n        self.multi_drop = 5\n        self.multi_dropouts = nn.ModuleList([nn.Dropout(config.dropout) for _ in range(self.multi_drop)])\n        self.classifier = nn.Linear(self.bert_config.hidden_size * 2, self.n_classes)\n\n    def forward(self, input_ids, input_masks, segment_ids):\n        sequence_output, pooler_output, hidden_states = self.bert_model(input_ids=input_ids, token_type_ids=segment_ids,\n                                                                        attention_mask=input_masks)\n        seq_avg = torch.mean(sequence_output, dim=1)\n        concat_out = torch.cat((seq_avg, pooler_output), dim=1)\n        for j, dropout in enumerate(self.multi_dropouts):\n            if j == 0:\n                logit = self.classifier(dropout(concat_out)) / self.multi_drop\n            else:\n                logit += self.classifier(dropout(concat_out)) / self.multi_drop\n\n        return logit\n\nclass BertLastTwoCls(nn.Module):\n    def __init__(self, config):\n        super(BertLastTwoCls, self).__init__()\n        self.n_classes = config.num_class\n        config_json = 'bert_config.json' if os.path.exists(config.model_path + 'bert_config.json') else 'config.json'\n        self.bert_config = CONFIGS[config.model].from_pretrained(config.model_path + config_json,\n                                                                 output_hidden_states=True)\n        self.bert_model = MODELS[config.model].from_pretrained(config.model_path, config=self.bert_config)\n        self.isDropout = True if 0 < config.dropout < 1 else False\n        self.dropout = nn.Dropout(p=config.dropout)\n        self.classifier = nn.Linear(self.bert_config.hidden_size, config.num_class)\n\n    def forward(self, input_ids, input_masks, segment_ids):\n        sequence_output, pooler_output, hidden_states = self.bert_model(input_ids=input_ids, token_type_ids=segment_ids,\n                                                                        attention_mask=input_masks)\n        logit = self.classifier(pooler_output)\n\n        return logit\n\n\nclass BertLastCls(nn.Module):\n    def __init__(self, config):\n        super(BertLastCls, self).__init__()\n        self.n_classes = config.num_class\n        config_json = 'bert_config.json' if os.path.exists(config.model_path + 'bert_config.json') else 'config.json'\n        self.bert_config = CONFIGS[config.model].from_pretrained(config.model_path + config_json,\n                                                                 output_hidden_states=True)\n        self.bert_model = MODELS[config.model].from_pretrained(config.model_path, config=self.bert_config)\n        self.isDropout = True if 0 < config.dropout < 1 else False\n        self.dropout = nn.Dropout(p=config.dropout)\n        self.classifier = nn.Linear(self.bert_config.hidden_size, config.num_class)\n\n    def forward(self, input_ids, input_masks, segment_ids):\n        output = self.bert_model(input_ids=input_ids, token_type_ids=segment_ids,\n                                 attention_mask=input_masks)\n        sequence_output = output[0]\n        pooler_output = output[1]\n        hidden_states = output[2]\n\n        if self.isDropout:\n            output = self.dropout(pooler_output)\n        logit = self.classifier(output)\n\n        return logit\n\nclass BertLastTwoClsPooler(nn.Module):\n    def __init__(self, config):\n        super(BertLastTwoClsPooler, self).__init__()\n        self.n_classes = config.num_class\n        config_json = 'bert_config.json' if os.path.exists(config.model_path + 'bert_config.json') else 'config.json'\n        self.bert_config = CONFIGS[config.model].from_pretrained(config.model_path + config_json,\n                                                                 output_hidden_states=True)\n        self.bert_model = MODELS[config.model].from_pretrained(config.model_path, config=self.bert_config)\n        self.isDropout = True if 0 < config.dropout < 1 else False\n        self.dropout = nn.Dropout(p=config.dropout)\n        self.classifier = nn.Linear(self.bert_config.hidden_size * 3, config.num_class)\n\n    def forward(self, input_ids, input_masks, segment_ids):\n        sequence_output, pooler_output, hidden_states = self.bert_model(input_ids=input_ids, token_type_ids=segment_ids,\n                                                                        attention_mask=input_masks)\n\n        output = torch.cat(\n            (pooler_output, hidden_states[-1][:, 0], hidden_states[-2][:, 0]), dim=1)\n        if self.isDropout:\n            output = self.dropout(output)\n        logit = self.classifier(output)\n\n        return logit\n\nclass BertLastTwoEmbeddings(nn.Module):\n    def __init__(self, config):\n        super(BertLastTwoEmbeddings, self).__init__()\n        self.n_classes = config.num_class\n        config_json = 'bert_config.json' if os.path.exists(config.model_path + 'bert_config.json') else 'config.json'\n        self.bert_config = CONFIGS[config.model].from_pretrained(config.model_path + config_json,\n                                                                 output_hidden_states=True)\n        self.bert_model = MODELS[config.model].from_pretrained(config.model_path, config=self.bert_config)\n        self.isDropout = True if 0 < config.dropout < 1 else False\n        self.dropout = nn.Dropout(p=config.dropout)\n        self.classifier = nn.Linear(self.bert_config.hidden_size * 2, config.num_class)\n\n    def forward(self, input_ids, input_masks, segment_ids):\n        sequence_output, pooler_output, hidden_states = self.bert_model(input_ids=input_ids, token_type_ids=segment_ids,\n                                                                        attention_mask=input_masks)\n\n        hidden_states1 = torch.mean(hidden_states[-1], dim=1)\n        hidden_states2 = torch.mean(hidden_states[-2], dim=1)\n\n        output = torch.cat(\n            (hidden_states1, hidden_states2), dim=1)\n        if self.isDropout:\n            output = self.dropout(output)\n        logit = self.classifier(output)\n\n        return logit\n\n\nclass BertLastTwoEmbeddingsPooler(nn.Module):\n    def __init__(self, config):\n        super(BertLastTwoEmbeddingsPooler, self).__init__()\n        self.n_classes = config.num_class\n        config_json = 'bert_config.json' if os.path.exists(config.model_path + 'bert_config.json') else 'config.json'\n        self.bert_config = CONFIGS[config.model].from_pretrained(config.model_path + config_json,\n                                                                 output_hidden_states=True)\n        self.bert_model = MODELS[config.model].from_pretrained(config.model_path, config=self.bert_config)\n        self.isDropout = True if 0 < config.dropout < 1 else False\n        self.dropout = nn.Dropout(p=config.dropout)\n        self.classifier = nn.Linear(self.bert_config.hidden_size * 3, config.num_class)\n\n    def forward(self, input_ids, input_masks, segment_ids):\n        sequence_output, pooler_output, hidden_states = self.bert_model(input_ids=input_ids, token_type_ids=segment_ids,\n                                                                        attention_mask=input_masks)\n\n        hidden_states1 = torch.mean(hidden_states[-1], dim=1)\n        hidden_states2 = torch.mean(hidden_states[-2], dim=1)\n\n        output = torch.cat(\n            (pooler_output, hidden_states1, hidden_states2), dim=1)\n        if self.isDropout:\n            output = self.dropout(output)\n        logit = self.classifier(output)\n\n        return logit\n\nclass BertLastFourCls(nn.Module):\n    def __init__(self, config):\n        super(BertLastFourCls, self).__init__()\n        self.n_classes = config.num_class\n        config_json = 'bert_config.json' if os.path.exists(config.model_path + 'bert_config.json') else 'config.json'\n        self.bert_config = CONFIGS[config.model].from_pretrained(config.model_path + config_json,\n                                                                 output_hidden_states=True)\n        self.bert_model = MODELS[config.model].from_pretrained(config.model_path, config=self.bert_config)\n        self.isDropout = True if 0 < config.dropout < 1 else False\n        self.dropout = nn.Dropout(p=config.dropout)\n        self.classifier = nn.Linear(self.bert_config.hidden_size * 4, config.num_class)\n\n    def forward(self, input_ids, input_masks, segment_ids):\n        output = self.bert_model(input_ids=input_ids, token_type_ids=segment_ids,\n                                 attention_mask=input_masks)\n        sequence_output = output[0]\n        pooler_output = output[1]\n        hidden_states = output[2]\n\n        output = torch.cat(\n            (hidden_states[-1][:, 0], hidden_states[-2][:, 0], hidden_states[-3][:, 0], hidden_states[-4][:, 0]), dim=1)\n        if self.isDropout:\n            output = self.dropout(output)\n        logit = self.classifier(output)\n\n        return logit\n\n\nclass BertLastFourClsPooler(nn.Module):\n    def __init__(self, config):\n        super(BertLastFourClsPooler, self).__init__()\n        self.n_classes = config.num_class\n        config_json = 'bert_config.json' if os.path.exists(config.model_path + 'bert_config.json') else 'config.json'\n        self.bert_config = CONFIGS[config.model].from_pretrained(config.model_path + config_json,\n                                                                 output_hidden_states=True)\n        self.bert_model = MODELS[config.model].from_pretrained(config.model_path, config=self.bert_config)\n        self.isDropout = True if 0 < config.dropout < 1 else False\n        self.dropout = nn.Dropout(p=config.dropout)\n        self.classifier = nn.Linear(self.bert_config.hidden_size * 5, config.num_class)\n\n    def forward(self, input_ids, input_masks, segment_ids):\n        sequence_output, pooler_output, hidden_states = self.bert_model(input_ids=input_ids, token_type_ids=segment_ids,\n                                                                        attention_mask=input_masks)\n\n        output = torch.cat(\n            (pooler_output, hidden_states[-1][:, 0], hidden_states[-2][:, 0], hidden_states[-3][:, 0], hidden_states[-4][:, 0]), dim=1)\n        if self.isDropout:\n            output = self.dropout(output)\n        logit = self.classifier(output)\n\n        return logit\n\nclass BertLastFourEmbeddings(nn.Module):\n    def __init__(self, config):\n        super(BertLastFourEmbeddings, self).__init__()\n        self.n_classes = config.num_class\n        config_json = 'bert_config.json' if os.path.exists(config.model_path + 'bert_config.json') else 'config.json'\n        self.bert_config = CONFIGS[config.model].from_pretrained(config.model_path + config_json,\n                                                                 output_hidden_states=True)\n        self.bert_model = MODELS[config.model].from_pretrained(config.model_path, config=self.bert_config)\n        self.isDropout = True if 0 < config.dropout < 1 else False\n        self.dropout = nn.Dropout(p=config.dropout)\n        self.classifier = nn.Linear(self.bert_config.hidden_size * 4, config.num_class)\n\n    def forward(self, input_ids, input_masks, segment_ids):\n        sequence_output, pooler_output, hidden_states = self.bert_model(input_ids=input_ids, token_type_ids=segment_ids,\n                                                                        attention_mask=input_masks)\n\n        hidden_states1 = torch.mean(hidden_states[-1], dim=1)\n        hidden_states2 = torch.mean(hidden_states[-2], dim=1)\n        hidden_states3 = torch.mean(hidden_states[-3], dim=1)\n        hidden_states4 = torch.mean(hidden_states[-4], dim=1)\n        output = torch.cat(\n            (hidden_states1, hidden_states2, hidden_states3, hidden_states4), dim=1)\n        if self.isDropout:\n            output = self.dropout(output)\n        logit = self.classifier(output)\n\n        return logit\n\n\nclass BertLastFourEmbeddingsPooler(nn.Module):\n    def __init__(self, config):\n        super(BertLastFourEmbeddingsPooler, self).__init__()\n        self.n_classes = config.num_class\n        config_json = 'bert_config.json' if os.path.exists(config.model_path + 'bert_config.json') else 'config.json'\n        self.bert_config = CONFIGS[config.model].from_pretrained(config.model_path + config_json,\n                                                                 output_hidden_states=True)\n        self.bert_model = MODELS[config.model].from_pretrained(config.model_path, config=self.bert_config)\n        self.isDropout = True if 0 < config.dropout < 1 else False\n        self.dropout = nn.Dropout(p=config.dropout)\n        self.classifier = nn.Linear(self.bert_config.hidden_size * 5, config.num_class)\n\n    def forward(self, input_ids, input_masks, segment_ids):\n        sequence_output, pooler_output, hidden_states = self.bert_model(input_ids=input_ids, token_type_ids=segment_ids,\n                                                                        attention_mask=input_masks)\n\n        hidden_states1 = torch.mean(hidden_states[-1], dim=1)\n        hidden_states2 = torch.mean(hidden_states[-2], dim=1)\n        hidden_states3 = torch.mean(hidden_states[-3], dim=1)\n        hidden_states4 = torch.mean(hidden_states[-4], dim=1)\n        output = torch.cat(\n            (pooler_output, hidden_states1, hidden_states2, hidden_states3, hidden_states4), dim=1)\n        if self.isDropout:\n            output = self.dropout(output)\n        logit = self.classifier(output)\n\n        return logit\n\nclass BertDynCls(nn.Module):\n    def __init__(self, config):\n        super(BertDynCls, self).__init__()\n        self.n_classes = config.num_class\n        config_json = 'bert_config.json' if os.path.exists(config.model_path + 'bert_config.json') else 'config.json'\n        self.bert_config = CONFIGS[config.model].from_pretrained(config.model_path + config_json,\n                                                                 output_hidden_states=True)\n        self.bert_model = MODELS[config.model].from_pretrained(config.model_path, config=self.bert_config)\n        self.dynWeight = nn.Linear(self.bert_config.hidden_size, 1)\n        self.dence = nn.Linear(self.bert_config.hidden_size, 512)\n        self.isDropout = True if 0 < config.dropout < 1 else False\n        self.dropout = nn.Dropout(p=config.dropout)\n        self.classifier = nn.Linear(512, config.num_class)\n\n\n    def forward(self, input_ids, input_masks, segment_ids):\n        sequence_output, pooler_output, hidden_states = self.bert_model(input_ids=input_ids, token_type_ids=segment_ids,\n                                                                        attention_mask=input_masks)\n\n        batch_size = pooler_output.shape[0]\n\n        hid_avg_list = None\n        weight_list = None\n        for i, hidden in enumerate(hidden_states):\n            hid_avg = hidden_states[-(i + 1)][0]\n            weight = self.dynWeight(hid_avg).repeat(1, self.bert_config.hidden_size)\n            if hid_avg_list is None:\n                hid_avg_list = hid_avg\n            else:\n                hid_avg_list = torch.cat((hid_avg_list, hid_avg), dim=1)\n\n            if weight_list is None:\n                weight_list = hid_avg\n            else:\n                weight_list = torch.cat((weight_list, weight), dim=1)\n\n        concat_out = weight_list.mul_(hid_avg_list)\n        concat_out = concat_out.reshape(batch_size, -1, self.bert_config.hidden_size)\n        concat_out = torch.sum(concat_out, dim=1)\n\n        if self.isDropout:\n            concat_out = self.dropout(concat_out)\n        concat_out = self.dence(concat_out)\n        logit = self.classifier(concat_out)\n\n        return logit\n\nclass BertDynEmbeddings(nn.Module):\n    def __init__(self, config):\n        super(BertDynEmbeddings, self).__init__()\n        self.n_classes = config.num_class\n        config_json = 'bert_config.json' if os.path.exists(config.model_path + 'bert_config.json') else 'config.json'\n        self.bert_config = CONFIGS[config.model].from_pretrained(config.model_path + config_json,\n                                                                 output_hidden_states=True)\n        self.bert_model = MODELS[config.model].from_pretrained(config.model_path, config=self.bert_config)\n        self.dynWeight = nn.Linear(self.bert_config.hidden_size, 1)\n        self.dence = nn.Linear(self.bert_config.hidden_size, 512)\n        self.isDropout = True if 0 < config.dropout < 1 else False\n        self.dropout = nn.Dropout(p=config.dropout)\n        self.classifier = nn.Linear(512, config.num_class)\n\n\n    def forward(self, input_ids, input_masks, segment_ids):\n        sequence_output, pooler_output, hidden_states = self.bert_model(input_ids=input_ids, token_type_ids=segment_ids,\n                                                                        attention_mask=input_masks)\n\n        batch_size = pooler_output.shape[0]\n\n        hid_avg_list = None\n        weight_list = None\n        for i, hidden in enumerate(hidden_states):\n            hid_avg = torch.mean(hidden_states[-(i + 1)], dim=1)\n            weight = self.dynWeight(hid_avg).repeat(1, self.bert_config.hidden_size)\n            if hid_avg_list is None:\n                hid_avg_list = hid_avg\n            else:\n                hid_avg_list = torch.cat((hid_avg_list, hid_avg), dim=1)\n\n            if weight_list is None:\n                weight_list = hid_avg\n            else:\n                weight_list = torch.cat((weight_list, weight), dim=1)\n\n        concat_out = weight_list.mul_(hid_avg_list)\n        concat_out = concat_out.reshape(batch_size, -1, self.bert_config.hidden_size)\n        concat_out = torch.sum(concat_out, dim=1)\n\n        if self.isDropout:\n            concat_out = self.dropout(concat_out)\n\n        concat_out = self.dence(concat_out)\n        logit = self.classifier(concat_out)\n\n        return logit\n\n\nclass BertRNN(nn.Module):\n\n    def __init__(self, config):\n        super(BertRNN, self).__init__()\n        self.rnn_type = \"gru\"\n        self.bidirectional = True\n        self.hidden_dim = 256\n        self.n_layers = 2\n        self.batch_first = True\n        self.drop_out = 0.1\n        self.n_classes = config.num_class\n        config_json = 'bert_config.json' if os.path.exists(config.model_path + 'bert_config.json') else 'config.json'\n        self.bert_config = CONFIGS[config.model].from_pretrained(config.model_path + config_json,\n                                                                 output_hidden_states=True)\n        self.bert_model = MODELS[config.model].from_pretrained(config.model_path, config=self.bert_config)\n        self.num_directions = 1 if not self.bidirectional else 2\n\n        if self.rnn_type == 'lstm':\n            self.rnn = nn.LSTM(self.bert_config.to_dict()['hidden_size'],\n                               hidden_size=self.hidden_dim,\n                               num_layers=self.n_layers,\n                               bidirectional=self.bidirectional,\n                               batch_first=self.batch_first,\n                               dropout=self.drop_out)\n        elif self.rnn_type == 'gru':\n            self.rnn = nn.GRU(self.bert_config.to_dict()['hidden_size'],\n                              hidden_size=self.hidden_dim,\n                              num_layers=self.n_layers,\n                              bidirectional=self.bidirectional,\n                              batch_first=self.batch_first,\n                              dropout=self.drop_out)\n        else:\n            self.rnn = nn.RNN(self.bert_config.to_dict()['hidden_size'],\n                              hidden_size=self.hidden_dim,\n                              num_layers=self.n_layers,\n                              bidirectional=self.bidirectional,\n                              batch_first=self.batch_first,\n                              dropout=self.drop_out)\n\n        self.dropout = nn.Dropout(self.drop_out)\n        self.fc_rnn = nn.Linear(self.hidden_dim * self.num_directions, config.num_class)\n\n    def forward(self, input_ids, input_masks, segment_ids):\n\n        sequence_output, pooler_output, hidden_states = self.bert_model(input_ids=input_ids, token_type_ids=segment_ids,\n                                                                        attention_mask=input_masks)\n\n        self.rnn.flatten_parameters()\n        if self.rnn_type in ['rnn', 'gru']:\n            output, hidden = self.rnn(sequence_output)\n        else:\n            output, (hidden, cell) = self.rnn(sequence_output)\n\n        # output = [ batch size, sent len, hidden_dim * bidirectional]\n        batch_size, max_seq_len, hidden_dim = output.shape\n        hidden = torch.transpose(hidden, 1, 0)\n        hidden = torch.mean(torch.reshape(hidden, [batch_size, -1, hidden_dim]), dim=1)\n        output = torch.sum(output, dim=1)\n        fc_input = self.dropout(output + hidden)\n\n        # output = torch.mean(output, dim=1)\n        # fc_input = self.dropout(output)\n        out = self.fc_rnn(fc_input)\n\n        return out\n\n\nclass BertCNN(nn.Module):\n\n    def __init__(self, config):\n        super(BertCNN, self).__init__()\n        self.num_filters = 100\n        config_json = 'bert_config.json' if os.path.exists(config.model_path + 'bert_config.json') else 'config.json'\n        self.bert_config = CONFIGS[config.model].from_pretrained(config.model_path + config_json,\n                                                                 output_hidden_states=True)\n        self.hidden_size = self.bert_config.to_dict()['hidden_size']\n        self.filter_sizes = {3, 4, 5}\n        self.drop_out = 0.5\n\n        self.convs = nn.ModuleList(\n            [nn.Conv2d(1, self.num_filters, (k, self.hidden_size)) for k in self.filter_sizes])\n\n\n        self.bert_model = MODELS[config.model].from_pretrained(config.model_path, config=self.bert_config)\n        self.dropout = nn.Dropout(self.drop_out)\n\n        self.fc_cnn = nn.Linear(self.num_filters * len(self.filter_sizes), config.num_class)\n\n    def conv_and_pool(self, x, conv):\n        x = F.relu(conv(x)).squeeze(3)\n        x = F.max_pool1d(x, x.size(2)).squeeze(2)\n        return x\n\n    def forward(self, input_ids, input_masks, segment_ids):\n        sequence_output, pooler_output, hidden_states = self.bert_model(input_ids=input_ids, token_type_ids=segment_ids,\n                                                                        attention_mask=input_masks)\n\n        sequence_output = self.dropout(sequence_output)\n        out = sequence_output.unsqueeze(1)\n        out = torch.cat([self.conv_and_pool(out, conv) for conv in self.convs], 1)\n        out = self.dropout(out)\n        out = self.fc_cnn(out)\n        return out\n\n\nclass BertRCNN(nn.Module):\n    def __init__(self, config):\n        super(BertRCNN, self).__init__()\n        self.rnn_type = \"lstm\"\n        self.bidirectional = True\n        self.hidden_dim = 256\n        self.n_layers = 2\n        self.batch_first = True\n        self.drop_out = 0.5\n\n        config_json = 'bert_config.json' if os.path.exists(config.model_path + 'bert_config.json') else 'config.json'\n        self.bert_config = CONFIGS[config.model].from_pretrained(config.model_path + config_json,\n                                                                 output_hidden_states=True)\n\n        if self.rnn_type == 'lstm':\n            self.rnn = nn.LSTM(self.bert_config.to_dict()['hidden_size'],\n                               self.hidden_dim,\n                               num_layers=self.n_layers,\n                               bidirectional=self.bidirectional,\n                               batch_first=self.batch_first,\n                               dropout=self.drop_out)\n        elif self.rnn_type == 'gru':\n            self.rnn = nn.GRU(self.bert_config.to_dict()['hidden_size'],\n                              hidden_size=self.hidden_dim,\n                              num_layers=self.n_layers,\n                              bidirectional=self.bidirectional,\n                              batch_first=self.batch_first,\n                              dropout=self.drop_out)\n        else:\n            self.rnn = nn.RNN(self.bert_config.to_dict()['hidden_size'],\n                              hidden_size=self.hidden_dim,\n                              num_layers=self.n_layers,\n                              bidirectional=self.bidirectional,\n                              batch_first=self.batch_first,\n                              dropout=self.drop_out)\n\n        # self.maxpool = nn.MaxPool1d()\n\n\n        self.bert_model = MODELS[config.model].from_pretrained(config.model_path, config=self.bert_config)\n        self.fc = nn.Linear(self.hidden_dim * self.n_layers, config.num_class)\n        self.dropout = nn.Dropout(self.drop_out)\n\n    def forward(self, input_ids, input_masks, segment_ids):\n\n        sequence_output, pooler_output, hidden_states = self.bert_model(input_ids=input_ids, token_type_ids=segment_ids,\n                                                                        attention_mask=input_masks)\n\n        sentence_len = sequence_output.shape[1]\n        pooler_output = pooler_output.unsqueeze(dim=1).repeat(1, sentence_len, 1)\n        bert_sentence = sequence_output + pooler_output\n\n        self.rnn.flatten_parameters()\n        if self.rnn_type in ['rnn', 'gru']:\n            output, hidden = self.rnn(bert_sentence)\n        else:\n            output, (hidden, cell) = self.rnn(bert_sentence)\n\n        batch_size, max_seq_len, hidden_dim = output.shape\n        out = torch.transpose(output.relu(), 1, 2)\n\n        out = F.max_pool1d(out, max_seq_len).squeeze()\n        out = self.fc(out)\n\n        return out\n\n\nclass XLNet(nn.Module):\n\n    def __init__(self, config):\n        super(XLNet, self).__init__()\n        self.xlnet = XLNetModel.from_pretrained(config.model_path)\n\n        self.isDropout = True if 0 < config.dropout < 1 else False\n        self.dropout = nn.Dropout(p=config.dropout)\n        self.fc = nn.Linear(self.xlnet.d_model, config.num_class)\n\n    def forward(self, input_ids, input_masks, segment_ids):\n        sequence_output = self.xlnet(input_ids=input_ids, token_type_ids=segment_ids,\n                                     attention_mask=input_masks)\n        sequence_output = torch.sum(sequence_output[0], dim=1)\n        if self.isDropout:\n            sequence_output = self.dropout(sequence_output)\n        out = self.fc(sequence_output)\n        return out\n\n\nclass ElectraClassificationHead(nn.Module):\n    \"\"\"Head for sentence-level classification tasks.\"\"\"\n\n    def __init__(self, config):\n        super().__init__()\n        self.dense = nn.Linear(config.hidden_size, config.hidden_size)\n        self.dropout = nn.Dropout(config.hidden_dropout_prob)\n        self.out_proj = nn.Linear(config.hidden_size, config.num_labels)\n\n    def forward(self, features, **kwargs):\n        x = features[:, 0, :]  # take <s> token (equiv. to [CLS])\n        x = self.dropout(x)\n        x = self.dense(x)\n        x = get_activation(\"gelu\")(x)  # although BERT uses tanh here, it seems Electra authors used gelu here\n        x = self.dropout(x)\n        x = self.out_proj(x)\n        return x\n\nclass Electra(nn.Module):\n\n    def __init__(self, config):\n        super(Electra, self).__init__()\n        self.electra = ElectraModel.from_pretrained(config.model_path)\n\n        config_json = 'bert_config.json' if os.path.exists(config.model_path + 'bert_config.json') else 'config.json'\n        self.electra_config = CONFIGS[config.model].from_pretrained(config.model_path + config_json)\n        self.electra_config.num_labels = config.num_class\n        self.fc = ElectraClassificationHead(self.electra_config)\n\n    def forward(self, input_ids, input_masks, segment_ids):\n        discriminator_hidden_states = self.electra(input_ids=input_ids, token_type_ids=segment_ids,\n                                     attention_mask=input_masks)\n\n        sequence_output = discriminator_hidden_states[0]\n        out = self.fc(sequence_output)\n        return out\n\nclass NEZHA(nn.Module):\n    def __init__(self, config):\n        super(NEZHA, self).__init__()\n        self.n_classes = config.num_class\n\n        config_json = 'bert_config.json' if os.path.exists(config.model_path + 'bert_config.json') else 'config.json'\n        self.bert_config = CONFIGS[config.model].from_pretrained(config.model_path + config_json)\n        #self.bert_model = MODELS[config.model](config=self.bert_config)\n        self.bert_model = MODELS[config.model].from_pretrained(config.model_path, config=self.bert_config)\n\n        # NEZHA init\n        #torch_init_model(self.bert_model, os.path.join(config.model_path, 'pytorch_model.bin'))\n        self.isDropout = True if 0 < config.dropout < 1 else False\n        self.dropout = nn.Dropout(p=config.dropout)\n        self.classifier = nn.Linear(self.bert_config.hidden_size * 2, self.n_classes)\n\n    def forward(self, input_ids, input_masks, segment_ids):\n        sequence_output, pooler_output = self.bert_model(input_ids=input_ids, token_type_ids=segment_ids,\n                                                                        attention_mask=input_masks)\n        seq_avg = torch.mean(sequence_output, dim=1)\n        concat_out = torch.cat((seq_avg, pooler_output), dim=1)\n\n        if self.isDropout:\n            concat_out = self.dropout(concat_out)\n        logit = self.classifier(concat_out)\n        return logit\n\n\n"
  },
  {
    "path": "code/bert-base-count5/finetuning/models/gitkeep",
    "content": ""
  },
  {
    "path": "code/bert-base-count5/finetuning/multi_gpu_QA.py",
    "content": "from tqdm import tqdm, trange\nimport numpy as np\nimport pandas as pd\nimport logging\nimport torch\nimport random\nimport os\nfrom torch import nn, optim\nfrom transformers import BertTokenizer, AdamW, BertModel, BertPreTrainedModel, BertConfig, \\\n    get_linear_schedule_with_warmup, XLNetModel, XLNetTokenizer, XLNetConfig\nfrom transformers.optimization import get_linear_schedule_with_warmup\nfrom sklearn.model_selection import StratifiedKFold, KFold\nfrom sklearn.metrics import mean_absolute_error, accuracy_score, f1_score, roc_auc_score\nfrom model import *\nfrom utils import *\nimport time\nimport logging\nlogging.basicConfig(level=logging.DEBUG, filename=\"train.log\",filemode='a')\n\n\nfrom NEZHA.modeling_nezha import *\n\nMODEL_CLASSES = {\n    'BertForClass': BertForClass,\n    'BertLastCls': BertLastCls,\n    'BertLastTwoCls': BertLastTwoCls,\n    'BertLastTwoClsPooler': BertLastTwoClsPooler,\n    'BertLastTwoEmbeddings': BertLastTwoEmbeddings,\n    'BertLastTwoEmbeddingsPooler': BertLastTwoEmbeddingsPooler,\n    'BertLastFourCls': BertLastFourCls,\n    'BertLastFourClsPooler': BertLastFourClsPooler,\n    'BertLastFourEmbeddings': BertLastFourEmbeddings,\n    'BertLastFourEmbeddingsPooler': BertLastFourEmbeddingsPooler,\n    'BertDynCls': BertDynCls,\n    'BertDynEmbeddings': BertDynEmbeddings,\n    'BertRNN': BertRNN,\n    'BertCNN': BertCNN,\n    'BertRCNN': BertRCNN,\n    'XLNet': XLNet,\n    'Electra': Electra,\n    'NEZHA': NEZHA,\n\n}\n\n\nclass Config:\n    def __init__(self):\n        # 预训练模型路径\n        self.modelId = 2\n        self.model = \"BertForClass\"\n        self.Stratification = False\n        self.model_path = '../pretrain/bert_model/'\n\n        self.num_class = 2\n        self.dropout = 0.2\n        self.MAX_LEN = 100\n        self.epoch = 3\n        self.learn_rate = 2e-5\n        self.normal_lr = 1e-4\n        self.batch_size = 32\n        self.k_fold = 10\n        self.seed = 42\n\n        self.device = torch.device('cuda')\n        # self.device = torch.device('cpu')\n\n        self.focalloss = False\n        self.pgd = False\n        self.fgm = True\n\n\nconfig = Config()\nos.environ['PYTHONHASHSEED']='0'#消除hash算法的随机性\nrandom.seed(config.seed)\nnp.random.seed(config.seed)\ntorch.manual_seed(config.seed)\ntorch.cuda.manual_seed_all(config.seed)\n\n\nfile_path = './log/'\n# 创建一个logger\nlogger = logging.getLogger('mylogger')\nlogger.setLevel(logging.DEBUG)\n\n\ntrain = pd.read_csv('/tcdata/gaiic_track3_round1_train_20210228.tsv',sep='\\t',header=None)\nsemi = pd.read_csv('/tcdata/gaiic_track3_round2_train_20210407.tsv',sep='\\t',header=None)\ntrain = pd.concat([train, semi], sort=False)\ntrain.columns=['q1','q2','label']\n\n\ntrain_query1 = train['q1'].values.astype(str)\ntrain_query2 = train['q2'].values.astype(str)\ntrain_label = train['label'].values.astype(int)\n\n\noof_train = np.zeros((len(train), config.num_class), dtype=np.float32)\n\n\n#kf = StratifiedKFold(n_splits=config.k_fold, shuffle=True, random_state=config.seed)\nkf = KFold(n_splits=config.k_fold, shuffle=True, random_state=config.seed)\n\nfor fold, (train_index, valid_index) in enumerate(kf.split(train_query1, train_label)):\n\n    print('\\n\\n------------fold:{}------------\\n'.format(fold))\n\n    '''\n    q1 = train_query1[train_index]\n    q2 = train_query2[train_index]\n    y = train_label[train_index]\n    '''\n    q1 = train_query1\n    q2 = train_query2\n    y = train_label\n\n\n    val_q1 = train_query1[valid_index]\n    val_q2 = train_query2[valid_index]\n    val_y = train_label[valid_index]\n\n    train_D = data_generator([q1, q2, y], config, shuffle=True)\n    val_D = data_generator([val_q1, val_q2, val_y], config)\n\n    model = MODEL_CLASSES[config.model](config).to(config.device)\n\n    if torch.cuda.device_count() > 1:\n        print(\"Let's use\", torch.cuda.device_count(), \"GPUs!\")\n        model = torch.nn.DataParallel(model)\n\n\n    if config.pgd:\n        pgd = PGD(model)\n        K = 3\n\n    elif config.fgm:\n        fgm = FGM(model)\n\n    if config.focalloss:\n        loss_fn = FocalLoss(config.num_class)\n    else:\n        loss_fn = nn.CrossEntropyLoss()  # BCEWithLogitsLoss就是把Sigmoid-BCELoss合成一步\n\n\n    num_train_steps = int(len(train) / config.batch_size * config.epoch)\n    param_optimizer = list(model.named_parameters())\n\n    no_decay = [\"bias\", \"LayerNorm.bias\", \"LayerNorm.weight\"]\n\n    if config.Stratification:\n        bert_params = [x for x in param_optimizer if 'bert' in x[0]]\n        normal_params = [p for n, p in param_optimizer if 'bert' not in n]\n        optimizer_parameters = [\n            {'params': [p for n, p in bert_params if not any(nd in n for nd in no_decay)], 'weight_decay': 0.01},\n            {'params': [p for n, p in bert_params if any(nd in n for nd in no_decay)], 'weight_decay': 0.0},\n            {'params': normal_params, 'lr': config.normal_lr},\n        ]\n    else:\n        optimizer_parameters = [\n            {'params': [p for n, p in param_optimizer if not any(nd in n for nd in no_decay)], 'weight_decay': 0.01},\n            {'params': [p for n, p in param_optimizer if any(nd in n for nd in no_decay)], 'weight_decay': 0.0},\n        ]\n\n    optimizer = AdamW(optimizer_parameters, lr=config.learn_rate) # lr为全局学习率\n    scheduler = get_linear_schedule_with_warmup(\n        optimizer,\n        num_warmup_steps=int(len(train) / config.batch_size / 2),\n        num_training_steps=num_train_steps\n    )\n\n    best_auc = 0\n    PATH = './models/bert_{}.pth'.format(fold)\n    save_model_path = './models/'\n    if not os.path.exists(save_model_path):\n        os.makedirs(save_model_path)\n\n    for e in range(config.epoch):\n        print('\\n------------epoch:{}------------'.format(e))\n        model.train()\n        acc = 0\n        train_len = 0\n        loss_num = 0\n        tq = tqdm(train_D,ncols=70,disable=True)\n        last=time.time()\n        for input_ids, input_masks, segment_ids, labels in tq:\n            label_t = torch.tensor(labels, dtype=torch.long).to(config.device)\n\n            y_pred = model(input_ids, input_masks, segment_ids)\n\n            loss = loss_fn(y_pred, label_t)\n            loss = loss.mean()\n            loss.backward()\n\n            if config.pgd:\n                pgd.backup_grad()\n                # 对抗训练\n                for t in range(K):\n                    pgd.attack(is_first_attack=(t == 0))  # 在embedding上添加对抗扰动, first attack时备份param.data\n                    if t != K - 1:\n                        model.zero_grad()\n                    else:\n                        pgd.restore_grad()\n                    y_pred = model(input_ids, input_masks, segment_ids)\n\n                    loss_adv = loss_fn(y_pred, label_t)\n                    loss_adv = loss_adv.mean()\n                    loss_adv.backward()  # 反向传播，并在正常的grad基础上，累加对抗训练的梯度\n                pgd.restore()  # 恢复embedding参数\n\n            elif config.fgm:\n                # 对抗训练\n                fgm.attack()  # 在embedding上添加对抗扰动\n                y_pred = model(input_ids, input_masks, segment_ids)\n                loss_adv = loss_fn(y_pred, label_t)\n                loss_adv = loss_adv.mean()\n                loss_adv.backward()  # 反向传播，并在正常的grad基础上，累加对抗训练的梯度\n                fgm.restore()  # 恢复embedding参数\n\n\n            # 梯度下降，更新参数\n            optimizer.step()\n            scheduler.step()  # Update learning rate schedule\n            model.zero_grad()\n\n            y_pred = np.argmax(y_pred.detach().to(\"cpu\").numpy(), axis=1)\n            acc += sum(y_pred == labels)\n            loss_num += loss.item()\n            train_len += len(labels)\n            tq.set_postfix(fold=fold, epoch=e, loss=loss_num / train_len, acc=acc / train_len)\n        print(f\"微调第{e}轮耗时：{time.time()-last}\")\n        model.eval()\n        with torch.no_grad():\n            y_p = []\n            y_l = []\n            train_logit = None\n            for input_ids, input_masks, segment_ids, labels in tqdm(val_D,disable=True):\n                label_t = torch.tensor(labels, dtype=torch.long).to(config.device)\n\n                y_pred = model(input_ids, input_masks, segment_ids)\n                y_pred = F.softmax(y_pred)\n                y_pred = y_pred.detach().to(\"cpu\").numpy()\n                if train_logit is None:\n                    train_logit = y_pred\n                else:\n                    train_logit = np.vstack((train_logit, y_pred))\n\n                y_p += list(y_pred[:,1])\n\n                y_pred = np.argmax(y_pred, axis=1)\n                y_l += list(y_pred)\n\n\n            f1 = f1_score(val_y, y_l, average=\"macro\")\n            auc_score = roc_auc_score(val_y, y_p)\n            print(\"best_auc:{}  auc_score:{}  f1:{}\\n\".format(best_auc, auc_score, f1))\n            if auc_score >= best_auc:\n                best_auc = auc_score\n                oof_train[valid_index] = np.array(train_logit)\n                #torch.save(model.module.state_dict() if hasattr(model, \"module\") else model.state_dict(), PATH)\n                torch.save(model.module if hasattr(model, \"module\") else model, PATH)\n\n    optimizer.zero_grad()\n\n    del model\n    torch.cuda.empty_cache()\n\n    break\n\n"
  },
  {
    "path": "code/bert-base-count5/finetuning/utils.py",
    "content": "import torch\nfrom transformers import BertTokenizer, AdamW, BertModel, BertPreTrainedModel, BertConfig, \\\n    get_linear_schedule_with_warmup, XLNetModel, XLNetTokenizer, XLNetConfig\nimport numpy as np\nimport os\nimport random\nfrom Config import *\nimport torch\nimport torch.nn as nn\nimport torch.nn.functional as F\n\ndef paddingList(ls:list,val,returnTensor=False):\n    ls=ls[:]#不要改变了原list尺寸\n    maxLen=max([len(i) for i in ls])\n    for i in range(len(ls)):\n        ls[i]=ls[i]+[val]*(maxLen-len(ls[i]))\n    return torch.tensor(ls,device='cuda') if returnTensor else ls\n\ndef fastTokenizer(a:str,b:str,maxLen,tk):\n    a,b=a.split(),b.split()\n    a,b=tk.convert_tokens_to_ids(a),tk.convert_tokens_to_ids(b)\n    maxLen-=3#空留给cls sep sep\n    assert maxLen>=0\n    len2=maxLen//2#若为奇数，更长部分给左边\n    len1=maxLen-len2\n    #一共就a超长与否，b超长与否，组合的四种情况\n    if len(a)+len(b)>maxLen:#需要截断\n        if len(a)<=len1 and len(b)>len2:\n            b=b[:maxLen-len(a)]\n        elif len(a)>len1 and len(b)<=len2:\n            a=a[:maxLen-len(b)]\n        elif len(a)>len1 and len(b)>len2:\n            a=a[:len1]\n            b=b[:len2]\n    input_ids=[tk.cls_token_id]+a+[tk.sep_token_id]+b+[tk.sep_token_id]\n    token_type_ids=[0]*(len(a)+2)+[1]*(len(b)+1)\n    return {'input_ids': input_ids, 'token_type_ids': token_type_ids}\n\nclass data_generator:\n    def __init__(self, data, config, shuffle=False):\n        self.data = data\n        self.batch_size = config.batch_size\n        self.max_length = config.MAX_LEN\n        self.shuffle = shuffle\n\n        vocab = 'vocab.txt' if os.path.exists(config.model_path + 'vocab.txt') else 'spiece.model'\n        self.tokenizer = TOKENIZERS[config.model].from_pretrained(config.model_path + vocab)\n\n        self.steps = len(self.data[0]) // self.batch_size\n        if len(self.data[0]) % self.batch_size != 0:\n            self.steps += 1\n\n    def __len__(self):\n        return self.steps\n\n    def __iter__(self):\n        q1, q2, y = self.data\n        idxs = list(range(len(self.data[0])))\n        if self.shuffle:\n            np.random.shuffle(idxs)\n        input_ids, input_masks, segment_ids, labels = [], [], [], []\n\n        for index, i in enumerate(idxs):\n\n            text = q1[i]\n            text_pair = q2[i]\n            '''\n            # text = self.tokenizer(text, text_pair, padding='max_length', truncation=True, max_length=self.max_length)\n            text = fastTokenizer(text, text_pair, self.max_length, self.tokenizer)\n            input_ids.append(text['input_ids'])\n            segment_ids.append(text['token_type_ids'])\n            input_masks.append([1] * len(text['input_ids']))  # bs为1时无padding，全1\n            yield input_ids, input_masks, segment_ids, labels\n            input_ids, input_masks, segment_ids, labels = [], [], [], []\n\n            '''\n            tkRes = self.tokenizer(text, text_pair, max_length=self.max_length, truncation='longest_first',\n                                   return_attention_mask=False)\n            input_id = tkRes['input_ids']\n            segment_id = tkRes['token_type_ids']\n            assert len(segment_id) == len(input_id)\n            input_ids.append(input_id)\n            segment_ids.append(segment_id)\n            labels.append(y[i])\n\n            if len(input_ids) == self.batch_size or i == idxs[-1]:\n                input_ids = paddingList(input_ids, 0, returnTensor=True)  # 动态padding\n                segment_ids = paddingList(segment_ids, 0, returnTensor=True)\n                input_masks = (input_ids != 0)\n                yield input_ids, input_masks, segment_ids, labels\n                input_ids, input_masks, segment_ids, labels = [], [], [], []\n\n\n\nclass PGD():\n    def __init__(self, model):\n        self.model = model\n        self.emb_backup = {}\n        self.grad_backup = {}\n\n    def attack(self, epsilon=0.3, alpha=0.1, emb_name='word_embeddings', is_first_attack=False):\n        # emb_name这个参数要换成你模型中embedding的参数名\n        for name, param in self.model.named_parameters():\n            if param.requires_grad and emb_name in name:\n                if is_first_attack:\n                    self.emb_backup[name] = param.data.clone()\n                norm = torch.norm(param.grad)\n                if norm != 0 and not torch.isnan(norm):\n                    r_at = alpha * param.grad / norm\n                    param.data.add_(r_at)\n                    param.data = self.project(name, param.data, epsilon)\n\n    def restore(self, emb_name='word_embeddings'):\n        # emb_name这个参数要换成你模型中embedding的参数名\n        for name, param in self.model.named_parameters():\n            if param.requires_grad and emb_name in name:\n                assert name in self.emb_backup\n                param.data = self.emb_backup[name]\n        self.emb_backup = {}\n\n    def project(self, param_name, param_data, epsilon):\n        r = param_data - self.emb_backup[param_name]\n        if torch.norm(r) > epsilon:\n            r = epsilon * r / torch.norm(r)\n        return self.emb_backup[param_name] + r\n\n    def backup_grad(self):\n        for name, param in self.model.named_parameters():\n            if param.requires_grad:\n                self.grad_backup[name] = param.grad.clone()\n\n    def restore_grad(self):\n        for name, param in self.model.named_parameters():\n            if param.requires_grad:\n                param.grad = self.grad_backup[name]\n\n\n\nclass FGM():\n    def __init__(self, model):\n        self.model = model\n        self.backup = {}\n\n    def attack(self, epsilon=0.25, emb_name='word_embeddings'):\n        # emb_name这个参数要换成你模型中embedding的参数名\n        for name, param in self.model.named_parameters():\n            if param.requires_grad and emb_name in name:\n                self.backup[name] = param.data.clone()\n                norm = torch.norm(param.grad)\n                if norm != 0:\n                    r_at = epsilon * param.grad / norm\n                    param.data.add_(r_at)\n\n    def restore(self, emb_name='word_embeddings'):\n        # emb_name这个参数要换成你模型中embedding的参数名\n        for name, param in self.model.named_parameters():\n            if param.requires_grad and emb_name in name:\n                assert name in self.backup\n                param.data = self.backup[name]\n        self.backup = {}\n\n\n# 支持多分类和二分类\nclass FocalLoss(nn.Module):\n    \"\"\"\n    This is a implementation of Focal Loss with smooth label cross entropy supported which is proposed in\n    'Focal Loss for Dense Object Detection. (https://arxiv.org/abs/1708.02002)'\n    Focal_Loss= -1*alpha*(1-pt)^gamma*log(pt)\n    :param num_class:\n    :param alpha: (tensor) 3D or 4D the scalar factor for this criterion\n    :param gamma: (float,double) gamma > 0 reduces the relative loss\n    for well-classified examples (p>0.5) putting more\n    focus on hard misclassified example\n    :param smooth: (float,double) smooth value when cross entropy\n    :param balance_index: (int) balance class index,\n    should be specific when alpha is float\n    :param size_average: (bool, optional) By default,\n    the losses are averaged over each loss element in the batch.\n    \"\"\"\n    def __init__(self, num_class, alpha=None, gamma=2,\n                smooth=None, size_average=True):\n        super(FocalLoss, self).__init__()\n        self.num_class = num_class\n        self.alpha = alpha\n        self.gamma = gamma\n        self.smooth = smooth\n        self.size_average = size_average\n\n        if self.alpha is None:\n            self.alpha = torch.ones(self.num_class, 1)\n        elif isinstance(self.alpha, (list, np.ndarray)):\n            assert len(self.alpha) == self.num_class\n            self.alpha = torch.FloatTensor(alpha).view(self.num_class, 1)\n            self.alpha = self.alpha / self.alpha.sum()\n        else:\n            raise TypeError('Not support alpha type')\n        if self.smooth is not None:\n            if self.smooth < 0 or self.smooth > 1.0:\n                raise ValueError('smooth value should be in [0,1]')\n\n    def forward(self, input, target):\n        logit = F.softmax(input, dim=1)\n\n        if logit.dim() > 2:\n            # N,C,d1,d2 -> N,C,m (m=d1*d2*...)\n            logit = logit.view(logit.size(0), logit.size(1), -1)\n            logit = logit.permute(0, 2, 1).contiguous()\n            logit = logit.view(-1, logit.size(-1))\n        target = target.view(-1, 1)\n\n        # N = input.size(0)\n        # alpha = torch.ones(N, self.num_class)\n        # alpha = alpha * (1 - self.alpha)\n        # alpha = alpha.scatter_(1, target.long(), self.alpha)\n        epsilon = 1e-10\n        alpha = self.alpha\n        if alpha.device != input.device:\n            alpha = alpha.to(input.device)\n\n        idx = target.cpu().long()\n        one_hot_key = torch.FloatTensor(target.size(0), self.num_class).zero_()\n        one_hot_key = one_hot_key.scatter_(1, idx, 1)\n        if one_hot_key.device != logit.device:\n            one_hot_key = one_hot_key.to(logit.device)\n\n        if self.smooth:\n            one_hot_key = torch.clamp(\n                one_hot_key, self.smooth, 1.0 - self.smooth)\n        pt = (one_hot_key * logit).sum(1) + epsilon\n        logpt = pt.log()\n\n        gamma = self.gamma\n\n        alpha = alpha[idx]\n        loss = -1 * alpha * torch.pow((1 - pt), gamma) * logpt\n\n        if self.size_average:\n            loss = loss.mean()\n        else:\n            loss = loss.sum()\n        return loss\n\n\ndef f1_match(y_true,y_pred):\n    acc = sum(y_pred & y_true) / (sum(y_pred))\n    rec = sum(y_pred & y_true) / (sum(y_true))\n\n    return 2 * acc * rec /(acc + rec)"
  },
  {
    "path": "code/bert-base-count5/pretrain/NLP_Utils.py",
    "content": "import random\nimport json\nimport transformers as _\nfrom transformers1 import BertTokenizer\nimport torch\nfrom torch.utils.data import Dataset,DataLoader\nimport numpy as np\nfrom itertools import chain\n\ndef writeToJsonFile(path: str, obj):\n    with open(path, \"w\", encoding=\"utf-8\") as f:\n        f.write(json.dumps(obj, ensure_ascii=False,indent=0))\ndef readFromJsonFile(path: str):\n    with open(path, \"r\", encoding=\"utf-8\") as f:\n        return json.loads(f.read())\n\ndef loadData(path):\n    allData=[]\n    with open(path,\"r\") as f:\n        for i in f:\n            i=i.strip().split('\\t')\n            if len(i)==0:#防止空行\n                break\n            if len(i)==3:#训练集\n                a,b,label=i\n                a=a.split(' ')\n                b=b.split(' ')\n            else:#测试集，直接转为id形式\n                a,b,label=i[0],i[1],-1\n                a=a.split(' ')\n                b=b.split(' ')\n            allData.append([a,b,label])\n    return allData\n\ndef calNegPos(ls):#计算正负比例\n    posNum,negNum=0,0\n    for i in ls:\n        if i[2]==0:\n            negNum+=1\n        elif i[2]==1:\n            posNum+=1\n    posNum=1 if posNum==0 else posNum\n    return negNum,posNum,round(negNum/posNum,4)\n\nallData=loadData('/tcdata/gaiic_track3_round1_train_20210228.tsv')+loadData('/tcdata/gaiic_track3_round2_train_20210407.tsv')\ntestA_data = loadData('/tcdata/gaiic_track3_round1_testA_20210228.tsv')\ntestB_data = loadData('/tcdata/gaiic_track3_round1_testB_20210317.tsv')\nrandom.shuffle(allData)\n\ntrain_data=allData+testA_data+testB_data#全量\nvalid_data=allData[-20000:]\nprint(\"训练集样本数量：\", len(train_data))\n\ndef paddingList(ls:list,val,returnTensor=False):\n    ls=ls[:]#不要改变了原list尺寸\n    maxLen=max([len(i) for i in ls])\n    for i in range(len(ls)):\n        ls[i]=ls[i]+[val]*(maxLen-len(ls[i]))\n    return torch.tensor(ls,device='cuda') if returnTensor else ls\n\ndef truncate(a:list,b:list,maxLen):\n    maxLen-=3#空留给cls sep sep\n    assert maxLen>=0\n    len2=maxLen//2#若为奇数，更长部分给左边\n    len1=maxLen-len2\n    #一共就a超长与否，b超长与否，组合的四种情况\n    if len(a)+len(b)>maxLen:#需要截断\n        if len(a)<=len1 and len(b)>len2:\n            b=b[:maxLen-len(a)]\n        elif len(a)>len1 and len(b)<=len2:\n            a=a[:maxLen-len(b)]\n        elif len(a)>len1 and len(b)>len2:\n            a=a[:len1]\n            b=b[:len2]\n    return a,b\n\nclass MLM_Data(Dataset):\n    #传入句子对列表\n    def __init__(self,textLs:list,maxLen:int,tk:BertTokenizer):\n        super().__init__()\n        self.data=textLs\n        self.maxLen=maxLen\n        self.tk=tk\n        self.spNum=len(tk.all_special_tokens)\n        self.tkNum=tk.vocab_size\n\n    def __len__(self):\n        return len(self.data)\n\n    def random_mask(self,text_ids):\n        input_ids, output_ids = [], []\n        rands = np.random.random(len(text_ids))\n        idx=0\n        while idx<len(rands):\n            if rands[idx]<0.15:#需要mask\n                ngram=np.random.choice([1,2,3], p=[0.7,0.2,0.1])#若要mask，进行x_gram mask的概率\n                if ngram==3 and len(rands)<7:#太大的gram不要应用于过短文本\n                    ngram=2\n                if ngram==2 and len(rands)<4:\n                    ngram=1\n                L=idx+1\n                R=idx+ngram#最终需要mask的右边界（开）\n                while L<R and L<len(rands):\n                    rands[L]=np.random.random()*0.15#强制mask\n                    L+=1\n                idx=R\n                if idx<len(rands):\n                    rands[idx]=1#禁止mask片段的下一个token被mask，防止一大片连续mask\n            idx+=1\n\n        for r, i in zip(rands, text_ids):\n            if r < 0.15 * 0.8:\n                input_ids.append(self.tk.mask_token_id)\n                output_ids.append(i)#mask预测自己\n            elif r < 0.15 * 0.9:\n                input_ids.append(i)\n                output_ids.append(i)#自己预测自己\n            elif r < 0.15:\n                input_ids.append(np.random.randint(self.spNum,self.tkNum))\n                output_ids.append(i)#随机的一个词预测自己，随机词不会从特殊符号中选取，有小概率抽到自己\n            else:\n                input_ids.append(i)\n                output_ids.append(-100)#保持原样不预测\n\n        return input_ids, output_ids\n\n    #耗时操作在此进行，可用上多进程\n    def __getitem__(self, item):\n        text1,text2,_=self.data[item]#预处理，mask等操作\n        if random.random()>0.5:\n            text1,text2=text2,text1#交换位置\n        text1,text2=truncate(text1,text2,self.maxLen)\n        text1_ids,text2_ids = self.tk.convert_tokens_to_ids(text1),self.tk.convert_tokens_to_ids(text2)\n        text1_ids, out1_ids = self.random_mask(text1_ids)#添加mask预测\n        text2_ids, out2_ids = self.random_mask(text2_ids)\n        input_ids = [self.tk.cls_token_id] + text1_ids + [self.tk.sep_token_id] + text2_ids + [self.tk.sep_token_id]#拼接\n        token_type_ids=[0]*(len(text1_ids)+2)+[1]*(len(text2_ids)+1)\n        labels = [-100] + out1_ids + [-100] + out2_ids + [-100]\n        assert len(input_ids)==len(token_type_ids)==len(labels)\n        return {'input_ids':input_ids,'token_type_ids':token_type_ids,'labels':labels}\n\n    @classmethod\n    def collate(cls,batch):\n        input_ids=[i['input_ids'] for i in batch]\n        token_type_ids=[i['token_type_ids'] for i in batch]\n        labels=[i['labels'] for i in batch]\n        input_ids=paddingList(input_ids,0,returnTensor=True)\n        token_type_ids=paddingList(token_type_ids,0,returnTensor=True)\n        labels=paddingList(labels,-100,returnTensor=True)\n        attention_mask=(input_ids!=0)\n        return {'input_ids':input_ids,'token_type_ids':token_type_ids\n                ,'attention_mask':attention_mask,'labels':labels}\n\n\n\n\nunionList=lambda ls:list(chain(*ls))#按元素拼接\nsplitList=lambda x,bs:[x[i:i+bs] for i in range(0,len(x),bs)]#按bs切分\n\n\n#sortBsNum：原序列按多少个bs块为单位排序，可用来增强随机性\n#比如如果每次打乱后都全体一起排序，那每次都是一样的\ndef blockShuffle(data:list,bs:int,sortBsNum,key):\n    random.shuffle(data)#先打乱\n    tail=len(data)%bs#计算碎片长度\n    tail=[] if tail==0 else data[-tail:]\n    data=data[:len(data)-len(tail)]\n    assert len(data)%bs==0#剩下的一定能被bs整除\n    sortBsNum=len(data)//bs if sortBsNum is None else sortBsNum#为None就是整体排序\n    data=splitList(data,sortBsNum*bs)\n    data=[sorted(i,key=key,reverse=True) for i in data]#每个大块进行降排序\n    data=unionList(data)\n    data=splitList(data,bs)#最后，按bs分块\n    random.shuffle(data)#块间打乱\n    data=unionList(data)+tail\n    return data\nfrom torch.utils.data.dataloader import _SingleProcessDataLoaderIter,_MultiProcessingDataLoaderIter\n#每轮迭代重新分块shuffle数据的DataLoader\nclass blockShuffleDataLoader(DataLoader):\n    def __init__(self, dataset: Dataset,sortBsNum,key,**kwargs):\n        assert isinstance(dataset.data,list)#需要有list类型的data属性\n        super().__init__(dataset,**kwargs)#父类的参数传过去\n        self.sortBsNum=sortBsNum\n        self.key=key\n\n    def __iter__(self):\n        #分块shuffle\n        self.dataset.data=blockShuffle(self.dataset.data,self.batch_size,self.sortBsNum,self.key)\n        if self.num_workers == 0:\n            return _SingleProcessDataLoaderIter(self)\n        else:\n            return _MultiProcessingDataLoaderIter(self)\n"
  },
  {
    "path": "code/bert-base-count5/pretrain/__init__.py",
    "content": ""
  },
  {
    "path": "code/bert-base-count5/pretrain/bert_model/gitkeep",
    "content": ""
  },
  {
    "path": "code/bert-base-count5/pretrain/train_bert.py",
    "content": "# coding:utf-8\nimport numpy as np\nimport random\nimport os\nrandom.seed(0)\nnp.random.seed(0)#seed应该在main里尽早设置，以防万一\nos.environ['PYTHONHASHSEED'] =str(0)#消除hash算法的随机性\nfrom transformers import BertForMaskedLM#除nezha外模型用新版加载\nfrom transformers1 import Trainer, TrainingArguments,BertTokenizer,BertConfig\nfrom NLP_Utils import MLM_Data,train_data,blockShuffleDataLoader\n\nmaxlen=100\nbatch_size=128\nvocab_file_dir = './bert_model/vocab.txt'\ntokenizer = BertTokenizer.from_pretrained(vocab_file_dir)\n\nconfig = BertConfig(\n    vocab_size=len(tokenizer),\n    hidden_size=768,\n    num_hidden_layers=12,\n    num_attention_heads=12,\n    max_position_embeddings=512,\n)\n\n# 把层数改为8层\nmodel = BertForMaskedLM.from_pretrained('../../bert-base-chinese')\n\nmodel.resize_token_embeddings(len(tokenizer))\nprint(model)\ntrain_MLM_data=MLM_Data(train_data,maxlen,tokenizer)\n#自己定义dataloader，不要用huggingface的\ndl=blockShuffleDataLoader(train_MLM_data,None,key=lambda x:len(x[0])+len(x[1]),shuffle=False\n                          ,batch_size=batch_size,collate_fn=train_MLM_data.collate)\n\ntraining_args = TrainingArguments(\n    output_dir='./bert_output',\n    overwrite_output_dir=True,\n    num_train_epochs=400,\n    per_device_train_batch_size=batch_size,\n    save_steps=len(dl)*10000,#每10个epoch save一次\n    save_total_limit=3,\n    logging_steps=len(dl),#每个epoch log一次\n    seed=2021,\n    learning_rate=5e-5,\n    lr_end=1e-5,#学习率衰减的终点\n    weight_decay=0.01,\n    warmup_steps=int(450000*150/batch_size*0.03)\n)\n\ntrainer = Trainer(\n    model=model,\n    args=training_args,\n    train_dataLoader=dl,\n    prediction_loss_only=True,\n)\n\nif __name__ == '__main__':\n    trainer.train()\n    trainer.save_model('./bert_model')\n"
  },
  {
    "path": "code/bert-base-count5/pretrain/transformers1/__init__.py",
    "content": "# flake8: noqa\n# There's no way to ignore \"F401 '...' imported but unused\" warnings in this\n# module, but to preserve other warnings. So, don't check this module at all.\n\n__version__ = \"2.11.0\"\n\n# Work around to update TensorFlow's absl.logging threshold which alters the\n# default Python logging output behavior when present.\n# see: https://github.com/abseil/abseil-py/issues/99\n# and: https://github.com/tensorflow/tensorflow/issues/26691#issuecomment-500369493\ntry:\n    import absl.logging\nexcept ImportError:\n    pass\nelse:\n    absl.logging.set_verbosity(\"info\")\n    absl.logging.set_stderrthreshold(\"info\")\n    absl.logging._warn_preinit_stderr = False\n\nimport logging\n\n# Configurations\nfrom .configuration_albert import ALBERT_PRETRAINED_CONFIG_ARCHIVE_MAP, AlbertConfig\nfrom .configuration_auto import ALL_PRETRAINED_CONFIG_ARCHIVE_MAP, CONFIG_MAPPING, AutoConfig\nfrom .configuration_bart import BartConfig\nfrom .configuration_bert import BERT_PRETRAINED_CONFIG_ARCHIVE_MAP, BertConfig\nfrom .configuration_camembert import CAMEMBERT_PRETRAINED_CONFIG_ARCHIVE_MAP, CamembertConfig\nfrom .configuration_ctrl import CTRL_PRETRAINED_CONFIG_ARCHIVE_MAP, CTRLConfig\nfrom .configuration_distilbert import DISTILBERT_PRETRAINED_CONFIG_ARCHIVE_MAP, DistilBertConfig\nfrom .configuration_electra import ELECTRA_PRETRAINED_CONFIG_ARCHIVE_MAP, ElectraConfig\nfrom .configuration_encoder_decoder import EncoderDecoderConfig\nfrom .configuration_flaubert import FLAUBERT_PRETRAINED_CONFIG_ARCHIVE_MAP, FlaubertConfig\nfrom .configuration_gpt2 import GPT2_PRETRAINED_CONFIG_ARCHIVE_MAP, GPT2Config\nfrom .configuration_longformer import LONGFORMER_PRETRAINED_CONFIG_ARCHIVE_MAP, LongformerConfig\nfrom .configuration_marian import MarianConfig\nfrom .configuration_mmbt import MMBTConfig\nfrom .configuration_openai import OPENAI_GPT_PRETRAINED_CONFIG_ARCHIVE_MAP, OpenAIGPTConfig\nfrom .configuration_reformer import REFORMER_PRETRAINED_CONFIG_ARCHIVE_MAP, ReformerConfig\nfrom .configuration_roberta import ROBERTA_PRETRAINED_CONFIG_ARCHIVE_MAP, RobertaConfig\nfrom .configuration_t5 import T5_PRETRAINED_CONFIG_ARCHIVE_MAP, T5Config\nfrom .configuration_transfo_xl import TRANSFO_XL_PRETRAINED_CONFIG_ARCHIVE_MAP, TransfoXLConfig\nfrom .configuration_utils import PretrainedConfig\nfrom .configuration_xlm import XLM_PRETRAINED_CONFIG_ARCHIVE_MAP, XLMConfig\nfrom .configuration_xlm_roberta import XLM_ROBERTA_PRETRAINED_CONFIG_ARCHIVE_MAP, XLMRobertaConfig\nfrom .configuration_xlnet import XLNET_PRETRAINED_CONFIG_ARCHIVE_MAP, XLNetConfig\nfrom .data import (\n    DataProcessor,\n    InputExample,\n    InputFeatures,\n    SingleSentenceClassificationProcessor,\n    SquadExample,\n    SquadFeatures,\n    SquadV1Processor,\n    SquadV2Processor,\n    glue_convert_examples_to_features,\n    glue_output_modes,\n    glue_processors,\n    glue_tasks_num_labels,\n    is_sklearn_available,\n    squad_convert_examples_to_features,\n    xnli_output_modes,\n    xnli_processors,\n    xnli_tasks_num_labels,\n)\n\n# Files and general utilities\nfrom .file_utils import (\n    CONFIG_NAME,\n    MODEL_CARD_NAME,\n    PYTORCH_PRETRAINED_BERT_CACHE,\n    PYTORCH_TRANSFORMERS_CACHE,\n    TF2_WEIGHTS_NAME,\n    TF_WEIGHTS_NAME,\n    TRANSFORMERS_CACHE,\n    WEIGHTS_NAME,\n    add_end_docstrings,\n    add_start_docstrings,\n    cached_path,\n    is_tf_available,\n    is_torch_available,\n)\nfrom .hf_argparser import HfArgumentParser\n\n# Model Cards\nfrom .modelcard import ModelCard\n\n# TF 2.0 <=> PyTorch conversion utilities\nfrom .modeling_tf_pytorch_utils import (\n    convert_tf_weight_name_to_pt_weight_name,\n    load_pytorch_checkpoint_in_tf2_model,\n    load_pytorch_model_in_tf2_model,\n    load_pytorch_weights_in_tf2_model,\n    load_tf2_checkpoint_in_pytorch_model,\n    load_tf2_model_in_pytorch_model,\n    load_tf2_weights_in_pytorch_model,\n)\n\n# Pipelines\nfrom .pipelines import (\n    CsvPipelineDataFormat,\n    FeatureExtractionPipeline,\n    FillMaskPipeline,\n    JsonPipelineDataFormat,\n    NerPipeline,\n    PipedPipelineDataFormat,\n    Pipeline,\n    PipelineDataFormat,\n    QuestionAnsweringPipeline,\n    SummarizationPipeline,\n    TextClassificationPipeline,\n    TextGenerationPipeline,\n    TokenClassificationPipeline,\n    TranslationPipeline,\n    pipeline,\n)\n\n# Tokenizers\nfrom .tokenization_albert import AlbertTokenizer\nfrom .tokenization_auto import TOKENIZER_MAPPING, AutoTokenizer\nfrom .tokenization_bart import BartTokenizer, MBartTokenizer\nfrom .tokenization_bert import BasicTokenizer, BertTokenizer, BertTokenizerFast, WordpieceTokenizer\nfrom .tokenization_bert_japanese import BertJapaneseTokenizer, CharacterTokenizer, MecabTokenizer\nfrom .tokenization_camembert import CamembertTokenizer\nfrom .tokenization_ctrl import CTRLTokenizer\nfrom .tokenization_distilbert import DistilBertTokenizer, DistilBertTokenizerFast\nfrom .tokenization_electra import ElectraTokenizer, ElectraTokenizerFast\nfrom .tokenization_flaubert import FlaubertTokenizer\nfrom .tokenization_gpt2 import GPT2Tokenizer, GPT2TokenizerFast\nfrom .tokenization_longformer import LongformerTokenizer, LongformerTokenizerFast\nfrom .tokenization_openai import OpenAIGPTTokenizer, OpenAIGPTTokenizerFast\nfrom .tokenization_reformer import ReformerTokenizer\nfrom .tokenization_roberta import RobertaTokenizer, RobertaTokenizerFast\nfrom .tokenization_t5 import T5Tokenizer\nfrom .tokenization_transfo_xl import TransfoXLCorpus, TransfoXLTokenizer, TransfoXLTokenizerFast\nfrom .tokenization_utils import PreTrainedTokenizer\nfrom .tokenization_xlm import XLMTokenizer\nfrom .tokenization_xlm_roberta import XLMRobertaTokenizer\nfrom .tokenization_xlnet import SPIECE_UNDERLINE, XLNetTokenizer\nfrom .trainer_utils import EvalPrediction\nfrom .training_args import TrainingArguments\nfrom .training_args_tf import TFTrainingArguments\n\n\nlogger = logging.getLogger(__name__)  # pylint: disable=invalid-name\n\n\nif is_sklearn_available():\n    from .data import glue_compute_metrics, xnli_compute_metrics\n\n\n# Modeling\nif is_torch_available():\n    from .modeling_utils import PreTrainedModel, prune_layer, Conv1D, top_k_top_p_filtering, apply_chunking_to_forward\n    from .modeling_auto import (\n        AutoModel,\n        AutoModelForPreTraining,\n        AutoModelForSequenceClassification,\n        AutoModelForQuestionAnswering,\n        AutoModelWithLMHead,\n        AutoModelForTokenClassification,\n        AutoModelForMultipleChoice,\n        MODEL_MAPPING,\n        MODEL_FOR_PRETRAINING_MAPPING,\n        MODEL_WITH_LM_HEAD_MAPPING,\n        MODEL_FOR_SEQUENCE_CLASSIFICATION_MAPPING,\n        MODEL_FOR_QUESTION_ANSWERING_MAPPING,\n        MODEL_FOR_TOKEN_CLASSIFICATION_MAPPING,\n        MODEL_FOR_MULTIPLE_CHOICE_MAPPING,\n    )\n\n    from .modeling_bert import (\n        BertPreTrainedModel,\n        BertModel,\n        BertForPreTraining,\n        BertForMaskedLM,\n        BertForNextSentencePrediction,\n        BertForSequenceClassification,\n        BertForMultipleChoice,\n        BertForTokenClassification,\n        BertForQuestionAnswering,\n        load_tf_weights_in_bert,\n        BERT_PRETRAINED_MODEL_ARCHIVE_LIST,\n        BertLayer,\n    )\n    from .modeling_openai import (\n        OpenAIGPTPreTrainedModel,\n        OpenAIGPTModel,\n        OpenAIGPTLMHeadModel,\n        OpenAIGPTDoubleHeadsModel,\n        load_tf_weights_in_openai_gpt,\n        OPENAI_GPT_PRETRAINED_MODEL_ARCHIVE_LIST,\n    )\n    from .modeling_transfo_xl import (\n        TransfoXLPreTrainedModel,\n        TransfoXLModel,\n        TransfoXLLMHeadModel,\n        AdaptiveEmbedding,\n        load_tf_weights_in_transfo_xl,\n        TRANSFO_XL_PRETRAINED_MODEL_ARCHIVE_LIST,\n    )\n    from .modeling_gpt2 import (\n        GPT2PreTrainedModel,\n        GPT2Model,\n        GPT2LMHeadModel,\n        GPT2DoubleHeadsModel,\n        load_tf_weights_in_gpt2,\n        GPT2_PRETRAINED_MODEL_ARCHIVE_LIST,\n    )\n    from .modeling_ctrl import CTRLPreTrainedModel, CTRLModel, CTRLLMHeadModel, CTRL_PRETRAINED_MODEL_ARCHIVE_LIST\n    from .modeling_xlnet import (\n        XLNetPreTrainedModel,\n        XLNetModel,\n        XLNetLMHeadModel,\n        XLNetForSequenceClassification,\n        XLNetForTokenClassification,\n        XLNetForMultipleChoice,\n        XLNetForQuestionAnsweringSimple,\n        XLNetForQuestionAnswering,\n        load_tf_weights_in_xlnet,\n        XLNET_PRETRAINED_MODEL_ARCHIVE_LIST,\n    )\n    from .modeling_xlm import (\n        XLMPreTrainedModel,\n        XLMModel,\n        XLMWithLMHeadModel,\n        XLMForSequenceClassification,\n        XLMForTokenClassification,\n        XLMForQuestionAnswering,\n        XLMForQuestionAnsweringSimple,\n        XLM_PRETRAINED_MODEL_ARCHIVE_LIST,\n    )\n    from .modeling_bart import (\n        BartForSequenceClassification,\n        BartModel,\n        BartForConditionalGeneration,\n        BART_PRETRAINED_MODEL_ARCHIVE_LIST,\n    )\n    from .modeling_marian import MarianMTModel\n    from .tokenization_marian import MarianTokenizer\n    from .modeling_roberta import (\n        RobertaForMaskedLM,\n        RobertaModel,\n        RobertaForSequenceClassification,\n        RobertaForMultipleChoice,\n        RobertaForTokenClassification,\n        RobertaForQuestionAnswering,\n        ROBERTA_PRETRAINED_MODEL_ARCHIVE_LIST,\n    )\n    from .modeling_distilbert import (\n        DistilBertPreTrainedModel,\n        DistilBertForMaskedLM,\n        DistilBertModel,\n        DistilBertForSequenceClassification,\n        DistilBertForQuestionAnswering,\n        DistilBertForTokenClassification,\n        DISTILBERT_PRETRAINED_MODEL_ARCHIVE_LIST,\n    )\n    from .modeling_camembert import (\n        CamembertForMaskedLM,\n        CamembertModel,\n        CamembertForSequenceClassification,\n        CamembertForMultipleChoice,\n        CamembertForTokenClassification,\n        CamembertForQuestionAnswering,\n        CAMEMBERT_PRETRAINED_MODEL_ARCHIVE_LIST,\n    )\n    from .modeling_encoder_decoder import EncoderDecoderModel\n    from .modeling_t5 import (\n        T5PreTrainedModel,\n        T5Model,\n        T5ForConditionalGeneration,\n        load_tf_weights_in_t5,\n        T5_PRETRAINED_MODEL_ARCHIVE_LIST,\n    )\n    from .modeling_albert import (\n        AlbertPreTrainedModel,\n        AlbertModel,\n        AlbertForPreTraining,\n        AlbertForMaskedLM,\n        AlbertForSequenceClassification,\n        AlbertForQuestionAnswering,\n        AlbertForTokenClassification,\n        load_tf_weights_in_albert,\n        ALBERT_PRETRAINED_MODEL_ARCHIVE_LIST,\n    )\n    from .modeling_xlm_roberta import (\n        XLMRobertaForMaskedLM,\n        XLMRobertaModel,\n        XLMRobertaForMultipleChoice,\n        XLMRobertaForSequenceClassification,\n        XLMRobertaForTokenClassification,\n        XLM_ROBERTA_PRETRAINED_MODEL_ARCHIVE_LIST,\n    )\n    from .modeling_mmbt import ModalEmbeddings, MMBTModel, MMBTForClassification\n\n    from .modeling_flaubert import (\n        FlaubertModel,\n        FlaubertWithLMHeadModel,\n        FlaubertForSequenceClassification,\n        FlaubertForQuestionAnswering,\n        FlaubertForQuestionAnsweringSimple,\n        FLAUBERT_PRETRAINED_MODEL_ARCHIVE_LIST,\n    )\n\n    from .modeling_electra import (\n        ElectraForPreTraining,\n        ElectraForMaskedLM,\n        ElectraForTokenClassification,\n        ElectraPreTrainedModel,\n        ElectraForSequenceClassification,\n        ElectraModel,\n        load_tf_weights_in_electra,\n        ELECTRA_PRETRAINED_MODEL_ARCHIVE_LIST,\n    )\n\n    from .modeling_reformer import (\n        ReformerAttention,\n        ReformerLayer,\n        ReformerModel,\n        ReformerModelWithLMHead,\n        REFORMER_PRETRAINED_MODEL_ARCHIVE_LIST,\n    )\n\n    from .modeling_longformer import (\n        LongformerModel,\n        LongformerForMaskedLM,\n        LongformerForSequenceClassification,\n        LongformerForMultipleChoice,\n        LongformerForTokenClassification,\n        LongformerForQuestionAnswering,\n        LONGFORMER_PRETRAINED_MODEL_ARCHIVE_LIST,\n    )\n\n    # Optimization\n    from .optimization import (\n        AdamW,\n        get_constant_schedule,\n        get_constant_schedule_with_warmup,\n        get_cosine_schedule_with_warmup,\n        get_cosine_with_hard_restarts_schedule_with_warmup,\n        get_linear_schedule_with_warmup,\n    )\n\n    # Trainer\n    from .trainer import Trainer, set_seed, torch_distributed_zero_first, EvalPrediction\n    from .data.data_collator import DefaultDataCollator, DataCollator, DataCollatorForLanguageModeling\n    from .data.datasets import GlueDataset, TextDataset, LineByLineTextDataset, GlueDataTrainingArguments\n\n    # Benchmarks\n    from .benchmark import PyTorchBenchmark, PyTorchBenchmarkArguments\n\n# TensorFlow\nif is_tf_available():\n    from .modeling_tf_utils import (\n        TFPreTrainedModel,\n        TFSharedEmbeddings,\n        TFSequenceSummary,\n        shape_list,\n        tf_top_k_top_p_filtering,\n    )\n    from .modeling_tf_auto import (\n        TFAutoModel,\n        TFAutoModelForPreTraining,\n        TFAutoModelForMultipleChoice,\n        TFAutoModelForSequenceClassification,\n        TFAutoModelForQuestionAnswering,\n        TFAutoModelWithLMHead,\n        TFAutoModelForTokenClassification,\n        TF_MODEL_MAPPING,\n        TF_MODEL_FOR_PRETRAINING_MAPPING,\n        TF_MODEL_WITH_LM_HEAD_MAPPING,\n        TF_MODEL_FOR_SEQUENCE_CLASSIFICATION_MAPPING,\n        TF_MODEL_FOR_QUESTION_ANSWERING_MAPPING,\n        TF_MODEL_FOR_TOKEN_CLASSIFICATION_MAPPING,\n    )\n\n    from .modeling_tf_bert import (\n        TFBertPreTrainedModel,\n        TFBertMainLayer,\n        TFBertEmbeddings,\n        TFBertModel,\n        TFBertForPreTraining,\n        TFBertForMaskedLM,\n        TFBertForNextSentencePrediction,\n        TFBertForSequenceClassification,\n        TFBertForMultipleChoice,\n        TFBertForTokenClassification,\n        TFBertForQuestionAnswering,\n        TF_BERT_PRETRAINED_MODEL_ARCHIVE_LIST,\n    )\n\n    from .modeling_tf_gpt2 import (\n        TFGPT2PreTrainedModel,\n        TFGPT2MainLayer,\n        TFGPT2Model,\n        TFGPT2LMHeadModel,\n        TFGPT2DoubleHeadsModel,\n        TF_GPT2_PRETRAINED_MODEL_ARCHIVE_LIST,\n    )\n\n    from .modeling_tf_openai import (\n        TFOpenAIGPTPreTrainedModel,\n        TFOpenAIGPTMainLayer,\n        TFOpenAIGPTModel,\n        TFOpenAIGPTLMHeadModel,\n        TFOpenAIGPTDoubleHeadsModel,\n        TF_OPENAI_GPT_PRETRAINED_MODEL_ARCHIVE_LIST,\n    )\n\n    from .modeling_tf_transfo_xl import (\n        TFTransfoXLPreTrainedModel,\n        TFTransfoXLMainLayer,\n        TFTransfoXLModel,\n        TFTransfoXLLMHeadModel,\n        TF_TRANSFO_XL_PRETRAINED_MODEL_ARCHIVE_LIST,\n        TFAdaptiveEmbedding,\n    )\n\n    from .modeling_tf_xlnet import (\n        TFXLNetPreTrainedModel,\n        TFXLNetMainLayer,\n        TFXLNetModel,\n        TFXLNetLMHeadModel,\n        TFXLNetForSequenceClassification,\n        TFXLNetForTokenClassification,\n        TFXLNetForQuestionAnsweringSimple,\n        TF_XLNET_PRETRAINED_MODEL_ARCHIVE_LIST,\n    )\n\n    from .modeling_tf_xlm import (\n        TFXLMPreTrainedModel,\n        TFXLMMainLayer,\n        TFXLMModel,\n        TFXLMWithLMHeadModel,\n        TFXLMForSequenceClassification,\n        TFXLMForQuestionAnsweringSimple,\n        TF_XLM_PRETRAINED_MODEL_ARCHIVE_LIST,\n    )\n\n    from .modeling_tf_xlm_roberta import (\n        TFXLMRobertaForMaskedLM,\n        TFXLMRobertaModel,\n        TFXLMRobertaForSequenceClassification,\n        TFXLMRobertaForTokenClassification,\n        TF_XLM_ROBERTA_PRETRAINED_MODEL_ARCHIVE_LIST,\n    )\n\n    from .modeling_tf_roberta import (\n        TFRobertaPreTrainedModel,\n        TFRobertaMainLayer,\n        TFRobertaModel,\n        TFRobertaForMaskedLM,\n        TFRobertaForSequenceClassification,\n        TFRobertaForTokenClassification,\n        TFRobertaForQuestionAnswering,\n        TF_ROBERTA_PRETRAINED_MODEL_ARCHIVE_LIST,\n    )\n\n    from .modeling_tf_camembert import (\n        TFCamembertModel,\n        TFCamembertForMaskedLM,\n        TFCamembertForSequenceClassification,\n        TFCamembertForTokenClassification,\n        TF_CAMEMBERT_PRETRAINED_MODEL_ARCHIVE_LIST,\n    )\n\n    from .modeling_tf_flaubert import (\n        TFFlaubertModel,\n        TFFlaubertWithLMHeadModel,\n        TFFlaubertForSequenceClassification,\n        TF_FLAUBERT_PRETRAINED_MODEL_ARCHIVE_LIST,\n    )\n\n    from .modeling_tf_distilbert import (\n        TFDistilBertPreTrainedModel,\n        TFDistilBertMainLayer,\n        TFDistilBertModel,\n        TFDistilBertForMaskedLM,\n        TFDistilBertForSequenceClassification,\n        TFDistilBertForTokenClassification,\n        TFDistilBertForQuestionAnswering,\n        TF_DISTILBERT_PRETRAINED_MODEL_ARCHIVE_LIST,\n    )\n\n    from .modeling_tf_ctrl import (\n        TFCTRLPreTrainedModel,\n        TFCTRLModel,\n        TFCTRLLMHeadModel,\n        TF_CTRL_PRETRAINED_MODEL_ARCHIVE_LIST,\n    )\n\n    from .modeling_tf_albert import (\n        TFAlbertPreTrainedModel,\n        TFAlbertMainLayer,\n        TFAlbertModel,\n        TFAlbertForPreTraining,\n        TFAlbertForMaskedLM,\n        TFAlbertForMultipleChoice,\n        TFAlbertForSequenceClassification,\n        TFAlbertForQuestionAnswering,\n        TF_ALBERT_PRETRAINED_MODEL_ARCHIVE_LIST,\n    )\n\n    from .modeling_tf_t5 import (\n        TFT5PreTrainedModel,\n        TFT5Model,\n        TFT5ForConditionalGeneration,\n        TF_T5_PRETRAINED_MODEL_ARCHIVE_LIST,\n    )\n\n    from .modeling_tf_electra import (\n        TFElectraPreTrainedModel,\n        TFElectraModel,\n        TFElectraForPreTraining,\n        TFElectraForMaskedLM,\n        TFElectraForTokenClassification,\n        TF_ELECTRA_PRETRAINED_MODEL_ARCHIVE_LIST,\n    )\n\n    # Optimization\n    from .optimization_tf import WarmUp, create_optimizer, AdamWeightDecay, GradientAccumulator\n\n    # Trainer\n    from .trainer_tf import TFTrainer\n\n\nif not is_tf_available() and not is_torch_available():\n    logger.warning(\n        \"Neither PyTorch nor TensorFlow >= 2.0 have been found.\"\n        \"Models won't be available and only tokenizers, configuration\"\n        \"and file/data utilities can be used.\"\n    )\n"
  },
  {
    "path": "code/bert-base-count5/pretrain/transformers1/__main__.py",
    "content": "# coding: utf8\ndef main():\n    import sys\n    if (len(sys.argv) < 4 or len(sys.argv) > 6) or sys.argv[1] not in [\"bert\", \"gpt\", \"transfo_xl\", \"gpt2\", \"xlnet\", \"xlm\"]:\n        print(\n        \"This command line utility let you convert original (author released) model checkpoint to pytorch.\\n\"\n        \"It should be used as one of: \\n\"\n        \">> transformers1 bert TF_CHECKPOINT TF_CONFIG PYTORCH_DUMP_OUTPUT, \\n\"\n        \">> transformers1 gpt OPENAI_GPT_CHECKPOINT_FOLDER_PATH PYTORCH_DUMP_OUTPUT [OPENAI_GPT_CONFIG], \\n\"\n        \">> transformers1 transfo_xl TF_CHECKPOINT_OR_DATASET PYTORCH_DUMP_OUTPUT [TF_CONFIG] or \\n\"\n        \">> transformers1 gpt2 TF_CHECKPOINT PYTORCH_DUMP_OUTPUT [GPT2_CONFIG] or \\n\"\n        \">> transformers1 xlnet TF_CHECKPOINT TF_CONFIG PYTORCH_DUMP_OUTPUT [FINETUNING_TASK_NAME] or \\n\"\n        \">> transformers1 xlm XLM_CHECKPOINT_PATH PYTORCH_DUMP_OUTPUT\")\n    else:\n        if sys.argv[1] == \"bert\":\n            try:\n                from .convert_bert_original_tf_checkpoint_to_pytorch import convert_tf_checkpoint_to_pytorch\n            except ImportError:\n                print(\"transformers1 can only be used from the commandline to convert TensorFlow models in PyTorch, \"\n                    \"In that case, it requires TensorFlow to be installed. Please see \"\n                    \"https://www.tensorflow.org/install/ for installation instructions.\")\n                raise\n\n            if len(sys.argv) != 5:\n                # pylint: disable=line-too-long\n                print(\"Should be used as `transformers1 bert TF_CHECKPOINT TF_CONFIG PYTORCH_DUMP_OUTPUT`\")\n            else:\n                PYTORCH_DUMP_OUTPUT = sys.argv.pop()\n                TF_CONFIG = sys.argv.pop()\n                TF_CHECKPOINT = sys.argv.pop()\n                convert_tf_checkpoint_to_pytorch(TF_CHECKPOINT, TF_CONFIG, PYTORCH_DUMP_OUTPUT)\n        elif sys.argv[1] == \"gpt\":\n            from .convert_openai_original_tf_checkpoint_to_pytorch import convert_openai_checkpoint_to_pytorch\n            if len(sys.argv) < 4 or len(sys.argv) > 5:\n                # pylint: disable=line-too-long\n                print(\"Should be used as `transformers1 gpt OPENAI_GPT_CHECKPOINT_FOLDER_PATH PYTORCH_DUMP_OUTPUT [OPENAI_GPT_CONFIG]`\")\n            else:\n                OPENAI_GPT_CHECKPOINT_FOLDER_PATH = sys.argv[2]\n                PYTORCH_DUMP_OUTPUT = sys.argv[3]\n                if len(sys.argv) == 5:\n                    OPENAI_GPT_CONFIG = sys.argv[4]\n                else:\n                    OPENAI_GPT_CONFIG = \"\"\n                convert_openai_checkpoint_to_pytorch(OPENAI_GPT_CHECKPOINT_FOLDER_PATH,\n                                                    OPENAI_GPT_CONFIG,\n                                                    PYTORCH_DUMP_OUTPUT)\n        elif sys.argv[1] == \"transfo_xl\":\n            try:\n                from .convert_transfo_xl_original_tf_checkpoint_to_pytorch import convert_transfo_xl_checkpoint_to_pytorch\n            except ImportError:\n                print(\"transformers1 can only be used from the commandline to convert TensorFlow models in PyTorch, \"\n                    \"In that case, it requires TensorFlow to be installed. Please see \"\n                    \"https://www.tensorflow.org/install/ for installation instructions.\")\n                raise\n            if len(sys.argv) < 4 or len(sys.argv) > 5:\n                # pylint: disable=line-too-long\n                print(\"Should be used as `transformers1 transfo_xl TF_CHECKPOINT/TF_DATASET_FILE PYTORCH_DUMP_OUTPUT [TF_CONFIG]`\")\n            else:\n                if 'ckpt' in sys.argv[2].lower():\n                    TF_CHECKPOINT = sys.argv[2]\n                    TF_DATASET_FILE = \"\"\n                else:\n                    TF_DATASET_FILE = sys.argv[2]\n                    TF_CHECKPOINT = \"\"\n                PYTORCH_DUMP_OUTPUT = sys.argv[3]\n                if len(sys.argv) == 5:\n                    TF_CONFIG = sys.argv[4]\n                else:\n                    TF_CONFIG = \"\"\n                convert_transfo_xl_checkpoint_to_pytorch(TF_CHECKPOINT, TF_CONFIG, PYTORCH_DUMP_OUTPUT, TF_DATASET_FILE)\n        elif sys.argv[1] == \"gpt2\":\n            try:\n                from .convert_gpt2_original_tf_checkpoint_to_pytorch import convert_gpt2_checkpoint_to_pytorch\n            except ImportError:\n                print(\"transformers1 can only be used from the commandline to convert TensorFlow models in PyTorch, \"\n                    \"In that case, it requires TensorFlow to be installed. Please see \"\n                    \"https://www.tensorflow.org/install/ for installation instructions.\")\n                raise\n\n            if len(sys.argv) < 4 or len(sys.argv) > 5:\n                # pylint: disable=line-too-long\n                print(\"Should be used as `transformers1 gpt2 TF_CHECKPOINT PYTORCH_DUMP_OUTPUT [TF_CONFIG]`\")\n            else:\n                TF_CHECKPOINT = sys.argv[2]\n                PYTORCH_DUMP_OUTPUT = sys.argv[3]\n                if len(sys.argv) == 5:\n                    TF_CONFIG = sys.argv[4]\n                else:\n                    TF_CONFIG = \"\"\n                convert_gpt2_checkpoint_to_pytorch(TF_CHECKPOINT, TF_CONFIG, PYTORCH_DUMP_OUTPUT)\n        elif sys.argv[1] == \"xlnet\":\n            try:\n                from .convert_xlnet_original_tf_checkpoint_to_pytorch import convert_xlnet_checkpoint_to_pytorch\n            except ImportError:\n                print(\"transformers1 can only be used from the commandline to convert TensorFlow models in PyTorch, \"\n                    \"In that case, it requires TensorFlow to be installed. Please see \"\n                    \"https://www.tensorflow.org/install/ for installation instructions.\")\n                raise\n\n            if len(sys.argv) < 5 or len(sys.argv) > 6:\n                # pylint: disable=line-too-long\n                print(\"Should be used as `transformers1 xlnet TF_CHECKPOINT TF_CONFIG PYTORCH_DUMP_OUTPUT [FINETUNING_TASK_NAME]`\")\n            else:\n                TF_CHECKPOINT = sys.argv[2]\n                TF_CONFIG = sys.argv[3]\n                PYTORCH_DUMP_OUTPUT = sys.argv[4]\n                if len(sys.argv) == 6:\n                    FINETUNING_TASK = sys.argv[5]\n                else:\n                    FINETUNING_TASK = None\n\n                convert_xlnet_checkpoint_to_pytorch(TF_CHECKPOINT,\n                                                    TF_CONFIG,\n                                                    PYTORCH_DUMP_OUTPUT,\n                                                    FINETUNING_TASK)\n        elif sys.argv[1] == \"xlm\":\n            from .convert_xlm_original_pytorch_checkpoint_to_pytorch import convert_xlm_checkpoint_to_pytorch\n\n            if len(sys.argv) != 4:\n                # pylint: disable=line-too-long\n                print(\"Should be used as `transformers1 xlm XLM_CHECKPOINT_PATH PYTORCH_DUMP_OUTPUT`\")\n            else:\n                XLM_CHECKPOINT_PATH = sys.argv[2]\n                PYTORCH_DUMP_OUTPUT = sys.argv[3]\n\n                convert_xlm_checkpoint_to_pytorch(XLM_CHECKPOINT_PATH, PYTORCH_DUMP_OUTPUT)\n\nif __name__ == '__main__':\n    main()\n"
  },
  {
    "path": "code/bert-base-count5/pretrain/transformers1/activations.py",
    "content": "import logging\nimport math\n\nimport torch\nimport torch.nn.functional as F\n\n\nlogger = logging.getLogger(__name__)\n\n\ndef swish(x):\n    return x * torch.sigmoid(x)\n\n\ndef _gelu_python(x):\n    \"\"\" Original Implementation of the gelu activation function in Google Bert repo when initially created.\n        For information: OpenAI GPT's gelu is slightly different (and gives slightly different results):\n        0.5 * x * (1 + torch.tanh(math.sqrt(2 / math.pi) * (x + 0.044715 * torch.pow(x, 3))))\n        This is now written in C in torch.nn.functional\n        Also see https://arxiv.org/abs/1606.08415\n    \"\"\"\n    return x * 0.5 * (1.0 + torch.erf(x / math.sqrt(2.0)))\n\n\ndef gelu_new(x):\n    \"\"\" Implementation of the gelu activation function currently in Google Bert repo (identical to OpenAI GPT).\n        Also see https://arxiv.org/abs/1606.08415\n    \"\"\"\n    return 0.5 * x * (1.0 + torch.tanh(math.sqrt(2.0 / math.pi) * (x + 0.044715 * torch.pow(x, 3.0))))\n\n\nif torch.__version__ < \"1.4.0\":\n    gelu = _gelu_python\nelse:\n    gelu = F.gelu\n\n\ndef gelu_fast(x):\n    return 0.5 * x * (1.0 + torch.tanh(x * 0.7978845608 * (1.0 + 0.044715 * x * x)))\n\n\nACT2FN = {\n    \"relu\": F.relu,\n    \"swish\": swish,\n    \"gelu\": gelu,\n    \"tanh\": torch.tanh,\n    \"gelu_new\": gelu_new,\n    \"gelu_fast\": gelu_fast,\n}\n\n\ndef get_activation(activation_string):\n    if activation_string in ACT2FN:\n        return ACT2FN[activation_string]\n    else:\n        raise KeyError(\"function {} not found in ACT2FN mapping {}\".format(activation_string, list(ACT2FN.keys())))\n"
  },
  {
    "path": "code/bert-base-count5/pretrain/transformers1/another_try.py",
    "content": "from transformers import TFBertModel, BertTokenizer, BertConfig\nimport tensorflow as tf\n\nconfig = BertConfig.from_pretrained(\"bert-base-cased\", output_hidden_states=True)\nmodel = TFBertModel.from_pretrained(\"bert-base-cased\", config=config)\n\ntok = BertTokenizer.from_pretrained(\"bert-base-cased\")\ntext = tok.encode(\"Ain't this [MASK] best thing you've ever seen?\")\n\ninputs = tf.constant(text)\noutputs = model.predict(inputs)\n\nprint(outputs)"
  },
  {
    "path": "code/bert-base-count5/pretrain/transformers1/benchmark/__init__.py",
    "content": "# flake8: noqa\n# There's no way to ignore \"F401 '...' imported but unused\" warnings in this\n# module, but to preserve other warnings. So, don't check this module at all.\n\nfrom ..file_utils import is_torch_available\n\n\nif is_torch_available():\n    from .benchmark_args import PyTorchBenchmarkArguments\n    from .benchmark import PyTorchBenchmark\n"
  },
  {
    "path": "code/bert-base-count5/pretrain/transformers1/benchmark/benchmark.py",
    "content": "# coding=utf-8\n# Copyright 2018 The HuggingFace Inc. team.\n# Copyright (c) 2018, NVIDIA CORPORATION.  All rights reserved.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n\"\"\"\n    Benchmarking the library on inference and training in PyTorch.\n\"\"\"\n\n\nimport inspect\nimport logging\nimport timeit\n\nfrom transformers import MODEL_MAPPING, MODEL_WITH_LM_HEAD_MAPPING, PretrainedConfig, is_torch_available\n\nfrom .benchmark_utils import Benchmark, Memory, start_memory_tracing, stop_memory_tracing\n\n\nif is_torch_available():\n    import torch\n    from .benchmark_args import PyTorchBenchmarkArguments\n\n\nlogger = logging.getLogger(__name__)\n\n\nclass PyTorchBenchmark(Benchmark):\n\n    args: PyTorchBenchmarkArguments\n    configs: PretrainedConfig\n    framework: str = \"PyTorch\"\n\n    @property\n    def framework_version(self):\n        return torch.__version__\n\n    def train(self, model_name, batch_size, sequence_length, trace_memory=False):\n        try:\n            config = self.config_dict[model_name]\n            model = MODEL_WITH_LM_HEAD_MAPPING[config.__class__](config)\n            model.to(self.args.device)\n            model.train()\n\n            input_ids = torch.randint(\n                model.config.vocab_size, (batch_size, sequence_length), dtype=torch.long, device=self.args.device\n            )\n\n            def compute_loss_and_backprob():\n                # TODO: Not all models call labels argument labels => this hack using the function signature should be corrected once all models have a common name for labels\n                function_argument_names = inspect.getfullargspec(model.forward).args\n                if \"labels\" in function_argument_names:\n                    loss = model(input_ids, labels=input_ids)[0]\n                elif \"lm_labels\" in function_argument_names:\n                    loss = model(input_ids, lm_labels=input_ids)[0]\n                elif \"masked_lm_labels\" in function_argument_names:\n                    loss = model(input_ids, masked_lm_labels=input_ids)[0]\n                else:\n                    NotImplementedError(f\"{model_name} does not seem to allow training with labels\")\n\n                loss.backward()\n                model.zero_grad()\n\n            if trace_memory is True:\n                if self.args.trace_memory_line_by_line or self.args.n_gpu == 0:\n                    trace = start_memory_tracing(\"transformers1\")\n                else:\n                    # clear cuda cache\n                    torch.cuda.empty_cache()\n                    torch.cuda.reset_peak_memory_stats()\n\n                # calculate loss and do backpropagation\n                compute_loss_and_backprob()\n\n                if self.args.trace_memory_line_by_line or self.args.n_gpu == 0:\n                    summary = stop_memory_tracing(trace)\n                    memory = summary.total\n                else:\n                    memory = Memory(torch.cuda.max_memory_reserved())\n\n                return memory\n            else:\n                # as written in https://docs.python.org/2/library/timeit.html#timeit.Timer.repeat, min should be taken rather than the average\n                runtimes = timeit.repeat(lambda: compute_loss_and_backprob(), repeat=self.args.repeat, number=10,)\n                return min(runtimes) / 10.0\n        except RuntimeError as e:\n            self.print_fn(\"Doesn't fit on GPU. {}\".format(e))\n            return \"N/A\"\n\n    def inference(self, model_name, batch_size, sequence_length, trace_memory=False):\n        try:\n            config = self.config_dict[model_name]\n            model = MODEL_MAPPING[config.__class__](config)\n            model.to(self.args.device)\n            model.eval()\n\n            input_ids = torch.randint(\n                config.vocab_size, (batch_size, sequence_length), dtype=torch.long, device=self.args.device\n            )\n            if trace_memory is True:\n                if self.args.trace_memory_line_by_line or self.args.n_gpu == 0:\n                    trace = start_memory_tracing(\"transformers1\")\n                else:\n                    # clear cuda cache\n                    torch.cuda.empty_cache()\n                    if hasattr(torch.cuda, \"max_memory_reserved\"):\n                        torch.cuda.reset_peak_memory_stats()\n                    else:\n                        logger.info(\n                            \"Please consider updating PyTorch to version 1.4 to get more accuracy on GPU memory usage\"\n                        )\n                        torch.cuda.reset_max_memory_cached()\n\n                model(input_ids)\n\n                if self.args.trace_memory_line_by_line or self.args.n_gpu == 0:\n                    summary = stop_memory_tracing(trace)\n                    memory = summary.total\n                else:\n                    if hasattr(torch.cuda, \"max_memory_reserved\"):\n                        memory = Memory(torch.cuda.max_memory_reserved())\n                    else:\n                        logger.info(\n                            \"Please consider updating PyTorch to version 1.4 to get more accuracy on GPU memory usage\"\n                        )\n                        memory = Memory(torch.cuda.max_memory_cached())\n\n                return memory\n            else:\n                # as written in https://docs.python.org/2/library/timeit.html#timeit.Timer.repeat, min should be taken rather than the average\n                runtimes = timeit.repeat(lambda: model(input_ids), repeat=self.args.repeat, number=10,)\n                return min(runtimes) / 10.0\n\n        except RuntimeError as e:\n            self.print_fn(\"Doesn't fit on GPU. {}\".format(e))\n            return \"N/A\"\n"
  },
  {
    "path": "code/bert-base-count5/pretrain/transformers1/benchmark/benchmark_args.py",
    "content": "# coding=utf-8\n# Copyright 2018 The HuggingFace Inc. team.\n# Copyright (c) 2018, NVIDIA CORPORATION.  All rights reserved.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n\nimport logging\nfrom dataclasses import dataclass, field\nfrom typing import Tuple\n\nfrom ..file_utils import cached_property, is_torch_available, torch_required\nfrom .benchmark_args_utils import BenchmarkArguments\n\n\nif is_torch_available():\n    import torch\n\ntry:\n    import torch_xla.core.xla_model as xm\n\n    _has_tpu = True\nexcept ImportError:\n    _has_tpu = False\n\n\n@torch_required\ndef is_tpu_available():\n    return _has_tpu\n\n\nlogger = logging.getLogger(__name__)\n\n\n@dataclass\nclass PyTorchBenchmarkArguments(BenchmarkArguments):\n    no_cuda: bool = field(default=False, metadata={\"help\": \"Whether to run on available cuda devices\"})\n    torchscript: bool = field(default=False, metadata={\"help\": \"Trace the models using torchscript\"})\n    fp16: bool = field(default=False, metadata={\"help\": \"Use FP16 to accelerate inference.\"})\n\n    @cached_property\n    @torch_required\n    def _setup_devices(self) -> Tuple[\"torch.device\", int]:\n        logger.info(\"PyTorch: setting up devices\")\n        if self.no_cuda:\n            device = torch.device(\"cpu\")\n            n_gpu = 0\n        elif is_tpu_available():\n            device = xm.xla_device()\n            n_gpu = 0\n        else:\n            device = torch.device(\"cuda\" if torch.cuda.is_available() else \"cpu\")\n            n_gpu = torch.cuda.device_count()\n        return device, n_gpu\n\n    @property\n    @torch_required\n    def device_idx(self) -> int:\n        return torch.cuda.current_device()\n\n    @property\n    @torch_required\n    def device(self) -> \"torch.device\":\n        return self._setup_devices[0]\n\n    @property\n    @torch_required\n    def n_gpu(self):\n        return self._setup_devices[1]\n"
  },
  {
    "path": "code/bert-base-count5/pretrain/transformers1/benchmark/benchmark_args_utils.py",
    "content": "# coding=utf-8\n# Copyright 2018 The HuggingFace Inc. team.\n# Copyright (c) 2018, NVIDIA CORPORATION.  All rights reserved.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n\nimport dataclasses\nimport json\nfrom dataclasses import dataclass, field\nfrom time import time\nfrom typing import List\n\n\ndef list_field(default=None, metadata=None):\n    return field(default_factory=lambda: default, metadata=metadata)\n\n\n@dataclass\nclass BenchmarkArguments:\n    \"\"\"\n    BenchMarkArguments are arguments we use in our benchmark scripts\n    **which relate to the training loop itself**.\n\n    Using `HfArgumentParser` we can turn this class\n    into argparse arguments to be able to specify them on\n    the command line.\n    \"\"\"\n\n    models: List[str] = list_field(\n        default=[],\n        metadata={\n            \"help\": \"Model checkpoints to be provided to the AutoModel classes. Leave blank to benchmark the base version of all available models\"\n        },\n    )\n\n    batch_sizes: List[int] = list_field(\n        default=[8], metadata={\"help\": \"List of batch sizes for which memory and time performance will be evaluated\"}\n    )\n\n    sequence_lengths: List[int] = list_field(\n        default=[8, 32, 128, 512],\n        metadata={\"help\": \"List of sequence lengths for which memory and time performance will be evaluated\"},\n    )\n\n    no_inference: bool = field(default=False, metadata={\"help\": \"Don't benchmark inference of model\"})\n    training: bool = field(default=False, metadata={\"help\": \"Benchmark training of model\"})\n    verbose: bool = field(default=False, metadata={\"help\": \"Verbose memory tracing\"})\n    no_speed: bool = field(default=False, metadata={\"help\": \"Don't perform speed measurments\"})\n    no_memory: bool = field(default=False, metadata={\"help\": \"Don't perform memory measurments\"})\n    trace_memory_line_by_line: bool = field(default=False, metadata={\"help\": \"Trace memory line by line\"})\n    save_to_csv: bool = field(default=False, metadata={\"help\": \"Save result to a CSV file\"})\n    log_print: bool = field(default=False, metadata={\"help\": \"Save all print statements in a log file\"})\n    no_env_print: bool = field(default=False, metadata={\"help\": \"Don't print environment information\"})\n    inference_time_csv_file: str = field(\n        default=f\"inference_time_{round(time())}.csv\",\n        metadata={\"help\": \"CSV filename used if saving time results to csv.\"},\n    )\n    inference_memory_csv_file: str = field(\n        default=f\"inference_memory_{round(time())}.csv\",\n        metadata={\"help\": \"CSV filename used if saving memory results to csv.\"},\n    )\n    train_time_csv_file: str = field(\n        default=f\"train_time_{round(time())}.csv\",\n        metadata={\"help\": \"CSV filename used if saving time results to csv for training.\"},\n    )\n    train_memory_csv_file: str = field(\n        default=f\"train_memory_{round(time())}.csv\",\n        metadata={\"help\": \"CSV filename used if saving memory results to csv for training.\"},\n    )\n    env_info_csv_file: str = field(\n        default=f\"env_info_{round(time())}.csv\",\n        metadata={\"help\": \"CSV filename used if saving environment information.\"},\n    )\n    log_filename: str = field(\n        default=f\"log_{round(time())}.csv\",\n        metadata={\"help\": \"Log filename used if print statements are saved in log.\"},\n    )\n    repeat: int = field(default=3, metadata={\"help\": \"Times an experiment will be run.\"})\n\n    def to_json_string(self):\n        \"\"\"\n        Serializes this instance to a JSON string.\n        \"\"\"\n        return json.dumps(dataclasses.asdict(self), indent=2)\n\n    @property\n    def model_names(self):\n        return self.models\n"
  },
  {
    "path": "code/bert-base-count5/pretrain/transformers1/benchmark/benchmark_utils.py",
    "content": "\"\"\"\nUtilities for working with the local dataset cache.\nThis file is adapted from the AllenNLP library at https://github.com/allenai/allennlp\nCopyright by the AllenNLP authors.\n\"\"\"\n\nimport copy\nimport csv\nimport linecache\nimport logging\nimport os\nimport platform\nimport sys\nfrom abc import ABC, abstractmethod\nfrom collections import defaultdict, namedtuple\nfrom datetime import datetime\nfrom typing import Iterable, List, NamedTuple, Optional, Union\n\nfrom transformers import AutoConfig, PretrainedConfig\nfrom transformers import __version__ as version\n\nfrom ..file_utils import is_tf_available, is_torch_available\nfrom .benchmark_args_utils import BenchmarkArguments\n\n\nif is_torch_available():\n    from torch.cuda import empty_cache as torch_empty_cache\n\nif is_tf_available():\n    from tensorflow.python.eager import context as tf_context\n\n\nlogger = logging.getLogger(__name__)  # pylint: disable=invalid-name\n\n\n_is_memory_tracing_enabled = False\n\nBenchmarkOutput = namedtuple(\n    \"BenchmarkOutput\", [\"time_inference_result\", \"memory_inference_result\", \"time_train_result\", \"memory_train_result\"]\n)\n\n\ndef is_memory_tracing_enabled():\n    global _is_memory_tracing_enabled\n    return _is_memory_tracing_enabled\n\n\nclass Frame(NamedTuple):\n    \"\"\" `Frame` is a NamedTuple used to gather the current frame state.\n            `Frame` has the following fields:\n            - 'filename' (string): Name of the file currently executed\n            - 'module' (string): Name of the module currently executed\n            - 'line_number' (int): Number of the line currently executed\n            - 'event' (string): Event that triggered the tracing (default will be \"line\")\n            - 'line_text' (string): Text of the line in the python script\n    \"\"\"\n\n    filename: str\n    module: str\n    line_number: int\n    event: str\n    line_text: str\n\n\nclass UsedMemoryState(NamedTuple):\n    \"\"\" `UsedMemoryState` are named tuples with the following fields:\n        - 'frame': a `Frame` namedtuple (see below) storing information on the current tracing frame (current file, location in current file)\n        - 'cpu_memory': CPU RSS memory state *before* executing the line\n        - 'gpu_memory': GPU used memory *before* executing the line (sum for all GPUs or for only `gpus_to_trace` if provided)\n    \"\"\"\n\n    frame: Frame\n    cpu_memory: int\n    gpu_memory: int\n\n\nclass Memory(NamedTuple):\n    \"\"\" `Memory` NamedTuple have a single field `bytes` and\n        you can get a human readable str of the number of mega bytes by calling `__repr__`\n            - `byte` (integer): number of bytes,\n    \"\"\"\n\n    bytes: int\n\n    def __repr__(self) -> str:\n        return str(bytes_to_mega_bytes(self.bytes))\n\n\nclass MemoryState(NamedTuple):\n    \"\"\" `MemoryState` are namedtuples listing frame + CPU/GPU memory with the following fields:\n        - `frame` (`Frame`): the current frame (see above)\n        - `cpu`: CPU memory consumed at during the current frame as a `Memory` named tuple\n        - `gpu`: GPU memory consumed at during the current frame as a `Memory` named tuple\n        - `cpu_gpu`: CPU + GPU memory consumed at during the current frame as a `Memory` named tuple\n    \"\"\"\n\n    frame: Frame\n    cpu: Memory\n    gpu: Memory\n    cpu_gpu: Memory\n\n\nclass MemorySummary(NamedTuple):\n    \"\"\" `MemorySummary` namedtuple otherwise with the fields:\n        - `sequential`: a list of `MemoryState` namedtuple (see below) computed from the provided `memory_trace`\n            by substracting the memory after executing each line from the memory before executing said line.\n        - `cumulative`: a list of `MemoryState` namedtuple (see below) with cumulative increase in memory for each line\n            obtained by summing repeted memory increase for a line if it's executed several times.\n            The list is sorted from the frame with the largest memory consumption to the frame with the smallest (can be negative if memory is released)\n        - `total`: total memory increase during the full tracing as a `Memory` named tuple (see below).\n            Line with memory release (negative consumption) are ignored if `ignore_released_memory` is `True` (default).\n    \"\"\"\n\n    sequential: List[MemoryState]\n    cumulative: List[MemoryState]\n    current: List[MemoryState]\n    total: Memory\n\n\nMemoryTrace = List[UsedMemoryState]\n\n\ndef start_memory_tracing(\n    modules_to_trace: Optional[Union[str, Iterable[str]]] = None,\n    modules_not_to_trace: Optional[Union[str, Iterable[str]]] = None,\n    events_to_trace: str = \"line\",\n    gpus_to_trace: Optional[List[int]] = None,\n) -> MemoryTrace:\n    \"\"\" Setup line-by-line tracing to record rss mem (RAM) at each line of a module or sub-module.\n        See `../../examples/benchmarks.py for a usage example.\n        Current memory consumption is returned using psutil and in particular is the RSS memory\n            \"Resident Set Size” (the non-swapped physical memory the process is using).\n            See https://psutil.readthedocs.io/en/latest/#psutil.Process.memory_info\n\n        Args:\n            - `modules_to_trace`: (None, string, list/tuple of string)\n                if None, all events are recorded\n                if string or list of strings: only events from the listed module/sub-module will be recorded (e.g. 'fairseq' or 'transformers1.modeling_gpt2')\n            - `modules_not_to_trace`: (None, string, list/tuple of string)\n                if None, no module is avoided\n                if string or list of strings: events from the listed module/sub-module will not be recorded (e.g. 'torch')\n            - `events_to_trace`: string or list of string of events to be recorded (see official python doc for `sys.settrace` for the list of events)\n                default to line\n            - `gpus_to_trace`: (optional list, default None) list of GPUs to trace. Default to tracing all GPUs\n\n        Return:\n            - `memory_trace` is a list of `UsedMemoryState` for each event (default each line of the traced script).\n                - `UsedMemoryState` are named tuples with the following fields:\n                    - 'frame': a `Frame` namedtuple (see below) storing information on the current tracing frame (current file, location in current file)\n                    - 'cpu_memory': CPU RSS memory state *before* executing the line\n                    - 'gpu_memory': GPU used memory *before* executing the line (sum for all GPUs or for only `gpus_to_trace` if provided)\n\n        `Frame` is a namedtuple used by `UsedMemoryState` to list the current frame state.\n            `Frame` has the following fields:\n            - 'filename' (string): Name of the file currently executed\n            - 'module' (string): Name of the module currently executed\n            - 'line_number' (int): Number of the line currently executed\n            - 'event' (string): Event that triggered the tracing (default will be \"line\")\n            - 'line_text' (string): Text of the line in the python script\n\n    \"\"\"\n    try:\n        import psutil\n    except (ImportError):\n        logger.warning(\n            \"Psutil not installed, we won't log CPU memory usage. \"\n            \"Install psutil (pip install psutil) to use CPU memory tracing.\"\n        )\n        process = None\n    else:\n        process = psutil.Process(os.getpid())\n\n    try:\n        from py3nvml import py3nvml\n\n        py3nvml.nvmlInit()\n        devices = list(range(py3nvml.nvmlDeviceGetCount())) if gpus_to_trace is None else gpus_to_trace\n        py3nvml.nvmlShutdown()\n    except ImportError:\n        logger.warning(\n            \"py3nvml not installed, we won't log GPU memory usage. \"\n            \"Install py3nvml (pip install py3nvml) to use GPU memory tracing.\"\n        )\n        log_gpu = False\n    except (OSError, py3nvml.NVMLError):\n        logger.warning(\"Error while initializing comunication with GPU. \" \"We won't perform GPU memory tracing.\")\n        log_gpu = False\n    else:\n        log_gpu = is_torch_available() or is_tf_available()\n\n    memory_trace = []\n\n    def traceit(frame, event, args):\n        \"\"\" Tracing method executed before running each line in a module or sub-module\n            Record memory allocated in a list with debugging information\n        \"\"\"\n        global _is_memory_tracing_enabled\n\n        if not _is_memory_tracing_enabled:\n            return traceit\n\n        # Filter events\n        if events_to_trace is not None:\n            if isinstance(events_to_trace, str) and event != events_to_trace:\n                return traceit\n            elif isinstance(events_to_trace, (list, tuple)) and event not in events_to_trace:\n                return traceit\n\n        # Filter modules\n        name = frame.f_globals[\"__name__\"]\n        if not isinstance(name, str):\n            return traceit\n        else:\n            # Filter whitelist of modules to trace\n            if modules_to_trace is not None:\n                if isinstance(modules_to_trace, str) and modules_to_trace not in name:\n                    return traceit\n                elif isinstance(modules_to_trace, (list, tuple)) and all(m not in name for m in modules_to_trace):\n                    return traceit\n\n            # Filter blacklist of modules not to trace\n            if modules_not_to_trace is not None:\n                if isinstance(modules_not_to_trace, str) and modules_not_to_trace in name:\n                    return traceit\n                elif isinstance(modules_not_to_trace, (list, tuple)) and any(m in name for m in modules_not_to_trace):\n                    return traceit\n\n        # Record current tracing state (file, location in file...)\n        lineno = frame.f_lineno\n        filename = frame.f_globals[\"__file__\"]\n        if filename.endswith(\".pyc\") or filename.endswith(\".pyo\"):\n            filename = filename[:-1]\n        line = linecache.getline(filename, lineno).rstrip()\n        traced_state = Frame(filename, name, lineno, event, line)\n\n        # Record current memory state (rss memory) and compute difference with previous memory state\n        cpu_mem = 0\n        if process is not None:\n            mem = process.memory_info()\n            cpu_mem = mem.rss\n\n        gpu_mem = 0\n        if log_gpu:\n            # Clear GPU caches\n            if is_torch_available():\n                torch_empty_cache()\n            if is_tf_available():\n                tf_context.context()._clear_caches()  # See https://github.com/tensorflow/tensorflow/issues/20218#issuecomment-416771802\n\n            # Sum used memory for all GPUs\n            py3nvml.nvmlInit()\n\n            for i in devices:\n                handle = py3nvml.nvmlDeviceGetHandleByIndex(i)\n                meminfo = py3nvml.nvmlDeviceGetMemoryInfo(handle)\n                gpu_mem += meminfo.used\n\n            py3nvml.nvmlShutdown()\n\n        mem_state = UsedMemoryState(traced_state, cpu_mem, gpu_mem)\n        memory_trace.append(mem_state)\n\n        return traceit\n\n    sys.settrace(traceit)\n\n    global _is_memory_tracing_enabled\n    _is_memory_tracing_enabled = True\n\n    return memory_trace\n\n\ndef stop_memory_tracing(\n    memory_trace: Optional[MemoryTrace] = None, ignore_released_memory: bool = True\n) -> Optional[MemorySummary]:\n    \"\"\" Stop memory tracing cleanly and return a summary of the memory trace if a trace is given.\n\n        Args:\n            - `memory_trace` (optional output of start_memory_tracing, default: None): memory trace to convert in summary\n            - `ignore_released_memory` (boolean, default: None): if True we only sum memory increase to compute total memory\n\n        Return:\n            - None if `memory_trace` is None\n            - `MemorySummary` namedtuple otherwise with the fields:\n                - `sequential`: a list of `MemoryState` namedtuple (see below) computed from the provided `memory_trace`\n                    by substracting the memory after executing each line from the memory before executing said line.\n                - `cumulative`: a list of `MemoryState` namedtuple (see below) with cumulative increase in memory for each line\n                    obtained by summing repeted memory increase for a line if it's executed several times.\n                    The list is sorted from the frame with the largest memory consumption to the frame with the smallest (can be negative if memory is released)\n                - `total`: total memory increase during the full tracing as a `Memory` named tuple (see below).\n                    Line with memory release (negative consumption) are ignored if `ignore_released_memory` is `True` (default).\n\n        `Memory` named tuple have fields\n            - `byte` (integer): number of bytes,\n            - `string` (string): same as human readable string (ex: \"3.5MB\")\n\n        `Frame` are namedtuple used to list the current frame state and have the following fields:\n            - 'filename' (string): Name of the file currently executed\n            - 'module' (string): Name of the module currently executed\n            - 'line_number' (int): Number of the line currently executed\n            - 'event' (string): Event that triggered the tracing (default will be \"line\")\n            - 'line_text' (string): Text of the line in the python script\n\n        `MemoryState` are namedtuples listing frame + CPU/GPU memory with the following fields:\n            - `frame` (`Frame`): the current frame (see above)\n            - `cpu`: CPU memory consumed at during the current frame as a `Memory` named tuple\n            - `gpu`: GPU memory consumed at during the current frame as a `Memory` named tuple\n            - `cpu_gpu`: CPU + GPU memory consumed at during the current frame as a `Memory` named tuple\n    \"\"\"\n    global _is_memory_tracing_enabled\n    _is_memory_tracing_enabled = False\n\n    if memory_trace is not None and len(memory_trace) > 1:\n        memory_diff_trace = []\n        memory_curr_trace = []\n\n        cumulative_memory_dict = defaultdict(lambda: [0, 0, 0])\n\n        for ((frame, cpu_mem, gpu_mem), (next_frame, next_cpu_mem, next_gpu_mem),) in zip(\n            memory_trace[:-1], memory_trace[1:]\n        ):\n            cpu_mem_inc = next_cpu_mem - cpu_mem\n            gpu_mem_inc = next_gpu_mem - gpu_mem\n            cpu_gpu_mem_inc = cpu_mem_inc + gpu_mem_inc\n            memory_diff_trace.append(\n                MemoryState(\n                    frame=frame, cpu=Memory(cpu_mem_inc), gpu=Memory(gpu_mem_inc), cpu_gpu=Memory(cpu_gpu_mem_inc),\n                )\n            )\n\n            memory_curr_trace.append(\n                MemoryState(\n                    frame=frame,\n                    cpu=Memory(next_cpu_mem),\n                    gpu=Memory(next_gpu_mem),\n                    cpu_gpu=Memory(next_gpu_mem + next_cpu_mem),\n                )\n            )\n\n            cumulative_memory_dict[frame][0] += cpu_mem_inc\n            cumulative_memory_dict[frame][1] += gpu_mem_inc\n            cumulative_memory_dict[frame][2] += cpu_gpu_mem_inc\n\n        cumulative_memory = sorted(\n            list(cumulative_memory_dict.items()), key=lambda x: x[1][2], reverse=True\n        )  # order by the total CPU + GPU memory increase\n        cumulative_memory = list(\n            MemoryState(\n                frame=frame, cpu=Memory(cpu_mem_inc), gpu=Memory(gpu_mem_inc), cpu_gpu=Memory(cpu_gpu_mem_inc),\n            )\n            for frame, (cpu_mem_inc, gpu_mem_inc, cpu_gpu_mem_inc) in cumulative_memory\n        )\n\n        memory_curr_trace = sorted(memory_curr_trace, key=lambda x: x.cpu_gpu.bytes, reverse=True)\n\n        if ignore_released_memory:\n            total_memory = sum(max(0, step_trace.cpu_gpu.bytes) for step_trace in memory_diff_trace)\n        else:\n            total_memory = sum(step_trace.cpu_gpu.bytes for step_trace in memory_diff_trace)\n\n        total_memory = Memory(total_memory)\n\n        return MemorySummary(\n            sequential=memory_diff_trace, cumulative=cumulative_memory, current=memory_curr_trace, total=total_memory,\n        )\n\n    return None\n\n\ndef bytes_to_mega_bytes(memory_amount: int) -> int:\n    \"\"\" Utility to convert a number of bytes (int) into a number of mega bytes (int)\n    \"\"\"\n    return memory_amount >> 20\n\n\nclass Benchmark(ABC):\n    \"\"\"\n    Benchmarks is a simple but feature-complete benchmarking script\n    to compare memory and time performance of models in Transformers.\n    \"\"\"\n\n    args: BenchmarkArguments\n    configs: PretrainedConfig\n    framework: str\n\n    def __init__(self, args: BenchmarkArguments = None, configs: PretrainedConfig = None):\n        self.args = args\n\n        if configs is None:\n            self.config_dict = {\n                model_name: AutoConfig.from_pretrained(model_name) for model_name in self.args.model_names\n            }\n        else:\n            self.config_dict = {model_name: config for model_name, config in zip(self.args.model_names, configs)}\n\n        self._print_fn = None\n        self._framework_version = None\n        self._environment_info = None\n\n    @property\n    def print_fn(self):\n        if self._print_fn is None:\n            if self.args.log_print:\n                logging.basicConfig(\n                    level=logging.DEBUG,\n                    filename=self.args.log_filename,\n                    filemode=\"a+\",\n                    format=\"%(asctime)-15s %(levelname)-8s %(message)s\",\n                )\n\n                def print_and_log(*args):\n                    logging.info(*args)\n                    print(*args)\n\n                self._print_fn = print_and_log\n            else:\n                self._print_fn = print\n        return self._print_fn\n\n    @property\n    def is_gpu(self):\n        return self.args.n_gpu > 0\n\n    @property\n    @abstractmethod\n    def framework_version(self):\n        pass\n\n    @abstractmethod\n    def train(self, model_name, batch_size, sequence_length):\n        pass\n\n    @abstractmethod\n    def inference(self, model_name, batch_size, sequence_length):\n        pass\n\n    def run(self):\n        result_dict = {model_name: {} for model_name in self.args.model_names}\n        inference_result_time = copy.deepcopy(result_dict)\n        inference_result_memory = copy.deepcopy(result_dict)\n        train_result_time = copy.deepcopy(result_dict)\n        train_result_memory = copy.deepcopy(result_dict)\n\n        for c, model_name in enumerate(self.args.model_names):\n            self.print_fn(f\"{c + 1} / {len(self.args.model_names)}\")\n\n            model_dict = {\n                \"bs\": self.args.batch_sizes,\n                \"ss\": self.args.sequence_lengths,\n                \"result\": {i: {} for i in self.args.batch_sizes},\n            }\n            inference_result_time[model_name] = copy.deepcopy(model_dict)\n            inference_result_memory[model_name] = copy.deepcopy(model_dict)\n            train_result_time[model_name] = copy.deepcopy(model_dict)\n            train_result_memory[model_name] = copy.deepcopy(model_dict)\n\n            for batch_size in self.args.batch_sizes:\n                for sequence_length in self.args.sequence_lengths:\n                    if not self.args.no_inference:\n                        if not self.args.no_memory:\n                            memory = self.inference(model_name, batch_size, sequence_length, trace_memory=True)\n                            inference_result_memory[model_name][\"result\"][batch_size][sequence_length] = memory\n                        if not self.args.no_speed:\n                            time = self.inference(model_name, batch_size, sequence_length, trace_memory=False)\n                            inference_result_time[model_name][\"result\"][batch_size][sequence_length] = time\n\n                    if self.args.training:\n                        if not self.args.no_memory:\n                            memory = self.train(model_name, batch_size, sequence_length, trace_memory=True)\n                            train_result_memory[model_name][\"result\"][batch_size][sequence_length] = memory\n                        if not self.args.no_speed:\n                            time = self.inference(model_name, batch_size, sequence_length, trace_memory=False)\n                            train_result_time[model_name][\"result\"][batch_size][sequence_length] = time\n\n        if not self.args.no_inference:\n            if not self.args.no_speed:\n                self.print_fn(\"======= INFERENCE - SPEED - RESULT =======\")\n                self.print_results(inference_result_time)\n                self.save_to_csv(inference_result_time, self.args.inference_time_csv_file)\n\n            if not self.args.no_memory:\n                self.print_fn(\"======= INFERENCE - MEMORY - RESULT =======\")\n                self.print_results(inference_result_memory)\n                self.save_to_csv(inference_result_memory, self.args.inference_memory_csv_file)\n\n        if self.args.training:\n            if not self.args.no_speed:\n                self.print_fn(\"======= TRAIN - SPEED - RESULT =======\")\n                self.print_results(train_result_time)\n                self.save_to_csv(train_result_time, self.args.train_time_csv_file)\n\n            if not self.args.no_memory:\n                self.print_fn(\"======= TRAIN - MEMORY - RESULT =======\")\n                self.print_results(train_result_memory)\n                self.save_to_csv(train_result_memory, self.args.train_memory_csv_file)\n\n        if not self.args.no_env_print:\n            self.print_fn(\"\\n======== ENVIRONMENT - INFORMATION ========\")\n            self.print_fn(\n                \"\\n\".join([\"- {}: {}\".format(prop, val) for prop, val in self.environment_info.items()]) + \"\\n\"\n            )\n\n        if self.args.save_to_csv:\n            with open(self.args.env_info_csv_file, mode=\"w\", newline=\"\") as csv_file:\n                writer = csv.writer(csv_file)\n                for key, value in self.environment_info.items():\n                    writer.writerow([key, value])\n\n        return BenchmarkOutput(inference_result_time, inference_result_memory, train_result_time, train_result_memory)\n\n    @property\n    def environment_info(self):\n        if self._environment_info is None:\n            info = {}\n            info[\"transformers_version\"] = version\n            info[\"framework\"] = self.framework\n            info[\"framework_version\"] = self.framework_version\n            info[\"python_version\"] = platform.python_version()\n            info[\"system\"] = platform.system()\n            info[\"cpu\"] = platform.processor()\n            info[\"architecture\"] = platform.architecture()[0]\n            info[\"date\"] = datetime.date(datetime.now())\n            info[\"time\"] = datetime.time(datetime.now())\n\n            try:\n                import psutil\n            except (ImportError):\n                logger.warning(\n                    \"Psutil not installed, we won't log available CPU memory.\"\n                    \"Install psutil (pip install psutil) to log available CPU memory.\"\n                )\n                info[\"cpu_ram_mb\"] = \"N/A\"\n            else:\n                info[\"cpu_ram_mb\"] = bytes_to_mega_bytes(psutil.virtual_memory().total)\n\n            info[\"use_gpu\"] = self.is_gpu\n            if self.is_gpu:\n                info[\"num_gpus\"] = self.args.n_gpu\n                try:\n                    from py3nvml import py3nvml\n\n                    py3nvml.nvmlInit()\n                    handle = py3nvml.nvmlDeviceGetHandleByIndex(self.args.device_idx)\n                except ImportError:\n                    logger.warning(\n                        \"py3nvml not installed, we won't log GPU memory usage. \"\n                        \"Install py3nvml (pip install py3nvml) to log information about GPU.\"\n                    )\n                    info[\"gpu\"] = \"N/A\"\n                    info[\"gpu_ram_mb\"] = \"N/A\"\n                    info[\"gpu_power_watts\"] = \"N/A\"\n                    info[\"gpu_performance_state\"] = \"N/A\"\n                except (OSError, py3nvml.NVMLError):\n                    logger.warning(\n                        \"Error while initializing comunication with GPU. \" \"We won't log information about GPU.\"\n                    )\n                    info[\"gpu\"] = \"N/A\"\n                    info[\"gpu_ram_mb\"] = \"N/A\"\n                    info[\"gpu_power_watts\"] = \"N/A\"\n                    info[\"gpu_performance_state\"] = \"N/A\"\n                    py3nvml.nvmlShutdown()\n                else:\n                    info[\"gpu\"] = py3nvml.nvmlDeviceGetName(handle)\n                    info[\"gpu_ram_mb\"] = bytes_to_mega_bytes(py3nvml.nvmlDeviceGetMemoryInfo(handle).total)\n                    info[\"gpu_power_watts\"] = py3nvml.nvmlDeviceGetPowerManagementLimit(handle) / 1000\n                    info[\"gpu_performance_state\"] = py3nvml.nvmlDeviceGetPerformanceState(handle)\n                    py3nvml.nvmlShutdown()\n\n            self._environment_info = info\n        return self._environment_info\n\n    def print_results(self, result_dict):\n        for model_name in self.args.model_names:\n            self.print_fn(\"\\t\" + f\"======= MODEL CHECKPOINT: {model_name} =======\")\n            for batch_size in result_dict[model_name][\"bs\"]:\n                for sequence_length in result_dict[model_name][\"ss\"]:\n                    result = result_dict[model_name][\"result\"][batch_size][sequence_length]\n                    if isinstance(result, float):\n                        self.print_fn(\n                            f\"\\t\\t{model_name}/{batch_size}/{sequence_length}: \" f\"{(round(1000 * result) / 1000)}s\"\n                        )\n                    else:\n                        self.print_fn(f\"\\t\\t{model_name}/{batch_size}/{sequence_length}: \" f\"{result} MB\")\n\n    def print_memory_trace_statistics(self, summary: MemorySummary):\n        self.print_fn(\n            \"\\nLine by line memory consumption:\\n\"\n            + \"\\n\".join(\n                f\"{state.frame.filename}:{state.frame.line_number}: mem {state.cpu_gpu}: {state.frame.line_text}\"\n                for state in summary.sequential\n            )\n        )\n        self.print_fn(\n            \"\\nLines with top memory consumption:\\n\"\n            + \"\\n\".join(\n                f\"=> {state.frame.filename}:{state.frame.line_number}: mem {state.cpu_gpu}: {state.frame.line_text}\"\n                for state in summary.cumulative[:6]\n            )\n        )\n        self.print_fn(\n            \"\\nLines with lowest memory consumption:\\n\"\n            + \"\\n\".join(\n                f\"=> {state.frame.filename}:{state.frame.line_number}: mem {state.cpu_gpu}: {state.frame.line_text}\"\n                for state in summary.cumulative[-6:]\n            )\n        )\n        self.print_fn(f\"\\nTotal memory increase: {summary.total}\")\n\n    def save_to_csv(self, result_dict, filename):\n        if not self.args.save_to_csv:\n            return\n        self.print_fn(\"Saving results to csv.\")\n        with open(filename, mode=\"w\") as csv_file:\n\n            assert len(self.args.model_names) > 0, \"At least 1 model should be defined, but got {}\".format(\n                self.model_names\n            )\n\n            fieldnames = [\"model\", \"batch_size\", \"sequence_length\"]\n            writer = csv.DictWriter(csv_file, fieldnames=fieldnames + [\"result\"])\n            writer.writeheader()\n\n            for model_name in self.args.model_names:\n                result_dict_model = result_dict[model_name][\"result\"]\n                for bs in result_dict_model:\n                    for ss in result_dict_model[bs]:\n                        result_model = result_dict_model[bs][ss]\n                        writer.writerow(\n                            {\n                                \"model\": model_name,\n                                \"batch_size\": bs,\n                                \"sequence_length\": ss,\n                                \"result\": (\"{}\" if not isinstance(result_model, float) else \"{:.4f}\").format(\n                                    result_model\n                                ),\n                            }\n                        )\n"
  },
  {
    "path": "code/bert-base-count5/pretrain/transformers1/benchmark_utils.py",
    "content": "\"\"\"\nUtilities for working with the local dataset cache.\nThis file is adapted from the AllenNLP library at https://github.com/allenai/allennlp\nCopyright by the AllenNLP authors.\n\"\"\"\n\nimport linecache\nimport logging\nimport os\nimport sys\nfrom collections import defaultdict\nfrom typing import Iterable, List, NamedTuple, Optional, Union\n\nfrom .file_utils import is_tf_available, is_torch_available\n\n\nif is_torch_available():\n    from torch.cuda import empty_cache as torch_empty_cache\nif is_tf_available():\n    from tensorflow.python.eager import context as tf_context\n\n\nlogger = logging.getLogger(__name__)  # pylint: disable=invalid-name\n\n\n_is_memory_tracing_enabled = False\n\n\ndef is_memory_tracing_enabled():\n    global _is_memory_tracing_enabled\n    return _is_memory_tracing_enabled\n\n\nclass Frame(NamedTuple):\n    \"\"\" `Frame` is a NamedTuple used to gather the current frame state.\n            `Frame` has the following fields:\n            - 'filename' (string): Name of the file currently executed\n            - 'module' (string): Name of the module currently executed\n            - 'line_number' (int): Number of the line currently executed\n            - 'event' (string): Event that triggered the tracing (default will be \"line\")\n            - 'line_text' (string): Text of the line in the python script\n    \"\"\"\n\n    filename: str\n    module: str\n    line_number: int\n    event: str\n    line_text: str\n\n\nclass UsedMemoryState(NamedTuple):\n    \"\"\" `UsedMemoryState` are named tuples with the following fields:\n        - 'frame': a `Frame` namedtuple (see below) storing information on the current tracing frame (current file, location in current file)\n        - 'cpu_memory': CPU RSS memory state *before* executing the line\n        - 'gpu_memory': GPU used memory *before* executing the line (sum for all GPUs or for only `gpus_to_trace` if provided)\n    \"\"\"\n\n    frame: Frame\n    cpu_memory: int\n    gpu_memory: int\n\n\nclass Memory(NamedTuple):\n    \"\"\" `Memory` NamedTuple have a single field `bytes` and\n        you can get a human readable string of the number of bytes by calling `__repr__`\n            - `byte` (integer): number of bytes,\n    \"\"\"\n\n    bytes: int\n\n    def __repr__(self) -> str:\n        return bytes_to_human_readable(self.bytes)\n\n\nclass MemoryState(NamedTuple):\n    \"\"\" `MemoryState` are namedtuples listing frame + CPU/GPU memory with the following fields:\n        - `frame` (`Frame`): the current frame (see above)\n        - `cpu`: CPU memory consumed at during the current frame as a `Memory` named tuple\n        - `gpu`: GPU memory consumed at during the current frame as a `Memory` named tuple\n        - `cpu_gpu`: CPU + GPU memory consumed at during the current frame as a `Memory` named tuple\n    \"\"\"\n\n    frame: Frame\n    cpu: Memory\n    gpu: Memory\n    cpu_gpu: Memory\n\n\nclass MemorySummary(NamedTuple):\n    \"\"\" `MemorySummary` namedtuple otherwise with the fields:\n        - `sequential`: a list of `MemoryState` namedtuple (see below) computed from the provided `memory_trace`\n            by substracting the memory after executing each line from the memory before executing said line.\n        - `cumulative`: a list of `MemoryState` namedtuple (see below) with cumulative increase in memory for each line\n            obtained by summing repeted memory increase for a line if it's executed several times.\n            The list is sorted from the frame with the largest memory consumption to the frame with the smallest (can be negative if memory is released)\n        - `total`: total memory increase during the full tracing as a `Memory` named tuple (see below).\n            Line with memory release (negative consumption) are ignored if `ignore_released_memory` is `True` (default).\n    \"\"\"\n\n    sequential: List[MemoryState]\n    cumulative: List[MemoryState]\n    total: Memory\n\n\nMemoryTrace = List[UsedMemoryState]\n\n\ndef start_memory_tracing(\n    modules_to_trace: Optional[Union[str, Iterable[str]]] = None,\n    modules_not_to_trace: Optional[Union[str, Iterable[str]]] = None,\n    events_to_trace: str = \"line\",\n    gpus_to_trace: Optional[List[int]] = None,\n) -> MemoryTrace:\n    \"\"\" Setup line-by-line tracing to record rss mem (RAM) at each line of a module or sub-module.\n        See `../../examples/benchmarks.py for a usage example.\n        Current memory consumption is returned using psutil and in particular is the RSS memory\n            \"Resident Set Size” (the non-swapped physical memory the process is using).\n            See https://psutil.readthedocs.io/en/latest/#psutil.Process.memory_info\n\n        Args:\n            - `modules_to_trace`: (None, string, list/tuple of string)\n                if None, all events are recorded\n                if string or list of strings: only events from the listed module/sub-module will be recorded (e.g. 'fairseq' or 'transformers1.modeling_gpt2')\n            - `modules_not_to_trace`: (None, string, list/tuple of string)\n                if None, no module is avoided\n                if string or list of strings: events from the listed module/sub-module will not be recorded (e.g. 'torch')\n            - `events_to_trace`: string or list of string of events to be recorded (see official python doc for `sys.settrace` for the list of events)\n                default to line\n            - `gpus_to_trace`: (optional list, default None) list of GPUs to trace. Default to tracing all GPUs\n\n        Return:\n            - `memory_trace` is a list of `UsedMemoryState` for each event (default each line of the traced script).\n                - `UsedMemoryState` are named tuples with the following fields:\n                    - 'frame': a `Frame` namedtuple (see below) storing information on the current tracing frame (current file, location in current file)\n                    - 'cpu_memory': CPU RSS memory state *before* executing the line\n                    - 'gpu_memory': GPU used memory *before* executing the line (sum for all GPUs or for only `gpus_to_trace` if provided)\n\n        `Frame` is a namedtuple used by `UsedMemoryState` to list the current frame state.\n            `Frame` has the following fields:\n            - 'filename' (string): Name of the file currently executed\n            - 'module' (string): Name of the module currently executed\n            - 'line_number' (int): Number of the line currently executed\n            - 'event' (string): Event that triggered the tracing (default will be \"line\")\n            - 'line_text' (string): Text of the line in the python script\n\n    \"\"\"\n    try:\n        import psutil\n    except (ImportError):\n        logger.warning(\n            \"Psutil not installed, we won't log CPU memory usage. \"\n            \"Install psutil (pip install psutil) to use CPU memory tracing.\"\n        )\n        process = None\n    else:\n        process = psutil.Process(os.getpid())\n\n    try:\n        from py3nvml import py3nvml\n\n        py3nvml.nvmlInit()\n        devices = list(range(py3nvml.nvmlDeviceGetCount())) if gpus_to_trace is None else gpus_to_trace\n        py3nvml.nvmlShutdown()\n    except ImportError:\n        logger.warning(\n            \"py3nvml not installed, we won't log GPU memory usage. \"\n            \"Install py3nvml (pip install py3nvml) to use GPU memory tracing.\"\n        )\n        log_gpu = False\n    except (OSError, py3nvml.NVMLError):\n        logger.warning(\"Error while initializing comunication with GPU. \" \"We won't perform GPU memory tracing.\")\n        log_gpu = False\n    else:\n        log_gpu = is_torch_available() or is_tf_available()\n\n    memory_trace = []\n\n    def traceit(frame, event, args):\n        \"\"\" Tracing method executed before running each line in a module or sub-module\n            Record memory allocated in a list with debugging information\n        \"\"\"\n        global _is_memory_tracing_enabled\n\n        if not _is_memory_tracing_enabled:\n            return traceit\n\n        # Filter events\n        if events_to_trace is not None:\n            if isinstance(events_to_trace, str) and event != events_to_trace:\n                return traceit\n            elif isinstance(events_to_trace, (list, tuple)) and event not in events_to_trace:\n                return traceit\n\n        # Filter modules\n        name = frame.f_globals[\"__name__\"]\n        if not isinstance(name, str):\n            return traceit\n        else:\n            # Filter whitelist of modules to trace\n            if modules_to_trace is not None:\n                if isinstance(modules_to_trace, str) and modules_to_trace not in name:\n                    return traceit\n                elif isinstance(modules_to_trace, (list, tuple)) and all(m not in name for m in modules_to_trace):\n                    return traceit\n\n            # Filter blacklist of modules not to trace\n            if modules_not_to_trace is not None:\n                if isinstance(modules_not_to_trace, str) and modules_not_to_trace in name:\n                    return traceit\n                elif isinstance(modules_not_to_trace, (list, tuple)) and any(m in name for m in modules_not_to_trace):\n                    return traceit\n\n        # Record current tracing state (file, location in file...)\n        lineno = frame.f_lineno\n        filename = frame.f_globals[\"__file__\"]\n        if filename.endswith(\".pyc\") or filename.endswith(\".pyo\"):\n            filename = filename[:-1]\n        line = linecache.getline(filename, lineno).rstrip()\n        traced_state = Frame(filename, name, lineno, event, line)\n\n        # Record current memory state (rss memory) and compute difference with previous memory state\n        cpu_mem = 0\n        if process is not None:\n            mem = process.memory_info()\n            cpu_mem = mem.rss\n\n        gpu_mem = 0\n        if log_gpu:\n            # Clear GPU caches\n            if is_torch_available():\n                torch_empty_cache()\n            if is_tf_available():\n                tf_context.context()._clear_caches()  # See https://github.com/tensorflow/tensorflow/issues/20218#issuecomment-416771802\n\n            # Sum used memory for all GPUs\n            py3nvml.nvmlInit()\n            for i in devices:\n                handle = py3nvml.nvmlDeviceGetHandleByIndex(i)\n                meminfo = py3nvml.nvmlDeviceGetMemoryInfo(handle)\n                gpu_mem += meminfo.used\n            py3nvml.nvmlShutdown()\n\n        mem_state = UsedMemoryState(traced_state, cpu_mem, gpu_mem)\n        memory_trace.append(mem_state)\n\n        return traceit\n\n    sys.settrace(traceit)\n\n    global _is_memory_tracing_enabled\n    _is_memory_tracing_enabled = True\n\n    return memory_trace\n\n\ndef stop_memory_tracing(\n    memory_trace: Optional[MemoryTrace] = None, ignore_released_memory: bool = True\n) -> Optional[MemorySummary]:\n    \"\"\" Stop memory tracing cleanly and return a summary of the memory trace if a trace is given.\n\n        Args:\n            - `memory_trace` (optional output of start_memory_tracing, default: None): memory trace to convert in summary\n            - `ignore_released_memory` (boolean, default: None): if True we only sum memory increase to compute total memory\n\n        Return:\n            - None if `memory_trace` is None\n            - `MemorySummary` namedtuple otherwise with the fields:\n                - `sequential`: a list of `MemoryState` namedtuple (see below) computed from the provided `memory_trace`\n                    by substracting the memory after executing each line from the memory before executing said line.\n                - `cumulative`: a list of `MemoryState` namedtuple (see below) with cumulative increase in memory for each line\n                    obtained by summing repeted memory increase for a line if it's executed several times.\n                    The list is sorted from the frame with the largest memory consumption to the frame with the smallest (can be negative if memory is released)\n                - `total`: total memory increase during the full tracing as a `Memory` named tuple (see below).\n                    Line with memory release (negative consumption) are ignored if `ignore_released_memory` is `True` (default).\n\n        `Memory` named tuple have fields\n            - `byte` (integer): number of bytes,\n            - `string` (string): same as human readable string (ex: \"3.5MB\")\n\n        `Frame` are namedtuple used to list the current frame state and have the following fields:\n            - 'filename' (string): Name of the file currently executed\n            - 'module' (string): Name of the module currently executed\n            - 'line_number' (int): Number of the line currently executed\n            - 'event' (string): Event that triggered the tracing (default will be \"line\")\n            - 'line_text' (string): Text of the line in the python script\n\n        `MemoryState` are namedtuples listing frame + CPU/GPU memory with the following fields:\n            - `frame` (`Frame`): the current frame (see above)\n            - `cpu`: CPU memory consumed at during the current frame as a `Memory` named tuple\n            - `gpu`: GPU memory consumed at during the current frame as a `Memory` named tuple\n            - `cpu_gpu`: CPU + GPU memory consumed at during the current frame as a `Memory` named tuple\n    \"\"\"\n    global _is_memory_tracing_enabled\n    _is_memory_tracing_enabled = False\n\n    if memory_trace is not None and len(memory_trace) > 1:\n        memory_diff_trace = []\n        cumulative_memory_dict = defaultdict(lambda: [0, 0, 0])\n        for (frame, cpu_mem, gpu_mem), (next_frame, next_cpu_mem, next_gpu_mem) in zip(\n            memory_trace[:-1], memory_trace[1:]\n        ):\n            cpu_mem_inc = next_cpu_mem - cpu_mem\n            gpu_mem_inc = next_gpu_mem - gpu_mem\n            cpu_gpu_mem_inc = cpu_mem_inc + gpu_mem_inc\n            memory_diff_trace.append(\n                MemoryState(\n                    frame=frame, cpu=Memory(cpu_mem_inc), gpu=Memory(gpu_mem_inc), cpu_gpu=Memory(cpu_gpu_mem_inc),\n                )\n            )\n            cumulative_memory_dict[frame][0] += cpu_mem_inc\n            cumulative_memory_dict[frame][1] += gpu_mem_inc\n            cumulative_memory_dict[frame][2] += cpu_gpu_mem_inc\n\n        cumulative_memory = sorted(\n            list(cumulative_memory_dict.items()), key=lambda x: x[1][2], reverse=True\n        )  # order by the total CPU + GPU memory increase\n        cumulative_memory = list(\n            MemoryState(\n                frame=frame, cpu=Memory(cpu_mem_inc), gpu=Memory(gpu_mem_inc), cpu_gpu=Memory(cpu_gpu_mem_inc),\n            )\n            for frame, (cpu_mem_inc, gpu_mem_inc, cpu_gpu_mem_inc) in cumulative_memory\n        )\n\n        if ignore_released_memory:\n            total_memory = sum(max(0, step_trace.cpu_gpu.bytes) for step_trace in memory_diff_trace)\n        else:\n            total_memory = sum(step_trace.cpu_gpu.bytes for step_trace in memory_diff_trace)\n        total_memory = Memory(total_memory)\n        return MemorySummary(sequential=memory_diff_trace, cumulative=cumulative_memory, total=total_memory)\n\n    return None\n\n\ndef bytes_to_human_readable(memory_amount: int) -> str:\n    \"\"\" Utility to convert a number of bytes (int) in a human readable string (with units)\n    \"\"\"\n    for unit in [\"B\", \"KB\", \"MB\", \"GB\"]:\n        if memory_amount > -1024.0 and memory_amount < 1024.0:\n            return \"{:.3f}{}\".format(memory_amount, unit)\n        memory_amount /= 1024.0\n    return \"{:.3f}TB\".format(memory_amount)\n"
  },
  {
    "path": "code/bert-base-count5/pretrain/transformers1/commands/__init__.py",
    "content": "from abc import ABC, abstractmethod\nfrom argparse import ArgumentParser\n\n\nclass BaseTransformersCLICommand(ABC):\n    @staticmethod\n    @abstractmethod\n    def register_subcommand(parser: ArgumentParser):\n        raise NotImplementedError()\n\n    @abstractmethod\n    def run(self):\n        raise NotImplementedError()\n"
  },
  {
    "path": "code/bert-base-count5/pretrain/transformers1/commands/convert.py",
    "content": "from argparse import ArgumentParser, Namespace\nfrom logging import getLogger\n\nfrom transformers.commands import BaseTransformersCLICommand\n\n\ndef convert_command_factory(args: Namespace):\n    \"\"\"\n    Factory function used to convert a model TF 1.0 checkpoint in a PyTorch checkpoint.\n    :return: ServeCommand\n    \"\"\"\n    return ConvertCommand(\n        args.model_type, args.tf_checkpoint, args.pytorch_dump_output, args.config, args.finetuning_task_name\n    )\n\n\nclass ConvertCommand(BaseTransformersCLICommand):\n    @staticmethod\n    def register_subcommand(parser: ArgumentParser):\n        \"\"\"\n        Register this command to argparse so it's available for the transformer-cli\n        :param parser: Root parser to register command-specific arguments\n        :return:\n        \"\"\"\n        train_parser = parser.add_parser(\n            \"convert\",\n            help=\"CLI tool to run convert model from original \"\n            \"author checkpoints to Transformers PyTorch checkpoints.\",\n        )\n        train_parser.add_argument(\"--model_type\", type=str, required=True, help=\"Model's type.\")\n        train_parser.add_argument(\n            \"--tf_checkpoint\", type=str, required=True, help=\"TensorFlow checkpoint path or folder.\"\n        )\n        train_parser.add_argument(\n            \"--pytorch_dump_output\", type=str, required=True, help=\"Path to the PyTorch savd model output.\"\n        )\n        train_parser.add_argument(\"--config\", type=str, default=\"\", help=\"Configuration file path or folder.\")\n        train_parser.add_argument(\n            \"--finetuning_task_name\",\n            type=str,\n            default=None,\n            help=\"Optional fine-tuning task name if the TF model was a finetuned model.\",\n        )\n        train_parser.set_defaults(func=convert_command_factory)\n\n    def __init__(\n        self,\n        model_type: str,\n        tf_checkpoint: str,\n        pytorch_dump_output: str,\n        config: str,\n        finetuning_task_name: str,\n        *args\n    ):\n        self._logger = getLogger(\"transformers1-cli/converting\")\n\n        self._logger.info(\"Loading model {}\".format(model_type))\n        self._model_type = model_type\n        self._tf_checkpoint = tf_checkpoint\n        self._pytorch_dump_output = pytorch_dump_output\n        self._config = config\n        self._finetuning_task_name = finetuning_task_name\n\n    def run(self):\n        if self._model_type == \"albert\":\n            try:\n                from transformers.convert_albert_original_tf_checkpoint_to_pytorch import (\n                    convert_tf_checkpoint_to_pytorch,\n                )\n            except ImportError:\n                msg = (\n                    \"transformers1 can only be used from the commandline to convert TensorFlow models in PyTorch, \"\n                    \"In that case, it requires TensorFlow to be installed. Please see \"\n                    \"https://www.tensorflow.org/install/ for installation instructions.\"\n                )\n                raise ImportError(msg)\n\n            convert_tf_checkpoint_to_pytorch(self._tf_checkpoint, self._config, self._pytorch_dump_output)\n        elif self._model_type == \"bert\":\n            try:\n                from transformers.convert_bert_original_tf_checkpoint_to_pytorch import (\n                    convert_tf_checkpoint_to_pytorch,\n                )\n            except ImportError:\n                msg = (\n                    \"transformers1 can only be used from the commandline to convert TensorFlow models in PyTorch, \"\n                    \"In that case, it requires TensorFlow to be installed. Please see \"\n                    \"https://www.tensorflow.org/install/ for installation instructions.\"\n                )\n                raise ImportError(msg)\n\n            convert_tf_checkpoint_to_pytorch(self._tf_checkpoint, self._config, self._pytorch_dump_output)\n        elif self._model_type == \"gpt\":\n            from transformers.convert_openai_original_tf_checkpoint_to_pytorch import (\n                convert_openai_checkpoint_to_pytorch,\n            )\n\n            convert_openai_checkpoint_to_pytorch(self._tf_checkpoint, self._config, self._pytorch_dump_output)\n        elif self._model_type == \"transfo_xl\":\n            try:\n                from transformers.convert_transfo_xl_original_tf_checkpoint_to_pytorch import (\n                    convert_transfo_xl_checkpoint_to_pytorch,\n                )\n            except ImportError:\n                msg = (\n                    \"transformers1 can only be used from the commandline to convert TensorFlow models in PyTorch, \"\n                    \"In that case, it requires TensorFlow to be installed. Please see \"\n                    \"https://www.tensorflow.org/install/ for installation instructions.\"\n                )\n                raise ImportError(msg)\n\n            if \"ckpt\" in self._tf_checkpoint.lower():\n                TF_CHECKPOINT = self._tf_checkpoint\n                TF_DATASET_FILE = \"\"\n            else:\n                TF_DATASET_FILE = self._tf_checkpoint\n                TF_CHECKPOINT = \"\"\n            convert_transfo_xl_checkpoint_to_pytorch(\n                TF_CHECKPOINT, self._config, self._pytorch_dump_output, TF_DATASET_FILE\n            )\n        elif self._model_type == \"gpt2\":\n            try:\n                from transformers.convert_gpt2_original_tf_checkpoint_to_pytorch import (\n                    convert_gpt2_checkpoint_to_pytorch,\n                )\n            except ImportError:\n                msg = (\n                    \"transformers1 can only be used from the commandline to convert TensorFlow models in PyTorch, \"\n                    \"In that case, it requires TensorFlow to be installed. Please see \"\n                    \"https://www.tensorflow.org/install/ for installation instructions.\"\n                )\n                raise ImportError(msg)\n\n            convert_gpt2_checkpoint_to_pytorch(self._tf_checkpoint, self._config, self._pytorch_dump_output)\n        elif self._model_type == \"xlnet\":\n            try:\n                from transformers.convert_xlnet_original_tf_checkpoint_to_pytorch import (\n                    convert_xlnet_checkpoint_to_pytorch,\n                )\n            except ImportError:\n                msg = (\n                    \"transformers1 can only be used from the commandline to convert TensorFlow models in PyTorch, \"\n                    \"In that case, it requires TensorFlow to be installed. Please see \"\n                    \"https://www.tensorflow.org/install/ for installation instructions.\"\n                )\n                raise ImportError(msg)\n\n            convert_xlnet_checkpoint_to_pytorch(\n                self._tf_checkpoint, self._config, self._pytorch_dump_output, self._finetuning_task_name\n            )\n        elif self._model_type == \"xlm\":\n            from transformers.convert_xlm_original_pytorch_checkpoint_to_pytorch import (\n                convert_xlm_checkpoint_to_pytorch,\n            )\n\n            convert_xlm_checkpoint_to_pytorch(self._tf_checkpoint, self._pytorch_dump_output)\n        else:\n            raise ValueError(\"--model_type should be selected in the list [bert, gpt, gpt2, transfo_xl, xlnet, xlm]\")\n"
  },
  {
    "path": "code/bert-base-count5/pretrain/transformers1/commands/download.py",
    "content": "from argparse import ArgumentParser\n\nfrom transformers.commands import BaseTransformersCLICommand\n\n\ndef download_command_factory(args):\n    return DownloadCommand(args.model, args.cache_dir, args.force)\n\n\nclass DownloadCommand(BaseTransformersCLICommand):\n    @staticmethod\n    def register_subcommand(parser: ArgumentParser):\n        download_parser = parser.add_parser(\"download\")\n        download_parser.add_argument(\n            \"--cache-dir\", type=str, default=None, help=\"Path to location to store the models\"\n        )\n        download_parser.add_argument(\n            \"--force\", action=\"store_true\", help=\"Force the model to be download even if already in cache-dir\"\n        )\n        download_parser.add_argument(\"model\", type=str, help=\"Name of the model to download\")\n        download_parser.set_defaults(func=download_command_factory)\n\n    def __init__(self, model: str, cache: str, force: bool):\n        self._model = model\n        self._cache = cache\n        self._force = force\n\n    def run(self):\n        from transformers import AutoModel, AutoTokenizer\n\n        AutoModel.from_pretrained(self._model, cache_dir=self._cache, force_download=self._force)\n        AutoTokenizer.from_pretrained(self._model, cache_dir=self._cache, force_download=self._force)\n"
  },
  {
    "path": "code/bert-base-count5/pretrain/transformers1/commands/env.py",
    "content": "import platform\nfrom argparse import ArgumentParser\n\nfrom transformers import __version__ as version\nfrom transformers import is_tf_available, is_torch_available\nfrom transformers.commands import BaseTransformersCLICommand\n\n\ndef info_command_factory(_):\n    return EnvironmentCommand()\n\n\nclass EnvironmentCommand(BaseTransformersCLICommand):\n    @staticmethod\n    def register_subcommand(parser: ArgumentParser):\n        download_parser = parser.add_parser(\"env\")\n        download_parser.set_defaults(func=info_command_factory)\n\n    def run(self):\n        pt_version = \"not installed\"\n        pt_cuda_available = \"NA\"\n        if is_torch_available():\n            import torch\n\n            pt_version = torch.__version__\n            pt_cuda_available = torch.cuda.is_available()\n\n        tf_version = \"not installed\"\n        tf_cuda_available = \"NA\"\n        if is_tf_available():\n            import tensorflow as tf\n\n            tf_version = tf.__version__\n            try:\n                # deprecated in v2.1\n                tf_cuda_available = tf.test.is_gpu_available()\n            except AttributeError:\n                # returns list of devices, convert to bool\n                tf_cuda_available = bool(tf.config.list_physical_devices(\"GPU\"))\n\n        info = {\n            \"`transformers1` version\": version,\n            \"Platform\": platform.platform(),\n            \"Python version\": platform.python_version(),\n            \"PyTorch version (GPU?)\": \"{} ({})\".format(pt_version, pt_cuda_available),\n            \"Tensorflow version (GPU?)\": \"{} ({})\".format(tf_version, tf_cuda_available),\n            \"Using GPU in script?\": \"<fill in>\",\n            \"Using distributed or parallel set-up in script?\": \"<fill in>\",\n        }\n\n        print(\"\\nCopy-and-paste the text below in your GitHub issue and FILL OUT the two last points.\\n\")\n        print(self.format_dict(info))\n\n        return info\n\n    @staticmethod\n    def format_dict(d):\n        return \"\\n\".join([\"- {}: {}\".format(prop, val) for prop, val in d.items()]) + \"\\n\"\n"
  },
  {
    "path": "code/bert-base-count5/pretrain/transformers1/commands/run.py",
    "content": "import logging\nfrom argparse import ArgumentParser\n\nfrom transformers.commands import BaseTransformersCLICommand\nfrom transformers.pipelines import SUPPORTED_TASKS, Pipeline, PipelineDataFormat, pipeline\n\n\nlogger = logging.getLogger(__name__)  # pylint: disable=invalid-name\n\n\ndef try_infer_format_from_ext(path: str):\n    if not path:\n        return \"pipe\"\n\n    for ext in PipelineDataFormat.SUPPORTED_FORMATS:\n        if path.endswith(ext):\n            return ext\n\n    raise Exception(\n        \"Unable to determine file format from file extension {}. \"\n        \"Please provide the format through --format {}\".format(path, PipelineDataFormat.SUPPORTED_FORMATS)\n    )\n\n\ndef run_command_factory(args):\n    nlp = pipeline(\n        task=args.task,\n        model=args.model if args.model else None,\n        config=args.config,\n        tokenizer=args.tokenizer,\n        device=args.device,\n    )\n    format = try_infer_format_from_ext(args.input) if args.format == \"infer\" else args.format\n    reader = PipelineDataFormat.from_str(\n        format=format,\n        output_path=args.output,\n        input_path=args.input,\n        column=args.column if args.column else nlp.default_input_names,\n        overwrite=args.overwrite,\n    )\n    return RunCommand(nlp, reader)\n\n\nclass RunCommand(BaseTransformersCLICommand):\n    def __init__(self, nlp: Pipeline, reader: PipelineDataFormat):\n        self._nlp = nlp\n        self._reader = reader\n\n    @staticmethod\n    def register_subcommand(parser: ArgumentParser):\n        run_parser = parser.add_parser(\"run\", help=\"Run a pipeline through the CLI\")\n        run_parser.add_argument(\"--task\", choices=SUPPORTED_TASKS.keys(), help=\"Task to run\")\n        run_parser.add_argument(\"--input\", type=str, help=\"Path to the file to use for inference\")\n        run_parser.add_argument(\"--output\", type=str, help=\"Path to the file that will be used post to write results.\")\n        run_parser.add_argument(\"--model\", type=str, help=\"Name or path to the model to instantiate.\")\n        run_parser.add_argument(\"--config\", type=str, help=\"Name or path to the model's config to instantiate.\")\n        run_parser.add_argument(\n            \"--tokenizer\", type=str, help=\"Name of the tokenizer to use. (default: same as the model name)\"\n        )\n        run_parser.add_argument(\n            \"--column\",\n            type=str,\n            help=\"Name of the column to use as input. (For multi columns input as QA use column1,columns2)\",\n        )\n        run_parser.add_argument(\n            \"--format\",\n            type=str,\n            default=\"infer\",\n            choices=PipelineDataFormat.SUPPORTED_FORMATS,\n            help=\"Input format to read from\",\n        )\n        run_parser.add_argument(\n            \"--device\",\n            type=int,\n            default=-1,\n            help=\"Indicate the device to run onto, -1 indicates CPU, >= 0 indicates GPU (default: -1)\",\n        )\n        run_parser.add_argument(\"--overwrite\", action=\"store_true\", help=\"Allow overwriting the output file.\")\n        run_parser.set_defaults(func=run_command_factory)\n\n    def run(self):\n        nlp, outputs = self._nlp, []\n\n        for entry in self._reader:\n            output = nlp(**entry) if self._reader.is_multi_columns else nlp(entry)\n            if isinstance(output, dict):\n                outputs.append(output)\n            else:\n                outputs += output\n\n        # Saving data\n        if self._nlp.binary_output:\n            binary_path = self._reader.save_binary(outputs)\n            logger.warning(\"Current pipeline requires output to be in binary format, saving at {}\".format(binary_path))\n        else:\n            self._reader.save(outputs)\n"
  },
  {
    "path": "code/bert-base-count5/pretrain/transformers1/commands/serving.py",
    "content": "import logging\nfrom argparse import ArgumentParser, Namespace\nfrom typing import Any, List, Optional\n\nfrom transformers import Pipeline\nfrom transformers.commands import BaseTransformersCLICommand\nfrom transformers.pipelines import SUPPORTED_TASKS, pipeline\n\n\ntry:\n    from uvicorn import run\n    from fastapi import FastAPI, HTTPException, Body\n    from fastapi.routing import APIRoute\n    from pydantic import BaseModel\n    from starlette.responses import JSONResponse\n\n    _serve_dependencies_installed = True\nexcept (ImportError, AttributeError):\n    BaseModel = object\n\n    def Body(*x, **y):\n        pass\n\n    _serve_dependencies_installed = False\n\n\nlogger = logging.getLogger(\"transformers1-cli/serving\")\n\n\ndef serve_command_factory(args: Namespace):\n    \"\"\"\n    Factory function used to instantiate serving server from provided command line arguments.\n    :return: ServeCommand\n    \"\"\"\n    nlp = pipeline(\n        task=args.task,\n        model=args.model if args.model else None,\n        config=args.config,\n        tokenizer=args.tokenizer,\n        device=args.device,\n    )\n    return ServeCommand(nlp, args.host, args.port, args.workers)\n\n\nclass ServeModelInfoResult(BaseModel):\n    \"\"\"\n    Expose model information\n    \"\"\"\n\n    infos: dict\n\n\nclass ServeTokenizeResult(BaseModel):\n    \"\"\"\n    Tokenize result model\n    \"\"\"\n\n    tokens: List[str]\n    tokens_ids: Optional[List[int]]\n\n\nclass ServeDeTokenizeResult(BaseModel):\n    \"\"\"\n    DeTokenize result model\n    \"\"\"\n\n    text: str\n\n\nclass ServeForwardResult(BaseModel):\n    \"\"\"\n    Forward result model\n    \"\"\"\n\n    output: Any\n\n\nclass ServeCommand(BaseTransformersCLICommand):\n    @staticmethod\n    def register_subcommand(parser: ArgumentParser):\n        \"\"\"\n        Register this command to argparse so it's available for the transformer-cli\n        :param parser: Root parser to register command-specific arguments\n        :return:\n        \"\"\"\n        serve_parser = parser.add_parser(\n            \"serve\", help=\"CLI tool to run inference requests through REST and GraphQL endpoints.\"\n        )\n        serve_parser.add_argument(\n            \"--task\", type=str, choices=SUPPORTED_TASKS.keys(), help=\"The task to run the pipeline on\"\n        )\n        serve_parser.add_argument(\"--host\", type=str, default=\"localhost\", help=\"Interface the server will listen on.\")\n        serve_parser.add_argument(\"--port\", type=int, default=8888, help=\"Port the serving will listen to.\")\n        serve_parser.add_argument(\"--workers\", type=int, default=1, help=\"Number of http workers\")\n        serve_parser.add_argument(\"--model\", type=str, help=\"Model's name or path to stored model.\")\n        serve_parser.add_argument(\"--config\", type=str, help=\"Model's config name or path to stored model.\")\n        serve_parser.add_argument(\"--tokenizer\", type=str, help=\"Tokenizer name to use.\")\n        serve_parser.add_argument(\n            \"--device\",\n            type=int,\n            default=-1,\n            help=\"Indicate the device to run onto, -1 indicates CPU, >= 0 indicates GPU (default: -1)\",\n        )\n        serve_parser.set_defaults(func=serve_command_factory)\n\n    def __init__(self, pipeline: Pipeline, host: str, port: int, workers: int):\n\n        self._pipeline = pipeline\n\n        self.host = host\n        self.port = port\n        self.workers = workers\n\n        if not _serve_dependencies_installed:\n            raise RuntimeError(\n                \"Using serve command requires FastAPI and unicorn. \"\n                'Please install transformers1 with [serving]: pip install \"transformers1[serving]\".'\n                \"Or install FastAPI and unicorn separately.\"\n            )\n        else:\n            logger.info(\"Serving model over {}:{}\".format(host, port))\n            self._app = FastAPI(\n                routes=[\n                    APIRoute(\n                        \"/\",\n                        self.model_info,\n                        response_model=ServeModelInfoResult,\n                        response_class=JSONResponse,\n                        methods=[\"GET\"],\n                    ),\n                    APIRoute(\n                        \"/tokenize\",\n                        self.tokenize,\n                        response_model=ServeTokenizeResult,\n                        response_class=JSONResponse,\n                        methods=[\"POST\"],\n                    ),\n                    APIRoute(\n                        \"/detokenize\",\n                        self.detokenize,\n                        response_model=ServeDeTokenizeResult,\n                        response_class=JSONResponse,\n                        methods=[\"POST\"],\n                    ),\n                    APIRoute(\n                        \"/forward\",\n                        self.forward,\n                        response_model=ServeForwardResult,\n                        response_class=JSONResponse,\n                        methods=[\"POST\"],\n                    ),\n                ],\n                timeout=600,\n            )\n\n    def run(self):\n        run(self._app, host=self.host, port=self.port, workers=self.workers)\n\n    def model_info(self):\n        return ServeModelInfoResult(infos=vars(self._pipeline.model.config))\n\n    def tokenize(self, text_input: str = Body(None, embed=True), return_ids: bool = Body(False, embed=True)):\n        \"\"\"\n        Tokenize the provided input and eventually returns corresponding tokens id:\n        - **text_input**: String to tokenize\n        - **return_ids**: Boolean flags indicating if the tokens have to be converted to their integer mapping.\n        \"\"\"\n        try:\n            tokens_txt = self._pipeline.tokenizer.tokenize(text_input)\n\n            if return_ids:\n                tokens_ids = self._pipeline.tokenizer.convert_tokens_to_ids(tokens_txt)\n                return ServeTokenizeResult(tokens=tokens_txt, tokens_ids=tokens_ids)\n            else:\n                return ServeTokenizeResult(tokens=tokens_txt)\n\n        except Exception as e:\n            raise HTTPException(status_code=500, detail={\"model\": \"\", \"error\": str(e)})\n\n    def detokenize(\n        self,\n        tokens_ids: List[int] = Body(None, embed=True),\n        skip_special_tokens: bool = Body(False, embed=True),\n        cleanup_tokenization_spaces: bool = Body(True, embed=True),\n    ):\n        \"\"\"\n        Detokenize the provided tokens ids to readable text:\n        - **tokens_ids**: List of tokens ids\n        - **skip_special_tokens**: Flag indicating to not try to decode special tokens\n        - **cleanup_tokenization_spaces**: Flag indicating to remove all leading/trailing spaces and intermediate ones.\n        \"\"\"\n        try:\n            decoded_str = self._pipeline.tokenizer.decode(tokens_ids, skip_special_tokens, cleanup_tokenization_spaces)\n            return ServeDeTokenizeResult(model=\"\", text=decoded_str)\n        except Exception as e:\n            raise HTTPException(status_code=500, detail={\"model\": \"\", \"error\": str(e)})\n\n    async def forward(self, inputs=Body(None, embed=True)):\n        \"\"\"\n        **inputs**:\n        **attention_mask**:\n        **tokens_type_ids**:\n        \"\"\"\n\n        # Check we don't have empty string\n        if len(inputs) == 0:\n            return ServeForwardResult(output=[], attention=[])\n\n        try:\n            # Forward through the model\n            output = self._pipeline(inputs)\n            return ServeForwardResult(output=output)\n        except Exception as e:\n            raise HTTPException(500, {\"error\": str(e)})\n"
  },
  {
    "path": "code/bert-base-count5/pretrain/transformers1/commands/train.py",
    "content": "import os\nfrom argparse import ArgumentParser, Namespace\nfrom logging import getLogger\n\nfrom transformers import SingleSentenceClassificationProcessor as Processor\nfrom transformers import TextClassificationPipeline, is_tf_available, is_torch_available\nfrom transformers.commands import BaseTransformersCLICommand\n\n\nif not is_tf_available() and not is_torch_available():\n    raise RuntimeError(\"At least one of PyTorch or TensorFlow 2.0+ should be installed to use CLI training\")\n\n# TF training parameters\nUSE_XLA = False\nUSE_AMP = False\n\n\ndef train_command_factory(args: Namespace):\n    \"\"\"\n    Factory function used to instantiate serving server from provided command line arguments.\n    :return: ServeCommand\n    \"\"\"\n    return TrainCommand(args)\n\n\nclass TrainCommand(BaseTransformersCLICommand):\n    @staticmethod\n    def register_subcommand(parser: ArgumentParser):\n        \"\"\"\n        Register this command to argparse so it's available for the transformer-cli\n        :param parser: Root parser to register command-specific arguments\n        :return:\n        \"\"\"\n        train_parser = parser.add_parser(\"train\", help=\"CLI tool to train a model on a task.\")\n\n        train_parser.add_argument(\n            \"--train_data\",\n            type=str,\n            required=True,\n            help=\"path to train (and optionally evaluation) dataset as a csv with \"\n            \"tab separated labels and sentences.\",\n        )\n        train_parser.add_argument(\n            \"--column_label\", type=int, default=0, help=\"Column of the dataset csv file with example labels.\"\n        )\n        train_parser.add_argument(\n            \"--column_text\", type=int, default=1, help=\"Column of the dataset csv file with example texts.\"\n        )\n        train_parser.add_argument(\n            \"--column_id\", type=int, default=2, help=\"Column of the dataset csv file with example ids.\"\n        )\n        train_parser.add_argument(\n            \"--skip_first_row\", action=\"store_true\", help=\"Skip the first row of the csv file (headers).\"\n        )\n\n        train_parser.add_argument(\"--validation_data\", type=str, default=\"\", help=\"path to validation dataset.\")\n        train_parser.add_argument(\n            \"--validation_split\",\n            type=float,\n            default=0.1,\n            help=\"if validation dataset is not provided, fraction of train dataset \" \"to use as validation dataset.\",\n        )\n\n        train_parser.add_argument(\"--output\", type=str, default=\"./\", help=\"path to saved the trained model.\")\n\n        train_parser.add_argument(\n            \"--task\", type=str, default=\"text_classification\", help=\"Task to train the model on.\"\n        )\n        train_parser.add_argument(\n            \"--model\", type=str, default=\"bert-base-uncased\", help=\"Model's name or path to stored model.\"\n        )\n        train_parser.add_argument(\"--train_batch_size\", type=int, default=32, help=\"Batch size for training.\")\n        train_parser.add_argument(\"--valid_batch_size\", type=int, default=64, help=\"Batch size for validation.\")\n        train_parser.add_argument(\"--learning_rate\", type=float, default=3e-5, help=\"Learning rate.\")\n        train_parser.add_argument(\"--adam_epsilon\", type=float, default=1e-08, help=\"Epsilon for Adam optimizer.\")\n        train_parser.set_defaults(func=train_command_factory)\n\n    def __init__(self, args: Namespace):\n        self.logger = getLogger(\"transformers1-cli/training\")\n\n        self.framework = \"tf\" if is_tf_available() else \"torch\"\n\n        os.makedirs(args.output, exist_ok=True)\n        assert os.path.isdir(args.output)\n        self.output = args.output\n\n        self.column_label = args.column_label\n        self.column_text = args.column_text\n        self.column_id = args.column_id\n\n        self.logger.info(\"Loading {} pipeline for {}\".format(args.task, args.model))\n        if args.task == \"text_classification\":\n            self.pipeline = TextClassificationPipeline.from_pretrained(args.model)\n        elif args.task == \"token_classification\":\n            raise NotImplementedError\n        elif args.task == \"question_answering\":\n            raise NotImplementedError\n\n        self.logger.info(\"Loading dataset from {}\".format(args.train_data))\n        self.train_dataset = Processor.create_from_csv(\n            args.train_data,\n            column_label=args.column_label,\n            column_text=args.column_text,\n            column_id=args.column_id,\n            skip_first_row=args.skip_first_row,\n        )\n        self.valid_dataset = None\n        if args.validation_data:\n            self.logger.info(\"Loading validation dataset from {}\".format(args.validation_data))\n            self.valid_dataset = Processor.create_from_csv(\n                args.validation_data,\n                column_label=args.column_label,\n                column_text=args.column_text,\n                column_id=args.column_id,\n                skip_first_row=args.skip_first_row,\n            )\n\n        self.validation_split = args.validation_split\n        self.train_batch_size = args.train_batch_size\n        self.valid_batch_size = args.valid_batch_size\n        self.learning_rate = args.learning_rate\n        self.adam_epsilon = args.adam_epsilon\n\n    def run(self):\n        if self.framework == \"tf\":\n            return self.run_tf()\n        return self.run_torch()\n\n    def run_torch(self):\n        raise NotImplementedError\n\n    def run_tf(self):\n        self.pipeline.fit(\n            self.train_dataset,\n            validation_data=self.valid_dataset,\n            validation_split=self.validation_split,\n            learning_rate=self.learning_rate,\n            adam_epsilon=self.adam_epsilon,\n            train_batch_size=self.train_batch_size,\n            valid_batch_size=self.valid_batch_size,\n        )\n\n        # Save trained pipeline\n        self.pipeline.save_pretrained(self.output)\n"
  },
  {
    "path": "code/bert-base-count5/pretrain/transformers1/commands/transformers_cli.py",
    "content": "#!/usr/bin/env python\nfrom argparse import ArgumentParser\n\nfrom transformers.commands.convert import ConvertCommand\nfrom transformers.commands.download import DownloadCommand\nfrom transformers.commands.env import EnvironmentCommand\nfrom transformers.commands.run import RunCommand\nfrom transformers.commands.serving import ServeCommand\nfrom transformers.commands.user import UserCommands\n\n\ndef main():\n    parser = ArgumentParser(\"Transformers CLI tool\", usage=\"transformers1-cli <command> [<args>]\")\n    commands_parser = parser.add_subparsers(help=\"transformers1-cli command helpers\")\n\n    # Register commands\n    ConvertCommand.register_subcommand(commands_parser)\n    DownloadCommand.register_subcommand(commands_parser)\n    EnvironmentCommand.register_subcommand(commands_parser)\n    RunCommand.register_subcommand(commands_parser)\n    ServeCommand.register_subcommand(commands_parser)\n    UserCommands.register_subcommand(commands_parser)\n\n    # Let's go\n    args = parser.parse_args()\n\n    if not hasattr(args, \"func\"):\n        parser.print_help()\n        exit(1)\n\n    # Run\n    service = args.func(args)\n    service.run()\n\n\nif __name__ == \"__main__\":\n    main()\n"
  },
  {
    "path": "code/bert-base-count5/pretrain/transformers1/commands/user.py",
    "content": "import os\nimport sys\nfrom argparse import ArgumentParser\nfrom getpass import getpass\nfrom typing import List, Union\n\nfrom requests.exceptions import HTTPError\n\nfrom transformers.commands import BaseTransformersCLICommand\nfrom transformers.hf_api import HfApi, HfFolder\n\n\nUPLOAD_MAX_FILES = 15\n\n\nclass UserCommands(BaseTransformersCLICommand):\n    @staticmethod\n    def register_subcommand(parser: ArgumentParser):\n        login_parser = parser.add_parser(\"login\", help=\"Log in using the same credentials as on huggingface.co\")\n        login_parser.set_defaults(func=lambda args: LoginCommand(args))\n        whoami_parser = parser.add_parser(\"whoami\", help=\"Find out which huggingface.co account you are logged in as.\")\n        whoami_parser.set_defaults(func=lambda args: WhoamiCommand(args))\n        logout_parser = parser.add_parser(\"logout\", help=\"Log out\")\n        logout_parser.set_defaults(func=lambda args: LogoutCommand(args))\n        # s3\n        s3_parser = parser.add_parser(\"s3\", help=\"{ls, rm} Commands to interact with the files you upload on S3.\")\n        s3_subparsers = s3_parser.add_subparsers(help=\"s3 related commands\")\n        ls_parser = s3_subparsers.add_parser(\"ls\")\n        ls_parser.add_argument(\"--organization\", type=str, help=\"Optional: organization namespace.\")\n        ls_parser.set_defaults(func=lambda args: ListObjsCommand(args))\n        rm_parser = s3_subparsers.add_parser(\"rm\")\n        rm_parser.add_argument(\"filename\", type=str, help=\"individual object filename to delete from S3.\")\n        rm_parser.add_argument(\"--organization\", type=str, help=\"Optional: organization namespace.\")\n        rm_parser.set_defaults(func=lambda args: DeleteObjCommand(args))\n        # upload\n        upload_parser = parser.add_parser(\"upload\", help=\"Upload a model to S3.\")\n        upload_parser.add_argument(\n            \"path\", type=str, help=\"Local path of the model folder or individual file to upload.\"\n        )\n        upload_parser.add_argument(\"--organization\", type=str, help=\"Optional: organization namespace.\")\n        upload_parser.add_argument(\n            \"--filename\", type=str, default=None, help=\"Optional: override individual object filename on S3.\"\n        )\n        upload_parser.set_defaults(func=lambda args: UploadCommand(args))\n\n\nclass ANSI:\n    \"\"\"\n    Helper for en.wikipedia.org/wiki/ANSI_escape_code\n    \"\"\"\n\n    _bold = \"\\u001b[1m\"\n    _red = \"\\u001b[31m\"\n    _reset = \"\\u001b[0m\"\n\n    @classmethod\n    def bold(cls, s):\n        return \"{}{}{}\".format(cls._bold, s, cls._reset)\n\n    @classmethod\n    def red(cls, s):\n        return \"{}{}{}\".format(cls._bold + cls._red, s, cls._reset)\n\n\nclass BaseUserCommand:\n    def __init__(self, args):\n        self.args = args\n        self._api = HfApi()\n\n\nclass LoginCommand(BaseUserCommand):\n    def run(self):\n        print(\n            \"\"\"\n        _|    _|  _|    _|    _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|_|_|_|    _|_|      _|_|_|  _|_|_|_|\n        _|    _|  _|    _|  _|        _|          _|    _|_|    _|  _|            _|        _|    _|  _|        _|\n        _|_|_|_|  _|    _|  _|  _|_|  _|  _|_|    _|    _|  _|  _|  _|  _|_|      _|_|_|    _|_|_|_|  _|        _|_|_|\n        _|    _|  _|    _|  _|    _|  _|    _|    _|    _|    _|_|  _|    _|      _|        _|    _|  _|        _|\n        _|    _|    _|_|      _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|        _|    _|    _|_|_|  _|_|_|_|\n\n        \"\"\"\n        )\n        username = input(\"Username: \")\n        password = getpass()\n        try:\n            token = self._api.login(username, password)\n        except HTTPError as e:\n            # probably invalid credentials, display error message.\n            print(e)\n            print(ANSI.red(e.response.text))\n            exit(1)\n        HfFolder.save_token(token)\n        print(\"Login successful\")\n        print(\"Your token:\", token, \"\\n\")\n        print(\"Your token has been saved to\", HfFolder.path_token)\n\n\nclass WhoamiCommand(BaseUserCommand):\n    def run(self):\n        token = HfFolder.get_token()\n        if token is None:\n            print(\"Not logged in\")\n            exit()\n        try:\n            user, orgs = self._api.whoami(token)\n            print(user)\n            if orgs:\n                print(ANSI.bold(\"orgs: \"), \",\".join(orgs))\n        except HTTPError as e:\n            print(e)\n            print(ANSI.red(e.response.text))\n            exit(1)\n\n\nclass LogoutCommand(BaseUserCommand):\n    def run(self):\n        token = HfFolder.get_token()\n        if token is None:\n            print(\"Not logged in\")\n            exit()\n        HfFolder.delete_token()\n        self._api.logout(token)\n        print(\"Successfully logged out.\")\n\n\nclass ListObjsCommand(BaseUserCommand):\n    def tabulate(self, rows: List[List[Union[str, int]]], headers: List[str]) -> str:\n        \"\"\"\n        Inspired by:\n        stackoverflow.com/a/8356620/593036\n        stackoverflow.com/questions/9535954/printing-lists-as-tabular-data\n        \"\"\"\n        col_widths = [max(len(str(x)) for x in col) for col in zip(*rows, headers)]\n        row_format = (\"{{:{}}} \" * len(headers)).format(*col_widths)\n        lines = []\n        lines.append(row_format.format(*headers))\n        lines.append(row_format.format(*[\"-\" * w for w in col_widths]))\n        for row in rows:\n            lines.append(row_format.format(*row))\n        return \"\\n\".join(lines)\n\n    def run(self):\n        token = HfFolder.get_token()\n        if token is None:\n            print(\"Not logged in\")\n            exit(1)\n        try:\n            objs = self._api.list_objs(token, organization=self.args.organization)\n        except HTTPError as e:\n            print(e)\n            print(ANSI.red(e.response.text))\n            exit(1)\n        if len(objs) == 0:\n            print(\"No shared file yet\")\n            exit()\n        rows = [[obj.filename, obj.LastModified, obj.ETag, obj.Size] for obj in objs]\n        print(self.tabulate(rows, headers=[\"Filename\", \"LastModified\", \"ETag\", \"Size\"]))\n\n\nclass DeleteObjCommand(BaseUserCommand):\n    def run(self):\n        token = HfFolder.get_token()\n        if token is None:\n            print(\"Not logged in\")\n            exit(1)\n        try:\n            self._api.delete_obj(token, filename=self.args.filename, organization=self.args.organization)\n        except HTTPError as e:\n            print(e)\n            print(ANSI.red(e.response.text))\n            exit(1)\n        print(\"Done\")\n\n\nclass UploadCommand(BaseUserCommand):\n    def walk_dir(self, rel_path):\n        \"\"\"\n        Recursively list all files in a folder.\n        \"\"\"\n        entries: List[os.DirEntry] = list(os.scandir(rel_path))\n        files = [(os.path.join(os.getcwd(), f.path), f.path) for f in entries if f.is_file()]  # (filepath, filename)\n        for f in entries:\n            if f.is_dir():\n                files += self.walk_dir(f.path)\n        return files\n\n    def run(self):\n        token = HfFolder.get_token()\n        if token is None:\n            print(\"Not logged in\")\n            exit(1)\n        local_path = os.path.abspath(self.args.path)\n        if os.path.isdir(local_path):\n            if self.args.filename is not None:\n                raise ValueError(\"Cannot specify a filename override when uploading a folder.\")\n            rel_path = os.path.basename(local_path)\n            files = self.walk_dir(rel_path)\n        elif os.path.isfile(local_path):\n            filename = self.args.filename if self.args.filename is not None else os.path.basename(local_path)\n            files = [(local_path, filename)]\n        else:\n            raise ValueError(\"Not a valid file or directory: {}\".format(local_path))\n\n        if sys.platform == \"win32\":\n            files = [(filepath, filename.replace(os.sep, \"/\")) for filepath, filename in files]\n\n        if len(files) > UPLOAD_MAX_FILES:\n            print(\n                \"About to upload {} files to S3. This is probably wrong. Please filter files before uploading.\".format(\n                    ANSI.bold(len(files))\n                )\n            )\n            exit(1)\n\n        user, _ = self._api.whoami(token)\n        namespace = self.args.organization if self.args.organization is not None else user\n\n        for filepath, filename in files:\n            print(\n                \"About to upload file {} to S3 under filename {} and namespace {}\".format(\n                    ANSI.bold(filepath), ANSI.bold(filename), ANSI.bold(namespace)\n                )\n            )\n\n        choice = input(\"Proceed? [Y/n] \").lower()\n        if not (choice == \"\" or choice == \"y\" or choice == \"yes\"):\n            print(\"Abort\")\n            exit()\n        print(ANSI.bold(\"Uploading... This might take a while if files are large\"))\n        for filepath, filename in files:\n            try:\n                access_url = self._api.presign_and_upload(\n                    token=token, filename=filename, filepath=filepath, organization=self.args.organization\n                )\n            except HTTPError as e:\n                print(e)\n                print(ANSI.red(e.response.text))\n                exit(1)\n            print(\"Your file now lives at:\")\n            print(access_url)\n"
  },
  {
    "path": "code/bert-base-count5/pretrain/transformers1/configuration_albert.py",
    "content": "# coding=utf-8\n# Copyright 2018 The Google AI Language Team Authors and The HuggingFace Inc. team.\n# Copyright (c) 2018, NVIDIA CORPORATION.  All rights reserved.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n\"\"\" ALBERT model configuration \"\"\"\n\nfrom .configuration_utils import PretrainedConfig\n\n\nALBERT_PRETRAINED_CONFIG_ARCHIVE_MAP = {\n    \"albert-base-v1\": \"https://s3.amazonaws.com/models.huggingface.co/bert/albert-base-v1-config.json\",\n    \"albert-large-v1\": \"https://s3.amazonaws.com/models.huggingface.co/bert/albert-large-v1-config.json\",\n    \"albert-xlarge-v1\": \"https://s3.amazonaws.com/models.huggingface.co/bert/albert-xlarge-v1-config.json\",\n    \"albert-xxlarge-v1\": \"https://s3.amazonaws.com/models.huggingface.co/bert/albert-xxlarge-v1-config.json\",\n    \"albert-base-v2\": \"https://s3.amazonaws.com/models.huggingface.co/bert/albert-base-v2-config.json\",\n    \"albert-large-v2\": \"https://s3.amazonaws.com/models.huggingface.co/bert/albert-large-v2-config.json\",\n    \"albert-xlarge-v2\": \"https://s3.amazonaws.com/models.huggingface.co/bert/albert-xlarge-v2-config.json\",\n    \"albert-xxlarge-v2\": \"https://s3.amazonaws.com/models.huggingface.co/bert/albert-xxlarge-v2-config.json\",\n}\n\n\nclass AlbertConfig(PretrainedConfig):\n    r\"\"\"\n        This is the configuration class to store the configuration of a :class:`~transformers1.AlbertModel`.\n        It is used to instantiate an ALBERT model according to the specified arguments, defining the model\n        architecture. Instantiating a configuration with the defaults will yield a similar configuration to that of\n        the ALBERT `xxlarge <https://huggingface.co/albert-xxlarge-v2>`__ architecture.\n\n        Configuration objects inherit from  :class:`~transformers1.PretrainedConfig` and can be used\n        to control the model outputs. Read the documentation from  :class:`~transformers1.PretrainedConfig`\n        for more information.\n\n\n        Args:\n            vocab_size (:obj:`int`, optional, defaults to 30000):\n                Vocabulary size of the ALBERT model. Defines the different tokens that\n                can be represented by the `inputs_ids` passed to the forward method of :class:`~transformers1.AlbertModel`.\n            embedding_size (:obj:`int`, optional, defaults to 128):\n                Dimensionality of vocabulary embeddings.\n            hidden_size (:obj:`int`, optional, defaults to 4096):\n                Dimensionality of the encoder layers and the pooler layer.\n            num_hidden_layers (:obj:`int`, optional, defaults to 12):\n                Number of hidden layers in the Transformer encoder.\n            num_hidden_groups (:obj:`int`, optional, defaults to 1):\n                Number of groups for the hidden layers, parameters in the same group are shared.\n            num_attention_heads (:obj:`int`, optional, defaults to 64):\n                Number of attention heads for each attention layer in the Transformer encoder.\n            intermediate_size (:obj:`int`, optional, defaults to 16384):\n                The dimensionality of the \"intermediate\" (i.e., feed-forward) layer in the Transformer encoder.\n            inner_group_num (:obj:`int`, optional, defaults to 1):\n                The number of inner repetition of attention and ffn.\n            hidden_act (:obj:`str` or :obj:`function`, optional, defaults to \"gelu_new\"):\n                The non-linear activation function (function or string) in the encoder and pooler.\n                If string, \"gelu\", \"relu\", \"swish\" and \"gelu_new\" are supported.\n            hidden_dropout_prob (:obj:`float`, optional, defaults to 0):\n                The dropout probability for all fully connected layers in the embeddings, encoder, and pooler.\n            attention_probs_dropout_prob (:obj:`float`, optional, defaults to 0):\n                The dropout ratio for the attention probabilities.\n            max_position_embeddings (:obj:`int`, optional, defaults to 512):\n                The maximum sequence length that this model might ever be used with. Typically set this to something\n                large (e.g., 512 or 1024 or 2048).\n            type_vocab_size (:obj:`int`, optional, defaults to 2):\n                The vocabulary size of the `token_type_ids` passed into :class:`~transformers1.AlbertModel`.\n            initializer_range (:obj:`float`, optional, defaults to 0.02):\n                The standard deviation of the truncated_normal_initializer for initializing all weight matrices.\n            layer_norm_eps (:obj:`float`, optional, defaults to 1e-12):\n                The epsilon used by the layer normalization layers.\n            classifier_dropout_prob (:obj:`float`, optional, defaults to 0.1):\n                The dropout ratio for attached classifiers.\n\n        Example::\n\n            from transformers1 import AlbertConfig, AlbertModel\n            # Initializing an ALBERT-xxlarge style configuration\n            albert_xxlarge_configuration = AlbertConfig()\n\n            # Initializing an ALBERT-base style configuration\n            albert_base_configuration = AlbertConfig(\n                hidden_size=768,\n                num_attention_heads=12,\n                intermediate_size=3072,\n            )\n\n            # Initializing a model from the ALBERT-base style configuration\n            model = AlbertModel(albert_xxlarge_configuration)\n\n            # Accessing the model configuration\n            configuration = model.config\n    \"\"\"\n\n    model_type = \"albert\"\n\n    def __init__(\n        self,\n        vocab_size=30000,\n        embedding_size=128,\n        hidden_size=4096,\n        num_hidden_layers=12,\n        num_hidden_groups=1,\n        num_attention_heads=64,\n        intermediate_size=16384,\n        inner_group_num=1,\n        hidden_act=\"gelu_new\",\n        hidden_dropout_prob=0,\n        attention_probs_dropout_prob=0,\n        max_position_embeddings=512,\n        type_vocab_size=2,\n        initializer_range=0.02,\n        layer_norm_eps=1e-12,\n        classifier_dropout_prob=0.1,\n        pad_token_id=0,\n        bos_token_id=2,\n        eos_token_id=3,\n        **kwargs\n    ):\n        super().__init__(pad_token_id=pad_token_id, bos_token_id=bos_token_id, eos_token_id=eos_token_id, **kwargs)\n\n        self.vocab_size = vocab_size\n        self.embedding_size = embedding_size\n        self.hidden_size = hidden_size\n        self.num_hidden_layers = num_hidden_layers\n        self.num_hidden_groups = num_hidden_groups\n        self.num_attention_heads = num_attention_heads\n        self.inner_group_num = inner_group_num\n        self.hidden_act = hidden_act\n        self.intermediate_size = intermediate_size\n        self.hidden_dropout_prob = hidden_dropout_prob\n        self.attention_probs_dropout_prob = attention_probs_dropout_prob\n        self.max_position_embeddings = max_position_embeddings\n        self.type_vocab_size = type_vocab_size\n        self.initializer_range = initializer_range\n        self.layer_norm_eps = layer_norm_eps\n        self.classifier_dropout_prob = classifier_dropout_prob\n"
  },
  {
    "path": "code/bert-base-count5/pretrain/transformers1/configuration_auto.py",
    "content": "# coding=utf-8\n# Copyright 2018 The HuggingFace Inc. team.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n\"\"\" Auto Config class. \"\"\"\n\n\nimport logging\nfrom collections import OrderedDict\n\nfrom .configuration_albert import ALBERT_PRETRAINED_CONFIG_ARCHIVE_MAP, AlbertConfig\nfrom .configuration_bart import BART_PRETRAINED_CONFIG_ARCHIVE_MAP, BartConfig\nfrom .configuration_bert import BERT_PRETRAINED_CONFIG_ARCHIVE_MAP, BertConfig\nfrom .configuration_camembert import CAMEMBERT_PRETRAINED_CONFIG_ARCHIVE_MAP, CamembertConfig\nfrom .configuration_ctrl import CTRL_PRETRAINED_CONFIG_ARCHIVE_MAP, CTRLConfig\nfrom .configuration_distilbert import DISTILBERT_PRETRAINED_CONFIG_ARCHIVE_MAP, DistilBertConfig\nfrom .configuration_electra import ELECTRA_PRETRAINED_CONFIG_ARCHIVE_MAP, ElectraConfig\nfrom .configuration_encoder_decoder import EncoderDecoderConfig\nfrom .configuration_flaubert import FLAUBERT_PRETRAINED_CONFIG_ARCHIVE_MAP, FlaubertConfig\nfrom .configuration_gpt2 import GPT2_PRETRAINED_CONFIG_ARCHIVE_MAP, GPT2Config\nfrom .configuration_longformer import LONGFORMER_PRETRAINED_CONFIG_ARCHIVE_MAP, LongformerConfig\nfrom .configuration_marian import MarianConfig\nfrom .configuration_openai import OPENAI_GPT_PRETRAINED_CONFIG_ARCHIVE_MAP, OpenAIGPTConfig\nfrom .configuration_reformer import ReformerConfig\nfrom .configuration_roberta import ROBERTA_PRETRAINED_CONFIG_ARCHIVE_MAP, RobertaConfig\nfrom .configuration_t5 import T5_PRETRAINED_CONFIG_ARCHIVE_MAP, T5Config\nfrom .configuration_transfo_xl import TRANSFO_XL_PRETRAINED_CONFIG_ARCHIVE_MAP, TransfoXLConfig\nfrom .configuration_utils import PretrainedConfig\nfrom .configuration_xlm import XLM_PRETRAINED_CONFIG_ARCHIVE_MAP, XLMConfig\nfrom .configuration_xlm_roberta import XLM_ROBERTA_PRETRAINED_CONFIG_ARCHIVE_MAP, XLMRobertaConfig\nfrom .configuration_xlnet import XLNET_PRETRAINED_CONFIG_ARCHIVE_MAP, XLNetConfig\n\n\nlogger = logging.getLogger(__name__)\n\n\nALL_PRETRAINED_CONFIG_ARCHIVE_MAP = dict(\n    (key, value)\n    for pretrained_map in [\n        BERT_PRETRAINED_CONFIG_ARCHIVE_MAP,\n        BART_PRETRAINED_CONFIG_ARCHIVE_MAP,\n        OPENAI_GPT_PRETRAINED_CONFIG_ARCHIVE_MAP,\n        TRANSFO_XL_PRETRAINED_CONFIG_ARCHIVE_MAP,\n        GPT2_PRETRAINED_CONFIG_ARCHIVE_MAP,\n        CTRL_PRETRAINED_CONFIG_ARCHIVE_MAP,\n        XLNET_PRETRAINED_CONFIG_ARCHIVE_MAP,\n        XLM_PRETRAINED_CONFIG_ARCHIVE_MAP,\n        ROBERTA_PRETRAINED_CONFIG_ARCHIVE_MAP,\n        DISTILBERT_PRETRAINED_CONFIG_ARCHIVE_MAP,\n        ALBERT_PRETRAINED_CONFIG_ARCHIVE_MAP,\n        CAMEMBERT_PRETRAINED_CONFIG_ARCHIVE_MAP,\n        T5_PRETRAINED_CONFIG_ARCHIVE_MAP,\n        XLM_ROBERTA_PRETRAINED_CONFIG_ARCHIVE_MAP,\n        FLAUBERT_PRETRAINED_CONFIG_ARCHIVE_MAP,\n        ELECTRA_PRETRAINED_CONFIG_ARCHIVE_MAP,\n        LONGFORMER_PRETRAINED_CONFIG_ARCHIVE_MAP,\n    ]\n    for key, value, in pretrained_map.items()\n)\n\n\nCONFIG_MAPPING = OrderedDict(\n    [\n        (\"t5\", T5Config,),\n        (\"distilbert\", DistilBertConfig,),\n        (\"albert\", AlbertConfig,),\n        (\"camembert\", CamembertConfig,),\n        (\"xlm-roberta\", XLMRobertaConfig,),\n        (\"marian\", MarianConfig,),\n        (\"bart\", BartConfig,),\n        (\"reformer\", ReformerConfig,),\n        (\"longformer\", LongformerConfig,),\n        (\"roberta\", RobertaConfig,),\n        (\"flaubert\", FlaubertConfig,),\n        (\"bert\", BertConfig,),\n        (\"openai-gpt\", OpenAIGPTConfig,),\n        (\"gpt2\", GPT2Config,),\n        (\"transfo-xl\", TransfoXLConfig,),\n        (\"xlnet\", XLNetConfig,),\n        (\"xlm\", XLMConfig,),\n        (\"ctrl\", CTRLConfig,),\n        (\"electra\", ElectraConfig,),\n        (\"encoder-decoder\", EncoderDecoderConfig,),\n    ]\n)\n\n\nclass AutoConfig:\n    r\"\"\"\n        :class:`~transformers1.AutoConfig` is a generic configuration class\n        that will be instantiated as one of the configuration classes of the library\n        when created with the :func:`~transformers1.AutoConfig.from_pretrained` class method.\n\n        The :func:`~transformers1.AutoConfig.from_pretrained` method takes care of returning the correct model class instance\n        based on the `model_type` property of the config object, or when it's missing,\n        falling back to using pattern matching on the `pretrained_model_name_or_path` string.\n    \"\"\"\n\n    def __init__(self):\n        raise EnvironmentError(\n            \"AutoConfig is designed to be instantiated \"\n            \"using the `AutoConfig.from_pretrained(pretrained_model_name_or_path)` method.\"\n        )\n\n    @classmethod\n    def for_model(cls, model_type: str, *args, **kwargs):\n        if model_type in CONFIG_MAPPING:\n            config_class = CONFIG_MAPPING[model_type]\n            return config_class(*args, **kwargs)\n        raise ValueError(\n            \"Unrecognized model identifier: {}. Should contain one of {}\".format(\n                model_type, \", \".join(CONFIG_MAPPING.keys())\n            )\n        )\n\n    @classmethod\n    def from_pretrained(cls, pretrained_model_name_or_path, **kwargs):\n        r\"\"\" Instantiates one of the configuration classes of the library\n        from a pre-trained model configuration.\n\n        The configuration class to instantiate is selected\n        based on the `model_type` property of the config object, or when it's missing,\n        falling back to using pattern matching on the `pretrained_model_name_or_path` string:\n            - `t5`: :class:`~transformers1.T5Config` (T5 model)\n            - `distilbert`: :class:`~transformers1.DistilBertConfig` (DistilBERT model)\n            - `albert`: :class:`~transformers1.AlbertConfig` (ALBERT model)\n            - `camembert`: :class:`~transformers1.CamembertConfig` (CamemBERT model)\n            - `xlm-roberta`: :class:`~transformers1.XLMRobertaConfig` (XLM-RoBERTa model)\n            - `longformer`: :class:`~transformers1.LongformerConfig` (Longformer model)\n            - `roberta`: :class:`~transformers1.RobertaConfig` (RoBERTa model)\n            - `reformer`: :class:`~transformers1.ReformerConfig` (Reformer model)\n            - `bert`: :class:`~transformers1.BertConfig` (Bert model)\n            - `openai-gpt`: :class:`~transformers1.OpenAIGPTConfig` (OpenAI GPT model)\n            - `gpt2`: :class:`~transformers1.GPT2Config` (OpenAI GPT-2 model)\n            - `transfo-xl`: :class:`~transformers1.TransfoXLConfig` (Transformer-XL model)\n            - `xlnet`: :class:`~transformers1.XLNetConfig` (XLNet model)\n            - `xlm`: :class:`~transformers1.XLMConfig` (XLM model)\n            - `ctrl` : :class:`~transformers1.CTRLConfig` (CTRL model)\n            - `flaubert` : :class:`~transformers1.FlaubertConfig` (Flaubert model)\n            - `electra` : :class:`~transformers1.ElectraConfig` (ELECTRA model)\n\n        Args:\n            pretrained_model_name_or_path (:obj:`string`):\n                Is either: \\\n                    - a string with the `shortcut name` of a pre-trained model configuration to load from cache or download, e.g.: ``bert-base-uncased``.\n                    - a string with the `identifier name` of a pre-trained model configuration that was user-uploaded to our S3, e.g.: ``dbmdz/bert-base-german-cased``.\n                    - a path to a `directory` containing a configuration file saved using the :func:`~transformers1.PretrainedConfig.save_pretrained` method, e.g.: ``./my_model_directory/``.\n                    - a path or url to a saved configuration JSON `file`, e.g.: ``./my_model_directory/configuration.json``.\n\n            cache_dir (:obj:`string`, optional, defaults to `None`):\n                Path to a directory in which a downloaded pre-trained model\n                configuration should be cached if the standard cache should not be used.\n\n            force_download (:obj:`boolean`, optional, defaults to `False`):\n                Force to (re-)download the model weights and configuration files and override the cached versions if they exist.\n\n            resume_download (:obj:`boolean`, optional, defaults to `False`):\n                Do not delete incompletely received file. Attempt to resume the download if such a file exists.\n\n            proxies (:obj:`Dict[str, str]`, optional, defaults to `None`):\n                A dictionary of proxy servers to use by protocol or endpoint, e.g.: :obj:`{'http': 'foo.bar:3128', 'http://hostname': 'foo.bar:4012'}`.\n                The proxies are used on each request. See `the requests documentation <https://requests.readthedocs.io/en/master/user/advanced/#proxies>`__ for usage.\n\n            return_unused_kwargs (:obj:`boolean`, optional, defaults to `False`):\n                - If False, then this function returns just the final configuration object.\n                - If True, then this functions returns a tuple `(config, unused_kwargs)` where `unused_kwargs` is a dictionary consisting of the key/value pairs whose keys are not configuration attributes: ie the part of kwargs which has not been used to update `config` and is otherwise ignored.\n\n            kwargs (:obj:`Dict[str, any]`, optional, defaults to `{}`): key/value pairs with which to update the configuration object after loading.\n                - The values in kwargs of any keys which are configuration attributes will be used to override the loaded values.\n                - Behavior concerning key/value pairs whose keys are *not* configuration attributes is controlled by the `return_unused_kwargs` keyword parameter.\n\n\n        Examples::\n\n            config = AutoConfig.from_pretrained('bert-base-uncased')  # Download configuration from S3 and cache.\n            config = AutoConfig.from_pretrained('./test/bert_saved_model/')  # E.g. config (or model) was saved using `save_pretrained('./test/saved_model/')`\n            config = AutoConfig.from_pretrained('./test/bert_saved_model/my_configuration.json')\n            config = AutoConfig.from_pretrained('bert-base-uncased', output_attention=True, foo=False)\n            assert config.output_attention == True\n            config, unused_kwargs = AutoConfig.from_pretrained('bert-base-uncased', output_attention=True,\n                                                               foo=False, return_unused_kwargs=True)\n            assert config.output_attention == True\n            assert unused_kwargs == {'foo': False}\n\n        \"\"\"\n        config_dict, _ = PretrainedConfig.get_config_dict(pretrained_model_name_or_path, **kwargs)\n\n        if \"model_type\" in config_dict:\n            config_class = CONFIG_MAPPING[config_dict[\"model_type\"]]\n            return config_class.from_dict(config_dict, **kwargs)\n        else:\n            # Fallback: use pattern matching on the string.\n            for pattern, config_class in CONFIG_MAPPING.items():\n                if pattern in pretrained_model_name_or_path:\n                    return config_class.from_dict(config_dict, **kwargs)\n\n        raise ValueError(\n            \"Unrecognized model in {}. \"\n            \"Should have a `model_type` key in its config.json, or contain one of the following strings \"\n            \"in its name: {}\".format(pretrained_model_name_or_path, \", \".join(CONFIG_MAPPING.keys()))\n        )\n"
  },
  {
    "path": "code/bert-base-count5/pretrain/transformers1/configuration_bart.py",
    "content": "# coding=utf-8\n# Copyright 2020 The Fairseq Authors and The HuggingFace Inc. team.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n\"\"\" BART configuration \"\"\"\n\n\nimport logging\n\nfrom .configuration_utils import PretrainedConfig\n\n\nlogger = logging.getLogger(__name__)\n\nBART_PRETRAINED_CONFIG_ARCHIVE_MAP = {\n    \"facebook/bart-large\": \"https://s3.amazonaws.com/models.huggingface.co/bert/facebook/bart-large/config.json\",\n    \"facebook/bart-large-mnli\": \"https://s3.amazonaws.com/models.huggingface.co/bert/facebook/bart-large-mnli/config.json\",\n    \"facebook/bart-large-cnn\": \"https://s3.amazonaws.com/models.huggingface.co/bert/facebook/bart-large-cnn/config.json\",\n    \"facebook/bart-large-xsum\": \"https://s3.amazonaws.com/models.huggingface.co/bert/facebook/bart-large-xsum/config.json\",\n    \"facebook/mbart-large-en-ro\": \"https://s3.amazonaws.com/models.huggingface.co/bert/facebook/mbart-large-en-ro/config.json\",\n}\n\n\nclass BartConfig(PretrainedConfig):\n    r\"\"\"\n        Configuration class for Bart. Parameters are renamed from the fairseq implementation\n    \"\"\"\n    model_type = \"bart\"\n\n    def __init__(\n        self,\n        activation_dropout=0.0,\n        activation_function=\"gelu\",\n        vocab_size=50265,\n        d_model=1024,\n        encoder_ffn_dim=4096,\n        encoder_layers=12,\n        encoder_attention_heads=16,\n        decoder_ffn_dim=4096,\n        decoder_layers=12,\n        decoder_attention_heads=16,\n        encoder_layerdrop=0.0,\n        decoder_layerdrop=0.0,\n        attention_dropout=0.0,\n        dropout=0.1,\n        max_position_embeddings=1024,\n        init_std=0.02,\n        classifier_dropout=0.0,\n        num_labels=3,\n        is_encoder_decoder=True,\n        pad_token_id=1,\n        bos_token_id=0,\n        eos_token_id=2,\n        normalize_before=False,\n        add_final_layer_norm=False,\n        scale_embedding=False,\n        normalize_embedding=True,\n        static_position_embeddings=False,\n        add_bias_logits=False,\n        **common_kwargs\n    ):\n        r\"\"\"\n            :class:`~transformers1.BartConfig` is the configuration class for `BartModel`.\n            Examples:\n                config = BartConfig.from_pretrained('bart-large')\n                model = BartModel(config)\n        \"\"\"\n        if \"hidden_size\" in common_kwargs:\n            raise ValueError(\"hidden size is called d_model\")\n        super().__init__(\n            num_labels=num_labels,\n            pad_token_id=pad_token_id,\n            bos_token_id=bos_token_id,\n            eos_token_id=eos_token_id,\n            is_encoder_decoder=is_encoder_decoder,\n            **common_kwargs,\n        )\n        self.vocab_size = vocab_size\n        self.d_model = d_model  # encoder_embed_dim and decoder_embed_dim\n        self.encoder_ffn_dim = encoder_ffn_dim\n        self.encoder_layers = self.num_hidden_layers = encoder_layers\n        self.encoder_attention_heads = encoder_attention_heads\n        self.encoder_layerdrop = encoder_layerdrop\n        self.decoder_layerdrop = decoder_layerdrop\n        self.decoder_ffn_dim = decoder_ffn_dim\n        self.decoder_layers = decoder_layers\n        self.decoder_attention_heads = decoder_attention_heads\n        self.max_position_embeddings = max_position_embeddings\n        self.init_std = init_std  # Normal(0, this parameter)\n        self.activation_function = activation_function\n\n        # Params introduced for Mbart\n        self.scale_embedding = scale_embedding  # scale factor will be sqrt(d_model) if True\n        self.normalize_embedding = normalize_embedding  # True for mbart, False otherwise\n        self.normalize_before = normalize_before  # combo of fairseq's encoder_ and decoder_normalize_before\n        self.add_final_layer_norm = add_final_layer_norm\n\n        # Params introduced for Marian\n        self.add_bias_logits = add_bias_logits\n        self.static_position_embeddings = static_position_embeddings\n\n        # 3 Types of Dropout\n        self.attention_dropout = attention_dropout\n        self.activation_dropout = activation_dropout\n        self.dropout = dropout\n\n        # Classifier stuff\n        self.classif_dropout = classifier_dropout\n\n    @property\n    def num_attention_heads(self) -> int:\n        return self.encoder_attention_heads\n\n    @property\n    def hidden_size(self) -> int:\n        return self.d_model\n\n    def is_valid_mbart(self) -> bool:\n        \"\"\"Is the configuration aligned with the MBART paper.\"\"\"\n        if self.normalize_before and self.add_final_layer_norm and self.scale_embedding:\n            return True\n        if self.normalize_before or self.add_final_layer_norm or self.scale_embedding:\n            logger.info(\"This configuration is a mixture of MBART and BART settings\")\n        return False\n"
  },
  {
    "path": "code/bert-base-count5/pretrain/transformers1/configuration_bert.py",
    "content": "# coding=utf-8\n# Copyright 2018 The Google AI Language Team Authors and The HuggingFace Inc. team.\n# Copyright (c) 2018, NVIDIA CORPORATION.  All rights reserved.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n\"\"\" BERT model configuration \"\"\"\n\n\nimport logging\n\nfrom .configuration_utils import PretrainedConfig\n\n\nlogger = logging.getLogger(__name__)\n\nBERT_PRETRAINED_CONFIG_ARCHIVE_MAP = {\n    \"bert-base-uncased\": \"https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-uncased-config.json\",\n    \"bert-large-uncased\": \"https://s3.amazonaws.com/models.huggingface.co/bert/bert-large-uncased-config.json\",\n    \"bert-base-cased\": \"https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-cased-config.json\",\n    \"bert-large-cased\": \"https://s3.amazonaws.com/models.huggingface.co/bert/bert-large-cased-config.json\",\n    \"bert-base-multilingual-uncased\": \"https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-multilingual-uncased-config.json\",\n    \"bert-base-multilingual-cased\": \"https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-multilingual-cased-config.json\",\n    \"bert-base-chinese\": \"https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-chinese-config.json\",\n    \"bert-base-german-cased\": \"https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-german-cased-config.json\",\n    \"bert-large-uncased-whole-word-masking\": \"https://s3.amazonaws.com/models.huggingface.co/bert/bert-large-uncased-whole-word-masking-config.json\",\n    \"bert-large-cased-whole-word-masking\": \"https://s3.amazonaws.com/models.huggingface.co/bert/bert-large-cased-whole-word-masking-config.json\",\n    \"bert-large-uncased-whole-word-masking-finetuned-squad\": \"https://s3.amazonaws.com/models.huggingface.co/bert/bert-large-uncased-whole-word-masking-finetuned-squad-config.json\",\n    \"bert-large-cased-whole-word-masking-finetuned-squad\": \"https://s3.amazonaws.com/models.huggingface.co/bert/bert-large-cased-whole-word-masking-finetuned-squad-config.json\",\n    \"bert-base-cased-finetuned-mrpc\": \"https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-cased-finetuned-mrpc-config.json\",\n    \"bert-base-german-dbmdz-cased\": \"https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-german-dbmdz-cased-config.json\",\n    \"bert-base-german-dbmdz-uncased\": \"https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-german-dbmdz-uncased-config.json\",\n    \"cl-tohoku/bert-base-japanese\": \"https://s3.amazonaws.com/models.huggingface.co/bert/cl-tohoku/bert-base-japanese/config.json\",\n    \"cl-tohoku/bert-base-japanese-whole-word-masking\": \"https://s3.amazonaws.com/models.huggingface.co/bert/cl-tohoku/bert-base-japanese-whole-word-masking/config.json\",\n    \"cl-tohoku/bert-base-japanese-char\": \"https://s3.amazonaws.com/models.huggingface.co/bert/cl-tohoku/bert-base-japanese-char/config.json\",\n    \"cl-tohoku/bert-base-japanese-char-whole-word-masking\": \"https://s3.amazonaws.com/models.huggingface.co/bert/cl-tohoku/bert-base-japanese-char-whole-word-masking/config.json\",\n    \"TurkuNLP/bert-base-finnish-cased-v1\": \"https://s3.amazonaws.com/models.huggingface.co/bert/TurkuNLP/bert-base-finnish-cased-v1/config.json\",\n    \"TurkuNLP/bert-base-finnish-uncased-v1\": \"https://s3.amazonaws.com/models.huggingface.co/bert/TurkuNLP/bert-base-finnish-uncased-v1/config.json\",\n    \"wietsedv/bert-base-dutch-cased\": \"https://s3.amazonaws.com/models.huggingface.co/bert/wietsedv/bert-base-dutch-cased/config.json\",\n    # See all BERT models at https://huggingface.co/models?filter=bert\n}\n\n\nclass BertConfig(PretrainedConfig):\n    r\"\"\"\n        This is the configuration class to store the configuration of a :class:`~transformers1.BertModel`.\n        It is used to instantiate an BERT model according to the specified arguments, defining the model\n        architecture. Instantiating a configuration with the defaults will yield a similar configuration to that of\n        the BERT `bert-base-uncased <https://huggingface.co/bert-base-uncased>`__ architecture.\n\n        Configuration objects inherit from  :class:`~transformers1.PretrainedConfig` and can be used\n        to control the model outputs. Read the documentation from  :class:`~transformers1.PretrainedConfig`\n        for more information.\n\n\n        Args:\n            vocab_size (:obj:`int`, optional, defaults to 30522):\n                Vocabulary size of the BERT model. Defines the different tokens that\n                can be represented by the `inputs_ids` passed to the forward method of :class:`~transformers1.BertModel`.\n            hidden_size (:obj:`int`, optional, defaults to 768):\n                Dimensionality of the encoder layers and the pooler layer.\n            num_hidden_layers (:obj:`int`, optional, defaults to 12):\n                Number of hidden layers in the Transformer encoder.\n            num_attention_heads (:obj:`int`, optional, defaults to 12):\n                Number of attention heads for each attention layer in the Transformer encoder.\n            intermediate_size (:obj:`int`, optional, defaults to 3072):\n                Dimensionality of the \"intermediate\" (i.e., feed-forward) layer in the Transformer encoder.\n            hidden_act (:obj:`str` or :obj:`function`, optional, defaults to \"gelu\"):\n                The non-linear activation function (function or string) in the encoder and pooler.\n                If string, \"gelu\", \"relu\", \"swish\" and \"gelu_new\" are supported.\n            hidden_dropout_prob (:obj:`float`, optional, defaults to 0.1):\n                The dropout probabilitiy for all fully connected layers in the embeddings, encoder, and pooler.\n            attention_probs_dropout_prob (:obj:`float`, optional, defaults to 0.1):\n                The dropout ratio for the attention probabilities.\n            max_position_embeddings (:obj:`int`, optional, defaults to 512):\n                The maximum sequence length that this model might ever be used with.\n                Typically set this to something large just in case (e.g., 512 or 1024 or 2048).\n            type_vocab_size (:obj:`int`, optional, defaults to 2):\n                The vocabulary size of the `token_type_ids` passed into :class:`~transformers1.BertModel`.\n            initializer_range (:obj:`float`, optional, defaults to 0.02):\n                The standard deviation of the truncated_normal_initializer for initializing all weight matrices.\n            layer_norm_eps (:obj:`float`, optional, defaults to 1e-12):\n                The epsilon used by the layer normalization layers.\n\n        Example::\n\n            from transformers1 import BertModel, BertConfig\n\n            # Initializing a BERT bert-base-uncased style configuration\n            configuration = BertConfig()\n\n            # Initializing a model from the bert-base-uncased style configuration\n            model = BertModel(configuration)\n\n            # Accessing the model configuration\n            configuration = model.config\n    \"\"\"\n    model_type = \"bert\"\n\n    def __init__(\n        self,\n        vocab_size=30522,\n        hidden_size=768,\n        num_hidden_layers=12,\n        num_attention_heads=12,\n        intermediate_size=3072,\n        hidden_act=\"gelu\",\n        hidden_dropout_prob=0.1,\n        attention_probs_dropout_prob=0.1,\n        max_position_embeddings=512,\n        type_vocab_size=2,\n        initializer_range=0.02,\n        layer_norm_eps=1e-12,\n        pad_token_id=0,\n        **kwargs\n    ):\n        super().__init__(pad_token_id=pad_token_id, **kwargs)\n\n        self.vocab_size = vocab_size\n        self.hidden_size = hidden_size\n        self.num_hidden_layers = num_hidden_layers\n        self.num_attention_heads = num_attention_heads\n        self.hidden_act = hidden_act\n        self.intermediate_size = intermediate_size\n        self.hidden_dropout_prob = hidden_dropout_prob\n        self.attention_probs_dropout_prob = attention_probs_dropout_prob\n        self.max_position_embeddings = max_position_embeddings\n        self.type_vocab_size = type_vocab_size\n        self.initializer_range = initializer_range\n        self.layer_norm_eps = layer_norm_eps\n"
  },
  {
    "path": "code/bert-base-count5/pretrain/transformers1/configuration_camembert.py",
    "content": "# coding=utf-8\n# Copyright 2018 The Google AI Language Team Authors and The HuggingFace Inc. team.\n# Copyright (c) 2018, NVIDIA CORPORATION.  All rights reserved.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n\"\"\" CamemBERT configuration \"\"\"\n\n\nimport logging\n\nfrom .configuration_roberta import RobertaConfig\n\n\nlogger = logging.getLogger(__name__)\n\nCAMEMBERT_PRETRAINED_CONFIG_ARCHIVE_MAP = {\n    \"camembert-base\": \"https://s3.amazonaws.com/models.huggingface.co/bert/camembert-base-config.json\",\n    \"umberto-commoncrawl-cased-v1\": \"https://s3.amazonaws.com/models.huggingface.co/bert/Musixmatch/umberto-commoncrawl-cased-v1/config.json\",\n    \"umberto-wikipedia-uncased-v1\": \"https://s3.amazonaws.com/models.huggingface.co/bert/Musixmatch/umberto-wikipedia-uncased-v1/config.json\",\n}\n\n\nclass CamembertConfig(RobertaConfig):\n    \"\"\"\n    This class overrides :class:`~transformers1.RobertaConfig`. Please check the\n    superclass for the appropriate documentation alongside usage examples.\n    \"\"\"\n\n    model_type = \"camembert\"\n"
  },
  {
    "path": "code/bert-base-count5/pretrain/transformers1/configuration_ctrl.py",
    "content": "# coding=utf-8\n# Copyright 2018 Salesforce and HuggingFace Inc. team.\n# Copyright (c) 2018, NVIDIA CORPORATION.  All rights reserved.\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n\"\"\" Salesforce CTRL configuration \"\"\"\n\n\nimport logging\n\nfrom .configuration_utils import PretrainedConfig\n\n\nlogger = logging.getLogger(__name__)\n\nCTRL_PRETRAINED_CONFIG_ARCHIVE_MAP = {\"ctrl\": \"https://s3.amazonaws.com/models.huggingface.co/bert/ctrl-config.json\"}\n\n\nclass CTRLConfig(PretrainedConfig):\n    \"\"\"\n        This is the configuration class to store the configuration of a :class:`~transformers1.CTRLModel`.\n        It is used to instantiate an CTRL model according to the specified arguments, defining the model\n        architecture. Instantiating a configuration with the defaults will yield a similar configuration to that of\n        the `ctrl <https://huggingface.co/ctrl>`__ architecture from SalesForce.\n\n        Configuration objects inherit from  :class:`~transformers1.PretrainedConfig` and can be used\n        to control the model outputs. Read the documentation from  :class:`~transformers1.PretrainedConfig`\n        for more information.\n\n        Args:\n            vocab_size (:obj:`int`, optional, defaults to 246534):\n                Vocabulary size of the CTRL model. Defines the different tokens that\n                can be represented by the `inputs_ids` passed to the forward method of :class:`~transformers1.CTRLModel`.\n            n_positions (:obj:`int`, optional, defaults to 256):\n                The maximum sequence length that this model might ever be used with.\n                Typically set this to something large just in case (e.g., 512 or 1024 or 2048).\n            n_ctx (:obj:`int`, optional, defaults to 256):\n                Dimensionality of the causal mask (usually same as n_positions).\n            n_embd (:obj:`int`, optional, defaults to 1280):\n                Dimensionality of the embeddings and hidden states.\n            dff (:obj:`int`, optional, defaults to 8192):\n                Dimensionality of the inner dimension of the FFN.\n            n_layer (:obj:`int`, optional, defaults to 48):\n                Number of hidden layers in the Transformer encoder.\n            n_head (:obj:`int`, optional, defaults to 16):\n                Number of attention heads for each attention layer in the Transformer encoder.\n            resid_pdrop (:obj:`float`, optional, defaults to 0.1):\n                The dropout probability for all fully connected layers in the embeddings, encoder, and pooler.\n            embd_pdrop (:obj:`int`, optional, defaults to 0.1):\n                The dropout ratio for the embeddings.\n            attn_pdrop (:obj:`float`, optional, defaults to 0.1):\n                The dropout ratio for the attention.\n            layer_norm_epsilon (:obj:`float`, optional, defaults to 1e-6):\n                The epsilon to use in the layer normalization layers\n            initializer_range (:obj:`float`, optional, defaults to 0.02):\n                The standard deviation of the truncated_normal_initializer for initializing all weight matrices.\n\n        Example::\n\n            from transformers1 import CTRLModel, CTRLConfig\n\n            # Initializing a CTRL configuration\n            configuration = CTRLConfig()\n\n            # Initializing a model from the configuration\n            model = CTRLModel(configuration)\n\n            # Accessing the model configuration\n            configuration = model.config\n    \"\"\"\n\n    model_type = \"ctrl\"\n\n    def __init__(\n        self,\n        vocab_size=246534,\n        n_positions=256,\n        n_ctx=256,\n        n_embd=1280,\n        dff=8192,\n        n_layer=48,\n        n_head=16,\n        resid_pdrop=0.1,\n        embd_pdrop=0.1,\n        attn_pdrop=0.1,\n        layer_norm_epsilon=1e-6,\n        initializer_range=0.02,\n        summary_type=\"cls_index\",\n        summary_use_proj=True,\n        summary_activation=None,\n        summary_proj_to_labels=True,\n        summary_first_dropout=0.1,\n        **kwargs\n    ):\n        super().__init__(**kwargs)\n        self.vocab_size = vocab_size\n        self.n_ctx = n_ctx\n        self.n_positions = n_positions\n        self.n_embd = n_embd\n        self.n_layer = n_layer\n        self.n_head = n_head\n        self.dff = dff\n        self.resid_pdrop = resid_pdrop\n        self.embd_pdrop = embd_pdrop\n        self.attn_pdrop = attn_pdrop\n        self.layer_norm_epsilon = layer_norm_epsilon\n        self.initializer_range = initializer_range\n\n        self.summary_type = summary_type\n        self.summary_use_proj = summary_use_proj\n        self.summary_activation = summary_activation\n        self.summary_first_dropout = summary_first_dropout\n        self.summary_proj_to_labels = summary_proj_to_labels\n\n    @property\n    def max_position_embeddings(self):\n        return self.n_positions\n\n    @property\n    def hidden_size(self):\n        return self.n_embd\n\n    @property\n    def num_attention_heads(self):\n        return self.n_head\n\n    @property\n    def num_hidden_layers(self):\n        return self.n_layer\n"
  },
  {
    "path": "code/bert-base-count5/pretrain/transformers1/configuration_distilbert.py",
    "content": "# coding=utf-8\n# Copyright 2019-present, the HuggingFace Inc. team, The Google AI Language Team and Facebook, Inc.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n\"\"\" DistilBERT model configuration \"\"\"\n\n\nimport logging\n\nfrom .configuration_utils import PretrainedConfig\n\n\nlogger = logging.getLogger(__name__)\n\nDISTILBERT_PRETRAINED_CONFIG_ARCHIVE_MAP = {\n    \"distilbert-base-uncased\": \"https://s3.amazonaws.com/models.huggingface.co/bert/distilbert-base-uncased-config.json\",\n    \"distilbert-base-uncased-distilled-squad\": \"https://s3.amazonaws.com/models.huggingface.co/bert/distilbert-base-uncased-distilled-squad-config.json\",\n    \"distilbert-base-cased\": \"https://s3.amazonaws.com/models.huggingface.co/bert/distilbert-base-cased-config.json\",\n    \"distilbert-base-cased-distilled-squad\": \"https://s3.amazonaws.com/models.huggingface.co/bert/distilbert-base-cased-distilled-squad-config.json\",\n    \"distilbert-base-german-cased\": \"https://s3.amazonaws.com/models.huggingface.co/bert/distilbert-base-german-cased-config.json\",\n    \"distilbert-base-multilingual-cased\": \"https://s3.amazonaws.com/models.huggingface.co/bert/distilbert-base-multilingual-cased-config.json\",\n    \"distilbert-base-uncased-finetuned-sst-2-english\": \"https://s3.amazonaws.com/models.huggingface.co/bert/distilbert-base-uncased-finetuned-sst-2-english-config.json\",\n}\n\n\nclass DistilBertConfig(PretrainedConfig):\n    r\"\"\"\n        This is the configuration class to store the configuration of a :class:`~transformers1.DistilBertModel`.\n        It is used to instantiate a DistilBERT model according to the specified arguments, defining the model\n        architecture. Instantiating a configuration with the defaults will yield a similar configuration to that of\n        the DistilBERT `distilbert-base-uncased <https://huggingface.co/distilbert-base-uncased>`__ architecture.\n\n        Configuration objects inherit from  :class:`~transformers1.PretrainedConfig` and can be used\n        to control the model outputs. Read the documentation from  :class:`~transformers1.PretrainedConfig`\n        for more information.\n\n\n        Args:\n            vocab_size (:obj:`int`, optional, defaults to 30522):\n                Vocabulary size of the DistilBERT model. Defines the different tokens that\n                can be represented by the `inputs_ids` passed to the forward method of :class:`~transformers1.BertModel`.\n            max_position_embeddings (:obj:`int`, optional, defaults to 512):\n                The maximum sequence length that this model might ever be used with.\n                Typically set this to something large just in case (e.g., 512 or 1024 or 2048).\n            sinusoidal_pos_embds (:obj:`boolean`, optional, defaults to :obj:`False`):\n                Whether to use sinusoidal positional embeddings.\n            n_layers (:obj:`int`, optional, defaults to 6):\n                Number of hidden layers in the Transformer encoder.\n            n_heads (:obj:`int`, optional, defaults to 12):\n                Number of attention heads for each attention layer in the Transformer encoder.\n            dim (:obj:`int`, optional, defaults to 768):\n                Dimensionality of the encoder layers and the pooler layer.\n            hidden_dim (:obj:`int`, optional, defaults to 3072):\n                The size of the \"intermediate\" (i.e., feed-forward) layer in the Transformer encoder.\n            dropout (:obj:`float`, optional, defaults to 0.1):\n                The dropout probabilitiy for all fully connected layers in the embeddings, encoder, and pooler.\n            attention_dropout (:obj:`float`, optional, defaults to 0.1):\n                The dropout ratio for the attention probabilities.\n            activation (:obj:`str` or :obj:`function`, optional, defaults to \"gelu\"):\n                The non-linear activation function (function or string) in the encoder and pooler.\n                If string, \"gelu\", \"relu\", \"swish\" and \"gelu_new\" are supported.\n            initializer_range (:obj:`float`, optional, defaults to 0.02):\n                The standard deviation of the truncated_normal_initializer for initializing all weight matrices.\n            qa_dropout (:obj:`float`, optional, defaults to 0.1):\n                The dropout probabilities used in the question answering model\n                :class:`~transformers1.DistilBertForQuestionAnswering`.\n            seq_classif_dropout (:obj:`float`, optional, defaults to 0.2):\n                The dropout probabilities used in the sequence classification model\n                :class:`~transformers1.DistilBertForSequenceClassification`.\n\n        Example::\n\n            from transformers1 import DistilBertModel, DistilBertConfig\n\n            # Initializing a DistilBERT configuration\n            configuration = DistilBertConfig()\n\n            # Initializing a model from the configuration\n            model = DistilBertModel(configuration)\n\n            # Accessing the model configuration\n            configuration = model.config\n    \"\"\"\n    model_type = \"distilbert\"\n\n    def __init__(\n        self,\n        vocab_size=30522,\n        max_position_embeddings=512,\n        sinusoidal_pos_embds=False,\n        n_layers=6,\n        n_heads=12,\n        dim=768,\n        hidden_dim=4 * 768,\n        dropout=0.1,\n        attention_dropout=0.1,\n        activation=\"gelu\",\n        initializer_range=0.02,\n        qa_dropout=0.1,\n        seq_classif_dropout=0.2,\n        pad_token_id=0,\n        **kwargs\n    ):\n        super().__init__(**kwargs, pad_token_id=pad_token_id)\n        self.vocab_size = vocab_size\n        self.max_position_embeddings = max_position_embeddings\n        self.sinusoidal_pos_embds = sinusoidal_pos_embds\n        self.n_layers = n_layers\n        self.n_heads = n_heads\n        self.dim = dim\n        self.hidden_dim = hidden_dim\n        self.dropout = dropout\n        self.attention_dropout = attention_dropout\n        self.activation = activation\n        self.initializer_range = initializer_range\n        self.qa_dropout = qa_dropout\n        self.seq_classif_dropout = seq_classif_dropout\n\n    @property\n    def hidden_size(self):\n        return self.dim\n\n    @property\n    def num_attention_heads(self):\n        return self.n_heads\n\n    @property\n    def num_hidden_layers(self):\n        return self.n_layers\n"
  },
  {
    "path": "code/bert-base-count5/pretrain/transformers1/configuration_electra.py",
    "content": "# coding=utf-8\n# Copyright 2018 The Google AI Language Team Authors and The HuggingFace Inc. team.\n# Copyright (c) 2018, NVIDIA CORPORATION.  All rights reserved.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n\"\"\" ELECTRA model configuration \"\"\"\n\n\nimport logging\n\nfrom .configuration_utils import PretrainedConfig\n\n\nlogger = logging.getLogger(__name__)\n\nELECTRA_PRETRAINED_CONFIG_ARCHIVE_MAP = {\n    \"google/electra-small-generator\": \"https://s3.amazonaws.com/models.huggingface.co/bert/google/electra-small-generator/config.json\",\n    \"google/electra-base-generator\": \"https://s3.amazonaws.com/models.huggingface.co/bert/google/electra-base-generator/config.json\",\n    \"google/electra-large-generator\": \"https://s3.amazonaws.com/models.huggingface.co/bert/google/electra-large-generator/config.json\",\n    \"google/electra-small-discriminator\": \"https://s3.amazonaws.com/models.huggingface.co/bert/google/electra-small-discriminator/config.json\",\n    \"google/electra-base-discriminator\": \"https://s3.amazonaws.com/models.huggingface.co/bert/google/electra-base-discriminator/config.json\",\n    \"google/electra-large-discriminator\": \"https://s3.amazonaws.com/models.huggingface.co/bert/google/electra-large-discriminator/config.json\",\n}\n\n\nclass ElectraConfig(PretrainedConfig):\n    r\"\"\"\n        This is the configuration class to store the configuration of a :class:`~transformers1.ElectraModel`.\n        It is used to instantiate an ELECTRA model according to the specified arguments, defining the model\n        architecture. Instantiating a configuration with the defaults will yield a similar configuration to that of\n        the ELECTRA `google/electra-small-discriminator <https://huggingface.co/google/electra-small-discriminator>`__\n        architecture.\n\n        Configuration objects inherit from  :class:`~transformers1.PretrainedConfig` and can be used\n        to control the model outputs. Read the documentation from  :class:`~transformers1.PretrainedConfig`\n        for more information.\n\n\n        Args:\n            vocab_size (:obj:`int`, optional, defaults to 30522):\n                Vocabulary size of the ELECTRA model. Defines the different tokens that\n                can be represented by the `inputs_ids` passed to the forward method of :class:`~transformers1.ElectraModel`.\n            embedding_size (:obj:`int`, optional, defaults to 128):\n                Dimensionality of the encoder layers and the pooler layer.\n            hidden_size (:obj:`int`, optional, defaults to 256):\n                Dimensionality of the encoder layers and the pooler layer.\n            num_hidden_layers (:obj:`int`, optional, defaults to 12):\n                Number of hidden layers in the Transformer encoder.\n            num_attention_heads (:obj:`int`, optional, defaults to 4):\n                Number of attention heads for each attention layer in the Transformer encoder.\n            intermediate_size (:obj:`int`, optional, defaults to 1024):\n                Dimensionality of the \"intermediate\" (i.e., feed-forward) layer in the Transformer encoder.\n            hidden_act (:obj:`str` or :obj:`function`, optional, defaults to \"gelu\"):\n                The non-linear activation function (function or string) in the encoder and pooler.\n                If string, \"gelu\", \"relu\", \"swish\" and \"gelu_new\" are supported.\n            hidden_dropout_prob (:obj:`float`, optional, defaults to 0.1):\n                The dropout probabilitiy for all fully connected layers in the embeddings, encoder, and pooler.\n            attention_probs_dropout_prob (:obj:`float`, optional, defaults to 0.1):\n                The dropout ratio for the attention probabilities.\n            max_position_embeddings (:obj:`int`, optional, defaults to 512):\n                The maximum sequence length that this model might ever be used with.\n                Typically set this to something large just in case (e.g., 512 or 1024 or 2048).\n            type_vocab_size (:obj:`int`, optional, defaults to 2):\n                The vocabulary size of the `token_type_ids` passed into :class:`~transformers1.ElectraModel`.\n            initializer_range (:obj:`float`, optional, defaults to 0.02):\n                The standard deviation of the truncated_normal_initializer for initializing all weight matrices.\n            layer_norm_eps (:obj:`float`, optional, defaults to 1e-12):\n                The epsilon used by the layer normalization layers.\n\n        Example::\n\n            from transformers1 import ElectraModel, ElectraConfig\n\n            # Initializing a ELECTRA electra-base-uncased style configuration\n            configuration = ElectraConfig()\n\n            # Initializing a model from the electra-base-uncased style configuration\n            model = ElectraModel(configuration)\n\n            # Accessing the model configuration\n            configuration = model.config\n    \"\"\"\n    model_type = \"electra\"\n\n    def __init__(\n        self,\n        vocab_size=30522,\n        embedding_size=128,\n        hidden_size=256,\n        num_hidden_layers=12,\n        num_attention_heads=4,\n        intermediate_size=1024,\n        hidden_act=\"gelu\",\n        hidden_dropout_prob=0.1,\n        attention_probs_dropout_prob=0.1,\n        max_position_embeddings=512,\n        type_vocab_size=2,\n        initializer_range=0.02,\n        layer_norm_eps=1e-12,\n        pad_token_id=0,\n        **kwargs\n    ):\n        super().__init__(pad_token_id=pad_token_id, **kwargs)\n\n        self.vocab_size = vocab_size\n        self.embedding_size = embedding_size\n        self.hidden_size = hidden_size\n        self.num_hidden_layers = num_hidden_layers\n        self.num_attention_heads = num_attention_heads\n        self.intermediate_size = intermediate_size\n        self.hidden_act = hidden_act\n        self.hidden_dropout_prob = hidden_dropout_prob\n        self.attention_probs_dropout_prob = attention_probs_dropout_prob\n        self.max_position_embeddings = max_position_embeddings\n        self.type_vocab_size = type_vocab_size\n        self.initializer_range = initializer_range\n        self.layer_norm_eps = layer_norm_eps\n"
  },
  {
    "path": "code/bert-base-count5/pretrain/transformers1/configuration_encoder_decoder.py",
    "content": "# coding=utf-8\n# Copyright 2020 The HuggingFace Inc. team.\n# Copyright (c) 2018, NVIDIA CORPORATION.  All rights reserved.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n\nimport copy\nimport logging\n\nfrom .configuration_utils import PretrainedConfig\n\n\nlogger = logging.getLogger(__name__)\n\n\nclass EncoderDecoderConfig(PretrainedConfig):\n    r\"\"\"\n        :class:`~transformers1.EncoderDecoderConfig` is the configuration class to store the configuration of a `EncoderDecoderModel`.\n\n        It is used to instantiate an Encoder Decoder model according to the specified arguments, defining the encoder and decoder configs.\n        Configuration objects inherit from  :class:`~transformers1.PretrainedConfig`\n        and can be used to control the model outputs.\n        See the documentation for :class:`~transformers1.PretrainedConfig` for more information.\n\n        Args:\n            kwargs (`optional`):\n                Remaining dictionary of keyword arguments. Notably:\n                    encoder (:class:`PretrainedConfig`, optional, defaults to `None`):\n                        An instance of a configuration object that defines the encoder config.\n                    encoder (:class:`PretrainedConfig`, optional, defaults to `None`):\n                        An instance of a configuration object that defines the decoder config.\n\n        Example::\n\n            from transformers1 import BertConfig, EncoderDecoderConfig, EncoderDecoderModel\n\n            # Initializing a BERT bert-base-uncased style configuration\n            config_encoder = BertConfig()\n            config_decoder = BertConfig()\n\n            config = EncoderDecoderConfig.from_encoder_decoder_configs(config_encoder, config_decoder)\n\n            # Initializing a Bert2Bert model from the bert-base-uncased style configurations\n            model = EncoderDecoderModel(config=config)\n\n            # Accessing the model configuration\n            config_encoder = model.config.encoder\n            config_decoder  = model.config.decoder\n    \"\"\"\n    model_type = \"encoder_decoder\"\n\n    def __init__(self, **kwargs):\n        super().__init__(**kwargs)\n        assert (\n            \"encoder\" in kwargs and \"decoder\" in kwargs\n        ), \"Config has to be initialized with encoder and decoder config\"\n        encoder_config = kwargs.pop(\"encoder\")\n        encoder_model_type = encoder_config.pop(\"model_type\")\n        decoder_config = kwargs.pop(\"decoder\")\n        decoder_model_type = decoder_config.pop(\"model_type\")\n\n        from transformers import AutoConfig\n\n        self.encoder = AutoConfig.for_model(encoder_model_type, **encoder_config)\n        self.decoder = AutoConfig.for_model(decoder_model_type, **decoder_config)\n        self.is_encoder_decoder = True\n\n    @classmethod\n    def from_encoder_decoder_configs(\n        cls, encoder_config: PretrainedConfig, decoder_config: PretrainedConfig\n    ) -> PretrainedConfig:\n        r\"\"\"\n        Instantiate a :class:`~transformers1.EncoderDecoderConfig` (or a derived class) from a pre-trained encoder model configuration and decoder model configuration.\n\n        Returns:\n            :class:`EncoderDecoderConfig`: An instance of a configuration object\n        \"\"\"\n        return cls(encoder=encoder_config.to_dict(), decoder=decoder_config.to_dict())\n\n    def to_dict(self):\n        \"\"\"\n        Serializes this instance to a Python dictionary. Override the default `to_dict()` from `PretrainedConfig`.\n\n        Returns:\n            :obj:`Dict[str, any]`: Dictionary of all the attributes that make up this configuration instance,\n        \"\"\"\n        output = copy.deepcopy(self.__dict__)\n        output[\"encoder\"] = self.encoder.to_dict()\n        output[\"decoder\"] = self.decoder.to_dict()\n        output[\"model_type\"] = self.__class__.model_type\n        return output\n"
  },
  {
    "path": "code/bert-base-count5/pretrain/transformers1/configuration_flaubert.py",
    "content": "# coding=utf-8\n# Copyright 2019-present CNRS, Facebook Inc. and the HuggingFace Inc. team.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n\"\"\" Flaubert configuration, based on XLM. \"\"\"\n\n\nimport logging\n\nfrom .configuration_xlm import XLMConfig\n\n\nlogger = logging.getLogger(__name__)\n\nFLAUBERT_PRETRAINED_CONFIG_ARCHIVE_MAP = {\n    \"flaubert/flaubert_small_cased\": \"https://s3.amazonaws.com/models.huggingface.co/bert/flaubert/flaubert_small_cased/config.json\",\n    \"flaubert/flaubert_base_uncased\": \"https://s3.amazonaws.com/models.huggingface.co/bert/flaubert/flaubert_base_uncased/config.json\",\n    \"flaubert/flaubert_base_cased\": \"https://s3.amazonaws.com/models.huggingface.co/bert/flaubert/flaubert_base_cased/config.json\",\n    \"flaubert/flaubert_large_cased\": \"https://s3.amazonaws.com/models.huggingface.co/bert/flaubert/flaubert_large_cased/config.json\",\n}\n\n\nclass FlaubertConfig(XLMConfig):\n    \"\"\"\n        Configuration class to store the configuration of a `FlaubertModel`.\n        This is the configuration class to store the configuration of a :class:`~transformers1.XLMModel`.\n        It is used to instantiate an XLM model according to the specified arguments, defining the model\n        architecture. Instantiating a configuration with the defaults will yield a similar configuration to that of\n        the `xlm-mlm-en-2048 <https://huggingface.co/xlm-mlm-en-2048>`__ architecture.\n\n        Configuration objects inherit from  :class:`~transformers1.PretrainedConfig` and can be used\n        to control the model outputs. Read the documentation from  :class:`~transformers1.PretrainedConfig`\n        for more information.\n\n        Args:\n            pre_norm (:obj:`bool`, `optional`, defaults to :obj:`False`):\n                Whether to apply the layer normalization before or after the feed forward layer following the\n                attention in each layer (Vaswani et al., Tensor2Tensor for Neural Machine Translation. 2018)\n            layerdrop (:obj:`float`, `optional`, defaults to 0.0):\n                Probability to drop layers during training (Fan et al., Reducing Transformer Depth on Demand\n                with Structured Dropout. ICLR 2020)\n            vocab_size (:obj:`int`, optional, defaults to 30145):\n                Vocabulary size of the Flaubert model. Defines the different tokens that\n                can be represented by the `inputs_ids` passed to the forward method of :class:`~transformers1.FlaubertModel`.\n            emb_dim (:obj:`int`, optional, defaults to 2048):\n                Dimensionality of the encoder layers and the pooler layer.\n            n_layer (:obj:`int`, optional, defaults to 12):\n                Number of hidden layers in the Transformer encoder.\n            n_head (:obj:`int`, optional, defaults to 16):\n                Number of attention heads for each attention layer in the Transformer encoder.\n            dropout (:obj:`float`, optional, defaults to 0.1):\n                The dropout probability for all fully connected\n                layers in the embeddings, encoder, and pooler.\n            attention_dropout (:obj:`float`, optional, defaults to 0.1):\n                The dropout probability for the attention mechanism\n            gelu_activation (:obj:`boolean`, optional, defaults to :obj:`True`):\n                The non-linear activation function (function or string) in the\n                encoder and pooler. If set to `True`, \"gelu\" will be used instead of \"relu\".\n            sinusoidal_embeddings (:obj:`boolean`, optional, defaults to :obj:`False`):\n                Whether to use sinusoidal positional embeddings instead of absolute positional embeddings.\n            causal (:obj:`boolean`, optional, defaults to :obj:`False`):\n                Set this to `True` for the model to behave in a causal manner.\n                Causal models use a triangular attention mask in order to only attend to the left-side context instead\n                if a bidirectional context.\n            asm (:obj:`boolean`, optional, defaults to :obj:`False`):\n                Whether to use an adaptive log softmax projection layer instead of a linear layer for the prediction\n                layer.\n            n_langs (:obj:`int`, optional, defaults to 1):\n                The number of languages the model handles. Set to 1 for monolingual models.\n            use_lang_emb (:obj:`boolean`, optional, defaults to :obj:`True`)\n                Whether to use language embeddings. Some models use additional language embeddings, see\n                `the multilingual models page <http://huggingface.co/transformers/multilingual.html#xlm-language-embeddings>`__\n                for information on how to use them.\n            max_position_embeddings (:obj:`int`, optional, defaults to 512):\n                The maximum sequence length that this model might\n                ever be used with. Typically set this to something large just in case\n                (e.g., 512 or 1024 or 2048).\n            embed_init_std (:obj:`float`, optional, defaults to 2048^-0.5):\n                The standard deviation of the truncated_normal_initializer for\n                initializing the embedding matrices.\n            init_std (:obj:`int`, optional, defaults to 50257):\n                The standard deviation of the truncated_normal_initializer for\n                initializing all weight matrices except the embedding matrices.\n            layer_norm_eps (:obj:`float`, optional, defaults to 1e-12):\n                The epsilon used by the layer normalization layers.\n            bos_index (:obj:`int`, optional, defaults to 0):\n                The index of the beginning of sentence token in the vocabulary.\n            eos_index (:obj:`int`, optional, defaults to 1):\n                The index of the end of sentence token in the vocabulary.\n            pad_index (:obj:`int`, optional, defaults to 2):\n                The index of the padding token in the vocabulary.\n            unk_index (:obj:`int`, optional, defaults to 3):\n                The index of the unknown token in the vocabulary.\n            mask_index (:obj:`int`, optional, defaults to 5):\n                The index of the masking token in the vocabulary.\n            is_encoder(:obj:`boolean`, optional, defaults to :obj:`True`):\n                Whether the initialized model should be a transformer encoder or decoder as seen in Vaswani et al.\n            summary_type (:obj:`string`, optional, defaults to \"first\"):\n                Argument used when doing sequence summary. Used in for the multiple choice head in\n                :class:`~transformers1.XLMForSequenceClassification`.\n                Is one of the following options:\n\n                - 'last' => take the last token hidden state (like XLNet)\n                - 'first' => take the first token hidden state (like Bert)\n                - 'mean' => take the mean of all tokens hidden states\n                - 'cls_index' => supply a Tensor of classification token position (GPT/GPT-2)\n                - 'attn' => Not implemented now, use multi-head attention\n            summary_use_proj (:obj:`boolean`, optional, defaults to :obj:`True`):\n                Argument used when doing sequence summary. Used in for the multiple choice head in\n                :class:`~transformers1.XLMForSequenceClassification`.\n                Add a projection after the vector extraction\n            summary_activation (:obj:`string` or :obj:`None`, optional, defaults to :obj:`None`):\n                Argument used when doing sequence summary. Used in for the multiple choice head in\n                :class:`~transformers1.XLMForSequenceClassification`.\n                'tanh' => add a tanh activation to the output, Other => no activation.\n            summary_proj_to_labels (:obj:`boolean`, optional, defaults to :obj:`True`):\n                Argument used when doing sequence summary. Used in for the multiple choice head in\n                :class:`~transformers1.XLMForSequenceClassification`.\n                If True, the projection outputs to config.num_labels classes (otherwise to hidden_size). Default: False.\n            summary_first_dropout (:obj:`float`, optional, defaults to 0.1):\n                Argument used when doing sequence summary. Used in for the multiple choice head in\n                :class:`~transformers1.XLMForSequenceClassification`.\n                Add a dropout before the projection and activation\n            start_n_top (:obj:`int`, optional, defaults to 5):\n                Used in the SQuAD evaluation script for XLM and XLNet.\n            end_n_top (:obj:`int`, optional, defaults to 5):\n                Used in the SQuAD evaluation script for XLM and XLNet.\n            mask_token_id (:obj:`int`, optional, defaults to 0):\n                Model agnostic parameter to identify masked tokens when generating text in an MLM context.\n            lang_id (:obj:`int`, optional, defaults to 1):\n                The ID of the language used by the model. This parameter is used when generating\n                text in a given language.\n    \"\"\"\n\n    model_type = \"flaubert\"\n\n    def __init__(self, layerdrop=0.0, pre_norm=False, pad_token_id=2, bos_token_id=0, **kwargs):\n        \"\"\"Constructs FlaubertConfig.\n        \"\"\"\n        super().__init__(pad_token_id=pad_token_id, bos_token_id=bos_token_id, **kwargs)\n        self.layerdrop = layerdrop\n        self.pre_norm = pre_norm\n"
  },
  {
    "path": "code/bert-base-count5/pretrain/transformers1/configuration_gpt2.py",
    "content": "# coding=utf-8\n# Copyright 2018 The OpenAI Team Authors and HuggingFace Inc. team.\n# Copyright (c) 2018, NVIDIA CORPORATION.  All rights reserved.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n\"\"\" OpenAI GPT-2 configuration \"\"\"\n\n\nimport logging\n\nfrom .configuration_utils import PretrainedConfig\n\n\nlogger = logging.getLogger(__name__)\n\nGPT2_PRETRAINED_CONFIG_ARCHIVE_MAP = {\n    \"gpt2\": \"https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-config.json\",\n    \"gpt2-medium\": \"https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-medium-config.json\",\n    \"gpt2-large\": \"https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-large-config.json\",\n    \"gpt2-xl\": \"https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-xl-config.json\",\n    \"distilgpt2\": \"https://s3.amazonaws.com/models.huggingface.co/bert/distilgpt2-config.json\",\n}\n\n\nclass GPT2Config(PretrainedConfig):\n    \"\"\"\n        This is the configuration class to store the configuration of a :class:`~transformers1.GPT2Model`.\n        It is used to instantiate an GPT-2 model according to the specified arguments, defining the model\n        architecture. Instantiating a configuration with the defaults will yield a similar configuration to that of\n        the GPT-2 `small <https://huggingface.co/gpt2>`__ architecture.\n\n        Configuration objects inherit from  :class:`~transformers1.PretrainedConfig` and can be used\n        to control the model outputs. Read the documentation from  :class:`~transformers1.PretrainedConfig`\n        for more information.\n\n\n        Args:\n            vocab_size (:obj:`int`, optional, defaults to 50257):\n                Vocabulary size of the GPT-2 model. Defines the different tokens that\n                can be represented by the `inputs_ids` passed to the forward method of :class:`~transformers1.GPT2Model`.\n            n_positions (:obj:`int`, optional, defaults to 1024):\n                The maximum sequence length that this model might ever be used with.\n                Typically set this to something large just in case (e.g., 512 or 1024 or 2048).\n            n_ctx (:obj:`int`, optional, defaults to 1024):\n                Dimensionality of the causal mask (usually same as n_positions).\n            n_embd (:obj:`int`, optional, defaults to 768):\n                Dimensionality of the embeddings and hidden states.\n            n_layer (:obj:`int`, optional, defaults to 12):\n                Number of hidden layers in the Transformer encoder.\n            n_head (:obj:`int`, optional, defaults to 12):\n                Number of attention heads for each attention layer in the Transformer encoder.\n            activation_function (:obj:`str`, optional, defaults to 'gelu'):\n                Activation function selected in the list [\"relu\", \"swish\", \"gelu\", \"tanh\", \"gelu_new\"].\n            resid_pdrop (:obj:`float`, optional, defaults to 0.1):\n                The dropout probability for all fully connected layers in the embeddings, encoder, and pooler.\n            embd_pdrop (:obj:`int`, optional, defaults to 0.1):\n                The dropout ratio for the embeddings.\n            attn_pdrop (:obj:`float`, optional, defaults to 0.1):\n                The dropout ratio for the attention.\n            layer_norm_epsilon (:obj:`float`, optional, defaults to 1e-5):\n                The epsilon to use in the layer normalization layers\n            initializer_range (:obj:`float`, optional, defaults to 16):\n                The standard deviation of the truncated_normal_initializer for initializing all weight matrices.\n            summary_type (:obj:`string`, optional, defaults to \"cls_index\"):\n                Argument used when doing sequence summary. Used in for the multiple choice head in\n                :class:`~transformers1.GPT2DoubleHeadsModel`.\n                Is one of the following options:\n\n                - 'last' => take the last token hidden state (like XLNet)\n                - 'first' => take the first token hidden state (like Bert)\n                - 'mean' => take the mean of all tokens hidden states\n                - 'cls_index' => supply a Tensor of classification token position (GPT/GPT-2)\n                - 'attn' => Not implemented now, use multi-head attention\n            summary_use_proj (:obj:`boolean`, optional, defaults to :obj:`True`):\n                Argument used when doing sequence summary. Used in for the multiple choice head in\n                :class:`~transformers1.GPT2DoubleHeadsModel`.\n                Add a projection after the vector extraction\n            summary_activation (:obj:`string` or :obj:`None`, optional, defaults to :obj:`None`):\n                Argument used when doing sequence summary. Used in for the multiple choice head in\n                :class:`~transformers1.GPT2DoubleHeadsModel`.\n                'tanh' => add a tanh activation to the output, Other => no activation.\n            summary_proj_to_labels (:obj:`boolean`, optional, defaults to :obj:`True`):\n                Argument used when doing sequence summary. Used in for the multiple choice head in\n                :class:`~transformers1.GPT2DoubleHeadsModel`.\n                If True, the projection outputs to config.num_labels classes (otherwise to hidden_size). Default: False.\n            summary_first_dropout (:obj:`float`, optional, defaults to 0.1):\n                Argument used when doing sequence summary. Used in for the multiple choice head in\n                :class:`~transformers1.GPT2DoubleHeadsModel`.\n                Add a dropout before the projection and activation\n\n        Example::\n\n            from transformers1 import GPT2Model, GPT2Config\n\n            # Initializing a GPT2 configuration\n            configuration = GPT2Config()\n\n            # Initializing a model from the configuration\n            model = GPT2Model(configuration)\n\n            # Accessing the model configuration\n            configuration = model.config\n    \"\"\"\n\n    model_type = \"gpt2\"\n\n    def __init__(\n        self,\n        vocab_size=50257,\n        n_positions=1024,\n        n_ctx=1024,\n        n_embd=768,\n        n_layer=12,\n        n_head=12,\n        activation_function=\"gelu_new\",\n        resid_pdrop=0.1,\n        embd_pdrop=0.1,\n        attn_pdrop=0.1,\n        layer_norm_epsilon=1e-5,\n        initializer_range=0.02,\n        summary_type=\"cls_index\",\n        summary_use_proj=True,\n        summary_activation=None,\n        summary_proj_to_labels=True,\n        summary_first_dropout=0.1,\n        bos_token_id=50256,\n        eos_token_id=50256,\n        **kwargs\n    ):\n        super().__init__(bos_token_id=bos_token_id, eos_token_id=eos_token_id, **kwargs)\n\n        self.vocab_size = vocab_size\n        self.n_ctx = n_ctx\n        self.n_positions = n_positions\n        self.n_embd = n_embd\n        self.n_layer = n_layer\n        self.n_head = n_head\n        self.activation_function = activation_function\n        self.resid_pdrop = resid_pdrop\n        self.embd_pdrop = embd_pdrop\n        self.attn_pdrop = attn_pdrop\n        self.layer_norm_epsilon = layer_norm_epsilon\n        self.initializer_range = initializer_range\n        self.summary_type = summary_type\n        self.summary_use_proj = summary_use_proj\n        self.summary_activation = summary_activation\n        self.summary_first_dropout = summary_first_dropout\n        self.summary_proj_to_labels = summary_proj_to_labels\n\n        self.bos_token_id = bos_token_id\n        self.eos_token_id = eos_token_id\n\n    @property\n    def max_position_embeddings(self):\n        return self.n_positions\n\n    @property\n    def hidden_size(self):\n        return self.n_embd\n\n    @property\n    def num_attention_heads(self):\n        return self.n_head\n\n    @property\n    def num_hidden_layers(self):\n        return self.n_layer\n"
  },
  {
    "path": "code/bert-base-count5/pretrain/transformers1/configuration_longformer.py",
    "content": "# coding=utf-8\n# Copyright 2020 The Allen Institute for AI team and The HuggingFace Inc. team.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n\"\"\" Longformer configuration \"\"\"\n\nimport logging\nfrom typing import List, Union\n\nfrom .configuration_roberta import RobertaConfig\n\n\nlogger = logging.getLogger(__name__)\n\nLONGFORMER_PRETRAINED_CONFIG_ARCHIVE_MAP = {\n    \"allenai/longformer-base-4096\": \"https://s3.amazonaws.com/models.huggingface.co/bert/allenai/longformer-base-4096/config.json\",\n    \"allenai/longformer-large-4096\": \"https://s3.amazonaws.com/models.huggingface.co/bert/allenai/longformer-large-4096/config.json\",\n    \"allenai/longformer-large-4096-finetuned-triviaqa\": \"https://s3.amazonaws.com/models.huggingface.co/bert/allenai/longformer-large-4096-finetuned-triviaqa/config.json\",\n    \"allenai/longformer-base-4096-extra.pos.embd.only\": \"https://s3.amazonaws.com/models.huggingface.co/bert/allenai/longformer-base-4096-extra.pos.embd.only/config.json\",\n    \"allenai/longformer-large-4096-extra.pos.embd.only\": \"https://s3.amazonaws.com/models.huggingface.co/bert/allenai/longformer-large-4096-extra.pos.embd.only/config.json\",\n}\n\n\nclass LongformerConfig(RobertaConfig):\n    r\"\"\"\n        This is the configuration class to store the configuration of a :class:`~transformers1.LongformerModel`.\n        It is used to instantiate an Longformer model according to the specified arguments, defining the model\n        architecture. Instantiating a configuration with the defaults will yield a similar configuration to that of\n        the RoBERTa `roberta-base <https://huggingface.co/roberta-base>`__ architecture with a sequence length 4,096.\n\n        The :class:`~transformers1.LongformerConfig` class directly inherits :class:`~transformers1.RobertaConfig`.\n        It reuses the same defaults. Please check the parent class for more information.\n\n        Args:\n            attention_window (:obj:`int` or :obj:`List[int]`, optional, defaults to 512):\n                Size of an attention window around each token. If :obj:`int`, use the same size for all layers.\n                To specify a different window size for each layer, use a :obj:`List[int]` where\n                ``len(attention_window) == num_hidden_layers``.\n\n        Example::\n\n            from transformers1 import LongformerConfig, LongformerModel\n\n            # Initializing a Longformer configuration\n            configuration = LongformerConfig()\n\n            # Initializing a model from the configuration\n            model = LongformerModel(configuration)\n\n            # Accessing the model configuration\n            configuration = model.config\n    \"\"\"\n    model_type = \"longformer\"\n\n    def __init__(self, attention_window: Union[List[int], int] = 512, sep_token_id: int = 2, **kwargs):\n        super().__init__(**kwargs)\n        self.attention_window = attention_window\n        self.sep_token_id = sep_token_id\n"
  },
  {
    "path": "code/bert-base-count5/pretrain/transformers1/configuration_marian.py",
    "content": "# coding=utf-8\n# Copyright 2020 The OPUS-NMT Team, Marian team, and The HuggingFace Inc. team.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n\"\"\" Marian model configuration \"\"\"\n\nfrom .configuration_bart import BartConfig\n\n\nPRETRAINED_CONFIG_ARCHIVE_MAP = {\n    \"Helsinki-NLP/opus-mt-en-de\": \"https://s3.amazonaws.com/models.huggingface.co/bert/Helsinki-NLP/opus-mt-en-de/config.json\",\n}\n\n\nclass MarianConfig(BartConfig):\n    model_type = \"marian\"\n"
  },
  {
    "path": "code/bert-base-count5/pretrain/transformers1/configuration_mmbt.py",
    "content": "# coding=utf-8\n# Copyright (c) Facebook, Inc. and its affiliates.\n# Copyright (c) HuggingFace Inc. team.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n\"\"\" MMBT configuration \"\"\"\n\n\nimport logging\n\n\nlogger = logging.getLogger(__name__)\n\n\nclass MMBTConfig(object):\n    \"\"\"Configuration class to store the configuration of a `MMBT Model`.\n\n    Args:\n        config (:obj:`~transformers1.PreTrainedConfig`):\n            Config of the underlying Transformer models. Its values are\n            copied over to use a single config.\n        num_labels (:obj:`int` or :obj:`None`, optional, defaults to `None`):\n            Size of final Linear layer for classification.\n        modal_hidden_size (:obj:`int`, optional, defautls to 2048):\n            Embedding dimension of the non-text modality encoder.\n    \"\"\"\n\n    def __init__(self, config, num_labels=None, modal_hidden_size=2048):\n        self.__dict__ = config.__dict__\n        self.modal_hidden_size = modal_hidden_size\n        if num_labels:\n            self.num_labels = num_labels\n"
  },
  {
    "path": "code/bert-base-count5/pretrain/transformers1/configuration_openai.py",
    "content": "# coding=utf-8\n# Copyright 2018 The OpenAI Team Authors and HuggingFace Inc. team.\n# Copyright (c) 2018, NVIDIA CORPORATION.  All rights reserved.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n\"\"\" OpenAI GPT configuration \"\"\"\n\n\nimport logging\n\nfrom .configuration_utils import PretrainedConfig\n\n\nlogger = logging.getLogger(__name__)\n\nOPENAI_GPT_PRETRAINED_CONFIG_ARCHIVE_MAP = {\n    \"openai-gpt\": \"https://s3.amazonaws.com/models.huggingface.co/bert/openai-gpt-config.json\"\n}\n\n\nclass OpenAIGPTConfig(PretrainedConfig):\n    \"\"\"\n        This is the configuration class to store the configuration of a :class:`~transformers1.OpenAIGPTModel`.\n        It is used to instantiate an GPT model according to the specified arguments, defining the model\n        architecture. Instantiating a configuration with the defaults will yield a similar configuration to that of\n        the `GPT <https://huggingface.co/openai-gpt>`__ architecture from OpenAI.\n\n        Configuration objects inherit from  :class:`~transformers1.PretrainedConfig` and can be used\n        to control the model outputs. Read the documentation from  :class:`~transformers1.PretrainedConfig`\n        for more information.\n\n        Args:\n            vocab_size (:obj:`int`, optional, defaults to 40478):\n                Vocabulary size of the GPT model. Defines the different tokens that\n                can be represented by the `inputs_ids` passed to the forward method of :class:`~transformers1.CTRLModel`.\n            n_positions (:obj:`int`, optional, defaults to 512):\n                The maximum sequence length that this model might ever be used with.\n                Typically set this to something large just in case (e.g., 512 or 1024 or 2048).\n            n_ctx (:obj:`int`, optional, defaults to 512):\n                Dimensionality of the causal mask (usually same as n_positions).\n            n_embd (:obj:`int`, optional, defaults to 768):\n                Dimensionality of the embeddings and hidden states.\n            n_layer (:obj:`int`, optional, defaults to 12):\n                Number of hidden layers in the Transformer encoder.\n            n_head (:obj:`int`, optional, defaults to 12):\n                Number of attention heads for each attention layer in the Transformer encoder.\n            afn (:obj:`str` or :obj:`function`, optional, defaults to \"gelu\"):\n                The non-linear activation function (function or string) in the encoder and pooler.\n                If string, \"gelu\", \"relu\", \"swish\" and \"gelu_new\" are supported.\n            resid_pdrop (:obj:`float`, optional, defaults to 0.1):\n                The dropout probability for all fully connected layers in the embeddings, encoder, and pooler.\n            embd_pdrop (:obj:`int`, optional, defaults to 0.1):\n                The dropout ratio for the embeddings.\n            attn_pdrop (:obj:`float`, optional, defaults to 0.1):\n                The dropout ratio for the attention.\n            layer_norm_epsilon (:obj:`float`, optional, defaults to 1e-5):\n                The epsilon to use in the layer normalization layers\n            initializer_range (:obj:`float`, optional, defaults to 0.02):\n                The standard deviation of the truncated_normal_initializer for initializing all weight matrices.\n            predict_special_tokens (:obj:`boolean`, optional, defaults to :obj:`True`):\n                Whether special tokens should be predicted when the model is has a language modeling head.\n            summary_type (:obj:`string`, optional, defaults to \"cls_index\"):\n                Argument used when doing sequence summary. Used in for the multiple choice head in\n                :class:`~transformers1.OpenAIGPTDoubleHeadsModel`.\n                Is one of the following options:\n\n                - 'last' => take the last token hidden state (like XLNet)\n                - 'first' => take the first token hidden state (like Bert)\n                - 'mean' => take the mean of all tokens hidden states\n                - 'cls_index' => supply a Tensor of classification token position (GPT/GPT-2)\n                - 'attn' => Not implemented now, use multi-head attention\n            summary_use_proj (:obj:`boolean`, optional, defaults to :obj:`True`):\n                Argument used when doing sequence summary. Used in for the multiple choice head in\n                :class:`~transformers1.OpenAIGPTDoubleHeadsModel`.\n                Add a projection after the vector extraction\n            summary_activation (:obj:`string` or :obj:`None`, optional, defaults to :obj:`None`):\n                Argument used when doing sequence summary. Used in for the multiple choice head in\n                :class:`~transformers1.OpenAIGPTDoubleHeadsModel`.\n                'tanh' => add a tanh activation to the output, Other => no activation.\n            summary_proj_to_labels (:obj:`boolean`, optional, defaults to :obj:`True`):\n                Argument used when doing sequence summary. Used in for the multiple choice head in\n                :class:`~transformers1.OpenAIGPTDoubleHeadsModel`.\n                If True, the projection outputs to config.num_labels classes (otherwise to hidden_size). Default: False.\n            summary_first_dropout (:obj:`float`, optional, defaults to 0.1):\n                Argument used when doing sequence summary. Used in for the multiple choice head in\n                :class:`~transformers1.OpenAIGPTDoubleHeadsModel`.\n                Add a dropout before the projection and activation\n\n        Example::\n\n            from transformers1 import OpenAIGPTConfig, OpenAIGPTModel\n\n            # Initializing a GPT configuration\n            configuration = OpenAIGPTConfig()\n\n            # Initializing a model from the configuration\n            model = OpenAIGPTModel(configuration)\n\n            # Accessing the model configuration\n            configuration = model.config\n    \"\"\"\n\n    model_type = \"openai-gpt\"\n\n    def __init__(\n        self,\n        vocab_size=40478,\n        n_positions=512,\n        n_ctx=512,\n        n_embd=768,\n        n_layer=12,\n        n_head=12,\n        afn=\"gelu\",\n        resid_pdrop=0.1,\n        embd_pdrop=0.1,\n        attn_pdrop=0.1,\n        layer_norm_epsilon=1e-5,\n        initializer_range=0.02,\n        predict_special_tokens=True,\n        summary_type=\"cls_index\",\n        summary_use_proj=True,\n        summary_activation=None,\n        summary_proj_to_labels=True,\n        summary_first_dropout=0.1,\n        **kwargs\n    ):\n        super().__init__(**kwargs)\n\n        self.vocab_size = vocab_size\n        self.n_ctx = n_ctx\n        self.n_positions = n_positions\n        self.n_embd = n_embd\n        self.n_layer = n_layer\n        self.n_head = n_head\n        self.afn = afn\n        self.resid_pdrop = resid_pdrop\n        self.embd_pdrop = embd_pdrop\n        self.attn_pdrop = attn_pdrop\n        self.layer_norm_epsilon = layer_norm_epsilon\n        self.initializer_range = initializer_range\n        self.predict_special_tokens = predict_special_tokens\n        self.summary_type = summary_type\n        self.summary_use_proj = summary_use_proj\n        self.summary_activation = summary_activation\n        self.summary_first_dropout = summary_first_dropout\n        self.summary_proj_to_labels = summary_proj_to_labels\n\n    @property\n    def max_position_embeddings(self):\n        return self.n_positions\n\n    @property\n    def hidden_size(self):\n        return self.n_embd\n\n    @property\n    def num_attention_heads(self):\n        return self.n_head\n\n    @property\n    def num_hidden_layers(self):\n        return self.n_layer\n"
  },
  {
    "path": "code/bert-base-count5/pretrain/transformers1/configuration_reformer.py",
    "content": "# coding=utf-8\n# Copyright 2020 The Trax Authors and The HuggingFace Inc. team.\n# Copyright (c) 2018, NVIDIA CORPORATION.  All rights reserved.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n\"\"\" Reformer model configuration \"\"\"\n\n\nimport logging\n\nfrom .configuration_utils import PretrainedConfig\n\n\nlogger = logging.getLogger(__name__)\n\nREFORMER_PRETRAINED_CONFIG_ARCHIVE_MAP = {\n    \"google/reformer-crime-and-punishment\": \"https://cdn.huggingface.co/google/reformer-crime-and-punishment/config.json\",\n    \"google/reformer-enwik8\": \"https://cdn.huggingface.co/google/reformer-enwik8/config.json\",\n}\n\n\nclass ReformerConfig(PretrainedConfig):\n    r\"\"\"\n        This is the configuration class to store the configuration of a :class:`~transformers1.ReformerModel`.\n        It is used to instantiate an Reformer model according to the specified arguments, defining the model\n        architecture.\n\n        Configuration objects inherit from  :class:`~transformers1.PretrainedConfig` and can be used\n        to control the model outputs. Read the documentation from  :class:`~transformers1.PretrainedConfig`\n        for more information.\n\n        Args:\n            attention_head_size (:obj:`int`, optional, defaults to 64):\n                Dimensionality of the projected key, query and value vectors\n            attn_layers (:obj:`list(str)`, optional, defaults to [\"local\", \"lsh\", \"local\", \"lsh\", \"local\", \"lsh\"]):\n                List of attention layer types in ascending order. It can be chosen between a\n                LSHSelfAttention layer (\"lsh\") and a LocalSelfAttention layer (\"local\").\n                For more information on LSHSelfAttention layer, see `LSH Self Attention <reformer.html#lsh-self-attention>`__ .\n                For more information on LocalSelfAttention layer, see `Local Self Attention <reformer.html#local-sensitive-hashing-self-attention>`__ .\n            axial_pos_embds (:obj:`bool`, optional, defaults to True):\n                If `True` use axial position embeddings. For more information on how axial position embeddings work, see `Axial Position Encodings <reformer.html#axial-positional-encodings>`__\n            axial_norm_std (:obj:`float`, optional, defaluts to 1.0):\n                The standard deviation of the normal_initializer for initializing the weight matrices of the axial positional encodings.\n            axial_pos_shape (:obj:`list(int)`, optional, defaults to `[64, 64]`):\n                The position dims of the axial position encodings.\n                During training the product of the position dims has to equal the sequence length.\n                For more information on how axial position embeddings work, see `Axial Position Encodings <reformer.html#axial-positional-encodings>`__.\n            axial_pos_embds_dim (:obj:`list(int)`, optional, defaults to `[64, 192]`):\n                The embedding dims of the axial position encodings.\n                The sum of the embedding dims has to equal the hidden size.\n                For more information on how axial position embeddings work, see `Axial Position Encodings <reformer.html#axial-positional-encodings>`__.\n            chunk_size_lm_head (:obj:`int`, optional, defaults to 0):\n                The chunk size of the final language model feed forward head layer.\n                A chunk size of 0 means that the feed forward layer is not chunked.\n                A chunk size of n means that the feed forward layer processes n < sequence_length embeddings at a time.\n                For more information on feed forward chunking, see `How does Feed Forward Chunking work? <../glossary.html#feed-forward-chunking>`__ .\n            chunk_size_feed_forward (:obj:`int`, optional, defaults to 0):\n                The chunk size of all feed forward layers in the residual attention blocks.\n                A chunk size of 0 means that the feed forward layer is not chunked.\n                A chunk size of n means that the feed forward layer processes n < sequence_length embeddings at a time.\n                For more information on feed forward chunking, see `How does Feed Forward Chunking work? <../glossary.html#feed-forward-chunking>`__ .\n            eos_token_id (:obj:`int`, optional, defaults to 2):\n                The token id for the <EOS> token.\n            feed_forward_size (:obj:`int`, optional, defaults to 512):\n                Dimensionality of the \"feed_forward\" (i.e., feed-forward) layer in the residual attention block.\n            hash_seed (:obj:`int`, optional, defaults to `None`):\n                Seed that can be used to make local sensitive hashing in LSHSelfAttention deterministic. This should only be set for testing purposed. For evaluation and training purposes `hash_seed` should be set to `None` to ensure fully random rotations in local sensitive hashing scheme.\n            hidden_act (:obj:`str` or :obj:`function`, optional, defaults to \"relu\"):\n                The non-linear activation function (function or string) in the feed forward layer in the residual attention block.\n                If string, \"gelu\", \"relu\", \"swish\", \"gelu_new\" and \"gelu_fast\" are supported.\n            hidden_dropout_prob (:obj:`float`, optional, defaults to 0.05):\n                The dropout probabilitiy for all fully connected layers in the embeddings, encoder, and pooler.\n            hidden_size (:obj:`int`, optional, defaults to 256):\n                Dimensionality of the output hidden states of the residual attention blocks.\n            initializer_range (:obj:`float`, optional, defaults to 0.02):\n                The standard deviation of the truncated_normal_initializer for initializing all weight matrices.\n            is_decoder (:obj:`bool`, optional, defaults to False):\n                If `is_decoder` is True, a causal mask is used in addition to `attention_mask`.\n                When using the Reformer for causal language modeling, `is_decoder` is set to `True`.\n            layer_norm_eps (:obj:`float`, optional, defaults to 1e-12):\n                The epsilon used by the layer normalization layers.\n            local_chunk_length (:obj:`int`, optional, defaults to 64):\n                Length of chunk which attends to itself in LocalSelfAttention. Chunking reduces memory complexity from sequence length x sequence length (self attention) to chunk length x chunk length x sequence length / chunk length (chunked self attention).\n            local_num_chunks_before (:obj:`int`, optional, defaults to 1):\n                Number of previous neighbouring chunks to attend to in LocalSelfAttention layer to itself.\n            local_num_chunks_after (:obj:`int`, optional, defaults to 0):\n                Number of following neighbouring chunks to attend to in LocalSelfAttention layer in addition to itself.\n            local_attention_probs_dropout_prob (:obj:`float`, optional, defaults to 0.1):\n                The dropout ratio for the attention probabilities in LocalSelfAttention.\n            lsh_chunk_length (:obj:`int`, optional, defaults to 64):\n                Length of chunk which attends to itself in LSHSelfAttention. Chunking reduces memory complexity from sequence length x sequence length (self attention) to chunk length x chunk length x sequence length / chunk length (chunked self attention).\n            lsh_num_chunks_before (:obj:`int`, optional, defaults to 1):\n                Number of previous neighbouring chunks to attend to in LSHSelfAttention layer to itself.\n            lsh_num_chunks_after (:obj:`int`, optional, defaults to 0):\n                Number of following neighbouring chunks to attend to in LSHSelfAttention layer to itself.\n            lsh_attention_probs_dropout_prob (:obj:`float`, optional, defaults to 0.1):\n                The dropout ratio for the attention probabilities in LSHSelfAttention.\n            max_position_embeddings (:obj:`int`, optional, defaults to 4096):\n                The maximum sequence length that this model might ever be used with.\n                Typically set this to something large just in case (e.g., 512 or 1024 or 2048).\n            num_attention_heads (:obj:`int`, optional, defaults to 12):\n                Number of attention heads for each attention layer in the Transformer encoder.\n            num_buckets (:obj:`int` or :obj:`list(int)`, optional, defaults to `None`):\n                Number of buckets, the key query vectors can be \"hashed into\" using the locality sensitive hashing scheme. Each query key vector is hashed into a hash in `1, ..., num_buckets`.\n                The number of buckets can also be factorized into a list for improved memory complexity. In this case, each query key vector is hashed into a hash in `1-1, 1-2, ..., num_buckets[0]-1, ..., num_buckets[0]-num_buckets[1]` if `num_buckets` is factorized into two factors.\n                The number of buckets (or the product the factors) should approximately equal sequence length / lsh_chunk_length. If `num_buckets` is set to `None`, a good value for `num_buckets` is calculated on the fly.\n            num_hashes (:obj:`int`, optional, defaults to 1):\n                Number of hashing rounds (e.g. number of random rotations) in Local Sensitive Hashing scheme.\n                The higher `num_hashes`, the more accurate the `LSHSelfAttention` becomes, but also the more memory and time intensive the hashing becomes.\n            pad_token_id (:obj:`int`, optional, defaults to 0):\n                The token id for the <PAD> token.\n            vocab_size (:obj:`int`, optional, defaults to 320):\n                Vocabulary size of the Reformer model. Defines the different tokens that\n                can be represented by the `inputs_ids` passed to the forward method of :class:`~transformers1.ReformerModel`.\n\n        Example::\n\n            from transformers1 import ReformerModel, ReformerConfig\n\n            # Initializing a Reformer configuration\n            configuration = ReformerConfig()\n\n            # Initializing a Reformer model\n            model = ReformerModel(configuration)\n\n            # Accessing the model configuration\n            configuration = model.config\n    \"\"\"\n    model_type = \"reformer\"\n\n    def __init__(\n        self,\n        attention_head_size=64,\n        attn_layers=[\"local\", \"lsh\", \"local\", \"lsh\", \"local\", \"lsh\"],\n        axial_norm_std=1.0,\n        axial_pos_embds=True,\n        axial_pos_shape=[64, 64],\n        axial_pos_embds_dim=[64, 192],\n        chunk_size_lm_head=0,\n        chunk_size_feed_forward=0,\n        eos_token_id=2,\n        feed_forward_size=512,\n        hash_seed=None,\n        hidden_act=\"relu\",\n        hidden_dropout_prob=0.05,\n        hidden_size=256,\n        initializer_range=0.02,\n        is_decoder=False,\n        layer_norm_eps=1e-12,\n        local_num_chunks_before=1,\n        local_num_chunks_after=0,\n        local_attention_probs_dropout_prob=0.05,\n        local_attn_chunk_length=64,\n        lsh_attn_chunk_length=64,\n        lsh_attention_probs_dropout_prob=0.0,\n        lsh_num_chunks_before=1,\n        lsh_num_chunks_after=0,\n        max_position_embeddings=4096,\n        num_attention_heads=2,\n        num_buckets=None,\n        num_hashes=1,\n        pad_token_id=0,\n        vocab_size=320,\n        **kwargs\n    ):\n        super().__init__(pad_token_id=pad_token_id, eos_token_id=eos_token_id, is_decoder=is_decoder, **kwargs)\n\n        self.hash_seed = hash_seed\n        self.vocab_size = vocab_size\n        self.attention_head_size = attention_head_size\n        self.hidden_size = hidden_size\n        self.num_attention_heads = num_attention_heads\n        self.num_hashes = num_hashes\n        self.num_hidden_layers = len(attn_layers)\n        self.num_buckets = tuple(num_buckets) if isinstance(num_buckets, list) else num_buckets\n        self.lsh_attn_chunk_length = lsh_attn_chunk_length\n        self.local_attn_chunk_length = local_attn_chunk_length\n        self.lsh_num_chunks_after = lsh_num_chunks_after\n        self.lsh_num_chunks_before = lsh_num_chunks_before\n        self.local_num_chunks_after = local_num_chunks_after\n        self.local_num_chunks_before = local_num_chunks_before\n        self.hidden_act = hidden_act\n        self.feed_forward_size = feed_forward_size\n        self.hidden_dropout_prob = hidden_dropout_prob\n        self.lsh_attention_probs_dropout_prob = lsh_attention_probs_dropout_prob\n        self.local_attention_probs_dropout_prob = local_attention_probs_dropout_prob\n        self.max_position_embeddings = max_position_embeddings\n        self.initializer_range = initializer_range\n        self.layer_norm_eps = layer_norm_eps\n        self.axial_pos_embds = axial_pos_embds\n        self.axial_pos_shape = tuple(axial_pos_shape)\n        self.axial_pos_embds_dim = tuple(axial_pos_embds_dim)\n        self.axial_norm_std = axial_norm_std\n        self.chunk_size_lm_head = chunk_size_lm_head\n        self.chunk_size_feed_forward = chunk_size_feed_forward\n        self.attn_layers = attn_layers\n"
  },
  {
    "path": "code/bert-base-count5/pretrain/transformers1/configuration_roberta.py",
    "content": "# coding=utf-8\n# Copyright 2018 The Google AI Language Team Authors and The HuggingFace Inc. team.\n# Copyright (c) 2018, NVIDIA CORPORATION.  All rights reserved.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n\"\"\" RoBERTa configuration \"\"\"\n\n\nimport logging\n\nfrom .configuration_bert import BertConfig\n\n\nlogger = logging.getLogger(__name__)\n\nROBERTA_PRETRAINED_CONFIG_ARCHIVE_MAP = {\n    \"roberta-base\": \"https://s3.amazonaws.com/models.huggingface.co/bert/roberta-base-config.json\",\n    \"roberta-large\": \"https://s3.amazonaws.com/models.huggingface.co/bert/roberta-large-config.json\",\n    \"roberta-large-mnli\": \"https://s3.amazonaws.com/models.huggingface.co/bert/roberta-large-mnli-config.json\",\n    \"distilroberta-base\": \"https://s3.amazonaws.com/models.huggingface.co/bert/distilroberta-base-config.json\",\n    \"roberta-base-openai-detector\": \"https://s3.amazonaws.com/models.huggingface.co/bert/roberta-base-openai-detector-config.json\",\n    \"roberta-large-openai-detector\": \"https://s3.amazonaws.com/models.huggingface.co/bert/roberta-large-openai-detector-config.json\",\n}\n\n\nclass RobertaConfig(BertConfig):\n    r\"\"\"\n        This is the configuration class to store the configuration of a :class:`~transformers1.RobertaModel`.\n        It is used to instantiate an RoBERTa model according to the specified arguments, defining the model\n        architecture. Instantiating a configuration with the defaults will yield a similar configuration to that of\n        the BERT `bert-base-uncased <https://huggingface.co/bert-base-uncased>`__ architecture.\n\n        Configuration objects inherit from  :class:`~transformers1.PretrainedConfig` and can be used\n        to control the model outputs. Read the documentation from  :class:`~transformers1.PretrainedConfig`\n        for more information.\n\n        The :class:`~transformers1.RobertaConfig` class directly inherits :class:`~transformers1.BertConfig`.\n        It reuses the same defaults. Please check the parent class for more information.\n\n        Example::\n\n            from transformers1 import RobertaConfig, RobertaModel\n\n            # Initializing a RoBERTa configuration\n            configuration = RobertaConfig()\n\n            # Initializing a model from the configuration\n            model = RobertaModel(configuration)\n\n            # Accessing the model configuration\n            configuration = model.config\n    \"\"\"\n    model_type = \"roberta\"\n\n    def __init__(self, pad_token_id=1, bos_token_id=0, eos_token_id=2, **kwargs):\n        \"\"\"Constructs RobertaConfig.\n        \"\"\"\n        super().__init__(pad_token_id=pad_token_id, bos_token_id=bos_token_id, eos_token_id=eos_token_id, **kwargs)\n"
  },
  {
    "path": "code/bert-base-count5/pretrain/transformers1/configuration_t5.py",
    "content": "# coding=utf-8\n# Copyright 2010, The T5 Authors and HuggingFace Inc.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n\"\"\" T5 model configuration \"\"\"\n\n\nimport logging\n\nfrom .configuration_utils import PretrainedConfig\n\n\nlogger = logging.getLogger(__name__)\n\nT5_PRETRAINED_CONFIG_ARCHIVE_MAP = {\n    \"t5-small\": \"https://s3.amazonaws.com/models.huggingface.co/bert/t5-small-config.json\",\n    \"t5-base\": \"https://s3.amazonaws.com/models.huggingface.co/bert/t5-base-config.json\",\n    \"t5-large\": \"https://s3.amazonaws.com/models.huggingface.co/bert/t5-large-config.json\",\n    \"t5-3b\": \"https://s3.amazonaws.com/models.huggingface.co/bert/t5-3b-config.json\",\n    \"t5-11b\": \"https://s3.amazonaws.com/models.huggingface.co/bert/t5-11b-config.json\",\n}\n\n\nclass T5Config(PretrainedConfig):\n    r\"\"\"\n        :class:`~transformers1.T5Config` is the configuration class to store the configuration of a\n        `T5Model`.\n\n\n        Arguments:\n            vocab_size_or_config_json_file: Vocabulary size of `inputs_ids` in `T5Model`.\n            d_model: Size of the encoder layers and the pooler layer. `d_model` can also accesed via the property `hidden_size`.\n            num_layers: Number of hidden layers in the Transformer encoder. `num_layers` can also be accessed via the property `num_hidden_layers`.\n            num_heads: Number of attention heads for each attention layer in\n                the Transformer encoder. `num_heads` can also be accessed via the property `num_attention_heads`.\n            intermediate_size: The size of the \"intermediate\" (i.e., feed-forward)\n                layer in the Transformer encoder.\n            hidden_act: The non-linear activation function (function or string) in the\n                encoder and pooler. If string, \"gelu\", \"relu\", \"swish\" and \"gelu_new\" are supported.\n            hidden_dropout_prob: The dropout probabilitiy for all fully connected\n                layers in the embeddings, encoder, and pooler.\n            attention_probs_dropout_prob: The dropout ratio for the attention\n                probabilities.\n            n_positions: The maximum sequence length that this model might\n                ever be used with. Typically set this to something large just in case\n                (e.g., 512 or 1024 or 2048). `n_positions` can also be accessed via the property `max_position_embeddings'.\n            type_vocab_size: The vocabulary size of the `token_type_ids` passed into\n                `T5Model`.\n            initializer_factor: A factor for initializing all weight matrices (should be kept to 1.0, used for initialization testing).\n            layer_norm_eps: The epsilon used by LayerNorm.\n    \"\"\"\n    model_type = \"t5\"\n\n    def __init__(\n        self,\n        vocab_size=32128,\n        n_positions=512,\n        d_model=512,\n        d_kv=64,\n        d_ff=2048,\n        num_layers=6,\n        num_heads=8,\n        relative_attention_num_buckets=32,\n        dropout_rate=0.1,\n        layer_norm_epsilon=1e-6,\n        initializer_factor=1.0,\n        is_encoder_decoder=True,\n        pad_token_id=0,\n        eos_token_id=1,\n        **kwargs\n    ):\n        super().__init__(\n            pad_token_id=pad_token_id, eos_token_id=eos_token_id, is_encoder_decoder=is_encoder_decoder, **kwargs,\n        )\n        self.vocab_size = vocab_size\n        self.n_positions = n_positions\n        self.d_model = d_model\n        self.d_kv = d_kv\n        self.d_ff = d_ff\n        self.num_layers = num_layers\n        self.num_heads = num_heads\n        self.relative_attention_num_buckets = relative_attention_num_buckets\n        self.dropout_rate = dropout_rate\n        self.layer_norm_epsilon = layer_norm_epsilon\n        self.initializer_factor = initializer_factor\n\n    @property\n    def max_position_embeddings(self):\n        return self.n_positions\n\n    @property\n    def hidden_size(self):\n        return self.d_model\n\n    @property\n    def num_attention_heads(self):\n        return self.num_heads\n\n    @property\n    def num_hidden_layers(self):\n        return self.num_layers\n"
  },
  {
    "path": "code/bert-base-count5/pretrain/transformers1/configuration_transfo_xl.py",
    "content": "# coding=utf-8\n# Copyright 2018 Google AI, Google Brain and Carnegie Mellon University Authors and the HuggingFace Inc. team.\n# Copyright (c) 2018, NVIDIA CORPORATION.  All rights reserved.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n\"\"\" Transformer XL configuration \"\"\"\n\n\nimport logging\n\nfrom .configuration_utils import PretrainedConfig\n\n\nlogger = logging.getLogger(__name__)\n\nTRANSFO_XL_PRETRAINED_CONFIG_ARCHIVE_MAP = {\n    \"transfo-xl-wt103\": \"https://s3.amazonaws.com/models.huggingface.co/bert/transfo-xl-wt103-config.json\",\n}\n\n\nclass TransfoXLConfig(PretrainedConfig):\n    \"\"\"\n        This is the configuration class to store the configuration of a :class:`~transformers1.TransfoXLModel`.\n        It is used to instantiate a Transformer XL model according to the specified arguments, defining the model\n        architecture. Instantiating a configuration with the defaults will yield a similar configuration to that of\n        the `Transformer XL <https://huggingface.co/transfo-xl-wt103>`__ architecture.\n\n        Configuration objects inherit from  :class:`~transformers1.PretrainedConfig` and can be used\n        to control the model outputs. Read the documentation from  :class:`~transformers1.PretrainedConfig`\n        for more information.\n\n        Args:\n            vocab_size (:obj:`int`, optional, defaults to 267735):\n                Vocabulary size of the Transformer XL model. Defines the different tokens that\n                can be represented by the `inputs_ids` passed to the forward method of :class:`~transformers1.TransfoXLModel`.\n            cutoffs (:obj:`List[int]`, optional, defaults to :obj:`[20000, 40000, 200000]`):\n                Cutoffs for the adaptive softmax\n            d_model (:obj:`int`, optional, defaults to 1024):\n                Dimensionality of the model's hidden states.\n            d_embed (:obj:`int`, optional, defaults to 1024):\n                Dimensionality of the embeddings\n            n_head (:obj:`int`, optional, defaults to 16):\n                Number of attention heads for each attention layer in the Transformer encoder.\n            d_head (:obj:`int`, optional, defaults to 64):\n                Dimensionality of the model's heads.\n            d_inner (:obj:`int`, optional, defaults to 4096):\n                Inner dimension in FF\n            div_val (:obj:`int`, optional, defaults to 4):\n                Divident value for adapative input and softmax\n            pre_lnorm (:obj:`boolean`, optional, defaults to :obj:`False`):\n                Apply LayerNorm to the input instead of the output\n            n_layer (:obj:`int`, optional, defaults to 18):\n                Number of hidden layers in the Transformer encoder.\n            tgt_len (:obj:`int`, optional, defaults to 128):\n                Number of tokens to predict\n            ext_len (:obj:`int`, optional, defaults to 0):\n                Length of the extended context\n            mem_len (:obj:`int`, optional, defaults to 1600):\n                Length of the retained previous heads\n            clamp_len (:obj:`int`, optional, defaults to 1000):\n                use the same pos embeddings after clamp_len\n            same_length (:obj:`boolean`, optional, defaults to :obj:`True`):\n                Use the same attn length for all tokens\n            proj_share_all_but_first (:obj:`boolean`, optional, defaults to :obj:`True`):\n                True to share all but first projs, False not to share.\n            attn_type (:obj:`int`, optional, defaults to 0):\n                Attention type. 0 for Transformer-XL, 1 for Shaw et al, 2 for Vaswani et al, 3 for Al Rfou et al.\n            sample_softmax (:obj:`int`, optional, defaults to -1):\n                number of samples in sampled softmax\n            adaptive (:obj:`boolean`, optional, defaults to :obj:`True`):\n                use adaptive softmax\n            tie_weight (:obj:`boolean`, optional, defaults to :obj:`True`):\n                tie the word embedding and softmax weights\n            dropout (:obj:`float`, optional, defaults to 0.1):\n                The dropout probabilitiy for all fully connected layers in the embeddings, encoder, and pooler.\n            dropatt (:obj:`float`, optional, defaults to 0):\n                The dropout ratio for the attention probabilities.\n            untie_r (:obj:`boolean`, optional, defaults to :obj:`True`):\n                Untie relative position biases\n            init (:obj:`string`, optional, defaults to `normal`):\n                Parameter initializer to use\n            init_range (:obj:`float`, optional, defaults to 0.01):\n                Parameters initialized by U(-init_range, init_range).\n            proj_init_std (:obj:`float`, optional, defaults to 0.01):\n                Parameters initialized by N(0, init_std)\n            init_std (:obj:`float`, optional, defaults to 0.02):\n                Parameters initialized by N(0, init_std)\n            layer_norm_epsilon (:obj:`float`, optional, defaults to 1e-5):\n                The epsilon to use in the layer normalization layers\n\n        Example::\n\n            from transformers1 import TransfoXLConfig, TransfoXLModel\n\n            # Initializing a Transformer XL configuration\n            configuration = TransfoXLConfig()\n\n            # Initializing a model from the configuration\n            model = TransfoXLModel(configuration)\n\n            # Accessing the model configuration\n            configuration = model.config\n    \"\"\"\n\n    model_type = \"transfo-xl\"\n\n    def __init__(\n        self,\n        vocab_size=267735,\n        cutoffs=[20000, 40000, 200000],\n        d_model=1024,\n        d_embed=1024,\n        n_head=16,\n        d_head=64,\n        d_inner=4096,\n        div_val=4,\n        pre_lnorm=False,\n        n_layer=18,\n        tgt_len=128,\n        ext_len=0,\n        mem_len=1600,\n        clamp_len=1000,\n        same_length=True,\n        proj_share_all_but_first=True,\n        attn_type=0,\n        sample_softmax=-1,\n        adaptive=True,\n        tie_weight=True,\n        dropout=0.1,\n        dropatt=0.0,\n        untie_r=True,\n        init=\"normal\",\n        init_range=0.01,\n        proj_init_std=0.01,\n        init_std=0.02,\n        layer_norm_epsilon=1e-5,\n        eos_token_id=0,\n        **kwargs\n    ):\n        super().__init__(eos_token_id=eos_token_id, **kwargs)\n\n        self.vocab_size = vocab_size\n        self.cutoffs = []\n        self.cutoffs.extend(cutoffs)\n        self.tie_weight = tie_weight\n        if proj_share_all_but_first:\n            self.tie_projs = [False] + [True] * len(self.cutoffs)\n        else:\n            self.tie_projs = [False] + [False] * len(self.cutoffs)\n        self.d_model = d_model\n        self.d_embed = d_embed\n        self.d_head = d_head\n        self.d_inner = d_inner\n        self.div_val = div_val\n        self.pre_lnorm = pre_lnorm\n        self.n_layer = n_layer\n        self.n_head = n_head\n        self.tgt_len = tgt_len\n        self.ext_len = ext_len\n        self.mem_len = mem_len\n        self.same_length = same_length\n        self.attn_type = attn_type\n        self.clamp_len = clamp_len\n        self.sample_softmax = sample_softmax\n        self.adaptive = adaptive\n        self.dropout = dropout\n        self.dropatt = dropatt\n        self.untie_r = untie_r\n        self.init = init\n        self.init_range = init_range\n        self.proj_init_std = proj_init_std\n        self.init_std = init_std\n        self.layer_norm_epsilon = layer_norm_epsilon\n\n    @property\n    def max_position_embeddings(self):\n        return self.tgt_len + self.ext_len + self.mem_len\n\n    @property\n    def n_token(self):  # Backward compatibility\n        return self.vocab_size\n\n    @n_token.setter\n    def n_token(self, value):  # Backward compatibility\n        self.vocab_size = value\n\n    @property\n    def hidden_size(self):\n        return self.d_model\n\n    @property\n    def num_attention_heads(self):\n        return self.n_head\n\n    @property\n    def num_hidden_layers(self):\n        return self.n_layer\n"
  },
  {
    "path": "code/bert-base-count5/pretrain/transformers1/configuration_utils.py",
    "content": "# coding=utf-8\n# Copyright 2018 The Google AI Language Team Authors and The HuggingFace Inc. team.\n# Copyright (c) 2018, NVIDIA CORPORATION.  All rights reserved.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n\"\"\" Configuration base class and utilities.\"\"\"\n\n\nimport copy\nimport json\nimport logging\nimport os\nfrom typing import Dict, Tuple\n\nfrom .file_utils import CONFIG_NAME, cached_path, hf_bucket_url, is_remote_url\n\n\nlogger = logging.getLogger(__name__)\n\n\nclass PretrainedConfig(object):\n    r\"\"\" Base class for all configuration classes.\n        Handles a few parameters common to all models' configurations as well as methods for loading/downloading/saving configurations.\n\n        Note:\n            A configuration file can be loaded and saved to disk. Loading the configuration file and using this file to initialize a model does **not** load the model weights.\n            It only affects the model's configuration.\n\n        Class attributes (overridden by derived classes):\n            - ``model_type``: a string that identifies the model type, that we serialize into the JSON file, and that we use to recreate the correct object in :class:`~transformers1.AutoConfig`.\n\n        Args:\n            finetuning_task (:obj:`string` or :obj:`None`, `optional`, defaults to :obj:`None`):\n                Name of the task used to fine-tune the model. This can be used when converting from an original (TensorFlow or PyTorch) checkpoint.\n            num_labels (:obj:`int`, `optional`, defaults to `2`):\n                Number of classes to use when the model is a classification model (sequences/tokens)\n            output_attentions (:obj:`bool`, `optional`, defaults to :obj:`False`):\n                Should the model returns attentions weights.\n            output_hidden_states (:obj:`string`, `optional`, defaults to :obj:`False`):\n                Should the model returns all hidden-states.\n            torchscript (:obj:`bool`, `optional`, defaults to :obj:`False`):\n                Is the model used with Torchscript (for PyTorch models).\n    \"\"\"\n    model_type: str = \"\"\n\n    def __init__(self, **kwargs):\n        # Attributes with defaults\n        self.output_attentions = kwargs.pop(\"output_attentions\", False)\n        self.output_hidden_states = kwargs.pop(\"output_hidden_states\", False)\n        self.use_cache = kwargs.pop(\"use_cache\", True)  # Not used by all models\n        self.torchscript = kwargs.pop(\"torchscript\", False)  # Only used by PyTorch models\n        self.use_bfloat16 = kwargs.pop(\"use_bfloat16\", False)\n        self.pruned_heads = kwargs.pop(\"pruned_heads\", {})\n\n        # Is decoder is used in encoder-decoder models to differentiate encoder from decoder\n        self.is_encoder_decoder = kwargs.pop(\"is_encoder_decoder\", False)\n        self.is_decoder = kwargs.pop(\"is_decoder\", False)\n\n        # Parameters for sequence generation\n        self.max_length = kwargs.pop(\"max_length\", 20)\n        self.min_length = kwargs.pop(\"min_length\", 0)\n        self.do_sample = kwargs.pop(\"do_sample\", False)\n        self.early_stopping = kwargs.pop(\"early_stopping\", False)\n        self.num_beams = kwargs.pop(\"num_beams\", 1)\n        self.temperature = kwargs.pop(\"temperature\", 1.0)\n        self.top_k = kwargs.pop(\"top_k\", 50)\n        self.top_p = kwargs.pop(\"top_p\", 1.0)\n        self.repetition_penalty = kwargs.pop(\"repetition_penalty\", 1.0)\n        self.length_penalty = kwargs.pop(\"length_penalty\", 1.0)\n        self.no_repeat_ngram_size = kwargs.pop(\"no_repeat_ngram_size\", 0)\n        self.bad_words_ids = kwargs.pop(\"bad_words_ids\", None)\n        self.num_return_sequences = kwargs.pop(\"num_return_sequences\", 1)\n\n        # Fine-tuning task arguments\n        self.architectures = kwargs.pop(\"architectures\", None)\n        self.finetuning_task = kwargs.pop(\"finetuning_task\", None)\n        self.id2label = kwargs.pop(\"id2label\", None)\n        self.label2id = kwargs.pop(\"label2id\", None)\n        if self.id2label is not None:\n            kwargs.pop(\"num_labels\", None)\n            self.id2label = dict((int(key), value) for key, value in self.id2label.items())\n            # Keys are always strings in JSON so convert ids to int here.\n        else:\n            self.num_labels = kwargs.pop(\"num_labels\", 2)\n\n        # Tokenizer arguments TODO: eventually tokenizer and models should share the same config\n        self.prefix = kwargs.pop(\"prefix\", None)\n        self.bos_token_id = kwargs.pop(\"bos_token_id\", None)\n        self.pad_token_id = kwargs.pop(\"pad_token_id\", None)\n        self.eos_token_id = kwargs.pop(\"eos_token_id\", None)\n        self.decoder_start_token_id = kwargs.pop(\"decoder_start_token_id\", None)\n\n        # task specific arguments\n        self.task_specific_params = kwargs.pop(\"task_specific_params\", None)\n\n        # TPU arguments\n        self.xla_device = kwargs.pop(\"xla_device\", None)\n\n        # Additional attributes without default values\n        for key, value in kwargs.items():\n            try:\n                setattr(self, key, value)\n            except AttributeError as err:\n                logger.error(\"Can't set {} with value {} for {}\".format(key, value, self))\n                raise err\n\n    @property\n    def num_labels(self):\n        return len(self.id2label)\n\n    @num_labels.setter\n    def num_labels(self, num_labels):\n        self.id2label = {i: \"LABEL_{}\".format(i) for i in range(num_labels)}\n        self.label2id = dict(zip(self.id2label.values(), self.id2label.keys()))\n\n    def save_pretrained(self, save_directory):\n        \"\"\"\n        Save a configuration object to the directory `save_directory`, so that it\n        can be re-loaded using the :func:`~transformers1.PretrainedConfig.from_pretrained` class method.\n\n        Args:\n            save_directory (:obj:`string`):\n                Directory where the configuration JSON file will be saved.\n        \"\"\"\n        assert os.path.isdir(\n            save_directory\n        ), \"Saving path should be a directory where the model and configuration can be saved\"\n\n        # If we save using the predefined names, we can load using `from_pretrained`\n        output_config_file = os.path.join(save_directory, CONFIG_NAME)\n\n        self.to_json_file(output_config_file, use_diff=True)\n        logger.info(\"Configuration saved in {}\".format(output_config_file))\n\n    @classmethod\n    def from_pretrained(cls, pretrained_model_name_or_path, **kwargs) -> \"PretrainedConfig\":\n        r\"\"\"\n\n        Instantiate a :class:`~transformers1.PretrainedConfig` (or a derived class) from a pre-trained model configuration.\n\n        Args:\n            pretrained_model_name_or_path (:obj:`string`):\n                either:\n                  - a string with the `shortcut name` of a pre-trained model configuration to load from cache or\n                    download, e.g.: ``bert-base-uncased``.\n                  - a string with the `identifier name` of a pre-trained model configuration that was user-uploaded to\n                    our S3, e.g.: ``dbmdz/bert-base-german-cased``.\n                  - a path to a `directory` containing a configuration file saved using the\n                    :func:`~transformers1.PretrainedConfig.save_pretrained` method, e.g.: ``./my_model_directory/``.\n                  - a path or url to a saved configuration JSON `file`, e.g.:\n                    ``./my_model_directory/configuration.json``.\n            cache_dir (:obj:`string`, `optional`):\n                Path to a directory in which a downloaded pre-trained model\n                configuration should be cached if the standard cache should not be used.\n            kwargs (:obj:`Dict[str, any]`, `optional`):\n                The values in kwargs of any keys which are configuration attributes will be used to override the loaded\n                values. Behavior concerning key/value pairs whose keys are *not* configuration attributes is\n                controlled by the `return_unused_kwargs` keyword parameter.\n            force_download (:obj:`bool`, `optional`, defaults to :obj:`False`):\n                Force to (re-)download the model weights and configuration files and override the cached versions if they exist.\n            resume_download (:obj:`bool`, `optional`, defaults to :obj:`False`):\n                Do not delete incompletely recieved file. Attempt to resume the download if such a file exists.\n            proxies (:obj:`Dict`, `optional`):\n                A dictionary of proxy servers to use by protocol or endpoint, e.g.:\n                :obj:`{'http': 'foo.bar:3128', 'http://hostname': 'foo.bar:4012'}.`\n                The proxies are used on each request.\n            return_unused_kwargs: (`optional`) bool:\n                If False, then this function returns just the final configuration object.\n                If True, then this functions returns a :obj:`Tuple(config, unused_kwargs)` where `unused_kwargs` is a\n                dictionary consisting of the key/value pairs whose keys are not configuration attributes: ie the part\n                of kwargs which has not been used to update `config` and is otherwise ignored.\n\n        Returns:\n            :class:`PretrainedConfig`: An instance of a configuration object\n\n        Examples::\n\n            # We can't instantiate directly the base class `PretrainedConfig` so let's show the examples on a\n            # derived class: BertConfig\n            config = BertConfig.from_pretrained('bert-base-uncased')    # Download configuration from S3 and cache.\n            config = BertConfig.from_pretrained('./test/saved_model/')  # E.g. config (or model) was saved using `save_pretrained('./test/saved_model/')`\n            config = BertConfig.from_pretrained('./test/saved_model/my_configuration.json')\n            config = BertConfig.from_pretrained('bert-base-uncased', output_attention=True, foo=False)\n            assert config.output_attention == True\n            config, unused_kwargs = BertConfig.from_pretrained('bert-base-uncased', output_attention=True,\n                                                               foo=False, return_unused_kwargs=True)\n            assert config.output_attention == True\n            assert unused_kwargs == {'foo': False}\n\n        \"\"\"\n        config_dict, kwargs = cls.get_config_dict(pretrained_model_name_or_path, **kwargs)\n        return cls.from_dict(config_dict, **kwargs)\n\n    @classmethod\n    def get_config_dict(cls, pretrained_model_name_or_path: str, **kwargs) -> Tuple[Dict, Dict]:\n        \"\"\"\n        From a `pretrained_model_name_or_path`, resolve to a dictionary of parameters, to be used\n        for instantiating a Config using `from_dict`.\n\n        Parameters:\n            pretrained_model_name_or_path (:obj:`string`):\n                The identifier of the pre-trained checkpoint from which we want the dictionary of parameters.\n\n        Returns:\n            :obj:`Tuple[Dict, Dict]`: The dictionary that will be used to instantiate the configuration object.\n\n        \"\"\"\n        cache_dir = kwargs.pop(\"cache_dir\", None)\n        force_download = kwargs.pop(\"force_download\", False)\n        resume_download = kwargs.pop(\"resume_download\", False)\n        proxies = kwargs.pop(\"proxies\", None)\n        local_files_only = kwargs.pop(\"local_files_only\", False)\n\n        if os.path.isdir(pretrained_model_name_or_path):\n            config_file = os.path.join(pretrained_model_name_or_path, CONFIG_NAME)\n        elif os.path.isfile(pretrained_model_name_or_path) or is_remote_url(pretrained_model_name_or_path):\n            config_file = pretrained_model_name_or_path\n        else:\n            config_file = hf_bucket_url(pretrained_model_name_or_path, filename=CONFIG_NAME, use_cdn=False)\n\n        try:\n            # Load from URL or cache if already cached\n            resolved_config_file = cached_path(\n                config_file,\n                cache_dir=cache_dir,\n                force_download=force_download,\n                proxies=proxies,\n                resume_download=resume_download,\n                local_files_only=local_files_only,\n            )\n            # Load config dict\n            if resolved_config_file is None:\n                raise EnvironmentError\n            config_dict = cls._dict_from_json_file(resolved_config_file)\n\n        except EnvironmentError:\n            msg = (\n                f\"Can't load config for '{pretrained_model_name_or_path}'. Make sure that:\\n\\n\"\n                f\"- '{pretrained_model_name_or_path}' is a correct model identifier listed on 'https://huggingface.co/models'\\n\\n\"\n                f\"- or '{pretrained_model_name_or_path}' is the correct path to a directory containing a {CONFIG_NAME} file\\n\\n\"\n            )\n            raise EnvironmentError(msg)\n\n        except json.JSONDecodeError:\n            msg = (\n                \"Couldn't reach server at '{}' to download configuration file or \"\n                \"configuration file is not a valid JSON file. \"\n                \"Please check network or file content here: {}.\".format(config_file, resolved_config_file)\n            )\n            raise EnvironmentError(msg)\n\n        if resolved_config_file == config_file:\n            logger.info(\"loading configuration file {}\".format(config_file))\n        else:\n            logger.info(\"loading configuration file {} from cache at {}\".format(config_file, resolved_config_file))\n\n        return config_dict, kwargs\n\n    @classmethod\n    def from_dict(cls, config_dict: Dict, **kwargs) -> \"PretrainedConfig\":\n        \"\"\"\n        Constructs a `Config` from a Python dictionary of parameters.\n\n        Args:\n            config_dict (:obj:`Dict[str, any]`):\n                Dictionary that will be used to instantiate the configuration object. Such a dictionary can be retrieved\n                from a pre-trained checkpoint by leveraging the :func:`~transformers1.PretrainedConfig.get_config_dict`\n                method.\n            kwargs (:obj:`Dict[str, any]`):\n                Additional parameters from which to initialize the configuration object.\n\n        Returns:\n            :class:`PretrainedConfig`: An instance of a configuration object\n        \"\"\"\n        return_unused_kwargs = kwargs.pop(\"return_unused_kwargs\", False)\n\n        config = cls(**config_dict)\n\n        if hasattr(config, \"pruned_heads\"):\n            config.pruned_heads = dict((int(key), value) for key, value in config.pruned_heads.items())\n\n        # Update config with kwargs if needed\n        to_remove = []\n        for key, value in kwargs.items():\n            if hasattr(config, key):\n                setattr(config, key, value)\n                to_remove.append(key)\n        for key in to_remove:\n            kwargs.pop(key, None)\n\n        logger.info(\"Model config %s\", str(config))\n        if return_unused_kwargs:\n            return config, kwargs\n        else:\n            return config\n\n    @classmethod\n    def from_json_file(cls, json_file: str) -> \"PretrainedConfig\":\n        \"\"\"\n        Constructs a `Config` from the path to a json file of parameters.\n\n        Args:\n            json_file (:obj:`string`):\n                Path to the JSON file containing the parameters.\n\n        Returns:\n            :class:`PretrainedConfig`: An instance of a configuration object\n\n        \"\"\"\n        config_dict = cls._dict_from_json_file(json_file)\n        return cls(**config_dict)\n\n    @classmethod\n    def _dict_from_json_file(cls, json_file: str):\n        with open(json_file, \"r\", encoding=\"utf-8\") as reader:\n            text = reader.read()\n        return json.loads(text)\n\n    def __eq__(self, other):\n        return self.__dict__ == other.__dict__\n\n    def __repr__(self):\n        return \"{} {}\".format(self.__class__.__name__, self.to_json_string())\n\n    def to_diff_dict(self):\n        \"\"\"\n        Removes all attributes from config which correspond to the default\n        config attributes for better readability and serializes to a Python\n        dictionary.\n\n        Returns:\n            :obj:`Dict[str, any]`: Dictionary of all the attributes that make up this configuration instance,\n        \"\"\"\n        config_dict = self.to_dict()\n\n        # get the default config dict\n        default_config_dict = PretrainedConfig().to_dict()\n\n        serializable_config_dict = {}\n\n        # only serialize values that differ from the default config\n        for key, value in config_dict.items():\n            if key not in default_config_dict or value != default_config_dict[key]:\n                serializable_config_dict[key] = value\n\n        return serializable_config_dict\n\n    def to_dict(self):\n        \"\"\"\n        Serializes this instance to a Python dictionary.\n\n        Returns:\n            :obj:`Dict[str, any]`: Dictionary of all the attributes that make up this configuration instance,\n        \"\"\"\n        output = copy.deepcopy(self.__dict__)\n        if hasattr(self.__class__, \"model_type\"):\n            output[\"model_type\"] = self.__class__.model_type\n        return output\n\n    def to_json_string(self, use_diff=True):\n        \"\"\"\n        Serializes this instance to a JSON string.\n\n        Args:\n            use_diff (:obj:`bool`):\n                If set to True, only the difference between the config instance and the default PretrainedConfig() is serialized to JSON string.\n\n        Returns:\n            :obj:`string`: String containing all the attributes that make up this configuration instance in JSON format.\n        \"\"\"\n        if use_diff is True:\n            config_dict = self.to_diff_dict()\n        else:\n            config_dict = self.to_dict()\n        return json.dumps(config_dict, indent=2, sort_keys=True) + \"\\n\"\n\n    def to_json_file(self, json_file_path, use_diff=True):\n        \"\"\"\n        Save this instance to a json file.\n\n        Args:\n            json_file_path (:obj:`string`):\n                Path to the JSON file in which this configuration instance's parameters will be saved.\n            use_diff (:obj:`bool`):\n                If set to True, only the difference between the config instance and the default PretrainedConfig() is serialized to JSON file.\n        \"\"\"\n        with open(json_file_path, \"w\", encoding=\"utf-8\") as writer:\n            writer.write(self.to_json_string(use_diff=use_diff))\n\n    def update(self, config_dict: Dict):\n        \"\"\"\n        Updates attributes of this class\n        with attributes from `config_dict`.\n\n        Args:\n            :obj:`Dict[str, any]`: Dictionary of attributes that shall be updated for this class.\n        \"\"\"\n        for key, value in config_dict.items():\n            setattr(self, key, value)\n"
  },
  {
    "path": "code/bert-base-count5/pretrain/transformers1/configuration_xlm.py",
    "content": "# coding=utf-8\n# Copyright 2019-present, Facebook, Inc and the HuggingFace Inc. team.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n\"\"\" XLM configuration \"\"\"\n\n\nimport logging\n\nfrom .configuration_utils import PretrainedConfig\n\n\nlogger = logging.getLogger(__name__)\n\nXLM_PRETRAINED_CONFIG_ARCHIVE_MAP = {\n    \"xlm-mlm-en-2048\": \"https://s3.amazonaws.com/models.huggingface.co/bert/xlm-mlm-en-2048-config.json\",\n    \"xlm-mlm-ende-1024\": \"https://s3.amazonaws.com/models.huggingface.co/bert/xlm-mlm-ende-1024-config.json\",\n    \"xlm-mlm-enfr-1024\": \"https://s3.amazonaws.com/models.huggingface.co/bert/xlm-mlm-enfr-1024-config.json\",\n    \"xlm-mlm-enro-1024\": \"https://s3.amazonaws.com/models.huggingface.co/bert/xlm-mlm-enro-1024-config.json\",\n    \"xlm-mlm-tlm-xnli15-1024\": \"https://s3.amazonaws.com/models.huggingface.co/bert/xlm-mlm-tlm-xnli15-1024-config.json\",\n    \"xlm-mlm-xnli15-1024\": \"https://s3.amazonaws.com/models.huggingface.co/bert/xlm-mlm-xnli15-1024-config.json\",\n    \"xlm-clm-enfr-1024\": \"https://s3.amazonaws.com/models.huggingface.co/bert/xlm-clm-enfr-1024-config.json\",\n    \"xlm-clm-ende-1024\": \"https://s3.amazonaws.com/models.huggingface.co/bert/xlm-clm-ende-1024-config.json\",\n    \"xlm-mlm-17-1280\": \"https://s3.amazonaws.com/models.huggingface.co/bert/xlm-mlm-17-1280-config.json\",\n    \"xlm-mlm-100-1280\": \"https://s3.amazonaws.com/models.huggingface.co/bert/xlm-mlm-100-1280-config.json\",\n}\n\n\nclass XLMConfig(PretrainedConfig):\n    \"\"\"\n        This is the configuration class to store the configuration of a :class:`~transformers1.XLMModel`.\n        It is used to instantiate an XLM model according to the specified arguments, defining the model\n        architecture. Instantiating a configuration with the defaults will yield a similar configuration to that of\n        the `xlm-mlm-en-2048 <https://huggingface.co/xlm-mlm-en-2048>`__ architecture.\n\n        Configuration objects inherit from  :class:`~transformers1.PretrainedConfig` and can be used\n        to control the model outputs. Read the documentation from  :class:`~transformers1.PretrainedConfig`\n        for more information.\n\n        Args:\n            vocab_size (:obj:`int`, optional, defaults to 30145):\n                Vocabulary size of the XLM model. Defines the different tokens that\n                can be represented by the `inputs_ids` passed to the forward method of :class:`~transformers1.XLMModel`.\n            emb_dim (:obj:`int`, optional, defaults to 2048):\n                Dimensionality of the encoder layers and the pooler layer.\n            n_layer (:obj:`int`, optional, defaults to 12):\n                Number of hidden layers in the Transformer encoder.\n            n_head (:obj:`int`, optional, defaults to 16):\n                Number of attention heads for each attention layer in the Transformer encoder.\n            dropout (:obj:`float`, optional, defaults to 0.1):\n                The dropout probability for all fully connected\n                layers in the embeddings, encoder, and pooler.\n            attention_dropout (:obj:`float`, optional, defaults to 0.1):\n                The dropout probability for the attention mechanism\n            gelu_activation (:obj:`boolean`, optional, defaults to :obj:`True`):\n                The non-linear activation function (function or string) in the\n                encoder and pooler. If set to `True`, \"gelu\" will be used instead of \"relu\".\n            sinusoidal_embeddings (:obj:`boolean`, optional, defaults to :obj:`False`):\n                Whether to use sinusoidal positional embeddings instead of absolute positional embeddings.\n            causal (:obj:`boolean`, optional, defaults to :obj:`False`):\n                Set this to `True` for the model to behave in a causal manner.\n                Causal models use a triangular attention mask in order to only attend to the left-side context instead\n                if a bidirectional context.\n            asm (:obj:`boolean`, optional, defaults to :obj:`False`):\n                Whether to use an adaptive log softmax projection layer instead of a linear layer for the prediction\n                layer.\n            n_langs (:obj:`int`, optional, defaults to 1):\n                The number of languages the model handles. Set to 1 for monolingual models.\n            use_lang_emb (:obj:`boolean`, optional, defaults to :obj:`True`)\n                Whether to use language embeddings. Some models use additional language embeddings, see\n                `the multilingual models page <http://huggingface.co/transformers/multilingual.html#xlm-language-embeddings>`__\n                for information on how to use them.\n            max_position_embeddings (:obj:`int`, optional, defaults to 512):\n                The maximum sequence length that this model might\n                ever be used with. Typically set this to something large just in case\n                (e.g., 512 or 1024 or 2048).\n            embed_init_std (:obj:`float`, optional, defaults to 2048^-0.5):\n                The standard deviation of the truncated_normal_initializer for\n                initializing the embedding matrices.\n            init_std (:obj:`int`, optional, defaults to 50257):\n                The standard deviation of the truncated_normal_initializer for\n                initializing all weight matrices except the embedding matrices.\n            layer_norm_eps (:obj:`float`, optional, defaults to 1e-12):\n                The epsilon used by the layer normalization layers.\n            bos_index (:obj:`int`, optional, defaults to 0):\n                The index of the beginning of sentence token in the vocabulary.\n            eos_index (:obj:`int`, optional, defaults to 1):\n                The index of the end of sentence token in the vocabulary.\n            pad_index (:obj:`int`, optional, defaults to 2):\n                The index of the padding token in the vocabulary.\n            unk_index (:obj:`int`, optional, defaults to 3):\n                The index of the unknown token in the vocabulary.\n            mask_index (:obj:`int`, optional, defaults to 5):\n                The index of the masking token in the vocabulary.\n            is_encoder(:obj:`boolean`, optional, defaults to :obj:`True`):\n                Whether the initialized model should be a transformer encoder or decoder as seen in Vaswani et al.\n            summary_type (:obj:`string`, optional, defaults to \"first\"):\n                Argument used when doing sequence summary. Used in for the multiple choice head in\n                :class:`~transformers1.XLMForSequenceClassification`.\n                Is one of the following options:\n\n                - 'last' => take the last token hidden state (like XLNet)\n                - 'first' => take the first token hidden state (like Bert)\n                - 'mean' => take the mean of all tokens hidden states\n                - 'cls_index' => supply a Tensor of classification token position (GPT/GPT-2)\n                - 'attn' => Not implemented now, use multi-head attention\n            summary_use_proj (:obj:`boolean`, optional, defaults to :obj:`True`):\n                Argument used when doing sequence summary. Used in for the multiple choice head in\n                :class:`~transformers1.XLMForSequenceClassification`.\n                Add a projection after the vector extraction\n            summary_activation (:obj:`string` or :obj:`None`, optional, defaults to :obj:`None`):\n                Argument used when doing sequence summary. Used in for the multiple choice head in\n                :class:`~transformers1.XLMForSequenceClassification`.\n                'tanh' => add a tanh activation to the output, Other => no activation.\n            summary_proj_to_labels (:obj:`boolean`, optional, defaults to :obj:`True`):\n                Argument used when doing sequence summary. Used in for the multiple choice head in\n                :class:`~transformers1.XLMForSequenceClassification`.\n                If True, the projection outputs to config.num_labels classes (otherwise to hidden_size). Default: False.\n            summary_first_dropout (:obj:`float`, optional, defaults to 0.1):\n                Argument used when doing sequence summary. Used in for the multiple choice head in\n                :class:`~transformers1.XLMForSequenceClassification`.\n                Add a dropout before the projection and activation\n            start_n_top (:obj:`int`, optional, defaults to 5):\n                Used in the SQuAD evaluation script for XLM and XLNet.\n            end_n_top (:obj:`int`, optional, defaults to 5):\n                Used in the SQuAD evaluation script for XLM and XLNet.\n            mask_token_id (:obj:`int`, optional, defaults to 0):\n                Model agnostic parameter to identify masked tokens when generating text in an MLM context.\n            lang_id (:obj:`int`, optional, defaults to 1):\n                The ID of the language used by the model. This parameter is used when generating\n                text in a given language.\n\n        Example::\n\n            from transformers1 import XLMConfig, XLMModel\n\n            # Initializing a XLM configuration\n            configuration = XLMConfig()\n\n            # Initializing a model from the configuration\n            model = XLMModel(configuration)\n\n            # Accessing the model configuration\n            configuration = model.config\n    \"\"\"\n\n    model_type = \"xlm\"\n\n    def __init__(\n        self,\n        vocab_size=30145,\n        emb_dim=2048,\n        n_layers=12,\n        n_heads=16,\n        dropout=0.1,\n        attention_dropout=0.1,\n        gelu_activation=True,\n        sinusoidal_embeddings=False,\n        causal=False,\n        asm=False,\n        n_langs=1,\n        use_lang_emb=True,\n        max_position_embeddings=512,\n        embed_init_std=2048 ** -0.5,\n        layer_norm_eps=1e-12,\n        init_std=0.02,\n        bos_index=0,\n        eos_index=1,\n        pad_index=2,\n        unk_index=3,\n        mask_index=5,\n        is_encoder=True,\n        summary_type=\"first\",\n        summary_use_proj=True,\n        summary_activation=None,\n        summary_proj_to_labels=True,\n        summary_first_dropout=0.1,\n        start_n_top=5,\n        end_n_top=5,\n        mask_token_id=0,\n        lang_id=0,\n        pad_token_id=2,\n        bos_token_id=0,\n        **kwargs\n    ):\n        \"\"\"Constructs XLMConfig.\n        \"\"\"\n        super().__init__(pad_token_id=pad_token_id, bos_token_id=bos_token_id, **kwargs)\n        self.vocab_size = vocab_size\n        self.emb_dim = emb_dim\n        self.n_layers = n_layers\n        self.n_heads = n_heads\n        self.dropout = dropout\n        self.attention_dropout = attention_dropout\n        self.gelu_activation = gelu_activation\n        self.sinusoidal_embeddings = sinusoidal_embeddings\n        self.causal = causal\n        self.asm = asm\n        self.n_langs = n_langs\n        self.use_lang_emb = use_lang_emb\n        self.layer_norm_eps = layer_norm_eps\n        self.bos_index = bos_index\n        self.eos_index = eos_index\n        self.pad_index = pad_index\n        self.unk_index = unk_index\n        self.mask_index = mask_index\n        self.is_encoder = is_encoder\n        self.max_position_embeddings = max_position_embeddings\n        self.embed_init_std = embed_init_std\n        self.init_std = init_std\n        self.summary_type = summary_type\n        self.summary_use_proj = summary_use_proj\n        self.summary_activation = summary_activation\n        self.summary_proj_to_labels = summary_proj_to_labels\n        self.summary_first_dropout = summary_first_dropout\n        self.start_n_top = start_n_top\n        self.end_n_top = end_n_top\n        self.mask_token_id = mask_token_id\n        self.lang_id = lang_id\n\n        if \"n_words\" in kwargs:\n            self.n_words = kwargs[\"n_words\"]\n\n    @property\n    def n_words(self):  # For backward compatibility\n        return self.vocab_size\n\n    @n_words.setter\n    def n_words(self, value):  # For backward compatibility\n        self.vocab_size = value\n\n    @property\n    def hidden_size(self):\n        return self.emb_dim\n\n    @property\n    def num_attention_heads(self):\n        return self.n_heads\n\n    @property\n    def num_hidden_layers(self):\n        return self.n_layers\n"
  },
  {
    "path": "code/bert-base-count5/pretrain/transformers1/configuration_xlm_roberta.py",
    "content": "# coding=utf-8\n# Copyright 2018 The Google AI Language Team Authors and The HuggingFace Inc. team.\n# Copyright (c) 2018, NVIDIA CORPORATION.  All rights reserved.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n\"\"\" XLM-RoBERTa configuration \"\"\"\n\n\nimport logging\n\nfrom .configuration_roberta import RobertaConfig\n\n\nlogger = logging.getLogger(__name__)\n\nXLM_ROBERTA_PRETRAINED_CONFIG_ARCHIVE_MAP = {\n    \"xlm-roberta-base\": \"https://s3.amazonaws.com/models.huggingface.co/bert/xlm-roberta-base-config.json\",\n    \"xlm-roberta-large\": \"https://s3.amazonaws.com/models.huggingface.co/bert/xlm-roberta-large-config.json\",\n    \"xlm-roberta-large-finetuned-conll02-dutch\": \"https://s3.amazonaws.com/models.huggingface.co/bert/xlm-roberta-large-finetuned-conll02-dutch-config.json\",\n    \"xlm-roberta-large-finetuned-conll02-spanish\": \"https://s3.amazonaws.com/models.huggingface.co/bert/xlm-roberta-large-finetuned-conll02-spanish-config.json\",\n    \"xlm-roberta-large-finetuned-conll03-english\": \"https://s3.amazonaws.com/models.huggingface.co/bert/xlm-roberta-large-finetuned-conll03-english-config.json\",\n    \"xlm-roberta-large-finetuned-conll03-german\": \"https://s3.amazonaws.com/models.huggingface.co/bert/xlm-roberta-large-finetuned-conll03-german-config.json\",\n}\n\n\nclass XLMRobertaConfig(RobertaConfig):\n    \"\"\"\n    This class overrides :class:`~transformers1.RobertaConfig`. Please check the\n    superclass for the appropriate documentation alongside usage examples.\n    \"\"\"\n\n    model_type = \"xlm-roberta\"\n"
  },
  {
    "path": "code/bert-base-count5/pretrain/transformers1/configuration_xlnet.py",
    "content": "# coding=utf-8\n# Copyright 2018 Google AI, Google Brain and Carnegie Mellon University Authors and the HuggingFace Inc. team.\n# Copyright (c) 2018, NVIDIA CORPORATION.  All rights reserved.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n\"\"\" XLNet configuration \"\"\"\n\n\nimport logging\n\nfrom .configuration_utils import PretrainedConfig\n\n\nlogger = logging.getLogger(__name__)\n\nXLNET_PRETRAINED_CONFIG_ARCHIVE_MAP = {\n    \"xlnet-base-cased\": \"https://s3.amazonaws.com/models.huggingface.co/bert/xlnet-base-cased-config.json\",\n    \"xlnet-large-cased\": \"https://s3.amazonaws.com/models.huggingface.co/bert/xlnet-large-cased-config.json\",\n}\n\n\nclass XLNetConfig(PretrainedConfig):\n    \"\"\"\n        This is the configuration class to store the configuration of a :class:`~transformers1.XLNetModel`.\n        It is used to instantiate an XLNet model according to the specified arguments, defining the model\n        architecture. Instantiating a configuration with the defaults will yield a similar configuration to that of\n        the `xlnet-large-cased <https://huggingface.co/xlnet-large-cased>`__ architecture.\n\n        Configuration objects inherit from  :class:`~transformers1.PretrainedConfig` and can be used\n        to control the model outputs. Read the documentation from  :class:`~transformers1.PretrainedConfig`\n        for more information.\n\n        Args:\n            vocab_size (:obj:`int`, optional, defaults to 32000):\n                Vocabulary size of the XLNet model. Defines the different tokens that\n                can be represented by the `inputs_ids` passed to the forward method of :class:`~transformers1.XLNetModel`.\n            d_model (:obj:`int`, optional, defaults to 1024):\n                Dimensionality of the encoder layers and the pooler layer.\n            n_layer (:obj:`int`, optional, defaults to 24):\n                Number of hidden layers in the Transformer encoder.\n            n_head (:obj:`int`, optional, defaults to 16):\n                Number of attention heads for each attention layer in the Transformer encoder.\n            d_inner (:obj:`int`, optional, defaults to 4096):\n                Dimensionality of the \"intermediate\" (i.e., feed-forward) layer in the Transformer encoder.\n            ff_activation (:obj:`string`, optional, defaults to \"gelu\"):\n                The non-linear activation function (function or string) in the\n                encoder and pooler. If string, \"gelu\", \"relu\" and \"swish\" are supported.\n            untie_r (:obj:`boolean`, optional, defaults to :obj:`True`):\n                Untie relative position biases\n            attn_type (:obj:`string`, optional, defaults to \"bi\"):\n                The attention type used by the model. Set 'bi' for XLNet, 'uni' for Transformer-XL.\n            initializer_range (:obj:`float`, optional, defaults to 0.02):\n                The standard deviation of the truncated_normal_initializer for initializing all weight matrices.\n            layer_norm_eps (:obj:`float`, optional, defaults to 1e-12):\n                The epsilon used by the layer normalization layers.\n            dropout (:obj:`float`, optional, defaults to 0.1):\n                The dropout probability for all fully connected layers in the embeddings, encoder, and pooler.\n            mem_len (:obj:`int` or :obj:`None`, optional, defaults to :obj:`None`):\n                The number of tokens to cache. The key/value pairs that have already been pre-computed\n                in a previous forward pass won't be re-computed. See the\n                `quickstart <https://huggingface.co/transformers/quickstart.html#using-the-past>`__\n                for more information.\n            reuse_len (:obj:`int` or :obj:`None`, optional, defaults to :obj:`None`):\n                The number of tokens in the current batch to be cached and reused in the future.\n            bi_data (:obj:`boolean`, optional, defaults to :obj:`False`):\n                Whether to use bidirectional input pipeline. Usually set to `True` during\n                pretraining and `False` during finetuning.\n            clamp_len (:obj:`int`, optional, defaults to -1):\n                Clamp all relative distances larger than clamp_len.\n                Setting this attribute to -1 means no clamping.\n            same_length (:obj:`boolean`, optional, defaults to :obj:`False`):\n                Whether to use the same attention length for each token.\n            summary_type (:obj:`string`, optional, defaults to \"last\"):\n                Argument used when doing sequence summary. Used in for the multiple choice head in\n                :class:transformers1.XLNetForSequenceClassification` and :class:`~transformers1.XLNetForMultipleChoice`.\n                Is one of the following options:\n                    - 'last' => take the last token hidden state (like XLNet)\n                    - 'first' => take the first token hidden state (like Bert)\n                    - 'mean' => take the mean of all tokens hidden states\n                    - 'cls_index' => supply a Tensor of classification token position (GPT/GPT-2)\n                    - 'attn' => Not implemented now, use multi-head attention\n            summary_use_proj (:obj:`boolean`, optional, defaults to :obj:`True`):\n                Argument used when doing sequence summary. Used in for the multiple choice head in\n                :class:`~transformers1.XLNetForSequenceClassification` and :class:`~transformers1.XLNetForMultipleChoice`.\n                Add a projection after the vector extraction\n            summary_activation (:obj:`string` or :obj:`None`, optional, defaults to :obj:`None`):\n                Argument used when doing sequence summary. Used in for the multiple choice head in\n                :class:`~transformers1.XLNetForSequenceClassification` and :class:`~transformers1.XLNetForMultipleChoice`.\n                'tanh' => add a tanh activation to the output, Other => no activation.\n            summary_proj_to_labels (:obj:`boolean`, optional, defaults to :obj:`True`):\n                Argument used when doing sequence summary. Used in for the multiple choice head in\n                :class:`~transformers1.XLNetForSequenceClassification` and :class:`~transformers1.XLNetForMultipleChoice`.\n                If True, the projection outputs to config.num_labels classes (otherwise to hidden_size). Default: False.\n            summary_last_dropout (:obj:`float`, optional, defaults to 0.1):\n                Argument used when doing sequence summary. Used in for the multiple choice head in\n                :class:`~transformers1.XLNetForSequenceClassification` and :class:`~transformers1.XLNetForMultipleChoice`.\n                Add a dropout after the projection and activation\n            start_n_top (:obj:`int`, optional, defaults to 5):\n                Used in the SQuAD evaluation script for XLM and XLNet.\n            end_n_top (:obj:`int`, optional, defaults to 5):\n                Used in the SQuAD evaluation script for XLM and XLNet.\n\n        Example::\n\n            from transformers1 import XLNetConfig, XLNetModel\n\n            # Initializing a XLNet configuration\n            configuration = XLNetConfig()\n\n            # Initializing a model from the configuration\n            model = XLNetModel(configuration)\n\n            # Accessing the model configuration\n            configuration = model.config\n    \"\"\"\n\n    model_type = \"xlnet\"\n\n    def __init__(\n        self,\n        vocab_size=32000,\n        d_model=1024,\n        n_layer=24,\n        n_head=16,\n        d_inner=4096,\n        ff_activation=\"gelu\",\n        untie_r=True,\n        attn_type=\"bi\",\n        initializer_range=0.02,\n        layer_norm_eps=1e-12,\n        dropout=0.1,\n        mem_len=None,\n        reuse_len=None,\n        bi_data=False,\n        clamp_len=-1,\n        same_length=False,\n        summary_type=\"last\",\n        summary_use_proj=True,\n        summary_activation=\"tanh\",\n        summary_last_dropout=0.1,\n        start_n_top=5,\n        end_n_top=5,\n        pad_token_id=5,\n        bos_token_id=1,\n        eos_token_id=2,\n        **kwargs\n    ):\n        \"\"\"Constructs XLNetConfig.\n        \"\"\"\n        super().__init__(pad_token_id=pad_token_id, bos_token_id=bos_token_id, eos_token_id=eos_token_id, **kwargs)\n        self.vocab_size = vocab_size\n        self.d_model = d_model\n        self.n_layer = n_layer\n        self.n_head = n_head\n        assert d_model % n_head == 0\n        self.d_head = d_model // n_head\n        self.ff_activation = ff_activation\n        self.d_inner = d_inner\n        self.untie_r = untie_r\n        self.attn_type = attn_type\n\n        self.initializer_range = initializer_range\n        self.layer_norm_eps = layer_norm_eps\n\n        self.dropout = dropout\n        self.mem_len = mem_len\n        self.reuse_len = reuse_len\n        self.bi_data = bi_data\n        self.clamp_len = clamp_len\n        self.same_length = same_length\n\n        self.summary_type = summary_type\n        self.summary_use_proj = summary_use_proj\n        self.summary_activation = summary_activation\n        self.summary_last_dropout = summary_last_dropout\n        self.start_n_top = start_n_top\n        self.end_n_top = end_n_top\n\n        self.bos_token_id = bos_token_id\n        self.pad_token_id = pad_token_id\n        self.eos_token_id = eos_token_id\n\n    @property\n    def max_position_embeddings(self):\n        return -1\n\n    @property\n    def n_token(self):  # Backward compatibility\n        return self.vocab_size\n\n    @n_token.setter\n    def n_token(self, value):  # Backward compatibility\n        self.vocab_size = value\n\n    @property\n    def hidden_size(self):\n        return self.d_model\n\n    @property\n    def num_attention_heads(self):\n        return self.n_head\n\n    @property\n    def num_hidden_layers(self):\n        return self.n_layer\n"
  },
  {
    "path": "code/bert-base-count5/pretrain/transformers1/convert_albert_original_tf_checkpoint_to_pytorch.py",
    "content": "# coding=utf-8\n# Copyright 2018 The HuggingFace Inc. team.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n\"\"\"Convert ALBERT checkpoint.\"\"\"\n\n\nimport argparse\nimport logging\n\nimport torch\n\nfrom transformers import AlbertConfig, AlbertForPreTraining, load_tf_weights_in_albert\n\n\nlogging.basicConfig(level=logging.INFO)\n\n\ndef convert_tf_checkpoint_to_pytorch(tf_checkpoint_path, albert_config_file, pytorch_dump_path):\n    # Initialise PyTorch model\n    config = AlbertConfig.from_json_file(albert_config_file)\n    print(\"Building PyTorch model from configuration: {}\".format(str(config)))\n    model = AlbertForPreTraining(config)\n\n    # Load weights from tf checkpoint\n    load_tf_weights_in_albert(model, config, tf_checkpoint_path)\n\n    # Save pytorch-model\n    print(\"Save PyTorch model to {}\".format(pytorch_dump_path))\n    torch.save(model.state_dict(), pytorch_dump_path)\n\n\nif __name__ == \"__main__\":\n    parser = argparse.ArgumentParser()\n    # Required parameters\n    parser.add_argument(\n        \"--tf_checkpoint_path\", default=None, type=str, required=True, help=\"Path to the TensorFlow checkpoint path.\"\n    )\n    parser.add_argument(\n        \"--albert_config_file\",\n        default=None,\n        type=str,\n        required=True,\n        help=\"The config json file corresponding to the pre-trained ALBERT model. \\n\"\n        \"This specifies the model architecture.\",\n    )\n    parser.add_argument(\n        \"--pytorch_dump_path\", default=None, type=str, required=True, help=\"Path to the output PyTorch model.\"\n    )\n    args = parser.parse_args()\n    convert_tf_checkpoint_to_pytorch(args.tf_checkpoint_path, args.albert_config_file, args.pytorch_dump_path)\n"
  },
  {
    "path": "code/bert-base-count5/pretrain/transformers1/convert_bart_original_pytorch_checkpoint_to_pytorch.py",
    "content": "# coding=utf-8\n# Copyright 2020 The HuggingFace Inc. team.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n\"\"\"Convert BART checkpoint.\"\"\"\n\n\nimport argparse\nimport logging\nimport os\nfrom pathlib import Path\n\nimport fairseq\nimport torch\nfrom packaging import version\n\nfrom transformers import (\n    BartConfig,\n    BartForConditionalGeneration,\n    BartForSequenceClassification,\n    BartModel,\n    BartTokenizer,\n)\nfrom transformers.modeling_bart import _make_linear_from_emb\n\n\nFAIRSEQ_MODELS = [\"bart.large\", \"bart.large.mnli\", \"bart.large.cnn\", \"bart_xsum/model.pt\"]\nextra_arch = {\"bart.large\": BartModel, \"bart.large.mnli\": BartForSequenceClassification}\nif version.parse(fairseq.__version__) < version.parse(\"0.9.0\"):\n    raise Exception(\"requires fairseq >= 0.9.0\")\n\n\nlogging.basicConfig(level=logging.INFO)\nlogger = logging.getLogger(__name__)\n\nSAMPLE_TEXT = \" Hello world! cécé herlolip\"\n\nmnli_rename_keys = [\n    (\"model.classification_heads.mnli.dense.weight\", \"classification_head.dense.weight\"),\n    (\"model.classification_heads.mnli.dense.bias\", \"classification_head.dense.bias\"),\n    (\"model.classification_heads.mnli.out_proj.weight\", \"classification_head.out_proj.weight\"),\n    (\"model.classification_heads.mnli.out_proj.bias\", \"classification_head.out_proj.bias\"),\n]\n\n\ndef remove_ignore_keys_(state_dict):\n    ignore_keys = [\n        \"encoder.version\",\n        \"decoder.version\",\n        \"model.encoder.version\",\n        \"model.decoder.version\",\n        \"_float_tensor\",\n    ]\n    for k in ignore_keys:\n        state_dict.pop(k, None)\n\n\ndef rename_key(dct, old, new):\n    val = dct.pop(old)\n    dct[new] = val\n\n\ndef load_xsum_checkpoint(checkpoint_path):\n    \"\"\"Checkpoint path should end in model.pt\"\"\"\n    sd = torch.load(checkpoint_path, map_location=\"cpu\")\n    hub_interface = torch.hub.load(\"pytorch/fairseq\", \"bart.large.cnn\").eval()\n    hub_interface.model.load_state_dict(sd[\"model\"])\n    return hub_interface\n\n\ndef convert_checkpoint_from_disk(checkpoint_path, **config_kwargs):\n    state_dict = torch.load(checkpoint_path, map_location=\"cpu\")[\"model\"]\n    remove_ignore_keys_(state_dict)\n    vocab_size = state_dict[\"encoder.embed_tokens.weight\"].shape[0]\n    state_dict[\"shared.weight\"] = state_dict[\"decoder.embed_tokens.weight\"]\n    mbart_config = BartConfig(vocab_size=vocab_size, **config_kwargs)\n    model = BartForConditionalGeneration(mbart_config)\n    model.model.load_state_dict(state_dict)\n    if hasattr(model, \"lm_head\"):\n        model.lm_head = _make_linear_from_emb(model.model.shared)\n    return model\n\n\n@torch.no_grad()\ndef convert_bart_checkpoint(checkpoint_path, pytorch_dump_folder_path, hf_checkpoint_name=None):\n    \"\"\"\n    Copy/paste/tweak model's weights to our BERT structure.\n    \"\"\"\n    if not os.path.exists(checkpoint_path):\n        bart = torch.hub.load(\"pytorch/fairseq\", checkpoint_path).eval()\n    else:\n        bart = load_xsum_checkpoint(checkpoint_path)\n\n    bart.model.upgrade_state_dict(bart.model.state_dict())\n    if hf_checkpoint_name is None:\n        hf_checkpoint_name = checkpoint_path.replace(\".\", \"-\")\n    config = BartConfig.from_pretrained(hf_checkpoint_name)\n    tokens = bart.encode(SAMPLE_TEXT).unsqueeze(0)\n    tokens2 = BartTokenizer.from_pretrained(hf_checkpoint_name).encode(SAMPLE_TEXT, return_tensors=\"pt\").unsqueeze(0)\n    assert torch.eq(tokens, tokens2).all()\n\n    if checkpoint_path == \"bart.large.mnli\":\n        state_dict = bart.state_dict()\n        remove_ignore_keys_(state_dict)\n        state_dict[\"model.shared.weight\"] = state_dict[\"model.decoder.embed_tokens.weight\"]\n        for src, dest in mnli_rename_keys:\n            rename_key(state_dict, src, dest)\n        model = BartForSequenceClassification(config).eval()\n        model.load_state_dict(state_dict)\n        fairseq_output = bart.predict(\"mnli\", tokens, return_logits=True)\n        new_model_outputs = model(tokens)[0]  # logits\n    else:  # no classification heads to worry about\n        state_dict = bart.model.state_dict()\n        remove_ignore_keys_(state_dict)\n        state_dict[\"shared.weight\"] = state_dict[\"decoder.embed_tokens.weight\"]\n        fairseq_output = bart.extract_features(tokens)\n        if hf_checkpoint_name == \"facebook/bart-large\":\n            model = BartModel(config).eval()\n            model.load_state_dict(state_dict)\n            new_model_outputs = model(tokens).model[0]\n        else:\n            model = BartForConditionalGeneration(config).eval()  # an existing summarization ckpt\n            model.model.load_state_dict(state_dict)\n            if hasattr(model, \"lm_head\"):\n                model.lm_head = _make_linear_from_emb(model.model.shared)\n            new_model_outputs = model.model(tokens)[0]\n\n    # Check results\n    assert fairseq_output.shape == new_model_outputs.shape\n    assert (fairseq_output == new_model_outputs).all().item()\n    Path(pytorch_dump_folder_path).mkdir(exist_ok=True)\n    model.save_pretrained(pytorch_dump_folder_path)\n\n\nif __name__ == \"__main__\":\n    parser = argparse.ArgumentParser()\n    # Required parameters\n    parser.add_argument(\n        \"fairseq_path\", type=str, help=\"bart.large, bart.large.cnn or a path to a model.pt on local filesystem.\"\n    )\n    parser.add_argument(\"pytorch_dump_folder_path\", default=None, type=str, help=\"Path to the output PyTorch model.\")\n    parser.add_argument(\n        \"--hf_config\", default=None, type=str, help=\"Which huggingface architecture to use: bart-large-xsum\"\n    )\n    args = parser.parse_args()\n    convert_bart_checkpoint(args.fairseq_path, args.pytorch_dump_folder_path, hf_checkpoint_name=args.hf_config)\n"
  },
  {
    "path": "code/bert-base-count5/pretrain/transformers1/convert_bert_original_tf_checkpoint_to_pytorch.py",
    "content": "# coding=utf-8\n# Copyright 2018 The HuggingFace Inc. team.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n\"\"\"Convert BERT checkpoint.\"\"\"\n\n\nimport argparse\nimport logging\n\nimport torch\n\nfrom transformers import BertConfig, BertForPreTraining, load_tf_weights_in_bert\n\n\nlogging.basicConfig(level=logging.INFO)\n\n\ndef convert_tf_checkpoint_to_pytorch(tf_checkpoint_path, bert_config_file, pytorch_dump_path):\n    # Initialise PyTorch model\n    config = BertConfig.from_json_file(bert_config_file)\n    print(\"Building PyTorch model from configuration: {}\".format(str(config)))\n    model = BertForPreTraining(config)\n\n    # Load weights from tf checkpoint\n    load_tf_weights_in_bert(model, config, tf_checkpoint_path)\n\n    # Save pytorch-model\n    print(\"Save PyTorch model to {}\".format(pytorch_dump_path))\n    torch.save(model.state_dict(), pytorch_dump_path)\n\n\nif __name__ == \"__main__\":\n    parser = argparse.ArgumentParser()\n    # Required parameters\n    parser.add_argument(\n        \"--tf_checkpoint_path\", default=None, type=str, required=True, help=\"Path to the TensorFlow checkpoint path.\"\n    )\n    parser.add_argument(\n        \"--bert_config_file\",\n        default=None,\n        type=str,\n        required=True,\n        help=\"The config json file corresponding to the pre-trained BERT model. \\n\"\n        \"This specifies the model architecture.\",\n    )\n    parser.add_argument(\n        \"--pytorch_dump_path\", default=None, type=str, required=True, help=\"Path to the output PyTorch model.\"\n    )\n    args = parser.parse_args()\n    convert_tf_checkpoint_to_pytorch(args.tf_checkpoint_path, args.bert_config_file, args.pytorch_dump_path)\n"
  },
  {
    "path": "code/bert-base-count5/pretrain/transformers1/convert_bert_pytorch_checkpoint_to_original_tf.py",
    "content": "# coding=utf-8\n# Copyright 2018 The HuggingFace Inc. team.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n\n\"\"\"Convert Huggingface Pytorch checkpoint to Tensorflow checkpoint.\"\"\"\n\nimport argparse\nimport os\n\nimport numpy as np\nimport tensorflow as tf\nimport torch\n\nfrom transformers import BertModel\n\n\ndef convert_pytorch_checkpoint_to_tf(model: BertModel, ckpt_dir: str, model_name: str):\n\n    \"\"\"\n    :param model:BertModel Pytorch model instance to be converted\n    :param ckpt_dir: Tensorflow model directory\n    :param model_name: model name\n    :return:\n\n    Currently supported HF models:\n        Y BertModel\n        N BertForMaskedLM\n        N BertForPreTraining\n        N BertForMultipleChoice\n        N BertForNextSentencePrediction\n        N BertForSequenceClassification\n        N BertForQuestionAnswering\n    \"\"\"\n\n    tensors_to_transpose = (\"dense.weight\", \"attention.self.query\", \"attention.self.key\", \"attention.self.value\")\n\n    var_map = (\n        (\"layer.\", \"layer_\"),\n        (\"word_embeddings.weight\", \"word_embeddings\"),\n        (\"position_embeddings.weight\", \"position_embeddings\"),\n        (\"token_type_embeddings.weight\", \"token_type_embeddings\"),\n        (\".\", \"/\"),\n        (\"LayerNorm/weight\", \"LayerNorm/gamma\"),\n        (\"LayerNorm/bias\", \"LayerNorm/beta\"),\n        (\"weight\", \"kernel\"),\n    )\n\n    if not os.path.isdir(ckpt_dir):\n        os.makedirs(ckpt_dir)\n\n    state_dict = model.state_dict()\n\n    def to_tf_var_name(name: str):\n        for patt, repl in iter(var_map):\n            name = name.replace(patt, repl)\n        return \"bert/{}\".format(name)\n\n    def create_tf_var(tensor: np.ndarray, name: str, session: tf.Session):\n        tf_dtype = tf.dtypes.as_dtype(tensor.dtype)\n        tf_var = tf.get_variable(dtype=tf_dtype, shape=tensor.shape, name=name, initializer=tf.zeros_initializer())\n        session.run(tf.variables_initializer([tf_var]))\n        session.run(tf_var)\n        return tf_var\n\n    tf.reset_default_graph()\n    with tf.Session() as session:\n        for var_name in state_dict:\n            tf_name = to_tf_var_name(var_name)\n            torch_tensor = state_dict[var_name].numpy()\n            if any([x in var_name for x in tensors_to_transpose]):\n                torch_tensor = torch_tensor.T\n            tf_var = create_tf_var(tensor=torch_tensor, name=tf_name, session=session)\n            tf.keras.backend.set_value(tf_var, torch_tensor)\n            tf_weight = session.run(tf_var)\n            print(\"Successfully created {}: {}\".format(tf_name, np.allclose(tf_weight, torch_tensor)))\n\n        saver = tf.train.Saver(tf.trainable_variables())\n        saver.save(session, os.path.join(ckpt_dir, model_name.replace(\"-\", \"_\") + \".ckpt\"))\n\n\ndef main(raw_args=None):\n    parser = argparse.ArgumentParser()\n    parser.add_argument(\"--model_name\", type=str, required=True, help=\"model name e.g. bert-base-uncased\")\n    parser.add_argument(\n        \"--cache_dir\", type=str, default=None, required=False, help=\"Directory containing pytorch model\"\n    )\n    parser.add_argument(\"--pytorch_model_path\", type=str, required=True, help=\"/path/to/<pytorch-model-name>.bin\")\n    parser.add_argument(\"--tf_cache_dir\", type=str, required=True, help=\"Directory in which to save tensorflow model\")\n    args = parser.parse_args(raw_args)\n\n    model = BertModel.from_pretrained(\n        pretrained_model_name_or_path=args.model_name,\n        state_dict=torch.load(args.pytorch_model_path),\n        cache_dir=args.cache_dir,\n    )\n\n    convert_pytorch_checkpoint_to_tf(model=model, ckpt_dir=args.tf_cache_dir, model_name=args.model_name)\n\n\nif __name__ == \"__main__\":\n    main()\n"
  },
  {
    "path": "code/bert-base-count5/pretrain/transformers1/convert_dialogpt_original_pytorch_checkpoint_to_pytorch.py",
    "content": "import argparse\nimport os\n\nimport torch\n\nfrom transformers.file_utils import WEIGHTS_NAME\n\n\nDIALOGPT_MODELS = [\"small\", \"medium\", \"large\"]\n\nOLD_KEY = \"lm_head.decoder.weight\"\nNEW_KEY = \"lm_head.weight\"\n\n\ndef convert_dialogpt_checkpoint(checkpoint_path: str, pytorch_dump_folder_path: str):\n    d = torch.load(checkpoint_path)\n    d[NEW_KEY] = d.pop(OLD_KEY)\n    os.makedirs(pytorch_dump_folder_path, exist_ok=True)\n    torch.save(d, os.path.join(pytorch_dump_folder_path, WEIGHTS_NAME))\n\n\nif __name__ == \"__main__\":\n    parser = argparse.ArgumentParser()\n    parser.add_argument(\"--dialogpt_path\", default=\".\", type=str)\n    args = parser.parse_args()\n    for MODEL in DIALOGPT_MODELS:\n        checkpoint_path = os.path.join(args.dialogpt_path, f\"{MODEL}_ft.pkl\")\n        pytorch_dump_folder_path = f\"./DialoGPT-{MODEL}\"\n        convert_dialogpt_checkpoint(\n            checkpoint_path, pytorch_dump_folder_path,\n        )\n"
  },
  {
    "path": "code/bert-base-count5/pretrain/transformers1/convert_electra_original_tf_checkpoint_to_pytorch.py",
    "content": "# coding=utf-8\n# Copyright 2018 The HuggingFace Inc. team.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n\"\"\"Convert ELECTRA checkpoint.\"\"\"\n\n\nimport argparse\nimport logging\n\nimport torch\n\nfrom transformers import ElectraConfig, ElectraForMaskedLM, ElectraForPreTraining, load_tf_weights_in_electra\n\n\nlogging.basicConfig(level=logging.INFO)\n\n\ndef convert_tf_checkpoint_to_pytorch(tf_checkpoint_path, config_file, pytorch_dump_path, discriminator_or_generator):\n    # Initialise PyTorch model\n    config = ElectraConfig.from_json_file(config_file)\n    print(\"Building PyTorch model from configuration: {}\".format(str(config)))\n\n    if discriminator_or_generator == \"discriminator\":\n        model = ElectraForPreTraining(config)\n    elif discriminator_or_generator == \"generator\":\n        model = ElectraForMaskedLM(config)\n    else:\n        raise ValueError(\"The discriminator_or_generator argument should be either 'discriminator' or 'generator'\")\n\n    # Load weights from tf checkpoint\n    load_tf_weights_in_electra(\n        model, config, tf_checkpoint_path, discriminator_or_generator=discriminator_or_generator\n    )\n\n    # Save pytorch-model\n    print(\"Save PyTorch model to {}\".format(pytorch_dump_path))\n    torch.save(model.state_dict(), pytorch_dump_path)\n\n\nif __name__ == \"__main__\":\n    parser = argparse.ArgumentParser()\n    # Required parameters\n    parser.add_argument(\n        \"--tf_checkpoint_path\", default=None, type=str, required=True, help=\"Path to the TensorFlow checkpoint path.\"\n    )\n    parser.add_argument(\n        \"--config_file\",\n        default=None,\n        type=str,\n        required=True,\n        help=\"The config json file corresponding to the pre-trained model. \\n\"\n        \"This specifies the model architecture.\",\n    )\n    parser.add_argument(\n        \"--pytorch_dump_path\", default=None, type=str, required=True, help=\"Path to the output PyTorch model.\"\n    )\n    parser.add_argument(\n        \"--discriminator_or_generator\",\n        default=None,\n        type=str,\n        required=True,\n        help=\"Whether to export the generator or the discriminator. Should be a string, either 'discriminator' or \"\n        \"'generator'.\",\n    )\n    args = parser.parse_args()\n    convert_tf_checkpoint_to_pytorch(\n        args.tf_checkpoint_path, args.config_file, args.pytorch_dump_path, args.discriminator_or_generator\n    )\n"
  },
  {
    "path": "code/bert-base-count5/pretrain/transformers1/convert_gpt2_original_tf_checkpoint_to_pytorch.py",
    "content": "# coding=utf-8\n# Copyright 2018 The HuggingFace Inc. team.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n\"\"\"Convert OpenAI GPT checkpoint.\"\"\"\n\n\nimport argparse\nimport logging\n\nimport torch\n\nfrom transformers import CONFIG_NAME, WEIGHTS_NAME, GPT2Config, GPT2Model, load_tf_weights_in_gpt2\n\n\nlogging.basicConfig(level=logging.INFO)\n\n\ndef convert_gpt2_checkpoint_to_pytorch(gpt2_checkpoint_path, gpt2_config_file, pytorch_dump_folder_path):\n    # Construct model\n    if gpt2_config_file == \"\":\n        config = GPT2Config()\n    else:\n        config = GPT2Config.from_json_file(gpt2_config_file)\n    model = GPT2Model(config)\n\n    # Load weights from numpy\n    load_tf_weights_in_gpt2(model, config, gpt2_checkpoint_path)\n\n    # Save pytorch-model\n    pytorch_weights_dump_path = pytorch_dump_folder_path + \"/\" + WEIGHTS_NAME\n    pytorch_config_dump_path = pytorch_dump_folder_path + \"/\" + CONFIG_NAME\n    print(\"Save PyTorch model to {}\".format(pytorch_weights_dump_path))\n    torch.save(model.state_dict(), pytorch_weights_dump_path)\n    print(\"Save configuration file to {}\".format(pytorch_config_dump_path))\n    with open(pytorch_config_dump_path, \"w\", encoding=\"utf-8\") as f:\n        f.write(config.to_json_string())\n\n\nif __name__ == \"__main__\":\n    parser = argparse.ArgumentParser()\n    # Required parameters\n    parser.add_argument(\n        \"--gpt2_checkpoint_path\", default=None, type=str, required=True, help=\"Path to the TensorFlow checkpoint path.\"\n    )\n    parser.add_argument(\n        \"--pytorch_dump_folder_path\", default=None, type=str, required=True, help=\"Path to the output PyTorch model.\"\n    )\n    parser.add_argument(\n        \"--gpt2_config_file\",\n        default=\"\",\n        type=str,\n        help=\"An optional config json file corresponding to the pre-trained OpenAI model. \\n\"\n        \"This specifies the model architecture.\",\n    )\n    args = parser.parse_args()\n    convert_gpt2_checkpoint_to_pytorch(args.gpt2_checkpoint_path, args.gpt2_config_file, args.pytorch_dump_folder_path)\n"
  },
  {
    "path": "code/bert-base-count5/pretrain/transformers1/convert_graph_to_onnx.py",
    "content": "from argparse import ArgumentParser\nfrom os import listdir, makedirs\nfrom os.path import abspath, dirname, exists\nfrom typing import Dict, List, Optional, Tuple\n\nfrom transformers import is_tf_available, is_torch_available\nfrom transformers.pipelines import Pipeline, pipeline\nfrom transformers.tokenization_utils import BatchEncoding\n\n\nclass OnnxConverterArgumentParser(ArgumentParser):\n    \"\"\"\n    Wraps all the script arguments supported to export transformers1 models to ONNX IR\n    \"\"\"\n\n    def __init__(self):\n        super(OnnxConverterArgumentParser, self).__init__(\"ONNX Converter\")\n\n        self.add_argument(\"--model\", type=str, required=True, help=\"Model's id or path (ex: bert-base-cased)\")\n        self.add_argument(\"--tokenizer\", type=str, help=\"Tokenizer's id or path (ex: bert-base-cased)\")\n        self.add_argument(\"--framework\", type=str, choices=[\"pt\", \"tf\"], help=\"Framework for loading the model\")\n        self.add_argument(\"--opset\", type=int, default=11, help=\"ONNX opset to use\")\n        self.add_argument(\"--check-loading\", action=\"store_true\", help=\"Check ONNX is able to load the model\")\n        self.add_argument(\"--use-external-format\", action=\"store_true\", help=\"Allow exporting model >= than 2Gb\")\n        self.add_argument(\"output\")\n\n\ndef ensure_valid_input(model, tokens, input_names):\n    \"\"\"\n    Ensure input are presented in the correct order, without any None\n    Args:\n        model: The model used to forward the input data\n        tokens: BatchEncoding holding the input data\n        input_names: The name of the inputs\n\n    Returns: Tuple\n\n    \"\"\"\n    model_args_name = model.forward.__code__.co_varnames\n\n    ordered_input_names = []\n    model_args = []\n    for arg_name in model_args_name[1:]:  # start at index 1 to skip \"self\" argument\n        if arg_name in input_names:\n            ordered_input_names.append(arg_name)\n            model_args.append(tokens[arg_name])\n        else:\n            break\n\n    return ordered_input_names, tuple(model_args)\n\n\ndef infer_shapes(nlp: Pipeline, framework: str) -> Tuple[List[str], List[str], Dict, BatchEncoding]:\n    def build_shape_dict(tensor, is_input: bool, seq_len: int):\n        if isinstance(tensor, (tuple, list)):\n            return [build_shape_dict(t, is_input, seq_len) for t in tensor]\n\n        else:\n            # Let's assume batch is the first axis with only 1 element (~~ might not be always true ...)\n            axes = {[axis for axis, numel in enumerate(tensor.shape) if numel == 1][0]: \"batch\"}\n            if is_input:\n                if len(tensor.shape) == 2:\n                    axes[1] = \"sequence\"\n                else:\n                    raise ValueError(\"Unable to infer tensor axes ({})\".format(len(tensor.shape)))\n            else:\n                seq_axes = [dim for dim, shape in enumerate(tensor.shape) if shape == seq_len]\n                axes.update({dim: \"sequence\" for dim in seq_axes})\n\n        return axes\n\n    tokens = nlp.tokenizer.encode_plus(\"This is a sample output\", return_tensors=framework)\n    seq_len = tokens.input_ids.shape[-1]\n    outputs = nlp.model(**tokens) if framework == \"pt\" else nlp.model(tokens)\n\n    if not isinstance(outputs, (list, tuple)):\n        outputs = (outputs,)\n\n    # Generate input names & axes\n    input_vars = list(tokens.keys())\n    input_dynamic_axes = {k: build_shape_dict(v, True, seq_len) for k, v in tokens.items()}\n\n    # flatten potentially grouped outputs (past for gpt2, attentions)\n    outputs_flat = []\n    for output in outputs:\n        if isinstance(output, (tuple, list)):\n            outputs_flat.extend(output)\n        else:\n            outputs_flat.append(output)\n\n    # Generate output names & axes\n    output_names = [\"output_{}\".format(i) for i in range(len(outputs_flat))]\n    output_dynamic_axes = {k: build_shape_dict(v, False, seq_len) for k, v in zip(output_names, outputs_flat)}\n\n    # Create the aggregated axes representation\n    dynamic_axes = dict(input_dynamic_axes, **output_dynamic_axes)\n    return input_vars, output_names, dynamic_axes, tokens\n\n\ndef load_graph_from_args(framework: str, model: str, tokenizer: Optional[str] = None) -> Pipeline:\n    # If no tokenizer provided\n    if tokenizer is None:\n        tokenizer = model\n\n    print(\"Loading pipeline (model: {}, tokenizer: {})\".format(model, tokenizer))\n\n    # Allocate tokenizer and model\n    return pipeline(\"feature-extraction\", model=model, tokenizer=tokenizer, framework=framework)\n\n\ndef convert_pytorch(nlp: Pipeline, opset: int, output: str, use_external_format: bool):\n    if not is_torch_available():\n        raise Exception(\"Cannot convert because PyTorch is not installed. Please install torch first.\")\n\n    import torch\n    from torch.onnx import export\n\n    print(\"PyTorch: {}\".format(torch.__version__))\n\n    with torch.no_grad():\n        input_names, output_names, dynamic_axes, tokens = infer_shapes(nlp, \"pt\")\n        ordered_input_names, model_args = ensure_valid_input(nlp.model, tokens, input_names)\n\n        export(\n            nlp.model,\n            model_args,\n            f=output,\n            input_names=ordered_input_names,\n            output_names=output_names,\n            dynamic_axes=dynamic_axes,\n            do_constant_folding=True,\n            use_external_data_format=use_external_format,\n            enable_onnx_checker=True,\n            opset_version=opset,\n        )\n\n\ndef convert_tensorflow(nlp: Pipeline, opset: int, output: str):\n    if not is_tf_available():\n        raise Exception(\n            \"Cannot convert {} because TF is not installed. Please install torch first.\".format(args.model)\n        )\n\n    print(\"/!\\\\ Please note TensorFlow doesn't support exporting model > 2Gb /!\\\\\")\n\n    try:\n        import tensorflow as tf\n        from keras2onnx import convert_keras, save_model, __version__ as k2ov\n\n        print(\"TensorFlow: {}, keras2onnx: {}\".format(tf.version.VERSION, k2ov))\n\n        # Build\n        input_names, output_names, dynamic_axes, tokens = infer_shapes(nlp, \"tf\")\n\n        # Forward\n        nlp.model.predict(tokens.data)\n        onnx_model = convert_keras(nlp.model, nlp.model.name, target_opset=opset)\n        save_model(onnx_model, output)\n\n    except ImportError as e:\n        raise Exception(\n            \"Cannot import {} required to convert TF model to ONNX. Please install {} first.\".format(e.name, e.name)\n        )\n\n\ndef convert(\n    framework: str,\n    model: str,\n    output: str,\n    opset: int,\n    tokenizer: Optional[str] = None,\n    use_external_format: bool = False,\n):\n    print(\"ONNX opset version set to: {}\".format(opset))\n\n    # Load the pipeline\n    nlp = load_graph_from_args(framework, model, tokenizer)\n\n    parent = dirname(output)\n    if not exists(parent):\n        print(\"Creating folder {}\".format(parent))\n        makedirs(parent)\n    elif len(listdir(parent)) > 0:\n        raise Exception(\"Folder {} is not empty, aborting conversion\".format(parent))\n\n    # Export the graph\n    if framework == \"pt\":\n        convert_pytorch(nlp, opset, output, use_external_format)\n    else:\n        convert_tensorflow(nlp, opset, output)\n\n\ndef verify(path: str):\n    from onnxruntime import InferenceSession, SessionOptions\n    from onnxruntime.capi.onnxruntime_pybind11_state import RuntimeException\n\n    print(\"Checking ONNX model loading from: {}\".format(path))\n    try:\n        onnx_options = SessionOptions()\n        _ = InferenceSession(path, onnx_options, providers=[\"CPUExecutionProvider\"])\n        print(\"Model correctly loaded\")\n    except RuntimeException as re:\n        print(\"Error while loading the model: {}\".format(re))\n\n\nif __name__ == \"__main__\":\n    parser = OnnxConverterArgumentParser()\n    args = parser.parse_args()\n\n    # Make sure output is absolute path\n    args.output = abspath(args.output)\n\n    try:\n        # Convert\n        convert(args.framework, args.model, args.output, args.opset, args.tokenizer, args.use_external_format)\n\n        # And verify\n        if args.check_loading:\n            verify(args.output)\n    except Exception as e:\n        print(\"Error while converting the model: {}\".format(e))\n        exit(1)\n"
  },
  {
    "path": "code/bert-base-count5/pretrain/transformers1/convert_longformer_original_pytorch_lightning_to_pytorch.py",
    "content": "# coding=utf-8\n# Copyright 2018 The HuggingFace Inc. team.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n\"\"\"Convert RoBERTa checkpoint.\"\"\"\n\n\nimport argparse\n\nimport pytorch_lightning as pl\nimport torch\n\nfrom transformers.modeling_longformer import LongformerForQuestionAnswering, LongformerModel\n\n\nclass LightningModel(pl.LightningModule):\n    def __init__(self, model):\n        super().__init__()\n        self.model = model\n        self.num_labels = 2\n        self.qa_outputs = torch.nn.Linear(self.model.config.hidden_size, self.num_labels)\n\n    # implement only because lighning requires to do so\n    def forward(self):\n        pass\n\n\ndef convert_longformer_qa_checkpoint_to_pytorch(\n    longformer_model: str, longformer_question_answering_ckpt_path: str, pytorch_dump_folder_path: str\n):\n\n    # load longformer model from model identifier\n    longformer = LongformerModel.from_pretrained(longformer_model)\n    lightning_model = LightningModel(longformer)\n\n    ckpt = torch.load(longformer_question_answering_ckpt_path, map_location=torch.device(\"cpu\"))\n    lightning_model.load_state_dict(ckpt[\"state_dict\"])\n\n    # init longformer question answering model\n    longformer_for_qa = LongformerForQuestionAnswering.from_pretrained(longformer_model)\n\n    # transfer weights\n    longformer_for_qa.longformer.load_state_dict(lightning_model.model.state_dict())\n    longformer_for_qa.qa_outputs.load_state_dict(lightning_model.qa_outputs.state_dict())\n    longformer_for_qa.eval()\n\n    # save model\n    longformer_for_qa.save_pretrained(pytorch_dump_folder_path)\n\n    print(\"Conversion succesful. Model saved under {}\".format(pytorch_dump_folder_path))\n\n\nif __name__ == \"__main__\":\n    parser = argparse.ArgumentParser()\n    # Required parameters\n    parser.add_argument(\n        \"--longformer_model\",\n        default=None,\n        type=str,\n        required=True,\n        help=\"model identifier of longformer. Should be either `longformer-base-4096` or `longformer-large-4096`.\",\n    )\n    parser.add_argument(\n        \"--longformer_question_answering_ckpt_path\",\n        default=None,\n        type=str,\n        required=True,\n        help=\"Path the official PyTorch Lighning Checkpoint.\",\n    )\n    parser.add_argument(\n        \"--pytorch_dump_folder_path\", default=None, type=str, required=True, help=\"Path to the output PyTorch model.\"\n    )\n    args = parser.parse_args()\n    convert_longformer_qa_checkpoint_to_pytorch(\n        args.longformer_model, args.longformer_question_answering_ckpt_path, args.pytorch_dump_folder_path\n    )\n"
  },
  {
    "path": "code/bert-base-count5/pretrain/transformers1/convert_marian_to_pytorch.py",
    "content": "import argparse\nimport json\nimport os\nimport shutil\nimport warnings\nfrom pathlib import Path\nfrom typing import Dict, List, Union\nfrom zipfile import ZipFile\n\nimport numpy as np\nimport torch\nfrom tqdm import tqdm\n\nfrom transformers import MarianConfig, MarianMTModel, MarianTokenizer\nfrom transformers.hf_api import HfApi\n\n\ndef remove_prefix(text: str, prefix: str):\n    if text.startswith(prefix):\n        return text[len(prefix) :]\n    return text  # or whatever\n\n\ndef convert_encoder_layer(opus_dict, layer_prefix: str, converter: dict):\n    sd = {}\n    for k in opus_dict:\n        if not k.startswith(layer_prefix):\n            continue\n        stripped = remove_prefix(k, layer_prefix)\n        v = opus_dict[k].T  # besides embeddings, everything must be transposed.\n        sd[converter[stripped]] = torch.tensor(v).squeeze()\n    return sd\n\n\ndef load_layers_(layer_lst: torch.nn.ModuleList, opus_state: dict, converter, is_decoder=False):\n    for i, layer in enumerate(layer_lst):\n        layer_tag = f\"decoder_l{i + 1}_\" if is_decoder else f\"encoder_l{i + 1}_\"\n        sd = convert_encoder_layer(opus_state, layer_tag, converter)\n        layer.load_state_dict(sd, strict=True)\n\n\ndef find_pretrained_model(src_lang: str, tgt_lang: str) -> List[str]:\n    \"\"\"Find models that can accept src_lang as input and return tgt_lang as output.\"\"\"\n    prefix = \"Helsinki-NLP/opus-mt-\"\n    api = HfApi()\n    model_list = api.model_list()\n    model_ids = [x.modelId for x in model_list if x.modelId.startswith(\"Helsinki-NLP\")]\n    src_and_targ = [\n        remove_prefix(m, prefix).lower().split(\"-\") for m in model_ids if \"+\" not in m\n    ]  # + cant be loaded.\n    matching = [f\"{prefix}{a}-{b}\" for (a, b) in src_and_targ if src_lang in a and tgt_lang in b]\n    return matching\n\n\ndef add_emb_entries(wemb, final_bias, n_special_tokens=1):\n    vsize, d_model = wemb.shape\n    embs_to_add = np.zeros((n_special_tokens, d_model))\n    new_embs = np.concatenate([wemb, embs_to_add])\n    bias_to_add = np.zeros((n_special_tokens, 1))\n    new_bias = np.concatenate((final_bias, bias_to_add), axis=1)\n    return new_embs, new_bias\n\n\ndef _cast_yaml_str(v):\n    bool_dct = {\"true\": True, \"false\": False}\n    if not isinstance(v, str):\n        return v\n    elif v in bool_dct:\n        return bool_dct[v]\n    try:\n        return int(v)\n    except (TypeError, ValueError):\n        return v\n\n\ndef cast_marian_config(raw_cfg: Dict[str, str]) -> Dict:\n    return {k: _cast_yaml_str(v) for k, v in raw_cfg.items()}\n\n\nCONFIG_KEY = \"special:model.yml\"\n\n\ndef load_config_from_state_dict(opus_dict):\n    import yaml\n\n    cfg_str = \"\".join([chr(x) for x in opus_dict[CONFIG_KEY]])\n    yaml_cfg = yaml.load(cfg_str[:-1], Loader=yaml.BaseLoader)\n    return cast_marian_config(yaml_cfg)\n\n\ndef find_model_file(dest_dir):  # this one better\n    model_files = list(Path(dest_dir).glob(\"*.npz\"))\n    assert len(model_files) == 1, model_files\n    model_file = model_files[0]\n    return model_file\n\n\n# Group Names Logic: change long opus model names to something shorter, like opus-mt-en-ROMANCE\nROM_GROUP = \"fr+fr_BE+fr_CA+fr_FR+wa+frp+oc+ca+rm+lld+fur+lij+lmo+es+es_AR+es_CL+es_CO+es_CR+es_DO+es_EC+es_ES+es_GT+es_HN+es_MX+es_NI+es_PA+es_PE+es_PR+es_SV+es_UY+es_VE+pt+pt_br+pt_BR+pt_PT+gl+lad+an+mwl+it+it_IT+co+nap+scn+vec+sc+ro+la\"\nGROUPS = [\n    (\"cmn+cn+yue+ze_zh+zh_cn+zh_CN+zh_HK+zh_tw+zh_TW+zh_yue+zhs+zht+zh\", \"ZH\"),\n    (ROM_GROUP, \"ROMANCE\"),\n    (\"de+nl+fy+af+da+fo+is+no+nb+nn+sv\", \"NORTH_EU\"),\n    (\"da+fo+is+no+nb+nn+sv\", \"SCANDINAVIA\"),\n    (\"se+sma+smj+smn+sms\", \"SAMI\"),\n    (\"nb_NO+nb+nn_NO+nn+nog+no_nb+no\", \"NORWAY\"),\n    (\"ga+cy+br+gd+kw+gv\", \"CELTIC\"),  # https://en.wikipedia.org/wiki/Insular_Celtic_languages\n]\nGROUP_TO_OPUS_NAME = {\n    \"opus-mt-ZH-de\": \"cmn+cn+yue+ze_zh+zh_cn+zh_CN+zh_HK+zh_tw+zh_TW+zh_yue+zhs+zht+zh-de\",\n    \"opus-mt-ZH-fi\": \"cmn+cn+yue+ze_zh+zh_cn+zh_CN+zh_HK+zh_tw+zh_TW+zh_yue+zhs+zht+zh-fi\",\n    \"opus-mt-ZH-sv\": \"cmn+cn+yue+ze_zh+zh_cn+zh_CN+zh_HK+zh_tw+zh_TW+zh_yue+zhs+zht+zh-sv\",\n    \"opus-mt-SCANDINAVIA-SCANDINAVIA\": \"da+fo+is+no+nb+nn+sv-da+fo+is+no+nb+nn+sv\",\n    \"opus-mt-NORTH_EU-NORTH_EU\": \"de+nl+fy+af+da+fo+is+no+nb+nn+sv-de+nl+fy+af+da+fo+is+no+nb+nn+sv\",\n    \"opus-mt-de-ZH\": \"de-cmn+cn+yue+ze_zh+zh_cn+zh_CN+zh_HK+zh_tw+zh_TW+zh_yue+zhs+zht+zh\",\n    \"opus-mt-en_el_es_fi-en_el_es_fi\": \"en+el+es+fi-en+el+es+fi\",\n    \"opus-mt-en-ROMANCE\": \"en-fr+fr_BE+fr_CA+fr_FR+wa+frp+oc+ca+rm+lld+fur+lij+lmo+es+es_AR+es_CL+es_CO+es_CR+es_DO\"\n    \"+es_EC+es_ES+es_GT+es_HN+es_MX+es_NI+es_PA+es_PE+es_PR+es_SV+es_UY+es_VE+pt+pt_br+pt_BR\"\n    \"+pt_PT+gl+lad+an+mwl+it+it_IT+co+nap+scn+vec+sc+ro+la\",\n    \"opus-mt-en-CELTIC\": \"en-ga+cy+br+gd+kw+gv\",\n    \"opus-mt-es-NORWAY\": \"es-nb_NO+nb+nn_NO+nn+nog+no_nb+no\",\n    \"opus-mt-fi_nb_no_nn_ru_sv_en-SAMI\": \"fi+nb+no+nn+ru+sv+en-se+sma+smj+smn+sms\",\n    \"opus-mt-fi-ZH\": \"fi-cmn+cn+yue+ze_zh+zh_cn+zh_CN+zh_HK+zh_tw+zh_TW+zh_yue+zhs+zht+zh\",\n    \"opus-mt-fi-NORWAY\": \"fi-nb_NO+nb+nn_NO+nn+nog+no_nb+no\",\n    \"opus-mt-ROMANCE-en\": \"fr+fr_BE+fr_CA+fr_FR+wa+frp+oc+ca+rm+lld+fur+lij+lmo+es+es_AR+es_CL+es_CO+es_CR+es_DO\"\n    \"+es_EC+es_ES+es_GT+es_HN+es_MX+es_NI+es_PA+es_PE+es_PR+es_SV+es_UY+es_VE+pt+pt_br+pt_BR\"\n    \"+pt_PT+gl+lad+an+mwl+it+it_IT+co+nap+scn+vec+sc+ro+la-en\",\n    \"opus-mt-CELTIC-en\": \"ga+cy+br+gd+kw+gv-en\",\n    \"opus-mt-sv-ZH\": \"sv-cmn+cn+yue+ze_zh+zh_cn+zh_CN+zh_HK+zh_tw+zh_TW+zh_yue+zhs+zht+zh\",\n    \"opus-mt-sv-NORWAY\": \"sv-nb_NO+nb+nn_NO+nn+nog+no_nb+no\",\n}\nOPUS_GITHUB_URL = \"https://github.com/Helsinki-NLP/OPUS-MT-train/blob/master/models/\"\nORG_NAME = \"Helsinki-NLP/\"\n\n\ndef convert_opus_name_to_hf_name(x):\n    for substr, grp_name in GROUPS:\n        x = x.replace(substr, grp_name)\n    return x.replace(\"+\", \"_\")\n\n\ndef convert_hf_name_to_opus_name(hf_model_name):\n    \"\"\"Relies on the assumption that there are no language codes like pt_br in models that are not in GROUP_TO_OPUS_NAME.\"\"\"\n    hf_model_name = remove_prefix(hf_model_name, ORG_NAME)\n    if hf_model_name in GROUP_TO_OPUS_NAME:\n        opus_w_prefix = GROUP_TO_OPUS_NAME[hf_model_name]\n    else:\n        opus_w_prefix = hf_model_name.replace(\"_\", \"+\")\n    return remove_prefix(opus_w_prefix, \"opus-mt-\")\n\n\ndef write_model_card(\n    hf_model_name: str,\n    repo_path=\"OPUS-MT-train/models/\",\n    dry_run=False,\n    model_card_dir=Path(\"marian_converted/model_cards/Helsinki-NLP/\"),\n) -> str:\n    \"\"\"Copy the most recent model's readme section from opus, and add metadata.\n    upload command: s3cmd sync --recursive model_card_dir s3://models.huggingface.co/bert/Helsinki-NLP/\n    \"\"\"\n    hf_model_name = remove_prefix(hf_model_name, ORG_NAME)\n    opus_name: str = convert_hf_name_to_opus_name(hf_model_name)\n    opus_src, opus_tgt = [x.split(\"+\") for x in opus_name.split(\"-\")]\n    readme_url = OPUS_GITHUB_URL + f\"{opus_name}/README.md\"\n    s, t = \",\".join(opus_src), \",\".join(opus_tgt)\n    extra_markdown = f\"### {hf_model_name}\\n\\n* source languages: {s}\\n* target languages: {t}\\n*  OPUS readme: [{opus_name}]({readme_url})\\n\"\n    # combine with opus markdown\n    opus_readme_path = Path(f\"{repo_path}{opus_name}/README.md\")\n    assert opus_readme_path.exists(), opus_readme_path\n    content = opus_readme_path.open().read()\n    content = content.split(\"\\n# \")[-1]  # Get the lowest level 1 header in the README -- the most recent model.\n    content = \"*\".join(content.split(\"*\")[1:])\n    content = extra_markdown + \"\\n* \" + content.replace(\"download\", \"download original weights\")\n    if dry_run:\n        return content\n    # Save string to model_cards/hf_model_name/readme.md\n    model_card_dir.mkdir(exist_ok=True)\n    sub_dir = model_card_dir / hf_model_name\n    sub_dir.mkdir(exist_ok=True)\n    dest = sub_dir / \"README.md\"\n    dest.open(\"w\").write(content)\n    return content\n\n\ndef get_clean_model_id_mapping(multiling_model_ids):\n    return {x: convert_opus_name_to_hf_name(x) for x in multiling_model_ids}\n\n\ndef make_registry(repo_path=\"Opus-MT-train/models\"):\n    if not (Path(repo_path) / \"fr-en\" / \"README.md\").exists():\n        raise ValueError(\n            f\"repo_path:{repo_path} does not exist: \"\n            \"You must run: git clone git@github.com:Helsinki-NLP/Opus-MT-train.git before calling.\"\n        )\n    results = {}\n    for p in Path(repo_path).ls():\n        n_dash = p.name.count(\"-\")\n        if n_dash == 0:\n            continue\n        else:\n            lns = list(open(p / \"README.md\").readlines())\n            results[p.name] = _parse_readme(lns)\n    return [(k, v[\"pre-processing\"], v[\"download\"], v[\"download\"][:-4] + \".test.txt\") for k, v in results.items()]\n\n\ndef convert_all_sentencepiece_models(model_list=None, repo_path=None):\n    \"\"\"Requires 300GB\"\"\"\n    save_dir = Path(\"marian_ckpt\")\n    dest_dir = Path(\"marian_converted\")\n    dest_dir.mkdir(exist_ok=True)\n    if model_list is None:\n        model_list: list = make_registry(repo_path=repo_path)\n    for k, prepro, download, test_set_url in tqdm(model_list):\n        if \"SentencePiece\" not in prepro:  # dont convert BPE models.\n            continue\n        if not os.path.exists(save_dir / k / \"pytorch_model.bin\"):\n            download_and_unzip(download, save_dir / k)\n        pair_name = convert_opus_name_to_hf_name(k)\n        convert(save_dir / k, dest_dir / f\"opus-mt-{pair_name}\")\n\n\ndef lmap(f, x) -> List:\n    return list(map(f, x))\n\n\ndef fetch_test_set(test_set_url):\n    import wget\n\n    fname = wget.download(test_set_url, \"opus_test.txt\")\n    lns = Path(fname).open().readlines()\n    src = lmap(str.strip, lns[::4])\n    gold = lmap(str.strip, lns[1::4])\n    mar_model = lmap(str.strip, lns[2::4])\n    assert len(gold) == len(mar_model) == len(src)\n    os.remove(fname)\n    return src, mar_model, gold\n\n\ndef convert_whole_dir(path=Path(\"marian_ckpt/\")):\n    for subdir in tqdm(list(path.ls())):\n        dest_dir = f\"marian_converted/{subdir.name}\"\n        if (dest_dir / \"pytorch_model.bin\").exists():\n            continue\n        convert(source_dir, dest_dir)\n\n\ndef _parse_readme(lns):\n    \"\"\"Get link and metadata from opus model card equivalent.\"\"\"\n    subres = {}\n    for ln in [x.strip() for x in lns]:\n        if not ln.startswith(\"*\"):\n            continue\n        ln = ln[1:].strip()\n\n        for k in [\"download\", \"dataset\", \"models\", \"model\", \"pre-processing\"]:\n            if ln.startswith(k):\n                break\n        else:\n            continue\n        if k in [\"dataset\", \"model\", \"pre-processing\"]:\n            splat = ln.split(\":\")\n            _, v = splat\n            subres[k] = v\n        elif k == \"download\":\n            v = ln.split(\"(\")[-1][:-1]\n            subres[k] = v\n    return subres\n\n\ndef save_tokenizer_config(dest_dir: Path):\n    dname = dest_dir.name.split(\"-\")\n    dct = dict(target_lang=dname[-1], source_lang=\"-\".join(dname[:-1]))\n    save_json(dct, dest_dir / \"tokenizer_config.json\")\n\n\ndef add_to_vocab_(vocab: Dict[str, int], special_tokens: List[str]):\n    start = max(vocab.values()) + 1\n    added = 0\n    for tok in special_tokens:\n        if tok in vocab:\n            continue\n        vocab[tok] = start + added\n        added += 1\n    return added\n\n\ndef find_vocab_file(model_dir):\n    return list(model_dir.glob(\"*vocab.yml\"))[0]\n\n\ndef add_special_tokens_to_vocab(model_dir: Path) -> None:\n    vocab = load_yaml(find_vocab_file(model_dir))\n    vocab = {k: int(v) for k, v in vocab.items()}\n    num_added = add_to_vocab_(vocab, [\"<pad>\"])\n    print(f\"added {num_added} tokens to vocab\")\n    save_json(vocab, model_dir / \"vocab.json\")\n    save_tokenizer_config(model_dir)\n\n\ndef save_tokenizer(self, save_directory):\n    dest = Path(save_directory)\n    src_path = Path(self.init_kwargs[\"source_spm\"])\n\n    for dest_name in {\"source.spm\", \"target.spm\", \"tokenizer_config.json\"}:\n        shutil.copyfile(src_path.parent / dest_name, dest / dest_name)\n    save_json(self.encoder, dest / \"vocab.json\")\n\n\ndef check_equal(marian_cfg, k1, k2):\n    v1, v2 = marian_cfg[k1], marian_cfg[k2]\n    assert v1 == v2, f\"hparams {k1},{k2} differ: {v1} != {v2}\"\n\n\ndef check_marian_cfg_assumptions(marian_cfg):\n    assumed_settings = {\n        \"tied-embeddings-all\": True,\n        \"layer-normalization\": False,\n        \"right-left\": False,\n        \"transformer-ffn-depth\": 2,\n        \"transformer-aan-depth\": 2,\n        \"transformer-no-projection\": False,\n        \"transformer-postprocess-emb\": \"d\",\n        \"transformer-postprocess\": \"dan\",  # Dropout, add, normalize\n        \"transformer-preprocess\": \"\",\n        \"type\": \"transformer\",\n        \"ulr-dim-emb\": 0,\n        \"dec-cell-base-depth\": 2,\n        \"dec-cell-high-depth\": 1,\n        \"transformer-aan-nogate\": False,\n    }\n    for k, v in assumed_settings.items():\n        actual = marian_cfg[k]\n        assert actual == v, f\"Unexpected config value for {k} expected {v} got {actual}\"\n    check_equal(marian_cfg, \"transformer-ffn-activation\", \"transformer-aan-activation\")\n    check_equal(marian_cfg, \"transformer-ffn-depth\", \"transformer-aan-depth\")\n    check_equal(marian_cfg, \"transformer-dim-ffn\", \"transformer-dim-aan\")\n\n\nBIAS_KEY = \"decoder_ff_logit_out_b\"\nBART_CONVERTER = {  # for each encoder and decoder layer\n    \"self_Wq\": \"self_attn.q_proj.weight\",\n    \"self_Wk\": \"self_attn.k_proj.weight\",\n    \"self_Wv\": \"self_attn.v_proj.weight\",\n    \"self_Wo\": \"self_attn.out_proj.weight\",\n    \"self_bq\": \"self_attn.q_proj.bias\",\n    \"self_bk\": \"self_attn.k_proj.bias\",\n    \"self_bv\": \"self_attn.v_proj.bias\",\n    \"self_bo\": \"self_attn.out_proj.bias\",\n    \"self_Wo_ln_scale\": \"self_attn_layer_norm.weight\",\n    \"self_Wo_ln_bias\": \"self_attn_layer_norm.bias\",\n    \"ffn_W1\": \"fc1.weight\",\n    \"ffn_b1\": \"fc1.bias\",\n    \"ffn_W2\": \"fc2.weight\",\n    \"ffn_b2\": \"fc2.bias\",\n    \"ffn_ffn_ln_scale\": \"final_layer_norm.weight\",\n    \"ffn_ffn_ln_bias\": \"final_layer_norm.bias\",\n    # Decoder Cross Attention\n    \"context_Wk\": \"encoder_attn.k_proj.weight\",\n    \"context_Wo\": \"encoder_attn.out_proj.weight\",\n    \"context_Wq\": \"encoder_attn.q_proj.weight\",\n    \"context_Wv\": \"encoder_attn.v_proj.weight\",\n    \"context_bk\": \"encoder_attn.k_proj.bias\",\n    \"context_bo\": \"encoder_attn.out_proj.bias\",\n    \"context_bq\": \"encoder_attn.q_proj.bias\",\n    \"context_bv\": \"encoder_attn.v_proj.bias\",\n    \"context_Wo_ln_scale\": \"encoder_attn_layer_norm.weight\",\n    \"context_Wo_ln_bias\": \"encoder_attn_layer_norm.bias\",\n}\n\n\nclass OpusState:\n    def __init__(self, source_dir):\n        npz_path = find_model_file(source_dir)\n        self.state_dict = np.load(npz_path)\n        cfg = load_config_from_state_dict(self.state_dict)\n        assert cfg[\"dim-vocabs\"][0] == cfg[\"dim-vocabs\"][1]\n        assert \"Wpos\" not in self.state_dict\n        self.state_dict = dict(self.state_dict)\n        self.wemb, self.final_bias = add_emb_entries(self.state_dict[\"Wemb\"], self.state_dict[BIAS_KEY], 1)\n        self.pad_token_id = self.wemb.shape[0] - 1\n        cfg[\"vocab_size\"] = self.pad_token_id + 1\n        # self.state_dict['Wemb'].sha\n        self.state_keys = list(self.state_dict.keys())\n        if \"Wtype\" in self.state_dict:\n            raise ValueError(\"found Wtype key\")\n        self._check_layer_entries()\n        self.source_dir = source_dir\n        self.cfg = cfg\n        hidden_size, intermediate_shape = self.state_dict[\"encoder_l1_ffn_W1\"].shape\n        assert hidden_size == cfg[\"dim-emb\"] == 512\n\n        # Process decoder.yml\n        decoder_yml = cast_marian_config(load_yaml(source_dir / \"decoder.yml\"))\n        check_marian_cfg_assumptions(cfg)\n        self.hf_config = MarianConfig(\n            vocab_size=cfg[\"vocab_size\"],\n            decoder_layers=cfg[\"dec-depth\"],\n            encoder_layers=cfg[\"enc-depth\"],\n            decoder_attention_heads=cfg[\"transformer-heads\"],\n            encoder_attention_heads=cfg[\"transformer-heads\"],\n            decoder_ffn_dim=cfg[\"transformer-dim-ffn\"],\n            encoder_ffn_dim=cfg[\"transformer-dim-ffn\"],\n            d_model=cfg[\"dim-emb\"],\n            activation_function=cfg[\"transformer-aan-activation\"],\n            pad_token_id=self.pad_token_id,\n            eos_token_id=0,\n            bos_token_id=0,\n            max_position_embeddings=cfg[\"dim-emb\"],\n            scale_embedding=True,\n            normalize_embedding=\"n\" in cfg[\"transformer-preprocess\"],\n            static_position_embeddings=not cfg[\"transformer-train-position-embeddings\"],\n            dropout=0.1,  # see opus-mt-train repo/transformer-dropout param.\n            # default: add_final_layer_norm=False,\n            num_beams=decoder_yml[\"beam-size\"],\n            decoder_start_token_id=self.pad_token_id,\n            bad_words_ids=[[self.pad_token_id]],\n            max_length=512,\n        )\n\n    def _check_layer_entries(self):\n        self.encoder_l1 = self.sub_keys(\"encoder_l1\")\n        self.decoder_l1 = self.sub_keys(\"decoder_l1\")\n        self.decoder_l2 = self.sub_keys(\"decoder_l2\")\n        if len(self.encoder_l1) != 16:\n            warnings.warn(f\"Expected 16 keys for each encoder layer, got {len(self.encoder_l1)}\")\n        if len(self.decoder_l1) != 26:\n            warnings.warn(f\"Expected 26 keys for each decoder layer, got {len(self.decoder_l1)}\")\n        if len(self.decoder_l2) != 26:\n            warnings.warn(f\"Expected 26 keys for each decoder layer, got {len(self.decoder_l1)}\")\n\n    @property\n    def extra_keys(self):\n        extra = []\n        for k in self.state_keys:\n            if (\n                k.startswith(\"encoder_l\")\n                or k.startswith(\"decoder_l\")\n                or k in [CONFIG_KEY, \"Wemb\", \"Wpos\", \"decoder_ff_logit_out_b\"]\n            ):\n                continue\n            else:\n                extra.append(k)\n        return extra\n\n    def sub_keys(self, layer_prefix):\n        return [remove_prefix(k, layer_prefix) for k in self.state_dict if k.startswith(layer_prefix)]\n\n    def load_marian_model(self) -> MarianMTModel:\n        state_dict, cfg = self.state_dict, self.hf_config\n\n        assert cfg.static_position_embeddings\n        model = MarianMTModel(cfg)\n\n        assert \"hidden_size\" not in cfg.to_dict()\n        load_layers_(\n            model.model.encoder.layers, state_dict, BART_CONVERTER,\n        )\n        load_layers_(model.model.decoder.layers, state_dict, BART_CONVERTER, is_decoder=True)\n\n        # handle tensors not associated with layers\n        wemb_tensor = torch.nn.Parameter(torch.FloatTensor(self.wemb))\n        bias_tensor = torch.nn.Parameter(torch.FloatTensor(self.final_bias))\n        model.model.shared.weight = wemb_tensor\n        model.model.encoder.embed_tokens = model.model.decoder.embed_tokens = model.model.shared\n\n        model.final_logits_bias = bias_tensor\n\n        if \"Wpos\" in state_dict:\n            print(\"Unexpected: got Wpos\")\n            wpos_tensor = torch.tensor(state_dict[\"Wpos\"])\n            model.model.encoder.embed_positions.weight = wpos_tensor\n            model.model.decoder.embed_positions.weight = wpos_tensor\n\n        if cfg.normalize_embedding:\n            assert \"encoder_emb_ln_scale_pre\" in state_dict\n            raise NotImplementedError(\"Need to convert layernorm_embedding\")\n\n        assert not self.extra_keys, f\"Failed to convert {self.extra_keys}\"\n        assert model.model.shared.padding_idx == self.pad_token_id\n        return model\n\n\ndef download_and_unzip(url, dest_dir):\n    try:\n        import wget\n    except ImportError:\n        raise ImportError(\"you must pip install wget\")\n\n    filename = wget.download(url)\n    unzip(filename, dest_dir)\n    os.remove(filename)\n\n\ndef convert(source_dir: Path, dest_dir):\n    dest_dir = Path(dest_dir)\n    dest_dir.mkdir(exist_ok=True)\n\n    add_special_tokens_to_vocab(source_dir)\n    tokenizer = MarianTokenizer.from_pretrained(str(source_dir))\n    save_tokenizer(tokenizer, dest_dir)\n\n    opus_state = OpusState(source_dir)\n    assert opus_state.cfg[\"vocab_size\"] == len(tokenizer.encoder)\n    # save_json(opus_state.cfg, dest_dir / \"marian_original_config.json\")\n    # ^^ Save human readable marian config for debugging\n\n    model = opus_state.load_marian_model()\n    model.save_pretrained(dest_dir)\n    model.from_pretrained(dest_dir)  # sanity check\n\n\nif __name__ == \"__main__\":\n    parser = argparse.ArgumentParser()\n    # Required parameters\n    parser.add_argument(\"--src\", type=str, help=\"path to marian model dir\", default=\"en-de\")\n    parser.add_argument(\"--dest\", type=str, default=None, help=\"Path to the output PyTorch model.\")\n    args = parser.parse_args()\n\n    source_dir = Path(args.src)\n    assert source_dir.exists()\n    dest_dir = f\"converted-{source_dir.name}\" if args.dest is None else args.dest\n    convert(source_dir, dest_dir)\n\n\ndef load_yaml(path):\n    import yaml\n\n    with open(path) as f:\n        return yaml.load(f, Loader=yaml.BaseLoader)\n\n\ndef save_json(content: Union[Dict, List], path: str) -> None:\n    with open(path, \"w\") as f:\n        json.dump(content, f)\n\n\ndef unzip(zip_path: str, dest_dir: str) -> None:\n    with ZipFile(zip_path, \"r\") as zipObj:\n        zipObj.extractall(dest_dir)\n"
  },
  {
    "path": "code/bert-base-count5/pretrain/transformers1/convert_openai_original_tf_checkpoint_to_pytorch.py",
    "content": "# coding=utf-8\n# Copyright 2018 The HuggingFace Inc. team.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n\"\"\"Convert OpenAI GPT checkpoint.\"\"\"\n\n\nimport argparse\nimport logging\n\nimport torch\n\nfrom transformers import CONFIG_NAME, WEIGHTS_NAME, OpenAIGPTConfig, OpenAIGPTModel, load_tf_weights_in_openai_gpt\n\n\nlogging.basicConfig(level=logging.INFO)\n\n\ndef convert_openai_checkpoint_to_pytorch(openai_checkpoint_folder_path, openai_config_file, pytorch_dump_folder_path):\n    # Construct model\n    if openai_config_file == \"\":\n        config = OpenAIGPTConfig()\n    else:\n        config = OpenAIGPTConfig.from_json_file(openai_config_file)\n    model = OpenAIGPTModel(config)\n\n    # Load weights from numpy\n    load_tf_weights_in_openai_gpt(model, config, openai_checkpoint_folder_path)\n\n    # Save pytorch-model\n    pytorch_weights_dump_path = pytorch_dump_folder_path + \"/\" + WEIGHTS_NAME\n    pytorch_config_dump_path = pytorch_dump_folder_path + \"/\" + CONFIG_NAME\n    print(\"Save PyTorch model to {}\".format(pytorch_weights_dump_path))\n    torch.save(model.state_dict(), pytorch_weights_dump_path)\n    print(\"Save configuration file to {}\".format(pytorch_config_dump_path))\n    with open(pytorch_config_dump_path, \"w\", encoding=\"utf-8\") as f:\n        f.write(config.to_json_string())\n\n\nif __name__ == \"__main__\":\n    parser = argparse.ArgumentParser()\n    # Required parameters\n    parser.add_argument(\n        \"--openai_checkpoint_folder_path\",\n        default=None,\n        type=str,\n        required=True,\n        help=\"Path to the TensorFlow checkpoint path.\",\n    )\n    parser.add_argument(\n        \"--pytorch_dump_folder_path\", default=None, type=str, required=True, help=\"Path to the output PyTorch model.\"\n    )\n    parser.add_argument(\n        \"--openai_config_file\",\n        default=\"\",\n        type=str,\n        help=\"An optional config json file corresponding to the pre-trained OpenAI model. \\n\"\n        \"This specifies the model architecture.\",\n    )\n    args = parser.parse_args()\n    convert_openai_checkpoint_to_pytorch(\n        args.openai_checkpoint_folder_path, args.openai_config_file, args.pytorch_dump_folder_path\n    )\n"
  },
  {
    "path": "code/bert-base-count5/pretrain/transformers1/convert_pytorch_checkpoint_to_tf2.py",
    "content": "# coding=utf-8\n# Copyright 2018 The HuggingFace Inc. team.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n\"\"\" Convert pytorch checkpoints to TensorFlow \"\"\"\n\n\nimport argparse\nimport logging\nimport os\n\nfrom transformers import (\n    ALBERT_PRETRAINED_CONFIG_ARCHIVE_MAP,\n    BERT_PRETRAINED_CONFIG_ARCHIVE_MAP,\n    CAMEMBERT_PRETRAINED_CONFIG_ARCHIVE_MAP,\n    CTRL_PRETRAINED_CONFIG_ARCHIVE_MAP,\n    DISTILBERT_PRETRAINED_CONFIG_ARCHIVE_MAP,\n    ELECTRA_PRETRAINED_CONFIG_ARCHIVE_MAP,\n    FLAUBERT_PRETRAINED_CONFIG_ARCHIVE_MAP,\n    GPT2_PRETRAINED_CONFIG_ARCHIVE_MAP,\n    OPENAI_GPT_PRETRAINED_CONFIG_ARCHIVE_MAP,\n    ROBERTA_PRETRAINED_CONFIG_ARCHIVE_MAP,\n    T5_PRETRAINED_CONFIG_ARCHIVE_MAP,\n    TRANSFO_XL_PRETRAINED_CONFIG_ARCHIVE_MAP,\n    WEIGHTS_NAME,\n    XLM_PRETRAINED_CONFIG_ARCHIVE_MAP,\n    XLM_ROBERTA_PRETRAINED_CONFIG_ARCHIVE_MAP,\n    XLNET_PRETRAINED_CONFIG_ARCHIVE_MAP,\n    AlbertConfig,\n    BertConfig,\n    CamembertConfig,\n    CTRLConfig,\n    DistilBertConfig,\n    ElectraConfig,\n    FlaubertConfig,\n    GPT2Config,\n    OpenAIGPTConfig,\n    RobertaConfig,\n    T5Config,\n    TFAlbertForPreTraining,\n    TFBertForPreTraining,\n    TFBertForQuestionAnswering,\n    TFBertForSequenceClassification,\n    TFCamembertForMaskedLM,\n    TFCTRLLMHeadModel,\n    TFDistilBertForMaskedLM,\n    TFDistilBertForQuestionAnswering,\n    TFElectraForPreTraining,\n    TFFlaubertWithLMHeadModel,\n    TFGPT2LMHeadModel,\n    TFOpenAIGPTLMHeadModel,\n    TFRobertaForMaskedLM,\n    TFRobertaForSequenceClassification,\n    TFT5ForConditionalGeneration,\n    TFTransfoXLLMHeadModel,\n    TFXLMRobertaForMaskedLM,\n    TFXLMWithLMHeadModel,\n    TFXLNetLMHeadModel,\n    TransfoXLConfig,\n    XLMConfig,\n    XLMRobertaConfig,\n    XLNetConfig,\n    cached_path,\n    hf_bucket_url,\n    is_torch_available,\n    load_pytorch_checkpoint_in_tf2_model,\n)\n\n\nif is_torch_available():\n    import torch\n    import numpy as np\n    from transformers import (\n        BertForPreTraining,\n        BertForQuestionAnswering,\n        BertForSequenceClassification,\n        GPT2LMHeadModel,\n        XLNetLMHeadModel,\n        XLMWithLMHeadModel,\n        XLMRobertaForMaskedLM,\n        TransfoXLLMHeadModel,\n        OpenAIGPTLMHeadModel,\n        RobertaForMaskedLM,\n        RobertaForSequenceClassification,\n        CamembertForMaskedLM,\n        FlaubertWithLMHeadModel,\n        DistilBertForMaskedLM,\n        DistilBertForQuestionAnswering,\n        CTRLLMHeadModel,\n        AlbertForPreTraining,\n        T5ForConditionalGeneration,\n        ElectraForPreTraining,\n    )\n\n\nlogging.basicConfig(level=logging.INFO)\n\nMODEL_CLASSES = {\n    \"bert\": (BertConfig, TFBertForPreTraining, BertForPreTraining, BERT_PRETRAINED_CONFIG_ARCHIVE_MAP,),\n    \"bert-large-uncased-whole-word-masking-finetuned-squad\": (\n        BertConfig,\n        TFBertForQuestionAnswering,\n        BertForQuestionAnswering,\n        BERT_PRETRAINED_CONFIG_ARCHIVE_MAP,\n    ),\n    \"bert-large-cased-whole-word-masking-finetuned-squad\": (\n        BertConfig,\n        TFBertForQuestionAnswering,\n        BertForQuestionAnswering,\n        BERT_PRETRAINED_CONFIG_ARCHIVE_MAP,\n    ),\n    \"bert-base-cased-finetuned-mrpc\": (\n        BertConfig,\n        TFBertForSequenceClassification,\n        BertForSequenceClassification,\n        BERT_PRETRAINED_CONFIG_ARCHIVE_MAP,\n    ),\n    \"gpt2\": (GPT2Config, TFGPT2LMHeadModel, GPT2LMHeadModel, GPT2_PRETRAINED_CONFIG_ARCHIVE_MAP,),\n    \"xlnet\": (XLNetConfig, TFXLNetLMHeadModel, XLNetLMHeadModel, XLNET_PRETRAINED_CONFIG_ARCHIVE_MAP,),\n    \"xlm\": (XLMConfig, TFXLMWithLMHeadModel, XLMWithLMHeadModel, XLM_PRETRAINED_CONFIG_ARCHIVE_MAP,),\n    \"xlm-roberta\": (\n        XLMRobertaConfig,\n        TFXLMRobertaForMaskedLM,\n        XLMRobertaForMaskedLM,\n        XLM_ROBERTA_PRETRAINED_CONFIG_ARCHIVE_MAP,\n    ),\n    \"transfo-xl\": (\n        TransfoXLConfig,\n        TFTransfoXLLMHeadModel,\n        TransfoXLLMHeadModel,\n        TRANSFO_XL_PRETRAINED_CONFIG_ARCHIVE_MAP,\n    ),\n    \"openai-gpt\": (\n        OpenAIGPTConfig,\n        TFOpenAIGPTLMHeadModel,\n        OpenAIGPTLMHeadModel,\n        OPENAI_GPT_PRETRAINED_CONFIG_ARCHIVE_MAP,\n    ),\n    \"roberta\": (RobertaConfig, TFRobertaForMaskedLM, RobertaForMaskedLM, ROBERTA_PRETRAINED_CONFIG_ARCHIVE_MAP,),\n    \"roberta-large-mnli\": (\n        RobertaConfig,\n        TFRobertaForSequenceClassification,\n        RobertaForSequenceClassification,\n        ROBERTA_PRETRAINED_CONFIG_ARCHIVE_MAP,\n    ),\n    \"camembert\": (\n        CamembertConfig,\n        TFCamembertForMaskedLM,\n        CamembertForMaskedLM,\n        CAMEMBERT_PRETRAINED_CONFIG_ARCHIVE_MAP,\n    ),\n    \"flaubert\": (\n        FlaubertConfig,\n        TFFlaubertWithLMHeadModel,\n        FlaubertWithLMHeadModel,\n        FLAUBERT_PRETRAINED_CONFIG_ARCHIVE_MAP,\n    ),\n    \"distilbert\": (\n        DistilBertConfig,\n        TFDistilBertForMaskedLM,\n        DistilBertForMaskedLM,\n        DISTILBERT_PRETRAINED_CONFIG_ARCHIVE_MAP,\n    ),\n    \"distilbert-base-distilled-squad\": (\n        DistilBertConfig,\n        TFDistilBertForQuestionAnswering,\n        DistilBertForQuestionAnswering,\n        DISTILBERT_PRETRAINED_CONFIG_ARCHIVE_MAP,\n    ),\n    \"ctrl\": (CTRLConfig, TFCTRLLMHeadModel, CTRLLMHeadModel, CTRL_PRETRAINED_CONFIG_ARCHIVE_MAP,),\n    \"albert\": (AlbertConfig, TFAlbertForPreTraining, AlbertForPreTraining, ALBERT_PRETRAINED_CONFIG_ARCHIVE_MAP,),\n    \"t5\": (T5Config, TFT5ForConditionalGeneration, T5ForConditionalGeneration, T5_PRETRAINED_CONFIG_ARCHIVE_MAP,),\n    \"electra\": (ElectraConfig, TFElectraForPreTraining, ElectraForPreTraining, ELECTRA_PRETRAINED_CONFIG_ARCHIVE_MAP,),\n}\n\n\ndef convert_pt_checkpoint_to_tf(\n    model_type, pytorch_checkpoint_path, config_file, tf_dump_path, compare_with_pt_model=False, use_cached_models=True\n):\n    if model_type not in MODEL_CLASSES:\n        raise ValueError(\"Unrecognized model type, should be one of {}.\".format(list(MODEL_CLASSES.keys())))\n\n    config_class, model_class, pt_model_class, aws_config_map = MODEL_CLASSES[model_type]\n\n    # Initialise TF model\n    if config_file in aws_config_map:\n        config_file = cached_path(aws_config_map[config_file], force_download=not use_cached_models)\n    config = config_class.from_json_file(config_file)\n    config.output_hidden_states = True\n    config.output_attentions = True\n    print(\"Building TensorFlow model from configuration: {}\".format(str(config)))\n    tf_model = model_class(config)\n\n    # Load weights from tf checkpoint\n    if pytorch_checkpoint_path in aws_config_map.keys():\n        pytorch_checkpoint_url = hf_bucket_url(pytorch_checkpoint_path, filename=WEIGHTS_NAME)\n        pytorch_checkpoint_path = cached_path(pytorch_checkpoint_url, force_download=not use_cached_models)\n    # Load PyTorch checkpoint in tf2 model:\n    tf_model = load_pytorch_checkpoint_in_tf2_model(tf_model, pytorch_checkpoint_path)\n\n    if compare_with_pt_model:\n        tfo = tf_model(tf_model.dummy_inputs, training=False)  # build the network\n\n        state_dict = torch.load(pytorch_checkpoint_path, map_location=\"cpu\")\n        pt_model = pt_model_class.from_pretrained(\n            pretrained_model_name_or_path=None, config=config, state_dict=state_dict\n        )\n\n        with torch.no_grad():\n            pto = pt_model(**pt_model.dummy_inputs)\n\n        np_pt = pto[0].numpy()\n        np_tf = tfo[0].numpy()\n        diff = np.amax(np.abs(np_pt - np_tf))\n        print(\"Max absolute difference between models outputs {}\".format(diff))\n        assert diff <= 2e-2, \"Error, model absolute difference is >2e-2: {}\".format(diff)\n\n    # Save pytorch-model\n    print(\"Save TensorFlow model to {}\".format(tf_dump_path))\n    tf_model.save_weights(tf_dump_path, save_format=\"h5\")\n\n\ndef convert_all_pt_checkpoints_to_tf(\n    args_model_type,\n    tf_dump_path,\n    model_shortcut_names_or_path=None,\n    config_shortcut_names_or_path=None,\n    compare_with_pt_model=False,\n    use_cached_models=False,\n    remove_cached_files=False,\n    only_convert_finetuned_models=False,\n):\n    assert os.path.isdir(args.tf_dump_path), \"--tf_dump_path should be a directory\"\n\n    if args_model_type is None:\n        model_types = list(MODEL_CLASSES.keys())\n    else:\n        model_types = [args_model_type]\n\n    for j, model_type in enumerate(model_types, start=1):\n        print(\"=\" * 100)\n        print(\" Converting model type {}/{}: {}\".format(j, len(model_types), model_type))\n        print(\"=\" * 100)\n        if model_type not in MODEL_CLASSES:\n            raise ValueError(\n                \"Unrecognized model type {}, should be one of {}.\".format(model_type, list(MODEL_CLASSES.keys()))\n            )\n\n        config_class, model_class, pt_model_class, aws_model_maps, aws_config_map = MODEL_CLASSES[model_type]\n\n        if model_shortcut_names_or_path is None:\n            model_shortcut_names_or_path = list(aws_model_maps.keys())\n        if config_shortcut_names_or_path is None:\n            config_shortcut_names_or_path = model_shortcut_names_or_path\n\n        for i, (model_shortcut_name, config_shortcut_name) in enumerate(\n            zip(model_shortcut_names_or_path, config_shortcut_names_or_path), start=1\n        ):\n            print(\"-\" * 100)\n            if \"-squad\" in model_shortcut_name or \"-mrpc\" in model_shortcut_name or \"-mnli\" in model_shortcut_name:\n                if not only_convert_finetuned_models:\n                    print(\"    Skipping finetuned checkpoint {}\".format(model_shortcut_name))\n                    continue\n                model_type = model_shortcut_name\n            elif only_convert_finetuned_models:\n                print(\"    Skipping not finetuned checkpoint {}\".format(model_shortcut_name))\n                continue\n            print(\n                \"    Converting checkpoint {}/{}: {} - model_type {}\".format(\n                    i, len(aws_config_map), model_shortcut_name, model_type\n                )\n            )\n            print(\"-\" * 100)\n\n            if config_shortcut_name in aws_config_map:\n                config_file = cached_path(aws_config_map[config_shortcut_name], force_download=not use_cached_models)\n            else:\n                config_file = cached_path(config_shortcut_name, force_download=not use_cached_models)\n\n            if model_shortcut_name in aws_model_maps:\n                model_file = cached_path(aws_model_maps[model_shortcut_name], force_download=not use_cached_models)\n            else:\n                model_file = cached_path(model_shortcut_name, force_download=not use_cached_models)\n\n            if os.path.isfile(model_shortcut_name):\n                model_shortcut_name = \"converted_model\"\n\n            convert_pt_checkpoint_to_tf(\n                model_type=model_type,\n                pytorch_checkpoint_path=model_file,\n                config_file=config_file,\n                tf_dump_path=os.path.join(tf_dump_path, model_shortcut_name + \"-tf_model.h5\"),\n                compare_with_pt_model=compare_with_pt_model,\n            )\n            if remove_cached_files:\n                os.remove(config_file)\n                os.remove(model_file)\n\n\nif __name__ == \"__main__\":\n    parser = argparse.ArgumentParser()\n    # Required parameters\n    parser.add_argument(\n        \"--tf_dump_path\", default=None, type=str, required=True, help=\"Path to the output Tensorflow dump file.\"\n    )\n    parser.add_argument(\n        \"--model_type\",\n        default=None,\n        type=str,\n        help=\"Model type selected in the list of {}. If not given, will download and convert all the models from AWS.\".format(\n            list(MODEL_CLASSES.keys())\n        ),\n    )\n    parser.add_argument(\n        \"--pytorch_checkpoint_path\",\n        default=None,\n        type=str,\n        help=\"Path to the PyTorch checkpoint path or shortcut name to download from AWS. \"\n        \"If not given, will download and convert all the checkpoints from AWS.\",\n    )\n    parser.add_argument(\n        \"--config_file\",\n        default=None,\n        type=str,\n        help=\"The config json file corresponding to the pre-trained model. \\n\"\n        \"This specifies the model architecture. If not given and \"\n        \"--pytorch_checkpoint_path is not given or is a shortcut name\"\n        \"use the configuration associated to the shortcut name on the AWS\",\n    )\n    parser.add_argument(\n        \"--compare_with_pt_model\", action=\"store_true\", help=\"Compare Tensorflow and PyTorch model predictions.\"\n    )\n    parser.add_argument(\n        \"--use_cached_models\",\n        action=\"store_true\",\n        help=\"Use cached models if possible instead of updating to latest checkpoint versions.\",\n    )\n    parser.add_argument(\n        \"--remove_cached_files\",\n        action=\"store_true\",\n        help=\"Remove pytorch models after conversion (save memory when converting in batches).\",\n    )\n    parser.add_argument(\"--only_convert_finetuned_models\", action=\"store_true\", help=\"Only convert finetuned models.\")\n    args = parser.parse_args()\n\n    # if args.pytorch_checkpoint_path is not None:\n    #     convert_pt_checkpoint_to_tf(args.model_type.lower(),\n    #                                 args.pytorch_checkpoint_path,\n    #                                 args.config_file if args.config_file is not None else args.pytorch_checkpoint_path,\n    #                                 args.tf_dump_path,\n    #                                 compare_with_pt_model=args.compare_with_pt_model,\n    #                                 use_cached_models=args.use_cached_models)\n    # else:\n    convert_all_pt_checkpoints_to_tf(\n        args.model_type.lower() if args.model_type is not None else None,\n        args.tf_dump_path,\n        model_shortcut_names_or_path=[args.pytorch_checkpoint_path]\n        if args.pytorch_checkpoint_path is not None\n        else None,\n        config_shortcut_names_or_path=[args.config_file] if args.config_file is not None else None,\n        compare_with_pt_model=args.compare_with_pt_model,\n        use_cached_models=args.use_cached_models,\n        remove_cached_files=args.remove_cached_files,\n        only_convert_finetuned_models=args.only_convert_finetuned_models,\n    )\n"
  },
  {
    "path": "code/bert-base-count5/pretrain/transformers1/convert_reformer_trax_checkpoint_to_pytorch.py",
    "content": "# coding=utf-8\n# Copyright 2020 The HuggingFace Inc. team.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n\"\"\"Convert Reformer checkpoint.\"\"\"\n\n\nimport argparse\nimport logging\nimport pickle\n\nimport numpy as np\nimport torch\n\nfrom transformers import ReformerConfig, ReformerModelWithLMHead\n\n\nlogging.basicConfig(level=logging.INFO)\n\n\ndef set_param(torch_layer, weight, bias=None):\n    # set parameter of one layer\n    assert torch_layer.weight.shape == weight.shape, \"{} layer.weight does not match\".format(torch_layer)\n    torch_layer.weight = torch.nn.Parameter(weight)\n    if bias is not None:\n        assert torch_layer.bias.shape == bias.shape, \"{} layer.bias does not match\".format(torch_layer)\n        torch_layer.bias = torch.nn.Parameter(bias)\n\n\ndef set_layer_weights_in_torch_lsh(weights, torch_layer, hidden_size):\n    # set torch weights for 1-to-1 comparison\n    np_query_key = np.asarray(weights[0])\n    np_value = np.asarray(weights[1])\n    np_dense = np.asarray(weights[2])\n\n    set_param(\n        torch_layer.self_attention.query_key,\n        torch.tensor(np_query_key).transpose(1, 2).contiguous().view(-1, hidden_size),\n    )\n    set_param(\n        torch_layer.self_attention.value, torch.tensor(np_value).transpose(1, 2).contiguous().view(-1, hidden_size),\n    )\n    set_param(\n        torch_layer.output.dense, torch.tensor(np_dense).view(-1, hidden_size).contiguous().transpose(0, 1),\n    )\n\n\ndef set_layer_weights_in_torch_local(weights, torch_layer, hidden_size):\n    # set torch weights for 1-to-1 comparison\n    np_query = np.asarray(weights[0])\n    np_key = np.asarray(weights[1])\n    np_value = np.asarray(weights[2])\n    np_dense = np.asarray(weights[3])\n\n    set_param(\n        torch_layer.self_attention.query, torch.tensor(np_query).transpose(1, 2).contiguous().view(-1, hidden_size),\n    )\n    set_param(\n        torch_layer.self_attention.key, torch.tensor(np_key).transpose(1, 2).contiguous().view(-1, hidden_size),\n    )\n    set_param(\n        torch_layer.self_attention.value, torch.tensor(np_value).transpose(1, 2).contiguous().view(-1, hidden_size),\n    )\n    set_param(\n        torch_layer.output.dense, torch.tensor(np_dense).view(-1, hidden_size).contiguous().transpose(0, 1),\n    )\n\n\ndef set_block_weights_in_torch(weights, torch_block, hidden_size):\n    # layernorm 1\n    layer_norm_1 = weights[0][0][0]\n    layer_norm_1_weight = np.asarray(layer_norm_1[0])\n    layer_norm_1_bias = np.asarray(layer_norm_1[1])\n    set_param(\n        torch_block.attention.layer_norm, torch.tensor(layer_norm_1_weight), torch.tensor(layer_norm_1_bias),\n    )\n\n    # lsh weights + output\n    attn_weights = weights[0][1]\n    if len(attn_weights) < 4:\n        set_layer_weights_in_torch_lsh(attn_weights, torch_block.attention, hidden_size)\n    else:\n        set_layer_weights_in_torch_local(attn_weights, torch_block.attention, hidden_size)\n\n    # intermediate weighs\n    intermediate_weights = weights[2][0][1][2]\n\n    # Chunked Feed Forward\n    if len(intermediate_weights) == 4:\n        intermediate_weights = intermediate_weights[2]\n\n    # layernorm 2\n    layer_norm_2_weight = np.asarray(intermediate_weights[0][0])\n    layer_norm_2_bias = np.asarray(intermediate_weights[0][1])\n    set_param(\n        torch_block.feed_forward.layer_norm, torch.tensor(layer_norm_2_weight), torch.tensor(layer_norm_2_bias),\n    )\n\n    # intermediate dense\n    inter_dense_weight = np.asarray(intermediate_weights[1][0])\n    inter_dense_bias = np.asarray(intermediate_weights[1][1])\n    set_param(\n        torch_block.feed_forward.dense.dense,\n        torch.tensor(inter_dense_weight).transpose(0, 1).contiguous(),\n        torch.tensor(inter_dense_bias),\n    )\n\n    # intermediate out\n    out_dense_weight = np.asarray(intermediate_weights[4][0])\n    out_dense_bias = np.asarray(intermediate_weights[4][1])\n    set_param(\n        torch_block.feed_forward.output.dense,\n        torch.tensor(out_dense_weight).transpose(0, 1).contiguous(),\n        torch.tensor(out_dense_bias),\n    )\n\n\ndef set_model_weights_in_torch(weights, torch_model, hidden_size):\n    # reformer model\n    torch_model_reformer = torch_model.reformer\n\n    # word embeds\n    word_embeddings = np.asarray(weights[1])\n    set_param(\n        torch_model_reformer.embeddings.word_embeddings, torch.tensor(word_embeddings),\n    )\n\n    if isinstance(weights[3], tuple):\n        position_embeddings = torch_model_reformer.embeddings.position_embeddings\n        for emb_idx in range(len(position_embeddings.weights)):\n            emb_weights = np.asarray(weights[3][emb_idx][0])\n            assert position_embeddings.weights[emb_idx].shape == emb_weights.shape, \"{} emb does not match\".format(\n                position_embeddings[emb_idx]\n            )\n            position_embeddings.weights[emb_idx] = torch.nn.Parameter(torch.tensor(emb_weights))\n\n    trax_layer_weights = weights[5]\n    assert len(torch_model_reformer.encoder.layers) * 4 == len(\n        trax_layer_weights\n    ), \"HF and trax model do not have the same number of layers\"\n    for layer_idx, layer in enumerate(torch_model_reformer.encoder.layers):\n        block_weights = trax_layer_weights[4 * layer_idx : 4 * (layer_idx + 1)]\n        set_block_weights_in_torch(block_weights, layer, hidden_size)\n\n    # output layer norm\n    layer_norm_out_weight = np.asarray(weights[7][0])\n    layer_norm_out_bias = np.asarray(weights[7][1])\n    set_param(\n        torch_model_reformer.encoder.layer_norm,\n        torch.tensor(layer_norm_out_weight),\n        torch.tensor(layer_norm_out_bias),\n    )\n\n    # output embeddings\n    output_embed_weights = np.asarray(weights[9][0])\n    output_embed_bias = np.asarray(weights[9][1])\n    set_param(\n        torch_model.lm_head.decoder,\n        torch.tensor(output_embed_weights).transpose(0, 1).contiguous(),\n        torch.tensor(output_embed_bias),\n    )\n\n\ndef convert_trax_checkpoint_to_pytorch(trax_model_pkl_path, config_file, pytorch_dump_path):\n    # Initialise PyTorch model\n    config = ReformerConfig.from_json_file(config_file)\n    print(\"Building PyTorch model from configuration: {}\".format(str(config)))\n    model = ReformerModelWithLMHead(config)\n\n    with open(trax_model_pkl_path, \"rb\") as f:\n        model_weights = pickle.load(f)[\"weights\"]\n\n    set_model_weights_in_torch(model_weights, model, config.hidden_size)\n\n    # Save pytorch-model\n    print(\"Save PyTorch model to {}\".format(pytorch_dump_path))\n    torch.save(model.state_dict(), pytorch_dump_path)\n\n\nif __name__ == \"__main__\":\n    parser = argparse.ArgumentParser()\n    # Required parameters\n    parser.add_argument(\n        \"--trax_model_pkl_path\", default=None, type=str, required=True, help=\"Path to the TensorFlow checkpoint path.\"\n    )\n    parser.add_argument(\n        \"--config_file\",\n        default=None,\n        type=str,\n        required=True,\n        help=\"The config json file corresponding to the pre-trained Reformer model. \\n\"\n        \"This specifies the model architecture.\",\n    )\n    parser.add_argument(\n        \"--pytorch_dump_path\", default=None, type=str, required=True, help=\"Path to the output PyTorch model.\"\n    )\n    args = parser.parse_args()\n    convert_trax_checkpoint_to_pytorch(args.trax_model_pkl_path, args.config_file, args.pytorch_dump_path)\n"
  },
  {
    "path": "code/bert-base-count5/pretrain/transformers1/convert_roberta_original_pytorch_checkpoint_to_pytorch.py",
    "content": "# coding=utf-8\n# Copyright 2018 The HuggingFace Inc. team.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n\"\"\"Convert RoBERTa checkpoint.\"\"\"\n\n\nimport argparse\nimport logging\nimport pathlib\n\nimport fairseq\nimport torch\nfrom fairseq.models.roberta import RobertaModel as FairseqRobertaModel\nfrom fairseq.modules import TransformerSentenceEncoderLayer\nfrom packaging import version\n\nfrom transformers.modeling_bert import BertIntermediate, BertLayer, BertOutput, BertSelfAttention, BertSelfOutput\nfrom transformers.modeling_roberta import RobertaConfig, RobertaForMaskedLM, RobertaForSequenceClassification\n\n\nif version.parse(fairseq.__version__) < version.parse(\"0.9.0\"):\n    raise Exception(\"requires fairseq >= 0.9.0\")\n\n\nlogging.basicConfig(level=logging.INFO)\nlogger = logging.getLogger(__name__)\n\nSAMPLE_TEXT = \"Hello world! cécé herlolip\"\n\n\ndef convert_roberta_checkpoint_to_pytorch(\n    roberta_checkpoint_path: str, pytorch_dump_folder_path: str, classification_head: bool\n):\n    \"\"\"\n    Copy/paste/tweak roberta's weights to our BERT structure.\n    \"\"\"\n    roberta = FairseqRobertaModel.from_pretrained(roberta_checkpoint_path)\n    roberta.eval()  # disable dropout\n    roberta_sent_encoder = roberta.model.decoder.sentence_encoder\n    config = RobertaConfig(\n        vocab_size=roberta_sent_encoder.embed_tokens.num_embeddings,\n        hidden_size=roberta.args.encoder_embed_dim,\n        num_hidden_layers=roberta.args.encoder_layers,\n        num_attention_heads=roberta.args.encoder_attention_heads,\n        intermediate_size=roberta.args.encoder_ffn_embed_dim,\n        max_position_embeddings=514,\n        type_vocab_size=1,\n        layer_norm_eps=1e-5,  # PyTorch default used in fairseq\n    )\n    if classification_head:\n        config.num_labels = roberta.args.num_classes\n    print(\"Our BERT config:\", config)\n\n    model = RobertaForSequenceClassification(config) if classification_head else RobertaForMaskedLM(config)\n    model.eval()\n\n    # Now let's copy all the weights.\n    # Embeddings\n    model.roberta.embeddings.word_embeddings.weight = roberta_sent_encoder.embed_tokens.weight\n    model.roberta.embeddings.position_embeddings.weight = roberta_sent_encoder.embed_positions.weight\n    model.roberta.embeddings.token_type_embeddings.weight.data = torch.zeros_like(\n        model.roberta.embeddings.token_type_embeddings.weight\n    )  # just zero them out b/c RoBERTa doesn't use them.\n    model.roberta.embeddings.LayerNorm.weight = roberta_sent_encoder.emb_layer_norm.weight\n    model.roberta.embeddings.LayerNorm.bias = roberta_sent_encoder.emb_layer_norm.bias\n\n    for i in range(config.num_hidden_layers):\n        # Encoder: start of layer\n        layer: BertLayer = model.roberta.encoder.layer[i]\n        roberta_layer: TransformerSentenceEncoderLayer = roberta_sent_encoder.layers[i]\n\n        # self attention\n        self_attn: BertSelfAttention = layer.attention.self\n        assert (\n            roberta_layer.self_attn.k_proj.weight.data.shape\n            == roberta_layer.self_attn.q_proj.weight.data.shape\n            == roberta_layer.self_attn.v_proj.weight.data.shape\n            == torch.Size((config.hidden_size, config.hidden_size))\n        )\n\n        self_attn.query.weight.data = roberta_layer.self_attn.q_proj.weight\n        self_attn.query.bias.data = roberta_layer.self_attn.q_proj.bias\n        self_attn.key.weight.data = roberta_layer.self_attn.k_proj.weight\n        self_attn.key.bias.data = roberta_layer.self_attn.k_proj.bias\n        self_attn.value.weight.data = roberta_layer.self_attn.v_proj.weight\n        self_attn.value.bias.data = roberta_layer.self_attn.v_proj.bias\n\n        # self-attention output\n        self_output: BertSelfOutput = layer.attention.output\n        assert self_output.dense.weight.shape == roberta_layer.self_attn.out_proj.weight.shape\n        self_output.dense.weight = roberta_layer.self_attn.out_proj.weight\n        self_output.dense.bias = roberta_layer.self_attn.out_proj.bias\n        self_output.LayerNorm.weight = roberta_layer.self_attn_layer_norm.weight\n        self_output.LayerNorm.bias = roberta_layer.self_attn_layer_norm.bias\n\n        # intermediate\n        intermediate: BertIntermediate = layer.intermediate\n        assert intermediate.dense.weight.shape == roberta_layer.fc1.weight.shape\n        intermediate.dense.weight = roberta_layer.fc1.weight\n        intermediate.dense.bias = roberta_layer.fc1.bias\n\n        # output\n        bert_output: BertOutput = layer.output\n        assert bert_output.dense.weight.shape == roberta_layer.fc2.weight.shape\n        bert_output.dense.weight = roberta_layer.fc2.weight\n        bert_output.dense.bias = roberta_layer.fc2.bias\n        bert_output.LayerNorm.weight = roberta_layer.final_layer_norm.weight\n        bert_output.LayerNorm.bias = roberta_layer.final_layer_norm.bias\n        # end of layer\n\n    if classification_head:\n        model.classifier.dense.weight = roberta.model.classification_heads[\"mnli\"].dense.weight\n        model.classifier.dense.bias = roberta.model.classification_heads[\"mnli\"].dense.bias\n        model.classifier.out_proj.weight = roberta.model.classification_heads[\"mnli\"].out_proj.weight\n        model.classifier.out_proj.bias = roberta.model.classification_heads[\"mnli\"].out_proj.bias\n    else:\n        # LM Head\n        model.lm_head.dense.weight = roberta.model.decoder.lm_head.dense.weight\n        model.lm_head.dense.bias = roberta.model.decoder.lm_head.dense.bias\n        model.lm_head.layer_norm.weight = roberta.model.decoder.lm_head.layer_norm.weight\n        model.lm_head.layer_norm.bias = roberta.model.decoder.lm_head.layer_norm.bias\n        model.lm_head.decoder.weight = roberta.model.decoder.lm_head.weight\n        model.lm_head.decoder.bias = roberta.model.decoder.lm_head.bias\n\n    # Let's check that we get the same results.\n    input_ids: torch.Tensor = roberta.encode(SAMPLE_TEXT).unsqueeze(0)  # batch of size 1\n\n    our_output = model(input_ids)[0]\n    if classification_head:\n        their_output = roberta.model.classification_heads[\"mnli\"](roberta.extract_features(input_ids))\n    else:\n        their_output = roberta.model(input_ids)[0]\n    print(our_output.shape, their_output.shape)\n    max_absolute_diff = torch.max(torch.abs(our_output - their_output)).item()\n    print(f\"max_absolute_diff = {max_absolute_diff}\")  # ~ 1e-7\n    success = torch.allclose(our_output, their_output, atol=1e-3)\n    print(\"Do both models output the same tensors?\", \"🔥\" if success else \"💩\")\n    if not success:\n        raise Exception(\"Something went wRoNg\")\n\n    pathlib.Path(pytorch_dump_folder_path).mkdir(parents=True, exist_ok=True)\n    print(f\"Saving model to {pytorch_dump_folder_path}\")\n    model.save_pretrained(pytorch_dump_folder_path)\n\n\nif __name__ == \"__main__\":\n    parser = argparse.ArgumentParser()\n    # Required parameters\n    parser.add_argument(\n        \"--roberta_checkpoint_path\", default=None, type=str, required=True, help=\"Path the official PyTorch dump.\"\n    )\n    parser.add_argument(\n        \"--pytorch_dump_folder_path\", default=None, type=str, required=True, help=\"Path to the output PyTorch model.\"\n    )\n    parser.add_argument(\n        \"--classification_head\", action=\"store_true\", help=\"Whether to convert a final classification head.\"\n    )\n    args = parser.parse_args()\n    convert_roberta_checkpoint_to_pytorch(\n        args.roberta_checkpoint_path, args.pytorch_dump_folder_path, args.classification_head\n    )\n"
  },
  {
    "path": "code/bert-base-count5/pretrain/transformers1/convert_t5_original_tf_checkpoint_to_pytorch.py",
    "content": "# coding=utf-8\n# Copyright 2018 The T5 authors and HuggingFace Inc. team.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n\"\"\"Convert T5 checkpoint.\"\"\"\n\n\nimport argparse\nimport logging\n\nimport torch\n\nfrom transformers import T5Config, T5Model, load_tf_weights_in_t5\n\n\nlogging.basicConfig(level=logging.INFO)\n\n\ndef convert_tf_checkpoint_to_pytorch(tf_checkpoint_path, config_file, pytorch_dump_path):\n    # Initialise PyTorch model\n    config = T5Config.from_json_file(config_file)\n    print(\"Building PyTorch model from configuration: {}\".format(str(config)))\n    model = T5Model(config)\n\n    # Load weights from tf checkpoint\n    load_tf_weights_in_t5(model, config, tf_checkpoint_path)\n\n    # Save pytorch-model\n    print(\"Save PyTorch model to {}\".format(pytorch_dump_path))\n    torch.save(model.state_dict(), pytorch_dump_path)\n\n\nif __name__ == \"__main__\":\n    parser = argparse.ArgumentParser()\n    # Required parameters\n    parser.add_argument(\n        \"--tf_checkpoint_path\", default=None, type=str, required=True, help=\"Path to the TensorFlow checkpoint path.\"\n    )\n    parser.add_argument(\n        \"--config_file\",\n        default=None,\n        type=str,\n        required=True,\n        help=\"The config json file corresponding to the pre-trained T5 model. \\n\"\n        \"This specifies the model architecture.\",\n    )\n    parser.add_argument(\n        \"--pytorch_dump_path\", default=None, type=str, required=True, help=\"Path to the output PyTorch model.\"\n    )\n    args = parser.parse_args()\n    convert_tf_checkpoint_to_pytorch(args.tf_checkpoint_path, args.config_file, args.pytorch_dump_path)\n"
  },
  {
    "path": "code/bert-base-count5/pretrain/transformers1/convert_transfo_xl_original_tf_checkpoint_to_pytorch.py",
    "content": "# coding=utf-8\n# Copyright 2018 The HuggingFace Inc. team.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n\"\"\"Convert Transformer XL checkpoint and datasets.\"\"\"\n\n\nimport argparse\nimport logging\nimport os\nimport pickle\nimport sys\n\nimport torch\n\nimport transformers.tokenization_transfo_xl as data_utils\nfrom transformers import (\n    CONFIG_NAME,\n    WEIGHTS_NAME,\n    TransfoXLConfig,\n    TransfoXLLMHeadModel,\n    load_tf_weights_in_transfo_xl,\n)\nfrom transformers.tokenization_transfo_xl import CORPUS_NAME, VOCAB_FILES_NAMES\n\n\nlogging.basicConfig(level=logging.INFO)\n\n# We do this to be able to load python 2 datasets pickles\n# See e.g. https://stackoverflow.com/questions/2121874/python-pickling-after-changing-a-modules-directory/2121918#2121918\ndata_utils.Vocab = data_utils.TransfoXLTokenizer\ndata_utils.Corpus = data_utils.TransfoXLCorpus\nsys.modules[\"data_utils\"] = data_utils\nsys.modules[\"vocabulary\"] = data_utils\n\n\ndef convert_transfo_xl_checkpoint_to_pytorch(\n    tf_checkpoint_path, transfo_xl_config_file, pytorch_dump_folder_path, transfo_xl_dataset_file\n):\n    if transfo_xl_dataset_file:\n        # Convert a pre-processed corpus (see original TensorFlow repo)\n        with open(transfo_xl_dataset_file, \"rb\") as fp:\n            corpus = pickle.load(fp, encoding=\"latin1\")\n        # Save vocabulary and dataset cache as Dictionaries (should be better than pickles for the long-term)\n        pytorch_vocab_dump_path = pytorch_dump_folder_path + \"/\" + VOCAB_FILES_NAMES[\"pretrained_vocab_file\"]\n        print(\"Save vocabulary to {}\".format(pytorch_vocab_dump_path))\n        corpus_vocab_dict = corpus.vocab.__dict__\n        torch.save(corpus_vocab_dict, pytorch_vocab_dump_path)\n\n        corpus_dict_no_vocab = corpus.__dict__\n        corpus_dict_no_vocab.pop(\"vocab\", None)\n        pytorch_dataset_dump_path = pytorch_dump_folder_path + \"/\" + CORPUS_NAME\n        print(\"Save dataset to {}\".format(pytorch_dataset_dump_path))\n        torch.save(corpus_dict_no_vocab, pytorch_dataset_dump_path)\n\n    if tf_checkpoint_path:\n        # Convert a pre-trained TensorFlow model\n        config_path = os.path.abspath(transfo_xl_config_file)\n        tf_path = os.path.abspath(tf_checkpoint_path)\n\n        print(\"Converting Transformer XL checkpoint from {} with config at {}\".format(tf_path, config_path))\n        # Initialise PyTorch model\n        if transfo_xl_config_file == \"\":\n            config = TransfoXLConfig()\n        else:\n            config = TransfoXLConfig.from_json_file(transfo_xl_config_file)\n        print(\"Building PyTorch model from configuration: {}\".format(str(config)))\n        model = TransfoXLLMHeadModel(config)\n\n        model = load_tf_weights_in_transfo_xl(model, config, tf_path)\n        # Save pytorch-model\n        pytorch_weights_dump_path = os.path.join(pytorch_dump_folder_path, WEIGHTS_NAME)\n        pytorch_config_dump_path = os.path.join(pytorch_dump_folder_path, CONFIG_NAME)\n        print(\"Save PyTorch model to {}\".format(os.path.abspath(pytorch_weights_dump_path)))\n        torch.save(model.state_dict(), pytorch_weights_dump_path)\n        print(\"Save configuration file to {}\".format(os.path.abspath(pytorch_config_dump_path)))\n        with open(pytorch_config_dump_path, \"w\", encoding=\"utf-8\") as f:\n            f.write(config.to_json_string())\n\n\nif __name__ == \"__main__\":\n    parser = argparse.ArgumentParser()\n    parser.add_argument(\n        \"--pytorch_dump_folder_path\",\n        default=None,\n        type=str,\n        required=True,\n        help=\"Path to the folder to store the PyTorch model or dataset/vocab.\",\n    )\n    parser.add_argument(\n        \"--tf_checkpoint_path\",\n        default=\"\",\n        type=str,\n        help=\"An optional path to a TensorFlow checkpoint path to be converted.\",\n    )\n    parser.add_argument(\n        \"--transfo_xl_config_file\",\n        default=\"\",\n        type=str,\n        help=\"An optional config json file corresponding to the pre-trained BERT model. \\n\"\n        \"This specifies the model architecture.\",\n    )\n    parser.add_argument(\n        \"--transfo_xl_dataset_file\",\n        default=\"\",\n        type=str,\n        help=\"An optional dataset file to be converted in a vocabulary.\",\n    )\n    args = parser.parse_args()\n    convert_transfo_xl_checkpoint_to_pytorch(\n        args.tf_checkpoint_path,\n        args.transfo_xl_config_file,\n        args.pytorch_dump_folder_path,\n        args.transfo_xl_dataset_file,\n    )\n"
  },
  {
    "path": "code/bert-base-count5/pretrain/transformers1/convert_xlm_original_pytorch_checkpoint_to_pytorch.py",
    "content": "# coding=utf-8\n# Copyright 2018 The HuggingFace Inc. team.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n\"\"\"Convert OpenAI GPT checkpoint.\"\"\"\n\n\nimport argparse\nimport json\nimport logging\n\nimport numpy\nimport torch\n\nfrom transformers import CONFIG_NAME, WEIGHTS_NAME\nfrom transformers.tokenization_xlm import VOCAB_FILES_NAMES\n\n\nlogging.basicConfig(level=logging.INFO)\n\n\ndef convert_xlm_checkpoint_to_pytorch(xlm_checkpoint_path, pytorch_dump_folder_path):\n    # Load checkpoint\n    chkpt = torch.load(xlm_checkpoint_path, map_location=\"cpu\")\n\n    state_dict = chkpt[\"model\"]\n\n    # We have the base model one level deeper than the original XLM repository\n    two_levels_state_dict = {}\n    for k, v in state_dict.items():\n        if \"pred_layer\" in k:\n            two_levels_state_dict[k] = v\n        else:\n            two_levels_state_dict[\"transformer.\" + k] = v\n\n    config = chkpt[\"params\"]\n    config = dict((n, v) for n, v in config.items() if not isinstance(v, (torch.FloatTensor, numpy.ndarray)))\n\n    vocab = chkpt[\"dico_word2id\"]\n    vocab = dict((s + \"</w>\" if s.find(\"@@\") == -1 and i > 13 else s.replace(\"@@\", \"\"), i) for s, i in vocab.items())\n\n    # Save pytorch-model\n    pytorch_weights_dump_path = pytorch_dump_folder_path + \"/\" + WEIGHTS_NAME\n    pytorch_config_dump_path = pytorch_dump_folder_path + \"/\" + CONFIG_NAME\n    pytorch_vocab_dump_path = pytorch_dump_folder_path + \"/\" + VOCAB_FILES_NAMES[\"vocab_file\"]\n\n    print(\"Save PyTorch model to {}\".format(pytorch_weights_dump_path))\n    torch.save(two_levels_state_dict, pytorch_weights_dump_path)\n\n    print(\"Save configuration file to {}\".format(pytorch_config_dump_path))\n    with open(pytorch_config_dump_path, \"w\", encoding=\"utf-8\") as f:\n        f.write(json.dumps(config, indent=2) + \"\\n\")\n\n    print(\"Save vocab file to {}\".format(pytorch_config_dump_path))\n    with open(pytorch_vocab_dump_path, \"w\", encoding=\"utf-8\") as f:\n        f.write(json.dumps(vocab, indent=2) + \"\\n\")\n\n\nif __name__ == \"__main__\":\n    parser = argparse.ArgumentParser()\n    # Required parameters\n    parser.add_argument(\n        \"--xlm_checkpoint_path\", default=None, type=str, required=True, help=\"Path the official PyTorch dump.\"\n    )\n    parser.add_argument(\n        \"--pytorch_dump_folder_path\", default=None, type=str, required=True, help=\"Path to the output PyTorch model.\"\n    )\n    args = parser.parse_args()\n    convert_xlm_checkpoint_to_pytorch(args.xlm_checkpoint_path, args.pytorch_dump_folder_path)\n"
  },
  {
    "path": "code/bert-base-count5/pretrain/transformers1/convert_xlnet_original_tf_checkpoint_to_pytorch.py",
    "content": "# coding=utf-8\n# Copyright 2018 The HuggingFace Inc. team.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n\"\"\"Convert BERT checkpoint.\"\"\"\n\n\nimport argparse\nimport logging\nimport os\n\nimport torch\n\nfrom transformers import (\n    CONFIG_NAME,\n    WEIGHTS_NAME,\n    XLNetConfig,\n    XLNetForQuestionAnswering,\n    XLNetForSequenceClassification,\n    XLNetLMHeadModel,\n    load_tf_weights_in_xlnet,\n)\n\n\nGLUE_TASKS_NUM_LABELS = {\n    \"cola\": 2,\n    \"mnli\": 3,\n    \"mrpc\": 2,\n    \"sst-2\": 2,\n    \"sts-b\": 1,\n    \"qqp\": 2,\n    \"qnli\": 2,\n    \"rte\": 2,\n    \"wnli\": 2,\n}\n\n\nlogging.basicConfig(level=logging.INFO)\n\n\ndef convert_xlnet_checkpoint_to_pytorch(\n    tf_checkpoint_path, bert_config_file, pytorch_dump_folder_path, finetuning_task=None\n):\n    # Initialise PyTorch model\n    config = XLNetConfig.from_json_file(bert_config_file)\n\n    finetuning_task = finetuning_task.lower() if finetuning_task is not None else \"\"\n    if finetuning_task in GLUE_TASKS_NUM_LABELS:\n        print(\"Building PyTorch XLNetForSequenceClassification model from configuration: {}\".format(str(config)))\n        config.finetuning_task = finetuning_task\n        config.num_labels = GLUE_TASKS_NUM_LABELS[finetuning_task]\n        model = XLNetForSequenceClassification(config)\n    elif \"squad\" in finetuning_task:\n        config.finetuning_task = finetuning_task\n        model = XLNetForQuestionAnswering(config)\n    else:\n        model = XLNetLMHeadModel(config)\n\n    # Load weights from tf checkpoint\n    load_tf_weights_in_xlnet(model, config, tf_checkpoint_path)\n\n    # Save pytorch-model\n    pytorch_weights_dump_path = os.path.join(pytorch_dump_folder_path, WEIGHTS_NAME)\n    pytorch_config_dump_path = os.path.join(pytorch_dump_folder_path, CONFIG_NAME)\n    print(\"Save PyTorch model to {}\".format(os.path.abspath(pytorch_weights_dump_path)))\n    torch.save(model.state_dict(), pytorch_weights_dump_path)\n    print(\"Save configuration file to {}\".format(os.path.abspath(pytorch_config_dump_path)))\n    with open(pytorch_config_dump_path, \"w\", encoding=\"utf-8\") as f:\n        f.write(config.to_json_string())\n\n\nif __name__ == \"__main__\":\n    parser = argparse.ArgumentParser()\n    # Required parameters\n    parser.add_argument(\n        \"--tf_checkpoint_path\", default=None, type=str, required=True, help=\"Path to the TensorFlow checkpoint path.\"\n    )\n    parser.add_argument(\n        \"--xlnet_config_file\",\n        default=None,\n        type=str,\n        required=True,\n        help=\"The config json file corresponding to the pre-trained XLNet model. \\n\"\n        \"This specifies the model architecture.\",\n    )\n    parser.add_argument(\n        \"--pytorch_dump_folder_path\",\n        default=None,\n        type=str,\n        required=True,\n        help=\"Path to the folder to store the PyTorch model or dataset/vocab.\",\n    )\n    parser.add_argument(\n        \"--finetuning_task\",\n        default=None,\n        type=str,\n        help=\"Name of a task on which the XLNet TensorFloaw model was fine-tuned\",\n    )\n    args = parser.parse_args()\n    print(args)\n\n    convert_xlnet_checkpoint_to_pytorch(\n        args.tf_checkpoint_path, args.xlnet_config_file, args.pytorch_dump_folder_path, args.finetuning_task\n    )\n"
  },
  {
    "path": "code/bert-base-count5/pretrain/transformers1/data/__init__.py",
    "content": "# flake8: noqa\n# There's no way to ignore \"F401 '...' imported but unused\" warnings in this\n# module, but to preserve other warnings. So, don't check this module at all.\n\nfrom .metrics import is_sklearn_available\nfrom .processors import (\n    DataProcessor,\n    InputExample,\n    InputFeatures,\n    SingleSentenceClassificationProcessor,\n    SquadExample,\n    SquadFeatures,\n    SquadV1Processor,\n    SquadV2Processor,\n    glue_convert_examples_to_features,\n    glue_output_modes,\n    glue_processors,\n    glue_tasks_num_labels,\n    squad_convert_examples_to_features,\n    xnli_output_modes,\n    xnli_processors,\n    xnli_tasks_num_labels,\n)\n\n\nif is_sklearn_available():\n    from .metrics import glue_compute_metrics, xnli_compute_metrics\n"
  },
  {
    "path": "code/bert-base-count5/pretrain/transformers1/data/data_collator.py",
    "content": "from abc import ABC, abstractmethod\nfrom dataclasses import dataclass\nfrom typing import Any, Dict, List, NewType, Tuple\n\nimport torch\nfrom torch.nn.utils.rnn import pad_sequence\nimport random\nimport numpy as np\nfrom ..tokenization_utils import PreTrainedTokenizer\n\n\nclass DataCollator(ABC):\n    \"\"\"\n    A `DataCollator` is responsible for batching\n    and pre-processing samples of data as requested by the training loop.\n    \"\"\"\n\n    @abstractmethod\n    def collate_batch(self) -> Dict[str, torch.Tensor]:\n        \"\"\"\n        Take a list of samples from a Dataset and collate them into a batch.\n\n        Returns:\n            A dictionary of tensors\n        \"\"\"\n        pass\n\n\nInputDataClass = NewType(\"InputDataClass\", Any)\n\n\n@dataclass\nclass DefaultDataCollator(DataCollator):\n    \"\"\"\n    Very simple data collator that:\n    - simply collates batches of dict-like objects\n    - Performs special handling for potential keys named:\n        - `label`: handles a single value (int or float) per object\n        - `label_ids`: handles a list of values per object\n    - does not do any additional preprocessing\n\n    i.e., Property names of the input object will be used as corresponding inputs to the model.\n    See glue and ner for example of how it's useful.\n    \"\"\"\n\n    def collate_batch(self, features: List[InputDataClass]) -> Dict[str, torch.Tensor]:\n        # In this method we'll make the assumption that all `features` in the batch\n        # have the same attributes.\n        # So we will look at the first element as a proxy for what attributes exist\n        # on the whole batch.\n        first = features[0]\n\n        # Special handling for labels.\n        # Ensure that tensor is created with the correct type\n        # (it should be automatically the case, but let's make sure of it.)\n        if hasattr(first, \"label\") and first.label is not None:\n            if type(first.label) is int:\n                labels = torch.tensor([f.label for f in features], dtype=torch.long)\n            else:\n                labels = torch.tensor([f.label for f in features], dtype=torch.float)\n            batch = {\"labels\": labels}\n        elif hasattr(first, \"label_ids\") and first.label_ids is not None:\n            if type(first.label_ids[0]) is int:\n                labels = torch.tensor([f.label_ids for f in features], dtype=torch.long)\n            else:\n                labels = torch.tensor([f.label_ids for f in features], dtype=torch.float)\n            batch = {\"labels\": labels}\n        else:\n            batch = {}\n\n        # Handling of all other possible attributes.\n        # Again, we will use the first element to figure out which key/values are not None for this model.\n        for k, v in vars(first).items():\n            if k not in (\"label\", \"label_ids\") and v is not None and not isinstance(v, str):\n                batch[k] = torch.tensor([getattr(f, k) for f in features], dtype=torch.long)\n        return batch\n\n\n@dataclass\nclass DataCollatorForLanguageModeling(DataCollator):\n    \"\"\"\n    Data collator used for language modeling.\n    - collates batches of tensors, honoring their tokenizer's pad_token\n    - preprocesses batches for masked language modeling\n    \"\"\"\n\n    tokenizer: PreTrainedTokenizer\n    mlm: bool = True\n    mlm_probability: float = 0.15\n\n    def collate_batch(self, examples: List[torch.Tensor]) -> Dict[str, torch.Tensor]:\n        batch = self._tensorize_batch(examples)\n        if self.mlm:\n            inputs, labels = self.mask_tokens7(batch)\n            return {\"input_ids\": inputs, \"labels\": labels}\n        else:\n            return {\"input_ids\": batch, \"labels\": batch}\n\n    def _tensorize_batch(self, examples: List[torch.Tensor]) -> torch.Tensor:\n        length_of_first = examples[0].size(0)\n        are_tensors_same_length = all(x.size(0) == length_of_first for x in examples)\n        if are_tensors_same_length:\n            return torch.stack(examples, dim=0)\n        else:\n            if self.tokenizer._pad_token is None:\n                raise ValueError(\n                    \"You are attempting to pad samples but the tokenizer you are using\"\n                    f\" ({self.tokenizer.__class__.__name__}) does not have one.\"\n                )\n            return pad_sequence(examples, batch_first=True, padding_value=self.tokenizer.pad_token_id)\n\n    def mask_tokens(self, inputs: torch.Tensor) -> Tuple[torch.Tensor, torch.Tensor]:\n        \"\"\"\n        Prepare masked tokens inputs/labels for masked language modeling: 80% MASK, 10% random, 10% original.\n        \"\"\"\n\n        if self.tokenizer.mask_token is None:\n            raise ValueError(\n                \"This tokenizer does not have a mask token which is necessary for masked language modeling. Remove the --mlm flag if you want to use this tokenizer.\"\n            )\n\n        labels = inputs.clone()\n        # We sample a few tokens in each sequence for masked-LM training (with probability args.mlm_probability defaults to 0.15 in Bert/RoBERTa)\n        probability_matrix = torch.full(labels.shape, self.mlm_probability)\n        special_tokens_mask = [\n            self.tokenizer.get_special_tokens_mask(val, already_has_special_tokens=True) for val in labels.tolist()\n        ]\n        probability_matrix.masked_fill_(torch.tensor(special_tokens_mask, dtype=torch.bool), value=0.0)\n        if self.tokenizer._pad_token is not None:\n            padding_mask = labels.eq(self.tokenizer.pad_token_id)\n            probability_matrix.masked_fill_(padding_mask, value=0.0)\n\n        masked_indices = torch.bernoulli(probability_matrix).bool()\n        labels[~masked_indices] = -100  # We only compute loss on masked tokens\n\n        # 80% of the time, we replace masked input tokens with tokenizer.mask_token ([MASK])\n        indices_replaced = torch.bernoulli(torch.full(labels.shape, 0.8)).bool() & masked_indices\n        inputs[indices_replaced] = self.tokenizer.convert_tokens_to_ids(self.tokenizer.mask_token)\n\n        # 10% of the time, we replace masked input tokens with random word\n        indices_random = torch.bernoulli(torch.full(labels.shape, 0.5)).bool() & masked_indices & ~indices_replaced\n        random_words = torch.randint(len(self.tokenizer), labels.shape, dtype=torch.long)\n        inputs[indices_random] = random_words[indices_random]\n\n        # The rest of the time (10% of the time) we keep the masked input tokens unchanged\n        return inputs, labels\n\n    def mask_tokens2(self, inputs: torch.Tensor) -> Tuple[torch.Tensor, torch.Tensor]:\n        \"\"\"\n        Prepare masked tokens inputs/labels for masked language modeling: 80% MASK, 10% random, 10% original.\n        \"\"\"\n\n        if self.tokenizer.mask_token is None:\n            raise ValueError(\n                \"This tokenizer does not have a mask token which is necessary for masked language modeling. Remove the --mlm flag if you want to use this tokenizer.\"\n            )\n\n        labels = inputs.clone()\n        inputs = inputs.numpy()\n        # We sample a few tokens in each sequence for masked-LM training (with probability args.mlm_probability defaults to 0.15 in Bert/RoBERTa)\n        probability_matrix = torch.full(labels.shape, self.mlm_probability)\n        special_tokens_mask = [\n            self.tokenizer.get_special_tokens_mask(val, already_has_special_tokens=True) for val in labels.tolist()\n        ]\n        probability_matrix.masked_fill_(torch.tensor(special_tokens_mask, dtype=torch.bool), value=0.0)\n        if self.tokenizer._pad_token is not None:\n            padding_mask = labels.eq(self.tokenizer.pad_token_id)\n            probability_matrix.masked_fill_(padding_mask, value=0.0)\n\n        probability_matrix = probability_matrix.numpy()\n        labels = labels.numpy()\n        for i in range(len(probability_matrix)):\n            for j in range(len(probability_matrix[0])):\n                if probability_matrix[i][j] == np.float32(0.15):\n                    if random.random() > 0.85:\n                        if random.random() > 0.2:\n                            inputs[i][j] = self.tokenizer.mask_token_id\n                        elif random.random() > 0.5:\n                            inputs[i][j] = random.randint(5, len(self.tokenizer) - 1)\n                        else:\n                            pass\n                    else:\n                        labels[i][j] = np.float32(-100)\n                else:\n                    labels[i][j] = np.float32(-100)\n\n        inputs = torch.from_numpy(inputs)\n        labels = torch.from_numpy(labels)\n        return inputs, labels\n\n\n    def mask_tokens3(self, inputs: torch.Tensor) -> Tuple[torch.Tensor, torch.Tensor]:\n        \"\"\"\n        Prepare masked tokens inputs/labels for masked language modeling: 80% MASK, 10% random, 10% original.\n        \"\"\"\n\n        if self.tokenizer.mask_token is None:\n            raise ValueError(\n                \"This tokenizer does not have a mask token which is necessary for masked language modeling. Remove the --mlm flag if you want to use this tokenizer.\"\n            )\n\n        labels = inputs.clone()\n        inputs = inputs.numpy()\n        # We sample a few tokens in each sequence for masked-LM training (with probability args.mlm_probability defaults to 0.15 in Bert/RoBERTa)\n        probability_matrix = torch.full(labels.shape, self.mlm_probability)\n        special_tokens_mask = [\n            self.tokenizer.get_special_tokens_mask(val, already_has_special_tokens=True) for val in labels.tolist()\n        ]\n        probability_matrix.masked_fill_(torch.tensor(special_tokens_mask, dtype=torch.bool), value=0.0)\n        if self.tokenizer._pad_token is not None:\n            padding_mask = labels.eq(self.tokenizer.pad_token_id)\n            probability_matrix.masked_fill_(padding_mask, value=0.0)\n\n        probability_matrix = probability_matrix.numpy()\n        labels = labels.numpy()\n        covered = set()\n        for i in range(len(probability_matrix)):\n            for j in range(len(probability_matrix[0])):\n                if probability_matrix[i][j] == np.float32(0.15) and (i,j) not in covered:\n                    if random.random() > 0.85:\n                        if random.random() > 0.2:\n                            if random.random() > 0.85:\n                                for k in range(j,min(j+5,len(probability_matrix[0]))):\n                                    if probability_matrix[i][k] == np.float32(0.15) and (i,k) not in covered:\n                                        inputs[i][k] = self.tokenizer.mask_token_id\n                                        covered.add((i,k))\n                            elif random.random() > 0.7647:\n                                for k in range(j,min(j+4,len(probability_matrix[0]))):\n                                    if probability_matrix[i][k] == np.float32(0.15) and (i,k) not in covered:\n                                        inputs[i][k] = self.tokenizer.mask_token_id\n                                        covered.add((i,k))\n                            elif random.random() > 0.5384:\n                                for k in range(j,min(j+3,len(probability_matrix[0]))):\n                                    if probability_matrix[i][k] == np.float32(0.15) and (i,k) not in covered:\n                                        inputs[i][k] = self.tokenizer.mask_token_id\n                                        covered.add((i,k))\n                            elif random.random() > 0.42857:\n                                for k in range(j,min(j+2,len(probability_matrix[0]))):\n                                    if probability_matrix[i][k] == np.float32(0.15) and (i,k) not in covered:\n                                        inputs[i][k] = self.tokenizer.mask_token_id\n                                        covered.add((i,k))\n                            else:\n                                inputs[i][j] = self.tokenizer.mask_token_id\n                                covered.add((i,j))\n\n                        elif random.random() > 0.5:\n                            inputs[i][j] = random.randint(5, len(self.tokenizer) - 1)\n                        else:\n                            pass\n                    else:\n                        labels[i][j] = np.float32(-100)\n                else:\n                    labels[i][j] = np.float32(-100)\n\n        inputs = torch.from_numpy(inputs)\n        labels = torch.from_numpy(labels)\n        return inputs, labels\n\n    def mask_tokens4(self, inputs: torch.Tensor) -> Tuple[torch.Tensor, torch.Tensor]:\n        \"\"\"\n        Prepare masked tokens inputs/labels for masked language modeling: 80% MASK, 10% random, 10% original.\n        \"\"\"\n\n        if self.tokenizer.mask_token is None:\n            raise ValueError(\n                \"This tokenizer does not have a mask token which is necessary for masked language modeling. Remove the --mlm flag if you want to use this tokenizer.\"\n            )\n\n        inputs = inputs.numpy()\n        ids = [i for i in range(len(inputs))]\n        random.shuffle(ids)\n        inputs = inputs[ids]\n        inputs = torch.from_numpy(inputs)\n\n        labels = inputs.clone()\n        inputs = inputs.numpy()\n        # We sample a few tokens in each sequence for masked-LM training (with probability args.mlm_probability defaults to 0.15 in Bert/RoBERTa)\n        probability_matrix = torch.full(labels.shape, self.mlm_probability)\n        special_tokens_mask = [\n            self.tokenizer.get_special_tokens_mask(val, already_has_special_tokens=True) for val in labels.tolist()\n        ]\n        probability_matrix.masked_fill_(torch.tensor(special_tokens_mask, dtype=torch.bool), value=0.0)\n        if self.tokenizer._pad_token is not None:\n            padding_mask = labels.eq(self.tokenizer.pad_token_id)\n            probability_matrix.masked_fill_(padding_mask, value=0.0)\n\n        total_token = 0\n        for i in range(len(probability_matrix)):\n            for j in range(len(probability_matrix[0])):\n                if probability_matrix[i][j] == np.float32(0.15):\n                    total_token += 1\n\n        cur_token = 0\n        probability_matrix = probability_matrix.numpy()\n        labels = labels.numpy()\n        covered = set()\n        ngramFlag = True\n        for i in range(len(probability_matrix)):\n            if cur_token > total_token * 0.03:\n                ngramFlag = False\n            for j in range(len(probability_matrix[0])):\n                if probability_matrix[i][j] == np.float32(0.15) and (i,j) not in covered:\n                    if random.random() > 0.85:\n                        if random.random() > 0.2:\n                            if random.random() > 0.9 and ngramFlag:\n                                for k in range(j,min(j+4,len(probability_matrix[0]))):\n                                    if probability_matrix[i][k] == np.float32(0.15) and (i,k) not in covered:\n                                        inputs[i][k] = self.tokenizer.mask_token_id\n                                        covered.add((i,k))\n                                        cur_token += 1\n                            elif random.random() > 0.222 and ngramFlag:\n                                for k in range(j,min(j+3,len(probability_matrix[0]))):\n                                    if probability_matrix[i][k] == np.float32(0.15) and (i,k) not in covered:\n                                        inputs[i][k] = self.tokenizer.mask_token_id\n                                        covered.add((i,k))\n                                        cur_token += 1\n                            elif random.random() > 0.42857 and ngramFlag:\n                                for k in range(j,min(j+2,len(probability_matrix[0]))):\n                                    if probability_matrix[i][k] == np.float32(0.15) and (i,k) not in covered:\n                                        inputs[i][k] = self.tokenizer.mask_token_id\n                                        covered.add((i,k))\n                                        cur_token += 1\n                            else:\n                                inputs[i][j] = self.tokenizer.mask_token_id\n                                covered.add((i,j))\n                                cur_token += 1\n\n                        elif random.random() > 0.5:\n                            inputs[i][j] = random.randint(5, len(self.tokenizer) - 1)\n                            cur_token += 1\n                        else:\n                            pass\n                    else:\n                        labels[i][j] = np.float32(-100)\n                else:\n                    labels[i][j] = np.float32(-100)\n\n        inputs = torch.from_numpy(inputs)\n        labels = torch.from_numpy(labels)\n        return inputs, labels\n\n    def mask_tokens5(self, inputs: torch.Tensor) -> Tuple[torch.Tensor, torch.Tensor]:\n        \"\"\"\n        Prepare masked tokens inputs/labels for masked language modeling: 80% MASK, 10% random, 10% original.\n        \"\"\"\n\n        if self.tokenizer.mask_token is None:\n            raise ValueError(\n                \"This tokenizer does not have a mask token which is necessary for masked language modeling. Remove the --mlm flag if you want to use this tokenizer.\"\n            )\n\n        labels = inputs.clone()\n        inputs = inputs.numpy()\n        # We sample a few tokens in each sequence for masked-LM training (with probability args.mlm_probability defaults to 0.15 in Bert/RoBERTa)\n        probability_matrix = torch.full(labels.shape, self.mlm_probability)\n        special_tokens_mask = [\n            self.tokenizer.get_special_tokens_mask(val, already_has_special_tokens=True) for val in labels.tolist()\n        ]\n        probability_matrix.masked_fill_(torch.tensor(special_tokens_mask, dtype=torch.bool), value=0.0)\n        if self.tokenizer._pad_token is not None:\n            padding_mask = labels.eq(self.tokenizer.pad_token_id)\n            probability_matrix.masked_fill_(padding_mask, value=0.0)\n\n        covered = set()\n        pvals = [0.4, 0.3, 0.2, 0.1]\n        ngrams = np.arange(1, 5, dtype=np.int64)\n\n        probability_matrix = probability_matrix.numpy()\n        labels = labels.numpy()\n        for i in range(len(probability_matrix)):\n            cur_token = 0\n            total_token = 0\n            for j in range(len(probability_matrix[0])):\n                if probability_matrix[i][j] == np.float32(0.15):\n                    total_token += 1\n            choose = random.randint(0, 1)\n            if choose == 0:\n                startIndex = 0\n                endIndex = np.argwhere(inputs[i] == np.float32(2))[-1][0]\n            elif choose == 1:\n                startIndex = np.argwhere(inputs[i] == np.float32(2))[-1][0]\n                endIndex = np.argwhere(inputs[i] == np.float32(3))[-1][0]\n\n            valid_j = [index for index in range(startIndex, endIndex + 1)]\n\n            for j in range(len(probability_matrix[0])):\n                if cur_token < total_token * 0.15:\n                    if probability_matrix[i][j] == np.float32(0.15):\n                        n = np.random.choice(ngrams, p=pvals)\n                        for k in range(n):\n                            if j + k >= len(probability_matrix[0]):\n                                break\n                            if (i, j+k) in covered:\n                                continue\n                            if j+k in valid_j:\n                                if random.random() > 0.7:\n                                    if random.random() > 0.2:\n                                        if probability_matrix[i][j+k] == np.float32(0.15):\n                                            inputs[i][j+k] = self.tokenizer.mask_token_id\n                                            covered.add((i, j + k))\n                                            cur_token += 1\n\n                                    elif random.random() > 0.5:\n                                        if probability_matrix[i][j + k] == np.float32(0.15):\n                                            inputs[i][j+k] = random.randint(5, len(self.tokenizer) - 1)\n                                            covered.add((i, j + k))\n                                            cur_token += 1\n\n                                    else:\n                                        if probability_matrix[i][j + k] == np.float32(0.15):\n                                            covered.add((i, j + k))\n                                            cur_token += 1\n\n                                else:\n                                    labels[i][j] = np.float32(-100)\n                            else:\n                                labels[i][j] = np.float32(-100)\n                    else:\n                        labels[i][j] = np.float32(-100)\n                else:\n                    labels[i][j] = np.float32(-100)\n\n        inputs = torch.from_numpy(inputs)\n        labels = torch.from_numpy(labels)\n        return inputs, labels\n\n    def mask_tokens6(self, inputs: torch.Tensor) -> Tuple[torch.Tensor, torch.Tensor]:\n        \"\"\"\n        Prepare masked tokens inputs/labels for masked language modeling: 80% MASK, 10% random, 10% original.\n        \"\"\"\n\n        if self.tokenizer.mask_token is None:\n            raise ValueError(\n                \"This tokenizer does not have a mask token which is necessary for masked language modeling. Remove the --mlm flag if you want to use this tokenizer.\"\n            )\n\n        labels = inputs.clone()\n        inputs = inputs.numpy()\n        # We sample a few tokens in each sequence for masked-LM training (with probability args.mlm_probability defaults to 0.15 in Bert/RoBERTa)\n        probability_matrix = torch.full(labels.shape, self.mlm_probability)\n        special_tokens_mask = [\n            self.tokenizer.get_special_tokens_mask(val, already_has_special_tokens=True) for val in labels.tolist()\n        ]\n        probability_matrix.masked_fill_(torch.tensor(special_tokens_mask, dtype=torch.bool), value=0.0)\n        if self.tokenizer._pad_token is not None:\n            padding_mask = labels.eq(self.tokenizer.pad_token_id)\n            probability_matrix.masked_fill_(padding_mask, value=0.0)\n\n        covered = set()\n\n\n        probability_matrix = probability_matrix.numpy()\n        labels = labels.numpy()\n        for i in range(len(probability_matrix)):\n            cur_token = 0\n            total_token = 0\n            for j in range(len(probability_matrix[0])):\n                if probability_matrix[i][j] == np.float32(0.15):\n                    total_token += 1\n            for j in range(len(probability_matrix[0])):\n                if cur_token > total_token*0.15:\n                    break\n                if probability_matrix[i][j] == np.float32(0.15):\n                    if random.random() > 0.85:\n                        if random.random() > 0.2:\n                            if random.random() > 0.9:\n                                for k in range(j, min(j + 4, len(probability_matrix[0]))):\n                                    if probability_matrix[i][k] == np.float32(0.15) and (i, k) not in covered:\n                                        inputs[i][k] = self.tokenizer.mask_token_id\n                                        covered.add((i, k))\n                                        cur_token += 1\n                            elif random.random() > 0.222:\n                                for k in range(j, min(j + 3, len(probability_matrix[0]))):\n                                    if probability_matrix[i][k] == np.float32(0.15) and (i, k) not in covered:\n                                        inputs[i][k] = self.tokenizer.mask_token_id\n                                        covered.add((i, k))\n                                        cur_token += 1\n                            elif random.random() > 0.42857:\n                                for k in range(j, min(j + 2, len(probability_matrix[0]))):\n                                    if probability_matrix[i][k] == np.float32(0.15) and (i, k) not in covered:\n                                        inputs[i][k] = self.tokenizer.mask_token_id\n                                        covered.add((i, k))\n                                        cur_token += 1\n                            else:\n                                inputs[i][j] = self.tokenizer.mask_token_id\n                                covered.add((i, j))\n                                cur_token += 1\n\n                        elif random.random() > 0.5:\n                            inputs[i][j] = random.randint(5, len(self.tokenizer) - 1)\n                            cur_token += 1\n                        else:\n                            cur_token += 1\n\n                    else:\n                        labels[i][j] = np.float32(-100)\n\n\n                else:\n                    labels[i][j] = np.float32(-100)\n\n        inputs = torch.from_numpy(inputs)\n        labels = torch.from_numpy(labels)\n        return inputs, labels\n\n\n    def mask_tokens7(self, inputs: torch.Tensor) -> Tuple[torch.Tensor, torch.Tensor]:\n        \"\"\"\n        Prepare masked tokens inputs/labels for masked language modeling: 80% MASK, 10% random, 10% original.\n        \"\"\"\n\n        if self.tokenizer.mask_token is None:\n            raise ValueError(\n                \"This tokenizer does not have a mask token which is necessary for masked language modeling. Remove the --mlm flag if you want to use this tokenizer.\"\n            )\n\n        labels = inputs.clone()\n        inputs = inputs.numpy()\n        # We sample a few tokens in each sequence for masked-LM training (with probability args.mlm_probability defaults to 0.15 in Bert/RoBERTa)\n        probability_matrix = torch.full(labels.shape, self.mlm_probability)\n        special_tokens_mask = [\n            self.tokenizer.get_special_tokens_mask(val, already_has_special_tokens=True) for val in labels.tolist()\n        ]\n        probability_matrix.masked_fill_(torch.tensor(special_tokens_mask, dtype=torch.bool), value=0.0)\n        if self.tokenizer._pad_token is not None:\n            padding_mask = labels.eq(self.tokenizer.pad_token_id)\n            probability_matrix.masked_fill_(padding_mask, value=0.0)\n\n        covered = set()\n        ngrams = np.arange(1, 3 + 1, dtype=np.int64)\n        pvals = 1. / np.arange(1, 3 + 1)\n        pvals /= pvals.sum(keepdims=True)\n\n        probability_matrix = probability_matrix.numpy()\n        labels = labels.numpy()\n        for i in range(len(probability_matrix)):\n            cur_token = 0\n            total_token = 0\n            for j in range(len(probability_matrix[0])):\n                if probability_matrix[i][j] == np.float32(0.15):\n                    total_token += 1\n            for j in range(len(probability_matrix[0])):\n                if cur_token <= total_token * 0.15:\n                    n = np.random.choice(ngrams, p=pvals)\n                    if probability_matrix[i][j] == np.float32(0.15):\n                        for k in range(n):\n                            if j + k >= len(probability_matrix[0]):\n                                break\n                            if (i, j+k) in covered:\n                                continue\n                            if random.random() > 0.85:\n                                if random.random() > 0.2:\n                                    if probability_matrix[i][j+k] == np.float32(0.15):\n                                        inputs[i][j+k] = self.tokenizer.mask_token_id\n                                        covered.add((i, j + k))\n                                        cur_token += 1\n\n                                elif random.random() > 0.5:\n                                    if probability_matrix[i][j + k] == np.float32(0.15):\n                                        inputs[i][j+k] = random.randint(5, len(self.tokenizer) - 1)\n                                        covered.add((i, j + k))\n                                        cur_token += 1\n\n                                else:\n                                    if probability_matrix[i][j + k] == np.float32(0.15):\n                                        covered.add((i, j + k))\n                                        cur_token += 1\n\n                            else:\n                                labels[i][j] = np.float32(-100)\n\n                    else:\n                        labels[i][j] = np.float32(-100)\n                else:\n                    labels[i][j] = np.float32(-100)\n\n        inputs = torch.from_numpy(inputs)\n        labels = torch.from_numpy(labels)\n        return inputs, labels\n\n"
  },
  {
    "path": "code/bert-base-count5/pretrain/transformers1/data/datasets/__init__.py",
    "content": "# flake8: noqa\n# There's no way to ignore \"F401 '...' imported but unused\" warnings in this\n# module, but to preserve other warnings. So, don't check this module at all.\n\nfrom .glue import GlueDataset, GlueDataTrainingArguments\nfrom .language_modeling import LineByLineTextDataset, TextDataset\n"
  },
  {
    "path": "code/bert-base-count5/pretrain/transformers1/data/datasets/glue.py",
    "content": "import logging\nimport os\nimport time\nfrom dataclasses import dataclass, field\nfrom enum import Enum\nfrom typing import List, Optional, Union\n\nimport torch\nfrom filelock import FileLock\nfrom torch.utils.data.dataset import Dataset\n\nfrom ...tokenization_roberta import RobertaTokenizer, RobertaTokenizerFast\nfrom ...tokenization_utils import PreTrainedTokenizer\nfrom ...tokenization_xlm_roberta import XLMRobertaTokenizer\nfrom ..processors.glue import glue_convert_examples_to_features, glue_output_modes, glue_processors\nfrom ..processors.utils import InputFeatures\n\n\nlogger = logging.getLogger(__name__)\n\n\n@dataclass\nclass GlueDataTrainingArguments:\n    \"\"\"\n    Arguments pertaining to what data we are going to input our model for training and eval.\n\n    Using `HfArgumentParser` we can turn this class\n    into argparse arguments to be able to specify them on\n    the command line.\n    \"\"\"\n\n    task_name: str = field(metadata={\"help\": \"The name of the task to train on: \" + \", \".join(glue_processors.keys())})\n    data_dir: str = field(\n        metadata={\"help\": \"The input data dir. Should contain the .tsv files (or other data files) for the task.\"}\n    )\n    max_seq_length: int = field(\n        default=128,\n        metadata={\n            \"help\": \"The maximum total input sequence length after tokenization. Sequences longer \"\n            \"than this will be truncated, sequences shorter will be padded.\"\n        },\n    )\n    overwrite_cache: bool = field(\n        default=False, metadata={\"help\": \"Overwrite the cached training and evaluation sets\"}\n    )\n\n    def __post_init__(self):\n        self.task_name = self.task_name.lower()\n\n\nclass Split(Enum):\n    train = \"train\"\n    dev = \"dev\"\n    test = \"test\"\n\n\nclass GlueDataset(Dataset):\n    \"\"\"\n    This will be superseded by a framework-agnostic approach\n    soon.\n    \"\"\"\n\n    args: GlueDataTrainingArguments\n    output_mode: str\n    features: List[InputFeatures]\n\n    def __init__(\n        self,\n        args: GlueDataTrainingArguments,\n        tokenizer: PreTrainedTokenizer,\n        limit_length: Optional[int] = None,\n        mode: Union[str, Split] = Split.train,\n    ):\n        self.args = args\n        self.processor = glue_processors[args.task_name]()\n        self.output_mode = glue_output_modes[args.task_name]\n        if isinstance(mode, str):\n            try:\n                mode = Split[mode]\n            except KeyError:\n                raise KeyError(\"mode is not a valid split name\")\n        # Load data features from cache or dataset file\n        cached_features_file = os.path.join(\n            args.data_dir,\n            \"cached_{}_{}_{}_{}\".format(\n                mode.value, tokenizer.__class__.__name__, str(args.max_seq_length), args.task_name,\n            ),\n        )\n        label_list = self.processor.get_labels()\n        if args.task_name in [\"mnli\", \"mnli-mm\"] and tokenizer.__class__ in (\n            RobertaTokenizer,\n            RobertaTokenizerFast,\n            XLMRobertaTokenizer,\n        ):\n            # HACK(label indices are swapped in RoBERTa pretrained model)\n            label_list[1], label_list[2] = label_list[2], label_list[1]\n        self.label_list = label_list\n\n        # Make sure only the first process in distributed training processes the dataset,\n        # and the others will use the cache.\n        lock_path = cached_features_file + \".lock\"\n        with FileLock(lock_path):\n\n            if os.path.exists(cached_features_file) and not args.overwrite_cache:\n                start = time.time()\n                self.features = torch.load(cached_features_file)\n                logger.info(\n                    f\"Loading features from cached file {cached_features_file} [took %.3f s]\", time.time() - start\n                )\n            else:\n                logger.info(f\"Creating features from dataset file at {args.data_dir}\")\n\n                if mode == Split.dev:\n                    examples = self.processor.get_dev_examples(args.data_dir)\n                elif mode == Split.test:\n                    examples = self.processor.get_test_examples(args.data_dir)\n                else:\n                    examples = self.processor.get_train_examples(args.data_dir)\n                if limit_length is not None:\n                    examples = examples[:limit_length]\n                self.features = glue_convert_examples_to_features(\n                    examples,\n                    tokenizer,\n                    max_length=args.max_seq_length,\n                    label_list=label_list,\n                    output_mode=self.output_mode,\n                )\n                start = time.time()\n                torch.save(self.features, cached_features_file)\n                # ^ This seems to take a lot of time so I want to investigate why and how we can improve.\n                logger.info(\n                    \"Saving features into cached file %s [took %.3f s]\", cached_features_file, time.time() - start\n                )\n\n    def __len__(self):\n        return len(self.features)\n\n    def __getitem__(self, i) -> InputFeatures:\n        return self.features[i]\n\n    def get_labels(self):\n        return self.label_list\n"
  },
  {
    "path": "code/bert-base-count5/pretrain/transformers1/data/datasets/language_modeling.py",
    "content": "import logging\nimport os\nimport pickle\nimport time\n\nimport torch\nfrom filelock import FileLock\nfrom torch.utils.data.dataset import Dataset\n\nfrom ...tokenization_utils import PreTrainedTokenizer\n\n\nlogger = logging.getLogger(__name__)\n\n\nclass TextDataset(Dataset):\n    \"\"\"\n    This will be superseded by a framework-agnostic approach\n    soon.\n    \"\"\"\n\n    def __init__(\n        self, tokenizer: PreTrainedTokenizer, file_path: str, block_size: int, overwrite_cache=False,\n    ):\n        assert os.path.isfile(file_path)\n\n        block_size = block_size - tokenizer.num_special_tokens_to_add(pair=False)\n\n        directory, filename = os.path.split(file_path)\n        cached_features_file = os.path.join(\n            directory, \"cached_lm_{}_{}_{}\".format(tokenizer.__class__.__name__, str(block_size), filename,),\n        )\n\n        # Make sure only the first process in distributed training processes the dataset,\n        # and the others will use the cache.\n        lock_path = cached_features_file + \".lock\"\n        with FileLock(lock_path):\n\n            if os.path.exists(cached_features_file) and not overwrite_cache:\n                start = time.time()\n                with open(cached_features_file, \"rb\") as handle:\n                    self.examples = pickle.load(handle)\n                logger.info(\n                    f\"Loading features from cached file {cached_features_file} [took %.3f s]\", time.time() - start\n                )\n\n            else:\n                logger.info(f\"Creating features from dataset file at {directory}\")\n\n                self.examples = []\n                with open(file_path, encoding=\"utf-8\") as f:\n                    text = f.read()\n\n                tokenized_text = tokenizer.convert_tokens_to_ids(tokenizer.tokenize(text))\n\n                for i in range(0, len(tokenized_text) - block_size + 1, block_size):  # Truncate in block of block_size\n                    self.examples.append(\n                        tokenizer.build_inputs_with_special_tokens(tokenized_text[i : i + block_size])\n                    )\n                # Note that we are losing the last truncated example here for the sake of simplicity (no padding)\n                # If your dataset is small, first you should loook for a bigger one :-) and second you\n                # can change this behavior by adding (model specific) padding.\n\n                start = time.time()\n                with open(cached_features_file, \"wb\") as handle:\n                    pickle.dump(self.examples, handle, protocol=pickle.HIGHEST_PROTOCOL)\n                logger.info(\n                    \"Saving features into cached file %s [took %.3f s]\", cached_features_file, time.time() - start\n                )\n\n    def __len__(self):\n        return len(self.examples)\n\n    def __getitem__(self, i) -> torch.Tensor:\n        return torch.tensor(self.examples[i], dtype=torch.long)\n\n\nclass LineByLineTextDataset(Dataset):\n    \"\"\"\n    This will be superseded by a framework-agnostic approach\n    soon.\n    \"\"\"\n\n    def __init__(self, tokenizer: PreTrainedTokenizer, file_path: str, block_size: int):\n        assert os.path.isfile(file_path)\n        # Here, we do not cache the features, operating under the assumption\n        # that we will soon use fast multithreaded tokenizers from the\n        # `tokenizers` repo everywhere =)\n        logger.info(\"Creating features from dataset file at %s\", file_path)\n\n        with open(file_path, encoding=\"utf-8\") as f:\n            lines = [line for line in f.read().splitlines() if (len(line) > 0 and not line.isspace())]\n\n        batch_encoding = tokenizer.batch_encode_plus(lines, add_special_tokens=True, max_length=block_size)\n        self.examples = batch_encoding[\"input_ids\"]\n\n    def __len__(self):\n        return len(self.examples)\n\n    def __getitem__(self, i) -> torch.Tensor:\n        return torch.tensor(self.examples[i], dtype=torch.long)\n"
  },
  {
    "path": "code/bert-base-count5/pretrain/transformers1/data/metrics/__init__.py",
    "content": "# coding=utf-8\n# Copyright 2018 The Google AI Language Team Authors and The HuggingFace Inc. team.\n# Copyright (c) 2018, NVIDIA CORPORATION.  All rights reserved.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n\ntry:\n    from scipy.stats import pearsonr, spearmanr\n    from sklearn.metrics import matthews_corrcoef, f1_score\n\n    _has_sklearn = True\nexcept (AttributeError, ImportError):\n    _has_sklearn = False\n\n\ndef is_sklearn_available():\n    return _has_sklearn\n\n\nif _has_sklearn:\n\n    def simple_accuracy(preds, labels):\n        return (preds == labels).mean()\n\n    def acc_and_f1(preds, labels):\n        acc = simple_accuracy(preds, labels)\n        f1 = f1_score(y_true=labels, y_pred=preds)\n        return {\n            \"acc\": acc,\n            \"f1\": f1,\n            \"acc_and_f1\": (acc + f1) / 2,\n        }\n\n    def pearson_and_spearman(preds, labels):\n        pearson_corr = pearsonr(preds, labels)[0]\n        spearman_corr = spearmanr(preds, labels)[0]\n        return {\n            \"pearson\": pearson_corr,\n            \"spearmanr\": spearman_corr,\n            \"corr\": (pearson_corr + spearman_corr) / 2,\n        }\n\n    def glue_compute_metrics(task_name, preds, labels):\n        assert len(preds) == len(labels)\n        if task_name == \"cola\":\n            return {\"mcc\": matthews_corrcoef(labels, preds)}\n        elif task_name == \"sst-2\":\n            return {\"acc\": simple_accuracy(preds, labels)}\n        elif task_name == \"mrpc\":\n            return acc_and_f1(preds, labels)\n        elif task_name == \"sts-b\":\n            return pearson_and_spearman(preds, labels)\n        elif task_name == \"qqp\":\n            return acc_and_f1(preds, labels)\n        elif task_name == \"mnli\":\n            return {\"acc\": simple_accuracy(preds, labels)}\n        elif task_name == \"mnli-mm\":\n            return {\"acc\": simple_accuracy(preds, labels)}\n        elif task_name == \"qnli\":\n            return {\"acc\": simple_accuracy(preds, labels)}\n        elif task_name == \"rte\":\n            return {\"acc\": simple_accuracy(preds, labels)}\n        elif task_name == \"wnli\":\n            return {\"acc\": simple_accuracy(preds, labels)}\n        elif task_name == \"hans\":\n            return {\"acc\": simple_accuracy(preds, labels)}\n        else:\n            raise KeyError(task_name)\n\n    def xnli_compute_metrics(task_name, preds, labels):\n        assert len(preds) == len(labels)\n        if task_name == \"xnli\":\n            return {\"acc\": simple_accuracy(preds, labels)}\n        else:\n            raise KeyError(task_name)\n"
  },
  {
    "path": "code/bert-base-count5/pretrain/transformers1/data/metrics/squad_metrics.py",
    "content": "\"\"\" Very heavily inspired by the official evaluation script for SQuAD version 2.0 which was\nmodified by XLNet authors to update `find_best_threshold` scripts for SQuAD V2.0\n\nIn addition to basic functionality, we also compute additional statistics and\nplot precision-recall curves if an additional na_prob.json file is provided.\nThis file is expected to map question ID's to the model's predicted probability\nthat a question is unanswerable.\n\"\"\"\n\n\nimport collections\nimport json\nimport logging\nimport math\nimport re\nimport string\n\nfrom transformers.tokenization_bert import BasicTokenizer\n\n\nlogger = logging.getLogger(__name__)\n\n\ndef normalize_answer(s):\n    \"\"\"Lower text and remove punctuation, articles and extra whitespace.\"\"\"\n\n    def remove_articles(text):\n        regex = re.compile(r\"\\b(a|an|the)\\b\", re.UNICODE)\n        return re.sub(regex, \" \", text)\n\n    def white_space_fix(text):\n        return \" \".join(text.split())\n\n    def remove_punc(text):\n        exclude = set(string.punctuation)\n        return \"\".join(ch for ch in text if ch not in exclude)\n\n    def lower(text):\n        return text.lower()\n\n    return white_space_fix(remove_articles(remove_punc(lower(s))))\n\n\ndef get_tokens(s):\n    if not s:\n        return []\n    return normalize_answer(s).split()\n\n\ndef compute_exact(a_gold, a_pred):\n    return int(normalize_answer(a_gold) == normalize_answer(a_pred))\n\n\ndef compute_f1(a_gold, a_pred):\n    gold_toks = get_tokens(a_gold)\n    pred_toks = get_tokens(a_pred)\n    common = collections.Counter(gold_toks) & collections.Counter(pred_toks)\n    num_same = sum(common.values())\n    if len(gold_toks) == 0 or len(pred_toks) == 0:\n        # If either is no-answer, then F1 is 1 if they agree, 0 otherwise\n        return int(gold_toks == pred_toks)\n    if num_same == 0:\n        return 0\n    precision = 1.0 * num_same / len(pred_toks)\n    recall = 1.0 * num_same / len(gold_toks)\n    f1 = (2 * precision * recall) / (precision + recall)\n    return f1\n\n\ndef get_raw_scores(examples, preds):\n    \"\"\"\n    Computes the exact and f1 scores from the examples and the model predictions\n    \"\"\"\n    exact_scores = {}\n    f1_scores = {}\n\n    for example in examples:\n        qas_id = example.qas_id\n        gold_answers = [answer[\"text\"] for answer in example.answers if normalize_answer(answer[\"text\"])]\n\n        if not gold_answers:\n            # For unanswerable questions, only correct answer is empty string\n            gold_answers = [\"\"]\n\n        if qas_id not in preds:\n            print(\"Missing prediction for %s\" % qas_id)\n            continue\n\n        prediction = preds[qas_id]\n        exact_scores[qas_id] = max(compute_exact(a, prediction) for a in gold_answers)\n        f1_scores[qas_id] = max(compute_f1(a, prediction) for a in gold_answers)\n\n    return exact_scores, f1_scores\n\n\ndef apply_no_ans_threshold(scores, na_probs, qid_to_has_ans, na_prob_thresh):\n    new_scores = {}\n    for qid, s in scores.items():\n        pred_na = na_probs[qid] > na_prob_thresh\n        if pred_na:\n            new_scores[qid] = float(not qid_to_has_ans[qid])\n        else:\n            new_scores[qid] = s\n    return new_scores\n\n\ndef make_eval_dict(exact_scores, f1_scores, qid_list=None):\n    if not qid_list:\n        total = len(exact_scores)\n        return collections.OrderedDict(\n            [\n                (\"exact\", 100.0 * sum(exact_scores.values()) / total),\n                (\"f1\", 100.0 * sum(f1_scores.values()) / total),\n                (\"total\", total),\n            ]\n        )\n    else:\n        total = len(qid_list)\n        return collections.OrderedDict(\n            [\n                (\"exact\", 100.0 * sum(exact_scores[k] for k in qid_list) / total),\n                (\"f1\", 100.0 * sum(f1_scores[k] for k in qid_list) / total),\n                (\"total\", total),\n            ]\n        )\n\n\ndef merge_eval(main_eval, new_eval, prefix):\n    for k in new_eval:\n        main_eval[\"%s_%s\" % (prefix, k)] = new_eval[k]\n\n\ndef find_best_thresh_v2(preds, scores, na_probs, qid_to_has_ans):\n    num_no_ans = sum(1 for k in qid_to_has_ans if not qid_to_has_ans[k])\n    cur_score = num_no_ans\n    best_score = cur_score\n    best_thresh = 0.0\n    qid_list = sorted(na_probs, key=lambda k: na_probs[k])\n    for i, qid in enumerate(qid_list):\n        if qid not in scores:\n            continue\n        if qid_to_has_ans[qid]:\n            diff = scores[qid]\n        else:\n            if preds[qid]:\n                diff = -1\n            else:\n                diff = 0\n        cur_score += diff\n        if cur_score > best_score:\n            best_score = cur_score\n            best_thresh = na_probs[qid]\n\n    has_ans_score, has_ans_cnt = 0, 0\n    for qid in qid_list:\n        if not qid_to_has_ans[qid]:\n            continue\n        has_ans_cnt += 1\n\n        if qid not in scores:\n            continue\n        has_ans_score += scores[qid]\n\n    return 100.0 * best_score / len(scores), best_thresh, 1.0 * has_ans_score / has_ans_cnt\n\n\ndef find_all_best_thresh_v2(main_eval, preds, exact_raw, f1_raw, na_probs, qid_to_has_ans):\n    best_exact, exact_thresh, has_ans_exact = find_best_thresh_v2(preds, exact_raw, na_probs, qid_to_has_ans)\n    best_f1, f1_thresh, has_ans_f1 = find_best_thresh_v2(preds, f1_raw, na_probs, qid_to_has_ans)\n    main_eval[\"best_exact\"] = best_exact\n    main_eval[\"best_exact_thresh\"] = exact_thresh\n    main_eval[\"best_f1\"] = best_f1\n    main_eval[\"best_f1_thresh\"] = f1_thresh\n    main_eval[\"has_ans_exact\"] = has_ans_exact\n    main_eval[\"has_ans_f1\"] = has_ans_f1\n\n\ndef find_best_thresh(preds, scores, na_probs, qid_to_has_ans):\n    num_no_ans = sum(1 for k in qid_to_has_ans if not qid_to_has_ans[k])\n    cur_score = num_no_ans\n    best_score = cur_score\n    best_thresh = 0.0\n    qid_list = sorted(na_probs, key=lambda k: na_probs[k])\n    for _, qid in enumerate(qid_list):\n        if qid not in scores:\n            continue\n        if qid_to_has_ans[qid]:\n            diff = scores[qid]\n        else:\n            if preds[qid]:\n                diff = -1\n            else:\n                diff = 0\n        cur_score += diff\n        if cur_score > best_score:\n            best_score = cur_score\n            best_thresh = na_probs[qid]\n    return 100.0 * best_score / len(scores), best_thresh\n\n\ndef find_all_best_thresh(main_eval, preds, exact_raw, f1_raw, na_probs, qid_to_has_ans):\n    best_exact, exact_thresh = find_best_thresh(preds, exact_raw, na_probs, qid_to_has_ans)\n    best_f1, f1_thresh = find_best_thresh(preds, f1_raw, na_probs, qid_to_has_ans)\n\n    main_eval[\"best_exact\"] = best_exact\n    main_eval[\"best_exact_thresh\"] = exact_thresh\n    main_eval[\"best_f1\"] = best_f1\n    main_eval[\"best_f1_thresh\"] = f1_thresh\n\n\ndef squad_evaluate(examples, preds, no_answer_probs=None, no_answer_probability_threshold=1.0):\n    qas_id_to_has_answer = {example.qas_id: bool(example.answers) for example in examples}\n    has_answer_qids = [qas_id for qas_id, has_answer in qas_id_to_has_answer.items() if has_answer]\n    no_answer_qids = [qas_id for qas_id, has_answer in qas_id_to_has_answer.items() if not has_answer]\n\n    if no_answer_probs is None:\n        no_answer_probs = {k: 0.0 for k in preds}\n\n    exact, f1 = get_raw_scores(examples, preds)\n\n    exact_threshold = apply_no_ans_threshold(\n        exact, no_answer_probs, qas_id_to_has_answer, no_answer_probability_threshold\n    )\n    f1_threshold = apply_no_ans_threshold(f1, no_answer_probs, qas_id_to_has_answer, no_answer_probability_threshold)\n\n    evaluation = make_eval_dict(exact_threshold, f1_threshold)\n\n    if has_answer_qids:\n        has_ans_eval = make_eval_dict(exact_threshold, f1_threshold, qid_list=has_answer_qids)\n        merge_eval(evaluation, has_ans_eval, \"HasAns\")\n\n    if no_answer_qids:\n        no_ans_eval = make_eval_dict(exact_threshold, f1_threshold, qid_list=no_answer_qids)\n        merge_eval(evaluation, no_ans_eval, \"NoAns\")\n\n    if no_answer_probs:\n        find_all_best_thresh(evaluation, preds, exact, f1, no_answer_probs, qas_id_to_has_answer)\n\n    return evaluation\n\n\ndef get_final_text(pred_text, orig_text, do_lower_case, verbose_logging=False):\n    \"\"\"Project the tokenized prediction back to the original text.\"\"\"\n\n    # When we created the data, we kept track of the alignment between original\n    # (whitespace tokenized) tokens and our WordPiece tokenized tokens. So\n    # now `orig_text` contains the span of our original text corresponding to the\n    # span that we predicted.\n    #\n    # However, `orig_text` may contain extra characters that we don't want in\n    # our prediction.\n    #\n    # For example, let's say:\n    #   pred_text = steve smith\n    #   orig_text = Steve Smith's\n    #\n    # We don't want to return `orig_text` because it contains the extra \"'s\".\n    #\n    # We don't want to return `pred_text` because it's already been normalized\n    # (the SQuAD eval script also does punctuation stripping/lower casing but\n    # our tokenizer does additional normalization like stripping accent\n    # characters).\n    #\n    # What we really want to return is \"Steve Smith\".\n    #\n    # Therefore, we have to apply a semi-complicated alignment heuristic between\n    # `pred_text` and `orig_text` to get a character-to-character alignment. This\n    # can fail in certain cases in which case we just return `orig_text`.\n\n    def _strip_spaces(text):\n        ns_chars = []\n        ns_to_s_map = collections.OrderedDict()\n        for (i, c) in enumerate(text):\n            if c == \" \":\n                continue\n            ns_to_s_map[len(ns_chars)] = i\n            ns_chars.append(c)\n        ns_text = \"\".join(ns_chars)\n        return (ns_text, ns_to_s_map)\n\n    # We first tokenize `orig_text`, strip whitespace from the result\n    # and `pred_text`, and check if they are the same length. If they are\n    # NOT the same length, the heuristic has failed. If they are the same\n    # length, we assume the characters are one-to-one aligned.\n    tokenizer = BasicTokenizer(do_lower_case=do_lower_case)\n\n    tok_text = \" \".join(tokenizer.tokenize(orig_text))\n\n    start_position = tok_text.find(pred_text)\n    if start_position == -1:\n        if verbose_logging:\n            logger.info(\"Unable to find text: '%s' in '%s'\" % (pred_text, orig_text))\n        return orig_text\n    end_position = start_position + len(pred_text) - 1\n\n    (orig_ns_text, orig_ns_to_s_map) = _strip_spaces(orig_text)\n    (tok_ns_text, tok_ns_to_s_map) = _strip_spaces(tok_text)\n\n    if len(orig_ns_text) != len(tok_ns_text):\n        if verbose_logging:\n            logger.info(\"Length not equal after stripping spaces: '%s' vs '%s'\", orig_ns_text, tok_ns_text)\n        return orig_text\n\n    # We then project the characters in `pred_text` back to `orig_text` using\n    # the character-to-character alignment.\n    tok_s_to_ns_map = {}\n    for (i, tok_index) in tok_ns_to_s_map.items():\n        tok_s_to_ns_map[tok_index] = i\n\n    orig_start_position = None\n    if start_position in tok_s_to_ns_map:\n        ns_start_position = tok_s_to_ns_map[start_position]\n        if ns_start_position in orig_ns_to_s_map:\n            orig_start_position = orig_ns_to_s_map[ns_start_position]\n\n    if orig_start_position is None:\n        if verbose_logging:\n            logger.info(\"Couldn't map start position\")\n        return orig_text\n\n    orig_end_position = None\n    if end_position in tok_s_to_ns_map:\n        ns_end_position = tok_s_to_ns_map[end_position]\n        if ns_end_position in orig_ns_to_s_map:\n            orig_end_position = orig_ns_to_s_map[ns_end_position]\n\n    if orig_end_position is None:\n        if verbose_logging:\n            logger.info(\"Couldn't map end position\")\n        return orig_text\n\n    output_text = orig_text[orig_start_position : (orig_end_position + 1)]\n    return output_text\n\n\ndef _get_best_indexes(logits, n_best_size):\n    \"\"\"Get the n-best logits from a list.\"\"\"\n    index_and_score = sorted(enumerate(logits), key=lambda x: x[1], reverse=True)\n\n    best_indexes = []\n    for i in range(len(index_and_score)):\n        if i >= n_best_size:\n            break\n        best_indexes.append(index_and_score[i][0])\n    return best_indexes\n\n\ndef _compute_softmax(scores):\n    \"\"\"Compute softmax probability over raw logits.\"\"\"\n    if not scores:\n        return []\n\n    max_score = None\n    for score in scores:\n        if max_score is None or score > max_score:\n            max_score = score\n\n    exp_scores = []\n    total_sum = 0.0\n    for score in scores:\n        x = math.exp(score - max_score)\n        exp_scores.append(x)\n        total_sum += x\n\n    probs = []\n    for score in exp_scores:\n        probs.append(score / total_sum)\n    return probs\n\n\ndef compute_predictions_logits(\n    all_examples,\n    all_features,\n    all_results,\n    n_best_size,\n    max_answer_length,\n    do_lower_case,\n    output_prediction_file,\n    output_nbest_file,\n    output_null_log_odds_file,\n    verbose_logging,\n    version_2_with_negative,\n    null_score_diff_threshold,\n    tokenizer,\n):\n    \"\"\"Write final predictions to the json file and log-odds of null if needed.\"\"\"\n    if output_prediction_file:\n        logger.info(f\"Writing predictions to: {output_prediction_file}\")\n    if output_nbest_file:\n        logger.info(f\"Writing nbest to: {output_nbest_file}\")\n    if output_null_log_odds_file and version_2_with_negative:\n        logger.info(f\"Writing null_log_odds to: {output_null_log_odds_file}\")\n\n    example_index_to_features = collections.defaultdict(list)\n    for feature in all_features:\n        example_index_to_features[feature.example_index].append(feature)\n\n    unique_id_to_result = {}\n    for result in all_results:\n        unique_id_to_result[result.unique_id] = result\n\n    _PrelimPrediction = collections.namedtuple(  # pylint: disable=invalid-name\n        \"PrelimPrediction\", [\"feature_index\", \"start_index\", \"end_index\", \"start_logit\", \"end_logit\"]\n    )\n\n    all_predictions = collections.OrderedDict()\n    all_nbest_json = collections.OrderedDict()\n    scores_diff_json = collections.OrderedDict()\n\n    for (example_index, example) in enumerate(all_examples):\n        features = example_index_to_features[example_index]\n\n        prelim_predictions = []\n        # keep track of the minimum score of null start+end of position 0\n        score_null = 1000000  # large and positive\n        min_null_feature_index = 0  # the paragraph slice with min null score\n        null_start_logit = 0  # the start logit at the slice with min null score\n        null_end_logit = 0  # the end logit at the slice with min null score\n        for (feature_index, feature) in enumerate(features):\n            result = unique_id_to_result[feature.unique_id]\n            start_indexes = _get_best_indexes(result.start_logits, n_best_size)\n            end_indexes = _get_best_indexes(result.end_logits, n_best_size)\n            # if we could have irrelevant answers, get the min score of irrelevant\n            if version_2_with_negative:\n                feature_null_score = result.start_logits[0] + result.end_logits[0]\n                if feature_null_score < score_null:\n                    score_null = feature_null_score\n                    min_null_feature_index = feature_index\n                    null_start_logit = result.start_logits[0]\n                    null_end_logit = result.end_logits[0]\n            for start_index in start_indexes:\n                for end_index in end_indexes:\n                    # We could hypothetically create invalid predictions, e.g., predict\n                    # that the start of the span is in the question. We throw out all\n                    # invalid predictions.\n                    if start_index >= len(feature.tokens):\n                        continue\n                    if end_index >= len(feature.tokens):\n                        continue\n                    if start_index not in feature.token_to_orig_map:\n                        continue\n                    if end_index not in feature.token_to_orig_map:\n                        continue\n                    if not feature.token_is_max_context.get(start_index, False):\n                        continue\n                    if end_index < start_index:\n                        continue\n                    length = end_index - start_index + 1\n                    if length > max_answer_length:\n                        continue\n                    prelim_predictions.append(\n                        _PrelimPrediction(\n                            feature_index=feature_index,\n                            start_index=start_index,\n                            end_index=end_index,\n                            start_logit=result.start_logits[start_index],\n                            end_logit=result.end_logits[end_index],\n                        )\n                    )\n        if version_2_with_negative:\n            prelim_predictions.append(\n                _PrelimPrediction(\n                    feature_index=min_null_feature_index,\n                    start_index=0,\n                    end_index=0,\n                    start_logit=null_start_logit,\n                    end_logit=null_end_logit,\n                )\n            )\n        prelim_predictions = sorted(prelim_predictions, key=lambda x: (x.start_logit + x.end_logit), reverse=True)\n\n        _NbestPrediction = collections.namedtuple(  # pylint: disable=invalid-name\n            \"NbestPrediction\", [\"text\", \"start_logit\", \"end_logit\"]\n        )\n\n        seen_predictions = {}\n        nbest = []\n        for pred in prelim_predictions:\n            if len(nbest) >= n_best_size:\n                break\n            feature = features[pred.feature_index]\n            if pred.start_index > 0:  # this is a non-null prediction\n                tok_tokens = feature.tokens[pred.start_index : (pred.end_index + 1)]\n                orig_doc_start = feature.token_to_orig_map[pred.start_index]\n                orig_doc_end = feature.token_to_orig_map[pred.end_index]\n                orig_tokens = example.doc_tokens[orig_doc_start : (orig_doc_end + 1)]\n\n                tok_text = tokenizer.convert_tokens_to_string(tok_tokens)\n\n                # tok_text = \" \".join(tok_tokens)\n                #\n                # # De-tokenize WordPieces that have been split off.\n                # tok_text = tok_text.replace(\" ##\", \"\")\n                # tok_text = tok_text.replace(\"##\", \"\")\n\n                # Clean whitespace\n                tok_text = tok_text.strip()\n                tok_text = \" \".join(tok_text.split())\n                orig_text = \" \".join(orig_tokens)\n\n                final_text = get_final_text(tok_text, orig_text, do_lower_case, verbose_logging)\n                if final_text in seen_predictions:\n                    continue\n\n                seen_predictions[final_text] = True\n            else:\n                final_text = \"\"\n                seen_predictions[final_text] = True\n\n            nbest.append(_NbestPrediction(text=final_text, start_logit=pred.start_logit, end_logit=pred.end_logit))\n        # if we didn't include the empty option in the n-best, include it\n        if version_2_with_negative:\n            if \"\" not in seen_predictions:\n                nbest.append(_NbestPrediction(text=\"\", start_logit=null_start_logit, end_logit=null_end_logit))\n\n            # In very rare edge cases we could only have single null prediction.\n            # So we just create a nonce prediction in this case to avoid failure.\n            if len(nbest) == 1:\n                nbest.insert(0, _NbestPrediction(text=\"empty\", start_logit=0.0, end_logit=0.0))\n\n        # In very rare edge cases we could have no valid predictions. So we\n        # just create a nonce prediction in this case to avoid failure.\n        if not nbest:\n            nbest.append(_NbestPrediction(text=\"empty\", start_logit=0.0, end_logit=0.0))\n\n        assert len(nbest) >= 1\n\n        total_scores = []\n        best_non_null_entry = None\n        for entry in nbest:\n            total_scores.append(entry.start_logit + entry.end_logit)\n            if not best_non_null_entry:\n                if entry.text:\n                    best_non_null_entry = entry\n\n        probs = _compute_softmax(total_scores)\n\n        nbest_json = []\n        for (i, entry) in enumerate(nbest):\n            output = collections.OrderedDict()\n            output[\"text\"] = entry.text\n            output[\"probability\"] = probs[i]\n            output[\"start_logit\"] = entry.start_logit\n            output[\"end_logit\"] = entry.end_logit\n            nbest_json.append(output)\n\n        assert len(nbest_json) >= 1\n\n        if not version_2_with_negative:\n            all_predictions[example.qas_id] = nbest_json[0][\"text\"]\n        else:\n            # predict \"\" iff the null score - the score of best non-null > threshold\n            score_diff = score_null - best_non_null_entry.start_logit - (best_non_null_entry.end_logit)\n            scores_diff_json[example.qas_id] = score_diff\n            if score_diff > null_score_diff_threshold:\n                all_predictions[example.qas_id] = \"\"\n            else:\n                all_predictions[example.qas_id] = best_non_null_entry.text\n        all_nbest_json[example.qas_id] = nbest_json\n\n    if output_prediction_file:\n        with open(output_prediction_file, \"w\") as writer:\n            writer.write(json.dumps(all_predictions, indent=4) + \"\\n\")\n\n    if output_nbest_file:\n        with open(output_nbest_file, \"w\") as writer:\n            writer.write(json.dumps(all_nbest_json, indent=4) + \"\\n\")\n\n    if output_null_log_odds_file and version_2_with_negative:\n        with open(output_null_log_odds_file, \"w\") as writer:\n            writer.write(json.dumps(scores_diff_json, indent=4) + \"\\n\")\n\n    return all_predictions\n\n\ndef compute_predictions_log_probs(\n    all_examples,\n    all_features,\n    all_results,\n    n_best_size,\n    max_answer_length,\n    output_prediction_file,\n    output_nbest_file,\n    output_null_log_odds_file,\n    start_n_top,\n    end_n_top,\n    version_2_with_negative,\n    tokenizer,\n    verbose_logging,\n):\n    \"\"\" XLNet write prediction logic (more complex than Bert's).\n        Write final predictions to the json file and log-odds of null if needed.\n\n        Requires utils_squad_evaluate.py\n    \"\"\"\n    _PrelimPrediction = collections.namedtuple(  # pylint: disable=invalid-name\n        \"PrelimPrediction\", [\"feature_index\", \"start_index\", \"end_index\", \"start_log_prob\", \"end_log_prob\"]\n    )\n\n    _NbestPrediction = collections.namedtuple(  # pylint: disable=invalid-name\n        \"NbestPrediction\", [\"text\", \"start_log_prob\", \"end_log_prob\"]\n    )\n\n    logger.info(\"Writing predictions to: %s\", output_prediction_file)\n    # logger.info(\"Writing nbest to: %s\" % (output_nbest_file))\n\n    example_index_to_features = collections.defaultdict(list)\n    for feature in all_features:\n        example_index_to_features[feature.example_index].append(feature)\n\n    unique_id_to_result = {}\n    for result in all_results:\n        unique_id_to_result[result.unique_id] = result\n\n    all_predictions = collections.OrderedDict()\n    all_nbest_json = collections.OrderedDict()\n    scores_diff_json = collections.OrderedDict()\n\n    for (example_index, example) in enumerate(all_examples):\n        features = example_index_to_features[example_index]\n\n        prelim_predictions = []\n        # keep track of the minimum score of null start+end of position 0\n        score_null = 1000000  # large and positive\n\n        for (feature_index, feature) in enumerate(features):\n            result = unique_id_to_result[feature.unique_id]\n\n            cur_null_score = result.cls_logits\n\n            # if we could have irrelevant answers, get the min score of irrelevant\n            score_null = min(score_null, cur_null_score)\n\n            for i in range(start_n_top):\n                for j in range(end_n_top):\n                    start_log_prob = result.start_logits[i]\n                    start_index = result.start_top_index[i]\n\n                    j_index = i * end_n_top + j\n\n                    end_log_prob = result.end_logits[j_index]\n                    end_index = result.end_top_index[j_index]\n\n                    # We could hypothetically create invalid predictions, e.g., predict\n                    # that the start of the span is in the question. We throw out all\n                    # invalid predictions.\n                    if start_index >= feature.paragraph_len - 1:\n                        continue\n                    if end_index >= feature.paragraph_len - 1:\n                        continue\n\n                    if not feature.token_is_max_context.get(start_index, False):\n                        continue\n                    if end_index < start_index:\n                        continue\n                    length = end_index - start_index + 1\n                    if length > max_answer_length:\n                        continue\n\n                    prelim_predictions.append(\n                        _PrelimPrediction(\n                            feature_index=feature_index,\n                            start_index=start_index,\n                            end_index=end_index,\n                            start_log_prob=start_log_prob,\n                            end_log_prob=end_log_prob,\n                        )\n                    )\n\n        prelim_predictions = sorted(\n            prelim_predictions, key=lambda x: (x.start_log_prob + x.end_log_prob), reverse=True\n        )\n\n        seen_predictions = {}\n        nbest = []\n        for pred in prelim_predictions:\n            if len(nbest) >= n_best_size:\n                break\n            feature = features[pred.feature_index]\n\n            # XLNet un-tokenizer\n            # Let's keep it simple for now and see if we need all this later.\n            #\n            # tok_start_to_orig_index = feature.tok_start_to_orig_index\n            # tok_end_to_orig_index = feature.tok_end_to_orig_index\n            # start_orig_pos = tok_start_to_orig_index[pred.start_index]\n            # end_orig_pos = tok_end_to_orig_index[pred.end_index]\n            # paragraph_text = example.paragraph_text\n            # final_text = paragraph_text[start_orig_pos: end_orig_pos + 1].strip()\n\n            # Previously used Bert untokenizer\n            tok_tokens = feature.tokens[pred.start_index : (pred.end_index + 1)]\n            orig_doc_start = feature.token_to_orig_map[pred.start_index]\n            orig_doc_end = feature.token_to_orig_map[pred.end_index]\n            orig_tokens = example.doc_tokens[orig_doc_start : (orig_doc_end + 1)]\n            tok_text = tokenizer.convert_tokens_to_string(tok_tokens)\n\n            # Clean whitespace\n            tok_text = tok_text.strip()\n            tok_text = \" \".join(tok_text.split())\n            orig_text = \" \".join(orig_tokens)\n\n            if hasattr(tokenizer, \"do_lower_case\"):\n                do_lower_case = tokenizer.do_lower_case\n            else:\n                do_lower_case = tokenizer.do_lowercase_and_remove_accent\n\n            final_text = get_final_text(tok_text, orig_text, do_lower_case, verbose_logging)\n\n            if final_text in seen_predictions:\n                continue\n\n            seen_predictions[final_text] = True\n\n            nbest.append(\n                _NbestPrediction(text=final_text, start_log_prob=pred.start_log_prob, end_log_prob=pred.end_log_prob)\n            )\n\n        # In very rare edge cases we could have no valid predictions. So we\n        # just create a nonce prediction in this case to avoid failure.\n        if not nbest:\n            nbest.append(_NbestPrediction(text=\"\", start_log_prob=-1e6, end_log_prob=-1e6))\n\n        total_scores = []\n        best_non_null_entry = None\n        for entry in nbest:\n            total_scores.append(entry.start_log_prob + entry.end_log_prob)\n            if not best_non_null_entry:\n                best_non_null_entry = entry\n\n        probs = _compute_softmax(total_scores)\n\n        nbest_json = []\n        for (i, entry) in enumerate(nbest):\n            output = collections.OrderedDict()\n            output[\"text\"] = entry.text\n            output[\"probability\"] = probs[i]\n            output[\"start_log_prob\"] = entry.start_log_prob\n            output[\"end_log_prob\"] = entry.end_log_prob\n            nbest_json.append(output)\n\n        assert len(nbest_json) >= 1\n        assert best_non_null_entry is not None\n\n        score_diff = score_null\n        scores_diff_json[example.qas_id] = score_diff\n        # note(zhiliny): always predict best_non_null_entry\n        # and the evaluation script will search for the best threshold\n        all_predictions[example.qas_id] = best_non_null_entry.text\n\n        all_nbest_json[example.qas_id] = nbest_json\n\n    with open(output_prediction_file, \"w\") as writer:\n        writer.write(json.dumps(all_predictions, indent=4) + \"\\n\")\n\n    with open(output_nbest_file, \"w\") as writer:\n        writer.write(json.dumps(all_nbest_json, indent=4) + \"\\n\")\n\n    if version_2_with_negative:\n        with open(output_null_log_odds_file, \"w\") as writer:\n            writer.write(json.dumps(scores_diff_json, indent=4) + \"\\n\")\n\n    return all_predictions\n"
  },
  {
    "path": "code/bert-base-count5/pretrain/transformers1/data/processors/__init__.py",
    "content": "# flake8: noqa\n# There's no way to ignore \"F401 '...' imported but unused\" warnings in this\n# module, but to preserve other warnings. So, don't check this module at all.\n\nfrom .glue import glue_convert_examples_to_features, glue_output_modes, glue_processors, glue_tasks_num_labels\nfrom .squad import SquadExample, SquadFeatures, SquadV1Processor, SquadV2Processor, squad_convert_examples_to_features\nfrom .utils import DataProcessor, InputExample, InputFeatures, SingleSentenceClassificationProcessor\nfrom .xnli import xnli_output_modes, xnli_processors, xnli_tasks_num_labels\n"
  },
  {
    "path": "code/bert-base-count5/pretrain/transformers1/data/processors/glue.py",
    "content": "# coding=utf-8\n# Copyright 2018 The Google AI Language Team Authors and The HuggingFace Inc. team.\n# Copyright (c) 2018, NVIDIA CORPORATION.  All rights reserved.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n\"\"\" GLUE processors and helpers \"\"\"\n\nimport logging\nimport os\nfrom enum import Enum\nfrom typing import List, Optional, Union\n\nfrom ...file_utils import is_tf_available\nfrom ...tokenization_utils import PreTrainedTokenizer\nfrom .utils import DataProcessor, InputExample, InputFeatures\n\n\nif is_tf_available():\n    import tensorflow as tf\n\nlogger = logging.getLogger(__name__)\n\n\ndef glue_convert_examples_to_features(\n    examples: Union[List[InputExample], \"tf.data.Dataset\"],\n    tokenizer: PreTrainedTokenizer,\n    max_length: Optional[int] = None,\n    task=None,\n    label_list=None,\n    output_mode=None,\n):\n    \"\"\"\n    Loads a data file into a list of ``InputFeatures``\n\n    Args:\n        examples: List of ``InputExamples`` or ``tf.data.Dataset`` containing the examples.\n        tokenizer: Instance of a tokenizer that will tokenize the examples\n        max_length: Maximum example length. Defaults to the tokenizer's max_len\n        task: GLUE task\n        label_list: List of labels. Can be obtained from the processor using the ``processor.get_labels()`` method\n        output_mode: String indicating the output mode. Either ``regression`` or ``classification``\n\n    Returns:\n        If the ``examples`` input is a ``tf.data.Dataset``, will return a ``tf.data.Dataset``\n        containing the task-specific features. If the input is a list of ``InputExamples``, will return\n        a list of task-specific ``InputFeatures`` which can be fed to the model.\n\n    \"\"\"\n    if is_tf_available() and isinstance(examples, tf.data.Dataset):\n        if task is None:\n            raise ValueError(\"When calling glue_convert_examples_to_features from TF, the task parameter is required.\")\n        return _tf_glue_convert_examples_to_features(examples, tokenizer, max_length=max_length, task=task)\n    return _glue_convert_examples_to_features(\n        examples, tokenizer, max_length=max_length, task=task, label_list=label_list, output_mode=output_mode\n    )\n\n\nif is_tf_available():\n\n    def _tf_glue_convert_examples_to_features(\n        examples: tf.data.Dataset, tokenizer: PreTrainedTokenizer, task=str, max_length: Optional[int] = None,\n    ) -> tf.data.Dataset:\n        \"\"\"\n        Returns:\n            A ``tf.data.Dataset`` containing the task-specific features.\n\n        \"\"\"\n        processor = glue_processors[task]()\n        examples = [processor.tfds_map(processor.get_example_from_tensor_dict(example)) for example in examples]\n        features = glue_convert_examples_to_features(examples, tokenizer, max_length=max_length, task=task)\n\n        def gen():\n            for ex in features:\n                yield (\n                    {\n                        \"input_ids\": ex.input_ids,\n                        \"attention_mask\": ex.attention_mask,\n                        \"token_type_ids\": ex.token_type_ids,\n                    },\n                    ex.label,\n                )\n\n        return tf.data.Dataset.from_generator(\n            gen,\n            ({\"input_ids\": tf.int32, \"attention_mask\": tf.int32, \"token_type_ids\": tf.int32}, tf.int64),\n            (\n                {\n                    \"input_ids\": tf.TensorShape([None]),\n                    \"attention_mask\": tf.TensorShape([None]),\n                    \"token_type_ids\": tf.TensorShape([None]),\n                },\n                tf.TensorShape([]),\n            ),\n        )\n\n\ndef _glue_convert_examples_to_features(\n    examples: List[InputExample],\n    tokenizer: PreTrainedTokenizer,\n    max_length: Optional[int] = None,\n    task=None,\n    label_list=None,\n    output_mode=None,\n):\n    if max_length is None:\n        max_length = tokenizer.max_len\n\n    if task is not None:\n        processor = glue_processors[task]()\n        if label_list is None:\n            label_list = processor.get_labels()\n            logger.info(\"Using label list %s for task %s\" % (label_list, task))\n        if output_mode is None:\n            output_mode = glue_output_modes[task]\n            logger.info(\"Using output mode %s for task %s\" % (output_mode, task))\n\n    label_map = {label: i for i, label in enumerate(label_list)}\n\n    def label_from_example(example: InputExample) -> Union[int, float, None]:\n        if example.label is None:\n            return None\n        if output_mode == \"classification\":\n            return label_map[example.label]\n        elif output_mode == \"regression\":\n            return float(example.label)\n        raise KeyError(output_mode)\n\n    labels = [label_from_example(example) for example in examples]\n\n    batch_encoding = tokenizer.batch_encode_plus(\n        [(example.text_a, example.text_b) for example in examples], max_length=max_length, pad_to_max_length=True,\n    )\n\n    features = []\n    for i in range(len(examples)):\n        inputs = {k: batch_encoding[k][i] for k in batch_encoding}\n\n        feature = InputFeatures(**inputs, label=labels[i])\n        features.append(feature)\n\n    for i, example in enumerate(examples[:5]):\n        logger.info(\"*** Example ***\")\n        logger.info(\"guid: %s\" % (example.guid))\n        logger.info(\"features: %s\" % features[i])\n\n    return features\n\n\nclass OutputMode(Enum):\n    classification = \"classification\"\n    regression = \"regression\"\n\n\nclass MrpcProcessor(DataProcessor):\n    \"\"\"Processor for the MRPC data set (GLUE version).\"\"\"\n\n    def get_example_from_tensor_dict(self, tensor_dict):\n        \"\"\"See base class.\"\"\"\n        return InputExample(\n            tensor_dict[\"idx\"].numpy(),\n            tensor_dict[\"sentence1\"].numpy().decode(\"utf-8\"),\n            tensor_dict[\"sentence2\"].numpy().decode(\"utf-8\"),\n            str(tensor_dict[\"label\"].numpy()),\n        )\n\n    def get_train_examples(self, data_dir):\n        \"\"\"See base class.\"\"\"\n        logger.info(\"LOOKING AT {}\".format(os.path.join(data_dir, \"train.tsv\")))\n        return self._create_examples(self._read_tsv(os.path.join(data_dir, \"train.tsv\")), \"train\")\n\n    def get_dev_examples(self, data_dir):\n        \"\"\"See base class.\"\"\"\n        return self._create_examples(self._read_tsv(os.path.join(data_dir, \"dev.tsv\")), \"dev\")\n\n    def get_test_examples(self, data_dir):\n        \"\"\"See base class.\"\"\"\n        return self._create_examples(self._read_tsv(os.path.join(data_dir, \"test.tsv\")), \"test\")\n\n    def get_labels(self):\n        \"\"\"See base class.\"\"\"\n        return [\"0\", \"1\"]\n\n    def _create_examples(self, lines, set_type):\n        \"\"\"Creates examples for the training, dev and test sets.\"\"\"\n        examples = []\n        for (i, line) in enumerate(lines):\n            if i == 0:\n                continue\n            guid = \"%s-%s\" % (set_type, i)\n            text_a = line[3]\n            text_b = line[4]\n            label = None if set_type == \"test\" else line[0]\n            examples.append(InputExample(guid=guid, text_a=text_a, text_b=text_b, label=label))\n        return examples\n\n\nclass MnliProcessor(DataProcessor):\n    \"\"\"Processor for the MultiNLI data set (GLUE version).\"\"\"\n\n    def get_example_from_tensor_dict(self, tensor_dict):\n        \"\"\"See base class.\"\"\"\n        return InputExample(\n            tensor_dict[\"idx\"].numpy(),\n            tensor_dict[\"premise\"].numpy().decode(\"utf-8\"),\n            tensor_dict[\"hypothesis\"].numpy().decode(\"utf-8\"),\n            str(tensor_dict[\"label\"].numpy()),\n        )\n\n    def get_train_examples(self, data_dir):\n        \"\"\"See base class.\"\"\"\n        return self._create_examples(self._read_tsv(os.path.join(data_dir, \"train.tsv\")), \"train\")\n\n    def get_dev_examples(self, data_dir):\n        \"\"\"See base class.\"\"\"\n        return self._create_examples(self._read_tsv(os.path.join(data_dir, \"dev_matched.tsv\")), \"dev_matched\")\n\n    def get_test_examples(self, data_dir):\n        \"\"\"See base class.\"\"\"\n        return self._create_examples(self._read_tsv(os.path.join(data_dir, \"test_matched.tsv\")), \"test_matched\")\n\n    def get_labels(self):\n        \"\"\"See base class.\"\"\"\n        return [\"contradiction\", \"entailment\", \"neutral\"]\n\n    def _create_examples(self, lines, set_type):\n        \"\"\"Creates examples for the training, dev and test sets.\"\"\"\n        examples = []\n        for (i, line) in enumerate(lines):\n            if i == 0:\n                continue\n            guid = \"%s-%s\" % (set_type, line[0])\n            text_a = line[8]\n            text_b = line[9]\n            label = None if set_type.startswith(\"test\") else line[-1]\n            examples.append(InputExample(guid=guid, text_a=text_a, text_b=text_b, label=label))\n        return examples\n\n\nclass MnliMismatchedProcessor(MnliProcessor):\n    \"\"\"Processor for the MultiNLI Mismatched data set (GLUE version).\"\"\"\n\n    def get_dev_examples(self, data_dir):\n        \"\"\"See base class.\"\"\"\n        return self._create_examples(self._read_tsv(os.path.join(data_dir, \"dev_mismatched.tsv\")), \"dev_mismatched\")\n\n    def get_test_examples(self, data_dir):\n        \"\"\"See base class.\"\"\"\n        return self._create_examples(self._read_tsv(os.path.join(data_dir, \"test_mismatched.tsv\")), \"test_mismatched\")\n\n\nclass ColaProcessor(DataProcessor):\n    \"\"\"Processor for the CoLA data set (GLUE version).\"\"\"\n\n    def get_example_from_tensor_dict(self, tensor_dict):\n        \"\"\"See base class.\"\"\"\n        return InputExample(\n            tensor_dict[\"idx\"].numpy(),\n            tensor_dict[\"sentence\"].numpy().decode(\"utf-8\"),\n            None,\n            str(tensor_dict[\"label\"].numpy()),\n        )\n\n    def get_train_examples(self, data_dir):\n        \"\"\"See base class.\"\"\"\n        return self._create_examples(self._read_tsv(os.path.join(data_dir, \"train.tsv\")), \"train\")\n\n    def get_dev_examples(self, data_dir):\n        \"\"\"See base class.\"\"\"\n        return self._create_examples(self._read_tsv(os.path.join(data_dir, \"dev.tsv\")), \"dev\")\n\n    def get_test_examples(self, data_dir):\n        \"\"\"See base class.\"\"\"\n        return self._create_examples(self._read_tsv(os.path.join(data_dir, \"test.tsv\")), \"test\")\n\n    def get_labels(self):\n        \"\"\"See base class.\"\"\"\n        return [\"0\", \"1\"]\n\n    def _create_examples(self, lines, set_type):\n        \"\"\"Creates examples for the training, dev and test sets.\"\"\"\n        test_mode = set_type == \"test\"\n        if test_mode:\n            lines = lines[1:]\n        text_index = 1 if test_mode else 3\n        examples = []\n        for (i, line) in enumerate(lines):\n            guid = \"%s-%s\" % (set_type, i)\n            text_a = line[text_index]\n            label = None if test_mode else line[1]\n            examples.append(InputExample(guid=guid, text_a=text_a, text_b=None, label=label))\n        return examples\n\n\nclass Sst2Processor(DataProcessor):\n    \"\"\"Processor for the SST-2 data set (GLUE version).\"\"\"\n\n    def get_example_from_tensor_dict(self, tensor_dict):\n        \"\"\"See base class.\"\"\"\n        return InputExample(\n            tensor_dict[\"idx\"].numpy(),\n            tensor_dict[\"sentence\"].numpy().decode(\"utf-8\"),\n            None,\n            str(tensor_dict[\"label\"].numpy()),\n        )\n\n    def get_train_examples(self, data_dir):\n        \"\"\"See base class.\"\"\"\n        return self._create_examples(self._read_tsv(os.path.join(data_dir, \"train.tsv\")), \"train\")\n\n    def get_dev_examples(self, data_dir):\n        \"\"\"See base class.\"\"\"\n        return self._create_examples(self._read_tsv(os.path.join(data_dir, \"dev.tsv\")), \"dev\")\n\n    def get_test_examples(self, data_dir):\n        \"\"\"See base class.\"\"\"\n        return self._create_examples(self._read_tsv(os.path.join(data_dir, \"test.tsv\")), \"test\")\n\n    def get_labels(self):\n        \"\"\"See base class.\"\"\"\n        return [\"0\", \"1\"]\n\n    def _create_examples(self, lines, set_type):\n        \"\"\"Creates examples for the training, dev and test sets.\"\"\"\n        examples = []\n        text_index = 1 if set_type == \"test\" else 0\n        for (i, line) in enumerate(lines):\n            if i == 0:\n                continue\n            guid = \"%s-%s\" % (set_type, i)\n            text_a = line[text_index]\n            label = None if set_type == \"test\" else line[1]\n            examples.append(InputExample(guid=guid, text_a=text_a, text_b=None, label=label))\n        return examples\n\n\nclass StsbProcessor(DataProcessor):\n    \"\"\"Processor for the STS-B data set (GLUE version).\"\"\"\n\n    def get_example_from_tensor_dict(self, tensor_dict):\n        \"\"\"See base class.\"\"\"\n        return InputExample(\n            tensor_dict[\"idx\"].numpy(),\n            tensor_dict[\"sentence1\"].numpy().decode(\"utf-8\"),\n            tensor_dict[\"sentence2\"].numpy().decode(\"utf-8\"),\n            str(tensor_dict[\"label\"].numpy()),\n        )\n\n    def get_train_examples(self, data_dir):\n        \"\"\"See base class.\"\"\"\n        return self._create_examples(self._read_tsv(os.path.join(data_dir, \"train.tsv\")), \"train\")\n\n    def get_dev_examples(self, data_dir):\n        \"\"\"See base class.\"\"\"\n        return self._create_examples(self._read_tsv(os.path.join(data_dir, \"dev.tsv\")), \"dev\")\n\n    def get_test_examples(self, data_dir):\n        \"\"\"See base class.\"\"\"\n        return self._create_examples(self._read_tsv(os.path.join(data_dir, \"test.tsv\")), \"test\")\n\n    def get_labels(self):\n        \"\"\"See base class.\"\"\"\n        return [None]\n\n    def _create_examples(self, lines, set_type):\n        \"\"\"Creates examples for the training, dev and test sets.\"\"\"\n        examples = []\n        for (i, line) in enumerate(lines):\n            if i == 0:\n                continue\n            guid = \"%s-%s\" % (set_type, line[0])\n            text_a = line[7]\n            text_b = line[8]\n            label = None if set_type == \"test\" else line[-1]\n            examples.append(InputExample(guid=guid, text_a=text_a, text_b=text_b, label=label))\n        return examples\n\n\nclass QqpProcessor(DataProcessor):\n    \"\"\"Processor for the QQP data set (GLUE version).\"\"\"\n\n    def get_example_from_tensor_dict(self, tensor_dict):\n        \"\"\"See base class.\"\"\"\n        return InputExample(\n            tensor_dict[\"idx\"].numpy(),\n            tensor_dict[\"question1\"].numpy().decode(\"utf-8\"),\n            tensor_dict[\"question2\"].numpy().decode(\"utf-8\"),\n            str(tensor_dict[\"label\"].numpy()),\n        )\n\n    def get_train_examples(self, data_dir):\n        \"\"\"See base class.\"\"\"\n        return self._create_examples(self._read_tsv(os.path.join(data_dir, \"train.tsv\")), \"train\")\n\n    def get_dev_examples(self, data_dir):\n        \"\"\"See base class.\"\"\"\n        return self._create_examples(self._read_tsv(os.path.join(data_dir, \"dev.tsv\")), \"dev\")\n\n    def get_test_examples(self, data_dir):\n        \"\"\"See base class.\"\"\"\n        return self._create_examples(self._read_tsv(os.path.join(data_dir, \"test.tsv\")), \"test\")\n\n    def get_labels(self):\n        \"\"\"See base class.\"\"\"\n        return [\"0\", \"1\"]\n\n    def _create_examples(self, lines, set_type):\n        \"\"\"Creates examples for the training, dev and test sets.\"\"\"\n        test_mode = set_type == \"test\"\n        q1_index = 1 if test_mode else 3\n        q2_index = 2 if test_mode else 4\n        examples = []\n        for (i, line) in enumerate(lines):\n            if i == 0:\n                continue\n            guid = \"%s-%s\" % (set_type, line[0])\n            try:\n                text_a = line[q1_index]\n                text_b = line[q2_index]\n                label = None if test_mode else line[5]\n            except IndexError:\n                continue\n            examples.append(InputExample(guid=guid, text_a=text_a, text_b=text_b, label=label))\n        return examples\n\n\nclass QnliProcessor(DataProcessor):\n    \"\"\"Processor for the QNLI data set (GLUE version).\"\"\"\n\n    def get_example_from_tensor_dict(self, tensor_dict):\n        \"\"\"See base class.\"\"\"\n        return InputExample(\n            tensor_dict[\"idx\"].numpy(),\n            tensor_dict[\"question\"].numpy().decode(\"utf-8\"),\n            tensor_dict[\"sentence\"].numpy().decode(\"utf-8\"),\n            str(tensor_dict[\"label\"].numpy()),\n        )\n\n    def get_train_examples(self, data_dir):\n        \"\"\"See base class.\"\"\"\n        return self._create_examples(self._read_tsv(os.path.join(data_dir, \"train.tsv\")), \"train\")\n\n    def get_dev_examples(self, data_dir):\n        \"\"\"See base class.\"\"\"\n        return self._create_examples(self._read_tsv(os.path.join(data_dir, \"dev.tsv\")), \"dev\")\n\n    def get_test_examples(self, data_dir):\n        \"\"\"See base class.\"\"\"\n        return self._create_examples(self._read_tsv(os.path.join(data_dir, \"test.tsv\")), \"test\")\n\n    def get_labels(self):\n        \"\"\"See base class.\"\"\"\n        return [\"entailment\", \"not_entailment\"]\n\n    def _create_examples(self, lines, set_type):\n        \"\"\"Creates examples for the training, dev and test sets.\"\"\"\n        examples = []\n        for (i, line) in enumerate(lines):\n            if i == 0:\n                continue\n            guid = \"%s-%s\" % (set_type, line[0])\n            text_a = line[1]\n            text_b = line[2]\n            label = None if set_type == \"test\" else line[-1]\n            examples.append(InputExample(guid=guid, text_a=text_a, text_b=text_b, label=label))\n        return examples\n\n\nclass RteProcessor(DataProcessor):\n    \"\"\"Processor for the RTE data set (GLUE version).\"\"\"\n\n    def get_example_from_tensor_dict(self, tensor_dict):\n        \"\"\"See base class.\"\"\"\n        return InputExample(\n            tensor_dict[\"idx\"].numpy(),\n            tensor_dict[\"sentence1\"].numpy().decode(\"utf-8\"),\n            tensor_dict[\"sentence2\"].numpy().decode(\"utf-8\"),\n            str(tensor_dict[\"label\"].numpy()),\n        )\n\n    def get_train_examples(self, data_dir):\n        \"\"\"See base class.\"\"\"\n        return self._create_examples(self._read_tsv(os.path.join(data_dir, \"train.tsv\")), \"train\")\n\n    def get_dev_examples(self, data_dir):\n        \"\"\"See base class.\"\"\"\n        return self._create_examples(self._read_tsv(os.path.join(data_dir, \"dev.tsv\")), \"dev\")\n\n    def get_test_examples(self, data_dir):\n        \"\"\"See base class.\"\"\"\n        return self._create_examples(self._read_tsv(os.path.join(data_dir, \"test.tsv\")), \"test\")\n\n    def get_labels(self):\n        \"\"\"See base class.\"\"\"\n        return [\"entailment\", \"not_entailment\"]\n\n    def _create_examples(self, lines, set_type):\n        \"\"\"Creates examples for the training, dev and test sets.\"\"\"\n        examples = []\n        for (i, line) in enumerate(lines):\n            if i == 0:\n                continue\n            guid = \"%s-%s\" % (set_type, line[0])\n            text_a = line[1]\n            text_b = line[2]\n            label = None if set_type == \"test\" else line[-1]\n            examples.append(InputExample(guid=guid, text_a=text_a, text_b=text_b, label=label))\n        return examples\n\n\nclass WnliProcessor(DataProcessor):\n    \"\"\"Processor for the WNLI data set (GLUE version).\"\"\"\n\n    def get_example_from_tensor_dict(self, tensor_dict):\n        \"\"\"See base class.\"\"\"\n        return InputExample(\n            tensor_dict[\"idx\"].numpy(),\n            tensor_dict[\"sentence1\"].numpy().decode(\"utf-8\"),\n            tensor_dict[\"sentence2\"].numpy().decode(\"utf-8\"),\n            str(tensor_dict[\"label\"].numpy()),\n        )\n\n    def get_train_examples(self, data_dir):\n        \"\"\"See base class.\"\"\"\n        return self._create_examples(self._read_tsv(os.path.join(data_dir, \"train.tsv\")), \"train\")\n\n    def get_dev_examples(self, data_dir):\n        \"\"\"See base class.\"\"\"\n        return self._create_examples(self._read_tsv(os.path.join(data_dir, \"dev.tsv\")), \"dev\")\n\n    def get_test_examples(self, data_dir):\n        \"\"\"See base class.\"\"\"\n        return self._create_examples(self._read_tsv(os.path.join(data_dir, \"test.tsv\")), \"test\")\n\n    def get_labels(self):\n        \"\"\"See base class.\"\"\"\n        return [\"0\", \"1\"]\n\n    def _create_examples(self, lines, set_type):\n        \"\"\"Creates examples for the training, dev and test sets.\"\"\"\n        examples = []\n        for (i, line) in enumerate(lines):\n            if i == 0:\n                continue\n            guid = \"%s-%s\" % (set_type, line[0])\n            text_a = line[1]\n            text_b = line[2]\n            label = None if set_type == \"test\" else line[-1]\n            examples.append(InputExample(guid=guid, text_a=text_a, text_b=text_b, label=label))\n        return examples\n\n\nglue_tasks_num_labels = {\n    \"cola\": 2,\n    \"mnli\": 3,\n    \"mrpc\": 2,\n    \"sst-2\": 2,\n    \"sts-b\": 1,\n    \"qqp\": 2,\n    \"qnli\": 2,\n    \"rte\": 2,\n    \"wnli\": 2,\n}\n\nglue_processors = {\n    \"cola\": ColaProcessor,\n    \"mnli\": MnliProcessor,\n    \"mnli-mm\": MnliMismatchedProcessor,\n    \"mrpc\": MrpcProcessor,\n    \"sst-2\": Sst2Processor,\n    \"sts-b\": StsbProcessor,\n    \"qqp\": QqpProcessor,\n    \"qnli\": QnliProcessor,\n    \"rte\": RteProcessor,\n    \"wnli\": WnliProcessor,\n}\n\nglue_output_modes = {\n    \"cola\": \"classification\",\n    \"mnli\": \"classification\",\n    \"mnli-mm\": \"classification\",\n    \"mrpc\": \"classification\",\n    \"sst-2\": \"classification\",\n    \"sts-b\": \"regression\",\n    \"qqp\": \"classification\",\n    \"qnli\": \"classification\",\n    \"rte\": \"classification\",\n    \"wnli\": \"classification\",\n}\n"
  },
  {
    "path": "code/bert-base-count5/pretrain/transformers1/data/processors/squad.py",
    "content": "import json\nimport logging\nimport os\nfrom functools import partial\nfrom multiprocessing import Pool, cpu_count\n\nimport numpy as np\nfrom tqdm import tqdm\n\nfrom ...file_utils import is_tf_available, is_torch_available\nfrom ...tokenization_bert import whitespace_tokenize\nfrom .utils import DataProcessor\n\n\nif is_torch_available():\n    import torch\n    from torch.utils.data import TensorDataset\n\nif is_tf_available():\n    import tensorflow as tf\n\nlogger = logging.getLogger(__name__)\n\n\ndef _improve_answer_span(doc_tokens, input_start, input_end, tokenizer, orig_answer_text):\n    \"\"\"Returns tokenized answer spans that better match the annotated answer.\"\"\"\n    tok_answer_text = \" \".join(tokenizer.tokenize(orig_answer_text))\n\n    for new_start in range(input_start, input_end + 1):\n        for new_end in range(input_end, new_start - 1, -1):\n            text_span = \" \".join(doc_tokens[new_start : (new_end + 1)])\n            if text_span == tok_answer_text:\n                return (new_start, new_end)\n\n    return (input_start, input_end)\n\n\ndef _check_is_max_context(doc_spans, cur_span_index, position):\n    \"\"\"Check if this is the 'max context' doc span for the token.\"\"\"\n    best_score = None\n    best_span_index = None\n    for (span_index, doc_span) in enumerate(doc_spans):\n        end = doc_span.start + doc_span.length - 1\n        if position < doc_span.start:\n            continue\n        if position > end:\n            continue\n        num_left_context = position - doc_span.start\n        num_right_context = end - position\n        score = min(num_left_context, num_right_context) + 0.01 * doc_span.length\n        if best_score is None or score > best_score:\n            best_score = score\n            best_span_index = span_index\n\n    return cur_span_index == best_span_index\n\n\ndef _new_check_is_max_context(doc_spans, cur_span_index, position):\n    \"\"\"Check if this is the 'max context' doc span for the token.\"\"\"\n    # if len(doc_spans) == 1:\n    # return True\n    best_score = None\n    best_span_index = None\n    for (span_index, doc_span) in enumerate(doc_spans):\n        end = doc_span[\"start\"] + doc_span[\"length\"] - 1\n        if position < doc_span[\"start\"]:\n            continue\n        if position > end:\n            continue\n        num_left_context = position - doc_span[\"start\"]\n        num_right_context = end - position\n        score = min(num_left_context, num_right_context) + 0.01 * doc_span[\"length\"]\n        if best_score is None or score > best_score:\n            best_score = score\n            best_span_index = span_index\n\n    return cur_span_index == best_span_index\n\n\ndef _is_whitespace(c):\n    if c == \" \" or c == \"\\t\" or c == \"\\r\" or c == \"\\n\" or ord(c) == 0x202F:\n        return True\n    return False\n\n\ndef squad_convert_example_to_features(example, max_seq_length, doc_stride, max_query_length, is_training):\n    features = []\n    if is_training and not example.is_impossible:\n        # Get start and end position\n        start_position = example.start_position\n        end_position = example.end_position\n\n        # If the answer cannot be found in the text, then skip this example.\n        actual_text = \" \".join(example.doc_tokens[start_position : (end_position + 1)])\n        cleaned_answer_text = \" \".join(whitespace_tokenize(example.answer_text))\n        if actual_text.find(cleaned_answer_text) == -1:\n            logger.warning(\"Could not find answer: '%s' vs. '%s'\", actual_text, cleaned_answer_text)\n            return []\n\n    tok_to_orig_index = []\n    orig_to_tok_index = []\n    all_doc_tokens = []\n    for (i, token) in enumerate(example.doc_tokens):\n        orig_to_tok_index.append(len(all_doc_tokens))\n        sub_tokens = tokenizer.tokenize(token)\n        for sub_token in sub_tokens:\n            tok_to_orig_index.append(i)\n            all_doc_tokens.append(sub_token)\n\n    if is_training and not example.is_impossible:\n        tok_start_position = orig_to_tok_index[example.start_position]\n        if example.end_position < len(example.doc_tokens) - 1:\n            tok_end_position = orig_to_tok_index[example.end_position + 1] - 1\n        else:\n            tok_end_position = len(all_doc_tokens) - 1\n\n        (tok_start_position, tok_end_position) = _improve_answer_span(\n            all_doc_tokens, tok_start_position, tok_end_position, tokenizer, example.answer_text\n        )\n\n    spans = []\n\n    truncated_query = tokenizer.encode(example.question_text, add_special_tokens=False, max_length=max_query_length)\n    sequence_added_tokens = (\n        tokenizer.max_len - tokenizer.max_len_single_sentence + 1\n        if \"roberta\" in str(type(tokenizer)) or \"camembert\" in str(type(tokenizer))\n        else tokenizer.max_len - tokenizer.max_len_single_sentence\n    )\n    sequence_pair_added_tokens = tokenizer.max_len - tokenizer.max_len_sentences_pair\n\n    span_doc_tokens = all_doc_tokens\n    while len(spans) * doc_stride < len(all_doc_tokens):\n\n        encoded_dict = tokenizer.encode_plus(\n            truncated_query if tokenizer.padding_side == \"right\" else span_doc_tokens,\n            span_doc_tokens if tokenizer.padding_side == \"right\" else truncated_query,\n            max_length=max_seq_length,\n            return_overflowing_tokens=True,\n            pad_to_max_length=True,\n            stride=max_seq_length - doc_stride - len(truncated_query) - sequence_pair_added_tokens,\n            truncation_strategy=\"only_second\" if tokenizer.padding_side == \"right\" else \"only_first\",\n            return_token_type_ids=True,\n        )\n\n        paragraph_len = min(\n            len(all_doc_tokens) - len(spans) * doc_stride,\n            max_seq_length - len(truncated_query) - sequence_pair_added_tokens,\n        )\n\n        if tokenizer.pad_token_id in encoded_dict[\"input_ids\"]:\n            if tokenizer.padding_side == \"right\":\n                non_padded_ids = encoded_dict[\"input_ids\"][: encoded_dict[\"input_ids\"].index(tokenizer.pad_token_id)]\n            else:\n                last_padding_id_position = (\n                    len(encoded_dict[\"input_ids\"]) - 1 - encoded_dict[\"input_ids\"][::-1].index(tokenizer.pad_token_id)\n                )\n                non_padded_ids = encoded_dict[\"input_ids\"][last_padding_id_position + 1 :]\n\n        else:\n            non_padded_ids = encoded_dict[\"input_ids\"]\n\n        tokens = tokenizer.convert_ids_to_tokens(non_padded_ids)\n\n        token_to_orig_map = {}\n        for i in range(paragraph_len):\n            index = len(truncated_query) + sequence_added_tokens + i if tokenizer.padding_side == \"right\" else i\n            token_to_orig_map[index] = tok_to_orig_index[len(spans) * doc_stride + i]\n\n        encoded_dict[\"paragraph_len\"] = paragraph_len\n        encoded_dict[\"tokens\"] = tokens\n        encoded_dict[\"token_to_orig_map\"] = token_to_orig_map\n        encoded_dict[\"truncated_query_with_special_tokens_length\"] = len(truncated_query) + sequence_added_tokens\n        encoded_dict[\"token_is_max_context\"] = {}\n        encoded_dict[\"start\"] = len(spans) * doc_stride\n        encoded_dict[\"length\"] = paragraph_len\n\n        spans.append(encoded_dict)\n\n        if \"overflowing_tokens\" not in encoded_dict:\n            break\n        span_doc_tokens = encoded_dict[\"overflowing_tokens\"]\n\n    for doc_span_index in range(len(spans)):\n        for j in range(spans[doc_span_index][\"paragraph_len\"]):\n            is_max_context = _new_check_is_max_context(spans, doc_span_index, doc_span_index * doc_stride + j)\n            index = (\n                j\n                if tokenizer.padding_side == \"left\"\n                else spans[doc_span_index][\"truncated_query_with_special_tokens_length\"] + j\n            )\n            spans[doc_span_index][\"token_is_max_context\"][index] = is_max_context\n\n    for span in spans:\n        # Identify the position of the CLS token\n        cls_index = span[\"input_ids\"].index(tokenizer.cls_token_id)\n\n        # p_mask: mask with 1 for token than cannot be in the answer (0 for token which can be in an answer)\n        # Original TF implem also keep the classification token (set to 0)\n        p_mask = np.ones_like(span[\"token_type_ids\"])\n        if tokenizer.padding_side == \"right\":\n            p_mask[len(truncated_query) + sequence_added_tokens :] = 0\n        else:\n            p_mask[-len(span[\"tokens\"]) : -(len(truncated_query) + sequence_added_tokens)] = 0\n\n        pad_token_indices = np.where(span[\"input_ids\"] == tokenizer.pad_token_id)\n        special_token_indices = np.asarray(\n            tokenizer.get_special_tokens_mask(span[\"input_ids\"], already_has_special_tokens=True)\n        ).nonzero()\n\n        p_mask[pad_token_indices] = 1\n        p_mask[special_token_indices] = 1\n\n        # Set the cls index to 0: the CLS index can be used for impossible answers\n        p_mask[cls_index] = 0\n\n        span_is_impossible = example.is_impossible\n        start_position = 0\n        end_position = 0\n        if is_training and not span_is_impossible:\n            # For training, if our document chunk does not contain an annotation\n            # we throw it out, since there is nothing to predict.\n            doc_start = span[\"start\"]\n            doc_end = span[\"start\"] + span[\"length\"] - 1\n            out_of_span = False\n\n            if not (tok_start_position >= doc_start and tok_end_position <= doc_end):\n                out_of_span = True\n\n            if out_of_span:\n                start_position = cls_index\n                end_position = cls_index\n                span_is_impossible = True\n            else:\n                if tokenizer.padding_side == \"left\":\n                    doc_offset = 0\n                else:\n                    doc_offset = len(truncated_query) + sequence_added_tokens\n\n                start_position = tok_start_position - doc_start + doc_offset\n                end_position = tok_end_position - doc_start + doc_offset\n\n        features.append(\n            SquadFeatures(\n                span[\"input_ids\"],\n                span[\"attention_mask\"],\n                span[\"token_type_ids\"],\n                cls_index,\n                p_mask.tolist(),\n                example_index=0,  # Can not set unique_id and example_index here. They will be set after multiple processing.\n                unique_id=0,\n                paragraph_len=span[\"paragraph_len\"],\n                token_is_max_context=span[\"token_is_max_context\"],\n                tokens=span[\"tokens\"],\n                token_to_orig_map=span[\"token_to_orig_map\"],\n                start_position=start_position,\n                end_position=end_position,\n                is_impossible=span_is_impossible,\n                qas_id=example.qas_id,\n            )\n        )\n    return features\n\n\ndef squad_convert_example_to_features_init(tokenizer_for_convert):\n    global tokenizer\n    tokenizer = tokenizer_for_convert\n\n\ndef squad_convert_examples_to_features(\n    examples,\n    tokenizer,\n    max_seq_length,\n    doc_stride,\n    max_query_length,\n    is_training,\n    return_dataset=False,\n    threads=1,\n    tqdm_enabled=True,\n):\n    \"\"\"\n    Converts a list of examples into a list of features that can be directly given as input to a model.\n    It is model-dependant and takes advantage of many of the tokenizer's features to create the model's inputs.\n\n    Args:\n        examples: list of :class:`~transformers1.data.processors.squad.SquadExample`\n        tokenizer: an instance of a child of :class:`~transformers1.PreTrainedTokenizer`\n        max_seq_length: The maximum sequence length of the inputs.\n        doc_stride: The stride used when the context is too large and is split across several features.\n        max_query_length: The maximum length of the query.\n        is_training: whether to create features for model evaluation or model training.\n        return_dataset: Default False. Either 'pt' or 'tf'.\n            if 'pt': returns a torch.data.TensorDataset,\n            if 'tf': returns a tf.data.Dataset\n        threads: multiple processing threadsa-smi\n\n\n    Returns:\n        list of :class:`~transformers1.data.processors.squad.SquadFeatures`\n\n    Example::\n\n        processor = SquadV2Processor()\n        examples = processor.get_dev_examples(data_dir)\n\n        features = squad_convert_examples_to_features(\n            examples=examples,\n            tokenizer=tokenizer,\n            max_seq_length=args.max_seq_length,\n            doc_stride=args.doc_stride,\n            max_query_length=args.max_query_length,\n            is_training=not evaluate,\n        )\n    \"\"\"\n\n    # Defining helper methods\n    features = []\n    threads = min(threads, cpu_count())\n    with Pool(threads, initializer=squad_convert_example_to_features_init, initargs=(tokenizer,)) as p:\n        annotate_ = partial(\n            squad_convert_example_to_features,\n            max_seq_length=max_seq_length,\n            doc_stride=doc_stride,\n            max_query_length=max_query_length,\n            is_training=is_training,\n        )\n        features = list(\n            tqdm(\n                p.imap(annotate_, examples, chunksize=32),\n                total=len(examples),\n                desc=\"convert squad examples to features\",\n                disable=not tqdm_enabled,\n            )\n        )\n    new_features = []\n    unique_id = 1000000000\n    example_index = 0\n    for example_features in tqdm(\n        features, total=len(features), desc=\"add example index and unique id\", disable=not tqdm_enabled\n    ):\n        if not example_features:\n            continue\n        for example_feature in example_features:\n            example_feature.example_index = example_index\n            example_feature.unique_id = unique_id\n            new_features.append(example_feature)\n            unique_id += 1\n        example_index += 1\n    features = new_features\n    del new_features\n    if return_dataset == \"pt\":\n        if not is_torch_available():\n            raise RuntimeError(\"PyTorch must be installed to return a PyTorch dataset.\")\n\n        # Convert to Tensors and build dataset\n        all_input_ids = torch.tensor([f.input_ids for f in features], dtype=torch.long)\n        all_attention_masks = torch.tensor([f.attention_mask for f in features], dtype=torch.long)\n        all_token_type_ids = torch.tensor([f.token_type_ids for f in features], dtype=torch.long)\n        all_cls_index = torch.tensor([f.cls_index for f in features], dtype=torch.long)\n        all_p_mask = torch.tensor([f.p_mask for f in features], dtype=torch.float)\n        all_is_impossible = torch.tensor([f.is_impossible for f in features], dtype=torch.float)\n\n        if not is_training:\n            all_feature_index = torch.arange(all_input_ids.size(0), dtype=torch.long)\n            dataset = TensorDataset(\n                all_input_ids, all_attention_masks, all_token_type_ids, all_feature_index, all_cls_index, all_p_mask\n            )\n        else:\n            all_start_positions = torch.tensor([f.start_position for f in features], dtype=torch.long)\n            all_end_positions = torch.tensor([f.end_position for f in features], dtype=torch.long)\n            dataset = TensorDataset(\n                all_input_ids,\n                all_attention_masks,\n                all_token_type_ids,\n                all_start_positions,\n                all_end_positions,\n                all_cls_index,\n                all_p_mask,\n                all_is_impossible,\n            )\n\n        return features, dataset\n    elif return_dataset == \"tf\":\n        if not is_tf_available():\n            raise RuntimeError(\"TensorFlow must be installed to return a TensorFlow dataset.\")\n\n        def gen():\n            for i, ex in enumerate(features):\n                yield (\n                    {\n                        \"input_ids\": ex.input_ids,\n                        \"attention_mask\": ex.attention_mask,\n                        \"token_type_ids\": ex.token_type_ids,\n                        \"feature_index\": i,\n                        \"qas_id\": ex.qas_id,\n                    },\n                    {\n                        \"start_position\": ex.start_position,\n                        \"end_position\": ex.end_position,\n                        \"cls_index\": ex.cls_index,\n                        \"p_mask\": ex.p_mask,\n                        \"is_impossible\": ex.is_impossible,\n                    },\n                )\n\n        # Why have we split the batch into a tuple? PyTorch just has a list of tensors.\n        train_types = (\n            {\n                \"input_ids\": tf.int32,\n                \"attention_mask\": tf.int32,\n                \"token_type_ids\": tf.int32,\n                \"feature_index\": tf.int64,\n                \"qas_id\": tf.string,\n            },\n            {\n                \"start_position\": tf.int64,\n                \"end_position\": tf.int64,\n                \"cls_index\": tf.int64,\n                \"p_mask\": tf.int32,\n                \"is_impossible\": tf.int32,\n            },\n        )\n\n        train_shapes = (\n            {\n                \"input_ids\": tf.TensorShape([None]),\n                \"attention_mask\": tf.TensorShape([None]),\n                \"token_type_ids\": tf.TensorShape([None]),\n                \"feature_index\": tf.TensorShape([]),\n                \"qas_id\": tf.TensorShape([]),\n            },\n            {\n                \"start_position\": tf.TensorShape([]),\n                \"end_position\": tf.TensorShape([]),\n                \"cls_index\": tf.TensorShape([]),\n                \"p_mask\": tf.TensorShape([None]),\n                \"is_impossible\": tf.TensorShape([]),\n            },\n        )\n\n        return tf.data.Dataset.from_generator(gen, train_types, train_shapes)\n    else:\n        return features\n\n\nclass SquadProcessor(DataProcessor):\n    \"\"\"\n    Processor for the SQuAD data set.\n    Overriden by SquadV1Processor and SquadV2Processor, used by the version 1.1 and version 2.0 of SQuAD, respectively.\n    \"\"\"\n\n    train_file = None\n    dev_file = None\n\n    def _get_example_from_tensor_dict(self, tensor_dict, evaluate=False):\n        if not evaluate:\n            answer = tensor_dict[\"answers\"][\"text\"][0].numpy().decode(\"utf-8\")\n            answer_start = tensor_dict[\"answers\"][\"answer_start\"][0].numpy()\n            answers = []\n        else:\n            answers = [\n                {\"answer_start\": start.numpy(), \"text\": text.numpy().decode(\"utf-8\")}\n                for start, text in zip(tensor_dict[\"answers\"][\"answer_start\"], tensor_dict[\"answers\"][\"text\"])\n            ]\n\n            answer = None\n            answer_start = None\n\n        return SquadExample(\n            qas_id=tensor_dict[\"id\"].numpy().decode(\"utf-8\"),\n            question_text=tensor_dict[\"question\"].numpy().decode(\"utf-8\"),\n            context_text=tensor_dict[\"context\"].numpy().decode(\"utf-8\"),\n            answer_text=answer,\n            start_position_character=answer_start,\n            title=tensor_dict[\"title\"].numpy().decode(\"utf-8\"),\n            answers=answers,\n        )\n\n    def get_examples_from_dataset(self, dataset, evaluate=False):\n        \"\"\"\n        Creates a list of :class:`~transformers1.data.processors.squad.SquadExample` using a TFDS dataset.\n\n        Args:\n            dataset: The tfds dataset loaded from `tensorflow_datasets.load(\"squad\")`\n            evaluate: boolean specifying if in evaluation mode or in training mode\n\n        Returns:\n            List of SquadExample\n\n        Examples::\n\n            import tensorflow_datasets as tfds\n            dataset = tfds.load(\"squad\")\n\n            training_examples = get_examples_from_dataset(dataset, evaluate=False)\n            evaluation_examples = get_examples_from_dataset(dataset, evaluate=True)\n        \"\"\"\n\n        if evaluate:\n            dataset = dataset[\"validation\"]\n        else:\n            dataset = dataset[\"train\"]\n\n        examples = []\n        for tensor_dict in tqdm(dataset):\n            examples.append(self._get_example_from_tensor_dict(tensor_dict, evaluate=evaluate))\n\n        return examples\n\n    def get_train_examples(self, data_dir, filename=None):\n        \"\"\"\n        Returns the training examples from the data directory.\n\n        Args:\n            data_dir: Directory containing the data files used for training and evaluating.\n            filename: None by default, specify this if the training file has a different name than the original one\n                which is `train-v1.1.json` and `train-v2.0.json` for squad versions 1.1 and 2.0 respectively.\n\n        \"\"\"\n        if data_dir is None:\n            data_dir = \"\"\n\n        if self.train_file is None:\n            raise ValueError(\"SquadProcessor should be instantiated via SquadV1Processor or SquadV2Processor\")\n\n        with open(\n            os.path.join(data_dir, self.train_file if filename is None else filename), \"r\", encoding=\"utf-8\"\n        ) as reader:\n            input_data = json.load(reader)[\"data\"]\n        return self._create_examples(input_data, \"train\")\n\n    def get_dev_examples(self, data_dir, filename=None):\n        \"\"\"\n        Returns the evaluation example from the data directory.\n\n        Args:\n            data_dir: Directory containing the data files used for training and evaluating.\n            filename: None by default, specify this if the evaluation file has a different name than the original one\n                which is `train-v1.1.json` and `train-v2.0.json` for squad versions 1.1 and 2.0 respectively.\n        \"\"\"\n        if data_dir is None:\n            data_dir = \"\"\n\n        if self.dev_file is None:\n            raise ValueError(\"SquadProcessor should be instantiated via SquadV1Processor or SquadV2Processor\")\n\n        with open(\n            os.path.join(data_dir, self.dev_file if filename is None else filename), \"r\", encoding=\"utf-8\"\n        ) as reader:\n            input_data = json.load(reader)[\"data\"]\n        return self._create_examples(input_data, \"dev\")\n\n    def _create_examples(self, input_data, set_type):\n        is_training = set_type == \"train\"\n        examples = []\n        for entry in tqdm(input_data):\n            title = entry[\"title\"]\n            for paragraph in entry[\"paragraphs\"]:\n                context_text = paragraph[\"context\"]\n                for qa in paragraph[\"qas\"]:\n                    qas_id = qa[\"id\"]\n                    question_text = qa[\"question\"]\n                    start_position_character = None\n                    answer_text = None\n                    answers = []\n\n                    if \"is_impossible\" in qa:\n                        is_impossible = qa[\"is_impossible\"]\n                    else:\n                        is_impossible = False\n\n                    if not is_impossible:\n                        if is_training:\n                            answer = qa[\"answers\"][0]\n                            answer_text = answer[\"text\"]\n                            start_position_character = answer[\"answer_start\"]\n                        else:\n                            answers = qa[\"answers\"]\n\n                    example = SquadExample(\n                        qas_id=qas_id,\n                        question_text=question_text,\n                        context_text=context_text,\n                        answer_text=answer_text,\n                        start_position_character=start_position_character,\n                        title=title,\n                        is_impossible=is_impossible,\n                        answers=answers,\n                    )\n\n                    examples.append(example)\n        return examples\n\n\nclass SquadV1Processor(SquadProcessor):\n    train_file = \"train-v1.1.json\"\n    dev_file = \"dev-v1.1.json\"\n\n\nclass SquadV2Processor(SquadProcessor):\n    train_file = \"train-v2.0.json\"\n    dev_file = \"dev-v2.0.json\"\n\n\nclass SquadExample(object):\n    \"\"\"\n    A single training/test example for the Squad dataset, as loaded from disk.\n\n    Args:\n        qas_id: The example's unique identifier\n        question_text: The question string\n        context_text: The context string\n        answer_text: The answer string\n        start_position_character: The character position of the start of the answer\n        title: The title of the example\n        answers: None by default, this is used during evaluation. Holds answers as well as their start positions.\n        is_impossible: False by default, set to True if the example has no possible answer.\n    \"\"\"\n\n    def __init__(\n        self,\n        qas_id,\n        question_text,\n        context_text,\n        answer_text,\n        start_position_character,\n        title,\n        answers=[],\n        is_impossible=False,\n    ):\n        self.qas_id = qas_id\n        self.question_text = question_text\n        self.context_text = context_text\n        self.answer_text = answer_text\n        self.title = title\n        self.is_impossible = is_impossible\n        self.answers = answers\n\n        self.start_position, self.end_position = 0, 0\n\n        doc_tokens = []\n        char_to_word_offset = []\n        prev_is_whitespace = True\n\n        # Split on whitespace so that different tokens may be attributed to their original position.\n        for c in self.context_text:\n            if _is_whitespace(c):\n                prev_is_whitespace = True\n            else:\n                if prev_is_whitespace:\n                    doc_tokens.append(c)\n                else:\n                    doc_tokens[-1] += c\n                prev_is_whitespace = False\n            char_to_word_offset.append(len(doc_tokens) - 1)\n\n        self.doc_tokens = doc_tokens\n        self.char_to_word_offset = char_to_word_offset\n\n        # Start and end positions only has a value during evaluation.\n        if start_position_character is not None and not is_impossible:\n            self.start_position = char_to_word_offset[start_position_character]\n            self.end_position = char_to_word_offset[\n                min(start_position_character + len(answer_text) - 1, len(char_to_word_offset) - 1)\n            ]\n\n\nclass SquadFeatures(object):\n    \"\"\"\n    Single squad example features to be fed to a model.\n    Those features are model-specific and can be crafted from :class:`~transformers1.data.processors.squad.SquadExample`\n    using the :method:`~transformers1.data.processors.squad.squad_convert_examples_to_features` method.\n\n    Args:\n        input_ids: Indices of input sequence tokens in the vocabulary.\n        attention_mask: Mask to avoid performing attention on padding token indices.\n        token_type_ids: Segment token indices to indicate first and second portions of the inputs.\n        cls_index: the index of the CLS token.\n        p_mask: Mask identifying tokens that can be answers vs. tokens that cannot.\n            Mask with 1 for tokens than cannot be in the answer and 0 for token that can be in an answer\n        example_index: the index of the example\n        unique_id: The unique Feature identifier\n        paragraph_len: The length of the context\n        token_is_max_context: List of booleans identifying which tokens have their maximum context in this feature object.\n            If a token does not have their maximum context in this feature object, it means that another feature object\n            has more information related to that token and should be prioritized over this feature for that token.\n        tokens: list of tokens corresponding to the input ids\n        token_to_orig_map: mapping between the tokens and the original text, needed in order to identify the answer.\n        start_position: start of the answer token index\n        end_position: end of the answer token index\n    \"\"\"\n\n    def __init__(\n        self,\n        input_ids,\n        attention_mask,\n        token_type_ids,\n        cls_index,\n        p_mask,\n        example_index,\n        unique_id,\n        paragraph_len,\n        token_is_max_context,\n        tokens,\n        token_to_orig_map,\n        start_position,\n        end_position,\n        is_impossible,\n        qas_id: str = None,\n    ):\n        self.input_ids = input_ids\n        self.attention_mask = attention_mask\n        self.token_type_ids = token_type_ids\n        self.cls_index = cls_index\n        self.p_mask = p_mask\n\n        self.example_index = example_index\n        self.unique_id = unique_id\n        self.paragraph_len = paragraph_len\n        self.token_is_max_context = token_is_max_context\n        self.tokens = tokens\n        self.token_to_orig_map = token_to_orig_map\n\n        self.start_position = start_position\n        self.end_position = end_position\n        self.is_impossible = is_impossible\n        self.qas_id = qas_id\n\n\nclass SquadResult(object):\n    \"\"\"\n    Constructs a SquadResult which can be used to evaluate a model's output on the SQuAD dataset.\n\n    Args:\n        unique_id: The unique identifier corresponding to that example.\n        start_logits: The logits corresponding to the start of the answer\n        end_logits: The logits corresponding to the end of the answer\n    \"\"\"\n\n    def __init__(self, unique_id, start_logits, end_logits, start_top_index=None, end_top_index=None, cls_logits=None):\n        self.start_logits = start_logits\n        self.end_logits = end_logits\n        self.unique_id = unique_id\n\n        if start_top_index:\n            self.start_top_index = start_top_index\n            self.end_top_index = end_top_index\n            self.cls_logits = cls_logits\n"
  },
  {
    "path": "code/bert-base-count5/pretrain/transformers1/data/processors/utils.py",
    "content": "# coding=utf-8\n# Copyright 2018 The Google AI Language Team Authors and The HuggingFace Inc. team.\n# Copyright (c) 2018, NVIDIA CORPORATION.  All rights reserved.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n\nimport csv\nimport dataclasses\nimport json\nimport logging\nfrom dataclasses import dataclass\nfrom typing import List, Optional, Union\n\nfrom ...file_utils import is_tf_available, is_torch_available\n\n\nlogger = logging.getLogger(__name__)\n\n\n@dataclass\nclass InputExample:\n    \"\"\"\n    A single training/test example for simple sequence classification.\n\n    Args:\n        guid: Unique id for the example.\n        text_a: string. The untokenized text of the first sequence. For single\n            sequence tasks, only this sequence must be specified.\n        text_b: (Optional) string. The untokenized text of the second sequence.\n            Only must be specified for sequence pair tasks.\n        label: (Optional) string. The label of the example. This should be\n            specified for train and dev examples, but not for test examples.\n    \"\"\"\n\n    guid: str\n    text_a: str\n    text_b: Optional[str] = None\n    label: Optional[str] = None\n\n    def to_json_string(self):\n        \"\"\"Serializes this instance to a JSON string.\"\"\"\n        return json.dumps(dataclasses.asdict(self), indent=2) + \"\\n\"\n\n\n@dataclass(frozen=True)\nclass InputFeatures:\n    \"\"\"\n    A single set of features of data.\n    Property names are the same names as the corresponding inputs to a model.\n\n    Args:\n        input_ids: Indices of input sequence tokens in the vocabulary.\n        attention_mask: Mask to avoid performing attention on padding token indices.\n            Mask values selected in ``[0, 1]``:\n            Usually  ``1`` for tokens that are NOT MASKED, ``0`` for MASKED (padded) tokens.\n        token_type_ids: (Optional) Segment token indices to indicate first and second\n            portions of the inputs. Only some models use them.\n        label: (Optional) Label corresponding to the input. Int for classification problems,\n            float for regression problems.\n    \"\"\"\n\n    input_ids: List[int]\n    attention_mask: Optional[List[int]] = None\n    token_type_ids: Optional[List[int]] = None\n    label: Optional[Union[int, float]] = None\n\n    def to_json_string(self):\n        \"\"\"Serializes this instance to a JSON string.\"\"\"\n        return json.dumps(dataclasses.asdict(self)) + \"\\n\"\n\n\nclass DataProcessor:\n    \"\"\"Base class for data converters for sequence classification data sets.\"\"\"\n\n    def get_example_from_tensor_dict(self, tensor_dict):\n        \"\"\"Gets an example from a dict with tensorflow tensors\n        Args:\n            tensor_dict: Keys and values should match the corresponding Glue\n                tensorflow_dataset examples.\n        \"\"\"\n        raise NotImplementedError()\n\n    def get_train_examples(self, data_dir):\n        \"\"\"Gets a collection of `InputExample`s for the train set.\"\"\"\n        raise NotImplementedError()\n\n    def get_dev_examples(self, data_dir):\n        \"\"\"Gets a collection of `InputExample`s for the dev set.\"\"\"\n        raise NotImplementedError()\n\n    def get_test_examples(self, data_dir):\n        \"\"\"Gets a collection of `InputExample`s for the test set.\"\"\"\n        raise NotImplementedError()\n\n    def get_labels(self):\n        \"\"\"Gets the list of labels for this data set.\"\"\"\n        raise NotImplementedError()\n\n    def tfds_map(self, example):\n        \"\"\"Some tensorflow_datasets datasets are not formatted the same way the GLUE datasets are.\n        This method converts examples to the correct format.\"\"\"\n        if len(self.get_labels()) > 1:\n            example.label = self.get_labels()[int(example.label)]\n        return example\n\n    @classmethod\n    def _read_tsv(cls, input_file, quotechar=None):\n        \"\"\"Reads a tab separated value file.\"\"\"\n        with open(input_file, \"r\", encoding=\"utf-8-sig\") as f:\n            return list(csv.reader(f, delimiter=\"\\t\", quotechar=quotechar))\n\n\nclass SingleSentenceClassificationProcessor(DataProcessor):\n    \"\"\" Generic processor for a single sentence classification data set.\"\"\"\n\n    def __init__(self, labels=None, examples=None, mode=\"classification\", verbose=False):\n        self.labels = [] if labels is None else labels\n        self.examples = [] if examples is None else examples\n        self.mode = mode\n        self.verbose = verbose\n\n    def __len__(self):\n        return len(self.examples)\n\n    def __getitem__(self, idx):\n        if isinstance(idx, slice):\n            return SingleSentenceClassificationProcessor(labels=self.labels, examples=self.examples[idx])\n        return self.examples[idx]\n\n    @classmethod\n    def create_from_csv(\n        cls, file_name, split_name=\"\", column_label=0, column_text=1, column_id=None, skip_first_row=False, **kwargs\n    ):\n        processor = cls(**kwargs)\n        processor.add_examples_from_csv(\n            file_name,\n            split_name=split_name,\n            column_label=column_label,\n            column_text=column_text,\n            column_id=column_id,\n            skip_first_row=skip_first_row,\n            overwrite_labels=True,\n            overwrite_examples=True,\n        )\n        return processor\n\n    @classmethod\n    def create_from_examples(cls, texts_or_text_and_labels, labels=None, **kwargs):\n        processor = cls(**kwargs)\n        processor.add_examples(texts_or_text_and_labels, labels=labels)\n        return processor\n\n    def add_examples_from_csv(\n        self,\n        file_name,\n        split_name=\"\",\n        column_label=0,\n        column_text=1,\n        column_id=None,\n        skip_first_row=False,\n        overwrite_labels=False,\n        overwrite_examples=False,\n    ):\n        lines = self._read_tsv(file_name)\n        if skip_first_row:\n            lines = lines[1:]\n        texts = []\n        labels = []\n        ids = []\n        for (i, line) in enumerate(lines):\n            texts.append(line[column_text])\n            labels.append(line[column_label])\n            if column_id is not None:\n                ids.append(line[column_id])\n            else:\n                guid = \"%s-%s\" % (split_name, i) if split_name else \"%s\" % i\n                ids.append(guid)\n\n        return self.add_examples(\n            texts, labels, ids, overwrite_labels=overwrite_labels, overwrite_examples=overwrite_examples\n        )\n\n    def add_examples(\n        self, texts_or_text_and_labels, labels=None, ids=None, overwrite_labels=False, overwrite_examples=False\n    ):\n        assert labels is None or len(texts_or_text_and_labels) == len(labels)\n        assert ids is None or len(texts_or_text_and_labels) == len(ids)\n        if ids is None:\n            ids = [None] * len(texts_or_text_and_labels)\n        if labels is None:\n            labels = [None] * len(texts_or_text_and_labels)\n        examples = []\n        added_labels = set()\n        for (text_or_text_and_label, label, guid) in zip(texts_or_text_and_labels, labels, ids):\n            if isinstance(text_or_text_and_label, (tuple, list)) and label is None:\n                text, label = text_or_text_and_label\n            else:\n                text = text_or_text_and_label\n            added_labels.add(label)\n            examples.append(InputExample(guid=guid, text_a=text, text_b=None, label=label))\n\n        # Update examples\n        if overwrite_examples:\n            self.examples = examples\n        else:\n            self.examples.extend(examples)\n\n        # Update labels\n        if overwrite_labels:\n            self.labels = list(added_labels)\n        else:\n            self.labels = list(set(self.labels).union(added_labels))\n\n        return self.examples\n\n    def get_features(\n        self,\n        tokenizer,\n        max_length=None,\n        pad_on_left=False,\n        pad_token=0,\n        mask_padding_with_zero=True,\n        return_tensors=None,\n    ):\n        \"\"\"\n        Convert examples in a list of ``InputFeatures``\n\n        Args:\n            tokenizer: Instance of a tokenizer that will tokenize the examples\n            max_length: Maximum example length\n            task: GLUE task\n            label_list: List of labels. Can be obtained from the processor using the ``processor.get_labels()`` method\n            output_mode: String indicating the output mode. Either ``regression`` or ``classification``\n            pad_on_left: If set to ``True``, the examples will be padded on the left rather than on the right (default)\n            pad_token: Padding token\n            mask_padding_with_zero: If set to ``True``, the attention mask will be filled by ``1`` for actual values\n                and by ``0`` for padded values. If set to ``False``, inverts it (``1`` for padded values, ``0`` for\n                actual values)\n\n        Returns:\n            If the ``examples`` input is a ``tf.data.Dataset``, will return a ``tf.data.Dataset``\n            containing the task-specific features. If the input is a list of ``InputExamples``, will return\n            a list of task-specific ``InputFeatures`` which can be fed to the model.\n\n        \"\"\"\n        if max_length is None:\n            max_length = tokenizer.max_len\n\n        label_map = {label: i for i, label in enumerate(self.labels)}\n\n        all_input_ids = []\n        for (ex_index, example) in enumerate(self.examples):\n            if ex_index % 10000 == 0:\n                logger.info(\"Tokenizing example %d\", ex_index)\n\n            input_ids = tokenizer.encode(\n                example.text_a, add_special_tokens=True, max_length=min(max_length, tokenizer.max_len),\n            )\n            all_input_ids.append(input_ids)\n\n        batch_length = max(len(input_ids) for input_ids in all_input_ids)\n\n        features = []\n        for (ex_index, (input_ids, example)) in enumerate(zip(all_input_ids, self.examples)):\n            if ex_index % 10000 == 0:\n                logger.info(\"Writing example %d/%d\" % (ex_index, len(self.examples)))\n            # The mask has 1 for real tokens and 0 for padding tokens. Only real\n            # tokens are attended to.\n            attention_mask = [1 if mask_padding_with_zero else 0] * len(input_ids)\n\n            # Zero-pad up to the sequence length.\n            padding_length = batch_length - len(input_ids)\n            if pad_on_left:\n                input_ids = ([pad_token] * padding_length) + input_ids\n                attention_mask = ([0 if mask_padding_with_zero else 1] * padding_length) + attention_mask\n            else:\n                input_ids = input_ids + ([pad_token] * padding_length)\n                attention_mask = attention_mask + ([0 if mask_padding_with_zero else 1] * padding_length)\n\n            assert len(input_ids) == batch_length, \"Error with input length {} vs {}\".format(\n                len(input_ids), batch_length\n            )\n            assert len(attention_mask) == batch_length, \"Error with input length {} vs {}\".format(\n                len(attention_mask), batch_length\n            )\n\n            if self.mode == \"classification\":\n                label = label_map[example.label]\n            elif self.mode == \"regression\":\n                label = float(example.label)\n            else:\n                raise ValueError(self.mode)\n\n            if ex_index < 5 and self.verbose:\n                logger.info(\"*** Example ***\")\n                logger.info(\"guid: %s\" % (example.guid))\n                logger.info(\"input_ids: %s\" % \" \".join([str(x) for x in input_ids]))\n                logger.info(\"attention_mask: %s\" % \" \".join([str(x) for x in attention_mask]))\n                logger.info(\"label: %s (id = %d)\" % (example.label, label))\n\n            features.append(InputFeatures(input_ids=input_ids, attention_mask=attention_mask, label=label))\n\n        if return_tensors is None:\n            return features\n        elif return_tensors == \"tf\":\n            if not is_tf_available():\n                raise RuntimeError(\"return_tensors set to 'tf' but TensorFlow 2.0 can't be imported\")\n            import tensorflow as tf\n\n            def gen():\n                for ex in features:\n                    yield ({\"input_ids\": ex.input_ids, \"attention_mask\": ex.attention_mask}, ex.label)\n\n            dataset = tf.data.Dataset.from_generator(\n                gen,\n                ({\"input_ids\": tf.int32, \"attention_mask\": tf.int32}, tf.int64),\n                ({\"input_ids\": tf.TensorShape([None]), \"attention_mask\": tf.TensorShape([None])}, tf.TensorShape([])),\n            )\n            return dataset\n        elif return_tensors == \"pt\":\n            if not is_torch_available():\n                raise RuntimeError(\"return_tensors set to 'pt' but PyTorch can't be imported\")\n            import torch\n            from torch.utils.data import TensorDataset\n\n            all_input_ids = torch.tensor([f.input_ids for f in features], dtype=torch.long)\n            all_attention_mask = torch.tensor([f.attention_mask for f in features], dtype=torch.long)\n            if self.mode == \"classification\":\n                all_labels = torch.tensor([f.label for f in features], dtype=torch.long)\n            elif self.mode == \"regression\":\n                all_labels = torch.tensor([f.label for f in features], dtype=torch.float)\n\n            dataset = TensorDataset(all_input_ids, all_attention_mask, all_labels)\n            return dataset\n        else:\n            raise ValueError(\"return_tensors should be one of 'tf' or 'pt'\")\n"
  },
  {
    "path": "code/bert-base-count5/pretrain/transformers1/data/processors/xnli.py",
    "content": "# coding=utf-8\n# Copyright 2018 The Google AI Language Team Authors and The HuggingFace Inc. team.\n# Copyright (c) 2018, NVIDIA CORPORATION.  All rights reserved.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n\"\"\" XNLI utils (dataset loading and evaluation) \"\"\"\n\n\nimport logging\nimport os\n\nfrom .utils import DataProcessor, InputExample\n\n\nlogger = logging.getLogger(__name__)\n\n\nclass XnliProcessor(DataProcessor):\n    \"\"\"Processor for the XNLI dataset.\n    Adapted from https://github.com/google-research/bert/blob/f39e881b169b9d53bea03d2d341b31707a6c052b/run_classifier.py#L207\"\"\"\n\n    def __init__(self, language, train_language=None):\n        self.language = language\n        self.train_language = train_language\n\n    def get_train_examples(self, data_dir):\n        \"\"\"See base class.\"\"\"\n        lg = self.language if self.train_language is None else self.train_language\n        lines = self._read_tsv(os.path.join(data_dir, \"XNLI-MT-1.0/multinli/multinli.train.{}.tsv\".format(lg)))\n        examples = []\n        for (i, line) in enumerate(lines):\n            if i == 0:\n                continue\n            guid = \"%s-%s\" % (\"train\", i)\n            text_a = line[0]\n            text_b = line[1]\n            label = \"contradiction\" if line[2] == \"contradictory\" else line[2]\n            assert isinstance(text_a, str) and isinstance(text_b, str) and isinstance(label, str)\n            examples.append(InputExample(guid=guid, text_a=text_a, text_b=text_b, label=label))\n        return examples\n\n    def get_test_examples(self, data_dir):\n        \"\"\"See base class.\"\"\"\n        lines = self._read_tsv(os.path.join(data_dir, \"XNLI-1.0/xnli.test.tsv\"))\n        examples = []\n        for (i, line) in enumerate(lines):\n            if i == 0:\n                continue\n            language = line[0]\n            if language != self.language:\n                continue\n            guid = \"%s-%s\" % (\"test\", i)\n            text_a = line[6]\n            text_b = line[7]\n            label = line[1]\n            assert isinstance(text_a, str) and isinstance(text_b, str) and isinstance(label, str)\n            examples.append(InputExample(guid=guid, text_a=text_a, text_b=text_b, label=label))\n        return examples\n\n    def get_labels(self):\n        \"\"\"See base class.\"\"\"\n        return [\"contradiction\", \"entailment\", \"neutral\"]\n\n\nxnli_processors = {\n    \"xnli\": XnliProcessor,\n}\n\nxnli_output_modes = {\n    \"xnli\": \"classification\",\n}\n\nxnli_tasks_num_labels = {\n    \"xnli\": 3,\n}\n"
  },
  {
    "path": "code/bert-base-count5/pretrain/transformers1/file.py",
    "content": ""
  },
  {
    "path": "code/bert-base-count5/pretrain/transformers1/file_utils.py",
    "content": "\"\"\"\nUtilities for working with the local dataset cache.\nThis file is adapted from the AllenNLP library at https://github.com/allenai/allennlp\nCopyright by the AllenNLP authors.\n\"\"\"\n\nimport fnmatch\nimport json\nimport logging\nimport os\nimport shutil\nimport sys\nimport tarfile\nimport tempfile\nfrom contextlib import contextmanager\nfrom functools import partial, wraps\nfrom hashlib import sha256\nfrom pathlib import Path\nfrom typing import Optional\nfrom urllib.parse import urlparse\nfrom zipfile import ZipFile, is_zipfile\n\nimport requests\nfrom filelock import FileLock\nfrom tqdm.auto import tqdm\n\nfrom . import __version__\n\n\nlogger = logging.getLogger(__name__)  # pylint: disable=invalid-name\n\ntry:\n    USE_TF = os.environ.get(\"USE_TF\", \"AUTO\").upper()\n    USE_TORCH = os.environ.get(\"USE_TORCH\", \"AUTO\").upper()\n    if USE_TORCH in (\"1\", \"ON\", \"YES\", \"AUTO\") and USE_TF not in (\"1\", \"ON\", \"YES\"):\n        import torch\n\n        _torch_available = True  # pylint: disable=invalid-name\n        logger.info(\"PyTorch version {} available.\".format(torch.__version__))\n    else:\n        logger.info(\"Disabling PyTorch because USE_TF is set\")\n        _torch_available = False\nexcept ImportError:\n    _torch_available = False  # pylint: disable=invalid-name\n\ntry:\n    USE_TF = os.environ.get(\"USE_TF\", \"AUTO\").upper()\n    USE_TORCH = os.environ.get(\"USE_TORCH\", \"AUTO\").upper()\n\n    if USE_TF in (\"1\", \"ON\", \"YES\", \"AUTO\") and USE_TORCH not in (\"1\", \"ON\", \"YES\"):\n        import tensorflow as tf\n\n        assert hasattr(tf, \"__version__\") and int(tf.__version__[0]) >= 2\n        _tf_available = True  # pylint: disable=invalid-name\n        logger.info(\"TensorFlow version {} available.\".format(tf.__version__))\n    else:\n        logger.info(\"Disabling Tensorflow because USE_TORCH is set\")\n        _tf_available = False\nexcept (ImportError, AssertionError):\n    _tf_available = False  # pylint: disable=invalid-name\n\n\ntry:\n    from torch.hub import _get_torch_home\n\n    torch_cache_home = _get_torch_home()\nexcept ImportError:\n    torch_cache_home = os.path.expanduser(\n        os.getenv(\"TORCH_HOME\", os.path.join(os.getenv(\"XDG_CACHE_HOME\", \"~/.cache\"), \"torch\"))\n    )\ndefault_cache_path = os.path.join(torch_cache_home, \"transformers1\")\n\n\nPYTORCH_PRETRAINED_BERT_CACHE = os.getenv(\"PYTORCH_PRETRAINED_BERT_CACHE\", default_cache_path)\nPYTORCH_TRANSFORMERS_CACHE = os.getenv(\"PYTORCH_TRANSFORMERS_CACHE\", PYTORCH_PRETRAINED_BERT_CACHE)\nTRANSFORMERS_CACHE = os.getenv(\"TRANSFORMERS_CACHE\", PYTORCH_TRANSFORMERS_CACHE)\n\nWEIGHTS_NAME = \"pytorch_model.bin\"\nTF2_WEIGHTS_NAME = \"tf_model.h5\"\nTF_WEIGHTS_NAME = \"model.ckpt\"\nCONFIG_NAME = \"config.json\"\nMODEL_CARD_NAME = \"modelcard.json\"\n\n\nMULTIPLE_CHOICE_DUMMY_INPUTS = [[[0], [1]], [[0], [1]]]\nDUMMY_INPUTS = [[7, 6, 0, 0, 1], [1, 2, 3, 0, 0], [0, 0, 0, 4, 5]]\nDUMMY_MASK = [[1, 1, 1, 1, 1], [1, 1, 1, 0, 0], [0, 0, 0, 1, 1]]\n\nS3_BUCKET_PREFIX = \"https://s3.amazonaws.com/models.huggingface.co/bert\"\nCLOUDFRONT_DISTRIB_PREFIX = \"https://cdn.huggingface.co\"\n\n\ndef is_torch_available():\n    return _torch_available\n\n\ndef is_tf_available():\n    return _tf_available\n\n\ndef add_start_docstrings(*docstr):\n    def docstring_decorator(fn):\n        fn.__doc__ = \"\".join(docstr) + (fn.__doc__ if fn.__doc__ is not None else \"\")\n        return fn\n\n    return docstring_decorator\n\n\ndef add_start_docstrings_to_callable(*docstr):\n    def docstring_decorator(fn):\n        class_name = \":class:`~transformers1.{}`\".format(fn.__qualname__.split(\".\")[0])\n        intro = \"   The {} forward method, overrides the :func:`__call__` special method.\".format(class_name)\n        note = r\"\"\"\n\n    .. note::\n        Although the recipe for forward pass needs to be defined within\n        this function, one should call the :class:`Module` instance afterwards\n        instead of this since the former takes care of running the\n        pre and post processing steps while the latter silently ignores them.\n        \"\"\"\n        fn.__doc__ = intro + note + \"\".join(docstr) + (fn.__doc__ if fn.__doc__ is not None else \"\")\n        return fn\n\n    return docstring_decorator\n\n\ndef add_end_docstrings(*docstr):\n    def docstring_decorator(fn):\n        fn.__doc__ = fn.__doc__ + \"\".join(docstr)\n        return fn\n\n    return docstring_decorator\n\n\ndef is_remote_url(url_or_filename):\n    parsed = urlparse(url_or_filename)\n    return parsed.scheme in (\"http\", \"https\")\n\n\ndef hf_bucket_url(model_id: str, filename: str, use_cdn=True) -> str:\n    \"\"\"\n    Resolve a model identifier, and a file name, to a HF-hosted url\n    on either S3 or Cloudfront (a Content Delivery Network, or CDN).\n\n    Cloudfront is replicated over the globe so downloads are way faster\n    for the end user (and it also lowers our bandwidth costs). However, it\n    is more aggressively cached by default, so may not always reflect the\n    latest changes to the underlying file (default TTL is 24 hours).\n\n    In terms of client-side caching from this library, even though\n    Cloudfront relays the ETags from S3, using one or the other\n    (or switching from one to the other) will affect caching: cached files\n    are not shared between the two because the cached file's name contains\n    a hash of the url.\n    \"\"\"\n    endpoint = CLOUDFRONT_DISTRIB_PREFIX if use_cdn else S3_BUCKET_PREFIX\n    legacy_format = \"/\" not in model_id\n    if legacy_format:\n        return f\"{endpoint}/{model_id}-{filename}\"\n    else:\n        return f\"{endpoint}/{model_id}/{filename}\"\n\n\ndef url_to_filename(url, etag=None):\n    \"\"\"\n    Convert `url` into a hashed filename in a repeatable way.\n    If `etag` is specified, append its hash to the url's, delimited\n    by a period.\n    If the url ends with .h5 (Keras HDF5 weights) adds '.h5' to the name\n    so that TF 2.0 can identify it as a HDF5 file\n    (see https://github.com/tensorflow/tensorflow/blob/00fad90125b18b80fe054de1055770cfb8fe4ba3/tensorflow/python/keras/engine/network.py#L1380)\n    \"\"\"\n    url_bytes = url.encode(\"utf-8\")\n    url_hash = sha256(url_bytes)\n    filename = url_hash.hexdigest()\n\n    if etag:\n        etag_bytes = etag.encode(\"utf-8\")\n        etag_hash = sha256(etag_bytes)\n        filename += \".\" + etag_hash.hexdigest()\n\n    if url.endswith(\".h5\"):\n        filename += \".h5\"\n\n    return filename\n\n\ndef filename_to_url(filename, cache_dir=None):\n    \"\"\"\n    Return the url and etag (which may be ``None``) stored for `filename`.\n    Raise ``EnvironmentError`` if `filename` or its stored metadata do not exist.\n    \"\"\"\n    if cache_dir is None:\n        cache_dir = TRANSFORMERS_CACHE\n    if isinstance(cache_dir, Path):\n        cache_dir = str(cache_dir)\n\n    cache_path = os.path.join(cache_dir, filename)\n    if not os.path.exists(cache_path):\n        raise EnvironmentError(\"file {} not found\".format(cache_path))\n\n    meta_path = cache_path + \".json\"\n    if not os.path.exists(meta_path):\n        raise EnvironmentError(\"file {} not found\".format(meta_path))\n\n    with open(meta_path, encoding=\"utf-8\") as meta_file:\n        metadata = json.load(meta_file)\n    url = metadata[\"url\"]\n    etag = metadata[\"etag\"]\n\n    return url, etag\n\n\ndef cached_path(\n    url_or_filename,\n    cache_dir=None,\n    force_download=False,\n    proxies=None,\n    resume_download=False,\n    user_agent=None,\n    extract_compressed_file=False,\n    force_extract=False,\n    local_files_only=False,\n) -> Optional[str]:\n    \"\"\"\n    Given something that might be a URL (or might be a local path),\n    determine which. If it's a URL, download the file and cache it, and\n    return the path to the cached file. If it's already a local path,\n    make sure the file exists and then return the path.\n    Args:\n        cache_dir: specify a cache directory to save the file to (overwrite the default cache dir).\n        force_download: if True, re-dowload the file even if it's already cached in the cache dir.\n        resume_download: if True, resume the download if incompletly recieved file is found.\n        user_agent: Optional string or dict that will be appended to the user-agent on remote requests.\n        extract_compressed_file: if True and the path point to a zip or tar file, extract the compressed\n            file in a folder along the archive.\n        force_extract: if True when extract_compressed_file is True and the archive was already extracted,\n            re-extract the archive and overide the folder where it was extracted.\n\n    Return:\n        None in case of non-recoverable file (non-existent or inaccessible url + no cache on disk).\n        Local path (string) otherwise\n    \"\"\"\n    if cache_dir is None:\n        cache_dir = TRANSFORMERS_CACHE\n    if isinstance(url_or_filename, Path):\n        url_or_filename = str(url_or_filename)\n    if isinstance(cache_dir, Path):\n        cache_dir = str(cache_dir)\n\n    if is_remote_url(url_or_filename):\n        # URL, so get it from the cache (downloading if necessary)\n        output_path = get_from_cache(\n            url_or_filename,\n            cache_dir=cache_dir,\n            force_download=force_download,\n            proxies=proxies,\n            resume_download=resume_download,\n            user_agent=user_agent,\n            local_files_only=local_files_only,\n        )\n    elif os.path.exists(url_or_filename):\n        # File, and it exists.\n        output_path = url_or_filename\n    elif urlparse(url_or_filename).scheme == \"\":\n        # File, but it doesn't exist.\n        raise EnvironmentError(\"file {} not found\".format(url_or_filename))\n    else:\n        # Something unknown\n        raise ValueError(\"unable to parse {} as a URL or as a local path\".format(url_or_filename))\n\n    if extract_compressed_file:\n        if not is_zipfile(output_path) and not tarfile.is_tarfile(output_path):\n            return output_path\n\n        # Path where we extract compressed archives\n        # We avoid '.' in dir name and add \"-extracted\" at the end: \"./model.zip\" => \"./model-zip-extracted/\"\n        output_dir, output_file = os.path.split(output_path)\n        output_extract_dir_name = output_file.replace(\".\", \"-\") + \"-extracted\"\n        output_path_extracted = os.path.join(output_dir, output_extract_dir_name)\n\n        if os.path.isdir(output_path_extracted) and os.listdir(output_path_extracted) and not force_extract:\n            return output_path_extracted\n\n        # Prevent parallel extractions\n        lock_path = output_path + \".lock\"\n        with FileLock(lock_path):\n            shutil.rmtree(output_path_extracted, ignore_errors=True)\n            os.makedirs(output_path_extracted)\n            if is_zipfile(output_path):\n                with ZipFile(output_path, \"r\") as zip_file:\n                    zip_file.extractall(output_path_extracted)\n                    zip_file.close()\n            elif tarfile.is_tarfile(output_path):\n                tar_file = tarfile.open(output_path)\n                tar_file.extractall(output_path_extracted)\n                tar_file.close()\n            else:\n                raise EnvironmentError(\"Archive format of {} could not be identified\".format(output_path))\n\n        return output_path_extracted\n\n    return output_path\n\n\ndef http_get(url, temp_file, proxies=None, resume_size=0, user_agent=None):\n    ua = \"transformers1/{}; python/{}\".format(__version__, sys.version.split()[0])\n    if is_torch_available():\n        ua += \"; torch/{}\".format(torch.__version__)\n    if is_tf_available():\n        ua += \"; tensorflow/{}\".format(tf.__version__)\n    if isinstance(user_agent, dict):\n        ua += \"; \" + \"; \".join(\"{}/{}\".format(k, v) for k, v in user_agent.items())\n    elif isinstance(user_agent, str):\n        ua += \"; \" + user_agent\n    headers = {\"user-agent\": ua}\n    if resume_size > 0:\n        headers[\"Range\"] = \"bytes=%d-\" % (resume_size,)\n    response = requests.get(url, stream=True, proxies=proxies, headers=headers)\n    if response.status_code == 416:  # Range not satisfiable\n        return\n    content_length = response.headers.get(\"Content-Length\")\n    total = resume_size + int(content_length) if content_length is not None else None\n    progress = tqdm(\n        unit=\"B\",\n        unit_scale=True,\n        total=total,\n        initial=resume_size,\n        desc=\"Downloading\",\n        disable=bool(logger.getEffectiveLevel() == logging.NOTSET),\n    )\n    for chunk in response.iter_content(chunk_size=1024):\n        if chunk:  # filter out keep-alive new chunks\n            progress.update(len(chunk))\n            temp_file.write(chunk)\n    progress.close()\n\n\ndef get_from_cache(\n    url,\n    cache_dir=None,\n    force_download=False,\n    proxies=None,\n    etag_timeout=10,\n    resume_download=False,\n    user_agent=None,\n    local_files_only=False,\n) -> Optional[str]:\n    \"\"\"\n    Given a URL, look for the corresponding file in the local cache.\n    If it's not there, download it. Then return the path to the cached file.\n\n    Return:\n        None in case of non-recoverable file (non-existent or inaccessible url + no cache on disk).\n        Local path (string) otherwise\n    \"\"\"\n    if cache_dir is None:\n        cache_dir = TRANSFORMERS_CACHE\n    if isinstance(cache_dir, Path):\n        cache_dir = str(cache_dir)\n\n    os.makedirs(cache_dir, exist_ok=True)\n\n    etag = None\n    if not local_files_only:\n        try:\n            response = requests.head(url, allow_redirects=True, proxies=proxies, timeout=etag_timeout)\n            if response.status_code == 200:\n                etag = response.headers.get(\"ETag\")\n        except (EnvironmentError, requests.exceptions.Timeout):\n            # etag is already None\n            pass\n\n    filename = url_to_filename(url, etag)\n\n    # get cache path to put the file\n    cache_path = os.path.join(cache_dir, filename)\n\n    # etag is None = we don't have a connection, or url doesn't exist, or is otherwise inaccessible.\n    # try to get the last downloaded one\n    if etag is None:\n        if os.path.exists(cache_path):\n            return cache_path\n        else:\n            matching_files = [\n                file\n                for file in fnmatch.filter(os.listdir(cache_dir), filename + \".*\")\n                if not file.endswith(\".json\") and not file.endswith(\".lock\")\n            ]\n            if len(matching_files) > 0:\n                return os.path.join(cache_dir, matching_files[-1])\n            else:\n                # If files cannot be found and local_files_only=True,\n                # the models might've been found if local_files_only=False\n                # Notify the user about that\n                if local_files_only:\n                    raise ValueError(\n                        \"Cannot find the requested files in the cached path and outgoing traffic has been\"\n                        \" disabled. To enable model look-ups and downloads online, set 'local_files_only'\"\n                        \" to False.\"\n                    )\n                return None\n\n    # From now on, etag is not None.\n    if os.path.exists(cache_path) and not force_download:\n        return cache_path\n\n    # Prevent parallel downloads of the same file with a lock.\n    lock_path = cache_path + \".lock\"\n    with FileLock(lock_path):\n\n        # If the download just completed while the lock was activated.\n        if os.path.exists(cache_path) and not force_download:\n            # Even if returning early like here, the lock will be released.\n            return cache_path\n\n        if resume_download:\n            incomplete_path = cache_path + \".incomplete\"\n\n            @contextmanager\n            def _resumable_file_manager():\n                with open(incomplete_path, \"a+b\") as f:\n                    yield f\n\n            temp_file_manager = _resumable_file_manager\n            if os.path.exists(incomplete_path):\n                resume_size = os.stat(incomplete_path).st_size\n            else:\n                resume_size = 0\n        else:\n            temp_file_manager = partial(tempfile.NamedTemporaryFile, dir=cache_dir, delete=False)\n            resume_size = 0\n\n        # Download to temporary file, then copy to cache dir once finished.\n        # Otherwise you get corrupt cache entries if the download gets interrupted.\n        with temp_file_manager() as temp_file:\n            logger.info(\"%s not found in cache or force_download set to True, downloading to %s\", url, temp_file.name)\n\n            http_get(url, temp_file, proxies=proxies, resume_size=resume_size, user_agent=user_agent)\n\n        logger.info(\"storing %s in cache at %s\", url, cache_path)\n        os.replace(temp_file.name, cache_path)\n\n        logger.info(\"creating metadata file for %s\", cache_path)\n        meta = {\"url\": url, \"etag\": etag}\n        meta_path = cache_path + \".json\"\n        with open(meta_path, \"w\") as meta_file:\n            json.dump(meta, meta_file)\n\n    return cache_path\n\n\nclass cached_property(property):\n    \"\"\"\n    Descriptor that mimics @property but caches output in member variable.\n\n    From tensorflow_datasets\n\n    Built-in in functools from Python 3.8.\n    \"\"\"\n\n    def __get__(self, obj, objtype=None):\n        # See docs.python.org/3/howto/descriptor.html#properties\n        if obj is None:\n            return self\n        if self.fget is None:\n            raise AttributeError(\"unreadable attribute\")\n        attr = \"__cached_\" + self.fget.__name__\n        cached = getattr(obj, attr, None)\n        if cached is None:\n            cached = self.fget(obj)\n            setattr(obj, attr, cached)\n        return cached\n\n\ndef torch_required(func):\n    # Chose a different decorator name than in tests so it's clear they are not the same.\n    @wraps(func)\n    def wrapper(*args, **kwargs):\n        if is_torch_available():\n            return func(*args, **kwargs)\n        else:\n            raise ImportError(f\"Method `{func.__name__}` requires PyTorch.\")\n\n    return wrapper\n\n\ndef tf_required(func):\n    # Chose a different decorator name than in tests so it's clear they are not the same.\n    @wraps(func)\n    def wrapper(*args, **kwargs):\n        if is_tf_available():\n            return func(*args, **kwargs)\n        else:\n            raise ImportError(f\"Method `{func.__name__}` requires TF.\")\n\n    return wrapper\n"
  },
  {
    "path": "code/bert-base-count5/pretrain/transformers1/filep.py",
    "content": "from transformers import GPT2LMHeadModel, GPT2Tokenizer\nimport torch\n\ntokenizer = GPT2Tokenizer.from_pretrained(\"gpt2\")\nmodel = GPT2LMHeadModel.from_pretrained('gpt2')\n\ngenerated = tokenizer.encode(\"The Manhattan bridge\")\ncontext = torch.tensor([generated])\npast = None\n\nfor i in range(15):\n    output, past = model(context, past=past)\n\n    distribution = output[0, :]\n\n    # Get the top 10 values' indices and cast them to a list\n    top_values = distribution[-1].topk(10).indices.tolist()\n\n    # Decode those into words\n    top_words = [tokenizer.decode([x]) for x in top_values.indices.tolist()]\n\n    # select words (only arbitrarily select the first three)\n    words = words[0:3]\n\n    # Cast them back to tokens which can be used as an added token\n    selected_tokens = [tokenizer.encode(word) for word in words]\n\n    generated += [argmax_token.tolist()]\n    context = argmax_token.unsqueeze(0)\n\n    print(tokenizer.decode([argmax_token.tolist()]))\n\nsequence = tokenizer.decode(generated)\n\nprint(sequence)"
  },
  {
    "path": "code/bert-base-count5/pretrain/transformers1/hf_api.py",
    "content": "# coding=utf-8\n# Copyright 2019-present, the HuggingFace Inc. team.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n\n\nimport io\nimport os\nfrom os.path import expanduser\nfrom typing import Dict, List, Optional, Tuple\n\nimport requests\nfrom tqdm import tqdm\n\n\nENDPOINT = \"https://huggingface.co\"\n\n\nclass S3Obj:\n    \"\"\"\n    Data structure that represents a file belonging to the current user.\n    \"\"\"\n\n    def __init__(self, filename: str, LastModified: str, ETag: str, Size: int, **kwargs):\n        self.filename = filename\n        self.LastModified = LastModified\n        self.ETag = ETag\n        self.Size = Size\n\n\nclass PresignedUrl:\n    def __init__(self, write: str, access: str, type: str, **kwargs):\n        self.write = write\n        self.access = access\n        self.type = type  # mime-type to send to S3.\n\n\nclass S3Object:\n    \"\"\"\n    Data structure that represents a public file accessible on our S3.\n    \"\"\"\n\n    def __init__(\n        self,\n        key: str,  # S3 object key\n        etag: str,\n        lastModified: str,\n        size: int,\n        rfilename: str,  # filename relative to config.json\n        **kwargs\n    ):\n        self.key = key\n        self.etag = etag\n        self.lastModified = lastModified\n        self.size = size\n        self.rfilename = rfilename\n\n\nclass ModelInfo:\n    \"\"\"\n    Info about a public model accessible from our S3.\n    \"\"\"\n\n    def __init__(\n        self,\n        modelId: str,  # id of model\n        key: str,  # S3 object key of config.json\n        author: Optional[str] = None,\n        downloads: Optional[int] = None,\n        tags: List[str] = [],\n        siblings: List[Dict] = [],  # list of files that constitute the model\n        **kwargs\n    ):\n        self.modelId = modelId\n        self.key = key\n        self.author = author\n        self.downloads = downloads\n        self.tags = tags\n        self.siblings = [S3Object(**x) for x in siblings]\n\n\nclass HfApi:\n    def __init__(self, endpoint=None):\n        self.endpoint = endpoint if endpoint is not None else ENDPOINT\n\n    def login(self, username: str, password: str) -> str:\n        \"\"\"\n        Call HF API to sign in a user and get a token if credentials are valid.\n\n        Outputs:\n            token if credentials are valid\n\n        Throws:\n            requests.exceptions.HTTPError if credentials are invalid\n        \"\"\"\n        path = \"{}/api/login\".format(self.endpoint)\n        r = requests.post(path, json={\"username\": username, \"password\": password})\n        r.raise_for_status()\n        d = r.json()\n        return d[\"token\"]\n\n    def whoami(self, token: str) -> Tuple[str, List[str]]:\n        \"\"\"\n        Call HF API to know \"whoami\"\n        \"\"\"\n        path = \"{}/api/whoami\".format(self.endpoint)\n        r = requests.get(path, headers={\"authorization\": \"Bearer {}\".format(token)})\n        r.raise_for_status()\n        d = r.json()\n        return d[\"user\"], d[\"orgs\"]\n\n    def logout(self, token: str) -> None:\n        \"\"\"\n        Call HF API to log out.\n        \"\"\"\n        path = \"{}/api/logout\".format(self.endpoint)\n        r = requests.post(path, headers={\"authorization\": \"Bearer {}\".format(token)})\n        r.raise_for_status()\n\n    def presign(self, token: str, filename: str, organization: Optional[str] = None) -> PresignedUrl:\n        \"\"\"\n        Call HF API to get a presigned url to upload `filename` to S3.\n        \"\"\"\n        path = \"{}/api/presign\".format(self.endpoint)\n        r = requests.post(\n            path,\n            headers={\"authorization\": \"Bearer {}\".format(token)},\n            json={\"filename\": filename, \"organization\": organization},\n        )\n        r.raise_for_status()\n        d = r.json()\n        return PresignedUrl(**d)\n\n    def presign_and_upload(self, token: str, filename: str, filepath: str, organization: Optional[str] = None) -> str:\n        \"\"\"\n        Get a presigned url, then upload file to S3.\n\n        Outputs:\n            url: Read-only url for the stored file on S3.\n        \"\"\"\n        urls = self.presign(token, filename=filename, organization=organization)\n        # streaming upload:\n        # https://2.python-requests.org/en/master/user/advanced/#streaming-uploads\n        #\n        # Even though we presign with the correct content-type,\n        # the client still has to specify it when uploading the file.\n        with open(filepath, \"rb\") as f:\n            pf = TqdmProgressFileReader(f)\n            data = f if pf.total_size > 0 else \"\"\n\n            r = requests.put(urls.write, data=data, headers={\"content-type\": urls.type})\n            r.raise_for_status()\n            pf.close()\n        return urls.access\n\n    def list_objs(self, token: str, organization: Optional[str] = None) -> List[S3Obj]:\n        \"\"\"\n        Call HF API to list all stored files for user (or one of their organizations).\n        \"\"\"\n        path = \"{}/api/listObjs\".format(self.endpoint)\n        params = {\"organization\": organization} if organization is not None else None\n        r = requests.get(path, params=params, headers={\"authorization\": \"Bearer {}\".format(token)})\n        r.raise_for_status()\n        d = r.json()\n        return [S3Obj(**x) for x in d]\n\n    def delete_obj(self, token: str, filename: str, organization: Optional[str] = None):\n        \"\"\"\n        Call HF API to delete a file stored by user\n        \"\"\"\n        path = \"{}/api/deleteObj\".format(self.endpoint)\n        r = requests.delete(\n            path,\n            headers={\"authorization\": \"Bearer {}\".format(token)},\n            json={\"filename\": filename, \"organization\": organization},\n        )\n        r.raise_for_status()\n\n    def model_list(self) -> List[ModelInfo]:\n        \"\"\"\n        Get the public list of all the models on huggingface, including the community models\n        \"\"\"\n        path = \"{}/api/models\".format(self.endpoint)\n        r = requests.get(path)\n        r.raise_for_status()\n        d = r.json()\n        return [ModelInfo(**x) for x in d]\n\n\nclass TqdmProgressFileReader:\n    \"\"\"\n    Wrap an io.BufferedReader `f` (such as the output of `open(…, \"rb\")`)\n    and override `f.read()` so as to display a tqdm progress bar.\n\n    see github.com/huggingface/transformers1/pull/2078#discussion_r354739608\n    for implementation details.\n    \"\"\"\n\n    def __init__(self, f: io.BufferedReader):\n        self.f = f\n        self.total_size = os.fstat(f.fileno()).st_size\n        self.pbar = tqdm(total=self.total_size, leave=False)\n        self.read = f.read\n        f.read = self._read\n\n    def _read(self, n=-1):\n        self.pbar.update(n)\n        return self.read(n)\n\n    def close(self):\n        self.pbar.close()\n\n\nclass HfFolder:\n    path_token = expanduser(\"~/.huggingface/token\")\n\n    @classmethod\n    def save_token(cls, token):\n        \"\"\"\n        Save token, creating folder as needed.\n        \"\"\"\n        os.makedirs(os.path.dirname(cls.path_token), exist_ok=True)\n        with open(cls.path_token, \"w+\") as f:\n            f.write(token)\n\n    @classmethod\n    def get_token(cls):\n        \"\"\"\n        Get token or None if not existent.\n        \"\"\"\n        try:\n            with open(cls.path_token, \"r\") as f:\n                return f.read()\n        except FileNotFoundError:\n            pass\n\n    @classmethod\n    def delete_token(cls):\n        \"\"\"\n        Delete token.\n        Do not fail if token does not exist.\n        \"\"\"\n        try:\n            os.remove(cls.path_token)\n        except FileNotFoundError:\n            pass\n"
  },
  {
    "path": "code/bert-base-count5/pretrain/transformers1/hf_argparser.py",
    "content": "import dataclasses\nimport json\nimport sys\nfrom argparse import ArgumentParser\nfrom enum import Enum\nfrom pathlib import Path\nfrom typing import Any, Iterable, List, NewType, Tuple, Union\n\n\nDataClass = NewType(\"DataClass\", Any)\nDataClassType = NewType(\"DataClassType\", Any)\n\n\nclass HfArgumentParser(ArgumentParser):\n    \"\"\"\n    This subclass of `argparse.ArgumentParser` uses type hints on dataclasses\n    to generate arguments.\n\n    The class is designed to play well with the native argparse. In particular,\n    you can add more (non-dataclass backed) arguments to the parser after initialization\n    and you'll get the output back after parsing as an additional namespace.\n    \"\"\"\n\n    dataclass_types: Iterable[DataClassType]\n\n    def __init__(self, dataclass_types: Union[DataClassType, Iterable[DataClassType]], **kwargs):\n        \"\"\"\n        Args:\n            dataclass_types:\n                Dataclass type, or list of dataclass types for which we will \"fill\" instances\n                with the parsed args.\n            kwargs:\n                (Optional) Passed to `argparse.ArgumentParser()` in the regular way.\n        \"\"\"\n        super().__init__(**kwargs)\n        if dataclasses.is_dataclass(dataclass_types):\n            dataclass_types = [dataclass_types]\n        self.dataclass_types = dataclass_types\n        for dtype in self.dataclass_types:\n            self._add_dataclass_arguments(dtype)\n\n    def _add_dataclass_arguments(self, dtype: DataClassType):\n        for field in dataclasses.fields(dtype):\n            field_name = f\"--{field.name}\"\n            kwargs = field.metadata.copy()\n            # field.metadata is not used at all by Data Classes,\n            # it is provided as a third-party extension mechanism.\n            if isinstance(field.type, str):\n                raise ImportError(\n                    \"This implementation is not compatible with Postponed Evaluation of Annotations (PEP 563),\"\n                    \"which can be opted in from Python 3.7 with `from __future__ import annotations`.\"\n                    \"We will add compatibility when Python 3.9 is released.\"\n                )\n            typestring = str(field.type)\n            for prim_type in (int, float, str):\n                for collection in (List,):\n                    if typestring == f\"typing.Union[{collection[prim_type]}, NoneType]\":\n                        field.type = collection[prim_type]\n                if typestring == f\"typing.Union[{prim_type.__name__}, NoneType]\":\n                    field.type = prim_type\n\n            if isinstance(field.type, type) and issubclass(field.type, Enum):\n                kwargs[\"choices\"] = list(field.type)\n                kwargs[\"type\"] = field.type\n                if field.default is not dataclasses.MISSING:\n                    kwargs[\"default\"] = field.default\n            elif field.type is bool:\n                kwargs[\"action\"] = \"store_false\" if field.default is True else \"store_true\"\n                if field.default is True:\n                    field_name = f\"--no-{field.name}\"\n                    kwargs[\"dest\"] = field.name\n            elif hasattr(field.type, \"__origin__\") and issubclass(field.type.__origin__, List):\n                kwargs[\"nargs\"] = \"+\"\n                kwargs[\"type\"] = field.type.__args__[0]\n                assert all(\n                    x == kwargs[\"type\"] for x in field.type.__args__\n                ), \"{} cannot be a List of mixed types\".format(field.name)\n                if field.default_factory is not dataclasses.MISSING:\n                    kwargs[\"default\"] = field.default_factory()\n            else:\n                kwargs[\"type\"] = field.type\n                if field.default is not dataclasses.MISSING:\n                    kwargs[\"default\"] = field.default\n                else:\n                    kwargs[\"required\"] = True\n            self.add_argument(field_name, **kwargs)\n\n    def parse_args_into_dataclasses(\n        self, args=None, return_remaining_strings=False, look_for_args_file=True\n    ) -> Tuple[DataClass, ...]:\n        \"\"\"\n        Parse command-line args into instances of the specified dataclass types.\n\n        This relies on argparse's `ArgumentParser.parse_known_args`.\n        See the doc at:\n        docs.python.org/3.7/library/argparse.html#argparse.ArgumentParser.parse_args\n\n        Args:\n            args:\n                List of strings to parse. The default is taken from sys.argv.\n                (same as argparse.ArgumentParser)\n            return_remaining_strings:\n                If true, also return a list of remaining argument strings.\n            look_for_args_file:\n                If true, will look for a \".args\" file with the same base name\n                as the entry point script for this process, and will append its\n                potential content to the command line args.\n\n        Returns:\n            Tuple consisting of:\n                - the dataclass instances in the same order as they\n                  were passed to the initializer.abspath\n                - if applicable, an additional namespace for more\n                  (non-dataclass backed) arguments added to the parser\n                  after initialization.\n                - The potential list of remaining argument strings.\n                  (same as argparse.ArgumentParser.parse_known_args)\n        \"\"\"\n        if look_for_args_file and len(sys.argv):\n            args_file = Path(sys.argv[0]).with_suffix(\".args\")\n            if args_file.exists():\n                fargs = args_file.read_text().split()\n                args = fargs + args if args is not None else fargs + sys.argv[1:]\n                # in case of duplicate arguments the first one has precedence\n                # so we append rather than prepend.\n        namespace, remaining_args = self.parse_known_args(args=args)\n        outputs = []\n        for dtype in self.dataclass_types:\n            keys = {f.name for f in dataclasses.fields(dtype)}\n            inputs = {k: v for k, v in vars(namespace).items() if k in keys}\n            for k in keys:\n                delattr(namespace, k)\n            obj = dtype(**inputs)\n            outputs.append(obj)\n        if len(namespace.__dict__) > 0:\n            # additional namespace.\n            outputs.append(namespace)\n        if return_remaining_strings:\n            return (*outputs, remaining_args)\n        else:\n            if remaining_args:\n                raise ValueError(f\"Some specified arguments are not used by the HfArgumentParser: {remaining_args}\")\n\n            return (*outputs,)\n\n    def parse_json_file(self, json_file: str) -> Tuple[DataClass, ...]:\n        \"\"\"\n        Alternative helper method that does not use `argparse` at all,\n        instead loading a json file and populating the dataclass types.\n        \"\"\"\n        data = json.loads(Path(json_file).read_text())\n        outputs = []\n        for dtype in self.dataclass_types:\n            keys = {f.name for f in dataclasses.fields(dtype)}\n            inputs = {k: v for k, v in data.items() if k in keys}\n            obj = dtype(**inputs)\n            outputs.append(obj)\n        return (*outputs,)\n"
  },
  {
    "path": "code/bert-base-count5/pretrain/transformers1/modelcard.py",
    "content": "# coding=utf-8\n# Copyright 2018 The HuggingFace Inc. team.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n\"\"\" Configuration base class and utilities.\"\"\"\n\n\nimport copy\nimport json\nimport logging\nimport os\n\nfrom .configuration_auto import ALL_PRETRAINED_CONFIG_ARCHIVE_MAP\nfrom .file_utils import (\n    CONFIG_NAME,\n    MODEL_CARD_NAME,\n    TF2_WEIGHTS_NAME,\n    WEIGHTS_NAME,\n    cached_path,\n    hf_bucket_url,\n    is_remote_url,\n)\n\n\nlogger = logging.getLogger(__name__)\n\n\nclass ModelCard:\n    r\"\"\" Structured Model Card class.\n        Store model card as well as methods for loading/downloading/saving model cards.\n\n        Please read the following paper for details and explanation on the sections:\n            \"Model Cards for Model Reporting\"\n                by Margaret Mitchell, Simone Wu,\n                Andrew Zaldivar, Parker Barnes, Lucy Vasserman, Ben Hutchinson, Elena Spitzer,\n                Inioluwa Deborah Raji and Timnit Gebru for the proposal behind model cards.\n            Link: https://arxiv.org/abs/1810.03993\n\n        Note:\n            A model card can be loaded and saved to disk.\n\n        Parameters:\n    \"\"\"\n\n    def __init__(self, **kwargs):\n        # Recomended attributes from https://arxiv.org/abs/1810.03993 (see papers)\n        self.model_details = kwargs.pop(\"model_details\", {})\n        self.intended_use = kwargs.pop(\"intended_use\", {})\n        self.factors = kwargs.pop(\"factors\", {})\n        self.metrics = kwargs.pop(\"metrics\", {})\n        self.evaluation_data = kwargs.pop(\"evaluation_data\", {})\n        self.training_data = kwargs.pop(\"training_data\", {})\n        self.quantitative_analyses = kwargs.pop(\"quantitative_analyses\", {})\n        self.ethical_considerations = kwargs.pop(\"ethical_considerations\", {})\n        self.caveats_and_recommendations = kwargs.pop(\"caveats_and_recommendations\", {})\n\n        # Open additional attributes\n        for key, value in kwargs.items():\n            try:\n                setattr(self, key, value)\n            except AttributeError as err:\n                logger.error(\"Can't set {} with value {} for {}\".format(key, value, self))\n                raise err\n\n    def save_pretrained(self, save_directory_or_file):\n        \"\"\" Save a model card object to the directory or file `save_directory_or_file`.\n        \"\"\"\n        if os.path.isdir(save_directory_or_file):\n            # If we save using the predefined names, we can load using `from_pretrained`\n            output_model_card_file = os.path.join(save_directory_or_file, MODEL_CARD_NAME)\n        else:\n            output_model_card_file = save_directory_or_file\n\n        self.to_json_file(output_model_card_file)\n        logger.info(\"Model card saved in {}\".format(output_model_card_file))\n\n    @classmethod\n    def from_pretrained(cls, pretrained_model_name_or_path, **kwargs):\n        r\"\"\" Instantiate a :class:`~transformers1.ModelCard` from a pre-trained model model card.\n\n        Parameters:\n            pretrained_model_name_or_path: either:\n\n                - a string with the `shortcut name` of a pre-trained model card to load from cache or download, e.g.: ``bert-base-uncased``.\n                - a string with the `identifier name` of a pre-trained model card that was user-uploaded to our S3, e.g.: ``dbmdz/bert-base-german-cased``.\n                - a path to a `directory` containing a model card file saved using the :func:`~transformers1.ModelCard.save_pretrained` method, e.g.: ``./my_model_directory/``.\n                - a path or url to a saved model card JSON `file`, e.g.: ``./my_model_directory/modelcard.json``.\n\n            cache_dir: (`optional`) string:\n                Path to a directory in which a downloaded pre-trained model\n                card should be cached if the standard cache should not be used.\n\n            kwargs: (`optional`) dict: key/value pairs with which to update the ModelCard object after loading.\n\n                - The values in kwargs of any keys which are model card attributes will be used to override the loaded values.\n                - Behavior concerning key/value pairs whose keys are *not* model card attributes is controlled by the `return_unused_kwargs` keyword parameter.\n\n            proxies: (`optional`) dict, default None:\n                A dictionary of proxy servers to use by protocol or endpoint, e.g.: {'http': 'foo.bar:3128', 'http://hostname': 'foo.bar:4012'}.\n                The proxies are used on each request.\n\n            find_from_standard_name: (`optional`) boolean, default True:\n                If the pretrained_model_name_or_path ends with our standard model or config filenames, replace them with our standard modelcard filename.\n                Can be used to directly feed a model/config url and access the colocated modelcard.\n\n            return_unused_kwargs: (`optional`) bool:\n\n                - If False, then this function returns just the final model card object.\n                - If True, then this functions returns a tuple `(model card, unused_kwargs)` where `unused_kwargs` is a dictionary consisting of the key/value pairs whose keys are not model card attributes: ie the part of kwargs which has not been used to update `ModelCard` and is otherwise ignored.\n\n        Examples::\n\n            modelcard = ModelCard.from_pretrained('bert-base-uncased')    # Download model card from S3 and cache.\n            modelcard = ModelCard.from_pretrained('./test/saved_model/')  # E.g. model card was saved using `save_pretrained('./test/saved_model/')`\n            modelcard = ModelCard.from_pretrained('./test/saved_model/modelcard.json')\n            modelcard = ModelCard.from_pretrained('bert-base-uncased', output_attention=True, foo=False)\n\n        \"\"\"\n        cache_dir = kwargs.pop(\"cache_dir\", None)\n        proxies = kwargs.pop(\"proxies\", None)\n        find_from_standard_name = kwargs.pop(\"find_from_standard_name\", True)\n        return_unused_kwargs = kwargs.pop(\"return_unused_kwargs\", False)\n\n        if pretrained_model_name_or_path in ALL_PRETRAINED_CONFIG_ARCHIVE_MAP:\n            # For simplicity we use the same pretrained url than the configuration files\n            # but with a different suffix (modelcard.json). This suffix is replaced below.\n            model_card_file = ALL_PRETRAINED_CONFIG_ARCHIVE_MAP[pretrained_model_name_or_path]\n        elif os.path.isdir(pretrained_model_name_or_path):\n            model_card_file = os.path.join(pretrained_model_name_or_path, MODEL_CARD_NAME)\n        elif os.path.isfile(pretrained_model_name_or_path) or is_remote_url(pretrained_model_name_or_path):\n            model_card_file = pretrained_model_name_or_path\n        else:\n            model_card_file = hf_bucket_url(pretrained_model_name_or_path, filename=MODEL_CARD_NAME, use_cdn=False)\n\n        if find_from_standard_name or pretrained_model_name_or_path in ALL_PRETRAINED_CONFIG_ARCHIVE_MAP:\n            model_card_file = model_card_file.replace(CONFIG_NAME, MODEL_CARD_NAME)\n            model_card_file = model_card_file.replace(WEIGHTS_NAME, MODEL_CARD_NAME)\n            model_card_file = model_card_file.replace(TF2_WEIGHTS_NAME, MODEL_CARD_NAME)\n\n        try:\n            # Load from URL or cache if already cached\n            resolved_model_card_file = cached_path(\n                model_card_file, cache_dir=cache_dir, force_download=True, proxies=proxies, resume_download=False\n            )\n            if resolved_model_card_file is None:\n                raise EnvironmentError\n            if resolved_model_card_file == model_card_file:\n                logger.info(\"loading model card file {}\".format(model_card_file))\n            else:\n                logger.info(\n                    \"loading model card file {} from cache at {}\".format(model_card_file, resolved_model_card_file)\n                )\n            # Load model card\n            modelcard = cls.from_json_file(resolved_model_card_file)\n\n        except (EnvironmentError, json.JSONDecodeError):\n            # We fall back on creating an empty model card\n            modelcard = cls()\n\n        # Update model card with kwargs if needed\n        to_remove = []\n        for key, value in kwargs.items():\n            if hasattr(modelcard, key):\n                setattr(modelcard, key, value)\n                to_remove.append(key)\n        for key in to_remove:\n            kwargs.pop(key, None)\n\n        logger.info(\"Model card: %s\", str(modelcard))\n        if return_unused_kwargs:\n            return modelcard, kwargs\n        else:\n            return modelcard\n\n    @classmethod\n    def from_dict(cls, json_object):\n        \"\"\"Constructs a `ModelCard` from a Python dictionary of parameters.\"\"\"\n        return cls(**json_object)\n\n    @classmethod\n    def from_json_file(cls, json_file):\n        \"\"\"Constructs a `ModelCard` from a json file of parameters.\"\"\"\n        with open(json_file, \"r\", encoding=\"utf-8\") as reader:\n            text = reader.read()\n        dict_obj = json.loads(text)\n        return cls(**dict_obj)\n\n    def __eq__(self, other):\n        return self.__dict__ == other.__dict__\n\n    def __repr__(self):\n        return str(self.to_json_string())\n\n    def to_dict(self):\n        \"\"\"Serializes this instance to a Python dictionary.\"\"\"\n        output = copy.deepcopy(self.__dict__)\n        return output\n\n    def to_json_string(self):\n        \"\"\"Serializes this instance to a JSON string.\"\"\"\n        return json.dumps(self.to_dict(), indent=2, sort_keys=True) + \"\\n\"\n\n    def to_json_file(self, json_file_path):\n        \"\"\" Save this instance to a json file.\"\"\"\n        with open(json_file_path, \"w\", encoding=\"utf-8\") as writer:\n            writer.write(self.to_json_string())\n"
  },
  {
    "path": "code/bert-base-count5/pretrain/transformers1/modeling_albert.py",
    "content": "# coding=utf-8\n# Copyright 2018 Google AI, Google Brain and the HuggingFace Inc. team.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n\"\"\"PyTorch ALBERT model. \"\"\"\n\nimport logging\nimport math\nimport os\n\nimport torch\nimport torch.nn as nn\nfrom torch.nn import CrossEntropyLoss, MSELoss\n\nfrom .configuration_albert import AlbertConfig\nfrom .file_utils import add_start_docstrings, add_start_docstrings_to_callable\nfrom .modeling_bert import ACT2FN, BertEmbeddings, BertSelfAttention, prune_linear_layer\nfrom .modeling_utils import PreTrainedModel\n\n\nlogger = logging.getLogger(__name__)\n\n\nALBERT_PRETRAINED_MODEL_ARCHIVE_LIST = [\n    \"albert-base-v1\",\n    \"albert-large-v1\",\n    \"albert-xlarge-v1\",\n    \"albert-xxlarge-v1\",\n    \"albert-base-v2\",\n    \"albert-large-v2\",\n    \"albert-xlarge-v2\",\n    \"albert-xxlarge-v2\",\n    # See all ALBERT models at https://huggingface.co/models?filter=albert\n]\n\n\ndef load_tf_weights_in_albert(model, config, tf_checkpoint_path):\n    \"\"\" Load tf checkpoints in a pytorch model.\"\"\"\n    try:\n        import re\n        import numpy as np\n        import tensorflow as tf\n    except ImportError:\n        logger.error(\n            \"Loading a TensorFlow model in PyTorch, requires TensorFlow to be installed. Please see \"\n            \"https://www.tensorflow.org/install/ for installation instructions.\"\n        )\n        raise\n    tf_path = os.path.abspath(tf_checkpoint_path)\n    logger.info(\"Converting TensorFlow checkpoint from {}\".format(tf_path))\n    # Load weights from TF model\n    init_vars = tf.train.list_variables(tf_path)\n    names = []\n    arrays = []\n    for name, shape in init_vars:\n        logger.info(\"Loading TF weight {} with shape {}\".format(name, shape))\n        array = tf.train.load_variable(tf_path, name)\n        names.append(name)\n        arrays.append(array)\n\n    for name, array in zip(names, arrays):\n        print(name)\n\n    for name, array in zip(names, arrays):\n        original_name = name\n\n        # If saved from the TF HUB module\n        name = name.replace(\"module/\", \"\")\n\n        # Renaming and simplifying\n        name = name.replace(\"ffn_1\", \"ffn\")\n        name = name.replace(\"bert/\", \"albert/\")\n        name = name.replace(\"attention_1\", \"attention\")\n        name = name.replace(\"transform/\", \"\")\n        name = name.replace(\"LayerNorm_1\", \"full_layer_layer_norm\")\n        name = name.replace(\"LayerNorm\", \"attention/LayerNorm\")\n        name = name.replace(\"transformer/\", \"\")\n\n        # The feed forward layer had an 'intermediate' step which has been abstracted away\n        name = name.replace(\"intermediate/dense/\", \"\")\n        name = name.replace(\"ffn/intermediate/output/dense/\", \"ffn_output/\")\n\n        # ALBERT attention was split between self and output which have been abstracted away\n        name = name.replace(\"/output/\", \"/\")\n        name = name.replace(\"/self/\", \"/\")\n\n        # The pooler is a linear layer\n        name = name.replace(\"pooler/dense\", \"pooler\")\n\n        # The classifier was simplified to predictions from cls/predictions\n        name = name.replace(\"cls/predictions\", \"predictions\")\n        name = name.replace(\"predictions/attention\", \"predictions\")\n\n        # Naming was changed to be more explicit\n        name = name.replace(\"embeddings/attention\", \"embeddings\")\n        name = name.replace(\"inner_group_\", \"albert_layers/\")\n        name = name.replace(\"group_\", \"albert_layer_groups/\")\n\n        # Classifier\n        if len(name.split(\"/\")) == 1 and (\"output_bias\" in name or \"output_weights\" in name):\n            name = \"classifier/\" + name\n\n        # No ALBERT model currently handles the next sentence prediction task\n        if \"seq_relationship\" in name:\n            name = name.replace(\"seq_relationship/output_\", \"sop_classifier/classifier/\")\n            name = name.replace(\"weights\", \"weight\")\n\n        name = name.split(\"/\")\n\n        # Ignore the gradients applied by the LAMB/ADAM optimizers.\n        if (\n            \"adam_m\" in name\n            or \"adam_v\" in name\n            or \"AdamWeightDecayOptimizer\" in name\n            or \"AdamWeightDecayOptimizer_1\" in name\n            or \"global_step\" in name\n        ):\n            logger.info(\"Skipping {}\".format(\"/\".join(name)))\n            continue\n\n        pointer = model\n        for m_name in name:\n            if re.fullmatch(r\"[A-Za-z]+_\\d+\", m_name):\n                scope_names = re.split(r\"_(\\d+)\", m_name)\n            else:\n                scope_names = [m_name]\n\n            if scope_names[0] == \"kernel\" or scope_names[0] == \"gamma\":\n                pointer = getattr(pointer, \"weight\")\n            elif scope_names[0] == \"output_bias\" or scope_names[0] == \"beta\":\n                pointer = getattr(pointer, \"bias\")\n            elif scope_names[0] == \"output_weights\":\n                pointer = getattr(pointer, \"weight\")\n            elif scope_names[0] == \"squad\":\n                pointer = getattr(pointer, \"classifier\")\n            else:\n                try:\n                    pointer = getattr(pointer, scope_names[0])\n                except AttributeError:\n                    logger.info(\"Skipping {}\".format(\"/\".join(name)))\n                    continue\n            if len(scope_names) >= 2:\n                num = int(scope_names[1])\n                pointer = pointer[num]\n\n        if m_name[-11:] == \"_embeddings\":\n            pointer = getattr(pointer, \"weight\")\n        elif m_name == \"kernel\":\n            array = np.transpose(array)\n        try:\n            assert pointer.shape == array.shape\n        except AssertionError as e:\n            e.args += (pointer.shape, array.shape)\n            raise\n        print(\"Initialize PyTorch weight {} from {}\".format(name, original_name))\n        pointer.data = torch.from_numpy(array)\n\n    return model\n\n\nclass AlbertEmbeddings(BertEmbeddings):\n    \"\"\"\n    Construct the embeddings from word, position and token_type embeddings.\n    \"\"\"\n\n    def __init__(self, config):\n        super().__init__(config)\n\n        self.word_embeddings = nn.Embedding(config.vocab_size, config.embedding_size, padding_idx=config.pad_token_id)\n        self.position_embeddings = nn.Embedding(config.max_position_embeddings, config.embedding_size)\n        self.token_type_embeddings = nn.Embedding(config.type_vocab_size, config.embedding_size)\n        self.LayerNorm = torch.nn.LayerNorm(config.embedding_size, eps=config.layer_norm_eps)\n\n\nclass AlbertAttention(BertSelfAttention):\n    def __init__(self, config):\n        super().__init__(config)\n\n        self.output_attentions = config.output_attentions\n        self.num_attention_heads = config.num_attention_heads\n        self.hidden_size = config.hidden_size\n        self.attention_head_size = config.hidden_size // config.num_attention_heads\n        self.dropout = nn.Dropout(config.attention_probs_dropout_prob)\n        self.dense = nn.Linear(config.hidden_size, config.hidden_size)\n        self.LayerNorm = nn.LayerNorm(config.hidden_size, eps=config.layer_norm_eps)\n        self.pruned_heads = set()\n\n    def prune_heads(self, heads):\n        if len(heads) == 0:\n            return\n        mask = torch.ones(self.num_attention_heads, self.attention_head_size)\n        heads = set(heads) - self.pruned_heads  # Convert to set and emove already pruned heads\n        for head in heads:\n            # Compute how many pruned heads are before the head and move the index accordingly\n            head = head - sum(1 if h < head else 0 for h in self.pruned_heads)\n            mask[head] = 0\n        mask = mask.view(-1).contiguous().eq(1)\n        index = torch.arange(len(mask))[mask].long()\n\n        # Prune linear layers\n        self.query = prune_linear_layer(self.query, index)\n        self.key = prune_linear_layer(self.key, index)\n        self.value = prune_linear_layer(self.value, index)\n        self.dense = prune_linear_layer(self.dense, index, dim=1)\n\n        # Update hyper params and store pruned heads\n        self.num_attention_heads = self.num_attention_heads - len(heads)\n        self.all_head_size = self.attention_head_size * self.num_attention_heads\n        self.pruned_heads = self.pruned_heads.union(heads)\n\n    def forward(self, input_ids, attention_mask=None, head_mask=None):\n        mixed_query_layer = self.query(input_ids)\n        mixed_key_layer = self.key(input_ids)\n        mixed_value_layer = self.value(input_ids)\n\n        query_layer = self.transpose_for_scores(mixed_query_layer)\n        key_layer = self.transpose_for_scores(mixed_key_layer)\n        value_layer = self.transpose_for_scores(mixed_value_layer)\n\n        # Take the dot product between \"query\" and \"key\" to get the raw attention scores.\n        attention_scores = torch.matmul(query_layer, key_layer.transpose(-1, -2))\n        attention_scores = attention_scores / math.sqrt(self.attention_head_size)\n        if attention_mask is not None:\n            # Apply the attention mask is (precomputed for all layers in BertModel forward() function)\n            attention_scores = attention_scores + attention_mask\n\n        # Normalize the attention scores to probabilities.\n        attention_probs = nn.Softmax(dim=-1)(attention_scores)\n\n        # This is actually dropping out entire tokens to attend to, which might\n        # seem a bit unusual, but is taken from the original Transformer paper.\n        attention_probs = self.dropout(attention_probs)\n\n        # Mask heads if we want to\n        if head_mask is not None:\n            attention_probs = attention_probs * head_mask\n\n        context_layer = torch.matmul(attention_probs, value_layer)\n\n        context_layer = context_layer.permute(0, 2, 1, 3).contiguous()\n\n        # Should find a better way to do this\n        w = (\n            self.dense.weight.t()\n            .view(self.num_attention_heads, self.attention_head_size, self.hidden_size)\n            .to(context_layer.dtype)\n        )\n        b = self.dense.bias.to(context_layer.dtype)\n\n        projected_context_layer = torch.einsum(\"bfnd,ndh->bfh\", context_layer, w) + b\n        projected_context_layer_dropout = self.dropout(projected_context_layer)\n        layernormed_context_layer = self.LayerNorm(input_ids + projected_context_layer_dropout)\n        return (layernormed_context_layer, attention_probs) if self.output_attentions else (layernormed_context_layer,)\n\n\nclass AlbertLayer(nn.Module):\n    def __init__(self, config):\n        super().__init__()\n\n        self.config = config\n        self.full_layer_layer_norm = nn.LayerNorm(config.hidden_size, eps=config.layer_norm_eps)\n        self.attention = AlbertAttention(config)\n        self.ffn = nn.Linear(config.hidden_size, config.intermediate_size)\n        self.ffn_output = nn.Linear(config.intermediate_size, config.hidden_size)\n        self.activation = ACT2FN[config.hidden_act]\n\n    def forward(self, hidden_states, attention_mask=None, head_mask=None):\n        attention_output = self.attention(hidden_states, attention_mask, head_mask)\n        ffn_output = self.ffn(attention_output[0])\n        ffn_output = self.activation(ffn_output)\n        ffn_output = self.ffn_output(ffn_output)\n        hidden_states = self.full_layer_layer_norm(ffn_output + attention_output[0])\n\n        return (hidden_states,) + attention_output[1:]  # add attentions if we output them\n\n\nclass AlbertLayerGroup(nn.Module):\n    def __init__(self, config):\n        super().__init__()\n\n        self.output_attentions = config.output_attentions\n        self.output_hidden_states = config.output_hidden_states\n        self.albert_layers = nn.ModuleList([AlbertLayer(config) for _ in range(config.inner_group_num)])\n\n    def forward(self, hidden_states, attention_mask=None, head_mask=None):\n        layer_hidden_states = ()\n        layer_attentions = ()\n\n        for layer_index, albert_layer in enumerate(self.albert_layers):\n            layer_output = albert_layer(hidden_states, attention_mask, head_mask[layer_index])\n            hidden_states = layer_output[0]\n\n            if self.output_attentions:\n                layer_attentions = layer_attentions + (layer_output[1],)\n\n            if self.output_hidden_states:\n                layer_hidden_states = layer_hidden_states + (hidden_states,)\n\n        outputs = (hidden_states,)\n        if self.output_hidden_states:\n            outputs = outputs + (layer_hidden_states,)\n        if self.output_attentions:\n            outputs = outputs + (layer_attentions,)\n        return outputs  # last-layer hidden state, (layer hidden states), (layer attentions)\n\n\nclass AlbertTransformer(nn.Module):\n    def __init__(self, config):\n        super().__init__()\n\n        self.config = config\n        self.output_attentions = config.output_attentions\n        self.output_hidden_states = config.output_hidden_states\n        self.embedding_hidden_mapping_in = nn.Linear(config.embedding_size, config.hidden_size)\n        self.albert_layer_groups = nn.ModuleList([AlbertLayerGroup(config) for _ in range(config.num_hidden_groups)])\n\n    def forward(self, hidden_states, attention_mask=None, head_mask=None):\n        hidden_states = self.embedding_hidden_mapping_in(hidden_states)\n\n        all_attentions = ()\n\n        if self.output_hidden_states:\n            all_hidden_states = (hidden_states,)\n\n        for i in range(self.config.num_hidden_layers):\n            # Number of layers in a hidden group\n            layers_per_group = int(self.config.num_hidden_layers / self.config.num_hidden_groups)\n\n            # Index of the hidden group\n            group_idx = int(i / (self.config.num_hidden_layers / self.config.num_hidden_groups))\n\n            layer_group_output = self.albert_layer_groups[group_idx](\n                hidden_states,\n                attention_mask,\n                head_mask[group_idx * layers_per_group : (group_idx + 1) * layers_per_group],\n            )\n            hidden_states = layer_group_output[0]\n\n            if self.output_attentions:\n                all_attentions = all_attentions + layer_group_output[-1]\n\n            if self.output_hidden_states:\n                all_hidden_states = all_hidden_states + (hidden_states,)\n\n        outputs = (hidden_states,)\n        if self.output_hidden_states:\n            outputs = outputs + (all_hidden_states,)\n        if self.output_attentions:\n            outputs = outputs + (all_attentions,)\n        return outputs  # last-layer hidden state, (all hidden states), (all attentions)\n\n\nclass AlbertPreTrainedModel(PreTrainedModel):\n    \"\"\" An abstract class to handle weights initialization and\n        a simple interface for downloading and loading pretrained models.\n    \"\"\"\n\n    config_class = AlbertConfig\n    base_model_prefix = \"albert\"\n\n    def _init_weights(self, module):\n        \"\"\" Initialize the weights.\n        \"\"\"\n        if isinstance(module, (nn.Linear, nn.Embedding)):\n            # Slightly different from the TF version which uses truncated_normal for initialization\n            # cf https://github.com/pytorch/pytorch/pull/5617\n            module.weight.data.normal_(mean=0.0, std=self.config.initializer_range)\n            if isinstance(module, (nn.Linear)) and module.bias is not None:\n                module.bias.data.zero_()\n        elif isinstance(module, nn.LayerNorm):\n            module.bias.data.zero_()\n            module.weight.data.fill_(1.0)\n\n\nALBERT_START_DOCSTRING = r\"\"\"\n\n    This model is a PyTorch `torch.nn.Module <https://pytorch.org/docs/stable/nn.html#torch.nn.Module>`_ sub-class.\n    Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general\n    usage and behavior.\n\n    Args:\n        config (:class:`~transformers1.AlbertConfig`): Model configuration class with all the parameters of the model.\n            Initializing with a config file does not load the weights associated with the model, only the configuration.\n            Check out the :meth:`~transformers1.PreTrainedModel.from_pretrained` method to load the model weights.\n\"\"\"\n\nALBERT_INPUTS_DOCSTRING = r\"\"\"\n    Args:\n        input_ids (:obj:`torch.LongTensor` of shape :obj:`(batch_size, sequence_length)`):\n            Indices of input sequence tokens in the vocabulary.\n\n            Indices can be obtained using :class:`transformers1.AlbertTokenizer`.\n            See :func:`transformers1.PreTrainedTokenizer.encode` and\n            :func:`transformers1.PreTrainedTokenizer.encode_plus` for details.\n\n            `What are input IDs? <../glossary.html#input-ids>`__\n        attention_mask (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length)`, `optional`, defaults to :obj:`None`):\n            Mask to avoid performing attention on padding token indices.\n            Mask values selected in ``[0, 1]``:\n            ``1`` for tokens that are NOT MASKED, ``0`` for MASKED tokens.\n\n            `What are attention masks? <../glossary.html#attention-mask>`__\n        token_type_ids (:obj:`torch.LongTensor` of shape :obj:`(batch_size, sequence_length)`, `optional`, defaults to :obj:`None`):\n            Segment token indices to indicate first and second portions of the inputs.\n            Indices are selected in ``[0, 1]``: ``0`` corresponds to a `sentence A` token, ``1``\n            corresponds to a `sentence B` token\n\n            `What are token type IDs? <../glossary.html#token-type-ids>`_\n        position_ids (:obj:`torch.LongTensor` of shape :obj:`(batch_size, sequence_length)`, `optional`, defaults to :obj:`None`):\n            Indices of positions of each input sequence tokens in the position embeddings.\n            Selected in the range ``[0, config.max_position_embeddings - 1]``.\n\n            `What are position IDs? <../glossary.html#position-ids>`_\n        head_mask (:obj:`torch.FloatTensor` of shape :obj:`(num_heads,)` or :obj:`(num_layers, num_heads)`, `optional`, defaults to :obj:`None`):\n            Mask to nullify selected heads of the self-attention modules.\n            Mask values selected in ``[0, 1]``:\n            :obj:`1` indicates the head is **not masked**, :obj:`0` indicates the head is **masked**.\n        inputs_embeds (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length, hidden_size)`, `optional`, defaults to :obj:`None`):\n            Optionally, instead of passing :obj:`input_ids` you can choose to directly pass an embedded representation.\n            This is useful if you want more control over how to convert `input_ids` indices into associated vectors\n            than the model's internal embedding lookup matrix.\n\"\"\"\n\n\n@add_start_docstrings(\n    \"The bare ALBERT Model transformer outputting raw hidden-states without any specific head on top.\",\n    ALBERT_START_DOCSTRING,\n)\nclass AlbertModel(AlbertPreTrainedModel):\n\n    config_class = AlbertConfig\n    load_tf_weights = load_tf_weights_in_albert\n    base_model_prefix = \"albert\"\n\n    def __init__(self, config):\n        super().__init__(config)\n\n        self.config = config\n        self.embeddings = AlbertEmbeddings(config)\n        self.encoder = AlbertTransformer(config)\n        self.pooler = nn.Linear(config.hidden_size, config.hidden_size)\n        self.pooler_activation = nn.Tanh()\n\n        self.init_weights()\n\n    def get_input_embeddings(self):\n        return self.embeddings.word_embeddings\n\n    def set_input_embeddings(self, value):\n        self.embeddings.word_embeddings = value\n\n    def _resize_token_embeddings(self, new_num_tokens):\n        old_embeddings = self.embeddings.word_embeddings\n        new_embeddings = self._get_resized_embeddings(old_embeddings, new_num_tokens)\n        self.embeddings.word_embeddings = new_embeddings\n        return self.embeddings.word_embeddings\n\n    def _prune_heads(self, heads_to_prune):\n        \"\"\" Prunes heads of the model.\n            heads_to_prune: dict of {layer_num: list of heads to prune in this layer}\n            ALBERT has a different architecture in that its layers are shared across groups, which then has inner groups.\n            If an ALBERT model has 12 hidden layers and 2 hidden groups, with two inner groups, there\n            is a total of 4 different layers.\n\n            These layers are flattened: the indices [0,1] correspond to the two inner groups of the first hidden layer,\n            while [2,3] correspond to the two inner groups of the second hidden layer.\n\n            Any layer with in index other than [0,1,2,3] will result in an error.\n            See base class PreTrainedModel for more information about head pruning\n        \"\"\"\n        for layer, heads in heads_to_prune.items():\n            group_idx = int(layer / self.config.inner_group_num)\n            inner_group_idx = int(layer - group_idx * self.config.inner_group_num)\n            self.encoder.albert_layer_groups[group_idx].albert_layers[inner_group_idx].attention.prune_heads(heads)\n\n    @add_start_docstrings_to_callable(ALBERT_INPUTS_DOCSTRING)\n    def forward(\n        self,\n        input_ids=None,\n        attention_mask=None,\n        token_type_ids=None,\n        position_ids=None,\n        head_mask=None,\n        inputs_embeds=None,\n    ):\n        r\"\"\"\n    Return:\n        :obj:`tuple(torch.FloatTensor)` comprising various elements depending on the configuration (:class:`~transformers1.AlbertConfig`) and inputs:\n        last_hidden_state (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length, hidden_size)`):\n            Sequence of hidden-states at the output of the last layer of the model.\n        pooler_output (:obj:`torch.FloatTensor`: of shape :obj:`(batch_size, hidden_size)`):\n            Last layer hidden-state of the first token of the sequence (classification token)\n            further processed by a Linear layer and a Tanh activation function. The Linear\n            layer weights are trained from the next sentence prediction (classification)\n            objective during pre-training.\n\n            This output is usually *not* a good summary\n            of the semantic content of the input, you're often better with averaging or pooling\n            the sequence of hidden-states for the whole input sequence.\n        hidden_states (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_hidden_states=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer)\n            of shape :obj:`(batch_size, sequence_length, hidden_size)`.\n\n            Hidden-states of the model at the output of each layer plus the initial embedding outputs.\n        attentions (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_attentions=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for each layer) of shape\n            :obj:`(batch_size, num_heads, sequence_length, sequence_length)`.\n\n            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention\n            heads.\n\n    Example::\n\n        from transformers1 import AlbertModel, AlbertTokenizer\n        import torch\n\n        tokenizer = AlbertTokenizer.from_pretrained('albert-base-v2')\n        model = AlbertModel.from_pretrained('albert-base-v2')\n        input_ids = torch.tensor(tokenizer.encode(\"Hello, my dog is cute\", add_special_tokens=True)).unsqueeze(0)  # Batch size 1\n        outputs = model(input_ids)\n        last_hidden_states = outputs[0]  # The last hidden-state is the first element of the output tuple\n\n        \"\"\"\n\n        if input_ids is not None and inputs_embeds is not None:\n            raise ValueError(\"You cannot specify both input_ids and inputs_embeds at the same time\")\n        elif input_ids is not None:\n            input_shape = input_ids.size()\n        elif inputs_embeds is not None:\n            input_shape = inputs_embeds.size()[:-1]\n        else:\n            raise ValueError(\"You have to specify either input_ids or inputs_embeds\")\n\n        device = input_ids.device if input_ids is not None else inputs_embeds.device\n\n        if attention_mask is None:\n            attention_mask = torch.ones(input_shape, device=device)\n        if token_type_ids is None:\n            token_type_ids = torch.zeros(input_shape, dtype=torch.long, device=device)\n\n        extended_attention_mask = attention_mask.unsqueeze(1).unsqueeze(2)\n        extended_attention_mask = extended_attention_mask.to(dtype=self.dtype)  # fp16 compatibility\n        extended_attention_mask = (1.0 - extended_attention_mask) * -10000.0\n        head_mask = self.get_head_mask(head_mask, self.config.num_hidden_layers)\n\n        embedding_output = self.embeddings(\n            input_ids, position_ids=position_ids, token_type_ids=token_type_ids, inputs_embeds=inputs_embeds\n        )\n        encoder_outputs = self.encoder(embedding_output, extended_attention_mask, head_mask=head_mask)\n\n        sequence_output = encoder_outputs[0]\n\n        pooled_output = self.pooler_activation(self.pooler(sequence_output[:, 0]))\n\n        outputs = (sequence_output, pooled_output) + encoder_outputs[\n            1:\n        ]  # add hidden_states and attentions if they are here\n        return outputs\n\n\n@add_start_docstrings(\n    \"\"\"Albert Model with two heads on top as done during the pre-training: a `masked language modeling` head and\n    a `sentence order prediction (classification)` head. \"\"\",\n    ALBERT_START_DOCSTRING,\n)\nclass AlbertForPreTraining(AlbertPreTrainedModel):\n    def __init__(self, config):\n        super().__init__(config)\n\n        self.albert = AlbertModel(config)\n        self.predictions = AlbertMLMHead(config)\n        self.sop_classifier = AlbertSOPHead(config)\n\n        self.init_weights()\n        self.tie_weights()\n\n    def tie_weights(self):\n        self._tie_or_clone_weights(self.predictions.decoder, self.albert.embeddings.word_embeddings)\n\n    def get_output_embeddings(self):\n        return self.predictions.decoder\n\n    @add_start_docstrings_to_callable(ALBERT_INPUTS_DOCSTRING)\n    def forward(\n        self,\n        input_ids=None,\n        attention_mask=None,\n        token_type_ids=None,\n        position_ids=None,\n        head_mask=None,\n        inputs_embeds=None,\n        masked_lm_labels=None,\n        sentence_order_label=None,\n    ):\n        r\"\"\"\n        masked_lm_labels (``torch.LongTensor`` of shape ``(batch_size, sequence_length)``, `optional`, defaults to :obj:`None`):\n            Labels for computing the masked language modeling loss.\n            Indices should be in ``[-100, 0, ..., config.vocab_size]`` (see ``input_ids`` docstring)\n            Tokens with indices set to ``-100`` are ignored (masked), the loss is only computed for the tokens with labels\n            in ``[0, ..., config.vocab_size]``\n        sentence_order_label (``torch.LongTensor`` of shape ``(batch_size,)``, `optional`, defaults to :obj:`None`):\n            Labels for computing the next sequence prediction (classification) loss. Input should be a sequence pair (see :obj:`input_ids` docstring)\n            Indices should be in ``[0, 1]``.\n            ``0`` indicates original order (sequence A, then sequence B),\n            ``1`` indicates switched order (sequence B, then sequence A).\n\n    Returns:\n        :obj:`tuple(torch.FloatTensor)` comprising various elements depending on the configuration (:class:`~transformers1.BertConfig`) and inputs:\n        loss (`optional`, returned when ``masked_lm_labels`` is provided) ``torch.FloatTensor`` of shape ``(1,)``:\n            Total loss as the sum of the masked language modeling loss and the next sequence prediction (classification) loss.\n        prediction_scores (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length, config.vocab_size)`)\n            Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax).\n        sop_scores (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, 2)`):\n            Prediction scores of the next sequence prediction (classification) head (scores of True/False\n            continuation before SoftMax).\n        hidden_states (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when :obj:`config.output_hidden_states=True`):\n            Tuple of :obj:`torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer)\n            of shape :obj:`(batch_size, sequence_length, hidden_size)`.\n\n            Hidden-states of the model at the output of each layer plus the initial embedding outputs.\n        attentions (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_attentions=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for each layer) of shape\n            :obj:`(batch_size, num_heads, sequence_length, sequence_length)`.\n\n            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention\n            heads.\n\n\n    Examples::\n\n        from transformers1 import AlbertTokenizer, AlbertForPreTraining\n        import torch\n\n        tokenizer = AlbertTokenizer.from_pretrained('albert-base-v2')\n        model = AlbertForPreTraining.from_pretrained('albert-base-v2')\n\n        input_ids = torch.tensor(tokenizer.encode(\"Hello, my dog is cute\", add_special_tokens=True)).unsqueeze(0)  # Batch size 1\n        outputs = model(input_ids)\n\n        prediction_scores, sop_scores = outputs[:2]\n\n        \"\"\"\n\n        outputs = self.albert(\n            input_ids,\n            attention_mask=attention_mask,\n            token_type_ids=token_type_ids,\n            position_ids=position_ids,\n            head_mask=head_mask,\n            inputs_embeds=inputs_embeds,\n        )\n\n        sequence_output, pooled_output = outputs[:2]\n\n        prediction_scores = self.predictions(sequence_output)\n        sop_scores = self.sop_classifier(pooled_output)\n\n        outputs = (prediction_scores, sop_scores,) + outputs[2:]  # add hidden states and attention if they are here\n\n        if masked_lm_labels is not None and sentence_order_label is not None:\n            loss_fct = CrossEntropyLoss()\n            masked_lm_loss = loss_fct(prediction_scores.view(-1, self.config.vocab_size), masked_lm_labels.view(-1))\n            sentence_order_loss = loss_fct(sop_scores.view(-1, 2), sentence_order_label.view(-1))\n            total_loss = masked_lm_loss + sentence_order_loss\n            outputs = (total_loss,) + outputs\n\n        return outputs  # (loss), prediction_scores, sop_scores, (hidden_states), (attentions)\n\n\nclass AlbertMLMHead(nn.Module):\n    def __init__(self, config):\n        super().__init__()\n\n        self.LayerNorm = nn.LayerNorm(config.embedding_size)\n        self.bias = nn.Parameter(torch.zeros(config.vocab_size))\n        self.dense = nn.Linear(config.hidden_size, config.embedding_size)\n        self.decoder = nn.Linear(config.embedding_size, config.vocab_size)\n        self.activation = ACT2FN[config.hidden_act]\n\n        # Need a link between the two variables so that the bias is correctly resized with `resize_token_embeddings`\n        self.decoder.bias = self.bias\n\n    def forward(self, hidden_states):\n        hidden_states = self.dense(hidden_states)\n        hidden_states = self.activation(hidden_states)\n        hidden_states = self.LayerNorm(hidden_states)\n        hidden_states = self.decoder(hidden_states)\n\n        prediction_scores = hidden_states\n\n        return prediction_scores\n\n\nclass AlbertSOPHead(nn.Module):\n    def __init__(self, config):\n        super().__init__()\n\n        self.dropout = nn.Dropout(config.classifier_dropout_prob)\n        self.classifier = nn.Linear(config.hidden_size, config.num_labels)\n\n    def forward(self, pooled_output):\n        dropout_pooled_output = self.dropout(pooled_output)\n        logits = self.classifier(dropout_pooled_output)\n        return logits\n\n\n@add_start_docstrings(\n    \"Albert Model with a `language modeling` head on top.\", ALBERT_START_DOCSTRING,\n)\nclass AlbertForMaskedLM(AlbertPreTrainedModel):\n    def __init__(self, config):\n        super().__init__(config)\n\n        self.albert = AlbertModel(config)\n        self.predictions = AlbertMLMHead(config)\n\n        self.init_weights()\n        self.tie_weights()\n\n    def tie_weights(self):\n        self._tie_or_clone_weights(self.predictions.decoder, self.albert.embeddings.word_embeddings)\n\n    def get_output_embeddings(self):\n        return self.predictions.decoder\n\n    @add_start_docstrings_to_callable(ALBERT_INPUTS_DOCSTRING)\n    def forward(\n        self,\n        input_ids=None,\n        attention_mask=None,\n        token_type_ids=None,\n        position_ids=None,\n        head_mask=None,\n        inputs_embeds=None,\n        masked_lm_labels=None,\n    ):\n        r\"\"\"\n        masked_lm_labels (:obj:`torch.LongTensor` of shape :obj:`(batch_size, sequence_length)`, `optional`, defaults to :obj:`None`):\n            Labels for computing the masked language modeling loss.\n            Indices should be in ``[-100, 0, ..., config.vocab_size]`` (see ``input_ids`` docstring)\n            Tokens with indices set to ``-100`` are ignored (masked), the loss is only computed for the tokens with\n            labels in ``[0, ..., config.vocab_size]``\n\n    Returns:\n        :obj:`tuple(torch.FloatTensor)` comprising various elements depending on the configuration (:class:`~transformers1.AlbertConfig`) and inputs:\n        loss (`optional`, returned when ``masked_lm_labels`` is provided) ``torch.FloatTensor`` of shape ``(1,)``:\n            Masked language modeling loss.\n        prediction_scores (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length, config.vocab_size)`)\n            Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax).\n        hidden_states (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_hidden_states=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer)\n            of shape :obj:`(batch_size, sequence_length, hidden_size)`.\n\n            Hidden-states of the model at the output of each layer plus the initial embedding outputs.\n        attentions (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_attentions=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for each layer) of shape\n            :obj:`(batch_size, num_heads, sequence_length, sequence_length)`.\n\n            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention\n            heads.\n\n    Example::\n\n        from transformers1 import AlbertTokenizer, AlbertForMaskedLM\n        import torch\n\n        tokenizer = AlbertTokenizer.from_pretrained('albert-base-v2')\n        model = AlbertForMaskedLM.from_pretrained('albert-base-v2')\n        input_ids = torch.tensor(tokenizer.encode(\"Hello, my dog is cute\", add_special_tokens=True)).unsqueeze(0)  # Batch size 1\n        outputs = model(input_ids, masked_lm_labels=input_ids)\n        loss, prediction_scores = outputs[:2]\n\n        \"\"\"\n        outputs = self.albert(\n            input_ids=input_ids,\n            attention_mask=attention_mask,\n            token_type_ids=token_type_ids,\n            position_ids=position_ids,\n            head_mask=head_mask,\n            inputs_embeds=inputs_embeds,\n        )\n        sequence_outputs = outputs[0]\n\n        prediction_scores = self.predictions(sequence_outputs)\n\n        outputs = (prediction_scores,) + outputs[2:]  # Add hidden states and attention if they are here\n        if masked_lm_labels is not None:\n            loss_fct = CrossEntropyLoss()\n            masked_lm_loss = loss_fct(prediction_scores.view(-1, self.config.vocab_size), masked_lm_labels.view(-1))\n            outputs = (masked_lm_loss,) + outputs\n\n        return outputs\n\n\n@add_start_docstrings(\n    \"\"\"Albert Model transformer with a sequence classification/regression head on top (a linear layer on top of\n    the pooled output) e.g. for GLUE tasks. \"\"\",\n    ALBERT_START_DOCSTRING,\n)\nclass AlbertForSequenceClassification(AlbertPreTrainedModel):\n    def __init__(self, config):\n        super().__init__(config)\n        self.num_labels = config.num_labels\n\n        self.albert = AlbertModel(config)\n        self.dropout = nn.Dropout(config.classifier_dropout_prob)\n        self.classifier = nn.Linear(config.hidden_size, self.config.num_labels)\n\n        self.init_weights()\n\n    @add_start_docstrings_to_callable(ALBERT_INPUTS_DOCSTRING)\n    def forward(\n        self,\n        input_ids=None,\n        attention_mask=None,\n        token_type_ids=None,\n        position_ids=None,\n        head_mask=None,\n        inputs_embeds=None,\n        labels=None,\n    ):\n        r\"\"\"\n        labels (:obj:`torch.LongTensor` of shape :obj:`(batch_size,)`, `optional`, defaults to :obj:`None`):\n            Labels for computing the sequence classification/regression loss.\n            Indices should be in ``[0, ..., config.num_labels - 1]``.\n            If ``config.num_labels == 1`` a regression loss is computed (Mean-Square loss),\n            If ``config.num_labels > 1`` a classification loss is computed (Cross-Entropy).\n\n    Returns:\n        :obj:`tuple(torch.FloatTensor)` comprising various elements depending on the configuration (:class:`~transformers1.AlbertConfig`) and inputs:\n        loss: (`optional`, returned when ``labels`` is provided) ``torch.FloatTensor`` of shape ``(1,)``:\n            Classification (or regression if config.num_labels==1) loss.\n        logits ``torch.FloatTensor`` of shape ``(batch_size, config.num_labels)``\n            Classification (or regression if config.num_labels==1) scores (before SoftMax).\n        hidden_states (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_hidden_states=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer)\n            of shape :obj:`(batch_size, sequence_length, hidden_size)`.\n\n            Hidden-states of the model at the output of each layer plus the initial embedding outputs.\n        attentions (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_attentions=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for each layer) of shape\n            :obj:`(batch_size, num_heads, sequence_length, sequence_length)`.\n\n            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention\n            heads.\n\n        Examples::\n\n            from transformers1 import AlbertTokenizer, AlbertForSequenceClassification\n            import torch\n\n            tokenizer = AlbertTokenizer.from_pretrained('albert-base-v2')\n            model = AlbertForSequenceClassification.from_pretrained('albert-base-v2')\n            input_ids = torch.tensor(tokenizer.encode(\"Hello, my dog is cute\")).unsqueeze(0)  # Batch size 1\n            labels = torch.tensor([1]).unsqueeze(0)  # Batch size 1\n            outputs = model(input_ids, labels=labels)\n            loss, logits = outputs[:2]\n\n        \"\"\"\n\n        outputs = self.albert(\n            input_ids=input_ids,\n            attention_mask=attention_mask,\n            token_type_ids=token_type_ids,\n            position_ids=position_ids,\n            head_mask=head_mask,\n            inputs_embeds=inputs_embeds,\n        )\n\n        pooled_output = outputs[1]\n\n        pooled_output = self.dropout(pooled_output)\n        logits = self.classifier(pooled_output)\n\n        outputs = (logits,) + outputs[2:]  # add hidden states and attention if they are here\n\n        if labels is not None:\n            if self.num_labels == 1:\n                #  We are doing regression\n                loss_fct = MSELoss()\n                loss = loss_fct(logits.view(-1), labels.view(-1))\n            else:\n                loss_fct = CrossEntropyLoss()\n                loss = loss_fct(logits.view(-1, self.num_labels), labels.view(-1))\n            outputs = (loss,) + outputs\n\n        return outputs  # (loss), logits, (hidden_states), (attentions)\n\n\n@add_start_docstrings(\n    \"\"\"Albert Model with a token classification head on top (a linear layer on top of\n    the hidden-states output) e.g. for Named-Entity-Recognition (NER) tasks. \"\"\",\n    ALBERT_START_DOCSTRING,\n)\nclass AlbertForTokenClassification(AlbertPreTrainedModel):\n    def __init__(self, config):\n        super().__init__(config)\n        self.num_labels = config.num_labels\n\n        self.albert = AlbertModel(config)\n        self.dropout = nn.Dropout(config.hidden_dropout_prob)\n        self.classifier = nn.Linear(config.hidden_size, self.config.num_labels)\n\n        self.init_weights()\n\n    @add_start_docstrings_to_callable(ALBERT_INPUTS_DOCSTRING)\n    def forward(\n        self,\n        input_ids=None,\n        attention_mask=None,\n        token_type_ids=None,\n        position_ids=None,\n        head_mask=None,\n        inputs_embeds=None,\n        labels=None,\n    ):\n        r\"\"\"\n        labels (:obj:`torch.LongTensor` of shape :obj:`(batch_size, sequence_length)`, `optional`, defaults to :obj:`None`):\n            Labels for computing the token classification loss.\n            Indices should be in ``[0, ..., config.num_labels - 1]``.\n\n    Returns:\n        :obj:`tuple(torch.FloatTensor)` comprising various elements depending on the configuration (:class:`~transformers1.AlbertConfig`) and inputs:\n        loss (:obj:`torch.FloatTensor` of shape :obj:`(1,)`, `optional`, returned when ``labels`` is provided) :\n            Classification loss.\n        scores (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length, config.num_labels)`)\n            Classification scores (before SoftMax).\n        hidden_states (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_hidden_states=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer)\n            of shape :obj:`(batch_size, sequence_length, hidden_size)`.\n\n            Hidden-states of the model at the output of each layer plus the initial embedding outputs.\n        attentions (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_attentions=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for each layer) of shape\n            :obj:`(batch_size, num_heads, sequence_length, sequence_length)`.\n\n            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention\n            heads.\n\n    Examples::\n\n        from transformers1 import AlbertTokenizer, AlbertForTokenClassification\n        import torch\n\n        tokenizer = AlbertTokenizer.from_pretrained('albert-base-v2')\n        model = AlbertForTokenClassification.from_pretrained('albert-base-v2')\n\n        input_ids = torch.tensor(tokenizer.encode(\"Hello, my dog is cute\", add_special_tokens=True)).unsqueeze(0)  # Batch size 1\n        labels = torch.tensor([1] * input_ids.size(1)).unsqueeze(0)  # Batch size 1\n        outputs = model(input_ids, labels=labels)\n\n        loss, scores = outputs[:2]\n\n        \"\"\"\n\n        outputs = self.albert(\n            input_ids,\n            attention_mask=attention_mask,\n            token_type_ids=token_type_ids,\n            position_ids=position_ids,\n            head_mask=head_mask,\n            inputs_embeds=inputs_embeds,\n        )\n\n        sequence_output = outputs[0]\n\n        sequence_output = self.dropout(sequence_output)\n        logits = self.classifier(sequence_output)\n\n        outputs = (logits,) + outputs[2:]  # add hidden states and attention if they are here\n\n        if labels is not None:\n            loss_fct = CrossEntropyLoss()\n            # Only keep active parts of the loss\n            if attention_mask is not None:\n                active_loss = attention_mask.view(-1) == 1\n                active_logits = logits.view(-1, self.num_labels)[active_loss]\n                active_labels = labels.view(-1)[active_loss]\n                loss = loss_fct(active_logits, active_labels)\n            else:\n                loss = loss_fct(logits.view(-1, self.num_labels), labels.view(-1))\n            outputs = (loss,) + outputs\n\n        return outputs  # (loss), logits, (hidden_states), (attentions)\n\n\n@add_start_docstrings(\n    \"\"\"Albert Model with a span classification head on top for extractive question-answering tasks like SQuAD (a linear layers on top of\n    the hidden-states output to compute `span start logits` and `span end logits`). \"\"\",\n    ALBERT_START_DOCSTRING,\n)\nclass AlbertForQuestionAnswering(AlbertPreTrainedModel):\n    def __init__(self, config):\n        super().__init__(config)\n        self.num_labels = config.num_labels\n\n        self.albert = AlbertModel(config)\n        self.qa_outputs = nn.Linear(config.hidden_size, config.num_labels)\n\n        self.init_weights()\n\n    @add_start_docstrings_to_callable(ALBERT_INPUTS_DOCSTRING)\n    def forward(\n        self,\n        input_ids=None,\n        attention_mask=None,\n        token_type_ids=None,\n        position_ids=None,\n        head_mask=None,\n        inputs_embeds=None,\n        start_positions=None,\n        end_positions=None,\n    ):\n        r\"\"\"\n        start_positions (:obj:`torch.LongTensor` of shape :obj:`(batch_size,)`, `optional`, defaults to :obj:`None`):\n            Labels for position (index) of the start of the labelled span for computing the token classification loss.\n            Positions are clamped to the length of the sequence (`sequence_length`).\n            Position outside of the sequence are not taken into account for computing the loss.\n        end_positions (:obj:`torch.LongTensor` of shape :obj:`(batch_size,)`, `optional`, defaults to :obj:`None`):\n            Labels for position (index) of the end of the labelled span for computing the token classification loss.\n            Positions are clamped to the length of the sequence (`sequence_length`).\n            Position outside of the sequence are not taken into account for computing the loss.\n\n    Returns:\n        :obj:`tuple(torch.FloatTensor)` comprising various elements depending on the configuration (:class:`~transformers1.AlbertConfig`) and inputs:\n        loss: (`optional`, returned when ``labels`` is provided) ``torch.FloatTensor`` of shape ``(1,)``:\n            Total span extraction loss is the sum of a Cross-Entropy for the start and end positions.\n        start_scores ``torch.FloatTensor`` of shape ``(batch_size, sequence_length,)``\n            Span-start scores (before SoftMax).\n        end_scores: ``torch.FloatTensor`` of shape ``(batch_size, sequence_length,)``\n            Span-end scores (before SoftMax).\n        hidden_states (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_hidden_states=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer)\n            of shape :obj:`(batch_size, sequence_length, hidden_size)`.\n\n            Hidden-states of the model at the output of each layer plus the initial embedding outputs.\n        attentions (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_attentions=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for each layer) of shape\n            :obj:`(batch_size, num_heads, sequence_length, sequence_length)`.\n\n            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention\n            heads.\n\n    Examples::\n\n        # The checkpoint albert-base-v2 is not fine-tuned for question answering. Please see the\n        # examples/question-answering/run_squad.py example to see how to fine-tune a model to a question answering task.\n\n        from transformers1 import AlbertTokenizer, AlbertForQuestionAnswering\n        import torch\n\n        tokenizer = AlbertTokenizer.from_pretrained('albert-base-v2')\n        model = AlbertForQuestionAnswering.from_pretrained('albert-base-v2')\n        question, text = \"Who was Jim Henson?\", \"Jim Henson was a nice puppet\"\n        input_dict = tokenizer.encode_plus(question, text, return_tensors='pt')\n        start_scores, end_scores = model(**input_dict)\n\n        \"\"\"\n\n        outputs = self.albert(\n            input_ids=input_ids,\n            attention_mask=attention_mask,\n            token_type_ids=token_type_ids,\n            position_ids=position_ids,\n            head_mask=head_mask,\n            inputs_embeds=inputs_embeds,\n        )\n\n        sequence_output = outputs[0]\n\n        logits = self.qa_outputs(sequence_output)\n        start_logits, end_logits = logits.split(1, dim=-1)\n        start_logits = start_logits.squeeze(-1)\n        end_logits = end_logits.squeeze(-1)\n\n        outputs = (start_logits, end_logits,) + outputs[2:]\n        if start_positions is not None and end_positions is not None:\n            # If we are on multi-GPU, split add a dimension\n            if len(start_positions.size()) > 1:\n                start_positions = start_positions.squeeze(-1)\n            if len(end_positions.size()) > 1:\n                end_positions = end_positions.squeeze(-1)\n            # sometimes the start/end positions are outside our model inputs, we ignore these terms\n            ignored_index = start_logits.size(1)\n            start_positions.clamp_(0, ignored_index)\n            end_positions.clamp_(0, ignored_index)\n\n            loss_fct = CrossEntropyLoss(ignore_index=ignored_index)\n            start_loss = loss_fct(start_logits, start_positions)\n            end_loss = loss_fct(end_logits, end_positions)\n            total_loss = (start_loss + end_loss) / 2\n            outputs = (total_loss,) + outputs\n\n        return outputs  # (loss), start_logits, end_logits, (hidden_states), (attentions)\n"
  },
  {
    "path": "code/bert-base-count5/pretrain/transformers1/modeling_auto.py",
    "content": "# coding=utf-8\n# Copyright 2018 The HuggingFace Inc. team.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n\"\"\" Auto Model class. \"\"\"\n\n\nimport logging\nfrom collections import OrderedDict\n\nfrom .configuration_auto import (\n    AlbertConfig,\n    AutoConfig,\n    BartConfig,\n    BertConfig,\n    CamembertConfig,\n    CTRLConfig,\n    DistilBertConfig,\n    ElectraConfig,\n    EncoderDecoderConfig,\n    FlaubertConfig,\n    GPT2Config,\n    LongformerConfig,\n    OpenAIGPTConfig,\n    ReformerConfig,\n    RobertaConfig,\n    T5Config,\n    TransfoXLConfig,\n    XLMConfig,\n    XLMRobertaConfig,\n    XLNetConfig,\n)\nfrom .configuration_marian import MarianConfig\nfrom .configuration_utils import PretrainedConfig\nfrom .modeling_albert import (\n    AlbertForMaskedLM,\n    AlbertForPreTraining,\n    AlbertForQuestionAnswering,\n    AlbertForSequenceClassification,\n    AlbertForTokenClassification,\n    AlbertModel,\n)\nfrom .modeling_bart import BartForConditionalGeneration, BartForSequenceClassification, BartModel\nfrom .modeling_bert import (\n    BertForMaskedLM,\n    BertForMultipleChoice,\n    BertForPreTraining,\n    BertForQuestionAnswering,\n    BertForSequenceClassification,\n    BertForTokenClassification,\n    BertModel,\n)\nfrom .modeling_camembert import (\n    CamembertForMaskedLM,\n    CamembertForMultipleChoice,\n    CamembertForSequenceClassification,\n    CamembertForTokenClassification,\n    CamembertModel,\n)\nfrom .modeling_ctrl import CTRLLMHeadModel, CTRLModel\nfrom .modeling_distilbert import (\n    DistilBertForMaskedLM,\n    DistilBertForQuestionAnswering,\n    DistilBertForSequenceClassification,\n    DistilBertForTokenClassification,\n    DistilBertModel,\n)\nfrom .modeling_electra import (\n    ElectraForMaskedLM,\n    ElectraForPreTraining,\n    ElectraForSequenceClassification,\n    ElectraForTokenClassification,\n    ElectraModel,\n)\nfrom .modeling_encoder_decoder import EncoderDecoderModel\nfrom .modeling_flaubert import (\n    FlaubertForQuestionAnsweringSimple,\n    FlaubertForSequenceClassification,\n    FlaubertModel,\n    FlaubertWithLMHeadModel,\n)\nfrom .modeling_gpt2 import GPT2LMHeadModel, GPT2Model\nfrom .modeling_longformer import (\n    LongformerForMaskedLM,\n    LongformerForMultipleChoice,\n    LongformerForQuestionAnswering,\n    LongformerForSequenceClassification,\n    LongformerForTokenClassification,\n    LongformerModel,\n)\nfrom .modeling_marian import MarianMTModel\nfrom .modeling_openai import OpenAIGPTLMHeadModel, OpenAIGPTModel\nfrom .modeling_reformer import ReformerModel, ReformerModelWithLMHead\nfrom .modeling_roberta import (\n    RobertaForMaskedLM,\n    RobertaForMultipleChoice,\n    RobertaForQuestionAnswering,\n    RobertaForSequenceClassification,\n    RobertaForTokenClassification,\n    RobertaModel,\n)\nfrom .modeling_t5 import T5ForConditionalGeneration, T5Model\nfrom .modeling_transfo_xl import TransfoXLLMHeadModel, TransfoXLModel\nfrom .modeling_xlm import (\n    XLMForQuestionAnsweringSimple,\n    XLMForSequenceClassification,\n    XLMForTokenClassification,\n    XLMModel,\n    XLMWithLMHeadModel,\n)\nfrom .modeling_xlm_roberta import (\n    XLMRobertaForMaskedLM,\n    XLMRobertaForMultipleChoice,\n    XLMRobertaForSequenceClassification,\n    XLMRobertaForTokenClassification,\n    XLMRobertaModel,\n)\nfrom .modeling_xlnet import (\n    XLNetForMultipleChoice,\n    XLNetForQuestionAnsweringSimple,\n    XLNetForSequenceClassification,\n    XLNetForTokenClassification,\n    XLNetLMHeadModel,\n    XLNetModel,\n)\n\n\nlogger = logging.getLogger(__name__)\n\n\nMODEL_MAPPING = OrderedDict(\n    [\n        (T5Config, T5Model),\n        (DistilBertConfig, DistilBertModel),\n        (AlbertConfig, AlbertModel),\n        (CamembertConfig, CamembertModel),\n        (XLMRobertaConfig, XLMRobertaModel),\n        (BartConfig, BartModel),\n        (LongformerConfig, LongformerModel),\n        (RobertaConfig, RobertaModel),\n        (BertConfig, BertModel),\n        (OpenAIGPTConfig, OpenAIGPTModel),\n        (GPT2Config, GPT2Model),\n        (TransfoXLConfig, TransfoXLModel),\n        (XLNetConfig, XLNetModel),\n        (FlaubertConfig, FlaubertModel),\n        (XLMConfig, XLMModel),\n        (CTRLConfig, CTRLModel),\n        (ElectraConfig, ElectraModel),\n        (ReformerConfig, ReformerModel),\n    ]\n)\n\nMODEL_FOR_PRETRAINING_MAPPING = OrderedDict(\n    [\n        (T5Config, T5ForConditionalGeneration),\n        (DistilBertConfig, DistilBertForMaskedLM),\n        (AlbertConfig, AlbertForPreTraining),\n        (CamembertConfig, CamembertForMaskedLM),\n        (XLMRobertaConfig, XLMRobertaForMaskedLM),\n        (BartConfig, BartForConditionalGeneration),\n        (LongformerConfig, LongformerForMaskedLM),\n        (RobertaConfig, RobertaForMaskedLM),\n        (BertConfig, BertForPreTraining),\n        (OpenAIGPTConfig, OpenAIGPTLMHeadModel),\n        (GPT2Config, GPT2LMHeadModel),\n        (TransfoXLConfig, TransfoXLLMHeadModel),\n        (XLNetConfig, XLNetLMHeadModel),\n        (FlaubertConfig, FlaubertWithLMHeadModel),\n        (XLMConfig, XLMWithLMHeadModel),\n        (CTRLConfig, CTRLLMHeadModel),\n        (ElectraConfig, ElectraForPreTraining),\n    ]\n)\n\nMODEL_WITH_LM_HEAD_MAPPING = OrderedDict(\n    [\n        (T5Config, T5ForConditionalGeneration),\n        (DistilBertConfig, DistilBertForMaskedLM),\n        (AlbertConfig, AlbertForMaskedLM),\n        (CamembertConfig, CamembertForMaskedLM),\n        (XLMRobertaConfig, XLMRobertaForMaskedLM),\n        (MarianConfig, MarianMTModel),\n        (BartConfig, BartForConditionalGeneration),\n        (LongformerConfig, LongformerForMaskedLM),\n        (RobertaConfig, RobertaForMaskedLM),\n        (BertConfig, BertForMaskedLM),\n        (OpenAIGPTConfig, OpenAIGPTLMHeadModel),\n        (GPT2Config, GPT2LMHeadModel),\n        (TransfoXLConfig, TransfoXLLMHeadModel),\n        (XLNetConfig, XLNetLMHeadModel),\n        (FlaubertConfig, FlaubertWithLMHeadModel),\n        (XLMConfig, XLMWithLMHeadModel),\n        (CTRLConfig, CTRLLMHeadModel),\n        (ElectraConfig, ElectraForMaskedLM),\n        (EncoderDecoderConfig, EncoderDecoderModel),\n        (ReformerConfig, ReformerModelWithLMHead),\n    ]\n)\n\nMODEL_FOR_SEQUENCE_CLASSIFICATION_MAPPING = OrderedDict(\n    [\n        (DistilBertConfig, DistilBertForSequenceClassification),\n        (AlbertConfig, AlbertForSequenceClassification),\n        (CamembertConfig, CamembertForSequenceClassification),\n        (XLMRobertaConfig, XLMRobertaForSequenceClassification),\n        (BartConfig, BartForSequenceClassification),\n        (LongformerConfig, LongformerForSequenceClassification),\n        (RobertaConfig, RobertaForSequenceClassification),\n        (BertConfig, BertForSequenceClassification),\n        (XLNetConfig, XLNetForSequenceClassification),\n        (FlaubertConfig, FlaubertForSequenceClassification),\n        (XLMConfig, XLMForSequenceClassification),\n        (ElectraConfig, ElectraForSequenceClassification),\n    ]\n)\n\nMODEL_FOR_QUESTION_ANSWERING_MAPPING = OrderedDict(\n    [\n        (DistilBertConfig, DistilBertForQuestionAnswering),\n        (AlbertConfig, AlbertForQuestionAnswering),\n        (LongformerConfig, LongformerForQuestionAnswering),\n        (RobertaConfig, RobertaForQuestionAnswering),\n        (BertConfig, BertForQuestionAnswering),\n        (XLNetConfig, XLNetForQuestionAnsweringSimple),\n        (FlaubertConfig, FlaubertForQuestionAnsweringSimple),\n        (XLMConfig, XLMForQuestionAnsweringSimple),\n    ]\n)\n\nMODEL_FOR_TOKEN_CLASSIFICATION_MAPPING = OrderedDict(\n    [\n        (DistilBertConfig, DistilBertForTokenClassification),\n        (CamembertConfig, CamembertForTokenClassification),\n        (XLMConfig, XLMForTokenClassification),\n        (XLMRobertaConfig, XLMRobertaForTokenClassification),\n        (LongformerConfig, LongformerForTokenClassification),\n        (RobertaConfig, RobertaForTokenClassification),\n        (BertConfig, BertForTokenClassification),\n        (XLNetConfig, XLNetForTokenClassification),\n        (AlbertConfig, AlbertForTokenClassification),\n        (ElectraConfig, ElectraForTokenClassification),\n    ]\n)\n\n\nMODEL_FOR_MULTIPLE_CHOICE_MAPPING = OrderedDict(\n    [\n        (CamembertConfig, CamembertForMultipleChoice),\n        (XLMRobertaConfig, XLMRobertaForMultipleChoice),\n        (LongformerConfig, LongformerForMultipleChoice),\n        (RobertaConfig, RobertaForMultipleChoice),\n        (BertConfig, BertForMultipleChoice),\n        (XLNetConfig, XLNetForMultipleChoice),\n    ]\n)\n\n\nclass AutoModel:\n    r\"\"\"\n        :class:`~transformers1.AutoModel` is a generic model class\n        that will be instantiated as one of the base model classes of the library\n        when created with the `AutoModel.from_pretrained(pretrained_model_name_or_path)`\n        or the `AutoModel.from_config(config)` class methods.\n\n        This class cannot be instantiated using `__init__()` (throws an error).\n    \"\"\"\n\n    def __init__(self):\n        raise EnvironmentError(\n            \"AutoModel is designed to be instantiated \"\n            \"using the `AutoModel.from_pretrained(pretrained_model_name_or_path)` or \"\n            \"`AutoModel.from_config(config)` methods.\"\n        )\n\n    @classmethod\n    def from_config(cls, config):\n        r\"\"\" Instantiates one of the base model classes of the library\n        from a configuration.\n\n        Note:\n            Loading a model from its configuration file does **not** load the model weights.\n            It only affects the model's configuration. Use :func:`~transformers1.AutoModel.from_pretrained` to load\n            the model weights\n\n        Args:\n            config (:class:`~transformers.PretrainedConfig`):\n                The model class to instantiate is selected based on the configuration class:\n\n                - isInstance of `distilbert` configuration class: :class:`~transformers1.DistilBertModel` (DistilBERT model)\n                - isInstance of `longformer` configuration class: :class:`~transformers1.LongformerModel` (Longformer model)\n                - isInstance of `roberta` configuration class: :class:`~transformers1.RobertaModel` (RoBERTa model)\n                - isInstance of `bert` configuration class: :class:`~transformers1.BertModel` (Bert model)\n                - isInstance of `openai-gpt` configuration class: :class:`~transformers1.OpenAIGPTModel` (OpenAI GPT model)\n                - isInstance of `gpt2` configuration class: :class:`~transformers1.GPT2Model` (OpenAI GPT-2 model)\n                - isInstance of `ctrl` configuration class: :class:`~transformers1.CTRLModel` (Salesforce CTRL  model)\n                - isInstance of `transfo-xl` configuration class: :class:`~transformers1.TransfoXLModel` (Transformer-XL model)\n                - isInstance of `xlnet` configuration class: :class:`~transformers1.XLNetModel` (XLNet model)\n                - isInstance of `xlm` configuration class: :class:`~transformers1.XLMModel` (XLM model)\n                - isInstance of `flaubert` configuration class: :class:`~transformers1.FlaubertModel` (Flaubert model)\n                - isInstance of `electra` configuration class: :class:`~transformers1.ElectraModel` (Electra model)\n\n        Examples::\n\n            config = BertConfig.from_pretrained('bert-base-uncased')    # Download configuration from S3 and cache.\n            model = AutoModel.from_config(config)  # E.g. model was saved using `save_pretrained('./test/saved_model/')`\n        \"\"\"\n        for config_class, model_class in MODEL_MAPPING.items():\n            if isinstance(config, config_class):\n                return model_class(config)\n        raise ValueError(\n            \"Unrecognized configuration class {} for this kind of AutoModel: {}.\\n\"\n            \"Model type should be one of {}.\".format(\n                config.__class__, cls.__name__, \", \".join(c.__name__ for c in MODEL_MAPPING.keys())\n            )\n        )\n\n    @classmethod\n    def from_pretrained(cls, pretrained_model_name_or_path, *model_args, **kwargs):\n        r\"\"\" Instantiates one of the base model classes of the library\n        from a pre-trained model configuration.\n\n        The `from_pretrained()` method takes care of returning the correct model class instance\n        based on the `model_type` property of the config object, or when it's missing,\n        falling back to using pattern matching on the `pretrained_model_name_or_path` string:\n            - `t5`: :class:`~transformers1.T5Model` (T5 model)\n            - `distilbert`: :class:`~transformers1.DistilBertModel` (DistilBERT model)\n            - `albert`: :class:`~transformers1.AlbertModel` (ALBERT model)\n            - `camembert`: :class:`~transformers1.CamembertModel` (CamemBERT model)\n            - `xlm-roberta`: :class:`~transformers1.XLMRobertaModel` (XLM-RoBERTa model)\n            - `longformer` :class:`~transformers1.LongformerModel` (Longformer model)\n            - `roberta`: :class:`~transformers1.RobertaModel` (RoBERTa model)\n            - `bert`: :class:`~transformers1.BertModel` (Bert model)\n            - `openai-gpt`: :class:`~transformers1.OpenAIGPTModel` (OpenAI GPT model)\n            - `gpt2`: :class:`~transformers1.GPT2Model` (OpenAI GPT-2 model)\n            - `transfo-xl`: :class:`~transformers1.TransfoXLModel` (Transformer-XL model)\n            - `xlnet`: :class:`~transformers1.XLNetModel` (XLNet model)\n            - `xlm`: :class:`~transformers1.XLMModel` (XLM model)\n            - `ctrl`: :class:`~transformers1.CTRLModel` (Salesforce CTRL  model)\n            - `flaubert`: :class:`~transformers1.FlaubertModel` (Flaubert  model)\n            - `electra`: :class:`~transformers1.ElectraModel` (Electra  model)\n\n        The model is set in evaluation mode by default using `model.eval()` (Dropout modules are deactivated)\n        To train the model, you should first set it back in training mode with `model.train()`\n\n        Args:\n            pretrained_model_name_or_path: either:\n\n                - a string with the `shortcut name` of a pre-trained model to load from cache or download, e.g.: ``bert-base-uncased``.\n                - a string with the `identifier name` of a pre-trained model that was user-uploaded to our S3, e.g.: ``dbmdz/bert-base-german-cased``.\n                - a path to a `directory` containing model weights saved using :func:`~transformers1.PreTrainedModel.save_pretrained`, e.g.: ``./my_model_directory/``.\n                - a path or url to a `tensorflow index checkpoint file` (e.g. `./tf_model/model.ckpt.index`). In this case, ``from_tf`` should be set to True and a configuration object should be provided as ``config`` argument. This loading path is slower than converting the TensorFlow checkpoint in a PyTorch model using the provided conversion scripts and loading the PyTorch model afterwards.\n\n            model_args: (`optional`) Sequence of positional arguments:\n                All remaning positional arguments will be passed to the underlying model's ``__init__`` method\n\n            config: (`optional`) instance of a class derived from :class:`~transformers1.PretrainedConfig`:\n                Configuration for the model to use instead of an automatically loaded configuation. Configuration can be automatically loaded when:\n\n                - the model is a model provided by the library (loaded with the ``shortcut-name`` string of a pretrained model), or\n                - the model was saved using :func:`~transformers1.PreTrainedModel.save_pretrained` and is reloaded by suppling the save directory.\n                - the model is loaded by suppling a local directory as ``pretrained_model_name_or_path`` and a configuration JSON file named `config.json` is found in the directory.\n\n            state_dict: (`optional`) dict:\n                an optional state dictionary for the model to use instead of a state dictionary loaded from saved weights file.\n                This option can be used if you want to create a model from a pretrained configuration but load your own weights.\n                In this case though, you should check if using :func:`~transformers1.PreTrainedModel.save_pretrained` and :func:`~transformers1.PreTrainedModel.from_pretrained` is not a simpler option.\n\n            cache_dir: (`optional`) string:\n                Path to a directory in which a downloaded pre-trained model\n                configuration should be cached if the standard cache should not be used.\n\n            force_download: (`optional`) boolean, default False:\n                Force to (re-)download the model weights and configuration files and override the cached versions if they exists.\n\n            resume_download: (`optional`) boolean, default False:\n                Do not delete incompletely recieved file. Attempt to resume the download if such a file exists.\n\n            proxies: (`optional`) dict, default None:\n                A dictionary of proxy servers to use by protocol or endpoint, e.g.: {'http': 'foo.bar:3128', 'http://hostname': 'foo.bar:4012'}.\n                The proxies are used on each request.\n\n            output_loading_info: (`optional`) boolean:\n                Set to ``True`` to also return a dictionary containing missing keys, unexpected keys and error messages.\n\n            kwargs: (`optional`) Remaining dictionary of keyword arguments:\n                These arguments will be passed to the configuration and the model.\n\n        Examples::\n\n            model = AutoModel.from_pretrained('bert-base-uncased')    # Download model and configuration from S3 and cache.\n            model = AutoModel.from_pretrained('./test/bert_model/')  # E.g. model was saved using `save_pretrained('./test/saved_model/')`\n            assert model.config.output_attention == True\n            # Loading from a TF checkpoint file instead of a PyTorch model (slower)\n            config = AutoConfig.from_json_file('./tf_model/bert_tf_model_config.json')\n            model = AutoModel.from_pretrained('./tf_model/bert_tf_checkpoint.ckpt.index', from_tf=True, config=config)\n\n        \"\"\"\n        config = kwargs.pop(\"config\", None)\n        if not isinstance(config, PretrainedConfig):\n            config = AutoConfig.from_pretrained(pretrained_model_name_or_path, **kwargs)\n\n        for config_class, model_class in MODEL_MAPPING.items():\n            if isinstance(config, config_class):\n                return model_class.from_pretrained(pretrained_model_name_or_path, *model_args, config=config, **kwargs)\n        raise ValueError(\n            \"Unrecognized configuration class {} for this kind of AutoModel: {}.\\n\"\n            \"Model type should be one of {}.\".format(\n                config.__class__, cls.__name__, \", \".join(c.__name__ for c in MODEL_MAPPING.keys())\n            )\n        )\n\n\nclass AutoModelForPreTraining:\n    r\"\"\"\n        :class:`~transformers1.AutoModelForPreTraining` is a generic model class\n        that will be instantiated as one of the model classes of the library -with the architecture used for pretraining this model– when created with the `AutoModelForPreTraining.from_pretrained(pretrained_model_name_or_path)`\n        class method.\n\n        This class cannot be instantiated using `__init__()` (throws an error).\n    \"\"\"\n\n    def __init__(self):\n        raise EnvironmentError(\n            \"AutoModelForPreTraining is designed to be instantiated \"\n            \"using the `AutoModelForPreTraining.from_pretrained(pretrained_model_name_or_path)` or \"\n            \"`AutoModelForPreTraining.from_config(config)` methods.\"\n        )\n\n    @classmethod\n    def from_config(cls, config):\n        r\"\"\" Instantiates one of the base model classes of the library\n        from a configuration.\n\n        Note:\n            Loading a model from its configuration file does **not** load the model weights.\n            It only affects the model's configuration. Use :func:`~transformers1.AutoModel.from_pretrained` to load\n            the model weights\n\n        Args:\n            config (:class:`~transformers.PretrainedConfig`):\n                The model class to instantiate is selected based on the configuration class:\n\n                - isInstance of `distilbert` configuration class: :class:`~transformers1.DistilBertForMaskedLM` (DistilBERT model)\n                - isInstance of `longformer` configuration class: :class:`~transformers1.LongformerForMaskedLM` (Longformer model)\n                - isInstance of `roberta` configuration class: :class:`~transformers1.RobertaForMaskedLM` (RoBERTa model)\n                - isInstance of `bert` configuration class: :class:`~transformers1.BertForPreTraining` (Bert model)\n                - isInstance of `openai-gpt` configuration class: :class:`~transformers1.OpenAIGPTLMHeadModel` (OpenAI GPT model)\n                - isInstance of `gpt2` configuration class: :class:`~transformers1.GPT2LMHeadModel` (OpenAI GPT-2 model)\n                - isInstance of `ctrl` configuration class: :class:`~transformers1.CTRLLMHeadModel` (Salesforce CTRL  model)\n                - isInstance of `transfo-xl` configuration class: :class:`~transformers1.TransfoXLLMHeadModel` (Transformer-XL model)\n                - isInstance of `xlnet` configuration class: :class:`~transformers1.XLNetLMHeadModel` (XLNet model)\n                - isInstance of `xlm` configuration class: :class:`~transformers1.XLMWithLMHeadModel` (XLM model)\n                - isInstance of `flaubert` configuration class: :class:`~transformers1.FlaubertWithLMHeadModel` (Flaubert model)\n                - isInstance of `electra` configuration class: :class:`~transformers1.ElectraForPreTraining` (Electra model)\n\n        Examples::\n\n            config = BertConfig.from_pretrained('bert-base-uncased')    # Download configuration from S3 and cache.\n            model = AutoModelForPreTraining.from_config(config)  # E.g. model was saved using `save_pretrained('./test/saved_model/')`\n        \"\"\"\n        for config_class, model_class in MODEL_FOR_PRETRAINING_MAPPING.items():\n            if isinstance(config, config_class):\n                return model_class(config)\n        raise ValueError(\n            \"Unrecognized configuration class {} for this kind of AutoModel: {}.\\n\"\n            \"Model type should be one of {}.\".format(\n                config.__class__, cls.__name__, \", \".join(c.__name__ for c in MODEL_FOR_PRETRAINING_MAPPING.keys())\n            )\n        )\n\n    @classmethod\n    def from_pretrained(cls, pretrained_model_name_or_path, *model_args, **kwargs):\n        r\"\"\" Instantiates one of the model classes of the library -with the architecture used for pretraining this model– from a pre-trained model configuration.\n\n        The `from_pretrained()` method takes care of returning the correct model class instance\n        based on the `model_type` property of the config object, or when it's missing,\n        falling back to using pattern matching on the `pretrained_model_name_or_path` string:\n            - `t5`: :class:`~transformers1.T5ModelWithLMHead` (T5 model)\n            - `distilbert`: :class:`~transformers1.DistilBertForMaskedLM` (DistilBERT model)\n            - `albert`: :class:`~transformers1.AlbertForMaskedLM` (ALBERT model)\n            - `camembert`: :class:`~transformers1.CamembertForMaskedLM` (CamemBERT model)\n            - `xlm-roberta`: :class:`~transformers1.XLMRobertaForMaskedLM` (XLM-RoBERTa model)\n            - `longformer`: :class:`~transformers1.LongformerForMaskedLM` (Longformer model)\n            - `roberta`: :class:`~transformers1.RobertaForMaskedLM` (RoBERTa model)\n            - `bert`: :class:`~transformers1.BertForPreTraining` (Bert model)\n            - `openai-gpt`: :class:`~transformers1.OpenAIGPTLMHeadModel` (OpenAI GPT model)\n            - `gpt2`: :class:`~transformers1.GPT2LMHeadModel` (OpenAI GPT-2 model)\n            - `transfo-xl`: :class:`~transformers1.TransfoXLLMHeadModel` (Transformer-XL model)\n            - `xlnet`: :class:`~transformers1.XLNetLMHeadModel` (XLNet model)\n            - `xlm`: :class:`~transformers1.XLMWithLMHeadModel` (XLM model)\n            - `ctrl`: :class:`~transformers1.CTRLLMHeadModel` (Salesforce CTRL model)\n            - `flaubert`: :class:`~transformers1.FlaubertWithLMHeadModel` (Flaubert model)\n            - `electra`: :class:`~transformers1.ElectraForPreTraining` (Electra model)\n\n        The model is set in evaluation mode by default using `model.eval()` (Dropout modules are deactivated)\n        To train the model, you should first set it back in training mode with `model.train()`\n\n        Args:\n            pretrained_model_name_or_path:\n                Either:\n\n                - a string with the `shortcut name` of a pre-trained model to load from cache or download, e.g.: ``bert-base-uncased``.\n                - a string with the `identifier name` of a pre-trained model that was user-uploaded to our S3, e.g.: ``dbmdz/bert-base-german-cased``.\n                - a path to a `directory` containing model weights saved using :func:`~transformers1.PreTrainedModel.save_pretrained`, e.g.: ``./my_model_directory/``.\n                - a path or url to a `tensorflow index checkpoint file` (e.g. `./tf_model/model.ckpt.index`). In this case, ``from_tf`` should be set to True and a configuration object should be provided as ``config`` argument. This loading path is slower than converting the TensorFlow checkpoint in a PyTorch model using the provided conversion scripts and loading the PyTorch model afterwards.\n            model_args: (`optional`) Sequence of positional arguments:\n                All remaning positional arguments will be passed to the underlying model's ``__init__`` method\n            config: (`optional`) instance of a class derived from :class:`~transformers1.PretrainedConfig`:\n                Configuration for the model to use instead of an automatically loaded configuation. Configuration can be automatically loaded when:\n\n                - the model is a model provided by the library (loaded with the ``shortcut-name`` string of a pretrained model), or\n                - the model was saved using :func:`~transformers1.PreTrainedModel.save_pretrained` and is reloaded by suppling the save directory.\n                - the model is loaded by suppling a local directory as ``pretrained_model_name_or_path`` and a configuration JSON file named `config.json` is found in the directory.\n\n            state_dict: (`optional`) dict:\n                an optional state dictionary for the model to use instead of a state dictionary loaded from saved weights file.\n                This option can be used if you want to create a model from a pretrained configuration but load your own weights.\n                In this case though, you should check if using :func:`~transformers1.PreTrainedModel.save_pretrained` and :func:`~transformers1.PreTrainedModel.from_pretrained` is not a simpler option.\n            cache_dir: (`optional`) string:\n                Path to a directory in which a downloaded pre-trained model\n                configuration should be cached if the standard cache should not be used.\n            force_download: (`optional`) boolean, default False:\n                Force to (re-)download the model weights and configuration files and override the cached versions if they exists.\n            resume_download: (`optional`) boolean, default False:\n                Do not delete incompletely received file. Attempt to resume the download if such a file exists.\n            proxies: (`optional`) dict, default None:\n                A dictionary of proxy servers to use by protocol or endpoint, e.g.: {'http': 'foo.bar:3128', 'http://hostname': 'foo.bar:4012'}.\n                The proxies are used on each request.\n            output_loading_info: (`optional`) boolean:\n                Set to ``True`` to also return a dictionary containing missing keys, unexpected keys and error messages.\n            kwargs: (`optional`) Remaining dictionary of keyword arguments:\n                These arguments will be passed to the configuration and the model.\n\n        Examples::\n\n            model = AutoModelForPreTraining.from_pretrained('bert-base-uncased')    # Download model and configuration from S3 and cache.\n            model = AutoModelForPreTraining.from_pretrained('./test/bert_model/')  # E.g. model was saved using `save_pretrained('./test/saved_model/')`\n            assert model.config.output_attention == True\n            # Loading from a TF checkpoint file instead of a PyTorch model (slower)\n            config = AutoConfig.from_json_file('./tf_model/bert_tf_model_config.json')\n            model = AutoModelForPreTraining.from_pretrained('./tf_model/bert_tf_checkpoint.ckpt.index', from_tf=True, config=config)\n\n        \"\"\"\n        config = kwargs.pop(\"config\", None)\n        if not isinstance(config, PretrainedConfig):\n            config = AutoConfig.from_pretrained(pretrained_model_name_or_path, **kwargs)\n\n        for config_class, model_class in MODEL_FOR_PRETRAINING_MAPPING.items():\n            if isinstance(config, config_class):\n                return model_class.from_pretrained(pretrained_model_name_or_path, *model_args, config=config, **kwargs)\n        raise ValueError(\n            \"Unrecognized configuration class {} for this kind of AutoModel: {}.\\n\"\n            \"Model type should be one of {}.\".format(\n                config.__class__, cls.__name__, \", \".join(c.__name__ for c in MODEL_FOR_PRETRAINING_MAPPING.keys())\n            )\n        )\n\n\nclass AutoModelWithLMHead:\n    r\"\"\"\n        :class:`~transformers1.AutoModelWithLMHead` is a generic model class\n        that will be instantiated as one of the language modeling model classes of the library\n        when created with the `AutoModelWithLMHead.from_pretrained(pretrained_model_name_or_path)`\n        class method.\n\n        This class cannot be instantiated using `__init__()` (throws an error).\n    \"\"\"\n\n    def __init__(self):\n        raise EnvironmentError(\n            \"AutoModelWithLMHead is designed to be instantiated \"\n            \"using the `AutoModelWithLMHead.from_pretrained(pretrained_model_name_or_path)` or \"\n            \"`AutoModelWithLMHead.from_config(config)` methods.\"\n        )\n\n    @classmethod\n    def from_config(cls, config):\n        r\"\"\" Instantiates one of the base model classes of the library\n        from a configuration.\n\n        Note:\n            Loading a model from its configuration file does **not** load the model weights.\n            It only affects the model's configuration. Use :func:`~transformers1.AutoModel.from_pretrained` to load\n            the model weights\n\n        Args:\n            config (:class:`~transformers.PretrainedConfig`):\n                The model class to instantiate is selected based on the configuration class:\n\n                - isInstance of `distilbert` configuration class: :class:`~transformers1.DistilBertForMaskedLM` (DistilBERT model)\n                - isInstance of `longformer` configuration class: :class:`~transformers1.LongformerForMaskedLM` (Longformer model)\n                - isInstance of `roberta` configuration class: :class:`~transformers1.RobertaForMaskedLM` (RoBERTa model)\n                - isInstance of `bert` configuration class: :class:`~transformers1.BertForMaskedLM` (Bert model)\n                - isInstance of `openai-gpt` configuration class: :class:`~transformers1.OpenAIGPTLMHeadModel` (OpenAI GPT model)\n                - isInstance of `gpt2` configuration class: :class:`~transformers1.GPT2LMHeadModel` (OpenAI GPT-2 model)\n                - isInstance of `ctrl` configuration class: :class:`~transformers1.CTRLLMHeadModel` (Salesforce CTRL  model)\n                - isInstance of `transfo-xl` configuration class: :class:`~transformers1.TransfoXLLMHeadModel` (Transformer-XL model)\n                - isInstance of `xlnet` configuration class: :class:`~transformers1.XLNetLMHeadModel` (XLNet model)\n                - isInstance of `xlm` configuration class: :class:`~transformers1.XLMWithLMHeadModel` (XLM model)\n                - isInstance of `flaubert` configuration class: :class:`~transformers1.FlaubertWithLMHeadModel` (Flaubert model)\n                - isInstance of `electra` configuration class: :class:`~transformers1.ElectraForMaskedLM` (Electra model)\n\n        Examples::\n\n            config = BertConfig.from_pretrained('bert-base-uncased')    # Download configuration from S3 and cache.\n            model = AutoModelWithLMHead.from_config(config)  # E.g. model was saved using `save_pretrained('./test/saved_model/')`\n        \"\"\"\n        for config_class, model_class in MODEL_WITH_LM_HEAD_MAPPING.items():\n            if isinstance(config, config_class):\n                return model_class(config)\n        raise ValueError(\n            \"Unrecognized configuration class {} for this kind of AutoModel: {}.\\n\"\n            \"Model type should be one of {}.\".format(\n                config.__class__, cls.__name__, \", \".join(c.__name__ for c in MODEL_WITH_LM_HEAD_MAPPING.keys())\n            )\n        )\n\n    @classmethod\n    def from_pretrained(cls, pretrained_model_name_or_path, *model_args, **kwargs):\n        r\"\"\" Instantiates one of the language modeling model classes of the library\n        from a pre-trained model configuration.\n\n        The `from_pretrained()` method takes care of returning the correct model class instance\n        based on the `model_type` property of the config object, or when it's missing,\n        falling back to using pattern matching on the `pretrained_model_name_or_path` string:\n            - `t5`: :class:`~transformers1.T5ModelWithLMHead` (T5 model)\n            - `distilbert`: :class:`~transformers1.DistilBertForMaskedLM` (DistilBERT model)\n            - `albert`: :class:`~transformers1.AlbertForMaskedLM` (ALBERT model)\n            - `camembert`: :class:`~transformers1.CamembertForMaskedLM` (CamemBERT model)\n            - `xlm-roberta`: :class:`~transformers1.XLMRobertaForMaskedLM` (XLM-RoBERTa model)\n            - `longformer`: :class:`~transformers1.LongformerForMaskedLM` (Longformer model)\n            - `roberta`: :class:`~transformers1.RobertaForMaskedLM` (RoBERTa model)\n            - `bert`: :class:`~transformers1.BertForMaskedLM` (Bert model)\n            - `openai-gpt`: :class:`~transformers1.OpenAIGPTLMHeadModel` (OpenAI GPT model)\n            - `gpt2`: :class:`~transformers1.GPT2LMHeadModel` (OpenAI GPT-2 model)\n            - `transfo-xl`: :class:`~transformers1.TransfoXLLMHeadModel` (Transformer-XL model)\n            - `xlnet`: :class:`~transformers1.XLNetLMHeadModel` (XLNet model)\n            - `xlm`: :class:`~transformers1.XLMWithLMHeadModel` (XLM model)\n            - `ctrl`: :class:`~transformers1.CTRLLMHeadModel` (Salesforce CTRL model)\n            - `flaubert`: :class:`~transformers1.FlaubertWithLMHeadModel` (Flaubert model)\n            - `electra`: :class:`~transformers1.ElectraForMaskedLM` (Electra model)\n\n        The model is set in evaluation mode by default using `model.eval()` (Dropout modules are deactivated)\n        To train the model, you should first set it back in training mode with `model.train()`\n\n        Args:\n            pretrained_model_name_or_path:\n                Either:\n\n                - a string with the `shortcut name` of a pre-trained model to load from cache or download, e.g.: ``bert-base-uncased``.\n                - a string with the `identifier name` of a pre-trained model that was user-uploaded to our S3, e.g.: ``dbmdz/bert-base-german-cased``.\n                - a path to a `directory` containing model weights saved using :func:`~transformers1.PreTrainedModel.save_pretrained`, e.g.: ``./my_model_directory/``.\n                - a path or url to a `tensorflow index checkpoint file` (e.g. `./tf_model/model.ckpt.index`). In this case, ``from_tf`` should be set to True and a configuration object should be provided as ``config`` argument. This loading path is slower than converting the TensorFlow checkpoint in a PyTorch model using the provided conversion scripts and loading the PyTorch model afterwards.\n            model_args: (`optional`) Sequence of positional arguments:\n                All remaning positional arguments will be passed to the underlying model's ``__init__`` method\n            config: (`optional`) instance of a class derived from :class:`~transformers1.PretrainedConfig`:\n                Configuration for the model to use instead of an automatically loaded configuation. Configuration can be automatically loaded when:\n\n                - the model is a model provided by the library (loaded with the ``shortcut-name`` string of a pretrained model), or\n                - the model was saved using :func:`~transformers1.PreTrainedModel.save_pretrained` and is reloaded by suppling the save directory.\n                - the model is loaded by suppling a local directory as ``pretrained_model_name_or_path`` and a configuration JSON file named `config.json` is found in the directory.\n\n            state_dict: (`optional`) dict:\n                an optional state dictionary for the model to use instead of a state dictionary loaded from saved weights file.\n                This option can be used if you want to create a model from a pretrained configuration but load your own weights.\n                In this case though, you should check if using :func:`~transformers1.PreTrainedModel.save_pretrained` and :func:`~transformers1.PreTrainedModel.from_pretrained` is not a simpler option.\n            cache_dir: (`optional`) string:\n                Path to a directory in which a downloaded pre-trained model\n                configuration should be cached if the standard cache should not be used.\n            force_download: (`optional`) boolean, default False:\n                Force to (re-)download the model weights and configuration files and override the cached versions if they exists.\n            resume_download: (`optional`) boolean, default False:\n                Do not delete incompletely received file. Attempt to resume the download if such a file exists.\n            proxies: (`optional`) dict, default None:\n                A dictionary of proxy servers to use by protocol or endpoint, e.g.: {'http': 'foo.bar:3128', 'http://hostname': 'foo.bar:4012'}.\n                The proxies are used on each request.\n            output_loading_info: (`optional`) boolean:\n                Set to ``True`` to also return a dictionary containing missing keys, unexpected keys and error messages.\n            kwargs: (`optional`) Remaining dictionary of keyword arguments:\n                These arguments will be passed to the configuration and the model.\n\n        Examples::\n\n            model = AutoModelWithLMHead.from_pretrained('bert-base-uncased')    # Download model and configuration from S3 and cache.\n            model = AutoModelWithLMHead.from_pretrained('./test/bert_model/')  # E.g. model was saved using `save_pretrained('./test/saved_model/')`\n            assert model.config.output_attention == True\n            # Loading from a TF checkpoint file instead of a PyTorch model (slower)\n            config = AutoConfig.from_json_file('./tf_model/bert_tf_model_config.json')\n            model = AutoModelWithLMHead.from_pretrained('./tf_model/bert_tf_checkpoint.ckpt.index', from_tf=True, config=config)\n\n        \"\"\"\n        config = kwargs.pop(\"config\", None)\n        if not isinstance(config, PretrainedConfig):\n            config = AutoConfig.from_pretrained(pretrained_model_name_or_path, **kwargs)\n\n        for config_class, model_class in MODEL_WITH_LM_HEAD_MAPPING.items():\n            if isinstance(config, config_class):\n                return model_class.from_pretrained(pretrained_model_name_or_path, *model_args, config=config, **kwargs)\n        raise ValueError(\n            \"Unrecognized configuration class {} for this kind of AutoModel: {}.\\n\"\n            \"Model type should be one of {}.\".format(\n                config.__class__, cls.__name__, \", \".join(c.__name__ for c in MODEL_WITH_LM_HEAD_MAPPING.keys())\n            )\n        )\n\n\nclass AutoModelForSequenceClassification:\n    r\"\"\"\n        :class:`~transformers1.AutoModelForSequenceClassification` is a generic model class\n        that will be instantiated as one of the sequence classification model classes of the library\n        when created with the `AutoModelForSequenceClassification.from_pretrained(pretrained_model_name_or_path)`\n        class method.\n\n        This class cannot be instantiated using `__init__()` (throws an error).\n    \"\"\"\n\n    def __init__(self):\n        raise EnvironmentError(\n            \"AutoModelForSequenceClassification is designed to be instantiated \"\n            \"using the `AutoModelForSequenceClassification.from_pretrained(pretrained_model_name_or_path)` or \"\n            \"`AutoModelForSequenceClassification.from_config(config)` methods.\"\n        )\n\n    @classmethod\n    def from_config(cls, config):\n        r\"\"\" Instantiates one of the base model classes of the library\n        from a configuration.\n\n        Note:\n            Loading a model from its configuration file does **not** load the model weights.\n            It only affects the model's configuration. Use :func:`~transformers1.AutoModel.from_pretrained` to load\n            the model weights\n\n        Args:\n            config (:class:`~transformers.PretrainedConfig`):\n                The model class to instantiate is selected based on the configuration class:\n\n                - isInstance of `distilbert` configuration class: :class:`~transformers1.DistilBertForSequenceClassification` (DistilBERT model)\n                - isInstance of `albert` configuration class: :class:`~transformers1.AlbertForSequenceClassification` (ALBERT model)\n                - isInstance of `camembert` configuration class: :class:`~transformers1.CamembertForSequenceClassification` (CamemBERT model)\n                - isInstance of `xlm roberta` configuration class: :class:`~transformers1.XLMRobertaForSequenceClassification` (XLM-RoBERTa model)\n                - isInstance of `roberta` configuration class: :class:`~transformers1.RobertaForSequenceClassification` (RoBERTa model)\n                - isInstance of `bert` configuration class: :class:`~transformers1.BertForSequenceClassification` (Bert model)\n                - isInstance of `xlnet` configuration class: :class:`~transformers1.XLNetForSequenceClassification` (XLNet model)\n                - isInstance of `xlm` configuration class: :class:`~transformers1.XLMForSequenceClassification` (XLM model)\n                - isInstance of `flaubert` configuration class: :class:`~transformers1.FlaubertForSequenceClassification` (Flaubert model)\n\n\n        Examples::\n\n            config = BertConfig.from_pretrained('bert-base-uncased')    # Download configuration from S3 and cache.\n            model = AutoModelForSequenceClassification.from_config(config)  # E.g. model was saved using `save_pretrained('./test/saved_model/')`\n        \"\"\"\n        for config_class, model_class in MODEL_FOR_SEQUENCE_CLASSIFICATION_MAPPING.items():\n            if isinstance(config, config_class):\n                return model_class(config)\n        raise ValueError(\n            \"Unrecognized configuration class {} for this kind of AutoModel: {}.\\n\"\n            \"Model type should be one of {}.\".format(\n                config.__class__,\n                cls.__name__,\n                \", \".join(c.__name__ for c in MODEL_FOR_SEQUENCE_CLASSIFICATION_MAPPING.keys()),\n            )\n        )\n\n    @classmethod\n    def from_pretrained(cls, pretrained_model_name_or_path, *model_args, **kwargs):\n        r\"\"\" Instantiates one of the sequence classification model classes of the library\n        from a pre-trained model configuration.\n\n        The `from_pretrained()` method takes care of returning the correct model class instance\n        based on the `model_type` property of the config object, or when it's missing,\n        falling back to using pattern matching on the `pretrained_model_name_or_path` string:\n            - `distilbert`: :class:`~transformers1.DistilBertForSequenceClassification` (DistilBERT model)\n            - `albert`: :class:`~transformers1.AlbertForSequenceClassification` (ALBERT model)\n            - `camembert`: :class:`~transformers1.CamembertForSequenceClassification` (CamemBERT model)\n            - `xlm-roberta`: :class:`~transformers1.XLMRobertaForSequenceClassification` (XLM-RoBERTa model)\n            - `roberta`: :class:`~transformers1.RobertaForSequenceClassification` (RoBERTa model)\n            - `bert`: :class:`~transformers1.BertForSequenceClassification` (Bert model)\n            - `xlnet`: :class:`~transformers1.XLNetForSequenceClassification` (XLNet model)\n            - `flaubert`: :class:`~transformers1.FlaubertForSequenceClassification` (Flaubert model)\n\n        The model is set in evaluation mode by default using `model.eval()` (Dropout modules are deactivated)\n        To train the model, you should first set it back in training mode with `model.train()`\n\n        Args:\n            pretrained_model_name_or_path: either:\n\n                - a string with the `shortcut name` of a pre-trained model to load from cache or download, e.g.: ``bert-base-uncased``.\n                - a string with the `identifier name` of a pre-trained model that was user-uploaded to our S3, e.g.: ``dbmdz/bert-base-german-cased``.\n                - a path to a `directory` containing model weights saved using :func:`~transformers1.PreTrainedModel.save_pretrained`, e.g.: ``./my_model_directory/``.\n                - a path or url to a `tensorflow index checkpoint file` (e.g. `./tf_model/model.ckpt.index`). In this case, ``from_tf`` should be set to True and a configuration object should be provided as ``config`` argument. This loading path is slower than converting the TensorFlow checkpoint in a PyTorch model using the provided conversion scripts and loading the PyTorch model afterwards.\n\n            model_args: (`optional`) Sequence of positional arguments:\n                All remaining positional arguments will be passed to the underlying model's ``__init__`` method\n\n            config: (`optional`) instance of a class derived from :class:`~transformers1.PretrainedConfig`:\n                Configuration for the model to use instead of an automatically loaded configuation. Configuration can be automatically loaded when:\n\n                - the model is a model provided by the library (loaded with the ``shortcut-name`` string of a pretrained model), or\n                - the model was saved using :func:`~transformers1.PreTrainedModel.save_pretrained` and is reloaded by suppling the save directory.\n                - the model is loaded by suppling a local directory as ``pretrained_model_name_or_path`` and a configuration JSON file named `config.json` is found in the directory.\n\n            state_dict: (`optional`) dict:\n                an optional state dictionary for the model to use instead of a state dictionary loaded from saved weights file.\n                This option can be used if you want to create a model from a pretrained configuration but load your own weights.\n                In this case though, you should check if using :func:`~transformers1.PreTrainedModel.save_pretrained` and :func:`~transformers1.PreTrainedModel.from_pretrained` is not a simpler option.\n\n            cache_dir: (`optional`) string:\n                Path to a directory in which a downloaded pre-trained model\n                configuration should be cached if the standard cache should not be used.\n\n            force_download: (`optional`) boolean, default False:\n                Force to (re-)download the model weights and configuration files and override the cached versions if they exists.\n\n            resume_download: (`optional`) boolean, default False:\n                Do not delete incompletely recieved file. Attempt to resume the download if such a file exists.\n\n            proxies: (`optional`) dict, default None:\n                A dictionary of proxy servers to use by protocol or endpoint, e.g.: {'http': 'foo.bar:3128', 'http://hostname': 'foo.bar:4012'}.\n                The proxies are used on each request.\n\n            output_loading_info: (`optional`) boolean:\n                Set to ``True`` to also return a dictionary containing missing keys, unexpected keys and error messages.\n\n            kwargs: (`optional`) Remaining dictionary of keyword arguments:\n                These arguments will be passed to the configuration and the model.\n\n        Examples::\n\n            model = AutoModelForSequenceClassification.from_pretrained('bert-base-uncased')    # Download model and configuration from S3 and cache.\n            model = AutoModelForSequenceClassification.from_pretrained('./test/bert_model/')  # E.g. model was saved using `save_pretrained('./test/saved_model/')`\n            assert model.config.output_attention == True\n            # Loading from a TF checkpoint file instead of a PyTorch model (slower)\n            config = AutoConfig.from_json_file('./tf_model/bert_tf_model_config.json')\n            model = AutoModelForSequenceClassification.from_pretrained('./tf_model/bert_tf_checkpoint.ckpt.index', from_tf=True, config=config)\n\n        \"\"\"\n        config = kwargs.pop(\"config\", None)\n        if not isinstance(config, PretrainedConfig):\n            config = AutoConfig.from_pretrained(pretrained_model_name_or_path, **kwargs)\n\n        for config_class, model_class in MODEL_FOR_SEQUENCE_CLASSIFICATION_MAPPING.items():\n            if isinstance(config, config_class):\n                return model_class.from_pretrained(pretrained_model_name_or_path, *model_args, config=config, **kwargs)\n        raise ValueError(\n            \"Unrecognized configuration class {} for this kind of AutoModel: {}.\\n\"\n            \"Model type should be one of {}.\".format(\n                config.__class__,\n                cls.__name__,\n                \", \".join(c.__name__ for c in MODEL_FOR_SEQUENCE_CLASSIFICATION_MAPPING.keys()),\n            )\n        )\n\n\nclass AutoModelForQuestionAnswering:\n    r\"\"\"\n        :class:`~transformers1.AutoModelForQuestionAnswering` is a generic model class\n        that will be instantiated as one of the question answering model classes of the library\n        when created with the `AutoModelForQuestionAnswering.from_pretrained(pretrained_model_name_or_path)`\n        class method.\n\n        This class cannot be instantiated using `__init__()` (throws an error).\n    \"\"\"\n\n    def __init__(self):\n        raise EnvironmentError(\n            \"AutoModelForQuestionAnswering is designed to be instantiated \"\n            \"using the `AutoModelForQuestionAnswering.from_pretrained(pretrained_model_name_or_path)` or \"\n            \"`AutoModelForQuestionAnswering.from_config(config)` methods.\"\n        )\n\n    @classmethod\n    def from_config(cls, config):\n        r\"\"\" Instantiates one of the base model classes of the library\n        from a configuration.\n\n        Note:\n            Loading a model from its configuration file does **not** load the model weights.\n            It only affects the model's configuration. Use :func:`~transformers1.AutoModel.from_pretrained` to load\n            the model weights\n\n        Args:\n            config (:class:`~transformers.PretrainedConfig`):\n                The model class to instantiate is selected based on the configuration class:\n\n                - isInstance of `distilbert` configuration class: :class:`~transformers1.DistilBertForQuestionAnswering` (DistilBERT model)\n                - isInstance of `albert` configuration class: :class:`~transformers1.AlbertForQuestionAnswering` (ALBERT model)\n                - isInstance of `bert` configuration class: :class:`~transformers1.BertModelForQuestionAnswering` (Bert model)\n                - isInstance of `xlnet` configuration class: :class:`~transformers1.XLNetForQuestionAnswering` (XLNet model)\n                - isInstance of `xlm` configuration class: :class:`~transformers1.XLMForQuestionAnswering` (XLM model)\n                - isInstance of `flaubert` configuration class: :class:`~transformers1.FlaubertForQuestionAnswering` (XLM model)\n\n        Examples::\n\n            config = BertConfig.from_pretrained('bert-base-uncased')    # Download configuration from S3 and cache.\n            model = AutoModelForQuestionAnswering.from_config(config)  # E.g. model was saved using `save_pretrained('./test/saved_model/')`\n        \"\"\"\n        for config_class, model_class in MODEL_FOR_QUESTION_ANSWERING_MAPPING.items():\n            if isinstance(config, config_class):\n                return model_class(config)\n\n        raise ValueError(\n            \"Unrecognized configuration class {} for this kind of AutoModel: {}.\\n\"\n            \"Model type should be one of {}.\".format(\n                config.__class__,\n                cls.__name__,\n                \", \".join(c.__name__ for c in MODEL_FOR_QUESTION_ANSWERING_MAPPING.keys()),\n            )\n        )\n\n    @classmethod\n    def from_pretrained(cls, pretrained_model_name_or_path, *model_args, **kwargs):\n        r\"\"\" Instantiates one of the question answering model classes of the library\n        from a pre-trained model configuration.\n\n        The `from_pretrained()` method takes care of returning the correct model class instance\n        based on the `model_type` property of the config object, or when it's missing,\n        falling back to using pattern matching on the `pretrained_model_name_or_path` string:\n            - `distilbert`: :class:`~transformers1.DistilBertForQuestionAnswering` (DistilBERT model)\n            - `albert`: :class:`~transformers1.AlbertForQuestionAnswering` (ALBERT model)\n            - `bert`: :class:`~transformers1.BertForQuestionAnswering` (Bert model)\n            - `xlnet`: :class:`~transformers1.XLNetForQuestionAnswering` (XLNet model)\n            - `xlm`: :class:`~transformers1.XLMForQuestionAnswering` (XLM model)\n            - `flaubert`: :class:`~transformers1.FlaubertForQuestionAnswering` (XLM model)\n\n        The model is set in evaluation mode by default using `model.eval()` (Dropout modules are deactivated)\n        To train the model, you should first set it back in training mode with `model.train()`\n\n        Args:\n            pretrained_model_name_or_path: either:\n\n                - a string with the `shortcut name` of a pre-trained model to load from cache or download, e.g.: ``bert-base-uncased``.\n                - a string with the `identifier name` of a pre-trained model that was user-uploaded to our S3, e.g.: ``dbmdz/bert-base-german-cased``.\n                - a path to a `directory` containing model weights saved using :func:`~transformers1.PreTrainedModel.save_pretrained`, e.g.: ``./my_model_directory/``.\n                - a path or url to a `tensorflow index checkpoint file` (e.g. `./tf_model/model.ckpt.index`). In this case, ``from_tf`` should be set to True and a configuration object should be provided as ``config`` argument. This loading path is slower than converting the TensorFlow checkpoint in a PyTorch model using the provided conversion scripts and loading the PyTorch model afterwards.\n\n            model_args: (`optional`) Sequence of positional arguments:\n                All remaning positional arguments will be passed to the underlying model's ``__init__`` method\n\n            config: (`optional`) instance of a class derived from :class:`~transformers1.PretrainedConfig`:\n                Configuration for the model to use instead of an automatically loaded configuation. Configuration can be automatically loaded when:\n\n                - the model is a model provided by the library (loaded with the ``shortcut-name`` string of a pretrained model), or\n                - the model was saved using :func:`~transformers1.PreTrainedModel.save_pretrained` and is reloaded by suppling the save directory.\n                - the model is loaded by suppling a local directory as ``pretrained_model_name_or_path`` and a configuration JSON file named `config.json` is found in the directory.\n\n            state_dict: (`optional`) dict:\n                an optional state dictionary for the model to use instead of a state dictionary loaded from saved weights file.\n                This option can be used if you want to create a model from a pretrained configuration but load your own weights.\n                In this case though, you should check if using :func:`~transformers1.PreTrainedModel.save_pretrained` and :func:`~transformers1.PreTrainedModel.from_pretrained` is not a simpler option.\n\n            cache_dir: (`optional`) string:\n                Path to a directory in which a downloaded pre-trained model\n                configuration should be cached if the standard cache should not be used.\n\n            force_download: (`optional`) boolean, default False:\n                Force to (re-)download the model weights and configuration files and override the cached versions if they exists.\n\n            proxies: (`optional`) dict, default None:\n                A dictionary of proxy servers to use by protocol or endpoint, e.g.: {'http': 'foo.bar:3128', 'http://hostname': 'foo.bar:4012'}.\n                The proxies are used on each request.\n\n            output_loading_info: (`optional`) boolean:\n                Set to ``True`` to also return a dictionary containing missing keys, unexpected keys and error messages.\n\n            kwargs: (`optional`) Remaining dictionary of keyword arguments:\n                These arguments will be passed to the configuration and the model.\n\n        Examples::\n\n            model = AutoModelForQuestionAnswering.from_pretrained('bert-base-uncased')    # Download model and configuration from S3 and cache.\n            model = AutoModelForQuestionAnswering.from_pretrained('./test/bert_model/')  # E.g. model was saved using `save_pretrained('./test/saved_model/')`\n            assert model.config.output_attention == True\n            # Loading from a TF checkpoint file instead of a PyTorch model (slower)\n            config = AutoConfig.from_json_file('./tf_model/bert_tf_model_config.json')\n            model = AutoModelForQuestionAnswering.from_pretrained('./tf_model/bert_tf_checkpoint.ckpt.index', from_tf=True, config=config)\n\n        \"\"\"\n        config = kwargs.pop(\"config\", None)\n        if not isinstance(config, PretrainedConfig):\n            config = AutoConfig.from_pretrained(pretrained_model_name_or_path, **kwargs)\n\n        for config_class, model_class in MODEL_FOR_QUESTION_ANSWERING_MAPPING.items():\n            if isinstance(config, config_class):\n                return model_class.from_pretrained(pretrained_model_name_or_path, *model_args, config=config, **kwargs)\n\n        raise ValueError(\n            \"Unrecognized configuration class {} for this kind of AutoModel: {}.\\n\"\n            \"Model type should be one of {}.\".format(\n                config.__class__,\n                cls.__name__,\n                \", \".join(c.__name__ for c in MODEL_FOR_QUESTION_ANSWERING_MAPPING.keys()),\n            )\n        )\n\n\nclass AutoModelForTokenClassification:\n    r\"\"\"\n        :class:`~transformers1.AutoModelForTokenClassification` is a generic model class\n        that will be instantiated as one of the token classification model classes of the library\n        when created with the `AutoModelForTokenClassification.from_pretrained(pretrained_model_name_or_path)`\n        class method.\n\n        This class cannot be instantiated using `__init__()` (throws an error).\n    \"\"\"\n\n    def __init__(self):\n        raise EnvironmentError(\n            \"AutoModelForTokenClassification is designed to be instantiated \"\n            \"using the `AutoModelForTokenClassification.from_pretrained(pretrained_model_name_or_path)` or \"\n            \"`AutoModelForTokenClassification.from_config(config)` methods.\"\n        )\n\n    @classmethod\n    def from_config(cls, config):\n        r\"\"\" Instantiates one of the base model classes of the library\n        from a configuration.\n\n        Note:\n            Loading a model from its configuration file does **not** load the model weights.\n            It only affects the model's configuration. Use :func:`~transformers1.AutoModel.from_pretrained` to load\n            the model weights\n\n        Args:\n            config (:class:`~transformers.PretrainedConfig`):\n                The model class to instantiate is selected based on the configuration class:\n\n                - isInstance of `distilbert` configuration class: :class:`~transformers1.DistilBertModelForTokenClassification` (DistilBERT model)\n                - isInstance of `xlm` configuration class: :class:`~transformers1.XLMForTokenClassification` (XLM model)\n                - isInstance of `xlm roberta` configuration class: :class:`~transformers1.XLMRobertaModelForTokenClassification` (XLMRoberta model)\n                - isInstance of `bert` configuration class: :class:`~transformers1.BertModelForTokenClassification` (Bert model)\n                - isInstance of `albert` configuration class: :class:`~transformers1.AlbertForTokenClassification` (AlBert model)\n                - isInstance of `xlnet` configuration class: :class:`~transformers1.XLNetModelForTokenClassification` (XLNet model)\n                - isInstance of `camembert` configuration class: :class:`~transformers1.CamembertModelForTokenClassification` (Camembert model)\n                - isInstance of `roberta` configuration class: :class:`~transformers1.RobertaModelForTokenClassification` (Roberta model)\n                - isInstance of `electra` configuration class: :class:`~transformers1.ElectraForTokenClassification` (Electra model)\n\n        Examples::\n\n            config = BertConfig.from_pretrained('bert-base-uncased')    # Download configuration from S3 and cache.\n            model = AutoModelForTokenClassification.from_config(config)  # E.g. model was saved using `save_pretrained('./test/saved_model/')`\n        \"\"\"\n        for config_class, model_class in MODEL_FOR_TOKEN_CLASSIFICATION_MAPPING.items():\n            if isinstance(config, config_class):\n                return model_class(config)\n\n        raise ValueError(\n            \"Unrecognized configuration class {} for this kind of AutoModel: {}.\\n\"\n            \"Model type should be one of {}.\".format(\n                config.__class__,\n                cls.__name__,\n                \", \".join(c.__name__ for c in MODEL_FOR_TOKEN_CLASSIFICATION_MAPPING.keys()),\n            )\n        )\n\n    @classmethod\n    def from_pretrained(cls, pretrained_model_name_or_path, *model_args, **kwargs):\n        r\"\"\" Instantiates one of the question answering model classes of the library\n        from a pre-trained model configuration.\n\n        The `from_pretrained()` method takes care of returning the correct model class instance\n        based on the `model_type` property of the config object, or when it's missing,\n        falling back to using pattern matching on the `pretrained_model_name_or_path` string:\n            - `distilbert`: :class:`~transformers1.DistilBertForTokenClassification` (DistilBERT model)\n            - `xlm`: :class:`~transformers1.XLMForTokenClassification` (XLM model)\n            - `xlm-roberta`: :class:`~transformers1.XLMRobertaForTokenClassification` (XLM-RoBERTa?Para model)\n            - `camembert`: :class:`~transformers1.CamembertForTokenClassification` (Camembert model)\n            - `bert`: :class:`~transformers1.BertForTokenClassification` (Bert model)\n            - `xlnet`: :class:`~transformers1.XLNetForTokenClassification` (XLNet model)\n            - `roberta`: :class:`~transformers1.RobertaForTokenClassification` (Roberta model)\n            - `electra`: :class:`~transformers1.ElectraForTokenClassification` (Electra model)\n\n        The model is set in evaluation mode by default using `model.eval()` (Dropout modules are deactivated)\n        To train the model, you should first set it back in training mode with `model.train()`\n\n        Args:\n            pretrained_model_name_or_path:\n                Either:\n\n                - a string with the `shortcut name` of a pre-trained model to load from cache or download, e.g.: ``bert-base-uncased``.\n                - a path to a `directory` containing model weights saved using :func:`~transformers1.PreTrainedModel.save_pretrained`, e.g.: ``./my_model_directory/``.\n                - a path or url to a `tensorflow index checkpoint file` (e.g. `./tf_model/model.ckpt.index`). In this case, ``from_tf`` should be set to True and a configuration object should be provided as ``config`` argument. This loading path is slower than converting the TensorFlow checkpoint in a PyTorch model using the provided conversion scripts and loading the PyTorch model afterwards.\n\n            model_args: (`optional`) Sequence of positional arguments:\n                All remaning positional arguments will be passed to the underlying model's ``__init__`` method\n\n            config: (`optional`) instance of a class derived from :class:`~transformers1.PretrainedConfig`:\n                Configuration for the model to use instead of an automatically loaded configuation. Configuration can be automatically loaded when:\n\n                - the model is a model provided by the library (loaded with the ``shortcut-name`` string of a pretrained model), or\n                - the model was saved using :func:`~transformers1.PreTrainedModel.save_pretrained` and is reloaded by suppling the save directory.\n                - the model is loaded by suppling a local directory as ``pretrained_model_name_or_path`` and a configuration JSON file named `config.json` is found in the directory.\n\n            state_dict: (`optional`) dict:\n                an optional state dictionary for the model to use instead of a state dictionary loaded from saved weights file.\n                This option can be used if you want to create a model from a pretrained configuration but load your own weights.\n                In this case though, you should check if using :func:`~transformers1.PreTrainedModel.save_pretrained` and :func:`~transformers1.PreTrainedModel.from_pretrained` is not a simpler option.\n\n            cache_dir: (`optional`) string:\n                Path to a directory in which a downloaded pre-trained model\n                configuration should be cached if the standard cache should not be used.\n\n            force_download: (`optional`) boolean, default False:\n                Force to (re-)download the model weights and configuration files and override the cached versions if they exists.\n\n            proxies: (`optional`) dict, default None:\n                A dictionary of proxy servers to use by protocol or endpoint, e.g.: {'http': 'foo.bar:3128', 'http://hostname': 'foo.bar:4012'}.\n                The proxies are used on each request.\n\n            output_loading_info: (`optional`) boolean:\n                Set to ``True`` to also return a dictionary containing missing keys, unexpected keys and error messages.\n\n            kwargs: (`optional`) Remaining dictionary of keyword arguments:\n                These arguments will be passed to the configuration and the model.\n\n        Examples::\n\n            model = AutoModelForTokenClassification.from_pretrained('bert-base-uncased')    # Download model and configuration from S3 and cache.\n            model = AutoModelForTokenClassification.from_pretrained('./test/bert_model/')  # E.g. model was saved using `save_pretrained('./test/saved_model/')`\n            assert model.config.output_attention == True\n            # Loading from a TF checkpoint file instead of a PyTorch model (slower)\n            config = AutoConfig.from_json_file('./tf_model/bert_tf_model_config.json')\n            model = AutoModelForTokenClassification.from_pretrained('./tf_model/bert_tf_checkpoint.ckpt.index', from_tf=True, config=config)\n\n        \"\"\"\n        config = kwargs.pop(\"config\", None)\n        if not isinstance(config, PretrainedConfig):\n            config = AutoConfig.from_pretrained(pretrained_model_name_or_path, **kwargs)\n\n        for config_class, model_class in MODEL_FOR_TOKEN_CLASSIFICATION_MAPPING.items():\n            if isinstance(config, config_class):\n                return model_class.from_pretrained(pretrained_model_name_or_path, *model_args, config=config, **kwargs)\n\n        raise ValueError(\n            \"Unrecognized configuration class {} for this kind of AutoModel: {}.\\n\"\n            \"Model type should be one of {}.\".format(\n                config.__class__,\n                cls.__name__,\n                \", \".join(c.__name__ for c in MODEL_FOR_TOKEN_CLASSIFICATION_MAPPING.keys()),\n            )\n        )\n\n\nclass AutoModelForMultipleChoice:\n    r\"\"\"\n        :class:`~transformers1.AutoModelForMultipleChoice` is a generic model class\n        that will be instantiated as one of the multiple choice model classes of the library\n        when created with the `AutoModelForMultipleChoice.from_pretrained(pretrained_model_name_or_path)`\n        class method.\n\n        This class cannot be instantiated using `__init__()` (throws an error).\n    \"\"\"\n\n    def __init__(self):\n        raise EnvironmentError(\n            \"AutoModelForMultipleChoice is designed to be instantiated \"\n            \"using the `AutoModelForMultipleChoice.from_pretrained(pretrained_model_name_or_path)` or \"\n            \"`AutoModelForMultipleChoice.from_config(config)` methods.\"\n        )\n\n    @classmethod\n    def from_config(cls, config):\n        for config_class, model_class in MODEL_FOR_MULTIPLE_CHOICE_MAPPING.items():\n            if isinstance(config, config_class):\n                return model_class(config)\n\n        raise ValueError(\n            \"Unrecognized configuration class {} for this kind of AutoModel: {}.\\n\"\n            \"Model type should be one of {}.\".format(\n                config.__class__,\n                cls.__name__,\n                \", \".join(c.__name__ for c in MODEL_FOR_MULTIPLE_CHOICE_MAPPING.keys()),\n            )\n        )\n\n    @classmethod\n    def from_pretrained(cls, pretrained_model_name_or_path, *model_args, **kwargs):\n        config = kwargs.pop(\"config\", None)\n        if not isinstance(config, PretrainedConfig):\n            config = AutoConfig.from_pretrained(pretrained_model_name_or_path, **kwargs)\n\n        for config_class, model_class in MODEL_FOR_MULTIPLE_CHOICE_MAPPING.items():\n            if isinstance(config, config_class):\n                return model_class.from_pretrained(pretrained_model_name_or_path, *model_args, config=config, **kwargs)\n\n        raise ValueError(\n            \"Unrecognized configuration class {} for this kind of AutoModel: {}.\\n\"\n            \"Model type should be one of {}.\".format(\n                config.__class__,\n                cls.__name__,\n                \", \".join(c.__name__ for c in MODEL_FOR_MULTIPLE_CHOICE_MAPPING.keys()),\n            )\n        )\n"
  },
  {
    "path": "code/bert-base-count5/pretrain/transformers1/modeling_bart.py",
    "content": "# coding=utf-8\n# Copyright 2020 The Facebook AI Research Team Authors and The HuggingFace Inc. team.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n\"\"\"PyTorch BART model, ported from the fairseq repo.\"\"\"\nimport logging\nimport math\nimport random\nfrom typing import Dict, List, Optional, Tuple\n\nimport numpy as np\nimport torch\nimport torch.nn.functional as F\nfrom torch import Tensor, nn\n\nfrom .activations import ACT2FN\nfrom .configuration_bart import BartConfig\nfrom .file_utils import add_start_docstrings, add_start_docstrings_to_callable\nfrom .modeling_utils import PreTrainedModel, create_position_ids_from_input_ids\n\n\nlogger = logging.getLogger(__name__)\n\n\nBART_PRETRAINED_MODEL_ARCHIVE_LIST = [\n    \"facebook/bart-large\",\n    \"facebook/bart-large-mnli\",\n    \"facebook/bart-large-cnn\",\n    \"facebook/bart-large-xsum\",\n    \"facebook/mbart-large-en-ro\",\n    # See all BART models at https://huggingface.co/models?filter=bart\n]\n\n\nBART_START_DOCSTRING = r\"\"\"\n\n    This model is a PyTorch `torch.nn.Module <https://pytorch.org/docs/stable/nn.html#torch.nn.Module>`_ sub-class. Use it as a regular PyTorch Module and\n    refer to the PyTorch documentation for all matters related to general usage and behavior.\n\n    Parameters:\n        config (:class:`~transformers1.BartConfig`): Model configuration class with all the parameters of the model.\n            Initializing with a config file does not load the weights associated with the model, only the configuration.\n            Check out the :meth:`~transformers1.PreTrainedModel.from_pretrained` method to load the model weights.\n\n\"\"\"\nBART_GENERATION_EXAMPLE = r\"\"\"\n    Examples::\n\n        from transformers1 import BartTokenizer, BartForConditionalGeneration, BartConfig\n        # see ``examples/summarization/bart/evaluate_cnn.py`` for a longer example\n        model = BartForConditionalGeneration.from_pretrained('bart-large-cnn')\n        tokenizer = BartTokenizer.from_pretrained('bart-large-cnn')\n        ARTICLE_TO_SUMMARIZE = \"My friends are cool but they eat too many carbs.\"\n        inputs = tokenizer.batch_encode_plus([ARTICLE_TO_SUMMARIZE], max_length=1024, return_tensors='pt')\n        # Generate Summary\n        summary_ids = model.generate(inputs['input_ids'], num_beams=4, max_length=5, early_stopping=True)\n        print([tokenizer.decode(g, skip_special_tokens=True, clean_up_tokenization_spaces=False) for g in summary_ids])\n\n\"\"\"\n\nBART_INPUTS_DOCSTRING = r\"\"\"\n    Args:\n        input_ids (:obj:`torch.LongTensor` of shape :obj:`(batch_size, sequence_length)`):\n               Indices of input sequence tokens in the vocabulary. Use BartTokenizer.encode to produce them.\n            Padding will be ignored by default should you provide it.\n            Indices can be obtained using :class:`transformers1.BartTokenizer.encode(text)`.\n        attention_mask (:obj:`torch.Tensor` of shape :obj:`(batch_size, sequence_length)`, `optional`, defaults to :obj:`None`):\n            Mask to avoid performing attention on padding token indices in input_ids.\n            Mask values selected in ``[0, 1]``:\n            ``1`` for tokens that are NOT MASKED, ``0`` for MASKED tokens.\n        encoder_outputs (:obj:`tuple(tuple(torch.FloatTensor)`, `optional`, defaults to :obj:`None`):\n            Tuple consists of (`last_hidden_state`, `optional`: `hidden_states`, `optional`: `attentions`)\n            `last_hidden_state` of shape :obj:`(batch_size, sequence_length, hidden_size)`, `optional`, defaults to :obj:`None`) is a sequence of hidden-states at the output of the last layer of the encoder.\n            Used in the cross-attention of the decoder.\n        decoder_input_ids (:obj:`torch.LongTensor` of shape :obj:`(batch_size, target_sequence_length)`, `optional`, defaults to :obj:`None`):\n            Provide for translation and summarization training. By default, the model will create this tensor by shifting the input_ids right, following the paper.\n        decoder_attention_mask (:obj:`torch.BoolTensor` of shape :obj:`(batch_size, tgt_seq_len)`, `optional`, defaults to :obj:`None`):\n            Default behavior: generate a tensor that ignores pad tokens in decoder_input_ids. Causal mask will also be used by default.\n            If you want to change padding behavior, you should read :func:`~transformers1.modeling_bart._prepare_decoder_inputs` and modify.\n            See diagram 1 in the paper for more info on the default strategy\n\"\"\"\n\n\ndef invert_mask(attention_mask):\n    assert attention_mask.dim() == 2\n    return attention_mask.eq(0)\n\n\ndef _prepare_bart_decoder_inputs(\n    config, input_ids, decoder_input_ids=None, decoder_padding_mask=None, causal_mask_dtype=torch.float32\n):\n    \"\"\"Prepare masks that ignore padding tokens in the decoder and a causal mask for the decoder if\n    none are provided. This mimics the default behavior in fairseq. To override it pass in masks.\n    Note: this is not called during generation\n    \"\"\"\n    pad_token_id = config.pad_token_id\n    if decoder_input_ids is None:\n        decoder_input_ids = shift_tokens_right(input_ids, pad_token_id)\n    bsz, tgt_len = decoder_input_ids.size()\n    if decoder_padding_mask is None:\n        decoder_padding_mask = make_padding_mask(decoder_input_ids, pad_token_id)\n    else:\n        decoder_padding_mask = invert_mask(decoder_padding_mask)\n    causal_mask = torch.triu(fill_with_neg_inf(torch.zeros(tgt_len, tgt_len)), 1).to(\n        dtype=causal_mask_dtype, device=decoder_input_ids.device\n    )\n    return decoder_input_ids, decoder_padding_mask, causal_mask\n\n\nclass PretrainedBartModel(PreTrainedModel):\n    config_class = BartConfig\n    base_model_prefix = \"model\"\n\n    def _init_weights(self, module):\n        std = self.config.init_std\n        if isinstance(module, nn.Linear):\n            module.weight.data.normal_(mean=0.0, std=std)\n            if module.bias is not None:\n                module.bias.data.zero_()\n        elif isinstance(module, SinusoidalPositionalEmbedding):\n            pass\n        elif isinstance(module, nn.Embedding):\n            module.weight.data.normal_(mean=0.0, std=std)\n            if module.padding_idx is not None:\n                module.weight.data[module.padding_idx].zero_()\n\n    @property\n    def dummy_inputs(self):\n        pad_token = self.config.pad_token_id\n        input_ids = torch.tensor([[0, 6, 10, 4, 2], [0, 8, 12, 2, pad_token]], device=self.device)\n        dummy_inputs = {\n            \"attention_mask\": input_ids.ne(pad_token),\n            \"input_ids\": input_ids,\n        }\n        return dummy_inputs\n\n\ndef _make_linear_from_emb(emb):\n    vocab_size, emb_size = emb.weight.shape\n    lin_layer = nn.Linear(vocab_size, emb_size, bias=False)\n    lin_layer.weight.data = emb.weight.data\n    return lin_layer\n\n\n# Helper Functions, mostly for making masks\ndef _check_shapes(shape_1, shape2):\n    if shape_1 != shape2:\n        raise AssertionError(\"shape mismatch: {} != {}\".format(shape_1, shape2))\n\n\ndef shift_tokens_right(input_ids, pad_token_id):\n    \"\"\"Shift input ids one token to the right, and wrap the last non pad token (usually <eos>).\"\"\"\n    prev_output_tokens = input_ids.clone()\n    index_of_eos = (input_ids.ne(pad_token_id).sum(dim=1) - 1).unsqueeze(-1)\n    prev_output_tokens[:, 0] = input_ids.gather(1, index_of_eos).squeeze()\n    prev_output_tokens[:, 1:] = input_ids[:, :-1]\n    return prev_output_tokens\n\n\ndef make_padding_mask(input_ids, padding_idx=1):\n    \"\"\"True for pad tokens\"\"\"\n    padding_mask = input_ids.eq(padding_idx)\n    if not padding_mask.any():\n        padding_mask = None\n    return padding_mask\n\n\n# Helper Modules\n\n\nclass EncoderLayer(nn.Module):\n    def __init__(self, config: BartConfig):\n        super().__init__()\n        self.embed_dim = config.d_model\n        self.output_attentions = config.output_attentions\n        self.self_attn = SelfAttention(\n            self.embed_dim, config.encoder_attention_heads, dropout=config.attention_dropout,\n        )\n        self.normalize_before = config.normalize_before\n        self.self_attn_layer_norm = LayerNorm(self.embed_dim)\n        self.dropout = config.dropout\n        self.activation_fn = ACT2FN[config.activation_function]\n        self.activation_dropout = config.activation_dropout\n        self.fc1 = nn.Linear(self.embed_dim, config.encoder_ffn_dim)\n        self.fc2 = nn.Linear(config.encoder_ffn_dim, self.embed_dim)\n        self.final_layer_norm = LayerNorm(self.embed_dim)\n\n    def forward(self, x, encoder_padding_mask):\n        \"\"\"\n        Args:\n            x (Tensor): input to the layer of shape `(seq_len, batch, embed_dim)`\n            encoder_padding_mask (ByteTensor): binary ByteTensor of shape\n                `(batch, src_len)` where padding elements are indicated by ``1``.\n            for t_tgt, t_src is excluded (or masked out), =0 means it is\n            included in attention\n\n        Returns:\n            encoded output of shape `(seq_len, batch, embed_dim)`\n        \"\"\"\n        residual = x\n        if self.normalize_before:\n            x = self.self_attn_layer_norm(x)\n        x, attn_weights = self.self_attn(\n            query=x, key=x, key_padding_mask=encoder_padding_mask, need_weights=self.output_attentions\n        )\n        x = F.dropout(x, p=self.dropout, training=self.training)\n        x = residual + x\n        if not self.normalize_before:\n            x = self.self_attn_layer_norm(x)\n\n        residual = x\n        if self.normalize_before:\n            x = self.final_layer_norm(x)\n        x = self.activation_fn(self.fc1(x))\n        x = F.dropout(x, p=self.activation_dropout, training=self.training)\n        x = self.fc2(x)\n        x = F.dropout(x, p=self.dropout, training=self.training)\n        x = residual + x\n        if not self.normalize_before:\n            x = self.final_layer_norm(x)\n        return x, attn_weights\n\n\nclass BartEncoder(nn.Module):\n    \"\"\"\n    Transformer encoder consisting of *config.encoder_layers* self attention layers. Each layer\n    is a :class:`EncoderLayer`.\n\n    Args:\n        config: BartConfig\n    \"\"\"\n\n    def __init__(self, config: BartConfig, embed_tokens):\n        super().__init__()\n\n        self.dropout = config.dropout\n        self.layerdrop = config.encoder_layerdrop\n        self.output_attentions = config.output_attentions\n        self.output_hidden_states = config.output_hidden_states\n\n        embed_dim = embed_tokens.embedding_dim\n        self.embed_scale = math.sqrt(embed_dim) if config.scale_embedding else 1.0\n        self.padding_idx = embed_tokens.padding_idx\n        self.max_source_positions = config.max_position_embeddings\n\n        self.embed_tokens = embed_tokens\n        if config.static_position_embeddings:\n            self.embed_positions = SinusoidalPositionalEmbedding(\n                config.max_position_embeddings, embed_dim, self.padding_idx\n            )\n        else:\n            self.embed_positions = LearnedPositionalEmbedding(\n                config.max_position_embeddings, embed_dim, self.padding_idx,\n            )\n        self.layers = nn.ModuleList([EncoderLayer(config) for _ in range(config.encoder_layers)])\n        self.layernorm_embedding = LayerNorm(embed_dim) if config.normalize_embedding else nn.Identity()\n        # mbart has one extra layer_norm\n        self.layer_norm = LayerNorm(config.d_model) if config.normalize_before else None\n\n    def forward(\n        self, input_ids, attention_mask=None,\n    ):\n        \"\"\"\n        Args:\n            input_ids (LongTensor): tokens in the source language of shape\n                `(batch, src_len)`\n            attention_mask (torch.LongTensor): indicating which indices are padding tokens.\n        Returns:\n            Tuple comprised of:\n                - **x** (Tensor): the last encoder layer's output of\n                  shape `(src_len, batch, embed_dim)`\n                - **encoder_states** (List[Tensor]): all intermediate\n                  hidden states of shape `(src_len, batch, embed_dim)`.\n                  Only populated if *self.output_hidden_states:* is True.\n                - **all_attentions** (List[Tensor]): Attention weights for each layer.\n                During training might not be of length n_layers because of layer dropout.\n        \"\"\"\n        # check attention mask and invert\n        if attention_mask is not None:\n            attention_mask = invert_mask(attention_mask)\n\n        inputs_embeds = self.embed_tokens(input_ids) * self.embed_scale\n        embed_pos = self.embed_positions(input_ids)\n        x = inputs_embeds + embed_pos\n        x = self.layernorm_embedding(x)\n        x = F.dropout(x, p=self.dropout, training=self.training)\n\n        # B x T x C -> T x B x C\n        x = x.transpose(0, 1)\n\n        encoder_states, all_attentions = [], []\n        for encoder_layer in self.layers:\n            if self.output_hidden_states:\n                encoder_states.append(x)\n            # add LayerDrop (see https://arxiv.org/abs/1909.11556 for description)\n            dropout_probability = random.uniform(0, 1)\n            if self.training and (dropout_probability < self.layerdrop):  # skip the layer\n                attn = None\n            else:\n                x, attn = encoder_layer(x, attention_mask)\n\n            if self.output_attentions:\n                all_attentions.append(attn)\n\n        if self.layer_norm:\n            x = self.layer_norm(x)\n        if self.output_hidden_states:\n            encoder_states.append(x)\n\n        # T x B x C -> B x T x C\n        encoder_states = [hidden_state.transpose(0, 1) for hidden_state in encoder_states]\n        x = x.transpose(0, 1)\n\n        return x, encoder_states, all_attentions\n\n\nclass DecoderLayer(nn.Module):\n    def __init__(self, config: BartConfig):\n        super().__init__()\n        self.embed_dim = config.d_model\n        self.output_attentions = config.output_attentions\n        self.self_attn = SelfAttention(\n            embed_dim=self.embed_dim, num_heads=config.decoder_attention_heads, dropout=config.attention_dropout,\n        )\n        self.dropout = config.dropout\n        self.activation_fn = ACT2FN[config.activation_function]\n        self.activation_dropout = config.activation_dropout\n        self.normalize_before = config.normalize_before\n\n        self.self_attn_layer_norm = LayerNorm(self.embed_dim)\n        self.encoder_attn = SelfAttention(\n            self.embed_dim,\n            config.decoder_attention_heads,\n            dropout=config.attention_dropout,\n            encoder_decoder_attention=True,\n        )\n        self.encoder_attn_layer_norm = LayerNorm(self.embed_dim)\n        self.fc1 = nn.Linear(self.embed_dim, config.decoder_ffn_dim)\n        self.fc2 = nn.Linear(config.decoder_ffn_dim, self.embed_dim)\n        self.final_layer_norm = LayerNorm(self.embed_dim)\n\n    def forward(\n        self,\n        x,\n        encoder_hidden_states,\n        encoder_attn_mask=None,\n        layer_state=None,\n        causal_mask=None,\n        decoder_padding_mask=None,\n    ):\n        residual = x\n\n        if layer_state is None:\n            layer_state = {}\n        if self.normalize_before:\n            x = self.self_attn_layer_norm(x)\n        # Self Attention\n\n        x, self_attn_weights = self.self_attn(\n            query=x,\n            key=x,\n            layer_state=layer_state,  # adds keys to layer state\n            key_padding_mask=decoder_padding_mask,\n            attn_mask=causal_mask,\n            need_weights=self.output_attentions,\n        )\n        x = F.dropout(x, p=self.dropout, training=self.training)\n        x = residual + x\n        if not self.normalize_before:\n            x = self.self_attn_layer_norm(x)\n\n        # Cross attention\n        residual = x\n        assert self.encoder_attn.cache_key != self.self_attn.cache_key\n        if self.normalize_before:\n            x = self.encoder_attn_layer_norm(x)\n        x, _ = self.encoder_attn(\n            query=x,\n            key=encoder_hidden_states,\n            key_padding_mask=encoder_attn_mask,\n            layer_state=layer_state,  # mutates layer state\n        )\n        x = F.dropout(x, p=self.dropout, training=self.training)\n        x = residual + x\n        if not self.normalize_before:\n            x = self.encoder_attn_layer_norm(x)\n\n        # Fully Connected\n        residual = x\n        if self.normalize_before:\n            x = self.final_layer_norm(x)\n        x = self.activation_fn(self.fc1(x))\n        x = F.dropout(x, p=self.activation_dropout, training=self.training)\n        x = self.fc2(x)\n        x = F.dropout(x, p=self.dropout, training=self.training)\n        x = residual + x\n        if not self.normalize_before:\n            x = self.final_layer_norm(x)\n        return (\n            x,\n            self_attn_weights,\n            layer_state,\n        )  # just self_attn weights for now, following t5, layer_state = cache for decoding\n\n\nclass BartDecoder(nn.Module):\n    \"\"\"\n    Transformer decoder consisting of *config.decoder_layers* layers. Each layer\n    is a :class:`DecoderLayer`.\n    Args:\n        config: BartConfig\n        embed_tokens (torch.nn.Embedding): output embedding\n    \"\"\"\n\n    def __init__(self, config: BartConfig, embed_tokens: nn.Embedding):\n        super().__init__()\n        self.output_attentions = config.output_attentions\n        self.output_hidden_states = config.output_hidden_states\n        self.dropout = config.dropout\n        self.layerdrop = config.decoder_layerdrop\n        self.padding_idx = embed_tokens.padding_idx\n        self.max_target_positions = config.max_position_embeddings\n        self.embed_scale = math.sqrt(config.d_model) if config.scale_embedding else 1.0\n        self.embed_tokens = embed_tokens\n        if config.static_position_embeddings:\n            self.embed_positions = SinusoidalPositionalEmbedding(\n                config.max_position_embeddings, config.d_model, config.pad_token_id\n            )\n        else:\n            self.embed_positions = LearnedPositionalEmbedding(\n                config.max_position_embeddings, config.d_model, self.padding_idx,\n            )\n        self.layers = nn.ModuleList(\n            [DecoderLayer(config) for _ in range(config.decoder_layers)]\n        )  # type: List[DecoderLayer]\n        self.layernorm_embedding = LayerNorm(config.d_model) if config.normalize_embedding else nn.Identity()\n        self.layer_norm = LayerNorm(config.d_model) if config.add_final_layer_norm else None\n\n    def forward(\n        self,\n        input_ids,\n        encoder_hidden_states,\n        encoder_padding_mask,\n        decoder_padding_mask,\n        decoder_causal_mask,\n        decoder_cached_states=None,\n        use_cache=False,\n        **unused\n    ):\n        \"\"\"\n        Includes several features from \"Jointly Learning to Align and\n        Translate with Transformer Models\" (Garg et al., EMNLP 2019).\n\n        Args:\n            input_ids (LongTensor): previous decoder outputs of shape\n                `(batch, tgt_len)`, for teacher forcing\n            encoder_hidden_states: output from the encoder, used for\n                encoder-side attention\n            encoder_padding_mask: for ignoring pad tokens\n            decoder_cached_states (dict or None): dictionary used for storing state during generation\n\n        Returns:\n            tuple:\n                - the decoder's features of shape `(batch, tgt_len, embed_dim)`\n                - hidden states\n                - attentions\n        \"\"\"\n        # check attention mask and invert\n        if encoder_padding_mask is not None:\n            encoder_padding_mask = invert_mask(encoder_padding_mask)\n\n        # embed positions\n        positions = self.embed_positions(input_ids, use_cache=use_cache)\n\n        if use_cache:\n            input_ids = input_ids[:, -1:]\n            positions = positions[:, -1:]  # happens after we embed them\n            # assert input_ids.ne(self.padding_idx).any()\n\n        x = self.embed_tokens(input_ids) * self.embed_scale\n        x += positions\n        x = self.layernorm_embedding(x)\n        x = F.dropout(x, p=self.dropout, training=self.training)\n\n        # Convert to Bart output format: (seq_len, BS, model_dim) -> (BS, seq_len, model_dim)\n        x = x.transpose(0, 1)\n        encoder_hidden_states = encoder_hidden_states.transpose(0, 1)\n\n        # decoder layers\n        all_hidden_states = ()\n        all_self_attns = ()\n        next_decoder_cache = []\n        for idx, decoder_layer in enumerate(self.layers):\n            # add LayerDrop (see https://arxiv.org/abs/1909.11556 for description)\n            if self.output_hidden_states:\n                all_hidden_states += (x,)\n            dropout_probability = random.uniform(0, 1)\n            if self.training and (dropout_probability < self.layerdrop):\n                continue\n\n            layer_state = decoder_cached_states[idx] if decoder_cached_states is not None else None\n\n            x, layer_self_attn, layer_past = decoder_layer(\n                x,\n                encoder_hidden_states,\n                encoder_attn_mask=encoder_padding_mask,\n                decoder_padding_mask=decoder_padding_mask,\n                layer_state=layer_state,\n                causal_mask=decoder_causal_mask,\n            )\n\n            if use_cache:\n                next_decoder_cache.append(layer_past.copy())\n\n            if self.layer_norm and (idx == len(self.layers) - 1):  # last layer of mbart\n                x = self.layer_norm(x)\n            if self.output_attentions:\n                all_self_attns += (layer_self_attn,)\n\n        # Convert to standard output format: (seq_len, BS, model_dim) -> (BS, seq_len, model_dim)\n        all_hidden_states = [hidden_state.transpose(0, 1) for hidden_state in all_hidden_states]\n        x = x.transpose(0, 1)\n        encoder_hidden_states = encoder_hidden_states.transpose(0, 1)\n\n        if use_cache:\n            next_cache = ((encoder_hidden_states, encoder_padding_mask), next_decoder_cache)\n        else:\n            next_cache = None\n        return x, next_cache, all_hidden_states, list(all_self_attns)\n\n\ndef _reorder_buffer(attn_cache, new_order):\n    for k, input_buffer_k in attn_cache.items():\n        if input_buffer_k is not None:\n            attn_cache[k] = input_buffer_k.index_select(0, new_order)\n    return attn_cache\n\n\nclass SelfAttention(nn.Module):\n    \"\"\"Multi-headed attention from 'Attention Is All You Need' paper\"\"\"\n\n    def __init__(\n        self,\n        embed_dim,\n        num_heads,\n        dropout=0.0,\n        bias=True,\n        encoder_decoder_attention=False,  # otherwise self_attention\n    ):\n        super().__init__()\n        self.embed_dim = embed_dim\n        self.num_heads = num_heads\n        self.dropout = dropout\n        self.head_dim = embed_dim // num_heads\n        assert self.head_dim * num_heads == self.embed_dim, \"embed_dim must be divisible by num_heads\"\n        self.scaling = self.head_dim ** -0.5\n\n        self.encoder_decoder_attention = encoder_decoder_attention\n        self.k_proj = nn.Linear(embed_dim, embed_dim, bias=bias)\n        self.v_proj = nn.Linear(embed_dim, embed_dim, bias=bias)\n        self.q_proj = nn.Linear(embed_dim, embed_dim, bias=bias)\n        self.out_proj = nn.Linear(embed_dim, embed_dim, bias=bias)\n        self.cache_key = \"encoder_decoder\" if self.encoder_decoder_attention else \"self\"\n\n    def _shape(self, tensor, dim_0, bsz):\n        return tensor.contiguous().view(dim_0, bsz * self.num_heads, self.head_dim).transpose(0, 1)\n\n    def forward(\n        self,\n        query,\n        key: Optional[Tensor],\n        key_padding_mask: Optional[Tensor] = None,\n        layer_state: Optional[Dict[str, Optional[Tensor]]] = None,\n        attn_mask: Optional[Tensor] = None,\n        need_weights=False,\n    ) -> Tuple[Tensor, Optional[Tensor]]:\n        \"\"\"Input shape: Time(SeqLen) x Batch x Channel\"\"\"\n        static_kv: bool = self.encoder_decoder_attention\n        tgt_len, bsz, embed_dim = query.size()\n        assert embed_dim == self.embed_dim\n        assert list(query.size()) == [tgt_len, bsz, embed_dim]\n        # get here for encoder decoder cause of static_kv\n        if layer_state is not None:  # reuse k,v and encoder_padding_mask\n            saved_state = layer_state.get(self.cache_key, {})\n            if \"prev_key\" in saved_state:\n                # previous time steps are cached - no need to recompute key and value if they are static\n                if static_kv:\n                    key = None\n        else:\n            saved_state = None\n            layer_state = {}\n\n        q = self.q_proj(query) * self.scaling\n        if static_kv:\n            if key is None:\n                k = v = None\n            else:\n                k = self.k_proj(key)\n                v = self.v_proj(key)\n        else:\n            k = self.k_proj(query)\n            v = self.v_proj(query)\n\n        q = self._shape(q, tgt_len, bsz)\n        if k is not None:\n            k = self._shape(k, -1, bsz)\n        if v is not None:\n            v = self._shape(v, -1, bsz)\n\n        if saved_state is not None:\n            k, v, key_padding_mask = self._use_saved_state(k, v, saved_state, key_padding_mask, static_kv, bsz)\n\n        # Update cache\n        layer_state[self.cache_key] = {\n            \"prev_key\": k.view(bsz, self.num_heads, -1, self.head_dim),\n            \"prev_value\": v.view(bsz, self.num_heads, -1, self.head_dim),\n            \"prev_key_padding_mask\": key_padding_mask if not static_kv else None,\n        }\n\n        assert k is not None\n        src_len = k.size(1)\n        attn_weights = torch.bmm(q, k.transpose(1, 2))\n        assert attn_weights.size() == (bsz * self.num_heads, tgt_len, src_len)\n\n        if attn_mask is not None:\n            attn_weights = attn_weights.view(bsz, self.num_heads, tgt_len, src_len) + attn_mask\n            attn_weights = attn_weights.view(bsz * self.num_heads, tgt_len, src_len)\n\n        # This is part of a workaround to get around fork/join parallelism not supporting Optional types.\n        if key_padding_mask is not None and key_padding_mask.dim() == 0:\n            key_padding_mask = None\n        assert key_padding_mask is None or key_padding_mask.size()[:2] == (bsz, src_len,)\n\n        if key_padding_mask is not None:  # don't attend to padding symbols\n            attn_weights = attn_weights.view(bsz, self.num_heads, tgt_len, src_len)\n            reshaped = key_padding_mask.unsqueeze(1).unsqueeze(2)\n            attn_weights = attn_weights.masked_fill(reshaped, float(\"-inf\"))\n            attn_weights = attn_weights.view(bsz * self.num_heads, tgt_len, src_len)\n        attn_weights = F.softmax(attn_weights, dim=-1)\n        attn_probs = F.dropout(attn_weights, p=self.dropout, training=self.training,)\n\n        assert v is not None\n        attn_output = torch.bmm(attn_probs, v)\n        assert attn_output.size() == (bsz * self.num_heads, tgt_len, self.head_dim)\n        attn_output = attn_output.transpose(0, 1).contiguous().view(tgt_len, bsz, embed_dim)\n        attn_output = self.out_proj(attn_output)\n        if need_weights:\n            attn_weights = attn_weights.view(bsz, self.num_heads, tgt_len, src_len)\n        else:\n            attn_weights = None\n        return attn_output, attn_weights\n\n    def _use_saved_state(self, k, v, saved_state, key_padding_mask, static_kv, bsz):\n        # saved states are stored with shape (bsz, num_heads, seq_len, head_dim)\n        if \"prev_key\" in saved_state:\n            _prev_key = saved_state[\"prev_key\"]\n            assert _prev_key is not None\n            prev_key = _prev_key.view(bsz * self.num_heads, -1, self.head_dim)\n            if static_kv:\n                k = prev_key\n            else:\n                assert k is not None\n                k = torch.cat([prev_key, k], dim=1)\n        if \"prev_value\" in saved_state:\n            _prev_value = saved_state[\"prev_value\"]\n            assert _prev_value is not None\n            prev_value = _prev_value.view(bsz * self.num_heads, -1, self.head_dim)\n            if static_kv:\n                v = prev_value\n            else:\n                assert v is not None\n                v = torch.cat([prev_value, v], dim=1)\n        assert k is not None and v is not None\n        prev_key_padding_mask: Optional[Tensor] = saved_state.get(\"prev_key_padding_mask\", None)\n        key_padding_mask = self._cat_prev_key_padding_mask(\n            key_padding_mask, prev_key_padding_mask, bsz, k.size(1), static_kv\n        )\n        return k, v, key_padding_mask\n\n    @staticmethod\n    def _cat_prev_key_padding_mask(\n        key_padding_mask: Optional[Tensor],\n        prev_key_padding_mask: Optional[Tensor],\n        batch_size: int,\n        src_len: int,\n        static_kv: bool,\n    ) -> Optional[Tensor]:\n        # saved key padding masks have shape (bsz, seq_len)\n        if prev_key_padding_mask is not None:\n            if static_kv:\n                new_key_padding_mask = prev_key_padding_mask\n            else:\n                new_key_padding_mask = torch.cat([prev_key_padding_mask, key_padding_mask], dim=1)\n\n        elif key_padding_mask is not None:\n            filler = torch.zeros(\n                batch_size,\n                src_len - key_padding_mask.size(1),\n                dtype=key_padding_mask.dtype,\n                device=key_padding_mask.device,\n            )\n            new_key_padding_mask = torch.cat([filler, key_padding_mask], dim=1)\n        else:\n            new_key_padding_mask = prev_key_padding_mask\n        return new_key_padding_mask\n\n\nclass BartClassificationHead(nn.Module):\n    \"\"\"Head for sentence-level classification tasks.\"\"\"\n\n    # This can trivially be shared with RobertaClassificationHead\n\n    def __init__(\n        self, input_dim, inner_dim, num_classes, pooler_dropout,\n    ):\n        super().__init__()\n        self.dense = nn.Linear(input_dim, inner_dim)\n        self.dropout = nn.Dropout(p=pooler_dropout)\n        self.out_proj = nn.Linear(inner_dim, num_classes)\n\n    def forward(self, x):\n        x = self.dropout(x)\n        x = self.dense(x)\n        x = torch.tanh(x)\n        x = self.dropout(x)\n        x = self.out_proj(x)\n        return x\n\n\nclass LearnedPositionalEmbedding(nn.Embedding):\n    \"\"\"\n    This module learns positional embeddings up to a fixed maximum size.\n    Padding ids are ignored by either offsetting based on padding_idx\n    or by setting padding_idx to None and ensuring that the appropriate\n    position ids are passed to the forward function.\n    \"\"\"\n\n    def __init__(\n        self, num_embeddings: int, embedding_dim: int, padding_idx: int,\n    ):\n        # if padding_idx is specified then offset the embedding ids by\n        # this index and adjust num_embeddings appropriately\n        assert padding_idx is not None\n        num_embeddings += padding_idx + 1  # WHY?\n        super().__init__(num_embeddings, embedding_dim, padding_idx=padding_idx)\n\n    def forward(self, input, use_cache=False):\n        \"\"\"Input is expected to be of size [bsz x seqlen].\"\"\"\n        if use_cache:  # the position is our current step in the decoded sequence\n            pos = int(self.padding_idx + input.size(1))\n            positions = input.data.new(1, 1).fill_(pos)\n        else:\n            positions = create_position_ids_from_input_ids(input, self.padding_idx)\n        return super().forward(positions)\n\n\ndef LayerNorm(normalized_shape, eps=1e-5, elementwise_affine=True):\n    if torch.cuda.is_available():\n        try:\n            from apex.normalization import FusedLayerNorm\n\n            return FusedLayerNorm(normalized_shape, eps, elementwise_affine)\n        except ImportError:\n            pass\n    return torch.nn.LayerNorm(normalized_shape, eps, elementwise_affine)\n\n\ndef fill_with_neg_inf(t):\n    \"\"\"FP16-compatible function that fills a input_ids with -inf.\"\"\"\n    return t.float().fill_(float(\"-inf\")).type_as(t)\n\n\ndef _filter_out_falsey_values(tup) -> Tuple:\n    \"\"\"Remove entries that are None or [] from an iterable.\"\"\"\n    return tuple(x for x in tup if isinstance(x, torch.Tensor) or x)\n\n\n# Public API\ndef _get_shape(t):\n    return getattr(t, \"shape\", None)\n\n\n@add_start_docstrings(\n    \"The bare BART Model outputting raw hidden-states without any specific head on top.\", BART_START_DOCSTRING,\n)\nclass BartModel(PretrainedBartModel):\n    def __init__(self, config: BartConfig):\n        super().__init__(config)\n        self.output_attentions = config.output_attentions\n        self.output_hidden_states = config.output_hidden_states\n\n        padding_idx, vocab_size = config.pad_token_id, config.vocab_size\n        self.shared = nn.Embedding(vocab_size, config.d_model, padding_idx)\n\n        self.encoder = BartEncoder(config, self.shared)\n        self.decoder = BartDecoder(config, self.shared)\n\n        self.init_weights()\n\n    @add_start_docstrings_to_callable(BART_INPUTS_DOCSTRING)\n    def forward(\n        self,\n        input_ids,\n        attention_mask=None,\n        decoder_input_ids=None,\n        encoder_outputs: Optional[Tuple] = None,\n        decoder_attention_mask=None,\n        decoder_cached_states=None,\n        use_cache=False,\n    ):\n\n        # make masks if user doesn't supply\n        if not use_cache:\n            decoder_input_ids, decoder_padding_mask, causal_mask = _prepare_bart_decoder_inputs(\n                self.config,\n                input_ids,\n                decoder_input_ids=decoder_input_ids,\n                decoder_padding_mask=decoder_attention_mask,\n                causal_mask_dtype=self.shared.weight.dtype,\n            )\n        else:\n            decoder_padding_mask, causal_mask = None, None\n\n        assert decoder_input_ids is not None\n        if encoder_outputs is None:\n            encoder_outputs = self.encoder(input_ids=input_ids, attention_mask=attention_mask)\n        assert isinstance(encoder_outputs, tuple)\n        # decoder outputs consists of (dec_features, layer_state, dec_hidden, dec_attn)\n        decoder_outputs = self.decoder(\n            decoder_input_ids,\n            encoder_outputs[0],\n            attention_mask,\n            decoder_padding_mask,\n            decoder_causal_mask=causal_mask,\n            decoder_cached_states=decoder_cached_states,\n            use_cache=use_cache,\n        )\n        # Attention and hidden_states will be [] or None if they aren't needed\n        decoder_outputs: Tuple = _filter_out_falsey_values(decoder_outputs)\n        assert isinstance(decoder_outputs[0], torch.Tensor)\n        encoder_outputs: Tuple = _filter_out_falsey_values(encoder_outputs)\n        return decoder_outputs + encoder_outputs\n\n    def get_input_embeddings(self):\n        return self.shared\n\n    def set_input_embeddings(self, value):\n        self.shared = value\n        self.encoder.embed_tokens = self.shared\n        self.decoder.embed_tokens = self.shared\n\n    def get_output_embeddings(self):\n        return _make_linear_from_emb(self.shared)  # make it on the fly\n\n\n@add_start_docstrings(\n    \"The BART Model with a language modeling head. Can be used for summarization.\",\n    BART_START_DOCSTRING + BART_GENERATION_EXAMPLE,\n)\nclass BartForConditionalGeneration(PretrainedBartModel):\n    base_model_prefix = \"model\"\n\n    def __init__(self, config: BartConfig):\n        super().__init__(config)\n        base_model = BartModel(config)\n        self.model = base_model\n        self.register_buffer(\"final_logits_bias\", torch.zeros((1, self.model.shared.num_embeddings)))\n\n    def resize_token_embeddings(self, new_num_tokens: int) -> nn.Embedding:\n        old_num_tokens = self.model.shared.num_embeddings\n        new_embeddings = super().resize_token_embeddings(new_num_tokens)\n        self.model.shared = new_embeddings\n        self._resize_final_logits_bias(new_num_tokens, old_num_tokens)\n        return new_embeddings\n\n    def _resize_final_logits_bias(self, new_num_tokens: int, old_num_tokens: int) -> None:\n        if new_num_tokens <= old_num_tokens:\n            new_bias = self.final_logits_bias[:, :new_num_tokens]\n        else:\n            extra_bias = torch.zeros((1, new_num_tokens - old_num_tokens), device=self.final_logits_bias.device)\n            new_bias = torch.cat([self.final_logits_bias, extra_bias], dim=1)\n        self.register_buffer(\"final_logits_bias\", new_bias)\n\n    @add_start_docstrings_to_callable(BART_INPUTS_DOCSTRING)\n    def forward(\n        self,\n        input_ids,\n        attention_mask=None,\n        encoder_outputs=None,\n        decoder_input_ids=None,\n        decoder_attention_mask=None,\n        decoder_cached_states=None,\n        lm_labels=None,\n        use_cache=False,\n        **unused\n    ):\n        r\"\"\"\n        lm_labels (:obj:`torch.LongTensor` of shape :obj:`(batch_size, sequence_length)`, `optional`, defaults to :obj:`None`):\n            Labels for computing the masked language modeling loss.\n            Indices should either be in ``[0, ..., config.vocab_size]`` or -100 (see ``input_ids`` docstring).\n            Tokens with indices set to ``-100`` are ignored (masked), the loss is only computed for the tokens\n            with labels\n            in ``[0, ..., config.vocab_size]``.\n\n    Returns:\n        :obj:`tuple(torch.FloatTensor)` comprising various elements depending on the configuration (:class:`~transformers1.RobertaConfig`) and inputs:\n        masked_lm_loss (`optional`, returned when ``lm_labels`` is provided) ``torch.FloatTensor`` of shape ``(1,)``:\n            Masked language modeling loss.\n        prediction_scores (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length, config.vocab_size)`)\n            Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax).\n        hidden_states (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_hidden_states=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer)\n            of shape :obj:`(batch_size, sequence_length, hidden_size)`.\n\n            Hidden-states of the model at the output of each layer plus the initial embedding outputs.\n        attentions (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_attentions=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for each layer) of shape\n            :obj:`(batch_size, num_heads, sequence_length, sequence_length)`.\n\n            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention\n            heads.\n\n    Examples::\n\n            # Mask filling only works for bart-large\n            from transformers1 import BartTokenizer, BartForConditionalGeneration\n            tokenizer = BartTokenizer.from_pretrained('bart-large')\n            TXT = \"My friends are <mask> but they eat too many carbs.\"\n            model = BartForConditionalGeneration.from_pretrained('bart-large')\n            input_ids = tokenizer.batch_encode_plus([TXT], return_tensors='pt')['input_ids']\n            logits = model(input_ids)[0]\n            masked_index = (input_ids[0] == tokenizer.mask_token_id).nonzero().item()\n            probs = logits[0, masked_index].softmax(dim=0)\n            values, predictions = probs.topk(5)\n            tokenizer.decode(predictions).split()\n            # ['good', 'great', 'all', 'really', 'very']\n        \"\"\"\n        outputs = self.model(\n            input_ids,\n            attention_mask=attention_mask,\n            decoder_input_ids=decoder_input_ids,\n            encoder_outputs=encoder_outputs,\n            decoder_attention_mask=decoder_attention_mask,\n            decoder_cached_states=decoder_cached_states,\n            use_cache=use_cache,\n        )\n        lm_logits = F.linear(outputs[0], self.model.shared.weight, bias=self.final_logits_bias)\n        outputs = (lm_logits,) + outputs[1:]  # Add cache, hidden states and attention if they are here\n        if lm_labels is not None:\n            loss_fct = nn.CrossEntropyLoss()\n            # TODO(SS): do we need to ignore pad tokens in lm_labels?\n            masked_lm_loss = loss_fct(lm_logits.view(-1, self.config.vocab_size), lm_labels.view(-1))\n            outputs = (masked_lm_loss,) + outputs\n\n        return outputs\n\n    def prepare_inputs_for_generation(self, decoder_input_ids, past, attention_mask, use_cache, **kwargs):\n        assert past is not None, \"past has to be defined for encoder_outputs\"\n\n        # first step, decoder_cached_states are empty\n        if not past[1]:\n            encoder_outputs, decoder_cached_states = past, None\n        else:\n            encoder_outputs, decoder_cached_states = past\n        return {\n            \"input_ids\": None,  # encoder_outputs is defined. input_ids not needed\n            \"encoder_outputs\": encoder_outputs,\n            \"decoder_cached_states\": decoder_cached_states,\n            \"decoder_input_ids\": decoder_input_ids,\n            \"attention_mask\": attention_mask,\n            \"use_cache\": use_cache,  # change this to avoid caching (presumably for debugging)\n        }\n\n    def prepare_logits_for_generation(self, logits, cur_len, max_length):\n        if cur_len == 1:\n            self._force_token_ids_generation(logits, self.config.bos_token_id)\n        if cur_len == max_length - 1 and self.config.eos_token_id is not None:\n            self._force_token_ids_generation(logits, self.config.eos_token_id)\n        return logits\n\n    def _force_token_ids_generation(self, scores, token_ids) -> None:\n        \"\"\"force one of token_ids to be generated by setting prob of all other tokens to 0\"\"\"\n        if isinstance(token_ids, int):\n            token_ids = [token_ids]\n        all_but_token_ids_mask = torch.tensor(\n            [x for x in range(self.config.vocab_size) if x not in token_ids],\n            dtype=torch.long,\n            device=next(self.parameters()).device,\n        )\n        assert len(scores.shape) == 2, \"scores should be of rank 2 with shape: [batch_size, vocab_size]\"\n        scores[:, all_but_token_ids_mask] = -float(\"inf\")\n\n    @staticmethod\n    def _reorder_cache(past, beam_idx):\n        ((enc_out, enc_mask), decoder_cached_states) = past\n        reordered_past = []\n        for layer_past in decoder_cached_states:\n            # get the correct batch idx from decoder layer's batch dim for cross and self-attn\n            layer_past_new = {\n                attn_key: _reorder_buffer(attn_cache, beam_idx) for attn_key, attn_cache in layer_past.items()\n            }\n            reordered_past.append(layer_past_new)\n\n        new_enc_out = enc_out if enc_out is None else enc_out.index_select(0, beam_idx)\n        new_enc_mask = enc_mask if enc_mask is None else enc_mask.index_select(0, beam_idx)\n\n        past = ((new_enc_out, new_enc_mask), reordered_past)\n        return past\n\n    def get_encoder(self):\n        return self.model.encoder\n\n    def get_output_embeddings(self):\n        return _make_linear_from_emb(self.model.shared)  # make it on the fly\n\n\n@add_start_docstrings(\n    \"\"\"Bart model with a sequence classification/head on top (a linear layer on top of the pooled output) e.g. for GLUE tasks. \"\"\",\n    BART_START_DOCSTRING,\n)\nclass BartForSequenceClassification(PretrainedBartModel):\n    def __init__(self, config: BartConfig, **kwargs):\n        super().__init__(config, **kwargs)\n        self.model = BartModel(config)\n        self.classification_head = BartClassificationHead(\n            config.d_model, config.d_model, config.num_labels, config.classif_dropout,\n        )\n        self.model._init_weights(self.classification_head.dense)\n        self.model._init_weights(self.classification_head.out_proj)\n\n    @add_start_docstrings_to_callable(BART_INPUTS_DOCSTRING)\n    def forward(\n        self,\n        input_ids,\n        attention_mask=None,\n        encoder_outputs=None,\n        decoder_input_ids=None,\n        decoder_attention_mask=None,\n        labels=None,\n    ):\n        r\"\"\"\n        labels (:obj:`torch.LongTensor` of shape :obj:`(batch_size,)`, `optional`, defaults to :obj:`None`):\n            Labels for computing the sequence classification/regression loss.\n            Indices should be in :obj:`[0, ..., config.num_labels - 1]`.\n            If :obj:`config.num_labels > 1` a classification loss is computed (Cross-Entropy).\n\n    Returns:\n        :obj:`tuple(torch.FloatTensor)` comprising various elements depending on the configuration (:class:`~transformers1.BartConfig`) and inputs:\n            loss (:obj:`torch.FloatTensor` of shape :obj:`(1,)`, `optional`, returned when :obj:`label` is provided):\n                Classification loss (cross entropy)\n            logits (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, config.num_labels)`):\n                Classification (or regression if config.num_labels==1) scores (before SoftMax).\n            hidden_states (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_hidden_states=True``):\n                Tuple of :obj:`torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer)\n                of shape :obj:`(batch_size, sequence_length, hidden_size)`.\n                Hidden-states of the model at the output of each layer plus the initial embedding outputs.\n            attentions (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_attentions=True``):\n                Tuple of :obj:`torch.FloatTensor` (one for each layer) of shape :obj:`(batch_size, num_heads, sequence_length, sequence_length)`.\n                Attentions weights after the attention softmax, used to compute the weighted average in the\n                self-attention\n                heads.\n\n    Examples::\n\n        from transformers1 import BartTokenizer, BartForSequenceClassification\n        import torch\n\n        tokenizer = BartTokenizer.from_pretrained('bart-large')\n        model = BartForSequenceClassification.from_pretrained('bart-large')\n        input_ids = torch.tensor(tokenizer.encode(\"Hello, my dog is cute\",\n        add_special_tokens=True)).unsqueeze(0)  # Batch size 1\n        labels = torch.tensor([1]).unsqueeze(0)  # Batch size 1\n        outputs = model(input_ids, labels=labels)\n        loss, logits = outputs[:2]\n\n        \"\"\"\n        outputs = self.model(\n            input_ids,\n            attention_mask=attention_mask,\n            decoder_input_ids=decoder_input_ids,\n            decoder_attention_mask=decoder_attention_mask,\n            encoder_outputs=encoder_outputs,\n        )\n        x = outputs[0]  # last hidden state\n        eos_mask = input_ids.eq(self.config.eos_token_id)\n        if len(torch.unique(eos_mask.sum(1))) > 1:\n            raise ValueError(\"All examples must have the same number of <eos> tokens.\")\n        sentence_representation = x[eos_mask, :].view(x.size(0), -1, x.size(-1))[:, -1, :]\n        logits = self.classification_head(sentence_representation)\n        # Prepend logits\n        outputs = (logits,) + outputs[1:]  # Add hidden states and attention if they are here\n        if labels is not None:  # prepend loss to output,\n            loss = F.cross_entropy(logits.view(-1, self.config.num_labels), labels.view(-1))\n            outputs = (loss,) + outputs\n\n        return outputs\n\n\nclass SinusoidalPositionalEmbedding(nn.Embedding):\n    \"\"\"This module produces sinusoidal positional embeddings of any length.\"\"\"\n\n    def __init__(self, num_positions, embedding_dim, padding_idx=None):\n        super().__init__(num_positions, embedding_dim)\n        if embedding_dim % 2 != 0:\n            raise NotImplementedError(f\"odd embedding_dim {embedding_dim} not supported\")\n        self.weight = self._init_weight(self.weight)\n\n    @staticmethod\n    def _init_weight(out: nn.Parameter):\n        \"\"\"Identical to the XLM create_sinusoidal_embeddings except features are not interleaved.\n            The cos features are in the 2nd half of the vector. [dim // 2:]\n        \"\"\"\n        n_pos, dim = out.shape\n        position_enc = np.array(\n            [[pos / np.power(10000, 2 * (j // 2) / dim) for j in range(dim)] for pos in range(n_pos)]\n        )\n        out[:, 0 : dim // 2] = torch.FloatTensor(np.sin(position_enc[:, 0::2]))  # This line breaks for odd n_pos\n        out[:, dim // 2 :] = torch.FloatTensor(np.cos(position_enc[:, 1::2]))\n        out.detach_()\n        out.requires_grad = False\n        return out\n\n    @torch.no_grad()\n    def forward(self, input_ids, use_cache=False):\n        \"\"\"Input is expected to be of size [bsz x seqlen].\"\"\"\n        bsz, seq_len = input_ids.shape[:2]\n        if use_cache:\n            positions = input_ids.data.new(1, 1).fill_(seq_len - 1)  # called before slicing\n        else:\n            # starts at 0, ends at 1-seq_len\n            positions = torch.arange(seq_len, dtype=torch.long, device=self.weight.device)\n        return super().forward(positions)\n"
  },
  {
    "path": "code/bert-base-count5/pretrain/transformers1/modeling_beam_search.py",
    "content": "# coding=utf-8\n# Copyright (c) 2019 Yang Liu\n\n# Permission is hereby granted, free of charge, to any person obtaining a copy\n# of this software and associated documentation files (the \"Software\"), to deal\n# in the Software without restriction, including without limitation the rights\n# to use, copy, modify, merge, publish, distribute, sublicense, and/or sell\n# copies of the Software, and to permit persons to whom the Software is\n# furnished to do so, subject to the following conditions:\n\n# The above copyright notice and this permission notice shall be included in all\n# copies or substantial portions of the Software.\n\n# THE SOFTWARE IS PROVIDED \"AS IS\", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR\n# IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,\n# FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE\n# AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER\n# LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,\n# OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE\n# SOFTWARE.\n\"\"\"\nA general wrapper around models with LM heads to generate sequences\nusing beam search.\n\"\"\"\nimport torch\nfrom torch import nn\n\n\nclass TransformerBeamSearch(nn.Module):\n    def __init__(\n        self,\n        model,\n        tokenizer,\n        batch_size,\n        beam_size,\n        min_length,\n        max_length,\n        alpha=0,\n        block_repeating_trigram=True,\n    ):\n        \"\"\"\n        Attributes:\n            mask_word_id: token id that corresponds to the mask\n        \"\"\"\n        super(TransformerBeamSearch, self).__init__()\n        self.model = model\n        self.tokenizer = tokenizer\n\n        self.start_token_id = tokenizer.start_token_id\n        self.end_token_id = tokenizer.end_token_id\n        self.pad_token_id = tokenizer.pad_token_id\n\n        self.beam_size = beam_size\n        self.min_length = min_length\n        self.max_length = max_length\n\n        self.block_repeating_trigram = block_repeating_trigram\n        self.apply_length_penalty = False if alpha == 0 else True\n        self.alpha = alpha\n\n        # State of the beam\n        self.hypotheses = [[] for _ in range(batch_size)]\n        self.batch_offset = torch.arange(batch_size, dtype=torch.long)\n        self.beam_offset = torch.arange(\n            0, batch_size * self.beam_size, step=self.beam_size, dtype=torch.long\n        )\n        self.growing_beam = torch.full(\n            (batch_size * self.beam_size, 1), self.start_token_id, dtype=torch.long\n        )\n        self.topk_log_probabilities = torch.tensor(\n            [0.0] + [float(\"-inf\")] * (self.beam_size - 1), dtype=torch.float\n        ).repeat(batch_size)\n        self.results = {\n            \"prediction\": [[] for _ in batch_size],\n            \"scores\": [[] for _ in batch_size],\n        }\n        self._step = 0\n        self.is_done = False\n\n    def step(self, log_probabilities):\n        \"\"\" Grows the beam by one step. \"\"\"\n        self._step += 1\n\n        # The batch size changes as some beams finish so we define _B\n        vocab_size = log_probabilities.size(-1)\n        _B = log_probabilities.size(0) // self.beam_size\n\n        # Multiply each beam probability with the probability of the\n        # next token (conditioned on the words in the beam).\n        log_probabilities += self.topk_log_probabilities.view(-1, 1)\n\n        self.enforce_min_length(log_probabilities)\n        if self.block_repeating_trigram:\n            self.remove_repeating_trigrams(log_probabilities, _B)\n\n        # Find the `beam_size` (previous_beam + token) combinations with\n        # the highest score\n        topk_log_probabilities, topk_ids = log_probabilities.topk(\n            log_probabilities.view(_B, self.beam_size * vocab_size),\n            self.beam_size,\n            dim=1,\n        )\n\n        # Apply the length penalty. The +1 accounts for the [EOS] token\n        # that will be added if the beam ends.\n        topk_scores = topk_log_probabilities / self.length_penalty()\n\n        # Retrieve the corresponding respective beam and token id\n        # topk_token_ids[i] will be added to topk_beam_ids[i]\n        topk_beam_ids = topk_ids.div(vocab_size)\n        topk_token_ids = topk_ids.fmod(vocab_size)\n\n        # Retrieve the row index of the surviving beams in the original\n        # view of the log_probabilities tensor\n        surviving_beams_rows = (topk_beam_ids + self.beam_offset[:_B].view(-1, 1)).view(\n            -1\n        )\n\n        # Append the last predictions\n        self.growing_beam = torch.cat(\n            [\n                self.growing_beam.index_select(0, surviving_beams_rows),\n                topk_token_ids.view(-1, 1),\n            ],\n            1,\n        )\n\n        # Check if any of the beam searches has ended during this\n        # growth step. Also if top beam (most probable) has ended\n        # for one element of the batch.\n        is_finished = topk_token_ids.eq(self.end_token_id)\n        self.enforce_max_length()\n        is_top_beam_finished = is_finished[:, 0].eq(1)\n\n        # Save the finished searches\n        if is_finished.any():\n            predictions = self.growing_beam.view(\n                -1, self.beam_size, self.growing_beam.size(1)\n            )\n            for i in range(is_finished.size(0)):\n                if is_top_beam_finished[i]:\n                    is_finished[i].fill_(1)\n                finished_hyp = is_finished[i].nonzero().view(-1)\n\n                # Store finished hypotheses for this batch.\n                b = self.batch_offset[i]\n                for j in finished_hyp:\n                    self.hypotheses[b].append((topk_scores[i, j], predictions[i, j, :]))\n\n                # If the batch reached the end, save the best hypotheses\n                # in terms of length-penalized score.\n                if is_top_beam_finished[i]:\n                    best_hyp = sorted(\n                        self.hypotheses[b], key=lambda x: x[0], reverse=True\n                    )\n                    best_score, best_prediction = best_hyp[0]\n                    self.results[\"scores\"][b].append(best_score)\n                    self.results[\"predictions\"][b].append(best_prediction)\n\n            non_finished = is_top_beam_finished.eq(0).nonzero().view(-1)\n            if len(non_finished) == 0:\n                self.is_done = True\n\n            # Remove finished batches for the next step.\n            topk_log_probabilities = topk_log_probabilities.index_select(\n                0, non_finished\n            )\n            self.batch_offset = self.batch_offset.index_select(0, non_finished)\n            self.growing_beam = predictions.index_select(0, non_finished).view(\n                -1, self.growing_beam.size(-1)\n            )\n\n            surviving_beams_rows = surviving_beams_rows.index_select(0, non_finished)\n\n        return surviving_beams_rows\n\n    def forward(self, encoder_input_ids, **kwargs):\n        # keyword arguments come in 3 flavors: encoder-specific (prefixed by\n        # `encoder_`), decoder-specific (prefixed by `decoder_`) and those\n        # that apply to the model as whole.\n        # We let the specific kwargs override the common ones in case of conflict.\n        kwargs_encoder = {\n            argument[len(\"encoder_\"):]: value\n            for argument, value in kwargs.items()\n            if argument.startswith(\"encoder_\")\n        }\n        kwargs_decoder = {\n            argument[len(\"decoder_\"):]: value\n            for argument, value in kwargs.items()\n            if argument.startswith(\"decoder_\")\n        }\n        kwargs_common = {\n            argument: value\n            for argument, value in kwargs.items()\n            if not (argument.startswith(\"encoder_\") or argument.startswith(\"decoder_\"))\n        }\n        kwargs_decoder = dict(kwargs_common, **kwargs_decoder)\n        kwargs_encoder = dict(kwargs_common, **kwargs_encoder)\n\n        # forward pass on the encoder\n        encoder_outputs = self.model.encoder.forward(encoder_input_ids, kwargs_encoder)\n        kwargs_decoder[\"encoder_hidden_states\"] = tile(\n            encoder_outputs, self.beam_size, dim=0\n        )\n\n        # grow the beam by generating sequences in an autoregressive way\n        self.growing_beam = torch.full(\n            (self.batch_size * self.beam_size, 1), self.start_token_id, dtype=torch.long\n        )\n        for step in range(self.max_length):\n            decoder_input = self.growing_beam[:, -1]\n            outputs = self.model.decoder(decoder_input, kwargs_decoder)\n            log_probabilities = torch.nn.functional.log_softmax(outputs[1])\n            surviving_beams_rows = self.step(log_probabilities)\n            if self.is_done:\n                break\n\n            kwargs_decoder[\"encoder_hidden_states\"] = kwargs_decoder[\n                \"encoder_hidden_states\"\n            ].index_select(0, surviving_beams_rows)\n\n        return self.results\n\n    def remove_repeating_trigrams(self, log_probabilities, _B):\n        if(self._step + 1 > 3):\n            for i in range(_B * self.beam_size):\n                tokens = [t for t in self.growing_beam[i]]\n                trigrams = [(tokens[i-1], tokens[i], tokens[i+1]) for i in range(1, len(words) - 1)]\n                last_trigram = tuple(trigrams[-1])\n                if last_trigram in trigrams[:-1]:\n                    log_probabilities[i] = -1e20\n\n    def enforce_min_length(self):\n        if self._step < self.min_length:\n            self.log_probabilities[self.end_token_id] = -1e20\n\n    def enforce_max_length(self):\n        if self._step + 1 == self.max_length:\n            self.is_finished.fill_(1)\n\n    def length_penalty(self):\n        return ((5.0 + (self._step + 1)) / 6.0) ** self.alpha\n\n\ndef tile(x, count, dim=0):\n    \"\"\"\n    Tiles `x` along dimension `dim` `count` times.\n\n    Example:\n        >> ex = torch.tensor([1,2],[3,4])\n        >> tile(ex, 2, 0)\n        torch.Tensor([[1,2],[1,2],[3,4],[3,4]])\n    \"\"\"\n    perm = list(range(len(x.size())))\n    if dim != 0:\n        perm[0], perm[dim] = perm[dim], perm[0]\n        x = x.permute(perm).contiguous()\n    out_size = list(x.size())\n    out_size[0] *= count\n    batch = x.size(0)\n    x = (\n        x.view(batch, -1)\n        .transpose(0, 1)\n        .repeat(count, 1)\n        .transpose(0, 1)\n        .contiguous()\n        .view(*out_size)\n    )\n    if dim != 0:\n        x = x.permute(perm).contiguous()\n    return x\n"
  },
  {
    "path": "code/bert-base-count5/pretrain/transformers1/modeling_bert.py",
    "content": "# coding=utf-8\n# Copyright 2018 The Google AI Language Team Authors and The HuggingFace Inc. team.\n# Copyright (c) 2018, NVIDIA CORPORATION.  All rights reserved.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n\"\"\"PyTorch BERT model. \"\"\"\n\n\nimport logging\nimport math\nimport os\n\nimport torch\nfrom torch import nn\nfrom torch.nn import CrossEntropyLoss, MSELoss\n\nfrom .activations import gelu, gelu_new, swish\nfrom .configuration_bert import BertConfig\nfrom .file_utils import add_start_docstrings, add_start_docstrings_to_callable\nfrom .modeling_utils import PreTrainedModel, prune_linear_layer\n\n\nlogger = logging.getLogger(__name__)\n\nBERT_PRETRAINED_MODEL_ARCHIVE_LIST = [\n    \"bert-base-uncased\",\n    \"bert-large-uncased\",\n    \"bert-base-cased\",\n    \"bert-large-cased\",\n    \"bert-base-multilingual-uncased\",\n    \"bert-base-multilingual-cased\",\n    \"bert-base-chinese\",\n    \"bert-base-german-cased\",\n    \"bert-large-uncased-whole-word-masking\",\n    \"bert-large-cased-whole-word-masking\",\n    \"bert-large-uncased-whole-word-masking-finetuned-squad\",\n    \"bert-large-cased-whole-word-masking-finetuned-squad\",\n    \"bert-base-cased-finetuned-mrpc\",\n    \"bert-base-german-dbmdz-cased\",\n    \"bert-base-german-dbmdz-uncased\",\n    \"cl-tohoku/bert-base-japanese\",\n    \"cl-tohoku/bert-base-japanese-whole-word-masking\",\n    \"cl-tohoku/bert-base-japanese-char\",\n    \"cl-tohoku/bert-base-japanese-char-whole-word-masking\",\n    \"TurkuNLP/bert-base-finnish-cased-v1\",\n    \"TurkuNLP/bert-base-finnish-uncased-v1\",\n    \"wietsedv/bert-base-dutch-cased\",\n    # See all BERT models at https://huggingface.co/models?filter=bert\n]\n\n\ndef load_tf_weights_in_bert(model, config, tf_checkpoint_path):\n    \"\"\" Load tf checkpoints in a pytorch model.\n    \"\"\"\n    try:\n        import re\n        import numpy as np\n        import tensorflow as tf\n    except ImportError:\n        logger.error(\n            \"Loading a TensorFlow model in PyTorch, requires TensorFlow to be installed. Please see \"\n            \"https://www.tensorflow.org/install/ for installation instructions.\"\n        )\n        raise\n    tf_path = os.path.abspath(tf_checkpoint_path)\n    logger.info(\"Converting TensorFlow checkpoint from {}\".format(tf_path))\n    # Load weights from TF model\n    init_vars = tf.train.list_variables(tf_path)\n    names = []\n    arrays = []\n    for name, shape in init_vars:\n        logger.info(\"Loading TF weight {} with shape {}\".format(name, shape))\n        array = tf.train.load_variable(tf_path, name)\n        names.append(name)\n        arrays.append(array)\n\n    for name, array in zip(names, arrays):\n        name = name.split(\"/\")\n        # adam_v and adam_m are variables used in AdamWeightDecayOptimizer to calculated m and v\n        # which are not required for using pretrained model\n        if any(\n            n in [\"adam_v\", \"adam_m\", \"AdamWeightDecayOptimizer\", \"AdamWeightDecayOptimizer_1\", \"global_step\"]\n            for n in name\n        ):\n            logger.info(\"Skipping {}\".format(\"/\".join(name)))\n            continue\n        pointer = model\n        for m_name in name:\n            if re.fullmatch(r\"[A-Za-z]+_\\d+\", m_name):\n                scope_names = re.split(r\"_(\\d+)\", m_name)\n            else:\n                scope_names = [m_name]\n            if scope_names[0] == \"kernel\" or scope_names[0] == \"gamma\":\n                pointer = getattr(pointer, \"weight\")\n            elif scope_names[0] == \"output_bias\" or scope_names[0] == \"beta\":\n                pointer = getattr(pointer, \"bias\")\n            elif scope_names[0] == \"output_weights\":\n                pointer = getattr(pointer, \"weight\")\n            elif scope_names[0] == \"squad\":\n                pointer = getattr(pointer, \"classifier\")\n            else:\n                try:\n                    pointer = getattr(pointer, scope_names[0])\n                except AttributeError:\n                    logger.info(\"Skipping {}\".format(\"/\".join(name)))\n                    continue\n            if len(scope_names) >= 2:\n                num = int(scope_names[1])\n                pointer = pointer[num]\n        if m_name[-11:] == \"_embeddings\":\n            pointer = getattr(pointer, \"weight\")\n        elif m_name == \"kernel\":\n            array = np.transpose(array)\n        try:\n            assert pointer.shape == array.shape\n        except AssertionError as e:\n            e.args += (pointer.shape, array.shape)\n            raise\n        logger.info(\"Initialize PyTorch weight {}\".format(name))\n        pointer.data = torch.from_numpy(array)\n    return model\n\n\ndef mish(x):\n    return x * torch.tanh(nn.functional.softplus(x))\n\n\nACT2FN = {\"gelu\": gelu, \"relu\": torch.nn.functional.relu, \"swish\": swish, \"gelu_new\": gelu_new, \"mish\": mish}\n\n\nBertLayerNorm = torch.nn.LayerNorm\n\n\nclass BertEmbeddings(nn.Module):\n    \"\"\"Construct the embeddings from word, position and token_type embeddings.\n    \"\"\"\n\n    def __init__(self, config):\n        super().__init__()\n        self.word_embeddings = nn.Embedding(config.vocab_size, config.hidden_size, padding_idx=config.pad_token_id)\n        self.position_embeddings = nn.Embedding(config.max_position_embeddings, config.hidden_size)\n        self.token_type_embeddings = nn.Embedding(config.type_vocab_size, config.hidden_size)\n\n        # self.LayerNorm is not snake-cased to stick with TensorFlow model variable name and be able to load\n        # any TensorFlow checkpoint file\n        self.LayerNorm = BertLayerNorm(config.hidden_size, eps=config.layer_norm_eps)\n        self.dropout = nn.Dropout(config.hidden_dropout_prob)\n\n    def forward(self, input_ids=None, token_type_ids=None, position_ids=None, inputs_embeds=None):\n        if input_ids is not None:\n            input_shape = input_ids.size()\n        else:\n            input_shape = inputs_embeds.size()[:-1]\n\n        seq_length = input_shape[1]\n        device = input_ids.device if input_ids is not None else inputs_embeds.device\n        if position_ids is None:\n            position_ids = torch.arange(seq_length, dtype=torch.long, device=device)\n            position_ids = position_ids.unsqueeze(0).expand(input_shape)\n        if token_type_ids is None:\n            token_type_ids = torch.zeros(input_shape, dtype=torch.long, device=device)\n\n        if inputs_embeds is None:\n            inputs_embeds = self.word_embeddings(input_ids)\n        position_embeddings = self.position_embeddings(position_ids)\n        token_type_embeddings = self.token_type_embeddings(token_type_ids)\n\n        embeddings = inputs_embeds + position_embeddings + token_type_embeddings\n        embeddings = self.LayerNorm(embeddings)\n        embeddings = self.dropout(embeddings)\n        return embeddings\n\n\nclass BertSelfAttention(nn.Module):\n    def __init__(self, config):\n        super().__init__()\n        if config.hidden_size % config.num_attention_heads != 0 and not hasattr(config, \"embedding_size\"):\n            raise ValueError(\n                \"The hidden size (%d) is not a multiple of the number of attention \"\n                \"heads (%d)\" % (config.hidden_size, config.num_attention_heads)\n            )\n        self.output_attentions = config.output_attentions\n\n        self.num_attention_heads = config.num_attention_heads\n        self.attention_head_size = int(config.hidden_size / config.num_attention_heads)\n        self.all_head_size = self.num_attention_heads * self.attention_head_size\n\n        self.query = nn.Linear(config.hidden_size, self.all_head_size)\n        self.key = nn.Linear(config.hidden_size, self.all_head_size)\n        self.value = nn.Linear(config.hidden_size, self.all_head_size)\n\n        self.dropout = nn.Dropout(config.attention_probs_dropout_prob)\n\n    def transpose_for_scores(self, x):\n        new_x_shape = x.size()[:-1] + (self.num_attention_heads, self.attention_head_size)\n        x = x.view(*new_x_shape)\n        return x.permute(0, 2, 1, 3)\n\n    def forward(\n        self,\n        hidden_states,\n        attention_mask=None,\n        head_mask=None,\n        encoder_hidden_states=None,\n        encoder_attention_mask=None,\n    ):\n        mixed_query_layer = self.query(hidden_states)\n\n        # If this is instantiated as a cross-attention module, the keys\n        # and values come from an encoder; the attention mask needs to be\n        # such that the encoder's padding tokens are not attended to.\n        if encoder_hidden_states is not None:\n            mixed_key_layer = self.key(encoder_hidden_states)\n            mixed_value_layer = self.value(encoder_hidden_states)\n            attention_mask = encoder_attention_mask\n        else:\n            mixed_key_layer = self.key(hidden_states)\n            mixed_value_layer = self.value(hidden_states)\n\n        query_layer = self.transpose_for_scores(mixed_query_layer)\n        key_layer = self.transpose_for_scores(mixed_key_layer)\n        value_layer = self.transpose_for_scores(mixed_value_layer)\n\n        # Take the dot product between \"query\" and \"key\" to get the raw attention scores.\n        attention_scores = torch.matmul(query_layer, key_layer.transpose(-1, -2))\n        attention_scores = attention_scores / math.sqrt(self.attention_head_size)\n        if attention_mask is not None:\n            # Apply the attention mask is (precomputed for all layers in BertModel forward() function)\n            attention_scores = attention_scores + attention_mask\n\n        # Normalize the attention scores to probabilities.\n        attention_probs = nn.Softmax(dim=-1)(attention_scores)\n\n        # This is actually dropping out entire tokens to attend to, which might\n        # seem a bit unusual, but is taken from the original Transformer paper.\n        attention_probs = self.dropout(attention_probs)\n\n        # Mask heads if we want to\n        if head_mask is not None:\n            attention_probs = attention_probs * head_mask\n\n        context_layer = torch.matmul(attention_probs, value_layer)\n\n        context_layer = context_layer.permute(0, 2, 1, 3).contiguous()\n        new_context_layer_shape = context_layer.size()[:-2] + (self.all_head_size,)\n        context_layer = context_layer.view(*new_context_layer_shape)\n\n        outputs = (context_layer, attention_probs) if self.output_attentions else (context_layer,)\n        return outputs\n\n\nclass BertSelfOutput(nn.Module):\n    def __init__(self, config):\n        super().__init__()\n        self.dense = nn.Linear(config.hidden_size, config.hidden_size)\n        self.LayerNorm = BertLayerNorm(config.hidden_size, eps=config.layer_norm_eps)\n        self.dropout = nn.Dropout(config.hidden_dropout_prob)\n\n    def forward(self, hidden_states, input_tensor):\n        hidden_states = self.dense(hidden_states)\n        hidden_states = self.dropout(hidden_states)\n        hidden_states = self.LayerNorm(hidden_states + input_tensor)\n        return hidden_states\n\n\nclass BertAttention(nn.Module):\n    def __init__(self, config):\n        super().__init__()\n        self.self = BertSelfAttention(config)\n        self.output = BertSelfOutput(config)\n        self.pruned_heads = set()\n\n    def prune_heads(self, heads):\n        if len(heads) == 0:\n            return\n        mask = torch.ones(self.self.num_attention_heads, self.self.attention_head_size)\n        heads = set(heads) - self.pruned_heads  # Convert to set and remove already pruned heads\n        for head in heads:\n            # Compute how many pruned heads are before the head and move the index accordingly\n            head = head - sum(1 if h < head else 0 for h in self.pruned_heads)\n            mask[head] = 0\n        mask = mask.view(-1).contiguous().eq(1)\n        index = torch.arange(len(mask))[mask].long()\n\n        # Prune linear layers\n        self.self.query = prune_linear_layer(self.self.query, index)\n        self.self.key = prune_linear_layer(self.self.key, index)\n        self.self.value = prune_linear_layer(self.self.value, index)\n        self.output.dense = prune_linear_layer(self.output.dense, index, dim=1)\n\n        # Update hyper params and store pruned heads\n        self.self.num_attention_heads = self.self.num_attention_heads - len(heads)\n        self.self.all_head_size = self.self.attention_head_size * self.self.num_attention_heads\n        self.pruned_heads = self.pruned_heads.union(heads)\n\n    def forward(\n        self,\n        hidden_states,\n        attention_mask=None,\n        head_mask=None,\n        encoder_hidden_states=None,\n        encoder_attention_mask=None,\n    ):\n        self_outputs = self.self(\n            hidden_states, attention_mask, head_mask, encoder_hidden_states, encoder_attention_mask\n        )\n        attention_output = self.output(self_outputs[0], hidden_states)\n        outputs = (attention_output,) + self_outputs[1:]  # add attentions if we output them\n        return outputs\n\n\nclass BertIntermediate(nn.Module):\n    def __init__(self, config):\n        super().__init__()\n        self.dense = nn.Linear(config.hidden_size, config.intermediate_size)\n        if isinstance(config.hidden_act, str):\n            self.intermediate_act_fn = ACT2FN[config.hidden_act]\n        else:\n            self.intermediate_act_fn = config.hidden_act\n\n    def forward(self, hidden_states):\n        hidden_states = self.dense(hidden_states)\n        hidden_states = self.intermediate_act_fn(hidden_states)\n        return hidden_states\n\n\nclass BertOutput(nn.Module):\n    def __init__(self, config):\n        super().__init__()\n        self.dense = nn.Linear(config.intermediate_size, config.hidden_size)\n        self.LayerNorm = BertLayerNorm(config.hidden_size, eps=config.layer_norm_eps)\n        self.dropout = nn.Dropout(config.hidden_dropout_prob)\n\n    def forward(self, hidden_states, input_tensor):\n        hidden_states = self.dense(hidden_states)\n        hidden_states = self.dropout(hidden_states)\n        hidden_states = self.LayerNorm(hidden_states + input_tensor)\n        return hidden_states\n\n\nclass BertLayer(nn.Module):\n    def __init__(self, config):\n        super().__init__()\n        self.attention = BertAttention(config)\n        self.is_decoder = config.is_decoder\n        if self.is_decoder:\n            self.crossattention = BertAttention(config)\n        self.intermediate = BertIntermediate(config)\n        self.output = BertOutput(config)\n\n    def forward(\n        self,\n        hidden_states,\n        attention_mask=None,\n        head_mask=None,\n        encoder_hidden_states=None,\n        encoder_attention_mask=None,\n    ):\n        self_attention_outputs = self.attention(hidden_states, attention_mask, head_mask)\n        attention_output = self_attention_outputs[0]\n        outputs = self_attention_outputs[1:]  # add self attentions if we output attention weights\n\n        if self.is_decoder and encoder_hidden_states is not None:\n            cross_attention_outputs = self.crossattention(\n                attention_output, attention_mask, head_mask, encoder_hidden_states, encoder_attention_mask\n            )\n            attention_output = cross_attention_outputs[0]\n            outputs = outputs + cross_attention_outputs[1:]  # add cross attentions if we output attention weights\n\n        intermediate_output = self.intermediate(attention_output)\n        layer_output = self.output(intermediate_output, attention_output)\n        outputs = (layer_output,) + outputs\n        return outputs\n\n\nclass BertEncoder(nn.Module):\n    def __init__(self, config):\n        super().__init__()\n        self.output_attentions = config.output_attentions\n        self.output_hidden_states = config.output_hidden_states\n        self.layer = nn.ModuleList([BertLayer(config) for _ in range(config.num_hidden_layers)])\n\n    def forward(\n        self,\n        hidden_states,\n        attention_mask=None,\n        head_mask=None,\n        encoder_hidden_states=None,\n        encoder_attention_mask=None,\n    ):\n        all_hidden_states = ()\n        all_attentions = ()\n        for i, layer_module in enumerate(self.layer):\n            if self.output_hidden_states:\n                all_hidden_states = all_hidden_states + (hidden_states,)\n\n            layer_outputs = layer_module(\n                hidden_states, attention_mask, head_mask[i], encoder_hidden_states, encoder_attention_mask\n            )\n            hidden_states = layer_outputs[0]\n\n            if self.output_attentions:\n                all_attentions = all_attentions + (layer_outputs[1],)\n\n        # Add last layer\n        if self.output_hidden_states:\n            all_hidden_states = all_hidden_states + (hidden_states,)\n\n        outputs = (hidden_states,)\n        if self.output_hidden_states:\n            outputs = outputs + (all_hidden_states,)\n        if self.output_attentions:\n            outputs = outputs + (all_attentions,)\n        return outputs  # last-layer hidden state, (all hidden states), (all attentions)\n\n\nclass BertPooler(nn.Module):\n    def __init__(self, config):\n        super().__init__()\n        self.dense = nn.Linear(config.hidden_size, config.hidden_size)\n        self.activation = nn.Tanh()\n\n    def forward(self, hidden_states):\n        # We \"pool\" the model by simply taking the hidden state corresponding\n        # to the first token.\n        first_token_tensor = hidden_states[:, 0]\n        pooled_output = self.dense(first_token_tensor)\n        pooled_output = self.activation(pooled_output)\n        return pooled_output\n\n\nclass BertPredictionHeadTransform(nn.Module):\n    def __init__(self, config):\n        super().__init__()\n        self.dense = nn.Linear(config.hidden_size, config.hidden_size)\n        if isinstance(config.hidden_act, str):\n            self.transform_act_fn = ACT2FN[config.hidden_act]\n        else:\n            self.transform_act_fn = config.hidden_act\n        self.LayerNorm = BertLayerNorm(config.hidden_size, eps=config.layer_norm_eps)\n\n    def forward(self, hidden_states):\n        hidden_states = self.dense(hidden_states)\n        hidden_states = self.transform_act_fn(hidden_states)\n        hidden_states = self.LayerNorm(hidden_states)\n        return hidden_states\n\n\nclass BertLMPredictionHead(nn.Module):\n    def __init__(self, config):\n        super().__init__()\n        self.transform = BertPredictionHeadTransform(config)\n\n        # The output weights are the same as the input embeddings, but there is\n        # an output-only bias for each token.\n        self.decoder = nn.Linear(config.hidden_size, config.vocab_size, bias=False)\n\n        self.bias = nn.Parameter(torch.zeros(config.vocab_size))\n\n        # Need a link between the two variables so that the bias is correctly resized with `resize_token_embeddings`\n        self.decoder.bias = self.bias\n\n    def forward(self, hidden_states):\n        hidden_states = self.transform(hidden_states)\n        hidden_states = self.decoder(hidden_states)\n        return hidden_states\n\n\nclass BertOnlyMLMHead(nn.Module):\n    def __init__(self, config):\n        super().__init__()\n        self.predictions = BertLMPredictionHead(config)\n\n    def forward(self, sequence_output):\n        prediction_scores = self.predictions(sequence_output)\n        return prediction_scores\n\n\nclass BertOnlyNSPHead(nn.Module):\n    def __init__(self, config):\n        super().__init__()\n        self.seq_relationship = nn.Linear(config.hidden_size, 2)\n\n    def forward(self, pooled_output):\n        seq_relationship_score = self.seq_relationship(pooled_output)\n        return seq_relationship_score\n\n\nclass BertPreTrainingHeads(nn.Module):\n    def __init__(self, config):\n        super().__init__()\n        self.predictions = BertLMPredictionHead(config)\n        self.seq_relationship = nn.Linear(config.hidden_size, 2)\n\n    def forward(self, sequence_output, pooled_output):\n        prediction_scores = self.predictions(sequence_output)\n        seq_relationship_score = self.seq_relationship(pooled_output)\n        return prediction_scores, seq_relationship_score\n\n\nclass BertPreTrainedModel(PreTrainedModel):\n    \"\"\" An abstract class to handle weights initialization and\n        a simple interface for downloading and loading pretrained models.\n    \"\"\"\n\n    config_class = BertConfig\n    load_tf_weights = load_tf_weights_in_bert\n    base_model_prefix = \"bert\"\n\n    def _init_weights(self, module):\n        \"\"\" Initialize the weights \"\"\"\n        if isinstance(module, (nn.Linear, nn.Embedding)):\n            # Slightly different from the TF version which uses truncated_normal for initialization\n            # cf https://github.com/pytorch/pytorch/pull/5617\n            module.weight.data.normal_(mean=0.0, std=self.config.initializer_range)\n        elif isinstance(module, BertLayerNorm):\n            module.bias.data.zero_()\n            module.weight.data.fill_(1.0)\n        if isinstance(module, nn.Linear) and module.bias is not None:\n            module.bias.data.zero_()\n\n\nBERT_START_DOCSTRING = r\"\"\"\n    This model is a PyTorch `torch.nn.Module <https://pytorch.org/docs/stable/nn.html#torch.nn.Module>`_ sub-class.\n    Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general\n    usage and behavior.\n\n    Parameters:\n        config (:class:`~transformers1.BertConfig`): Model configuration class with all the parameters of the model.\n            Initializing with a config file does not load the weights associated with the model, only the configuration.\n            Check out the :meth:`~transformers1.PreTrainedModel.from_pretrained` method to load the model weights.\n\"\"\"\n\nBERT_INPUTS_DOCSTRING = r\"\"\"\n    Args:\n        input_ids (:obj:`torch.LongTensor` of shape :obj:`{0}`):\n            Indices of input sequence tokens in the vocabulary.\n\n            Indices can be obtained using :class:`transformers1.BertTokenizer`.\n            See :func:`transformers1.PreTrainedTokenizer.encode` and\n            :func:`transformers1.PreTrainedTokenizer.encode_plus` for details.\n\n            `What are input IDs? <../glossary.html#input-ids>`__\n        attention_mask (:obj:`torch.FloatTensor` of shape :obj:`{0}`, `optional`, defaults to :obj:`None`):\n            Mask to avoid performing attention on padding token indices.\n            Mask values selected in ``[0, 1]``:\n            ``1`` for tokens that are NOT MASKED, ``0`` for MASKED tokens.\n\n            `What are attention masks? <../glossary.html#attention-mask>`__\n        token_type_ids (:obj:`torch.LongTensor` of shape :obj:`{0}`, `optional`, defaults to :obj:`None`):\n            Segment token indices to indicate first and second portions of the inputs.\n            Indices are selected in ``[0, 1]``: ``0`` corresponds to a `sentence A` token, ``1``\n            corresponds to a `sentence B` token\n\n            `What are token type IDs? <../glossary.html#token-type-ids>`_\n        position_ids (:obj:`torch.LongTensor` of shape :obj:`{0}`, `optional`, defaults to :obj:`None`):\n            Indices of positions of each input sequence tokens in the position embeddings.\n            Selected in the range ``[0, config.max_position_embeddings - 1]``.\n\n            `What are position IDs? <../glossary.html#position-ids>`_\n        head_mask (:obj:`torch.FloatTensor` of shape :obj:`(num_heads,)` or :obj:`(num_layers, num_heads)`, `optional`, defaults to :obj:`None`):\n            Mask to nullify selected heads of the self-attention modules.\n            Mask values selected in ``[0, 1]``:\n            :obj:`1` indicates the head is **not masked**, :obj:`0` indicates the head is **masked**.\n        inputs_embeds (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length, hidden_size)`, `optional`, defaults to :obj:`None`):\n            Optionally, instead of passing :obj:`input_ids` you can choose to directly pass an embedded representation.\n            This is useful if you want more control over how to convert `input_ids` indices into associated vectors\n            than the model's internal embedding lookup matrix.\n        encoder_hidden_states  (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length, hidden_size)`, `optional`, defaults to :obj:`None`):\n            Sequence of hidden-states at the output of the last layer of the encoder. Used in the cross-attention\n            if the model is configured as a decoder.\n        encoder_attention_mask (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length)`, `optional`, defaults to :obj:`None`):\n            Mask to avoid performing attention on the padding token indices of the encoder input. This mask\n            is used in the cross-attention if the model is configured as a decoder.\n            Mask values selected in ``[0, 1]``:\n            ``1`` for tokens that are NOT MASKED, ``0`` for MASKED tokens.\n\"\"\"\n\n\n@add_start_docstrings(\n    \"The bare Bert Model transformer outputting raw hidden-states without any specific head on top.\",\n    BERT_START_DOCSTRING,\n)\nclass BertModel(BertPreTrainedModel):\n    \"\"\"\n\n    The model can behave as an encoder (with only self-attention) as well\n    as a decoder, in which case a layer of cross-attention is added between\n    the self-attention layers, following the architecture described in `Attention is all you need`_ by Ashish Vaswani,\n    Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser and Illia Polosukhin.\n\n    To behave as an decoder the model needs to be initialized with the\n    :obj:`is_decoder` argument of the configuration set to :obj:`True`; an\n    :obj:`encoder_hidden_states` is expected as an input to the forward pass.\n\n    .. _`Attention is all you need`:\n        https://arxiv.org/abs/1706.03762\n\n    \"\"\"\n\n    def __init__(self, config):\n        super().__init__(config)\n        self.config = config\n\n        self.embeddings = BertEmbeddings(config)\n        self.encoder = BertEncoder(config)\n        self.pooler = BertPooler(config)\n\n        self.init_weights()\n\n    def get_input_embeddings(self):\n        return self.embeddings.word_embeddings\n\n    def set_input_embeddings(self, value):\n        self.embeddings.word_embeddings = value\n\n    def _prune_heads(self, heads_to_prune):\n        \"\"\" Prunes heads of the model.\n            heads_to_prune: dict of {layer_num: list of heads to prune in this layer}\n            See base class PreTrainedModel\n        \"\"\"\n        for layer, heads in heads_to_prune.items():\n            self.encoder.layer[layer].attention.prune_heads(heads)\n\n    @add_start_docstrings_to_callable(BERT_INPUTS_DOCSTRING.format(\"(batch_size, sequence_length)\"))\n    def forward(\n        self,\n        input_ids=None,\n        attention_mask=None,\n        token_type_ids=None,\n        position_ids=None,\n        head_mask=None,\n        inputs_embeds=None,\n        encoder_hidden_states=None,\n        encoder_attention_mask=None,\n    ):\n        r\"\"\"\n    Return:\n        :obj:`tuple(torch.FloatTensor)` comprising various elements depending on the configuration (:class:`~transformers1.BertConfig`) and inputs:\n        last_hidden_state (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length, hidden_size)`):\n            Sequence of hidden-states at the output of the last layer of the model.\n        pooler_output (:obj:`torch.FloatTensor`: of shape :obj:`(batch_size, hidden_size)`):\n            Last layer hidden-state of the first token of the sequence (classification token)\n            further processed by a Linear layer and a Tanh activation function. The Linear\n            layer weights are trained from the next sentence prediction (classification)\n            objective during pre-training.\n\n            This output is usually *not* a good summary\n            of the semantic content of the input, you're often better with averaging or pooling\n            the sequence of hidden-states for the whole input sequence.\n        hidden_states (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_hidden_states=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer)\n            of shape :obj:`(batch_size, sequence_length, hidden_size)`.\n\n            Hidden-states of the model at the output of each layer plus the initial embedding outputs.\n        attentions (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_attentions=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for each layer) of shape\n            :obj:`(batch_size, num_heads, sequence_length, sequence_length)`.\n\n            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention\n            heads.\n\n    Examples::\n\n        from transformers1 import BertModel, BertTokenizer\n        import torch\n\n        tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')\n        model = BertModel.from_pretrained('bert-base-uncased')\n\n        input_ids = torch.tensor(tokenizer.encode(\"Hello, my dog is cute\", add_special_tokens=True)).unsqueeze(0)  # Batch size 1\n        outputs = model(input_ids)\n\n        last_hidden_states = outputs[0]  # The last hidden-state is the first element of the output tuple\n\n        \"\"\"\n\n        if input_ids is not None and inputs_embeds is not None:\n            raise ValueError(\"You cannot specify both input_ids and inputs_embeds at the same time\")\n        elif input_ids is not None:\n            input_shape = input_ids.size()\n        elif inputs_embeds is not None:\n            input_shape = inputs_embeds.size()[:-1]\n        else:\n            raise ValueError(\"You have to specify either input_ids or inputs_embeds\")\n\n        device = input_ids.device if input_ids is not None else inputs_embeds.device\n\n        if attention_mask is None:\n            attention_mask = torch.ones(input_shape, device=device)\n        if token_type_ids is None:\n            token_type_ids = torch.zeros(input_shape, dtype=torch.long, device=device)\n\n        # We can provide a self-attention mask of dimensions [batch_size, from_seq_length, to_seq_length]\n        # ourselves in which case we just need to make it broadcastable to all heads.\n        extended_attention_mask: torch.Tensor = self.get_extended_attention_mask(attention_mask, input_shape, device)\n\n        # If a 2D ou 3D attention mask is provided for the cross-attention\n        # we need to make broadcastabe to [batch_size, num_heads, seq_length, seq_length]\n        if self.config.is_decoder and encoder_hidden_states is not None:\n            encoder_batch_size, encoder_sequence_length, _ = encoder_hidden_states.size()\n            encoder_hidden_shape = (encoder_batch_size, encoder_sequence_length)\n            if encoder_attention_mask is None:\n                encoder_attention_mask = torch.ones(encoder_hidden_shape, device=device)\n            encoder_extended_attention_mask = self.invert_attention_mask(encoder_attention_mask)\n        else:\n            encoder_extended_attention_mask = None\n\n        # Prepare head mask if needed\n        # 1.0 in head_mask indicate we keep the head\n        # attention_probs has shape bsz x n_heads x N x N\n        # input head_mask has shape [num_heads] or [num_hidden_layers x num_heads]\n        # and head_mask is converted to shape [num_hidden_layers x batch x num_heads x seq_length x seq_length]\n        head_mask = self.get_head_mask(head_mask, self.config.num_hidden_layers)\n\n        embedding_output = self.embeddings(\n            input_ids=input_ids, position_ids=position_ids, token_type_ids=token_type_ids, inputs_embeds=inputs_embeds\n        )\n        encoder_outputs = self.encoder(\n            embedding_output,\n            attention_mask=extended_attention_mask,\n            head_mask=head_mask,\n            encoder_hidden_states=encoder_hidden_states,\n            encoder_attention_mask=encoder_extended_attention_mask,\n        )\n        sequence_output = encoder_outputs[0]\n        pooled_output = self.pooler(sequence_output)\n\n        outputs = (sequence_output, pooled_output,) + encoder_outputs[\n            1:\n        ]  # add hidden_states and attentions if they are here\n        return outputs  # sequence_output, pooled_output, (hidden_states), (attentions)\n\n\n@add_start_docstrings(\n    \"\"\"Bert Model with two heads on top as done during the pre-training: a `masked language modeling` head and\n    a `next sentence prediction (classification)` head. \"\"\",\n    BERT_START_DOCSTRING,\n)\nclass BertForPreTraining(BertPreTrainedModel):\n    def __init__(self, config):\n        super().__init__(config)\n\n        self.bert = BertModel(config)\n        self.cls = BertPreTrainingHeads(config)\n\n        self.init_weights()\n\n    def get_output_embeddings(self):\n        return self.cls.predictions.decoder\n\n    @add_start_docstrings_to_callable(BERT_INPUTS_DOCSTRING.format(\"(batch_size, sequence_length)\"))\n    def forward(\n        self,\n        input_ids=None,\n        attention_mask=None,\n        token_type_ids=None,\n        position_ids=None,\n        head_mask=None,\n        inputs_embeds=None,\n        masked_lm_labels=None,\n        next_sentence_label=None,\n    ):\n        r\"\"\"\n        masked_lm_labels (``torch.LongTensor`` of shape ``(batch_size, sequence_length)``, `optional`, defaults to :obj:`None`):\n            Labels for computing the masked language modeling loss.\n            Indices should be in ``[-100, 0, ..., config.vocab_size]`` (see ``input_ids`` docstring)\n            Tokens with indices set to ``-100`` are ignored (masked), the loss is only computed for the tokens with labels\n            in ``[0, ..., config.vocab_size]``\n        next_sentence_label (``torch.LongTensor`` of shape ``(batch_size,)``, `optional`, defaults to :obj:`None`):\n            Labels for computing the next sequence prediction (classification) loss. Input should be a sequence pair (see :obj:`input_ids` docstring)\n            Indices should be in ``[0, 1]``.\n            ``0`` indicates sequence B is a continuation of sequence A,\n            ``1`` indicates sequence B is a random sequence.\n\n    Returns:\n        :obj:`tuple(torch.FloatTensor)` comprising various elements depending on the configuration (:class:`~transformers1.BertConfig`) and inputs:\n        loss (`optional`, returned when ``masked_lm_labels`` is provided) ``torch.FloatTensor`` of shape ``(1,)``:\n            Total loss as the sum of the masked language modeling loss and the next sequence prediction (classification) loss.\n        prediction_scores (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length, config.vocab_size)`)\n            Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax).\n        seq_relationship_scores (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, 2)`):\n            Prediction scores of the next sequence prediction (classification) head (scores of True/False\n            continuation before SoftMax).\n        hidden_states (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when :obj:`config.output_hidden_states=True`):\n            Tuple of :obj:`torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer)\n            of shape :obj:`(batch_size, sequence_length, hidden_size)`.\n\n            Hidden-states of the model at the output of each layer plus the initial embedding outputs.\n        attentions (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_attentions=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for each layer) of shape\n            :obj:`(batch_size, num_heads, sequence_length, sequence_length)`.\n\n            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention\n            heads.\n\n\n    Examples::\n\n        from transformers1 import BertTokenizer, BertForPreTraining\n        import torch\n\n        tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')\n        model = BertForPreTraining.from_pretrained('bert-base-uncased')\n\n        input_ids = torch.tensor(tokenizer.encode(\"Hello, my dog is cute\", add_special_tokens=True)).unsqueeze(0)  # Batch size 1\n        outputs = model(input_ids)\n\n        prediction_scores, seq_relationship_scores = outputs[:2]\n\n        \"\"\"\n\n        outputs = self.bert(\n            input_ids,\n            attention_mask=attention_mask,\n            token_type_ids=token_type_ids,\n            position_ids=position_ids,\n            head_mask=head_mask,\n            inputs_embeds=inputs_embeds,\n        )\n\n        sequence_output, pooled_output = outputs[:2]\n        prediction_scores, seq_relationship_score = self.cls(sequence_output, pooled_output)\n\n        outputs = (prediction_scores, seq_relationship_score,) + outputs[\n            2:\n        ]  # add hidden states and attention if they are here\n\n        if masked_lm_labels is not None and next_sentence_label is not None:\n            loss_fct = CrossEntropyLoss()\n            masked_lm_loss = loss_fct(prediction_scores.view(-1, self.config.vocab_size), masked_lm_labels.view(-1))\n            next_sentence_loss = loss_fct(seq_relationship_score.view(-1, 2), next_sentence_label.view(-1))\n            total_loss = masked_lm_loss + next_sentence_loss\n            outputs = (total_loss,) + outputs\n\n        return outputs  # (loss), prediction_scores, seq_relationship_score, (hidden_states), (attentions)\n\n\n@add_start_docstrings(\"\"\"Bert Model with a `language modeling` head on top. \"\"\", BERT_START_DOCSTRING)\nclass BertForMaskedLM(BertPreTrainedModel):\n    def __init__(self, config):\n        super().__init__(config)\n\n        self.bert = BertModel(config)\n        self.cls = BertOnlyMLMHead(config)\n\n        self.init_weights()\n\n    def get_output_embeddings(self):\n        return self.cls.predictions.decoder\n\n    @add_start_docstrings_to_callable(BERT_INPUTS_DOCSTRING.format(\"(batch_size, sequence_length)\"))\n    def forward(\n        self,\n        input_ids=None,\n        attention_mask=None,\n        token_type_ids=None,\n        position_ids=None,\n        head_mask=None,\n        inputs_embeds=None,\n        masked_lm_labels=None,\n        encoder_hidden_states=None,\n        encoder_attention_mask=None,\n        lm_labels=None,\n    ):\n        r\"\"\"\n        masked_lm_labels (:obj:`torch.LongTensor` of shape :obj:`(batch_size, sequence_length)`, `optional`, defaults to :obj:`None`):\n            Labels for computing the masked language modeling loss.\n            Indices should be in ``[-100, 0, ..., config.vocab_size]`` (see ``input_ids`` docstring)\n            Tokens with indices set to ``-100`` are ignored (masked), the loss is only computed for the tokens with labels\n            in ``[0, ..., config.vocab_size]``\n        lm_labels (:obj:`torch.LongTensor` of shape :obj:`(batch_size, sequence_length)`, `optional`, defaults to :obj:`None`):\n            Labels for computing the left-to-right language modeling loss (next word prediction).\n            Indices should be in ``[-100, 0, ..., config.vocab_size]`` (see ``input_ids`` docstring)\n            Tokens with indices set to ``-100`` are ignored (masked), the loss is only computed for the tokens with labels\n            in ``[0, ..., config.vocab_size]``\n\n    Returns:\n        :obj:`tuple(torch.FloatTensor)` comprising various elements depending on the configuration (:class:`~transformers1.BertConfig`) and inputs:\n        masked_lm_loss (`optional`, returned when ``masked_lm_labels`` is provided) ``torch.FloatTensor`` of shape ``(1,)``:\n            Masked language modeling loss.\n        ltr_lm_loss (:obj:`torch.FloatTensor` of shape :obj:`(1,)`, `optional`, returned when :obj:`lm_labels` is provided):\n                Next token prediction loss.\n        prediction_scores (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length, config.vocab_size)`)\n            Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax).\n        hidden_states (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_hidden_states=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer)\n            of shape :obj:`(batch_size, sequence_length, hidden_size)`.\n\n            Hidden-states of the model at the output of each layer plus the initial embedding outputs.\n        attentions (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_attentions=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for each layer) of shape\n            :obj:`(batch_size, num_heads, sequence_length, sequence_length)`.\n\n            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention\n            heads.\n\n        Examples::\n\n            from transformers1 import BertTokenizer, BertForMaskedLM\n            import torch\n\n            tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')\n            model = BertForMaskedLM.from_pretrained('bert-base-uncased')\n\n            input_ids = torch.tensor(tokenizer.encode(\"Hello, my dog is cute\", add_special_tokens=True)).unsqueeze(0)  # Batch size 1\n            outputs = model(input_ids, masked_lm_labels=input_ids)\n\n            loss, prediction_scores = outputs[:2]\n\n        \"\"\"\n\n        outputs = self.bert(\n            input_ids,\n            attention_mask=attention_mask,\n            token_type_ids=token_type_ids,\n            position_ids=position_ids,\n            head_mask=head_mask,\n            inputs_embeds=inputs_embeds,\n            encoder_hidden_states=encoder_hidden_states,\n            encoder_attention_mask=encoder_attention_mask,\n        )\n\n        sequence_output = outputs[0]\n        prediction_scores = self.cls(sequence_output)\n\n        outputs = (prediction_scores,) + outputs[2:]  # Add hidden states and attention if they are here\n\n        # Although this may seem awkward, BertForMaskedLM supports two scenarios:\n        # 1. If a tensor that contains the indices of masked labels is provided,\n        #    the cross-entropy is the MLM cross-entropy that measures the likelihood\n        #    of predictions for masked words.\n        # 2. If `lm_labels` is provided we are in a causal scenario where we\n        #    try to predict the next token for each input in the decoder.\n        if masked_lm_labels is not None:\n            loss_fct = CrossEntropyLoss()  # -100 index = padding token\n            masked_lm_loss = loss_fct(prediction_scores.view(-1, self.config.vocab_size), masked_lm_labels.view(-1))\n            outputs = (masked_lm_loss,) + outputs\n\n        if lm_labels is not None:\n            # we are doing next-token prediction; shift prediction scores and input ids by one\n            prediction_scores = prediction_scores[:, :-1, :].contiguous()\n            lm_labels = lm_labels[:, 1:].contiguous()\n            loss_fct = CrossEntropyLoss()\n            ltr_lm_loss = loss_fct(prediction_scores.view(-1, self.config.vocab_size), lm_labels.view(-1))\n            outputs = (ltr_lm_loss,) + outputs\n\n        return outputs  # (ltr_lm_loss), (masked_lm_loss), prediction_scores, (hidden_states), (attentions)\n\n    def prepare_inputs_for_generation(self, input_ids, attention_mask=None, **model_kwargs):\n        input_shape = input_ids.shape\n        effective_batch_size = input_shape[0]\n\n        # if model is used as a decoder in encoder-decoder model, the decoder attention mask is created on the fly\n        if attention_mask is None:\n            attention_mask = input_ids.new_ones(input_shape)\n\n        # if model is does not use a causal mask then add a dummy token\n        if self.config.is_decoder is False:\n            assert self.config.pad_token_id is not None, \"The PAD token should be defined for generation\"\n            attention_mask = torch.cat(\n                [attention_mask, attention_mask.new_zeros((attention_mask.shape[0], 1))], dim=-1\n            )\n\n            dummy_token = torch.full(\n                (effective_batch_size, 1), self.config.pad_token_id, dtype=torch.long, device=input_ids.device\n            )\n            input_ids = torch.cat([input_ids, dummy_token], dim=1)\n\n        return {\"input_ids\": input_ids, \"attention_mask\": attention_mask}\n\n\n@add_start_docstrings(\n    \"\"\"Bert Model with a `next sentence prediction (classification)` head on top. \"\"\", BERT_START_DOCSTRING,\n)\nclass BertForNextSentencePrediction(BertPreTrainedModel):\n    def __init__(self, config):\n        super().__init__(config)\n\n        self.bert = BertModel(config)\n        self.cls = BertOnlyNSPHead(config)\n\n        self.init_weights()\n\n    @add_start_docstrings_to_callable(BERT_INPUTS_DOCSTRING.format(\"(batch_size, sequence_length)\"))\n    def forward(\n        self,\n        input_ids=None,\n        attention_mask=None,\n        token_type_ids=None,\n        position_ids=None,\n        head_mask=None,\n        inputs_embeds=None,\n        next_sentence_label=None,\n    ):\n        r\"\"\"\n        next_sentence_label (:obj:`torch.LongTensor` of shape :obj:`(batch_size,)`, `optional`, defaults to :obj:`None`):\n            Labels for computing the next sequence prediction (classification) loss. Input should be a sequence pair (see ``input_ids`` docstring)\n            Indices should be in ``[0, 1]``.\n            ``0`` indicates sequence B is a continuation of sequence A,\n            ``1`` indicates sequence B is a random sequence.\n\n    Returns:\n        :obj:`tuple(torch.FloatTensor)` comprising various elements depending on the configuration (:class:`~transformers1.BertConfig`) and inputs:\n        loss (:obj:`torch.FloatTensor` of shape :obj:`(1,)`, `optional`, returned when :obj:`next_sentence_label` is provided):\n            Next sequence prediction (classification) loss.\n        seq_relationship_scores (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, 2)`):\n            Prediction scores of the next sequence prediction (classification) head (scores of True/False continuation before SoftMax).\n        hidden_states (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_hidden_states=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer)\n            of shape :obj:`(batch_size, sequence_length, hidden_size)`.\n\n            Hidden-states of the model at the output of each layer plus the initial embedding outputs.\n        attentions (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_attentions=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for each layer) of shape\n            :obj:`(batch_size, num_heads, sequence_length, sequence_length)`.\n\n            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention\n            heads.\n\n    Examples::\n\n        from transformers1 import BertTokenizer, BertForNextSentencePrediction\n        import torch\n\n        tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')\n        model = BertForNextSentencePrediction.from_pretrained('bert-base-uncased')\n\n        prompt = \"In Italy, pizza served in formal settings, such as at a restaurant, is presented unsliced.\"\n        next_sentence = \"The sky is blue due to the shorter wavelength of blue light.\"\n        encoding = tokenizer.encode_plus(prompt, next_sentence, return_tensors='pt')\n\n        loss, logits = model(**encoding, next_sentence_label=torch.LongTensor([1]))\n        assert logits[0, 0] < logits[0, 1] # next sentence was random\n        \"\"\"\n\n        outputs = self.bert(\n            input_ids,\n            attention_mask=attention_mask,\n            token_type_ids=token_type_ids,\n            position_ids=position_ids,\n            head_mask=head_mask,\n            inputs_embeds=inputs_embeds,\n        )\n\n        pooled_output = outputs[1]\n\n        seq_relationship_score = self.cls(pooled_output)\n\n        outputs = (seq_relationship_score,) + outputs[2:]  # add hidden states and attention if they are here\n        if next_sentence_label is not None:\n            loss_fct = CrossEntropyLoss()\n            next_sentence_loss = loss_fct(seq_relationship_score.view(-1, 2), next_sentence_label.view(-1))\n            outputs = (next_sentence_loss,) + outputs\n\n        return outputs  # (next_sentence_loss), seq_relationship_score, (hidden_states), (attentions)\n\n\n@add_start_docstrings(\n    \"\"\"Bert Model transformer with a sequence classification/regression head on top (a linear layer on top of\n    the pooled output) e.g. for GLUE tasks. \"\"\",\n    BERT_START_DOCSTRING,\n)\nclass BertForSequenceClassification(BertPreTrainedModel):\n    def __init__(self, config):\n        super().__init__(config)\n        self.num_labels = config.num_labels\n\n        self.bert = BertModel(config)\n        self.dropout = nn.Dropout(config.hidden_dropout_prob)\n        self.classifier = nn.Linear(config.hidden_size, config.num_labels)\n\n        self.init_weights()\n\n    @add_start_docstrings_to_callable(BERT_INPUTS_DOCSTRING.format(\"(batch_size, sequence_length)\"))\n    def forward(\n        self,\n        input_ids=None,\n        attention_mask=None,\n        token_type_ids=None,\n        position_ids=None,\n        head_mask=None,\n        inputs_embeds=None,\n        labels=None,\n    ):\n        r\"\"\"\n        labels (:obj:`torch.LongTensor` of shape :obj:`(batch_size,)`, `optional`, defaults to :obj:`None`):\n            Labels for computing the sequence classification/regression loss.\n            Indices should be in :obj:`[0, ..., config.num_labels - 1]`.\n            If :obj:`config.num_labels == 1` a regression loss is computed (Mean-Square loss),\n            If :obj:`config.num_labels > 1` a classification loss is computed (Cross-Entropy).\n\n    Returns:\n        :obj:`tuple(torch.FloatTensor)` comprising various elements depending on the configuration (:class:`~transformers1.BertConfig`) and inputs:\n        loss (:obj:`torch.FloatTensor` of shape :obj:`(1,)`, `optional`, returned when :obj:`label` is provided):\n            Classification (or regression if config.num_labels==1) loss.\n        logits (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, config.num_labels)`):\n            Classification (or regression if config.num_labels==1) scores (before SoftMax).\n        hidden_states (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_hidden_states=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer)\n            of shape :obj:`(batch_size, sequence_length, hidden_size)`.\n\n            Hidden-states of the model at the output of each layer plus the initial embedding outputs.\n        attentions (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_attentions=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for each layer) of shape\n            :obj:`(batch_size, num_heads, sequence_length, sequence_length)`.\n\n            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention\n            heads.\n\n    Examples::\n\n        from transformers1 import BertTokenizer, BertForSequenceClassification\n        import torch\n\n        tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')\n        model = BertForSequenceClassification.from_pretrained('bert-base-uncased')\n\n        input_ids = torch.tensor(tokenizer.encode(\"Hello, my dog is cute\", add_special_tokens=True)).unsqueeze(0)  # Batch size 1\n        labels = torch.tensor([1]).unsqueeze(0)  # Batch size 1\n        outputs = model(input_ids, labels=labels)\n\n        loss, logits = outputs[:2]\n\n        \"\"\"\n\n        outputs = self.bert(\n            input_ids,\n            attention_mask=attention_mask,\n            token_type_ids=token_type_ids,\n            position_ids=position_ids,\n            head_mask=head_mask,\n            inputs_embeds=inputs_embeds,\n        )\n\n        pooled_output = outputs[1]\n\n        pooled_output = self.dropout(pooled_output)\n        logits = self.classifier(pooled_output)\n\n        outputs = (logits,) + outputs[2:]  # add hidden states and attention if they are here\n\n        if labels is not None:\n            if self.num_labels == 1:\n                #  We are doing regression\n                loss_fct = MSELoss()\n                loss = loss_fct(logits.view(-1), labels.view(-1))\n            else:\n                loss_fct = CrossEntropyLoss()\n                loss = loss_fct(logits.view(-1, self.num_labels), labels.view(-1))\n            outputs = (loss,) + outputs\n\n        return outputs  # (loss), logits, (hidden_states), (attentions)\n\n\n@add_start_docstrings(\n    \"\"\"Bert Model with a multiple choice classification head on top (a linear layer on top of\n    the pooled output and a softmax) e.g. for RocStories/SWAG tasks. \"\"\",\n    BERT_START_DOCSTRING,\n)\nclass BertForMultipleChoice(BertPreTrainedModel):\n    def __init__(self, config):\n        super().__init__(config)\n\n        self.bert = BertModel(config)\n        self.dropout = nn.Dropout(config.hidden_dropout_prob)\n        self.classifier = nn.Linear(config.hidden_size, 1)\n\n        self.init_weights()\n\n    @add_start_docstrings_to_callable(BERT_INPUTS_DOCSTRING.format(\"(batch_size, num_choices, sequence_length)\"))\n    def forward(\n        self,\n        input_ids=None,\n        attention_mask=None,\n        token_type_ids=None,\n        position_ids=None,\n        head_mask=None,\n        inputs_embeds=None,\n        labels=None,\n    ):\n        r\"\"\"\n        labels (:obj:`torch.LongTensor` of shape :obj:`(batch_size,)`, `optional`, defaults to :obj:`None`):\n            Labels for computing the multiple choice classification loss.\n            Indices should be in ``[0, ..., num_choices-1]`` where `num_choices` is the size of the second dimension\n            of the input tensors. (see `input_ids` above)\n\n    Returns:\n        :obj:`tuple(torch.FloatTensor)` comprising various elements depending on the configuration (:class:`~transformers1.BertConfig`) and inputs:\n        loss (:obj:`torch.FloatTensor` of shape `(1,)`, `optional`, returned when :obj:`labels` is provided):\n            Classification loss.\n        classification_scores (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, num_choices)`):\n            `num_choices` is the second dimension of the input tensors. (see `input_ids` above).\n\n            Classification scores (before SoftMax).\n        hidden_states (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_hidden_states=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer)\n            of shape :obj:`(batch_size, sequence_length, hidden_size)`.\n\n            Hidden-states of the model at the output of each layer plus the initial embedding outputs.\n        attentions (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_attentions=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for each layer) of shape\n            :obj:`(batch_size, num_heads, sequence_length, sequence_length)`.\n\n            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention\n            heads.\n\n    Examples::\n\n        from transformers1 import BertTokenizer, BertForMultipleChoice\n        import torch\n\n        tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')\n        model = BertForMultipleChoice.from_pretrained('bert-base-uncased')\n\n        prompt = \"In Italy, pizza served in formal settings, such as at a restaurant, is presented unsliced.\"\n        choice0 = \"It is eaten with a fork and a knife.\"\n        choice1 = \"It is eaten while held in the hand.\"\n        labels = torch.tensor(0) # choice0 is correct (according to Wikipedia ;))\n\n        encoding = tokenizer.batch_encode_plus([[prompt, choice0], [prompt, choice1]], return_tensors='pt', pad_to_max_length=True)\n        outputs = model(**{k: v.unsqueeze(0) for k,v in encoding.items()}, labels=labels) # batch size is 1\n\n        # the linear classifier still needs to be trained\n        loss, logits = outputs[:2]\n        \"\"\"\n        num_choices = input_ids.shape[1]\n\n        input_ids = input_ids.view(-1, input_ids.size(-1))\n        attention_mask = attention_mask.view(-1, attention_mask.size(-1)) if attention_mask is not None else None\n        token_type_ids = token_type_ids.view(-1, token_type_ids.size(-1)) if token_type_ids is not None else None\n        position_ids = position_ids.view(-1, position_ids.size(-1)) if position_ids is not None else None\n\n        outputs = self.bert(\n            input_ids,\n            attention_mask=attention_mask,\n            token_type_ids=token_type_ids,\n            position_ids=position_ids,\n            head_mask=head_mask,\n            inputs_embeds=inputs_embeds,\n        )\n\n        pooled_output = outputs[1]\n\n        pooled_output = self.dropout(pooled_output)\n        logits = self.classifier(pooled_output)\n        reshaped_logits = logits.view(-1, num_choices)\n\n        outputs = (reshaped_logits,) + outputs[2:]  # add hidden states and attention if they are here\n\n        if labels is not None:\n            loss_fct = CrossEntropyLoss()\n            loss = loss_fct(reshaped_logits, labels)\n            outputs = (loss,) + outputs\n\n        return outputs  # (loss), reshaped_logits, (hidden_states), (attentions)\n\n\n@add_start_docstrings(\n    \"\"\"Bert Model with a token classification head on top (a linear layer on top of\n    the hidden-states output) e.g. for Named-Entity-Recognition (NER) tasks. \"\"\",\n    BERT_START_DOCSTRING,\n)\nclass BertForTokenClassification(BertPreTrainedModel):\n    def __init__(self, config):\n        super().__init__(config)\n        self.num_labels = config.num_labels\n\n        self.bert = BertModel(config)\n        self.dropout = nn.Dropout(config.hidden_dropout_prob)\n        self.classifier = nn.Linear(config.hidden_size, config.num_labels)\n\n        self.init_weights()\n\n    @add_start_docstrings_to_callable(BERT_INPUTS_DOCSTRING.format(\"(batch_size, sequence_length)\"))\n    def forward(\n        self,\n        input_ids=None,\n        attention_mask=None,\n        token_type_ids=None,\n        position_ids=None,\n        head_mask=None,\n        inputs_embeds=None,\n        labels=None,\n    ):\n        r\"\"\"\n        labels (:obj:`torch.LongTensor` of shape :obj:`(batch_size, sequence_length)`, `optional`, defaults to :obj:`None`):\n            Labels for computing the token classification loss.\n            Indices should be in ``[0, ..., config.num_labels - 1]``.\n\n    Returns:\n        :obj:`tuple(torch.FloatTensor)` comprising various elements depending on the configuration (:class:`~transformers1.BertConfig`) and inputs:\n        loss (:obj:`torch.FloatTensor` of shape :obj:`(1,)`, `optional`, returned when ``labels`` is provided) :\n            Classification loss.\n        scores (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length, config.num_labels)`)\n            Classification scores (before SoftMax).\n        hidden_states (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_hidden_states=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer)\n            of shape :obj:`(batch_size, sequence_length, hidden_size)`.\n\n            Hidden-states of the model at the output of each layer plus the initial embedding outputs.\n        attentions (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_attentions=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for each layer) of shape\n            :obj:`(batch_size, num_heads, sequence_length, sequence_length)`.\n\n            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention\n            heads.\n\n    Examples::\n\n        from transformers1 import BertTokenizer, BertForTokenClassification\n        import torch\n\n        tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')\n        model = BertForTokenClassification.from_pretrained('bert-base-uncased')\n\n        input_ids = torch.tensor(tokenizer.encode(\"Hello, my dog is cute\", add_special_tokens=True)).unsqueeze(0)  # Batch size 1\n        labels = torch.tensor([1] * input_ids.size(1)).unsqueeze(0)  # Batch size 1\n        outputs = model(input_ids, labels=labels)\n\n        loss, scores = outputs[:2]\n\n        \"\"\"\n\n        outputs = self.bert(\n            input_ids,\n            attention_mask=attention_mask,\n            token_type_ids=token_type_ids,\n            position_ids=position_ids,\n            head_mask=head_mask,\n            inputs_embeds=inputs_embeds,\n        )\n\n        sequence_output = outputs[0]\n\n        sequence_output = self.dropout(sequence_output)\n        logits = self.classifier(sequence_output)\n\n        outputs = (logits,) + outputs[2:]  # add hidden states and attention if they are here\n        if labels is not None:\n            loss_fct = CrossEntropyLoss()\n            # Only keep active parts of the loss\n            if attention_mask is not None:\n                active_loss = attention_mask.view(-1) == 1\n                active_logits = logits.view(-1, self.num_labels)\n                active_labels = torch.where(\n                    active_loss, labels.view(-1), torch.tensor(loss_fct.ignore_index).type_as(labels)\n                )\n                loss = loss_fct(active_logits, active_labels)\n            else:\n                loss = loss_fct(logits.view(-1, self.num_labels), labels.view(-1))\n            outputs = (loss,) + outputs\n\n        return outputs  # (loss), scores, (hidden_states), (attentions)\n\n\n@add_start_docstrings(\n    \"\"\"Bert Model with a span classification head on top for extractive question-answering tasks like SQuAD (a linear\n    layers on top of the hidden-states output to compute `span start logits` and `span end logits`). \"\"\",\n    BERT_START_DOCSTRING,\n)\nclass BertForQuestionAnswering(BertPreTrainedModel):\n    def __init__(self, config):\n        super().__init__(config)\n        self.num_labels = config.num_labels\n\n        self.bert = BertModel(config)\n        self.qa_outputs = nn.Linear(config.hidden_size, config.num_labels)\n\n        self.init_weights()\n\n    @add_start_docstrings_to_callable(BERT_INPUTS_DOCSTRING.format(\"(batch_size, sequence_length)\"))\n    def forward(\n        self,\n        input_ids=None,\n        attention_mask=None,\n        token_type_ids=None,\n        position_ids=None,\n        head_mask=None,\n        inputs_embeds=None,\n        start_positions=None,\n        end_positions=None,\n    ):\n        r\"\"\"\n        start_positions (:obj:`torch.LongTensor` of shape :obj:`(batch_size,)`, `optional`, defaults to :obj:`None`):\n            Labels for position (index) of the start of the labelled span for computing the token classification loss.\n            Positions are clamped to the length of the sequence (`sequence_length`).\n            Position outside of the sequence are not taken into account for computing the loss.\n        end_positions (:obj:`torch.LongTensor` of shape :obj:`(batch_size,)`, `optional`, defaults to :obj:`None`):\n            Labels for position (index) of the end of the labelled span for computing the token classification loss.\n            Positions are clamped to the length of the sequence (`sequence_length`).\n            Position outside of the sequence are not taken into account for computing the loss.\n\n    Returns:\n        :obj:`tuple(torch.FloatTensor)` comprising various elements depending on the configuration (:class:`~transformers1.BertConfig`) and inputs:\n        loss (:obj:`torch.FloatTensor` of shape :obj:`(1,)`, `optional`, returned when :obj:`labels` is provided):\n            Total span extraction loss is the sum of a Cross-Entropy for the start and end positions.\n        start_scores (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length,)`):\n            Span-start scores (before SoftMax).\n        end_scores (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length,)`):\n            Span-end scores (before SoftMax).\n        hidden_states (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_hidden_states=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer)\n            of shape :obj:`(batch_size, sequence_length, hidden_size)`.\n\n            Hidden-states of the model at the output of each layer plus the initial embedding outputs.\n        attentions (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_attentions=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for each layer) of shape\n            :obj:`(batch_size, num_heads, sequence_length, sequence_length)`.\n\n            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention\n            heads.\n\n    Examples::\n\n        from transformers1 import BertTokenizer, BertForQuestionAnswering\n        import torch\n\n        tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')\n        model = BertForQuestionAnswering.from_pretrained('bert-large-uncased-whole-word-masking-finetuned-squad')\n\n        question, text = \"Who was Jim Henson?\", \"Jim Henson was a nice puppet\"\n        encoding = tokenizer.encode_plus(question, text)\n        input_ids, token_type_ids = encoding[\"input_ids\"], encoding[\"token_type_ids\"]\n        start_scores, end_scores = model(torch.tensor([input_ids]), token_type_ids=torch.tensor([token_type_ids]))\n\n        all_tokens = tokenizer.convert_ids_to_tokens(input_ids)\n        answer = ' '.join(all_tokens[torch.argmax(start_scores) : torch.argmax(end_scores)+1])\n\n        assert answer == \"a nice puppet\"\n\n        \"\"\"\n\n        outputs = self.bert(\n            input_ids,\n            attention_mask=attention_mask,\n            token_type_ids=token_type_ids,\n            position_ids=position_ids,\n            head_mask=head_mask,\n            inputs_embeds=inputs_embeds,\n        )\n\n        sequence_output = outputs[0]\n\n        logits = self.qa_outputs(sequence_output)\n        start_logits, end_logits = logits.split(1, dim=-1)\n        start_logits = start_logits.squeeze(-1)\n        end_logits = end_logits.squeeze(-1)\n\n        outputs = (start_logits, end_logits,) + outputs[2:]\n        if start_positions is not None and end_positions is not None:\n            # If we are on multi-GPU, split add a dimension\n            if len(start_positions.size()) > 1:\n                start_positions = start_positions.squeeze(-1)\n            if len(end_positions.size()) > 1:\n                end_positions = end_positions.squeeze(-1)\n            # sometimes the start/end positions are outside our model inputs, we ignore these terms\n            ignored_index = start_logits.size(1)\n            start_positions.clamp_(0, ignored_index)\n            end_positions.clamp_(0, ignored_index)\n\n            loss_fct = CrossEntropyLoss(ignore_index=ignored_index)\n            start_loss = loss_fct(start_logits, start_positions)\n            end_loss = loss_fct(end_logits, end_positions)\n            total_loss = (start_loss + end_loss) / 2\n            outputs = (total_loss,) + outputs\n\n        return outputs  # (loss), start_logits, end_logits, (hidden_states), (attentions)\n"
  },
  {
    "path": "code/bert-base-count5/pretrain/transformers1/modeling_camembert.py",
    "content": "# coding=utf-8\n# Copyright 2019 Inria, Facebook AI Research and the HuggingFace Inc. team.\n# Copyright (c) 2018, NVIDIA CORPORATION.  All rights reserved.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n\"\"\"PyTorch CamemBERT model. \"\"\"\n\nimport logging\n\nfrom .configuration_camembert import CamembertConfig\nfrom .file_utils import add_start_docstrings\nfrom .modeling_roberta import (\n    RobertaForMaskedLM,\n    RobertaForMultipleChoice,\n    RobertaForQuestionAnswering,\n    RobertaForSequenceClassification,\n    RobertaForTokenClassification,\n    RobertaModel,\n)\n\n\nlogger = logging.getLogger(__name__)\n\nCAMEMBERT_PRETRAINED_MODEL_ARCHIVE_LIST = [\n    \"camembert-base\",\n    \"Musixmatch/umberto-commoncrawl-cased-v1\",\n    \"Musixmatch/umberto-wikipedia-uncased-v1\",\n    # See all CamemBERT models at https://huggingface.co/models?filter=camembert\n]\n\nCAMEMBERT_START_DOCSTRING = r\"\"\"\n\n    This model is a PyTorch `torch.nn.Module <https://pytorch.org/docs/stable/nn.html#torch.nn.Module>`_ sub-class.\n    Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general\n    usage and behavior.\n\n    Parameters:\n        config (:class:`~transformers1.CamembertConfig`): Model configuration class with all the parameters of the\n            model. Initializing with a config file does not load the weights associated with the model, only the\n            configuration.\n            Check out the :meth:`~transformers1.PreTrainedModel.from_pretrained` method to load the model weights.\n\"\"\"\n\n\n@add_start_docstrings(\n    \"The bare CamemBERT Model transformer outputting raw hidden-states without any specific head on top.\",\n    CAMEMBERT_START_DOCSTRING,\n)\nclass CamembertModel(RobertaModel):\n    \"\"\"\n    This class overrides :class:`~transformers1.RobertaModel`. Please check the\n    superclass for the appropriate documentation alongside usage examples.\n    \"\"\"\n\n    config_class = CamembertConfig\n\n\n@add_start_docstrings(\n    \"\"\"CamemBERT Model with a `language modeling` head on top. \"\"\", CAMEMBERT_START_DOCSTRING,\n)\nclass CamembertForMaskedLM(RobertaForMaskedLM):\n    \"\"\"\n    This class overrides :class:`~transformers1.RobertaForMaskedLM`. Please check the\n    superclass for the appropriate documentation alongside usage examples.\n    \"\"\"\n\n    config_class = CamembertConfig\n\n\n@add_start_docstrings(\n    \"\"\"CamemBERT Model transformer with a sequence classification/regression head on top (a linear layer\n    on top of the pooled output) e.g. for GLUE tasks. \"\"\",\n    CAMEMBERT_START_DOCSTRING,\n)\nclass CamembertForSequenceClassification(RobertaForSequenceClassification):\n    \"\"\"\n    This class overrides :class:`~transformers1.RobertaForSequenceClassification`. Please check the\n    superclass for the appropriate documentation alongside usage examples.\n    \"\"\"\n\n    config_class = CamembertConfig\n\n\n@add_start_docstrings(\n    \"\"\"CamemBERT Model with a multiple choice classification head on top (a linear layer on top of\n    the pooled output and a softmax) e.g. for RocStories/SWAG tasks. \"\"\",\n    CAMEMBERT_START_DOCSTRING,\n)\nclass CamembertForMultipleChoice(RobertaForMultipleChoice):\n    \"\"\"\n    This class overrides :class:`~transformers1.RobertaForMultipleChoice`. Please check the\n    superclass for the appropriate documentation alongside usage examples.\n    \"\"\"\n\n    config_class = CamembertConfig\n\n\n@add_start_docstrings(\n    \"\"\"CamemBERT Model with a token classification head on top (a linear layer on top of\n    the hidden-states output) e.g. for Named-Entity-Recognition (NER) tasks. \"\"\",\n    CAMEMBERT_START_DOCSTRING,\n)\nclass CamembertForTokenClassification(RobertaForTokenClassification):\n    \"\"\"\n    This class overrides :class:`~transformers1.RobertaForTokenClassification`. Please check the\n    superclass for the appropriate documentation alongside usage examples.\n    \"\"\"\n\n    config_class = CamembertConfig\n\n\n@add_start_docstrings(\n    \"\"\"CamemBERT Model with a span classification head on top for extractive question-answering tasks like SQuAD\n    (a linear layers on top of the hidden-states output to compute `span start logits` and `span end logits` \"\"\",\n    CAMEMBERT_START_DOCSTRING,\n)\nclass CamembertForQuestionAnswering(RobertaForQuestionAnswering):\n    \"\"\"\n    This class overrides :class:`~transformers1.RobertaForQuestionAnswering`. Please check the\n    superclass for the appropriate documentation alongside usage examples.\n    \"\"\"\n\n    config_class = CamembertConfig\n"
  },
  {
    "path": "code/bert-base-count5/pretrain/transformers1/modeling_ctrl.py",
    "content": "# coding=utf-8\n# Copyright 2018 Salesforce and HuggingFace Inc. team.\n# Copyright (c) 2018, NVIDIA CORPORATION.  All rights reserved.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n\"\"\" PyTorch CTRL model.\"\"\"\n\n\nimport logging\n\nimport numpy as np\nimport torch\nimport torch.nn as nn\nfrom torch.nn import CrossEntropyLoss\n\nfrom .configuration_ctrl import CTRLConfig\nfrom .file_utils import add_start_docstrings, add_start_docstrings_to_callable\nfrom .modeling_utils import Conv1D, PreTrainedModel\n\n\nlogger = logging.getLogger(__name__)\n\nCTRL_PRETRAINED_MODEL_ARCHIVE_LIST = [\n    \"ctrl\"\n    # See all CTRL models at https://huggingface.co/models?filter=ctrl\n]\n\n\ndef angle_defn(pos, i, d_model_size):\n    angle_rates = 1 / torch.pow(10000, (2 * (i // 2)) / d_model_size)\n    return pos * angle_rates\n\n\ndef positional_encoding(position, d_model_size, dtype):\n    # create the sinusoidal pattern for the positional encoding\n    angle_rads = angle_defn(\n        torch.arange(position, dtype=dtype).unsqueeze(1),\n        torch.arange(d_model_size, dtype=dtype).unsqueeze(0),\n        d_model_size,\n    )\n\n    sines = torch.sin(angle_rads[:, 0::2])\n    cosines = torch.cos(angle_rads[:, 1::2])\n\n    pos_encoding = torch.cat([sines, cosines], dim=-1)\n    return pos_encoding\n\n\ndef scaled_dot_product_attention(q, k, v, mask, attention_mask=None, head_mask=None):\n    # calculate attention\n    matmul_qk = torch.matmul(q, k.permute(0, 1, 3, 2))\n\n    dk = k.shape[-1]\n    scaled_attention_logits = matmul_qk / np.sqrt(dk)\n\n    if mask is not None:\n        nd, ns = scaled_attention_logits.size(-2), scaled_attention_logits.size(-1)\n        scaled_attention_logits += mask[ns - nd : ns, :ns] * -1e4\n\n    if attention_mask is not None:\n        # Apply the attention mask\n        scaled_attention_logits = scaled_attention_logits + attention_mask\n\n    attention_weights = torch.softmax(scaled_attention_logits, dim=-1)\n\n    # Mask heads if we want to\n    if head_mask is not None:\n        attention_weights = attention_weights * head_mask\n\n    output = torch.matmul(attention_weights, v)\n\n    return output, attention_weights\n\n\nclass MultiHeadAttention(torch.nn.Module):\n    def __init__(self, d_model_size, num_heads, output_attentions=False):\n        super().__init__()\n        self.output_attentions = output_attentions\n        self.num_heads = num_heads\n        self.d_model_size = d_model_size\n\n        self.depth = int(d_model_size / self.num_heads)\n\n        self.Wq = torch.nn.Linear(d_model_size, d_model_size)\n        self.Wk = torch.nn.Linear(d_model_size, d_model_size)\n        self.Wv = torch.nn.Linear(d_model_size, d_model_size)\n\n        self.dense = torch.nn.Linear(d_model_size, d_model_size)\n\n    def split_into_heads(self, x, batch_size):\n        x = x.reshape(batch_size, -1, self.num_heads, self.depth)\n        return x.permute([0, 2, 1, 3])\n\n    def forward(self, v, k, q, mask, layer_past=None, attention_mask=None, head_mask=None, use_cache=False):\n        batch_size = q.shape[0]\n\n        q = self.Wq(q)\n        k = self.Wk(k)\n        v = self.Wv(v)\n\n        q = self.split_into_heads(q, batch_size)\n        k = self.split_into_heads(k, batch_size)\n        v = self.split_into_heads(v, batch_size)\n        if layer_past is not None:\n            past_key, past_value = layer_past[0], layer_past[1]\n            k = torch.cat((past_key, k), dim=-2)\n            v = torch.cat((past_value, v), dim=-2)\n\n        if use_cache is True:\n            present = torch.stack((k, v))\n        else:\n            present = (None,)\n\n        output = scaled_dot_product_attention(q, k, v, mask, attention_mask, head_mask)\n        scaled_attention = output[0].permute([0, 2, 1, 3])\n        attn = output[1]\n        original_size_attention = scaled_attention.reshape(batch_size, -1, self.d_model_size)\n        output = self.dense(original_size_attention)\n\n        outputs = (output, present)\n        if self.output_attentions:\n            outputs = outputs + (attn,)\n        return outputs\n\n\ndef point_wise_feed_forward_network(d_model_size, dff):\n    return torch.nn.Sequential(torch.nn.Linear(d_model_size, dff), torch.nn.ReLU(), torch.nn.Linear(dff, d_model_size))\n\n\nclass EncoderLayer(torch.nn.Module):\n    def __init__(self, d_model_size, num_heads, dff, rate=0.1, output_attentions=False):\n        super().__init__()\n\n        self.multi_head_attention = MultiHeadAttention(d_model_size, num_heads, output_attentions)\n        self.ffn = point_wise_feed_forward_network(d_model_size, dff)\n\n        self.layernorm1 = torch.nn.LayerNorm(d_model_size, eps=1e-6)\n        self.layernorm2 = torch.nn.LayerNorm(d_model_size, eps=1e-6)\n\n        self.dropout1 = torch.nn.Dropout(rate)\n        self.dropout2 = torch.nn.Dropout(rate)\n\n    def forward(self, x, mask, layer_past=None, attention_mask=None, head_mask=None, use_cache=False):\n        normed = self.layernorm1(x)\n        attn_outputs = self.multi_head_attention(\n            normed,\n            normed,\n            normed,\n            mask,\n            layer_past=layer_past,\n            attention_mask=attention_mask,\n            head_mask=head_mask,\n            use_cache=use_cache,\n        )\n        attn_output = attn_outputs[0]\n        attn_output = self.dropout1(attn_output)\n        out1 = x + attn_output\n\n        out2 = self.layernorm2(out1)\n        ffn_output = self.ffn(out2)\n        ffn_output = self.dropout2(ffn_output)\n        out2 = out1 + ffn_output\n\n        outputs = (out2,) + attn_outputs[1:]\n        return outputs\n\n\nclass CTRLPreTrainedModel(PreTrainedModel):\n    \"\"\" An abstract class to handle weights initialization and\n        a simple interface for downloading and loading pretrained models.\n    \"\"\"\n\n    config_class = CTRLConfig\n    base_model_prefix = \"transformer\"\n\n    def _init_weights(self, module):\n        \"\"\" Initialize the weights.\n        \"\"\"\n        if isinstance(module, (nn.Linear, nn.Embedding, Conv1D)):\n            # Slightly different from the TF version which uses truncated_normal for initialization\n            # cf https://github.com/pytorch/pytorch/pull/5617\n            module.weight.data.normal_(mean=0.0, std=self.config.initializer_range)\n            if isinstance(module, (nn.Linear, Conv1D)) and module.bias is not None:\n                module.bias.data.zero_()\n        elif isinstance(module, nn.LayerNorm):\n            module.bias.data.zero_()\n            module.weight.data.fill_(1.0)\n\n\nCTRL_START_DOCSTRING = r\"\"\"\n    This model is a PyTorch `torch.nn.Module <https://pytorch.org/docs/stable/nn.html#torch.nn.Module>`_ sub-class.\n    Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general\n    usage and behavior.\n\n    Parameters:\n        config (:class:`~transformers1.CTRLConfig`): Model configuration class with all the parameters of the model.\n            Initializing with a config file does not load the weights associated with the model, only the configuration.\n            Check out the :meth:`~transformers1.PreTrainedModel.from_pretrained` method to load the model weights.\n\"\"\"\n\nCTRL_INPUTS_DOCSTRING = r\"\"\"\n    Args:\n        input_ids (:obj:`torch.LongTensor` of shape :obj:`(batch_size, input_ids_length)`):\n            :obj:`input_ids_length` = ``sequence_length`` if ``past`` is ``None`` else ``past[0].shape[-2]`` (``sequence_length`` of input past key value states).\n            Indices of input sequence tokens in the vocabulary.\n\n            If `past` is used, only input_ids that do not have their past calculated should be passed as input_ids.\n\n            Indices can be obtained using :class:`transformers1.CTRLTokenizer`.\n            See :func:`transformers1.PreTrainedTokenizer.encode` and\n            :func:`transformers1.PreTrainedTokenizer.encode_plus` for details.\n\n            `What are input IDs? <../glossary.html#input-ids>`__\n        past (:obj:`List[torch.FloatTensor]` of length :obj:`config.n_layers`):\n            Contains pre-computed hidden-states (key and values in the attention blocks) as computed by the model\n            (see `past` output below). Can be used to speed up sequential decoding.\n            The input_ids which have their past given to this model should not be passed as input ids as they have already been computed.\n        attention_mask (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length)`, `optional`, defaults to :obj:`None`):\n            Mask to avoid performing attention on padding token indices.\n            Mask values selected in ``[0, 1]``:\n            ``1`` for tokens that are NOT MASKED, ``0`` for MASKED tokens.\n\n            `What are attention masks? <../glossary.html#attention-mask>`__\n        token_type_ids (:obj:`torch.LongTensor` of shape :obj:`(batch_size, sequence_length)`, `optional`, defaults to :obj:`None`):\n            Segment token indices to indicate first and second portions of the inputs.\n            Indices are selected in ``[0, 1]``: ``0`` corresponds to a `sentence A` token, ``1``\n            corresponds to a `sentence B` token\n\n            `What are token type IDs? <../glossary.html#token-type-ids>`_\n        position_ids (:obj:`torch.LongTensor` of shape :obj:`(batch_size, sequence_length)`, `optional`, defaults to :obj:`None`):\n            Indices of positions of each input sequence tokens in the position embeddings.\n            Selected in the range ``[0, config.max_position_embeddings - 1]``.\n\n            `What are position IDs? <../glossary.html#position-ids>`_\n        head_mask (:obj:`torch.FloatTensor` of shape :obj:`(num_heads,)` or :obj:`(num_layers, num_heads)`, `optional`, defaults to :obj:`None`):\n            Mask to nullify selected heads of the self-attention modules.\n            Mask values selected in ``[0, 1]``:\n            :obj:`1` indicates the head is **not masked**, :obj:`0` indicates the head is **masked**.\n        inputs_embeds (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length, hidden_size)`, `optional`, defaults to :obj:`None`):\n            This is useful if you want more control over how to convert `input_ids` indices into associated vectors\n            than the model's internal embedding lookup matrix.\n            If `past` is used, optionally only the last `inputs_embeds` have to be input (see `past`).\n        use_cache (:obj:`bool`):\n            If `use_cache` is True, `past` key value states are returned and\n            can be used to speed up decoding (see `past`). Defaults to `True`.\n\"\"\"\n\n\n@add_start_docstrings(\n    \"The bare CTRL Model transformer outputting raw hidden-states without any specific head on top.\",\n    CTRL_START_DOCSTRING,\n)\nclass CTRLModel(CTRLPreTrainedModel):\n    def __init__(self, config):\n        super().__init__(config)\n        self.output_hidden_states = config.output_hidden_states\n        self.output_attentions = config.output_attentions\n\n        self.d_model_size = config.n_embd\n        self.num_layers = config.n_layer\n\n        self.pos_encoding = positional_encoding(config.n_positions, self.d_model_size, torch.float)\n\n        self.w = nn.Embedding(config.vocab_size, config.n_embd)\n\n        self.dropout = nn.Dropout(config.embd_pdrop)\n        self.h = nn.ModuleList(\n            [\n                EncoderLayer(config.n_embd, config.n_head, config.dff, config.resid_pdrop, config.output_attentions)\n                for _ in range(config.n_layer)\n            ]\n        )\n        self.layernorm = nn.LayerNorm(config.n_embd, eps=config.layer_norm_epsilon)\n\n        self.init_weights()\n\n    def get_input_embeddings(self):\n        return self.w\n\n    def set_input_embeddings(self, new_embeddings):\n        self.w = new_embeddings\n\n    def _prune_heads(self, heads_to_prune):\n        \"\"\" Prunes heads of the model.\n                heads_to_prune: dict of {layer_num: list of heads to prune in this layer}\n        \"\"\"\n        for layer, heads in heads_to_prune.items():\n            self.h[layer].attn.prune_heads(heads)\n\n    @add_start_docstrings_to_callable(CTRL_INPUTS_DOCSTRING)\n    def forward(\n        self,\n        input_ids=None,\n        past=None,\n        attention_mask=None,\n        token_type_ids=None,\n        position_ids=None,\n        head_mask=None,\n        inputs_embeds=None,\n        use_cache=True,\n    ):\n        r\"\"\"\n    Return:\n        :obj:`tuple(torch.FloatTensor)` comprising various elements depending on the configuration (:class:`~transformers1.CTRLConfig`) and inputs:\n        last_hidden_state (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length, hidden_size)`):\n            Sequence of hidden-states at the last layer of the model.\n        past (:obj:`List[torch.FloatTensor]` of length :obj:`config.n_layers` with each tensor of shape :obj:`(2, batch_size, num_heads, sequence_length, embed_size_per_head)`):\n            Contains pre-computed hidden-states (key and values in the attention blocks).\n            Can be used (see `past` input) to speed up sequential decoding.\n        hidden_states (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_hidden_states=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer)\n            of shape :obj:`(batch_size, sequence_length, hidden_size)`.\n\n            Hidden-states of the model at the output of each layer plus the initial embedding outputs.\n        attentions (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_attentions=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for each layer) of shape\n            :obj:`(batch_size, num_heads, sequence_length, sequence_length)`.\n\n            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention\n            heads.\n\n    Examples::\n\n        from transformers1 import CTRLTokenizer, CTRLModel\n        import torch\n\n        tokenizer = CTRLTokenizer.from_pretrained('ctrl')\n        model = CTRLModel.from_pretrained('ctrl')\n\n        input_ids = torch.tensor(tokenizer.encode(\"Links Hello, my dog is cute\", add_special_tokens=True)).unsqueeze(0)  # Batch size 1\n        outputs = model(input_ids)\n\n        last_hidden_states = outputs[0]  # The last hidden-state is the first element of the output tuple\n\n        \"\"\"\n\n        if input_ids is not None and inputs_embeds is not None:\n            raise ValueError(\"You cannot specify both input_ids and inputs_embeds at the same time\")\n        elif input_ids is not None:\n            input_shape = input_ids.size()\n            input_ids = input_ids.view(-1, input_shape[-1])\n            batch_size = input_ids.shape[0]\n        elif inputs_embeds is not None:\n            input_shape = inputs_embeds.size()[:-1]\n            batch_size = inputs_embeds.shape[0]\n        else:\n            raise ValueError(\"You have to specify either input_ids or inputs_embeds\")\n\n        if past is None:\n            past_length = 0\n            past = [None] * len(self.h)\n        else:\n            past_length = past[0][0].size(-2)\n        if position_ids is None:\n            device = input_ids.device if input_ids is not None else inputs_embeds.device\n            position_ids = torch.arange(past_length, input_shape[-1] + past_length, dtype=torch.long, device=device)\n            position_ids = position_ids.unsqueeze(0).view(-1, input_shape[-1])\n\n        # Attention mask.\n        if attention_mask is not None:\n            assert batch_size > 0, \"batch_size has to be defined and > 0\"\n            attention_mask = attention_mask.view(batch_size, -1)\n            # We create a 3D attention mask from a 2D tensor mask.\n            # Sizes are [batch_size, 1, 1, to_seq_length]\n            # So we can broadcast to [batch_size, num_heads, from_seq_length, to_seq_length]\n            # this attention mask is more simple than the triangular masking of causal attention\n            # used in OpenAI GPT, we just need to prepare the broadcast dimension here.\n            attention_mask = attention_mask.unsqueeze(1).unsqueeze(2)\n\n            # Since attention_mask is 1.0 for positions we want to attend and 0.0 for\n            # masked positions, this operation will create a tensor which is 0.0 for\n            # positions we want to attend and -10000.0 for masked positions.\n            # Since we are adding it to the raw scores before the softmax, this is\n            # effectively the same as removing these entirely.\n            attention_mask = attention_mask.to(dtype=self.dtype)  # fp16 compatibility\n            attention_mask = (1.0 - attention_mask) * -10000.0\n\n        # Prepare head mask if needed\n        head_mask = self.get_head_mask(head_mask, self.config.n_layer)\n\n        if token_type_ids is not None:\n            token_type_ids = token_type_ids.view(-1, input_shape[-1])\n            token_type_embeds = self.w(token_type_ids)\n            token_type_embeds *= np.sqrt(self.d_model_size)\n        else:\n            token_type_embeds = 0\n        position_ids = position_ids.view(-1, input_shape[-1])\n\n        if inputs_embeds is None:\n            inputs_embeds = self.w(input_ids)\n        # inputs_embeds = embedded.unsqueeze(0) if len(input_ids.shape)<2 else embedded\n        seq_len = input_shape[-1]\n        mask = torch.triu(torch.ones(seq_len + past_length, seq_len + past_length), 1).to(inputs_embeds.device)\n\n        inputs_embeds *= np.sqrt(self.d_model_size)\n\n        pos_embeds = self.pos_encoding[position_ids, :].to(inputs_embeds.device)\n\n        hidden_states = inputs_embeds + pos_embeds + token_type_embeds\n\n        hidden_states = self.dropout(hidden_states)\n\n        output_shape = input_shape + (inputs_embeds.size(-1),)\n        presents = ()\n        all_hidden_states = ()\n        all_attentions = []\n        for i, (h, layer_past) in enumerate(zip(self.h, past)):\n            if self.output_hidden_states:\n                all_hidden_states = all_hidden_states + (hidden_states.view(*output_shape),)\n            outputs = h(\n                hidden_states,\n                mask,\n                layer_past=layer_past,\n                attention_mask=attention_mask,\n                head_mask=head_mask[i],\n                use_cache=use_cache,\n            )\n            hidden_states, present = outputs[:2]\n            if use_cache is True:\n                presents = presents + (present,)\n\n            if self.output_attentions:\n                all_attentions.append(outputs[2])\n\n        hidden_states = self.layernorm(hidden_states)\n        hidden_states = hidden_states.view(*output_shape)\n        if self.output_hidden_states:\n            all_hidden_states = all_hidden_states + (hidden_states,)\n\n        outputs = (hidden_states,)\n        if use_cache is True:\n            outputs = outputs + (presents,)\n        if self.output_hidden_states:\n            outputs = outputs + (all_hidden_states,)\n        if self.output_attentions:\n            # let the number of heads free (-1) so we can extract attention even after head pruning\n            attention_output_shape = input_shape[:-1] + (-1,) + all_attentions[0].shape[-2:]\n            all_attentions = tuple(t.view(*attention_output_shape) for t in all_attentions)\n            outputs = outputs + (all_attentions,)\n        return outputs\n\n\n@add_start_docstrings(\n    \"\"\"The CTRL Model transformer with a language modeling head on top\n    (linear layer with weights tied to the input embeddings). \"\"\",\n    CTRL_START_DOCSTRING,\n)\nclass CTRLLMHeadModel(CTRLPreTrainedModel):\n    def __init__(self, config):\n        super().__init__(config)\n        self.transformer = CTRLModel(config)\n        self.lm_head = nn.Linear(config.n_embd, config.vocab_size, bias=True)\n\n        self.init_weights()\n\n    def get_output_embeddings(self):\n        return self.lm_head\n\n    def prepare_inputs_for_generation(self, input_ids, past, **kwargs):\n        # only last token for inputs_ids if past is defined in kwargs\n        if past:\n            input_ids = input_ids[:, -1].unsqueeze(-1)\n\n        return {\"input_ids\": input_ids, \"past\": past, \"use_cache\": kwargs[\"use_cache\"]}\n\n    @add_start_docstrings_to_callable(CTRL_INPUTS_DOCSTRING)\n    def forward(\n        self,\n        input_ids=None,\n        past=None,\n        attention_mask=None,\n        token_type_ids=None,\n        position_ids=None,\n        head_mask=None,\n        inputs_embeds=None,\n        labels=None,\n        use_cache=True,\n    ):\n        r\"\"\"\n        labels (:obj:`torch.LongTensor` of shape :obj:`(batch_size, sequence_length)`, `optional`, defaults to :obj:`None`):\n            Labels for language modeling.\n            Note that the labels **are shifted** inside the model, i.e. you can set ``lm_labels = input_ids``\n            Indices are selected in ``[-100, 0, ..., config.vocab_size]``\n            All labels set to ``-100`` are ignored (masked), the loss is only\n            computed for labels in ``[0, ..., config.vocab_size]``\n\n    Return:\n        :obj:`tuple(torch.FloatTensor)` comprising various elements depending on the configuration (:class:`~transformers1.CTRLConfig`) and inputs:\n        loss (:obj:`torch.FloatTensor` of shape `(1,)`, `optional`, returned when ``labels`` is provided)\n            Language modeling loss.\n        prediction_scores (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length, config.vocab_size)`):\n            Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax).\n        past (:obj:`List[torch.FloatTensor]` of length :obj:`config.n_layers` with each tensor of shape :obj:`(2, batch_size, num_heads, sequence_length, embed_size_per_head)`):\n            Contains pre-computed hidden-states (key and values in the attention blocks).\n            Can be used (see `past` input) to speed up sequential decoding.\n        hidden_states (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_hidden_states=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer)\n            of shape :obj:`(batch_size, sequence_length, hidden_size)`.\n\n            Hidden-states of the model at the output of each layer plus the initial embedding outputs.\n        attentions (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_attentions=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for each layer) of shape\n            :obj:`(batch_size, num_heads, sequence_length, sequence_length)`.\n\n            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention\n            heads.\n\n    Examples::\n\n        import torch\n        from transformers1 import CTRLTokenizer, CTRLLMHeadModel\n\n        tokenizer = CTRLTokenizer.from_pretrained('ctrl')\n        model = CTRLLMHeadModel.from_pretrained('ctrl')\n\n        input_ids = torch.tensor(tokenizer.encode(\"Links Hello, my dog is cute\", add_special_tokens=True)).unsqueeze(0)  # Batch size 1\n        outputs = model(input_ids, labels=input_ids)\n        loss, logits = outputs[:2]\n\n        \"\"\"\n        transformer_outputs = self.transformer(\n            input_ids,\n            past=past,\n            attention_mask=attention_mask,\n            token_type_ids=token_type_ids,\n            position_ids=position_ids,\n            head_mask=head_mask,\n            inputs_embeds=inputs_embeds,\n            use_cache=use_cache,\n        )\n\n        hidden_states = transformer_outputs[0]\n\n        lm_logits = self.lm_head(hidden_states)\n\n        outputs = (lm_logits,) + transformer_outputs[1:]\n\n        if labels is not None:\n            # Shift so that tokens < n predict n\n            shift_logits = lm_logits[..., :-1, :].contiguous()\n            shift_labels = labels[..., 1:].contiguous()\n            # Flatten the tokens\n            loss_fct = CrossEntropyLoss()\n            loss = loss_fct(shift_logits.view(-1, shift_logits.size(-1)), shift_labels.view(-1))\n            outputs = (loss,) + outputs\n\n        return outputs  # (loss), lm_logits, presents, (all hidden_states), (attentions)\n"
  },
  {
    "path": "code/bert-base-count5/pretrain/transformers1/modeling_distilbert.py",
    "content": "# coding=utf-8\n# Copyright 2019-present, the HuggingFace Inc. team, The Google AI Language Team and Facebook, Inc.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n\"\"\" PyTorch DistilBERT model\n    adapted in part from Facebook, Inc XLM model (https://github.com/facebookresearch/XLM)\n    and in part from HuggingFace PyTorch version of Google AI Bert model (https://github.com/google-research/bert)\n\"\"\"\n\n\nimport copy\nimport logging\nimport math\n\nimport numpy as np\nimport torch\nimport torch.nn as nn\nfrom torch.nn import CrossEntropyLoss\n\nfrom .activations import gelu\nfrom .configuration_distilbert import DistilBertConfig\nfrom .file_utils import add_start_docstrings, add_start_docstrings_to_callable\nfrom .modeling_utils import PreTrainedModel, prune_linear_layer\n\n\nlogger = logging.getLogger(__name__)\n\n\nDISTILBERT_PRETRAINED_MODEL_ARCHIVE_LIST = [\n    \"distilbert-base-uncased\",\n    \"distilbert-base-uncased-distilled-squad\",\n    \"distilbert-base-cased\",\n    \"distilbert-base-cased-distilled-squad\",\n    \"distilbert-base-german-cased\",\n    \"distilbert-base-multilingual-cased\",\n    \"distilbert-base-uncased-finetuned-sst-2-english\",\n    # See all DistilBERT models at https://huggingface.co/models?filter=distilbert\n]\n\n\n# UTILS AND BUILDING BLOCKS OF THE ARCHITECTURE #\n\n\ndef create_sinusoidal_embeddings(n_pos, dim, out):\n    position_enc = np.array([[pos / np.power(10000, 2 * (j // 2) / dim) for j in range(dim)] for pos in range(n_pos)])\n    out[:, 0::2] = torch.FloatTensor(np.sin(position_enc[:, 0::2]))\n    out[:, 1::2] = torch.FloatTensor(np.cos(position_enc[:, 1::2]))\n    out.detach_()\n    out.requires_grad = False\n\n\nclass Embeddings(nn.Module):\n    def __init__(self, config):\n        super().__init__()\n        self.word_embeddings = nn.Embedding(config.vocab_size, config.dim, padding_idx=config.pad_token_id)\n        self.position_embeddings = nn.Embedding(config.max_position_embeddings, config.dim)\n        if config.sinusoidal_pos_embds:\n            create_sinusoidal_embeddings(\n                n_pos=config.max_position_embeddings, dim=config.dim, out=self.position_embeddings.weight\n            )\n\n        self.LayerNorm = nn.LayerNorm(config.dim, eps=1e-12)\n        self.dropout = nn.Dropout(config.dropout)\n\n    def forward(self, input_ids):\n        \"\"\"\n        Parameters\n        ----------\n        input_ids: torch.tensor(bs, max_seq_length)\n            The token ids to embed.\n\n        Outputs\n        -------\n        embeddings: torch.tensor(bs, max_seq_length, dim)\n            The embedded tokens (plus position embeddings, no token_type embeddings)\n        \"\"\"\n        seq_length = input_ids.size(1)\n        position_ids = torch.arange(seq_length, dtype=torch.long, device=input_ids.device)  # (max_seq_length)\n        position_ids = position_ids.unsqueeze(0).expand_as(input_ids)  # (bs, max_seq_length)\n\n        word_embeddings = self.word_embeddings(input_ids)  # (bs, max_seq_length, dim)\n        position_embeddings = self.position_embeddings(position_ids)  # (bs, max_seq_length, dim)\n\n        embeddings = word_embeddings + position_embeddings  # (bs, max_seq_length, dim)\n        embeddings = self.LayerNorm(embeddings)  # (bs, max_seq_length, dim)\n        embeddings = self.dropout(embeddings)  # (bs, max_seq_length, dim)\n        return embeddings\n\n\nclass MultiHeadSelfAttention(nn.Module):\n    def __init__(self, config):\n        super().__init__()\n\n        self.n_heads = config.n_heads\n        self.dim = config.dim\n        self.dropout = nn.Dropout(p=config.attention_dropout)\n        self.output_attentions = config.output_attentions\n\n        assert self.dim % self.n_heads == 0\n\n        self.q_lin = nn.Linear(in_features=config.dim, out_features=config.dim)\n        self.k_lin = nn.Linear(in_features=config.dim, out_features=config.dim)\n        self.v_lin = nn.Linear(in_features=config.dim, out_features=config.dim)\n        self.out_lin = nn.Linear(in_features=config.dim, out_features=config.dim)\n\n        self.pruned_heads = set()\n\n    def prune_heads(self, heads):\n        attention_head_size = self.dim // self.n_heads\n        if len(heads) == 0:\n            return\n        mask = torch.ones(self.n_heads, attention_head_size)\n        heads = set(heads) - self.pruned_heads\n        for head in heads:\n            head -= sum(1 if h < head else 0 for h in self.pruned_heads)\n            mask[head] = 0\n        mask = mask.view(-1).contiguous().eq(1)\n        index = torch.arange(len(mask))[mask].long()\n        # Prune linear layers\n        self.q_lin = prune_linear_layer(self.q_lin, index)\n        self.k_lin = prune_linear_layer(self.k_lin, index)\n        self.v_lin = prune_linear_layer(self.v_lin, index)\n        self.out_lin = prune_linear_layer(self.out_lin, index, dim=1)\n        # Update hyper params\n        self.n_heads = self.n_heads - len(heads)\n        self.dim = attention_head_size * self.n_heads\n        self.pruned_heads = self.pruned_heads.union(heads)\n\n    def forward(self, query, key, value, mask, head_mask=None):\n        \"\"\"\n        Parameters\n        ----------\n        query: torch.tensor(bs, seq_length, dim)\n        key: torch.tensor(bs, seq_length, dim)\n        value: torch.tensor(bs, seq_length, dim)\n        mask: torch.tensor(bs, seq_length)\n\n        Outputs\n        -------\n        weights: torch.tensor(bs, n_heads, seq_length, seq_length)\n            Attention weights\n        context: torch.tensor(bs, seq_length, dim)\n            Contextualized layer. Optional: only if `output_attentions=True`\n        \"\"\"\n        bs, q_length, dim = query.size()\n        k_length = key.size(1)\n        # assert dim == self.dim, 'Dimensions do not match: %s input vs %s configured' % (dim, self.dim)\n        # assert key.size() == value.size()\n\n        dim_per_head = self.dim // self.n_heads\n\n        mask_reshp = (bs, 1, 1, k_length)\n\n        def shape(x):\n            \"\"\" separate heads \"\"\"\n            return x.view(bs, -1, self.n_heads, dim_per_head).transpose(1, 2)\n\n        def unshape(x):\n            \"\"\" group heads \"\"\"\n            return x.transpose(1, 2).contiguous().view(bs, -1, self.n_heads * dim_per_head)\n\n        q = shape(self.q_lin(query))  # (bs, n_heads, q_length, dim_per_head)\n        k = shape(self.k_lin(key))  # (bs, n_heads, k_length, dim_per_head)\n        v = shape(self.v_lin(value))  # (bs, n_heads, k_length, dim_per_head)\n\n        q = q / math.sqrt(dim_per_head)  # (bs, n_heads, q_length, dim_per_head)\n        scores = torch.matmul(q, k.transpose(2, 3))  # (bs, n_heads, q_length, k_length)\n        mask = (mask == 0).view(mask_reshp).expand_as(scores)  # (bs, n_heads, q_length, k_length)\n        scores.masked_fill_(mask, -float(\"inf\"))  # (bs, n_heads, q_length, k_length)\n\n        weights = nn.Softmax(dim=-1)(scores)  # (bs, n_heads, q_length, k_length)\n        weights = self.dropout(weights)  # (bs, n_heads, q_length, k_length)\n\n        # Mask heads if we want to\n        if head_mask is not None:\n            weights = weights * head_mask\n\n        context = torch.matmul(weights, v)  # (bs, n_heads, q_length, dim_per_head)\n        context = unshape(context)  # (bs, q_length, dim)\n        context = self.out_lin(context)  # (bs, q_length, dim)\n\n        if self.output_attentions:\n            return (context, weights)\n        else:\n            return (context,)\n\n\nclass FFN(nn.Module):\n    def __init__(self, config):\n        super().__init__()\n        self.dropout = nn.Dropout(p=config.dropout)\n        self.lin1 = nn.Linear(in_features=config.dim, out_features=config.hidden_dim)\n        self.lin2 = nn.Linear(in_features=config.hidden_dim, out_features=config.dim)\n        assert config.activation in [\"relu\", \"gelu\"], \"activation ({}) must be in ['relu', 'gelu']\".format(\n            config.activation\n        )\n        self.activation = gelu if config.activation == \"gelu\" else nn.ReLU()\n\n    def forward(self, input):\n        x = self.lin1(input)\n        x = self.activation(x)\n        x = self.lin2(x)\n        x = self.dropout(x)\n        return x\n\n\nclass TransformerBlock(nn.Module):\n    def __init__(self, config):\n        super().__init__()\n\n        self.output_attentions = config.output_attentions\n\n        assert config.dim % config.n_heads == 0\n\n        self.attention = MultiHeadSelfAttention(config)\n        self.sa_layer_norm = nn.LayerNorm(normalized_shape=config.dim, eps=1e-12)\n\n        self.ffn = FFN(config)\n        self.output_layer_norm = nn.LayerNorm(normalized_shape=config.dim, eps=1e-12)\n\n    def forward(self, x, attn_mask=None, head_mask=None):\n        \"\"\"\n        Parameters\n        ----------\n        x: torch.tensor(bs, seq_length, dim)\n        attn_mask: torch.tensor(bs, seq_length)\n\n        Outputs\n        -------\n        sa_weights: torch.tensor(bs, n_heads, seq_length, seq_length)\n            The attention weights\n        ffn_output: torch.tensor(bs, seq_length, dim)\n            The output of the transformer block contextualization.\n        \"\"\"\n        # Self-Attention\n        sa_output = self.attention(query=x, key=x, value=x, mask=attn_mask, head_mask=head_mask)\n        if self.output_attentions:\n            sa_output, sa_weights = sa_output  # (bs, seq_length, dim), (bs, n_heads, seq_length, seq_length)\n        else:  # To handle these `output_attention` or `output_hidden_states` cases returning tuples\n            assert type(sa_output) == tuple\n            sa_output = sa_output[0]\n        sa_output = self.sa_layer_norm(sa_output + x)  # (bs, seq_length, dim)\n\n        # Feed Forward Network\n        ffn_output = self.ffn(sa_output)  # (bs, seq_length, dim)\n        ffn_output = self.output_layer_norm(ffn_output + sa_output)  # (bs, seq_length, dim)\n\n        output = (ffn_output,)\n        if self.output_attentions:\n            output = (sa_weights,) + output\n        return output\n\n\nclass Transformer(nn.Module):\n    def __init__(self, config):\n        super().__init__()\n        self.n_layers = config.n_layers\n        self.output_attentions = config.output_attentions\n        self.output_hidden_states = config.output_hidden_states\n\n        layer = TransformerBlock(config)\n        self.layer = nn.ModuleList([copy.deepcopy(layer) for _ in range(config.n_layers)])\n\n    def forward(self, x, attn_mask=None, head_mask=None):\n        \"\"\"\n        Parameters\n        ----------\n        x: torch.tensor(bs, seq_length, dim)\n            Input sequence embedded.\n        attn_mask: torch.tensor(bs, seq_length)\n            Attention mask on the sequence.\n\n        Outputs\n        -------\n        hidden_state: torch.tensor(bs, seq_length, dim)\n            Sequence of hiddens states in the last (top) layer\n        all_hidden_states: Tuple[torch.tensor(bs, seq_length, dim)]\n            Tuple of length n_layers with the hidden states from each layer.\n            Optional: only if output_hidden_states=True\n        all_attentions: Tuple[torch.tensor(bs, n_heads, seq_length, seq_length)]\n            Tuple of length n_layers with the attention weights from each layer\n            Optional: only if output_attentions=True\n        \"\"\"\n        all_hidden_states = ()\n        all_attentions = ()\n\n        hidden_state = x\n        for i, layer_module in enumerate(self.layer):\n            if self.output_hidden_states:\n                all_hidden_states = all_hidden_states + (hidden_state,)\n\n            layer_outputs = layer_module(x=hidden_state, attn_mask=attn_mask, head_mask=head_mask[i])\n            hidden_state = layer_outputs[-1]\n\n            if self.output_attentions:\n                assert len(layer_outputs) == 2\n                attentions = layer_outputs[0]\n                all_attentions = all_attentions + (attentions,)\n            else:\n                assert len(layer_outputs) == 1\n\n        # Add last layer\n        if self.output_hidden_states:\n            all_hidden_states = all_hidden_states + (hidden_state,)\n\n        outputs = (hidden_state,)\n        if self.output_hidden_states:\n            outputs = outputs + (all_hidden_states,)\n        if self.output_attentions:\n            outputs = outputs + (all_attentions,)\n        return outputs  # last-layer hidden state, (all hidden states), (all attentions)\n\n\n# INTERFACE FOR ENCODER AND TASK SPECIFIC MODEL #\nclass DistilBertPreTrainedModel(PreTrainedModel):\n    \"\"\" An abstract class to handle weights initialization and\n        a simple interface for downloading and loading pretrained models.\n    \"\"\"\n\n    config_class = DistilBertConfig\n    load_tf_weights = None\n    base_model_prefix = \"distilbert\"\n\n    def _init_weights(self, module):\n        \"\"\" Initialize the weights.\n        \"\"\"\n        if isinstance(module, nn.Embedding):\n            if module.weight.requires_grad:\n                module.weight.data.normal_(mean=0.0, std=self.config.initializer_range)\n        if isinstance(module, nn.Linear):\n            module.weight.data.normal_(mean=0.0, std=self.config.initializer_range)\n        elif isinstance(module, nn.LayerNorm):\n            module.bias.data.zero_()\n            module.weight.data.fill_(1.0)\n        if isinstance(module, nn.Linear) and module.bias is not None:\n            module.bias.data.zero_()\n\n\nDISTILBERT_START_DOCSTRING = r\"\"\"\n\n    This model is a PyTorch `torch.nn.Module <https://pytorch.org/docs/stable/nn.html#torch.nn.Module>`_ sub-class.\n    Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general\n    usage and behavior.\n\n    Parameters:\n        config (:class:`~transformers1.DistilBertConfig`): Model configuration class with all the parameters of the model.\n            Initializing with a config file does not load the weights associated with the model, only the configuration.\n            Check out the :meth:`~transformers1.PreTrainedModel.from_pretrained` method to load the model weights.\n\"\"\"\n\nDISTILBERT_INPUTS_DOCSTRING = r\"\"\"\n    Args:\n        input_ids (:obj:`torch.LongTensor` of shape :obj:`(batch_size, sequence_length)`):\n            Indices of input sequence tokens in the vocabulary.\n\n            Indices can be obtained using :class:`transformers1.DistilBertTokenizer`.\n            See :func:`transformers1.PreTrainedTokenizer.encode` and\n            :func:`transformers1.PreTrainedTokenizer.encode_plus` for details.\n\n            `What are input IDs? <../glossary.html#input-ids>`__\n        attention_mask (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length)`, `optional`, defaults to :obj:`None`):\n            Mask to avoid performing attention on padding token indices.\n            Mask values selected in ``[0, 1]``:\n            ``1`` for tokens that are NOT MASKED, ``0`` for MASKED tokens.\n\n            `What are attention masks? <../glossary.html#attention-mask>`__\n        head_mask (:obj:`torch.FloatTensor` of shape :obj:`(num_heads,)` or :obj:`(num_layers, num_heads)`, `optional`, defaults to :obj:`None`):\n            Mask to nullify selected heads of the self-attention modules.\n            Mask values selected in ``[0, 1]``:\n            :obj:`1` indicates the head is **not masked**, :obj:`0` indicates the head is **masked**.\n        inputs_embeds (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length, hidden_size)`, `optional`, defaults to :obj:`None`):\n            Optionally, instead of passing :obj:`input_ids` you can choose to directly pass an embedded representation.\n            This is useful if you want more control over how to convert `input_ids` indices into associated vectors\n            than the model's internal embedding lookup matrix.\n\"\"\"\n\n\n@add_start_docstrings(\n    \"The bare DistilBERT encoder/transformer outputting raw hidden-states without any specific head on top.\",\n    DISTILBERT_START_DOCSTRING,\n)\nclass DistilBertModel(DistilBertPreTrainedModel):\n    def __init__(self, config):\n        super().__init__(config)\n\n        self.embeddings = Embeddings(config)  # Embeddings\n        self.transformer = Transformer(config)  # Encoder\n\n        self.init_weights()\n\n    def get_input_embeddings(self):\n        return self.embeddings.word_embeddings\n\n    def set_input_embeddings(self, new_embeddings):\n        self.embeddings.word_embeddings = new_embeddings\n\n    def _prune_heads(self, heads_to_prune):\n        \"\"\" Prunes heads of the model.\n            heads_to_prune: dict of {layer_num: list of heads to prune in this layer}\n            See base class PreTrainedModel\n        \"\"\"\n        for layer, heads in heads_to_prune.items():\n            self.transformer.layer[layer].attention.prune_heads(heads)\n\n    @add_start_docstrings_to_callable(DISTILBERT_INPUTS_DOCSTRING)\n    def forward(self, input_ids=None, attention_mask=None, head_mask=None, inputs_embeds=None):\n        r\"\"\"\n    Return:\n        :obj:`tuple(torch.FloatTensor)` comprising various elements depending on the configuration (:class:`~transformers1.DistilBertConfig`) and inputs:\n        last_hidden_state (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length, hidden_size)`):\n            Sequence of hidden-states at the output of the last layer of the model.\n        hidden_states (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_hidden_states=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer)\n            of shape :obj:`(batch_size, sequence_length, hidden_size)`.\n\n            Hidden-states of the model at the output of each layer plus the initial embedding outputs.\n        attentions (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_attentions=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for each layer) of shape\n            :obj:`(batch_size, num_heads, sequence_length, sequence_length)`.\n\n            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention\n            heads.\n\n    Examples::\n\n        from transformers1 import DistilBertTokenizer, DistilBertModel\n        import torch\n\n        tokenizer = DistilBertTokenizer.from_pretrained('distilbert-base-cased')\n        model = DistilBertModel.from_pretrained('distilbert-base-cased')\n\n        input_ids = torch.tensor(tokenizer.encode(\"Hello, my dog is cute\", add_special_tokens=True)).unsqueeze(0)  # Batch size 1\n        outputs = model(input_ids)\n\n        last_hidden_states = outputs[0]  # The last hidden-state is the first element of the output tuple\n\n        \"\"\"\n        if input_ids is not None and inputs_embeds is not None:\n            raise ValueError(\"You cannot specify both input_ids and inputs_embeds at the same time\")\n        elif input_ids is not None:\n            input_shape = input_ids.size()\n        elif inputs_embeds is not None:\n            input_shape = inputs_embeds.size()[:-1]\n        else:\n            raise ValueError(\"You have to specify either input_ids or inputs_embeds\")\n\n        device = input_ids.device if input_ids is not None else inputs_embeds.device\n\n        if attention_mask is None:\n            attention_mask = torch.ones(input_shape, device=device)  # (bs, seq_length)\n\n        # Prepare head mask if needed\n        head_mask = self.get_head_mask(head_mask, self.config.num_hidden_layers)\n\n        if inputs_embeds is None:\n            inputs_embeds = self.embeddings(input_ids)  # (bs, seq_length, dim)\n        tfmr_output = self.transformer(x=inputs_embeds, attn_mask=attention_mask, head_mask=head_mask)\n        hidden_state = tfmr_output[0]\n        output = (hidden_state,) + tfmr_output[1:]\n\n        return output  # last-layer hidden-state, (all hidden_states), (all attentions)\n\n\n@add_start_docstrings(\n    \"\"\"DistilBert Model with a `masked language modeling` head on top. \"\"\", DISTILBERT_START_DOCSTRING,\n)\nclass DistilBertForMaskedLM(DistilBertPreTrainedModel):\n    def __init__(self, config):\n        super().__init__(config)\n        self.output_attentions = config.output_attentions\n        self.output_hidden_states = config.output_hidden_states\n\n        self.distilbert = DistilBertModel(config)\n        self.vocab_transform = nn.Linear(config.dim, config.dim)\n        self.vocab_layer_norm = nn.LayerNorm(config.dim, eps=1e-12)\n        self.vocab_projector = nn.Linear(config.dim, config.vocab_size)\n\n        self.init_weights()\n\n        self.mlm_loss_fct = nn.CrossEntropyLoss()\n\n    def get_output_embeddings(self):\n        return self.vocab_projector\n\n    @add_start_docstrings_to_callable(DISTILBERT_INPUTS_DOCSTRING)\n    def forward(self, input_ids=None, attention_mask=None, head_mask=None, inputs_embeds=None, masked_lm_labels=None):\n        r\"\"\"\n        masked_lm_labels (:obj:`torch.LongTensor` of shape :obj:`(batch_size, sequence_length)`, `optional`, defaults to :obj:`None`):\n            Labels for computing the masked language modeling loss.\n            Indices should be in ``[-100, 0, ..., config.vocab_size]`` (see ``input_ids`` docstring)\n            Tokens with indices set to ``-100`` are ignored (masked), the loss is only computed for the tokens with labels\n            in ``[0, ..., config.vocab_size]``\n\n    Returns:\n        :obj:`tuple(torch.FloatTensor)` comprising various elements depending on the configuration (:class:`~transformers1.DistilBertConfig`) and inputs:\n        loss (`optional`, returned when ``masked_lm_labels`` is provided) ``torch.FloatTensor`` of shape ``(1,)``:\n            Masked language modeling loss.\n        prediction_scores (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length, config.vocab_size)`)\n            Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax).\n        hidden_states (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_hidden_states=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer)\n            of shape :obj:`(batch_size, sequence_length, hidden_size)`.\n\n            Hidden-states of the model at the output of each layer plus the initial embedding outputs.\n        attentions (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_attentions=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for each layer) of shape\n            :obj:`(batch_size, num_heads, sequence_length, sequence_length)`.\n\n            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention\n            heads.\n\n    Examples::\n\n        from transformers1 import DistilBertTokenizer, DistilBertForMaskedLM\n        import torch\n\n        tokenizer = DistilBertTokenizer.from_pretrained('distilbert-base-cased')\n        model = DistilBertForMaskedLM.from_pretrained('distilbert-base-cased')\n        input_ids = torch.tensor(tokenizer.encode(\"Hello, my dog is cute\", add_special_tokens=True)).unsqueeze(0)  # Batch size 1\n        outputs = model(input_ids, masked_lm_labels=input_ids)\n        loss, prediction_scores = outputs[:2]\n\n        \"\"\"\n        dlbrt_output = self.distilbert(\n            input_ids=input_ids, attention_mask=attention_mask, head_mask=head_mask, inputs_embeds=inputs_embeds\n        )\n        hidden_states = dlbrt_output[0]  # (bs, seq_length, dim)\n        prediction_logits = self.vocab_transform(hidden_states)  # (bs, seq_length, dim)\n        prediction_logits = gelu(prediction_logits)  # (bs, seq_length, dim)\n        prediction_logits = self.vocab_layer_norm(prediction_logits)  # (bs, seq_length, dim)\n        prediction_logits = self.vocab_projector(prediction_logits)  # (bs, seq_length, vocab_size)\n\n        outputs = (prediction_logits,) + dlbrt_output[1:]\n        if masked_lm_labels is not None:\n            mlm_loss = self.mlm_loss_fct(\n                prediction_logits.view(-1, prediction_logits.size(-1)), masked_lm_labels.view(-1)\n            )\n            outputs = (mlm_loss,) + outputs\n\n        return outputs  # (mlm_loss), prediction_logits, (all hidden_states), (all attentions)\n\n\n@add_start_docstrings(\n    \"\"\"DistilBert Model transformer with a sequence classification/regression head on top (a linear layer on top of\n    the pooled output) e.g. for GLUE tasks. \"\"\",\n    DISTILBERT_START_DOCSTRING,\n)\nclass DistilBertForSequenceClassification(DistilBertPreTrainedModel):\n    def __init__(self, config):\n        super().__init__(config)\n        self.num_labels = config.num_labels\n\n        self.distilbert = DistilBertModel(config)\n        self.pre_classifier = nn.Linear(config.dim, config.dim)\n        self.classifier = nn.Linear(config.dim, config.num_labels)\n        self.dropout = nn.Dropout(config.seq_classif_dropout)\n\n        self.init_weights()\n\n    @add_start_docstrings_to_callable(DISTILBERT_INPUTS_DOCSTRING)\n    def forward(self, input_ids=None, attention_mask=None, head_mask=None, inputs_embeds=None, labels=None):\n        r\"\"\"\n        labels (:obj:`torch.LongTensor` of shape :obj:`(batch_size,)`, `optional`, defaults to :obj:`None`):\n            Labels for computing the sequence classification/regression loss.\n            Indices should be in :obj:`[0, ..., config.num_labels - 1]`.\n            If :obj:`config.num_labels == 1` a regression loss is computed (Mean-Square loss),\n            If :obj:`config.num_labels > 1` a classification loss is computed (Cross-Entropy).\n\n    Returns:\n        :obj:`tuple(torch.FloatTensor)` comprising various elements depending on the configuration (:class:`~transformers1.DistilBertConfig`) and inputs:\n        loss (:obj:`torch.FloatTensor` of shape :obj:`(1,)`, `optional`, returned when :obj:`label` is provided):\n            Classification (or regression if config.num_labels==1) loss.\n        logits (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, config.num_labels)`):\n            Classification (or regression if config.num_labels==1) scores (before SoftMax).\n        hidden_states (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_hidden_states=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer)\n            of shape :obj:`(batch_size, sequence_length, hidden_size)`.\n\n            Hidden-states of the model at the output of each layer plus the initial embedding outputs.\n        attentions (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_attentions=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for each layer) of shape\n            :obj:`(batch_size, num_heads, sequence_length, sequence_length)`.\n\n            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention\n            heads.\n\n    Examples::\n\n        from transformers1 import DistilBertTokenizer, DistilBertForSequenceClassification\n        import torch\n\n        tokenizer = DistilBertTokenizer.from_pretrained('distilbert-base-cased')\n        model = DistilBertForSequenceClassification.from_pretrained('distilbert-base-cased')\n        input_ids = torch.tensor(tokenizer.encode(\"Hello, my dog is cute\", add_special_tokens=True)).unsqueeze(0)  # Batch size 1\n        labels = torch.tensor([1]).unsqueeze(0)  # Batch size 1\n        outputs = model(input_ids, labels=labels)\n        loss, logits = outputs[:2]\n\n        \"\"\"\n        distilbert_output = self.distilbert(\n            input_ids=input_ids, attention_mask=attention_mask, head_mask=head_mask, inputs_embeds=inputs_embeds\n        )\n        hidden_state = distilbert_output[0]  # (bs, seq_len, dim)\n        pooled_output = hidden_state[:, 0]  # (bs, dim)\n        pooled_output = self.pre_classifier(pooled_output)  # (bs, dim)\n        pooled_output = nn.ReLU()(pooled_output)  # (bs, dim)\n        pooled_output = self.dropout(pooled_output)  # (bs, dim)\n        logits = self.classifier(pooled_output)  # (bs, dim)\n\n        outputs = (logits,) + distilbert_output[1:]\n        if labels is not None:\n            if self.num_labels == 1:\n                loss_fct = nn.MSELoss()\n                loss = loss_fct(logits.view(-1), labels.view(-1))\n            else:\n                loss_fct = nn.CrossEntropyLoss()\n                loss = loss_fct(logits.view(-1, self.num_labels), labels.view(-1))\n            outputs = (loss,) + outputs\n\n        return outputs  # (loss), logits, (hidden_states), (attentions)\n\n\n@add_start_docstrings(\n    \"\"\"DistilBert Model with a span classification head on top for extractive question-answering tasks like SQuAD (a linear layers on top of\n    the hidden-states output to compute `span start logits` and `span end logits`). \"\"\",\n    DISTILBERT_START_DOCSTRING,\n)\nclass DistilBertForQuestionAnswering(DistilBertPreTrainedModel):\n    def __init__(self, config):\n        super().__init__(config)\n\n        self.distilbert = DistilBertModel(config)\n        self.qa_outputs = nn.Linear(config.dim, config.num_labels)\n        assert config.num_labels == 2\n        self.dropout = nn.Dropout(config.qa_dropout)\n\n        self.init_weights()\n\n    @add_start_docstrings_to_callable(DISTILBERT_INPUTS_DOCSTRING)\n    def forward(\n        self,\n        input_ids=None,\n        attention_mask=None,\n        head_mask=None,\n        inputs_embeds=None,\n        start_positions=None,\n        end_positions=None,\n    ):\n        r\"\"\"\n        start_positions (:obj:`torch.LongTensor` of shape :obj:`(batch_size,)`, `optional`, defaults to :obj:`None`):\n            Labels for position (index) of the start of the labelled span for computing the token classification loss.\n            Positions are clamped to the length of the sequence (`sequence_length`).\n            Position outside of the sequence are not taken into account for computing the loss.\n        end_positions (:obj:`torch.LongTensor` of shape :obj:`(batch_size,)`, `optional`, defaults to :obj:`None`):\n            Labels for position (index) of the end of the labelled span for computing the token classification loss.\n            Positions are clamped to the length of the sequence (`sequence_length`).\n            Position outside of the sequence are not taken into account for computing the loss.\n\n    Returns:\n        :obj:`tuple(torch.FloatTensor)` comprising various elements depending on the configuration (:class:`~transformers1.DistilBertConfig`) and inputs:\n        loss (:obj:`torch.FloatTensor` of shape :obj:`(1,)`, `optional`, returned when :obj:`labels` is provided):\n            Total span extraction loss is the sum of a Cross-Entropy for the start and end positions.\n        start_scores (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length,)`):\n            Span-start scores (before SoftMax).\n        end_scores (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length,)`):\n            Span-end scores (before SoftMax).\n        hidden_states (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_hidden_states=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer)\n            of shape :obj:`(batch_size, sequence_length, hidden_size)`.\n\n            Hidden-states of the model at the output of each layer plus the initial embedding outputs.\n        attentions (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_attentions=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for each layer) of shape\n            :obj:`(batch_size, num_heads, sequence_length, sequence_length)`.\n\n            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention\n            heads.\n\n    Examples::\n\n        from transformers1 import DistilBertTokenizer, DistilBertForQuestionAnswering\n        import torch\n\n        tokenizer = DistilBertTokenizer.from_pretrained('distilbert-base-cased')\n        model = DistilBertForQuestionAnswering.from_pretrained('distilbert-base-cased')\n        input_ids = torch.tensor(tokenizer.encode(\"Hello, my dog is cute\", add_special_tokens=True)).unsqueeze(0)  # Batch size 1\n        start_positions = torch.tensor([1])\n        end_positions = torch.tensor([3])\n        outputs = model(input_ids, start_positions=start_positions, end_positions=end_positions)\n        loss, start_scores, end_scores = outputs[:3]\n\n        \"\"\"\n        distilbert_output = self.distilbert(\n            input_ids=input_ids, attention_mask=attention_mask, head_mask=head_mask, inputs_embeds=inputs_embeds\n        )\n        hidden_states = distilbert_output[0]  # (bs, max_query_len, dim)\n\n        hidden_states = self.dropout(hidden_states)  # (bs, max_query_len, dim)\n        logits = self.qa_outputs(hidden_states)  # (bs, max_query_len, 2)\n        start_logits, end_logits = logits.split(1, dim=-1)\n        start_logits = start_logits.squeeze(-1)  # (bs, max_query_len)\n        end_logits = end_logits.squeeze(-1)  # (bs, max_query_len)\n\n        outputs = (start_logits, end_logits,) + distilbert_output[1:]\n        if start_positions is not None and end_positions is not None:\n            # If we are on multi-GPU, split add a dimension\n            if len(start_positions.size()) > 1:\n                start_positions = start_positions.squeeze(-1)\n            if len(end_positions.size()) > 1:\n                end_positions = end_positions.squeeze(-1)\n            # sometimes the start/end positions are outside our model inputs, we ignore these terms\n            ignored_index = start_logits.size(1)\n            start_positions.clamp_(0, ignored_index)\n            end_positions.clamp_(0, ignored_index)\n\n            loss_fct = nn.CrossEntropyLoss(ignore_index=ignored_index)\n            start_loss = loss_fct(start_logits, start_positions)\n            end_loss = loss_fct(end_logits, end_positions)\n            total_loss = (start_loss + end_loss) / 2\n            outputs = (total_loss,) + outputs\n\n        return outputs  # (loss), start_logits, end_logits, (hidden_states), (attentions)\n\n\n@add_start_docstrings(\n    \"\"\"DistilBert Model with a token classification head on top (a linear layer on top of\n    the hidden-states output) e.g. for Named-Entity-Recognition (NER) tasks. \"\"\",\n    DISTILBERT_START_DOCSTRING,\n)\nclass DistilBertForTokenClassification(DistilBertPreTrainedModel):\n    def __init__(self, config):\n        super().__init__(config)\n        self.num_labels = config.num_labels\n\n        self.distilbert = DistilBertModel(config)\n        self.dropout = nn.Dropout(config.dropout)\n        self.classifier = nn.Linear(config.hidden_size, config.num_labels)\n\n        self.init_weights()\n\n    @add_start_docstrings_to_callable(DISTILBERT_INPUTS_DOCSTRING)\n    def forward(self, input_ids=None, attention_mask=None, head_mask=None, inputs_embeds=None, labels=None):\n        r\"\"\"\n        labels (:obj:`torch.LongTensor` of shape :obj:`(batch_size, sequence_length)`, `optional`, defaults to :obj:`None`):\n            Labels for computing the token classification loss.\n            Indices should be in ``[0, ..., config.num_labels - 1]``.\n\n    Returns:\n        :obj:`tuple(torch.FloatTensor)` comprising various elements depending on the configuration (:class:`~transformers1.DistilBertConfig`) and inputs:\n        loss (:obj:`torch.FloatTensor` of shape :obj:`(1,)`, `optional`, returned when ``labels`` is provided) :\n            Classification loss.\n        scores (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length, config.num_labels)`)\n            Classification scores (before SoftMax).\n        hidden_states (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_hidden_states=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer)\n            of shape :obj:`(batch_size, sequence_length, hidden_size)`.\n\n            Hidden-states of the model at the output of each layer plus the initial embedding outputs.\n        attentions (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_attentions=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for each layer) of shape\n            :obj:`(batch_size, num_heads, sequence_length, sequence_length)`.\n\n            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention\n            heads.\n\n    Examples::\n\n        from transformers1 import DistilBertTokenizer, DistilBertForTokenClassification\n        import torch\n\n        tokenizer = DistilBertTokenizer.from_pretrained('distilbert-base-cased')\n        model = DistilBertForTokenClassification.from_pretrained('distilbert-base-cased')\n        input_ids = torch.tensor(tokenizer.encode(\"Hello, my dog is cute\")).unsqueeze(0)  # Batch size 1\n        labels = torch.tensor([1] * input_ids.size(1)).unsqueeze(0)  # Batch size 1\n        outputs = model(input_ids, labels=labels)\n        loss, scores = outputs[:2]\n\n        \"\"\"\n\n        outputs = self.distilbert(\n            input_ids, attention_mask=attention_mask, head_mask=head_mask, inputs_embeds=inputs_embeds\n        )\n\n        sequence_output = outputs[0]\n\n        sequence_output = self.dropout(sequence_output)\n        logits = self.classifier(sequence_output)\n\n        outputs = (logits,) + outputs[2:]  # add hidden states and attention if they are here\n        if labels is not None:\n            loss_fct = CrossEntropyLoss()\n            # Only keep active parts of the loss\n            if attention_mask is not None:\n                active_loss = attention_mask.view(-1) == 1\n                active_logits = logits.view(-1, self.num_labels)\n                active_labels = torch.where(\n                    active_loss, labels.view(-1), torch.tensor(loss_fct.ignore_index).type_as(labels)\n                )\n                loss = loss_fct(active_logits, active_labels)\n            else:\n                loss = loss_fct(logits.view(-1, self.num_labels), labels.view(-1))\n            outputs = (loss,) + outputs\n\n        return outputs  # (loss), scores, (hidden_states), (attentions)\n"
  },
  {
    "path": "code/bert-base-count5/pretrain/transformers1/modeling_electra.py",
    "content": "import logging\nimport os\n\nimport torch\nimport torch.nn as nn\nfrom torch.nn import CrossEntropyLoss, MSELoss\n\nfrom .activations import get_activation\nfrom .configuration_electra import ElectraConfig\nfrom .file_utils import add_start_docstrings, add_start_docstrings_to_callable\nfrom .modeling_bert import BertEmbeddings, BertEncoder, BertLayerNorm, BertPreTrainedModel\n\n\nlogger = logging.getLogger(__name__)\n\n\nELECTRA_PRETRAINED_MODEL_ARCHIVE_LIST = [\n    \"google/electra-small-generator\",\n    \"google/electra-base-generator\",\n    \"google/electra-large-generator\",\n    \"google/electra-small-discriminator\",\n    \"google/electra-base-discriminator\",\n    \"google/electra-large-discriminator\",\n    # See all ELECTRA models at https://huggingface.co/models?filter=electra\n]\n\n\ndef load_tf_weights_in_electra(model, config, tf_checkpoint_path, discriminator_or_generator=\"discriminator\"):\n    \"\"\" Load tf checkpoints in a pytorch model.\n    \"\"\"\n    try:\n        import re\n        import numpy as np\n        import tensorflow as tf\n    except ImportError:\n        logger.error(\n            \"Loading a TensorFlow model in PyTorch, requires TensorFlow to be installed. Please see \"\n            \"https://www.tensorflow.org/install/ for installation instructions.\"\n        )\n        raise\n    tf_path = os.path.abspath(tf_checkpoint_path)\n    logger.info(\"Converting TensorFlow checkpoint from {}\".format(tf_path))\n    # Load weights from TF model\n    init_vars = tf.train.list_variables(tf_path)\n    names = []\n    arrays = []\n    for name, shape in init_vars:\n        logger.info(\"Loading TF weight {} with shape {}\".format(name, shape))\n        array = tf.train.load_variable(tf_path, name)\n        names.append(name)\n        arrays.append(array)\n    for name, array in zip(names, arrays):\n        original_name: str = name\n\n        try:\n            if isinstance(model, ElectraForMaskedLM):\n                name = name.replace(\"electra/embeddings/\", \"generator/embeddings/\")\n\n            if discriminator_or_generator == \"generator\":\n                name = name.replace(\"electra/\", \"discriminator/\")\n                name = name.replace(\"generator/\", \"electra/\")\n\n            name = name.replace(\"dense_1\", \"dense_prediction\")\n            name = name.replace(\"generator_predictions/output_bias\", \"generator_lm_head/bias\")\n\n            name = name.split(\"/\")\n            # print(original_name, name)\n            # adam_v and adam_m are variables used in AdamWeightDecayOptimizer to calculated m and v\n            # which are not required for using pretrained model\n            if any(n in [\"global_step\", \"temperature\"] for n in name):\n                logger.info(\"Skipping {}\".format(original_name))\n                continue\n            pointer = model\n            for m_name in name:\n                if re.fullmatch(r\"[A-Za-z]+_\\d+\", m_name):\n                    scope_names = re.split(r\"_(\\d+)\", m_name)\n                else:\n                    scope_names = [m_name]\n                if scope_names[0] == \"kernel\" or scope_names[0] == \"gamma\":\n                    pointer = getattr(pointer, \"weight\")\n                elif scope_names[0] == \"output_bias\" or scope_names[0] == \"beta\":\n                    pointer = getattr(pointer, \"bias\")\n                elif scope_names[0] == \"output_weights\":\n                    pointer = getattr(pointer, \"weight\")\n                elif scope_names[0] == \"squad\":\n                    pointer = getattr(pointer, \"classifier\")\n                else:\n                    pointer = getattr(pointer, scope_names[0])\n                if len(scope_names) >= 2:\n                    num = int(scope_names[1])\n                    pointer = pointer[num]\n            if m_name.endswith(\"_embeddings\"):\n                pointer = getattr(pointer, \"weight\")\n            elif m_name == \"kernel\":\n                array = np.transpose(array)\n            try:\n                assert pointer.shape == array.shape, original_name\n            except AssertionError as e:\n                e.args += (pointer.shape, array.shape)\n                raise\n            print(\"Initialize PyTorch weight {}\".format(name), original_name)\n            pointer.data = torch.from_numpy(array)\n        except AttributeError as e:\n            print(\"Skipping {}\".format(original_name), name, e)\n            continue\n    return model\n\n\nclass ElectraEmbeddings(BertEmbeddings):\n    \"\"\"Construct the embeddings from word, position and token_type embeddings.\"\"\"\n\n    def __init__(self, config):\n        super().__init__(config)\n        self.word_embeddings = nn.Embedding(config.vocab_size, config.embedding_size, padding_idx=config.pad_token_id)\n        self.position_embeddings = nn.Embedding(config.max_position_embeddings, config.embedding_size)\n        self.token_type_embeddings = nn.Embedding(config.type_vocab_size, config.embedding_size)\n\n        # self.LayerNorm is not snake-cased to stick with TensorFlow model variable name and be able to load\n        # any TensorFlow checkpoint file\n        self.LayerNorm = BertLayerNorm(config.embedding_size, eps=config.layer_norm_eps)\n\n\nclass ElectraDiscriminatorPredictions(nn.Module):\n    \"\"\"Prediction module for the discriminator, made up of two dense layers.\"\"\"\n\n    def __init__(self, config):\n        super().__init__()\n\n        self.dense = nn.Linear(config.hidden_size, config.hidden_size)\n        self.dense_prediction = nn.Linear(config.hidden_size, 1)\n        self.config = config\n\n    def forward(self, discriminator_hidden_states, attention_mask):\n        hidden_states = self.dense(discriminator_hidden_states)\n        hidden_states = get_activation(self.config.hidden_act)(hidden_states)\n        logits = self.dense_prediction(hidden_states).squeeze()\n\n        return logits\n\n\nclass ElectraGeneratorPredictions(nn.Module):\n    \"\"\"Prediction module for the generator, made up of two dense layers.\"\"\"\n\n    def __init__(self, config):\n        super().__init__()\n\n        self.LayerNorm = BertLayerNorm(config.embedding_size)\n        self.dense = nn.Linear(config.hidden_size, config.embedding_size)\n\n    def forward(self, generator_hidden_states):\n        hidden_states = self.dense(generator_hidden_states)\n        hidden_states = get_activation(\"gelu\")(hidden_states)\n        hidden_states = self.LayerNorm(hidden_states)\n\n        return hidden_states\n\n\nclass ElectraPreTrainedModel(BertPreTrainedModel):\n    \"\"\" An abstract class to handle weights initialization and\n        a simple interface for downloading and loading pretrained models.\n    \"\"\"\n\n    config_class = ElectraConfig\n    load_tf_weights = load_tf_weights_in_electra\n    base_model_prefix = \"electra\"\n\n\nELECTRA_START_DOCSTRING = r\"\"\"\n    This model is a PyTorch `torch.nn.Module <https://pytorch.org/docs/stable/nn.html#torch.nn.Module>`_ sub-class.\n    Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general\n    usage and behavior.\n\n    Parameters:\n        config (:class:`~transformers1.ElectraConfig`): Model configuration class with all the parameters of the model.\n            Initializing with a config file does not load the weights associated with the model, only the configuration.\n            Check out the :meth:`~transformers1.PreTrainedModel.from_pretrained` method to load the model weights.\n\"\"\"\n\nELECTRA_INPUTS_DOCSTRING = r\"\"\"\n    Args:\n        input_ids (:obj:`torch.LongTensor` of shape :obj:`(batch_size, sequence_length)`):\n            Indices of input sequence tokens in the vocabulary.\n\n            Indices can be obtained using :class:`transformers1.ElectraTokenizer`.\n            See :func:`transformers1.PreTrainedTokenizer.encode` and\n            :func:`transformers1.PreTrainedTokenizer.encode_plus` for details.\n\n            `What are input IDs? <../glossary.html#input-ids>`__\n        attention_mask (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length)`, `optional`, defaults to :obj:`None`):\n            Mask to avoid performing attention on padding token indices.\n            Mask values selected in ``[0, 1]``:\n            ``1`` for tokens that are NOT MASKED, ``0`` for MASKED tokens.\n\n            `What are attention masks? <../glossary.html#attention-mask>`__\n        token_type_ids (:obj:`torch.LongTensor` of shape :obj:`(batch_size, sequence_length)`, `optional`, defaults to :obj:`None`):\n            Segment token indices to indicate first and second portions of the inputs.\n            Indices are selected in ``[0, 1]``: ``0`` corresponds to a `sentence A` token, ``1``\n            corresponds to a `sentence B` token\n\n            `What are token type IDs? <../glossary.html#token-type-ids>`_\n        position_ids (:obj:`torch.LongTensor` of shape :obj:`(batch_size, sequence_length)`, `optional`, defaults to :obj:`None`):\n            Indices of positions of each input sequence tokens in the position embeddings.\n            Selected in the range ``[0, config.max_position_embeddings - 1]``.\n\n            `What are position IDs? <../glossary.html#position-ids>`_\n        head_mask (:obj:`torch.FloatTensor` of shape :obj:`(num_heads,)` or :obj:`(num_layers, num_heads)`, `optional`, defaults to :obj:`None`):\n            Mask to nullify selected heads of the self-attention modules.\n            Mask values selected in ``[0, 1]``:\n            :obj:`1` indicates the head is **not masked**, :obj:`0` indicates the head is **masked**.\n        inputs_embeds (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length, hidden_size)`, `optional`, defaults to :obj:`None`):\n            Optionally, instead of passing :obj:`input_ids` you can choose to directly pass an embedded representation.\n            This is useful if you want more control over how to convert `input_ids` indices into associated vectors\n            than the model's internal embedding lookup matrix.\n        encoder_hidden_states  (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length, hidden_size)`, `optional`, defaults to :obj:`None`):\n            Sequence of hidden-states at the output of the last layer of the encoder. Used in the cross-attention\n            if the model is configured as a decoder.\n        encoder_attention_mask (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length)`, `optional`, defaults to :obj:`None`):\n            Mask to avoid performing attention on the padding token indices of the encoder input. This mask\n            is used in the cross-attention if the model is configured as a decoder.\n            Mask values selected in ``[0, 1]``:\n            ``1`` for tokens that are NOT MASKED, ``0`` for MASKED tokens.\n\"\"\"\n\n\n@add_start_docstrings(\n    \"The bare Electra Model transformer outputting raw hidden-states without any specific head on top. Identical to \"\n    \"the BERT model except that it uses an additional linear layer between the embedding layer and the encoder if the \"\n    \"hidden size and embedding size are different.\"\n    \"\"\n    \"Both the generator and discriminator checkpoints may be loaded into this model.\",\n    ELECTRA_START_DOCSTRING,\n)\nclass ElectraModel(ElectraPreTrainedModel):\n\n    config_class = ElectraConfig\n\n    def __init__(self, config):\n        super().__init__(config)\n        self.embeddings = ElectraEmbeddings(config)\n\n        if config.embedding_size != config.hidden_size:\n            self.embeddings_project = nn.Linear(config.embedding_size, config.hidden_size)\n\n        self.encoder = BertEncoder(config)\n        self.config = config\n        self.init_weights()\n\n    def get_input_embeddings(self):\n        return self.embeddings.word_embeddings\n\n    def set_input_embeddings(self, value):\n        self.embeddings.word_embeddings = value\n\n    def _prune_heads(self, heads_to_prune):\n        \"\"\" Prunes heads of the model.\n            heads_to_prune: dict of {layer_num: list of heads to prune in this layer}\n            See base class PreTrainedModel\n        \"\"\"\n        for layer, heads in heads_to_prune.items():\n            self.encoder.layer[layer].attention.prune_heads(heads)\n\n    @add_start_docstrings_to_callable(ELECTRA_INPUTS_DOCSTRING)\n    def forward(\n        self,\n        input_ids=None,\n        attention_mask=None,\n        token_type_ids=None,\n        position_ids=None,\n        head_mask=None,\n        inputs_embeds=None,\n    ):\n        r\"\"\"\n    Return:\n        :obj:`tuple(torch.FloatTensor)` comprising various elements depending on the configuration (:class:`~transformers1.ElectraConfig`) and inputs:\n        last_hidden_state (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length, hidden_size)`):\n            Sequence of hidden-states at the output of the last layer of the model.\n        hidden_states (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_hidden_states=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer)\n            of shape :obj:`(batch_size, sequence_length, hidden_size)`.\n\n            Hidden-states of the model at the output of each layer plus the initial embedding outputs.\n        attentions (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_attentions=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for each layer) of shape\n            :obj:`(batch_size, num_heads, sequence_length, sequence_length)`.\n\n            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention\n            heads.\n\n    Examples::\n\n        from transformers1 import ElectraModel, ElectraTokenizer\n        import torch\n\n        tokenizer = ElectraTokenizer.from_pretrained('google/electra-small-discriminator')\n        model = ElectraModel.from_pretrained('google/electra-small-discriminator')\n\n        input_ids = torch.tensor(tokenizer.encode(\"Hello, my dog is cute\", add_special_tokens=True)).unsqueeze(0)  # Batch size 1\n        outputs = model(input_ids)\n\n        last_hidden_states = outputs[0]  # The last hidden-state is the first element of the output tuple\n\n        \"\"\"\n        if input_ids is not None and inputs_embeds is not None:\n            raise ValueError(\"You cannot specify both input_ids and inputs_embeds at the same time\")\n        elif input_ids is not None:\n            input_shape = input_ids.size()\n        elif inputs_embeds is not None:\n            input_shape = inputs_embeds.size()[:-1]\n        else:\n            raise ValueError(\"You have to specify either input_ids or inputs_embeds\")\n\n        device = input_ids.device if input_ids is not None else inputs_embeds.device\n\n        if attention_mask is None:\n            attention_mask = torch.ones(input_shape, device=device)\n        if token_type_ids is None:\n            token_type_ids = torch.zeros(input_shape, dtype=torch.long, device=device)\n\n        extended_attention_mask = self.get_extended_attention_mask(attention_mask, input_shape, device)\n        head_mask = self.get_head_mask(head_mask, self.config.num_hidden_layers)\n\n        hidden_states = self.embeddings(\n            input_ids=input_ids, position_ids=position_ids, token_type_ids=token_type_ids, inputs_embeds=inputs_embeds\n        )\n\n        if hasattr(self, \"embeddings_project\"):\n            hidden_states = self.embeddings_project(hidden_states)\n\n        hidden_states = self.encoder(hidden_states, attention_mask=extended_attention_mask, head_mask=head_mask)\n\n        return hidden_states\n\n\nclass ElectraClassificationHead(nn.Module):\n    \"\"\"Head for sentence-level classification tasks.\"\"\"\n\n    def __init__(self, config):\n        super().__init__()\n        self.dense = nn.Linear(config.hidden_size, config.hidden_size)\n        self.dropout = nn.Dropout(config.hidden_dropout_prob)\n        self.out_proj = nn.Linear(config.hidden_size, config.num_labels)\n\n    def forward(self, features, **kwargs):\n        x = features[:, 0, :]  # take <s> token (equiv. to [CLS])\n        x = self.dropout(x)\n        x = self.dense(x)\n        x = get_activation(\"gelu\")(x)  # although BERT uses tanh here, it seems Electra authors used gelu here\n        x = self.dropout(x)\n        x = self.out_proj(x)\n        return x\n\n\n@add_start_docstrings(\n    \"\"\"ELECTRA Model transformer with a sequence classification/regression head on top (a linear layer on top of\n    the pooled output) e.g. for GLUE tasks. \"\"\",\n    ELECTRA_START_DOCSTRING,\n)\nclass ElectraForSequenceClassification(ElectraPreTrainedModel):\n    def __init__(self, config):\n        super().__init__(config)\n        self.num_labels = config.num_labels\n        self.electra = ElectraModel(config)\n        self.classifier = ElectraClassificationHead(config)\n\n        self.init_weights()\n\n    @add_start_docstrings_to_callable(ELECTRA_INPUTS_DOCSTRING)\n    def forward(\n        self,\n        input_ids=None,\n        attention_mask=None,\n        token_type_ids=None,\n        position_ids=None,\n        head_mask=None,\n        inputs_embeds=None,\n        labels=None,\n    ):\n        r\"\"\"\n        labels (:obj:`torch.LongTensor` of shape :obj:`(batch_size,)`, `optional`, defaults to :obj:`None`):\n            Labels for computing the sequence classification/regression loss.\n            Indices should be in :obj:`[0, ..., config.num_labels - 1]`.\n            If :obj:`config.num_labels == 1` a regression loss is computed (Mean-Square loss),\n            If :obj:`config.num_labels > 1` a classification loss is computed (Cross-Entropy).\n\n    Returns:\n        :obj:`tuple(torch.FloatTensor)` comprising various elements depending on the configuration (:class:`~transformers1.BertConfig`) and inputs:\n        loss (:obj:`torch.FloatTensor` of shape :obj:`(1,)`, `optional`, returned when :obj:`label` is provided):\n            Classification (or regression if config.num_labels==1) loss.\n        logits (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, config.num_labels)`):\n            Classification (or regression if config.num_labels==1) scores (before SoftMax).\n        hidden_states (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_hidden_states=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer)\n            of shape :obj:`(batch_size, sequence_length, hidden_size)`.\n\n            Hidden-states of the model at the output of each layer plus the initial embedding outputs.\n        attentions (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_attentions=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for each layer) of shape\n            :obj:`(batch_size, num_heads, sequence_length, sequence_length)`.\n\n            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention\n            heads.\n\n    Examples::\n\n        from transformers1 import BertTokenizer, BertForSequenceClassification\n        import torch\n\n        tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')\n        model = BertForSequenceClassification.from_pretrained('bert-base-uncased')\n\n        input_ids = torch.tensor(tokenizer.encode(\"Hello, my dog is cute\", add_special_tokens=True)).unsqueeze(0)  # Batch size 1\n        labels = torch.tensor([1]).unsqueeze(0)  # Batch size 1\n        outputs = model(input_ids, labels=labels)\n\n        loss, logits = outputs[:2]\n\n        \"\"\"\n        discriminator_hidden_states = self.electra(\n            input_ids, attention_mask, token_type_ids, position_ids, head_mask, inputs_embeds\n        )\n\n        sequence_output = discriminator_hidden_states[0]\n        logits = self.classifier(sequence_output)\n\n        outputs = (logits,) + discriminator_hidden_states[2:]  # add hidden states and attention if they are here\n\n        if labels is not None:\n            if self.num_labels == 1:\n                #  We are doing regression\n                loss_fct = MSELoss()\n                loss = loss_fct(logits.view(-1), labels.view(-1))\n            else:\n                loss_fct = CrossEntropyLoss()\n                loss = loss_fct(logits.view(-1, self.num_labels), labels.view(-1))\n            outputs = (loss,) + outputs\n\n        return outputs  # (loss), logits, (hidden_states), (attentions)\n\n\n@add_start_docstrings(\n    \"\"\"\n    Electra model with a binary classification head on top as used during pre-training for identifying generated\n    tokens.\n\n    It is recommended to load the discriminator checkpoint into that model.\"\"\",\n    ELECTRA_START_DOCSTRING,\n)\nclass ElectraForPreTraining(ElectraPreTrainedModel):\n    def __init__(self, config):\n        super().__init__(config)\n\n        self.electra = ElectraModel(config)\n        self.discriminator_predictions = ElectraDiscriminatorPredictions(config)\n        self.init_weights()\n\n    @add_start_docstrings_to_callable(ELECTRA_INPUTS_DOCSTRING)\n    def forward(\n        self,\n        input_ids=None,\n        attention_mask=None,\n        token_type_ids=None,\n        position_ids=None,\n        head_mask=None,\n        inputs_embeds=None,\n        labels=None,\n    ):\n        r\"\"\"\n        labels (``torch.LongTensor`` of shape ``(batch_size, sequence_length)``, `optional`, defaults to :obj:`None`):\n            Labels for computing the ELECTRA loss. Input should be a sequence of tokens (see :obj:`input_ids` docstring)\n            Indices should be in ``[0, 1]``.\n            ``0`` indicates the token is an original token,\n            ``1`` indicates the token was replaced.\n\n    Returns:\n        :obj:`tuple(torch.FloatTensor)` comprising various elements depending on the configuration (:class:`~transformers1.ElectraConfig`) and inputs:\n        loss (`optional`, returned when ``labels`` is provided) ``torch.FloatTensor`` of shape ``(1,)``:\n            Total loss of the ELECTRA objective.\n        scores (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length)`)\n            Prediction scores of the head (scores for each token before SoftMax).\n        hidden_states (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when :obj:`config.output_hidden_states=True`):\n            Tuple of :obj:`torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer)\n            of shape :obj:`(batch_size, sequence_length, hidden_size)`.\n\n            Hidden-states of the model at the output of each layer plus the initial embedding outputs.\n        attentions (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_attentions=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for each layer) of shape\n            :obj:`(batch_size, num_heads, sequence_length, sequence_length)`.\n\n            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention\n            heads.\n\n\n    Examples::\n\n        from transformers1 import ElectraTokenizer, ElectraForPreTraining\n        import torch\n\n        tokenizer = ElectraTokenizer.from_pretrained('google/electra-small-discriminator')\n        model = ElectraForPreTraining.from_pretrained('google/electra-small-discriminator')\n\n        input_ids = torch.tensor(tokenizer.encode(\"Hello, my dog is cute\", add_special_tokens=True)).unsqueeze(0)  # Batch size 1\n        outputs = model(input_ids)\n\n        prediction_scores, seq_relationship_scores = outputs[:2]\n\n        \"\"\"\n\n        discriminator_hidden_states = self.electra(\n            input_ids, attention_mask, token_type_ids, position_ids, head_mask, inputs_embeds\n        )\n        discriminator_sequence_output = discriminator_hidden_states[0]\n\n        logits = self.discriminator_predictions(discriminator_sequence_output, attention_mask)\n\n        output = (logits,)\n\n        if labels is not None:\n            loss_fct = nn.BCEWithLogitsLoss()\n            if attention_mask is not None:\n                active_loss = attention_mask.view(-1, discriminator_sequence_output.shape[1]) == 1\n                active_logits = logits.view(-1, discriminator_sequence_output.shape[1])[active_loss]\n                active_labels = labels[active_loss]\n                loss = loss_fct(active_logits, active_labels.float())\n            else:\n                loss = loss_fct(logits.view(-1, discriminator_sequence_output.shape[1]), labels.float())\n\n            output = (loss,) + output\n\n        output += discriminator_hidden_states[1:]\n\n        return output  # (loss), scores, (hidden_states), (attentions)\n\n\n@add_start_docstrings(\n    \"\"\"\n    Electra model with a language modeling head on top.\n\n    Even though both the discriminator and generator may be loaded into this model, the generator is\n    the only model of the two to have been trained for the masked language modeling task.\"\"\",\n    ELECTRA_START_DOCSTRING,\n)\nclass ElectraForMaskedLM(ElectraPreTrainedModel):\n    def __init__(self, config):\n        super().__init__(config)\n\n        self.electra = ElectraModel(config)\n        self.generator_predictions = ElectraGeneratorPredictions(config)\n\n        self.generator_lm_head = nn.Linear(config.embedding_size, config.vocab_size)\n        self.init_weights()\n\n    def get_output_embeddings(self):\n        return self.generator_lm_head\n\n    @add_start_docstrings_to_callable(ELECTRA_INPUTS_DOCSTRING)\n    def forward(\n        self,\n        input_ids=None,\n        attention_mask=None,\n        token_type_ids=None,\n        position_ids=None,\n        head_mask=None,\n        inputs_embeds=None,\n        masked_lm_labels=None,\n    ):\n        r\"\"\"\n        masked_lm_labels (:obj:`torch.LongTensor` of shape :obj:`(batch_size, sequence_length)`, `optional`, defaults to :obj:`None`):\n            Labels for computing the masked language modeling loss.\n            Indices should be in ``[-100, 0, ..., config.vocab_size]`` (see ``input_ids`` docstring)\n            Tokens with indices set to ``-100`` are ignored (masked), the loss is only computed for the tokens with labels\n            in ``[0, ..., config.vocab_size]``\n\n    Returns:\n        :obj:`tuple(torch.FloatTensor)` comprising various elements depending on the configuration (:class:`~transformers1.ElectraConfig`) and inputs:\n        masked_lm_loss (`optional`, returned when ``masked_lm_labels`` is provided) ``torch.FloatTensor`` of shape ``(1,)``:\n            Masked language modeling loss.\n        prediction_scores (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length, config.vocab_size)`)\n            Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax).\n        hidden_states (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_hidden_states=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer)\n            of shape :obj:`(batch_size, sequence_length, hidden_size)`.\n\n            Hidden-states of the model at the output of each layer plus the initial embedding outputs.\n        attentions (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_attentions=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for each layer) of shape\n            :obj:`(batch_size, num_heads, sequence_length, sequence_length)`.\n\n            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention\n            heads.\n\n        Examples::\n\n            from transformers1 import ElectraTokenizer, ElectraForMaskedLM\n            import torch\n\n            tokenizer = ElectraTokenizer.from_pretrained('google/electra-small-generator')\n            model = ElectraForMaskedLM.from_pretrained('google/electra-small-generator')\n\n            input_ids = torch.tensor(tokenizer.encode(\"Hello, my dog is cute\", add_special_tokens=True)).unsqueeze(0)  # Batch size 1\n            outputs = model(input_ids, masked_lm_labels=input_ids)\n\n            loss, prediction_scores = outputs[:2]\n\n        \"\"\"\n\n        generator_hidden_states = self.electra(\n            input_ids, attention_mask, token_type_ids, position_ids, head_mask, inputs_embeds\n        )\n        generator_sequence_output = generator_hidden_states[0]\n\n        prediction_scores = self.generator_predictions(generator_sequence_output)\n        prediction_scores = self.generator_lm_head(prediction_scores)\n\n        output = (prediction_scores,)\n\n        # Masked language modeling softmax layer\n        if masked_lm_labels is not None:\n            loss_fct = nn.CrossEntropyLoss()  # -100 index = padding token\n            loss = loss_fct(prediction_scores.view(-1, self.config.vocab_size), masked_lm_labels.view(-1))\n            output = (loss,) + output\n\n        output += generator_hidden_states[1:]\n\n        return output  # (masked_lm_loss), prediction_scores, (hidden_states), (attentions)\n\n\n@add_start_docstrings(\n    \"\"\"\n    Electra model with a token classification head on top.\n\n    Both the discriminator and generator may be loaded into this model.\"\"\",\n    ELECTRA_START_DOCSTRING,\n)\nclass ElectraForTokenClassification(ElectraPreTrainedModel):\n    def __init__(self, config):\n        super().__init__(config)\n\n        self.electra = ElectraModel(config)\n        self.dropout = nn.Dropout(config.hidden_dropout_prob)\n        self.classifier = nn.Linear(config.hidden_size, config.num_labels)\n        self.init_weights()\n\n    @add_start_docstrings_to_callable(ELECTRA_INPUTS_DOCSTRING)\n    def forward(\n        self,\n        input_ids=None,\n        attention_mask=None,\n        token_type_ids=None,\n        position_ids=None,\n        head_mask=None,\n        inputs_embeds=None,\n        labels=None,\n    ):\n        r\"\"\"\n        labels (:obj:`torch.LongTensor` of shape :obj:`(batch_size, sequence_length)`, `optional`, defaults to :obj:`None`):\n            Labels for computing the token classification loss.\n            Indices should be in ``[0, ..., config.num_labels - 1]``.\n\n    Returns:\n        :obj:`tuple(torch.FloatTensor)` comprising various elements depending on the configuration (:class:`~transformers1.ElectraConfig`) and inputs:\n        loss (:obj:`torch.FloatTensor` of shape :obj:`(1,)`, `optional`, returned when ``labels`` is provided) :\n            Classification loss.\n        scores (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length, config.num_labels)`)\n            Classification scores (before SoftMax).\n        hidden_states (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_hidden_states=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer)\n            of shape :obj:`(batch_size, sequence_length, hidden_size)`.\n\n            Hidden-states of the model at the output of each layer plus the initial embedding outputs.\n        attentions (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_attentions=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for each layer) of shape\n            :obj:`(batch_size, num_heads, sequence_length, sequence_length)`.\n\n            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention\n            heads.\n\n    Examples::\n\n        from transformers1 import ElectraTokenizer, ElectraForTokenClassification\n        import torch\n\n        tokenizer = ElectraTokenizer.from_pretrained('google/electra-small-discriminator')\n        model = ElectraForTokenClassification.from_pretrained('google/electra-small-discriminator')\n\n        input_ids = torch.tensor(tokenizer.encode(\"Hello, my dog is cute\", add_special_tokens=True)).unsqueeze(0)  # Batch size 1\n        labels = torch.tensor([1] * input_ids.size(1)).unsqueeze(0)  # Batch size 1\n        outputs = model(input_ids, labels=labels)\n\n        loss, scores = outputs[:2]\n\n        \"\"\"\n\n        discriminator_hidden_states = self.electra(\n            input_ids, attention_mask, token_type_ids, position_ids, head_mask, inputs_embeds\n        )\n        discriminator_sequence_output = discriminator_hidden_states[0]\n\n        discriminator_sequence_output = self.dropout(discriminator_sequence_output)\n        logits = self.classifier(discriminator_sequence_output)\n\n        output = (logits,)\n\n        if labels is not None:\n            loss_fct = nn.CrossEntropyLoss()\n            # Only keep active parts of the loss\n            if attention_mask is not None:\n                active_loss = attention_mask.view(-1) == 1\n                active_logits = logits.view(-1, self.config.num_labels)[active_loss]\n                active_labels = labels.view(-1)[active_loss]\n                loss = loss_fct(active_logits, active_labels)\n            else:\n                loss = loss_fct(logits.view(-1, self.config.num_labels), labels.view(-1))\n\n            output = (loss,) + output\n\n        output += discriminator_hidden_states[1:]\n\n        return output  # (loss), scores, (hidden_states), (attentions)\n"
  },
  {
    "path": "code/bert-base-count5/pretrain/transformers1/modeling_encoder_decoder.py",
    "content": "# coding=utf-8\n# Copyright 2018 The HuggingFace Inc. team.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n\"\"\" Classes to support Encoder-Decoder architectures \"\"\"\n\n\nimport logging\nfrom typing import Optional\n\nfrom .configuration_encoder_decoder import EncoderDecoderConfig\nfrom .configuration_utils import PretrainedConfig\nfrom .modeling_utils import PreTrainedModel\n\n\nlogger = logging.getLogger(__name__)\n\n\nclass EncoderDecoderModel(PreTrainedModel):\n    r\"\"\"\n        :class:`~transformers1.EncoderDecoder` is a generic model class that will be\n        instantiated as a transformer architecture with one of the base model\n        classes of the library as encoder and another one as\n        decoder when created with the `AutoModel.from_pretrained(pretrained_model_name_or_path)`\n        class method for the encoder and `AutoModelWithLMHead.from_pretrained(pretrained_model_name_or_path)` class method for the decoder.\n    \"\"\"\n    config_class = EncoderDecoderConfig\n    base_model_prefix = \"encoder_decoder\"\n\n    def __init__(\n        self,\n        config: Optional[PretrainedConfig] = None,\n        encoder: Optional[PreTrainedModel] = None,\n        decoder: Optional[PreTrainedModel] = None,\n    ):\n        assert config is not None or (\n            encoder is not None and decoder is not None\n        ), \"Either a configuration or an Encoder and a decoder has to be provided\"\n        if config is None:\n            config = EncoderDecoderConfig.from_encoder_decoder_configs(encoder.config, decoder.config)\n        else:\n            assert isinstance(config, self.config_class), \"config: {} has to be of type {}\".format(\n                config, self.config_class\n            )\n        # initialize with config\n        super().__init__(config)\n\n        if encoder is None:\n            from transformers import AutoModel\n\n            encoder = AutoModel.from_config(config.encoder)\n\n        if decoder is None:\n            from transformers import AutoModelWithLMHead\n\n            decoder = AutoModelWithLMHead.from_config(config.decoder)\n\n        self.encoder = encoder\n        self.decoder = decoder\n        assert (\n            self.encoder.get_output_embeddings() is None\n        ), \"The encoder {} should not have a LM Head. Please use a model without LM Head\"\n\n    def tie_weights(self):\n        # for now no weights tying in encoder-decoder\n        pass\n\n    def get_encoder(self):\n        return self.encoder\n\n    def get_decoder(self):\n        return self.decoder\n\n    def get_input_embeddings(self):\n        return self.encoder.get_input_embeddings()\n\n    def get_output_embeddings(self):\n        return self.decoder.get_output_embeddings()\n\n    @classmethod\n    def from_encoder_decoder_pretrained(\n        cls,\n        encoder_pretrained_model_name_or_path: str = None,\n        decoder_pretrained_model_name_or_path: str = None,\n        *model_args,\n        **kwargs\n    ) -> PreTrainedModel:\n        r\"\"\" Instantiates an encoder and a decoder from one or two base classes of the library from pre-trained model checkpoints.\n\n\n        The model is set in evaluation mode by default using `model.eval()` (Dropout modules are deactivated).\n        To train the model, you need to first set it back in training mode with `model.train()`.\n\n        Params:\n            encoder_pretrained_model_name_or_path (:obj: `str`, `optional`, defaults to `None`):\n                information necessary to initiate the encoder. Either:\n\n                - a string with the `shortcut name` of a pre-trained model to load from cache or download, e.g.: ``bert-base-uncased``.\n                - a string with the `identifier name` of a pre-trained model that was user-uploaded to our S3, e.g.: ``dbmdz/bert-base-german-cased``.\n                - a path to a `directory` containing model weights saved using :func:`~transformers1.PreTrainedModel.save_pretrained`, e.g.: ``./my_model_directory/encoder``.\n                - a path or url to a `tensorflow index checkpoint file` (e.g. `./tf_model/model.ckpt.index`). In this case, ``from_tf`` should be set to True and a configuration object should be provided as ``config`` argument. This loading path is slower than converting the TensorFlow checkpoint in a PyTorch model using the provided conversion scripts and loading the PyTorch model afterwards.\n\n            decoder_pretrained_model_name_or_path (:obj: `str`, `optional`, defaults to `None`):\n                information necessary to initiate the decoder. Either:\n\n                - a string with the `shortcut name` of a pre-trained model to load from cache or download, e.g.: ``bert-base-uncased``.\n                - a string with the `identifier name` of a pre-trained model that was user-uploaded to our S3, e.g.: ``dbmdz/bert-base-german-cased``.\n                - a path to a `directory` containing model weights saved using :func:`~transformers1.PreTrainedModel.save_pretrained`, e.g.: ``./my_model_directory/decoder``.\n                - a path or url to a `tensorflow index checkpoint file` (e.g. `./tf_model/model.ckpt.index`). In this case, ``from_tf`` should be set to True and a configuration object should be provided as ``config`` argument. This loading path is slower than converting the TensorFlow checkpoint in a PyTorch model using the provided conversion scripts and loading the PyTorch model afterwards.\n\n            model_args: (`optional`) Sequence of positional arguments:\n                All remaning positional arguments will be passed to the underlying model's ``__init__`` method\n\n            kwargs: (`optional`) Remaining dictionary of keyword arguments.\n                Can be used to update the configuration object (after it being loaded) and initiate the model. (e.g. ``output_attention=True``). Behave differently depending on whether a `config` is provided or automatically loaded:\n\n        Examples::\n\n            from transformers1 import EncoderDecoder\n\n            model = EncoderDecoder.from_encoder_decoder_pretrained('bert-base-uncased', 'bert-base-uncased') # initialize Bert2Bert\n        \"\"\"\n\n        kwargs_encoder = {\n            argument[len(\"encoder_\") :]: value for argument, value in kwargs.items() if argument.startswith(\"encoder_\")\n        }\n\n        kwargs_decoder = {\n            argument[len(\"decoder_\") :]: value for argument, value in kwargs.items() if argument.startswith(\"decoder_\")\n        }\n\n        # Load and initialize the encoder and decoder\n        # The distinction between encoder and decoder at the model level is made\n        # by the value of the flag `is_decoder` that we need to set correctly.\n        encoder = kwargs_encoder.pop(\"model\", None)\n        if encoder is None:\n            assert (\n                encoder_pretrained_model_name_or_path is not None\n            ), \"If `model` is not defined as an argument, a `encoder_pretrained_model_name_or_path` has to be defined\"\n            from .modeling_auto import AutoModel\n\n            encoder = AutoModel.from_pretrained(encoder_pretrained_model_name_or_path, *model_args, **kwargs_encoder)\n        encoder.config.is_decoder = False\n\n        decoder = kwargs_decoder.pop(\"model\", None)\n        if decoder is None:\n            assert (\n                decoder_pretrained_model_name_or_path is not None\n            ), \"If `decoder_model` is not defined as an argument, a `decoder_pretrained_model_name_or_path` has to be defined\"\n            from .modeling_auto import AutoModelWithLMHead\n\n            if \"config\" not in kwargs_decoder:\n                from transformers import AutoConfig\n\n                decoder_config = AutoConfig.from_pretrained(decoder_pretrained_model_name_or_path)\n                if decoder_config.is_decoder is False:\n                    logger.info(\n                        f\"Initializing {decoder_pretrained_model_name_or_path} as a decoder model. Cross attention layers are added to {decoder_pretrained_model_name_or_path} and randomly initialized if {decoder_pretrained_model_name_or_path}'s architecture allows for cross attention layers.\"\n                    )\n                    decoder_config.is_decoder = True\n\n                kwargs_decoder[\"config\"] = decoder_config\n\n            if kwargs_decoder[\"config\"].is_decoder is False:\n                logger.warning(\n                    f\"Decoder model {decoder_pretrained_model_name_or_path} is not initialized as a decoder. In order to initialize {decoder_pretrained_model_name_or_path} as a decoder, make sure that the attribute `is_decoder` of `decoder_config` passed to `.from_encoder_decoder_pretrained(...)` is set to `True` or do not pass a `decoder_config` to `.from_encoder_decoder_pretrained(...)`\"\n                )\n\n            decoder = AutoModelWithLMHead.from_pretrained(decoder_pretrained_model_name_or_path, **kwargs_decoder)\n\n        return cls(encoder=encoder, decoder=decoder)\n\n    def forward(\n        self,\n        input_ids=None,\n        inputs_embeds=None,\n        attention_mask=None,\n        head_mask=None,\n        encoder_outputs=None,\n        decoder_input_ids=None,\n        decoder_attention_mask=None,\n        decoder_head_mask=None,\n        decoder_inputs_embeds=None,\n        masked_lm_labels=None,\n        lm_labels=None,\n        **kwargs,\n    ):\n\n        \"\"\"\n        Args:\n            input_ids (:obj:`torch.LongTensor` of shape :obj:`(batch_size, sequence_length)`):\n                Indices of input sequence tokens in the vocabulary for the encoder.\n                Indices can be obtained using :class:`transformers1.PretrainedTokenizer`.\n                See :func:`transformers1.PreTrainedTokenizer.encode` and\n                :func:`transformers1.PreTrainedTokenizer.convert_tokens_to_ids` for details.\n            inputs_embeds (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length, hidden_size)`, `optional`, defaults to :obj:`None`):\n                Optionally, instead of passing :obj:`input_ids` you can choose to directly pass an embedded representation.\n                This is useful if you want more control over how to convert `input_ids` indices into associated vectors\n                than the model's internal embedding lookup matrix.\n            attention_mask (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length)`, `optional`, defaults to :obj:`None`):\n                Mask to avoid performing attention on padding token indices for the encoder.\n                Mask values selected in ``[0, 1]``:\n                ``1`` for tokens that are NOT MASKED, ``0`` for MASKED tokens.\n            head_mask: (:obj:`torch.FloatTensor` of shape :obj:`(num_heads,)` or :obj:`(num_layers, num_heads)`, `optional`, defaults to :obj:`None`):\n                Mask to nullify selected heads of the self-attention modules for the encoder.\n                Mask values selected in ``[0, 1]``:\n                ``1`` indicates the head is **not masked**, ``0`` indicates the head is **masked**.\n            encoder_outputs (:obj:`tuple(tuple(torch.FloatTensor)`, `optional`, defaults to :obj:`None`):\n                Tuple consists of (`last_hidden_state`, `optional`: `hidden_states`, `optional`: `attentions`)\n                `last_hidden_state` of shape :obj:`(batch_size, sequence_length, hidden_size)`, `optional`, defaults to :obj:`None`) is a sequence of hidden-states at the output of the last layer of the encoder.\n                Used in the cross-attention of the decoder.\n            decoder_input_ids (:obj:`torch.LongTensor` of shape :obj:`(batch_size, target_sequence_length)`, `optional`, defaults to :obj:`None`):\n                Provide for sequence to sequence training to the decoder.\n                Indices can be obtained using :class:`transformers1.PretrainedTokenizer`.\n                See :func:`transformers1.PreTrainedTokenizer.encode` and\n                :func:`transformers1.PreTrainedTokenizer.convert_tokens_to_ids` for details.\n            decoder_attention_mask (:obj:`torch.BoolTensor` of shape :obj:`(batch_size, tgt_seq_len)`, `optional`, defaults to :obj:`None`):\n                Default behavior: generate a tensor that ignores pad tokens in decoder_input_ids. Causal mask will also be used by default.\n            decoder_head_mask: (:obj:`torch.FloatTensor` of shape :obj:`(num_heads,)` or :obj:`(num_layers, num_heads)`, `optional`, defaults to :obj:`None`):\n                Mask to nullify selected heads of the self-attention modules for the decoder.\n                Mask values selected in ``[0, 1]``:\n                ``1`` indicates the head is **not masked**, ``0`` indicates the head is **masked**.\n            decoder_inputs_embeds (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, target_sequence_length, hidden_size)`, `optional`, defaults to :obj:`None`):\n                Optionally, instead of passing :obj:`decoder_input_ids` you can choose to directly pass an embedded representation.\n                This is useful if you want more control over how to convert `decoder_input_ids` indices into associated vectors\n                than the model's internal embedding lookup matrix.\n            masked_lm_labels (:obj:`torch.LongTensor` of shape :obj:`(batch_size, sequence_length)`, `optional`, defaults to :obj:`None`):\n                Labels for computing the masked language modeling loss for the decoder.\n                Indices should be in ``[-100, 0, ..., config.vocab_size]`` (see ``input_ids`` docstring)\n                Tokens with indices set to ``-100`` are ignored (masked), the loss is only computed for the tokens with labels\n                in ``[0, ..., config.vocab_size]``\n            lm_labels (:obj:`torch.LongTensor` of shape :obj:`(batch_size, sequence_length)`, `optional`, defaults to :obj:`None`):\n                Labels for computing the left-to-right language modeling loss (next word prediction) for the decoder.\n                Indices should be in ``[-100, 0, ..., config.vocab_size]`` (see ``input_ids`` docstring)\n                Tokens with indices set to ``-100`` are ignored (masked), the loss is only computed for the tokens with labels\n                in ``[0, ..., config.vocab_size]``\n            kwargs: (`optional`) Remaining dictionary of keyword arguments. Keyword arguments come in two flavors:\n                - Without a prefix which will be input as `**encoder_kwargs` for the encoder forward function.\n                - With a `decoder_` prefix which will be input as `**decoder_kwargs` for the decoder forward function.\n\n        Examples::\n\n            from transformers1 import EncoderDecoderModel, BertTokenizer\n            import torch\n\n            tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')\n            model = EncoderDecoderModel.from_encoder_decoder_pretrained('bert-base-uncased', 'bert-base-uncased') # initialize Bert2Bert\n\n            # forward\n            input_ids = torch.tensor(tokenizer.encode(\"Hello, my dog is cute\", add_special_tokens=True)).unsqueeze(0)  # Batch size 1\n            outputs = model(input_ids=input_ids, decoder_input_ids=input_ids)\n\n            # training\n            loss, outputs = model(input_ids=input_ids, decoder_input_ids=input_ids, lm_labels=input_ids)[:2]\n\n            # generation\n            generated = model.generate(input_ids, decoder_start_token_id=model.config.decoder.pad_token_id)\n\n        \"\"\"\n\n        kwargs_encoder = {argument: value for argument, value in kwargs.items() if not argument.startswith(\"decoder_\")}\n\n        kwargs_decoder = {\n            argument[len(\"decoder_\") :]: value for argument, value in kwargs.items() if argument.startswith(\"decoder_\")\n        }\n\n        if encoder_outputs is None:\n            encoder_outputs = self.encoder(\n                input_ids=input_ids,\n                attention_mask=attention_mask,\n                inputs_embeds=inputs_embeds,\n                head_mask=head_mask,\n                **kwargs_encoder,\n            )\n\n        hidden_states = encoder_outputs[0]\n\n        # Decode\n        decoder_outputs = self.decoder(\n            input_ids=decoder_input_ids,\n            inputs_embeds=decoder_inputs_embeds,\n            attention_mask=decoder_attention_mask,\n            encoder_hidden_states=hidden_states,\n            encoder_attention_mask=attention_mask,\n            head_mask=decoder_head_mask,\n            lm_labels=lm_labels,\n            masked_lm_labels=masked_lm_labels,\n            **kwargs_decoder,\n        )\n\n        return decoder_outputs + encoder_outputs\n\n    def prepare_inputs_for_generation(self, input_ids, past, attention_mask, **kwargs):\n        assert past is not None, \"past has to be defined for encoder_outputs\"\n\n        # first step\n        if type(past) is tuple:\n            encoder_outputs = past\n        else:\n            encoder_outputs = (past,)\n\n        decoder_inputs = self.decoder.prepare_inputs_for_generation(input_ids)\n\n        return {\n            \"attention_mask\": attention_mask,\n            \"decoder_attention_mask\": decoder_inputs[\"attention_mask\"],\n            \"decoder_input_ids\": decoder_inputs[\"input_ids\"],\n            \"encoder_outputs\": encoder_outputs,\n        }\n\n    def _reorder_cache(self, past, beam_idx):\n        # as a default encoder-decoder models do not re-order the past.\n        # TODO(PVP): might have to be updated, e.g. if GPT2 is to be used as a decoder\n        return past\n"
  },
  {
    "path": "code/bert-base-count5/pretrain/transformers1/modeling_flaubert.py",
    "content": "# coding=utf-8\n# Copyright 2019-present CNRS, Facebook Inc. and the HuggingFace Inc. team.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n\"\"\" PyTorch Flaubert model, based on XLM. \"\"\"\n\n\nimport logging\nimport random\n\nimport torch\nfrom torch.nn import functional as F\n\nfrom .configuration_flaubert import FlaubertConfig\nfrom .file_utils import add_start_docstrings, add_start_docstrings_to_callable\nfrom .modeling_xlm import (\n    XLMForQuestionAnswering,\n    XLMForQuestionAnsweringSimple,\n    XLMForSequenceClassification,\n    XLMModel,\n    XLMWithLMHeadModel,\n    get_masks,\n)\n\n\nlogger = logging.getLogger(__name__)\n\nFLAUBERT_PRETRAINED_MODEL_ARCHIVE_LIST = [\n    \"flaubert/flaubert_small_cased\",\n    \"flaubert/flaubert_base_uncased\",\n    \"flaubert/flaubert_base_cased\",\n    \"flaubert/flaubert_large_cased\",\n    # See all Flaubert models at https://huggingface.co/models?filter=flaubert\n]\n\n\nFLAUBERT_START_DOCSTRING = r\"\"\"\n\n    This model is a PyTorch `torch.nn.Module <https://pytorch.org/docs/stable/nn.html#torch.nn.Module>`_ sub-class.\n    Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general\n    usage and behavior.\n\n    Parameters:\n        config (:class:`~transformers1.FlaubertConfig`): Model configuration class with all the parameters of the model.\n            Initializing with a config file does not load the weights associated with the model, only the configuration.\n            Check out the :meth:`~transformers1.PreTrainedModel.from_pretrained` method to load the model weights.\n\"\"\"\n\nFLAUBERT_INPUTS_DOCSTRING = r\"\"\"\n    Args:\n        input_ids (:obj:`torch.LongTensor` of shape :obj:`(batch_size, sequence_length)`):\n            Indices of input sequence tokens in the vocabulary.\n\n            Indices can be obtained using :class:`transformers1.BertTokenizer`.\n            See :func:`transformers1.PreTrainedTokenizer.encode` and\n            :func:`transformers1.PreTrainedTokenizer.encode_plus` for details.\n\n            `What are input IDs? <../glossary.html#input-ids>`__\n        attention_mask (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length)`, `optional`, defaults to :obj:`None`):\n            Mask to avoid performing attention on padding token indices.\n            Mask values selected in ``[0, 1]``:\n            ``1`` for tokens that are NOT MASKED, ``0`` for MASKED tokens.\n\n            `What are attention masks? <../glossary.html#attention-mask>`__\n        token_type_ids (:obj:`torch.LongTensor` of shape :obj:`(batch_size, sequence_length)`, `optional`, defaults to :obj:`None`):\n            Segment token indices to indicate first and second portions of the inputs.\n            Indices are selected in ``[0, 1]``: ``0`` corresponds to a `sentence A` token, ``1``\n            corresponds to a `sentence B` token\n\n            `What are token type IDs? <../glossary.html#token-type-ids>`_\n        position_ids (:obj:`torch.LongTensor` of shape :obj:`(batch_size, sequence_length)`, `optional`, defaults to :obj:`None`):\n            Indices of positions of each input sequence tokens in the position embeddings.\n            Selected in the range ``[0, config.max_position_embeddings - 1]``.\n\n            `What are position IDs? <../glossary.html#position-ids>`_\n        lengths (:obj:`torch.LongTensor` of shape :obj:`(batch_size,)`, `optional`, defaults to :obj:`None`):\n            Length of each sentence that can be used to avoid performing attention on padding token indices.\n            You can also use `attention_mask` for the same result (see above), kept here for compatbility.\n            Indices selected in ``[0, ..., input_ids.size(-1)]``:\n        cache (:obj:`Dict[str, torch.FloatTensor]`, `optional`, defaults to :obj:`None`):\n            dictionary with ``torch.FloatTensor`` that contains pre-computed\n            hidden-states (key and values in the attention blocks) as computed by the model\n            (see `cache` output below). Can be used to speed up sequential decoding.\n            The dictionary object will be modified in-place during the forward pass to add newly computed hidden-states.\n        head_mask (:obj:`torch.FloatTensor` of shape :obj:`(num_heads,)` or :obj:`(num_layers, num_heads)`, `optional`, defaults to :obj:`None`):\n            Mask to nullify selected heads of the self-attention modules.\n            Mask values selected in ``[0, 1]``:\n            :obj:`1` indicates the head is **not masked**, :obj:`0` indicates the head is **masked**.\n        inputs_embeds (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length, hidden_size)`, `optional`, defaults to :obj:`None`):\n            Optionally, instead of passing :obj:`input_ids` you can choose to directly pass an embedded representation.\n            This is useful if you want more control over how to convert `input_ids` indices into associated vectors\n            than the model's internal embedding lookup matrix.\n\"\"\"\n\n\n@add_start_docstrings(\n    \"The bare Flaubert Model transformer outputting raw hidden-states without any specific head on top.\",\n    FLAUBERT_START_DOCSTRING,\n)\nclass FlaubertModel(XLMModel):\n\n    config_class = FlaubertConfig\n\n    def __init__(self, config):  # , dico, is_encoder, with_output):\n        super().__init__(config)\n        self.layerdrop = getattr(config, \"layerdrop\", 0.0)\n        self.pre_norm = getattr(config, \"pre_norm\", False)\n\n    @add_start_docstrings_to_callable(FLAUBERT_INPUTS_DOCSTRING)\n    def forward(\n        self,\n        input_ids=None,\n        attention_mask=None,\n        langs=None,\n        token_type_ids=None,\n        position_ids=None,\n        lengths=None,\n        cache=None,\n        head_mask=None,\n        inputs_embeds=None,\n    ):\n        r\"\"\"\n    Return:\n        :obj:`tuple(torch.FloatTensor)` comprising various elements depending on the configuration (:class:`~transformers1.XLMConfig`) and inputs:\n        last_hidden_state (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length, hidden_size)`):\n            Sequence of hidden-states at the output of the last layer of the model.\n        hidden_states (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_hidden_states=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer)\n            of shape :obj:`(batch_size, sequence_length, hidden_size)`.\n\n            Hidden-states of the model at the output of each layer plus the initial embedding outputs.\n        attentions (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_attentions=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for each layer) of shape\n            :obj:`(batch_size, num_heads, sequence_length, sequence_length)`.\n\n            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention\n            heads.\n\n    Examples::\n\n        from transformers1 import FlaubertTokenizer, FlaubertModel\n        import torch\n\n        tokenizer = FlaubertTokenizer.from_pretrained('flaubert-base-cased')\n        model = FlaubertModel.from_pretrained('flaubert-base-cased')\n        input_ids = torch.tensor(tokenizer.encode(\"Le chat mange une pomme.\", add_special_tokens=True)).unsqueeze(0)  # Batch size 1\n        outputs = model(input_ids)\n        last_hidden_states = outputs[0]  # The last hidden-state is the first element of the output tuple\n\n        \"\"\"\n        # removed: src_enc=None, src_len=None\n        if input_ids is not None:\n            bs, slen = input_ids.size()\n        else:\n            bs, slen = inputs_embeds.size()[:-1]\n\n        if lengths is None:\n            if input_ids is not None:\n                lengths = (input_ids != self.pad_index).sum(dim=1).long()\n            else:\n                lengths = torch.LongTensor([slen] * bs)\n        # mask = input_ids != self.pad_index\n\n        # check inputs\n        assert lengths.size(0) == bs\n        assert lengths.max().item() <= slen\n        # input_ids = input_ids.transpose(0, 1)  # batch size as dimension 0\n        # assert (src_enc is None) == (src_len is None)\n        # if src_enc is not None:\n        #     assert self.is_decoder\n        #     assert src_enc.size(0) == bs\n\n        # generate masks\n        mask, attn_mask = get_masks(slen, lengths, self.causal, padding_mask=attention_mask)\n        # if self.is_decoder and src_enc is not None:\n        #     src_mask = torch.arange(src_len.max(), dtype=torch.long, device=lengths.device) < src_len[:, None]\n\n        device = input_ids.device if input_ids is not None else inputs_embeds.device\n\n        # position_ids\n        if position_ids is None:\n            position_ids = torch.arange(slen, dtype=torch.long, device=device)\n            position_ids = position_ids.unsqueeze(0).expand((bs, slen))\n        else:\n            assert position_ids.size() == (bs, slen)  # (slen, bs)\n            # position_ids = position_ids.transpose(0, 1)\n\n        # langs\n        if langs is not None:\n            assert langs.size() == (bs, slen)  # (slen, bs)\n            # langs = langs.transpose(0, 1)\n\n        # Prepare head mask if needed\n        head_mask = self.get_head_mask(head_mask, self.config.n_layers)\n\n        # do not recompute cached elements\n        if cache is not None and input_ids is not None:\n            _slen = slen - cache[\"slen\"]\n            input_ids = input_ids[:, -_slen:]\n            position_ids = position_ids[:, -_slen:]\n            if langs is not None:\n                langs = langs[:, -_slen:]\n            mask = mask[:, -_slen:]\n            attn_mask = attn_mask[:, -_slen:]\n\n        # embeddings\n        if inputs_embeds is None:\n            inputs_embeds = self.embeddings(input_ids)\n\n        tensor = inputs_embeds + self.position_embeddings(position_ids).expand_as(inputs_embeds)\n        if langs is not None and self.use_lang_emb and self.config.n_langs > 1:\n            tensor = tensor + self.lang_embeddings(langs)\n        if token_type_ids is not None:\n            tensor = tensor + self.embeddings(token_type_ids)\n        tensor = self.layer_norm_emb(tensor)\n        tensor = F.dropout(tensor, p=self.dropout, training=self.training)\n        tensor *= mask.unsqueeze(-1).to(tensor.dtype)\n\n        # transformer layers\n        hidden_states = ()\n        attentions = ()\n        for i in range(self.n_layers):\n            # LayerDrop\n            dropout_probability = random.uniform(0, 1)\n            if self.training and (dropout_probability < self.layerdrop):\n                continue\n\n            if self.output_hidden_states:\n                hidden_states = hidden_states + (tensor,)\n\n            # self attention\n            if not self.pre_norm:\n                attn_outputs = self.attentions[i](tensor, attn_mask, cache=cache, head_mask=head_mask[i])\n                attn = attn_outputs[0]\n                if self.output_attentions:\n                    attentions = attentions + (attn_outputs[1],)\n                attn = F.dropout(attn, p=self.dropout, training=self.training)\n                tensor = tensor + attn\n                tensor = self.layer_norm1[i](tensor)\n            else:\n                tensor_normalized = self.layer_norm1[i](tensor)\n                attn_outputs = self.attentions[i](tensor_normalized, attn_mask, cache=cache, head_mask=head_mask[i])\n                attn = attn_outputs[0]\n                if self.output_attentions:\n                    attentions = attentions + (attn_outputs[1],)\n                attn = F.dropout(attn, p=self.dropout, training=self.training)\n                tensor = tensor + attn\n\n            # encoder attention (for decoder only)\n            # if self.is_decoder and src_enc is not None:\n            #     attn = self.encoder_attn[i](tensor, src_mask, kv=src_enc, cache=cache)\n            #     attn = F.dropout(attn, p=self.dropout, training=self.training)\n            #     tensor = tensor + attn\n            #     tensor = self.layer_norm15[i](tensor)\n\n            # FFN\n            if not self.pre_norm:\n                tensor = tensor + self.ffns[i](tensor)\n                tensor = self.layer_norm2[i](tensor)\n            else:\n                tensor_normalized = self.layer_norm2[i](tensor)\n                tensor = tensor + self.ffns[i](tensor_normalized)\n\n            tensor *= mask.unsqueeze(-1).to(tensor.dtype)\n\n        # Add last hidden state\n        if self.output_hidden_states:\n            hidden_states = hidden_states + (tensor,)\n\n        # update cache length\n        if cache is not None:\n            cache[\"slen\"] += tensor.size(1)\n\n        # move back sequence length to dimension 0\n        # tensor = tensor.transpose(0, 1)\n\n        outputs = (tensor,)\n        if self.output_hidden_states:\n            outputs = outputs + (hidden_states,)\n        if self.output_attentions:\n            outputs = outputs + (attentions,)\n        return outputs  # outputs, (hidden_states), (attentions)\n\n\n@add_start_docstrings(\n    \"\"\"The Flaubert Model transformer with a language modeling head on top\n    (linear layer with weights tied to the input embeddings). \"\"\",\n    FLAUBERT_START_DOCSTRING,\n)\nclass FlaubertWithLMHeadModel(XLMWithLMHeadModel):\n    \"\"\"\n    This class overrides :class:`~transformers1.XLMWithLMHeadModel`. Please check the\n    superclass for the appropriate documentation alongside usage examples.\n    \"\"\"\n\n    config_class = FlaubertConfig\n\n    def __init__(self, config):\n        super().__init__(config)\n        self.transformer = FlaubertModel(config)\n        self.init_weights()\n\n\n@add_start_docstrings(\n    \"\"\"Flaubert Model with a sequence classification/regression head on top (a linear layer on top of\n    the pooled output) e.g. for GLUE tasks. \"\"\",\n    FLAUBERT_START_DOCSTRING,\n)\nclass FlaubertForSequenceClassification(XLMForSequenceClassification):\n    \"\"\"\n    This class overrides :class:`~transformers1.XLMForSequenceClassification`. Please check the\n    superclass for the appropriate documentation alongside usage examples.\n    \"\"\"\n\n    config_class = FlaubertConfig\n\n    def __init__(self, config):\n        super().__init__(config)\n        self.transformer = FlaubertModel(config)\n        self.init_weights()\n\n\n@add_start_docstrings(\n    \"\"\"Flaubert Model with a span classification head on top for extractive question-answering tasks like SQuAD (a linear layers on top of\n    the hidden-states output to compute `span start logits` and `span end logits`). \"\"\",\n    FLAUBERT_START_DOCSTRING,\n)\nclass FlaubertForQuestionAnsweringSimple(XLMForQuestionAnsweringSimple):\n    \"\"\"\n    This class overrides :class:`~transformers1.XLMForQuestionAnsweringSimple`. Please check the\n    superclass for the appropriate documentation alongside usage examples.\n    \"\"\"\n\n    config_class = FlaubertConfig\n\n    def __init__(self, config):\n        super().__init__(config)\n        self.transformer = FlaubertModel(config)\n        self.init_weights()\n\n\n@add_start_docstrings(\n    \"\"\"Flaubert Model with a beam-search span classification head on top for extractive question-answering tasks like SQuAD (a linear layers on top of\n    the hidden-states output to compute `span start logits` and `span end logits`). \"\"\",\n    FLAUBERT_START_DOCSTRING,\n)\nclass FlaubertForQuestionAnswering(XLMForQuestionAnswering):\n    \"\"\"\n    This class overrides :class:`~transformers1.XLMForQuestionAnswering`. Please check the\n    superclass for the appropriate documentation alongside usage examples.\n    \"\"\"\n\n    config_class = FlaubertConfig\n\n    def __init__(self, config):\n        super().__init__(config)\n        self.transformer = FlaubertModel(config)\n        self.init_weights()\n"
  },
  {
    "path": "code/bert-base-count5/pretrain/transformers1/modeling_gpt2.py",
    "content": "# coding=utf-8\n# Copyright 2018 The OpenAI Team Authors and HuggingFace Inc. team.\n# Copyright (c) 2018, NVIDIA CORPORATION.  All rights reserved.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n\"\"\"PyTorch OpenAI GPT-2 model.\"\"\"\n\n\nimport logging\nimport os\n\nimport torch\nimport torch.nn as nn\nfrom torch.nn import CrossEntropyLoss\n\nfrom .activations import ACT2FN\nfrom .configuration_gpt2 import GPT2Config\nfrom .file_utils import add_start_docstrings, add_start_docstrings_to_callable\nfrom .modeling_utils import Conv1D, PreTrainedModel, SequenceSummary, prune_conv1d_layer\n\n\nlogger = logging.getLogger(__name__)\n\nGPT2_PRETRAINED_MODEL_ARCHIVE_LIST = [\n    \"gpt2\",\n    \"gpt2-medium\",\n    \"gpt2-large\",\n    \"gpt2-xl\",\n    \"distilgpt2\",\n    # See all GPT-2 models at https://huggingface.co/models?filter=gpt2\n]\n\n\ndef load_tf_weights_in_gpt2(model, config, gpt2_checkpoint_path):\n    \"\"\" Load tf checkpoints in a pytorch model\n    \"\"\"\n    try:\n        import re\n        import tensorflow as tf\n    except ImportError:\n        logger.error(\n            \"Loading a TensorFlow model in PyTorch, requires TensorFlow to be installed. Please see \"\n            \"https://www.tensorflow.org/install/ for installation instructions.\"\n        )\n        raise\n    tf_path = os.path.abspath(gpt2_checkpoint_path)\n    logger.info(\"Converting TensorFlow checkpoint from {}\".format(tf_path))\n    # Load weights from TF model\n    init_vars = tf.train.list_variables(tf_path)\n    names = []\n    arrays = []\n    for name, shape in init_vars:\n        logger.info(\"Loading TF weight {} with shape {}\".format(name, shape))\n        array = tf.train.load_variable(tf_path, name)\n        names.append(name)\n        arrays.append(array.squeeze())\n\n    for name, array in zip(names, arrays):\n        name = name[6:]  # skip \"model/\"\n        name = name.split(\"/\")\n        pointer = model\n        for m_name in name:\n            if re.fullmatch(r\"[A-Za-z]+\\d+\", m_name):\n                scope_names = re.split(r\"(\\d+)\", m_name)\n            else:\n                scope_names = [m_name]\n            if scope_names[0] == \"w\" or scope_names[0] == \"g\":\n                pointer = getattr(pointer, \"weight\")\n            elif scope_names[0] == \"b\":\n                pointer = getattr(pointer, \"bias\")\n            elif scope_names[0] == \"wpe\" or scope_names[0] == \"wte\":\n                pointer = getattr(pointer, scope_names[0])\n                pointer = getattr(pointer, \"weight\")\n            else:\n                pointer = getattr(pointer, scope_names[0])\n            if len(scope_names) >= 2:\n                num = int(scope_names[1])\n                pointer = pointer[num]\n        try:\n            assert pointer.shape == array.shape\n        except AssertionError as e:\n            e.args += (pointer.shape, array.shape)\n            raise\n        logger.info(\"Initialize PyTorch weight {}\".format(name))\n        pointer.data = torch.from_numpy(array)\n    return model\n\n\nclass Attention(nn.Module):\n    def __init__(self, nx, n_ctx, config, scale=False):\n        super().__init__()\n        self.output_attentions = config.output_attentions\n\n        n_state = nx  # in Attention: n_state=768 (nx=n_embd)\n        # [switch nx => n_state from Block to Attention to keep identical to TF implem]\n        assert n_state % config.n_head == 0\n        self.register_buffer(\n            \"bias\", torch.tril(torch.ones((n_ctx, n_ctx), dtype=torch.uint8)).view(1, 1, n_ctx, n_ctx)\n        )\n        self.register_buffer(\"masked_bias\", torch.tensor(-1e4))\n        self.n_head = config.n_head\n        self.split_size = n_state\n        self.scale = scale\n\n        self.c_attn = Conv1D(n_state * 3, nx)\n        self.c_proj = Conv1D(n_state, nx)\n        self.attn_dropout = nn.Dropout(config.attn_pdrop)\n        self.resid_dropout = nn.Dropout(config.resid_pdrop)\n        self.pruned_heads = set()\n\n    def prune_heads(self, heads):\n        if len(heads) == 0:\n            return\n        mask = torch.ones(self.n_head, self.split_size // self.n_head)\n        heads = set(heads) - self.pruned_heads  # Convert to set and emove already pruned heads\n        for head in heads:\n            # Compute how many pruned heads are before the head and move the index accordingly\n            head = head - sum(1 if h < head else 0 for h in self.pruned_heads)\n            mask[head] = 0\n        mask = mask.view(-1).contiguous().eq(1)\n        index = torch.arange(len(mask))[mask].long()\n        index_attn = torch.cat([index, index + self.split_size, index + (2 * self.split_size)])\n\n        # Prune conv1d layers\n        self.c_attn = prune_conv1d_layer(self.c_attn, index_attn, dim=1)\n        self.c_proj = prune_conv1d_layer(self.c_proj, index, dim=0)\n\n        # Update hyper params\n        self.split_size = (self.split_size // self.n_head) * (self.n_head - len(heads))\n        self.n_head = self.n_head - len(heads)\n        self.pruned_heads = self.pruned_heads.union(heads)\n\n    def _attn(self, q, k, v, attention_mask=None, head_mask=None):\n        w = torch.matmul(q, k)\n        if self.scale:\n            w = w / (float(v.size(-1)) ** 0.5)\n        nd, ns = w.size(-2), w.size(-1)\n        mask = self.bias[:, :, ns - nd : ns, :ns]\n        w = torch.where(mask.bool(), w, self.masked_bias.to(w.dtype))\n\n        if attention_mask is not None:\n            # Apply the attention mask\n            w = w + attention_mask\n\n        w = nn.Softmax(dim=-1)(w)\n        w = self.attn_dropout(w)\n\n        # Mask heads if we want to\n        if head_mask is not None:\n            w = w * head_mask\n\n        outputs = [torch.matmul(w, v)]\n        if self.output_attentions:\n            outputs.append(w)\n        return outputs\n\n    def merge_heads(self, x):\n        x = x.permute(0, 2, 1, 3).contiguous()\n        new_x_shape = x.size()[:-2] + (x.size(-2) * x.size(-1),)\n        return x.view(*new_x_shape)  # in Tensorflow implem: fct merge_states\n\n    def split_heads(self, x, k=False):\n        new_x_shape = x.size()[:-1] + (self.n_head, x.size(-1) // self.n_head)\n        x = x.view(*new_x_shape)  # in Tensorflow implem: fct split_states\n        if k:\n            return x.permute(0, 2, 3, 1)  # (batch, head, head_features, seq_length)\n        else:\n            return x.permute(0, 2, 1, 3)  # (batch, head, seq_length, head_features)\n\n    def forward(self, x, layer_past=None, attention_mask=None, head_mask=None, use_cache=False):\n        x = self.c_attn(x)\n        query, key, value = x.split(self.split_size, dim=2)\n        query = self.split_heads(query)\n        key = self.split_heads(key, k=True)\n        value = self.split_heads(value)\n        if layer_past is not None:\n            past_key, past_value = layer_past[0].transpose(-2, -1), layer_past[1]  # transpose back cf below\n            key = torch.cat((past_key, key), dim=-1)\n            value = torch.cat((past_value, value), dim=-2)\n\n        if use_cache is True:\n            present = torch.stack((key.transpose(-2, -1), value))  # transpose to have same shapes for stacking\n        else:\n            present = (None,)\n\n        attn_outputs = self._attn(query, key, value, attention_mask, head_mask)\n        a = attn_outputs[0]\n\n        a = self.merge_heads(a)\n        a = self.c_proj(a)\n        a = self.resid_dropout(a)\n\n        outputs = [a, present] + attn_outputs[1:]\n        return outputs  # a, present, (attentions)\n\n\nclass MLP(nn.Module):\n    def __init__(self, n_state, config):  # in MLP: n_state=3072 (4 * n_embd)\n        super().__init__()\n        nx = config.n_embd\n        self.c_fc = Conv1D(n_state, nx)\n        self.c_proj = Conv1D(nx, n_state)\n        self.act = ACT2FN[config.activation_function]\n        self.dropout = nn.Dropout(config.resid_pdrop)\n\n    def forward(self, x):\n        h = self.act(self.c_fc(x))\n        h2 = self.c_proj(h)\n        return self.dropout(h2)\n\n\nclass Block(nn.Module):\n    def __init__(self, n_ctx, config, scale=False):\n        super().__init__()\n        nx = config.n_embd\n        self.ln_1 = nn.LayerNorm(nx, eps=config.layer_norm_epsilon)\n        self.attn = Attention(nx, n_ctx, config, scale)\n        self.ln_2 = nn.LayerNorm(nx, eps=config.layer_norm_epsilon)\n        self.mlp = MLP(4 * nx, config)\n\n    def forward(self, x, layer_past=None, attention_mask=None, head_mask=None, use_cache=False):\n        output_attn = self.attn(\n            self.ln_1(x),\n            layer_past=layer_past,\n            attention_mask=attention_mask,\n            head_mask=head_mask,\n            use_cache=use_cache,\n        )\n        a = output_attn[0]  # output_attn: a, present, (attentions)\n\n        x = x + a\n        m = self.mlp(self.ln_2(x))\n        x = x + m\n\n        outputs = [x] + output_attn[1:]\n        return outputs  # x, present, (attentions)\n\n\nclass GPT2PreTrainedModel(PreTrainedModel):\n    \"\"\" An abstract class to handle weights initialization and\n        a simple interface for downloading and loading pretrained models.\n    \"\"\"\n\n    config_class = GPT2Config\n    load_tf_weights = load_tf_weights_in_gpt2\n    base_model_prefix = \"transformer\"\n\n    def __init__(self, *inputs, **kwargs):\n        super().__init__(*inputs, **kwargs)\n\n    def _init_weights(self, module):\n        \"\"\" Initialize the weights.\n        \"\"\"\n        if isinstance(module, (nn.Linear, nn.Embedding, Conv1D)):\n            # Slightly different from the TF version which uses truncated_normal for initialization\n            # cf https://github.com/pytorch/pytorch/pull/5617\n            module.weight.data.normal_(mean=0.0, std=self.config.initializer_range)\n            if isinstance(module, (nn.Linear, Conv1D)) and module.bias is not None:\n                module.bias.data.zero_()\n        elif isinstance(module, nn.LayerNorm):\n            module.bias.data.zero_()\n            module.weight.data.fill_(1.0)\n\n\nGPT2_START_DOCSTRING = r\"\"\"\n\n    This model is a PyTorch `torch.nn.Module <https://pytorch.org/docs/stable/nn.html#torch.nn.Module>`_ sub-class.\n    Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general\n    usage and behavior.\n\n    Parameters:\n        config (:class:`~transformers1.GPT2Config`): Model configuration class with all the parameters of the model.\n            Initializing with a config file does not load the weights associated with the model, only the configuration.\n            Check out the :meth:`~transformers1.PreTrainedModel.from_pretrained` method to load the model weights.\n\"\"\"\n\nGPT2_INPUTS_DOCSTRING = r\"\"\"\n    Args:\n        input_ids (:obj:`torch.LongTensor` of shape :obj:`(batch_size, input_ids_length)`):\n            :obj:`input_ids_length` = ``sequence_length`` if ``past`` is ``None`` else ``past[0].shape[-2]`` (``sequence_length`` of input past key value states).\n            Indices of input sequence tokens in the vocabulary.\n\n            If `past` is used, only `input_ids` that do not have their past calculated should be passed as `input_ids`.\n\n            Indices can be obtained using :class:`transformers1.GPT2Tokenizer`.\n            See :func:`transformers1.PreTrainedTokenizer.encode` and\n            :func:`transformers1.PreTrainedTokenizer.encode_plus` for details.\n\n            `What are input IDs? <../glossary.html#input-ids>`__\n\n        past (:obj:`List[torch.FloatTensor]` of length :obj:`config.n_layers`):\n            Contains pre-computed hidden-states (key and values in the attention blocks) as computed by the model\n            (see `past` output below). Can be used to speed up sequential decoding.\n            The `input_ids` which have their past given to this model should not be passed as `input_ids` as they have already been computed.\n        attention_mask (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length)`, `optional`, defaults to :obj:`None`):\n            Mask to avoid performing attention on padding token indices.\n            Mask values selected in ``[0, 1]``:\n            ``1`` for tokens that are NOT MASKED, ``0`` for MASKED tokens.\n\n            `What are attention masks? <../glossary.html#attention-mask>`__\n        token_type_ids (:obj:`torch.LongTensor` of shape :obj:`(batch_size, input_ids_length)`, `optional`, defaults to :obj:`None`):\n            `input_ids_length` = `sequence_length if `past` is None else 1\n            Segment token indices to indicate first and second portions of the inputs.\n            Indices are selected in ``[0, 1]``: ``0`` corresponds to a `sentence A` token, ``1``\n            corresponds to a `sentence B` token\n            `What are token type IDs? <../glossary.html#token-type-ids>`_\n        position_ids (:obj:`torch.LongTensor` of shape :obj:`(batch_size, sequence_length)`, `optional`, defaults to :obj:`None`):\n            Indices of positions of each input sequence tokens in the position embeddings.\n            Selected in the range ``[0, config.max_position_embeddings - 1]``.\n\n            `What are position IDs? <../glossary.html#position-ids>`_\n        head_mask (:obj:`torch.FloatTensor` of shape :obj:`(num_heads,)` or :obj:`(num_layers, num_heads)`, `optional`, defaults to :obj:`None`):\n            Mask to nullify selected heads of the self-attention modules.\n            Mask values selected in ``[0, 1]``:\n            :obj:`1` indicates the head is **not masked**, :obj:`0` indicates the head is **masked**.\n        inputs_embeds (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length, hidden_size)`, `optional`, defaults to :obj:`None`):\n            This is useful if you want more control over how to convert `input_ids` indices into associated vectors\n            than the model's internal embedding lookup matrix.\n            If `past` is used, optionally only the last `inputs_embeds` have to be input (see `past`).\n        use_cache (:obj:`bool`):\n            If `use_cache` is True, `past` key value states are returned and can be used to speed up decoding (see `past`). Defaults to `True`.\n\"\"\"\n\n\n@add_start_docstrings(\n    \"The bare GPT2 Model transformer outputting raw hidden-states without any specific head on top.\",\n    GPT2_START_DOCSTRING,\n)\nclass GPT2Model(GPT2PreTrainedModel):\n    def __init__(self, config):\n        super().__init__(config)\n        self.output_hidden_states = config.output_hidden_states\n        self.output_attentions = config.output_attentions\n\n        self.wte = nn.Embedding(config.vocab_size, config.n_embd)\n        self.wpe = nn.Embedding(config.n_positions, config.n_embd)\n        self.drop = nn.Dropout(config.embd_pdrop)\n        self.h = nn.ModuleList([Block(config.n_ctx, config, scale=True) for _ in range(config.n_layer)])\n        self.ln_f = nn.LayerNorm(config.n_embd, eps=config.layer_norm_epsilon)\n\n        self.init_weights()\n\n    def get_input_embeddings(self):\n        return self.wte\n\n    def set_input_embeddings(self, new_embeddings):\n        self.wte = new_embeddings\n\n    def _prune_heads(self, heads_to_prune):\n        \"\"\" Prunes heads of the model.\n            heads_to_prune: dict of {layer_num: list of heads to prune in this layer}\n        \"\"\"\n        for layer, heads in heads_to_prune.items():\n            self.h[layer].attn.prune_heads(heads)\n\n    @add_start_docstrings_to_callable(GPT2_INPUTS_DOCSTRING)\n    def forward(\n        self,\n        input_ids=None,\n        past=None,\n        attention_mask=None,\n        token_type_ids=None,\n        position_ids=None,\n        head_mask=None,\n        inputs_embeds=None,\n        use_cache=True,\n    ):\n        r\"\"\"\n    Return:\n        :obj:`tuple(torch.FloatTensor)` comprising various elements depending on the configuration (:class:`~transformers1.GPT2Config`) and inputs:\n        last_hidden_state (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length, hidden_size)`):\n            Sequence of hidden-states at the last layer of the model.\n            If `past` is used only the last hidden-state of the sequences of shape :obj:`(batch_size, 1, hidden_size)` is output.\n        past (:obj:`List[torch.FloatTensor]` of length :obj:`config.n_layers` with each tensor of shape :obj:`(2, batch_size, num_heads, sequence_length, embed_size_per_head)`):\n            Contains pre-computed hidden-states (key and values in the attention blocks).\n            Can be used (see `past` input) to speed up sequential decoding.\n        hidden_states (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_hidden_states=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer)\n            of shape :obj:`(batch_size, sequence_length, hidden_size)`.\n\n            Hidden-states of the model at the output of each layer plus the initial embedding outputs.\n        attentions (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_attentions=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for each layer) of shape\n            :obj:`(batch_size, num_heads, sequence_length, sequence_length)`.\n\n            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention\n            heads.\n\n    Examples::\n\n        from transformers1 import GPT2Tokenizer, GPT2Model\n        import torch\n\n        tokenizer = GPT2Tokenizer.from_pretrained('gpt2')\n        model = GPT2Model.from_pretrained('gpt2')\n        input_ids = torch.tensor(tokenizer.encode(\"Hello, my dog is cute\", add_special_tokens=True)).unsqueeze(0)  # Batch size 1\n        outputs = model(input_ids)\n        last_hidden_states = outputs[0]  # The last hidden-state is the first element of the output tuple\n\n        \"\"\"\n\n        if input_ids is not None and inputs_embeds is not None:\n            raise ValueError(\"You cannot specify both input_ids and inputs_embeds at the same time\")\n        elif input_ids is not None:\n            input_shape = input_ids.size()\n            input_ids = input_ids.view(-1, input_shape[-1])\n            batch_size = input_ids.shape[0]\n        elif inputs_embeds is not None:\n            input_shape = inputs_embeds.size()[:-1]\n            batch_size = inputs_embeds.shape[0]\n        else:\n            raise ValueError(\"You have to specify either input_ids or inputs_embeds\")\n\n        if token_type_ids is not None:\n            token_type_ids = token_type_ids.view(-1, input_shape[-1])\n        if position_ids is not None:\n            position_ids = position_ids.view(-1, input_shape[-1])\n\n        if past is None:\n            past_length = 0\n            past = [None] * len(self.h)\n        else:\n            past_length = past[0][0].size(-2)\n        if position_ids is None:\n            device = input_ids.device if input_ids is not None else inputs_embeds.device\n            position_ids = torch.arange(past_length, input_shape[-1] + past_length, dtype=torch.long, device=device)\n            position_ids = position_ids.unsqueeze(0).view(-1, input_shape[-1])\n\n        # Attention mask.\n        if attention_mask is not None:\n            assert batch_size > 0, \"batch_size has to be defined and > 0\"\n            attention_mask = attention_mask.view(batch_size, -1)\n            # We create a 3D attention mask from a 2D tensor mask.\n            # Sizes are [batch_size, 1, 1, to_seq_length]\n            # So we can broadcast to [batch_size, num_heads, from_seq_length, to_seq_length]\n            # this attention mask is more simple than the triangular masking of causal attention\n            # used in OpenAI GPT, we just need to prepare the broadcast dimension here.\n            attention_mask = attention_mask.unsqueeze(1).unsqueeze(2)\n\n            # Since attention_mask is 1.0 for positions we want to attend and 0.0 for\n            # masked positions, this operation will create a tensor which is 0.0 for\n            # positions we want to attend and -10000.0 for masked positions.\n            # Since we are adding it to the raw scores before the softmax, this is\n            # effectively the same as removing these entirely.\n            attention_mask = attention_mask.to(dtype=next(self.parameters()).dtype)  # fp16 compatibility\n            attention_mask = (1.0 - attention_mask) * -10000.0\n\n        # Prepare head mask if needed\n        # 1.0 in head_mask indicate we keep the head\n        # attention_probs has shape bsz x n_heads x N x N\n        # head_mask has shape n_layer x batch x n_heads x N x N\n        head_mask = self.get_head_mask(head_mask, self.config.n_layer)\n\n        if inputs_embeds is None:\n            inputs_embeds = self.wte(input_ids)\n        position_embeds = self.wpe(position_ids)\n        if token_type_ids is not None:\n            token_type_embeds = self.wte(token_type_ids)\n        else:\n            token_type_embeds = 0\n        hidden_states = inputs_embeds + position_embeds + token_type_embeds\n        hidden_states = self.drop(hidden_states)\n\n        output_shape = input_shape + (hidden_states.size(-1),)\n\n        presents = ()\n        all_attentions = []\n        all_hidden_states = ()\n        for i, (block, layer_past) in enumerate(zip(self.h, past)):\n            if self.output_hidden_states:\n                all_hidden_states = all_hidden_states + (hidden_states.view(*output_shape),)\n\n            outputs = block(\n                hidden_states,\n                layer_past=layer_past,\n                attention_mask=attention_mask,\n                head_mask=head_mask[i],\n                use_cache=use_cache,\n            )\n\n            hidden_states, present = outputs[:2]\n            if use_cache is True:\n                presents = presents + (present,)\n\n            if self.output_attentions:\n                all_attentions.append(outputs[2])\n\n        hidden_states = self.ln_f(hidden_states)\n\n        hidden_states = hidden_states.view(*output_shape)\n        # Add last hidden state\n        if self.output_hidden_states:\n            all_hidden_states = all_hidden_states + (hidden_states,)\n\n        outputs = (hidden_states,)\n        if use_cache is True:\n            outputs = outputs + (presents,)\n        if self.output_hidden_states:\n            outputs = outputs + (all_hidden_states,)\n        if self.output_attentions:\n            # let the number of heads free (-1) so we can extract attention even after head pruning\n            attention_output_shape = input_shape[:-1] + (-1,) + all_attentions[0].shape[-2:]\n            all_attentions = tuple(t.view(*attention_output_shape) for t in all_attentions)\n            outputs = outputs + (all_attentions,)\n        return outputs  # last hidden state, (presents), (all hidden_states), (attentions)\n\n\n@add_start_docstrings(\n    \"\"\"The GPT2 Model transformer with a language modeling head on top\n    (linear layer with weights tied to the input embeddings). \"\"\",\n    GPT2_START_DOCSTRING,\n)\nclass GPT2LMHeadModel(GPT2PreTrainedModel):\n    def __init__(self, config):\n        super().__init__(config)\n        self.transformer = GPT2Model(config)\n        self.lm_head = nn.Linear(config.n_embd, config.vocab_size, bias=False)\n\n        self.init_weights()\n\n    def get_output_embeddings(self):\n        return self.lm_head\n\n    def prepare_inputs_for_generation(self, input_ids, past, **kwargs):\n        # only last token for inputs_ids if past is defined in kwargs\n        if past:\n            input_ids = input_ids[:, -1].unsqueeze(-1)\n\n        return {\"input_ids\": input_ids, \"past\": past, \"use_cache\": kwargs[\"use_cache\"]}\n\n    @add_start_docstrings_to_callable(GPT2_INPUTS_DOCSTRING)\n    def forward(\n        self,\n        input_ids=None,\n        past=None,\n        attention_mask=None,\n        token_type_ids=None,\n        position_ids=None,\n        head_mask=None,\n        inputs_embeds=None,\n        labels=None,\n        use_cache=True,\n    ):\n        r\"\"\"\n        labels (:obj:`torch.LongTensor` of shape :obj:`(batch_size, sequence_length)`, `optional`, defaults to :obj:`None`):\n            Labels for language modeling.\n            Note that the labels **are shifted** inside the model, i.e. you can set ``labels = input_ids``\n            Indices are selected in ``[-100, 0, ..., config.vocab_size]``\n            All labels set to ``-100`` are ignored (masked), the loss is only\n            computed for labels in ``[0, ..., config.vocab_size]``\n\n    Return:\n        :obj:`tuple(torch.FloatTensor)` comprising various elements depending on the configuration (:class:`~transformers1.GPT2Config`) and inputs:\n        loss (:obj:`torch.FloatTensor` of shape `(1,)`, `optional`, returned when ``labels`` is provided)\n            Language modeling loss.\n        prediction_scores (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length, config.vocab_size)`):\n            Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax).\n        past (:obj:`List[torch.FloatTensor]` of length :obj:`config.n_layers` with each tensor of shape :obj:`(2, batch_size, num_heads, sequence_length, embed_size_per_head)`):\n            Contains pre-computed hidden-states (key and values in the attention blocks).\n            Can be used (see `past` input) to speed up sequential decoding.\n        hidden_states (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_hidden_states=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer)\n            of shape :obj:`(batch_size, sequence_length, hidden_size)`.\n\n            Hidden-states of the model at the output of each layer plus the initial embedding outputs.\n        attentions (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_attentions=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for each layer) of shape\n            :obj:`(batch_size, num_heads, sequence_length, sequence_length)`.\n\n            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention\n            heads.\n\n    Examples::\n\n        import torch\n        from transformers1 import GPT2Tokenizer, GPT2LMHeadModel\n\n        tokenizer = GPT2Tokenizer.from_pretrained('gpt2')\n        model = GPT2LMHeadModel.from_pretrained('gpt2')\n\n        input_ids = torch.tensor(tokenizer.encode(\"Hello, my dog is cute\", add_special_tokens=True)).unsqueeze(0)  # Batch size 1\n        outputs = model(input_ids, labels=input_ids)\n        loss, logits = outputs[:2]\n\n        \"\"\"\n        transformer_outputs = self.transformer(\n            input_ids,\n            past=past,\n            attention_mask=attention_mask,\n            token_type_ids=token_type_ids,\n            position_ids=position_ids,\n            head_mask=head_mask,\n            inputs_embeds=inputs_embeds,\n            use_cache=use_cache,\n        )\n        hidden_states = transformer_outputs[0]\n\n        lm_logits = self.lm_head(hidden_states)\n\n        outputs = (lm_logits,) + transformer_outputs[1:]\n        if labels is not None:\n            # Shift so that tokens < n predict n\n            shift_logits = lm_logits[..., :-1, :].contiguous()\n            shift_labels = labels[..., 1:].contiguous()\n            # Flatten the tokens\n            loss_fct = CrossEntropyLoss()\n            loss = loss_fct(shift_logits.view(-1, shift_logits.size(-1)), shift_labels.view(-1))\n            outputs = (loss,) + outputs\n\n        return outputs  # (loss), lm_logits, presents, (all hidden_states), (attentions)\n\n\n@add_start_docstrings(\n    \"\"\"The GPT2 Model transformer with a language modeling and a multiple-choice classification\n    head on top e.g. for RocStories/SWAG tasks. The two heads are two linear layers.\n    The language modeling head has its weights tied to the input embeddings,\n    the classification head takes as input the input of a specified classification token index in the input sequence).\n\"\"\",\n    GPT2_START_DOCSTRING,\n)\nclass GPT2DoubleHeadsModel(GPT2PreTrainedModel):\n    def __init__(self, config):\n        super().__init__(config)\n        config.num_labels = 1\n        self.transformer = GPT2Model(config)\n        self.lm_head = nn.Linear(config.n_embd, config.vocab_size, bias=False)\n        self.multiple_choice_head = SequenceSummary(config)\n\n        self.init_weights()\n\n    def get_output_embeddings(self):\n        return self.lm_head\n\n    @add_start_docstrings_to_callable(GPT2_INPUTS_DOCSTRING)\n    def forward(\n        self,\n        input_ids=None,\n        past=None,\n        attention_mask=None,\n        token_type_ids=None,\n        position_ids=None,\n        head_mask=None,\n        inputs_embeds=None,\n        mc_token_ids=None,\n        lm_labels=None,\n        mc_labels=None,\n        use_cache=True,\n    ):\n        r\"\"\"\n        mc_token_ids (:obj:`torch.LongTensor` of shape :obj:`(batch_size, num_choices)`, `optional`, default to index of the last token of the input)\n            Index of the classification token in each input sequence.\n            Selected in the range ``[0, input_ids.size(-1) - 1[``.\n        lm_labels (:obj:`torch.LongTensor` of shape :obj:`(batch_size, sequence_length)`, `optional`, defaults to :obj:`None`)\n            Labels for language modeling.\n            Note that the labels **are shifted** inside the model, i.e. you can set ``lm_labels = input_ids``\n            Indices are selected in ``[-1, 0, ..., config.vocab_size]``\n            All labels set to ``-100`` are ignored (masked), the loss is only\n            computed for labels in ``[0, ..., config.vocab_size]``\n        mc_labels (:obj:`torch.LongTensor` of shape :obj:`(batch_size)`, `optional`, defaults to :obj:`None`)\n            Labels for computing the multiple choice classification loss.\n            Indices should be in ``[0, ..., num_choices]`` where `num_choices` is the size of the second dimension\n            of the input tensors. (see `input_ids` above)\n\n    Return:\n        :obj:`tuple(torch.FloatTensor)` comprising various elements depending on the configuration (:class:`~transformers1.GPT2Config`) and inputs:\n        lm_loss (:obj:`torch.FloatTensor` of shape :obj:`(1,)`, `optional`, returned when ``lm_labels`` is provided):\n            Language modeling loss.\n        mc_loss (:obj:`torch.FloatTensor` of shape :obj:`(1,)`, `optional`, returned when :obj:`multiple_choice_labels` is provided):\n            Multiple choice classification loss.\n        lm_prediction_scores (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, num_choices, sequence_length, config.vocab_size)`):\n            Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax).\n        mc_prediction_scores (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, num_choices)`):\n            Prediction scores of the multiple choice classification head (scores for each choice before SoftMax).\n        past (:obj:`List[torch.FloatTensor]` of length :obj:`config.n_layers` with each tensor of shape :obj:`(2, batch_size, num_heads, sequence_length, embed_size_per_head)`):\n            Contains pre-computed hidden-states (key and values in the attention blocks).\n            Can be used (see `past` input) to speed up sequential decoding.\n        hidden_states (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_hidden_states=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer)\n            of shape :obj:`(batch_size, sequence_length, hidden_size)`.\n\n            Hidden-states of the model at the output of each layer plus the initial embedding outputs.\n        attentions (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_attentions=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for each layer) of shape\n            :obj:`(batch_size, num_heads, sequence_length, sequence_length)`.\n\n            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention\n            heads.\n\n    Examples::\n\n        import torch\n        from transformers1 import GPT2Tokenizer, GPT2DoubleHeadsModel\n\n        tokenizer = GPT2Tokenizer.from_pretrained('gpt2')\n        model = GPT2DoubleHeadsModel.from_pretrained('gpt2')\n\n        # Add a [CLS] to the vocabulary (we should train it also!)\n        tokenizer.add_special_tokens({'cls_token': '[CLS]'})\n        model.resize_token_embeddings(len(tokenizer))  # Update the model embeddings with the new vocabulary size\n        print(tokenizer.cls_token_id, len(tokenizer))  # The newly token the last token of the vocabulary\n\n        choices = [\"Hello, my dog is cute [CLS]\", \"Hello, my cat is cute [CLS]\"]\n        encoded_choices = [tokenizer.encode(s) for s in choices]\n        cls_token_location = [tokens.index(tokenizer.cls_token_id) for tokens in encoded_choices]\n\n        input_ids = torch.tensor(encoded_choices).unsqueeze(0)  # Batch size: 1, number of choices: 2\n        mc_token_ids = torch.tensor([cls_token_location])  # Batch size: 1\n\n        outputs = model(input_ids, mc_token_ids=mc_token_ids)\n        lm_prediction_scores, mc_prediction_scores = outputs[:2]\n\n        \"\"\"\n        transformer_outputs = self.transformer(\n            input_ids,\n            past=past,\n            attention_mask=attention_mask,\n            token_type_ids=token_type_ids,\n            position_ids=position_ids,\n            head_mask=head_mask,\n            inputs_embeds=inputs_embeds,\n            use_cache=use_cache,\n        )\n\n        hidden_states = transformer_outputs[0]\n\n        lm_logits = self.lm_head(hidden_states)\n        mc_logits = self.multiple_choice_head(hidden_states, mc_token_ids).squeeze(-1)\n\n        outputs = (lm_logits, mc_logits) + transformer_outputs[1:]\n        if mc_labels is not None:\n            loss_fct = CrossEntropyLoss()\n            loss = loss_fct(mc_logits.view(-1, mc_logits.size(-1)), mc_labels.view(-1))\n            outputs = (loss,) + outputs\n        if lm_labels is not None:\n            shift_logits = lm_logits[..., :-1, :].contiguous()\n            shift_labels = lm_labels[..., 1:].contiguous()\n            loss_fct = CrossEntropyLoss()\n            loss = loss_fct(shift_logits.view(-1, shift_logits.size(-1)), shift_labels.view(-1))\n            outputs = (loss,) + outputs\n\n        return outputs  # (lm loss), (mc loss), lm logits, mc logits, presents, (all hidden_states), (attentions)\n"
  },
  {
    "path": "code/bert-base-count5/pretrain/transformers1/modeling_longformer.py",
    "content": "# coding=utf-8\n# Copyright 2020 The Allen Institute for AI team and The HuggingFace Inc. team.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n\"\"\"PyTorch Longformer model. \"\"\"\n\nimport logging\nimport math\n\nimport torch\nimport torch.nn as nn\nfrom torch.nn import CrossEntropyLoss, MSELoss\nfrom torch.nn import functional as F\n\nfrom .configuration_longformer import LongformerConfig\nfrom .file_utils import add_start_docstrings, add_start_docstrings_to_callable\nfrom .modeling_bert import BertPreTrainedModel\nfrom .modeling_roberta import RobertaLMHead, RobertaModel\n\n\nlogger = logging.getLogger(__name__)\n\nLONGFORMER_PRETRAINED_MODEL_ARCHIVE_LIST = [\n    \"allenai/longformer-base-4096\",\n    \"allenai/longformer-large-4096\",\n    \"allenai/longformer-large-4096-finetuned-triviaqa\",\n    \"allenai/longformer-base-4096-extra.pos.embd.only\",\n    \"allenai/longformer-large-4096-extra.pos.embd.only\",\n    # See all Longformer models at https://huggingface.co/models?filter=longformer\n]\n\n\ndef _get_question_end_index(input_ids, sep_token_id):\n    \"\"\"\n        Computes the index of the first occurance of `sep_token_id`.\n    \"\"\"\n\n    sep_token_indices = (input_ids == sep_token_id).nonzero()\n    batch_size = input_ids.shape[0]\n\n    assert sep_token_indices.shape[1] == 2, \"`input_ids` should have two dimensions\"\n    assert (\n        sep_token_indices.shape[0] == 3 * batch_size\n    ), f\"There should be exactly three separator tokens: {sep_token_id} in every sample for questions answering. You might also consider to set `global_attention_mask` manually in the forward function to avoid this error.\"\n\n    return sep_token_indices.view(batch_size, 3, 2)[:, 0, 1]\n\n\ndef _compute_global_attention_mask(input_ids, sep_token_id, before_sep_token=True):\n    \"\"\"\n        Computes global attention mask by putting attention on all tokens\n        before `sep_token_id` if `before_sep_token is True` else after\n        `sep_token_id`.\n    \"\"\"\n\n    question_end_index = _get_question_end_index(input_ids, sep_token_id)\n    question_end_index = question_end_index.unsqueeze(dim=1)  # size: batch_size x 1\n    # bool attention mask with True in locations of global attention\n    attention_mask = torch.arange(input_ids.shape[1], device=input_ids.device)\n    if before_sep_token is True:\n        attention_mask = (attention_mask.expand_as(input_ids) < question_end_index).to(torch.uint8)\n    else:\n        # last token is separation token and should not be counted and in the middle are two separation tokens\n        attention_mask = (attention_mask.expand_as(input_ids) > (question_end_index + 1)).to(torch.uint8) * (\n            attention_mask.expand_as(input_ids) < input_ids.shape[-1]\n        ).to(torch.uint8)\n\n    return attention_mask\n\n\nclass LongformerSelfAttention(nn.Module):\n    def __init__(self, config, layer_id):\n        super().__init__()\n        if config.hidden_size % config.num_attention_heads != 0:\n            raise ValueError(\n                \"The hidden size (%d) is not a multiple of the number of attention \"\n                \"heads (%d)\" % (config.hidden_size, config.num_attention_heads)\n            )\n        self.output_attentions = config.output_attentions\n        self.num_heads = config.num_attention_heads\n        self.head_dim = int(config.hidden_size / config.num_attention_heads)\n        self.embed_dim = config.hidden_size\n\n        self.query = nn.Linear(config.hidden_size, self.embed_dim)\n        self.key = nn.Linear(config.hidden_size, self.embed_dim)\n        self.value = nn.Linear(config.hidden_size, self.embed_dim)\n\n        # separate projection layers for tokens with global attention\n        self.query_global = nn.Linear(config.hidden_size, self.embed_dim)\n        self.key_global = nn.Linear(config.hidden_size, self.embed_dim)\n        self.value_global = nn.Linear(config.hidden_size, self.embed_dim)\n\n        self.dropout = config.attention_probs_dropout_prob\n\n        self.layer_id = layer_id\n        attention_window = config.attention_window[self.layer_id]\n        assert (\n            attention_window % 2 == 0\n        ), f\"`attention_window` for layer {self.layer_id} has to be an even value. Given {attention_window}\"\n        assert (\n            attention_window > 0\n        ), f\"`attention_window` for layer {self.layer_id} has to be positive. Given {attention_window}\"\n\n        self.one_sided_attention_window_size = attention_window // 2\n\n    @staticmethod\n    def _skew(x, direction):\n        \"\"\"Convert diagonals into columns (or columns into diagonals depending on `direction`\"\"\"\n        x_padded = F.pad(x, direction)  # padding value is not important because it will be overwritten\n        x_padded = x_padded.view(*x_padded.size()[:-2], x_padded.size(-1), x_padded.size(-2))\n        return x_padded\n\n    @staticmethod\n    def _skew2(x):\n        \"\"\"shift every row 1 step to right converting columns into diagonals\"\"\"\n        # X = B x C x M x L\n        B, C, M, L = x.size()\n        x = F.pad(x, (0, M + 1))  # B x C x M x (L+M+1). Padding value is not important because it'll be overwritten\n        x = x.view(B, C, -1)  # B x C x ML+MM+M\n        x = x[:, :, :-M]  # B x C x ML+MM\n        x = x.view(B, C, M, M + L)  # B x C, M x L+M\n        x = x[:, :, :, :-1]\n        return x\n\n    @staticmethod\n    def _chunk(x, w):\n        \"\"\"convert into overlapping chunkings. Chunk size = 2w, overlap size = w\"\"\"\n\n        # non-overlapping chunks of size = 2w\n        x = x.view(x.size(0), x.size(1) // (w * 2), w * 2, x.size(2))\n\n        # use `as_strided` to make the chunks overlap with an overlap size = w\n        chunk_size = list(x.size())\n        chunk_size[1] = chunk_size[1] * 2 - 1\n\n        chunk_stride = list(x.stride())\n        chunk_stride[1] = chunk_stride[1] // 2\n        return x.as_strided(size=chunk_size, stride=chunk_stride)\n\n    def _mask_invalid_locations(self, input_tensor, w) -> torch.Tensor:\n        affected_seqlen = w\n        beginning_mask_2d = input_tensor.new_ones(w, w + 1).tril().flip(dims=[0])\n        beginning_mask = beginning_mask_2d[None, :, None, :]\n        ending_mask = beginning_mask.flip(dims=(1, 3))\n        seqlen = input_tensor.size(1)\n        beginning_input = input_tensor[:, :affected_seqlen, :, : w + 1]\n        beginning_mask = beginning_mask[:, :seqlen].expand(beginning_input.size())\n        beginning_input.masked_fill_(beginning_mask == 1, -float(\"inf\"))  # `== 1` converts to bool or uint8\n        ending_input = input_tensor[:, -affected_seqlen:, :, -(w + 1) :]\n        ending_mask = ending_mask[:, -seqlen:].expand(ending_input.size())\n        ending_input.masked_fill_(ending_mask == 1, -float(\"inf\"))  # `== 1` converts to bool or uint8\n\n    def _sliding_chunks_matmul_qk(self, q: torch.Tensor, k: torch.Tensor, w: int):\n        \"\"\"Matrix multiplicatio of query x key tensors using with a sliding window attention pattern.\n        This implementation splits the input into overlapping chunks of size 2w (e.g. 512 for pretrained Longformer)\n        with an overlap of size w\"\"\"\n        batch_size, seqlen, num_heads, head_dim = q.size()\n        assert seqlen % (w * 2) == 0, f\"Sequence length should be multiple of {w * 2}. Given {seqlen}\"\n        assert q.size() == k.size()\n\n        chunks_count = seqlen // w - 1\n\n        # group batch_size and num_heads dimensions into one, then chunk seqlen into chunks of size w * 2\n        q = q.transpose(1, 2).reshape(batch_size * num_heads, seqlen, head_dim)\n        k = k.transpose(1, 2).reshape(batch_size * num_heads, seqlen, head_dim)\n\n        chunk_q = self._chunk(q, w)\n        chunk_k = self._chunk(k, w)\n\n        # matrix multipication\n        # bcxd: batch_size * num_heads x chunks x 2w x head_dim\n        # bcyd: batch_size * num_heads x chunks x 2w x head_dim\n        # bcxy: batch_size * num_heads x chunks x 2w x 2w\n        chunk_attn = torch.einsum(\"bcxd,bcyd->bcxy\", (chunk_q, chunk_k))  # multiply\n\n        # convert diagonals into columns\n        diagonal_chunk_attn = self._skew(chunk_attn, direction=(0, 0, 0, 1))\n\n        # allocate space for the overall attention matrix where the chunks are compined. The last dimension\n        # has (w * 2 + 1) columns. The first (w) columns are the w lower triangles (attention from a word to\n        # w previous words). The following column is attention score from each word to itself, then\n        # followed by w columns for the upper triangle.\n\n        diagonal_attn = diagonal_chunk_attn.new_empty((batch_size * num_heads, chunks_count + 1, w, w * 2 + 1))\n\n        # copy parts from diagonal_chunk_attn into the compined matrix of attentions\n        # - copying the main diagonal and the upper triangle\n        diagonal_attn[:, :-1, :, w:] = diagonal_chunk_attn[:, :, :w, : w + 1]\n        diagonal_attn[:, -1, :, w:] = diagonal_chunk_attn[:, -1, w:, : w + 1]\n        # - copying the lower triangle\n        diagonal_attn[:, 1:, :, :w] = diagonal_chunk_attn[:, :, -(w + 1) : -1, w + 1 :]\n        diagonal_attn[:, 0, 1:w, 1:w] = diagonal_chunk_attn[:, 0, : w - 1, 1 - w :]\n\n        # separate batch_size and num_heads dimensions again\n        diagonal_attn = diagonal_attn.view(batch_size, num_heads, seqlen, 2 * w + 1).transpose(2, 1)\n\n        self._mask_invalid_locations(diagonal_attn, w)\n        return diagonal_attn\n\n    def _sliding_chunks_matmul_pv(self, prob: torch.Tensor, v: torch.Tensor, w: int):\n        \"\"\"Same as _sliding_chunks_matmul_qk but for prob and value tensors. It is expecting the same output\n        format from _sliding_chunks_matmul_qk\"\"\"\n        batch_size, seqlen, num_heads, head_dim = v.size()\n        assert seqlen % (w * 2) == 0\n        assert prob.size()[:3] == v.size()[:3]\n        assert prob.size(3) == 2 * w + 1\n        chunks_count = seqlen // w - 1\n        # group batch_size and num_heads dimensions into one, then chunk seqlen into chunks of size 2w\n        chunk_prob = prob.transpose(1, 2).reshape(batch_size * num_heads, seqlen // w, w, 2 * w + 1)\n\n        # group batch_size and num_heads dimensions into one\n        v = v.transpose(1, 2).reshape(batch_size * num_heads, seqlen, head_dim)\n\n        # pad seqlen with w at the beginning of the sequence and another w at the end\n        padded_v = F.pad(v, (0, 0, w, w), value=-1)\n\n        # chunk padded_v into chunks of size 3w and an overlap of size w\n        chunk_v_size = (batch_size * num_heads, chunks_count + 1, 3 * w, head_dim)\n        chunk_v_stride = padded_v.stride()\n        chunk_v_stride = chunk_v_stride[0], w * chunk_v_stride[1], chunk_v_stride[1], chunk_v_stride[2]\n        chunk_v = padded_v.as_strided(size=chunk_v_size, stride=chunk_v_stride)\n\n        skewed_prob = self._skew2(chunk_prob)\n\n        context = torch.einsum(\"bcwd,bcdh->bcwh\", (skewed_prob, chunk_v))\n        return context.view(batch_size, num_heads, seqlen, head_dim).transpose(1, 2)\n\n    def forward(\n        self,\n        hidden_states,\n        attention_mask=None,\n        head_mask=None,\n        encoder_hidden_states=None,\n        encoder_attention_mask=None,\n    ):\n        \"\"\"\n        LongformerSelfAttention expects `len(hidden_states)` to be multiple of `attention_window`.\n        Padding to `attention_window` happens in LongformerModel.forward to avoid redoing the padding on each layer.\n\n        The `attention_mask` is changed in `BertModel.forward` from 0, 1, 2 to\n            -ve: no attention\n              0: local attention\n            +ve: global attention\n\n        `encoder_hidden_states` and `encoder_attention_mask` are not supported and should be None\n        \"\"\"\n        # TODO: add support for `encoder_hidden_states` and `encoder_attention_mask`\n        assert encoder_hidden_states is None, \"`encoder_hidden_states` is not supported and should be None\"\n        assert encoder_attention_mask is None, \"`encoder_attention_mask` is not supported and shiould be None\"\n\n        if attention_mask is not None:\n            attention_mask = attention_mask.squeeze(dim=2).squeeze(dim=1)\n            key_padding_mask = attention_mask < 0\n            extra_attention_mask = attention_mask > 0\n            remove_from_windowed_attention_mask = attention_mask != 0\n\n            num_extra_indices_per_batch = extra_attention_mask.long().sum(dim=1)\n            max_num_extra_indices_per_batch = num_extra_indices_per_batch.max()\n            if max_num_extra_indices_per_batch <= 0:\n                extra_attention_mask = None\n            else:\n                # To support the case of variable number of global attention in the rows of a batch,\n                # we use the following three selection masks to select global attention embeddings\n                # in a 3d tensor and pad it to `max_num_extra_indices_per_batch`\n                # 1) selecting embeddings that correspond to global attention\n                extra_attention_mask_nonzeros = extra_attention_mask.nonzero(as_tuple=True)\n                zero_to_max_range = torch.arange(\n                    0, max_num_extra_indices_per_batch, device=num_extra_indices_per_batch.device\n                )\n                # mask indicating which values are actually going to be padding\n                selection_padding_mask = zero_to_max_range < num_extra_indices_per_batch.unsqueeze(dim=-1)\n                # 2) location of the non-padding values in the selected global attention\n                selection_padding_mask_nonzeros = selection_padding_mask.nonzero(as_tuple=True)\n                # 3) location of the padding values in the selected global attention\n                selection_padding_mask_zeros = (selection_padding_mask == 0).nonzero(as_tuple=True)\n        else:\n            remove_from_windowed_attention_mask = None\n            extra_attention_mask = None\n            key_padding_mask = None\n\n        hidden_states = hidden_states.transpose(0, 1)\n        seqlen, batch_size, embed_dim = hidden_states.size()\n        assert embed_dim == self.embed_dim\n        q = self.query(hidden_states)\n        k = self.key(hidden_states)\n        v = self.value(hidden_states)\n        q /= math.sqrt(self.head_dim)\n\n        q = q.view(seqlen, batch_size, self.num_heads, self.head_dim).transpose(0, 1)\n        k = k.view(seqlen, batch_size, self.num_heads, self.head_dim).transpose(0, 1)\n        # attn_weights = (batch_size, seqlen, num_heads, window*2+1)\n        attn_weights = self._sliding_chunks_matmul_qk(q, k, self.one_sided_attention_window_size)\n        self._mask_invalid_locations(attn_weights, self.one_sided_attention_window_size)\n        if remove_from_windowed_attention_mask is not None:\n            # This implementation is fast and takes very little memory because num_heads x hidden_size = 1\n            # from (batch_size x seqlen) to (batch_size x seqlen x num_heads x hidden_size)\n            remove_from_windowed_attention_mask = remove_from_windowed_attention_mask.unsqueeze(dim=-1).unsqueeze(\n                dim=-1\n            )\n            # cast to fp32/fp16 then replace 1's with -inf\n            float_mask = remove_from_windowed_attention_mask.type_as(q).masked_fill(\n                remove_from_windowed_attention_mask, -10000.0\n            )\n            ones = float_mask.new_ones(size=float_mask.size())  # tensor of ones\n            # diagonal mask with zeros everywhere and -inf inplace of padding\n            d_mask = self._sliding_chunks_matmul_qk(ones, float_mask, self.one_sided_attention_window_size)\n            attn_weights += d_mask\n        assert list(attn_weights.size()) == [\n            batch_size,\n            seqlen,\n            self.num_heads,\n            self.one_sided_attention_window_size * 2 + 1,\n        ]\n\n        # the extra attention\n        if extra_attention_mask is not None:\n            selected_k = k.new_zeros(batch_size, max_num_extra_indices_per_batch, self.num_heads, self.head_dim)\n            selected_k[selection_padding_mask_nonzeros] = k[extra_attention_mask_nonzeros]\n            # (batch_size, seqlen, num_heads, max_num_extra_indices_per_batch)\n            selected_attn_weights = torch.einsum(\"blhd,bshd->blhs\", (q, selected_k))\n            selected_attn_weights[selection_padding_mask_zeros[0], :, :, selection_padding_mask_zeros[1]] = -10000\n            # concat to attn_weights\n            # (batch_size, seqlen, num_heads, extra attention count + 2*window+1)\n            attn_weights = torch.cat((selected_attn_weights, attn_weights), dim=-1)\n\n        attn_weights_fp32 = F.softmax(attn_weights, dim=-1, dtype=torch.float32)  # use fp32 for numerical stability\n        attn_weights = attn_weights_fp32.type_as(attn_weights)\n\n        if key_padding_mask is not None:\n            # softmax sometimes inserts NaN if all positions are masked, replace them with 0\n            attn_weights = torch.masked_fill(attn_weights, key_padding_mask.unsqueeze(-1).unsqueeze(-1), 0.0)\n\n        attn_probs = F.dropout(attn_weights, p=self.dropout, training=self.training)\n        v = v.view(seqlen, batch_size, self.num_heads, self.head_dim).transpose(0, 1)\n        attn = None\n        if extra_attention_mask is not None:\n            selected_attn_probs = attn_probs.narrow(-1, 0, max_num_extra_indices_per_batch)\n            selected_v = v.new_zeros(batch_size, max_num_extra_indices_per_batch, self.num_heads, self.head_dim)\n            selected_v[selection_padding_mask_nonzeros] = v[extra_attention_mask_nonzeros]\n            # use `matmul` because `einsum` crashes sometimes with fp16\n            # attn = torch.einsum('blhs,bshd->blhd', (selected_attn_probs, selected_v))\n            attn = torch.matmul(selected_attn_probs.transpose(1, 2), selected_v.transpose(1, 2)).transpose(1, 2)\n            attn_probs = attn_probs.narrow(\n                -1, max_num_extra_indices_per_batch, attn_probs.size(-1) - max_num_extra_indices_per_batch\n            ).contiguous()\n        if attn is None:\n            attn = self._sliding_chunks_matmul_pv(attn_probs, v, self.one_sided_attention_window_size)\n        else:\n            attn += self._sliding_chunks_matmul_pv(attn_probs, v, self.one_sided_attention_window_size)\n\n        assert attn.size() == (batch_size, seqlen, self.num_heads, self.head_dim), \"Unexpected size\"\n        attn = attn.transpose(0, 1).reshape(seqlen, batch_size, embed_dim).contiguous()\n\n        # For this case, we'll just recompute the attention for these indices\n        # and overwrite the attn tensor.\n        # TODO: remove the redundant computation\n        if extra_attention_mask is not None:\n            selected_hidden_states = hidden_states.new_zeros(max_num_extra_indices_per_batch, batch_size, embed_dim)\n            selected_hidden_states[selection_padding_mask_nonzeros[::-1]] = hidden_states[\n                extra_attention_mask_nonzeros[::-1]\n            ]\n\n            q = self.query_global(selected_hidden_states)\n            k = self.key_global(hidden_states)\n            v = self.value_global(hidden_states)\n            q /= math.sqrt(self.head_dim)\n\n            q = (\n                q.contiguous()\n                .view(max_num_extra_indices_per_batch, batch_size * self.num_heads, self.head_dim)\n                .transpose(0, 1)\n            )  # (batch_size * self.num_heads, max_num_extra_indices_per_batch, head_dim)\n            k = (\n                k.contiguous().view(-1, batch_size * self.num_heads, self.head_dim).transpose(0, 1)\n            )  # batch_size * self.num_heads, seqlen, head_dim)\n            v = (\n                v.contiguous().view(-1, batch_size * self.num_heads, self.head_dim).transpose(0, 1)\n            )  # batch_size * self.num_heads, seqlen, head_dim)\n            attn_weights = torch.bmm(q, k.transpose(1, 2))\n            assert list(attn_weights.size()) == [batch_size * self.num_heads, max_num_extra_indices_per_batch, seqlen]\n\n            attn_weights = attn_weights.view(batch_size, self.num_heads, max_num_extra_indices_per_batch, seqlen)\n            attn_weights[selection_padding_mask_zeros[0], :, selection_padding_mask_zeros[1], :] = -10000.0\n            if key_padding_mask is not None:\n                attn_weights = attn_weights.masked_fill(key_padding_mask.unsqueeze(1).unsqueeze(2), -10000.0,)\n            attn_weights = attn_weights.view(batch_size * self.num_heads, max_num_extra_indices_per_batch, seqlen)\n            attn_weights_float = F.softmax(\n                attn_weights, dim=-1, dtype=torch.float32\n            )  # use fp32 for numerical stability\n            attn_probs = F.dropout(attn_weights_float.type_as(attn_weights), p=self.dropout, training=self.training)\n            selected_attn = torch.bmm(attn_probs, v)\n            assert list(selected_attn.size()) == [\n                batch_size * self.num_heads,\n                max_num_extra_indices_per_batch,\n                self.head_dim,\n            ]\n\n            selected_attn_4d = selected_attn.view(\n                batch_size, self.num_heads, max_num_extra_indices_per_batch, self.head_dim\n            )\n            nonzero_selected_attn = selected_attn_4d[\n                selection_padding_mask_nonzeros[0], :, selection_padding_mask_nonzeros[1]\n            ]\n            attn[extra_attention_mask_nonzeros[::-1]] = nonzero_selected_attn.view(\n                len(selection_padding_mask_nonzeros[0]), -1\n            )\n\n        context_layer = attn.transpose(0, 1)\n        if self.output_attentions:\n            if extra_attention_mask is not None:\n                # With global attention, return global attention probabilities only\n                # batch_size x num_heads x max_num_global_attention_tokens x sequence_length\n                # which is the attention weights from tokens with global attention to all tokens\n                # It doesn't not return local attention\n                # In case of variable number of global attantion in the rows of a batch,\n                # attn_weights are padded with -10000.0 attention scores\n                attn_weights = attn_weights.view(batch_size, self.num_heads, max_num_extra_indices_per_batch, seqlen)\n            else:\n                # without global attention, return local attention probabilities\n                # batch_size x num_heads x sequence_length x window_size\n                # which is the attention weights of every token attending to its neighbours\n                attn_weights = attn_weights.permute(0, 2, 1, 3)\n        outputs = (context_layer, attn_weights) if self.output_attentions else (context_layer,)\n        return outputs\n\n\nLONGFORMER_START_DOCSTRING = r\"\"\"\n\n    This model is a PyTorch `torch.nn.Module <https://pytorch.org/docs/stable/nn.html#torch.nn.Module>`_ sub-class.\n    Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general\n    usage and behavior.\n\n    Parameters:\n        config (:class:`~transformers1.LongformerConfig`): Model configuration class with all the parameters of the\n            model. Initializing with a config file does not load the weights associated with the model, only the configuration.\n            Check out the :meth:`~transformers1.PreTrainedModel.from_pretrained` method to load the model weights.\n\"\"\"\n\nLONGFORMER_INPUTS_DOCSTRING = r\"\"\"\n    Args:\n        input_ids (:obj:`torch.LongTensor` of shape :obj:`{0}`):\n            Indices of input sequence tokens in the vocabulary.\n\n            Indices can be obtained using :class:`transformers1.LonmgformerTokenizer`.\n            See :func:`transformers1.PreTrainedTokenizer.encode` and\n            :func:`transformers1.PreTrainedTokenizer.encode_plus` for details.\n\n            `What are input IDs? <../glossary.html#input-ids>`__\n        attention_mask (:obj:`torch.FloatTensor` of shape :obj:`{0}`, `optional`, defaults to :obj:`None`):\n            Mask to avoid performing attention on padding token indices.\n            Mask values selected in ``[0, 1]``:\n            ``1`` for tokens that are NOT MASKED, ``0`` for MASKED tokens.\n\n            `What are attention masks? <../glossary.html#attention-mask>`__\n\n        global_attention_mask (:obj:`torch.FloatTensor` of shape :obj:`{0}`, `optional`, defaults to :obj:`None`):\n            Mask to decide the attention given on each token, local attention or global attenion.\n            Tokens with global attention attends to all other tokens, and all other tokens attend to them. This is important for\n            task-specific finetuning because it makes the model more flexible at representing the task. For example,\n            for classification, the <s> token should be given global attention. For QA, all question tokens should also have\n            global attention. Please refer to the Longformer paper https://arxiv.org/abs/2004.05150 for more details.\n            Mask values selected in ``[0, 1]``:\n            ``0`` for local attention (a sliding window attention),\n            ``1`` for global attention (tokens that attend to all other tokens, and all other tokens attend to them).\n\n        token_type_ids (:obj:`torch.LongTensor` of shape :obj:`{0}`, `optional`, defaults to :obj:`None`):\n            Segment token indices to indicate first and second portions of the inputs.\n            Indices are selected in ``[0, 1]``: ``0`` corresponds to a `sentence A` token, ``1``\n            corresponds to a `sentence B` token\n\n            `What are token type IDs? <../glossary.html#token-type-ids>`_\n        position_ids (:obj:`torch.LongTensor` of shape :obj:`{0}`, `optional`, defaults to :obj:`None`):\n            Indices of positions of each input sequence tokens in the position embeddings.\n            Selected in the range ``[0, config.max_position_embeddings - 1]``.\n\n            `What are position IDs? <../glossary.html#position-ids>`_\n        inputs_embeds (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length, hidden_size)`, `optional`, defaults to :obj:`None`):\n            Optionally, instead of passing :obj:`input_ids` you can choose to directly pass an embedded representation.\n            This is useful if you want more control over how to convert `input_ids` indices into associated vectors\n            than the model's internal embedding lookup matrix.\n\"\"\"\n\n\n@add_start_docstrings(\n    \"The bare Longformer Model outputting raw hidden-states without any specific head on top.\",\n    LONGFORMER_START_DOCSTRING,\n)\nclass LongformerModel(RobertaModel):\n    \"\"\"\n    This class overrides :class:`~transformers1.RobertaModel` to provide the ability to process\n    long sequences following the selfattention approach described in `Longformer: the Long-Document Transformer`_by\n    Iz Beltagy, Matthew E. Peters, and Arman Cohan. Longformer selfattention combines a local (sliding window)\n    and global attention to extend to long documents without the O(n^2) increase in memory and compute.\n\n    The selfattention module `LongformerSelfAttention` implemented here supports the combination of local and\n    global attention but it lacks support for autoregressive attention and dilated attention. Autoregressive\n    and dilated attention are more relevant for autoregressive language modeling than finetuning on downstream\n    tasks. Future release will add support for autoregressive attention, but the support for dilated attention\n    requires a custom CUDA kernel to be memory and compute efficient.\n\n    .. _`Longformer: the Long-Document Transformer`:\n        https://arxiv.org/abs/2004.05150\n\n    \"\"\"\n\n    config_class = LongformerConfig\n    base_model_prefix = \"longformer\"\n\n    def __init__(self, config):\n        super().__init__(config)\n\n        if isinstance(config.attention_window, int):\n            assert config.attention_window % 2 == 0, \"`config.attention_window` has to be an even value\"\n            assert config.attention_window > 0, \"`config.attention_window` has to be positive\"\n            config.attention_window = [config.attention_window] * config.num_hidden_layers  # one value per layer\n        else:\n            assert len(config.attention_window) == config.num_hidden_layers, (\n                \"`len(config.attention_window)` should equal `config.num_hidden_layers`. \"\n                f\"Expected {config.num_hidden_layers}, given {len(config.attention_window)}\"\n            )\n\n        for i, layer in enumerate(self.encoder.layer):\n            # replace the `modeling_bert.BertSelfAttention` object with `LongformerSelfAttention`\n            layer.attention.self = LongformerSelfAttention(config, layer_id=i)\n\n        self.init_weights()\n\n    def _pad_to_window_size(\n        self,\n        input_ids: torch.Tensor,\n        attention_mask: torch.Tensor,\n        token_type_ids: torch.Tensor,\n        position_ids: torch.Tensor,\n        inputs_embeds: torch.Tensor,\n        attention_window: int,\n        pad_token_id: int,\n    ):\n        \"\"\"A helper function to pad tokens and mask to work with implementation of Longformer selfattention.\"\"\"\n\n        assert attention_window % 2 == 0, f\"`attention_window` should be an even value. Given {attention_window}\"\n        input_shape = input_ids.shape if input_ids is not None else inputs_embeds.shape\n        batch_size, seqlen = input_shape[:2]\n\n        padding_len = (attention_window - seqlen % attention_window) % attention_window\n        if padding_len > 0:\n            logger.info(\n                \"Input ids are automatically padded from {} to {} to be a multiple of `config.attention_window`: {}\".format(\n                    seqlen, seqlen + padding_len, attention_window\n                )\n            )\n            if input_ids is not None:\n                input_ids = F.pad(input_ids, (0, padding_len), value=pad_token_id)\n            if attention_mask is not None:\n                attention_mask = F.pad(\n                    attention_mask, (0, padding_len), value=False\n                )  # no attention on the padding tokens\n            if token_type_ids is not None:\n                token_type_ids = F.pad(token_type_ids, (0, padding_len), value=0)  # pad with token_type_id = 0\n            if position_ids is not None:\n                # pad with position_id = pad_token_id as in modeling_roberta.RobertaEmbeddings\n                position_ids = F.pad(position_ids, (0, padding_len), value=pad_token_id)\n            if inputs_embeds is not None:\n                input_ids_padding = inputs_embeds.new_full(\n                    (batch_size, padding_len), self.config.pad_token_id, dtype=torch.long,\n                )\n                inputs_embeds_padding = self.embeddings(input_ids_padding)\n                inputs_embeds = torch.cat([inputs_embeds, inputs_embeds_padding], dim=-2)\n\n        return padding_len, input_ids, attention_mask, token_type_ids, position_ids, inputs_embeds\n\n    @add_start_docstrings_to_callable(LONGFORMER_INPUTS_DOCSTRING.format(\"(batch_size, sequence_length)\"))\n    def forward(\n        self,\n        input_ids=None,\n        attention_mask=None,\n        global_attention_mask=None,\n        token_type_ids=None,\n        position_ids=None,\n        inputs_embeds=None,\n        masked_lm_labels=None,\n    ):\n        r\"\"\"\n\n    Returns:\n        :obj:`tuple(torch.FloatTensor)` comprising various elements depending on the configuration (:class:`~transformers1.RobertaConfig`) and inputs:\n        masked_lm_loss (`optional`, returned when ``masked_lm_labels`` is provided) ``torch.FloatTensor`` of shape ``(1,)``:\n            Masked language modeling loss.\n        prediction_scores (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length, config.vocab_size)`)\n            Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax).\n        hidden_states (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_hidden_states=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer)\n            of shape :obj:`(batch_size, sequence_length, hidden_size)`.\n\n            Hidden-states of the model at the output of each layer plus the initial embedding outputs.\n        attentions (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_attentions=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for each layer) of shape\n            :obj:`(batch_size, num_heads, sequence_length, sequence_length)`.\n\n            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention\n            heads.\n\n    Examples::\n\n        import torch\n        from transformers1 import LongformerModel, LongformerTokenizer\n\n        model = LongformerModel.from_pretrained('allenai/longformer-base-4096')\n        tokenizer = LongformerTokenizer.from_pretrained('allenai/longformer-base-4096')\n\n        SAMPLE_TEXT = ' '.join(['Hello world! '] * 1000)  # long input document\n        input_ids = torch.tensor(tokenizer.encode(SAMPLE_TEXT)).unsqueeze(0)  # batch of size 1\n\n        # Attention mask values -- 0: no attention, 1: local attention, 2: global attention\n        attention_mask = torch.ones(input_ids.shape, dtype=torch.long, device=input_ids.device) # initialize to local attention\n        attention_mask[:, [1, 4, 21,]] = 2  # Set global attention based on the task. For example,\n                                            # classification: the <s> token\n                                            # QA: question tokens\n                                            # LM: potentially on the beginning of sentences and paragraphs\n        sequence_output, pooled_output = model(input_ids, attention_mask=attention_mask)\n        \"\"\"\n\n        # padding\n        attention_window = (\n            self.config.attention_window\n            if isinstance(self.config.attention_window, int)\n            else max(self.config.attention_window)\n        )\n\n        # merge `global_attention_mask` and `attention_mask`\n        if global_attention_mask is not None:\n            # longformer self attention expects attention mask to have 0 (no attn), 1 (local attn), 2 (global attn)\n            # (global_attention_mask + 1) => 1 for local attention, 2 for global attention\n            # => final attention_mask => 0 for no attention, 1 for local attention 2 for global attention\n            if attention_mask is not None:\n                attention_mask = attention_mask * (global_attention_mask + 1)\n            else:\n                # simply use `global_attention_mask` as `attention_mask`\n                # if no `attention_mask` is given\n                attention_mask = global_attention_mask + 1\n\n        padding_len, input_ids, attention_mask, token_type_ids, position_ids, inputs_embeds = self._pad_to_window_size(\n            input_ids=input_ids,\n            attention_mask=attention_mask,\n            token_type_ids=token_type_ids,\n            position_ids=position_ids,\n            inputs_embeds=inputs_embeds,\n            attention_window=attention_window,\n            pad_token_id=self.config.pad_token_id,\n        )\n\n        # embed\n        output = super().forward(\n            input_ids=input_ids,\n            attention_mask=attention_mask,\n            token_type_ids=token_type_ids,\n            position_ids=position_ids,\n            head_mask=None,\n            inputs_embeds=inputs_embeds,\n            encoder_hidden_states=None,\n            encoder_attention_mask=None,\n        )\n\n        # undo padding\n        if padding_len > 0:\n            # `output` has the following tensors: sequence_output, pooled_output, (hidden_states), (attentions)\n            # `sequence_output`: unpad because the calling function is expecting a length == input_ids.size(1)\n            # `pooled_output`: independent of the sequence length\n            # `hidden_states`: mainly used for debugging and analysis, so keep the padding\n            # `attentions`: mainly used for debugging and analysis, so keep the padding\n            output = output[0][:, :-padding_len], *output[1:]\n\n        return output\n\n\n@add_start_docstrings(\"\"\"Longformer Model with a `language modeling` head on top. \"\"\", LONGFORMER_START_DOCSTRING)\nclass LongformerForMaskedLM(BertPreTrainedModel):\n    config_class = LongformerConfig\n    base_model_prefix = \"longformer\"\n\n    def __init__(self, config):\n        super().__init__(config)\n\n        self.longformer = LongformerModel(config)\n        self.lm_head = RobertaLMHead(config)\n\n        self.init_weights()\n\n    @add_start_docstrings_to_callable(LONGFORMER_INPUTS_DOCSTRING.format(\"(batch_size, sequence_length)\"))\n    def forward(\n        self,\n        input_ids=None,\n        attention_mask=None,\n        global_attention_mask=None,\n        token_type_ids=None,\n        position_ids=None,\n        inputs_embeds=None,\n        masked_lm_labels=None,\n    ):\n        r\"\"\"\n        masked_lm_labels (:obj:`torch.LongTensor` of shape :obj:`(batch_size, sequence_length)`, `optional`, defaults to :obj:`None`):\n            Labels for computing the masked language modeling loss.\n            Indices should be in ``[-100, 0, ..., config.vocab_size]`` (see ``input_ids`` docstring)\n            Tokens with indices set to ``-100`` are ignored (masked), the loss is only computed for the tokens with labels\n            in ``[0, ..., config.vocab_size]``\n\n    Returns:\n        :obj:`tuple(torch.FloatTensor)` comprising various elements depending on the configuration (:class:`~transformers1.RobertaConfig`) and inputs:\n        masked_lm_loss (`optional`, returned when ``masked_lm_labels`` is provided) ``torch.FloatTensor`` of shape ``(1,)``:\n            Masked language modeling loss.\n        prediction_scores (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length, config.vocab_size)`)\n            Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax).\n        hidden_states (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_hidden_states=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer)\n            of shape :obj:`(batch_size, sequence_length, hidden_size)`.\n\n            Hidden-states of the model at the output of each layer plus the initial embedding outputs.\n        attentions (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_attentions=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for each layer) of shape\n            :obj:`(batch_size, num_heads, sequence_length, sequence_length)`.\n\n            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention\n            heads.\n\n    Examples::\n\n        import torch\n        from transformers1 import LongformerForMaskedLM, LongformerTokenizer\n\n        model = LongformerForMaskedLM.from_pretrained('allenai/longformer-base-4096')\n        tokenizer = LongformerTokenizer.from_pretrained('allenai/longformer-base-4096')\n\n        SAMPLE_TEXT = ' '.join(['Hello world! '] * 1000)  # long input document\n        input_ids = torch.tensor(tokenizer.encode(SAMPLE_TEXT)).unsqueeze(0)  # batch of size 1\n\n        attention_mask = None  # default is local attention everywhere, which is a good choice for MaskedLM\n                               # check ``LongformerModel.forward`` for more details how to set `attention_mask`\n        loss, prediction_scores = model(input_ids, attention_mask=attention_mask, masked_lm_labels=input_ids)\n        \"\"\"\n\n        outputs = self.longformer(\n            input_ids,\n            attention_mask=attention_mask,\n            global_attention_mask=global_attention_mask,\n            token_type_ids=token_type_ids,\n            position_ids=position_ids,\n            inputs_embeds=inputs_embeds,\n        )\n        sequence_output = outputs[0]\n        prediction_scores = self.lm_head(sequence_output)\n\n        outputs = (prediction_scores,) + outputs[2:]  # Add hidden states and attention if they are here\n\n        if masked_lm_labels is not None:\n            loss_fct = CrossEntropyLoss()\n            masked_lm_loss = loss_fct(prediction_scores.view(-1, self.config.vocab_size), masked_lm_labels.view(-1))\n            outputs = (masked_lm_loss,) + outputs\n\n        return outputs  # (masked_lm_loss), prediction_scores, (hidden_states), (attentions)\n\n\n@add_start_docstrings(\n    \"\"\"Longformer Model transformer with a sequence classification/regression head on top (a linear layer\n    on top of the pooled output) e.g. for GLUE tasks. \"\"\",\n    LONGFORMER_START_DOCSTRING,\n)\nclass LongformerForSequenceClassification(BertPreTrainedModel):\n    config_class = LongformerConfig\n    base_model_prefix = \"longformer\"\n\n    def __init__(self, config):\n        super().__init__(config)\n        self.num_labels = config.num_labels\n\n        self.longformer = LongformerModel(config)\n        self.classifier = LongformerClassificationHead(config)\n\n    @add_start_docstrings_to_callable(LONGFORMER_INPUTS_DOCSTRING.format(\"(batch_size, sequence_length)\"))\n    def forward(\n        self,\n        input_ids=None,\n        attention_mask=None,\n        global_attention_mask=None,\n        token_type_ids=None,\n        position_ids=None,\n        inputs_embeds=None,\n        labels=None,\n    ):\n        r\"\"\"\n        labels (:obj:`torch.LongTensor` of shape :obj:`(batch_size,)`, `optional`, defaults to :obj:`None`):\n            Labels for computing the sequence classification/regression loss.\n            Indices should be in :obj:`[0, ..., config.num_labels - 1]`.\n            If :obj:`config.num_labels == 1` a regression loss is computed (Mean-Square loss),\n            If :obj:`config.num_labels > 1` a classification loss is computed (Cross-Entropy).\n\n    Returns:\n        :obj:`tuple(torch.FloatTensor)` comprising various elements depending on the configuration (:class:`~transformers1.LongformerConfig`) and inputs:\n        loss (:obj:`torch.FloatTensor` of shape :obj:`(1,)`, `optional`, returned when :obj:`label` is provided):\n            Classification (or regression if config.num_labels==1) loss.\n        logits (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, config.num_labels)`):\n            Classification (or regression if config.num_labels==1) scores (before SoftMax).\n        hidden_states (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_hidden_states=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer)\n            of shape :obj:`(batch_size, sequence_length, hidden_size)`.\n\n            Hidden-states of the model at the output of each layer plus the initial embedding outputs.\n        attentions (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_attentions=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for each layer) of shape\n            :obj:`(batch_size, num_heads, sequence_length, sequence_length)`.\n\n            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention\n            heads.\n\n    Examples::\n\n        from transformers1 import LongformerTokenizer, LongformerForSequenceClassification\n        import torch\n\n        tokenizer = LongformerTokenizer.from_pretrained('allenai/longformer-base-4096')\n        model = LongformerForSequenceClassification.from_pretrained('allenai/longformer-base-4096')\n        input_ids = torch.tensor(tokenizer.encode(\"Hello, my dog is cute\", add_special_tokens=True)).unsqueeze(0)  # Batch size 1\n        labels = torch.tensor([1]).unsqueeze(0)  # Batch size 1\n        outputs = model(input_ids, labels=labels)\n        loss, logits = outputs[:2]\n\n        \"\"\"\n\n        if global_attention_mask is None:\n            logger.info(\"Initializing global attention on CLS token...\")\n            global_attention_mask = torch.zeros_like(input_ids)\n            # global attention on cls token\n            global_attention_mask[:, 0] = 1\n\n        outputs = self.longformer(\n            input_ids,\n            attention_mask=attention_mask,\n            global_attention_mask=global_attention_mask,\n            token_type_ids=token_type_ids,\n            position_ids=position_ids,\n            inputs_embeds=inputs_embeds,\n        )\n        sequence_output = outputs[0]\n        logits = self.classifier(sequence_output)\n\n        outputs = (logits,) + outputs[2:]\n        if labels is not None:\n            if self.num_labels == 1:\n                #  We are doing regression\n                loss_fct = MSELoss()\n                loss = loss_fct(logits.view(-1), labels.view(-1))\n            else:\n                loss_fct = CrossEntropyLoss()\n                loss = loss_fct(logits.view(-1, self.num_labels), labels.view(-1))\n            outputs = (loss,) + outputs\n\n        return outputs  # (loss), logits, (hidden_states), (attentions)\n\n\nclass LongformerClassificationHead(nn.Module):\n    \"\"\"Head for sentence-level classification tasks.\"\"\"\n\n    def __init__(self, config):\n        super().__init__()\n        self.dense = nn.Linear(config.hidden_size, config.hidden_size)\n        self.dropout = nn.Dropout(config.hidden_dropout_prob)\n        self.out_proj = nn.Linear(config.hidden_size, config.num_labels)\n\n    def forward(self, hidden_states, **kwargs):\n        hidden_states = hidden_states[:, 0, :]  # take <s> token (equiv. to [CLS])\n        hidden_states = self.dropout(hidden_states)\n        hidden_states = self.dense(hidden_states)\n        hidden_states = torch.tanh(hidden_states)\n        hidden_states = self.dropout(hidden_states)\n        output = self.out_proj(hidden_states)\n        return output\n\n\n@add_start_docstrings(\n    \"\"\"Longformer Model with a span classification head on top for extractive question-answering tasks like SQuAD / TriviaQA (a linear layers on top of\n    the hidden-states output to compute `span start logits` and `span end logits`). \"\"\",\n    LONGFORMER_START_DOCSTRING,\n)\nclass LongformerForQuestionAnswering(BertPreTrainedModel):\n    config_class = LongformerConfig\n    base_model_prefix = \"longformer\"\n\n    def __init__(self, config):\n        super().__init__(config)\n        self.num_labels = config.num_labels\n\n        self.longformer = LongformerModel(config)\n        self.qa_outputs = nn.Linear(config.hidden_size, config.num_labels)\n\n        self.init_weights()\n\n    @add_start_docstrings_to_callable(LONGFORMER_INPUTS_DOCSTRING.format(\"(batch_size, sequence_length)\"))\n    def forward(\n        self,\n        input_ids,\n        attention_mask=None,\n        global_attention_mask=None,\n        token_type_ids=None,\n        position_ids=None,\n        inputs_embeds=None,\n        start_positions=None,\n        end_positions=None,\n    ):\n        r\"\"\"\n        start_positions (:obj:`torch.LongTensor` of shape :obj:`(batch_size,)`, `optional`, defaults to :obj:`None`):\n            Labels for position (index) of the start of the labelled span for computing the token classification loss.\n            Positions are clamped to the length of the sequence (`sequence_length`).\n            Position outside of the sequence are not taken into account for computing the loss.\n        end_positions (:obj:`torch.LongTensor` of shape :obj:`(batch_size,)`, `optional`, defaults to :obj:`None`):\n            Labels for position (index) of the end of the labelled span for computing the token classification loss.\n            Positions are clamped to the length of the sequence (`sequence_length`).\n            Position outside of the sequence are not taken into account for computing the loss.\n    Returns:\n        :obj:`tuple(torch.FloatTensor)` comprising various elements depending on the configuration (:class:`~transformers1.LongformerConfig`) and inputs:\n        loss (:obj:`torch.FloatTensor` of shape :obj:`(1,)`, `optional`, returned when :obj:`labels` is provided):\n            Total span extraction loss is the sum of a Cross-Entropy for the start and end positions.\n        start_scores (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length,)`):\n            Span-start scores (before SoftMax).\n        end_scores (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length,)`):\n            Span-end scores (before SoftMax).\n        hidden_states (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_hidden_states=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer)\n            of shape :obj:`(batch_size, sequence_length, hidden_size)`.\n            Hidden-states of the model at the output of each layer plus the initial embedding outputs.\n        attentions (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_attentions=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for each layer) of shape\n            :obj:`(batch_size, num_heads, sequence_length, sequence_length)`.\n            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention\n            heads.\n\n    Examples::\n\n        from transformers1 import LongformerTokenizer, LongformerForQuestionAnswering\n        import torch\n\n        tokenizer = LongformerTokenizer.from_pretrained(\"allenai/longformer-large-4096-finetuned-triviaqa\")\n        model = LongformerForQuestionAnswering.from_pretrained(\"allenai/longformer-large-4096-finetuned-triviaqa\")\n\n        question, text = \"Who was Jim Henson?\", \"Jim Henson was a nice puppet\"\n        encoding = tokenizer.encode_plus(question, text, return_tensors=\"pt\")\n        input_ids = encoding[\"input_ids\"]\n\n        # default is local attention everywhere\n        # the forward method will automatically set global attention on question tokens\n        attention_mask = encoding[\"attention_mask\"]\n\n        start_scores, end_scores = model(input_ids, attention_mask=attention_mask)\n        all_tokens = tokenizer.convert_ids_to_tokens(input_ids[0].tolist())\n\n        answer_tokens = all_tokens[torch.argmax(start_scores) :torch.argmax(end_scores)+1]\n        answer = tokenizer.decode(tokenizer.convert_tokens_to_ids(answer_tokens)) # remove space prepending space token\n\n        \"\"\"\n\n        # set global attention on question tokens\n        if global_attention_mask is None:\n            logger.info(\"Initializing global attention on question tokens...\")\n            # put global attention on all tokens until `config.sep_token_id` is reached\n            global_attention_mask = _compute_global_attention_mask(input_ids, self.config.sep_token_id)\n\n        outputs = self.longformer(\n            input_ids,\n            attention_mask=attention_mask,\n            global_attention_mask=global_attention_mask,\n            token_type_ids=token_type_ids,\n            position_ids=position_ids,\n            inputs_embeds=inputs_embeds,\n        )\n\n        sequence_output = outputs[0]\n\n        logits = self.qa_outputs(sequence_output)\n        start_logits, end_logits = logits.split(1, dim=-1)\n        start_logits = start_logits.squeeze(-1)\n        end_logits = end_logits.squeeze(-1)\n\n        outputs = (start_logits, end_logits,) + outputs[2:]\n        if start_positions is not None and end_positions is not None:\n            # If we are on multi-GPU, split add a dimension\n            if len(start_positions.size()) > 1:\n                start_positions = start_positions.squeeze(-1)\n            if len(end_positions.size()) > 1:\n                end_positions = end_positions.squeeze(-1)\n            # sometimes the start/end positions are outside our model inputs, we ignore these terms\n            ignored_index = start_logits.size(1)\n            start_positions.clamp_(0, ignored_index)\n            end_positions.clamp_(0, ignored_index)\n\n            loss_fct = CrossEntropyLoss(ignore_index=ignored_index)\n            start_loss = loss_fct(start_logits, start_positions)\n            end_loss = loss_fct(end_logits, end_positions)\n            total_loss = (start_loss + end_loss) / 2\n            outputs = (total_loss,) + outputs\n\n        return outputs  # (loss), start_logits, end_logits, (hidden_states), (attentions)\n\n\n@add_start_docstrings(\n    \"\"\"Longformer Model with a token classification head on top (a linear layer on top of\n    the hidden-states output) e.g. for Named-Entity-Recognition (NER) tasks. \"\"\",\n    LONGFORMER_START_DOCSTRING,\n)\nclass LongformerForTokenClassification(BertPreTrainedModel):\n    config_class = LongformerConfig\n    base_model_prefix = \"longformer\"\n\n    def __init__(self, config):\n        super().__init__(config)\n        self.num_labels = config.num_labels\n\n        self.longformer = LongformerModel(config)\n        self.dropout = nn.Dropout(config.hidden_dropout_prob)\n        self.classifier = nn.Linear(config.hidden_size, config.num_labels)\n\n        self.init_weights()\n\n    @add_start_docstrings_to_callable(LONGFORMER_INPUTS_DOCSTRING.format(\"(batch_size, sequence_length)\"))\n    def forward(\n        self,\n        input_ids=None,\n        attention_mask=None,\n        global_attention_mask=None,\n        token_type_ids=None,\n        position_ids=None,\n        inputs_embeds=None,\n        labels=None,\n    ):\n        r\"\"\"\n        labels (:obj:`torch.LongTensor` of shape :obj:`(batch_size, sequence_length)`, `optional`, defaults to :obj:`None`):\n            Labels for computing the token classification loss.\n            Indices should be in ``[0, ..., config.num_labels - 1]``.\n\n    Returns:\n        :obj:`tuple(torch.FloatTensor)` comprising various elements depending on the configuration (:class:`~transformers1.LongformerConfig`) and inputs:\n        loss (:obj:`torch.FloatTensor` of shape :obj:`(1,)`, `optional`, returned when ``labels`` is provided) :\n            Classification loss.\n        scores (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length, config.num_labels)`)\n            Classification scores (before SoftMax).\n        hidden_states (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_hidden_states=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer)\n            of shape :obj:`(batch_size, sequence_length, hidden_size)`.\n\n            Hidden-states of the model at the output of each layer plus the initial embedding outputs.\n        attentions (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_attentions=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for each layer) of shape\n            :obj:`(batch_size, num_heads, sequence_length, sequence_length)`.\n\n            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention\n            heads.\n\n    Examples::\n\n        from transformers1 import LongformerTokenizer, LongformerForTokenClassification\n        import torch\n\n        tokenizer = LongformerTokenizer.from_pretrained('allenai/longformer-base-4096')\n        model = LongformerForTokenClassification.from_pretrained('allenai/longformer-base-4096')\n        input_ids = torch.tensor(tokenizer.encode(\"Hello, my dog is cute\", add_special_tokens=True)).unsqueeze(0)  # Batch size 1\n        labels = torch.tensor([1] * input_ids.size(1)).unsqueeze(0)  # Batch size 1\n        outputs = model(input_ids, labels=labels)\n        loss, scores = outputs[:2]\n\n        \"\"\"\n\n        outputs = self.longformer(\n            input_ids,\n            attention_mask=attention_mask,\n            global_attention_mask=global_attention_mask,\n            token_type_ids=token_type_ids,\n            position_ids=position_ids,\n            inputs_embeds=inputs_embeds,\n        )\n\n        sequence_output = outputs[0]\n\n        sequence_output = self.dropout(sequence_output)\n        logits = self.classifier(sequence_output)\n\n        outputs = (logits,) + outputs[2:]  # add hidden states and attention if they are here\n\n        if labels is not None:\n            loss_fct = CrossEntropyLoss()\n            # Only keep active parts of the loss\n            if attention_mask is not None:\n                active_loss = attention_mask.view(-1) == 1\n                active_logits = logits.view(-1, self.num_labels)\n                active_labels = torch.where(\n                    active_loss, labels.view(-1), torch.tensor(loss_fct.ignore_index).type_as(labels)\n                )\n                loss = loss_fct(active_logits, active_labels)\n            else:\n                loss = loss_fct(logits.view(-1, self.num_labels), labels.view(-1))\n            outputs = (loss,) + outputs\n\n        return outputs  # (loss), scores, (hidden_states), (attentions)\n\n\n@add_start_docstrings(\n    \"\"\"Longformer Model with a multiple choice classification head on top (a linear layer on top of\n    the pooled output and a softmax) e.g. for RocStories/SWAG tasks. \"\"\",\n    LONGFORMER_START_DOCSTRING,\n)\nclass LongformerForMultipleChoice(BertPreTrainedModel):\n    config_class = LongformerConfig\n    base_model_prefix = \"longformer\"\n\n    def __init__(self, config):\n        super().__init__(config)\n\n        self.longformer = LongformerModel(config)\n        self.dropout = nn.Dropout(config.hidden_dropout_prob)\n        self.classifier = nn.Linear(config.hidden_size, 1)\n\n        self.init_weights()\n\n    @add_start_docstrings_to_callable(LONGFORMER_INPUTS_DOCSTRING.format(\"(batch_size, num_choices, sequence_length)\"))\n    def forward(\n        self,\n        input_ids=None,\n        token_type_ids=None,\n        attention_mask=None,\n        global_attention_mask=None,\n        labels=None,\n        position_ids=None,\n        inputs_embeds=None,\n    ):\n        r\"\"\"\n        labels (:obj:`torch.LongTensor` of shape :obj:`(batch_size,)`, `optional`, defaults to :obj:`None`):\n            Labels for computing the multiple choice classification loss.\n            Indices should be in ``[0, ..., num_choices]`` where `num_choices` is the size of the second dimension\n            of the input tensors. (see `input_ids` above)\n\n    Returns:\n        :obj:`tuple(torch.FloatTensor)` comprising various elements depending on the configuration (:class:`~transformers1.RobertaConfig`) and inputs:\n        loss (:obj:`torch.FloatTensor`` of shape ``(1,)`, `optional`, returned when :obj:`labels` is provided):\n            Classification loss.\n        classification_scores (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, num_choices)`):\n            `num_choices` is the second dimension of the input tensors. (see `input_ids` above).\n\n            Classification scores (before SoftMax).\n        hidden_states (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_hidden_states=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer)\n            of shape :obj:`(batch_size, sequence_length, hidden_size)`.\n\n            Hidden-states of the model at the output of each layer plus the initial embedding outputs.\n        attentions (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_attentions=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for each layer) of shape\n            :obj:`(batch_size, num_heads, sequence_length, sequence_length)`.\n\n            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention\n            heads.\n\n    Examples::\n\n        from transformers1 import LongformerTokenizer, LongformerForMultipleChoice\n        import torch\n\n        tokenizer = LongformerTokenizer.from_pretrained('allenai/longformer-base-4096')\n        model = LongformerForMultipleChoice.from_pretrained('allenai/longformer-base-4096')\n        # context = \"The dog is cute\" | choice = \"the dog\" / \"the cat\"\n        choices = [(\"The dog is cute\", \"the dog\"), (\"The dog is cute\", \"the cat\")]\n        input_ids = torch.tensor([tokenizer.encode(s[0], s[1], add_special_tokens=True) for s in choices]).unsqueeze(0)  # Batch size 1, 2 choices\n        labels = torch.tensor(1).unsqueeze(0)  # Batch size 1\n\n        # global attention is automatically put on \"the dog\" and \"the cat\"\n        outputs = model(input_ids, labels=labels)\n        loss, classification_scores = outputs[:2]\n\n        \"\"\"\n        num_choices = input_ids.shape[1]\n\n        # set global attention on question tokens\n        if global_attention_mask is None:\n            logger.info(\"Initializing global attention on multiple choice...\")\n            # put global attention on all tokens after `config.sep_token_id`\n            global_attention_mask = torch.stack(\n                [\n                    _compute_global_attention_mask(input_ids[:, i], self.config.sep_token_id, before_sep_token=False)\n                    for i in range(num_choices)\n                ],\n                dim=1,\n            )\n\n        flat_input_ids = input_ids.view(-1, input_ids.size(-1))\n        flat_position_ids = position_ids.view(-1, position_ids.size(-1)) if position_ids is not None else None\n        flat_token_type_ids = token_type_ids.view(-1, token_type_ids.size(-1)) if token_type_ids is not None else None\n        flat_attention_mask = attention_mask.view(-1, attention_mask.size(-1)) if attention_mask is not None else None\n        flat_global_attention_mask = (\n            global_attention_mask.view(-1, global_attention_mask.size(-1))\n            if global_attention_mask is not None\n            else None\n        )\n\n        outputs = self.longformer(\n            flat_input_ids,\n            position_ids=flat_position_ids,\n            token_type_ids=flat_token_type_ids,\n            attention_mask=flat_attention_mask,\n            global_attention_mask=flat_global_attention_mask,\n        )\n        pooled_output = outputs[1]\n\n        pooled_output = self.dropout(pooled_output)\n        logits = self.classifier(pooled_output)\n        reshaped_logits = logits.view(-1, num_choices)\n\n        outputs = (reshaped_logits,) + outputs[2:]  # add hidden states and attention if they are here\n\n        if labels is not None:\n            loss_fct = CrossEntropyLoss()\n            loss = loss_fct(reshaped_logits, labels)\n            outputs = (loss,) + outputs\n\n        return outputs  # (loss), reshaped_logits, (hidden_states), (attentions)\n"
  },
  {
    "path": "code/bert-base-count5/pretrain/transformers1/modeling_marian.py",
    "content": "# coding=utf-8\n# Copyright 2020 Marian Team Authors and The HuggingFace Inc. team.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n\"\"\"PyTorch MarianMTModel model, ported from the Marian C++ repo.\"\"\"\n\n\nfrom .modeling_bart import BartForConditionalGeneration\n\n\nMARIAN_PRETRAINED_MODEL_ARCHIVE_LIST = [\n    # See all Marian models at https://huggingface.co/models?search=Helsinki-NLP\n]\n\n\nclass MarianMTModel(BartForConditionalGeneration):\n    r\"\"\"\n    Pytorch version of marian-nmt's transformer.h (c++). Designed for the OPUS-NMT translation checkpoints.\n    Model API is identical to BartForConditionalGeneration.\n    Available models are listed at `Model List <https://huggingface.co/models?search=Helsinki-NLP>`__\n\n    Examples::\n\n        from transformers1 import MarianTokenizer, MarianMTModel\n        from typing import List\n        src = 'fr'  # source language\n        trg = 'en'  # target language\n        sample_text = \"où est l'arrêt de bus ?\"\n        mname = f'Helsinki-NLP/opus-mt-{src}-{trg}'\n\n        model = MarianMTModel.from_pretrained(mname)\n        tok = MarianTokenizer.from_pretrained(mname)\n        batch = tok.prepare_translation_batch(src_texts=[sample_text])  # don't need tgt_text for inference\n        gen = model.generate(**batch)  # for forward pass: model(**batch)\n        words: List[str] = tok.batch_decode(gen, skip_special_tokens=True)  # returns \"Where is the the bus stop ?\"\n\n    \"\"\"\n\n    def prepare_logits_for_generation(self, logits, cur_len, max_length):\n        logits[:, self.config.pad_token_id] = float(\"-inf\")\n        if cur_len == max_length - 1 and self.config.eos_token_id is not None:\n            self._force_token_ids_generation(logits, self.config.eos_token_id)\n        return logits\n"
  },
  {
    "path": "code/bert-base-count5/pretrain/transformers1/modeling_mmbt.py",
    "content": "# coding=utf-8\n# Copyright (c) Facebook, Inc. and its affiliates.\n# Copyright (c) HuggingFace Inc. team.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n\"\"\"PyTorch MMBT model. \"\"\"\n\n\nimport logging\n\nimport torch\nimport torch.nn as nn\nfrom torch.nn import CrossEntropyLoss, MSELoss\n\nfrom .file_utils import add_start_docstrings\nfrom .modeling_utils import ModuleUtilsMixin\n\n\nlogger = logging.getLogger(__name__)\n\n\nclass ModalEmbeddings(nn.Module):\n    \"\"\"Generic Modal Embeddings which takes in an encoder, and a transformer embedding.\n    \"\"\"\n\n    def __init__(self, config, encoder, embeddings):\n        super().__init__()\n        self.config = config\n        self.encoder = encoder\n        self.proj_embeddings = nn.Linear(config.modal_hidden_size, config.hidden_size)\n        self.position_embeddings = embeddings.position_embeddings\n        self.token_type_embeddings = embeddings.token_type_embeddings\n        self.word_embeddings = embeddings.word_embeddings\n        self.LayerNorm = embeddings.LayerNorm\n        self.dropout = nn.Dropout(p=config.hidden_dropout_prob)\n\n    def forward(self, input_modal, start_token=None, end_token=None, position_ids=None, token_type_ids=None):\n        token_embeddings = self.proj_embeddings(self.encoder(input_modal))\n        seq_length = token_embeddings.size(1)\n\n        if start_token is not None:\n            start_token_embeds = self.word_embeddings(start_token)\n            seq_length += 1\n            token_embeddings = torch.cat([start_token_embeds.unsqueeze(1), token_embeddings], dim=1)\n\n        if end_token is not None:\n            end_token_embeds = self.word_embeddings(end_token)\n            seq_length += 1\n            token_embeddings = torch.cat([token_embeddings, end_token_embeds.unsqueeze(1)], dim=1)\n\n        if position_ids is None:\n            position_ids = torch.arange(seq_length, dtype=torch.long, device=input_modal.device)\n            position_ids = position_ids.unsqueeze(0).expand(input_modal.size(0), seq_length)\n\n        if token_type_ids is None:\n            token_type_ids = torch.zeros(\n                (input_modal.size(0), seq_length), dtype=torch.long, device=input_modal.device\n            )\n\n        position_embeddings = self.position_embeddings(position_ids)\n        token_type_embeddings = self.token_type_embeddings(token_type_ids)\n        embeddings = token_embeddings + position_embeddings + token_type_embeddings\n        embeddings = self.LayerNorm(embeddings)\n        embeddings = self.dropout(embeddings)\n        return embeddings\n\n\nMMBT_START_DOCSTRING = r\"\"\"    MMBT model was proposed in\n    `Supervised Multimodal Bitransformers for Classifying Images and Text`_\n    by Douwe Kiela, Suvrat Bhooshan, Hamed Firooz, Davide Testuggine.\n    It's a supervised multimodal bitransformer model that fuses information from text and other image encoders,\n    and obtain state-of-the-art performance on various multimodal classification benchmark tasks.\n\n    This model is a PyTorch `torch.nn.Module`_ sub-class. Use it as a regular PyTorch Module and\n    refer to the PyTorch documentation for all matter related to general usage and behavior.\n\n    .. _`Supervised Multimodal Bitransformers for Classifying Images and Text`:\n        https://github.com/facebookresearch/mmbt\n\n    .. _`torch.nn.Module`:\n        https://pytorch.org/docs/stable/nn.html#module\n\n    Parameters:\n        config (:class:`~transformers1.MMBTConfig`): Model configuration class with all the parameters of the model.\n            Initializing with a config file does not load the weights associated with the model, only the configuration.\n        transformer (:class: `~nn.Module`): A text transformer that is used by MMBT.\n            It should have embeddings, encoder, and pooler attributes.\n        encoder (:class: `~nn.Module`): Encoder for the second modality.\n            It should take in a batch of modal inputs and return k, n dimension embeddings.\n\"\"\"\n\nMMBT_INPUTS_DOCSTRING = r\"\"\"    Inputs:\n        **input_modal**: ``torch.FloatTensor`` of shape ``(batch_size, ***)``:\n            The other modality data. It will be the shape that the encoder for that type expects.\n            e.g. With an Image Encoder, the shape would be (batch_size, channels, height, width)\n        **input_ids**: ``torch.LongTensor`` of shape ``(batch_size, sequence_length)``:\n            Indices of input sequence tokens in the vocabulary.\n            It does not expect [CLS] token to be added as it's appended to the end of other modality embeddings.\n            See :func:`transformers1.PreTrainedTokenizer.encode` and\n            :func:`transformers1.PreTrainedTokenizer.convert_tokens_to_ids` for details.\n        **modal_start_tokens**: (`optional`) ``torch.LongTensor`` of shape ``(batch_size,)``:\n            Optional start token to be added to Other Modality Embedding. [CLS] Most commonly used for Classification tasks.\n        **modal_end_tokens**: (`optional`) ``torch.LongTensor`` of shape ``(batch_size,)``:\n            Optional end token to be added to Other Modality Embedding. [SEP] Most commonly used.\n        **attention_mask**: (`optional`) ``torch.FloatTensor`` of shape ``(batch_size, sequence_length)``:\n            Mask to avoid performing attention on padding token indices.\n            Mask values selected in ``[0, 1]``:\n            ``1`` for tokens that are NOT MASKED, ``0`` for MASKED tokens.\n        **token_type_ids**: (`optional`) ``torch.LongTensor`` of shape ``(batch_size, sequence_length)``:\n            Segment token indices to indicate different portions of the inputs.\n        **modal_token_type_ids**: (`optional`) ``torch.LongTensor`` of shape ``(batch_size, modal_sequence_length)``:\n            Segment token indices to indicate different portions of the non-text modality.\n            The embeddings from these tokens will be summed with the respective token embeddings for the non-text modality.\n        **position_ids**: (`optional`) ``torch.LongTensor`` of shape ``(batch_size, sequence_length)``:\n            Indices of positions of each input sequence tokens in the position embeddings.\n        **modal_position_ids**: (`optional`) ``torch.LongTensor`` of shape ``(batch_size, modal_sequence_length)``:\n            Indices of positions of each input sequence tokens in the position embeddings for the non-text modality.\n        **head_mask**: (`optional`) ``torch.FloatTensor`` of shape ``(num_heads,)`` or ``(num_layers, num_heads)``:\n            Mask to nullify selected heads of the self-attention modules.\n            Mask values selected in ``[0, 1]``:\n            ``1`` indicates the head is **not masked**, ``0`` indicates the head is **masked**.\n        **inputs_embeds**: (`optional`) ``torch.FloatTensor`` of shape ``(batch_size, sequence_length, embedding_dim)``:\n            Optionally, instead of passing ``input_ids`` you can choose to directly pass an embedded representation.\n            This is useful if you want more control over how to convert `input_ids` indices into associated vectors\n            than the model's internal embedding lookup matrix.\n        **encoder_hidden_states**: (`optional`) ``torch.FloatTensor`` of shape ``(batch_size, sequence_length, hidden_size)``:\n            Sequence of hidden-states at the output of the last layer of the encoder. Used in the cross-attention if the model\n            is configured as a decoder.\n        **encoder_attention_mask**: (`optional`) ``torch.FloatTensor`` of shape ``(batch_size, sequence_length)``:\n            Mask to avoid performing attention on the padding token indices of the encoder input. This mask\n            is used in the cross-attention if the model is configured as a decoder.\n            Mask values selected in ``[0, 1]``:\n            ``1`` for tokens that are NOT MASKED, ``0`` for MASKED tokens.\n\"\"\"\n\n\n@add_start_docstrings(\n    \"The bare MMBT Model outputting raw hidden-states without any specific head on top.\",\n    MMBT_START_DOCSTRING,\n    MMBT_INPUTS_DOCSTRING,\n)\nclass MMBTModel(nn.Module, ModuleUtilsMixin):\n    r\"\"\"\n        Outputs: `Tuple` comprising various elements depending on the configuration (config) and inputs:\n            **last_hidden_state**: ``torch.FloatTensor`` of shape ``(batch_size, sequence_length, hidden_size)``\n                Sequence of hidden-states at the output of the last layer of the model.\n            **pooler_output**: ``torch.FloatTensor`` of shape ``(batch_size, hidden_size)``\n                Last layer hidden-state of the first token of the sequence (classification token)\n                further processed by a Linear layer and a Tanh activation function. The Linear\n                layer weights are trained from the next sentence prediction (classification)\n                objective during Bert pretraining. This output is usually *not* a good summary\n                of the semantic content of the input, you're often better with averaging or pooling\n                the sequence of hidden-states for the whole input sequence.\n            **hidden_states**: (`optional`, returned when ``config.output_hidden_states=True``)\n                list of ``torch.FloatTensor`` (one for the output of each layer + the output of the embeddings)\n                of shape ``(batch_size, sequence_length, hidden_size)``:\n                Hidden-states of the model at the output of each layer plus the initial embedding outputs.\n            **attentions**: (`optional`, returned when ``config.output_attentions=True``)\n                list of ``torch.FloatTensor`` (one for each layer) of shape ``(batch_size, num_heads, sequence_length, sequence_length)``:\n                Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads.\n\n        Examples::\n\n            # For example purposes. Not runnable.\n            transformer = BertModel.from_pretrained('bert-base-uncased')\n            encoder = ImageEncoder(args)\n            mmbt = MMBTModel(config, transformer, encoder)\n        \"\"\"\n\n    def __init__(self, config, transformer, encoder):\n        super().__init__()\n        self.config = config\n        self.transformer = transformer\n        self.modal_encoder = ModalEmbeddings(config, encoder, transformer.embeddings)\n\n    def forward(\n        self,\n        input_modal,\n        input_ids=None,\n        modal_start_tokens=None,\n        modal_end_tokens=None,\n        attention_mask=None,\n        token_type_ids=None,\n        modal_token_type_ids=None,\n        position_ids=None,\n        modal_position_ids=None,\n        head_mask=None,\n        inputs_embeds=None,\n        encoder_hidden_states=None,\n        encoder_attention_mask=None,\n    ):\n\n        if input_ids is not None and inputs_embeds is not None:\n            raise ValueError(\"You cannot specify both input_ids and inputs_embeds at the same time\")\n        elif input_ids is not None:\n            input_txt_shape = input_ids.size()\n        elif inputs_embeds is not None:\n            input_txt_shape = inputs_embeds.size()[:-1]\n        else:\n            raise ValueError(\"You have to specify either input_ids or inputs_embeds\")\n\n        device = input_ids.device if input_ids is not None else inputs_embeds.device\n\n        modal_embeddings = self.modal_encoder(\n            input_modal,\n            start_token=modal_start_tokens,\n            end_token=modal_end_tokens,\n            position_ids=modal_position_ids,\n            token_type_ids=modal_token_type_ids,\n        )\n\n        input_modal_shape = modal_embeddings.size()[:-1]\n\n        if token_type_ids is None:\n            token_type_ids = torch.ones(input_txt_shape, dtype=torch.long, device=device)\n\n        txt_embeddings = self.transformer.embeddings(\n            input_ids=input_ids, position_ids=position_ids, token_type_ids=token_type_ids, inputs_embeds=inputs_embeds\n        )\n\n        embedding_output = torch.cat([modal_embeddings, txt_embeddings], 1)\n\n        input_shape = embedding_output.size()[:-1]\n\n        if attention_mask is None:\n            attention_mask = torch.ones(input_shape, device=device)\n        else:\n            attention_mask = torch.cat(\n                [torch.ones(input_modal_shape, device=device, dtype=torch.long), attention_mask], dim=1\n            )\n        if encoder_attention_mask is None:\n            encoder_attention_mask = torch.ones(input_shape, device=device)\n        else:\n            encoder_attention_mask = torch.cat(\n                [torch.ones(input_modal_shape, device=device), encoder_attention_mask], dim=1\n            )\n\n        extended_attention_mask = self.get_extended_attention_mask(attention_mask, input_shape, self.device)\n        encoder_extended_attention_mask = self.invert_attention_mask(encoder_attention_mask)\n        head_mask = self.get_head_mask(head_mask, self.config.num_hidden_layers)\n\n        encoder_outputs = self.transformer.encoder(\n            embedding_output,\n            attention_mask=extended_attention_mask,\n            head_mask=head_mask,\n            encoder_hidden_states=encoder_hidden_states,\n            encoder_attention_mask=encoder_extended_attention_mask,\n        )\n\n        sequence_output = encoder_outputs[0]\n        pooled_output = self.transformer.pooler(sequence_output)\n\n        outputs = (sequence_output, pooled_output,) + encoder_outputs[\n            1:\n        ]  # add hidden_states and attentions if they are here\n        return outputs  # sequence_output, pooled_output, (hidden_states), (attentions)\n\n    def get_input_embeddings(self):\n        return self.embeddings.word_embeddings\n\n    def set_input_embeddings(self, value):\n        self.embeddings.word_embeddings = value\n\n\n@add_start_docstrings(\n    \"\"\"MMBT Model with a sequence classification/regression head on top (a linear layer on top of\n                      the pooled output)\"\"\",\n    MMBT_START_DOCSTRING,\n    MMBT_INPUTS_DOCSTRING,\n)\nclass MMBTForClassification(nn.Module):\n    r\"\"\"\n            **labels**: (`optional`) ``torch.LongTensor`` of shape ``(batch_size,)``:\n                Labels for computing the sequence classification/regression loss.\n                Indices should be in ``[0, ..., config.num_labels - 1]``.\n                If ``config.num_labels == 1`` a regression loss is computed (Mean-Square loss),\n                If ``config.num_labels > 1`` a classification loss is computed (Cross-Entropy).\n\n        Outputs: `Tuple` comprising various elements depending on the configuration (config) and inputs:\n            **loss**: (`optional`, returned when ``labels`` is provided) ``torch.FloatTensor`` of shape ``(1,)``:\n                Classification (or regression if config.num_labels==1) loss.\n            **logits**: ``torch.FloatTensor`` of shape ``(batch_size, config.num_labels)``\n                Classification (or regression if config.num_labels==1) scores (before SoftMax).\n            **hidden_states**: (`optional`, returned when ``config.output_hidden_states=True``)\n                list of ``torch.FloatTensor`` (one for the output of each layer + the output of the embeddings)\n                of shape ``(batch_size, sequence_length, hidden_size)``:\n                Hidden-states of the model at the output of each layer plus the initial embedding outputs.\n            **attentions**: (`optional`, returned when ``config.output_attentions=True``)\n                list of ``torch.FloatTensor`` (one for each layer) of shape ``(batch_size, num_heads, sequence_length, sequence_length)``:\n                Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads.\n\n        Examples::\n\n            # For example purposes. Not runnable.\n            transformer = BertModel.from_pretrained('bert-base-uncased')\n            encoder = ImageEncoder(args)\n            model = MMBTForClassification(config, transformer, encoder)\n            outputs = model(input_modal, input_ids, labels=labels)\n            loss, logits = outputs[:2]\n        \"\"\"\n\n    def __init__(self, config, transformer, encoder):\n        super().__init__()\n        self.num_labels = config.num_labels\n\n        self.mmbt = MMBTModel(config, transformer, encoder)\n        self.dropout = nn.Dropout(config.hidden_dropout_prob)\n        self.classifier = nn.Linear(config.hidden_size, config.num_labels)\n\n    def forward(\n        self,\n        input_modal,\n        input_ids=None,\n        modal_start_tokens=None,\n        modal_end_tokens=None,\n        attention_mask=None,\n        token_type_ids=None,\n        modal_token_type_ids=None,\n        position_ids=None,\n        modal_position_ids=None,\n        head_mask=None,\n        inputs_embeds=None,\n        labels=None,\n    ):\n\n        outputs = self.mmbt(\n            input_modal=input_modal,\n            input_ids=input_ids,\n            modal_start_tokens=modal_start_tokens,\n            modal_end_tokens=modal_end_tokens,\n            attention_mask=attention_mask,\n            token_type_ids=token_type_ids,\n            modal_token_type_ids=modal_token_type_ids,\n            position_ids=position_ids,\n            modal_position_ids=modal_position_ids,\n            head_mask=head_mask,\n            inputs_embeds=inputs_embeds,\n        )\n\n        pooled_output = outputs[1]\n\n        pooled_output = self.dropout(pooled_output)\n        logits = self.classifier(pooled_output)\n\n        outputs = (logits,) + outputs[2:]  # add hidden states and attention if they are here\n\n        if labels is not None:\n            if self.num_labels == 1:\n                #  We are doing regression\n                loss_fct = MSELoss()\n                loss = loss_fct(logits.view(-1), labels.view(-1))\n            else:\n                loss_fct = CrossEntropyLoss()\n                loss = loss_fct(logits.view(-1, self.num_labels), labels.view(-1))\n            outputs = (loss,) + outputs\n\n        return outputs  # (loss), logits, (hidden_states), (attentions)\n"
  },
  {
    "path": "code/bert-base-count5/pretrain/transformers1/modeling_openai.py",
    "content": "# coding=utf-8\n# Copyright 2018 The OpenAI Team Authors and HuggingFace Inc. team.\n# Copyright (c) 2018, NVIDIA CORPORATION.  All rights reserved.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n\"\"\"PyTorch OpenAI GPT model.\"\"\"\n\n\nimport json\nimport logging\nimport math\nimport os\n\nimport torch\nimport torch.nn as nn\nfrom torch.nn import CrossEntropyLoss\n\nfrom .activations import gelu_new, swish\nfrom .configuration_openai import OpenAIGPTConfig\nfrom .file_utils import add_start_docstrings, add_start_docstrings_to_callable\nfrom .modeling_utils import Conv1D, PreTrainedModel, SequenceSummary, prune_conv1d_layer\n\n\nlogger = logging.getLogger(__name__)\n\nOPENAI_GPT_PRETRAINED_MODEL_ARCHIVE_LIST = [\n    \"openai-gpt\",\n    # See all OpenAI GPT models at https://huggingface.co/models?filter=openai-gpt\n]\n\n\ndef load_tf_weights_in_openai_gpt(model, config, openai_checkpoint_folder_path):\n    \"\"\" Load tf pre-trained weights in a pytorch model (from NumPy arrays here)\n    \"\"\"\n    import re\n    import numpy as np\n\n    if \".ckpt\" in openai_checkpoint_folder_path:\n        openai_checkpoint_folder_path = os.path.dirname(openai_checkpoint_folder_path)\n\n    logger.info(\"Loading weights from {}\".format(openai_checkpoint_folder_path))\n\n    with open(openai_checkpoint_folder_path + \"/parameters_names.json\", \"r\", encoding=\"utf-8\") as names_handle:\n        names = json.load(names_handle)\n    with open(openai_checkpoint_folder_path + \"/params_shapes.json\", \"r\", encoding=\"utf-8\") as shapes_handle:\n        shapes = json.load(shapes_handle)\n    offsets = np.cumsum([np.prod(shape) for shape in shapes])\n    init_params = [np.load(openai_checkpoint_folder_path + \"/params_{}.npy\".format(n)) for n in range(10)]\n    init_params = np.split(np.concatenate(init_params, 0), offsets)[:-1]\n    init_params = [param.reshape(shape) for param, shape in zip(init_params, shapes)]\n\n    # This was used when we had a single embedding matrix for positions and tokens\n    # init_params[0] = np.concatenate([init_params[1], init_params[0]], 0)\n    # model init_params[1]\n    init_params = [arr.squeeze() for arr in init_params]\n\n    try:\n        assert model.tokens_embed.weight.shape == init_params[1].shape\n        assert model.positions_embed.weight.shape == init_params[0].shape\n    except AssertionError as e:\n        e.args += (model.tokens_embed.weight.shape, init_params[1].shape)\n        e.args += (model.positions_embed.weight.shape, init_params[0].shape)\n        raise\n\n    model.tokens_embed.weight.data = torch.from_numpy(init_params[1])\n    model.positions_embed.weight.data = torch.from_numpy(init_params[0])\n    names.pop(0)\n    # Pop position and token embedding arrays\n    init_params.pop(0)\n    init_params.pop(0)\n\n    for name, array in zip(names, init_params):  # names[1:n_transfer], init_params[1:n_transfer]):\n        name = name[6:]  # skip \"model/\"\n        assert name[-2:] == \":0\"\n        name = name[:-2]\n        name = name.split(\"/\")\n        pointer = model\n        for m_name in name:\n            if re.fullmatch(r\"[A-Za-z]+\\d+\", m_name):\n                scope_names = re.split(r\"(\\d+)\", m_name)\n            else:\n                scope_names = [m_name]\n            if scope_names[0] == \"g\":\n                pointer = getattr(pointer, \"weight\")\n            elif scope_names[0] == \"b\":\n                pointer = getattr(pointer, \"bias\")\n            elif scope_names[0] == \"w\":\n                pointer = getattr(pointer, \"weight\")\n            else:\n                pointer = getattr(pointer, scope_names[0])\n            if len(scope_names) >= 2:\n                num = int(scope_names[1])\n                pointer = pointer[num]\n        try:\n            assert pointer.shape == array.shape\n        except AssertionError as e:\n            e.args += (pointer.shape, array.shape)\n            raise\n        try:\n            assert pointer.shape == array.shape\n        except AssertionError as e:\n            e.args += (pointer.shape, array.shape)\n            raise\n        logger.info(\"Initialize PyTorch weight {}\".format(name))\n        pointer.data = torch.from_numpy(array)\n    return model\n\n\nACT_FNS = {\"relu\": nn.ReLU, \"swish\": swish, \"gelu\": gelu_new}\n\n\nclass Attention(nn.Module):\n    def __init__(self, nx, n_ctx, config, scale=False):\n        super().__init__()\n        n_state = nx  # in Attention: n_state=768 (nx=n_embd)\n        # [switch nx => n_state from Block to Attention to keep identical to TF implem]\n        assert n_state % config.n_head == 0\n        self.register_buffer(\"bias\", torch.tril(torch.ones(n_ctx, n_ctx)).view(1, 1, n_ctx, n_ctx))\n        self.n_head = config.n_head\n        self.split_size = n_state\n        self.scale = scale\n\n        self.output_attentions = config.output_attentions\n\n        self.c_attn = Conv1D(n_state * 3, nx)\n        self.c_proj = Conv1D(n_state, nx)\n        self.attn_dropout = nn.Dropout(config.attn_pdrop)\n        self.resid_dropout = nn.Dropout(config.resid_pdrop)\n        self.pruned_heads = set()\n\n    def prune_heads(self, heads):\n        if len(heads) == 0:\n            return\n        mask = torch.ones(self.n_head, self.split_size // self.n_head)\n        heads = set(heads) - self.pruned_heads\n        for head in heads:\n            head -= sum(1 if h < head else 0 for h in self.pruned_heads)\n            mask[head] = 0\n        mask = mask.view(-1).contiguous().eq(1)\n        index = torch.arange(len(mask))[mask].long()\n        index_attn = torch.cat([index, index + self.split_size, index + (2 * self.split_size)])\n        # Prune conv1d layers\n        self.c_attn = prune_conv1d_layer(self.c_attn, index_attn, dim=1)\n        self.c_proj = prune_conv1d_layer(self.c_proj, index, dim=0)\n        # Update hyper params\n        self.split_size = (self.split_size // self.n_head) * (self.n_head - len(heads))\n        self.n_head = self.n_head - len(heads)\n        self.pruned_heads = self.pruned_heads.union(heads)\n\n    def _attn(self, q, k, v, attention_mask=None, head_mask=None):\n        w = torch.matmul(q, k)\n        if self.scale:\n            w = w / math.sqrt(v.size(-1))\n        # w = w * self.bias + -1e9 * (1 - self.bias)  # TF implem method: mask_attn_weights\n        # XD: self.b may be larger than w, so we need to crop it\n        b = self.bias[:, :, : w.size(-2), : w.size(-1)]\n        w = w * b + -1e4 * (1 - b)\n\n        if attention_mask is not None:\n            # Apply the attention mask\n            w = w + attention_mask\n\n        w = nn.Softmax(dim=-1)(w)\n        w = self.attn_dropout(w)\n\n        # Mask heads if we want to\n        if head_mask is not None:\n            w = w * head_mask\n\n        outputs = [torch.matmul(w, v)]\n        if self.output_attentions:\n            outputs.append(w)\n        return outputs\n\n    def merge_heads(self, x):\n        x = x.permute(0, 2, 1, 3).contiguous()\n        new_x_shape = x.size()[:-2] + (x.size(-2) * x.size(-1),)\n        return x.view(*new_x_shape)  # in Tensorflow implem: fct merge_states\n\n    def split_heads(self, x, k=False):\n        new_x_shape = x.size()[:-1] + (self.n_head, x.size(-1) // self.n_head)\n        x = x.view(*new_x_shape)  # in Tensorflow implem: fct split_states\n        if k:\n            return x.permute(0, 2, 3, 1)\n        else:\n            return x.permute(0, 2, 1, 3)\n\n    def forward(self, x, attention_mask=None, head_mask=None):\n        x = self.c_attn(x)\n        query, key, value = x.split(self.split_size, dim=2)\n        query = self.split_heads(query)\n        key = self.split_heads(key, k=True)\n        value = self.split_heads(value)\n\n        attn_outputs = self._attn(query, key, value, attention_mask, head_mask)\n        a = attn_outputs[0]\n\n        a = self.merge_heads(a)\n        a = self.c_proj(a)\n        a = self.resid_dropout(a)\n\n        outputs = [a] + attn_outputs[1:]\n        return outputs  # a, (attentions)\n\n\nclass MLP(nn.Module):\n    def __init__(self, n_state, config):  # in MLP: n_state=3072 (4 * n_embd)\n        super().__init__()\n        nx = config.n_embd\n        self.c_fc = Conv1D(n_state, nx)\n        self.c_proj = Conv1D(nx, n_state)\n        self.act = ACT_FNS[config.afn]\n        self.dropout = nn.Dropout(config.resid_pdrop)\n\n    def forward(self, x):\n        h = self.act(self.c_fc(x))\n        h2 = self.c_proj(h)\n        return self.dropout(h2)\n\n\nclass Block(nn.Module):\n    def __init__(self, n_ctx, config, scale=False):\n        super().__init__()\n        nx = config.n_embd\n        self.attn = Attention(nx, n_ctx, config, scale)\n        self.ln_1 = nn.LayerNorm(nx, eps=config.layer_norm_epsilon)\n        self.mlp = MLP(4 * nx, config)\n        self.ln_2 = nn.LayerNorm(nx, eps=config.layer_norm_epsilon)\n\n    def forward(self, x, attention_mask=None, head_mask=None):\n        attn_outputs = self.attn(x, attention_mask=attention_mask, head_mask=head_mask)\n        a = attn_outputs[0]\n\n        n = self.ln_1(x + a)\n        m = self.mlp(n)\n        h = self.ln_2(n + m)\n\n        outputs = [h] + attn_outputs[1:]\n        return outputs\n\n\nclass OpenAIGPTPreTrainedModel(PreTrainedModel):\n    \"\"\" An abstract class to handle weights initialization and\n        a simple interface for downloading and loading pretrained models.\n    \"\"\"\n\n    config_class = OpenAIGPTConfig\n    load_tf_weights = load_tf_weights_in_openai_gpt\n    base_model_prefix = \"transformer\"\n\n    def _init_weights(self, module):\n        \"\"\" Initialize the weights.\n        \"\"\"\n        if isinstance(module, (nn.Linear, nn.Embedding, Conv1D)):\n            # Slightly different from the TF version which uses truncated_normal for initialization\n            # cf https://github.com/pytorch/pytorch/pull/5617\n            module.weight.data.normal_(mean=0.0, std=self.config.initializer_range)\n            if isinstance(module, (nn.Linear, Conv1D)) and module.bias is not None:\n                module.bias.data.zero_()\n        elif isinstance(module, nn.LayerNorm):\n            module.bias.data.zero_()\n            module.weight.data.fill_(1.0)\n\n\nOPENAI_GPT_START_DOCSTRING = r\"\"\"\n\n    This model is a PyTorch `torch.nn.Module <https://pytorch.org/docs/stable/nn.html#torch.nn.Module>`_ sub-class.\n    Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general\n    usage and behavior.\n\n    Parameters:\n        config (:class:`~transformers1.OpenAIGPTConfig`): Model configuration class with all the parameters of the model.\n            Initializing with a config file does not load the weights associated with the model, only the configuration.\n            Check out the :meth:`~transformers1.PreTrainedModel.from_pretrained` method to load the model weights.\n\"\"\"\n\nOPENAI_GPT_INPUTS_DOCSTRING = r\"\"\"\n    Args:\n        input_ids (:obj:`torch.LongTensor` of shape :obj:`(batch_size, sequence_length)`):\n            Indices of input sequence tokens in the vocabulary.\n\n            Indices can be obtained using :class:`transformers1.OpenAIGPTTokenizer`.\n            See :func:`transformers1.PreTrainedTokenizer.encode` and\n            :func:`transformers1.PreTrainedTokenizer.encode_plus` for details.\n\n            `What are input IDs? <../glossary.html#input-ids>`__\n        attention_mask (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length)`, `optional`, defaults to :obj:`None`):\n            Mask to avoid performing attention on padding token indices.\n            Mask values selected in ``[0, 1]``:\n            ``1`` for tokens that are NOT MASKED, ``0`` for MASKED tokens.\n\n            `What are attention masks? <../glossary.html#attention-mask>`__\n        token_type_ids (:obj:`torch.LongTensor` of shape :obj:`(batch_size, sequence_length)`, `optional`, defaults to :obj:`None`):\n            Segment token indices to indicate first and second portions of the inputs.\n            Indices are selected in ``[0, 1]``: ``0`` corresponds to a `sentence A` token, ``1``\n            corresponds to a `sentence B` token\n\n            `What are token type IDs? <../glossary.html#token-type-ids>`_\n        position_ids (:obj:`torch.LongTensor` of shape :obj:`(batch_size, sequence_length)`, `optional`, defaults to :obj:`None`):\n            Indices of positions of each input sequence tokens in the position embeddings.\n            Selected in the range ``[0, config.max_position_embeddings - 1]``.\n\n            `What are position IDs? <../glossary.html#position-ids>`_\n        head_mask (:obj:`torch.FloatTensor` of shape :obj:`(num_heads,)` or :obj:`(num_layers, num_heads)`, `optional`, defaults to :obj:`None`):\n            Mask to nullify selected heads of the self-attention modules.\n            Mask values selected in ``[0, 1]``:\n            :obj:`1` indicates the head is **not masked**, :obj:`0` indicates the head is **masked**.\n        inputs_embeds (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length, hidden_size)`, `optional`, defaults to :obj:`None`):\n            Optionally, instead of passing :obj:`input_ids` you can choose to directly pass an embedded representation.\n            This is useful if you want more control over how to convert `input_ids` indices into associated vectors\n            than the model's internal embedding lookup matrix.\n\"\"\"\n\n\n@add_start_docstrings(\n    \"The bare OpenAI GPT transformer model outputting raw hidden-states without any specific head on top.\",\n    OPENAI_GPT_START_DOCSTRING,\n)\nclass OpenAIGPTModel(OpenAIGPTPreTrainedModel):\n    def __init__(self, config):\n        super().__init__(config)\n        self.output_attentions = config.output_attentions\n        self.output_hidden_states = config.output_hidden_states\n\n        self.tokens_embed = nn.Embedding(config.vocab_size, config.n_embd)\n        self.positions_embed = nn.Embedding(config.n_positions, config.n_embd)\n        self.drop = nn.Dropout(config.embd_pdrop)\n        self.h = nn.ModuleList([Block(config.n_ctx, config, scale=True) for _ in range(config.n_layer)])\n\n        self.init_weights()\n\n    def get_input_embeddings(self):\n        return self.tokens_embed\n\n    def set_input_embeddings(self, new_embeddings):\n        self.tokens_embed = new_embeddings\n\n    def _prune_heads(self, heads_to_prune):\n        \"\"\" Prunes heads of the model.\n            heads_to_prune: dict of {layer_num: list of heads to prune in this layer}\n        \"\"\"\n        for layer, heads in heads_to_prune.items():\n            self.h[layer].attn.prune_heads(heads)\n\n    @add_start_docstrings_to_callable(OPENAI_GPT_INPUTS_DOCSTRING)\n    def forward(\n        self,\n        input_ids=None,\n        attention_mask=None,\n        token_type_ids=None,\n        position_ids=None,\n        head_mask=None,\n        inputs_embeds=None,\n    ):\n        r\"\"\"\n    Return:\n        :obj:`tuple(torch.FloatTensor)` comprising various elements depending on the configuration (:class:`~transformers1.OpenAIGPTConfig`) and inputs:\n        last_hidden_state (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length, hidden_size)`):\n            Sequence of hidden-states at the last layer of the model.\n        hidden_states (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_hidden_states=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer)\n            of shape :obj:`(batch_size, sequence_length, hidden_size)`.\n\n            Hidden-states of the model at the output of each layer plus the initial embedding outputs.\n        attentions (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_attentions=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for each layer) of shape\n            :obj:`(batch_size, num_heads, sequence_length, sequence_length)`.\n\n            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention\n            heads.\n\n    Examples::\n\n        from transformers1 import OpenAIGPTTokenizer, OpenAIGPTModel\n        import torch\n\n        tokenizer = OpenAIGPTTokenizer.from_pretrained('openai-gpt')\n        model = OpenAIGPTModel.from_pretrained('openai-gpt')\n        input_ids = torch.tensor(tokenizer.encode(\"Hello, my dog is cute\", add_special_tokens=True)).unsqueeze(0)  # Batch size 1\n        outputs = model(input_ids)\n        last_hidden_states = outputs[0]  # The last hidden-state is the first element of the output tuple\n\n        \"\"\"\n        if input_ids is not None and inputs_embeds is not None:\n            raise ValueError(\"You cannot specify both input_ids and inputs_embeds at the same time\")\n        elif input_ids is not None:\n            input_shape = input_ids.size()\n            input_ids = input_ids.view(-1, input_shape[-1])\n        elif inputs_embeds is not None:\n            input_shape = inputs_embeds.size()[:-1]\n        else:\n            raise ValueError(\"You have to specify either input_ids or inputs_embeds\")\n\n        if position_ids is None:\n            # Code is different from when we had a single embedding matrice from position and token embeddings\n            device = input_ids.device if input_ids is not None else inputs_embeds.device\n            position_ids = torch.arange(input_shape[-1], dtype=torch.long, device=device)\n            position_ids = position_ids.unsqueeze(0).view(-1, input_shape[-1])\n\n        # Attention mask.\n        if attention_mask is not None:\n            # We create a 3D attention mask from a 2D tensor mask.\n            # Sizes are [batch_size, 1, 1, to_seq_length]\n            # So we can broadcast to [batch_size, num_heads, from_seq_length, to_seq_length]\n            # this attention mask is more simple than the triangular masking of causal attention\n            # used in OpenAI GPT, we just need to prepare the broadcast dimension here.\n            attention_mask = attention_mask.unsqueeze(1).unsqueeze(2)\n\n            # Since attention_mask is 1.0 for positions we want to attend and 0.0 for\n            # masked positions, this operation will create a tensor which is 0.0 for\n            # positions we want to attend and -10000.0 for masked positions.\n            # Since we are adding it to the raw scores before the softmax, this is\n            # effectively the same as removing these entirely.\n            attention_mask = attention_mask.to(dtype=next(self.parameters()).dtype)  # fp16 compatibility\n            attention_mask = (1.0 - attention_mask) * -10000.0\n\n        # Prepare head mask if needed\n        head_mask = self.get_head_mask(head_mask, self.config.n_layer)\n\n        if inputs_embeds is None:\n            inputs_embeds = self.tokens_embed(input_ids)\n        position_embeds = self.positions_embed(position_ids)\n        if token_type_ids is not None:\n            token_type_ids = token_type_ids.view(-1, token_type_ids.size(-1))\n            token_type_embeds = self.tokens_embed(token_type_ids)\n        else:\n            token_type_embeds = 0\n        hidden_states = inputs_embeds + position_embeds + token_type_embeds\n        hidden_states = self.drop(hidden_states)\n\n        output_shape = input_shape + (hidden_states.size(-1),)\n\n        all_attentions = ()\n        all_hidden_states = ()\n        for i, block in enumerate(self.h):\n            if self.output_hidden_states:\n                all_hidden_states = all_hidden_states + (hidden_states.view(*output_shape),)\n\n            outputs = block(hidden_states, attention_mask, head_mask[i])\n            hidden_states = outputs[0]\n            if self.output_attentions:\n                all_attentions = all_attentions + (outputs[1],)\n\n        # Add last layer\n        if self.output_hidden_states:\n            all_hidden_states = all_hidden_states + (hidden_states.view(*output_shape),)\n\n        outputs = (hidden_states.view(*output_shape),)\n        if self.output_hidden_states:\n            outputs = outputs + (all_hidden_states,)\n        if self.output_attentions:\n            outputs = outputs + (all_attentions,)\n        return outputs  # last hidden state, (all hidden states), (all attentions)\n\n\n@add_start_docstrings(\n    \"\"\"OpenAI GPT Model transformer with a language modeling head on top\n    (linear layer with weights tied to the input embeddings). \"\"\",\n    OPENAI_GPT_START_DOCSTRING,\n)\nclass OpenAIGPTLMHeadModel(OpenAIGPTPreTrainedModel):\n    def __init__(self, config):\n        super().__init__(config)\n        self.transformer = OpenAIGPTModel(config)\n        self.lm_head = nn.Linear(config.n_embd, config.vocab_size, bias=False)\n\n        self.init_weights()\n\n    def get_output_embeddings(self):\n        return self.lm_head\n\n    @add_start_docstrings_to_callable(OPENAI_GPT_INPUTS_DOCSTRING)\n    def forward(\n        self,\n        input_ids=None,\n        attention_mask=None,\n        token_type_ids=None,\n        position_ids=None,\n        head_mask=None,\n        inputs_embeds=None,\n        labels=None,\n    ):\n        r\"\"\"\n        labels (:obj:`torch.LongTensor` of shape :obj:`(batch_size, sequence_length)`, `optional`, defaults to :obj:`None`):\n            Labels for language modeling.\n            Note that the labels **are shifted** inside the model, i.e. you can set ``labels = input_ids``\n            Indices are selected in ``[-100, 0, ..., config.vocab_size]``\n            All labels set to ``-100`` are ignored (masked), the loss is only\n            computed for labels in ``[0, ..., config.vocab_size]``\n\n    Return:\n        :obj:`tuple(torch.FloatTensor)` comprising various elements depending on the configuration (:class:`~transformers1.OpenAIGPTConfig`) and inputs:\n        loss (:obj:`torch.FloatTensor` of shape `(1,)`, `optional`, returned when ``labels`` is provided)\n            Language modeling loss.\n        prediction_scores (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length, config.vocab_size)`):\n            Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax).\n        past (:obj:`List[torch.FloatTensor]` of length :obj:`config.n_layers` with each tensor of shape :obj:`(2, batch_size, num_heads, sequence_length, embed_size_per_head)`):\n            Contains pre-computed hidden-states (key and values in the attention blocks).\n            Can be used (see `past` input) to speed up sequential decoding. The token ids which have their past given to this model\n            should not be passed as input ids as they have already been computed.\n        hidden_states (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_hidden_states=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer)\n            of shape :obj:`(batch_size, sequence_length, hidden_size)`.\n\n            Hidden-states of the model at the output of each layer plus the initial embedding outputs.\n        attentions (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_attentions=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for each layer) of shape\n            :obj:`(batch_size, num_heads, sequence_length, sequence_length)`.\n\n            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention\n            heads.\n\n    Examples::\n\n        from transformers1 import OpenAIGPTTokenizer, OpenAIGPTLMHeadModel\n        import torch\n\n        tokenizer = OpenAIGPTTokenizer.from_pretrained('openai-gpt')\n        model = OpenAIGPTLMHeadModel.from_pretrained('openai-gpt')\n        input_ids = torch.tensor(tokenizer.encode(\"Hello, my dog is cute\", add_special_tokens=True)).unsqueeze(0)  # Batch size 1\n        outputs = model(input_ids, labels=input_ids)\n        loss, logits = outputs[:2]\n\n    \"\"\"\n        transformer_outputs = self.transformer(\n            input_ids,\n            attention_mask=attention_mask,\n            token_type_ids=token_type_ids,\n            position_ids=position_ids,\n            head_mask=head_mask,\n            inputs_embeds=inputs_embeds,\n        )\n        hidden_states = transformer_outputs[0]\n        lm_logits = self.lm_head(hidden_states)\n\n        outputs = (lm_logits,) + transformer_outputs[1:]\n        if labels is not None:\n            # Shift so that tokens < n predict n\n            shift_logits = lm_logits[..., :-1, :].contiguous()\n            shift_labels = labels[..., 1:].contiguous()\n            # Flatten the tokens\n            loss_fct = CrossEntropyLoss()\n            loss = loss_fct(shift_logits.view(-1, shift_logits.size(-1)), shift_labels.view(-1))\n            outputs = (loss,) + outputs\n\n        return outputs  # (loss), lm_logits, (all hidden states), (all attentions)\n\n\n@add_start_docstrings(\n    \"\"\"OpenAI GPT Model transformer with a language modeling and a multiple-choice classification\n    head on top e.g. for RocStories/SWAG tasks. The two heads are two linear layers.\n    The language modeling head has its weights tied to the input embeddings,\n    the classification head takes as input the input of a specified classification token index in the input sequence).\n\"\"\",\n    OPENAI_GPT_START_DOCSTRING,\n)\nclass OpenAIGPTDoubleHeadsModel(OpenAIGPTPreTrainedModel):\n    def __init__(self, config):\n        super().__init__(config)\n\n        config.num_labels = 1\n        self.transformer = OpenAIGPTModel(config)\n        self.lm_head = nn.Linear(config.n_embd, config.vocab_size, bias=False)\n        self.multiple_choice_head = SequenceSummary(config)\n\n        self.init_weights()\n\n    def get_output_embeddings(self):\n        return self.lm_head\n\n    @add_start_docstrings_to_callable(OPENAI_GPT_INPUTS_DOCSTRING)\n    def forward(\n        self,\n        input_ids=None,\n        attention_mask=None,\n        token_type_ids=None,\n        position_ids=None,\n        head_mask=None,\n        inputs_embeds=None,\n        mc_token_ids=None,\n        lm_labels=None,\n        mc_labels=None,\n    ):\n        r\"\"\"\n        mc_token_ids (:obj:`torch.LongTensor` of shape :obj:`(batch_size, num_choices)`, `optional`, default to index of the last token of the input)\n            Index of the classification token in each input sequence.\n            Selected in the range ``[0, input_ids.size(-1) - 1[``.\n        lm_labels (:obj:`torch.LongTensor` of shape :obj:`(batch_size, sequence_length)`, `optional`, defaults to :obj:`None`)\n            Labels for language modeling.\n            Note that the labels **are shifted** inside the model, i.e. you can set ``lm_labels = input_ids``\n            Indices are selected in ``[-1, 0, ..., config.vocab_size]``\n            All labels set to ``-100`` are ignored (masked), the loss is only\n            computed for labels in ``[0, ..., config.vocab_size]``\n        mc_labels (:obj:`torch.LongTensor` of shape :obj:`(batch_size)`, `optional`, defaults to :obj:`None`)\n            Labels for computing the multiple choice classification loss.\n            Indices should be in ``[0, ..., num_choices]`` where `num_choices` is the size of the second dimension\n            of the input tensors. (see `input_ids` above)\n\n    Return:\n        :obj:`tuple(torch.FloatTensor)` comprising various elements depending on the configuration (:class:`~transformers1.OpenAIGPTConfig`) and inputs:\n        lm_loss (:obj:`torch.FloatTensor` of shape :obj:`(1,)`, `optional`, returned when ``lm_labels`` is provided):\n            Language modeling loss.\n        mc_loss (:obj:`torch.FloatTensor` of shape :obj:`(1,)`, `optional`, returned when :obj:`multiple_choice_labels` is provided):\n            Multiple choice classification loss.\n        lm_prediction_scores (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, num_choices, sequence_length, config.vocab_size)`):\n            Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax).\n        mc_prediction_scores (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, num_choices)`):\n            Prediction scores of the multiple choice classification head (scores for each choice before SoftMax).\n        past (:obj:`List[torch.FloatTensor]` of length :obj:`config.n_layers` with each tensor of shape :obj:`(2, batch_size, num_heads, sequence_length, embed_size_per_head)`):\n            Contains pre-computed hidden-states (key and values in the attention blocks).\n            Can be used (see `past` input) to speed up sequential decoding. The token ids which have their past given to this model\n            should not be passed as input ids as they have already been computed.\n        hidden_states (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_hidden_states=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer)\n            of shape :obj:`(batch_size, sequence_length, hidden_size)`.\n\n            Hidden-states of the model at the output of each layer plus the initial embedding outputs.\n        attentions (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_attentions=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for each layer) of shape\n            :obj:`(batch_size, num_heads, sequence_length, sequence_length)`.\n\n            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention\n            heads.\n\n    Examples::\n\n        from transformers1 import OpenAIGPTTokenizer, OpenAIGPTDoubleHeadsModel\n        import torch\n\n        tokenizer = OpenAIGPTTokenizer.from_pretrained('openai-gpt')\n        model = OpenAIGPTDoubleHeadsModel.from_pretrained('openai-gpt')\n        tokenizer.add_special_tokens({'cls_token': '[CLS]'})  # Add a [CLS] to the vocabulary (we should train it also!)\n        model.resize_token_embeddings(len(tokenizer))\n\n        choices = [\"Hello, my dog is cute [CLS]\", \"Hello, my cat is cute [CLS]\"]\n        input_ids = torch.tensor([tokenizer.encode(s) for s in choices]).unsqueeze(0)  # Batch size 1, 2 choices\n        mc_token_ids = torch.tensor([input_ids.size(-1)-1, input_ids.size(-1)-1]).unsqueeze(0)  # Batch size 1\n\n        outputs = model(input_ids, mc_token_ids=mc_token_ids)\n        lm_prediction_scores, mc_prediction_scores = outputs[:2]\n\n    \"\"\"\n        transformer_outputs = self.transformer(\n            input_ids,\n            attention_mask=attention_mask,\n            token_type_ids=token_type_ids,\n            position_ids=position_ids,\n            head_mask=head_mask,\n            inputs_embeds=inputs_embeds,\n        )\n        hidden_states = transformer_outputs[0]\n\n        lm_logits = self.lm_head(hidden_states)\n        mc_logits = self.multiple_choice_head(hidden_states, mc_token_ids).squeeze(-1)\n\n        outputs = (lm_logits, mc_logits) + transformer_outputs[1:]\n        if mc_labels is not None:\n            loss_fct = CrossEntropyLoss()\n            loss = loss_fct(mc_logits.view(-1, mc_logits.size(-1)), mc_labels.view(-1))\n            outputs = (loss,) + outputs\n        if lm_labels is not None:\n            shift_logits = lm_logits[..., :-1, :].contiguous()\n            shift_labels = lm_labels[..., 1:].contiguous()\n            loss_fct = CrossEntropyLoss()\n            loss = loss_fct(shift_logits.view(-1, shift_logits.size(-1)), shift_labels.view(-1))\n            outputs = (loss,) + outputs\n\n        return outputs  # (lm loss), (mc loss), lm logits, mc logits, (all hidden_states), (attentions)\n"
  },
  {
    "path": "code/bert-base-count5/pretrain/transformers1/modeling_reformer.py",
    "content": "# coding=utf-8\n# Copyright 2020 The Trax Authors and The HuggingFace Inc. team.\n# Copyright (c) 2018, NVIDIA CORPORATION.  All rights reserved.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n\"\"\"PyTorch REFORMER model. \"\"\"\n\nimport logging\nimport sys\nfrom collections import namedtuple\nfrom functools import reduce\nfrom operator import mul\n\nimport numpy as np\nimport torch\nfrom torch import nn\nfrom torch.autograd.function import Function\nfrom torch.nn import CrossEntropyLoss\n\nfrom .activations import gelu, gelu_fast, gelu_new, swish\nfrom .configuration_reformer import ReformerConfig\nfrom .file_utils import DUMMY_INPUTS, DUMMY_MASK, add_start_docstrings, add_start_docstrings_to_callable\nfrom .modeling_utils import PreTrainedModel, apply_chunking_to_forward\n\n\nlogger = logging.getLogger(__name__)\n\nREFORMER_PRETRAINED_MODEL_ARCHIVE_LIST = [\n    \"google/reformer-crime-and-punishment\",\n    \"google/reformer-enwik8\",\n    # See all Reformer models at https://huggingface.co/models?filter=reformer\n]\n\n\ndef mish(x):\n    return x * torch.tanh(nn.functional.softplus(x))\n\n\nACT2FN = {\n    \"gelu\": gelu,\n    \"relu\": torch.nn.functional.relu,\n    \"swish\": swish,\n    \"gelu_new\": gelu_new,\n    \"gelu_fast\": gelu_fast,\n    \"mish\": mish,\n}\n\n\n# Define named tuples for nn.Modules here\nLSHSelfAttentionOutput = namedtuple(\"LSHSelfAttentionOutput\", [\"hidden_states\", \"attention_probs\", \"buckets\"])\nLocalSelfAttentionOutput = namedtuple(\"LocalSelfAttentionOutput\", [\"hidden_states\", \"attention_probs\"])\nAttentionOutput = namedtuple(\"AttentionOutput\", [\"hidden_states\", \"attention_probs\", \"buckets\"])\nReformerOutput = namedtuple(\"ReformerOutput\", [\"hidden_states\", \"attn_output\", \"attention_probs\", \"buckets\"])\nReformerBackwardOutput = namedtuple(\n    \"ReformerBackwardOutput\", [\"attn_output\", \"hidden_states\", \"grad_attn_output\", \"grad_hidden_states\"]\n)\nReformerEncoderOutput = namedtuple(\"ReformerEncoderOutput\", [\"hidden_states\", \"all_hidden_states\", \"all_attentions\"])\n\n\ndef _get_least_common_mult_chunk_len(config):\n    attn_types = config.attn_layers\n    attn_types_set = set(attn_types)\n    if len(attn_types_set) == 1 and attn_types[0] == \"lsh\":\n        return config.lsh_attn_chunk_length\n    elif len(attn_types_set) == 1 and attn_types[0] == \"local\":\n        return config.local_attn_chunk_length\n    elif len(attn_types_set) == 2 and attn_types_set == set([\"lsh\", \"local\"]):\n        return np.lcm(config.lsh_attn_chunk_length, config.local_attn_chunk_length)\n    else:\n        raise NotImplementedError(\n            \"Only attn layer types 'lsh' and 'local' exist, but `config.attn_layers`: {}. Select attn layer types from ['lsh', 'local'] only.\".format(\n                config.attn_layers\n            )\n        )\n\n\nclass AxialPositionEmbeddings(nn.Module):\n    \"\"\"Constructs axial position embeddings. Useful for very long input\n    sequences to save memory and time.\n    \"\"\"\n\n    def __init__(self, config):\n        super().__init__()\n        self.axial_pos_shape = config.axial_pos_shape\n        self.axial_pos_embds_dim = config.axial_pos_embds_dim\n        self.dropout = config.hidden_dropout_prob\n\n        self.least_common_mult_chunk_length = _get_least_common_mult_chunk_len(config)\n        self.weights = nn.ParameterList()\n\n        assert (\n            sum(self.axial_pos_embds_dim) == config.hidden_size\n        ), \"Make sure that config.axial_pos_embds factors: {} sum to config.hidden_size: {}\".format(\n            self.axial_pos_embds_dim, config.hidden_size\n        )\n\n        # create weights\n        for axis, axial_pos_embd_dim in enumerate(self.axial_pos_embds_dim):\n            # create expanded shapes\n            ax_shape = [1] * len(self.axial_pos_shape)\n            ax_shape[axis] = self.axial_pos_shape[axis]\n            ax_shape = tuple(ax_shape) + (axial_pos_embd_dim,)\n\n            # create tensor and init\n            self.weights.append(nn.Parameter(torch.ones(ax_shape, dtype=torch.float32)))\n\n    def forward(self, position_ids):\n        # broadcast weights to correct shape\n        batch_size = position_ids.shape[0]\n        sequence_length = position_ids.shape[1]\n\n        broadcasted_weights = [\n            weight.expand((batch_size,) + self.axial_pos_shape + weight.shape[-1:]) for weight in self.weights\n        ]\n\n        if self.training is True:\n            assert (\n                reduce(mul, self.axial_pos_shape) == sequence_length\n            ), \"If training, make sure that config.axial_pos_shape factors: {} multiply to sequence length. Got prod({}) != sequence_length: {}. You might want to consider padding your sequence length to {} or changing config.axial_pos_shape.\".format(\n                self.axial_pos_shape, self.axial_pos_shape, sequence_length, reduce(mul, self.axial_pos_shape)\n            )\n            if self.dropout > 0:\n                weights = torch.cat(broadcasted_weights, dim=-1)\n                # permute weights so that 2D correctly drops dims 1 and 2\n                transposed_weights = weights.transpose(2, 1)\n                # drop entire matrix of last two dims (prev dims 1 and 2)\n                dropped_transposed_weights = nn.functional.dropout2d(\n                    transposed_weights, p=self.dropout, training=self.training\n                )\n                dropped_weights = dropped_transposed_weights.transpose(2, 1)\n\n                position_encodings = torch.reshape(dropped_weights, (batch_size, sequence_length, -1))\n\n            else:\n                position_encodings = torch.cat(\n                    [torch.reshape(weight, (batch_size, sequence_length, -1)) for weight in broadcasted_weights],\n                    dim=-1,\n                )\n\n        else:\n            assert (\n                reduce(mul, self.axial_pos_shape) >= sequence_length\n            ), \"Make sure that config.axial_pos_shape factors: {} multiply at least to max(sequence_length, least_common_mult_chunk_length): max({}, {})\".format(\n                self.axial_pos_shape, sequence_length, self.least_common_mult_chunk_length,\n            )\n\n            # reshape axial encodings and use only until sequence_length\n            position_encodings = torch.cat(broadcasted_weights, dim=-1)\n            position_encodings = position_encodings.view(batch_size, -1, position_encodings.shape[-1])[\n                :, :sequence_length\n            ]\n\n        return position_encodings\n\n\nclass PositionEmbeddings(nn.Module):\n    \"\"\"Constructs conventional position embeddings of shape `[max_pos_embeddings, hidden_size]`.\n    \"\"\"\n\n    def __init__(self, config):\n        super().__init__()\n        self.dropout = config.hidden_dropout_prob\n        self.embedding = nn.Embedding(config.max_position_embeddings, config.hidden_size)\n\n    def forward(self, position_ids):\n        position_embeddings = self.embedding(position_ids)\n        position_embeddings = nn.functional.dropout(position_embeddings, p=self.dropout, training=self.training)\n        return position_embeddings\n\n\nclass ReformerEmbeddings(nn.Module):\n    \"\"\"Construct the embeddings from word, position and token_type embeddings.\n    \"\"\"\n\n    def __init__(self, config):\n        super().__init__()\n        self.max_position_embeddings = config.max_position_embeddings\n        self.dropout = config.hidden_dropout_prob\n\n        self.word_embeddings = nn.Embedding(config.vocab_size, config.hidden_size)\n        self.position_embeddings = (\n            AxialPositionEmbeddings(config) if config.axial_pos_embds else PositionEmbeddings(config)\n        )\n\n    def forward(self, input_ids=None, position_ids=None, inputs_embeds=None):\n        if input_ids is not None:\n            input_shape = input_ids.size()\n            device = input_ids.device\n        else:\n            input_shape = inputs_embeds.size()[:-1]\n            device = inputs_embeds.device\n\n        seq_length = input_shape[1]\n        if position_ids is None:\n            position_ids = torch.arange(seq_length, dtype=torch.long, device=device)\n            position_ids = position_ids.unsqueeze(0).expand(input_shape)\n\n        if inputs_embeds is None:\n            inputs_embeds = self.word_embeddings(input_ids)\n\n        assert (\n            position_ids.shape[-1] <= self.max_position_embeddings\n        ), \"Sequence Length: {} has to be larger equal than config.max_position_embeddings: {}\".format(\n            position_ids.shape[-1], self.max_position_embeddings\n        )\n\n        # dropout\n        embeddings = nn.functional.dropout(inputs_embeds, p=self.dropout, training=self.training)\n\n        # add positional embeddings\n        position_embeddings = self.position_embeddings(position_ids)\n        embeddings = embeddings + position_embeddings\n        return embeddings\n\n\nclass EfficientAttentionMixin:\n    \"\"\"\n    A few utilities for nn.Modules in Reformer, to be used as a mixin.\n    \"\"\"\n\n    def _look_adjacent(self, vectors, num_chunks_before, num_chunks_after):\n        \"\"\" Used to implement attention between consecutive chunks.\n\n            Args:\n                vectors: array of shape [batch_size, num_attention_heads, n_chunks, chunk_len, ...]\n                num_chunks_before: chunks before current chunk to include in attention\n                num_chunks_after: chunks after current chunk to include in attention\n\n            Returns:\n                tensor of shape [num_chunks, N * chunk_length, ...], where\n                N = (1 + num_chunks_before + num_chunks_after).\n        \"\"\"\n        if num_chunks_before == 0 and num_chunks_after == 0:\n            return vectors\n\n        slices = []\n        for i in range(-num_chunks_before, num_chunks_after + 1):\n            if i == 0:\n                slices.append(vectors)\n            else:\n                slices.append(torch.cat([vectors[:, :, i:, ...], vectors[:, :, :i, ...]], dim=2))\n        return torch.cat(slices, dim=3)\n\n    def _split_hidden_size_dim(self, x, num_attn_heads, attn_head_size):\n        \"\"\"\n            splits hidden_size dim into attn_head_size and num_attn_heads\n        \"\"\"\n        new_x_shape = x.size()[:-1] + (num_attn_heads, attn_head_size)\n        x = x.view(*new_x_shape)\n        return x.transpose(2, 1)\n\n    def _merge_hidden_size_dims(self, x, num_attn_heads, attn_head_size):\n        \"\"\"\n            merges attn_head_size dim and num_attn_heads dim into hidden_size\n        \"\"\"\n        x = x.permute(0, 2, 1, 3)\n        return torch.reshape(x, (x.size()[0], -1, num_attn_heads * attn_head_size))\n\n    def _split_seq_length_dim_to(self, vectors, dim_factor_1, dim_factor_2, num_attn_heads, attn_head_size=None):\n        \"\"\"\n            splits sequence length dim of vectors into `dim_factor_1` and `dim_factor_2` dims\n        \"\"\"\n        batch_size = vectors.shape[0]\n        split_dim_shape = (batch_size, num_attn_heads, dim_factor_1, dim_factor_2)\n\n        if len(vectors.shape) == 4:\n            return torch.reshape(vectors, split_dim_shape + (attn_head_size,))\n        elif len(vectors.shape) == 3:\n            return torch.reshape(vectors, split_dim_shape)\n        else:\n            raise ValueError(\"Input vector rank should be one of [3, 4], but is: {}\".format(len(vectors.shape)))\n\n\nclass LSHSelfAttention(nn.Module, EfficientAttentionMixin):\n    def __init__(self, config):\n        super().__init__()\n        self.config = config\n\n        self.chunk_length = config.lsh_attn_chunk_length\n        self.num_hashes = config.num_hashes\n        self.num_buckets = config.num_buckets\n        self.num_chunks_before = config.lsh_num_chunks_before\n        self.num_chunks_after = config.lsh_num_chunks_after\n        self.hash_seed = config.hash_seed\n        self.is_decoder = config.is_decoder\n        self.max_position_embeddings = config.max_position_embeddings\n\n        self.dropout = config.lsh_attention_probs_dropout_prob\n\n        self.num_attention_heads = config.num_attention_heads\n        self.attention_head_size = config.attention_head_size\n        self.all_head_size = self.num_attention_heads * self.attention_head_size\n        self.hidden_size = config.hidden_size\n\n        # projection matrices\n        self.query_key = nn.Linear(self.hidden_size, self.all_head_size, bias=False)\n        self.value = nn.Linear(self.hidden_size, self.all_head_size, bias=False)\n\n        # save mask value here. Need fp32 and fp16 mask values\n        self.register_buffer(\"self_mask_value_float16\", torch.tensor(-1e3))\n        self.register_buffer(\"self_mask_value_float32\", torch.tensor(-1e5))\n        self.register_buffer(\"mask_value_float16\", torch.tensor(-1e4))\n        self.register_buffer(\"mask_value_float32\", torch.tensor(-1e9))\n\n    def forward(\n        self,\n        hidden_states,\n        attention_mask=None,\n        head_mask=None,\n        num_hashes=None,\n        do_output_attentions=False,\n        buckets=None,\n        **kwargs\n    ):\n        sequence_length = hidden_states.shape[1]\n        batch_size = hidden_states.shape[0]\n\n        # num hashes can optionally be overwritten by user\n        num_hashes = num_hashes if num_hashes is not None else self.num_hashes\n\n        # project hidden_states to query_key and value\n        query_key_vectors = self.query_key(hidden_states)\n        value_vectors = self.value(hidden_states)\n\n        # free memory\n        del hidden_states\n\n        query_key_vectors = self._split_hidden_size_dim(\n            query_key_vectors, self.num_attention_heads, self.attention_head_size\n        )\n        value_vectors = self._split_hidden_size_dim(value_vectors, self.num_attention_heads, self.attention_head_size)\n\n        assert (\n            query_key_vectors.shape[-1] == self.attention_head_size\n        ), \"last dim of query_key_vectors is {} but should be {}.\".format(\n            query_key_vectors.shape[-1], self.attention_head_size\n        )\n        assert (\n            value_vectors.shape[-1] == self.attention_head_size\n        ), \"last dim of value_vectors is {} but should be {}.\".format(\n            value_vectors.shape[-1], self.attention_head_size\n        )\n\n        # set `num_buckets` on the fly, recommended way to do it\n        if self.num_buckets is None:\n            self._set_num_buckets(sequence_length)\n\n        # use cached buckets for backprop only\n        if buckets is None:\n            # hash query key vectors into buckets\n            buckets = self._hash_vectors(query_key_vectors, num_hashes)\n\n        assert (\n            int(buckets.shape[-1]) == num_hashes * sequence_length\n        ), \"last dim of buckets is {}, but should be {}\".format(buckets.shape[-1], num_hashes * sequence_length)\n\n        sorted_bucket_idx, undo_sorted_bucket_idx = self._get_sorted_bucket_idx_and_undo_sorted_bucket_idx(\n            sequence_length, buckets, num_hashes\n        )\n\n        # make sure bucket idx is not longer then sequence length\n        sorted_bucket_idx = sorted_bucket_idx % sequence_length\n\n        # cluster query key value vectors according to hashed buckets\n        query_key_vectors = self._gather_by_expansion(query_key_vectors, sorted_bucket_idx, num_hashes)\n        value_vectors = self._gather_by_expansion(value_vectors, sorted_bucket_idx, num_hashes)\n\n        query_key_vectors = self._split_seq_length_dim_to(\n            query_key_vectors, -1, self.chunk_length, self.num_attention_heads, self.attention_head_size,\n        )\n        value_vectors = self._split_seq_length_dim_to(\n            value_vectors, -1, self.chunk_length, self.num_attention_heads, self.attention_head_size,\n        )\n\n        if self.chunk_length is None:\n            assert (\n                self.num_chunks_before == 0 and self.num_chunks_after == 0\n            ), \"If `config.chunk_length` is `None`, make sure `config.num_chunks_after` and `config.num_chunks_before` are set to 0.\"\n\n        # scale key vectors\n        key_vectors = self._len_and_dim_norm(query_key_vectors)\n\n        # get attention probs\n        out_vectors, logits, attention_probs = self._attend(\n            query_vectors=query_key_vectors,\n            key_vectors=key_vectors,\n            value_vectors=value_vectors,\n            sorted_bucket_idx=sorted_bucket_idx,\n            attention_mask=attention_mask,\n            head_mask=head_mask,\n        )\n        # free memory\n        del query_key_vectors, key_vectors, value_vectors\n\n        # sort clusters back to correct ordering\n        out_vectors, logits = ReverseSort.apply(\n            out_vectors, logits, sorted_bucket_idx, undo_sorted_bucket_idx, self.num_hashes\n        )\n\n        # sum up all hash rounds\n        if num_hashes > 1:\n            out_vectors = self._split_seq_length_dim_to(\n                out_vectors, num_hashes, sequence_length, self.num_attention_heads, self.attention_head_size,\n            )\n            logits = self._split_seq_length_dim_to(\n                logits, num_hashes, sequence_length, self.num_attention_heads, self.attention_head_size,\n            ).unsqueeze(-1)\n\n            probs_vectors = torch.exp(logits - torch.logsumexp(logits, dim=2, keepdim=True))\n            out_vectors = torch.sum(out_vectors * probs_vectors, dim=2)\n            # free memory\n            del probs_vectors\n\n        # free memory\n        del logits\n\n        assert out_vectors.shape == (\n            batch_size,\n            self.num_attention_heads,\n            sequence_length,\n            self.attention_head_size,\n        ), \"out_vectors have be of shape `[batch_size, config.num_attention_heads, sequence_length, config.attention_head_size]`.\"\n\n        out_vectors = self._merge_hidden_size_dims(out_vectors, self.num_attention_heads, self.attention_head_size)\n\n        if do_output_attentions is False:\n            attention_probs = ()\n\n        return LSHSelfAttentionOutput(hidden_states=out_vectors, attention_probs=attention_probs, buckets=buckets)\n\n    def _hash_vectors(self, vectors, num_hashes):\n        batch_size = vectors.shape[0]\n\n        # See https://arxiv.org/pdf/1509.02897.pdf\n        # We sample a different random rotation for each round of hashing to\n        # decrease the probability of hash misses.\n        if isinstance(self.num_buckets, int):\n            assert (\n                self.num_buckets % 2 == 0\n            ), \"There should be an even number of bucktes, but `self.num_bucktes`: {}\".format(self.num_buckets)\n            rotation_size = self.num_buckets\n            num_buckets = self.num_buckets\n        else:\n            # Factorize the hash if self.num_buckets is a list or tuple\n            rotation_size, num_buckets = 0, 1\n            for bucket_factor in self.num_buckets:\n                assert bucket_factor % 2 == 0, \"The number of buckets should be even, but `num_bucket`: {}\".format(\n                    bucket_factor\n                )\n                rotation_size = rotation_size + bucket_factor\n                num_buckets = num_buckets * bucket_factor\n\n        # remove gradient\n        vectors = vectors.detach()\n\n        if self.hash_seed is not None:\n            # for determinism\n            torch.manual_seed(self.hash_seed)\n\n        rotations_shape = (self.num_attention_heads, vectors.shape[-1], num_hashes, rotation_size // 2)\n        # create a random self.attention_head_size x num_hashes x num_buckets/2\n        random_rotations = torch.randn(rotations_shape, device=vectors.device, dtype=vectors.dtype)\n\n        # Output dim: Batch_Size x Num_Attn_Heads x Num_Hashes x Seq_Len x Num_Buckets/2\n        rotated_vectors = torch.einsum(\"bmtd,mdhr->bmhtr\", vectors, random_rotations)\n\n        if isinstance(self.num_buckets, int) or len(self.num_buckets) == 1:\n            rotated_vectors = torch.cat([rotated_vectors, -rotated_vectors], dim=-1)\n            buckets = torch.argmax(rotated_vectors, dim=-1)\n        else:\n            # Get the buckets for them and combine.\n            buckets, cur_sum, cur_product = None, 0, 1\n            for bucket_factor in self.num_buckets:\n                rotated_vectors_factor = rotated_vectors[..., cur_sum : cur_sum + (bucket_factor // 2)]\n                cur_sum = cur_sum + bucket_factor // 2\n                rotated_vectors_factor = torch.cat([rotated_vectors_factor, -rotated_vectors_factor], dim=-1)\n\n                if buckets is None:\n                    buckets = torch.argmax(rotated_vectors_factor, dim=-1)\n                else:\n                    buckets = buckets + (cur_product * torch.argmax(rotated_vectors_factor, dim=-1))\n\n                cur_product = cur_product * bucket_factor\n\n        # buckets is now (Batch_size x Num_Attn_Heads x Num_Hashes x Seq_Len).\n        # Next we add offsets so that bucket numbers from different hashing rounds don't overlap.\n        offsets = torch.arange(num_hashes, device=vectors.device)\n        offsets = (offsets * num_buckets).view((1, 1, -1, 1))\n\n        # expand to batch size and num attention heads\n        offsets = offsets.expand((batch_size, self.num_attention_heads) + offsets.shape[-2:])\n        offset_buckets = (buckets + offsets).flatten(start_dim=2, end_dim=3)\n\n        return offset_buckets\n\n    def _get_sorted_bucket_idx_and_undo_sorted_bucket_idx(self, sequence_length, buckets, num_hashes):\n        # no gradients are needed\n        with torch.no_grad():\n            batch_size = buckets.shape[0]\n\n            # arange and expand\n            orig_indices = torch.arange(num_hashes * sequence_length, device=buckets.device).view(1, 1, -1)\n            orig_indices = orig_indices.expand(batch_size, self.num_attention_heads, orig_indices.shape[-1])\n\n            # scale buckets\n            scaled_buckets = sequence_length * buckets + (orig_indices % sequence_length)\n\n            # remove gradient\n            scaled_buckets = scaled_buckets.detach()\n\n            # Hash-based sort\n            sorted_bucket_idx = torch.argsort(scaled_buckets, dim=-1)\n\n            # create simple indices to scatter to, to have undo sort\n            indices = (\n                torch.arange(sorted_bucket_idx.shape[-1], device=buckets.device)\n                .view(1, 1, -1)\n                .expand(sorted_bucket_idx.shape)\n            )\n\n            # get undo sort\n            undo_sorted_bucket_idx = sorted_bucket_idx.new(*sorted_bucket_idx.size())\n            undo_sorted_bucket_idx.scatter_(-1, sorted_bucket_idx, indices)\n\n        return sorted_bucket_idx, undo_sorted_bucket_idx\n\n    def _set_num_buckets(self, sequence_length):\n        # `num_buckets` should be set to 2 * sequence_length // chunk_length as recommended in paper\n        num_buckets_pow_2 = (2 * (sequence_length // self.chunk_length)).bit_length() - 1\n        # make sure buckets are power of 2\n        num_buckets = 2 ** num_buckets_pow_2\n\n        # factorize `num_buckets` if `num_buckets` becomes too large\n        num_buckets_limit = 2 * max(\n            int((self.max_position_embeddings // self.chunk_length) ** (0.5)), self.chunk_length,\n        )\n        if num_buckets > num_buckets_limit:\n            num_buckets = [2 ** (num_buckets_pow_2 // 2), 2 ** (num_buckets_pow_2 - num_buckets_pow_2 // 2)]\n\n        logger.warning(\"config.num_buckets is not set. Setting config.num_buckets to {}...\".format(num_buckets))\n\n        # set num buckets in config to be properly saved\n        self.config.num_buckets = num_buckets\n        self.num_buckets = num_buckets\n\n    def _attend(\n        self, query_vectors, key_vectors, value_vectors, sorted_bucket_idx, attention_mask, head_mask,\n    ):\n        key_vectors = self._look_adjacent(key_vectors, self.num_chunks_before, self.num_chunks_after)\n        value_vectors = self._look_adjacent(value_vectors, self.num_chunks_before, self.num_chunks_after)\n\n        # get logits and dots\n        query_key_dots = torch.matmul(query_vectors, key_vectors.transpose(-1, -2))\n\n        # free memory\n        del query_vectors, key_vectors\n\n        query_bucket_idx = self._split_seq_length_dim_to(\n            sorted_bucket_idx, -1, self.chunk_length, self.num_attention_heads\n        )\n        key_value_bucket_idx = self._look_adjacent(query_bucket_idx, self.num_chunks_before, self.num_chunks_after)\n\n        # get correct mask values depending on precision\n        if query_key_dots.dtype == torch.float16:\n            self_mask_value = self.self_mask_value_float16.half()\n            mask_value = self.mask_value_float16.half()\n        else:\n            self_mask_value = self.self_mask_value_float32\n            mask_value = self.mask_value_float32\n\n        mask = self._compute_attn_mask(query_bucket_idx, key_value_bucket_idx, attention_mask)\n\n        if mask is not None:\n            query_key_dots = torch.where(mask, query_key_dots, mask_value)\n\n        # free memory\n        del mask\n\n        # Self mask is ALWAYS applied.\n        # From the reformer paper (https://arxiv.org/pdf/2001.04451.pdf):\n        # \" While attention to the future is not allowed, typical implementations of the\n        # Transformer do allow a position to attend to itself.\n        # Such behavior is undesirable in a shared-QK formulation because the dot-product\n        # of a query vector with itself will almost always be greater than the dot product of a\n        # query vector with a vector at another position. We therefore modify the masking\n        # to forbid a token from attending to itself, except in situations\n        # where a token has no other valid attention targets (e.g. the first token in a sequence) \"\n\n        self_mask = torch.ne(query_bucket_idx.unsqueeze(-1), key_value_bucket_idx.unsqueeze(-2)).to(\n            query_bucket_idx.device\n        )\n\n        # apply self_mask\n        query_key_dots = torch.where(self_mask, query_key_dots, self_mask_value)\n\n        # free memory\n        del self_mask\n\n        logits = torch.logsumexp(query_key_dots, dim=-1, keepdim=True)\n        # dots shape is `[batch_size, num_attn_heads, num_hashes * seq_len // chunk_length, chunk_length, chunk_length * (1 + num_chunks_before + num_chunks_after)]`\n        attention_probs = torch.exp(query_key_dots - logits)\n\n        # free memory\n        del query_key_dots\n\n        # dropout\n        attention_probs = nn.functional.dropout(attention_probs, p=self.dropout, training=self.training)\n\n        # Mask heads if we want to\n        if head_mask is not None:\n            attention_probs = attention_probs * head_mask\n\n        # attend values\n        out_vectors = torch.matmul(attention_probs, value_vectors)\n\n        # free memory\n        del value_vectors\n\n        # merge chunk length\n        logits = logits.flatten(start_dim=2, end_dim=3).squeeze(-1)\n        out_vectors = out_vectors.flatten(start_dim=2, end_dim=3)\n\n        return out_vectors, logits, attention_probs\n\n    def _compute_attn_mask(self, query_indices, key_indices, attention_mask):\n        mask = None\n\n        # Causal mask\n        if self.is_decoder:\n            mask = torch.ge(query_indices.unsqueeze(-1), key_indices.unsqueeze(-2)).to(query_indices.device)\n\n        # Attention mask: chunk, look up correct mask value from key_value_bucket_idx\n        # IMPORTANT: official trax code does not use a mask for LSH Atttention. Not sure why.\n        if attention_mask is not None:\n            attention_mask = attention_mask.to(torch.uint8)[:, None, None, :]\n            # expand attn_mask to fit with key_value_bucket_idx shape\n            attention_mask = attention_mask.expand(query_indices.shape[:-1] + (-1,))\n            key_attn_mask = torch.gather(attention_mask, -1, key_indices)\n            query_attn_mask = torch.gather(attention_mask, -1, query_indices)\n            # expand to query_key_dots shape: duplicate along query axis since key sorting is the same for each query position in chunk\n            attn_mask = query_attn_mask.unsqueeze(-1) * key_attn_mask.unsqueeze(-2)\n            # free memory\n            del query_attn_mask, key_attn_mask, attention_mask\n\n            # multiply by casaul mask if necessary\n            if mask is not None:\n                mask = mask * attn_mask\n            else:\n                mask = attn_mask\n\n        return mask\n\n    def _len_and_dim_norm(self, vectors):\n        \"\"\"\n            length and attention head size dim normalization\n        \"\"\"\n        vectors = self._len_norm(vectors)\n        vectors = vectors * torch.rsqrt(\n            torch.tensor(self.attention_head_size, device=vectors.device, dtype=vectors.dtype)\n        )\n        return vectors\n\n    def _len_norm(self, x, epsilon=1e-6):\n        \"\"\"\n            length normalization\n        \"\"\"\n        variance = torch.mean(x ** 2, -1, keepdim=True)\n        norm_x = x * torch.rsqrt(variance + epsilon)\n        return norm_x\n\n    def _gather_by_expansion(self, vectors, idxs, num_hashes):\n        \"\"\"\n            expand dims of idxs and vectors for all hashes and gather\n        \"\"\"\n        expanded_idxs = idxs.unsqueeze(-1).expand(-1, -1, -1, self.attention_head_size)\n        vectors = vectors.repeat(1, 1, num_hashes, 1)\n        return torch.gather(vectors, 2, expanded_idxs)\n\n\nclass ReverseSort(Function):\n    \"\"\"\n        After chunked attention is applied which sorted clusters,\n        original ordering has to be restored.\n        Since customized backward function is used for Reformer,\n        the gradients of the output vectors have to be explicitely\n        sorted here.\n    \"\"\"\n\n    @staticmethod\n    def forward(ctx, out_vectors, logits, sorted_bucket_idx, undo_sorted_bucket_idx, num_hashes):\n        # save sorted_bucket_idx for backprop\n        with torch.no_grad():\n            ctx.sorted_bucket_idx = sorted_bucket_idx\n            ctx.num_hashes = num_hashes\n\n            # undo sort to have correct order for next layer\n            expanded_undo_sort_indices = undo_sorted_bucket_idx.unsqueeze(-1).expand(out_vectors.shape)\n            out_vectors = torch.gather(out_vectors, 2, expanded_undo_sort_indices)\n            logits = torch.gather(logits, 2, undo_sorted_bucket_idx)\n        return out_vectors, logits\n\n    @staticmethod\n    def backward(ctx, grad_out_vectors, grad_logits):\n        # get parameters saved in ctx\n        sorted_bucket_idx = ctx.sorted_bucket_idx\n        num_hashes = ctx.num_hashes\n\n        # get real gradient shape\n        # shape is BatchSize x NumAttnHeads x ChunkLen * NumHashes\n        grad_logits_shape = grad_logits.shape\n        # shape is BatchSize x NumAttnHeads x ChunkLen * NumHashes x ChunkLen\n        grad_out_vectors_shape = grad_out_vectors.shape\n\n        # split gradient vectors and sorted bucket idxs by concatenated chunk dimension to gather correct indices\n        # shape is BatchSize x NumAttnHeads x NumHashes x ChunkLen\n        grad_logits = grad_logits.view((grad_logits_shape[:2] + (num_hashes, -1)))\n        # shape is BatchSize x NumAttnHeads x NumHashes x ChunkLen x ChunkLen\n        grad_out_vectors = grad_out_vectors.view(\n            (grad_out_vectors_shape[:2] + (num_hashes, -1) + grad_out_vectors_shape[-1:])\n        )\n\n        # reshape and expand\n        sorted_bucket_idx = torch.reshape(sorted_bucket_idx, (sorted_bucket_idx.shape[:2] + (num_hashes, -1)))\n        expanded_sort_indices = sorted_bucket_idx.unsqueeze(-1).expand(grad_out_vectors.shape)\n        # reverse sort of forward\n        grad_out_vectors = torch.gather(grad_out_vectors, 3, expanded_sort_indices)\n        grad_logits = torch.gather(grad_logits, 3, sorted_bucket_idx)\n\n        # reshape into correct shape\n        grad_logits = torch.reshape(grad_logits, grad_logits_shape)\n        grad_out_vectors = torch.reshape(grad_out_vectors, grad_out_vectors_shape)\n\n        # return grad and `None` fillers for last 3 forward args\n        return grad_out_vectors, grad_logits, None, None, None\n\n\nclass LocalSelfAttention(nn.Module, EfficientAttentionMixin):\n    def __init__(self, config):\n        super().__init__()\n\n        self.num_attention_heads = config.num_attention_heads\n        self.chunk_length = config.local_attn_chunk_length\n        self.num_chunks_before = config.local_num_chunks_before\n        self.num_chunks_after = config.local_num_chunks_after\n        self.is_decoder = config.is_decoder\n        self.pad_token_id = config.pad_token_id\n\n        self.attention_head_size = config.attention_head_size\n        self.all_head_size = self.num_attention_heads * self.attention_head_size\n        self.hidden_size = config.hidden_size\n\n        # projection matrices\n        self.query = nn.Linear(self.hidden_size, self.all_head_size, bias=False)\n        self.key = nn.Linear(self.hidden_size, self.all_head_size, bias=False)\n        self.value = nn.Linear(self.hidden_size, self.all_head_size, bias=False)\n\n        self.dropout = config.local_attention_probs_dropout_prob\n\n        # save mask value here\n        self.register_buffer(\"mask_value_float16\", torch.tensor(-1e4))\n        self.register_buffer(\"mask_value_float32\", torch.tensor(-1e9))\n\n    def forward(self, hidden_states, attention_mask=None, head_mask=None, do_output_attentions=False, **kwargs):\n        sequence_length = hidden_states.shape[1]\n        batch_size = hidden_states.shape[0]\n\n        # project hidden_states to query, key and value\n        query_vectors = self.query(hidden_states)\n        key_vectors = self.key(hidden_states)\n        value_vectors = self.value(hidden_states)\n\n        # split last dim into `config.num_attention_heads` and `config.attention_head_size`\n        query_vectors = self._split_hidden_size_dim(query_vectors, self.num_attention_heads, self.attention_head_size)\n        key_vectors = self._split_hidden_size_dim(key_vectors, self.num_attention_heads, self.attention_head_size)\n        value_vectors = self._split_hidden_size_dim(value_vectors, self.num_attention_heads, self.attention_head_size)\n\n        assert (\n            query_vectors.shape[-1] == self.attention_head_size\n        ), \"last dim of query_key_vectors is {} but should be {}.\".format(\n            query_vectors.shape[-1], self.attention_head_size\n        )\n        assert (\n            key_vectors.shape[-1] == self.attention_head_size\n        ), \"last dim of query_key_vectors is {} but should be {}.\".format(\n            key_vectors.shape[-1], self.attention_head_size\n        )\n        assert (\n            value_vectors.shape[-1] == self.attention_head_size\n        ), \"last dim of query_key_vectors is {} but should be {}.\".format(\n            value_vectors.shape[-1], self.attention_head_size\n        )\n\n        if self.chunk_length is None:\n            assert (\n                self.num_chunks_before == 0 and self.num_chunks_after == 0\n            ), \"If `config.chunk_length` is `None`, make sure `config.num_chunks_after` and `config.num_chunks_before` are set to 0.\"\n\n        # normalize key vectors\n        key_vectors = key_vectors / torch.sqrt(\n            torch.tensor(self.attention_head_size, device=key_vectors.device, dtype=key_vectors.dtype)\n        )\n\n        # chunk vectors\n        # B x Num_Attn_Head x Seq_Len // chunk_len x chunk_len  x  attn_head_size\n        query_vectors = self._split_seq_length_dim_to(\n            query_vectors, -1, self.chunk_length, self.num_attention_heads, self.attention_head_size,\n        )\n        key_vectors = self._split_seq_length_dim_to(\n            key_vectors, -1, self.chunk_length, self.num_attention_heads, self.attention_head_size,\n        )\n        value_vectors = self._split_seq_length_dim_to(\n            value_vectors, -1, self.chunk_length, self.num_attention_heads, self.attention_head_size,\n        )\n\n        # chunk indices\n        indices = torch.arange(sequence_length, device=query_vectors.device).repeat(\n            batch_size, self.num_attention_heads, 1\n        )\n        query_indices = self._split_seq_length_dim_to(indices, -1, self.chunk_length, self.num_attention_heads)\n        key_indices = self._split_seq_length_dim_to(indices, -1, self.chunk_length, self.num_attention_heads)\n\n        # append chunks before and after\n        key_vectors = self._look_adjacent(key_vectors, self.num_chunks_before, self.num_chunks_after)\n        value_vectors = self._look_adjacent(value_vectors, self.num_chunks_before, self.num_chunks_after)\n        key_indices = self._look_adjacent(key_indices, self.num_chunks_before, self.num_chunks_after)\n\n        query_key_dots = torch.matmul(query_vectors, key_vectors.transpose(-1, -2))\n\n        # free memory\n        del query_vectors, key_vectors\n\n        mask = self._compute_attn_mask(query_indices, key_indices, attention_mask, query_key_dots.shape)\n\n        if mask is not None:\n            # get mask tensor depending on half precision or not\n            if query_key_dots.dtype == torch.float16:\n                mask_value = self.mask_value_float16.half()\n            else:\n                mask_value = self.mask_value_float32\n\n            query_key_dots = torch.where(mask, query_key_dots, mask_value)\n\n        # free memory\n        del mask\n\n        # softmax\n        logits = torch.logsumexp(query_key_dots, dim=-1, keepdim=True)\n        attention_probs = torch.exp(query_key_dots - logits)\n\n        # free memory\n        del logits\n\n        # dropout\n        attention_probs = nn.functional.dropout(attention_probs, p=self.dropout, training=self.training)\n\n        # Mask heads if we want to\n        if head_mask is not None:\n            attention_probs = attention_probs * head_mask\n\n        # attend values\n        out_vectors = torch.matmul(attention_probs, value_vectors)\n\n        # free memory\n        del value_vectors\n\n        # merge chunk length\n        out_vectors = out_vectors.flatten(start_dim=2, end_dim=3)\n\n        assert out_vectors.shape == (batch_size, self.num_attention_heads, sequence_length, self.attention_head_size,)\n\n        out_vectors = self._merge_hidden_size_dims(out_vectors, self.num_attention_heads, self.attention_head_size)\n\n        if do_output_attentions is False:\n            attention_probs = ()\n\n        return LocalSelfAttentionOutput(hidden_states=out_vectors, attention_probs=attention_probs)\n\n    def _compute_attn_mask(self, query_indices, key_indices, attention_mask, query_key_dots_shape):\n        mask = None\n\n        # chunk attention mask and look before and after\n        if attention_mask is not None:\n            attention_mask = attention_mask.to(torch.uint8)[:, None, :]\n            attention_mask = self._split_seq_length_dim_to(attention_mask, -1, self.chunk_length, 1)\n            attention_mask_key = self._look_adjacent(attention_mask, self.num_chunks_before, self.num_chunks_after)\n\n        # Causal mask\n        if self.is_decoder is True:\n            mask = torch.ge(query_indices.unsqueeze(-1), key_indices.unsqueeze(-2)).to(query_indices.device)\n\n        # Attention mask\n        if attention_mask is not None:\n            # create attn_mask\n            attn_mask = (attention_mask.unsqueeze(-1) * attention_mask_key.unsqueeze(-2)).expand(query_key_dots_shape)\n            # multiply by casaul mask if necessary\n            if mask is not None:\n                mask = mask * attn_mask\n            else:\n                mask = attn_mask\n        return mask\n\n\nclass ReformerSelfOutput(nn.Module):\n    def __init__(self, config):\n        super().__init__()\n        all_head_size = config.num_attention_heads * config.attention_head_size\n        self.dropout = config.hidden_dropout_prob\n\n        self.dense = nn.Linear(all_head_size, config.hidden_size, bias=False)\n\n    def forward(self, hidden_states):\n        hidden_states = self.dense(hidden_states)\n        hidden_states = nn.functional.dropout(hidden_states, p=self.dropout, training=self.training)\n        return hidden_states\n\n\nclass ReformerAttention(nn.Module):\n    def __init__(self, config, layer_id=0):\n        super().__init__()\n        self.layer_id = layer_id\n        self.attn_layers = config.attn_layers\n\n        self.layer_norm = nn.LayerNorm(config.hidden_size, eps=config.layer_norm_eps)\n\n        if len(set(self.attn_layers)) == 1 and self.attn_layers[0] == \"lsh\":\n            self.self_attention = LSHSelfAttention(config)\n        elif len(set(self.attn_layers)) == 1 and self.attn_layers[0] == \"local\":\n            self.self_attention = LocalSelfAttention(config)\n        elif len(set(self.attn_layers)) == 2 and set(self.attn_layers) == set([\"lsh\", \"local\"]):\n            # get correct attn layers\n            if self.attn_layers[self.layer_id] == \"lsh\":\n                self.self_attention = LSHSelfAttention(config)\n            else:\n                self.self_attention = LocalSelfAttention(config)\n        else:\n            raise NotImplementedError(\n                \"Only attn layer types 'lsh' and 'local' exist, but got `config.attn_layers`: {}. Select attn layer types from ['lsh', 'local'] only.\".format(\n                    self.attn_layers\n                )\n            )\n        self.output = ReformerSelfOutput(config)\n\n    def forward(\n        self,\n        hidden_states,\n        attention_mask=None,\n        head_mask=None,\n        num_hashes=None,\n        do_output_attentions=False,\n        buckets=None,\n    ):\n        hidden_states = self.layer_norm(hidden_states)\n\n        # use cached buckets for backprob if buckets not None for LSHSelfAttention\n        self_attention_outputs = self.self_attention(\n            hidden_states=hidden_states,\n            head_mask=head_mask,\n            attention_mask=attention_mask,\n            num_hashes=num_hashes,\n            do_output_attentions=do_output_attentions,\n            buckets=buckets,\n        )\n        attention_output = self.output(self_attention_outputs.hidden_states)\n\n        # add buckets if necessary\n        if hasattr(self_attention_outputs, \"buckets\"):\n            buckets = self_attention_outputs.buckets\n        else:\n            buckets = None\n\n        return AttentionOutput(\n            hidden_states=attention_output, attention_probs=self_attention_outputs.attention_probs, buckets=buckets,\n        )\n\n\nclass ReformerFeedForwardDense(nn.Module):\n    def __init__(self, config):\n        super().__init__()\n        self.dropout = config.hidden_dropout_prob\n\n        if isinstance(config.hidden_act, str):\n            self.act_fn = ACT2FN[config.hidden_act]\n        else:\n            self.act_fn = config.hidden_act\n\n        self.dense = nn.Linear(config.hidden_size, config.feed_forward_size)\n\n    def forward(self, hidden_states):\n        hidden_states = self.dense(hidden_states)\n        hidden_states = nn.functional.dropout(hidden_states, p=self.dropout, training=self.training)\n        hidden_states = self.act_fn(hidden_states)\n        return hidden_states\n\n\nclass ReformerFeedForwardOutput(nn.Module):\n    def __init__(self, config):\n        super().__init__()\n        self.dropout = config.hidden_dropout_prob\n\n        self.dense = nn.Linear(config.feed_forward_size, config.hidden_size)\n\n    def forward(self, hidden_states):\n        hidden_states = self.dense(hidden_states)\n        hidden_states = nn.functional.dropout(hidden_states, p=self.dropout, training=self.training)\n        return hidden_states\n\n\nclass ChunkReformerFeedForward(nn.Module):\n    def __init__(self, config):\n        super().__init__()\n        self.chunk_size_feed_forward = config.chunk_size_feed_forward\n        self.seq_len_dim = 1\n\n        self.layer_norm = nn.LayerNorm(config.hidden_size, eps=config.layer_norm_eps)\n        self.dense = ReformerFeedForwardDense(config)\n        self.output = ReformerFeedForwardOutput(config)\n\n    def forward(self, attention_output):\n        return apply_chunking_to_forward(\n            self.chunk_size_feed_forward, self.seq_len_dim, self.forward_chunk, attention_output,\n        )\n\n    def forward_chunk(self, hidden_states):\n        hidden_states = self.layer_norm(hidden_states)\n        hidden_states = self.dense(hidden_states)\n        return self.output(hidden_states)\n\n\nclass ReformerLayer(nn.Module):\n    def __init__(self, config, layer_id=0):\n        super().__init__()\n        self.attention = ReformerAttention(config, layer_id)\n        # dropout requires to have the same\n        # seed for forward and backward pass\n        self.attention_seed = None\n        self.feed_forward_seed = None\n\n        self.feed_forward = ChunkReformerFeedForward(config)\n\n    def _init_attention_seed(self):\n        \"\"\"\n            This function sets a new seed for the\n            attention layer to make dropout deterministic\n            for both forward calls: 1 normal forward\n            call and 1 forward call in backward\n            to recalculate activations.\n        \"\"\"\n\n        # randomize seeds\n        if next(self.parameters()).device.type == \"cuda\":\n            # GPU\n            device_idx = torch.cuda.current_device()\n            self.attention_seed = torch.cuda.default_generators[device_idx].seed()\n            torch.cuda.manual_seed(self.attention_seed)\n        else:\n            # CPU\n            self.attention_seed = int(torch.seed() % sys.maxsize)\n            torch.manual_seed(self.attention_seed)\n\n    def _init_feed_forward_seed(self):\n        \"\"\"\n            This function sets a new seed for the\n            feed forward layer to make dropout deterministic\n            for both forward calls: 1 normal forward\n            call and 1 forward call in backward\n            to recalculate activations.\n        \"\"\"\n\n        # randomize seeds\n        if next(self.parameters()).device.type == \"cuda\":\n            # GPU\n            device_idx = torch.cuda.current_device()\n            self.feed_forward_seed = torch.cuda.default_generators[device_idx].seed()\n            torch.cuda.manual_seed(self.feed_forward_seed)\n        else:\n            # CPU\n            self.feed_forward_seed = int(torch.seed() % sys.maxsize)\n            torch.manual_seed(self.feed_forward_seed)\n\n    def forward(\n        self,\n        prev_attn_output,\n        hidden_states,\n        attention_mask=None,\n        head_mask=None,\n        num_hashes=None,\n        do_output_attentions=False,\n    ):\n        with torch.no_grad():\n            # every forward pass we sample a different seed\n            # for dropout and save for forward fn in backward pass\n            # to have correct dropout\n            self._init_attention_seed()\n            attn_outputs = self.attention(\n                hidden_states=hidden_states,\n                head_mask=head_mask,\n                attention_mask=attention_mask,\n                num_hashes=num_hashes,\n                do_output_attentions=do_output_attentions,\n            )\n            attn_output = attn_outputs.hidden_states\n\n            # Implementation of RevNet (see Fig. 6 in https://towardsdatascience.com/illustrating-the-reformer-393575ac6ba0)\n            # Y_1 = X_1 + f(X_2)\n            attn_output = prev_attn_output + attn_output\n\n            # free memory\n            del prev_attn_output\n\n            # every forward pass we sample a different seed\n            # for dropout and save seed for forward fn in backward\n            # to have correct dropout\n            self._init_feed_forward_seed()\n            # Y_2 = X_2 + g(Y_1)\n            hidden_states = hidden_states + self.feed_forward(attn_output)\n\n        return ReformerOutput(\n            attn_output=attn_output,\n            hidden_states=hidden_states,\n            attention_probs=attn_outputs.attention_probs,\n            buckets=attn_outputs.buckets,\n        )\n\n    def backward_pass(\n        self,\n        next_attn_output,\n        hidden_states,\n        grad_attn_output,\n        grad_hidden_states,\n        attention_mask=None,\n        head_mask=None,\n        buckets=None,\n    ):\n        # Implements the backward pass for reversible ResNets.\n        # A good blog post on how this works can be found here:\n        # Implementation of RevNet (see Fig. 6 in https://towardsdatascience.com/illustrating-the-reformer-393575ac6ba0)\n        # This code is heavily inspired by https://github.com/lucidrains/reformer-pytorch/blob/master/reformer_pytorch/reversible.py\n\n        with torch.enable_grad():\n            next_attn_output.requires_grad = True\n\n            # set seed to have correct dropout\n            torch.manual_seed(self.feed_forward_seed)\n            # g(Y_1)\n            res_hidden_states = self.feed_forward(next_attn_output)\n            res_hidden_states.backward(grad_hidden_states, retain_graph=True)\n\n        with torch.no_grad():\n            # X_2 = Y_2 - g(Y_1)\n            hidden_states = hidden_states - res_hidden_states\n            del res_hidden_states\n\n            grad_attn_output = grad_attn_output + next_attn_output.grad\n            next_attn_output.grad = None\n\n        with torch.enable_grad():\n            hidden_states.requires_grad = True\n\n            # set seed to have correct dropout\n            torch.manual_seed(self.attention_seed)\n            # f(X_2)\n            # use cached buckets for backprob if buckets not None for LSHSelfAttention\n            output = self.attention(\n                hidden_states=hidden_states, head_mask=head_mask, attention_mask=attention_mask, buckets=buckets,\n            ).hidden_states\n            output.backward(grad_attn_output, retain_graph=True)\n\n        with torch.no_grad():\n            # X_1 = Y_1 - f(X_2)\n            attn_output = next_attn_output - output\n            del output, next_attn_output\n\n            grad_hidden_states = grad_hidden_states + hidden_states.grad\n            hidden_states.grad = None\n            hidden_states = hidden_states.detach()\n\n        return ReformerBackwardOutput(\n            attn_output=attn_output,\n            hidden_states=hidden_states,\n            grad_attn_output=grad_attn_output,\n            grad_hidden_states=grad_hidden_states,\n        )\n\n\nclass _ReversibleFunction(Function):\n    \"\"\"\n    To prevent PyTorch from performing the usual backpropagation,\n    a customized backward function is implemented here. This way\n    it is made sure that no memory expensive activations are\n    saved during the forward pass.\n    This function is heavily inspired by https://github.com/lucidrains/reformer-pytorch/blob/master/reformer_pytorch/reversible.py\n    \"\"\"\n\n    @staticmethod\n    def forward(\n        ctx,\n        hidden_states,\n        layers,\n        attention_mask,\n        head_mask,\n        num_hashes,\n        all_hidden_states,\n        all_attentions,\n        do_output_hidden_states,\n        do_output_attentions,\n    ):\n        all_buckets = ()\n\n        # split duplicated tensor\n        hidden_states, attn_output = torch.chunk(hidden_states, 2, dim=-1)\n\n        for layer, layer_head_mask in zip(layers, head_mask):\n            if do_output_hidden_states is True:\n                all_hidden_states.append(hidden_states)\n\n            layer_outputs = layer(\n                prev_attn_output=attn_output,\n                hidden_states=hidden_states,\n                attention_mask=attention_mask,\n                head_mask=layer_head_mask,\n                num_hashes=num_hashes,\n                do_output_attentions=do_output_attentions,\n            )\n            attn_output = layer_outputs.attn_output\n            hidden_states = layer_outputs.hidden_states\n            all_buckets = all_buckets + (layer_outputs.buckets,)\n\n            if do_output_attentions:\n                all_attentions.append(layer_outputs.attention_probs)\n\n        # Add last layer\n        if do_output_hidden_states is True:\n            all_hidden_states.append(hidden_states)\n\n        # attach params to ctx for backward\n        ctx.save_for_backward(attn_output.detach(), hidden_states.detach())\n        ctx.layers = layers\n        ctx.all_buckets = all_buckets\n        ctx.head_mask = head_mask\n        ctx.attention_mask = attention_mask\n\n        # Concatenate 2 RevNet outputs\n        return torch.cat([attn_output, hidden_states], dim=-1)\n\n    @staticmethod\n    def backward(ctx, grad_hidden_states):\n        grad_attn_output, grad_hidden_states = torch.chunk(grad_hidden_states, 2, dim=-1)\n\n        # retrieve params from ctx for backward\n        attn_output, hidden_states = ctx.saved_tensors\n\n        # create tuple\n        output = ReformerBackwardOutput(\n            attn_output=attn_output,\n            hidden_states=hidden_states,\n            grad_attn_output=grad_attn_output,\n            grad_hidden_states=grad_hidden_states,\n        )\n\n        # free memory\n        del grad_attn_output, grad_hidden_states, attn_output, hidden_states\n\n        layers = ctx.layers\n        all_buckets = ctx.all_buckets\n        head_mask = ctx.head_mask\n        attention_mask = ctx.attention_mask\n\n        for idx, layer in enumerate(layers[::-1]):\n            # pop last buckets from stack\n            buckets = all_buckets[-1]\n            all_buckets = all_buckets[:-1]\n\n            # backprop\n            output = layer.backward_pass(\n                next_attn_output=output.attn_output,\n                hidden_states=output.hidden_states,\n                grad_attn_output=output.grad_attn_output,\n                grad_hidden_states=output.grad_hidden_states,\n                head_mask=head_mask[len(layers) - idx - 1],\n                attention_mask=attention_mask,\n                buckets=buckets,\n            )\n\n        assert all_buckets == (), \"buckets have to be empty after backpropagation\"\n        grad_hidden_states = torch.cat([output.grad_attn_output, output.grad_hidden_states], dim=-1)\n\n        # num of return vars has to match num of forward() args\n        # return gradient for hidden_states arg and None for other args\n        return grad_hidden_states, None, None, None, None, None, None, None, None\n\n\nclass ReformerEncoder(nn.Module):\n    def __init__(self, config):\n        super().__init__()\n        self.dropout = config.hidden_dropout_prob\n\n        self.layers = nn.ModuleList([ReformerLayer(config, i) for i in range(config.num_hidden_layers)])\n        # Reformer is using Rev Nets, thus last layer outputs are concatenated and\n        # Layer Norm is done over 2 * hidden_size\n        self.layer_norm = nn.LayerNorm(2 * config.hidden_size, eps=config.layer_norm_eps)\n\n    def forward(\n        self,\n        hidden_states,\n        attention_mask=None,\n        head_mask=None,\n        num_hashes=None,\n        do_output_hidden_states=False,\n        do_output_attentions=False,\n    ):\n        # hidden_states and attention lists to be filled if wished\n        all_hidden_states = []\n        all_attentions = []\n\n        # concat same tensor for reversible ResNet\n        hidden_states = torch.cat([hidden_states, hidden_states], dim=-1)\n        hidden_states = _ReversibleFunction.apply(\n            hidden_states,\n            self.layers,\n            attention_mask,\n            head_mask,\n            num_hashes,\n            all_hidden_states,\n            all_attentions,\n            do_output_hidden_states,\n            do_output_attentions,\n        )\n\n        # Apply layer norm to concatenated hidden states\n        hidden_states = self.layer_norm(hidden_states)\n\n        # Apply dropout\n        hidden_states = nn.functional.dropout(hidden_states, p=self.dropout, training=self.training)\n\n        return ReformerEncoderOutput(\n            hidden_states=hidden_states, all_hidden_states=all_hidden_states, all_attentions=all_attentions\n        )\n\n\nclass ReformerOnlyLMHead(nn.Module):\n    def __init__(self, config):\n        super().__init__()\n        # Reformer is using Rev Nets, thus last layer outputs are concatenated and\n        # Layer Norm is done over 2 * hidden_size\n        self.seq_len_dim = 1\n        self.chunk_size_lm_head = config.chunk_size_lm_head\n        self.decoder = nn.Linear(2 * config.hidden_size, config.vocab_size, bias=False)\n        self.bias = nn.Parameter(torch.zeros(config.vocab_size))\n\n        # Need a link between the two variables so that the bias is correctly resized with `resize_token_embeddings`\n        self.decoder.bias = self.bias\n\n    def forward(self, hidden_states):\n        return apply_chunking_to_forward(self.chunk_size_lm_head, self.seq_len_dim, self.forward_chunk, hidden_states)\n\n    def forward_chunk(self, hidden_states):\n        hidden_states = self.decoder(hidden_states)\n        return hidden_states\n\n\nclass ReformerPreTrainedModel(PreTrainedModel):\n    \"\"\" An abstract class to handle weights initialization and\n        a simple interface for downloading and loading pretrained models.\n    \"\"\"\n\n    config_class = ReformerConfig\n    base_model_prefix = \"reformer\"\n\n    @property\n    def dummy_inputs(self):\n        input_ids = torch.tensor(DUMMY_INPUTS)\n        input_mask = torch.tensor(DUMMY_MASK)\n        dummy_inputs = {\n            \"input_ids\": input_ids,\n            \"attention_mask\": input_mask,\n        }\n        return dummy_inputs\n\n    def _init_weights(self, module):\n        \"\"\" Initialize the weights \"\"\"\n        if isinstance(module, AxialPositionEmbeddings):\n            for weight in module.weights:\n                torch.nn.init.normal_(weight, std=self.config.axial_norm_std)\n        elif isinstance(module, nn.Embedding):\n            module.weight.data.normal_(mean=0.0, std=self.config.initializer_range)\n        elif isinstance(module, nn.Linear):\n            # Slightly different from the TF version which uses truncated_normal for initialization\n            # cf https://github.com/pytorch/pytorch/pull/5617\n            module.weight.data.normal_(mean=0.0, std=self.config.initializer_range)\n\n        elif isinstance(module, nn.LayerNorm):\n            module.bias.data.zero_()\n            module.weight.data.fill_(1.0)\n        if isinstance(module, nn.Linear) and module.bias is not None:\n            module.bias.data.zero_()\n\n\nREFORMER_START_DOCSTRING = r\"\"\"\n    Reformer was proposed in\n    `Reformer: The Efficient Transformer`_\n    by Nikita Kitaev, Łukasz Kaiser, Anselm Levskaya.\n\n    .. _`Reformer: The Efficient Transformer`:\n        https://arxiv.org/abs/2001.04451\n\n    This model is a PyTorch `torch.nn.Module <https://pytorch.org/docs/stable/nn.html#torch.nn.Module>`_ sub-class.\n    Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general\n    usage and behavior.\n\n    Parameters:\n        config (:class:`~transformers1.ReformerConfig`): Model configuration class with all the parameters of the model.\n            Initializing with a config file does not load the weights associated with the model, only the configuration.\n            Check out the :meth:`~transformers1.PreTrainedModel.from_pretrained` method to load the model weights.\n\"\"\"\n\nREFORMER_INPUTS_DOCSTRING = r\"\"\"\n    Args:\n        input_ids (:obj:`torch.LongTensor` of shape :obj:`(batch_size, sequence_length)`):\n            Indices of input sequence tokens in the vocabulary.\n            During training the input_ids sequence_length has to be a multiple of the relevant model's\n            chunk lengths (lsh's, local's or both). During evaluation, the indices are automatically\n            padded to be a multiple of the chunk length.\n\n            Indices can be obtained using :class:`transformers1.ReformerTokenizer`.\n            See :func:`transformers1.PreTrainedTokenizer.encode` and\n            :func:`transformers1.PreTrainedTokenizer.encode_plus` for details.\n\n            `What are input IDs? <../glossary.html#input-ids>`__\n        attention_mask (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length)`, `optional`, defaults to :obj:`None`):\n            Mask to avoid performing attention on padding token indices.\n            Mask values selected in ``[0, 1]``:\n            ``1`` for tokens that are NOT MASKED, ``0`` for MASKED tokens.\n\n            `What are attention masks? <../glossary.html#attention-mask>`__\n        position_ids (:obj:`torch.LongTensor` of shape :obj:`(batch_size, sequence_length)`, `optional`, defaults to :obj:`None`):\n            Indices of positions of each input sequence tokens in the position embeddings.\n            Selected in the range ``[0, config.max_position_embeddings - 1]``.\n\n            `What are position IDs? <../glossary.html#position-ids>`_\n        head_mask (:obj:`torch.FloatTensor` of shape :obj:`(num_heads,)` or :obj:`(num_layers, num_heads)`, `optional`, defaults to :obj:`None`):\n            Mask to nullify selected heads of the self-attention modules.\n            Mask values selected in ``[0, 1]``:\n            :obj:`1` indicates the head is **not masked**, :obj:`0` indicates the head is **masked**.\n        inputs_embeds (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length, hidden_size)`, `optional`, defaults to :obj:`None`):\n            Optionally, instead of passing :obj:`input_ids` you can choose to directly pass an embedded representation.\n            This is useful if you want more control over how to convert `input_ids` indices into associated vectors\n            than the model's internal embedding lookup matrix.\n        num_hashes (:obj:`int`, `optional`, defaults to :obj:`None`):\n            `num_hashes` is the number of hashing rounds that should be performed during\n            bucketing. Setting `num_hashes` overwrites the default `num_hashes` defined\n            in `config.num_hashes`.\n            For more information, see `num_hashes` in :class:`transformers1.ReformerConfig`.\n\"\"\"\n\n\n@add_start_docstrings(\n    \"The bare Reformer Model transformer outputting raw hidden-states\" \"without any specific head on top.\",\n    REFORMER_START_DOCSTRING,\n)\nclass ReformerModel(ReformerPreTrainedModel):\n    def __init__(self, config):\n        super().__init__(config)\n        self.config = config\n        assert (\n            self.config.num_hidden_layers > 0\n        ), \"`config.attn_layers` is empty. Select at least one attn layer form ['lsh', 'local']\"\n\n        self.embeddings = ReformerEmbeddings(config)\n        self.encoder = ReformerEncoder(config)\n\n        self.init_weights()\n\n    def get_input_embeddings(self):\n        return self.embeddings.word_embeddings\n\n    def set_input_embeddings(self, value):\n        self.embeddings.word_embeddings = value\n\n    def _prune_heads(self, heads_to_prune):\n        \"\"\" Prunes heads of the model.\n            heads_to_prune: dict of {layer_num: list of heads to prune in this layer}\n            See base class PreTrainedModel\n        \"\"\"\n        for layer, heads in heads_to_prune.items():\n            self.encoder.layer[layer].attention.prune_heads(heads)\n\n    @add_start_docstrings_to_callable(REFORMER_INPUTS_DOCSTRING)\n    def forward(\n        self,\n        input_ids=None,\n        attention_mask=None,\n        position_ids=None,\n        head_mask=None,\n        inputs_embeds=None,\n        num_hashes=None,\n        do_output_hidden_states=False,\n        do_output_attentions=False,\n    ):\n        r\"\"\"\n    Return:\n        :obj:`tuple(torch.FloatTensor)` comprising various elements depending on the configuration (:class:`~transformers1.BertConfig`) and inputs:\n        last_hidden_state (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length, hidden_size)`):\n            Sequence of hidden-states at the output of the last layer of the model.\n        all_hidden_states (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_hidden_states=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer)\n            of shape :obj:`(batch_size, sequence_length, hidden_size)`.\n\n            Hidden-states of the model at the output of each layer plus the initial embedding outputs.\n        all_attentions (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``do_output_attentions=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for each layer) of shape\n            :obj:`(batch_size, num_heads, sequence_length, sequence_length)`.\n\n            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention\n            heads.\n\n    Examples::\n\n        from transformers1 import ReformerModel, ReformerTokenizer\n        import torch\n\n        tokenizer = ReformerTokenizer.from_pretrained('google/reformer-crime-and-punishment')\n        model =  ReformerModel.from_pretrained('google/reformer-crime-and-punishment')\n\n        input_ids = torch.tensor(tokenizer.encode(\"Hello, my dog is cute\", add_special_tokens=True)).unsqueeze(0)  # Batch size 1\n        outputs = model(input_ids)\n\n        last_hidden_states = outputs[0]  # The last hidden-state is the first element of the output tuple\n        \"\"\"\n\n        # TODO(PVP): delete when PR to change output_attentions is made\n        do_output_attentions = self.config.output_attentions\n        do_output_hidden_states = self.config.output_hidden_states\n\n        if input_ids is not None and inputs_embeds is not None:\n            raise ValueError(\"You cannot specify both input_ids and inputs_embeds at the same time\")\n        elif input_ids is not None:\n            input_shape = input_ids.size()  # noqa: F841\n            device = input_ids.device\n        elif inputs_embeds is not None:\n            input_shape = inputs_embeds.size()[:-1]  # noqa: F841\n            device = inputs_embeds.device\n        else:\n            raise ValueError(\"You have to specify either input_ids or inputs_embeds\")\n\n        assert (\n            len(input_shape) == 2\n        ), \"`input_ids` have be of shape `[batch_size, sequence_length]`, but got shape: {}\".format(input_shape)\n\n        # prepare head mask\n        head_mask = self.get_head_mask(head_mask, self.config.num_hidden_layers, is_attention_chunked=True)\n\n        # original sequence length for padding\n        orig_sequence_length = input_shape[-1]\n\n        # if needs padding\n        least_common_mult_chunk_length = _get_least_common_mult_chunk_len(self.config)\n        must_pad_to_match_chunk_length = input_shape[-1] % least_common_mult_chunk_length != 0\n\n        if must_pad_to_match_chunk_length:\n            padding_length = least_common_mult_chunk_length - input_shape[-1] % least_common_mult_chunk_length\n\n            if self.training is True:\n                raise ValueError(\n                    \"If training, sequence Length {} has to be a multiple of least common multiple chunk_length {}. Please consider padding the input to a length of {}.\".format(\n                        input_shape[-1], least_common_mult_chunk_length, input_shape[-1] + padding_length\n                    )\n                )\n\n            # pad input\n            input_ids, inputs_embeds, attention_mask, position_ids, input_shape = self._pad_to_mult_of_chunk_length(\n                input_ids,\n                inputs_embeds=inputs_embeds,\n                attention_mask=attention_mask,\n                position_ids=position_ids,\n                input_shape=input_shape,\n                padding_length=padding_length,\n                padded_seq_length=least_common_mult_chunk_length,\n                device=device,\n            )\n\n        embedding_output = self.embeddings(input_ids=input_ids, position_ids=position_ids, inputs_embeds=inputs_embeds)\n\n        encoder_outputs = self.encoder(\n            hidden_states=embedding_output,\n            head_mask=head_mask,\n            attention_mask=attention_mask,\n            num_hashes=num_hashes,\n            do_output_hidden_states=do_output_hidden_states,\n            do_output_attentions=do_output_attentions,\n        )\n        sequence_output = encoder_outputs.hidden_states\n\n        # if padding was applied\n        if must_pad_to_match_chunk_length:\n            sequence_output = sequence_output[:, :orig_sequence_length]\n\n        outputs = (sequence_output,)\n        # TODO(PVP): Replace by named tuple after namedtuples are introduced in the library.\n        if do_output_hidden_states is True:\n            outputs = outputs + (encoder_outputs.all_hidden_states,)\n        if do_output_attentions is True:\n            outputs = outputs + (encoder_outputs.all_attentions,)\n        return outputs\n\n    def _pad_to_mult_of_chunk_length(\n        self,\n        input_ids,\n        inputs_embeds=None,\n        attention_mask=None,\n        position_ids=None,\n        input_shape=None,\n        padding_length=None,\n        padded_seq_length=None,\n        device=None,\n    ):\n        logger.info(\n            \"Input ids are automatically padded from {} to {} to be a multiple of `config.chunk_length`: {}\".format(\n                input_shape[-1], input_shape[-1] + padding_length, padded_seq_length\n            )\n        )\n\n        padded_input_ids = torch.full(\n            (input_shape[0], padding_length), self.config.pad_token_id, device=device, dtype=torch.long,\n        )\n\n        # Extend `attention_mask`\n        if attention_mask is not None:\n            attention_mask = torch.cat(\n                [\n                    attention_mask,\n                    torch.zeros(input_shape[0], padding_length, device=device, dtype=attention_mask.dtype,),\n                ],\n                dim=-1,\n            )\n        else:\n            attention_mask = torch.cat(\n                [\n                    torch.ones(input_shape, device=device, dtype=torch.uint8),\n                    torch.zeros((input_shape[0], padding_length), device=device, dtype=torch.uint8),\n                ],\n                dim=-1,\n            )\n\n        # Extend `input_ids` with padding to match least common multiple chunk_length\n        if input_ids is not None:\n            input_ids = torch.cat([input_ids, padded_input_ids], dim=-1)\n            input_shape = input_ids.size()\n\n            # Pad position ids if given\n            if position_ids is not None:\n                padded_position_ids = torch.arange(input_shape[-1], padded_seq_length, dtype=torch.long, device=device)\n                padded_position_ids = position_ids.unsqueeze(0).expand(input_shape[0], padding_length)\n                position_ids = torch.cat([position_ids, padded_position_ids], dim=-1)\n\n        # Extend `inputs_embeds` with padding to match least common multiple chunk_length\n        if inputs_embeds is not None:\n            padded_inputs_embeds = self.embeddings(padded_input_ids, position_ids)\n            inputs_embeds = torch.cat([inputs_embeds, padded_inputs_embeds], dim=-2)\n            input_shape = inputs_embeds.size()\n        return input_ids, inputs_embeds, attention_mask, position_ids, input_shape\n\n\n@add_start_docstrings(\"\"\"Reformer Model with a `language modeling` head on top. \"\"\", REFORMER_START_DOCSTRING)\nclass ReformerModelWithLMHead(ReformerPreTrainedModel):\n    def __init__(self, config):\n        super().__init__(config)\n        self.reformer = ReformerModel(config)\n        self.lm_head = ReformerOnlyLMHead(config)\n\n        self.init_weights()\n\n    def get_output_embeddings(self):\n        return self.lm_head.decoder\n\n    def tie_weights(self):\n        # word embeddings are not tied in Reformer\n        pass\n\n    @add_start_docstrings_to_callable(REFORMER_INPUTS_DOCSTRING)\n    def forward(\n        self,\n        input_ids=None,\n        position_ids=None,\n        attention_mask=None,\n        head_mask=None,\n        inputs_embeds=None,\n        num_hashes=None,\n        labels=None,\n        do_output_hidden_states=False,\n        do_output_attentions=False,\n    ):\n        r\"\"\"\n        labels (:obj:`torch.LongTensor` of shape :obj:`(batch_size,)`, `optional`, defaults to :obj:`None`):\n                Labels for computing the sequence classification/regression loss.\n                Indices should be in :obj:`[-100, 0, ..., config.vocab_size - 1]`.\n                All labels set to ``-100`` are ignored (masked), the loss is only\n                computed for labels in ``[0, ..., config.vocab_size]``\n\n    Return:\n        :obj:`tuple(torch.FloatTensor)` comprising various elements depending on the configuration (:class:`~transformers1.BertConfig`) and inputs:\n        loss (:obj:`torch.FloatTensor` of shape :obj:`(1,)`, `optional`, returned when :obj:`lm_label` is provided):\n            Classification loss (cross entropy).\n        prediction_scores (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length, config.vocab_size)`)\n            Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax).\n        all_hidden_states (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_hidden_states=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer)\n            of shape :obj:`(batch_size, sequence_length, hidden_size)`.\n\n            Hidden-states of the model at the output of each layer plus the initial embedding outputs.\n        all_attentions (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``do_output_attentions=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for each layer) of shape\n            :obj:`(batch_size, num_heads, sequence_length, sequence_length)`.\n\n            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention\n            heads.\n\n    Examples::\n\n        from transformers1 import ReformerModelWithLMHead, ReformerTokenizer\n        import torch\n\n        tokenizer = ReformerTokenizer.from_pretrained('google/reformer-crime-and-punishment')\n        model =  ReformerModelWithLMHead.from_pretrained('google/reformer-crime-and-punishment')\n\n        input_ids = torch.tensor(tokenizer.encode(\"Hello, my dog is cute\", add_special_tokens=True)).unsqueeze(0)  # Batch size 1\n        outputs = model(input_ids, labels=input_ids)\n\n        loss, prediction_scores = outputs[:2]\n        \"\"\"\n\n        reformer_outputs = self.reformer(\n            input_ids,\n            position_ids=position_ids,\n            attention_mask=attention_mask,\n            head_mask=head_mask,\n            inputs_embeds=inputs_embeds,\n            num_hashes=num_hashes,\n            do_output_hidden_states=do_output_hidden_states,\n            do_output_attentions=do_output_attentions,\n        )\n\n        sequence_output = reformer_outputs[0]\n        logits = self.lm_head(sequence_output)\n        outputs = (logits,) + reformer_outputs[1:]\n\n        if labels is not None:\n            # Shift so that tokens < n predict n\n            shift_logits = logits[..., :-1, :].contiguous()\n            shift_labels = labels[..., 1:].contiguous()\n            # Flatten the tokens\n            loss_fct = CrossEntropyLoss()\n            loss = loss_fct(shift_logits.view(-1, self.config.vocab_size), shift_labels.view(-1))\n            outputs = (loss,) + outputs\n        return outputs  # (lm_loss), lm_logits, (hidden_states), (attentions)\n\n    def prepare_inputs_for_generation(self, input_ids, past, **kwargs):\n        # TODO(PVP): Add smart caching\n        inputs_dict = {\"input_ids\": input_ids}\n\n        if \"num_hashes\" in kwargs:\n            inputs_dict[\"num_hashes\"] = kwargs[\"num_hashes\"]\n\n        return inputs_dict\n"
  },
  {
    "path": "code/bert-base-count5/pretrain/transformers1/modeling_roberta.py",
    "content": "# coding=utf-8\n# Copyright 2018 The Google AI Language Team Authors and The HuggingFace Inc. team.\n# Copyright (c) 2018, NVIDIA CORPORATION.  All rights reserved.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n\"\"\"PyTorch RoBERTa model. \"\"\"\n\n\nimport logging\n\nimport torch\nimport torch.nn as nn\nfrom torch.nn import CrossEntropyLoss, MSELoss\n\nfrom .configuration_roberta import RobertaConfig\nfrom .file_utils import add_start_docstrings, add_start_docstrings_to_callable\nfrom .modeling_bert import BertEmbeddings, BertLayerNorm, BertModel, BertPreTrainedModel, gelu\nfrom .modeling_utils import create_position_ids_from_input_ids\n\n\nlogger = logging.getLogger(__name__)\n\nROBERTA_PRETRAINED_MODEL_ARCHIVE_LIST = [\n    \"roberta-base\",\n    \"roberta-large\",\n    \"roberta-large-mnli\",\n    \"distilroberta-base\",\n    \"roberta-base-openai-detector\",\n    \"roberta-large-openai-detector\",\n    # See all RoBERTa models at https://huggingface.co/models?filter=roberta\n]\n\n\nclass RobertaEmbeddings(BertEmbeddings):\n    \"\"\"\n    Same as BertEmbeddings with a tiny tweak for positional embeddings indexing.\n    \"\"\"\n\n    def __init__(self, config):\n        super().__init__(config)\n        self.padding_idx = config.pad_token_id\n        self.word_embeddings = nn.Embedding(config.vocab_size, config.hidden_size, padding_idx=self.padding_idx)\n        self.position_embeddings = nn.Embedding(\n            config.max_position_embeddings, config.hidden_size, padding_idx=self.padding_idx\n        )\n\n    def forward(self, input_ids=None, token_type_ids=None, position_ids=None, inputs_embeds=None):\n        if position_ids is None:\n            if input_ids is not None:\n                # Create the position ids from the input token ids. Any padded tokens remain padded.\n                position_ids = create_position_ids_from_input_ids(input_ids, self.padding_idx).to(input_ids.device)\n            else:\n                position_ids = self.create_position_ids_from_inputs_embeds(inputs_embeds)\n\n        return super().forward(\n            input_ids, token_type_ids=token_type_ids, position_ids=position_ids, inputs_embeds=inputs_embeds\n        )\n\n    def create_position_ids_from_inputs_embeds(self, inputs_embeds):\n        \"\"\" We are provided embeddings directly. We cannot infer which are padded so just generate\n        sequential position ids.\n\n        :param torch.Tensor inputs_embeds:\n        :return torch.Tensor:\n        \"\"\"\n        input_shape = inputs_embeds.size()[:-1]\n        sequence_length = input_shape[1]\n\n        position_ids = torch.arange(\n            self.padding_idx + 1, sequence_length + self.padding_idx + 1, dtype=torch.long, device=inputs_embeds.device\n        )\n        return position_ids.unsqueeze(0).expand(input_shape)\n\n\nROBERTA_START_DOCSTRING = r\"\"\"\n\n    This model is a PyTorch `torch.nn.Module <https://pytorch.org/docs/stable/nn.html#torch.nn.Module>`_ sub-class.\n    Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general\n    usage and behavior.\n\n    Parameters:\n        config (:class:`~transformers1.RobertaConfig`): Model configuration class with all the parameters of the\n            model. Initializing with a config file does not load the weights associated with the model, only the configuration.\n            Check out the :meth:`~transformers1.PreTrainedModel.from_pretrained` method to load the model weights.\n\"\"\"\n\nROBERTA_INPUTS_DOCSTRING = r\"\"\"\n    Args:\n        input_ids (:obj:`torch.LongTensor` of shape :obj:`{0}`):\n            Indices of input sequence tokens in the vocabulary.\n\n            Indices can be obtained using :class:`transformers1.RobertaTokenizer`.\n            See :func:`transformers1.PreTrainedTokenizer.encode` and\n            :func:`transformers1.PreTrainedTokenizer.encode_plus` for details.\n\n            `What are input IDs? <../glossary.html#input-ids>`__\n        attention_mask (:obj:`torch.FloatTensor` of shape :obj:`{0}`, `optional`, defaults to :obj:`None`):\n            Mask to avoid performing attention on padding token indices.\n            Mask values selected in ``[0, 1]``:\n            ``1`` for tokens that are NOT MASKED, ``0`` for MASKED tokens.\n\n            `What are attention masks? <../glossary.html#attention-mask>`__\n        token_type_ids (:obj:`torch.LongTensor` of shape :obj:`{0}`, `optional`, defaults to :obj:`None`):\n            Segment token indices to indicate first and second portions of the inputs.\n            Indices are selected in ``[0, 1]``: ``0`` corresponds to a `sentence A` token, ``1``\n            corresponds to a `sentence B` token\n\n            `What are token type IDs? <../glossary.html#token-type-ids>`_\n        position_ids (:obj:`torch.LongTensor` of shape :obj:`{0}`, `optional`, defaults to :obj:`None`):\n            Indices of positions of each input sequence tokens in the position embeddings.\n            Selected in the range ``[0, config.max_position_embeddings - 1]``.\n\n            `What are position IDs? <../glossary.html#position-ids>`_\n        head_mask (:obj:`torch.FloatTensor` of shape :obj:`(num_heads,)` or :obj:`(num_layers, num_heads)`, `optional`, defaults to :obj:`None`):\n            Mask to nullify selected heads of the self-attention modules.\n            Mask values selected in ``[0, 1]``:\n            :obj:`1` indicates the head is **not masked**, :obj:`0` indicates the head is **masked**.\n        inputs_embeds (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length, hidden_size)`, `optional`, defaults to :obj:`None`):\n            Optionally, instead of passing :obj:`input_ids` you can choose to directly pass an embedded representation.\n            This is useful if you want more control over how to convert `input_ids` indices into associated vectors\n            than the model's internal embedding lookup matrix.\n\"\"\"\n\n\n@add_start_docstrings(\n    \"The bare RoBERTa Model transformer outputting raw hidden-states without any specific head on top.\",\n    ROBERTA_START_DOCSTRING,\n)\nclass RobertaModel(BertModel):\n    \"\"\"\n    This class overrides :class:`~transformers1.BertModel`. Please check the\n    superclass for the appropriate documentation alongside usage examples.\n    \"\"\"\n\n    config_class = RobertaConfig\n    base_model_prefix = \"roberta\"\n\n    def __init__(self, config):\n        super().__init__(config)\n\n        self.embeddings = RobertaEmbeddings(config)\n        self.init_weights()\n\n    def get_input_embeddings(self):\n        return self.embeddings.word_embeddings\n\n    def set_input_embeddings(self, value):\n        self.embeddings.word_embeddings = value\n\n\n@add_start_docstrings(\"\"\"RoBERTa Model with a `language modeling` head on top. \"\"\", ROBERTA_START_DOCSTRING)\nclass RobertaForMaskedLM(BertPreTrainedModel):\n    config_class = RobertaConfig\n    base_model_prefix = \"roberta\"\n\n    def __init__(self, config):\n        super().__init__(config)\n\n        self.roberta = RobertaModel(config)\n        self.lm_head = RobertaLMHead(config)\n\n        self.init_weights()\n\n    def get_output_embeddings(self):\n        return self.lm_head.decoder\n\n    @add_start_docstrings_to_callable(ROBERTA_INPUTS_DOCSTRING.format(\"(batch_size, sequence_length)\"))\n    def forward(\n        self,\n        input_ids=None,\n        attention_mask=None,\n        token_type_ids=None,\n        position_ids=None,\n        head_mask=None,\n        inputs_embeds=None,\n        masked_lm_labels=None,\n    ):\n        r\"\"\"\n        masked_lm_labels (:obj:`torch.LongTensor` of shape :obj:`(batch_size, sequence_length)`, `optional`, defaults to :obj:`None`):\n            Labels for computing the masked language modeling loss.\n            Indices should be in ``[-100, 0, ..., config.vocab_size]`` (see ``input_ids`` docstring)\n            Tokens with indices set to ``-100`` are ignored (masked), the loss is only computed for the tokens with labels\n            in ``[0, ..., config.vocab_size]``\n\n    Returns:\n        :obj:`tuple(torch.FloatTensor)` comprising various elements depending on the configuration (:class:`~transformers1.RobertaConfig`) and inputs:\n        masked_lm_loss (`optional`, returned when ``masked_lm_labels`` is provided) ``torch.FloatTensor`` of shape ``(1,)``:\n            Masked language modeling loss.\n        prediction_scores (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length, config.vocab_size)`)\n            Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax).\n        hidden_states (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_hidden_states=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer)\n            of shape :obj:`(batch_size, sequence_length, hidden_size)`.\n\n            Hidden-states of the model at the output of each layer plus the initial embedding outputs.\n        attentions (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_attentions=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for each layer) of shape\n            :obj:`(batch_size, num_heads, sequence_length, sequence_length)`.\n\n            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention\n            heads.\n\n    Examples::\n\n        from transformers1 import RobertaTokenizer, RobertaForMaskedLM\n        import torch\n\n        tokenizer = RobertaTokenizer.from_pretrained('roberta-base')\n        model = RobertaForMaskedLM.from_pretrained('roberta-base')\n        input_ids = torch.tensor(tokenizer.encode(\"Hello, my dog is cute\", add_special_tokens=True)).unsqueeze(0)  # Batch size 1\n        outputs = model(input_ids, masked_lm_labels=input_ids)\n        loss, prediction_scores = outputs[:2]\n\n        \"\"\"\n        outputs = self.roberta(\n            input_ids,\n            attention_mask=attention_mask,\n            token_type_ids=token_type_ids,\n            position_ids=position_ids,\n            head_mask=head_mask,\n            inputs_embeds=inputs_embeds,\n        )\n        sequence_output = outputs[0]\n        prediction_scores = self.lm_head(sequence_output)\n\n        outputs = (prediction_scores,) + outputs[2:]  # Add hidden states and attention if they are here\n\n        if masked_lm_labels is not None:\n            loss_fct = CrossEntropyLoss()\n            masked_lm_loss = loss_fct(prediction_scores.view(-1, self.config.vocab_size), masked_lm_labels.view(-1))\n            outputs = (masked_lm_loss,) + outputs\n\n        return outputs  # (masked_lm_loss), prediction_scores, (hidden_states), (attentions)\n\n\nclass RobertaLMHead(nn.Module):\n    \"\"\"Roberta Head for masked language modeling.\"\"\"\n\n    def __init__(self, config):\n        super().__init__()\n        self.dense = nn.Linear(config.hidden_size, config.hidden_size)\n        self.layer_norm = BertLayerNorm(config.hidden_size, eps=config.layer_norm_eps)\n\n        self.decoder = nn.Linear(config.hidden_size, config.vocab_size, bias=False)\n        self.bias = nn.Parameter(torch.zeros(config.vocab_size))\n\n        # Need a link between the two variables so that the bias is correctly resized with `resize_token_embeddings`\n        self.decoder.bias = self.bias\n\n    def forward(self, features, **kwargs):\n        x = self.dense(features)\n        x = gelu(x)\n        x = self.layer_norm(x)\n\n        # project back to size of vocabulary with bias\n        x = self.decoder(x)\n\n        return x\n\n\n@add_start_docstrings(\n    \"\"\"RoBERTa Model transformer with a sequence classification/regression head on top (a linear layer\n    on top of the pooled output) e.g. for GLUE tasks. \"\"\",\n    ROBERTA_START_DOCSTRING,\n)\nclass RobertaForSequenceClassification(BertPreTrainedModel):\n    config_class = RobertaConfig\n    base_model_prefix = \"roberta\"\n\n    def __init__(self, config):\n        super().__init__(config)\n        self.num_labels = config.num_labels\n\n        self.roberta = RobertaModel(config)\n        self.classifier = RobertaClassificationHead(config)\n\n    @add_start_docstrings_to_callable(ROBERTA_INPUTS_DOCSTRING.format(\"(batch_size, sequence_length)\"))\n    def forward(\n        self,\n        input_ids=None,\n        attention_mask=None,\n        token_type_ids=None,\n        position_ids=None,\n        head_mask=None,\n        inputs_embeds=None,\n        labels=None,\n    ):\n        r\"\"\"\n        labels (:obj:`torch.LongTensor` of shape :obj:`(batch_size,)`, `optional`, defaults to :obj:`None`):\n            Labels for computing the sequence classification/regression loss.\n            Indices should be in :obj:`[0, ..., config.num_labels - 1]`.\n            If :obj:`config.num_labels == 1` a regression loss is computed (Mean-Square loss),\n            If :obj:`config.num_labels > 1` a classification loss is computed (Cross-Entropy).\n\n    Returns:\n        :obj:`tuple(torch.FloatTensor)` comprising various elements depending on the configuration (:class:`~transformers1.RobertaConfig`) and inputs:\n        loss (:obj:`torch.FloatTensor` of shape :obj:`(1,)`, `optional`, returned when :obj:`label` is provided):\n            Classification (or regression if config.num_labels==1) loss.\n        logits (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, config.num_labels)`):\n            Classification (or regression if config.num_labels==1) scores (before SoftMax).\n        hidden_states (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_hidden_states=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer)\n            of shape :obj:`(batch_size, sequence_length, hidden_size)`.\n\n            Hidden-states of the model at the output of each layer plus the initial embedding outputs.\n        attentions (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_attentions=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for each layer) of shape\n            :obj:`(batch_size, num_heads, sequence_length, sequence_length)`.\n\n            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention\n            heads.\n\n    Examples::\n\n        from transformers1 import RobertaTokenizer, RobertaForSequenceClassification\n        import torch\n\n        tokenizer = RobertaTokenizer.from_pretrained('roberta-base')\n        model = RobertaForSequenceClassification.from_pretrained('roberta-base')\n        input_ids = torch.tensor(tokenizer.encode(\"Hello, my dog is cute\", add_special_tokens=True)).unsqueeze(0)  # Batch size 1\n        labels = torch.tensor([1]).unsqueeze(0)  # Batch size 1\n        outputs = model(input_ids, labels=labels)\n        loss, logits = outputs[:2]\n\n        \"\"\"\n        outputs = self.roberta(\n            input_ids,\n            attention_mask=attention_mask,\n            token_type_ids=token_type_ids,\n            position_ids=position_ids,\n            head_mask=head_mask,\n            inputs_embeds=inputs_embeds,\n        )\n        sequence_output = outputs[0]\n        logits = self.classifier(sequence_output)\n\n        outputs = (logits,) + outputs[2:]\n        if labels is not None:\n            if self.num_labels == 1:\n                #  We are doing regression\n                loss_fct = MSELoss()\n                loss = loss_fct(logits.view(-1), labels.view(-1))\n            else:\n                loss_fct = CrossEntropyLoss()\n                loss = loss_fct(logits.view(-1, self.num_labels), labels.view(-1))\n            outputs = (loss,) + outputs\n\n        return outputs  # (loss), logits, (hidden_states), (attentions)\n\n\n@add_start_docstrings(\n    \"\"\"Roberta Model with a multiple choice classification head on top (a linear layer on top of\n    the pooled output and a softmax) e.g. for RocStories/SWAG tasks. \"\"\",\n    ROBERTA_START_DOCSTRING,\n)\nclass RobertaForMultipleChoice(BertPreTrainedModel):\n    config_class = RobertaConfig\n    base_model_prefix = \"roberta\"\n\n    def __init__(self, config):\n        super().__init__(config)\n\n        self.roberta = RobertaModel(config)\n        self.dropout = nn.Dropout(config.hidden_dropout_prob)\n        self.classifier = nn.Linear(config.hidden_size, 1)\n\n        self.init_weights()\n\n    @add_start_docstrings_to_callable(ROBERTA_INPUTS_DOCSTRING.format(\"(batch_size, num_choices, sequence_length)\"))\n    def forward(\n        self,\n        input_ids=None,\n        token_type_ids=None,\n        attention_mask=None,\n        labels=None,\n        position_ids=None,\n        head_mask=None,\n        inputs_embeds=None,\n    ):\n        r\"\"\"\n        labels (:obj:`torch.LongTensor` of shape :obj:`(batch_size,)`, `optional`, defaults to :obj:`None`):\n            Labels for computing the multiple choice classification loss.\n            Indices should be in ``[0, ..., num_choices]`` where `num_choices` is the size of the second dimension\n            of the input tensors. (see `input_ids` above)\n\n    Returns:\n        :obj:`tuple(torch.FloatTensor)` comprising various elements depending on the configuration (:class:`~transformers1.RobertaConfig`) and inputs:\n        loss (:obj:`torch.FloatTensor`` of shape ``(1,)`, `optional`, returned when :obj:`labels` is provided):\n            Classification loss.\n        classification_scores (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, num_choices)`):\n            `num_choices` is the second dimension of the input tensors. (see `input_ids` above).\n\n            Classification scores (before SoftMax).\n        hidden_states (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_hidden_states=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer)\n            of shape :obj:`(batch_size, sequence_length, hidden_size)`.\n\n            Hidden-states of the model at the output of each layer plus the initial embedding outputs.\n        attentions (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_attentions=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for each layer) of shape\n            :obj:`(batch_size, num_heads, sequence_length, sequence_length)`.\n\n            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention\n            heads.\n\n    Examples::\n\n        from transformers1 import RobertaTokenizer, RobertaForMultipleChoice\n        import torch\n\n        tokenizer = RobertaTokenizer.from_pretrained('roberta-base')\n        model = RobertaForMultipleChoice.from_pretrained('roberta-base')\n        choices = [\"Hello, my dog is cute\", \"Hello, my cat is amazing\"]\n        input_ids = torch.tensor([tokenizer.encode(s, add_special_tokens=True) for s in choices]).unsqueeze(0)  # Batch size 1, 2 choices\n        labels = torch.tensor(1).unsqueeze(0)  # Batch size 1\n        outputs = model(input_ids, labels=labels)\n        loss, classification_scores = outputs[:2]\n\n        \"\"\"\n        num_choices = input_ids.shape[1]\n\n        flat_input_ids = input_ids.view(-1, input_ids.size(-1))\n        flat_position_ids = position_ids.view(-1, position_ids.size(-1)) if position_ids is not None else None\n        flat_token_type_ids = token_type_ids.view(-1, token_type_ids.size(-1)) if token_type_ids is not None else None\n        flat_attention_mask = attention_mask.view(-1, attention_mask.size(-1)) if attention_mask is not None else None\n        outputs = self.roberta(\n            flat_input_ids,\n            position_ids=flat_position_ids,\n            token_type_ids=flat_token_type_ids,\n            attention_mask=flat_attention_mask,\n            head_mask=head_mask,\n        )\n        pooled_output = outputs[1]\n\n        pooled_output = self.dropout(pooled_output)\n        logits = self.classifier(pooled_output)\n        reshaped_logits = logits.view(-1, num_choices)\n\n        outputs = (reshaped_logits,) + outputs[2:]  # add hidden states and attention if they are here\n\n        if labels is not None:\n            loss_fct = CrossEntropyLoss()\n            loss = loss_fct(reshaped_logits, labels)\n            outputs = (loss,) + outputs\n\n        return outputs  # (loss), reshaped_logits, (hidden_states), (attentions)\n\n\n@add_start_docstrings(\n    \"\"\"Roberta Model with a token classification head on top (a linear layer on top of\n    the hidden-states output) e.g. for Named-Entity-Recognition (NER) tasks. \"\"\",\n    ROBERTA_START_DOCSTRING,\n)\nclass RobertaForTokenClassification(BertPreTrainedModel):\n    config_class = RobertaConfig\n    base_model_prefix = \"roberta\"\n\n    def __init__(self, config):\n        super().__init__(config)\n        self.num_labels = config.num_labels\n\n        self.roberta = RobertaModel(config)\n        self.dropout = nn.Dropout(config.hidden_dropout_prob)\n        self.classifier = nn.Linear(config.hidden_size, config.num_labels)\n\n        self.init_weights()\n\n    @add_start_docstrings_to_callable(ROBERTA_INPUTS_DOCSTRING.format(\"(batch_size, sequence_length)\"))\n    def forward(\n        self,\n        input_ids=None,\n        attention_mask=None,\n        token_type_ids=None,\n        position_ids=None,\n        head_mask=None,\n        inputs_embeds=None,\n        labels=None,\n    ):\n        r\"\"\"\n        labels (:obj:`torch.LongTensor` of shape :obj:`(batch_size, sequence_length)`, `optional`, defaults to :obj:`None`):\n            Labels for computing the token classification loss.\n            Indices should be in ``[0, ..., config.num_labels - 1]``.\n\n    Returns:\n        :obj:`tuple(torch.FloatTensor)` comprising various elements depending on the configuration (:class:`~transformers1.RobertaConfig`) and inputs:\n        loss (:obj:`torch.FloatTensor` of shape :obj:`(1,)`, `optional`, returned when ``labels`` is provided) :\n            Classification loss.\n        scores (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length, config.num_labels)`)\n            Classification scores (before SoftMax).\n        hidden_states (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_hidden_states=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer)\n            of shape :obj:`(batch_size, sequence_length, hidden_size)`.\n\n            Hidden-states of the model at the output of each layer plus the initial embedding outputs.\n        attentions (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_attentions=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for each layer) of shape\n            :obj:`(batch_size, num_heads, sequence_length, sequence_length)`.\n\n            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention\n            heads.\n\n    Examples::\n\n        from transformers1 import RobertaTokenizer, RobertaForTokenClassification\n        import torch\n\n        tokenizer = RobertaTokenizer.from_pretrained('roberta-base')\n        model = RobertaForTokenClassification.from_pretrained('roberta-base')\n        input_ids = torch.tensor(tokenizer.encode(\"Hello, my dog is cute\", add_special_tokens=True)).unsqueeze(0)  # Batch size 1\n        labels = torch.tensor([1] * input_ids.size(1)).unsqueeze(0)  # Batch size 1\n        outputs = model(input_ids, labels=labels)\n        loss, scores = outputs[:2]\n\n        \"\"\"\n\n        outputs = self.roberta(\n            input_ids,\n            attention_mask=attention_mask,\n            token_type_ids=token_type_ids,\n            position_ids=position_ids,\n            head_mask=head_mask,\n            inputs_embeds=inputs_embeds,\n        )\n\n        sequence_output = outputs[0]\n\n        sequence_output = self.dropout(sequence_output)\n        logits = self.classifier(sequence_output)\n\n        outputs = (logits,) + outputs[2:]  # add hidden states and attention if they are here\n\n        if labels is not None:\n            loss_fct = CrossEntropyLoss()\n            # Only keep active parts of the loss\n            if attention_mask is not None:\n                active_loss = attention_mask.view(-1) == 1\n                active_logits = logits.view(-1, self.num_labels)\n                active_labels = torch.where(\n                    active_loss, labels.view(-1), torch.tensor(loss_fct.ignore_index).type_as(labels)\n                )\n                loss = loss_fct(active_logits, active_labels)\n            else:\n                loss = loss_fct(logits.view(-1, self.num_labels), labels.view(-1))\n            outputs = (loss,) + outputs\n\n        return outputs  # (loss), scores, (hidden_states), (attentions)\n\n\nclass RobertaClassificationHead(nn.Module):\n    \"\"\"Head for sentence-level classification tasks.\"\"\"\n\n    def __init__(self, config):\n        super().__init__()\n        self.dense = nn.Linear(config.hidden_size, config.hidden_size)\n        self.dropout = nn.Dropout(config.hidden_dropout_prob)\n        self.out_proj = nn.Linear(config.hidden_size, config.num_labels)\n\n    def forward(self, features, **kwargs):\n        x = features[:, 0, :]  # take <s> token (equiv. to [CLS])\n        x = self.dropout(x)\n        x = self.dense(x)\n        x = torch.tanh(x)\n        x = self.dropout(x)\n        x = self.out_proj(x)\n        return x\n\n\n@add_start_docstrings(\n    \"\"\"Roberta Model with a span classification head on top for extractive question-answering tasks like SQuAD (a linear layers on top of\n    the hidden-states output to compute `span start logits` and `span end logits`). \"\"\",\n    ROBERTA_START_DOCSTRING,\n)\nclass RobertaForQuestionAnswering(BertPreTrainedModel):\n    config_class = RobertaConfig\n    base_model_prefix = \"roberta\"\n\n    def __init__(self, config):\n        super().__init__(config)\n        self.num_labels = config.num_labels\n\n        self.roberta = RobertaModel(config)\n        self.qa_outputs = nn.Linear(config.hidden_size, config.num_labels)\n\n        self.init_weights()\n\n    @add_start_docstrings_to_callable(ROBERTA_INPUTS_DOCSTRING.format(\"(batch_size, sequence_length)\"))\n    def forward(\n        self,\n        input_ids,\n        attention_mask=None,\n        token_type_ids=None,\n        position_ids=None,\n        head_mask=None,\n        inputs_embeds=None,\n        start_positions=None,\n        end_positions=None,\n    ):\n        r\"\"\"\n        start_positions (:obj:`torch.LongTensor` of shape :obj:`(batch_size,)`, `optional`, defaults to :obj:`None`):\n            Labels for position (index) of the start of the labelled span for computing the token classification loss.\n            Positions are clamped to the length of the sequence (`sequence_length`).\n            Position outside of the sequence are not taken into account for computing the loss.\n        end_positions (:obj:`torch.LongTensor` of shape :obj:`(batch_size,)`, `optional`, defaults to :obj:`None`):\n            Labels for position (index) of the end of the labelled span for computing the token classification loss.\n            Positions are clamped to the length of the sequence (`sequence_length`).\n            Position outside of the sequence are not taken into account for computing the loss.\n\n    Returns:\n        :obj:`tuple(torch.FloatTensor)` comprising various elements depending on the configuration (:class:`~transformers1.RobertaConfig`) and inputs:\n        loss (:obj:`torch.FloatTensor` of shape :obj:`(1,)`, `optional`, returned when :obj:`labels` is provided):\n            Total span extraction loss is the sum of a Cross-Entropy for the start and end positions.\n        start_scores (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length,)`):\n            Span-start scores (before SoftMax).\n        end_scores (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length,)`):\n            Span-end scores (before SoftMax).\n        hidden_states (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_hidden_states=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer)\n            of shape :obj:`(batch_size, sequence_length, hidden_size)`.\n\n            Hidden-states of the model at the output of each layer plus the initial embedding outputs.\n        attentions (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_attentions=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for each layer) of shape\n            :obj:`(batch_size, num_heads, sequence_length, sequence_length)`.\n\n            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention\n            heads.\n\n    Examples::\n\n        # The checkpoint roberta-large is not fine-tuned for question answering. Please see the\n        # examples/question-answering/run_squad.py example to see how to fine-tune a model to a question answering task.\n\n        from transformers1 import RobertaTokenizer, RobertaForQuestionAnswering\n        import torch\n\n        tokenizer = RobertaTokenizer.from_pretrained('roberta-base')\n        model = RobertaForQuestionAnswering.from_pretrained('roberta-base')\n\n        question, text = \"Who was Jim Henson?\", \"Jim Henson was a nice puppet\"\n        input_ids = tokenizer.encode(question, text)\n        start_scores, end_scores = model(torch.tensor([input_ids]))\n\n        all_tokens = tokenizer.convert_ids_to_tokens(input_ids)\n        answer = ' '.join(all_tokens[torch.argmax(start_scores) : torch.argmax(end_scores)+1])\n\n        \"\"\"\n\n        outputs = self.roberta(\n            input_ids,\n            attention_mask=attention_mask,\n            token_type_ids=token_type_ids,\n            position_ids=position_ids,\n            head_mask=head_mask,\n            inputs_embeds=inputs_embeds,\n        )\n\n        sequence_output = outputs[0]\n\n        logits = self.qa_outputs(sequence_output)\n        start_logits, end_logits = logits.split(1, dim=-1)\n        start_logits = start_logits.squeeze(-1)\n        end_logits = end_logits.squeeze(-1)\n\n        outputs = (start_logits, end_logits,) + outputs[2:]\n        if start_positions is not None and end_positions is not None:\n            # If we are on multi-GPU, split add a dimension\n            if len(start_positions.size()) > 1:\n                start_positions = start_positions.squeeze(-1)\n            if len(end_positions.size()) > 1:\n                end_positions = end_positions.squeeze(-1)\n            # sometimes the start/end positions are outside our model inputs, we ignore these terms\n            ignored_index = start_logits.size(1)\n            start_positions.clamp_(0, ignored_index)\n            end_positions.clamp_(0, ignored_index)\n\n            loss_fct = CrossEntropyLoss(ignore_index=ignored_index)\n            start_loss = loss_fct(start_logits, start_positions)\n            end_loss = loss_fct(end_logits, end_positions)\n            total_loss = (start_loss + end_loss) / 2\n            outputs = (total_loss,) + outputs\n\n        return outputs  # (loss), start_logits, end_logits, (hidden_states), (attentions)\n"
  },
  {
    "path": "code/bert-base-count5/pretrain/transformers1/modeling_t5.py",
    "content": "# coding=utf-8\n# Copyright 2018 Mesh TensorFlow authors, T5 Authors and HuggingFace Inc. team.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n\"\"\" PyTorch T5 model. \"\"\"\n\n\nimport copy\nimport logging\nimport math\nimport os\n\nimport torch\nimport torch.nn.functional as F\nfrom torch import nn\nfrom torch.nn import CrossEntropyLoss\n\nfrom .configuration_t5 import T5Config\nfrom .file_utils import DUMMY_INPUTS, DUMMY_MASK, add_start_docstrings, add_start_docstrings_to_callable\nfrom .modeling_utils import PreTrainedModel, prune_linear_layer\n\n\nlogger = logging.getLogger(__name__)\n\n####################################################\n# This dict contrains shortcut names and associated url\n# for the pretrained weights provided with the models\n####################################################\nT5_PRETRAINED_MODEL_ARCHIVE_LIST = [\n    \"t5-small\",\n    \"t5-base\",\n    \"t5-large\",\n    \"t5-3b\",\n    \"t5-11b\",\n    # See all T5 models at https://huggingface.co/models?filter=t5\n]\n\n\n####################################################\n# This is a conversion method from TF 1.0 to PyTorch\n# More details: https://medium.com/huggingface/from-tensorflow-to-pytorch-265f40ef2a28\n####################################################\ndef load_tf_weights_in_t5(model, config, tf_checkpoint_path):\n    \"\"\" Load tf checkpoints in a pytorch model.\n    \"\"\"\n    try:\n        import re\n        import numpy as np\n        import tensorflow as tf\n    except ImportError:\n        logger.error(\n            \"Loading a TensorFlow model in PyTorch, requires TensorFlow to be installed. Please see \"\n            \"https://www.tensorflow.org/install/ for installation instructions.\"\n        )\n        raise\n    tf_path = os.path.abspath(tf_checkpoint_path)\n    logger.info(\"Converting TensorFlow checkpoint from {}\".format(tf_path))\n    # Load weights from TF model\n    init_vars = tf.train.list_variables(tf_path)\n    names = []\n    tf_weights = {}\n    for name, shape in init_vars:\n        logger.info(\"Loading TF weight {} with shape {}\".format(name, shape))\n        array = tf.train.load_variable(tf_path, name)\n        names.append(name)\n        tf_weights[name] = array\n\n    for txt_name in names:\n        name = txt_name.split(\"/\")\n        # adam_v and adam_m are variables used in AdamWeightDecayOptimizer to calculated m and v\n        # which are not required for using pretrained model\n        if any(\n            n in [\"adam_v\", \"adam_m\", \"AdamWeightDecayOptimizer\", \"AdamWeightDecayOptimizer_1\", \"global_step\"]\n            for n in name\n        ):\n            logger.info(\"Skipping {}\".format(\"/\".join(name)))\n            tf_weights.pop(txt_name, None)\n            continue\n        if \"_slot_\" in name[-1]:\n            logger.info(\"Skipping {}\".format(\"/\".join(name)))\n            tf_weights.pop(txt_name, None)\n            continue\n        pointer = model\n        array = tf_weights[txt_name]\n        for m_name in name:\n            if re.fullmatch(r\"[A-Za-z]+_\\d+\", m_name):\n                scope_names = re.split(r\"_(\\d+)\", m_name)\n            else:\n                scope_names = [m_name]\n            if scope_names[0] in [\"kernel\", \"scale\", \"embedding\"]:\n                pointer = getattr(pointer, \"weight\")\n            # elif scope_names[0] == 'scale':\n            #     pointer = getattr(pointer, 'weight')\n            # elif scope_names[0] == 'output_bias' or scope_names[0] == 'beta':\n            #     pointer = getattr(pointer, 'bias')\n            # elif scope_names[0] == 'squad':\n            #     pointer = getattr(pointer, 'classifier')\n            else:\n                try:\n                    pointer = getattr(pointer, scope_names[0])\n                except AttributeError:\n                    logger.info(\"Skipping {}\".format(\"/\".join(name)))\n                    continue\n            if len(scope_names) >= 2:\n                num = int(scope_names[1])\n                pointer = pointer[num]\n        if scope_names[0] not in [\"kernel\", \"scale\", \"embedding\"]:\n            pointer = getattr(pointer, \"weight\")\n        if scope_names[0] != \"embedding\":\n            logger.info(\"Transposing numpy weight of shape {} for {}\".format(array.shape, name))\n            array = np.transpose(array)\n        try:\n            assert pointer.shape == array.shape\n        except AssertionError as e:\n            e.args += (pointer.shape, array.shape)\n            raise\n        logger.info(\"Initialize PyTorch weight {}\".format(name))\n        pointer.data = torch.from_numpy(array.astype(np.float32))\n        tf_weights.pop(txt_name, None)\n\n    logger.info(\"Weights not copied to PyTorch model: {}\".format(\", \".join(tf_weights.keys())))\n    # logger.info(\"Weights not copied to PyTorch model: {}\".format(', '.join(tf_weights.keys())))\n    return model\n\n\n####################################################\n# PyTorch Models are constructed by sub-classing\n# - torch.nn.Module for the layers and\n# - PreTrainedModel for the models (it-self a sub-class of torch.nn.Module)\n####################################################\n\n\nclass T5LayerNorm(nn.Module):\n    def __init__(self, hidden_size, eps=1e-6):\n        \"\"\" Construct a layernorm module in the T5 style\n            No bias and no substraction of mean.\n        \"\"\"\n        super().__init__()\n        self.weight = nn.Parameter(torch.ones(hidden_size))\n        self.variance_epsilon = eps\n\n    def forward(self, x):\n        # layer norm should always be calculated in float32\n        variance = x.to(torch.float32).pow(2).mean(-1, keepdim=True)\n        x = x / torch.sqrt(variance + self.variance_epsilon)\n\n        if self.weight.dtype == torch.float16:\n            x = x.to(torch.float16)\n        return self.weight * x\n\n\nclass T5DenseReluDense(nn.Module):\n    def __init__(self, config):\n        super().__init__()\n        self.wi = nn.Linear(config.d_model, config.d_ff, bias=False)\n        self.wo = nn.Linear(config.d_ff, config.d_model, bias=False)\n        self.dropout = nn.Dropout(config.dropout_rate)\n\n    def forward(self, hidden_states):\n        h = self.wi(hidden_states)\n        h = F.relu(h)\n        h = self.dropout(h)\n        h = self.wo(h)\n        return h\n\n\nclass T5LayerFF(nn.Module):\n    def __init__(self, config):\n        super().__init__()\n        self.DenseReluDense = T5DenseReluDense(config)\n        self.layer_norm = T5LayerNorm(config.d_model, eps=config.layer_norm_epsilon)\n        self.dropout = nn.Dropout(config.dropout_rate)\n\n    def forward(self, hidden_states):\n        norm_x = self.layer_norm(hidden_states)\n        y = self.DenseReluDense(norm_x)\n        layer_output = hidden_states + self.dropout(y)\n        return layer_output\n\n\nclass T5Attention(nn.Module):\n    def __init__(self, config: T5Config, has_relative_attention_bias=False):\n        super().__init__()\n        self.is_decoder = config.is_decoder\n        self.has_relative_attention_bias = has_relative_attention_bias\n\n        self.output_attentions = config.output_attentions\n        self.relative_attention_num_buckets = config.relative_attention_num_buckets\n        self.d_model = config.d_model\n        self.d_kv = config.d_kv\n        self.n_heads = config.num_heads\n        self.dropout = config.dropout_rate\n        self.inner_dim = self.n_heads * self.d_kv\n\n        # Mesh TensorFlow initialization to avoid scaling before softmax\n        self.q = nn.Linear(self.d_model, self.inner_dim, bias=False)\n        self.k = nn.Linear(self.d_model, self.inner_dim, bias=False)\n        self.v = nn.Linear(self.d_model, self.inner_dim, bias=False)\n        self.o = nn.Linear(self.inner_dim, self.d_model, bias=False)\n\n        if self.has_relative_attention_bias:\n            self.relative_attention_bias = nn.Embedding(self.relative_attention_num_buckets, self.n_heads)\n        self.pruned_heads = set()\n\n    def prune_heads(self, heads):\n        if len(heads) == 0:\n            return\n        mask = torch.ones(self.n_heads, self.d_kv)\n        heads = set(heads) - self.pruned_heads\n        for head in heads:\n            head -= sum(1 if h < head else 0 for h in self.pruned_heads)\n            mask[head] = 0\n        mask = mask.view(-1).contiguous().eq(1)\n        index = torch.arange(len(mask))[mask].long()\n        # Prune linear layers\n        self.q = prune_linear_layer(self.q, index)\n        self.k = prune_linear_layer(self.k, index)\n        self.v = prune_linear_layer(self.v, index)\n        self.o = prune_linear_layer(self.o, index, dim=1)\n        # Update hyper params\n        self.n_heads = self.n_heads - len(heads)\n        self.inner_dim = self.d_kv * self.n_heads\n        self.pruned_heads = self.pruned_heads.union(heads)\n\n    @staticmethod\n    def _relative_position_bucket(relative_position, bidirectional=True, num_buckets=32, max_distance=128):\n        \"\"\"\n        Adapted from Mesh Tensorflow:\n        https://github.com/tensorflow/mesh/blob/0cb87fe07da627bf0b7e60475d59f95ed6b5be3d/mesh_tensorflow/transformer/transformer_layers.py#L593\n\n        Translate relative position to a bucket number for relative attention.\n        The relative position is defined as memory_position - query_position, i.e.\n        the distance in tokens from the attending position to the attended-to\n        position.  If bidirectional=False, then positive relative positions are\n        invalid.\n        We use smaller buckets for small absolute relative_position and larger buckets\n        for larger absolute relative_positions.  All relative positions >=max_distance\n        map to the same bucket.  All relative positions <=-max_distance map to the\n        same bucket.  This should allow for more graceful generalization to longer\n        sequences than the model has been trained on.\n        Args:\n            relative_position: an int32 Tensor\n            bidirectional: a boolean - whether the attention is bidirectional\n            num_buckets: an integer\n            max_distance: an integer\n        Returns:\n            a Tensor with the same shape as relative_position, containing int32\n            values in the range [0, num_buckets)\n        \"\"\"\n        ret = 0\n        n = -relative_position\n        if bidirectional:\n            num_buckets //= 2\n            ret += (n < 0).to(torch.long) * num_buckets  # mtf.to_int32(mtf.less(n, 0)) * num_buckets\n            n = torch.abs(n)\n        else:\n            n = torch.max(n, torch.zeros_like(n))\n        # now n is in the range [0, inf)\n\n        # half of the buckets are for exact increments in positions\n        max_exact = num_buckets // 2\n        is_small = n < max_exact\n\n        # The other half of the buckets are for logarithmically bigger bins in positions up to max_distance\n        val_if_large = max_exact + (\n            torch.log(n.float() / max_exact) / math.log(max_distance / max_exact) * (num_buckets - max_exact)\n        ).to(torch.long)\n        val_if_large = torch.min(val_if_large, torch.full_like(val_if_large, num_buckets - 1))\n\n        ret += torch.where(is_small, n, val_if_large)\n        return ret\n\n    def compute_bias(self, qlen, klen):\n        \"\"\" Compute binned relative position bias \"\"\"\n        context_position = torch.arange(qlen, dtype=torch.long)[:, None]\n        memory_position = torch.arange(klen, dtype=torch.long)[None, :]\n        relative_position = memory_position - context_position  # shape (qlen, klen)\n        rp_bucket = self._relative_position_bucket(\n            relative_position,  # shape (qlen, klen)\n            bidirectional=not self.is_decoder,\n            num_buckets=self.relative_attention_num_buckets,\n        )\n        rp_bucket = rp_bucket.to(self.relative_attention_bias.weight.device)\n        values = self.relative_attention_bias(rp_bucket)  # shape (qlen, klen, num_heads)\n        values = values.permute([2, 0, 1]).unsqueeze(0)  # shape (1, num_heads, qlen, klen)\n        return values\n\n    def forward(\n        self,\n        input,\n        mask=None,\n        kv=None,\n        position_bias=None,\n        past_key_value_state=None,\n        head_mask=None,\n        query_length=None,\n        use_cache=False,\n    ):\n        \"\"\"\n        Self-attention (if kv is None) or attention over source sentence (provided by kv).\n        \"\"\"\n        # Input is (bs, qlen, dim)\n        # Mask is (bs, klen) (non-causal) or (bs, klen, klen)\n        # past_key_value_state[0] is (bs, n_heads, q_len - 1, dim_per_head)\n        bs, qlen, dim = input.size()\n\n        if past_key_value_state is not None:\n            assert self.is_decoder is True, \"Encoder cannot cache past key value states\"\n            assert (\n                len(past_key_value_state) == 2\n            ), \"past_key_value_state should have 2 past states: keys and values. Got {} past states\".format(\n                len(past_key_value_state)\n            )\n            real_qlen = qlen + past_key_value_state[0].shape[2] if query_length is None else query_length\n        else:\n            real_qlen = qlen\n\n        if kv is None:\n            klen = real_qlen\n        else:\n            klen = kv.size(1)\n\n        def shape(x):\n            \"\"\"  projection \"\"\"\n            return x.view(bs, -1, self.n_heads, self.d_kv).transpose(1, 2)\n\n        def unshape(x):\n            \"\"\"  compute context \"\"\"\n            return x.transpose(1, 2).contiguous().view(bs, -1, self.inner_dim)\n\n        q = shape(self.q(input))  # (bs, n_heads, qlen, dim_per_head)\n\n        if kv is None:\n            k = shape(self.k(input))  # (bs, n_heads, qlen, dim_per_head)\n            v = shape(self.v(input))  # (bs, n_heads, qlen, dim_per_head)\n        elif past_key_value_state is None:\n            k = v = kv\n            k = shape(self.k(k))  # (bs, n_heads, qlen, dim_per_head)\n            v = shape(self.v(v))  # (bs, n_heads, qlen, dim_per_head)\n\n        if past_key_value_state is not None:\n            if kv is None:\n                k_, v_ = past_key_value_state\n                k = torch.cat([k_, k], dim=2)  # (bs, n_heads, klen, dim_per_head)\n                v = torch.cat([v_, v], dim=2)  # (bs, n_heads, klen, dim_per_head)\n            else:\n                k, v = past_key_value_state\n\n        if self.is_decoder and use_cache is True:\n            present_key_value_state = ((k, v),)\n        else:\n            present_key_value_state = (None,)\n\n        scores = torch.einsum(\"bnqd,bnkd->bnqk\", q, k)  # (bs, n_heads, qlen, klen)\n\n        if position_bias is None:\n            if not self.has_relative_attention_bias:\n                raise ValueError(\"No position_bias provided and no weights to compute position_bias\")\n            position_bias = self.compute_bias(real_qlen, klen)\n\n            # if key and values are already calculated\n            # we want only the last query position bias\n            if past_key_value_state is not None:\n                position_bias = position_bias[:, :, -1:, :]\n\n            if mask is not None:\n                position_bias = position_bias + mask  # (bs, n_heads, qlen, klen)\n\n        scores += position_bias\n        weights = F.softmax(scores.float(), dim=-1).type_as(scores)  # (bs, n_heads, qlen, klen)\n        weights = F.dropout(weights, p=self.dropout, training=self.training)  # (bs, n_heads, qlen, klen)\n\n        # Mask heads if we want to\n        if head_mask is not None:\n            weights = weights * head_mask\n\n        context = torch.matmul(weights, v)  # (bs, n_heads, qlen, dim_per_head)\n        context = unshape(context)  # (bs, qlen, dim)\n\n        context = self.o(context)\n\n        outputs = (context,) + present_key_value_state\n\n        if self.output_attentions:\n            outputs = outputs + (weights,)\n        if self.has_relative_attention_bias:\n            outputs = outputs + (position_bias,)\n        return outputs\n\n\nclass T5LayerSelfAttention(nn.Module):\n    def __init__(self, config, has_relative_attention_bias=False):\n        super().__init__()\n        self.SelfAttention = T5Attention(config, has_relative_attention_bias=has_relative_attention_bias)\n        self.layer_norm = T5LayerNorm(config.d_model, eps=config.layer_norm_epsilon)\n        self.dropout = nn.Dropout(config.dropout_rate)\n\n    def forward(\n        self,\n        hidden_states,\n        attention_mask=None,\n        position_bias=None,\n        head_mask=None,\n        past_key_value_state=None,\n        use_cache=False,\n    ):\n        norm_x = self.layer_norm(hidden_states)\n        attention_output = self.SelfAttention(\n            norm_x,\n            mask=attention_mask,\n            position_bias=position_bias,\n            head_mask=head_mask,\n            past_key_value_state=past_key_value_state,\n            use_cache=use_cache,\n        )\n        y = attention_output[0]\n        layer_output = hidden_states + self.dropout(y)\n        outputs = (layer_output,) + attention_output[1:]  # add attentions if we output them\n        return outputs\n\n\nclass T5LayerCrossAttention(nn.Module):\n    def __init__(self, config, has_relative_attention_bias=False):\n        super().__init__()\n        self.EncDecAttention = T5Attention(config, has_relative_attention_bias=has_relative_attention_bias)\n        self.layer_norm = T5LayerNorm(config.d_model, eps=config.layer_norm_epsilon)\n        self.dropout = nn.Dropout(config.dropout_rate)\n\n    def forward(\n        self,\n        hidden_states,\n        kv,\n        attention_mask=None,\n        position_bias=None,\n        head_mask=None,\n        past_key_value_state=None,\n        use_cache=False,\n        query_length=None,\n    ):\n        norm_x = self.layer_norm(hidden_states)\n        attention_output = self.EncDecAttention(\n            norm_x,\n            mask=attention_mask,\n            kv=kv,\n            position_bias=position_bias,\n            head_mask=head_mask,\n            past_key_value_state=past_key_value_state,\n            use_cache=use_cache,\n            query_length=query_length,\n        )\n        y = attention_output[0]\n        layer_output = hidden_states + self.dropout(y)\n        outputs = (layer_output,) + attention_output[1:]  # add attentions if we output them\n        return outputs\n\n\nclass T5Block(nn.Module):\n    def __init__(self, config, has_relative_attention_bias=False):\n        super().__init__()\n        self.is_decoder = config.is_decoder\n        self.layer = nn.ModuleList()\n        self.layer.append(T5LayerSelfAttention(config, has_relative_attention_bias=has_relative_attention_bias))\n        if self.is_decoder:\n            self.layer.append(T5LayerCrossAttention(config, has_relative_attention_bias=has_relative_attention_bias))\n\n        self.layer.append(T5LayerFF(config))\n\n    def forward(\n        self,\n        hidden_states,\n        attention_mask=None,\n        position_bias=None,\n        encoder_hidden_states=None,\n        encoder_attention_mask=None,\n        encoder_decoder_position_bias=None,\n        head_mask=None,\n        past_key_value_state=None,\n        use_cache=False,\n    ):\n\n        if past_key_value_state is not None:\n            assert self.is_decoder, \"Only decoder can use `past_key_value_states`\"\n            expected_num_past_key_value_states = 2 if encoder_hidden_states is None else 4\n\n            error_message = \"There should be {} past states. 2 (past / key) for self attention.{} Got {} past key / value states\".format(\n                expected_num_past_key_value_states,\n                \"2 (past / key) for cross attention\" if expected_num_past_key_value_states == 4 else \"\",\n                len(past_key_value_state),\n            )\n            assert len(past_key_value_state) == expected_num_past_key_value_states, error_message\n\n            self_attn_past_key_value_state = past_key_value_state[:2]\n            cross_attn_past_key_value_state = past_key_value_state[2:]\n        else:\n            self_attn_past_key_value_state, cross_attn_past_key_value_state = None, None\n\n        self_attention_outputs = self.layer[0](\n            hidden_states,\n            attention_mask=attention_mask,\n            position_bias=position_bias,\n            head_mask=head_mask,\n            past_key_value_state=self_attn_past_key_value_state,\n            use_cache=use_cache,\n        )\n        hidden_states, present_key_value_state = self_attention_outputs[:2]\n        attention_outputs = self_attention_outputs[2:]  # Keep self-attention outputs and relative position weights\n\n        if self.is_decoder and encoder_hidden_states is not None:\n            # the actual query length is unknown for cross attention\n            # if using past key value states. Need to inject it here\n            if present_key_value_state is not None:\n                query_length = present_key_value_state[0].shape[2]\n            else:\n                query_length = None\n\n            cross_attention_outputs = self.layer[1](\n                hidden_states,\n                kv=encoder_hidden_states,\n                attention_mask=encoder_attention_mask,\n                position_bias=encoder_decoder_position_bias,\n                head_mask=head_mask,\n                past_key_value_state=cross_attn_past_key_value_state,\n                query_length=query_length,\n                use_cache=use_cache,\n            )\n            hidden_states = cross_attention_outputs[0]\n            # Combine self attn and cross attn key value states\n            if present_key_value_state is not None:\n                present_key_value_state = present_key_value_state + cross_attention_outputs[1]\n\n            # Keep cross-attention outputs and relative position weights\n            attention_outputs = attention_outputs + cross_attention_outputs[2:]\n\n        # Apply Feed Forward layer\n        hidden_states = self.layer[-1](hidden_states)\n        outputs = (hidden_states,)\n\n        # Add attentions if we output them\n        outputs = outputs + (present_key_value_state,) + attention_outputs\n        return outputs  # hidden-states, present_key_value_states, (self-attention weights), (self-attention position bias), (cross-attention weights), (cross-attention position bias)\n\n\nclass T5PreTrainedModel(PreTrainedModel):\n    \"\"\" An abstract class to handle weights initialization and\n        a simple interface for downloading and loading pretrained models.\n    \"\"\"\n\n    config_class = T5Config\n    load_tf_weights = load_tf_weights_in_t5\n    base_model_prefix = \"transformer\"\n\n    @property\n    def dummy_inputs(self):\n        input_ids = torch.tensor(DUMMY_INPUTS)\n        input_mask = torch.tensor(DUMMY_MASK)\n        dummy_inputs = {\n            \"decoder_input_ids\": input_ids,\n            \"input_ids\": input_ids,\n            \"decoder_attention_mask\": input_mask,\n        }\n        return dummy_inputs\n\n    def _init_weights(self, module):\n        \"\"\" Initialize the weights \"\"\"\n        factor = self.config.initializer_factor  # Used for testing weights initialization\n        if isinstance(module, T5LayerNorm):\n            module.weight.data.fill_(factor * 1.0)\n        elif isinstance(module, (T5Model, T5ForConditionalGeneration)):\n            # Mesh TensorFlow embeddings initialization\n            # See https://github.com/tensorflow/mesh/blob/fa19d69eafc9a482aff0b59ddd96b025c0cb207d/mesh_tensorflow/layers.py#L1624\n            module.shared.weight.data.normal_(mean=0.0, std=factor * 1.0)\n        elif isinstance(module, T5DenseReluDense):\n            # Mesh TensorFlow FF initialization\n            # See https://github.com/tensorflow/mesh/blob/master/mesh_tensorflow/transformer/transformer_layers.py#L56\n            # and https://github.com/tensorflow/mesh/blob/fa19d69eafc9a482aff0b59ddd96b025c0cb207d/mesh_tensorflow/layers.py#L89\n            module.wi.weight.data.normal_(mean=0.0, std=factor * ((self.config.d_model) ** -0.5))\n            if hasattr(module.wi, \"bias\") and module.wi.bias is not None:\n                module.wi.bias.data.zero_()\n            module.wo.weight.data.normal_(mean=0.0, std=factor * ((self.config.d_ff) ** -0.5))\n            if hasattr(module.wo, \"bias\") and module.wo.bias is not None:\n                module.wo.bias.data.zero_()\n        elif isinstance(module, T5Attention):\n            # Mesh TensorFlow attention initialization to avoid scaling before softmax\n            # See https://github.com/tensorflow/mesh/blob/fa19d69eafc9a482aff0b59ddd96b025c0cb207d/mesh_tensorflow/transformer/attention.py#L136\n            d_model = self.config.d_model\n            d_kv = self.config.d_kv\n            n_heads = self.config.num_heads\n            module.q.weight.data.normal_(mean=0.0, std=factor * ((d_model * d_kv) ** -0.5))\n            module.k.weight.data.normal_(mean=0.0, std=factor * (d_model ** -0.5))\n            module.v.weight.data.normal_(mean=0.0, std=factor * (d_model ** -0.5))\n            module.o.weight.data.normal_(mean=0.0, std=factor * ((n_heads * d_kv) ** -0.5))\n            if module.has_relative_attention_bias:\n                module.relative_attention_bias.weight.data.normal_(mean=0.0, std=factor * ((d_model) ** -0.5))\n\n    def _shift_right(self, input_ids):\n        decoder_start_token_id = self.config.decoder_start_token_id\n        pad_token_id = self.config.pad_token_id\n\n        assert (\n            decoder_start_token_id is not None\n        ), \"self.model.config.decoder_start_token_id has to be defined. In T5 it is usually set to the pad_token_id. See T5 docs for more information\"\n\n        # shift inputs to the right\n        shifted_input_ids = input_ids.new_zeros(input_ids.shape)\n        shifted_input_ids[..., 1:] = input_ids[..., :-1].clone()\n        shifted_input_ids[..., 0] = decoder_start_token_id\n\n        assert pad_token_id is not None, \"self.model.config.pad_token_id has to be defined.\"\n        # replace possible -100 values in lm_labels by `pad_token_id`\n        shifted_input_ids.masked_fill_(shifted_input_ids == -100, pad_token_id)\n\n        assert torch.all(shifted_input_ids >= 0).item(), \"Verify that `lm_labels` has only positive values and -100\"\n\n        return shifted_input_ids\n\n\nclass T5Stack(T5PreTrainedModel):\n    def __init__(self, config, embed_tokens=None):\n        super().__init__(config)\n        self.output_attentions = config.output_attentions\n        self.output_hidden_states = config.output_hidden_states\n\n        self.embed_tokens = embed_tokens\n        self.is_decoder = config.is_decoder\n\n        self.block = nn.ModuleList(\n            [T5Block(config, has_relative_attention_bias=bool(i == 0)) for i in range(config.num_layers)]\n        )\n        self.final_layer_norm = T5LayerNorm(config.d_model, eps=config.layer_norm_epsilon)\n        self.dropout = nn.Dropout(config.dropout_rate)\n\n        self.init_weights()\n\n    def get_input_embeddings(self):\n        return self.embed_tokens\n\n    def get_output_embeddings(self):\n        return self.embed_tokens\n\n    def set_input_embeddings(self, new_embeddings):\n        self.embed_tokens = new_embeddings\n\n    def forward(\n        self,\n        input_ids=None,\n        attention_mask=None,\n        encoder_hidden_states=None,\n        encoder_attention_mask=None,\n        inputs_embeds=None,\n        head_mask=None,\n        past_key_value_states=None,\n        use_cache=False,\n    ):\n\n        if input_ids is not None and inputs_embeds is not None:\n            raise ValueError(\"You cannot specify both input_ids and inputs_embeds at the same time\")\n        elif input_ids is not None:\n            input_shape = input_ids.size()\n            input_ids = input_ids.view(-1, input_shape[-1])\n        elif inputs_embeds is not None:\n            input_shape = inputs_embeds.size()[:-1]\n        else:\n            if self.is_decoder:\n                raise ValueError(\"You have to specify either decoder_input_ids or decoder_inputs_embeds\")\n            else:\n                raise ValueError(\"You have to specify either input_ids or inputs_embeds\")\n\n        if inputs_embeds is None:\n            assert self.embed_tokens is not None, \"You have to intialize the model with valid token embeddings\"\n            inputs_embeds = self.embed_tokens(input_ids)\n\n        batch_size, seq_length = input_shape\n\n        if past_key_value_states is not None:\n            assert seq_length == 1, \"Input shape is {}, but should be {} when using past_key_value_sates\".format(\n                input_shape, (batch_size, 1)\n            )\n            # required mask seq length can be calculated via length of past\n            # key value states and seq_length = 1 for the last token\n            mask_seq_length = past_key_value_states[0][0].shape[2] + seq_length\n        else:\n            mask_seq_length = seq_length\n\n        if attention_mask is None:\n            attention_mask = torch.ones(batch_size, mask_seq_length).to(inputs_embeds.device)\n        if self.is_decoder and encoder_attention_mask is None and encoder_hidden_states is not None:\n            encoder_seq_length = encoder_hidden_states.shape[1]\n            encoder_attention_mask = torch.ones(\n                batch_size, encoder_seq_length, device=inputs_embeds.device, dtype=torch.long\n            )\n\n        # initialize past_key_value_states with `None` if past does not exist\n        if past_key_value_states is None:\n            past_key_value_states = [None] * len(self.block)\n\n        # ourselves in which case we just need to make it broadcastable to all heads.\n        extended_attention_mask = self.get_extended_attention_mask(attention_mask, input_shape, inputs_embeds.device)\n\n        if self.is_decoder and encoder_attention_mask is not None:\n            encoder_extended_attention_mask = self.invert_attention_mask(encoder_attention_mask)\n        else:\n            encoder_extended_attention_mask = None\n\n        # Prepare head mask if needed\n        head_mask = self.get_head_mask(head_mask, self.config.num_layers)\n        present_key_value_states = ()\n        all_hidden_states = ()\n        all_attentions = ()\n        position_bias = None\n        encoder_decoder_position_bias = None\n\n        hidden_states = self.dropout(inputs_embeds)\n\n        for i, (layer_module, past_key_value_state) in enumerate(zip(self.block, past_key_value_states)):\n            if self.output_hidden_states:\n                all_hidden_states = all_hidden_states + (hidden_states,)\n\n            layer_outputs = layer_module(\n                hidden_states,\n                attention_mask=extended_attention_mask,\n                position_bias=position_bias,\n                encoder_hidden_states=encoder_hidden_states,\n                encoder_attention_mask=encoder_extended_attention_mask,\n                encoder_decoder_position_bias=encoder_decoder_position_bias,\n                head_mask=head_mask[i],\n                past_key_value_state=past_key_value_state,\n                use_cache=use_cache,\n            )\n            # layer_outputs is a tuple with:\n            # hidden-states, key-value-states, (self-attention weights), (self-attention position bias), (cross-attention weights), (cross-attention position bias)\n            hidden_states, present_key_value_state = layer_outputs[:2]\n\n            if i == 0:\n                # We share the position biases between the layers - the first layer store them\n                # layer_outputs = hidden-states, key-value-states (self-attention weights), (self-attention position bias), (cross-attention weights), (cross-attention position bias)\n                position_bias = layer_outputs[3 if self.output_attentions else 2]\n                if self.is_decoder and encoder_hidden_states is not None:\n                    encoder_decoder_position_bias = layer_outputs[5 if self.output_attentions else 3]\n            # append next layer key value states\n            present_key_value_states = present_key_value_states + (present_key_value_state,)\n\n            if self.output_attentions:\n                all_attentions = all_attentions + (layer_outputs[2],)  # We keep only self-attention weights for now\n\n        hidden_states = self.final_layer_norm(hidden_states)\n        hidden_states = self.dropout(hidden_states)\n\n        # Add last layer\n        if self.output_hidden_states:\n            all_hidden_states = all_hidden_states + (hidden_states,)\n\n        outputs = (hidden_states,)\n        if use_cache is True:\n            assert self.is_decoder, \"`use_cache` can only be set to `True` if {} is used as a decoder\".format(self)\n            outputs = outputs + (present_key_value_states,)\n        if self.output_hidden_states:\n            outputs = outputs + (all_hidden_states,)\n        if self.output_attentions:\n            outputs = outputs + (all_attentions,)\n        return outputs  # last-layer hidden state, (presents,) (all hidden states), (all attentions)\n\n\nT5_START_DOCSTRING = r\"\"\"    The T5 model was proposed in\n    `Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer`_\n    by Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, Peter J. Liu.\n    It's an encoder decoder transformer pre-trained in a text-to-text denoising generative setting.\n\n    This model is a PyTorch `torch.nn.Module`_ sub-class. Use it as a regular PyTorch Module and\n    refer to the PyTorch documentation for all matter related to general usage and behavior.\n\n    .. _`Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer`:\n        https://arxiv.org/abs/1910.10683\n\n    .. _`torch.nn.Module`:\n        https://pytorch.org/docs/stable/nn.html#module\n\n    Parameters:\n        config (:class:`~transformers1.T5Config`): Model configuration class with all the parameters of the model.\n            Initializing with a config file does not load the weights associated with the model, only the configuration.\n            Check out the :meth:`~transformers1.PreTrainedModel.from_pretrained` method to load the model weights.\n\"\"\"\n\nT5_INPUTS_DOCSTRING = r\"\"\"\n    Args:\n        input_ids (:obj:`torch.LongTensor` of shape :obj:`(batch_size, sequence_length)`):\n            Indices of input sequence tokens in the vocabulary.\n            T5 is a model with relative position embeddings so you should be able to pad the inputs on both the right and the left.\n            Indices can be obtained using :class:`transformers1.T5Tokenizer`.\n            See :func:`transformers1.PreTrainedTokenizer.encode` and\n            :func:`transformers1.PreTrainedTokenizer.convert_tokens_to_ids` for details.\n            To know more on how to prepare :obj:`input_ids` for pre-training take a look at\n            `T5 Training <./t5.html#training>`_ .\n        attention_mask (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length)`, `optional`, defaults to :obj:`None`):\n            Mask to avoid performing attention on padding token indices.\n            Mask values selected in ``[0, 1]``:\n            ``1`` for tokens that are NOT MASKED, ``0`` for MASKED tokens.\n        encoder_outputs (:obj:`tuple(tuple(torch.FloatTensor)`, `optional`, defaults to :obj:`None`):\n            Tuple consists of (`last_hidden_state`, `optional`: `hidden_states`, `optional`: `attentions`)\n            `last_hidden_state` of shape :obj:`(batch_size, sequence_length, hidden_size)`, `optional`, defaults to :obj:`None`) is a sequence of hidden-states at the output of the last layer of the encoder.\n            Used in the cross-attention of the decoder.\n        decoder_input_ids (:obj:`torch.LongTensor` of shape :obj:`(batch_size, target_sequence_length)`, `optional`, defaults to :obj:`None`):\n            Provide for sequence to sequence training. T5 uses the pad_token_id as the starting token for decoder_input_ids generation.\n            If `decoder_past_key_value_states` is used, optionally only the last `decoder_input_ids` have to be input (see `decoder_past_key_value_states`).\n            To know more on how to prepare :obj:`decoder_input_ids` for pre-training take a look at\n            `T5 Training <./t5.html#training>`_ .\n        decoder_attention_mask (:obj:`torch.BoolTensor` of shape :obj:`(batch_size, tgt_seq_len)`, `optional`, defaults to :obj:`None`):\n            Default behavior: generate a tensor that ignores pad tokens in decoder_input_ids. Causal mask will also be used by default.\n        decoder_past_key_value_states (:obj:`tuple(tuple(torch.FloatTensor))` of length :obj:`config.n_layers` with each tuple having 4 tensors of shape :obj:`(batch_size, num_heads, sequence_length - 1, embed_size_per_head)`):\n            Contains pre-computed key and value hidden-states of the attention blocks.\n            Can be used to speed up decoding.\n            If `decoder_past_key_value_states` are used, the user can optionally input only the last `decoder_input_ids`\n            (those that don't have their past key value states given to this model) of shape :obj:`(batch_size, 1)`\n            instead of all `decoder_input_ids` of shape :obj:`(batch_size, sequence_length)`.\n        use_cache (:obj:`bool`, `optional`, defaults to :obj:`True`):\n            If `use_cache` is True, `decoder_past_key_value_states` are returned and can be used to speed up decoding (see `decoder_past_key_value_states`).\n        inputs_embeds (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length, hidden_size)`, `optional`, defaults to :obj:`None`):\n            Optionally, instead of passing :obj:`input_ids` you can choose to directly pass an embedded representation.\n            This is useful if you want more control over how to convert `input_ids` indices into associated vectors\n            than the model's internal embedding lookup matrix.\n        decoder_inputs_embeds (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, target_sequence_length, hidden_size)`, `optional`, defaults to :obj:`None`):\n            Optionally, instead of passing :obj:`decoder_input_ids` you can choose to directly pass an embedded representation.\n            If `decoder_past_key_value_states` is used, optionally only the last `decoder_inputs_embeds` have to be input (see `decoder_past_key_value_states`).\n            This is useful if you want more control over how to convert `decoder_input_ids` indices into associated vectors\n            than the model's internal embedding lookup matrix.\n        head_mask: (:obj:`torch.FloatTensor` of shape :obj:`(num_heads,)` or :obj:`(num_layers, num_heads)`, `optional`, defaults to :obj:`None`):\n            Mask to nullify selected heads of the self-attention modules.\n            Mask values selected in ``[0, 1]``:\n            ``1`` indicates the head is **not masked**, ``0`` indicates the head is **masked**.\n\"\"\"\n\n\n@add_start_docstrings(\n    \"The bare T5 Model transformer outputting raw hidden-states\" \"without any specific head on top.\",\n    T5_START_DOCSTRING,\n)\nclass T5Model(T5PreTrainedModel):\n    def __init__(self, config):\n        super().__init__(config)\n        self.shared = nn.Embedding(config.vocab_size, config.d_model)\n\n        encoder_config = copy.deepcopy(config)\n        self.encoder = T5Stack(encoder_config, self.shared)\n\n        decoder_config = copy.deepcopy(config)\n        decoder_config.is_decoder = True\n        self.decoder = T5Stack(decoder_config, self.shared)\n\n        self.init_weights()\n\n    def get_input_embeddings(self):\n        return self.shared\n\n    def set_input_embeddings(self, new_embeddings):\n        self.shared = new_embeddings\n        self.encoder.set_input_embeddings(new_embeddings)\n        self.decoder.set_input_embeddings(new_embeddings)\n\n    def get_encoder(self):\n        return self.encoder\n\n    def get_decoder(self):\n        return self.decoder\n\n    def _prune_heads(self, heads_to_prune):\n        \"\"\" Prunes heads of the model.\n            heads_to_prune: dict of {layer_num: list of heads to prune in this layer}\n            See base class PreTrainedModel\n        \"\"\"\n        for layer, heads in heads_to_prune.items():\n            self.encoder.layer[layer].attention.prune_heads(heads)\n\n    @add_start_docstrings_to_callable(T5_INPUTS_DOCSTRING)\n    def forward(\n        self,\n        input_ids=None,\n        attention_mask=None,\n        encoder_outputs=None,\n        decoder_input_ids=None,\n        decoder_attention_mask=None,\n        decoder_past_key_value_states=None,\n        use_cache=True,\n        inputs_embeds=None,\n        decoder_inputs_embeds=None,\n        head_mask=None,\n    ):\n        r\"\"\"\n    Return:\n        :obj:`tuple(torch.FloatTensor)` comprising various elements depending on the configuration (:class:`~transformers1.T5Config`) and inputs.\n        last_hidden_state (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length, hidden_size)`):\n            Sequence of hidden-states at the output of the last layer of the model.\n            If `decoder_past_key_value_states` is used only the last hidden-state of the sequences of shape :obj:`(batch_size, 1, hidden_size)` is output.\n        decoder_past_key_value_states (:obj:`tuple(tuple(torch.FloatTensor))` of length :obj:`config.n_layers` with each tuple having 4 tensors of shape :obj:`(batch_size, num_heads, sequence_length, embed_size_per_head)`, `optional`, returned when ``use_cache=True``):\n            Contains pre-computed key and value hidden-states of the attention blocks.\n            Can be used to speed up sequential decoding (see `decoder_past_key_value_states` input).\n            Note that when using `decoder_past_key_value_states`, the model only outputs the last `hidden-state` of the sequence of shape :obj:`(batch_size, 1, config.vocab_size)`.\n        hidden_states (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_hidden_states=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer)\n            of shape :obj:`(batch_size, sequence_length, hidden_size)`.\n\n            Hidden-states of the model at the output of each layer plus the initial embedding outputs.\n        attentions (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_attentions=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for each layer) of shape\n            :obj:`(batch_size, num_heads, sequence_length, sequence_length)`.\n\n            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention\n            heads.\n\n    Examples::\n\n            from transformers1 import T5Tokenizer, T5Model\n\n            tokenizer = T5Tokenizer.from_pretrained('t5-small')\n            model = T5Model.from_pretrained('t5-small')\n            input_ids = tokenizer.encode(\"Hello, my dog is cute\", return_tensors=\"pt\")  # Batch size 1\n            outputs = model(input_ids=input_ids, decoder_input_ids=input_ids)\n            last_hidden_states = outputs[0]  # The last hidden-state is the first element of the output tuple\n\n        \"\"\"\n\n        # Encode if needed (training, first prediction pass)\n        if encoder_outputs is None:\n            encoder_outputs = self.encoder(\n                input_ids=input_ids, attention_mask=attention_mask, inputs_embeds=inputs_embeds, head_mask=head_mask\n            )\n\n        hidden_states = encoder_outputs[0]\n\n        # If decoding with past key value states, only the last tokens\n        # should be given as an input\n        if decoder_past_key_value_states is not None:\n            if decoder_input_ids is not None:\n                decoder_input_ids = decoder_input_ids[:, -1:]\n            if decoder_inputs_embeds is not None:\n                decoder_inputs_embeds = decoder_inputs_embeds[:, -1:]\n\n        # Decode\n        decoder_outputs = self.decoder(\n            input_ids=decoder_input_ids,\n            attention_mask=decoder_attention_mask,\n            inputs_embeds=decoder_inputs_embeds,\n            past_key_value_states=decoder_past_key_value_states,\n            encoder_hidden_states=hidden_states,\n            encoder_attention_mask=attention_mask,\n            head_mask=head_mask,\n            use_cache=use_cache,\n        )\n\n        if use_cache is True:\n            past = ((encoder_outputs, decoder_outputs[1]),)\n            decoder_outputs = decoder_outputs[:1] + past + decoder_outputs[2:]\n\n        return decoder_outputs + encoder_outputs\n\n\n@add_start_docstrings(\"\"\"T5 Model with a `language modeling` head on top. \"\"\", T5_START_DOCSTRING)\nclass T5ForConditionalGeneration(T5PreTrainedModel):\n    def __init__(self, config):\n        super().__init__(config)\n        self.model_dim = config.d_model\n\n        self.shared = nn.Embedding(config.vocab_size, config.d_model)\n\n        encoder_config = copy.deepcopy(config)\n        self.encoder = T5Stack(encoder_config, self.shared)\n\n        decoder_config = copy.deepcopy(config)\n        decoder_config.is_decoder = True\n        self.decoder = T5Stack(decoder_config, self.shared)\n\n        self.lm_head = nn.Linear(config.d_model, config.vocab_size, bias=False)\n\n        self.init_weights()\n\n    def get_input_embeddings(self):\n        return self.shared\n\n    def set_input_embeddings(self, new_embeddings):\n        self.shared = new_embeddings\n        self.encoder.set_input_embeddings(new_embeddings)\n        self.decoder.set_input_embeddings(new_embeddings)\n\n    def get_output_embeddings(self):\n        return self.lm_head\n\n    def get_encoder(self):\n        return self.encoder\n\n    def get_decoder(self):\n        return self.decoder\n\n    @add_start_docstrings_to_callable(T5_INPUTS_DOCSTRING)\n    def forward(\n        self,\n        input_ids=None,\n        attention_mask=None,\n        encoder_outputs=None,\n        decoder_input_ids=None,\n        decoder_attention_mask=None,\n        decoder_past_key_value_states=None,\n        use_cache=True,\n        lm_labels=None,\n        inputs_embeds=None,\n        decoder_inputs_embeds=None,\n        head_mask=None,\n    ):\n        r\"\"\"\n        lm_labels (:obj:`torch.LongTensor` of shape :obj:`(batch_size,)`, `optional`, defaults to :obj:`None`):\n                Labels for computing the sequence classification/regression loss.\n                Indices should be in :obj:`[-100, 0, ..., config.vocab_size - 1]`.\n                All labels set to ``-100`` are ignored (masked), the loss is only\n                computed for labels in ``[0, ..., config.vocab_size]``\n\n    Returns:\n        :obj:`tuple(torch.FloatTensor)` comprising various elements depending on the configuration (:class:`~transformers1.T5Config`) and inputs.\n        loss (:obj:`torch.FloatTensor` of shape :obj:`(1,)`, `optional`, returned when :obj:`lm_label` is provided):\n            Classification loss (cross entropy).\n        prediction_scores (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length, config.vocab_size)`)\n            Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax).\n            If `past_key_value_states` is used only the last prediction_scores of the sequences of shape :obj:`(batch_size, 1, hidden_size)` is output.\n        decoder_past_key_value_states (:obj:`tuple(tuple(torch.FloatTensor))` of length :obj:`config.n_layers` with each tuple having 4 tensors of shape :obj:`(batch_size, num_heads, sequence_length, embed_size_per_head)`, `optional`, returned when ``use_cache=True``):\n            Contains pre-computed key and value hidden-states of the attention blocks.\n            Can be used to speed up sequential decoding (see `decoder_past_key_value_states` input).\n            Note that when using `decoder_past_key_value_states`, the model only outputs the last `prediction_score` of the sequence of shape :obj:`(batch_size, 1, config.vocab_size)`.\n        hidden_states (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_hidden_states=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer)\n            of shape :obj:`(batch_size, sequence_length, hidden_size)`.\n            Hidden-states of the model at the output of each layer plus the initial embedding outputs.\n        attentions (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_attentions=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for each layer) of shape\n            :obj:`(batch_size, num_heads, sequence_length, sequence_length)`.\n            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention.\n\n    Examples::\n\n        from transformers1 import T5Tokenizer, T5ForConditionalGeneration\n\n        tokenizer = T5Tokenizer.from_pretrained('t5-small')\n        model = T5ForConditionalGeneration.from_pretrained('t5-small')\n        input_ids = tokenizer.encode(\"Hello, my dog is cute\", return_tensors=\"pt\")  # Batch size 1\n        outputs = model(input_ids=input_ids, decoder_input_ids=input_ids, lm_labels=input_ids)\n        loss, prediction_scores = outputs[:2]\n\n        tokenizer = T5Tokenizer.from_pretrained('t5-small')\n        model = T5ForConditionalGeneration.from_pretrained('t5-small')\n        input_ids = tokenizer.encode(\"summarize: Hello, my dog is cute\", return_tensors=\"pt\")  # Batch size 1\n        outputs = model.generate(input_ids)\n        \"\"\"\n\n        # Encode if needed (training, first prediction pass)\n        if encoder_outputs is None:\n            # Convert encoder inputs in embeddings if needed\n            encoder_outputs = self.encoder(\n                input_ids=input_ids, attention_mask=attention_mask, inputs_embeds=inputs_embeds, head_mask=head_mask\n            )\n\n        hidden_states = encoder_outputs[0]\n\n        if lm_labels is not None and decoder_input_ids is None and decoder_inputs_embeds is None:\n            # get decoder inputs from shifting lm labels to the right\n            decoder_input_ids = self._shift_right(lm_labels)\n\n        # If decoding with past key value states, only the last tokens\n        # should be given as an input\n        if decoder_past_key_value_states is not None:\n            assert lm_labels is None, \"Decoder should not use cached key value states when training.\"\n            if decoder_input_ids is not None:\n                decoder_input_ids = decoder_input_ids[:, -1:]\n            if decoder_inputs_embeds is not None:\n                decoder_inputs_embeds = decoder_inputs_embeds[:, -1:]\n\n        # Decode\n        decoder_outputs = self.decoder(\n            input_ids=decoder_input_ids,\n            attention_mask=decoder_attention_mask,\n            inputs_embeds=decoder_inputs_embeds,\n            past_key_value_states=decoder_past_key_value_states,\n            encoder_hidden_states=hidden_states,\n            encoder_attention_mask=attention_mask,\n            head_mask=head_mask,\n            use_cache=use_cache,\n        )\n\n        # insert decoder past at right place\n        # to speed up decoding\n        if use_cache is True:\n            past = ((encoder_outputs, decoder_outputs[1]),)\n            decoder_outputs = decoder_outputs[:1] + past + decoder_outputs[2:]\n\n        sequence_output = decoder_outputs[0]\n        # Rescale output before projecting on vocab\n        # See https://github.com/tensorflow/mesh/blob/fa19d69eafc9a482aff0b59ddd96b025c0cb207d/mesh_tensorflow/transformer/transformer.py#L586\n        sequence_output = sequence_output * (self.model_dim ** -0.5)\n        lm_logits = self.lm_head(sequence_output)\n\n        decoder_outputs = (lm_logits,) + decoder_outputs[1:]  # Add hidden states and attention if they are here\n        if lm_labels is not None:\n            loss_fct = CrossEntropyLoss(ignore_index=-100)\n            loss = loss_fct(lm_logits.view(-1, lm_logits.size(-1)), lm_labels.view(-1))\n            # TODO(thom): Add z_loss https://github.com/tensorflow/mesh/blob/fa19d69eafc9a482aff0b59ddd96b025c0cb207d/mesh_tensorflow/layers.py#L666\n            decoder_outputs = (loss,) + decoder_outputs\n\n        return decoder_outputs + encoder_outputs\n\n    def prepare_inputs_for_generation(self, input_ids, past, attention_mask, use_cache, **kwargs):\n        assert past is not None, \"past has to be defined for encoder_outputs\"\n\n        # first step\n        if len(past) < 2:\n            encoder_outputs, decoder_past_key_value_states = past, None\n        else:\n            encoder_outputs, decoder_past_key_value_states = past[0], past[1]\n\n        return {\n            \"decoder_input_ids\": input_ids,\n            \"decoder_past_key_value_states\": decoder_past_key_value_states,\n            \"encoder_outputs\": encoder_outputs,\n            \"attention_mask\": attention_mask,\n            \"use_cache\": use_cache,\n        }\n\n    def _reorder_cache(self, past, beam_idx):\n        # if decoder past is not included in output\n        # speedy decoding is disabled and no need to reorder\n        if len(past) < 2:\n            logger.warning(\"You might want to consider setting `use_cache=True` to speed up decoding\")\n            return past\n\n        decoder_past = past[1]\n        past = (past[0],)\n        reordered_decoder_past = ()\n        for layer_past_states in decoder_past:\n            # get the correct batch idx from layer past batch dim\n            # batch dim of `past` is at 2nd position\n            reordered_layer_past_states = ()\n            for layer_past_state in layer_past_states:\n                # need to set correct `past` for each of the four key / value states\n                reordered_layer_past_states = reordered_layer_past_states + (\n                    layer_past_state.index_select(0, beam_idx),\n                )\n\n            assert reordered_layer_past_states[0].shape == layer_past_states[0].shape\n            assert len(reordered_layer_past_states) == len(layer_past_states)\n\n            reordered_decoder_past = reordered_decoder_past + (reordered_layer_past_states,)\n        return past + (reordered_decoder_past,)\n"
  },
  {
    "path": "code/bert-base-count5/pretrain/transformers1/modeling_tf_albert.py",
    "content": "# coding=utf-8\n# Copyright 2018 The OpenAI Team Authors and HuggingFace Inc. team.\n# Copyright (c) 2018, NVIDIA CORPORATION.  All rights reserved.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n\"\"\" TF 2.0 ALBERT model. \"\"\"\n\n\nimport logging\n\nimport tensorflow as tf\n\nfrom .configuration_albert import AlbertConfig\nfrom .file_utils import MULTIPLE_CHOICE_DUMMY_INPUTS, add_start_docstrings, add_start_docstrings_to_callable\nfrom .modeling_tf_bert import ACT2FN, TFBertSelfAttention\nfrom .modeling_tf_utils import TFPreTrainedModel, get_initializer, keras_serializable, shape_list\nfrom .tokenization_utils import BatchEncoding\n\n\nlogger = logging.getLogger(__name__)\n\nTF_ALBERT_PRETRAINED_MODEL_ARCHIVE_LIST = [\n    \"albert-base-v1\",\n    \"albert-large-v1\",\n    \"albert-xlarge-v1\",\n    \"albert-xxlarge-v1\",\n    \"albert-base-v2\",\n    \"albert-large-v2\",\n    \"albert-xlarge-v2\",\n    \"albert-xxlarge-v2\",\n    # See all ALBERT models at https://huggingface.co/models?filter=albert\n]\n\n\nclass TFAlbertEmbeddings(tf.keras.layers.Layer):\n    \"\"\"Construct the embeddings from word, position and token_type embeddings.\n    \"\"\"\n\n    def __init__(self, config, **kwargs):\n        super().__init__(**kwargs)\n\n        self.config = config\n        self.position_embeddings = tf.keras.layers.Embedding(\n            config.max_position_embeddings,\n            config.embedding_size,\n            embeddings_initializer=get_initializer(self.config.initializer_range),\n            name=\"position_embeddings\",\n        )\n        self.token_type_embeddings = tf.keras.layers.Embedding(\n            config.type_vocab_size,\n            config.embedding_size,\n            embeddings_initializer=get_initializer(self.config.initializer_range),\n            name=\"token_type_embeddings\",\n        )\n\n        # self.LayerNorm is not snake-cased to stick with TensorFlow model variable name and be able to load\n        # any TensorFlow checkpoint file\n        self.LayerNorm = tf.keras.layers.LayerNormalization(epsilon=config.layer_norm_eps, name=\"LayerNorm\")\n        self.dropout = tf.keras.layers.Dropout(config.hidden_dropout_prob)\n\n    def build(self, input_shape):\n        \"\"\"Build shared word embedding layer \"\"\"\n        with tf.name_scope(\"word_embeddings\"):\n            # Create and initialize weights. The random normal initializer was chosen\n            # arbitrarily, and works well.\n            self.word_embeddings = self.add_weight(\n                \"weight\",\n                shape=[self.config.vocab_size, self.config.embedding_size],\n                initializer=get_initializer(self.config.initializer_range),\n            )\n        super().build(input_shape)\n\n    def call(self, inputs, mode=\"embedding\", training=False):\n        \"\"\"Get token embeddings of inputs.\n        Args:\n            inputs: list of three int64 tensors with shape [batch_size, length]: (input_ids, position_ids, token_type_ids)\n            mode: string, a valid value is one of \"embedding\" and \"linear\".\n        Returns:\n            outputs: (1) If mode == \"embedding\", output embedding tensor, float32 with\n                shape [batch_size, length, embedding_size]; (2) mode == \"linear\", output\n                linear tensor, float32 with shape [batch_size, length, vocab_size].\n        Raises:\n            ValueError: if mode is not valid.\n\n        Shared weights logic adapted from\n            https://github.com/tensorflow/models/blob/a009f4fb9d2fc4949e32192a944688925ef78659/official/transformer/v2/embedding_layer.py#L24\n        \"\"\"\n        if mode == \"embedding\":\n            return self._embedding(inputs, training=training)\n        elif mode == \"linear\":\n            return self._linear(inputs)\n        else:\n            raise ValueError(\"mode {} is not valid.\".format(mode))\n\n    def _embedding(self, inputs, training=False):\n        \"\"\"Applies embedding based on inputs tensor.\"\"\"\n        input_ids, position_ids, token_type_ids, inputs_embeds = inputs\n\n        if input_ids is not None:\n            input_shape = shape_list(input_ids)\n        else:\n            input_shape = shape_list(inputs_embeds)[:-1]\n\n        seq_length = input_shape[1]\n        if position_ids is None:\n            position_ids = tf.range(seq_length, dtype=tf.int32)[tf.newaxis, :]\n        if token_type_ids is None:\n            token_type_ids = tf.fill(input_shape, 0)\n\n        if inputs_embeds is None:\n            inputs_embeds = tf.gather(self.word_embeddings, input_ids)\n        position_embeddings = self.position_embeddings(position_ids)\n        token_type_embeddings = self.token_type_embeddings(token_type_ids)\n\n        embeddings = inputs_embeds + position_embeddings + token_type_embeddings\n        embeddings = self.LayerNorm(embeddings)\n        embeddings = self.dropout(embeddings, training=training)\n        return embeddings\n\n    def _linear(self, inputs):\n        \"\"\"Computes logits by running inputs through a linear layer.\n            Args:\n                inputs: A float32 tensor with shape [batch_size, length, embedding_size]\n            Returns:\n                float32 tensor with shape [batch_size, length, vocab_size].\n        \"\"\"\n        batch_size = shape_list(inputs)[0]\n        length = shape_list(inputs)[1]\n        x = tf.reshape(inputs, [-1, self.config.embedding_size])\n        logits = tf.matmul(x, self.word_embeddings, transpose_b=True)\n        return tf.reshape(logits, [batch_size, length, self.config.vocab_size])\n\n\nclass TFAlbertSelfAttention(tf.keras.layers.Layer):\n    def __init__(self, config, **kwargs):\n        super().__init__(**kwargs)\n        if config.hidden_size % config.num_attention_heads != 0:\n            raise ValueError(\n                \"The hidden size (%d) is not a multiple of the number of attention \"\n                \"heads (%d)\" % (config.hidden_size, config.num_attention_heads)\n            )\n        self.output_attentions = config.output_attentions\n\n        self.num_attention_heads = config.num_attention_heads\n        assert config.hidden_size % config.num_attention_heads == 0\n        self.attention_head_size = int(config.hidden_size / config.num_attention_heads)\n        self.all_head_size = self.num_attention_heads * self.attention_head_size\n\n        self.query = tf.keras.layers.Dense(\n            self.all_head_size, kernel_initializer=get_initializer(config.initializer_range), name=\"query\"\n        )\n        self.key = tf.keras.layers.Dense(\n            self.all_head_size, kernel_initializer=get_initializer(config.initializer_range), name=\"key\"\n        )\n        self.value = tf.keras.layers.Dense(\n            self.all_head_size, kernel_initializer=get_initializer(config.initializer_range), name=\"value\"\n        )\n\n        self.dropout = tf.keras.layers.Dropout(config.attention_probs_dropout_prob)\n\n    def transpose_for_scores(self, x, batch_size):\n        x = tf.reshape(x, (batch_size, -1, self.num_attention_heads, self.attention_head_size))\n        return tf.transpose(x, perm=[0, 2, 1, 3])\n\n    def call(self, inputs, training=False):\n        hidden_states, attention_mask, head_mask = inputs\n\n        batch_size = shape_list(hidden_states)[0]\n        mixed_query_layer = self.query(hidden_states)\n        mixed_key_layer = self.key(hidden_states)\n        mixed_value_layer = self.value(hidden_states)\n\n        query_layer = self.transpose_for_scores(mixed_query_layer, batch_size)\n        key_layer = self.transpose_for_scores(mixed_key_layer, batch_size)\n        value_layer = self.transpose_for_scores(mixed_value_layer, batch_size)\n\n        # Take the dot product between \"query\" and \"key\" to get the raw attention scores.\n        # (batch size, num_heads, seq_len_q, seq_len_k)\n        attention_scores = tf.matmul(query_layer, key_layer, transpose_b=True)\n        # scale attention_scores\n        dk = tf.cast(shape_list(key_layer)[-1], tf.float32)\n        attention_scores = attention_scores / tf.math.sqrt(dk)\n\n        if attention_mask is not None:\n            # Apply the attention mask is (precomputed for all layers in TFAlbertModel call() function)\n            attention_scores = attention_scores + attention_mask\n\n        # Normalize the attention scores to probabilities.\n        attention_probs = tf.nn.softmax(attention_scores, axis=-1)\n\n        # This is actually dropping out entire tokens to attend to, which might\n        # seem a bit unusual, but is taken from the original Transformer paper.\n        attention_probs = self.dropout(attention_probs, training=training)\n\n        # Mask heads if we want to\n        if head_mask is not None:\n            attention_probs = attention_probs * head_mask\n\n        context_layer = tf.matmul(attention_probs, value_layer)\n\n        context_layer = tf.transpose(context_layer, perm=[0, 2, 1, 3])\n        context_layer = tf.reshape(\n            context_layer, (batch_size, -1, self.all_head_size)\n        )  # (batch_size, seq_len_q, all_head_size)\n\n        outputs = (context_layer, attention_probs) if self.output_attentions else (context_layer,)\n        return outputs\n\n\nclass TFAlbertSelfOutput(tf.keras.layers.Layer):\n    def __init__(self, config, **kwargs):\n        super().__init__(**kwargs)\n        self.dense = tf.keras.layers.Dense(\n            config.hidden_size, kernel_initializer=get_initializer(config.initializer_range), name=\"dense\"\n        )\n        self.LayerNorm = tf.keras.layers.LayerNormalization(epsilon=config.layer_norm_eps, name=\"LayerNorm\")\n        self.dropout = tf.keras.layers.Dropout(config.hidden_dropout_prob)\n\n    def call(self, inputs, training=False):\n        hidden_states, input_tensor = inputs\n\n        hidden_states = self.dense(hidden_states)\n        hidden_states = self.dropout(hidden_states, training=training)\n        hidden_states = self.LayerNorm(hidden_states + input_tensor)\n        return hidden_states\n\n\nclass TFAlbertAttention(TFBertSelfAttention):\n    def __init__(self, config, **kwargs):\n        super().__init__(config, **kwargs)\n\n        self.hidden_size = config.hidden_size\n        self.dense = tf.keras.layers.Dense(\n            config.hidden_size, kernel_initializer=get_initializer(config.initializer_range), name=\"dense\"\n        )\n        self.LayerNorm = tf.keras.layers.LayerNormalization(epsilon=config.layer_norm_eps, name=\"LayerNorm\")\n        self.pruned_heads = set()\n\n    def prune_heads(self, heads):\n        raise NotImplementedError\n\n    def call(self, inputs, training=False):\n        input_tensor, attention_mask, head_mask = inputs\n\n        batch_size = shape_list(input_tensor)[0]\n        mixed_query_layer = self.query(input_tensor)\n        mixed_key_layer = self.key(input_tensor)\n        mixed_value_layer = self.value(input_tensor)\n\n        query_layer = self.transpose_for_scores(mixed_query_layer, batch_size)\n        key_layer = self.transpose_for_scores(mixed_key_layer, batch_size)\n        value_layer = self.transpose_for_scores(mixed_value_layer, batch_size)\n\n        # Take the dot product between \"query\" and \"key\" to get the raw attention scores.\n        # (batch size, num_heads, seq_len_q, seq_len_k)\n        attention_scores = tf.matmul(query_layer, key_layer, transpose_b=True)\n        # scale attention_scores\n        dk = tf.cast(shape_list(key_layer)[-1], tf.float32)\n        attention_scores = attention_scores / tf.math.sqrt(dk)\n\n        if attention_mask is not None:\n            # Apply the attention mask is (precomputed for all layers in TFBertModel call() function)\n            attention_scores = attention_scores + attention_mask\n\n        # Normalize the attention scores to probabilities.\n        attention_probs = tf.nn.softmax(attention_scores, axis=-1)\n\n        # This is actually dropping out entire tokens to attend to, which might\n        # seem a bit unusual, but is taken from the original Transformer paper.\n        attention_probs = self.dropout(attention_probs, training=training)\n\n        # Mask heads if we want to\n        if head_mask is not None:\n            attention_probs = attention_probs * head_mask\n\n        context_layer = tf.matmul(attention_probs, value_layer)\n\n        context_layer = tf.transpose(context_layer, perm=[0, 2, 1, 3])\n        context_layer = tf.reshape(\n            context_layer, (batch_size, -1, self.all_head_size)\n        )  # (batch_size, seq_len_q, all_head_size)\n\n        self_outputs = (context_layer, attention_probs) if self.output_attentions else (context_layer,)\n\n        hidden_states = self_outputs[0]\n\n        hidden_states = self.dense(hidden_states)\n        hidden_states = self.dropout(hidden_states, training=training)\n        attention_output = self.LayerNorm(hidden_states + input_tensor)\n\n        # add attentions if we output them\n        outputs = (attention_output,) + self_outputs[1:]\n        return outputs\n\n\nclass TFAlbertLayer(tf.keras.layers.Layer):\n    def __init__(self, config, **kwargs):\n        super().__init__(**kwargs)\n        self.attention = TFAlbertAttention(config, name=\"attention\")\n\n        self.ffn = tf.keras.layers.Dense(\n            config.intermediate_size, kernel_initializer=get_initializer(config.initializer_range), name=\"ffn\"\n        )\n\n        if isinstance(config.hidden_act, str):\n            self.activation = ACT2FN[config.hidden_act]\n        else:\n            self.activation = config.hidden_act\n\n        self.ffn_output = tf.keras.layers.Dense(\n            config.hidden_size, kernel_initializer=get_initializer(config.initializer_range), name=\"ffn_output\"\n        )\n        self.full_layer_layer_norm = tf.keras.layers.LayerNormalization(\n            epsilon=config.layer_norm_eps, name=\"full_layer_layer_norm\"\n        )\n        self.dropout = tf.keras.layers.Dropout(config.hidden_dropout_prob)\n\n    def call(self, inputs, training=False):\n        hidden_states, attention_mask, head_mask = inputs\n\n        attention_outputs = self.attention([hidden_states, attention_mask, head_mask], training=training)\n        ffn_output = self.ffn(attention_outputs[0])\n        ffn_output = self.activation(ffn_output)\n        ffn_output = self.ffn_output(ffn_output)\n\n        hidden_states = self.dropout(hidden_states, training=training)\n        hidden_states = self.full_layer_layer_norm(ffn_output + attention_outputs[0])\n\n        # add attentions if we output them\n        outputs = (hidden_states,) + attention_outputs[1:]\n        return outputs\n\n\nclass TFAlbertLayerGroup(tf.keras.layers.Layer):\n    def __init__(self, config, **kwargs):\n        super().__init__(**kwargs)\n\n        self.output_attentions = config.output_attentions\n        self.output_hidden_states = config.output_hidden_states\n        self.albert_layers = [\n            TFAlbertLayer(config, name=\"albert_layers_._{}\".format(i)) for i in range(config.inner_group_num)\n        ]\n\n    def call(self, inputs, training=False):\n        hidden_states, attention_mask, head_mask = inputs\n\n        layer_hidden_states = ()\n        layer_attentions = ()\n\n        for layer_index, albert_layer in enumerate(self.albert_layers):\n            layer_output = albert_layer([hidden_states, attention_mask, head_mask[layer_index]], training=training)\n            hidden_states = layer_output[0]\n\n            if self.output_attentions:\n                layer_attentions = layer_attentions + (layer_output[1],)\n\n            if self.output_hidden_states:\n                layer_hidden_states = layer_hidden_states + (hidden_states,)\n\n        outputs = (hidden_states,)\n        if self.output_hidden_states:\n            outputs = outputs + (layer_hidden_states,)\n        if self.output_attentions:\n            outputs = outputs + (layer_attentions,)\n        # last-layer hidden state, (layer hidden states), (layer attentions)\n        return outputs\n\n\nclass TFAlbertTransformer(tf.keras.layers.Layer):\n    def __init__(self, config, **kwargs):\n        super().__init__(**kwargs)\n\n        self.config = config\n        self.output_attentions = config.output_attentions\n        self.output_hidden_states = config.output_hidden_states\n        self.embedding_hidden_mapping_in = tf.keras.layers.Dense(\n            config.hidden_size,\n            kernel_initializer=get_initializer(config.initializer_range),\n            name=\"embedding_hidden_mapping_in\",\n        )\n        self.albert_layer_groups = [\n            TFAlbertLayerGroup(config, name=\"albert_layer_groups_._{}\".format(i))\n            for i in range(config.num_hidden_groups)\n        ]\n\n    def call(self, inputs, training=False):\n        hidden_states, attention_mask, head_mask = inputs\n\n        hidden_states = self.embedding_hidden_mapping_in(hidden_states)\n        all_attentions = ()\n\n        if self.output_hidden_states:\n            all_hidden_states = (hidden_states,)\n\n        for i in range(self.config.num_hidden_layers):\n            # Number of layers in a hidden group\n            layers_per_group = int(self.config.num_hidden_layers / self.config.num_hidden_groups)\n\n            # Index of the hidden group\n            group_idx = int(i / (self.config.num_hidden_layers / self.config.num_hidden_groups))\n\n            layer_group_output = self.albert_layer_groups[group_idx](\n                [\n                    hidden_states,\n                    attention_mask,\n                    head_mask[group_idx * layers_per_group : (group_idx + 1) * layers_per_group],\n                ],\n                training=training,\n            )\n            hidden_states = layer_group_output[0]\n\n            if self.output_attentions:\n                all_attentions = all_attentions + layer_group_output[-1]\n\n            if self.output_hidden_states:\n                all_hidden_states = all_hidden_states + (hidden_states,)\n\n        outputs = (hidden_states,)\n        if self.output_hidden_states:\n            outputs = outputs + (all_hidden_states,)\n        if self.output_attentions:\n            outputs = outputs + (all_attentions,)\n\n        # last-layer hidden state, (all hidden states), (all attentions)\n        return outputs\n\n\nclass TFAlbertPreTrainedModel(TFPreTrainedModel):\n    \"\"\" An abstract class to handle weights initialization and\n        a simple interface for downloading and loading pretrained models.\n    \"\"\"\n\n    config_class = AlbertConfig\n    base_model_prefix = \"albert\"\n\n\nclass TFAlbertMLMHead(tf.keras.layers.Layer):\n    def __init__(self, config, input_embeddings, **kwargs):\n        super().__init__(**kwargs)\n        self.vocab_size = config.vocab_size\n\n        self.dense = tf.keras.layers.Dense(\n            config.embedding_size, kernel_initializer=get_initializer(config.initializer_range), name=\"dense\"\n        )\n        if isinstance(config.hidden_act, str):\n            self.activation = ACT2FN[config.hidden_act]\n        else:\n            self.activation = config.hidden_act\n\n        self.LayerNorm = tf.keras.layers.LayerNormalization(epsilon=config.layer_norm_eps, name=\"LayerNorm\")\n\n        # The output weights are the same as the input embeddings, but there is\n        # an output-only bias for each token.\n        self.decoder = input_embeddings\n\n    def build(self, input_shape):\n        self.bias = self.add_weight(shape=(self.vocab_size,), initializer=\"zeros\", trainable=True, name=\"bias\")\n        self.decoder_bias = self.add_weight(\n            shape=(self.vocab_size,), initializer=\"zeros\", trainable=True, name=\"decoder/bias\"\n        )\n        super().build(input_shape)\n\n    def call(self, hidden_states):\n        hidden_states = self.dense(hidden_states)\n        hidden_states = self.activation(hidden_states)\n        hidden_states = self.LayerNorm(hidden_states)\n        hidden_states = self.decoder(hidden_states, mode=\"linear\") + self.decoder_bias\n        return hidden_states\n\n\n@keras_serializable\nclass TFAlbertMainLayer(tf.keras.layers.Layer):\n    config_class = AlbertConfig\n\n    def __init__(self, config, **kwargs):\n        super().__init__(**kwargs)\n        self.num_hidden_layers = config.num_hidden_layers\n\n        self.embeddings = TFAlbertEmbeddings(config, name=\"embeddings\")\n        self.encoder = TFAlbertTransformer(config, name=\"encoder\")\n        self.pooler = tf.keras.layers.Dense(\n            config.hidden_size,\n            kernel_initializer=get_initializer(config.initializer_range),\n            activation=\"tanh\",\n            name=\"pooler\",\n        )\n\n    def get_input_embeddings(self):\n        return self.embeddings\n\n    def _resize_token_embeddings(self, new_num_tokens):\n        raise NotImplementedError\n\n    def _prune_heads(self, heads_to_prune):\n        \"\"\" Prunes heads of the model.\n            heads_to_prune: dict of {layer_num: list of heads to prune in this layer}\n            See base class PreTrainedModel\n        \"\"\"\n        raise NotImplementedError\n\n    def call(\n        self,\n        inputs,\n        attention_mask=None,\n        token_type_ids=None,\n        position_ids=None,\n        head_mask=None,\n        inputs_embeds=None,\n        training=False,\n    ):\n        if isinstance(inputs, (tuple, list)):\n            input_ids = inputs[0]\n            attention_mask = inputs[1] if len(inputs) > 1 else attention_mask\n            token_type_ids = inputs[2] if len(inputs) > 2 else token_type_ids\n            position_ids = inputs[3] if len(inputs) > 3 else position_ids\n            head_mask = inputs[4] if len(inputs) > 4 else head_mask\n            inputs_embeds = inputs[5] if len(inputs) > 5 else inputs_embeds\n            assert len(inputs) <= 6, \"Too many inputs.\"\n        elif isinstance(inputs, (dict, BatchEncoding)):\n            input_ids = inputs.get(\"input_ids\")\n            attention_mask = inputs.get(\"attention_mask\", attention_mask)\n            token_type_ids = inputs.get(\"token_type_ids\", token_type_ids)\n            position_ids = inputs.get(\"position_ids\", position_ids)\n            head_mask = inputs.get(\"head_mask\", head_mask)\n            inputs_embeds = inputs.get(\"inputs_embeds\", inputs_embeds)\n            assert len(inputs) <= 6, \"Too many inputs.\"\n        else:\n            input_ids = inputs\n\n        if input_ids is not None and inputs_embeds is not None:\n            raise ValueError(\"You cannot specify both input_ids and inputs_embeds at the same time\")\n        elif input_ids is not None:\n            input_shape = shape_list(input_ids)\n        elif inputs_embeds is not None:\n            input_shape = shape_list(inputs_embeds)[:-1]\n        else:\n            raise ValueError(\"You have to specify either input_ids or inputs_embeds\")\n\n        if attention_mask is None:\n            attention_mask = tf.fill(input_shape, 1)\n        if token_type_ids is None:\n            token_type_ids = tf.fill(input_shape, 0)\n\n        # We create a 3D attention mask from a 2D tensor mask.\n        # Sizes are [batch_size, 1, 1, to_seq_length]\n        # So we can broadcast to [batch_size, num_heads, from_seq_length, to_seq_length]\n        # this attention mask is more simple than the triangular masking of causal attention\n        # used in OpenAI GPT, we just need to prepare the broadcast dimension here.\n        extended_attention_mask = attention_mask[:, tf.newaxis, tf.newaxis, :]\n\n        # Since attention_mask is 1.0 for positions we want to attend and 0.0 for\n        # masked positions, this operation will create a tensor which is 0.0 for\n        # positions we want to attend and -10000.0 for masked positions.\n        # Since we are adding it to the raw scores before the softmax, this is\n        # effectively the same as removing these entirely.\n\n        extended_attention_mask = tf.cast(extended_attention_mask, tf.float32)\n        extended_attention_mask = (1.0 - extended_attention_mask) * -10000.0\n\n        # Prepare head mask if needed\n        # 1.0 in head_mask indicate we keep the head\n        # attention_probs has shape bsz x n_heads x N x N\n        # input head_mask has shape [num_heads] or [num_hidden_layers x num_heads]\n        # and head_mask is converted to shape [num_hidden_layers x batch x num_heads x seq_length x seq_length]\n        if head_mask is not None:\n            raise NotImplementedError\n        else:\n            head_mask = [None] * self.num_hidden_layers\n            # head_mask = tf.constant([0] * self.num_hidden_layers)\n\n        embedding_output = self.embeddings([input_ids, position_ids, token_type_ids, inputs_embeds], training=training)\n        encoder_outputs = self.encoder([embedding_output, extended_attention_mask, head_mask], training=training)\n\n        sequence_output = encoder_outputs[0]\n        pooled_output = self.pooler(sequence_output[:, 0])\n\n        # add hidden_states and attentions if they are here\n        outputs = (sequence_output, pooled_output,) + encoder_outputs[1:]\n        # sequence_output, pooled_output, (hidden_states), (attentions)\n        return outputs\n\n\nALBERT_START_DOCSTRING = r\"\"\"\n    This model is a `tf.keras.Model <https://www.tensorflow.org/api_docs/python/tf/keras/Model>`__ sub-class.\n    Use it as a regular TF 2.0 Keras Model and\n    refer to the TF 2.0 documentation for all matter related to general usage and behavior.\n\n    .. _`ALBERT: A Lite BERT for Self-supervised Learning of Language Representations`:\n        https://arxiv.org/abs/1909.11942\n\n    .. _`tf.keras.Model`:\n        https://www.tensorflow.org/versions/r2.0/api_docs/python/tf/keras/Model\n\n    .. note::\n\n        TF 2.0 models accepts two formats as inputs:\n\n            - having all inputs as keyword arguments (like PyTorch models), or\n            - having all inputs as a list, tuple or dict in the first positional arguments.\n\n        This second option is useful when using :obj:`tf.keras.Model.fit()` method which currently requires having\n        all the tensors in the first argument of the model call function: :obj:`model(inputs)`.\n\n        If you choose this second option, there are three possibilities you can use to gather all the input Tensors\n        in the first positional argument :\n\n        - a single Tensor with input_ids only and nothing else: :obj:`model(inputs_ids)`\n        - a list of varying length with one or several input Tensors IN THE ORDER given in the docstring:\n          :obj:`model([input_ids, attention_mask])` or :obj:`model([input_ids, attention_mask, token_type_ids])`\n        - a dictionary with one or several input Tensors associated to the input names given in the docstring:\n          :obj:`model({'input_ids': input_ids, 'token_type_ids': token_type_ids})`\n\n    Args:\n        config (:class:`~transformers1.AlbertConfig`): Model configuration class with all the parameters of the model.\n            Initializing with a config file does not load the weights associated with the model, only the configuration.\n            Check out the :meth:`~transformers1.PreTrainedModel.from_pretrained` method to load the model weights.\n\"\"\"\n\nALBERT_INPUTS_DOCSTRING = r\"\"\"\n    Args:\n        input_ids (:obj:`Numpy array` or :obj:`tf.Tensor` of shape :obj:`{0}`):\n            Indices of input sequence tokens in the vocabulary.\n\n            Indices can be obtained using :class:`transformers1.AlbertTokenizer`.\n            See :func:`transformers1.PreTrainedTokenizer.encode` and\n            :func:`transformers1.PreTrainedTokenizer.encode_plus` for details.\n\n            `What are input IDs? <../glossary.html#input-ids>`__\n        attention_mask (:obj:`Numpy array` or :obj:`tf.Tensor` of shape :obj:`{0}`, `optional, defaults to :obj:`None`):\n            Mask to avoid performing attention on padding token indices.\n            Mask values selected in ``[0, 1]``:\n            ``1`` for tokens that are NOT MASKED, ``0`` for MASKED tokens.\n\n            `What are attention masks? <../glossary.html#attention-mask>`__\n        token_type_ids (:obj:`Numpy array` or :obj:`tf.Tensor` of shape :obj:`{0}`, `optional`, defaults to :obj:`None`):\n            Segment token indices to indicate first and second portions of the inputs.\n            Indices are selected in ``[0, 1]``: ``0`` corresponds to a `sentence A` token, ``1``\n            corresponds to a `sentence B` token\n\n            `What are token type IDs? <../glossary.html#token-type-ids>`_\n        position_ids (:obj:`Numpy array` or :obj:`tf.Tensor` of shape :obj:`{0}`, `optional`, defaults to :obj:`None`):\n            Indices of positions of each input sequence tokens in the position embeddings.\n            Selected in the range ``[0, config.max_position_embeddings - 1]``.\n\n            `What are position IDs? <../glossary.html#position-ids>`_\n        head_mask (:obj:`Numpy array` or :obj:`tf.Tensor` of shape :obj:`(num_heads,)` or :obj:`(num_layers, num_heads)`, `optional`, defaults to :obj:`None`):\n            Mask to nullify selected heads of the self-attention modules.\n            Mask values selected in ``[0, 1]``:\n            ``1`` indicates the head is **not masked**, ``0`` indicates the head is **masked**.\n        inputs_embeds (:obj:`tf.Tensor` of shape :obj:`(batch_size, sequence_length, hidden_size)`, `optional`, defaults to :obj:`None`):\n            Optionally, instead of passing :obj:`input_ids` you can choose to directly pass an embedded representation.\n            This is useful if you want more control over how to convert `input_ids` indices into associated vectors\n            than the model's internal embedding lookup matrix.\n        training (:obj:`boolean`, `optional`, defaults to :obj:`False`):\n            Whether to activate dropout modules (if set to :obj:`True`) during training or to de-activate them\n            (if set to :obj:`False`) for evaluation.\n\"\"\"\n\n\n@add_start_docstrings(\n    \"The bare Albert Model transformer outputing raw hidden-states without any specific head on top.\",\n    ALBERT_START_DOCSTRING,\n)\nclass TFAlbertModel(TFAlbertPreTrainedModel):\n    def __init__(self, config, *inputs, **kwargs):\n        super().__init__(config, *inputs, **kwargs)\n        self.albert = TFAlbertMainLayer(config, name=\"albert\")\n\n    @add_start_docstrings_to_callable(ALBERT_INPUTS_DOCSTRING.format(\"(batch_size, sequence_length)\"))\n    def call(self, inputs, **kwargs):\n        r\"\"\"\n        Returns:\n            :obj:`tuple(tf.Tensor)` comprising various elements depending on the configuration (:class:`~transformers1.AlbertConfig`) and inputs:\n            last_hidden_state (:obj:`tf.Tensor` of shape :obj:`(batch_size, sequence_length, hidden_size)`):\n                Sequence of hidden-states at the output of the last layer of the model.\n            pooler_output (:obj:`tf.Tensor` of shape :obj:`(batch_size, hidden_size)`):\n                Last layer hidden-state of the first token of the sequence (classification token)\n                further processed by a Linear layer and a Tanh activation function. The Linear\n                layer weights are trained from the next sentence prediction (classification)\n                objective during Albert pretraining. This output is usually *not* a good summary\n                of the semantic content of the input, you're often better with averaging or pooling\n                the sequence of hidden-states for the whole input sequence.\n            hidden_states (:obj:`tuple(tf.Tensor)`, `optional`, returned when :obj:`config.output_hidden_states=True`):\n                tuple of :obj:`tf.Tensor` (one for the output of the embeddings + one for the output of each layer)\n                of shape :obj:`(batch_size, sequence_length, hidden_size)`.\n\n                Hidden-states of the model at the output of each layer plus the initial embedding outputs.\n            attentions (:obj:`tuple(tf.Tensor)`, `optional`, returned when ``config.output_attentions=True``):\n                tuple of :obj:`tf.Tensor` (one for each layer) of shape\n                :obj:`(batch_size, num_heads, sequence_length, sequence_length)`:\n\n                Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads.\n\n        Examples::\n\n            import tensorflow as tf\n            from transformers1 import AlbertTokenizer, TFAlbertModel\n\n            tokenizer = AlbertTokenizer.from_pretrained('albert-base-v2')\n            model = TFAlbertModel.from_pretrained('albert-base-v2')\n            input_ids = tf.constant(tokenizer.encode(\"Hello, my dog is cute\"))[None, :]  # Batch size 1\n            outputs = model(input_ids)\n            last_hidden_states = outputs[0]  # The last hidden-state is the first element of the output tuple\n\n        \"\"\"\n        outputs = self.albert(inputs, **kwargs)\n        return outputs\n\n\n@add_start_docstrings(\n    \"\"\"Albert Model with two heads on top for pre-training:\n    a `masked language modeling` head and a `sentence order prediction` (classification) head. \"\"\",\n    ALBERT_START_DOCSTRING,\n)\nclass TFAlbertForPreTraining(TFAlbertPreTrainedModel):\n    def __init__(self, config, *inputs, **kwargs):\n        super().__init__(config, *inputs, **kwargs)\n        self.num_labels = config.num_labels\n\n        self.albert = TFAlbertMainLayer(config, name=\"albert\")\n        self.predictions = TFAlbertMLMHead(config, self.albert.embeddings, name=\"predictions\")\n        self.sop_classifier = TFAlbertSOPHead(config, name=\"sop_classifier\")\n\n    def get_output_embeddings(self):\n        return self.albert.embeddings\n\n    @add_start_docstrings_to_callable(ALBERT_INPUTS_DOCSTRING.format(\"(batch_size, sequence_length)\"))\n    def call(self, inputs, **kwargs):\n        r\"\"\"\n    Return:\n        :obj:`tuple(tf.Tensor)` comprising various elements depending on the configuration (:class:`~transformers1.BertConfig`) and inputs:\n        prediction_scores (:obj:`tf.Tensor` of shape :obj:`(batch_size, sequence_length, config.vocab_size)`):\n            Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax).\n        sop_scores (:obj:`tf.Tensor` of shape :obj:`(batch_size, sequence_length, 2)`):\n            Prediction scores of the sentence order prediction (classification) head (scores of True/False continuation before SoftMax).\n        hidden_states (:obj:`tuple(tf.Tensor)`, `optional`, returned when :obj:`config.output_hidden_states=True`):\n            tuple of :obj:`tf.Tensor` (one for the output of the embeddings + one for the output of each layer)\n            of shape :obj:`(batch_size, sequence_length, hidden_size)`.\n            Hidden-states of the model at the output of each layer plus the initial embedding outputs.\n        attentions (:obj:`tuple(tf.Tensor)`, `optional`, returned when ``config.output_attentions=True``):\n            tuple of :obj:`tf.Tensor` (one for each layer) of shape\n            :obj:`(batch_size, num_heads, sequence_length, sequence_length)`:\n            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads.\n    Examples::\n        import tensorflow as tf\n        from transformers1 import AlbertTokenizer, TFAlbertForPreTraining\n        tokenizer = AlbertTokenizer.from_pretrained('albert-base-v2')\n        model = TFAlbertForPreTraining.from_pretrained('albert-base-v2')\n        input_ids = tf.constant(tokenizer.encode(\"Hello, my dog is cute\", add_special_tokens=True))[None, :]  # Batch size 1\n        outputs = model(input_ids)\n        prediction_scores, sop_scores = outputs[:2]\n        \"\"\"\n\n        outputs = self.albert(inputs, **kwargs)\n        sequence_output, pooled_output = outputs[:2]\n        prediction_scores = self.predictions(sequence_output)\n        sop_scores = self.sop_classifier(pooled_output, training=kwargs.get(\"training\", False))\n        outputs = (prediction_scores, sop_scores) + outputs[2:]\n        return outputs\n\n\nclass TFAlbertSOPHead(tf.keras.layers.Layer):\n    def __init__(self, config, **kwargs):\n        super().__init__(**kwargs)\n\n        self.dropout = tf.keras.layers.Dropout(config.classifier_dropout_prob)\n        self.classifier = tf.keras.layers.Dense(\n            config.num_labels, kernel_initializer=get_initializer(config.initializer_range), name=\"classifier\",\n        )\n\n    def call(self, pooled_output, training: bool):\n        dropout_pooled_output = self.dropout(pooled_output, training=training)\n        logits = self.classifier(dropout_pooled_output)\n        return logits\n\n\n@add_start_docstrings(\"\"\"Albert Model with a `language modeling` head on top. \"\"\", ALBERT_START_DOCSTRING)\nclass TFAlbertForMaskedLM(TFAlbertPreTrainedModel):\n    def __init__(self, config, *inputs, **kwargs):\n        super().__init__(config, *inputs, **kwargs)\n\n        self.albert = TFAlbertMainLayer(config, name=\"albert\")\n        self.predictions = TFAlbertMLMHead(config, self.albert.embeddings, name=\"predictions\")\n\n    def get_output_embeddings(self):\n        return self.albert.embeddings\n\n    @add_start_docstrings_to_callable(ALBERT_INPUTS_DOCSTRING.format(\"(batch_size, sequence_length)\"))\n    def call(self, inputs, **kwargs):\n        r\"\"\"\n    Returns:\n        :obj:`tuple(tf.Tensor)` comprising various elements depending on the configuration (:class:`~transformers1.AlbertConfig`) and inputs:\n        prediction_scores (:obj:`Numpy array` or :obj:`tf.Tensor` of shape :obj:`(batch_size, sequence_length, config.vocab_size)`\n            Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax).\n        hidden_states (:obj:`tuple(tf.Tensor)`, `optional`, returned when :obj:`config.output_hidden_states=True`):\n            tuple of :obj:`tf.Tensor` (one for the output of the embeddings + one for the output of each layer)\n            of shape :obj:`(batch_size, sequence_length, hidden_size)`.\n\n            Hidden-states of the model at the output of each layer plus the initial embedding outputs.\n        attentions (:obj:`tuple(tf.Tensor)`, `optional`, returned when ``config.output_attentions=True``):\n            tuple of :obj:`tf.Tensor` (one for each layer) of shape\n            :obj:`(batch_size, num_heads, sequence_length, sequence_length)`:\n\n            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads.\n\n    Examples::\n\n        import tensorflow as tf\n        from transformers1 import AlbertTokenizer, TFAlbertForMaskedLM\n\n        tokenizer = AlbertTokenizer.from_pretrained('albert-base-v2')\n        model = TFAlbertForMaskedLM.from_pretrained('albert-base-v2')\n        input_ids = tf.constant(tokenizer.encode(\"Hello, my dog is cute\"))[None, :]  # Batch size 1\n        outputs = model(input_ids)\n        prediction_scores = outputs[0]\n\n        \"\"\"\n        outputs = self.albert(inputs, **kwargs)\n\n        sequence_output = outputs[0]\n        prediction_scores = self.predictions(sequence_output, training=kwargs.get(\"training\", False))\n\n        # Add hidden states and attention if they are here\n        outputs = (prediction_scores,) + outputs[2:]\n\n        return outputs  # prediction_scores, (hidden_states), (attentions)\n\n\n@add_start_docstrings(\n    \"\"\"Albert Model transformer with a sequence classification/regression head on top (a linear layer on top of\n    the pooled output) e.g. for GLUE tasks. \"\"\",\n    ALBERT_START_DOCSTRING,\n)\nclass TFAlbertForSequenceClassification(TFAlbertPreTrainedModel):\n    def __init__(self, config, *inputs, **kwargs):\n        super().__init__(config, *inputs, **kwargs)\n        self.num_labels = config.num_labels\n\n        self.albert = TFAlbertMainLayer(config, name=\"albert\")\n        self.dropout = tf.keras.layers.Dropout(config.classifier_dropout_prob)\n        self.classifier = tf.keras.layers.Dense(\n            config.num_labels, kernel_initializer=get_initializer(config.initializer_range), name=\"classifier\"\n        )\n\n    @add_start_docstrings_to_callable(ALBERT_INPUTS_DOCSTRING.format(\"(batch_size, sequence_length)\"))\n    def call(self, inputs, **kwargs):\n        r\"\"\"\n    Returns:\n        :obj:`tuple(tf.Tensor)` comprising various elements depending on the configuration (:class:`~transformers1.AlbertConfig`) and inputs:\n        logits (:obj:`Numpy array` or :obj:`tf.Tensor` of shape :obj:`(batch_size, config.num_labels)`)\n            Classification (or regression if config.num_labels==1) scores (before SoftMax).\n        hidden_states (:obj:`tuple(tf.Tensor)`, `optional`, returned when :obj:`config.output_hidden_states=True`):\n            tuple of :obj:`tf.Tensor` (one for the output of the embeddings + one for the output of each layer)\n            of shape :obj:`(batch_size, sequence_length, hidden_size)`.\n\n            Hidden-states of the model at the output of each layer plus the initial embedding outputs.\n        attentions (:obj:`tuple(tf.Tensor)`, `optional`, returned when ``config.output_attentions=True``):\n            tuple of :obj:`tf.Tensor` (one for each layer) of shape\n            :obj:`(batch_size, num_heads, sequence_length, sequence_length)`:\n\n            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads.\n\n    Examples::\n\n        import tensorflow as tf\n        from transformers1 import AlbertTokenizer, TFAlbertForSequenceClassification\n\n        tokenizer = AlbertTokenizer.from_pretrained('albert-base-v2')\n        model = TFAlbertForSequenceClassification.from_pretrained('albert-base-v2')\n        input_ids = tf.constant(tokenizer.encode(\"Hello, my dog is cute\"))[None, :]  # Batch size 1\n        outputs = model(input_ids)\n        logits = outputs[0]\n\n        \"\"\"\n        outputs = self.albert(inputs, **kwargs)\n\n        pooled_output = outputs[1]\n\n        pooled_output = self.dropout(pooled_output, training=kwargs.get(\"training\", False))\n        logits = self.classifier(pooled_output)\n\n        outputs = (logits,) + outputs[2:]  # add hidden states and attention if they are here\n\n        return outputs  # logits, (hidden_states), (attentions)\n\n\n@add_start_docstrings(\n    \"\"\"Albert Model with a span classification head on top for extractive question-answering tasks like SQuAD (a linear layers on top of the hidden-states output to compute `span start logits` and `span end logits`). \"\"\",\n    ALBERT_START_DOCSTRING,\n)\nclass TFAlbertForQuestionAnswering(TFAlbertPreTrainedModel):\n    def __init__(self, config, *inputs, **kwargs):\n        super().__init__(config, *inputs, **kwargs)\n        self.num_labels = config.num_labels\n\n        self.albert = TFAlbertMainLayer(config, name=\"albert\")\n        self.qa_outputs = tf.keras.layers.Dense(\n            config.num_labels, kernel_initializer=get_initializer(config.initializer_range), name=\"qa_outputs\"\n        )\n\n    @add_start_docstrings_to_callable(ALBERT_INPUTS_DOCSTRING.format(\"(batch_size, sequence_length)\"))\n    def call(self, inputs, **kwargs):\n        r\"\"\"\n    Return:\n        :obj:`tuple(tf.Tensor)` comprising various elements depending on the configuration (:class:`~transformers1.AlbertConfig`) and inputs:\n        start_scores (:obj:`Numpy array` or :obj:`tf.Tensor` of shape :obj:`(batch_size, sequence_length,)`):\n            Span-start scores (before SoftMax).\n        end_scores (:obj:`Numpy array` or :obj:`tf.Tensor` of shape :obj:`(batch_size, sequence_length,)`):\n            Span-end scores (before SoftMax).\n        hidden_states (:obj:`tuple(tf.Tensor)`, `optional`, returned when :obj:`config.output_hidden_states=True`):\n            tuple of :obj:`tf.Tensor` (one for the output of the embeddings + one for the output of each layer)\n            of shape :obj:`(batch_size, sequence_length, hidden_size)`.\n\n            Hidden-states of the model at the output of each layer plus the initial embedding outputs.\n        attentions (:obj:`tuple(tf.Tensor)`, `optional`, returned when ``config.output_attentions=True``):\n            tuple of :obj:`tf.Tensor` (one for each layer) of shape\n            :obj:`(batch_size, num_heads, sequence_length, sequence_length)`:\n\n            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads.\n\n    Examples::\n\n        # The checkpoint albert-base-v2 is not fine-tuned for question answering. Please see the\n        # examples/question-answering/run_squad.py example to see how to fine-tune a model to a question answering task.\n\n        import tensorflow as tf\n        from transformers1 import AlbertTokenizer, TFAlbertForQuestionAnswering\n\n        tokenizer = AlbertTokenizer.from_pretrained('albert-base-v2')\n        model = TFAlbertForQuestionAnswering.from_pretrained('albert-base-v2')\n        input_ids = tokenizer.encode(\"Who was Jim Henson?\", \"Jim Henson was a nice puppet\")\n        start_scores, end_scores = model(tf.constant(input_ids)[None, :]) # Batch size 1\n\n        all_tokens = tokenizer.convert_ids_to_tokens(input_ids)\n        answer = ' '.join(all_tokens[tf.math.argmax(start_scores, 1)[0] : tf.math.argmax(end_scores, 1)[0]+1])\n\n        \"\"\"\n        outputs = self.albert(inputs, **kwargs)\n\n        sequence_output = outputs[0]\n\n        logits = self.qa_outputs(sequence_output)\n        start_logits, end_logits = tf.split(logits, 2, axis=-1)\n        start_logits = tf.squeeze(start_logits, axis=-1)\n        end_logits = tf.squeeze(end_logits, axis=-1)\n\n        outputs = (start_logits, end_logits,) + outputs[2:]\n\n        return outputs  # start_logits, end_logits, (hidden_states), (attentions)\n\n\n@add_start_docstrings(\n    \"\"\"Albert Model with a multiple choice classification head on top (a linear layer on top of\n    the pooled output and a softmax) e.g. for RocStories/SWAG tasks. \"\"\",\n    ALBERT_START_DOCSTRING,\n)\nclass TFAlbertForMultipleChoice(TFAlbertPreTrainedModel):\n    def __init__(self, config, *inputs, **kwargs):\n        super().__init__(config, *inputs, **kwargs)\n\n        self.albert = TFAlbertMainLayer(config, name=\"albert\")\n        self.dropout = tf.keras.layers.Dropout(config.hidden_dropout_prob)\n        self.classifier = tf.keras.layers.Dense(\n            1, kernel_initializer=get_initializer(config.initializer_range), name=\"classifier\"\n        )\n\n    @property\n    def dummy_inputs(self):\n        \"\"\" Dummy inputs to build the network.\n\n        Returns:\n            tf.Tensor with dummy inputs\n        \"\"\"\n        return {\"input_ids\": tf.constant(MULTIPLE_CHOICE_DUMMY_INPUTS)}\n\n    @add_start_docstrings_to_callable(ALBERT_INPUTS_DOCSTRING.format(\"(batch_size, num_choices, sequence_length)\"))\n    def call(\n        self,\n        inputs,\n        attention_mask=None,\n        token_type_ids=None,\n        position_ids=None,\n        head_mask=None,\n        inputs_embeds=None,\n        training=False,\n    ):\n        r\"\"\"\n    Return:\n        :obj:`tuple(tf.Tensor)` comprising various elements depending on the configuration (:class:`~transformers1.BertConfig`) and inputs:\n        classification_scores (:obj:`Numpy array` or :obj:`tf.Tensor` of shape :obj:`(batch_size, num_choices)`:\n            `num_choices` is the size of the second dimension of the input tensors. (see `input_ids` above).\n\n            Classification scores (before SoftMax).\n        hidden_states (:obj:`tuple(tf.Tensor)`, `optional`, returned when :obj:`config.output_hidden_states=True`):\n            tuple of :obj:`tf.Tensor` (one for the output of the embeddings + one for the output of each layer)\n            of shape :obj:`(batch_size, sequence_length, hidden_size)`.\n\n            Hidden-states of the model at the output of each layer plus the initial embedding outputs.\n        attentions (:obj:`tuple(tf.Tensor)`, `optional`, returned when ``config.output_attentions=True``):\n            tuple of :obj:`tf.Tensor` (one for each layer) of shape\n            :obj:`(batch_size, num_heads, sequence_length, sequence_length)`:\n\n            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads.\n\n    Examples::\n\n        import tensorflow as tf\n        from transformers1 import AlbertTokenizer, TFAlbertForMultipleChoice\n\n        tokenizer = AlbertTokenizer.from_pretrained('albert-base-v2')\n        model = TFAlbertForMultipleChoice.from_pretrained('albert-base-v2')\n\n        example1 = [\"This is a context\", \"Is it a context? Yes\"]\n        example2 = [\"This is a context\", \"Is it a context? No\"]\n        encoding = tokenizer.batch_encode_plus([example1, example2], return_tensors='tf', truncation_strategy=\"only_first\", pad_to_max_length=True, max_length=128)\n        outputs = model(encoding[\"input_ids\"][None, :])\n        logits = outputs[0]\n\n        \"\"\"\n        if isinstance(inputs, (tuple, list)):\n            input_ids = inputs[0]\n            attention_mask = inputs[1] if len(inputs) > 1 else attention_mask\n            token_type_ids = inputs[2] if len(inputs) > 2 else token_type_ids\n            position_ids = inputs[3] if len(inputs) > 3 else position_ids\n            head_mask = inputs[4] if len(inputs) > 4 else head_mask\n            inputs_embeds = inputs[5] if len(inputs) > 5 else inputs_embeds\n            assert len(inputs) <= 6, \"Too many inputs.\"\n        elif isinstance(inputs, dict):\n            print(\"isdict(1)\")\n            input_ids = inputs.get(\"input_ids\")\n            print(input_ids)\n\n            attention_mask = inputs.get(\"attention_mask\", attention_mask)\n            token_type_ids = inputs.get(\"token_type_ids\", token_type_ids)\n            position_ids = inputs.get(\"position_ids\", position_ids)\n            head_mask = inputs.get(\"head_mask\", head_mask)\n            inputs_embeds = inputs.get(\"inputs_embeds\", inputs_embeds)\n            assert len(inputs) <= 6, \"Too many inputs.\"\n        else:\n            input_ids = inputs\n\n        if input_ids is not None:\n            num_choices = shape_list(input_ids)[1]\n            seq_length = shape_list(input_ids)[2]\n        else:\n            num_choices = shape_list(inputs_embeds)[1]\n            seq_length = shape_list(inputs_embeds)[2]\n\n        flat_input_ids = tf.reshape(input_ids, (-1, seq_length)) if input_ids is not None else None\n        flat_attention_mask = tf.reshape(attention_mask, (-1, seq_length)) if attention_mask is not None else None\n        flat_token_type_ids = tf.reshape(token_type_ids, (-1, seq_length)) if token_type_ids is not None else None\n        flat_position_ids = tf.reshape(position_ids, (-1, seq_length)) if position_ids is not None else None\n\n        flat_inputs = [\n            flat_input_ids,\n            flat_attention_mask,\n            flat_token_type_ids,\n            flat_position_ids,\n            head_mask,\n            inputs_embeds,\n        ]\n\n        outputs = self.albert(flat_inputs, training=training)\n\n        pooled_output = outputs[1]\n\n        pooled_output = self.dropout(pooled_output, training=training)\n        logits = self.classifier(pooled_output)\n        reshaped_logits = tf.reshape(logits, (-1, num_choices))\n\n        outputs = (reshaped_logits,) + outputs[2:]  # add hidden states and attention if they are here\n\n        return outputs  # reshaped_logits, (hidden_states), (attentions)\n"
  },
  {
    "path": "code/bert-base-count5/pretrain/transformers1/modeling_tf_auto.py",
    "content": "# coding=utf-8\n# Copyright 2018 The HuggingFace Inc. team.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n\"\"\" Auto Model class. \"\"\"\n\n\nimport logging\nfrom collections import OrderedDict\n\nfrom .configuration_auto import (\n    AlbertConfig,\n    AutoConfig,\n    BertConfig,\n    CTRLConfig,\n    DistilBertConfig,\n    GPT2Config,\n    OpenAIGPTConfig,\n    RobertaConfig,\n    T5Config,\n    TransfoXLConfig,\n    XLMConfig,\n    XLNetConfig,\n)\nfrom .configuration_utils import PretrainedConfig\nfrom .modeling_tf_albert import (\n    TFAlbertForMaskedLM,\n    TFAlbertForMultipleChoice,\n    TFAlbertForPreTraining,\n    TFAlbertForQuestionAnswering,\n    TFAlbertForSequenceClassification,\n    TFAlbertModel,\n)\nfrom .modeling_tf_bert import (\n    TFBertForMaskedLM,\n    TFBertForMultipleChoice,\n    TFBertForPreTraining,\n    TFBertForQuestionAnswering,\n    TFBertForSequenceClassification,\n    TFBertForTokenClassification,\n    TFBertModel,\n)\nfrom .modeling_tf_ctrl import TFCTRLLMHeadModel, TFCTRLModel\nfrom .modeling_tf_distilbert import (\n    TFDistilBertForMaskedLM,\n    TFDistilBertForQuestionAnswering,\n    TFDistilBertForSequenceClassification,\n    TFDistilBertForTokenClassification,\n    TFDistilBertModel,\n)\nfrom .modeling_tf_gpt2 import TFGPT2LMHeadModel, TFGPT2Model\nfrom .modeling_tf_openai import TFOpenAIGPTLMHeadModel, TFOpenAIGPTModel\nfrom .modeling_tf_roberta import (\n    TFRobertaForMaskedLM,\n    TFRobertaForQuestionAnswering,\n    TFRobertaForSequenceClassification,\n    TFRobertaForTokenClassification,\n    TFRobertaModel,\n)\nfrom .modeling_tf_t5 import TFT5ForConditionalGeneration, TFT5Model\nfrom .modeling_tf_transfo_xl import TFTransfoXLLMHeadModel, TFTransfoXLModel\nfrom .modeling_tf_xlm import (\n    TFXLMForQuestionAnsweringSimple,\n    TFXLMForSequenceClassification,\n    TFXLMModel,\n    TFXLMWithLMHeadModel,\n)\nfrom .modeling_tf_xlnet import (\n    TFXLNetForQuestionAnsweringSimple,\n    TFXLNetForSequenceClassification,\n    TFXLNetForTokenClassification,\n    TFXLNetLMHeadModel,\n    TFXLNetModel,\n)\n\n\nlogger = logging.getLogger(__name__)\n\n\nTF_MODEL_MAPPING = OrderedDict(\n    [\n        (T5Config, TFT5Model),\n        (DistilBertConfig, TFDistilBertModel),\n        (AlbertConfig, TFAlbertModel),\n        (RobertaConfig, TFRobertaModel),\n        (BertConfig, TFBertModel),\n        (OpenAIGPTConfig, TFOpenAIGPTModel),\n        (GPT2Config, TFGPT2Model),\n        (TransfoXLConfig, TFTransfoXLModel),\n        (XLNetConfig, TFXLNetModel),\n        (XLMConfig, TFXLMModel),\n        (CTRLConfig, TFCTRLModel),\n    ]\n)\n\nTF_MODEL_FOR_PRETRAINING_MAPPING = OrderedDict(\n    [\n        (T5Config, TFT5ForConditionalGeneration),\n        (DistilBertConfig, TFDistilBertForMaskedLM),\n        (AlbertConfig, TFAlbertForPreTraining),\n        (RobertaConfig, TFRobertaForMaskedLM),\n        (BertConfig, TFBertForPreTraining),\n        (OpenAIGPTConfig, TFOpenAIGPTLMHeadModel),\n        (GPT2Config, TFGPT2LMHeadModel),\n        (TransfoXLConfig, TFTransfoXLLMHeadModel),\n        (XLNetConfig, TFXLNetLMHeadModel),\n        (XLMConfig, TFXLMWithLMHeadModel),\n        (CTRLConfig, TFCTRLLMHeadModel),\n    ]\n)\n\nTF_MODEL_WITH_LM_HEAD_MAPPING = OrderedDict(\n    [\n        (T5Config, TFT5ForConditionalGeneration),\n        (DistilBertConfig, TFDistilBertForMaskedLM),\n        (AlbertConfig, TFAlbertForMaskedLM),\n        (RobertaConfig, TFRobertaForMaskedLM),\n        (BertConfig, TFBertForMaskedLM),\n        (OpenAIGPTConfig, TFOpenAIGPTLMHeadModel),\n        (GPT2Config, TFGPT2LMHeadModel),\n        (TransfoXLConfig, TFTransfoXLLMHeadModel),\n        (XLNetConfig, TFXLNetLMHeadModel),\n        (XLMConfig, TFXLMWithLMHeadModel),\n        (CTRLConfig, TFCTRLLMHeadModel),\n    ]\n)\n\nTF_MODEL_FOR_SEQUENCE_CLASSIFICATION_MAPPING = OrderedDict(\n    [\n        (DistilBertConfig, TFDistilBertForSequenceClassification),\n        (AlbertConfig, TFAlbertForSequenceClassification),\n        (RobertaConfig, TFRobertaForSequenceClassification),\n        (BertConfig, TFBertForSequenceClassification),\n        (XLNetConfig, TFXLNetForSequenceClassification),\n        (XLMConfig, TFXLMForSequenceClassification),\n    ]\n)\n\nTF_MODEL_FOR_MULTIPLE_CHOICE_MAPPING = OrderedDict(\n    [(BertConfig, TFBertForMultipleChoice), (AlbertConfig, TFAlbertForMultipleChoice)]\n)\n\nTF_MODEL_FOR_QUESTION_ANSWERING_MAPPING = OrderedDict(\n    [\n        (DistilBertConfig, TFDistilBertForQuestionAnswering),\n        (AlbertConfig, TFAlbertForQuestionAnswering),\n        (RobertaConfig, TFRobertaForQuestionAnswering),\n        (BertConfig, TFBertForQuestionAnswering),\n        (XLNetConfig, TFXLNetForQuestionAnsweringSimple),\n        (XLMConfig, TFXLMForQuestionAnsweringSimple),\n    ]\n)\n\nTF_MODEL_FOR_TOKEN_CLASSIFICATION_MAPPING = OrderedDict(\n    [\n        (DistilBertConfig, TFDistilBertForTokenClassification),\n        (RobertaConfig, TFRobertaForTokenClassification),\n        (BertConfig, TFBertForTokenClassification),\n        (XLNetConfig, TFXLNetForTokenClassification),\n    ]\n)\n\n\nclass TFAutoModel(object):\n    r\"\"\"\n        :class:`~transformers1.TFAutoModel` is a generic model class\n        that will be instantiated as one of the base model classes of the library\n        when created with the `TFAutoModel.from_pretrained(pretrained_model_name_or_path)`\n        class method.\n\n        The `from_pretrained()` method takes care of returning the correct model class instance\n        based on the `model_type` property of the config object, or when it's missing,\n        falling back to using pattern matching on the `pretrained_model_name_or_path` string:\n            - `t5`: TFT5Model (T5 model)\n            - `distilbert`: TFDistilBertModel (DistilBERT model)\n            - `roberta`: TFRobertaModel (RoBERTa model)\n            - `bert`: TFBertModel (Bert model)\n            - `openai-gpt`: TFOpenAIGPTModel (OpenAI GPT model)\n            - `gpt2`: TFGPT2Model (OpenAI GPT-2 model)\n            - `transfo-xl`: TFTransfoXLModel (Transformer-XL model)\n            - `xlnet`: TFXLNetModel (XLNet model)\n            - `xlm`: TFXLMModel (XLM model)\n            - `ctrl`: TFCTRLModel (CTRL model)\n\n        This class cannot be instantiated using `__init__()` (throws an error).\n    \"\"\"\n\n    def __init__(self):\n        raise EnvironmentError(\n            \"TFAutoModel is designed to be instantiated \"\n            \"using the `TFAutoModel.from_pretrained(pretrained_model_name_or_path)` or \"\n            \"`TFAutoModel.from_config(config)` methods.\"\n        )\n\n    @classmethod\n    def from_config(cls, config):\n        r\"\"\" Instantiates one of the base model classes of the library\n        from a configuration.\n\n        Note:\n            Loading a model from its configuration file does **not** load the model weights.\n            It only affects the model's configuration. Use :func:`~transformers1.AutoModel.from_pretrained` to load\n            the model weights\n\n        Args:\n            config: (`optional`) instance of a class derived from :class:`~transformers1.PretrainedConfig`:\n                The model class to instantiate is selected based on the configuration class:\n                    - isInstance of `distilbert` configuration class: TFDistilBertModel (DistilBERT model)\n                    - isInstance of `roberta` configuration class: TFRobertaModel (RoBERTa model)\n                    - isInstance of `bert` configuration class: TFBertModel (Bert model)\n                    - isInstance of `openai-gpt` configuration class: TFOpenAIGPTModel (OpenAI GPT model)\n                    - isInstance of `gpt2` configuration class: TFGPT2Model (OpenAI GPT-2 model)\n                    - isInstance of `ctrl` configuration class: TFCTRLModel (Salesforce CTRL  model)\n                    - isInstance of `transfo-xl` configuration class: TFTransfoXLModel (Transformer-XL model)\n                    - isInstance of `xlnet` configuration class: TFXLNetModel (XLNet model)\n                    - isInstance of `xlm` configuration class: TFXLMModel (XLM model)\n\n        Examples::\n\n            config = BertConfig.from_pretrained('bert-base-uncased')    # Download configuration from S3 and cache.\n            model = TFAutoModel.from_config(config)  # E.g. model was saved using `save_pretrained('./test/saved_model/')`\n        \"\"\"\n        for config_class, model_class in TF_MODEL_MAPPING.items():\n            if isinstance(config, config_class):\n                return model_class(config)\n        raise ValueError(\n            \"Unrecognized configuration class {} for this kind of TFAutoModel: {}.\\n\"\n            \"Model type should be one of {}.\".format(\n                config.__class__, cls.__name__, \", \".join(c.__name__ for c in TF_MODEL_MAPPING.keys())\n            )\n        )\n\n    @classmethod\n    def from_pretrained(cls, pretrained_model_name_or_path, *model_args, **kwargs):\n        r\"\"\" Instantiates one of the base model classes of the library\n        from a pre-trained model configuration.\n\n        The `from_pretrained()` method takes care of returning the correct model class instance\n        based on the `model_type` property of the config object, or when it's missing,\n        falling back to using pattern matching on the `pretrained_model_name_or_path` string:\n            - `t5`: TFT5Model (T5 model)\n            - `distilbert`: TFDistilBertModel (DistilBERT model)\n            - `roberta`: TFRobertaModel (RoBERTa model)\n            - `bert`: TFTFBertModel (Bert model)\n            - `openai-gpt`: TFOpenAIGPTModel (OpenAI GPT model)\n            - `gpt2`: TFGPT2Model (OpenAI GPT-2 model)\n            - `transfo-xl`: TFTransfoXLModel (Transformer-XL model)\n            - `xlnet`: TFXLNetModel (XLNet model)\n            - `ctrl`: TFCTRLModel (CTRL model)\n\n        Params:\n            pretrained_model_name_or_path: either:\n\n                - a string with the `shortcut name` of a pre-trained model to load from cache or download, e.g.: ``bert-base-uncased``.\n                - a string with the `identifier name` of a pre-trained model that was user-uploaded to our S3, e.g.: ``dbmdz/bert-base-german-cased``.\n                - a path to a `directory` containing model weights saved using :func:`~transformers1.PreTrainedModel.save_pretrained`, e.g.: ``./my_model_directory/``.\n                - a path or url to a `PyTorch, TF 1.X or TF 2.0 checkpoint file` (e.g. `./tf_model/model.ckpt.index`). In the case of a PyTorch checkpoint, ``from_pt`` should be set to True and a configuration object should be provided as ``config`` argument.\n\n            from_pt: (`Optional`) Boolean\n                Set to True if the Checkpoint is a PyTorch checkpoint.\n\n            model_args: (`optional`) Sequence of positional arguments:\n                All remaning positional arguments will be passed to the underlying model's ``__init__`` method\n\n            config: (`optional`) instance of a class derived from :class:`~transformers1.PretrainedConfig`:\n                Configuration for the model to use instead of an automatically loaded configuation. Configuration can be automatically loaded when:\n\n                - the model is a model provided by the library (loaded with the ``shortcut-name`` string of a pretrained model), or\n                - the model was saved using :func:`~transformers1.PreTrainedModel.save_pretrained` and is reloaded by suppling the save directory.\n                - the model is loaded by suppling a local directory as ``pretrained_model_name_or_path`` and a configuration JSON file named `config.json` is found in the directory.\n\n            state_dict: (`optional`) dict:\n                an optional state dictionnary for the model to use instead of a state dictionary loaded from saved weights file.\n                This option can be used if you want to create a model from a pretrained configuration but load your own weights.\n                In this case though, you should check if using :func:`~transformers1.PreTrainedModel.save_pretrained` and :func:`~transformers1.PreTrainedModel.from_pretrained` is not a simpler option.\n\n            cache_dir: (`optional`) string:\n                Path to a directory in which a downloaded pre-trained model\n                configuration should be cached if the standard cache should not be used.\n\n            force_download: (`optional`) boolean, default False:\n                Force to (re-)download the model weights and configuration files and override the cached versions if they exists.\n\n            resume_download: (`optional`) boolean, default False:\n                Do not delete incompletely recieved file. Attempt to resume the download if such a file exists.\n\n            proxies: (`optional`) dict, default None:\n                A dictionary of proxy servers to use by protocol or endpoint, e.g.: {'http': 'foo.bar:3128', 'http://hostname': 'foo.bar:4012'}.\n                The proxies are used on each request.\n\n            output_loading_info: (`optional`) boolean:\n                Set to ``True`` to also return a dictionnary containing missing keys, unexpected keys and error messages.\n\n            kwargs: (`optional`) Remaining dictionary of keyword arguments:\n                Can be used to update the configuration object (after it being loaded) and initiate the model. (e.g. ``output_attention=True``). Behave differently depending on whether a `config` is provided or automatically loaded:\n\n                - If a configuration is provided with ``config``, ``**kwargs`` will be directly passed to the underlying model's ``__init__`` method (we assume all relevant updates to the configuration have already been done)\n                - If a configuration is not provided, ``kwargs`` will be first passed to the configuration class initialization function (:func:`~transformers1.PretrainedConfig.from_pretrained`). Each key of ``kwargs`` that corresponds to a configuration attribute will be used to override said attribute with the supplied ``kwargs`` value. Remaining keys that do not correspond to any configuration attribute will be passed to the underlying model's ``__init__`` function.\n\n        Examples::\n\n            model = TFAutoModel.from_pretrained('bert-base-uncased')    # Download model and configuration from S3 and cache.\n            model = TFAutoModel.from_pretrained('./test/bert_model/')  # E.g. model was saved using `save_pretrained('./test/saved_model/')`\n            model = TFAutoModel.from_pretrained('bert-base-uncased', output_attention=True)  # Update configuration during loading\n            assert model.config.output_attention == True\n            # Loading from a TF checkpoint file instead of a PyTorch model (slower)\n            config = AutoConfig.from_json_file('./tf_model/bert_tf_model_config.json')\n            model = TFAutoModel.from_pretrained('./pt_model/bert_pytorch_model.bin', from_pt=True, config=config)\n\n        \"\"\"\n        config = kwargs.pop(\"config\", None)\n        if not isinstance(config, PretrainedConfig):\n            config = AutoConfig.from_pretrained(pretrained_model_name_or_path, **kwargs)\n\n        for config_class, model_class in TF_MODEL_MAPPING.items():\n            if isinstance(config, config_class):\n                return model_class.from_pretrained(pretrained_model_name_or_path, *model_args, config=config, **kwargs)\n        raise ValueError(\n            \"Unrecognized configuration class {} for this kind of TFAutoModel: {}.\\n\"\n            \"Model type should be one of {}.\".format(\n                config.__class__, cls.__name__, \", \".join(c.__name__ for c in TF_MODEL_MAPPING.keys())\n            )\n        )\n\n\nclass TFAutoModelForPreTraining(object):\n    r\"\"\"\n        :class:`~transformers1.TFAutoModelForPreTraining` is a generic model class\n        that will be instantiated as one of the model classes of the library -with the architecture used for pretraining this model– when created with the `TFAutoModelForPreTraining.from_pretrained(pretrained_model_name_or_path)`\n        class method.\n\n        This class cannot be instantiated using `__init__()` (throws an error).\n    \"\"\"\n\n    def __init__(self):\n        raise EnvironmentError(\n            \"TFAutoModelForPreTraining is designed to be instantiated \"\n            \"using the `TFAutoModelForPreTraining.from_pretrained(pretrained_model_name_or_path)` or \"\n            \"`TFAutoModelForPreTraining.from_config(config)` methods.\"\n        )\n\n    @classmethod\n    def from_config(cls, config):\n        r\"\"\" Instantiates one of the base model classes of the library\n        from a configuration.\n\n        Note:\n            Loading a model from its configuration file does **not** load the model weights.\n            It only affects the model's configuration. Use :func:`~transformers1.AutoModel.from_pretrained` to load\n            the model weights\n\n        Args:\n            config (:class:`~transformers.PretrainedConfig`):\n                The model class to instantiate is selected based on the configuration class:\n\n                - isInstance of `distilbert` configuration class: :class:`~transformers1.TFDistilBertModelForMaskedLM` (DistilBERT model)\n                - isInstance of `roberta` configuration class: :class:`~transformers1.TFRobertaModelForMaskedLM` (RoBERTa model)\n                - isInstance of `bert` configuration class: :class:`~transformers1.TFBertForPreTraining` (Bert model)\n                - isInstance of `openai-gpt` configuration class: :class:`~transformers1.TFOpenAIGPTLMHeadModel` (OpenAI GPT model)\n                - isInstance of `gpt2` configuration class: :class:`~transformers1.TFGPT2ModelLMHeadModel` (OpenAI GPT-2 model)\n                - isInstance of `ctrl` configuration class: :class:`~transformers1.TFCTRLModelLMHeadModel` (Salesforce CTRL  model)\n                - isInstance of `transfo-xl` configuration class: :class:`~transformers1.TFTransfoXLLMHeadModel` (Transformer-XL model)\n                - isInstance of `xlnet` configuration class: :class:`~transformers1.TFXLNetLMHeadModel` (XLNet model)\n                - isInstance of `xlm` configuration class: :class:`~transformers1.TFXLMWithLMHeadModel` (XLM model)\n\n        Examples::\n\n            config = BertConfig.from_pretrained('bert-base-uncased')    # Download configuration from S3 and cache.\n            model = TFAutoModelForPreTraining.from_config(config)  # E.g. model was saved using `save_pretrained('./test/saved_model/')`\n        \"\"\"\n        for config_class, model_class in TF_MODEL_FOR_PRETRAINING_MAPPING.items():\n            if isinstance(config, config_class):\n                return model_class(config)\n        raise ValueError(\n            \"Unrecognized configuration class {} for this kind of AutoModel: {}.\\n\"\n            \"Model type should be one of {}.\".format(\n                config.__class__, cls.__name__, \", \".join(c.__name__ for c in TF_MODEL_FOR_PRETRAINING_MAPPING.keys())\n            )\n        )\n\n    @classmethod\n    def from_pretrained(cls, pretrained_model_name_or_path, *model_args, **kwargs):\n        r\"\"\" Instantiates one of the model classes of the library -with the architecture used for pretraining this model– from a pre-trained model configuration.\n\n        The `from_pretrained()` method takes care of returning the correct model class instance\n        based on the `model_type` property of the config object, or when it's missing,\n        falling back to using pattern matching on the `pretrained_model_name_or_path` string:\n            - `t5`: :class:`~transformers1.TFT5ModelWithLMHead` (T5 model)\n            - `distilbert`: :class:`~transformers1.TFDistilBertForMaskedLM` (DistilBERT model)\n            - `albert`: :class:`~transformers1.TFAlbertForPreTraining` (ALBERT model)\n            - `roberta`: :class:`~transformers1.TFRobertaForMaskedLM` (RoBERTa model)\n            - `bert`: :class:`~transformers1.TFBertForPreTraining` (Bert model)\n            - `openai-gpt`: :class:`~transformers1.TFOpenAIGPTLMHeadModel` (OpenAI GPT model)\n            - `gpt2`: :class:`~transformers1.TFGPT2LMHeadModel` (OpenAI GPT-2 model)\n            - `transfo-xl`: :class:`~transformers1.TFTransfoXLLMHeadModel` (Transformer-XL model)\n            - `xlnet`: :class:`~transformers1.TFXLNetLMHeadModel` (XLNet model)\n            - `xlm`: :class:`~transformers1.TFXLMWithLMHeadModel` (XLM model)\n            - `ctrl`: :class:`~transformers1.TFCTRLLMHeadModel` (Salesforce CTRL model)\n\n        The model is set in evaluation mode by default using `model.eval()` (Dropout modules are deactivated)\n        To train the model, you should first set it back in training mode with `model.train()`\n\n        Args:\n            pretrained_model_name_or_path:\n                Either:\n\n                - a string with the `shortcut name` of a pre-trained model to load from cache or download, e.g.: ``bert-base-uncased``.\n                - a string with the `identifier name` of a pre-trained model that was user-uploaded to our S3, e.g.: ``dbmdz/bert-base-german-cased``.\n                - a path to a `directory` containing model weights saved using :func:`~transformers1.PreTrainedModel.save_pretrained`, e.g.: ``./my_model_directory/``.\n                - a path or url to a `tensorflow index checkpoint file` (e.g. `./tf_model/model.ckpt.index`). In this case, ``from_tf`` should be set to True and a configuration object should be provided as ``config`` argument. This loading path is slower than converting the TensorFlow checkpoint in a PyTorch model using the provided conversion scripts and loading the PyTorch model afterwards.\n            model_args: (`optional`) Sequence of positional arguments:\n                All remaning positional arguments will be passed to the underlying model's ``__init__`` method\n            config: (`optional`) instance of a class derived from :class:`~transformers1.PretrainedConfig`:\n                Configuration for the model to use instead of an automatically loaded configuation. Configuration can be automatically loaded when:\n\n                - the model is a model provided by the library (loaded with the ``shortcut-name`` string of a pretrained model), or\n                - the model was saved using :func:`~transformers1.PreTrainedModel.save_pretrained` and is reloaded by suppling the save directory.\n                - the model is loaded by suppling a local directory as ``pretrained_model_name_or_path`` and a configuration JSON file named `config.json` is found in the directory.\n\n            state_dict: (`optional`) dict:\n                an optional state dictionnary for the model to use instead of a state dictionary loaded from saved weights file.\n                This option can be used if you want to create a model from a pretrained configuration but load your own weights.\n                In this case though, you should check if using :func:`~transformers1.PreTrainedModel.save_pretrained` and :func:`~transformers1.PreTrainedModel.from_pretrained` is not a simpler option.\n            cache_dir: (`optional`) string:\n                Path to a directory in which a downloaded pre-trained model\n                configuration should be cached if the standard cache should not be used.\n            force_download: (`optional`) boolean, default False:\n                Force to (re-)download the model weights and configuration files and override the cached versions if they exists.\n            resume_download: (`optional`) boolean, default False:\n                Do not delete incompletely received file. Attempt to resume the download if such a file exists.\n            proxies: (`optional`) dict, default None:\n                A dictionary of proxy servers to use by protocol or endpoint, e.g.: {'http': 'foo.bar:3128', 'http://hostname': 'foo.bar:4012'}.\n                The proxies are used on each request.\n            output_loading_info: (`optional`) boolean:\n                Set to ``True`` to also return a dictionnary containing missing keys, unexpected keys and error messages.\n            kwargs: (`optional`) Remaining dictionary of keyword arguments:\n                Can be used to update the configuration object (after it being loaded) and initiate the model.\n                (e.g. ``output_attention=True``). Behave differently depending on whether a `config` is provided or\n                automatically loaded:\n\n                - If a configuration is provided with ``config``, ``**kwargs`` will be directly passed to the\n                  underlying model's ``__init__`` method (we assume all relevant updates to the configuration have\n                  already been done)\n                - If a configuration is not provided, ``kwargs`` will be first passed to the configuration class\n                  initialization function (:func:`~transformers1.PretrainedConfig.from_pretrained`). Each key of\n                  ``kwargs`` that corresponds to a configuration attribute will be used to override said attribute\n                  with the supplied ``kwargs`` value. Remaining keys that do not correspond to any configuration\n                  attribute will be passed to the underlying model's ``__init__`` function.\n\n        Examples::\n\n            model = TFAutoModelForPreTraining.from_pretrained('bert-base-uncased')    # Download model and configuration from S3 and cache.\n            model = TFAutoModelForPreTraining.from_pretrained('./test/bert_model/')  # E.g. model was saved using `save_pretrained('./test/saved_model/')`\n            model = TFAutoModelForPreTraining.from_pretrained('bert-base-uncased', output_attention=True)  # Update configuration during loading\n            assert model.config.output_attention == True\n            # Loading from a TF checkpoint file instead of a PyTorch model (slower)\n            config = AutoConfig.from_json_file('./tf_model/bert_tf_model_config.json')\n            model = TFAutoModelForPreTraining.from_pretrained('./tf_model/bert_tf_checkpoint.ckpt.index', from_tf=True, config=config)\n\n        \"\"\"\n        config = kwargs.pop(\"config\", None)\n        if not isinstance(config, PretrainedConfig):\n            config = AutoConfig.from_pretrained(pretrained_model_name_or_path, **kwargs)\n\n        for config_class, model_class in TF_MODEL_FOR_PRETRAINING_MAPPING.items():\n            if isinstance(config, config_class):\n                return model_class.from_pretrained(pretrained_model_name_or_path, *model_args, config=config, **kwargs)\n        raise ValueError(\n            \"Unrecognized configuration class {} for this kind of AutoModel: {}.\\n\"\n            \"Model type should be one of {}.\".format(\n                config.__class__, cls.__name__, \", \".join(c.__name__ for c in TF_MODEL_FOR_PRETRAINING_MAPPING.keys())\n            )\n        )\n\n\nclass TFAutoModelWithLMHead(object):\n    r\"\"\"\n        :class:`~transformers1.TFAutoModelWithLMHead` is a generic model class\n        that will be instantiated as one of the language modeling model classes of the library\n        when created with the `TFAutoModelWithLMHead.from_pretrained(pretrained_model_name_or_path)`\n        class method.\n\n        The `from_pretrained()` method takes care of returning the correct model class instance\n        based on the `model_type` property of the config object, or when it's missing,\n        falling back to using pattern matching on the `pretrained_model_name_or_path` string:\n            - `t5`: TFT5ForConditionalGeneration (T5 model)\n            - `distilbert`: TFDistilBertForMaskedLM (DistilBERT model)\n            - `roberta`: TFRobertaForMaskedLM (RoBERTa model)\n            - `bert`: TFBertForMaskedLM (Bert model)\n            - `openai-gpt`: TFOpenAIGPTLMHeadModel (OpenAI GPT model)\n            - `gpt2`: TFGPT2LMHeadModel (OpenAI GPT-2 model)\n            - `transfo-xl`: TFTransfoXLLMHeadModel (Transformer-XL model)\n            - `xlnet`: TFXLNetLMHeadModel (XLNet model)\n            - `xlm`: TFXLMWithLMHeadModel (XLM model)\n            - `ctrl`: TFCTRLLMHeadModel (CTRL model)\n\n        This class cannot be instantiated using `__init__()` (throws an error).\n    \"\"\"\n\n    def __init__(self):\n        raise EnvironmentError(\n            \"TFAutoModelWithLMHead is designed to be instantiated \"\n            \"using the `TFAutoModelWithLMHead.from_pretrained(pretrained_model_name_or_path)` or \"\n            \"`TFAutoModelWithLMHead.from_config(config)` methods.\"\n        )\n\n    @classmethod\n    def from_config(cls, config):\n        r\"\"\" Instantiates one of the base model classes of the library\n        from a configuration.\n\n        Note:\n            Loading a model from its configuration file does **not** load the model weights.\n            It only affects the model's configuration. Use :func:`~transformers1.AutoModel.from_pretrained` to load\n            the model weights\n\n        Args:\n            config: (`optional`) instance of a class derived from :class:`~transformers1.PretrainedConfig`:\n                The model class to instantiate is selected based on the configuration class:\n                    - isInstance of `distilbert` configuration class: DistilBertModel (DistilBERT model)\n                    - isInstance of `roberta` configuration class: RobertaModel (RoBERTa model)\n                    - isInstance of `bert` configuration class: BertModel (Bert model)\n                    - isInstance of `openai-gpt` configuration class: OpenAIGPTModel (OpenAI GPT model)\n                    - isInstance of `gpt2` configuration class: GPT2Model (OpenAI GPT-2 model)\n                    - isInstance of `ctrl` configuration class: CTRLModel (Salesforce CTRL  model)\n                    - isInstance of `transfo-xl` configuration class: TransfoXLModel (Transformer-XL model)\n                    - isInstance of `xlnet` configuration class: XLNetModel (XLNet model)\n                    - isInstance of `xlm` configuration class: XLMModel (XLM model)\n\n        Examples::\n\n            config = BertConfig.from_pretrained('bert-base-uncased')    # Download configuration from S3 and cache.\n            model = TFAutoModelWithLMHead.from_config(config)  # E.g. model was saved using `save_pretrained('./test/saved_model/')`\n        \"\"\"\n        for config_class, model_class in TF_MODEL_WITH_LM_HEAD_MAPPING.items():\n            if isinstance(config, config_class):\n                return model_class(config)\n        raise ValueError(\n            \"Unrecognized configuration class {} for this kind of TFAutoModel: {}.\\n\"\n            \"Model type should be one of {}.\".format(\n                config.__class__, cls.__name__, \", \".join(c.__name__ for c in TF_MODEL_WITH_LM_HEAD_MAPPING.keys())\n            )\n        )\n\n    @classmethod\n    def from_pretrained(cls, pretrained_model_name_or_path, *model_args, **kwargs):\n        r\"\"\" Instantiates one of the language modeling model classes of the library\n        from a pre-trained model configuration.\n\n        The `from_pretrained()` method takes care of returning the correct model class instance\n        based on the `model_type` property of the config object, or when it's missing,\n        falling back to using pattern matching on the `pretrained_model_name_or_path` string:\n            - `t5`: TFT5ForConditionalGeneration (T5 model)\n            - `distilbert`: TFDistilBertForMaskedLM (DistilBERT model)\n            - `roberta`: TFRobertaForMaskedLM (RoBERTa model)\n            - `bert`: TFBertForMaskedLM (Bert model)\n            - `openai-gpt`: TFOpenAIGPTLMHeadModel (OpenAI GPT model)\n            - `gpt2`: TFGPT2LMHeadModel (OpenAI GPT-2 model)\n            - `transfo-xl`: TFTransfoXLLMHeadModel (Transformer-XL model)\n            - `xlnet`: TFXLNetLMHeadModel (XLNet model)\n            - `xlm`: TFXLMWithLMHeadModel (XLM model)\n            - `ctrl`: TFCTRLLMHeadModel (CTRL model)\n\n        Params:\n            pretrained_model_name_or_path: either:\n\n                - a string with the `shortcut name` of a pre-trained model to load from cache or download, e.g.: ``bert-base-uncased``.\n                - a string with the `identifier name` of a pre-trained model that was user-uploaded to our S3, e.g.: ``dbmdz/bert-base-german-cased``.\n                - a path to a `directory` containing model weights saved using :func:`~transformers1.PreTrainedModel.save_pretrained`, e.g.: ``./my_model_directory/``.\n                - a path or url to a `PyTorch, TF 1.X or TF 2.0 checkpoint file` (e.g. `./tf_model/model.ckpt.index`). In the case of a PyTorch checkpoint, ``from_pt`` should be set to True and a configuration object should be provided as ``config`` argument.\n\n            from_pt: (`Optional`) Boolean\n                Set to True if the Checkpoint is a PyTorch checkpoint.\n\n            model_args: (`optional`) Sequence of positional arguments:\n                All remaning positional arguments will be passed to the underlying model's ``__init__`` method\n\n            config: (`optional`) instance of a class derived from :class:`~transformers1.PretrainedConfig`:\n                Configuration for the model to use instead of an automatically loaded configuation. Configuration can be automatically loaded when:\n\n                - the model is a model provided by the library (loaded with the ``shortcut-name`` string of a pretrained model), or\n                - the model was saved using :func:`~transformers1.PreTrainedModel.save_pretrained` and is reloaded by suppling the save directory.\n                - the model is loaded by suppling a local directory as ``pretrained_model_name_or_path`` and a configuration JSON file named `config.json` is found in the directory.\n\n            state_dict: (`optional`) dict:\n                an optional state dictionnary for the model to use instead of a state dictionary loaded from saved weights file.\n                This option can be used if you want to create a model from a pretrained configuration but load your own weights.\n                In this case though, you should check if using :func:`~transformers1.PreTrainedModel.save_pretrained` and :func:`~transformers1.PreTrainedModel.from_pretrained` is not a simpler option.\n\n            cache_dir: (`optional`) string:\n                Path to a directory in which a downloaded pre-trained model\n                configuration should be cached if the standard cache should not be used.\n\n            force_download: (`optional`) boolean, default False:\n                Force to (re-)download the model weights and configuration files and override the cached versions if they exists.\n\n            resume_download: (`optional`) boolean, default False:\n                Do not delete incompletely recieved file. Attempt to resume the download if such a file exists.\n\n            proxies: (`optional`) dict, default None:\n                A dictionary of proxy servers to use by protocol or endpoint, e.g.: {'http': 'foo.bar:3128', 'http://hostname': 'foo.bar:4012'}.\n                The proxies are used on each request.\n\n            output_loading_info: (`optional`) boolean:\n                Set to ``True`` to also return a dictionnary containing missing keys, unexpected keys and error messages.\n\n            kwargs: (`optional`) Remaining dictionary of keyword arguments:\n                Can be used to update the configuration object (after it being loaded) and initiate the model. (e.g. ``output_attention=True``). Behave differently depending on whether a `config` is provided or automatically loaded:\n\n                - If a configuration is provided with ``config``, ``**kwargs`` will be directly passed to the underlying model's ``__init__`` method (we assume all relevant updates to the configuration have already been done)\n                - If a configuration is not provided, ``kwargs`` will be first passed to the configuration class initialization function (:func:`~transformers1.PretrainedConfig.from_pretrained`). Each key of ``kwargs`` that corresponds to a configuration attribute will be used to override said attribute with the supplied ``kwargs`` value. Remaining keys that do not correspond to any configuration attribute will be passed to the underlying model's ``__init__`` function.\n\n        Examples::\n\n            model = TFAutoModelWithLMHead.from_pretrained('bert-base-uncased')    # Download model and configuration from S3 and cache.\n            model = TFAutoModelWithLMHead.from_pretrained('./test/bert_model/')  # E.g. model was saved using `save_pretrained('./test/saved_model/')`\n            model = TFAutoModelWithLMHead.from_pretrained('bert-base-uncased', output_attention=True)  # Update configuration during loading\n            assert model.config.output_attention == True\n            # Loading from a TF checkpoint file instead of a PyTorch model (slower)\n            config = AutoConfig.from_json_file('./tf_model/bert_tf_model_config.json')\n            model = TFAutoModelWithLMHead.from_pretrained('./pt_model/bert_pytorch_model.bin', from_pt=True, config=config)\n\n        \"\"\"\n        config = kwargs.pop(\"config\", None)\n        if not isinstance(config, PretrainedConfig):\n            config = AutoConfig.from_pretrained(pretrained_model_name_or_path, **kwargs)\n\n        for config_class, model_class in TF_MODEL_WITH_LM_HEAD_MAPPING.items():\n            if isinstance(config, config_class):\n                return model_class.from_pretrained(pretrained_model_name_or_path, *model_args, config=config, **kwargs)\n        raise ValueError(\n            \"Unrecognized configuration class {} for this kind of TFAutoModel: {}.\\n\"\n            \"Model type should be one of {}.\".format(\n                config.__class__, cls.__name__, \", \".join(c.__name__ for c in TF_MODEL_WITH_LM_HEAD_MAPPING.keys())\n            )\n        )\n\n\nclass TFAutoModelForMultipleChoice:\n    r\"\"\"\n        :class:`~transformers1.TFAutoModelForMultipleChoice` is a generic model class\n        that will be instantiated as one of the multiple choice model classes of the library\n        when created with the `TFAutoModelForMultipleChoice.from_pretrained(pretrained_model_name_or_path)`\n        class method.\n\n        The `from_pretrained()` method takes care of returning the correct model class instance\n        based on the `model_type` property of the config object, or when it's missing,\n        falling back to using pattern matching on the `pretrained_model_name_or_path` string:\n            - `albert`: TFAlbertForMultipleChoice (Albert model)\n            - `bert`: TFBertForMultipleChoice (Bert model)\n\n        This class cannot be instantiated using `__init__()` (throws an error).\n    \"\"\"\n\n    def __init__(self):\n        raise EnvironmentError(\n            \"TFAutoModelForMultipleChoice is designed to be instantiated \"\n            \"using the `TFAutoModelForMultipleChoice.from_pretrained(pretrained_model_name_or_path)` or \"\n            \"`TFAutoModelForMultipleChoice.from_config(config)` methods.\"\n        )\n\n    @classmethod\n    def from_config(cls, config):\n        r\"\"\" Instantiates one of the base model classes of the library\n        from a configuration.\n\n        Note:\n            Loading a model from its configuration file does **not** load the model weights.\n            It only affects the model's configuration. Use :func:`~transformers1.AutoModel.from_pretrained` to load\n            the model weights\n\n        Args:\n            config: (`optional`) instance of a class derived from :class:`~transformers1.PretrainedConfig`:\n                The model class to instantiate is selected based on the configuration class:\n                    - isInstance of `albert` configuration class: AlbertModel (Albert model)\n                    - isInstance of `bert` configuration class: BertModel (Bert model)\n\n        Examples::\n\n            config = BertConfig.from_pretrained('bert-base-uncased')    # Download configuration from S3 and cache.\n            model = AutoModelForMulitpleChoice.from_config(config)  # E.g. model was saved using `save_pretrained('./test/saved_model/')`\n        \"\"\"\n        for config_class, model_class in TF_MODEL_FOR_MULTIPLE_CHOICE_MAPPING.items():\n            if isinstance(config, config_class):\n                return model_class(config)\n        raise ValueError(\n            \"Unrecognized configuration class {} for this kind of TFAutoModel: {}.\\n\"\n            \"Model type should be one of {}.\".format(\n                config.__class__,\n                cls.__name__,\n                \", \".join(c.__name__ for c in TF_MODEL_FOR_MULTIPLE_CHOICE_MAPPING.keys()),\n            )\n        )\n\n    @classmethod\n    def from_pretrained(cls, pretrained_model_name_or_path, *model_args, **kwargs):\n        r\"\"\" Instantiates one of the multiple choice model classes of the library\n        from a pre-trained model configuration.\n\n        The `from_pretrained()` method takes care of returning the correct model class instance\n        based on the `model_type` property of the config object, or when it's missing,\n        falling back to using pattern matching on the `pretrained_model_name_or_path` string:\n            - `albert`: TFRobertaForMultiple (Albert model)\n            - `bert`: TFBertForMultipleChoice (Bert model)\n\n        The model is set in evaluation mode by default using `model.eval()` (Dropout modules are deactivated)\n        To train the model, you should first set it back in training mode with `model.train()`\n\n        Params:\n            pretrained_model_name_or_path: either:\n\n                - a string with the `shortcut name` of a pre-trained model to load from cache or download, e.g.: ``bert-base-uncased``.\n                - a string with the `identifier name` of a pre-trained model that was user-uploaded to our S3, e.g.: ``dbmdz/bert-base-german-cased``.\n                - a path to a `directory` containing model weights saved using :func:`~transformers1.PreTrainedModel.save_pretrained`, e.g.: ``./my_model_directory/``.\n                - a path or url to a `PyTorch, TF 1.X or TF 2.0 checkpoint file` (e.g. `./tf_model/model.ckpt.index`). In the case of a PyTorch checkpoint, ``from_pt`` should be set to True and a configuration object should be provided as ``config`` argument.\n\n            from_pt: (`Optional`) Boolean\n                Set to True if the Checkpoint is a PyTorch checkpoint.\n\n            model_args: (`optional`) Sequence of positional arguments:\n                All remaning positional arguments will be passed to the underlying model's ``__init__`` method\n\n            config: (`optional`) instance of a class derived from :class:`~transformers1.PretrainedConfig`:\n                Configuration for the model to use instead of an automatically loaded configuation. Configuration can be automatically loaded when:\n\n                - the model is a model provided by the library (loaded with the ``shortcut-name`` string of a pretrained model), or\n                - the model was saved using :func:`~transformers1.PreTrainedModel.save_pretrained` and is reloaded by suppling the save directory.\n                - the model is loaded by suppling a local directory as ``pretrained_model_name_or_path`` and a configuration JSON file named `config.json` is found in the directory.\n\n            state_dict: (`optional`) dict:\n                an optional state dictionnary for the model to use instead of a state dictionary loaded from saved weights file.\n                This option can be used if you want to create a model from a pretrained configuration but load your own weights.\n                In this case though, you should check if using :func:`~transformers1.PreTrainedModel.save_pretrained` and :func:`~transformers1.PreTrainedModel.from_pretrained` is not a simpler option.\n\n            cache_dir: (`optional`) string:\n                Path to a directory in which a downloaded pre-trained model\n                configuration should be cached if the standard cache should not be used.\n\n            force_download: (`optional`) boolean, default False:\n                Force to (re-)download the model weights and configuration files and override the cached versions if they exists.\n\n            resume_download: (`optional`) boolean, default False:\n                Do not delete incompletely recieved file. Attempt to resume the download if such a file exists.\n\n            proxies: (`optional`) dict, default None:\n                A dictionary of proxy servers to use by protocol or endpoint, e.g.: {'http': 'foo.bar:3128', 'http://hostname': 'foo.bar:4012'}.\n                The proxies are used on each request.\n\n            output_loading_info: (`optional`) boolean:\n                Set to ``True`` to also return a dictionnary containing missing keys, unexpected keys and error messages.\n\n            kwargs: (`optional`) Remaining dictionary of keyword arguments:\n                Can be used to update the configuration object (after it being loaded) and initiate the model. (e.g. ``output_attention=True``). Behave differently depending on whether a `config` is provided or automatically loaded:\n\n                - If a configuration is provided with ``config``, ``**kwargs`` will be directly passed to the underlying model's ``__init__`` method (we assume all relevant updates to the configuration have already been done)\n                - If a configuration is not provided, ``kwargs`` will be first passed to the configuration class initialization function (:func:`~transformers1.PretrainedConfig.from_pretrained`). Each key of ``kwargs`` that corresponds to a configuration attribute will be used to override said attribute with the supplied ``kwargs`` value. Remaining keys that do not correspond to any configuration attribute will be passed to the underlying model's ``__init__`` function.\n\n        Examples::\n\n            model = TFAutoModelFormultipleChoice.from_pretrained('bert-base-uncased')    # Download model and configuration from S3 and cache.\n            model = TFAutoModelFormultipleChoice.from_pretrained('./test/bert_model/')  # E.g. model was saved using `save_pretrained('./test/saved_model/')`\n            model = TFAutoModelFormultipleChoice.from_pretrained('bert-base-uncased', output_attention=True)  # Update configuration during loading\n            assert model.config.output_attention == True\n            # Loading from a TF checkpoint file instead of a PyTorch model (slower)\n            config = AutoConfig.from_json_file('./tf_model/bert_tf_model_config.json')\n            model = TFAutoModelFormultipleChoice.from_pretrained('./pt_model/bert_pytorch_model.bin', from_pt=True, config=config)\n\n        \"\"\"\n        config = kwargs.pop(\"config\", None)\n        if not isinstance(config, PretrainedConfig):\n            config = AutoConfig.from_pretrained(pretrained_model_name_or_path, **kwargs)\n\n        for config_class, model_class in TF_MODEL_FOR_MULTIPLE_CHOICE_MAPPING.items():\n            if isinstance(config, config_class):\n                return model_class.from_pretrained(pretrained_model_name_or_path, *model_args, config=config, **kwargs)\n        raise ValueError(\n            \"Unrecognized configuration class {} for this kind of TFAutoModel: {}.\\n\"\n            \"Model type should be one of {}.\".format(\n                config.__class__,\n                cls.__name__,\n                \", \".join(c.__name__ for c in TF_MODEL_FOR_MULTIPLE_CHOICE_MAPPING.keys()),\n            )\n        )\n\n\nclass TFAutoModelForSequenceClassification(object):\n    r\"\"\"\n        :class:`~transformers1.TFAutoModelForSequenceClassification` is a generic model class\n        that will be instantiated as one of the sequence classification model classes of the library\n        when created with the `TFAutoModelForSequenceClassification.from_pretrained(pretrained_model_name_or_path)`\n        class method.\n\n        The `from_pretrained()` method takes care of returning the correct model class instance\n        based on the `model_type` property of the config object, or when it's missing,\n        falling back to using pattern matching on the `pretrained_model_name_or_path` string:\n            - `distilbert`: TFDistilBertForSequenceClassification (DistilBERT model)\n            - `roberta`: TFRobertaForSequenceClassification (RoBERTa model)\n            - `bert`: TFBertForSequenceClassification (Bert model)\n            - `xlnet`: TFXLNetForSequenceClassification (XLNet model)\n            - `xlm`: TFXLMForSequenceClassification (XLM model)\n\n        This class cannot be instantiated using `__init__()` (throws an error).\n    \"\"\"\n\n    def __init__(self):\n        raise EnvironmentError(\n            \"TFAutoModelForSequenceClassification is designed to be instantiated \"\n            \"using the `TFAutoModelForSequenceClassification.from_pretrained(pretrained_model_name_or_path)` or \"\n            \"`TFAutoModelForSequenceClassification.from_config(config)` methods.\"\n        )\n\n    @classmethod\n    def from_config(cls, config):\n        r\"\"\" Instantiates one of the base model classes of the library\n        from a configuration.\n\n        Note:\n            Loading a model from its configuration file does **not** load the model weights.\n            It only affects the model's configuration. Use :func:`~transformers1.AutoModel.from_pretrained` to load\n            the model weights\n\n        Args:\n            config: (`optional`) instance of a class derived from :class:`~transformers1.PretrainedConfig`:\n                The model class to instantiate is selected based on the configuration class:\n                    - isInstance of `distilbert` configuration class: DistilBertModel (DistilBERT model)\n                    - isInstance of `roberta` configuration class: RobertaModel (RoBERTa model)\n                    - isInstance of `bert` configuration class: BertModel (Bert model)\n                    - isInstance of `xlnet` configuration class: XLNetModel (XLNet model)\n                    - isInstance of `xlm` configuration class: XLMModel (XLM model)\n\n        Examples::\n\n            config = BertConfig.from_pretrained('bert-base-uncased')    # Download configuration from S3 and cache.\n            model = AutoModelForSequenceClassification.from_config(config)  # E.g. model was saved using `save_pretrained('./test/saved_model/')`\n        \"\"\"\n        for config_class, model_class in TF_MODEL_FOR_SEQUENCE_CLASSIFICATION_MAPPING.items():\n            if isinstance(config, config_class):\n                return model_class(config)\n        raise ValueError(\n            \"Unrecognized configuration class {} for this kind of TFAutoModel: {}.\\n\"\n            \"Model type should be one of {}.\".format(\n                config.__class__,\n                cls.__name__,\n                \", \".join(c.__name__ for c in TF_MODEL_FOR_SEQUENCE_CLASSIFICATION_MAPPING.keys()),\n            )\n        )\n\n    @classmethod\n    def from_pretrained(cls, pretrained_model_name_or_path, *model_args, **kwargs):\n        r\"\"\" Instantiates one of the sequence classification model classes of the library\n        from a pre-trained model configuration.\n\n        The `from_pretrained()` method takes care of returning the correct model class instance\n        based on the `model_type` property of the config object, or when it's missing,\n        falling back to using pattern matching on the `pretrained_model_name_or_path` string:\n            - `distilbert`: TFDistilBertForSequenceClassification (DistilBERT model)\n            - `roberta`: TFRobertaForSequenceClassification (RoBERTa model)\n            - `bert`: TFBertForSequenceClassification (Bert model)\n            - `xlnet`: TFXLNetForSequenceClassification (XLNet model)\n            - `xlm`: TFXLMForSequenceClassification (XLM model)\n\n        The model is set in evaluation mode by default using `model.eval()` (Dropout modules are deactivated)\n        To train the model, you should first set it back in training mode with `model.train()`\n\n        Params:\n            pretrained_model_name_or_path: either:\n\n                - a string with the `shortcut name` of a pre-trained model to load from cache or download, e.g.: ``bert-base-uncased``.\n                - a string with the `identifier name` of a pre-trained model that was user-uploaded to our S3, e.g.: ``dbmdz/bert-base-german-cased``.\n                - a path to a `directory` containing model weights saved using :func:`~transformers1.PreTrainedModel.save_pretrained`, e.g.: ``./my_model_directory/``.\n                - a path or url to a `PyTorch, TF 1.X or TF 2.0 checkpoint file` (e.g. `./tf_model/model.ckpt.index`). In the case of a PyTorch checkpoint, ``from_pt`` should be set to True and a configuration object should be provided as ``config`` argument.\n\n            from_pt: (`Optional`) Boolean\n                Set to True if the Checkpoint is a PyTorch checkpoint.\n\n            model_args: (`optional`) Sequence of positional arguments:\n                All remaning positional arguments will be passed to the underlying model's ``__init__`` method\n\n            config: (`optional`) instance of a class derived from :class:`~transformers1.PretrainedConfig`:\n                Configuration for the model to use instead of an automatically loaded configuation. Configuration can be automatically loaded when:\n\n                - the model is a model provided by the library (loaded with the ``shortcut-name`` string of a pretrained model), or\n                - the model was saved using :func:`~transformers1.PreTrainedModel.save_pretrained` and is reloaded by suppling the save directory.\n                - the model is loaded by suppling a local directory as ``pretrained_model_name_or_path`` and a configuration JSON file named `config.json` is found in the directory.\n\n            state_dict: (`optional`) dict:\n                an optional state dictionnary for the model to use instead of a state dictionary loaded from saved weights file.\n                This option can be used if you want to create a model from a pretrained configuration but load your own weights.\n                In this case though, you should check if using :func:`~transformers1.PreTrainedModel.save_pretrained` and :func:`~transformers1.PreTrainedModel.from_pretrained` is not a simpler option.\n\n            cache_dir: (`optional`) string:\n                Path to a directory in which a downloaded pre-trained model\n                configuration should be cached if the standard cache should not be used.\n\n            force_download: (`optional`) boolean, default False:\n                Force to (re-)download the model weights and configuration files and override the cached versions if they exists.\n\n            resume_download: (`optional`) boolean, default False:\n                Do not delete incompletely recieved file. Attempt to resume the download if such a file exists.\n\n            proxies: (`optional`) dict, default None:\n                A dictionary of proxy servers to use by protocol or endpoint, e.g.: {'http': 'foo.bar:3128', 'http://hostname': 'foo.bar:4012'}.\n                The proxies are used on each request.\n\n            output_loading_info: (`optional`) boolean:\n                Set to ``True`` to also return a dictionnary containing missing keys, unexpected keys and error messages.\n\n            kwargs: (`optional`) Remaining dictionary of keyword arguments:\n                Can be used to update the configuration object (after it being loaded) and initiate the model. (e.g. ``output_attention=True``). Behave differently depending on whether a `config` is provided or automatically loaded:\n\n                - If a configuration is provided with ``config``, ``**kwargs`` will be directly passed to the underlying model's ``__init__`` method (we assume all relevant updates to the configuration have already been done)\n                - If a configuration is not provided, ``kwargs`` will be first passed to the configuration class initialization function (:func:`~transformers1.PretrainedConfig.from_pretrained`). Each key of ``kwargs`` that corresponds to a configuration attribute will be used to override said attribute with the supplied ``kwargs`` value. Remaining keys that do not correspond to any configuration attribute will be passed to the underlying model's ``__init__`` function.\n\n        Examples::\n\n            model = TFAutoModelForSequenceClassification.from_pretrained('bert-base-uncased')    # Download model and configuration from S3 and cache.\n            model = TFAutoModelForSequenceClassification.from_pretrained('./test/bert_model/')  # E.g. model was saved using `save_pretrained('./test/saved_model/')`\n            model = TFAutoModelForSequenceClassification.from_pretrained('bert-base-uncased', output_attention=True)  # Update configuration during loading\n            assert model.config.output_attention == True\n            # Loading from a TF checkpoint file instead of a PyTorch model (slower)\n            config = AutoConfig.from_json_file('./tf_model/bert_tf_model_config.json')\n            model = TFAutoModelForSequenceClassification.from_pretrained('./pt_model/bert_pytorch_model.bin', from_pt=True, config=config)\n\n        \"\"\"\n        config = kwargs.pop(\"config\", None)\n        if not isinstance(config, PretrainedConfig):\n            config = AutoConfig.from_pretrained(pretrained_model_name_or_path, **kwargs)\n\n        for config_class, model_class in TF_MODEL_FOR_SEQUENCE_CLASSIFICATION_MAPPING.items():\n            if isinstance(config, config_class):\n                return model_class.from_pretrained(pretrained_model_name_or_path, *model_args, config=config, **kwargs)\n        raise ValueError(\n            \"Unrecognized configuration class {} for this kind of TFAutoModel: {}.\\n\"\n            \"Model type should be one of {}.\".format(\n                config.__class__,\n                cls.__name__,\n                \", \".join(c.__name__ for c in TF_MODEL_FOR_SEQUENCE_CLASSIFICATION_MAPPING.keys()),\n            )\n        )\n\n\nclass TFAutoModelForQuestionAnswering(object):\n    r\"\"\"\n        :class:`~transformers1.TFAutoModelForQuestionAnswering` is a generic model class\n        that will be instantiated as one of the question answering model classes of the library\n        when created with the `TFAutoModelForQuestionAnswering.from_pretrained(pretrained_model_name_or_path)`\n        class method.\n\n        The `from_pretrained()` method takes care of returning the correct model class instance\n        based on the `model_type` property of the config object, or when it's missing,\n        falling back to using pattern matching on the `pretrained_model_name_or_path` string:\n            - `distilbert`: TFDistilBertForQuestionAnswering (DistilBERT model)\n            - `albert`: TFAlbertForQuestionAnswering (ALBERT model)\n            - `roberta`: TFRobertaForQuestionAnswering (RoBERTa model)\n            - `bert`: TFBertForQuestionAnswering (Bert model)\n            - `xlnet`: TFXLNetForQuestionAnswering (XLNet model)\n            - `xlm`: TFXLMForQuestionAnswering (XLM model)\n\n        This class cannot be instantiated using `__init__()` (throws an error).\n    \"\"\"\n\n    def __init__(self):\n        raise EnvironmentError(\n            \"TFAutoModelForQuestionAnswering is designed to be instantiated \"\n            \"using the `TFAutoModelForQuestionAnswering.from_pretrained(pretrained_model_name_or_path)` or \"\n            \"`TFAutoModelForQuestionAnswering.from_config(config)` methods.\"\n        )\n\n    @classmethod\n    def from_config(cls, config):\n        r\"\"\" Instantiates one of the base model classes of the library\n        from a configuration.\n\n        Note:\n            Loading a model from its configuration file does **not** load the model weights.\n            It only affects the model's configuration. Use :func:`~transformers1.AutoModel.from_pretrained` to load\n            the model weights\n\n        Args:\n            config: (`optional`) instance of a class derived from :class:`~transformers1.PretrainedConfig`:\n                The model class to instantiate is selected based on the configuration class:\n                    - isInstance of `distilbert` configuration class: DistilBertModel (DistilBERT model)\n                    - isInstance of `albert` configuration class: AlbertModel (ALBERT model)\n                    - isInstance of `roberta` configuration class: RobertaModel (RoBERTa model)\n                    - isInstance of `bert` configuration class: BertModel (Bert model)\n                    - isInstance of `xlnet` configuration class: XLNetModel (XLNet model)\n                    - isInstance of `xlm` configuration class: XLMModel (XLM model)\n\n        Examples::\n\n            config = BertConfig.from_pretrained('bert-base-uncased')    # Download configuration from S3 and cache.\n            model = TFAutoModelForQuestionAnswering.from_config(config)  # E.g. model was saved using `save_pretrained('./test/saved_model/')`\n        \"\"\"\n        for config_class, model_class in TF_MODEL_FOR_QUESTION_ANSWERING_MAPPING.items():\n            if isinstance(config, config_class):\n                return model_class(config)\n        raise ValueError(\n            \"Unrecognized configuration class {} for this kind of TFAutoModel: {}.\\n\"\n            \"Model type should be one of {}.\".format(\n                config.__class__,\n                cls.__name__,\n                \", \".join(c.__name__ for c in TF_MODEL_FOR_QUESTION_ANSWERING_MAPPING.keys()),\n            )\n        )\n\n    @classmethod\n    def from_pretrained(cls, pretrained_model_name_or_path, *model_args, **kwargs):\n        r\"\"\" Instantiates one of the question answering model classes of the library\n        from a pre-trained model configuration.\n\n        The `from_pretrained()` method takes care of returning the correct model class instance\n        based on the `model_type` property of the config object, or when it's missing,\n        falling back to using pattern matching on the `pretrained_model_name_or_path` string:\n            - `distilbert`: TFDistilBertForQuestionAnswering (DistilBERT model)\n            - `albert`: TFAlbertForQuestionAnswering (ALBERT model)\n            - `roberta`: TFRobertaForQuestionAnswering (RoBERTa model)\n            - `bert`: TFBertForQuestionAnswering (Bert model)\n            - `xlnet`: TFXLNetForQuestionAnswering (XLNet model)\n            - `xlm`: TFXLMForQuestionAnswering (XLM model)\n\n        The model is set in evaluation mode by default using `model.eval()` (Dropout modules are deactivated)\n        To train the model, you should first set it back in training mode with `model.train()`\n\n        Params:\n            pretrained_model_name_or_path: either:\n\n                - a string with the `shortcut name` of a pre-trained model to load from cache or download, e.g.: ``bert-base-uncased``.\n                - a string with the `identifier name` of a pre-trained model that was user-uploaded to our S3, e.g.: ``dbmdz/bert-base-german-cased``.\n                - a path to a `directory` containing model weights saved using :func:`~transformers1.PreTrainedModel.save_pretrained`, e.g.: ``./my_model_directory/``.\n                - a path or url to a `PyTorch, TF 1.X or TF 2.0 checkpoint file` (e.g. `./tf_model/model.ckpt.index`). In the case of a PyTorch checkpoint, ``from_pt`` should be set to True and a configuration object should be provided as ``config`` argument.\n\n            from_pt: (`Optional`) Boolean\n                Set to True if the Checkpoint is a PyTorch checkpoint.\n\n            model_args: (`optional`) Sequence of positional arguments:\n                All remaning positional arguments will be passed to the underlying model's ``__init__`` method\n\n            config: (`optional`) instance of a class derived from :class:`~transformers1.PretrainedConfig`:\n                Configuration for the model to use instead of an automatically loaded configuation. Configuration can be automatically loaded when:\n\n                - the model is a model provided by the library (loaded with the ``shortcut-name`` string of a pretrained model), or\n                - the model was saved using :func:`~transformers1.PreTrainedModel.save_pretrained` and is reloaded by suppling the save directory.\n                - the model is loaded by suppling a local directory as ``pretrained_model_name_or_path`` and a configuration JSON file named `config.json` is found in the directory.\n\n            state_dict: (`optional`) dict:\n                an optional state dictionnary for the model to use instead of a state dictionary loaded from saved weights file.\n                This option can be used if you want to create a model from a pretrained configuration but load your own weights.\n                In this case though, you should check if using :func:`~transformers1.PreTrainedModel.save_pretrained` and :func:`~transformers1.PreTrainedModel.from_pretrained` is not a simpler option.\n\n            cache_dir: (`optional`) string:\n                Path to a directory in which a downloaded pre-trained model\n                configuration should be cached if the standard cache should not be used.\n\n            force_download: (`optional`) boolean, default False:\n                Force to (re-)download the model weights and configuration files and override the cached versions if they exists.\n\n            resume_download: (`optional`) boolean, default False:\n                Do not delete incompletely recieved file. Attempt to resume the download if such a file exists.\n\n            proxies: (`optional`) dict, default None:\n                A dictionary of proxy servers to use by protocol or endpoint, e.g.: {'http': 'foo.bar:3128', 'http://hostname': 'foo.bar:4012'}.\n                The proxies are used on each request.\n\n            output_loading_info: (`optional`) boolean:\n                Set to ``True`` to also return a dictionnary containing missing keys, unexpected keys and error messages.\n\n            kwargs: (`optional`) Remaining dictionary of keyword arguments:\n                Can be used to update the configuration object (after it being loaded) and initiate the model. (e.g. ``output_attention=True``). Behave differently depending on whether a `config` is provided or automatically loaded:\n\n                - If a configuration is provided with ``config``, ``**kwargs`` will be directly passed to the underlying model's ``__init__`` method (we assume all relevant updates to the configuration have already been done)\n                - If a configuration is not provided, ``kwargs`` will be first passed to the configuration class initialization function (:func:`~transformers1.PretrainedConfig.from_pretrained`). Each key of ``kwargs`` that corresponds to a configuration attribute will be used to override said attribute with the supplied ``kwargs`` value. Remaining keys that do not correspond to any configuration attribute will be passed to the underlying model's ``__init__`` function.\n\n        Examples::\n\n            model = TFAutoModelForQuestionAnswering.from_pretrained('bert-base-uncased')    # Download model and configuration from S3 and cache.\n            model = TFAutoModelForQuestionAnswering.from_pretrained('./test/bert_model/')  # E.g. model was saved using `save_pretrained('./test/saved_model/')`\n            model = TFAutoModelForQuestionAnswering.from_pretrained('bert-base-uncased', output_attention=True)  # Update configuration during loading\n            assert model.config.output_attention == True\n            # Loading from a TF checkpoint file instead of a PyTorch model (slower)\n            config = AutoConfig.from_json_file('./tf_model/bert_tf_model_config.json')\n            model = TFAutoModelForQuestionAnswering.from_pretrained('./pt_model/bert_pytorch_model.bin', from_pt=True, config=config)\n\n        \"\"\"\n        config = kwargs.pop(\"config\", None)\n        if not isinstance(config, PretrainedConfig):\n            config = AutoConfig.from_pretrained(pretrained_model_name_or_path, **kwargs)\n\n        for config_class, model_class in TF_MODEL_FOR_QUESTION_ANSWERING_MAPPING.items():\n            if isinstance(config, config_class):\n                return model_class.from_pretrained(pretrained_model_name_or_path, *model_args, config=config, **kwargs)\n        raise ValueError(\n            \"Unrecognized configuration class {} for this kind of TFAutoModel: {}.\\n\"\n            \"Model type should be one of {}.\".format(\n                config.__class__,\n                cls.__name__,\n                \", \".join(c.__name__ for c in TF_MODEL_FOR_QUESTION_ANSWERING_MAPPING.keys()),\n            )\n        )\n\n\nclass TFAutoModelForTokenClassification:\n    def __init__(self):\n        raise EnvironmentError(\n            \"TFAutoModelForTokenClassification is designed to be instantiated \"\n            \"using the `TFAutoModelForTokenClassification.from_pretrained(pretrained_model_name_or_path)` or \"\n            \"`AutoModelForTokenClassification.from_config(config)` methods.\"\n        )\n\n    @classmethod\n    def from_config(cls, config):\n        r\"\"\" Instantiates one of the base model classes of the library\n        from a configuration.\n\n        Note:\n            Loading a model from its configuration file does **not** load the model weights.\n            It only affects the model's configuration. Use :func:`~transformers1.AutoModel.from_pretrained` to load\n            the model weights\n\n        Args:\n            config: (`optional`) instance of a class derived from :class:`~transformers1.PretrainedConfig`:\n                The model class to instantiate is selected based on the configuration class:\n                    - isInstance of `bert` configuration class: BertModel (Bert model)\n                    - isInstance of `xlnet` configuration class: XLNetModel (XLNet model)\n                    - isInstance of `distilbert` configuration class: DistilBertModel (DistilBert model)\n                    - isInstance of `roberta` configuration class: RobteraModel (Roberta model)\n\n        Examples::\n\n            config = BertConfig.from_pretrained('bert-base-uncased')    # Download configuration from S3 and cache.\n            model = TFAutoModelForTokenClassification.from_config(config)  # E.g. model was saved using `save_pretrained('./test/saved_model/')`\n        \"\"\"\n        for config_class, model_class in TF_MODEL_FOR_TOKEN_CLASSIFICATION_MAPPING.items():\n            if isinstance(config, config_class):\n                return model_class(config)\n        raise ValueError(\n            \"Unrecognized configuration class {} for this kind of TFAutoModel: {}.\\n\"\n            \"Model type should be one of {}.\".format(\n                config.__class__,\n                cls.__name__,\n                \", \".join(c.__name__ for c in TF_MODEL_FOR_TOKEN_CLASSIFICATION_MAPPING.keys()),\n            )\n        )\n\n    @classmethod\n    def from_pretrained(cls, pretrained_model_name_or_path, *model_args, **kwargs):\n        r\"\"\" Instantiates one of the question answering model classes of the library\n        from a pre-trained model configuration.\n\n        The `from_pretrained()` method takes care of returning the correct model class instance\n        based on the `model_type` property of the config object, or when it's missing,\n        falling back to using pattern matching on the `pretrained_model_name_or_path` string:\n            - `bert`: BertForTokenClassification (Bert model)\n            - `xlnet`: XLNetForTokenClassification (XLNet model)\n            - `distilbert`: DistilBertForTokenClassification (DistilBert model)\n            - `roberta`: RobertaForTokenClassification (Roberta model)\n\n        The model is set in evaluation mode by default using `model.eval()` (Dropout modules are deactivated)\n        To train the model, you should first set it back in training mode with `model.train()`\n\n        Params:\n            pretrained_model_name_or_path: either:\n\n                - a string with the `shortcut name` of a pre-trained model to load from cache or download, e.g.: ``bert-base-uncased``.\n                - a path to a `directory` containing model weights saved using :func:`~transformers1.PreTrainedModel.save_pretrained`, e.g.: ``./my_model_directory/``.\n                - a path or url to a `tensorflow index checkpoint file` (e.g. `./tf_model/model.ckpt.index`). In this case, ``from_tf`` should be set to True and a configuration object should be provided as ``config`` argument. This loading path is slower than converting the TensorFlow checkpoint in a PyTorch model using the provided conversion scripts and loading the PyTorch model afterwards.\n\n            model_args: (`optional`) Sequence of positional arguments:\n                All remaning positional arguments will be passed to the underlying model's ``__init__`` method\n\n            config: (`optional`) instance of a class derived from :class:`~transformers1.PretrainedConfig`:\n                Configuration for the model to use instead of an automatically loaded configuation. Configuration can be automatically loaded when:\n\n                - the model is a model provided by the library (loaded with the ``shortcut-name`` string of a pretrained model), or\n                - the model was saved using :func:`~transformers1.PreTrainedModel.save_pretrained` and is reloaded by suppling the save directory.\n                - the model is loaded by suppling a local directory as ``pretrained_model_name_or_path`` and a configuration JSON file named `config.json` is found in the directory.\n\n            state_dict: (`optional`) dict:\n                an optional state dictionnary for the model to use instead of a state dictionary loaded from saved weights file.\n                This option can be used if you want to create a model from a pretrained configuration but load your own weights.\n                In this case though, you should check if using :func:`~transformers1.PreTrainedModel.save_pretrained` and :func:`~transformers1.PreTrainedModel.from_pretrained` is not a simpler option.\n\n            cache_dir: (`optional`) string:\n                Path to a directory in which a downloaded pre-trained model\n                configuration should be cached if the standard cache should not be used.\n\n            force_download: (`optional`) boolean, default False:\n                Force to (re-)download the model weights and configuration files and override the cached versions if they exists.\n\n            proxies: (`optional`) dict, default None:\n                A dictionary of proxy servers to use by protocol or endpoint, e.g.: {'http': 'foo.bar:3128', 'http://hostname': 'foo.bar:4012'}.\n                The proxies are used on each request.\n\n            output_loading_info: (`optional`) boolean:\n                Set to ``True`` to also return a dictionnary containing missing keys, unexpected keys and error messages.\n\n            kwargs: (`optional`) Remaining dictionary of keyword arguments:\n                Can be used to update the configuration object (after it being loaded) and initiate the model. (e.g. ``output_attention=True``). Behave differently depending on whether a `config` is provided or automatically loaded:\n\n                - If a configuration is provided with ``config``, ``**kwargs`` will be directly passed to the underlying model's ``__init__`` method (we assume all relevant updates to the configuration have already been done)\n                - If a configuration is not provided, ``kwargs`` will be first passed to the configuration class initialization function (:func:`~transformers1.PretrainedConfig.from_pretrained`). Each key of ``kwargs`` that corresponds to a configuration attribute will be used to override said attribute with the supplied ``kwargs`` value. Remaining keys that do not correspond to any configuration attribute will be passed to the underlying model's ``__init__`` function.\n\n        Examples::\n\n            model = TFAutoModelForTokenClassification.from_pretrained('bert-base-uncased')    # Download model and configuration from S3 and cache.\n            model = TFAutoModelForTokenClassification.from_pretrained('./test/bert_model/')  # E.g. model was saved using `save_pretrained('./test/saved_model/')`\n            model = TFAutoModelForTokenClassification.from_pretrained('bert-base-uncased', output_attention=True)  # Update configuration during loading\n            assert model.config.output_attention == True\n            # Loading from a TF checkpoint file instead of a PyTorch model (slower)\n            config = AutoConfig.from_json_file('./tf_model/bert_tf_model_config.json')\n            model = TFAutoModelForTokenClassification.from_pretrained('./tf_model/bert_tf_checkpoint.ckpt.index', from_tf=True, config=config)\n\n        \"\"\"\n        config = kwargs.pop(\"config\", None)\n        if not isinstance(config, PretrainedConfig):\n            config = AutoConfig.from_pretrained(pretrained_model_name_or_path, **kwargs)\n\n        for config_class, model_class in TF_MODEL_FOR_TOKEN_CLASSIFICATION_MAPPING.items():\n            if isinstance(config, config_class):\n                return model_class.from_pretrained(pretrained_model_name_or_path, *model_args, config=config, **kwargs)\n        raise ValueError(\n            \"Unrecognized configuration class {} for this kind of TFAutoModel: {}.\\n\"\n            \"Model type should be one of {}.\".format(\n                config.__class__,\n                cls.__name__,\n                \", \".join(c.__name__ for c in TF_MODEL_FOR_TOKEN_CLASSIFICATION_MAPPING.keys()),\n            )\n        )\n"
  },
  {
    "path": "code/bert-base-count5/pretrain/transformers1/modeling_tf_bert.py",
    "content": "# coding=utf-8\n# Copyright 2018 The Google AI Language Team Authors and The HuggingFace Inc. team.\n# Copyright (c) 2018, NVIDIA CORPORATION.  All rights reserved.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n\"\"\" TF 2.0 BERT model. \"\"\"\n\n\nimport logging\n\nimport numpy as np\nimport tensorflow as tf\n\nfrom .configuration_bert import BertConfig\nfrom .file_utils import MULTIPLE_CHOICE_DUMMY_INPUTS, add_start_docstrings, add_start_docstrings_to_callable\nfrom .modeling_tf_utils import TFPreTrainedModel, get_initializer, keras_serializable, shape_list\nfrom .tokenization_utils import BatchEncoding\n\n\nlogger = logging.getLogger(__name__)\n\n\nTF_BERT_PRETRAINED_MODEL_ARCHIVE_LIST = [\n    \"bert-base-uncased\",\n    \"bert-large-uncased\",\n    \"bert-base-cased\",\n    \"bert-large-cased\",\n    \"bert-base-multilingual-uncased\",\n    \"bert-base-multilingual-cased\",\n    \"bert-base-chinese\",\n    \"bert-base-german-cased\",\n    \"bert-large-uncased-whole-word-masking\",\n    \"bert-large-cased-whole-word-masking\",\n    \"bert-large-uncased-whole-word-masking-finetuned-squad\",\n    \"bert-large-cased-whole-word-masking-finetuned-squad\",\n    \"bert-base-cased-finetuned-mrpc\",\n    \"cl-tohoku/bert-base-japanese\",\n    \"cl-tohoku/bert-base-japanese-whole-word-masking\",\n    \"cl-tohoku/bert-base-japanese-char\",\n    \"cl-tohoku/bert-base-japanese-char-whole-word-masking\",\n    \"TurkuNLP/bert-base-finnish-cased-v1\",\n    \"TurkuNLP/bert-base-finnish-uncased-v1\",\n    \"wietsedv/bert-base-dutch-cased\",\n    # See all BERT models at https://huggingface.co/models?filter=bert\n]\n\n\ndef gelu(x):\n    \"\"\" Gaussian Error Linear Unit.\n    Original Implementation of the gelu activation function in Google Bert repo when initially created.\n        For information: OpenAI GPT's gelu is slightly different (and gives slightly different results):\n        0.5 * x * (1 + torch.tanh(math.sqrt(2 / math.pi) * (x + 0.044715 * torch.pow(x, 3))))\n        Also see https://arxiv.org/abs/1606.08415\n    \"\"\"\n    cdf = 0.5 * (1.0 + tf.math.erf(x / tf.math.sqrt(2.0)))\n    return x * cdf\n\n\ndef gelu_new(x):\n    \"\"\"Gaussian Error Linear Unit.\n    This is a smoother version of the RELU.\n    Original paper: https://arxiv.org/abs/1606.08415\n    Args:\n        x: float Tensor to perform activation.\n    Returns:\n        `x` with the GELU activation applied.\n    \"\"\"\n    cdf = 0.5 * (1.0 + tf.tanh((np.sqrt(2 / np.pi) * (x + 0.044715 * tf.pow(x, 3)))))\n    return x * cdf\n\n\ndef swish(x):\n    return x * tf.sigmoid(x)\n\n\nACT2FN = {\n    \"gelu\": tf.keras.layers.Activation(gelu),\n    \"relu\": tf.keras.activations.relu,\n    \"swish\": tf.keras.layers.Activation(swish),\n    \"gelu_new\": tf.keras.layers.Activation(gelu_new),\n}\n\n\nclass TFBertEmbeddings(tf.keras.layers.Layer):\n    \"\"\"Construct the embeddings from word, position and token_type embeddings.\n    \"\"\"\n\n    def __init__(self, config, **kwargs):\n        super().__init__(**kwargs)\n        self.vocab_size = config.vocab_size\n        self.hidden_size = config.hidden_size\n        self.initializer_range = config.initializer_range\n\n        self.position_embeddings = tf.keras.layers.Embedding(\n            config.max_position_embeddings,\n            config.hidden_size,\n            embeddings_initializer=get_initializer(self.initializer_range),\n            name=\"position_embeddings\",\n        )\n        self.token_type_embeddings = tf.keras.layers.Embedding(\n            config.type_vocab_size,\n            config.hidden_size,\n            embeddings_initializer=get_initializer(self.initializer_range),\n            name=\"token_type_embeddings\",\n        )\n\n        # self.LayerNorm is not snake-cased to stick with TensorFlow model variable name and be able to load\n        # any TensorFlow checkpoint file\n        self.LayerNorm = tf.keras.layers.LayerNormalization(epsilon=config.layer_norm_eps, name=\"LayerNorm\")\n        self.dropout = tf.keras.layers.Dropout(config.hidden_dropout_prob)\n\n    def build(self, input_shape):\n        \"\"\"Build shared word embedding layer \"\"\"\n        with tf.name_scope(\"word_embeddings\"):\n            # Create and initialize weights. The random normal initializer was chosen\n            # arbitrarily, and works well.\n            self.word_embeddings = self.add_weight(\n                \"weight\",\n                shape=[self.vocab_size, self.hidden_size],\n                initializer=get_initializer(self.initializer_range),\n            )\n        super().build(input_shape)\n\n    def call(self, inputs, mode=\"embedding\", training=False):\n        \"\"\"Get token embeddings of inputs.\n        Args:\n            inputs: list of three int64 tensors with shape [batch_size, length]: (input_ids, position_ids, token_type_ids)\n            mode: string, a valid value is one of \"embedding\" and \"linear\".\n        Returns:\n            outputs: (1) If mode == \"embedding\", output embedding tensor, float32 with\n                shape [batch_size, length, embedding_size]; (2) mode == \"linear\", output\n                linear tensor, float32 with shape [batch_size, length, vocab_size].\n        Raises:\n            ValueError: if mode is not valid.\n\n        Shared weights logic adapted from\n            https://github.com/tensorflow/models/blob/a009f4fb9d2fc4949e32192a944688925ef78659/official/transformer/v2/embedding_layer.py#L24\n        \"\"\"\n        if mode == \"embedding\":\n            return self._embedding(inputs, training=training)\n        elif mode == \"linear\":\n            return self._linear(inputs)\n        else:\n            raise ValueError(\"mode {} is not valid.\".format(mode))\n\n    def _embedding(self, inputs, training=False):\n        \"\"\"Applies embedding based on inputs tensor.\"\"\"\n        input_ids, position_ids, token_type_ids, inputs_embeds = inputs\n\n        if input_ids is not None:\n            input_shape = shape_list(input_ids)\n        else:\n            input_shape = shape_list(inputs_embeds)[:-1]\n\n        seq_length = input_shape[1]\n        if position_ids is None:\n            position_ids = tf.range(seq_length, dtype=tf.int32)[tf.newaxis, :]\n        if token_type_ids is None:\n            token_type_ids = tf.fill(input_shape, 0)\n\n        if inputs_embeds is None:\n            inputs_embeds = tf.gather(self.word_embeddings, input_ids)\n        position_embeddings = self.position_embeddings(position_ids)\n        token_type_embeddings = self.token_type_embeddings(token_type_ids)\n\n        embeddings = inputs_embeds + position_embeddings + token_type_embeddings\n        embeddings = self.LayerNorm(embeddings)\n        embeddings = self.dropout(embeddings, training=training)\n        return embeddings\n\n    def _linear(self, inputs):\n        \"\"\"Computes logits by running inputs through a linear layer.\n            Args:\n                inputs: A float32 tensor with shape [batch_size, length, hidden_size]\n            Returns:\n                float32 tensor with shape [batch_size, length, vocab_size].\n        \"\"\"\n        batch_size = shape_list(inputs)[0]\n        length = shape_list(inputs)[1]\n\n        x = tf.reshape(inputs, [-1, self.hidden_size])\n        logits = tf.matmul(x, self.word_embeddings, transpose_b=True)\n\n        return tf.reshape(logits, [batch_size, length, self.vocab_size])\n\n\nclass TFBertSelfAttention(tf.keras.layers.Layer):\n    def __init__(self, config, **kwargs):\n        super().__init__(**kwargs)\n        if config.hidden_size % config.num_attention_heads != 0:\n            raise ValueError(\n                \"The hidden size (%d) is not a multiple of the number of attention \"\n                \"heads (%d)\" % (config.hidden_size, config.num_attention_heads)\n            )\n        self.output_attentions = config.output_attentions\n\n        self.num_attention_heads = config.num_attention_heads\n        assert config.hidden_size % config.num_attention_heads == 0\n        self.attention_head_size = int(config.hidden_size / config.num_attention_heads)\n        self.all_head_size = self.num_attention_heads * self.attention_head_size\n\n        self.query = tf.keras.layers.Dense(\n            self.all_head_size, kernel_initializer=get_initializer(config.initializer_range), name=\"query\"\n        )\n        self.key = tf.keras.layers.Dense(\n            self.all_head_size, kernel_initializer=get_initializer(config.initializer_range), name=\"key\"\n        )\n        self.value = tf.keras.layers.Dense(\n            self.all_head_size, kernel_initializer=get_initializer(config.initializer_range), name=\"value\"\n        )\n\n        self.dropout = tf.keras.layers.Dropout(config.attention_probs_dropout_prob)\n\n    def transpose_for_scores(self, x, batch_size):\n        x = tf.reshape(x, (batch_size, -1, self.num_attention_heads, self.attention_head_size))\n        return tf.transpose(x, perm=[0, 2, 1, 3])\n\n    def call(self, inputs, training=False):\n        hidden_states, attention_mask, head_mask = inputs\n\n        batch_size = shape_list(hidden_states)[0]\n        mixed_query_layer = self.query(hidden_states)\n        mixed_key_layer = self.key(hidden_states)\n        mixed_value_layer = self.value(hidden_states)\n\n        query_layer = self.transpose_for_scores(mixed_query_layer, batch_size)\n        key_layer = self.transpose_for_scores(mixed_key_layer, batch_size)\n        value_layer = self.transpose_for_scores(mixed_value_layer, batch_size)\n\n        # Take the dot product between \"query\" and \"key\" to get the raw attention scores.\n        attention_scores = tf.matmul(\n            query_layer, key_layer, transpose_b=True\n        )  # (batch size, num_heads, seq_len_q, seq_len_k)\n        dk = tf.cast(shape_list(key_layer)[-1], tf.float32)  # scale attention_scores\n        attention_scores = attention_scores / tf.math.sqrt(dk)\n\n        if attention_mask is not None:\n            # Apply the attention mask is (precomputed for all layers in TFBertModel call() function)\n            attention_scores = attention_scores + attention_mask\n\n        # Normalize the attention scores to probabilities.\n        attention_probs = tf.nn.softmax(attention_scores, axis=-1)\n\n        # This is actually dropping out entire tokens to attend to, which might\n        # seem a bit unusual, but is taken from the original Transformer paper.\n        attention_probs = self.dropout(attention_probs, training=training)\n\n        # Mask heads if we want to\n        if head_mask is not None:\n            attention_probs = attention_probs * head_mask\n\n        context_layer = tf.matmul(attention_probs, value_layer)\n\n        context_layer = tf.transpose(context_layer, perm=[0, 2, 1, 3])\n        context_layer = tf.reshape(\n            context_layer, (batch_size, -1, self.all_head_size)\n        )  # (batch_size, seq_len_q, all_head_size)\n\n        outputs = (context_layer, attention_probs) if self.output_attentions else (context_layer,)\n        return outputs\n\n\nclass TFBertSelfOutput(tf.keras.layers.Layer):\n    def __init__(self, config, **kwargs):\n        super().__init__(**kwargs)\n        self.dense = tf.keras.layers.Dense(\n            config.hidden_size, kernel_initializer=get_initializer(config.initializer_range), name=\"dense\"\n        )\n        self.LayerNorm = tf.keras.layers.LayerNormalization(epsilon=config.layer_norm_eps, name=\"LayerNorm\")\n        self.dropout = tf.keras.layers.Dropout(config.hidden_dropout_prob)\n\n    def call(self, inputs, training=False):\n        hidden_states, input_tensor = inputs\n\n        hidden_states = self.dense(hidden_states)\n        hidden_states = self.dropout(hidden_states, training=training)\n        hidden_states = self.LayerNorm(hidden_states + input_tensor)\n        return hidden_states\n\n\nclass TFBertAttention(tf.keras.layers.Layer):\n    def __init__(self, config, **kwargs):\n        super().__init__(**kwargs)\n        self.self_attention = TFBertSelfAttention(config, name=\"self\")\n        self.dense_output = TFBertSelfOutput(config, name=\"output\")\n\n    def prune_heads(self, heads):\n        raise NotImplementedError\n\n    def call(self, inputs, training=False):\n        input_tensor, attention_mask, head_mask = inputs\n\n        self_outputs = self.self_attention([input_tensor, attention_mask, head_mask], training=training)\n        attention_output = self.dense_output([self_outputs[0], input_tensor], training=training)\n        outputs = (attention_output,) + self_outputs[1:]  # add attentions if we output them\n        return outputs\n\n\nclass TFBertIntermediate(tf.keras.layers.Layer):\n    def __init__(self, config, **kwargs):\n        super().__init__(**kwargs)\n        self.dense = tf.keras.layers.Dense(\n            config.intermediate_size, kernel_initializer=get_initializer(config.initializer_range), name=\"dense\"\n        )\n        if isinstance(config.hidden_act, str):\n            self.intermediate_act_fn = ACT2FN[config.hidden_act]\n        else:\n            self.intermediate_act_fn = config.hidden_act\n\n    def call(self, hidden_states):\n        hidden_states = self.dense(hidden_states)\n        hidden_states = self.intermediate_act_fn(hidden_states)\n        return hidden_states\n\n\nclass TFBertOutput(tf.keras.layers.Layer):\n    def __init__(self, config, **kwargs):\n        super().__init__(**kwargs)\n        self.dense = tf.keras.layers.Dense(\n            config.hidden_size, kernel_initializer=get_initializer(config.initializer_range), name=\"dense\"\n        )\n        self.LayerNorm = tf.keras.layers.LayerNormalization(epsilon=config.layer_norm_eps, name=\"LayerNorm\")\n        self.dropout = tf.keras.layers.Dropout(config.hidden_dropout_prob)\n\n    def call(self, inputs, training=False):\n        hidden_states, input_tensor = inputs\n\n        hidden_states = self.dense(hidden_states)\n        hidden_states = self.dropout(hidden_states, training=training)\n        hidden_states = self.LayerNorm(hidden_states + input_tensor)\n        return hidden_states\n\n\nclass TFBertLayer(tf.keras.layers.Layer):\n    def __init__(self, config, **kwargs):\n        super().__init__(**kwargs)\n        self.attention = TFBertAttention(config, name=\"attention\")\n        self.intermediate = TFBertIntermediate(config, name=\"intermediate\")\n        self.bert_output = TFBertOutput(config, name=\"output\")\n\n    def call(self, inputs, training=False):\n        hidden_states, attention_mask, head_mask = inputs\n\n        attention_outputs = self.attention([hidden_states, attention_mask, head_mask], training=training)\n        attention_output = attention_outputs[0]\n        intermediate_output = self.intermediate(attention_output)\n        layer_output = self.bert_output([intermediate_output, attention_output], training=training)\n        outputs = (layer_output,) + attention_outputs[1:]  # add attentions if we output them\n        return outputs\n\n\nclass TFBertEncoder(tf.keras.layers.Layer):\n    def __init__(self, config, **kwargs):\n        super().__init__(**kwargs)\n        self.output_attentions = config.output_attentions\n        self.output_hidden_states = config.output_hidden_states\n        self.layer = [TFBertLayer(config, name=\"layer_._{}\".format(i)) for i in range(config.num_hidden_layers)]\n\n    def call(self, inputs, training=False):\n        hidden_states, attention_mask, head_mask = inputs\n\n        all_hidden_states = ()\n        all_attentions = ()\n        for i, layer_module in enumerate(self.layer):\n            if self.output_hidden_states:\n                all_hidden_states = all_hidden_states + (hidden_states,)\n\n            layer_outputs = layer_module([hidden_states, attention_mask, head_mask[i]], training=training)\n            hidden_states = layer_outputs[0]\n\n            if self.output_attentions:\n                all_attentions = all_attentions + (layer_outputs[1],)\n\n        # Add last layer\n        if self.output_hidden_states:\n            all_hidden_states = all_hidden_states + (hidden_states,)\n\n        outputs = (hidden_states,)\n        if self.output_hidden_states:\n            outputs = outputs + (all_hidden_states,)\n        if self.output_attentions:\n            outputs = outputs + (all_attentions,)\n        return outputs  # outputs, (hidden states), (attentions)\n\n\nclass TFBertPooler(tf.keras.layers.Layer):\n    def __init__(self, config, **kwargs):\n        super().__init__(**kwargs)\n        self.dense = tf.keras.layers.Dense(\n            config.hidden_size,\n            kernel_initializer=get_initializer(config.initializer_range),\n            activation=\"tanh\",\n            name=\"dense\",\n        )\n\n    def call(self, hidden_states):\n        # We \"pool\" the model by simply taking the hidden state corresponding\n        # to the first token.\n        first_token_tensor = hidden_states[:, 0]\n        pooled_output = self.dense(first_token_tensor)\n        return pooled_output\n\n\nclass TFBertPredictionHeadTransform(tf.keras.layers.Layer):\n    def __init__(self, config, **kwargs):\n        super().__init__(**kwargs)\n        self.dense = tf.keras.layers.Dense(\n            config.hidden_size, kernel_initializer=get_initializer(config.initializer_range), name=\"dense\"\n        )\n        if isinstance(config.hidden_act, str):\n            self.transform_act_fn = ACT2FN[config.hidden_act]\n        else:\n            self.transform_act_fn = config.hidden_act\n        self.LayerNorm = tf.keras.layers.LayerNormalization(epsilon=config.layer_norm_eps, name=\"LayerNorm\")\n\n    def call(self, hidden_states):\n        hidden_states = self.dense(hidden_states)\n        hidden_states = self.transform_act_fn(hidden_states)\n        hidden_states = self.LayerNorm(hidden_states)\n        return hidden_states\n\n\nclass TFBertLMPredictionHead(tf.keras.layers.Layer):\n    def __init__(self, config, input_embeddings, **kwargs):\n        super().__init__(**kwargs)\n        self.vocab_size = config.vocab_size\n        self.transform = TFBertPredictionHeadTransform(config, name=\"transform\")\n\n        # The output weights are the same as the input embeddings, but there is\n        # an output-only bias for each token.\n        self.input_embeddings = input_embeddings\n\n    def build(self, input_shape):\n        self.bias = self.add_weight(shape=(self.vocab_size,), initializer=\"zeros\", trainable=True, name=\"bias\")\n        super().build(input_shape)\n\n    def call(self, hidden_states):\n        hidden_states = self.transform(hidden_states)\n        hidden_states = self.input_embeddings(hidden_states, mode=\"linear\")\n        hidden_states = hidden_states + self.bias\n        return hidden_states\n\n\nclass TFBertMLMHead(tf.keras.layers.Layer):\n    def __init__(self, config, input_embeddings, **kwargs):\n        super().__init__(**kwargs)\n        self.predictions = TFBertLMPredictionHead(config, input_embeddings, name=\"predictions\")\n\n    def call(self, sequence_output):\n        prediction_scores = self.predictions(sequence_output)\n        return prediction_scores\n\n\nclass TFBertNSPHead(tf.keras.layers.Layer):\n    def __init__(self, config, **kwargs):\n        super().__init__(**kwargs)\n        self.seq_relationship = tf.keras.layers.Dense(\n            2, kernel_initializer=get_initializer(config.initializer_range), name=\"seq_relationship\"\n        )\n\n    def call(self, pooled_output):\n        seq_relationship_score = self.seq_relationship(pooled_output)\n        return seq_relationship_score\n\n\n@keras_serializable\nclass TFBertMainLayer(tf.keras.layers.Layer):\n    config_class = BertConfig\n\n    def __init__(self, config, **kwargs):\n        super().__init__(**kwargs)\n        self.num_hidden_layers = config.num_hidden_layers\n\n        self.embeddings = TFBertEmbeddings(config, name=\"embeddings\")\n        self.encoder = TFBertEncoder(config, name=\"encoder\")\n        self.pooler = TFBertPooler(config, name=\"pooler\")\n\n    def get_input_embeddings(self):\n        return self.embeddings\n\n    def _resize_token_embeddings(self, new_num_tokens):\n        raise NotImplementedError\n\n    def _prune_heads(self, heads_to_prune):\n        \"\"\" Prunes heads of the model.\n            heads_to_prune: dict of {layer_num: list of heads to prune in this layer}\n            See base class PreTrainedModel\n        \"\"\"\n        raise NotImplementedError\n\n    def call(\n        self,\n        inputs,\n        attention_mask=None,\n        token_type_ids=None,\n        position_ids=None,\n        head_mask=None,\n        inputs_embeds=None,\n        training=False,\n    ):\n        if isinstance(inputs, (tuple, list)):\n            input_ids = inputs[0]\n            attention_mask = inputs[1] if len(inputs) > 1 else attention_mask\n            token_type_ids = inputs[2] if len(inputs) > 2 else token_type_ids\n            position_ids = inputs[3] if len(inputs) > 3 else position_ids\n            head_mask = inputs[4] if len(inputs) > 4 else head_mask\n            inputs_embeds = inputs[5] if len(inputs) > 5 else inputs_embeds\n            assert len(inputs) <= 6, \"Too many inputs.\"\n        elif isinstance(inputs, (dict, BatchEncoding)):\n            input_ids = inputs.get(\"input_ids\")\n            attention_mask = inputs.get(\"attention_mask\", attention_mask)\n            token_type_ids = inputs.get(\"token_type_ids\", token_type_ids)\n            position_ids = inputs.get(\"position_ids\", position_ids)\n            head_mask = inputs.get(\"head_mask\", head_mask)\n            inputs_embeds = inputs.get(\"inputs_embeds\", inputs_embeds)\n            assert len(inputs) <= 6, \"Too many inputs.\"\n        else:\n            input_ids = inputs\n\n        if input_ids is not None and inputs_embeds is not None:\n            raise ValueError(\"You cannot specify both input_ids and inputs_embeds at the same time\")\n        elif input_ids is not None:\n            input_shape = shape_list(input_ids)\n        elif inputs_embeds is not None:\n            input_shape = shape_list(inputs_embeds)[:-1]\n        else:\n            raise ValueError(\"You have to specify either input_ids or inputs_embeds\")\n\n        if attention_mask is None:\n            attention_mask = tf.fill(input_shape, 1)\n        if token_type_ids is None:\n            token_type_ids = tf.fill(input_shape, 0)\n\n        # We create a 3D attention mask from a 2D tensor mask.\n        # Sizes are [batch_size, 1, 1, to_seq_length]\n        # So we can broadcast to [batch_size, num_heads, from_seq_length, to_seq_length]\n        # this attention mask is more simple than the triangular masking of causal attention\n        # used in OpenAI GPT, we just need to prepare the broadcast dimension here.\n        extended_attention_mask = attention_mask[:, tf.newaxis, tf.newaxis, :]\n\n        # Since attention_mask is 1.0 for positions we want to attend and 0.0 for\n        # masked positions, this operation will create a tensor which is 0.0 for\n        # positions we want to attend and -10000.0 for masked positions.\n        # Since we are adding it to the raw scores before the softmax, this is\n        # effectively the same as removing these entirely.\n\n        extended_attention_mask = tf.cast(extended_attention_mask, tf.float32)\n        extended_attention_mask = (1.0 - extended_attention_mask) * -10000.0\n\n        # Prepare head mask if needed\n        # 1.0 in head_mask indicate we keep the head\n        # attention_probs has shape bsz x n_heads x N x N\n        # input head_mask has shape [num_heads] or [num_hidden_layers x num_heads]\n        # and head_mask is converted to shape [num_hidden_layers x batch x num_heads x seq_length x seq_length]\n        if head_mask is not None:\n            raise NotImplementedError\n        else:\n            head_mask = [None] * self.num_hidden_layers\n            # head_mask = tf.constant([0] * self.num_hidden_layers)\n\n        embedding_output = self.embeddings([input_ids, position_ids, token_type_ids, inputs_embeds], training=training)\n        encoder_outputs = self.encoder([embedding_output, extended_attention_mask, head_mask], training=training)\n\n        sequence_output = encoder_outputs[0]\n        pooled_output = self.pooler(sequence_output)\n\n        outputs = (sequence_output, pooled_output,) + encoder_outputs[\n            1:\n        ]  # add hidden_states and attentions if they are here\n        return outputs  # sequence_output, pooled_output, (hidden_states), (attentions)\n\n\nclass TFBertPreTrainedModel(TFPreTrainedModel):\n    \"\"\" An abstract class to handle weights initialization and\n        a simple interface for downloading and loading pretrained models.\n    \"\"\"\n\n    config_class = BertConfig\n    base_model_prefix = \"bert\"\n\n\nBERT_START_DOCSTRING = r\"\"\"\n    This model is a `tf.keras.Model <https://www.tensorflow.org/api_docs/python/tf/keras/Model>`__ sub-class.\n    Use it as a regular TF 2.0 Keras Model and\n    refer to the TF 2.0 documentation for all matter related to general usage and behavior.\n\n    .. note::\n\n        TF 2.0 models accepts two formats as inputs:\n\n            - having all inputs as keyword arguments (like PyTorch models), or\n            - having all inputs as a list, tuple or dict in the first positional arguments.\n\n        This second option is useful when using :obj:`tf.keras.Model.fit()` method which currently requires having\n        all the tensors in the first argument of the model call function: :obj:`model(inputs)`.\n\n        If you choose this second option, there are three possibilities you can use to gather all the input Tensors\n        in the first positional argument :\n\n        - a single Tensor with input_ids only and nothing else: :obj:`model(inputs_ids)`\n        - a list of varying length with one or several input Tensors IN THE ORDER given in the docstring:\n          :obj:`model([input_ids, attention_mask])` or :obj:`model([input_ids, attention_mask, token_type_ids])`\n        - a dictionary with one or several input Tensors associated to the input names given in the docstring:\n          :obj:`model({'input_ids': input_ids, 'token_type_ids': token_type_ids})`\n\n    Parameters:\n        config (:class:`~transformers1.BertConfig`): Model configuration class with all the parameters of the model.\n            Initializing with a config file does not load the weights associated with the model, only the configuration.\n            Check out the :meth:`~transformers1.PreTrainedModel.from_pretrained` method to load the model weights.\n\"\"\"\n\nBERT_INPUTS_DOCSTRING = r\"\"\"\n    Args:\n        input_ids (:obj:`Numpy array` or :obj:`tf.Tensor` of shape :obj:`{0}`):\n            Indices of input sequence tokens in the vocabulary.\n\n            Indices can be obtained using :class:`transformers1.BertTokenizer`.\n            See :func:`transformers1.PreTrainedTokenizer.encode` and\n            :func:`transformers1.PreTrainedTokenizer.encode_plus` for details.\n\n            `What are input IDs? <../glossary.html#input-ids>`__\n        attention_mask (:obj:`Numpy array` or :obj:`tf.Tensor` of shape :obj:`{0}`, `optional`, defaults to :obj:`None`):\n            Mask to avoid performing attention on padding token indices.\n            Mask values selected in ``[0, 1]``:\n            ``1`` for tokens that are NOT MASKED, ``0`` for MASKED tokens.\n\n            `What are attention masks? <../glossary.html#attention-mask>`__\n        token_type_ids (:obj:`Numpy array` or :obj:`tf.Tensor` of shape :obj:`{0}`, `optional`, defaults to :obj:`None`):\n            Segment token indices to indicate first and second portions of the inputs.\n            Indices are selected in ``[0, 1]``: ``0`` corresponds to a `sentence A` token, ``1``\n            corresponds to a `sentence B` token\n\n            `What are token type IDs? <../glossary.html#token-type-ids>`__\n        position_ids (:obj:`Numpy array` or :obj:`tf.Tensor` of shape :obj:`{0}`, `optional`, defaults to :obj:`None`):\n            Indices of positions of each input sequence tokens in the position embeddings.\n            Selected in the range ``[0, config.max_position_embeddings - 1]``.\n\n            `What are position IDs? <../glossary.html#position-ids>`__\n        head_mask (:obj:`Numpy array` or :obj:`tf.Tensor` of shape :obj:`(num_heads,)` or :obj:`(num_layers, num_heads)`, `optional`, defaults to :obj:`None`):\n            Mask to nullify selected heads of the self-attention modules.\n            Mask values selected in ``[0, 1]``:\n            :obj:`1` indicates the head is **not masked**, :obj:`0` indicates the head is **masked**.\n        inputs_embeds (:obj:`Numpy array` or :obj:`tf.Tensor` of shape :obj:`(batch_size, sequence_length, embedding_dim)`, `optional`, defaults to :obj:`None`):\n            Optionally, instead of passing :obj:`input_ids` you can choose to directly pass an embedded representation.\n            This is useful if you want more control over how to convert `input_ids` indices into associated vectors\n            than the model's internal embedding lookup matrix.\n        training (:obj:`boolean`, `optional`, defaults to :obj:`False`):\n            Whether to activate dropout modules (if set to :obj:`True`) during training or to de-activate them\n            (if set to :obj:`False`) for evaluation.\n\"\"\"\n\n\n@add_start_docstrings(\n    \"The bare Bert Model transformer outputing raw hidden-states without any specific head on top.\",\n    BERT_START_DOCSTRING,\n)\nclass TFBertModel(TFBertPreTrainedModel):\n    def __init__(self, config, *inputs, **kwargs):\n        super().__init__(config, *inputs, **kwargs)\n        self.bert = TFBertMainLayer(config, name=\"bert\")\n\n    @add_start_docstrings_to_callable(BERT_INPUTS_DOCSTRING.format(\"(batch_size, sequence_length)\"))\n    def call(self, inputs, **kwargs):\n        r\"\"\"\n    Returns:\n        :obj:`tuple(tf.Tensor)` comprising various elements depending on the configuration (:class:`~transformers1.BertConfig`) and inputs:\n        last_hidden_state (:obj:`tf.Tensor` of shape :obj:`(batch_size, sequence_length, hidden_size)`):\n            Sequence of hidden-states at the output of the last layer of the model.\n        pooler_output (:obj:`tf.Tensor` of shape :obj:`(batch_size, hidden_size)`):\n            Last layer hidden-state of the first token of the sequence (classification token)\n            further processed by a Linear layer and a Tanh activation function. The Linear\n            layer weights are trained from the next sentence prediction (classification)\n            objective during Bert pretraining. This output is usually *not* a good summary\n            of the semantic content of the input, you're often better with averaging or pooling\n            the sequence of hidden-states for the whole input sequence.\n        hidden_states (:obj:`tuple(tf.Tensor)`, `optional`, returned when :obj:`config.output_hidden_states=True`):\n            tuple of :obj:`tf.Tensor` (one for the output of the embeddings + one for the output of each layer)\n            of shape :obj:`(batch_size, sequence_length, hidden_size)`.\n\n            Hidden-states of the model at the output of each layer plus the initial embedding outputs.\n        attentions (:obj:`tuple(tf.Tensor)`, `optional`, returned when ``config.output_attentions=True``):\n            tuple of :obj:`tf.Tensor` (one for each layer) of shape\n            :obj:`(batch_size, num_heads, sequence_length, sequence_length)`.\n\n            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads.\n\n\n    Examples::\n\n        import tensorflow as tf\n        from transformers1 import BertTokenizer, TFBertModel\n\n        tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')\n        model = TFBertModel.from_pretrained('bert-base-uncased')\n        input_ids = tf.constant(tokenizer.encode(\"Hello, my dog is cute\", add_special_tokens=True))[None, :]  # Batch size 1\n        outputs = model(input_ids)\n        last_hidden_states = outputs[0]  # The last hidden-state is the first element of the output tuple\n        \"\"\"\n        outputs = self.bert(inputs, **kwargs)\n        return outputs\n\n\n@add_start_docstrings(\n    \"\"\"Bert Model with two heads on top as done during the pre-training:\n    a `masked language modeling` head and a `next sentence prediction (classification)` head. \"\"\",\n    BERT_START_DOCSTRING,\n)\nclass TFBertForPreTraining(TFBertPreTrainedModel):\n    def __init__(self, config, *inputs, **kwargs):\n        super().__init__(config, *inputs, **kwargs)\n\n        self.bert = TFBertMainLayer(config, name=\"bert\")\n        self.nsp = TFBertNSPHead(config, name=\"nsp___cls\")\n        self.mlm = TFBertMLMHead(config, self.bert.embeddings, name=\"mlm___cls\")\n\n    def get_output_embeddings(self):\n        return self.bert.embeddings\n\n    @add_start_docstrings_to_callable(BERT_INPUTS_DOCSTRING.format(\"(batch_size, sequence_length)\"))\n    def call(self, inputs, **kwargs):\n        r\"\"\"\n    Return:\n        :obj:`tuple(tf.Tensor)` comprising various elements depending on the configuration (:class:`~transformers1.BertConfig`) and inputs:\n        prediction_scores (:obj:`tf.Tensor` of shape :obj:`(batch_size, sequence_length, config.vocab_size)`):\n            Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax).\n        seq_relationship_scores (:obj:`tf.Tensor` of shape :obj:`(batch_size, sequence_length, 2)`):\n            Prediction scores of the next sequence prediction (classification) head (scores of True/False continuation before SoftMax).\n        hidden_states (:obj:`tuple(tf.Tensor)`, `optional`, returned when :obj:`config.output_hidden_states=True`):\n            tuple of :obj:`tf.Tensor` (one for the output of the embeddings + one for the output of each layer)\n            of shape :obj:`(batch_size, sequence_length, hidden_size)`.\n\n            Hidden-states of the model at the output of each layer plus the initial embedding outputs.\n        attentions (:obj:`tuple(tf.Tensor)`, `optional`, returned when ``config.output_attentions=True``):\n            tuple of :obj:`tf.Tensor` (one for each layer) of shape\n            :obj:`(batch_size, num_heads, sequence_length, sequence_length)`:\n\n            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads.\n\n    Examples::\n\n        import tensorflow as tf\n        from transformers1 import BertTokenizer, TFBertForPreTraining\n\n        tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')\n        model = TFBertForPreTraining.from_pretrained('bert-base-uncased')\n        input_ids = tf.constant(tokenizer.encode(\"Hello, my dog is cute\", add_special_tokens=True))[None, :]  # Batch size 1\n        outputs = model(input_ids)\n        prediction_scores, seq_relationship_scores = outputs[:2]\n\n        \"\"\"\n        outputs = self.bert(inputs, **kwargs)\n\n        sequence_output, pooled_output = outputs[:2]\n        prediction_scores = self.mlm(sequence_output, training=kwargs.get(\"training\", False))\n        seq_relationship_score = self.nsp(pooled_output)\n\n        outputs = (prediction_scores, seq_relationship_score,) + outputs[\n            2:\n        ]  # add hidden states and attention if they are here\n\n        return outputs  # prediction_scores, seq_relationship_score, (hidden_states), (attentions)\n\n\n@add_start_docstrings(\"\"\"Bert Model with a `language modeling` head on top. \"\"\", BERT_START_DOCSTRING)\nclass TFBertForMaskedLM(TFBertPreTrainedModel):\n    def __init__(self, config, *inputs, **kwargs):\n        super().__init__(config, *inputs, **kwargs)\n\n        self.bert = TFBertMainLayer(config, name=\"bert\")\n        self.mlm = TFBertMLMHead(config, self.bert.embeddings, name=\"mlm___cls\")\n\n    def get_output_embeddings(self):\n        return self.bert.embeddings\n\n    @add_start_docstrings_to_callable(BERT_INPUTS_DOCSTRING.format(\"(batch_size, sequence_length)\"))\n    def call(self, inputs, **kwargs):\n        r\"\"\"\n    Return:\n        :obj:`tuple(tf.Tensor)` comprising various elements depending on the configuration (:class:`~transformers1.BertConfig`) and inputs:\n        prediction_scores (:obj:`Numpy array` or :obj:`tf.Tensor` of shape :obj:`(batch_size, sequence_length, config.vocab_size)`):\n            Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax).\n        hidden_states (:obj:`tuple(tf.Tensor)`, `optional`, returned when :obj:`config.output_hidden_states=True`):\n            tuple of :obj:`tf.Tensor` (one for the output of the embeddings + one for the output of each layer)\n            of shape :obj:`(batch_size, sequence_length, hidden_size)`.\n\n            Hidden-states of the model at the output of each layer plus the initial embedding outputs.\n        attentions (:obj:`tuple(tf.Tensor)`, `optional`, returned when ``config.output_attentions=True``):\n            tuple of :obj:`tf.Tensor` (one for each layer) of shape\n            :obj:`(batch_size, num_heads, sequence_length, sequence_length)`:\n\n            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads.\n\n    Examples::\n\n        import tensorflow as tf\n        from transformers1 import BertTokenizer, TFBertForMaskedLM\n\n        tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')\n        model = TFBertForMaskedLM.from_pretrained('bert-base-uncased')\n        input_ids = tf.constant(tokenizer.encode(\"Hello, my dog is cute\", add_special_tokens=True))[None, :]  # Batch size 1\n        outputs = model(input_ids)\n        prediction_scores = outputs[0]\n\n        \"\"\"\n        outputs = self.bert(inputs, **kwargs)\n\n        sequence_output = outputs[0]\n        prediction_scores = self.mlm(sequence_output, training=kwargs.get(\"training\", False))\n\n        outputs = (prediction_scores,) + outputs[2:]  # Add hidden states and attention if they are here\n\n        return outputs  # prediction_scores, (hidden_states), (attentions)\n\n\n@add_start_docstrings(\n    \"\"\"Bert Model with a `next sentence prediction (classification)` head on top. \"\"\", BERT_START_DOCSTRING,\n)\nclass TFBertForNextSentencePrediction(TFBertPreTrainedModel):\n    def __init__(self, config, *inputs, **kwargs):\n        super().__init__(config, *inputs, **kwargs)\n\n        self.bert = TFBertMainLayer(config, name=\"bert\")\n        self.nsp = TFBertNSPHead(config, name=\"nsp___cls\")\n\n    @add_start_docstrings_to_callable(BERT_INPUTS_DOCSTRING.format(\"(batch_size, sequence_length)\"))\n    def call(self, inputs, **kwargs):\n        r\"\"\"\n    Return:\n        :obj:`tuple(tf.Tensor)` comprising various elements depending on the configuration (:class:`~transformers1.BertConfig`) and inputs:\n        seq_relationship_scores (:obj:`Numpy array` or :obj:`tf.Tensor` of shape :obj:`(batch_size, sequence_length, 2)`)\n            Prediction scores of the next sequence prediction (classification) head (scores of True/False continuation before SoftMax).\n        hidden_states (:obj:`tuple(tf.Tensor)`, `optional`, returned when :obj:`config.output_hidden_states=True`):\n            tuple of :obj:`tf.Tensor` (one for the output of the embeddings + one for the output of each layer)\n            of shape :obj:`(batch_size, sequence_length, hidden_size)`.\n\n            Hidden-states of the model at the output of each layer plus the initial embedding outputs.\n        attentions (:obj:`tuple(tf.Tensor)`, `optional`, returned when ``config.output_attentions=True``):\n            tuple of :obj:`tf.Tensor` (one for each layer) of shape\n            :obj:`(batch_size, num_heads, sequence_length, sequence_length)`:\n\n            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads.\n\n    Examples::\n\n        import tensorflow as tf\n        from transformers1 import BertTokenizer, TFBertForNextSentencePrediction\n\n        tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')\n        model = TFBertForNextSentencePrediction.from_pretrained('bert-base-uncased')\n\n        prompt = \"In Italy, pizza served in formal settings, such as at a restaurant, is presented unsliced.\"\n        next_sentence = \"The sky is blue due to the shorter wavelength of blue light.\"\n        encoding = tokenizer.encode_plus(prompt, next_sentence, return_tensors='tf')\n\n        logits = model(encoding['input_ids'], token_type_ids=encoding['token_type_ids'])[0]\n        assert logits[0][0] < logits[0][1] # the next sentence was random\n        \"\"\"\n        outputs = self.bert(inputs, **kwargs)\n\n        pooled_output = outputs[1]\n        seq_relationship_score = self.nsp(pooled_output)\n\n        outputs = (seq_relationship_score,) + outputs[2:]  # add hidden states and attention if they are here\n\n        return outputs  # seq_relationship_score, (hidden_states), (attentions)\n\n\n@add_start_docstrings(\n    \"\"\"Bert Model transformer with a sequence classification/regression head on top (a linear layer on top of\n    the pooled output) e.g. for GLUE tasks. \"\"\",\n    BERT_START_DOCSTRING,\n)\nclass TFBertForSequenceClassification(TFBertPreTrainedModel):\n    def __init__(self, config, *inputs, **kwargs):\n        super().__init__(config, *inputs, **kwargs)\n        self.num_labels = config.num_labels\n\n        self.bert = TFBertMainLayer(config, name=\"bert\")\n        self.dropout = tf.keras.layers.Dropout(config.hidden_dropout_prob)\n        self.classifier = tf.keras.layers.Dense(\n            config.num_labels, kernel_initializer=get_initializer(config.initializer_range), name=\"classifier\"\n        )\n\n    @add_start_docstrings_to_callable(BERT_INPUTS_DOCSTRING.format(\"(batch_size, sequence_length)\"))\n    def call(self, inputs, **kwargs):\n        r\"\"\"\n    Return:\n        :obj:`tuple(tf.Tensor)` comprising various elements depending on the configuration (:class:`~transformers1.BertConfig`) and inputs:\n        logits (:obj:`Numpy array` or :obj:`tf.Tensor` of shape :obj:`(batch_size, config.num_labels)`):\n            Classification (or regression if config.num_labels==1) scores (before SoftMax).\n        hidden_states (:obj:`tuple(tf.Tensor)`, `optional`, returned when :obj:`config.output_hidden_states=True`):\n            tuple of :obj:`tf.Tensor` (one for the output of the embeddings + one for the output of each layer)\n            of shape :obj:`(batch_size, sequence_length, hidden_size)`.\n\n            Hidden-states of the model at the output of each layer plus the initial embedding outputs.\n        attentions (:obj:`tuple(tf.Tensor)`, `optional`, returned when ``config.output_attentions=True``):\n            tuple of :obj:`tf.Tensor` (one for each layer) of shape\n            :obj:`(batch_size, num_heads, sequence_length, sequence_length)`:\n\n            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads.\n\n    Examples::\n\n        import tensorflow as tf\n        from transformers1 import BertTokenizer, TFBertForSequenceClassification\n\n        tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')\n        model = TFBertForSequenceClassification.from_pretrained('bert-base-uncased')\n        input_ids = tf.constant(tokenizer.encode(\"Hello, my dog is cute\", add_special_tokens=True))[None, :]  # Batch size 1\n        outputs = model(input_ids)\n        logits = outputs[0]\n\n        \"\"\"\n        outputs = self.bert(inputs, **kwargs)\n\n        pooled_output = outputs[1]\n\n        pooled_output = self.dropout(pooled_output, training=kwargs.get(\"training\", False))\n        logits = self.classifier(pooled_output)\n\n        outputs = (logits,) + outputs[2:]  # add hidden states and attention if they are here\n\n        return outputs  # logits, (hidden_states), (attentions)\n\n\n@add_start_docstrings(\n    \"\"\"Bert Model with a multiple choice classification head on top (a linear layer on top of\n    the pooled output and a softmax) e.g. for RocStories/SWAG tasks. \"\"\",\n    BERT_START_DOCSTRING,\n)\nclass TFBertForMultipleChoice(TFBertPreTrainedModel):\n    def __init__(self, config, *inputs, **kwargs):\n        super().__init__(config, *inputs, **kwargs)\n\n        self.bert = TFBertMainLayer(config, name=\"bert\")\n        self.dropout = tf.keras.layers.Dropout(config.hidden_dropout_prob)\n        self.classifier = tf.keras.layers.Dense(\n            1, kernel_initializer=get_initializer(config.initializer_range), name=\"classifier\"\n        )\n\n    @property\n    def dummy_inputs(self):\n        \"\"\" Dummy inputs to build the network.\n\n        Returns:\n            tf.Tensor with dummy inputs\n        \"\"\"\n        return {\"input_ids\": tf.constant(MULTIPLE_CHOICE_DUMMY_INPUTS)}\n\n    @add_start_docstrings_to_callable(BERT_INPUTS_DOCSTRING.format(\"(batch_size, num_choices, sequence_length)\"))\n    def call(\n        self,\n        inputs,\n        attention_mask=None,\n        token_type_ids=None,\n        position_ids=None,\n        head_mask=None,\n        inputs_embeds=None,\n        training=False,\n    ):\n        r\"\"\"\n    Return:\n        :obj:`tuple(tf.Tensor)` comprising various elements depending on the configuration (:class:`~transformers1.BertConfig`) and inputs:\n        classification_scores (:obj:`Numpy array` or :obj:`tf.Tensor` of shape :obj:`(batch_size, num_choices)`:\n            `num_choices` is the size of the second dimension of the input tensors. (see `input_ids` above).\n\n            Classification scores (before SoftMax).\n        hidden_states (:obj:`tuple(tf.Tensor)`, `optional`, returned when :obj:`config.output_hidden_states=True`):\n            tuple of :obj:`tf.Tensor` (one for the output of the embeddings + one for the output of each layer)\n            of shape :obj:`(batch_size, sequence_length, hidden_size)`.\n\n            Hidden-states of the model at the output of each layer plus the initial embedding outputs.\n        attentions (:obj:`tuple(tf.Tensor)`, `optional`, returned when ``config.output_attentions=True``):\n            tuple of :obj:`tf.Tensor` (one for each layer) of shape\n            :obj:`(batch_size, num_heads, sequence_length, sequence_length)`:\n\n            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads.\n\n    Examples::\n\n        import tensorflow as tf\n        from transformers1 import BertTokenizer, TFBertForMultipleChoice\n\n        tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')\n        model = TFBertForMultipleChoice.from_pretrained('bert-base-uncased')\n\n        prompt = \"In Italy, pizza served in formal settings, such as at a restaurant, is presented unsliced.\"\n        choice0 = \"It is eaten with a fork and a knife.\"\n        choice1 = \"It is eaten while held in the hand.\"\n        encoding = tokenizer.batch_encode_plus([[prompt, choice0], [prompt, choice1]], return_tensors='tf', pad_to_max_length=True)\n\n        # linear classifier on the output is not yet trained\n        outputs = model(encoding['input_ids'][None, :])\n        logits = outputs[0]\n        \"\"\"\n        if isinstance(inputs, (tuple, list)):\n            input_ids = inputs[0]\n            attention_mask = inputs[1] if len(inputs) > 1 else attention_mask\n            token_type_ids = inputs[2] if len(inputs) > 2 else token_type_ids\n            position_ids = inputs[3] if len(inputs) > 3 else position_ids\n            head_mask = inputs[4] if len(inputs) > 4 else head_mask\n            inputs_embeds = inputs[5] if len(inputs) > 5 else inputs_embeds\n            assert len(inputs) <= 6, \"Too many inputs.\"\n        elif isinstance(inputs, dict):\n            input_ids = inputs.get(\"input_ids\")\n            attention_mask = inputs.get(\"attention_mask\", attention_mask)\n            token_type_ids = inputs.get(\"token_type_ids\", token_type_ids)\n            position_ids = inputs.get(\"position_ids\", position_ids)\n            head_mask = inputs.get(\"head_mask\", head_mask)\n            inputs_embeds = inputs.get(\"inputs_embeds\", inputs_embeds)\n            assert len(inputs) <= 6, \"Too many inputs.\"\n        else:\n            input_ids = inputs\n\n        if input_ids is not None:\n            num_choices = shape_list(input_ids)[1]\n            seq_length = shape_list(input_ids)[2]\n        else:\n            num_choices = shape_list(inputs_embeds)[1]\n            seq_length = shape_list(inputs_embeds)[2]\n\n        flat_input_ids = tf.reshape(input_ids, (-1, seq_length)) if input_ids is not None else None\n        flat_attention_mask = tf.reshape(attention_mask, (-1, seq_length)) if attention_mask is not None else None\n        flat_token_type_ids = tf.reshape(token_type_ids, (-1, seq_length)) if token_type_ids is not None else None\n        flat_position_ids = tf.reshape(position_ids, (-1, seq_length)) if position_ids is not None else None\n\n        flat_inputs = [\n            flat_input_ids,\n            flat_attention_mask,\n            flat_token_type_ids,\n            flat_position_ids,\n            head_mask,\n            inputs_embeds,\n        ]\n\n        outputs = self.bert(flat_inputs, training=training)\n\n        pooled_output = outputs[1]\n\n        pooled_output = self.dropout(pooled_output, training=training)\n        logits = self.classifier(pooled_output)\n        reshaped_logits = tf.reshape(logits, (-1, num_choices))\n\n        outputs = (reshaped_logits,) + outputs[2:]  # add hidden states and attention if they are here\n\n        return outputs  # reshaped_logits, (hidden_states), (attentions)\n\n\n@add_start_docstrings(\n    \"\"\"Bert Model with a token classification head on top (a linear layer on top of\n    the hidden-states output) e.g. for Named-Entity-Recognition (NER) tasks. \"\"\",\n    BERT_START_DOCSTRING,\n)\nclass TFBertForTokenClassification(TFBertPreTrainedModel):\n    def __init__(self, config, *inputs, **kwargs):\n        super().__init__(config, *inputs, **kwargs)\n        self.num_labels = config.num_labels\n\n        self.bert = TFBertMainLayer(config, name=\"bert\")\n        self.dropout = tf.keras.layers.Dropout(config.hidden_dropout_prob)\n        self.classifier = tf.keras.layers.Dense(\n            config.num_labels, kernel_initializer=get_initializer(config.initializer_range), name=\"classifier\"\n        )\n\n    @add_start_docstrings_to_callable(BERT_INPUTS_DOCSTRING.format(\"(batch_size, sequence_length)\"))\n    def call(self, inputs, **kwargs):\n        r\"\"\"\n    Return:\n        :obj:`tuple(tf.Tensor)` comprising various elements depending on the configuration (:class:`~transformers1.BertConfig`) and inputs:\n        scores (:obj:`Numpy array` or :obj:`tf.Tensor` of shape :obj:`(batch_size, sequence_length, config.num_labels)`):\n            Classification scores (before SoftMax).\n        hidden_states (:obj:`tuple(tf.Tensor)`, `optional`, returned when :obj:`config.output_hidden_states=True`):\n            tuple of :obj:`tf.Tensor` (one for the output of the embeddings + one for the output of each layer)\n            of shape :obj:`(batch_size, sequence_length, hidden_size)`.\n\n            Hidden-states of the model at the output of each layer plus the initial embedding outputs.\n        attentions (:obj:`tuple(tf.Tensor)`, `optional`, returned when ``config.output_attentions=True``):\n            tuple of :obj:`tf.Tensor` (one for each layer) of shape\n            :obj:`(batch_size, num_heads, sequence_length, sequence_length)`:\n\n            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads.\n\n    Examples::\n\n        import tensorflow as tf\n        from transformers1 import BertTokenizer, TFBertForTokenClassification\n\n        tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')\n        model = TFBertForTokenClassification.from_pretrained('bert-base-uncased')\n        input_ids = tf.constant(tokenizer.encode(\"Hello, my dog is cute\", add_special_tokens=True))[None, :]  # Batch size 1\n        outputs = model(input_ids)\n        scores = outputs[0]\n\n        \"\"\"\n        outputs = self.bert(inputs, **kwargs)\n\n        sequence_output = outputs[0]\n\n        sequence_output = self.dropout(sequence_output, training=kwargs.get(\"training\", False))\n        logits = self.classifier(sequence_output)\n\n        outputs = (logits,) + outputs[2:]  # add hidden states and attention if they are here\n\n        return outputs  # scores, (hidden_states), (attentions)\n\n\n@add_start_docstrings(\n    \"\"\"Bert Model with a span classification head on top for extractive question-answering tasks like SQuAD (a linear layers on top of\n    the hidden-states output to compute `span start logits` and `span end logits`). \"\"\",\n    BERT_START_DOCSTRING,\n)\nclass TFBertForQuestionAnswering(TFBertPreTrainedModel):\n    def __init__(self, config, *inputs, **kwargs):\n        super().__init__(config, *inputs, **kwargs)\n        self.num_labels = config.num_labels\n\n        self.bert = TFBertMainLayer(config, name=\"bert\")\n        self.qa_outputs = tf.keras.layers.Dense(\n            config.num_labels, kernel_initializer=get_initializer(config.initializer_range), name=\"qa_outputs\"\n        )\n\n    @add_start_docstrings_to_callable(BERT_INPUTS_DOCSTRING.format(\"(batch_size, sequence_length)\"))\n    def call(self, inputs, **kwargs):\n        r\"\"\"\n    Return:\n        :obj:`tuple(tf.Tensor)` comprising various elements depending on the configuration (:class:`~transformers1.BertConfig`) and inputs:\n        start_scores (:obj:`Numpy array` or :obj:`tf.Tensor` of shape :obj:`(batch_size, sequence_length,)`):\n            Span-start scores (before SoftMax).\n        end_scores (:obj:`Numpy array` or :obj:`tf.Tensor` of shape :obj:`(batch_size, sequence_length,)`):\n            Span-end scores (before SoftMax).\n        hidden_states (:obj:`tuple(tf.Tensor)`, `optional`, returned when :obj:`config.output_hidden_states=True`):\n            tuple of :obj:`tf.Tensor` (one for the output of the embeddings + one for the output of each layer)\n            of shape :obj:`(batch_size, sequence_length, hidden_size)`.\n\n            Hidden-states of the model at the output of each layer plus the initial embedding outputs.\n        attentions (:obj:`tuple(tf.Tensor)`, `optional`, returned when ``config.output_attentions=True``):\n            tuple of :obj:`tf.Tensor` (one for each layer) of shape\n            :obj:`(batch_size, num_heads, sequence_length, sequence_length)`:\n\n            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads.\n\n    Examples::\n\n        import tensorflow as tf\n        from transformers1 import BertTokenizer, TFBertForQuestionAnswering\n\n        tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')\n        model = TFBertForQuestionAnswering.from_pretrained('bert-large-uncased-whole-word-masking-finetuned-squad')\n\n        question, text = \"Who was Jim Henson?\", \"Jim Henson was a nice puppet\"\n        encoding = tokenizer.encode_plus(question, text)\n        input_ids, token_type_ids = encoding[\"input_ids\"], encoding[\"token_type_ids\"]\n        start_scores, end_scores = model(tf.constant(input_ids)[None, :], token_type_ids=tf.constant(token_type_ids)[None, :])\n\n        all_tokens = tokenizer.convert_ids_to_tokens(input_ids)\n        answer = ' '.join(all_tokens[tf.math.argmax(tf.squeeze(start_scores)) : tf.math.argmax(tf.squeeze(end_scores))+1])\n        assert answer == \"a nice puppet\"\n\n        \"\"\"\n        outputs = self.bert(inputs, **kwargs)\n\n        sequence_output = outputs[0]\n\n        logits = self.qa_outputs(sequence_output)\n        start_logits, end_logits = tf.split(logits, 2, axis=-1)\n        start_logits = tf.squeeze(start_logits, axis=-1)\n        end_logits = tf.squeeze(end_logits, axis=-1)\n\n        outputs = (start_logits, end_logits,) + outputs[2:]\n\n        return outputs  # start_logits, end_logits, (hidden_states), (attentions)\n"
  },
  {
    "path": "code/bert-base-count5/pretrain/transformers1/modeling_tf_camembert.py",
    "content": "# coding=utf-8\n# Copyright 2018 The Google AI Language Team Authors and The HuggingFace Inc. team.\n# Copyright (c) 2018, NVIDIA CORPORATION.  All rights reserved.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n\"\"\" TF 2.0 CamemBERT model. \"\"\"\n\n\nimport logging\n\nfrom .configuration_camembert import CamembertConfig\nfrom .file_utils import add_start_docstrings\nfrom .modeling_tf_roberta import (\n    TFRobertaForMaskedLM,\n    TFRobertaForSequenceClassification,\n    TFRobertaForTokenClassification,\n    TFRobertaModel,\n)\n\n\nlogger = logging.getLogger(__name__)\n\nTF_CAMEMBERT_PRETRAINED_MODEL_ARCHIVE_LIST = [\n    # See all CamemBERT models at https://huggingface.co/models?filter=camembert\n]\n\n\nCAMEMBERT_START_DOCSTRING = r\"\"\"\n\n    .. note::\n\n        TF 2.0 models accepts two formats as inputs:\n\n            - having all inputs as keyword arguments (like PyTorch models), or\n            - having all inputs as a list, tuple or dict in the first positional arguments.\n\n        This second option is useful when using :obj:`tf.keras.Model.fit()` method which currently requires having\n        all the tensors in the first argument of the model call function: :obj:`model(inputs)`.\n\n        If you choose this second option, there are three possibilities you can use to gather all the input Tensors\n        in the first positional argument :\n\n        - a single Tensor with input_ids only and nothing else: :obj:`model(inputs_ids)`\n        - a list of varying length with one or several input Tensors IN THE ORDER given in the docstring:\n          :obj:`model([input_ids, attention_mask])` or :obj:`model([input_ids, attention_mask, token_type_ids])`\n        - a dictionary with one or several input Tensors associated to the input names given in the docstring:\n          :obj:`model({'input_ids': input_ids, 'token_type_ids': token_type_ids})`\n\n    Parameters:\n        config (:class:`~transformers1.CamembertConfig`): Model configuration class with all the parameters of the\n            model. Initializing with a config file does not load the weights associated with the model, only the configuration.\n            Check out the :meth:`~transformers1.PreTrainedModel.from_pretrained` method to load the model weights.\n\"\"\"\n\n\n@add_start_docstrings(\n    \"The bare CamemBERT Model transformer outputting raw hidden-states without any specific head on top.\",\n    CAMEMBERT_START_DOCSTRING,\n)\nclass TFCamembertModel(TFRobertaModel):\n    \"\"\"\n    This class overrides :class:`~transformers1.TFRobertaModel`. Please check the\n    superclass for the appropriate documentation alongside usage examples.\n    \"\"\"\n\n    config_class = CamembertConfig\n\n\n@add_start_docstrings(\n    \"\"\"CamemBERT Model with a `language modeling` head on top. \"\"\", CAMEMBERT_START_DOCSTRING,\n)\nclass TFCamembertForMaskedLM(TFRobertaForMaskedLM):\n    \"\"\"\n    This class overrides :class:`~transformers1.TFRobertaForMaskedLM`. Please check the\n    superclass for the appropriate documentation alongside usage examples.\n    \"\"\"\n\n    config_class = CamembertConfig\n\n\n@add_start_docstrings(\n    \"\"\"CamemBERT Model transformer with a sequence classification/regression head on top (a linear layer\n    on top of the pooled output) e.g. for GLUE tasks. \"\"\",\n    CAMEMBERT_START_DOCSTRING,\n)\nclass TFCamembertForSequenceClassification(TFRobertaForSequenceClassification):\n    \"\"\"\n    This class overrides :class:`~transformers1.TFRobertaForSequenceClassification`. Please check the\n    superclass for the appropriate documentation alongside usage examples.\n    \"\"\"\n\n    config_class = CamembertConfig\n\n\n@add_start_docstrings(\n    \"\"\"CamemBERT Model with a token classification head on top (a linear layer on top of\n    the hidden-states output) e.g. for Named-Entity-Recognition (NER) tasks. \"\"\",\n    CAMEMBERT_START_DOCSTRING,\n)\nclass TFCamembertForTokenClassification(TFRobertaForTokenClassification):\n    \"\"\"\n    This class overrides :class:`~transformers1.TFRobertaForTokenClassification`. Please check the\n    superclass for the appropriate documentation alongside usage examples.\n    \"\"\"\n\n    config_class = CamembertConfig\n"
  },
  {
    "path": "code/bert-base-count5/pretrain/transformers1/modeling_tf_ctrl.py",
    "content": "# coding=utf-8\n# Copyright 2018 Salesforce and HuggingFace Inc. team.\n# Copyright (c) 2018, NVIDIA CORPORATION.  All rights reserved.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n\"\"\" TF 2.0 CTRL model.\"\"\"\n\n\nimport logging\n\nimport numpy as np\nimport tensorflow as tf\n\nfrom .configuration_ctrl import CTRLConfig\nfrom .file_utils import add_start_docstrings, add_start_docstrings_to_callable\nfrom .modeling_tf_utils import TFPreTrainedModel, TFSharedEmbeddings, keras_serializable, shape_list\nfrom .tokenization_utils import BatchEncoding\n\n\nlogger = logging.getLogger(__name__)\n\nTF_CTRL_PRETRAINED_MODEL_ARCHIVE_LIST = [\n    \"ctrl\"\n    # See all CTRL models at https://huggingface.co/models?filter=ctrl\n]\n\n\ndef angle_defn(pos, i, d_model_size):\n    angle_rates = 1 / np.power(10000, (2 * (i // 2)) / np.float32(d_model_size))\n    return pos * angle_rates\n\n\ndef positional_encoding(position, d_model_size):\n    # create the sinusoidal pattern for the positional encoding\n    angle_rads = angle_defn(np.arange(position)[:, np.newaxis], np.arange(d_model_size)[np.newaxis, :], d_model_size)\n\n    sines = np.sin(angle_rads[:, 0::2])\n    cosines = np.cos(angle_rads[:, 1::2])\n\n    # pos_encoding = tf.cast(np.concatenate([sines, cosines], axis=-1)[np.newaxis, ...], dtype=tf.float32)\n    pos_encoding = tf.cast(np.concatenate([sines, cosines], axis=-1), dtype=tf.float32)\n    return pos_encoding\n\n\ndef scaled_dot_product_attention(q, k, v, mask, attention_mask=None, head_mask=None):\n    # calculate attention\n    matmul_qk = tf.matmul(q, k, transpose_b=True)\n\n    dk = tf.cast(shape_list(k)[-1], tf.float32)\n    scaled_attention_logits = matmul_qk / tf.math.sqrt(dk)\n\n    if mask is not None:\n        scaled_attention_logits += mask * -1e4\n\n    if attention_mask is not None:\n        # Apply the attention mask\n        scaled_attention_logits = scaled_attention_logits + attention_mask\n\n    attention_weights = tf.nn.softmax(scaled_attention_logits, axis=-1)\n\n    # Mask heads if we want to\n    if head_mask is not None:\n        attention_weights = attention_weights * head_mask\n\n    output = tf.matmul(attention_weights, v)\n\n    return output, attention_weights\n\n\nclass TFMultiHeadAttention(tf.keras.layers.Layer):\n    def __init__(self, d_model_size, num_heads, output_attentions=False, **kwargs):\n        super().__init__(**kwargs)\n        self.output_attentions = output_attentions\n        self.num_heads = num_heads\n        self.d_model_size = d_model_size\n\n        self.depth = int(d_model_size / self.num_heads)\n\n        self.Wq = tf.keras.layers.Dense(d_model_size, name=\"Wq\")\n        self.Wk = tf.keras.layers.Dense(d_model_size, name=\"Wk\")\n        self.Wv = tf.keras.layers.Dense(d_model_size, name=\"Wv\")\n\n        self.dense = tf.keras.layers.Dense(d_model_size, name=\"dense\")\n\n    def split_into_heads(self, x, batch_size):\n        x = tf.reshape(x, (batch_size, -1, self.num_heads, self.depth))\n        return tf.transpose(x, perm=[0, 2, 1, 3])\n\n    def call(self, inputs, training=False):\n        v, k, q, mask, layer_past, attention_mask, head_mask, use_cache = inputs\n        batch_size = shape_list(q)[0]\n\n        q = self.Wq(q)\n        k = self.Wk(k)\n        v = self.Wv(v)\n\n        q = self.split_into_heads(q, batch_size)\n        k = self.split_into_heads(k, batch_size)\n        v = self.split_into_heads(v, batch_size)\n\n        if layer_past is not None:\n            past_key, past_value = tf.unstack(layer_past, axis=0)\n            k = tf.concat((past_key, k), axis=-2)\n            v = tf.concat((past_value, v), axis=-2)\n\n        # to cope with keras serialization\n        # we need to cast `use_cache` to correct bool\n        # if it is a tensor\n        if tf.is_tensor(use_cache):\n            if hasattr(use_cache, \"numpy\"):\n                use_cache = bool(use_cache.numpy())\n            else:\n                use_cache = True\n\n        if use_cache is True:\n            present = tf.stack((k, v), axis=0)\n        else:\n            present = (None,)\n\n        output = scaled_dot_product_attention(q, k, v, mask, attention_mask, head_mask)\n        scaled_attention = tf.transpose(output[0], perm=[0, 2, 1, 3])\n        attn = output[1]\n        original_size_attention = tf.reshape(scaled_attention, (batch_size, -1, self.d_model_size))\n        output = self.dense(original_size_attention)\n\n        outputs = (output, present)\n        if self.output_attentions:\n            outputs = outputs + (attn,)\n        return outputs\n\n\ndef point_wise_feed_forward_network(d_model_size, dff, name=\"\"):\n    return tf.keras.Sequential(\n        [tf.keras.layers.Dense(dff, activation=\"relu\", name=\"0\"), tf.keras.layers.Dense(d_model_size, name=\"2\")],\n        name=\"ffn\",\n    )\n\n\nclass TFEncoderLayer(tf.keras.layers.Layer):\n    def __init__(\n        self, d_model_size, num_heads, dff, rate=0.1, layer_norm_epsilon=1e-6, output_attentions=False, **kwargs\n    ):\n        super().__init__(**kwargs)\n\n        self.multi_head_attention = TFMultiHeadAttention(\n            d_model_size, num_heads, output_attentions, name=\"multi_head_attention\"\n        )\n        self.ffn = point_wise_feed_forward_network(d_model_size, dff, name=\"ffn\")\n\n        self.layernorm1 = tf.keras.layers.LayerNormalization(epsilon=layer_norm_epsilon, name=\"layernorm1\")\n        self.layernorm2 = tf.keras.layers.LayerNormalization(epsilon=layer_norm_epsilon, name=\"layernorm2\")\n\n        self.dropout1 = tf.keras.layers.Dropout(rate)\n        self.dropout2 = tf.keras.layers.Dropout(rate)\n\n    def call(self, inputs, training=False):\n        x, mask, layer_past, attention_mask, head_mask, use_cache = inputs\n        normed = self.layernorm1(x)\n        attn_outputs = self.multi_head_attention(\n            [normed, normed, normed, mask, layer_past, attention_mask, head_mask, use_cache], training=training\n        )\n        attn_output = attn_outputs[0]\n        attn_output = self.dropout1(attn_output, training=training)\n        out1 = x + attn_output\n\n        out2 = self.layernorm2(out1)\n        ffn_output = self.ffn(out2)\n        ffn_output = self.dropout2(ffn_output, training=training)\n        out2 = out1 + ffn_output\n\n        outputs = (out2,) + attn_outputs[1:]\n        return outputs\n\n\n@keras_serializable\nclass TFCTRLMainLayer(tf.keras.layers.Layer):\n    config_class = CTRLConfig\n\n    def __init__(self, config, **kwargs):\n        super().__init__(**kwargs)\n        self.output_hidden_states = config.output_hidden_states\n        self.output_attentions = config.output_attentions\n\n        self.d_model_size = config.n_embd\n        self.num_layers = config.n_layer\n\n        self.pos_encoding = positional_encoding(config.n_positions, self.d_model_size)\n\n        self.w = TFSharedEmbeddings(\n            config.vocab_size, config.n_embd, initializer_range=config.initializer_range, name=\"w\"\n        )\n\n        self.dropout = tf.keras.layers.Dropout(config.embd_pdrop)\n        self.h = [\n            TFEncoderLayer(\n                config.n_embd,\n                config.n_head,\n                config.dff,\n                config.resid_pdrop,\n                config.layer_norm_epsilon,\n                config.output_attentions,\n                name=\"h_._{}\".format(i),\n            )\n            for i in range(config.n_layer)\n        ]\n        self.layernorm = tf.keras.layers.LayerNormalization(epsilon=config.layer_norm_epsilon, name=\"layernorm\")\n\n    def get_input_embeddings(self):\n        return self.w\n\n    def _resize_token_embeddings(self, new_num_tokens):\n        raise NotImplementedError\n\n    def _prune_heads(self, heads_to_prune):\n        \"\"\" Prunes heads of the model.\n                heads_to_prune: dict of {layer_num: list of heads to prune in this layer}\n        \"\"\"\n        raise NotImplementedError\n\n    def call(\n        self,\n        inputs,\n        past=None,\n        attention_mask=None,\n        token_type_ids=None,\n        position_ids=None,\n        head_mask=None,\n        inputs_embeds=None,\n        use_cache=True,\n        training=False,\n    ):\n\n        if isinstance(inputs, (tuple, list)):\n            input_ids = inputs[0]\n            past = inputs[1] if len(inputs) > 1 else past\n            attention_mask = inputs[2] if len(inputs) > 2 else attention_mask\n            token_type_ids = inputs[3] if len(inputs) > 3 else token_type_ids\n            position_ids = inputs[4] if len(inputs) > 4 else position_ids\n            head_mask = inputs[5] if len(inputs) > 5 else head_mask\n            inputs_embeds = inputs[6] if len(inputs) > 6 else inputs_embeds\n            use_cache = inputs[7] if len(inputs) > 7 else use_cache\n            assert len(inputs) <= 8, \"Too many inputs.\"\n        elif isinstance(inputs, (dict, BatchEncoding)):\n            input_ids = inputs.get(\"input_ids\")\n            past = inputs.get(\"past\", past)\n            attention_mask = inputs.get(\"attention_mask\", attention_mask)\n            token_type_ids = inputs.get(\"token_type_ids\", token_type_ids)\n            position_ids = inputs.get(\"position_ids\", position_ids)\n            head_mask = inputs.get(\"head_mask\", head_mask)\n            inputs_embeds = inputs.get(\"inputs_embeds\", inputs_embeds)\n            use_cache = inputs.get(\"use_cache\", use_cache)\n            assert len(inputs) <= 8, \"Too many inputs.\"\n        else:\n            input_ids = inputs\n\n        # If using past key value states, only the last tokens\n        # should be given as an input\n        if past is not None:\n            if input_ids is not None:\n                input_ids = input_ids[:, -1:]\n            if inputs_embeds is not None:\n                inputs_embeds = inputs_embeds[:, -1:]\n            if token_type_ids is not None:\n                token_type_ids = token_type_ids[:, -1:]\n\n        if input_ids is not None and inputs_embeds is not None:\n            raise ValueError(\"You cannot specify both input_ids and inputs_embeds at the same time\")\n        elif input_ids is not None:\n            input_shape = shape_list(input_ids)\n            input_ids = tf.reshape(input_ids, [-1, input_shape[-1]])\n        elif inputs_embeds is not None:\n            input_shape = shape_list(inputs_embeds)[:-1]\n        else:\n            raise ValueError(\"You have to specify either input_ids or inputs_embeds\")\n\n        if past is None:\n            past_length = 0\n            past = [None] * len(self.h)\n        else:\n            past_length = shape_list(past[0][0])[-2]\n        if position_ids is None:\n            position_ids = tf.range(past_length, input_shape[-1] + past_length, dtype=tf.int32)[tf.newaxis, :]\n            position_ids = tf.tile(position_ids, [input_shape[0], 1])\n\n        # Attention mask.\n        if attention_mask is not None:\n            # We create a 3D attention mask from a 2D tensor mask.\n            # Sizes are [batch_size, 1, 1, to_seq_length]\n            # So we can broadcast to [batch_size, num_heads, from_seq_length, to_seq_length]\n            # this attention mask is more simple than the triangular masking of causal attention\n            # used in OpenAI GPT, we just need to prepare the broadcast dimension here.\n            attention_mask = attention_mask[:, tf.newaxis, tf.newaxis, :]\n\n            # Since attention_mask is 1.0 for positions we want to attend and 0.0 for\n            # masked positions, this operation will create a tensor which is 0.0 for\n            # positions we want to attend and -10000.0 for masked positions.\n            # Since we are adding it to the raw scores before the softmax, this is\n            # effectively the same as removing these entirely.\n\n            attention_mask = tf.cast(attention_mask, tf.float32)\n            attention_mask = (1.0 - attention_mask) * -10000.0\n        else:\n            attention_mask = None\n\n        # Prepare head mask if needed\n        # 1.0 in head_mask indicate we keep the head\n        # attention_probs has shape bsz x n_heads x N x N\n        # head_mask has shape n_layer x batch x n_heads x N x N\n        if head_mask is not None:\n            raise NotImplementedError\n        else:\n            head_mask = [None] * self.num_layers\n\n        if token_type_ids is not None:\n            token_type_ids = tf.reshape(token_type_ids, [-1, shape_list(token_type_ids)[-1]])\n            token_type_embeds = self.w(token_type_ids, mode=\"embedding\")\n            token_type_embeds *= tf.math.sqrt(tf.cast(self.d_model_size, tf.float32))\n        else:\n            token_type_embeds = 0\n        position_ids = tf.reshape(position_ids, [-1, shape_list(position_ids)[-1]])\n\n        if inputs_embeds is None:\n            inputs_embeds = self.w(input_ids, mode=\"embedding\")\n        seq_len = input_shape[-1]\n        mask = 1 - tf.linalg.band_part(tf.ones((seq_len, seq_len)), -1, 0)\n\n        inputs_embeds *= tf.math.sqrt(tf.cast(self.d_model_size, tf.float32))\n\n        pos_embeds = tf.gather(self.pos_encoding, position_ids)\n\n        hidden_states = inputs_embeds + pos_embeds + token_type_embeds\n\n        hidden_states = self.dropout(hidden_states, training=training)\n\n        output_shape = input_shape + [shape_list(hidden_states)[-1]]\n        presents = ()\n        all_hidden_states = ()\n        all_attentions = []\n        for i, (h, layer_past) in enumerate(zip(self.h, past)):\n            if self.output_hidden_states:\n                all_hidden_states = all_hidden_states + (tf.reshape(hidden_states, output_shape),)\n            outputs = h([hidden_states, mask, layer_past, attention_mask, head_mask[i], use_cache], training=training)\n            hidden_states, present = outputs[:2]\n\n            if use_cache is True:\n                presents = presents + (present,)\n\n            if self.output_attentions:\n                all_attentions.append(outputs[2])\n\n        hidden_states = self.layernorm(hidden_states)\n        hidden_states = tf.reshape(hidden_states, output_shape)\n        if self.output_hidden_states:\n            all_hidden_states = all_hidden_states + (hidden_states,)\n\n        outputs = (hidden_states,)\n        if use_cache is True:\n            outputs = outputs + (presents,)\n        if self.output_hidden_states:\n            outputs = outputs + (all_hidden_states,)\n        if self.output_attentions:\n            # let the number of heads free (-1) so we can extract attention even after head pruning\n            attention_output_shape = input_shape[:-1] + [-1] + shape_list(all_attentions[0])[-2:]\n            all_attentions = tuple(tf.reshape(t, attention_output_shape) for t in all_attentions)\n            outputs = outputs + (all_attentions,)\n        return outputs\n\n\nclass TFCTRLPreTrainedModel(TFPreTrainedModel):\n    \"\"\" An abstract class to handle weights initialization and\n        a simple interface for downloading and loading pretrained models.\n    \"\"\"\n\n    config_class = CTRLConfig\n    base_model_prefix = \"transformer\"\n\n\nCTRL_START_DOCSTRING = r\"\"\"\n\n    .. note::\n        TF 2.0 models accepts two formats as inputs:\n\n            - having all inputs as keyword arguments (like PyTorch models), or\n            - having all inputs as a list, tuple or dict in the first positional arguments.\n\n        This second option is useful when using :obj:`tf.keras.Model.fit()` method which currently requires having\n        all the tensors in the first argument of the model call function: :obj:`model(inputs)`.\n\n        If you choose this second option, there are three possibilities you can use to gather all the input Tensors\n        in the first positional argument :\n\n        - a single Tensor with input_ids only and nothing else: :obj:`model(inputs_ids)`\n        - a list of varying length with one or several input Tensors IN THE ORDER given in the docstring:\n          :obj:`model([input_ids, attention_mask])` or :obj:`model([input_ids, attention_mask, token_type_ids])`\n        - a dictionary with one or several input Tensors associated to the input names given in the docstring:\n          :obj:`model({'input_ids': input_ids, 'token_type_ids': token_type_ids})`\n\n    Parameters:\n        config (:class:`~transformers1.CTRLConfig`): Model configuration class with all the parameters of the model.\n            Initializing with a config file does not load the weights associated with the model, only the configuration.\n            Check out the :meth:`~transformers1.PreTrainedModel.from_pretrained` method to load the model weights.\n\"\"\"\n\nCTRL_INPUTS_DOCSTRING = r\"\"\"\n    Args:\n        input_ids (:obj:`Numpy array` or :obj:`tf.Tensor` of shape :obj:`(batch_size, input_ids_length)`):\n            :obj:`input_ids_length` = ``sequence_length`` if ``past`` is ``None`` else ``past[0].shape[-2]`` (``sequence_length`` of input past key value states).\n\n            Indices of input sequence tokens in the vocabulary.\n\n            If `past` is used, only input_ids that do not have their past calculated should be passed as input_ids (see `past`).\n\n            Indices can be obtained using :class:`transformers1.CTRLTokenizer`.\n            See :func:`transformers1.PreTrainedTokenizer.encode` and\n            :func:`transformers1.PreTrainedTokenizer.encode_plus` for details.\n\n            `What are input IDs? <../glossary.html#input-ids>`__\n        past (:obj:`List[tf.Tensor]` of length :obj:`config.n_layers`):\n            Contains pre-computed hidden-states (key and values in the attention blocks) as computed by the model\n            (see `past` output below). Can be used to speed up sequential decoding.\n            The token ids which have their past given to this model\n            should not be passed as input ids as they have already been computed.\n        attention_mask (:obj:`tf.Tensor` or :obj:`Numpy array` of shape :obj:`(batch_size, sequence_length)`, `optional`, defaults to :obj:`None`):\n            Mask to avoid performing attention on padding token indices.\n            Mask values selected in ``[0, 1]``:\n            ``1`` for tokens that are NOT MASKED, ``0`` for MASKED tokens.\n\n            `What are attention masks? <../glossary.html#attention-mask>`__\n        token_type_ids (:obj:`tf.Tensor` or :obj:`Numpy array` of shape :obj:`(batch_size, sequence_length)`, `optional`, defaults to :obj:`None`):\n            Segment token indices to indicate first and second portions of the inputs.\n            Indices are selected in ``[0, 1]``: ``0`` corresponds to a `sentence A` token, ``1``\n            corresponds to a `sentence B` token\n\n            `What are token type IDs? <../glossary.html#token-type-ids>`_\n        position_ids (:obj:`tf.Tensor` or :obj:`Numpy array` of shape :obj:`(batch_size, sequence_length)`, `optional`, defaults to :obj:`None`):\n            Indices of positions of each input sequence tokens in the position embeddings.\n            Selected in the range ``[0, config.max_position_embeddings - 1]``.\n\n            `What are position IDs? <../glossary.html#position-ids>`_\n        head_mask (:obj:`tf.Tensor` or :obj:`Numpy array` of shape :obj:`(num_heads,)` or :obj:`(num_layers, num_heads)`, `optional`, defaults to :obj:`None`):\n            Mask to nullify selected heads of the self-attention modules.\n            Mask values selected in ``[0, 1]``:\n            :obj:`1` indicates the head is **not masked**, :obj:`0` indicates the head is **masked**.\n        inputs_embeds (:obj:`tf.Tensor` or :obj:`Numpy array` of shape :obj:`(batch_size, sequence_length, hidden_size)`, `optional`, defaults to :obj:`None`):\n            Optionally, instead of passing :obj:`input_ids` you can choose to directly pass an embedded representation.\n            This is useful if you want more control over how to convert `input_ids` indices into associated vectors\n            than the model's internal embedding lookup matrix.\n        use_cache (:obj:`bool`):\n            If `use_cache` is True, `past` key value states are returned and\n            can be used to speed up decoding (see `past`). Defaults to `True`.\n        training (:obj:`boolean`, `optional`, defaults to :obj:`False`):\n            Whether to activate dropout modules (if set to :obj:`True`) during training or to de-activate them\n            (if set to :obj:`False`) for evaluation.\n\"\"\"\n\n\n@add_start_docstrings(\n    \"The bare CTRL Model transformer outputting raw hidden-states without any specific head on top.\",\n    CTRL_START_DOCSTRING,\n)\nclass TFCTRLModel(TFCTRLPreTrainedModel):\n    def __init__(self, config, *inputs, **kwargs):\n        super().__init__(config, *inputs, **kwargs)\n        self.transformer = TFCTRLMainLayer(config, name=\"transformer\")\n\n    @add_start_docstrings_to_callable(CTRL_INPUTS_DOCSTRING)\n    def call(self, inputs, **kwargs):\n        r\"\"\"\n    Return:\n        :obj:`tuple(tf.Tensor)` comprising various elements depending on the configuration (:class:`~transformers1.CTRLConfig`) and inputs:\n        last_hidden_state (:obj:`tf.Tensor` of shape :obj:`(batch_size, sequence_length, hidden_size)`):\n            Sequence of hidden-states at the last layer of the model.\n        past (:obj:`List[tf.Tensor]` of length :obj:`config.n_layers` with each tensor of shape :obj:`(2, batch_size, num_heads, sequence_length, embed_size_per_head)`):\n            Contains pre-computed hidden-states (key and values in the attention blocks).\n            Can be used (see `past` input) to speed up sequential decoding. The token ids which have their past given to this model\n            should not be passed as input ids as they have already been computed.\n        hidden_states (:obj:`tuple(tf.Tensor)` `optional`, returned when ``config.output_hidden_states=True``):\n            Tuple of :obj:`tf.Tensor` (one for the output of the embeddings + one for the output of each layer)\n            of shape :obj:`(batch_size, sequence_length, hidden_size)`.\n\n            Hidden-states of the model at the output of each layer plus the initial embedding outputs.\n        attentions (:obj:`tuple(tf.Tensor)`, `optional`, returned when ``config.output_attentions=True``):\n            Tuple of :obj:`tf.Tensor` (one for each layer) of shape\n            :obj:`(batch_size, num_heads, sequence_length, sequence_length)`.\n\n            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention\n            heads.\n\n    Examples::\n\n        import tensorflow as tf\n        from transformers1 import CTRLTokenizer, TFCTRLModel\n\n        tokenizer = CTRLTokenizer.from_pretrained('ctrl')\n        model = TFCTRLModel.from_pretrained('ctrl')\n        input_ids = tf.constant(tokenizer.encode(\"Hello, my dog is cute\", add_special_tokens=True))[None, :]  # Batch size 1\n        outputs = model(input_ids)\n        last_hidden_states = outputs[0]  # The last hidden-state is the first element of the output tuple\n\n        \"\"\"\n        outputs = self.transformer(inputs, **kwargs)\n        return outputs\n\n\nclass TFCTRLLMHead(tf.keras.layers.Layer):\n    def __init__(self, config, input_embeddings, **kwargs):\n        super().__init__(**kwargs)\n        self.vocab_size = config.vocab_size\n\n        # The output weights are the same as the input embeddings, but there is\n        # an output-only bias for each token.\n        self.input_embeddings = input_embeddings\n\n    def build(self, input_shape):\n        self.bias = self.add_weight(shape=(self.vocab_size,), initializer=\"zeros\", trainable=True, name=\"bias\")\n        super().build(input_shape)\n\n    def call(self, hidden_states):\n        hidden_states = self.input_embeddings(hidden_states, mode=\"linear\")\n        hidden_states = hidden_states + self.bias\n        return hidden_states\n\n\n@add_start_docstrings(\n    \"\"\"The CTRL Model transformer with a language modeling head on top\n    (linear layer with weights tied to the input embeddings). \"\"\",\n    CTRL_START_DOCSTRING,\n)\nclass TFCTRLLMHeadModel(TFCTRLPreTrainedModel):\n    def __init__(self, config, *inputs, **kwargs):\n        super().__init__(config, *inputs, **kwargs)\n        self.transformer = TFCTRLMainLayer(config, name=\"transformer\")\n\n        self.lm_head = TFCTRLLMHead(config, self.transformer.w, name=\"lm_head\")\n\n    def get_output_embeddings(self):\n        return self.lm_head.input_embeddings\n\n    def prepare_inputs_for_generation(self, inputs, past, **kwargs):\n        # only last token for inputs_ids if past is defined in kwargs\n        if past:\n            inputs = tf.expand_dims(inputs[:, -1], -1)\n\n        return {\"inputs\": inputs, \"past\": past, \"use_cache\": kwargs[\"use_cache\"]}\n\n    @add_start_docstrings_to_callable(CTRL_INPUTS_DOCSTRING)\n    def call(self, inputs, **kwargs):\n        r\"\"\"\n    Return:\n        :obj:`tuple(tf.Tensor)` comprising various elements depending on the configuration (:class:`~transformers1.CTRLConfig`) and inputs:\n        prediction_scores (:obj:`tf.Tensor` of shape :obj:`(batch_size, sequence_length, config.vocab_size)`):\n            Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax).\n        past (:obj:`List[tf.Tensor]` of length :obj:`config.n_layers` with each tensor of shape :obj:`(2, batch_size, num_heads, sequence_length, embed_size_per_head)`):\n            Contains pre-computed hidden-states (key and values in the attention blocks).\n            Can be used (see `past` input) to speed up sequential decoding. The token ids which have their past given to this model\n            should not be passed as input ids as they have already been computed.\n        hidden_states (:obj:`tuple(tf.Tensor)`, `optional`, returned when ``config.output_hidden_states=True``):\n            Tuple of :obj:`tf.Tensor` (one for the output of the embeddings + one for the output of each layer)\n            of shape :obj:`(batch_size, sequence_length, hidden_size)`.\n\n            Hidden-states of the model at the output of each layer plus the initial embedding outputs.\n        attentions (:obj:`tuple(tf.Tensor)`, `optional`, returned when ``config.output_attentions=True``):\n            Tuple of :obj:`tf.Tensor` (one for each layer) of shape\n            :obj:`(batch_size, num_heads, sequence_length, sequence_length)`.\n\n            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention\n            heads.\n\n    Examples::\n\n        import tensorflow as tf\n        from transformers1 import CTRLTokenizer, TFCTRLLMHeadModel\n\n        tokenizer = CTRLTokenizer.from_pretrained('ctrl')\n        model = TFCTRLLMHeadModel.from_pretrained('ctrl')\n\n        input_ids = tf.constant([tokenizer.encode(\"Links Hello, my dog is cute\", add_special_tokens=True)])\n        outputs = model(input_ids)\n        loss, logits = outputs[:2]\n\n        \"\"\"\n        transformer_outputs = self.transformer(inputs, **kwargs)\n        hidden_states = transformer_outputs[0]\n\n        lm_logits = self.lm_head(hidden_states)\n\n        outputs = (lm_logits,) + transformer_outputs[1:]\n\n        return outputs  # lm_logits, presents, (all hidden_states), (attentions)\n"
  },
  {
    "path": "code/bert-base-count5/pretrain/transformers1/modeling_tf_distilbert.py",
    "content": "# coding=utf-8\n# Copyright 2019-present, the HuggingFace Inc. team, The Google AI Language Team and Facebook, Inc.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n\"\"\" TF 2.0 DistilBERT model\n\"\"\"\n\n\nimport logging\nimport math\n\nimport numpy as np\nimport tensorflow as tf\n\nfrom .configuration_distilbert import DistilBertConfig\nfrom .file_utils import add_start_docstrings, add_start_docstrings_to_callable\nfrom .modeling_tf_utils import TFPreTrainedModel, TFSharedEmbeddings, get_initializer, shape_list\nfrom .tokenization_utils import BatchEncoding\n\n\nlogger = logging.getLogger(__name__)\n\n\nTF_DISTILBERT_PRETRAINED_MODEL_ARCHIVE_LIST = [\n    \"distilbert-base-uncased\",\n    \"distilbert-base-uncased-distilled-squad\",\n    \"distilbert-base-cased\",\n    \"distilbert-base-cased-distilled-squad\",\n    \"distilbert-base-multilingual-cased\",\n    \"distilbert-base-uncased-finetuned-sst-2-english\",\n    # See all DistilBERT models at https://huggingface.co/models?filter=distilbert\n]\n\n\n# UTILS AND BUILDING BLOCKS OF THE ARCHITECTURE #\ndef gelu(x):\n    \"\"\" Gaussian Error Linear Unit.\n    Original Implementation of the gelu activation function in Google Bert repo when initially created.\n        For information: OpenAI GPT's gelu is slightly different (and gives slightly different results):\n        0.5 * x * (1 + torch.tanh(math.sqrt(2 / math.pi) * (x + 0.044715 * torch.pow(x, 3))))\n        Also see https://arxiv.org/abs/1606.08415\n    \"\"\"\n    cdf = 0.5 * (1.0 + tf.math.erf(x / tf.math.sqrt(2.0)))\n    return x * cdf\n\n\ndef gelu_new(x):\n    \"\"\"Gaussian Error Linear Unit.\n    This is a smoother version of the RELU.\n    Original paper: https://arxiv.org/abs/1606.08415\n    Args:\n        x: float Tensor to perform activation.\n    Returns:\n        `x` with the GELU activation applied.\n    \"\"\"\n    cdf = 0.5 * (1.0 + tf.tanh((np.sqrt(2 / np.pi) * (x + 0.044715 * tf.pow(x, 3)))))\n    return x * cdf\n\n\nclass TFEmbeddings(tf.keras.layers.Layer):\n    def __init__(self, config, **kwargs):\n        super().__init__(**kwargs)\n        self.vocab_size = config.vocab_size\n        self.dim = config.dim\n        self.initializer_range = config.initializer_range\n        self.word_embeddings = TFSharedEmbeddings(\n            config.vocab_size, config.dim, initializer_range=config.initializer_range, name=\"word_embeddings\"\n        )  # padding_idx=0)\n        self.position_embeddings = tf.keras.layers.Embedding(\n            config.max_position_embeddings,\n            config.dim,\n            embeddings_initializer=get_initializer(config.initializer_range),\n            name=\"position_embeddings\",\n        )\n\n        self.LayerNorm = tf.keras.layers.LayerNormalization(epsilon=1e-12, name=\"LayerNorm\")\n        self.dropout = tf.keras.layers.Dropout(config.dropout)\n\n    def build(self, input_shape):\n        \"\"\"Build shared word embedding layer \"\"\"\n        with tf.name_scope(\"word_embeddings\"):\n            # Create and initialize weights. The random normal initializer was chosen\n            # arbitrarily, and works well.\n            self.word_embeddings = self.add_weight(\n                \"weight\", shape=[self.vocab_size, self.dim], initializer=get_initializer(self.initializer_range)\n            )\n        super().build(input_shape)\n\n    def call(self, inputs, inputs_embeds=None, mode=\"embedding\", training=False):\n        \"\"\"Get token embeddings of inputs.\n        Args:\n            inputs: list of three int64 tensors with shape [batch_size, length]: (input_ids, position_ids, token_type_ids)\n            mode: string, a valid value is one of \"embedding\" and \"linear\".\n        Returns:\n            outputs: (1) If mode == \"embedding\", output embedding tensor, float32 with\n                shape [batch_size, length, embedding_size]; (2) mode == \"linear\", output\n                linear tensor, float32 with shape [batch_size, length, vocab_size].\n        Raises:\n            ValueError: if mode is not valid.\n\n        Shared weights logic adapted from\n            https://github.com/tensorflow/models/blob/a009f4fb9d2fc4949e32192a944688925ef78659/official/transformer/v2/embedding_layer.py#L24\n        \"\"\"\n        if mode == \"embedding\":\n            return self._embedding(inputs, inputs_embeds=inputs_embeds, training=training)\n        elif mode == \"linear\":\n            return self._linear(inputs)\n        else:\n            raise ValueError(\"mode {} is not valid.\".format(mode))\n\n    def _embedding(self, inputs, inputs_embeds=None, training=False):\n        \"\"\"\n        Parameters\n        ----------\n        input_ids: tf.Tensor(bs, max_seq_length)\n            The token ids to embed.\n\n        Outputs\n        -------\n        embeddings: tf.Tensor(bs, max_seq_length, dim)\n            The embedded tokens (plus position embeddings, no token_type embeddings)\n        \"\"\"\n        if not isinstance(inputs, (tuple, list)):\n            input_ids = inputs\n            position_ids = None\n        else:\n            input_ids, position_ids = inputs\n\n        if input_ids is not None:\n            seq_length = shape_list(input_ids)[1]\n        else:\n            seq_length = shape_list(inputs_embeds)[1]\n\n        if position_ids is None:\n            position_ids = tf.range(seq_length, dtype=tf.int32)[tf.newaxis, :]\n\n        if inputs_embeds is None:\n            inputs_embeds = tf.gather(self.word_embeddings, input_ids)\n        position_embeddings = self.position_embeddings(position_ids)  # (bs, max_seq_length, dim)\n\n        embeddings = inputs_embeds + position_embeddings  # (bs, max_seq_length, dim)\n        embeddings = self.LayerNorm(embeddings)  # (bs, max_seq_length, dim)\n        embeddings = self.dropout(embeddings, training=training)  # (bs, max_seq_length, dim)\n        return embeddings\n\n    def _linear(self, inputs):\n        \"\"\"Computes logits by running inputs through a linear layer.\n            Args:\n                inputs: A float32 tensor with shape [batch_size, length, hidden_size]\n            Returns:\n                float32 tensor with shape [batch_size, length, vocab_size].\n        \"\"\"\n        batch_size = shape_list(inputs)[0]\n        length = shape_list(inputs)[1]\n\n        x = tf.reshape(inputs, [-1, self.dim])\n        logits = tf.matmul(x, self.word_embeddings, transpose_b=True)\n\n        return tf.reshape(logits, [batch_size, length, self.vocab_size])\n\n\nclass TFMultiHeadSelfAttention(tf.keras.layers.Layer):\n    def __init__(self, config, **kwargs):\n        super().__init__(**kwargs)\n\n        self.n_heads = config.n_heads\n        self.dim = config.dim\n        self.dropout = tf.keras.layers.Dropout(config.attention_dropout)\n        self.output_attentions = config.output_attentions\n\n        assert self.dim % self.n_heads == 0\n\n        self.q_lin = tf.keras.layers.Dense(\n            config.dim, kernel_initializer=get_initializer(config.initializer_range), name=\"q_lin\"\n        )\n        self.k_lin = tf.keras.layers.Dense(\n            config.dim, kernel_initializer=get_initializer(config.initializer_range), name=\"k_lin\"\n        )\n        self.v_lin = tf.keras.layers.Dense(\n            config.dim, kernel_initializer=get_initializer(config.initializer_range), name=\"v_lin\"\n        )\n        self.out_lin = tf.keras.layers.Dense(\n            config.dim, kernel_initializer=get_initializer(config.initializer_range), name=\"out_lin\"\n        )\n\n        self.pruned_heads = set()\n\n    def prune_heads(self, heads):\n        raise NotImplementedError\n\n    def call(self, inputs, training=False):\n        \"\"\"\n        Parameters\n        ----------\n        query: tf.Tensor(bs, seq_length, dim)\n        key: tf.Tensor(bs, seq_length, dim)\n        value: tf.Tensor(bs, seq_length, dim)\n        mask: tf.Tensor(bs, seq_length)\n\n        Outputs\n        -------\n        weights: tf.Tensor(bs, n_heads, seq_length, seq_length)\n            Attention weights\n        context: tf.Tensor(bs, seq_length, dim)\n            Contextualized layer. Optional: only if `output_attentions=True`\n        \"\"\"\n        query, key, value, mask, head_mask = inputs\n        bs, q_length, dim = shape_list(query)\n        k_length = shape_list(key)[1]\n        # assert dim == self.dim, 'Dimensions do not match: %s input vs %s configured' % (dim, self.dim)\n        # assert key.size() == value.size()\n\n        dim_per_head = self.dim // self.n_heads\n\n        mask_reshape = [bs, 1, 1, k_length]\n\n        def shape(x):\n            \"\"\" separate heads \"\"\"\n            return tf.transpose(tf.reshape(x, (bs, -1, self.n_heads, dim_per_head)), perm=(0, 2, 1, 3))\n\n        def unshape(x):\n            \"\"\" group heads \"\"\"\n            return tf.reshape(tf.transpose(x, perm=(0, 2, 1, 3)), (bs, -1, self.n_heads * dim_per_head))\n\n        q = shape(self.q_lin(query))  # (bs, n_heads, q_length, dim_per_head)\n        k = shape(self.k_lin(key))  # (bs, n_heads, k_length, dim_per_head)\n        v = shape(self.v_lin(value))  # (bs, n_heads, k_length, dim_per_head)\n\n        q = q / math.sqrt(dim_per_head)  # (bs, n_heads, q_length, dim_per_head)\n        scores = tf.matmul(q, k, transpose_b=True)  # (bs, n_heads, q_length, k_length)\n        mask = tf.reshape(mask, mask_reshape)  # (bs, n_heads, qlen, klen)\n        # scores.masked_fill_(mask, -float('inf'))            # (bs, n_heads, q_length, k_length)\n        scores = scores - 1e30 * (1.0 - mask)\n\n        weights = tf.nn.softmax(scores, axis=-1)  # (bs, n_heads, qlen, klen)\n        weights = self.dropout(weights, training=training)  # (bs, n_heads, qlen, klen)\n\n        # Mask heads if we want to\n        if head_mask is not None:\n            weights = weights * head_mask\n\n        context = tf.matmul(weights, v)  # (bs, n_heads, qlen, dim_per_head)\n        context = unshape(context)  # (bs, q_length, dim)\n        context = self.out_lin(context)  # (bs, q_length, dim)\n\n        if self.output_attentions:\n            return (context, weights)\n        else:\n            return (context,)\n\n\nclass TFFFN(tf.keras.layers.Layer):\n    def __init__(self, config, **kwargs):\n        super().__init__(**kwargs)\n        self.dropout = tf.keras.layers.Dropout(config.dropout)\n        self.lin1 = tf.keras.layers.Dense(\n            config.hidden_dim, kernel_initializer=get_initializer(config.initializer_range), name=\"lin1\"\n        )\n        self.lin2 = tf.keras.layers.Dense(\n            config.dim, kernel_initializer=get_initializer(config.initializer_range), name=\"lin2\"\n        )\n        assert config.activation in [\"relu\", \"gelu\"], \"activation ({}) must be in ['relu', 'gelu']\".format(\n            config.activation\n        )\n        self.activation = (\n            tf.keras.layers.Activation(gelu) if config.activation == \"gelu\" else tf.keras.activations.relu\n        )\n\n    def call(self, input, training=False):\n        x = self.lin1(input)\n        x = self.activation(x)\n        x = self.lin2(x)\n        x = self.dropout(x, training=training)\n        return x\n\n\nclass TFTransformerBlock(tf.keras.layers.Layer):\n    def __init__(self, config, **kwargs):\n        super().__init__(**kwargs)\n\n        self.n_heads = config.n_heads\n        self.dim = config.dim\n        self.hidden_dim = config.hidden_dim\n        self.dropout = tf.keras.layers.Dropout(config.dropout)\n        self.activation = config.activation\n        self.output_attentions = config.output_attentions\n\n        assert config.dim % config.n_heads == 0\n\n        self.attention = TFMultiHeadSelfAttention(config, name=\"attention\")\n        self.sa_layer_norm = tf.keras.layers.LayerNormalization(epsilon=1e-12, name=\"sa_layer_norm\")\n\n        self.ffn = TFFFN(config, name=\"ffn\")\n        self.output_layer_norm = tf.keras.layers.LayerNormalization(epsilon=1e-12, name=\"output_layer_norm\")\n\n    def call(self, inputs, training=False):  # removed: src_enc=None, src_len=None\n        \"\"\"\n        Parameters\n        ----------\n        x: tf.Tensor(bs, seq_length, dim)\n        attn_mask: tf.Tensor(bs, seq_length)\n\n        Outputs\n        -------\n        sa_weights: tf.Tensor(bs, n_heads, seq_length, seq_length)\n            The attention weights\n        ffn_output: tf.Tensor(bs, seq_length, dim)\n            The output of the transformer block contextualization.\n        \"\"\"\n        x, attn_mask, head_mask = inputs\n\n        # Self-Attention\n        sa_output = self.attention([x, x, x, attn_mask, head_mask], training=training)\n        if self.output_attentions:\n            sa_output, sa_weights = sa_output  # (bs, seq_length, dim), (bs, n_heads, seq_length, seq_length)\n        else:  # To handle these `output_attention` or `output_hidden_states` cases returning tuples\n            # assert type(sa_output) == tuple\n            sa_output = sa_output[0]\n        sa_output = self.sa_layer_norm(sa_output + x)  # (bs, seq_length, dim)\n\n        # Feed Forward Network\n        ffn_output = self.ffn(sa_output, training=training)  # (bs, seq_length, dim)\n        ffn_output = self.output_layer_norm(ffn_output + sa_output)  # (bs, seq_length, dim)\n\n        output = (ffn_output,)\n        if self.output_attentions:\n            output = (sa_weights,) + output\n        return output\n\n\nclass TFTransformer(tf.keras.layers.Layer):\n    def __init__(self, config, **kwargs):\n        super().__init__(**kwargs)\n        self.n_layers = config.n_layers\n        self.output_attentions = config.output_attentions\n        self.output_hidden_states = config.output_hidden_states\n\n        self.layer = [TFTransformerBlock(config, name=\"layer_._{}\".format(i)) for i in range(config.n_layers)]\n\n    def call(self, inputs, training=False):\n        \"\"\"\n        Parameters\n        ----------\n        x: tf.Tensor(bs, seq_length, dim)\n            Input sequence embedded.\n        attn_mask: tf.Tensor(bs, seq_length)\n            Attention mask on the sequence.\n\n        Outputs\n        -------\n        hidden_state: tf.Tensor(bs, seq_length, dim)\n            Sequence of hiddens states in the last (top) layer\n        all_hidden_states: Tuple[tf.Tensor(bs, seq_length, dim)]\n            Tuple of length n_layers with the hidden states from each layer.\n            Optional: only if output_hidden_states=True\n        all_attentions: Tuple[tf.Tensor(bs, n_heads, seq_length, seq_length)]\n            Tuple of length n_layers with the attention weights from each layer\n            Optional: only if output_attentions=True\n        \"\"\"\n        x, attn_mask, head_mask = inputs\n\n        all_hidden_states = ()\n        all_attentions = ()\n\n        hidden_state = x\n        for i, layer_module in enumerate(self.layer):\n            if self.output_hidden_states:\n                all_hidden_states = all_hidden_states + (hidden_state,)\n\n            layer_outputs = layer_module([hidden_state, attn_mask, head_mask[i]], training=training)\n            hidden_state = layer_outputs[-1]\n\n            if self.output_attentions:\n                assert len(layer_outputs) == 2\n                attentions = layer_outputs[0]\n                all_attentions = all_attentions + (attentions,)\n            else:\n                assert len(layer_outputs) == 1\n\n        # Add last layer\n        if self.output_hidden_states:\n            all_hidden_states = all_hidden_states + (hidden_state,)\n\n        outputs = (hidden_state,)\n        if self.output_hidden_states:\n            outputs = outputs + (all_hidden_states,)\n        if self.output_attentions:\n            outputs = outputs + (all_attentions,)\n        return outputs  # last-layer hidden state, (all hidden states), (all attentions)\n\n\nclass TFDistilBertMainLayer(tf.keras.layers.Layer):\n    def __init__(self, config, **kwargs):\n        super().__init__(**kwargs)\n        self.num_hidden_layers = config.num_hidden_layers\n\n        self.embeddings = TFEmbeddings(config, name=\"embeddings\")  # Embeddings\n        self.transformer = TFTransformer(config, name=\"transformer\")  # Encoder\n\n    def get_input_embeddings(self):\n        return self.embeddings\n\n    def _resize_token_embeddings(self, new_num_tokens):\n        raise NotImplementedError\n\n    def _prune_heads(self, heads_to_prune):\n        raise NotImplementedError\n\n    def call(self, inputs, attention_mask=None, head_mask=None, inputs_embeds=None, training=False):\n        if isinstance(inputs, (tuple, list)):\n            input_ids = inputs[0]\n            attention_mask = inputs[1] if len(inputs) > 1 else attention_mask\n            head_mask = inputs[2] if len(inputs) > 2 else head_mask\n            inputs_embeds = inputs[3] if len(inputs) > 3 else inputs_embeds\n            assert len(inputs) <= 4, \"Too many inputs.\"\n        elif isinstance(inputs, (dict, BatchEncoding)):\n            input_ids = inputs.get(\"input_ids\")\n            attention_mask = inputs.get(\"attention_mask\", attention_mask)\n            head_mask = inputs.get(\"head_mask\", head_mask)\n            inputs_embeds = inputs.get(\"inputs_embeds\", inputs_embeds)\n            assert len(inputs) <= 4, \"Too many inputs.\"\n        else:\n            input_ids = inputs\n\n        if input_ids is not None and inputs_embeds is not None:\n            raise ValueError(\"You cannot specify both input_ids and inputs_embeds at the same time\")\n        elif input_ids is not None:\n            input_shape = shape_list(input_ids)\n        elif inputs_embeds is not None:\n            input_shape = shape_list(inputs_embeds)[:-1]\n        else:\n            raise ValueError(\"You have to specify either input_ids or inputs_embeds\")\n\n        if attention_mask is None:\n            attention_mask = tf.ones(input_shape)  # (bs, seq_length)\n        attention_mask = tf.cast(attention_mask, dtype=tf.float32)\n\n        # Prepare head mask if needed\n        # 1.0 in head_mask indicate we keep the head\n        # attention_probs has shape bsz x n_heads x N x N\n        # input head_mask has shape [num_heads] or [num_hidden_layers x num_heads]\n        # and head_mask is converted to shape [num_hidden_layers x batch x num_heads x seq_length x seq_length]\n        if head_mask is not None:\n            raise NotImplementedError\n        else:\n            head_mask = [None] * self.num_hidden_layers\n\n        embedding_output = self.embeddings(input_ids, inputs_embeds=inputs_embeds)  # (bs, seq_length, dim)\n        tfmr_output = self.transformer([embedding_output, attention_mask, head_mask], training=training)\n\n        return tfmr_output  # last-layer hidden-state, (all hidden_states), (all attentions)\n\n\n# INTERFACE FOR ENCODER AND TASK SPECIFIC MODEL #\nclass TFDistilBertPreTrainedModel(TFPreTrainedModel):\n    \"\"\" An abstract class to handle weights initialization and\n        a simple interface for downloading and loading pretrained models.\n    \"\"\"\n\n    config_class = DistilBertConfig\n    base_model_prefix = \"distilbert\"\n\n\nDISTILBERT_START_DOCSTRING = r\"\"\"\n    This model is a `tf.keras.Model <https://www.tensorflow.org/api_docs/python/tf/keras/Model>`__ sub-class.\n    Use it as a regular TF 2.0 Keras Model and\n    refer to the TF 2.0 documentation for all matter related to general usage and behavior.\n\n    .. note::\n\n        TF 2.0 models accepts two formats as inputs:\n\n            - having all inputs as keyword arguments (like PyTorch models), or\n            - having all inputs as a list, tuple or dict in the first positional arguments.\n\n        This second option is useful when using :obj:`tf.keras.Model.fit()` method which currently requires having\n        all the tensors in the first argument of the model call function: :obj:`model(inputs)`.\n\n        If you choose this second option, there are three possibilities you can use to gather all the input Tensors\n        in the first positional argument :\n\n        - a single Tensor with input_ids only and nothing else: :obj:`model(inputs_ids)`\n        - a list of varying length with one or several input Tensors IN THE ORDER given in the docstring:\n          :obj:`model([input_ids, attention_mask])` or :obj:`model([input_ids, attention_mask, token_type_ids])`\n        - a dictionary with one or several input Tensors associated to the input names given in the docstring:\n          :obj:`model({'input_ids': input_ids, 'token_type_ids': token_type_ids})`\n\n    Parameters:\n        config (:class:`~transformers1.DistilBertConfig`): Model configuration class with all the parameters of the model.\n            Initializing with a config file does not load the weights associated with the model, only the configuration.\n            Check out the :meth:`~transformers1.PreTrainedModel.from_pretrained` method to load the model weights.\n\"\"\"\n\nDISTILBERT_INPUTS_DOCSTRING = r\"\"\"\n    Args:\n        input_ids (:obj:`Numpy array` or :obj:`tf.Tensor` of shape :obj:`(batch_size, sequence_length)`):\n            Indices of input sequence tokens in the vocabulary.\n\n            Indices can be obtained using :class:`transformers1.BertTokenizer`.\n            See :func:`transformers1.PreTrainedTokenizer.encode` and\n            :func:`transformers1.PreTrainedTokenizer.encode_plus` for details.\n\n            `What are input IDs? <../glossary.html#input-ids>`__\n        attention_mask (:obj:`Numpy array` or :obj:`tf.Tensor` of shape :obj:`(batch_size, sequence_length)`, `optional`, defaults to :obj:`None`):\n            Mask to avoid performing attention on padding token indices.\n            Mask values selected in ``[0, 1]``:\n            ``1`` for tokens that are NOT MASKED, ``0`` for MASKED tokens.\n\n            `What are attention masks? <../glossary.html#attention-mask>`__\n        head_mask (:obj:`Numpy array` or :obj:`tf.Tensor` of shape :obj:`(num_heads,)` or :obj:`(num_layers, num_heads)`, `optional`, defaults to :obj:`None`):\n            Mask to nullify selected heads of the self-attention modules.\n            Mask values selected in ``[0, 1]``:\n            :obj:`1` indicates the head is **not masked**, :obj:`0` indicates the head is **masked**.\n        inputs_embeds (:obj:`Numpy array` or :obj:`tf.Tensor` of shape :obj:`(batch_size, sequence_length, embedding_dim)`, `optional`, defaults to :obj:`None`):\n            Optionally, instead of passing :obj:`input_ids` you can choose to directly pass an embedded representation.\n            This is useful if you want more control over how to convert `input_ids` indices into associated vectors\n            than the model's internal embedding lookup matrix.\n        training (:obj:`boolean`, `optional`, defaults to :obj:`False`):\n            Whether to activate dropout modules (if set to :obj:`True`) during training or to de-activate them\n            (if set to :obj:`False`) for evaluation.\n\n\"\"\"\n\n\n@add_start_docstrings(\n    \"The bare DistilBERT encoder/transformer outputing raw hidden-states without any specific head on top.\",\n    DISTILBERT_START_DOCSTRING,\n)\nclass TFDistilBertModel(TFDistilBertPreTrainedModel):\n    def __init__(self, config, *inputs, **kwargs):\n        super().__init__(config, *inputs, **kwargs)\n        self.distilbert = TFDistilBertMainLayer(config, name=\"distilbert\")  # Embeddings\n\n    @add_start_docstrings_to_callable(DISTILBERT_INPUTS_DOCSTRING)\n    def call(self, inputs, **kwargs):\n        r\"\"\"\n    Returns:\n        :obj:`tuple(tf.Tensor)` comprising various elements depending on the configuration (:class:`~transformers1,DistilBertConfig`) and inputs:\n        last_hidden_state (:obj:`tf.Tensor` of shape :obj:`(batch_size, sequence_length, hidden_size)`):\n            Sequence of hidden-states at the output of the last layer of the model.\n        hidden_states (:obj:`tuple(tf.Tensor)`, `optional`, returned when :obj:`config.output_hidden_states=True`):\n            tuple of :obj:`tf.Tensor` (one for the output of the embeddings + one for the output of each layer)\n            of shape :obj:`(batch_size, sequence_length, hidden_size)`.\n\n            Hidden-states of the model at the output of each layer plus the initial embedding outputs.\n        attentions (:obj:`tuple(tf.Tensor)`, `optional`, returned when ``config.output_attentions=True``):\n            tuple of :obj:`tf.Tensor` (one for each layer) of shape\n            :obj:`(batch_size, num_heads, sequence_length, sequence_length)`:\n\n            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads.\n\n    Examples::\n\n        import tensorflow as tf\n        from transformers1 import DistilBertTokenizer, TFDistilBertModel\n\n        tokenizer = DistilBertTokenizer.from_pretrained('distilbert-base-cased')\n        model = TFDistilBertModel.from_pretrained('distilbert-base-cased')\n        input_ids = tf.constant(tokenizer.encode(\"Hello, my dog is cute\"))[None, :]  # Batch size 1\n        outputs = model(input_ids)\n        last_hidden_states = outputs[0]  # The last hidden-state is the first element of the output tuple\n        \"\"\"\n        outputs = self.distilbert(inputs, **kwargs)\n        return outputs\n\n\nclass TFDistilBertLMHead(tf.keras.layers.Layer):\n    def __init__(self, config, input_embeddings, **kwargs):\n        super().__init__(**kwargs)\n        self.vocab_size = config.vocab_size\n\n        # The output weights are the same as the input embeddings, but there is\n        # an output-only bias for each token.\n        self.input_embeddings = input_embeddings\n\n    def build(self, input_shape):\n        self.bias = self.add_weight(shape=(self.vocab_size,), initializer=\"zeros\", trainable=True, name=\"bias\")\n        super().build(input_shape)\n\n    def call(self, hidden_states):\n        hidden_states = self.input_embeddings(hidden_states, mode=\"linear\")\n        hidden_states = hidden_states + self.bias\n        return hidden_states\n\n\n@add_start_docstrings(\n    \"\"\"DistilBert Model with a `masked language modeling` head on top. \"\"\", DISTILBERT_START_DOCSTRING,\n)\nclass TFDistilBertForMaskedLM(TFDistilBertPreTrainedModel):\n    def __init__(self, config, *inputs, **kwargs):\n        super().__init__(config, *inputs, **kwargs)\n        self.output_attentions = config.output_attentions\n        self.output_hidden_states = config.output_hidden_states\n        self.vocab_size = config.vocab_size\n\n        self.distilbert = TFDistilBertMainLayer(config, name=\"distilbert\")\n        self.vocab_transform = tf.keras.layers.Dense(\n            config.dim, kernel_initializer=get_initializer(config.initializer_range), name=\"vocab_transform\"\n        )\n        self.act = tf.keras.layers.Activation(gelu)\n        self.vocab_layer_norm = tf.keras.layers.LayerNormalization(epsilon=1e-12, name=\"vocab_layer_norm\")\n        self.vocab_projector = TFDistilBertLMHead(config, self.distilbert.embeddings, name=\"vocab_projector\")\n\n    def get_output_embeddings(self):\n        return self.vocab_projector.input_embeddings\n\n    @add_start_docstrings_to_callable(DISTILBERT_INPUTS_DOCSTRING)\n    def call(self, inputs, **kwargs):\n        r\"\"\"\n\n    Returns:\n        :obj:`tuple(tf.Tensor)` comprising various elements depending on the configuration (:class:`~transformers1,DistilBertConfig`) and inputs:\n        prediction_scores (:obj:`Numpy array` or :obj:`tf.Tensor` of shape :obj:`(batch_size, sequence_length, config.vocab_size)`):\n            Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax).\n        hidden_states (:obj:`tuple(tf.Tensor)`, `optional`, returned when :obj:`config.output_hidden_states=True`):\n            tuple of :obj:`tf.Tensor` (one for the output of the embeddings + one for the output of each layer)\n            of shape :obj:`(batch_size, sequence_length, hidden_size)`.\n\n            Hidden-states of the model at the output of each layer plus the initial embedding outputs.\n        attentions (:obj:`tuple(tf.Tensor)`, `optional`, returned when ``config.output_attentions=True``):\n            tuple of :obj:`tf.Tensor` (one for each layer) of shape\n            :obj:`(batch_size, num_heads, sequence_length, sequence_length)`:\n\n            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads.\n\n    Examples::\n\n        import tensorflow as tf\n        from transformers1 import DistilBertTokenizer, TFDistilBertForMaskedLM\n\n        tokenizer = DistilBertTokenizer.from_pretrained('distilbert-base-cased')\n        model = TFDistilBertForMaskedLM.from_pretrained('distilbert-base-cased')\n        input_ids = tf.constant(tokenizer.encode(\"Hello, my dog is cute\"))[None, :]  # Batch size 1\n        outputs = model(input_ids)\n        prediction_scores = outputs[0]\n\n        \"\"\"\n        distilbert_output = self.distilbert(inputs, **kwargs)\n\n        hidden_states = distilbert_output[0]  # (bs, seq_length, dim)\n        prediction_logits = self.vocab_transform(hidden_states)  # (bs, seq_length, dim)\n        prediction_logits = self.act(prediction_logits)  # (bs, seq_length, dim)\n        prediction_logits = self.vocab_layer_norm(prediction_logits)  # (bs, seq_length, dim)\n        prediction_logits = self.vocab_projector(prediction_logits)\n\n        outputs = (prediction_logits,) + distilbert_output[1:]\n        return outputs  # logits, (hidden_states), (attentions)\n\n\n@add_start_docstrings(\n    \"\"\"DistilBert Model transformer with a sequence classification/regression head on top (a linear layer on top of\n    the pooled output) e.g. for GLUE tasks. \"\"\",\n    DISTILBERT_START_DOCSTRING,\n)\nclass TFDistilBertForSequenceClassification(TFDistilBertPreTrainedModel):\n    def __init__(self, config, *inputs, **kwargs):\n        super().__init__(config, *inputs, **kwargs)\n        self.num_labels = config.num_labels\n\n        self.distilbert = TFDistilBertMainLayer(config, name=\"distilbert\")\n        self.pre_classifier = tf.keras.layers.Dense(\n            config.dim,\n            kernel_initializer=get_initializer(config.initializer_range),\n            activation=\"relu\",\n            name=\"pre_classifier\",\n        )\n        self.classifier = tf.keras.layers.Dense(\n            config.num_labels, kernel_initializer=get_initializer(config.initializer_range), name=\"classifier\"\n        )\n        self.dropout = tf.keras.layers.Dropout(config.seq_classif_dropout)\n\n    @add_start_docstrings_to_callable(DISTILBERT_INPUTS_DOCSTRING)\n    def call(self, inputs, **kwargs):\n        r\"\"\"\n    Returns:\n        :obj:`tuple(tf.Tensor)` comprising various elements depending on the configuration (:class:`~transformers1,DistilBertConfig`) and inputs:\n        logits (:obj:`Numpy array` or :obj:`tf.Tensor` of shape :obj:`(batch_size, config.num_labels)`):\n            Classification (or regression if config.num_labels==1) scores (before SoftMax).\n        hidden_states (:obj:`tuple(tf.Tensor)`, `optional`, returned when :obj:`config.output_hidden_states=True`):\n            tuple of :obj:`tf.Tensor` (one for the output of the embeddings + one for the output of each layer)\n            of shape :obj:`(batch_size, sequence_length, hidden_size)`.\n\n            Hidden-states of the model at the output of each layer plus the initial embedding outputs.\n        attentions (:obj:`tuple(tf.Tensor)`, `optional`, returned when ``config.output_attentions=True``):\n            tuple of :obj:`tf.Tensor` (one for each layer) of shape\n            :obj:`(batch_size, num_heads, sequence_length, sequence_length)`:\n\n            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads.\n\n    Examples::\n\n        import tensorflow as tf\n        from transformers1 import DistilBertTokenizer, TFDistilBertForSequenceClassification\n\n        tokenizer = DistilBertTokenizer.from_pretrained('distilbert-base-cased')\n        model = TFDistilBertForSequenceClassification.from_pretrained('distilbert-base-cased')\n        input_ids = tf.constant(tokenizer.encode(\"Hello, my dog is cute\"))[None, :]  # Batch size 1\n        outputs = model(input_ids)\n        logits = outputs[0]\n\n        \"\"\"\n        distilbert_output = self.distilbert(inputs, **kwargs)\n\n        hidden_state = distilbert_output[0]  # (bs, seq_len, dim)\n        pooled_output = hidden_state[:, 0]  # (bs, dim)\n        pooled_output = self.pre_classifier(pooled_output)  # (bs, dim)\n        pooled_output = self.dropout(pooled_output, training=kwargs.get(\"training\", False))  # (bs, dim)\n        logits = self.classifier(pooled_output)  # (bs, dim)\n\n        outputs = (logits,) + distilbert_output[1:]\n        return outputs  # logits, (hidden_states), (attentions)\n\n\n@add_start_docstrings(\n    \"\"\"DistilBert Model with a token classification head on top (a linear layer on top of\n    the hidden-states output) e.g. for Named-Entity-Recognition (NER) tasks. \"\"\",\n    DISTILBERT_START_DOCSTRING,\n)\nclass TFDistilBertForTokenClassification(TFDistilBertPreTrainedModel):\n    def __init__(self, config, *inputs, **kwargs):\n        super().__init__(config, *inputs, **kwargs)\n        self.num_labels = config.num_labels\n\n        self.distilbert = TFDistilBertMainLayer(config, name=\"distilbert\")\n        self.dropout = tf.keras.layers.Dropout(config.dropout)\n        self.classifier = tf.keras.layers.Dense(\n            config.num_labels, kernel_initializer=get_initializer(config.initializer_range), name=\"classifier\"\n        )\n\n    @add_start_docstrings_to_callable(DISTILBERT_INPUTS_DOCSTRING)\n    def call(self, inputs, **kwargs):\n        r\"\"\"\n    Returns:\n        :obj:`tuple(tf.Tensor)` comprising various elements depending on the configuration (:class:`~transformers1,DistilBertConfig`) and inputs:\n        scores (:obj:`Numpy array` or :obj:`tf.Tensor` of shape :obj:`(batch_size, sequence_length, config.num_labels)`):\n            Classification scores (before SoftMax).\n        hidden_states (:obj:`tuple(tf.Tensor)`, `optional`, returned when :obj:`config.output_hidden_states=True`):\n            tuple of :obj:`tf.Tensor` (one for the output of the embeddings + one for the output of each layer)\n            of shape :obj:`(batch_size, sequence_length, hidden_size)`.\n\n            Hidden-states of the model at the output of each layer plus the initial embedding outputs.\n        attentions (:obj:`tuple(tf.Tensor)`, `optional`, returned when ``config.output_attentions=True``):\n            tuple of :obj:`tf.Tensor` (one for each layer) of shape\n            :obj:`(batch_size, num_heads, sequence_length, sequence_length)`:\n\n            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads.\n\n    Examples::\n\n        import tensorflow as tf\n        from transformers1 import DistilBertTokenizer, TFDistilBertForTokenClassification\n\n        tokenizer = DistilBertTokenizer.from_pretrained('distilbert-base-cased')\n        model = TFDistilBertForTokenClassification.from_pretrained('distilbert-base-cased')\n        input_ids = tf.constant(tokenizer.encode(\"Hello, my dog is cute\"))[None, :]  # Batch size 1\n        outputs = model(input_ids)\n        scores = outputs[0]\n        \"\"\"\n        outputs = self.distilbert(inputs, **kwargs)\n\n        sequence_output = outputs[0]\n\n        sequence_output = self.dropout(sequence_output, training=kwargs.get(\"training\", False))\n        logits = self.classifier(sequence_output)\n\n        outputs = (logits,) + outputs[2:]  # add hidden states and attention if they are here\n\n        return outputs  # scores, (hidden_states), (attentions)\n\n\n@add_start_docstrings(\n    \"\"\"DistilBert Model with a span classification head on top for extractive question-answering tasks like SQuAD (a linear layers on top of\n    the hidden-states output to compute `span start logits` and `span end logits`). \"\"\",\n    DISTILBERT_START_DOCSTRING,\n)\nclass TFDistilBertForQuestionAnswering(TFDistilBertPreTrainedModel):\n    def __init__(self, config, *inputs, **kwargs):\n        super().__init__(config, *inputs, **kwargs)\n\n        self.distilbert = TFDistilBertMainLayer(config, name=\"distilbert\")\n        self.qa_outputs = tf.keras.layers.Dense(\n            config.num_labels, kernel_initializer=get_initializer(config.initializer_range), name=\"qa_outputs\"\n        )\n        assert config.num_labels == 2\n        self.dropout = tf.keras.layers.Dropout(config.qa_dropout)\n\n    @add_start_docstrings_to_callable(DISTILBERT_INPUTS_DOCSTRING)\n    def call(self, inputs, **kwargs):\n        r\"\"\"\n    Return:\n        :obj:`tuple(tf.Tensor)` comprising various elements depending on the configuration (:class:`~transformers1,DistilBertConfig`) and inputs:\n        start_scores (:obj:`Numpy array` or :obj:`tf.Tensor` of shape :obj:`(batch_size, sequence_length,)`):\n            Span-start scores (before SoftMax).\n        end_scores (:obj:`Numpy array` or :obj:`tf.Tensor` of shape :obj:`(batch_size, sequence_length,)`):\n            Span-end scores (before SoftMax).\n        hidden_states (:obj:`tuple(tf.Tensor)`, `optional`, returned when :obj:`config.output_hidden_states=True`):\n            tuple of :obj:`tf.Tensor` (one for the output of the embeddings + one for the output of each layer)\n            of shape :obj:`(batch_size, sequence_length, hidden_size)`.\n\n            Hidden-states of the model at the output of each layer plus the initial embedding outputs.\n        attentions (:obj:`tuple(tf.Tensor)`, `optional`, returned when ``config.output_attentions=True``):\n            tuple of :obj:`tf.Tensor` (one for each layer) of shape\n            :obj:`(batch_size, num_heads, sequence_length, sequence_length)`:\n\n            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads.\n\n    Examples::\n\n        import tensorflow as tf\n        from transformers1 import DistilBertTokenizer, TFDistilBertForQuestionAnswering\n\n        tokenizer = DistilBertTokenizer.from_pretrained('distilbert-base-cased')\n        model = TFDistilBertForQuestionAnswering.from_pretrained('distilbert-base-cased')\n        input_ids = tf.constant(tokenizer.encode(\"Hello, my dog is cute\"))[None, :]  # Batch size 1\n        outputs = model(input_ids)\n        start_scores, end_scores = outputs[:2]\n\n        \"\"\"\n        distilbert_output = self.distilbert(inputs, **kwargs)\n\n        hidden_states = distilbert_output[0]  # (bs, max_query_len, dim)\n        hidden_states = self.dropout(hidden_states, training=kwargs.get(\"training\", False))  # (bs, max_query_len, dim)\n        logits = self.qa_outputs(hidden_states)  # (bs, max_query_len, 2)\n        start_logits, end_logits = tf.split(logits, 2, axis=-1)\n        start_logits = tf.squeeze(start_logits, axis=-1)\n        end_logits = tf.squeeze(end_logits, axis=-1)\n\n        outputs = (start_logits, end_logits,) + distilbert_output[1:]\n        return outputs  # start_logits, end_logits, (hidden_states), (attentions)\n"
  },
  {
    "path": "code/bert-base-count5/pretrain/transformers1/modeling_tf_electra.py",
    "content": "import logging\n\nimport tensorflow as tf\n\nfrom transformers import ElectraConfig\n\nfrom .file_utils import add_start_docstrings, add_start_docstrings_to_callable\nfrom .modeling_tf_bert import ACT2FN, TFBertEncoder, TFBertPreTrainedModel\nfrom .modeling_tf_utils import get_initializer, shape_list\nfrom .tokenization_utils import BatchEncoding\n\n\nlogger = logging.getLogger(__name__)\n\n\nTF_ELECTRA_PRETRAINED_MODEL_ARCHIVE_LIST = [\n    \"google/electra-small-generator\",\n    \"google/electra-base-generator\",\n    \"google/electra-large-generator\",\n    \"google/electra-small-discriminator\",\n    \"google/electra-base-discriminator\",\n    \"google/electra-large-discriminator\",\n    # See all ELECTRA models at https://huggingface.co/models?filter=electra\n]\n\n\nclass TFElectraEmbeddings(tf.keras.layers.Layer):\n    \"\"\"Construct the embeddings from word, position and token_type embeddings.\n    \"\"\"\n\n    def __init__(self, config, **kwargs):\n        super().__init__(**kwargs)\n        self.vocab_size = config.vocab_size\n        self.embedding_size = config.embedding_size\n        self.initializer_range = config.initializer_range\n\n        self.position_embeddings = tf.keras.layers.Embedding(\n            config.max_position_embeddings,\n            config.embedding_size,\n            embeddings_initializer=get_initializer(self.initializer_range),\n            name=\"position_embeddings\",\n        )\n        self.token_type_embeddings = tf.keras.layers.Embedding(\n            config.type_vocab_size,\n            config.embedding_size,\n            embeddings_initializer=get_initializer(self.initializer_range),\n            name=\"token_type_embeddings\",\n        )\n\n        # self.LayerNorm is not snake-cased to stick with TensorFlow model variable name and be able to load\n        # any TensorFlow checkpoint file\n        self.LayerNorm = tf.keras.layers.LayerNormalization(epsilon=config.layer_norm_eps, name=\"LayerNorm\")\n        self.dropout = tf.keras.layers.Dropout(config.hidden_dropout_prob)\n\n    def build(self, input_shape):\n        \"\"\"Build shared word embedding layer \"\"\"\n        with tf.name_scope(\"word_embeddings\"):\n            # Create and initialize weights. The random normal initializer was chosen\n            # arbitrarily, and works well.\n            self.word_embeddings = self.add_weight(\n                \"weight\",\n                shape=[self.vocab_size, self.embedding_size],\n                initializer=get_initializer(self.initializer_range),\n            )\n        super().build(input_shape)\n\n    def call(self, inputs, mode=\"embedding\", training=False):\n        \"\"\"Get token embeddings of inputs.\n        Args:\n            inputs: list of three int64 tensors with shape [batch_size, length]: (input_ids, position_ids, token_type_ids)\n            mode: string, a valid value is one of \"embedding\" and \"linear\".\n        Returns:\n            outputs: (1) If mode == \"embedding\", output embedding tensor, float32 with\n                shape [batch_size, length, embedding_size]; (2) mode == \"linear\", output\n                linear tensor, float32 with shape [batch_size, length, vocab_size].\n        Raises:\n            ValueError: if mode is not valid.\n\n        Shared weights logic adapted from\n            https://github.com/tensorflow/models/blob/a009f4fb9d2fc4949e32192a944688925ef78659/official/transformer/v2/embedding_layer.py#L24\n        \"\"\"\n        if mode == \"embedding\":\n            return self._embedding(inputs, training=training)\n        elif mode == \"linear\":\n            return self._linear(inputs)\n        else:\n            raise ValueError(\"mode {} is not valid.\".format(mode))\n\n    def _embedding(self, inputs, training=False):\n        \"\"\"Applies embedding based on inputs tensor.\"\"\"\n        input_ids, position_ids, token_type_ids, inputs_embeds = inputs\n\n        if input_ids is not None:\n            input_shape = shape_list(input_ids)\n        else:\n            input_shape = shape_list(inputs_embeds)[:-1]\n\n        seq_length = input_shape[1]\n        if position_ids is None:\n            position_ids = tf.range(seq_length, dtype=tf.int32)[tf.newaxis, :]\n        if token_type_ids is None:\n            token_type_ids = tf.fill(input_shape, 0)\n\n        if inputs_embeds is None:\n            inputs_embeds = tf.gather(self.word_embeddings, input_ids)\n        position_embeddings = self.position_embeddings(position_ids)\n        token_type_embeddings = self.token_type_embeddings(token_type_ids)\n\n        embeddings = inputs_embeds + position_embeddings + token_type_embeddings\n        embeddings = self.LayerNorm(embeddings)\n        embeddings = self.dropout(embeddings, training=training)\n        return embeddings\n\n    def _linear(self, inputs):\n        \"\"\"Computes logits by running inputs through a linear layer.\n            Args:\n                inputs: A float32 tensor with shape [batch_size, length, hidden_size]\n            Returns:\n                float32 tensor with shape [batch_size, length, vocab_size].\n        \"\"\"\n        batch_size = shape_list(inputs)[0]\n        length = shape_list(inputs)[1]\n\n        x = tf.reshape(inputs, [-1, self.embedding_size])\n        logits = tf.matmul(x, self.word_embeddings, transpose_b=True)\n\n        return tf.reshape(logits, [batch_size, length, self.vocab_size])\n\n\nclass TFElectraDiscriminatorPredictions(tf.keras.layers.Layer):\n    def __init__(self, config, **kwargs):\n        super().__init__(**kwargs)\n\n        self.dense = tf.keras.layers.Dense(config.hidden_size, name=\"dense\")\n        self.dense_prediction = tf.keras.layers.Dense(1, name=\"dense_prediction\")\n        self.config = config\n\n    def call(self, discriminator_hidden_states, training=False):\n        hidden_states = self.dense(discriminator_hidden_states)\n        hidden_states = ACT2FN[self.config.hidden_act](hidden_states)\n        logits = tf.squeeze(self.dense_prediction(hidden_states))\n\n        return logits\n\n\nclass TFElectraGeneratorPredictions(tf.keras.layers.Layer):\n    def __init__(self, config, **kwargs):\n        super().__init__(**kwargs)\n\n        self.LayerNorm = tf.keras.layers.LayerNormalization(epsilon=config.layer_norm_eps, name=\"LayerNorm\")\n        self.dense = tf.keras.layers.Dense(config.embedding_size, name=\"dense\")\n\n    def call(self, generator_hidden_states, training=False):\n        hidden_states = self.dense(generator_hidden_states)\n        hidden_states = ACT2FN[\"gelu\"](hidden_states)\n        hidden_states = self.LayerNorm(hidden_states)\n\n        return hidden_states\n\n\nclass TFElectraPreTrainedModel(TFBertPreTrainedModel):\n\n    config_class = ElectraConfig\n    base_model_prefix = \"electra\"\n\n    def get_extended_attention_mask(self, attention_mask, input_shape):\n        if attention_mask is None:\n            attention_mask = tf.fill(input_shape, 1)\n\n        # We create a 3D attention mask from a 2D tensor mask.\n        # Sizes are [batch_size, 1, 1, to_seq_length]\n        # So we can broadcast to [batch_size, num_heads, from_seq_length, to_seq_length]\n        # this attention mask is more simple than the triangular masking of causal attention\n        # used in OpenAI GPT, we just need to prepare the broadcast dimension here.\n        extended_attention_mask = attention_mask[:, tf.newaxis, tf.newaxis, :]\n\n        # Since attention_mask is 1.0 for positions we want to attend and 0.0 for\n        # masked positions, this operation will create a tensor which is 0.0 for\n        # positions we want to attend and -10000.0 for masked positions.\n        # Since we are adding it to the raw scores before the softmax, this is\n        # effectively the same as removing these entirely.\n\n        extended_attention_mask = tf.cast(extended_attention_mask, tf.float32)\n        extended_attention_mask = (1.0 - extended_attention_mask) * -10000.0\n\n        return extended_attention_mask\n\n    def get_head_mask(self, head_mask):\n        if head_mask is not None:\n            raise NotImplementedError\n        else:\n            head_mask = [None] * self.config.num_hidden_layers\n\n        return head_mask\n\n\nclass TFElectraMainLayer(TFElectraPreTrainedModel):\n\n    config_class = ElectraConfig\n\n    def __init__(self, config, **kwargs):\n        super().__init__(config, **kwargs)\n        self.embeddings = TFElectraEmbeddings(config, name=\"embeddings\")\n\n        if config.embedding_size != config.hidden_size:\n            self.embeddings_project = tf.keras.layers.Dense(config.hidden_size, name=\"embeddings_project\")\n        self.encoder = TFBertEncoder(config, name=\"encoder\")\n        self.config = config\n\n    def get_input_embeddings(self):\n        return self.embeddings\n\n    def _resize_token_embeddings(self, new_num_tokens):\n        raise NotImplementedError\n\n    def _prune_heads(self, heads_to_prune):\n        \"\"\" Prunes heads of the model.\n            heads_to_prune: dict of {layer_num: list of heads to prune in this layer}\n            See base class PreTrainedModel\n        \"\"\"\n        raise NotImplementedError\n\n    def call(\n        self,\n        inputs,\n        attention_mask=None,\n        token_type_ids=None,\n        position_ids=None,\n        head_mask=None,\n        inputs_embeds=None,\n        training=False,\n    ):\n        if isinstance(inputs, (tuple, list)):\n            input_ids = inputs[0]\n            attention_mask = inputs[1] if len(inputs) > 1 else attention_mask\n            token_type_ids = inputs[2] if len(inputs) > 2 else token_type_ids\n            position_ids = inputs[3] if len(inputs) > 3 else position_ids\n            head_mask = inputs[4] if len(inputs) > 4 else head_mask\n            inputs_embeds = inputs[5] if len(inputs) > 5 else inputs_embeds\n            assert len(inputs) <= 6, \"Too many inputs.\"\n        elif isinstance(inputs, (dict, BatchEncoding)):\n            input_ids = inputs.get(\"input_ids\")\n            attention_mask = inputs.get(\"attention_mask\", attention_mask)\n            token_type_ids = inputs.get(\"token_type_ids\", token_type_ids)\n            position_ids = inputs.get(\"position_ids\", position_ids)\n            head_mask = inputs.get(\"head_mask\", head_mask)\n            inputs_embeds = inputs.get(\"inputs_embeds\", inputs_embeds)\n            assert len(inputs) <= 6, \"Too many inputs.\"\n        else:\n            input_ids = inputs\n\n        if input_ids is not None and inputs_embeds is not None:\n            raise ValueError(\"You cannot specify both input_ids and inputs_embeds at the same time\")\n        elif input_ids is not None:\n            input_shape = shape_list(input_ids)\n        elif inputs_embeds is not None:\n            input_shape = shape_list(inputs_embeds)[:-1]\n        else:\n            raise ValueError(\"You have to specify either input_ids or inputs_embeds\")\n\n        if attention_mask is None:\n            attention_mask = tf.fill(input_shape, 1)\n        if token_type_ids is None:\n            token_type_ids = tf.fill(input_shape, 0)\n\n        extended_attention_mask = self.get_extended_attention_mask(attention_mask, input_shape)\n        head_mask = self.get_head_mask(head_mask)\n\n        hidden_states = self.embeddings([input_ids, position_ids, token_type_ids, inputs_embeds], training=training)\n\n        if hasattr(self, \"embeddings_project\"):\n            hidden_states = self.embeddings_project(hidden_states, training=training)\n\n        hidden_states = self.encoder([hidden_states, extended_attention_mask, head_mask], training=training)\n\n        return hidden_states\n\n\nELECTRA_START_DOCSTRING = r\"\"\"\n    This model is a `tf.keras.Model <https://www.tensorflow.org/api_docs/python/tf/keras/Model>`__ sub-class.\n    Use it as a regular TF 2.0 Keras Model and\n    refer to the TF 2.0 documentation for all matter related to general usage and behavior.\n\n    .. note::\n\n        TF 2.0 models accepts two formats as inputs:\n\n            - having all inputs as keyword arguments (like PyTorch models), or\n            - having all inputs as a list, tuple or dict in the first positional arguments.\n\n        This second option is useful when using :obj:`tf.keras.Model.fit()` method which currently requires having\n        all the tensors in the first argument of the model call function: :obj:`model(inputs)`.\n\n        If you choose this second option, there are three possibilities you can use to gather all the input Tensors\n        in the first positional argument :\n\n        - a single Tensor with input_ids only and nothing else: :obj:`model(inputs_ids)`\n        - a list of varying length with one or several input Tensors IN THE ORDER given in the docstring:\n          :obj:`model([input_ids, attention_mask])` or :obj:`model([input_ids, attention_mask, token_type_ids])`\n        - a dictionary with one or several input Tensors associated to the input names given in the docstring:\n          :obj:`model({'input_ids': input_ids, 'token_type_ids': token_type_ids})`\n\n    Parameters:\n        config (:class:`~transformers1.ElectraConfig`): Model configuration class with all the parameters of the model.\n            Initializing with a config file does not load the weights associated with the model, only the configuration.\n            Check out the :meth:`~transformers1.PreTrainedModel.from_pretrained` method to load the model weights.\n\"\"\"\n\nELECTRA_INPUTS_DOCSTRING = r\"\"\"\n    Args:\n        input_ids (:obj:`Numpy array` or :obj:`tf.Tensor` of shape :obj:`(batch_size, sequence_length)`):\n            Indices of input sequence tokens in the vocabulary.\n\n            Indices can be obtained using :class:`transformers1.ElectraTokenizer`.\n            See :func:`transformers1.PreTrainedTokenizer.encode` and\n            :func:`transformers1.PreTrainedTokenizer.encode_plus` for details.\n\n            `What are input IDs? <../glossary.html#input-ids>`__\n        attention_mask (:obj:`Numpy array` or :obj:`tf.Tensor` of shape :obj:`(batch_size, sequence_length)`, `optional`, defaults to :obj:`None`):\n            Mask to avoid performing attention on padding token indices.\n            Mask values selected in ``[0, 1]``:\n            ``1`` for tokens that are NOT MASKED, ``0`` for MASKED tokens.\n\n            `What are attention masks? <../glossary.html#attention-mask>`__\n        head_mask (:obj:`Numpy array` or :obj:`tf.Tensor` of shape :obj:`(num_heads,)` or :obj:`(num_layers, num_heads)`, `optional`, defaults to :obj:`None`):\n            Mask to nullify selected heads of the self-attention modules.\n            Mask values selected in ``[0, 1]``:\n            :obj:`1` indicates the head is **not masked**, :obj:`0` indicates the head is **masked**.\n        inputs_embeds (:obj:`Numpy array` or :obj:`tf.Tensor` of shape :obj:`(batch_size, sequence_length, embedding_dim)`, `optional`, defaults to :obj:`None`):\n            Optionally, instead of passing :obj:`input_ids` you can choose to directly pass an embedded representation.\n            This is useful if you want more control over how to convert `input_ids` indices into associated vectors\n            than the model's internal embedding lookup matrix.\n        training (:obj:`boolean`, `optional`, defaults to :obj:`False`):\n            Whether to activate dropout modules (if set to :obj:`True`) during training or to de-activate them\n            (if set to :obj:`False`) for evaluation.\n\n\"\"\"\n\n\n@add_start_docstrings(\n    \"The bare Electra Model transformer outputting raw hidden-states without any specific head on top. Identical to \"\n    \"the BERT model except that it uses an additional linear layer between the embedding layer and the encoder if the \"\n    \"hidden size and embedding size are different.\"\n    \"\"\n    \"Both the generator and discriminator checkpoints may be loaded into this model.\",\n    ELECTRA_START_DOCSTRING,\n)\nclass TFElectraModel(TFElectraPreTrainedModel):\n    def __init__(self, config, *inputs, **kwargs):\n        super().__init__(config, *inputs, **kwargs)\n        self.electra = TFElectraMainLayer(config, name=\"electra\")\n\n    def get_input_embeddings(self):\n        return self.electra.embeddings\n\n    @add_start_docstrings_to_callable(ELECTRA_INPUTS_DOCSTRING)\n    def call(self, inputs, **kwargs):\n        r\"\"\"\n    Returns:\n        :obj:`tuple(tf.Tensor)` comprising various elements depending on the configuration (:class:`~transformers1.ElectraConfig`) and inputs:\n        last_hidden_state (:obj:`tf.Tensor` of shape :obj:`(batch_size, sequence_length, hidden_size)`):\n            Sequence of hidden-states at the output of the last layer of the model.\n        hidden_states (:obj:`tuple(tf.Tensor)`, `optional`, returned when :obj:`config.output_hidden_states=True`):\n            tuple of :obj:`tf.Tensor` (one for the output of the embeddings + one for the output of each layer)\n            of shape :obj:`(batch_size, sequence_length, hidden_size)`.\n\n            Hidden-states of the model at the output of each layer plus the initial embedding outputs.\n        attentions (:obj:`tuple(tf.Tensor)`, `optional`, returned when ``config.output_attentions=True``):\n            tuple of :obj:`tf.Tensor` (one for each layer) of shape\n            :obj:`(batch_size, num_heads, sequence_length, sequence_length)`:\n\n            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads.\n\n    Examples::\n\n        import tensorflow as tf\n        from transformers1 import ElectraTokenizer, TFElectraModel\n\n        tokenizer = ElectraTokenizer.from_pretrained('google/electra-small-discriminator')\n        model = TFElectraModel.from_pretrained('google/electra-small-discriminator')\n        input_ids = tf.constant(tokenizer.encode(\"Hello, my dog is cute\"))[None, :]  # Batch size 1\n        outputs = model(input_ids)\n        last_hidden_states = outputs[0]  # The last hidden-state is the first element of the output tuple\n        \"\"\"\n        outputs = self.electra(inputs, **kwargs)\n        return outputs\n\n\n@add_start_docstrings(\n    \"\"\"\nElectra model with a binary classification head on top as used during pre-training for identifying generated\ntokens.\n\nEven though both the discriminator and generator may be loaded into this model, the discriminator is\nthe only model of the two to have the correct classification head to be used for this model.\"\"\",\n    ELECTRA_START_DOCSTRING,\n)\nclass TFElectraForPreTraining(TFElectraPreTrainedModel):\n    def __init__(self, config, **kwargs):\n        super().__init__(config, **kwargs)\n\n        self.electra = TFElectraMainLayer(config, name=\"electra\")\n        self.discriminator_predictions = TFElectraDiscriminatorPredictions(config, name=\"discriminator_predictions\")\n\n    def get_input_embeddings(self):\n        return self.electra.embeddings\n\n    @add_start_docstrings_to_callable(ELECTRA_INPUTS_DOCSTRING)\n    def call(\n        self,\n        input_ids=None,\n        attention_mask=None,\n        token_type_ids=None,\n        position_ids=None,\n        head_mask=None,\n        inputs_embeds=None,\n        training=False,\n    ):\n        r\"\"\"\n    Returns:\n        :obj:`tuple(tf.Tensor)` comprising various elements depending on the configuration (:class:`~transformers1.ElectraConfig`) and inputs:\n        scores (:obj:`Numpy array` or :obj:`tf.Tensor` of shape :obj:`(batch_size, sequence_length, config.num_labels)`):\n            Prediction scores of the head (scores for each token before SoftMax).\n        hidden_states (:obj:`tuple(tf.Tensor)`, `optional`, returned when :obj:`config.output_hidden_states=True`):\n            tuple of :obj:`tf.Tensor` (one for the output of the embeddings + one for the output of each layer)\n            of shape :obj:`(batch_size, sequence_length, hidden_size)`.\n\n            Hidden-states of the model at the output of each layer plus the initial embedding outputs.\n        attentions (:obj:`tuple(tf.Tensor)`, `optional`, returned when ``config.output_attentions=True``):\n            tuple of :obj:`tf.Tensor` (one for each layer) of shape\n            :obj:`(batch_size, num_heads, sequence_length, sequence_length)`:\n\n            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads.\n\n    Examples::\n\n        import tensorflow as tf\n        from transformers1 import ElectraTokenizer, TFElectraForPreTraining\n\n        tokenizer = ElectraTokenizer.from_pretrained('google/electra-small-discriminator')\n        model = TFElectraForPreTraining.from_pretrained('google/electra-small-discriminator')\n        input_ids = tf.constant(tokenizer.encode(\"Hello, my dog is cute\"))[None, :]  # Batch size 1\n        outputs = model(input_ids)\n        scores = outputs[0]\n        \"\"\"\n\n        discriminator_hidden_states = self.electra(\n            input_ids, attention_mask, token_type_ids, position_ids, head_mask, inputs_embeds, training=training\n        )\n        discriminator_sequence_output = discriminator_hidden_states[0]\n        logits = self.discriminator_predictions(discriminator_sequence_output)\n        output = (logits,)\n        output += discriminator_hidden_states[1:]\n\n        return output  # (loss), scores, (hidden_states), (attentions)\n\n\nclass TFElectraMaskedLMHead(tf.keras.layers.Layer):\n    def __init__(self, config, input_embeddings, **kwargs):\n        super().__init__(**kwargs)\n        self.vocab_size = config.vocab_size\n        self.input_embeddings = input_embeddings\n\n    def build(self, input_shape):\n        self.bias = self.add_weight(shape=(self.vocab_size,), initializer=\"zeros\", trainable=True, name=\"bias\")\n        super().build(input_shape)\n\n    def call(self, hidden_states, training=False):\n        hidden_states = self.input_embeddings(hidden_states, mode=\"linear\")\n        hidden_states = hidden_states + self.bias\n        return hidden_states\n\n\n@add_start_docstrings(\n    \"\"\"\nElectra model with a language modeling head on top.\n\nEven though both the discriminator and generator may be loaded into this model, the generator is\nthe only model of the two to have been trained for the masked language modeling task.\"\"\",\n    ELECTRA_START_DOCSTRING,\n)\nclass TFElectraForMaskedLM(TFElectraPreTrainedModel):\n    def __init__(self, config, **kwargs):\n        super().__init__(config, **kwargs)\n\n        self.vocab_size = config.vocab_size\n        self.electra = TFElectraMainLayer(config, name=\"electra\")\n        self.generator_predictions = TFElectraGeneratorPredictions(config, name=\"generator_predictions\")\n        if isinstance(config.hidden_act, str):\n            self.activation = ACT2FN[config.hidden_act]\n        else:\n            self.activation = config.hidden_act\n        self.generator_lm_head = TFElectraMaskedLMHead(config, self.electra.embeddings, name=\"generator_lm_head\")\n\n    def get_input_embeddings(self):\n        return self.electra.embeddings\n\n    def get_output_embeddings(self):\n        return self.generator_lm_head\n\n    @add_start_docstrings_to_callable(ELECTRA_INPUTS_DOCSTRING)\n    def call(\n        self,\n        input_ids=None,\n        attention_mask=None,\n        token_type_ids=None,\n        position_ids=None,\n        head_mask=None,\n        inputs_embeds=None,\n        training=False,\n    ):\n        r\"\"\"\n    Returns:\n        :obj:`tuple(tf.Tensor)` comprising various elements depending on the configuration (:class:`~transformers1.ElectraConfig`) and inputs:\n        prediction_scores (:obj:`Numpy array` or :obj:`tf.Tensor` of shape :obj:`(batch_size, sequence_length, config.vocab_size)`):\n            Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax).\n        hidden_states (:obj:`tuple(tf.Tensor)`, `optional`, returned when :obj:`config.output_hidden_states=True`):\n            tuple of :obj:`tf.Tensor` (one for the output of the embeddings + one for the output of each layer)\n            of shape :obj:`(batch_size, sequence_length, hidden_size)`.\n\n            Hidden-states of the model at the output of each layer plus the initial embedding outputs.\n        attentions (:obj:`tuple(tf.Tensor)`, `optional`, returned when ``config.output_attentions=True``):\n            tuple of :obj:`tf.Tensor` (one for each layer) of shape\n            :obj:`(batch_size, num_heads, sequence_length, sequence_length)`:\n\n            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads.\n\n    Examples::\n\n        import tensorflow as tf\n        from transformers1 import ElectraTokenizer, TFElectraForMaskedLM\n\n        tokenizer = ElectraTokenizer.from_pretrained('google/electra-small-generator')\n        model = TFElectraForMaskedLM.from_pretrained('google/electra-small-generator')\n        input_ids = tf.constant(tokenizer.encode(\"Hello, my dog is cute\"))[None, :]  # Batch size 1\n        outputs = model(input_ids)\n        prediction_scores = outputs[0]\n\n        \"\"\"\n\n        generator_hidden_states = self.electra(\n            input_ids, attention_mask, token_type_ids, position_ids, head_mask, inputs_embeds, training=training\n        )\n        generator_sequence_output = generator_hidden_states[0]\n        prediction_scores = self.generator_predictions(generator_sequence_output, training=training)\n        prediction_scores = self.generator_lm_head(prediction_scores, training=training)\n        output = (prediction_scores,)\n        output += generator_hidden_states[1:]\n\n        return output  # (masked_lm_loss), prediction_scores, (hidden_states), (attentions)\n\n\n@add_start_docstrings(\n    \"\"\"\nElectra model with a token classification head on top.\n\nBoth the discriminator and generator may be loaded into this model.\"\"\",\n    ELECTRA_START_DOCSTRING,\n)\nclass TFElectraForTokenClassification(TFElectraPreTrainedModel):\n    def __init__(self, config, **kwargs):\n        super().__init__(config, **kwargs)\n\n        self.electra = TFElectraMainLayer(config, name=\"electra\")\n        self.dropout = tf.keras.layers.Dropout(config.hidden_dropout_prob)\n        self.classifier = tf.keras.layers.Dense(config.num_labels, name=\"classifier\")\n\n    @add_start_docstrings_to_callable(ELECTRA_INPUTS_DOCSTRING)\n    def call(\n        self,\n        input_ids=None,\n        attention_mask=None,\n        token_type_ids=None,\n        position_ids=None,\n        head_mask=None,\n        inputs_embeds=None,\n        training=False,\n    ):\n        r\"\"\"\n    Returns:\n        :obj:`tuple(tf.Tensor)` comprising various elements depending on the configuration (:class:`~transformers1.ElectraConfig`) and inputs:\n        scores (:obj:`Numpy array` or :obj:`tf.Tensor` of shape :obj:`(batch_size, sequence_length, config.num_labels)`):\n            Classification scores (before SoftMax).\n        hidden_states (:obj:`tuple(tf.Tensor)`, `optional`, returned when :obj:`config.output_hidden_states=True`):\n            tuple of :obj:`tf.Tensor` (one for the output of the embeddings + one for the output of each layer)\n            of shape :obj:`(batch_size, sequence_length, hidden_size)`.\n\n            Hidden-states of the model at the output of each layer plus the initial embedding outputs.\n        attentions (:obj:`tuple(tf.Tensor)`, `optional`, returned when ``config.output_attentions=True``):\n            tuple of :obj:`tf.Tensor` (one for each layer) of shape\n            :obj:`(batch_size, num_heads, sequence_length, sequence_length)`:\n\n            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads.\n\n    Examples::\n\n        import tensorflow as tf\n        from transformers1 import ElectraTokenizer, TFElectraForTokenClassification\n\n        tokenizer = ElectraTokenizer.from_pretrained('google/electra-small-discriminator')\n        model = TFElectraForTokenClassification.from_pretrained('google/electra-small-discriminator')\n        input_ids = tf.constant(tokenizer.encode(\"Hello, my dog is cute\"))[None, :]  # Batch size 1\n        outputs = model(input_ids)\n        scores = outputs[0]\n        \"\"\"\n\n        discriminator_hidden_states = self.electra(\n            input_ids, attention_mask, token_type_ids, position_ids, head_mask, inputs_embeds, training=training\n        )\n        discriminator_sequence_output = discriminator_hidden_states[0]\n        discriminator_sequence_output = self.dropout(discriminator_sequence_output)\n        logits = self.classifier(discriminator_sequence_output)\n        output = (logits,)\n        output += discriminator_hidden_states[1:]\n\n        return output  # (loss), scores, (hidden_states), (attentions)\n"
  },
  {
    "path": "code/bert-base-count5/pretrain/transformers1/modeling_tf_flaubert.py",
    "content": "# coding=utf-8\n# Copyright 2019-present, Facebook, Inc and the HuggingFace Inc. team.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n\"\"\" TF 2.0 Flaubert model.\n\"\"\"\n\nimport logging\nimport random\n\nimport tensorflow as tf\n\nfrom .configuration_flaubert import FlaubertConfig\nfrom .file_utils import add_start_docstrings\nfrom .modeling_tf_xlm import (\n    TFXLMForSequenceClassification,\n    TFXLMMainLayer,\n    TFXLMModel,\n    TFXLMWithLMHeadModel,\n    get_masks,\n    shape_list,\n)\nfrom .tokenization_utils import BatchEncoding\n\n\nlogger = logging.getLogger(__name__)\n\nTF_FLAUBERT_PRETRAINED_MODEL_ARCHIVE_LIST = [\n    # See all Flaubert models at https://huggingface.co/models?filter=flaubert\n]\n\nFLAUBERT_START_DOCSTRING = r\"\"\"\n\n    This model is a `tf.keras.Model <https://www.tensorflow.org/api_docs/python/tf/keras/Model>`__ sub-class.\n    Use it as a regular TF 2.0 Keras Model and\n    refer to the TF 2.0 documentation for all matter related to general usage and behavior.\n\n    Parameters:\n        config (:class:`~transformers1.FlaubertConfig`): Model configuration class with all the parameters of the model.\n            Initializing with a config file does not load the weights associated with the model, only the configuration.\n            Check out the :meth:`~transformers1.PreTrainedModel.from_pretrained` method to load the model weights.\n\"\"\"\n\nFLAUBERT_INPUTS_DOCSTRING = r\"\"\"\n    Args:\n        input_ids (:obj:`tf.Tensor` or :obj:`Numpy array` of shape :obj:`(batch_size, sequence_length)`):\n            Indices of input sequence tokens in the vocabulary.\n            Indices can be obtained using :class:`transformers1.BertTokenizer`.\n            See :func:`transformers1.PreTrainedTokenizer.encode` and\n            :func:`transformers1.PreTrainedTokenizer.encode_plus` for details.\n            `What are input IDs? <../glossary.html#input-ids>`__\n        attention_mask (:obj:`tf.Tensor` or :obj:`Numpy array` of shape :obj:`(batch_size, sequence_length)`, `optional`, defaults to :obj:`None`):\n            Mask to avoid performing attention on padding token indices.\n            Mask values selected in ``[0, 1]``:\n            ``1`` for tokens that are NOT MASKED, ``0`` for MASKED tokens.\n            `What are attention masks? <../glossary.html#attention-mask>`__\n        langs (:obj:`tf.Tensor` or :obj:`Numpy array` of shape :obj:`(batch_size, sequence_length)`, `optional`, defaults to :obj:`None`):\n            A parallel sequence of tokens to be used to indicate the language of each token in the input.\n            Indices are languages ids which can be obtained from the language names by using two conversion mappings\n            provided in the configuration of the model (only provided for multilingual models).\n            More precisely, the `language name -> language id` mapping is in `model.config.lang2id` (dict str -> int) and\n            the `language id -> language name` mapping is `model.config.id2lang` (dict int -> str).\n            See usage examples detailed in the `multilingual documentation <https://huggingface.co/transformers/multilingual.html>`__.\n        token_type_ids (:obj:`tf.Tensor` or :obj:`Numpy array` of shape :obj:`(batch_size, sequence_length)`, `optional`, defaults to :obj:`None`):\n            Segment token indices to indicate first and second portions of the inputs.\n            Indices are selected in ``[0, 1]``: ``0`` corresponds to a `sentence A` token, ``1``\n            corresponds to a `sentence B` token\n            `What are token type IDs? <../glossary.html#token-type-ids>`_\n        position_ids (:obj:`tf.Tensor` or :obj:`Numpy array` of shape :obj:`(batch_size, sequence_length)`, `optional`, defaults to :obj:`None`):\n            Indices of positions of each input sequence tokens in the position embeddings.\n            Selected in the range ``[0, config.max_position_embeddings - 1]``.\n            `What are position IDs? <../glossary.html#position-ids>`_\n        lengths (:obj:`tf.Tensor` or :obj:`Numpy array` of shape :obj:`(batch_size,)`, `optional`, defaults to :obj:`None`):\n            Length of each sentence that can be used to avoid performing attention on padding token indices.\n            You can also use `attention_mask` for the same result (see above), kept here for compatbility.\n            Indices selected in ``[0, ..., input_ids.size(-1)]``:\n        cache (:obj:`Dict[str, tf.Tensor]`, `optional`, defaults to :obj:`None`):\n            dictionary with ``tf.Tensor`` that contains pre-computed\n            hidden-states (key and values in the attention blocks) as computed by the model\n            (see `cache` output below). Can be used to speed up sequential decoding.\n            The dictionary object will be modified in-place during the forward pass to add newly computed hidden-states.\n        head_mask (:obj:`tf.Tensor` or :obj:`Numpy array` of shape :obj:`(num_heads,)` or :obj:`(num_layers, num_heads)`, `optional`, defaults to :obj:`None`):\n            Mask to nullify selected heads of the self-attention modules.\n            Mask values selected in ``[0, 1]``:\n            :obj:`1` indicates the head is **not masked**, :obj:`0` indicates the head is **masked**.\n        inputs_embeds (:obj:`tf.Tensor` or :obj:`Numpy array` of shape :obj:`(batch_size, sequence_length, hidden_size)`, `optional`, defaults to :obj:`None`):\n            Optionally, instead of passing :obj:`input_ids` you can choose to directly pass an embedded representation.\n            This is useful if you want more control over how to convert `input_ids` indices into associated vectors\n            than the model's internal embedding lookup matrix.\n\"\"\"\n\n\n@add_start_docstrings(\n    \"The bare Flaubert Model transformer outputting raw hidden-states without any specific head on top.\",\n    FLAUBERT_START_DOCSTRING,\n)\nclass TFFlaubertModel(TFXLMModel):\n    config_class = FlaubertConfig\n\n    def __init__(self, config, *inputs, **kwargs):\n        super().__init__(config, *inputs, **kwargs)\n        self.transformer = TFFlaubertMainLayer(config, name=\"transformer\")\n\n\nclass TFFlaubertMainLayer(TFXLMMainLayer):\n    def __init__(self, config, *inputs, **kwargs):\n        super().__init__(config, *inputs, **kwargs)\n        self.layerdrop = getattr(config, \"layerdrop\", 0.0)\n        self.pre_norm = getattr(config, \"pre_norm\", False)\n\n    def call(\n        self,\n        inputs,\n        attention_mask=None,\n        langs=None,\n        token_type_ids=None,\n        position_ids=None,\n        lengths=None,\n        cache=None,\n        head_mask=None,\n        inputs_embeds=None,\n        training=False,\n    ):\n        # removed: src_enc=None, src_len=None\n        if isinstance(inputs, (tuple, list)):\n            input_ids = inputs[0]\n            attention_mask = inputs[1] if len(inputs) > 1 else attention_mask\n            langs = inputs[2] if len(inputs) > 2 else langs\n            token_type_ids = inputs[3] if len(inputs) > 3 else token_type_ids\n            position_ids = inputs[4] if len(inputs) > 4 else position_ids\n            lengths = inputs[5] if len(inputs) > 5 else lengths\n            cache = inputs[6] if len(inputs) > 6 else cache\n            head_mask = inputs[7] if len(inputs) > 7 else head_mask\n            inputs_embeds = inputs[8] if len(inputs) > 8 else inputs_embeds\n            assert len(inputs) <= 9, \"Too many inputs.\"\n        elif isinstance(inputs, (dict, BatchEncoding)):\n            input_ids = inputs.get(\"input_ids\")\n            attention_mask = inputs.get(\"attention_mask\", attention_mask)\n            langs = inputs.get(\"langs\", langs)\n            token_type_ids = inputs.get(\"token_type_ids\", token_type_ids)\n            position_ids = inputs.get(\"position_ids\", position_ids)\n            lengths = inputs.get(\"lengths\", lengths)\n            cache = inputs.get(\"cache\", cache)\n            head_mask = inputs.get(\"head_mask\", head_mask)\n            inputs_embeds = inputs.get(\"inputs_embeds\", inputs_embeds)\n            assert len(inputs) <= 9, \"Too many inputs.\"\n        else:\n            input_ids = inputs\n\n        if input_ids is not None and inputs_embeds is not None:\n            raise ValueError(\"You cannot specify both input_ids and inputs_embeds at the same time\")\n        elif input_ids is not None:\n            bs, slen = shape_list(input_ids)\n        elif inputs_embeds is not None:\n            bs, slen = shape_list(inputs_embeds)[:2]\n        else:\n            raise ValueError(\"You have to specify either input_ids or inputs_embeds\")\n\n        if lengths is None:\n            if input_ids is not None:\n                lengths = tf.reduce_sum(tf.cast(tf.not_equal(input_ids, self.pad_index), dtype=tf.int32), axis=1)\n            else:\n                lengths = tf.convert_to_tensor([slen] * bs, tf.int32)\n        # mask = input_ids != self.pad_index\n\n        # check inputs\n        # assert shape_list(lengths)[0] == bs\n        tf.debugging.assert_equal(shape_list(lengths)[0], bs)\n        # assert lengths.max().item() <= slen\n        # input_ids = input_ids.transpose(0, 1)  # batch size as dimension 0\n        # assert (src_enc is None) == (src_len is None)\n        # if src_enc is not None:\n        #     assert self.is_decoder\n        #     assert src_enc.size(0) == bs\n\n        # generate masks\n        mask, attn_mask = get_masks(slen, lengths, self.causal, padding_mask=attention_mask)\n        # if self.is_decoder and src_enc is not None:\n        #     src_mask = torch.arange(src_len.max(), dtype=torch.long, device=lengths.device) < src_len[:, None]\n\n        # position_ids\n        if position_ids is None:\n            position_ids = tf.expand_dims(tf.range(slen), axis=0)\n        else:\n            # assert shape_list(position_ids) == [bs, slen]  # (slen, bs)\n            tf.debugging.assert_equal(shape_list(position_ids), [bs, slen])\n            # position_ids = position_ids.transpose(0, 1)\n\n        # langs\n        if langs is not None:\n            # assert shape_list(langs) == [bs, slen]  # (slen, bs)\n            tf.debugging.assert_equal(shape_list(langs), [bs, slen])\n            # langs = langs.transpose(0, 1)\n\n        # Prepare head mask if needed\n        # 1.0 in head_mask indicate we keep the head\n        # attention_probs has shape bsz x n_heads x N x N\n        # input head_mask has shape [num_heads] or [num_hidden_layers x num_heads]\n        # and head_mask is converted to shape [num_hidden_layers x batch x num_heads x qlen x klen]\n        if head_mask is not None:\n            raise NotImplementedError\n        else:\n            head_mask = [None] * self.n_layers\n\n        # do not recompute cached elements\n        if cache is not None and input_ids is not None:\n            _slen = slen - cache[\"slen\"]\n            input_ids = input_ids[:, -_slen:]\n            position_ids = position_ids[:, -_slen:]\n            if langs is not None:\n                langs = langs[:, -_slen:]\n            mask = mask[:, -_slen:]\n            attn_mask = attn_mask[:, -_slen:]\n\n        # embeddings\n        if inputs_embeds is None:\n            inputs_embeds = self.embeddings(input_ids)\n\n        tensor = inputs_embeds + self.position_embeddings(position_ids)\n        if langs is not None and self.use_lang_emb:\n            tensor = tensor + self.lang_embeddings(langs)\n        if token_type_ids is not None:\n            tensor = tensor + self.embeddings(token_type_ids)\n        tensor = self.layer_norm_emb(tensor)\n        tensor = self.dropout(tensor, training=training)\n        tensor = tensor * mask[..., tf.newaxis]\n\n        # transformer layers\n        hidden_states = ()\n        attentions = ()\n        for i in range(self.n_layers):\n            # LayerDrop\n            dropout_probability = random.uniform(0, 1)\n            if training and (dropout_probability < self.layerdrop):\n                continue\n\n            if self.output_hidden_states:\n                hidden_states = hidden_states + (tensor,)\n\n            # self attention\n            if not self.pre_norm:\n                attn_outputs = self.attentions[i]([tensor, attn_mask, None, cache, head_mask[i]], training=training)\n                attn = attn_outputs[0]\n                if self.output_attentions:\n                    attentions = attentions + (attn_outputs[1],)\n                attn = self.dropout(attn, training=training)\n                tensor = tensor + attn\n                tensor = self.layer_norm1[i](tensor)\n            else:\n                tensor_normalized = self.layer_norm1[i](tensor)\n                attn_outputs = self.attentions[i](\n                    [tensor_normalized, attn_mask, None, cache, head_mask[i]], training=training\n                )\n                attn = attn_outputs[0]\n                if self.output_attentions:\n                    attentions = attentions + (attn_outputs[1],)\n                attn = self.dropout(attn, training=training)\n                tensor = tensor + attn\n\n            # encoder attention (for decoder only)\n            # if self.is_decoder and src_enc is not None:\n            #     attn = self.encoder_attn[i](tensor, src_mask, kv=src_enc, cache=cache)\n            #     attn = F.dropout(attn, p=self.dropout, training=self.training)\n            #     tensor = tensor + attn\n            #     tensor = self.layer_norm15[i](tensor)\n\n            # FFN\n            if not self.pre_norm:\n                tensor = tensor + self.ffns[i](tensor)\n                tensor = self.layer_norm2[i](tensor)\n            else:\n                tensor_normalized = self.layer_norm2[i](tensor)\n                tensor = tensor + self.ffns[i](tensor_normalized)\n\n            tensor = tensor * mask[..., tf.newaxis]\n\n        # Add last hidden state\n        if self.output_hidden_states:\n            hidden_states = hidden_states + (tensor,)\n\n        # update cache length\n        if cache is not None:\n            cache[\"slen\"] += tensor.size(1)\n\n        # move back sequence length to dimension 0\n        # tensor = tensor.transpose(0, 1)\n\n        outputs = (tensor,)\n        if self.output_hidden_states:\n            outputs = outputs + (hidden_states,)\n        if self.output_attentions:\n            outputs = outputs + (attentions,)\n        return outputs  # outputs, (hidden_states), (attentions)\n\n\n@add_start_docstrings(\n    \"\"\"The Flaubert Model transformer with a language modeling head on top\n    (linear layer with weights tied to the input embeddings). \"\"\",\n    FLAUBERT_START_DOCSTRING,\n)\nclass TFFlaubertWithLMHeadModel(TFXLMWithLMHeadModel):\n    config_class = FlaubertConfig\n\n    def __init__(self, config, *inputs, **kwargs):\n        super().__init__(config, *inputs, **kwargs)\n        self.transformer = TFFlaubertMainLayer(config, name=\"transformer\")\n\n\n@add_start_docstrings(\n    \"\"\"Flaubert Model with a sequence classification/regression head on top (a linear layer on top of\n    the pooled output) e.g. for GLUE tasks. \"\"\",\n    FLAUBERT_START_DOCSTRING,\n)\nclass TFFlaubertForSequenceClassification(TFXLMForSequenceClassification):\n    config_class = FlaubertConfig\n\n    def __init__(self, config, *inputs, **kwargs):\n        super().__init__(config, *inputs, **kwargs)\n        self.transformer = TFFlaubertMainLayer(config, name=\"transformer\")\n"
  },
  {
    "path": "code/bert-base-count5/pretrain/transformers1/modeling_tf_gpt2.py",
    "content": "# coding=utf-8\n# Copyright 2018 The OpenAI Team Authors and HuggingFace Inc. team.\n# Copyright (c) 2018, NVIDIA CORPORATION.  All rights reserved.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n\"\"\" TF 2.0 OpenAI GPT-2 model. \"\"\"\n\n\nimport logging\n\nimport numpy as np\nimport tensorflow as tf\n\nfrom .configuration_gpt2 import GPT2Config\nfrom .file_utils import add_start_docstrings, add_start_docstrings_to_callable\nfrom .modeling_tf_utils import (\n    TFConv1D,\n    TFPreTrainedModel,\n    TFSequenceSummary,\n    TFSharedEmbeddings,\n    get_initializer,\n    keras_serializable,\n    shape_list,\n)\nfrom .tokenization_utils import BatchEncoding\n\n\nlogger = logging.getLogger(__name__)\n\nTF_GPT2_PRETRAINED_MODEL_ARCHIVE_LIST = [\n    \"gpt2\",\n    \"gpt2-medium\",\n    \"gpt2-large\",\n    \"gpt2-xl\",\n    \"distilgpt2\",\n    # See all GPT-2 models at https://huggingface.co/models?filter=gpt2\n]\n\n\ndef gelu(x):\n    \"\"\"Gaussian Error Linear Unit.\n    This is a smoother version of the RELU.\n    Original paper: https://arxiv.org/abs/1606.08415\n    Args:\n        x: float Tensor to perform activation.\n    Returns:\n        `x` with the GELU activation applied.\n    \"\"\"\n    cdf = 0.5 * (1.0 + tf.tanh((np.sqrt(2 / np.pi) * (x + 0.044715 * tf.pow(x, 3)))))\n    return x * cdf\n\n\nclass TFAttention(tf.keras.layers.Layer):\n    def __init__(self, nx, n_ctx, config, scale=False, **kwargs):\n        super().__init__(**kwargs)\n        self.output_attentions = config.output_attentions\n\n        n_state = nx  # in Attention: n_state=768 (nx=n_embd)\n        # [switch nx => n_state from Block to Attention to keep identical to TF implem]\n        assert n_state % config.n_head == 0\n        self.n_ctx = n_ctx\n        self.n_head = config.n_head\n        self.split_size = n_state\n        self.scale = scale\n\n        self.c_attn = TFConv1D(n_state * 3, nx, initializer_range=config.initializer_range, name=\"c_attn\")\n        self.c_proj = TFConv1D(n_state, nx, initializer_range=config.initializer_range, name=\"c_proj\")\n        self.attn_dropout = tf.keras.layers.Dropout(config.attn_pdrop)\n        self.resid_dropout = tf.keras.layers.Dropout(config.resid_pdrop)\n        self.pruned_heads = set()\n\n    def prune_heads(self, heads):\n        pass\n\n    @staticmethod\n    def causal_attention_mask(nd, ns, dtype):\n        \"\"\"1's in the lower triangle, counting from the lower right corner.\n        Same as tf.matrix_band_part(tf.ones([nd, ns]), -1, ns-nd), but doesn't produce garbage on TPUs.\n        \"\"\"\n        i = tf.range(nd)[:, None]\n        j = tf.range(ns)\n        m = i >= j - ns + nd\n        return tf.cast(m, dtype)\n\n    def _attn(self, inputs, training=False):\n        q, k, v, attention_mask, head_mask = inputs\n        # q, k, v have shape [batch, heads, sequence, features]\n        w = tf.matmul(q, k, transpose_b=True)\n        if self.scale:\n            dk = tf.cast(shape_list(k)[-1], tf.float32)  # scale attention_scores\n            w = w / tf.math.sqrt(dk)\n\n        # w has shape [batch, heads, dst_sequence, src_sequence], where information flows from src to dst.\n        _, _, nd, ns = shape_list(w)\n        b = self.causal_attention_mask(nd, ns, dtype=w.dtype)\n        b = tf.reshape(b, [1, 1, nd, ns])\n        w = w * b - 1e4 * (1 - b)\n\n        if attention_mask is not None:\n            # Apply the attention mask\n            w = w + attention_mask\n\n        w = tf.nn.softmax(w, axis=-1)\n        w = self.attn_dropout(w, training=training)\n\n        # Mask heads if we want to\n        if head_mask is not None:\n            w = w * head_mask\n\n        outputs = [tf.matmul(w, v)]\n        if self.output_attentions:\n            outputs.append(w)\n        return outputs\n\n    def merge_heads(self, x):\n        x = tf.transpose(x, [0, 2, 1, 3])\n        x_shape = shape_list(x)\n        new_x_shape = x_shape[:-2] + [x_shape[-2] * x_shape[-1]]\n        return tf.reshape(x, new_x_shape)\n\n    def split_heads(self, x):\n        x_shape = shape_list(x)\n        new_x_shape = x_shape[:-1] + [self.n_head, x_shape[-1] // self.n_head]\n        x = tf.reshape(x, new_x_shape)\n        return tf.transpose(x, (0, 2, 1, 3))  # (batch, head, seq_length, head_features)\n\n    def call(self, inputs, training=False):\n        x, layer_past, attention_mask, head_mask, use_cache = inputs\n\n        x = self.c_attn(x)\n        query, key, value = tf.split(x, 3, axis=2)\n        query = self.split_heads(query)\n        key = self.split_heads(key)\n        value = self.split_heads(value)\n        if layer_past is not None:\n            past_key, past_value = tf.unstack(layer_past, axis=0)\n            key = tf.concat([past_key, key], axis=-2)\n            value = tf.concat([past_value, value], axis=-2)\n\n        # to cope with keras serialization\n        # we need to cast `use_cache` to correct bool\n        # if it is a tensor\n        if tf.is_tensor(use_cache):\n            if hasattr(use_cache, \"numpy\"):\n                use_cache = bool(use_cache.numpy())\n            else:\n                use_cache = True\n\n        if use_cache is True:\n            present = tf.stack([key, value], axis=0)\n        else:\n            present = (None,)\n\n        attn_outputs = self._attn([query, key, value, attention_mask, head_mask], training=training)\n        a = attn_outputs[0]\n\n        a = self.merge_heads(a)\n        a = self.c_proj(a)\n        a = self.resid_dropout(a, training=training)\n\n        outputs = [a, present] + attn_outputs[1:]\n        return outputs  # a, present, (attentions)\n\n\nclass TFMLP(tf.keras.layers.Layer):\n    def __init__(self, n_state, config, **kwargs):\n        super().__init__(**kwargs)\n        nx = config.n_embd\n        self.c_fc = TFConv1D(n_state, nx, initializer_range=config.initializer_range, name=\"c_fc\")\n        self.c_proj = TFConv1D(nx, n_state, initializer_range=config.initializer_range, name=\"c_proj\")\n        self.act = gelu\n        self.dropout = tf.keras.layers.Dropout(config.resid_pdrop)\n\n    def call(self, x, training=False):\n        h = self.act(self.c_fc(x))\n        h2 = self.c_proj(h)\n        h2 = self.dropout(h2, training=training)\n        return h2\n\n\nclass TFBlock(tf.keras.layers.Layer):\n    def __init__(self, n_ctx, config, scale=False, **kwargs):\n        super().__init__(**kwargs)\n        nx = config.n_embd\n        self.ln_1 = tf.keras.layers.LayerNormalization(epsilon=config.layer_norm_epsilon, name=\"ln_1\")\n        self.attn = TFAttention(nx, n_ctx, config, scale, name=\"attn\")\n        self.ln_2 = tf.keras.layers.LayerNormalization(epsilon=config.layer_norm_epsilon, name=\"ln_2\")\n        self.mlp = TFMLP(4 * nx, config, name=\"mlp\")\n\n    def call(self, inputs, training=False):\n        x, layer_past, attention_mask, head_mask, use_cache = inputs\n\n        a = self.ln_1(x)\n        output_attn = self.attn([a, layer_past, attention_mask, head_mask, use_cache], training=training)\n        a = output_attn[0]  # output_attn: a, present, (attentions)\n        x = x + a\n\n        m = self.ln_2(x)\n        m = self.mlp(m, training=training)\n        x = x + m\n\n        outputs = [x] + output_attn[1:]\n        return outputs  # x, present, (attentions)\n\n\n@keras_serializable\nclass TFGPT2MainLayer(tf.keras.layers.Layer):\n    config_class = GPT2Config\n\n    def __init__(self, config, *inputs, **kwargs):\n        super().__init__(*inputs, **kwargs)\n        self.output_hidden_states = config.output_hidden_states\n        self.output_attentions = config.output_attentions\n        self.num_hidden_layers = config.n_layer\n        self.vocab_size = config.vocab_size\n        self.n_embd = config.n_embd\n\n        self.wte = TFSharedEmbeddings(\n            config.vocab_size, config.hidden_size, initializer_range=config.initializer_range, name=\"wte\"\n        )\n        self.wpe = tf.keras.layers.Embedding(\n            config.n_positions,\n            config.n_embd,\n            embeddings_initializer=get_initializer(config.initializer_range),\n            name=\"wpe\",\n        )\n        self.drop = tf.keras.layers.Dropout(config.embd_pdrop)\n        self.h = [TFBlock(config.n_ctx, config, scale=True, name=\"h_._{}\".format(i)) for i in range(config.n_layer)]\n        self.ln_f = tf.keras.layers.LayerNormalization(epsilon=config.layer_norm_epsilon, name=\"ln_f\")\n\n    def get_input_embeddings(self):\n        return self.wte\n\n    def _resize_token_embeddings(self, new_num_tokens):\n        raise NotImplementedError\n\n    def _prune_heads(self, heads_to_prune):\n        \"\"\" Prunes heads of the model.\n            heads_to_prune: dict of {layer_num: list of heads to prune in this layer}\n        \"\"\"\n        raise NotImplementedError\n\n    def call(\n        self,\n        inputs,\n        past=None,\n        attention_mask=None,\n        token_type_ids=None,\n        position_ids=None,\n        head_mask=None,\n        inputs_embeds=None,\n        use_cache=True,\n        training=False,\n    ):\n        if isinstance(inputs, (tuple, list)):\n            input_ids = inputs[0]\n            past = inputs[1] if len(inputs) > 1 else past\n            attention_mask = inputs[2] if len(inputs) > 2 else attention_mask\n            token_type_ids = inputs[3] if len(inputs) > 3 else token_type_ids\n            position_ids = inputs[4] if len(inputs) > 4 else position_ids\n            head_mask = inputs[5] if len(inputs) > 5 else head_mask\n            inputs_embeds = inputs[6] if len(inputs) > 6 else inputs_embeds\n            use_cache = inputs[7] if len(inputs) > 7 else use_cache\n            assert len(inputs) <= 8, \"Too many inputs.\"\n        elif isinstance(inputs, (dict, BatchEncoding)):\n            input_ids = inputs.get(\"input_ids\")\n            past = inputs.get(\"past\", past)\n            attention_mask = inputs.get(\"attention_mask\", attention_mask)\n            token_type_ids = inputs.get(\"token_type_ids\", token_type_ids)\n            position_ids = inputs.get(\"position_ids\", position_ids)\n            head_mask = inputs.get(\"head_mask\", head_mask)\n            inputs_embeds = inputs.get(\"inputs_embeds\", inputs_embeds)\n            use_cache = inputs.get(\"use_cache\", use_cache)\n            assert len(inputs) <= 8, \"Too many inputs.\"\n        else:\n            input_ids = inputs\n\n        if input_ids is not None and inputs_embeds is not None:\n            raise ValueError(\"You cannot specify both input_ids and inputs_embeds at the same time\")\n        elif input_ids is not None:\n            input_shape = shape_list(input_ids)\n            input_ids = tf.reshape(input_ids, [-1, input_shape[-1]])\n        elif inputs_embeds is not None:\n            input_shape = shape_list(inputs_embeds)[:-1]\n        else:\n            raise ValueError(\"You have to specify either input_ids or inputs_embeds\")\n\n        if past is None:\n            past_length = 0\n            past = [None] * len(self.h)\n        else:\n            past_length = shape_list(past[0][0])[-2]\n        if position_ids is None:\n            position_ids = tf.range(past_length, input_shape[-1] + past_length, dtype=tf.int32)[tf.newaxis, :]\n\n        if attention_mask is not None:\n            # We create a 3D attention mask from a 2D tensor mask.\n            # Sizes are [batch_size, 1, 1, to_seq_length]\n            # So we can broadcast to [batch_size, num_heads, from_seq_length, to_seq_length]\n            # this attention mask is more simple than the triangular masking of causal attention\n            # used in OpenAI GPT, we just need to prepare the broadcast dimension here.\n            attention_mask = attention_mask[:, tf.newaxis, tf.newaxis, :]\n\n            # Since attention_mask is 1.0 for positions we want to attend and 0.0 for\n            # masked positions, this operation will create a tensor which is 0.0 for\n            # positions we want to attend and -10000.0 for masked positions.\n            # Since we are adding it to the raw scores before the softmax, this is\n            # effectively the same as removing these entirely.\n\n            attention_mask = tf.cast(attention_mask, tf.float32)\n            attention_mask = (1.0 - attention_mask) * -10000.0\n        else:\n            attention_mask = None\n\n        # Prepare head mask if needed\n        # 1.0 in head_mask indicate we keep the head\n        # attention_probs has shape bsz x n_heads x N x N\n        # input head_mask has shape [num_heads] or [num_hidden_layers x num_heads]\n        # and head_mask is converted to shape [num_hidden_layers x batch x num_heads x seq_length x seq_length]\n        if head_mask is not None:\n            raise NotImplementedError\n        else:\n            head_mask = [None] * self.num_hidden_layers\n            # head_mask = tf.constant([0] * self.num_hidden_layers)\n\n        position_ids = tf.reshape(position_ids, [-1, shape_list(position_ids)[-1]])\n\n        if inputs_embeds is None:\n            inputs_embeds = self.wte(input_ids, mode=\"embedding\")\n        position_embeds = self.wpe(position_ids)\n        if token_type_ids is not None:\n            token_type_ids = tf.reshape(token_type_ids, [-1, shape_list(token_type_ids)[-1]])\n            token_type_embeds = self.wte(token_type_ids, mode=\"embedding\")\n        else:\n            token_type_embeds = 0\n        hidden_states = inputs_embeds + position_embeds + token_type_embeds\n        hidden_states = self.drop(hidden_states, training=training)\n\n        output_shape = input_shape + [shape_list(hidden_states)[-1]]\n\n        presents = ()\n        all_attentions = []\n        all_hidden_states = ()\n        for i, (block, layer_past) in enumerate(zip(self.h, past)):\n            if self.output_hidden_states:\n                all_hidden_states = all_hidden_states + (tf.reshape(hidden_states, output_shape),)\n\n            outputs = block([hidden_states, layer_past, attention_mask, head_mask[i], use_cache], training=training)\n\n            hidden_states, present = outputs[:2]\n            presents = presents + (present,)\n\n            if self.output_attentions:\n                all_attentions.append(outputs[2])\n\n        hidden_states = self.ln_f(hidden_states)\n\n        hidden_states = tf.reshape(hidden_states, output_shape)\n        # Add last hidden state\n        if self.output_hidden_states:\n            all_hidden_states = all_hidden_states + (hidden_states,)\n\n        outputs = (hidden_states,)\n\n        if use_cache is True:\n            outputs = outputs + (presents,)\n        if self.output_hidden_states:\n            outputs = outputs + (all_hidden_states,)\n        if self.output_attentions:\n            # let the number of heads free (-1) so we can extract attention even after head pruning\n            attention_output_shape = input_shape[:-1] + [-1] + shape_list(all_attentions[0])[-2:]\n            all_attentions = tuple(tf.reshape(t, attention_output_shape) for t in all_attentions)\n            outputs = outputs + (all_attentions,)\n        return outputs  # last hidden state, presents, (all hidden_states), (attentions)\n\n\nclass TFGPT2PreTrainedModel(TFPreTrainedModel):\n    \"\"\" An abstract class to handle weights initialization and\n        a simple interface for downloading and loading pretrained models.\n    \"\"\"\n\n    config_class = GPT2Config\n    base_model_prefix = \"transformer\"\n\n\nGPT2_START_DOCSTRING = r\"\"\"\n\n    .. note::\n        TF 2.0 models accepts two formats as inputs:\n\n            - having all inputs as keyword arguments (like PyTorch models), or\n            - having all inputs as a list, tuple or dict in the first positional arguments.\n\n        This second option is useful when using :obj:`tf.keras.Model.fit()` method which currently requires having\n        all the tensors in the first argument of the model call function: :obj:`model(inputs)`.\n\n        If you choose this second option, there are three possibilities you can use to gather all the input Tensors\n        in the first positional argument :\n\n        - a single Tensor with input_ids only and nothing else: :obj:`model(inputs_ids)`\n        - a list of varying length with one or several input Tensors IN THE ORDER given in the docstring:\n          :obj:`model([input_ids, attention_mask])` or :obj:`model([input_ids, attention_mask, token_type_ids])`\n        - a dictionary with one or several input Tensors associated to the input names given in the docstring:\n          :obj:`model({'input_ids': input_ids, 'token_type_ids': token_type_ids})`\n\n    Parameters:\n        config (:class:`~transformers1.GPT2Config`): Model configuration class with all the parameters of the model.\n            Initializing with a config file does not load the weights associated with the model, only the configuration.\n            Check out the :meth:`~transformers1.PreTrainedModel.from_pretrained` method to load the model weights.\n\"\"\"\n\nGPT2_INPUTS_DOCSTRING = r\"\"\"\n    Args:\n        input_ids (:obj:`Numpy array` or :obj:`tf.Tensor` of shape :obj:`(batch_size, input_ids_length)`):\n            :obj:`input_ids_length` = ``sequence_length`` if ``past`` is ``None`` else ``past[0].shape[-2]`` (``sequence_length`` of input past key value states).\n            Indices of input sequence tokens in the vocabulary.\n\n            If `past` is used, only `input_ids` that do not have their past calculated should be passed as `input_ids`.\n\n            Indices can be obtained using :class:`transformers1.GPT2Tokenizer`.\n            See :func:`transformers1.PreTrainedTokenizer.encode` and\n            :func:`transformers1.PreTrainedTokenizer.encode_plus` for details.\n\n            `What are input IDs? <../glossary.html#input-ids>`__\n        past (:obj:`List[tf.Tensor]` of length :obj:`config.n_layers`):\n            Contains pre-computed hidden-states (key and values in the attention blocks) as computed by the model\n            (see `past` output below). Can be used to speed up sequential decoding.\n            The token ids which have their past given to this model\n            should not be passed as `input_ids` as they have already been computed.\n        attention_mask (:obj:`tf.Tensor` or :obj:`Numpy array` of shape :obj:`(batch_size, sequence_length)`, `optional`, defaults to :obj:`None`):\n            Mask to avoid performing attention on padding token indices.\n            Mask values selected in ``[0, 1]``:\n            ``1`` for tokens that are NOT MASKED, ``0`` for MASKED tokens.\n\n            `What are attention masks? <../glossary.html#attention-mask>`__\n        token_type_ids (:obj:`tf.Tensor` or :obj:`Numpy array` of shape :obj:`(batch_size, sequence_length)`, `optional`, defaults to :obj:`None`):\n            Segment token indices to indicate first and second portions of the inputs.\n            Indices are selected in ``[0, 1]``: ``0`` corresponds to a `sentence A` token, ``1``\n            corresponds to a `sentence B` token\n\n            `What are token type IDs? <../glossary.html#token-type-ids>`_\n        position_ids (:obj:`tf.Tensor` or :obj:`Numpy array` of shape :obj:`(batch_size, sequence_length)`, `optional`, defaults to :obj:`None`):\n            Indices of positions of each input sequence tokens in the position embeddings.\n            Selected in the range ``[0, config.max_position_embeddings - 1]``.\n\n            `What are position IDs? <../glossary.html#position-ids>`_\n        head_mask (:obj:`tf.Tensor` or :obj:`Numpy array` of shape :obj:`(num_heads,)` or :obj:`(num_layers, num_heads)`, `optional`, defaults to :obj:`None`):\n            Mask to nullify selected heads of the self-attention modules.\n            Mask values selected in ``[0, 1]``:\n            :obj:`1` indicates the head is **not masked**, :obj:`0` indicates the head is **masked**.\n        inputs_embeds (:obj:`tf.Tensor` or :obj:`Numpy array` of shape :obj:`(batch_size, sequence_length, hidden_size)`, `optional`, defaults to :obj:`None`):\n            Optionally, instead of passing :obj:`input_ids` you can choose to directly pass an embedded representation.\n            This is useful if you want more control over how to convert `input_ids` indices into associated vectors\n            than the model's internal embedding lookup matrix.\n        training (:obj:`boolean`, `optional`, defaults to :obj:`False`):\n            Whether to activate dropout modules (if set to :obj:`True`) during training or to de-activate them\n            (if set to :obj:`False`) for evaluation.\n\"\"\"\n\n\n@add_start_docstrings(\n    \"The bare GPT2 Model transformer outputing raw hidden-states without any specific head on top.\",\n    GPT2_START_DOCSTRING,\n)\nclass TFGPT2Model(TFGPT2PreTrainedModel):\n    def __init__(self, config, *inputs, **kwargs):\n        super().__init__(config, *inputs, **kwargs)\n        self.transformer = TFGPT2MainLayer(config, name=\"transformer\")\n\n    @add_start_docstrings_to_callable(GPT2_INPUTS_DOCSTRING)\n    def call(self, inputs, **kwargs):\n        r\"\"\"\n    Return:\n        :obj:`tuple(tf.Tensor)` comprising various elements depending on the configuration (:class:`~transformers1.GPT2Config`) and inputs:\n        last_hidden_state (:obj:`tf.Tensor` of shape :obj:`(batch_size, sequence_length, hidden_size)`):\n            Sequence of hidden-states at the last layer of the model.\n        past (:obj:`List[tf.Tensor]` of length :obj:`config.n_layers` with each tensor of shape :obj:`(2, batch_size, num_heads, sequence_length, embed_size_per_head)`):\n            Contains pre-computed hidden-states (key and values in the attention blocks).\n            Can be used (see `past` input) to speed up sequential decoding. The token ids which have their past given to this model\n            should not be passed as input ids as they have already been computed.\n        hidden_states (:obj:`tuple(tf.Tensor)` `optional`, returned when ``config.output_hidden_states=True``):\n            Tuple of :obj:`tf.Tensor` (one for the output of the embeddings + one for the output of each layer)\n            of shape :obj:`(batch_size, sequence_length, hidden_size)`.\n\n            Hidden-states of the model at the output of each layer plus the initial embedding outputs.\n        attentions (:obj:`tuple(tf.Tensor)`, `optional`, returned when ``config.output_attentions=True``):\n            Tuple of :obj:`tf.Tensor` (one for each layer) of shape\n            :obj:`(batch_size, num_heads, sequence_length, sequence_length)`.\n\n            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention\n            heads.\n\n    Examples::\n\n        import tensorflow as tf\n        from transformers1 import GPT2Tokenizer, TFGPT2Model\n\n        tokenizer = GPT2Tokenizer.from_pretrained('gpt2')\n        model = TFGPT2Model.from_pretrained('gpt2')\n        input_ids = tf.constant(tokenizer.encode(\"Hello, my dog is cute\", add_special_tokens=True))[None, :]  # Batch size 1\n        outputs = model(input_ids)\n        last_hidden_states = outputs[0]  # The last hidden-state is the first element of the output tuple\n\n    \"\"\"\n        outputs = self.transformer(inputs, **kwargs)\n        return outputs\n\n\n@add_start_docstrings(\n    \"\"\"The GPT2 Model transformer with a language modeling head on top\n    (linear layer with weights tied to the input embeddings). \"\"\",\n    GPT2_START_DOCSTRING,\n)\nclass TFGPT2LMHeadModel(TFGPT2PreTrainedModel):\n    def __init__(self, config, *inputs, **kwargs):\n        super().__init__(config, *inputs, **kwargs)\n        self.transformer = TFGPT2MainLayer(config, name=\"transformer\")\n\n    def get_output_embeddings(self):\n        return self.transformer.wte\n\n    def prepare_inputs_for_generation(self, inputs, past, **kwargs):\n        # only last token for inputs_ids if past is defined in kwargs\n        if past:\n            inputs = tf.expand_dims(inputs[:, -1], -1)\n\n        return {\"inputs\": inputs, \"past\": past, \"use_cache\": kwargs[\"use_cache\"]}\n\n    @add_start_docstrings_to_callable(GPT2_INPUTS_DOCSTRING)\n    def call(self, inputs, **kwargs):\n        r\"\"\"\n    Return:\n        :obj:`tuple(tf.Tensor)` comprising various elements depending on the configuration (:class:`~transformers1.GPT2Config`) and inputs:\n        prediction_scores (:obj:`tf.Tensor` of shape :obj:`(batch_size, sequence_length, config.vocab_size)`):\n            Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax).\n        past (:obj:`List[tf.Tensor]` of length :obj:`config.n_layers` with each tensor of shape :obj:`(2, batch_size, num_heads, sequence_length, embed_size_per_head)`):\n            Contains pre-computed hidden-states (key and values in the attention blocks).\n            Can be used (see `past` input) to speed up sequential decoding. The token ids which have their past given to this model\n            should not be passed as input ids as they have already been computed.\n        hidden_states (:obj:`tuple(tf.Tensor)`, `optional`, returned when ``config.output_hidden_states=True``):\n            Tuple of :obj:`tf.Tensor` (one for the output of the embeddings + one for the output of each layer)\n            of shape :obj:`(batch_size, sequence_length, hidden_size)`.\n\n            Hidden-states of the model at the output of each layer plus the initial embedding outputs.\n        attentions (:obj:`tuple(tf.Tensor)`, `optional`, returned when ``config.output_attentions=True``):\n            Tuple of :obj:`tf.Tensor` (one for each layer) of shape\n            :obj:`(batch_size, num_heads, sequence_length, sequence_length)`.\n\n            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention\n            heads.\n\n    Examples::\n\n        import tensorflow as tf\n        from transformers1 import GPT2Tokenizer, TFGPT2LMHeadModel\n\n        tokenizer = GPT2Tokenizer.from_pretrained('gpt2')\n        model = TFGPT2LMHeadModel.from_pretrained('gpt2')\n\n        input_ids = tf.constant(tokenizer.encode(\"Hello, my dog is cute\", add_special_tokens=True))[None, :]  # Batch size 1\n        outputs = model(input_ids)\n        logits = outputs[0]\n\n        \"\"\"\n        transformer_outputs = self.transformer(inputs, **kwargs)\n        hidden_states = transformer_outputs[0]\n\n        lm_logits = self.transformer.wte(hidden_states, mode=\"linear\")\n\n        outputs = (lm_logits,) + transformer_outputs[1:]\n\n        return outputs  # lm_logits, presents, (all hidden_states), (attentions)\n\n\n@add_start_docstrings(\n    \"\"\"The GPT2 Model transformer with a language modeling and a multiple-choice classification\n    head on top e.g. for RocStories/SWAG tasks. The two heads are two linear layers.\n    The language modeling head has its weights tied to the input embeddings,\n    the classification head takes as input the input of a specified classification token index in the input sequence).\n\"\"\",\n    GPT2_START_DOCSTRING,\n)\nclass TFGPT2DoubleHeadsModel(TFGPT2PreTrainedModel):\n    def __init__(self, config, *inputs, **kwargs):\n        super().__init__(config, *inputs, **kwargs)\n        config.num_labels = 1\n        self.transformer = TFGPT2MainLayer(config, name=\"transformer\")\n        self.multiple_choice_head = TFSequenceSummary(\n            config, initializer_range=config.initializer_range, name=\"multiple_choice_head\"\n        )\n\n    def get_output_embeddings(self):\n        return self.transformer.wte\n\n    @add_start_docstrings_to_callable(GPT2_INPUTS_DOCSTRING)\n    def call(\n        self,\n        inputs,\n        past=None,\n        attention_mask=None,\n        token_type_ids=None,\n        position_ids=None,\n        head_mask=None,\n        inputs_embeds=None,\n        mc_token_ids=None,\n        use_cache=True,\n        training=False,\n    ):\n        r\"\"\"\n        mc_token_ids (:obj:`tf.Tensor` or :obj:`Numpy array` of shape :obj:`(batch_size, num_choices)`, `optional`, default to index of the last token of the input)\n            Index of the classification token in each input sequence.\n            Selected in the range ``[0, input_ids.size(-1) - 1[``.\n\n    Return:\n        :obj:`tuple(tf.Tensor)` comprising various elements depending on the configuration (:class:`~transformers1.GPT2Config`) and inputs:\n        lm_prediction_scores (:obj:`tf.Tensor` of shape :obj:`(batch_size, num_choices, sequence_length, config.vocab_size)`):\n            Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax).\n        mc_prediction_scores (:obj:`tf.Tensor` of shape :obj:`(batch_size, num_choices)`):\n            Prediction scores of the multiple choice classification head (scores for each choice before SoftMax).\n        past (:obj:`List[tf.Tensor]` of length :obj:`config.n_layers` with each tensor of shape :obj:`(2, batch_size, num_heads, sequence_length, embed_size_per_head)`):\n            Contains pre-computed hidden-states (key and values in the attention blocks).\n            Can be used (see `past` input) to speed up sequential decoding. The token ids which have their past given to this model\n            should not be passed as `input_ids` as they have already been computed.\n        hidden_states (:obj:`tuple(tf.Tensor)`, `optional`, returned when ``config.output_hidden_states=True``):\n            Tuple of :obj:`tf.Tensor` (one for the output of the embeddings + one for the output of each layer)\n            of shape :obj:`(batch_size, sequence_length, hidden_size)`.\n\n            Hidden-states of the model at the output of each layer plus the initial embedding outputs.\n        attentions (:obj:`tuple(tf.Tensor)`, `optional`, returned when ``config.output_attentions=True``):\n            Tuple of :obj:`tf.Tensor` (one for each layer) of shape\n            :obj:`(batch_size, num_heads, sequence_length, sequence_length)`.\n\n            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention\n            heads.\n\n\n    Examples::\n\n        # For example purposes. Not runnable.\n        import tensorflow as tf\n        from transformers1 import GPT2Tokenizer, TFGPT2DoubleHeadsModel\n\n        tokenizer = GPT2Tokenizer.from_pretrained('gpt2')\n        model = TFGPT2DoubleHeadsModel.from_pretrained('gpt2')\n\n        # Add a [CLS] to the vocabulary (we should train it also!)\n        # This option is currently not implemented in TF 2.0\n        raise NotImplementedError\n        tokenizer.add_special_tokens({'cls_token': '[CLS]'})\n        model.resize_token_embeddings(len(tokenizer))  # Update the model embeddings with the new vocabulary size\n        print(tokenizer.cls_token_id, len(tokenizer))  # The newly token the last token of the vocabulary\n\n        choices = [\"Hello, my dog is cute [CLS]\", \"Hello, my cat is cute [CLS]\"]\n        encoded_choices = [tokenizer.encode(s) for s in choices]\n        cls_token_location = [tokens.index(tokenizer.cls_token_id) for tokens in encoded_choices]\n\n        input_ids = tf.constant(encoded_choices)[None, :]  # Batch size: 1, number of choices: 2\n        mc_token_ids = tf.constant([cls_token_location])  # Batch size: 1\n\n        outputs = model(input_ids, mc_token_ids=mc_token_ids)\n        lm_prediction_scores, mc_prediction_scores = outputs[:2]\n\n        \"\"\"\n        if isinstance(inputs, (tuple, list)):\n            input_ids = inputs[0]\n            past = inputs[1] if len(inputs) > 1 else past\n            attention_mask = inputs[2] if len(inputs) > 2 else attention_mask\n            token_type_ids = inputs[3] if len(inputs) > 3 else token_type_ids\n            position_ids = inputs[4] if len(inputs) > 4 else position_ids\n            head_mask = inputs[5] if len(inputs) > 5 else head_mask\n            inputs_embeds = inputs[6] if len(inputs) > 6 else inputs_embeds\n            mc_token_ids = inputs[7] if len(inputs) > 7 else mc_token_ids\n            use_cache = inputs[8] if len(inputs) > 8 else use_cache\n            assert len(inputs) <= 9, \"Too many inputs.\"\n        elif isinstance(inputs, dict):\n            input_ids = inputs.get(\"input_ids\")\n            past = inputs.get(\"past\", past)\n            attention_mask = inputs.get(\"attention_mask\", attention_mask)\n            token_type_ids = inputs.get(\"token_type_ids\", token_type_ids)\n            position_ids = inputs.get(\"position_ids\", position_ids)\n            head_mask = inputs.get(\"head_mask\", head_mask)\n            inputs_embeds = inputs.get(\"inputs_embeds\", inputs_embeds)\n            mc_token_ids = inputs.get(\"mc_token_ids\", mc_token_ids)\n            use_cache = inputs.get(\"use_cache\", use_cache)\n            assert len(inputs) <= 9, \"Too many inputs.\"\n        else:\n            input_ids = inputs\n\n        if input_ids is not None:\n            input_shapes = shape_list(input_ids)\n        else:\n            input_shapes = shape_list(inputs_embeds)[:-1]\n\n        seq_length = input_shapes[-1]\n\n        flat_input_ids = tf.reshape(input_ids, (-1, seq_length)) if input_ids is not None else None\n        flat_attention_mask = tf.reshape(attention_mask, (-1, seq_length)) if attention_mask is not None else None\n        flat_token_type_ids = tf.reshape(token_type_ids, (-1, seq_length)) if token_type_ids is not None else None\n        flat_position_ids = tf.reshape(position_ids, (-1, seq_length)) if position_ids is not None else None\n\n        flat_inputs = [\n            flat_input_ids,\n            past,\n            flat_attention_mask,\n            flat_token_type_ids,\n            flat_position_ids,\n            head_mask,\n            inputs_embeds,\n            use_cache,\n        ]\n\n        transformer_outputs = self.transformer(flat_inputs, training=training)\n        hidden_states = transformer_outputs[0]\n\n        hidden_states = tf.reshape(hidden_states, input_shapes + shape_list(hidden_states)[-1:])\n\n        lm_logits = self.transformer.wte(hidden_states, mode=\"linear\")\n        mc_logits = self.multiple_choice_head([hidden_states, mc_token_ids], training=training)\n\n        mc_logits = tf.squeeze(mc_logits, axis=-1)\n\n        outputs = (lm_logits, mc_logits) + transformer_outputs[1:]\n\n        return outputs  # lm logits, mc logits, presents, (all hidden_states), (attentions)\n"
  },
  {
    "path": "code/bert-base-count5/pretrain/transformers1/modeling_tf_openai.py",
    "content": "# coding=utf-8\n# Copyright 2018 The OpenAI Team Authors and HuggingFace Inc. team.\n# Copyright (c) 2018, NVIDIA CORPORATION.  All rights reserved.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n\"\"\" TF 2.0 OpenAI GPT model.\"\"\"\n\n\nimport logging\n\nimport numpy as np\nimport tensorflow as tf\n\nfrom .configuration_openai import OpenAIGPTConfig\nfrom .file_utils import add_start_docstrings, add_start_docstrings_to_callable\nfrom .modeling_tf_utils import (\n    TFConv1D,\n    TFPreTrainedModel,\n    TFSequenceSummary,\n    TFSharedEmbeddings,\n    get_initializer,\n    shape_list,\n)\nfrom .tokenization_utils import BatchEncoding\n\n\nlogger = logging.getLogger(__name__)\n\nTF_OPENAI_GPT_PRETRAINED_MODEL_ARCHIVE_LIST = [\n    \"openai-gpt\",\n    # See all OpenAI GPT models at https://huggingface.co/models?filter=openai-gpt\n]\n\n\ndef gelu(x):\n    \"\"\"Gaussian Error Linear Unit.\n    This is a smoother version of the RELU.\n    Original paper: https://arxiv.org/abs/1606.08415\n    Args:\n        x: float Tensor to perform activation.\n    Returns:\n        `x` with the GELU activation applied.\n    \"\"\"\n    cdf = 0.5 * (1.0 + tf.tanh((np.sqrt(2 / np.pi) * (x + 0.044715 * tf.pow(x, 3)))))\n    return x * cdf\n\n\ndef swish(x):\n    return x * tf.math.sigmoid(x)\n\n\nACT_FNS = {\n    \"gelu\": tf.keras.layers.Activation(gelu),\n    \"relu\": tf.keras.activations.relu,\n    \"swish\": tf.keras.layers.Activation(swish),\n}\n\n\nclass TFAttention(tf.keras.layers.Layer):\n    def __init__(self, nx, n_ctx, config, scale=False, **kwargs):\n        super().__init__(**kwargs)\n        self.output_attentions = config.output_attentions\n\n        n_state = nx  # in Attention: n_state=768 (nx=n_embd)\n        # [switch nx => n_state from Block to Attention to keep identical to TF implem]\n        assert n_state % config.n_head == 0\n        self.n_ctx = n_ctx\n        self.n_head = config.n_head\n        self.split_size = n_state\n        self.scale = scale\n\n        self.c_attn = TFConv1D(n_state * 3, nx, initializer_range=config.initializer_range, name=\"c_attn\")\n        self.c_proj = TFConv1D(n_state, nx, initializer_range=config.initializer_range, name=\"c_proj\")\n        self.attn_dropout = tf.keras.layers.Dropout(config.attn_pdrop)\n        self.resid_dropout = tf.keras.layers.Dropout(config.resid_pdrop)\n        self.pruned_heads = set()\n\n    def prune_heads(self, heads):\n        pass\n\n    @staticmethod\n    def causal_attention_mask(nd, ns, dtype):\n        \"\"\"1's in the lower triangle, counting from the lower right corner.\n        Same as tf.matrix_band_part(tf.ones([nd, ns]), -1, ns-nd), but doesn't produce garbage on TPUs.\n        \"\"\"\n        i = tf.range(nd)[:, None]\n        j = tf.range(ns)\n        m = i >= j - ns + nd\n        return tf.cast(m, dtype)\n\n    def _attn(self, inputs, training=False):\n        q, k, v, attention_mask, head_mask = inputs\n        # q, k, v have shape [batch, heads, sequence, features]\n        w = tf.matmul(q, k, transpose_b=True)\n        if self.scale:\n            dk = tf.cast(shape_list(k)[-1], tf.float32)  # scale attention_scores\n            w = w / tf.math.sqrt(dk)\n\n        # w has shape [batch, heads, dst_sequence, src_sequence], where information flows from src to dst.\n        _, _, nd, ns = shape_list(w)\n        b = self.causal_attention_mask(nd, ns, dtype=w.dtype)\n        b = tf.reshape(b, [1, 1, nd, ns])\n        w = w * b - 1e4 * (1 - b)\n\n        if attention_mask is not None:\n            # Apply the attention mask\n            w = w + attention_mask\n\n        w = tf.nn.softmax(w, axis=-1)\n        w = self.attn_dropout(w, training=training)\n\n        # Mask heads if we want to\n        if head_mask is not None:\n            w = w * head_mask\n\n        outputs = [tf.matmul(w, v)]\n        if self.output_attentions:\n            outputs.append(w)\n        return outputs\n\n    def merge_heads(self, x):\n        x = tf.transpose(x, [0, 2, 1, 3])\n        x_shape = shape_list(x)\n        new_x_shape = x_shape[:-2] + [x_shape[-2] * x_shape[-1]]\n        return tf.reshape(x, new_x_shape)\n\n    def split_heads(self, x):\n        x_shape = shape_list(x)\n        new_x_shape = x_shape[:-1] + [self.n_head, x_shape[-1] // self.n_head]\n        x = tf.reshape(x, new_x_shape)\n        return tf.transpose(x, (0, 2, 1, 3))  # (batch, head, seq_length, head_features)\n\n    def call(self, inputs, training=False):\n        x, attention_mask, head_mask = inputs\n\n        x = self.c_attn(x)\n        query, key, value = tf.split(x, 3, axis=2)\n        query = self.split_heads(query)\n        key = self.split_heads(key)\n        value = self.split_heads(value)\n\n        attn_outputs = self._attn([query, key, value, attention_mask, head_mask], training=training)\n        a = attn_outputs[0]\n\n        a = self.merge_heads(a)\n        a = self.c_proj(a)\n        a = self.resid_dropout(a, training=training)\n\n        outputs = [a] + attn_outputs[1:]\n        return outputs  # a, (attentions)\n\n\nclass TFMLP(tf.keras.layers.Layer):\n    def __init__(self, n_state, config, **kwargs):\n        super().__init__(**kwargs)\n        nx = config.n_embd\n        self.c_fc = TFConv1D(n_state, nx, initializer_range=config.initializer_range, name=\"c_fc\")\n        self.c_proj = TFConv1D(nx, n_state, initializer_range=config.initializer_range, name=\"c_proj\")\n        self.act = gelu\n        self.dropout = tf.keras.layers.Dropout(config.resid_pdrop)\n\n    def call(self, x, training=False):\n        h = self.act(self.c_fc(x))\n        h2 = self.c_proj(h)\n        h2 = self.dropout(h2, training=training)\n        return h2\n\n\nclass TFBlock(tf.keras.layers.Layer):\n    def __init__(self, n_ctx, config, scale=False, **kwargs):\n        super().__init__(**kwargs)\n        nx = config.n_embd\n        self.attn = TFAttention(nx, n_ctx, config, scale, name=\"attn\")\n        self.ln_1 = tf.keras.layers.LayerNormalization(epsilon=config.layer_norm_epsilon, name=\"ln_1\")\n        self.mlp = TFMLP(4 * nx, config, name=\"mlp\")\n        self.ln_2 = tf.keras.layers.LayerNormalization(epsilon=config.layer_norm_epsilon, name=\"ln_2\")\n\n    def call(self, inputs, training=False):\n        x, attention_mask, head_mask = inputs\n\n        output_attn = self.attn([x, attention_mask, head_mask], training=training)\n        a = output_attn[0]  # output_attn: a, (attentions)\n\n        n = self.ln_1(x + a)\n        m = self.mlp(n, training=training)\n        h = self.ln_2(n + m)\n\n        outputs = [h] + output_attn[1:]\n        return outputs  # x, (attentions)\n\n\nclass TFOpenAIGPTMainLayer(tf.keras.layers.Layer):\n    def __init__(self, config, *inputs, **kwargs):\n        super().__init__(*inputs, **kwargs)\n        self.output_hidden_states = config.output_hidden_states\n        self.output_attentions = config.output_attentions\n        self.num_hidden_layers = config.n_layer\n        self.vocab_size = config.vocab_size\n        self.n_embd = config.n_embd\n\n        self.tokens_embed = TFSharedEmbeddings(\n            config.vocab_size, config.n_embd, initializer_range=config.initializer_range, name=\"tokens_embed\"\n        )\n        self.positions_embed = tf.keras.layers.Embedding(\n            config.n_positions,\n            config.n_embd,\n            embeddings_initializer=get_initializer(config.initializer_range),\n            name=\"positions_embed\",\n        )\n        self.drop = tf.keras.layers.Dropout(config.embd_pdrop)\n        self.h = [TFBlock(config.n_ctx, config, scale=True, name=\"h_._{}\".format(i)) for i in range(config.n_layer)]\n\n    def get_input_embeddings(self):\n        return self.tokens_embed\n\n    def _resize_token_embeddings(self, new_num_tokens):\n        raise NotImplementedError\n\n    def _prune_heads(self, heads_to_prune):\n        \"\"\" Prunes heads of the model.\n            heads_to_prune: dict of {layer_num: list of heads to prune in this layer}\n        \"\"\"\n        raise NotImplementedError\n\n    def call(\n        self,\n        inputs,\n        attention_mask=None,\n        token_type_ids=None,\n        position_ids=None,\n        head_mask=None,\n        inputs_embeds=None,\n        training=False,\n    ):\n        if isinstance(inputs, (tuple, list)):\n            input_ids = inputs[0]\n            attention_mask = inputs[1] if len(inputs) > 1 else attention_mask\n            token_type_ids = inputs[2] if len(inputs) > 2 else token_type_ids\n            position_ids = inputs[3] if len(inputs) > 3 else position_ids\n            head_mask = inputs[4] if len(inputs) > 4 else head_mask\n            inputs_embeds = inputs[5] if len(inputs) > 5 else inputs_embeds\n            assert len(inputs) <= 6, \"Too many inputs.\"\n        elif isinstance(inputs, (dict, BatchEncoding)):\n            input_ids = inputs.get(\"input_ids\")\n            attention_mask = inputs.get(\"attention_mask\", attention_mask)\n            token_type_ids = inputs.get(\"token_type_ids\", token_type_ids)\n            position_ids = inputs.get(\"position_ids\", position_ids)\n            head_mask = inputs.get(\"head_mask\", head_mask)\n            inputs_embeds = inputs.get(\"inputs_embeds\", inputs_embeds)\n            assert len(inputs) <= 6, \"Too many inputs.\"\n        else:\n            input_ids = inputs\n\n        if input_ids is not None and inputs_embeds is not None:\n            raise ValueError(\"You cannot specify both input_ids and inputs_embeds at the same time\")\n        elif input_ids is not None:\n            input_shape = shape_list(input_ids)\n            input_ids = tf.reshape(input_ids, [-1, input_shape[-1]])\n        elif inputs_embeds is not None:\n            input_shape = shape_list(inputs_embeds)[:-1]\n        else:\n            raise ValueError(\"You have to specify either input_ids or inputs_embeds\")\n\n        if position_ids is None:\n            position_ids = tf.range(input_shape[-1], dtype=tf.int32)[tf.newaxis, :]\n\n        if attention_mask is not None:\n            # We create a 3D attention mask from a 2D tensor mask.\n            # Sizes are [batch_size, 1, 1, to_seq_length]\n            # So we can broadcast to [batch_size, num_heads, from_seq_length, to_seq_length]\n            # this attention mask is more simple than the triangular masking of causal attention\n            # used in OpenAI GPT, we just need to prepare the broadcast dimension here.\n            attention_mask = attention_mask[:, tf.newaxis, tf.newaxis, :]\n\n            # Since attention_mask is 1.0 for positions we want to attend and 0.0 for\n            # masked positions, this operation will create a tensor which is 0.0 for\n            # positions we want to attend and -10000.0 for masked positions.\n            # Since we are adding it to the raw scores before the softmax, this is\n            # effectively the same as removing these entirely.\n\n            attention_mask = tf.cast(attention_mask, tf.float32)\n            attention_mask = (1.0 - attention_mask) * -10000.0\n        else:\n            attention_mask = None\n\n        # Prepare head mask if needed\n        # 1.0 in head_mask indicate we keep the head\n        # attention_probs has shape bsz x n_heads x N x N\n        # input head_mask has shape [num_heads] or [num_hidden_layers x num_heads]\n        # and head_mask is converted to shape [num_hidden_layers x batch x num_heads x seq_length x seq_length]\n        if head_mask is not None:\n            raise NotImplementedError\n        else:\n            head_mask = [None] * self.num_hidden_layers\n            # head_mask = tf.constant([0] * self.num_hidden_layers)\n\n        position_ids = tf.reshape(position_ids, [-1, shape_list(position_ids)[-1]])\n\n        if inputs_embeds is None:\n            inputs_embeds = self.tokens_embed(input_ids, mode=\"embedding\")\n        position_embeds = self.positions_embed(position_ids)\n        if token_type_ids is not None:\n            token_type_ids = tf.reshape(token_type_ids, [-1, shape_list(token_type_ids)[-1]])\n            token_type_embeds = self.tokens_embed(token_type_ids, mode=\"embedding\")\n        else:\n            token_type_embeds = 0\n        hidden_states = inputs_embeds + position_embeds + token_type_embeds\n        hidden_states = self.drop(hidden_states, training=training)\n\n        output_shape = input_shape + [shape_list(hidden_states)[-1]]\n\n        all_attentions = []\n        all_hidden_states = ()\n        for i, block in enumerate(self.h):\n            if self.output_hidden_states:\n                all_hidden_states = all_hidden_states + (tf.reshape(hidden_states, output_shape),)\n\n            outputs = block([hidden_states, attention_mask, head_mask[i]], training=training)\n            hidden_states = outputs[0]\n            if self.output_attentions:\n                all_attentions.append(outputs[1])\n\n        hidden_states = tf.reshape(hidden_states, output_shape)\n        # Add last hidden state\n        if self.output_hidden_states:\n            all_hidden_states = all_hidden_states + (hidden_states,)\n\n        outputs = (hidden_states,)\n        if self.output_hidden_states:\n            outputs = outputs + (all_hidden_states,)\n        if self.output_attentions:\n            # let the number of heads free (-1) so we can extract attention even after head pruning\n            attention_output_shape = input_shape[:-1] + [-1] + shape_list(all_attentions[0])[-2:]\n            all_attentions = tuple(tf.reshape(t, attention_output_shape) for t in all_attentions)\n            outputs = outputs + (all_attentions,)\n        return outputs  # last hidden state, (all hidden_states), (attentions)\n\n\nclass TFOpenAIGPTPreTrainedModel(TFPreTrainedModel):\n    \"\"\" An abstract class to handle weights initialization and\n        a simple interface for downloading and loading pretrained models.\n    \"\"\"\n\n    config_class = OpenAIGPTConfig\n    base_model_prefix = \"transformer\"\n\n\nOPENAI_GPT_START_DOCSTRING = r\"\"\"\n\n    .. note::\n        TF 2.0 models accepts two formats as inputs:\n\n            - having all inputs as keyword arguments (like PyTorch models), or\n            - having all inputs as a list, tuple or dict in the first positional arguments.\n\n        This second option is useful when using :obj:`tf.keras.Model.fit()` method which currently requires having\n        all the tensors in the first argument of the model call function: :obj:`model(inputs)`.\n\n        If you choose this second option, there are three possibilities you can use to gather all the input Tensors\n        in the first positional argument :\n\n        - a single Tensor with input_ids only and nothing else: :obj:`model(inputs_ids)`\n        - a list of varying length with one or several input Tensors IN THE ORDER given in the docstring:\n          :obj:`model([input_ids, attention_mask])` or :obj:`model([input_ids, attention_mask, token_type_ids])`\n        - a dictionary with one or several input Tensors associated to the input names given in the docstring:\n          :obj:`model({'input_ids': input_ids, 'token_type_ids': token_type_ids})`\n\n\n    Parameters:\n        config (:class:`~transformers1.OpenAIGPTConfig`): Model configuration class with all the parameters of the model.\n            Initializing with a config file does not load the weights associated with the model, only the configuration.\n            Check out the :meth:`~transformers1.PreTrainedModel.from_pretrained` method to load the model weights.\n\"\"\"\n\nOPENAI_GPT_INPUTS_DOCSTRING = r\"\"\"\n    Args:\n        input_ids (:obj:`Numpy array` or :obj:`tf.Tensor` of shape :obj:`(batch_size, sequence_length)`):\n            Indices of input sequence tokens in the vocabulary.\n\n            Indices can be obtained using :class:`transformers1.GPT2Tokenizer`.\n            See :func:`transformers1.PreTrainedTokenizer.encode` and\n            :func:`transformers1.PreTrainedTokenizer.encode_plus` for details.\n\n            `What are input IDs? <../glossary.html#input-ids>`__\n        attention_mask (:obj:`tf.Tensor` or :obj:`Numpy array` of shape :obj:`(batch_size, sequence_length)`, `optional`, defaults to :obj:`None`):\n            Mask to avoid performing attention on padding token indices.\n            Mask values selected in ``[0, 1]``:\n            ``1`` for tokens that are NOT MASKED, ``0`` for MASKED tokens.\n\n            `What are attention masks? <../glossary.html#attention-mask>`__\n        token_type_ids (:obj:`tf.Tensor` or :obj:`Numpy array` of shape :obj:`(batch_size, sequence_length)`, `optional`, defaults to :obj:`None`):\n            Segment token indices to indicate first and second portions of the inputs.\n            Indices are selected in ``[0, 1]``: ``0`` corresponds to a `sentence A` token, ``1``\n            corresponds to a `sentence B` token\n\n            `What are token type IDs? <../glossary.html#token-type-ids>`_\n        position_ids (:obj:`tf.Tensor` or :obj:`Numpy array` of shape :obj:`(batch_size, sequence_length)`, `optional`, defaults to :obj:`None`):\n            Indices of positions of each input sequence tokens in the position embeddings.\n            Selected in the range ``[0, config.max_position_embeddings - 1]``.\n\n            `What are position IDs? <../glossary.html#position-ids>`_\n        head_mask (:obj:`tf.Tensor` or :obj:`Numpy array` of shape :obj:`(num_heads,)` or :obj:`(num_layers, num_heads)`, `optional`, defaults to :obj:`None`):\n            Mask to nullify selected heads of the self-attention modules.\n            Mask values selected in ``[0, 1]``:\n            :obj:`1` indicates the head is **not masked**, :obj:`0` indicates the head is **masked**.\n        inputs_embeds (:obj:`tf.Tensor` or :obj:`Numpy array` of shape :obj:`(batch_size, sequence_length, hidden_size)`, `optional`, defaults to :obj:`None`):\n            Optionally, instead of passing :obj:`input_ids` you can choose to directly pass an embedded representation.\n            This is useful if you want more control over how to convert `input_ids` indices into associated vectors\n            than the model's internal embedding lookup matrix.\n        training (:obj:`boolean`, `optional`, defaults to :obj:`False`):\n            Whether to activate dropout modules (if set to :obj:`True`) during training or to de-activate them\n            (if set to :obj:`False`) for evaluation.\n\"\"\"\n\n\n@add_start_docstrings(\n    \"The bare OpenAI GPT transformer model outputing raw hidden-states without any specific head on top.\",\n    OPENAI_GPT_START_DOCSTRING,\n)\nclass TFOpenAIGPTModel(TFOpenAIGPTPreTrainedModel):\n    def __init__(self, config, *inputs, **kwargs):\n        super().__init__(config, *inputs, **kwargs)\n        self.transformer = TFOpenAIGPTMainLayer(config, name=\"transformer\")\n\n    @add_start_docstrings_to_callable(OPENAI_GPT_INPUTS_DOCSTRING)\n    def call(self, inputs, **kwargs):\n        r\"\"\"\n    Return:\n        :obj:`tuple(tf.Tensor)` comprising various elements depending on the configuration (:class:`~transformers1.OpenAIGPTConfig`) and inputs:\n        last_hidden_state (:obj:`tf.Tensor` of shape :obj:`(batch_size, sequence_length, hidden_size)`):\n            Sequence of hidden-states at the last layer of the model.\n        hidden_states (:obj:`tuple(tf.Tensor)` `optional`, returned when ``config.output_hidden_states=True``):\n            Tuple of :obj:`tf.Tensor` (one for the output of the embeddings + one for the output of each layer)\n            of shape :obj:`(batch_size, sequence_length, hidden_size)`.\n\n            Hidden-states of the model at the output of each layer plus the initial embedding outputs.\n        attentions (:obj:`tuple(tf.Tensor)`, `optional`, returned when ``config.output_attentions=True``):\n            Tuple of :obj:`tf.Tensor` (one for each layer) of shape\n            :obj:`(batch_size, num_heads, sequence_length, sequence_length)`.\n\n            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention\n            heads.\n\n    Examples::\n\n        import tensorflow as tf\n        from transformers1 import OpenAIGPTTokenizer, TFOpenAIGPTModel\n\n        tokenizer = OpenAIGPTTokenizer.from_pretrained('openai-gpt')\n        model = TFOpenAIGPTModel.from_pretrained('openai-gpt')\n        input_ids = tf.constant(tokenizer.encode(\"Hello, my dog is cute\", add_special_tokens=True))[None, :]  # Batch size 1\n        outputs = model(input_ids)\n        last_hidden_states = outputs[0]  # The last hidden-state is the first element of the output tuple\n\n        \"\"\"\n        outputs = self.transformer(inputs, **kwargs)\n        return outputs\n\n\n@add_start_docstrings(\n    \"\"\"OpenAI GPT Model transformer with a language modeling head on top\n    (linear layer with weights tied to the input embeddings). \"\"\",\n    OPENAI_GPT_START_DOCSTRING,\n)\nclass TFOpenAIGPTLMHeadModel(TFOpenAIGPTPreTrainedModel):\n    def __init__(self, config, *inputs, **kwargs):\n        super().__init__(config, *inputs, **kwargs)\n        self.transformer = TFOpenAIGPTMainLayer(config, name=\"transformer\")\n\n    def get_output_embeddings(self):\n        return self.transformer.tokens_embed\n\n    @add_start_docstrings_to_callable(OPENAI_GPT_INPUTS_DOCSTRING)\n    def call(self, inputs, **kwargs):\n        r\"\"\"\n    Return:\n        :obj:`tuple(tf.Tensor)` comprising various elements depending on the configuration (:class:`~transformers1.OpenAIGPTConfig`) and inputs:\n        prediction_scores (:obj:`tf.Tensor` of shape :obj:`(batch_size, sequence_length, config.vocab_size)`):\n            Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax).\n        hidden_states (:obj:`tuple(tf.Tensor)`, `optional`, returned when ``config.output_hidden_states=True``):\n            Tuple of :obj:`tf.Tensor` (one for the output of the embeddings + one for the output of each layer)\n            of shape :obj:`(batch_size, sequence_length, hidden_size)`.\n\n            Hidden-states of the model at the output of each layer plus the initial embedding outputs.\n        attentions (:obj:`tuple(tf.Tensor)`, `optional`, returned when ``config.output_attentions=True``):\n            Tuple of :obj:`tf.Tensor` (one for each layer) of shape\n            :obj:`(batch_size, num_heads, sequence_length, sequence_length)`.\n\n            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention\n            heads.\n\n    Examples::\n\n        import tensorflow as tf\n        from transformers1 import OpenAIGPTTokenizer, TFOpenAIGPTLMHeadModel\n\n        tokenizer = OpenAIGPTTokenizer.from_pretrained('openai-gpt')\n        model = TFOpenAIGPTLMHeadModel.from_pretrained('openai-gpt')\n        input_ids = tf.constant(tokenizer.encode(\"Hello, my dog is cute\", add_special_tokens=True))[None, :]  # Batch size 1\n        outputs = model(input_ids)\n        logits = outputs[0]\n\n        \"\"\"\n        transformer_outputs = self.transformer(inputs, **kwargs)\n        hidden_states = transformer_outputs[0]\n\n        lm_logits = self.transformer.tokens_embed(hidden_states, mode=\"linear\")\n\n        outputs = (lm_logits,) + transformer_outputs[1:]\n\n        return outputs  # lm_logits, (all hidden_states), (attentions)\n\n\n@add_start_docstrings(\n    \"\"\"OpenAI GPT Model transformer with a language modeling and a multiple-choice classification\n    head on top e.g. for RocStories/SWAG tasks. The two heads are two linear layers.\n    The language modeling head has its weights tied to the input embeddings,\n    the classification head takes as input the input of a specified classification token index in the input sequence).\n\"\"\",\n    OPENAI_GPT_START_DOCSTRING,\n)\nclass TFOpenAIGPTDoubleHeadsModel(TFOpenAIGPTPreTrainedModel):\n    def __init__(self, config, *inputs, **kwargs):\n        super().__init__(config, *inputs, **kwargs)\n        config.num_labels = 1\n        self.transformer = TFOpenAIGPTMainLayer(config, name=\"transformer\")\n        self.multiple_choice_head = TFSequenceSummary(\n            config, initializer_range=config.initializer_range, name=\"multiple_choice_head\"\n        )\n\n    def get_output_embeddings(self):\n        return self.transformer.tokens_embed\n\n    @add_start_docstrings_to_callable(OPENAI_GPT_INPUTS_DOCSTRING)\n    def call(\n        self,\n        inputs,\n        attention_mask=None,\n        token_type_ids=None,\n        position_ids=None,\n        head_mask=None,\n        inputs_embeds=None,\n        mc_token_ids=None,\n        training=False,\n    ):\n        r\"\"\"\n        mc_token_ids (:obj:`tf.Tensor` or :obj:`Numpy array` of shape :obj:`(batch_size, num_choices)`, `optional`, default to index of the last token of the input)\n            Index of the classification token in each input sequence.\n            Selected in the range ``[0, input_ids.size(-1) - 1[``.\n\n    Return:\n        :obj:`tuple(tf.Tensor)` comprising various elements depending on the configuration (:class:`~transformers1.OpenAIGPTConfig`) and inputs:\n        lm_prediction_scores (:obj:`tf.Tensor` of shape :obj:`(batch_size, num_choices, sequence_length, config.vocab_size)`):\n            Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax).\n        mc_prediction_scores (:obj:`tf.Tensor` of shape :obj:`(batch_size, num_choices)`):\n            Prediction scores of the multiple choice classification head (scores for each choice before SoftMax).\n        past (:obj:`List[tf.Tensor]` of length :obj:`config.n_layers` with each tensor of shape :obj:`(2, batch_size, num_heads, sequence_length, embed_size_per_head)`):\n            Contains pre-computed hidden-states (key and values in the attention blocks).\n            Can be used (see `past` input) to speed up sequential decoding. The token ids which have their past given to this model\n            should not be passed as input ids as they have already been computed.\n        hidden_states (:obj:`tuple(tf.Tensor)`, `optional`, returned when ``config.output_hidden_states=True``):\n            Tuple of :obj:`tf.Tensor` (one for the output of the embeddings + one for the output of each layer)\n            of shape :obj:`(batch_size, sequence_length, hidden_size)`.\n\n            Hidden-states of the model at the output of each layer plus the initial embedding outputs.\n        attentions (:obj:`tuple(tf.Tensor)`, `optional`, returned when ``config.output_attentions=True``):\n            Tuple of :obj:`tf.Tensor` (one for each layer) of shape\n            :obj:`(batch_size, num_heads, sequence_length, sequence_length)`.\n\n            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention\n            heads.\n\n\n    Examples::\n\n        # For example purposes. Not runnable.\n        import tensorflow as tf\n        from transformers1 import OpenAIGPTTokenizer, TFOpenAIGPTDoubleHeadsModel\n\n        tokenizer = OpenAIGPTTokenizer.from_pretrained('openai-gpt')\n        model = TFOpenAIGPTDoubleHeadsModel.from_pretrained('openai-gpt')\n\n        # Add a [CLS] to the vocabulary (we should train it also!)\n        # This option is currently not implemented in TF 2.0\n        raise NotImplementedError\n        tokenizer.add_special_tokens({'cls_token': '[CLS]'})\n        model.resize_token_embeddings(len(tokenizer))  # Update the model embeddings with the new vocabulary size\n        print(tokenizer.cls_token_id, len(tokenizer))  # The newly token the last token of the vocabulary\n\n        choices = [\"Hello, my dog is cute [CLS]\", \"Hello, my cat is cute [CLS]\"]\n        input_ids = tf.constant([tokenizer.encode(s) for s in choices])[None, :]  # Batch size 1, 2 choices\n        mc_token_ids = tf.constant([input_ids.size(-1), input_ids.size(-1)])[None, :]  # Batch size 1\n        outputs = model(input_ids, mc_token_ids=mc_token_ids)\n        lm_prediction_scores, mc_prediction_scores = outputs[:2]\n\n        \"\"\"\n\n        if isinstance(inputs, (tuple, list)):\n            input_ids = inputs[0]\n            attention_mask = inputs[1] if len(inputs) > 1 else attention_mask\n            token_type_ids = inputs[2] if len(inputs) > 2 else token_type_ids\n            position_ids = inputs[3] if len(inputs) > 3 else position_ids\n            head_mask = inputs[4] if len(inputs) > 4 else head_mask\n            inputs_embeds = inputs[5] if len(inputs) > 5 else inputs_embeds\n            mc_token_ids = inputs[6] if len(inputs) > 6 else mc_token_ids\n            assert len(inputs) <= 7, \"Too many inputs.\"\n        elif isinstance(inputs, dict):\n            input_ids = inputs.get(\"input_ids\")\n            attention_mask = inputs.get(\"attention_mask\", attention_mask)\n            token_type_ids = inputs.get(\"token_type_ids\", token_type_ids)\n            position_ids = inputs.get(\"position_ids\", position_ids)\n            head_mask = inputs.get(\"head_mask\", head_mask)\n            inputs_embeds = inputs.get(\"inputs_embeds\", inputs_embeds)\n            mc_token_ids = inputs.get(\"mc_token_ids\", mc_token_ids)\n            assert len(inputs) <= 7, \"Too many inputs.\"\n        else:\n            input_ids = inputs\n\n        if input_ids is not None:\n            input_shapes = shape_list(input_ids)\n        else:\n            input_shapes = shape_list(inputs_embeds)[:-1]\n\n        seq_length = input_shapes[-1]\n\n        flat_input_ids = tf.reshape(input_ids, (-1, seq_length)) if input_ids is not None else None\n        flat_attention_mask = tf.reshape(attention_mask, (-1, seq_length)) if attention_mask is not None else None\n        flat_token_type_ids = tf.reshape(token_type_ids, (-1, seq_length)) if token_type_ids is not None else None\n        flat_position_ids = tf.reshape(position_ids, (-1, seq_length)) if position_ids is not None else None\n\n        flat_inputs = [\n            flat_input_ids,\n            flat_attention_mask,\n            flat_token_type_ids,\n            flat_position_ids,\n            head_mask,\n            inputs_embeds,\n        ]\n\n        transformer_outputs = self.transformer(flat_inputs, training=training)\n        hidden_states = transformer_outputs[0]\n\n        hidden_states = tf.reshape(hidden_states, input_shapes + shape_list(hidden_states)[-1:])\n\n        lm_logits = self.transformer.tokens_embed(hidden_states, mode=\"linear\")\n        mc_logits = self.multiple_choice_head([hidden_states, mc_token_ids], training=training)\n\n        mc_logits = tf.squeeze(mc_logits, axis=-1)\n\n        outputs = (lm_logits, mc_logits) + transformer_outputs[1:]\n\n        return outputs  # lm logits, mc logits, (all hidden_states), (attentions)\n"
  },
  {
    "path": "code/bert-base-count5/pretrain/transformers1/modeling_tf_pytorch_utils.py",
    "content": "# coding=utf-8\n# Copyright 2018 The Google AI Language Team Authors and The HuggingFace Inc. team.\n# Copyright (c) 2018, NVIDIA CORPORATION.  All rights reserved.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n\"\"\" PyTorch - TF 2.0 general utilities.\"\"\"\n\n\nimport logging\nimport os\nimport re\n\nimport numpy\n\n\nlogger = logging.getLogger(__name__)\n\n\ndef convert_tf_weight_name_to_pt_weight_name(tf_name, start_prefix_to_remove=\"\"):\n    \"\"\" Convert a TF 2.0 model variable name in a pytorch model weight name.\n\n        Conventions for TF2.0 scopes -> PyTorch attribute names conversions:\n            - '$1___$2' is replaced by $2 (can be used to duplicate or remove layers in TF2.0 vs PyTorch)\n            - '_._' is replaced by a new level separation (can be used to convert TF2.0 lists in PyTorch nn.ModulesList)\n\n        return tuple with:\n            - pytorch model weight name\n            - transpose: boolean indicating weither TF2.0 and PyTorch weights matrices are transposed with regards to each other\n    \"\"\"\n    tf_name = tf_name.replace(\":0\", \"\")  # device ids\n    tf_name = re.sub(\n        r\"/[^/]*___([^/]*)/\", r\"/\\1/\", tf_name\n    )  # '$1___$2' is replaced by $2 (can be used to duplicate or remove layers in TF2.0 vs PyTorch)\n    tf_name = tf_name.replace(\n        \"_._\", \"/\"\n    )  # '_._' is replaced by a level separation (can be used to convert TF2.0 lists in PyTorch nn.ModulesList)\n    tf_name = re.sub(r\"//+\", \"/\", tf_name)  # Remove empty levels at the end\n    tf_name = tf_name.split(\"/\")  # Convert from TF2.0 '/' separators to PyTorch '.' separators\n    tf_name = tf_name[1:]  # Remove level zero\n\n    # When should we transpose the weights\n    transpose = bool(tf_name[-1] == \"kernel\" or \"emb_projs\" in tf_name or \"out_projs\" in tf_name)\n\n    # Convert standard TF2.0 names in PyTorch names\n    if tf_name[-1] == \"kernel\" or tf_name[-1] == \"embeddings\" or tf_name[-1] == \"gamma\":\n        tf_name[-1] = \"weight\"\n    if tf_name[-1] == \"beta\":\n        tf_name[-1] = \"bias\"\n\n    # Remove prefix if needed\n    tf_name = \".\".join(tf_name)\n    if start_prefix_to_remove:\n        tf_name = tf_name.replace(start_prefix_to_remove, \"\", 1)\n\n    return tf_name, transpose\n\n\n#####################\n# PyTorch => TF 2.0 #\n#####################\n\n\ndef load_pytorch_checkpoint_in_tf2_model(tf_model, pytorch_checkpoint_path, tf_inputs=None, allow_missing_keys=False):\n    \"\"\" Load pytorch checkpoints in a TF 2.0 model\n    \"\"\"\n    try:\n        import tensorflow as tf  # noqa: F401\n        import torch  # noqa: F401\n    except ImportError:\n        logger.error(\n            \"Loading a PyTorch model in TensorFlow, requires both PyTorch and TensorFlow to be installed. Please see \"\n            \"https://pytorch.org/ and https://www.tensorflow.org/install/ for installation instructions.\"\n        )\n        raise\n\n    pt_path = os.path.abspath(pytorch_checkpoint_path)\n    logger.info(\"Loading PyTorch weights from {}\".format(pt_path))\n\n    pt_state_dict = torch.load(pt_path, map_location=\"cpu\")\n    logger.info(\"PyTorch checkpoint contains {:,} parameters\".format(sum(t.numel() for t in pt_state_dict.values())))\n\n    return load_pytorch_weights_in_tf2_model(\n        tf_model, pt_state_dict, tf_inputs=tf_inputs, allow_missing_keys=allow_missing_keys\n    )\n\n\ndef load_pytorch_model_in_tf2_model(tf_model, pt_model, tf_inputs=None, allow_missing_keys=False):\n    \"\"\" Load pytorch checkpoints in a TF 2.0 model\n    \"\"\"\n    pt_state_dict = pt_model.state_dict()\n\n    return load_pytorch_weights_in_tf2_model(\n        tf_model, pt_state_dict, tf_inputs=tf_inputs, allow_missing_keys=allow_missing_keys\n    )\n\n\ndef load_pytorch_weights_in_tf2_model(tf_model, pt_state_dict, tf_inputs=None, allow_missing_keys=False):\n    \"\"\" Load pytorch state_dict in a TF 2.0 model.\n    \"\"\"\n    try:\n        import torch  # noqa: F401\n        import tensorflow as tf  # noqa: F401\n        from tensorflow.python.keras import backend as K\n    except ImportError:\n        logger.error(\n            \"Loading a PyTorch model in TensorFlow, requires both PyTorch and TensorFlow to be installed. Please see \"\n            \"https://pytorch.org/ and https://www.tensorflow.org/install/ for installation instructions.\"\n        )\n        raise\n\n    if tf_inputs is None:\n        tf_inputs = tf_model.dummy_inputs\n\n    if tf_inputs is not None:\n        tf_model(tf_inputs, training=False)  # Make sure model is built\n\n    # Adapt state dict - TODO remove this and update the AWS weights files instead\n    # Convert old format to new format if needed from a PyTorch state_dict\n    old_keys = []\n    new_keys = []\n    for key in pt_state_dict.keys():\n        new_key = None\n        if \"gamma\" in key:\n            new_key = key.replace(\"gamma\", \"weight\")\n        if \"beta\" in key:\n            new_key = key.replace(\"beta\", \"bias\")\n        if new_key:\n            old_keys.append(key)\n            new_keys.append(new_key)\n    for old_key, new_key in zip(old_keys, new_keys):\n        pt_state_dict[new_key] = pt_state_dict.pop(old_key)\n\n    # Make sure we are able to load PyTorch base models as well as derived models (with heads)\n    # TF models always have a prefix, some of PyTorch models (base ones) don't\n    start_prefix_to_remove = \"\"\n    if not any(s.startswith(tf_model.base_model_prefix) for s in pt_state_dict.keys()):\n        start_prefix_to_remove = tf_model.base_model_prefix + \".\"\n\n    symbolic_weights = tf_model.trainable_weights + tf_model.non_trainable_weights\n    tf_loaded_numel = 0\n    weight_value_tuples = []\n    all_pytorch_weights = set(list(pt_state_dict.keys()))\n    for symbolic_weight in symbolic_weights:\n        sw_name = symbolic_weight.name\n        name, transpose = convert_tf_weight_name_to_pt_weight_name(\n            sw_name, start_prefix_to_remove=start_prefix_to_remove\n        )\n\n        # Find associated numpy array in pytorch model state dict\n        if name not in pt_state_dict:\n            if allow_missing_keys:\n                continue\n\n            raise AttributeError(\"{} not found in PyTorch model\".format(name))\n\n        array = pt_state_dict[name].numpy()\n\n        if transpose:\n            array = numpy.transpose(array)\n\n        if len(symbolic_weight.shape) < len(array.shape):\n            array = numpy.squeeze(array)\n        elif len(symbolic_weight.shape) > len(array.shape):\n            array = numpy.expand_dims(array, axis=0)\n\n        try:\n            assert list(symbolic_weight.shape) == list(array.shape)\n        except AssertionError as e:\n            e.args += (symbolic_weight.shape, array.shape)\n            raise e\n\n        tf_loaded_numel += array.size\n        # logger.warning(\"Initialize TF weight {}\".format(symbolic_weight.name))\n\n        weight_value_tuples.append((symbolic_weight, array))\n        all_pytorch_weights.discard(name)\n\n    K.batch_set_value(weight_value_tuples)\n\n    if tf_inputs is not None:\n        tf_model(tf_inputs, training=False)  # Make sure restore ops are run\n\n    logger.info(\"Loaded {:,} parameters in the TF 2.0 model.\".format(tf_loaded_numel))\n\n    logger.info(\"Weights or buffers not loaded from PyTorch model: {}\".format(all_pytorch_weights))\n\n    return tf_model\n\n\n#####################\n# TF 2.0 => PyTorch #\n#####################\n\n\ndef load_tf2_checkpoint_in_pytorch_model(pt_model, tf_checkpoint_path, tf_inputs=None, allow_missing_keys=False):\n    \"\"\" Load TF 2.0 HDF5 checkpoint in a PyTorch model\n        We use HDF5 to easily do transfer learning\n        (see https://github.com/tensorflow/tensorflow/blob/ee16fcac960ae660e0e4496658a366e2f745e1f0/tensorflow/python/keras/engine/network.py#L1352-L1357).\n    \"\"\"\n    try:\n        import tensorflow as tf  # noqa: F401\n        import torch  # noqa: F401\n    except ImportError:\n        logger.error(\n            \"Loading a TensorFlow model in PyTorch, requires both PyTorch and TensorFlow to be installed. Please see \"\n            \"https://pytorch.org/ and https://www.tensorflow.org/install/ for installation instructions.\"\n        )\n        raise\n\n    import transformers\n\n    logger.info(\"Loading TensorFlow weights from {}\".format(tf_checkpoint_path))\n\n    # Instantiate and load the associated TF 2.0 model\n    tf_model_class_name = \"TF\" + pt_model.__class__.__name__  # Add \"TF\" at the beggining\n    tf_model_class = getattr(transformers, tf_model_class_name)\n    tf_model = tf_model_class(pt_model.config)\n\n    if tf_inputs is None:\n        tf_inputs = tf_model.dummy_inputs\n\n    if tf_inputs is not None:\n        tf_model(tf_inputs, training=False)  # Make sure model is built\n\n    tf_model.load_weights(tf_checkpoint_path, by_name=True)\n\n    return load_tf2_model_in_pytorch_model(pt_model, tf_model, allow_missing_keys=allow_missing_keys)\n\n\ndef load_tf2_model_in_pytorch_model(pt_model, tf_model, allow_missing_keys=False):\n    \"\"\" Load TF 2.0 model in a pytorch model\n    \"\"\"\n    weights = tf_model.weights\n\n    return load_tf2_weights_in_pytorch_model(pt_model, weights, allow_missing_keys=allow_missing_keys)\n\n\ndef load_tf2_weights_in_pytorch_model(pt_model, tf_weights, allow_missing_keys=False):\n    \"\"\" Load TF2.0 symbolic weights in a PyTorch model\n    \"\"\"\n    try:\n        import tensorflow as tf  # noqa: F401\n        import torch  # noqa: F401\n    except ImportError:\n        logger.error(\n            \"Loading a TensorFlow model in PyTorch, requires both PyTorch and TensorFlow to be installed. Please see \"\n            \"https://pytorch.org/ and https://www.tensorflow.org/install/ for installation instructions.\"\n        )\n        raise\n\n    new_pt_params_dict = {}\n    current_pt_params_dict = dict(pt_model.named_parameters())\n\n    # Make sure we are able to load PyTorch base models as well as derived models (with heads)\n    # TF models always have a prefix, some of PyTorch models (base ones) don't\n    start_prefix_to_remove = \"\"\n    if not any(s.startswith(pt_model.base_model_prefix) for s in current_pt_params_dict.keys()):\n        start_prefix_to_remove = pt_model.base_model_prefix + \".\"\n\n    # Build a map from potential PyTorch weight names to TF 2.0 Variables\n    tf_weights_map = {}\n    for tf_weight in tf_weights:\n        pt_name, transpose = convert_tf_weight_name_to_pt_weight_name(\n            tf_weight.name, start_prefix_to_remove=start_prefix_to_remove\n        )\n        tf_weights_map[pt_name] = (tf_weight.numpy(), transpose)\n\n    all_tf_weights = set(list(tf_weights_map.keys()))\n    loaded_pt_weights_data_ptr = {}\n    missing_keys_pt = []\n    for pt_weight_name, pt_weight in current_pt_params_dict.items():\n        # Handle PyTorch shared weight ()not duplicated in TF 2.0\n        if pt_weight.data_ptr() in loaded_pt_weights_data_ptr:\n            new_pt_params_dict[pt_weight_name] = loaded_pt_weights_data_ptr[pt_weight.data_ptr()]\n            continue\n\n        # Find associated numpy array in pytorch model state dict\n        if pt_weight_name not in tf_weights_map:\n            if allow_missing_keys:\n                missing_keys_pt.append(pt_weight_name)\n                continue\n\n            raise AttributeError(\"{} not found in TF 2.0 model\".format(pt_weight_name))\n\n        array, transpose = tf_weights_map[pt_weight_name]\n\n        if transpose:\n            array = numpy.transpose(array)\n\n        if len(pt_weight.shape) < len(array.shape):\n            array = numpy.squeeze(array)\n        elif len(pt_weight.shape) > len(array.shape):\n            array = numpy.expand_dims(array, axis=0)\n\n        try:\n            assert list(pt_weight.shape) == list(array.shape)\n        except AssertionError as e:\n            e.args += (pt_weight.shape, array.shape)\n            raise e\n\n        # logger.warning(\"Initialize PyTorch weight {}\".format(pt_weight_name))\n\n        new_pt_params_dict[pt_weight_name] = torch.from_numpy(array)\n        loaded_pt_weights_data_ptr[pt_weight.data_ptr()] = torch.from_numpy(array)\n        all_tf_weights.discard(pt_weight_name)\n\n    missing_keys, unexpected_keys = pt_model.load_state_dict(new_pt_params_dict, strict=False)\n    missing_keys += missing_keys_pt\n\n    if len(missing_keys) > 0:\n        logger.info(\n            \"Weights of {} not initialized from TF 2.0 model: {}\".format(pt_model.__class__.__name__, missing_keys)\n        )\n    if len(unexpected_keys) > 0:\n        logger.info(\n            \"Weights from TF 2.0 model not used in {}: {}\".format(pt_model.__class__.__name__, unexpected_keys)\n        )\n\n    logger.info(\"Weights or buffers not loaded from TF 2.0 model: {}\".format(all_tf_weights))\n\n    return pt_model\n"
  },
  {
    "path": "code/bert-base-count5/pretrain/transformers1/modeling_tf_roberta.py",
    "content": "# coding=utf-8\n# Copyright 2018 The Google AI Language Team Authors and The HuggingFace Inc. team.\n# Copyright (c) 2018, NVIDIA CORPORATION.  All rights reserved.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n\"\"\" TF 2.0 RoBERTa model. \"\"\"\n\n\nimport logging\n\nimport tensorflow as tf\n\nfrom .configuration_roberta import RobertaConfig\nfrom .file_utils import add_start_docstrings, add_start_docstrings_to_callable\nfrom .modeling_tf_bert import TFBertEmbeddings, TFBertMainLayer, gelu\nfrom .modeling_tf_utils import TFPreTrainedModel, get_initializer, shape_list\n\n\nlogger = logging.getLogger(__name__)\n\nTF_ROBERTA_PRETRAINED_MODEL_ARCHIVE_LIST = [\n    \"roberta-base\",\n    \"roberta-large\",\n    \"roberta-large-mnli\",\n    \"distilroberta-base\",\n    # See all RoBERTa models at https://huggingface.co/models?filter=roberta\n]\n\n\nclass TFRobertaEmbeddings(TFBertEmbeddings):\n    \"\"\"\n    Same as BertEmbeddings with a tiny tweak for positional embeddings indexing.\n    \"\"\"\n\n    def __init__(self, config, **kwargs):\n        super().__init__(config, **kwargs)\n        self.padding_idx = 1\n\n    def create_position_ids_from_input_ids(self, x):\n        \"\"\" Replace non-padding symbols with their position numbers. Position numbers begin at\n        padding_idx+1. Padding symbols are ignored. This is modified from fairseq's\n        `utils.make_positions`.\n        :param tf.Tensor x:\n        :return tf.Tensor:\n        \"\"\"\n        mask = tf.cast(tf.math.not_equal(x, self.padding_idx), dtype=tf.int32)\n        incremental_indicies = tf.math.cumsum(mask, axis=1) * mask\n        return incremental_indicies + self.padding_idx\n\n    def create_position_ids_from_inputs_embeds(self, inputs_embeds):\n        \"\"\" We are provided embeddings directly. We cannot infer which are padded so just generate\n        sequential position ids.\n        :param tf.Tensor inputs_embeds:\n        :return tf.Tensor:\n        \"\"\"\n        seq_length = shape_list(inputs_embeds)[1]\n\n        position_ids = tf.range(self.padding_idx + 1, seq_length + self.padding_idx + 1, dtype=tf.int32)[tf.newaxis, :]\n        return position_ids\n\n    def _embedding(self, inputs, training=False):\n        \"\"\"Applies embedding based on inputs tensor.\"\"\"\n        input_ids, position_ids, token_type_ids, inputs_embeds = inputs\n\n        if position_ids is None:\n            if input_ids is not None:\n                # Create the position ids from the input token ids. Any padded tokens remain padded.\n                position_ids = self.create_position_ids_from_input_ids(input_ids)\n            else:\n                position_ids = self.create_position_ids_from_inputs_embeds(inputs_embeds)\n\n        return super()._embedding([input_ids, position_ids, token_type_ids, inputs_embeds], training=training)\n\n\nclass TFRobertaMainLayer(TFBertMainLayer):\n    \"\"\"\n    Same as TFBertMainLayer but uses TFRobertaEmbeddings.\n    \"\"\"\n\n    def __init__(self, config, **kwargs):\n        super().__init__(config, **kwargs)\n        self.embeddings = TFRobertaEmbeddings(config, name=\"embeddings\")\n\n    def get_input_embeddings(self):\n        return self.embeddings\n\n\nclass TFRobertaPreTrainedModel(TFPreTrainedModel):\n    \"\"\" An abstract class to handle weights initialization and\n        a simple interface for downloading and loading pretrained models.\n    \"\"\"\n\n    config_class = RobertaConfig\n    base_model_prefix = \"roberta\"\n\n\nROBERTA_START_DOCSTRING = r\"\"\"\n    This model is a `tf.keras.Model <https://www.tensorflow.org/api_docs/python/tf/keras/Model>`__ sub-class.\n    Use it as a regular TF 2.0 Keras Model and\n    refer to the TF 2.0 documentation for all matter related to general usage and behavior.\n\n    .. note::\n\n        TF 2.0 models accepts two formats as inputs:\n\n            - having all inputs as keyword arguments (like PyTorch models), or\n            - having all inputs as a list, tuple or dict in the first positional arguments.\n\n        This second option is useful when using :obj:`tf.keras.Model.fit()` method which currently requires having\n        all the tensors in the first argument of the model call function: :obj:`model(inputs)`.\n\n        If you choose this second option, there are three possibilities you can use to gather all the input Tensors\n        in the first positional argument :\n\n        - a single Tensor with input_ids only and nothing else: :obj:`model(inputs_ids)`\n        - a list of varying length with one or several input Tensors IN THE ORDER given in the docstring:\n          :obj:`model([input_ids, attention_mask])` or :obj:`model([input_ids, attention_mask, token_type_ids])`\n        - a dictionary with one or several input Tensors associated to the input names given in the docstring:\n          :obj:`model({'input_ids': input_ids, 'token_type_ids': token_type_ids})`\n\n    Parameters:\n        config (:class:`~transformers1.RobertaConfig`): Model configuration class with all the parameters of the\n            model. Initializing with a config file does not load the weights associated with the model, only the configuration.\n            Check out the :meth:`~transformers1.PreTrainedModel.from_pretrained` method to load the model weights.\n\"\"\"\n\nROBERTA_INPUTS_DOCSTRING = r\"\"\"\n    Args:\n        input_ids (:obj:`Numpy array` or :obj:`tf.Tensor` of shape :obj:`(batch_size, sequence_length)`):\n            Indices of input sequence tokens in the vocabulary.\n\n            Indices can be obtained using :class:`transformers1.RobertaTokenizer`.\n            See :func:`transformers1.PreTrainedTokenizer.encode` and\n            :func:`transformers1.PreTrainedTokenizer.encode_plus` for details.\n\n            `What are input IDs? <../glossary.html#input-ids>`__\n        attention_mask (:obj:`Numpy array` or :obj:`tf.Tensor` of shape :obj:`(batch_size, sequence_length)`, `optional`, defaults to :obj:`None`):\n            Mask to avoid performing attention on padding token indices.\n            Mask values selected in ``[0, 1]``:\n            ``1`` for tokens that are NOT MASKED, ``0`` for MASKED tokens.\n\n            `What are attention masks? <../glossary.html#attention-mask>`__\n        token_type_ids (:obj:`Numpy array` or :obj:`tf.Tensor` of shape :obj:`(batch_size, sequence_length)`, `optional`, defaults to :obj:`None`):\n            Segment token indices to indicate first and second portions of the inputs.\n            Indices are selected in ``[0, 1]``: ``0`` corresponds to a `sentence A` token, ``1``\n            corresponds to a `sentence B` token\n\n            `What are token type IDs? <../glossary.html#token-type-ids>`__\n        position_ids (:obj:`Numpy array` or :obj:`tf.Tensor` of shape :obj:`(batch_size, sequence_length)`, `optional`, defaults to :obj:`None`):\n            Indices of positions of each input sequence tokens in the position embeddings.\n            Selected in the range ``[0, config.max_position_embeddings - 1]``.\n\n            `What are position IDs? <../glossary.html#position-ids>`__\n        head_mask (:obj:`Numpy array` or :obj:`tf.Tensor` of shape :obj:`(num_heads,)` or :obj:`(num_layers, num_heads)`, `optional`, defaults to :obj:`None`):\n            Mask to nullify selected heads of the self-attention modules.\n            Mask values selected in ``[0, 1]``:\n            :obj:`1` indicates the head is **not masked**, :obj:`0` indicates the head is **masked**.\n        inputs_embeds (:obj:`Numpy array` or :obj:`tf.Tensor` of shape :obj:`(batch_size, sequence_length, embedding_dim)`, `optional`, defaults to :obj:`None`):\n            Optionally, instead of passing :obj:`input_ids` you can choose to directly pass an embedded representation.\n            This is useful if you want more control over how to convert `input_ids` indices into associated vectors\n            than the model's internal embedding lookup matrix.\n        training (:obj:`boolean`, `optional`, defaults to :obj:`False`):\n            Whether to activate dropout modules (if set to :obj:`True`) during training or to de-activate them\n            (if set to :obj:`False`) for evaluation.\n\"\"\"\n\n\n@add_start_docstrings(\n    \"The bare RoBERTa Model transformer outputing raw hidden-states without any specific head on top.\",\n    ROBERTA_START_DOCSTRING,\n)\nclass TFRobertaModel(TFRobertaPreTrainedModel):\n    def __init__(self, config, *inputs, **kwargs):\n        super().__init__(config, *inputs, **kwargs)\n        self.roberta = TFRobertaMainLayer(config, name=\"roberta\")\n\n    @add_start_docstrings_to_callable(ROBERTA_INPUTS_DOCSTRING)\n    def call(self, inputs, **kwargs):\n        r\"\"\"\n    Returns:\n        :obj:`tuple(tf.Tensor)` comprising various elements depending on the configuration (:class:`~transformers1.RobertaConfig`) and inputs:\n        last_hidden_state (:obj:`tf.Tensor` of shape :obj:`(batch_size, sequence_length, hidden_size)`):\n            Sequence of hidden-states at the output of the last layer of the model.\n        pooler_output (:obj:`tf.Tensor` of shape :obj:`(batch_size, hidden_size)`):\n            Last layer hidden-state of the first token of the sequence (classification token)\n            further processed by a Linear layer and a Tanh activation function. The Linear\n            layer weights are trained from the next sentence prediction (classification)\n            objective during Bert pretraining. This output is usually *not* a good summary\n            of the semantic content of the input, you're often better with averaging or pooling\n            the sequence of hidden-states for the whole input sequence.\n        hidden_states (:obj:`tuple(tf.Tensor)`, `optional`, returned when :obj:`config.output_hidden_states=True`):\n            tuple of :obj:`tf.Tensor` (one for the output of the embeddings + one for the output of each layer)\n            of shape :obj:`(batch_size, sequence_length, hidden_size)`.\n\n            Hidden-states of the model at the output of each layer plus the initial embedding outputs.\n        attentions (:obj:`tuple(tf.Tensor)`, `optional`, returned when ``config.output_attentions=True``):\n            tuple of :obj:`tf.Tensor` (one for each layer) of shape\n            :obj:`(batch_size, num_heads, sequence_length, sequence_length)`:\n\n            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads.\n\n    Examples::\n\n        import tensorflow as tf\n        from transformers1 import RobertaTokenizer, TFRobertaModel\n\n        tokenizer = RobertaTokenizer.from_pretrained('roberta-base')\n        model = TFRobertaModel.from_pretrained('roberta-base')\n        input_ids = tf.constant(tokenizer.encode(\"Hello, my dog is cute\", add_special_tokens=True))[None, :]  # Batch size 1\n        outputs = model(input_ids)\n        last_hidden_states = outputs[0]  # The last hidden-state is the first element of the output tuple\n\n        \"\"\"\n        outputs = self.roberta(inputs, **kwargs)\n        return outputs\n\n\nclass TFRobertaLMHead(tf.keras.layers.Layer):\n    \"\"\"Roberta Head for masked language modeling.\"\"\"\n\n    def __init__(self, config, input_embeddings, **kwargs):\n        super().__init__(**kwargs)\n        self.vocab_size = config.vocab_size\n        self.dense = tf.keras.layers.Dense(\n            config.hidden_size, kernel_initializer=get_initializer(config.initializer_range), name=\"dense\"\n        )\n        self.layer_norm = tf.keras.layers.LayerNormalization(epsilon=config.layer_norm_eps, name=\"layer_norm\")\n        self.act = tf.keras.layers.Activation(gelu)\n\n        # The output weights are the same as the input embeddings, but there is\n        # an output-only bias for each token.\n        self.decoder = input_embeddings\n\n    def build(self, input_shape):\n        self.bias = self.add_weight(shape=(self.vocab_size,), initializer=\"zeros\", trainable=True, name=\"bias\")\n        super().build(input_shape)\n\n    def call(self, features):\n        x = self.dense(features)\n        x = self.act(x)\n        x = self.layer_norm(x)\n\n        # project back to size of vocabulary with bias\n        x = self.decoder(x, mode=\"linear\") + self.bias\n\n        return x\n\n\n@add_start_docstrings(\"\"\"RoBERTa Model with a `language modeling` head on top. \"\"\", ROBERTA_START_DOCSTRING)\nclass TFRobertaForMaskedLM(TFRobertaPreTrainedModel):\n    def __init__(self, config, *inputs, **kwargs):\n        super().__init__(config, *inputs, **kwargs)\n\n        self.roberta = TFRobertaMainLayer(config, name=\"roberta\")\n        self.lm_head = TFRobertaLMHead(config, self.roberta.embeddings, name=\"lm_head\")\n\n    def get_output_embeddings(self):\n        return self.lm_head.decoder\n\n    @add_start_docstrings_to_callable(ROBERTA_INPUTS_DOCSTRING)\n    def call(self, inputs, **kwargs):\n        r\"\"\"\n    Return:\n        :obj:`tuple(tf.Tensor)` comprising various elements depending on the configuration (:class:`~transformers1.RobertaConfig`) and inputs:\n        prediction_scores (:obj:`Numpy array` or :obj:`tf.Tensor` of shape :obj:`(batch_size, sequence_length, config.vocab_size)`):\n            Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax).\n        hidden_states (:obj:`tuple(tf.Tensor)`, `optional`, returned when :obj:`config.output_hidden_states=True`):\n            tuple of :obj:`tf.Tensor` (one for the output of the embeddings + one for the output of each layer)\n            of shape :obj:`(batch_size, sequence_length, hidden_size)`.\n\n            Hidden-states of the model at the output of each layer plus the initial embedding outputs.\n        attentions (:obj:`tuple(tf.Tensor)`, `optional`, returned when ``config.output_attentions=True``):\n            tuple of :obj:`tf.Tensor` (one for each layer) of shape\n            :obj:`(batch_size, num_heads, sequence_length, sequence_length)`:\n\n            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads.\n\n    Examples::\n\n        import tensorflow as tf\n        from transformers1 import RobertaTokenizer, TFRobertaForMaskedLM\n\n        tokenizer = RobertaTokenizer.from_pretrained('roberta-base')\n        model = TFRobertaForMaskedLM.from_pretrained('roberta-base')\n        input_ids = tf.constant(tokenizer.encode(\"Hello, my dog is cute\", add_special_tokens=True))[None, :]  # Batch size 1\n        outputs = model(input_ids)\n        prediction_scores = outputs[0]\n\n        \"\"\"\n        outputs = self.roberta(inputs, **kwargs)\n\n        sequence_output = outputs[0]\n        prediction_scores = self.lm_head(sequence_output)\n\n        outputs = (prediction_scores,) + outputs[2:]  # Add hidden states and attention if they are here\n\n        return outputs  # prediction_scores, (hidden_states), (attentions)\n\n\nclass TFRobertaClassificationHead(tf.keras.layers.Layer):\n    \"\"\"Head for sentence-level classification tasks.\"\"\"\n\n    def __init__(self, config, **kwargs):\n        super().__init__(config, **kwargs)\n        self.dense = tf.keras.layers.Dense(\n            config.hidden_size,\n            kernel_initializer=get_initializer(config.initializer_range),\n            activation=\"tanh\",\n            name=\"dense\",\n        )\n        self.dropout = tf.keras.layers.Dropout(config.hidden_dropout_prob)\n        self.out_proj = tf.keras.layers.Dense(\n            config.num_labels, kernel_initializer=get_initializer(config.initializer_range), name=\"out_proj\"\n        )\n\n    def call(self, features, training=False):\n        x = features[:, 0, :]  # take <s> token (equiv. to [CLS])\n        x = self.dropout(x, training=training)\n        x = self.dense(x)\n        x = self.dropout(x, training=training)\n        x = self.out_proj(x)\n        return x\n\n\n@add_start_docstrings(\n    \"\"\"RoBERTa Model transformer with a sequence classification/regression head on top (a linear layer\n    on top of the pooled output) e.g. for GLUE tasks. \"\"\",\n    ROBERTA_START_DOCSTRING,\n)\nclass TFRobertaForSequenceClassification(TFRobertaPreTrainedModel):\n    def __init__(self, config, *inputs, **kwargs):\n        super().__init__(config, *inputs, **kwargs)\n        self.num_labels = config.num_labels\n\n        self.roberta = TFRobertaMainLayer(config, name=\"roberta\")\n        self.classifier = TFRobertaClassificationHead(config, name=\"classifier\")\n\n    @add_start_docstrings_to_callable(ROBERTA_INPUTS_DOCSTRING)\n    def call(self, inputs, **kwargs):\n        r\"\"\"\n    Return:\n        :obj:`tuple(tf.Tensor)` comprising various elements depending on the configuration (:class:`~transformers1.RobertaConfig`) and inputs:\n        logits (:obj:`Numpy array` or :obj:`tf.Tensor` of shape :obj:`(batch_size, config.num_labels)`):\n            Classification (or regression if config.num_labels==1) scores (before SoftMax).\n        hidden_states (:obj:`tuple(tf.Tensor)`, `optional`, returned when :obj:`config.output_hidden_states=True`):\n            tuple of :obj:`tf.Tensor` (one for the output of the embeddings + one for the output of each layer)\n            of shape :obj:`(batch_size, sequence_length, hidden_size)`.\n\n            Hidden-states of the model at the output of each layer plus the initial embedding outputs.\n        attentions (:obj:`tuple(tf.Tensor)`, `optional`, returned when ``config.output_attentions=True``):\n            tuple of :obj:`tf.Tensor` (one for each layer) of shape\n            :obj:`(batch_size, num_heads, sequence_length, sequence_length)`:\n\n            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads.\n\n    Examples::\n\n        import tensorflow as tf\n        from transformers1 import RobertaTokenizer, TFRobertaForSequenceClassification\n\n        tokenizer = RobertaTokenizer.from_pretrained('roberta-base')\n        model = TFRobertaForSequenceClassification.from_pretrained('roberta-base')\n        input_ids = tf.constant(tokenizer.encode(\"Hello, my dog is cute\", add_special_tokens=True))[None, :]  # Batch size 1\n        labels = tf.constant([1])[None, :]  # Batch size 1\n        outputs = model(input_ids)\n        logits = outputs[0]\n\n        \"\"\"\n        outputs = self.roberta(inputs, **kwargs)\n\n        sequence_output = outputs[0]\n        logits = self.classifier(sequence_output, training=kwargs.get(\"training\", False))\n\n        outputs = (logits,) + outputs[2:]\n\n        return outputs  # logits, (hidden_states), (attentions)\n\n\n@add_start_docstrings(\n    \"\"\"RoBERTa Model with a token classification head on top (a linear layer on top of\n    the hidden-states output) e.g. for Named-Entity-Recognition (NER) tasks. \"\"\",\n    ROBERTA_START_DOCSTRING,\n)\nclass TFRobertaForTokenClassification(TFRobertaPreTrainedModel):\n    def __init__(self, config, *inputs, **kwargs):\n        super().__init__(config, *inputs, **kwargs)\n        self.num_labels = config.num_labels\n\n        self.roberta = TFRobertaMainLayer(config, name=\"roberta\")\n        self.dropout = tf.keras.layers.Dropout(config.hidden_dropout_prob)\n        self.classifier = tf.keras.layers.Dense(\n            config.num_labels, kernel_initializer=get_initializer(config.initializer_range), name=\"classifier\"\n        )\n\n    @add_start_docstrings_to_callable(ROBERTA_INPUTS_DOCSTRING)\n    def call(self, inputs, **kwargs):\n        r\"\"\"\n    Return:\n        :obj:`tuple(tf.Tensor)` comprising various elements depending on the configuration (:class:`~transformers1.RobertaConfig`) and inputs:\n        scores (:obj:`Numpy array` or :obj:`tf.Tensor` of shape :obj:`(batch_size, sequence_length, config.num_labels)`):\n            Classification scores (before SoftMax).\n        hidden_states (:obj:`tuple(tf.Tensor)`, `optional`, returned when :obj:`config.output_hidden_states=True`):\n            tuple of :obj:`tf.Tensor` (one for the output of the embeddings + one for the output of each layer)\n            of shape :obj:`(batch_size, sequence_length, hidden_size)`.\n\n            Hidden-states of the model at the output of each layer plus the initial embedding outputs.\n        attentions (:obj:`tuple(tf.Tensor)`, `optional`, returned when ``config.output_attentions=True``):\n            tuple of :obj:`tf.Tensor` (one for each layer) of shape\n            :obj:`(batch_size, num_heads, sequence_length, sequence_length)`:\n\n            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads.\n\n    Examples::\n\n        import tensorflow as tf\n        from transformers1 import RobertaTokenizer, TFRobertaForTokenClassification\n\n        tokenizer = RobertaTokenizer.from_pretrained('roberta-base')\n        model = TFRobertaForTokenClassification.from_pretrained('roberta-base')\n        input_ids = tf.constant(tokenizer.encode(\"Hello, my dog is cute\", add_special_tokens=True))[None, :]  # Batch size 1\n        outputs = model(input_ids)\n        scores = outputs[0]\n\n        \"\"\"\n        outputs = self.roberta(inputs, **kwargs)\n\n        sequence_output = outputs[0]\n\n        sequence_output = self.dropout(sequence_output, training=kwargs.get(\"training\", False))\n        logits = self.classifier(sequence_output)\n\n        outputs = (logits,) + outputs[2:]  # add hidden states and attention if they are here\n\n        return outputs  # scores, (hidden_states), (attentions)\n\n\n@add_start_docstrings(\n    \"\"\"RoBERTa Model with a span classification head on top for extractive question-answering tasks like SQuAD (a linear layers on top of the hidden-states output to compute `span start logits` and `span end logits`). \"\"\",\n    ROBERTA_START_DOCSTRING,\n)\nclass TFRobertaForQuestionAnswering(TFRobertaPreTrainedModel):\n    def __init__(self, config, *inputs, **kwargs):\n        super().__init__(config, *inputs, **kwargs)\n        self.num_labels = config.num_labels\n\n        self.roberta = TFRobertaMainLayer(config, name=\"roberta\")\n        self.qa_outputs = tf.keras.layers.Dense(\n            config.num_labels, kernel_initializer=get_initializer(config.initializer_range), name=\"qa_outputs\"\n        )\n\n    @add_start_docstrings_to_callable(ROBERTA_INPUTS_DOCSTRING)\n    def call(self, inputs, **kwargs):\n        r\"\"\"\n    Return:\n        :obj:`tuple(tf.Tensor)` comprising various elements depending on the configuration (:class:`~transformers1.RobertaConfig`) and inputs:\n        start_scores (:obj:`Numpy array` or :obj:`tf.Tensor` of shape :obj:`(batch_size, sequence_length,)`):\n            Span-start scores (before SoftMax).\n        end_scores (:obj:`Numpy array` or :obj:`tf.Tensor` of shape :obj:`(batch_size, sequence_length,)`):\n            Span-end scores (before SoftMax).\n        hidden_states (:obj:`tuple(tf.Tensor)`, `optional`, returned when :obj:`config.output_hidden_states=True`):\n            tuple of :obj:`tf.Tensor` (one for the output of the embeddings + one for the output of each layer)\n            of shape :obj:`(batch_size, sequence_length, hidden_size)`.\n\n            Hidden-states of the model at the output of each layer plus the initial embedding outputs.\n        attentions (:obj:`tuple(tf.Tensor)`, `optional`, returned when ``config.output_attentions=True``):\n            tuple of :obj:`tf.Tensor` (one for each layer) of shape\n            :obj:`(batch_size, num_heads, sequence_length, sequence_length)`:\n\n            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads.\n\n    Examples::\n\n        # The checkpoint roberta-base is not fine-tuned for question answering. Please see the\n        # examples/question-answering/run_squad.py example to see how to fine-tune a model to a question answering task.\n\n        import tensorflow as tf\n        from transformers1 import RobertaTokenizer, TFRobertaForQuestionAnswering\n\n        tokenizer = RobertaTokenizer.from_pretrained('roberta-base')\n        model = TFRobertaForQuestionAnswering.from_pretrained('roberta-base')\n        input_ids = tokenizer.encode(\"Who was Jim Henson?\", \"Jim Henson was a nice puppet\")\n        start_scores, end_scores = model(tf.constant(input_ids)[None, :]) # Batch size 1\n\n        all_tokens = tokenizer.convert_ids_to_tokens(input_ids)\n        answer = ' '.join(all_tokens[tf.math.argmax(start_scores, 1)[0] : tf.math.argmax(end_scores, 1)[0]+1])\n\n        \"\"\"\n        outputs = self.roberta(inputs, **kwargs)\n\n        sequence_output = outputs[0]\n\n        logits = self.qa_outputs(sequence_output)\n        start_logits, end_logits = tf.split(logits, 2, axis=-1)\n        start_logits = tf.squeeze(start_logits, axis=-1)\n        end_logits = tf.squeeze(end_logits, axis=-1)\n\n        outputs = (start_logits, end_logits,) + outputs[2:]\n\n        return outputs  # start_logits, end_logits, (hidden_states), (attentions)\n"
  },
  {
    "path": "code/bert-base-count5/pretrain/transformers1/modeling_tf_t5.py",
    "content": "# coding=utf-8\n# Copyright 2018 T5 Authors and The HuggingFace Inc. team.\n# Copyright (c) 2018, NVIDIA CORPORATION.  All rights reserved.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n\"\"\" TF 2.0 T5 model. \"\"\"\n\n\nimport copy\nimport itertools\nimport logging\nimport math\n\nimport tensorflow as tf\n\nfrom .configuration_t5 import T5Config\nfrom .file_utils import DUMMY_INPUTS, DUMMY_MASK, add_start_docstrings, add_start_docstrings_to_callable\nfrom .modeling_tf_utils import TFPreTrainedModel, TFSharedEmbeddings, shape_list\n\n\nlogger = logging.getLogger(__name__)\n\nTF_T5_PRETRAINED_MODEL_ARCHIVE_LIST = [\n    \"t5-small\",\n    \"t5-base\",\n    \"t5-large\",\n    \"t5-3b\",\n    \"t5-11b\",\n    # See all T5 models at https://huggingface.co/models?filter=t5\n]\n\n####################################################\n# TF 2.0 Models are constructed using Keras imperative API by sub-classing\n# - tf.keras.layers.Layer for the layers and\n# - TFPreTrainedModel for the models (it-self a sub-class of tf.keras.Model)\n####################################################\n\n\nclass TFT5LayerNorm(tf.keras.layers.Layer):\n    def __init__(self, epsilon=1e-6, **kwargs):\n        \"\"\" Construct a layernorm module in the T5 style\n            No bias and no substraction of mean.\n        \"\"\"\n        super().__init__(**kwargs)\n        self.variance_epsilon = epsilon\n\n    def build(self, input_shape):\n        \"\"\"Build shared word embedding layer \"\"\"\n        self.weight = self.add_weight(\"weight\", shape=(input_shape[-1],), initializer=\"ones\")\n        super().build(input_shape)\n\n    def call(self, x):\n        variance = tf.math.reduce_mean(tf.math.square(x), axis=-1, keepdims=True)\n        x = x * tf.math.rsqrt(variance + self.variance_epsilon)\n        return self.weight * x\n\n\nclass TFT5DenseReluDense(tf.keras.layers.Layer):\n    def __init__(self, config, **kwargs):\n        super().__init__(**kwargs)\n        self.wi = tf.keras.layers.Dense(config.d_ff, use_bias=False, name=\"wi\")\n        self.wo = tf.keras.layers.Dense(config.d_model, use_bias=False, name=\"wo\")\n        self.dropout = tf.keras.layers.Dropout(config.dropout_rate)\n        self.act = tf.keras.activations.relu\n\n    def call(self, hidden_states, training=False):\n        h = self.wi(hidden_states)\n        h = self.act(h)\n        h = self.dropout(h, training=training)\n        h = self.wo(h)\n        return h\n\n\nclass TFT5LayerFF(tf.keras.layers.Layer):\n    def __init__(self, config, **kwargs):\n        super().__init__(**kwargs)\n        self.DenseReluDense = TFT5DenseReluDense(config, name=\"DenseReluDense\")\n        self.layer_norm = TFT5LayerNorm(epsilon=config.layer_norm_epsilon, name=\"layer_norm\")\n        self.dropout = tf.keras.layers.Dropout(config.dropout_rate)\n\n    def call(self, hidden_states, training=False):\n        norm_x = self.layer_norm(hidden_states)\n        y = self.DenseReluDense(norm_x, training=training)\n        layer_output = hidden_states + self.dropout(y, training=training)\n        return layer_output\n\n\nclass TFT5Attention(tf.keras.layers.Layer):\n    NEW_ID = itertools.count()\n\n    def __init__(self, config, has_relative_attention_bias=False, **kwargs):\n        super().__init__(**kwargs)\n        self.layer_id = next(TFT5Attention.NEW_ID)\n        self.is_decoder = config.is_decoder\n        self.has_relative_attention_bias = has_relative_attention_bias\n\n        self.output_attentions = config.output_attentions\n        self.relative_attention_num_buckets = config.relative_attention_num_buckets\n        self.d_model = config.d_model\n        self.d_kv = config.d_kv\n        self.n_heads = config.num_heads\n        self.inner_dim = self.n_heads * self.d_kv\n\n        # Mesh TensorFlow initialization to avoid scaling before softmax\n        self.q = tf.keras.layers.Dense(self.inner_dim, use_bias=False, name=\"q\")\n        self.k = tf.keras.layers.Dense(self.inner_dim, use_bias=False, name=\"k\")\n        self.v = tf.keras.layers.Dense(self.inner_dim, use_bias=False, name=\"v\")\n        self.o = tf.keras.layers.Dense(self.d_model, use_bias=False, name=\"o\")\n        self.dropout = tf.keras.layers.Dropout(config.dropout_rate)\n\n        if self.has_relative_attention_bias:\n            self.relative_attention_bias = tf.keras.layers.Embedding(\n                self.relative_attention_num_buckets, self.n_heads, name=\"relative_attention_bias\",\n            )\n        self.pruned_heads = set()\n\n    def prune_heads(self, heads):\n        raise NotImplementedError\n\n    @staticmethod\n    def _relative_position_bucket(relative_position, bidirectional=True, num_buckets=32, max_distance=128):\n        \"\"\"\n        Adapted from Mesh Tensorflow:\n        https://github.com/tensorflow/mesh/blob/0cb87fe07da627bf0b7e60475d59f95ed6b5be3d/mesh_tensorflow/transformer/transformer_layers.py#L593\n\n        Translate relative position to a bucket number for relative attention.\n        The relative position is defined as memory_position - query_position, i.e.\n        the distance in tokens from the attending position to the attended-to\n        position.  If bidirectional=False, then positive relative positions are\n        invalid.\n        We use smaller buckets for small absolute relative_position and larger buckets\n        for larger absolute relative_positions.  All relative positions >=max_distance\n        map to the same bucket.  All relative positions <=-max_distance map to the\n        same bucket.  This should allow for more graceful generalization to longer\n        sequences than the model has been trained on.\n        Args:\n            relative_position: an int32 Tensor\n            bidirectional: a boolean - whether the attention is bidirectional\n            num_buckets: an integer\n            max_distance: an integer\n        Returns:\n            a Tensor with the same shape as relative_position, containing int32\n            values in the range [0, num_buckets)\n        \"\"\"\n        ret = 0\n        n = -relative_position\n        if bidirectional:\n            num_buckets //= 2\n            ret += tf.dtypes.cast(tf.math.less(n, 0), tf.int32) * num_buckets\n            n = tf.math.abs(n)\n        else:\n            n = tf.math.maximum(n, 0)\n        # now n is in the range [0, inf)\n        max_exact = num_buckets // 2\n        is_small = tf.math.less(n, max_exact)\n        val_if_large = max_exact + tf.dtypes.cast(\n            tf.math.log(tf.dtypes.cast(n, tf.float32) / max_exact)\n            / math.log(max_distance / max_exact)\n            * (num_buckets - max_exact),\n            tf.int32,\n        )\n        val_if_large = tf.math.minimum(val_if_large, num_buckets - 1)\n        ret += tf.where(is_small, n, val_if_large)\n        return ret\n\n    def compute_bias(self, qlen, klen):\n        \"\"\" Compute binned relative position bias \"\"\"\n        context_position = tf.range(qlen)[:, None]\n        memory_position = tf.range(klen)[None, :]\n        relative_position = memory_position - context_position  # shape (qlen, klen)\n        rp_bucket = self._relative_position_bucket(\n            relative_position, bidirectional=not self.is_decoder, num_buckets=self.relative_attention_num_buckets,\n        )\n        values = self.relative_attention_bias(rp_bucket)  # shape (qlen, klen, num_heads)\n        values = tf.expand_dims(tf.transpose(values, [2, 0, 1]), axis=0)  # shape (1, num_heads, qlen, klen)\n        return values\n\n    def call(\n        self,\n        input,\n        mask=None,\n        kv=None,\n        position_bias=None,\n        cache=None,\n        past_key_value_state=None,\n        head_mask=None,\n        query_length=None,\n        use_cache=False,\n        training=False,\n    ):\n        \"\"\"\n        Self-attention (if kv is None) or attention over source sentence (provided by kv).\n        \"\"\"\n        # Input is (bs, qlen, dim)\n        # Mask is (bs, klen) (non-causal) or (bs, klen, klen)\n        # past_key_value_state[0] is (bs, n_heads, q_len - 1, dim_per_head)\n        bs, qlen, dim = shape_list(input)\n\n        if past_key_value_state is not None:\n            assert self.is_decoder is True, \"Encoder cannot cache past key value states\"\n            assert (\n                len(past_key_value_state) == 2\n            ), \"past_key_value_state should have 2 past states: keys and values. Got {} past states\".format(\n                len(past_key_value_state)\n            )\n            real_qlen = qlen + shape_list(past_key_value_state[0])[2] if query_length is None else query_length\n        else:\n            real_qlen = qlen\n\n        if kv is None:\n            klen = real_qlen\n        else:\n            klen = shape_list(kv)[1]\n\n        def shape(x):\n            \"\"\"  projection \"\"\"\n            return tf.transpose(tf.reshape(x, (bs, -1, self.n_heads, self.d_kv)), perm=(0, 2, 1, 3))\n\n        def unshape(x):\n            \"\"\"  compute context \"\"\"\n            return tf.reshape(tf.transpose(x, perm=(0, 2, 1, 3)), (bs, -1, self.inner_dim))\n\n        q = shape(self.q(input))  # (bs, n_heads, qlen, dim_per_head)\n\n        if kv is None:\n            k = shape(self.k(input))  # (bs, n_heads, qlen, dim_per_head)\n            v = shape(self.v(input))  # (bs, n_heads, qlen, dim_per_head)\n        elif past_key_value_state is None:\n            k = v = kv\n            k = shape(self.k(k))  # (bs, n_heads, qlen, dim_per_head)\n            v = shape(self.v(v))  # (bs, n_heads, qlen, dim_per_head)\n\n        if past_key_value_state is not None:\n            if kv is None:\n                k_, v_ = past_key_value_state\n                k = tf.concat([k_, k], axis=2)  # (bs, n_heads, klen, dim_per_head)\n                v = tf.concat([v_, v], axis=2)  # (bs, n_heads, klen, dim_per_head)\n            else:\n                k, v = past_key_value_state\n\n        # to cope with keras serialization\n        # we need to cast `use_cache` to correct bool\n        # if it is a tensor\n        if tf.is_tensor(use_cache):\n            if hasattr(use_cache, \"numpy\"):\n                use_cache = bool(use_cache.numpy())\n            else:\n                use_cache = True\n\n        if self.is_decoder and use_cache is True:\n            present_key_value_state = ((k, v),)\n        else:\n            present_key_value_state = (None,)\n\n        scores = tf.einsum(\"bnqd,bnkd->bnqk\", q, k)  # (bs, n_heads, qlen, klen)\n\n        if position_bias is None:\n            if not self.has_relative_attention_bias:\n                raise ValueError(\"No position_bias provided and no weights to compute position_bias\")\n            position_bias = self.compute_bias(real_qlen, klen)\n\n            # if key and values are already calculated\n            # we want only the last query position bias\n            if past_key_value_state is not None:\n                position_bias = position_bias[:, :, -1:, :]\n\n            if mask is not None:\n                position_bias = position_bias + mask  # (bs, n_heads, qlen, klen)\n\n        scores += position_bias\n        weights = tf.nn.softmax(scores, axis=-1)  # (bs, n_heads, qlen, klen)\n        weights = self.dropout(weights, training=training)  # (bs, n_heads, qlen, klen)\n\n        # Mask heads if we want to\n        if head_mask is not None:\n            weights = weights * head_mask\n\n        context = tf.matmul(weights, v)  # (bs, n_heads, qlen, dim_per_head)\n        context = unshape(context)  # (bs, qlen, dim)\n\n        context = self.o(context)\n\n        outputs = (context,) + present_key_value_state\n\n        if self.output_attentions:\n            outputs = outputs + (weights,)\n        if self.has_relative_attention_bias:\n            outputs = outputs + (position_bias,)\n        return outputs\n\n\nclass TFT5LayerSelfAttention(tf.keras.layers.Layer):\n    def __init__(self, config, has_relative_attention_bias=False, **kwargs):\n        super().__init__(**kwargs)\n        self.SelfAttention = TFT5Attention(\n            config, has_relative_attention_bias=has_relative_attention_bias, name=\"SelfAttention\",\n        )\n        self.layer_norm = TFT5LayerNorm(epsilon=config.layer_norm_epsilon, name=\"layer_norm\")\n        self.dropout = tf.keras.layers.Dropout(config.dropout_rate)\n\n    def call(\n        self,\n        hidden_states,\n        attention_mask=None,\n        position_bias=None,\n        head_mask=None,\n        past_key_value_state=None,\n        use_cache=False,\n        training=False,\n    ):\n        norm_x = self.layer_norm(hidden_states)\n        attention_output = self.SelfAttention(\n            norm_x,\n            mask=attention_mask,\n            position_bias=position_bias,\n            head_mask=head_mask,\n            past_key_value_state=past_key_value_state,\n            use_cache=use_cache,\n            training=training,\n        )\n        y = attention_output[0]\n        layer_output = hidden_states + self.dropout(y, training=training)\n        outputs = (layer_output,) + attention_output[1:]  # add attentions if we output them\n        return outputs\n\n\nclass TFT5LayerCrossAttention(tf.keras.layers.Layer):\n    def __init__(self, config, has_relative_attention_bias=False, **kwargs):\n        super().__init__(**kwargs)\n        self.EncDecAttention = TFT5Attention(\n            config, has_relative_attention_bias=has_relative_attention_bias, name=\"EncDecAttention\",\n        )\n        self.layer_norm = TFT5LayerNorm(epsilon=config.layer_norm_epsilon, name=\"layer_norm\")\n        self.dropout = tf.keras.layers.Dropout(config.dropout_rate)\n\n    def call(\n        self,\n        hidden_states,\n        kv,\n        attention_mask=None,\n        position_bias=None,\n        head_mask=None,\n        past_key_value_state=None,\n        query_length=None,\n        use_cache=False,\n        training=False,\n    ):\n        norm_x = self.layer_norm(hidden_states)\n        attention_output = self.EncDecAttention(\n            norm_x,\n            mask=attention_mask,\n            kv=kv,\n            position_bias=position_bias,\n            head_mask=head_mask,\n            past_key_value_state=past_key_value_state,\n            query_length=query_length,\n            use_cache=use_cache,\n            training=training,\n        )\n        y = attention_output[0]\n        layer_output = hidden_states + self.dropout(y, training=training)\n        outputs = (layer_output,) + attention_output[1:]  # add attentions if we output them\n        return outputs\n\n\nclass TFT5Block(tf.keras.layers.Layer):\n    def __init__(self, config, has_relative_attention_bias=False, **kwargs):\n        super().__init__(**kwargs)\n        self.is_decoder = config.is_decoder\n        self.layer = []\n        self.layer.append(\n            TFT5LayerSelfAttention(config, has_relative_attention_bias=has_relative_attention_bias, name=\"layer_._0\",)\n        )\n        if self.is_decoder:\n            self.layer.append(\n                TFT5LayerCrossAttention(\n                    config, has_relative_attention_bias=has_relative_attention_bias, name=\"layer_._1\",\n                )\n            )\n\n        self.layer.append(TFT5LayerFF(config, name=\"layer_._{}\".format(len(self.layer))))\n\n    def call(\n        self,\n        hidden_states,\n        attention_mask=None,\n        position_bias=None,\n        encoder_hidden_states=None,\n        encoder_attention_mask=None,\n        encoder_decoder_position_bias=None,\n        head_mask=None,\n        past_key_value_state=None,\n        use_cache=False,\n        training=False,\n    ):\n\n        if past_key_value_state is not None:\n            assert self.is_decoder, \"Only decoder can use `past_key_value_states`\"\n            expected_num_past_key_value_states = 2 if encoder_hidden_states is None else 4\n\n            error_message = \"There should be {} past states. 2 (past / key) for self attention.{} Got {} past key / value states\".format(\n                expected_num_past_key_value_states,\n                \"2 (past / key) for cross attention\" if expected_num_past_key_value_states == 4 else \"\",\n                len(past_key_value_state),\n            )\n            assert len(past_key_value_state) == expected_num_past_key_value_states, error_message\n\n            self_attn_past_key_value_state = past_key_value_state[:2]\n            cross_attn_past_key_value_state = past_key_value_state[2:]\n        else:\n            self_attn_past_key_value_state, cross_attn_past_key_value_state = None, None\n\n        self_attention_outputs = self.layer[0](\n            hidden_states,\n            attention_mask=attention_mask,\n            position_bias=position_bias,\n            head_mask=head_mask,\n            past_key_value_state=self_attn_past_key_value_state,\n            use_cache=use_cache,\n            training=training,\n        )\n        hidden_states, present_key_value_state = self_attention_outputs[:2]\n        attention_outputs = self_attention_outputs[2:]  # Keep self-attention outputs and relative position weights\n\n        if self.is_decoder and encoder_hidden_states is not None:\n            # the actual query length is unknown for cross attention\n            # if using past key value states. Need to inject it here\n            if present_key_value_state is not None:\n                query_length = shape_list(present_key_value_state[0])[2]\n            else:\n                query_length = None\n\n            cross_attention_outputs = self.layer[1](\n                hidden_states,\n                kv=encoder_hidden_states,\n                attention_mask=encoder_attention_mask,\n                position_bias=encoder_decoder_position_bias,\n                head_mask=head_mask,\n                past_key_value_state=cross_attn_past_key_value_state,\n                query_length=query_length,\n                use_cache=use_cache,\n                training=training,\n            )\n            hidden_states = cross_attention_outputs[0]\n            # Combine self attn and cross attn key value states\n            if present_key_value_state is not None:\n                present_key_value_state = present_key_value_state + cross_attention_outputs[1]\n\n            # Keep cross-attention outputs and relative position weights\n            attention_outputs = attention_outputs + cross_attention_outputs[2:]\n\n        # Apply Feed Forward layer\n        hidden_states = self.layer[-1](hidden_states, training=training)\n        outputs = (hidden_states,)\n\n        # Add attentions if we output them\n        outputs = outputs + (present_key_value_state,) + attention_outputs\n        return outputs  # hidden-states, present_key_value_states, (self-attention weights), (self-attention position bias), (cross-attention weights), (cross-attention position bias)\n\n\nclass _NoLayerEmbedTokens(object):\n    \"\"\"\n     this class wraps a the TFSharedEmbeddingTokens layer into a python 'no-keras-layer'\n     class to avoid problem with weight restoring. Also it makes sure that the layer is\n     called from the correct scope to avoid problem with saving/storing the correct weights\n    \"\"\"\n\n    def __init__(self, layer, abs_scope_name=None):\n        self._layer = layer\n        self._abs_scope_name = abs_scope_name\n\n    def call(self, inputs, mode=\"embedding\"):\n        if self._abs_scope_name is None:\n            return self._layer.call(inputs, mode)\n\n        # if an abs scope name is given to the embedding variable, call variable from absolute scope\n        with tf.compat.v1.variable_scope(self._abs_scope_name, auxiliary_name_scope=False) as abs_scope_name:\n            with tf.name_scope(abs_scope_name.original_name_scope):\n                return self._layer.call(inputs, mode)\n\n    def __call__(self, inputs, mode=\"embedding\"):\n        if self._abs_scope_name is None:\n            return self._layer(inputs, mode)\n\n        # if an abs scope name is given to the embedding variable, call variable from absolute scope\n        with tf.compat.v1.variable_scope(self._abs_scope_name, auxiliary_name_scope=False) as abs_scope_name:\n            with tf.name_scope(abs_scope_name.original_name_scope):\n                return self._layer(inputs, mode)\n\n\n####################################################\n# The full model without a specific pretrained or finetuning head is\n# provided as a tf.keras.layers.Layer usually called \"TFT5MainLayer\"\n####################################################\nclass TFT5MainLayer(tf.keras.layers.Layer):\n    def __init__(self, config, embed_tokens=None, **kwargs):\n        super().__init__(**kwargs)\n        self.output_attentions = config.output_attentions\n        self.output_hidden_states = config.output_hidden_states\n\n        self.embed_tokens = embed_tokens\n        self.is_decoder = config.is_decoder\n\n        self.config = config\n        self.num_hidden_layers = config.num_layers\n\n        self.block = [\n            TFT5Block(config, has_relative_attention_bias=bool(i == 0), name=\"block_._{}\".format(i),)\n            for i in range(config.num_layers)\n        ]\n        self.final_layer_norm = TFT5LayerNorm(epsilon=config.layer_norm_epsilon, name=\"final_layer_norm\")\n        self.dropout = tf.keras.layers.Dropout(config.dropout_rate)\n\n    def get_input_embeddings(self):\n        return self.embed_tokens\n\n    def get_output_embeddings(self):\n        return self.embed_tokens\n\n    def set_embed_tokens(self, embed_tokens):\n        self.embed_tokens = embed_tokens\n\n    def _resize_token_embeddings(self, new_num_tokens):\n        raise NotImplementedError  # Not implemented yet in the library fr TF 2.0 models\n\n    def _prune_heads(self, heads_to_prune):\n        raise NotImplementedError  # Not implemented yet in the library fr TF 2.0 models\n\n    def call(\n        self,\n        inputs,\n        attention_mask=None,\n        encoder_hidden_states=None,\n        encoder_attention_mask=None,\n        inputs_embeds=None,\n        head_mask=None,\n        past_key_value_states=None,\n        use_cache=False,\n        training=False,\n    ):\n\n        if inputs is not None and inputs_embeds is not None:\n            raise ValueError(\"You cannot specify both inputs and inputs_embeds at the same time\")\n        elif inputs is not None:\n            input_shape = shape_list(inputs)\n            inputs = tf.reshape(inputs, (-1, input_shape[-1]))\n        elif inputs_embeds is not None:\n            input_shape = shape_list(inputs_embeds)[:-1]\n        else:\n            raise ValueError(\"You have to specify either inputs or inputs_embeds\")\n\n        if inputs_embeds is None:\n            assert self.embed_tokens is not None, \"You have to intialize the model with valid token embeddings\"\n            inputs_embeds = self.embed_tokens(inputs)\n\n        batch_size, seq_length = input_shape\n\n        if past_key_value_states is not None:\n            assert seq_length == 1, \"Input shape is {}, but should be {} when using past_key_value_sates\".format(\n                input_shape, (batch_size, 1)\n            )\n            # required mask seq length can be calculated via length of past\n            # key value states and seq_length = 1 for the last token\n            mask_seq_length = shape_list(past_key_value_states[0][0])[2] + seq_length\n        else:\n            mask_seq_length = seq_length\n\n        if attention_mask is None:\n            attention_mask = tf.fill((batch_size, mask_seq_length), 1)\n        if self.is_decoder and encoder_attention_mask is None and encoder_hidden_states is not None:\n            encoder_seq_length = shape_list(encoder_hidden_states)[1]\n            encoder_attention_mask = tf.fill((batch_size, encoder_seq_length), 1)\n\n        # initialize past_key_value_states with `None` if past does not exist\n        if past_key_value_states is None:\n            past_key_value_states = [None] * len(self.block)\n\n        # We can provide a self-attention mask of dimensions [batch_size, from_seq_length, to_seq_length]\n        # ourselves in which case we just need to make it broadcastable to all heads.\n        attention_mask = tf.cast(attention_mask, dtype=tf.float32)\n        num_dims_attention_mask = len(shape_list(attention_mask))\n        if num_dims_attention_mask == 3:\n            extended_attention_mask = attention_mask[:, None, :, :]\n        elif num_dims_attention_mask == 2:\n            # Provided a padding mask of dimensions [batch_size, mask_seq_length]\n            # - if the model is a decoder, apply a causal mask in addition to the padding mask\n            # - if the model is an encoder, make the mask broadcastable to [batch_size, num_heads, mask_seq_length, mask_seq_length]\n            if self.is_decoder:\n                seq_ids = tf.range(mask_seq_length)\n                causal_mask = tf.less_equal(\n                    tf.tile(seq_ids[None, None, :], (batch_size, mask_seq_length, 1)), seq_ids[None, :, None],\n                )\n                causal_mask = tf.cast(causal_mask, dtype=tf.float32)\n                extended_attention_mask = causal_mask[:, None, :, :] * attention_mask[:, None, None, :]\n                if past_key_value_states[0] is not None:\n                    extended_attention_mask = extended_attention_mask[:, :, -1:, :]\n            else:\n                extended_attention_mask = attention_mask[:, None, None, :]\n\n        # Since attention_mask is 1.0 for positions we want to attend and 0.0 for\n        # masked positions, this operation will create a tensor which is 0.0 for\n        # positions we want to attend and -10000.0 for masked positions.\n        # Since we are adding it to the raw scores before the softmax, this is\n        # effectively the same as removing these entirely.\n\n        # T5 has a mask that can compare sequence ids, we can simulate this here with this transposistion\n        # Cf. https://github.com/tensorflow/mesh/blob/8d2465e9bc93129b913b5ccc6a59aa97abd96ec6/mesh_tensorflow/transformer/transformer_layers.py#L270\n        # extended_attention_mask = tf.math.equal(extended_attention_mask,\n        #                                         tf.transpose(extended_attention_mask, perm=(-1, -2)))\n\n        extended_attention_mask = (1.0 - extended_attention_mask) * -1e9\n\n        if self.is_decoder and encoder_attention_mask is not None:\n            # If a 2D ou 3D attention mask is provided for the cross-attention\n            # we need to make broadcastabe to [batch_size, num_heads, mask_seq_length, mask_seq_length]\n            # we need to make broadcastabe to [batch_size, num_heads, seq_length, seq_length]\n            encoder_attention_mask = tf.cast(encoder_attention_mask, dtype=tf.float32)\n            num_dims_encoder_attention_mask = len(shape_list(encoder_attention_mask))\n            if num_dims_encoder_attention_mask == 3:\n                encoder_extended_attention_mask = encoder_attention_mask[:, None, :, :]\n            if num_dims_encoder_attention_mask == 2:\n                encoder_extended_attention_mask = encoder_attention_mask[:, None, None, :]\n\n            # T5 has a mask that can compare sequence ids, we can simulate this here with this transposistion\n            # Cf. https://github.com/tensorflow/mesh/blob/8d2465e9bc93129b913b5ccc6a59aa97abd96ec6/mesh_tensorflow/transformer/transformer_layers.py#L270\n            # encoder_extended_attention_mask = tf.math.equal(encoder_extended_attention_mask,\n            #                                         tf.transpose(encoder_extended_attention_mask, perm=(-1, -2)))\n\n            encoder_extended_attention_mask = (1.0 - encoder_extended_attention_mask) * -1e9\n        else:\n            encoder_extended_attention_mask = None\n\n        # Prepare head mask if needed\n        # 1.0 in head_mask indicate we keep the head\n        # attention_probs has shape bsz x n_heads x N x N\n        # input head_mask has shape [num_heads] or [num_hidden_layers x num_heads]\n        # and head_mask is converted to shape [num_hidden_layers x batch x num_heads x seq_length x seq_length]\n        if head_mask is not None:\n            raise NotImplementedError\n        else:\n            head_mask = [None] * self.num_hidden_layers\n            # head_mask = tf.constant([0] * self.num_hidden_layers)\n\n        present_key_value_states = ()\n        all_hidden_states = ()\n        all_attentions = ()\n        position_bias = None\n        encoder_decoder_position_bias = None\n\n        hidden_states = self.dropout(inputs_embeds, training=training)\n\n        for i, (layer_module, past_key_value_state) in enumerate(zip(self.block, past_key_value_states)):\n            if self.output_hidden_states:\n                all_hidden_states = all_hidden_states + (hidden_states,)\n\n            layer_outputs = layer_module(\n                hidden_states,\n                attention_mask=extended_attention_mask,\n                position_bias=position_bias,\n                encoder_hidden_states=encoder_hidden_states,\n                encoder_attention_mask=encoder_extended_attention_mask,\n                encoder_decoder_position_bias=encoder_decoder_position_bias,\n                head_mask=head_mask[i],\n                past_key_value_state=past_key_value_state,\n                use_cache=use_cache,\n                training=training,\n            )\n            # layer_outputs is a tuple with:\n            # hidden-states, key-value-states, (self-attention weights), (self-attention position bias), (cross-attention weights), (cross-attention position bias)\n            hidden_states, present_key_value_state = layer_outputs[:2]\n            if i == 0:\n                # We share the position biases between the layers - the first layer store them\n                # layer_outputs = hidden-states, (self-attention weights), (self-attention position bias), (cross-attention weights), (cross-attention position bias)\n                position_bias = layer_outputs[3 if self.output_attentions else 2]\n                if self.is_decoder and encoder_hidden_states is not None:\n                    encoder_decoder_position_bias = layer_outputs[5 if self.output_attentions else 3]\n            # append next layer key value states\n            present_key_value_states = present_key_value_states + (present_key_value_state,)\n\n            if self.output_attentions:\n                all_attentions = all_attentions + (layer_outputs[2],)\n\n        hidden_states = self.final_layer_norm(hidden_states)\n        hidden_states = self.dropout(hidden_states, training=training)\n\n        # Add last layer\n        if self.output_hidden_states:\n            all_hidden_states = all_hidden_states + (hidden_states,)\n\n        outputs = (hidden_states,)\n        if use_cache is True:\n            assert self.is_decoder, \"`use_cache` can only be set to `True` if {} is used as a decoder\".format(self)\n            outputs = outputs + (present_key_value_states,)\n        if self.output_hidden_states:\n            outputs = outputs + (all_hidden_states,)\n        if self.output_attentions:\n            outputs = outputs + (all_attentions,)\n        return outputs  # last-layer hidden state, (all hidden states), (all attentions)\n\n\n####################################################\n# TFT5PreTrainedModel is a sub-class of tf.keras.Model\n# which take care of loading and saving pretrained weights\n# and various common utilities.\n# Here you just need to specify a few (self-explanatory)\n# pointers for your model.\n####################################################\nclass TFT5PreTrainedModel(TFPreTrainedModel):\n    \"\"\" An abstract class to handle weights initialization and\n        a simple interface for downloading and loading pretrained models.\n    \"\"\"\n\n    config_class = T5Config\n    base_model_prefix = \"transformer\"\n\n    @property\n    def dummy_inputs(self):\n        inputs = tf.constant(DUMMY_INPUTS)\n        input_mask = tf.constant(DUMMY_MASK)\n        dummy_inputs = {\n            \"inputs\": inputs,\n            \"decoder_input_ids\": inputs,\n            \"decoder_attention_mask\": input_mask,\n        }\n        return dummy_inputs\n\n\nT5_START_DOCSTRING = r\"\"\"    The T5 model was proposed in\n    `Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer`_\n    by Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, Peter J. Liu.\n    It's an encoder decoder transformer pre-trained in a text-to-text denoising generative setting.\n\n    This model is a tf.keras.Model `tf.keras.Model`_ sub-class. Use it as a regular TF 2.0 Keras Model and\n    refer to the TF 2.0 documentation for all matter related to general usage and behavior.\n\n    .. _`Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer`:\n        https://arxiv.org/abs/1910.10683\n\n    .. _`tf.keras.Model`:\n        https://www.tensorflow.org/versions/r2.0/api_docs/python/tf/keras/Model\n\n    Note on the model inputs:\n        TF 2.0 models accepts two formats as inputs:\n\n            - having all inputs as keyword arguments (like PyTorch models), or\n            - having all inputs as a list, tuple or dict in the first positional arguments.\n\n        This second option is usefull when using `tf.keras.Model.fit()` method which currently requires having all the tensors in the first argument of the model call function: `model(inputs)`.\n\n        If you choose this second option, there are three possibilities you can use to gather all the input Tensors in the first positional argument :\n\n        - a single Tensor with inputs only and nothing else: `model(inputs_ids)\n        - a list of varying length with one or several input Tensors IN THE ORDER given in the docstring:\n            `model([inputs, attention_mask])` or `model([inputs, attention_mask, token_type_ids])`\n        - a dictionary with one or several input Tensors associaed to the input names given in the docstring:\n            `model({'inputs': inputs, 'token_type_ids': token_type_ids})`\n\n    Parameters:\n        config (:class:`~transformers1.T5Config`): Model configuration class with all the parameters of the model.\n            Initializing with a config file does not load the weights associated with the model, only the configuration.\n            Check out the :meth:`~transformers1.PreTrainedModel.from_pretrained` method to load the model weights.\n\"\"\"\n\nT5_INPUTS_DOCSTRING = r\"\"\"\n    Args:\n        inputs are usually used as a `dict` (see T5 description above for more information) containing all the following.\n\n        inputs (:obj:`tf.Tensor` of shape :obj:`(batch_size, sequence_length)`):\n            Indices of input sequence tokens in the vocabulary.\n            T5 is a model with relative position embeddings so you should be able to pad the inputs on\n            the right or the left.\n            Indices can be obtained using :class:`transformers1.T5Tokenizer`.\n            To know more on how to prepare :obj:`inputs` for pre-training take a look at\n            `T5 Training <./t5.html#training>`_ .\n            See :func:`transformers1.PreTrainedTokenizer.encode` and\n            :func:`transformers1.PreTrainedTokenizer.convert_tokens_to_ids` for details.\n        decoder_input_ids (:obj:`tf.Tensor` of shape :obj:`(batch_size, target_sequence_length)`, `optional`, defaults to :obj:`None`):\n            Provide for sequence to sequence training. T5 uses the pad_token_id as the starting token for decoder_input_ids generation.\n            If `decoder_past_key_value_states` is used, optionally only the last `decoder_input_ids` have to be input (see `decoder_past_key_value_states`).\n        attention_mask (:obj:`tf.Tensor` of shape :obj:`(batch_size, sequence_length)`, `optional`, defaults to :obj:`None`):\n            Mask to avoid performing attention on padding token indices.\n            Mask values selected in ``[0, 1]``:\n            ``1`` for tokens that are NOT MASKED, ``0`` for MASKED tokens.\n        encoder_outputs (:obj:`tuple(tuple(tf.FloatTensor)`, `optional`, defaults to :obj:`None`):\n            Tuple consists of (`last_hidden_state`, `optional`: `hidden_states`, `optional`: `attentions`)\n            `last_hidden_state` of shape :obj:`(batch_size, sequence_length, hidden_size)`, `optional`, defaults to :obj:`None`) is a sequence of hidden-states at the output of the last layer of the encoder.\n            Used in the cross-attention of the decoder.\n        decoder_attention_mask (:obj:`tf.Tensor` of shape :obj:`(batch_size, tgt_seq_len)`, `optional`, defaults to :obj:`None`):\n            Default behavior: generate a tensor that ignores pad tokens in decoder_input_ids. Causal mask will also be used by default.\n        decoder_past_key_value_states (:obj:`tuple(tuple(tf.Tensor))` of length :obj:`config.n_layers` with each tuple having 4 tensors of shape :obj:`(batch_size, num_heads, sequence_length - 1, embed_size_per_head)`):\n            Contains pre-computed key and value hidden-states of the attention blocks.\n            Can be used to speed up decoding.\n            If `decoder_past_key_value_states` are used, the user can optionally input only the last `decoder_input_ids`\n            (those that don't have their past key value states given to this model) of shape :obj:`(batch_size, 1)`\n        use_cache (:obj:`bool`, `optional`, defaults to :obj:`True`):\n            If `use_cache` is True, `decoder_past_key_value_states` are returned and can be used to speed up decoding (see `decoder_past_key_value_states`).\n        inputs_embeds (:obj:`tf.Tensor` of shape :obj:`(batch_size, sequence_length, hidden_size)`, `optional`, defaults to :obj:`None`):\n            Optionally, instead of passing :obj:`inputs` you can choose to directly pass an embedded representation.\n            This is useful if you want more control over how to convert `inputs` indices into associated vectors\n            than the model's internal embedding lookup matrix.\n        decoder_inputs_embeds (:obj:`tf.Tensor` of shape :obj:`(batch_size, target_sequence_length, hidden_size)`, `optional`, defaults to :obj:`None`):\n            Optionally, instead of passing :obj:`decoder_input_ids` you can choose to directly pass an embedded representation.\n            This is useful if you want more control over how to convert `decoder_input_ids` indices into associated vectors\n            than the model's internal embedding lookup matrix.\n            To know more on how to prepare :obj:`decoder_input_ids` for pre-training take a look at\n            `T5 Training <./t5.html#training>`_ .\n        head_mask: (:obj:`tf.Tensor` of shape :obj:`(num_heads,)` or :obj:`(num_layers, num_heads)`, `optional`, defaults to :obj:`None`):\n            Mask to nullify selected heads of the self-attention modules.\n            Mask values selected in ``[0, 1]``:\n            ``1`` indicates the head is **not masked**, ``0`` indicates the head is **masked**.\n\"\"\"\n\n\n@add_start_docstrings(\n    \"The bare T5 Model transformer outputting raw hidden-states\" \"without any specific head on top.\",\n    T5_START_DOCSTRING,\n)\nclass TFT5Model(TFT5PreTrainedModel):\n    def __init__(self, config, *inputs, **kwargs):\n        super().__init__(config, *inputs, **kwargs)\n        self.shared = TFSharedEmbeddings(config.vocab_size, config.d_model, name=\"shared\")\n\n        # retrieve correct absolute scope for embed token wrapper\n        with tf.compat.v1.variable_scope(\"shared\") as shared_abs_scope_name:\n            pass\n\n        embed_tokens = _NoLayerEmbedTokens(self.shared, abs_scope_name=shared_abs_scope_name)\n\n        encoder_config = copy.deepcopy(config)\n        self.encoder = TFT5MainLayer(encoder_config, embed_tokens, name=\"encoder\")\n\n        decoder_config = copy.deepcopy(config)\n        decoder_config.is_decoder = True\n        self.decoder = TFT5MainLayer(decoder_config, embed_tokens, name=\"decoder\")\n\n    def get_input_embeddings(self):\n        return self.shared\n\n    def get_output_embeddings(self):\n        return self.shared\n\n    def get_encoder(self):\n        return self.encoder\n\n    def get_decoder(self):\n        return self.decoder\n\n    @add_start_docstrings_to_callable(T5_INPUTS_DOCSTRING)\n    def call(self, inputs, **kwargs):\n        r\"\"\"\n    Return:\n        :obj:`tuple(tf.Tensor)` comprising various elements depending on the configuration (:class:`~transformers1.T5Config`) and inputs.\n        last_hidden_state (:obj:`tf.Tensor` of shape :obj:`(batch_size, sequence_length, hidden_size)`):\n            Sequence of hidden-states at the output of the last layer of the model.\n            If `decoder_past_key_value_states` is used only the last hidden-state of the sequences of shape :obj:`(batch_size, 1, hidden_size)` is output.\n        decoder_past_key_value_states (:obj:`tuple(tuple(tf.Tensor))` of length :obj:`config.n_layers` with each tuple having 4 tensors of shape :obj:`(batch_size, num_heads, sequence_length, embed_size_per_head)`, `optional`, returned when ``use_cache=True``):\n            Contains pre-computed key and value hidden-states of the attention blocks.\n            Can be used to speed up sequential decoding (see `decoder_past_key_value_states` input).\n            Note that when using `decoder_past_key_value_states`, the model only outputs the last `hidden-state` of the sequence of shape :obj:`(batch_size, 1, config.vocab_size)`.\n        hidden_states (:obj:`tuple(tf.Tensor)`, `optional`, returned when ``config.output_hidden_states=True``):\n            Tuple of :obj:`tf.Tensor` (one for the output of the embeddings + one for the output of each layer)\n            of shape :obj:`(batch_size, sequence_length, hidden_size)`.\n\n            Hidden-states of the model at the output of each layer plus the initial embedding outputs.\n        attentions (:obj:`tuple(tf.Tensor)`, `optional`, returned when ``config.output_attentions=True``):\n            Tuple of :obj:`tf.Tensor` (one for each layer) of shape\n                :obj:`(batch_size, num_heads, sequence_length, sequence_length)`.\n\n            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention\n            heads.\n\n    Examples::\n\n        from transformers1 import T5Tokenizer, TFT5Model\n\n        tokenizer = T5Tokenizer.from_pretrained('t5-small')\n        model = TFT5Model.from_pretrained('t5-small')\n        inputs = tokenizer.encode(\"Hello, my dog is cute\", return_tensors=\"tf\")  # Batch size 1\n        outputs = model(inputs, decoder_input_ids=inputs)\n        last_hidden_states = outputs[0]  # The last hidden-state is the first element of the output tuple\n\n        \"\"\"\n\n        if isinstance(inputs, dict):\n            kwargs.update(inputs)\n        else:\n            kwargs[\"inputs\"] = inputs\n\n        # retrieve arguments\n        inputs = kwargs.get(\"inputs\", None)\n        inputs_embeds = kwargs.get(\"inputs_embeds\", None)\n        attention_mask = kwargs.get(\"attention_mask\", None)\n        encoder_outputs = kwargs.get(\"encoder_outputs\", None)\n        decoder_input_ids = kwargs.get(\"decoder_input_ids\", None)\n        decoder_attention_mask = kwargs.get(\"decoder_attention_mask\", None)\n        decoder_inputs_embeds = kwargs.get(\"decoder_inputs_embeds\", None)\n        decoder_past_key_value_states = kwargs.get(\"decoder_past_key_value_states\", None)\n        use_cache = kwargs.get(\"use_cache\", True)\n        head_mask = kwargs.get(\"head_mask\", None)\n\n        # Encode if needed (training, first prediction pass)\n        if encoder_outputs is None:\n            encoder_outputs = self.encoder(\n                inputs, attention_mask=attention_mask, inputs_embeds=inputs_embeds, head_mask=head_mask,\n            )\n\n        hidden_states = encoder_outputs[0]\n\n        # If decoding with past key value states, only the last tokens\n        # should be given as an input\n        if decoder_past_key_value_states is not None:\n            if decoder_input_ids is not None:\n                decoder_input_ids = decoder_input_ids[:, -1:]\n            if decoder_inputs_embeds is not None:\n                decoder_inputs_embeds = decoder_inputs_embeds[:, -1:]\n\n        # Decode\n        decoder_outputs = self.decoder(\n            decoder_input_ids,\n            attention_mask=decoder_attention_mask,\n            inputs_embeds=decoder_inputs_embeds,\n            past_key_value_states=decoder_past_key_value_states,\n            encoder_hidden_states=hidden_states,\n            encoder_attention_mask=attention_mask,\n            head_mask=head_mask,\n            use_cache=use_cache,\n        )\n\n        if use_cache is True:\n            past = ((encoder_outputs, decoder_outputs[1]),)\n            decoder_outputs = decoder_outputs[:1] + past + decoder_outputs[2:]\n\n        return decoder_outputs + encoder_outputs\n\n\n@add_start_docstrings(\"\"\"T5 Model with a `language modeling` head on top. \"\"\", T5_START_DOCSTRING)\nclass TFT5ForConditionalGeneration(TFT5PreTrainedModel):\n    def __init__(self, config, *inputs, **kwargs):\n        super().__init__(config, *inputs, **kwargs)\n        self.model_dim = config.d_model\n\n        self.shared = TFSharedEmbeddings(config.vocab_size, config.d_model, name=\"shared\")\n\n        # retrieve correct absolute scope for embed token wrapper\n        with tf.compat.v1.variable_scope(\"shared\") as shared_abs_scope_name:\n            pass\n\n        embed_tokens = _NoLayerEmbedTokens(self.shared, abs_scope_name=shared_abs_scope_name)\n\n        encoder_config = copy.deepcopy(config)\n        self.encoder = TFT5MainLayer(encoder_config, embed_tokens, name=\"encoder\")\n\n        decoder_config = copy.deepcopy(config)\n        decoder_config.is_decoder = True\n        self.decoder = TFT5MainLayer(decoder_config, embed_tokens, name=\"decoder\")\n\n    def get_input_embeddings(self):\n        return self.shared\n\n    def get_output_embeddings(self):\n        return self.shared\n\n    def get_encoder(self):\n        return self.encoder\n\n    def get_decoder(self):\n        return self.decoder\n\n    @add_start_docstrings_to_callable(T5_INPUTS_DOCSTRING)\n    def call(self, inputs, **kwargs):\n        r\"\"\"\n    Return:\n        :obj:`tuple(tf.Tensor)` comprising various elements depending on the configuration (:class:`~transformers1.T5Config`) and inputs.\n        loss (:obj:`tf.Tensor` of shape :obj:`(1,)`, `optional`, returned when :obj:`lm_label` is provided):\n            Classification loss (cross entropy).\n        prediction_scores (:obj:`tf.Tensor` of shape :obj:`(batch_size, sequence_length, config.vocab_size)`)\n            Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax).\n        decoder_past_key_value_states (:obj:`tuple(tuple(tf.Tensor))` of length :obj:`config.n_layers` with each tuple having 4 tensors of shape :obj:`(batch_size, num_heads, sequence_length, embed_size_per_head)`, `optional`, returned when ``use_cache=True``):\n            Contains pre-computed key and value hidden-states of the attention blocks.\n            Can be used to speed up sequential decoding (see `decoder_past_key_value_states` input).\n            Note that when using `decoder_past_key_value_states`, the model only outputs the last `prediction_score` of the sequence of shape :obj:`(batch_size, 1, config.vocab_size)`.\n        hidden_states (:obj:`tuple(tf.Tensor)`, `optional`, returned when ``config.output_hidden_states=True``):\n            Tuple of :obj:`tf.Tensor` (one for the output of the embeddings + one for the output of each layer)\n            of shape :obj:`(batch_size, sequence_length, hidden_size)`.\n\n            Hidden-states of the model at the output of each layer plus the initial embedding outputs.\n        attentions (:obj:`tuple(tf.Tensor)`, `optional`, returned when ``config.output_attentions=True``):\n            Tuple of :obj:`tf.Tensor` (one for each layer) of shape\n            :obj:`(batch_size, num_heads, sequence_length, sequence_length)`.\n\n            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention.\n\n    Examples::\n\n        from transformers1 import T5Tokenizer, TFT5ForConditionalGeneration\n\n        tokenizer = T5Tokenizer.from_pretrained('t5-small')\n        model = TFT5ForConditionalGeneration.from_pretrained('t5-small')\n        inputs = tokenizer.encode(\"Hello, my dog is cute\", return_tensors=\"tf\")  # Batch size 1\n        outputs = model(inputs, decoder_input_ids=inputs)\n        prediction_scores = outputs[0]\n\n        tokenizer = T5Tokenizer.from_pretrained('t5-small')\n        model = TFT5ForConditionalGeneration.from_pretrained('t5-small')\n        inputs = tokenizer.encode(\"summarize: Hello, my dog is cute\", return_tensors=\"tf\")  # Batch size 1\n        model.generate(inputs)\n\n        \"\"\"\n\n        if isinstance(inputs, dict):\n            kwargs.update(inputs)\n        else:\n            kwargs[\"inputs\"] = inputs\n\n        # retrieve arguments\n        inputs = kwargs.get(\"inputs\", None)\n        decoder_input_ids = kwargs.get(\"decoder_input_ids\", None)\n        attention_mask = kwargs.get(\"attention_mask\", None)\n        encoder_outputs = kwargs.get(\"encoder_outputs\", None)\n        decoder_attention_mask = kwargs.get(\"decoder_attention_mask\", None)\n        decoder_past_key_value_states = kwargs.get(\"decoder_past_key_value_states\", None)\n        use_cache = kwargs.get(\"use_cache\", True)\n        inputs_embeds = kwargs.get(\"inputs_embeds\", None)\n        decoder_inputs_embeds = kwargs.get(\"decoder_inputs_embeds\", None)\n        head_mask = kwargs.get(\"head_mask\", None)\n\n        # Encode if needed (training, first prediction pass)\n        if encoder_outputs is None:\n            # Convert encoder inputs in embeddings if needed\n            encoder_outputs = self.encoder(\n                inputs, attention_mask=attention_mask, inputs_embeds=inputs_embeds, head_mask=head_mask,\n            )\n\n        hidden_states = encoder_outputs[0]\n\n        # If decoding with past key value states, only the last tokens\n        # should be given as an input\n        if decoder_past_key_value_states is not None:\n            if decoder_input_ids is not None:\n                decoder_input_ids = decoder_input_ids[:, -1:]\n            if decoder_inputs_embeds is not None:\n                decoder_inputs_embeds = decoder_inputs_embeds[:, -1:]\n\n        # Decode\n        decoder_outputs = self.decoder(\n            decoder_input_ids,\n            attention_mask=decoder_attention_mask,\n            inputs_embeds=decoder_inputs_embeds,\n            past_key_value_states=decoder_past_key_value_states,\n            encoder_hidden_states=hidden_states,\n            encoder_attention_mask=attention_mask,\n            head_mask=head_mask,\n            use_cache=use_cache,\n        )\n\n        # insert decoder past at right place\n        # to speed up decoding\n        if use_cache is True:\n            past = ((encoder_outputs, decoder_outputs[1]),)\n            decoder_outputs = decoder_outputs[:1] + past + decoder_outputs[2:]\n\n        sequence_output = decoder_outputs[0] * (self.model_dim ** -0.5)\n        embed_tokens = self.get_output_embeddings()\n        lm_logits = embed_tokens(sequence_output, mode=\"linear\")\n        decoder_outputs = (lm_logits,) + decoder_outputs[1:]\n\n        return decoder_outputs + encoder_outputs\n\n    def prepare_inputs_for_generation(self, inputs, past, attention_mask, use_cache, **kwargs):\n        assert past is not None, \"past has to be defined for encoder_outputs\"\n\n        # first step\n        if len(past) < 2:\n            encoder_outputs, decoder_past_key_value_states = past, None\n        else:\n            encoder_outputs, decoder_past_key_value_states = past[0], past[1]\n\n        return {\n            \"inputs\": None,  # inputs don't have to be defined, but still need to be passed to make Keras.layer.__call__ happy\n            \"decoder_input_ids\": inputs,  # inputs are the decoder_input_ids\n            \"decoder_past_key_value_states\": decoder_past_key_value_states,\n            \"encoder_outputs\": encoder_outputs,\n            \"attention_mask\": attention_mask,\n            \"use_cache\": use_cache,\n        }\n\n    def _reorder_cache(self, past, beam_idx):\n        # if decoder past is not included in output\n        # speedy decoding is disabled and no need to reorder\n\n        if len(past) < 2:\n            logger.warning(\"You might want to consider setting `use_cache=True` to speed up decoding\")\n            return past\n\n        decoder_past = past[1]\n        past = (past[0],)\n        reordered_decoder_past = ()\n\n        for layer_past_states in decoder_past:\n            # get the correct batch idx from layer past batch dim\n            # batch dim of `past` is at 2nd position\n            reordered_layer_past_states = ()\n            for layer_past_state in layer_past_states:\n                # need to set correct `past` for each of the four key / value states\n                reordered_layer_past_states = reordered_layer_past_states + (tf.gather(layer_past_state, beam_idx),)\n\n            assert shape_list(reordered_layer_past_states[0]) == shape_list(layer_past_states[0])\n            assert len(reordered_layer_past_states) == len(layer_past_states)\n\n            reordered_decoder_past = reordered_decoder_past + (reordered_layer_past_states,)\n        return past + (reordered_decoder_past,)\n"
  },
  {
    "path": "code/bert-base-count5/pretrain/transformers1/modeling_tf_transfo_xl.py",
    "content": "# coding=utf-8\n# Copyright 2018 Google AI, Google Brain and Carnegie Mellon University Authors and the HuggingFace Inc. team.\n# Copyright (c) 2018, NVIDIA CORPORATION.  All rights reserved.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n\"\"\" TF 2.0 Transformer XL model.\n\"\"\"\n\n\nimport logging\n\nimport tensorflow as tf\n\nfrom .configuration_transfo_xl import TransfoXLConfig\nfrom .file_utils import add_start_docstrings, add_start_docstrings_to_callable\nfrom .modeling_tf_transfo_xl_utilities import TFAdaptiveSoftmaxMask\nfrom .modeling_tf_utils import TFPreTrainedModel, get_initializer, keras_serializable, shape_list\nfrom .tokenization_utils import BatchEncoding\n\n\nlogger = logging.getLogger(__name__)\n\nTF_TRANSFO_XL_PRETRAINED_MODEL_ARCHIVE_LIST = [\n    \"transfo-xl-wt103\",\n    # See all Transformer XL models at https://huggingface.co/models?filter=transfo-xl\n]\n\n\nclass TFPositionalEmbedding(tf.keras.layers.Layer):\n    def __init__(self, demb, **kwargs):\n        super().__init__(**kwargs)\n\n        self.inv_freq = 1 / (10000 ** (tf.range(0, demb, 2.0) / demb))\n\n    def call(self, pos_seq, bsz=None):\n        sinusoid_inp = tf.einsum(\"i,j->ij\", pos_seq, self.inv_freq)\n        pos_emb = tf.concat([tf.sin(sinusoid_inp), tf.cos(sinusoid_inp)], -1)\n\n        if bsz is not None:\n            return tf.tile(pos_emb[:, None, :], [1, bsz, 1])\n        else:\n            return pos_emb[:, None, :]\n\n\nclass TFPositionwiseFF(tf.keras.layers.Layer):\n    def __init__(self, d_model, d_inner, dropout, pre_lnorm=False, layer_norm_epsilon=1e-5, init_std=0.02, **kwargs):\n        super().__init__(**kwargs)\n\n        self.d_model = d_model\n        self.d_inner = d_inner\n        self.dropout = dropout\n\n        self.layer_1 = tf.keras.layers.Dense(\n            d_inner, kernel_initializer=get_initializer(init_std), activation=tf.nn.relu, name=\"CoreNet_._0\"\n        )\n        self.drop_1 = tf.keras.layers.Dropout(dropout)\n        self.layer_2 = tf.keras.layers.Dense(d_model, kernel_initializer=get_initializer(init_std), name=\"CoreNet_._3\")\n        self.drop_2 = tf.keras.layers.Dropout(dropout)\n\n        self.layer_norm = tf.keras.layers.LayerNormalization(epsilon=layer_norm_epsilon, name=\"layer_norm\")\n\n        self.pre_lnorm = pre_lnorm\n\n    def call(self, inp, training=False):\n        if self.pre_lnorm:\n            # layer normalization + positionwise feed-forward\n            core_out = self.layer_norm(inp)\n            core_out = self.layer_1(core_out)\n            core_out = self.drop_1(core_out, training=training)\n            core_out = self.layer_2(core_out)\n            core_out = self.drop_2(core_out, training=training)\n\n            # residual connection\n            output = core_out + inp\n        else:\n            # positionwise feed-forward\n            core_out = self.layer_1(inp)\n            core_out = self.drop_1(core_out, training=training)\n            core_out = self.layer_2(core_out)\n            core_out = self.drop_2(core_out, training=training)\n\n            # residual connection + layer normalization\n            output = self.layer_norm(inp + core_out)\n\n        return output\n\n\nclass TFRelPartialLearnableMultiHeadAttn(tf.keras.layers.Layer):\n    def __init__(\n        self,\n        n_head,\n        d_model,\n        d_head,\n        dropout,\n        dropatt=0,\n        tgt_len=None,\n        ext_len=None,\n        mem_len=None,\n        pre_lnorm=False,\n        r_r_bias=None,\n        r_w_bias=None,\n        output_attentions=False,\n        layer_norm_epsilon=1e-5,\n        init_std=0.02,\n        **kwargs\n    ):\n        super().__init__(**kwargs)\n\n        self.output_attentions = output_attentions\n        self.n_head = n_head\n        self.d_model = d_model\n        self.d_head = d_head\n        self.dropout = dropout\n\n        self.qkv_net = tf.keras.layers.Dense(\n            3 * n_head * d_head, kernel_initializer=get_initializer(init_std), use_bias=False, name=\"qkv_net\"\n        )\n\n        self.drop = tf.keras.layers.Dropout(dropout)\n        self.dropatt = tf.keras.layers.Dropout(dropatt)\n        self.o_net = tf.keras.layers.Dense(\n            d_model, kernel_initializer=get_initializer(init_std), use_bias=False, name=\"o_net\"\n        )\n\n        self.layer_norm = tf.keras.layers.LayerNormalization(epsilon=layer_norm_epsilon, name=\"layer_norm\")\n\n        self.scale = 1 / (d_head ** 0.5)\n\n        self.pre_lnorm = pre_lnorm\n\n        if r_r_bias is not None and r_w_bias is not None:  # Biases are shared\n            self.r_r_bias = r_r_bias\n            self.r_w_bias = r_w_bias\n        else:\n            self.r_r_bias = None\n            self.r_w_bias = None\n\n        self.r_net = tf.keras.layers.Dense(\n            self.n_head * self.d_head, kernel_initializer=get_initializer(init_std), use_bias=False, name=\"r_net\"\n        )\n\n    def build(self, input_shape):\n        if self.r_r_bias is None or self.r_w_bias is None:  # Biases are not shared\n            self.r_r_bias = self.add_weight(\n                shape=(self.n_head, self.d_head), initializer=\"zeros\", trainable=True, name=\"r_r_bias\"\n            )\n            self.r_w_bias = self.add_weight(\n                shape=(self.n_head, self.d_head), initializer=\"zeros\", trainable=True, name=\"r_w_bias\"\n            )\n        super().build(input_shape)\n\n    def _rel_shift(self, x):\n        x_size = shape_list(x)\n\n        x = tf.pad(x, [[0, 0], [1, 0], [0, 0], [0, 0]])\n        x = tf.reshape(x, [x_size[1] + 1, x_size[0], x_size[2], x_size[3]])\n        x = tf.slice(x, [1, 0, 0, 0], [-1, -1, -1, -1])\n        x = tf.reshape(x, x_size)\n\n        return x\n\n    def call(self, inputs, training=False):\n        w, r, attn_mask, mems, head_mask = inputs\n        qlen, rlen, bsz = shape_list(w)[0], shape_list(r)[0], shape_list(w)[1]\n\n        if mems is not None:\n            cat = tf.concat([mems, w], 0)\n            if self.pre_lnorm:\n                w_heads = self.qkv_net(self.layer_norm(cat))\n            else:\n                w_heads = self.qkv_net(cat)\n            r_head_k = self.r_net(r)\n\n            w_head_q, w_head_k, w_head_v = tf.split(w_heads, 3, axis=-1)\n            w_head_q = w_head_q[-qlen:]\n        else:\n            if self.pre_lnorm:\n                w_heads = self.qkv_net(self.layer_norm(w))\n            else:\n                w_heads = self.qkv_net(w)\n            r_head_k = self.r_net(r)\n\n            w_head_q, w_head_k, w_head_v = tf.split(w_heads, 3, axis=-1)\n\n        klen = shape_list(w_head_k)[0]\n\n        w_head_q = tf.reshape(w_head_q, (qlen, bsz, self.n_head, self.d_head))  # qlen x bsz x n_head x d_head\n        w_head_k = tf.reshape(w_head_k, (klen, bsz, self.n_head, self.d_head))  # qlen x bsz x n_head x d_head\n        w_head_v = tf.reshape(w_head_v, (klen, bsz, self.n_head, self.d_head))  # qlen x bsz x n_head x d_head\n\n        r_head_k = tf.reshape(r_head_k, (rlen, self.n_head, self.d_head))  # qlen x n_head x d_head\n\n        # compute attention score\n        rw_head_q = w_head_q + self.r_w_bias  # qlen x bsz x n_head x d_head\n        AC = tf.einsum(\"ibnd,jbnd->ijbn\", rw_head_q, w_head_k)  # qlen x klen x bsz x n_head\n\n        rr_head_q = w_head_q + self.r_r_bias\n        BD = tf.einsum(\"ibnd,jnd->ijbn\", rr_head_q, r_head_k)  # qlen x klen x bsz x n_head\n        BD = self._rel_shift(BD)\n\n        # [qlen x klen x bsz x n_head]\n        attn_score = AC + BD\n        attn_score = attn_score * self.scale\n\n        # compute attention probability\n        if attn_mask is not None:\n            attn_mask_t = attn_mask[:, :, None, None]\n            attn_score = attn_score * (1 - attn_mask_t) - 1e30 * attn_mask_t\n\n        # [qlen x klen x bsz x n_head]\n        attn_prob = tf.nn.softmax(attn_score, axis=1)\n        attn_prob = self.dropatt(attn_prob, training=training)\n\n        # Mask heads if we want to\n        if head_mask is not None:\n            attn_prob = attn_prob * head_mask\n\n        # compute attention vector\n        attn_vec = tf.einsum(\"ijbn,jbnd->ibnd\", attn_prob, w_head_v)\n\n        # [qlen x bsz x n_head x d_head]\n        attn_vec_sizes = shape_list(attn_vec)\n        attn_vec = tf.reshape(attn_vec, (attn_vec_sizes[0], attn_vec_sizes[1], self.n_head * self.d_head))\n\n        # linear projection\n        attn_out = self.o_net(attn_vec)\n        attn_out = self.drop(attn_out, training=training)\n\n        if self.pre_lnorm:\n            # residual connection\n            outputs = [w + attn_out]\n        else:\n            # residual connection + layer normalization\n            outputs = [self.layer_norm(w + attn_out)]\n\n        if self.output_attentions:\n            outputs.append(attn_prob)\n\n        return outputs\n\n\nclass TFRelPartialLearnableDecoderLayer(tf.keras.layers.Layer):\n    def __init__(\n        self,\n        n_head,\n        d_model,\n        d_head,\n        d_inner,\n        dropout,\n        tgt_len=None,\n        ext_len=None,\n        mem_len=None,\n        dropatt=0.0,\n        pre_lnorm=False,\n        r_w_bias=None,\n        r_r_bias=None,\n        output_attentions=False,\n        layer_norm_epsilon=1e-5,\n        init_std=0.02,\n        **kwargs\n    ):\n        super().__init__(**kwargs)\n\n        self.dec_attn = TFRelPartialLearnableMultiHeadAttn(\n            n_head,\n            d_model,\n            d_head,\n            dropout,\n            tgt_len=tgt_len,\n            ext_len=ext_len,\n            mem_len=mem_len,\n            dropatt=dropatt,\n            pre_lnorm=pre_lnorm,\n            r_w_bias=r_w_bias,\n            r_r_bias=r_r_bias,\n            init_std=init_std,\n            output_attentions=output_attentions,\n            layer_norm_epsilon=layer_norm_epsilon,\n            name=\"dec_attn\",\n        )\n        self.pos_ff = TFPositionwiseFF(\n            d_model,\n            d_inner,\n            dropout,\n            pre_lnorm=pre_lnorm,\n            init_std=init_std,\n            layer_norm_epsilon=layer_norm_epsilon,\n            name=\"pos_ff\",\n        )\n\n    def call(self, inputs, training=False):\n        dec_inp, r, dec_attn_mask, mems, head_mask = inputs\n        attn_outputs = self.dec_attn([dec_inp, r, dec_attn_mask, mems, head_mask], training=training)\n        ff_output = self.pos_ff(attn_outputs[0], training=training)\n\n        outputs = [ff_output] + attn_outputs[1:]\n\n        return outputs\n\n\nclass TFAdaptiveEmbedding(tf.keras.layers.Layer):\n    def __init__(self, n_token, d_embed, d_proj, cutoffs, div_val=1, init_std=0.02, sample_softmax=False, **kwargs):\n        super().__init__(**kwargs)\n\n        self.n_token = n_token\n        self.d_embed = d_embed\n        self.init_std = init_std\n\n        self.cutoffs = cutoffs + [n_token]\n        self.div_val = div_val\n        self.d_proj = d_proj\n\n        self.emb_scale = d_proj ** 0.5\n\n        self.cutoff_ends = [0] + self.cutoffs\n\n        self.emb_layers = []\n        self.emb_projs = []\n        if div_val == 1:\n            raise NotImplementedError  # Removed these to avoid maintaining dead code - They are not used in our pretrained checkpoint\n        else:\n            for i in range(len(self.cutoffs)):\n                l_idx, r_idx = self.cutoff_ends[i], self.cutoff_ends[i + 1]\n                d_emb_i = d_embed // (div_val ** i)\n                self.emb_layers.append(\n                    tf.keras.layers.Embedding(\n                        r_idx - l_idx,\n                        d_emb_i,\n                        embeddings_initializer=get_initializer(init_std),\n                        name=\"emb_layers_._{}\".format(i),\n                    )\n                )\n\n    def build(self, input_shape):\n        for i in range(len(self.cutoffs)):\n            d_emb_i = self.d_embed // (self.div_val ** i)\n            self.emb_projs.append(\n                self.add_weight(\n                    shape=(d_emb_i, self.d_proj),\n                    initializer=get_initializer(self.init_std),\n                    trainable=True,\n                    name=\"emb_projs_._{}\".format(i),\n                )\n            )\n        super().build(input_shape)\n\n    def call(self, inp):\n        if self.div_val == 1:\n            raise NotImplementedError  # Removed these to avoid maintaining dead code - They are not used in our pretrained checkpoint\n        else:\n            inp_flat = tf.reshape(inp, (-1,))\n            emb_flat = tf.zeros([shape_list(inp_flat)[0], self.d_proj])\n            for i in range(len(self.cutoffs)):\n                l_idx, r_idx = self.cutoff_ends[i], self.cutoff_ends[i + 1]\n\n                mask_i = (inp_flat >= l_idx) & (inp_flat < r_idx)\n\n                inp_i = tf.boolean_mask(inp_flat, mask_i) - l_idx\n                emb_i = self.emb_layers[i](inp_i)\n                emb_i = tf.einsum(\"id,de->ie\", emb_i, self.emb_projs[i])\n\n                mask_idx = tf.cast(tf.where(mask_i), dtype=tf.int64)\n                emb_flat += tf.scatter_nd(mask_idx, emb_i, tf.cast(shape_list(emb_flat), dtype=tf.int64))\n\n            embed_shape = shape_list(inp) + [self.d_proj]\n            embed = tf.reshape(emb_flat, embed_shape)\n\n        embed *= self.emb_scale\n\n        return embed\n\n\n@keras_serializable\nclass TFTransfoXLMainLayer(tf.keras.layers.Layer):\n    config_class = TransfoXLConfig\n\n    def __init__(self, config, **kwargs):\n        super().__init__(**kwargs)\n        self.output_attentions = config.output_attentions\n        self.output_hidden_states = config.output_hidden_states\n\n        self.n_token = config.vocab_size\n\n        self.d_embed = config.d_embed\n        self.d_model = config.d_model\n        self.n_head = config.n_head\n        self.d_head = config.d_head\n        self.untie_r = config.untie_r\n\n        self.word_emb = TFAdaptiveEmbedding(\n            config.vocab_size,\n            config.d_embed,\n            config.d_model,\n            config.cutoffs,\n            div_val=config.div_val,\n            init_std=config.init_std,\n            name=\"word_emb\",\n        )\n\n        self.drop = tf.keras.layers.Dropout(config.dropout)\n\n        self.n_layer = config.n_layer\n\n        self.tgt_len = config.tgt_len\n        self.mem_len = config.mem_len\n        self.ext_len = config.ext_len\n        self.max_klen = config.tgt_len + config.ext_len + config.mem_len\n\n        self.attn_type = config.attn_type\n\n        self.layers = []\n        if config.attn_type == 0:  # the default attention\n            for i in range(config.n_layer):\n                self.layers.append(\n                    TFRelPartialLearnableDecoderLayer(\n                        config.n_head,\n                        config.d_model,\n                        config.d_head,\n                        config.d_inner,\n                        config.dropout,\n                        tgt_len=config.tgt_len,\n                        ext_len=config.ext_len,\n                        mem_len=config.mem_len,\n                        dropatt=config.dropatt,\n                        pre_lnorm=config.pre_lnorm,\n                        r_w_bias=None if self.untie_r else self.r_w_bias,\n                        r_r_bias=None if self.untie_r else self.r_r_bias,\n                        output_attentions=self.output_attentions,\n                        layer_norm_epsilon=config.layer_norm_epsilon,\n                        init_std=config.init_std,\n                        name=\"layers_._{}\".format(i),\n                    )\n                )\n        else:  # learnable embeddings and absolute embeddings\n            raise NotImplementedError  # Removed these to avoid maintaining dead code - They are not used in our pretrained checkpoint\n\n        self.same_length = config.same_length\n        self.clamp_len = config.clamp_len\n\n        if self.attn_type == 0:  # default attention\n            self.pos_emb = TFPositionalEmbedding(self.d_model, name=\"pos_emb\")\n        else:  # learnable embeddings and absolute embeddings\n            raise NotImplementedError  # Removed these to avoid maintaining dead code - They are not used in our pretrained checkpoint\n\n    def build(self, input_shape):\n        if not self.untie_r:\n            self.r_w_bias = self.add_weight(\n                shape=(self.n_head, self.d_head), initializer=\"zeros\", trainable=True, name=\"r_w_bias\"\n            )\n            self.r_r_bias = self.add_weight(\n                shape=(self.n_head, self.d_head), initializer=\"zeros\", trainable=True, name=\"r_r_bias\"\n            )\n        super().build(input_shape)\n\n    def get_input_embeddings(self):\n        return self.word_emb\n\n    def _resize_token_embeddings(self, new_num_tokens):\n        return self.word_emb\n\n    def backward_compatible(self):\n        self.sample_softmax = -1\n\n    def reset_length(self, tgt_len, ext_len, mem_len):\n        self.tgt_len = tgt_len\n        self.mem_len = mem_len\n        self.ext_len = ext_len\n\n    def _prune_heads(self, heads):\n        raise NotImplementedError\n\n    def init_mems(self, bsz):\n        if self.mem_len > 0:\n            mems = []\n            for i in range(self.n_layer):\n                empty = tf.zeros([self.mem_len, bsz, self.d_model])\n                mems.append(empty)\n\n            return mems\n        else:\n            return None\n\n    def _update_mems(self, hids, mems, mlen, qlen):\n        # does not deal with None\n        if mems is None:\n            return None\n\n        # mems is not None\n        assert len(hids) == len(mems), \"len(hids) != len(mems)\"\n\n        # There are `mlen + qlen` steps that can be cached into mems\n        # For the next step, the last `ext_len` of the `qlen` tokens\n        # will be used as the extended context. Hence, we only cache\n        # the tokens from `mlen + qlen - self.ext_len - self.mem_len`\n        # to `mlen + qlen - self.ext_len`.\n        new_mems = []\n        end_idx = mlen + max(0, qlen - 0 - self.ext_len)\n        beg_idx = max(0, end_idx - self.mem_len)\n        for i in range(len(hids)):\n\n            cat = tf.concat([mems[i], hids[i]], axis=0)\n            tf.stop_gradient(cat)\n            new_mems.append(cat[beg_idx:end_idx])\n\n        return new_mems\n\n    def call(self, inputs, mems=None, head_mask=None, inputs_embeds=None, training=False):\n        if isinstance(inputs, (tuple, list)):\n            input_ids = inputs[0]\n            mems = inputs[1] if len(inputs) > 1 else mems\n            head_mask = inputs[2] if len(inputs) > 2 else head_mask\n            inputs_embeds = inputs[3] if len(inputs) > 3 else inputs_embeds\n            assert len(inputs) <= 4, \"Too many inputs.\"\n        elif isinstance(inputs, (dict, BatchEncoding)):\n            input_ids = inputs.get(\"input_ids\")\n            mems = inputs.get(\"mems\", mems)\n            head_mask = inputs.get(\"head_mask\", head_mask)\n            inputs_embeds = inputs.get(\"inputs_embeds\", inputs_embeds)\n            assert len(inputs) <= 4, \"Too many inputs.\"\n        else:\n            input_ids = inputs\n\n        # the original code for Transformer-XL used shapes [len, bsz] but we want a unified interface in the library\n        # so we transpose here from shape [bsz, len] to shape [len, bsz]\n        if input_ids is not None and inputs_embeds is not None:\n            raise ValueError(\"You cannot specify both input_ids and inputs_embeds at the same time\")\n        elif input_ids is not None:\n            input_ids = tf.transpose(input_ids, perm=(1, 0))\n            qlen, bsz = shape_list(input_ids)\n        elif inputs_embeds is not None:\n            inputs_embeds = tf.transpose(inputs_embeds, perm=(1, 0, 2))\n            qlen, bsz = shape_list(inputs_embeds)[:2]\n        else:\n            raise ValueError(\"You have to specify either input_ids or inputs_embeds\")\n\n        if mems is None:\n            mems = self.init_mems(bsz)\n\n        # Prepare head mask if needed\n        # 1.0 in head_mask indicate we keep the head\n        # attention_probs has shape bsz x n_heads x N x N\n        # input head_mask has shape [num_heads] or [num_hidden_layers x num_heads] (a head_mask for each layer)\n        # and head_mask is converted to shape [num_hidden_layers x qlen x klen x bsz x n_head]\n        if head_mask is not None:\n            raise NotImplementedError\n        else:\n            head_mask = [None] * self.n_layer\n\n        if inputs_embeds is not None:\n            word_emb = inputs_embeds\n        else:\n            word_emb = self.word_emb(input_ids)\n\n        mlen = shape_list(mems[0])[0] if mems is not None else 0\n        klen = mlen + qlen\n\n        attn_mask = tf.ones([qlen, qlen])\n        mask_u = tf.linalg.band_part(attn_mask, 0, -1)\n        mask_dia = tf.linalg.band_part(attn_mask, 0, 0)\n        attn_mask_pad = tf.zeros([qlen, mlen])\n        dec_attn_mask = tf.concat([attn_mask_pad, mask_u - mask_dia], 1)\n        if self.same_length:\n            mask_l = tf.linalg.band_part(attn_mask, -1, 0)\n            dec_attn_mask = tf.concat([dec_attn_mask[:, :qlen] + mask_l - mask_dia, dec_attn_mask[:, qlen:]], 1)\n        # ::: PyTorch masking code for reference :::\n        # if self.same_length:\n        #     all_ones = word_emb.new_ones((qlen, klen), dtype=torch.uint8)\n        #     mask_len = klen - self.mem_len\n        #     if mask_len > 0:\n        #         mask_shift_len = qlen - mask_len\n        #     else:\n        #         mask_shift_len = qlen\n        #     dec_attn_mask = (torch.triu(all_ones, 1+mlen)\n        #             + torch.tril(all_ones, -mask_shift_len))[:, :, None] # -1\n        # else:\n        #     dec_attn_mask = torch.triu(\n        #         word_emb.new_ones((qlen, klen), dtype=torch.uint8), diagonal=1+mlen)[:,:,None]\n\n        hids = []\n        attentions = []\n        if self.attn_type == 0:  # default\n            pos_seq = tf.range(klen - 1, -1, -1.0)\n            if self.clamp_len > 0:\n                pos_seq = tf.minimum(pos_seq, self.clamp_len)\n            pos_emb = self.pos_emb(pos_seq)\n\n            core_out = self.drop(word_emb, training=training)\n            pos_emb = self.drop(pos_emb, training=training)\n\n            for i, layer in enumerate(self.layers):\n                hids.append(core_out)\n                mems_i = None if mems is None else mems[i]\n                layer_outputs = layer([core_out, pos_emb, dec_attn_mask, mems_i, head_mask[i]], training=training)\n                core_out = layer_outputs[0]\n                if self.output_attentions:\n                    attentions.append(layer_outputs[1])\n        else:  # learnable embeddings and absolute embeddings\n            raise NotImplementedError  # Removed these to avoid maintaining dead code - They are not used in our pretrained checkpoint\n\n        core_out = self.drop(core_out, training=training)\n\n        new_mems = self._update_mems(hids, mems, mlen, qlen)\n\n        # We transpose back here to shape [bsz, len, hidden_dim]\n        outputs = [tf.transpose(core_out, perm=(1, 0, 2)), new_mems]\n        if self.output_hidden_states:\n            # Add last layer and transpose to library standard shape [bsz, len, hidden_dim]\n            hids.append(core_out)\n            hids = list(tf.transpose(t, perm=(1, 0, 2)) for t in hids)\n            outputs.append(hids)\n        if self.output_attentions:\n            # Transpose to library standard shape [bsz, n_heads, query_seq_len, key_seq_len]\n            attentions = list(tf.transpose(t, perm=(2, 3, 0, 1)) for t in attentions)\n            outputs.append(attentions)\n        return outputs  # last hidden state, new_mems, (all hidden states), (all attentions)\n\n\nclass TFTransfoXLPreTrainedModel(TFPreTrainedModel):\n    \"\"\" An abstract class to handle weights initialization and\n        a simple interface for downloading and loading pretrained models.\n    \"\"\"\n\n    config_class = TransfoXLConfig\n    base_model_prefix = \"transformer\"\n\n\nTRANSFO_XL_START_DOCSTRING = r\"\"\"\n\n    .. note::\n\n        TF 2.0 models accepts two formats as inputs:\n\n            - having all inputs as keyword arguments (like PyTorch models), or\n            - having all inputs as a list, tuple or dict in the first positional arguments.\n\n        This second option is useful when using :obj:`tf.keras.Model.fit()` method which currently requires having\n        all the tensors in the first argument of the model call function: :obj:`model(inputs)`.\n\n        If you choose this second option, there are three possibilities you can use to gather all the input Tensors\n        in the first positional argument :\n\n        - a single Tensor with input_ids only and nothing else: :obj:`model(inputs_ids)`\n        - a list of varying length with one or several input Tensors IN THE ORDER given in the docstring:\n          :obj:`model([input_ids, attention_mask])` or :obj:`model([input_ids, attention_mask, token_type_ids])`\n        - a dictionary with one or several input Tensors associated to the input names given in the docstring:\n          :obj:`model({'input_ids': input_ids, 'token_type_ids': token_type_ids})`\n\n    Parameters:\n        config (:class:`~transformers1.TransfoXLConfig`): Model configuration class with all the parameters of the model.\n            Initializing with a config file does not load the weights associated with the model, only the configuration.\n            Check out the :meth:`~transformers1.PreTrainedModel.from_pretrained` method to load the model weights.\n\"\"\"\n\nTRANSFO_XL_INPUTS_DOCSTRING = r\"\"\"\n    Args:\n        input_ids (:obj:`tf.Tensor` or :obj:`Numpy array` of shape :obj:`(batch_size, sequence_length)`):\n            Indices of input sequence tokens in the vocabulary.\n\n            Indices can be obtained using :class:`transformers1.TransfoXLTokenizer`.\n            See :func:`transformers1.PreTrainedTokenizer.encode` and\n            :func:`transformers1.PreTrainedTokenizer.encode_plus` for details.\n\n            `What are input IDs? <../glossary.html#input-ids>`__\n        mems (:obj:`List[tf.Tensor]` of length :obj:`config.n_layers`):\n            Contains pre-computed hidden-states (key and values in the attention blocks) as computed by the model\n            (see `mems` output below). Can be used to speed up sequential decoding. The token ids which have their mems\n            given to this model should not be passed as input ids as they have already been computed.\n        head_mask (:obj:`tf.Tensor` or :obj:`Numpy array` of shape :obj:`(num_heads,)` or :obj:`(num_layers, num_heads)`, `optional`, defaults to :obj:`None`):\n            Mask to nullify selected heads of the self-attention modules.\n            Mask values selected in ``[0, 1]``:\n            :obj:`1` indicates the head is **not masked**, :obj:`0` indicates the head is **masked**.\n        inputs_embeds (:obj:`tf.Tensor` or :obj:`Numpy array` of shape :obj:`(batch_size, sequence_length, hidden_size)`, `optional`, defaults to :obj:`None`):\n            Optionally, instead of passing :obj:`input_ids` you can choose to directly pass an embedded representation.\n            This is useful if you want more control over how to convert `input_ids` indices into associated vectors\n            than the model's internal embedding lookup matrix.\n\"\"\"\n\n\n@add_start_docstrings(\n    \"The bare Bert Model transformer outputing raw hidden-states without any specific head on top.\",\n    TRANSFO_XL_START_DOCSTRING,\n)\nclass TFTransfoXLModel(TFTransfoXLPreTrainedModel):\n    def __init__(self, config, *inputs, **kwargs):\n        super().__init__(config, *inputs, **kwargs)\n        self.transformer = TFTransfoXLMainLayer(config, name=\"transformer\")\n\n    @add_start_docstrings_to_callable(TRANSFO_XL_INPUTS_DOCSTRING)\n    def call(self, inputs, **kwargs):\n        r\"\"\"\n    Return:\n        :obj:`tuple(tf.Tensor)` comprising various elements depending on the configuration (:class:`~transformers1.TransfoXLConfig`) and inputs:\n        last_hidden_state (:obj:`tf.Tensor` of shape :obj:`(batch_size, sequence_length, hidden_size)`):\n            Sequence of hidden-states at the last layer of the model.\n        mems (:obj:`List[tf.Tensor]` of length :obj:`config.n_layers`):\n            Contains pre-computed hidden-states (key and values in the attention blocks).\n            Can be used (see `mems` input) to speed up sequential decoding. The token ids which have their past given to this model\n            should not be passed as input ids as they have already been computed.\n        hidden_states (:obj:`tuple(tf.Tensor)`, `optional`, returned when ``config.output_hidden_states=True``):\n            Tuple of :obj:`tf.Tensor` (one for the output of the embeddings + one for the output of each layer)\n            of shape :obj:`(batch_size, sequence_length, hidden_size)`.\n\n            Hidden-states of the model at the output of each layer plus the initial embedding outputs.\n        attentions (:obj:`tuple(tf.Tensor)`, `optional`, returned when ``config.output_attentions=True``):\n            Tuple of :obj:`tf.Tensor` (one for each layer) of shape\n            :obj:`(batch_size, num_heads, sequence_length, sequence_length)`.\n\n            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention\n            heads.\n\n    Examples::\n\n        import tensorflow as tf\n        from transformers1 import TransfoXLTokenizer, TFTransfoXLModel\n\n        tokenizer = TransfoXLTokenizer.from_pretrained('transfo-xl-wt103')\n        model = TFTransfoXLModel.from_pretrained('transfo-xl-wt103')\n        input_ids = tf.constant(tokenizer.encode(\"Hello, my dog is cute\", add_special_tokens=True))[None, :]  # Batch size 1\n        outputs = model(input_ids)\n        last_hidden_states, mems = outputs[:2]\n\n        \"\"\"\n        outputs = self.transformer(inputs, **kwargs)\n        return outputs\n\n\nclass TFTransfoXLLMHead(tf.keras.layers.Layer):\n    def __init__(self, config, input_embeddings, **kwargs):\n        super().__init__(**kwargs)\n        self.vocab_size = config.vocab_size\n\n        # The output weights are the same as the input embeddings, but there is\n        # an output-only bias for each token.\n        self.input_embeddings = input_embeddings\n\n    def build(self, input_shape):\n        self.bias = self.add_weight(shape=(self.vocab_size,), initializer=\"zeros\", trainable=True, name=\"bias\")\n        super().build(input_shape)\n\n    def call(self, hidden_states):\n        hidden_states = self.input_embeddings(hidden_states, mode=\"linear\")\n        hidden_states = hidden_states + self.bias\n        return hidden_states\n\n\n@add_start_docstrings(\n    \"\"\"The Transformer-XL Model with a language modeling head on top\n    (adaptive softmax with weights tied to the adaptive input embeddings)\"\"\",\n    TRANSFO_XL_START_DOCSTRING,\n)\nclass TFTransfoXLLMHeadModel(TFTransfoXLPreTrainedModel):\n    def __init__(self, config):\n        super().__init__(config)\n        self.transformer = TFTransfoXLMainLayer(config, name=\"transformer\")\n        self.sample_softmax = config.sample_softmax\n        assert (\n            self.sample_softmax <= 0\n        ), \"Sampling from the softmax is not implemented yet. Please look at issue: #3310: https://github.com/huggingface/transformers/issues/3310\"\n\n        self.crit = TFAdaptiveSoftmaxMask(\n            config.vocab_size, config.d_embed, config.d_model, config.cutoffs, div_val=config.div_val, name=\"crit\"\n        )\n\n    def get_output_embeddings(self):\n        \"\"\" Double-check if you are using adaptive softmax.\n        \"\"\"\n        if len(self.crit.out_layers) > 0:\n            return self.crit.out_layers[-1]\n        return None\n\n    def reset_length(self, tgt_len, ext_len, mem_len):\n        self.transformer.reset_length(tgt_len, ext_len, mem_len)\n\n    def init_mems(self, bsz):\n        return self.transformer.init_mems(bsz)\n\n    @add_start_docstrings_to_callable(TRANSFO_XL_INPUTS_DOCSTRING)\n    def call(self, inputs, mems=None, head_mask=None, inputs_embeds=None, labels=None, training=False):\n        r\"\"\"\n    Return:\n        :obj:`tuple(tf.Tensor)` comprising various elements depending on the configuration (:class:`~transformers1.TransfoXLConfig`) and inputs:\n        prediction_scores (:obj:`tf.Tensor` of shape :obj:`(batch_size, sequence_length, config.vocab_size)`):\n            Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax).\n        mems (:obj:`List[tf.Tensor]` of length :obj:`config.n_layers`):\n            Contains pre-computed hidden-states (key and values in the attention blocks).\n            Can be used (see `past` input) to speed up sequential decoding. The token ids which have their past given to this model\n            should not be passed as input ids as they have already been computed.\n        hidden_states (:obj:`tuple(tf.Tensor)`, `optional`, returned when ``config.output_hidden_states=True``):\n            Tuple of :obj:`tf.Tensor` (one for the output of the embeddings + one for the output of each layer)\n            of shape :obj:`(batch_size, sequence_length, hidden_size)`.\n\n            Hidden-states of the model at the output of each layer plus the initial embedding outputs.\n        attentions (:obj:`tuple(tf.Tensor)`, `optional`, returned when ``config.output_attentions=True``):\n            Tuple of :obj:`tf.Tensor` (one for each layer) of shape\n            :obj:`(batch_size, num_heads, sequence_length, sequence_length)`.\n\n            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention\n            heads.\n\n    Examples::\n\n        import tensorflow as tf\n        from transformers1 import TransfoXLTokenizer, TFTransfoXLLMHeadModel\n\n        tokenizer = TransfoXLTokenizer.from_pretrained('transfo-xl-wt103')\n        model = TFTransfoXLLMHeadModel.from_pretrained('transfo-xl-wt103')\n        input_ids = tf.constant(tokenizer.encode(\"Hello, my dog is cute\", add_special_tokens=True))[None, :]  # Batch size 1\n        outputs = model(input_ids)\n        prediction_scores, mems = outputs[:2]\n\n        \"\"\"\n        if isinstance(inputs, (tuple, list)):\n            input_ids = inputs[0]\n            mems = inputs[1] if len(inputs) > 1 else mems\n            head_mask = inputs[2] if len(inputs) > 2 else head_mask\n            inputs_embeds = inputs[3] if len(inputs) > 3 else inputs_embeds\n            labels = inputs[4] if len(inputs) > 4 else labels\n            assert len(inputs) <= 5, \"Too many inputs.\"\n        elif isinstance(inputs, dict):\n            input_ids = inputs.get(\"input_ids\")\n            mems = inputs.get(\"mems\", mems)\n            head_mask = inputs.get(\"head_mask\", head_mask)\n            inputs_embeds = inputs.get(\"inputs_embeds\", inputs_embeds)\n            labels = inputs.get(\"labels\", labels)\n            assert len(inputs) <= 5, \"Too many inputs.\"\n        else:\n            input_ids = inputs\n\n        if input_ids is not None:\n            bsz, tgt_len = shape_list(input_ids)[:2]\n        else:\n            bsz, tgt_len = shape_list(inputs_embeds)[:2]\n\n        transformer_outputs = self.transformer([input_ids, mems, head_mask, inputs_embeds], training=training)\n\n        last_hidden = transformer_outputs[0]\n        pred_hid = last_hidden[:, -tgt_len:]\n        outputs = transformer_outputs[1:]\n\n        softmax_output = self.crit([pred_hid, labels], training=training)\n        outputs = [softmax_output] + outputs\n\n        return outputs  # logits, new_mems, (all hidden states), (all attentions)\n\n    def prepare_inputs_for_generation(self, inputs, past, **model_kwargs):\n        inputs = {\"inputs\": inputs}\n\n        # if past is defined in model kwargs then use it for faster decoding\n        if past:\n            inputs[\"mems\"] = past\n\n        return inputs\n"
  },
  {
    "path": "code/bert-base-count5/pretrain/transformers1/modeling_tf_transfo_xl_utilities.py",
    "content": "# coding=utf-8\n# Copyright 2018 Google AI, Google Brain and Carnegie Mellon University Authors and the HuggingFace Inc. team.\n# Copyright (c) 2018, NVIDIA CORPORATION.  All rights reserved.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n\"\"\" A TF 2.0 Adaptive Softmax for Transformer XL model.\n\"\"\"\n\n\nimport tensorflow as tf\n\nfrom .modeling_tf_utils import shape_list\n\n\nclass TFAdaptiveSoftmaxMask(tf.keras.layers.Layer):\n    def __init__(self, vocab_size, d_embed, d_proj, cutoffs, div_val=1, keep_order=False, **kwargs):\n        super().__init__(**kwargs)\n\n        self.vocab_size = vocab_size\n        self.d_embed = d_embed\n        self.d_proj = d_proj\n\n        self.cutoffs = cutoffs + [vocab_size]\n        self.cutoff_ends = [0] + self.cutoffs\n        self.div_val = div_val\n\n        self.shortlist_size = self.cutoffs[0]\n        self.n_clusters = len(self.cutoffs) - 1\n        self.head_size = self.shortlist_size + self.n_clusters\n        self.keep_order = keep_order\n\n        self.out_layers = []\n        self.out_projs = []\n\n    def build(self, input_shape):\n        if self.n_clusters > 0:\n            self.cluster_weight = self.add_weight(\n                shape=(self.n_clusters, self.d_embed), initializer=\"zeros\", trainable=True, name=\"cluster_weight\"\n            )\n            self.cluster_bias = self.add_weight(\n                shape=(self.n_clusters,), initializer=\"zeros\", trainable=True, name=\"cluster_bias\"\n            )\n\n        if self.div_val == 1:\n            for i in range(len(self.cutoffs)):\n                if self.d_proj != self.d_embed:\n                    weight = self.add_weight(\n                        shape=(self.d_embed, self.d_proj),\n                        initializer=\"zeros\",\n                        trainable=True,\n                        name=\"out_projs_._{}\".format(i),\n                    )\n                    self.out_projs.append(weight)\n                else:\n                    self.out_projs.append(None)\n                weight = self.add_weight(\n                    shape=(self.vocab_size, self.d_embed,),\n                    initializer=\"zeros\",\n                    trainable=True,\n                    name=\"out_layers_._{}_._weight\".format(i),\n                )\n                bias = self.add_weight(\n                    shape=(self.vocab_size,),\n                    initializer=\"zeros\",\n                    trainable=True,\n                    name=\"out_layers_._{}_._bias\".format(i),\n                )\n                self.out_layers.append((weight, bias))\n        else:\n            for i in range(len(self.cutoffs)):\n                l_idx, r_idx = self.cutoff_ends[i], self.cutoff_ends[i + 1]\n                d_emb_i = self.d_embed // (self.div_val ** i)\n\n                weight = self.add_weight(\n                    shape=(d_emb_i, self.d_proj), initializer=\"zeros\", trainable=True, name=\"out_projs_._{}\".format(i)\n                )\n                self.out_projs.append(weight)\n                weight = self.add_weight(\n                    shape=(r_idx - l_idx, d_emb_i,),\n                    initializer=\"zeros\",\n                    trainable=True,\n                    name=\"out_layers_._{}_._weight\".format(i),\n                )\n                bias = self.add_weight(\n                    shape=(r_idx - l_idx,),\n                    initializer=\"zeros\",\n                    trainable=True,\n                    name=\"out_layers_._{}_._bias\".format(i),\n                )\n                self.out_layers.append((weight, bias))\n        super().build(input_shape)\n\n    @staticmethod\n    def _logit(x, W, b, proj=None):\n        y = x\n        if proj is not None:\n            y = tf.einsum(\"ibd,ed->ibe\", y, proj)\n        return tf.einsum(\"ibd,nd->ibn\", y, W) + b\n\n    @staticmethod\n    def _gather_logprob(logprob, target):\n        lp_size = shape_list(logprob)\n        r = tf.range(lp_size[0])\n        idx = tf.stack([r, target], 1)\n        return tf.gather_nd(logprob, idx)\n\n    def call(self, inputs, return_mean=True, training=False):\n        hidden, target = inputs\n        head_logprob = 0\n        if self.n_clusters == 0:\n            output = self._logit(hidden, self.out_layers[0][0], self.out_layers[0][1], self.out_projs[0])\n            if target is not None:\n                loss = tf.nn.sparse_softmax_cross_entropy_with_logits(labels=target, logits=output)\n            out = tf.nn.log_softmax(output, axis=-1)\n        else:\n            hidden_sizes = shape_list(hidden)\n            out = []\n            loss = tf.zeros(hidden_sizes[:2], dtype=tf.float32)\n            for i in range(len(self.cutoffs)):\n                l_idx, r_idx = self.cutoff_ends[i], self.cutoff_ends[i + 1]\n                if target is not None:\n                    mask = (target >= l_idx) & (target < r_idx)\n                    mask_idx = tf.where(mask)\n                    cur_target = tf.boolean_mask(target, mask) - l_idx\n\n                if self.div_val == 1:\n                    cur_W = self.out_layers[0][0][l_idx:r_idx]\n                    cur_b = self.out_layers[0][1][l_idx:r_idx]\n                else:\n                    cur_W = self.out_layers[i][0]\n                    cur_b = self.out_layers[i][1]\n\n                if i == 0:\n                    cur_W = tf.concat([cur_W, self.cluster_weight], 0)\n                    cur_b = tf.concat([cur_b, self.cluster_bias], 0)\n\n                    head_logit = self._logit(hidden, cur_W, cur_b, self.out_projs[0])\n                    head_logprob = tf.nn.log_softmax(head_logit)\n                    out.append(head_logprob[..., : self.cutoffs[0]])\n                    if target is not None:\n                        cur_head_logprob = tf.boolean_mask(head_logprob, mask)\n                        cur_logprob = self._gather_logprob(cur_head_logprob, cur_target)\n                else:\n                    tail_logit = self._logit(hidden, cur_W, cur_b, self.out_projs[i])\n                    tail_logprob = tf.nn.log_softmax(tail_logit)\n                    cluster_prob_idx = self.cutoffs[0] + i - 1  # No probability for the head cluster\n                    logprob_i = head_logprob[..., cluster_prob_idx, None] + tail_logprob\n                    out.append(logprob_i)\n                    if target is not None:\n                        cur_head_logprob = tf.boolean_mask(head_logprob, mask)\n                        cur_tail_logprob = tf.boolean_mask(tail_logprob, mask)\n                        cur_logprob = self._gather_logprob(cur_tail_logprob, cur_target)\n                        cur_logprob += cur_head_logprob[:, self.cutoff_ends[1] + i - 1]\n                if target is not None:\n                    loss += tf.scatter_nd(mask_idx, -cur_logprob, tf.cast(shape_list(loss), dtype=tf.int64))\n            out = tf.concat(out, axis=-1)\n\n        if target is not None:\n            if return_mean:\n                loss = tf.reduce_mean(loss)\n            # Add the training-time loss value to the layer using `self.add_loss()`.\n            self.add_loss(loss)\n\n            # Log the loss as a metric (we could log arbitrary metrics,\n            # including different metrics for training and inference.\n            self.add_metric(loss, name=self.name, aggregation=\"mean\" if return_mean else \"\")\n\n        return out\n"
  },
  {
    "path": "code/bert-base-count5/pretrain/transformers1/modeling_tf_utils.py",
    "content": "# coding=utf-8\n# Copyright 2018 The Google AI Language Team Authors and The HuggingFace Inc. team.\n# Copyright (c) 2018, NVIDIA CORPORATION.  All rights reserved.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n\"\"\"TF general model utils.\"\"\"\nimport functools\nimport logging\nimport os\n\nimport h5py\nimport numpy as np\nimport tensorflow as tf\nfrom tensorflow.python.keras.saving import hdf5_format\n\nfrom .configuration_utils import PretrainedConfig\nfrom .file_utils import DUMMY_INPUTS, TF2_WEIGHTS_NAME, WEIGHTS_NAME, cached_path, hf_bucket_url, is_remote_url\nfrom .modeling_tf_pytorch_utils import load_pytorch_checkpoint_in_tf2_model\n\n\nlogger = logging.getLogger(__name__)\n\n\nclass TFModelUtilsMixin:\n    \"\"\"\n    A few utilities for `tf.keras.Model`s, to be used as a mixin.\n    \"\"\"\n\n    def num_parameters(self, only_trainable: bool = False) -> int:\n        \"\"\"\n        Get number of (optionally, trainable) parameters in the model.\n        \"\"\"\n        if only_trainable:\n            return int(sum(np.prod(w.shape.as_list()) for w in self.trainable_variables))\n        else:\n            return self.count_params()\n\n\ndef keras_serializable(cls):\n    \"\"\"\n    Decorate a Keras Layer class to support Keras serialization.\n\n    This is done by:\n    1. adding a `transformers_config` dict to the Keras config dictionary in `get_config` (called by Keras at\n       serialization time\n    2. wrapping `__init__` to accept that `transformers_config` dict (passed by Keras at deserialization time) and\n       convert it to a config object for the actual layer initializer\n    3. registering the class as a custom object in Keras (if the Tensorflow version supports this), so that it does\n       not need to be supplied in `custom_objects` in the call to `tf.keras.models.load_model`\n\n    :param cls: a tf.keras.layers.Layers subclass that accepts a `config` argument to its initializer (typically a\n                `TF*MainLayer` class in this project)\n    :return: the same class object, with modifications for Keras deserialization.\n    \"\"\"\n    initializer = cls.__init__\n\n    config_class = getattr(cls, \"config_class\", None)\n    if config_class is None:\n        raise AttributeError(\"Must set `config_class` to use @keras_serializable\")\n\n    @functools.wraps(initializer)\n    def wrapped_init(self, *args, **kwargs):\n        transformers_config = kwargs.pop(\"transformers_config\", None)\n        config = args[0] if args and isinstance(args[0], PretrainedConfig) else kwargs.get(\"config\", None)\n        if config is not None and transformers_config is not None:\n            raise ValueError(\"Must pass either `config` or `transformers_config`, not both\")\n        elif config is not None:\n            # normal layer construction, call with unchanged args (config is already in there)\n            initializer(self, *args, **kwargs)\n        elif transformers_config is not None:\n            # Keras deserialization, convert dict to config\n            config = config_class.from_dict(transformers_config)\n            initializer(self, config, *args, **kwargs)\n        else:\n            raise ValueError(\"Must pass either `config` (PretrainedConfig) or `transformers_config` (dict)\")\n        self._transformers_config = config\n\n    cls.__init__ = wrapped_init\n\n    if not hasattr(cls, \"get_config\"):\n        raise TypeError(\"Only use @keras_serializable on tf.keras.layers.Layer subclasses\")\n    if hasattr(cls.get_config, \"_is_default\"):\n\n        def get_config(self):\n            cfg = super(cls, self).get_config()\n            cfg[\"transformers_config\"] = self._transformers_config.to_dict()\n            return cfg\n\n        cls.get_config = get_config\n\n    cls._keras_serializable = True\n    if hasattr(tf.keras.utils, \"register_keras_serializable\"):\n        cls = tf.keras.utils.register_keras_serializable()(cls)\n    return cls\n\n\nclass TFPreTrainedModel(tf.keras.Model, TFModelUtilsMixin):\n    r\"\"\" Base class for all TF models.\n\n        :class:`~transformers1.TFPreTrainedModel` takes care of storing the configuration of the models and handles methods for loading/downloading/saving models\n        as well as a few methods common to all models to (i) resize the input embeddings and (ii) prune heads in the self-attention heads.\n\n        Class attributes (overridden by derived classes):\n            - ``config_class``: a class derived from :class:`~transformers1.PretrainedConfig` to use as configuration class for this model architecture.\n            - ``load_tf_weights``: a python ``method`` for loading a TensorFlow checkpoint in a PyTorch model, taking as arguments:\n\n                - ``model``: an instance of the relevant subclass of :class:`~transformers1.PreTrainedModel`,\n                - ``config``: an instance of the relevant subclass of :class:`~transformers1.PretrainedConfig`,\n                - ``path``: a path (string) to the TensorFlow checkpoint.\n\n            - ``base_model_prefix``: a string indicating the attribute associated to the base model in derived classes of the same architecture adding modules on top of the base model.\n    \"\"\"\n    config_class = None\n    base_model_prefix = \"\"\n\n    @property\n    def dummy_inputs(self):\n        \"\"\" Dummy inputs to build the network.\n\n        Returns:\n            tf.Tensor with dummy inputs\n        \"\"\"\n        return {\"input_ids\": tf.constant(DUMMY_INPUTS)}\n\n    def __init__(self, config, *inputs, **kwargs):\n        super().__init__(*inputs, **kwargs)\n        if not isinstance(config, PretrainedConfig):\n            raise ValueError(\n                \"Parameter config in `{}(config)` should be an instance of class `PretrainedConfig`. \"\n                \"To create a model from a pretrained model use \"\n                \"`model = {}.from_pretrained(PRETRAINED_MODEL_NAME)`\".format(\n                    self.__class__.__name__, self.__class__.__name__\n                )\n            )\n        # Save config in model\n        self.config = config\n\n    def get_input_embeddings(self):\n        \"\"\"\n        Returns the model's input embeddings.\n\n        Returns:\n            :obj:`tf.keras.layers.Layer`:\n                A torch module mapping vocabulary to hidden states.\n        \"\"\"\n        base_model = getattr(self, self.base_model_prefix, self)\n        if base_model is not self:\n            return base_model.get_input_embeddings()\n        else:\n            raise NotImplementedError\n\n    def get_output_embeddings(self):\n        \"\"\"\n        Returns the model's output embeddings.\n\n        Returns:\n            :obj:`tf.keras.layers.Layer`:\n                A torch module mapping hidden states to vocabulary.\n        \"\"\"\n        return None  # Overwrite for models with output embeddings\n\n    def _get_resized_embeddings(self, old_embeddings, new_num_tokens=None):\n        \"\"\" Build a resized Embedding Variable from a provided token Embedding Module.\n            Increasing the size will add newly initialized vectors at the end\n            Reducing the size will remove vectors from the end\n\n        Args:\n            new_num_tokens: (`optional`) int\n                New number of tokens in the embedding matrix.\n                Increasing the size will add newly initialized vectors at the end\n                Reducing the size will remove vectors from the end\n                If not provided or None: return the provided token Embedding Module.\n        Return: ``tf.Variable``\n            Pointer to the resized Embedding Module or the old Embedding Module if new_num_tokens is None\n        \"\"\"\n        # if new_num_tokens is None:\n        #     return old_embeddings\n\n        # old_num_tokens, old_embedding_dim = old_embeddings.weight.size()\n        # if old_num_tokens == new_num_tokens:\n        #     return old_embeddings\n\n        # # Build new embeddings\n        # new_embeddings = nn.Embedding(new_num_tokens, old_embedding_dim)\n        # new_embeddings.to(old_embeddings.weight.device)\n\n        # # initialize all new embeddings (in particular added tokens)\n        # self._init_weights(new_embeddings)\n\n        # # Copy token embeddings from the previous weights\n        # num_tokens_to_copy = min(old_num_tokens, new_num_tokens)\n        # new_embeddings.weight.data[:num_tokens_to_copy, :] = old_embeddings.weight.data[:num_tokens_to_copy, :]\n\n        # return new_embeddings\n\n    def resize_token_embeddings(self, new_num_tokens=None):\n        \"\"\" Resize input token embeddings matrix of the model if new_num_tokens != config.vocab_size.\n        Take care of tying weights embeddings afterwards if the model class has a `tie_weights()` method.\n\n        Arguments:\n\n            new_num_tokens: (`optional`) int:\n                New number of tokens in the embedding matrix. Increasing the size will add newly initialized vectors at the end. Reducing the size will remove vectors from the end.\n                If not provided or None: does nothing and just returns a pointer to the input tokens ``tf.Variable`` Module of the model.\n\n        Return: ``tf.Variable``\n            Pointer to the input tokens Embeddings Module of the model\n        \"\"\"\n        raise NotImplementedError\n\n    def prune_heads(self, heads_to_prune):\n        \"\"\" Prunes heads of the base model.\n\n            Arguments:\n\n                heads_to_prune: dict with keys being selected layer indices (`int`) and associated values being the list of heads to prune in said layer (list of `int`).\n        \"\"\"\n        raise NotImplementedError\n\n    def save_pretrained(self, save_directory):\n        \"\"\" Save a model and its configuration file to a directory, so that it\n            can be re-loaded using the :func:`~transformers1.PreTrainedModel.from_pretrained` class method.\n        \"\"\"\n        assert os.path.isdir(\n            save_directory\n        ), \"Saving path should be a directory where the model and configuration can be saved\"\n\n        # Save configuration file\n        self.config.save_pretrained(save_directory)\n\n        # If we save using the predefined names, we can load using `from_pretrained`\n        output_model_file = os.path.join(save_directory, TF2_WEIGHTS_NAME)\n        self.save_weights(output_model_file)\n        logger.info(\"Model weights saved in {}\".format(output_model_file))\n\n    @classmethod\n    def from_pretrained(cls, pretrained_model_name_or_path, *model_args, **kwargs):\n        r\"\"\"Instantiate a pretrained TF 2.0 model from a pre-trained model configuration.\n\n        The warning ``Weights from XXX not initialized from pretrained model`` means that the weights of XXX do not come pre-trained with the rest of the model.\n        It is up to you to train those weights with a downstream fine-tuning task.\n\n        The warning ``Weights from XXX not used in YYY`` means that the layer XXX is not used by YYY, therefore those weights are discarded.\n\n        Parameters:\n            pretrained_model_name_or_path: either:\n\n                - a string with the `shortcut name` of a pre-trained model to load from cache or download, e.g.: ``bert-base-uncased``.\n                - a string with the `identifier name` of a pre-trained model that was user-uploaded to our S3, e.g.: ``dbmdz/bert-base-german-cased``.\n                - a path to a `directory` containing model weights saved using :func:`~transformers1.PreTrainedModel.save_pretrained`, e.g.: ``./my_model_directory/``.\n                - a path or url to a `PyTorch state_dict save file` (e.g. `./pt_model/pytorch_model.bin`). In this case, ``from_pt`` should be set to True and a configuration object should be provided as ``config`` argument. This loading path is slower than converting the PyTorch checkpoint in a TensorFlow model using the provided conversion scripts and loading the TensorFlow model afterwards.\n\n            model_args: (`optional`) Sequence of positional arguments:\n                All remaning positional arguments will be passed to the underlying model's ``__init__`` method\n\n            config: (`optional`) one of:\n                    - an instance of a class derived from :class:`~transformers1.PretrainedConfig`, or\n                    - a string valid as input to :func:`~transformers1.PretrainedConfig.from_pretrained()`\n                Configuration for the model to use instead of an automatically loaded configuation. Configuration can be automatically loaded when:\n\n                - the model is a model provided by the library (loaded with the ``shortcut-name`` string of a pretrained model), or\n                - the model was saved using :func:`~transformers1.PreTrainedModel.save_pretrained` and is reloaded by suppling the save directory.\n                - the model is loaded by suppling a local directory as ``pretrained_model_name_or_path`` and a configuration JSON file named `config.json` is found in the directory.\n\n            from_pt: (`optional`) boolean, default False:\n                Load the model weights from a PyTorch state_dict save file (see docstring of pretrained_model_name_or_path argument).\n\n            cache_dir: (`optional`) string:\n                Path to a directory in which a downloaded pre-trained model\n                configuration should be cached if the standard cache should not be used.\n\n            force_download: (`optional`) boolean, default False:\n                Force to (re-)download the model weights and configuration files and override the cached versions if they exists.\n\n            resume_download: (`optional`) boolean, default False:\n                Do not delete incompletely recieved file. Attempt to resume the download if such a file exists.\n\n            proxies: (`optional`) dict, default None:\n                A dictionary of proxy servers to use by protocol or endpoint, e.g.: {'http': 'foo.bar:3128', 'http://hostname': 'foo.bar:4012'}.\n                The proxies are used on each request.\n\n            output_loading_info: (`optional`) boolean:\n                Set to ``True`` to also return a dictionnary containing missing keys, unexpected keys and error messages.\n\n            kwargs: (`optional`) Remaining dictionary of keyword arguments:\n                Can be used to update the configuration object (after it being loaded) and initiate the model. (e.g. ``output_attention=True``). Behave differently depending on whether a `config` is provided or automatically loaded:\n\n                - If a configuration is provided with ``config``, ``**kwargs`` will be directly passed to the underlying model's ``__init__`` method (we assume all relevant updates to the configuration have already been done)\n                - If a configuration is not provided, ``kwargs`` will be first passed to the configuration class initialization function (:func:`~transformers1.PretrainedConfig.from_pretrained`). Each key of ``kwargs`` that corresponds to a configuration attribute will be used to override said attribute with the supplied ``kwargs`` value. Remaining keys that do not correspond to any configuration attribute will be passed to the underlying model's ``__init__`` function.\n\n        Examples::\n\n            # For example purposes. Not runnable.\n            model = BertModel.from_pretrained('bert-base-uncased')    # Download model and configuration from S3 and cache.\n            model = BertModel.from_pretrained('./test/saved_model/')  # E.g. model was saved using `save_pretrained('./test/saved_model/')`\n            model = BertModel.from_pretrained('bert-base-uncased', output_attention=True)  # Update configuration during loading\n            assert model.config.output_attention == True\n            # Loading from a TF checkpoint file instead of a PyTorch model (slower)\n            config = BertConfig.from_json_file('./tf_model/my_tf_model_config.json')\n            model = BertModel.from_pretrained('./tf_model/my_tf_checkpoint.ckpt.index', from_pt=True, config=config)\n\n        \"\"\"\n        config = kwargs.pop(\"config\", None)\n        cache_dir = kwargs.pop(\"cache_dir\", None)\n        from_pt = kwargs.pop(\"from_pt\", False)\n        force_download = kwargs.pop(\"force_download\", False)\n        resume_download = kwargs.pop(\"resume_download\", False)\n        proxies = kwargs.pop(\"proxies\", None)\n        output_loading_info = kwargs.pop(\"output_loading_info\", False)\n        use_cdn = kwargs.pop(\"use_cdn\", True)\n\n        # Load config if we don't provide a configuration\n        if not isinstance(config, PretrainedConfig):\n            config_path = config if config is not None else pretrained_model_name_or_path\n            config, model_kwargs = cls.config_class.from_pretrained(\n                config_path,\n                *model_args,\n                cache_dir=cache_dir,\n                return_unused_kwargs=True,\n                force_download=force_download,\n                resume_download=resume_download,\n                **kwargs,\n            )\n        else:\n            model_kwargs = kwargs\n\n        # Load model\n        if pretrained_model_name_or_path is not None:\n            if os.path.isdir(pretrained_model_name_or_path):\n                if os.path.isfile(os.path.join(pretrained_model_name_or_path, TF2_WEIGHTS_NAME)):\n                    # Load from a TF 2.0 checkpoint\n                    archive_file = os.path.join(pretrained_model_name_or_path, TF2_WEIGHTS_NAME)\n                elif from_pt and os.path.isfile(os.path.join(pretrained_model_name_or_path, WEIGHTS_NAME)):\n                    # Load from a PyTorch checkpoint\n                    archive_file = os.path.join(pretrained_model_name_or_path, WEIGHTS_NAME)\n                else:\n                    raise EnvironmentError(\n                        \"Error no file named {} found in directory {} or `from_pt` set to False\".format(\n                            [WEIGHTS_NAME, TF2_WEIGHTS_NAME], pretrained_model_name_or_path\n                        )\n                    )\n            elif os.path.isfile(pretrained_model_name_or_path) or is_remote_url(pretrained_model_name_or_path):\n                archive_file = pretrained_model_name_or_path\n            elif os.path.isfile(pretrained_model_name_or_path + \".index\"):\n                archive_file = pretrained_model_name_or_path + \".index\"\n            else:\n                archive_file = hf_bucket_url(\n                    pretrained_model_name_or_path,\n                    filename=(WEIGHTS_NAME if from_pt else TF2_WEIGHTS_NAME),\n                    use_cdn=use_cdn,\n                )\n\n            try:\n                # Load from URL or cache if already cached\n                resolved_archive_file = cached_path(\n                    archive_file,\n                    cache_dir=cache_dir,\n                    force_download=force_download,\n                    resume_download=resume_download,\n                    proxies=proxies,\n                )\n                if resolved_archive_file is None:\n                    raise EnvironmentError\n            except EnvironmentError:\n                msg = (\n                    f\"Can't load weights for '{pretrained_model_name_or_path}'. Make sure that:\\n\\n\"\n                    f\"- '{pretrained_model_name_or_path}' is a correct model identifier listed on 'https://huggingface.co/models'\\n\\n\"\n                    f\"- or '{pretrained_model_name_or_path}' is the correct path to a directory containing a file named one of {TF2_WEIGHTS_NAME}, {WEIGHTS_NAME}.\\n\\n\"\n                )\n                raise EnvironmentError(msg)\n            if resolved_archive_file == archive_file:\n                logger.info(\"loading weights file {}\".format(archive_file))\n            else:\n                logger.info(\"loading weights file {} from cache at {}\".format(archive_file, resolved_archive_file))\n        else:\n            resolved_archive_file = None\n\n        # Instantiate model.\n        model = cls(config, *model_args, **model_kwargs)\n\n        if from_pt:\n            # Load from a PyTorch checkpoint\n            return load_pytorch_checkpoint_in_tf2_model(model, resolved_archive_file, allow_missing_keys=True)\n\n        model(model.dummy_inputs, training=False)  # build the network with dummy inputs\n\n        assert os.path.isfile(resolved_archive_file), \"Error retrieving file {}\".format(resolved_archive_file)\n        # 'by_name' allow us to do transfer learning by skipping/adding layers\n        # see https://github.com/tensorflow/tensorflow/blob/00fad90125b18b80fe054de1055770cfb8fe4ba3/tensorflow/python/keras/engine/network.py#L1339-L1357\n        try:\n            model.load_weights(resolved_archive_file, by_name=True)\n        except OSError:\n            raise OSError(\n                \"Unable to load weights from h5 file. \"\n                \"If you tried to load a TF 2.0 model from a PyTorch checkpoint, please set from_pt=True. \"\n            )\n\n        model(model.dummy_inputs, training=False)  # Make sure restore ops are run\n\n        # Check if the models are the same to output loading informations\n        with h5py.File(resolved_archive_file, \"r\") as f:\n            if \"layer_names\" not in f.attrs and \"model_weights\" in f:\n                f = f[\"model_weights\"]\n            hdf5_layer_names = set(hdf5_format.load_attributes_from_hdf5_group(f, \"layer_names\"))\n        model_layer_names = set(layer.name for layer in model.layers)\n        missing_keys = list(model_layer_names - hdf5_layer_names)\n        unexpected_keys = list(hdf5_layer_names - model_layer_names)\n        error_msgs = []\n\n        if len(missing_keys) > 0:\n            logger.info(\n                \"Layers of {} not initialized from pretrained model: {}\".format(model.__class__.__name__, missing_keys)\n            )\n        if len(unexpected_keys) > 0:\n            logger.info(\n                \"Layers from pretrained model not used in {}: {}\".format(model.__class__.__name__, unexpected_keys)\n            )\n        if len(error_msgs) > 0:\n            raise RuntimeError(\n                \"Error(s) in loading weights for {}:\\n\\t{}\".format(model.__class__.__name__, \"\\n\\t\".join(error_msgs))\n            )\n        if output_loading_info:\n            loading_info = {\"missing_keys\": missing_keys, \"unexpected_keys\": unexpected_keys, \"error_msgs\": error_msgs}\n            return model, loading_info\n\n        return model\n\n    def prepare_inputs_for_generation(self, inputs, **kwargs):\n        return {\"inputs\": inputs}\n\n    def _use_cache(self, outputs, use_cache):\n        \"\"\"During generation, decide whether to pass the `past` variable to the next forward pass.\"\"\"\n        if len(outputs) <= 1 or use_cache is False:\n            return False\n        if hasattr(self.config, \"mem_len\") and self.config.mem_len == 0:\n            return False\n        return True\n\n    def generate(\n        self,\n        input_ids=None,\n        max_length=None,\n        min_length=None,\n        do_sample=None,\n        early_stopping=None,\n        num_beams=None,\n        temperature=None,\n        top_k=None,\n        top_p=None,\n        repetition_penalty=None,\n        bad_words_ids=None,\n        bos_token_id=None,\n        pad_token_id=None,\n        eos_token_id=None,\n        length_penalty=None,\n        no_repeat_ngram_size=None,\n        num_return_sequences=None,\n        attention_mask=None,\n        decoder_start_token_id=None,\n        use_cache=None,\n    ):\n        r\"\"\" Generates sequences for models with a LM head. The method currently supports greedy or penalized greedy decoding, sampling with top-k or nucleus sampling\n        and beam-search.\n\n        Adapted in part from `Facebook's XLM beam search code`_.\n\n        .. _`Facebook's XLM beam search code`:\n           https://github.com/facebookresearch/XLM/blob/9e6f6814d17be4fe5b15f2e6c43eb2b2d76daeb4/src/model/transformer.py#L529\n\n\n        Parameters:\n\n            input_ids: (`optional`) `tf.Tensor` of `dtype=tf.int32` of shape `(batch_size, sequence_length)`\n                The sequence used as a prompt for the generation. If `None` the method initializes\n                it as an empty `tf.Tensor` of shape `(1,)`.\n\n            max_length: (`optional`) int\n                The max length of the sequence to be generated.  Between 1 and infinity. Default to 20.\n\n            min_length: (`optional`) int\n                The min length of the sequence to be generated.  Between 0 and infinity. Default to 0.\n            do_sample: (`optional`) bool\n                If set to `False` greedy decoding is used. Otherwise sampling is used. Defaults to `False` as defined in `configuration_utils.PretrainedConfig`.\n\n            early_stopping: (`optional`) bool\n                if set to `True` beam search is stopped when at least `num_beams` sentences finished per batch. Defaults to `False` as defined in `configuration_utils.PretrainedConfig`.\n\n            num_beams: (`optional`) int\n                Number of beams for beam search. Must be between 1 and infinity. 1 means no beam search. Default to 1.\n\n            temperature: (`optional`) float\n                The value used to module the next token probabilities. Must be strictely positive. Default to 1.0.\n\n            top_k: (`optional`) int\n                The number of highest probability vocabulary tokens to keep for top-k-filtering. Between 1 and infinity. Default to 50.\n\n            top_p: (`optional`) float\n                The cumulative probability of parameter highest probability vocabulary tokens to keep for nucleus sampling. Must be between 0 and 1. Default to 1.\n\n            repetition_penalty: (`optional`) float\n                The parameter for repetition penalty. Between 1.0 and infinity. 1.0 means no penalty. Default to 1.0.\n\n            bos_token_id: (`optional`) int\n                Beginning of sentence token if no prompt is provided. Default to specicic model bos_token_id or None if it does not exist.\n\n            pad_token_id: (`optional`) int\n                Pad token. Defaults to pad_token_id as defined in the models config.\n\n            eos_token_id: (`optional`) int\n                EOS token. Defaults to eos_token_id as defined in the models config.\n\n            length_penalty: (`optional`) float\n                Exponential penalty to the length. Default to 1.\n\n            no_repeat_ngram_size: (`optional`) int\n                If set to int > 0, all ngrams of size `no_repeat_ngram_size` can only occur once.\n\n            bad_words_ids: (`optional`) list of lists of int\n                `bad_words_ids` contains tokens that are not allowed to be generated. In order to get the tokens of the words that should not appear in the generated text, use `tokenizer.encode(bad_word, add_prefix_space=True)`.\n\n            num_return_sequences: (`optional`) int\n                The number of independently computed returned sequences for each element in the batch. Default to 1.\n\n            attention_mask (`optional`) obj: `tf.Tensor` with `dtype=tf.int32` of same shape as `input_ids`\n                Mask to avoid performing attention on padding token indices.\n                Mask values selected in ``[0, 1]``:\n                ``1`` for tokens that are NOT MASKED, ``0`` for MASKED tokens.\n                Defaults to `None`.\n\n                `What are attention masks? <../glossary.html#attention-mask>`__\n\n            decoder_start_token_id=None: (`optional`) int\n                If an encoder-decoder model starts decoding with a different token than BOS.\n                Defaults to `None` and is changed to `BOS` later.\n\n            use_cache: (`optional`) bool\n                If `use_cache` is True, past key values are used to speed up decoding if applicable to model. Defaults to `True`.\n\n        Return:\n\n            output: `tf.Tensor` of `dtype=tf.int32` shape `(batch_size * num_return_sequences, sequence_length)`\n                sequence_length is either equal to max_length or shorter if all batches finished early due to the `eos_token_id`\n\n        Examples::\n\n            tokenizer = AutoTokenizer.from_pretrained('distilgpt2')   # Initialize tokenizer\n            model = TFAutoModelWithLMHead.from_pretrained('distilgpt2')    # Download model and configuration from S3 and cache.\n            outputs = model.generate(max_length=40)  # do greedy decoding\n            print('Generated: {}'.format(tokenizer.decode(outputs[0], skip_special_tokens=True)))\n\n            tokenizer = AutoTokenizer.from_pretrained('openai-gpt')   # Initialize tokenizer\n            model = TFAutoModelWithLMHead.from_pretrained('openai-gpt')    # Download model and configuration from S3 and cache.\n            input_context = 'The dog'\n            input_ids = tokenizer.encode(input_context, return_tensors='tf')  # encode input context\n            outputs = model.generate(input_ids=input_ids, num_beams=5, num_return_sequences=3, temperature=1.5)  # generate 3 independent sequences using beam search decoding (5 beams) with sampling from initial context 'The dog'\n            for i in range(3): #  3 output sequences were generated\n                print('Generated {}: {}'.format(i, tokenizer.decode(outputs[i], skip_special_tokens=True)))\n\n            tokenizer = AutoTokenizer.from_pretrained('distilgpt2')   # Initialize tokenizer\n            model = TFAutoModelWithLMHead.from_pretrained('distilgpt2')    # Download model and configuration from S3 and cache.\n            input_context = 'The dog'\n            input_ids = tokenizer.encode(input_context, return_tensors='tf')  # encode input context\n            outputs = model.generate(input_ids=input_ids, max_length=40, temperature=0.7, num_return_sequences=3)  # 3 generate sequences using by sampling\n            for i in range(3): #  3 output sequences were generated\n                print('Generated {}: {}'.format(i, tokenizer.decode(outputs[i], skip_special_tokens=True)))\n\n            tokenizer = AutoTokenizer.from_pretrained('ctrl')   # Initialize tokenizer\n            model = TFAutoModelWithLMHead.from_pretrained('ctrl')    # Download model and configuration from S3 and cache.\n            input_context = 'Legal My neighbor is'  # \"Legal\" is one of the control codes for ctrl\n            input_ids = tokenizer.encode(input_context, return_tensors='tf')  # encode input context\n            outputs = model.generate(input_ids=input_ids, max_length=50, temperature=0.7, repetition_penalty=1.2)  # generate sequences\n            print('Generated: {}'.format(tokenizer.decode(outputs[0], skip_special_tokens=True)))\n\n            tokenizer = AutoTokenizer.from_pretrained('gpt2')   # Initialize tokenizer\n            model = TFAutoModelWithLMHead.from_pretrained('gpt2')    # Download model and configuration from S3 and cache.\n            input_context = 'My cute dog'  # \"Legal\" is one of the control codes for ctrl\n            bad_words_ids = [tokenizer.encode(bad_word, add_prefix_space=True) for bad_word in ['idiot', 'stupid', 'shut up']]\n            input_ids = tokenizer.encode(input_context, return_tensors='tf')  # encode input context\n            outputs = model.generate(input_ids=input_ids, max_length=100, do_sample=True, bad_words_ids=bad_words_ids)  # generate sequences without allowing bad_words to be generated\n        \"\"\"\n\n        # We cannot generate if the model does not have a LM head\n        if self.get_output_embeddings() is None:\n            raise AttributeError(\n                \"You tried to generate sequences with a model that does not have a LM Head.\"\n                \"Please use another model class (e.g. `TFOpenAIGPTLMHeadModel`, `TFXLNetLMHeadModel`, `TFGPT2LMHeadModel`, `TFCTRLLMHeadModel`, `TFT5ForConditionalGeneration`, `TFTransfoXLLMHeadModel`)\"\n            )\n\n        max_length = max_length if max_length is not None else self.config.max_length\n        min_length = min_length if min_length is not None else self.config.min_length\n        do_sample = do_sample if do_sample is not None else self.config.do_sample\n        early_stopping = early_stopping if early_stopping is not None else self.config.early_stopping\n        use_cache = use_cache if use_cache is not None else self.config.use_cache\n        num_beams = num_beams if num_beams is not None else self.config.num_beams\n        temperature = temperature if temperature is not None else self.config.temperature\n        top_k = top_k if top_k is not None else self.config.top_k\n        top_p = top_p if top_p is not None else self.config.top_p\n        repetition_penalty = repetition_penalty if repetition_penalty is not None else self.config.repetition_penalty\n        bos_token_id = bos_token_id if bos_token_id is not None else self.config.bos_token_id\n        pad_token_id = pad_token_id if pad_token_id is not None else self.config.pad_token_id\n        eos_token_id = eos_token_id if eos_token_id is not None else self.config.eos_token_id\n        length_penalty = length_penalty if length_penalty is not None else self.config.length_penalty\n        no_repeat_ngram_size = (\n            no_repeat_ngram_size if no_repeat_ngram_size is not None else self.config.no_repeat_ngram_size\n        )\n        bad_words_ids = bad_words_ids if bad_words_ids is not None else self.config.bad_words_ids\n        num_return_sequences = (\n            num_return_sequences if num_return_sequences is not None else self.config.num_return_sequences\n        )\n        decoder_start_token_id = (\n            decoder_start_token_id if decoder_start_token_id is not None else self.config.decoder_start_token_id\n        )\n\n        if input_ids is not None:\n            batch_size = shape_list(input_ids)[0]  # overriden by the input batch_size\n        else:\n            batch_size = 1\n\n        assert isinstance(max_length, int) and max_length > 0, \"`max_length` should be a strictely positive integer.\"\n        assert isinstance(min_length, int) and min_length >= 0, \"`min_length` should be a positive integer.\"\n        assert isinstance(do_sample, bool), \"`do_sample` should be a boolean.\"\n        assert isinstance(early_stopping, bool), \"`early_stopping` should be a boolean.\"\n        assert isinstance(use_cache, bool), \"`use_cache` should be a boolean.\"\n        assert isinstance(num_beams, int) and num_beams > 0, \"`num_beams` should be a strictely positive integer.\"\n        assert temperature > 0, \"`temperature` should be strictely positive.\"\n        assert isinstance(top_k, int) and top_k >= 0, \"`top_k` should be a positive integer.\"\n        assert 0 <= top_p <= 1, \"`top_p` should be between 0 and 1.\"\n        assert repetition_penalty >= 1.0, \"`repetition_penalty` should be >= 1.\"\n        assert input_ids is not None or (\n            isinstance(bos_token_id, int) and bos_token_id >= 0\n        ), \"If input_ids is not defined, `bos_token_id` should be a positive integer.\"\n        assert pad_token_id is None or (\n            isinstance(pad_token_id, int) and (pad_token_id >= 0)\n        ), \"`pad_token_id` should be a positive integer.\"\n        assert (eos_token_id is None) or (\n            isinstance(eos_token_id, int) and (eos_token_id >= 0)\n        ), \"`eos_token_id` should be a positive integer.\"\n        assert length_penalty > 0, \"`length_penalty` should be strictely positive.\"\n        assert (\n            isinstance(num_return_sequences, int) and num_return_sequences > 0\n        ), \"`num_return_sequences` should be a strictely positive integer.\"\n        assert (\n            bad_words_ids is None or isinstance(bad_words_ids, list) and isinstance(bad_words_ids[0], list)\n        ), \"`bad_words_ids` is either `None` or a list of lists of tokens that should not be generated\"\n\n        if input_ids is None:\n            assert isinstance(bos_token_id, int) and bos_token_id >= 0, (\n                \"you should either supply a context to complete as `input_ids` input \"\n                \"or a `bos_token_id` (integer >= 0) as a first token to start the generation.\"\n            )\n            input_ids = tf.fill((batch_size, 1), bos_token_id)\n        else:\n            assert len(shape_list(input_ids)) == 2, \"Input prompt should be of shape (batch_size, sequence length).\"\n\n        # not allow to duplicate outputs when greedy decoding\n        if do_sample is False:\n            if num_beams == 1:\n                # no_beam_search greedy generation conditions\n                assert (\n                    num_return_sequences == 1\n                ), \"Greedy decoding will always produce the same output for num_beams == 1 and num_return_sequences > 1. Please set num_return_sequences = 1\"\n\n            else:\n                # beam_search greedy generation conditions\n                assert (\n                    num_beams >= num_return_sequences\n                ), \"Greedy beam search decoding cannot return more sequences than it has beams. Please set num_beams >= num_return_sequences\"\n\n        # create attention mask if necessary\n        # TODO (PVP): this should later be handled by the forward fn() in each model in the future see PR 3140\n        if (attention_mask is None) and (pad_token_id is not None) and (pad_token_id in input_ids.numpy()):\n            attention_mask = tf.cast(tf.math.not_equal(input_ids, pad_token_id), dtype=tf.int32)\n        elif attention_mask is None:\n            attention_mask = tf.ones_like(input_ids)\n\n        if pad_token_id is None and eos_token_id is not None:\n            logger.warning(\n                \"Setting `pad_token_id` to {} (first `eos_token_id`) to generate sequence\".format(eos_token_id)\n            )\n            pad_token_id = eos_token_id\n\n        # current position and vocab size\n        cur_len = shape_list(input_ids)[1]\n        vocab_size = self.config.vocab_size\n\n        # set effective batch size and effective batch multiplier according to do_sample\n        if do_sample:\n            effective_batch_size = batch_size * num_return_sequences\n            effective_batch_mult = num_return_sequences\n        else:\n            effective_batch_size = batch_size\n            effective_batch_mult = 1\n\n        if self.config.is_encoder_decoder:\n            if decoder_start_token_id is None:\n                decoder_start_token_id = bos_token_id\n\n            assert (\n                decoder_start_token_id is not None\n            ), \"decoder_start_token_id or bos_token_id has to be defined for encoder-decoder generation\"\n            assert hasattr(self, \"get_encoder\"), \"{} should have a 'get_encoder' function defined\".format(self)\n            assert callable(self.get_encoder), \"{} should be a method\".format(self.get_encoder)\n\n            # get encoder and store encoder outputs\n            encoder = self.get_encoder()\n\n            encoder_outputs = encoder(input_ids, attention_mask=attention_mask)\n\n        # Expand input ids if num_beams > 1 or num_return_sequences > 1\n        if num_return_sequences > 1 or num_beams > 1:\n            input_ids_len = shape_list(input_ids)[-1]\n            input_ids = tf.broadcast_to(\n                tf.expand_dims(input_ids, 1), (batch_size, effective_batch_mult * num_beams, input_ids_len)\n            )\n            attention_mask = tf.broadcast_to(\n                tf.expand_dims(attention_mask, 1), (batch_size, effective_batch_mult * num_beams, input_ids_len)\n            )\n            input_ids = tf.reshape(\n                input_ids, (effective_batch_size * num_beams, input_ids_len)\n            )  # shape: (batch_size * num_return_sequences * num_beams, cur_len)\n            attention_mask = tf.reshape(\n                attention_mask, (effective_batch_size * num_beams, input_ids_len)\n            )  # shape: (batch_size * num_return_sequences * num_beams, cur_len)\n\n        if self.config.is_encoder_decoder:\n\n            # create empty decoder_input_ids\n            input_ids = tf.ones((effective_batch_size * num_beams, 1), dtype=tf.int32,) * decoder_start_token_id\n            cur_len = 1\n\n            assert (\n                batch_size == encoder_outputs[0].shape[0]\n            ), f\"expected encoder_outputs[0] to have 1st dimension bs={batch_size}, got {encoder_outputs[0].shape[0]} \"\n\n            # expand batch_idx to assign correct encoder output for expanded input_ids (due to num_beams > 1 and num_return_sequences > 1)\n            expanded_batch_idxs = tf.reshape(\n                tf.repeat(tf.expand_dims(tf.range(batch_size), -1), repeats=num_beams * effective_batch_mult, axis=1),\n                shape=(-1,),\n            )\n            # expand encoder_outputs\n            encoder_outputs = (tf.gather(encoder_outputs[0], expanded_batch_idxs, axis=0), *encoder_outputs[1:])\n\n        else:\n            encoder_outputs = None\n            cur_len = shape_list(input_ids)[-1]\n\n        if num_beams > 1:\n            output = self._generate_beam_search(\n                input_ids,\n                cur_len=cur_len,\n                max_length=max_length,\n                min_length=min_length,\n                do_sample=do_sample,\n                early_stopping=early_stopping,\n                temperature=temperature,\n                top_k=top_k,\n                top_p=top_p,\n                repetition_penalty=repetition_penalty,\n                no_repeat_ngram_size=no_repeat_ngram_size,\n                bad_words_ids=bad_words_ids,\n                bos_token_id=bos_token_id,\n                pad_token_id=pad_token_id,\n                eos_token_id=eos_token_id,\n                decoder_start_token_id=decoder_start_token_id,\n                batch_size=effective_batch_size,\n                num_return_sequences=num_return_sequences,\n                length_penalty=length_penalty,\n                num_beams=num_beams,\n                vocab_size=vocab_size,\n                encoder_outputs=encoder_outputs,\n                attention_mask=attention_mask,\n                use_cache=use_cache,\n            )\n        else:\n            output = self._generate_no_beam_search(\n                input_ids,\n                cur_len=cur_len,\n                max_length=max_length,\n                min_length=min_length,\n                do_sample=do_sample,\n                temperature=temperature,\n                top_k=top_k,\n                top_p=top_p,\n                repetition_penalty=repetition_penalty,\n                no_repeat_ngram_size=no_repeat_ngram_size,\n                bad_words_ids=bad_words_ids,\n                bos_token_id=bos_token_id,\n                pad_token_id=pad_token_id,\n                eos_token_id=eos_token_id,\n                decoder_start_token_id=decoder_start_token_id,\n                batch_size=effective_batch_size,\n                vocab_size=vocab_size,\n                encoder_outputs=encoder_outputs,\n                attention_mask=attention_mask,\n                use_cache=use_cache,\n            )\n\n        return output\n\n    def _generate_no_beam_search(\n        self,\n        input_ids,\n        cur_len,\n        max_length,\n        min_length,\n        do_sample,\n        temperature,\n        top_k,\n        top_p,\n        repetition_penalty,\n        no_repeat_ngram_size,\n        bad_words_ids,\n        bos_token_id,\n        pad_token_id,\n        eos_token_id,\n        decoder_start_token_id,\n        batch_size,\n        vocab_size,\n        encoder_outputs,\n        attention_mask,\n        use_cache,\n    ):\n        \"\"\" Generate sequences for each example without beam search (num_beams == 1).\n            All returned sequence are generated independantly.\n        \"\"\"\n\n        # length of generated sentences / unfinished sentences\n        unfinished_sents = tf.ones_like(input_ids[:, 0])\n        sent_lengths = tf.ones_like(input_ids[:, 0]) * max_length\n\n        past = encoder_outputs  # defined for encoder-decoder models, None for decoder-only models\n\n        while cur_len < max_length:\n            model_inputs = self.prepare_inputs_for_generation(\n                input_ids, past=past, attention_mask=attention_mask, use_cache=use_cache\n            )\n            outputs = self(**model_inputs)\n            next_token_logits = outputs[0][:, -1, :]\n\n            # if model has past, then set the past variable to speed up decoding\n            if self._use_cache(outputs, use_cache):\n                past = outputs[1]\n\n            # repetition penalty from CTRL paper (https://arxiv.org/abs/1909.05858)\n            if repetition_penalty != 1.0:\n                next_token_logits_penalties = _create_next_token_logits_penalties(\n                    input_ids, next_token_logits, repetition_penalty\n                )\n                next_token_logits = tf.math.multiply(next_token_logits, next_token_logits_penalties)\n\n            if no_repeat_ngram_size > 0:\n                # calculate a list of banned tokens to prevent repetitively generating the same ngrams\n                # from fairseq: https://github.com/pytorch/fairseq/blob/a07cb6f40480928c9e0548b737aadd36ee66ac76/fairseq/sequence_generator.py#L345\n                banned_tokens = calc_banned_ngram_tokens(input_ids, batch_size, no_repeat_ngram_size, cur_len)\n                # create banned_tokens boolean mask\n                banned_tokens_indices_mask = []\n                for banned_tokens_slice in banned_tokens:\n                    banned_tokens_indices_mask.append(\n                        [True if token in banned_tokens_slice else False for token in range(vocab_size)]\n                    )\n\n                next_token_logits = set_tensor_by_indices_to_value(\n                    next_token_logits, tf.convert_to_tensor(banned_tokens_indices_mask, dtype=tf.bool), -float(\"inf\")\n                )\n\n            if bad_words_ids is not None:\n                # calculate a list of banned tokens according to bad words\n                banned_tokens = calc_banned_bad_words_ids(input_ids, bad_words_ids)\n\n                banned_tokens_indices_mask = []\n                for banned_tokens_slice in banned_tokens:\n                    banned_tokens_indices_mask.append(\n                        [True if token in banned_tokens_slice else False for token in range(vocab_size)]\n                    )\n\n                next_token_logits = set_tensor_by_indices_to_value(\n                    next_token_logits, tf.convert_to_tensor(banned_tokens_indices_mask, dtype=tf.bool), -float(\"inf\")\n                )\n\n            # set eos token prob to zero if min_length is not reached\n            if eos_token_id is not None and cur_len < min_length:\n                # create eos_token_id boolean mask\n                is_token_logit_eos_token = tf.convert_to_tensor(\n                    [True if token is eos_token_id else False for token in range(vocab_size)], dtype=tf.bool\n                )\n                eos_token_indices_mask = tf.broadcast_to(is_token_logit_eos_token, [batch_size, vocab_size])\n\n                next_token_logits = set_tensor_by_indices_to_value(\n                    next_token_logits, eos_token_indices_mask, -float(\"inf\")\n                )\n\n            if do_sample:\n                # Temperature (higher temperature => more likely to sample low probability tokens)\n                if temperature != 1.0:\n                    next_token_logits = next_token_logits / temperature\n                # Top-p/top-k filtering\n                next_token_logits = tf_top_k_top_p_filtering(next_token_logits, top_k=top_k, top_p=top_p)\n                # Sample\n                next_token = tf.squeeze(\n                    tf.random.categorical(next_token_logits, dtype=tf.int32, num_samples=1), axis=1\n                )\n            else:\n                # Greedy decoding\n                next_token = tf.math.argmax(next_token_logits, axis=-1, output_type=tf.int32)\n\n            # update generations and finished sentences\n            if eos_token_id is not None:\n                # pad finished sentences if eos_token_id exist\n                tokens_to_add = next_token * unfinished_sents + (pad_token_id) * (1 - unfinished_sents)\n            else:\n                tokens_to_add = next_token\n\n            # add token and increase length by one\n            input_ids = tf.concat([input_ids, tf.expand_dims(tokens_to_add, -1)], 1)\n            cur_len = cur_len + 1\n\n            if eos_token_id is not None:\n                eos_in_sents = tokens_to_add == eos_token_id\n                # if sentence is unfinished and the token to add is eos, sent_lengths is filled with current length\n                is_sents_unfinished_and_token_to_add_is_eos = tf.math.multiply(\n                    unfinished_sents, tf.cast(eos_in_sents, tf.int32)\n                )\n                sent_lengths = (\n                    sent_lengths * (1 - is_sents_unfinished_and_token_to_add_is_eos)\n                    + cur_len * is_sents_unfinished_and_token_to_add_is_eos\n                )\n\n                # unfinished_sents is set to zero if eos in sentence\n                unfinished_sents -= is_sents_unfinished_and_token_to_add_is_eos\n\n            # stop when there is a </s> in each sentence, or if we exceed the maximul length\n            if tf.math.reduce_max(unfinished_sents) == 0:\n                break\n\n            # extend attention_mask for new generated input if only decoder\n            if self.config.is_encoder_decoder is False:\n                attention_mask = tf.concat(\n                    [attention_mask, tf.ones((shape_list(attention_mask)[0], 1), dtype=tf.int32)], axis=-1\n                )\n\n        # if there are different sentences lengths in the batch, some batches have to be padded\n        min_sent_length = tf.math.reduce_min(sent_lengths)\n        max_sent_length = tf.math.reduce_max(sent_lengths)\n        if min_sent_length != max_sent_length:\n            assert pad_token_id is not None, \"`Pad_token_id` has to be defined if batches have different lengths\"\n            # finished sents are filled with pad_token\n            padding = tf.ones([batch_size, max_sent_length.numpy()], dtype=tf.int32) * pad_token_id\n\n            # create length masks for tf.where operation\n            broad_casted_sent_lengths = tf.broadcast_to(\n                tf.expand_dims(sent_lengths, -1), [batch_size, max_sent_length]\n            )\n            broad_casted_range = tf.transpose(\n                tf.broadcast_to(tf.expand_dims(tf.range(max_sent_length), -1), [max_sent_length, batch_size])\n            )\n\n            decoded = tf.where(broad_casted_range < broad_casted_sent_lengths, input_ids, padding)\n        else:\n            decoded = input_ids\n\n        return decoded\n\n    def _generate_beam_search(\n        self,\n        input_ids,\n        cur_len,\n        max_length,\n        min_length,\n        do_sample,\n        early_stopping,\n        temperature,\n        top_k,\n        top_p,\n        repetition_penalty,\n        no_repeat_ngram_size,\n        bad_words_ids,\n        bos_token_id,\n        pad_token_id,\n        decoder_start_token_id,\n        eos_token_id,\n        batch_size,\n        num_return_sequences,\n        length_penalty,\n        num_beams,\n        vocab_size,\n        encoder_outputs,\n        attention_mask,\n        use_cache,\n    ):\n        \"\"\" Generate sequences for each example with beam search.\n        \"\"\"\n\n        # generated hypotheses\n        generated_hyps = [\n            BeamHypotheses(num_beams, max_length, length_penalty, early_stopping=early_stopping)\n            for _ in range(batch_size)\n        ]\n\n        # for greedy decoding it is made sure that only tokens of the first beam are considered to avoid sampling the exact same tokens three times\n        if do_sample is False:\n            beam_scores_begin = tf.zeros((batch_size, 1), dtype=tf.float32)\n            beam_scores_end = tf.ones((batch_size, num_beams - 1), dtype=tf.float32) * (-1e9)\n            beam_scores = tf.concat([beam_scores_begin, beam_scores_end], -1)\n        else:\n            beam_scores = tf.zeros((batch_size, num_beams), dtype=tf.float32)\n\n        beam_scores = tf.reshape(beam_scores, (batch_size * num_beams,))\n\n        # cache compute states\n        past = encoder_outputs\n\n        # done sentences\n        done = [False for _ in range(batch_size)]\n\n        while cur_len < max_length:\n            model_inputs = self.prepare_inputs_for_generation(\n                input_ids, past=past, attention_mask=attention_mask, use_cache=use_cache\n            )\n            outputs = self(**model_inputs)  # (batch_size * num_beams, cur_len, vocab_size)\n            next_token_logits = outputs[0][:, -1, :]  # (batch_size * num_beams, vocab_size)\n\n            # if model has past, then set the past variable to speed up decoding\n            if self._use_cache(outputs, use_cache):\n                past = outputs[1]\n\n            # repetition penalty (from CTRL paper https://arxiv.org/abs/1909.05858)\n            if repetition_penalty != 1.0:\n                next_token_logits_penalties = _create_next_token_logits_penalties(\n                    input_ids, next_token_logits, repetition_penalty\n                )\n                next_token_logits = tf.math.multiply(next_token_logits, next_token_logits_penalties)\n\n            # Temperature (higher temperature => more likely to sample low probability tokens)\n            if temperature != 1.0:\n                next_token_logits = next_token_logits / temperature\n\n            #             calculate log softmax score\n            scores = tf.nn.log_softmax(next_token_logits, axis=-1)  # (batch_size * num_beams, vocab_size)\n\n            # set eos token prob to zero if min_length is not reached\n            if eos_token_id is not None and cur_len < min_length:\n                # create eos_token_id boolean mask\n                num_batch_hypotheses = batch_size * num_beams\n\n                is_token_logit_eos_token = tf.convert_to_tensor(\n                    [True if token is eos_token_id else False for token in range(vocab_size)], dtype=tf.bool\n                )\n                eos_token_indices_mask = tf.broadcast_to(is_token_logit_eos_token, [num_batch_hypotheses, vocab_size])\n\n                scores = set_tensor_by_indices_to_value(scores, eos_token_indices_mask, -float(\"inf\"))\n\n            if no_repeat_ngram_size > 0:\n                # calculate a list of banned tokens to prevent repetitively generating the same ngrams\n                # from fairseq: https://github.com/pytorch/fairseq/blob/a07cb6f40480928c9e0548b737aadd36ee66ac76/fairseq/sequence_generator.py#L345\n                num_batch_hypotheses = batch_size * num_beams\n                banned_tokens = calc_banned_ngram_tokens(\n                    input_ids, num_batch_hypotheses, no_repeat_ngram_size, cur_len\n                )\n                # create banned_tokens boolean mask\n                banned_tokens_indices_mask = []\n                for banned_tokens_slice in banned_tokens:\n                    banned_tokens_indices_mask.append(\n                        [True if token in banned_tokens_slice else False for token in range(vocab_size)]\n                    )\n\n                scores = set_tensor_by_indices_to_value(\n                    scores, tf.convert_to_tensor(banned_tokens_indices_mask, dtype=tf.bool), -float(\"inf\")\n                )\n\n            if bad_words_ids is not None:\n                # calculate a list of banned tokens according to bad words\n                banned_tokens = calc_banned_bad_words_ids(input_ids, bad_words_ids)\n\n                banned_tokens_indices_mask = []\n                for banned_tokens_slice in banned_tokens:\n                    banned_tokens_indices_mask.append(\n                        [True if token in banned_tokens_slice else False for token in range(vocab_size)]\n                    )\n\n                scores = set_tensor_by_indices_to_value(\n                    scores, tf.convert_to_tensor(banned_tokens_indices_mask, dtype=tf.bool), -float(\"inf\")\n                )\n\n            assert shape_list(scores) == [batch_size * num_beams, vocab_size]\n\n            if do_sample:\n                _scores = scores + tf.broadcast_to(\n                    beam_scores[:, None], (batch_size * num_beams, vocab_size)\n                )  # (batch_size * num_beams, vocab_size)\n\n                # Top-p/top-k filtering\n                _scores = tf_top_k_top_p_filtering(\n                    _scores, top_k=top_k, top_p=top_p, min_tokens_to_keep=2\n                )  # (batch_size * num_beams, vocab_size)\n                # Sample 2 next tokens for each beam (so we have some spare tokens and match output of greedy beam search)\n                _scores = tf.reshape(_scores, (batch_size, num_beams * vocab_size))\n\n                next_tokens = tf.random.categorical(\n                    _scores, dtype=tf.int32, num_samples=2 * num_beams\n                )  # (batch_size, 2 * num_beams)\n                # Compute next scores\n                next_scores = tf.gather(_scores, next_tokens, batch_dims=1)  # (batch_size, 2 * num_beams)\n\n                # sort the sampled vector to make sure that the first num_beams samples are the best\n                next_scores_indices = tf.argsort(next_scores, direction=\"DESCENDING\", axis=1)\n                next_scores = tf.gather(next_scores, next_scores_indices, batch_dims=1)  # (batch_size, num_beams * 2)\n                next_tokens = tf.gather(next_tokens, next_scores_indices, batch_dims=1)  # (batch_size, num_beams * 2)\n            else:\n                # Add the log prob of the new beams to the log prob of the beginning of the sequence (sum of logs == log of the product)\n                next_scores = scores + tf.broadcast_to(\n                    beam_scores[:, None], (batch_size * num_beams, vocab_size)\n                )  # (batch_size * num_beams, vocab_size)\n\n                # re-organize to group the beam together (we are keeping top hypothesis accross beams)\n                next_scores = tf.reshape(\n                    next_scores, (batch_size, num_beams * vocab_size)\n                )  # (batch_size, num_beams * vocab_size)\n\n                next_scores, next_tokens = tf.math.top_k(next_scores, k=2 * num_beams, sorted=True)\n\n            assert shape_list(next_scores) == shape_list(next_tokens) == [batch_size, 2 * num_beams]\n\n            # next batch beam content\n            next_batch_beam = []\n\n            # for each sentence\n            for batch_idx in range(batch_size):\n\n                # if we are done with this sentence\n                if done[batch_idx]:\n                    assert (\n                        len(generated_hyps[batch_idx]) >= num_beams\n                    ), \"Batch can only be done if at least {} beams have been generated\".format(num_beams)\n                    assert (\n                        eos_token_id is not None and pad_token_id is not None\n                    ), \"generated beams >= num_beams -> eos_token_id and pad_token have to be defined\"\n                    next_batch_beam.extend([(0, pad_token_id, 0)] * num_beams)  # pad the batch\n                    continue\n\n                # next sentence beam content\n                next_sent_beam = []\n\n                # next tokens for this sentence\n                for beam_token_rank, (beam_token_id, beam_token_score) in enumerate(\n                    zip(next_tokens[batch_idx], next_scores[batch_idx])\n                ):\n                    # get beam and token IDs\n                    beam_id = beam_token_id // vocab_size\n                    token_id = beam_token_id % vocab_size\n\n                    effective_beam_id = batch_idx * num_beams + beam_id\n                    # add to generated hypotheses if end of sentence or last iteration\n                    if (eos_token_id is not None) and (token_id.numpy() == eos_token_id):\n                        # if beam_token does not belong to top num_beams tokens, it should not be added\n                        is_beam_token_worse_than_top_num_beams = beam_token_rank >= num_beams\n                        if is_beam_token_worse_than_top_num_beams:\n                            continue\n                        generated_hyps[batch_idx].add(\n                            tf.identity(input_ids[effective_beam_id]), beam_token_score.numpy()\n                        )\n                    else:\n                        # add next predicted token if it is not eos_token\n                        next_sent_beam.append((beam_token_score, token_id, effective_beam_id))\n\n                    # the beam for next step is full\n                    if len(next_sent_beam) == num_beams:\n                        break\n\n                # Check if were done so that we can save a pad step if all(done)\n                done[batch_idx] = done[batch_idx] or generated_hyps[batch_idx].is_done(\n                    tf.reduce_max(next_scores[batch_idx]).numpy(), cur_len=cur_len\n                )\n\n                # update next beam content\n                assert len(next_sent_beam) == num_beams, \"Beam should always be full\"\n                next_batch_beam.extend(next_sent_beam)\n                assert len(next_batch_beam) == num_beams * (batch_idx + 1)\n\n            # stop when we are done with each sentence\n            if all(done):\n                break\n\n            # sanity check / prepare next batch\n            assert len(next_batch_beam) == batch_size * num_beams\n            beam_scores = tf.convert_to_tensor([x[0] for x in next_batch_beam], dtype=tf.float32)\n            beam_tokens = tf.convert_to_tensor([x[1] for x in next_batch_beam], dtype=tf.int32)\n            beam_idx = tf.convert_to_tensor([x[2] for x in next_batch_beam], dtype=tf.int32)\n\n            # re-order batch and update current length\n            input_ids = tf.stack([tf.identity(input_ids[x, :]) for x in beam_idx])\n            input_ids = tf.concat([input_ids, tf.expand_dims(beam_tokens, 1)], axis=-1)\n            cur_len = cur_len + 1\n\n            # re-order internal states\n            if past is not None:\n                past = self._reorder_cache(past, beam_idx)\n\n            # extend attention_mask for new generated input if only decoder\n            if self.config.is_encoder_decoder is False:\n                attention_mask = tf.concat(\n                    [attention_mask, tf.ones((shape_list(attention_mask)[0], 1), dtype=tf.int32)], axis=-1\n                )\n\n        # finalize all open beam hypotheses and end to generated hypotheses\n        for batch_idx in range(batch_size):\n            # Add all open beam hypothesis to generated_hyps\n            if done[batch_idx]:\n                continue\n            # test that beam scores match previously calculated scores if not eos and batch_idx not done\n            if eos_token_id is not None and all(\n                (token_id % vocab_size).numpy().item() is not eos_token_id for token_id in next_tokens[batch_idx]\n            ):\n                assert tf.reduce_all(\n                    next_scores[batch_idx, :num_beams] == tf.reshape(beam_scores, (batch_size, num_beams))[batch_idx]\n                ), \"If batch_idx is not done, final next scores: {} have to equal to accumulated beam_scores: {}\".format(\n                    next_scores[:, :num_beams][batch_idx], tf.reshape(beam_scores, (batch_size, num_beams))[batch_idx]\n                )\n\n            # need to add best num_beams hypotheses to generated hyps\n            for beam_id in range(num_beams):\n                effective_beam_id = batch_idx * num_beams + beam_id\n                final_score = beam_scores[effective_beam_id].numpy().item()\n                final_tokens = input_ids[effective_beam_id]\n                generated_hyps[batch_idx].add(final_tokens, final_score)\n\n        # depending on whether greedy generation is wanted or not define different output_batch_size and output_num_return_sequences_per_batch\n        output_batch_size = batch_size if do_sample else batch_size * num_return_sequences\n        output_num_return_sequences_per_batch = 1 if do_sample else num_return_sequences\n\n        # select the best hypotheses\n        sent_lengths_list = []\n        best = []\n\n        # retrieve best hypotheses\n        for i, hypotheses in enumerate(generated_hyps):\n            sorted_hyps = sorted(hypotheses.beams, key=lambda x: x[0])\n            for j in range(output_num_return_sequences_per_batch):\n                best_hyp = sorted_hyps.pop()[1]\n                sent_lengths_list.append(len(best_hyp))\n                best.append(best_hyp)\n        assert output_batch_size == len(best), \"Output batch size {} must match output beam hypotheses {}\".format(\n            output_batch_size, len(best)\n        )\n\n        sent_lengths = tf.convert_to_tensor(sent_lengths_list, dtype=tf.int32)\n\n        # shorter batches are filled with pad_token\n        if tf.reduce_min(sent_lengths).numpy() != tf.reduce_max(sent_lengths).numpy():\n            assert pad_token_id is not None, \"`Pad_token_id` has to be defined\"\n            sent_max_len = min(tf.reduce_max(sent_lengths).numpy() + 1, max_length)\n            decoded_list = []\n\n            # fill with hypothesis and eos_token_id if necessary\n            for i, hypo in enumerate(best):\n                assert sent_lengths[i] == shape_list(hypo)[0]\n                # if sent_length is max_len do not pad\n                if sent_lengths[i] == sent_max_len:\n                    decoded_slice = hypo\n                else:\n                    # else pad to sent_max_len\n                    num_pad_tokens = sent_max_len - sent_lengths[i]\n                    padding = pad_token_id * tf.ones((num_pad_tokens,), dtype=tf.int32)\n                    decoded_slice = tf.concat([hypo, padding], axis=-1)\n\n                    # finish sentence with EOS token\n                    if sent_lengths[i] < max_length:\n                        decoded_slice = tf.where(\n                            tf.range(sent_max_len, dtype=tf.int32) == sent_lengths[i],\n                            eos_token_id * tf.ones((sent_max_len,), dtype=tf.int32),\n                            decoded_slice,\n                        )\n                # add to list\n                decoded_list.append(decoded_slice)\n\n            decoded = tf.stack(decoded_list)\n        else:\n            # none of the hypotheses have an eos_token\n            assert (len(hypo) == max_length for hypo in best)\n            decoded = tf.stack(best)\n\n        return decoded\n\n    @staticmethod\n    def _reorder_cache(past, beam_idx):\n        return tuple(tf.gather(layer_past, beam_idx, axis=1) for layer_past in past)\n\n\ndef _create_next_token_logits_penalties(input_ids, logits, repetition_penalty):\n    # create logit penalties for already seen input_ids\n    token_penalties = np.ones(shape_list(logits))\n    prev_input_ids = [np.unique(input_id) for input_id in input_ids.numpy()]\n    for i, prev_input_id in enumerate(prev_input_ids):\n        logit_penalized = logits[i].numpy()[prev_input_id]\n        logit_penalties = np.zeros(logit_penalized.shape)\n        # if previous logit score is < 0 then multiply repetition penalty else divide\n        logit_penalties[logit_penalized < 0] = repetition_penalty\n        logit_penalties[logit_penalized > 0] = 1 / repetition_penalty\n        np.put(token_penalties[i], prev_input_id, logit_penalties)\n    return tf.convert_to_tensor(token_penalties, dtype=tf.float32)\n\n\ndef calc_banned_ngram_tokens(prev_input_ids, num_hypos, no_repeat_ngram_size, cur_len):\n    # Copied from fairseq for no_repeat_ngram in beam_search\"\"\"\n    if cur_len + 1 < no_repeat_ngram_size:\n        # return no banned tokens if we haven't generated no_repeat_ngram_size tokens yet\n        return [[] for _ in range(num_hypos)]\n    generated_ngrams = [{} for _ in range(num_hypos)]\n    for idx in range(num_hypos):\n        gen_tokens = prev_input_ids[idx].numpy().tolist()\n        generated_ngram = generated_ngrams[idx]\n        for ngram in zip(*[gen_tokens[i:] for i in range(no_repeat_ngram_size)]):\n            prev_ngram_tuple = tuple(ngram[:-1])\n            generated_ngram[prev_ngram_tuple] = generated_ngram.get(prev_ngram_tuple, []) + [ngram[-1]]\n\n    def _get_generated_ngrams(hypo_idx):\n        # Before decoding the next token, prevent decoding of ngrams that have already appeared\n        start_idx = cur_len + 1 - no_repeat_ngram_size\n        ngram_idx = tuple(prev_input_ids[hypo_idx, start_idx:cur_len].numpy().tolist())\n        return generated_ngrams[hypo_idx].get(ngram_idx, [])\n\n    banned_tokens = [_get_generated_ngrams(hypo_idx) for hypo_idx in range(num_hypos)]\n    return banned_tokens\n\n\ndef calc_banned_bad_words_ids(prev_input_ids, bad_words_ids):\n    banned_tokens = []\n\n    def _tokens_match(prev_tokens, tokens):\n        if len(tokens) == 0:\n            # if bad word tokens is just one token always ban it\n            return True\n        if len(tokens) > len(prev_input_ids):\n            # if bad word tokens are longer then prev input_ids they can't be equal\n            return False\n\n        if prev_tokens[-len(tokens) :] == tokens:\n            # if tokens match\n            return True\n        else:\n            return False\n\n    for prev_input_ids_slice in prev_input_ids:\n        banned_tokens_slice = []\n\n        for banned_token_seq in bad_words_ids:\n            assert len(banned_token_seq) > 0, \"Banned words token sequences {} cannot have an empty list\".format(\n                bad_words_ids\n            )\n\n            if _tokens_match(prev_input_ids_slice.numpy().tolist(), banned_token_seq[:-1]) is False:\n                # if tokens do not match continue\n                continue\n\n            banned_tokens_slice.append(banned_token_seq[-1])\n\n        banned_tokens.append(banned_tokens_slice)\n\n    return banned_tokens\n\n\ndef tf_top_k_top_p_filtering(logits, top_k=0, top_p=1.0, filter_value=-float(\"Inf\"), min_tokens_to_keep=1):\n    \"\"\" Filter a distribution of logits using top-k and/or nucleus (top-p) filtering\n        Args:\n            logits: logits distribution shape (batch size, vocabulary size)\n            if top_k > 0: keep only top k tokens with highest probability (top-k filtering).\n            if top_p < 1.0: keep the top tokens with cumulative probability >= top_p (nucleus filtering).\n                Nucleus filtering is described in Holtzman et al. (http://arxiv.org/abs/1904.09751)\n            Make sure we keep at least min_tokens_to_keep per batch example in the output\n        From: https://gist.github.com/thomwolf/1a5a29f6962089e871b94cbd09daf317\n    \"\"\"\n    logits_shape = shape_list(logits)\n\n    if top_k > 0:\n        top_k = min(max(top_k, min_tokens_to_keep), logits_shape[-1])  # Safety check\n        # Remove all tokens with a probability less than the last token of the top-k\n        indices_to_remove = logits < tf.math.top_k(logits, k=top_k)[0][..., -1, None]\n        logits = set_tensor_by_indices_to_value(logits, indices_to_remove, filter_value)\n\n    if top_p < 1.0:\n        sorted_indices = tf.argsort(logits, direction=\"DESCENDING\")\n        sorted_logits = tf.gather(\n            logits, sorted_indices, axis=-1, batch_dims=1\n        )  # expects logits to be of dim (batch_size, vocab_size)\n\n        cumulative_probs = tf.math.cumsum(tf.nn.softmax(sorted_logits, axis=-1), axis=-1)\n\n        # Remove tokens with cumulative probability above the threshold (token with 0 are kept)\n        sorted_indices_to_remove = cumulative_probs > top_p\n\n        if min_tokens_to_keep > 1:\n            # Keep at least min_tokens_to_keep (set to min_tokens_to_keep-1 because we add the first one below)\n            sorted_indices_to_remove = tf.concat(\n                [\n                    tf.zeros_like(sorted_indices_to_remove[:, :min_tokens_to_keep]),\n                    sorted_indices_to_remove[:, min_tokens_to_keep:],\n                ],\n                -1,\n            )\n\n        # Shift the indices to the right to keep also the first token above the threshold\n        sorted_indices_to_remove = tf.roll(sorted_indices_to_remove, 1, axis=-1)\n        sorted_indices_to_remove = tf.concat(\n            [tf.zeros_like(sorted_indices_to_remove[:, :1]), sorted_indices_to_remove[:, 1:]], -1,\n        )\n        # scatter sorted tensors to original indexing\n        indices_to_remove = scatter_values_on_batch_indices(sorted_indices_to_remove, sorted_indices)\n        logits = set_tensor_by_indices_to_value(logits, indices_to_remove, filter_value)\n    return logits\n\n\ndef scatter_values_on_batch_indices(values, batch_indices):\n    shape = shape_list(batch_indices)\n    # broadcast batch dim to shape\n    broad_casted_batch_dims = tf.reshape(tf.broadcast_to(tf.expand_dims(tf.range(shape[0]), axis=-1), shape), [1, -1])\n    # transform batch_indices to pair_indices\n    pair_indices = tf.transpose(tf.concat([broad_casted_batch_dims, tf.reshape(batch_indices, [1, -1])], 0))\n    # scatter values to pair indices\n    return tf.scatter_nd(pair_indices, tf.reshape(values, [-1]), shape)\n\n\ndef set_tensor_by_indices_to_value(tensor, indices, value):\n    # create value_tensor since tensor value assignment is not possible in TF\n    value_tensor = tf.zeros_like(tensor) + value\n    return tf.where(indices, value_tensor, tensor)\n\n\nclass BeamHypotheses(object):\n    def __init__(self, num_beams, max_length, length_penalty, early_stopping):\n        \"\"\"\n        Initialize n-best list of hypotheses.\n        \"\"\"\n        self.max_length = max_length - 1  # ignoring bos_token\n        self.length_penalty = length_penalty\n        self.early_stopping = early_stopping\n        self.num_beams = num_beams\n        self.beams = []\n        self.worst_score = 1e9\n\n    def __len__(self):\n        \"\"\"\n        Number of hypotheses in the list.\n        \"\"\"\n        return len(self.beams)\n\n    def add(self, hyp, sum_logprobs):\n        \"\"\"\n        Add a new hypothesis to the list.\n        \"\"\"\n        score = sum_logprobs / len(hyp) ** self.length_penalty\n        if len(self) < self.num_beams or score > self.worst_score:\n            self.beams.append((score, hyp))\n            if len(self) > self.num_beams:\n                sorted_scores = sorted([(s, idx) for idx, (s, _) in enumerate(self.beams)])\n                del self.beams[sorted_scores[0][1]]\n                self.worst_score = sorted_scores[1][0]\n            else:\n                self.worst_score = min(score, self.worst_score)\n\n    def is_done(self, best_sum_logprobs, cur_len=None):\n        \"\"\"\n        If there are enough hypotheses and that none of the hypotheses being generated\n        can become better than the worst one in the heap, then we are done with this sentence.\n        \"\"\"\n\n        if len(self) < self.num_beams:\n            return False\n        elif self.early_stopping:\n            return True\n        else:\n            if cur_len is None:\n                cur_len = self.max_length\n            cur_score = best_sum_logprobs / cur_len ** self.length_penalty\n            ret = self.worst_score >= cur_score\n            return ret\n\n\nclass TFConv1D(tf.keras.layers.Layer):\n    def __init__(self, nf, nx, initializer_range=0.02, **kwargs):\n        \"\"\" TFConv1D layer as defined by Radford et al. for OpenAI GPT (and also used in GPT-2)\n            Basically works like a Linear layer but the weights are transposed\n        \"\"\"\n        super().__init__(**kwargs)\n        self.nf = nf\n        self.nx = nx\n        self.initializer_range = initializer_range\n\n    def build(self, input_shape):\n        self.weight = self.add_weight(\n            \"weight\", shape=[self.nx, self.nf], initializer=get_initializer(self.initializer_range)\n        )\n        self.bias = self.add_weight(\"bias\", shape=[1, self.nf], initializer=tf.zeros_initializer())\n\n    def call(self, x):\n        bz, sl = shape_list(x)[:2]\n\n        x = tf.reshape(x, [-1, self.nx])\n        x = tf.matmul(x, self.weight) + self.bias\n\n        x = tf.reshape(x, [bz, sl, self.nf])\n\n        return x\n\n\nclass TFSharedEmbeddings(tf.keras.layers.Layer):\n    \"\"\"Construct shared token embeddings.\n    \"\"\"\n\n    def __init__(self, vocab_size, hidden_size, initializer_range=None, **kwargs):\n        super().__init__(**kwargs)\n        self.vocab_size = vocab_size\n        self.hidden_size = hidden_size\n        self.initializer_range = hidden_size ** -0.5 if initializer_range is None else initializer_range\n\n    def build(self, input_shape):\n        \"\"\"Build shared token embedding layer\n        Shared weights logic adapted from\n            https://github.com/tensorflow/models/blob/a009f4fb9d2fc4949e32192a944688925ef78659/official/transformer/v2/embedding_layer.py#L24\n        \"\"\"\n        self.weight = self.add_weight(\n            \"weight\", shape=[self.vocab_size, self.hidden_size], initializer=get_initializer(self.initializer_range)\n        )\n        super().build(input_shape)\n\n    def call(self, inputs, mode=\"embedding\"):\n        \"\"\"Get token embeddings of inputs.\n        Args:\n            inputs: list of three int64 tensors with shape [batch_size, length]: (input_ids, position_ids, token_type_ids)\n            mode: string, a valid value is one of \"embedding\" and \"linear\".\n        Returns:\n            outputs: (1) If mode == \"embedding\", output embedding tensor, float32 with\n                shape [batch_size, length, embedding_size]; (2) mode == \"linear\", output\n                linear tensor, float32 with shape [batch_size, length, vocab_size].\n        Raises:\n            ValueError: if mode is not valid.\n\n        Shared weights logic adapted from\n            https://github.com/tensorflow/models/blob/a009f4fb9d2fc4949e32192a944688925ef78659/official/transformer/v2/embedding_layer.py#L24\n        \"\"\"\n        if mode == \"embedding\":\n            return self._embedding(inputs)\n        elif mode == \"linear\":\n            return self._linear(inputs)\n        else:\n            raise ValueError(\"mode {} is not valid.\".format(mode))\n\n    def _embedding(self, input_ids):\n        \"\"\"Applies embedding based on inputs tensor.\"\"\"\n        return tf.gather(self.weight, input_ids)\n\n    def _linear(self, inputs):\n        \"\"\"Computes logits by running inputs through a linear layer.\n            Args:\n                inputs: A float32 tensor with shape [..., hidden_size]\n            Returns:\n                float32 tensor with shape [..., vocab_size].\n        \"\"\"\n        first_dims = shape_list(inputs)[:-1]\n\n        x = tf.reshape(inputs, [-1, self.hidden_size])\n        logits = tf.matmul(x, self.weight, transpose_b=True)\n\n        return tf.reshape(logits, first_dims + [self.vocab_size])\n\n\nclass TFSequenceSummary(tf.keras.layers.Layer):\n    r\"\"\" Compute a single vector summary of a sequence hidden states according to various possibilities:\n        Args of the config class:\n            summary_type:\n                - 'last' => [default] take the last token hidden state (like XLNet)\n                - 'first' => take the first token hidden state (like Bert)\n                - 'mean' => take the mean of all tokens hidden states\n                - 'cls_index' => supply a Tensor of classification token position (GPT/GPT-2)\n                - 'attn' => Not implemented now, use multi-head attention\n            summary_use_proj: Add a projection after the vector extraction\n            summary_proj_to_labels: If True, the projection outputs to config.num_labels classes (otherwise to hidden_size). Default: False.\n            summary_activation: 'tanh' => add a tanh activation to the output, Other => no activation. Default\n            summary_first_dropout: Add a dropout before the projection and activation\n            summary_last_dropout: Add a dropout after the projection and activation\n    \"\"\"\n\n    def __init__(self, config, initializer_range=0.02, **kwargs):\n        super().__init__(**kwargs)\n\n        self.summary_type = config.summary_type if hasattr(config, \"summary_use_proj\") else \"last\"\n        if self.summary_type == \"attn\":\n            # We should use a standard multi-head attention module with absolute positional embedding for that.\n            # Cf. https://github.com/zihangdai/xlnet/blob/master/modeling.py#L253-L276\n            # We can probably just use the multi-head attention module of PyTorch >=1.1.0\n            raise NotImplementedError\n\n        self.has_summary = hasattr(config, \"summary_use_proj\") and config.summary_use_proj\n        if self.has_summary:\n            if hasattr(config, \"summary_proj_to_labels\") and config.summary_proj_to_labels and config.num_labels > 0:\n                num_classes = config.num_labels\n            else:\n                num_classes = config.hidden_size\n            self.summary = tf.keras.layers.Dense(\n                num_classes, kernel_initializer=get_initializer(initializer_range), name=\"summary\"\n            )\n\n        self.has_activation = hasattr(config, \"summary_activation\") and config.summary_activation == \"tanh\"\n        if self.has_activation:\n            self.activation = tf.keras.activations.tanh\n\n        self.has_first_dropout = hasattr(config, \"summary_first_dropout\") and config.summary_first_dropout > 0\n        if self.has_first_dropout:\n            self.first_dropout = tf.keras.layers.Dropout(config.summary_first_dropout)\n\n        self.has_last_dropout = hasattr(config, \"summary_last_dropout\") and config.summary_last_dropout > 0\n        if self.has_last_dropout:\n            self.last_dropout = tf.keras.layers.Dropout(config.summary_last_dropout)\n\n    def call(self, inputs, training=False):\n        \"\"\" hidden_states: float Tensor in shape [bsz, seq_len, hidden_size], the hidden-states of the last layer.\n            cls_index: [optional] position of the classification token if summary_type == 'cls_index',\n                shape (bsz,) or more generally (bsz, ...) where ... are optional leading dimensions of hidden_states.\n                if summary_type == 'cls_index' and cls_index is None:\n                    we take the last token of the sequence as classification token\n        \"\"\"\n        if not isinstance(inputs, (dict, tuple, list)):\n            hidden_states = inputs\n            cls_index = None\n        elif isinstance(inputs, (tuple, list)):\n            hidden_states = inputs[0]\n            cls_index = inputs[1] if len(inputs) > 1 else None\n            assert len(inputs) <= 2, \"Too many inputs.\"\n        else:\n            hidden_states = inputs.get(\"hidden_states\")\n            cls_index = inputs.get(\"cls_index\", None)\n\n        if self.summary_type == \"last\":\n            output = hidden_states[:, -1]\n        elif self.summary_type == \"first\":\n            output = hidden_states[:, 0]\n        elif self.summary_type == \"mean\":\n            output = tf.reduce_mean(hidden_states, axis=1)\n        elif self.summary_type == \"cls_index\":\n            hidden_shape = shape_list(hidden_states)  # e.g. [batch, num choices, seq length, hidden dims]\n            if cls_index is None:\n                cls_index = tf.fill(\n                    hidden_shape[:-2], hidden_shape[-2] - 1\n                )  # A tensor full of shape [batch] or [batch, num choices] full of sequence length\n            cls_shape = shape_list(cls_index)\n            if len(cls_shape) <= len(hidden_shape) - 2:\n                cls_index = cls_index[..., tf.newaxis]\n            # else:\n            # cls_index = cls_index[..., tf.newaxis]\n            # cls_index = cls_index.expand((-1,) * (cls_index.dim()-1) + (hidden_states.size(-1),))\n            # shape of cls_index: (bsz, XX, 1, hidden_size) where XX are optional leading dim of hidden_states\n            output = tf.gather(hidden_states, cls_index, batch_dims=len(hidden_shape) - 2)\n            output = tf.squeeze(\n                output, axis=len(hidden_shape) - 2\n            )  # shape of output: (batch, num choices, hidden_size)\n        elif self.summary_type == \"attn\":\n            raise NotImplementedError\n\n        if self.has_first_dropout:\n            output = self.first_dropout(output, training=training)\n\n        if self.has_summary:\n            output = self.summary(output)\n\n        if self.has_activation:\n            output = self.activation(output)\n\n        if self.has_last_dropout:\n            output = self.last_dropout(output, training=training)\n\n        return output\n\n\ndef shape_list(x):\n    \"\"\"Deal with dynamic shape in tensorflow cleanly.\"\"\"\n    static = x.shape.as_list()\n    dynamic = tf.shape(x)\n    return [dynamic[i] if s is None else s for i, s in enumerate(static)]\n\n\ndef get_initializer(initializer_range=0.02):\n    \"\"\"Creates a `tf.initializers.truncated_normal` with the given range.\n    Args:\n        initializer_range: float, initializer range for stddev.\n    Returns:\n        TruncatedNormal initializer with stddev = `initializer_range`.\n    \"\"\"\n    return tf.keras.initializers.TruncatedNormal(stddev=initializer_range)\n"
  },
  {
    "path": "code/bert-base-count5/pretrain/transformers1/modeling_tf_xlm.py",
    "content": "# coding=utf-8\n# Copyright 2019-present, Facebook, Inc and the HuggingFace Inc. team.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n\"\"\" TF 2.0 XLM model.\n\"\"\"\n\n\nimport itertools\nimport logging\nimport math\n\nimport numpy as np\nimport tensorflow as tf\n\nfrom .configuration_xlm import XLMConfig\nfrom .file_utils import add_start_docstrings, add_start_docstrings_to_callable\nfrom .modeling_tf_utils import TFPreTrainedModel, TFSequenceSummary, TFSharedEmbeddings, get_initializer, shape_list\nfrom .tokenization_utils import BatchEncoding\n\n\nlogger = logging.getLogger(__name__)\n\nTF_XLM_PRETRAINED_MODEL_ARCHIVE_LIST = [\n    \"xlm-mlm-en-2048\",\n    \"xlm-mlm-ende-1024\",\n    \"xlm-mlm-enfr-1024\",\n    \"xlm-mlm-enro-1024\",\n    \"xlm-mlm-tlm-xnli15-1024\",\n    \"xlm-mlm-xnli15-1024\",\n    \"xlm-clm-enfr-1024\",\n    \"xlm-clm-ende-1024\",\n    \"xlm-mlm-17-1280\",\n    \"xlm-mlm-100-1280\",\n    # See all XLM models at https://huggingface.co/models?filter=xlm\n]\n\n\ndef create_sinusoidal_embeddings(n_pos, dim, out):\n    position_enc = np.array([[pos / np.power(10000, 2 * (j // 2) / dim) for j in range(dim)] for pos in range(n_pos)])\n    out[:, 0::2] = tf.constant(np.sin(position_enc[:, 0::2]))\n    out[:, 1::2] = tf.constant(np.cos(position_enc[:, 1::2]))\n\n\ndef gelu(x):\n    \"\"\" Gaussian Error Linear Unit.\n    Original Implementation of the gelu activation function in Google Bert repo when initially created.\n        For information: OpenAI GPT's gelu is slightly different (and gives slightly different results):\n        0.5 * x * (1 + torch.tanh(math.sqrt(2 / math.pi) * (x + 0.044715 * torch.pow(x, 3))))\n        Also see https://arxiv.org/abs/1606.08415\n    \"\"\"\n    cdf = 0.5 * (1.0 + tf.math.erf(x / tf.math.sqrt(2.0)))\n    return x * cdf\n\n\ndef get_masks(slen, lengths, causal, padding_mask=None, dtype=tf.float32):\n    \"\"\"\n    Generate hidden states mask, and optionally an attention mask.\n    \"\"\"\n    bs = shape_list(lengths)[0]\n    if padding_mask is not None:\n        mask = padding_mask\n    else:\n        # assert lengths.max().item() <= slen\n        alen = tf.range(slen)\n        mask = tf.math.less(alen, lengths[:, tf.newaxis])\n\n    # attention mask is the same as mask, or triangular inferior attention (causal)\n    if causal:\n        attn_mask = tf.less_equal(\n            tf.tile(alen[tf.newaxis, tf.newaxis, :], (bs, slen, 1)), alen[tf.newaxis, :, tf.newaxis]\n        )\n    else:\n        attn_mask = mask\n\n    # sanity check\n    # assert shape_list(mask) == [bs, slen]\n    tf.debugging.assert_equal(shape_list(mask), [bs, slen])\n    assert causal is False or shape_list(attn_mask) == [bs, slen, slen]\n\n    mask = tf.cast(mask, dtype=dtype)\n    attn_mask = tf.cast(attn_mask, dtype=dtype)\n\n    return mask, attn_mask\n\n\nclass TFMultiHeadAttention(tf.keras.layers.Layer):\n\n    NEW_ID = itertools.count()\n\n    def __init__(self, n_heads, dim, config, **kwargs):\n        super().__init__(**kwargs)\n        self.layer_id = next(TFMultiHeadAttention.NEW_ID)\n        self.output_attentions = config.output_attentions\n        self.dim = dim\n        self.n_heads = n_heads\n        assert self.dim % self.n_heads == 0\n\n        self.q_lin = tf.keras.layers.Dense(dim, kernel_initializer=get_initializer(config.init_std), name=\"q_lin\")\n        self.k_lin = tf.keras.layers.Dense(dim, kernel_initializer=get_initializer(config.init_std), name=\"k_lin\")\n        self.v_lin = tf.keras.layers.Dense(dim, kernel_initializer=get_initializer(config.init_std), name=\"v_lin\")\n        self.out_lin = tf.keras.layers.Dense(dim, kernel_initializer=get_initializer(config.init_std), name=\"out_lin\")\n        self.dropout = tf.keras.layers.Dropout(config.attention_dropout)\n        self.pruned_heads = set()\n\n    def prune_heads(self, heads):\n        raise NotImplementedError\n\n    def call(self, inputs, training=False):\n        \"\"\"\n        Self-attention (if kv is None) or attention over source sentence (provided by kv).\n        \"\"\"\n        input, mask, kv, cache, head_mask = inputs\n        # Input is (bs, qlen, dim)\n        # Mask is (bs, klen) (non-causal) or (bs, klen, klen)\n        bs, qlen, dim = shape_list(input)\n        if kv is None:\n            klen = qlen if cache is None else cache[\"slen\"] + qlen\n        else:\n            klen = shape_list(kv)[1]\n        # assert dim == self.dim, 'Dimensions do not match: %s input vs %s configured' % (dim, self.dim)\n        n_heads = self.n_heads\n        dim_per_head = self.dim // n_heads\n        mask_reshape = (bs, 1, qlen, klen) if len(shape_list(mask)) == 3 else (bs, 1, 1, klen)\n\n        def shape(x):\n            \"\"\"  projection \"\"\"\n            return tf.transpose(tf.reshape(x, (bs, -1, self.n_heads, dim_per_head)), perm=(0, 2, 1, 3))\n\n        def unshape(x):\n            \"\"\"  compute context \"\"\"\n            return tf.reshape(tf.transpose(x, perm=(0, 2, 1, 3)), (bs, -1, self.n_heads * dim_per_head))\n\n        q = shape(self.q_lin(input))  # (bs, n_heads, qlen, dim_per_head)\n        if kv is None:\n            k = shape(self.k_lin(input))  # (bs, n_heads, qlen, dim_per_head)\n            v = shape(self.v_lin(input))  # (bs, n_heads, qlen, dim_per_head)\n        elif cache is None or self.layer_id not in cache:\n            k = v = kv\n            k = shape(self.k_lin(k))  # (bs, n_heads, qlen, dim_per_head)\n            v = shape(self.v_lin(v))  # (bs, n_heads, qlen, dim_per_head)\n\n        if cache is not None:\n            if self.layer_id in cache:\n                if kv is None:\n                    k_, v_ = cache[self.layer_id]\n                    k = tf.concat([k_, k], axis=2)  # (bs, n_heads, klen, dim_per_head)\n                    v = tf.concat([v_, v], axis=2)  # (bs, n_heads, klen, dim_per_head)\n                else:\n                    k, v = cache[self.layer_id]\n            cache[self.layer_id] = (k, v)\n\n        q = q / math.sqrt(dim_per_head)  # (bs, n_heads, qlen, dim_per_head)\n        scores = tf.matmul(q, k, transpose_b=True)  # (bs, n_heads, qlen, klen)\n        mask = tf.reshape(mask, mask_reshape)  # (bs, n_heads, qlen, klen)\n        # scores.masked_fill_(mask, -float('inf'))                            # (bs, n_heads, qlen, klen)\n        scores = scores - 1e30 * (1.0 - mask)\n\n        weights = tf.nn.softmax(scores, axis=-1)  # (bs, n_heads, qlen, klen)\n        weights = self.dropout(weights, training=training)  # (bs, n_heads, qlen, klen)\n\n        # Mask heads if we want to\n        if head_mask is not None:\n            weights = weights * head_mask\n\n        context = tf.matmul(weights, v)  # (bs, n_heads, qlen, dim_per_head)\n        context = unshape(context)  # (bs, qlen, dim)\n\n        outputs = (self.out_lin(context),)\n        if self.output_attentions:\n            outputs = outputs + (weights,)\n        return outputs\n\n\nclass TFTransformerFFN(tf.keras.layers.Layer):\n    def __init__(self, in_dim, dim_hidden, out_dim, config, **kwargs):\n        super().__init__(**kwargs)\n        self.lin1 = tf.keras.layers.Dense(dim_hidden, kernel_initializer=get_initializer(config.init_std), name=\"lin1\")\n        self.lin2 = tf.keras.layers.Dense(out_dim, kernel_initializer=get_initializer(config.init_std), name=\"lin2\")\n        self.act = tf.keras.layers.Activation(gelu) if config.gelu_activation else tf.keras.activations.relu\n        self.dropout = tf.keras.layers.Dropout(config.dropout)\n\n    def call(self, input, training=False):\n        x = self.lin1(input)\n        x = self.act(x)\n        x = self.lin2(x)\n        x = self.dropout(x, training=training)\n        return x\n\n\nclass TFXLMMainLayer(tf.keras.layers.Layer):\n    def __init__(self, config, **kwargs):\n        super().__init__(**kwargs)\n        self.output_attentions = config.output_attentions\n        self.output_hidden_states = config.output_hidden_states\n\n        # encoder / decoder, output layer\n        self.is_encoder = config.is_encoder\n        self.is_decoder = not config.is_encoder\n        if self.is_decoder:\n            raise NotImplementedError(\"Currently XLM can only be used as an encoder\")\n        # self.with_output = with_output\n        self.causal = config.causal\n\n        # dictionary / languages\n        self.n_langs = config.n_langs\n        self.use_lang_emb = config.use_lang_emb\n        self.n_words = config.n_words\n        self.eos_index = config.eos_index\n        self.pad_index = config.pad_index\n        # self.dico = dico\n        # self.id2lang = config.id2lang\n        # self.lang2id = config.lang2id\n        # assert len(self.dico) == self.n_words\n        # assert len(self.id2lang) == len(self.lang2id) == self.n_langs\n\n        # model parameters\n        self.dim = config.emb_dim  # 512 by default\n        self.hidden_dim = self.dim * 4  # 2048 by default\n        self.n_heads = config.n_heads  # 8 by default\n        self.n_layers = config.n_layers\n        assert self.dim % self.n_heads == 0, \"transformer dim must be a multiple of n_heads\"\n\n        # embeddings\n        self.dropout = tf.keras.layers.Dropout(config.dropout)\n        self.attention_dropout = tf.keras.layers.Dropout(config.attention_dropout)\n\n        self.position_embeddings = tf.keras.layers.Embedding(\n            config.max_position_embeddings,\n            self.dim,\n            embeddings_initializer=get_initializer(config.embed_init_std),\n            name=\"position_embeddings\",\n        )\n        if config.sinusoidal_embeddings:\n            raise NotImplementedError\n            # create_sinusoidal_embeddings(config.max_position_embeddings, self.dim, out=self.position_embeddings.weight)\n        if config.n_langs > 1 and config.use_lang_emb:\n            self.lang_embeddings = tf.keras.layers.Embedding(\n                self.n_langs,\n                self.dim,\n                embeddings_initializer=get_initializer(config.embed_init_std),\n                name=\"lang_embeddings\",\n            )\n        self.embeddings = TFSharedEmbeddings(\n            self.n_words, self.dim, initializer_range=config.embed_init_std, name=\"embeddings\"\n        )  # padding_idx=self.pad_index)\n        self.layer_norm_emb = tf.keras.layers.LayerNormalization(epsilon=config.layer_norm_eps, name=\"layer_norm_emb\")\n\n        # transformer layers\n        self.attentions = []\n        self.layer_norm1 = []\n        self.ffns = []\n        self.layer_norm2 = []\n        # if self.is_decoder:\n        #     self.layer_norm15 = []\n        #     self.encoder_attn = []\n\n        for i in range(self.n_layers):\n            self.attentions.append(\n                TFMultiHeadAttention(self.n_heads, self.dim, config=config, name=\"attentions_._{}\".format(i))\n            )\n            self.layer_norm1.append(\n                tf.keras.layers.LayerNormalization(epsilon=config.layer_norm_eps, name=\"layer_norm1_._{}\".format(i))\n            )\n            # if self.is_decoder:\n            #     self.layer_norm15.append(nn.LayerNorm(self.dim, eps=config.layer_norm_eps))\n            #     self.encoder_attn.append(MultiHeadAttention(self.n_heads, self.dim, dropout=self.attention_dropout))\n            self.ffns.append(\n                TFTransformerFFN(self.dim, self.hidden_dim, self.dim, config=config, name=\"ffns_._{}\".format(i))\n            )\n            self.layer_norm2.append(\n                tf.keras.layers.LayerNormalization(epsilon=config.layer_norm_eps, name=\"layer_norm2_._{}\".format(i))\n            )\n\n        if hasattr(config, \"pruned_heads\"):\n            pruned_heads = config.pruned_heads.copy().items()\n            config.pruned_heads = {}\n            for layer, heads in pruned_heads:\n                if self.attentions[int(layer)].n_heads == config.n_heads:\n                    self.prune_heads({int(layer): list(map(int, heads))})\n\n    def get_input_embeddings(self):\n        return self.embeddings\n\n    def _resize_token_embeddings(self, new_num_tokens):\n        raise NotImplementedError\n\n    def _prune_heads(self, heads_to_prune):\n        \"\"\" Prunes heads of the model.\n            heads_to_prune: dict of {layer_num: list of heads to prune in this layer}\n            See base class PreTrainedModel\n        \"\"\"\n        raise NotImplementedError\n\n    def call(\n        self,\n        inputs,\n        attention_mask=None,\n        langs=None,\n        token_type_ids=None,\n        position_ids=None,\n        lengths=None,\n        cache=None,\n        head_mask=None,\n        inputs_embeds=None,\n        training=False,\n    ):  # removed: src_enc=None, src_len=None\n        if isinstance(inputs, (tuple, list)):\n            input_ids = inputs[0]\n            attention_mask = inputs[1] if len(inputs) > 1 else attention_mask\n            langs = inputs[2] if len(inputs) > 2 else langs\n            token_type_ids = inputs[3] if len(inputs) > 3 else token_type_ids\n            position_ids = inputs[4] if len(inputs) > 4 else position_ids\n            lengths = inputs[5] if len(inputs) > 5 else lengths\n            cache = inputs[6] if len(inputs) > 6 else cache\n            head_mask = inputs[7] if len(inputs) > 7 else head_mask\n            inputs_embeds = inputs[8] if len(inputs) > 8 else inputs_embeds\n            assert len(inputs) <= 9, \"Too many inputs.\"\n        elif isinstance(inputs, (dict, BatchEncoding)):\n            input_ids = inputs.get(\"input_ids\")\n            attention_mask = inputs.get(\"attention_mask\", attention_mask)\n            langs = inputs.get(\"langs\", langs)\n            token_type_ids = inputs.get(\"token_type_ids\", token_type_ids)\n            position_ids = inputs.get(\"position_ids\", position_ids)\n            lengths = inputs.get(\"lengths\", lengths)\n            cache = inputs.get(\"cache\", cache)\n            head_mask = inputs.get(\"head_mask\", head_mask)\n            inputs_embeds = inputs.get(\"inputs_embeds\", inputs_embeds)\n            assert len(inputs) <= 9, \"Too many inputs.\"\n        else:\n            input_ids = inputs\n\n        if input_ids is not None and inputs_embeds is not None:\n            raise ValueError(\"You cannot specify both input_ids and inputs_embeds at the same time\")\n        elif input_ids is not None:\n            bs, slen = shape_list(input_ids)\n        elif inputs_embeds is not None:\n            bs, slen = shape_list(inputs_embeds)[:2]\n        else:\n            raise ValueError(\"You have to specify either input_ids or inputs_embeds\")\n\n        if lengths is None:\n            if input_ids is not None:\n                lengths = tf.reduce_sum(tf.cast(tf.not_equal(input_ids, self.pad_index), dtype=tf.int32), axis=1)\n            else:\n                lengths = tf.convert_to_tensor([slen] * bs, tf.int32)\n        # mask = input_ids != self.pad_index\n\n        # check inputs\n        # assert shape_list(lengths)[0] == bs\n        tf.debugging.assert_equal(shape_list(lengths)[0], bs)\n        # assert lengths.max().item() <= slen\n        # input_ids = input_ids.transpose(0, 1)  # batch size as dimension 0\n        # assert (src_enc is None) == (src_len is None)\n        # if src_enc is not None:\n        #     assert self.is_decoder\n        #     assert src_enc.size(0) == bs\n\n        # generate masks\n        mask, attn_mask = get_masks(slen, lengths, self.causal, padding_mask=attention_mask)\n        # if self.is_decoder and src_enc is not None:\n        #     src_mask = torch.arange(src_len.max(), dtype=torch.long, device=lengths.device) < src_len[:, None]\n\n        # position_ids\n        if position_ids is None:\n            position_ids = tf.expand_dims(tf.range(slen), axis=0)\n        else:\n            # assert shape_list(position_ids) == [bs, slen]  # (slen, bs)\n            tf.debugging.assert_equal(shape_list(position_ids), [bs, slen])\n            # position_ids = position_ids.transpose(0, 1)\n\n        # langs\n        if langs is not None:\n            # assert shape_list(langs) == [bs, slen]  # (slen, bs)\n            tf.debugging.assert_equal(shape_list(langs), [bs, slen])\n            # langs = langs.transpose(0, 1)\n\n        # Prepare head mask if needed\n        # 1.0 in head_mask indicate we keep the head\n        # attention_probs has shape bsz x n_heads x N x N\n        # input head_mask has shape [num_heads] or [num_hidden_layers x num_heads]\n        # and head_mask is converted to shape [num_hidden_layers x batch x num_heads x qlen x klen]\n        if head_mask is not None:\n            raise NotImplementedError\n        else:\n            head_mask = [None] * self.n_layers\n\n        # do not recompute cached elements\n        if cache is not None and input_ids is not None:\n            _slen = slen - cache[\"slen\"]\n            input_ids = input_ids[:, -_slen:]\n            position_ids = position_ids[:, -_slen:]\n            if langs is not None:\n                langs = langs[:, -_slen:]\n            mask = mask[:, -_slen:]\n            attn_mask = attn_mask[:, -_slen:]\n\n        # embeddings\n        if inputs_embeds is None:\n            inputs_embeds = self.embeddings(input_ids)\n\n        tensor = inputs_embeds + self.position_embeddings(position_ids)\n        if langs is not None and self.use_lang_emb and self.n_langs > 1:\n            tensor = tensor + self.lang_embeddings(langs)\n        if token_type_ids is not None:\n            tensor = tensor + self.embeddings(token_type_ids)\n        tensor = self.layer_norm_emb(tensor)\n        tensor = self.dropout(tensor, training=training)\n        tensor = tensor * mask[..., tf.newaxis]\n\n        # transformer layers\n        hidden_states = ()\n        attentions = ()\n        for i in range(self.n_layers):\n            if self.output_hidden_states:\n                hidden_states = hidden_states + (tensor,)\n\n            # self attention\n            attn_outputs = self.attentions[i]([tensor, attn_mask, None, cache, head_mask[i]], training=training)\n            attn = attn_outputs[0]\n            if self.output_attentions:\n                attentions = attentions + (attn_outputs[1],)\n            attn = self.dropout(attn, training=training)\n            tensor = tensor + attn\n            tensor = self.layer_norm1[i](tensor)\n\n            # encoder attention (for decoder only)\n            # if self.is_decoder and src_enc is not None:\n            #     attn = self.encoder_attn[i](tensor, src_mask, kv=src_enc, cache=cache)\n            #     attn = F.dropout(attn, p=self.dropout, training=self.training)\n            #     tensor = tensor + attn\n            #     tensor = self.layer_norm15[i](tensor)\n\n            # FFN\n            tensor = tensor + self.ffns[i](tensor)\n            tensor = self.layer_norm2[i](tensor)\n            tensor = tensor * mask[..., tf.newaxis]\n\n        # Add last hidden state\n        if self.output_hidden_states:\n            hidden_states = hidden_states + (tensor,)\n\n        # update cache length\n        if cache is not None:\n            cache[\"slen\"] += tensor.size(1)\n\n        # move back sequence length to dimension 0\n        # tensor = tensor.transpose(0, 1)\n\n        outputs = (tensor,)\n        if self.output_hidden_states:\n            outputs = outputs + (hidden_states,)\n        if self.output_attentions:\n            outputs = outputs + (attentions,)\n        return outputs  # outputs, (hidden_states), (attentions)\n\n\nclass TFXLMPreTrainedModel(TFPreTrainedModel):\n    \"\"\" An abstract class to handle weights initialization and\n        a simple interface for downloading and loading pretrained models.\n    \"\"\"\n\n    config_class = XLMConfig\n    base_model_prefix = \"transformer\"\n\n    @property\n    def dummy_inputs(self):\n        # Sometimes XLM has language embeddings so don't forget to build them as well if needed\n        inputs_list = tf.constant([[7, 6, 0, 0, 1], [1, 2, 3, 0, 0], [0, 0, 0, 4, 5]])\n        attns_list = tf.constant([[1, 1, 0, 0, 1], [1, 1, 1, 0, 0], [1, 0, 0, 1, 1]])\n        if self.config.use_lang_emb and self.config.n_langs > 1:\n            langs_list = tf.constant([[1, 1, 0, 0, 1], [1, 1, 1, 0, 0], [1, 0, 0, 1, 1]])\n        else:\n            langs_list = None\n        return {\"input_ids\": inputs_list, \"attention_mask\": attns_list, \"langs\": langs_list}\n\n\nXLM_START_DOCSTRING = r\"\"\"\n\n    .. note::\n\n        TF 2.0 models accepts two formats as inputs:\n\n            - having all inputs as keyword arguments (like PyTorch models), or\n            - having all inputs as a list, tuple or dict in the first positional arguments.\n\n        This second option is useful when using :obj:`tf.keras.Model.fit()` method which currently requires having\n        all the tensors in the first argument of the model call function: :obj:`model(inputs)`.\n\n        If you choose this second option, there are three possibilities you can use to gather all the input Tensors\n        in the first positional argument :\n\n        - a single Tensor with input_ids only and nothing else: :obj:`model(inputs_ids)`\n        - a list of varying length with one or several input Tensors IN THE ORDER given in the docstring:\n          :obj:`model([input_ids, attention_mask])` or :obj:`model([input_ids, attention_mask, token_type_ids])`\n        - a dictionary with one or several input Tensors associated to the input names given in the docstring:\n          :obj:`model({'input_ids': input_ids, 'token_type_ids': token_type_ids})`\n\n    Parameters:\n        config (:class:`~transformers1.XLMConfig`): Model configuration class with all the parameters of the model.\n            Initializing with a config file does not load the weights associated with the model, only the configuration.\n            Check out the :meth:`~transformers1.PreTrainedModel.from_pretrained` method to load the model weights.\n\"\"\"\n\nXLM_INPUTS_DOCSTRING = r\"\"\"\n    Args:\n        input_ids (:obj:`tf.Tensor` or :obj:`Numpy array` of shape :obj:`(batch_size, sequence_length)`):\n            Indices of input sequence tokens in the vocabulary.\n\n            Indices can be obtained using :class:`transformers1.BertTokenizer`.\n            See :func:`transformers1.PreTrainedTokenizer.encode` and\n            :func:`transformers1.PreTrainedTokenizer.encode_plus` for details.\n\n            `What are input IDs? <../glossary.html#input-ids>`__\n        attention_mask (:obj:`tf.Tensor` or :obj:`Numpy array` of shape :obj:`(batch_size, sequence_length)`, `optional`, defaults to :obj:`None`):\n            Mask to avoid performing attention on padding token indices.\n            Mask values selected in ``[0, 1]``:\n            ``1`` for tokens that are NOT MASKED, ``0`` for MASKED tokens.\n\n            `What are attention masks? <../glossary.html#attention-mask>`__\n        langs (:obj:`tf.Tensor` or :obj:`Numpy array` of shape :obj:`(batch_size, sequence_length)`, `optional`, defaults to :obj:`None`):\n            A parallel sequence of tokens to be used to indicate the language of each token in the input.\n            Indices are languages ids which can be obtained from the language names by using two conversion mappings\n            provided in the configuration of the model (only provided for multilingual models).\n            More precisely, the `language name -> language id` mapping is in `model.config.lang2id` (dict str -> int) and\n            the `language id -> language name` mapping is `model.config.id2lang` (dict int -> str).\n\n            See usage examples detailed in the `multilingual documentation <https://huggingface.co/transformers/multilingual.html>`__.\n        token_type_ids (:obj:`tf.Tensor` or :obj:`Numpy array` of shape :obj:`(batch_size, sequence_length)`, `optional`, defaults to :obj:`None`):\n            Segment token indices to indicate first and second portions of the inputs.\n            Indices are selected in ``[0, 1]``: ``0`` corresponds to a `sentence A` token, ``1``\n            corresponds to a `sentence B` token\n\n            `What are token type IDs? <../glossary.html#token-type-ids>`_\n        position_ids (:obj:`tf.Tensor` or :obj:`Numpy array` of shape :obj:`(batch_size, sequence_length)`, `optional`, defaults to :obj:`None`):\n            Indices of positions of each input sequence tokens in the position embeddings.\n            Selected in the range ``[0, config.max_position_embeddings - 1]``.\n\n            `What are position IDs? <../glossary.html#position-ids>`_\n        lengths (:obj:`tf.Tensor` or :obj:`Numpy array` of shape :obj:`(batch_size,)`, `optional`, defaults to :obj:`None`):\n            Length of each sentence that can be used to avoid performing attention on padding token indices.\n            You can also use `attention_mask` for the same result (see above), kept here for compatbility.\n            Indices selected in ``[0, ..., input_ids.size(-1)]``:\n        cache (:obj:`Dict[str, tf.Tensor]`, `optional`, defaults to :obj:`None`):\n            dictionary with ``tf.Tensor`` that contains pre-computed\n            hidden-states (key and values in the attention blocks) as computed by the model\n            (see `cache` output below). Can be used to speed up sequential decoding.\n            The dictionary object will be modified in-place during the forward pass to add newly computed hidden-states.\n        head_mask (:obj:`tf.Tensor` or :obj:`Numpy array` of shape :obj:`(num_heads,)` or :obj:`(num_layers, num_heads)`, `optional`, defaults to :obj:`None`):\n            Mask to nullify selected heads of the self-attention modules.\n            Mask values selected in ``[0, 1]``:\n            :obj:`1` indicates the head is **not masked**, :obj:`0` indicates the head is **masked**.\n        inputs_embeds (:obj:`tf.Tensor` or :obj:`Numpy array` of shape :obj:`(batch_size, sequence_length, hidden_size)`, `optional`, defaults to :obj:`None`):\n            Optionally, instead of passing :obj:`input_ids` you can choose to directly pass an embedded representation.\n            This is useful if you want more control over how to convert `input_ids` indices into associated vectors\n            than the model's internal embedding lookup matrix.\n\"\"\"\n\n\n@add_start_docstrings(\n    \"The bare XLM Model transformer outputing raw hidden-states without any specific head on top.\",\n    XLM_START_DOCSTRING,\n)\nclass TFXLMModel(TFXLMPreTrainedModel):\n    def __init__(self, config, *inputs, **kwargs):\n        super().__init__(config, *inputs, **kwargs)\n        self.transformer = TFXLMMainLayer(config, name=\"transformer\")\n\n    @add_start_docstrings_to_callable(XLM_INPUTS_DOCSTRING)\n    def call(self, inputs, **kwargs):\n        r\"\"\"\n    Return:\n        :obj:`tuple(tf.Tensor)` comprising various elements depending on the configuration (:class:`~transformers1.XLMConfig`) and inputs:\n        last_hidden_state (:obj:`tf.Tensor` or :obj:`Numpy array` of shape :obj:`(batch_size, sequence_length, hidden_size)`):\n            Sequence of hidden-states at the output of the last layer of the model.\n        hidden_states (:obj:`tuple(tf.Tensor)`, `optional`, returned when ``config.output_hidden_states=True``):\n            Tuple of :obj:`tf.Tensor` or :obj:`Numpy array` (one for the output of the embeddings + one for the output of each layer)\n            of shape :obj:`(batch_size, sequence_length, hidden_size)`.\n\n            Hidden-states of the model at the output of each layer plus the initial embedding outputs.\n        attentions (:obj:`tuple(tf.Tensor)`, `optional`, returned when ``config.output_attentions=True``):\n            Tuple of :obj:`tf.Tensor` or :obj:`Numpy array` (one for each layer) of shape\n            :obj:`(batch_size, num_heads, sequence_length, sequence_length)`.\n\n            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention\n            heads.\n\n    Examples::\n\n        import tensorflow as tf\n        from transformers1 import XLMTokenizer, TFXLMModel\n\n        tokenizer = XLMTokenizer.from_pretrained('xlm-mlm-en-2048')\n        model = TFXLMModel.from_pretrained('xlm-mlm-en-2048')\n        input_ids = tf.constant(tokenizer.encode(\"Hello, my dog is cute\", add_special_tokens=True))[None, :]  # Batch size 1\n        outputs = model(input_ids)\n        last_hidden_states = outputs[0]  # The last hidden-state is the first element of the output tuple\n\n        \"\"\"\n        outputs = self.transformer(inputs, **kwargs)\n        return outputs\n\n\nclass TFXLMPredLayer(tf.keras.layers.Layer):\n    \"\"\"\n    Prediction layer (cross_entropy or adaptive_softmax).\n    \"\"\"\n\n    def __init__(self, config, input_embeddings, **kwargs):\n        super().__init__(**kwargs)\n        self.asm = config.asm\n        self.n_words = config.n_words\n        self.pad_index = config.pad_index\n        if config.asm is False:\n            self.input_embeddings = input_embeddings\n        else:\n            raise NotImplementedError\n            # self.proj = nn.AdaptiveLogSoftmaxWithLoss(\n            #     in_features=dim,\n            #     n_classes=config.n_words,\n            #     cutoffs=config.asm_cutoffs,\n            #     div_value=config.asm_div_value,\n            #     head_bias=True,  # default is False\n            # )\n\n    def build(self, input_shape):\n        # The output weights are the same as the input embeddings, but there is an output-only bias for each token.\n        self.bias = self.add_weight(shape=(self.n_words,), initializer=\"zeros\", trainable=True, name=\"bias\")\n        super().build(input_shape)\n\n    def call(self, hidden_states):\n        hidden_states = self.input_embeddings(hidden_states, mode=\"linear\")\n        hidden_states = hidden_states + self.bias\n        return hidden_states\n\n\n@add_start_docstrings(\n    \"\"\"The XLM Model transformer with a language modeling head on top\n    (linear layer with weights tied to the input embeddings). \"\"\",\n    XLM_START_DOCSTRING,\n)\nclass TFXLMWithLMHeadModel(TFXLMPreTrainedModel):\n    def __init__(self, config, *inputs, **kwargs):\n        super().__init__(config, *inputs, **kwargs)\n        self.transformer = TFXLMMainLayer(config, name=\"transformer\")\n        self.pred_layer = TFXLMPredLayer(config, self.transformer.embeddings, name=\"pred_layer_._proj\")\n\n    def get_output_embeddings(self):\n        return self.pred_layer.input_embeddings\n\n    def prepare_inputs_for_generation(self, inputs, **kwargs):\n        mask_token_id = self.config.mask_token_id\n        lang_id = self.config.lang_id\n\n        effective_batch_size = inputs.shape[0]\n        mask_token = tf.ones((effective_batch_size, 1), dtype=tf.int32) * mask_token_id\n        inputs = tf.concat([inputs, mask_token], axis=1)\n\n        if lang_id is not None:\n            langs = tf.ones_like(inputs) * lang_id\n        else:\n            langs = None\n        return {\"inputs\": inputs, \"langs\": langs}\n\n    @add_start_docstrings_to_callable(XLM_INPUTS_DOCSTRING)\n    def call(self, inputs, **kwargs):\n        r\"\"\"\n    Return:\n        :obj:`tuple(tf.Tensor)` comprising various elements depending on the configuration (:class:`~transformers1.XLMConfig`) and inputs:\n        prediction_scores (:obj:`tf.Tensor` or :obj:`Numpy array` of shape :obj:`(batch_size, sequence_length, config.vocab_size)`):\n            Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax).\n        hidden_states (:obj:`tuple(tf.Tensor)`, `optional`, returned when ``config.output_hidden_states=True``):\n            Tuple of :obj:`tf.Tensor` or :obj:`Numpy array` (one for the output of the embeddings + one for the output of each layer)\n            of shape :obj:`(batch_size, sequence_length, hidden_size)`.\n\n            Hidden-states of the model at the output of each layer plus the initial embedding outputs.\n        attentions (:obj:`tuple(tf.Tensor)`, `optional`, returned when ``config.output_attentions=True``):\n            Tuple of :obj:`tf.Tensor` or :obj:`Numpy array` (one for each layer) of shape\n            :obj:`(batch_size, num_heads, sequence_length, sequence_length)`.\n\n            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention\n            heads.\n\n    Examples::\n\n        import tensorflow as tf\n        from transformers1 import XLMTokenizer, TFXLMWithLMHeadModel\n\n        tokenizer = XLMTokenizer.from_pretrained('xlm-mlm-en-2048')\n        model = TFXLMWithLMHeadModel.from_pretrained('xlm-mlm-en-2048')\n        input_ids = tf.constant(tokenizer.encode(\"Hello, my dog is cute\", add_special_tokens=True))[None, :]  # Batch size 1\n        outputs = model(input_ids)\n        last_hidden_states = outputs[0]  # The last hidden-state is the first element of the output tuple\n\n        \"\"\"\n        transformer_outputs = self.transformer(inputs, **kwargs)\n\n        output = transformer_outputs[0]\n        outputs = self.pred_layer(output)\n        outputs = (outputs,) + transformer_outputs[1:]  # Keep new_mems and attention/hidden states if they are here\n\n        return outputs\n\n\n@add_start_docstrings(\n    \"\"\"XLM Model with a sequence classification/regression head on top (a linear layer on top of\n    the pooled output) e.g. for GLUE tasks. \"\"\",\n    XLM_START_DOCSTRING,\n)\nclass TFXLMForSequenceClassification(TFXLMPreTrainedModel):\n    def __init__(self, config, *inputs, **kwargs):\n        super().__init__(config, *inputs, **kwargs)\n        self.num_labels = config.num_labels\n\n        self.transformer = TFXLMMainLayer(config, name=\"transformer\")\n        self.sequence_summary = TFSequenceSummary(config, initializer_range=config.init_std, name=\"sequence_summary\")\n\n    @add_start_docstrings_to_callable(XLM_INPUTS_DOCSTRING)\n    def call(self, inputs, **kwargs):\n        r\"\"\"\n    Returns:\n        :obj:`tuple(tf.Tensor)` comprising various elements depending on the configuration (:class:`~transformers1.XLMConfig`) and inputs:\n        logits (:obj:`tf.Tensor` or :obj:`Numpy array` of shape :obj:`(batch_size, config.num_labels)`):\n            Classification (or regression if config.num_labels==1) scores (before SoftMax).\n        hidden_states (:obj:`tuple(tf.Tensor)`, `optional`, returned when ``config.output_hidden_states=True``):\n            Tuple of :obj:`tf.Tensor` or :obj:`Numpy array` (one for the output of the embeddings + one for the output of each layer)\n            of shape :obj:`(batch_size, sequence_length, hidden_size)`.\n\n            Hidden-states of the model at the output of each layer plus the initial embedding outputs.\n        attentions (:obj:`tuple(tf.Tensor)`, `optional`, returned when ``config.output_attentions=True``):\n            Tuple of :obj:`tf.Tensor` or :obj:`Numpy array` (one for each layer) of shape\n            :obj:`(batch_size, num_heads, sequence_length, sequence_length)`.\n\n            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention\n            heads.\n\n    Examples::\n\n        import tensorflow as tf\n        from transformers1 import XLMTokenizer, TFXLMForSequenceClassification\n\n        tokenizer = XLMTokenizer.from_pretrained('xlm-mlm-en-2048')\n        model = TFXLMForSequenceClassification.from_pretrained('xlm-mlm-en-2048')\n        input_ids = tf.constant(tokenizer.encode(\"Hello, my dog is cute\", add_special_tokens=True))[None, :]  # Batch size 1\n        labels = tf.constant([1])[None, :]  # Batch size 1\n        outputs = model(input_ids)\n        logits = outputs[0]\n\n        \"\"\"\n        transformer_outputs = self.transformer(inputs, **kwargs)\n        output = transformer_outputs[0]\n\n        logits = self.sequence_summary(output)\n\n        outputs = (logits,) + transformer_outputs[1:]  # Keep new_mems and attention/hidden states if they are here\n        return outputs\n\n\n@add_start_docstrings(\n    \"\"\"XLM Model with a span classification head on top for extractive question-answering tasks like SQuAD (a linear layers on top of\n    the hidden-states output to compute `span start logits` and `span end logits`). \"\"\",\n    XLM_START_DOCSTRING,\n)\nclass TFXLMForQuestionAnsweringSimple(TFXLMPreTrainedModel):\n    def __init__(self, config, *inputs, **kwargs):\n        super().__init__(config, *inputs, **kwargs)\n        self.transformer = TFXLMMainLayer(config, name=\"transformer\")\n        self.qa_outputs = tf.keras.layers.Dense(\n            config.num_labels, kernel_initializer=get_initializer(config.init_std), name=\"qa_outputs\"\n        )\n\n    @add_start_docstrings_to_callable(XLM_INPUTS_DOCSTRING)\n    def call(self, inputs, **kwargs):\n        r\"\"\"\n    Returns:\n        :obj:`tuple(tf.Tensor)` comprising various elements depending on the configuration (:class:`~transformers1.XLMConfig`) and inputs:\n        start_scores (:obj:`tf.Tensor` or :obj:`Numpy array` of shape :obj:`(batch_size, sequence_length,)`):\n            Span-start scores (before SoftMax).\n        end_scores (:obj:`tf.Tensor` or :obj:`Numpy array` of shape :obj:`(batch_size, sequence_length,)`):\n            Span-end scores (before SoftMax).\n        hidden_states (:obj:`tuple(tf.Tensor)`, `optional`, returned when ``config.output_hidden_states=True``):\n            Tuple of :obj:`tf.Tensor` or :obj:`Numpy array` (one for the output of the embeddings + one for the output of each layer)\n            of shape :obj:`(batch_size, sequence_length, hidden_size)`.\n\n            Hidden-states of the model at the output of each layer plus the initial embedding outputs.\n        attentions (:obj:`tuple(tf.Tensor)`, `optional`, returned when ``config.output_attentions=True``):\n            Tuple of :obj:`tf.Tensor` or :obj:`Numpy array` (one for each layer) of shape\n            :obj:`(batch_size, num_heads, sequence_length, sequence_length)`.\n\n            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention\n            heads.\n\n    Examples::\n\n        import tensorflow as tf\n        from transformers1 import XLMTokenizer, TFXLMForQuestionAnsweringSimple\n\n        tokenizer = XLMTokenizer.from_pretrained('xlm-mlm-en-2048')\n        model = TFXLMForQuestionAnsweringSimple.from_pretrained('xlm-mlm-en-2048')\n        input_ids = tf.constant(tokenizer.encode(\"Hello, my dog is cute\", add_special_tokens=True))[None, :]  # Batch size 1\n        outputs = model(input_ids)\n        start_scores, end_scores = outputs[:2]\n\n        \"\"\"\n        transformer_outputs = self.transformer(inputs, **kwargs)\n\n        sequence_output = transformer_outputs[0]\n\n        logits = self.qa_outputs(sequence_output)\n        start_logits, end_logits = tf.split(logits, 2, axis=-1)\n        start_logits = tf.squeeze(start_logits, axis=-1)\n        end_logits = tf.squeeze(end_logits, axis=-1)\n\n        outputs = (start_logits, end_logits,) + transformer_outputs[\n            1:\n        ]  # Keep mems, hidden states, attentions if there are in it\n\n        return outputs  # start_logits, end_logits, (hidden_states), (attentions)\n"
  },
  {
    "path": "code/bert-base-count5/pretrain/transformers1/modeling_tf_xlm_roberta.py",
    "content": "# coding=utf-8\n# Copyright 2019 Facebook AI Research and the HuggingFace Inc. team.\n# Copyright (c) 2018, NVIDIA CORPORATION.  All rights reserved.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n\"\"\" TF 2.0  XLM-RoBERTa model. \"\"\"\n\n\nimport logging\n\nfrom .configuration_xlm_roberta import XLMRobertaConfig\nfrom .file_utils import add_start_docstrings\nfrom .modeling_tf_roberta import (\n    TFRobertaForMaskedLM,\n    TFRobertaForSequenceClassification,\n    TFRobertaForTokenClassification,\n    TFRobertaModel,\n)\n\n\nlogger = logging.getLogger(__name__)\n\nTF_XLM_ROBERTA_PRETRAINED_MODEL_ARCHIVE_LIST = [\n    # See all XLM-RoBERTa models at https://huggingface.co/models?filter=xlm-roberta\n]\n\n\nXLM_ROBERTA_START_DOCSTRING = r\"\"\"\n\n    .. note::\n\n        TF 2.0 models accepts two formats as inputs:\n\n            - having all inputs as keyword arguments (like PyTorch models), or\n            - having all inputs as a list, tuple or dict in the first positional arguments.\n\n        This second option is useful when using :obj:`tf.keras.Model.fit()` method which currently requires having\n        all the tensors in the first argument of the model call function: :obj:`model(inputs)`.\n\n        If you choose this second option, there are three possibilities you can use to gather all the input Tensors\n        in the first positional argument :\n\n        - a single Tensor with input_ids only and nothing else: :obj:`model(inputs_ids)`\n        - a list of varying length with one or several input Tensors IN THE ORDER given in the docstring:\n          :obj:`model([input_ids, attention_mask])` or :obj:`model([input_ids, attention_mask, token_type_ids])`\n        - a dictionary with one or several input Tensors associated to the input names given in the docstring:\n          :obj:`model({'input_ids': input_ids, 'token_type_ids': token_type_ids})`\n\n    Parameters:\n        config (:class:`~transformers1.XLMRobertaConfig`): Model configuration class with all the parameters of the\n            model. Initializing with a config file does not load the weights associated with the model, only the configuration.\n            Check out the :meth:`~transformers1.PreTrainedModel.from_pretrained` method to load the model weights.\n\"\"\"\n\n\n@add_start_docstrings(\n    \"The bare XLM-RoBERTa Model transformer outputting raw hidden-states without any specific head on top.\",\n    XLM_ROBERTA_START_DOCSTRING,\n)\nclass TFXLMRobertaModel(TFRobertaModel):\n    \"\"\"\n    This class overrides :class:`~transformers1.TFRobertaModel`. Please check the\n    superclass for the appropriate documentation alongside usage examples.\n    \"\"\"\n\n    config_class = XLMRobertaConfig\n\n\n@add_start_docstrings(\n    \"\"\"XLM-RoBERTa Model with a `language modeling` head on top. \"\"\", XLM_ROBERTA_START_DOCSTRING,\n)\nclass TFXLMRobertaForMaskedLM(TFRobertaForMaskedLM):\n    \"\"\"\n    This class overrides :class:`~transformers1.TFRobertaForMaskedLM`. Please check the\n    superclass for the appropriate documentation alongside usage examples.\n    \"\"\"\n\n    config_class = XLMRobertaConfig\n\n\n@add_start_docstrings(\n    \"\"\"XLM-RoBERTa Model transformer with a sequence classification/regression head on top (a linear layer\n    on top of the pooled output) e.g. for GLUE tasks. \"\"\",\n    XLM_ROBERTA_START_DOCSTRING,\n)\nclass TFXLMRobertaForSequenceClassification(TFRobertaForSequenceClassification):\n    \"\"\"\n    This class overrides :class:`~transformers1.TFRobertaForSequenceClassification`. Please check the\n    superclass for the appropriate documentation alongside usage examples.\n    \"\"\"\n\n    config_class = XLMRobertaConfig\n\n\n@add_start_docstrings(\n    \"\"\"XLM-RoBERTa Model with a token classification head on top (a linear layer on top of\n    the hidden-states output) e.g. for Named-Entity-Recognition (NER) tasks. \"\"\",\n    XLM_ROBERTA_START_DOCSTRING,\n)\nclass TFXLMRobertaForTokenClassification(TFRobertaForTokenClassification):\n    \"\"\"\n    This class overrides :class:`~transformers1.TFRobertaForTokenClassification`. Please check the\n    superclass for the appropriate documentation alongside usage examples.\n    \"\"\"\n\n    config_class = XLMRobertaConfig\n"
  },
  {
    "path": "code/bert-base-count5/pretrain/transformers1/modeling_tf_xlnet.py",
    "content": "# coding=utf-8\n# Copyright 2018 Google AI, Google Brain and Carnegie Mellon University Authors and the HuggingFace Inc. team.\n# Copyright (c) 2018, NVIDIA CORPORATION.  All rights reserved.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n\"\"\" TF 2.0 XLNet model.\n\"\"\"\n\n\nimport logging\n\nimport numpy as np\nimport tensorflow as tf\n\nfrom .configuration_xlnet import XLNetConfig\nfrom .file_utils import add_start_docstrings, add_start_docstrings_to_callable\nfrom .modeling_tf_utils import (\n    TFPreTrainedModel,\n    TFSequenceSummary,\n    TFSharedEmbeddings,\n    get_initializer,\n    keras_serializable,\n    shape_list,\n)\nfrom .tokenization_utils import BatchEncoding\n\n\nlogger = logging.getLogger(__name__)\n\nTF_XLNET_PRETRAINED_MODEL_ARCHIVE_LIST = [\n    \"xlnet-base-cased\",\n    \"xlnet-large-cased\",\n    # See all XLNet models at https://huggingface.co/models?filter=xlnet\n]\n\n\ndef gelu(x):\n    \"\"\" Implementation of the gelu activation function.\n        XLNet is using OpenAI GPT's gelu\n        Also see https://arxiv.org/abs/1606.08415\n    \"\"\"\n    cdf = 0.5 * (1.0 + tf.tanh((np.sqrt(2 / np.pi) * (x + 0.044715 * tf.pow(x, 3)))))\n    return x * cdf\n\n\ndef swish(x):\n    return x * tf.sigmoid(x)\n\n\nACT2FN = {\n    \"gelu\": tf.keras.layers.Activation(gelu),\n    \"relu\": tf.keras.activations.relu,\n    \"swish\": tf.keras.layers.Activation(swish),\n}\n\n\nclass TFXLNetRelativeAttention(tf.keras.layers.Layer):\n    def __init__(self, config, **kwargs):\n        super().__init__(**kwargs)\n        self.output_attentions = config.output_attentions\n\n        if config.d_model % config.n_head != 0:\n            raise ValueError(\n                \"The hidden size (%d) is not a multiple of the number of attention \"\n                \"heads (%d)\" % (config.d_model, config.n_head)\n            )\n\n        self.n_head = config.n_head\n        self.d_head = config.d_head\n        self.d_model = config.d_model\n        self.scale = 1 / (config.d_head ** 0.5)\n        self.initializer_range = config.initializer_range\n\n        self.layer_norm = tf.keras.layers.LayerNormalization(epsilon=config.layer_norm_eps, name=\"layer_norm\")\n        self.dropout = tf.keras.layers.Dropout(config.dropout)\n\n    def build(self, input_shape):\n        initializer = get_initializer(self.initializer_range)\n        self.q = self.add_weight(\n            shape=(self.d_model, self.n_head, self.d_head), initializer=initializer, trainable=True, name=\"q\"\n        )\n        self.k = self.add_weight(\n            shape=(self.d_model, self.n_head, self.d_head), initializer=initializer, trainable=True, name=\"k\"\n        )\n        self.v = self.add_weight(\n            shape=(self.d_model, self.n_head, self.d_head), initializer=initializer, trainable=True, name=\"v\"\n        )\n        self.o = self.add_weight(\n            shape=(self.d_model, self.n_head, self.d_head), initializer=initializer, trainable=True, name=\"o\"\n        )\n        self.r = self.add_weight(\n            shape=(self.d_model, self.n_head, self.d_head), initializer=initializer, trainable=True, name=\"r\"\n        )\n        self.r_r_bias = self.add_weight(\n            shape=(self.n_head, self.d_head), initializer=\"zeros\", trainable=True, name=\"r_r_bias\"\n        )\n        self.r_s_bias = self.add_weight(\n            shape=(self.n_head, self.d_head), initializer=\"zeros\", trainable=True, name=\"r_s_bias\"\n        )\n        self.r_w_bias = self.add_weight(\n            shape=(self.n_head, self.d_head), initializer=\"zeros\", trainable=True, name=\"r_w_bias\"\n        )\n        self.seg_embed = self.add_weight(\n            shape=(2, self.n_head, self.d_head), initializer=initializer, trainable=True, name=\"seg_embed\"\n        )\n        super().build(input_shape)\n\n    def prune_heads(self, heads):\n        raise NotImplementedError\n\n    def rel_shift(self, x, klen=-1):\n        \"\"\"perform relative shift to form the relative attention score.\"\"\"\n        x_size = shape_list(x)\n\n        x = tf.reshape(x, (x_size[1], x_size[0], x_size[2], x_size[3]))\n        x = x[1:, ...]\n        x = tf.reshape(x, (x_size[0], x_size[1] - 1, x_size[2], x_size[3]))\n        x = x[:, 0:klen, :, :]\n        # x = torch.index_select(x, 1, torch.arange(klen, device=x.device, dtype=torch.long))\n\n        return x\n\n    def rel_attn_core(self, inputs, training=False):\n        \"\"\"Core relative positional attention operations.\"\"\"\n\n        q_head, k_head_h, v_head_h, k_head_r, seg_mat, attn_mask, head_mask = inputs\n\n        # content based attention score\n        ac = tf.einsum(\"ibnd,jbnd->ijbn\", q_head + self.r_w_bias, k_head_h)\n\n        # position based attention score\n        bd = tf.einsum(\"ibnd,jbnd->ijbn\", q_head + self.r_r_bias, k_head_r)\n        bd = self.rel_shift(bd, klen=shape_list(ac)[1])\n\n        # segment based attention score\n        if seg_mat is None:\n            ef = 0\n        else:\n            ef = tf.einsum(\"ibnd,snd->ibns\", q_head + self.r_s_bias, self.seg_embed)\n            ef = tf.einsum(\"ijbs,ibns->ijbn\", seg_mat, ef)\n\n        # merge attention scores and perform masking\n        attn_score = (ac + bd + ef) * self.scale\n        if attn_mask is not None:\n            # attn_score = attn_score * (1 - attn_mask) - 1e30 * attn_mask\n            if attn_mask.dtype == tf.float16:\n                attn_score = attn_score - 65500 * attn_mask\n            else:\n                attn_score = attn_score - 1e30 * attn_mask\n\n        # attention probability\n        attn_prob = tf.nn.softmax(attn_score, axis=1)\n\n        attn_prob = self.dropout(attn_prob, training=training)\n\n        # Mask heads if we want to\n        if head_mask is not None:\n            attn_prob = attn_prob * head_mask\n\n        # attention output\n        attn_vec = tf.einsum(\"ijbn,jbnd->ibnd\", attn_prob, v_head_h)\n\n        if self.output_attentions:\n            return attn_vec, attn_prob\n\n        return attn_vec\n\n    def post_attention(self, inputs, residual=True, training=False):\n        \"\"\"Post-attention processing.\"\"\"\n        # post-attention projection (back to `d_model`)\n        h, attn_vec = inputs\n\n        attn_out = tf.einsum(\"ibnd,hnd->ibh\", attn_vec, self.o)\n\n        attn_out = self.dropout(attn_out, training=training)\n\n        if residual:\n            attn_out = attn_out + h\n        output = self.layer_norm(attn_out)\n\n        return output\n\n    def call(self, inputs, training=False):\n        (h, g, attn_mask_h, attn_mask_g, r, seg_mat, mems, target_mapping, head_mask) = inputs\n\n        if g is not None:\n            # Two-stream attention with relative positional encoding.\n            # content based attention score\n            if mems is not None and len(shape_list(mems)) > 1:\n                cat = tf.concat([mems, h], axis=0)\n            else:\n                cat = h\n\n            # content-based key head\n            k_head_h = tf.einsum(\"ibh,hnd->ibnd\", cat, self.k)\n\n            # content-based value head\n            v_head_h = tf.einsum(\"ibh,hnd->ibnd\", cat, self.v)\n\n            # position-based key head\n            k_head_r = tf.einsum(\"ibh,hnd->ibnd\", r, self.r)\n\n            # h-stream\n            # content-stream query head\n            q_head_h = tf.einsum(\"ibh,hnd->ibnd\", h, self.q)\n\n            # core attention ops\n            attn_vec_h = self.rel_attn_core(\n                [q_head_h, k_head_h, v_head_h, k_head_r, seg_mat, attn_mask_h, head_mask], training=training\n            )\n\n            if self.output_attentions:\n                attn_vec_h, attn_prob_h = attn_vec_h\n\n            # post processing\n            output_h = self.post_attention([h, attn_vec_h], training=training)\n\n            # g-stream\n            # query-stream query head\n            q_head_g = tf.einsum(\"ibh,hnd->ibnd\", g, self.q)\n\n            # core attention ops\n            if target_mapping is not None:\n                q_head_g = tf.einsum(\"mbnd,mlb->lbnd\", q_head_g, target_mapping)\n                attn_vec_g = self.rel_attn_core(\n                    [q_head_g, k_head_h, v_head_h, k_head_r, seg_mat, attn_mask_g, head_mask], training=training\n                )\n\n                if self.output_attentions:\n                    attn_vec_g, attn_prob_g = attn_vec_g\n\n                attn_vec_g = tf.einsum(\"lbnd,mlb->mbnd\", attn_vec_g, target_mapping)\n            else:\n                attn_vec_g = self.rel_attn_core(\n                    [q_head_g, k_head_h, v_head_h, k_head_r, seg_mat, attn_mask_g, head_mask], training=training\n                )\n\n                if self.output_attentions:\n                    attn_vec_g, attn_prob_g = attn_vec_g\n\n            # post processing\n            output_g = self.post_attention([g, attn_vec_g], training=training)\n\n            if self.output_attentions:\n                attn_prob = attn_prob_h, attn_prob_g\n\n        else:\n            # Multi-head attention with relative positional encoding\n            if mems is not None and len(shape_list(mems)) > 1:\n                cat = tf.concat([mems, h], axis=0)\n            else:\n                cat = h\n\n            # content heads\n            q_head_h = tf.einsum(\"ibh,hnd->ibnd\", h, self.q)\n            k_head_h = tf.einsum(\"ibh,hnd->ibnd\", cat, self.k)\n            v_head_h = tf.einsum(\"ibh,hnd->ibnd\", cat, self.v)\n\n            # positional heads\n            k_head_r = tf.einsum(\"ibh,hnd->ibnd\", r, self.r)\n\n            # core attention ops\n            attn_vec = self.rel_attn_core(\n                [q_head_h, k_head_h, v_head_h, k_head_r, seg_mat, attn_mask_h, head_mask], training=training\n            )\n\n            if self.output_attentions:\n                attn_vec, attn_prob = attn_vec\n\n            # post processing\n            output_h = self.post_attention([h, attn_vec], training=training)\n            output_g = None\n\n        outputs = (output_h, output_g)\n        if self.output_attentions:\n            outputs = outputs + (attn_prob,)\n        return outputs\n\n\nclass TFXLNetFeedForward(tf.keras.layers.Layer):\n    def __init__(self, config, **kwargs):\n        super().__init__(**kwargs)\n        self.layer_norm = tf.keras.layers.LayerNormalization(epsilon=config.layer_norm_eps, name=\"layer_norm\")\n        self.layer_1 = tf.keras.layers.Dense(\n            config.d_inner, kernel_initializer=get_initializer(config.initializer_range), name=\"layer_1\"\n        )\n        self.layer_2 = tf.keras.layers.Dense(\n            config.d_model, kernel_initializer=get_initializer(config.initializer_range), name=\"layer_2\"\n        )\n        self.dropout = tf.keras.layers.Dropout(config.dropout)\n        if isinstance(config.ff_activation, str):\n            self.activation_function = ACT2FN[config.ff_activation]\n        else:\n            self.activation_function = config.ff_activation\n\n    def call(self, inp, training=False):\n        output = inp\n        output = self.layer_1(output)\n        output = self.activation_function(output)\n        output = self.dropout(output, training=training)\n        output = self.layer_2(output)\n        output = self.dropout(output, training=training)\n        output = self.layer_norm(output + inp)\n        return output\n\n\nclass TFXLNetLayer(tf.keras.layers.Layer):\n    def __init__(self, config, **kwargs):\n        super().__init__(**kwargs)\n        self.rel_attn = TFXLNetRelativeAttention(config, name=\"rel_attn\")\n        self.ff = TFXLNetFeedForward(config, name=\"ff\")\n        self.dropout = tf.keras.layers.Dropout(config.dropout)\n\n    def call(self, inputs, training=False):\n        outputs = self.rel_attn(inputs, training=training)\n        output_h, output_g = outputs[:2]\n\n        if output_g is not None:\n            output_g = self.ff(output_g, training=training)\n        output_h = self.ff(output_h, training=training)\n\n        outputs = (output_h, output_g) + outputs[2:]  # Add again attentions if there are there\n        return outputs\n\n\nclass TFXLNetLMHead(tf.keras.layers.Layer):\n    def __init__(self, config, input_embeddings, **kwargs):\n        super().__init__(**kwargs)\n        self.vocab_size = config.vocab_size\n        # The output weights are the same as the input embeddings, but there is\n        # an output-only bias for each token.\n        self.input_embeddings = input_embeddings\n\n    def build(self, input_shape):\n        self.bias = self.add_weight(shape=(self.vocab_size,), initializer=\"zeros\", trainable=True, name=\"bias\")\n        super().build(input_shape)\n\n    def call(self, hidden_states):\n        hidden_states = self.input_embeddings(hidden_states, mode=\"linear\")\n        hidden_states = hidden_states + self.bias\n        return hidden_states\n\n\n@keras_serializable\nclass TFXLNetMainLayer(tf.keras.layers.Layer):\n    config_class = XLNetConfig\n\n    def __init__(self, config, **kwargs):\n        super().__init__(**kwargs)\n        self.output_attentions = config.output_attentions\n        self.output_hidden_states = config.output_hidden_states\n\n        self.mem_len = config.mem_len\n        self.reuse_len = config.reuse_len\n        self.d_model = config.d_model\n        self.same_length = config.same_length\n        self.attn_type = config.attn_type\n        self.bi_data = config.bi_data\n        self.clamp_len = config.clamp_len\n        self.n_layer = config.n_layer\n        self.use_bfloat16 = config.use_bfloat16\n        self.initializer_range = config.initializer_range\n\n        self.word_embedding = TFSharedEmbeddings(\n            config.vocab_size, config.d_model, initializer_range=config.initializer_range, name=\"word_embedding\"\n        )\n        self.layer = [TFXLNetLayer(config, name=\"layer_._{}\".format(i)) for i in range(config.n_layer)]\n        self.dropout = tf.keras.layers.Dropout(config.dropout)\n\n    def get_input_embeddings(self):\n        return self.word_embedding\n\n    def build(self, input_shape):\n        initializer = get_initializer(self.initializer_range)\n        self.mask_emb = self.add_weight(\n            shape=(1, 1, self.d_model), initializer=initializer, trainable=True, name=\"mask_emb\"\n        )\n\n    def _resize_token_embeddings(self, new_num_tokens):\n        raise NotImplementedError\n\n    def _prune_heads(self, heads_to_prune):\n        raise NotImplementedError\n\n    def create_mask(self, qlen, mlen, dtype=tf.float32):\n        \"\"\"\n        Creates causal attention mask. Float mask where 1.0 indicates masked, 0.0 indicates not-masked.\n\n        Args:\n            qlen: TODO Lysandre didn't fill\n            mlen: TODO Lysandre didn't fill\n\n        ::\n\n                  same_length=False:      same_length=True:\n                  <mlen > <  qlen >       <mlen > <  qlen >\n               ^ [0 0 0 0 0 1 1 1 1]     [0 0 0 0 0 1 1 1 1]\n                 [0 0 0 0 0 0 1 1 1]     [1 0 0 0 0 0 1 1 1]\n            qlen [0 0 0 0 0 0 0 1 1]     [1 1 0 0 0 0 0 1 1]\n                 [0 0 0 0 0 0 0 0 1]     [1 1 1 0 0 0 0 0 1]\n               v [0 0 0 0 0 0 0 0 0]     [1 1 1 1 0 0 0 0 0]\n\n        \"\"\"\n        attn_mask = tf.ones([qlen, qlen], dtype=dtype)\n        mask_u = tf.matrix_band_part(attn_mask, 0, -1)\n        mask_dia = tf.matrix_band_part(attn_mask, 0, 0)\n        attn_mask_pad = tf.zeros([qlen, mlen], dtype=dtype)\n        ret = tf.concat([attn_mask_pad, mask_u - mask_dia], 1)\n        if self.same_length:\n            mask_l = tf.matrix_band_part(attn_mask, -1, 0)\n            ret = tf.concat([ret[:, :qlen] + mask_l - mask_dia, ret[:, qlen:]], 1)\n        return ret\n\n    def cache_mem(self, curr_out, prev_mem):\n        \"\"\"cache hidden states into memory.\"\"\"\n        if self.reuse_len is not None and self.reuse_len > 0:\n            curr_out = curr_out[: self.reuse_len]\n\n        if prev_mem is None:\n            new_mem = curr_out[-self.mem_len :]\n        else:\n            new_mem = tf.concat([prev_mem, curr_out], 0)[-self.mem_len :]\n\n        return tf.stop_gradient(new_mem)\n\n    @staticmethod\n    def positional_embedding(pos_seq, inv_freq, bsz=None):\n        sinusoid_inp = tf.einsum(\"i,d->id\", pos_seq, inv_freq)\n        pos_emb = tf.concat([tf.sin(sinusoid_inp), tf.cos(sinusoid_inp)], axis=-1)\n        pos_emb = pos_emb[:, None, :]\n\n        if bsz is not None:\n            pos_emb = tf.tile(pos_emb, [1, bsz, 1])\n\n        return pos_emb\n\n    def relative_positional_encoding(self, qlen, klen, bsz=None, dtype=None):\n        \"\"\"create relative positional encoding.\"\"\"\n        freq_seq = tf.range(0, self.d_model, 2.0)\n        if dtype is not None and dtype != tf.float32:\n            freq_seq = tf.cast(freq_seq, dtype=dtype)\n        inv_freq = 1 / (10000 ** (freq_seq / self.d_model))\n\n        if self.attn_type == \"bi\":\n            # beg, end = klen - 1, -qlen\n            beg, end = klen, -qlen\n        elif self.attn_type == \"uni\":\n            # beg, end = klen - 1, -1\n            beg, end = klen, -1\n        else:\n            raise ValueError(\"Unknown `attn_type` {}.\".format(self.attn_type))\n\n        if self.bi_data:\n            fwd_pos_seq = tf.range(beg, end, -1.0)\n            bwd_pos_seq = tf.range(-beg, -end, 1.0)\n\n            if dtype is not None and dtype != tf.float32:\n                fwd_pos_seq = tf.cast(fwd_pos_seq, dtype=dtype)\n                bwd_pos_seq = tf.cast(bwd_pos_seq, dtype=dtype)\n\n            if self.clamp_len > 0:\n                fwd_pos_seq = tf.clip_by_value(fwd_pos_seq, -self.clamp_len, self.clamp_len)\n                bwd_pos_seq = tf.clip_by_value(bwd_pos_seq, -self.clamp_len, self.clamp_len)\n\n            if bsz is not None:\n                # With bi_data, the batch size should be divisible by 2.\n                assert bsz % 2 == 0\n                fwd_pos_emb = self.positional_embedding(fwd_pos_seq, inv_freq, bsz // 2)\n                bwd_pos_emb = self.positional_embedding(bwd_pos_seq, inv_freq, bsz // 2)\n            else:\n                fwd_pos_emb = self.positional_embedding(fwd_pos_seq, inv_freq)\n                bwd_pos_emb = self.positional_embedding(bwd_pos_seq, inv_freq)\n\n            pos_emb = tf.concat([fwd_pos_emb, bwd_pos_emb], axis=1)\n        else:\n            fwd_pos_seq = tf.range(beg, end, -1.0)\n            if dtype is not None and dtype != tf.float32:\n                fwd_pos_seq = tf.cast(fwd_pos_seq, dtype=dtype)\n            if self.clamp_len > 0:\n                fwd_pos_seq = tf.clip_by_value(fwd_pos_seq, -self.clamp_len, self.clamp_len)\n            pos_emb = self.positional_embedding(fwd_pos_seq, inv_freq, bsz)\n\n        return pos_emb\n\n    def call(\n        self,\n        inputs,\n        attention_mask=None,\n        mems=None,\n        perm_mask=None,\n        target_mapping=None,\n        token_type_ids=None,\n        input_mask=None,\n        head_mask=None,\n        inputs_embeds=None,\n        use_cache=True,\n        training=False,\n    ):\n        if isinstance(inputs, (tuple, list)):\n            input_ids = inputs[0]\n            attention_mask = inputs[1] if len(inputs) > 1 else attention_mask\n            mems = inputs[2] if len(inputs) > 2 else mems\n            perm_mask = inputs[3] if len(inputs) > 3 else perm_mask\n            target_mapping = inputs[4] if len(inputs) > 4 else target_mapping\n            token_type_ids = inputs[5] if len(inputs) > 5 else token_type_ids\n            input_mask = inputs[6] if len(inputs) > 6 else input_mask\n            head_mask = inputs[7] if len(inputs) > 7 else head_mask\n            inputs_embeds = inputs[8] if len(inputs) > 8 else inputs_embeds\n            use_cache = inputs[9] if len(inputs) > 9 else use_cache\n            assert len(inputs) <= 10, \"Too many inputs.\"\n        elif isinstance(inputs, (dict, BatchEncoding)):\n            input_ids = inputs.get(\"input_ids\")\n            attention_mask = inputs.get(\"attention_mask\", attention_mask)\n            mems = inputs.get(\"mems\", mems)\n            perm_mask = inputs.get(\"perm_mask\", perm_mask)\n            target_mapping = inputs.get(\"target_mapping\", target_mapping)\n            token_type_ids = inputs.get(\"token_type_ids\", token_type_ids)\n            input_mask = inputs.get(\"input_mask\", input_mask)\n            head_mask = inputs.get(\"head_mask\", head_mask)\n            inputs_embeds = inputs.get(\"inputs_embeds\", inputs_embeds)\n            use_cache = inputs.get(\"use_cache\", use_cache)\n            assert len(inputs) <= 10, \"Too many inputs.\"\n        else:\n            input_ids = inputs\n\n        # the original code for XLNet uses shapes [len, bsz] with the batch dimension at the end\n        # but we want a unified interface in the library with the batch size on the first dimension\n        # so we move here the first dimension (batch) to the end\n\n        if input_ids is not None and inputs_embeds is not None:\n            raise ValueError(\"You cannot specify both input_ids and inputs_embeds at the same time\")\n        elif input_ids is not None:\n            input_ids = tf.transpose(input_ids, perm=(1, 0))\n            qlen, bsz = shape_list(input_ids)[:2]\n        elif inputs_embeds is not None:\n            inputs_embeds = tf.transpose(inputs_embeds, perm=(1, 0, 2))\n            qlen, bsz = shape_list(inputs_embeds)[:2]\n        else:\n            raise ValueError(\"You have to specify either input_ids or inputs_embeds\")\n\n        token_type_ids = tf.transpose(token_type_ids, perm=(1, 0)) if token_type_ids is not None else None\n        input_mask = tf.transpose(input_mask, perm=(1, 0)) if input_mask is not None else None\n        attention_mask = tf.transpose(attention_mask, perm=(1, 0)) if attention_mask is not None else None\n        perm_mask = tf.transpose(perm_mask, perm=(1, 2, 0)) if perm_mask is not None else None\n        target_mapping = tf.transpose(target_mapping, perm=(1, 2, 0)) if target_mapping is not None else None\n\n        mlen = shape_list(mems[0])[0] if mems is not None and mems[0] is not None else 0\n        klen = mlen + qlen\n\n        dtype_float = tf.bfloat16 if self.use_bfloat16 else tf.float32\n\n        # Attention mask\n        # causal attention mask\n        if self.attn_type == \"uni\":\n            attn_mask = self.create_mask(qlen, mlen)\n            attn_mask = attn_mask[:, :, None, None]\n        elif self.attn_type == \"bi\":\n            attn_mask = None\n        else:\n            raise ValueError(\"Unsupported attention type: {}\".format(self.attn_type))\n\n        # data mask: input mask & perm mask\n        assert input_mask is None or attention_mask is None, (\n            \"You can only use one of input_mask (uses 1 for padding) \"\n            \"or attention_mask (uses 0 for padding, added for compatbility with BERT). Please choose one.\"\n        )\n        if input_mask is None and attention_mask is not None:\n            input_mask = 1.0 - tf.cast(attention_mask, dtype=dtype_float)\n        if input_mask is not None and perm_mask is not None:\n            data_mask = input_mask[None] + perm_mask\n        elif input_mask is not None and perm_mask is None:\n            data_mask = input_mask[None]\n        elif input_mask is None and perm_mask is not None:\n            data_mask = perm_mask\n        else:\n            data_mask = None\n\n        if data_mask is not None:\n            # all mems can be attended to\n            if mlen > 0:\n                mems_mask = tf.zeros([shape_list(data_mask)[0], mlen, bsz], dtype=dtype_float)\n                data_mask = tf.concat([mems_mask, data_mask], axis=1)\n            if attn_mask is None:\n                attn_mask = data_mask[:, :, :, None]\n            else:\n                attn_mask += data_mask[:, :, :, None]\n\n        if attn_mask is not None:\n            attn_mask = tf.cast(attn_mask > 0, dtype=dtype_float)\n\n        if attn_mask is not None:\n            non_tgt_mask = -tf.eye(qlen, dtype=dtype_float)\n            if mlen > 0:\n                non_tgt_mask = tf.concat([tf.zeros([qlen, mlen], dtype=dtype_float), non_tgt_mask], axis=-1)\n            non_tgt_mask = tf.cast((attn_mask + non_tgt_mask[:, :, None, None]) > 0, dtype=dtype_float)\n        else:\n            non_tgt_mask = None\n\n        # Word embeddings and prepare h & g hidden states\n        if inputs_embeds is not None:\n            word_emb_k = inputs_embeds\n        else:\n            word_emb_k = self.word_embedding(input_ids)\n        output_h = self.dropout(word_emb_k, training=training)\n        if target_mapping is not None:\n            word_emb_q = tf.tile(self.mask_emb, [shape_list(target_mapping)[0], bsz, 1])\n            # else:  # We removed the inp_q input which was same as target mapping\n            #     inp_q_ext = inp_q[:, :, None]\n            #     word_emb_q = inp_q_ext * self.mask_emb + (1 - inp_q_ext) * word_emb_k\n            output_g = self.dropout(word_emb_q, training=training)\n        else:\n            output_g = None\n\n        # Segment embedding\n        if token_type_ids is not None:\n            # Convert `token_type_ids` to one-hot `seg_mat`\n            if mlen > 0:\n                mem_pad = tf.zeros([mlen, bsz], dtype=tf.int32)\n                cat_ids = tf.concat([mem_pad, token_type_ids], 0)\n            else:\n                cat_ids = token_type_ids\n\n            # `1` indicates not in the same segment [qlen x klen x bsz]\n            seg_mat = tf.cast(tf.logical_not(tf.equal(token_type_ids[:, None], cat_ids[None, :])), tf.int32)\n            seg_mat = tf.one_hot(seg_mat, 2, dtype=dtype_float)\n        else:\n            seg_mat = None\n\n        # Positional encoding\n        pos_emb = self.relative_positional_encoding(qlen, klen, bsz=bsz, dtype=dtype_float)\n        pos_emb = self.dropout(pos_emb, training=training)\n\n        # Prepare head mask if needed\n        # 1.0 in head_mask indicate we keep the head\n        # attention_probs has shape bsz x n_heads x N x N\n        # input head_mask has shape [num_heads] or [num_hidden_layers x num_heads] (a head_mask for each layer)\n        # and head_mask is converted to shape [num_hidden_layers x qlen x klen x bsz x n_head]\n        if head_mask is not None:\n            raise NotImplementedError\n        else:\n            head_mask = [None] * self.n_layer\n\n        new_mems = ()\n        if mems is None:\n            mems = [None] * len(self.layer)\n\n        attentions = []\n        hidden_states = []\n        for i, layer_module in enumerate(self.layer):\n            # cache new mems\n            if self.mem_len is not None and self.mem_len > 0 and use_cache is True:\n                new_mems = new_mems + (self.cache_mem(output_h, mems[i]),)\n            if self.output_hidden_states:\n                hidden_states.append((output_h, output_g) if output_g is not None else output_h)\n\n            outputs = layer_module(\n                [output_h, output_g, non_tgt_mask, attn_mask, pos_emb, seg_mat, mems[i], target_mapping, head_mask[i]],\n                training=training,\n            )\n            output_h, output_g = outputs[:2]\n            if self.output_attentions:\n                attentions.append(outputs[2])\n\n        # Add last hidden state\n        if self.output_hidden_states:\n            hidden_states.append((output_h, output_g) if output_g is not None else output_h)\n\n        output = self.dropout(output_g if output_g is not None else output_h, training=training)\n\n        # Prepare outputs, we transpose back here to shape [bsz, len, hidden_dim] (cf. beginning of forward() method)\n        outputs = (tf.transpose(output, perm=(1, 0, 2)),)\n\n        if self.mem_len is not None and self.mem_len > 0 and use_cache is True:\n            outputs = outputs + (new_mems,)\n\n        if self.output_hidden_states:\n            if output_g is not None:\n                hidden_states = tuple(tf.transpose(h, perm=(1, 0, 2)) for hs in hidden_states for h in hs)\n            else:\n                hidden_states = tuple(tf.transpose(hs, perm=(1, 0, 2)) for hs in hidden_states)\n            outputs = outputs + (hidden_states,)\n        if self.output_attentions:\n            attentions = tuple(tf.transpose(t, perm=(2, 3, 0, 1)) for t in attentions)\n            outputs = outputs + (attentions,)\n\n        return outputs  # outputs, (new_mems), (hidden_states), (attentions)\n\n\nclass TFXLNetPreTrainedModel(TFPreTrainedModel):\n    \"\"\" An abstract class to handle weights initialization and\n        a simple interface for downloading and loading pretrained models.\n    \"\"\"\n\n    config_class = XLNetConfig\n    base_model_prefix = \"transformer\"\n\n\nXLNET_START_DOCSTRING = r\"\"\"\n\n    .. note::\n\n        TF 2.0 models accepts two formats as inputs:\n\n            - having all inputs as keyword arguments (like PyTorch models), or\n            - having all inputs as a list, tuple or dict in the first positional arguments.\n\n        This second option is useful when using :obj:`tf.keras.Model.fit()` method which currently requires having\n        all the tensors in the first argument of the model call function: :obj:`model(inputs)`.\n\n        If you choose this second option, there are three possibilities you can use to gather all the input Tensors\n        in the first positional argument :\n\n        - a single Tensor with input_ids only and nothing else: :obj:`model(inputs_ids)`\n        - a list of varying length with one or several input Tensors IN THE ORDER given in the docstring:\n          :obj:`model([input_ids, attention_mask])` or :obj:`model([input_ids, attention_mask, token_type_ids])`\n        - a dictionary with one or several input Tensors associated to the input names given in the docstring:\n          :obj:`model({'input_ids': input_ids, 'token_type_ids': token_type_ids})`\n\n    Parameters:\n        config (:class:`~transformers1.XLNetConfig`): Model configuration class with all the parameters of the model.\n            Initializing with a config file does not load the weights associated with the model, only the configuration.\n            Check out the :meth:`~transformers1.PreTrainedModel.from_pretrained` method to load the model weights.\n\"\"\"\n\nXLNET_INPUTS_DOCSTRING = r\"\"\"\n    Args:\n        input_ids (:obj:`tf.Tensor` or :obj:`Numpy array` of shape :obj:`(batch_size, sequence_length)`):\n            Indices of input sequence tokens in the vocabulary.\n\n            Indices can be obtained using :class:`transformers1.XLNetTokenizer`.\n            See :func:`transformers1.PreTrainedTokenizer.encode` and\n            :func:`transformers1.PreTrainedTokenizer.encode_plus` for details.\n\n            `What are input IDs? <../glossary.html#input-ids>`__\n        attention_mask (:obj:`tf.Tensor` or :obj:`Numpy array` of shape :obj:`(batch_size, sequence_length)`, `optional`, defaults to :obj:`None`):\n            Mask to avoid performing attention on padding token indices.\n            Mask values selected in ``[0, 1]``:\n            ``1`` for tokens that are NOT MASKED, ``0`` for MASKED tokens.\n\n            `What are attention masks? <../glossary.html#attention-mask>`__\n        mems (:obj:`List[tf.Tensor]` of length :obj:`config.n_layers`):\n            Contains pre-computed hidden-states (key and values in the attention blocks) as computed by the model\n            (see `mems` output below). Can be used to speed up sequential decoding. The token ids which have their mems\n            given to this model should not be passed as input ids as they have already been computed.\n        perm_mask (:obj:`tf.Tensor` or :obj:`Numpy array` of shape :obj:`(batch_size, sequence_length, sequence_length)`, `optional`, defaults to :obj:`None`):\n            Mask to indicate the attention pattern for each input token with values selected in ``[0, 1]``:\n            If ``perm_mask[k, i, j] = 0``, i attend to j in batch k;\n            if ``perm_mask[k, i, j] = 1``, i does not attend to j in batch k.\n            If None, each token attends to all the others (full bidirectional attention).\n            Only used during pretraining (to define factorization order) or for sequential decoding (generation).\n        target_mapping (:obj:`tf.Tensor` or :obj:`Numpy array` of shape :obj:`(batch_size, num_predict, sequence_length)`, `optional`, defaults to :obj:`None`):\n            Mask to indicate the output tokens to use.\n            If ``target_mapping[k, i, j] = 1``, the i-th predict in batch k is on the j-th token.\n            Only used during pretraining for partial prediction or for sequential decoding (generation).\n        token_type_ids (:obj:`tf.Tensor` or :obj:`Numpy array` of shape :obj:`(batch_size, sequence_length)`, `optional`, defaults to :obj:`None`):\n            Segment token indices to indicate first and second portions of the inputs.\n            Indices are selected in ``[0, 1]``: ``0`` corresponds to a `sentence A` token, ``1``\n            corresponds to a `sentence B` token\n\n            `What are token type IDs? <../glossary.html#token-type-ids>`_\n        input_mask (:obj:`tf.Tensor` or :obj:`Numpy array` of shape :obj:`(batch_size, sequence_length)`, `optional`, defaults to :obj:`None`):\n            Mask to avoid performing attention on padding token indices.\n            Negative of `attention_mask`, i.e. with 0 for real tokens and 1 for padding.\n            Kept for compatibility with the original code base.\n            You can only uses one of `input_mask` and `attention_mask`\n            Mask values selected in ``[0, 1]``:\n            ``1`` for tokens that are MASKED, ``0`` for tokens that are NOT MASKED.\n        head_mask (:obj:`tf.Tensor` or :obj:`Numpy array` of shape :obj:`(num_heads,)` or :obj:`(num_layers, num_heads)`, `optional`, defaults to :obj:`None`):\n            Mask to nullify selected heads of the self-attention modules.\n            Mask values selected in ``[0, 1]``:\n            :obj:`1` indicates the head is **not masked**, :obj:`0` indicates the head is **masked**.\n        inputs_embeds (:obj:`tf.Tensor` or :obj:`Numpy array` of shape :obj:`(batch_size, sequence_length, hidden_size)`, `optional`, defaults to :obj:`None`):\n            Optionally, instead of passing :obj:`input_ids` you can choose to directly pass an embedded representation.\n            This is useful if you want more control over how to convert `input_ids` indices into associated vectors\n            than the model's internal embedding lookup matrix.\n        use_cache (:obj:`bool`):\n            If `use_cache` is True, `mems` are returned and can be used to speed up decoding (see `mems`). Defaults to `True`.\n\"\"\"\n\n\n@add_start_docstrings(\n    \"The bare XLNet Model transformer outputing raw hidden-states without any specific head on top.\",\n    XLNET_START_DOCSTRING,\n)\nclass TFXLNetModel(TFXLNetPreTrainedModel):\n    def __init__(self, config, *inputs, **kwargs):\n        super().__init__(config, *inputs, **kwargs)\n        self.transformer = TFXLNetMainLayer(config, name=\"transformer\")\n\n    @add_start_docstrings_to_callable(XLNET_INPUTS_DOCSTRING)\n    def call(self, inputs, **kwargs):\n        r\"\"\"\n    Return:\n        :obj:`tuple(tf.Tensor)` comprising various elements depending on the configuration (:class:`~transformers1.XLNetConfig`) and inputs:\n        last_hidden_state (:obj:`tf.Tensor` or :obj:`Numpy array` of shape :obj:`(batch_size, sequence_length, hidden_size)`):\n            Sequence of hidden-states at the last layer of the model.\n        mems (:obj:`List[tf.Tensor]` of length :obj:`config.n_layers`):\n            Contains pre-computed hidden-states (key and values in the attention blocks).\n            Can be used (see `mems` input) to speed up sequential decoding. The token ids which have their past given to this model\n            should not be passed as input ids as they have already been computed.\n        hidden_states (:obj:`tuple(tf.Tensor)`, `optional`, returned when ``config.output_hidden_states=True``):\n            Tuple of :obj:`tf.Tensor` or :obj:`Numpy array` (one for the output of the embeddings + one for the output of each layer)\n            of shape :obj:`(batch_size, sequence_length, hidden_size)`.\n\n            Hidden-states of the model at the output of each layer plus the initial embedding outputs.\n        attentions (:obj:`tuple(tf.Tensor)`, `optional`, returned when ``config.output_attentions=True``):\n            Tuple of :obj:`tf.Tensor` or :obj:`Numpy array` (one for each layer) of shape\n            :obj:`(batch_size, num_heads, sequence_length, sequence_length)`.\n\n            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention\n            heads.\n\n    Examples::\n\n        import tensorflow as tf\n        from transformers1 import XLNetTokenizer, TFXLNetModel\n\n        tokenizer = XLNetTokenizer.from_pretrained('xlnet-large-cased')\n        model = TFXLNetModel.from_pretrained('xlnet-large-cased')\n        input_ids = tf.constant(tokenizer.encode(\"Hello, my dog is cute\", add_special_tokens=True))[None, :]  # Batch size 1\n        outputs = model(input_ids)\n        last_hidden_states = outputs[0]  # The last hidden-state is the first element of the output tuple\n\n        \"\"\"\n        outputs = self.transformer(inputs, **kwargs)\n        return outputs\n\n\n@add_start_docstrings(\n    \"\"\"XLNet Model with a language modeling head on top\n    (linear layer with weights tied to the input embeddings). \"\"\",\n    XLNET_START_DOCSTRING,\n)\nclass TFXLNetLMHeadModel(TFXLNetPreTrainedModel):\n    def __init__(self, config, *inputs, **kwargs):\n        super().__init__(config, *inputs, **kwargs)\n        self.transformer = TFXLNetMainLayer(config, name=\"transformer\")\n        self.lm_loss = TFXLNetLMHead(config, self.transformer.word_embedding, name=\"lm_loss\")\n\n    def get_output_embeddings(self):\n        return self.lm_loss.input_embeddings\n\n    def prepare_inputs_for_generation(self, inputs, past, **kwargs):\n        # Add dummy token at the end (no attention on this one)\n\n        effective_batch_size = inputs.shape[0]\n        dummy_token = tf.zeros((effective_batch_size, 1), dtype=tf.int32)\n        inputs = tf.concat([inputs, dummy_token], axis=1)\n\n        # Build permutation mask so that previous tokens don't see last token\n        sequence_length = inputs.shape[1]\n        perm_mask = tf.zeros((effective_batch_size, sequence_length, sequence_length - 1), dtype=tf.float32)\n        perm_mask_seq_end = tf.ones((effective_batch_size, sequence_length, 1), dtype=tf.float32)\n        perm_mask = tf.concat([perm_mask, perm_mask_seq_end], axis=-1)\n\n        # We'll only predict the last token\n        target_mapping = tf.zeros((effective_batch_size, 1, sequence_length - 1), dtype=tf.float32)\n        target_mapping_seq_end = tf.ones((effective_batch_size, 1, 1), dtype=tf.float32)\n        target_mapping = tf.concat([target_mapping, target_mapping_seq_end], axis=-1)\n\n        inputs = {\n            \"inputs\": inputs,\n            \"perm_mask\": perm_mask,\n            \"target_mapping\": target_mapping,\n            \"use_cache\": kwargs[\"use_cache\"],\n        }\n\n        # if past is defined in model kwargs then use it for faster decoding\n        if past:\n            inputs[\"mems\"] = past\n\n        return inputs\n\n    @add_start_docstrings_to_callable(XLNET_INPUTS_DOCSTRING)\n    def call(self, inputs, **kwargs):\n        r\"\"\"\n    Return:\n        :obj:`tuple(tf.Tensor)` comprising various elements depending on the configuration (:class:`~transformers1.XLNetConfig`) and inputs:\n        prediction_scores (:obj:`tf.Tensor` or :obj:`Numpy array` of shape :obj:`(batch_size, sequence_length, config.vocab_size)`):\n            Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax).\n        mems (:obj:`List[tf.Tensor]` of length :obj:`config.n_layers`):\n            Contains pre-computed hidden-states (key and values in the attention blocks).\n            Can be used (see `past` input) to speed up sequential decoding. The token ids which have their past given to this model\n            should not be passed as input ids as they have already been computed.\n        hidden_states (:obj:`tuple(tf.Tensor)`, `optional`, returned when ``config.output_hidden_states=True``):\n            Tuple of :obj:`tf.Tensor` or :obj:`Numpy array` (one for the output of the embeddings + one for the output of each layer)\n            of shape :obj:`(batch_size, sequence_length, hidden_size)`.\n\n            Hidden-states of the model at the output of each layer plus the initial embedding outputs.\n        attentions (:obj:`tuple(tf.Tensor)`, `optional`, returned when ``config.output_attentions=True``):\n            Tuple of :obj:`tf.Tensor` or :obj:`Numpy array` (one for each layer) of shape\n            :obj:`(batch_size, num_heads, sequence_length, sequence_length)`.\n\n            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention\n            heads.\n\n    Examples::\n\n        import tensorflow as tf\n        import numpy as np\n        from transformers1 import XLNetTokenizer, TFXLNetLMHeadModel\n\n        tokenizer = XLNetTokenizer.from_pretrained('xlnet-large-cased')\n        model = TFXLNetLMHeadModel.from_pretrained('xlnet-large-cased')\n\n        # We show how to setup inputs to predict a next token using a bi-directional context.\n        input_ids = tf.constant(tokenizer.encode(\"Hello, my dog is very <mask>\", add_special_tokens=True))[None, :]  # We will predict the masked token\n        perm_mask = np.zeros((1, input_ids.shape[1], input_ids.shape[1]))\n        perm_mask[:, :, -1] = 1.0  # Previous tokens don't see last token\n        target_mapping = np.zeros((1, 1, input_ids.shape[1]))  # Shape [1, 1, seq_length] => let's predict one token\n        target_mapping[0, 0, -1] = 1.0  # Our first (and only) prediction will be the last token of the sequence (the masked token)\n        outputs = model(input_ids, perm_mask=tf.constant(perm_mask, dtype=tf.float32), target_mapping=tf.constant(target_mapping, dtype=tf.float32))\n\n        next_token_logits = outputs[0]  # Output has shape [target_mapping.size(0), target_mapping.size(1), config.vocab_size]\n\n        \"\"\"\n        transformer_outputs = self.transformer(inputs, **kwargs)\n        hidden_state = transformer_outputs[0]\n        logits = self.lm_loss(hidden_state)\n\n        outputs = (logits,) + transformer_outputs[1:]  # Keep mems, hidden states, attentions if there are in it\n\n        return outputs  # return logits, (mems), (hidden states), (attentions)\n\n\n@add_start_docstrings(\n    \"\"\"XLNet Model with a sequence classification/regression head on top (a linear layer on top of\n    the pooled output) e.g. for GLUE tasks. \"\"\",\n    XLNET_START_DOCSTRING,\n)\nclass TFXLNetForSequenceClassification(TFXLNetPreTrainedModel):\n    def __init__(self, config, *inputs, **kwargs):\n        super().__init__(config, *inputs, **kwargs)\n        self.num_labels = config.num_labels\n\n        self.transformer = TFXLNetMainLayer(config, name=\"transformer\")\n        self.sequence_summary = TFSequenceSummary(\n            config, initializer_range=config.initializer_range, name=\"sequence_summary\"\n        )\n        self.logits_proj = tf.keras.layers.Dense(\n            config.num_labels, kernel_initializer=get_initializer(config.initializer_range), name=\"logits_proj\"\n        )\n\n    @add_start_docstrings_to_callable(XLNET_INPUTS_DOCSTRING)\n    def call(self, inputs, **kwargs):\n        r\"\"\"\n    Return:\n        :obj:`tuple(tf.Tensor)` comprising various elements depending on the configuration (:class:`~transformers1.XLNetConfig`) and inputs:\n        logits (:obj:`tf.Tensor` or :obj:`Numpy array` of shape :obj:(batch_size, config.num_labels)`):\n            Classification (or regression if config.num_labels==1) scores (before SoftMax).\n        mems (:obj:`List[tf.Tensor]` of length :obj:`config.n_layers`):\n            Contains pre-computed hidden-states (key and values in the attention blocks).\n            Can be used (see `past` input) to speed up sequential decoding. The token ids which have their past given to this model\n            should not be passed as input ids as they have already been computed.\n        hidden_states (:obj:`tuple(tf.Tensor)`, `optional`, returned when ``config.output_hidden_states=True``):\n            Tuple of :obj:`tf.Tensor` or :obj:`Numpy array` (one for the output of the embeddings + one for the output of each layer)\n            of shape :obj:`(batch_size, sequence_length, hidden_size)`.\n\n            Hidden-states of the model at the output of each layer plus the initial embedding outputs.\n        attentions (:obj:`tuple(tf.Tensor)`, `optional`, returned when ``config.output_attentions=True``):\n            Tuple of :obj:`tf.Tensor` or :obj:`Numpy array` (one for each layer) of shape\n            :obj:`(batch_size, num_heads, sequence_length, sequence_length)`.\n\n            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention\n            heads.\n\n    Examples::\n\n        import tensorflow as tf\n        from transformers1 import XLNetTokenizer, TFXLNetForSequenceClassification\n\n        tokenizer = XLNetTokenizer.from_pretrained('xlnet-large-cased')\n        model = TFXLNetForSequenceClassification.from_pretrained('xlnet-large-cased')\n        input_ids = tf.constant(tokenizer.encode(\"Hello, my dog is cute\", add_special_tokens=True))[None, :]  # Batch size 1\n        outputs = model(input_ids)\n        logits = outputs[0]\n\n        \"\"\"\n        transformer_outputs = self.transformer(inputs, **kwargs)\n        output = transformer_outputs[0]\n\n        output = self.sequence_summary(output)\n        logits = self.logits_proj(output)\n\n        outputs = (logits,) + transformer_outputs[1:]  # Keep mems, hidden states, attentions if there are in it\n\n        return outputs  # return logits, (mems), (hidden states), (attentions)\n\n\n@add_start_docstrings(\n    \"\"\"XLNet Model with a token classification head on top (a linear layer on top of\n    the hidden-states output) e.g. for Named-Entity-Recognition (NER) tasks. \"\"\",\n    XLNET_START_DOCSTRING,\n)\nclass TFXLNetForTokenClassification(TFXLNetPreTrainedModel):\n    def __init__(self, config, *inputs, **kwargs):\n        super().__init__(config, *inputs, **kwargs)\n        self.num_labels = config.num_labels\n\n        self.transformer = TFXLNetMainLayer(config, name=\"transformer\")\n        self.classifier = tf.keras.layers.Dense(\n            config.num_labels, kernel_initializer=get_initializer(config.initializer_range), name=\"classifier\"\n        )\n\n    def call(self, inputs, **kwargs):\n        r\"\"\"\n    Return:\n        :obj:`tuple(tf.Tensor)` comprising various elements depending on the configuration (:class:`~transformers1.XLNetConfig`) and inputs:\n        logits (:obj:`tf.Tensor` or :obj:`Numpy array` of shape :obj:(batch_size, config.num_labels)`):\n            Classification scores (before SoftMax).\n        mems (:obj:`List[tf.Tensor]` of length :obj:`config.n_layers`):\n            Contains pre-computed hidden-states (key and values in the attention blocks).\n            Can be used (see `past` input) to speed up sequential decoding. The token ids which have their past given to this model\n            should not be passed as input ids as they have already been computed.\n        hidden_states (:obj:`tuple(tf.Tensor)`, `optional`, returned when ``config.output_hidden_states=True``):\n            Tuple of :obj:`tf.Tensor` or :obj:`Numpy array` (one for the output of the embeddings + one for the output of each layer)\n            of shape :obj:`(batch_size, sequence_length, hidden_size)`.\n\n            Hidden-states of the model at the output of each layer plus the initial embedding outputs.\n        attentions (:obj:`tuple(tf.Tensor)`, `optional`, returned when ``config.output_attentions=True``):\n            Tuple of :obj:`tf.Tensor` or :obj:`Numpy array` (one for each layer) of shape\n            :obj:`(batch_size, num_heads, sequence_length, sequence_length)`.\n\n            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention\n            heads.\n\n    Examples::\n\n        import tensorflow as tf\n        from transformers1 import XLNetTokenizer, TFXLNetForTokenClassification\n\n        tokenizer = XLNetTokenizer.from_pretrained('xlnet-large-cased')\n        model = TFXLNetForTokenClassification.from_pretrained('xlnet-large-cased')\n        input_ids = tf.constant(tokenizer.encode(\"Hello, my dog is cute\"))[None, :]  # Batch size 1\n        outputs = model(input_ids)\n        scores = outputs[0]\n\n        \"\"\"\n        transformer_outputs = self.transformer(inputs, **kwargs)\n        output = transformer_outputs[0]\n\n        logits = self.classifier(output)\n\n        outputs = (logits,) + transformer_outputs[1:]  # Keep mems, hidden states, attentions if there are in it\n\n        return outputs  # return logits, (mems), (hidden states), (attentions)\n\n\n@add_start_docstrings(\n    \"\"\"XLNet Model with a span classification head on top for extractive question-answering tasks like SQuAD (a linear layers on top of\n    the hidden-states output to compute `span start logits` and `span end logits`). \"\"\",\n    XLNET_START_DOCSTRING,\n)\nclass TFXLNetForQuestionAnsweringSimple(TFXLNetPreTrainedModel):\n    def __init__(self, config, *inputs, **kwargs):\n        super().__init__(config, *inputs, **kwargs)\n        self.transformer = TFXLNetMainLayer(config, name=\"transformer\")\n        self.qa_outputs = tf.keras.layers.Dense(\n            config.num_labels, kernel_initializer=get_initializer(config.initializer_range), name=\"qa_outputs\"\n        )\n\n    @add_start_docstrings_to_callable(XLNET_INPUTS_DOCSTRING)\n    def call(self, inputs, **kwargs):\n        r\"\"\"\n    Returns:\n        :obj:`tuple(tf.Tensor)` comprising various elements depending on the configuration (:class:`~transformers1.XLNetConfig`) and inputs:\n        loss (:obj:`tf.Tensor` or :obj:`Numpy array` of shape :obj:`(1,)`, `optional`, returned when :obj:`labels` is provided):\n            Total span extraction loss is the sum of a Cross-Entropy for the start and end positions.\n        start_scores (:obj:`tf.Tensor` or :obj:`Numpy array` of shape :obj:`(batch_size, sequence_length,)`):\n            Span-start scores (before SoftMax).\n        end_scores (:obj:`tf.Tensor` or :obj:`Numpy array` of shape :obj:`(batch_size, sequence_length,)`):\n            Span-end scores (before SoftMax).\n        mems (:obj:`List[tf.Tensor]` of length :obj:`config.n_layers`):\n            Contains pre-computed hidden-states (key and values in the attention blocks).\n            Can be used (see `past` input) to speed up sequential decoding. The token ids which have their past given to this model\n            should not be passed as input ids as they have already been computed.\n        hidden_states (:obj:`tuple(tf.Tensor)`, `optional`, returned when ``config.output_hidden_states=True``):\n            Tuple of :obj:`tf.Tensor` or :obj:`Numpy array` (one for the output of the embeddings + one for the output of each layer)\n            of shape :obj:`(batch_size, sequence_length, hidden_size)`.\n\n            Hidden-states of the model at the output of each layer plus the initial embedding outputs.\n        attentions (:obj:`tuple(tf.Tensor)`, `optional`, returned when ``config.output_attentions=True``):\n            Tuple of :obj:`tf.Tensor` or :obj:`Numpy array` (one for each layer) of shape\n            :obj:`(batch_size, num_heads, sequence_length, sequence_length)`.\n\n            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention\n            heads.\n\n    Examples::\n\n        import tensorflow as tf\n        from transformers1 import XLNetTokenizer, TFXLNetForQuestionAnsweringSimple\n\n        tokenizer = XLNetTokenizer.from_pretrained('xlnet-base-cased')\n        model = TFXLNetForQuestionAnsweringSimple.from_pretrained('xlnet-base-cased')\n        input_ids = tf.constant(tokenizer.encode(\"Hello, my dog is cute\", add_special_tokens=True))[None, :]  # Batch size 1\n        outputs = model(input_ids)\n        start_scores, end_scores = outputs[:2]\n\n        \"\"\"\n        transformer_outputs = self.transformer(inputs, **kwargs)\n\n        sequence_output = transformer_outputs[0]\n\n        logits = self.qa_outputs(sequence_output)\n        start_logits, end_logits = tf.split(logits, 2, axis=-1)\n        start_logits = tf.squeeze(start_logits, axis=-1)\n        end_logits = tf.squeeze(end_logits, axis=-1)\n\n        outputs = (start_logits, end_logits,) + transformer_outputs[\n            1:\n        ]  # Keep mems, hidden states, attentions if there are in it\n\n        return outputs  # start_logits, end_logits, (mems), (hidden_states), (attentions)\n\n\n# @add_start_docstrings(\"\"\"XLNet Model with a span classification head on top for extractive question-answering tasks like SQuAD (a linear layers on top of\n#     the hidden-states output to compute `span start logits` and `span end logits`). \"\"\",\n#     XLNET_START_DOCSTRING, XLNET_INPUTS_DOCSTRING)\n# class TFXLNetForQuestionAnswering(TFXLNetPreTrainedModel):\n#     r\"\"\"\n#     Outputs: `Tuple` comprising various elements depending on the configuration (config) and inputs:\n#         **start_top_log_probs**: (`optional`, returned if ``start_positions`` or ``end_positions`` is not provided)\n#             ``tf.Tensor`` of shape ``(batch_size, config.start_n_top)``\n#             Log probabilities for the top config.start_n_top start token possibilities (beam-search).\n#         **start_top_index**: (`optional`, returned if ``start_positions`` or ``end_positions`` is not provided)\n#             ``tf.Tensor`` of shape ``(batch_size, config.start_n_top)``\n#             Indices for the top config.start_n_top start token possibilities (beam-search).\n#         **end_top_log_probs**: (`optional`, returned if ``start_positions`` or ``end_positions`` is not provided)\n#             ``tf.Tensor`` of shape ``(batch_size, config.start_n_top * config.end_n_top)``\n#             Log probabilities for the top ``config.start_n_top * config.end_n_top`` end token possibilities (beam-search).\n#         **end_top_index**: (`optional`, returned if ``start_positions`` or ``end_positions`` is not provided)\n#             ``tf.Tensor`` of shape ``(batch_size, config.start_n_top * config.end_n_top)``\n#             Indices for the top ``config.start_n_top * config.end_n_top`` end token possibilities (beam-search).\n#         **cls_logits**: (`optional`, returned if ``start_positions`` or ``end_positions`` is not provided)\n#             ``tf.Tensor`` of shape ``(batch_size,)``\n#             Log probabilities for the ``is_impossible`` label of the answers.\n#         **mems**:\n#             list of ``tf.Tensor`` (one for each layer):\n#             that contains pre-computed hidden-states (key and values in the attention blocks) as computed by the model\n#             if config.mem_len > 0 else tuple of None. Can be used to speed up sequential decoding and attend to longer context.\n#             See details in the docstring of the `mems` input above.\n#         **hidden_states**: (`optional`, returned when ``config.output_hidden_states=True``)\n#             list of ``tf.Tensor`` (one for the output of each layer + the output of the embeddings)\n#             of shape ``(batch_size, sequence_length, hidden_size)``:\n#             Hidden-states of the model at the output of each layer plus the initial embedding outputs.\n#         **attentions**: (`optional`, returned when ``config.output_attentions=True``)\n#             list of ``tf.Tensor`` (one for each layer) of shape ``(batch_size, num_heads, sequence_length, sequence_length)``:\n#             Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads.\n\n#     Examples::\n\n#         # For example purposes. Not runnable.\n#         tokenizer = XLMTokenizer.from_pretrained('xlm-mlm-en-2048')\n#         model = XLMForQuestionAnswering.from_pretrained('xlnet-large-cased')\n#         input_ids = tf.constant(tokenizer.encode(\"Hello, my dog is cute\", add_special_tokens=True))[None, :]  # Batch size 1\n#         start_positions = tf.constant([1])\n#         end_positions = tf.constant([3])\n#         outputs = model(input_ids, start_positions=start_positions, end_positions=end_positions)\n#         loss, start_scores, end_scores = outputs[:2]\n\n#     \"\"\"\n#     def __init__(self, config, *inputs, **kwargs):\n#         super().__init__(config, *inputs, **kwargs)\n#         self.start_n_top = config.start_n_top\n#         self.end_n_top = config.end_n_top\n\n#         self.transformer = TFXLNetMainLayer(config, name='transformer')\n#         self.start_logits = TFPoolerStartLogits(config, name='start_logits')\n#         self.end_logits = TFPoolerEndLogits(config, name='end_logits')\n#         self.answer_class = TFPoolerAnswerClass(config, name='answer_class')\n\n#     def call(self, inputs, training=False):\n#         transformer_outputs = self.transformer(inputs, training=training)\n#         hidden_states = transformer_outputs[0]\n#         start_logits = self.start_logits(hidden_states, p_mask=p_mask)\n\n#         outputs = transformer_outputs[1:]  # Keep mems, hidden states, attentions if there are in it\n\n#         if start_positions is not None and end_positions is not None:\n#             # If we are on multi-GPU, let's remove the dimension added by batch splitting\n#             for x in (start_positions, end_positions, cls_index, is_impossible):\n#                 if x is not None and x.dim() > 1:\n#                     x.squeeze_(-1)\n\n#             # during training, compute the end logits based on the ground truth of the start position\n#             end_logits = self.end_logits(hidden_states, start_positions=start_positions, p_mask=p_mask)\n\n#             loss_fct = CrossEntropyLoss()\n#             start_loss = loss_fct(start_logits, start_positions)\n#             end_loss = loss_fct(end_logits, end_positions)\n#             total_loss = (start_loss + end_loss) / 2\n\n#             if cls_index is not None and is_impossible is not None:\n#                 # Predict answerability from the representation of CLS and START\n#                 cls_logits = self.answer_class(hidden_states, start_positions=start_positions, cls_index=cls_index)\n#                 loss_fct_cls = nn.BCEWithLogitsLoss()\n#                 cls_loss = loss_fct_cls(cls_logits, is_impossible)\n\n#                 # note(zhiliny): by default multiply the loss by 0.5 so that the scale is comparable to start_loss and end_loss\n#                 total_loss += cls_loss * 0.5\n\n#             outputs = (total_loss,) + outputs\n\n#         else:\n#             # during inference, compute the end logits based on beam search\n#             bsz, slen, hsz = hidden_states.size()\n#             start_log_probs = F.softmax(start_logits, dim=-1) # shape (bsz, slen)\n\n#             start_top_log_probs, start_top_index = torch.topk(start_log_probs, self.start_n_top, dim=-1) # shape (bsz, start_n_top)\n#             start_top_index_exp = start_top_index.unsqueeze(-1).expand(-1, -1, hsz) # shape (bsz, start_n_top, hsz)\n#             start_states = torch.gather(hidden_states, -2, start_top_index_exp) # shape (bsz, start_n_top, hsz)\n#             start_states = start_states.unsqueeze(1).expand(-1, slen, -1, -1) # shape (bsz, slen, start_n_top, hsz)\n\n#             hidden_states_expanded = hidden_states.unsqueeze(2).expand_as(start_states) # shape (bsz, slen, start_n_top, hsz)\n#             p_mask = p_mask.unsqueeze(-1) if p_mask is not None else None\n#             end_logits = self.end_logits(hidden_states_expanded, start_states=start_states, p_mask=p_mask)\n#             end_log_probs = F.softmax(end_logits, dim=1) # shape (bsz, slen, start_n_top)\n\n#             end_top_log_probs, end_top_index = torch.topk(end_log_probs, self.end_n_top, dim=1) # shape (bsz, end_n_top, start_n_top)\n#             end_top_log_probs = end_top_log_probs.view(-1, self.start_n_top * self.end_n_top)\n#             end_top_index = end_top_index.view(-1, self.start_n_top * self.end_n_top)\n\n#             start_states = torch.einsum(\"blh,bl->bh\", hidden_states, start_log_probs)  # get the representation of START as weighted sum of hidden states\n#             cls_logits = self.answer_class(hidden_states, start_states=start_states, cls_index=cls_index)  # Shape (batch size,): one single `cls_logits` for each sample\n\n#             outputs = (start_top_log_probs, start_top_index, end_top_log_probs, end_top_index, cls_logits) + outputs\n\n#         # return start_top_log_probs, start_top_index, end_top_log_probs, end_top_index, cls_logits\n#         # or (if labels are provided) (total_loss,)\n#         return outputs\n"
  },
  {
    "path": "code/bert-base-count5/pretrain/transformers1/modeling_transfo_xl.py",
    "content": "# coding=utf-8\n# Copyright 2018 Google AI, Google Brain and Carnegie Mellon University Authors and the HuggingFace Inc. team.\n# Copyright (c) 2018, NVIDIA CORPORATION.  All rights reserved.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n\"\"\" PyTorch Transformer XL model.\n    Adapted from https://github.com/kimiyoung/transformer-xl.\n    In particular https://github.com/kimiyoung/transformer-xl/blob/master/pytorch/mem_transformer.py\n\"\"\"\n\n\nimport logging\n\nimport torch\nimport torch.nn as nn\nimport torch.nn.functional as F\n\nfrom .configuration_transfo_xl import TransfoXLConfig\nfrom .file_utils import add_start_docstrings, add_start_docstrings_to_callable\nfrom .modeling_transfo_xl_utilities import ProjectedAdaptiveLogSoftmax\nfrom .modeling_utils import PreTrainedModel\n\n\nlogger = logging.getLogger(__name__)\n\nTRANSFO_XL_PRETRAINED_MODEL_ARCHIVE_LIST = [\n    \"transfo-xl-wt103\",\n    # See all Transformer XL models at https://huggingface.co/models?filter=transfo-xl\n]\n\n\ndef build_tf_to_pytorch_map(model, config):\n    \"\"\" A map of modules from TF to PyTorch.\n        This time I use a map to keep the PyTorch model as identical to the original PyTorch model as possible.\n    \"\"\"\n    tf_to_pt_map = {}\n\n    if hasattr(model, \"transformer\"):\n        # We are loading in a TransfoXLLMHeadModel => we will load also the Adaptive Softmax\n        tf_to_pt_map.update(\n            {\n                \"transformer/adaptive_softmax/cutoff_0/cluster_W\": model.crit.cluster_weight,\n                \"transformer/adaptive_softmax/cutoff_0/cluster_b\": model.crit.cluster_bias,\n            }\n        )\n        for i, (out_l, proj_l, tie_proj) in enumerate(\n            zip(model.crit.out_layers, model.crit.out_projs, config.tie_projs)\n        ):\n            layer_str = \"transformer/adaptive_softmax/cutoff_%d/\" % i\n            if config.tie_weight:\n                tf_to_pt_map.update({layer_str + \"b\": out_l.bias})\n            else:\n                raise NotImplementedError\n                # I don't think this is implemented in the TF code\n                tf_to_pt_map.update({layer_str + \"lookup_table\": out_l.weight, layer_str + \"b\": out_l.bias})\n            if not tie_proj:\n                tf_to_pt_map.update({layer_str + \"proj\": proj_l})\n        # Now load the rest of the transformer\n        model = model.transformer\n\n    # Embeddings\n    for i, (embed_l, proj_l) in enumerate(zip(model.word_emb.emb_layers, model.word_emb.emb_projs)):\n        layer_str = \"transformer/adaptive_embed/cutoff_%d/\" % i\n        tf_to_pt_map.update({layer_str + \"lookup_table\": embed_l.weight, layer_str + \"proj_W\": proj_l})\n\n    # Transformer blocks\n    for i, b in enumerate(model.layers):\n        layer_str = \"transformer/layer_%d/\" % i\n        tf_to_pt_map.update(\n            {\n                layer_str + \"rel_attn/LayerNorm/gamma\": b.dec_attn.layer_norm.weight,\n                layer_str + \"rel_attn/LayerNorm/beta\": b.dec_attn.layer_norm.bias,\n                layer_str + \"rel_attn/o/kernel\": b.dec_attn.o_net.weight,\n                layer_str + \"rel_attn/qkv/kernel\": b.dec_attn.qkv_net.weight,\n                layer_str + \"rel_attn/r/kernel\": b.dec_attn.r_net.weight,\n                layer_str + \"ff/LayerNorm/gamma\": b.pos_ff.layer_norm.weight,\n                layer_str + \"ff/LayerNorm/beta\": b.pos_ff.layer_norm.bias,\n                layer_str + \"ff/layer_1/kernel\": b.pos_ff.CoreNet[0].weight,\n                layer_str + \"ff/layer_1/bias\": b.pos_ff.CoreNet[0].bias,\n                layer_str + \"ff/layer_2/kernel\": b.pos_ff.CoreNet[3].weight,\n                layer_str + \"ff/layer_2/bias\": b.pos_ff.CoreNet[3].bias,\n            }\n        )\n\n    # Relative positioning biases\n    if config.untie_r:\n        r_r_list = []\n        r_w_list = []\n        for b in model.layers:\n            r_r_list.append(b.dec_attn.r_r_bias)\n            r_w_list.append(b.dec_attn.r_w_bias)\n    else:\n        r_r_list = [model.r_r_bias]\n        r_w_list = [model.r_w_bias]\n    tf_to_pt_map.update({\"transformer/r_r_bias\": r_r_list, \"transformer/r_w_bias\": r_w_list})\n    return tf_to_pt_map\n\n\ndef load_tf_weights_in_transfo_xl(model, config, tf_path):\n    \"\"\" Load tf checkpoints in a pytorch model\n    \"\"\"\n    try:\n        import numpy as np\n        import tensorflow as tf\n    except ImportError:\n        logger.error(\n            \"Loading a TensorFlow models in PyTorch, requires TensorFlow to be installed. Please see \"\n            \"https://www.tensorflow.org/install/ for installation instructions.\"\n        )\n        raise\n    # Build TF to PyTorch weights loading map\n    tf_to_pt_map = build_tf_to_pytorch_map(model, config)\n\n    # Load weights from TF model\n    init_vars = tf.train.list_variables(tf_path)\n    tf_weights = {}\n    for name, shape in init_vars:\n        logger.info(\"Loading TF weight {} with shape {}\".format(name, shape))\n        array = tf.train.load_variable(tf_path, name)\n        tf_weights[name] = array\n\n    for name, pointer in tf_to_pt_map.items():\n        assert name in tf_weights\n        array = tf_weights[name]\n        # adam_v and adam_m are variables used in AdamWeightDecayOptimizer to calculated m and v\n        # which are not required for using pretrained model\n        if \"kernel\" in name or \"proj\" in name:\n            array = np.transpose(array)\n        if (\"r_r_bias\" in name or \"r_w_bias\" in name) and len(pointer) > 1:\n            # Here we will split the TF weights\n            assert len(pointer) == array.shape[0]\n            for i, p_i in enumerate(pointer):\n                arr_i = array[i, ...]\n                try:\n                    assert p_i.shape == arr_i.shape\n                except AssertionError as e:\n                    e.args += (p_i.shape, arr_i.shape)\n                    raise\n                logger.info(\"Initialize PyTorch weight {} for layer {}\".format(name, i))\n                p_i.data = torch.from_numpy(arr_i)\n        else:\n            try:\n                assert pointer.shape == array.shape\n            except AssertionError as e:\n                e.args += (pointer.shape, array.shape)\n                raise\n            logger.info(\"Initialize PyTorch weight {}\".format(name))\n            pointer.data = torch.from_numpy(array)\n        tf_weights.pop(name, None)\n        tf_weights.pop(name + \"/Adam\", None)\n        tf_weights.pop(name + \"/Adam_1\", None)\n\n    logger.info(\"Weights not copied to PyTorch model: {}\".format(\", \".join(tf_weights.keys())))\n    return model\n\n\nclass PositionalEmbedding(nn.Module):\n    def __init__(self, demb):\n        super().__init__()\n\n        self.demb = demb\n\n        inv_freq = 1 / (10000 ** (torch.arange(0.0, demb, 2.0) / demb))\n        self.register_buffer(\"inv_freq\", inv_freq)\n\n    def forward(self, pos_seq, bsz=None):\n        sinusoid_inp = torch.ger(pos_seq, self.inv_freq)\n        pos_emb = torch.cat([sinusoid_inp.sin(), sinusoid_inp.cos()], dim=-1)\n\n        if bsz is not None:\n            return pos_emb[:, None, :].expand(-1, bsz, -1)\n        else:\n            return pos_emb[:, None, :]\n\n\nclass PositionwiseFF(nn.Module):\n    def __init__(self, d_model, d_inner, dropout, pre_lnorm=False, layer_norm_epsilon=1e-5):\n        super().__init__()\n\n        self.d_model = d_model\n        self.d_inner = d_inner\n        self.dropout = dropout\n\n        self.CoreNet = nn.Sequential(\n            nn.Linear(d_model, d_inner),\n            nn.ReLU(inplace=True),\n            nn.Dropout(dropout),\n            nn.Linear(d_inner, d_model),\n            nn.Dropout(dropout),\n        )\n\n        self.layer_norm = nn.LayerNorm(d_model, eps=layer_norm_epsilon)\n\n        self.pre_lnorm = pre_lnorm\n\n    def forward(self, inp):\n        if self.pre_lnorm:\n            # layer normalization + positionwise feed-forward\n            core_out = self.CoreNet(self.layer_norm(inp))\n\n            # residual connection\n            output = core_out + inp\n        else:\n            # positionwise feed-forward\n            core_out = self.CoreNet(inp)\n\n            # residual connection + layer normalization\n            output = self.layer_norm(inp + core_out)\n\n        return output\n\n\nclass RelPartialLearnableMultiHeadAttn(nn.Module):\n    def __init__(\n        self,\n        n_head,\n        d_model,\n        d_head,\n        dropout,\n        dropatt=0,\n        tgt_len=None,\n        ext_len=None,\n        mem_len=None,\n        pre_lnorm=False,\n        r_r_bias=None,\n        r_w_bias=None,\n        output_attentions=False,\n        layer_norm_epsilon=1e-5,\n    ):\n        super().__init__()\n\n        self.output_attentions = output_attentions\n        self.n_head = n_head\n        self.d_model = d_model\n        self.d_head = d_head\n        self.dropout = dropout\n\n        self.qkv_net = nn.Linear(d_model, 3 * n_head * d_head, bias=False)\n\n        self.drop = nn.Dropout(dropout)\n        self.dropatt = nn.Dropout(dropatt)\n        self.o_net = nn.Linear(n_head * d_head, d_model, bias=False)\n\n        self.layer_norm = nn.LayerNorm(d_model, eps=layer_norm_epsilon)\n\n        self.scale = 1 / (d_head ** 0.5)\n\n        self.pre_lnorm = pre_lnorm\n\n        if r_r_bias is None or r_w_bias is None:  # Biases are not shared\n            self.r_r_bias = nn.Parameter(torch.FloatTensor(self.n_head, self.d_head))\n            self.r_w_bias = nn.Parameter(torch.FloatTensor(self.n_head, self.d_head))\n        else:\n            self.r_r_bias = r_r_bias\n            self.r_w_bias = r_w_bias\n\n        self.r_net = nn.Linear(self.d_model, self.n_head * self.d_head, bias=False)\n\n    def _rel_shift(self, x):\n        zero_pad_shape = (x.size(0), 1) + x.size()[2:]\n        zero_pad = torch.zeros(zero_pad_shape, device=x.device, dtype=x.dtype)\n        x_padded = torch.cat([zero_pad, x], dim=1)\n\n        x_padded_shape = (x.size(1) + 1, x.size(0)) + x.size()[2:]\n        x_padded = x_padded.view(*x_padded_shape)\n\n        x = x_padded[1:].view_as(x)\n\n        return x\n\n    def forward(self, w, r, attn_mask=None, mems=None, head_mask=None):\n        qlen, rlen, bsz = w.size(0), r.size(0), w.size(1)\n\n        if mems is not None:\n            cat = torch.cat([mems, w], 0)\n            if self.pre_lnorm:\n                w_heads = self.qkv_net(self.layer_norm(cat))\n            else:\n                w_heads = self.qkv_net(cat)\n            r_head_k = self.r_net(r)\n\n            w_head_q, w_head_k, w_head_v = torch.chunk(w_heads, 3, dim=-1)\n            w_head_q = w_head_q[-qlen:]\n        else:\n            if self.pre_lnorm:\n                w_heads = self.qkv_net(self.layer_norm(w))\n            else:\n                w_heads = self.qkv_net(w)\n            r_head_k = self.r_net(r)\n\n            w_head_q, w_head_k, w_head_v = torch.chunk(w_heads, 3, dim=-1)\n\n        klen = w_head_k.size(0)\n\n        w_head_q = w_head_q.view(qlen, bsz, self.n_head, self.d_head)  # qlen x bsz x n_head x d_head\n        w_head_k = w_head_k.view(klen, bsz, self.n_head, self.d_head)  # qlen x bsz x n_head x d_head\n        w_head_v = w_head_v.view(klen, bsz, self.n_head, self.d_head)  # qlen x bsz x n_head x d_head\n\n        r_head_k = r_head_k.view(rlen, self.n_head, self.d_head)  # qlen x n_head x d_head\n\n        # compute attention score\n        rw_head_q = w_head_q + self.r_w_bias  # qlen x bsz x n_head x d_head\n        AC = torch.einsum(\"ibnd,jbnd->ijbn\", (rw_head_q, w_head_k))  # qlen x klen x bsz x n_head\n\n        rr_head_q = w_head_q + self.r_r_bias\n        BD = torch.einsum(\"ibnd,jnd->ijbn\", (rr_head_q, r_head_k))  # qlen x klen x bsz x n_head\n        BD = self._rel_shift(BD)\n\n        # [qlen x klen x bsz x n_head]\n        attn_score = AC + BD\n        attn_score.mul_(self.scale)\n\n        # compute attention probability\n        if attn_mask is not None and torch.sum(attn_mask).item():\n            attn_mask = attn_mask == 1  # Switch to bool\n            if attn_mask.dim() == 2:\n                if next(self.parameters()).dtype == torch.float16:\n                    attn_score = (\n                        attn_score.float().masked_fill(attn_mask[None, :, :, None], -65000).type_as(attn_score)\n                    )\n                else:\n                    attn_score = attn_score.float().masked_fill(attn_mask[None, :, :, None], -1e30).type_as(attn_score)\n            elif attn_mask.dim() == 3:\n                if next(self.parameters()).dtype == torch.float16:\n                    attn_score = attn_score.float().masked_fill(attn_mask[:, :, :, None], -65000).type_as(attn_score)\n                else:\n                    attn_score = attn_score.float().masked_fill(attn_mask[:, :, :, None], -1e30).type_as(attn_score)\n\n        # [qlen x klen x bsz x n_head]\n        attn_prob = F.softmax(attn_score, dim=1)\n        attn_prob = self.dropatt(attn_prob)\n\n        # Mask heads if we want to\n        if head_mask is not None:\n            attn_prob = attn_prob * head_mask\n\n        # compute attention vector\n        attn_vec = torch.einsum(\"ijbn,jbnd->ibnd\", (attn_prob, w_head_v))\n\n        # [qlen x bsz x n_head x d_head]\n        attn_vec = attn_vec.contiguous().view(attn_vec.size(0), attn_vec.size(1), self.n_head * self.d_head)\n\n        # linear projection\n        attn_out = self.o_net(attn_vec)\n        attn_out = self.drop(attn_out)\n\n        if self.pre_lnorm:\n            # residual connection\n            outputs = [w + attn_out]\n        else:\n            # residual connection + layer normalization\n            outputs = [self.layer_norm(w + attn_out)]\n\n        if self.output_attentions:\n            outputs.append(attn_prob)\n\n        return outputs\n\n\nclass RelPartialLearnableDecoderLayer(nn.Module):\n    def __init__(self, n_head, d_model, d_head, d_inner, dropout, layer_norm_epsilon=1e-5, **kwargs):\n        super().__init__()\n\n        self.dec_attn = RelPartialLearnableMultiHeadAttn(\n            n_head, d_model, d_head, dropout, layer_norm_epsilon=layer_norm_epsilon, **kwargs\n        )\n        self.pos_ff = PositionwiseFF(\n            d_model, d_inner, dropout, pre_lnorm=kwargs.get(\"pre_lnorm\"), layer_norm_epsilon=layer_norm_epsilon\n        )\n\n    def forward(self, dec_inp, r, dec_attn_mask=None, mems=None, head_mask=None):\n\n        attn_outputs = self.dec_attn(dec_inp, r, attn_mask=dec_attn_mask, mems=mems, head_mask=head_mask)\n        ff_output = self.pos_ff(attn_outputs[0])\n\n        outputs = [ff_output] + attn_outputs[1:]\n\n        return outputs\n\n\nclass AdaptiveEmbedding(nn.Module):\n    def __init__(self, n_token, d_embed, d_proj, cutoffs, div_val=1, sample_softmax=False):\n        super().__init__()\n\n        self.n_token = n_token\n        self.d_embed = d_embed\n\n        self.cutoffs = cutoffs + [n_token]\n        self.div_val = div_val\n        self.d_proj = d_proj\n\n        self.emb_scale = d_proj ** 0.5\n\n        self.cutoff_ends = [0] + self.cutoffs\n\n        self.emb_layers = nn.ModuleList()\n        self.emb_projs = nn.ParameterList()\n        if div_val == 1:\n            self.emb_layers.append(nn.Embedding(n_token, d_embed, sparse=sample_softmax > 0))\n            if d_proj != d_embed:\n                self.emb_projs.append(nn.Parameter(torch.FloatTensor(d_proj, d_embed)))\n        else:\n            for i in range(len(self.cutoffs)):\n                l_idx, r_idx = self.cutoff_ends[i], self.cutoff_ends[i + 1]\n                d_emb_i = d_embed // (div_val ** i)\n                self.emb_layers.append(nn.Embedding(r_idx - l_idx, d_emb_i))\n                self.emb_projs.append(nn.Parameter(torch.FloatTensor(d_proj, d_emb_i)))\n\n    def forward(self, inp):\n        if self.div_val == 1:\n            embed = self.emb_layers[0](inp)\n            if self.d_proj != self.d_embed:\n                embed = F.linear(embed, self.emb_projs[0])\n        else:\n            param = next(self.parameters())\n            inp_flat = inp.view(-1)\n            emb_flat = torch.zeros([inp_flat.size(0), self.d_proj], dtype=param.dtype, device=param.device)\n            for i in range(len(self.cutoffs)):\n                l_idx, r_idx = self.cutoff_ends[i], self.cutoff_ends[i + 1]\n\n                mask_i = (inp_flat >= l_idx) & (inp_flat < r_idx)\n                indices_i = mask_i.nonzero().squeeze()\n\n                if indices_i.numel() == 0:\n                    continue\n\n                inp_i = inp_flat.index_select(0, indices_i) - l_idx\n                emb_i = self.emb_layers[i](inp_i)\n                emb_i = F.linear(emb_i, self.emb_projs[i])\n\n                emb_flat.index_copy_(0, indices_i, emb_i)\n\n            embed_shape = inp.size() + (self.d_proj,)\n            embed = emb_flat.view(embed_shape)\n\n        embed.mul_(self.emb_scale)\n\n        return embed\n\n\nclass TransfoXLPreTrainedModel(PreTrainedModel):\n    \"\"\" An abstract class to handle weights initialization and\n        a simple interface for downloading and loading pretrained models.\n    \"\"\"\n\n    config_class = TransfoXLConfig\n    load_tf_weights = load_tf_weights_in_transfo_xl\n    base_model_prefix = \"transformer\"\n\n    def _init_weight(self, weight):\n        if self.config.init == \"uniform\":\n            nn.init.uniform_(weight, -self.config.init_range, self.config.init_range)\n        elif self.config.init == \"normal\":\n            nn.init.normal_(weight, 0.0, self.config.init_std)\n\n    def _init_bias(self, bias):\n        nn.init.constant_(bias, 0.0)\n\n    def _init_weights(self, m):\n        \"\"\" Initialize the weights.\n        \"\"\"\n        classname = m.__class__.__name__\n        if classname.find(\"Linear\") != -1:\n            if hasattr(m, \"weight\") and m.weight is not None:\n                self._init_weight(m.weight)\n            if hasattr(m, \"bias\") and m.bias is not None:\n                self._init_bias(m.bias)\n        elif classname.find(\"AdaptiveEmbedding\") != -1:\n            if hasattr(m, \"emb_projs\"):\n                for i in range(len(m.emb_projs)):\n                    if m.emb_projs[i] is not None:\n                        nn.init.normal_(m.emb_projs[i], 0.0, self.config.proj_init_std)\n        elif classname.find(\"Embedding\") != -1:\n            if hasattr(m, \"weight\"):\n                self._init_weight(m.weight)\n        elif classname.find(\"ProjectedAdaptiveLogSoftmax\") != -1:\n            if hasattr(m, \"cluster_weight\") and m.cluster_weight is not None:\n                self._init_weight(m.cluster_weight)\n            if hasattr(m, \"cluster_bias\") and m.cluster_bias is not None:\n                self._init_bias(m.cluster_bias)\n            if hasattr(m, \"out_projs\"):\n                for i in range(len(m.out_projs)):\n                    if m.out_projs[i] is not None:\n                        nn.init.normal_(m.out_projs[i], 0.0, self.config.proj_init_std)\n        elif classname.find(\"LayerNorm\") != -1:\n            if hasattr(m, \"weight\"):\n                nn.init.normal_(m.weight, 1.0, self.config.init_std)\n            if hasattr(m, \"bias\") and m.bias is not None:\n                self._init_bias(m.bias)\n        else:\n            if hasattr(m, \"r_emb\"):\n                self._init_weight(m.r_emb)\n            if hasattr(m, \"r_w_bias\"):\n                self._init_weight(m.r_w_bias)\n            if hasattr(m, \"r_r_bias\"):\n                self._init_weight(m.r_r_bias)\n            if hasattr(m, \"r_bias\"):\n                self._init_bias(m.r_bias)\n\n\nTRANSFO_XL_START_DOCSTRING = r\"\"\"\n\n    This model is a PyTorch `torch.nn.Module <https://pytorch.org/docs/stable/nn.html#torch.nn.Module>`_ sub-class.\n    Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general\n    usage and behavior.\n\n    Parameters:\n        config (:class:`~transformers1.TransfoXLConfig`): Model configuration class with all the parameters of the model.\n            Initializing with a config file does not load the weights associated with the model, only the configuration.\n            Check out the :meth:`~transformers1.PreTrainedModel.from_pretrained` method to load the model weights.\n\"\"\"\n\nTRANSFO_XL_INPUTS_DOCSTRING = r\"\"\"\n    Args:\n        input_ids (:obj:`torch.LongTensor` of shape :obj:`(batch_size, sequence_length)`):\n            Indices of input sequence tokens in the vocabulary.\n\n            Indices can be obtained using :class:`transformers1.TransfoXLTokenizer`.\n            See :func:`transformers1.PreTrainedTokenizer.encode` and\n            :func:`transformers1.PreTrainedTokenizer.encode_plus` for details.\n\n            `What are input IDs? <../glossary.html#input-ids>`__\n        mems (:obj:`List[torch.FloatTensor]` of length :obj:`config.n_layers`):\n            Contains pre-computed hidden-states (key and values in the attention blocks) as computed by the model\n            (see `mems` output below). Can be used to speed up sequential decoding. The token ids which have their mems\n            given to this model should not be passed as input ids as they have already been computed.\n        head_mask (:obj:`torch.FloatTensor` of shape :obj:`(num_heads,)` or :obj:`(num_layers, num_heads)`, `optional`, defaults to :obj:`None`):\n            Mask to nullify selected heads of the self-attention modules.\n            Mask values selected in ``[0, 1]``:\n            :obj:`1` indicates the head is **not masked**, :obj:`0` indicates the head is **masked**.\n        inputs_embeds (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length, hidden_size)`, `optional`, defaults to :obj:`None`):\n            Optionally, instead of passing :obj:`input_ids` you can choose to directly pass an embedded representation.\n            This is useful if you want more control over how to convert `input_ids` indices into associated vectors\n            than the model's internal embedding lookup matrix.\n\"\"\"\n\n\n@add_start_docstrings(\n    \"The bare Bert Model transformer outputting raw hidden-states without any specific head on top.\",\n    TRANSFO_XL_START_DOCSTRING,\n)\nclass TransfoXLModel(TransfoXLPreTrainedModel):\n    def __init__(self, config):\n        super().__init__(config)\n        self.output_attentions = config.output_attentions\n        self.output_hidden_states = config.output_hidden_states\n\n        self.n_token = config.vocab_size\n\n        self.d_embed = config.d_embed\n        self.d_model = config.d_model\n        self.n_head = config.n_head\n        self.d_head = config.d_head\n\n        self.word_emb = AdaptiveEmbedding(\n            config.vocab_size, config.d_embed, config.d_model, config.cutoffs, div_val=config.div_val\n        )\n\n        self.drop = nn.Dropout(config.dropout)\n\n        self.n_layer = config.n_layer\n\n        self.tgt_len = config.tgt_len\n        self.mem_len = config.mem_len\n        self.ext_len = config.ext_len\n        self.max_klen = config.tgt_len + config.ext_len + config.mem_len\n\n        self.attn_type = config.attn_type\n\n        if not config.untie_r:\n            self.r_w_bias = nn.Parameter(torch.FloatTensor(self.n_head, self.d_head))\n            self.r_r_bias = nn.Parameter(torch.FloatTensor(self.n_head, self.d_head))\n\n        self.layers = nn.ModuleList()\n        if config.attn_type == 0:  # the default attention\n            for i in range(config.n_layer):\n                self.layers.append(\n                    RelPartialLearnableDecoderLayer(\n                        config.n_head,\n                        config.d_model,\n                        config.d_head,\n                        config.d_inner,\n                        config.dropout,\n                        tgt_len=config.tgt_len,\n                        ext_len=config.ext_len,\n                        mem_len=config.mem_len,\n                        dropatt=config.dropatt,\n                        pre_lnorm=config.pre_lnorm,\n                        r_w_bias=None if config.untie_r else self.r_w_bias,\n                        r_r_bias=None if config.untie_r else self.r_r_bias,\n                        output_attentions=self.output_attentions,\n                        layer_norm_epsilon=config.layer_norm_epsilon,\n                    )\n                )\n        else:  # learnable embeddings and absolute embeddings are not used in our pretrained checkpoints\n            raise NotImplementedError  # Removed them to avoid maintaining dead code\n\n        self.same_length = config.same_length\n        self.clamp_len = config.clamp_len\n\n        if self.attn_type == 0:  # default attention\n            self.pos_emb = PositionalEmbedding(self.d_model)\n        else:  # learnable embeddings and absolute embeddings\n            raise NotImplementedError  # Removed these to avoid maintaining dead code - They are not used in our pretrained checkpoint\n\n        self.init_weights()\n\n    def get_input_embeddings(self):\n        return self.word_emb\n\n    def set_input_embeddings(self, new_embeddings):\n        self.word_emb = new_embeddings\n\n    def backward_compatible(self):\n        self.sample_softmax = -1\n\n    def reset_length(self, tgt_len, ext_len, mem_len):\n        self.tgt_len = tgt_len\n        self.mem_len = mem_len\n        self.ext_len = ext_len\n\n    def _prune_heads(self, heads):\n        logger.info(\"Head pruning is not implemented for Transformer-XL model\")\n        pass\n\n    def init_mems(self, bsz):\n        if self.mem_len > 0:\n            mems = []\n            param = next(self.parameters())\n            for i in range(self.n_layer):\n                empty = torch.zeros(self.mem_len, bsz, self.config.d_model, dtype=param.dtype, device=param.device)\n                mems.append(empty)\n\n            return mems\n        else:\n            return None\n\n    def _update_mems(self, hids, mems, mlen, qlen):\n        # does not deal with None\n        if mems is None:\n            return None\n\n        # mems is not None\n        assert len(hids) == len(mems), \"len(hids) != len(mems)\"\n\n        # There are `mlen + qlen` steps that can be cached into mems\n        # For the next step, the last `ext_len` of the `qlen` tokens\n        # will be used as the extended context. Hence, we only cache\n        # the tokens from `mlen + qlen - self.ext_len - self.mem_len`\n        # to `mlen + qlen - self.ext_len`.\n        with torch.no_grad():\n            new_mems = []\n            end_idx = mlen + max(0, qlen - 0 - self.ext_len)\n            beg_idx = max(0, end_idx - self.mem_len)\n            for i in range(len(hids)):\n\n                cat = torch.cat([mems[i], hids[i]], dim=0)\n                new_mems.append(cat[beg_idx:end_idx].detach())\n\n        return new_mems\n\n    @add_start_docstrings_to_callable(TRANSFO_XL_INPUTS_DOCSTRING)\n    def forward(self, input_ids=None, mems=None, head_mask=None, inputs_embeds=None):\n        r\"\"\"\n    Return:\n        :obj:`tuple(torch.FloatTensor)` comprising various elements depending on the configuration (:class:`~transformers1.TransfoXLConfig`) and inputs:\n        last_hidden_state (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length, hidden_size)`):\n            Sequence of hidden-states at the last layer of the model.\n        mems (:obj:`List[torch.FloatTensor]` of length :obj:`config.n_layers`):\n            Contains pre-computed hidden-states (key and values in the attention blocks).\n            Can be used (see `mems` input) to speed up sequential decoding. The token ids which have their past given to this model\n            should not be passed as input ids as they have already been computed.\n        hidden_states (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_hidden_states=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer)\n            of shape :obj:`(batch_size, sequence_length, hidden_size)`.\n\n            Hidden-states of the model at the output of each layer plus the initial embedding outputs.\n        attentions (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_attentions=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for each layer) of shape\n            :obj:`(batch_size, num_heads, sequence_length, sequence_length)`.\n\n            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention\n            heads.\n\n    Examples::\n\n        from transformers1 import TransfoXLTokenizer, TransfoXLModel\n        import torch\n\n        tokenizer = TransfoXLTokenizer.from_pretrained('transfo-xl-wt103')\n        model = TransfoXLModel.from_pretrained('transfo-xl-wt103')\n        input_ids = torch.tensor(tokenizer.encode(\"Hello, my dog is cute\", add_special_tokens=True)).unsqueeze(0)  # Batch size 1\n        outputs = model(input_ids)\n        last_hidden_states, mems = outputs[:2]\n\n        \"\"\"\n        # the original code for Transformer-XL used shapes [len, bsz] but we want a unified interface in the library\n        # so we transpose here from shape [bsz, len] to shape [len, bsz]\n        if input_ids is not None and inputs_embeds is not None:\n            raise ValueError(\"You cannot specify both input_ids and inputs_embeds at the same time\")\n        elif input_ids is not None:\n            input_ids = input_ids.transpose(0, 1).contiguous()\n            qlen, bsz = input_ids.size()\n        elif inputs_embeds is not None:\n            inputs_embeds = inputs_embeds.transpose(0, 1).contiguous()\n            qlen, bsz = inputs_embeds.shape[0], inputs_embeds.shape[1]\n        else:\n            raise ValueError(\"You have to specify either input_ids or inputs_embeds\")\n\n        if mems is None:\n            mems = self.init_mems(bsz)\n\n        # Prepare head mask if needed\n        # 1.0 in head_mask indicate we keep the head\n        # attention_probs has shape bsz x n_heads x N x N\n        # input head_mask has shape [num_heads] or [num_hidden_layers x num_heads] (a head_mask for each layer)\n        # and head_mask is converted to shape [num_hidden_layers x qlen x klen x bsz x n_head]\n        if head_mask is not None:\n            if head_mask.dim() == 1:\n                head_mask = head_mask.unsqueeze(0).unsqueeze(0).unsqueeze(0).unsqueeze(0)\n                head_mask = head_mask.expand(self.n_layer, -1, -1, -1, -1)\n            elif head_mask.dim() == 2:\n                head_mask = head_mask.unsqueeze(1).unsqueeze(1).unsqueeze(1)\n            head_mask = head_mask.to(\n                dtype=next(self.parameters()).dtype\n            )  # switch to fload if need + fp16 compatibility\n        else:\n            head_mask = [None] * self.n_layer\n\n        if inputs_embeds is not None:\n            word_emb = inputs_embeds\n        else:\n            word_emb = self.word_emb(input_ids)\n\n        mlen = mems[0].size(0) if mems is not None else 0\n        klen = mlen + qlen\n        if self.same_length:\n            all_ones = word_emb.new_ones((qlen, klen), dtype=torch.uint8)\n            mask_len = klen - self.mem_len\n            if mask_len > 0:\n                mask_shift_len = qlen - mask_len\n            else:\n                mask_shift_len = qlen\n            dec_attn_mask = (torch.triu(all_ones, 1 + mlen) + torch.tril(all_ones, -mask_shift_len))[:, :, None]  # -1\n        else:\n            dec_attn_mask = torch.triu(word_emb.new_ones((qlen, klen), dtype=torch.uint8), diagonal=1 + mlen)[\n                :, :, None\n            ]\n\n        hids = []\n        attentions = []\n        if self.attn_type == 0:  # default\n            pos_seq = torch.arange(klen - 1, -1, -1.0, device=word_emb.device, dtype=word_emb.dtype)\n            if self.clamp_len > 0:\n                pos_seq.clamp_(max=self.clamp_len)\n            pos_emb = self.pos_emb(pos_seq)\n\n            core_out = self.drop(word_emb)\n            pos_emb = self.drop(pos_emb)\n\n            for i, layer in enumerate(self.layers):\n                hids.append(core_out)\n                mems_i = None if mems is None else mems[i]\n                layer_outputs = layer(\n                    core_out, pos_emb, dec_attn_mask=dec_attn_mask, mems=mems_i, head_mask=head_mask[i]\n                )\n                core_out = layer_outputs[0]\n                if self.output_attentions:\n                    attentions.append(layer_outputs[1])\n        else:  # learnable embeddings and absolute embeddings\n            raise NotImplementedError  # Removed these to avoid maintaining dead code - They are not used in our pretrained checkpoint\n\n        core_out = self.drop(core_out)\n\n        new_mems = self._update_mems(hids, mems, mlen, qlen)\n\n        # We transpose back here to shape [bsz, len, hidden_dim]\n        outputs = [core_out.transpose(0, 1).contiguous(), new_mems]\n        if self.output_hidden_states:\n            # Add last layer and transpose to library standard shape [bsz, len, hidden_dim]\n            hids.append(core_out)\n            hids = list(t.transpose(0, 1).contiguous() for t in hids)\n            outputs.append(hids)\n        if self.output_attentions:\n            # Transpose to library standard shape [bsz, n_heads, query_seq_len, key_seq_len]\n            attentions = list(t.permute(2, 3, 0, 1).contiguous() for t in attentions)\n            outputs.append(attentions)\n\n        return outputs  # last hidden state, new_mems, (all hidden states), (all attentions)\n\n\n@add_start_docstrings(\n    \"\"\"The Transformer-XL Model with a language modeling head on top\n    (adaptive softmax with weights tied to the adaptive input embeddings)\"\"\",\n    TRANSFO_XL_START_DOCSTRING,\n)\nclass TransfoXLLMHeadModel(TransfoXLPreTrainedModel):\n    def __init__(self, config):\n        super().__init__(config)\n        self.transformer = TransfoXLModel(config)\n        self.sample_softmax = config.sample_softmax\n\n        assert (\n            self.sample_softmax <= 0\n        ), \"Sampling from the softmax is not implemented yet. Please look at issue: #3310: https://github.com/huggingface/transformers/issues/3310\"\n\n        self.crit = ProjectedAdaptiveLogSoftmax(\n            config.vocab_size, config.d_embed, config.d_model, config.cutoffs, div_val=config.div_val\n        )\n\n        self.init_weights()\n\n    def tie_weights(self):\n        \"\"\"\n        Run this to be sure output and input (adaptive) softmax weights are tied\n        \"\"\"\n\n        if self.config.tie_weight:\n            for i in range(len(self.crit.out_layers)):\n                self._tie_or_clone_weights(self.crit.out_layers[i], self.transformer.word_emb.emb_layers[i])\n        if self.config.tie_projs:\n            for i, tie_proj in enumerate(self.config.tie_projs):\n                if tie_proj and self.config.div_val == 1 and self.config.d_model != self.config.d_embed:\n                    if self.config.torchscript:\n                        self.crit.out_projs[i] = nn.Parameter(self.transformer.word_emb.emb_projs[0].clone())\n                    else:\n                        self.crit.out_projs[i] = self.transformer.word_emb.emb_projs[0]\n                elif tie_proj and self.config.div_val != 1:\n                    if self.config.torchscript:\n                        self.crit.out_projs[i] = nn.Parameter(self.transformer.word_emb.emb_projs[i].clone())\n                    else:\n                        self.crit.out_projs[i] = self.transformer.word_emb.emb_projs[i]\n\n    def reset_length(self, tgt_len, ext_len, mem_len):\n        self.transformer.reset_length(tgt_len, ext_len, mem_len)\n\n    def init_mems(self, bsz):\n        return self.transformer.init_mems(bsz)\n\n    @add_start_docstrings_to_callable(TRANSFO_XL_INPUTS_DOCSTRING)\n    def forward(self, input_ids=None, mems=None, head_mask=None, inputs_embeds=None, labels=None):\n        r\"\"\"\n        labels (:obj:`torch.LongTensor` of shape :obj:`(batch_size, sequence_length)`, `optional`, defaults to :obj:`None`):\n            Labels for language modeling.\n            Note that the labels **are shifted** inside the model, i.e. you can set ``labels = input_ids``\n            Indices are selected in ``[-100, 0, ..., config.vocab_size]``\n            All labels set to ``-100`` are ignored (masked), the loss is only\n            computed for labels in ``[0, ..., config.vocab_size]``\n\n    Return:\n        :obj:`tuple(torch.FloatTensor)` comprising various elements depending on the configuration (:class:`~transformers1.TransfoXLConfig`) and inputs:\n        loss (:obj:`torch.FloatTensor` of shape `(batch_size, sequence_length-1)`, `optional`, returned when ``labels`` is provided)\n            Language modeling loss.\n        prediction_scores (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length, config.vocab_size)`):\n            Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax).\n        mems (:obj:`List[torch.FloatTensor]` of length :obj:`config.n_layers`):\n            Contains pre-computed hidden-states (key and values in the attention blocks).\n            Can be used (see `past` input) to speed up sequential decoding. The token ids which have their past given to this model\n            should not be passed as input ids as they have already been computed.\n        hidden_states (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_hidden_states=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer)\n            of shape :obj:`(batch_size, sequence_length, hidden_size)`.\n\n            Hidden-states of the model at the output of each layer plus the initial embedding outputs.\n        attentions (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_attentions=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for each layer) of shape\n            :obj:`(batch_size, num_heads, sequence_length, sequence_length)`.\n\n            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention\n            heads.\n\n    Examples::\n\n        from transformers1 import TransfoXLTokenizer, TransfoXLLMHeadModel\n        import torch\n\n        tokenizer = TransfoXLTokenizer.from_pretrained('transfo-xl-wt103')\n        model = TransfoXLLMHeadModel.from_pretrained('transfo-xl-wt103')\n        input_ids = torch.tensor(tokenizer.encode(\"Hello, my dog is cute\", add_special_tokens=True)).unsqueeze(0)  # Batch size 1\n        outputs = model(input_ids)\n        prediction_scores, mems = outputs[:2]\n\n        \"\"\"\n        if input_ids is not None:\n            bsz, tgt_len = input_ids.size(0), input_ids.size(1)\n        elif inputs_embeds is not None:\n            bsz, tgt_len = inputs_embeds.size(0), inputs_embeds.size(1)\n        else:\n            raise ValueError(\"You have to specify either input_ids or inputs_embeds\")\n\n        transformer_outputs = self.transformer(input_ids, mems=mems, head_mask=head_mask, inputs_embeds=inputs_embeds)\n\n        last_hidden = transformer_outputs[0]\n        pred_hid = last_hidden[:, -tgt_len:]\n        outputs = transformer_outputs[1:]\n\n        softmax_output = self.crit(pred_hid, labels)\n        if labels is None:\n            softmax_output = softmax_output.view(bsz, tgt_len, -1)\n            outputs = [softmax_output] + outputs\n        else:\n            softmax_output = softmax_output.view(bsz, tgt_len - 1)\n            outputs = [softmax_output, None] + outputs\n\n        return outputs  # (loss), logits or None if labels is not None (speed up adaptive softmax), new_mems, (all hidden states), (all attentions)\n\n    def get_output_embeddings(self):\n        \"\"\" Double-check if you are using adaptive softmax.\n        \"\"\"\n        if self.sample_softmax > 0:\n            return self.out_layer\n        else:\n            return self.crit.out_layers[-1]\n\n    def prepare_inputs_for_generation(self, input_ids, past, **model_kwargs):\n        inputs = {\"input_ids\": input_ids}\n\n        # if past is defined in model kwargs then use it for faster decoding\n        if past:\n            inputs[\"mems\"] = past\n\n        return inputs\n"
  },
  {
    "path": "code/bert-base-count5/pretrain/transformers1/modeling_transfo_xl_utilities.py",
    "content": "# coding=utf-8\n# Copyright 2018 Google AI, Google Brain and Carnegie Mellon University Authors and the HuggingFace Inc. team.\n# Copyright (c) 2018, NVIDIA CORPORATION.  All rights reserved.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n\"\"\" Utilities for PyTorch Transformer XL model.\n    Directly adapted from https://github.com/kimiyoung/transformer-xl.\n\"\"\"\n\n\nimport torch\nimport torch.nn as nn\nimport torch.nn.functional as F\n\n\n# CUDA_MAJOR = int(torch.version.cuda.split('.')[0])\n# CUDA_MINOR = int(torch.version.cuda.split('.')[1])\n\n\nclass ProjectedAdaptiveLogSoftmax(nn.Module):\n    def __init__(self, n_token, d_embed, d_proj, cutoffs, div_val=1, keep_order=False):\n        super().__init__()\n\n        self.n_token = n_token\n        self.d_embed = d_embed\n        self.d_proj = d_proj\n\n        self.cutoffs = cutoffs + [n_token]\n        self.cutoff_ends = [0] + self.cutoffs\n        self.div_val = div_val\n\n        self.shortlist_size = self.cutoffs[0]\n        self.n_clusters = len(self.cutoffs) - 1\n        self.head_size = self.shortlist_size + self.n_clusters\n\n        if self.n_clusters > 0:\n            self.cluster_weight = nn.Parameter(torch.zeros(self.n_clusters, self.d_embed))\n            self.cluster_bias = nn.Parameter(torch.zeros(self.n_clusters))\n\n        self.out_layers = nn.ModuleList()\n        self.out_projs = nn.ParameterList()\n\n        if div_val == 1:\n            for i in range(len(self.cutoffs)):\n                if d_proj != d_embed:\n                    self.out_projs.append(nn.Parameter(torch.FloatTensor(d_proj, d_embed)))\n                else:\n                    self.out_projs.append(None)\n\n            self.out_layers.append(nn.Linear(d_embed, n_token))\n        else:\n            for i in range(len(self.cutoffs)):\n                l_idx, r_idx = self.cutoff_ends[i], self.cutoff_ends[i + 1]\n                d_emb_i = d_embed // (div_val ** i)\n\n                self.out_projs.append(nn.Parameter(torch.FloatTensor(d_proj, d_emb_i)))\n\n                self.out_layers.append(nn.Linear(d_emb_i, r_idx - l_idx))\n\n        self.keep_order = keep_order\n\n    def _compute_logit(self, hidden, weight, bias, proj):\n        if proj is None:\n            logit = F.linear(hidden, weight, bias=bias)\n        else:\n            # if CUDA_MAJOR <= 9 and CUDA_MINOR <= 1:\n            proj_hid = F.linear(hidden, proj.t().contiguous())\n            logit = F.linear(proj_hid, weight, bias=bias)\n            # else:\n            #     logit = torch.einsum('bd,de,ev->bv', (hidden, proj, weight.t()))\n            #     if bias is not None:\n            #         logit = logit + bias\n\n        return logit\n\n    def forward(self, hidden, labels=None, keep_order=False):\n        \"\"\"\n            Params:\n                hidden :: [len*bsz x d_proj]\n                labels :: [len*bsz]\n            Return:\n                if labels is None:\n                    out :: [len*bsz x n_tokens] log probabilities of tokens over the vocabulary\n                else:\n                    out :: [(len-1)*bsz] Negative log likelihood\n            We could replace this implementation by the native PyTorch one\n            if their's had an option to set bias on all clusters in the native one.\n            here: https://github.com/pytorch/pytorch/blob/dbe6a7a9ff1a364a8706bf5df58a1ca96d2fd9da/torch/nn/modules/adaptive.py#L138\n        \"\"\"\n\n        if labels is not None:\n            # Shift so that tokens < n predict n\n            hidden = hidden[..., :-1, :].contiguous()\n            labels = labels[..., 1:].contiguous()\n            hidden = hidden.view(-1, hidden.size(-1))\n            labels = labels.view(-1)\n            if hidden.size(0) != labels.size(0):\n                raise RuntimeError(\"Input and labels should have the same size \" \"in the batch dimension.\")\n        else:\n            hidden = hidden.view(-1, hidden.size(-1))\n\n        if self.n_clusters == 0:\n            logit = self._compute_logit(hidden, self.out_layers[0].weight, self.out_layers[0].bias, self.out_projs[0])\n            if labels is not None:\n                out = -F.log_softmax(logit, dim=-1).gather(1, labels.unsqueeze(1)).squeeze(1)\n            else:\n                out = F.log_softmax(logit, dim=-1)\n        else:\n            # construct weights and biases\n            weights, biases = [], []\n            for i in range(len(self.cutoffs)):\n                if self.div_val == 1:\n                    l_idx, r_idx = self.cutoff_ends[i], self.cutoff_ends[i + 1]\n                    weight_i = self.out_layers[0].weight[l_idx:r_idx]\n                    bias_i = self.out_layers[0].bias[l_idx:r_idx]\n                else:\n                    weight_i = self.out_layers[i].weight\n                    bias_i = self.out_layers[i].bias\n\n                if i == 0:\n                    weight_i = torch.cat([weight_i, self.cluster_weight], dim=0)\n                    bias_i = torch.cat([bias_i, self.cluster_bias], dim=0)\n\n                weights.append(weight_i)\n                biases.append(bias_i)\n\n            head_weight, head_bias, head_proj = weights[0], biases[0], self.out_projs[0]\n\n            head_logit = self._compute_logit(hidden, head_weight, head_bias, head_proj)\n            head_logprob = F.log_softmax(head_logit, dim=1)\n\n            if labels is None:\n                out = hidden.new_empty((head_logit.size(0), self.n_token))\n            else:\n                out = torch.zeros_like(labels, dtype=hidden.dtype, device=hidden.device)\n\n            offset = 0\n            cutoff_values = [0] + self.cutoffs\n            for i in range(len(cutoff_values) - 1):\n                l_idx, r_idx = cutoff_values[i], cutoff_values[i + 1]\n\n                if labels is not None:\n                    mask_i = (labels >= l_idx) & (labels < r_idx)\n                    indices_i = mask_i.nonzero().squeeze()\n\n                    if indices_i.numel() == 0:\n                        continue\n\n                    target_i = labels.index_select(0, indices_i) - l_idx\n                    head_logprob_i = head_logprob.index_select(0, indices_i)\n                    hidden_i = hidden.index_select(0, indices_i)\n                else:\n                    hidden_i = hidden\n\n                if i == 0:\n                    if labels is not None:\n                        logprob_i = head_logprob_i.gather(1, target_i[:, None]).squeeze(1)\n                    else:\n                        out[:, : self.cutoffs[0]] = head_logprob[:, : self.cutoffs[0]]\n                else:\n                    weight_i, bias_i, proj_i = weights[i], biases[i], self.out_projs[i]\n\n                    tail_logit_i = self._compute_logit(hidden_i, weight_i, bias_i, proj_i)\n                    tail_logprob_i = F.log_softmax(tail_logit_i, dim=1)\n                    cluster_prob_idx = self.cutoffs[0] + i - 1  # No probability for the head cluster\n                    if labels is not None:\n                        logprob_i = head_logprob_i[:, cluster_prob_idx] + tail_logprob_i.gather(\n                            1, target_i[:, None]\n                        ).squeeze(1)\n                    else:\n                        logprob_i = head_logprob[:, cluster_prob_idx, None] + tail_logprob_i\n                        out[:, l_idx:r_idx] = logprob_i\n\n                if labels is not None:\n                    if (hasattr(self, \"keep_order\") and self.keep_order) or keep_order:\n                        out.index_copy_(0, indices_i, -logprob_i)\n                    else:\n                        out[offset : offset + logprob_i.size(0)].copy_(-logprob_i)\n                    offset += logprob_i.size(0)\n\n        return out\n\n    def log_prob(self, hidden):\n        r\"\"\" Computes log probabilities for all :math:`n\\_classes`\n        From: https://github.com/pytorch/pytorch/blob/master/torch/nn/modules/adaptive.py\n        Args:\n            hidden (Tensor): a minibatch of examples\n        Returns:\n            log-probabilities of for each class :math:`c`\n            in range :math:`0 <= c <= n\\_classes`, where :math:`n\\_classes` is a\n            parameter passed to ``AdaptiveLogSoftmaxWithLoss`` constructor.\n        Shape:\n            - Input: :math:`(N, in\\_features)`\n            - Output: :math:`(N, n\\_classes)`\n        \"\"\"\n        if self.n_clusters == 0:\n            logit = self._compute_logit(hidden, self.out_layers[0].weight, self.out_layers[0].bias, self.out_projs[0])\n            return F.log_softmax(logit, dim=-1)\n        else:\n            # construct weights and biases\n            weights, biases = [], []\n            for i in range(len(self.cutoffs)):\n                if self.div_val == 1:\n                    l_idx, r_idx = self.cutoff_ends[i], self.cutoff_ends[i + 1]\n                    weight_i = self.out_layers[0].weight[l_idx:r_idx]\n                    bias_i = self.out_layers[0].bias[l_idx:r_idx]\n                else:\n                    weight_i = self.out_layers[i].weight\n                    bias_i = self.out_layers[i].bias\n\n                if i == 0:\n                    weight_i = torch.cat([weight_i, self.cluster_weight], dim=0)\n                    bias_i = torch.cat([bias_i, self.cluster_bias], dim=0)\n\n                weights.append(weight_i)\n                biases.append(bias_i)\n\n            head_weight, head_bias, head_proj = weights[0], biases[0], self.out_projs[0]\n            head_logit = self._compute_logit(hidden, head_weight, head_bias, head_proj)\n\n            out = hidden.new_empty((head_logit.size(0), self.n_token))\n            head_logprob = F.log_softmax(head_logit, dim=1)\n\n            cutoff_values = [0] + self.cutoffs\n            for i in range(len(cutoff_values) - 1):\n                start_idx, stop_idx = cutoff_values[i], cutoff_values[i + 1]\n\n                if i == 0:\n                    out[:, : self.cutoffs[0]] = head_logprob[:, : self.cutoffs[0]]\n                else:\n                    weight_i, bias_i, proj_i = weights[i], biases[i], self.out_projs[i]\n\n                    tail_logit_i = self._compute_logit(hidden, weight_i, bias_i, proj_i)\n                    tail_logprob_i = F.log_softmax(tail_logit_i, dim=1)\n\n                    logprob_i = head_logprob[:, -i] + tail_logprob_i\n                    out[:, start_idx, stop_idx] = logprob_i\n\n            return out\n"
  },
  {
    "path": "code/bert-base-count5/pretrain/transformers1/modeling_utils.py",
    "content": "# coding=utf-8\n# Copyright 2018 The Google AI Language Team Authors, Facebook AI Research authors and The HuggingFace Inc. team.\n# Copyright (c) 2018, NVIDIA CORPORATION.  All rights reserved.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n\nimport inspect\nimport logging\nimport os\nfrom typing import Callable, Dict, Iterable, List, Optional, Tuple\n\nimport torch\nfrom torch import Tensor, device, dtype, nn\nfrom torch.nn import CrossEntropyLoss\nfrom torch.nn import functional as F\n\nfrom .activations import get_activation\nfrom .configuration_utils import PretrainedConfig\nfrom .file_utils import (\n    DUMMY_INPUTS,\n    TF2_WEIGHTS_NAME,\n    TF_WEIGHTS_NAME,\n    WEIGHTS_NAME,\n    cached_path,\n    hf_bucket_url,\n    is_remote_url,\n)\n\n\nlogger = logging.getLogger(__name__)\n\n\ntry:\n    from torch.nn import Identity\nexcept ImportError:\n    # Older PyTorch compatibility\n    class Identity(nn.Module):\n        r\"\"\"A placeholder identity operator that is argument-insensitive.\n        \"\"\"\n\n        def __init__(self, *args, **kwargs):\n            super().__init__()\n\n        def forward(self, input):\n            return input\n\n\nclass ModuleUtilsMixin:\n    \"\"\"\n    A few utilities for torch.nn.Modules, to be used as a mixin.\n    \"\"\"\n\n    def num_parameters(self, only_trainable: bool = False) -> int:\n        \"\"\"\n        Get number of (optionally, trainable) parameters in the module.\n        \"\"\"\n        params = filter(lambda x: x.requires_grad, self.parameters()) if only_trainable else self.parameters()\n        return sum(p.numel() for p in params)\n\n    @staticmethod\n    def _hook_rss_memory_pre_forward(module, *args, **kwargs):\n        try:\n            import psutil\n        except (ImportError):\n            raise ImportError(\"You need to install psutil (pip install psutil) to use memory tracing.\")\n\n        process = psutil.Process(os.getpid())\n        mem = process.memory_info()\n        module.mem_rss_pre_forward = mem.rss\n        return None\n\n    @staticmethod\n    def _hook_rss_memory_post_forward(module, *args, **kwargs):\n        try:\n            import psutil\n        except (ImportError):\n            raise ImportError(\"You need to install psutil (pip install psutil) to use memory tracing.\")\n\n        process = psutil.Process(os.getpid())\n        mem = process.memory_info()\n        module.mem_rss_post_forward = mem.rss\n        mem_rss_diff = module.mem_rss_post_forward - module.mem_rss_pre_forward\n        module.mem_rss_diff = mem_rss_diff + (module.mem_rss_diff if hasattr(module, \"mem_rss_diff\") else 0)\n        return None\n\n    def add_memory_hooks(self):\n        \"\"\" Add a memory hook before and after each sub-module forward pass to record increase in memory consumption.\n            Increase in memory consumption is stored in a `mem_rss_diff` attribute for each module and can be reset to zero with `model.reset_memory_hooks_state()`\n        \"\"\"\n        for module in self.modules():\n            module.register_forward_pre_hook(self._hook_rss_memory_pre_forward)\n            module.register_forward_hook(self._hook_rss_memory_post_forward)\n        self.reset_memory_hooks_state()\n\n    def reset_memory_hooks_state(self):\n        for module in self.modules():\n            module.mem_rss_diff = 0\n            module.mem_rss_post_forward = 0\n            module.mem_rss_pre_forward = 0\n\n    @property\n    def device(self) -> device:\n        \"\"\"\n        Get torch.device from module, assuming that the whole module has one device.\n        \"\"\"\n        try:\n            return next(self.parameters()).device\n        except StopIteration:\n            # For nn.DataParallel compatibility in PyTorch 1.5\n\n            def find_tensor_attributes(module: nn.Module) -> List[Tuple[str, Tensor]]:\n                tuples = [(k, v) for k, v in module.__dict__.items() if torch.is_tensor(v)]\n                return tuples\n\n            gen = self._named_members(get_members_fn=find_tensor_attributes)\n            first_tuple = next(gen)\n            return first_tuple[1].device\n\n    @property\n    def dtype(self) -> dtype:\n        \"\"\"\n        Get torch.dtype from module, assuming that the whole module has one dtype.\n        \"\"\"\n        try:\n            return next(self.parameters()).dtype\n        except StopIteration:\n            # For nn.DataParallel compatibility in PyTorch 1.5\n\n            def find_tensor_attributes(module: nn.Module) -> List[Tuple[str, Tensor]]:\n                tuples = [(k, v) for k, v in module.__dict__.items() if torch.is_tensor(v)]\n                return tuples\n\n            gen = self._named_members(get_members_fn=find_tensor_attributes)\n            first_tuple = next(gen)\n            return first_tuple[1].dtype\n\n    def invert_attention_mask(self, encoder_attention_mask: Tensor) -> Tensor:\n        \"\"\"type: torch.Tensor -> torch.Tensor\"\"\"\n        if encoder_attention_mask.dim() == 3:\n            encoder_extended_attention_mask = encoder_attention_mask[:, None, :, :]\n        if encoder_attention_mask.dim() == 2:\n            encoder_extended_attention_mask = encoder_attention_mask[:, None, None, :]\n        # T5 has a mask that can compare sequence ids, we can simulate this here with this transposition\n        # Cf. https://github.com/tensorflow/mesh/blob/8d2465e9bc93129b913b5ccc6a59aa97abd96ec6/mesh_tensorflow\n        # /transformer/transformer_layers.py#L270\n        # encoder_extended_attention_mask = (encoder_extended_attention_mask ==\n        # encoder_extended_attention_mask.transpose(-1, -2))\n        encoder_extended_attention_mask = encoder_extended_attention_mask.to(dtype=self.dtype)  # fp16 compatibility\n\n        if self.dtype == torch.float16:\n            encoder_extended_attention_mask = (1.0 - encoder_extended_attention_mask) * -1e4\n        elif self.dtype == torch.float32:\n            encoder_extended_attention_mask = (1.0 - encoder_extended_attention_mask) * -1e9\n        else:\n            raise ValueError(\n                \"{} not recognized. `dtype` should be set to either `torch.float32` or `torch.float16`\".format(\n                    self.dtype\n                )\n            )\n\n        return encoder_extended_attention_mask\n\n    def get_extended_attention_mask(self, attention_mask: Tensor, input_shape: Tuple, device: device) -> Tensor:\n        \"\"\"Makes broadcastable attention mask and causal mask so that future and maked tokens are ignored.\n\n        Arguments:\n            attention_mask: torch.Tensor with 1 indicating tokens to ATTEND to\n            input_shape: tuple, shape of input_ids\n            device: torch.Device, usually self.device\n\n        Returns:\n            torch.Tensor with dtype of attention_mask.dtype\n        \"\"\"\n        # We can provide a self-attention mask of dimensions [batch_size, from_seq_length, to_seq_length]\n        # ourselves in which case we just need to make it broadcastable to all heads.\n        if attention_mask.dim() == 3:\n            extended_attention_mask = attention_mask[:, None, :, :]\n        elif attention_mask.dim() == 2:\n            # Provided a padding mask of dimensions [batch_size, seq_length]\n            # - if the model is a decoder, apply a causal mask in addition to the padding mask\n            # - if the model is an encoder, make the mask broadcastable to [batch_size, num_heads, seq_length, seq_length]\n            if self.config.is_decoder:\n                batch_size, seq_length = input_shape\n                seq_ids = torch.arange(seq_length, device=device)\n                causal_mask = seq_ids[None, None, :].repeat(batch_size, seq_length, 1) <= seq_ids[None, :, None]\n                # causal and attention masks must have same type with pytorch version < 1.3\n                causal_mask = causal_mask.to(attention_mask.dtype)\n                extended_attention_mask = causal_mask[:, None, :, :] * attention_mask[:, None, None, :]\n            else:\n                extended_attention_mask = attention_mask[:, None, None, :]\n        else:\n            raise ValueError(\n                \"Wrong shape for input_ids (shape {}) or attention_mask (shape {})\".format(\n                    input_shape, attention_mask.shape\n                )\n            )\n\n        # Since attention_mask is 1.0 for positions we want to attend and 0.0 for\n        # masked positions, this operation will create a tensor which is 0.0 for\n        # positions we want to attend and -10000.0 for masked positions.\n        # Since we are adding it to the raw scores before the softmax, this is\n        # effectively the same as removing these entirely.\n        extended_attention_mask = extended_attention_mask.to(dtype=self.dtype)  # fp16 compatibility\n        extended_attention_mask = (1.0 - extended_attention_mask) * -10000.0\n        return extended_attention_mask\n\n    def get_head_mask(self, head_mask: Tensor, num_hidden_layers: int, is_attention_chunked: bool = False) -> Tensor:\n        \"\"\"\n        # Prepare head mask if needed\n        # 1.0 in head_mask indicate we keep the head\n        attention_probs has shape bsz x n_heads x N x N\n        Arguments:\n            head_mask: torch.Tensor or None: has shape [num_heads] or [num_hidden_layers x num_heads]\n            num_hidden_layers: int\n        Returns:\n             Tensor of shape shape [num_hidden_layers x batch x num_heads x seq_length x seq_length]\n             or list with [None] for each layer\n        \"\"\"\n        if head_mask is not None:\n            head_mask = self._convert_head_mask_to_5d(head_mask, num_hidden_layers)\n            if is_attention_chunked is True:\n                head_mask = head_mask.unsqueeze(-1)\n        else:\n            head_mask = [None] * num_hidden_layers\n\n        return head_mask\n\n    def _convert_head_mask_to_5d(self, head_mask, num_hidden_layers):\n        \"\"\"-> [num_hidden_layers x batch x num_heads x seq_length x seq_length]\"\"\"\n        if head_mask.dim() == 1:\n            head_mask = head_mask.unsqueeze(0).unsqueeze(0).unsqueeze(-1).unsqueeze(-1)\n            head_mask = head_mask.expand(num_hidden_layers, -1, -1, -1, -1)\n        elif head_mask.dim() == 2:\n            head_mask = head_mask.unsqueeze(1).unsqueeze(-1).unsqueeze(-1)  # We can specify head_mask for each layer\n        assert head_mask.dim() == 5, f\"head_mask.dim != 5, instead {head_mask.dim()}\"\n        head_mask = head_mask.to(dtype=self.dtype)  # switch to fload if need + fp16 compatibility\n        return head_mask\n\n\nclass PreTrainedModel(nn.Module, ModuleUtilsMixin):\n    r\"\"\" Base class for all models.\n\n        :class:`~transformers1.PreTrainedModel` takes care of storing the configuration of the models and handles methods for loading/downloading/saving models\n        as well as a few methods common to all models to (i) resize the input embeddings and (ii) prune heads in the self-attention heads.\n\n        Class attributes (overridden by derived classes):\n            - ``config_class``: a class derived from :class:`~transformers1.PretrainedConfig` to use as configuration class for this model architecture.\n            - ``load_tf_weights``: a python ``method`` for loading a TensorFlow checkpoint in a PyTorch model, taking as arguments:\n\n                - ``model``: an instance of the relevant subclass of :class:`~transformers1.PreTrainedModel`,\n                - ``config``: an instance of the relevant subclass of :class:`~transformers1.PretrainedConfig`,\n                - ``path``: a path (string) to the TensorFlow checkpoint.\n\n            - ``base_model_prefix``: a string indicating the attribute associated to the base model in derived classes of the same architecture adding modules on top of the base model.\n    \"\"\"\n    config_class = None\n    base_model_prefix = \"\"\n\n    @property\n    def dummy_inputs(self):\n        \"\"\" Dummy inputs to do a forward pass in the network.\n\n        Returns:\n            torch.Tensor with dummy inputs\n        \"\"\"\n        return {\"input_ids\": torch.tensor(DUMMY_INPUTS)}\n\n    def __init__(self, config, *inputs, **kwargs):\n        super().__init__()\n        if not isinstance(config, PretrainedConfig):\n            raise ValueError(\n                \"Parameter config in `{}(config)` should be an instance of class `PretrainedConfig`. \"\n                \"To create a model from a pretrained model use \"\n                \"`model = {}.from_pretrained(PRETRAINED_MODEL_NAME)`\".format(\n                    self.__class__.__name__, self.__class__.__name__\n                )\n            )\n        # Save config in model\n        self.config = config\n\n    @property\n    def base_model(self):\n        return getattr(self, self.base_model_prefix, self)\n\n    def get_input_embeddings(self):\n        \"\"\"\n        Returns the model's input embeddings.\n\n        Returns:\n            :obj:`nn.Module`:\n                A torch module mapping vocabulary to hidden states.\n        \"\"\"\n        base_model = getattr(self, self.base_model_prefix, self)\n        if base_model is not self:\n            return base_model.get_input_embeddings()\n        else:\n            raise NotImplementedError\n\n    def set_input_embeddings(self, value: nn.Module):\n        \"\"\"\n        Set model's input embeddings\n\n        Args:\n            value (:obj:`nn.Module`):\n                A module mapping vocabulary to hidden states.\n        \"\"\"\n        base_model = getattr(self, self.base_model_prefix, self)\n        if base_model is not self:\n            base_model.set_input_embeddings(value)\n        else:\n            raise NotImplementedError\n\n    def get_output_embeddings(self):\n        \"\"\"\n        Returns the model's output embeddings.\n\n        Returns:\n            :obj:`nn.Module`:\n                A torch module mapping hidden states to vocabulary.\n        \"\"\"\n        return None  # Overwrite for models with output embeddings\n\n    def tie_weights(self):\n        \"\"\"\n        Tie the weights between the input embeddings and the output embeddings.\n        If the `torchscript` flag is set in the configuration, can't handle parameter sharing so we are cloning\n        the weights instead.\n        \"\"\"\n        output_embeddings = self.get_output_embeddings()\n        if output_embeddings is not None:\n            self._tie_or_clone_weights(output_embeddings, self.get_input_embeddings())\n\n    def _tie_or_clone_weights(self, output_embeddings, input_embeddings):\n        \"\"\" Tie or clone module weights depending of whether we are using TorchScript or not\n        \"\"\"\n        if self.config.torchscript:\n            output_embeddings.weight = nn.Parameter(input_embeddings.weight.clone())\n        else:\n            output_embeddings.weight = input_embeddings.weight\n\n        if getattr(output_embeddings, \"bias\", None) is not None:\n            output_embeddings.bias.data = torch.nn.functional.pad(\n                output_embeddings.bias.data,\n                (0, output_embeddings.weight.shape[0] - output_embeddings.bias.shape[0],),\n                \"constant\",\n                0,\n            )\n        if hasattr(output_embeddings, \"out_features\") and hasattr(input_embeddings, \"num_embeddings\"):\n            output_embeddings.out_features = input_embeddings.num_embeddings\n\n    def resize_token_embeddings(self, new_num_tokens: Optional[int] = None):\n        \"\"\" Resize input token embeddings matrix of the model if new_num_tokens != config.vocab_size.\n        Take care of tying weights embeddings afterwards if the model class has a `tie_weights()` method.\n\n        Arguments:\n\n            new_num_tokens: (`optional`) int:\n                New number of tokens in the embedding matrix. Increasing the size will add newly initialized vectors at the end. Reducing the size will remove vectors from the end.\n                If not provided or None: does nothing and just returns a pointer to the input tokens ``torch.nn.Embeddings`` Module of the model.\n\n        Return: ``torch.nn.Embeddings``\n            Pointer to the input tokens Embeddings Module of the model\n        \"\"\"\n        base_model = getattr(self, self.base_model_prefix, self)  # get the base model if needed\n        model_embeds = base_model._resize_token_embeddings(new_num_tokens)\n        if new_num_tokens is None:\n            return model_embeds\n\n        # Update base model and current model config\n        self.config.vocab_size = new_num_tokens\n        base_model.vocab_size = new_num_tokens\n\n        # Tie weights again if needed\n        self.tie_weights()\n\n        return model_embeds\n\n    def _resize_token_embeddings(self, new_num_tokens):\n        old_embeddings = self.get_input_embeddings()\n        new_embeddings = self._get_resized_embeddings(old_embeddings, new_num_tokens)\n        self.set_input_embeddings(new_embeddings)\n        return self.get_input_embeddings()\n\n    def _get_resized_embeddings(\n        self, old_embeddings: torch.nn.Embedding, new_num_tokens: Optional[int] = None\n    ) -> torch.nn.Embedding:\n        \"\"\" Build a resized Embedding Module from a provided token Embedding Module.\n            Increasing the size will add newly initialized vectors at the end\n            Reducing the size will remove vectors from the end\n\n        Args:\n            old_embeddings: ``torch.nn.Embedding``\n                Old embeddings to be resized.\n            new_num_tokens: (`optional`) int\n                New number of tokens in the embedding matrix.\n                Increasing the size will add newly initialized vectors at the end\n                Reducing the size will remove vectors from the end\n                If not provided or None: return the provided token Embedding Module.\n        Return: ``torch.nn.Embedding``\n            Pointer to the resized Embedding Module or the old Embedding Module if new_num_tokens is None\n        \"\"\"\n        if new_num_tokens is None:\n            return old_embeddings\n\n        old_num_tokens, old_embedding_dim = old_embeddings.weight.size()\n        if old_num_tokens == new_num_tokens:\n            return old_embeddings\n\n        # Build new embeddings\n        new_embeddings = nn.Embedding(new_num_tokens, old_embedding_dim)\n        new_embeddings.to(old_embeddings.weight.device)\n\n        # initialize all new embeddings (in particular added tokens)\n        self._init_weights(new_embeddings)\n\n        # Copy token embeddings from the previous weights\n        num_tokens_to_copy = min(old_num_tokens, new_num_tokens)\n        new_embeddings.weight.data[:num_tokens_to_copy, :] = old_embeddings.weight.data[:num_tokens_to_copy, :]\n\n        return new_embeddings\n\n    def init_weights(self):\n        \"\"\" Initialize and prunes weights if needed. \"\"\"\n        # Initialize weights\n        self.apply(self._init_weights)\n\n        # Prune heads if needed\n        if self.config.pruned_heads:\n            self.prune_heads(self.config.pruned_heads)\n\n        # Tie weights if needed\n        self.tie_weights()\n\n    def prune_heads(self, heads_to_prune: Dict):\n        \"\"\" Prunes heads of the base model.\n\n            Arguments:\n\n                heads_to_prune: dict with keys being selected layer indices (`int`) and associated values being the list of heads to prune in said layer (list of `int`).\n                E.g. {1: [0, 2], 2: [2, 3]} will prune heads 0 and 2 on layer 1 and heads 2 and 3 on layer 2.\n        \"\"\"\n        # save new sets of pruned heads as union of previously stored pruned heads and newly pruned heads\n        for layer, heads in heads_to_prune.items():\n            union_heads = set(self.config.pruned_heads.get(layer, [])) | set(heads)\n            self.config.pruned_heads[layer] = list(union_heads)  # Unfortunately we have to store it as list for JSON\n\n        self.base_model._prune_heads(heads_to_prune)\n\n    def save_pretrained(self, save_directory):\n        \"\"\" Save a model and its configuration file to a directory, so that it\n            can be re-loaded using the `:func:`~transformers1.PreTrainedModel.from_pretrained`` class method.\n\n            Arguments:\n                save_directory: directory to which to save.\n        \"\"\"\n        assert os.path.isdir(\n            save_directory\n        ), \"Saving path should be a directory where the model and configuration can be saved\"\n\n        # Only save the model itself if we are using distributed training\n        model_to_save = self.module if hasattr(self, \"module\") else self\n\n        # Attach architecture to the config\n        model_to_save.config.architectures = [model_to_save.__class__.__name__]\n\n        # If we save using the predefined names, we can load using `from_pretrained`\n        output_model_file = os.path.join(save_directory, WEIGHTS_NAME)\n\n        if getattr(self.config, \"xla_device\", False):\n            import torch_xla.core.xla_model as xm\n\n            if xm.is_master_ordinal():\n                # Save configuration file\n                model_to_save.config.save_pretrained(save_directory)\n            # xm.save takes care of saving only from master\n            xm.save(model_to_save.state_dict(), output_model_file)\n        else:\n            model_to_save.config.save_pretrained(save_directory)\n            torch.save(model_to_save.state_dict(), output_model_file)\n\n        logger.info(\"Model weights saved in {}\".format(output_model_file))\n\n    @classmethod\n    def from_pretrained(cls, pretrained_model_name_or_path, *model_args, **kwargs):\n        r\"\"\"Instantiate a pretrained pytorch model from a pre-trained model configuration.\n\n        The model is set in evaluation mode by default using ``model.eval()`` (Dropout modules are deactivated)\n        To train the model, you should first set it back in training mode with ``model.train()``\n\n        The warning ``Weights from XXX not initialized from pretrained model`` means that the weights of XXX do not come pre-trained with the rest of the model.\n        It is up to you to train those weights with a downstream fine-tuning task.\n\n        The warning ``Weights from XXX not used in YYY`` means that the layer XXX is not used by YYY, therefore those weights are discarded.\n\n        Parameters:\n            pretrained_model_name_or_path: either:\n              - a string with the `shortcut name` of a pre-trained model to load from cache or download, e.g.: ``bert-base-uncased``.\n              - a string with the `identifier name` of a pre-trained model that was user-uploaded to our S3, e.g.: ``dbmdz/bert-base-german-cased``.\n              - a path to a `directory` containing model weights saved using :func:`~transformers1.PreTrainedModel.save_pretrained`, e.g.: ``./my_model_directory/``.\n              - a path or url to a `tensorflow index checkpoint file` (e.g. `./tf_model/model.ckpt.index`). In this case, ``from_tf`` should be set to True and a configuration object should be provided as ``config`` argument. This loading path is slower than converting the TensorFlow checkpoint in a PyTorch model using the provided conversion scripts and loading the PyTorch model afterwards.\n              - None if you are both providing the configuration and state dictionary (resp. with keyword arguments ``config`` and ``state_dict``)\n\n            model_args: (`optional`) Sequence of positional arguments:\n                All remaning positional arguments will be passed to the underlying model's ``__init__`` method\n\n            config: (`optional`) one of:\n                - an instance of a class derived from :class:`~transformers1.PretrainedConfig`, or\n                - a string valid as input to :func:`~transformers1.PretrainedConfig.from_pretrained()`\n                Configuration for the model to use instead of an automatically loaded configuation. Configuration can be automatically loaded when:\n                    - the model is a model provided by the library (loaded with the ``shortcut-name`` string of a pretrained model), or\n                    - the model was saved using :func:`~transformers1.PreTrainedModel.save_pretrained` and is reloaded by suppling the save directory.\n                    - the model is loaded by suppling a local directory as ``pretrained_model_name_or_path`` and a configuration JSON file named `config.json` is found in the directory.\n\n            state_dict: (`optional`) dict:\n                an optional state dictionnary for the model to use instead of a state dictionary loaded from saved weights file.\n                This option can be used if you want to create a model from a pretrained configuration but load your own weights.\n                In this case though, you should check if using :func:`~transformers1.PreTrainedModel.save_pretrained` and :func:`~transformers1.PreTrainedModel.from_pretrained` is not a simpler option.\n\n            cache_dir: (`optional`) string:\n                Path to a directory in which a downloaded pre-trained model\n                configuration should be cached if the standard cache should not be used.\n\n            force_download: (`optional`) boolean, default False:\n                Force to (re-)download the model weights and configuration files and override the cached versions if they exists.\n\n            resume_download: (`optional`) boolean, default False:\n                Do not delete incompletely recieved file. Attempt to resume the download if such a file exists.\n\n            proxies: (`optional`) dict, default None:\n                A dictionary of proxy servers to use by protocol or endpoint, e.g.: {'http': 'foo.bar:3128', 'http://hostname': 'foo.bar:4012'}.\n                The proxies are used on each request.\n\n            output_loading_info: (`optional`) boolean:\n                Set to ``True`` to also return a dictionnary containing missing keys, unexpected keys and error messages.\n\n            kwargs: (`optional`) Remaining dictionary of keyword arguments:\n                Can be used to update the configuration object (after it being loaded) and initiate the model. (e.g. ``output_attention=True``). Behave differently depending on whether a `config` is provided or automatically loaded:\n\n                - If a configuration is provided with ``config``, ``**kwargs`` will be directly passed to the underlying model's ``__init__`` method (we assume all relevant updates to the configuration have already been done)\n                - If a configuration is not provided, ``kwargs`` will be first passed to the configuration class initialization function (:func:`~transformers1.PretrainedConfig.from_pretrained`). Each key of ``kwargs`` that corresponds to a configuration attribute will be used to override said attribute with the supplied ``kwargs`` value. Remaining keys that do not correspond to any configuration attribute will be passed to the underlying model's ``__init__`` function.\n\n        Examples::\n\n            # For example purposes. Not runnable.\n            model = BertModel.from_pretrained('bert-base-uncased')    # Download model and configuration from S3 and cache.\n            model = BertModel.from_pretrained('./test/saved_model/')  # E.g. model was saved using `save_pretrained('./test/saved_model/')`\n            model = BertModel.from_pretrained('bert-base-uncased', output_attention=True)  # Update configuration during loading\n            assert model.config.output_attention == True\n            # Loading from a TF checkpoint file instead of a PyTorch model (slower)\n            config = BertConfig.from_json_file('./tf_model/my_tf_model_config.json')\n            model = BertModel.from_pretrained('./tf_model/my_tf_checkpoint.ckpt.index', from_tf=True, config=config)\n\n        \"\"\"\n        config = kwargs.pop(\"config\", None)\n        state_dict = kwargs.pop(\"state_dict\", None)\n        cache_dir = kwargs.pop(\"cache_dir\", None)\n        from_tf = kwargs.pop(\"from_tf\", False)\n        force_download = kwargs.pop(\"force_download\", False)\n        resume_download = kwargs.pop(\"resume_download\", False)\n        proxies = kwargs.pop(\"proxies\", None)\n        output_loading_info = kwargs.pop(\"output_loading_info\", False)\n        local_files_only = kwargs.pop(\"local_files_only\", False)\n        use_cdn = kwargs.pop(\"use_cdn\", True)\n\n        # Load config if we don't provide a configuration\n        if not isinstance(config, PretrainedConfig):\n            config_path = config if config is not None else pretrained_model_name_or_path\n            config, model_kwargs = cls.config_class.from_pretrained(\n                config_path,\n                *model_args,\n                cache_dir=cache_dir,\n                return_unused_kwargs=True,\n                force_download=force_download,\n                resume_download=resume_download,\n                proxies=proxies,\n                local_files_only=local_files_only,\n                **kwargs,\n            )\n        else:\n            model_kwargs = kwargs\n\n        # Load model\n        if pretrained_model_name_or_path is not None:\n            if os.path.isdir(pretrained_model_name_or_path):\n                if from_tf and os.path.isfile(os.path.join(pretrained_model_name_or_path, TF_WEIGHTS_NAME + \".index\")):\n                    # Load from a TF 1.0 checkpoint\n                    archive_file = os.path.join(pretrained_model_name_or_path, TF_WEIGHTS_NAME + \".index\")\n                elif from_tf and os.path.isfile(os.path.join(pretrained_model_name_or_path, TF2_WEIGHTS_NAME)):\n                    # Load from a TF 2.0 checkpoint\n                    archive_file = os.path.join(pretrained_model_name_or_path, TF2_WEIGHTS_NAME)\n                elif os.path.isfile(os.path.join(pretrained_model_name_or_path, WEIGHTS_NAME)):\n                    # Load from a PyTorch checkpoint\n                    archive_file = os.path.join(pretrained_model_name_or_path, WEIGHTS_NAME)\n                else:\n                    raise EnvironmentError(\n                        \"Error no file named {} found in directory {} or `from_tf` set to False\".format(\n                            [WEIGHTS_NAME, TF2_WEIGHTS_NAME, TF_WEIGHTS_NAME + \".index\"],\n                            pretrained_model_name_or_path,\n                        )\n                    )\n            elif os.path.isfile(pretrained_model_name_or_path) or is_remote_url(pretrained_model_name_or_path):\n                archive_file = pretrained_model_name_or_path\n            elif os.path.isfile(pretrained_model_name_or_path + \".index\"):\n                assert (\n                    from_tf\n                ), \"We found a TensorFlow checkpoint at {}, please set from_tf to True to load from this checkpoint\".format(\n                    pretrained_model_name_or_path + \".index\"\n                )\n                archive_file = pretrained_model_name_or_path + \".index\"\n            else:\n                archive_file = hf_bucket_url(\n                    pretrained_model_name_or_path,\n                    filename=(TF2_WEIGHTS_NAME if from_tf else WEIGHTS_NAME),\n                    use_cdn=use_cdn,\n                )\n\n            try:\n                # Load from URL or cache if already cached\n                resolved_archive_file = cached_path(\n                    archive_file,\n                    cache_dir=cache_dir,\n                    force_download=force_download,\n                    proxies=proxies,\n                    resume_download=resume_download,\n                    local_files_only=local_files_only,\n                )\n                if resolved_archive_file is None:\n                    raise EnvironmentError\n            except EnvironmentError:\n                msg = (\n                    f\"Can't load weights for '{pretrained_model_name_or_path}'. Make sure that:\\n\\n\"\n                    f\"- '{pretrained_model_name_or_path}' is a correct model identifier listed on 'https://huggingface.co/models'\\n\\n\"\n                    f\"- or '{pretrained_model_name_or_path}' is the correct path to a directory containing a file named one of {WEIGHTS_NAME}, {TF2_WEIGHTS_NAME}, {TF_WEIGHTS_NAME}.\\n\\n\"\n                )\n                raise EnvironmentError(msg)\n\n            if resolved_archive_file == archive_file:\n                logger.info(\"loading weights file {}\".format(archive_file))\n            else:\n                logger.info(\"loading weights file {} from cache at {}\".format(archive_file, resolved_archive_file))\n        else:\n            resolved_archive_file = None\n\n        # Instantiate model.\n        model = cls(config, *model_args, **model_kwargs)\n\n        if state_dict is None and not from_tf:\n            try:\n                state_dict = torch.load(resolved_archive_file, map_location=\"cpu\")\n            except Exception:\n                raise OSError(\n                    \"Unable to load weights from pytorch checkpoint file. \"\n                    \"If you tried to load a PyTorch model from a TF 2.0 checkpoint, please set from_tf=True. \"\n                )\n\n        missing_keys = []\n        unexpected_keys = []\n        error_msgs = []\n\n        if from_tf:\n            if resolved_archive_file.endswith(\".index\"):\n                # Load from a TensorFlow 1.X checkpoint - provided by original authors\n                model = cls.load_tf_weights(model, config, resolved_archive_file[:-6])  # Remove the '.index'\n            else:\n                # Load from our TensorFlow 2.0 checkpoints\n                try:\n                    from transformers import load_tf2_checkpoint_in_pytorch_model\n\n                    model = load_tf2_checkpoint_in_pytorch_model(model, resolved_archive_file, allow_missing_keys=True)\n                except ImportError:\n                    logger.error(\n                        \"Loading a TensorFlow model in PyTorch, requires both PyTorch and TensorFlow to be installed. Please see \"\n                        \"https://pytorch.org/ and https://www.tensorflow.org/install/ for installation instructions.\"\n                    )\n                    raise\n        else:\n            # Convert old format to new format if needed from a PyTorch state_dict\n            old_keys = []\n            new_keys = []\n            for key in state_dict.keys():\n                new_key = None\n                if \"gamma\" in key:\n                    new_key = key.replace(\"gamma\", \"weight\")\n                if \"beta\" in key:\n                    new_key = key.replace(\"beta\", \"bias\")\n                if new_key:\n                    old_keys.append(key)\n                    new_keys.append(new_key)\n            for old_key, new_key in zip(old_keys, new_keys):\n                state_dict[new_key] = state_dict.pop(old_key)\n\n            # copy state_dict so _load_from_state_dict can modify it\n            metadata = getattr(state_dict, \"_metadata\", None)\n            state_dict = state_dict.copy()\n            if metadata is not None:\n                state_dict._metadata = metadata\n\n            # PyTorch's `_load_from_state_dict` does not copy parameters in a module's descendants\n            # so we need to apply the function recursively.\n            def load(module: nn.Module, prefix=\"\"):\n                local_metadata = {} if metadata is None else metadata.get(prefix[:-1], {})\n                module._load_from_state_dict(\n                    state_dict, prefix, local_metadata, True, missing_keys, unexpected_keys, error_msgs,\n                )\n                for name, child in module._modules.items():\n                    if child is not None:\n                        load(child, prefix + name + \".\")\n\n            # Make sure we are able to load base models as well as derived models (with heads)\n            start_prefix = \"\"\n            model_to_load = model\n            has_prefix_module = any(s.startswith(cls.base_model_prefix) for s in state_dict.keys())\n            if not hasattr(model, cls.base_model_prefix) and has_prefix_module:\n                start_prefix = cls.base_model_prefix + \".\"\n            if hasattr(model, cls.base_model_prefix) and not has_prefix_module:\n                model_to_load = getattr(model, cls.base_model_prefix)\n\n            load(model_to_load, prefix=start_prefix)\n\n            if model.__class__.__name__ != model_to_load.__class__.__name__:\n                base_model_state_dict = model_to_load.state_dict().keys()\n                head_model_state_dict_without_base_prefix = [\n                    key.split(cls.base_model_prefix + \".\")[-1] for key in model.state_dict().keys()\n                ]\n\n                missing_keys.extend(head_model_state_dict_without_base_prefix - base_model_state_dict)\n\n            if len(missing_keys) > 0:\n                logger.info(\n                    \"Weights of {} not initialized from pretrained model: {}\".format(\n                        model.__class__.__name__, missing_keys\n                    )\n                )\n            if len(unexpected_keys) > 0:\n                logger.info(\n                    \"Weights from pretrained model not used in {}: {}\".format(\n                        model.__class__.__name__, unexpected_keys\n                    )\n                )\n            if len(error_msgs) > 0:\n                raise RuntimeError(\n                    \"Error(s) in loading state_dict for {}:\\n\\t{}\".format(\n                        model.__class__.__name__, \"\\n\\t\".join(error_msgs)\n                    )\n                )\n        model.tie_weights()  # make sure token embedding weights are still tied if needed\n\n        # Set model in evaluation mode to deactivate DropOut modules by default\n        model.eval()\n\n        if output_loading_info:\n            loading_info = {\n                \"missing_keys\": missing_keys,\n                \"unexpected_keys\": unexpected_keys,\n                \"error_msgs\": error_msgs,\n            }\n            return model, loading_info\n\n        if hasattr(config, \"xla_device\") and config.xla_device:\n            import torch_xla.core.xla_model as xm\n\n            model = xm.send_cpu_data_to_device(model, xm.xla_device())\n            model.to(xm.xla_device())\n\n        return model\n\n    def prepare_inputs_for_generation(self, input_ids, **kwargs):\n        return {\"input_ids\": input_ids}\n\n    def prepare_logits_for_generation(self, logits, **kwargs):\n        return logits\n\n    def _use_cache(self, outputs, use_cache):\n        \"\"\"During generation, decide whether to pass the `past` variable to the next forward pass.\"\"\"\n        if len(outputs) <= 1 or use_cache is False:\n            return False\n        if hasattr(self.config, \"mem_len\") and self.config.mem_len == 0:\n            return False\n        return True\n\n    def enforce_repetition_penalty_(self, lprobs, batch_size, num_beams, prev_output_tokens, repetition_penalty):\n        \"\"\"repetition penalty (from CTRL paper https://arxiv.org/abs/1909.05858). \"\"\"\n        for i in range(batch_size * num_beams):\n            for previous_token in set(prev_output_tokens[i].tolist()):\n                # if score < 0 then repetition penalty has to multiplied to reduce the previous token probability\n                if lprobs[i, previous_token] < 0:\n                    lprobs[i, previous_token] *= repetition_penalty\n                else:\n                    lprobs[i, previous_token] /= repetition_penalty\n\n    @torch.no_grad()\n    def generate(\n        self,\n        input_ids: Optional[torch.LongTensor] = None,\n        max_length: Optional[int] = None,\n        min_length: Optional[int] = None,\n        do_sample: Optional[bool] = None,\n        early_stopping: Optional[bool] = None,\n        num_beams: Optional[int] = None,\n        temperature: Optional[float] = None,\n        top_k: Optional[int] = None,\n        top_p: Optional[float] = None,\n        repetition_penalty: Optional[float] = None,\n        bad_words_ids: Optional[Iterable[int]] = None,\n        bos_token_id: Optional[int] = None,\n        pad_token_id: Optional[int] = None,\n        eos_token_id: Optional[int] = None,\n        length_penalty: Optional[float] = None,\n        no_repeat_ngram_size: Optional[int] = None,\n        num_return_sequences: Optional[int] = None,\n        attention_mask: Optional[torch.LongTensor] = None,\n        decoder_start_token_id: Optional[int] = None,\n        use_cache: Optional[bool] = None,\n        **model_specific_kwargs\n    ) -> torch.LongTensor:\n        r\"\"\" Generates sequences for models with a LM head. The method currently supports greedy decoding, beam-search decoding, sampling with temperature, sampling with top-k or nucleus sampling.\n\n        Adapted in part from `Facebook's XLM beam search code`_.\n\n        .. _`Facebook's XLM beam search code`:\n           https://github.com/facebookresearch/XLM/blob/9e6f6814d17be4fe5b15f2e6c43eb2b2d76daeb4/src/model/transformer.py#L529\n\n\n        Parameters:\n\n            input_ids: (`optional`) `torch.LongTensor` of shape `(batch_size, sequence_length)`\n                The sequence used as a prompt for the generation. If `None` the method initializes\n                it as an empty `torch.LongTensor` of shape `(1,)`.\n\n            max_length: (`optional`) int\n                The max length of the sequence to be generated.  Between `min_length` and infinity. Default to 20.\n\n            min_length: (`optional`) int\n                The min length of the sequence to be generated.  Between 0 and infinity. Default to 0.\n\n            do_sample: (`optional`) bool\n                If set to `False` greedy decoding is used. Otherwise sampling is used. Defaults to `False` as defined in `configuration_utils.PretrainedConfig`.\n\n            early_stopping: (`optional`) bool\n                if set to `True` beam search is stopped when at least `num_beams` sentences finished per batch. Defaults to `False` as defined in `configuration_utils.PretrainedConfig`.\n\n            num_beams: (`optional`) int\n                Number of beams for beam search. Must be between 1 and infinity. 1 means no beam search. Default to 1.\n\n            temperature: (`optional`) float\n                The value used to module the next token probabilities. Must be strictly positive. Default to 1.0.\n\n            top_k: (`optional`) int\n                The number of highest probability vocabulary tokens to keep for top-k-filtering. Between 1 and infinity. Default to 50.\n\n            top_p: (`optional`) float\n                The cumulative probability of parameter highest probability vocabulary tokens to keep for nucleus sampling. Must be between 0 and 1. Default to 1.\n\n            repetition_penalty: (`optional`) float\n                The parameter for repetition penalty. Between 1.0 and infinity. 1.0 means no penalty. Default to 1.0.\n\n            pad_token_id: (`optional`) int\n                Padding token. Default to specicic model pad_token_id or None if it does not exist.\n\n            bos_token_id: (`optional`) int\n                BOS token. Defaults to `bos_token_id` as defined in the models config.\n\n            eos_token_id: (`optional`) int\n                EOS token. Defaults to `eos_token_id` as defined in the models config.\n\n            length_penalty: (`optional`) float\n                Exponential penalty to the length. Default to 1.\n\n            no_repeat_ngram_size: (`optional`) int\n                If set to int > 0, all ngrams of size `no_repeat_ngram_size` can only occur once.\n            bad_words_ids: (`optional`) list of lists of int\n                `bad_words_ids` contains tokens that are not allowed to be generated. In order to get the tokens of the words that should not appear in the generated text, use `tokenizer.encode(bad_word, add_prefix_space=True)`.\n\n            num_return_sequences: (`optional`) int\n                The number of independently computed returned sequences for each element in the batch. Default to 1.\n\n            attention_mask (`optional`) obj: `torch.LongTensor` of same shape as `input_ids`\n                Mask to avoid performing attention on padding token indices.\n                Mask values selected in ``[0, 1]``:\n                ``1`` for tokens that are NOT MASKED, ``0`` for MASKED tokens.\n                Defaults to `None`.\n\n                `What are attention masks? <../glossary.html#attention-mask>`__\n\n            decoder_start_token_id=None: (`optional`) int\n                If an encoder-decoder model starts decoding with a different token than BOS.\n                Defaults to `None` and is changed to `BOS` later.\n\n            use_cache: (`optional`) bool\n                If `use_cache` is True, past key values are used to speed up decoding if applicable to model. Defaults to `True`.\n\n            model_specific_kwargs: (`optional`) dict\n                Additional model specific kwargs will be forwarded to the `forward` function of the model.\n\n        Return:\n\n            output: `torch.LongTensor` of shape `(batch_size * num_return_sequences, sequence_length)`\n                sequence_length is either equal to max_length or shorter if all batches finished early due to the `eos_token_id`\n\n        Examples::\n\n            tokenizer = AutoTokenizer.from_pretrained('distilgpt2')   # Initialize tokenizer\n            model = AutoModelWithLMHead.from_pretrained('distilgpt2')    # Download model and configuration from S3 and cache.\n            outputs = model.generate(max_length=40)  # do greedy decoding\n            print('Generated: {}'.format(tokenizer.decode(outputs[0], skip_special_tokens=True)))\n\n            tokenizer = AutoTokenizer.from_pretrained('openai-gpt')   # Initialize tokenizer\n            model = AutoModelWithLMHead.from_pretrained('openai-gpt')    # Download model and configuration from S3 and cache.\n            input_context = 'The dog'\n            input_ids = tokenizer.encode(input_context, return_tensors='pt')  # encode input context\n            outputs = model.generate(input_ids=input_ids, num_beams=5, num_return_sequences=3, temperature=1.5)  # generate 3 independent sequences using beam search decoding (5 beams) with sampling from initial context 'The dog'\n            for i in range(3): #  3 output sequences were generated\n                print('Generated {}: {}'.format(i, tokenizer.decode(outputs[i], skip_special_tokens=True)))\n\n            tokenizer = AutoTokenizer.from_pretrained('distilgpt2')   # Initialize tokenizer\n            model = AutoModelWithLMHead.from_pretrained('distilgpt2')    # Download model and configuration from S3 and cache.\n            input_context = 'The dog'\n            input_ids = tokenizer.encode(input_context, return_tensors='pt')  # encode input context\n            outputs = model.generate(input_ids=input_ids, max_length=40, temperature=0.7, num_return_sequences=3)  # 3 generate sequences using by sampling\n            for i in range(3): #  3 output sequences were generated\n                print('Generated {}: {}'.format(i, tokenizer.decode(outputs[i], skip_special_tokens=True)))\n\n            tokenizer = AutoTokenizer.from_pretrained('ctrl')   # Initialize tokenizer\n            model = AutoModelWithLMHead.from_pretrained('ctrl')    # Download model and configuration from S3 and cache.\n            input_context = 'Legal My neighbor is'  # \"Legal\" is one of the control codes for ctrl\n            input_ids = tokenizer.encode(input_context, return_tensors='pt')  # encode input context\n            outputs = model.generate(input_ids=input_ids, max_length=50, temperature=0.7, repetition_penalty=1.2)  # generate sequences\n            print('Generated: {}'.format(tokenizer.decode(outputs[0], skip_special_tokens=True)))\n\n            tokenizer = AutoTokenizer.from_pretrained('gpt2')   # Initialize tokenizer\n            model = AutoModelWithLMHead.from_pretrained('gpt2')    # Download model and configuration from S3 and cache.\n            input_context = 'My cute dog'  # \"Legal\" is one of the control codes for ctrl\n            bad_words_ids = [tokenizer.encode(bad_word, add_prefix_space=True) for bad_word in ['idiot', 'stupid', 'shut up']]\n            input_ids = tokenizer.encode(input_context, return_tensors='pt')  # encode input context\n            outputs = model.generate(input_ids=input_ids, max_length=100, do_sample=True, bad_words_ids=bad_words_ids)  # generate sequences without allowing bad_words to be generated\n        \"\"\"\n\n        # We cannot generate if the model does not have a LM head\n        if self.get_output_embeddings() is None:\n            raise AttributeError(\n                \"You tried to generate sequences with a model that does not have a LM Head.\"\n                \"Please use another model class (e.g. `OpenAIGPTLMHeadModel`, `XLNetLMHeadModel`, `GPT2LMHeadModel`, `CTRLLMHeadModel`, `T5WithLMHeadModel`, `TransfoXLLMHeadModel`, `XLMWithLMHeadModel`, `BartForConditionalGeneration` )\"\n            )\n\n        max_length = max_length if max_length is not None else self.config.max_length\n        min_length = min_length if min_length is not None else self.config.min_length\n        do_sample = do_sample if do_sample is not None else self.config.do_sample\n        early_stopping = early_stopping if early_stopping is not None else self.config.early_stopping\n        use_cache = use_cache if use_cache is not None else self.config.use_cache\n        num_beams = num_beams if num_beams is not None else self.config.num_beams\n        temperature = temperature if temperature is not None else self.config.temperature\n        top_k = top_k if top_k is not None else self.config.top_k\n        top_p = top_p if top_p is not None else self.config.top_p\n        repetition_penalty = repetition_penalty if repetition_penalty is not None else self.config.repetition_penalty\n        bos_token_id = bos_token_id if bos_token_id is not None else self.config.bos_token_id\n        pad_token_id = pad_token_id if pad_token_id is not None else self.config.pad_token_id\n        eos_token_id = eos_token_id if eos_token_id is not None else self.config.eos_token_id\n        length_penalty = length_penalty if length_penalty is not None else self.config.length_penalty\n        no_repeat_ngram_size = (\n            no_repeat_ngram_size if no_repeat_ngram_size is not None else self.config.no_repeat_ngram_size\n        )\n        bad_words_ids = bad_words_ids if bad_words_ids is not None else self.config.bad_words_ids\n        num_return_sequences = (\n            num_return_sequences if num_return_sequences is not None else self.config.num_return_sequences\n        )\n        decoder_start_token_id = (\n            decoder_start_token_id if decoder_start_token_id is not None else self.config.decoder_start_token_id\n        )\n\n        if input_ids is not None:\n            batch_size = input_ids.shape[0]  # overriden by the input batch_size\n        else:\n            batch_size = 1\n\n        assert isinstance(max_length, int) and max_length > 0, \"`max_length` should be a strictly positive integer.\"\n        assert isinstance(min_length, int) and min_length >= 0, \"`min_length` should be a positive integer.\"\n        assert isinstance(do_sample, bool), \"`do_sample` should be a boolean.\"\n        assert isinstance(early_stopping, bool), \"`early_stopping` should be a boolean.\"\n        assert isinstance(use_cache, bool), \"`use_cache` should be a boolean.\"\n        assert isinstance(num_beams, int) and num_beams > 0, \"`num_beams` should be a strictly positive integer.\"\n        assert temperature > 0, \"`temperature` should be strictly positive.\"\n        assert isinstance(top_k, int) and top_k >= 0, \"`top_k` should be a positive integer.\"\n        assert 0 <= top_p <= 1, \"`top_p` should be between 0 and 1.\"\n        assert repetition_penalty >= 1.0, \"`repetition_penalty` should be >= 1.\"\n        assert input_ids is not None or (\n            isinstance(bos_token_id, int) and bos_token_id >= 0\n        ), \"If input_ids is not defined, `bos_token_id` should be a positive integer.\"\n        assert pad_token_id is None or (\n            isinstance(pad_token_id, int) and (pad_token_id >= 0)\n        ), \"`pad_token_id` should be a positive integer.\"\n        assert (eos_token_id is None) or (\n            isinstance(eos_token_id, int) and (eos_token_id >= 0)\n        ), \"`eos_token_id` should be a positive integer.\"\n        assert length_penalty > 0, \"`length_penalty` should be strictly positive.\"\n        assert (\n            isinstance(no_repeat_ngram_size, int) and no_repeat_ngram_size >= 0\n        ), \"`no_repeat_ngram_size` should be a positive integer.\"\n        assert (\n            isinstance(num_return_sequences, int) and num_return_sequences > 0\n        ), \"`num_return_sequences` should be a strictly positive integer.\"\n        assert (\n            bad_words_ids is None or isinstance(bad_words_ids, list) and isinstance(bad_words_ids[0], list)\n        ), \"`bad_words_ids` is either `None` or a list of lists of tokens that should not be generated\"\n\n        if input_ids is None:\n            assert isinstance(bos_token_id, int) and bos_token_id >= 0, (\n                \"you should either supply a context to complete as `input_ids` input \"\n                \"or a `bos_token_id` (integer >= 0) as a first token to start the generation.\"\n            )\n            input_ids = torch.full(\n                (batch_size, 1), bos_token_id, dtype=torch.long, device=next(self.parameters()).device,\n            )\n        else:\n            assert input_ids.dim() == 2, \"Input prompt should be of shape (batch_size, sequence length).\"\n\n        # not allow to duplicate outputs when greedy decoding\n        if do_sample is False:\n            if num_beams == 1:\n                # no_beam_search greedy generation conditions\n                assert (\n                    num_return_sequences == 1\n                ), \"Greedy decoding will always produce the same output for num_beams == 1 and num_return_sequences > 1. Please set num_return_sequences = 1\"\n\n            else:\n                # beam_search greedy generation conditions\n                assert (\n                    num_beams >= num_return_sequences\n                ), \"Greedy beam search decoding cannot return more sequences than it has beams. Please set num_beams >= num_return_sequences\"\n\n        # create attention mask if necessary\n        # TODO (PVP): this should later be handled by the forward fn() in each model in the future see PR 3140\n        if (attention_mask is None) and (pad_token_id is not None) and (pad_token_id in input_ids):\n            attention_mask = input_ids.ne(pad_token_id).long()\n        elif attention_mask is None:\n            attention_mask = input_ids.new_ones(input_ids.shape)\n\n        # set pad_token_id to eos_token_id if not set. Important that this is done after\n        # attention_mask is created\n        if pad_token_id is None and eos_token_id is not None:\n            logger.warning(\n                \"Setting `pad_token_id` to {} (first `eos_token_id`) to generate sequence\".format(eos_token_id)\n            )\n            pad_token_id = eos_token_id\n\n        # current position and vocab size\n        if hasattr(self.config, \"vocab_size\"):\n            vocab_size = self.config.vocab_size\n        elif (\n            self.config.is_encoder_decoder\n            and hasattr(self.config, \"decoder\")\n            and hasattr(self.config.decoder, \"vocab_size\")\n        ):\n            vocab_size = self.config.decoder.vocab_size\n\n        # set effective batch size and effective batch multiplier according to do_sample\n        if do_sample:\n            effective_batch_size = batch_size * num_return_sequences\n            effective_batch_mult = num_return_sequences\n        else:\n            effective_batch_size = batch_size\n            effective_batch_mult = 1\n\n        if self.config.is_encoder_decoder:\n            if decoder_start_token_id is None:\n                decoder_start_token_id = bos_token_id\n\n            assert (\n                decoder_start_token_id is not None\n            ), \"decoder_start_token_id or bos_token_id has to be defined for encoder-decoder generation\"\n            assert hasattr(self, \"get_encoder\"), \"{} should have a 'get_encoder' function defined\".format(self)\n            assert callable(self.get_encoder), \"{} should be a method\".format(self.get_encoder)\n\n            # get encoder and store encoder outputs\n            encoder = self.get_encoder()\n\n            encoder_outputs: tuple = encoder(input_ids, attention_mask=attention_mask)\n\n        # Expand input ids if num_beams > 1 or num_return_sequences > 1\n        if num_return_sequences > 1 or num_beams > 1:\n            input_ids_len = input_ids.shape[-1]\n            input_ids = input_ids.unsqueeze(1).expand(batch_size, effective_batch_mult * num_beams, input_ids_len)\n            attention_mask = attention_mask.unsqueeze(1).expand(\n                batch_size, effective_batch_mult * num_beams, input_ids_len\n            )\n\n            input_ids = input_ids.contiguous().view(\n                effective_batch_size * num_beams, input_ids_len\n            )  # shape: (batch_size * num_return_sequences * num_beams, cur_len)\n            attention_mask = attention_mask.contiguous().view(\n                effective_batch_size * num_beams, input_ids_len\n            )  # shape: (batch_size * num_return_sequences * num_beams, cur_len)\n\n        if self.config.is_encoder_decoder:\n            # create empty decoder_input_ids\n            input_ids = torch.full(\n                (effective_batch_size * num_beams, 1),\n                decoder_start_token_id,\n                dtype=torch.long,\n                device=next(self.parameters()).device,\n            )\n            cur_len = 1\n\n            assert (\n                batch_size == encoder_outputs[0].shape[0]\n            ), f\"expected encoder_outputs[0] to have 1st dimension bs={batch_size}, got {encoder_outputs[0].shape[0]} \"\n\n            # expand batch_idx to assign correct encoder output for expanded input_ids (due to num_beams > 1 and num_return_sequences > 1)\n            expanded_batch_idxs = (\n                torch.arange(batch_size)\n                .view(-1, 1)\n                .repeat(1, num_beams * effective_batch_mult)\n                .view(-1)\n                .to(input_ids.device)\n            )\n            # expand encoder_outputs\n            encoder_outputs = (encoder_outputs[0].index_select(0, expanded_batch_idxs), *encoder_outputs[1:])\n\n        else:\n            encoder_outputs = None\n            cur_len = input_ids.shape[-1]\n\n        if num_beams > 1:\n            output = self._generate_beam_search(\n                input_ids,\n                cur_len=cur_len,\n                max_length=max_length,\n                min_length=min_length,\n                do_sample=do_sample,\n                early_stopping=early_stopping,\n                temperature=temperature,\n                top_k=top_k,\n                top_p=top_p,\n                repetition_penalty=repetition_penalty,\n                no_repeat_ngram_size=no_repeat_ngram_size,\n                bad_words_ids=bad_words_ids,\n                bos_token_id=bos_token_id,\n                pad_token_id=pad_token_id,\n                decoder_start_token_id=decoder_start_token_id,\n                eos_token_id=eos_token_id,\n                batch_size=effective_batch_size,\n                num_return_sequences=num_return_sequences,\n                length_penalty=length_penalty,\n                num_beams=num_beams,\n                vocab_size=vocab_size,\n                encoder_outputs=encoder_outputs,\n                attention_mask=attention_mask,\n                use_cache=use_cache,\n                model_specific_kwargs=model_specific_kwargs,\n            )\n        else:\n            output = self._generate_no_beam_search(\n                input_ids,\n                cur_len=cur_len,\n                max_length=max_length,\n                min_length=min_length,\n                do_sample=do_sample,\n                temperature=temperature,\n                top_k=top_k,\n                top_p=top_p,\n                repetition_penalty=repetition_penalty,\n                no_repeat_ngram_size=no_repeat_ngram_size,\n                bad_words_ids=bad_words_ids,\n                bos_token_id=bos_token_id,\n                pad_token_id=pad_token_id,\n                decoder_start_token_id=decoder_start_token_id,\n                eos_token_id=eos_token_id,\n                batch_size=effective_batch_size,\n                encoder_outputs=encoder_outputs,\n                attention_mask=attention_mask,\n                use_cache=use_cache,\n                model_specific_kwargs=model_specific_kwargs,\n            )\n\n        return output\n\n    def _generate_no_beam_search(\n        self,\n        input_ids,\n        cur_len,\n        max_length,\n        min_length,\n        do_sample,\n        temperature,\n        top_k,\n        top_p,\n        repetition_penalty,\n        no_repeat_ngram_size,\n        bad_words_ids,\n        bos_token_id,\n        pad_token_id,\n        eos_token_id,\n        decoder_start_token_id,\n        batch_size,\n        encoder_outputs,\n        attention_mask,\n        use_cache,\n        model_specific_kwargs,\n    ):\n        \"\"\" Generate sequences for each example without beam search (num_beams == 1).\n            All returned sequence are generated independantly.\n        \"\"\"\n        # length of generated sentences / unfinished sentences\n        unfinished_sents = input_ids.new(batch_size).fill_(1)\n        sent_lengths = input_ids.new(batch_size).fill_(max_length)\n\n        past = encoder_outputs  # defined for encoder-decoder models, None for decoder-only models\n\n        while cur_len < max_length:\n            model_inputs = self.prepare_inputs_for_generation(\n                input_ids, past=past, attention_mask=attention_mask, use_cache=use_cache, **model_specific_kwargs\n            )\n\n            outputs = self(**model_inputs)\n            next_token_logits = outputs[0][:, -1, :]\n\n            # if model has past, then set the past variable to speed up decoding\n            if self._use_cache(outputs, use_cache):\n                past = outputs[1]\n\n            # repetition penalty from CTRL paper (https://arxiv.org/abs/1909.05858)\n            if repetition_penalty != 1.0:\n                self.enforce_repetition_penalty_(next_token_logits, batch_size, 1, input_ids, repetition_penalty)\n\n            if no_repeat_ngram_size > 0:\n                # calculate a list of banned tokens to prevent repetitively generating the same ngrams\n                # from fairseq: https://github.com/pytorch/fairseq/blob/a07cb6f40480928c9e0548b737aadd36ee66ac76/fairseq/sequence_generator.py#L345\n                banned_tokens = calc_banned_ngram_tokens(input_ids, batch_size, no_repeat_ngram_size, cur_len)\n                for batch_idx in range(batch_size):\n                    next_token_logits[batch_idx, banned_tokens[batch_idx]] = -float(\"inf\")\n\n            if bad_words_ids is not None:\n                # calculate a list of banned tokens according to bad words\n                banned_tokens = calc_banned_bad_words_ids(input_ids, bad_words_ids)\n\n                for batch_idx in range(batch_size):\n                    next_token_logits[batch_idx, banned_tokens[batch_idx]] = -float(\"inf\")\n\n            # set eos token prob to zero if min_length is not reached\n            if eos_token_id is not None and cur_len < min_length:\n                next_token_logits[:, eos_token_id] = -float(\"inf\")\n\n            if do_sample:\n                # Temperature (higher temperature => more likely to sample low probability tokens)\n                if temperature != 1.0:\n                    next_token_logits = next_token_logits / temperature\n                # Top-p/top-k filtering\n                next_token_logits = top_k_top_p_filtering(next_token_logits, top_k=top_k, top_p=top_p)\n                # Sample\n                probs = F.softmax(next_token_logits, dim=-1)\n                next_token = torch.multinomial(probs, num_samples=1).squeeze(1)\n            else:\n                # Greedy decoding\n                next_token = torch.argmax(next_token_logits, dim=-1)\n\n            # update generations and finished sentences\n            if eos_token_id is not None:\n                # pad finished sentences if eos_token_id exist\n                tokens_to_add = next_token * unfinished_sents + (pad_token_id) * (1 - unfinished_sents)\n            else:\n                tokens_to_add = next_token\n\n            # add token and increase length by one\n            input_ids = torch.cat([input_ids, tokens_to_add.unsqueeze(-1)], dim=-1)\n            cur_len = cur_len + 1\n\n            if eos_token_id is not None:\n                eos_in_sents = tokens_to_add == eos_token_id\n                # if sentence is unfinished and the token to add is eos, sent_lengths is filled with current length\n                is_sents_unfinished_and_token_to_add_is_eos = unfinished_sents.mul(eos_in_sents.long()).bool()\n                sent_lengths.masked_fill_(is_sents_unfinished_and_token_to_add_is_eos, cur_len)\n                # unfinished_sents is set to zero if eos in sentence\n                unfinished_sents.mul_((~eos_in_sents).long())\n\n            # stop when there is a </s> in each sentence, or if we exceed the maximul length\n            if unfinished_sents.max() == 0:\n                break\n\n            # extend attention_mask for new generated input if only decoder\n            if self.config.is_encoder_decoder is False:\n                attention_mask = torch.cat(\n                    [attention_mask, attention_mask.new_ones((attention_mask.shape[0], 1))], dim=-1\n                )\n\n        # if there are different sentences lengths in the batch, some batches have to be padded\n        if sent_lengths.min().item() != sent_lengths.max().item():\n            assert pad_token_id is not None, \"`Pad_token_id` has to be defined if batches have different lengths\"\n            # finished sents are filled with pad_token\n            decoded = input_ids.new(batch_size, sent_lengths.max().item()).fill_(pad_token_id)\n        else:\n            decoded = input_ids\n\n        for hypo_idx, hypo in enumerate(input_ids):\n            decoded[hypo_idx, : sent_lengths[hypo_idx]] = hypo[: sent_lengths[hypo_idx]]\n\n        return decoded\n\n    def _generate_beam_search(\n        self,\n        input_ids,\n        cur_len,\n        max_length,\n        min_length,\n        do_sample,\n        early_stopping,\n        temperature,\n        top_k,\n        top_p,\n        repetition_penalty,\n        no_repeat_ngram_size,\n        bad_words_ids,\n        bos_token_id,\n        pad_token_id,\n        eos_token_id,\n        decoder_start_token_id,\n        batch_size,\n        num_return_sequences,\n        length_penalty,\n        num_beams,\n        vocab_size,\n        encoder_outputs,\n        attention_mask,\n        use_cache,\n        model_specific_kwargs,\n    ):\n        \"\"\" Generate sequences for each example with beam search.\n        \"\"\"\n\n        # generated hypotheses\n        generated_hyps = [\n            BeamHypotheses(num_beams, max_length, length_penalty, early_stopping=early_stopping)\n            for _ in range(batch_size)\n        ]\n\n        # scores for each sentence in the beam\n        beam_scores = torch.zeros((batch_size, num_beams), dtype=torch.float, device=input_ids.device)\n\n        # for greedy decoding it is made sure that only tokens of the first beam are considered to avoid sampling the exact same tokens three times\n        if do_sample is False:\n            beam_scores[:, 1:] = -1e9\n        beam_scores = beam_scores.view(-1)  # shape (batch_size * num_beams,)\n\n        # cache compute states\n        past = encoder_outputs  # defined for encoder-decoder models, None for decoder-only models\n\n        # done sentences\n        done = [False for _ in range(batch_size)]\n\n        while cur_len < max_length:\n            model_inputs = self.prepare_inputs_for_generation(\n                input_ids, past=past, attention_mask=attention_mask, use_cache=use_cache, **model_specific_kwargs\n            )\n            outputs = self(**model_inputs)  # (batch_size * num_beams, cur_len, vocab_size)\n            next_token_logits = outputs[0][:, -1, :]  # (batch_size * num_beams, vocab_size)\n\n            # if model has past, then set the past variable to speed up decoding\n            if self._use_cache(outputs, use_cache):\n                past = outputs[1]\n\n            # repetition penalty (from CTRL paper https://arxiv.org/abs/1909.05858)\n            if repetition_penalty != 1.0:\n                self.enforce_repetition_penalty_(\n                    next_token_logits, batch_size, num_beams, input_ids, repetition_penalty,\n                )\n\n            if temperature != 1.0:\n                next_token_logits = next_token_logits / temperature\n\n            if self.config.is_encoder_decoder and do_sample is False:\n                # TODO (PVP) still a bit hacky here - there might be a better solution\n                next_token_logits = self.prepare_logits_for_generation(\n                    next_token_logits, cur_len=cur_len, max_length=max_length\n                )\n\n            scores = F.log_softmax(next_token_logits, dim=-1)  # (batch_size * num_beams, vocab_size)\n\n            # set eos token prob to zero if min_length is not reached\n            if eos_token_id is not None and cur_len < min_length:\n                scores[:, eos_token_id] = -float(\"inf\")\n\n            if no_repeat_ngram_size > 0:\n                # calculate a list of banned tokens to prevent repetitively generating the same ngrams\n                num_batch_hypotheses = batch_size * num_beams\n                # from fairseq: https://github.com/pytorch/fairseq/blob/a07cb6f40480928c9e0548b737aadd36ee66ac76/fairseq/sequence_generator.py#L345\n                banned_batch_tokens = calc_banned_ngram_tokens(\n                    input_ids, num_batch_hypotheses, no_repeat_ngram_size, cur_len\n                )\n                for i, banned_tokens in enumerate(banned_batch_tokens):\n                    scores[i, banned_tokens] = -float(\"inf\")\n\n            if bad_words_ids is not None:\n                # calculate a list of banned tokens according to bad words\n                banned_tokens = calc_banned_bad_words_ids(input_ids, bad_words_ids)\n\n                for i, banned_tokens in enumerate(banned_tokens):\n                    scores[i, banned_tokens] = -float(\"inf\")\n\n            assert scores.shape == (batch_size * num_beams, vocab_size), \"Shapes of scores: {} != {}\".format(\n                scores.shape, (batch_size * num_beams, vocab_size)\n            )\n\n            if do_sample:\n                _scores = scores + beam_scores[:, None].expand_as(scores)  # (batch_size * num_beams, vocab_size)\n                # Top-p/top-k filtering\n                _scores = top_k_top_p_filtering(\n                    _scores, top_k=top_k, top_p=top_p, min_tokens_to_keep=2\n                )  # (batch_size * num_beams, vocab_size)\n                # re-organize to group the beam together to sample from all beam_idxs\n                _scores = _scores.contiguous().view(\n                    batch_size, num_beams * vocab_size\n                )  # (batch_size, num_beams * vocab_size)\n\n                # Sample 2 next tokens for each beam (so we have some spare tokens and match output of greedy beam search)\n                probs = F.softmax(_scores, dim=-1)\n                next_tokens = torch.multinomial(probs, num_samples=2 * num_beams)  # (batch_size, num_beams * 2)\n                # Compute next scores\n                next_scores = torch.gather(_scores, -1, next_tokens)  # (batch_size, num_beams * 2)\n                # sort the sampled vector to make sure that the first num_beams samples are the best\n                next_scores, next_scores_indices = torch.sort(next_scores, descending=True, dim=1)\n                next_tokens = torch.gather(next_tokens, -1, next_scores_indices)  # (batch_size, num_beams * 2)\n\n            else:\n                next_scores = scores + beam_scores[:, None].expand_as(scores)  # (batch_size * num_beams, vocab_size)\n\n                # re-organize to group the beam together (we are keeping top hypothesis accross beams)\n                next_scores = next_scores.view(\n                    batch_size, num_beams * vocab_size\n                )  # (batch_size, num_beams * vocab_size)\n\n                next_scores, next_tokens = torch.topk(next_scores, 2 * num_beams, dim=1, largest=True, sorted=True)\n\n            assert next_scores.size() == next_tokens.size() == (batch_size, 2 * num_beams)\n\n            # next batch beam content\n            next_batch_beam = []\n\n            # for each sentence\n            for batch_idx in range(batch_size):\n\n                # if we are done with this sentence\n                if done[batch_idx]:\n                    assert (\n                        len(generated_hyps[batch_idx]) >= num_beams\n                    ), \"Batch can only be done if at least {} beams have been generated\".format(num_beams)\n                    assert (\n                        eos_token_id is not None and pad_token_id is not None\n                    ), \"generated beams >= num_beams -> eos_token_id and pad_token have to be defined\"\n                    next_batch_beam.extend([(0, pad_token_id, 0)] * num_beams)  # pad the batch\n                    continue\n\n                # next sentence beam content\n                next_sent_beam = []\n\n                # next tokens for this sentence\n                for beam_token_rank, (beam_token_id, beam_token_score) in enumerate(\n                    zip(next_tokens[batch_idx], next_scores[batch_idx])\n                ):\n                    # get beam and token IDs\n                    beam_id = beam_token_id // vocab_size\n                    token_id = beam_token_id % vocab_size\n\n                    effective_beam_id = batch_idx * num_beams + beam_id\n                    # add to generated hypotheses if end of sentence or last iteration\n                    if (eos_token_id is not None) and (token_id.item() == eos_token_id):\n                        # if beam_token does not belong to top num_beams tokens, it should not be added\n                        is_beam_token_worse_than_top_num_beams = beam_token_rank >= num_beams\n                        if is_beam_token_worse_than_top_num_beams:\n                            continue\n                        generated_hyps[batch_idx].add(\n                            input_ids[effective_beam_id].clone(), beam_token_score.item(),\n                        )\n                    else:\n                        # add next predicted token if it is not eos_token\n                        next_sent_beam.append((beam_token_score, token_id, effective_beam_id))\n\n                    # the beam for next step is full\n                    if len(next_sent_beam) == num_beams:\n                        break\n\n                # Check if were done so that we can save a pad step if all(done)\n                done[batch_idx] = done[batch_idx] or generated_hyps[batch_idx].is_done(\n                    next_scores[batch_idx].max().item(), cur_len=cur_len\n                )\n\n                # update next beam content\n                assert len(next_sent_beam) == num_beams, \"Beam should always be full\"\n                next_batch_beam.extend(next_sent_beam)\n                assert len(next_batch_beam) == num_beams * (batch_idx + 1)\n\n            # stop when we are done with each sentence\n            if all(done):\n                break\n\n            # sanity check / prepare next batch\n            assert len(next_batch_beam) == batch_size * num_beams\n            beam_scores = beam_scores.new([x[0] for x in next_batch_beam])\n            beam_tokens = input_ids.new([x[1] for x in next_batch_beam])\n            beam_idx = input_ids.new([x[2] for x in next_batch_beam])\n\n            # re-order batch and update current length\n            input_ids = input_ids[beam_idx, :]\n            input_ids = torch.cat([input_ids, beam_tokens.unsqueeze(1)], dim=-1)\n            cur_len = cur_len + 1\n\n            # re-order internal states\n            if past is not None:\n                past = self._reorder_cache(past, beam_idx)\n\n            # extend attention_mask for new generated input if only decoder\n            if self.config.is_encoder_decoder is False:\n                attention_mask = torch.cat(\n                    [attention_mask, attention_mask.new_ones((attention_mask.shape[0], 1))], dim=-1\n                )\n\n        # finalize all open beam hypotheses and end to generated hypotheses\n        for batch_idx in range(batch_size):\n            if done[batch_idx]:\n                continue\n\n            # test that beam scores match previously calculated scores if not eos and batch_idx not done\n            if eos_token_id is not None and all(\n                (token_id % vocab_size).item() is not eos_token_id for token_id in next_tokens[batch_idx]\n            ):\n                assert torch.all(\n                    next_scores[batch_idx, :num_beams] == beam_scores.view(batch_size, num_beams)[batch_idx]\n                ), \"If batch_idx is not done, final next scores: {} have to equal to accumulated beam_scores: {}\".format(\n                    next_scores[:, :num_beams][batch_idx], beam_scores.view(batch_size, num_beams)[batch_idx],\n                )\n\n            # need to add best num_beams hypotheses to generated hyps\n            for beam_id in range(num_beams):\n                effective_beam_id = batch_idx * num_beams + beam_id\n                final_score = beam_scores[effective_beam_id].item()\n                final_tokens = input_ids[effective_beam_id]\n                generated_hyps[batch_idx].add(final_tokens, final_score)\n\n        # depending on whether greedy generation is wanted or not define different output_batch_size and output_num_return_sequences_per_batch\n        output_batch_size = batch_size if do_sample else batch_size * num_return_sequences\n        output_num_return_sequences_per_batch = 1 if do_sample else num_return_sequences\n\n        # select the best hypotheses\n        sent_lengths = input_ids.new(output_batch_size)\n        best = []\n\n        # retrieve best hypotheses\n        for i, hypotheses in enumerate(generated_hyps):\n            sorted_hyps = sorted(hypotheses.beams, key=lambda x: x[0])\n            for j in range(output_num_return_sequences_per_batch):\n                effective_batch_idx = output_num_return_sequences_per_batch * i + j\n                best_hyp = sorted_hyps.pop()[1]\n                sent_lengths[effective_batch_idx] = len(best_hyp)\n                best.append(best_hyp)\n\n        # shorter batches are filled with pad_token\n        if sent_lengths.min().item() != sent_lengths.max().item():\n            assert pad_token_id is not None, \"`Pad_token_id` has to be defined\"\n            sent_max_len = min(sent_lengths.max().item() + 1, max_length)\n            decoded = input_ids.new(output_batch_size, sent_max_len).fill_(pad_token_id)\n\n            # fill with hypothesis and eos_token_id if necessary\n            for i, hypo in enumerate(best):\n                decoded[i, : sent_lengths[i]] = hypo\n                if sent_lengths[i] < max_length:\n                    decoded[i, sent_lengths[i]] = eos_token_id\n        else:\n            # none of the hypotheses have an eos_token\n            assert (len(hypo) == max_length for hypo in best)\n            decoded = torch.stack(best).type(torch.long).to(next(self.parameters()).device)\n\n        return decoded\n\n    @staticmethod\n    def _reorder_cache(past: Tuple, beam_idx: Tensor) -> Tuple[Tensor]:\n        return tuple(layer_past.index_select(1, beam_idx) for layer_past in past)\n\n\ndef calc_banned_ngram_tokens(prev_input_ids: Tensor, num_hypos: int, no_repeat_ngram_size: int, cur_len: int) -> None:\n    \"\"\"Copied from fairseq for no_repeat_ngram in beam_search\"\"\"\n    if cur_len + 1 < no_repeat_ngram_size:\n        # return no banned tokens if we haven't generated no_repeat_ngram_size tokens yet\n        return [[] for _ in range(num_hypos)]\n    generated_ngrams = [{} for _ in range(num_hypos)]\n    for idx in range(num_hypos):\n        gen_tokens = prev_input_ids[idx].tolist()\n        generated_ngram = generated_ngrams[idx]\n        for ngram in zip(*[gen_tokens[i:] for i in range(no_repeat_ngram_size)]):\n            prev_ngram_tuple = tuple(ngram[:-1])\n            generated_ngram[prev_ngram_tuple] = generated_ngram.get(prev_ngram_tuple, []) + [ngram[-1]]\n\n    def _get_generated_ngrams(hypo_idx):\n        # Before decoding the next token, prevent decoding of ngrams that have already appeared\n        start_idx = cur_len + 1 - no_repeat_ngram_size\n        ngram_idx = tuple(prev_input_ids[hypo_idx, start_idx:cur_len].tolist())\n        return generated_ngrams[hypo_idx].get(ngram_idx, [])\n\n    banned_tokens = [_get_generated_ngrams(hypo_idx) for hypo_idx in range(num_hypos)]\n    return banned_tokens\n\n\ndef calc_banned_bad_words_ids(prev_input_ids: Iterable[int], bad_words_ids: Iterable[int]) -> Iterable[int]:\n    banned_tokens = []\n\n    def _tokens_match(prev_tokens, tokens):\n        if len(tokens) == 0:\n            # if bad word tokens is just one token always ban it\n            return True\n        if len(tokens) > len(prev_input_ids):\n            # if bad word tokens are longer then prev input_ids they can't be equal\n            return False\n\n        if prev_tokens[-len(tokens) :] == tokens:\n            # if tokens match\n            return True\n        else:\n            return False\n\n    for prev_input_ids_slice in prev_input_ids:\n        banned_tokens_slice = []\n\n        for banned_token_seq in bad_words_ids:\n            assert len(banned_token_seq) > 0, \"Banned words token sequences {} cannot have an empty list\".format(\n                bad_words_ids\n            )\n\n            if _tokens_match(prev_input_ids_slice.tolist(), banned_token_seq[:-1]) is False:\n                # if tokens do not match continue\n                continue\n\n            banned_tokens_slice.append(banned_token_seq[-1])\n\n        banned_tokens.append(banned_tokens_slice)\n\n    return banned_tokens\n\n\ndef top_k_top_p_filtering(\n    logits: Tensor,\n    top_k: int = 0,\n    top_p: float = 1.0,\n    filter_value: float = -float(\"Inf\"),\n    min_tokens_to_keep: int = 1,\n) -> Tensor:\n    \"\"\" Filter a distribution of logits using top-k and/or nucleus (top-p) filtering\n        Args:\n            logits: logits distribution shape (batch size, vocabulary size)\n            if top_k > 0: keep only top k tokens with highest probability (top-k filtering).\n            if top_p < 1.0: keep the top tokens with cumulative probability >= top_p (nucleus filtering).\n                Nucleus filtering is described in Holtzman et al. (http://arxiv.org/abs/1904.09751)\n            Make sure we keep at least min_tokens_to_keep per batch example in the output\n        From: https://gist.github.com/thomwolf/1a5a29f6962089e871b94cbd09daf317\n    \"\"\"\n    if top_k > 0:\n        top_k = min(max(top_k, min_tokens_to_keep), logits.size(-1))  # Safety check\n        # Remove all tokens with a probability less than the last token of the top-k\n        indices_to_remove = logits < torch.topk(logits, top_k)[0][..., -1, None]\n        logits[indices_to_remove] = filter_value\n\n    if top_p < 1.0:\n        sorted_logits, sorted_indices = torch.sort(logits, descending=True)\n        cumulative_probs = torch.cumsum(F.softmax(sorted_logits, dim=-1), dim=-1)\n\n        # Remove tokens with cumulative probability above the threshold (token with 0 are kept)\n        sorted_indices_to_remove = cumulative_probs > top_p\n        if min_tokens_to_keep > 1:\n            # Keep at least min_tokens_to_keep (set to min_tokens_to_keep-1 because we add the first one below)\n            sorted_indices_to_remove[..., :min_tokens_to_keep] = 0\n        # Shift the indices to the right to keep also the first token above the threshold\n        sorted_indices_to_remove[..., 1:] = sorted_indices_to_remove[..., :-1].clone()\n        sorted_indices_to_remove[..., 0] = 0\n\n        # scatter sorted tensors to original indexing\n        indices_to_remove = sorted_indices_to_remove.scatter(1, sorted_indices, sorted_indices_to_remove)\n        logits[indices_to_remove] = filter_value\n    return logits\n\n\nclass BeamHypotheses(object):\n    def __init__(self, num_beams, max_length, length_penalty, early_stopping):\n        \"\"\"\n        Initialize n-best list of hypotheses.\n        \"\"\"\n        self.max_length = max_length - 1  # ignoring bos_token\n        self.length_penalty = length_penalty\n        self.early_stopping = early_stopping\n        self.num_beams = num_beams\n        self.beams = []\n        self.worst_score = 1e9\n\n    def __len__(self):\n        \"\"\"\n        Number of hypotheses in the list.\n        \"\"\"\n        return len(self.beams)\n\n    def add(self, hyp, sum_logprobs):\n        \"\"\"\n        Add a new hypothesis to the list.\n        \"\"\"\n        score = sum_logprobs / len(hyp) ** self.length_penalty\n        if len(self) < self.num_beams or score > self.worst_score:\n            self.beams.append((score, hyp))\n            if len(self) > self.num_beams:\n                sorted_scores = sorted([(s, idx) for idx, (s, _) in enumerate(self.beams)])\n                del self.beams[sorted_scores[0][1]]\n                self.worst_score = sorted_scores[1][0]\n            else:\n                self.worst_score = min(score, self.worst_score)\n\n    def is_done(self, best_sum_logprobs, cur_len=None):\n        \"\"\"\n        If there are enough hypotheses and that none of the hypotheses being generated\n        can become better than the worst one in the heap, then we are done with this sentence.\n        \"\"\"\n\n        if len(self) < self.num_beams:\n            return False\n        elif self.early_stopping:\n            return True\n        else:\n            if cur_len is None:\n                cur_len = self.max_length\n            cur_score = best_sum_logprobs / cur_len ** self.length_penalty\n            ret = self.worst_score >= cur_score\n            return ret\n\n\nclass Conv1D(nn.Module):\n    def __init__(self, nf, nx):\n        \"\"\" Conv1D layer as defined by Radford et al. for OpenAI GPT (and also used in GPT-2)\n            Basically works like a Linear layer but the weights are transposed\n        \"\"\"\n        super().__init__()\n        self.nf = nf\n        w = torch.empty(nx, nf)\n        nn.init.normal_(w, std=0.02)\n        self.weight = nn.Parameter(w)\n        self.bias = nn.Parameter(torch.zeros(nf))\n\n    def forward(self, x):\n        size_out = x.size()[:-1] + (self.nf,)\n        x = torch.addmm(self.bias, x.view(-1, x.size(-1)), self.weight)\n        x = x.view(*size_out)\n        return x\n\n\nclass PoolerStartLogits(nn.Module):\n    \"\"\" Compute SQuAD start_logits from sequence hidden states. \"\"\"\n\n    def __init__(self, config):\n        super().__init__()\n        self.dense = nn.Linear(config.hidden_size, 1)\n\n    def forward(self, hidden_states, p_mask=None):\n        \"\"\" Args:\n            **p_mask**: (`optional`) ``torch.FloatTensor`` of shape `(batch_size, seq_len)`\n                invalid position mask such as query and special symbols (PAD, SEP, CLS)\n                1.0 means token should be masked.\n        \"\"\"\n        x = self.dense(hidden_states).squeeze(-1)\n\n        if p_mask is not None:\n            if next(self.parameters()).dtype == torch.float16:\n                x = x * (1 - p_mask) - 65500 * p_mask\n            else:\n                x = x * (1 - p_mask) - 1e30 * p_mask\n\n        return x\n\n\nclass PoolerEndLogits(nn.Module):\n    \"\"\" Compute SQuAD end_logits from sequence hidden states and start token hidden state.\n    \"\"\"\n\n    def __init__(self, config):\n        super().__init__()\n        self.dense_0 = nn.Linear(config.hidden_size * 2, config.hidden_size)\n        self.activation = nn.Tanh()\n        self.LayerNorm = nn.LayerNorm(config.hidden_size, eps=config.layer_norm_eps)\n        self.dense_1 = nn.Linear(config.hidden_size, 1)\n\n    def forward(self, hidden_states, start_states=None, start_positions=None, p_mask=None):\n        \"\"\" Args:\n            One of ``start_states``, ``start_positions`` should be not None.\n            If both are set, ``start_positions`` overrides ``start_states``.\n\n            **start_states**: ``torch.LongTensor`` of shape identical to hidden_states\n                hidden states of the first tokens for the labeled span.\n            **start_positions**: ``torch.LongTensor`` of shape ``(batch_size,)``\n                position of the first token for the labeled span:\n            **p_mask**: (`optional`) ``torch.FloatTensor`` of shape ``(batch_size, seq_len)``\n                Mask of invalid position such as query and special symbols (PAD, SEP, CLS)\n                1.0 means token should be masked.\n        \"\"\"\n        assert (\n            start_states is not None or start_positions is not None\n        ), \"One of start_states, start_positions should be not None\"\n        if start_positions is not None:\n            slen, hsz = hidden_states.shape[-2:]\n            start_positions = start_positions[:, None, None].expand(-1, -1, hsz)  # shape (bsz, 1, hsz)\n            start_states = hidden_states.gather(-2, start_positions)  # shape (bsz, 1, hsz)\n            start_states = start_states.expand(-1, slen, -1)  # shape (bsz, slen, hsz)\n\n        x = self.dense_0(torch.cat([hidden_states, start_states], dim=-1))\n        x = self.activation(x)\n        x = self.LayerNorm(x)\n        x = self.dense_1(x).squeeze(-1)\n\n        if p_mask is not None:\n            if next(self.parameters()).dtype == torch.float16:\n                x = x * (1 - p_mask) - 65500 * p_mask\n            else:\n                x = x * (1 - p_mask) - 1e30 * p_mask\n\n        return x\n\n\nclass PoolerAnswerClass(nn.Module):\n    \"\"\" Compute SQuAD 2.0 answer class from classification and start tokens hidden states. \"\"\"\n\n    def __init__(self, config):\n        super().__init__()\n        self.dense_0 = nn.Linear(config.hidden_size * 2, config.hidden_size)\n        self.activation = nn.Tanh()\n        self.dense_1 = nn.Linear(config.hidden_size, 1, bias=False)\n\n    def forward(self, hidden_states, start_states=None, start_positions=None, cls_index=None):\n        \"\"\"\n        Args:\n            One of ``start_states``, ``start_positions`` should be not None.\n            If both are set, ``start_positions`` overrides ``start_states``.\n\n            **start_states**: ``torch.LongTensor`` of shape identical to ``hidden_states``.\n                hidden states of the first tokens for the labeled span.\n            **start_positions**: ``torch.LongTensor`` of shape ``(batch_size,)``\n                position of the first token for the labeled span.\n            **cls_index**: torch.LongTensor of shape ``(batch_size,)``\n                position of the CLS token. If None, take the last token.\n\n            note(Original repo):\n                no dependency on end_feature so that we can obtain one single `cls_logits`\n                for each sample\n        \"\"\"\n        hsz = hidden_states.shape[-1]\n        assert (\n            start_states is not None or start_positions is not None\n        ), \"One of start_states, start_positions should be not None\"\n        if start_positions is not None:\n            start_positions = start_positions[:, None, None].expand(-1, -1, hsz)  # shape (bsz, 1, hsz)\n            start_states = hidden_states.gather(-2, start_positions).squeeze(-2)  # shape (bsz, hsz)\n\n        if cls_index is not None:\n            cls_index = cls_index[:, None, None].expand(-1, -1, hsz)  # shape (bsz, 1, hsz)\n            cls_token_state = hidden_states.gather(-2, cls_index).squeeze(-2)  # shape (bsz, hsz)\n        else:\n            cls_token_state = hidden_states[:, -1, :]  # shape (bsz, hsz)\n\n        x = self.dense_0(torch.cat([start_states, cls_token_state], dim=-1))\n        x = self.activation(x)\n        x = self.dense_1(x).squeeze(-1)\n\n        return x\n\n\nclass SQuADHead(nn.Module):\n    r\"\"\" A SQuAD head inspired by XLNet.\n\n    Parameters:\n        config (:class:`~transformers.XLNetConfig`): Model configuration class with all the parameters of the model.\n\n    Inputs:\n        **hidden_states**: ``torch.FloatTensor`` of shape ``(batch_size, seq_len, hidden_size)``\n            hidden states of sequence tokens\n        **start_positions**: ``torch.LongTensor`` of shape ``(batch_size,)``\n            position of the first token for the labeled span.\n        **end_positions**: ``torch.LongTensor`` of shape ``(batch_size,)``\n            position of the last token for the labeled span.\n        **cls_index**: torch.LongTensor of shape ``(batch_size,)``\n            position of the CLS token. If None, take the last token.\n        **is_impossible**: ``torch.LongTensor`` of shape ``(batch_size,)``\n            Whether the question has a possible answer in the paragraph or not.\n        **p_mask**: (`optional`) ``torch.FloatTensor`` of shape ``(batch_size, seq_len)``\n            Mask of invalid position such as query and special symbols (PAD, SEP, CLS)\n            1.0 means token should be masked.\n\n    Outputs: `Tuple` comprising various elements depending on the configuration (config) and inputs:\n        **loss**: (`optional`, returned if both ``start_positions`` and ``end_positions`` are provided) ``torch.FloatTensor`` of shape ``(1,)``:\n            Classification loss as the sum of start token, end token (and is_impossible if provided) classification losses.\n        **start_top_log_probs**: (`optional`, returned if ``start_positions`` or ``end_positions`` is not provided)\n            ``torch.FloatTensor`` of shape ``(batch_size, config.start_n_top)``\n            Log probabilities for the top config.start_n_top start token possibilities (beam-search).\n        **start_top_index**: (`optional`, returned if ``start_positions`` or ``end_positions`` is not provided)\n            ``torch.LongTensor`` of shape ``(batch_size, config.start_n_top)``\n            Indices for the top config.start_n_top start token possibilities (beam-search).\n        **end_top_log_probs**: (`optional`, returned if ``start_positions`` or ``end_positions`` is not provided)\n            ``torch.FloatTensor`` of shape ``(batch_size, config.start_n_top * config.end_n_top)``\n            Log probabilities for the top ``config.start_n_top * config.end_n_top`` end token possibilities (beam-search).\n        **end_top_index**: (`optional`, returned if ``start_positions`` or ``end_positions`` is not provided)\n            ``torch.LongTensor`` of shape ``(batch_size, config.start_n_top * config.end_n_top)``\n            Indices for the top ``config.start_n_top * config.end_n_top`` end token possibilities (beam-search).\n        **cls_logits**: (`optional`, returned if ``start_positions`` or ``end_positions`` is not provided)\n            ``torch.FloatTensor`` of shape ``(batch_size,)``\n            Log probabilities for the ``is_impossible`` label of the answers.\n    \"\"\"\n\n    def __init__(self, config):\n        super().__init__()\n        self.start_n_top = config.start_n_top\n        self.end_n_top = config.end_n_top\n\n        self.start_logits = PoolerStartLogits(config)\n        self.end_logits = PoolerEndLogits(config)\n        self.answer_class = PoolerAnswerClass(config)\n\n    def forward(\n        self, hidden_states, start_positions=None, end_positions=None, cls_index=None, is_impossible=None, p_mask=None,\n    ):\n        outputs = ()\n\n        start_logits = self.start_logits(hidden_states, p_mask=p_mask)\n\n        if start_positions is not None and end_positions is not None:\n            # If we are on multi-GPU, let's remove the dimension added by batch splitting\n            for x in (start_positions, end_positions, cls_index, is_impossible):\n                if x is not None and x.dim() > 1:\n                    x.squeeze_(-1)\n\n            # during training, compute the end logits based on the ground truth of the start position\n            end_logits = self.end_logits(hidden_states, start_positions=start_positions, p_mask=p_mask)\n\n            loss_fct = CrossEntropyLoss()\n            start_loss = loss_fct(start_logits, start_positions)\n            end_loss = loss_fct(end_logits, end_positions)\n            total_loss = (start_loss + end_loss) / 2\n\n            if cls_index is not None and is_impossible is not None:\n                # Predict answerability from the representation of CLS and START\n                cls_logits = self.answer_class(hidden_states, start_positions=start_positions, cls_index=cls_index)\n                loss_fct_cls = nn.BCEWithLogitsLoss()\n                cls_loss = loss_fct_cls(cls_logits, is_impossible)\n\n                # note(zhiliny): by default multiply the loss by 0.5 so that the scale is comparable to start_loss and end_loss\n                total_loss += cls_loss * 0.5\n\n            outputs = (total_loss,) + outputs\n\n        else:\n            # during inference, compute the end logits based on beam search\n            bsz, slen, hsz = hidden_states.size()\n            start_log_probs = F.softmax(start_logits, dim=-1)  # shape (bsz, slen)\n\n            start_top_log_probs, start_top_index = torch.topk(\n                start_log_probs, self.start_n_top, dim=-1\n            )  # shape (bsz, start_n_top)\n            start_top_index_exp = start_top_index.unsqueeze(-1).expand(-1, -1, hsz)  # shape (bsz, start_n_top, hsz)\n            start_states = torch.gather(hidden_states, -2, start_top_index_exp)  # shape (bsz, start_n_top, hsz)\n            start_states = start_states.unsqueeze(1).expand(-1, slen, -1, -1)  # shape (bsz, slen, start_n_top, hsz)\n\n            hidden_states_expanded = hidden_states.unsqueeze(2).expand_as(\n                start_states\n            )  # shape (bsz, slen, start_n_top, hsz)\n            p_mask = p_mask.unsqueeze(-1) if p_mask is not None else None\n            end_logits = self.end_logits(hidden_states_expanded, start_states=start_states, p_mask=p_mask)\n            end_log_probs = F.softmax(end_logits, dim=1)  # shape (bsz, slen, start_n_top)\n\n            end_top_log_probs, end_top_index = torch.topk(\n                end_log_probs, self.end_n_top, dim=1\n            )  # shape (bsz, end_n_top, start_n_top)\n            end_top_log_probs = end_top_log_probs.view(-1, self.start_n_top * self.end_n_top)\n            end_top_index = end_top_index.view(-1, self.start_n_top * self.end_n_top)\n\n            start_states = torch.einsum(\"blh,bl->bh\", hidden_states, start_log_probs)\n            cls_logits = self.answer_class(hidden_states, start_states=start_states, cls_index=cls_index)\n\n            outputs = (start_top_log_probs, start_top_index, end_top_log_probs, end_top_index, cls_logits,) + outputs\n\n        # return start_top_log_probs, start_top_index, end_top_log_probs, end_top_index, cls_logits\n        # or (if labels are provided) (total_loss,)\n        return outputs\n\n\nclass SequenceSummary(nn.Module):\n    r\"\"\" Compute a single vector summary of a sequence hidden states according to various possibilities:\n        Args of the config class:\n            summary_type:\n                - 'last' => [default] take the last token hidden state (like XLNet)\n                - 'first' => take the first token hidden state (like Bert)\n                - 'mean' => take the mean of all tokens hidden states\n                - 'cls_index' => supply a Tensor of classification token position (GPT/GPT-2)\n                - 'attn' => Not implemented now, use multi-head attention\n            summary_use_proj: Add a projection after the vector extraction\n            summary_proj_to_labels: If True, the projection outputs to config.num_labels classes (otherwise to hidden_size). Default: False.\n            summary_activation: 'tanh' or another string => add an activation to the output, Other => no activation. Default\n            summary_first_dropout: Add a dropout before the projection and activation\n            summary_last_dropout: Add a dropout after the projection and activation\n    \"\"\"\n\n    def __init__(self, config: PretrainedConfig):\n        super().__init__()\n\n        self.summary_type = getattr(config, \"summary_type\", \"last\")\n        if self.summary_type == \"attn\":\n            # We should use a standard multi-head attention module with absolute positional embedding for that.\n            # Cf. https://github.com/zihangdai/xlnet/blob/master/modeling.py#L253-L276\n            # We can probably just use the multi-head attention module of PyTorch >=1.1.0\n            raise NotImplementedError\n\n        self.summary = Identity()\n        if hasattr(config, \"summary_use_proj\") and config.summary_use_proj:\n            if hasattr(config, \"summary_proj_to_labels\") and config.summary_proj_to_labels and config.num_labels > 0:\n                num_classes = config.num_labels\n            else:\n                num_classes = config.hidden_size\n            self.summary = nn.Linear(config.hidden_size, num_classes)\n\n        activation_string = getattr(config, \"summary_activation\", None)\n        self.activation: Callable = (get_activation(activation_string) if activation_string else Identity())\n\n        self.first_dropout = Identity()\n        if hasattr(config, \"summary_first_dropout\") and config.summary_first_dropout > 0:\n            self.first_dropout = nn.Dropout(config.summary_first_dropout)\n\n        self.last_dropout = Identity()\n        if hasattr(config, \"summary_last_dropout\") and config.summary_last_dropout > 0:\n            self.last_dropout = nn.Dropout(config.summary_last_dropout)\n\n    def forward(self, hidden_states, cls_index=None):\n        \"\"\" hidden_states: float Tensor in shape [bsz, ..., seq_len, hidden_size], the hidden-states of the last layer.\n            cls_index: [optional] position of the classification token if summary_type == 'cls_index',\n                shape (bsz,) or more generally (bsz, ...) where ... are optional leading dimensions of hidden_states.\n                if summary_type == 'cls_index' and cls_index is None:\n                    we take the last token of the sequence as classification token\n        \"\"\"\n        if self.summary_type == \"last\":\n            output = hidden_states[:, -1]\n        elif self.summary_type == \"first\":\n            output = hidden_states[:, 0]\n        elif self.summary_type == \"mean\":\n            output = hidden_states.mean(dim=1)\n        elif self.summary_type == \"cls_index\":\n            if cls_index is None:\n                cls_index = torch.full_like(hidden_states[..., :1, :], hidden_states.shape[-2] - 1, dtype=torch.long,)\n            else:\n                cls_index = cls_index.unsqueeze(-1).unsqueeze(-1)\n                cls_index = cls_index.expand((-1,) * (cls_index.dim() - 1) + (hidden_states.size(-1),))\n            # shape of cls_index: (bsz, XX, 1, hidden_size) where XX are optional leading dim of hidden_states\n            output = hidden_states.gather(-2, cls_index).squeeze(-2)  # shape (bsz, XX, hidden_size)\n        elif self.summary_type == \"attn\":\n            raise NotImplementedError\n\n        output = self.first_dropout(output)\n        output = self.summary(output)\n        output = self.activation(output)\n        output = self.last_dropout(output)\n\n        return output\n\n\ndef create_position_ids_from_input_ids(input_ids, padding_idx):\n    \"\"\" Replace non-padding symbols with their position numbers. Position numbers begin at\n    padding_idx+1. Padding symbols are ignored. This is modified from fairseq's\n    `utils.make_positions`.\n\n    :param torch.Tensor x:\n    :return torch.Tensor:\n    \"\"\"\n    # The series of casts and type-conversions here are carefully balanced to both work with ONNX export and XLA.\n    mask = input_ids.ne(padding_idx).int()\n    incremental_indices = torch.cumsum(mask, dim=1).type_as(mask) * mask\n    return incremental_indices.long() + padding_idx\n\n\ndef prune_linear_layer(layer, index, dim=0):\n    \"\"\" Prune a linear layer (a model parameters) to keep only entries in index.\n        Return the pruned layer as a new layer with requires_grad=True.\n        Used to remove heads.\n    \"\"\"\n    index = index.to(layer.weight.device)\n    W = layer.weight.index_select(dim, index).clone().detach()\n    if layer.bias is not None:\n        if dim == 1:\n            b = layer.bias.clone().detach()\n        else:\n            b = layer.bias[index].clone().detach()\n    new_size = list(layer.weight.size())\n    new_size[dim] = len(index)\n    new_layer = nn.Linear(new_size[1], new_size[0], bias=layer.bias is not None).to(layer.weight.device)\n    new_layer.weight.requires_grad = False\n    new_layer.weight.copy_(W.contiguous())\n    new_layer.weight.requires_grad = True\n    if layer.bias is not None:\n        new_layer.bias.requires_grad = False\n        new_layer.bias.copy_(b.contiguous())\n        new_layer.bias.requires_grad = True\n    return new_layer\n\n\ndef prune_conv1d_layer(layer, index, dim=1):\n    \"\"\" Prune a Conv1D layer (a model parameters) to keep only entries in index.\n        A Conv1D work as a Linear layer (see e.g. BERT) but the weights are transposed.\n        Return the pruned layer as a new layer with requires_grad=True.\n        Used to remove heads.\n    \"\"\"\n    index = index.to(layer.weight.device)\n    W = layer.weight.index_select(dim, index).clone().detach()\n    if dim == 0:\n        b = layer.bias.clone().detach()\n    else:\n        b = layer.bias[index].clone().detach()\n    new_size = list(layer.weight.size())\n    new_size[dim] = len(index)\n    new_layer = Conv1D(new_size[1], new_size[0]).to(layer.weight.device)\n    new_layer.weight.requires_grad = False\n    new_layer.weight.copy_(W.contiguous())\n    new_layer.weight.requires_grad = True\n    new_layer.bias.requires_grad = False\n    new_layer.bias.copy_(b.contiguous())\n    new_layer.bias.requires_grad = True\n    return new_layer\n\n\ndef prune_layer(layer, index, dim=None):\n    \"\"\" Prune a Conv1D or nn.Linear layer (a model parameters) to keep only entries in index.\n        Return the pruned layer as a new layer with requires_grad=True.\n        Used to remove heads.\n    \"\"\"\n    if isinstance(layer, nn.Linear):\n        return prune_linear_layer(layer, index, dim=0 if dim is None else dim)\n    elif isinstance(layer, Conv1D):\n        return prune_conv1d_layer(layer, index, dim=1 if dim is None else dim)\n    else:\n        raise ValueError(\"Can't prune layer of class {}\".format(layer.__class__))\n\n\ndef apply_chunking_to_forward(\n    chunk_size: int, chunk_dim: int, forward_fn: Callable[..., torch.Tensor], *input_tensors\n) -> torch.Tensor:\n    \"\"\"\n    This function chunks the `input_tensors` into smaller input tensor parts of size `chunk_size` over the dimension `chunk_dim`.\n    It then applies a layer `forward_fn` to each chunk independently to save memory.\n    If the `forward_fn` is independent across the `chunk_dim` this function will yield the\n    same result as not applying it.\n\n    Args:\n        chunk_size: int - the chunk size of a chunked tensor. `num_chunks` = `len(input_tensors[0]) / chunk_size`\n        chunk_dim: int - the dimension over which the input_tensors should be chunked\n        forward_fn: fn - the forward fn of the model\n        input_tensors: tuple(torch.Tensor) - the input tensors of `forward_fn` which are chunked\n    Returns:\n        a Tensor with the same shape the foward_fn would have given if applied\n\n\n    Examples::\n\n        # rename the usual forward() fn to forward_chunk()\n        def forward_chunk(self, hidden_states):\n            hidden_states = self.decoder(hidden_states)\n            return hidden_states\n\n        # implement a chunked forward function\n        def forward(self, hidden_states):\n            return apply_chunking_to_forward(self.chunk_size_lm_head, self.seq_len_dim, self.forward_chunk, hidden_states)\n    \"\"\"\n\n    assert len(input_tensors) > 0, \"{} has to be a tuple/list of tensors\".format(input_tensors)\n    tensor_shape = input_tensors[0].shape\n    assert all(\n        input_tensor.shape == tensor_shape for input_tensor in input_tensors\n    ), \"All input tenors have to be of the same shape\"\n\n    # inspect.signature exist since python 3.5 and is a python method -> no problem with backward compability\n    num_args_in_forward_chunk_fn = len(inspect.signature(forward_fn).parameters)\n    assert num_args_in_forward_chunk_fn == len(\n        input_tensors\n    ), \"forward_chunk_fn expects {} arguments, but only {} input tensors are given\".format(\n        num_args_in_forward_chunk_fn, len(input_tensors)\n    )\n\n    if chunk_size > 0:\n        assert (\n            input_tensors[0].shape[chunk_dim] % chunk_size == 0\n        ), \"The dimension to be chunked {} has to be a multiple of the chunk size {}\".format(\n            input_tensors[0][chunk_dim], chunk_size\n        )\n\n        num_chunks = input_tensors[0].shape[chunk_dim] // chunk_size\n\n        # chunk input tensor into tuples\n        input_tensors_chunks = tuple(input_tensor.chunk(num_chunks, dim=chunk_dim) for input_tensor in input_tensors)\n        # apply forward fn to every tuple\n        output_chunks = tuple(forward_fn(*input_tensors_chunk) for input_tensors_chunk in zip(*input_tensors_chunks))\n        # concatenate output at same dimension\n        return torch.cat(output_chunks, dim=chunk_dim)\n\n    return forward_fn(*input_tensors)\n"
  },
  {
    "path": "code/bert-base-count5/pretrain/transformers1/modeling_xlm.py",
    "content": "# coding=utf-8\n# Copyright 2019-present, Facebook, Inc and the HuggingFace Inc. team.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n\"\"\" PyTorch XLM model.\n\"\"\"\n\n\nimport itertools\nimport logging\nimport math\n\nimport numpy as np\nimport torch\nfrom torch import nn\nfrom torch.nn import CrossEntropyLoss, MSELoss\nfrom torch.nn import functional as F\n\nfrom .activations import gelu\nfrom .configuration_xlm import XLMConfig\nfrom .file_utils import add_start_docstrings, add_start_docstrings_to_callable\nfrom .modeling_utils import PreTrainedModel, SequenceSummary, SQuADHead, prune_linear_layer\n\n\nlogger = logging.getLogger(__name__)\n\nXLM_PRETRAINED_MODEL_ARCHIVE_LIST = [\n    \"xlm-mlm-en-2048\",\n    \"xlm-mlm-ende-1024\",\n    \"xlm-mlm-enfr-1024\",\n    \"xlm-mlm-enro-1024\",\n    \"xlm-mlm-tlm-xnli15-1024\",\n    \"xlm-mlm-xnli15-1024\",\n    \"xlm-clm-enfr-1024\",\n    \"xlm-clm-ende-1024\",\n    \"xlm-mlm-17-1280\",\n    \"xlm-mlm-100-1280\",\n    # See all XLM models at https://huggingface.co/models?filter=xlm\n]\n\n\ndef create_sinusoidal_embeddings(n_pos, dim, out):\n    position_enc = np.array([[pos / np.power(10000, 2 * (j // 2) / dim) for j in range(dim)] for pos in range(n_pos)])\n    out[:, 0::2] = torch.FloatTensor(np.sin(position_enc[:, 0::2]))\n    out[:, 1::2] = torch.FloatTensor(np.cos(position_enc[:, 1::2]))\n    out.detach_()\n    out.requires_grad = False\n\n\ndef get_masks(slen, lengths, causal, padding_mask=None):\n    \"\"\"\n    Generate hidden states mask, and optionally an attention mask.\n    \"\"\"\n    alen = torch.arange(slen, dtype=torch.long, device=lengths.device)\n    if padding_mask is not None:\n        mask = padding_mask\n    else:\n        assert lengths.max().item() <= slen\n        mask = alen < lengths[:, None]\n\n    # attention mask is the same as mask, or triangular inferior attention (causal)\n    bs = lengths.size(0)\n    if causal:\n        attn_mask = alen[None, None, :].repeat(bs, slen, 1) <= alen[None, :, None]\n    else:\n        attn_mask = mask\n\n    # sanity check\n    assert mask.size() == (bs, slen)\n    assert causal is False or attn_mask.size() == (bs, slen, slen)\n\n    return mask, attn_mask\n\n\nclass MultiHeadAttention(nn.Module):\n\n    NEW_ID = itertools.count()\n\n    def __init__(self, n_heads, dim, config):\n        super().__init__()\n        self.layer_id = next(MultiHeadAttention.NEW_ID)\n        self.output_attentions = config.output_attentions\n        self.dim = dim\n        self.n_heads = n_heads\n        self.dropout = config.attention_dropout\n        assert self.dim % self.n_heads == 0\n\n        self.q_lin = nn.Linear(dim, dim)\n        self.k_lin = nn.Linear(dim, dim)\n        self.v_lin = nn.Linear(dim, dim)\n        self.out_lin = nn.Linear(dim, dim)\n        self.pruned_heads = set()\n\n    def prune_heads(self, heads):\n        attention_head_size = self.dim // self.n_heads\n        if len(heads) == 0:\n            return\n        mask = torch.ones(self.n_heads, attention_head_size)\n        heads = set(heads) - self.pruned_heads\n        for head in heads:\n            head -= sum(1 if h < head else 0 for h in self.pruned_heads)\n            mask[head] = 0\n        mask = mask.view(-1).contiguous().eq(1)\n        index = torch.arange(len(mask))[mask].long()\n        # Prune linear layers\n        self.q_lin = prune_linear_layer(self.q_lin, index)\n        self.k_lin = prune_linear_layer(self.k_lin, index)\n        self.v_lin = prune_linear_layer(self.v_lin, index)\n        self.out_lin = prune_linear_layer(self.out_lin, index, dim=1)\n        # Update hyper params\n        self.n_heads = self.n_heads - len(heads)\n        self.dim = attention_head_size * self.n_heads\n        self.pruned_heads = self.pruned_heads.union(heads)\n\n    def forward(self, input, mask, kv=None, cache=None, head_mask=None):\n        \"\"\"\n        Self-attention (if kv is None) or attention over source sentence (provided by kv).\n        \"\"\"\n        # Input is (bs, qlen, dim)\n        # Mask is (bs, klen) (non-causal) or (bs, klen, klen)\n        bs, qlen, dim = input.size()\n        if kv is None:\n            klen = qlen if cache is None else cache[\"slen\"] + qlen\n        else:\n            klen = kv.size(1)\n        # assert dim == self.dim, 'Dimensions do not match: %s input vs %s configured' % (dim, self.dim)\n        n_heads = self.n_heads\n        dim_per_head = self.dim // n_heads\n        mask_reshape = (bs, 1, qlen, klen) if mask.dim() == 3 else (bs, 1, 1, klen)\n\n        def shape(x):\n            \"\"\"  projection \"\"\"\n            return x.view(bs, -1, self.n_heads, dim_per_head).transpose(1, 2)\n\n        def unshape(x):\n            \"\"\"  compute context \"\"\"\n            return x.transpose(1, 2).contiguous().view(bs, -1, self.n_heads * dim_per_head)\n\n        q = shape(self.q_lin(input))  # (bs, n_heads, qlen, dim_per_head)\n        if kv is None:\n            k = shape(self.k_lin(input))  # (bs, n_heads, qlen, dim_per_head)\n            v = shape(self.v_lin(input))  # (bs, n_heads, qlen, dim_per_head)\n        elif cache is None or self.layer_id not in cache:\n            k = v = kv\n            k = shape(self.k_lin(k))  # (bs, n_heads, qlen, dim_per_head)\n            v = shape(self.v_lin(v))  # (bs, n_heads, qlen, dim_per_head)\n\n        if cache is not None:\n            if self.layer_id in cache:\n                if kv is None:\n                    k_, v_ = cache[self.layer_id]\n                    k = torch.cat([k_, k], dim=2)  # (bs, n_heads, klen, dim_per_head)\n                    v = torch.cat([v_, v], dim=2)  # (bs, n_heads, klen, dim_per_head)\n                else:\n                    k, v = cache[self.layer_id]\n            cache[self.layer_id] = (k, v)\n\n        q = q / math.sqrt(dim_per_head)  # (bs, n_heads, qlen, dim_per_head)\n        scores = torch.matmul(q, k.transpose(2, 3))  # (bs, n_heads, qlen, klen)\n        mask = (mask == 0).view(mask_reshape).expand_as(scores)  # (bs, n_heads, qlen, klen)\n        scores.masked_fill_(mask, -float(\"inf\"))  # (bs, n_heads, qlen, klen)\n\n        weights = F.softmax(scores.float(), dim=-1).type_as(scores)  # (bs, n_heads, qlen, klen)\n        weights = F.dropout(weights, p=self.dropout, training=self.training)  # (bs, n_heads, qlen, klen)\n\n        # Mask heads if we want to\n        if head_mask is not None:\n            weights = weights * head_mask\n\n        context = torch.matmul(weights, v)  # (bs, n_heads, qlen, dim_per_head)\n        context = unshape(context)  # (bs, qlen, dim)\n\n        outputs = (self.out_lin(context),)\n        if self.output_attentions:\n            outputs = outputs + (weights,)\n        return outputs\n\n\nclass TransformerFFN(nn.Module):\n    def __init__(self, in_dim, dim_hidden, out_dim, config):\n        super().__init__()\n        self.dropout = config.dropout\n        self.lin1 = nn.Linear(in_dim, dim_hidden)\n        self.lin2 = nn.Linear(dim_hidden, out_dim)\n        self.act = gelu if config.gelu_activation else F.relu\n\n    def forward(self, input):\n        x = self.lin1(input)\n        x = self.act(x)\n        x = self.lin2(x)\n        x = F.dropout(x, p=self.dropout, training=self.training)\n        return x\n\n\nclass XLMPreTrainedModel(PreTrainedModel):\n    \"\"\" An abstract class to handle weights initialization and\n        a simple interface for downloading and loading pretrained models.\n    \"\"\"\n\n    config_class = XLMConfig\n    load_tf_weights = None\n    base_model_prefix = \"transformer\"\n\n    def __init__(self, *inputs, **kwargs):\n        super().__init__(*inputs, **kwargs)\n\n    @property\n    def dummy_inputs(self):\n        inputs_list = torch.tensor([[7, 6, 0, 0, 1], [1, 2, 3, 0, 0], [0, 0, 0, 4, 5]])\n        attns_list = torch.tensor([[1, 1, 0, 0, 1], [1, 1, 1, 0, 0], [1, 0, 0, 1, 1]])\n        if self.config.use_lang_emb and self.config.n_langs > 1:\n            langs_list = torch.tensor([[1, 1, 0, 0, 1], [1, 1, 1, 0, 0], [1, 0, 0, 1, 1]])\n        else:\n            langs_list = None\n        return {\"input_ids\": inputs_list, \"attention_mask\": attns_list, \"langs\": langs_list}\n\n    def _init_weights(self, module):\n        \"\"\" Initialize the weights. \"\"\"\n        if isinstance(module, nn.Embedding):\n            if self.config is not None and self.config.embed_init_std is not None:\n                nn.init.normal_(module.weight, mean=0, std=self.config.embed_init_std)\n        if isinstance(module, nn.Linear):\n            if self.config is not None and self.config.init_std is not None:\n                nn.init.normal_(module.weight, mean=0, std=self.config.init_std)\n                if hasattr(module, \"bias\") and module.bias is not None:\n                    nn.init.constant_(module.bias, 0.0)\n        if isinstance(module, nn.LayerNorm):\n            module.bias.data.zero_()\n            module.weight.data.fill_(1.0)\n\n\nXLM_START_DOCSTRING = r\"\"\"\n\n    This model is a PyTorch `torch.nn.Module <https://pytorch.org/docs/stable/nn.html#torch.nn.Module>`_ sub-class.\n    Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general\n    usage and behavior.\n\n    Parameters:\n        config (:class:`~transformers1.XLMConfig`): Model configuration class with all the parameters of the model.\n            Initializing with a config file does not load the weights associated with the model, only the configuration.\n            Check out the :meth:`~transformers1.PreTrainedModel.from_pretrained` method to load the model weights.\n\"\"\"\n\nXLM_INPUTS_DOCSTRING = r\"\"\"\n    Args:\n        input_ids (:obj:`torch.LongTensor` of shape :obj:`(batch_size, sequence_length)`):\n            Indices of input sequence tokens in the vocabulary.\n\n            Indices can be obtained using :class:`transformers1.BertTokenizer`.\n            See :func:`transformers1.PreTrainedTokenizer.encode` and\n            :func:`transformers1.PreTrainedTokenizer.encode_plus` for details.\n\n            `What are input IDs? <../glossary.html#input-ids>`__\n        attention_mask (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length)`, `optional`, defaults to :obj:`None`):\n            Mask to avoid performing attention on padding token indices.\n            Mask values selected in ``[0, 1]``:\n            ``1`` for tokens that are NOT MASKED, ``0`` for MASKED tokens.\n\n            `What are attention masks? <../glossary.html#attention-mask>`__\n        langs (:obj:`torch.LongTensor` of shape :obj:`(batch_size, sequence_length)`, `optional`, defaults to :obj:`None`):\n            A parallel sequence of tokens to be used to indicate the language of each token in the input.\n            Indices are languages ids which can be obtained from the language names by using two conversion mappings\n            provided in the configuration of the model (only provided for multilingual models).\n            More precisely, the `language name -> language id` mapping is in `model.config.lang2id` (dict str -> int) and\n            the `language id -> language name` mapping is `model.config.id2lang` (dict int -> str).\n\n            See usage examples detailed in the `multilingual documentation <https://huggingface.co/transformers/multilingual.html>`__.\n        token_type_ids (:obj:`torch.LongTensor` of shape :obj:`(batch_size, sequence_length)`, `optional`, defaults to :obj:`None`):\n            Segment token indices to indicate first and second portions of the inputs.\n            Indices are selected in ``[0, 1]``: ``0`` corresponds to a `sentence A` token, ``1``\n            corresponds to a `sentence B` token\n\n            `What are token type IDs? <../glossary.html#token-type-ids>`_\n        position_ids (:obj:`torch.LongTensor` of shape :obj:`(batch_size, sequence_length)`, `optional`, defaults to :obj:`None`):\n            Indices of positions of each input sequence tokens in the position embeddings.\n            Selected in the range ``[0, config.max_position_embeddings - 1]``.\n\n            `What are position IDs? <../glossary.html#position-ids>`_\n        lengths (:obj:`torch.LongTensor` of shape :obj:`(batch_size,)`, `optional`, defaults to :obj:`None`):\n            Length of each sentence that can be used to avoid performing attention on padding token indices.\n            You can also use `attention_mask` for the same result (see above), kept here for compatbility.\n            Indices selected in ``[0, ..., input_ids.size(-1)]``:\n        cache (:obj:`Dict[str, torch.FloatTensor]`, `optional`, defaults to :obj:`None`):\n            dictionary with ``torch.FloatTensor`` that contains pre-computed\n            hidden-states (key and values in the attention blocks) as computed by the model\n            (see `cache` output below). Can be used to speed up sequential decoding.\n            The dictionary object will be modified in-place during the forward pass to add newly computed hidden-states.\n        head_mask (:obj:`torch.FloatTensor` of shape :obj:`(num_heads,)` or :obj:`(num_layers, num_heads)`, `optional`, defaults to :obj:`None`):\n            Mask to nullify selected heads of the self-attention modules.\n            Mask values selected in ``[0, 1]``:\n            :obj:`1` indicates the head is **not masked**, :obj:`0` indicates the head is **masked**.\n        inputs_embeds (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length, hidden_size)`, `optional`, defaults to :obj:`None`):\n            Optionally, instead of passing :obj:`input_ids` you can choose to directly pass an embedded representation.\n            This is useful if you want more control over how to convert `input_ids` indices into associated vectors\n            than the model's internal embedding lookup matrix.\n\"\"\"\n\n\n@add_start_docstrings(\n    \"The bare XLM Model transformer outputting raw hidden-states without any specific head on top.\",\n    XLM_START_DOCSTRING,\n)\nclass XLMModel(XLMPreTrainedModel):\n    def __init__(self, config):  # , dico, is_encoder, with_output):\n        super().__init__(config)\n        self.output_attentions = config.output_attentions\n        self.output_hidden_states = config.output_hidden_states\n\n        # encoder / decoder, output layer\n        self.is_encoder = config.is_encoder\n        self.is_decoder = not config.is_encoder\n        if self.is_decoder:\n            raise NotImplementedError(\"Currently XLM can only be used as an encoder\")\n        # self.with_output = with_output\n        self.causal = config.causal\n\n        # dictionary / languages\n        self.n_langs = config.n_langs\n        self.use_lang_emb = config.use_lang_emb\n        self.n_words = config.n_words\n        self.eos_index = config.eos_index\n        self.pad_index = config.pad_index\n        # self.dico = dico\n        # self.id2lang = config.id2lang\n        # self.lang2id = config.lang2id\n        # assert len(self.dico) == self.n_words\n        # assert len(self.id2lang) == len(self.lang2id) == self.n_langs\n\n        # model parameters\n        self.dim = config.emb_dim  # 512 by default\n        self.hidden_dim = self.dim * 4  # 2048 by default\n        self.n_heads = config.n_heads  # 8 by default\n        self.n_layers = config.n_layers\n        self.dropout = config.dropout\n        self.attention_dropout = config.attention_dropout\n        assert self.dim % self.n_heads == 0, \"transformer dim must be a multiple of n_heads\"\n\n        # embeddings\n        self.position_embeddings = nn.Embedding(config.max_position_embeddings, self.dim)\n        if config.sinusoidal_embeddings:\n            create_sinusoidal_embeddings(config.max_position_embeddings, self.dim, out=self.position_embeddings.weight)\n        if config.n_langs > 1 and config.use_lang_emb:\n            self.lang_embeddings = nn.Embedding(self.n_langs, self.dim)\n        self.embeddings = nn.Embedding(self.n_words, self.dim, padding_idx=self.pad_index)\n        self.layer_norm_emb = nn.LayerNorm(self.dim, eps=config.layer_norm_eps)\n\n        # transformer layers\n        self.attentions = nn.ModuleList()\n        self.layer_norm1 = nn.ModuleList()\n        self.ffns = nn.ModuleList()\n        self.layer_norm2 = nn.ModuleList()\n        # if self.is_decoder:\n        #     self.layer_norm15 = nn.ModuleList()\n        #     self.encoder_attn = nn.ModuleList()\n\n        for _ in range(self.n_layers):\n            self.attentions.append(MultiHeadAttention(self.n_heads, self.dim, config=config))\n            self.layer_norm1.append(nn.LayerNorm(self.dim, eps=config.layer_norm_eps))\n            # if self.is_decoder:\n            #     self.layer_norm15.append(nn.LayerNorm(self.dim, eps=config.layer_norm_eps))\n            #     self.encoder_attn.append(MultiHeadAttention(self.n_heads, self.dim, dropout=self.attention_dropout))\n            self.ffns.append(TransformerFFN(self.dim, self.hidden_dim, self.dim, config=config))\n            self.layer_norm2.append(nn.LayerNorm(self.dim, eps=config.layer_norm_eps))\n\n        if hasattr(config, \"pruned_heads\"):\n            pruned_heads = config.pruned_heads.copy().items()\n            config.pruned_heads = {}\n            for layer, heads in pruned_heads:\n                if self.attentions[int(layer)].n_heads == config.n_heads:\n                    self.prune_heads({int(layer): list(map(int, heads))})\n\n        self.init_weights()\n\n    def get_input_embeddings(self):\n        return self.embeddings\n\n    def set_input_embeddings(self, new_embeddings):\n        self.embeddings = new_embeddings\n\n    def _prune_heads(self, heads_to_prune):\n        \"\"\" Prunes heads of the model.\n            heads_to_prune: dict of {layer_num: list of heads to prune in this layer}\n            See base class PreTrainedModel\n        \"\"\"\n        for layer, heads in heads_to_prune.items():\n            self.attentions[layer].prune_heads(heads)\n\n    @add_start_docstrings_to_callable(XLM_INPUTS_DOCSTRING)\n    def forward(\n        self,\n        input_ids=None,\n        attention_mask=None,\n        langs=None,\n        token_type_ids=None,\n        position_ids=None,\n        lengths=None,\n        cache=None,\n        head_mask=None,\n        inputs_embeds=None,\n    ):\n        r\"\"\"\n    Return:\n        :obj:`tuple(torch.FloatTensor)` comprising various elements depending on the configuration (:class:`~transformers1.XLMConfig`) and inputs:\n        last_hidden_state (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length, hidden_size)`):\n            Sequence of hidden-states at the output of the last layer of the model.\n        hidden_states (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_hidden_states=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer)\n            of shape :obj:`(batch_size, sequence_length, hidden_size)`.\n\n            Hidden-states of the model at the output of each layer plus the initial embedding outputs.\n        attentions (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_attentions=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for each layer) of shape\n            :obj:`(batch_size, num_heads, sequence_length, sequence_length)`.\n\n            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention\n            heads.\n\n    Examples::\n\n        from transformers1 import XLMTokenizer, XLMModel\n        import torch\n\n        tokenizer = XLMTokenizer.from_pretrained('xlm-mlm-en-2048')\n        model = XLMModel.from_pretrained('xlm-mlm-en-2048')\n        input_ids = torch.tensor(tokenizer.encode(\"Hello, my dog is cute\", add_special_tokens=True)).unsqueeze(0)  # Batch size 1\n        outputs = model(input_ids)\n        last_hidden_states = outputs[0]  # The last hidden-state is the first element of the output tuple\n\n        \"\"\"\n        if input_ids is not None:\n            bs, slen = input_ids.size()\n        else:\n            bs, slen = inputs_embeds.size()[:-1]\n\n        if lengths is None:\n            if input_ids is not None:\n                lengths = (input_ids != self.pad_index).sum(dim=1).long()\n            else:\n                lengths = torch.LongTensor([slen] * bs)\n        # mask = input_ids != self.pad_index\n\n        # check inputs\n        assert lengths.size(0) == bs\n        assert lengths.max().item() <= slen\n        # input_ids = input_ids.transpose(0, 1)  # batch size as dimension 0\n        # assert (src_enc is None) == (src_len is None)\n        # if src_enc is not None:\n        #     assert self.is_decoder\n        #     assert src_enc.size(0) == bs\n\n        # generate masks\n        mask, attn_mask = get_masks(slen, lengths, self.causal, padding_mask=attention_mask)\n        # if self.is_decoder and src_enc is not None:\n        #     src_mask = torch.arange(src_len.max(), dtype=torch.long, device=lengths.device) < src_len[:, None]\n\n        device = input_ids.device if input_ids is not None else inputs_embeds.device\n\n        # position_ids\n        if position_ids is None:\n            position_ids = torch.arange(slen, dtype=torch.long, device=device)\n            position_ids = position_ids.unsqueeze(0).expand((bs, slen))\n        else:\n            assert position_ids.size() == (bs, slen)  # (slen, bs)\n            # position_ids = position_ids.transpose(0, 1)\n\n        # langs\n        if langs is not None:\n            assert langs.size() == (bs, slen)  # (slen, bs)\n            # langs = langs.transpose(0, 1)\n\n        # Prepare head mask if needed\n        head_mask = self.get_head_mask(head_mask, self.config.n_layers)\n\n        # do not recompute cached elements\n        if cache is not None and input_ids is not None:\n            _slen = slen - cache[\"slen\"]\n            input_ids = input_ids[:, -_slen:]\n            position_ids = position_ids[:, -_slen:]\n            if langs is not None:\n                langs = langs[:, -_slen:]\n            mask = mask[:, -_slen:]\n            attn_mask = attn_mask[:, -_slen:]\n\n        # embeddings\n        if inputs_embeds is None:\n            inputs_embeds = self.embeddings(input_ids)\n\n        tensor = inputs_embeds + self.position_embeddings(position_ids).expand_as(inputs_embeds)\n        if langs is not None and self.use_lang_emb and self.n_langs > 1:\n            tensor = tensor + self.lang_embeddings(langs)\n        if token_type_ids is not None:\n            tensor = tensor + self.embeddings(token_type_ids)\n        tensor = self.layer_norm_emb(tensor)\n        tensor = F.dropout(tensor, p=self.dropout, training=self.training)\n        tensor *= mask.unsqueeze(-1).to(tensor.dtype)\n\n        # transformer layers\n        hidden_states = ()\n        attentions = ()\n        for i in range(self.n_layers):\n            if self.output_hidden_states:\n                hidden_states = hidden_states + (tensor,)\n\n            # self attention\n            attn_outputs = self.attentions[i](tensor, attn_mask, cache=cache, head_mask=head_mask[i])\n            attn = attn_outputs[0]\n            if self.output_attentions:\n                attentions = attentions + (attn_outputs[1],)\n            attn = F.dropout(attn, p=self.dropout, training=self.training)\n            tensor = tensor + attn\n            tensor = self.layer_norm1[i](tensor)\n\n            # encoder attention (for decoder only)\n            # if self.is_decoder and src_enc is not None:\n            #     attn = self.encoder_attn[i](tensor, src_mask, kv=src_enc, cache=cache)\n            #     attn = F.dropout(attn, p=self.dropout, training=self.training)\n            #     tensor = tensor + attn\n            #     tensor = self.layer_norm15[i](tensor)\n\n            # FFN\n            tensor = tensor + self.ffns[i](tensor)\n            tensor = self.layer_norm2[i](tensor)\n            tensor *= mask.unsqueeze(-1).to(tensor.dtype)\n\n        # Add last hidden state\n        if self.output_hidden_states:\n            hidden_states = hidden_states + (tensor,)\n\n        # update cache length\n        if cache is not None:\n            cache[\"slen\"] += tensor.size(1)\n\n        # move back sequence length to dimension 0\n        # tensor = tensor.transpose(0, 1)\n\n        outputs = (tensor,)\n        if self.output_hidden_states:\n            outputs = outputs + (hidden_states,)\n        if self.output_attentions:\n            outputs = outputs + (attentions,)\n        return outputs  # outputs, (hidden_states), (attentions)\n\n\nclass XLMPredLayer(nn.Module):\n    \"\"\"\n    Prediction layer (cross_entropy or adaptive_softmax).\n    \"\"\"\n\n    def __init__(self, config):\n        super().__init__()\n        self.asm = config.asm\n        self.n_words = config.n_words\n        self.pad_index = config.pad_index\n        dim = config.emb_dim\n\n        if config.asm is False:\n            self.proj = nn.Linear(dim, config.n_words, bias=True)\n        else:\n            self.proj = nn.AdaptiveLogSoftmaxWithLoss(\n                in_features=dim,\n                n_classes=config.n_words,\n                cutoffs=config.asm_cutoffs,\n                div_value=config.asm_div_value,\n                head_bias=True,  # default is False\n            )\n\n    def forward(self, x, y=None):\n        \"\"\" Compute the loss, and optionally the scores.\n        \"\"\"\n        outputs = ()\n        if self.asm is False:\n            scores = self.proj(x)\n            outputs = (scores,) + outputs\n            if y is not None:\n                loss = F.cross_entropy(scores.view(-1, self.n_words), y.view(-1), reduction=\"elementwise_mean\")\n                outputs = (loss,) + outputs\n        else:\n            scores = self.proj.log_prob(x)\n            outputs = (scores,) + outputs\n            if y is not None:\n                _, loss = self.proj(x, y)\n                outputs = (loss,) + outputs\n\n        return outputs\n\n\n@add_start_docstrings(\n    \"\"\"The XLM Model transformer with a language modeling head on top\n    (linear layer with weights tied to the input embeddings). \"\"\",\n    XLM_START_DOCSTRING,\n)\nclass XLMWithLMHeadModel(XLMPreTrainedModel):\n    def __init__(self, config):\n        super().__init__(config)\n        self.transformer = XLMModel(config)\n        self.pred_layer = XLMPredLayer(config)\n\n        self.init_weights()\n\n    def get_output_embeddings(self):\n        return self.pred_layer.proj\n\n    def prepare_inputs_for_generation(self, input_ids, **kwargs):\n        mask_token_id = self.config.mask_token_id\n        lang_id = self.config.lang_id\n\n        effective_batch_size = input_ids.shape[0]\n        mask_token = torch.full((effective_batch_size, 1), mask_token_id, dtype=torch.long, device=input_ids.device)\n        input_ids = torch.cat([input_ids, mask_token], dim=1)\n        if lang_id is not None:\n            langs = torch.full_like(input_ids, lang_id)\n        else:\n            langs = None\n        return {\"input_ids\": input_ids, \"langs\": langs}\n\n    @add_start_docstrings_to_callable(XLM_INPUTS_DOCSTRING)\n    def forward(\n        self,\n        input_ids=None,\n        attention_mask=None,\n        langs=None,\n        token_type_ids=None,\n        position_ids=None,\n        lengths=None,\n        cache=None,\n        head_mask=None,\n        inputs_embeds=None,\n        labels=None,\n    ):\n        r\"\"\"\n        labels (:obj:`torch.LongTensor` of shape :obj:`(batch_size, sequence_length)`, `optional`, defaults to :obj:`None`):\n            Labels for language modeling.\n            Note that the labels **are shifted** inside the model, i.e. you can set ``labels = input_ids``\n            Indices are selected in ``[-100, 0, ..., config.vocab_size]``\n            All labels set to ``-100`` are ignored (masked), the loss is only\n            computed for labels in ``[0, ..., config.vocab_size]``\n\n    Return:\n        :obj:`tuple(torch.FloatTensor)` comprising various elements depending on the configuration (:class:`~transformers1.XLMConfig`) and inputs:\n        loss (:obj:`torch.FloatTensor` of shape `(1,)`, `optional`, returned when ``labels`` is provided)\n            Language modeling loss.\n        prediction_scores (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length, config.vocab_size)`):\n            Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax).\n        hidden_states (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_hidden_states=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer)\n            of shape :obj:`(batch_size, sequence_length, hidden_size)`.\n\n            Hidden-states of the model at the output of each layer plus the initial embedding outputs.\n        attentions (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_attentions=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for each layer) of shape\n            :obj:`(batch_size, num_heads, sequence_length, sequence_length)`.\n\n            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention\n            heads.\n\n    Examples::\n\n        from transformers1 import XLMTokenizer, XLMWithLMHeadModel\n        import torch\n\n        tokenizer = XLMTokenizer.from_pretrained('xlm-mlm-en-2048')\n        model = XLMWithLMHeadModel.from_pretrained('xlm-mlm-en-2048')\n        input_ids = torch.tensor(tokenizer.encode(\"Hello, my dog is cute\", add_special_tokens=True)).unsqueeze(0)  # Batch size 1\n        outputs = model(input_ids)\n        last_hidden_states = outputs[0]  # The last hidden-state is the first element of the output tuple\n\n        \"\"\"\n        transformer_outputs = self.transformer(\n            input_ids,\n            attention_mask=attention_mask,\n            langs=langs,\n            token_type_ids=token_type_ids,\n            position_ids=position_ids,\n            lengths=lengths,\n            cache=cache,\n            head_mask=head_mask,\n            inputs_embeds=inputs_embeds,\n        )\n\n        output = transformer_outputs[0]\n        outputs = self.pred_layer(output, labels)\n        outputs = outputs + transformer_outputs[1:]  # Keep new_mems and attention/hidden states if they are here\n\n        return outputs\n\n\n@add_start_docstrings(\n    \"\"\"XLM Model with a sequence classification/regression head on top (a linear layer on top of\n    the pooled output) e.g. for GLUE tasks. \"\"\",\n    XLM_START_DOCSTRING,\n)\nclass XLMForSequenceClassification(XLMPreTrainedModel):\n    def __init__(self, config):\n        super().__init__(config)\n        self.num_labels = config.num_labels\n\n        self.transformer = XLMModel(config)\n        self.sequence_summary = SequenceSummary(config)\n\n        self.init_weights()\n\n    @add_start_docstrings_to_callable(XLM_INPUTS_DOCSTRING)\n    def forward(\n        self,\n        input_ids=None,\n        attention_mask=None,\n        langs=None,\n        token_type_ids=None,\n        position_ids=None,\n        lengths=None,\n        cache=None,\n        head_mask=None,\n        inputs_embeds=None,\n        labels=None,\n    ):\n        r\"\"\"\n        labels (:obj:`torch.LongTensor` of shape :obj:`(batch_size,)`, `optional`, defaults to :obj:`None`):\n            Labels for computing the sequence classification/regression loss.\n            Indices should be in :obj:`[0, ..., config.num_labels - 1]`.\n            If :obj:`config.num_labels == 1` a regression loss is computed (Mean-Square loss),\n            If :obj:`config.num_labels > 1` a classification loss is computed (Cross-Entropy).\n\n    Returns:\n        :obj:`tuple(torch.FloatTensor)` comprising various elements depending on the configuration (:class:`~transformers1.XLMConfig`) and inputs:\n        loss (:obj:`torch.FloatTensor` of shape :obj:`(1,)`, `optional`, returned when :obj:`label` is provided):\n            Classification (or regression if config.num_labels==1) loss.\n        logits (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, config.num_labels)`):\n            Classification (or regression if config.num_labels==1) scores (before SoftMax).\n        hidden_states (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_hidden_states=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer)\n            of shape :obj:`(batch_size, sequence_length, hidden_size)`.\n\n            Hidden-states of the model at the output of each layer plus the initial embedding outputs.\n        attentions (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_attentions=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for each layer) of shape\n            :obj:`(batch_size, num_heads, sequence_length, sequence_length)`.\n\n            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention\n            heads.\n\n    Examples::\n\n        from transformers1 import XLMTokenizer, XLMForSequenceClassification\n        import torch\n\n        tokenizer = XLMTokenizer.from_pretrained('xlm-mlm-en-2048')\n        model = XLMForSequenceClassification.from_pretrained('xlm-mlm-en-2048')\n        input_ids = torch.tensor(tokenizer.encode(\"Hello, my dog is cute\", add_special_tokens=True)).unsqueeze(0)  # Batch size 1\n        labels = torch.tensor([1]).unsqueeze(0)  # Batch size 1\n        outputs = model(input_ids, labels=labels)\n        loss, logits = outputs[:2]\n\n        \"\"\"\n        transformer_outputs = self.transformer(\n            input_ids,\n            attention_mask=attention_mask,\n            langs=langs,\n            token_type_ids=token_type_ids,\n            position_ids=position_ids,\n            lengths=lengths,\n            cache=cache,\n            head_mask=head_mask,\n            inputs_embeds=inputs_embeds,\n        )\n\n        output = transformer_outputs[0]\n        logits = self.sequence_summary(output)\n\n        outputs = (logits,) + transformer_outputs[1:]  # Keep new_mems and attention/hidden states if they are here\n\n        if labels is not None:\n            if self.num_labels == 1:\n                #  We are doing regression\n                loss_fct = MSELoss()\n                loss = loss_fct(logits.view(-1), labels.view(-1))\n            else:\n                loss_fct = CrossEntropyLoss()\n                loss = loss_fct(logits.view(-1, self.num_labels), labels.view(-1))\n            outputs = (loss,) + outputs\n\n        return outputs\n\n\n@add_start_docstrings(\n    \"\"\"XLM Model with a span classification head on top for extractive question-answering tasks like SQuAD (a linear layers on top of\n    the hidden-states output to compute `span start logits` and `span end logits`). \"\"\",\n    XLM_START_DOCSTRING,\n)\nclass XLMForQuestionAnsweringSimple(XLMPreTrainedModel):\n    def __init__(self, config):\n        super().__init__(config)\n\n        self.transformer = XLMModel(config)\n        self.qa_outputs = nn.Linear(config.hidden_size, config.num_labels)\n\n        self.init_weights()\n\n    @add_start_docstrings_to_callable(XLM_INPUTS_DOCSTRING)\n    def forward(\n        self,\n        input_ids=None,\n        attention_mask=None,\n        langs=None,\n        token_type_ids=None,\n        position_ids=None,\n        lengths=None,\n        cache=None,\n        head_mask=None,\n        inputs_embeds=None,\n        start_positions=None,\n        end_positions=None,\n    ):\n        r\"\"\"\n        start_positions (:obj:`torch.LongTensor` of shape :obj:`(batch_size,)`, `optional`, defaults to :obj:`None`):\n            Labels for position (index) of the start of the labelled span for computing the token classification loss.\n            Positions are clamped to the length of the sequence (`sequence_length`).\n            Position outside of the sequence are not taken into account for computing the loss.\n        end_positions (:obj:`torch.LongTensor` of shape :obj:`(batch_size,)`, `optional`, defaults to :obj:`None`):\n            Labels for position (index) of the end of the labelled span for computing the token classification loss.\n            Positions are clamped to the length of the sequence (`sequence_length`).\n            Position outside of the sequence are not taken into account for computing the loss.\n\n    Returns:\n        :obj:`tuple(torch.FloatTensor)` comprising various elements depending on the configuration (:class:`~transformers1.XLMConfig`) and inputs:\n        loss (:obj:`torch.FloatTensor` of shape :obj:`(1,)`, `optional`, returned when :obj:`labels` is provided):\n            Total span extraction loss is the sum of a Cross-Entropy for the start and end positions.\n        start_scores (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length,)`):\n            Span-start scores (before SoftMax).\n        end_scores (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length,)`):\n            Span-end scores (before SoftMax).\n        hidden_states (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_hidden_states=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer)\n            of shape :obj:`(batch_size, sequence_length, hidden_size)`.\n\n            Hidden-states of the model at the output of each layer plus the initial embedding outputs.\n        attentions (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_attentions=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for each layer) of shape\n            :obj:`(batch_size, num_heads, sequence_length, sequence_length)`.\n\n            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention\n            heads.\n\n    Examples::\n\n        from transformers1 import XLMTokenizer, XLMForQuestionAnsweringSimple\n        import torch\n\n        tokenizer = XLMTokenizer.from_pretrained('xlm-mlm-en-2048')\n        model = XLMForQuestionAnsweringSimple.from_pretrained('xlm-mlm-en-2048')\n        input_ids = torch.tensor(tokenizer.encode(\"Hello, my dog is cute\", add_special_tokens=True)).unsqueeze(0)  # Batch size 1\n        start_positions = torch.tensor([1])\n        end_positions = torch.tensor([3])\n        outputs = model(input_ids, start_positions=start_positions, end_positions=end_positions)\n        loss = outputs[0]\n\n        \"\"\"\n        transformer_outputs = self.transformer(\n            input_ids,\n            attention_mask=attention_mask,\n            langs=langs,\n            token_type_ids=token_type_ids,\n            position_ids=position_ids,\n            lengths=lengths,\n            cache=cache,\n            head_mask=head_mask,\n            inputs_embeds=inputs_embeds,\n        )\n\n        sequence_output = transformer_outputs[0]\n\n        logits = self.qa_outputs(sequence_output)\n        start_logits, end_logits = logits.split(1, dim=-1)\n        start_logits = start_logits.squeeze(-1)\n        end_logits = end_logits.squeeze(-1)\n\n        outputs = (\n            start_logits,\n            end_logits,\n        )\n        if start_positions is not None and end_positions is not None:\n            # If we are on multi-GPU, split add a dimension\n            if len(start_positions.size()) > 1:\n                start_positions = start_positions.squeeze(-1)\n            if len(end_positions.size()) > 1:\n                end_positions = end_positions.squeeze(-1)\n            # sometimes the start/end positions are outside our model inputs, we ignore these terms\n            ignored_index = start_logits.size(1)\n            start_positions.clamp_(0, ignored_index)\n            end_positions.clamp_(0, ignored_index)\n\n            loss_fct = CrossEntropyLoss(ignore_index=ignored_index)\n            start_loss = loss_fct(start_logits, start_positions)\n            end_loss = loss_fct(end_logits, end_positions)\n            total_loss = (start_loss + end_loss) / 2\n            outputs = (total_loss,) + outputs\n\n        outputs = outputs + transformer_outputs[1:]  # Keep new_mems and attention/hidden states if they are here\n\n        return outputs\n\n\n@add_start_docstrings(\n    \"\"\"XLM Model with a beam-search span classification head on top for extractive question-answering tasks like SQuAD (a linear layers on top of\n    the hidden-states output to compute `span start logits` and `span end logits`). \"\"\",\n    XLM_START_DOCSTRING,\n)\nclass XLMForQuestionAnswering(XLMPreTrainedModel):\n    def __init__(self, config):\n        super().__init__(config)\n\n        self.transformer = XLMModel(config)\n        self.qa_outputs = SQuADHead(config)\n\n        self.init_weights()\n\n    @add_start_docstrings_to_callable(XLM_INPUTS_DOCSTRING)\n    def forward(\n        self,\n        input_ids=None,\n        attention_mask=None,\n        langs=None,\n        token_type_ids=None,\n        position_ids=None,\n        lengths=None,\n        cache=None,\n        head_mask=None,\n        inputs_embeds=None,\n        start_positions=None,\n        end_positions=None,\n        is_impossible=None,\n        cls_index=None,\n        p_mask=None,\n    ):\n        r\"\"\"\n        start_positions (:obj:`torch.LongTensor` of shape :obj:`(batch_size,)`, `optional`, defaults to :obj:`None`):\n            Labels for position (index) of the start of the labelled span for computing the token classification loss.\n            Positions are clamped to the length of the sequence (`sequence_length`).\n            Position outside of the sequence are not taken into account for computing the loss.\n        end_positions (:obj:`torch.LongTensor` of shape :obj:`(batch_size,)`, `optional`, defaults to :obj:`None`):\n            Labels for position (index) of the end of the labelled span for computing the token classification loss.\n            Positions are clamped to the length of the sequence (`sequence_length`).\n            Position outside of the sequence are not taken into account for computing the loss.\n        is_impossible (``torch.LongTensor`` of shape ``(batch_size,)``, `optional`, defaults to :obj:`None`):\n            Labels whether a question has an answer or no answer (SQuAD 2.0)\n        cls_index (``torch.LongTensor`` of shape ``(batch_size,)``, `optional`, defaults to :obj:`None`):\n            Labels for position (index) of the classification token to use as input for computing plausibility of the answer.\n        p_mask (``torch.FloatTensor`` of shape ``(batch_size, sequence_length)``, `optional`, defaults to :obj:`None`):\n            Optional mask of tokens which can't be in answers (e.g. [CLS], [PAD], ...).\n            1.0 means token should be masked. 0.0 mean token is not masked.\n\n    Returns:\n        :obj:`tuple(torch.FloatTensor)` comprising various elements depending on the configuration (:class:`~transformers1.XLMConfig`) and inputs:\n        loss (:obj:`torch.FloatTensor` of shape :obj:`(1,)`, `optional`, returned if both :obj:`start_positions` and :obj:`end_positions` are provided):\n            Classification loss as the sum of start token, end token (and is_impossible if provided) classification losses.\n        start_top_log_probs (``torch.FloatTensor`` of shape ``(batch_size, config.start_n_top)``, `optional`, returned if ``start_positions`` or ``end_positions`` is not provided):\n            Log probabilities for the top config.start_n_top start token possibilities (beam-search).\n        start_top_index (``torch.LongTensor`` of shape ``(batch_size, config.start_n_top)``, `optional`, returned if ``start_positions`` or ``end_positions`` is not provided):\n            Indices for the top config.start_n_top start token possibilities (beam-search).\n        end_top_log_probs (``torch.FloatTensor`` of shape ``(batch_size, config.start_n_top * config.end_n_top)``, `optional`, returned if ``start_positions`` or ``end_positions`` is not provided):\n            Log probabilities for the top ``config.start_n_top * config.end_n_top`` end token possibilities (beam-search).\n        end_top_index (``torch.LongTensor`` of shape ``(batch_size, config.start_n_top * config.end_n_top)``, `optional`, returned if ``start_positions`` or ``end_positions`` is not provided):\n            Indices for the top ``config.start_n_top * config.end_n_top`` end token possibilities (beam-search).\n        cls_logits (``torch.FloatTensor`` of shape ``(batch_size,)``, `optional`, returned if ``start_positions`` or ``end_positions`` is not provided):\n            Log probabilities for the ``is_impossible`` label of the answers.\n        hidden_states (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_hidden_states=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer)\n            of shape :obj:`(batch_size, sequence_length, hidden_size)`.\n\n            Hidden-states of the model at the output of each layer plus the initial embedding outputs.\n        attentions (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_attentions=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for each layer) of shape\n            :obj:`(batch_size, num_heads, sequence_length, sequence_length)`.\n\n            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention\n            heads.\n\n    Examples::\n\n        from transformers1 import XLMTokenizer, XLMForQuestionAnswering\n        import torch\n\n        tokenizer = XLMTokenizer.from_pretrained('xlm-mlm-en-2048')\n        model = XLMForQuestionAnswering.from_pretrained('xlm-mlm-en-2048')\n        input_ids = torch.tensor(tokenizer.encode(\"Hello, my dog is cute\", add_special_tokens=True)).unsqueeze(0)  # Batch size 1\n        start_positions = torch.tensor([1])\n        end_positions = torch.tensor([3])\n        outputs = model(input_ids, start_positions=start_positions, end_positions=end_positions)\n        loss = outputs[0]\n\n        \"\"\"\n        transformer_outputs = self.transformer(\n            input_ids,\n            attention_mask=attention_mask,\n            langs=langs,\n            token_type_ids=token_type_ids,\n            position_ids=position_ids,\n            lengths=lengths,\n            cache=cache,\n            head_mask=head_mask,\n            inputs_embeds=inputs_embeds,\n        )\n\n        output = transformer_outputs[0]\n\n        outputs = self.qa_outputs(\n            output,\n            start_positions=start_positions,\n            end_positions=end_positions,\n            cls_index=cls_index,\n            is_impossible=is_impossible,\n            p_mask=p_mask,\n        )\n\n        outputs = outputs + transformer_outputs[1:]  # Keep new_mems and attention/hidden states if they are here\n\n        return outputs\n\n\n@add_start_docstrings(\n    \"\"\"XLM Model with a token classification head on top (a linear layer on top of\n    the hidden-states output) e.g. for Named-Entity-Recognition (NER) tasks. \"\"\",\n    XLM_START_DOCSTRING,\n)\nclass XLMForTokenClassification(XLMPreTrainedModel):\n    def __init__(self, config):\n        super().__init__(config)\n        self.num_labels = config.num_labels\n\n        self.transformer = XLMModel(config)\n        self.dropout = nn.Dropout(config.dropout)\n        self.classifier = nn.Linear(config.hidden_size, config.num_labels)\n\n        self.init_weights()\n\n    @add_start_docstrings_to_callable(XLM_INPUTS_DOCSTRING)\n    def forward(\n        self,\n        input_ids=None,\n        attention_mask=None,\n        langs=None,\n        token_type_ids=None,\n        position_ids=None,\n        head_mask=None,\n        labels=None,\n    ):\n        r\"\"\"\n        labels (:obj:`torch.LongTensor` of shape :obj:`(batch_size, sequence_length)`, `optional`, defaults to :obj:`None`):\n            Labels for computing the token classification loss.\n            Indices should be in ``[0, ..., config.num_labels - 1]``.\n\n    Returns:\n        :obj:`tuple(torch.FloatTensor)` comprising various elements depending on the configuration (:class:`~transformers1.XLMConfig`) and inputs:\n        loss (:obj:`torch.FloatTensor` of shape :obj:`(1,)`, `optional`, returned when ``labels`` is provided) :\n            Classification loss.\n        scores (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length, config.num_labels)`)\n            Classification scores (before SoftMax).\n        hidden_states (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_hidden_states=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer)\n            of shape :obj:`(batch_size, sequence_length, hidden_size)`.\n\n            Hidden-states of the model at the output of each layer plus the initial embedding outputs.\n        attentions (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_attentions=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for each layer) of shape\n            :obj:`(batch_size, num_heads, sequence_length, sequence_length)`.\n\n            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention\n            heads.\n\n    Examples::\n\n        from transformers1 import XLMTokenizer, XLMForTokenClassification\n        import torch\n\n        tokenizer = XLMTokenizer.from_pretrained('xlm-mlm-100-1280')\n        model = XLMForTokenClassification.from_pretrained('xlm-mlm-100-1280')\n        input_ids = torch.tensor(tokenizer.encode(\"Hello, my dog is cute\")).unsqueeze(0)  # Batch size 1\n        labels = torch.tensor([1] * input_ids.size(1)).unsqueeze(0)  # Batch size 1\n        outputs = model(input_ids, labels=labels)\n        loss, scores = outputs[:2]\n\n        \"\"\"\n        outputs = self.transformer(\n            input_ids,\n            attention_mask=attention_mask,\n            langs=langs,\n            token_type_ids=token_type_ids,\n            position_ids=position_ids,\n            head_mask=head_mask,\n        )\n\n        sequence_output = outputs[0]\n\n        sequence_output = self.dropout(sequence_output)\n        logits = self.classifier(sequence_output)\n\n        outputs = (logits,) + outputs[2:]  # add hidden states and attention if they are here\n        if labels is not None:\n            loss_fct = CrossEntropyLoss()\n            # Only keep active parts of the loss\n            if attention_mask is not None:\n                active_loss = attention_mask.view(-1) == 1\n                active_logits = logits.view(-1, self.num_labels)\n                active_labels = torch.where(\n                    active_loss, labels.view(-1), torch.tensor(loss_fct.ignore_index).type_as(labels)\n                )\n                loss = loss_fct(active_logits, active_labels)\n            else:\n                loss = loss_fct(logits.view(-1, self.num_labels), labels.view(-1))\n            outputs = (loss,) + outputs\n\n        return outputs  # (loss), scores, (hidden_states), (attentions)\n"
  },
  {
    "path": "code/bert-base-count5/pretrain/transformers1/modeling_xlm_roberta.py",
    "content": "# coding=utf-8\n# Copyright 2019 Facebook AI Research and the HuggingFace Inc. team.\n# Copyright (c) 2018, NVIDIA CORPORATION.  All rights reserved.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n\"\"\"PyTorch XLM-RoBERTa model. \"\"\"\n\n\nimport logging\n\nfrom .configuration_xlm_roberta import XLMRobertaConfig\nfrom .file_utils import add_start_docstrings\nfrom .modeling_roberta import (\n    RobertaForMaskedLM,\n    RobertaForMultipleChoice,\n    RobertaForSequenceClassification,\n    RobertaForTokenClassification,\n    RobertaModel,\n)\n\n\nlogger = logging.getLogger(__name__)\n\nXLM_ROBERTA_PRETRAINED_MODEL_ARCHIVE_LIST = [\n    \"xlm-roberta-base\",\n    \"xlm-roberta-large\",\n    \"xlm-roberta-large-finetuned-conll02-dutch\",\n    \"xlm-roberta-large-finetuned-conll02-spanish\",\n    \"xlm-roberta-large-finetuned-conll03-english\",\n    \"xlm-roberta-large-finetuned-conll03-german\",\n    # See all XLM-RoBERTa models at https://huggingface.co/models?filter=xlm-roberta\n]\n\n\nXLM_ROBERTA_START_DOCSTRING = r\"\"\"\n\n    This model is a PyTorch `torch.nn.Module <https://pytorch.org/docs/stable/nn.html#torch.nn.Module>`_ sub-class.\n    Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general\n    usage and behavior.\n\n    Parameters:\n        config (:class:`~transformers1.XLMRobertaConfig`): Model configuration class with all the parameters of the\n            model. Initializing with a config file does not load the weights associated with the model, only the configuration.\n            Check out the :meth:`~transformers1.PreTrainedModel.from_pretrained` method to load the model weights.\n\"\"\"\n\n\n@add_start_docstrings(\n    \"The bare XLM-RoBERTa Model transformer outputting raw hidden-states without any specific head on top.\",\n    XLM_ROBERTA_START_DOCSTRING,\n)\nclass XLMRobertaModel(RobertaModel):\n    \"\"\"\n    This class overrides :class:`~transformers1.RobertaModel`. Please check the\n    superclass for the appropriate documentation alongside usage examples.\n    \"\"\"\n\n    config_class = XLMRobertaConfig\n\n\n@add_start_docstrings(\n    \"\"\"XLM-RoBERTa Model with a `language modeling` head on top. \"\"\", XLM_ROBERTA_START_DOCSTRING,\n)\nclass XLMRobertaForMaskedLM(RobertaForMaskedLM):\n    \"\"\"\n    This class overrides :class:`~transformers1.RobertaForMaskedLM`. Please check the\n    superclass for the appropriate documentation alongside usage examples.\n    \"\"\"\n\n    config_class = XLMRobertaConfig\n\n\n@add_start_docstrings(\n    \"\"\"XLM-RoBERTa Model transformer with a sequence classification/regression head on top (a linear layer\n    on top of the pooled output) e.g. for GLUE tasks. \"\"\",\n    XLM_ROBERTA_START_DOCSTRING,\n)\nclass XLMRobertaForSequenceClassification(RobertaForSequenceClassification):\n    \"\"\"\n    This class overrides :class:`~transformers1.RobertaForSequenceClassification`. Please check the\n    superclass for the appropriate documentation alongside usage examples.\n    \"\"\"\n\n    config_class = XLMRobertaConfig\n\n\n@add_start_docstrings(\n    \"\"\"XLM-RoBERTa Model with a multiple choice classification head on top (a linear layer on top of\n    the pooled output and a softmax) e.g. for RocStories/SWAG tasks. \"\"\",\n    XLM_ROBERTA_START_DOCSTRING,\n)\nclass XLMRobertaForMultipleChoice(RobertaForMultipleChoice):\n    \"\"\"\n    This class overrides :class:`~transformers1.RobertaForMultipleChoice`. Please check the\n    superclass for the appropriate documentation alongside usage examples.\n    \"\"\"\n\n    config_class = XLMRobertaConfig\n\n\n@add_start_docstrings(\n    \"\"\"XLM-RoBERTa Model with a token classification head on top (a linear layer on top of\n    the hidden-states output) e.g. for Named-Entity-Recognition (NER) tasks. \"\"\",\n    XLM_ROBERTA_START_DOCSTRING,\n)\nclass XLMRobertaForTokenClassification(RobertaForTokenClassification):\n    \"\"\"\n    This class overrides :class:`~transformers1.RobertaForTokenClassification`. Please check the\n    superclass for the appropriate documentation alongside usage examples.\n    \"\"\"\n\n    config_class = XLMRobertaConfig\n"
  },
  {
    "path": "code/bert-base-count5/pretrain/transformers1/modeling_xlnet.py",
    "content": "# coding=utf-8\n# Copyright 2018 Google AI, Google Brain and Carnegie Mellon University Authors and the HuggingFace Inc. team.\n# Copyright (c) 2018, NVIDIA CORPORATION.  All rights reserved.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n\"\"\" PyTorch XLNet model.\n\"\"\"\n\n\nimport logging\n\nimport torch\nfrom torch import nn\nfrom torch.nn import CrossEntropyLoss, MSELoss\nfrom torch.nn import functional as F\n\nfrom .activations import gelu_new, swish\nfrom .configuration_xlnet import XLNetConfig\nfrom .file_utils import add_start_docstrings, add_start_docstrings_to_callable\nfrom .modeling_utils import PoolerAnswerClass, PoolerEndLogits, PoolerStartLogits, PreTrainedModel, SequenceSummary\n\n\nlogger = logging.getLogger(__name__)\n\nXLNET_PRETRAINED_MODEL_ARCHIVE_LIST = [\n    \"xlnet-base-cased\",\n    \"xlnet-large-cased\",\n    # See all XLNet models at https://huggingface.co/models?filter=xlnet\n]\n\n\ndef build_tf_xlnet_to_pytorch_map(model, config, tf_weights=None):\n    \"\"\" A map of modules from TF to PyTorch.\n        I use a map to keep the PyTorch model as\n        identical to the original PyTorch model as possible.\n    \"\"\"\n\n    tf_to_pt_map = {}\n\n    if hasattr(model, \"transformer\"):\n        if hasattr(model, \"lm_loss\"):\n            # We will load also the output bias\n            tf_to_pt_map[\"model/lm_loss/bias\"] = model.lm_loss.bias\n        if hasattr(model, \"sequence_summary\") and \"model/sequnece_summary/summary/kernel\" in tf_weights:\n            # We will load also the sequence summary\n            tf_to_pt_map[\"model/sequnece_summary/summary/kernel\"] = model.sequence_summary.summary.weight\n            tf_to_pt_map[\"model/sequnece_summary/summary/bias\"] = model.sequence_summary.summary.bias\n        if (\n            hasattr(model, \"logits_proj\")\n            and config.finetuning_task is not None\n            and \"model/regression_{}/logit/kernel\".format(config.finetuning_task) in tf_weights\n        ):\n            tf_to_pt_map[\"model/regression_{}/logit/kernel\".format(config.finetuning_task)] = model.logits_proj.weight\n            tf_to_pt_map[\"model/regression_{}/logit/bias\".format(config.finetuning_task)] = model.logits_proj.bias\n\n        # Now load the rest of the transformer\n        model = model.transformer\n\n    # Embeddings and output\n    tf_to_pt_map.update(\n        {\n            \"model/transformer/word_embedding/lookup_table\": model.word_embedding.weight,\n            \"model/transformer/mask_emb/mask_emb\": model.mask_emb,\n        }\n    )\n\n    # Transformer blocks\n    for i, b in enumerate(model.layer):\n        layer_str = \"model/transformer/layer_%d/\" % i\n        tf_to_pt_map.update(\n            {\n                layer_str + \"rel_attn/LayerNorm/gamma\": b.rel_attn.layer_norm.weight,\n                layer_str + \"rel_attn/LayerNorm/beta\": b.rel_attn.layer_norm.bias,\n                layer_str + \"rel_attn/o/kernel\": b.rel_attn.o,\n                layer_str + \"rel_attn/q/kernel\": b.rel_attn.q,\n                layer_str + \"rel_attn/k/kernel\": b.rel_attn.k,\n                layer_str + \"rel_attn/r/kernel\": b.rel_attn.r,\n                layer_str + \"rel_attn/v/kernel\": b.rel_attn.v,\n                layer_str + \"ff/LayerNorm/gamma\": b.ff.layer_norm.weight,\n                layer_str + \"ff/LayerNorm/beta\": b.ff.layer_norm.bias,\n                layer_str + \"ff/layer_1/kernel\": b.ff.layer_1.weight,\n                layer_str + \"ff/layer_1/bias\": b.ff.layer_1.bias,\n                layer_str + \"ff/layer_2/kernel\": b.ff.layer_2.weight,\n                layer_str + \"ff/layer_2/bias\": b.ff.layer_2.bias,\n            }\n        )\n\n    # Relative positioning biases\n    if config.untie_r:\n        r_r_list = []\n        r_w_list = []\n        r_s_list = []\n        seg_embed_list = []\n        for b in model.layer:\n            r_r_list.append(b.rel_attn.r_r_bias)\n            r_w_list.append(b.rel_attn.r_w_bias)\n            r_s_list.append(b.rel_attn.r_s_bias)\n            seg_embed_list.append(b.rel_attn.seg_embed)\n    else:\n        r_r_list = [model.r_r_bias]\n        r_w_list = [model.r_w_bias]\n        r_s_list = [model.r_s_bias]\n        seg_embed_list = [model.seg_embed]\n    tf_to_pt_map.update(\n        {\n            \"model/transformer/r_r_bias\": r_r_list,\n            \"model/transformer/r_w_bias\": r_w_list,\n            \"model/transformer/r_s_bias\": r_s_list,\n            \"model/transformer/seg_embed\": seg_embed_list,\n        }\n    )\n    return tf_to_pt_map\n\n\ndef load_tf_weights_in_xlnet(model, config, tf_path):\n    \"\"\" Load tf checkpoints in a pytorch model\n    \"\"\"\n    try:\n        import numpy as np\n        import tensorflow as tf\n    except ImportError:\n        logger.error(\n            \"Loading a TensorFlow models in PyTorch, requires TensorFlow to be installed. Please see \"\n            \"https://www.tensorflow.org/install/ for installation instructions.\"\n        )\n        raise\n    # Load weights from TF model\n    init_vars = tf.train.list_variables(tf_path)\n    tf_weights = {}\n    for name, shape in init_vars:\n        logger.info(\"Loading TF weight {} with shape {}\".format(name, shape))\n        array = tf.train.load_variable(tf_path, name)\n        tf_weights[name] = array\n\n    # Build TF to PyTorch weights loading map\n    tf_to_pt_map = build_tf_xlnet_to_pytorch_map(model, config, tf_weights)\n\n    for name, pointer in tf_to_pt_map.items():\n        logger.info(\"Importing {}\".format(name))\n        if name not in tf_weights:\n            logger.info(\"{} not in tf pre-trained weights, skipping\".format(name))\n            continue\n        array = tf_weights[name]\n        # adam_v and adam_m are variables used in AdamWeightDecayOptimizer to calculated m and v\n        # which are not required for using pretrained model\n        if \"kernel\" in name and (\"ff\" in name or \"summary\" in name or \"logit\" in name):\n            logger.info(\"Transposing\")\n            array = np.transpose(array)\n        if isinstance(pointer, list):\n            # Here we will split the TF weights\n            assert len(pointer) == array.shape[0]\n            for i, p_i in enumerate(pointer):\n                arr_i = array[i, ...]\n                try:\n                    assert p_i.shape == arr_i.shape\n                except AssertionError as e:\n                    e.args += (p_i.shape, arr_i.shape)\n                    raise\n                logger.info(\"Initialize PyTorch weight {} for layer {}\".format(name, i))\n                p_i.data = torch.from_numpy(arr_i)\n        else:\n            try:\n                assert pointer.shape == array.shape\n            except AssertionError as e:\n                e.args += (pointer.shape, array.shape)\n                raise\n            logger.info(\"Initialize PyTorch weight {}\".format(name))\n            pointer.data = torch.from_numpy(array)\n        tf_weights.pop(name, None)\n        tf_weights.pop(name + \"/Adam\", None)\n        tf_weights.pop(name + \"/Adam_1\", None)\n\n    logger.info(\"Weights not copied to PyTorch model: {}\".format(\", \".join(tf_weights.keys())))\n    return model\n\n\nACT2FN = {\"gelu\": gelu_new, \"relu\": torch.nn.functional.relu, \"swish\": swish}\n\n\nXLNetLayerNorm = nn.LayerNorm\n\n\nclass XLNetRelativeAttention(nn.Module):\n    def __init__(self, config):\n        super().__init__()\n        self.output_attentions = config.output_attentions\n\n        if config.d_model % config.n_head != 0:\n            raise ValueError(\n                \"The hidden size (%d) is not a multiple of the number of attention \"\n                \"heads (%d)\" % (config.d_model, config.n_head)\n            )\n\n        self.n_head = config.n_head\n        self.d_head = config.d_head\n        self.d_model = config.d_model\n        self.scale = 1 / (config.d_head ** 0.5)\n\n        self.q = nn.Parameter(torch.FloatTensor(config.d_model, self.n_head, self.d_head))\n        self.k = nn.Parameter(torch.FloatTensor(config.d_model, self.n_head, self.d_head))\n        self.v = nn.Parameter(torch.FloatTensor(config.d_model, self.n_head, self.d_head))\n        self.o = nn.Parameter(torch.FloatTensor(config.d_model, self.n_head, self.d_head))\n        self.r = nn.Parameter(torch.FloatTensor(config.d_model, self.n_head, self.d_head))\n\n        self.r_r_bias = nn.Parameter(torch.FloatTensor(self.n_head, self.d_head))\n        self.r_s_bias = nn.Parameter(torch.FloatTensor(self.n_head, self.d_head))\n        self.r_w_bias = nn.Parameter(torch.FloatTensor(self.n_head, self.d_head))\n        self.seg_embed = nn.Parameter(torch.FloatTensor(2, self.n_head, self.d_head))\n\n        self.layer_norm = XLNetLayerNorm(config.d_model, eps=config.layer_norm_eps)\n        self.dropout = nn.Dropout(config.dropout)\n\n    def prune_heads(self, heads):\n        raise NotImplementedError\n\n    @staticmethod\n    def rel_shift(x, klen=-1):\n        \"\"\"perform relative shift to form the relative attention score.\"\"\"\n        x_size = x.shape\n\n        x = x.reshape(x_size[1], x_size[0], x_size[2], x_size[3])\n        x = x[1:, ...]\n        x = x.reshape(x_size[0], x_size[1] - 1, x_size[2], x_size[3])\n        # x = x[:, 0:klen, :, :]\n        x = torch.index_select(x, 1, torch.arange(klen, device=x.device, dtype=torch.long))\n\n        return x\n\n    @staticmethod\n    def rel_shift_bnij(x, klen=-1):\n        x_size = x.shape\n\n        x = x.reshape(x_size[0], x_size[1], x_size[3], x_size[2])\n        x = x[:, :, 1:, :]\n        x = x.reshape(x_size[0], x_size[1], x_size[2], x_size[3] - 1)\n        # Note: the tensor-slice form was faster in my testing than torch.index_select\n        #       However, tracing doesn't like the nature of the slice, and if klen changes\n        #       during the run then it'll fail, whereas index_select will be fine.\n        x = torch.index_select(x, 3, torch.arange(klen, device=x.device, dtype=torch.long))\n        # x = x[:, :, :, :klen]\n\n        return x\n\n    def rel_attn_core(self, q_head, k_head_h, v_head_h, k_head_r, seg_mat=None, attn_mask=None, head_mask=None):\n        \"\"\"Core relative positional attention operations.\"\"\"\n\n        # content based attention score\n        ac = torch.einsum(\"ibnd,jbnd->bnij\", q_head + self.r_w_bias, k_head_h)\n\n        # position based attention score\n        bd = torch.einsum(\"ibnd,jbnd->bnij\", q_head + self.r_r_bias, k_head_r)\n        bd = self.rel_shift_bnij(bd, klen=ac.shape[3])\n\n        # segment based attention score\n        if seg_mat is None:\n            ef = 0\n        else:\n            ef = torch.einsum(\"ibnd,snd->ibns\", q_head + self.r_s_bias, self.seg_embed)\n            ef = torch.einsum(\"ijbs,ibns->bnij\", seg_mat, ef)\n\n        # merge attention scores and perform masking\n        attn_score = (ac + bd + ef) * self.scale\n        if attn_mask is not None:\n            # attn_score = attn_score * (1 - attn_mask) - 1e30 * attn_mask\n            if attn_mask.dtype == torch.float16:\n                attn_score = attn_score - 65500 * torch.einsum(\"ijbn->bnij\", attn_mask)\n            else:\n                attn_score = attn_score - 1e30 * torch.einsum(\"ijbn->bnij\", attn_mask)\n\n        # attention probability\n        attn_prob = F.softmax(attn_score, dim=3)\n        attn_prob = self.dropout(attn_prob)\n\n        # Mask heads if we want to\n        if head_mask is not None:\n            attn_prob = attn_prob * torch.einsum(\"ijbn->bnij\", head_mask)\n\n        # attention output\n        attn_vec = torch.einsum(\"bnij,jbnd->ibnd\", attn_prob, v_head_h)\n\n        if self.output_attentions:\n            return attn_vec, torch.einsum(\"bnij->ijbn\", attn_prob)\n\n        return attn_vec\n\n    def post_attention(self, h, attn_vec, residual=True):\n        \"\"\"Post-attention processing.\"\"\"\n        # post-attention projection (back to `d_model`)\n        attn_out = torch.einsum(\"ibnd,hnd->ibh\", attn_vec, self.o)\n\n        attn_out = self.dropout(attn_out)\n        if residual:\n            attn_out = attn_out + h\n        output = self.layer_norm(attn_out)\n\n        return output\n\n    def forward(self, h, g, attn_mask_h, attn_mask_g, r, seg_mat, mems=None, target_mapping=None, head_mask=None):\n        if g is not None:\n            # Two-stream attention with relative positional encoding.\n            # content based attention score\n            if mems is not None and mems.dim() > 1:\n                cat = torch.cat([mems, h], dim=0)\n            else:\n                cat = h\n\n            # content-based key head\n            k_head_h = torch.einsum(\"ibh,hnd->ibnd\", cat, self.k)\n\n            # content-based value head\n            v_head_h = torch.einsum(\"ibh,hnd->ibnd\", cat, self.v)\n\n            # position-based key head\n            k_head_r = torch.einsum(\"ibh,hnd->ibnd\", r, self.r)\n\n            # h-stream\n            # content-stream query head\n            q_head_h = torch.einsum(\"ibh,hnd->ibnd\", h, self.q)\n\n            # core attention ops\n            attn_vec_h = self.rel_attn_core(\n                q_head_h, k_head_h, v_head_h, k_head_r, seg_mat=seg_mat, attn_mask=attn_mask_h, head_mask=head_mask\n            )\n\n            if self.output_attentions:\n                attn_vec_h, attn_prob_h = attn_vec_h\n\n            # post processing\n            output_h = self.post_attention(h, attn_vec_h)\n\n            # g-stream\n            # query-stream query head\n            q_head_g = torch.einsum(\"ibh,hnd->ibnd\", g, self.q)\n\n            # core attention ops\n            if target_mapping is not None:\n                q_head_g = torch.einsum(\"mbnd,mlb->lbnd\", q_head_g, target_mapping)\n                attn_vec_g = self.rel_attn_core(\n                    q_head_g, k_head_h, v_head_h, k_head_r, seg_mat=seg_mat, attn_mask=attn_mask_g, head_mask=head_mask\n                )\n\n                if self.output_attentions:\n                    attn_vec_g, attn_prob_g = attn_vec_g\n\n                attn_vec_g = torch.einsum(\"lbnd,mlb->mbnd\", attn_vec_g, target_mapping)\n            else:\n                attn_vec_g = self.rel_attn_core(\n                    q_head_g, k_head_h, v_head_h, k_head_r, seg_mat=seg_mat, attn_mask=attn_mask_g, head_mask=head_mask\n                )\n\n                if self.output_attentions:\n                    attn_vec_g, attn_prob_g = attn_vec_g\n\n            # post processing\n            output_g = self.post_attention(g, attn_vec_g)\n\n            if self.output_attentions:\n                attn_prob = attn_prob_h, attn_prob_g\n\n        else:\n            # Multi-head attention with relative positional encoding\n            if mems is not None and mems.dim() > 1:\n                cat = torch.cat([mems, h], dim=0)\n            else:\n                cat = h\n\n            # content heads\n            q_head_h = torch.einsum(\"ibh,hnd->ibnd\", h, self.q)\n            k_head_h = torch.einsum(\"ibh,hnd->ibnd\", cat, self.k)\n            v_head_h = torch.einsum(\"ibh,hnd->ibnd\", cat, self.v)\n\n            # positional heads\n            k_head_r = torch.einsum(\"ibh,hnd->ibnd\", r, self.r)\n\n            # core attention ops\n            attn_vec = self.rel_attn_core(\n                q_head_h, k_head_h, v_head_h, k_head_r, seg_mat=seg_mat, attn_mask=attn_mask_h, head_mask=head_mask\n            )\n\n            if self.output_attentions:\n                attn_vec, attn_prob = attn_vec\n\n            # post processing\n            output_h = self.post_attention(h, attn_vec)\n            output_g = None\n\n        outputs = (output_h, output_g)\n        if self.output_attentions:\n            outputs = outputs + (attn_prob,)\n        return outputs\n\n\nclass XLNetFeedForward(nn.Module):\n    def __init__(self, config):\n        super().__init__()\n        self.layer_norm = XLNetLayerNorm(config.d_model, eps=config.layer_norm_eps)\n        self.layer_1 = nn.Linear(config.d_model, config.d_inner)\n        self.layer_2 = nn.Linear(config.d_inner, config.d_model)\n        self.dropout = nn.Dropout(config.dropout)\n        if isinstance(config.ff_activation, str):\n            self.activation_function = ACT2FN[config.ff_activation]\n        else:\n            self.activation_function = config.ff_activation\n\n    def forward(self, inp):\n        output = inp\n        output = self.layer_1(output)\n        output = self.activation_function(output)\n        output = self.dropout(output)\n        output = self.layer_2(output)\n        output = self.dropout(output)\n        output = self.layer_norm(output + inp)\n        return output\n\n\nclass XLNetLayer(nn.Module):\n    def __init__(self, config):\n        super().__init__()\n        self.rel_attn = XLNetRelativeAttention(config)\n        self.ff = XLNetFeedForward(config)\n        self.dropout = nn.Dropout(config.dropout)\n\n    def forward(\n        self, output_h, output_g, attn_mask_h, attn_mask_g, r, seg_mat, mems=None, target_mapping=None, head_mask=None\n    ):\n        outputs = self.rel_attn(\n            output_h,\n            output_g,\n            attn_mask_h,\n            attn_mask_g,\n            r,\n            seg_mat,\n            mems=mems,\n            target_mapping=target_mapping,\n            head_mask=head_mask,\n        )\n        output_h, output_g = outputs[:2]\n\n        if output_g is not None:\n            output_g = self.ff(output_g)\n        output_h = self.ff(output_h)\n\n        outputs = (output_h, output_g) + outputs[2:]  # Add again attentions if there are there\n        return outputs\n\n\nclass XLNetPreTrainedModel(PreTrainedModel):\n    \"\"\" An abstract class to handle weights initialization and\n        a simple interface for downloading and loading pretrained models.\n    \"\"\"\n\n    config_class = XLNetConfig\n    load_tf_weights = load_tf_weights_in_xlnet\n    base_model_prefix = \"transformer\"\n\n    def _init_weights(self, module):\n        \"\"\" Initialize the weights.\n        \"\"\"\n        if isinstance(module, (nn.Linear, nn.Embedding)):\n            # Slightly different from the TF version which uses truncated_normal for initialization\n            # cf https://github.com/pytorch/pytorch/pull/5617\n            module.weight.data.normal_(mean=0.0, std=self.config.initializer_range)\n            if isinstance(module, nn.Linear) and module.bias is not None:\n                module.bias.data.zero_()\n        elif isinstance(module, XLNetLayerNorm):\n            module.bias.data.zero_()\n            module.weight.data.fill_(1.0)\n        elif isinstance(module, XLNetRelativeAttention):\n            for param in [\n                module.q,\n                module.k,\n                module.v,\n                module.o,\n                module.r,\n                module.r_r_bias,\n                module.r_s_bias,\n                module.r_w_bias,\n                module.seg_embed,\n            ]:\n                param.data.normal_(mean=0.0, std=self.config.initializer_range)\n        elif isinstance(module, XLNetModel):\n            module.mask_emb.data.normal_(mean=0.0, std=self.config.initializer_range)\n\n\nXLNET_START_DOCSTRING = r\"\"\"\n\n    This model is a PyTorch `torch.nn.Module <https://pytorch.org/docs/stable/nn.html#torch.nn.Module>`_ sub-class.\n    Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general\n    usage and behavior.\n\n    Parameters:\n        config (:class:`~transformers1.XLNetConfig`): Model configuration class with all the parameters of the model.\n            Initializing with a config file does not load the weights associated with the model, only the configuration.\n            Check out the :meth:`~transformers1.PreTrainedModel.from_pretrained` method to load the model weights.\n\"\"\"\n\nXLNET_INPUTS_DOCSTRING = r\"\"\"\n    Args:\n        input_ids (:obj:`torch.LongTensor` of shape :obj:`{0}`):\n            Indices of input sequence tokens in the vocabulary.\n\n            Indices can be obtained using :class:`transformers1.BertTokenizer`.\n            See :func:`transformers1.PreTrainedTokenizer.encode` and\n            :func:`transformers1.PreTrainedTokenizer.encode_plus` for details.\n\n            `What are input IDs? <../glossary.html#input-ids>`__\n        attention_mask (:obj:`torch.FloatTensor` of shape :obj:`{0}`, `optional`, defaults to :obj:`None`):\n            Mask to avoid performing attention on padding token indices.\n            Mask values selected in ``[0, 1]``:\n            ``1`` for tokens that are NOT MASKED, ``0`` for MASKED tokens.\n\n            `What are attention masks? <../glossary.html#attention-mask>`__\n        mems (:obj:`List[torch.FloatTensor]` of length :obj:`config.n_layers`):\n            Contains pre-computed hidden-states (key and values in the attention blocks) as computed by the model\n            (see `mems` output below). Can be used to speed up sequential decoding. The token ids which have their mems\n            given to this model should not be passed as input ids as they have already been computed.\n            `use_cache` has to be set to `True` to make use of `mems`.\n        perm_mask (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length, sequence_length)`, `optional`, defaults to :obj:`None`):\n            Mask to indicate the attention pattern for each input token with values selected in ``[0, 1]``:\n            If ``perm_mask[k, i, j] = 0``, i attend to j in batch k;\n            if ``perm_mask[k, i, j] = 1``, i does not attend to j in batch k.\n            If None, each token attends to all the others (full bidirectional attention).\n            Only used during pretraining (to define factorization order) or for sequential decoding (generation).\n        target_mapping (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, num_predict, sequence_length)`, `optional`, defaults to :obj:`None`):\n            Mask to indicate the output tokens to use.\n            If ``target_mapping[k, i, j] = 1``, the i-th predict in batch k is on the j-th token.\n            Only used during pretraining for partial prediction or for sequential decoding (generation).\n        token_type_ids (:obj:`torch.LongTensor` of shape :obj:`{0}`, `optional`, defaults to :obj:`None`):\n            Segment token indices to indicate first and second portions of the inputs.\n            Indices are selected in ``[0, 1]``: ``0`` corresponds to a `sentence A` token, ``1``\n            corresponds to a `sentence B` token. The classifier token should be represented by a ``2``.\n\n            `What are token type IDs? <../glossary.html#token-type-ids>`_\n        input_mask (:obj:`torch.FloatTensor` of shape :obj:`{0}`, `optional`, defaults to :obj:`None`):\n            Mask to avoid performing attention on padding token indices.\n            Negative of `attention_mask`, i.e. with 0 for real tokens and 1 for padding.\n            Kept for compatibility with the original code base.\n            You can only uses one of `input_mask` and `attention_mask`\n            Mask values selected in ``[0, 1]``:\n            ``1`` for tokens that are MASKED, ``0`` for tokens that are NOT MASKED.\n        head_mask (:obj:`torch.FloatTensor` of shape :obj:`(num_heads,)` or :obj:`(num_layers, num_heads)`, `optional`, defaults to :obj:`None`):\n            Mask to nullify selected heads of the self-attention modules.\n            Mask values selected in ``[0, 1]``:\n            :obj:`1` indicates the head is **not masked**, :obj:`0` indicates the head is **masked**.\n        inputs_embeds (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length, hidden_size)`, `optional`, defaults to :obj:`None`):\n            Optionally, instead of passing :obj:`input_ids` you can choose to directly pass an embedded representation.\n            This is useful if you want more control over how to convert `input_ids` indices into associated vectors\n            than the model's internal embedding lookup matrix.\n        use_cache (:obj:`bool`):\n            If `use_cache` is True, `mems` are returned and can be used to speed up decoding (see `mems`). Defaults to `True`.\n\"\"\"\n\n\n@add_start_docstrings(\n    \"The bare XLNet Model transformer outputting raw hidden-states without any specific head on top.\",\n    XLNET_START_DOCSTRING,\n)\nclass XLNetModel(XLNetPreTrainedModel):\n    def __init__(self, config):\n        super().__init__(config)\n        self.output_attentions = config.output_attentions\n        self.output_hidden_states = config.output_hidden_states\n\n        self.mem_len = config.mem_len\n        self.reuse_len = config.reuse_len\n        self.d_model = config.d_model\n        self.same_length = config.same_length\n        self.attn_type = config.attn_type\n        self.bi_data = config.bi_data\n        self.clamp_len = config.clamp_len\n        self.n_layer = config.n_layer\n\n        self.word_embedding = nn.Embedding(config.vocab_size, config.d_model)\n        self.mask_emb = nn.Parameter(torch.FloatTensor(1, 1, config.d_model))\n        self.layer = nn.ModuleList([XLNetLayer(config) for _ in range(config.n_layer)])\n        self.dropout = nn.Dropout(config.dropout)\n\n        self.init_weights()\n\n    def get_input_embeddings(self):\n        return self.word_embedding\n\n    def set_input_embeddings(self, new_embeddings):\n        self.word_embedding = new_embeddings\n\n    def _prune_heads(self, heads_to_prune):\n        raise NotImplementedError\n\n    def create_mask(self, qlen, mlen):\n        \"\"\"\n        Creates causal attention mask. Float mask where 1.0 indicates masked, 0.0 indicates not-masked.\n\n        Args:\n            qlen: Sequence length\n            mlen: Mask length\n\n        ::\n\n                  same_length=False:      same_length=True:\n                  <mlen > <  qlen >       <mlen > <  qlen >\n               ^ [0 0 0 0 0 1 1 1 1]     [0 0 0 0 0 1 1 1 1]\n                 [0 0 0 0 0 0 1 1 1]     [1 0 0 0 0 0 1 1 1]\n            qlen [0 0 0 0 0 0 0 1 1]     [1 1 0 0 0 0 0 1 1]\n                 [0 0 0 0 0 0 0 0 1]     [1 1 1 0 0 0 0 0 1]\n               v [0 0 0 0 0 0 0 0 0]     [1 1 1 1 0 0 0 0 0]\n\n        \"\"\"\n        attn_mask = torch.ones([qlen, qlen])\n        mask_up = torch.triu(attn_mask, diagonal=1)\n        attn_mask_pad = torch.zeros([qlen, mlen])\n        ret = torch.cat([attn_mask_pad, mask_up], dim=1)\n        if self.same_length:\n            mask_lo = torch.tril(attn_mask, diagonal=-1)\n            ret = torch.cat([ret[:, :qlen] + mask_lo, ret[:, qlen:]], dim=1)\n\n        ret = ret.to(self.device)\n        return ret\n\n    def cache_mem(self, curr_out, prev_mem):\n        # cache hidden states into memory.\n        if self.reuse_len is not None and self.reuse_len > 0:\n            curr_out = curr_out[: self.reuse_len]\n\n        if prev_mem is None:\n            new_mem = curr_out[-self.mem_len :]\n        else:\n            new_mem = torch.cat([prev_mem, curr_out], dim=0)[-self.mem_len :]\n\n        return new_mem.detach()\n\n    @staticmethod\n    def positional_embedding(pos_seq, inv_freq, bsz=None):\n        sinusoid_inp = torch.einsum(\"i,d->id\", pos_seq, inv_freq)\n        pos_emb = torch.cat([torch.sin(sinusoid_inp), torch.cos(sinusoid_inp)], dim=-1)\n        pos_emb = pos_emb[:, None, :]\n\n        if bsz is not None:\n            pos_emb = pos_emb.expand(-1, bsz, -1)\n\n        return pos_emb\n\n    def relative_positional_encoding(self, qlen, klen, bsz=None):\n        # create relative positional encoding.\n        freq_seq = torch.arange(0, self.d_model, 2.0, dtype=torch.float)\n        inv_freq = 1 / torch.pow(10000, (freq_seq / self.d_model))\n\n        if self.attn_type == \"bi\":\n            # beg, end = klen - 1, -qlen\n            beg, end = klen, -qlen\n        elif self.attn_type == \"uni\":\n            # beg, end = klen - 1, -1\n            beg, end = klen, -1\n        else:\n            raise ValueError(\"Unknown `attn_type` {}.\".format(self.attn_type))\n\n        if self.bi_data:\n            fwd_pos_seq = torch.arange(beg, end, -1.0, dtype=torch.float)\n            bwd_pos_seq = torch.arange(-beg, -end, 1.0, dtype=torch.float)\n\n            if self.clamp_len > 0:\n                fwd_pos_seq = fwd_pos_seq.clamp(-self.clamp_len, self.clamp_len)\n                bwd_pos_seq = bwd_pos_seq.clamp(-self.clamp_len, self.clamp_len)\n\n            if bsz is not None:\n                fwd_pos_emb = self.positional_embedding(fwd_pos_seq, inv_freq, bsz // 2)\n                bwd_pos_emb = self.positional_embedding(bwd_pos_seq, inv_freq, bsz // 2)\n            else:\n                fwd_pos_emb = self.positional_embedding(fwd_pos_seq, inv_freq)\n                bwd_pos_emb = self.positional_embedding(bwd_pos_seq, inv_freq)\n\n            pos_emb = torch.cat([fwd_pos_emb, bwd_pos_emb], dim=1)\n        else:\n            fwd_pos_seq = torch.arange(beg, end, -1.0)\n            if self.clamp_len > 0:\n                fwd_pos_seq = fwd_pos_seq.clamp(-self.clamp_len, self.clamp_len)\n            pos_emb = self.positional_embedding(fwd_pos_seq, inv_freq, bsz)\n\n        pos_emb = pos_emb.to(self.device)\n        return pos_emb\n\n    @add_start_docstrings_to_callable(XLNET_INPUTS_DOCSTRING.format(\"(batch_size, sequence_length)\"))\n    def forward(\n        self,\n        input_ids=None,\n        attention_mask=None,\n        mems=None,\n        perm_mask=None,\n        target_mapping=None,\n        token_type_ids=None,\n        input_mask=None,\n        head_mask=None,\n        inputs_embeds=None,\n        use_cache=True,\n    ):\n        r\"\"\"\n    Return:\n        :obj:`tuple(torch.FloatTensor)` comprising various elements depending on the configuration (:class:`~transformers1.XLNetConfig`) and inputs:\n        last_hidden_state (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, num_predict, hidden_size)`):\n            Sequence of hidden-states at the last layer of the model.\n            `num_predict` corresponds to `target_mapping.shape[1]`. If `target_mapping` is `None`, then `num_predict` corresponds to `sequence_length`.\n        mems (:obj:`List[torch.FloatTensor]` of length :obj:`config.n_layers`):\n            Contains pre-computed hidden-states (key and values in the attention blocks).\n            Can be used (see `mems` input) to speed up sequential decoding. The token ids which have their past given to this model\n            should not be passed as input ids as they have already been computed.\n        hidden_states (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_hidden_states=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer)\n            of shape :obj:`(batch_size, sequence_length, hidden_size)`.\n\n            Hidden-states of the model at the output of each layer plus the initial embedding outputs.\n        attentions (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_attentions=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for each layer) of shape\n            :obj:`(batch_size, num_heads, sequence_length, sequence_length)`.\n\n            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention\n            heads.\n\n    Examples::\n\n        from transformers1 import XLNetTokenizer, XLNetModel\n        import torch\n\n        tokenizer = XLNetTokenizer.from_pretrained('xlnet-large-cased')\n        model = XLNetModel.from_pretrained('xlnet-large-cased')\n\n        input_ids = torch.tensor(tokenizer.encode(\"Hello, my dog is cute\", add_special_tokens=False)).unsqueeze(0)  # Batch size 1\n\n        outputs = model(input_ids)\n        last_hidden_states = outputs[0]  # The last hidden-state is the first element of the output tuple\n\n        \"\"\"\n        # the original code for XLNet uses shapes [len, bsz] with the batch dimension at the end\n        # but we want a unified interface in the library with the batch size on the first dimension\n        # so we move here the first dimension (batch) to the end\n        if input_ids is not None and inputs_embeds is not None:\n            raise ValueError(\"You cannot specify both input_ids and inputs_embeds at the same time\")\n        elif input_ids is not None:\n            input_ids = input_ids.transpose(0, 1).contiguous()\n            qlen, bsz = input_ids.shape[0], input_ids.shape[1]\n        elif inputs_embeds is not None:\n            inputs_embeds = inputs_embeds.transpose(0, 1).contiguous()\n            qlen, bsz = inputs_embeds.shape[0], inputs_embeds.shape[1]\n        else:\n            raise ValueError(\"You have to specify either input_ids or inputs_embeds\")\n\n        token_type_ids = token_type_ids.transpose(0, 1).contiguous() if token_type_ids is not None else None\n        input_mask = input_mask.transpose(0, 1).contiguous() if input_mask is not None else None\n        attention_mask = attention_mask.transpose(0, 1).contiguous() if attention_mask is not None else None\n        perm_mask = perm_mask.permute(1, 2, 0).contiguous() if perm_mask is not None else None\n        target_mapping = target_mapping.permute(1, 2, 0).contiguous() if target_mapping is not None else None\n\n        mlen = mems[0].shape[0] if mems is not None and mems[0] is not None else 0\n        klen = mlen + qlen\n\n        dtype_float = self.dtype\n        device = self.device\n\n        # Attention mask\n        # causal attention mask\n        if self.attn_type == \"uni\":\n            attn_mask = self.create_mask(qlen, mlen)\n            attn_mask = attn_mask[:, :, None, None]\n        elif self.attn_type == \"bi\":\n            attn_mask = None\n        else:\n            raise ValueError(\"Unsupported attention type: {}\".format(self.attn_type))\n\n        # data mask: input mask & perm mask\n        assert input_mask is None or attention_mask is None, \"You can only use one of input_mask (uses 1 for padding) \"\n        \"or attention_mask (uses 0 for padding, added for compatbility with BERT). Please choose one.\"\n        if input_mask is None and attention_mask is not None:\n            input_mask = 1.0 - attention_mask\n        if input_mask is not None and perm_mask is not None:\n            data_mask = input_mask[None] + perm_mask\n        elif input_mask is not None and perm_mask is None:\n            data_mask = input_mask[None]\n        elif input_mask is None and perm_mask is not None:\n            data_mask = perm_mask\n        else:\n            data_mask = None\n\n        if data_mask is not None:\n            # all mems can be attended to\n            if mlen > 0:\n                mems_mask = torch.zeros([data_mask.shape[0], mlen, bsz]).to(data_mask)\n                data_mask = torch.cat([mems_mask, data_mask], dim=1)\n            if attn_mask is None:\n                attn_mask = data_mask[:, :, :, None]\n            else:\n                attn_mask += data_mask[:, :, :, None]\n\n        if attn_mask is not None:\n            attn_mask = (attn_mask > 0).to(dtype_float)\n\n        if attn_mask is not None:\n            non_tgt_mask = -torch.eye(qlen).to(attn_mask)\n            if mlen > 0:\n                non_tgt_mask = torch.cat([torch.zeros([qlen, mlen]).to(attn_mask), non_tgt_mask], dim=-1)\n            non_tgt_mask = ((attn_mask + non_tgt_mask[:, :, None, None]) > 0).to(attn_mask)\n        else:\n            non_tgt_mask = None\n\n        # Word embeddings and prepare h & g hidden states\n        if inputs_embeds is not None:\n            word_emb_k = inputs_embeds\n        else:\n            word_emb_k = self.word_embedding(input_ids)\n        output_h = self.dropout(word_emb_k)\n        if target_mapping is not None:\n            word_emb_q = self.mask_emb.expand(target_mapping.shape[0], bsz, -1)\n            # else:  # We removed the inp_q input which was same as target mapping\n            #     inp_q_ext = inp_q[:, :, None]\n            #     word_emb_q = inp_q_ext * self.mask_emb + (1 - inp_q_ext) * word_emb_k\n            output_g = self.dropout(word_emb_q)\n        else:\n            output_g = None\n\n        # Segment embedding\n        if token_type_ids is not None:\n            # Convert `token_type_ids` to one-hot `seg_mat`\n            if mlen > 0:\n                mem_pad = torch.zeros([mlen, bsz], dtype=torch.long, device=device)\n                cat_ids = torch.cat([mem_pad, token_type_ids], dim=0)\n            else:\n                cat_ids = token_type_ids\n\n            # `1` indicates not in the same segment [qlen x klen x bsz]\n            seg_mat = (token_type_ids[:, None] != cat_ids[None, :]).long()\n            seg_mat = F.one_hot(seg_mat, num_classes=2).to(dtype_float)\n        else:\n            seg_mat = None\n\n        # Positional encoding\n        pos_emb = self.relative_positional_encoding(qlen, klen, bsz=bsz)\n        pos_emb = self.dropout(pos_emb)\n\n        # Prepare head mask if needed\n        # 1.0 in head_mask indicate we keep the head\n        # attention_probs has shape bsz x n_heads x N x N\n        # input head_mask has shape [num_heads] or [num_hidden_layers x num_heads] (a head_mask for each layer)\n        # and head_mask is converted to shape [num_hidden_layers x qlen x klen x bsz x n_head]\n        if head_mask is not None:\n            if head_mask.dim() == 1:\n                head_mask = head_mask.unsqueeze(0).unsqueeze(0).unsqueeze(0).unsqueeze(0)\n                head_mask = head_mask.expand(self.n_layer, -1, -1, -1, -1)\n            elif head_mask.dim() == 2:\n                head_mask = head_mask.unsqueeze(1).unsqueeze(1).unsqueeze(1)\n            head_mask = head_mask.to(\n                dtype=next(self.parameters()).dtype\n            )  # switch to fload if need + fp16 compatibility\n        else:\n            head_mask = [None] * self.n_layer\n\n        new_mems = ()\n        if mems is None:\n            mems = [None] * len(self.layer)\n\n        attentions = []\n        hidden_states = []\n        for i, layer_module in enumerate(self.layer):\n            if self.mem_len is not None and self.mem_len > 0 and use_cache is True:\n                # cache new mems\n                new_mems = new_mems + (self.cache_mem(output_h, mems[i]),)\n            if self.output_hidden_states:\n                hidden_states.append((output_h, output_g) if output_g is not None else output_h)\n\n            outputs = layer_module(\n                output_h,\n                output_g,\n                attn_mask_h=non_tgt_mask,\n                attn_mask_g=attn_mask,\n                r=pos_emb,\n                seg_mat=seg_mat,\n                mems=mems[i],\n                target_mapping=target_mapping,\n                head_mask=head_mask[i],\n            )\n            output_h, output_g = outputs[:2]\n            if self.output_attentions:\n                attentions.append(outputs[2])\n\n        # Add last hidden state\n        if self.output_hidden_states:\n            hidden_states.append((output_h, output_g) if output_g is not None else output_h)\n\n        output = self.dropout(output_g if output_g is not None else output_h)\n\n        # Prepare outputs, we transpose back here to shape [bsz, len, hidden_dim] (cf. beginning of forward() method)\n        outputs = (output.permute(1, 0, 2).contiguous(),)\n\n        if self.mem_len is not None and self.mem_len > 0 and use_cache is True:\n            outputs = outputs + (new_mems,)\n\n        if self.output_hidden_states:\n            if output_g is not None:\n                hidden_states = tuple(h.permute(1, 0, 2).contiguous() for hs in hidden_states for h in hs)\n            else:\n                hidden_states = tuple(hs.permute(1, 0, 2).contiguous() for hs in hidden_states)\n            outputs = outputs + (hidden_states,)\n        if self.output_attentions:\n            if target_mapping is not None:\n                # when target_mapping is provided, there are 2-tuple of attentions\n                attentions = tuple(\n                    tuple(att_stream.permute(2, 3, 0, 1).contiguous() for att_stream in t) for t in attentions\n                )\n            else:\n                attentions = tuple(t.permute(2, 3, 0, 1).contiguous() for t in attentions)\n            outputs = outputs + (attentions,)\n\n        return outputs  # outputs, (new_mems), (hidden_states), (attentions)\n\n\n@add_start_docstrings(\n    \"\"\"XLNet Model with a language modeling head on top\n    (linear layer with weights tied to the input embeddings). \"\"\",\n    XLNET_START_DOCSTRING,\n)\nclass XLNetLMHeadModel(XLNetPreTrainedModel):\n    def __init__(self, config):\n        super().__init__(config)\n        self.attn_type = config.attn_type\n        self.same_length = config.same_length\n\n        self.transformer = XLNetModel(config)\n        self.lm_loss = nn.Linear(config.d_model, config.vocab_size, bias=True)\n\n        self.init_weights()\n\n    def get_output_embeddings(self):\n        return self.lm_loss\n\n    def prepare_inputs_for_generation(self, input_ids, past, **kwargs):\n        # Add dummy token at the end (no attention on this one)\n\n        effective_batch_size = input_ids.shape[0]\n        dummy_token = torch.zeros((effective_batch_size, 1), dtype=torch.long, device=input_ids.device)\n        input_ids = torch.cat([input_ids, dummy_token], dim=1)\n\n        # Build permutation mask so that previous tokens don't see last token\n        sequence_length = input_ids.shape[1]\n        perm_mask = torch.zeros(\n            (effective_batch_size, sequence_length, sequence_length), dtype=torch.float, device=input_ids.device\n        )\n        perm_mask[:, :, -1] = 1.0\n\n        # We'll only predict the last token\n        target_mapping = torch.zeros(\n            (effective_batch_size, 1, sequence_length), dtype=torch.float, device=input_ids.device\n        )\n        target_mapping[0, 0, -1] = 1.0\n\n        inputs = {\n            \"input_ids\": input_ids,\n            \"perm_mask\": perm_mask,\n            \"target_mapping\": target_mapping,\n            \"use_cache\": kwargs[\"use_cache\"],\n        }\n\n        # if past is defined in model kwargs then use it for faster decoding\n        if past:\n            inputs[\"mems\"] = past\n\n        return inputs\n\n    @add_start_docstrings_to_callable(XLNET_INPUTS_DOCSTRING.format(\"(batch_size, sequence_length)\"))\n    def forward(\n        self,\n        input_ids=None,\n        attention_mask=None,\n        mems=None,\n        perm_mask=None,\n        target_mapping=None,\n        token_type_ids=None,\n        input_mask=None,\n        head_mask=None,\n        inputs_embeds=None,\n        use_cache=True,\n        labels=None,\n    ):\n        r\"\"\"\n        labels (:obj:`torch.LongTensor` of shape :obj:`(batch_size, num_predict)`, `optional`, defaults to :obj:`None`):\n            Labels for masked language modeling.\n            `num_predict` corresponds to `target_mapping.shape[1]`. If `target_mapping` is `None`, then `num_predict` corresponds to `sequence_length`.\n            The labels should correspond to the masked input words that should be predicted and depends on `target_mapping`. Note in order to perform standard auto-regressive language modeling a `<mask>` token has to be added to the `input_ids` (see `prepare_inputs_for_generation` fn and examples below)\n            Indices are selected in ``[-100, 0, ..., config.vocab_size]``\n            All labels set to ``-100`` are ignored, the loss is only\n            computed for labels in ``[0, ..., config.vocab_size]``\n\n    Return:\n        :obj:`tuple(torch.FloatTensor)` comprising various elements depending on the configuration (:class:`~transformers1.XLNetConfig`) and inputs:\n        loss (:obj:`torch.FloatTensor` of shape `(1,)`, `optional`, returned when ``labels`` is provided)\n            Language modeling loss.\n        prediction_scores (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, num_predict, config.vocab_size)`):\n            Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax).\n            `num_predict` corresponds to `target_mapping.shape[1]`. If `target_mapping` is `None`, then `num_predict` corresponds to `sequence_length`.\n        mems (:obj:`List[torch.FloatTensor]` of length :obj:`config.n_layers`):\n            Contains pre-computed hidden-states (key and values in the attention blocks).\n            Can be used (see `past` input) to speed up sequential decoding. The token ids which have their past given to this model\n            should not be passed as input ids as they have already been computed.\n        hidden_states (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_hidden_states=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer)\n            of shape :obj:`(batch_size, sequence_length, hidden_size)`.\n\n            Hidden-states of the model at the output of each layer plus the initial embedding outputs.\n        attentions (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_attentions=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for each layer) of shape\n            :obj:`(batch_size, num_heads, sequence_length, sequence_length)`.\n\n            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention\n            heads.\n\n    Examples::\n\n        from transformers1 import XLNetTokenizer, XLNetLMHeadModel\n        import torch\n\n        tokenizer = XLNetTokenizer.from_pretrained('xlnet-large-cased')\n        model = XLNetLMHeadModel.from_pretrained('xlnet-large-cased')\n\n        # We show how to setup inputs to predict a next token using a bi-directional context.\n        input_ids = torch.tensor(tokenizer.encode(\"Hello, my dog is very <mask>\", add_special_tokens=False)).unsqueeze(0)  # We will predict the masked token\n        perm_mask = torch.zeros((1, input_ids.shape[1], input_ids.shape[1]), dtype=torch.float)\n        perm_mask[:, :, -1] = 1.0  # Previous tokens don't see last token\n        target_mapping = torch.zeros((1, 1, input_ids.shape[1]), dtype=torch.float)  # Shape [1, 1, seq_length] => let's predict one token\n        target_mapping[0, 0, -1] = 1.0  # Our first (and only) prediction will be the last token of the sequence (the masked token)\n\n        outputs = model(input_ids, perm_mask=perm_mask, target_mapping=target_mapping)\n        next_token_logits = outputs[0]  # Output has shape [target_mapping.size(0), target_mapping.size(1), config.vocab_size]\n\n        # The same way can the XLNetLMHeadModel be used to be trained by standard auto-regressive language modeling.\n        input_ids = torch.tensor(tokenizer.encode(\"Hello, my dog is very <mask>\", add_special_tokens=False)).unsqueeze(0)  # We will predict the masked token\n        labels = torch.tensor(tokenizer.encode(\"cute\", add_special_tokens=False)).unsqueeze(0)\n        assert labels.shape[0] == 1, 'only one word will be predicted'\n        perm_mask = torch.zeros((1, input_ids.shape[1], input_ids.shape[1]), dtype=torch.float)\n        perm_mask[:, :, -1] = 1.0  # Previous tokens don't see last token as is done in standard auto-regressive lm training\n        target_mapping = torch.zeros((1, 1, input_ids.shape[1]), dtype=torch.float)  # Shape [1, 1, seq_length] => let's predict one token\n        target_mapping[0, 0, -1] = 1.0  # Our first (and only) prediction will be the last token of the sequence (the masked token)\n\n        outputs = model(input_ids, perm_mask=perm_mask, target_mapping=target_mapping, labels=labels)\n        loss, next_token_logits = outputs[:2]  # Output has shape [target_mapping.size(0), target_mapping.size(1), config.vocab_size]\n\n        \"\"\"\n        transformer_outputs = self.transformer(\n            input_ids,\n            attention_mask=attention_mask,\n            mems=mems,\n            perm_mask=perm_mask,\n            target_mapping=target_mapping,\n            token_type_ids=token_type_ids,\n            input_mask=input_mask,\n            head_mask=head_mask,\n            inputs_embeds=inputs_embeds,\n            use_cache=use_cache,\n        )\n\n        logits = self.lm_loss(transformer_outputs[0])\n\n        outputs = (logits,) + transformer_outputs[1:]  # Keep mems, hidden states, attentions if there are in it\n\n        if labels is not None:\n            # Flatten the tokens\n            loss_fct = CrossEntropyLoss()\n            loss = loss_fct(logits.view(-1, logits.size(-1)), labels.view(-1))\n            outputs = (loss,) + outputs\n\n        return outputs  # return (loss), logits, (mems), (hidden states), (attentions)\n\n\n@add_start_docstrings(\n    \"\"\"XLNet Model with a sequence classification/regression head on top (a linear layer on top of\n    the pooled output) e.g. for GLUE tasks. \"\"\",\n    XLNET_START_DOCSTRING,\n)\nclass XLNetForSequenceClassification(XLNetPreTrainedModel):\n    def __init__(self, config):\n        super().__init__(config)\n        self.num_labels = config.num_labels\n\n        self.transformer = XLNetModel(config)\n        self.sequence_summary = SequenceSummary(config)\n        self.logits_proj = nn.Linear(config.d_model, config.num_labels)\n\n        self.init_weights()\n\n    @add_start_docstrings_to_callable(XLNET_INPUTS_DOCSTRING.format(\"(batch_size, sequence_length)\"))\n    def forward(\n        self,\n        input_ids=None,\n        attention_mask=None,\n        mems=None,\n        perm_mask=None,\n        target_mapping=None,\n        token_type_ids=None,\n        input_mask=None,\n        head_mask=None,\n        inputs_embeds=None,\n        use_cache=True,\n        labels=None,\n    ):\n        r\"\"\"\n        labels (:obj:`torch.LongTensor` of shape :obj:`(batch_size,)`, `optional`, defaults to :obj:`None`)\n            Labels for computing the sequence classification/regression loss.\n            Indices should be in ``[0, ..., config.num_labels - 1]``.\n            If ``config.num_labels == 1`` a regression loss is computed (Mean-Square loss),\n            If ``config.num_labels > 1`` a classification loss is computed (Cross-Entropy).\n\n    Return:\n        :obj:`tuple(torch.FloatTensor)` comprising various elements depending on the configuration (:class:`~transformers1.XLNetConfig`) and inputs:\n        loss (:obj:`torch.FloatTensor` of shape :obj:`(1,)`, `optional`, returned when :obj:`labels` is provided):\n            Classification (or regression if config.num_labels==1) loss.\n        logits (:obj:`torch.FloatTensor` of shape :obj:(batch_size, config.num_labels)`):\n            Classification (or regression if config.num_labels==1) scores (before SoftMax).\n        mems (:obj:`List[torch.FloatTensor]` of length :obj:`config.n_layers`):\n            Contains pre-computed hidden-states (key and values in the attention blocks).\n            Can be used (see `past` input) to speed up sequential decoding. The token ids which have their past given to this model\n            should not be passed as input ids as they have already been computed.\n        hidden_states (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_hidden_states=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer)\n            of shape :obj:`(batch_size, sequence_length, hidden_size)`.\n\n            Hidden-states of the model at the output of each layer plus the initial embedding outputs.\n        attentions (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_attentions=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for each layer) of shape\n            :obj:`(batch_size, num_heads, sequence_length, sequence_length)`.\n\n            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention\n            heads.\n\n    Examples::\n\n        from transformers1 import XLNetTokenizer, XLNetForSequenceClassification\n        import torch\n\n        tokenizer = XLNetTokenizer.from_pretrained('xlnet-large-cased')\n        model = XLNetForSequenceClassification.from_pretrained('xlnet-large-cased')\n\n        input_ids = torch.tensor(tokenizer.encode(\"Hello, my dog is cute\", add_special_tokens=True)).unsqueeze(0)  # Batch size 1\n        labels = torch.tensor([1]).unsqueeze(0)  # Batch size 1\n        outputs = model(input_ids, labels=labels)\n        loss, logits = outputs[:2]\n\n        \"\"\"\n        transformer_outputs = self.transformer(\n            input_ids,\n            attention_mask=attention_mask,\n            mems=mems,\n            perm_mask=perm_mask,\n            target_mapping=target_mapping,\n            token_type_ids=token_type_ids,\n            input_mask=input_mask,\n            head_mask=head_mask,\n            inputs_embeds=inputs_embeds,\n            use_cache=use_cache,\n        )\n        output = transformer_outputs[0]\n\n        output = self.sequence_summary(output)\n        logits = self.logits_proj(output)\n\n        outputs = (logits,) + transformer_outputs[1:]  # Keep mems, hidden states, attentions if there are in it\n\n        if labels is not None:\n            if self.num_labels == 1:\n                #  We are doing regression\n                loss_fct = MSELoss()\n                loss = loss_fct(logits.view(-1), labels.view(-1))\n            else:\n                loss_fct = CrossEntropyLoss()\n                loss = loss_fct(logits.view(-1, self.num_labels), labels.view(-1))\n            outputs = (loss,) + outputs\n\n        return outputs  # return (loss), logits, (mems), (hidden states), (attentions)\n\n\n@add_start_docstrings(\n    \"\"\"XLNet Model with a token classification head on top (a linear layer on top of\n    the hidden-states output) e.g. for Named-Entity-Recognition (NER) tasks. \"\"\",\n    XLNET_START_DOCSTRING,\n)\nclass XLNetForTokenClassification(XLNetPreTrainedModel):\n    def __init__(self, config):\n        super().__init__(config)\n        self.num_labels = config.num_labels\n\n        self.transformer = XLNetModel(config)\n        self.classifier = nn.Linear(config.hidden_size, config.num_labels)\n\n        self.init_weights()\n\n    @add_start_docstrings_to_callable(XLNET_INPUTS_DOCSTRING.format(\"(batch_size, sequence_length)\"))\n    def forward(\n        self,\n        input_ids=None,\n        attention_mask=None,\n        mems=None,\n        perm_mask=None,\n        target_mapping=None,\n        token_type_ids=None,\n        input_mask=None,\n        head_mask=None,\n        inputs_embeds=None,\n        use_cache=True,\n        labels=None,\n    ):\n        r\"\"\"\n        labels (:obj:`torch.LongTensor` of shape :obj:`(batch_size,)`, `optional`, defaults to :obj:`None`):\n            Labels for computing the multiple choice classification loss.\n            Indices should be in ``[0, ..., num_choices]`` where `num_choices` is the size of the second dimension\n            of the input tensors. (see `input_ids` above)\n\n    Return:\n        :obj:`tuple(torch.FloatTensor)` comprising various elements depending on the configuration (:class:`~transformers1.XLNetConfig`) and inputs:\n        loss (:obj:`torch.FloatTensor` of shape :obj:`(1,)`, `optional`, returned when :obj:`labels` is provided):\n            Classification loss.\n        logits (:obj:`torch.FloatTensor` of shape :obj:(batch_size, config.num_labels)`):\n            Classification scores (before SoftMax).\n        mems (:obj:`List[torch.FloatTensor]` of length :obj:`config.n_layers`):\n            Contains pre-computed hidden-states (key and values in the attention blocks).\n            Can be used (see `past` input) to speed up sequential decoding. The token ids which have their past given to this model\n            should not be passed as input ids as they have already been computed.\n        hidden_states (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_hidden_states=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer)\n            of shape :obj:`(batch_size, sequence_length, hidden_size)`.\n\n            Hidden-states of the model at the output of each layer plus the initial embedding outputs.\n        attentions (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_attentions=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for each layer) of shape\n            :obj:`(batch_size, num_heads, sequence_length, sequence_length)`.\n\n            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention\n            heads.\n\n    Examples::\n\n        from transformers1 import XLNetTokenizer, XLNetForTokenClassification\n        import torch\n\n        tokenizer = XLNetTokenizer.from_pretrained('xlnet-large-cased')\n        model = XLNetForTokenClassification.from_pretrained('xlnet-large-cased')\n\n        input_ids = torch.tensor(tokenizer.encode(\"Hello, my dog is cute\")).unsqueeze(0)  # Batch size 1\n        labels = torch.tensor([1] * input_ids.size(1)).unsqueeze(0)  # Batch size 1\n        outputs = model(input_ids, labels=labels)\n\n        scores = outputs[0]\n\n        \"\"\"\n\n        outputs = self.transformer(\n            input_ids,\n            attention_mask=attention_mask,\n            mems=mems,\n            perm_mask=perm_mask,\n            target_mapping=target_mapping,\n            token_type_ids=token_type_ids,\n            input_mask=input_mask,\n            head_mask=head_mask,\n            inputs_embeds=inputs_embeds,\n            use_cache=use_cache,\n        )\n\n        sequence_output = outputs[0]\n\n        logits = self.classifier(sequence_output)\n\n        outputs = (logits,) + outputs[1:]  # Keep mems, hidden states, attentions if there are in it\n        if labels is not None:\n            loss_fct = CrossEntropyLoss()\n            # Only keep active parts of the loss\n            if attention_mask is not None:\n                active_loss = attention_mask.view(-1) == 1\n                active_logits = logits.view(-1, self.num_labels)\n                active_labels = torch.where(\n                    active_loss, labels.view(-1), torch.tensor(loss_fct.ignore_index).type_as(labels)\n                )\n                loss = loss_fct(active_logits, active_labels)\n            else:\n                loss = loss_fct(logits.view(-1, self.num_labels), labels.view(-1))\n            outputs = (loss,) + outputs\n\n        return outputs  # return (loss), logits, (mems), (hidden states), (attentions)\n\n\n@add_start_docstrings(\n    \"\"\"XLNet Model with a multiple choice classification head on top (a linear layer on top of\n    the pooled output and a softmax) e.g. for RACE/SWAG tasks. \"\"\",\n    XLNET_START_DOCSTRING,\n)\nclass XLNetForMultipleChoice(XLNetPreTrainedModel):\n    def __init__(self, config):\n        super().__init__(config)\n\n        self.transformer = XLNetModel(config)\n        self.sequence_summary = SequenceSummary(config)\n        self.logits_proj = nn.Linear(config.d_model, 1)\n\n        self.init_weights()\n\n    @add_start_docstrings_to_callable(XLNET_INPUTS_DOCSTRING.format(\"(batch_size, num_choices, sequence_length)\"))\n    def forward(\n        self,\n        input_ids=None,\n        token_type_ids=None,\n        input_mask=None,\n        attention_mask=None,\n        mems=None,\n        perm_mask=None,\n        target_mapping=None,\n        head_mask=None,\n        inputs_embeds=None,\n        use_cache=True,\n        labels=None,\n    ):\n        r\"\"\"\n        labels (:obj:`torch.LongTensor` of shape :obj:`(batch_size,)`, `optional`, defaults to :obj:`None`):\n            Labels for computing the multiple choice classification loss.\n            Indices should be in ``[0, ..., num_choices]`` where `num_choices` is the size of the second dimension\n            of the input tensors. (see `input_ids` above)\n\n    Returns:\n        :obj:`tuple(torch.FloatTensor)` comprising various elements depending on the configuration (:class:`~transformers1.XLNetConfig`) and inputs:\n        loss (:obj:`torch.FloatTensor`` of shape ``(1,)`, `optional`, returned when :obj:`labels` is provided):\n            Classification loss.\n        classification_scores (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, num_choices)`):\n            `num_choices` is the second dimension of the input tensors. (see `input_ids` above).\n\n            Classification scores (before SoftMax).\n        mems (:obj:`List[torch.FloatTensor]` of length :obj:`config.n_layers`):\n            Contains pre-computed hidden-states (key and values in the attention blocks).\n            Can be used (see `past` input) to speed up sequential decoding. The token ids which have their past given to this model\n            should not be passed as input ids as they have already been computed.\n        hidden_states (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_hidden_states=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer)\n            of shape :obj:`(batch_size, sequence_length, hidden_size)`.\n\n            Hidden-states of the model at the output of each layer plus the initial embedding outputs.\n        attentions (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_attentions=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for each layer) of shape\n            :obj:`(batch_size, num_heads, sequence_length, sequence_length)`.\n\n            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention\n            heads.\n\n    Examples::\n\n        from transformers1 import XLNetTokenizer, XLNetForMultipleChoice\n        import torch\n\n        tokenizer = XLNetTokenizer.from_pretrained('xlnet-base-cased')\n        model = XLNetForMultipleChoice.from_pretrained('xlnet-base-cased')\n\n        choices = [\"Hello, my dog is cute\", \"Hello, my cat is amazing\"]\n        input_ids = torch.tensor([tokenizer.encode(s) for s in choices]).unsqueeze(0)  # Batch size 1, 2 choices\n        labels = torch.tensor(1).unsqueeze(0)  # Batch size 1\n\n        outputs = model(input_ids, labels=labels)\n        loss, classification_scores = outputs[:2]\n\n        \"\"\"\n        num_choices = input_ids.shape[1]\n\n        flat_input_ids = input_ids.view(-1, input_ids.size(-1))\n        flat_token_type_ids = token_type_ids.view(-1, token_type_ids.size(-1)) if token_type_ids is not None else None\n        flat_attention_mask = attention_mask.view(-1, attention_mask.size(-1)) if attention_mask is not None else None\n        flat_input_mask = input_mask.view(-1, input_mask.size(-1)) if input_mask is not None else None\n\n        transformer_outputs = self.transformer(\n            flat_input_ids,\n            token_type_ids=flat_token_type_ids,\n            input_mask=flat_input_mask,\n            attention_mask=flat_attention_mask,\n            mems=mems,\n            perm_mask=perm_mask,\n            target_mapping=target_mapping,\n            head_mask=head_mask,\n            inputs_embeds=inputs_embeds,\n            use_cache=use_cache,\n        )\n\n        output = transformer_outputs[0]\n\n        output = self.sequence_summary(output)\n        logits = self.logits_proj(output)\n        reshaped_logits = logits.view(-1, num_choices)\n        outputs = (reshaped_logits,) + transformer_outputs[\n            1:\n        ]  # Keep mems, hidden states, attentions if there are in it\n\n        if labels is not None:\n            loss_fct = CrossEntropyLoss()\n            loss = loss_fct(reshaped_logits, labels.view(-1))\n            outputs = (loss,) + outputs\n\n        return outputs  # return (loss), logits, (mems), (hidden states), (attentions)\n\n\n@add_start_docstrings(\n    \"\"\"XLNet Model with a span classification head on top for extractive question-answering tasks like SQuAD (a linear layers on top of\n    the hidden-states output to compute `span start logits` and `span end logits`). \"\"\",\n    XLNET_START_DOCSTRING,\n)\nclass XLNetForQuestionAnsweringSimple(XLNetPreTrainedModel):\n    def __init__(self, config):\n        super().__init__(config)\n        self.num_labels = config.num_labels\n\n        self.transformer = XLNetModel(config)\n        self.qa_outputs = nn.Linear(config.hidden_size, config.num_labels)\n\n        self.init_weights()\n\n    @add_start_docstrings_to_callable(XLNET_INPUTS_DOCSTRING.format(\"(batch_size, sequence_length)\"))\n    def forward(\n        self,\n        input_ids=None,\n        attention_mask=None,\n        mems=None,\n        perm_mask=None,\n        target_mapping=None,\n        token_type_ids=None,\n        input_mask=None,\n        head_mask=None,\n        inputs_embeds=None,\n        use_cache=True,\n        start_positions=None,\n        end_positions=None,\n    ):\n        r\"\"\"\n        start_positions (:obj:`torch.LongTensor` of shape :obj:`(batch_size,)`, `optional`, defaults to :obj:`None`):\n            Labels for position (index) of the start of the labelled span for computing the token classification loss.\n            Positions are clamped to the length of the sequence (`sequence_length`).\n            Position outside of the sequence are not taken into account for computing the loss.\n        end_positions (:obj:`torch.LongTensor` of shape :obj:`(batch_size,)`, `optional`, defaults to :obj:`None`):\n            Labels for position (index) of the end of the labelled span for computing the token classification loss.\n            Positions are clamped to the length of the sequence (`sequence_length`).\n            Position outside of the sequence are not taken into account for computing the loss.\n\n    Returns:\n        :obj:`tuple(torch.FloatTensor)` comprising various elements depending on the configuration (:class:`~transformers1.XLNetConfig`) and inputs:\n        loss (:obj:`torch.FloatTensor` of shape :obj:`(1,)`, `optional`, returned when :obj:`labels` is provided):\n            Total span extraction loss is the sum of a Cross-Entropy for the start and end positions.\n        start_scores (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length,)`):\n            Span-start scores (before SoftMax).\n        end_scores (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length,)`):\n            Span-end scores (before SoftMax).\n        mems (:obj:`List[torch.FloatTensor]` of length :obj:`config.n_layers`):\n            Contains pre-computed hidden-states (key and values in the attention blocks).\n            Can be used (see `past` input) to speed up sequential decoding. The token ids which have their past given to this model\n            should not be passed as input ids as they have already been computed.\n        hidden_states (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_hidden_states=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer)\n            of shape :obj:`(batch_size, sequence_length, hidden_size)`.\n\n            Hidden-states of the model at the output of each layer plus the initial embedding outputs.\n        attentions (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_attentions=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for each layer) of shape\n            :obj:`(batch_size, num_heads, sequence_length, sequence_length)`.\n\n            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention\n            heads.\n\n    Examples::\n\n        from transformers1 import XLNetTokenizer, XLNetForQuestionAnsweringSimple\n        import torch\n\n        tokenizer = XLNetTokenizer.from_pretrained('xlnet-base-cased')\n        model = XLNetForQuestionAnsweringSimple.from_pretrained('xlnet-base-cased')\n\n        input_ids = torch.tensor(tokenizer.encode(\"Hello, my dog is cute\", add_special_tokens=True)).unsqueeze(0)  # Batch size 1\n        start_positions = torch.tensor([1])\n        end_positions = torch.tensor([3])\n\n        outputs = model(input_ids, start_positions=start_positions, end_positions=end_positions)\n        loss = outputs[0]\n\n        \"\"\"\n\n        outputs = self.transformer(\n            input_ids,\n            attention_mask=attention_mask,\n            mems=mems,\n            perm_mask=perm_mask,\n            target_mapping=target_mapping,\n            token_type_ids=token_type_ids,\n            input_mask=input_mask,\n            head_mask=head_mask,\n            inputs_embeds=inputs_embeds,\n            use_cache=use_cache,\n        )\n\n        sequence_output = outputs[0]\n\n        logits = self.qa_outputs(sequence_output)\n        start_logits, end_logits = logits.split(1, dim=-1)\n        start_logits = start_logits.squeeze(-1)\n        end_logits = end_logits.squeeze(-1)\n\n        outputs = (start_logits, end_logits,) + outputs[2:]\n        if start_positions is not None and end_positions is not None:\n            # If we are on multi-GPU, split add a dimension\n            if len(start_positions.size()) > 1:\n                start_positions = start_positions.squeeze(-1)\n            if len(end_positions.size()) > 1:\n                end_positions = end_positions.squeeze(-1)\n            # sometimes the start/end positions are outside our model inputs, we ignore these terms\n            ignored_index = start_logits.size(1)\n            start_positions.clamp_(0, ignored_index)\n            end_positions.clamp_(0, ignored_index)\n\n            loss_fct = CrossEntropyLoss(ignore_index=ignored_index)\n            start_loss = loss_fct(start_logits, start_positions)\n            end_loss = loss_fct(end_logits, end_positions)\n            total_loss = (start_loss + end_loss) / 2\n            outputs = (total_loss,) + outputs\n\n        return outputs  # (loss), start_logits, end_logits, (mems), (hidden_states), (attentions)\n\n\n@add_start_docstrings(\n    \"\"\"XLNet Model with a span classification head on top for extractive question-answering tasks like SQuAD (a linear layers on top of\n    the hidden-states output to compute `span start logits` and `span end logits`). \"\"\",\n    XLNET_START_DOCSTRING,\n)\nclass XLNetForQuestionAnswering(XLNetPreTrainedModel):\n    def __init__(self, config):\n        super().__init__(config)\n        self.start_n_top = config.start_n_top\n        self.end_n_top = config.end_n_top\n\n        self.transformer = XLNetModel(config)\n        self.start_logits = PoolerStartLogits(config)\n        self.end_logits = PoolerEndLogits(config)\n        self.answer_class = PoolerAnswerClass(config)\n\n        self.init_weights()\n\n    @add_start_docstrings_to_callable(XLNET_INPUTS_DOCSTRING.format(\"(batch_size, sequence_length)\"))\n    def forward(\n        self,\n        input_ids=None,\n        attention_mask=None,\n        mems=None,\n        perm_mask=None,\n        target_mapping=None,\n        token_type_ids=None,\n        input_mask=None,\n        head_mask=None,\n        inputs_embeds=None,\n        use_cache=True,\n        start_positions=None,\n        end_positions=None,\n        is_impossible=None,\n        cls_index=None,\n        p_mask=None,\n    ):\n        r\"\"\"\n        start_positions (:obj:`torch.LongTensor` of shape :obj:`(batch_size,)`, `optional`, defaults to :obj:`None`):\n            Labels for position (index) of the start of the labelled span for computing the token classification loss.\n            Positions are clamped to the length of the sequence (`sequence_length`).\n            Position outside of the sequence are not taken into account for computing the loss.\n        end_positions (:obj:`torch.LongTensor` of shape :obj:`(batch_size,)`, `optional`, defaults to :obj:`None`):\n            Labels for position (index) of the end of the labelled span for computing the token classification loss.\n            Positions are clamped to the length of the sequence (`sequence_length`).\n            Position outside of the sequence are not taken into account for computing the loss.\n        is_impossible (``torch.LongTensor`` of shape ``(batch_size,)``, `optional`, defaults to :obj:`None`):\n            Labels whether a question has an answer or no answer (SQuAD 2.0)\n        cls_index (``torch.LongTensor`` of shape ``(batch_size,)``, `optional`, defaults to :obj:`None`):\n            Labels for position (index) of the classification token to use as input for computing plausibility of the answer.\n        p_mask (``torch.FloatTensor`` of shape ``(batch_size, sequence_length)``, `optional`, defaults to :obj:`None`):\n            Optional mask of tokens which can't be in answers (e.g. [CLS], [PAD], ...).\n            1.0 means token should be masked. 0.0 mean token is not masked.\n\n    Returns:\n        :obj:`tuple(torch.FloatTensor)` comprising various elements depending on the configuration (:class:`~transformers1.XLNetConfig`) and inputs:\n        loss (:obj:`torch.FloatTensor` of shape :obj:`(1,)`, `optional`, returned if both :obj:`start_positions` and :obj:`end_positions` are provided):\n            Classification loss as the sum of start token, end token (and is_impossible if provided) classification losses.\n        start_top_log_probs (``torch.FloatTensor`` of shape ``(batch_size, config.start_n_top)``, `optional`, returned if ``start_positions`` or ``end_positions`` is not provided):\n            Log probabilities for the top config.start_n_top start token possibilities (beam-search).\n        start_top_index (``torch.LongTensor`` of shape ``(batch_size, config.start_n_top)``, `optional`, returned if ``start_positions`` or ``end_positions`` is not provided):\n            Indices for the top config.start_n_top start token possibilities (beam-search).\n        end_top_log_probs (``torch.FloatTensor`` of shape ``(batch_size, config.start_n_top * config.end_n_top)``, `optional`, returned if ``start_positions`` or ``end_positions`` is not provided):\n            Log probabilities for the top ``config.start_n_top * config.end_n_top`` end token possibilities (beam-search).\n        end_top_index (``torch.LongTensor`` of shape ``(batch_size, config.start_n_top * config.end_n_top)``, `optional`, returned if ``start_positions`` or ``end_positions`` is not provided):\n            Indices for the top ``config.start_n_top * config.end_n_top`` end token possibilities (beam-search).\n        cls_logits (``torch.FloatTensor`` of shape ``(batch_size,)``, `optional`, returned if ``start_positions`` or ``end_positions`` is not provided):\n            Log probabilities for the ``is_impossible`` label of the answers.\n        mems (:obj:`List[torch.FloatTensor]` of length :obj:`config.n_layers`):\n            Contains pre-computed hidden-states (key and values in the attention blocks).\n            Can be used (see `past` input) to speed up sequential decoding. The token ids which have their past given to this model\n            should not be passed as input ids as they have already been computed.\n        hidden_states (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_hidden_states=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer)\n            of shape :obj:`(batch_size, sequence_length, hidden_size)`.\n\n            Hidden-states of the model at the output of each layer plus the initial embedding outputs.\n        attentions (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_attentions=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for each layer) of shape\n            :obj:`(batch_size, num_heads, sequence_length, sequence_length)`.\n\n            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention\n            heads.\n\n    Examples::\n\n        from transformers1 import XLNetTokenizer, XLNetForQuestionAnswering\n        import torch\n\n        tokenizer =  XLNetTokenizer.from_pretrained('xlnet-base-cased')\n        model = XLNetForQuestionAnswering.from_pretrained('xlnet-base-cased')\n\n        input_ids = torch.tensor(tokenizer.encode(\"Hello, my dog is cute\", add_special_tokens=True)).unsqueeze(0)  # Batch size 1\n        start_positions = torch.tensor([1])\n        end_positions = torch.tensor([3])\n        outputs = model(input_ids, start_positions=start_positions, end_positions=end_positions)\n        loss = outputs[0]\n\n        \"\"\"\n        transformer_outputs = self.transformer(\n            input_ids,\n            attention_mask=attention_mask,\n            mems=mems,\n            perm_mask=perm_mask,\n            target_mapping=target_mapping,\n            token_type_ids=token_type_ids,\n            input_mask=input_mask,\n            head_mask=head_mask,\n            inputs_embeds=inputs_embeds,\n            use_cache=use_cache,\n        )\n        hidden_states = transformer_outputs[0]\n        start_logits = self.start_logits(hidden_states, p_mask=p_mask)\n\n        outputs = transformer_outputs[1:]  # Keep mems, hidden states, attentions if there are in it\n\n        if start_positions is not None and end_positions is not None:\n            # If we are on multi-GPU, let's remove the dimension added by batch splitting\n            for x in (start_positions, end_positions, cls_index, is_impossible):\n                if x is not None and x.dim() > 1:\n                    x.squeeze_(-1)\n\n            # during training, compute the end logits based on the ground truth of the start position\n            end_logits = self.end_logits(hidden_states, start_positions=start_positions, p_mask=p_mask)\n\n            loss_fct = CrossEntropyLoss()\n            start_loss = loss_fct(start_logits, start_positions)\n            end_loss = loss_fct(end_logits, end_positions)\n            total_loss = (start_loss + end_loss) / 2\n\n            if cls_index is not None and is_impossible is not None:\n                # Predict answerability from the representation of CLS and START\n                cls_logits = self.answer_class(hidden_states, start_positions=start_positions, cls_index=cls_index)\n                loss_fct_cls = nn.BCEWithLogitsLoss()\n                cls_loss = loss_fct_cls(cls_logits, is_impossible)\n\n                # note(zhiliny): by default multiply the loss by 0.5 so that the scale is comparable to start_loss and end_loss\n                total_loss += cls_loss * 0.5\n\n            outputs = (total_loss,) + outputs\n\n        else:\n            # during inference, compute the end logits based on beam search\n            bsz, slen, hsz = hidden_states.size()\n            start_log_probs = F.softmax(start_logits, dim=-1)  # shape (bsz, slen)\n\n            start_top_log_probs, start_top_index = torch.topk(\n                start_log_probs, self.start_n_top, dim=-1\n            )  # shape (bsz, start_n_top)\n            start_top_index_exp = start_top_index.unsqueeze(-1).expand(-1, -1, hsz)  # shape (bsz, start_n_top, hsz)\n            start_states = torch.gather(hidden_states, -2, start_top_index_exp)  # shape (bsz, start_n_top, hsz)\n            start_states = start_states.unsqueeze(1).expand(-1, slen, -1, -1)  # shape (bsz, slen, start_n_top, hsz)\n\n            hidden_states_expanded = hidden_states.unsqueeze(2).expand_as(\n                start_states\n            )  # shape (bsz, slen, start_n_top, hsz)\n            p_mask = p_mask.unsqueeze(-1) if p_mask is not None else None\n            end_logits = self.end_logits(hidden_states_expanded, start_states=start_states, p_mask=p_mask)\n            end_log_probs = F.softmax(end_logits, dim=1)  # shape (bsz, slen, start_n_top)\n\n            end_top_log_probs, end_top_index = torch.topk(\n                end_log_probs, self.end_n_top, dim=1\n            )  # shape (bsz, end_n_top, start_n_top)\n            end_top_log_probs = end_top_log_probs.view(-1, self.start_n_top * self.end_n_top)\n            end_top_index = end_top_index.view(-1, self.start_n_top * self.end_n_top)\n\n            start_states = torch.einsum(\n                \"blh,bl->bh\", hidden_states, start_log_probs\n            )  # get the representation of START as weighted sum of hidden states\n            cls_logits = self.answer_class(\n                hidden_states, start_states=start_states, cls_index=cls_index\n            )  # Shape (batch size,): one single `cls_logits` for each sample\n\n            outputs = (start_top_log_probs, start_top_index, end_top_log_probs, end_top_index, cls_logits) + outputs\n\n        # return start_top_log_probs, start_top_index, end_top_log_probs, end_top_index, cls_logits\n        # or (if labels are provided) (total_loss,)\n        return outputs\n"
  },
  {
    "path": "code/bert-base-count5/pretrain/transformers1/optimization.py",
    "content": "# coding=utf-8\n# Copyright 2018 The Google AI Language Team Authors and The HuggingFace Inc. team.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n\"\"\"PyTorch optimization for BERT model.\"\"\"\n\nimport logging\nimport math\n\nimport torch\nfrom torch.optim import Optimizer\nfrom torch.optim.lr_scheduler import LambdaLR\n\n\nlogger = logging.getLogger(__name__)\n\n\ndef get_constant_schedule(optimizer, last_epoch=-1):\n    \"\"\" Create a schedule with a constant learning rate.\n    \"\"\"\n    return LambdaLR(optimizer, lambda _: 1, last_epoch=last_epoch)\n\n\ndef get_constant_schedule_with_warmup(optimizer, num_warmup_steps, last_epoch=-1):\n    \"\"\" Create a schedule with a constant learning rate preceded by a warmup\n    period during which the learning rate increases linearly between 0 and 1.\n    \"\"\"\n\n    def lr_lambda(current_step):\n        if current_step < num_warmup_steps:\n            return float(current_step) / float(max(1.0, num_warmup_steps))\n        return 1.0\n\n    return LambdaLR(optimizer, lr_lambda, last_epoch=last_epoch)\n\n\ndef get_linear_schedule_with_warmup(optimizer, num_warmup_steps, num_training_steps, last_epoch=-1):\n    \"\"\" Create a schedule with a learning rate that decreases linearly after\n    linearly increasing during a warmup period.\n    \"\"\"\n\n    def lr_lambda(current_step):\n        if current_step < num_warmup_steps:\n            return float(current_step) / float(max(1, num_warmup_steps))\n        return max(\n            0.0, float(num_training_steps - current_step) / float(max(1, num_training_steps - num_warmup_steps))\n        )\n\n    return LambdaLR(optimizer, lr_lambda, last_epoch)\n\n\ndef get_cosine_schedule_with_warmup(optimizer, num_warmup_steps, num_training_steps, num_cycles=0.5, last_epoch=-1):\n    \"\"\" Create a schedule with a learning rate that decreases following the\n    values of the cosine function between 0 and `pi * cycles` after a warmup\n    period during which it increases linearly between 0 and 1.\n    \"\"\"\n\n    def lr_lambda(current_step):\n        if current_step < num_warmup_steps:\n            return float(current_step) / float(max(1, num_warmup_steps))\n        progress = float(current_step - num_warmup_steps) / float(max(1, num_training_steps - num_warmup_steps))\n        return max(0.0, 0.5 * (1.0 + math.cos(math.pi * float(num_cycles) * 2.0 * progress)))\n\n    return LambdaLR(optimizer, lr_lambda, last_epoch)\n\n\ndef get_cosine_with_hard_restarts_schedule_with_warmup(\n    optimizer, num_warmup_steps, num_training_steps, num_cycles=1.0, last_epoch=-1\n):\n    \"\"\" Create a schedule with a learning rate that decreases following the\n    values of the cosine function with several hard restarts, after a warmup\n    period during which it increases linearly between 0 and 1.\n    \"\"\"\n\n    def lr_lambda(current_step):\n        if current_step < num_warmup_steps:\n            return float(current_step) / float(max(1, num_warmup_steps))\n        progress = float(current_step - num_warmup_steps) / float(max(1, num_training_steps - num_warmup_steps))\n        if progress >= 1.0:\n            return 0.0\n        return max(0.0, 0.5 * (1.0 + math.cos(math.pi * ((float(num_cycles) * progress) % 1.0))))\n\n    return LambdaLR(optimizer, lr_lambda, last_epoch)\n\n\nclass AdamW(Optimizer):\n    \"\"\" Implements Adam algorithm with weight decay fix.\n\n    Parameters:\n        lr (float): learning rate. Default 1e-3.\n        betas (tuple of 2 floats): Adams beta parameters (b1, b2). Default: (0.9, 0.999)\n        eps (float): Adams epsilon. Default: 1e-6\n        weight_decay (float): Weight decay. Default: 0.0\n        correct_bias (bool): can be set to False to avoid correcting bias in Adam (e.g. like in Bert TF repository). Default True.\n    \"\"\"\n\n    def __init__(self, params, lr=1e-3, betas=(0.9, 0.999), eps=1e-6, weight_decay=0.0, correct_bias=True):\n        if lr < 0.0:\n            raise ValueError(\"Invalid learning rate: {} - should be >= 0.0\".format(lr))\n        if not 0.0 <= betas[0] < 1.0:\n            raise ValueError(\"Invalid beta parameter: {} - should be in [0.0, 1.0[\".format(betas[0]))\n        if not 0.0 <= betas[1] < 1.0:\n            raise ValueError(\"Invalid beta parameter: {} - should be in [0.0, 1.0[\".format(betas[1]))\n        if not 0.0 <= eps:\n            raise ValueError(\"Invalid epsilon value: {} - should be >= 0.0\".format(eps))\n        defaults = dict(lr=lr, betas=betas, eps=eps, weight_decay=weight_decay, correct_bias=correct_bias)\n        super().__init__(params, defaults)\n\n    def step(self, closure=None):\n        \"\"\"Performs a single optimization step.\n\n        Arguments:\n            closure (callable, optional): A closure that reevaluates the model\n                and returns the loss.\n        \"\"\"\n        loss = None\n        if closure is not None:\n            loss = closure()\n\n        for group in self.param_groups:\n            for p in group[\"params\"]:\n                if p.grad is None:\n                    continue\n                grad = p.grad.data\n                if grad.is_sparse:\n                    raise RuntimeError(\"Adam does not support sparse gradients, please consider SparseAdam instead\")\n\n                state = self.state[p]\n\n                # State initialization\n                if len(state) == 0:\n                    state[\"step\"] = 0\n                    # Exponential moving average of gradient values\n                    state[\"exp_avg\"] = torch.zeros_like(p.data)\n                    # Exponential moving average of squared gradient values\n                    state[\"exp_avg_sq\"] = torch.zeros_like(p.data)\n\n                exp_avg, exp_avg_sq = state[\"exp_avg\"], state[\"exp_avg_sq\"]\n                beta1, beta2 = group[\"betas\"]\n\n                state[\"step\"] += 1\n\n                # Decay the first and second moment running average coefficient\n                # In-place operations to update the averages at the same time\n                exp_avg.mul_(beta1).add_(grad, alpha=1.0 - beta1)\n                exp_avg_sq.mul_(beta2).addcmul_(grad, grad, value=1.0 - beta2)\n                denom = exp_avg_sq.sqrt().add_(group[\"eps\"])\n\n                step_size = group[\"lr\"]\n                if group[\"correct_bias\"]:  # No bias correction for Bert\n                    bias_correction1 = 1.0 - beta1 ** state[\"step\"]\n                    bias_correction2 = 1.0 - beta2 ** state[\"step\"]\n                    step_size = step_size * math.sqrt(bias_correction2) / bias_correction1\n\n                p.data.addcdiv_(exp_avg, denom, value=-step_size)\n\n                # Just adding the square of the weights to the loss function is *not*\n                # the correct way of using L2 regularization/weight decay with Adam,\n                # since that will interact with the m and v parameters in strange ways.\n                #\n                # Instead we want to decay the weights in a manner that doesn't interact\n                # with the m/v parameters. This is equivalent to adding the square\n                # of the weights to the loss with plain (non-momentum) SGD.\n                # Add weight decay at the end (fixed version)\n                if group[\"weight_decay\"] > 0.0:\n                    p.data.add_(p.data, alpha=-group[\"lr\"] * group[\"weight_decay\"])\n\n        return loss\n"
  },
  {
    "path": "code/bert-base-count5/pretrain/transformers1/optimization_tf.py",
    "content": "# Copyright 2019 The TensorFlow Authors. All Rights Reserved.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n# ==============================================================================\n\"\"\"Functions and classes related to optimization (weight updates).\"\"\"\n\n\nimport re\n\nimport tensorflow as tf\n\n\nclass WarmUp(tf.keras.optimizers.schedules.LearningRateSchedule):\n    \"\"\"Applies a warmup schedule on a given learning rate decay schedule.\"\"\"\n\n    def __init__(\n        self, initial_learning_rate, decay_schedule_fn, warmup_steps, power=1.0, name=None,\n    ):\n        super().__init__()\n        self.initial_learning_rate = initial_learning_rate\n        self.warmup_steps = warmup_steps\n        self.power = power\n        self.decay_schedule_fn = decay_schedule_fn\n        self.name = name\n\n    def __call__(self, step):\n        with tf.name_scope(self.name or \"WarmUp\") as name:\n            # Implements polynomial warmup. i.e., if global_step < warmup_steps, the\n            # learning rate will be `global_step/num_warmup_steps * init_lr`.\n            global_step_float = tf.cast(step, tf.float32)\n            warmup_steps_float = tf.cast(self.warmup_steps, tf.float32)\n            warmup_percent_done = global_step_float / warmup_steps_float\n            warmup_learning_rate = self.initial_learning_rate * tf.math.pow(warmup_percent_done, self.power)\n            return tf.cond(\n                global_step_float < warmup_steps_float,\n                lambda: warmup_learning_rate,\n                lambda: self.decay_schedule_fn(step),\n                name=name,\n            )\n\n    def get_config(self):\n        return {\n            \"initial_learning_rate\": self.initial_learning_rate,\n            \"decay_schedule_fn\": self.decay_schedule_fn,\n            \"warmup_steps\": self.warmup_steps,\n            \"power\": self.power,\n            \"name\": self.name,\n        }\n\n\ndef create_optimizer(init_lr, num_train_steps, num_warmup_steps, end_lr=0.0, optimizer_type=\"adamw\"):\n    \"\"\"Creates an optimizer with learning rate schedule.\"\"\"\n    # Implements linear decay of the learning rate.\n    lr_schedule = tf.keras.optimizers.schedules.PolynomialDecay(\n        initial_learning_rate=init_lr, decay_steps=num_train_steps, end_learning_rate=end_lr,\n    )\n    if num_warmup_steps:\n        lr_schedule = WarmUp(\n            initial_learning_rate=init_lr, decay_schedule_fn=lr_schedule, warmup_steps=num_warmup_steps,\n        )\n\n    optimizer = AdamWeightDecay(\n        learning_rate=lr_schedule,\n        weight_decay_rate=0.01,\n        beta_1=0.9,\n        beta_2=0.999,\n        epsilon=1e-6,\n        exclude_from_weight_decay=[\"LayerNorm\", \"layer_norm\", \"bias\"],\n    )\n\n    return optimizer\n\n\nclass AdamWeightDecay(tf.keras.optimizers.Adam):\n    \"\"\"Adam enables L2 weight decay and clip_by_global_norm on gradients.\n  Just adding the square of the weights to the loss function is *not* the\n  correct way of using L2 regularization/weight decay with Adam, since that will\n  interact with the m and v parameters in strange ways.\n  Instead we want ot decay the weights in a manner that doesn't interact with\n  the m/v parameters. This is equivalent to adding the square of the weights to\n  the loss with plain (non-momentum) SGD.\n  \"\"\"\n\n    def __init__(\n        self,\n        learning_rate=0.001,\n        beta_1=0.9,\n        beta_2=0.999,\n        epsilon=1e-7,\n        amsgrad=False,\n        weight_decay_rate=0.0,\n        include_in_weight_decay=None,\n        exclude_from_weight_decay=None,\n        name=\"AdamWeightDecay\",\n        **kwargs\n    ):\n        super().__init__(learning_rate, beta_1, beta_2, epsilon, amsgrad, name, **kwargs)\n        self.weight_decay_rate = weight_decay_rate\n        self._include_in_weight_decay = include_in_weight_decay\n        self._exclude_from_weight_decay = exclude_from_weight_decay\n\n    @classmethod\n    def from_config(cls, config):\n        \"\"\"Creates an optimizer from its config with WarmUp custom object.\"\"\"\n        custom_objects = {\"WarmUp\": WarmUp}\n        return super(AdamWeightDecay, cls).from_config(config, custom_objects=custom_objects)\n\n    def _prepare_local(self, var_device, var_dtype, apply_state):\n        super(AdamWeightDecay, self)._prepare_local(var_device, var_dtype, apply_state)\n        apply_state[(var_device, var_dtype)][\"weight_decay_rate\"] = tf.constant(\n            self.weight_decay_rate, name=\"adam_weight_decay_rate\"\n        )\n\n    def _decay_weights_op(self, var, learning_rate, apply_state):\n        do_decay = self._do_use_weight_decay(var.name)\n        if do_decay:\n            return var.assign_sub(\n                learning_rate * var * apply_state[(var.device, var.dtype.base_dtype)][\"weight_decay_rate\"],\n                use_locking=self._use_locking,\n            )\n        return tf.no_op()\n\n    def apply_gradients(self, grads_and_vars, name=None):\n        grads, tvars = list(zip(*grads_and_vars))\n        return super(AdamWeightDecay, self).apply_gradients(zip(grads, tvars), name=name,)\n\n    def _get_lr(self, var_device, var_dtype, apply_state):\n        \"\"\"Retrieves the learning rate with the given state.\"\"\"\n        if apply_state is None:\n            return self._decayed_lr_t[var_dtype], {}\n\n        apply_state = apply_state or {}\n        coefficients = apply_state.get((var_device, var_dtype))\n        if coefficients is None:\n            coefficients = self._fallback_apply_state(var_device, var_dtype)\n            apply_state[(var_device, var_dtype)] = coefficients\n\n        return coefficients[\"lr_t\"], dict(apply_state=apply_state)\n\n    def _resource_apply_dense(self, grad, var, apply_state=None):\n        lr_t, kwargs = self._get_lr(var.device, var.dtype.base_dtype, apply_state)\n        decay = self._decay_weights_op(var, lr_t, apply_state)\n        with tf.control_dependencies([decay]):\n            return super(AdamWeightDecay, self)._resource_apply_dense(grad, var, **kwargs)\n\n    def _resource_apply_sparse(self, grad, var, indices, apply_state=None):\n        lr_t, kwargs = self._get_lr(var.device, var.dtype.base_dtype, apply_state)\n        decay = self._decay_weights_op(var, lr_t, apply_state)\n        with tf.control_dependencies([decay]):\n            return super(AdamWeightDecay, self)._resource_apply_sparse(grad, var, indices, **kwargs)\n\n    def get_config(self):\n        config = super().get_config()\n        config.update({\"weight_decay_rate\": self.weight_decay_rate})\n        return config\n\n    def _do_use_weight_decay(self, param_name):\n        \"\"\"Whether to use L2 weight decay for `param_name`.\"\"\"\n        if self.weight_decay_rate == 0:\n            return False\n\n        if self._include_in_weight_decay:\n            for r in self._include_in_weight_decay:\n                if re.search(r, param_name) is not None:\n                    return True\n\n        if self._exclude_from_weight_decay:\n            for r in self._exclude_from_weight_decay:\n                if re.search(r, param_name) is not None:\n                    return False\n        return True\n\n\n# Extracted from https://github.com/OpenNMT/OpenNMT-tf/blob/master/opennmt/optimizers/utils.py\nclass GradientAccumulator(object):\n    \"\"\"Gradient accumulation utility.\n  When used with a distribution strategy, the accumulator should be called in a\n  replica context. Gradients will be accumulated locally on each replica and\n  without synchronization. Users should then call ``.gradients``, scale the\n  gradients if required, and pass the result to ``apply_gradients``.\n  \"\"\"\n\n    # We use the ON_READ synchronization policy so that no synchronization is\n    # performed on assignment. To get the value, we call .value() which returns the\n    # value on the current replica without synchronization.\n\n    def __init__(self):\n        \"\"\"Initializes the accumulator.\"\"\"\n        self._gradients = []\n        self._accum_steps = None\n\n    @property\n    def step(self):\n        \"\"\"Number of accumulated steps.\"\"\"\n        if self._accum_steps is None:\n            self._accum_steps = tf.Variable(\n                tf.constant(0, dtype=tf.int64),\n                trainable=False,\n                synchronization=tf.VariableSynchronization.ON_READ,\n                aggregation=tf.VariableAggregation.ONLY_FIRST_REPLICA,\n            )\n\n        return self._accum_steps.value()\n\n    @property\n    def gradients(self):\n        \"\"\"The accumulated gradients on the current replica.\"\"\"\n        if not self._gradients:\n            raise ValueError(\"The accumulator should be called first to initialize the gradients\")\n        return list(gradient.value() if gradient is not None else gradient for gradient in self._gradients)\n\n    def __call__(self, gradients):\n        \"\"\"Accumulates :obj:`gradients` on the current replica.\"\"\"\n        if not self._gradients:\n            _ = self.step  # Create the step variable.\n            self._gradients.extend(\n                [\n                    tf.Variable(\n                        tf.zeros_like(gradient),\n                        trainable=False,\n                        synchronization=tf.VariableSynchronization.ON_READ,\n                        aggregation=tf.VariableAggregation.ONLY_FIRST_REPLICA,\n                    )\n                    if gradient is not None\n                    else gradient\n                    for gradient in gradients\n                ]\n            )\n        if len(gradients) != len(self._gradients):\n            raise ValueError(\"Expected %s gradients, but got %d\" % (len(self._gradients), len(gradients)))\n\n        for accum_gradient, gradient in zip(self._gradients, gradients):\n            if accum_gradient is not None and gradient is not None:\n                accum_gradient.assign_add(gradient)\n\n        self._accum_steps.assign_add(1)\n\n    def reset(self):\n        \"\"\"Resets the accumulated gradients on the current replica.\"\"\"\n        if not self._gradients:\n            return\n        self._accum_steps.assign(0)\n        for gradient in self._gradients:\n            if gradient is not None:\n                gradient.assign(tf.zeros_like(gradient))\n"
  },
  {
    "path": "code/bert-base-count5/pretrain/transformers1/pipelines.py",
    "content": "# coding=utf-8\n# Copyright 2018 The HuggingFace Inc. team.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n\n\nimport csv\nimport json\nimport logging\nimport os\nimport pickle\nimport sys\nfrom abc import ABC, abstractmethod\nfrom contextlib import contextmanager\nfrom itertools import chain\nfrom os.path import abspath, exists\nfrom typing import TYPE_CHECKING, Any, Dict, Iterable, List, Optional, Sequence, Tuple, Union\n\nimport numpy as np\n\nfrom .configuration_auto import ALL_PRETRAINED_CONFIG_ARCHIVE_MAP, AutoConfig\nfrom .configuration_utils import PretrainedConfig\nfrom .data import SquadExample, squad_convert_examples_to_features\nfrom .file_utils import is_tf_available, is_torch_available\nfrom .modelcard import ModelCard\nfrom .tokenization_auto import AutoTokenizer\nfrom .tokenization_bert import BasicTokenizer\nfrom .tokenization_utils import PreTrainedTokenizer\n\n\nif is_tf_available():\n    import tensorflow as tf\n    from .modeling_tf_auto import (\n        TFAutoModel,\n        TFAutoModelForSequenceClassification,\n        TFAutoModelForQuestionAnswering,\n        TFAutoModelForTokenClassification,\n        TFAutoModelWithLMHead,\n    )\n\nif is_torch_available():\n    import torch\n    from .modeling_auto import (\n        AutoModel,\n        AutoModelForSequenceClassification,\n        AutoModelForQuestionAnswering,\n        AutoModelForTokenClassification,\n        AutoModelWithLMHead,\n    )\n\nif TYPE_CHECKING:\n    from .modeling_utils import PreTrainedModel\n    from .modeling_tf_utils import TFPreTrainedModel\n\n\nlogger = logging.getLogger(__name__)\n\n\ndef get_framework(model=None):\n    \"\"\" Select framework (TensorFlow/PyTorch) to use.\n        If both frameworks are installed and no specific model is provided, defaults to using PyTorch.\n    \"\"\"\n    if is_tf_available() and is_torch_available() and model is not None and not isinstance(model, str):\n        # Both framework are available but the user supplied a model class instance.\n        # Try to guess which framework to use from the model classname\n        framework = \"tf\" if model.__class__.__name__.startswith(\"TF\") else \"pt\"\n    elif not is_tf_available() and not is_torch_available():\n        raise RuntimeError(\n            \"At least one of TensorFlow 2.0 or PyTorch should be installed. \"\n            \"To install TensorFlow 2.0, read the instructions at https://www.tensorflow.org/install/ \"\n            \"To install PyTorch, read the instructions at https://pytorch.org/.\"\n        )\n    else:\n        # framework = 'tf' if is_tf_available() else 'pt'\n        framework = \"pt\" if is_torch_available() else \"tf\"\n    return framework\n\n\nclass ArgumentHandler(ABC):\n    \"\"\"\n    Base interface for handling varargs for each Pipeline\n    \"\"\"\n\n    @abstractmethod\n    def __call__(self, *args, **kwargs):\n        raise NotImplementedError()\n\n\nclass DefaultArgumentHandler(ArgumentHandler):\n    \"\"\"\n    Default varargs argument parser handling parameters for each Pipeline\n    \"\"\"\n\n    @staticmethod\n    def handle_kwargs(kwargs: Dict) -> List:\n        if len(kwargs) == 1:\n            output = list(kwargs.values())\n        else:\n            output = list(chain(kwargs.values()))\n\n        return DefaultArgumentHandler.handle_args(output)\n\n    @staticmethod\n    def handle_args(args: Sequence[Any]) -> List[str]:\n\n        # Only one argument, let's do case by case\n        if len(args) == 1:\n            if isinstance(args[0], str):\n                return [args[0]]\n            elif not isinstance(args[0], list):\n                return list(args)\n            else:\n                return args[0]\n\n        # Multiple arguments (x1, x2, ...)\n        elif len(args) > 1:\n            if all([isinstance(arg, str) for arg in args]):\n                return list(args)\n\n            # If not instance of list, then it should instance of iterable\n            elif isinstance(args, Iterable):\n                return list(chain.from_iterable(chain(args)))\n            else:\n                raise ValueError(\n                    \"Invalid input type {}. Pipeline supports Union[str, Iterable[str]]\".format(type(args))\n                )\n        else:\n            return []\n\n    def __call__(self, *args, **kwargs):\n        if len(kwargs) > 0 and len(args) > 0:\n            raise ValueError(\"Pipeline cannot handle mixed args and kwargs\")\n\n        if len(kwargs) > 0:\n            return DefaultArgumentHandler.handle_kwargs(kwargs)\n        else:\n            return DefaultArgumentHandler.handle_args(args)\n\n\nclass PipelineDataFormat:\n    \"\"\"\n    Base class for all the pipeline supported data format both for reading and writing.\n    Supported data formats currently includes:\n     - JSON\n     - CSV\n     - stdin/stdout (pipe)\n\n    PipelineDataFormat also includes some utilities to work with multi-columns like mapping from datasets columns\n    to pipelines keyword arguments through the `dataset_kwarg_1=dataset_column_1` format.\n    \"\"\"\n\n    SUPPORTED_FORMATS = [\"json\", \"csv\", \"pipe\"]\n\n    def __init__(\n        self, output_path: Optional[str], input_path: Optional[str], column: Optional[str], overwrite=False,\n    ):\n        self.output_path = output_path\n        self.input_path = input_path\n        self.column = column.split(\",\") if column is not None else [\"\"]\n        self.is_multi_columns = len(self.column) > 1\n\n        if self.is_multi_columns:\n            self.column = [tuple(c.split(\"=\")) if \"=\" in c else (c, c) for c in self.column]\n\n        if output_path is not None and not overwrite:\n            if exists(abspath(self.output_path)):\n                raise OSError(\"{} already exists on disk\".format(self.output_path))\n\n        if input_path is not None:\n            if not exists(abspath(self.input_path)):\n                raise OSError(\"{} doesnt exist on disk\".format(self.input_path))\n\n    @abstractmethod\n    def __iter__(self):\n        raise NotImplementedError()\n\n    @abstractmethod\n    def save(self, data: dict):\n        \"\"\"\n        Save the provided data object with the representation for the current `DataFormat`.\n        :param data: data to store\n        :return:\n        \"\"\"\n        raise NotImplementedError()\n\n    def save_binary(self, data: Union[dict, List[dict]]) -> str:\n        \"\"\"\n        Save the provided data object as a pickle-formatted binary data on the disk.\n        :param data: data to store\n        :return: (str) Path where the data has been saved\n        \"\"\"\n        path, _ = os.path.splitext(self.output_path)\n        binary_path = os.path.extsep.join((path, \"pickle\"))\n\n        with open(binary_path, \"wb+\") as f_output:\n            pickle.dump(data, f_output)\n\n        return binary_path\n\n    @staticmethod\n    def from_str(\n        format: str, output_path: Optional[str], input_path: Optional[str], column: Optional[str], overwrite=False,\n    ):\n        if format == \"json\":\n            return JsonPipelineDataFormat(output_path, input_path, column, overwrite=overwrite)\n        elif format == \"csv\":\n            return CsvPipelineDataFormat(output_path, input_path, column, overwrite=overwrite)\n        elif format == \"pipe\":\n            return PipedPipelineDataFormat(output_path, input_path, column, overwrite=overwrite)\n        else:\n            raise KeyError(\"Unknown reader {} (Available reader are json/csv/pipe)\".format(format))\n\n\nclass CsvPipelineDataFormat(PipelineDataFormat):\n    def __init__(\n        self, output_path: Optional[str], input_path: Optional[str], column: Optional[str], overwrite=False,\n    ):\n        super().__init__(output_path, input_path, column, overwrite=overwrite)\n\n    def __iter__(self):\n        with open(self.input_path, \"r\") as f:\n            reader = csv.DictReader(f)\n            for row in reader:\n                if self.is_multi_columns:\n                    yield {k: row[c] for k, c in self.column}\n                else:\n                    yield row[self.column[0]]\n\n    def save(self, data: List[dict]):\n        with open(self.output_path, \"w\") as f:\n            if len(data) > 0:\n                writer = csv.DictWriter(f, list(data[0].keys()))\n                writer.writeheader()\n                writer.writerows(data)\n\n\nclass JsonPipelineDataFormat(PipelineDataFormat):\n    def __init__(\n        self, output_path: Optional[str], input_path: Optional[str], column: Optional[str], overwrite=False,\n    ):\n        super().__init__(output_path, input_path, column, overwrite=overwrite)\n\n        with open(input_path, \"r\") as f:\n            self._entries = json.load(f)\n\n    def __iter__(self):\n        for entry in self._entries:\n            if self.is_multi_columns:\n                yield {k: entry[c] for k, c in self.column}\n            else:\n                yield entry[self.column[0]]\n\n    def save(self, data: dict):\n        with open(self.output_path, \"w\") as f:\n            json.dump(data, f)\n\n\nclass PipedPipelineDataFormat(PipelineDataFormat):\n    \"\"\"\n    Read data from piped input to the python process.\n    For multi columns data, columns should separated by \\t\n\n    If columns are provided, then the output will be a dictionary with {column_x: value_x}\n    \"\"\"\n\n    def __iter__(self):\n        for line in sys.stdin:\n            # Split for multi-columns\n            if \"\\t\" in line:\n\n                line = line.split(\"\\t\")\n                if self.column:\n                    # Dictionary to map arguments\n                    yield {kwargs: l for (kwargs, _), l in zip(self.column, line)}\n                else:\n                    yield tuple(line)\n\n            # No dictionary to map arguments\n            else:\n                yield line\n\n    def save(self, data: dict):\n        print(data)\n\n    def save_binary(self, data: Union[dict, List[dict]]) -> str:\n        if self.output_path is None:\n            raise KeyError(\n                \"When using piped input on pipeline outputting large object requires an output file path. \"\n                \"Please provide such output path through --output argument.\"\n            )\n\n        return super().save_binary(data)\n\n\nclass _ScikitCompat(ABC):\n    \"\"\"\n    Interface layer for the Scikit and Keras compatibility.\n    \"\"\"\n\n    @abstractmethod\n    def transform(self, X):\n        raise NotImplementedError()\n\n    @abstractmethod\n    def predict(self, X):\n        raise NotImplementedError()\n\n\nclass Pipeline(_ScikitCompat):\n    \"\"\"\n    The Pipeline class is the class from which all pipelines inherit. Refer to this class for methods shared across\n    different pipelines.\n\n    Base class implementing pipelined operations.\n    Pipeline workflow is defined as a sequence of the following operations:\n        Input -> Tokenization -> Model Inference -> Post-Processing (Task dependent) -> Output\n\n    Pipeline supports running on CPU or GPU through the device argument. Users can specify\n    device argument as an integer, -1 meaning \"CPU\", >= 0 referring the CUDA device ordinal.\n\n    Some pipeline, like for instance FeatureExtractionPipeline ('feature-extraction') outputs large\n    tensor object as nested-lists. In order to avoid dumping such large structure as textual data we\n    provide the binary_output constructor argument. If set to True, the output will be stored in the\n    pickle format.\n\n    Arguments:\n        model (:obj:`~transformers1.PreTrainedModel` or :obj:`~transformers1.TFPreTrainedModel`):\n            The model that will be used by the pipeline to make predictions. This needs to be a model inheriting from\n            :class:`~transformers1.PreTrainedModel` for PyTorch and :class:`~transformers1.TFPreTrainedModel` for\n            TensorFlow.\n        tokenizer (:obj:`~transformers1.PreTrainedTokenizer`):\n            The tokenizer that will be used by the pipeline to encode data for the model. This object inherits from\n            :class:`~transformers1.PreTrainedTokenizer`.\n        modelcard (:obj:`str` or :class:`~transformers1.ModelCard`, `optional`, defaults to :obj:`None`):\n            Model card attributed to the model for this pipeline.\n        framework (:obj:`str`, `optional`, defaults to :obj:`None`):\n            The framework to use, either \"pt\" for PyTorch or \"tf\" for TensorFlow. The specified framework must be\n            installed.\n\n            If no framework is specified, will default to the one currently installed. If no framework is specified\n            and both frameworks are installed, will default to PyTorch.\n        args_parser (:class:`~transformers1.pipelines.ArgumentHandler`, `optional`, defaults to :obj:`None`):\n            Reference to the object in charge of parsing supplied pipeline parameters.\n        device (:obj:`int`, `optional`, defaults to :obj:`-1`):\n            Device ordinal for CPU/GPU supports. Setting this to -1 will leverage CPU, >=0 will run the model\n            on the associated CUDA device id.\n        binary_output (:obj:`bool`, `optional`, defaults to :obj:`False`):\n            Flag indicating if the output the pipeline should happen in a binary format (i.e. pickle) or as raw text.\n\n    Return:\n        :obj:`List` or :obj:`Dict`:\n        Pipeline returns list or dictionary depending on:\n\n         - Whether the user supplied multiple samples\n         - Whether the pipeline exposes multiple fields in the output object\n    \"\"\"\n\n    default_input_names = None\n\n    def __init__(\n        self,\n        model: Union[\"PreTrainedModel\", \"TFPreTrainedModel\"],\n        tokenizer: PreTrainedTokenizer,\n        modelcard: Optional[ModelCard] = None,\n        framework: Optional[str] = None,\n        task: str = \"\",\n        args_parser: ArgumentHandler = None,\n        device: int = -1,\n        binary_output: bool = False,\n    ):\n\n        if framework is None:\n            framework = get_framework()\n\n        self.model = model\n        self.tokenizer = tokenizer\n        self.modelcard = modelcard\n        self.framework = framework\n        self.device = device if framework == \"tf\" else torch.device(\"cpu\" if device < 0 else \"cuda:{}\".format(device))\n        self.binary_output = binary_output\n        self._args_parser = args_parser or DefaultArgumentHandler()\n\n        # Special handling\n        if self.framework == \"pt\" and self.device.type == \"cuda\":\n            self.model = self.model.to(self.device)\n\n        # Update config with task specific parameters\n        task_specific_params = self.model.config.task_specific_params\n        if task_specific_params is not None and task in task_specific_params:\n            self.model.config.update(task_specific_params.get(task))\n\n    def save_pretrained(self, save_directory):\n        \"\"\"\n        Save the pipeline's model and tokenizer to the specified save_directory\n        \"\"\"\n        if not os.path.isdir(save_directory):\n            logger.error(\"Provided path ({}) should be a directory\".format(save_directory))\n            return\n\n        self.model.save_pretrained(save_directory)\n        self.tokenizer.save_pretrained(save_directory)\n        if self.modelcard is not None:\n            self.modelcard.save_pretrained(save_directory)\n\n    def transform(self, X):\n        \"\"\"\n        Scikit / Keras interface to transformers1' pipelines. This method will forward to __call__().\n        \"\"\"\n        return self(X=X)\n\n    def predict(self, X):\n        \"\"\"\n        Scikit / Keras interface to transformers1' pipelines. This method will forward to __call__().\n        \"\"\"\n        return self(X=X)\n\n    @contextmanager\n    def device_placement(self):\n        \"\"\"\n        Context Manager allowing tensor allocation on the user-specified device in framework agnostic way.\n        example:\n            # Explicitly ask for tensor allocation on CUDA device :0\n            nlp = pipeline(..., device=0)\n            with nlp.device_placement():\n                # Every framework specific tensor allocation will be done on the request device\n                output = nlp(...)\n        Returns:\n            Context manager\n        \"\"\"\n        if self.framework == \"tf\":\n            with tf.device(\"/CPU:0\" if self.device == -1 else \"/device:GPU:{}\".format(self.device)):\n                yield\n        else:\n            if self.device.type == \"cuda\":\n                torch.cuda.set_device(self.device)\n\n            yield\n\n    def ensure_tensor_on_device(self, **inputs):\n        \"\"\"\n        Ensure PyTorch tensors are on the specified device.\n        :param inputs:\n        :return:\n        \"\"\"\n        return {name: tensor.to(self.device) for name, tensor in inputs.items()}\n\n    def _parse_and_tokenize(self, *args, pad_to_max_length=True, add_special_tokens=True, **kwargs):\n        \"\"\"\n        Parse arguments and tokenize\n        \"\"\"\n        # Parse arguments\n        inputs = self._args_parser(*args, **kwargs)\n        inputs = self.tokenizer.batch_encode_plus(\n            inputs,\n            add_special_tokens=add_special_tokens,\n            return_tensors=self.framework,\n            pad_to_max_length=pad_to_max_length,\n        )\n\n        return inputs\n\n    def __call__(self, *args, **kwargs):\n        inputs = self._parse_and_tokenize(*args, **kwargs)\n        return self._forward(inputs)\n\n    def _forward(self, inputs, return_tensors=False):\n        \"\"\"\n        Internal framework specific forward dispatching.\n        Args:\n            inputs: dict holding all the keyworded arguments for required by the model forward method.\n            return_tensors: Whether to return native framework (pt/tf) tensors rather than numpy array.\n        Returns:\n            Numpy array\n        \"\"\"\n        # Encode for forward\n        with self.device_placement():\n            if self.framework == \"tf\":\n                # TODO trace model\n                predictions = self.model(inputs.data, training=False)[0]\n            else:\n                with torch.no_grad():\n                    inputs = self.ensure_tensor_on_device(**inputs)\n                    predictions = self.model(**inputs)[0].cpu()\n\n        if return_tensors:\n            return predictions\n        else:\n            return predictions.numpy()\n\n\nclass FeatureExtractionPipeline(Pipeline):\n    \"\"\"\n    Feature extraction pipeline using Model head. This pipeline extracts the hidden states from the base transformer,\n    which can be used as features in downstream tasks.\n\n    This feature extraction pipeline can currently be loaded from the :func:`~transformers1.pipeline` method using\n    the following task identifier(s):\n\n    - \"feature-extraction\", for extracting features of a sequence.\n\n    All models may be used for this pipeline. See a list of all models, including community-contributed models on\n    `huggingface.co/models <https://huggingface.co/models>`__.\n\n    Arguments:\n        model (:obj:`~transformers1.PreTrainedModel` or :obj:`~transformers1.TFPreTrainedModel`):\n            The model that will be used by the pipeline to make predictions. This needs to be a model inheriting from\n            :class:`~transformers1.PreTrainedModel` for PyTorch and :class:`~transformers1.TFPreTrainedModel` for\n            TensorFlow.\n        tokenizer (:obj:`~transformers1.PreTrainedTokenizer`):\n            The tokenizer that will be used by the pipeline to encode data for the model. This object inherits from\n            :class:`~transformers1.PreTrainedTokenizer`.\n        modelcard (:obj:`str` or :class:`~transformers1.ModelCard`, `optional`, defaults to :obj:`None`):\n            Model card attributed to the model for this pipeline.\n        framework (:obj:`str`, `optional`, defaults to :obj:`None`):\n            The framework to use, either \"pt\" for PyTorch or \"tf\" for TensorFlow. The specified framework must be\n            installed.\n\n            If no framework is specified, will default to the one currently installed. If no framework is specified\n            and both frameworks are installed, will default to PyTorch.\n        args_parser (:class:`~transformers1.pipelines.ArgumentHandler`, `optional`, defaults to :obj:`None`):\n            Reference to the object in charge of parsing supplied pipeline parameters.\n        device (:obj:`int`, `optional`, defaults to :obj:`-1`):\n            Device ordinal for CPU/GPU supports. Setting this to -1 will leverage CPU, >=0 will run the model\n            on the associated CUDA device id.\n    \"\"\"\n\n    def __init__(\n        self,\n        model: Union[\"PreTrainedModel\", \"TFPreTrainedModel\"],\n        tokenizer: PreTrainedTokenizer,\n        modelcard: Optional[ModelCard] = None,\n        framework: Optional[str] = None,\n        args_parser: ArgumentHandler = None,\n        device: int = -1,\n        task: str = \"\",\n    ):\n        super().__init__(\n            model=model,\n            tokenizer=tokenizer,\n            modelcard=modelcard,\n            framework=framework,\n            args_parser=args_parser,\n            device=device,\n            binary_output=True,\n            task=task,\n        )\n\n    def __call__(self, *args, **kwargs):\n        return super().__call__(*args, **kwargs).tolist()\n\n\nclass TextGenerationPipeline(Pipeline):\n    \"\"\"\n    Language generation pipeline using any ModelWithLMHead head. This pipeline predicts the words that will follow a specified text prompt.\n\n    This language generation pipeline can currently be loaded from the :func:`~transformers1.pipeline` method using\n    the following task identifier(s):\n\n    - \"text-generation\", for generating text from a specified prompt.\n\n    The models that this pipeline can use are models that have been trained with an autoregressive language modeling objective,\n    which includes the uni-directional models in the library (e.g. gpt2).\n    See the list of available community models on\n    `huggingface.co/models <https://huggingface.co/models?search=&filter=lm-head>`__.\n    \"\"\"\n\n    # Padding text to help Transformer-XL and XLNet with short prompts as proposed by Aman Rusia\n    # in https://github.com/rusiaaman/XLNet-gen#methodology\n    # and https://medium.com/@amanrusia/xlnet-speaks-comparison-to-gpt-2-ea1a4e9ba39e\n\n    PADDING_TEXT = \"\"\"In 1991, the remains of Russian Tsar Nicholas II and his family\n    (except for Alexei and Maria) are discovered.\n    The voice of Nicholas's young son, Tsarevich Alexei Nikolaevich, narrates the\n    remainder of the story. 1883 Western Siberia,\n    a young Grigori Rasputin is asked by his father and a group of men to perform magic.\n    Rasputin has a vision and denounces one of the men as a horse thief. Although his\n    father initially slaps him for making such an accusation, Rasputin watches as the\n    man is chased outside and beaten. Twenty years later, Rasputin sees a vision of\n    the Virgin Mary, prompting him to become a priest. Rasputin quickly becomes famous,\n    with people, even a bishop, begging for his blessing. <eod> </s> <eos>\"\"\"\n\n    ALLOWED_MODELS = [\n        \"XLNetLMHeadModel\",\n        \"TransfoXLLMHeadModel\",\n        \"ReformerModelWithLMHead\",\n        \"GPT2LMHeadModel\",\n        \"OpenAIGPTLMHeadModel\",\n        \"CTRLLMHeadModel\",\n        \"TFXLNetLMHeadModel\",\n        \"TFTransfoXLLMHeadModel\",\n        \"TFGPT2LMHeadModel\",\n        \"TFOpenAIGPTLMHeadModel\",\n        \"TFCTRLLMHeadModel\",\n    ]\n\n    def __call__(\n        self, *args, return_tensors=False, return_text=True, clean_up_tokenization_spaces=False, **generate_kwargs\n    ):\n        if self.model.__class__.__name__ not in self.ALLOWED_MODELS:\n            raise NotImplementedError(\n                \"Generation is currently not supported for {}. Please select a model from {} for generation.\".format(\n                    self.model.__class__.__name__, self.ALLOWED_MODELS\n                )\n            )\n\n        text_inputs = self._args_parser(*args)\n\n        results = []\n        for prompt_text in text_inputs:\n            # Manage correct placement of the tensors\n            with self.device_placement():\n                if self.model.__class__.__name__ in [\"XLNetLMHeadModel\", \"TransfoXLLMHeadModel\"]:\n                    inputs = self._parse_and_tokenize(\n                        self.PADDING_TEXT + prompt_text, pad_to_max_length=False, add_special_tokens=False\n                    )\n                else:\n                    inputs = self._parse_and_tokenize(prompt_text, pad_to_max_length=False, add_special_tokens=False)\n\n                # set input_ids to None to allow empty prompt\n                if inputs[\"input_ids\"].shape[-1] == 0:\n                    inputs[\"input_ids\"] = None\n                    inputs[\"attention_mask\"] = None\n\n                if self.framework == \"pt\" and inputs[\"input_ids\"] is not None:\n                    inputs = self.ensure_tensor_on_device(**inputs)\n\n                input_ids = inputs[\"input_ids\"]\n\n                # Ensure that batch size = 1 (batch generation not allowed for now)\n                assert (\n                    input_ids is None or input_ids.shape[0] == 1\n                ), \"Batch generation is currently not supported. See https://github.com/huggingface/transformers/issues/3021 for more information.\"\n\n                output_sequences = self.model.generate(input_ids=input_ids, **generate_kwargs)  # BS x SL\n\n            result = []\n            for generated_sequence in output_sequences:\n                generated_sequence = generated_sequence.numpy().tolist()\n                record = {}\n                if return_tensors:\n                    record[\"generated_token_ids\"] = generated_sequence\n                if return_text:\n                    # Decode text\n                    text = self.tokenizer.decode(\n                        generated_sequence,\n                        skip_special_tokens=True,\n                        clean_up_tokenization_spaces=clean_up_tokenization_spaces,\n                    )\n\n                    # Remove PADDING prompt of the sequence if XLNet or Transfo-XL model is used\n                    if input_ids is None:\n                        prompt_length = 0\n                    else:\n                        prompt_length = len(\n                            self.tokenizer.decode(\n                                input_ids[0],\n                                skip_special_tokens=True,\n                                clean_up_tokenization_spaces=clean_up_tokenization_spaces,\n                            )\n                        )\n\n                    record[\"generated_text\"] = prompt_text + text[prompt_length:]\n\n                result.append(record)\n            results += [result]\n\n        if len(results) == 1:\n            return results[0]\n\n        return results\n\n\nclass TextClassificationPipeline(Pipeline):\n    \"\"\"\n    Text classification pipeline using ModelForSequenceClassification head. See the\n    `sequence classification usage <../usage.html#sequence-classification>`__ examples for more information.\n\n    This text classification pipeline can currently be loaded from the :func:`~transformers1.pipeline` method using\n    the following task identifier(s):\n\n    - \"sentiment-analysis\", for classifying sequences according to positive or negative sentiments.\n\n    The models that this pipeline can use are models that have been fine-tuned on a sequence classification task.\n    See the up-to-date list of available models on\n    `huggingface.co/models <https://huggingface.co/models?filter=text-classification>`__.\n\n    Arguments:\n        model (:obj:`~transformers1.PreTrainedModel` or :obj:`~transformers1.TFPreTrainedModel`):\n            The model that will be used by the pipeline to make predictions. This needs to be a model inheriting from\n            :class:`~transformers1.PreTrainedModel` for PyTorch and :class:`~transformers1.TFPreTrainedModel` for\n            TensorFlow.\n        tokenizer (:obj:`~transformers1.PreTrainedTokenizer`):\n            The tokenizer that will be used by the pipeline to encode data for the model. This object inherits from\n            :class:`~transformers1.PreTrainedTokenizer`.\n        modelcard (:obj:`str` or :class:`~transformers1.ModelCard`, `optional`, defaults to :obj:`None`):\n            Model card attributed to the model for this pipeline.\n        framework (:obj:`str`, `optional`, defaults to :obj:`None`):\n            The framework to use, either \"pt\" for PyTorch or \"tf\" for TensorFlow. The specified framework must be\n            installed.\n\n            If no framework is specified, will default to the one currently installed. If no framework is specified\n            and both frameworks are installed, will default to PyTorch.\n        args_parser (:class:`~transformers1.pipelines.ArgumentHandler`, `optional`, defaults to :obj:`None`):\n            Reference to the object in charge of parsing supplied pipeline parameters.\n        device (:obj:`int`, `optional`, defaults to :obj:`-1`):\n            Device ordinal for CPU/GPU supports. Setting this to -1 will leverage CPU, >=0 will run the model\n            on the associated CUDA device id.\n    \"\"\"\n\n    def __call__(self, *args, **kwargs):\n        outputs = super().__call__(*args, **kwargs)\n        scores = np.exp(outputs) / np.exp(outputs).sum(-1, keepdims=True)\n        return [{\"label\": self.model.config.id2label[item.argmax()], \"score\": item.max().item()} for item in scores]\n\n\nclass FillMaskPipeline(Pipeline):\n    \"\"\"\n    Masked language modeling prediction pipeline using ModelWithLMHead head. See the\n    `masked language modeling usage <../usage.html#masked-language-modeling>`__ examples for more information.\n\n    This mask filling pipeline can currently be loaded from the :func:`~transformers1.pipeline` method using\n    the following task identifier(s):\n\n    - \"fill-mask\", for predicting masked tokens in a sequence.\n\n    The models that this pipeline can use are models that have been trained with a masked language modeling objective,\n    which includes the bi-directional models in the library.\n    See the up-to-date list of available models on\n    `huggingface.co/models <https://huggingface.co/models?filter=lm-head>`__.\n\n    Arguments:\n        model (:obj:`~transformers1.PreTrainedModel` or :obj:`~transformers1.TFPreTrainedModel`):\n            The model that will be used by the pipeline to make predictions. This needs to be a model inheriting from\n            :class:`~transformers1.PreTrainedModel` for PyTorch and :class:`~transformers1.TFPreTrainedModel` for\n            TensorFlow.\n        tokenizer (:obj:`~transformers1.PreTrainedTokenizer`):\n            The tokenizer that will be used by the pipeline to encode data for the model. This object inherits from\n            :class:`~transformers1.PreTrainedTokenizer`.\n        modelcard (:obj:`str` or :class:`~transformers1.ModelCard`, `optional`, defaults to :obj:`None`):\n            Model card attributed to the model for this pipeline.\n        framework (:obj:`str`, `optional`, defaults to :obj:`None`):\n            The framework to use, either \"pt\" for PyTorch or \"tf\" for TensorFlow. The specified framework must be\n            installed.\n\n            If no framework is specified, will default to the one currently installed. If no framework is specified\n            and both frameworks are installed, will default to PyTorch.\n        args_parser (:class:`~transformers1.pipelines.ArgumentHandler`, `optional`, defaults to :obj:`None`):\n            Reference to the object in charge of parsing supplied pipeline parameters.\n        device (:obj:`int`, `optional`, defaults to :obj:`-1`):\n            Device ordinal for CPU/GPU supports. Setting this to -1 will leverage CPU, >=0 will run the model\n            on the associated CUDA device id.\n    \"\"\"\n\n    def __init__(\n        self,\n        model: Union[\"PreTrainedModel\", \"TFPreTrainedModel\"],\n        tokenizer: PreTrainedTokenizer,\n        modelcard: Optional[ModelCard] = None,\n        framework: Optional[str] = None,\n        args_parser: ArgumentHandler = None,\n        device: int = -1,\n        topk=5,\n        task: str = \"\",\n    ):\n        super().__init__(\n            model=model,\n            tokenizer=tokenizer,\n            modelcard=modelcard,\n            framework=framework,\n            args_parser=args_parser,\n            device=device,\n            binary_output=True,\n            task=task,\n        )\n\n        self.topk = topk\n\n    def __call__(self, *args, **kwargs):\n        inputs = self._parse_and_tokenize(*args, **kwargs)\n        outputs = self._forward(inputs, return_tensors=True)\n\n        results = []\n        batch_size = outputs.shape[0] if self.framework == \"tf\" else outputs.size(0)\n\n        for i in range(batch_size):\n            input_ids = inputs[\"input_ids\"][i]\n            result = []\n\n            if self.framework == \"tf\":\n                masked_index = tf.where(input_ids == self.tokenizer.mask_token_id).numpy().item()\n                logits = outputs[i, masked_index, :]\n                probs = tf.nn.softmax(logits)\n                topk = tf.math.top_k(probs, k=self.topk)\n                values, predictions = topk.values.numpy(), topk.indices.numpy()\n            else:\n                masked_index = (input_ids == self.tokenizer.mask_token_id).nonzero().item()\n                logits = outputs[i, masked_index, :]\n                probs = logits.softmax(dim=0)\n                values, predictions = probs.topk(self.topk)\n\n            for v, p in zip(values.tolist(), predictions.tolist()):\n                tokens = input_ids.numpy()\n                tokens[masked_index] = p\n                # Filter padding out:\n                tokens = tokens[np.where(tokens != self.tokenizer.pad_token_id)]\n                result.append({\"sequence\": self.tokenizer.decode(tokens), \"score\": v, \"token\": p})\n\n            # Append\n            results += [result]\n\n        if len(results) == 1:\n            return results[0]\n        return results\n\n\nclass NerPipeline(Pipeline):\n    \"\"\"\n    Named Entity Recognition pipeline using ModelForTokenClassification head. See the\n    `named entity recognition usage <../usage.html#named-entity-recognition>`__ examples for more information.\n\n    This token recognition pipeline can currently be loaded from the :func:`~transformers1.pipeline` method using\n    the following task identifier(s):\n\n    - \"ner\", for predicting the classes of tokens in a sequence: person, organisation, location or miscellaneous.\n\n    The models that this pipeline can use are models that have been fine-tuned on a token classification task.\n    See the up-to-date list of available models on\n    `huggingface.co/models <https://huggingface.co/models?filter=token-classification>`__.\n\n    Arguments:\n        model (:obj:`~transformers1.PreTrainedModel` or :obj:`~transformers1.TFPreTrainedModel`):\n            The model that will be used by the pipeline to make predictions. This needs to be a model inheriting from\n            :class:`~transformers1.PreTrainedModel` for PyTorch and :class:`~transformers1.TFPreTrainedModel` for\n            TensorFlow.\n        tokenizer (:obj:`~transformers1.PreTrainedTokenizer`):\n            The tokenizer that will be used by the pipeline to encode data for the model. This object inherits from\n            :class:`~transformers1.PreTrainedTokenizer`.\n        modelcard (:obj:`str` or :class:`~transformers1.ModelCard`, `optional`, defaults to :obj:`None`):\n            Model card attributed to the model for this pipeline.\n        framework (:obj:`str`, `optional`, defaults to :obj:`None`):\n            The framework to use, either \"pt\" for PyTorch or \"tf\" for TensorFlow. The specified framework must be\n            installed.\n\n            If no framework is specified, will default to the one currently installed. If no framework is specified\n            and both frameworks are installed, will default to PyTorch.\n        args_parser (:class:`~transformers1.pipelines.ArgumentHandler`, `optional`, defaults to :obj:`None`):\n            Reference to the object in charge of parsing supplied pipeline parameters.\n        device (:obj:`int`, `optional`, defaults to :obj:`-1`):\n            Device ordinal for CPU/GPU supports. Setting this to -1 will leverage CPU, >=0 will run the model\n            on the associated CUDA device id.\n    \"\"\"\n\n    default_input_names = \"sequences\"\n\n    def __init__(\n        self,\n        model: Union[\"PreTrainedModel\", \"TFPreTrainedModel\"],\n        tokenizer: PreTrainedTokenizer,\n        modelcard: Optional[ModelCard] = None,\n        framework: Optional[str] = None,\n        args_parser: ArgumentHandler = None,\n        device: int = -1,\n        binary_output: bool = False,\n        ignore_labels=[\"O\"],\n        task: str = \"\",\n        grouped_entities: bool = False,\n    ):\n        super().__init__(\n            model=model,\n            tokenizer=tokenizer,\n            modelcard=modelcard,\n            framework=framework,\n            args_parser=args_parser,\n            device=device,\n            binary_output=binary_output,\n            task=task,\n        )\n\n        self._basic_tokenizer = BasicTokenizer(do_lower_case=False)\n        self.ignore_labels = ignore_labels\n        self.grouped_entities = grouped_entities\n\n    def __call__(self, *args, **kwargs):\n        inputs = self._args_parser(*args, **kwargs)\n        answers = []\n        for sentence in inputs:\n\n            # Manage correct placement of the tensors\n            with self.device_placement():\n\n                tokens = self.tokenizer.encode_plus(\n                    sentence,\n                    return_attention_mask=False,\n                    return_tensors=self.framework,\n                    max_length=self.tokenizer.max_len,\n                )\n\n                # Forward\n                if self.framework == \"tf\":\n                    entities = self.model(tokens.data)[0][0].numpy()\n                    input_ids = tokens[\"input_ids\"].numpy()[0]\n                else:\n                    with torch.no_grad():\n                        tokens = self.ensure_tensor_on_device(**tokens)\n                        entities = self.model(**tokens)[0][0].cpu().numpy()\n                        input_ids = tokens[\"input_ids\"].cpu().numpy()[0]\n\n            score = np.exp(entities) / np.exp(entities).sum(-1, keepdims=True)\n            labels_idx = score.argmax(axis=-1)\n\n            entities = []\n            entity_groups = []\n            entity_group_disagg = []\n            # Filter to labels not in `self.ignore_labels`\n            filtered_labels_idx = [\n                (idx, label_idx)\n                for idx, label_idx in enumerate(labels_idx)\n                if self.model.config.id2label[label_idx] not in self.ignore_labels\n            ]\n\n            for idx, label_idx in filtered_labels_idx:\n\n                entity = {\n                    \"word\": self.tokenizer.convert_ids_to_tokens(int(input_ids[idx])),\n                    \"score\": score[idx][label_idx].item(),\n                    \"entity\": self.model.config.id2label[label_idx],\n                    \"index\": idx,\n                }\n                last_idx, _ = filtered_labels_idx[-1]\n                if self.grouped_entities:\n                    if not entity_group_disagg:\n                        entity_group_disagg += [entity]\n                        if idx == last_idx:\n                            entity_groups += [self.group_entities(entity_group_disagg)]\n                        continue\n\n                    # If the current entity is similar and adjacent to the previous entity, append it to the disaggregated entity group\n                    if (\n                        entity[\"entity\"] == entity_group_disagg[-1][\"entity\"]\n                        and entity[\"index\"] == entity_group_disagg[-1][\"index\"] + 1\n                    ):\n                        entity_group_disagg += [entity]\n                        # Group the entities at the last entity\n                        if idx == last_idx:\n                            entity_groups += [self.group_entities(entity_group_disagg)]\n                    # If the current entity is different from the previous entity, aggregate the disaggregated entity group\n                    else:\n                        entity_groups += [self.group_entities(entity_group_disagg)]\n                        entity_group_disagg = [entity]\n\n                entities += [entity]\n\n            # Append\n            if self.grouped_entities:\n                answers += [entity_groups]\n            else:\n                answers += [entities]\n\n        if len(answers) == 1:\n            return answers[0]\n        return answers\n\n    def group_entities(self, entities):\n        \"\"\"\n        Returns grouped entities\n        \"\"\"\n        # Get the last entity in the entity group\n        entity = entities[-1][\"entity\"]\n        scores = np.mean([entity[\"score\"] for entity in entities])\n        tokens = [entity[\"word\"] for entity in entities]\n\n        entity_group = {\n            \"entity_group\": entity,\n            \"score\": np.mean(scores),\n            \"word\": self.tokenizer.convert_tokens_to_string(tokens),\n        }\n        return entity_group\n\n\nTokenClassificationPipeline = NerPipeline\n\n\nclass QuestionAnsweringArgumentHandler(ArgumentHandler):\n    \"\"\"\n    QuestionAnsweringPipeline requires the user to provide multiple arguments (i.e. question & context) to be mapped\n    to internal SquadExample / SquadFeature structures.\n\n    QuestionAnsweringArgumentHandler manages all the possible to create SquadExample from the command-line supplied\n    arguments.\n    \"\"\"\n\n    def __call__(self, *args, **kwargs):\n        # Position args, handling is sensibly the same as X and data, so forwarding to avoid duplicating\n        if args is not None and len(args) > 0:\n            if len(args) == 1:\n                kwargs[\"X\"] = args[0]\n            else:\n                kwargs[\"X\"] = list(args)\n\n        # Generic compatibility with sklearn and Keras\n        # Batched data\n        if \"X\" in kwargs or \"data\" in kwargs:\n            inputs = kwargs[\"X\"] if \"X\" in kwargs else kwargs[\"data\"]\n\n            if isinstance(inputs, dict):\n                inputs = [inputs]\n            else:\n                # Copy to avoid overriding arguments\n                inputs = [i for i in inputs]\n\n            for i, item in enumerate(inputs):\n                if isinstance(item, dict):\n                    if any(k not in item for k in [\"question\", \"context\"]):\n                        raise KeyError(\"You need to provide a dictionary with keys {question:..., context:...}\")\n\n                    inputs[i] = QuestionAnsweringPipeline.create_sample(**item)\n\n                elif not isinstance(item, SquadExample):\n                    raise ValueError(\n                        \"{} argument needs to be of type (list[SquadExample | dict], SquadExample, dict)\".format(\n                            \"X\" if \"X\" in kwargs else \"data\"\n                        )\n                    )\n\n            # Tabular input\n        elif \"question\" in kwargs and \"context\" in kwargs:\n            if isinstance(kwargs[\"question\"], str):\n                kwargs[\"question\"] = [kwargs[\"question\"]]\n\n            if isinstance(kwargs[\"context\"], str):\n                kwargs[\"context\"] = [kwargs[\"context\"]]\n\n            inputs = [\n                QuestionAnsweringPipeline.create_sample(q, c) for q, c in zip(kwargs[\"question\"], kwargs[\"context\"])\n            ]\n        else:\n            raise ValueError(\"Unknown arguments {}\".format(kwargs))\n\n        if not isinstance(inputs, list):\n            inputs = [inputs]\n\n        return inputs\n\n\nclass QuestionAnsweringPipeline(Pipeline):\n    \"\"\"\n    Question Answering pipeline using ModelForQuestionAnswering head. See the\n    `question answering usage <../usage.html#question-answering>`__ examples for more information.\n\n    This question answering can currently be loaded from the :func:`~transformers1.pipeline` method using\n    the following task identifier(s):\n\n    - \"question-answering\", for answering questions given a context.\n\n    The models that this pipeline can use are models that have been fine-tuned on a question answering task.\n    See the up-to-date list of available models on\n    `huggingface.co/models <https://huggingface.co/models?filter=question-answering>`__.\n\n    Arguments:\n        model (:obj:`~transformers1.PreTrainedModel` or :obj:`~transformers1.TFPreTrainedModel`):\n            The model that will be used by the pipeline to make predictions. This needs to be a model inheriting from\n            :class:`~transformers1.PreTrainedModel` for PyTorch and :class:`~transformers1.TFPreTrainedModel` for\n            TensorFlow.\n        tokenizer (:obj:`~transformers1.PreTrainedTokenizer`):\n            The tokenizer that will be used by the pipeline to encode data for the model. This object inherits from\n            :class:`~transformers1.PreTrainedTokenizer`.\n        modelcard (:obj:`str` or :class:`~transformers1.ModelCard`, `optional`, defaults to :obj:`None`):\n            Model card attributed to the model for this pipeline.\n        framework (:obj:`str`, `optional`, defaults to :obj:`None`):\n            The framework to use, either \"pt\" for PyTorch or \"tf\" for TensorFlow. The specified framework must be\n            installed.\n\n            If no framework is specified, will default to the one currently installed. If no framework is specified\n            and both frameworks are installed, will default to PyTorch.\n        args_parser (:class:`~transformers1.pipelines.ArgumentHandler`, `optional`, defaults to :obj:`None`):\n            Reference to the object in charge of parsing supplied pipeline parameters.\n        device (:obj:`int`, `optional`, defaults to :obj:`-1`):\n            Device ordinal for CPU/GPU supports. Setting this to -1 will leverage CPU, >=0 will run the model\n            on the associated CUDA device id.\n    \"\"\"\n\n    default_input_names = \"question,context\"\n\n    def __init__(\n        self,\n        model: Union[\"PreTrainedModel\", \"TFPreTrainedModel\"],\n        tokenizer: PreTrainedTokenizer,\n        modelcard: Optional[ModelCard] = None,\n        framework: Optional[str] = None,\n        device: int = -1,\n        task: str = \"\",\n        **kwargs\n    ):\n        super().__init__(\n            model=model,\n            tokenizer=tokenizer,\n            modelcard=modelcard,\n            framework=framework,\n            args_parser=QuestionAnsweringArgumentHandler(),\n            device=device,\n            task=task,\n            **kwargs,\n        )\n\n    @staticmethod\n    def create_sample(\n        question: Union[str, List[str]], context: Union[str, List[str]]\n    ) -> Union[SquadExample, List[SquadExample]]:\n        \"\"\"\n        QuestionAnsweringPipeline leverages the SquadExample/SquadFeatures internally.\n        This helper method encapsulate all the logic for converting question(s) and context(s) to SquadExample(s).\n        We currently support extractive question answering.\n        Arguments:\n             question: (str, List[str]) The question to be ask for the associated context\n             context: (str, List[str]) The context in which we will look for the answer.\n\n        Returns:\n            SquadExample initialized with the corresponding question and context.\n        \"\"\"\n        if isinstance(question, list):\n            return [SquadExample(None, q, c, None, None, None) for q, c in zip(question, context)]\n        else:\n            return SquadExample(None, question, context, None, None, None)\n\n    def __call__(self, *args, **kwargs):\n        \"\"\"\n        Args:\n            We support multiple use-cases, the following are exclusive:\n            X: sequence of SquadExample\n            data: sequence of SquadExample\n            question: (str, List[str]), batch of question(s) to map along with context\n            context: (str, List[str]), batch of context(s) associated with the provided question keyword argument\n        Returns:\n            dict: {'answer': str, 'score\": float, 'start\": int, \"end\": int}\n            answer: the textual answer in the intial context\n            score: the score the current answer scored for the model\n            start: the character index in the original string corresponding to the beginning of the answer' span\n            end: the character index in the original string corresponding to the ending of the answer' span\n        \"\"\"\n        # Set defaults values\n        kwargs.setdefault(\"topk\", 1)\n        kwargs.setdefault(\"doc_stride\", 128)\n        kwargs.setdefault(\"max_answer_len\", 15)\n        kwargs.setdefault(\"max_seq_len\", 384)\n        kwargs.setdefault(\"max_question_len\", 64)\n        kwargs.setdefault(\"handle_impossible_answer\", False)\n\n        if kwargs[\"topk\"] < 1:\n            raise ValueError(\"topk parameter should be >= 1 (got {})\".format(kwargs[\"topk\"]))\n\n        if kwargs[\"max_answer_len\"] < 1:\n            raise ValueError(\"max_answer_len parameter should be >= 1 (got {})\".format(kwargs[\"max_answer_len\"]))\n\n        # Convert inputs to features\n        examples = self._args_parser(*args, **kwargs)\n        features_list = [\n            squad_convert_examples_to_features(\n                [example],\n                self.tokenizer,\n                kwargs[\"max_seq_len\"],\n                kwargs[\"doc_stride\"],\n                kwargs[\"max_question_len\"],\n                False,\n                tqdm_enabled=False,\n            )\n            for example in examples\n        ]\n        all_answers = []\n        for features, example in zip(features_list, examples):\n            model_input_names = self.tokenizer.model_input_names + [\"input_ids\"]\n            fw_args = {k: [feature.__dict__[k] for feature in features] for k in model_input_names}\n\n            # Manage tensor allocation on correct device\n            with self.device_placement():\n                if self.framework == \"tf\":\n                    fw_args = {k: tf.constant(v) for (k, v) in fw_args.items()}\n                    start, end = self.model(fw_args)\n                    start, end = start.numpy(), end.numpy()\n                else:\n                    with torch.no_grad():\n                        # Retrieve the score for the context tokens only (removing question tokens)\n                        fw_args = {k: torch.tensor(v, device=self.device) for (k, v) in fw_args.items()}\n                        start, end = self.model(**fw_args)\n                        start, end = start.cpu().numpy(), end.cpu().numpy()\n\n            min_null_score = 1000000  # large and positive\n            answers = []\n            for (feature, start_, end_) in zip(features, start, end):\n                # Normalize logits and spans to retrieve the answer\n                start_ = np.exp(start_) / np.sum(np.exp(start_))\n                end_ = np.exp(end_) / np.sum(np.exp(end_))\n\n                # Mask padding and question\n                start_, end_ = (\n                    start_ * np.abs(np.array(feature.p_mask) - 1),\n                    end_ * np.abs(np.array(feature.p_mask) - 1),\n                )\n\n                if kwargs[\"handle_impossible_answer\"]:\n                    min_null_score = min(min_null_score, (start_[0] * end_[0]).item())\n\n                start_[0] = end_[0] = 0\n\n                starts, ends, scores = self.decode(start_, end_, kwargs[\"topk\"], kwargs[\"max_answer_len\"])\n                char_to_word = np.array(example.char_to_word_offset)\n\n                # Convert the answer (tokens) back to the original text\n                answers += [\n                    {\n                        \"score\": score.item(),\n                        \"start\": np.where(char_to_word == feature.token_to_orig_map[s])[0][0].item(),\n                        \"end\": np.where(char_to_word == feature.token_to_orig_map[e])[0][-1].item(),\n                        \"answer\": \" \".join(\n                            example.doc_tokens[feature.token_to_orig_map[s] : feature.token_to_orig_map[e] + 1]\n                        ),\n                    }\n                    for s, e, score in zip(starts, ends, scores)\n                ]\n\n            if kwargs[\"handle_impossible_answer\"]:\n                answers.append({\"score\": min_null_score, \"start\": 0, \"end\": 0, \"answer\": \"\"})\n\n            answers = sorted(answers, key=lambda x: x[\"score\"], reverse=True)[: kwargs[\"topk\"]]\n            all_answers += answers\n\n        if len(all_answers) == 1:\n            return all_answers[0]\n        return all_answers\n\n    def decode(self, start: np.ndarray, end: np.ndarray, topk: int, max_answer_len: int) -> Tuple:\n        \"\"\"\n        Take the output of any QuestionAnswering head and will generate probalities for each span to be\n        the actual answer.\n        In addition, it filters out some unwanted/impossible cases like answer len being greater than\n        max_answer_len or answer end position being before the starting position.\n        The method supports output the k-best answer through the topk argument.\n\n        Args:\n            start: numpy array, holding individual start probabilities for each token\n            end: numpy array, holding individual end probabilities for each token\n            topk: int, indicates how many possible answer span(s) to extract from the model's output\n            max_answer_len: int, maximum size of the answer to extract from the model's output\n        \"\"\"\n        # Ensure we have batch axis\n        if start.ndim == 1:\n            start = start[None]\n\n        if end.ndim == 1:\n            end = end[None]\n\n        # Compute the score of each tuple(start, end) to be the real answer\n        outer = np.matmul(np.expand_dims(start, -1), np.expand_dims(end, 1))\n\n        # Remove candidate with end < start and end - start > max_answer_len\n        candidates = np.tril(np.triu(outer), max_answer_len - 1)\n\n        #  Inspired by Chen & al. (https://github.com/facebookresearch/DrQA)\n        scores_flat = candidates.flatten()\n        if topk == 1:\n            idx_sort = [np.argmax(scores_flat)]\n        elif len(scores_flat) < topk:\n            idx_sort = np.argsort(-scores_flat)\n        else:\n            idx = np.argpartition(-scores_flat, topk)[0:topk]\n            idx_sort = idx[np.argsort(-scores_flat[idx])]\n\n        start, end = np.unravel_index(idx_sort, candidates.shape)[1:]\n        return start, end, candidates[0, start, end]\n\n    def span_to_answer(self, text: str, start: int, end: int):\n        \"\"\"\n        When decoding from token probalities, this method maps token indexes to actual word in\n        the initial context.\n\n        Args:\n            text: str, the actual context to extract the answer from\n            start: int, starting answer token index\n            end: int, ending answer token index\n\n        Returns:\n            dict: {'answer': str, 'start': int, 'end': int}\n        \"\"\"\n        words = []\n        token_idx = char_start_idx = char_end_idx = chars_idx = 0\n\n        for i, word in enumerate(text.split(\" \")):\n            token = self.tokenizer.tokenize(word)\n\n            # Append words if they are in the span\n            if start <= token_idx <= end:\n                if token_idx == start:\n                    char_start_idx = chars_idx\n\n                if token_idx == end:\n                    char_end_idx = chars_idx + len(word)\n\n                words += [word]\n\n            # Stop if we went over the end of the answer\n            if token_idx > end:\n                break\n\n            # Append the subtokenization length to the running index\n            token_idx += len(token)\n            chars_idx += len(word) + 1\n\n        # Join text with spaces\n        return {\n            \"answer\": \" \".join(words),\n            \"start\": max(0, char_start_idx),\n            \"end\": min(len(text), char_end_idx),\n        }\n\n\nclass SummarizationPipeline(Pipeline):\n    \"\"\"\n    Summarize news articles and other documents\n\n    Usage::\n\n        # use bart in pytorch\n        summarizer = pipeline(\"summarization\")\n        summarizer(\"Sam Shleifer writes the best docstring examples in the whole world.\", min_length=5, max_length=20)\n\n        # use t5 in tf\n        summarizer = pipeline(\"summarization\", model=\"t5-base\", tokenizer=\"t5-base\", framework=\"tf\")\n        summarizer(\"Sam Shleifer writes the best docstring examples in the whole world.\", min_length=5, max_length=20)\n\n    The models that this pipeline can use are models that have been fine-tuned on a summarization task,\n    which is currently, '`bart-large-cnn`', '`t5-small`', '`t5-base`', '`t5-large`', '`t5-3b`', '`t5-11b`'.\n    See the up-to-date list of available models on\n    `huggingface.co/models <https://huggingface.co/models?filter=summarization>`__.\n\n    Arguments:\n        model (:obj:`str` or :obj:`~transformers1.PreTrainedModel` or :obj:`~transformers1.TFPreTrainedModel`, `optional`, defaults to :obj:`None`):\n            The model that will be used by the pipeline to make predictions. This can be :obj:`None`, a string\n            checkpoint identifier or an actual pre-trained model inheriting from\n            :class:`~transformers1.PreTrainedModel` for PyTorch and :class:`~transformers1.TFPreTrainedModel` for\n            TensorFlow.\n\n            If :obj:`None`, the default of the pipeline will be loaded.\n        tokenizer (:obj:`str` or :obj:`~transformers1.PreTrainedTokenizer`, `optional`, defaults to :obj:`None`):\n            The tokenizer that will be used by the pipeline to encode data for the model. This can be :obj:`None`,\n            a string checkpoint identifier or an actual pre-trained tokenizer inheriting from\n            :class:`~transformers1.PreTrainedTokenizer`.\n\n            If :obj:`None`, the default of the pipeline will be loaded.\n        modelcard (:obj:`str` or :class:`~transformers1.ModelCard`, `optional`, defaults to :obj:`None`):\n            Model card attributed to the model for this pipeline.\n        framework (:obj:`str`, `optional`, defaults to :obj:`None`):\n            The framework to use, either \"pt\" for PyTorch or \"tf\" for TensorFlow. The specified framework must be\n            installed.\n\n            If no framework is specified, will default to the one currently installed. If no framework is specified\n            and both frameworks are installed, will default to PyTorch.\n        args_parser (:class:`~transformers1.pipelines.ArgumentHandler`, `optional`, defaults to :obj:`None`):\n            Reference to the object in charge of parsing supplied pipeline parameters.\n        device (:obj:`int`, `optional`, defaults to :obj:`-1`):\n            Device ordinal for CPU/GPU supports. Setting this to -1 will leverage CPU, >=0 will run the model\n            on the associated CUDA device id.\n    \"\"\"\n\n    def __call__(\n        self, *documents, return_tensors=False, return_text=True, clean_up_tokenization_spaces=False, **generate_kwargs\n    ):\n        r\"\"\"\n        Args:\n            *documents: (list of strings) articles to be summarized\n            return_text: (bool, default=True) whether to add a decoded \"summary_text\" to each result\n            return_tensors: (bool, default=False) whether to return the raw \"summary_token_ids\" to each result\n\n            clean_up_tokenization_spaces: (`optional`) bool whether to include extra spaces in the output\n            **generate_kwargs: extra kwargs passed to `self.model.generate`_\n\n        Returns:\n            list of dicts with 'summary_text' and/or 'summary_token_ids' for each document_to_summarize\n\n        .. _`self.model.generate`:\n            https://huggingface.co/transformers/model_doc/bart.html#transformers.BartForConditionalGeneration.generate\n\n        \"\"\"\n        assert return_tensors or return_text, \"You must specify return_tensors=True or return_text=True\"\n        assert len(documents) > 0, \"Please provide a document to summarize\"\n\n        if self.framework == \"tf\" and \"BartForConditionalGeneration\" in self.model.__class__.__name__:\n            raise NotImplementedError(\n                \"Tensorflow is not yet supported for Bart. Please consider using T5, e.g. `t5-base`\"\n            )\n\n        prefix = self.model.config.prefix if self.model.config.prefix is not None else \"\"\n\n        if isinstance(documents[0], list):\n            assert (\n                self.tokenizer.pad_token_id is not None\n            ), \"Please make sure that the tokenizer has a pad_token_id when using a batch input\"\n\n            documents = ([prefix + document for document in documents[0]],)\n            pad_to_max_length = True\n\n        elif isinstance(documents[0], str):\n            documents = (prefix + documents[0],)\n            pad_to_max_length = False\n        else:\n            raise ValueError(\n                \" `documents[0]`: {} have the wrong format. The should be either of type `str` or type `list`\".format(\n                    documents[0]\n                )\n            )\n\n        with self.device_placement():\n            inputs = self._parse_and_tokenize(*documents, pad_to_max_length=pad_to_max_length)\n\n            if self.framework == \"pt\":\n                inputs = self.ensure_tensor_on_device(**inputs)\n                input_length = inputs[\"input_ids\"].shape[-1]\n            elif self.framework == \"tf\":\n                input_length = tf.shape(inputs[\"input_ids\"])[-1].numpy()\n\n            min_length = generate_kwargs.get(\"min_length\", self.model.config.min_length)\n            if input_length < min_length // 2:\n                logger.warning(\n                    \"Your min_length is set to {}, but you input_length is only {}. You might consider decreasing min_length manually, e.g. summarizer('...', min_length=10)\".format(\n                        min_length, input_length\n                    )\n                )\n\n            max_length = generate_kwargs.get(\"max_length\", self.model.config.max_length)\n            if input_length < max_length:\n                logger.warning(\n                    \"Your max_length is set to {}, but you input_length is only {}. You might consider decreasing max_length manually, e.g. summarizer('...', max_length=50)\".format(\n                        max_length, input_length\n                    )\n                )\n\n            summaries = self.model.generate(\n                inputs[\"input_ids\"], attention_mask=inputs[\"attention_mask\"], **generate_kwargs,\n            )\n\n            results = []\n            for summary in summaries:\n                record = {}\n                if return_tensors:\n                    record[\"summary_token_ids\"] = summary\n                if return_text:\n                    record[\"summary_text\"] = self.tokenizer.decode(\n                        summary, skip_special_tokens=True, clean_up_tokenization_spaces=clean_up_tokenization_spaces,\n                    )\n                results.append(record)\n            return results\n\n\nclass TranslationPipeline(Pipeline):\n    \"\"\"\n    Translates from one language to another.\n\n    Usage::\n        en_fr_translator = pipeline(\"translation_en_to_fr\")\n        en_fr_translator(\"How old are you?\")\n\n    The models that this pipeline can use are models that have been fine-tuned on a translation task,\n    currently: \"t5-small\", \"t5-base\", \"t5-large\", \"t5-3b\", \"t5-11b\"\n    See the up-to-date list of available models on\n    `huggingface.co/models <https://huggingface.co/models?filter=translation>`__.\n\n    Arguments:\n        model (:obj:`str` or :obj:`~transformers1.PreTrainedModel` or :obj:`~transformers1.TFPreTrainedModel`, `optional`, defaults to :obj:`None`):\n            The model that will be used by the pipeline to make predictions. This can be :obj:`None`, a string\n            checkpoint identifier or an actual pre-trained model inheriting from\n            :class:`~transformers1.PreTrainedModel` for PyTorch and :class:`~transformers1.TFPreTrainedModel` for\n            TensorFlow.\n            If :obj:`None`, the default of the pipeline will be loaded.\n        tokenizer (:obj:`str` or :obj:`~transformers1.PreTrainedTokenizer`, `optional`, defaults to :obj:`None`):\n            The tokenizer that will be used by the pipeline to encode data for the model. This can be :obj:`None`,\n            a string checkpoint identifier or an actual pre-trained tokenizer inheriting from\n            :class:`~transformers1.PreTrainedTokenizer`.\n            If :obj:`None`, the default of the pipeline will be loaded.\n        modelcard (:obj:`str` or :class:`~transformers1.ModelCard`, `optional`, defaults to :obj:`None`):\n            Model card attributed to the model for this pipeline.\n        framework (:obj:`str`, `optional`, defaults to :obj:`None`):\n            The framework to use, either \"pt\" for PyTorch or \"tf\" for TensorFlow. The specified framework must be\n            installed.\n            If no framework is specified, will default to the one currently installed. If no framework is specified\n            and both frameworks are installed, will default to PyTorch.\n        args_parser (:class:`~transformers1.pipelines.ArgumentHandler`, `optional`, defaults to :obj:`None`):\n            Reference to the object in charge of parsing supplied pipeline parameters.\n        device (:obj:`int`, `optional`, defaults to :obj:`-1`):\n            Device ordinal for CPU/GPU supports. Setting this to -1 will leverage CPU, >=0 will run the model\n            on the associated CUDA device id.\n    \"\"\"\n\n    def __call__(\n        self, *args, return_tensors=False, return_text=True, clean_up_tokenization_spaces=False, **generate_kwargs\n    ):\n        r\"\"\"\n        Args:\n            *args: (list of strings) texts to be translated\n            return_text: (bool, default=True) whether to add a decoded \"translation_text\" to each result\n            return_tensors: (bool, default=False) whether to return the raw \"translation_token_ids\" to each result\n\n            **generate_kwargs: extra kwargs passed to `self.model.generate`_\n\n        Returns:\n            list of dicts with 'translation_text' and/or 'translation_token_ids' for each text_to_translate\n        .. _`self.model.generate`:\n            https://huggingface.co/transformers/model_doc/bart.html#transformers.BartForConditionalGeneration.generate\n        \"\"\"\n        assert return_tensors or return_text, \"You must specify return_tensors=True or return_text=True\"\n\n        prefix = self.model.config.prefix if self.model.config.prefix is not None else \"\"\n\n        if isinstance(args[0], list):\n            assert (\n                self.tokenizer.pad_token_id is not None\n            ), \"Please make sure that the tokenizer has a pad_token_id when using a batch input\"\n            args = ([prefix + text for text in args[0]],)\n            pad_to_max_length = True\n\n        elif isinstance(args[0], str):\n            args = (prefix + args[0],)\n            pad_to_max_length = False\n        else:\n            raise ValueError(\n                \" `documents[0]`: {} have the wrong format. The should be either of type `str` or type `list`\".format(\n                    args[0]\n                )\n            )\n\n        with self.device_placement():\n            inputs = self._parse_and_tokenize(*args, pad_to_max_length=pad_to_max_length)\n\n            if self.framework == \"pt\":\n                inputs = self.ensure_tensor_on_device(**inputs)\n                input_length = inputs[\"input_ids\"].shape[-1]\n\n            elif self.framework == \"tf\":\n                input_length = tf.shape(inputs[\"input_ids\"])[-1].numpy()\n\n            max_length = generate_kwargs.get(\"max_length\", self.model.config.max_length)\n            if input_length > 0.9 * max_length:\n                logger.warning(\n                    \"Your input_length: {} is bigger than 0.9 * max_length: {}. You might consider increasing your max_length manually, e.g. translator('...', max_length=400)\".format(\n                        input_length, max_length\n                    )\n                )\n\n            translations = self.model.generate(\n                inputs[\"input_ids\"], attention_mask=inputs[\"attention_mask\"], **generate_kwargs,\n            )\n            results = []\n            for translation in translations:\n                record = {}\n                if return_tensors:\n                    record[\"translation_token_ids\"] = translation\n                if return_text:\n                    record[\"translation_text\"] = self.tokenizer.decode(\n                        translation,\n                        skip_special_tokens=True,\n                        clean_up_tokenization_spaces=clean_up_tokenization_spaces,\n                    )\n                results.append(record)\n            return results\n\n\n# Register all the supported tasks here\nSUPPORTED_TASKS = {\n    \"feature-extraction\": {\n        \"impl\": FeatureExtractionPipeline,\n        \"tf\": TFAutoModel if is_tf_available() else None,\n        \"pt\": AutoModel if is_torch_available() else None,\n        \"default\": {\n            \"model\": {\"pt\": \"distilbert-base-cased\", \"tf\": \"distilbert-base-cased\"},\n            \"config\": None,\n            \"tokenizer\": \"distilbert-base-cased\",\n        },\n    },\n    \"sentiment-analysis\": {\n        \"impl\": TextClassificationPipeline,\n        \"tf\": TFAutoModelForSequenceClassification if is_tf_available() else None,\n        \"pt\": AutoModelForSequenceClassification if is_torch_available() else None,\n        \"default\": {\n            \"model\": {\n                \"pt\": \"distilbert-base-uncased-finetuned-sst-2-english\",\n                \"tf\": \"distilbert-base-uncased-finetuned-sst-2-english\",\n            },\n            \"config\": \"distilbert-base-uncased-finetuned-sst-2-english\",\n            \"tokenizer\": \"distilbert-base-uncased\",\n        },\n    },\n    \"ner\": {\n        \"impl\": NerPipeline,\n        \"tf\": TFAutoModelForTokenClassification if is_tf_available() else None,\n        \"pt\": AutoModelForTokenClassification if is_torch_available() else None,\n        \"default\": {\n            \"model\": {\n                \"pt\": \"dbmdz/bert-large-cased-finetuned-conll03-english\",\n                \"tf\": \"dbmdz/bert-large-cased-finetuned-conll03-english\",\n            },\n            \"config\": \"dbmdz/bert-large-cased-finetuned-conll03-english\",\n            \"tokenizer\": \"bert-large-cased\",\n        },\n    },\n    \"question-answering\": {\n        \"impl\": QuestionAnsweringPipeline,\n        \"tf\": TFAutoModelForQuestionAnswering if is_tf_available() else None,\n        \"pt\": AutoModelForQuestionAnswering if is_torch_available() else None,\n        \"default\": {\n            \"model\": {\"pt\": \"distilbert-base-cased-distilled-squad\", \"tf\": \"distilbert-base-cased-distilled-squad\"},\n            \"config\": None,\n            \"tokenizer\": (\"distilbert-base-cased\", {\"use_fast\": False}),\n        },\n    },\n    \"fill-mask\": {\n        \"impl\": FillMaskPipeline,\n        \"tf\": TFAutoModelWithLMHead if is_tf_available() else None,\n        \"pt\": AutoModelWithLMHead if is_torch_available() else None,\n        \"default\": {\n            \"model\": {\"pt\": \"distilroberta-base\", \"tf\": \"distilroberta-base\"},\n            \"config\": None,\n            \"tokenizer\": (\"distilroberta-base\", {\"use_fast\": False}),\n        },\n    },\n    \"summarization\": {\n        \"impl\": SummarizationPipeline,\n        \"tf\": TFAutoModelWithLMHead if is_tf_available() else None,\n        \"pt\": AutoModelWithLMHead if is_torch_available() else None,\n        \"default\": {\"model\": {\"pt\": \"facebook/bart-large-cnn\", \"tf\": \"t5-small\"}, \"config\": None, \"tokenizer\": None},\n    },\n    \"translation_en_to_fr\": {\n        \"impl\": TranslationPipeline,\n        \"tf\": TFAutoModelWithLMHead if is_tf_available() else None,\n        \"pt\": AutoModelWithLMHead if is_torch_available() else None,\n        \"default\": {\n            \"model\": {\"pt\": \"t5-base\", \"tf\": \"t5-base\"},\n            \"config\": None,\n            \"tokenizer\": (\"t5-base\", {\"use_fast\": False}),\n        },\n    },\n    \"translation_en_to_de\": {\n        \"impl\": TranslationPipeline,\n        \"tf\": TFAutoModelWithLMHead if is_tf_available() else None,\n        \"pt\": AutoModelWithLMHead if is_torch_available() else None,\n        \"default\": {\n            \"model\": {\"pt\": \"t5-base\", \"tf\": \"t5-base\"},\n            \"config\": None,\n            \"tokenizer\": (\"t5-base\", {\"use_fast\": False}),\n        },\n    },\n    \"translation_en_to_ro\": {\n        \"impl\": TranslationPipeline,\n        \"tf\": TFAutoModelWithLMHead if is_tf_available() else None,\n        \"pt\": AutoModelWithLMHead if is_torch_available() else None,\n        \"default\": {\n            \"model\": {\"pt\": \"t5-base\", \"tf\": \"t5-base\"},\n            \"config\": None,\n            \"tokenizer\": (\"t5-base\", {\"use_fast\": False}),\n        },\n    },\n    \"text-generation\": {\n        \"impl\": TextGenerationPipeline,\n        \"tf\": TFAutoModelWithLMHead if is_tf_available() else None,\n        \"pt\": AutoModelWithLMHead if is_torch_available() else None,\n        \"default\": {\"model\": {\"pt\": \"gpt2\", \"tf\": \"gpt2\"}, \"config\": None, \"tokenizer\": \"gpt2\"},\n    },\n}\n\n\ndef pipeline(\n    task: str,\n    model: Optional = None,\n    config: Optional[Union[str, PretrainedConfig]] = None,\n    tokenizer: Optional[Union[str, PreTrainedTokenizer]] = None,\n    framework: Optional[str] = None,\n    **kwargs\n) -> Pipeline:\n    \"\"\"\n    Utility factory method to build a pipeline.\n\n    Pipeline are made of:\n\n        - A Tokenizer instance in charge of mapping raw textual input to token\n        - A Model instance\n        - Some (optional) post processing for enhancing model's output\n\n\n    Args:\n        task (:obj:`str`):\n            The task defining which pipeline will be returned. Currently accepted tasks are:\n\n            - \"feature-extraction\": will return a :class:`~transformers1.FeatureExtractionPipeline`\n            - \"sentiment-analysis\": will return a :class:`~transformers1.TextClassificationPipeline`\n            - \"ner\": will return a :class:`~transformers1.NerPipeline`\n            - \"question-answering\": will return a :class:`~transformers1.QuestionAnsweringPipeline`\n            - \"fill-mask\": will return a :class:`~transformers1.FillMaskPipeline`\n            - \"summarization\": will return a :class:`~transformers1.SummarizationPipeline`\n            - \"translation_xx_to_yy\": will return a :class:`~transformers1.TranslationPipeline`\n        model (:obj:`str` or :obj:`~transformers1.PreTrainedModel` or :obj:`~transformers1.TFPreTrainedModel`, `optional`, defaults to :obj:`None`):\n            The model that will be used by the pipeline to make predictions. This can be :obj:`None`,\n            a model identifier or an actual pre-trained model inheriting from\n            :class:`~transformers1.PreTrainedModel` for PyTorch and :class:`~transformers1.TFPreTrainedModel` for\n            TensorFlow.\n\n            If :obj:`None`, the default for this pipeline will be loaded.\n        config (:obj:`str` or :obj:`~transformers1.PretrainedConfig`, `optional`, defaults to :obj:`None`):\n            The configuration that will be used by the pipeline to instantiate the model. This can be :obj:`None`,\n            a model identifier or an actual pre-trained model configuration inheriting from\n            :class:`~transformers1.PretrainedConfig`.\n\n            If :obj:`None`, the default for this pipeline will be loaded.\n        tokenizer (:obj:`str` or :obj:`~transformers1.PreTrainedTokenizer`, `optional`, defaults to :obj:`None`):\n            The tokenizer that will be used by the pipeline to encode data for the model. This can be :obj:`None`,\n            a model identifier or an actual pre-trained tokenizer inheriting from\n            :class:`~transformers1.PreTrainedTokenizer`.\n\n            If :obj:`None`, the default for this pipeline will be loaded.\n        framework (:obj:`str`, `optional`, defaults to :obj:`None`):\n            The framework to use, either \"pt\" for PyTorch or \"tf\" for TensorFlow. The specified framework must be\n            installed.\n\n            If no framework is specified, will default to the one currently installed. If no framework is specified\n            and both frameworks are installed, will default to PyTorch.\n\n    Returns:\n        :class:`~transformers.Pipeline`: Class inheriting from :class:`~transformers1.Pipeline`, according to\n        the task.\n\n    Examples::\n\n        from transformers1 import pipeline, AutoModelForTokenClassification, AutoTokenizer\n\n        # Sentiment analysis pipeline\n        pipeline('sentiment-analysis')\n\n        # Question answering pipeline, specifying the checkpoint identifier\n        pipeline('question-answering', model='distilbert-base-cased-distilled-squad', tokenizer='bert-base-cased')\n\n        # Named entity recognition pipeline, passing in a specific model and tokenizer\n        model = AutoModelForTokenClassification.from_pretrained(\"dbmdz/bert-large-cased-finetuned-conll03-english\")\n        tokenizer = AutoTokenizer.from_pretrained(\"bert-base-cased\")\n        pipeline('ner', model=model, tokenizer=tokenizer)\n    \"\"\"\n    # Retrieve the task\n    if task not in SUPPORTED_TASKS:\n        raise KeyError(\"Unknown task {}, available tasks are {}\".format(task, list(SUPPORTED_TASKS.keys())))\n\n    framework = framework or get_framework(model)\n\n    targeted_task = SUPPORTED_TASKS[task]\n    task_class, model_class = targeted_task[\"impl\"], targeted_task[framework]\n\n    # Use default model/config/tokenizer for the task if no model is provided\n    if model is None:\n        models, config, tokenizer = [targeted_task[\"default\"][k] for k in [\"model\", \"config\", \"tokenizer\"]]\n        model = models[framework]\n\n    # Try to infer tokenizer from model or config name (if provided as str)\n    if tokenizer is None:\n        if isinstance(model, str) and model in ALL_PRETRAINED_CONFIG_ARCHIVE_MAP:\n            tokenizer = model\n        elif isinstance(config, str) and config in ALL_PRETRAINED_CONFIG_ARCHIVE_MAP:\n            tokenizer = config\n        else:\n            # Impossible to guest what is the right tokenizer here\n            raise Exception(\n                \"Impossible to guess which tokenizer to use. \"\n                \"Please provided a PretrainedTokenizer class or a path/identifier to a pretrained tokenizer.\"\n            )\n\n    modelcard = None\n    # Try to infer modelcard from model or config name (if provided as str)\n    if isinstance(model, str):\n        modelcard = model\n    elif isinstance(config, str):\n        modelcard = config\n\n    # Instantiate tokenizer if needed\n    if isinstance(tokenizer, (str, tuple)):\n        if isinstance(tokenizer, tuple):\n            # For tuple we have (tokenizer name, {kwargs})\n            tokenizer = AutoTokenizer.from_pretrained(tokenizer[0], **tokenizer[1])\n        else:\n            tokenizer = AutoTokenizer.from_pretrained(tokenizer)\n\n    # Instantiate config if needed\n    if isinstance(config, str):\n        config = AutoConfig.from_pretrained(config)\n\n    # Instantiate modelcard if needed\n    if isinstance(modelcard, str):\n        modelcard = ModelCard.from_pretrained(modelcard)\n\n    # Instantiate model if needed\n    if isinstance(model, str):\n        # Handle transparent TF/PT model conversion\n        model_kwargs = {}\n        if framework == \"pt\" and model.endswith(\".h5\"):\n            model_kwargs[\"from_tf\"] = True\n            logger.warning(\n                \"Model might be a TensorFlow model (ending with `.h5`) but TensorFlow is not available. \"\n                \"Trying to load the model with PyTorch.\"\n            )\n        elif framework == \"tf\" and model.endswith(\".bin\"):\n            model_kwargs[\"from_pt\"] = True\n            logger.warning(\n                \"Model might be a PyTorch model (ending with `.bin`) but PyTorch is not available. \"\n                \"Trying to load the model with Tensorflow.\"\n            )\n        model = model_class.from_pretrained(model, config=config, **model_kwargs)\n\n    return task_class(model=model, tokenizer=tokenizer, modelcard=modelcard, framework=framework, task=task, **kwargs)\n"
  },
  {
    "path": "code/bert-base-count5/pretrain/transformers1/tokenization_albert.py",
    "content": "# coding=utf-8\n# Copyright 2018 Google AI, Google Brain and the HuggingFace Inc. team.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n\"\"\" Tokenization classes for ALBERT model.\"\"\"\n\n\nimport logging\nimport os\nimport unicodedata\nfrom shutil import copyfile\nfrom typing import List, Optional\n\nfrom .tokenization_utils import PreTrainedTokenizer\n\n\nlogger = logging.getLogger(__name__)\nVOCAB_FILES_NAMES = {\"vocab_file\": \"spiece.model\"}\n\nPRETRAINED_VOCAB_FILES_MAP = {\n    \"vocab_file\": {\n        \"albert-base-v1\": \"https://s3.amazonaws.com/models.huggingface.co/bert/albert-base-v1-spiece.model\",\n        \"albert-large-v1\": \"https://s3.amazonaws.com/models.huggingface.co/bert/albert-large-v1-spiece.model\",\n        \"albert-xlarge-v1\": \"https://s3.amazonaws.com/models.huggingface.co/bert/albert-xlarge-v1-spiece.model\",\n        \"albert-xxlarge-v1\": \"https://s3.amazonaws.com/models.huggingface.co/bert/albert-xxlarge-v1-spiece.model\",\n        \"albert-base-v2\": \"https://s3.amazonaws.com/models.huggingface.co/bert/albert-base-v2-spiece.model\",\n        \"albert-large-v2\": \"https://s3.amazonaws.com/models.huggingface.co/bert/albert-large-v2-spiece.model\",\n        \"albert-xlarge-v2\": \"https://s3.amazonaws.com/models.huggingface.co/bert/albert-xlarge-v2-spiece.model\",\n        \"albert-xxlarge-v2\": \"https://s3.amazonaws.com/models.huggingface.co/bert/albert-xxlarge-v2-spiece.model\",\n    }\n}\n\nPRETRAINED_POSITIONAL_EMBEDDINGS_SIZES = {\n    \"albert-base-v1\": 512,\n    \"albert-large-v1\": 512,\n    \"albert-xlarge-v1\": 512,\n    \"albert-xxlarge-v1\": 512,\n    \"albert-base-v2\": 512,\n    \"albert-large-v2\": 512,\n    \"albert-xlarge-v2\": 512,\n    \"albert-xxlarge-v2\": 512,\n}\n\nSPIECE_UNDERLINE = \"▁\"\n\n\nclass AlbertTokenizer(PreTrainedTokenizer):\n    \"\"\"\n    Constructs an ALBERT tokenizer. Based on `SentencePiece <https://github.com/google/sentencepiece>`__\n\n    This tokenizer inherits from :class:`~transformers1.PreTrainedTokenizer` which contains most of the methods. Users\n    should refer to the superclass for more information regarding methods.\n\n    Args:\n        vocab_file (:obj:`string`):\n            `SentencePiece <https://github.com/google/sentencepiece>`__ file (generally has a .spm extension) that\n            contains the vocabulary necessary to instantiate a tokenizer.\n        do_lower_case (:obj:`bool`, `optional`, defaults to :obj:`True`):\n            Whether to lowercase the input when tokenizing.\n        remove_space (:obj:`bool`, `optional`, defaults to :obj:`True`):\n            Whether to strip the text when tokenizing (removing excess spaces before and after the string).\n        keep_accents (:obj:`bool`, `optional`, defaults to :obj:`False`):\n            Whether to keep accents when tokenizing.\n        bos_token (:obj:`string`, `optional`, defaults to \"[CLS]\"):\n            The beginning of sequence token that was used during pre-training. Can be used a sequence classifier token.\n\n            .. note::\n\n                When building a sequence using special tokens, this is not the token that is used for the beginning\n                of sequence. The token used is the :obj:`cls_token`.\n        eos_token (:obj:`string`, `optional`, defaults to \"[SEP]\"):\n            The end of sequence token.\n\n            .. note::\n\n                When building a sequence using special tokens, this is not the token that is used for the end\n                of sequence. The token used is the :obj:`sep_token`.\n        unk_token (:obj:`string`, `optional`, defaults to \"<unk>\"):\n            The unknown token. A token that is not in the vocabulary cannot be converted to an ID and is set to be this\n            token instead.\n        sep_token (:obj:`string`, `optional`, defaults to \"[SEP]\"):\n            The separator token, which is used when building a sequence from multiple sequences, e.g. two sequences\n            for sequence classification or for a text and a question for question answering.\n            It is also used as the last token of a sequence built with special tokens.\n        pad_token (:obj:`string`, `optional`, defaults to \"<pad>\"):\n            The token used for padding, for example when batching sequences of different lengths.\n        cls_token (:obj:`string`, `optional`, defaults to \"[CLS]\"):\n            The classifier token which is used when doing sequence classification (classification of the whole\n            sequence instead of per-token classification). It is the first token of the sequence when built with\n            special tokens.\n        mask_token (:obj:`string`, `optional`, defaults to \"[MASK]\"):\n            The token used for masking values. This is the token used when training this model with masked language\n            modeling. This is the token which the model will try to predict.\n\n    Attributes:\n        sp_model (:obj:`SentencePieceProcessor`):\n            The `SentencePiece` processor that is used for every conversion (string, tokens and IDs).\n    \"\"\"\n\n    vocab_files_names = VOCAB_FILES_NAMES\n    pretrained_vocab_files_map = PRETRAINED_VOCAB_FILES_MAP\n    max_model_input_sizes = PRETRAINED_POSITIONAL_EMBEDDINGS_SIZES\n\n    def __init__(\n        self,\n        vocab_file,\n        do_lower_case=True,\n        remove_space=True,\n        keep_accents=False,\n        bos_token=\"[CLS]\",\n        eos_token=\"[SEP]\",\n        unk_token=\"<unk>\",\n        sep_token=\"[SEP]\",\n        pad_token=\"<pad>\",\n        cls_token=\"[CLS]\",\n        mask_token=\"[MASK]\",\n        **kwargs\n    ):\n        super().__init__(\n            bos_token=bos_token,\n            eos_token=eos_token,\n            unk_token=unk_token,\n            sep_token=sep_token,\n            pad_token=pad_token,\n            cls_token=cls_token,\n            mask_token=mask_token,\n            **kwargs,\n        )\n\n        try:\n            import sentencepiece as spm\n        except ImportError:\n            logger.warning(\n                \"You need to install SentencePiece to use AlbertTokenizer: https://github.com/google/sentencepiece\"\n                \"pip install sentencepiece\"\n            )\n            raise\n\n        self.do_lower_case = do_lower_case\n        self.remove_space = remove_space\n        self.keep_accents = keep_accents\n        self.vocab_file = vocab_file\n\n        self.sp_model = spm.SentencePieceProcessor()\n        self.sp_model.Load(vocab_file)\n\n    @property\n    def vocab_size(self):\n        return len(self.sp_model)\n\n    def get_vocab(self):\n        vocab = {self.convert_ids_to_tokens(i): i for i in range(self.vocab_size)}\n        vocab.update(self.added_tokens_encoder)\n        return vocab\n\n    def __getstate__(self):\n        state = self.__dict__.copy()\n        state[\"sp_model\"] = None\n        return state\n\n    def __setstate__(self, d):\n        self.__dict__ = d\n        try:\n            import sentencepiece as spm\n        except ImportError:\n            logger.warning(\n                \"You need to install SentencePiece to use AlbertTokenizer: https://github.com/google/sentencepiece\"\n                \"pip install sentencepiece\"\n            )\n            raise\n        self.sp_model = spm.SentencePieceProcessor()\n        self.sp_model.Load(self.vocab_file)\n\n    def preprocess_text(self, inputs):\n        if self.remove_space:\n            outputs = \" \".join(inputs.strip().split())\n        else:\n            outputs = inputs\n        outputs = outputs.replace(\"``\", '\"').replace(\"''\", '\"')\n\n        if not self.keep_accents:\n            outputs = unicodedata.normalize(\"NFKD\", outputs)\n            outputs = \"\".join([c for c in outputs if not unicodedata.combining(c)])\n        if self.do_lower_case:\n            outputs = outputs.lower()\n\n        return outputs\n\n    def _tokenize(self, text, sample=False):\n        \"\"\" Tokenize a string. \"\"\"\n        text = self.preprocess_text(text)\n\n        if not sample:\n            pieces = self.sp_model.EncodeAsPieces(text)\n        else:\n            pieces = self.sp_model.SampleEncodeAsPieces(text, 64, 0.1)\n        new_pieces = []\n        for piece in pieces:\n            if len(piece) > 1 and piece[-1] == str(\",\") and piece[-2].isdigit():\n                cur_pieces = self.sp_model.EncodeAsPieces(piece[:-1].replace(SPIECE_UNDERLINE, \"\"))\n                if piece[0] != SPIECE_UNDERLINE and cur_pieces[0][0] == SPIECE_UNDERLINE:\n                    if len(cur_pieces[0]) == 1:\n                        cur_pieces = cur_pieces[1:]\n                    else:\n                        cur_pieces[0] = cur_pieces[0][1:]\n                cur_pieces.append(piece[-1])\n                new_pieces.extend(cur_pieces)\n            else:\n                new_pieces.append(piece)\n\n        return new_pieces\n\n    def _convert_token_to_id(self, token):\n        \"\"\" Converts a token (str) in an id using the vocab. \"\"\"\n        return self.sp_model.PieceToId(token)\n\n    def _convert_id_to_token(self, index):\n        \"\"\"Converts an index (integer) in a token (str) using the vocab.\"\"\"\n        return self.sp_model.IdToPiece(index)\n\n    def convert_tokens_to_string(self, tokens):\n        out_string = \"\".join(tokens).replace(SPIECE_UNDERLINE, \" \").strip()\n        return out_string\n\n    def build_inputs_with_special_tokens(\n        self, token_ids_0: List[int], token_ids_1: Optional[List[int]] = None\n    ) -> List[int]:\n        \"\"\"\n        Build model inputs from a sequence or a pair of sequence for sequence classification tasks\n        by concatenating and adding special tokens.\n        An ALBERT sequence has the following format:\n\n        - single sequence: ``[CLS] X [SEP]``\n        - pair of sequences: ``[CLS] A [SEP] B [SEP]``\n\n        Args:\n            token_ids_0 (:obj:`List[int]`):\n                List of IDs to which the special tokens will be added\n            token_ids_1 (:obj:`List[int]`, `optional`, defaults to :obj:`None`):\n                Optional second list of IDs for sequence pairs.\n\n        Returns:\n            :obj:`List[int]`: list of `input IDs <../glossary.html#input-ids>`__ with the appropriate special tokens.\n        \"\"\"\n        sep = [self.sep_token_id]\n        cls = [self.cls_token_id]\n        if token_ids_1 is None:\n            return cls + token_ids_0 + sep\n        return cls + token_ids_0 + sep + token_ids_1 + sep\n\n    def get_special_tokens_mask(\n        self, token_ids_0: List[int], token_ids_1: Optional[List[int]] = None, already_has_special_tokens: bool = False\n    ) -> List[int]:\n        \"\"\"\n        Retrieves sequence ids from a token list that has no special tokens added. This method is called when adding\n        special tokens using the tokenizer ``prepare_for_model`` or ``encode_plus`` methods.\n\n        Args:\n            token_ids_0 (:obj:`List[int]`):\n                List of ids.\n            token_ids_1 (:obj:`List[int]`, `optional`, defaults to :obj:`None`):\n                Optional second list of IDs for sequence pairs.\n            already_has_special_tokens (:obj:`bool`, `optional`, defaults to :obj:`False`):\n                Set to True if the token list is already formatted with special tokens for the model\n\n        Returns:\n            :obj:`List[int]`: A list of integers in the range [0, 1]: 1 for a special token, 0 for a sequence token.\n        \"\"\"\n\n        if already_has_special_tokens:\n            if token_ids_1 is not None:\n                raise ValueError(\n                    \"You should not supply a second sequence if the provided sequence of \"\n                    \"ids is already formatted with special tokens for the model.\"\n                )\n            return list(map(lambda x: 1 if x in [self.sep_token_id, self.cls_token_id] else 0, token_ids_0))\n\n        if token_ids_1 is not None:\n            return [1] + ([0] * len(token_ids_0)) + [1] + ([0] * len(token_ids_1)) + [1]\n        return [1] + ([0] * len(token_ids_0)) + [1]\n\n    def create_token_type_ids_from_sequences(\n        self, token_ids_0: List[int], token_ids_1: Optional[List[int]] = None\n    ) -> List[int]:\n        \"\"\"\n        Creates a mask from the two sequences passed to be used in a sequence-pair classification task.\n        An ALBERT sequence pair mask has the following format:\n\n        ::\n\n            0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1\n            | first sequence    | second sequence |\n\n        if token_ids_1 is None, only returns the first portion of the mask (0s).\n\n        Args:\n            token_ids_0 (:obj:`List[int]`):\n                List of ids.\n            token_ids_1 (:obj:`List[int]`, `optional`, defaults to :obj:`None`):\n                Optional second list of IDs for sequence pairs.\n\n        Returns:\n            :obj:`List[int]`: List of `token type IDs <../glossary.html#token-type-ids>`_ according to the given\n            sequence(s).\n        \"\"\"\n        sep = [self.sep_token_id]\n        cls = [self.cls_token_id]\n\n        if token_ids_1 is None:\n            return len(cls + token_ids_0 + sep) * [0]\n        return len(cls + token_ids_0 + sep) * [0] + len(token_ids_1 + sep) * [1]\n\n    def save_vocabulary(self, save_directory):\n        \"\"\"\n        Save the sentencepiece vocabulary (copy original file) and special tokens file to a directory.\n\n        Args:\n            save_directory (:obj:`str`):\n                The directory in which to save the vocabulary.\n\n        Returns:\n            :obj:`Tuple(str)`: Paths to the files saved.\n        \"\"\"\n        if not os.path.isdir(save_directory):\n            logger.error(\"Vocabulary path ({}) should be a directory\".format(save_directory))\n            return\n        out_vocab_file = os.path.join(save_directory, VOCAB_FILES_NAMES[\"vocab_file\"])\n\n        if os.path.abspath(self.vocab_file) != os.path.abspath(out_vocab_file):\n            copyfile(self.vocab_file, out_vocab_file)\n\n        return (out_vocab_file,)\n"
  },
  {
    "path": "code/bert-base-count5/pretrain/transformers1/tokenization_auto.py",
    "content": "# coding=utf-8\n# Copyright 2018 The HuggingFace Inc. team.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n\"\"\" Auto Tokenizer class. \"\"\"\n\n\nimport logging\nfrom collections import OrderedDict\n\nfrom .configuration_auto import (\n    AlbertConfig,\n    AutoConfig,\n    BartConfig,\n    BertConfig,\n    CamembertConfig,\n    CTRLConfig,\n    DistilBertConfig,\n    ElectraConfig,\n    FlaubertConfig,\n    GPT2Config,\n    LongformerConfig,\n    OpenAIGPTConfig,\n    ReformerConfig,\n    RobertaConfig,\n    T5Config,\n    TransfoXLConfig,\n    XLMConfig,\n    XLMRobertaConfig,\n    XLNetConfig,\n)\nfrom .configuration_marian import MarianConfig\nfrom .configuration_utils import PretrainedConfig\nfrom .tokenization_albert import AlbertTokenizer\nfrom .tokenization_bart import BartTokenizer\nfrom .tokenization_bert import BertTokenizer, BertTokenizerFast\nfrom .tokenization_bert_japanese import BertJapaneseTokenizer\nfrom .tokenization_camembert import CamembertTokenizer\nfrom .tokenization_ctrl import CTRLTokenizer\nfrom .tokenization_distilbert import DistilBertTokenizer, DistilBertTokenizerFast\nfrom .tokenization_electra import ElectraTokenizer, ElectraTokenizerFast\nfrom .tokenization_flaubert import FlaubertTokenizer\nfrom .tokenization_gpt2 import GPT2Tokenizer, GPT2TokenizerFast\nfrom .tokenization_longformer import LongformerTokenizer\nfrom .tokenization_marian import MarianTokenizer\nfrom .tokenization_openai import OpenAIGPTTokenizer, OpenAIGPTTokenizerFast\nfrom .tokenization_reformer import ReformerTokenizer\nfrom .tokenization_roberta import RobertaTokenizer, RobertaTokenizerFast\nfrom .tokenization_t5 import T5Tokenizer\nfrom .tokenization_transfo_xl import TransfoXLTokenizer, TransfoXLTokenizerFast\nfrom .tokenization_xlm import XLMTokenizer\nfrom .tokenization_xlm_roberta import XLMRobertaTokenizer\nfrom .tokenization_xlnet import XLNetTokenizer\n\n\nlogger = logging.getLogger(__name__)\n\n\nTOKENIZER_MAPPING = OrderedDict(\n    [\n        (T5Config, (T5Tokenizer, None)),\n        (DistilBertConfig, (DistilBertTokenizer, DistilBertTokenizerFast)),\n        (AlbertConfig, (AlbertTokenizer, None)),\n        (CamembertConfig, (CamembertTokenizer, None)),\n        (XLMRobertaConfig, (XLMRobertaTokenizer, None)),\n        (MarianConfig, (MarianTokenizer, None)),\n        (BartConfig, (BartTokenizer, None)),\n        (LongformerConfig, (LongformerTokenizer, None)),\n        (RobertaConfig, (RobertaTokenizer, RobertaTokenizerFast)),\n        (ReformerConfig, (ReformerTokenizer, None)),\n        (ElectraConfig, (ElectraTokenizer, ElectraTokenizerFast)),\n        (BertConfig, (BertTokenizer, BertTokenizerFast)),\n        (OpenAIGPTConfig, (OpenAIGPTTokenizer, OpenAIGPTTokenizerFast)),\n        (GPT2Config, (GPT2Tokenizer, GPT2TokenizerFast)),\n        (TransfoXLConfig, (TransfoXLTokenizer, TransfoXLTokenizerFast)),\n        (XLNetConfig, (XLNetTokenizer, None)),\n        (FlaubertConfig, (FlaubertTokenizer, None)),\n        (XLMConfig, (XLMTokenizer, None)),\n        (CTRLConfig, (CTRLTokenizer, None)),\n    ]\n)\n\n\nclass AutoTokenizer:\n    r\"\"\":class:`~transformers1.AutoTokenizer` is a generic tokenizer class\n        that will be instantiated as one of the tokenizer classes of the library\n        when created with the `AutoTokenizer.from_pretrained(pretrained_model_name_or_path)`\n        class method.\n\n        The `from_pretrained()` method takes care of returning the correct tokenizer class instance\n        based on the `model_type` property of the config object, or when it's missing,\n        falling back to using pattern matching on the `pretrained_model_name_or_path` string:\n            - `t5`: T5Tokenizer (T5 model)\n            - `distilbert`: DistilBertTokenizer (DistilBert model)\n            - `albert`: AlbertTokenizer (ALBERT model)\n            - `camembert`: CamembertTokenizer (CamemBERT model)\n            - `xlm-roberta`: XLMRobertaTokenizer (XLM-RoBERTa model)\n            - `longformer`: LongformerTokenizer (AllenAI Longformer model)\n            - `roberta`: RobertaTokenizer (RoBERTa model)\n            - `bert`: BertTokenizer (Bert model)\n            - `openai-gpt`: OpenAIGPTTokenizer (OpenAI GPT model)\n            - `gpt2`: GPT2Tokenizer (OpenAI GPT-2 model)\n            - `transfo-xl`: TransfoXLTokenizer (Transformer-XL model)\n            - `xlnet`: XLNetTokenizer (XLNet model)\n            - `xlm`: XLMTokenizer (XLM model)\n            - `ctrl`: CTRLTokenizer (Salesforce CTRL model)\n            - `electra`: ElectraTokenizer (Google ELECTRA model)\n\n        This class cannot be instantiated using `__init__()` (throw an error).\n    \"\"\"\n\n    def __init__(self):\n        raise EnvironmentError(\n            \"AutoTokenizer is designed to be instantiated \"\n            \"using the `AutoTokenizer.from_pretrained(pretrained_model_name_or_path)` method.\"\n        )\n\n    @classmethod\n    def from_pretrained(cls, pretrained_model_name_or_path, *inputs, **kwargs):\n        r\"\"\" Instantiate one of the tokenizer classes of the library\n        from a pre-trained model vocabulary.\n\n        The tokenizer class to instantiate is selected\n        based on the `model_type` property of the config object, or when it's missing,\n        falling back to using pattern matching on the `pretrained_model_name_or_path` string:\n            - `t5`: T5Tokenizer (T5 model)\n            - `distilbert`: DistilBertTokenizer (DistilBert model)\n            - `albert`: AlbertTokenizer (ALBERT model)\n            - `camembert`: CamembertTokenizer (CamemBERT model)\n            - `xlm-roberta`: XLMRobertaTokenizer (XLM-RoBERTa model)\n            - `longformer`: LongformerTokenizer (AllenAI Longformer model)\n            - `roberta`: RobertaTokenizer (RoBERTa model)\n            - `bert-base-japanese`: BertJapaneseTokenizer (Bert model)\n            - `bert`: BertTokenizer (Bert model)\n            - `openai-gpt`: OpenAIGPTTokenizer (OpenAI GPT model)\n            - `gpt2`: GPT2Tokenizer (OpenAI GPT-2 model)\n            - `transfo-xl`: TransfoXLTokenizer (Transformer-XL model)\n            - `xlnet`: XLNetTokenizer (XLNet model)\n            - `xlm`: XLMTokenizer (XLM model)\n            - `ctrl`: CTRLTokenizer (Salesforce CTRL model)\n            - `electra`: ElectraTokenizer (Google ELECTRA model)\n\n        Params:\n            pretrained_model_name_or_path: either:\n\n                - a string with the `shortcut name` of a predefined tokenizer to load from cache or download, e.g.: ``bert-base-uncased``.\n                - a string with the `identifier name` of a predefined tokenizer that was user-uploaded to our S3, e.g.: ``dbmdz/bert-base-german-cased``.\n                - a path to a `directory` containing vocabulary files required by the tokenizer, for instance saved using the :func:`~transformers1.PreTrainedTokenizer.save_pretrained` method, e.g.: ``./my_model_directory/``.\n                - (not applicable to all derived classes) a path or url to a single saved vocabulary file if and only if the tokenizer only requires a single vocabulary file (e.g. Bert, XLNet), e.g.: ``./my_model_directory/vocab.txt``.\n\n            cache_dir: (`optional`) string:\n                Path to a directory in which a downloaded predefined tokenizer vocabulary files should be cached if the standard cache should not be used.\n\n            force_download: (`optional`) boolean, default False:\n                Force to (re-)download the vocabulary files and override the cached versions if they exists.\n\n            resume_download: (`optional`) boolean, default False:\n                Do not delete incompletely recieved file. Attempt to resume the download if such a file exists.\n\n            proxies: (`optional`) dict, default None:\n                A dictionary of proxy servers to use by protocol or endpoint, e.g.: {'http': 'foo.bar:3128', 'http://hostname': 'foo.bar:4012'}.\n                The proxies are used on each request.\n\n            use_fast: (`optional`) boolean, default False:\n                Indicate if transformers1 should try to load the fast version of the tokenizer (True) or use the Python one (False).\n\n            inputs: (`optional`) positional arguments: will be passed to the Tokenizer ``__init__`` method.\n\n            kwargs: (`optional`) keyword arguments: will be passed to the Tokenizer ``__init__`` method. Can be used to set special tokens like ``bos_token``, ``eos_token``, ``unk_token``, ``sep_token``, ``pad_token``, ``cls_token``, ``mask_token``, ``additional_special_tokens``. See parameters in the doc string of :class:`~transformers1.PreTrainedTokenizer` for details.\n\n        Examples::\n\n            # Download vocabulary from S3 and cache.\n            tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')\n\n            # Download vocabulary from S3 (user-uploaded) and cache.\n            tokenizer = AutoTokenizer.from_pretrained('dbmdz/bert-base-german-cased')\n\n            # If vocabulary files are in a directory (e.g. tokenizer was saved using `save_pretrained('./test/saved_model/')`)\n            tokenizer = AutoTokenizer.from_pretrained('./test/bert_saved_model/')\n\n        \"\"\"\n        config = kwargs.pop(\"config\", None)\n        if not isinstance(config, PretrainedConfig):\n            config = AutoConfig.from_pretrained(pretrained_model_name_or_path, **kwargs)\n\n        if \"bert-base-japanese\" in pretrained_model_name_or_path:\n            return BertJapaneseTokenizer.from_pretrained(pretrained_model_name_or_path, *inputs, **kwargs)\n\n        use_fast = kwargs.pop(\"use_fast\", False)\n        for config_class, (tokenizer_class_py, tokenizer_class_fast) in TOKENIZER_MAPPING.items():\n            if isinstance(config, config_class):\n                if tokenizer_class_fast and use_fast:\n                    return tokenizer_class_fast.from_pretrained(pretrained_model_name_or_path, *inputs, **kwargs)\n                else:\n                    return tokenizer_class_py.from_pretrained(pretrained_model_name_or_path, *inputs, **kwargs)\n\n        raise ValueError(\n            \"Unrecognized configuration class {} to build an AutoTokenizer.\\n\"\n            \"Model type should be one of {}.\".format(\n                config.__class__, \", \".join(c.__name__ for c in TOKENIZER_MAPPING.keys())\n            )\n        )\n"
  },
  {
    "path": "code/bert-base-count5/pretrain/transformers1/tokenization_bart.py",
    "content": "# coding=utf-8\n# Copyright 2020 The Facebook AI Research Team Authors and The HuggingFace Inc. team.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n\nimport logging\n\nfrom .tokenization_roberta import RobertaTokenizer\nfrom .tokenization_xlm_roberta import XLMRobertaTokenizer\n\n\nlogger = logging.getLogger(__name__)\n\n\n# vocab and merges same as roberta\nvocab_url = \"https://s3.amazonaws.com/models.huggingface.co/bert/roberta-large-vocab.json\"\nmerges_url = \"https://s3.amazonaws.com/models.huggingface.co/bert/roberta-large-merges.txt\"\n_all_bart_models = [\n    \"facebook/bart-large\",\n    \"facebook/bart-large-mnli\",\n    \"facebook/bart-large-cnn\",\n    \"facebook/bart-large-xsum\",\n]\n\n\nclass BartTokenizer(RobertaTokenizer):\n    # merges and vocab same as Roberta\n    max_model_input_sizes = {m: 1024 for m in _all_bart_models}\n    pretrained_vocab_files_map = {\n        \"vocab_file\": {m: vocab_url for m in _all_bart_models},\n        \"merges_file\": {m: merges_url for m in _all_bart_models},\n    }\n\n\n_all_mbart_models = [\"facebook/mbart-large-en-ro\"]\nSPM_URL = \"https://s3.amazonaws.com/models.huggingface.co/bert/facebook/mbart-large-en-ro/sentence.bpe.model\"\n\n\nclass MBartTokenizer(XLMRobertaTokenizer):\n    vocab_files_names = {\"vocab_file\": \"sentencepiece.bpe.model\"}\n    max_model_input_sizes = {m: 1024 for m in _all_mbart_models}\n    pretrained_vocab_files_map = {\"vocab_file\": {m: SPM_URL for m in _all_mbart_models}}\n"
  },
  {
    "path": "code/bert-base-count5/pretrain/transformers1/tokenization_bert.py",
    "content": "# coding=utf-8\n# Copyright 2018 The Google AI Language Team Authors and The HuggingFace Inc. team.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n\"\"\"Tokenization classes.\"\"\"\n\n\nimport collections\nimport logging\nimport os\nimport unicodedata\nfrom typing import List, Optional\n\nfrom tokenizers import BertWordPieceTokenizer\n\nfrom .tokenization_utils import PreTrainedTokenizer, PreTrainedTokenizerFast\n\n\nlogger = logging.getLogger(__name__)\n\nVOCAB_FILES_NAMES = {\"vocab_file\": \"vocab.txt\"}\n\nPRETRAINED_VOCAB_FILES_MAP = {\n    \"vocab_file\": {\n        \"bert-base-uncased\": \"https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-uncased-vocab.txt\",\n        \"bert-large-uncased\": \"https://s3.amazonaws.com/models.huggingface.co/bert/bert-large-uncased-vocab.txt\",\n        \"bert-base-cased\": \"https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-cased-vocab.txt\",\n        \"bert-large-cased\": \"https://s3.amazonaws.com/models.huggingface.co/bert/bert-large-cased-vocab.txt\",\n        \"bert-base-multilingual-uncased\": \"https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-multilingual-uncased-vocab.txt\",\n        \"bert-base-multilingual-cased\": \"https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-multilingual-cased-vocab.txt\",\n        \"bert-base-chinese\": \"https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-chinese-vocab.txt\",\n        \"bert-base-german-cased\": \"https://int-deepset-models-bert.s3.eu-central-1.amazonaws.com/pytorch/bert-base-german-cased-vocab.txt\",\n        \"bert-large-uncased-whole-word-masking\": \"https://s3.amazonaws.com/models.huggingface.co/bert/bert-large-uncased-whole-word-masking-vocab.txt\",\n        \"bert-large-cased-whole-word-masking\": \"https://s3.amazonaws.com/models.huggingface.co/bert/bert-large-cased-whole-word-masking-vocab.txt\",\n        \"bert-large-uncased-whole-word-masking-finetuned-squad\": \"https://s3.amazonaws.com/models.huggingface.co/bert/bert-large-uncased-whole-word-masking-finetuned-squad-vocab.txt\",\n        \"bert-large-cased-whole-word-masking-finetuned-squad\": \"https://s3.amazonaws.com/models.huggingface.co/bert/bert-large-cased-whole-word-masking-finetuned-squad-vocab.txt\",\n        \"bert-base-cased-finetuned-mrpc\": \"https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-cased-finetuned-mrpc-vocab.txt\",\n        \"bert-base-german-dbmdz-cased\": \"https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-german-dbmdz-cased-vocab.txt\",\n        \"bert-base-german-dbmdz-uncased\": \"https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-german-dbmdz-uncased-vocab.txt\",\n        \"TurkuNLP/bert-base-finnish-cased-v1\": \"https://s3.amazonaws.com/models.huggingface.co/bert/TurkuNLP/bert-base-finnish-cased-v1/vocab.txt\",\n        \"TurkuNLP/bert-base-finnish-uncased-v1\": \"https://s3.amazonaws.com/models.huggingface.co/bert/TurkuNLP/bert-base-finnish-uncased-v1/vocab.txt\",\n        \"wietsedv/bert-base-dutch-cased\": \"https://s3.amazonaws.com/models.huggingface.co/bert/wietsedv/bert-base-dutch-cased/vocab.txt\",\n    }\n}\n\nPRETRAINED_POSITIONAL_EMBEDDINGS_SIZES = {\n    \"bert-base-uncased\": 512,\n    \"bert-large-uncased\": 512,\n    \"bert-base-cased\": 512,\n    \"bert-large-cased\": 512,\n    \"bert-base-multilingual-uncased\": 512,\n    \"bert-base-multilingual-cased\": 512,\n    \"bert-base-chinese\": 512,\n    \"bert-base-german-cased\": 512,\n    \"bert-large-uncased-whole-word-masking\": 512,\n    \"bert-large-cased-whole-word-masking\": 512,\n    \"bert-large-uncased-whole-word-masking-finetuned-squad\": 512,\n    \"bert-large-cased-whole-word-masking-finetuned-squad\": 512,\n    \"bert-base-cased-finetuned-mrpc\": 512,\n    \"bert-base-german-dbmdz-cased\": 512,\n    \"bert-base-german-dbmdz-uncased\": 512,\n    \"TurkuNLP/bert-base-finnish-cased-v1\": 512,\n    \"TurkuNLP/bert-base-finnish-uncased-v1\": 512,\n    \"wietsedv/bert-base-dutch-cased\": 512,\n}\n\nPRETRAINED_INIT_CONFIGURATION = {\n    \"bert-base-uncased\": {\"do_lower_case\": True},\n    \"bert-large-uncased\": {\"do_lower_case\": True},\n    \"bert-base-cased\": {\"do_lower_case\": False},\n    \"bert-large-cased\": {\"do_lower_case\": False},\n    \"bert-base-multilingual-uncased\": {\"do_lower_case\": True},\n    \"bert-base-multilingual-cased\": {\"do_lower_case\": False},\n    \"bert-base-chinese\": {\"do_lower_case\": False},\n    \"bert-base-german-cased\": {\"do_lower_case\": False},\n    \"bert-large-uncased-whole-word-masking\": {\"do_lower_case\": True},\n    \"bert-large-cased-whole-word-masking\": {\"do_lower_case\": False},\n    \"bert-large-uncased-whole-word-masking-finetuned-squad\": {\"do_lower_case\": True},\n    \"bert-large-cased-whole-word-masking-finetuned-squad\": {\"do_lower_case\": False},\n    \"bert-base-cased-finetuned-mrpc\": {\"do_lower_case\": False},\n    \"bert-base-german-dbmdz-cased\": {\"do_lower_case\": False},\n    \"bert-base-german-dbmdz-uncased\": {\"do_lower_case\": True},\n    \"TurkuNLP/bert-base-finnish-cased-v1\": {\"do_lower_case\": False},\n    \"TurkuNLP/bert-base-finnish-uncased-v1\": {\"do_lower_case\": True},\n    \"wietsedv/bert-base-dutch-cased\": {\"do_lower_case\": False},\n}\n\n\ndef load_vocab(vocab_file):\n    \"\"\"Loads a vocabulary file into a dictionary.\"\"\"\n    vocab = collections.OrderedDict()\n    with open(vocab_file, \"r\", encoding=\"utf-8\") as reader:\n        tokens = reader.readlines()\n    for index, token in enumerate(tokens):\n        token = token.rstrip(\"\\n\")\n        vocab[token] = index\n    return vocab\n\n\ndef whitespace_tokenize(text):\n    \"\"\"Runs basic whitespace cleaning and splitting on a piece of text.\"\"\"\n    text = text.strip()\n    if not text:\n        return []\n    tokens = text.split()\n    return tokens\n\n\nclass BertTokenizer(PreTrainedTokenizer):\n    r\"\"\"\n    Constructs a BERT tokenizer. Based on WordPiece.\n\n    This tokenizer inherits from :class:`~transformers1.PreTrainedTokenizer` which contains most of the methods. Users\n    should refer to the superclass for more information regarding methods.\n\n    Args:\n        vocab_file (:obj:`string`):\n            File containing the vocabulary.\n        do_lower_case (:obj:`bool`, `optional`, defaults to :obj:`True`):\n            Whether to lowercase the input when tokenizing.\n        do_basic_tokenize (:obj:`bool`, `optional`, defaults to :obj:`True`):\n            Whether to do basic tokenization before WordPiece.\n        never_split (:obj:`bool`, `optional`, defaults to :obj:`True`):\n            List of tokens which will never be split during tokenization. Only has an effect when\n            :obj:`do_basic_tokenize=True`\n        unk_token (:obj:`string`, `optional`, defaults to \"[UNK]\"):\n            The unknown token. A token that is not in the vocabulary cannot be converted to an ID and is set to be this\n            token instead.\n        sep_token (:obj:`string`, `optional`, defaults to \"[SEP]\"):\n            The separator token, which is used when building a sequence from multiple sequences, e.g. two sequences\n            for sequence classification or for a text and a question for question answering.\n            It is also used as the last token of a sequence built with special tokens.\n        pad_token (:obj:`string`, `optional`, defaults to \"[PAD]\"):\n            The token used for padding, for example when batching sequences of different lengths.\n        cls_token (:obj:`string`, `optional`, defaults to \"[CLS]\"):\n            The classifier token which is used when doing sequence classification (classification of the whole\n            sequence instead of per-token classification). It is the first token of the sequence when built with\n            special tokens.\n        mask_token (:obj:`string`, `optional`, defaults to \"[MASK]\"):\n            The token used for masking values. This is the token used when training this model with masked language\n            modeling. This is the token which the model will try to predict.\n        tokenize_chinese_chars (:obj:`bool`, `optional`, defaults to :obj:`True`):\n            Whether to tokenize Chinese characters.\n            This should likely be deactivated for Japanese:\n            see: https://github.com/huggingface/transformers/issues/328\n    \"\"\"\n\n    vocab_files_names = VOCAB_FILES_NAMES\n    pretrained_vocab_files_map = PRETRAINED_VOCAB_FILES_MAP\n    pretrained_init_configuration = PRETRAINED_INIT_CONFIGURATION\n    max_model_input_sizes = PRETRAINED_POSITIONAL_EMBEDDINGS_SIZES\n\n    def __init__(\n        self,\n        vocab_file,\n        do_lower_case=True,\n        do_basic_tokenize=True,\n        never_split=None,\n        unk_token=\"[UNK]\",\n        sep_token=\"[SEP]\",\n        pad_token=\"[PAD]\",\n        cls_token=\"[CLS]\",\n        mask_token=\"[MASK]\",\n        tokenize_chinese_chars=True,\n        **kwargs\n    ):\n        super().__init__(\n            unk_token=unk_token,\n            sep_token=sep_token,\n            pad_token=pad_token,\n            cls_token=cls_token,\n            mask_token=mask_token,\n            **kwargs,\n        )\n\n        if not os.path.isfile(vocab_file):\n            raise ValueError(\n                \"Can't find a vocabulary file at path '{}'. To load the vocabulary from a Google pretrained \"\n                \"model use `tokenizer = BertTokenizer.from_pretrained(PRETRAINED_MODEL_NAME)`\".format(vocab_file)\n            )\n        self.vocab = load_vocab(vocab_file)\n        self.ids_to_tokens = collections.OrderedDict([(ids, tok) for tok, ids in self.vocab.items()])\n        self.do_basic_tokenize = do_basic_tokenize\n        if do_basic_tokenize:\n            self.basic_tokenizer = BasicTokenizer(\n                do_lower_case=do_lower_case, never_split=never_split, tokenize_chinese_chars=tokenize_chinese_chars\n            )\n        self.wordpiece_tokenizer = WordpieceTokenizer(vocab=self.vocab, unk_token=self.unk_token)\n\n    @property\n    def vocab_size(self):\n        return len(self.vocab)\n\n    def get_vocab(self):\n        return dict(self.vocab, **self.added_tokens_encoder)\n\n    def _tokenize(self, text):\n        split_tokens = []\n        if self.do_basic_tokenize:\n            for token in self.basic_tokenizer.tokenize(text, never_split=self.all_special_tokens):\n                for sub_token in self.wordpiece_tokenizer.tokenize(token):\n                    split_tokens.append(sub_token)\n        else:\n            split_tokens = self.wordpiece_tokenizer.tokenize(text)\n        return split_tokens\n\n    def _convert_token_to_id(self, token):\n        \"\"\" Converts a token (str) in an id using the vocab. \"\"\"\n        return self.vocab.get(token, self.vocab.get(self.unk_token))\n\n    def _convert_id_to_token(self, index):\n        \"\"\"Converts an index (integer) in a token (str) using the vocab.\"\"\"\n        return self.ids_to_tokens.get(index, self.unk_token)\n\n    def convert_tokens_to_string(self, tokens):\n        \"\"\" Converts a sequence of tokens (string) in a single string. \"\"\"\n        out_string = \" \".join(tokens).replace(\" ##\", \"\").strip()\n        return out_string\n\n    def build_inputs_with_special_tokens(\n        self, token_ids_0: List[int], token_ids_1: Optional[List[int]] = None\n    ) -> List[int]:\n        \"\"\"\n        Build model inputs from a sequence or a pair of sequence for sequence classification tasks\n        by concatenating and adding special tokens.\n        A BERT sequence has the following format:\n\n        - single sequence: ``[CLS] X [SEP]``\n        - pair of sequences: ``[CLS] A [SEP] B [SEP]``\n\n        Args:\n            token_ids_0 (:obj:`List[int]`):\n                List of IDs to which the special tokens will be added\n            token_ids_1 (:obj:`List[int]`, `optional`, defaults to :obj:`None`):\n                Optional second list of IDs for sequence pairs.\n\n        Returns:\n            :obj:`List[int]`: list of `input IDs <../glossary.html#input-ids>`__ with the appropriate special tokens.\n        \"\"\"\n        if token_ids_1 is None:\n            return [self.cls_token_id] + token_ids_0 + [self.sep_token_id]\n        cls = [self.cls_token_id]\n        sep = [self.sep_token_id]\n        return cls + token_ids_0 + sep + token_ids_1 + sep\n\n    def get_special_tokens_mask(\n        self, token_ids_0: List[int], token_ids_1: Optional[List[int]] = None, already_has_special_tokens: bool = False\n    ) -> List[int]:\n        \"\"\"\n        Retrieves sequence ids from a token list that has no special tokens added. This method is called when adding\n        special tokens using the tokenizer ``prepare_for_model`` or ``encode_plus`` methods.\n\n        Args:\n            token_ids_0 (:obj:`List[int]`):\n                List of ids.\n            token_ids_1 (:obj:`List[int]`, `optional`, defaults to :obj:`None`):\n                Optional second list of IDs for sequence pairs.\n            already_has_special_tokens (:obj:`bool`, `optional`, defaults to :obj:`False`):\n                Set to True if the token list is already formatted with special tokens for the model\n\n        Returns:\n            :obj:`List[int]`: A list of integers in the range [0, 1]: 1 for a special token, 0 for a sequence token.\n        \"\"\"\n\n        if already_has_special_tokens:\n            if token_ids_1 is not None:\n                raise ValueError(\n                    \"You should not supply a second sequence if the provided sequence of \"\n                    \"ids is already formated with special tokens for the model.\"\n                )\n            return list(map(lambda x: 1 if x in [self.sep_token_id, self.cls_token_id] else 0, token_ids_0))\n\n        if token_ids_1 is not None:\n            return [1] + ([0] * len(token_ids_0)) + [1] + ([0] * len(token_ids_1)) + [1]\n        return [1] + ([0] * len(token_ids_0)) + [1]\n\n    def create_token_type_ids_from_sequences(\n        self, token_ids_0: List[int], token_ids_1: Optional[List[int]] = None\n    ) -> List[int]:\n        \"\"\"\n        Creates a mask from the two sequences passed to be used in a sequence-pair classification task.\n        A BERT sequence pair mask has the following format:\n\n        ::\n\n            0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1\n            | first sequence    | second sequence |\n\n        if token_ids_1 is None, only returns the first portion of the mask (0's).\n\n        Args:\n            token_ids_0 (:obj:`List[int]`):\n                List of ids.\n            token_ids_1 (:obj:`List[int]`, `optional`, defaults to :obj:`None`):\n                Optional second list of IDs for sequence pairs.\n\n        Returns:\n            :obj:`List[int]`: List of `token type IDs <../glossary.html#token-type-ids>`_ according to the given\n            sequence(s).\n        \"\"\"\n        sep = [self.sep_token_id]\n        cls = [self.cls_token_id]\n        if token_ids_1 is None:\n            return len(cls + token_ids_0 + sep) * [0]\n        return len(cls + token_ids_0 + sep) * [0] + len(token_ids_1 + sep) * [1]\n\n    def save_vocabulary(self, vocab_path):\n        \"\"\"\n        Save the sentencepiece vocabulary (copy original file) and special tokens file to a directory.\n\n        Args:\n            vocab_path (:obj:`str`):\n                The directory in which to save the vocabulary.\n\n        Returns:\n            :obj:`Tuple(str)`: Paths to the files saved.\n        \"\"\"\n        index = 0\n        if os.path.isdir(vocab_path):\n            vocab_file = os.path.join(vocab_path, VOCAB_FILES_NAMES[\"vocab_file\"])\n        else:\n            vocab_file = vocab_path\n        with open(vocab_file, \"w\", encoding=\"utf-8\") as writer:\n            for token, token_index in sorted(self.vocab.items(), key=lambda kv: kv[1]):\n                if index != token_index:\n                    logger.warning(\n                        \"Saving vocabulary to {}: vocabulary indices are not consecutive.\"\n                        \" Please check that the vocabulary is not corrupted!\".format(vocab_file)\n                    )\n                    index = token_index\n                writer.write(token + \"\\n\")\n                index += 1\n        return (vocab_file,)\n\n\nclass BasicTokenizer(object):\n    \"\"\"Runs basic tokenization (punctuation splitting, lower casing, etc.).\"\"\"\n\n    def __init__(self, do_lower_case=True, never_split=None, tokenize_chinese_chars=True):\n        \"\"\" Constructs a BasicTokenizer.\n\n        Args:\n            **do_lower_case**: Whether to lower case the input.\n            **never_split**: (`optional`) list of str\n                Kept for backward compatibility purposes.\n                Now implemented directly at the base class level (see :func:`PreTrainedTokenizer.tokenize`)\n                List of token not to split.\n            **tokenize_chinese_chars**: (`optional`) boolean (default True)\n                Whether to tokenize Chinese characters.\n                This should likely be deactivated for Japanese:\n                see: https://github.com/huggingface/pytorch-pretrained-BERT/issues/328\n        \"\"\"\n        if never_split is None:\n            never_split = []\n        self.do_lower_case = do_lower_case\n        self.never_split = never_split\n        self.tokenize_chinese_chars = tokenize_chinese_chars\n\n    def tokenize(self, text, never_split=None):\n        \"\"\" Basic Tokenization of a piece of text.\n            Split on \"white spaces\" only, for sub-word tokenization, see WordPieceTokenizer.\n\n        Args:\n            **never_split**: (`optional`) list of str\n                Kept for backward compatibility purposes.\n                Now implemented directly at the base class level (see :func:`PreTrainedTokenizer.tokenize`)\n                List of token not to split.\n        \"\"\"\n        never_split = self.never_split + (never_split if never_split is not None else [])\n        text = self._clean_text(text)\n        # This was added on November 1st, 2018 for the multilingual and Chinese\n        # models. This is also applied to the English models now, but it doesn't\n        # matter since the English models were not trained on any Chinese data\n        # and generally don't have any Chinese data in them (there are Chinese\n        # characters in the vocabulary because Wikipedia does have some Chinese\n        # words in the English Wikipedia.).\n        if self.tokenize_chinese_chars:\n            text = self._tokenize_chinese_chars(text)\n        orig_tokens = whitespace_tokenize(text)\n        split_tokens = []\n        for token in orig_tokens:\n            if self.do_lower_case and token not in never_split:\n                token = token.lower()\n                token = self._run_strip_accents(token)\n            split_tokens.extend(self._run_split_on_punc(token, never_split))\n\n        output_tokens = whitespace_tokenize(\" \".join(split_tokens))\n        return output_tokens\n\n    def _run_strip_accents(self, text):\n        \"\"\"Strips accents from a piece of text.\"\"\"\n        text = unicodedata.normalize(\"NFD\", text)\n        output = []\n        for char in text:\n            cat = unicodedata.category(char)\n            if cat == \"Mn\":\n                continue\n            output.append(char)\n        return \"\".join(output)\n\n    def _run_split_on_punc(self, text, never_split=None):\n        \"\"\"Splits punctuation on a piece of text.\"\"\"\n        if never_split is not None and text in never_split:\n            return [text]\n        chars = list(text)\n        i = 0\n        start_new_word = True\n        output = []\n        while i < len(chars):\n            char = chars[i]\n            if _is_punctuation(char):\n                output.append([char])\n                start_new_word = True\n            else:\n                if start_new_word:\n                    output.append([])\n                start_new_word = False\n                output[-1].append(char)\n            i += 1\n\n        return [\"\".join(x) for x in output]\n\n    def _tokenize_chinese_chars(self, text):\n        \"\"\"Adds whitespace around any CJK character.\"\"\"\n        output = []\n        for char in text:\n            cp = ord(char)\n            if self._is_chinese_char(cp):\n                output.append(\" \")\n                output.append(char)\n                output.append(\" \")\n            else:\n                output.append(char)\n        return \"\".join(output)\n\n    def _is_chinese_char(self, cp):\n        \"\"\"Checks whether CP is the codepoint of a CJK character.\"\"\"\n        # This defines a \"chinese character\" as anything in the CJK Unicode block:\n        #   https://en.wikipedia.org/wiki/CJK_Unified_Ideographs_(Unicode_block)\n        #\n        # Note that the CJK Unicode block is NOT all Japanese and Korean characters,\n        # despite its name. The modern Korean Hangul alphabet is a different block,\n        # as is Japanese Hiragana and Katakana. Those alphabets are used to write\n        # space-separated words, so they are not treated specially and handled\n        # like the all of the other languages.\n        if (\n            (cp >= 0x4E00 and cp <= 0x9FFF)\n            or (cp >= 0x3400 and cp <= 0x4DBF)  #\n            or (cp >= 0x20000 and cp <= 0x2A6DF)  #\n            or (cp >= 0x2A700 and cp <= 0x2B73F)  #\n            or (cp >= 0x2B740 and cp <= 0x2B81F)  #\n            or (cp >= 0x2B820 and cp <= 0x2CEAF)  #\n            or (cp >= 0xF900 and cp <= 0xFAFF)\n            or (cp >= 0x2F800 and cp <= 0x2FA1F)  #\n        ):  #\n            return True\n\n        return False\n\n    def _clean_text(self, text):\n        \"\"\"Performs invalid character removal and whitespace cleanup on text.\"\"\"\n        output = []\n        for char in text:\n            cp = ord(char)\n            if cp == 0 or cp == 0xFFFD or _is_control(char):\n                continue\n            if _is_whitespace(char):\n                output.append(\" \")\n            else:\n                output.append(char)\n        return \"\".join(output)\n\n\nclass WordpieceTokenizer(object):\n    \"\"\"Runs WordPiece tokenization.\"\"\"\n\n    def __init__(self, vocab, unk_token, max_input_chars_per_word=100):\n        self.vocab = vocab\n        self.unk_token = unk_token\n        self.max_input_chars_per_word = max_input_chars_per_word\n\n    def tokenize(self, text):\n        \"\"\"Tokenizes a piece of text into its word pieces.\n\n        This uses a greedy longest-match-first algorithm to perform tokenization\n        using the given vocabulary.\n\n        For example:\n          input = \"unaffable\"\n          output = [\"un\", \"##aff\", \"##able\"]\n\n        Args:\n          text: A single token or whitespace separated tokens. This should have\n            already been passed through `BasicTokenizer`.\n\n        Returns:\n          A list of wordpiece tokens.\n        \"\"\"\n\n        output_tokens = []\n        for token in whitespace_tokenize(text):\n            chars = list(token)\n            if len(chars) > self.max_input_chars_per_word:\n                output_tokens.append(self.unk_token)\n                continue\n\n            is_bad = False\n            start = 0\n            sub_tokens = []\n            while start < len(chars):\n                end = len(chars)\n                cur_substr = None\n                while start < end:\n                    substr = \"\".join(chars[start:end])\n                    if start > 0:\n                        substr = \"##\" + substr\n                    if substr in self.vocab:\n                        cur_substr = substr\n                        break\n                    end -= 1\n                if cur_substr is None:\n                    is_bad = True\n                    break\n                sub_tokens.append(cur_substr)\n                start = end\n\n            if is_bad:\n                output_tokens.append(self.unk_token)\n            else:\n                output_tokens.extend(sub_tokens)\n        return output_tokens\n\n\ndef _is_whitespace(char):\n    \"\"\"Checks whether `chars` is a whitespace character.\"\"\"\n    # \\t, \\n, and \\r are technically contorl characters but we treat them\n    # as whitespace since they are generally considered as such.\n    if char == \" \" or char == \"\\t\" or char == \"\\n\" or char == \"\\r\":\n        return True\n    cat = unicodedata.category(char)\n    if cat == \"Zs\":\n        return True\n    return False\n\n\ndef _is_control(char):\n    \"\"\"Checks whether `chars` is a control character.\"\"\"\n    # These are technically control characters but we count them as whitespace\n    # characters.\n    if char == \"\\t\" or char == \"\\n\" or char == \"\\r\":\n        return False\n    cat = unicodedata.category(char)\n    if cat.startswith(\"C\"):\n        return True\n    return False\n\n\ndef _is_punctuation(char):\n    \"\"\"Checks whether `chars` is a punctuation character.\"\"\"\n    cp = ord(char)\n    # We treat all non-letter/number ASCII as punctuation.\n    # Characters such as \"^\", \"$\", and \"`\" are not in the Unicode\n    # Punctuation class but we treat them as punctuation anyways, for\n    # consistency.\n    if (cp >= 33 and cp <= 47) or (cp >= 58 and cp <= 64) or (cp >= 91 and cp <= 96) or (cp >= 123 and cp <= 126):\n        return True\n    cat = unicodedata.category(char)\n    if cat.startswith(\"P\"):\n        return True\n    return False\n\n\nclass BertTokenizerFast(PreTrainedTokenizerFast):\n    r\"\"\"\n    Constructs a \"Fast\" BERT tokenizer (backed by HuggingFace's `tokenizers` library).\n\n    Bert tokenization is Based on WordPiece.\n\n    This tokenizer inherits from :class:`~transformers1.PreTrainedTokenizerFast` which contains most of the methods. Users\n    should refer to the superclass for more information regarding methods.\n\n    Args:\n        vocab_file (:obj:`string`):\n            File containing the vocabulary.\n        do_lower_case (:obj:`bool`, `optional`, defaults to :obj:`True`):\n            Whether to lowercase the input when tokenizing.\n        unk_token (:obj:`string`, `optional`, defaults to \"[UNK]\"):\n            The unknown token. A token that is not in the vocabulary cannot be converted to an ID and is set to be this\n            token instead.\n        sep_token (:obj:`string`, `optional`, defaults to \"[SEP]\"):\n            The separator token, which is used when building a sequence from multiple sequences, e.g. two sequences\n            for sequence classification or for a text and a question for question answering.\n            It is also used as the last token of a sequence built with special tokens.\n        pad_token (:obj:`string`, `optional`, defaults to \"[PAD]\"):\n            The token used for padding, for example when batching sequences of different lengths.\n        cls_token (:obj:`string`, `optional`, defaults to \"[CLS]\"):\n            The classifier token which is used when doing sequence classification (classification of the whole\n            sequence instead of per-token classification). It is the first token of the sequence when built with\n            special tokens.\n        mask_token (:obj:`string`, `optional`, defaults to \"[MASK]\"):\n            The token used for masking values. This is the token used when training this model with masked language\n            modeling. This is the token which the model will try to predict.\n        tokenize_chinese_chars (:obj:`bool`, `optional`, defaults to :obj:`True`):\n            Whether to tokenize Chinese characters.\n            This should likely be deactivated for Japanese:\n            see: https://github.com/huggingface/transformers/issues/328\n        clean_text (:obj:`bool`, `optional`, defaults to :obj:`True`):\n            Whether to clean the text before tokenization by removing any control characters and\n            replacing all whitespaces by the classic one.\n        tokenize_chinese_chars (:obj:`bool`, `optional`, defaults to :obj:`True`):\n            Whether to tokenize Chinese characters.\n            This should likely be deactivated for Japanese:\n            see: https://github.com/huggingface/transformers/issues/328\n    \"\"\"\n\n    vocab_files_names = VOCAB_FILES_NAMES\n    pretrained_vocab_files_map = PRETRAINED_VOCAB_FILES_MAP\n    pretrained_init_configuration = PRETRAINED_INIT_CONFIGURATION\n    max_model_input_sizes = PRETRAINED_POSITIONAL_EMBEDDINGS_SIZES\n\n    def __init__(\n        self,\n        vocab_file,\n        do_lower_case=True,\n        unk_token=\"[UNK]\",\n        sep_token=\"[SEP]\",\n        pad_token=\"[PAD]\",\n        cls_token=\"[CLS]\",\n        mask_token=\"[MASK]\",\n        clean_text=True,\n        tokenize_chinese_chars=True,\n        strip_accents=True,\n        wordpieces_prefix=\"##\",\n        **kwargs\n    ):\n        super().__init__(\n            BertWordPieceTokenizer(\n                vocab_file=vocab_file,\n                unk_token=unk_token,\n                sep_token=sep_token,\n                cls_token=cls_token,\n                clean_text=clean_text,\n                handle_chinese_chars=tokenize_chinese_chars,\n                strip_accents=strip_accents,\n                lowercase=do_lower_case,\n                wordpieces_prefix=wordpieces_prefix,\n            ),\n            unk_token=unk_token,\n            sep_token=sep_token,\n            pad_token=pad_token,\n            cls_token=cls_token,\n            mask_token=mask_token,\n            **kwargs,\n        )\n\n        self.do_lower_case = do_lower_case\n\n    def build_inputs_with_special_tokens(self, token_ids_0, token_ids_1=None):\n        output = [self.cls_token_id] + token_ids_0 + [self.sep_token_id]\n\n        if token_ids_1:\n            output += token_ids_1 + [self.sep_token_id]\n\n        return output\n\n    def create_token_type_ids_from_sequences(\n        self, token_ids_0: List[int], token_ids_1: Optional[List[int]] = None\n    ) -> List[int]:\n        \"\"\"\n        Creates a mask from the two sequences passed to be used in a sequence-pair classification task.\n        A BERT sequence pair mask has the following format:\n\n        ::\n\n            0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1\n            | first sequence    | second sequence |\n\n        if token_ids_1 is None, only returns the first portion of the mask (0's).\n\n        Args:\n            token_ids_0 (:obj:`List[int]`):\n                List of ids.\n            token_ids_1 (:obj:`List[int]`, `optional`, defaults to :obj:`None`):\n                Optional second list of IDs for sequence pairs.\n\n        Returns:\n            :obj:`List[int]`: List of `token type IDs <../glossary.html#token-type-ids>`_ according to the given\n            sequence(s).\n        \"\"\"\n        sep = [self.sep_token_id]\n        cls = [self.cls_token_id]\n        if token_ids_1 is None:\n            return len(cls + token_ids_0 + sep) * [0]\n        return len(cls + token_ids_0 + sep) * [0] + len(token_ids_1 + sep) * [1]\n"
  },
  {
    "path": "code/bert-base-count5/pretrain/transformers1/tokenization_bert_japanese.py",
    "content": "# coding=utf-8\n# Copyright 2018 The Google AI Language Team Authors and The HuggingFace Inc. team.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n\"\"\"Tokenization classes.\"\"\"\n\n\nimport collections\nimport logging\nimport os\nimport unicodedata\nfrom typing import Optional\n\nfrom .tokenization_bert import BasicTokenizer, BertTokenizer, WordpieceTokenizer, load_vocab\n\n\nlogger = logging.getLogger(__name__)\n\nVOCAB_FILES_NAMES = {\"vocab_file\": \"vocab.txt\"}\n\nPRETRAINED_VOCAB_FILES_MAP = {\n    \"vocab_file\": {\n        \"cl-tohoku/bert-base-japanese\": \"https://s3.amazonaws.com/models.huggingface.co/bert/cl-tohoku/bert-base-japanese/vocab.txt\",\n        \"cl-tohoku/bert-base-japanese-whole-word-masking\": \"https://s3.amazonaws.com/models.huggingface.co/bert/cl-tohoku/bert-base-japanese-whole-word-masking/vocab.txt\",\n        \"cl-tohoku/bert-base-japanese-char\": \"https://s3.amazonaws.com/models.huggingface.co/bert/cl-tohoku/bert-base-japanese-char/vocab.txt\",\n        \"cl-tohoku/bert-base-japanese-char-whole-word-masking\": \"https://s3.amazonaws.com/models.huggingface.co/bert/cl-tohoku/bert-base-japanese-char-whole-word-masking/vocab.txt\",\n    }\n}\n\nPRETRAINED_POSITIONAL_EMBEDDINGS_SIZES = {\n    \"cl-tohoku/bert-base-japanese\": 512,\n    \"cl-tohoku/bert-base-japanese-whole-word-masking\": 512,\n    \"cl-tohoku/bert-base-japanese-char\": 512,\n    \"cl-tohoku/bert-base-japanese-char-whole-word-masking\": 512,\n}\n\nPRETRAINED_INIT_CONFIGURATION = {\n    \"cl-tohoku/bert-base-japanese\": {\n        \"do_lower_case\": False,\n        \"word_tokenizer_type\": \"mecab\",\n        \"subword_tokenizer_type\": \"wordpiece\",\n    },\n    \"cl-tohoku/bert-base-japanese-whole-word-masking\": {\n        \"do_lower_case\": False,\n        \"word_tokenizer_type\": \"mecab\",\n        \"subword_tokenizer_type\": \"wordpiece\",\n    },\n    \"cl-tohoku/bert-base-japanese-char\": {\n        \"do_lower_case\": False,\n        \"word_tokenizer_type\": \"mecab\",\n        \"subword_tokenizer_type\": \"character\",\n    },\n    \"cl-tohoku/bert-base-japanese-char-whole-word-masking\": {\n        \"do_lower_case\": False,\n        \"word_tokenizer_type\": \"mecab\",\n        \"subword_tokenizer_type\": \"character\",\n    },\n}\n\n\nclass BertJapaneseTokenizer(BertTokenizer):\n    \"\"\"BERT tokenizer for Japanese text\"\"\"\n\n    vocab_files_names = VOCAB_FILES_NAMES\n    pretrained_vocab_files_map = PRETRAINED_VOCAB_FILES_MAP\n    pretrained_init_configuration = PRETRAINED_INIT_CONFIGURATION\n    max_model_input_sizes = PRETRAINED_POSITIONAL_EMBEDDINGS_SIZES\n\n    def __init__(\n        self,\n        vocab_file,\n        do_lower_case=False,\n        do_word_tokenize=True,\n        do_subword_tokenize=True,\n        word_tokenizer_type=\"basic\",\n        subword_tokenizer_type=\"wordpiece\",\n        never_split=None,\n        unk_token=\"[UNK]\",\n        sep_token=\"[SEP]\",\n        pad_token=\"[PAD]\",\n        cls_token=\"[CLS]\",\n        mask_token=\"[MASK]\",\n        mecab_kwargs=None,\n        **kwargs\n    ):\n        \"\"\"Constructs a MecabBertTokenizer.\n\n        Args:\n            **vocab_file**: Path to a one-wordpiece-per-line vocabulary file.\n            **do_lower_case**: (`optional`) boolean (default True)\n                Whether to lower case the input.\n                Only has an effect when do_basic_tokenize=True.\n            **do_word_tokenize**: (`optional`) boolean (default True)\n                Whether to do word tokenization.\n            **do_subword_tokenize**: (`optional`) boolean (default True)\n                Whether to do subword tokenization.\n            **word_tokenizer_type**: (`optional`) string (default \"basic\")\n                Type of word tokenizer.\n            **subword_tokenizer_type**: (`optional`) string (default \"wordpiece\")\n                Type of subword tokenizer.\n            **mecab_kwargs**: (`optional`) dict passed to `MecabTokenizer` constructor (default None)\n        \"\"\"\n        super(BertTokenizer, self).__init__(\n            unk_token=unk_token,\n            sep_token=sep_token,\n            pad_token=pad_token,\n            cls_token=cls_token,\n            mask_token=mask_token,\n            **kwargs,\n        )\n        # ^^ We call the grandparent's init, not the parent's.\n\n        if not os.path.isfile(vocab_file):\n            raise ValueError(\n                \"Can't find a vocabulary file at path '{}'. To load the vocabulary from a Google pretrained \"\n                \"model use `tokenizer = BertTokenizer.from_pretrained(PRETRAINED_MODEL_NAME)`\".format(vocab_file)\n            )\n        self.vocab = load_vocab(vocab_file)\n        self.ids_to_tokens = collections.OrderedDict([(ids, tok) for tok, ids in self.vocab.items()])\n\n        self.do_word_tokenize = do_word_tokenize\n        if do_word_tokenize:\n            if word_tokenizer_type == \"basic\":\n                self.word_tokenizer = BasicTokenizer(\n                    do_lower_case=do_lower_case, never_split=never_split, tokenize_chinese_chars=False\n                )\n            elif word_tokenizer_type == \"mecab\":\n                self.word_tokenizer = MecabTokenizer(\n                    do_lower_case=do_lower_case, never_split=never_split, **(mecab_kwargs or {})\n                )\n            else:\n                raise ValueError(\"Invalid word_tokenizer_type '{}' is specified.\".format(word_tokenizer_type))\n\n        self.do_subword_tokenize = do_subword_tokenize\n        if do_subword_tokenize:\n            if subword_tokenizer_type == \"wordpiece\":\n                self.subword_tokenizer = WordpieceTokenizer(vocab=self.vocab, unk_token=self.unk_token)\n            elif subword_tokenizer_type == \"character\":\n                self.subword_tokenizer = CharacterTokenizer(vocab=self.vocab, unk_token=self.unk_token)\n            else:\n                raise ValueError(\"Invalid subword_tokenizer_type '{}' is specified.\".format(subword_tokenizer_type))\n\n    def _tokenize(self, text):\n        if self.do_word_tokenize:\n            tokens = self.word_tokenizer.tokenize(text, never_split=self.all_special_tokens)\n        else:\n            tokens = [text]\n\n        if self.do_subword_tokenize:\n            split_tokens = [sub_token for token in tokens for sub_token in self.subword_tokenizer.tokenize(token)]\n        else:\n            split_tokens = tokens\n\n        return split_tokens\n\n\nclass MecabTokenizer:\n    \"\"\"Runs basic tokenization with MeCab morphological parser.\"\"\"\n\n    def __init__(self, do_lower_case=False, never_split=None, normalize_text=True, mecab_option: Optional[str] = None):\n        \"\"\"Constructs a MecabTokenizer.\n\n        Args:\n            **do_lower_case**: (`optional`) boolean (default True)\n                Whether to lower case the input.\n            **never_split**: (`optional`) list of str\n                Kept for backward compatibility purposes.\n                Now implemented directly at the base class level (see :func:`PreTrainedTokenizer.tokenize`)\n                List of token not to split.\n            **normalize_text**: (`optional`) boolean (default True)\n                Whether to apply unicode normalization to text before tokenization.\n            **mecab_option**: (`optional`) string passed to `MeCab.Tagger` constructor (default \"\")\n        \"\"\"\n        self.do_lower_case = do_lower_case\n        self.never_split = never_split if never_split is not None else []\n        self.normalize_text = normalize_text\n\n        import MeCab\n\n        self.mecab = MeCab.Tagger(mecab_option) if mecab_option is not None else MeCab.Tagger()\n\n    def tokenize(self, text, never_split=None, **kwargs):\n        \"\"\"Tokenizes a piece of text.\"\"\"\n        if self.normalize_text:\n            text = unicodedata.normalize(\"NFKC\", text)\n\n        never_split = self.never_split + (never_split if never_split is not None else [])\n        tokens = []\n\n        mecab_output = self.mecab.parse(text)\n\n        cursor = 0\n        for line in mecab_output.split(\"\\n\"):\n            if line == \"EOS\":\n                break\n\n            token, _ = line.split(\"\\t\")\n            token_start = text.index(token, cursor)\n            token_end = token_start + len(token)\n            if self.do_lower_case and token not in never_split:\n                token = token.lower()\n\n            tokens.append(token)\n            cursor = token_end\n\n        return tokens\n\n\nclass CharacterTokenizer(object):\n    \"\"\"Runs Character tokenziation.\"\"\"\n\n    def __init__(self, vocab, unk_token, normalize_text=True):\n        \"\"\"Constructs a CharacterTokenizer.\n\n        Args:\n            **vocab**:\n                Vocabulary object.\n            **unk_token**: str\n                A special symbol for out-of-vocabulary token.\n            **normalize_text**: (`optional`) boolean (default True)\n                Whether to apply unicode normalization to text before tokenization.\n        \"\"\"\n        self.vocab = vocab\n        self.unk_token = unk_token\n        self.normalize_text = normalize_text\n\n    def tokenize(self, text):\n        \"\"\"Tokenizes a piece of text into characters.\n\n        For example:\n            input = \"apple\"\n            output = [\"a\", \"p\", \"p\", \"l\", \"e\"]\n        Args:\n            text: A single token or whitespace separated tokens.\n                This should have already been passed through `BasicTokenizer`.\n        Returns:\n            A list of characters.\n        \"\"\"\n        if self.normalize_text:\n            text = unicodedata.normalize(\"NFKC\", text)\n\n        output_tokens = []\n        for i, char in enumerate(text):\n            if char not in self.vocab:\n                output_tokens.append(self.unk_token)\n                continue\n\n            output_tokens.append(char)\n\n        return output_tokens\n"
  },
  {
    "path": "code/bert-base-count5/pretrain/transformers1/tokenization_camembert.py",
    "content": "# coding=utf-8\n# Copyright 2018 Google AI, Google Brain and Carnegie Mellon University Authors and the HuggingFace Inc. team.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License\n\"\"\" Tokenization classes for Camembert model.\"\"\"\n\n\nimport logging\nimport os\nfrom shutil import copyfile\nfrom typing import List, Optional\n\nimport sentencepiece as spm\n\nfrom .tokenization_utils import PreTrainedTokenizer\nfrom .tokenization_xlnet import SPIECE_UNDERLINE\n\n\nlogger = logging.getLogger(__name__)\n\nVOCAB_FILES_NAMES = {\"vocab_file\": \"sentencepiece.bpe.model\"}\n\nPRETRAINED_VOCAB_FILES_MAP = {\n    \"vocab_file\": {\n        \"camembert-base\": \"https://s3.amazonaws.com/models.huggingface.co/bert/camembert-base-sentencepiece.bpe.model\",\n    }\n}\n\nPRETRAINED_POSITIONAL_EMBEDDINGS_SIZES = {\n    \"camembert-base\": None,\n}\n\nSHARED_MODEL_IDENTIFIERS = [\n    # Load with\n    # `tokenizer = AutoTokenizer.from_pretrained(\"username/pretrained_model\")`\n    \"Musixmatch/umberto-commoncrawl-cased-v1\",\n    \"Musixmatch/umberto-wikipedia-uncased-v1\",\n]\n\n\nclass CamembertTokenizer(PreTrainedTokenizer):\n    \"\"\"\n        Adapted from RobertaTokenizer and XLNetTokenizer\n        SentencePiece based tokenizer. Peculiarities:\n\n        - requires `SentencePiece <https://github.com/google/sentencepiece>`_\n\n    This tokenizer inherits from :class:`~transformers1.PreTrainedTokenizer` which contains most of the methods. Users\n    should refer to the superclass for more information regarding methods.\n\n    Args:\n        vocab_file (:obj:`str`):\n            Path to the vocabulary file.\n        bos_token (:obj:`string`, `optional`, defaults to \"<s>\"):\n            The beginning of sequence token that was used during pre-training. Can be used a sequence classifier token.\n\n            .. note::\n\n                When building a sequence using special tokens, this is not the token that is used for the beginning\n                of sequence. The token used is the :obj:`cls_token`.\n        eos_token (:obj:`string`, `optional`, defaults to \"</s>\"):\n            The end of sequence token.\n\n            .. note::\n\n                When building a sequence using special tokens, this is not the token that is used for the end\n                of sequence. The token used is the :obj:`sep_token`.\n        sep_token (:obj:`string`, `optional`, defaults to \"</s>\"):\n            The separator token, which is used when building a sequence from multiple sequences, e.g. two sequences\n            for sequence classification or for a text and a question for question answering.\n            It is also used as the last token of a sequence built with special tokens.\n        cls_token (:obj:`string`, `optional`, defaults to \"<s>\"):\n            The classifier token which is used when doing sequence classification (classification of the whole\n            sequence instead of per-token classification). It is the first token of the sequence when built with\n            special tokens.\n        unk_token (:obj:`string`, `optional`, defaults to \"<unk>\"):\n            The unknown token. A token that is not in the vocabulary cannot be converted to an ID and is set to be this\n            token instead.\n        pad_token (:obj:`string`, `optional`, defaults to \"<pad>\"):\n            The token used for padding, for example when batching sequences of different lengths.\n        mask_token (:obj:`string`, `optional`, defaults to \"<mask>\"):\n            The token used for masking values. This is the token used when training this model with masked language\n            modeling. This is the token which the model will try to predict.\n        additional_special_tokens (:obj:`List[str]`, `optional`, defaults to :obj:`[\"<s>NOTUSED\", \"</s>NOTUSED\"]`):\n            Additional special tokens used by the tokenizer.\n\n    Attributes:\n        sp_model (:obj:`SentencePieceProcessor`):\n            The `SentencePiece` processor that is used for every conversion (string, tokens and IDs).\n    \"\"\"\n\n    vocab_files_names = VOCAB_FILES_NAMES\n    pretrained_vocab_files_map = PRETRAINED_VOCAB_FILES_MAP\n    max_model_input_sizes = PRETRAINED_POSITIONAL_EMBEDDINGS_SIZES\n    model_input_names = [\"attention_mask\"]\n\n    def __init__(\n        self,\n        vocab_file,\n        bos_token=\"<s>\",\n        eos_token=\"</s>\",\n        sep_token=\"</s>\",\n        cls_token=\"<s>\",\n        unk_token=\"<unk>\",\n        pad_token=\"<pad>\",\n        mask_token=\"<mask>\",\n        additional_special_tokens=[\"<s>NOTUSED\", \"</s>NOTUSED\"],\n        **kwargs\n    ):\n        super().__init__(\n            max_len=512,\n            bos_token=bos_token,\n            eos_token=eos_token,\n            unk_token=unk_token,\n            sep_token=sep_token,\n            cls_token=cls_token,\n            pad_token=pad_token,\n            mask_token=mask_token,\n            additional_special_tokens=additional_special_tokens,\n            **kwargs,\n        )\n        self.sp_model = spm.SentencePieceProcessor()\n        self.sp_model.Load(str(vocab_file))\n        self.vocab_file = vocab_file\n        # HACK: These tokens were added by fairseq but don't seem to be actually used when duplicated in the actual\n        # sentencepiece vocabulary (this is the case for <s> and </s>\n        self.fairseq_tokens_to_ids = {\"<s>NOTUSED\": 0, \"<pad>\": 1, \"</s>NOTUSED\": 2, \"<unk>\": 3}\n        self.fairseq_offset = len(self.fairseq_tokens_to_ids)\n        self.fairseq_tokens_to_ids[\"<mask>\"] = len(self.sp_model) + len(self.fairseq_tokens_to_ids)\n        self.fairseq_ids_to_tokens = {v: k for k, v in self.fairseq_tokens_to_ids.items()}\n\n    def build_inputs_with_special_tokens(\n        self, token_ids_0: List[int], token_ids_1: Optional[List[int]] = None\n    ) -> List[int]:\n        \"\"\"\n        Build model inputs from a sequence or a pair of sequence for sequence classification tasks\n        by concatenating and adding special tokens.\n        A CamemBERT sequence has the following format:\n\n        - single sequence: ``<s> X </s>``\n        - pair of sequences: ``<s> A </s></s> B </s>``\n\n        Args:\n            token_ids_0 (:obj:`List[int]`):\n                List of IDs to which the special tokens will be added\n            token_ids_1 (:obj:`List[int]`, `optional`, defaults to :obj:`None`):\n                Optional second list of IDs for sequence pairs.\n\n        Returns:\n            :obj:`List[int]`: list of `input IDs <../glossary.html#input-ids>`__ with the appropriate special tokens.\n        \"\"\"\n\n        if token_ids_1 is None:\n            return [self.cls_token_id] + token_ids_0 + [self.sep_token_id]\n        cls = [self.cls_token_id]\n        sep = [self.sep_token_id]\n        return cls + token_ids_0 + sep + sep + token_ids_1 + sep\n\n    def get_special_tokens_mask(\n        self, token_ids_0: List[int], token_ids_1: Optional[List[int]] = None, already_has_special_tokens: bool = False\n    ) -> List[int]:\n        \"\"\"\n        Retrieves sequence ids from a token list that has no special tokens added. This method is called when adding\n        special tokens using the tokenizer ``prepare_for_model`` or ``encode_plus`` methods.\n\n        Args:\n            token_ids_0 (:obj:`List[int]`):\n                List of ids.\n            token_ids_1 (:obj:`List[int]`, `optional`, defaults to :obj:`None`):\n                Optional second list of IDs for sequence pairs.\n            already_has_special_tokens (:obj:`bool`, `optional`, defaults to :obj:`False`):\n                Set to True if the token list is already formatted with special tokens for the model\n\n        Returns:\n            :obj:`List[int]`: A list of integers in the range [0, 1]: 1 for a special token, 0 for a sequence token.\n        \"\"\"\n        if already_has_special_tokens:\n            if token_ids_1 is not None:\n                raise ValueError(\n                    \"You should not supply a second sequence if the provided sequence of \"\n                    \"ids is already formated with special tokens for the model.\"\n                )\n            return list(map(lambda x: 1 if x in [self.sep_token_id, self.cls_token_id] else 0, token_ids_0))\n\n        if token_ids_1 is None:\n            return [1] + ([0] * len(token_ids_0)) + [1]\n        return [1] + ([0] * len(token_ids_0)) + [1, 1] + ([0] * len(token_ids_1)) + [1]\n\n    def create_token_type_ids_from_sequences(\n        self, token_ids_0: List[int], token_ids_1: Optional[List[int]] = None\n    ) -> List[int]:\n        \"\"\"\n        Creates a mask from the two sequences passed to be used in a sequence-pair classification task.\n        CamemBERT, like RoBERTa, does not make use of token type ids, therefore a list of zeros is returned.\n\n        Args:\n            token_ids_0 (:obj:`List[int]`):\n                List of ids.\n            token_ids_1 (:obj:`List[int]`, `optional`, defaults to :obj:`None`):\n                Optional second list of IDs for sequence pairs.\n\n        Returns:\n            :obj:`List[int]`: List of zeros.\n\n        \"\"\"\n        sep = [self.sep_token_id]\n        cls = [self.cls_token_id]\n\n        if token_ids_1 is None:\n            return len(cls + token_ids_0 + sep) * [0]\n        return len(cls + token_ids_0 + sep + sep + token_ids_1 + sep) * [0]\n\n    @property\n    def vocab_size(self):\n        return len(self.fairseq_tokens_to_ids) + len(self.sp_model)\n\n    def _tokenize(self, text):\n        return self.sp_model.EncodeAsPieces(text)\n\n    def _convert_token_to_id(self, token):\n        \"\"\" Converts a token (str) in an id using the vocab. \"\"\"\n        if token in self.fairseq_tokens_to_ids:\n            return self.fairseq_tokens_to_ids[token]\n        elif self.sp_model.PieceToId(token) == 0:\n            # Convert sentence piece unk token to fairseq unk token index\n            return self.unk_token_id\n        return self.fairseq_offset + self.sp_model.PieceToId(token)\n\n    def _convert_id_to_token(self, index):\n        \"\"\"Converts an index (integer) in a token (str) using the vocab.\"\"\"\n        if index in self.fairseq_ids_to_tokens:\n            return self.fairseq_ids_to_tokens[index]\n        return self.sp_model.IdToPiece(index - self.fairseq_offset)\n\n    def __getstate__(self):\n        state = self.__dict__.copy()\n        state[\"sp_model\"] = None\n        return state\n\n    def __setstate__(self, d):\n        self.__dict__ = d\n        try:\n            import sentencepiece as spm\n        except ImportError:\n            logger.warning(\n                \"You need to install SentencePiece to use AlbertTokenizer: https://github.com/google/sentencepiece\"\n                \"pip install sentencepiece\"\n            )\n            raise\n        self.sp_model = spm.SentencePieceProcessor()\n        self.sp_model.Load(self.vocab_file)\n\n    def convert_tokens_to_string(self, tokens):\n        \"\"\"Converts a sequence of tokens (strings for sub-words) in a single string.\"\"\"\n        out_string = \"\".join(tokens).replace(SPIECE_UNDERLINE, \" \").strip()\n        return out_string\n\n    def save_vocabulary(self, save_directory):\n        \"\"\"\n        Save the sentencepiece vocabulary (copy original file) and special tokens file to a directory.\n\n        Args:\n            save_directory (:obj:`str`):\n                The directory in which to save the vocabulary.\n\n        Returns:\n            :obj:`Tuple(str)`: Paths to the files saved.\n        \"\"\"\n        if not os.path.isdir(save_directory):\n            logger.error(\"Vocabulary path ({}) should be a directory\".format(save_directory))\n            return\n        out_vocab_file = os.path.join(save_directory, VOCAB_FILES_NAMES[\"vocab_file\"])\n\n        if os.path.abspath(self.vocab_file) != os.path.abspath(out_vocab_file):\n            copyfile(self.vocab_file, out_vocab_file)\n\n        return (out_vocab_file,)\n"
  },
  {
    "path": "code/bert-base-count5/pretrain/transformers1/tokenization_ctrl.py",
    "content": "# coding=utf-8\n# Copyright 2018 Salesforce and The HuggingFace Inc. team.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n\"\"\"Tokenization classes for Salesforce CTRL.\"\"\"\n\n\nimport json\nimport logging\nimport os\n\nimport regex as re\n\nfrom .tokenization_utils import PreTrainedTokenizer\n\n\nlogger = logging.getLogger(__name__)\n\nVOCAB_FILES_NAMES = {\n    \"vocab_file\": \"vocab.json\",\n    \"merges_file\": \"merges.txt\",\n}\n\nPRETRAINED_VOCAB_FILES_MAP = {\n    \"vocab_file\": {\"ctrl\": \"https://raw.githubusercontent.com/salesforce/ctrl/master/ctrl-vocab.json\"},\n    \"merges_file\": {\"ctrl\": \"https://raw.githubusercontent.com/salesforce/ctrl/master/ctrl-merges.txt\"},\n}\n\nPRETRAINED_POSITIONAL_EMBEDDINGS_SIZES = {\n    \"ctrl\": 256,\n}\n\nCONTROL_CODES = {\n    \"Pregnancy\": 168629,\n    \"Christianity\": 7675,\n    \"Explain\": 106423,\n    \"Fitness\": 63440,\n    \"Saving\": 63163,\n    \"Ask\": 27171,\n    \"Ass\": 95985,\n    \"Joke\": 163509,\n    \"Questions\": 45622,\n    \"Thoughts\": 49605,\n    \"Retail\": 52342,\n    \"Feminism\": 164338,\n    \"Writing\": 11992,\n    \"Atheism\": 192263,\n    \"Netflix\": 48616,\n    \"Computing\": 39639,\n    \"Opinion\": 43213,\n    \"Alone\": 44967,\n    \"Funny\": 58917,\n    \"Gaming\": 40358,\n    \"Human\": 4088,\n    \"India\": 1331,\n    \"Joker\": 77138,\n    \"Diet\": 36206,\n    \"Legal\": 11859,\n    \"Norman\": 4939,\n    \"Tip\": 72689,\n    \"Weight\": 52343,\n    \"Movies\": 46273,\n    \"Running\": 23425,\n    \"Science\": 2090,\n    \"Horror\": 37793,\n    \"Confession\": 60572,\n    \"Finance\": 12250,\n    \"Politics\": 16360,\n    \"Scary\": 191985,\n    \"Support\": 12654,\n    \"Technologies\": 32516,\n    \"Teenage\": 66160,\n    \"Event\": 32769,\n    \"Learned\": 67460,\n    \"Notion\": 182770,\n    \"Wikipedia\": 37583,\n    \"Books\": 6665,\n    \"Extract\": 76050,\n    \"Confessions\": 102701,\n    \"Conspiracy\": 75932,\n    \"Links\": 63674,\n    \"Narcissus\": 150425,\n    \"Relationship\": 54766,\n    \"Relationships\": 134796,\n    \"Reviews\": 41671,\n    \"News\": 4256,\n    \"Translation\": 26820,\n    \"multilingual\": 128406,\n}\n\n\ndef get_pairs(word):\n    \"\"\"Return set of symbol pairs in a word.\n\n    Word is represented as tuple of symbols (symbols being variable-length strings).\n    \"\"\"\n    pairs = set()\n    prev_char = word[0]\n    for char in word[1:]:\n        pairs.add((prev_char, char))\n        prev_char = char\n\n    pairs = set(pairs)\n    return pairs\n\n\nclass CTRLTokenizer(PreTrainedTokenizer):\n    \"\"\"\n    Constructs a CTRL tokenizer. Peculiarities:\n\n    - Byte-Pair-Encoding\n\n    This tokenizer inherits from :class:`~transformers1.PreTrainedTokenizer` which contains most of the methods. Users\n    should refer to the superclass for more information regarding methods.\n\n    Args:\n        vocab_file (:obj:`str`):\n            Path to the vocabulary file.\n        merges_file (:obj:`str`):\n            Path to the merges file.\n        unk_token (:obj:`string`, `optional`, defaults to \"<unk>\"):\n            The unknown token. A token that is not in the vocabulary cannot be converted to an ID and is set to be this\n            token instead.\n    \"\"\"\n\n    vocab_files_names = VOCAB_FILES_NAMES\n    pretrained_vocab_files_map = PRETRAINED_VOCAB_FILES_MAP\n    max_model_input_sizes = PRETRAINED_POSITIONAL_EMBEDDINGS_SIZES\n    control_codes = CONTROL_CODES\n\n    def __init__(self, vocab_file, merges_file, unk_token=\"<unk>\", **kwargs):\n        super().__init__(unk_token=unk_token, **kwargs)\n\n        with open(vocab_file, encoding=\"utf-8\") as vocab_handle:\n            self.encoder = json.load(vocab_handle)\n        self.decoder = {v: k for k, v in self.encoder.items()}\n        with open(merges_file, encoding=\"utf-8\") as merges_handle:\n            merges = merges_handle.read().split(\"\\n\")[1:-1]\n        merges = [tuple(merge.split()) for merge in merges]\n        self.bpe_ranks = dict(zip(merges, range(len(merges))))\n        self.cache = {}\n\n    @property\n    def vocab_size(self):\n        return len(self.encoder)\n\n    def get_vocab(self):\n        return dict(self.encoder, **self.added_tokens_encoder)\n\n    def bpe(self, token):\n        if token in self.cache:\n            return self.cache[token]\n        word = tuple(token)\n        word = tuple(list(word[:-1]) + [word[-1] + \"</w>\"])\n        pairs = get_pairs(word)\n\n        if not pairs:\n            return token\n\n        while True:\n            bigram = min(pairs, key=lambda pair: self.bpe_ranks.get(pair, float(\"inf\")))\n            if bigram not in self.bpe_ranks:\n                break\n            first, second = bigram\n            new_word = []\n            i = 0\n            while i < len(word):\n                try:\n                    j = word.index(first, i)\n                except ValueError:\n                    new_word.extend(word[i:])\n                    break\n                else:\n                    new_word.extend(word[i:j])\n                    i = j\n\n                if word[i] == first and i < len(word) - 1 and word[i + 1] == second:\n                    new_word.append(first + second)\n                    i += 2\n                else:\n                    new_word.append(word[i])\n                    i += 1\n            new_word = tuple(new_word)\n            word = new_word\n            if len(word) == 1:\n                break\n            else:\n                pairs = get_pairs(word)\n        word = \"@@ \".join(word)\n        word = word[:-4]\n        self.cache[token] = word\n        return word\n\n    def _tokenize(self, text):\n        \"\"\" Tokenize a string.\n        \"\"\"\n        split_tokens = []\n\n        words = re.findall(r\"\\S+\\n?\", text)\n\n        for token in words:\n            split_tokens.extend([t for t in self.bpe(token).split(\" \")])\n        return split_tokens\n\n    def _convert_token_to_id(self, token):\n        \"\"\" Converts a token (str) in an id using the vocab. \"\"\"\n        return self.encoder.get(token, self.encoder.get(self.unk_token))\n\n    def _convert_id_to_token(self, index):\n        \"\"\"Converts an index (integer) in a token (str) using the vocab.\"\"\"\n        return self.decoder.get(index, self.unk_token)\n\n    def convert_tokens_to_string(self, tokens):\n        \"\"\" Converts a sequence of tokens (string) in a single string. \"\"\"\n        out_string = \" \".join(tokens).replace(\"@@ \", \"\").strip()\n        return out_string\n\n    def save_vocabulary(self, save_directory):\n        \"\"\"\n        Save the vocabulary and special tokens file to a directory.\n\n        Args:\n            save_directory (:obj:`str`):\n                The directory in which to save the vocabulary.\n\n        Returns:\n            :obj:`Tuple(str)`: Paths to the files saved.\n        \"\"\"\n        if not os.path.isdir(save_directory):\n            logger.error(\"Vocabulary path ({}) should be a directory\".format(save_directory))\n            return\n        vocab_file = os.path.join(save_directory, VOCAB_FILES_NAMES[\"vocab_file\"])\n        merge_file = os.path.join(save_directory, VOCAB_FILES_NAMES[\"merges_file\"])\n\n        with open(vocab_file, \"w\", encoding=\"utf-8\") as f:\n            f.write(json.dumps(self.encoder, ensure_ascii=False))\n\n        index = 0\n        with open(merge_file, \"w\", encoding=\"utf-8\") as writer:\n            writer.write(\"#version: 0.2\\n\")\n            for bpe_tokens, token_index in sorted(self.bpe_ranks.items(), key=lambda kv: kv[1]):\n                if index != token_index:\n                    logger.warning(\n                        \"Saving vocabulary to {}: BPE merge indices are not consecutive.\"\n                        \" Please check that the tokenizer is not corrupted!\".format(merge_file)\n                    )\n                    index = token_index\n                writer.write(\" \".join(bpe_tokens) + \"\\n\")\n                index += 1\n\n        return vocab_file, merge_file\n\n    # def decode(self, token_ids, skip_special_tokens=False, clean_up_tokenization_spaces=True):\n    #     filtered_tokens = ' '.join(self.convert_ids_to_tokens(token_ids, skip_special_tokens=skip_special_tokens))\n    #     tokens_generated_so_far = re.sub('(@@ )', '', string=filtered_tokens)\n    #     tokens_generated_so_far = re.sub('(@@ ?$)', '', string=tokens_generated_so_far)\n    #     return ''.join(tokens_generated_so_far)\n"
  },
  {
    "path": "code/bert-base-count5/pretrain/transformers1/tokenization_distilbert.py",
    "content": "# coding=utf-8\n# Copyright 2018 The HuggingFace Inc. team.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n\"\"\"Tokenization classes for DistilBERT.\"\"\"\n\n\nimport logging\n\nfrom .tokenization_bert import BertTokenizer, BertTokenizerFast\n\n\nlogger = logging.getLogger(__name__)\n\nVOCAB_FILES_NAMES = {\"vocab_file\": \"vocab.txt\"}\n\nPRETRAINED_VOCAB_FILES_MAP = {\n    \"vocab_file\": {\n        \"distilbert-base-uncased\": \"https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-uncased-vocab.txt\",\n        \"distilbert-base-uncased-distilled-squad\": \"https://s3.amazonaws.com/models.huggingface.co/bert/bert-large-uncased-vocab.txt\",\n        \"distilbert-base-cased\": \"https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-cased-vocab.txt\",\n        \"distilbert-base-cased-distilled-squad\": \"https://s3.amazonaws.com/models.huggingface.co/bert/bert-large-cased-vocab.txt\",\n        \"distilbert-base-german-cased\": \"https://s3.amazonaws.com/models.huggingface.co/bert/distilbert-base-german-cased-vocab.txt\",\n        \"distilbert-base-multilingual-cased\": \"https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-multilingual-cased-vocab.txt\",\n    }\n}\n\nPRETRAINED_POSITIONAL_EMBEDDINGS_SIZES = {\n    \"distilbert-base-uncased\": 512,\n    \"distilbert-base-uncased-distilled-squad\": 512,\n    \"distilbert-base-cased\": 512,\n    \"distilbert-base-cased-distilled-squad\": 512,\n    \"distilbert-base-german-cased\": 512,\n    \"distilbert-base-multilingual-cased\": 512,\n}\n\n\nPRETRAINED_INIT_CONFIGURATION = {\n    \"distilbert-base-uncased\": {\"do_lower_case\": True},\n    \"distilbert-base-uncased-distilled-squad\": {\"do_lower_case\": True},\n    \"distilbert-base-cased\": {\"do_lower_case\": False},\n    \"distilbert-base-cased-distilled-squad\": {\"do_lower_case\": False},\n    \"distilbert-base-german-cased\": {\"do_lower_case\": False},\n    \"distilbert-base-multilingual-cased\": {\"do_lower_case\": False},\n}\n\n\nclass DistilBertTokenizer(BertTokenizer):\n    r\"\"\"\n    Constructs a  DistilBertTokenizer.\n\n    :class:`~transformers1.DistilBertTokenizer is identical to :class:`~transformers1.BertTokenizer` and runs end-to-end\n    tokenization: punctuation splitting + wordpiece.\n\n    Refer to superclass :class:`~transformers1.BertTokenizer` for usage examples and documentation concerning\n    parameters.\n    \"\"\"\n\n    vocab_files_names = VOCAB_FILES_NAMES\n    pretrained_vocab_files_map = PRETRAINED_VOCAB_FILES_MAP\n    max_model_input_sizes = PRETRAINED_POSITIONAL_EMBEDDINGS_SIZES\n    pretrained_init_configuration = PRETRAINED_INIT_CONFIGURATION\n    model_input_names = [\"attention_mask\"]\n\n\nclass DistilBertTokenizerFast(BertTokenizerFast):\n    r\"\"\"\n    Constructs a  \"Fast\" DistilBertTokenizer (backed by HuggingFace's `tokenizers` library).\n\n    :class:`~transformers1.DistilBertTokenizerFast` is identical to :class:`~transformers1.BertTokenizerFast` and runs end-to-end\n    tokenization: punctuation splitting + wordpiece.\n\n    Refer to superclass :class:`~transformers1.BertTokenizerFast` for usage examples and documentation concerning\n    parameters.\n    \"\"\"\n\n    vocab_files_names = VOCAB_FILES_NAMES\n    pretrained_vocab_files_map = PRETRAINED_VOCAB_FILES_MAP\n    max_model_input_sizes = PRETRAINED_POSITIONAL_EMBEDDINGS_SIZES\n    pretrained_init_configuration = PRETRAINED_INIT_CONFIGURATION\n    model_input_names = [\"attention_mask\"]\n"
  },
  {
    "path": "code/bert-base-count5/pretrain/transformers1/tokenization_electra.py",
    "content": "# coding=utf-8\n# Copyright 2020 The Google AI Team, Stanford University and The HuggingFace Inc. team.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n\nfrom .tokenization_bert import BertTokenizer, BertTokenizerFast\n\n\nVOCAB_FILES_NAMES = {\"vocab_file\": \"vocab.txt\"}\n\nPRETRAINED_VOCAB_FILES_MAP = {\n    \"vocab_file\": {\n        \"google/electra-small-generator\": \"https://s3.amazonaws.com/models.huggingface.co/bert/google/electra-small-generator/vocab.txt\",\n        \"google/electra-base-generator\": \"https://s3.amazonaws.com/models.huggingface.co/bert/google/electra-base-generator/vocab.txt\",\n        \"google/electra-large-generator\": \"https://s3.amazonaws.com/models.huggingface.co/bert/google/electra-large-generator/vocab.txt\",\n        \"google/electra-small-discriminator\": \"https://s3.amazonaws.com/models.huggingface.co/bert/google/electra-small-discriminator/vocab.txt\",\n        \"google/electra-base-discriminator\": \"https://s3.amazonaws.com/models.huggingface.co/bert/google/electra-base-discriminator/vocab.txt\",\n        \"google/electra-large-discriminator\": \"https://s3.amazonaws.com/models.huggingface.co/bert/google/electra-large-discriminator/vocab.txt\",\n    }\n}\n\nPRETRAINED_POSITIONAL_EMBEDDINGS_SIZES = {\n    \"google/electra-small-generator\": 512,\n    \"google/electra-base-generator\": 512,\n    \"google/electra-large-generator\": 512,\n    \"google/electra-small-discriminator\": 512,\n    \"google/electra-base-discriminator\": 512,\n    \"google/electra-large-discriminator\": 512,\n}\n\n\nPRETRAINED_INIT_CONFIGURATION = {\n    \"google/electra-small-generator\": {\"do_lower_case\": True},\n    \"google/electra-base-generator\": {\"do_lower_case\": True},\n    \"google/electra-large-generator\": {\"do_lower_case\": True},\n    \"google/electra-small-discriminator\": {\"do_lower_case\": True},\n    \"google/electra-base-discriminator\": {\"do_lower_case\": True},\n    \"google/electra-large-discriminator\": {\"do_lower_case\": True},\n}\n\n\nclass ElectraTokenizer(BertTokenizer):\n    r\"\"\"\n    Constructs an Electra tokenizer.\n    :class:`~transformers1.ElectraTokenizer` is identical to :class:`~transformers1.BertTokenizer` and runs end-to-end\n    tokenization: punctuation splitting + wordpiece.\n\n    Refer to superclass :class:`~transformers1.BertTokenizer` for usage examples and documentation concerning\n    parameters.\n    \"\"\"\n\n    vocab_files_names = VOCAB_FILES_NAMES\n    pretrained_vocab_files_map = PRETRAINED_VOCAB_FILES_MAP\n    max_model_input_sizes = PRETRAINED_POSITIONAL_EMBEDDINGS_SIZES\n    pretrained_init_configuration = PRETRAINED_INIT_CONFIGURATION\n\n\nclass ElectraTokenizerFast(BertTokenizerFast):\n    r\"\"\"\n    Constructs a \"Fast\" Electra Fast tokenizer (backed by HuggingFace's `tokenizers` library).\n\n    :class:`~transformers1.ElectraTokenizerFast` is identical to :class:`~transformers1.BertTokenizerFast` and runs end-to-end\n    tokenization: punctuation splitting + wordpiece.\n\n    Refer to superclass :class:`~transformers1.BertTokenizerFast` for usage examples and documentation concerning\n    parameters.\n    \"\"\"\n    vocab_files_names = VOCAB_FILES_NAMES\n    pretrained_vocab_files_map = PRETRAINED_VOCAB_FILES_MAP\n    max_model_input_sizes = PRETRAINED_POSITIONAL_EMBEDDINGS_SIZES\n    pretrained_init_configuration = PRETRAINED_INIT_CONFIGURATION\n"
  },
  {
    "path": "code/bert-base-count5/pretrain/transformers1/tokenization_flaubert.py",
    "content": "# coding=utf-8\n# Copyright 2019-present CNRS, Facebook Inc. and the HuggingFace Inc. team.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n\"\"\"Tokenization classes for Flaubert, based on XLM.\"\"\"\n\n\nimport logging\nimport unicodedata\n\nimport six\n\nfrom .tokenization_xlm import XLMTokenizer\n\n\nlogger = logging.getLogger(__name__)\n\nVOCAB_FILES_NAMES = {\n    \"vocab_file\": \"vocab.json\",\n    \"merges_file\": \"merges.txt\",\n}\n\nPRETRAINED_VOCAB_FILES_MAP = {\n    \"vocab_file\": {\n        \"flaubert/flaubert_small_cased\": \"https://s3.amazonaws.com/models.huggingface.co/bert/flaubert/flaubert_small_cased/vocab.json\",\n        \"flaubert/flaubert_base_uncased\": \"https://s3.amazonaws.com/models.huggingface.co/bert/flaubert/flaubert_base_uncased/vocab.json\",\n        \"flaubert/flaubert_base_cased\": \"https://s3.amazonaws.com/models.huggingface.co/bert/flaubert/flaubert_base_cased/vocab.json\",\n        \"flaubert/flaubert_large_cased\": \"https://s3.amazonaws.com/models.huggingface.co/bert/flaubert/flaubert_large_cased/vocab.json\",\n    },\n    \"merges_file\": {\n        \"flaubert/flaubert_small_cased\": \"https://s3.amazonaws.com/models.huggingface.co/bert/flaubert/flaubert_small_cased/merges.txt\",\n        \"flaubert/flaubert_base_uncased\": \"https://s3.amazonaws.com/models.huggingface.co/bert/flaubert/flaubert_base_uncased/merges.txt\",\n        \"flaubert/flaubert_base_cased\": \"https://s3.amazonaws.com/models.huggingface.co/bert/flaubert/flaubert_base_cased/merges.txt\",\n        \"flaubert/flaubert_large_cased\": \"https://s3.amazonaws.com/models.huggingface.co/bert/flaubert/flaubert_large_cased/merges.txt\",\n    },\n}\n\nPRETRAINED_POSITIONAL_EMBEDDINGS_SIZES = {\n    \"flaubert/flaubert_small_cased\": 512,\n    \"flaubert/flaubert_base_uncased\": 512,\n    \"flaubert/flaubert_base_cased\": 512,\n    \"flaubert/flaubert_large_cased\": 512,\n}\n\nPRETRAINED_INIT_CONFIGURATION = {\n    \"flaubert/flaubert_small_cased\": {\"do_lowercase\": False},\n    \"flaubert/flaubert_base_uncased\": {\"do_lowercase\": True},\n    \"flaubert/flaubert_base_cased\": {\"do_lowercase\": False},\n    \"flaubert/flaubert_large_cased\": {\"do_lowercase\": False},\n}\n\n\ndef convert_to_unicode(text):\n    \"\"\"\n    Converts `text` to Unicode (if it's not already), assuming UTF-8 input.\n    \"\"\"\n    # six_ensure_text is copied from https://github.com/benjaminp/six\n    def six_ensure_text(s, encoding=\"utf-8\", errors=\"strict\"):\n        if isinstance(s, six.binary_type):\n            return s.decode(encoding, errors)\n        elif isinstance(s, six.text_type):\n            return s\n        else:\n            raise TypeError(\"not expecting type '%s'\" % type(s))\n\n    return six_ensure_text(text, encoding=\"utf-8\", errors=\"ignore\")\n\n\nclass FlaubertTokenizer(XLMTokenizer):\n    \"\"\"\n    BPE tokenizer for Flaubert\n\n    - Moses preprocessing & tokenization\n    - Normalize all inputs text\n    - argument ``special_tokens`` and function ``set_special_tokens``, can be used to add additional symbols \\\n      (ex: \"__classify__\") to a vocabulary\n    - `do_lowercase` controle lower casing (automatically set for pretrained vocabularies)\n\n    This tokenizer inherits from :class:`~transformers1.XLMTokenizer`. Please check the superclass for usage examples\n    and documentation regarding arguments.\n    \"\"\"\n\n    vocab_files_names = VOCAB_FILES_NAMES\n    pretrained_vocab_files_map = PRETRAINED_VOCAB_FILES_MAP\n    pretrained_init_configuration = PRETRAINED_INIT_CONFIGURATION\n    max_model_input_sizes = PRETRAINED_POSITIONAL_EMBEDDINGS_SIZES\n\n    def __init__(self, do_lowercase=False, **kwargs):\n        super().__init__(**kwargs)\n        self.do_lowercase = do_lowercase\n        self.do_lowercase_and_remove_accent = False\n\n    def preprocess_text(self, text):\n        text = text.replace(\"``\", '\"').replace(\"''\", '\"')\n        text = convert_to_unicode(text)\n        text = unicodedata.normalize(\"NFC\", text)\n\n        if self.do_lowercase:\n            text = text.lower()\n\n        return text\n\n    def _tokenize(self, text, bypass_tokenizer=False):\n        \"\"\"\n        Tokenize a string given language code using Moses.\n\n        Details of tokenization:\n        - [sacremoses](https://github.com/alvations/sacremoses): port of Moses\n            - Install with `pip install sacremoses`\n\n        Args:\n            - bypass_tokenizer: Allow users to preprocess and tokenize the sentences externally (default = False)  (bool). If True, we only apply BPE.\n\n        Returns:\n            List of tokens.\n        \"\"\"\n        lang = \"fr\"\n        if lang and self.lang2id and lang not in self.lang2id:\n            logger.error(\n                \"Supplied language code not found in lang2id mapping. Please check that your language is supported by the loaded pretrained model.\"\n            )\n\n        if bypass_tokenizer:\n            text = text.split()\n        else:\n            text = self.preprocess_text(text)\n            text = self.moses_pipeline(text, lang=lang)\n            text = self.moses_tokenize(text, lang=lang)\n\n        split_tokens = []\n        for token in text:\n            if token:\n                split_tokens.extend([t for t in self.bpe(token).split(\" \")])\n\n        return split_tokens\n"
  },
  {
    "path": "code/bert-base-count5/pretrain/transformers1/tokenization_gpt2.py",
    "content": "# coding=utf-8\n# Copyright 2018 The Open AI Team Authors and The HuggingFace Inc. team.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n\"\"\"Tokenization classes for OpenAI GPT.\"\"\"\n\n\nimport json\nimport logging\nimport os\nfrom functools import lru_cache\n\nimport regex as re\nfrom tokenizers import ByteLevelBPETokenizer\n\nfrom .tokenization_utils import PreTrainedTokenizer, PreTrainedTokenizerFast\n\n\nlogger = logging.getLogger(__name__)\n\nVOCAB_FILES_NAMES = {\n    \"vocab_file\": \"vocab.json\",\n    \"merges_file\": \"merges.txt\",\n}\n\nPRETRAINED_VOCAB_FILES_MAP = {\n    \"vocab_file\": {\n        \"gpt2\": \"https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-vocab.json\",\n        \"gpt2-medium\": \"https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-medium-vocab.json\",\n        \"gpt2-large\": \"https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-large-vocab.json\",\n        \"gpt2-xl\": \"https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-xl-vocab.json\",\n        \"distilgpt2\": \"https://s3.amazonaws.com/models.huggingface.co/bert/distilgpt2-vocab.json\",\n    },\n    \"merges_file\": {\n        \"gpt2\": \"https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-merges.txt\",\n        \"gpt2-medium\": \"https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-medium-merges.txt\",\n        \"gpt2-large\": \"https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-large-merges.txt\",\n        \"gpt2-xl\": \"https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-xl-merges.txt\",\n        \"distilgpt2\": \"https://s3.amazonaws.com/models.huggingface.co/bert/distilgpt2-merges.txt\",\n    },\n}\n\nPRETRAINED_POSITIONAL_EMBEDDINGS_SIZES = {\n    \"gpt2\": 1024,\n    \"gpt2-medium\": 1024,\n    \"gpt2-large\": 1024,\n    \"gpt2-xl\": 1024,\n    \"distilgpt2\": 1024,\n}\n\n\n@lru_cache()\ndef bytes_to_unicode():\n    \"\"\"\n    Returns list of utf-8 byte and a mapping to unicode strings.\n    We specifically avoids mapping to whitespace/control characters the bpe code barfs on.\n\n    The reversible bpe codes work on unicode strings.\n    This means you need a large # of unicode characters in your vocab if you want to avoid UNKs.\n    When you're at something like a 10B token dataset you end up needing around 5K for decent coverage.\n    This is a signficant percentage of your normal, say, 32K bpe vocab.\n    To avoid that, we want lookup tables between utf-8 bytes and unicode strings.\n    \"\"\"\n    bs = (\n        list(range(ord(\"!\"), ord(\"~\") + 1)) + list(range(ord(\"¡\"), ord(\"¬\") + 1)) + list(range(ord(\"®\"), ord(\"ÿ\") + 1))\n    )\n    cs = bs[:]\n    n = 0\n    for b in range(2 ** 8):\n        if b not in bs:\n            bs.append(b)\n            cs.append(2 ** 8 + n)\n            n += 1\n    cs = [chr(n) for n in cs]\n    return dict(zip(bs, cs))\n\n\ndef get_pairs(word):\n    \"\"\"Return set of symbol pairs in a word.\n\n    Word is represented as tuple of symbols (symbols being variable-length strings).\n    \"\"\"\n    pairs = set()\n    prev_char = word[0]\n    for char in word[1:]:\n        pairs.add((prev_char, char))\n        prev_char = char\n    return pairs\n\n\nclass GPT2Tokenizer(PreTrainedTokenizer):\n    \"\"\"\n    GPT-2 BPE tokenizer. Peculiarities:\n\n    - Byte-level Byte-Pair-Encoding\n    - Requires a space to start the input string => the encoding methods should be called with the\n      ``add_prefix_space`` flag set to ``True``.\n      Otherwise, this tokenizer ``encode`` and ``decode`` method will not conserve\n      the absence of a space at the beginning of a string:\n\n    ::\n\n        tokenizer.decode(tokenizer.encode(\"Hello\")) = \" Hello\"\n\n    This tokenizer inherits from :class:`~transformers1.PreTrainedTokenizer` which contains most of the methods. Users\n    should refer to the superclass for more information regarding methods.\n\n    Args:\n        vocab_file (:obj:`str`):\n            Path to the vocabulary file.\n        merges_file (:obj:`str`):\n            Path to the merges file.\n        errors (:obj:`str`, `optional`, defaults to \"replace\"):\n            Paradigm to follow when decoding bytes to UTF-8. See `bytes.decode\n            <https://docs.python.org/3/library/stdtypes.html#bytes.decode>`__ for more information.\n        unk_token (:obj:`string`, `optional`, defaults to `<|endoftext|>`):\n            The unknown token. A token that is not in the vocabulary cannot be converted to an ID and is set to be this\n            token instead.\n        bos_token (:obj:`string`, `optional`, defaults to `<|endoftext|>`):\n            The beginning of sequence token.\n        eos_token (:obj:`string`, `optional`, defaults to `<|endoftext|>`):\n            The end of sequence token.\n    \"\"\"\n\n    vocab_files_names = VOCAB_FILES_NAMES\n    pretrained_vocab_files_map = PRETRAINED_VOCAB_FILES_MAP\n    max_model_input_sizes = PRETRAINED_POSITIONAL_EMBEDDINGS_SIZES\n\n    def __init__(\n        self,\n        vocab_file,\n        merges_file,\n        errors=\"replace\",\n        unk_token=\"<|endoftext|>\",\n        bos_token=\"<|endoftext|>\",\n        eos_token=\"<|endoftext|>\",\n        **kwargs\n    ):\n        super().__init__(bos_token=bos_token, eos_token=eos_token, unk_token=unk_token, **kwargs)\n\n        with open(vocab_file, encoding=\"utf-8\") as vocab_handle:\n            self.encoder = json.load(vocab_handle)\n        self.decoder = {v: k for k, v in self.encoder.items()}\n        self.errors = errors  # how to handle errors in decoding\n        self.byte_encoder = bytes_to_unicode()\n        self.byte_decoder = {v: k for k, v in self.byte_encoder.items()}\n        with open(merges_file, encoding=\"utf-8\") as merges_handle:\n            bpe_merges = merges_handle.read().split(\"\\n\")[1:-1]\n        bpe_merges = [tuple(merge.split()) for merge in bpe_merges]\n        self.bpe_ranks = dict(zip(bpe_merges, range(len(bpe_merges))))\n        self.cache = {}\n\n        # Should haved added re.IGNORECASE so BPE merges can happen for capitalized versions of contractions\n        self.pat = re.compile(r\"\"\"'s|'t|'re|'ve|'m|'ll|'d| ?\\p{L}+| ?\\p{N}+| ?[^\\s\\p{L}\\p{N}]+|\\s+(?!\\S)|\\s+\"\"\")\n\n    @property\n    def vocab_size(self):\n        return len(self.encoder)\n\n    def get_vocab(self):\n        return dict(self.encoder, **self.added_tokens_encoder)\n\n    def bpe(self, token):\n        if token in self.cache:\n            return self.cache[token]\n        word = tuple(token)\n        pairs = get_pairs(word)\n\n        if not pairs:\n            return token\n\n        while True:\n            bigram = min(pairs, key=lambda pair: self.bpe_ranks.get(pair, float(\"inf\")))\n            if bigram not in self.bpe_ranks:\n                break\n            first, second = bigram\n            new_word = []\n            i = 0\n            while i < len(word):\n                try:\n                    j = word.index(first, i)\n                except ValueError:\n                    new_word.extend(word[i:])\n                    break\n                else:\n                    new_word.extend(word[i:j])\n                    i = j\n\n                if word[i] == first and i < len(word) - 1 and word[i + 1] == second:\n                    new_word.append(first + second)\n                    i += 2\n                else:\n                    new_word.append(word[i])\n                    i += 1\n            new_word = tuple(new_word)\n            word = new_word\n            if len(word) == 1:\n                break\n            else:\n                pairs = get_pairs(word)\n        word = \" \".join(word)\n        self.cache[token] = word\n        return word\n\n    def _tokenize(self, text):\n        \"\"\" Tokenize a string. \"\"\"\n        bpe_tokens = []\n        for token in re.findall(self.pat, text):\n            token = \"\".join(\n                self.byte_encoder[b] for b in token.encode(\"utf-8\")\n            )  # Maps all our bytes to unicode strings, avoiding controle tokens of the BPE (spaces in our case)\n            bpe_tokens.extend(bpe_token for bpe_token in self.bpe(token).split(\" \"))\n        return bpe_tokens\n\n    def _convert_token_to_id(self, token):\n        \"\"\" Converts a token (str) in an id using the vocab. \"\"\"\n        return self.encoder.get(token, self.encoder.get(self.unk_token))\n\n    def _convert_id_to_token(self, index):\n        \"\"\"Converts an index (integer) in a token (str) using the vocab.\"\"\"\n        return self.decoder.get(index)\n\n    def convert_tokens_to_string(self, tokens):\n        \"\"\" Converts a sequence of tokens (string) in a single string. \"\"\"\n        text = \"\".join(tokens)\n        text = bytearray([self.byte_decoder[c] for c in text]).decode(\"utf-8\", errors=self.errors)\n        return text\n\n    def save_vocabulary(self, save_directory):\n        \"\"\"\n        Save the vocabulary and special tokens file to a directory.\n\n        Args:\n            save_directory (:obj:`str`):\n                The directory in which to save the vocabulary.\n\n        Returns:\n            :obj:`Tuple(str)`: Paths to the files saved.\n        \"\"\"\n        if not os.path.isdir(save_directory):\n            logger.error(\"Vocabulary path ({}) should be a directory\".format(save_directory))\n            return\n        vocab_file = os.path.join(save_directory, VOCAB_FILES_NAMES[\"vocab_file\"])\n        merge_file = os.path.join(save_directory, VOCAB_FILES_NAMES[\"merges_file\"])\n\n        with open(vocab_file, \"w\", encoding=\"utf-8\") as f:\n            f.write(json.dumps(self.encoder, ensure_ascii=False))\n\n        index = 0\n        with open(merge_file, \"w\", encoding=\"utf-8\") as writer:\n            writer.write(\"#version: 0.2\\n\")\n            for bpe_tokens, token_index in sorted(self.bpe_ranks.items(), key=lambda kv: kv[1]):\n                if index != token_index:\n                    logger.warning(\n                        \"Saving vocabulary to {}: BPE merge indices are not consecutive.\"\n                        \" Please check that the tokenizer is not corrupted!\".format(merge_file)\n                    )\n                    index = token_index\n                writer.write(\" \".join(bpe_tokens) + \"\\n\")\n                index += 1\n\n        return vocab_file, merge_file\n\n    def prepare_for_tokenization(self, text, **kwargs):\n        if \"add_prefix_space\" in kwargs and kwargs[\"add_prefix_space\"]:\n            return \" \" + text\n        return text\n\n\nclass GPT2TokenizerFast(PreTrainedTokenizerFast):\n    \"\"\"\n    Constructs a \"Fast\" GPT-2 BPE tokenizer (backed by HuggingFace's `tokenizers` library).\n\n    Peculiarities:\n\n    - Byte-level Byte-Pair-Encoding\n    - Requires a space to start the input string => the encoding methods should be called with the\n      ``add_prefix_space`` flag set to ``True``.\n      Otherwise, this tokenizer ``encode`` and ``decode`` method will not conserve\n      the absence of a space at the beginning of a string:\n\n    ::\n\n        tokenizer.decode(tokenizer.encode(\"Hello\")) = \" Hello\"\n\n    This tokenizer inherits from :class:`~transformers1.PreTrainedTokenizerFast` which contains most of the methods. Users\n    should refer to the superclass for more information regarding methods.\n\n    Args:\n        vocab_file (:obj:`str`):\n            Path to the vocabulary file.\n        merges_file (:obj:`str`):\n            Path to the merges file.\n        errors (:obj:`str`, `optional`, defaults to \"replace\"):\n            Paradigm to follow when decoding bytes to UTF-8. See `bytes.decode\n            <https://docs.python.org/3/library/stdtypes.html#bytes.decode>`__ for more information.\n        unk_token (:obj:`string`, `optional`, defaults to `<|endoftext|>`):\n            The unknown token. A token that is not in the vocabulary cannot be converted to an ID and is set to be this\n            token instead.\n        bos_token (:obj:`string`, `optional`, defaults to `<|endoftext|>`):\n            The beginning of sequence token.\n        eos_token (:obj:`string`, `optional`, defaults to `<|endoftext|>`):\n            The end of sequence token.\n        add_prefix_space (:obj:`bool`, `optional`, defaults to `False`):\n            Whether to add a leading space to the first word.\n            This allows to treat the leading word just as any other word.\n            (GPT2 tokenizer detect beginning of words by the preceeding space)\n        trim_offsets (:obj:`bool`, `optional`, defaults to `True`):\n            Whether the post processing step should trim offsets to avoid including whitespaces.\n    \"\"\"\n\n    vocab_files_names = VOCAB_FILES_NAMES\n    pretrained_vocab_files_map = PRETRAINED_VOCAB_FILES_MAP\n    max_model_input_sizes = PRETRAINED_POSITIONAL_EMBEDDINGS_SIZES\n\n    def __init__(\n        self,\n        vocab_file,\n        merges_file,\n        unk_token=\"<|endoftext|>\",\n        bos_token=\"<|endoftext|>\",\n        eos_token=\"<|endoftext|>\",\n        add_prefix_space=False,\n        trim_offsets=True,\n        **kwargs\n    ):\n        super().__init__(\n            ByteLevelBPETokenizer(\n                vocab_file=vocab_file,\n                merges_file=merges_file,\n                add_prefix_space=add_prefix_space,\n                trim_offsets=trim_offsets,\n            ),\n            bos_token=bos_token,\n            eos_token=eos_token,\n            unk_token=unk_token,\n            **kwargs,\n        )\n"
  },
  {
    "path": "code/bert-base-count5/pretrain/transformers1/tokenization_longformer.py",
    "content": "# coding=utf-8\n# Copyright 2020 The Allen Institute for AI team and The HuggingFace Inc. team.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n\nimport logging\n\nfrom .tokenization_roberta import RobertaTokenizer, RobertaTokenizerFast\n\n\nlogger = logging.getLogger(__name__)\n\n\n# vocab and merges same as roberta\nvocab_url = \"https://s3.amazonaws.com/models.huggingface.co/bert/roberta-large-vocab.json\"\nmerges_url = \"https://s3.amazonaws.com/models.huggingface.co/bert/roberta-large-merges.txt\"\n_all_longformer_models = [\n    \"allenai/longformer-base-4096\",\n    \"allenai/longformer-large-4096\",\n    \"allenai/longformer-large-4096-finetuned-triviaqa\",\n    \"allenai/longformer-base-4096-extra.pos.embd.only\",\n    \"allenai/longformer-large-4096-extra.pos.embd.only\",\n]\n\n\nPRETRAINED_POSITIONAL_EMBEDDINGS_SIZES = {\n    \"allenai/longformer-base-4096\": 4096,\n    \"allenai/longformer-large-4096\": 4096,\n    \"allenai/longformer-large-4096-finetuned-triviaqa\": 4096,\n    \"allenai/longformer-base-4096-extra.pos.embd.only\": 4096,\n    \"allenai/longformer-large-4096-extra.pos.embd.only\": 4096,\n}\n\n\nclass LongformerTokenizer(RobertaTokenizer):\n    # merges and vocab same as Roberta\n    max_model_input_sizes = PRETRAINED_POSITIONAL_EMBEDDINGS_SIZES\n    pretrained_vocab_files_map = {\n        \"vocab_file\": {m: vocab_url for m in _all_longformer_models},\n        \"merges_file\": {m: merges_url for m in _all_longformer_models},\n    }\n\n\nclass LongformerTokenizerFast(RobertaTokenizerFast):\n    # merges and vocab same as Roberta\n    max_model_input_sizes = PRETRAINED_POSITIONAL_EMBEDDINGS_SIZES\n    pretrained_vocab_files_map = {\n        \"vocab_file\": {m: vocab_url for m in _all_longformer_models},\n        \"merges_file\": {m: merges_url for m in _all_longformer_models},\n    }\n"
  },
  {
    "path": "code/bert-base-count5/pretrain/transformers1/tokenization_marian.py",
    "content": "import json\nimport re\nimport warnings\nfrom pathlib import Path\nfrom shutil import copyfile\nfrom typing import Dict, List, Optional, Tuple, Union\n\nimport sentencepiece\n\nfrom .file_utils import S3_BUCKET_PREFIX\nfrom .tokenization_utils import BatchEncoding, PreTrainedTokenizer\n\n\nvocab_files_names = {\n    \"source_spm\": \"source.spm\",\n    \"target_spm\": \"target.spm\",\n    \"vocab\": \"vocab.json\",\n    \"tokenizer_config_file\": \"tokenizer_config.json\",\n}\nMODEL_NAMES = (\"opus-mt-en-de\",)  # TODO(SS): delete this, the only required constant is vocab_files_names\nPRETRAINED_VOCAB_FILES_MAP = {\n    k: {m: f\"{S3_BUCKET_PREFIX}/Helsinki-NLP/{m}/{fname}\" for m in MODEL_NAMES}\n    for k, fname in vocab_files_names.items()\n}\n# Example URL https://s3.amazonaws.com/models.huggingface.co/bert/Helsinki-NLP/opus-mt-en-de/vocab.json\n\n\nclass MarianTokenizer(PreTrainedTokenizer):\n    \"\"\"Sentencepiece tokenizer for marian. Source and target languages have different SPM models.\n    The logic is use the relevant source_spm or target_spm to encode txt as pieces, then look up each piece in a vocab dictionary.\n\n    Examples::\n\n        from transformers1 import MarianTokenizer\n        tok = MarianTokenizer.from_pretrained('Helsinki-NLP/opus-mt-en-de')\n        src_texts = [ \"I am a small frog.\", \"Tom asked his teacher for advice.\"]\n        tgt_texts = [\"Ich bin ein kleiner Frosch.\", \"Tom bat seinen Lehrer um Rat.\"]  # optional\n        batch_enc: BatchEncoding = tok.prepare_translation_batch(src_texts, tgt_texts=tgt_texts)\n        # keys  [input_ids, attention_mask, decoder_input_ids,  decoder_attention_mask].\n        # model(**batch) should work\n    \"\"\"\n\n    vocab_files_names = vocab_files_names\n    pretrained_vocab_files_map = PRETRAINED_VOCAB_FILES_MAP\n    max_model_input_sizes = {m: 512 for m in MODEL_NAMES}\n    model_input_names = [\"attention_mask\"]  # actually attention_mask, decoder_attention_mask\n    language_code_re = re.compile(\">>.+<<\")  # type: re.Pattern\n\n    def __init__(\n        self,\n        vocab=None,\n        source_spm=None,\n        target_spm=None,\n        source_lang=None,\n        target_lang=None,\n        unk_token=\"<unk>\",\n        eos_token=\"</s>\",\n        pad_token=\"<pad>\",\n        max_len=512,\n        **kwargs,\n    ):\n\n        super().__init__(\n            # bos_token=bos_token,  unused. Start decoding with config.decoder_start_token_id\n            max_len=max_len,\n            eos_token=eos_token,\n            unk_token=unk_token,\n            pad_token=pad_token,\n            **kwargs,\n        )\n        self.encoder = load_json(vocab)\n        if self.unk_token not in self.encoder:\n            raise KeyError(\"<unk> token must be in vocab\")\n        assert self.pad_token in self.encoder\n        self.decoder = {v: k for k, v in self.encoder.items()}\n\n        self.source_lang = source_lang\n        self.target_lang = target_lang\n        self.supported_language_codes: list = [k for k in self.encoder if k.startswith(\">>\") and k.endswith(\"<<\")]\n        self.spm_files = [source_spm, target_spm]\n\n        # load SentencePiece model for pre-processing\n        self.spm_source = load_spm(source_spm)\n        self.spm_target = load_spm(target_spm)\n        self.current_spm = self.spm_source\n\n        # Multilingual target side: default to using first supported language code.\n\n        self._setup_normalizer()\n\n    def _setup_normalizer(self):\n        try:\n            from mosestokenizer import MosesPunctuationNormalizer\n\n            self.punc_normalizer = MosesPunctuationNormalizer(self.source_lang)\n        except ImportError:\n            warnings.warn(\"Recommended: pip install mosestokenizer\")\n            self.punc_normalizer = lambda x: x\n\n    def normalize(self, x: str) -> str:\n        \"\"\"Cover moses empty string edge case. They return empty list for '' input!\"\"\"\n        return self.punc_normalizer(x) if x else \"\"\n\n    def _convert_token_to_id(self, token):\n        return self.encoder.get(token, self.encoder[self.unk_token])\n\n    def remove_language_code(self, text: str):\n        \"\"\"Remove language codes like <<fr>> before sentencepiece\"\"\"\n        match = self.language_code_re.match(text)\n        code: list = [match.group(0)] if match else []\n        return code, self.language_code_re.sub(\"\", text)\n\n    def _tokenize(self, text: str) -> List[str]:\n        code, text = self.remove_language_code(text)\n        pieces = self.current_spm.EncodeAsPieces(text)\n        return code + pieces\n\n    def _convert_id_to_token(self, index: int) -> str:\n        \"\"\"Converts an index (integer) in a token (str) using the encoder.\"\"\"\n        return self.decoder.get(index, self.unk_token)\n\n    def convert_tokens_to_string(self, tokens: List[str]) -> str:\n        \"\"\"Uses target language sentencepiece model\"\"\"\n        return self.spm_target.DecodePieces(tokens)\n\n    def build_inputs_with_special_tokens(self, token_ids_0, token_ids_1=None) -> List[int]:\n        \"\"\"Build model inputs from a sequence by appending eos_token_id.\"\"\"\n        if token_ids_1 is None:\n            return token_ids_0 + [self.eos_token_id]\n        # We don't expect to process pairs, but leave the pair logic for API consistency\n        return token_ids_0 + token_ids_1 + [self.eos_token_id]\n\n    def prepare_translation_batch(\n        self,\n        src_texts: List[str],\n        tgt_texts: Optional[List[str]] = None,\n        max_length: Optional[int] = None,\n        pad_to_max_length: bool = True,\n        return_tensors: str = \"pt\",\n    ) -> BatchEncoding:\n        \"\"\"Prepare model inputs for translation. For best performance, translate one sentence at a time.\n        Arguments:\n            src_texts: list of src language texts\n            tgt_texts: list of tgt language texts\n            max_length: (None) defer to config (1024 for mbart-large-en-ro)\n            pad_to_max_length: (bool)\n            return_tensors: (str) default \"pt\" returns pytorch tensors, pass None to return lists.\n\n        Returns:\n            BatchEncoding: with keys [input_ids, attention_mask, decoder_input_ids,  decoder_attention_mask]\n            all shaped bs, seq_len. (BatchEncoding is a dict of string -> tensor or lists).\n            If no tgt_text is specified, the only keys will be input_ids and attention_mask.\n        \"\"\"\n        if \"\" in src_texts:\n            raise ValueError(f\"found empty string in src_texts: {src_texts}\")\n        self.current_spm = self.spm_source\n        src_texts = [self.normalize(t) for t in src_texts]  # this does not appear to do much\n        model_inputs: BatchEncoding = self.batch_encode_plus(\n            src_texts,\n            add_special_tokens=True,\n            return_tensors=return_tensors,\n            max_length=max_length,\n            pad_to_max_length=pad_to_max_length,\n        )\n        if tgt_texts is None:\n            return model_inputs\n\n        self.current_spm = self.spm_target\n        decoder_inputs: BatchEncoding = self.batch_encode_plus(\n            tgt_texts,\n            add_special_tokens=True,\n            return_tensors=return_tensors,\n            max_length=max_length,\n            pad_to_max_length=pad_to_max_length,\n        )\n        for k, v in decoder_inputs.items():\n            model_inputs[f\"decoder_{k}\"] = v\n        self.current_spm = self.spm_source\n        return model_inputs\n\n    @property\n    def vocab_size(self) -> int:\n        return len(self.encoder)\n\n    def save_vocabulary(self, save_directory: str) -> Tuple[str]:\n        \"\"\"save vocab file to json and copy spm files from their original path.\"\"\"\n        save_dir = Path(save_directory)\n        assert save_dir.is_dir(), f\"{save_directory} should be a directory\"\n        save_json(self.encoder, save_dir / self.vocab_files_names[\"vocab\"])\n\n        for f in self.spm_files:\n            dest_path = save_dir / Path(f).name\n            if not dest_path.exists():\n                copyfile(f, save_dir / Path(f).name)\n        return tuple(save_dir / f for f in self.vocab_files_names)\n\n    def get_vocab(self) -> Dict:\n        vocab = self.encoder.copy()\n        vocab.update(self.added_tokens_encoder)\n        return vocab\n\n    def __getstate__(self) -> Dict:\n        state = self.__dict__.copy()\n        state.update({k: None for k in [\"spm_source\", \"spm_target\", \"current_spm\", \"punc_normalizer\"]})\n        return state\n\n    def __setstate__(self, d: Dict) -> None:\n        self.__dict__ = d\n        self.spm_source, self.spm_target = (load_spm(f) for f in self.spm_files)\n        self.current_spm = self.spm_source\n        self._setup_normalizer()\n\n    def num_special_tokens_to_add(self, **unused):\n        \"\"\"Just EOS\"\"\"\n        return 1\n\n    def _special_token_mask(self, seq):\n        all_special_ids = set(self.all_special_ids)  # call it once instead of inside list comp\n        all_special_ids.remove(self.unk_token_id)  # <unk> is only sometimes special\n        return [1 if x in all_special_ids else 0 for x in seq]\n\n    def get_special_tokens_mask(\n        self, token_ids_0: List, token_ids_1: Optional[List] = None, already_has_special_tokens: bool = False\n    ) -> List[int]:\n        \"\"\"Get list where entries are [1] if a token is [eos] or [pad] else 0.\"\"\"\n        if already_has_special_tokens:\n            return self._special_token_mask(token_ids_0)\n        elif token_ids_1 is None:\n            return self._special_token_mask(token_ids_0) + [1]\n        else:\n            return self._special_token_mask(token_ids_0 + token_ids_1) + [1]\n\n\ndef load_spm(path: str) -> sentencepiece.SentencePieceProcessor:\n    spm = sentencepiece.SentencePieceProcessor()\n    spm.Load(path)\n    return spm\n\n\ndef save_json(data, path: str) -> None:\n    with open(path, \"w\") as f:\n        json.dump(data, f, indent=2)\n\n\ndef load_json(path: str) -> Union[Dict, List]:\n    with open(path, \"r\") as f:\n        return json.load(f)\n"
  },
  {
    "path": "code/bert-base-count5/pretrain/transformers1/tokenization_openai.py",
    "content": "# coding=utf-8\n# Copyright 2018 The Open AI Team Authors and The HuggingFace Inc. team.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n\"\"\"Tokenization classes for OpenAI GPT.\"\"\"\n\n\nimport json\nimport logging\nimport os\nimport re\n\nfrom tokenizers import CharBPETokenizer\n\nfrom .tokenization_bert import BasicTokenizer\nfrom .tokenization_utils import PreTrainedTokenizer, PreTrainedTokenizerFast\n\n\nlogger = logging.getLogger(__name__)\n\nVOCAB_FILES_NAMES = {\n    \"vocab_file\": \"vocab.json\",\n    \"merges_file\": \"merges.txt\",\n}\n\nPRETRAINED_VOCAB_FILES_MAP = {\n    \"vocab_file\": {\"openai-gpt\": \"https://s3.amazonaws.com/models.huggingface.co/bert/openai-gpt-vocab.json\"},\n    \"merges_file\": {\"openai-gpt\": \"https://s3.amazonaws.com/models.huggingface.co/bert/openai-gpt-merges.txt\"},\n}\n\nPRETRAINED_POSITIONAL_EMBEDDINGS_SIZES = {\n    \"openai-gpt\": 512,\n}\n\n\ndef get_pairs(word):\n    \"\"\"\n    Return set of symbol pairs in a word.\n    word is represented as tuple of symbols (symbols being variable-length strings)\n    \"\"\"\n    pairs = set()\n    prev_char = word[0]\n    for char in word[1:]:\n        pairs.add((prev_char, char))\n        prev_char = char\n    return pairs\n\n\ndef text_standardize(text):\n    \"\"\"\n    fixes some issues the spacy tokenizer had on books corpus\n    also does some whitespace standardization\n    \"\"\"\n    text = text.replace(\"—\", \"-\")\n    text = text.replace(\"–\", \"-\")\n    text = text.replace(\"―\", \"-\")\n    text = text.replace(\"…\", \"...\")\n    text = text.replace(\"´\", \"'\")\n    text = re.sub(r\"\"\"(-+|~+|!+|\"+|;+|\\?+|\\++|,+|\\)+|\\(+|\\\\+|\\/+|\\*+|\\[+|\\]+|}+|{+|\\|+|_+)\"\"\", r\" \\1 \", text)\n    text = re.sub(r\"\\s*\\n\\s*\", \" \\n \", text)\n    text = re.sub(r\"[^\\S\\n]+\", \" \", text)\n    return text.strip()\n\n\nclass OpenAIGPTTokenizer(PreTrainedTokenizer):\n    \"\"\"\n    BPE tokenizer. Peculiarities:\n\n    - lower case all inputs\n    - uses SpaCy tokenizer and ftfy for pre-BPE tokenization if they are installed, fallback to BERT's BasicTokenizer if not.\n\n    This tokenizer inherits from :class:`~transformers1.PreTrainedTokenizer` which contains most of the methods. Users\n    should refer to the superclass for more information regarding methods.\n\n    Args:\n        vocab_file (:obj:`str`):\n            Path to the vocabulary file.\n        merges_file (:obj:`str`):\n            Path to the merges file.\n        unk_token (:obj:`string`, `optional`, defaults to \"<unk>\"):\n            The unknown token. A token that is not in the vocabulary cannot be converted to an ID and is set to be this\n            token instead.\n    \"\"\"\n\n    vocab_files_names = VOCAB_FILES_NAMES\n    pretrained_vocab_files_map = PRETRAINED_VOCAB_FILES_MAP\n    max_model_input_sizes = PRETRAINED_POSITIONAL_EMBEDDINGS_SIZES\n\n    def __init__(self, vocab_file, merges_file, unk_token=\"<unk>\", **kwargs):\n        super().__init__(unk_token=unk_token, **kwargs)\n\n        try:\n            import ftfy\n            from spacy.lang.en import English\n\n            _nlp = English()\n            self.nlp = _nlp.Defaults.create_tokenizer(_nlp)\n            self.fix_text = ftfy.fix_text\n        except ImportError:\n            logger.warning(\"ftfy or spacy is not installed using BERT BasicTokenizer instead of SpaCy & ftfy.\")\n            self.nlp = BasicTokenizer(do_lower_case=True)\n            self.fix_text = None\n\n        with open(vocab_file, encoding=\"utf-8\") as vocab_handle:\n            self.encoder = json.load(vocab_handle)\n        self.decoder = {v: k for k, v in self.encoder.items()}\n        with open(merges_file, encoding=\"utf-8\") as merges_handle:\n            merges = merges_handle.read().split(\"\\n\")[1:-1]\n        merges = [tuple(merge.split()) for merge in merges]\n        self.bpe_ranks = dict(zip(merges, range(len(merges))))\n        self.cache = {}\n\n    @property\n    def vocab_size(self):\n        return len(self.encoder)\n\n    def get_vocab(self):\n        return dict(self.encoder, **self.added_tokens_encoder)\n\n    def bpe(self, token):\n        word = tuple(token[:-1]) + (token[-1] + \"</w>\",)\n        if token in self.cache:\n            return self.cache[token]\n        pairs = get_pairs(word)\n\n        if not pairs:\n            return token + \"</w>\"\n\n        while True:\n            bigram = min(pairs, key=lambda pair: self.bpe_ranks.get(pair, float(\"inf\")))\n            if bigram not in self.bpe_ranks:\n                break\n            first, second = bigram\n            new_word = []\n            i = 0\n            while i < len(word):\n                try:\n                    j = word.index(first, i)\n                except ValueError:\n                    new_word.extend(word[i:])\n                    break\n                else:\n                    new_word.extend(word[i:j])\n                    i = j\n\n                if word[i] == first and i < len(word) - 1 and word[i + 1] == second:\n                    new_word.append(first + second)\n                    i += 2\n                else:\n                    new_word.append(word[i])\n                    i += 1\n            new_word = tuple(new_word)\n            word = new_word\n            if len(word) == 1:\n                break\n            else:\n                pairs = get_pairs(word)\n        word = \" \".join(word)\n        if word == \"\\n  </w>\":\n            word = \"\\n</w>\"\n        self.cache[token] = word\n        return word\n\n    def _tokenize(self, text):\n        \"\"\" Tokenize a string. \"\"\"\n        split_tokens = []\n        if self.fix_text is None:\n            # Using BERT's BasicTokenizer\n            text = self.nlp.tokenize(text)\n            for token in text:\n                split_tokens.extend([t for t in self.bpe(token).split(\" \")])\n        else:\n            # Using SpaCy & ftfy (original tokenization process of OpenAI GPT)\n            text = self.nlp(text_standardize(self.fix_text(text)))\n            for token in text:\n                split_tokens.extend([t for t in self.bpe(token.text.lower()).split(\" \")])\n        return split_tokens\n\n    def _convert_token_to_id(self, token):\n        \"\"\" Converts a token (str) in an id using the vocab. \"\"\"\n        return self.encoder.get(token, self.encoder.get(self.unk_token))\n\n    def _convert_id_to_token(self, index):\n        \"\"\"Converts an id in a token (BPE) using the vocab.\"\"\"\n        return self.decoder.get(index, self.unk_token)\n\n    def convert_tokens_to_string(self, tokens):\n        \"\"\" Converts a sequence of tokens (string) in a single string. \"\"\"\n        out_string = \"\".join(tokens).replace(\"</w>\", \" \").strip()\n        return out_string\n\n    def save_vocabulary(self, save_directory):\n        \"\"\"\n        Save the vocabulary and special tokens file to a directory.\n\n        Args:\n            save_directory (:obj:`str`):\n                The directory in which to save the vocabulary.\n\n        Returns:\n            :obj:`Tuple(str)`: Paths to the files saved.\n        \"\"\"\n        if not os.path.isdir(save_directory):\n            logger.error(\"Vocabulary path ({}) should be a directory\".format(save_directory))\n            return\n        vocab_file = os.path.join(save_directory, VOCAB_FILES_NAMES[\"vocab_file\"])\n        merge_file = os.path.join(save_directory, VOCAB_FILES_NAMES[\"merges_file\"])\n\n        with open(vocab_file, \"w\", encoding=\"utf-8\") as f:\n            f.write(json.dumps(self.encoder, ensure_ascii=False))\n\n        index = 0\n        with open(merge_file, \"w\", encoding=\"utf-8\") as writer:\n            writer.write(\"#version: 0.2\\n\")\n            for bpe_tokens, token_index in sorted(self.bpe_ranks.items(), key=lambda kv: kv[1]):\n                if index != token_index:\n                    logger.warning(\n                        \"Saving vocabulary to {}: BPE merge indices are not consecutive.\"\n                        \" Please check that the tokenizer is not corrupted!\".format(merge_file)\n                    )\n                    index = token_index\n                writer.write(\" \".join(bpe_tokens) + \"\\n\")\n                index += 1\n\n        return vocab_file, merge_file\n\n\nclass OpenAIGPTTokenizerFast(PreTrainedTokenizerFast):\n    \"\"\"\n    Construct a \"Fast\" BPE tokenizer for OpenAI GPT (backed by HuggingFace's `tokenizers` library).\n\n    Peculiarities:\n\n    - lower case all inputs\n    - uses SpaCy tokenizer and ftfy for pre-BPE tokenization if they are installed, fallback to BERT's BasicTokenizer if not.\n\n    This tokenizer inherits from :class:`~transformers1.PreTrainedTokenizer` which contains most of the methods. Users\n    should refer to the superclass for more information regarding methods.\n\n    Args:\n        vocab_file (:obj:`str`):\n            Path to the vocabulary file.\n        merges_file (:obj:`str`):\n            Path to the merges file.\n        unk_token (:obj:`string`, `optional`, defaults to \"<unk>\"):\n            The unknown token. A token that is not in the vocabulary cannot be converted to an ID and is set to be this\n            token instead.\n    \"\"\"\n\n    vocab_files_names = VOCAB_FILES_NAMES\n    pretrained_vocab_files_map = PRETRAINED_VOCAB_FILES_MAP\n    max_model_input_sizes = PRETRAINED_POSITIONAL_EMBEDDINGS_SIZES\n\n    def __init__(self, vocab_file, merges_file, unk_token=\"<unk>\", **kwargs):\n        kwargs.setdefault(\"unk_token\", unk_token)\n        super().__init__(\n            CharBPETokenizer(vocab_file=vocab_file, merges_file=merges_file, unk_token=unk_token, lowercase=True),\n            **kwargs,\n        )\n"
  },
  {
    "path": "code/bert-base-count5/pretrain/transformers1/tokenization_reformer.py",
    "content": "# coding=utf-8\n# Copyright 2020 The Trax Authors and The HuggingFace Inc. team.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n\"\"\" Tokenization class for model Reformer.\"\"\"\n\n\nimport logging\nimport os\nfrom shutil import copyfile\n\nfrom .tokenization_utils import PreTrainedTokenizer\n\n\nlogger = logging.getLogger(__name__)\n\nSPIECE_UNDERLINE = \"▁\"\n\n\n####################################################\n# Mapping from the keyword arguments names of Tokenizer `__init__`\n# to file names for serializing Tokenizer instances\n####################################################\nVOCAB_FILES_NAMES = {\"vocab_file\": \"spiece.model\"}\n\n####################################################\n# Mapping from the keyword arguments names of Tokenizer `__init__`\n# to pretrained vocabulary URL for all the model shortcut names.\n####################################################\nPRETRAINED_VOCAB_FILES_MAP = {\n    \"vocab_file\": {\n        \"google/reformer-crime-and-punishment\": \"https://cdn.huggingface.co/google/reformer-crime-and-punishment/spiece.model\"\n    }\n}\n\n####################################################\n# Mapping from model shortcut names to max length of inputs\n####################################################\nPRETRAINED_POSITIONAL_EMBEDDINGS_SIZES = {\n    \"google/reformer-crime-and-punishment\": 524288,\n}\n\n\nclass ReformerTokenizer(PreTrainedTokenizer):\n    \"\"\"\n        Constructs an Reformer tokenizer. Based on `SentencePiece <https://github.com/google/sentencepiece>`__ .\n\n        This tokenizer inherits from :class:`~transformers1.PreTrainedTokenizer` which contains most of the methods. Users\n        should refer to the superclass for more information regarding methods.\n\n        Args:\n            vocab_file (:obj:`string`):\n                `SentencePiece <https://github.com/google/sentencepiece>`__ file (generally has a `.spm` extension) that\n                contains the vocabulary necessary to instantiate a tokenizer.\n            eos_token (:obj:`string`, `optional`, defaults to \"</s>\"):\n                The end of sequence token.\n\n                .. note::\n\n                    When building a sequence using special tokens, this is not the token that is used for the end\n                    of sequence. The token used is the :obj:`sep_token`.\n            unk_token (:obj:`string`, `optional`, defaults to \"<unk>\"):\n                The unknown token. A token that is not in the vocabulary cannot be converted to an ID and is set to be this\n                token instead.\n            pad_token (:obj:`string`, `optional`, defaults to \"<pad>\"):\n                The token used for padding, for example when batching sequences of different lengths.\n            additional_special_tokens (:obj:`List[str]`, `optional`, defaults to :obj:`None`):\n                Additional special tokens used by the tokenizer.\n    \"\"\"\n\n    vocab_files_names = VOCAB_FILES_NAMES\n    pretrained_vocab_files_map = PRETRAINED_VOCAB_FILES_MAP\n    max_model_input_sizes = PRETRAINED_POSITIONAL_EMBEDDINGS_SIZES\n\n    def __init__(\n        self,\n        vocab_file,\n        eos_token=\"</s>\",\n        unk_token=\"<unk>\",\n        pad_token=\"<pad>\",\n        additional_special_tokens=[],\n        **kwargs\n    ):\n        super().__init__(\n            eos_token=eos_token,\n            unk_token=unk_token,\n            pad_token=pad_token,\n            additional_special_tokens=additional_special_tokens,\n            **kwargs,\n        )\n\n        try:\n            import sentencepiece as spm\n        except ImportError:\n            logger.warning(\n                \"You need to install SentencePiece to use ReformerTokenizer:\"\n                \"https://github.com/google/sentencepiece\"\n                \"pip install sentencepiece\"\n            )\n            raise\n\n        self.vocab_file = vocab_file\n        self.sp_model = spm.SentencePieceProcessor()\n        self.sp_model.Load(vocab_file)\n\n    @property\n    def vocab_size(self):\n        return self.sp_model.get_piece_size()\n\n    def get_vocab(self):\n        vocab = {self.convert_ids_to_tokens(i): i for i in range(self.vocab_size)}\n        vocab.update(self.added_tokens_encoder)\n        return vocab\n\n    def __getstate__(self):\n        state = self.__dict__.copy()\n        state[\"sp_model\"] = None\n        return state\n\n    def __setstate__(self, d):\n        self.__dict__ = d\n        try:\n            import sentencepiece as spm\n        except ImportError:\n            logger.warning(\n                \"You need to install SentencePiece to use ReformerTokenizer: https://github.com/google/sentencepiece\"\n                \"pip install sentencepiece\"\n            )\n            raise\n        self.sp_model = spm.SentencePieceProcessor()\n        self.sp_model.Load(self.vocab_file)\n\n    def _tokenize(self, text, sample=False):\n        \"\"\" Take as input a string and return a list of strings (tokens) for words/sub-words\n        \"\"\"\n        if not sample:\n            pieces = self.sp_model.EncodeAsPieces(text)\n        else:\n            pieces = self.sp_model.SampleEncodeAsPieces(text, 64, 0.1)\n        return pieces\n\n    def _convert_token_to_id(self, token):\n        \"\"\" Converts a token (str) in an id using the vocab. \"\"\"\n        return self.sp_model.piece_to_id(token)\n\n    def _convert_id_to_token(self, index):\n        \"\"\"Converts an index (integer) in a token (str) using the vocab.\"\"\"\n        if index < self.sp_model.get_piece_size():\n            token = self.sp_model.IdToPiece(index)\n        return token\n\n    def convert_tokens_to_string(self, tokens):\n        \"\"\" Converts a sequence of tokens (string) in a single string. \"\"\"\n        out_string = self.sp_model.decode_pieces(tokens)\n        return out_string\n\n    def save_vocabulary(self, save_directory):\n        \"\"\" Save the sentencepiece vocabulary (copy original file) and special tokens file\n            to a directory.\n        \"\"\"\n        if not os.path.isdir(save_directory):\n            logger.error(\"Vocabulary path ({}) should be a directory\".format(save_directory))\n            return\n        out_vocab_file = os.path.join(save_directory, VOCAB_FILES_NAMES[\"vocab_file\"])\n\n        if os.path.abspath(self.vocab_file) != os.path.abspath(out_vocab_file):\n            copyfile(self.vocab_file, out_vocab_file)\n\n        return (out_vocab_file,)\n"
  },
  {
    "path": "code/bert-base-count5/pretrain/transformers1/tokenization_roberta.py",
    "content": "# coding=utf-8\n# Copyright 2018 The Open AI Team Authors and The HuggingFace Inc. team.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n\"\"\"Tokenization classes for RoBERTa.\"\"\"\n\n\nimport logging\nfrom typing import List, Optional\n\nfrom tokenizers import AddedToken\nfrom tokenizers.processors import RobertaProcessing\n\nfrom .tokenization_gpt2 import GPT2Tokenizer, GPT2TokenizerFast\nfrom .tokenization_utils import PreTrainedTokenizer\n\n\nlogger = logging.getLogger(__name__)\n\nVOCAB_FILES_NAMES = {\n    \"vocab_file\": \"vocab.json\",\n    \"merges_file\": \"merges.txt\",\n}\n\nPRETRAINED_VOCAB_FILES_MAP = {\n    \"vocab_file\": {\n        \"roberta-base\": \"https://s3.amazonaws.com/models.huggingface.co/bert/roberta-base-vocab.json\",\n        \"roberta-large\": \"https://s3.amazonaws.com/models.huggingface.co/bert/roberta-large-vocab.json\",\n        \"roberta-large-mnli\": \"https://s3.amazonaws.com/models.huggingface.co/bert/roberta-large-mnli-vocab.json\",\n        \"distilroberta-base\": \"https://s3.amazonaws.com/models.huggingface.co/bert/distilroberta-base-vocab.json\",\n        \"roberta-base-openai-detector\": \"https://s3.amazonaws.com/models.huggingface.co/bert/roberta-base-vocab.json\",\n        \"roberta-large-openai-detector\": \"https://s3.amazonaws.com/models.huggingface.co/bert/roberta-large-vocab.json\",\n    },\n    \"merges_file\": {\n        \"roberta-base\": \"https://s3.amazonaws.com/models.huggingface.co/bert/roberta-base-merges.txt\",\n        \"roberta-large\": \"https://s3.amazonaws.com/models.huggingface.co/bert/roberta-large-merges.txt\",\n        \"roberta-large-mnli\": \"https://s3.amazonaws.com/models.huggingface.co/bert/roberta-large-mnli-merges.txt\",\n        \"distilroberta-base\": \"https://s3.amazonaws.com/models.huggingface.co/bert/distilroberta-base-merges.txt\",\n        \"roberta-base-openai-detector\": \"https://s3.amazonaws.com/models.huggingface.co/bert/roberta-base-merges.txt\",\n        \"roberta-large-openai-detector\": \"https://s3.amazonaws.com/models.huggingface.co/bert/roberta-large-merges.txt\",\n    },\n}\n\nPRETRAINED_POSITIONAL_EMBEDDINGS_SIZES = {\n    \"roberta-base\": 512,\n    \"roberta-large\": 512,\n    \"roberta-large-mnli\": 512,\n    \"distilroberta-base\": 512,\n    \"roberta-base-openai-detector\": 512,\n    \"roberta-large-openai-detector\": 512,\n}\n\n\nclass RobertaTokenizer(GPT2Tokenizer):\n    \"\"\"\n    Constructs a RoBERTa BPE tokenizer, derived from the GPT-2 tokenizer. Peculiarities:\n\n    - Byte-level Byte-Pair-Encoding\n    - Requires a space to start the input string => the encoding methods should be called with the\n      ``add_prefix_space`` flag set to ``True``.\n      Otherwise, this tokenizer ``encode`` and ``decode`` method will not conserve\n      the absence of a space at the beginning of a string:\n\n    ::\n\n        tokenizer.decode(tokenizer.encode(\"Hello\")) = \" Hello\"\n\n    This tokenizer inherits from :class:`~transformers1.PreTrainedTokenizer` which contains most of the methods. Users\n    should refer to the superclass for more information regarding methods.\n\n    Args:\n        vocab_file (:obj:`str`):\n            Path to the vocabulary file.\n        merges_file (:obj:`str`):\n            Path to the merges file.\n        errors (:obj:`str`, `optional`, defaults to \"replace\"):\n            Paradigm to follow when decoding bytes to UTF-8. See `bytes.decode\n            <https://docs.python.org/3/library/stdtypes.html#bytes.decode>`__ for more information.\n        bos_token (:obj:`string`, `optional`, defaults to \"<s>\"):\n            The beginning of sequence token that was used during pre-training. Can be used a sequence classifier token.\n\n            .. note::\n\n                When building a sequence using special tokens, this is not the token that is used for the beginning\n                of sequence. The token used is the :obj:`cls_token`.\n        eos_token (:obj:`string`, `optional`, defaults to \"</s>\"):\n            The end of sequence token.\n\n            .. note::\n\n                When building a sequence using special tokens, this is not the token that is used for the end\n                of sequence. The token used is the :obj:`sep_token`.\n        sep_token (:obj:`string`, `optional`, defaults to \"</s>\"):\n            The separator token, which is used when building a sequence from multiple sequences, e.g. two sequences\n            for sequence classification or for a text and a question for question answering.\n            It is also used as the last token of a sequence built with special tokens.\n        cls_token (:obj:`string`, `optional`, defaults to \"<s>\"):\n            The classifier token which is used when doing sequence classification (classification of the whole\n            sequence instead of per-token classification). It is the first token of the sequence when built with\n            special tokens.\n        unk_token (:obj:`string`, `optional`, defaults to \"<unk>\"):\n            The unknown token. A token that is not in the vocabulary cannot be converted to an ID and is set to be this\n            token instead.\n        pad_token (:obj:`string`, `optional`, defaults to \"<pad>\"):\n            The token used for padding, for example when batching sequences of different lengths.\n        mask_token (:obj:`string`, `optional`, defaults to \"<mask>\"):\n            The token used for masking values. This is the token used when training this model with masked language\n            modeling. This is the token which the model will try to predict.\n    \"\"\"\n\n    vocab_files_names = VOCAB_FILES_NAMES\n    pretrained_vocab_files_map = PRETRAINED_VOCAB_FILES_MAP\n    max_model_input_sizes = PRETRAINED_POSITIONAL_EMBEDDINGS_SIZES\n    model_input_names = [\"attention_mask\"]\n\n    def __init__(\n        self,\n        vocab_file,\n        merges_file,\n        errors=\"replace\",\n        bos_token=\"<s>\",\n        eos_token=\"</s>\",\n        sep_token=\"</s>\",\n        cls_token=\"<s>\",\n        unk_token=\"<unk>\",\n        pad_token=\"<pad>\",\n        mask_token=\"<mask>\",\n        **kwargs\n    ):\n        super().__init__(\n            vocab_file=vocab_file,\n            merges_file=merges_file,\n            errors=errors,\n            bos_token=bos_token,\n            eos_token=eos_token,\n            unk_token=unk_token,\n            sep_token=sep_token,\n            cls_token=cls_token,\n            pad_token=pad_token,\n            mask_token=mask_token,\n            **kwargs,\n        )\n\n    def build_inputs_with_special_tokens(\n        self, token_ids_0: List[int], token_ids_1: Optional[List[int]] = None\n    ) -> List[int]:\n        \"\"\"\n        Build model inputs from a sequence or a pair of sequence for sequence classification tasks\n        by concatenating and adding special tokens.\n        A RoBERTa sequence has the following format:\n\n        - single sequence: ``<s> X </s>``\n        - pair of sequences: ``<s> A </s></s> B </s>``\n\n        Args:\n            token_ids_0 (:obj:`List[int]`):\n                List of IDs to which the special tokens will be added\n            token_ids_1 (:obj:`List[int]`, `optional`, defaults to :obj:`None`):\n                Optional second list of IDs for sequence pairs.\n\n        Returns:\n            :obj:`List[int]`: list of `input IDs <../glossary.html#input-ids>`__ with the appropriate special tokens.\n        \"\"\"\n        if token_ids_1 is None:\n            return [self.cls_token_id] + token_ids_0 + [self.sep_token_id]\n        cls = [self.cls_token_id]\n        sep = [self.sep_token_id]\n        return cls + token_ids_0 + sep + sep + token_ids_1 + sep\n\n    def get_special_tokens_mask(\n        self, token_ids_0: List[int], token_ids_1: Optional[List[int]] = None, already_has_special_tokens: bool = False\n    ) -> List[int]:\n        \"\"\"\n        Retrieves sequence ids from a token list that has no special tokens added. This method is called when adding\n        special tokens using the tokenizer ``prepare_for_model`` or ``encode_plus`` methods.\n\n        Args:\n            token_ids_0 (:obj:`List[int]`):\n                List of ids.\n            token_ids_1 (:obj:`List[int]`, `optional`, defaults to :obj:`None`):\n                Optional second list of IDs for sequence pairs.\n            already_has_special_tokens (:obj:`bool`, `optional`, defaults to :obj:`False`):\n                Set to True if the token list is already formatted with special tokens for the model\n\n        Returns:\n            :obj:`List[int]`: A list of integers in the range [0, 1]: 1 for a special token, 0 for a sequence token.\n        \"\"\"\n        if already_has_special_tokens:\n            if token_ids_1 is not None:\n                raise ValueError(\n                    \"You should not supply a second sequence if the provided sequence of \"\n                    \"ids is already formatted with special tokens for the model.\"\n                )\n            return list(map(lambda x: 1 if x in [self.sep_token_id, self.cls_token_id] else 0, token_ids_0))\n\n        if token_ids_1 is None:\n            return [1] + ([0] * len(token_ids_0)) + [1]\n        return [1] + ([0] * len(token_ids_0)) + [1, 1] + ([0] * len(token_ids_1)) + [1]\n\n    def create_token_type_ids_from_sequences(\n        self, token_ids_0: List[int], token_ids_1: Optional[List[int]] = None\n    ) -> List[int]:\n        \"\"\"\n        Creates a mask from the two sequences passed to be used in a sequence-pair classification task.\n        RoBERTa does not make use of token type ids, therefore a list of zeros is returned.\n\n        Args:\n            token_ids_0 (:obj:`List[int]`):\n                List of ids.\n            token_ids_1 (:obj:`List[int]`, `optional`, defaults to :obj:`None`):\n                Optional second list of IDs for sequence pairs.\n\n        Returns:\n            :obj:`List[int]`: List of zeros.\n\n        \"\"\"\n        sep = [self.sep_token_id]\n        cls = [self.cls_token_id]\n\n        if token_ids_1 is None:\n            return len(cls + token_ids_0 + sep) * [0]\n        return len(cls + token_ids_0 + sep + sep + token_ids_1 + sep) * [0]\n\n    def prepare_for_tokenization(self, text, add_special_tokens=False, **kwargs):\n        if \"add_prefix_space\" in kwargs:\n            add_prefix_space = kwargs[\"add_prefix_space\"]\n        else:\n            add_prefix_space = add_special_tokens\n        if add_prefix_space and not text[0].isspace():\n            text = \" \" + text\n        return text\n\n\nclass RobertaTokenizerFast(GPT2TokenizerFast):\n    \"\"\"\n    Constructs a \"Fast\" RoBERTa BPE tokenizer (backed by HuggingFace's `tokenizers` library).\n\n    Peculiarities:\n\n    - Byte-level Byte-Pair-Encoding\n    - Requires a space to start the input string => the encoding methods should be called with the\n      ``add_prefix_space`` flag set to ``True``.\n      Otherwise, this tokenizer ``encode`` and ``decode`` method will not conserve\n      the absence of a space at the beginning of a string:\n\n    ::\n\n        tokenizer.decode(tokenizer.encode(\"Hello\")) = \" Hello\"\n\n    This tokenizer inherits from :class:`~transformers1.PreTrainedTokenizerFast` which contains most of the methods. Users\n    should refer to the superclass for more information regarding methods.\n\n    Args:\n        vocab_file (:obj:`str`):\n            Path to the vocabulary file.\n        merges_file (:obj:`str`):\n            Path to the merges file.\n        errors (:obj:`str`, `optional`, defaults to \"replace\"):\n            Paradigm to follow when decoding bytes to UTF-8. See `bytes.decode\n            <https://docs.python.org/3/library/stdtypes.html#bytes.decode>`__ for more information.\n        unk_token (:obj:`string`, `optional`, defaults to `<|endoftext|>`):\n            The unknown token. A token that is not in the vocabulary cannot be converted to an ID and is set to be this\n            token instead.\n        bos_token (:obj:`string`, `optional`, defaults to `<|endoftext|>`):\n            The beginning of sequence token.\n        eos_token (:obj:`string`, `optional`, defaults to `<|endoftext|>`):\n            The end of sequence token.\n        add_prefix_space (:obj:`bool`, `optional`, defaults to `False`):\n            Whether to add a leading space to the first word.\n            This allows to treat the leading word just as any other word.\n            (GPT2 tokenizer detect beginning of words by the preceeding space)\n        trim_offsets (:obj:`bool`, `optional`, defaults to `True`):\n            Whether the post processing step should trim offsets to avoid including whitespaces.\n    \"\"\"\n\n    vocab_files_names = VOCAB_FILES_NAMES\n    pretrained_vocab_files_map = PRETRAINED_VOCAB_FILES_MAP\n    max_model_input_sizes = PRETRAINED_POSITIONAL_EMBEDDINGS_SIZES\n    model_input_names = [\"attention_mask\"]\n\n    def __init__(\n        self,\n        vocab_file,\n        merges_file,\n        errors=\"replace\",\n        bos_token=\"<s>\",\n        eos_token=\"</s>\",\n        sep_token=\"</s>\",\n        cls_token=\"<s>\",\n        unk_token=\"<unk>\",\n        pad_token=\"<pad>\",\n        mask_token=\"<mask>\",\n        add_prefix_space=True,\n        trim_offsets=True,\n        **kwargs\n    ):\n        kwargs.setdefault(\"pad_token\", pad_token)\n        kwargs.setdefault(\"sep_token\", sep_token)\n        kwargs.setdefault(\"cls_token\", cls_token)\n        kwargs.setdefault(\"mask_token\", mask_token)\n\n        super().__init__(\n            vocab_file=vocab_file,\n            merges_file=merges_file,\n            unk_token=unk_token,\n            bos_token=bos_token,\n            eos_token=eos_token,\n            add_prefix_space=add_prefix_space,\n            trim_offsets=trim_offsets,\n            **kwargs,\n        )\n\n        self.backend_tokenizer._tokenizer.post_processor = RobertaProcessing(\n            sep=(sep_token, self.sep_token_id),\n            cls=(cls_token, self.cls_token_id),\n            add_prefix_space=add_prefix_space,\n            trim_offsets=trim_offsets,\n        )\n\n        self.backend_tokenizer.add_special_tokens([kwargs[\"mask_token\"]])\n\n    @PreTrainedTokenizer.mask_token.setter\n    def mask_token(self, value):\n        if not isinstance(value, AddedToken):\n            value = AddedToken(value, lstrip=True)\n\n        self._mask_token = str(value)\n        self._maybe_update_backend([value])\n\n    def build_inputs_with_special_tokens(self, token_ids_0, token_ids_1=None):\n        output = [self.bos_token_id] + token_ids_0 + [self.eos_token_id]\n        if token_ids_1 is None:\n            return output\n\n        return output + [self.eos_token_id] + token_ids_1 + [self.eos_token_id]\n\n    def create_token_type_ids_from_sequences(\n        self, token_ids_0: List[int], token_ids_1: Optional[List[int]] = None\n    ) -> List[int]:\n        \"\"\"\n        Creates a mask from the two sequences passed to be used in a sequence-pair classification task.\n        RoBERTa does not make use of token type ids, therefore a list of zeros is returned.\n\n        Args:\n            token_ids_0 (:obj:`List[int]`):\n                List of ids.\n            token_ids_1 (:obj:`List[int]`, `optional`, defaults to :obj:`None`):\n                Optional second list of IDs for sequence pairs.\n\n        Returns:\n            :obj:`List[int]`: List of zeros.\n\n        \"\"\"\n        sep = [self.sep_token_id]\n        cls = [self.cls_token_id]\n\n        if token_ids_1 is None:\n            return len(cls + token_ids_0 + sep) * [0]\n        return len(cls + token_ids_0 + sep + sep + token_ids_1 + sep) * [0]\n"
  },
  {
    "path": "code/bert-base-count5/pretrain/transformers1/tokenization_t5.py",
    "content": "# coding=utf-8\n# Copyright 2018 T5 Authors and HuggingFace Inc. team.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n\"\"\" Tokenization class for model T5.\"\"\"\n\n\nimport logging\nimport os\nimport re\nfrom shutil import copyfile\n\nfrom .tokenization_utils import PreTrainedTokenizer\n\n\nlogger = logging.getLogger(__name__)\n\nSPIECE_UNDERLINE = \"▁\"\n\n####################################################\n# Mapping from the keyword arguments names of Tokenizer `__init__`\n# to file names for serializing Tokenizer instances\n####################################################\nVOCAB_FILES_NAMES = {\"vocab_file\": \"spiece.model\"}\n\n####################################################\n# Mapping from the keyword arguments names of Tokenizer `__init__`\n# to pretrained vocabulary URL for all the model shortcut names.\n####################################################\nPRETRAINED_VOCAB_FILES_MAP = {\n    \"vocab_file\": {\n        \"t5-small\": \"https://s3.amazonaws.com/models.huggingface.co/bert/t5-spiece.model\",\n        \"t5-base\": \"https://s3.amazonaws.com/models.huggingface.co/bert/t5-spiece.model\",\n        \"t5-large\": \"https://s3.amazonaws.com/models.huggingface.co/bert/t5-spiece.model\",\n        \"t5-3b\": \"https://s3.amazonaws.com/models.huggingface.co/bert/t5-spiece.model\",\n        \"t5-11b\": \"https://s3.amazonaws.com/models.huggingface.co/bert/t5-spiece.model\",\n    }\n}\n\n####################################################\n# Mapping from model shortcut names to max length of inputs\n####################################################\nPRETRAINED_POSITIONAL_EMBEDDINGS_SIZES = {\n    \"t5-small\": 512,\n    \"t5-base\": 512,\n    \"t5-large\": 512,\n    \"t5-3b\": 512,\n    \"t5-11b\": 512,\n}\n\n\nclass T5Tokenizer(PreTrainedTokenizer):\n    \"\"\"\n        Constructs an XLNet tokenizer. Based on `SentencePiece <https://github.com/google/sentencepiece>`__ .\n\n        This tokenizer inherits from :class:`~transformers1.PreTrainedTokenizer` which contains most of the methods. Users\n        should refer to the superclass for more information regarding methods.\n\n        Args:\n            vocab_file (:obj:`string`):\n                `SentencePiece <https://github.com/google/sentencepiece>`__ file (generally has a `.spm` extension) that\n                contains the vocabulary necessary to instantiate a tokenizer.\n            eos_token (:obj:`string`, `optional`, defaults to \"</s>\"):\n                The end of sequence token.\n\n                .. note::\n\n                    When building a sequence using special tokens, this is not the token that is used for the end\n                    of sequence. The token used is the :obj:`sep_token`.\n            unk_token (:obj:`string`, `optional`, defaults to \"<unk>\"):\n                The unknown token. A token that is not in the vocabulary cannot be converted to an ID and is set to be this\n                token instead.\n            pad_token (:obj:`string`, `optional`, defaults to \"<pad>\"):\n                The token used for padding, for example when batching sequences of different lengths.\n            extra_ids (:obj:`List[str]`, `optional`, defaults to :obj:`100`):\n                Add a number of extra ids added to the end of the vocabulary for use as sentinels.\n                These tokens are accessible as \"<extra_id_{%d}>\" where \"{%d}\" is a number between 0 and extra_ids-1.\n                Extra tokens are indexed from the end of the vocabulary up to beginnning (\"<extra_id_0>\" is the last token in the vocabulary like in T5 preprocessing\n                see: https://github.com/google-research/text-to-text-transfer-transformer/blob/9fd7b14a769417be33bc6c850f9598764913c833/t5/data/preprocessors.py#L2117)\n            additional_special_tokens (:obj:`List[str]`, `optional`, defaults to :obj:`None`):\n                Additional special tokens used by the tokenizer.\n    \"\"\"\n\n    vocab_files_names = VOCAB_FILES_NAMES\n    pretrained_vocab_files_map = PRETRAINED_VOCAB_FILES_MAP\n    max_model_input_sizes = PRETRAINED_POSITIONAL_EMBEDDINGS_SIZES\n\n    def __init__(\n        self,\n        vocab_file,\n        eos_token=\"</s>\",\n        unk_token=\"<unk>\",\n        pad_token=\"<pad>\",\n        extra_ids=100,\n        additional_special_tokens=None,\n        **kwargs\n    ):\n        # Add extra_ids to the special token list\n        if extra_ids > 0:\n            if additional_special_tokens is None:\n                additional_special_tokens = []\n            additional_special_tokens.extend([\"<extra_id_{}>\".format(i) for i in range(extra_ids)])\n\n        super().__init__(\n            eos_token=eos_token,\n            unk_token=unk_token,\n            pad_token=pad_token,\n            additional_special_tokens=additional_special_tokens,\n            **kwargs,\n        )\n\n        try:\n            import sentencepiece as spm\n        except ImportError:\n            logger.warning(\n                \"You need to install SentencePiece to use T5Tokenizer:\"\n                \"https://github.com/google/sentencepiece\"\n                \"pip install sentencepiece\"\n            )\n            raise\n\n        self.vocab_file = vocab_file\n        self._extra_ids = extra_ids\n\n        self.sp_model = spm.SentencePieceProcessor()\n        self.sp_model.Load(vocab_file)\n\n    @property\n    def vocab_size(self):\n        return self.sp_model.get_piece_size() + self._extra_ids\n\n    def get_vocab(self):\n        vocab = {self.convert_ids_to_tokens(i): i for i in range(self.vocab_size)}\n        vocab.update(self.added_tokens_encoder)\n        return vocab\n\n    def __getstate__(self):\n        state = self.__dict__.copy()\n        state[\"sp_model\"] = None\n        return state\n\n    def __setstate__(self, d):\n        self.__dict__ = d\n        try:\n            import sentencepiece as spm\n        except ImportError:\n            logger.warning(\n                \"You need to install SentencePiece to use T5Tokenizer: https://github.com/google/sentencepiece\"\n                \"pip install sentencepiece\"\n            )\n            raise\n        self.sp_model = spm.SentencePieceProcessor()\n        self.sp_model.Load(self.vocab_file)\n\n    def _tokenize(self, text, sample=False):\n        \"\"\" Take as input a string and return a list of strings (tokens) for words/sub-words\n        \"\"\"\n        if not sample:\n            pieces = self.sp_model.EncodeAsPieces(text)\n        else:\n            pieces = self.sp_model.SampleEncodeAsPieces(text, 64, 0.1)\n        return pieces\n\n    def _convert_token_to_id(self, token):\n        \"\"\" Converts a token (str) in an id using the vocab. \"\"\"\n        if token.startswith(\"<extra_id_\"):\n            match = re.match(r\"<extra_id_(\\d+)>\", token)\n            num = int(match.group(1))\n            return self.vocab_size - num - 1\n        return self.sp_model.piece_to_id(token)\n\n    def _convert_id_to_token(self, index):\n        \"\"\"Converts an index (integer) in a token (str) using the vocab.\"\"\"\n        if index < self.sp_model.get_piece_size():\n            token = self.sp_model.IdToPiece(index)\n        else:\n            token = \"<extra_id_{}>\".format(self.vocab_size - 1 - index)\n        return token\n\n    def convert_tokens_to_string(self, tokens):\n        \"\"\" Converts a sequence of tokens (string) in a single string. \"\"\"\n        out_string = self.sp_model.decode_pieces(tokens)\n        return out_string\n\n    def save_vocabulary(self, save_directory):\n        \"\"\" Save the sentencepiece vocabulary (copy original file) and special tokens file\n            to a directory.\n        \"\"\"\n        if not os.path.isdir(save_directory):\n            logger.error(\"Vocabulary path ({}) should be a directory\".format(save_directory))\n            return\n        out_vocab_file = os.path.join(save_directory, VOCAB_FILES_NAMES[\"vocab_file\"])\n\n        if os.path.abspath(self.vocab_file) != os.path.abspath(out_vocab_file):\n            copyfile(self.vocab_file, out_vocab_file)\n\n        return (out_vocab_file,)\n"
  },
  {
    "path": "code/bert-base-count5/pretrain/transformers1/tokenization_transfo_xl.py",
    "content": "# coding=utf-8\n# Copyright 2018 Google AI, Google Brain and Carnegie Mellon University Authors and the HuggingFace Inc. team.\n# Copyright (c) 2018, NVIDIA CORPORATION.  All rights reserved.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n\"\"\" Tokenization classes for Transformer XL model.\n    Adapted from https://github.com/kimiyoung/transformer-xl.\n\"\"\"\n\n\nimport glob\nimport logging\nimport os\nimport pickle\nimport re\nfrom collections import Counter, OrderedDict\nfrom typing import Optional\n\nimport numpy as np\nfrom tokenizers import Tokenizer\nfrom tokenizers.implementations import BaseTokenizer\nfrom tokenizers.models import WordLevel\nfrom tokenizers.normalizers import Lowercase, Sequence, Strip, unicode_normalizer_from_str\nfrom tokenizers.pre_tokenizers import CharDelimiterSplit, WhitespaceSplit\nfrom tokenizers.processors import BertProcessing\n\nfrom .file_utils import cached_path, is_torch_available\nfrom .tokenization_utils import PreTrainedTokenizer, PreTrainedTokenizerFast\n\n\nif is_torch_available():\n    import torch\n\n\nlogger = logging.getLogger(__name__)\n\nVOCAB_FILES_NAMES = {\"pretrained_vocab_file\": \"vocab.bin\", \"vocab_file\": \"vocab.txt\"}\nVOCAB_FILES_NAMES_FAST = {\"pretrained_vocab_file\": \"vocab.json\", \"vocab_file\": \"vocab.json\"}\n\nPRETRAINED_VOCAB_FILES_MAP = {\n    \"pretrained_vocab_file\": {\n        \"transfo-xl-wt103\": \"https://s3.amazonaws.com/models.huggingface.co/bert/transfo-xl-wt103-vocab.bin\",\n    }\n}\n\nPRETRAINED_VOCAB_FILES_MAP_FAST = {\n    \"pretrained_vocab_file\": {\n        \"transfo-xl-wt103\": \"https://s3.amazonaws.com/models.huggingface.co/bert/transfo-xl-wt103-vocab.json\",\n    }\n}\n\nPRETRAINED_POSITIONAL_EMBEDDINGS_SIZES = {\n    \"transfo-xl-wt103\": None,\n}\n\nPRETRAINED_CORPUS_ARCHIVE_MAP = {\n    \"transfo-xl-wt103\": \"https://s3.amazonaws.com/models.huggingface.co/bert/transfo-xl-wt103-corpus.bin\",\n}\nCORPUS_NAME = \"corpus.bin\"\n\n\nclass TransfoXLTokenizer(PreTrainedTokenizer):\n    \"\"\"\n    Transformer-XL tokenizer adapted from Vocab class in https://github.com/kimiyoung/transformer-xl\n\n    This tokenizer inherits from :class:`~transformers1.PreTrainedTokenizer` which contains most of the methods. Users\n    should refer to the superclass for more information regarding methods.\n    \"\"\"\n\n    vocab_files_names = VOCAB_FILES_NAMES\n    pretrained_vocab_files_map = PRETRAINED_VOCAB_FILES_MAP\n    max_model_input_sizes = PRETRAINED_POSITIONAL_EMBEDDINGS_SIZES\n    model_input_names = []\n\n    def __init__(\n        self,\n        special=None,\n        min_freq=0,\n        max_size=None,\n        lower_case=False,\n        delimiter=None,\n        vocab_file=None,\n        pretrained_vocab_file=None,\n        never_split=None,\n        unk_token=\"<unk>\",\n        eos_token=\"<eos>\",\n        additional_special_tokens=[\"<formula>\"],\n        **kwargs\n    ):\n        super().__init__(\n            unk_token=unk_token, eos_token=eos_token, additional_special_tokens=additional_special_tokens, **kwargs\n        )\n\n        if never_split is None:\n            never_split = self.all_special_tokens\n        if special is None:\n            special = []\n        self.counter = Counter()\n        self.special = special\n        self.min_freq = min_freq\n        self.max_size = max_size\n        self.lower_case = lower_case\n        self.delimiter = delimiter\n        self.vocab_file = vocab_file\n        self.never_split = never_split\n        self.punctuation_symbols = '!\"#$%&()*+,-./\\:;<=>?@[\\\\]^_`{|}~'  # noqa: W605\n        self.punction_without_space_before_pattern = re.compile(r\"[^\\s][{}]\".format(self.punctuation_symbols))\n        self.punctuation_with_space_around_pattern = self._compile_space_around_punctuation_pattern()\n\n        try:\n            if pretrained_vocab_file is not None:\n                # Hack because, honestly this tokenizer was not made to be used\n                # in a library like ours, at all.\n                vocab_dict = torch.load(pretrained_vocab_file)\n                for key, value in vocab_dict.items():\n                    if key not in self.__dict__:\n                        self.__dict__[key] = value\n\n            if vocab_file is not None:\n                self.build_vocab()\n        except Exception:\n            raise ValueError(\n                \"Unable to parse file {}. Unknown format. \"\n                \"If you tried to load a model saved through TransfoXLTokenizerFast,\"\n                \"please note they are not compatible.\".format(pretrained_vocab_file)\n            )\n\n        if vocab_file is not None:\n            self.build_vocab()\n\n    def _compile_space_around_punctuation_pattern(self):\n        look_ahead_for_special_token = \"(?=[{}])\".format(self.punctuation_symbols)\n        look_ahead_to_match_all_except_space = \"(?=[^\\s])\"  # noqa: W605\n        return re.compile(r\"\" + look_ahead_for_special_token + look_ahead_to_match_all_except_space)\n\n    def count_file(self, path, verbose=False, add_eos=False):\n        if verbose:\n            logger.info(\"counting file {} ...\".format(path))\n        assert os.path.exists(path)\n\n        sents = []\n        with open(path, \"r\", encoding=\"utf-8\") as f:\n            for idx, line in enumerate(f):\n                if verbose and idx > 0 and idx % 500000 == 0:\n                    logger.info(\"    line {}\".format(idx))\n                symbols = self.tokenize(line, add_eos=add_eos)\n                self.counter.update(symbols)\n                sents.append(symbols)\n\n        return sents\n\n    def count_sents(self, sents, verbose=False):\n        \"\"\"\n            sents : a list of sentences, each a list of tokenized symbols\n        \"\"\"\n        if verbose:\n            logger.info(\"counting {} sents ...\".format(len(sents)))\n        for idx, symbols in enumerate(sents):\n            if verbose and idx > 0 and idx % 500000 == 0:\n                logger.info(\"    line {}\".format(idx))\n            self.counter.update(symbols)\n\n    def _build_from_file(self, vocab_file):\n        self.idx2sym = []\n        self.sym2idx = OrderedDict()\n\n        with open(vocab_file, \"r\", encoding=\"utf-8\") as f:\n            for line in f:\n                symb = line.strip().split()[0]\n                self.add_symbol(symb)\n        if \"<UNK>\" in self.sym2idx:\n            self.unk_idx = self.sym2idx[\"<UNK>\"]\n        elif \"<unk>\" in self.sym2idx:\n            self.unk_idx = self.sym2idx[\"<unk>\"]\n        else:\n            raise ValueError(\"No <unkown> token in vocabulary\")\n\n    def save_vocabulary(self, vocab_path):\n        \"\"\"\n        Save the vocabulary and special tokens file to a directory.\n\n        Args:\n            vocab_path (:obj:`str`):\n                The directory in which to save the vocabulary.\n\n        Returns:\n            :obj:`Tuple(str)`: Paths to the files saved.\n        \"\"\"\n\n        logger.warning(\n            \"Please note you will not be able to load the save vocabulary in\"\n            \" Rust-based TransfoXLTokenizerFast as they don't share the same structure.\"\n        )\n\n        if os.path.isdir(vocab_path):\n            vocab_file = os.path.join(vocab_path, VOCAB_FILES_NAMES[\"pretrained_vocab_file\"])\n        else:\n            vocab_file = vocab_path\n        torch.save(self.__dict__, vocab_file)\n        return (vocab_file,)\n\n    def build_vocab(self):\n        if self.vocab_file:\n            logger.info(\"building vocab from {}\".format(self.vocab_file))\n            self._build_from_file(self.vocab_file)\n            logger.info(\"final vocab size {}\".format(len(self)))\n        else:\n            logger.info(\"building vocab with min_freq={}, max_size={}\".format(self.min_freq, self.max_size))\n            self.idx2sym = []\n            self.sym2idx = OrderedDict()\n\n            for sym in self.special:\n                self.add_special(sym)\n\n            for sym, cnt in self.counter.most_common(self.max_size):\n                if cnt < self.min_freq:\n                    break\n                self.add_symbol(sym)\n\n            logger.info(\"final vocab size {} from {} unique tokens\".format(len(self), len(self.counter)))\n\n    def encode_file(self, path, ordered=False, verbose=False, add_eos=True, add_double_eos=False):\n        if verbose:\n            logger.info(\"encoding file {} ...\".format(path))\n        assert os.path.exists(path)\n        encoded = []\n        with open(path, \"r\", encoding=\"utf-8\") as f:\n            for idx, line in enumerate(f):\n                if verbose and idx > 0 and idx % 500000 == 0:\n                    logger.info(\"    line {}\".format(idx))\n                symbols = self.tokenize(line, add_eos=add_eos, add_double_eos=add_double_eos)\n                encoded.append(self.convert_to_tensor(symbols))\n\n        if ordered:\n            encoded = torch.cat(encoded)\n\n        return encoded\n\n    def encode_sents(self, sents, ordered=False, verbose=False):\n        if verbose:\n            logger.info(\"encoding {} sents ...\".format(len(sents)))\n        encoded = []\n        for idx, symbols in enumerate(sents):\n            if verbose and idx > 0 and idx % 500000 == 0:\n                logger.info(\"    line {}\".format(idx))\n            encoded.append(self.convert_to_tensor(symbols))\n\n        if ordered:\n            encoded = torch.cat(encoded)\n\n        return encoded\n\n    def add_special(self, sym):\n        if sym not in self.sym2idx:\n            self.idx2sym.append(sym)\n            self.sym2idx[sym] = len(self.idx2sym) - 1\n            setattr(self, \"{}_idx\".format(sym.strip(\"<>\")), self.sym2idx[sym])\n\n    def add_symbol(self, sym):\n        if sym not in self.sym2idx:\n            self.idx2sym.append(sym)\n            self.sym2idx[sym] = len(self.idx2sym) - 1\n\n    def _convert_id_to_token(self, idx):\n        \"\"\"Converts an id in a token (BPE) using the vocab.\"\"\"\n        assert 0 <= idx < len(self), \"Index {} out of vocabulary range\".format(idx)\n        return self.idx2sym[idx]\n\n    def _convert_token_to_id(self, sym):\n        \"\"\" Converts a token (str) in an id using the vocab. \"\"\"\n        if sym in self.sym2idx:\n            return self.sym2idx[sym]\n        else:\n            # logger.info('encounter unk {}'.format(sym))\n            # assert '<eos>' not in sym\n            if hasattr(self, \"unk_idx\"):\n                return self.sym2idx.get(sym, self.unk_idx)\n            # Backward compatibility with pre-trained models\n            elif \"<unk>\" in self.sym2idx:\n                return self.sym2idx[\"<unk>\"]\n            elif \"<UNK>\" in self.sym2idx:\n                return self.sym2idx[\"<UNK>\"]\n            else:\n                raise ValueError(\"Token not in vocabulary and no <unk> token in vocabulary for replacement\")\n\n    def convert_tokens_to_string(self, tokens):\n        \"\"\" Converts a sequence of tokens (string) in a single string. \"\"\"\n        out_string = \" \".join(tokens).strip()\n        return out_string\n\n    def convert_to_tensor(self, symbols):\n        return torch.LongTensor(self.convert_tokens_to_ids(symbols))\n\n    @property\n    def vocab_size(self):\n        return len(self.idx2sym)\n\n    def get_vocab(self):\n        return dict(self.sym2idx, **self.added_tokens_encoder)\n\n    def _tokenize(self, line, add_eos=False, add_double_eos=False):\n        line = line.strip()\n        # convert to lower case\n        if self.lower_case:\n            line = line.lower()\n\n        # empty delimiter '' will evaluate False\n        if self.delimiter == \"\":\n            symbols = line\n        else:\n            symbols = line.split(self.delimiter)\n\n        if add_double_eos:  # lm1b\n            return [\"<S>\"] + symbols + [\"<S>\"]\n        elif add_eos:\n            return symbols + [\"<eos>\"]\n        else:\n            return symbols\n\n    def prepare_for_tokenization(self, text, **kwargs):\n        # add spaces before punctuation symbols as should be done in transfo-xl\n\n        if \"add_space_before_punct_symbol\" in kwargs and kwargs[\"add_space_before_punct_symbol\"]:\n            text = self.punctuation_with_space_around_pattern.sub(r\" \", text)\n        elif self.punction_without_space_before_pattern.search(text):\n            # searches until the first occurence of a punctuation symbol without surrounding spaces\n            logger.warning(\n                \"You might want to consider setting `add_space_before_punct_symbol=True` as an argument to the `tokenizer.encode()` to avoid tokenizing words with punctuation symbols to the `<unk>` token\"\n            )\n\n        return text\n\n\nclass _TransfoXLDelimiterLookupTokenizer(BaseTokenizer):\n    def __init__(\n        self,\n        vocab_file,\n        delimiter,\n        lowercase,\n        unk_token,\n        eos_token,\n        add_eos=False,\n        add_double_eos=False,\n        normalization: Optional[str] = None,\n    ):\n\n        try:\n            tokenizer = WordLevel(vocab_file, unk_token=unk_token)\n            tokenizer = Tokenizer(tokenizer)\n        except Exception:\n            raise ValueError(\n                \"Unable to parse file {}. Unknown format. \"\n                \"If you tried to load a model saved through TransfoXLTokenizer,\"\n                \"please note they are not compatible.\".format(vocab_file)\n            )\n\n        # Create the correct normalization path\n        normalizer = []\n\n        # Include unicode normalization\n        if normalization:\n            normalizer += [unicode_normalizer_from_str(normalization)]\n\n        # Include case normalization\n        if lowercase:\n            normalizer += [Lowercase()]\n\n        # Strip normalizer at the end\n        normalizer += [Strip(left=True, right=True)]\n\n        if len(normalizer) > 0:\n            tokenizer.normalizer = Sequence(normalizer) if len(normalizer) > 1 else normalizer[0]\n\n        # Setup the splitter\n        tokenizer.pre_tokenizer = CharDelimiterSplit(delimiter) if delimiter else WhitespaceSplit()\n\n        if add_double_eos:\n            tokenizer.post_processor = BertProcessing(\n                (eos_token, tokenizer.token_to_id(eos_token)), (eos_token, tokenizer.token_to_id(eos_token))\n            )\n\n        parameters = {\n            \"model\": \"TransfoXLModel\",\n            \"add_eos\": add_eos,\n            \"add_double_eos\": add_double_eos,\n            \"unk_token\": unk_token,\n            \"eos_token\": eos_token,\n            \"delimiter\": delimiter,\n            \"lowercase\": lowercase,\n        }\n\n        super().__init__(tokenizer, parameters)\n\n\nclass TransfoXLTokenizerFast(PreTrainedTokenizerFast):\n    \"\"\"\n    Construct a \"Fast\" Transformer-XL tokenizer (backed by HuggingFace's `tokenizers` library).\n\n    The Transformer-XL tokenizer is a word-level tokenizer (no sub-word tokenization).\n\n    Adapted from Vocab class in https://github.com/kimiyoung/transformer-xl\n\n    This tokenizer inherits from :class:`~transformers1.PreTrainedTokenizerFast` which contains most of the methods. Users\n    should refer to the superclass for more information regarding methods.\n    \"\"\"\n\n    vocab_files_names = VOCAB_FILES_NAMES_FAST\n    pretrained_vocab_files_map = PRETRAINED_VOCAB_FILES_MAP_FAST\n    max_model_input_sizes = PRETRAINED_POSITIONAL_EMBEDDINGS_SIZES\n    model_input_names = []\n\n    def __init__(\n        self,\n        special=None,\n        min_freq=0,\n        max_size=None,\n        lower_case=False,\n        delimiter=None,\n        vocab_file=None,\n        pretrained_vocab_file=None,\n        never_split=None,\n        unk_token=\"<unk>\",\n        eos_token=\"<eos>\",\n        additional_special_tokens=[\"<formula>\"],\n        add_eos=False,\n        add_double_eos=False,\n        normalization=None,\n        **kwargs\n    ):\n\n        super().__init__(\n            _TransfoXLDelimiterLookupTokenizer(\n                vocab_file=vocab_file or pretrained_vocab_file,\n                delimiter=delimiter,\n                lowercase=lower_case,\n                unk_token=unk_token,\n                eos_token=eos_token,\n                add_eos=add_eos,\n                add_double_eos=add_double_eos,\n                normalization=normalization,\n            ),\n            unk_token=unk_token,\n            eos_token=eos_token,\n            additional_special_tokens=additional_special_tokens,\n            **kwargs,\n        )\n\n    def save_pretrained(self, save_directory):\n        logger.warning(\n            \"Please note you will not be able to load the vocabulary in\"\n            \" Python-based TransfoXLTokenizer as they don't share the same structure.\"\n        )\n\n        return super().save_pretrained(save_directory)\n\n\nclass LMOrderedIterator(object):\n    def __init__(self, data, bsz, bptt, device=\"cpu\", ext_len=None):\n        \"\"\"\n            data -- LongTensor -- the LongTensor is strictly ordered\n        \"\"\"\n        self.bsz = bsz\n        self.bptt = bptt\n        self.ext_len = ext_len if ext_len is not None else 0\n\n        self.device = device\n\n        # Work out how cleanly we can divide the dataset into bsz parts.\n        self.n_step = data.size(0) // bsz\n\n        # Trim off any extra elements that wouldn't cleanly fit (remainders).\n        data = data.narrow(0, 0, self.n_step * bsz)\n\n        # Evenly divide the data across the bsz batches.\n        self.data = data.view(bsz, -1).t().contiguous().to(device)\n\n        # Number of mini-batches\n        self.n_batch = (self.n_step + self.bptt - 1) // self.bptt\n\n    def get_batch(self, i, bptt=None):\n        if bptt is None:\n            bptt = self.bptt\n        seq_len = min(bptt, self.data.size(0) - 1 - i)\n\n        end_idx = i + seq_len\n        beg_idx = max(0, i - self.ext_len)\n\n        data = self.data[beg_idx:end_idx]\n        target = self.data[i + 1 : i + 1 + seq_len]\n\n        data_out = data.transpose(0, 1).contiguous().to(self.device)\n        target_out = target.transpose(0, 1).contiguous().to(self.device)\n\n        return data_out, target_out, seq_len\n\n    def get_fixlen_iter(self, start=0):\n        for i in range(start, self.data.size(0) - 1, self.bptt):\n            yield self.get_batch(i)\n\n    def get_varlen_iter(self, start=0, std=5, min_len=5, max_deviation=3):\n        max_len = self.bptt + max_deviation * std\n        i = start\n        while True:\n            bptt = self.bptt if np.random.random() < 0.95 else self.bptt / 2.0\n            bptt = min(max_len, max(min_len, int(np.random.normal(bptt, std))))\n            data, target, seq_len = self.get_batch(i, bptt)\n            i += seq_len\n            yield data, target, seq_len\n            if i >= self.data.size(0) - 2:\n                break\n\n    def __iter__(self):\n        return self.get_fixlen_iter()\n\n\nclass LMShuffledIterator(object):\n    def __init__(self, data, bsz, bptt, device=\"cpu\", ext_len=None, shuffle=False):\n        \"\"\"\n            data -- list[LongTensor] -- there is no order among the LongTensors\n        \"\"\"\n        self.data = data\n\n        self.bsz = bsz\n        self.bptt = bptt\n        self.ext_len = ext_len if ext_len is not None else 0\n\n        self.device = device\n        self.shuffle = shuffle\n\n    def get_sent_stream(self):\n        # index iterator\n        epoch_indices = np.random.permutation(len(self.data)) if self.shuffle else np.array(range(len(self.data)))\n\n        # sentence iterator\n        for idx in epoch_indices:\n            yield self.data[idx]\n\n    def stream_iterator(self, sent_stream):\n        # streams for each data in the batch\n        streams = [None] * self.bsz\n\n        data = torch.LongTensor(self.bptt, self.bsz)\n        target = torch.LongTensor(self.bptt, self.bsz)\n\n        n_retain = 0\n\n        while True:\n            # data   : [n_retain+bptt x bsz]\n            # target : [bptt x bsz]\n            data[n_retain:].fill_(-1)\n            target.fill_(-1)\n\n            valid_batch = True\n\n            for i in range(self.bsz):\n                n_filled = 0\n                try:\n                    while n_filled < self.bptt:\n                        if streams[i] is None or len(streams[i]) <= 1:\n                            streams[i] = next(sent_stream)\n                        # number of new tokens to fill in\n                        n_new = min(len(streams[i]) - 1, self.bptt - n_filled)\n                        # first n_retain tokens are retained from last batch\n                        data[n_retain + n_filled : n_retain + n_filled + n_new, i] = streams[i][:n_new]\n                        target[n_filled : n_filled + n_new, i] = streams[i][1 : n_new + 1]\n                        streams[i] = streams[i][n_new:]\n                        n_filled += n_new\n                except StopIteration:\n                    valid_batch = False\n                    break\n\n            if not valid_batch:\n                return\n\n            data_out = data.transpose(0, 1).contiguous().to(self.device)\n            target_out = target.transpose(0, 1).contiguous().to(self.device)\n\n            yield data_out, target_out, self.bptt\n\n            n_retain = min(data.size(0), self.ext_len)\n            if n_retain > 0:\n                data[:n_retain] = data[-n_retain:]\n            data.resize_(n_retain + self.bptt, data.size(1))\n\n    def __iter__(self):\n        # sent_stream is an iterator\n        sent_stream = self.get_sent_stream()\n\n        for batch in self.stream_iterator(sent_stream):\n            yield batch\n\n\nclass LMMultiFileIterator(LMShuffledIterator):\n    def __init__(self, paths, vocab, bsz, bptt, device=\"cpu\", ext_len=None, shuffle=False):\n\n        self.paths = paths\n        self.vocab = vocab\n\n        self.bsz = bsz\n        self.bptt = bptt\n        self.ext_len = ext_len if ext_len is not None else 0\n\n        self.device = device\n        self.shuffle = shuffle\n\n    def get_sent_stream(self, path):\n        sents = self.vocab.encode_file(path, add_double_eos=True)\n        if self.shuffle:\n            np.random.shuffle(sents)\n        sent_stream = iter(sents)\n\n        return sent_stream\n\n    def __iter__(self):\n        if self.shuffle:\n            np.random.shuffle(self.paths)\n\n        for path in self.paths:\n            # sent_stream is an iterator\n            sent_stream = self.get_sent_stream(path)\n            for batch in self.stream_iterator(sent_stream):\n                yield batch\n\n\nclass TransfoXLCorpus(object):\n    @classmethod\n    def from_pretrained(cls, pretrained_model_name_or_path, cache_dir=None, *inputs, **kwargs):\n        \"\"\"\n        Instantiate a pre-processed corpus.\n        \"\"\"\n        vocab = TransfoXLTokenizer.from_pretrained(pretrained_model_name_or_path, *inputs, **kwargs)\n        if pretrained_model_name_or_path in PRETRAINED_CORPUS_ARCHIVE_MAP:\n            corpus_file = PRETRAINED_CORPUS_ARCHIVE_MAP[pretrained_model_name_or_path]\n        else:\n            corpus_file = os.path.join(pretrained_model_name_or_path, CORPUS_NAME)\n        # redirect to the cache, if necessary\n        try:\n            resolved_corpus_file = cached_path(corpus_file, cache_dir=cache_dir)\n        except EnvironmentError:\n            logger.error(\n                \"Corpus '{}' was not found in corpus list ({}). \"\n                \"We assumed '{}' was a path or url but couldn't find files {} \"\n                \"at this path or url.\".format(\n                    pretrained_model_name_or_path,\n                    \", \".join(PRETRAINED_CORPUS_ARCHIVE_MAP.keys()),\n                    pretrained_model_name_or_path,\n                    corpus_file,\n                )\n            )\n            return None\n        if resolved_corpus_file == corpus_file:\n            logger.info(\"loading corpus file {}\".format(corpus_file))\n        else:\n            logger.info(\"loading corpus file {} from cache at {}\".format(corpus_file, resolved_corpus_file))\n\n        # Instantiate tokenizer.\n        corpus = cls(*inputs, **kwargs)\n        corpus_dict = torch.load(resolved_corpus_file)\n        for key, value in corpus_dict.items():\n            corpus.__dict__[key] = value\n        corpus.vocab = vocab\n        if corpus.train is not None:\n            corpus.train = torch.tensor(corpus.train, dtype=torch.long)\n        if corpus.valid is not None:\n            corpus.valid = torch.tensor(corpus.valid, dtype=torch.long)\n        if corpus.test is not None:\n            corpus.test = torch.tensor(corpus.test, dtype=torch.long)\n        return corpus\n\n    def __init__(self, *args, **kwargs):\n        self.vocab = TransfoXLTokenizer(*args, **kwargs)\n        self.dataset = None\n        self.train = None\n        self.valid = None\n        self.test = None\n\n    def build_corpus(self, path, dataset):\n        self.dataset = dataset\n\n        if self.dataset in [\"ptb\", \"wt2\", \"enwik8\", \"text8\"]:\n            self.vocab.count_file(os.path.join(path, \"train.txt\"))\n            self.vocab.count_file(os.path.join(path, \"valid.txt\"))\n            self.vocab.count_file(os.path.join(path, \"test.txt\"))\n        elif self.dataset == \"wt103\":\n            self.vocab.count_file(os.path.join(path, \"train.txt\"))\n        elif self.dataset == \"lm1b\":\n            train_path_pattern = os.path.join(\n                path,\n                \"1-billion-word-language-modeling-benchmark-r13output\",\n                \"training-monolingual.tokenized.shuffled\",\n                \"news.en-*\",\n            )\n            train_paths = glob.glob(train_path_pattern)\n            # the vocab will load from file when build_vocab() is called\n\n        self.vocab.build_vocab()\n\n        if self.dataset in [\"ptb\", \"wt2\", \"wt103\"]:\n            self.train = self.vocab.encode_file(os.path.join(path, \"train.txt\"), ordered=True)\n            self.valid = self.vocab.encode_file(os.path.join(path, \"valid.txt\"), ordered=True)\n            self.test = self.vocab.encode_file(os.path.join(path, \"test.txt\"), ordered=True)\n        elif self.dataset in [\"enwik8\", \"text8\"]:\n            self.train = self.vocab.encode_file(os.path.join(path, \"train.txt\"), ordered=True, add_eos=False)\n            self.valid = self.vocab.encode_file(os.path.join(path, \"valid.txt\"), ordered=True, add_eos=False)\n            self.test = self.vocab.encode_file(os.path.join(path, \"test.txt\"), ordered=True, add_eos=False)\n        elif self.dataset == \"lm1b\":\n            self.train = train_paths\n            self.valid = self.vocab.encode_file(os.path.join(path, \"valid.txt\"), ordered=False, add_double_eos=True)\n            self.test = self.vocab.encode_file(os.path.join(path, \"test.txt\"), ordered=False, add_double_eos=True)\n\n    def get_iterator(self, split, *args, **kwargs):\n        if split == \"train\":\n            if self.dataset in [\"ptb\", \"wt2\", \"wt103\", \"enwik8\", \"text8\"]:\n                data_iter = LMOrderedIterator(self.train, *args, **kwargs)\n            elif self.dataset == \"lm1b\":\n                kwargs[\"shuffle\"] = True\n                data_iter = LMMultiFileIterator(self.train, self.vocab, *args, **kwargs)\n        elif split in [\"valid\", \"test\"]:\n            data = self.valid if split == \"valid\" else self.test\n            if self.dataset in [\"ptb\", \"wt2\", \"wt103\", \"enwik8\", \"text8\"]:\n                data_iter = LMOrderedIterator(data, *args, **kwargs)\n            elif self.dataset == \"lm1b\":\n                data_iter = LMShuffledIterator(data, *args, **kwargs)\n\n        return data_iter\n\n\ndef get_lm_corpus(datadir, dataset):\n    fn = os.path.join(datadir, \"cache.pt\")\n    fn_pickle = os.path.join(datadir, \"cache.pkl\")\n    if os.path.exists(fn):\n        logger.info(\"Loading cached dataset...\")\n        corpus = torch.load(fn_pickle)\n    elif os.path.exists(fn):\n        logger.info(\"Loading cached dataset from pickle...\")\n        with open(fn, \"rb\") as fp:\n            corpus = pickle.load(fp)\n    else:\n        logger.info(\"Producing dataset {}...\".format(dataset))\n        kwargs = {}\n        if dataset in [\"wt103\", \"wt2\"]:\n            kwargs[\"special\"] = [\"<eos>\"]\n            kwargs[\"lower_case\"] = False\n        elif dataset == \"ptb\":\n            kwargs[\"special\"] = [\"<eos>\"]\n            kwargs[\"lower_case\"] = True\n        elif dataset == \"lm1b\":\n            kwargs[\"special\"] = []\n            kwargs[\"lower_case\"] = False\n            kwargs[\"vocab_file\"] = os.path.join(datadir, \"1b_word_vocab.txt\")\n        elif dataset in [\"enwik8\", \"text8\"]:\n            pass\n\n        corpus = TransfoXLCorpus(datadir, dataset, **kwargs)\n        torch.save(corpus, fn)\n\n    return corpus\n"
  },
  {
    "path": "code/bert-base-count5/pretrain/transformers1/tokenization_utils.py",
    "content": "# coding=utf-8\n# Copyright 2020 The HuggingFace Inc. team.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n\"\"\"Tokenization classes for python and fast tokenizers. Fast tokenizers are provided by HuggingFace's tokenizers library.\"\"\"\n\nimport copy\nimport functools\nimport itertools\nimport json\nimport logging\nimport operator\nimport os\nimport re\nimport warnings\nfrom collections import UserDict, defaultdict\nfrom contextlib import contextmanager\nfrom typing import Any, Dict, List, NamedTuple, Optional, Sequence, Tuple, Union\n\nfrom tokenizers import AddedToken as AddedTokenFast\nfrom tokenizers import Encoding as EncodingFast\nfrom tokenizers.decoders import Decoder as DecoderFast\nfrom tokenizers.implementations import BaseTokenizer as BaseTokenizerFast\n\nfrom .file_utils import cached_path, hf_bucket_url, is_remote_url, is_tf_available, is_torch_available, torch_required\n\n\nif is_tf_available():\n    import tensorflow as tf\nif is_torch_available():\n    import torch\n\nlogger = logging.getLogger(__name__)\n\nSPECIAL_TOKENS_MAP_FILE = \"special_tokens_map.json\"\nADDED_TOKENS_FILE = \"added_tokens.json\"\nTOKENIZER_CONFIG_FILE = \"tokenizer_config.json\"\n\nVERY_LARGE_INTEGER = int(1e30)  # This is used to set the max input length for a model with infinite size input\nLARGE_INTEGER = int(1e20)  # This is used when we need something big but slightly smaller than VERY_LARGE_INTEGER\n\n# Define type aliases and NamedTuples\nTextInput = str\nPreTokenizedInput = List[str]\nEncodedInput = List[int]\nTextInputPair = Tuple[str, str]\nPreTokenizedInputPair = Tuple[List[str], List[str]]\nEncodedInputPair = Tuple[List[int], List[int]]\n\n\nclass CharSpan(NamedTuple):\n    \"\"\" Character span in the original string\n\n        Args:\n            start: index of the first character in the original string\n            end: index of the character following the last character in the original string\n    \"\"\"\n\n    start: int\n    end: int\n\n\nclass TokenSpan(NamedTuple):\n    \"\"\" Token span in an encoded string (list of tokens)\n\n        Args:\n            start: index of the first token in the span\n            end: index of the token following the last token in the span\n    \"\"\"\n\n    start: int\n    end: int\n\n\ndef flatten(x: Sequence):\n    \"\"\"\n    Flatten the provided (potentially nested) sequence\n\n    Args:\n        x (Sequence): Potentially nested sequence to flatten\n\n    Returns:\n        list: Flattened sequence\n    \"\"\"\n\n    return functools.reduce(operator.iconcat, x, [])\n\n\n@contextmanager\ndef truncate_and_pad(\n    tokenizer: BaseTokenizerFast,\n    max_length: int,\n    stride: int,\n    strategy: str,\n    pad_to_max_length: bool,\n    padding_side: str,\n    pad_token_id: int,\n    pad_token_type_id: int,\n    pad_token: str,\n):\n    \"\"\" This contextmanager is in charge of defining the truncation and the padding strategies for fast tokenizers\n        (provided by HuggingFace tokenizers library) and restore the tokenizer settings afterwards.\n\n        This contextmanager assumes the provider tokenizer has no padding / truncation strategy\n        before the managed section. If your tokenizer set a padding / truncation strategy before,\n        then it will be reset to no padding/truncation when exiting the managed section.\n\n        Args:\n            tokenizer (BaseTokenizerFast): The tokenizer which will be used\n            max_length (int): The maximum size of the sequence\n            stride (int): The stride to use when handling overflow\n            strategy (str): Overflowing logic to use\n            pad_to_max_length (bool): Boolean indicating if the output needs to be padded up to max_length\n            padding_side (str): \"left\" or \"right\" indicating the direction the output sequence will be padded\n            pad_token_id (int): The integer representation of the padding token to use\n            pad_token_type_id (int): The integer representation of the padding token type to use\n            pad_token (str): The string representation of the padding token to use\n\n    \"\"\"\n\n    # Handle all the truncation and padding stuff\n    if max_length is not None:\n        tokenizer.enable_truncation(max_length, stride=stride, strategy=strategy)\n\n    if pad_to_max_length and (pad_token and pad_token_id >= 0):\n        tokenizer.enable_padding(\n            max_length=max_length,\n            direction=padding_side,\n            pad_id=pad_token_id,\n            pad_type_id=pad_token_type_id,\n            pad_token=pad_token,\n        )\n    elif pad_to_max_length:\n        logger.warning(\n            \"Disabled padding because no padding token set (pad_token: {}, pad_token_id: {}).\\n\"\n            \"To remove this error, you can add a new pad token and then resize model embedding:\\n\"\n            \"\\ttokenizer.pad_token = '<PAD>'\\n\\tmodel.resize_token_embeddings(len(tokenizer))\".format(\n                pad_token, pad_token_id\n            )\n        )\n\n    yield\n\n    # TODO(morgan, anthony): once we have a simple way to serialize tokenizers maybe store and restore the state afterward\n    # to avoid destructing the padding / truncation strategy as we do now.\n\n    if max_length is not None:\n        tokenizer.no_truncation()\n\n    if pad_to_max_length and (pad_token and pad_token_id >= 0):\n        tokenizer.no_padding()\n\n\nclass BatchEncoding(UserDict):\n    \"\"\" BatchEncoding hold the output of the encode and batch_encode methods (tokens, attention_masks, etc).\n        This class is derived from a python Dictionary and can be used as a dictionnary.\n        In addition, this class expose utility methods to map from word/char space to token space.\n\n        Args:\n            data (:obj:`dict`): Dictionary of lists/arrays returned by the encode/batch_encode methods ('input_ids', 'attention_mask'...)\n            encoding (:obj:`EncodingFast`, :obj:`list(EncodingFast)`, `optional`, defaults to :obj:`None`):\n                If the tokenizer is a fast tokenizer which outputs additional informations like mapping from word/char space to token space\n                the `EncodingFast` instance or list of instance (for batches) hold these informations.\n\n    \"\"\"\n\n    def __init__(\n        self,\n        data: Optional[Dict[str, Any]] = None,\n        encoding: Optional[Union[EncodingFast, Sequence[EncodingFast]]] = None,\n    ):\n        super().__init__(data)\n\n        if isinstance(encoding, EncodingFast):\n            encoding = [encoding]\n\n        self._encodings = encoding\n\n    def __getitem__(self, item: Union[int, str]) -> EncodingFast:\n        \"\"\" If the key is a string, get the value of the dict associated to `key` ('input_ids', 'attention_mask'...)\n            If the key is an integer, get the EncodingFast for batch item with index `key`\n        \"\"\"\n        if isinstance(item, str):\n            return self.data[item]\n        elif self._encodings is not None:\n            return self._encodings[item]\n        else:\n            raise KeyError(\n                \"Indexing with integers (to access backend Encoding for a given batch index) \"\n                \"is not available when using Python based tokenizers\"\n            )\n\n    def __getattr__(self, item: str):\n        return self.data[item]\n\n    def keys(self):\n        return self.data.keys()\n\n    def values(self):\n        return self.data.values()\n\n    def items(self):\n        return self.data.items()\n\n    # After this point:\n    # Extended properties and methods only available for fast (Rust-based) tokenizers\n    # provided by HuggingFace tokenizers library.\n\n    @property\n    def encodings(self) -> Optional[List[EncodingFast]]:\n        \"\"\"\n        Return the list all encoding from the tokenization process\n\n        Returns: List[EncodingFast] or None if input was tokenized through Python (i.e. not fast) tokenizer\n        \"\"\"\n        return self._encodings\n\n    def tokens(self, batch_index: int = 0) -> List[int]:\n        if not self._encodings:\n            raise ValueError(\"tokens() is not available when using Python based tokenizers\")\n        return self._encodings[batch_index].tokens\n\n    def words(self, batch_index: int = 0) -> List[Optional[int]]:\n        if not self._encodings:\n            raise ValueError(\"words() is not available when using Python based tokenizers\")\n        return self._encodings[batch_index].words\n\n    def token_to_word(self, batch_or_token_index: int, token_index: Optional[int] = None) -> int:\n        \"\"\" Get the index of the word corresponding (i.e. comprising) to an encoded token\n            in a sequence of the batch.\n\n            Can be called as:\n                - self.token_to_word(token_index) if batch size is 1\n                - self.token_to_word(batch_index, token_index) if batch size is greater than 1\n\n            This method is particularly suited when the input sequences are provided as\n            pre-tokenized sequences (i.e. words are defined by the user). In this case it allows\n            to easily associate encoded tokens with provided tokenized words.\n\n        Args:\n            batch_or_token_index (:obj:`int`):\n                Index of the sequence in the batch. If the batch only comprise one sequence,\n                this can be the index of the token in the sequence\n            token_index (:obj:`int`, `optional`):\n                If a batch index is provided in `batch_or_token_index`, this can be the index\n                of the token in the sequence.\n\n        Returns:\n            word_index (:obj:`int`):\n                index of the word in the input sequence.\n\n        \"\"\"\n\n        if not self._encodings:\n            raise ValueError(\"token_to_word() is not available when using Python based tokenizers\")\n        if token_index is not None:\n            batch_index = batch_or_token_index\n        else:\n            batch_index = 0\n            token_index = batch_or_token_index\n        if batch_index < 0:\n            batch_index = self._batch_size + batch_index\n        if token_index < 0:\n            token_index = self._seq_len + token_index\n        return self._encodings[batch_index].token_to_word(token_index)\n\n    def word_to_tokens(self, batch_or_word_index: int, word_index: Optional[int] = None) -> TokenSpan:\n        \"\"\" Get the encoded token span corresponding to a word in the sequence of the batch.\n\n            Token spans are returned as a TokenSpan NamedTuple with:\n                start: index of the first token\n                end: index of the token following the last token\n\n            Can be called as:\n                - self.word_to_tokens(word_index) if batch size is 1\n                - self.word_to_tokens(batch_index, word_index) if batch size is greater or equal to 1\n\n            This method is particularly suited when the input sequences are provided as\n            pre-tokenized sequences (i.e. words are defined by the user). In this case it allows\n            to easily associate encoded tokens with provided tokenized words.\n\n        Args:\n            batch_or_word_index (:obj:`int`):\n                Index of the sequence in the batch. If the batch only comprises one sequence,\n                this can be the index of the word in the sequence\n            word_index (:obj:`int`, `optional`):\n                If a batch index is provided in `batch_or_token_index`, this can be the index\n                of the word in the sequence.\n\n        Returns:\n            token_span (:obj:`TokenSpan`):\n                Span of tokens in the encoded sequence.\n\n                TokenSpan are NamedTuple with:\n                    start: index of the first token\n                    end: index of the token following the last token\n        \"\"\"\n\n        if not self._encodings:\n            raise ValueError(\"word_to_tokens() is not available when using Python based tokenizers\")\n        if word_index is not None:\n            batch_index = batch_or_word_index\n        else:\n            batch_index = 0\n            word_index = batch_or_word_index\n        if batch_index < 0:\n            batch_index = self._batch_size + batch_index\n        if word_index < 0:\n            word_index = self._seq_len + word_index\n        return TokenSpan(*(self._encodings[batch_index].word_to_tokens(word_index)))\n\n    def token_to_chars(self, batch_or_token_index: int, token_index: Optional[int] = None) -> CharSpan:\n        \"\"\" Get the character span corresponding to an encoded token in a sequence of the batch.\n\n            Character spans are returned as a CharSpan NamedTuple with:\n                start: index of the first character in the original string associated to the token\n                end: index of the character following the last character in the original string associated to the token\n\n            Can be called as:\n                - self.token_to_chars(token_index) if batch size is 1\n                - self.token_to_chars(batch_index, token_index) if batch size is greater or equal to 1\n\n        Args:\n            batch_or_token_index (:obj:`int`):\n                Index of the sequence in the batch. If the batch only comprise one sequence,\n                this can be the index of the token in the sequence\n            token_index (:obj:`int`, `optional`):\n                If a batch index is provided in `batch_or_token_index`, this can be the index\n                of the token or tokens in the sequence.\n\n        Returns:\n            char_span (:obj:`CharSpan`):\n                Span of characters in the original string.\n\n                CharSpan are NamedTuple with:\n                    start: index of the first character in the original string\n                    end: index of the character following the last character in the original string\n        \"\"\"\n\n        if not self._encodings:\n            raise ValueError(\"token_to_chars() is not available when using Python based tokenizers\")\n        if token_index is not None:\n            batch_index = batch_or_token_index\n        else:\n            batch_index = 0\n            token_index = batch_or_token_index\n        return CharSpan(*(self._encodings[batch_index].token_to_chars(token_index)))\n\n    def char_to_token(self, batch_or_char_index: int, char_index: Optional[int] = None) -> int:\n        \"\"\" Get the index of the token in the encoded output comprising a character\n            in the original string for a sequence of the batch.\n\n            Can be called as:\n                - self.char_to_token(char_index) if batch size is 1\n                - self.char_to_token(batch_index, char_index) if batch size is greater or equal to 1\n\n            This method is particularly suited when the input sequences are provided as\n            pre-tokenized sequences (i.e. words are defined by the user). In this case it allows\n            to easily associate encoded tokens with provided tokenized words.\n\n        Args:\n            batch_or_char_index (:obj:`int`):\n                Index of the sequence in the batch. If the batch only comprise one sequence,\n                this can be the index of the word in the sequence\n            char_index (:obj:`int`, `optional`):\n                If a batch index is provided in `batch_or_token_index`, this can be the index\n                of the word in the sequence.\n\n\n        Returns:\n            token_index (:obj:`int`):\n                Index of the token.\n        \"\"\"\n\n        if not self._encodings:\n            raise ValueError(\"char_to_token() is not available when using Python based tokenizers\")\n        if char_index is not None:\n            batch_index = batch_or_char_index\n        else:\n            batch_index = 0\n            char_index = batch_or_char_index\n        return self._encodings[batch_index].char_to_token(char_index)\n\n    def word_to_chars(self, batch_or_word_index: int, word_index: Optional[int] = None) -> CharSpan:\n        \"\"\" Get the character span in the original string corresponding to given word in a sequence\n            of the batch.\n\n            Character spans are returned as a CharSpan NamedTuple with:\n                start: index of the first character in the original string\n                end: index of the character following the last character in the original string\n\n            Can be called as:\n                - self.word_to_chars(word_index) if batch size is 1\n                - self.word_to_chars(batch_index, word_index) if batch size is greater or equal to 1\n\n        Args:\n            batch_or_word_index (:obj:`int`):\n                Index of the sequence in the batch. If the batch only comprise one sequence,\n                this can be the index of the word in the sequence\n            word_index (:obj:`int`, `optional`):\n                If a batch index is provided in `batch_or_token_index`, this can be the index\n                of the word in the sequence.\n\n        Returns:\n            char_span (:obj:`CharSpan` or :obj:`List[CharSpan]`):\n                Span(s) of the associated character or characters in the string.\n                CharSpan are NamedTuple with:\n                    start: index of the first character associated to the token in the original string\n                    end: index of the character following the last character associated to the token in the original string\n        \"\"\"\n\n        if not self._encodings:\n            raise ValueError(\"word_to_chars() is not available when using Python based tokenizers\")\n        if word_index is not None:\n            batch_index = batch_or_word_index\n        else:\n            batch_index = 0\n            word_index = batch_or_word_index\n        return CharSpan(*(self._encodings[batch_index].word_to_chars(word_index)))\n\n    def char_to_word(self, batch_or_char_index: int, char_index: Optional[int] = None) -> int:\n        \"\"\" Get the word in the original string corresponding to a character in the original string of\n            a sequence of the batch.\n\n            Can be called as:\n                - self.char_to_word(char_index) if batch size is 1\n                - self.char_to_word(batch_index, char_index) if batch size is greater than 1\n\n            This method is particularly suited when the input sequences are provided as\n            pre-tokenized sequences (i.e. words are defined by the user). In this case it allows\n            to easily associate encoded tokens with provided tokenized words.\n\n        Args:\n            batch_or_char_index (:obj:`int`):\n                Index of the sequence in the batch. If the batch only comprise one sequence,\n                this can be the index of the character in the orginal string.\n            char_index (:obj:`int`, `optional`):\n                If a batch index is provided in `batch_or_token_index`, this can be the index\n                of the character in the orginal string.\n\n\n        Returns:\n            token_index (:obj:`int` or :obj:`List[int]`):\n                Index or indices of the associated encoded token(s).\n        \"\"\"\n\n        if not self._encodings:\n            raise ValueError(\"char_to_word() is not available when using Python based tokenizers\")\n        if char_index is not None:\n            batch_index = batch_or_char_index\n        else:\n            batch_index = 0\n            char_index = batch_or_char_index\n        return self._encodings[batch_index].char_to_word(char_index)\n\n    @torch_required\n    def to(self, device: str):\n        \"\"\"Send all values to device by calling v.to(device)\"\"\"\n        self.data = {k: v.to(device) for k, v in self.data.items()}\n        return self\n\n\nclass SpecialTokensMixin:\n    \"\"\" SpecialTokensMixin is derived by ``PreTrainedTokenizer`` and ``PreTrainedTokenizerFast`` and\n        handles specific behaviors related to special tokens. In particular, this class hold the\n        attributes which can be used to directly access to these special tokens in a\n        model-independant manner and allow to set and update the special tokens.\n    \"\"\"\n\n    SPECIAL_TOKENS_ATTRIBUTES = [\n        \"bos_token\",\n        \"eos_token\",\n        \"unk_token\",\n        \"sep_token\",\n        \"pad_token\",\n        \"cls_token\",\n        \"mask_token\",\n        \"additional_special_tokens\",\n    ]\n\n    def __init__(self, **kwargs):\n        self._bos_token = None\n        self._eos_token = None\n        self._unk_token = None\n        self._sep_token = None\n        self._pad_token = None\n        self._cls_token = None\n        self._mask_token = None\n        self._pad_token_type_id = 0\n        self._additional_special_tokens = []\n\n        for key, value in kwargs.items():\n            if key in self.SPECIAL_TOKENS_ATTRIBUTES:\n                if key == \"additional_special_tokens\":\n                    assert isinstance(value, (list, tuple)) and all(isinstance(t, str) for t in value)\n                    setattr(self, key, value)\n                elif isinstance(value, AddedTokenFast):\n                    setattr(self, key, str(value))\n                elif isinstance(value, str):\n                    setattr(self, key, value)\n                else:\n                    raise TypeError(\n                        \"special token {} has to be either str or AddedTokenFast but got: {}\".format(key, type(value))\n                    )\n\n    @property\n    def bos_token(self):\n        \"\"\" Beginning of sentence token (string). Log an error if used while not having been set. \"\"\"\n        if self._bos_token is None:\n            logger.error(\"Using bos_token, but it is not set yet.\")\n        return self._bos_token\n\n    @property\n    def eos_token(self):\n        \"\"\" End of sentence token (string). Log an error if used while not having been set. \"\"\"\n        if self._eos_token is None:\n            logger.error(\"Using eos_token, but it is not set yet.\")\n        return self._eos_token\n\n    @property\n    def unk_token(self):\n        \"\"\" Unknown token (string). Log an error if used while not having been set. \"\"\"\n        if self._unk_token is None:\n            logger.error(\"Using unk_token, but it is not set yet.\")\n        return self._unk_token\n\n    @property\n    def sep_token(self):\n        \"\"\" Separation token (string). E.g. separate context and query in an input sequence. Log an error if used while not having been set. \"\"\"\n        if self._sep_token is None:\n            logger.error(\"Using sep_token, but it is not set yet.\")\n        return self._sep_token\n\n    @property\n    def pad_token(self):\n        \"\"\" Padding token (string). Log an error if used while not having been set. \"\"\"\n        if self._pad_token is None:\n            logger.error(\"Using pad_token, but it is not set yet.\")\n        return self._pad_token\n\n    @property\n    def cls_token(self):\n        \"\"\" Classification token (string). E.g. to extract a summary of an input sequence leveraging self-attention along the full depth of the model. Log an error if used while not having been set. \"\"\"\n        if self._cls_token is None:\n            logger.error(\"Using cls_token, but it is not set yet.\")\n        return self._cls_token\n\n    @property\n    def mask_token(self):\n        \"\"\" Mask token (string). E.g. when training a model with masked-language modeling. Log an error if used while not having been set. \"\"\"\n        if self._mask_token is None:\n            logger.error(\"Using mask_token, but it is not set yet.\")\n        return self._mask_token\n\n    @property\n    def additional_special_tokens(self):\n        \"\"\" All the additional special tokens you may want to use (list of strings). Log an error if used while not having been set. \"\"\"\n        if self._additional_special_tokens is None:\n            logger.error(\"Using additional_special_tokens, but it is not set yet.\")\n        return self._additional_special_tokens\n\n    def _maybe_update_backend(self, value):\n        \"\"\" To be overriden by derived class if a backend tokenizer has to be updated. \"\"\"\n        pass\n\n    @bos_token.setter\n    def bos_token(self, value):\n        self._bos_token = value\n        self._maybe_update_backend([value])\n\n    @eos_token.setter\n    def eos_token(self, value):\n        self._eos_token = value\n        self._maybe_update_backend([value])\n\n    @unk_token.setter\n    def unk_token(self, value):\n        self._unk_token = value\n        self._maybe_update_backend([value])\n\n    @sep_token.setter\n    def sep_token(self, value):\n        self._sep_token = value\n        self._maybe_update_backend([value])\n\n    @pad_token.setter\n    def pad_token(self, value):\n        self._pad_token = value\n        self._maybe_update_backend([value])\n\n    @cls_token.setter\n    def cls_token(self, value):\n        self._cls_token = value\n        self._maybe_update_backend([value])\n\n    @mask_token.setter\n    def mask_token(self, value):\n        self._mask_token = value\n        self._maybe_update_backend([value])\n\n    @additional_special_tokens.setter\n    def additional_special_tokens(self, value):\n        self._additional_special_tokens = value\n        self._maybe_update_backend(value)\n\n    @property\n    def bos_token_id(self):\n        \"\"\" Id of the beginning of sentence token in the vocabulary. Log an error if used while not having been set. \"\"\"\n        return self.convert_tokens_to_ids(self.bos_token)\n\n    @property\n    def eos_token_id(self):\n        \"\"\" Id of the end of sentence token in the vocabulary. Log an error if used while not having been set. \"\"\"\n        return self.convert_tokens_to_ids(self.eos_token)\n\n    @property\n    def unk_token_id(self):\n        \"\"\" Id of the unknown token in the vocabulary. Log an error if used while not having been set. \"\"\"\n        return self.convert_tokens_to_ids(self.unk_token)\n\n    @property\n    def sep_token_id(self):\n        \"\"\" Id of the separation token in the vocabulary. E.g. separate context and query in an input sequence. Log an error if used while not having been set. \"\"\"\n        return self.convert_tokens_to_ids(self.sep_token)\n\n    @property\n    def pad_token_id(self):\n        \"\"\" Id of the padding token in the vocabulary. Log an error if used while not having been set. \"\"\"\n        return self.convert_tokens_to_ids(self.pad_token)\n\n    @property\n    def pad_token_type_id(self):\n        \"\"\" Id of the padding token type in the vocabulary.\"\"\"\n        return self._pad_token_type_id\n\n    @property\n    def cls_token_id(self):\n        \"\"\" Id of the classification token in the vocabulary. E.g. to extract a summary of an input sequence leveraging self-attention along the full depth of the model. Log an error if used while not having been set. \"\"\"\n        return self.convert_tokens_to_ids(self.cls_token)\n\n    @property\n    def mask_token_id(self):\n        \"\"\" Id of the mask token in the vocabulary. E.g. when training a model with masked-language modeling. Log an error if used while not having been set. \"\"\"\n        return self.convert_tokens_to_ids(self.mask_token)\n\n    @property\n    def additional_special_tokens_ids(self):\n        \"\"\" Ids of all the additional special tokens in the vocabulary (list of integers). Log an error if used while not having been set. \"\"\"\n        return self.convert_tokens_to_ids(self.additional_special_tokens)\n\n    @property\n    def special_tokens_map(self):\n        \"\"\" A dictionary mapping special token class attribute (cls_token, unk_token...) to their\n            values ('<unk>', '<cls>'...)\n        \"\"\"\n        set_attr = {}\n        for attr in self.SPECIAL_TOKENS_ATTRIBUTES:\n            attr_value = getattr(self, \"_\" + attr)\n            if attr_value:\n                set_attr[attr] = attr_value\n        return set_attr\n\n    @property\n    def all_special_tokens(self):\n        \"\"\" List all the special tokens ('<unk>', '<cls>'...) mapped to class attributes\n            (cls_token, unk_token...).\n        \"\"\"\n        all_toks = []\n        set_attr = self.special_tokens_map\n        for attr_value in set_attr.values():\n            all_toks = all_toks + (list(attr_value) if isinstance(attr_value, (list, tuple)) else [attr_value])\n        all_toks = list(set(all_toks))\n        return all_toks\n\n    @property\n    def all_special_ids(self):\n        \"\"\" List the vocabulary indices of the special tokens ('<unk>', '<cls>'...) mapped to\n            class attributes (cls_token, unk_token...).\n        \"\"\"\n        all_toks = self.all_special_tokens\n        all_ids = self.convert_tokens_to_ids(all_toks)\n        return all_ids\n\n\nclass PreTrainedTokenizer(SpecialTokensMixin):\n    \"\"\" Base class for all tokenizers.\n\n    Handle all the shared methods for tokenization and special tokens as well as methods\n    downloading/caching/loading pretrained tokenizers as well as adding tokens to the vocabulary.\n\n    This class also contain the added tokens in a unified way on top of all tokenizers so we don't\n    have to handle the specific vocabulary augmentation methods of the various underlying\n    dictionary structures (BPE, sentencepiece...).\n\n    Class attributes (overridden by derived classes):\n\n        - ``vocab_files_names``: a python ``dict`` with, as keys, the ``__init__`` keyword name of each vocabulary file\n            required by the model, and as associated values, the filename for saving the associated file (string).\n        - ``pretrained_vocab_files_map``: a python ``dict of dict`` the high-level keys\n            being the ``__init__`` keyword name of each vocabulary file required by the model, the low-level being the\n            `short-cut-names` (string) of the pretrained models with, as associated values, the `url` (string) to the\n            associated pretrained vocabulary file.\n        - ``max_model_input_sizes``: a python ``dict`` with, as keys, the `short-cut-names` (string) of the pretrained\n            models, and as associated values, the maximum length of the sequence inputs of this model, or None if the\n            model has no maximum input size.\n        - ``pretrained_init_configuration``: a python ``dict`` with, as keys, the `short-cut-names` (string) of the\n            pretrained models, and as associated values, a dictionnary of specific arguments to pass to the\n            ``__init__``method of the tokenizer class for this pretrained model when loading the tokenizer with the\n            ``from_pretrained()`` method.\n\n    Args:\n        - ``model_max_length``: (`Optional`) int: the maximum length in number of tokens for the inputs to the transformer model.\n            When the tokenizer is loaded with `from_pretrained`, this will be set to the value stored for the associated\n            model in ``max_model_input_sizes`` (see above). If no value is provided, will default to VERY_LARGE_INTEGER (`int(1e30)`).\n            no associated max_length can be found in ``max_model_input_sizes``.\n        - ``padding_side``: (`Optional`) string: the side on which the model should have padding applied.\n            Should be selected between ['right', 'left']\n        - ``model_input_names``: (`Optional`) List[string]: the list of the forward pass inputs accepted by the\n            model (\"token_type_ids\", \"attention_mask\"...).\n        - ``bos_token``: (`Optional`) string: a beginning of sentence token.\n            Will be associated to ``self.bos_token`` and ``self.bos_token_id``\n        - ``eos_token``: (`Optional`) string: an end of sentence token.\n            Will be associated to ``self.eos_token`` and ``self.eos_token_id``\n        - ``unk_token``: (`Optional`) string: an unknown token.\n            Will be associated to ``self.unk_token`` and ``self.unk_token_id``\n        - ``sep_token``: (`Optional`) string: a separation token (e.g. to separate context and query in an input sequence).\n            Will be associated to ``self.sep_token`` and ``self.sep_token_id``\n        - ``pad_token``: (`Optional`) string: a padding token.\n            Will be associated to ``self.pad_token`` and ``self.pad_token_id``\n        - ``cls_token``: (`Optional`) string: a classification token (e.g. to extract a summary of an input sequence\n            leveraging self-attention along the full depth of the model).\n            Will be associated to ``self.cls_token`` and ``self.cls_token_id``\n        - ``mask_token``: (`Optional`) string: a masking token (e.g. when training a model with masked-language\n            modeling). Will be associated to ``self.mask_token`` and ``self.mask_token_id``\n        - ``additional_special_tokens``: (`Optional`) list: a list of additional special tokens.\n            Adding all special tokens here ensure they won't be split by the tokenization process.\n            Will be associated to ``self.additional_special_tokens`` and ``self.additional_special_tokens_ids``\n    \"\"\"\n\n    vocab_files_names: Dict[str, str] = {}\n    pretrained_vocab_files_map: Dict[str, Dict[str, str]] = {}\n    pretrained_init_configuration: Dict[str, Dict[str, Any]] = {}\n    max_model_input_sizes: Dict[str, int] = {}\n    model_input_names: List[str] = [\"token_type_ids\", \"attention_mask\"]\n\n    padding_side: str = \"right\"\n\n    NO_PAD_TOKEN_FOR_BATCH_MSG = (\n        \"No padding token is set for this model, therefore no batch can be made with uneven \"\n        \"sequences. Set a padding token or adjust the lengths of the sequences building the \"\n        \"batch so that every sequence is of the same length.\"\n    )\n\n    UNEVEN_SEQUENCES_FOR_BATCH_MSG = (\n        \"The sequences building the batch are not of the same size, no tensor \"\n        \"can be built. Set `pad_to_max_length=True` to pad the smaller sequences\"\n        \"up to the larger sequence's length.\"\n    )\n\n    @property\n    def vocab_size(self) -> int:\n        \"\"\" Size of the base vocabulary (without the added tokens) \"\"\"\n        raise NotImplementedError\n\n    @property\n    def is_fast(self) -> bool:\n        return False\n\n    @property\n    def max_len(self) -> int:\n        \"\"\" Kept here for backward compatibility.\n            Now renamed to `model_max_length` to avoid ambiguity.\n        \"\"\"\n        return self.model_max_length\n\n    @property\n    def max_len_single_sentence(self) -> int:\n        return self.model_max_length - self.num_special_tokens_to_add(pair=False)\n\n    @property\n    def max_len_sentences_pair(self) -> int:\n        return self.model_max_length - self.num_special_tokens_to_add(pair=True)\n\n    @max_len_single_sentence.setter\n    def max_len_single_sentence(self, value) -> int:\n        \"\"\" For backward compatibility, allow to try to setup 'max_len_single_sentence' \"\"\"\n        if value == self.model_max_length - self.num_special_tokens_to_add(pair=False):\n            logger.warning(\n                \"Setting 'max_len_single_sentence' is now deprecated. \" \"This value is automatically set up.\"\n            )\n        else:\n            raise ValueError(\n                \"Setting 'max_len_single_sentence' is now deprecated. \" \"This value is automatically set up.\"\n            )\n\n    @max_len_sentences_pair.setter\n    def max_len_sentences_pair(self, value) -> int:\n        \"\"\" For backward compatibility, allow to try to setup 'max_len_sentences_pair' \"\"\"\n        if value == self.model_max_length - self.num_special_tokens_to_add(pair=True):\n            logger.warning(\n                \"Setting 'max_len_sentences_pair' is now deprecated. \" \"This value is automatically set up.\"\n            )\n        else:\n            raise ValueError(\n                \"Setting 'max_len_sentences_pair' is now deprecated. \" \"This value is automatically set up.\"\n            )\n\n    def get_vocab(self):\n        \"\"\" Returns the vocabulary as a dict of {token: index} pairs. `tokenizer.get_vocab()[token]` is equivalent to `tokenizer.convert_tokens_to_ids(token)` when `token` is in the vocab. \"\"\"\n        raise NotImplementedError()\n\n    def __init__(self, model_max_length=None, **kwargs):\n\n        super().__init__(**kwargs)\n\n        # For backward compatibility we fallback to set model_max_length from max_len if provided\n        if \"max_len\" in kwargs:\n            warnings.warn(\n                \"Parameter max_len is deprecated and will be removed in a future release. \"\n                \"Use model_max_length instead.\",\n                category=FutureWarning,\n            )\n\n            model_max_length = kwargs.pop(\"max_len\")\n        self.model_max_length = model_max_length if model_max_length is not None else VERY_LARGE_INTEGER\n\n        # Padding side is right by default and overridden in subclasses. If specified in the kwargs, it is changed.\n        self.padding_side = kwargs.pop(\"padding_side\", self.padding_side)\n        assert self.padding_side in [\n            \"right\",\n            \"left\",\n        ], f\"Padding side should be selected between 'right' and 'left', current value: {self.padding_side}\"\n        self.model_input_names = kwargs.pop(\"model_input_names\", self.model_input_names)\n\n        # Added tokens\n        self.added_tokens_encoder = {}\n        self.unique_added_tokens_encoder = set()\n        self.added_tokens_decoder = {}\n\n        # inputs and kwargs for saving and re-loading (see ``from_pretrained`` and ``save_pretrained``)\n        self.init_inputs = ()\n        self.init_kwargs = {}\n\n    def __len__(self):\n        \"\"\" Size of the full vocabulary with the added tokens \"\"\"\n        return self.vocab_size + len(self.added_tokens_encoder)\n\n    @classmethod\n    def from_pretrained(cls, *inputs, **kwargs):\n        r\"\"\"\n        Instantiate a :class:`~transformers1.PreTrainedTokenizer` (or a derived class) from a predefined tokenizer.\n\n        Args:\n            pretrained_model_name_or_path: either:\n\n                - a string with the `shortcut name` of a predefined tokenizer to load from cache or download, e.g.: ``bert-base-uncased``.\n                - a string with the `identifier name` of a predefined tokenizer that was user-uploaded to our S3, e.g.: ``dbmdz/bert-base-german-cased``.\n                - a path to a `directory` containing vocabulary files required by the tokenizer, for instance saved using the :func:`~transformers1.PreTrainedTokenizer.save_pretrained` method, e.g.: ``./my_model_directory/``.\n                - (not applicable to all derived classes, deprecated) a path or url to a single saved vocabulary file if and only if the tokenizer only requires a single vocabulary file (e.g. Bert, XLNet), e.g.: ``./my_model_directory/vocab.txt``.\n\n            cache_dir: (`optional`) string:\n                Path to a directory in which a downloaded predefined tokenizer vocabulary files should be cached if the standard cache should not be used.\n\n            force_download: (`optional`) boolean, default False:\n                Force to (re-)download the vocabulary files and override the cached versions if they exists.\n\n            resume_download: (`optional`) boolean, default False:\n                Do not delete incompletely recieved file. Attempt to resume the download if such a file exists.\n\n            proxies: (`optional`) dict, default None:\n                A dictionary of proxy servers to use by protocol or endpoint, e.g.: {'http': 'foo.bar:3128', 'http://hostname': 'foo.bar:4012'}.\n                The proxies are used on each request.\n\n            inputs: (`optional`) positional arguments: will be passed to the Tokenizer ``__init__`` method.\n\n            kwargs: (`optional`) keyword arguments: will be passed to the Tokenizer ``__init__`` method. Can be used to set special tokens like ``bos_token``, ``eos_token``, ``unk_token``, ``sep_token``, ``pad_token``, ``cls_token``, ``mask_token``, ``additional_special_tokens``. See parameters in the doc string of :class:`~transformers1.PreTrainedTokenizer` for details.\n\n        Examples::\n\n            # We can't instantiate directly the base class `PreTrainedTokenizer` so let's show our examples on a derived class: BertTokenizer\n\n            # Download vocabulary from S3 and cache.\n            tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')\n\n            # Download vocabulary from S3 (user-uploaded) and cache.\n            tokenizer = BertTokenizer.from_pretrained('dbmdz/bert-base-german-cased')\n\n            # If vocabulary files are in a directory (e.g. tokenizer was saved using `save_pretrained('./test/saved_model/')`)\n            tokenizer = BertTokenizer.from_pretrained('./test/saved_model/')\n\n            # If the tokenizer uses a single vocabulary file, you can point directly to this file\n            tokenizer = BertTokenizer.from_pretrained('./test/saved_model/my_vocab.txt')\n\n            # You can link tokens to special vocabulary when instantiating\n            tokenizer = BertTokenizer.from_pretrained('bert-base-uncased', unk_token='<unk>')\n            # You should be sure '<unk>' is in the vocabulary when doing that.\n            # Otherwise use tokenizer.add_special_tokens({'unk_token': '<unk>'}) instead)\n            assert tokenizer.unk_token == '<unk>'\n\n        \"\"\"\n        return cls._from_pretrained(*inputs, **kwargs)\n\n    @classmethod\n    def _from_pretrained(cls, pretrained_model_name_or_path, *init_inputs, **kwargs):\n        cache_dir = kwargs.pop(\"cache_dir\", None)\n        force_download = kwargs.pop(\"force_download\", False)\n        resume_download = kwargs.pop(\"resume_download\", False)\n        proxies = kwargs.pop(\"proxies\", None)\n        local_files_only = kwargs.pop(\"local_files_only\", False)\n\n        s3_models = list(cls.max_model_input_sizes.keys())\n        vocab_files = {}\n        init_configuration = {}\n        if pretrained_model_name_or_path in s3_models:\n            # Get the vocabulary from AWS S3 bucket\n            for file_id, map_list in cls.pretrained_vocab_files_map.items():\n                vocab_files[file_id] = map_list[pretrained_model_name_or_path]\n            if (\n                cls.pretrained_init_configuration\n                and pretrained_model_name_or_path in cls.pretrained_init_configuration\n            ):\n                init_configuration = cls.pretrained_init_configuration[pretrained_model_name_or_path].copy()\n        else:\n            # Get the vocabulary from local files\n            logger.info(\n                \"Model name '{}' not found in model shortcut name list ({}). \"\n                \"Assuming '{}' is a path, a model identifier, or url to a directory containing tokenizer files.\".format(\n                    pretrained_model_name_or_path, \", \".join(s3_models), pretrained_model_name_or_path\n                )\n            )\n\n            if os.path.isfile(pretrained_model_name_or_path) or is_remote_url(pretrained_model_name_or_path):\n                if len(cls.vocab_files_names) > 1:\n                    raise ValueError(\n                        f\"Calling {cls.__name__}.from_pretrained() with the path to a single file or url is not supported.\"\n                        \"Use a model identifier or the path to a directory instead.\"\n                    )\n                logger.warning(\n                    f\"Calling {cls.__name__}.from_pretrained() with the path to a single file or url is deprecated\"\n                )\n                file_id = list(cls.vocab_files_names.keys())[0]\n                vocab_files[file_id] = pretrained_model_name_or_path\n            else:\n                # At this point pretrained_model_name_or_path is either a directory or a model identifier name\n                additional_files_names = {\n                    \"added_tokens_file\": ADDED_TOKENS_FILE,\n                    \"special_tokens_map_file\": SPECIAL_TOKENS_MAP_FILE,\n                    \"tokenizer_config_file\": TOKENIZER_CONFIG_FILE,\n                }\n                # Look for the tokenizer main vocabulary files + the additional tokens files\n                for file_id, file_name in {**cls.vocab_files_names, **additional_files_names}.items():\n                    if os.path.isdir(pretrained_model_name_or_path):\n                        full_file_name = os.path.join(pretrained_model_name_or_path, file_name)\n                        if not os.path.exists(full_file_name):\n                            logger.info(\"Didn't find file {}. We won't load it.\".format(full_file_name))\n                            full_file_name = None\n                    else:\n                        full_file_name = hf_bucket_url(\n                            pretrained_model_name_or_path, filename=file_name, use_cdn=False\n                        )\n\n                    vocab_files[file_id] = full_file_name\n\n        # Get files from url, cache, or disk depending on the case\n        try:\n            resolved_vocab_files = {}\n            for file_id, file_path in vocab_files.items():\n                if file_path is None:\n                    resolved_vocab_files[file_id] = None\n                else:\n                    resolved_vocab_files[file_id] = cached_path(\n                        file_path,\n                        cache_dir=cache_dir,\n                        force_download=force_download,\n                        proxies=proxies,\n                        resume_download=resume_download,\n                        local_files_only=local_files_only,\n                    )\n        except EnvironmentError:\n            if pretrained_model_name_or_path in s3_models:\n                msg = \"Couldn't reach server at '{}' to download vocabulary files.\"\n            else:\n                msg = (\n                    \"Model name '{}' was not found in tokenizers model name list ({}). \"\n                    \"We assumed '{}' was a path or url to a directory containing vocabulary files \"\n                    \"named {}, but couldn't find such vocabulary files at this path or url.\".format(\n                        pretrained_model_name_or_path,\n                        \", \".join(s3_models),\n                        pretrained_model_name_or_path,\n                        list(cls.vocab_files_names.values()),\n                    )\n                )\n\n            raise EnvironmentError(msg)\n\n        if all(full_file_name is None for full_file_name in resolved_vocab_files.values()):\n            raise EnvironmentError(\n                \"Model name '{}' was not found in tokenizers model name list ({}). \"\n                \"We assumed '{}' was a path, a model identifier, or url to a directory containing vocabulary files \"\n                \"named {} but couldn't find such vocabulary files at this path or url.\".format(\n                    pretrained_model_name_or_path,\n                    \", \".join(s3_models),\n                    pretrained_model_name_or_path,\n                    list(cls.vocab_files_names.values()),\n                )\n            )\n\n        for file_id, file_path in vocab_files.items():\n            if file_path == resolved_vocab_files[file_id]:\n                logger.info(\"loading file {}\".format(file_path))\n            else:\n                logger.info(\"loading file {} from cache at {}\".format(file_path, resolved_vocab_files[file_id]))\n\n        # Prepare tokenizer initialization kwargs\n        # Did we saved some inputs and kwargs to reload ?\n        tokenizer_config_file = resolved_vocab_files.pop(\"tokenizer_config_file\", None)\n        if tokenizer_config_file is not None:\n            with open(tokenizer_config_file, encoding=\"utf-8\") as tokenizer_config_handle:\n                init_kwargs = json.load(tokenizer_config_handle)\n            saved_init_inputs = init_kwargs.pop(\"init_inputs\", ())\n            if not init_inputs:\n                init_inputs = saved_init_inputs\n        else:\n            init_kwargs = init_configuration\n\n        # Update with newly provided kwargs\n        init_kwargs.update(kwargs)\n\n        # Set max length if needed\n        if pretrained_model_name_or_path in cls.max_model_input_sizes:\n            # if we're using a pretrained model, ensure the tokenizer\n            # wont index sequences longer than the number of positional embeddings\n            model_max_length = cls.max_model_input_sizes[pretrained_model_name_or_path]\n            if model_max_length is not None and isinstance(model_max_length, (int, float)):\n                init_kwargs[\"model_max_length\"] = min(init_kwargs.get(\"model_max_length\", int(1e30)), model_max_length)\n\n        # Merge resolved_vocab_files arguments in init_kwargs.\n        added_tokens_file = resolved_vocab_files.pop(\"added_tokens_file\", None)\n        special_tokens_map_file = resolved_vocab_files.pop(\"special_tokens_map_file\", None)\n        for args_name, file_path in resolved_vocab_files.items():\n            if args_name not in init_kwargs:\n                init_kwargs[args_name] = file_path\n        if special_tokens_map_file is not None:\n            with open(special_tokens_map_file, encoding=\"utf-8\") as special_tokens_map_handle:\n                special_tokens_map = json.load(special_tokens_map_handle)\n            for key, value in special_tokens_map.items():\n                if key not in init_kwargs:\n                    init_kwargs[key] = value\n\n        # Instantiate tokenizer.\n        try:\n            tokenizer = cls(*init_inputs, **init_kwargs)\n        except OSError:\n            raise OSError(\n                \"Unable to load vocabulary from file. \"\n                \"Please check that the provided vocabulary is accessible and not corrupted.\"\n            )\n\n        # Save inputs and kwargs for saving and re-loading with ``save_pretrained``\n        tokenizer.init_inputs = init_inputs\n        tokenizer.init_kwargs = init_kwargs\n\n        # update unique_added_tokens_encoder with special tokens for correct tokenization\n        tokenizer.unique_added_tokens_encoder.update(set(tokenizer.all_special_tokens))\n\n        # Add supplementary tokens.\n        if added_tokens_file is not None:\n            with open(added_tokens_file, encoding=\"utf-8\") as added_tokens_handle:\n                added_tok_encoder = json.load(added_tokens_handle)\n            added_tok_decoder = {v: k for k, v in added_tok_encoder.items()}\n            tokenizer.added_tokens_encoder.update(added_tok_encoder)\n            tokenizer.added_tokens_decoder.update(added_tok_decoder)\n            tokenizer.unique_added_tokens_encoder.update(set(tokenizer.added_tokens_encoder.keys()))\n\n        return tokenizer\n\n    def save_pretrained(self, save_directory):\n        \"\"\" Save the tokenizer vocabulary files together with:\n                - added tokens,\n                - special-tokens-to-class-attributes-mapping,\n                - tokenizer instantiation positional and keywords inputs (e.g. do_lower_case for Bert).\n\n            Warning: This won't save modifications you may have applied to the tokenizer after the instantiation\n            (e.g. modifying tokenizer.do_lower_case after creation).\n\n            This method make sure the full tokenizer can then be re-loaded using the\n            :func:`~transformers1.PreTrainedTokenizer.from_pretrained` class method.\n        \"\"\"\n        if not os.path.isdir(save_directory):\n            logger.error(\"Saving directory ({}) should be a directory\".format(save_directory))\n            return\n\n        special_tokens_map_file = os.path.join(save_directory, SPECIAL_TOKENS_MAP_FILE)\n        added_tokens_file = os.path.join(save_directory, ADDED_TOKENS_FILE)\n        tokenizer_config_file = os.path.join(save_directory, TOKENIZER_CONFIG_FILE)\n\n        tokenizer_config = copy.deepcopy(self.init_kwargs)\n        if len(self.init_inputs) > 0:\n            tokenizer_config[\"init_inputs\"] = copy.deepcopy(self.init_inputs)\n        for file_id in self.vocab_files_names.keys():\n            tokenizer_config.pop(file_id, None)\n\n        with open(tokenizer_config_file, \"w\", encoding=\"utf-8\") as f:\n            f.write(json.dumps(tokenizer_config, ensure_ascii=False))\n\n        with open(special_tokens_map_file, \"w\", encoding=\"utf-8\") as f:\n            f.write(json.dumps(self.special_tokens_map, ensure_ascii=False))\n\n        if len(self.added_tokens_encoder) > 0:\n            with open(added_tokens_file, \"w\", encoding=\"utf-8\") as f:\n                out_str = json.dumps(self.added_tokens_encoder, ensure_ascii=False)\n                f.write(out_str)\n\n        vocab_files = self.save_vocabulary(save_directory)\n\n        return vocab_files + (special_tokens_map_file, added_tokens_file)\n\n    def save_vocabulary(self, save_directory) -> Tuple[str]:\n        \"\"\" Save the tokenizer vocabulary to a directory. This method does *NOT* save added tokens\n            and special token mappings.\n\n            Please use :func:`~transformers1.PreTrainedTokenizer.save_pretrained` `()` to save the full\n            Tokenizer state if you want to reload it using the :func:`~transformers1.PreTrainedTokenizer.from_pretrained`\n            class method.\n        \"\"\"\n        raise NotImplementedError\n\n    def add_tokens(self, new_tokens: Union[str, List[str]]) -> int:\n        \"\"\"\n        Add a list of new tokens to the tokenizer class. If the new tokens are not in the\n        vocabulary, they are added to it with indices starting from length of the current vocabulary.\n\n        Args:\n            new_tokens: string or list of string. Each string is a token to add. Tokens are only added if they are not\n            already in the vocabulary (tested by checking if the tokenizer assign the index of the ``unk_token`` to them).\n\n        Returns:\n            Number of tokens added to the vocabulary.\n\n        Examples::\n\n            # Let's see how to increase the vocabulary of Bert model and tokenizer\n            tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')\n            model = BertModel.from_pretrained('bert-base-uncased')\n\n            num_added_toks = tokenizer.add_tokens(['new_tok1', 'my_new-tok2'])\n            print('We have added', num_added_toks, 'tokens')\n            model.resize_token_embeddings(len(tokenizer))  # Notice: resize_token_embeddings expect to receive the full size of the new vocabulary, i.e. the length of the tokenizer.\n        \"\"\"\n        if not new_tokens:\n            return 0\n\n        if not isinstance(new_tokens, list):\n            new_tokens = [new_tokens]\n\n        tokens_to_add = []\n        for token in new_tokens:\n            assert isinstance(token, str)\n            if self.init_kwargs.get(\"do_lower_case\", False) and token not in self.all_special_tokens:\n                token = token.lower()\n            if (\n                token != self.unk_token\n                and self.convert_tokens_to_ids(token) == self.convert_tokens_to_ids(self.unk_token)\n                and token not in tokens_to_add\n            ):\n                tokens_to_add.append(token)\n                logger.info(\"Adding %s to the vocabulary\", token)\n\n        added_tok_encoder = dict((tok, len(self) + i) for i, tok in enumerate(tokens_to_add))\n        added_tok_decoder = {v: k for k, v in added_tok_encoder.items()}\n        self.added_tokens_encoder.update(added_tok_encoder)\n        self.unique_added_tokens_encoder = set(self.added_tokens_encoder.keys()).union(set(self.all_special_tokens))\n        self.added_tokens_decoder.update(added_tok_decoder)\n\n        return len(tokens_to_add)\n\n    def num_special_tokens_to_add(self, pair=False):\n        \"\"\"\n        Returns the number of added tokens when encoding a sequence with special tokens.\n\n        Note:\n            This encodes inputs and checks the number of added tokens, and is therefore not efficient. Do not put this\n            inside your training loop.\n\n        Args:\n            pair: Returns the number of added tokens in the case of a sequence pair if set to True, returns the\n                number of added tokens in the case of a single sequence if set to False.\n\n        Returns:\n            Number of tokens added to sequences\n        \"\"\"\n        token_ids_0 = []\n        token_ids_1 = []\n        return len(self.build_inputs_with_special_tokens(token_ids_0, token_ids_1 if pair else None))\n\n    def add_special_tokens(self, special_tokens_dict):\n        \"\"\"\n        Add a dictionary of special tokens (eos, pad, cls...) to the encoder and link them\n        to class attributes. If special tokens are NOT in the vocabulary, they are added\n        to it (indexed starting from the last index of the current vocabulary).\n\n        Using `add_special_tokens` will ensure your special tokens can be used in several ways:\n\n        - special tokens are carefully handled by the tokenizer (they are never split)\n        - you can easily refer to special tokens using tokenizer class attributes like `tokenizer.cls_token`. This makes it easy to develop model-agnostic training and fine-tuning scripts.\n\n        When possible, special tokens are already registered for provided pretrained models (ex: BertTokenizer cls_token is already registered to be '[CLS]' and XLM's one is also registered to be '</s>')\n\n        Args:\n            special_tokens_dict: dict of string. Keys should be in the list of predefined special attributes:\n                [``bos_token``, ``eos_token``, ``unk_token``, ``sep_token``, ``pad_token``, ``cls_token``, ``mask_token``,\n                ``additional_special_tokens``].\n\n                Tokens are only added if they are not already in the vocabulary (tested by checking if the tokenizer assign the index of the ``unk_token`` to them).\n\n        Returns:\n            Number of tokens added to the vocabulary.\n\n        Examples::\n\n            # Let's see how to add a new classification token to GPT-2\n            tokenizer = GPT2Tokenizer.from_pretrained('gpt2')\n            model = GPT2Model.from_pretrained('gpt2')\n\n            special_tokens_dict = {'cls_token': '<CLS>'}\n\n            num_added_toks = tokenizer.add_special_tokens(special_tokens_dict)\n            print('We have added', num_added_toks, 'tokens')\n            model.resize_token_embeddings(len(tokenizer))  # Notice: resize_token_embeddings expect to receive the full size of the new vocabulary, i.e. the length of the tokenizer.\n\n            assert tokenizer.cls_token == '<CLS>'\n        \"\"\"\n        if not special_tokens_dict:\n            return 0\n\n        added_tokens = 0\n        for key, value in special_tokens_dict.items():\n            assert key in self.SPECIAL_TOKENS_ATTRIBUTES\n            if key == \"additional_special_tokens\":\n                assert isinstance(value, (list, tuple)) and all(isinstance(t, str) for t in value)\n                added_tokens += self.add_tokens(value)\n            else:\n                assert isinstance(value, str)\n                added_tokens += self.add_tokens([value])\n            logger.info(\"Assigning %s to the %s key of the tokenizer\", value, key)\n            setattr(self, key, value)\n\n        return added_tokens\n\n    def tokenize(self, text: TextInput, **kwargs):\n        \"\"\" Converts a string in a sequence of tokens (string), using the tokenizer.\n            Split in words for word-based vocabulary or sub-words for sub-word-based\n            vocabularies (BPE/SentencePieces/WordPieces).\n\n            Take care of added tokens.\n\n            Args:\n                text (:obj:`string`): The sequence to be encoded.\n                **kwargs (:obj: `dict`): Arguments passed to the model-specific `prepare_for_tokenization` preprocessing method.\n        \"\"\"\n        all_special_tokens = self.all_special_tokens\n        text = self.prepare_for_tokenization(text, **kwargs)\n\n        # TODO: should this be in the base class?\n        def lowercase_text(t):\n            # convert non-special tokens to lowercase\n            escaped_special_toks = [re.escape(s_tok) for s_tok in all_special_tokens]\n            pattern = r\"(\" + r\"|\".join(escaped_special_toks) + r\")|\" + r\"(.+?)\"\n            return re.sub(pattern, lambda m: m.groups()[0] or m.groups()[1].lower(), t)\n\n        if self.init_kwargs.get(\"do_lower_case\", False):\n            text = lowercase_text(text)\n\n        def split_on_token(tok, text):\n            result = []\n            split_text = text.split(tok)\n            for i, sub_text in enumerate(split_text):\n                sub_text = sub_text.rstrip()\n                if i == 0 and not sub_text:\n                    result += [tok]\n                elif i == len(split_text) - 1:\n                    if sub_text:\n                        result += [sub_text]\n                    else:\n                        pass\n                else:\n                    if sub_text:\n                        result += [sub_text]\n                    result += [tok]\n            return result\n\n        def split_on_tokens(tok_list, text):\n            if not text.strip():\n                return []\n            if not tok_list:\n                return self._tokenize(text)\n\n            tokenized_text = []\n            text_list = [text]\n            for tok in tok_list:\n                tokenized_text = []\n                for sub_text in text_list:\n                    if sub_text not in self.unique_added_tokens_encoder:\n                        tokenized_text += split_on_token(tok, sub_text)\n                    else:\n                        tokenized_text += [sub_text]\n                text_list = tokenized_text\n\n            return list(\n                itertools.chain.from_iterable(\n                    (\n                        self._tokenize(token) if token not in self.unique_added_tokens_encoder else [token]\n                        for token in tokenized_text\n                    )\n                )\n            )\n\n        added_tokens = self.unique_added_tokens_encoder\n        tokenized_text = split_on_tokens(added_tokens, text)\n        return tokenized_text\n\n    def _tokenize(self, text, **kwargs):\n        \"\"\" Converts a string in a sequence of tokens (string), using the tokenizer.\n            Split in words for word-based vocabulary or sub-words for sub-word-based\n            vocabularies (BPE/SentencePieces/WordPieces).\n\n            Do NOT take care of added tokens.\n        \"\"\"\n        raise NotImplementedError\n\n    def convert_tokens_to_ids(self, tokens):\n        \"\"\" Converts a token string (or a sequence of tokens) in a single integer id\n            (or a sequence of ids), using the vocabulary.\n        \"\"\"\n        if tokens is None:\n            return None\n\n        if isinstance(tokens, str):\n            return self._convert_token_to_id_with_added_voc(tokens)\n\n        ids = []\n        for token in tokens:\n            ids.append(self._convert_token_to_id_with_added_voc(token))\n        return ids\n\n    def _convert_token_to_id_with_added_voc(self, token):\n        if token is None:\n            return None\n\n        if token in self.added_tokens_encoder:\n            return self.added_tokens_encoder[token]\n        return self._convert_token_to_id(token)\n\n    def _convert_token_to_id(self, token):\n        raise NotImplementedError\n\n    def encode(\n        self,\n        text: Union[TextInput, PreTokenizedInput, EncodedInput],\n        text_pair: Optional[Union[TextInput, PreTokenizedInput, EncodedInput]] = None,\n        add_special_tokens: bool = True,\n        max_length: Optional[int] = None,\n        stride: int = 0,\n        truncation_strategy: str = \"longest_first\",\n        pad_to_max_length: bool = False,\n        return_tensors: Optional[str] = None,\n        **kwargs\n    ):\n        \"\"\"\n        Converts a string in a sequence of ids (integer), using the tokenizer and vocabulary.\n\n        Same as doing ``self.convert_tokens_to_ids(self.tokenize(text))``.\n\n        Args:\n            text (:obj:`str`, :obj:`List[str]` or :obj:`List[int]`):\n                The first sequence to be encoded. This can be a string, a list of strings (tokenized string using\n                the `tokenize` method) or a list of integers (tokenized string ids using the `convert_tokens_to_ids`\n                method)\n            text_pair (:obj:`str`, :obj:`List[str]` or :obj:`List[int]`, `optional`, defaults to :obj:`None`):\n                Optional second sequence to be encoded. This can be a string, a list of strings (tokenized\n                string using the `tokenize` method) or a list of integers (tokenized string ids using the\n                `convert_tokens_to_ids` method)\n            add_special_tokens (:obj:`bool`, `optional`, defaults to :obj:`True`):\n                If set to ``True``, the sequences will be encoded with the special tokens relative\n                to their model.\n            max_length (:obj:`int`, `optional`, defaults to :obj:`None`):\n                If set to a number, will limit the total sequence returned so that it has a maximum length.\n                If there are overflowing tokens, those will be added to the returned dictionary.\n                You can set it to the maximal input size of the model with `max_length = tokenizer.model_max_length`.\n            stride (:obj:`int`, `optional`, defaults to ``0``):\n                If set to a number along with max_length, the overflowing tokens returned will contain some tokens\n                from the main sequence returned. The value of this argument defines the number of additional tokens.\n            truncation_strategy (:obj:`str`, `optional`, defaults to `longest_first`):\n                String selected in the following options:\n\n                - 'longest_first' (default) Iteratively reduce the inputs sequence until the input is under max_length\n                  starting from the longest one at each token (when there is a pair of input sequences)\n                - 'only_first': Only truncate the first sequence\n                - 'only_second': Only truncate the second sequence\n                - 'do_not_truncate': Does not truncate (raise an error if the input sequence is longer than max_length)\n            pad_to_max_length (:obj:`bool`, `optional`, defaults to :obj:`False`):\n                If set to True, the returned sequences will be padded according to the model's padding side and\n                padding index, up to their max length. If no max length is specified, the padding is done up to the\n                model's max length. The tokenizer padding sides are handled by the class attribute `padding_side`\n                which can be set to the following strings:\n\n                - 'left': pads on the left of the sequences\n                - 'right': pads on the right of the sequences\n                Defaults to False: no padding.\n            return_tensors (:obj:`str`, `optional`, defaults to :obj:`None`):\n                Can be set to 'tf' or 'pt' to return respectively TensorFlow :obj:`tf.constant`\n                or PyTorch :obj:`torch.Tensor` instead of a list of python integers.\n            **kwargs: passed to the `self.tokenize()` method\n        \"\"\"\n        encoded_inputs = self.encode_plus(\n            text,\n            text_pair=text_pair,\n            max_length=max_length,\n            add_special_tokens=add_special_tokens,\n            stride=stride,\n            truncation_strategy=truncation_strategy,\n            pad_to_max_length=pad_to_max_length,\n            return_tensors=return_tensors,\n            **kwargs,\n        )\n\n        return encoded_inputs[\"input_ids\"]\n\n    def encode_plus(\n        self,\n        text: Union[TextInput, PreTokenizedInput, EncodedInput],\n        text_pair: Optional[Union[TextInput, PreTokenizedInput, EncodedInput]] = None,\n        add_special_tokens: bool = True,\n        max_length: Optional[int] = None,\n        stride: int = 0,\n        truncation_strategy: str = \"longest_first\",\n        pad_to_max_length: bool = False,\n        is_pretokenized: bool = False,\n        return_tensors: Optional[str] = None,\n        return_token_type_ids: Optional[bool] = None,\n        return_attention_mask: Optional[bool] = None,\n        return_overflowing_tokens: bool = False,\n        return_special_tokens_mask: bool = False,\n        return_offsets_mapping: bool = False,\n        **kwargs\n    ) -> BatchEncoding:\n        \"\"\"\n        Returns a dictionary containing the encoded sequence or sequence pair and additional information:\n        the mask for sequence classification and the overflowing elements if a ``max_length`` is specified.\n\n        Args:\n            text (:obj:`str`, :obj:`List[str]` or :obj:`List[int]` (the later only for not-fast tokenizers)):\n                The first sequence to be encoded. This can be a string, a list of strings (tokenized string using\n                the `tokenize` method) or a list of integers (tokenized string ids using the `convert_tokens_to_ids`\n                method)\n            text_pair (:obj:`str`, :obj:`List[str]` or :obj:`List[int]`, `optional`, defaults to :obj:`None`):\n                Optional second sequence to be encoded. This can be a string, a list of strings (tokenized\n                string using the `tokenize` method) or a list of integers (tokenized string ids using the\n                `convert_tokens_to_ids` method)\n            add_special_tokens (:obj:`bool`, `optional`, defaults to :obj:`True`):\n                If set to ``True``, the sequences will be encoded with the special tokens relative\n                to their model.\n            max_length (:obj:`int`, `optional`, defaults to :obj:`None`):\n                If set to a number, will limit the total sequence returned so that it has a maximum length.\n                If there are overflowing tokens, those will be added to the returned dictionary\n                You can set it to the maximal input size of the model with `max_length = tokenizer.model_max_length`.\n            stride (:obj:`int`, `optional`, defaults to ``0``):\n                If set to a number along with max_length, the overflowing tokens returned will contain some tokens\n                from the main sequence returned. The value of this argument defines the number of additional tokens.\n            truncation_strategy (:obj:`str`, `optional`, defaults to `longest_first`):\n                String selected in the following options:\n\n                - 'longest_first' (default) Iteratively reduce the inputs sequence until the input is under max_length\n                  starting from the longest one at each token (when there is a pair of input sequences)\n                - 'only_first': Only truncate the first sequence\n                - 'only_second': Only truncate the second sequence\n                - 'do_not_truncate': Does not truncate (raise an error if the input sequence is longer than max_length)\n            pad_to_max_length (:obj:`bool`, `optional`, defaults to :obj:`False`):\n                If set to True, the returned sequences will be padded according to the model's padding side and\n                padding index, up to their max length. If no max length is specified, the padding is done up to the\n                model's max length. The tokenizer padding sides are handled by the class attribute `padding_side`\n                which can be set to the following strings:\n\n                - 'left': pads on the left of the sequences\n                - 'right': pads on the right of the sequences\n                Defaults to False: no padding.\n            is_pretokenized (:obj:`bool`, defaults to :obj:`False`):\n                Set to True to indicate the input is already tokenized\n            return_tensors (:obj:`str`, `optional`, defaults to :obj:`None`):\n                Can be set to 'tf' or 'pt' to return respectively TensorFlow :obj:`tf.constant`\n                or PyTorch :obj:`torch.Tensor` instead of a list of python integers.\n            return_token_type_ids (:obj:`bool`, `optional`, defaults to :obj:`None`):\n                Whether to return token type IDs. If left to the default, will return the token type IDs according\n                to the specific tokenizer's default, defined by the :obj:`return_outputs` attribute.\n\n                `What are token type IDs? <../glossary.html#token-type-ids>`_\n            return_attention_mask (:obj:`bool`, `optional`, defaults to :obj:`none`):\n                Whether to return the attention mask. If left to the default, will return the attention mask according\n                to the specific tokenizer's default, defined by the :obj:`return_outputs` attribute.\n\n                `What are attention masks? <../glossary.html#attention-mask>`__\n            return_overflowing_tokens (:obj:`bool`, `optional`, defaults to :obj:`False`):\n                Set to True to return overflowing token information (default False).\n            return_special_tokens_mask (:obj:`bool`, `optional`, defaults to :obj:`False`):\n                Set to True to return special tokens mask information (default False).\n            return_offsets_mapping (:obj:`bool`, `optional`, defaults to :obj:`False`):\n                Set to True to return (char_start, char_end) for each token (default False).\n                If using Python's tokenizer, this method will raise NotImplementedError.\n                This one is only available on fast tokenizers inheriting from PreTrainedTokenizerFast.\n            **kwargs: passed to the `self.tokenize()` method\n\n        Return:\n            A Dictionary of shape::\n\n                {\n                    input_ids: list[int],\n                    token_type_ids: list[int] if return_token_type_ids is True (default)\n                    attention_mask: list[int] if return_attention_mask is True (default)\n                    overflowing_tokens: list[int] if a ``max_length`` is specified and return_overflowing_tokens is True\n                    num_truncated_tokens: int if a ``max_length`` is specified and return_overflowing_tokens is True\n                    special_tokens_mask: list[int] if ``add_special_tokens`` if set to ``True``\n                    and return_special_tokens_mask is True\n                }\n\n            With the fields:\n\n            - ``input_ids``: list of token ids to be fed to a model\n            - ``token_type_ids``: list of token type ids to be fed to a model\n            - ``attention_mask``: list of indices specifying which tokens should be attended to by the model\n            - ``overflowing_tokens``: list of overflowing tokens if a max length is specified.\n            - ``num_truncated_tokens``: number of overflowing tokens a ``max_length`` is specified\n            - ``special_tokens_mask``: if adding special tokens, this is a list of [0, 1], with 0 specifying special added\n              tokens and 1 specifying sequence tokens.\n        \"\"\"\n\n        def get_input_ids(text):\n            if isinstance(text, str):\n                tokens = self.tokenize(text, add_special_tokens=add_special_tokens, **kwargs)\n                return self.convert_tokens_to_ids(tokens)\n            elif isinstance(text, (list, tuple)) and len(text) > 0 and isinstance(text[0], str):\n                return self.convert_tokens_to_ids(text)\n            elif isinstance(text, (list, tuple)) and len(text) > 0 and isinstance(text[0], int):\n                return text\n            else:\n                raise ValueError(\n                    \"Input is not valid. Should be a string, a list/tuple of strings or a list/tuple of integers.\"\n                )\n\n        if return_offsets_mapping:\n            raise NotImplementedError(\n                \"return_offset_mapping is not available when using Python tokenizers.\"\n                \"To use this feature, change your tokenizer to one deriving from \"\n                \"transformers1.PreTrainedTokenizerFast.\"\n                \"More information on available tokenizers at \"\n                \"https://github.com/huggingface/transformers/pull/2674\"\n            )\n\n        # Throw an error if we can pad because there is no padding token\n        if pad_to_max_length and self.pad_token_id is None:\n            raise ValueError(\n                \"Unable to set proper padding strategy as the tokenizer does not have a padding token. \"\n                \"In this case please set the `pad_token` `(tokenizer.pad_token = tokenizer.eos_token e.g.)` \"\n                \"or add a new pad token via the function add_special_tokens if you want to use a padding strategy\"\n            )\n\n        first_ids = get_input_ids(text)\n        second_ids = get_input_ids(text_pair) if text_pair is not None else None\n\n        return self.prepare_for_model(\n            first_ids,\n            pair_ids=second_ids,\n            max_length=max_length,\n            pad_to_max_length=pad_to_max_length,\n            add_special_tokens=add_special_tokens,\n            stride=stride,\n            truncation_strategy=truncation_strategy,\n            return_tensors=return_tensors,\n            return_attention_mask=return_attention_mask,\n            return_token_type_ids=return_token_type_ids,\n            return_overflowing_tokens=return_overflowing_tokens,\n            return_special_tokens_mask=return_special_tokens_mask,\n        )\n\n    def batch_encode_plus(\n        self,\n        batch_text_or_text_pairs: Union[\n            List[TextInput],\n            List[TextInputPair],\n            List[PreTokenizedInput],\n            List[PreTokenizedInputPair],\n            List[EncodedInput],\n            List[EncodedInputPair],\n        ],\n        add_special_tokens: bool = True,\n        max_length: Optional[int] = None,\n        stride: int = 0,\n        truncation_strategy: str = \"longest_first\",\n        pad_to_max_length: bool = False,\n        is_pretokenized: bool = False,\n        return_tensors: Optional[str] = None,\n        return_token_type_ids: Optional[bool] = None,\n        return_attention_masks: Optional[bool] = None,\n        return_overflowing_tokens: bool = False,\n        return_special_tokens_masks: bool = False,\n        return_offsets_mapping: bool = False,\n        return_lengths: bool = False,\n        **kwargs\n    ) -> BatchEncoding:\n        \"\"\"\n        Returns a dictionary containing the encoded sequence or sequence pair and additional information:\n        the mask for sequence classification and the overflowing elements if a ``max_length`` is specified.\n\n        Args:\n            batch_text_or_text_pairs (:obj:`List[str]`,  :obj:`List[Tuple[str, str]]`,\n                                      :obj:`List[List[str]]`,  :obj:`List[Tuple[List[str], List[str]]]`,\n                                      and for not-fast tokenizers, also:\n                                      :obj:`List[List[int]]`,  :obj:`List[Tuple[List[int], List[int]]]`):\n                Batch of sequences or pair of sequences to be encoded.\n                This can be a list of string/string-sequences/int-sequences or a list of pair of\n                string/string-sequences/int-sequence (see details in encode_plus)\n            add_special_tokens (:obj:`bool`, `optional`, defaults to :obj:`True`):\n                If set to ``True``, the sequences will be encoded with the special tokens relative\n                to their model.\n            max_length (:obj:`int`, `optional`, defaults to :obj:`None`):\n                If set to a number, will limit the total sequence returned so that it has a maximum length.\n                If there are overflowing tokens, those will be added to the returned dictionary\n            stride (:obj:`int`, `optional`, defaults to ``0``):\n                If set to a number along with max_length, the overflowing tokens returned will contain some tokens\n                from the main sequence returned. The value of this argument defines the number of additional tokens.\n            truncation_strategy (:obj:`str`, `optional`, defaults to `longest_first`):\n                String selected in the following options:\n\n                - 'longest_first' (default) Iteratively reduce the inputs sequence until the input is under max_length\n                  starting from the longest one at each token (when there is a pair of input sequences)\n                - 'only_first': Only truncate the first sequence\n                - 'only_second': Only truncate the second sequence\n                - 'do_not_truncate': Does not truncate (raise an error if the input sequence is longer than max_length)\n            pad_to_max_length (:obj:`bool`, `optional`, defaults to :obj:`False`):\n                If set to True, the returned sequences will be padded according to the model's padding side and\n                padding index, up to their max length. If no max length is specified, the padding is done up to the\n                model's max length. The tokenizer padding sides are handled by the class attribute `padding_side`\n                which can be set to the following strings:\n\n                - 'left': pads on the left of the sequences\n                - 'right': pads on the right of the sequences\n                Defaults to False: no padding.\n            is_pretokenized (:obj:`bool`, defaults to :obj:`False`):\n                Set to True to indicate the input is already tokenized\n            return_tensors (:obj:`str`, `optional`, defaults to :obj:`None`):\n                Can be set to 'tf' or 'pt' to return respectively TensorFlow :obj:`tf.constant`\n                or PyTorch :obj:`torch.Tensor` instead of a list of python integers.\n            return_token_type_ids (:obj:`bool`, `optional`, defaults to :obj:`None`):\n                Whether to return token type IDs. If left to the default, will return the token type IDs according\n                to the specific tokenizer's default, defined by the :obj:`return_outputs` attribute.\n\n                `What are token type IDs? <../glossary.html#token-type-ids>`_\n            return_attention_masks (:obj:`bool`, `optional`, defaults to :obj:`none`):\n                Whether to return the attention mask. If left to the default, will return the attention mask according\n                to the specific tokenizer's default, defined by the :obj:`return_outputs` attribute.\n\n                `What are attention masks? <../glossary.html#attention-mask>`__\n            return_overflowing_tokens (:obj:`bool`, `optional`, defaults to :obj:`False`):\n                Set to True to return overflowing token information (default False).\n            return_special_tokens_masks (:obj:`bool`, `optional`, defaults to :obj:`False`):\n                Set to True to return special tokens mask information (default False).\n            return_offsets_mapping (:obj:`bool`, `optional`, defaults to :obj:`False`):\n                Set to True to return (char_start, char_end) for each token (default False).\n                If using Python's tokenizer, this method will raise NotImplementedError. This one is only available on\n                Rust-based tokenizers inheriting from PreTrainedTokenizerFast.\n            return_lengths (:obj:`bool`, `optional`, defaults to :obj:`False`):\n                If set the resulting dictionary will include the length of each encoded inputs\n            **kwargs: passed to the `self.tokenize()` method\n\n        Return:\n            A Dictionary of shape::\n\n                {\n                    input_ids: list[List[int]],\n                    token_type_ids: list[List[int]] if return_token_type_ids is True (default)\n                    attention_mask: list[List[int]] if return_attention_mask is True (default)\n                    overflowing_tokens: list[List[int]] if a ``max_length`` is specified and return_overflowing_tokens is True\n                    num_truncated_tokens: List[int] if a ``max_length`` is specified and return_overflowing_tokens is True\n                    special_tokens_mask: list[List[int]] if ``add_special_tokens`` if set to ``True`` and return_special_tokens_mask is True\n                }\n\n            With the fields:\n\n            - ``input_ids``: list of token ids to be fed to a model\n            - ``token_type_ids``: list of token type ids to be fed to a model\n            - ``attention_mask``: list of indices specifying which tokens should be attended to by the model\n            - ``overflowing_tokens``: list of overflowing tokens if a max length is specified.\n            - ``num_truncated_tokens``: number of overflowing tokens a ``max_length`` is specified\n            - ``special_tokens_mask``: if adding special tokens, this is a list of [0, 1], with 0 specifying special added\n              tokens and 1 specifying sequence tokens.\n        \"\"\"\n\n        def get_input_ids(text):\n            if isinstance(text, str):\n                tokens = self.tokenize(text, add_special_tokens=add_special_tokens, **kwargs)\n                return self.convert_tokens_to_ids(tokens)\n            elif isinstance(text, (list, tuple)) and len(text) > 0 and isinstance(text[0], str):\n                return self.convert_tokens_to_ids(text)\n            elif isinstance(text, (list, tuple)) and len(text) > 0 and isinstance(text[0], int):\n                return text\n            else:\n                raise ValueError(\n                    \"Input is not valid. Should be a string, a list/tuple of strings or a list/tuple of integers.\"\n                )\n\n        # Throw an error if we can pad because there is no padding token\n        if pad_to_max_length and self.pad_token_id is None:\n            raise ValueError(\n                \"Unable to set proper padding strategy as the tokenizer does not have a padding token. In this case please set the `pad_token` `(tokenizer.pad_token = tokenizer.eos_token e.g.)` or add a new pad token via the function add_special_tokens if you want to use a padding strategy\"\n            )\n\n        if return_offsets_mapping:\n            raise NotImplementedError(\n                \"return_offset_mapping is not available when using Python tokenizers.\"\n                \"To use this feature, change your tokenizer to one deriving from \"\n                \"transformers1.PreTrainedTokenizerFast.\"\n                \"More information on available tokenizers at \"\n                \"https://github.com/huggingface/transformers/pull/2674\"\n            )\n\n        input_ids = []\n        for ids_or_pair_ids in batch_text_or_text_pairs:\n            if isinstance(ids_or_pair_ids, (list, tuple)) and len(ids_or_pair_ids) == 2 and not is_pretokenized:\n                ids, pair_ids = ids_or_pair_ids\n            else:\n                ids, pair_ids = ids_or_pair_ids, None\n\n            first_ids = get_input_ids(ids)\n            second_ids = get_input_ids(pair_ids) if pair_ids is not None else None\n            input_ids.append((first_ids, second_ids))\n\n        if max_length is None and pad_to_max_length:\n\n            def total_sequence_length(input_pairs):\n                first_ids, second_ids = input_pairs\n                return len(first_ids) + (\n                    self.num_special_tokens_to_add()\n                    if second_ids is None\n                    else (len(second_ids) + self.num_special_tokens_to_add(pair=True))\n                )\n\n            max_length = max([total_sequence_length(ids) for ids in input_ids])\n\n        batch_outputs = {}\n        for first_ids, second_ids in input_ids:\n            # Prepares a sequence of input id, or a pair of sequences of inputs ids so that it can be used by\n            # the model. It adds special tokens, truncates sequences if overflowing while taking into account\n            # the special tokens and manages a window stride for overflowing tokens\n            outputs = self.prepare_for_model(\n                first_ids,\n                pair_ids=second_ids,\n                max_length=max_length,\n                pad_to_max_length=pad_to_max_length,\n                add_special_tokens=add_special_tokens,\n                stride=stride,\n                truncation_strategy=truncation_strategy,\n                return_attention_mask=return_attention_masks,\n                return_token_type_ids=return_token_type_ids,\n                return_overflowing_tokens=return_overflowing_tokens,\n                return_special_tokens_mask=return_special_tokens_masks,\n                return_lengths=return_lengths,\n                return_tensors=None,  # We will convert the whole batch to tensors at the end\n            )\n\n            for key, value in outputs.items():\n                if key not in batch_outputs:\n                    batch_outputs[key] = []\n                batch_outputs[key].append(value)\n\n        if return_tensors is not None:\n\n            self.convert_to_tensors_(batch_outputs, return_tensors)\n        return BatchEncoding(batch_outputs)\n\n    def convert_to_tensors_(self, batch_outputs: dict, return_tensors: str) -> None:\n        # Do the tensor conversion in batch\n        for key, value in batch_outputs.items():\n            if return_tensors == \"tf\" and is_tf_available():\n                try:\n                    batch_outputs[key] = tf.constant(value)\n                except ValueError:\n                    if None in [item for sequence in value for item in sequence]:\n                        raise ValueError(self.NO_PAD_TOKEN_FOR_BATCH_MSG)\n                    else:\n                        raise ValueError(self.UNEVEN_SEQUENCES_FOR_BATCH_MSG)\n            elif return_tensors == \"pt\" and is_torch_available():\n                try:\n                    batch_outputs[key] = torch.tensor(value)\n                except ValueError:\n                    raise ValueError(self.UNEVEN_SEQUENCES_FOR_BATCH_MSG)\n                except RuntimeError:\n                    if None in [item for sequence in value for item in sequence]:\n                        raise ValueError(self.NO_PAD_TOKEN_FOR_BATCH_MSG)\n                    else:\n                        raise\n\n            elif return_tensors is not None:\n                logger.warning(\n                    \"Unable to convert output to tensors format {}, PyTorch or TensorFlow is not available.\".format(\n                        return_tensors\n                    )\n                )\n\n    def prepare_for_model(\n        self,\n        ids: List[int],\n        pair_ids: Optional[List[int]] = None,\n        max_length: Optional[int] = None,\n        add_special_tokens: bool = True,\n        stride: int = 0,\n        truncation_strategy: str = \"longest_first\",\n        pad_to_max_length: bool = False,\n        return_tensors: Optional[str] = None,\n        return_token_type_ids: Optional[bool] = None,\n        return_attention_mask: Optional[bool] = None,\n        return_overflowing_tokens: bool = False,\n        return_special_tokens_mask: bool = False,\n        return_lengths: bool = False,\n    ) -> BatchEncoding:\n        \"\"\" Prepares a sequence of input id, or a pair of sequences of inputs ids so that it can be used by the model.\n        It adds special tokens, truncates sequences if overflowing while taking into account the special tokens and\n        manages a moving window (with user defined stride) for overflowing tokens\n\n        Args:\n            ids: list of tokenized input ids. Can be obtained from a string by chaining the\n                `tokenize` and `convert_tokens_to_ids` methods.\n            pair_ids: Optional second list of input ids. Can be obtained from a string by chaining the\n                `tokenize` and `convert_tokens_to_ids` methods.\n            max_length: maximum length of the returned list. Will truncate by taking into account the special tokens.\n            add_special_tokens: if set to ``True``, the sequences will be encoded with the special tokens relative\n                to their model.\n            stride: window stride for overflowing tokens. Can be useful to remove edge effect when using sequential\n                list of inputs. The overflowing token will contains a part of the previous window of tokens.\n            truncation_strategy: string selected in the following options:\n                - 'longest_first' (default) Iteratively reduce the inputs sequence until the input is under max_length\n                    starting from the longest one at each token (when there is a pair of input sequences)\n                - 'only_first': Only truncate the first sequence\n                - 'only_second': Only truncate the second sequence\n                - 'do_not_truncate': Does not truncate (raise an error if the input sequence is longer than max_length)\n            pad_to_max_length: if set to True, the returned sequences will be padded according to the model's padding side and\n                padding index, up to their max length. If no max length is specified, the padding is done up to the model's max length.\n                The tokenizer padding sides are handled by the following strings:\n                - 'left': pads on the left of the sequences\n                - 'right': pads on the right of the sequences\n                Defaults to False: no padding.\n            return_tensors: (optional) can be set to 'tf' or 'pt' to return respectively TensorFlow tf.constant\n                or PyTorch torch.Tensor instead of a list of python integers.\n            return_token_type_ids: (optional) Set to False to avoid returning token_type_ids (default: set to model specifics).\n            return_attention_mask: (optional) Set to False to avoid returning attention mask (default: set to model specifics)\n            return_overflowing_tokens: (optional) Set to True to return overflowing token information (default False).\n            return_special_tokens_mask: (optional) Set to True to return special tokens mask information (default False).\n            return_lengths (:obj:`bool`, `optional`, defaults to :obj:`False`):\n                If set the resulting dictionary will include the length of each encoded inputs\n\n        Return:\n            A Dictionary of shape::\n\n                {\n                    input_ids: list[int],\n                    token_type_ids: list[int] if return_token_type_ids is True (default)\n                    overflowing_tokens: list[int] if a ``max_length`` is specified and return_overflowing_tokens is True\n                    num_truncated_tokens: int if a ``max_length`` is specified and return_overflowing_tokens is True\n                    special_tokens_mask: list[int] if ``add_special_tokens`` if set to ``True`` and return_special_tokens_mask is True\n                    length: int if return_lengths is True\n                }\n\n            With the fields:\n                - ``input_ids``: list of token ids to be fed to a model\n                - ``token_type_ids``: list of token type ids to be fed to a model\n\n                - ``overflowing_tokens``: list of overflowing tokens if a max length is specified.\n                - ``num_truncated_tokens``: number of overflowing tokens a ``max_length`` is specified\n                - ``special_tokens_mask``: if adding special tokens, this is a list of [0, 1], with 0 specifying special added\n                    tokens and 1 specifying sequence tokens.\n                - ``length``: this is the length of ``input_ids``\n        \"\"\"\n        pair = bool(pair_ids is not None)\n        len_ids = len(ids)\n        len_pair_ids = len(pair_ids) if pair else 0\n\n        # Load from model defaults\n        if return_token_type_ids is None:\n            return_token_type_ids = \"token_type_ids\" in self.model_input_names\n        if return_attention_mask is None:\n            return_attention_mask = \"attention_mask\" in self.model_input_names\n\n        encoded_inputs = {}\n\n        # Truncation: Handle max sequence length\n        total_len = len_ids + len_pair_ids + (self.num_special_tokens_to_add(pair=pair) if add_special_tokens else 0)\n        if max_length and total_len > max_length:\n            ids, pair_ids, overflowing_tokens = self.truncate_sequences(\n                ids,\n                pair_ids=pair_ids,\n                num_tokens_to_remove=total_len - max_length,\n                truncation_strategy=truncation_strategy,\n                stride=stride,\n            )\n            if return_overflowing_tokens:\n                encoded_inputs[\"overflowing_tokens\"] = overflowing_tokens\n                encoded_inputs[\"num_truncated_tokens\"] = total_len - max_length\n\n        # Add special tokens\n        if add_special_tokens:\n            sequence = self.build_inputs_with_special_tokens(ids, pair_ids)\n            token_type_ids = self.create_token_type_ids_from_sequences(ids, pair_ids)\n        else:\n            sequence = ids + pair_ids if pair else ids\n            token_type_ids = [0] * len(ids) + ([1] * len(pair_ids) if pair else [])\n\n        # Build output dictionnary\n        encoded_inputs[\"input_ids\"] = sequence\n        if return_token_type_ids:\n            encoded_inputs[\"token_type_ids\"] = token_type_ids\n        if return_special_tokens_mask:\n            if add_special_tokens:\n                encoded_inputs[\"special_tokens_mask\"] = self.get_special_tokens_mask(ids, pair_ids)\n            else:\n                encoded_inputs[\"special_tokens_mask\"] = [0] * len(sequence)\n\n        # Check lengths\n        assert max_length is None or len(encoded_inputs[\"input_ids\"]) <= max_length\n        if max_length is None and len(encoded_inputs[\"input_ids\"]) > self.model_max_length:\n            logger.warning(\n                \"Token indices sequence length is longer than the specified maximum sequence length \"\n                \"for this model ({} > {}). Running this sequence through the model will result in \"\n                \"indexing errors\".format(len(ids), self.model_max_length)\n            )\n\n        # Padding\n        needs_to_be_padded = pad_to_max_length and (\n            max_length\n            and len(encoded_inputs[\"input_ids\"]) < max_length\n            or max_length is None\n            and len(encoded_inputs[\"input_ids\"]) < self.model_max_length\n            and self.model_max_length <= LARGE_INTEGER\n        )\n\n        if pad_to_max_length and max_length is None and self.model_max_length > LARGE_INTEGER:\n            logger.warning(\n                \"Sequence can't be padded as no maximum length is specified and the model maximum length is too high.\"\n            )\n\n        if needs_to_be_padded:\n            difference = (max_length if max_length is not None else self.model_max_length) - len(\n                encoded_inputs[\"input_ids\"]\n            )\n            if self.padding_side == \"right\":\n                if return_attention_mask:\n                    encoded_inputs[\"attention_mask\"] = [1] * len(encoded_inputs[\"input_ids\"]) + [0] * difference\n                if return_token_type_ids:\n                    encoded_inputs[\"token_type_ids\"] = (\n                        encoded_inputs[\"token_type_ids\"] + [self.pad_token_type_id] * difference\n                    )\n                if return_special_tokens_mask:\n                    encoded_inputs[\"special_tokens_mask\"] = encoded_inputs[\"special_tokens_mask\"] + [1] * difference\n                encoded_inputs[\"input_ids\"] = encoded_inputs[\"input_ids\"] + [self.pad_token_id] * difference\n            elif self.padding_side == \"left\":\n                if return_attention_mask:\n                    encoded_inputs[\"attention_mask\"] = [0] * difference + [1] * len(encoded_inputs[\"input_ids\"])\n                if return_token_type_ids:\n                    encoded_inputs[\"token_type_ids\"] = [self.pad_token_type_id] * difference + encoded_inputs[\n                        \"token_type_ids\"\n                    ]\n                if return_special_tokens_mask:\n                    encoded_inputs[\"special_tokens_mask\"] = [1] * difference + encoded_inputs[\"special_tokens_mask\"]\n                encoded_inputs[\"input_ids\"] = [self.pad_token_id] * difference + encoded_inputs[\"input_ids\"]\n            else:\n                raise ValueError(\"Invalid padding strategy:\" + str(self.padding_side))\n        else:\n            if return_attention_mask:\n                encoded_inputs[\"attention_mask\"] = [1] * len(encoded_inputs[\"input_ids\"])\n\n        if return_lengths:\n            encoded_inputs[\"length\"] = len(encoded_inputs[\"input_ids\"])\n\n        # Prepare model inputs as tensors if asked\n        if return_tensors == \"tf\" and is_tf_available():\n            encoded_inputs[\"input_ids\"] = tf.constant([encoded_inputs[\"input_ids\"]])\n\n            if \"token_type_ids\" in encoded_inputs:\n                encoded_inputs[\"token_type_ids\"] = tf.constant([encoded_inputs[\"token_type_ids\"]])\n\n            if \"attention_mask\" in encoded_inputs:\n                encoded_inputs[\"attention_mask\"] = tf.constant([encoded_inputs[\"attention_mask\"]])\n\n        elif return_tensors == \"pt\" and is_torch_available():\n            encoded_inputs[\"input_ids\"] = torch.tensor([encoded_inputs[\"input_ids\"]])\n\n            if \"token_type_ids\" in encoded_inputs:\n                encoded_inputs[\"token_type_ids\"] = torch.tensor([encoded_inputs[\"token_type_ids\"]])\n\n            if \"attention_mask\" in encoded_inputs:\n                encoded_inputs[\"attention_mask\"] = torch.tensor([encoded_inputs[\"attention_mask\"]])\n        elif return_tensors is not None:\n            logger.warning(\n                \"Unable to convert output to tensors format {}, PyTorch or TensorFlow is not available.\".format(\n                    return_tensors\n                )\n            )\n\n        return BatchEncoding(encoded_inputs)\n\n    def prepare_for_tokenization(self, text: str, **kwargs) -> str:\n        \"\"\" Performs any necessary transformations before tokenization \"\"\"\n        return text\n\n    def truncate_sequences(\n        self,\n        ids: List[int],\n        pair_ids: Optional[List[int]] = None,\n        num_tokens_to_remove: int = 0,\n        truncation_strategy: str = \"longest_first\",\n        stride: int = 0,\n    ) -> Tuple[List[int], List[int], List[int]]:\n        \"\"\" Truncates a sequence pair in place to the maximum length.\n\n        Args:\n            ids: list of tokenized input ids. Can be obtained from a string by chaining the\n                `tokenize` and `convert_tokens_to_ids` methods.\n            pair_ids: Optional second list of input ids. Can be obtained from a string by chaining the\n                `tokenize` and `convert_tokens_to_ids` methods.\n            num_tokens_to_remove (:obj:`int`, `optional`, defaults to ``0``):\n                number of tokens to remove using the truncation strategy\n            truncation_strategy: string selected in the following options:\n                - 'longest_first' (default) Iteratively reduce the inputs sequence until the input is under max_length\n                    starting from the longest one at each token (when there is a pair of input sequences).\n                    Overflowing tokens only contains overflow from the first sequence.\n                - 'only_first': Only truncate the first sequence. raise an error if the first sequence is shorter or equal to than num_tokens_to_remove.\n                - 'only_second': Only truncate the second sequence\n                - 'do_not_truncate': Does not truncate (raise an error if the input sequence is longer than max_length)\n            stride (:obj:`int`, `optional`, defaults to ``0``):\n                If set to a number along with max_length, the overflowing tokens returned will contain some tokens\n                from the main sequence returned. The value of this argument defines the number of additional tokens.\n        \"\"\"\n        if num_tokens_to_remove <= 0:\n            return ids, pair_ids, []\n\n        if truncation_strategy == \"longest_first\":\n            overflowing_tokens = []\n            for _ in range(num_tokens_to_remove):\n                if pair_ids is None or len(ids) > len(pair_ids):\n                    overflowing_tokens = [ids[-1]] + overflowing_tokens\n                    ids = ids[:-1]\n                else:\n                    pair_ids = pair_ids[:-1]\n            window_len = min(len(ids), stride)\n            if window_len > 0:\n                overflowing_tokens = ids[-window_len:] + overflowing_tokens\n        elif truncation_strategy == \"only_first\":\n            assert len(ids) > num_tokens_to_remove\n            window_len = min(len(ids), stride + num_tokens_to_remove)\n            overflowing_tokens = ids[-window_len:]\n            ids = ids[:-num_tokens_to_remove]\n        elif truncation_strategy == \"only_second\":\n            assert pair_ids is not None and len(pair_ids) > num_tokens_to_remove\n            window_len = min(len(pair_ids), stride + num_tokens_to_remove)\n            overflowing_tokens = pair_ids[-window_len:]\n            pair_ids = pair_ids[:-num_tokens_to_remove]\n        elif truncation_strategy == \"do_not_truncate\":\n            raise ValueError(\"Input sequence are too long for max_length. Please select a truncation strategy.\")\n        else:\n            raise ValueError(\n                \"Truncation_strategy should be selected in ['longest_first', 'only_first', 'only_second', 'do_not_truncate']\"\n            )\n        return (ids, pair_ids, overflowing_tokens)\n\n    def create_token_type_ids_from_sequences(self, token_ids_0: List, token_ids_1: Optional[List] = None) -> List[int]:\n        if token_ids_1 is None:\n            return len(token_ids_0) * [0]\n        return [0] * len(token_ids_0) + [1] * len(token_ids_1)\n\n    def build_inputs_with_special_tokens(self, token_ids_0: List, token_ids_1: Optional[List] = None) -> List:\n        \"\"\"\n        Build model inputs from a sequence or a pair of sequence for sequence classification tasks\n        by concatenating and adding special tokens. This implementation does not add special tokens.\n        \"\"\"\n        if token_ids_1 is None:\n            return token_ids_0\n        return token_ids_0 + token_ids_1\n\n    def get_special_tokens_mask(\n        self, token_ids_0: List, token_ids_1: Optional[List] = None, already_has_special_tokens: bool = False\n    ) -> List[int]:\n        \"\"\"\n        Retrieves sequence ids from a token list that has no special tokens added. This method is called when adding\n        special tokens using the tokenizer ``prepare_for_model`` or ``encode_plus`` methods.\n\n        Args:\n            token_ids_0: list of ids (must not contain special tokens)\n            token_ids_1: Optional list of ids (must not contain special tokens), necessary when fetching sequence ids\n                for sequence pairs\n            already_has_special_tokens: (default False) Set to True if the token list is already formated with\n                special tokens for the model\n\n        Returns:\n            A list of integers in the range [0, 1]: 1 for a special token, 0 for a sequence token.\n        \"\"\"\n        return [0] * ((len(token_ids_1) if token_ids_1 else 0) + len(token_ids_0))\n\n    def convert_ids_to_tokens(\n        self, ids: Union[int, List[int]], skip_special_tokens: bool = False\n    ) -> Union[int, List[int]]:\n        \"\"\" Converts a single index or a sequence of indices (integers) in a token \"\n            (resp.) a sequence of tokens (str), using the vocabulary and added tokens.\n\n            Args:\n                skip_special_tokens: Don't decode special tokens (self.all_special_tokens). Default: False\n        \"\"\"\n        if isinstance(ids, int):\n            if ids in self.added_tokens_decoder:\n                return self.added_tokens_decoder[ids]\n            else:\n                return self._convert_id_to_token(ids)\n        tokens = []\n        for index in ids:\n            index = int(index)\n            if skip_special_tokens and index in self.all_special_ids:\n                continue\n            if index in self.added_tokens_decoder:\n                tokens.append(self.added_tokens_decoder[index])\n            else:\n                tokens.append(self._convert_id_to_token(index))\n        return tokens\n\n    def _convert_id_to_token(self, index: int) -> str:\n        raise NotImplementedError\n\n    def convert_tokens_to_string(self, tokens: List[str]) -> str:\n        \"\"\" Converts a sequence of tokens (string) in a single string.\n            The most simple way to do it is ' '.join(self.convert_ids_to_tokens(token_ids))\n            but we often want to remove sub-word tokenization artifacts at the same time.\n        \"\"\"\n        return \" \".join(self.convert_ids_to_tokens(tokens))\n\n    def decode(\n        self, token_ids: List[int], skip_special_tokens: bool = False, clean_up_tokenization_spaces: bool = True\n    ) -> str:\n        \"\"\"\n        Converts a sequence of ids (integer) in a string, using the tokenizer and vocabulary\n        with options to remove special tokens and clean up tokenization spaces.\n        Similar to doing ``self.convert_tokens_to_string(self.convert_ids_to_tokens(token_ids))``.\n\n        Args:\n            token_ids: list of tokenized input ids. Can be obtained using the `encode` or `encode_plus` methods.\n            skip_special_tokens: if set to True, will replace special tokens.\n            clean_up_tokenization_spaces: if set to True, will clean up the tokenization spaces.\n        \"\"\"\n        filtered_tokens = self.convert_ids_to_tokens(token_ids, skip_special_tokens=skip_special_tokens)\n\n        # To avoid mixing byte-level and unicode for byte-level BPT\n        # we need to build string separatly for added tokens and byte-level tokens\n        # cf. https://github.com/huggingface/transformers/issues/1133\n        sub_texts = []\n        current_sub_text = []\n        for token in filtered_tokens:\n            if skip_special_tokens and token in self.all_special_ids:\n                continue\n            if token in self.added_tokens_encoder:\n                if current_sub_text:\n                    sub_texts.append(self.convert_tokens_to_string(current_sub_text))\n                    current_sub_text = []\n                sub_texts.append(token)\n            else:\n                current_sub_text.append(token)\n        if current_sub_text:\n            sub_texts.append(self.convert_tokens_to_string(current_sub_text))\n        text = \" \".join(sub_texts)\n\n        if clean_up_tokenization_spaces:\n            clean_text = self.clean_up_tokenization(text)\n            return clean_text\n        else:\n            return text\n\n    def batch_decode(self, sequences: List[List[int]], **kwargs) -> List[str]:\n        return [self.decode(seq, **kwargs) for seq in sequences]\n\n    @staticmethod\n    def clean_up_tokenization(out_string: str) -> str:\n        \"\"\" Clean up a list of simple English tokenization artifacts like spaces before punctuations and abreviated forms.\n        \"\"\"\n        out_string = (\n            out_string.replace(\" .\", \".\")\n            .replace(\" ?\", \"?\")\n            .replace(\" !\", \"!\")\n            .replace(\" ,\", \",\")\n            .replace(\" ' \", \"'\")\n            .replace(\" n't\", \"n't\")\n            .replace(\" 'm\", \"'m\")\n            .replace(\" 's\", \"'s\")\n            .replace(\" 've\", \"'ve\")\n            .replace(\" 're\", \"'re\")\n        )\n        return out_string\n\n\nclass PreTrainedTokenizerFast(PreTrainedTokenizer):\n    \"\"\" Base class for all fast tokenizers (wrapping HuggingFace tokenizers library).\n\n    Inherit from PreTrainedTokenizer.\n\n    Handle all the shared methods for tokenization and special tokens as well as methods\n    downloading/caching/loading pretrained tokenizers as well as adding tokens to the vocabulary.\n\n    This class also contain the added tokens in a unified way on top of all tokenizers so we don't\n    have to handle the specific vocabulary augmentation methods of the various underlying\n    dictionary structures (BPE, sentencepiece...).\n\n    Class attributes (overridden by derived classes):\n\n        - ``vocab_files_names``: a python ``dict`` with, as keys, the ``__init__`` keyword name of each vocabulary file\n            required by the model, and as associated values, the filename for saving the associated file (string).\n        - ``pretrained_vocab_files_map``: a python ``dict of dict`` the high-level keys\n            being the ``__init__`` keyword name of each vocabulary file required by the model, the low-level being the\n            `short-cut-names` (string) of the pretrained models with, as associated values, the `url` (string) to the\n            associated pretrained vocabulary file.\n        - ``max_model_input_sizes``: a python ``dict`` with, as keys, the `short-cut-names` (string) of the pretrained\n            models, and as associated values, the maximum length of the sequence inputs of this model, or None if the\n            model has no maximum input size.\n        - ``pretrained_init_configuration``: a python ``dict`` with, as keys, the `short-cut-names` (string) of the\n            pretrained models, and as associated values, a dictionnary of specific arguments to pass to the\n            ``__init__``method of the tokenizer class for this pretrained model when loading the tokenizer with the\n            ``from_pretrained()`` method.\n\n    Args:\n        - ``tokenizer`` (`BaseTokenizerFast`): A Fast tokenizer from the HuggingFace tokenizer library (in low level Rust language)\n        - ``model_max_length``: (`Optional`) int: the maximum length in number of tokens for the inputs to the transformer model.\n            When the tokenizer is loaded with `from_pretrained`, this will be set to the value stored for the associated\n            model in ``max_model_input_sizes`` (see above). If no value is provided, will default to VERY_LARGE_INTEGER (`int(1e30)`).\n            no associated max_length can be found in ``max_model_input_sizes``.\n        - ``padding_side``: (`Optional`) string: the side on which the model should have padding applied.\n            Should be selected between ['right', 'left']\n        - ``model_input_names``: (`Optional`) List[string]: the list of the forward pass inputs accepted by the\n            model (\"token_type_ids\", \"attention_mask\"...).\n        - ``bos_token``: (`Optional`) string: a beginning of sentence token.\n            Will be associated to ``self.bos_token`` and ``self.bos_token_id``\n        - ``eos_token``: (`Optional`) string: an end of sentence token.\n            Will be associated to ``self.eos_token`` and ``self.eos_token_id``\n        - ``unk_token``: (`Optional`) string: an unknown token.\n            Will be associated to ``self.unk_token`` and ``self.unk_token_id``\n        - ``sep_token``: (`Optional`) string: a separation token (e.g. to separate context and query in an input sequence).\n            Will be associated to ``self.sep_token`` and ``self.sep_token_id``\n        - ``pad_token``: (`Optional`) string: a padding token.\n            Will be associated to ``self.pad_token`` and ``self.pad_token_id``\n        - ``cls_token``: (`Optional`) string: a classification token (e.g. to extract a summary of an input sequence\n            leveraging self-attention along the full depth of the model).\n            Will be associated to ``self.cls_token`` and ``self.cls_token_id``\n        - ``mask_token``: (`Optional`) string: a masking token (e.g. when training a model with masked-language\n            modeling). Will be associated to ``self.mask_token`` and ``self.mask_token_id``\n        - ``additional_special_tokens``: (`Optional`) list: a list of additional special tokens.\n            Adding all special tokens here ensure they won't be split by the tokenization process.\n            Will be associated to ``self.additional_special_tokens`` and ``self.additional_special_tokens_ids``\n    \"\"\"\n\n    def __init__(self, tokenizer: BaseTokenizerFast, **kwargs):\n        if not isinstance(tokenizer, BaseTokenizerFast):\n            raise ValueError(\n                \"Tokenizer should be an instance of a Tokenizer \" \"provided by HuggingFace tokenizers library.\"\n            )\n        self._tokenizer: BaseTokenizerFast = tokenizer\n\n        # Initialize all the rest of the kwargs\n        super().__init__(**kwargs)\n\n    @property\n    def backend_tokenizer(self) -> BaseTokenizerFast:\n        return self._tokenizer\n\n    @property\n    def decoder(self) -> DecoderFast:\n        return self._tokenizer._tokenizer.decoder\n\n    @property\n    def is_fast(self) -> bool:\n        return True\n\n    @property\n    def vocab_size(self) -> int:\n        return self._tokenizer.get_vocab_size(with_added_tokens=False)\n\n    def __len__(self) -> int:\n        return self._tokenizer.get_vocab_size(with_added_tokens=True)\n\n    def _maybe_update_backend(self, value):\n        \"\"\" Update the backend fast tokenizer.\n            Override method from base class SpecialTokensMixin \"\"\"\n        self._tokenizer.add_special_tokens(value)\n\n    def _convert_encoding(\n        self,\n        encoding: EncodingFast,\n        return_tensors: Optional[bool] = None,\n        return_token_type_ids: Optional[bool] = None,\n        return_attention_mask: Optional[bool] = None,\n        return_overflowing_tokens: bool = False,\n        return_special_tokens_mask: bool = False,\n        return_offsets_mapping: bool = False,\n    ) -> Dict[str, Any]:\n        \"\"\" Convert the encoding representation (from low-level HuggingFace tokenizer output) to a python Dict.\n\n            Overflowing tokens are converted to additional examples (like batches) so the output values of\n            the dict are lists (overflows) of lists (tokens).\n\n            If return_tensors is not None, these lists of lists are converted to 2-D tensors\n            for input_ids, token_type_ids and attention_mask.\n            Output shape: (overflows, sequence length)\n        \"\"\"\n        if return_token_type_ids is None:\n            return_token_type_ids = \"token_type_ids\" in self.model_input_names\n        if return_attention_mask is None:\n            return_attention_mask = \"attention_mask\" in self.model_input_names\n\n        if return_overflowing_tokens and encoding.overflowing is not None:\n            encodings = [encoding] + encoding.overflowing\n        else:\n            encodings = [encoding]\n\n        encoding_dict = defaultdict(list)\n        for e in encodings:\n            encoding_dict[\"input_ids\"].append(e.ids)\n\n            if return_token_type_ids:\n                encoding_dict[\"token_type_ids\"].append(e.type_ids)\n            if return_attention_mask:\n                encoding_dict[\"attention_mask\"].append(e.attention_mask)\n            if return_special_tokens_mask:\n                encoding_dict[\"special_tokens_mask\"].append(e.special_tokens_mask)\n            if return_offsets_mapping:\n                encoding_dict[\"offset_mapping\"].append(e.offsets)\n\n        if return_tensors is not None:\n            for key, value in encoding_dict.items():\n                if return_tensors == \"tf\" and is_tf_available():\n                    encoding_dict[key] = tf.constant(value)\n                elif return_tensors == \"pt\" and is_torch_available():\n                    encoding_dict[key] = torch.tensor(value)\n                elif return_tensors is not None:\n                    logger.warning(\n                        \"Unable to convert output to tensors format {}, \"\n                        \"PyTorch or TensorFlow is not available.\".format(return_tensors)\n                    )\n\n        return encoding_dict\n\n    def _convert_token_to_id_with_added_voc(self, token: int) -> str:\n        index = self._tokenizer.token_to_id(token)\n        if index is None:\n            return self.unk_token_id\n        return index\n\n    def _convert_id_to_token(self, index: int) -> Optional[str]:\n        return self._tokenizer.id_to_token(int(index))\n\n    def get_vocab(self):\n        return self._tokenizer.get_vocab(True)\n\n    def convert_tokens_to_string(self, tokens: List[int], skip_special_tokens: bool = False) -> str:\n        return self._tokenizer.decode(tokens, skip_special_tokens)\n\n    def add_tokens(self, new_tokens: List[Union[str, AddedTokenFast]]) -> int:\n        \"\"\"\n        Add a list of new tokens to the tokenizer class. If the new tokens are not in the\n        vocabulary, they are added to it with indices starting from length of the current vocabulary.\n\n        Args:\n            new_tokens: string or list of string or AddedTokenFast. Each string is a token to add.\n            Tokens are only added if they are not already in the vocabulary. AddedTokenFast wrap a string token to let you personnalize it's behavior (Whether this token should only match against single word, whether this token should strip all potential whitespaces on the left side, Whether this token should strip all potential whitespaces on the right side...).\n            See details for AddedToken in HuggingFace tokenizers library.\n\n        Returns:\n            Number of tokens added to the vocabulary.\n\n        Examples::\n\n            # Let's see how to increase the vocabulary of Bert model and tokenizer\n            tokenizer = BertTokenizerFast.from_pretrained('bert-base-uncased')\n            model = BertModel.from_pretrained('bert-base-uncased')\n\n            num_added_toks = tokenizer.add_tokens(['new_tok1', 'my_new-tok2'])\n            print('We have added', num_added_toks, 'tokens')\n            model.resize_token_embeddings(len(tokenizer))  # Notice: resize_token_embeddings expect to receive the full size of the new vocabulary, i.e. the length of the tokenizer.\n        \"\"\"\n        if isinstance(new_tokens, str):\n            new_tokens = [new_tokens]\n        return self._tokenizer.add_tokens(new_tokens)\n\n    def add_special_tokens(self, special_tokens_dict: dict) -> int:\n        # Map special tokens to class attributes (self.pad_token...)\n        super().add_special_tokens(special_tokens_dict)\n\n        # If the backend tokenizer the only specificities of special tokens are that\n        #    - they will never be processed by the model, and\n        #    - they will be removed while decoding.\n        # But they are not mapped to special attributes in the backend so we can just\n        # send a list.\n        tokens = []\n        for token in special_tokens_dict.values():\n            if isinstance(token, list):\n                tokens += token\n            else:\n                tokens += [token]\n        num_added_tokens = self._tokenizer.add_special_tokens(tokens)\n\n        return num_added_tokens\n\n    def num_special_tokens_to_add(self, pair: bool = False) -> int:\n        return self._tokenizer.num_special_tokens_to_add(pair)\n\n    def tokenize(\n        self, text: TextInput, pair: Optional[TextInput] = None, add_special_tokens: bool = False\n    ) -> List[str]:\n        return self._tokenizer.encode(text, pair, add_special_tokens).tokens\n\n    def batch_encode_plus(\n        self,\n        batch_text_or_text_pairs: Union[\n            List[TextInput], List[TextInputPair], List[PreTokenizedInput], List[PreTokenizedInputPair]\n        ],\n        add_special_tokens: bool = True,\n        max_length: Optional[int] = None,\n        stride: int = 0,\n        truncation_strategy: str = \"longest_first\",\n        pad_to_max_length: bool = False,\n        is_pretokenized: bool = False,\n        return_tensors: Optional[str] = None,\n        return_token_type_ids: Optional[bool] = None,\n        return_attention_mask: Optional[bool] = None,\n        return_overflowing_tokens: bool = False,\n        return_special_tokens_mask: bool = False,\n        return_offsets_mapping: bool = False,\n        return_lengths: bool = False,\n        **kwargs\n    ) -> BatchEncoding:\n\n        if not isinstance(batch_text_or_text_pairs, list):\n            raise ValueError(\n                \"batch_text_or_text_pairs has to be a list (got {})\".format(type(batch_text_or_text_pairs))\n            )\n\n        # Needed if we have to return a tensor\n        pad_to_max_length = pad_to_max_length or (return_tensors is not None and len(batch_text_or_text_pairs) > 1)\n\n        # Throw an error if we can pad because there is no padding token\n        if pad_to_max_length and self.pad_token_id is None:\n            raise ValueError(\"Unable to set proper padding strategy as the tokenizer does not have a padding token\")\n\n        # Set the truncation and padding strategy and restore the initial configuration\n        with truncate_and_pad(\n            tokenizer=self._tokenizer,\n            max_length=max_length,\n            stride=stride,\n            strategy=truncation_strategy,\n            pad_to_max_length=pad_to_max_length,\n            padding_side=self.padding_side,\n            pad_token_id=self.pad_token_id,\n            pad_token_type_id=self.pad_token_type_id,\n            pad_token=self._pad_token,\n        ):\n\n            # Check for the pretokenized path\n            if is_pretokenized:\n                encodings = []\n\n                # Iterate over each sample (we don't know yet if they are pairs or simple input\n                for i, sample in enumerate(batch_text_or_text_pairs):\n\n                    if not isinstance(sample, (list, tuple)):\n                        raise TypeError(\n                            \"batch_encode_plus(..., is_pretokenized=True) requires batch_text_or_text_pairs \"\n                            \"to be either List[List[str]] or List[Tuple[List[str], List[str]]] but sample at \"\n                            \"index {} is of type {}\".format(i, type(sample))\n                        )\n\n                    # Test if we have a pair of sentences by checking the depth of nesting\n                    is_pair = bool(len(sample) > 0 and isinstance(sample[0], (list, tuple)))\n\n                    # Take care of the first sequence - we multi-thread over the words\n                    encodings_text = EncodingFast.merge(\n                        self._tokenizer.encode_batch(sample[0] if is_pair else sample, add_special_tokens=False),\n                        growing_offsets=True,\n                    )\n\n                    # Take care of the second sequence if we have a pair\n                    if is_pair:\n                        encodings_pair = EncodingFast.merge(\n                            self._tokenizer.encode_batch([(\"\", s) for s in sample[1]], add_special_tokens=False),\n                            growing_offsets=True,\n                        )\n                    else:\n                        encodings_pair = None\n\n                    # Post-process - truncate/pad and add special tokens\n                    encoding = self._tokenizer.post_process(encodings_text, encodings_pair, add_special_tokens)\n                    encodings.append(encoding)\n\n            # Classical path with strings input\n            else:\n                # Avoid thread overhead if only one example.\n                if len(batch_text_or_text_pairs) == 1:\n                    if isinstance(batch_text_or_text_pairs[0], (tuple, list)):\n                        encodings = self._tokenizer.encode(\n                            *batch_text_or_text_pairs[0], add_special_tokens=add_special_tokens\n                        )\n                    else:\n                        encodings = self._tokenizer.encode(\n                            batch_text_or_text_pairs[0], add_special_tokens=add_special_tokens\n                        )\n                    encodings = [encodings]\n                else:\n                    encodings = self._tokenizer.encode_batch(\n                        batch_text_or_text_pairs, add_special_tokens=add_special_tokens\n                    )\n\n        # Convert encoding to dict\n        # `Tokens` has type: List[Dict[str, List[List[int]]]] or List[Dict[str, 2D-Tensor]]\n        # with nested dimensions corresponding to batch, overflows, sequence length\n        tokens = [\n            self._convert_encoding(\n                encoding=encoding,\n                return_tensors=return_tensors,\n                return_token_type_ids=return_token_type_ids,\n                return_attention_mask=return_attention_mask,\n                return_overflowing_tokens=return_overflowing_tokens,\n                return_special_tokens_mask=return_special_tokens_mask,\n                return_offsets_mapping=return_offsets_mapping,\n            )\n            for encoding in encodings\n        ]\n\n        # Sanitize the output to have dict[list] from list[dict]\n        sanitized = {}\n        for key in tokens[0].keys():\n            # To List[List[List[int]]] of shape (batch, overflows, sequence length)\n            stack = [e for item in tokens for e in item[key]]\n            if return_tensors == \"tf\":\n                stack = tf.stack(stack, axis=0)\n            elif return_tensors == \"pt\":\n                stack = torch.stack(stack, dim=0)\n            # elif not return_tensors and len(stack) == 1:\n            #     stack = stack[0]\n\n            sanitized[key] = stack\n\n        # If returning overflowing tokens, we need to return a mapping\n        # from the batch idx to the original sample\n        if return_overflowing_tokens:\n            overflow_to_sample_mapping = flatten([[i] * len(enc[\"input_ids\"]) for i, enc in enumerate(tokens)])\n            sanitized[\"overflow_to_sample_mapping\"] = overflow_to_sample_mapping\n\n        return BatchEncoding(sanitized, encodings)\n\n    def encode_plus(\n        self,\n        text: Union[TextInput, PreTokenizedInput],\n        text_pair: Optional[Union[TextInput, PreTokenizedInput]] = None,\n        add_special_tokens: bool = True,\n        max_length: Optional[int] = None,\n        pad_to_max_length: bool = False,\n        stride: int = 0,\n        truncation_strategy: str = \"longest_first\",\n        is_pretokenized: bool = False,\n        return_tensors: Optional[bool] = None,\n        return_token_type_ids: Optional[bool] = None,\n        return_attention_mask: Optional[bool] = None,\n        return_overflowing_tokens: bool = False,\n        return_special_tokens_mask: bool = False,\n        return_offsets_mapping: bool = False,\n        **kwargs\n    ) -> BatchEncoding:\n\n        # Check for pretokenized path (ie [token1, token2, ..., tokenN] -> [id1, id2, ..., idN]\n        if is_pretokenized:\n            if isinstance(text, list) and len(text) > 0:\n\n                # Encode through encode_batch with sequence of only one word which will be merged after hand\n                encoding = self._tokenizer.encode_batch(text, add_special_tokens=False)\n                encoding = EncodingFast.merge(encoding, growing_offsets=True)\n\n                # Let's do the same for pairs if provided\n                if isinstance(text_pair, list):\n                    # We prepend empty string before each word so that encoding is aware content is a pair\n                    encoding_pair = self._tokenizer.encode_batch(\n                        [(\"\", p) for p in text_pair], add_special_tokens=False\n                    )\n                    encoding_pair = EncodingFast.merge(encoding_pair, growing_offsets=True)\n                elif text_pair is None:\n                    encoding_pair = None\n                else:\n                    raise TypeError(\n                        \"encode_plus(..., is_pretokenized=True) requires text and text_pair to be List[str] \"\n                        \"but got (text={}, text_pair={})\".format(type(text), type(text_pair))\n                    )\n\n                # Post process and if asked to do so, insert special tokens where needed\n                encoding = self._tokenizer.post_process(encoding, encoding_pair, add_special_tokens)\n\n                batched_output = BatchEncoding(\n                    self._convert_encoding(\n                        encoding,\n                        return_tensors=return_tensors,\n                        return_token_type_ids=return_token_type_ids,\n                        return_attention_mask=return_attention_mask,\n                        return_overflowing_tokens=return_overflowing_tokens,\n                        return_special_tokens_mask=return_special_tokens_mask,\n                        return_offsets_mapping=return_offsets_mapping,\n                    ),\n                    encoding,\n                )\n            else:\n                raise TypeError(\n                    \"encode_plus(..., is_pretokenized=True) requires text to be List[str] \"\n                    \"but got (text={}, text_pair={})\".format(type(text), type(text_pair))\n                )\n        else:\n            batched_input = [(text, text_pair)] if text_pair else [text]\n            batched_output = self.batch_encode_plus(\n                batched_input,\n                add_special_tokens=add_special_tokens,\n                max_length=max_length,\n                stride=stride,\n                truncation_strategy=truncation_strategy,\n                return_tensors=return_tensors,\n                return_token_type_ids=return_token_type_ids,\n                return_attention_mask=return_attention_mask,\n                return_overflowing_tokens=return_overflowing_tokens,\n                return_special_tokens_mask=return_special_tokens_mask,\n                return_offsets_mapping=return_offsets_mapping,\n                pad_to_max_length=pad_to_max_length,\n                **kwargs,\n            )\n\n        # Return tensor is None, then we can remove the leading batch axis\n        if not return_tensors:\n            batched_output = BatchEncoding(\n                {\n                    key: value[0] if len(value) > 0 and isinstance(value[0], list) else value\n                    for key, value in batched_output.items()\n                },\n                batched_output.encodings,\n            )\n\n        return batched_output\n\n    def decode(\n        self, token_ids: List[int], skip_special_tokens: bool = False, clean_up_tokenization_spaces: bool = True\n    ) -> str:\n        text = self._tokenizer.decode(token_ids, skip_special_tokens)\n\n        if clean_up_tokenization_spaces:\n            clean_text = self.clean_up_tokenization(text)\n            return clean_text\n        else:\n            return text\n\n    def save_vocabulary(self, save_directory: str) -> Tuple[str]:\n        if os.path.isdir(save_directory):\n            files = self._tokenizer.save(save_directory)\n        else:\n            folder, file = os.path.split(os.path.abspath(save_directory))\n            files = self._tokenizer.save(folder, name=file)\n\n        return tuple(files)\n\n\ndef trim_batch(\n    input_ids, pad_token_id, attention_mask=None,\n):\n    \"\"\"Remove columns that are populated exclusively by pad_token_id\"\"\"\n    keep_column_mask = input_ids.ne(pad_token_id).any(dim=0)\n    if attention_mask is None:\n        return input_ids[:, keep_column_mask]\n    else:\n        return (input_ids[:, keep_column_mask], attention_mask[:, keep_column_mask])\n"
  },
  {
    "path": "code/bert-base-count5/pretrain/transformers1/tokenization_xlm.py",
    "content": "# coding=utf-8\n# Copyright 2019 The Open AI Team Authors and The HuggingFace Inc. team.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n\"\"\"Tokenization classes for XLM.\"\"\"\n\n\nimport json\nimport logging\nimport os\nimport re\nimport sys\nimport unicodedata\nfrom typing import List, Optional\n\nimport sacremoses as sm\n\nfrom .tokenization_utils import PreTrainedTokenizer\n\n\nlogger = logging.getLogger(__name__)\n\nVOCAB_FILES_NAMES = {\n    \"vocab_file\": \"vocab.json\",\n    \"merges_file\": \"merges.txt\",\n}\n\nPRETRAINED_VOCAB_FILES_MAP = {\n    \"vocab_file\": {\n        \"xlm-mlm-en-2048\": \"https://s3.amazonaws.com/models.huggingface.co/bert/xlm-mlm-en-2048-vocab.json\",\n        \"xlm-mlm-ende-1024\": \"https://s3.amazonaws.com/models.huggingface.co/bert/xlm-mlm-ende-1024-vocab.json\",\n        \"xlm-mlm-enfr-1024\": \"https://s3.amazonaws.com/models.huggingface.co/bert/xlm-mlm-enfr-1024-vocab.json\",\n        \"xlm-mlm-enro-1024\": \"https://s3.amazonaws.com/models.huggingface.co/bert/xlm-mlm-enro-1024-vocab.json\",\n        \"xlm-mlm-tlm-xnli15-1024\": \"https://s3.amazonaws.com/models.huggingface.co/bert/xlm-mlm-tlm-xnli15-1024-vocab.json\",\n        \"xlm-mlm-xnli15-1024\": \"https://s3.amazonaws.com/models.huggingface.co/bert/xlm-mlm-xnli15-1024-vocab.json\",\n        \"xlm-clm-enfr-1024\": \"https://s3.amazonaws.com/models.huggingface.co/bert/xlm-clm-enfr-1024-vocab.json\",\n        \"xlm-clm-ende-1024\": \"https://s3.amazonaws.com/models.huggingface.co/bert/xlm-clm-ende-1024-vocab.json\",\n        \"xlm-mlm-17-1280\": \"https://s3.amazonaws.com/models.huggingface.co/bert/xlm-mlm-17-1280-vocab.json\",\n        \"xlm-mlm-100-1280\": \"https://s3.amazonaws.com/models.huggingface.co/bert/xlm-mlm-100-1280-vocab.json\",\n    },\n    \"merges_file\": {\n        \"xlm-mlm-en-2048\": \"https://s3.amazonaws.com/models.huggingface.co/bert/xlm-mlm-en-2048-merges.txt\",\n        \"xlm-mlm-ende-1024\": \"https://s3.amazonaws.com/models.huggingface.co/bert/xlm-mlm-ende-1024-merges.txt\",\n        \"xlm-mlm-enfr-1024\": \"https://s3.amazonaws.com/models.huggingface.co/bert/xlm-mlm-enfr-1024-merges.txt\",\n        \"xlm-mlm-enro-1024\": \"https://s3.amazonaws.com/models.huggingface.co/bert/xlm-mlm-enro-1024-merges.txt\",\n        \"xlm-mlm-tlm-xnli15-1024\": \"https://s3.amazonaws.com/models.huggingface.co/bert/xlm-mlm-tlm-xnli15-1024-merges.txt\",\n        \"xlm-mlm-xnli15-1024\": \"https://s3.amazonaws.com/models.huggingface.co/bert/xlm-mlm-xnli15-1024-merges.txt\",\n        \"xlm-clm-enfr-1024\": \"https://s3.amazonaws.com/models.huggingface.co/bert/xlm-mlm-enfr-1024-merges.txt\",\n        \"xlm-clm-ende-1024\": \"https://s3.amazonaws.com/models.huggingface.co/bert/xlm-mlm-ende-1024-merges.txt\",\n        \"xlm-mlm-17-1280\": \"https://s3.amazonaws.com/models.huggingface.co/bert/xlm-mlm-17-1280-merges.txt\",\n        \"xlm-mlm-100-1280\": \"https://s3.amazonaws.com/models.huggingface.co/bert/xlm-mlm-100-1280-merges.txt\",\n    },\n}\n\nPRETRAINED_POSITIONAL_EMBEDDINGS_SIZES = {\n    \"xlm-mlm-en-2048\": 512,\n    \"xlm-mlm-ende-1024\": 512,\n    \"xlm-mlm-enfr-1024\": 512,\n    \"xlm-mlm-enro-1024\": 512,\n    \"xlm-mlm-tlm-xnli15-1024\": 512,\n    \"xlm-mlm-xnli15-1024\": 512,\n    \"xlm-clm-enfr-1024\": 512,\n    \"xlm-clm-ende-1024\": 512,\n    \"xlm-mlm-17-1280\": 512,\n    \"xlm-mlm-100-1280\": 512,\n}\n\nPRETRAINED_INIT_CONFIGURATION = {\n    \"xlm-mlm-en-2048\": {\"do_lowercase_and_remove_accent\": True},\n    \"xlm-mlm-ende-1024\": {\n        \"do_lowercase_and_remove_accent\": True,\n        \"id2lang\": {\"0\": \"de\", \"1\": \"en\"},\n        \"lang2id\": {\"de\": 0, \"en\": 1},\n    },\n    \"xlm-mlm-enfr-1024\": {\n        \"do_lowercase_and_remove_accent\": True,\n        \"id2lang\": {\"0\": \"en\", \"1\": \"fr\"},\n        \"lang2id\": {\"en\": 0, \"fr\": 1},\n    },\n    \"xlm-mlm-enro-1024\": {\n        \"do_lowercase_and_remove_accent\": True,\n        \"id2lang\": {\"0\": \"en\", \"1\": \"ro\"},\n        \"lang2id\": {\"en\": 0, \"ro\": 1},\n    },\n    \"xlm-mlm-tlm-xnli15-1024\": {\n        \"do_lowercase_and_remove_accent\": True,\n        \"id2lang\": {\n            \"0\": \"ar\",\n            \"1\": \"bg\",\n            \"2\": \"de\",\n            \"3\": \"el\",\n            \"4\": \"en\",\n            \"5\": \"es\",\n            \"6\": \"fr\",\n            \"7\": \"hi\",\n            \"8\": \"ru\",\n            \"9\": \"sw\",\n            \"10\": \"th\",\n            \"11\": \"tr\",\n            \"12\": \"ur\",\n            \"13\": \"vi\",\n            \"14\": \"zh\",\n        },\n        \"lang2id\": {\n            \"ar\": 0,\n            \"bg\": 1,\n            \"de\": 2,\n            \"el\": 3,\n            \"en\": 4,\n            \"es\": 5,\n            \"fr\": 6,\n            \"hi\": 7,\n            \"ru\": 8,\n            \"sw\": 9,\n            \"th\": 10,\n            \"tr\": 11,\n            \"ur\": 12,\n            \"vi\": 13,\n            \"zh\": 14,\n        },\n    },\n    \"xlm-mlm-xnli15-1024\": {\n        \"do_lowercase_and_remove_accent\": True,\n        \"id2lang\": {\n            \"0\": \"ar\",\n            \"1\": \"bg\",\n            \"2\": \"de\",\n            \"3\": \"el\",\n            \"4\": \"en\",\n            \"5\": \"es\",\n            \"6\": \"fr\",\n            \"7\": \"hi\",\n            \"8\": \"ru\",\n            \"9\": \"sw\",\n            \"10\": \"th\",\n            \"11\": \"tr\",\n            \"12\": \"ur\",\n            \"13\": \"vi\",\n            \"14\": \"zh\",\n        },\n        \"lang2id\": {\n            \"ar\": 0,\n            \"bg\": 1,\n            \"de\": 2,\n            \"el\": 3,\n            \"en\": 4,\n            \"es\": 5,\n            \"fr\": 6,\n            \"hi\": 7,\n            \"ru\": 8,\n            \"sw\": 9,\n            \"th\": 10,\n            \"tr\": 11,\n            \"ur\": 12,\n            \"vi\": 13,\n            \"zh\": 14,\n        },\n    },\n    \"xlm-clm-enfr-1024\": {\n        \"do_lowercase_and_remove_accent\": True,\n        \"id2lang\": {\"0\": \"en\", \"1\": \"fr\"},\n        \"lang2id\": {\"en\": 0, \"fr\": 1},\n    },\n    \"xlm-clm-ende-1024\": {\n        \"do_lowercase_and_remove_accent\": True,\n        \"id2lang\": {\"0\": \"de\", \"1\": \"en\"},\n        \"lang2id\": {\"de\": 0, \"en\": 1},\n    },\n    \"xlm-mlm-17-1280\": {\n        \"do_lowercase_and_remove_accent\": False,\n        \"id2lang\": {\n            \"0\": \"ar\",\n            \"1\": \"de\",\n            \"2\": \"en\",\n            \"3\": \"es\",\n            \"4\": \"fr\",\n            \"5\": \"hi\",\n            \"6\": \"it\",\n            \"7\": \"ja\",\n            \"8\": \"ko\",\n            \"9\": \"nl\",\n            \"10\": \"pl\",\n            \"11\": \"pt\",\n            \"12\": \"ru\",\n            \"13\": \"sv\",\n            \"14\": \"tr\",\n            \"15\": \"vi\",\n            \"16\": \"zh\",\n        },\n        \"lang2id\": {\n            \"ar\": 0,\n            \"de\": 1,\n            \"en\": 2,\n            \"es\": 3,\n            \"fr\": 4,\n            \"hi\": 5,\n            \"it\": 6,\n            \"ja\": 7,\n            \"ko\": 8,\n            \"nl\": 9,\n            \"pl\": 10,\n            \"pt\": 11,\n            \"ru\": 12,\n            \"sv\": 13,\n            \"tr\": 14,\n            \"vi\": 15,\n            \"zh\": 16,\n        },\n    },\n    \"xlm-mlm-100-1280\": {\n        \"do_lowercase_and_remove_accent\": False,\n        \"id2lang\": {\n            \"0\": \"af\",\n            \"1\": \"als\",\n            \"2\": \"am\",\n            \"3\": \"an\",\n            \"4\": \"ang\",\n            \"5\": \"ar\",\n            \"6\": \"arz\",\n            \"7\": \"ast\",\n            \"8\": \"az\",\n            \"9\": \"bar\",\n            \"10\": \"be\",\n            \"11\": \"bg\",\n            \"12\": \"bn\",\n            \"13\": \"br\",\n            \"14\": \"bs\",\n            \"15\": \"ca\",\n            \"16\": \"ceb\",\n            \"17\": \"ckb\",\n            \"18\": \"cs\",\n            \"19\": \"cy\",\n            \"20\": \"da\",\n            \"21\": \"de\",\n            \"22\": \"el\",\n            \"23\": \"en\",\n            \"24\": \"eo\",\n            \"25\": \"es\",\n            \"26\": \"et\",\n            \"27\": \"eu\",\n            \"28\": \"fa\",\n            \"29\": \"fi\",\n            \"30\": \"fr\",\n            \"31\": \"fy\",\n            \"32\": \"ga\",\n            \"33\": \"gan\",\n            \"34\": \"gl\",\n            \"35\": \"gu\",\n            \"36\": \"he\",\n            \"37\": \"hi\",\n            \"38\": \"hr\",\n            \"39\": \"hu\",\n            \"40\": \"hy\",\n            \"41\": \"ia\",\n            \"42\": \"id\",\n            \"43\": \"is\",\n            \"44\": \"it\",\n            \"45\": \"ja\",\n            \"46\": \"jv\",\n            \"47\": \"ka\",\n            \"48\": \"kk\",\n            \"49\": \"kn\",\n            \"50\": \"ko\",\n            \"51\": \"ku\",\n            \"52\": \"la\",\n            \"53\": \"lb\",\n            \"54\": \"lt\",\n            \"55\": \"lv\",\n            \"56\": \"mk\",\n            \"57\": \"ml\",\n            \"58\": \"mn\",\n            \"59\": \"mr\",\n            \"60\": \"ms\",\n            \"61\": \"my\",\n            \"62\": \"nds\",\n            \"63\": \"ne\",\n            \"64\": \"nl\",\n            \"65\": \"nn\",\n            \"66\": \"no\",\n            \"67\": \"oc\",\n            \"68\": \"pl\",\n            \"69\": \"pt\",\n            \"70\": \"ro\",\n            \"71\": \"ru\",\n            \"72\": \"scn\",\n            \"73\": \"sco\",\n            \"74\": \"sh\",\n            \"75\": \"si\",\n            \"76\": \"simple\",\n            \"77\": \"sk\",\n            \"78\": \"sl\",\n            \"79\": \"sq\",\n            \"80\": \"sr\",\n            \"81\": \"sv\",\n            \"82\": \"sw\",\n            \"83\": \"ta\",\n            \"84\": \"te\",\n            \"85\": \"th\",\n            \"86\": \"tl\",\n            \"87\": \"tr\",\n            \"88\": \"tt\",\n            \"89\": \"uk\",\n            \"90\": \"ur\",\n            \"91\": \"uz\",\n            \"92\": \"vi\",\n            \"93\": \"war\",\n            \"94\": \"wuu\",\n            \"95\": \"yi\",\n            \"96\": \"zh\",\n            \"97\": \"zh_classical\",\n            \"98\": \"zh_min_nan\",\n            \"99\": \"zh_yue\",\n        },\n        \"lang2id\": {\n            \"af\": 0,\n            \"als\": 1,\n            \"am\": 2,\n            \"an\": 3,\n            \"ang\": 4,\n            \"ar\": 5,\n            \"arz\": 6,\n            \"ast\": 7,\n            \"az\": 8,\n            \"bar\": 9,\n            \"be\": 10,\n            \"bg\": 11,\n            \"bn\": 12,\n            \"br\": 13,\n            \"bs\": 14,\n            \"ca\": 15,\n            \"ceb\": 16,\n            \"ckb\": 17,\n            \"cs\": 18,\n            \"cy\": 19,\n            \"da\": 20,\n            \"de\": 21,\n            \"el\": 22,\n            \"en\": 23,\n            \"eo\": 24,\n            \"es\": 25,\n            \"et\": 26,\n            \"eu\": 27,\n            \"fa\": 28,\n            \"fi\": 29,\n            \"fr\": 30,\n            \"fy\": 31,\n            \"ga\": 32,\n            \"gan\": 33,\n            \"gl\": 34,\n            \"gu\": 35,\n            \"he\": 36,\n            \"hi\": 37,\n            \"hr\": 38,\n            \"hu\": 39,\n            \"hy\": 40,\n            \"ia\": 41,\n            \"id\": 42,\n            \"is\": 43,\n            \"it\": 44,\n            \"ja\": 45,\n            \"jv\": 46,\n            \"ka\": 47,\n            \"kk\": 48,\n            \"kn\": 49,\n            \"ko\": 50,\n            \"ku\": 51,\n            \"la\": 52,\n            \"lb\": 53,\n            \"lt\": 54,\n            \"lv\": 55,\n            \"mk\": 56,\n            \"ml\": 57,\n            \"mn\": 58,\n            \"mr\": 59,\n            \"ms\": 60,\n            \"my\": 61,\n            \"nds\": 62,\n            \"ne\": 63,\n            \"nl\": 64,\n            \"nn\": 65,\n            \"no\": 66,\n            \"oc\": 67,\n            \"pl\": 68,\n            \"pt\": 69,\n            \"ro\": 70,\n            \"ru\": 71,\n            \"scn\": 72,\n            \"sco\": 73,\n            \"sh\": 74,\n            \"si\": 75,\n            \"simple\": 76,\n            \"sk\": 77,\n            \"sl\": 78,\n            \"sq\": 79,\n            \"sr\": 80,\n            \"sv\": 81,\n            \"sw\": 82,\n            \"ta\": 83,\n            \"te\": 84,\n            \"th\": 85,\n            \"tl\": 86,\n            \"tr\": 87,\n            \"tt\": 88,\n            \"uk\": 89,\n            \"ur\": 90,\n            \"uz\": 91,\n            \"vi\": 92,\n            \"war\": 93,\n            \"wuu\": 94,\n            \"yi\": 95,\n            \"zh\": 96,\n            \"zh_classical\": 97,\n            \"zh_min_nan\": 98,\n            \"zh_yue\": 99,\n        },\n    },\n}\n\n\ndef get_pairs(word):\n    \"\"\"\n    Return set of symbol pairs in a word.\n    word is represented as tuple of symbols (symbols being variable-length strings)\n    \"\"\"\n    pairs = set()\n    prev_char = word[0]\n    for char in word[1:]:\n        pairs.add((prev_char, char))\n        prev_char = char\n    return pairs\n\n\ndef lowercase_and_remove_accent(text):\n    \"\"\"\n    Lowercase and strips accents from a piece of text based on\n    https://github.com/facebookresearch/XLM/blob/master/tools/lowercase_and_remove_accent.py\n    \"\"\"\n    text = \" \".join(text)\n    text = text.lower()\n    text = unicodedata.normalize(\"NFD\", text)\n    output = []\n    for char in text:\n        cat = unicodedata.category(char)\n        if cat == \"Mn\":\n            continue\n        output.append(char)\n    return \"\".join(output).lower().split(\" \")\n\n\ndef replace_unicode_punct(text):\n    \"\"\"\n    Port of https://github.com/moses-smt/mosesdecoder/blob/master/scripts/tokenizer/replace-unicode-punctuation.perl\n    \"\"\"\n    text = text.replace(\"，\", \",\")\n    text = re.sub(r\"。\\s*\", \". \", text)\n    text = text.replace(\"、\", \",\")\n    text = text.replace(\"”\", '\"')\n    text = text.replace(\"“\", '\"')\n    text = text.replace(\"∶\", \":\")\n    text = text.replace(\"：\", \":\")\n    text = text.replace(\"？\", \"?\")\n    text = text.replace(\"《\", '\"')\n    text = text.replace(\"》\", '\"')\n    text = text.replace(\"）\", \")\")\n    text = text.replace(\"！\", \"!\")\n    text = text.replace(\"（\", \"(\")\n    text = text.replace(\"；\", \";\")\n    text = text.replace(\"１\", \"1\")\n    text = text.replace(\"」\", '\"')\n    text = text.replace(\"「\", '\"')\n    text = text.replace(\"０\", \"0\")\n    text = text.replace(\"３\", \"3\")\n    text = text.replace(\"２\", \"2\")\n    text = text.replace(\"５\", \"5\")\n    text = text.replace(\"６\", \"6\")\n    text = text.replace(\"９\", \"9\")\n    text = text.replace(\"７\", \"7\")\n    text = text.replace(\"８\", \"8\")\n    text = text.replace(\"４\", \"4\")\n    text = re.sub(r\"．\\s*\", \". \", text)\n    text = text.replace(\"～\", \"~\")\n    text = text.replace(\"’\", \"'\")\n    text = text.replace(\"…\", \"...\")\n    text = text.replace(\"━\", \"-\")\n    text = text.replace(\"〈\", \"<\")\n    text = text.replace(\"〉\", \">\")\n    text = text.replace(\"【\", \"[\")\n    text = text.replace(\"】\", \"]\")\n    text = text.replace(\"％\", \"%\")\n    return text\n\n\ndef remove_non_printing_char(text):\n    \"\"\"\n    Port of https://github.com/moses-smt/mosesdecoder/blob/master/scripts/tokenizer/remove-non-printing-char.perl\n    \"\"\"\n    output = []\n    for char in text:\n        cat = unicodedata.category(char)\n        if cat.startswith(\"C\"):\n            continue\n        output.append(char)\n    return \"\".join(output)\n\n\ndef romanian_preprocessing(text):\n    \"\"\"Sennrich's WMT16 scripts for Romanian preprocessing, used by model `xlm-mlm-enro-1024`\"\"\"\n    # https://github.com/rsennrich/wmt16-scripts/blob/master/preprocess/normalise-romanian.py\n    text = text.replace(\"\\u015e\", \"\\u0218\").replace(\"\\u015f\", \"\\u0219\")\n    text = text.replace(\"\\u0162\", \"\\u021a\").replace(\"\\u0163\", \"\\u021b\")\n    # https://github.com/rsennrich/wmt16-scripts/blob/master/preprocess/remove-diacritics.py\n    text = text.replace(\"\\u0218\", \"S\").replace(\"\\u0219\", \"s\")  # s-comma\n    text = text.replace(\"\\u021a\", \"T\").replace(\"\\u021b\", \"t\")  # t-comma\n    text = text.replace(\"\\u0102\", \"A\").replace(\"\\u0103\", \"a\")\n    text = text.replace(\"\\u00C2\", \"A\").replace(\"\\u00E2\", \"a\")\n    text = text.replace(\"\\u00CE\", \"I\").replace(\"\\u00EE\", \"i\")\n    return text\n\n\nclass XLMTokenizer(PreTrainedTokenizer):\n    \"\"\"\n    BPE tokenizer for XLM\n\n    - Moses preprocessing & tokenization for most supported languages\n    - Language specific tokenization for Chinese (Jieba), Japanese (KyTea) and Thai (PyThaiNLP)\n    - (optionally) lower case & normalize all inputs text\n    - argument ``special_tokens`` and function ``set_special_tokens``, can be used to add additional symbols \\\n      (ex: \"__classify__\") to a vocabulary\n    - `lang2id` attribute maps the languages supported by the model with their ids if provided (automatically set for pretrained vocabularies)\n    - `id2lang` attributes does reverse mapping if provided (automatically set for pretrained vocabularies)\n\n    This tokenizer inherits from :class:`~transformers1.PreTrainedTokenizer` which contains most of the methods. Users\n    should refer to the superclass for more information regarding methods.\n\n    Args:\n        vocab_file (:obj:`string`):\n            Vocabulary file.\n        merges_file (:obj:`string`):\n            Merges file.\n        do_lower_case (:obj:`bool`, `optional`, defaults to :obj:`True`):\n            Whether to lowercase the input when tokenizing.\n        remove_space (:obj:`bool`, `optional`, defaults to :obj:`True`):\n            Whether to strip the text when tokenizing (removing excess spaces before and after the string).\n        keep_accents (:obj:`bool`, `optional`, defaults to :obj:`False`):\n            Whether to keep accents when tokenizing.\n        unk_token (:obj:`string`, `optional`, defaults to \"<unk>\"):\n            The unknown token. A token that is not in the vocabulary cannot be converted to an ID and is set to be this\n            token instead.\n        bos_token (:obj:`string`, `optional`, defaults to \"<s>\"):\n            The beginning of sequence token that was used during pre-training. Can be used a sequence classifier token.\n\n            .. note::\n\n                When building a sequence using special tokens, this is not the token that is used for the beginning\n                of sequence. The token used is the :obj:`cls_token`.\n        sep_token (:obj:`string`, `optional`, defaults to \"</s>\"):\n            The separator token, which is used when building a sequence from multiple sequences, e.g. two sequences\n            for sequence classification or for a text and a question for question answering.\n            It is also used as the last token of a sequence built with special tokens.\n        pad_token (:obj:`string`, `optional`, defaults to \"<pad>\"):\n            The token used for padding, for example when batching sequences of different lengths.\n        cls_token (:obj:`string`, `optional`, defaults to \"</s>\"):\n            The classifier token which is used when doing sequence classification (classification of the whole\n            sequence instead of per-token classification). It is the first token of the sequence when built with\n            special tokens.\n        mask_token (:obj:`string`, `optional`, defaults to \"<special1>\"):\n            The token used for masking values. This is the token used when training this model with masked language\n            modeling. This is the token which the model will try to predict.\n        additional_special_tokens (:obj:`List[str]`, `optional`, defaults to :obj:`[\"<special0>\",\"<special1>\",\"<special2>\",\"<special3>\",\"<special4>\",\"<special5>\",\"<special6>\",\"<special7>\",\"<special8>\",\"<special9>\"]`):\n            List of additional special tokens.\n        lang2id (:obj:`Dict[str, int]`, `optional`, defaults to :obj:`None`):\n            Dictionary mapping languages string identifiers to their IDs.\n        id2lang (:obj:`Dict[int, str`, `optional`, defaults to :obj:`None`):\n            Dictionary mapping language IDs to their string identifiers.\n        do_lowercase_and_remove_accent (:obj:`bool`, `optional`, defaults to :obj:`True`):\n            Whether to lowercase and remove accents when tokenizing.\n    \"\"\"\n\n    vocab_files_names = VOCAB_FILES_NAMES\n    pretrained_vocab_files_map = PRETRAINED_VOCAB_FILES_MAP\n    pretrained_init_configuration = PRETRAINED_INIT_CONFIGURATION\n    max_model_input_sizes = PRETRAINED_POSITIONAL_EMBEDDINGS_SIZES\n\n    def __init__(\n        self,\n        vocab_file,\n        merges_file,\n        unk_token=\"<unk>\",\n        bos_token=\"<s>\",\n        sep_token=\"</s>\",\n        pad_token=\"<pad>\",\n        cls_token=\"</s>\",\n        mask_token=\"<special1>\",\n        additional_special_tokens=[\n            \"<special0>\",\n            \"<special1>\",\n            \"<special2>\",\n            \"<special3>\",\n            \"<special4>\",\n            \"<special5>\",\n            \"<special6>\",\n            \"<special7>\",\n            \"<special8>\",\n            \"<special9>\",\n        ],\n        lang2id=None,\n        id2lang=None,\n        do_lowercase_and_remove_accent=True,\n        **kwargs\n    ):\n        super().__init__(\n            unk_token=unk_token,\n            bos_token=bos_token,\n            sep_token=sep_token,\n            pad_token=pad_token,\n            cls_token=cls_token,\n            mask_token=mask_token,\n            additional_special_tokens=additional_special_tokens,\n            **kwargs,\n        )\n\n        # cache of sm.MosesPunctNormalizer instance\n        self.cache_moses_punct_normalizer = dict()\n        # cache of sm.MosesTokenizer instance\n        self.cache_moses_tokenizer = dict()\n        self.lang_with_custom_tokenizer = set([\"zh\", \"th\", \"ja\"])\n        # True for current supported model (v1.2.0), False for XLM-17 & 100\n        self.do_lowercase_and_remove_accent = do_lowercase_and_remove_accent\n        self.lang2id = lang2id\n        self.id2lang = id2lang\n        if lang2id is not None and id2lang is not None:\n            assert len(lang2id) == len(id2lang)\n\n        self.ja_word_tokenizer = None\n        self.zh_word_tokenizer = None\n\n        with open(vocab_file, encoding=\"utf-8\") as vocab_handle:\n            self.encoder = json.load(vocab_handle)\n        self.decoder = {v: k for k, v in self.encoder.items()}\n        with open(merges_file, encoding=\"utf-8\") as merges_handle:\n            merges = merges_handle.read().split(\"\\n\")[:-1]\n        merges = [tuple(merge.split()[:2]) for merge in merges]\n        self.bpe_ranks = dict(zip(merges, range(len(merges))))\n        self.cache = {}\n\n    def moses_punct_norm(self, text, lang):\n        if lang not in self.cache_moses_punct_normalizer:\n            punct_normalizer = sm.MosesPunctNormalizer(lang=lang)\n            self.cache_moses_punct_normalizer[lang] = punct_normalizer\n        else:\n            punct_normalizer = self.cache_moses_punct_normalizer[lang]\n        return punct_normalizer.normalize(text)\n\n    def moses_tokenize(self, text, lang):\n        if lang not in self.cache_moses_tokenizer:\n            moses_tokenizer = sm.MosesTokenizer(lang=lang)\n            self.cache_moses_tokenizer[lang] = moses_tokenizer\n        else:\n            moses_tokenizer = self.cache_moses_tokenizer[lang]\n        return moses_tokenizer.tokenize(text, return_str=False, escape=False)\n\n    def moses_pipeline(self, text, lang):\n        text = replace_unicode_punct(text)\n        text = self.moses_punct_norm(text, lang)\n        text = remove_non_printing_char(text)\n        return text\n\n    def ja_tokenize(self, text):\n        if self.ja_word_tokenizer is None:\n            try:\n                import Mykytea\n\n                self.ja_word_tokenizer = Mykytea.Mykytea(\n                    \"-model %s/local/share/kytea/model.bin\" % os.path.expanduser(\"~\")\n                )\n            except (AttributeError, ImportError):\n                logger.error(\n                    \"Make sure you install KyTea (https://github.com/neubig/kytea) and it's python wrapper (https://github.com/chezou/Mykytea-python) with the following steps\"\n                )\n                logger.error(\"1. git clone git@github.com:neubig/kytea.git && cd kytea\")\n                logger.error(\"2. autoreconf -i\")\n                logger.error(\"3. ./configure --prefix=$HOME/local\")\n                logger.error(\"4. make && make install\")\n                logger.error(\"5. pip install kytea\")\n                raise\n        return list(self.ja_word_tokenizer.getWS(text))\n\n    @property\n    def vocab_size(self):\n        return len(self.encoder)\n\n    def get_vocab(self):\n        return dict(self.encoder, **self.added_tokens_encoder)\n\n    def bpe(self, token):\n        word = tuple(token[:-1]) + (token[-1] + \"</w>\",)\n        if token in self.cache:\n            return self.cache[token]\n        pairs = get_pairs(word)\n\n        if not pairs:\n            return token + \"</w>\"\n\n        while True:\n            bigram = min(pairs, key=lambda pair: self.bpe_ranks.get(pair, float(\"inf\")))\n            if bigram not in self.bpe_ranks:\n                break\n            first, second = bigram\n            new_word = []\n            i = 0\n            while i < len(word):\n                try:\n                    j = word.index(first, i)\n                except ValueError:\n                    new_word.extend(word[i:])\n                    break\n                else:\n                    new_word.extend(word[i:j])\n                    i = j\n\n                if word[i] == first and i < len(word) - 1 and word[i + 1] == second:\n                    new_word.append(first + second)\n                    i += 2\n                else:\n                    new_word.append(word[i])\n                    i += 1\n            new_word = tuple(new_word)\n            word = new_word\n            if len(word) == 1:\n                break\n            else:\n                pairs = get_pairs(word)\n        word = \" \".join(word)\n        if word == \"\\n  </w>\":\n            word = \"\\n</w>\"\n        self.cache[token] = word\n        return word\n\n    def _tokenize(self, text, lang=\"en\", bypass_tokenizer=False):\n        \"\"\"\n        Tokenize a string given language code. For Chinese, Japanese and Thai, we use a language specific tokenizerself. Otherwise, we use Moses.\n\n        Details of tokenization:\n        - [sacremoses](https://github.com/alvations/sacremoses): port of Moses\n            - Install with `pip install sacremoses`\n        - [pythainlp](https://github.com/PyThaiNLP/pythainlp): Thai tokenizer\n            - Install with `pip install pythainlp`\n        - [kytea](https://github.com/chezou/Mykytea-python): Japanese tokenizer, wrapper of [KyTea](https://github.com/neubig/kytea)\n            - Install with the following steps:\n            ```\n            git clone git@github.com:neubig/kytea.git && cd kytea\n            autoreconf -i\n            ./configure --prefix=$HOME/local\n            make && make install\n            pip install kytea\n            ```\n        - [jieba](https://github.com/fxsjy/jieba): Chinese tokenizer (*)\n            - Install with `pip install jieba`\n\n        (*) The original XLM used [Stanford Segmenter](https://nlp.stanford.edu/software/stanford-segmenter-2018-10-16.zip).\n        However, the wrapper (`nltk.tokenize.stanford_segmenter`) is slow due to JVM overhead, and it will be deprecated.\n        Jieba is a lot faster and pip-installable. Note there is some mismatch with the Stanford Segmenter. It should be fine\n        if you fine-tune the model with Chinese supervisionself. If you want the same exact behaviour, use the original XLM\n        [preprocessing script](https://github.com/facebookresearch/XLM/tree/master/tools) to tokenize the sentence externally,\n        and set `bypass_tokenizer=True` to bypass the tokenizer.\n\n        Args:\n            - lang: ISO language code (default = 'en') (string). Languages should belong of the model supported languages. However, we don't enforce it.\n            - bypass_tokenizer: Allow users to preprocess and tokenize the sentences externally (default = False)  (bool). If True, we only apply BPE.\n\n        Returns:\n            List of tokens.\n        \"\"\"\n        if lang and self.lang2id and lang not in self.lang2id:\n            logger.error(\n                \"Supplied language code not found in lang2id mapping. Please check that your language is supported by the loaded pretrained model.\"\n            )\n        if bypass_tokenizer:\n            text = text.split()\n        elif lang not in self.lang_with_custom_tokenizer:\n            text = self.moses_pipeline(text, lang=lang)\n            # TODO: make sure we are using `xlm-mlm-enro-1024`, since XLM-100 doesn't have this step\n            if lang == \"ro\":\n                text = romanian_preprocessing(text)\n            text = self.moses_tokenize(text, lang=lang)\n        elif lang == \"th\":\n            text = self.moses_pipeline(text, lang=lang)\n            try:\n                if \"pythainlp\" not in sys.modules:\n                    from pythainlp.tokenize import word_tokenize as th_word_tokenize\n                else:\n                    th_word_tokenize = sys.modules[\"pythainlp\"].word_tokenize\n            except (AttributeError, ImportError):\n                logger.error(\n                    \"Make sure you install PyThaiNLP (https://github.com/PyThaiNLP/pythainlp) with the following steps\"\n                )\n                logger.error(\"1. pip install pythainlp\")\n                raise\n            text = th_word_tokenize(text)\n        elif lang == \"zh\":\n            try:\n                if \"jieba\" not in sys.modules:\n                    import jieba\n                else:\n                    jieba = sys.modules[\"jieba\"]\n            except (AttributeError, ImportError):\n                logger.error(\"Make sure you install Jieba (https://github.com/fxsjy/jieba) with the following steps\")\n                logger.error(\"1. pip install jieba\")\n                raise\n            text = \" \".join(jieba.cut(text))\n            text = self.moses_pipeline(text, lang=lang)\n            text = text.split()\n        elif lang == \"ja\":\n            text = self.moses_pipeline(text, lang=lang)\n            text = self.ja_tokenize(text)\n        else:\n            raise ValueError(\"It should not reach here\")\n\n        if self.do_lowercase_and_remove_accent and not bypass_tokenizer:\n            text = lowercase_and_remove_accent(text)\n\n        split_tokens = []\n        for token in text:\n            if token:\n                split_tokens.extend([t for t in self.bpe(token).split(\" \")])\n\n        return split_tokens\n\n    def _convert_token_to_id(self, token):\n        \"\"\" Converts a token (str) in an id using the vocab. \"\"\"\n        return self.encoder.get(token, self.encoder.get(self.unk_token))\n\n    def _convert_id_to_token(self, index):\n        \"\"\"Converts an index (integer) in a token (str) using the vocab.\"\"\"\n        return self.decoder.get(index, self.unk_token)\n\n    def convert_tokens_to_string(self, tokens):\n        \"\"\" Converts a sequence of tokens (string) in a single string. \"\"\"\n        out_string = \"\".join(tokens).replace(\"</w>\", \" \").strip()\n        return out_string\n\n    def build_inputs_with_special_tokens(\n        self, token_ids_0: List[int], token_ids_1: Optional[List[int]] = None\n    ) -> List[int]:\n        \"\"\"\n        Build model inputs from a sequence or a pair of sequence for sequence classification tasks\n        by concatenating and adding special tokens.\n        A XLM sequence has the following format:\n\n        - single sequence: ``<s> X </s>``\n        - pair of sequences: ``<s> A </s> B </s>``\n\n        Args:\n            token_ids_0 (:obj:`List[int]`):\n                List of IDs to which the special tokens will be added\n            token_ids_1 (:obj:`List[int]`, `optional`, defaults to :obj:`None`):\n                Optional second list of IDs for sequence pairs.\n\n        Returns:\n            :obj:`List[int]`: list of `input IDs <../glossary.html#input-ids>`__ with the appropriate special tokens.\n\n        \"\"\"\n        bos = [self.bos_token_id]\n        sep = [self.sep_token_id]\n\n        if token_ids_1 is None:\n            return bos + token_ids_0 + sep\n        return bos + token_ids_0 + sep + token_ids_1 + sep\n\n    def get_special_tokens_mask(\n        self, token_ids_0: List[int], token_ids_1: Optional[List[int]] = None, already_has_special_tokens: bool = False\n    ) -> List[int]:\n        \"\"\"\n        Retrieves sequence ids from a token list that has no special tokens added. This method is called when adding\n        special tokens using the tokenizer ``prepare_for_model`` or ``encode_plus`` methods.\n\n        Args:\n            token_ids_0 (:obj:`List[int]`):\n                List of ids.\n            token_ids_1 (:obj:`List[int]`, `optional`, defaults to :obj:`None`):\n                Optional second list of IDs for sequence pairs.\n            already_has_special_tokens (:obj:`bool`, `optional`, defaults to :obj:`False`):\n                Set to True if the token list is already formatted with special tokens for the model\n\n        Returns:\n            :obj:`List[int]`: A list of integers in the range [0, 1]: 1 for a special token, 0 for a sequence token.\n        \"\"\"\n\n        if already_has_special_tokens:\n            if token_ids_1 is not None:\n                raise ValueError(\n                    \"You should not supply a second sequence if the provided sequence of \"\n                    \"ids is already formated with special tokens for the model.\"\n                )\n            return list(map(lambda x: 1 if x in [self.sep_token_id, self.cls_token_id] else 0, token_ids_0,))\n\n        if token_ids_1 is not None:\n            return [1] + ([0] * len(token_ids_0)) + [1] + ([0] * len(token_ids_1)) + [1]\n        return [1] + ([0] * len(token_ids_0)) + [1]\n\n    def create_token_type_ids_from_sequences(\n        self, token_ids_0: List[int], token_ids_1: Optional[List[int]] = None\n    ) -> List[int]:\n        \"\"\"\n        Creates a mask from the two sequences passed to be used in a sequence-pair classification task.\n        An XLM sequence pair mask has the following format:\n\n        ::\n\n            0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1\n            | first sequence    | second sequence |\n\n        if token_ids_1 is None, only returns the first portion of the mask (0s).\n\n        Args:\n            token_ids_0 (:obj:`List[int]`):\n                List of ids.\n            token_ids_1 (:obj:`List[int]`, `optional`, defaults to :obj:`None`):\n                Optional second list of IDs for sequence pairs.\n\n        Returns:\n            :obj:`List[int]`: List of `token type IDs <../glossary.html#token-type-ids>`_ according to the given\n            sequence(s).\n        \"\"\"\n        sep = [self.sep_token_id]\n        cls = [self.cls_token_id]\n        if token_ids_1 is None:\n            return len(cls + token_ids_0 + sep) * [0]\n        return len(cls + token_ids_0 + sep) * [0] + len(token_ids_1 + sep) * [1]\n\n    def save_vocabulary(self, save_directory):\n        \"\"\"\n        Save the vocabulary and special tokens file to a directory.\n\n        Args:\n            save_directory (:obj:`str`):\n                The directory in which to save the vocabulary.\n\n        Returns:\n            :obj:`Tuple(str)`: Paths to the files saved.\n        \"\"\"\n        if not os.path.isdir(save_directory):\n            logger.error(\"Vocabulary path ({}) should be a directory\".format(save_directory))\n            return\n        vocab_file = os.path.join(save_directory, VOCAB_FILES_NAMES[\"vocab_file\"])\n        merge_file = os.path.join(save_directory, VOCAB_FILES_NAMES[\"merges_file\"])\n\n        with open(vocab_file, \"w\", encoding=\"utf-8\") as f:\n            f.write(json.dumps(self.encoder, ensure_ascii=False))\n\n        index = 0\n        with open(merge_file, \"w\", encoding=\"utf-8\") as writer:\n            for bpe_tokens, token_index in sorted(self.bpe_ranks.items(), key=lambda kv: kv[1]):\n                if index != token_index:\n                    logger.warning(\n                        \"Saving vocabulary to {}: BPE merge indices are not consecutive.\"\n                        \" Please check that the tokenizer is not corrupted!\".format(merge_file)\n                    )\n                    index = token_index\n                writer.write(\" \".join(bpe_tokens) + \"\\n\")\n                index += 1\n\n        return vocab_file, merge_file\n"
  },
  {
    "path": "code/bert-base-count5/pretrain/transformers1/tokenization_xlm_roberta.py",
    "content": "# coding=utf-8\n# Copyright 2018 Google AI, Google Brain and Carnegie Mellon University Authors and the HuggingFace Inc. team.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License\n\"\"\" Tokenization classes for XLM-RoBERTa model.\"\"\"\n\n\nimport logging\nimport os\nfrom shutil import copyfile\nfrom typing import List, Optional\n\nfrom .tokenization_utils import PreTrainedTokenizer\nfrom .tokenization_xlnet import SPIECE_UNDERLINE\n\n\nlogger = logging.getLogger(__name__)\n\nVOCAB_FILES_NAMES = {\"vocab_file\": \"sentencepiece.bpe.model\"}\n\nPRETRAINED_VOCAB_FILES_MAP = {\n    \"vocab_file\": {\n        \"xlm-roberta-base\": \"https://s3.amazonaws.com/models.huggingface.co/bert/xlm-roberta-base-sentencepiece.bpe.model\",\n        \"xlm-roberta-large\": \"https://s3.amazonaws.com/models.huggingface.co/bert/xlm-roberta-large-sentencepiece.bpe.model\",\n        \"xlm-roberta-large-finetuned-conll02-dutch\": \"https://s3.amazonaws.com/models.huggingface.co/bert/xlm-roberta-large-finetuned-conll02-dutch-sentencepiece.bpe.model\",\n        \"xlm-roberta-large-finetuned-conll02-spanish\": \"https://s3.amazonaws.com/models.huggingface.co/bert/xlm-roberta-large-finetuned-conll02-spanish-sentencepiece.bpe.model\",\n        \"xlm-roberta-large-finetuned-conll03-english\": \"https://s3.amazonaws.com/models.huggingface.co/bert/xlm-roberta-large-finetuned-conll03-english-sentencepiece.bpe.model\",\n        \"xlm-roberta-large-finetuned-conll03-german\": \"https://s3.amazonaws.com/models.huggingface.co/bert/xlm-roberta-large-finetuned-conll03-german-sentencepiece.bpe.model\",\n    }\n}\n\nPRETRAINED_POSITIONAL_EMBEDDINGS_SIZES = {\n    \"xlm-roberta-base\": 512,\n    \"xlm-roberta-large\": 512,\n    \"xlm-roberta-large-finetuned-conll02-dutch\": 512,\n    \"xlm-roberta-large-finetuned-conll02-spanish\": 512,\n    \"xlm-roberta-large-finetuned-conll03-english\": 512,\n    \"xlm-roberta-large-finetuned-conll03-german\": 512,\n}\n\n\nclass XLMRobertaTokenizer(PreTrainedTokenizer):\n    \"\"\"\n        Adapted from RobertaTokenizer and XLNetTokenizer\n        SentencePiece based tokenizer. Peculiarities:\n\n        - requires `SentencePiece <https://github.com/google/sentencepiece>`_\n\n    This tokenizer inherits from :class:`~transformers1.PreTrainedTokenizer` which contains most of the methods. Users\n    should refer to the superclass for more information regarding methods.\n\n    Args:\n        vocab_file (:obj:`str`):\n            Path to the vocabulary file.\n        bos_token (:obj:`string`, `optional`, defaults to \"<s>\"):\n            The beginning of sequence token that was used during pre-training. Can be used a sequence classifier token.\n\n            .. note::\n\n                When building a sequence using special tokens, this is not the token that is used for the beginning\n                of sequence. The token used is the :obj:`cls_token`.\n        eos_token (:obj:`string`, `optional`, defaults to \"</s>\"):\n            The end of sequence token.\n\n            .. note::\n\n                When building a sequence using special tokens, this is not the token that is used for the end\n                of sequence. The token used is the :obj:`sep_token`.\n        sep_token (:obj:`string`, `optional`, defaults to \"</s>\"):\n            The separator token, which is used when building a sequence from multiple sequences, e.g. two sequences\n            for sequence classification or for a text and a question for question answering.\n            It is also used as the last token of a sequence built with special tokens.\n        cls_token (:obj:`string`, `optional`, defaults to \"<s>\"):\n            The classifier token which is used when doing sequence classification (classification of the whole\n            sequence instead of per-token classification). It is the first token of the sequence when built with\n            special tokens.\n        unk_token (:obj:`string`, `optional`, defaults to \"<unk>\"):\n            The unknown token. A token that is not in the vocabulary cannot be converted to an ID and is set to be this\n            token instead.\n        pad_token (:obj:`string`, `optional`, defaults to \"<pad>\"):\n            The token used for padding, for example when batching sequences of different lengths.\n        mask_token (:obj:`string`, `optional`, defaults to \"<mask>\"):\n            The token used for masking values. This is the token used when training this model with masked language\n            modeling. This is the token which the model will try to predict.\n        additional_special_tokens (:obj:`List[str]`, `optional`, defaults to :obj:`[\"<s>NOTUSED\", \"</s>NOTUSED\"]`):\n            Additional special tokens used by the tokenizer.\n\n    Attributes:\n        sp_model (:obj:`SentencePieceProcessor`):\n            The `SentencePiece` processor that is used for every conversion (string, tokens and IDs).\n    \"\"\"\n\n    vocab_files_names = VOCAB_FILES_NAMES\n    pretrained_vocab_files_map = PRETRAINED_VOCAB_FILES_MAP\n    max_model_input_sizes = PRETRAINED_POSITIONAL_EMBEDDINGS_SIZES\n    model_input_names = [\"attention_mask\"]\n\n    def __init__(\n        self,\n        vocab_file,\n        bos_token=\"<s>\",\n        eos_token=\"</s>\",\n        sep_token=\"</s>\",\n        cls_token=\"<s>\",\n        unk_token=\"<unk>\",\n        pad_token=\"<pad>\",\n        mask_token=\"<mask>\",\n        **kwargs\n    ):\n        super().__init__(\n            bos_token=bos_token,\n            eos_token=eos_token,\n            unk_token=unk_token,\n            sep_token=sep_token,\n            cls_token=cls_token,\n            pad_token=pad_token,\n            mask_token=mask_token,\n            **kwargs,\n        )\n\n        try:\n            import sentencepiece as spm\n        except ImportError:\n            logger.warning(\n                \"You need to install SentencePiece to use XLMRobertaTokenizer: https://github.com/google/sentencepiece\"\n                \"pip install sentencepiece\"\n            )\n            raise\n\n        self.sp_model = spm.SentencePieceProcessor()\n        self.sp_model.Load(str(vocab_file))\n        self.vocab_file = vocab_file\n\n        # Original fairseq vocab and spm vocab must be \"aligned\":\n        # Vocab    |    0    |    1    |   2    |    3    |  4  |  5  |  6  |   7   |   8   |  9\n        # -------- | ------- | ------- | ------ | ------- | --- | --- | --- | ----- | ----- | ----\n        # fairseq  | '<s>'   | '<pad>' | '</s>' | '<unk>' | ',' | '.' | '▁' | 's'   | '▁de' | '-'\n        # spm      | '<unk>' | '<s>'   | '</s>' | ','     | '.' | '▁' | 's' | '▁de' | '-'   | '▁a'\n\n        # Mimic fairseq token-to-id alignment for the first 4 token\n        self.fairseq_tokens_to_ids = {\"<s>\": 0, \"<pad>\": 1, \"</s>\": 2, \"<unk>\": 3}\n\n        # The first \"real\" token \",\" has position 4 in the original fairseq vocab and position 3 in the spm vocab\n        self.fairseq_offset = 1\n\n        self.fairseq_tokens_to_ids[\"<mask>\"] = len(self.sp_model) + self.fairseq_offset\n        self.fairseq_ids_to_tokens = {v: k for k, v in self.fairseq_tokens_to_ids.items()}\n\n    def __getstate__(self):\n        state = self.__dict__.copy()\n        state[\"sp_model\"] = None\n        return state\n\n    def __setstate__(self, d):\n        self.__dict__ = d\n        try:\n            import sentencepiece as spm\n        except ImportError:\n            logger.warning(\n                \"You need to install SentencePiece to use XLMRobertaTokenizer: https://github.com/google/sentencepiece\"\n                \"pip install sentencepiece\"\n            )\n            raise\n        self.sp_model = spm.SentencePieceProcessor()\n        self.sp_model.Load(self.vocab_file)\n\n    def build_inputs_with_special_tokens(\n        self, token_ids_0: List[int], token_ids_1: Optional[List[int]] = None\n    ) -> List[int]:\n        \"\"\"\n        Build model inputs from a sequence or a pair of sequence for sequence classification tasks\n        by concatenating and adding special tokens.\n        A XLM-R sequence has the following format:\n\n        - single sequence: ``<s> X </s>``\n        - pair of sequences: ``<s> A </s></s> B </s>``\n\n        Args:\n            token_ids_0 (:obj:`List[int]`):\n                List of IDs to which the special tokens will be added\n            token_ids_1 (:obj:`List[int]`, `optional`, defaults to :obj:`None`):\n                Optional second list of IDs for sequence pairs.\n\n        Returns:\n            :obj:`List[int]`: list of `input IDs <../glossary.html#input-ids>`__ with the appropriate special tokens.\n        \"\"\"\n\n        if token_ids_1 is None:\n            return [self.cls_token_id] + token_ids_0 + [self.sep_token_id]\n        cls = [self.cls_token_id]\n        sep = [self.sep_token_id]\n        return cls + token_ids_0 + sep + sep + token_ids_1 + sep\n\n    def get_special_tokens_mask(\n        self, token_ids_0: List[int], token_ids_1: Optional[List[int]] = None, already_has_special_tokens: bool = False\n    ) -> List[int]:\n        \"\"\"\n        Retrieves sequence ids from a token list that has no special tokens added. This method is called when adding\n        special tokens using the tokenizer ``prepare_for_model`` or ``encode_plus`` methods.\n\n        Args:\n            token_ids_0 (:obj:`List[int]`):\n                List of ids.\n            token_ids_1 (:obj:`List[int]`, `optional`, defaults to :obj:`None`):\n                Optional second list of IDs for sequence pairs.\n            already_has_special_tokens (:obj:`bool`, `optional`, defaults to :obj:`False`):\n                Set to True if the token list is already formatted with special tokens for the model\n\n        Returns:\n            :obj:`List[int]`: A list of integers in the range [0, 1]: 1 for a special token, 0 for a sequence token.\n        \"\"\"\n\n        if already_has_special_tokens:\n            if token_ids_1 is not None:\n                raise ValueError(\n                    \"You should not supply a second sequence if the provided sequence of \"\n                    \"ids is already formated with special tokens for the model.\"\n                )\n            return list(map(lambda x: 1 if x in [self.sep_token_id, self.cls_token_id] else 0, token_ids_0))\n\n        if token_ids_1 is None:\n            return [1] + ([0] * len(token_ids_0)) + [1]\n        return [1] + ([0] * len(token_ids_0)) + [1, 1] + ([0] * len(token_ids_1)) + [1]\n\n    def create_token_type_ids_from_sequences(\n        self, token_ids_0: List[int], token_ids_1: Optional[List[int]] = None\n    ) -> List[int]:\n        \"\"\"\n        Creates a mask from the two sequences passed to be used in a sequence-pair classification task.\n        XLM-R does not make use of token type ids, therefore a list of zeros is returned.\n\n        Args:\n            token_ids_0 (:obj:`List[int]`):\n                List of ids.\n            token_ids_1 (:obj:`List[int]`, `optional`, defaults to :obj:`None`):\n                Optional second list of IDs for sequence pairs.\n\n        Returns:\n            :obj:`List[int]`: List of zeros.\n\n        \"\"\"\n\n        sep = [self.sep_token_id]\n        cls = [self.cls_token_id]\n\n        if token_ids_1 is None:\n            return len(cls + token_ids_0 + sep) * [0]\n        return len(cls + token_ids_0 + sep + sep + token_ids_1 + sep) * [0]\n\n    @property\n    def vocab_size(self):\n        return len(self.sp_model) + self.fairseq_offset + 1  # Add the <mask> token\n\n    def get_vocab(self):\n        vocab = {self.convert_ids_to_tokens(i): i for i in range(self.vocab_size)}\n        vocab.update(self.added_tokens_encoder)\n        return vocab\n\n    def _tokenize(self, text):\n        return self.sp_model.EncodeAsPieces(text)\n\n    def _convert_token_to_id(self, token):\n        \"\"\" Converts a token (str) in an id using the vocab. \"\"\"\n        if token in self.fairseq_tokens_to_ids:\n            return self.fairseq_tokens_to_ids[token]\n        spm_id = self.sp_model.PieceToId(token)\n\n        # Need to return unknown token if the SP model returned 0\n        return spm_id + self.fairseq_offset if spm_id else self.unk_token_id\n\n    def _convert_id_to_token(self, index):\n        \"\"\"Converts an index (integer) in a token (str) using the vocab.\"\"\"\n        if index in self.fairseq_ids_to_tokens:\n            return self.fairseq_ids_to_tokens[index]\n        return self.sp_model.IdToPiece(index - self.fairseq_offset)\n\n    def convert_tokens_to_string(self, tokens):\n        \"\"\"Converts a sequence of tokens (strings for sub-words) in a single string.\"\"\"\n        out_string = \"\".join(tokens).replace(SPIECE_UNDERLINE, \" \").strip()\n        return out_string\n\n    def save_vocabulary(self, save_directory):\n        \"\"\"\n        Save the sentencepiece vocabulary (copy original file) and special tokens file to a directory.\n\n        Args:\n            save_directory (:obj:`str`):\n                The directory in which to save the vocabulary.\n\n        Returns:\n            :obj:`Tuple(str)`: Paths to the files saved.\n        \"\"\"\n        if not os.path.isdir(save_directory):\n            logger.error(\"Vocabulary path ({}) should be a directory\".format(save_directory))\n            return\n        out_vocab_file = os.path.join(save_directory, VOCAB_FILES_NAMES[\"vocab_file\"])\n\n        if os.path.abspath(self.vocab_file) != os.path.abspath(out_vocab_file):\n            copyfile(self.vocab_file, out_vocab_file)\n\n        return (out_vocab_file,)\n"
  },
  {
    "path": "code/bert-base-count5/pretrain/transformers1/tokenization_xlnet.py",
    "content": "# coding=utf-8\n# Copyright 2018 Google AI, Google Brain and Carnegie Mellon University Authors and the HuggingFace Inc. team.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n\"\"\" Tokenization classes for XLNet model.\"\"\"\n\n\nimport logging\nimport os\nimport unicodedata\nfrom shutil import copyfile\nfrom typing import List, Optional\n\nfrom .tokenization_utils import PreTrainedTokenizer\n\n\nlogger = logging.getLogger(__name__)\n\nVOCAB_FILES_NAMES = {\"vocab_file\": \"spiece.model\"}\n\nPRETRAINED_VOCAB_FILES_MAP = {\n    \"vocab_file\": {\n        \"xlnet-base-cased\": \"https://s3.amazonaws.com/models.huggingface.co/bert/xlnet-base-cased-spiece.model\",\n        \"xlnet-large-cased\": \"https://s3.amazonaws.com/models.huggingface.co/bert/xlnet-large-cased-spiece.model\",\n    }\n}\n\nPRETRAINED_POSITIONAL_EMBEDDINGS_SIZES = {\n    \"xlnet-base-cased\": None,\n    \"xlnet-large-cased\": None,\n}\n\nSPIECE_UNDERLINE = \"▁\"\n\n# Segments (not really needed)\nSEG_ID_A = 0\nSEG_ID_B = 1\nSEG_ID_CLS = 2\nSEG_ID_SEP = 3\nSEG_ID_PAD = 4\n\n\nclass XLNetTokenizer(PreTrainedTokenizer):\n    \"\"\"\n    Constructs an XLNet tokenizer. Based on `SentencePiece <https://github.com/google/sentencepiece>`__\n\n    This tokenizer inherits from :class:`~transformers1.PreTrainedTokenizer` which contains most of the methods. Users\n    should refer to the superclass for more information regarding methods.\n\n    Args:\n        vocab_file (:obj:`string`):\n            `SentencePiece <https://github.com/google/sentencepiece>`__ file (generally has a .spm extension) that\n            contains the vocabulary necessary to instantiate a tokenizer.\n        do_lower_case (:obj:`bool`, `optional`, defaults to :obj:`True`):\n            Whether to lowercase the input when tokenizing.\n        remove_space (:obj:`bool`, `optional`, defaults to :obj:`True`):\n            Whether to strip the text when tokenizing (removing excess spaces before and after the string).\n        keep_accents (:obj:`bool`, `optional`, defaults to :obj:`False`):\n            Whether to keep accents when tokenizing.\n        bos_token (:obj:`string`, `optional`, defaults to \"<s>\"):\n            The beginning of sequence token that was used during pre-training. Can be used a sequence classifier token.\n\n            .. note::\n\n                When building a sequence using special tokens, this is not the token that is used for the beginning\n                of sequence. The token used is the :obj:`cls_token`.\n        eos_token (:obj:`string`, `optional`, defaults to \"</s>\"):\n            The end of sequence token.\n\n            .. note::\n\n                When building a sequence using special tokens, this is not the token that is used for the end\n                of sequence. The token used is the :obj:`sep_token`.\n        unk_token (:obj:`string`, `optional`, defaults to \"<unk>\"):\n            The unknown token. A token that is not in the vocabulary cannot be converted to an ID and is set to be this\n            token instead.\n        sep_token (:obj:`string`, `optional`, defaults to \"<sep>\"):\n            The separator token, which is used when building a sequence from multiple sequences, e.g. two sequences\n            for sequence classification or for a text and a question for question answering.\n            It is also used as the last token of a sequence built with special tokens.\n        pad_token (:obj:`string`, `optional`, defaults to \"<pad>\"):\n            The token used for padding, for example when batching sequences of different lengths.\n        cls_token (:obj:`string`, `optional`, defaults to \"<cls>\"):\n            The classifier token which is used when doing sequence classification (classification of the whole\n            sequence instead of per-token classification). It is the first token of the sequence when built with\n            special tokens.\n        mask_token (:obj:`string`, `optional`, defaults to \"<mask>\"):\n            The token used for masking values. This is the token used when training this model with masked language\n            modeling. This is the token which the model will try to predict.\n        additional_special_tokens (:obj:`List[str]`, `optional`, defaults to :obj:`[\"<eop>\", \"<eod>\"]`):\n            Additional special tokens used by the tokenizer.\n\n    Attributes:\n        sp_model (:obj:`SentencePieceProcessor`):\n            The `SentencePiece` processor that is used for every conversion (string, tokens and IDs).\n    \"\"\"\n\n    vocab_files_names = VOCAB_FILES_NAMES\n    pretrained_vocab_files_map = PRETRAINED_VOCAB_FILES_MAP\n    max_model_input_sizes = PRETRAINED_POSITIONAL_EMBEDDINGS_SIZES\n    padding_side = \"left\"\n\n    def __init__(\n        self,\n        vocab_file,\n        do_lower_case=False,\n        remove_space=True,\n        keep_accents=False,\n        bos_token=\"<s>\",\n        eos_token=\"</s>\",\n        unk_token=\"<unk>\",\n        sep_token=\"<sep>\",\n        pad_token=\"<pad>\",\n        cls_token=\"<cls>\",\n        mask_token=\"<mask>\",\n        additional_special_tokens=[\"<eop>\", \"<eod>\"],\n        **kwargs\n    ):\n        super().__init__(\n            bos_token=bos_token,\n            eos_token=eos_token,\n            unk_token=unk_token,\n            sep_token=sep_token,\n            pad_token=pad_token,\n            cls_token=cls_token,\n            mask_token=mask_token,\n            additional_special_tokens=additional_special_tokens,\n            **kwargs,\n        )\n\n        self._pad_token_type_id = 3\n\n        try:\n            import sentencepiece as spm\n        except ImportError:\n            logger.warning(\n                \"You need to install SentencePiece to use XLNetTokenizer: https://github.com/google/sentencepiece\"\n                \"pip install sentencepiece\"\n            )\n            raise\n\n        self.do_lower_case = do_lower_case\n        self.remove_space = remove_space\n        self.keep_accents = keep_accents\n        self.vocab_file = vocab_file\n\n        self.sp_model = spm.SentencePieceProcessor()\n        self.sp_model.Load(vocab_file)\n\n    @property\n    def vocab_size(self):\n        return len(self.sp_model)\n\n    def get_vocab(self):\n        vocab = {self.convert_ids_to_tokens(i): i for i in range(self.vocab_size)}\n        vocab.update(self.added_tokens_encoder)\n        return vocab\n\n    def __getstate__(self):\n        state = self.__dict__.copy()\n        state[\"sp_model\"] = None\n        return state\n\n    def __setstate__(self, d):\n        self.__dict__ = d\n        try:\n            import sentencepiece as spm\n        except ImportError:\n            logger.warning(\n                \"You need to install SentencePiece to use XLNetTokenizer: https://github.com/google/sentencepiece\"\n                \"pip install sentencepiece\"\n            )\n            raise\n        self.sp_model = spm.SentencePieceProcessor()\n        self.sp_model.Load(self.vocab_file)\n\n    def preprocess_text(self, inputs):\n        if self.remove_space:\n            outputs = \" \".join(inputs.strip().split())\n        else:\n            outputs = inputs\n        outputs = outputs.replace(\"``\", '\"').replace(\"''\", '\"')\n\n        if not self.keep_accents:\n            outputs = unicodedata.normalize(\"NFKD\", outputs)\n            outputs = \"\".join([c for c in outputs if not unicodedata.combining(c)])\n        if self.do_lower_case:\n            outputs = outputs.lower()\n\n        return outputs\n\n    def _tokenize(self, text, sample=False):\n        \"\"\" Tokenize a string. \"\"\"\n        text = self.preprocess_text(text)\n\n        if not sample:\n            pieces = self.sp_model.EncodeAsPieces(text)\n        else:\n            pieces = self.sp_model.SampleEncodeAsPieces(text, 64, 0.1)\n        new_pieces = []\n        for piece in pieces:\n            if len(piece) > 1 and piece[-1] == str(\",\") and piece[-2].isdigit():\n                cur_pieces = self.sp_model.EncodeAsPieces(piece[:-1].replace(SPIECE_UNDERLINE, \"\"))\n                if piece[0] != SPIECE_UNDERLINE and cur_pieces[0][0] == SPIECE_UNDERLINE:\n                    if len(cur_pieces[0]) == 1:\n                        cur_pieces = cur_pieces[1:]\n                    else:\n                        cur_pieces[0] = cur_pieces[0][1:]\n                cur_pieces.append(piece[-1])\n                new_pieces.extend(cur_pieces)\n            else:\n                new_pieces.append(piece)\n\n        return new_pieces\n\n    def _convert_token_to_id(self, token):\n        \"\"\" Converts a token (str) in an id using the vocab. \"\"\"\n        return self.sp_model.PieceToId(token)\n\n    def _convert_id_to_token(self, index):\n        \"\"\"Converts an index (integer) in a token (str) using the vocab.\"\"\"\n        return self.sp_model.IdToPiece(index)\n\n    def convert_tokens_to_string(self, tokens):\n        \"\"\"Converts a sequence of tokens (strings for sub-words) in a single string.\"\"\"\n        out_string = \"\".join(tokens).replace(SPIECE_UNDERLINE, \" \").strip()\n        return out_string\n\n    def build_inputs_with_special_tokens(\n        self, token_ids_0: List[int], token_ids_1: Optional[List[int]] = None\n    ) -> List[int]:\n        \"\"\"\n        Build model inputs from a sequence or a pair of sequence for sequence classification tasks\n        by concatenating and adding special tokens.\n        An XLNet sequence has the following format:\n\n        - single sequence: ``X <sep> <cls>``\n        - pair of sequences: ``A <sep> B <sep> <cls>``\n\n        Args:\n            token_ids_0 (:obj:`List[int]`):\n                List of IDs to which the special tokens will be added\n            token_ids_1 (:obj:`List[int]`, `optional`, defaults to :obj:`None`):\n                Optional second list of IDs for sequence pairs.\n\n        Returns:\n            :obj:`List[int]`: list of `input IDs <../glossary.html#input-ids>`__ with the appropriate special tokens.\n        \"\"\"\n        sep = [self.sep_token_id]\n        cls = [self.cls_token_id]\n        if token_ids_1 is None:\n            return token_ids_0 + sep + cls\n        return token_ids_0 + sep + token_ids_1 + sep + cls\n\n    def get_special_tokens_mask(\n        self, token_ids_0: List[int], token_ids_1: Optional[List[int]] = None, already_has_special_tokens: bool = False\n    ) -> List[int]:\n        \"\"\"\n        Retrieves sequence ids from a token list that has no special tokens added. This method is called when adding\n        special tokens using the tokenizer ``prepare_for_model`` or ``encode_plus`` methods.\n\n        Args:\n            token_ids_0 (:obj:`List[int]`):\n                List of ids.\n            token_ids_1 (:obj:`List[int]`, `optional`, defaults to :obj:`None`):\n                Optional second list of IDs for sequence pairs.\n            already_has_special_tokens (:obj:`bool`, `optional`, defaults to :obj:`False`):\n                Set to True if the token list is already formatted with special tokens for the model\n\n        Returns:\n            :obj:`List[int]`: A list of integers in the range [0, 1]: 1 for a special token, 0 for a sequence token.\n        \"\"\"\n\n        if already_has_special_tokens:\n            if token_ids_1 is not None:\n                raise ValueError(\n                    \"You should not supply a second sequence if the provided sequence of \"\n                    \"ids is already formated with special tokens for the model.\"\n                )\n            return list(map(lambda x: 1 if x in [self.sep_token_id, self.cls_token_id] else 0, token_ids_0))\n\n        if token_ids_1 is not None:\n            return ([0] * len(token_ids_0)) + [1] + ([0] * len(token_ids_1)) + [1, 1]\n        return ([0] * len(token_ids_0)) + [1, 1]\n\n    def create_token_type_ids_from_sequences(\n        self, token_ids_0: List[int], token_ids_1: Optional[List[int]] = None\n    ) -> List[int]:\n        \"\"\"\n        Creates a mask from the two sequences passed to be used in a sequence-pair classification task.\n        An XLNet sequence pair mask has the following format:\n        0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 2\n        | first sequence    | second sequence     | CLS segment ID\n\n        if token_ids_1 is None, only returns the first portion of the mask (0's).\n\n        Args:\n            token_ids_0 (:obj:`List[int]`):\n                List of ids.\n            token_ids_1 (:obj:`List[int]`, `optional`, defaults to :obj:`None`):\n                Optional second list of IDs for sequence pairs.\n\n        Returns:\n            :obj:`List[int]`: List of `token type IDs <../glossary.html#token-type-ids>`_ according to the given\n            sequence(s).\n        \"\"\"\n        sep = [self.sep_token_id]\n        cls_segment_id = [2]\n\n        if token_ids_1 is None:\n            return len(token_ids_0 + sep) * [0] + cls_segment_id\n        return len(token_ids_0 + sep) * [0] + len(token_ids_1 + sep) * [1] + cls_segment_id\n\n    def save_vocabulary(self, save_directory):\n        \"\"\"\n        Save the sentencepiece vocabulary (copy original file) and special tokens file to a directory.\n\n        Args:\n            save_directory (:obj:`str`):\n                The directory in which to save the vocabulary.\n\n        Returns:\n            :obj:`Tuple(str)`: Paths to the files saved.\n        \"\"\"\n        if not os.path.isdir(save_directory):\n            logger.error(\"Vocabulary path ({}) should be a directory\".format(save_directory))\n            return\n        out_vocab_file = os.path.join(save_directory, VOCAB_FILES_NAMES[\"vocab_file\"])\n\n        if os.path.abspath(self.vocab_file) != os.path.abspath(out_vocab_file):\n            copyfile(self.vocab_file, out_vocab_file)\n\n        return (out_vocab_file,)\n"
  },
  {
    "path": "code/bert-base-count5/pretrain/transformers1/trainer.py",
    "content": "import json\nimport logging\nimport math\nimport os\nimport random\nimport re\nimport shutil\nfrom contextlib import contextmanager\nfrom pathlib import Path\nfrom typing import Callable, Dict, List, Optional, Tuple\nimport time\nimport numpy as np\nimport torch\nfrom packaging import version\nfrom torch import nn\nfrom torch.utils.data.dataloader import DataLoader\nfrom torch.utils.data.dataset import Dataset\nfrom torch.utils.data.distributed import DistributedSampler\nfrom torch.utils.data.sampler import RandomSampler, Sampler, SequentialSampler\nfrom tqdm.auto import tqdm, trange\n\nfrom .data.data_collator import DataCollator, DefaultDataCollator\nfrom transformers.modeling_utils import PreTrainedModel\nfrom .optimization import AdamW\nfrom transformers import get_polynomial_decay_schedule_with_warmup#需要新版才有\nfrom .trainer_utils import PREFIX_CHECKPOINT_DIR, EvalPrediction, PredictionOutput, TrainOutput\nfrom .training_args import TrainingArguments, is_tpu_available\n\n\ntry:\n    from apex import amp\n\n    _has_apex = True\nexcept ImportError:\n    _has_apex = False\n\n\ndef is_apex_available():\n    return _has_apex\n\n\nif is_tpu_available():\n    import torch_xla.core.xla_model as xm\n    import torch_xla.debug.metrics as met\n    import torch_xla.distributed.parallel_loader as pl\n\ntry:\n    from torch.utils.tensorboard import SummaryWriter\n\n    _has_tensorboard = True\nexcept ImportError:\n    try:\n        from tensorboardX import SummaryWriter\n\n        _has_tensorboard = True\n    except ImportError:\n        _has_tensorboard = False\n\n\ndef is_tensorboard_available():\n    return _has_tensorboard\n\n\ntry:\n    import wandb\n\n    wandb.ensure_configured()\n    if wandb.api.api_key is None:\n        _has_wandb = False\n        wandb.termwarn(\"W&B installed but not logged in.  Run `wandb login` or set the WANDB_API_KEY env variable.\")\n    else:\n        _has_wandb = False if os.getenv(\"WANDB_DISABLED\") else True\nexcept ImportError:\n    _has_wandb = False\n\n\ndef is_wandb_available():\n    return _has_wandb\n\n\nlogger = logging.getLogger(__name__)\n\n\ndef set_seed(seed: int):\n    random.seed(seed)\n    np.random.seed(seed)\n    torch.manual_seed(seed)\n    torch.cuda.manual_seed_all(seed)\n    # ^^ safe to call this function even if cuda is not available\n\n\n@contextmanager\ndef torch_distributed_zero_first(local_rank: int):\n    \"\"\"\n    Decorator to make all processes in distributed training wait for each local_master to do something.\n    \"\"\"\n    if local_rank not in [-1, 0]:\n        torch.distributed.barrier()\n    yield\n    if local_rank == 0:\n        torch.distributed.barrier()\n\n\nclass SequentialDistributedSampler(Sampler):\n    \"\"\"\n    Distributed Sampler that subsamples indicies sequentially,\n    making it easier to collate all results at the end.\n\n    Even though we only use this sampler for eval and predict (no training),\n    which means that the model params won't have to be synced (i.e. will not hang\n    for synchronization even if varied number of forward passes), we still add extra\n    samples to the sampler to make it evenly divisible (like in `DistributedSampler`)\n    to make it easy to `gather` or `reduce` resulting tensors at the end of the loop.\n    \"\"\"\n\n    def __init__(self, dataset, num_replicas=None, rank=None):\n        if num_replicas is None:\n            if not torch.distributed.is_available():\n                raise RuntimeError(\"Requires distributed package to be available\")\n            num_replicas = torch.distributed.get_world_size()\n        if rank is None:\n            if not torch.distributed.is_available():\n                raise RuntimeError(\"Requires distributed package to be available\")\n            rank = torch.distributed.get_rank()\n        self.dataset = dataset\n        self.num_replicas = num_replicas\n        self.rank = rank\n        self.num_samples = int(math.ceil(len(self.dataset) * 1.0 / self.num_replicas))\n        self.total_size = self.num_samples * self.num_replicas\n\n    def __iter__(self):\n        indices = list(range(len(self.dataset)))\n\n        # add extra samples to make it evenly divisible\n        indices += indices[: (self.total_size - len(indices))]\n        assert len(indices) == self.total_size\n\n        # subsample\n        indices = indices[self.rank * self.num_samples : (self.rank + 1) * self.num_samples]\n        assert len(indices) == self.num_samples\n\n        return iter(indices)\n\n    def __len__(self):\n        return self.num_samples\n\n\ndef get_tpu_sampler(dataset: Dataset):\n    if xm.xrt_world_size() <= 1:\n        return RandomSampler(dataset)\n    return DistributedSampler(dataset, num_replicas=xm.xrt_world_size(), rank=xm.get_ordinal())\n\n\nclass Trainer:\n    \"\"\"\n    Trainer is a simple but feature-complete training and eval loop for PyTorch,\n    optimized for Transformers.\n    \"\"\"\n\n    model: PreTrainedModel\n    args: TrainingArguments\n    train_dataset: Optional[Dataset]\n    eval_dataset: Optional[Dataset]\n    compute_metrics: Optional[Callable[[EvalPrediction], Dict]] = None\n    prediction_loss_only: bool\n    tb_writer: Optional[\"SummaryWriter\"] = None\n    optimizers: Tuple[torch.optim.Optimizer, torch.optim.lr_scheduler.LambdaLR] = None\n    global_step: Optional[int] = None\n    epoch: Optional[float] = None\n\n    def __init__(\n        self,\n        model: PreTrainedModel,\n        args: TrainingArguments,\n        train_dataLoader: Optional[DataLoader] = None,\n        eval_dataLoader: Optional[DataLoader] = None,\n        compute_metrics: Optional[Callable[[EvalPrediction], Dict]] = None,\n        prediction_loss_only=False,\n        tb_writer: Optional[\"SummaryWriter\"] = None,\n        optimizers: Tuple[torch.optim.Optimizer, torch.optim.lr_scheduler.LambdaLR] = None,\n    ):\n        \"\"\"\n        Trainer is a simple but feature-complete training and eval loop for PyTorch,\n        optimized for Transformers.\n\n        Args:\n            prediction_loss_only:\n                (Optional) in evaluation and prediction, only return the loss\n        \"\"\"\n        self.model = model.to(args.device)\n        self.args = args\n\n        self.train_dataLoader = train_dataLoader\n        self.eval_dataLoader = eval_dataLoader\n        self.compute_metrics = compute_metrics\n        self.prediction_loss_only = prediction_loss_only\n        self.optimizers = optimizers\n        if tb_writer is not None:\n            self.tb_writer = tb_writer\n        elif is_tensorboard_available() and self.is_world_master():\n            self.tb_writer = SummaryWriter(log_dir=self.args.logging_dir)\n        if not is_tensorboard_available():\n            logger.warning(\n                \"You are instantiating a Trainer but Tensorboard is not installed. You should consider installing it.\"\n            )\n        if is_wandb_available():\n            self._setup_wandb()\n        else:\n            logger.info(\n                \"You are instantiating a Trainer but W&B is not installed. To use wandb logging, \"\n                \"run `pip install wandb; wandb login` see https://docs.wandb.com/huggingface.\"\n            )\n        set_seed(self.args.seed)\n        # Create output directory if needed\n        if self.is_world_master():\n            os.makedirs(self.args.output_dir, exist_ok=True)\n        if is_tpu_available():\n            # Set an xla_device flag on the model's config.\n            # We'll find a more elegant and not need to do this in the future.\n            self.model.config.xla_device = True\n\n    def get_test_dataloader(self, test_dataset: Dataset) -> DataLoader:\n        # We use the same batch_size as for eval.\n        if is_tpu_available():\n            sampler = SequentialDistributedSampler(\n                test_dataset, num_replicas=xm.xrt_world_size(), rank=xm.get_ordinal()\n            )\n        elif self.args.local_rank != -1:\n            sampler = SequentialDistributedSampler(test_dataset)\n        else:\n            sampler = SequentialSampler(test_dataset)\n\n        data_loader = DataLoader(\n            test_dataset,\n            sampler=sampler,\n            batch_size=self.args.eval_batch_size,\n\n        )\n\n        return data_loader\n\n    def get_optimizers(\n        self, num_training_steps: int\n    ) -> Tuple[torch.optim.Optimizer, torch.optim.lr_scheduler.LambdaLR]:\n        \"\"\"\n        Setup the optimizer and the learning rate scheduler.\n\n        We provide a reasonable default that works well.\n        If you want to use something else, you can pass a tuple in the Trainer's init,\n        or override this method in a subclass.\n        \"\"\"\n        if self.optimizers is not None:\n            return self.optimizers\n        # Prepare optimizer and schedule (linear warmup and decay)\n        no_decay = [\"bias\", \"LayerNorm.weight\"]\n        optimizer_grouped_parameters = [\n            {\n                \"params\": [p for n, p in self.model.named_parameters() if not any(nd in n for nd in no_decay)],\n                \"weight_decay\": self.args.weight_decay,\n            },\n            {\n                \"params\": [p for n, p in self.model.named_parameters() if any(nd in n for nd in no_decay)],\n                \"weight_decay\": 0.0,\n            },\n        ]\n\n        optimizer = AdamW(optimizer_grouped_parameters, lr=self.args.learning_rate, eps=self.args.adam_epsilon)\n        scheduler = get_polynomial_decay_schedule_with_warmup(\n            optimizer, num_warmup_steps=self.args.warmup_steps, num_training_steps=num_training_steps,lr_end=self.args.lr_end\n        )\n        return optimizer, scheduler\n\n    def _setup_wandb(self):\n        \"\"\"\n        Setup the optional Weights & Biases (`wandb`) integration.\n\n        One can override this method to customize the setup if needed.  Find more information at https://docs.wandb.com/huggingface\n        You can also override the following environment variables:\n\n        Environment:\n            WANDB_WATCH:\n                (Optional, [\"gradients\", \"all\", \"false\"]) \"gradients\" by default, set to \"false\" to disable gradient logging\n                or \"all\" to log gradients and parameters\n            WANDB_PROJECT:\n                (Optional): str - \"huggingface\" by default, set this to a custom string to store results in a different project\n            WANDB_DISABLED:\n                (Optional): boolean - defaults to false, set to \"true\" to disable wandb entirely\n        \"\"\"\n        logger.info('Automatic Weights & Biases logging enabled, to disable set os.environ[\"WANDB_DISABLED\"] = \"true\"')\n        wandb.init(project=os.getenv(\"WANDB_PROJECT\", \"huggingface\"), config=vars(self.args))\n        # keep track of model topology and gradients\n        if os.getenv(\"WANDB_WATCH\") != \"false\":\n            wandb.watch(\n                self.model, log=os.getenv(\"WANDB_WATCH\", \"gradients\"), log_freq=max(100, self.args.logging_steps)\n            )\n\n    def num_examples(self, dataloader: DataLoader) -> int:\n        \"\"\"\n        Helper to get num of examples from a DataLoader, by accessing its Dataset.\n        \"\"\"\n        return len(dataloader.dataset)\n\n    def train(self, model_path: Optional[str] = None):\n        \"\"\"\n        Main training entry point.\n\n        Args:\n            model_path:\n                (Optional) Local path to model if model to train has been instantiated from a local path\n                If present, we will try reloading the optimizer/scheduler states from there.\n        \"\"\"\n        train_dataloader = self.train_dataLoader\n        if self.args.max_steps > 0:\n            t_total = self.args.max_steps\n            num_train_epochs = (\n                self.args.max_steps // (len(train_dataloader) // self.args.gradient_accumulation_steps) + 1\n            )\n        else:\n            t_total = int(len(train_dataloader) // self.args.gradient_accumulation_steps * self.args.num_train_epochs)\n            num_train_epochs = self.args.num_train_epochs\n\n        optimizer, scheduler = self.get_optimizers(num_training_steps=t_total)\n\n        # Check if saved optimizer or scheduler states exist\n        if (\n            model_path is not None\n            and os.path.isfile(os.path.join(model_path, \"optimizer.pt\"))\n            and os.path.isfile(os.path.join(model_path, \"scheduler.pt\"))\n        ):\n            # Load in optimizer and scheduler states\n            optimizer.load_state_dict(\n                torch.load(os.path.join(model_path, \"optimizer.pt\"), map_location=self.args.device)\n            )\n            scheduler.load_state_dict(torch.load(os.path.join(model_path, \"scheduler.pt\")))\n\n        model = self.model\n        if self.args.fp16:\n            if not is_apex_available():\n                raise ImportError(\"Please install apex from https://www.github.com/nvidia/apex to use fp16 training.\")\n            model, optimizer = amp.initialize(model, optimizer, opt_level=self.args.fp16_opt_level)\n\n        # multi-gpu training (should be after apex fp16 initialization)\n        if self.args.n_gpu > 1:\n            model = torch.nn.DataParallel(model)\n\n        # Distributed training (should be after apex fp16 initialization)\n        if self.args.local_rank != -1:\n            model = torch.nn.parallel.DistributedDataParallel(\n                model,\n                device_ids=[self.args.local_rank],\n                output_device=self.args.local_rank,\n                find_unused_parameters=True,\n            )\n\n        if self.tb_writer is not None:\n            self.tb_writer.add_text(\"args\", self.args.to_json_string())\n            self.tb_writer.add_hparams(self.args.to_sanitized_dict(), metric_dict={})\n\n        # Train!\n        if is_tpu_available():\n            total_train_batch_size = self.args.train_batch_size * xm.xrt_world_size()\n        else:\n            total_train_batch_size = (\n                self.args.train_batch_size\n                * self.args.gradient_accumulation_steps\n                * (torch.distributed.get_world_size() if self.args.local_rank != -1 else 1)\n            )\n        logger.info(\"***** Running training *****\")\n        logger.info(\"  Num examples = %d\", self.num_examples(train_dataloader))\n        logger.info(\"  Num Epochs = %d\", num_train_epochs)\n        logger.info(\"  Instantaneous batch size per device = %d\", self.args.per_device_train_batch_size)\n        logger.info(\"  Total train batch size (w. parallel, distributed & accumulation) = %d\", total_train_batch_size)\n        logger.info(\"  Gradient Accumulation steps = %d\", self.args.gradient_accumulation_steps)\n        logger.info(\"  Total optimization steps = %d\", t_total)\n\n        self.global_step = 0\n        self.epoch = 0\n        epochs_trained = 0\n        steps_trained_in_current_epoch = 0\n        # Check if continuing training from a checkpoint\n        if model_path is not None:\n            # set global_step to global_step of last saved checkpoint from model path\n            try:\n                self.global_step = int(model_path.split(\"-\")[-1].split(\"/\")[0])\n                epochs_trained = self.global_step // (len(train_dataloader) // self.args.gradient_accumulation_steps)\n                steps_trained_in_current_epoch = self.global_step % (\n                    len(train_dataloader) // self.args.gradient_accumulation_steps\n                )\n\n                logger.info(\"  Continuing training from checkpoint, will skip to saved global_step\")\n                logger.info(\"  Continuing training from epoch %d\", epochs_trained)\n                logger.info(\"  Continuing training from global step %d\", self.global_step)\n                logger.info(\"  Will skip the first %d steps in the first epoch\", steps_trained_in_current_epoch)\n            except ValueError:\n                self.global_step = 0\n                logger.info(\"  Starting fine-tuning.\")\n\n        tr_loss = 0.0\n        logging_loss = 0.0\n        tqdmLoss=0#进度条的loss用滑动平均显示\n        beta_exp=1\n        model.zero_grad()\n        train_iterator = trange(\n            epochs_trained, int(num_train_epochs), desc=\"Epoch\", disable=True\n        )\n        for epoch in train_iterator:\n            last=time.time()\n            if isinstance(train_dataloader, DataLoader) and isinstance(train_dataloader.sampler, DistributedSampler):\n                train_dataloader.sampler.set_epoch(epoch)\n\n            if is_tpu_available():\n                parallel_loader = pl.ParallelLoader(train_dataloader, [self.args.device]).per_device_loader(\n                    self.args.device\n                )\n                epoch_iterator = tqdm(parallel_loader, desc=\"Iteration\", disable=not self.is_local_master())\n            else:\n                epoch_iterator = tqdm(train_dataloader, desc=\"Iteration\", disable=True,ncols=70)#固定下长度，不然要换行\n\n            for step, inputs in enumerate(epoch_iterator):\n\n                # Skip past any already trained steps if resuming training\n                if steps_trained_in_current_epoch > 0:\n                    steps_trained_in_current_epoch -= 1\n                    continue\n                now_loss=self._training_step(model, inputs, optimizer)\n                tr_loss += now_loss\n                #丰富进度条\n                tqdmLoss=tqdmLoss*0.99+(1-0.99)*now_loss#滑动平均下\n                beta_exp*=0.99#校正\n\n                epoch_iterator.set_description_str(f\"epoch：{epoch+1}\")\n                epoch_iterator.set_postfix_str(f\"loss：{round(tqdmLoss/(1-beta_exp),4)}\")\n                if (step + 1) % self.args.gradient_accumulation_steps == 0 or (\n                    # last step in epoch but step is always smaller than gradient_accumulation_steps\n                    len(epoch_iterator) <= self.args.gradient_accumulation_steps\n                    and (step + 1) == len(epoch_iterator)\n                ):\n                    if self.args.fp16:\n                        torch.nn.utils.clip_grad_norm_(amp.master_params(optimizer), self.args.max_grad_norm)\n                    else:\n                        torch.nn.utils.clip_grad_norm_(model.parameters(), self.args.max_grad_norm)\n\n                    if is_tpu_available():\n                        xm.optimizer_step(optimizer)\n                    else:\n                        optimizer.step()\n\n                    scheduler.step()\n                    model.zero_grad()\n                    self.global_step += 1\n                    self.epoch = epoch + (step + 1) / len(epoch_iterator)\n\n                    if (self.args.logging_steps > 0 and self.global_step % self.args.logging_steps == 0) or (\n                        self.global_step == 1 and self.args.logging_first_step\n                    ):\n                        logs: Dict[str, float] = {}\n                        logs[\"loss\"] = (tr_loss - logging_loss) / self.args.logging_steps\n                        # backward compatibility for pytorch schedulers\n                        logs[\"learning_rate\"] = (\n                            scheduler.get_last_lr()[0]\n                            if version.parse(torch.__version__) >= version.parse(\"1.4\")\n                            else scheduler.get_lr()[0]\n                        )\n                        logging_loss = tr_loss\n                        print()#log前要换行，不然和进度条挤在一起\n                        self._log(logs)\n                        print()\n                        if self.args.evaluate_during_training:\n                            self.evaluate()\n\n                    if self.args.save_steps > 0 and self.global_step % self.args.save_steps==0:\n                        # In all cases (even distributed/parallel), self.model is always a reference\n                        # to the model we want to save.\n                        if hasattr(model, \"module\"):\n                            assert model.module is self.model\n                        else:\n                            assert model is self.model\n                        # Save model checkpoint\n                        output_dir = os.path.join(self.args.output_dir, f\"{PREFIX_CHECKPOINT_DIR}-{self.global_step}-epoch-{int(self.epoch)}\")\n\n                        self.save_model(output_dir)\n\n                        if self.is_world_master():\n                            self._rotate_checkpoints()\n\n                        if is_tpu_available():\n                            xm.rendezvous(\"saving_optimizer_states\")\n                            xm.save(optimizer.state_dict(), os.path.join(output_dir, \"optimizer.pt\"))\n                            xm.save(scheduler.state_dict(), os.path.join(output_dir, \"scheduler.pt\"))\n                        elif self.is_world_master():\n                            torch.save(optimizer.state_dict(), os.path.join(output_dir, \"optimizer.pt\"))\n                            torch.save(scheduler.state_dict(), os.path.join(output_dir, \"scheduler.pt\"))\n\n                if self.args.max_steps > 0 and self.global_step > self.args.max_steps:\n                    epoch_iterator.close()\n                    break\n            print(f\"预训练第{epoch}轮耗时：\",time.time()-last)\n            if self.args.max_steps > 0 and self.global_step > self.args.max_steps:\n                train_iterator.close()\n                break\n            if self.args.tpu_metrics_debug:\n                # tpu-comment: Logging debug metrics for PyTorch/XLA (compile, execute times, ops, etc.)\n                xm.master_print(met.metrics_report())\n        if self.tb_writer:\n            self.tb_writer.close()\n\n        logger.info(\"\\n\\nTraining completed. Do not forget to share your model on huggingface.co/models =)\\n\\n\")\n        return TrainOutput(self.global_step, tr_loss / self.global_step)\n\n    def _log(self, logs: Dict[str, float], iterator: Optional[tqdm] = None) -> None:\n        if self.epoch is not None:\n            logs[\"epoch\"] = self.epoch\n        if self.tb_writer:\n            for k, v in logs.items():\n                self.tb_writer.add_scalar(k, v, self.global_step)\n        if is_wandb_available():\n            wandb.log(logs, step=self.global_step)\n        output = json.dumps({**logs, **{\"step\": self.global_step}})\n        if iterator is not None:\n            iterator.write(output)\n        else:\n            print(output)\n\n    def _training_step(\n        self, model: nn.Module, inputs: Dict[str, torch.Tensor], optimizer: torch.optim.Optimizer\n    ) -> float:\n        model.train()\n        for k, v in inputs.items():\n            inputs[k] = v.to(self.args.device)\n\n        outputs = model(**inputs)\n        loss = outputs[0]  # model outputs are always tuple in transformers1 (see doc)\n\n        if self.args.n_gpu > 1:\n            loss = loss.mean()  # mean() to average on multi-gpu parallel training\n        if self.args.gradient_accumulation_steps > 1:\n            loss = loss / self.args.gradient_accumulation_steps\n\n        if self.args.fp16:\n            with amp.scale_loss(loss, optimizer) as scaled_loss:\n                scaled_loss.backward()\n        else:\n            loss.backward()\n\n        return loss.item()\n\n    def is_local_master(self) -> bool:\n        if is_tpu_available():\n            return xm.is_master_ordinal(local=True)\n        else:\n            return self.args.local_rank in [-1, 0]\n\n    def is_world_master(self) -> bool:\n        \"\"\"\n        This will be True only in one process, even in distributed mode,\n        even when training on multiple machines.\n        \"\"\"\n        if is_tpu_available():\n            return xm.is_master_ordinal(local=False)\n        else:\n            return self.args.local_rank == -1 or torch.distributed.get_rank() == 0\n\n    def save_model(self, output_dir: Optional[str] = None):\n        \"\"\"\n        Saving best-practices: if you use default names for the model,\n        you can reload it using from_pretrained().\n\n        Will only save from the world_master process (unless in TPUs).\n        \"\"\"\n\n        if is_tpu_available():\n            self._save_tpu(output_dir)\n        elif self.is_world_master():\n            self._save(output_dir)\n\n    def _save_tpu(self, output_dir: Optional[str] = None):\n        output_dir = output_dir if output_dir is not None else self.args.output_dir\n        logger.info(\"Saving model checkpoint to %s\", output_dir)\n\n        if xm.is_master_ordinal():\n            os.makedirs(output_dir, exist_ok=True)\n            torch.save(self.args, os.path.join(output_dir, \"training_args.bin\"))\n\n        # Save a trained model and configuration using `save_pretrained()`.\n        # They can then be reloaded using `from_pretrained()`\n        if not isinstance(self.model, PreTrainedModel):\n            raise ValueError(\"Trainer.model appears to not be a PreTrainedModel\")\n\n        xm.rendezvous(\"saving_checkpoint\")\n        self.model.save_pretrained(output_dir)\n\n    def _save(self, output_dir: Optional[str] = None):\n        output_dir = output_dir if output_dir is not None else self.args.output_dir\n        os.makedirs(output_dir, exist_ok=True)\n        logger.info(\"Saving model checkpoint to %s\", output_dir)\n        # Save a trained model and configuration using `save_pretrained()`.\n        # They can then be reloaded using `from_pretrained()`\n        if not isinstance(self.model, PreTrainedModel):\n            raise ValueError(\"Trainer.model appears to not be a PreTrainedModel\")\n        self.model.save_pretrained(output_dir)\n\n        # Good practice: save your training arguments together with the trained model\n        torch.save(self.args, os.path.join(output_dir, \"training_args.bin\"))\n\n    def _sorted_checkpoints(self, checkpoint_prefix=PREFIX_CHECKPOINT_DIR, use_mtime=False) -> List[str]:\n        ordering_and_checkpoint_path = []\n\n        glob_checkpoints = [str(x) for x in Path(self.args.output_dir).glob(f\"{checkpoint_prefix}-*\")]\n\n        for path in glob_checkpoints:\n            if use_mtime:\n                ordering_and_checkpoint_path.append((os.path.getmtime(path), path))\n            else:\n                regex_match = re.match(f\".*{checkpoint_prefix}-([0-9]+)\", path)\n                if regex_match and regex_match.groups():\n                    ordering_and_checkpoint_path.append((int(regex_match.groups()[0]), path))\n\n        checkpoints_sorted = sorted(ordering_and_checkpoint_path)\n        checkpoints_sorted = [checkpoint[1] for checkpoint in checkpoints_sorted]\n        return checkpoints_sorted\n\n    def _rotate_checkpoints(self, use_mtime=False) -> None:\n        if self.args.save_total_limit is None or self.args.save_total_limit <= 0:\n            return\n\n        # Check if we should delete older checkpoint(s)\n        checkpoints_sorted = self._sorted_checkpoints(use_mtime=use_mtime)\n        if len(checkpoints_sorted) <= self.args.save_total_limit:\n            return\n\n        number_of_checkpoints_to_delete = max(0, len(checkpoints_sorted) - self.args.save_total_limit)\n        checkpoints_to_be_deleted = checkpoints_sorted[:number_of_checkpoints_to_delete]\n        for checkpoint in checkpoints_to_be_deleted:\n            curEpoch = checkpoint.split('-')[-1]\n            print(checkpoint,curEpoch)\n            if int(curEpoch) % 50 == 0:\n                continue\n            logger.info(\"Deleting older checkpoint [{}] due to args.save_total_limit\".format(checkpoint))\n            shutil.rmtree(checkpoint)\n\n    def evaluate(\n        self, eval_dataset: Optional[Dataset] = None, prediction_loss_only: Optional[bool] = None,\n    ) -> Dict[str, float]:\n        \"\"\"\n        Run evaluation and return metrics.\n\n        The calling script will be responsible for providing a method to compute metrics, as they are\n        task-dependent.\n\n        Args:\n            eval_dataset: (Optional) Pass a dataset if you wish to override\n            the one on the instance.\n        Returns:\n            A dict containing:\n                - the eval loss\n                - the potential metrics computed from the predictions\n        \"\"\"\n        eval_dataloader = self.eval_dataLoader\n\n        output = self._prediction_loop(eval_dataloader, description=\"Evaluation\")\n\n        self._log(output.metrics)\n\n        if self.args.tpu_metrics_debug:\n            # tpu-comment: Logging debug metrics for PyTorch/XLA (compile, execute times, ops, etc.)\n            xm.master_print(met.metrics_report())\n\n        return output.metrics\n\n    def predict(self, test_dataset: Dataset) -> PredictionOutput:\n        \"\"\"\n        Run prediction and return predictions and potential metrics.\n\n        Depending on the dataset and your use case, your test dataset may contain labels.\n        In that case, this method will also return metrics, like in evaluate().\n        \"\"\"\n        test_dataloader = self.get_test_dataloader(test_dataset)\n\n        return self._prediction_loop(test_dataloader, description=\"Prediction\")\n\n    def _prediction_loop(\n        self, dataloader: DataLoader, description: str, prediction_loss_only: Optional[bool] = None\n    ) -> PredictionOutput:\n        \"\"\"\n        Prediction/evaluation loop, shared by `evaluate()` and `predict()`.\n\n        Works both with or without labels.\n        \"\"\"\n\n        prediction_loss_only = prediction_loss_only if prediction_loss_only is not None else self.prediction_loss_only\n\n        model = self.model\n        # multi-gpu eval\n        if self.args.n_gpu > 1:\n            model = torch.nn.DataParallel(model)\n        else:\n            model = self.model\n        # Note: in torch.distributed mode, there's no point in wrapping the model\n        # inside a DistributedDataParallel as we'll be under `no_grad` anyways.\n\n        batch_size = dataloader.batch_size\n        logger.info(\"***** Running %s *****\", description)\n        logger.info(\"  Num examples = %d\", self.num_examples(dataloader))\n        logger.info(\"  Batch size = %d\", batch_size)\n        eval_losses: List[float] = []\n        preds: torch.Tensor = None\n        label_ids: torch.Tensor = None\n        model.eval()\n\n        if is_tpu_available():\n            dataloader = pl.ParallelLoader(dataloader, [self.args.device]).per_device_loader(self.args.device)\n\n        for inputs in tqdm(dataloader, desc=description):\n            has_labels = any(inputs.get(k) is not None for k in [\"labels\", \"lm_labels\", \"masked_lm_labels\"])\n\n            for k, v in inputs.items():\n                inputs[k] = v.to(self.args.device)\n\n            with torch.no_grad():\n                outputs = model(**inputs)\n                if has_labels:\n                    step_eval_loss, logits = outputs[:2]\n                    eval_losses += [step_eval_loss.mean().item()]\n                else:\n                    logits = outputs[0]\n\n            if not prediction_loss_only:\n                if preds is None:\n                    preds = logits.detach()\n                else:\n                    preds = torch.cat((preds, logits.detach()), dim=0)\n                if inputs.get(\"labels\") is not None:\n                    if label_ids is None:\n                        label_ids = inputs[\"labels\"].detach()\n                    else:\n                        label_ids = torch.cat((label_ids, inputs[\"labels\"].detach()), dim=0)\n\n        if self.args.local_rank != -1:\n            # In distributed mode, concatenate all results from all nodes:\n            if preds is not None:\n                preds = self.distributed_concat(preds, num_total_examples=self.num_examples(dataloader))\n            if label_ids is not None:\n                label_ids = self.distributed_concat(label_ids, num_total_examples=self.num_examples(dataloader))\n        elif is_tpu_available():\n            # tpu-comment: Get all predictions and labels from all worker shards of eval dataset\n            if preds is not None:\n                preds = xm.mesh_reduce(\"eval_preds\", preds, torch.cat)\n            if label_ids is not None:\n                label_ids = xm.mesh_reduce(\"eval_label_ids\", label_ids, torch.cat)\n\n        # Finally, turn the aggregated tensors into numpy arrays.\n        if preds is not None:\n            preds = preds.cpu().numpy()\n        if label_ids is not None:\n            label_ids = label_ids.cpu().numpy()\n\n        if self.compute_metrics is not None and preds is not None and label_ids is not None:\n            metrics = self.compute_metrics(EvalPrediction(predictions=preds, label_ids=label_ids))\n        else:\n            metrics = {}\n        if len(eval_losses) > 0:\n            metrics[\"eval_loss\"] = np.mean(eval_losses)\n\n        # Prefix all keys with eval_\n        for key in list(metrics.keys()):\n            if not key.startswith(\"eval_\"):\n                metrics[f\"eval_{key}\"] = metrics.pop(key)\n\n        return PredictionOutput(predictions=preds, label_ids=label_ids, metrics=metrics)\n\n    def distributed_concat(self, tensor: torch.Tensor, num_total_examples: int) -> torch.Tensor:\n        assert self.args.local_rank != -1\n\n        output_tensors = [tensor.clone() for _ in range(torch.distributed.get_world_size())]\n        torch.distributed.all_gather(output_tensors, tensor)\n\n        concat = torch.cat(output_tensors, dim=0)\n\n        # truncate the dummy elements added by SequentialDistributedSampler\n        output = concat[:num_total_examples]\n        return output\n"
  },
  {
    "path": "code/bert-base-count5/pretrain/transformers1/trainer_tf.py",
    "content": "\"\"\"Tensorflow trainer class.\"\"\"\n\nimport logging\nimport math\nimport os\nfrom typing import Callable, Dict, Optional\n\nimport numpy as np\nimport tensorflow as tf\n\nfrom .modeling_tf_utils import TFPreTrainedModel, shape_list\nfrom .optimization_tf import GradientAccumulator, create_optimizer\nfrom .trainer_utils import PREFIX_CHECKPOINT_DIR, EvalPrediction, PredictionOutput\nfrom .training_args_tf import TFTrainingArguments\n\n\nlogger = logging.getLogger(__name__)\n\n\nclass TFTrainer:\n    model: TFPreTrainedModel\n    args: TFTrainingArguments\n    # something similar to a PT Dataset.\n    # This is just temporary before to have\n    # a framework-agnostic approach for datasets.\n    train_dataset: Optional[tf.data.Dataset]\n    eval_dataset: Optional[tf.data.Dataset]\n    compute_metrics: Optional[Callable[[EvalPrediction], Dict]] = None\n    prediction_loss_only: bool\n\n    def __init__(\n        self,\n        model: TFPreTrainedModel,\n        args: TFTrainingArguments,\n        train_dataset: Optional[tf.data.Dataset] = None,\n        eval_dataset: Optional[tf.data.Dataset] = None,\n        compute_metrics: Optional[Callable[[EvalPrediction], Dict]] = None,\n        prediction_loss_only=False,\n    ):\n        self.model = model\n        self.args = args\n        self.train_dataset = train_dataset\n        self.eval_dataset = eval_dataset\n        self.compute_metrics = compute_metrics\n        self.prediction_loss_only = prediction_loss_only\n        self.gradient_accumulator = GradientAccumulator()\n\n        self._setup_training()\n\n    def _setup_training(self) -> None:\n        \"\"\"\n        Setup the different steps to train a model:\n          - check if all the data are given\n          - create the proper strategy\n          - create the features\n          - prepare the model settings\n        \"\"\"\n        self._prepare_dataset()\n\n        with self.args.strategy.scope():\n            self._create_optimizer()\n            _ = self.optimizer.iterations\n            self._set_loss_and_metric()\n            self._create_checkpoint_manager()\n            self._create_summary_writer()\n\n    def _set_loss_and_metric(self) -> None:\n        \"\"\"\n        Create the training loss and metric with their name. Allowed names are those listed\n        in the Tensorflow documentation and those contained in the transformers1 library.\n        \"\"\"\n        try:\n            self.loss = tf.keras.losses.get(\n                {\n                    \"class_name\": self.args.loss_name,\n                    \"config\": {\"from_logits\": True, \"reduction\": tf.keras.losses.Reduction.NONE},\n                }\n            )\n        except TypeError:\n            self.loss = tf.keras.losses.get(\n                {\"class_name\": self.args.loss_name, \"config\": {\"reduction\": tf.keras.losses.Reduction.NONE}}\n            )\n\n    def _create_summary_writer(self) -> None:\n        \"\"\"\n        Create a summary writer to be able to read the logs in Tensorboard.\n        \"\"\"\n        self.writer = tf.summary.create_file_writer(self.args.logging_dir)\n\n    def _prepare_dataset(self) -> None:\n        \"\"\"\n        Prepare the training, validation and test data.\n        \"\"\"\n        if self.train_dataset is not None:\n            self.num_train_examples = self.train_dataset.reduce(tf.constant(0), lambda x, _: x + 1).numpy()\n\n            if self.args.max_steps > 0:\n                self.train_steps = self.args.max_steps\n            else:\n                self.train_steps: int = math.ceil(self.num_train_examples / self.args.train_batch_size)\n\n            self.train_dataset = (\n                self.train_dataset.cache()\n                .shuffle(self.num_train_examples)\n                .batch(self.args.train_batch_size)\n                .prefetch(tf.data.experimental.AUTOTUNE)\n            )\n\n            if self.args.max_steps > 0:\n                self.train_dataset = self.train_dataset.repeat(-1)\n\n            self.train_dataset = self.args.strategy.experimental_distribute_dataset(self.train_dataset)\n        else:\n            self.train_steps = 0\n\n        if self.eval_dataset is not None:\n            self.eval_dataset = (\n                self.eval_dataset.batch(self.args.eval_batch_size).cache().prefetch(tf.data.experimental.AUTOTUNE)\n            )\n            self.eval_dataset = self.args.strategy.experimental_distribute_dataset(self.eval_dataset)\n\n    def _create_optimizer(self) -> None:\n        \"\"\"\n        Create the training optimizer with its name. Allowed names are those listed\n        in the Tensorflow documentation and those contained in the transformers1 library.\n        \"\"\"\n        if self.args.optimizer_name == \"adamw\":\n            self.optimizer = create_optimizer(\n                self.args.learning_rate, self.train_steps, self.args.warmup_steps, self.args.end_lr\n            )\n        else:\n            try:\n                self.optimizer = tf.keras.optimizers.get(\n                    {\n                        \"class_name\": self.args.optimizer_name,\n                        \"config\": {\"learning_rate\": self.args.learning_rate, \"epsilon\": self.args.adam_epsilon},\n                    }\n                )\n            except TypeError:\n                # This is for the case where the optimizer is not Adam-like such as SGD\n                self.optimizer = tf.keras.optimizers.get(\n                    {\"class_name\": self.args.optimizer_name, \"config\": {\"learning_rate\": self.args.learning_rate}}\n                )\n        logger.info(\"Created an/a {} optimizer\".format(self.args.optimizer_name))\n\n    def _create_checkpoint_manager(self, max_to_keep: int = 5, load_model: bool = True) -> None:\n        \"\"\"\n        Create a checkpoint manager in order to be able to make the training\n        fault-tolerant.\n        Args:\n          max_to_keep: the maximum number of checkpoints to keep in the checkpoint path.\n          load_model: if we want to start the training from the latest checkpoint.\n        \"\"\"\n        ckpt = tf.train.Checkpoint(optimizer=self.optimizer, model=self.model)\n\n        self.model.ckpt_manager = tf.train.CheckpointManager(ckpt, PREFIX_CHECKPOINT_DIR, max_to_keep=max_to_keep)\n\n        if load_model:\n            ckpt.restore(self.model.ckpt_manager.latest_checkpoint).expect_partial()\n\n    @tf.function\n    def _evaluate_steps(self, per_replica_features, per_replica_labels):\n        \"\"\"\n        One step evaluation across replica.\n        Args:\n          per_replica_features: the batched features.\n          per_replica_labels: the batched labels.\n        Returns:\n          The loss corresponding to the given batch.\n        \"\"\"\n        per_replica_loss, per_replica_logits = self.args.strategy.experimental_run_v2(\n            self._run_model, args=(per_replica_features, per_replica_labels, False)\n        )\n\n        try:\n            reduced_loss = self.args.strategy.reduce(tf.distribute.ReduceOp.MEAN, per_replica_loss, axis=0)\n        except ValueError:\n            reduced_loss = self.args.strategy.reduce(tf.distribute.ReduceOp.MEAN, per_replica_loss, None)\n\n        return reduced_loss, per_replica_logits\n\n    def _prediction_loop(\n        self, dataset: tf.data.Dataset, description: str, prediction_loss_only: Optional[bool] = None\n    ) -> PredictionOutput:\n        logger.info(\"***** Running %s *****\", description)\n        logger.info(\"  Batch size = %d\", self.args.eval_batch_size)\n\n        label_ids: np.ndarray = None\n        preds: np.ndarray = None\n\n        step: int = 1\n\n        for features, labels in dataset:\n            step = tf.convert_to_tensor(step, dtype=tf.int64)\n            loss, logits = self._evaluate_steps(features, labels)\n            loss = tf.reduce_mean(loss)\n\n            if not prediction_loss_only:\n                if self.args.n_gpu > 1:\n                    for val in logits.values:\n                        if preds is None:\n                            preds = val.numpy()\n                        else:\n                            preds = np.append(preds, val.numpy(), axis=0)\n\n                    for val in labels.values:\n                        if label_ids is None:\n                            label_ids = val.numpy()\n                        else:\n                            label_ids = np.append(label_ids, val.numpy(), axis=0)\n                else:\n                    if preds is None:\n                        preds = logits.numpy()\n                    else:\n                        preds = np.append(preds, logits.numpy(), axis=0)\n\n                    if label_ids is None:\n                        label_ids = labels.numpy()\n                    else:\n                        label_ids = np.append(label_ids, labels.numpy(), axis=0)\n\n            step += 1\n\n        if self.compute_metrics is not None and preds is not None and label_ids is not None:\n            metrics = self.compute_metrics(EvalPrediction(predictions=preds, label_ids=label_ids))\n        else:\n            metrics = {}\n\n        metrics[\"eval_loss\"] = loss.numpy()\n\n        for key in list(metrics.keys()):\n            if not key.startswith(\"eval_\"):\n                metrics[f\"eval_{key}\"] = metrics.pop(key)\n\n        return PredictionOutput(predictions=preds, label_ids=label_ids, metrics=metrics)\n\n    def evaluate(\n        self, eval_dataset: Optional[tf.data.Dataset] = None, prediction_loss_only: Optional[bool] = None\n    ) -> Dict[str, float]:\n        \"\"\"\n        Prediction/evaluation loop, shared by `evaluate()` and `predict()`.\n        \"\"\"\n        if eval_dataset is None:\n            eval_dataset = self.eval_dataset\n\n        output = self._prediction_loop(eval_dataset, description=\"Evaluation\")\n\n        return output.metrics\n\n    def train(self) -> None:\n        \"\"\"\n        Train method to train the model.\n        \"\"\"\n        if self.args.debug:\n            tf.summary.trace_on(graph=True, profiler=True)\n\n        self.gradient_accumulator.reset()\n\n        iterations = self.optimizer.iterations\n\n        if iterations.numpy() > 0:\n            logger.info(\"Start the training from the last checkpoint\")\n            start_epoch = (iterations.numpy() // self.train_steps) + 1\n        else:\n            start_epoch = 1\n\n        tf.summary.experimental.set_step(iterations)\n\n        epochs = 1 if self.args.max_steps > 0 else self.args.num_train_epochs\n\n        logger.info(\"***** Running training *****\")\n        logger.info(\"  Num examples = %d\", self.num_train_examples)\n        logger.info(\"  Num Epochs = %d\", epochs)\n        logger.info(\"  Total optimization steps = %d\", self.train_steps)\n\n        for epoch in range(start_epoch, int(epochs + 1)):\n            for training_loss in self._training_steps():\n                step = iterations.numpy()\n\n                if self.args.debug:\n                    with self.writer.as_default():\n                        tf.summary.scalar(\"loss\", training_loss, step=step)\n\n                if step == 1 and self.args.debug:\n                    with self.writer.as_default():\n                        tf.summary.trace_export(name=\"training\", step=step, profiler_outdir=self.args.logging_dir)\n\n                if self.args.evaluate_during_training and step % self.args.eval_steps == 0:\n                    logs = {}\n                    results = self.evaluate()\n\n                    for key, value in results.items():\n                        eval_key = \"eval_{}\".format(key)\n                        logs[eval_key] = value\n\n                    if callable(self.optimizer.learning_rate):\n                        logs[\"learning_rate\"] = self.optimizer.learning_rate(step).numpy()\n                    else:\n                        logs[\"learning_rate\"] = self.optimizer.learning_rate.numpy()\n\n                    logger.info(\"Epoch {} Step {} Validation Metrics {}\".format(epoch, step, logs))\n\n                    with self.writer.as_default():\n                        for k, v in logs.items():\n                            tf.summary.scalar(k, v, step=step)\n\n                if step % self.args.logging_steps == 0:\n                    logger.info(\"Epoch {} Step {} Train Loss {:.4f}\".format(epoch, step, training_loss.numpy()))\n\n                if step % self.args.save_steps == 0:\n                    ckpt_save_path = self.model.ckpt_manager.save()\n                    logger.info(\"Saving checkpoint for step {} at {}\".format(step, ckpt_save_path))\n\n                if step % self.train_steps == 0:\n                    break\n\n    def _training_steps(self):\n        \"\"\"\n        Returns a generator over training steps (i.e. parameters update).\n        \"\"\"\n        for i, loss in enumerate(self._accumulate_next_gradients()):\n            if i % self.args.gradient_accumulation_steps == 0:\n                self._apply_gradients()\n                yield loss\n\n    @tf.function\n    def _apply_gradients(self):\n        \"\"\"Applies the gradients (cross-replica).\"\"\"\n        self.args.strategy.experimental_run_v2(self._step)\n\n    def _step(self):\n        \"\"\"Applies gradients and resets accumulation.\"\"\"\n        gradient_scale = self.gradient_accumulator.step * self.args.strategy.num_replicas_in_sync\n        gradients = [\n            gradient / tf.cast(gradient_scale, gradient.dtype) for gradient in self.gradient_accumulator.gradients\n        ]\n        gradients = [(tf.clip_by_value(grad, -self.args.max_grad_norm, self.args.max_grad_norm)) for grad in gradients]\n\n        self.optimizer.apply_gradients(list(zip(gradients, self.model.trainable_variables)))\n        self.gradient_accumulator.reset()\n\n    def _accumulate_next_gradients(self):\n        \"\"\"Accumulates the gradients from the next element in dataset.\"\"\"\n        iterator = iter(self.train_dataset)\n\n        @tf.function\n        def _accumulate_next():\n            per_replica_features, per_replica_labels = next(iterator)\n\n            return self._accumulate_gradients(per_replica_features, per_replica_labels)\n\n        while True:\n            try:\n                yield _accumulate_next()\n            except tf.errors.OutOfRangeError:\n                break\n\n    def _accumulate_gradients(self, per_replica_features, per_replica_labels):\n        \"\"\"Accumulates the gradients across all the replica.\"\"\"\n        per_replica_loss = self.args.strategy.experimental_run_v2(\n            self._forward, args=(per_replica_features, per_replica_labels)\n        )\n\n        try:\n            reduced_loss = self.args.strategy.reduce(tf.distribute.ReduceOp.MEAN, per_replica_loss, axis=0)\n        except ValueError:\n            reduced_loss = self.args.strategy.reduce(tf.distribute.ReduceOp.MEAN, per_replica_loss, None)\n\n        return reduced_loss\n\n    def _forward(self, features, labels):\n        \"\"\"Forwards a training example and accumulates the gradients.\"\"\"\n        per_example_loss, _ = self._run_model(features, labels, True)\n        gradients = tf.gradients(per_example_loss, self.model.trainable_variables)\n        gradients = [\n            g if g is not None else tf.zeros_like(v) for g, v in zip(gradients, self.model.trainable_variables)\n        ]\n\n        self.gradient_accumulator(gradients)\n\n        return per_example_loss\n\n    def _run_model(self, features, labels, training):\n        \"\"\"\n        Computes the loss of the given features and labels pair.\n        Args:\n          features: the batched features.\n          labels: the batched labels.\n          training: run the model in training mode or not\n        \"\"\"\n        if self.args.mode == \"text-classification\" or self.args.mode == \"token-classification\":\n            logits = self.model(features, training=training)[0]\n        else:\n            logits = self.model(features, training=training)\n\n        if self.args.mode == \"token-classification\":\n            active_loss = tf.reshape(labels, (-1,)) != -1\n            reduced_logits = tf.boolean_mask(tf.reshape(logits, (-1, shape_list(logits)[2])), active_loss)\n            labels = tf.boolean_mask(tf.reshape(labels, (-1,)), active_loss)\n            loss = self.loss(labels, reduced_logits)\n        elif self.args.mode == \"question-answering\":\n            start_loss = self.loss(labels[\"start_position\"], logits[0])\n            end_loss = self.loss(labels[\"end_position\"], logits[1])\n            loss = (start_loss + end_loss) / 2.0\n        else:\n            loss = self.loss(labels, logits)\n\n        loss += sum(self.model.losses) * (1.0 / self.args.n_gpu)\n\n        return loss, logits\n\n    def predict(self, test_dataset: tf.data.Dataset) -> PredictionOutput:\n        \"\"\"\n        Run prediction and return predictions and potential metrics.\n        Depending on the dataset and your use case, your test dataset may contain labels.\n        In that case, this method will also return metrics, like in evaluate().\n        Args:\n          test_dataset: something similar to a PT Dataset. This is just\n            temporary before to have a framework-agnostic approach for datasets.\n        \"\"\"\n        test_dataset = test_dataset.batch(self.args.eval_batch_size)\n        test_dataset = self.args.strategy.experimental_distribute_dataset(test_dataset)\n\n        return self._prediction_loop(test_dataset, description=\"Prediction\")\n\n    def save_model(self) -> None:\n        \"\"\"\n        Save the pretrained model and create a Tensorflow saved model.\n        \"\"\"\n        logger.info(\"Saving model in {}\".format(self.args.output_dir))\n\n        path = os.path.join(self.args.output_dir, \"saved_model\")\n\n        logger.info(\"Saving model in {}\".format(path))\n        os.makedirs(path, exist_ok=True)\n        self.model.save_pretrained(self.args.output_dir)\n"
  },
  {
    "path": "code/bert-base-count5/pretrain/transformers1/trainer_utils.py",
    "content": "from typing import Dict, NamedTuple, Optional\n\nimport numpy as np\n\n\nclass EvalPrediction(NamedTuple):\n    \"\"\"\n    Evaluation output (always contains labels), to be used\n    to compute metrics.\n    \"\"\"\n\n    predictions: np.ndarray\n    label_ids: np.ndarray\n\n\nclass PredictionOutput(NamedTuple):\n    predictions: np.ndarray\n    label_ids: Optional[np.ndarray]\n    metrics: Optional[Dict[str, float]]\n\n\nclass TrainOutput(NamedTuple):\n    global_step: int\n    training_loss: float\n\n\nPREFIX_CHECKPOINT_DIR = \"checkpoint\"\n"
  },
  {
    "path": "code/bert-base-count5/pretrain/transformers1/training_args.py",
    "content": "import dataclasses\nimport json\nimport logging\nfrom dataclasses import dataclass, field\nfrom typing import Any, Dict, Optional, Tuple\n\nfrom .file_utils import cached_property, is_torch_available, torch_required\n\n\nif is_torch_available():\n    import torch\n\n\ntry:\n    import torch_xla.core.xla_model as xm\n\n    _has_tpu = True\nexcept ImportError:\n    _has_tpu = False\n\n\n@torch_required\ndef is_tpu_available():\n    return _has_tpu\n\n\nlogger = logging.getLogger(__name__)\n\n\n@dataclass\nclass TrainingArguments:\n    \"\"\"\n    TrainingArguments is the subset of the arguments we use in our example scripts\n    **which relate to the training loop itself**.\n\n    Using `HfArgumentParser` we can turn this class\n    into argparse arguments to be able to specify them on\n    the command line.\n    \"\"\"\n\n    output_dir: str = field(\n        metadata={\"help\": \"The output directory where the model predictions and checkpoints will be written.\"}\n    )\n    overwrite_output_dir: bool = field(\n        default=False,\n        metadata={\n            \"help\": (\n                \"Overwrite the content of the output directory.\"\n                \"Use this to continue training if output_dir points to a checkpoint directory.\"\n            )\n        },\n    )\n\n    do_train: bool = field(default=False, metadata={\"help\": \"Whether to run training.\"})\n    do_eval: bool = field(default=False, metadata={\"help\": \"Whether to run eval on the dev set.\"})\n    do_predict: bool = field(default=False, metadata={\"help\": \"Whether to run predictions on the test set.\"})\n    evaluate_during_training: bool = field(\n        default=False, metadata={\"help\": \"Run evaluation during training at each logging step.\"},\n    )\n\n    per_device_train_batch_size: int = field(\n        default=8, metadata={\"help\": \"Batch size per GPU/TPU core/CPU for training.\"}\n    )\n    per_device_eval_batch_size: int = field(\n        default=8, metadata={\"help\": \"Batch size per GPU/TPU core/CPU for evaluation.\"}\n    )\n\n    per_gpu_train_batch_size: Optional[int] = field(\n        default=None,\n        metadata={\n            \"help\": \"Deprecated, the use of `--per_device_train_batch_size` is preferred. \"\n            \"Batch size per GPU/TPU core/CPU for training.\"\n        },\n    )\n    per_gpu_eval_batch_size: Optional[int] = field(\n        default=None,\n        metadata={\n            \"help\": \"Deprecated, the use of `--per_device_eval_batch_size` is preferred.\"\n            \"Batch size per GPU/TPU core/CPU for evaluation.\"\n        },\n    )\n\n    gradient_accumulation_steps: int = field(\n        default=1,\n        metadata={\"help\": \"Number of updates steps to accumulate before performing a backward/update pass.\"},\n    )\n\n    learning_rate: float = field(default=5e-5, metadata={\"help\": \"The initial learning rate for Adam.\"})\n    lr_end: float = field(default=1e-5, metadata={\"help\": \"学习率最后衰减到多少.\"})\n    weight_decay: float = field(default=0.0, metadata={\"help\": \"Weight decay if we apply some.\"})\n    adam_epsilon: float = field(default=1e-8, metadata={\"help\": \"Epsilon for Adam optimizer.\"})\n    max_grad_norm: float = field(default=1.0, metadata={\"help\": \"Max gradient norm.\"})\n\n    num_train_epochs: float = field(default=3.0, metadata={\"help\": \"Total number of training epochs to perform.\"})\n    max_steps: int = field(\n        default=-1,\n        metadata={\"help\": \"If > 0: set total number of training steps to perform. Override num_train_epochs.\"},\n    )\n    warmup_steps: int = field(default=0, metadata={\"help\": \"Linear warmup over warmup_steps.\"})\n\n    logging_dir: Optional[str] = field(default=None, metadata={\"help\": \"Tensorboard log dir.\"})\n    logging_first_step: bool = field(default=False, metadata={\"help\": \"Log and eval the first global_step\"})\n    logging_steps: int = field(default=500, metadata={\"help\": \"Log every X updates steps.\"})\n    save_steps: int = field(default=500, metadata={\"help\": \"Save checkpoint every X updates steps.\"})\n    save_total_limit: Optional[int] = field(\n        default=None,\n        metadata={\n            \"help\": (\n                \"Limit the total amount of checkpoints.\"\n                \"Deletes the older checkpoints in the output_dir. Default is unlimited checkpoints\"\n            )\n        },\n    )\n    no_cuda: bool = field(default=False, metadata={\"help\": \"Do not use CUDA even when it is available\"})\n    seed: int = field(default=42, metadata={\"help\": \"random seed for initialization\"})\n\n    fp16: bool = field(\n        default=False,\n        metadata={\"help\": \"Whether to use 16-bit (mixed) precision (through NVIDIA apex) instead of 32-bit\"},\n    )\n    fp16_opt_level: str = field(\n        default=\"O1\",\n        metadata={\n            \"help\": (\n                \"For fp16: Apex AMP optimization level selected in ['O0', 'O1', 'O2', and 'O3'].\"\n                \"See details at https://nvidia.github.io/apex/amp.html\"\n            )\n        },\n    )\n    local_rank: int = field(default=-1, metadata={\"help\": \"For distributed training: local_rank\"})\n\n    tpu_num_cores: Optional[int] = field(\n        default=None, metadata={\"help\": \"TPU: Number of TPU cores (automatically passed by launcher script)\"}\n    )\n    tpu_metrics_debug: bool = field(default=False, metadata={\"help\": \"TPU: Whether to print debug metrics\"})\n\n    @property\n    def train_batch_size(self) -> int:\n        if self.per_gpu_train_batch_size:\n            logger.warning(\n                \"Using deprecated `--per_gpu_train_batch_size` argument which will be removed in a future \"\n                \"version. Using `--per_device_train_batch_size` is preferred.\"\n            )\n        per_device_batch_size = self.per_gpu_train_batch_size or self.per_device_train_batch_size\n        return per_device_batch_size * max(1, self.n_gpu)\n\n    @property\n    def eval_batch_size(self) -> int:\n        if self.per_gpu_eval_batch_size:\n            logger.warning(\n                \"Using deprecated `--per_gpu_eval_batch_size` argument which will be removed in a future \"\n                \"version. Using `--per_device_eval_batch_size` is preferred.\"\n            )\n        per_device_batch_size = self.per_gpu_eval_batch_size or self.per_device_eval_batch_size\n        return per_device_batch_size * max(1, self.n_gpu)\n\n    @cached_property\n    @torch_required\n    def _setup_devices(self) -> Tuple[\"torch.device\", int]:\n        logger.info(\"PyTorch: setting up devices\")\n        if self.no_cuda:\n            device = torch.device(\"cpu\")\n            n_gpu = 0\n        elif is_tpu_available():\n            device = xm.xla_device()\n            n_gpu = 0\n        elif self.local_rank == -1:\n            # if n_gpu is > 1 we'll use nn.DataParallel.\n            # If you only want to use a specific subset of GPUs use `CUDA_VISIBLE_DEVICES=0`\n            device = torch.device(\"cuda\" if torch.cuda.is_available() else \"cpu\")\n            n_gpu = torch.cuda.device_count()\n        else:\n            # Here, we'll use torch.distributed.\n            # Initializes the distributed backend which will take care of sychronizing nodes/GPUs\n            torch.distributed.init_process_group(backend=\"nccl\")\n            device = torch.device(\"cuda\", self.local_rank)\n            n_gpu = 1\n        return device, n_gpu\n\n    @property\n    @torch_required\n    def device(self) -> \"torch.device\":\n        return self._setup_devices[0]\n\n    @property\n    @torch_required\n    def n_gpu(self):\n        return self._setup_devices[1]\n\n    def to_json_string(self):\n        \"\"\"\n        Serializes this instance to a JSON string.\n        \"\"\"\n        return json.dumps(dataclasses.asdict(self), indent=2)\n\n    def to_sanitized_dict(self) -> Dict[str, Any]:\n        \"\"\"\n        Sanitized serialization to use with TensorBoard’s hparams\n        \"\"\"\n        d = dataclasses.asdict(self)\n        valid_types = [bool, int, float, str]\n        if is_torch_available():\n            valid_types.append(torch.Tensor)\n        return {k: v if type(v) in valid_types else str(v) for k, v in d.items()}\n"
  },
  {
    "path": "code/bert-base-count5/pretrain/transformers1/training_args_tf.py",
    "content": "import logging\nfrom dataclasses import dataclass, field\nfrom typing import Tuple\n\nfrom .file_utils import cached_property, is_tf_available, tf_required\nfrom .training_args import TrainingArguments\n\n\nlogger = logging.getLogger(__name__)\n\nif is_tf_available():\n    import tensorflow as tf\n\n\n@dataclass\nclass TFTrainingArguments(TrainingArguments):\n    optimizer_name: str = field(\n        default=\"adam\",\n        metadata={\n            \"help\": 'Name of a Tensorflow optimizer among \"adadelta, adagrad, adam, adamax, ftrl, nadam, rmsprop, sgd, adamw\"'\n        },\n    )\n    mode: str = field(\n        default=\"text-classification\",\n        metadata={\"help\": 'Type of task, one of \"text-classification\", \"token-classification\", \"question-answering\"'},\n    )\n    loss_name: str = field(\n        default=\"SparseCategoricalCrossentropy\",\n        metadata={\n            \"help\": \"Name of a Tensorflow loss. For the list see: https://www.tensorflow.org/api_docs/python/tf/keras/losses\"\n        },\n    )\n    tpu_name: str = field(\n        default=None, metadata={\"help\": \"Name of TPU\"},\n    )\n    end_lr: float = field(\n        default=0, metadata={\"help\": \"End learning rate for optimizer\"},\n    )\n    eval_steps: int = field(default=1000, metadata={\"help\": \"Run an evaluation every X steps.\"})\n    debug: bool = field(\n        default=False, metadata={\"help\": \"Activate the trace to record computation graphs and profiling information\"}\n    )\n\n    @cached_property\n    @tf_required\n    def _setup_strategy(self) -> Tuple[\"tf.distribute.Strategy\", int]:\n        logger.info(\"Tensorflow: setting up strategy\")\n        gpus = tf.config.list_physical_devices(\"GPU\")\n\n        if self.no_cuda:\n            strategy = tf.distribute.OneDeviceStrategy(device=\"/cpu:0\")\n        else:\n            try:\n                if self.tpu_name:\n                    tpu = tf.distribute.cluster_resolver.TPUClusterResolver(self.tpu_name)\n                else:\n                    tpu = tf.distribute.cluster_resolver.TPUClusterResolver()\n            except ValueError:\n                tpu = None\n\n            if tpu:\n                tf.config.experimental_connect_to_cluster(tpu)\n                tf.tpu.experimental.initialize_tpu_system(tpu)\n\n                strategy = tf.distribute.experimental.TPUStrategy(tpu)\n            elif len(gpus) == 0:\n                strategy = tf.distribute.OneDeviceStrategy(device=\"/cpu:0\")\n            elif len(gpus) == 1:\n                strategy = tf.distribute.OneDeviceStrategy(device=\"/gpu:0\")\n            elif len(gpus) > 1:\n                # If you only want to use a specific subset of GPUs use `CUDA_VISIBLE_DEVICES=0`\n                strategy = tf.distribute.MirroredStrategy()\n            else:\n                raise ValueError(\"Cannot find the proper strategy please check your environment properties.\")\n\n        return strategy\n\n    @property\n    @tf_required\n    def strategy(self) -> \"tf.distribute.Strategy\":\n        return self._setup_strategy\n\n    @property\n    @tf_required\n    def n_gpu(self) -> int:\n        return self._setup_strategy.num_replicas_in_sync\n"
  },
  {
    "path": "code/bert-base-count5/pretrain/transformers1/try.py",
    "content": "from transformers import TFAlbertForMaskedLM, TFAlbertModel, TFAlbertForSequenceClassification, AlbertForMaskedLM\nimport os\n\ncheckpoint = \"albert-base-v1\"\n\nmodel = AlbertForMaskedLM.from_pretrained(checkpoint)\n\nif not os.path.exists(\"~/saved/\" + checkpoint):\n    os.makedirs(\"~/saved/\" + checkpoint)\n    \n\nmodel.save_pretrained(\"~/saved/\" + checkpoint)\nmodel = TFAlbertForMaskedLM.from_pretrained('~/saved/' + checkpoint, from_pt=True)\nmodel.save_pretrained(\"~/saved/\" + checkpoint)\nmodel = TFAlbertModel.from_pretrained('~/saved/' + checkpoint)\nmodel = TFAlbertForMaskedLM.from_pretrained('~/saved/' + checkpoint)\nmodel = TFAlbertForSequenceClassification.from_pretrained('~/saved/' + checkpoint)\n\n\nprint(\"nice model\") "
  },
  {
    "path": "code/bert-base-count5/pretrain/transformers1/utils_encoder_decoder.py",
    "content": "# coding=utf-8\n# Copyright 2020 The HuggingFace Inc. team.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n\"\"\" Classes to support Encoder-Decoder architectures \"\"\"\n\n\ndef prepare_encoder_decoder_model_kwargs(**kwargs):\n    \"\"\" Prepare the encoder and decoder's keyword arguments.\n\n    Keyword arguments come in 3 flavors:\n    - encoder-specific (prefixed by `encoder_`)\n    - decoder-specific (prefixed by `decoder_`)\n    - those that apply to the model as whole.\n\n    We let the specific kwargs override the common ones in case of\n    conflict.\n    \"\"\"\n\n    kwargs_common = {\n        argument: value\n        for argument, value in kwargs.items()\n        if not argument.startswith(\"encoder_\") and not argument.startswith(\"decoder_\")\n    }\n    if \"input_ids\" in kwargs_common:\n        kwargs[\"encoder_input_ids\"] = kwargs_common.pop(\"input_ids\")\n\n    decoder_kwargs = kwargs_common.copy()\n    encoder_kwargs = kwargs_common.copy()\n    encoder_kwargs.update(\n        {argument[len(\"encoder_\") :]: value for argument, value in kwargs.items() if argument.startswith(\"encoder_\")}\n    )\n    decoder_kwargs.update(\n        {argument[len(\"decoder_\") :]: value for argument, value in kwargs.items() if argument.startswith(\"decoder_\")}\n    )\n    decoder_kwargs[\"encoder_attention_mask\"] = encoder_kwargs.get(\"attention_mask\", None)\n    return encoder_kwargs, decoder_kwargs\n"
  },
  {
    "path": "code/bert-base-count5-len32/finetuning/.ipynb_checkpoints/PyTorch_Bert-Squad_OnnxRuntime_GPU-checkpoint.ipynb",
    "content": "{\n \"cells\": [\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Copyright (c) Microsoft Corporation. All rights reserved.  \\n\",\n    \"Licensed under the MIT License.\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"# Inference PyTorch Bert Model with ONNX Runtime on GPU\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"In this tutorial, you'll learn how to load a Bert model from PyTorch, convert it to ONNX, and inference it for high performance using ONNX Runtime and NVIDIA GPU. In the following sections, we are going to use the Bert model trained with Stanford Question Answering Dataset (SQuAD) dataset as an example. Bert SQuAD model is used in question answering scenarios, where the answer to every question is a segment of text from the corresponding reading passage, or the question might be unanswerable.\\n\",\n    \"\\n\",\n    \"This notebook is for GPU inference. For CPU inference, please look at another notebook [Inference PyTorch Bert Model with ONNX Runtime on CPU](PyTorch_Bert-Squad_OnnxRuntime_CPU.ipynb).\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"## 0. Prerequisites ##\\n\",\n    \"It requires your machine to have a GPU, and a python environment with [PyTorch](https://pytorch.org/) installed before running this notebook.\\n\",\n    \"\\n\",\n    \"#### GPU Environment Setup using AnaConda\\n\",\n    \"\\n\",\n    \"First, we install [AnaConda](https://www.anaconda.com/distribution/) in a target machine and open an AnaConda prompt window when it is done. Then run the following commands to create a conda environment. This notebook is tested with PyTorch 1.5.0 and OnnxRuntime 1.3.0.\\n\",\n    \"\\n\",\n    \"```console\\n\",\n    \"conda create -n gpu_env python=3.7\\n\",\n    \"conda activate gpu_env\\n\",\n    \"conda install pytorch torchvision cudatoolkit=10.1 -c pytorch\\n\",\n    \"conda install -c anaconda ipykernel\\n\",\n    \"conda install -c conda-forge ipywidgets\\n\",\n    \"python -m ipykernel install --user --name=gpu_env_py37\\n\",\n    \"jupyter notebook\\n\",\n    \"```\\n\",\n    \"Finally, launch Jupyter Notebook and you can choose gpu_env_py37 as kernel to run this notebook.\\n\",\n    \"\\n\",\n    \"Onnxruntime-gpu need specified version of CUDA and cuDNN. You can find the corresponding version in [requirements](https://github.com/microsoft/onnxruntime/tree/rel-1.3.0#system-requirements). If the version is different from above cudatoolkit version, you have to install them separately, and add their bin directories to PATH environment variable (See [CUDA and cuDNN Path](#CUDA-and-cuDNN-Path) below).\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"\\u001b[33mWARNING: Skipping onnxruntime-gpu as it is not installed.\\u001b[0m\\r\\n\"\n     ]\n    }\n   ],\n   \"source\": [\n    \"import sys\\n\",\n    \"!{sys.executable} -m pip uninstall --quiet --yes onnxruntime-gpu\\n\",\n    \"!{sys.executable} -m pip install --quiet onnxruntime-gpu\\n\",\n    \"!{sys.executable} -m pip install --quiet --upgrade transformers\\n\",\n    \"!{sys.executable} -m pip install --quiet --upgrade onnxconverter_common\\n\",\n    \"!{sys.executable} -m pip install --quiet --upgrade onnxruntime-tools\\n\",\n    \"!{sys.executable} -m pip install --quiet wget netron pandas\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"## 1. Load Pretrained Bert model ##\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"We begin by downloading the SQuAD data file and store them in the specified location. \"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"import os\\n\",\n    \"\\n\",\n    \"cache_dir = \\\"./squad\\\"\\n\",\n    \"if not os.path.exists(cache_dir):\\n\",\n    \"    os.makedirs(cache_dir)\\n\",\n    \"\\n\",\n    \"predict_file_url = \\\"https://rajpurkar.github.io/SQuAD-explorer/dataset/dev-v1.1.json\\\"\\n\",\n    \"predict_file = os.path.join(cache_dir, \\\"dev-v1.1.json\\\")\\n\",\n    \"if not os.path.exists(predict_file):\\n\",\n    \"    import wget\\n\",\n    \"    print(\\\"Start downloading predict file.\\\")\\n\",\n    \"    wget.download(predict_file_url, predict_file)\\n\",\n    \"    print(\\\"Predict file downloaded.\\\")\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Let's first define some constant variables.\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 3,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"# Whether allow overwriting existing ONNX model and download the latest script from GitHub\\n\",\n    \"enable_overwrite = True\\n\",\n    \"\\n\",\n    \"# Total samples to inference, so that we can get average latency\\n\",\n    \"total_samples = 1000\\n\",\n    \"\\n\",\n    \"# ONNX opset version\\n\",\n    \"opset_version=11\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Specify some model configuration variables.\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 4,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"# For fine-tuned large model, the model name is \\\"bert-large-uncased-whole-word-masking-finetuned-squad\\\". Here we use bert-base for demo.\\n\",\n    \"model_name_or_path = \\\"bert-base-cased\\\"\\n\",\n    \"max_seq_length = 128\\n\",\n    \"doc_stride = 128\\n\",\n    \"max_query_length = 64\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Start to load model from pretrained. This step could take a few minutes. \"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 5,\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"name\": \"stderr\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"100%|██████████| 48/48 [00:04<00:00, 11.28it/s]\\n\",\n      \"convert squad examples to features: 100%|██████████| 1000/1000 [00:09<00:00, 102.15it/s]\\n\",\n      \"add example index and unique id: 100%|██████████| 1000/1000 [00:00<00:00, 161306.98it/s]\\n\"\n     ]\n    }\n   ],\n   \"source\": [\n    \"# The following code is adapted from HuggingFace transformers\\n\",\n    \"# https://github.com/huggingface/transformers/blob/master/examples/run_squad.py\\n\",\n    \"\\n\",\n    \"from transformers import (BertConfig, BertForQuestionAnswering, BertTokenizer)\\n\",\n    \"\\n\",\n    \"# Load pretrained model and tokenizer\\n\",\n    \"config_class, model_class, tokenizer_class = (BertConfig, BertForQuestionAnswering, BertTokenizer)\\n\",\n    \"config = config_class.from_pretrained(model_name_or_path, cache_dir=cache_dir)\\n\",\n    \"tokenizer = tokenizer_class.from_pretrained(model_name_or_path, do_lower_case=True, cache_dir=cache_dir)\\n\",\n    \"model = model_class.from_pretrained(model_name_or_path,\\n\",\n    \"                                    from_tf=False,\\n\",\n    \"                                    config=config,\\n\",\n    \"                                    cache_dir=cache_dir)\\n\",\n    \"# load some examples\\n\",\n    \"from transformers.data.processors.squad import SquadV1Processor\\n\",\n    \"\\n\",\n    \"processor = SquadV1Processor()\\n\",\n    \"examples = processor.get_dev_examples(None, filename=predict_file)\\n\",\n    \"\\n\",\n    \"from transformers import squad_convert_examples_to_features\\n\",\n    \"features, dataset = squad_convert_examples_to_features( \\n\",\n    \"            examples=examples[:total_samples], # convert enough examples for this notebook\\n\",\n    \"            tokenizer=tokenizer,\\n\",\n    \"            max_seq_length=max_seq_length,\\n\",\n    \"            doc_stride=doc_stride,\\n\",\n    \"            max_query_length=max_query_length,\\n\",\n    \"            is_training=False,\\n\",\n    \"            return_dataset='pt'\\n\",\n    \"        )\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"## 2. Export the loaded model ##\\n\",\n    \"Once the model is loaded, we can export the loaded PyTorch model to ONNX.\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 6,\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"Model exported at  ./onnx/bert-base-cased-squad_opset11.onnx\\n\"\n     ]\n    }\n   ],\n   \"source\": [\n    \"output_dir = \\\"./onnx\\\"\\n\",\n    \"if not os.path.exists(output_dir):\\n\",\n    \"    os.makedirs(output_dir)   \\n\",\n    \"export_model_path = os.path.join(output_dir, 'bert-base-cased-squad_opset{}.onnx'.format(opset_version))\\n\",\n    \"\\n\",\n    \"import torch\\n\",\n    \"use_gpu = torch.cuda.is_available()\\n\",\n    \"device = torch.device(\\\"cuda\\\" if use_gpu else \\\"cpu\\\")\\n\",\n    \"\\n\",\n    \"# Get the first example data to run the model and export it to ONNX\\n\",\n    \"data = dataset[0]\\n\",\n    \"inputs = {\\n\",\n    \"    'input_ids':      data[0].to(device).reshape(1, max_seq_length),\\n\",\n    \"    'attention_mask': data[1].to(device).reshape(1, max_seq_length),\\n\",\n    \"    'token_type_ids': data[2].to(device).reshape(1, max_seq_length)\\n\",\n    \"}\\n\",\n    \"\\n\",\n    \"# Set model to inference mode, which is required before exporting the model because some operators behave differently in \\n\",\n    \"# inference and training mode.\\n\",\n    \"model.eval()\\n\",\n    \"model.to(device)\\n\",\n    \"\\n\",\n    \"if enable_overwrite or not os.path.exists(export_model_path):\\n\",\n    \"    with torch.no_grad():\\n\",\n    \"        symbolic_names = {0: 'batch_size', 1: 'max_seq_len'}\\n\",\n    \"        torch.onnx.export(model,                                            # model being run\\n\",\n    \"                          args=tuple(inputs.values()),                      # model input (or a tuple for multiple inputs)\\n\",\n    \"                          f=export_model_path,                              # where to save the model (can be a file or file-like object)\\n\",\n    \"                          opset_version=opset_version,                      # the ONNX version to export the model to\\n\",\n    \"                          do_constant_folding=True,                         # whether to execute constant folding for optimization\\n\",\n    \"                          input_names=['input_ids',                         # the model's input names\\n\",\n    \"                                       'input_mask', \\n\",\n    \"                                       'segment_ids'],\\n\",\n    \"                          output_names=['start', 'end'],                    # the model's output names\\n\",\n    \"                          dynamic_axes={'input_ids': symbolic_names,        # variable length axes\\n\",\n    \"                                        'input_mask' : symbolic_names,\\n\",\n    \"                                        'segment_ids' : symbolic_names,\\n\",\n    \"                                        'start' : symbolic_names,\\n\",\n    \"                                        'end' : symbolic_names})\\n\",\n    \"        print(\\\"Model exported at \\\", export_model_path)\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"## 3. PyTorch Inference ##\\n\",\n    \"Use PyTorch to evaluate an example input for comparison purpose.\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 7,\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"PyTorch cuda Inference time = 16.57 ms\\n\"\n     ]\n    }\n   ],\n   \"source\": [\n    \"import time\\n\",\n    \"\\n\",\n    \"# Measure the latency. It is not accurate using Jupyter Notebook, it is recommended to use standalone python script.\\n\",\n    \"latency = []\\n\",\n    \"with torch.no_grad():\\n\",\n    \"    for i in range(total_samples):\\n\",\n    \"        data = dataset[i]\\n\",\n    \"        inputs = {\\n\",\n    \"            'input_ids':      data[0].to(device).reshape(1, max_seq_length),\\n\",\n    \"            'attention_mask': data[1].to(device).reshape(1, max_seq_length),\\n\",\n    \"            'token_type_ids': data[2].to(device).reshape(1, max_seq_length)\\n\",\n    \"        }\\n\",\n    \"        start = time.time()\\n\",\n    \"        outputs = model(**inputs)\\n\",\n    \"        latency.append(time.time() - start)\\n\",\n    \"print(\\\"PyTorch {} Inference time = {} ms\\\".format(device.type, format(sum(latency) * 1000 / len(latency), '.2f')))\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"## 4. Inference ONNX Model with ONNX Runtime ##\\n\",\n    \"\\n\",\n    \"### CUDA and cuDNN Path\\n\",\n    \"onnxruntime-gpu has dependency on [CUDA](https://developer.nvidia.com/cuda-downloads) and [cuDNN](https://developer.nvidia.com/cudnn):\\n\",\n    \"\\n\",\n    \"* [onnxruntime-gpu v1.3.0](https://github.com/microsoft/onnxruntime/tree/rel-1.3.0#system-requirements) requires CUDA Runtime 10.1 and CUDNN 7.6.5.\\n\",\n    \"* [onnxruntime-gpu v1.2.0](https://github.com/microsoft/onnxruntime/releases/tag/v1.2.0) requires CUDA Runtime 10.1 and CUDNN 7.6.5.\\n\",\n    \"\\n\",\n    \"During installing PyTorch 1.5, we installed cudatoolkit 10.1.243 in this conda environment. That shall be good for onnxruntime-gpu 1.3.0 in Jupyter Notebook.\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 8,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"# Change to True when onnxruntime (like onnxruntime-gpu 1.0.0 ~ 1.1.2) cannot be imported.\\n\",\n    \"add_cuda_path = False\\n\",\n    \"\\n\",\n    \"if add_cuda_path:\\n\",\n    \"    # Add path of CUDA 10.0 and CUDNN 7.6 for onnxruntime-gpu 1.0.0 ~ 1.1.2\\n\",\n    \"    cuda_dir = 'D:/NVidia/CUDA/v10.1/bin'\\n\",\n    \"    cudnn_dir = 'D:/NVidia/CUDA/v10.1/bin'\\n\",\n    \"    if not (os.path.exists(cuda_dir) and os.path.exists(cudnn_dir)):\\n\",\n    \"        raise ValueError(\\\"Please specify correct path for CUDA and cuDNN. Otherwise onnxruntime cannot be imported.\\\")\\n\",\n    \"    else:\\n\",\n    \"        if cuda_dir == cudnn_dir:\\n\",\n    \"            os.environ[\\\"PATH\\\"] = cuda_dir + ';' + os.environ[\\\"PATH\\\"]\\n\",\n    \"        else:\\n\",\n    \"            os.environ[\\\"PATH\\\"] = cuda_dir + ';' + cudnn_dir + ';' + os.environ[\\\"PATH\\\"]\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"### OpenMP Environment Variable\\n\",\n    \"\\n\",\n    \"OpenMP environment variables are optional for GPU inference of standard Bert model. It has little performance impact on Bert model since most nodes are executed in GPU. \\n\",\n    \"\\n\",\n    \"You can find the best setting based on [Performance Test Tool](#Performance-Test-Tool) result in later part of this notebook.\\n\",\n    \"\\n\",\n    \"**Attention: Setting environment variables shall be done before importing onnxruntime**. Otherwise, they might not take effect.\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 9,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"# Optional. You can change them according to Performance Test Tool result.\\n\",\n    \"#os.environ[\\\"OMP_NUM_THREADS\\\"] = '1'\\n\",\n    \"#os.environ[\\\"OMP_WAIT_POLICY\\\"] = 'PASSIVE'\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Now we are ready to inference the model with ONNX Runtime.\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 10,\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"OnnxRuntime gpu Inference time = 4.43 ms\\n\"\n     ]\n    }\n   ],\n   \"source\": [\n    \"import psutil\\n\",\n    \"import onnxruntime\\n\",\n    \"import numpy\\n\",\n    \"\\n\",\n    \"assert 'CUDAExecutionProvider' in onnxruntime.get_available_providers()\\n\",\n    \"device_name = 'gpu'\\n\",\n    \"\\n\",\n    \"sess_options = onnxruntime.SessionOptions()\\n\",\n    \"\\n\",\n    \"# Optional: store the optimized graph and view it using Netron to verify that model is fully optimized.\\n\",\n    \"# Note that this will increase session creation time so enable it for debugging only.\\n\",\n    \"sess_options.optimized_model_filepath = os.path.join(output_dir, \\\"optimized_model_{}.onnx\\\".format(device_name))\\n\",\n    \"\\n\",\n    \"# Please change the value according to best setting in Performance Test Tool result.\\n\",\n    \"sess_options.intra_op_num_threads=psutil.cpu_count(logical=True)\\n\",\n    \"\\n\",\n    \"session = onnxruntime.InferenceSession(export_model_path, sess_options)\\n\",\n    \"\\n\",\n    \"latency = []\\n\",\n    \"for i in range(total_samples):\\n\",\n    \"    data = dataset[i]\\n\",\n    \"    # TODO: use IO Binding (see https://github.com/microsoft/onnxruntime/pull/4206) to improve performance.\\n\",\n    \"    ort_inputs = {\\n\",\n    \"        'input_ids':  data[0].cpu().reshape(1, max_seq_length).numpy(),\\n\",\n    \"        'input_mask': data[1].cpu().reshape(1, max_seq_length).numpy(),\\n\",\n    \"        'segment_ids': data[2].cpu().reshape(1, max_seq_length).numpy()\\n\",\n    \"    }\\n\",\n    \"    start = time.time()\\n\",\n    \"    ort_outputs = session.run(None, ort_inputs)\\n\",\n    \"    latency.append(time.time() - start)\\n\",\n    \"    \\n\",\n    \"print(\\\"OnnxRuntime {} Inference time = {} ms\\\".format(device_name, format(sum(latency) * 1000 / len(latency), '.2f')))\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"We can compare the output of PyTorch and ONNX Runtime. We can see some results are not close. It is because ONNX Runtime uses some approximation in CUDA optimization. Based on our evaluation on SQuAD data set, F1 score is on par for models before and after optimization.\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 11,\n   \"metadata\": {\n    \"scrolled\": true\n   },\n   \"outputs\": [\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"***** Verifying correctness *****\\n\",\n      \"PyTorch and ONNX Runtime output 0 are close: True\\n\",\n      \"maximum_diff=9.499490261077881e-07 average_diff=1.4225952327251434e-07\\n\",\n      \"PyTorch and ONNX Runtime output 1 are close: True\\n\",\n      \"maximum_diff=6.92903995513916e-07 average_diff=1.2441887520253658e-07\\n\"\n     ]\n    }\n   ],\n   \"source\": [\n    \"print(\\\"***** Verifying correctness *****\\\")\\n\",\n    \"for i in range(2):    \\n\",\n    \"    print('PyTorch and ONNX Runtime output {} are close:'.format(i), numpy.allclose(ort_outputs[i], outputs[i].cpu(), rtol=1e-02, atol=1e-02))\\n\",\n    \"    diff = ort_outputs[i] - outputs[i].cpu().numpy()\\n\",\n    \"    max_diff = numpy.max(numpy.abs(diff))\\n\",\n    \"    avg_diff = numpy.average(numpy.abs(diff))\\n\",\n    \"    print(f'maximum_diff={max_diff} average_diff={avg_diff}')\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"### Inference with Actual Sequence Length\\n\",\n    \"Note that ONNX model is exported using dynamic length axis. It is recommended to use actual sequence input without padding instead of fixed length input for best performance. Let's see how it can be applied to this model.\\n\",\n    \"\\n\",\n    \"From an example input below, we can see zero padding at the end of each sequence.\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 12,\n   \"metadata\": {\n    \"scrolled\": true\n   },\n   \"outputs\": [\n    {\n     \"data\": {\n      \"text/plain\": [\n       \"{'input_ids': tensor([[  101,  1293,  1242,  2557,  1127,  1226,  1104,  1103,  3613, 16429,\\n\",\n       \"           5235,   136,   102,  3613, 16429,  5988,   170,   107,  1353,  1671,\\n\",\n       \"           1992,  1342,   107,  5235,   117,  1107,  1134,  1473,  3683,  3538,\\n\",\n       \"           1125,   170,  1476,   118,  1248,  2595,  4086,  1714,  1104,  2965,\\n\",\n       \"          15897,  1104,  3613, 16429,   119,  1473,  3683,  3538,  3222,  1149,\\n\",\n       \"           2551,  1168, 23759,  1116,  1121,  1506,  1103, 10280,  2231,  1111,\\n\",\n       \"           1103,  1714, 16355,   119,   102,     0,     0,     0,     0,     0,\\n\",\n       \"              0,     0,     0,     0,     0,     0,     0,     0,     0,     0,\\n\",\n       \"              0,     0,     0,     0,     0,     0,     0,     0,     0,     0,\\n\",\n       \"              0,     0,     0,     0,     0,     0,     0,     0,     0,     0,\\n\",\n       \"              0,     0,     0,     0,     0,     0,     0,     0,     0,     0,\\n\",\n       \"              0,     0,     0,     0,     0,     0,     0,     0,     0,     0,\\n\",\n       \"              0,     0,     0,     0,     0,     0,     0,     0]],\\n\",\n       \"        device='cuda:0'),\\n\",\n       \" 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,\\n\",\n       \"          1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,\\n\",\n       \"          1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0,\\n\",\n       \"          0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,\\n\",\n       \"          0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,\\n\",\n       \"          0, 0, 0, 0, 0, 0, 0, 0]], device='cuda:0'),\\n\",\n       \" 'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,\\n\",\n       \"          1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,\\n\",\n       \"          1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0,\\n\",\n       \"          0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,\\n\",\n       \"          0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,\\n\",\n       \"          0, 0, 0, 0, 0, 0, 0, 0]], device='cuda:0')}\"\n      ]\n     },\n     \"execution_count\": 12,\n     \"metadata\": {},\n     \"output_type\": \"execute_result\"\n    }\n   ],\n   \"source\": [\n    \"# An example input (we can see padding). From attention_mask, we can deduce the actual length.\\n\",\n    \"inputs\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"The original sequence length is 128. After removing paddings, the sequence length is reduced. Input with smaller sequence length need less computation, thus we can see there is improvement on inference latency. \"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 13,\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"Average length 101\\n\",\n      \"OnnxRuntime gpu Inference time with actual sequence length = 4.23 ms\\n\"\n     ]\n    }\n   ],\n   \"source\": [\n    \"import statistics\\n\",\n    \"\\n\",\n    \"latency = []\\n\",\n    \"lengths = []\\n\",\n    \"for i in range(total_samples):\\n\",\n    \"    data = dataset[i]\\n\",\n    \"    # Instead of using fixed length (128), we can use actual sequence length (less than 128), which helps to get better performance.\\n\",\n    \"    actual_sequence_length = sum(data[1].numpy())\\n\",\n    \"    lengths.append(actual_sequence_length)\\n\",\n    \"    opt_inputs = {\\n\",\n    \"        'input_ids':  data[0].numpy()[:actual_sequence_length].reshape(1, actual_sequence_length),\\n\",\n    \"        'input_mask': data[1].numpy()[:actual_sequence_length].reshape(1, actual_sequence_length),\\n\",\n    \"        'segment_ids': data[2].numpy()[:actual_sequence_length].reshape(1, actual_sequence_length)\\n\",\n    \"    }\\n\",\n    \"    start = time.time()\\n\",\n    \"    opt_outputs = session.run(None, opt_inputs)\\n\",\n    \"    latency.append(time.time() - start)\\n\",\n    \"print(\\\"Average length\\\", statistics.mean(lengths))\\n\",\n    \"print(\\\"OnnxRuntime {} Inference time with actual sequence length = {} ms\\\".format(device_name, format(sum(latency) * 1000 / len(latency), '.2f')))\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Let's compare the output and see whether the results are close.\\n\",\n    \"\\n\",\n    \"**Note**: Need end-to-end evaluation on performance and accuracy if you use this strategy.\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 14,\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"***** Comparing results with/without paddings *****\\n\",\n      \"Output 0 are close: True\\n\",\n      \"Output 1 are close: True\\n\"\n     ]\n    }\n   ],\n   \"source\": [\n    \"print(\\\"***** Comparing results with/without paddings *****\\\")\\n\",\n    \"for i in range(2):\\n\",\n    \"    print('Output {} are close:'.format(i), numpy.allclose(opt_outputs[i], ort_outputs[i][:,:len(opt_outputs[i][0])], rtol=1e-03, atol=1e-03))\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"## 5. Offline Optimization and Test Tools\\n\",\n    \"\\n\",\n    \"It is recommended to try [OnnxRuntime Transformer Model Optimization Tool](https://github.com/microsoft/onnxruntime/tree/master/onnxruntime/python/tools/transformers) on the exported ONNX models. It could help verify whether the model can be fully optimized, and get performance test results.\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"#### Transformer Optimizer\\n\",\n    \"\\n\",\n    \"Although OnnxRuntime could optimize Bert model exported by PyTorch. Sometime, model cannot be fully optimized due to different reasons:\\n\",\n    \"* A new subgraph pattern is generated by new version of export tool, and the pattern is not covered by older version of OnnxRuntime. \\n\",\n    \"* The exported model uses dynamic axis and this makes it harder for shape inference of the graph. That blocks some optimization to be applied.\\n\",\n    \"* Some optimization is better to be done offline. Like change input tensor type from int64 to int32 to avoid extra Cast nodes, or convert model to float16 to achieve better performance in V100 or T4 GPU.\\n\",\n    \"\\n\",\n    \"We have python script **optimizer.py**, which is more flexible in graph pattern matching and model conversion (like float32 to float16). You can also use it to verify whether a Bert model is fully optimized.\\n\",\n    \"\\n\",\n    \"In this example, we can see that it introduces optimization that is not provided by onnxruntime: SkipLayerNormalization and bias fusion, which is not fused in OnnxRuntime due to shape inference as mentioned.\\n\",\n    \"\\n\",\n    \"It will also tell whether the model is fully optimized or not. If not, that means you might need change the script to fuse some new pattern of subgraph.\\n\",\n    \"\\n\",\n    \"Example Usage:\\n\",\n    \"```\\n\",\n    \"from onnxruntime_tools import optimizer\\n\",\n    \"optimized_model = optimizer.optimize_model(export_model_path, model_type='bert', num_heads=12, hidden_size=768)\\n\",\n    \"optimized_model.save_model_to_file(optimized_model_path)\\n\",\n    \"```\\n\",\n    \"\\n\",\n    \"You can also use optimizer_cli like the following:\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"#### Float32 Model\\n\",\n    \"Let us optimize the ONNX model using the script. The first example will output model with float32 to store weights. This is the choice for most GPUs without Tensor Core.\\n\",\n    \"\\n\",\n    \"If your GPU (like V100 or T4) has Tensor Core, jump to [Float16 Model](#6.-Model-Optimization-with-Float16) section since that will give you better performance than Float32 model.\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 15,\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"optimize_by_onnxruntime: Save optimized model by onnxruntime to ./onnx/bert-base-cased-squad_opset11_o1_cpu.onnx\\n\",\n      \"               apply: Fused LayerNormalization count: 25\\n\",\n      \"               apply: Fused Gelu count: 12\\n\",\n      \"               apply: Fused SkipLayerNormalization count: 25\\n\",\n      \"               apply: Fused Attention count: 12\\n\",\n      \"         prune_graph: Graph pruned: 0 inputs, 0 outputs and 5 nodes are removed\\n\",\n      \"               apply: Fused EmbedLayerNormalization(with mask) count: 1\\n\",\n      \"         prune_graph: Graph pruned: 0 inputs, 0 outputs and 12 nodes are removed\\n\",\n      \"         prune_graph: Graph pruned: 0 inputs, 0 outputs and 0 nodes are removed\\n\",\n      \"               apply: Fused BiasGelu count: 12\\n\",\n      \"               apply: Fused SkipLayerNormalization(add bias) count: 24\\n\",\n      \"            optimize: opset verion: 11\\n\",\n      \"  save_model_to_file: Output model to ./onnx/bert-base-cased-squad_opt_gpu_fp32.onnx\\n\",\n      \"get_fused_operator_statistics: Optimized operators:{'EmbedLayerNormalization': 1, 'Attention': 12, 'Gelu': 0, 'FastGelu': 0, 'BiasGelu': 12, 'LayerNormalization': 0, 'SkipLayerNormalization': 24}\\n\",\n      \"                main: The model has been fully optimized.\\n\"\n     ]\n    }\n   ],\n   \"source\": [\n    \"optimized_fp32_model_path = './onnx/bert-base-cased-squad_opt_{}_fp32.onnx'.format('gpu' if use_gpu else 'cpu')\\n\",\n    \"\\n\",\n    \"!python -m onnxruntime_tools.optimizer_cli --input $export_model_path --output $optimized_fp32_model_path\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"#### Optimized Graph\\n\",\n    \"We can open the optimized model using [Netron](https://github.com/lutzroeder/netron) to visualize.\\n\",\n    \"\\n\",\n    \"The graph is like the following:\\n\",\n    \"<img src='images/optimized_bert_gpu.png'>\\n\",\n    \"\\n\",\n    \"Sometime, optimized graph is slightly different. For example, FastGelu is replaced by BiasGelu for CPU inference; When the option --input_int32 is used, Cast nodes for inputs are removed.\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 16,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"import netron\\n\",\n    \"\\n\",\n    \"# change it to True if want to view the optimized model in browser\\n\",\n    \"enable_netron = False\\n\",\n    \"if enable_netron:\\n\",\n    \"    # If you encounter error \\\"access a socket in a way forbidden by its access permissions\\\", install Netron as standalone application instead.\\n\",\n    \"    netron.start(optimized_fp32_model_path)\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"### Performance Test Tool\\n\",\n    \"\\n\",\n    \"The following will create 1000 random inputs of batch_size 1 and sequence length 128, then measure the average latency and throughput numbers.\\n\",\n    \"\\n\",\n    \"Note that the test uses fixed sequence length. If you use [dynamic sequence length](#Inference-with-Actual-Sequence-Length), actual performance depends on the distribution of sequence length.\\n\",\n    \"\\n\",\n    \"**Attention**: Latency numbers from Jupyter Notebook are not accurate. See [Attional Info](#7.-Additional-Info) for more info.\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 17,\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"test setting TestSetting(batch_size=1, sequence_length=128, test_cases=1000, test_times=1, contiguous=None, use_gpu=True, warmup=True, omp_num_threads=None, omp_wait_policy=None, intra_op_num_threads=None, seed=3, verbose=False, inclusive=False, extra_latency=True)\\n\",\n      \"Generating 1000 samples for batch_size=1 sequence_length=128\\n\",\n      \"Running test: model=bert-base-cased-squad_opt_gpu_fp32.onnx,graph_optimization_level=ENABLE_ALL,intra_op_num_threads=0,OMP_NUM_THREADS=,OMP_WAIT_POLICY=,batch_size=1,sequence_length=128,test_cases=1000,test_times=1,contiguous=None,use_gpu=True,warmup=True\\n\",\n      \"Average latency = 4.92 ms, Throughput = 203.24 QPS\\n\",\n      \"Running test: model=bert-base-cased-squad_opt_gpu_fp32.onnx,graph_optimization_level=ENABLE_ALL,intra_op_num_threads=1,OMP_NUM_THREADS=12,OMP_WAIT_POLICY=PASSIVE,batch_size=1,sequence_length=128,test_cases=1000,test_times=1,contiguous=None,use_gpu=True,warmup=True\\n\",\n      \"Average latency = 4.90 ms, Throughput = 203.88 QPS\\n\",\n      \"Running test: model=bert-base-cased-squad_opt_gpu_fp32.onnx,graph_optimization_level=ENABLE_ALL,intra_op_num_threads=12,OMP_NUM_THREADS=1,OMP_WAIT_POLICY=ACTIVE,batch_size=1,sequence_length=128,test_cases=1000,test_times=1,contiguous=None,use_gpu=True,warmup=True\\n\",\n      \"Average latency = 5.07 ms, Throughput = 197.16 QPS\\n\",\n      \"Running test: model=bert-base-cased-squad_opt_gpu_fp32.onnx,graph_optimization_level=ENABLE_ALL,intra_op_num_threads=1,OMP_NUM_THREADS=12,OMP_WAIT_POLICY=ACTIVE,batch_size=1,sequence_length=128,test_cases=1000,test_times=1,contiguous=None,use_gpu=True,warmup=True\\n\",\n      \"Average latency = 4.82 ms, Throughput = 207.33 QPS\\n\",\n      \"skip duplicated test: model=bert-base-cased-squad_opt_gpu_fp32.onnx,graph_optimization_level=ENABLE_ALL,intra_op_num_threads=1,OMP_NUM_THREADS=12,OMP_WAIT_POLICY=PASSIVE,batch_size=1,sequence_length=128,test_cases=1000,test_times=1,contiguous=None,use_gpu=True,warmup=True\\n\",\n      \"skip duplicated test: model=bert-base-cased-squad_opt_gpu_fp32.onnx,graph_optimization_level=ENABLE_ALL,intra_op_num_threads=12,OMP_NUM_THREADS=1,OMP_WAIT_POLICY=ACTIVE,batch_size=1,sequence_length=128,test_cases=1000,test_times=1,contiguous=None,use_gpu=True,warmup=True\\n\",\n      \"Running test: model=bert-base-cased-squad_opt_gpu_fp32.onnx,graph_optimization_level=ENABLE_ALL,intra_op_num_threads=12,OMP_NUM_THREADS=1,OMP_WAIT_POLICY=PASSIVE,batch_size=1,sequence_length=128,test_cases=1000,test_times=1,contiguous=None,use_gpu=True,warmup=True\\n\",\n      \"Average latency = 4.93 ms, Throughput = 202.92 QPS\\n\",\n      \"Running test: model=bert-base-cased-squad_opt_gpu_fp32.onnx,graph_optimization_level=ENABLE_ALL,intra_op_num_threads=12,OMP_NUM_THREADS=12,OMP_WAIT_POLICY=ACTIVE,batch_size=1,sequence_length=128,test_cases=1000,test_times=1,contiguous=None,use_gpu=True,warmup=True\\n\",\n      \"Average latency = 4.91 ms, Throughput = 203.55 QPS\\n\",\n      \"Running test: model=bert-base-cased-squad_opt_gpu_fp32.onnx,graph_optimization_level=ENABLE_ALL,intra_op_num_threads=12,OMP_NUM_THREADS=12,OMP_WAIT_POLICY=PASSIVE,batch_size=1,sequence_length=128,test_cases=1000,test_times=1,contiguous=None,use_gpu=True,warmup=True\\n\",\n      \"Average latency = 4.88 ms, Throughput = 204.90 QPS\\n\",\n      \"Test summary is saved to onnx/perf_results_GPU_B1_S128_20200617-232134.txt\\n\"\n     ]\n    }\n   ],\n   \"source\": [\n    \"GPU_OPTION = '--use_gpu' if use_gpu else ''\\n\",\n    \"\\n\",\n    \"!python -m onnxruntime_tools.transformers.bert_perf_test --model $optimized_fp32_model_path --batch_size 1 --sequence_length 128 --samples 1000 --test_times 1 --inclusive --all $GPU_OPTION\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Let's load the summary file and take a look. Note that blank value in OMP_NUM_THREADS or OMP_WAIT_POLICY means the environment variable does not exist.\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 18,\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"Float32 model perf results from ./onnx/perf_results_GPU_B1_S128_20200617-232134.txt\\n\"\n     ]\n    },\n    {\n     \"data\": {\n      \"text/html\": [\n       \"<div>\\n\",\n       \"<style scoped>\\n\",\n       \"    .dataframe tbody tr th:only-of-type {\\n\",\n       \"        vertical-align: middle;\\n\",\n       \"    }\\n\",\n       \"\\n\",\n       \"    .dataframe tbody tr th {\\n\",\n       \"        vertical-align: top;\\n\",\n       \"    }\\n\",\n       \"\\n\",\n       \"    .dataframe thead th {\\n\",\n       \"        text-align: right;\\n\",\n       \"    }\\n\",\n       \"</style>\\n\",\n       \"<table border=\\\"1\\\" class=\\\"dataframe\\\">\\n\",\n       \"  <thead>\\n\",\n       \"    <tr style=\\\"text-align: right;\\\">\\n\",\n       \"      <th></th>\\n\",\n       \"      <th>Latency(ms)</th>\\n\",\n       \"      <th>Latency_P50</th>\\n\",\n       \"      <th>Latency_P75</th>\\n\",\n       \"      <th>Latency_P90</th>\\n\",\n       \"      <th>Latency_P95</th>\\n\",\n       \"      <th>Latency_P99</th>\\n\",\n       \"      <th>Throughput(QPS)</th>\\n\",\n       \"      <th>intra_op_num_threads</th>\\n\",\n       \"      <th>OMP_NUM_THREADS</th>\\n\",\n       \"      <th>OMP_WAIT_POLICY</th>\\n\",\n       \"      <th>contiguous</th>\\n\",\n       \"      <th>warmup</th>\\n\",\n       \"    </tr>\\n\",\n       \"  </thead>\\n\",\n       \"  <tbody>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>0</th>\\n\",\n       \"      <td>4.82</td>\\n\",\n       \"      <td>4.53</td>\\n\",\n       \"      <td>4.57</td>\\n\",\n       \"      <td>5.15</td>\\n\",\n       \"      <td>7.25</td>\\n\",\n       \"      <td>8.75</td>\\n\",\n       \"      <td>207.33</td>\\n\",\n       \"      <td>1</td>\\n\",\n       \"      <td>12</td>\\n\",\n       \"      <td>ACTIVE</td>\\n\",\n       \"      <td>None</td>\\n\",\n       \"      <td>True</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>1</th>\\n\",\n       \"      <td>4.88</td>\\n\",\n       \"      <td>4.54</td>\\n\",\n       \"      <td>4.58</td>\\n\",\n       \"      <td>6.47</td>\\n\",\n       \"      <td>7.13</td>\\n\",\n       \"      <td>8.68</td>\\n\",\n       \"      <td>204.90</td>\\n\",\n       \"      <td>12</td>\\n\",\n       \"      <td>12</td>\\n\",\n       \"      <td>PASSIVE</td>\\n\",\n       \"      <td>None</td>\\n\",\n       \"      <td>True</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>2</th>\\n\",\n       \"      <td>4.90</td>\\n\",\n       \"      <td>4.54</td>\\n\",\n       \"      <td>4.57</td>\\n\",\n       \"      <td>6.16</td>\\n\",\n       \"      <td>7.64</td>\\n\",\n       \"      <td>8.82</td>\\n\",\n       \"      <td>203.88</td>\\n\",\n       \"      <td>1</td>\\n\",\n       \"      <td>12</td>\\n\",\n       \"      <td>PASSIVE</td>\\n\",\n       \"      <td>None</td>\\n\",\n       \"      <td>True</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>3</th>\\n\",\n       \"      <td>4.91</td>\\n\",\n       \"      <td>4.55</td>\\n\",\n       \"      <td>4.59</td>\\n\",\n       \"      <td>6.70</td>\\n\",\n       \"      <td>7.43</td>\\n\",\n       \"      <td>8.78</td>\\n\",\n       \"      <td>203.55</td>\\n\",\n       \"      <td>12</td>\\n\",\n       \"      <td>12</td>\\n\",\n       \"      <td>ACTIVE</td>\\n\",\n       \"      <td>None</td>\\n\",\n       \"      <td>True</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>4</th>\\n\",\n       \"      <td>4.92</td>\\n\",\n       \"      <td>4.57</td>\\n\",\n       \"      <td>4.60</td>\\n\",\n       \"      <td>6.50</td>\\n\",\n       \"      <td>7.82</td>\\n\",\n       \"      <td>8.90</td>\\n\",\n       \"      <td>203.24</td>\\n\",\n       \"      <td>0</td>\\n\",\n       \"      <td></td>\\n\",\n       \"      <td></td>\\n\",\n       \"      <td>None</td>\\n\",\n       \"      <td>True</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>5</th>\\n\",\n       \"      <td>4.93</td>\\n\",\n       \"      <td>4.55</td>\\n\",\n       \"      <td>4.59</td>\\n\",\n       \"      <td>6.66</td>\\n\",\n       \"      <td>7.57</td>\\n\",\n       \"      <td>8.80</td>\\n\",\n       \"      <td>202.92</td>\\n\",\n       \"      <td>12</td>\\n\",\n       \"      <td>1</td>\\n\",\n       \"      <td>PASSIVE</td>\\n\",\n       \"      <td>None</td>\\n\",\n       \"      <td>True</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>6</th>\\n\",\n       \"      <td>5.07</td>\\n\",\n       \"      <td>4.56</td>\\n\",\n       \"      <td>4.61</td>\\n\",\n       \"      <td>7.19</td>\\n\",\n       \"      <td>8.11</td>\\n\",\n       \"      <td>9.01</td>\\n\",\n       \"      <td>197.16</td>\\n\",\n       \"      <td>12</td>\\n\",\n       \"      <td>1</td>\\n\",\n       \"      <td>ACTIVE</td>\\n\",\n       \"      <td>None</td>\\n\",\n       \"      <td>True</td>\\n\",\n       \"    </tr>\\n\",\n       \"  </tbody>\\n\",\n       \"</table>\\n\",\n       \"</div>\"\n      ],\n      \"text/plain\": [\n       \"   Latency(ms)  Latency_P50  Latency_P75  Latency_P90  Latency_P95  \\\\\\n\",\n       \"0         4.82         4.53         4.57         5.15         7.25   \\n\",\n       \"1         4.88         4.54         4.58         6.47         7.13   \\n\",\n       \"2         4.90         4.54         4.57         6.16         7.64   \\n\",\n       \"3         4.91         4.55         4.59         6.70         7.43   \\n\",\n       \"4         4.92         4.57         4.60         6.50         7.82   \\n\",\n       \"5         4.93         4.55         4.59         6.66         7.57   \\n\",\n       \"6         5.07         4.56         4.61         7.19         8.11   \\n\",\n       \"\\n\",\n       \"   Latency_P99  Throughput(QPS)  intra_op_num_threads OMP_NUM_THREADS  \\\\\\n\",\n       \"0         8.75           207.33                     1              12   \\n\",\n       \"1         8.68           204.90                    12              12   \\n\",\n       \"2         8.82           203.88                     1              12   \\n\",\n       \"3         8.78           203.55                    12              12   \\n\",\n       \"4         8.90           203.24                     0                   \\n\",\n       \"5         8.80           202.92                    12               1   \\n\",\n       \"6         9.01           197.16                    12               1   \\n\",\n       \"\\n\",\n       \"  OMP_WAIT_POLICY contiguous  warmup  \\n\",\n       \"0          ACTIVE       None    True  \\n\",\n       \"1         PASSIVE       None    True  \\n\",\n       \"2         PASSIVE       None    True  \\n\",\n       \"3          ACTIVE       None    True  \\n\",\n       \"4                       None    True  \\n\",\n       \"5         PASSIVE       None    True  \\n\",\n       \"6          ACTIVE       None    True  \"\n      ]\n     },\n     \"execution_count\": 18,\n     \"metadata\": {},\n     \"output_type\": \"execute_result\"\n    }\n   ],\n   \"source\": [\n    \"import os\\n\",\n    \"import glob     \\n\",\n    \"import pandas\\n\",\n    \"latest_result_file = max(glob.glob(\\\"./onnx/perf_results_GPU_B1_S128_*.txt\\\"), key=os.path.getmtime)\\n\",\n    \"result_data = pandas.read_table(latest_result_file, converters={'OMP_NUM_THREADS': str, 'OMP_WAIT_POLICY':str})\\n\",\n    \"print(\\\"Float32 model perf results from\\\", latest_result_file)\\n\",\n    \"# Remove some columns that have same values for all rows.\\n\",\n    \"columns_to_remove = ['model', 'graph_optimization_level', 'batch_size', 'sequence_length', 'test_cases', 'test_times', 'use_gpu']\\n\",\n    \"result_data.drop(columns_to_remove, axis=1, inplace=True)\\n\",\n    \"result_data\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"From above result, we can see that latency is very close for different settings. The default setting (intra_op_num_threads=0, OMP_NUM_THREADS and OMP_WAIT_POLICY does not exist) performs the best. \\n\",\n    \"\\n\",\n    \"### Model Results Comparison Tool\\n\",\n    \"\\n\",\n    \"When a BERT model is optimized, some approximation is used in calculation. If your BERT model has three inputs, a script compare_bert_results.py can be used to do a quick verification. The tool will generate some fake input data, and compare the inference outputs of the original and optimized models. If outputs are all close, it is safe to use the optimized model.\\n\",\n    \"\\n\",\n    \"For GPU inference, the absolute or relative difference is larger than those numbers of CPU inference. Note that slight difference in output will not impact final result. We did end-to-end evaluation using SQuAD data set using a fine-tuned squad model, and F1 score is almost the same before/after optimization.\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 19,\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"100% passed for 100 random inputs given thresholds (rtol=0.01, atol=0.01).\\r\\n\",\n      \"maximum absolute difference=1.9222497940063477e-06\\r\\n\",\n      \"maximum relative difference=0.05027933046221733\\r\\n\"\n     ]\n    }\n   ],\n   \"source\": [\n    \"!python -m onnxruntime_tools.transformers.compare_bert_results --baseline_model $export_model_path --optimized_model $optimized_fp32_model_path --batch_size 1 --sequence_length 128 --samples 100 --rtol 0.01 --atol 0.01 $GPU_OPTION\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"## 6. Model Optimization with Float16\\n\",\n    \"\\n\",\n    \"The optimizer.py script have an option **--float16** to convert model to use float16 to store weights. After the conversion, it could be faster to run in GPU with tensor cores like V100 or T4.\\n\",\n    \"\\n\",\n    \"Let's run tools to measure the performance on V100. The results show significant performance improvement: latency is about 3.4 ms for float32 model, and 1.8 ms for float16 model.\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 20,\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"optimize_by_onnxruntime: Save optimized model by onnxruntime to ./onnx/bert-base-cased-squad_opset11_o1_cpu.onnx\\n\",\n      \"               apply: Fused LayerNormalization count: 25\\n\",\n      \"               apply: Fused Gelu count: 12\\n\",\n      \"               apply: Fused SkipLayerNormalization count: 25\\n\",\n      \"               apply: Fused Attention count: 12\\n\",\n      \"         prune_graph: Graph pruned: 0 inputs, 0 outputs and 5 nodes are removed\\n\",\n      \"               apply: Fused EmbedLayerNormalization(with mask) count: 1\\n\",\n      \"         prune_graph: Graph pruned: 0 inputs, 0 outputs and 12 nodes are removed\\n\",\n      \"         prune_graph: Graph pruned: 0 inputs, 0 outputs and 0 nodes are removed\\n\",\n      \"               apply: Fused BiasGelu count: 12\\n\",\n      \"               apply: Fused SkipLayerNormalization(add bias) count: 24\\n\",\n      \"            optimize: opset verion: 11\\n\",\n      \"  save_model_to_file: Output model to ./onnx/bert-base-cased-squad_opt_gpu_fp16.onnx\\n\",\n      \"get_fused_operator_statistics: Optimized operators:{'EmbedLayerNormalization': 1, 'Attention': 12, 'Gelu': 0, 'FastGelu': 0, 'BiasGelu': 12, 'LayerNormalization': 0, 'SkipLayerNormalization': 24}\\n\",\n      \"                main: The model has been fully optimized.\\n\"\n     ]\n    }\n   ],\n   \"source\": [\n    \"optimized_fp16_model_path = './onnx/bert-base-cased-squad_opt_{}_fp16.onnx'.format('gpu' if use_gpu else 'cpu')\\n\",\n    \"!python -m onnxruntime_tools.optimizer_cli --input $export_model_path --output $optimized_fp16_model_path --float16\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 21,\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"test setting TestSetting(batch_size=1, sequence_length=128, test_cases=1000, test_times=1, contiguous=None, use_gpu=True, warmup=True, omp_num_threads=None, omp_wait_policy=None, intra_op_num_threads=None, seed=3, verbose=False, inclusive=False, extra_latency=True)\\n\",\n      \"Generating 1000 samples for batch_size=1 sequence_length=128\\n\",\n      \"Running test: model=bert-base-cased-squad_opt_gpu_fp16.onnx,graph_optimization_level=ENABLE_ALL,intra_op_num_threads=0,OMP_NUM_THREADS=,OMP_WAIT_POLICY=,batch_size=1,sequence_length=128,test_cases=1000,test_times=1,contiguous=None,use_gpu=True,warmup=True\\n\",\n      \"Average latency = 3.01 ms, Throughput = 331.90 QPS\\n\",\n      \"Running test: model=bert-base-cased-squad_opt_gpu_fp16.onnx,graph_optimization_level=ENABLE_ALL,intra_op_num_threads=1,OMP_NUM_THREADS=12,OMP_WAIT_POLICY=PASSIVE,batch_size=1,sequence_length=128,test_cases=1000,test_times=1,contiguous=None,use_gpu=True,warmup=True\\n\",\n      \"Average latency = 3.12 ms, Throughput = 320.00 QPS\\n\",\n      \"Running test: model=bert-base-cased-squad_opt_gpu_fp16.onnx,graph_optimization_level=ENABLE_ALL,intra_op_num_threads=12,OMP_NUM_THREADS=1,OMP_WAIT_POLICY=ACTIVE,batch_size=1,sequence_length=128,test_cases=1000,test_times=1,contiguous=None,use_gpu=True,warmup=True\\n\",\n      \"Average latency = 3.02 ms, Throughput = 331.39 QPS\\n\",\n      \"Running test: model=bert-base-cased-squad_opt_gpu_fp16.onnx,graph_optimization_level=ENABLE_ALL,intra_op_num_threads=1,OMP_NUM_THREADS=12,OMP_WAIT_POLICY=ACTIVE,batch_size=1,sequence_length=128,test_cases=1000,test_times=1,contiguous=None,use_gpu=True,warmup=True\\n\",\n      \"Average latency = 3.01 ms, Throughput = 332.53 QPS\\n\",\n      \"skip duplicated test: model=bert-base-cased-squad_opt_gpu_fp16.onnx,graph_optimization_level=ENABLE_ALL,intra_op_num_threads=1,OMP_NUM_THREADS=12,OMP_WAIT_POLICY=PASSIVE,batch_size=1,sequence_length=128,test_cases=1000,test_times=1,contiguous=None,use_gpu=True,warmup=True\\n\",\n      \"skip duplicated test: model=bert-base-cased-squad_opt_gpu_fp16.onnx,graph_optimization_level=ENABLE_ALL,intra_op_num_threads=12,OMP_NUM_THREADS=1,OMP_WAIT_POLICY=ACTIVE,batch_size=1,sequence_length=128,test_cases=1000,test_times=1,contiguous=None,use_gpu=True,warmup=True\\n\",\n      \"Running test: model=bert-base-cased-squad_opt_gpu_fp16.onnx,graph_optimization_level=ENABLE_ALL,intra_op_num_threads=12,OMP_NUM_THREADS=1,OMP_WAIT_POLICY=PASSIVE,batch_size=1,sequence_length=128,test_cases=1000,test_times=1,contiguous=None,use_gpu=True,warmup=True\\n\",\n      \"Average latency = 3.04 ms, Throughput = 328.67 QPS\\n\",\n      \"Running test: model=bert-base-cased-squad_opt_gpu_fp16.onnx,graph_optimization_level=ENABLE_ALL,intra_op_num_threads=12,OMP_NUM_THREADS=12,OMP_WAIT_POLICY=ACTIVE,batch_size=1,sequence_length=128,test_cases=1000,test_times=1,contiguous=None,use_gpu=True,warmup=True\\n\",\n      \"Average latency = 3.01 ms, Throughput = 331.72 QPS\\n\",\n      \"Running test: model=bert-base-cased-squad_opt_gpu_fp16.onnx,graph_optimization_level=ENABLE_ALL,intra_op_num_threads=12,OMP_NUM_THREADS=12,OMP_WAIT_POLICY=PASSIVE,batch_size=1,sequence_length=128,test_cases=1000,test_times=1,contiguous=None,use_gpu=True,warmup=True\\n\",\n      \"Average latency = 3.04 ms, Throughput = 329.32 QPS\\n\",\n      \"Test summary is saved to onnx/perf_results_GPU_B1_S128_20200617-232234.txt\\n\"\n     ]\n    }\n   ],\n   \"source\": [\n    \"GPU_OPTION = '--use_gpu' if use_gpu else ''\\n\",\n    \"!python -m onnxruntime_tools.transformers.bert_perf_test --model $optimized_fp16_model_path --batch_size 1 --sequence_length 128 --samples 1000 --test_times 1 --inclusive --all $GPU_OPTION\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 22,\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"Float32 model perf results from ./onnx/perf_results_GPU_B1_S128_20200617-232234.txt\\n\"\n     ]\n    },\n    {\n     \"data\": {\n      \"text/html\": [\n       \"<div>\\n\",\n       \"<style scoped>\\n\",\n       \"    .dataframe tbody tr th:only-of-type {\\n\",\n       \"        vertical-align: middle;\\n\",\n       \"    }\\n\",\n       \"\\n\",\n       \"    .dataframe tbody tr th {\\n\",\n       \"        vertical-align: top;\\n\",\n       \"    }\\n\",\n       \"\\n\",\n       \"    .dataframe thead th {\\n\",\n       \"        text-align: right;\\n\",\n       \"    }\\n\",\n       \"</style>\\n\",\n       \"<table border=\\\"1\\\" class=\\\"dataframe\\\">\\n\",\n       \"  <thead>\\n\",\n       \"    <tr style=\\\"text-align: right;\\\">\\n\",\n       \"      <th></th>\\n\",\n       \"      <th>Latency(ms)</th>\\n\",\n       \"      <th>Latency_P50</th>\\n\",\n       \"      <th>Latency_P75</th>\\n\",\n       \"      <th>Latency_P90</th>\\n\",\n       \"      <th>Latency_P95</th>\\n\",\n       \"      <th>Latency_P99</th>\\n\",\n       \"      <th>Throughput(QPS)</th>\\n\",\n       \"      <th>intra_op_num_threads</th>\\n\",\n       \"      <th>OMP_NUM_THREADS</th>\\n\",\n       \"      <th>OMP_WAIT_POLICY</th>\\n\",\n       \"      <th>contiguous</th>\\n\",\n       \"      <th>warmup</th>\\n\",\n       \"    </tr>\\n\",\n       \"  </thead>\\n\",\n       \"  <tbody>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>0</th>\\n\",\n       \"      <td>3.01</td>\\n\",\n       \"      <td>2.79</td>\\n\",\n       \"      <td>2.81</td>\\n\",\n       \"      <td>2.86</td>\\n\",\n       \"      <td>5.08</td>\\n\",\n       \"      <td>7.16</td>\\n\",\n       \"      <td>332.53</td>\\n\",\n       \"      <td>1</td>\\n\",\n       \"      <td>12</td>\\n\",\n       \"      <td>ACTIVE</td>\\n\",\n       \"      <td>None</td>\\n\",\n       \"      <td>True</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>1</th>\\n\",\n       \"      <td>3.01</td>\\n\",\n       \"      <td>2.80</td>\\n\",\n       \"      <td>2.81</td>\\n\",\n       \"      <td>2.88</td>\\n\",\n       \"      <td>4.52</td>\\n\",\n       \"      <td>7.05</td>\\n\",\n       \"      <td>331.90</td>\\n\",\n       \"      <td>0</td>\\n\",\n       \"      <td></td>\\n\",\n       \"      <td></td>\\n\",\n       \"      <td>None</td>\\n\",\n       \"      <td>True</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>2</th>\\n\",\n       \"      <td>3.01</td>\\n\",\n       \"      <td>2.78</td>\\n\",\n       \"      <td>2.80</td>\\n\",\n       \"      <td>2.92</td>\\n\",\n       \"      <td>5.01</td>\\n\",\n       \"      <td>7.02</td>\\n\",\n       \"      <td>331.72</td>\\n\",\n       \"      <td>12</td>\\n\",\n       \"      <td>12</td>\\n\",\n       \"      <td>ACTIVE</td>\\n\",\n       \"      <td>None</td>\\n\",\n       \"      <td>True</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>3</th>\\n\",\n       \"      <td>3.02</td>\\n\",\n       \"      <td>2.79</td>\\n\",\n       \"      <td>2.80</td>\\n\",\n       \"      <td>2.85</td>\\n\",\n       \"      <td>6.34</td>\\n\",\n       \"      <td>7.04</td>\\n\",\n       \"      <td>331.39</td>\\n\",\n       \"      <td>12</td>\\n\",\n       \"      <td>1</td>\\n\",\n       \"      <td>ACTIVE</td>\\n\",\n       \"      <td>None</td>\\n\",\n       \"      <td>True</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>4</th>\\n\",\n       \"      <td>3.04</td>\\n\",\n       \"      <td>2.80</td>\\n\",\n       \"      <td>2.82</td>\\n\",\n       \"      <td>2.93</td>\\n\",\n       \"      <td>5.56</td>\\n\",\n       \"      <td>7.08</td>\\n\",\n       \"      <td>329.32</td>\\n\",\n       \"      <td>12</td>\\n\",\n       \"      <td>12</td>\\n\",\n       \"      <td>PASSIVE</td>\\n\",\n       \"      <td>None</td>\\n\",\n       \"      <td>True</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>5</th>\\n\",\n       \"      <td>3.04</td>\\n\",\n       \"      <td>2.79</td>\\n\",\n       \"      <td>2.81</td>\\n\",\n       \"      <td>2.92</td>\\n\",\n       \"      <td>6.37</td>\\n\",\n       \"      <td>7.08</td>\\n\",\n       \"      <td>328.67</td>\\n\",\n       \"      <td>12</td>\\n\",\n       \"      <td>1</td>\\n\",\n       \"      <td>PASSIVE</td>\\n\",\n       \"      <td>None</td>\\n\",\n       \"      <td>True</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>6</th>\\n\",\n       \"      <td>3.12</td>\\n\",\n       \"      <td>2.79</td>\\n\",\n       \"      <td>2.82</td>\\n\",\n       \"      <td>2.96</td>\\n\",\n       \"      <td>6.66</td>\\n\",\n       \"      <td>7.20</td>\\n\",\n       \"      <td>320.00</td>\\n\",\n       \"      <td>1</td>\\n\",\n       \"      <td>12</td>\\n\",\n       \"      <td>PASSIVE</td>\\n\",\n       \"      <td>None</td>\\n\",\n       \"      <td>True</td>\\n\",\n       \"    </tr>\\n\",\n       \"  </tbody>\\n\",\n       \"</table>\\n\",\n       \"</div>\"\n      ],\n      \"text/plain\": [\n       \"   Latency(ms)  Latency_P50  Latency_P75  Latency_P90  Latency_P95  \\\\\\n\",\n       \"0         3.01         2.79         2.81         2.86         5.08   \\n\",\n       \"1         3.01         2.80         2.81         2.88         4.52   \\n\",\n       \"2         3.01         2.78         2.80         2.92         5.01   \\n\",\n       \"3         3.02         2.79         2.80         2.85         6.34   \\n\",\n       \"4         3.04         2.80         2.82         2.93         5.56   \\n\",\n       \"5         3.04         2.79         2.81         2.92         6.37   \\n\",\n       \"6         3.12         2.79         2.82         2.96         6.66   \\n\",\n       \"\\n\",\n       \"   Latency_P99  Throughput(QPS)  intra_op_num_threads OMP_NUM_THREADS  \\\\\\n\",\n       \"0         7.16           332.53                     1              12   \\n\",\n       \"1         7.05           331.90                     0                   \\n\",\n       \"2         7.02           331.72                    12              12   \\n\",\n       \"3         7.04           331.39                    12               1   \\n\",\n       \"4         7.08           329.32                    12              12   \\n\",\n       \"5         7.08           328.67                    12               1   \\n\",\n       \"6         7.20           320.00                     1              12   \\n\",\n       \"\\n\",\n       \"  OMP_WAIT_POLICY contiguous  warmup  \\n\",\n       \"0          ACTIVE       None    True  \\n\",\n       \"1                       None    True  \\n\",\n       \"2          ACTIVE       None    True  \\n\",\n       \"3          ACTIVE       None    True  \\n\",\n       \"4         PASSIVE       None    True  \\n\",\n       \"5         PASSIVE       None    True  \\n\",\n       \"6         PASSIVE       None    True  \"\n      ]\n     },\n     \"execution_count\": 22,\n     \"metadata\": {},\n     \"output_type\": \"execute_result\"\n    }\n   ],\n   \"source\": [\n    \"import os\\n\",\n    \"import glob     \\n\",\n    \"import pandas\\n\",\n    \"latest_result_file = max(glob.glob(\\\"./onnx/perf_results_GPU_B1_S128_*.txt\\\"), key=os.path.getmtime)\\n\",\n    \"result_data = pandas.read_table(latest_result_file, converters={'OMP_NUM_THREADS': str, 'OMP_WAIT_POLICY':str})\\n\",\n    \"print(\\\"Float32 model perf results from\\\", latest_result_file)\\n\",\n    \"# Remove some columns that have same values for all rows.\\n\",\n    \"columns_to_remove = ['model', 'graph_optimization_level', 'batch_size', 'sequence_length', 'test_cases', 'test_times', 'use_gpu']\\n\",\n    \"result_data.drop(columns_to_remove, axis=1, inplace=True)\\n\",\n    \"result_data\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"### Throughput Tuning\\n\",\n    \"\\n\",\n    \"Some application need best throughput under some constraint on latency. This can be done by testing performance of different batch sizes. The tool could help on this.\\n\",\n    \"\\n\",\n    \"Here is an example that check the performance of multiple batch sizes (1, 2, 4, 8, 16, 32 and 64) using default settings.\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 23,\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"test setting TestSetting(batch_size=32, sequence_length=128, test_cases=1000, test_times=1, contiguous=None, use_gpu=True, warmup=True, omp_num_threads=12, omp_wait_policy='ACTIVE', intra_op_num_threads=1, seed=3, verbose=False, inclusive=False, extra_latency=True)\\n\",\n      \"Generating 1000 samples for batch_size=32 sequence_length=128\\n\",\n      \"Running test: model=bert-base-cased-squad_opt_gpu_fp16.onnx,graph_optimization_level=ENABLE_ALL,intra_op_num_threads=1,OMP_NUM_THREADS=12,OMP_WAIT_POLICY=ACTIVE,batch_size=32,sequence_length=128,test_cases=1000,test_times=1,contiguous=None,use_gpu=True,warmup=True\\n\",\n      \"Average latency = 16.17 ms, Throughput = 1979.41 QPS\\n\",\n      \"test setting TestSetting(batch_size=1, sequence_length=128, test_cases=1000, test_times=1, contiguous=None, use_gpu=True, warmup=True, omp_num_threads=12, omp_wait_policy='ACTIVE', intra_op_num_threads=1, seed=3, verbose=False, inclusive=False, extra_latency=True)\\n\",\n      \"Generating 1000 samples for batch_size=1 sequence_length=128\\n\",\n      \"Running test: model=bert-base-cased-squad_opt_gpu_fp16.onnx,graph_optimization_level=ENABLE_ALL,intra_op_num_threads=1,OMP_NUM_THREADS=12,OMP_WAIT_POLICY=ACTIVE,batch_size=1,sequence_length=128,test_cases=1000,test_times=1,contiguous=None,use_gpu=True,warmup=True\\n\",\n      \"Average latency = 3.00 ms, Throughput = 333.83 QPS\\n\",\n      \"test setting TestSetting(batch_size=2, sequence_length=128, test_cases=1000, test_times=1, contiguous=None, use_gpu=True, warmup=True, omp_num_threads=12, omp_wait_policy='ACTIVE', intra_op_num_threads=1, seed=3, verbose=False, inclusive=False, extra_latency=True)\\n\",\n      \"Generating 1000 samples for batch_size=2 sequence_length=128\\n\",\n      \"Running test: model=bert-base-cased-squad_opt_gpu_fp16.onnx,graph_optimization_level=ENABLE_ALL,intra_op_num_threads=1,OMP_NUM_THREADS=12,OMP_WAIT_POLICY=ACTIVE,batch_size=2,sequence_length=128,test_cases=1000,test_times=1,contiguous=None,use_gpu=True,warmup=True\\n\",\n      \"Average latency = 3.59 ms, Throughput = 557.32 QPS\\n\",\n      \"test setting TestSetting(batch_size=64, sequence_length=128, test_cases=1000, test_times=1, contiguous=None, use_gpu=True, warmup=True, omp_num_threads=12, omp_wait_policy='ACTIVE', intra_op_num_threads=1, seed=3, verbose=False, inclusive=False, extra_latency=True)\\n\",\n      \"Generating 1000 samples for batch_size=64 sequence_length=128\\n\",\n      \"Running test: model=bert-base-cased-squad_opt_gpu_fp16.onnx,graph_optimization_level=ENABLE_ALL,intra_op_num_threads=1,OMP_NUM_THREADS=12,OMP_WAIT_POLICY=ACTIVE,batch_size=64,sequence_length=128,test_cases=1000,test_times=1,contiguous=None,use_gpu=True,warmup=True\\n\",\n      \"Average latency = 29.26 ms, Throughput = 2187.15 QPS\\n\",\n      \"test setting TestSetting(batch_size=4, sequence_length=128, test_cases=1000, test_times=1, contiguous=None, use_gpu=True, warmup=True, omp_num_threads=12, omp_wait_policy='ACTIVE', intra_op_num_threads=1, seed=3, verbose=False, inclusive=False, extra_latency=True)\\n\",\n      \"Generating 1000 samples for batch_size=4 sequence_length=128\\n\",\n      \"Running test: model=bert-base-cased-squad_opt_gpu_fp16.onnx,graph_optimization_level=ENABLE_ALL,intra_op_num_threads=1,OMP_NUM_THREADS=12,OMP_WAIT_POLICY=ACTIVE,batch_size=4,sequence_length=128,test_cases=1000,test_times=1,contiguous=None,use_gpu=True,warmup=True\\n\",\n      \"Average latency = 4.32 ms, Throughput = 926.92 QPS\\n\",\n      \"test setting TestSetting(batch_size=8, sequence_length=128, test_cases=1000, test_times=1, contiguous=None, use_gpu=True, warmup=True, omp_num_threads=12, omp_wait_policy='ACTIVE', intra_op_num_threads=1, seed=3, verbose=False, inclusive=False, extra_latency=True)\\n\",\n      \"Generating 1000 samples for batch_size=8 sequence_length=128\\n\",\n      \"Running test: model=bert-base-cased-squad_opt_gpu_fp16.onnx,graph_optimization_level=ENABLE_ALL,intra_op_num_threads=1,OMP_NUM_THREADS=12,OMP_WAIT_POLICY=ACTIVE,batch_size=8,sequence_length=128,test_cases=1000,test_times=1,contiguous=None,use_gpu=True,warmup=True\\n\",\n      \"Average latency = 6.32 ms, Throughput = 1266.63 QPS\\n\",\n      \"test setting TestSetting(batch_size=16, sequence_length=128, test_cases=1000, test_times=1, contiguous=None, use_gpu=True, warmup=True, omp_num_threads=12, omp_wait_policy='ACTIVE', intra_op_num_threads=1, seed=3, verbose=False, inclusive=False, extra_latency=True)\\n\",\n      \"Generating 1000 samples for batch_size=16 sequence_length=128\\n\",\n      \"Running test: model=bert-base-cased-squad_opt_gpu_fp16.onnx,graph_optimization_level=ENABLE_ALL,intra_op_num_threads=1,OMP_NUM_THREADS=12,OMP_WAIT_POLICY=ACTIVE,batch_size=16,sequence_length=128,test_cases=1000,test_times=1,contiguous=None,use_gpu=True,warmup=True\\n\",\n      \"Average latency = 9.60 ms, Throughput = 1666.05 QPS\\n\",\n      \"Test summary is saved to onnx/perf_results_GPU_B1-2-4-8-16-32-64_S128_20200617-232401.txt\\n\"\n     ]\n    }\n   ],\n   \"source\": [\n    \"GPU_OPTION = '--use_gpu' if use_gpu else ''\\n\",\n    \"THREAD_SETTING = '--intra_op_num_threads 1 --omp_num_threads {} --omp_wait_policy ACTIVE'.format(psutil.cpu_count(logical=True))\\n\",\n    \"!python -m onnxruntime_tools.transformers.bert_perf_test --model $optimized_fp16_model_path --batch_size 1 2 4 8 16 32 64 --sequence_length 128 --samples 1000 --test_times 1 --inclusive $THREAD_SETTING $GPU_OPTION\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 26,\n   \"metadata\": {\n    \"scrolled\": false\n   },\n   \"outputs\": [\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"Float16 model summary from ./onnx/perf_results_GPU_B1-2-4-8-16-32-64_S128_20200617-232401.txt\\n\"\n     ]\n    },\n    {\n     \"data\": {\n      \"text/html\": [\n       \"<div>\\n\",\n       \"<style scoped>\\n\",\n       \"    .dataframe tbody tr th:only-of-type {\\n\",\n       \"        vertical-align: middle;\\n\",\n       \"    }\\n\",\n       \"\\n\",\n       \"    .dataframe tbody tr th {\\n\",\n       \"        vertical-align: top;\\n\",\n       \"    }\\n\",\n       \"\\n\",\n       \"    .dataframe thead th {\\n\",\n       \"        text-align: right;\\n\",\n       \"    }\\n\",\n       \"</style>\\n\",\n       \"<table border=\\\"1\\\" class=\\\"dataframe\\\">\\n\",\n       \"  <thead>\\n\",\n       \"    <tr style=\\\"text-align: right;\\\">\\n\",\n       \"      <th></th>\\n\",\n       \"      <th>Latency(ms)</th>\\n\",\n       \"      <th>Latency_P50</th>\\n\",\n       \"      <th>Latency_P75</th>\\n\",\n       \"      <th>Latency_P90</th>\\n\",\n       \"      <th>Latency_P95</th>\\n\",\n       \"      <th>Latency_P99</th>\\n\",\n       \"      <th>Throughput(QPS)</th>\\n\",\n       \"      <th>batch_size</th>\\n\",\n       \"    </tr>\\n\",\n       \"  </thead>\\n\",\n       \"  <tbody>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>0</th>\\n\",\n       \"      <td>3.00</td>\\n\",\n       \"      <td>2.79</td>\\n\",\n       \"      <td>2.81</td>\\n\",\n       \"      <td>2.86</td>\\n\",\n       \"      <td>4.37</td>\\n\",\n       \"      <td>7.08</td>\\n\",\n       \"      <td>333.83</td>\\n\",\n       \"      <td>1</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>1</th>\\n\",\n       \"      <td>3.59</td>\\n\",\n       \"      <td>3.33</td>\\n\",\n       \"      <td>3.35</td>\\n\",\n       \"      <td>3.42</td>\\n\",\n       \"      <td>6.60</td>\\n\",\n       \"      <td>7.54</td>\\n\",\n       \"      <td>557.32</td>\\n\",\n       \"      <td>2</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>2</th>\\n\",\n       \"      <td>4.32</td>\\n\",\n       \"      <td>3.98</td>\\n\",\n       \"      <td>4.01</td>\\n\",\n       \"      <td>4.64</td>\\n\",\n       \"      <td>7.23</td>\\n\",\n       \"      <td>8.11</td>\\n\",\n       \"      <td>926.92</td>\\n\",\n       \"      <td>4</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>3</th>\\n\",\n       \"      <td>6.32</td>\\n\",\n       \"      <td>5.94</td>\\n\",\n       \"      <td>5.97</td>\\n\",\n       \"      <td>7.61</td>\\n\",\n       \"      <td>8.96</td>\\n\",\n       \"      <td>10.12</td>\\n\",\n       \"      <td>1266.63</td>\\n\",\n       \"      <td>8</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>4</th>\\n\",\n       \"      <td>9.60</td>\\n\",\n       \"      <td>9.22</td>\\n\",\n       \"      <td>9.25</td>\\n\",\n       \"      <td>11.32</td>\\n\",\n       \"      <td>12.33</td>\\n\",\n       \"      <td>13.34</td>\\n\",\n       \"      <td>1666.05</td>\\n\",\n       \"      <td>16</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>5</th>\\n\",\n       \"      <td>16.17</td>\\n\",\n       \"      <td>15.80</td>\\n\",\n       \"      <td>15.90</td>\\n\",\n       \"      <td>17.38</td>\\n\",\n       \"      <td>18.80</td>\\n\",\n       \"      <td>19.93</td>\\n\",\n       \"      <td>1979.41</td>\\n\",\n       \"      <td>32</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>6</th>\\n\",\n       \"      <td>29.26</td>\\n\",\n       \"      <td>28.89</td>\\n\",\n       \"      <td>29.01</td>\\n\",\n       \"      <td>30.63</td>\\n\",\n       \"      <td>32.53</td>\\n\",\n       \"      <td>33.28</td>\\n\",\n       \"      <td>2187.15</td>\\n\",\n       \"      <td>64</td>\\n\",\n       \"    </tr>\\n\",\n       \"  </tbody>\\n\",\n       \"</table>\\n\",\n       \"</div>\"\n      ],\n      \"text/plain\": [\n       \"   Latency(ms)  Latency_P50  Latency_P75  Latency_P90  Latency_P95  \\\\\\n\",\n       \"0         3.00         2.79         2.81         2.86         4.37   \\n\",\n       \"1         3.59         3.33         3.35         3.42         6.60   \\n\",\n       \"2         4.32         3.98         4.01         4.64         7.23   \\n\",\n       \"3         6.32         5.94         5.97         7.61         8.96   \\n\",\n       \"4         9.60         9.22         9.25        11.32        12.33   \\n\",\n       \"5        16.17        15.80        15.90        17.38        18.80   \\n\",\n       \"6        29.26        28.89        29.01        30.63        32.53   \\n\",\n       \"\\n\",\n       \"   Latency_P99  Throughput(QPS)  batch_size  \\n\",\n       \"0         7.08           333.83           1  \\n\",\n       \"1         7.54           557.32           2  \\n\",\n       \"2         8.11           926.92           4  \\n\",\n       \"3        10.12          1266.63           8  \\n\",\n       \"4        13.34          1666.05          16  \\n\",\n       \"5        19.93          1979.41          32  \\n\",\n       \"6        33.28          2187.15          64  \"\n      ]\n     },\n     \"execution_count\": 26,\n     \"metadata\": {},\n     \"output_type\": \"execute_result\"\n    }\n   ],\n   \"source\": [\n    \"import os\\n\",\n    \"import glob     \\n\",\n    \"import pandas\\n\",\n    \"latest_result_file = max(glob.glob(\\\"./onnx/perf_results_*.txt\\\"), key=os.path.getmtime)\\n\",\n    \"result_data = pandas.read_table(latest_result_file, converters={'OMP_NUM_THREADS': str, 'OMP_WAIT_POLICY':str})\\n\",\n    \"print(\\\"Float16 model summary from\\\", latest_result_file)\\n\",\n    \"columns_to_remove = ['model', 'graph_optimization_level', 'test_cases', 'test_times', 'use_gpu', 'warmup', 'sequence_length']\\n\",\n    \"columns_to_remove.extend(['intra_op_num_threads', 'OMP_NUM_THREADS', 'OMP_WAIT_POLICY', 'contiguous'])\\n\",\n    \"result_data.drop(columns_to_remove, axis=1, inplace=True)\\n\",\n    \"result_data\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"## 7. Additional Info\\n\",\n    \"\\n\",\n    \"Note that running Jupyter Notebook has significant impact on performance result. You can close Jupyter Notebook and other applications, then run the performance test in a console to get more accurate performance numbers.\\n\",\n    \"\\n\",\n    \"We have a [benchmark script](https://github.com/microsoft/onnxruntime/blob/master/onnxruntime/python/tools/transformers/run_benchmark.sh). It is recommended to use it measure inference speed of OnnxRuntime.\\n\",\n    \"\\n\",\n    \"[OnnxRuntime C API](https://github.com/microsoft/onnxruntime/blob/master/docs/C_API.md) could get slightly better performance than python API. If you use C API in inference, you can use OnnxRuntime_Perf_Test.exe built from source to measure performance instead.\\n\",\n    \"\\n\",\n    \"Here is the machine configuration that generated the above results. You might get slower or faster result according to your hardware.\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 27,\n   \"metadata\": {\n    \"scrolled\": true\n   },\n   \"outputs\": [\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"{\\r\\n\",\n      \"  \\\"gpu\\\": {\\r\\n\",\n      \"    \\\"driver_version\\\": \\\"440.64.00\\\",\\r\\n\",\n      \"    \\\"devices\\\": [\\r\\n\",\n      \"      {\\r\\n\",\n      \"        \\\"memory_total\\\": 16945512448,\\r\\n\",\n      \"        \\\"memory_available\\\": 14110883840,\\r\\n\",\n      \"        \\\"name\\\": \\\"Tesla V100-PCIE-16GB\\\"\\r\\n\",\n      \"      },\\r\\n\",\n      \"      {\\r\\n\",\n      \"        \\\"memory_total\\\": 16945512448,\\r\\n\",\n      \"        \\\"memory_available\\\": 16932601856,\\r\\n\",\n      \"        \\\"name\\\": \\\"Tesla V100-PCIE-16GB\\\"\\r\\n\",\n      \"      }\\r\\n\",\n      \"    ]\\r\\n\",\n      \"  },\\r\\n\",\n      \"  \\\"cpu\\\": {\\r\\n\",\n      \"    \\\"brand\\\": \\\"Intel(R) Xeon(R) CPU E5-2690 v4 @ 2.60GHz\\\",\\r\\n\",\n      \"    \\\"cores\\\": 12,\\r\\n\",\n      \"    \\\"logical_cores\\\": 12,\\r\\n\",\n      \"    \\\"hz\\\": \\\"2.5940 GHz\\\",\\r\\n\",\n      \"    \\\"l2_cache\\\": \\\"256 KB\\\",\\r\\n\",\n      \"    \\\"l3_cache\\\": \\\"35840 KB\\\",\\r\\n\",\n      \"    \\\"processor\\\": \\\"x86_64\\\"\\r\\n\",\n      \"  },\\r\\n\",\n      \"  \\\"memory\\\": {\\r\\n\",\n      \"    \\\"total\\\": 236645588992,\\r\\n\",\n      \"    \\\"available\\\": 222567559168\\r\\n\",\n      \"  },\\r\\n\",\n      \"  \\\"python\\\": \\\"3.7.7.final.0 (64 bit)\\\",\\r\\n\",\n      \"  \\\"os\\\": \\\"Linux-4.15.0-1089-azure-x86_64-with-debian-stretch-sid\\\",\\r\\n\",\n      \"  \\\"onnxruntime\\\": {\\r\\n\",\n      \"    \\\"version\\\": \\\"1.3.0\\\",\\r\\n\",\n      \"    \\\"support_gpu\\\": true\\r\\n\",\n      \"  },\\r\\n\",\n      \"  \\\"pytorch\\\": {\\r\\n\",\n      \"    \\\"version\\\": \\\"1.5.0\\\",\\r\\n\",\n      \"    \\\"support_gpu\\\": true\\r\\n\",\n      \"  },\\r\\n\",\n      \"  \\\"tensorflow\\\": null\\r\\n\",\n      \"}\\r\\n\"\n     ]\n    }\n   ],\n   \"source\": [\n    \"!{sys.executable} -m onnxruntime_tools.transformers.machine_info --silent\"\n   ]\n  }\n ],\n \"metadata\": {\n  \"kernelspec\": {\n   \"display_name\": \"PyCharm (ccks_ner-master)\",\n   \"language\": \"python\",\n   \"name\": \"pycharm-de4c0941\"\n  },\n  \"language_info\": {\n   \"codemirror_mode\": {\n    \"name\": \"ipython\",\n    \"version\": 3\n   },\n   \"file_extension\": \".py\",\n   \"mimetype\": \"text/x-python\",\n   \"name\": \"python\",\n   \"nbconvert_exporter\": \"python\",\n   \"pygments_lexer\": \"ipython3\",\n   \"version\": \"3.8.5\"\n  }\n },\n \"nbformat\": 4,\n \"nbformat_minor\": 2\n}\n"
  },
  {
    "path": "code/bert-base-count5-len32/finetuning/Config.py",
    "content": "from transformers import BertTokenizer, AdamW, BertModel, BertPreTrainedModel, BertConfig, \\\n    get_linear_schedule_with_warmup, XLNetModel, XLNetTokenizer, XLNetConfig, ElectraModel, ElectraConfig, ElectraTokenizer, \\\n    RobertaTokenizer, RobertaModel, RobertaConfig\nfrom NEZHA.modeling_nezha import NeZhaModel\nfrom NEZHA.configuration_nezha import NeZhaConfig\n\n\nMODELS = {\n    'BertForClass':  BertModel,\n    'BertForClass_MultiDropout':  BertModel,\n   'BertLastTwoCls':  BertModel,\n    'BertLastCls':BertModel,\n   'BertLastTwoClsPooler':  BertModel,\n    'BertLastTwoEmbeddings': BertModel,\n    'BertLastTwoEmbeddingsPooler': BertModel,\n    'BertLastFourCls': BertModel,\n    'BertLastFourClsPooler':  BertModel,\n    'BertLastFourEmbeddings':  BertModel,\n   'BertLastFourEmbeddingsPooler':  BertModel,\n   'BertDynCls':  BertModel,\n    'BertDynEmbeddings': BertModel,\n    'BertRNN': BertModel,\n    'BertCNN': XLNetModel,\n    'BertRCNN':  BertModel,\n    'XLNet': XLNetModel,\n    'Electra': ElectraModel,\n    'NEZHA': NeZhaModel\n    }\n\nTOKENIZERS = {\n    'BertForClass': BertTokenizer,\n    'BertForClass_MultiDropout': BertTokenizer,\n    'BertLastTwoCls': BertTokenizer,\n    'BertLastCls': BertTokenizer,\n    'BertLastTwoClsPooler': BertTokenizer,\n    'BertLastTwoEmbeddings': BertTokenizer,\n    'BertLastTwoEmbeddingsPooler': BertTokenizer,\n    'BertLastFourCls': BertTokenizer,\n    'BertLastFourClsPooler': BertTokenizer,\n    'BertLastFourEmbeddings': BertTokenizer,\n    'BertLastFourEmbeddingsPooler': BertTokenizer,\n    'BertDynCls': BertTokenizer,\n    'BertDynEmbeddings': BertTokenizer,\n    'BertRNN': BertTokenizer,\n    'BertCNN': BertTokenizer,\n    'BertRCNN': BertTokenizer,\n    'XLNet': XLNetTokenizer,\n    'Electra': ElectraTokenizer,\n    'NEZHA': BertTokenizer\n    }\n\nCONFIGS = {\n    'BertForClass': BertConfig,\n    'BertForClass_MultiDropout': BertConfig,\n    'BertLastTwoCls': BertConfig,\n    'BertLastCls': BertConfig,\n    'BertLastTwoClsPooler': BertConfig,\n    'BertLastTwoEmbeddings': BertConfig,\n    'BertLastTwoEmbeddingsPooler': BertConfig,\n    'BertLastFourCls': BertConfig,\n    'BertLastFourClsPooler': BertConfig,\n    'BertLastFourEmbeddings': BertConfig,\n    'BertLastFourEmbeddingsPooler': BertConfig,\n    'BertDynCls': BertConfig,\n    'BertDynEmbeddings': BertConfig,\n    'BertRNN': BertConfig,\n    'BertCNN': BertConfig,\n    'BertRCNN': BertConfig,\n    'XLNet': XLNetConfig,\n    'Electra': ElectraConfig,\n    'NEZHA': NeZhaConfig\n\n    }"
  },
  {
    "path": "code/bert-base-count5-len32/finetuning/NEZHA/configuration_nezha.py",
    "content": "\nfrom transformers import PretrainedConfig\n\nNEZHA_PRETRAINED_CONFIG_ARCHIVE_MAP = {}\n\nclass NeZhaConfig(PretrainedConfig):\n    r\"\"\"\n        This is the configuration class to store the configuration of an :class:`~transformers.AlbertModel`.\n        It is used to instantiate an ALBERT model according to the specified arguments, defining the model\n        architecture. Instantiating a configuration with the defaults will yield a similar configuration to that of\n        the ALBERT `xxlarge <https://huggingface.co/albert-xxlarge-v2>`__ architecture.\n\n        Configuration objects inherit from  :class:`~transformers.PretrainedConfig` and can be used\n        to control the model outputs. Read the documentation from  :class:`~transformers.PretrainedConfig`\n        for more information.\n\n\n        Args:\n            vocab_size (:obj:`int`, optional, defaults to 30000):\n                Vocabulary size of the ALBERT model. Defines the different tokens that\n                can be represented by the `inputs_ids` passed to the forward method of :class:`~transformers.AlbertModel`.\n            embedding_size (:obj:`int`, optional, defaults to 128):\n                Dimensionality of vocabulary embeddings.\n            hidden_size (:obj:`int`, optional, defaults to 4096):\n                Dimensionality of the encoder layers and the pooler layer.\n            num_hidden_layers (:obj:`int`, optional, defaults to 12):\n                Number of hidden layers in the Transformer encoder.\n            num_hidden_groups (:obj:`int`, optional, defaults to 1):\n                Number of groups for the hidden layers, parameters in the same group are shared.\n            num_attention_heads (:obj:`int`, optional, defaults to 64):\n                Number of attention heads for each attention layer in the Transformer encoder.\n            intermediate_size (:obj:`int`, optional, defaults to 16384):\n                The dimensionality of the \"intermediate\" (i.e., feed-forward) layer in the Transformer encoder.\n            inner_group_num (:obj:`int`, optional, defaults to 1):\n                The number of inner repetition of attention and ffn.\n            hidden_act (:obj:`str` or :obj:`function`, optional, defaults to \"gelu_new\"):\n                The non-linear activation function (function or string) in the encoder and pooler.\n                If string, \"gelu\", \"relu\", \"swish\" and \"gelu_new\" are supported.\n            hidden_dropout_prob (:obj:`float`, optional, defaults to 0):\n                The dropout probability for all fully connected layers in the embeddings, encoder, and pooler.\n            attention_probs_dropout_prob (:obj:`float`, optional, defaults to 0):\n                The dropout ratio for the attention probabilities.\n            max_position_embeddings (:obj:`int`, optional, defaults to 512):\n                The maximum sequence length that this model might ever be used with. Typically set this to something\n                large (e.g., 512 or 1024 or 2048).\n            type_vocab_size (:obj:`int`, optional, defaults to 2):\n                The vocabulary size of the `token_type_ids` passed into :class:`~transformers.AlbertModel`.\n            initializer_range (:obj:`float`, optional, defaults to 0.02):\n                The standard deviation of the truncated_normal_initializer for initializing all weight matrices.\n            layer_norm_eps (:obj:`float`, optional, defaults to 1e-12):\n                The epsilon used by the layer normalization layers.\n            classifier_dropout_prob (:obj:`float`, optional, defaults to 0.1):\n                The dropout ratio for attached classifiers.\n\n        Example::\n\n            from transformers import AlbertConfig, AlbertModel\n            # Initializing an ALBERT-xxlarge style configuration\n            albert_xxlarge_configuration = AlbertConfig()\n\n            # Initializing an ALBERT-base style configuration\n            albert_base_configuration = AlbertConfig(\n                hidden_size=768,\n                num_attention_heads=12,\n                intermediate_size=3072,\n            )\n\n            # Initializing a model from the ALBERT-base style configuration\n            model = AlbertModel(albert_xxlarge_configuration)\n\n            # Accessing the model configuration\n            configuration = model.config\n\n        Attributes:\n            pretrained_config_archive_map (Dict[str, str]):\n                A dictionary containing all the available pre-trained checkpoints.\n    \"\"\"\n\n    pretrained_config_archive_map = NEZHA_PRETRAINED_CONFIG_ARCHIVE_MAP\n    model_type = \"nezha\"\n\n    def __init__(\n        self,\n        vocab_size=30000,\n        embedding_size=128,\n        hidden_size=4096,\n        num_hidden_layers=12,\n        num_hidden_groups=1,\n        num_attention_heads=64,\n        intermediate_size=16384,\n        inner_group_num=1,\n        hidden_act=\"gelu_new\",\n        hidden_dropout_prob=0,\n        attention_probs_dropout_prob=0,\n        max_position_embeddings=512,\n        max_relative_position=64,\n        type_vocab_size=2,\n        initializer_range=0.02,\n        layer_norm_eps=1e-12,\n        classifier_dropout_prob=0.1,\n        use_relative_position=True,\n        pad_token_id=0,\n        bos_token_id=2,\n        eos_token_id=3,\n        **kwargs\n    ):\n        super().__init__(pad_token_id=pad_token_id, bos_token_id=bos_token_id, eos_token_id=eos_token_id, **kwargs)\n\n        self.vocab_size = vocab_size\n        self.embedding_size = embedding_size\n        self.hidden_size = hidden_size\n        self.num_hidden_layers = num_hidden_layers\n        self.num_hidden_groups = num_hidden_groups\n        self.num_attention_heads = num_attention_heads\n        self.inner_group_num = inner_group_num\n        self.hidden_act = hidden_act\n        self.intermediate_size = intermediate_size\n        self.hidden_dropout_prob = hidden_dropout_prob\n        self.attention_probs_dropout_prob = attention_probs_dropout_prob\n        self.max_position_embeddings = max_position_embeddings\n        self.max_relative_position = max_relative_position\n        self.type_vocab_size = type_vocab_size\n        self.initializer_range = initializer_range\n        self.layer_norm_eps = layer_norm_eps\n        self.use_relative_position=use_relative_position\n        self.classifier_dropout_prob = classifier_dropout_prob\n"
  },
  {
    "path": "code/bert-base-count5-len32/finetuning/NEZHA/modeling_nezha.py",
    "content": "import math\nimport os\nimport logging\nimport torch\n\nfrom torch import nn\nfrom torch.nn import CrossEntropyLoss, MSELoss\n\nfrom .configuration_nezha import NeZhaConfig\nfrom transformers.file_utils import add_start_docstrings, add_start_docstrings_to_model_forward\nfrom transformers.modeling_utils import PreTrainedModel, prune_linear_layer\nfrom transformers.models.bert.modeling_bert import (\n    BertOutput,\n    BertPooler,\n    BertSelfOutput,\n    BertIntermediate,\n    BertOnlyMLMHead,\n    BertOnlyNSPHead,\n    BertPreTrainingHeads,\n    BERT_START_DOCSTRING,\n    BERT_INPUTS_DOCSTRING,\n)\n\nlogger = logging.getLogger(__name__)\n\n_CONFIG_FOR_DOC = \"NeZhaConfig\"\n_TOKENIZER_FOR_DOC = \"NeZhaTokenizer\"\n\nNEZHA_PRETRAINED_MODEL_ARCHIVE_LIST = []\nNEZHA_PRETRAINED_MODEL_ARCHIVE_MAP = {}\n\n\ndef load_tf_weights_in_nezha(model, config, tf_checkpoint_path):\n    \"\"\"Load tf checkpoints in a pytorch model.\"\"\"\n    try:\n        import re\n        import numpy as np\n        import tensorflow as tf\n    except ImportError:\n        logger.error(\n            \"Loading a TensorFlow model in PyTorch, requires TensorFlow to be installed. Please see \"\n            \"https://www.tensorflow.org/install/ for installation instructions.\"\n        )\n        raise\n\n    tf_path = os.path.abspath(tf_checkpoint_path)\n    logger.info(\"Converting TensorFlow checkpoint from {}\".format(tf_path))\n    # Load weights from TF model\n    init_vars = tf.train.list_variables(tf_path)\n    names = []\n    arrays = []\n    for name, shape in init_vars:\n        # logger.info(\"Loading TF weight {} with shape {}\".format(name, shape))\n        array = tf.train.load_variable(tf_path, name)\n        names.append(name)\n        arrays.append(array)\n\n    for name, array in zip(names, arrays):\n        name = name.split(\"/\")\n        # adam_v and adam_m are variables used in AdamWeightDecayOptimizer to calculated m and v\n        # which are not required for using pretrained model\n        if any(\n                n in [\"adam_v\", \"adam_m\", \"lamb_m\", \"lamb_v\", \"AdamWeightDecayOptimizer\", \"AdamWeightDecayOptimizer_1\",\n                      \"global_step\", \"good_steps\", \"loss_scale\", 'bad_steps']\n                for n in name\n        ):\n            logger.info(\"Skipping {}\".format(\"/\".join(name)))\n            continue\n        pointer = model\n        for m_name in name:\n            if re.fullmatch(r\"[A-Za-z]+_\\d+\", m_name):\n                scope_names = re.split(r\"_(\\d+)\", m_name)\n            else:\n                scope_names = [m_name]\n            if scope_names[0] == \"kernel\" or scope_names[0] == \"gamma\":\n                pointer = getattr(pointer, \"weight\")\n            elif scope_names[0] == \"output_bias\" or scope_names[0] == \"beta\":\n                pointer = getattr(pointer, \"bias\")\n            elif scope_names[0] == \"output_weights\":\n                pointer = getattr(pointer, \"weight\")\n            elif scope_names[0] == \"squad\":\n                pointer = getattr(pointer, \"classifier\")\n            else:\n                try:\n                    pointer = getattr(pointer, scope_names[0])\n                except AttributeError:\n                    logger.info(\"Skipping {}\".format(\"/\".join(name)))\n                    continue\n            if len(scope_names) >= 2:\n                num = int(scope_names[1])\n                pointer = pointer[num]\n        if m_name[-11:] == \"_embeddings\":\n            pointer = getattr(pointer, \"weight\")\n        elif m_name == \"kernel\":\n            array = np.transpose(array)\n        try:\n            assert (\n                    pointer.shape == array.shape\n            ), f\"Pointer shape {pointer.shape} and array shape {array.shape} mismatched\"\n        except AssertionError as e:\n            e.args += (pointer.shape, array.shape)\n            raise\n        logger.info(\"Initialize PyTorch weight {}\".format(name))\n        pointer.data = torch.from_numpy(array)\n    return model\n\n\nclass NeZhaEmbeddings(nn.Module):\n    \"\"\"\n    Construct the embeddings from word, position and token_type embeddings.\n    \"\"\"\n\n    def __init__(self, config):\n        super().__init__()\n        self.use_relative_position = config.use_relative_position\n        self.word_embeddings = nn.Embedding(config.vocab_size, config.hidden_size, padding_idx=config.pad_token_id)\n        self.token_type_embeddings = nn.Embedding(config.type_vocab_size, config.hidden_size)\n        # self.LayerNorm is not snake-cased to stick with TensorFlow model variable name and be able to load\n        # any TensorFlow checkpoint file\n        self.LayerNorm = nn.LayerNorm(config.hidden_size, eps=config.layer_norm_eps)\n        self.dropout = nn.Dropout(config.hidden_dropout_prob)\n\n    def forward(self, input_ids=None, token_type_ids=None, inputs_embeds=None):\n        if input_ids is not None:\n            input_shape = input_ids.size()\n        else:\n            input_shape = inputs_embeds.size()[:-1]\n        device = input_ids.device if input_ids is not None else inputs_embeds.device\n        if token_type_ids is None:\n            token_type_ids = torch.zeros(input_shape, dtype=torch.long, device=device)\n        if inputs_embeds is None:\n            inputs_embeds = self.word_embeddings(input_ids)\n        token_type_embeddings = self.token_type_embeddings(token_type_ids)\n        embeddings = inputs_embeds + token_type_embeddings\n        embeddings = self.LayerNorm(embeddings)\n        embeddings = self.dropout(embeddings)\n        return embeddings\n\n\ndef relative_position_encoding(depth, max_length=512, max_relative_position=127):\n    vocab_size = max_relative_position * 2 + 1\n    range_vec = torch.arange(max_length)\n    range_mat = range_vec.repeat(max_length).view(max_length, max_length)\n    distance_mat = range_mat - torch.t(range_mat)\n    distance_mat_clipped = torch.clamp(distance_mat, -max_relative_position, max_relative_position)\n    final_mat = distance_mat_clipped + max_relative_position\n\n    embeddings_table = torch.zeros(vocab_size, depth)\n    position = torch.arange(0, vocab_size, dtype=torch.float).unsqueeze(1)\n    div_term = torch.exp(torch.arange(0, depth, 2).float() * (-math.log(10000.0) / depth))\n    embeddings_table[:, 0::2] = torch.sin(position * div_term)\n    embeddings_table[:, 1::2] = torch.cos(position * div_term)\n    embeddings_table = embeddings_table.unsqueeze(0).transpose(0, 1).squeeze(1)\n\n    flat_relative_positions_matrix = final_mat.view(-1)\n    one_hot_relative_positions_matrix = torch.nn.functional.one_hot(flat_relative_positions_matrix,\n                                                                    num_classes=vocab_size).float()\n    positions_encoding = torch.matmul(one_hot_relative_positions_matrix, embeddings_table)\n    my_shape = list(final_mat.size())\n    my_shape.append(depth)\n    positions_encoding = positions_encoding.view(my_shape)\n    return positions_encoding\n\n\nclass NeZhaSelfAttention(nn.Module):\n    def __init__(self, config):\n        super().__init__()\n        if config.hidden_size % config.num_attention_heads != 0 and not hasattr(config, \"embedding_size\"):\n            raise ValueError(\n                \"The hidden size (%d) is not a multiple of the number of attention \"\n                \"heads (%d)\" % (config.hidden_size, config.num_attention_heads)\n            )\n        self.output_attentions = config.output_attentions\n\n        self.num_attention_heads = config.num_attention_heads\n        self.attention_head_size = int(config.hidden_size / config.num_attention_heads)\n        self.all_head_size = self.num_attention_heads * self.attention_head_size\n\n        self.query = nn.Linear(config.hidden_size, self.all_head_size)\n        self.key = nn.Linear(config.hidden_size, self.all_head_size)\n        self.value = nn.Linear(config.hidden_size, self.all_head_size)\n        self.dropout = nn.Dropout(config.attention_probs_dropout_prob)\n\n        self.relative_positions_encoding = relative_position_encoding(max_length=config.max_position_embeddings,\n                                                                     depth=self.attention_head_size,\n                                                                     max_relative_position=config.max_relative_position).to('cuda')\n\n    def transpose_for_scores(self, x):\n        new_x_shape = x.size()[:-1] + (self.num_attention_heads, self.attention_head_size)\n        x = x.view(*new_x_shape)\n        return x.permute(0, 2, 1, 3)\n\n    def forward(\n            self,\n            hidden_states,\n            attention_mask=None,\n            head_mask=None,\n            encoder_hidden_states=None,\n            encoder_attention_mask=None,\n    ):\n        mixed_query_layer = self.query(hidden_states)\n\n        # If this is instantiated as a cross-attention module, the keys\n        # and values come from an encoder; the attention mask needs to be\n        # such that the encoder's padding tokens are not attended to.\n        if encoder_hidden_states is not None:\n            mixed_key_layer = self.key(encoder_hidden_states)\n            mixed_value_layer = self.value(encoder_hidden_states)\n            attention_mask = encoder_attention_mask\n        else:\n            mixed_key_layer = self.key(hidden_states)\n            mixed_value_layer = self.value(hidden_states)\n\n        query_layer = self.transpose_for_scores(mixed_query_layer)\n        key_layer = self.transpose_for_scores(mixed_key_layer)\n        value_layer = self.transpose_for_scores(mixed_value_layer)\n\n        # Take the dot product between \"query\" and \"key\" to get the raw attention scores.\n        attention_scores = torch.matmul(query_layer, key_layer.transpose(-1, -2))\n\n        batch_size, num_attention_heads, from_seq_length, to_seq_length = attention_scores.size()\n\n        relations_keys = self.relative_positions_encoding[:to_seq_length, :to_seq_length, :]\n        query_layer_t = query_layer.permute(2, 0, 1, 3)\n\n        query_layer_r = query_layer_t.contiguous().view(from_seq_length, batch_size * num_attention_heads,\n                                                        self.attention_head_size)\n        key_position_scores = torch.matmul(query_layer_r, relations_keys.permute(0, 2, 1))\n        key_position_scores_r = key_position_scores.view(from_seq_length, batch_size,\n                                                         num_attention_heads, from_seq_length)\n        key_position_scores_r_t = key_position_scores_r.permute(1, 2, 0, 3)\n        attention_scores = attention_scores + key_position_scores_r_t\n\n        attention_scores = attention_scores / math.sqrt(self.attention_head_size)\n        if attention_mask is not None:\n            # Apply the attention mask is (precomputed for all layers in BertModel forward() function)\n            attention_scores = attention_scores + attention_mask\n\n        # Normalize the attention scores to probabilities.\n        attention_probs = nn.Softmax(dim=-1)(attention_scores)\n\n        # This is actually dropping out entire tokens to attend to, which might\n        # seem a bit unusual, but is taken from the original Transformer paper.\n        attention_probs = self.dropout(attention_probs)\n\n        # Mask heads if we want to\n        if head_mask is not None:\n            attention_probs = attention_probs * head_mask\n\n        context_layer = torch.matmul(attention_probs, value_layer)\n\n        relations_values = self.relative_positions_encoding[:to_seq_length, :to_seq_length, :]\n        attention_probs_t = attention_probs.permute(2, 0, 1, 3)\n        attentions_probs_r = attention_probs_t.contiguous().view(from_seq_length, batch_size * num_attention_heads,\n                                                                 to_seq_length)\n        value_position_scores = torch.matmul(attentions_probs_r, relations_values)\n        value_position_scores_r = value_position_scores.view(from_seq_length, batch_size,\n                                                             num_attention_heads, self.attention_head_size)\n        value_position_scores_r_t = value_position_scores_r.permute(1, 2, 0, 3)\n        context_layer = context_layer + value_position_scores_r_t\n\n        context_layer = context_layer.permute(0, 2, 1, 3).contiguous()\n        new_context_layer_shape = context_layer.size()[:-2] + (self.all_head_size,)\n        context_layer = context_layer.view(*new_context_layer_shape)\n\n        outputs = (context_layer, attention_probs) if self.output_attentions else (context_layer,)\n        return outputs\n\n\nclass NeZhaAttention(nn.Module):\n    def __init__(self, config):\n        super().__init__()\n        self.self = NeZhaSelfAttention(config)\n        self.output = BertSelfOutput(config)\n        self.pruned_heads = set()\n\n    def prune_heads(self, heads):\n        if len(heads) == 0:\n            return\n        mask = torch.ones(self.self.num_attention_heads, self.self.attention_head_size)\n        heads = set(heads) - self.pruned_heads  # Convert to set and remove already pruned heads\n        for head in heads:\n            # Compute how many pruned heads are before the head and move the index accordingly\n            head = head - sum(1 if h < head else 0 for h in self.pruned_heads)\n            mask[head] = 0\n        mask = mask.view(-1).contiguous().eq(1)\n        index = torch.arange(len(mask))[mask].long()\n        # Prune linear layers\n        self.self.query = prune_linear_layer(self.self.query, index)\n        self.self.key = prune_linear_layer(self.self.key, index)\n        self.self.value = prune_linear_layer(self.self.value, index)\n        self.output.dense = prune_linear_layer(self.output.dense, index, dim=1)\n        # Update hyper params and store pruned heads\n        self.self.num_attention_heads = self.self.num_attention_heads - len(heads)\n        self.self.all_head_size = self.self.attention_head_size * self.self.num_attention_heads\n        self.pruned_heads = self.pruned_heads.union(heads)\n\n    def forward(\n            self,\n            hidden_states,\n            attention_mask=None,\n            head_mask=None,\n            encoder_hidden_states=None,\n            encoder_attention_mask=None,\n    ):\n        self_outputs = self.self(\n            hidden_states, attention_mask, head_mask, encoder_hidden_states, encoder_attention_mask\n        )\n        attention_output = self.output(self_outputs[0], hidden_states)\n        outputs = (attention_output,) + self_outputs[1:]  # add attentions if we output them\n        return outputs\n\n\nclass NeZhaLayer(nn.Module):\n    def __init__(self, config):\n        super().__init__()\n        self.attention = NeZhaAttention(config)\n        self.is_decoder = config.is_decoder\n        if self.is_decoder:\n            self.crossattention = NeZhaAttention(config)\n        self.intermediate = BertIntermediate(config)\n        self.output = BertOutput(config)\n\n    def forward(\n            self,\n            hidden_states,\n            attention_mask=None,\n            head_mask=None,\n            encoder_hidden_states=None,\n            encoder_attention_mask=None,\n    ):\n        self_attention_outputs = self.attention(hidden_states, attention_mask, head_mask)\n        attention_output = self_attention_outputs[0]\n        outputs = self_attention_outputs[1:]  # add self attentions if we output attention weights\n\n        if self.is_decoder and encoder_hidden_states is not None:\n            cross_attention_outputs = self.crossattention(\n                attention_output, attention_mask, head_mask, encoder_hidden_states, encoder_attention_mask\n            )\n            attention_output = cross_attention_outputs[0]\n            outputs = outputs + cross_attention_outputs[1:]  # add cross attentions if we output attention weights\n\n        intermediate_output = self.intermediate(attention_output)\n        layer_output = self.output(intermediate_output, attention_output)\n        outputs = (layer_output,) + outputs\n        return outputs\n\n\nclass NeZhaEncoder(nn.Module):\n    def __init__(self, config):\n        super().__init__()\n        self.output_attentions = config.output_attentions\n        self.output_hidden_states = config.output_hidden_states\n        self.layer = nn.ModuleList([NeZhaLayer(config) for _ in range(config.num_hidden_layers)])\n\n\n    def forward(\n            self,\n            hidden_states,\n            attention_mask=None,\n            head_mask=None,\n            encoder_hidden_states=None,\n            encoder_attention_mask=None,\n    ):\n        all_hidden_states = ()\n        all_attentions = ()\n        for i, layer_module in enumerate(self.layer):\n            if self.output_hidden_states:\n                all_hidden_states = all_hidden_states + (hidden_states,)\n            layer_outputs = layer_module(\n                hidden_states, attention_mask, head_mask[i], encoder_hidden_states, encoder_attention_mask\n            )\n            hidden_states = layer_outputs[0]\n            if self.output_attentions:\n                all_attentions = all_attentions + (layer_outputs[1],)\n        # Add last layer\n        if self.output_hidden_states:\n            all_hidden_states = all_hidden_states + (hidden_states,)\n\n        outputs = (hidden_states,)\n        if self.output_hidden_states:\n            outputs = outputs + (all_hidden_states,)\n        if self.output_attentions:\n            outputs = outputs + (all_attentions,)\n        return outputs  # last-layer hidden state, (all hidden states), (all attentions)\n\n\nclass NeZhaPreTrainedModel(PreTrainedModel):\n    \"\"\" An abstract class to handle weights initialization and\n        a simple interface for downloading and loading pretrained models.\n    \"\"\"\n    config_class = NeZhaConfig\n    pretrained_model_archive_map = NEZHA_PRETRAINED_MODEL_ARCHIVE_MAP\n    load_tf_weights = load_tf_weights_in_nezha\n    base_model_prefix = \"bert\"\n\n    def _init_weights(self, module):\n        \"\"\" Initialize the weights \"\"\"\n        if isinstance(module, (nn.Linear, nn.Embedding)):\n            # Slightly different from the TF version which uses truncated_normal for initialization\n            # cf https://github.com/pytorch/pytorch/pull/5617\n            module.weight.data.normal_(mean=0.0, std=self.config.initializer_range)\n        elif isinstance(module, nn.LayerNorm):\n            module.bias.data.zero_()\n            module.weight.data.fill_(1.0)\n        if isinstance(module, nn.Linear) and module.bias is not None:\n            module.bias.data.zero_()\n\n\n@add_start_docstrings(\n    \"The bare Bert Model transformer outputting raw hidden-states without any specific head on top.\",\n    BERT_START_DOCSTRING,\n)\nclass NeZhaModel(NeZhaPreTrainedModel):\n    \"\"\"\n    The model can behave as an encoder (with only self-attention) as well\n    as a decoder, in which case a layer of cross-attention is added between\n    the self-attention layers, following the architecture described in `Attention is all you need`_ by Ashish Vaswani,\n    Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser and Illia Polosukhin.\n\n    To behave as an decoder the model needs to be initialized with the\n    :obj:`is_decoder` argument of the configuration set to :obj:`True`; an\n    :obj:`encoder_hidden_states` is expected as an input to the forward pass.\n\n    .. _`Attention is all you need`:\n        https://arxiv.org/abs/1706.03762\n\n    \"\"\"\n\n    def __init__(self, config):\n        super().__init__(config)\n        self.config = config\n        self.embeddings = NeZhaEmbeddings(config)\n        self.encoder = NeZhaEncoder(config)\n        self.pooler = BertPooler(config)\n        self.init_weights()\n\n    def get_input_embeddings(self):\n        return self.embeddings.word_embeddings\n\n    def set_input_embeddings(self, value):\n        self.embeddings.word_embeddings = value\n\n    def _prune_heads(self, heads_to_prune):\n        \"\"\" Prunes heads of the model.\n            heads_to_prune: dict of {layer_num: list of heads to prune in this layer}\n            See base class PreTrainedModel\n        \"\"\"\n        for layer, heads in heads_to_prune.items():\n            self.encoder.layer[layer].attention.prune_heads(heads)\n\n    @add_start_docstrings_to_model_forward(BERT_INPUTS_DOCSTRING.format(\"batch_size, sequence_length\"))\n    def forward(\n            self,\n            input_ids=None,\n            attention_mask=None,\n            token_type_ids=None,\n            head_mask=None,\n            position_ids=None,\n            inputs_embeds=None,\n            encoder_hidden_states=None,\n            encoder_attention_mask=None,\n    ):\n        r\"\"\"\n    Return:\n        :obj:`tuple(torch.FloatTensor)` comprising various elements depending on the configuration (:class:`~transformers.BertConfig`) and inputs:\n        last_hidden_state (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length, hidden_size)`):\n            Sequence of hidden-states at the output of the last layer of the model.\n        pooler_output (:obj:`torch.FloatTensor`: of shape :obj:`(batch_size, hidden_size)`):\n            Last layer hidden-state of the first token of the sequence (classification token)\n            further processed by a Linear layer and a Tanh activation function. The Linear\n            layer weights are trained from the next sentence prediction (classification)\n            objective during pre-training.\n\n            This output is usually *not* a good summary\n            of the semantic content of the input, you're often better with averaging or pooling\n            the sequence of hidden-states for the whole input sequence.\n        hidden_states (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_hidden_states=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer)\n            of shape :obj:`(batch_size, sequence_length, hidden_size)`.\n\n            Hidden-states of the model at the output of each layer plus the initial embedding outputs.\n        attentions (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_attentions=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for each layer) of shape\n            :obj:`(batch_size, num_heads, sequence_length, sequence_length)`.\n\n            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention\n            heads.\n\n    Examples::\n\n        from transformers import BertModel, BertTokenizer\n        import torch\n\n        tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')\n        model = BertModel.from_pretrained('bert-base-uncased')\n\n        input_ids = torch.tensor(tokenizer.encode(\"Hello, my dog is cute\", add_special_tokens=True)).unsqueeze(0)  # Batch size 1\n        outputs = model(input_ids)\n\n        last_hidden_states = outputs[0]  # The last hidden-state is the first element of the output tuple\n\n        \"\"\"\n\n        if input_ids is not None and inputs_embeds is not None:\n            raise ValueError(\"You cannot specify both input_ids and inputs_embeds at the same time\")\n        elif input_ids is not None:\n            input_shape = input_ids.size()\n        elif inputs_embeds is not None:\n            input_shape = inputs_embeds.size()[:-1]\n        else:\n            raise ValueError(\"You have to specify either input_ids or inputs_embeds\")\n\n        device = input_ids.device if input_ids is not None else inputs_embeds.device\n\n        if attention_mask is None:\n            attention_mask = torch.ones(input_shape, device=device)\n        if token_type_ids is None:\n            token_type_ids = torch.zeros(input_shape, dtype=torch.long, device=device)\n\n        # We can provide a self-attention mask of dimensions [batch_size, from_seq_length, to_seq_length]\n        # ourselves in which case we just need to make it broadcastable to all heads.\n        extended_attention_mask: torch.Tensor = self.get_extended_attention_mask(\n            attention_mask, input_shape, self.device\n        )\n\n        # If a 2D ou 3D attention mask is provided for the cross-attention\n        # we need to make broadcastabe to [batch_size, num_heads, seq_length, seq_length]\n        if self.config.is_decoder and encoder_hidden_states is not None:\n            encoder_batch_size, encoder_sequence_length, _ = encoder_hidden_states.size()\n            encoder_hidden_shape = (encoder_batch_size, encoder_sequence_length)\n            if encoder_attention_mask is None:\n                encoder_attention_mask = torch.ones(encoder_hidden_shape, device=device)\n            encoder_extended_attention_mask = self.invert_attention_mask(encoder_attention_mask)\n        else:\n            encoder_extended_attention_mask = None\n\n        # Prepare head mask if needed\n        # 1.0 in head_mask indicate we keep the head\n        # attention_probs has shape bsz x n_heads x N x N\n        # input head_mask has shape [num_heads] or [num_hidden_layers x num_heads]\n        # and head_mask is converted to shape [num_hidden_layers x batch x num_heads x seq_length x seq_length]\n        head_mask = self.get_head_mask(head_mask, self.config.num_hidden_layers)\n\n        embedding_output = self.embeddings(\n            input_ids=input_ids, token_type_ids=token_type_ids, inputs_embeds=inputs_embeds\n        )\n        encoder_outputs = self.encoder(\n            embedding_output,\n            attention_mask=extended_attention_mask,\n            head_mask=head_mask,\n            encoder_hidden_states=encoder_hidden_states,\n            encoder_attention_mask=encoder_extended_attention_mask,\n        )\n        sequence_output = encoder_outputs[0]\n        pooled_output = self.pooler(sequence_output)\n\n        outputs = (sequence_output, pooled_output,) + encoder_outputs[\n                                                      1:\n                                                      ]  # add hidden_states and attentions if they are here\n        return outputs  # sequence_output, pooled_output, (hidden_states), (attentions)\n\n\n@add_start_docstrings(\n    \"\"\"Bert Model with two heads on top as done during the pre-training: a `masked language modeling` head and\n    a `next sentence prediction (classification)` head. \"\"\",\n    BERT_START_DOCSTRING,\n)\nclass NeZhaForPreTraining(NeZhaPreTrainedModel):\n    def __init__(self, config):\n        super().__init__(config)\n        self.bert = NeZhaModel(config)\n        self.cls = BertPreTrainingHeads(config)\n        self.init_weights()\n\n    def get_output_embeddings(self):\n        return self.cls.predictions.decoder\n\n    @add_start_docstrings_to_model_forward(BERT_INPUTS_DOCSTRING.format(\"batch_size, sequence_length\"))\n    def forward(\n            self,\n            input_ids=None,\n            attention_mask=None,\n            token_type_ids=None,\n            head_mask=None,\n            position_ids=None,\n            inputs_embeds=None,\n            labels=None,\n            next_sentence_label=None,\n    ):\n        r\"\"\"\n        masked_lm_labels (``torch.LongTensor`` of shape ``(batch_size, sequence_length)``, `optional`, defaults to :obj:`None`):\n            Labels for computing the masked language modeling loss.\n            Indices should be in ``[-100, 0, ..., config.vocab_size]`` (see ``input_ids`` docstring)\n            Tokens with indices set to ``-100`` are ignored (masked), the loss is only computed for the tokens with labels\n            in ``[0, ..., config.vocab_size]``\n        next_sentence_label (``torch.LongTensor`` of shape ``(batch_size,)``, `optional`, defaults to :obj:`None`):\n            Labels for computing the next sequence prediction (classification) loss. Input should be a sequence pair (see :obj:`input_ids` docstring)\n            Indices should be in ``[0, 1]``.\n            ``0`` indicates sequence B is a continuation of sequence A,\n            ``1`` indicates sequence B is a random sequence.\n\n    Returns:\n        :obj:`tuple(torch.FloatTensor)` comprising various elements depending on the configuration (:class:`~transformers.BertConfig`) and inputs:\n        loss (`optional`, returned when ``masked_lm_labels`` is provided) ``torch.FloatTensor`` of shape ``(1,)``:\n            Total loss as the sum of the masked language modeling loss and the next sequence prediction (classification) loss.\n        prediction_scores (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length, config.vocab_size)`)\n            Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax).\n        seq_relationship_scores (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, 2)`):\n            Prediction scores of the next sequence prediction (classification) head (scores of True/False\n            continuation before SoftMax).\n        hidden_states (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when :obj:`config.output_hidden_states=True`):\n            Tuple of :obj:`torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer)\n            of shape :obj:`(batch_size, sequence_length, hidden_size)`.\n\n            Hidden-states of the model at the output of each layer plus the initial embedding outputs.\n        attentions (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_attentions=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for each layer) of shape\n            :obj:`(batch_size, num_heads, sequence_length, sequence_length)`.\n\n            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention\n            heads.\n\n\n    Examples::\n\n        from transformers import BertTokenizer, BertForPreTraining\n        import torch\n\n        tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')\n        model = BertForPreTraining.from_pretrained('bert-base-uncased')\n\n        input_ids = torch.tensor(tokenizer.encode(\"Hello, my dog is cute\", add_special_tokens=True)).unsqueeze(0)  # Batch size 1\n        outputs = model(input_ids)\n\n        prediction_scores, seq_relationship_scores = outputs[:2]\n\n        \"\"\"\n\n        outputs = self.bert(\n            input_ids,\n            attention_mask=attention_mask,\n            token_type_ids=token_type_ids,\n            head_mask=head_mask,\n            inputs_embeds=inputs_embeds,\n        )\n\n        sequence_output, pooled_output = outputs[:2]\n        prediction_scores, seq_relationship_score = self.cls(sequence_output, pooled_output)\n        # add hidden states and attention if they are here\n        outputs = (prediction_scores, seq_relationship_score,) + outputs[2:]\n\n        if labels is not None and next_sentence_label is not None:\n            loss_fct = CrossEntropyLoss()\n            masked_lm_loss = loss_fct(prediction_scores.view(-1, self.config.vocab_size), labels.view(-1))\n            next_sentence_loss = loss_fct(seq_relationship_score.view(-1, 2), next_sentence_label.view(-1))\n            total_loss = masked_lm_loss + next_sentence_loss\n            outputs = (total_loss,) + outputs\n\n        return outputs  # (loss), prediction_scores, seq_relationship_score, (hidden_states), (attentions)\n\n\n@add_start_docstrings(\"\"\"Bert Model with a `language modeling` head on top. \"\"\", BERT_START_DOCSTRING)\nclass NeZhaForMaskedLM(NeZhaPreTrainedModel):\n    def __init__(self, config):\n        super().__init__(config)\n        self.bert = NeZhaModel(config)\n        self.cls = BertOnlyMLMHead(config)\n        self.init_weights()\n\n    def get_output_embeddings(self):\n        return self.cls.predictions.decoder\n\n    @add_start_docstrings_to_model_forward(BERT_INPUTS_DOCSTRING.format(\"batch_size, sequence_length\"))\n    def forward(\n            self,\n            input_ids=None,\n            attention_mask=None,\n            token_type_ids=None,\n            head_mask=None,\n            position_ids=None,\n            inputs_embeds=None,\n            encoder_hidden_states=None,\n            encoder_attention_mask=None,\n            labels=None,\n    ):\n        r\"\"\"\n        masked_lm_labels (:obj:`torch.LongTensor` of shape :obj:`(batch_size, sequence_length)`, `optional`, defaults to :obj:`None`):\n            Labels for computing the masked language modeling loss.\n            Indices should be in ``[-100, 0, ..., config.vocab_size]`` (see ``input_ids`` docstring)\n            Tokens with indices set to ``-100`` are ignored (masked), the loss is only computed for the tokens with labels\n            in ``[0, ..., config.vocab_size]``\n        lm_labels (:obj:`torch.LongTensor` of shape :obj:`(batch_size, sequence_length)`, `optional`, defaults to :obj:`None`):\n            Labels for computing the left-to-right language modeling loss (next word prediction).\n            Indices should be in ``[-100, 0, ..., config.vocab_size]`` (see ``input_ids`` docstring)\n            Tokens with indices set to ``-100`` are ignored (masked), the loss is only computed for the tokens with labels\n            in ``[0, ..., config.vocab_size]``\n\n    Returns:\n        :obj:`tuple(torch.FloatTensor)` comprising various elements depending on the configuration (:class:`~transformers.BertConfig`) and inputs:\n        masked_lm_loss (`optional`, returned when ``masked_lm_labels`` is provided) ``torch.FloatTensor`` of shape ``(1,)``:\n            Masked language modeling loss.\n        ltr_lm_loss (:obj:`torch.FloatTensor` of shape :obj:`(1,)`, `optional`, returned when :obj:`lm_labels` is provided):\n                Next token prediction loss.\n        prediction_scores (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length, config.vocab_size)`)\n            Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax).\n        hidden_states (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_hidden_states=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer)\n            of shape :obj:`(batch_size, sequence_length, hidden_size)`.\n\n            Hidden-states of the model at the output of each layer plus the initial embedding outputs.\n        attentions (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_attentions=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for each layer) of shape\n            :obj:`(batch_size, num_heads, sequence_length, sequence_length)`.\n\n            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention\n            heads.\n\n        Examples::\n\n            from transformers import BertTokenizer, BertForMaskedLM\n            import torch\n\n            tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')\n            model = BertForMaskedLM.from_pretrained('bert-base-uncased')\n\n            input_ids = torch.tensor(tokenizer.encode(\"Hello, my dog is cute\", add_special_tokens=True)).unsqueeze(0)  # Batch size 1\n            outputs = model(input_ids, masked_lm_labels=input_ids)\n\n            loss, prediction_scores = outputs[:2]\n\n        \"\"\"\n        outputs = self.bert(\n            input_ids,\n            attention_mask=attention_mask,\n            token_type_ids=token_type_ids,\n            head_mask=head_mask,\n            inputs_embeds=inputs_embeds,\n            encoder_hidden_states=encoder_hidden_states,\n            encoder_attention_mask=encoder_attention_mask,\n        )\n\n        sequence_output = outputs[0]\n        prediction_scores = self.cls(sequence_output)\n        outputs = (prediction_scores,) + outputs[2:]  # Add hidden states and attention if they are here\n\n        # Although this may seem awkward, BertForMaskedLM supports two scenarios:\n        # 1. If a tensor that contains the indices of masked labels is provided,\n        #    the cross-entropy is the MLM cross-entropy that measures the likelihood\n        #    of predictions for masked words.\n        # 2. If `lm_labels` is provided we are in a causal scenario where we\n        #    try to predict the next token for each input in the decoder.\n        masked_lm_labels = None\n        if labels is not None:\n            loss_fct = CrossEntropyLoss()  # -100 index = padding token\n            masked_lm_loss = loss_fct(prediction_scores.view(-1, self.config.vocab_size), labels.view(-1))\n            outputs = (masked_lm_loss,) + outputs\n        return outputs  # (ltr_lm_loss), (masked_lm_loss), prediction_scores, (hidden_states), (attentions)\n\n    def prepare_inputs_for_generation(self, input_ids, attention_mask=None, **model_kwargs):\n        input_shape = input_ids.shape\n        effective_batch_size = input_shape[0]\n\n        # if model is used as a decoder in encoder-decoder model, the decoder attention mask is created on the fly\n        if attention_mask is None:\n            attention_mask = input_ids.new_ones(input_shape)\n\n        # if model is does not use a causal mask then add a dummy token\n        if self.config.is_decoder is False:\n            assert self.config.pad_token_id is not None, \"The PAD token should be defined for generation\"\n            attention_mask = torch.cat(\n                [attention_mask, attention_mask.new_zeros((attention_mask.shape[0], 1))], dim=-1\n            )\n\n            dummy_token = torch.full(\n                (effective_batch_size, 1), self.config.pad_token_id, dtype=torch.long, device=input_ids.device\n            )\n            input_ids = torch.cat([input_ids, dummy_token], dim=1)\n\n        return {\"input_ids\": input_ids, \"attention_mask\": attention_mask}\n\n\n@add_start_docstrings(\n    \"\"\"Bert Model with a `next sentence prediction (classification)` head on top. \"\"\", BERT_START_DOCSTRING,\n)\nclass NeZhaForNextSentencePrediction(NeZhaPreTrainedModel):\n    def __init__(self, config):\n        super().__init__(config)\n        self.bert = NeZhaModel(config)\n        self.cls = BertOnlyNSPHead(config)\n        self.init_weights()\n\n    @add_start_docstrings_to_model_forward(BERT_INPUTS_DOCSTRING.format(\"batch_size, sequence_length\"))\n    def forward(\n            self,\n            input_ids=None,\n            attention_mask=None,\n            token_type_ids=None,\n            head_mask=None,\n            position_ids=None,\n            inputs_embeds=None,\n            next_sentence_label=None,\n    ):\n        r\"\"\"\n        next_sentence_label (:obj:`torch.LongTensor` of shape :obj:`(batch_size,)`, `optional`, defaults to :obj:`None`):\n            Labels for computing the next sequence prediction (classification) loss. Input should be a sequence pair (see ``input_ids`` docstring)\n            Indices should be in ``[0, 1]``.\n            ``0`` indicates sequence B is a continuation of sequence A,\n            ``1`` indicates sequence B is a random sequence.\n\n    Returns:\n        :obj:`tuple(torch.FloatTensor)` comprising various elements depending on the configuration (:class:`~transformers.BertConfig`) and inputs:\n        loss (:obj:`torch.FloatTensor` of shape :obj:`(1,)`, `optional`, returned when :obj:`next_sentence_label` is provided):\n            Next sequence prediction (classification) loss.\n        seq_relationship_scores (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, 2)`):\n            Prediction scores of the next sequence prediction (classification) head (scores of True/False continuation before SoftMax).\n        hidden_states (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_hidden_states=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer)\n            of shape :obj:`(batch_size, sequence_length, hidden_size)`.\n\n            Hidden-states of the model at the output of each layer plus the initial embedding outputs.\n        attentions (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_attentions=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for each layer) of shape\n            :obj:`(batch_size, num_heads, sequence_length, sequence_length)`.\n\n            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention\n            heads.\n\n    Examples::\n\n        from transformers import BertTokenizer, BertForNextSentencePrediction\n        import torch\n\n        tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')\n        model = BertForNextSentencePrediction.from_pretrained('bert-base-uncased')\n\n        input_ids = torch.tensor(tokenizer.encode(\"Hello, my dog is cute\", add_special_tokens=True)).unsqueeze(0)  # Batch size 1\n        outputs = model(input_ids)\n\n        seq_relationship_scores = outputs[0]\n\n        \"\"\"\n\n        outputs = self.bert(\n            input_ids,\n            attention_mask=attention_mask,\n            token_type_ids=token_type_ids,\n            head_mask=head_mask,\n            inputs_embeds=inputs_embeds,\n        )\n\n        pooled_output = outputs[1]\n        seq_relationship_score = self.cls(pooled_output)\n        outputs = (seq_relationship_score,) + outputs[2:]  # add hidden states and attention if they are here\n        if next_sentence_label is not None:\n            loss_fct = CrossEntropyLoss()\n            next_sentence_loss = loss_fct(seq_relationship_score.view(-1, 2), next_sentence_label.view(-1))\n            outputs = (next_sentence_loss,) + outputs\n\n        return outputs  # (next_sentence_loss), seq_relationship_score, (hidden_states), (attentions)\n\n\n@add_start_docstrings(\n    \"\"\"Bert Model transformer with a sequence classification/regression head on top (a linear layer on top of\n    the pooled output) e.g. for GLUE tasks. \"\"\",\n    BERT_START_DOCSTRING,\n)\nclass NeZhaForSequenceClassification(NeZhaPreTrainedModel):\n    def __init__(self, config):\n        super().__init__(config)\n        self.num_labels = config.num_labels\n        self.bert = NeZhaModel(config)\n        self.dropout = nn.Dropout(config.hidden_dropout_prob)\n        self.classifier = nn.Linear(config.hidden_size, config.num_labels)\n        self.init_weights()\n\n    @add_start_docstrings_to_model_forward(BERT_INPUTS_DOCSTRING.format(\"batch_size, sequence_length\"))\n    def forward(\n            self,\n            input_ids=None,\n            attention_mask=None,\n            token_type_ids=None,\n            position_ids=None,\n            head_mask=None,\n            inputs_embeds=None,\n            labels=None,\n    ):\n        r\"\"\"\n        labels (:obj:`torch.LongTensor` of shape :obj:`(batch_size,)`, `optional`, defaults to :obj:`None`):\n            Labels for computing the sequence classification/regression loss.\n            Indices should be in :obj:`[0, ..., config.num_labels - 1]`.\n            If :obj:`config.num_labels == 1` a regression loss is computed (Mean-Square loss),\n            If :obj:`config.num_labels > 1` a classification loss is computed (Cross-Entropy).\n\n    Returns:\n        :obj:`tuple(torch.FloatTensor)` comprising various elements depending on the configuration (:class:`~transformers.BertConfig`) and inputs:\n        loss (:obj:`torch.FloatTensor` of shape :obj:`(1,)`, `optional`, returned when :obj:`label` is provided):\n            Classification (or regression if config.num_labels==1) loss.\n        logits (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, config.num_labels)`):\n            Classification (or regression if config.num_labels==1) scores (before SoftMax).\n        hidden_states (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_hidden_states=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer)\n            of shape :obj:`(batch_size, sequence_length, hidden_size)`.\n\n            Hidden-states of the model at the output of each layer plus the initial embedding outputs.\n        attentions (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_attentions=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for each layer) of shape\n            :obj:`(batch_size, num_heads, sequence_length, sequence_length)`.\n\n            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention\n            heads.\n\n    Examples::\n\n        from transformers import BertTokenizer, BertForSequenceClassification\n        import torch\n\n        tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')\n        model = BertForSequenceClassification.from_pretrained('bert-base-uncased')\n\n        input_ids = torch.tensor(tokenizer.encode(\"Hello, my dog is cute\", add_special_tokens=True)).unsqueeze(0)  # Batch size 1\n        labels = torch.tensor([1]).unsqueeze(0)  # Batch size 1\n        outputs = model(input_ids, labels=labels)\n\n        loss, logits = outputs[:2]\n\n        \"\"\"\n\n        outputs = self.bert(\n            input_ids,\n            attention_mask=attention_mask,\n            token_type_ids=token_type_ids,\n            head_mask=head_mask,\n            inputs_embeds=inputs_embeds,\n        )\n\n        pooled_output = outputs[1]\n\n        pooled_output = self.dropout(pooled_output)\n        logits = self.classifier(pooled_output)\n\n        outputs = (logits,) + outputs[2:]  # add hidden states and attention if they are here\n\n        if labels is not None:\n            if self.num_labels == 1:\n                #  We are doing regression\n                loss_fct = MSELoss()\n                loss = loss_fct(logits.view(-1), labels.view(-1))\n            else:\n                loss_fct = CrossEntropyLoss()\n                loss = loss_fct(logits.view(-1, self.num_labels), labels.view(-1))\n            outputs = (loss,) + outputs\n\n        return outputs  # (loss), logits, (hidden_states), (attentions)\n\n\n@add_start_docstrings(\n    \"\"\"Bert Model with a multiple choice classification head on top (a linear layer on top of\n    the pooled output and a softmax) e.g. for RocStories/SWAG tasks. \"\"\",\n    BERT_START_DOCSTRING,\n)\nclass NeZhaForMultipleChoice(NeZhaPreTrainedModel):\n    def __init__(self, config):\n        super().__init__(config)\n        self.bert = NeZhaModel(config)\n        self.dropout = nn.Dropout(config.hidden_dropout_prob)\n        self.classifier = nn.Linear(config.hidden_size, 1)\n        self.init_weights()\n\n    @add_start_docstrings_to_model_forward(BERT_INPUTS_DOCSTRING.format(\"batch_size, sequence_length\"))\n    def forward(\n            self,\n            input_ids=None,\n            attention_mask=None,\n            token_type_ids=None,\n            head_mask=None,\n            position_ids=None,\n            inputs_embeds=None,\n            labels=None,\n    ):\n        r\"\"\"\n        labels (:obj:`torch.LongTensor` of shape :obj:`(batch_size,)`, `optional`, defaults to :obj:`None`):\n            Labels for computing the multiple choice classification loss.\n            Indices should be in ``[0, ..., num_choices]`` where `num_choices` is the size of the second dimension\n            of the input tensors. (see `input_ids` above)\n\n    Returns:\n        :obj:`tuple(torch.FloatTensor)` comprising various elements depending on the configuration (:class:`~transformers.BertConfig`) and inputs:\n        loss (:obj:`torch.FloatTensor` of shape `(1,)`, `optional`, returned when :obj:`labels` is provided):\n            Classification loss.\n        classification_scores (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, num_choices)`):\n            `num_choices` is the second dimension of the input tensors. (see `input_ids` above).\n\n            Classification scores (before SoftMax).\n        hidden_states (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_hidden_states=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer)\n            of shape :obj:`(batch_size, sequence_length, hidden_size)`.\n\n            Hidden-states of the model at the output of each layer plus the initial embedding outputs.\n        attentions (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_attentions=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for each layer) of shape\n            :obj:`(batch_size, num_heads, sequence_length, sequence_length)`.\n\n            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention\n            heads.\n\n    Examples::\n\n        from transformers import BertTokenizer, BertForMultipleChoice\n        import torch\n\n        tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')\n        model = BertForMultipleChoice.from_pretrained('bert-base-uncased')\n        choices = [\"Hello, my dog is cute\", \"Hello, my cat is amazing\"]\n\n        input_ids = torch.tensor([tokenizer.encode(s, add_special_tokens=True) for s in choices]).unsqueeze(0)  # Batch size 1, 2 choices\n        labels = torch.tensor(1).unsqueeze(0)  # Batch size 1\n        outputs = model(input_ids, labels=labels)\n\n        loss, classification_scores = outputs[:2]\n\n        \"\"\"\n        num_choices = input_ids.shape[1]\n\n        input_ids = input_ids.view(-1, input_ids.size(-1))\n        attention_mask = attention_mask.view(-1, attention_mask.size(-1)) if attention_mask is not None else None\n        token_type_ids = token_type_ids.view(-1, token_type_ids.size(-1)) if token_type_ids is not None else None\n\n        outputs = self.bert(\n            input_ids,\n            attention_mask=attention_mask,\n            token_type_ids=token_type_ids,\n            head_mask=head_mask,\n            inputs_embeds=inputs_embeds,\n        )\n\n        pooled_output = outputs[1]\n\n        pooled_output = self.dropout(pooled_output)\n        logits = self.classifier(pooled_output)\n        reshaped_logits = logits.view(-1, num_choices)\n\n        outputs = (reshaped_logits,) + outputs[2:]  # add hidden states and attention if they are here\n\n        if labels is not None:\n            loss_fct = CrossEntropyLoss()\n            loss = loss_fct(reshaped_logits, labels)\n            outputs = (loss,) + outputs\n\n        return outputs  # (loss), reshaped_logits, (hidden_states), (attentions)\n\n\n@add_start_docstrings(\n    \"\"\"Bert Model with a token classification head on top (a linear layer on top of\n    the hidden-states output) e.g. for Named-Entity-Recognition (NER) tasks. \"\"\",\n    BERT_START_DOCSTRING,\n)\nclass NeZhaForTokenClassification(NeZhaPreTrainedModel):\n    def __init__(self, config):\n        super().__init__(config)\n        self.num_labels = config.num_labels\n        self.bert = NeZhaModel(config)\n        self.dropout = nn.Dropout(config.hidden_dropout_prob)\n        self.classifier = nn.Linear(config.hidden_size, config.num_labels)\n        self.init_weights()\n\n    @add_start_docstrings_to_model_forward(BERT_INPUTS_DOCSTRING.format(\"batch_size, sequence_length\"))\n    def forward(\n            self,\n            input_ids=None,\n            attention_mask=None,\n            token_type_ids=None,\n            head_mask=None,\n            position_ids=None,\n            inputs_embeds=None,\n            labels=None,\n    ):\n        r\"\"\"\n        labels (:obj:`torch.LongTensor` of shape :obj:`(batch_size, sequence_length)`, `optional`, defaults to :obj:`None`):\n            Labels for computing the token classification loss.\n            Indices should be in ``[0, ..., config.num_labels - 1]``.\n\n    Returns:\n        :obj:`tuple(torch.FloatTensor)` comprising various elements depending on the configuration (:class:`~transformers.BertConfig`) and inputs:\n        loss (:obj:`torch.FloatTensor` of shape :obj:`(1,)`, `optional`, returned when ``labels`` is provided) :\n            Classification loss.\n        scores (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length, config.num_labels)`)\n            Classification scores (before SoftMax).\n        hidden_states (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_hidden_states=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer)\n            of shape :obj:`(batch_size, sequence_length, hidden_size)`.\n\n            Hidden-states of the model at the output of each layer plus the initial embedding outputs.\n        attentions (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_attentions=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for each layer) of shape\n            :obj:`(batch_size, num_heads, sequence_length, sequence_length)`.\n\n            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention\n            heads.\n\n    Examples::\n\n        from transformers import BertTokenizer, BertForTokenClassification\n        import torch\n\n        tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')\n        model = BertForTokenClassification.from_pretrained('bert-base-uncased')\n\n        input_ids = torch.tensor(tokenizer.encode(\"Hello, my dog is cute\", add_special_tokens=True)).unsqueeze(0)  # Batch size 1\n        labels = torch.tensor([1] * input_ids.size(1)).unsqueeze(0)  # Batch size 1\n        outputs = model(input_ids, labels=labels)\n\n        loss, scores = outputs[:2]\n\n        \"\"\"\n\n        outputs = self.bert(\n            input_ids,\n            attention_mask=attention_mask,\n            token_type_ids=token_type_ids,\n            head_mask=head_mask,\n            inputs_embeds=inputs_embeds,\n        )\n\n        sequence_output = outputs[0]\n\n        sequence_output = self.dropout(sequence_output)\n        logits = self.classifier(sequence_output)\n\n        outputs = (logits,) + outputs[2:]  # add hidden states and attention if they are here\n        if labels is not None:\n            loss_fct = CrossEntropyLoss()\n            # Only keep active parts of the loss\n            if attention_mask is not None:\n                active_loss = attention_mask.view(-1) == 1\n                active_logits = logits.view(-1, self.num_labels)\n                active_labels = torch.where(\n                    active_loss, labels.view(-1), torch.tensor(loss_fct.ignore_index).type_as(labels)\n                )\n                loss = loss_fct(active_logits, active_labels)\n            else:\n                loss = loss_fct(logits.view(-1, self.num_labels), labels.view(-1))\n            outputs = (loss,) + outputs\n\n        return outputs  # (loss), scores, (hidden_states), (attentions)\n\n\n@add_start_docstrings(\n    \"\"\"Bert Model with a span classification head on top for extractive question-answering tasks like SQuAD (a linear\n    layers on top of the hidden-states output to compute `span start logits` and `span end logits`). \"\"\",\n    BERT_START_DOCSTRING,\n)\nclass NeZhaForQuestionAnswering(NeZhaPreTrainedModel):\n    def __init__(self, config):\n        super().__init__(config)\n        self.num_labels = config.num_labels\n        self.bert = NeZhaModel(config)\n        self.qa_outputs = nn.Linear(config.hidden_size, config.num_labels)\n        self.init_weights()\n\n    @add_start_docstrings_to_model_forward(BERT_INPUTS_DOCSTRING.format(\"batch_size, sequence_length\"))\n    def forward(\n            self,\n            input_ids=None,\n            attention_mask=None,\n            token_type_ids=None,\n            head_mask=None,\n            inputs_embeds=None,\n            position_ids=None,\n            start_positions=None,\n            end_positions=None,\n    ):\n        r\"\"\"\n        start_positions (:obj:`torch.LongTensor` of shape :obj:`(batch_size,)`, `optional`, defaults to :obj:`None`):\n            Labels for position (index) of the start of the labelled span for computing the token classification loss.\n            Positions are clamped to the length of the sequence (`sequence_length`).\n            Position outside of the sequence are not taken into account for computing the loss.\n        end_positions (:obj:`torch.LongTensor` of shape :obj:`(batch_size,)`, `optional`, defaults to :obj:`None`):\n            Labels for position (index) of the end of the labelled span for computing the token classification loss.\n            Positions are clamped to the length of the sequence (`sequence_length`).\n            Position outside of the sequence are not taken into account for computing the loss.\n\n    Returns:\n        :obj:`tuple(torch.FloatTensor)` comprising various elements depending on the configuration (:class:`~transformers.BertConfig`) and inputs:\n        loss (:obj:`torch.FloatTensor` of shape :obj:`(1,)`, `optional`, returned when :obj:`labels` is provided):\n            Total span extraction loss is the sum of a Cross-Entropy for the start and end positions.\n        start_scores (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length,)`):\n            Span-start scores (before SoftMax).\n        end_scores (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length,)`):\n            Span-end scores (before SoftMax).\n        hidden_states (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_hidden_states=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer)\n            of shape :obj:`(batch_size, sequence_length, hidden_size)`.\n\n            Hidden-states of the model at the output of each layer plus the initial embedding outputs.\n        attentions (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_attentions=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for each layer) of shape\n            :obj:`(batch_size, num_heads, sequence_length, sequence_length)`.\n\n            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention\n            heads.\n\n    Examples::\n\n        from transformers import BertTokenizer, BertForQuestionAnswering\n        import torch\n\n        tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')\n        model = BertForQuestionAnswering.from_pretrained('bert-large-uncased-whole-word-masking-finetuned-squad')\n\n        question, text = \"Who was Jim Henson?\", \"Jim Henson was a nice puppet\"\n        encoding = tokenizer.encode_plus(question, text)\n        input_ids, token_type_ids = encoding[\"input_ids\"], encoding[\"token_type_ids\"]\n        start_scores, end_scores = model(torch.tensor([input_ids]), token_type_ids=torch.tensor([token_type_ids]))\n\n        all_tokens = tokenizer.convert_ids_to_tokens(input_ids)\n        answer = ' '.join(all_tokens[torch.argmax(start_scores) : torch.argmax(end_scores)+1])\n\n        assert answer == \"a nice puppet\"\n\n        \"\"\"\n\n        outputs = self.bert(\n            input_ids,\n            attention_mask=attention_mask,\n            token_type_ids=token_type_ids,\n            head_mask=head_mask,\n            inputs_embeds=inputs_embeds,\n        )\n\n        sequence_output = outputs[0]\n\n        logits = self.qa_outputs(sequence_output)\n        start_logits, end_logits = logits.split(1, dim=-1)\n        start_logits = start_logits.squeeze(-1)\n        end_logits = end_logits.squeeze(-1)\n\n        outputs = (start_logits, end_logits,) + outputs[2:]\n        if start_positions is not None and end_positions is not None:\n            # If we are on multi-GPU, split add a dimension\n            if len(start_positions.size()) > 1:\n                start_positions = start_positions.squeeze(-1)\n            if len(end_positions.size()) > 1:\n                end_positions = end_positions.squeeze(-1)\n            # sometimes the start/end positions are outside our model inputs, we ignore these terms\n            ignored_index = start_logits.size(1)\n            start_positions.clamp_(0, ignored_index)\n            end_positions.clamp_(0, ignored_index)\n\n            loss_fct = CrossEntropyLoss(ignore_index=ignored_index)\n            start_loss = loss_fct(start_logits, start_positions)\n            end_loss = loss_fct(end_logits, end_positions)\n            total_loss = (start_loss + end_loss) / 2\n            outputs = (total_loss,) + outputs\n\n        return outputs  # (loss), start_logits, end_logits, (hidden_states), (attentions)\n\n"
  },
  {
    "path": "code/bert-base-count5-len32/finetuning/model.py",
    "content": "import torch\nimport random\nimport os\nfrom torch import nn, optim\nimport torch.nn.functional as F\nfrom transformers.activations import get_activation\n\nfrom Config import *\n\n\nclass BertForClass(nn.Module):\n    def __init__(self, config):\n        super(BertForClass, self).__init__()\n        self.n_classes = config.num_class\n\n        config_json = 'bert_config.json' if os.path.exists(config.model_path + 'bert_config.json') else 'config.json'\n        self.bert_config = CONFIGS[config.model].from_pretrained(config.model_path + config_json,\n                                                                 output_hidden_states=True)\n        self.bert_model = MODELS[config.model].from_pretrained(config.model_path, config=self.bert_config)\n        self.isDropout = True if 0 < config.dropout < 1 else False\n        self.dropout = nn.Dropout(p=config.dropout)\n        self.classifier = nn.Linear(self.bert_config.hidden_size * 2, self.n_classes)\n\n    def forward(self, input_ids, input_masks, segment_ids):\n        output = self.bert_model(input_ids=input_ids, token_type_ids=segment_ids,\n                                                                        attention_mask=input_masks)\n        sequence_output = output[0]\n        pooler_output = output[1]\n        hidden_states = output[2]\n        seq_avg = torch.mean(sequence_output, dim=1)\n        concat_out = torch.cat((seq_avg, pooler_output), dim=1)\n        if self.isDropout:\n            concat_out = self.dropout(concat_out)\n        logit = self.classifier(concat_out)\n        return logit\n\nclass BertForClass_MultiDropout(nn.Module):\n    def __init__(self, config):\n        super(BertForClass_MultiDropout, self).__init__()\n        self.n_classes = config.num_class\n        config_json = 'bert_config.json' if os.path.exists(config.model_path + 'bert_config.json') else 'config.json'\n        self.bert_config = CONFIGS[config.model].from_pretrained(config.model_path + config_json,\n                                                                 output_hidden_states=True)\n\n        self.bert_model = MODELS[config.model].from_pretrained(config.model_path, config=self.bert_config)\n        self.multi_drop = 5\n        self.multi_dropouts = nn.ModuleList([nn.Dropout(config.dropout) for _ in range(self.multi_drop)])\n        self.classifier = nn.Linear(self.bert_config.hidden_size * 2, self.n_classes)\n\n    def forward(self, input_ids, input_masks, segment_ids):\n        sequence_output, pooler_output, hidden_states = self.bert_model(input_ids=input_ids, token_type_ids=segment_ids,\n                                                                        attention_mask=input_masks)\n        seq_avg = torch.mean(sequence_output, dim=1)\n        concat_out = torch.cat((seq_avg, pooler_output), dim=1)\n        for j, dropout in enumerate(self.multi_dropouts):\n            if j == 0:\n                logit = self.classifier(dropout(concat_out)) / self.multi_drop\n            else:\n                logit += self.classifier(dropout(concat_out)) / self.multi_drop\n\n        return logit\n\nclass BertLastTwoCls(nn.Module):\n    def __init__(self, config):\n        super(BertLastTwoCls, self).__init__()\n        self.n_classes = config.num_class\n        config_json = 'bert_config.json' if os.path.exists(config.model_path + 'bert_config.json') else 'config.json'\n        self.bert_config = CONFIGS[config.model].from_pretrained(config.model_path + config_json,\n                                                                 output_hidden_states=True)\n        self.bert_model = MODELS[config.model].from_pretrained(config.model_path, config=self.bert_config)\n        self.isDropout = True if 0 < config.dropout < 1 else False\n        self.dropout = nn.Dropout(p=config.dropout)\n        self.classifier = nn.Linear(self.bert_config.hidden_size, config.num_class)\n\n    def forward(self, input_ids, input_masks, segment_ids):\n        sequence_output, pooler_output, hidden_states = self.bert_model(input_ids=input_ids, token_type_ids=segment_ids,\n                                                                        attention_mask=input_masks)\n        logit = self.classifier(pooler_output)\n\n        return logit\n\n\nclass BertLastCls(nn.Module):\n    def __init__(self, config):\n        super(BertLastCls, self).__init__()\n        self.n_classes = config.num_class\n        config_json = 'bert_config.json' if os.path.exists(config.model_path + 'bert_config.json') else 'config.json'\n        self.bert_config = CONFIGS[config.model].from_pretrained(config.model_path + config_json,\n                                                                 output_hidden_states=True)\n        self.bert_model = MODELS[config.model].from_pretrained(config.model_path, config=self.bert_config)\n        self.isDropout = True if 0 < config.dropout < 1 else False\n        self.dropout = nn.Dropout(p=config.dropout)\n        self.classifier = nn.Linear(self.bert_config.hidden_size, config.num_class)\n\n    def forward(self, input_ids, input_masks, segment_ids):\n        output = self.bert_model(input_ids=input_ids, token_type_ids=segment_ids,\n                                 attention_mask=input_masks)\n        sequence_output = output[0]\n        pooler_output = output[1]\n        hidden_states = output[2]\n\n        if self.isDropout:\n            output = self.dropout(pooler_output)\n        logit = self.classifier(output)\n\n        return logit\n\nclass BertLastTwoClsPooler(nn.Module):\n    def __init__(self, config):\n        super(BertLastTwoClsPooler, self).__init__()\n        self.n_classes = config.num_class\n        config_json = 'bert_config.json' if os.path.exists(config.model_path + 'bert_config.json') else 'config.json'\n        self.bert_config = CONFIGS[config.model].from_pretrained(config.model_path + config_json,\n                                                                 output_hidden_states=True)\n        self.bert_model = MODELS[config.model].from_pretrained(config.model_path, config=self.bert_config)\n        self.isDropout = True if 0 < config.dropout < 1 else False\n        self.dropout = nn.Dropout(p=config.dropout)\n        self.classifier = nn.Linear(self.bert_config.hidden_size * 3, config.num_class)\n\n    def forward(self, input_ids, input_masks, segment_ids):\n        sequence_output, pooler_output, hidden_states = self.bert_model(input_ids=input_ids, token_type_ids=segment_ids,\n                                                                        attention_mask=input_masks)\n\n        output = torch.cat(\n            (pooler_output, hidden_states[-1][:, 0], hidden_states[-2][:, 0]), dim=1)\n        if self.isDropout:\n            output = self.dropout(output)\n        logit = self.classifier(output)\n\n        return logit\n\nclass BertLastTwoEmbeddings(nn.Module):\n    def __init__(self, config):\n        super(BertLastTwoEmbeddings, self).__init__()\n        self.n_classes = config.num_class\n        config_json = 'bert_config.json' if os.path.exists(config.model_path + 'bert_config.json') else 'config.json'\n        self.bert_config = CONFIGS[config.model].from_pretrained(config.model_path + config_json,\n                                                                 output_hidden_states=True)\n        self.bert_model = MODELS[config.model].from_pretrained(config.model_path, config=self.bert_config)\n        self.isDropout = True if 0 < config.dropout < 1 else False\n        self.dropout = nn.Dropout(p=config.dropout)\n        self.classifier = nn.Linear(self.bert_config.hidden_size * 2, config.num_class)\n\n    def forward(self, input_ids, input_masks, segment_ids):\n        sequence_output, pooler_output, hidden_states = self.bert_model(input_ids=input_ids, token_type_ids=segment_ids,\n                                                                        attention_mask=input_masks)\n\n        hidden_states1 = torch.mean(hidden_states[-1], dim=1)\n        hidden_states2 = torch.mean(hidden_states[-2], dim=1)\n\n        output = torch.cat(\n            (hidden_states1, hidden_states2), dim=1)\n        if self.isDropout:\n            output = self.dropout(output)\n        logit = self.classifier(output)\n\n        return logit\n\n\nclass BertLastTwoEmbeddingsPooler(nn.Module):\n    def __init__(self, config):\n        super(BertLastTwoEmbeddingsPooler, self).__init__()\n        self.n_classes = config.num_class\n        config_json = 'bert_config.json' if os.path.exists(config.model_path + 'bert_config.json') else 'config.json'\n        self.bert_config = CONFIGS[config.model].from_pretrained(config.model_path + config_json,\n                                                                 output_hidden_states=True)\n        self.bert_model = MODELS[config.model].from_pretrained(config.model_path, config=self.bert_config)\n        self.isDropout = True if 0 < config.dropout < 1 else False\n        self.dropout = nn.Dropout(p=config.dropout)\n        self.classifier = nn.Linear(self.bert_config.hidden_size * 3, config.num_class)\n\n    def forward(self, input_ids, input_masks, segment_ids):\n        sequence_output, pooler_output, hidden_states = self.bert_model(input_ids=input_ids, token_type_ids=segment_ids,\n                                                                        attention_mask=input_masks)\n\n        hidden_states1 = torch.mean(hidden_states[-1], dim=1)\n        hidden_states2 = torch.mean(hidden_states[-2], dim=1)\n\n        output = torch.cat(\n            (pooler_output, hidden_states1, hidden_states2), dim=1)\n        if self.isDropout:\n            output = self.dropout(output)\n        logit = self.classifier(output)\n\n        return logit\n\nclass BertLastFourCls(nn.Module):\n    def __init__(self, config):\n        super(BertLastFourCls, self).__init__()\n        self.n_classes = config.num_class\n        config_json = 'bert_config.json' if os.path.exists(config.model_path + 'bert_config.json') else 'config.json'\n        self.bert_config = CONFIGS[config.model].from_pretrained(config.model_path + config_json,\n                                                                 output_hidden_states=True)\n        self.bert_model = MODELS[config.model].from_pretrained(config.model_path, config=self.bert_config)\n        self.isDropout = True if 0 < config.dropout < 1 else False\n        self.dropout = nn.Dropout(p=config.dropout)\n        self.classifier = nn.Linear(self.bert_config.hidden_size * 4, config.num_class)\n\n    def forward(self, input_ids, input_masks, segment_ids):\n        output = self.bert_model(input_ids=input_ids, token_type_ids=segment_ids,\n                                 attention_mask=input_masks)\n        sequence_output = output[0]\n        pooler_output = output[1]\n        hidden_states = output[2]\n\n        output = torch.cat(\n            (hidden_states[-1][:, 0], hidden_states[-2][:, 0], hidden_states[-3][:, 0], hidden_states[-4][:, 0]), dim=1)\n        if self.isDropout:\n            output = self.dropout(output)\n        logit = self.classifier(output)\n\n        return logit\n\n\nclass BertLastFourClsPooler(nn.Module):\n    def __init__(self, config):\n        super(BertLastFourClsPooler, self).__init__()\n        self.n_classes = config.num_class\n        config_json = 'bert_config.json' if os.path.exists(config.model_path + 'bert_config.json') else 'config.json'\n        self.bert_config = CONFIGS[config.model].from_pretrained(config.model_path + config_json,\n                                                                 output_hidden_states=True)\n        self.bert_model = MODELS[config.model].from_pretrained(config.model_path, config=self.bert_config)\n        self.isDropout = True if 0 < config.dropout < 1 else False\n        self.dropout = nn.Dropout(p=config.dropout)\n        self.classifier = nn.Linear(self.bert_config.hidden_size * 5, config.num_class)\n\n    def forward(self, input_ids, input_masks, segment_ids):\n        sequence_output, pooler_output, hidden_states = self.bert_model(input_ids=input_ids, token_type_ids=segment_ids,\n                                                                        attention_mask=input_masks)\n\n        output = torch.cat(\n            (pooler_output, hidden_states[-1][:, 0], hidden_states[-2][:, 0], hidden_states[-3][:, 0], hidden_states[-4][:, 0]), dim=1)\n        if self.isDropout:\n            output = self.dropout(output)\n        logit = self.classifier(output)\n\n        return logit\n\nclass BertLastFourEmbeddings(nn.Module):\n    def __init__(self, config):\n        super(BertLastFourEmbeddings, self).__init__()\n        self.n_classes = config.num_class\n        config_json = 'bert_config.json' if os.path.exists(config.model_path + 'bert_config.json') else 'config.json'\n        self.bert_config = CONFIGS[config.model].from_pretrained(config.model_path + config_json,\n                                                                 output_hidden_states=True)\n        self.bert_model = MODELS[config.model].from_pretrained(config.model_path, config=self.bert_config)\n        self.isDropout = True if 0 < config.dropout < 1 else False\n        self.dropout = nn.Dropout(p=config.dropout)\n        self.classifier = nn.Linear(self.bert_config.hidden_size * 4, config.num_class)\n\n    def forward(self, input_ids, input_masks, segment_ids):\n        sequence_output, pooler_output, hidden_states = self.bert_model(input_ids=input_ids, token_type_ids=segment_ids,\n                                                                        attention_mask=input_masks)\n\n        hidden_states1 = torch.mean(hidden_states[-1], dim=1)\n        hidden_states2 = torch.mean(hidden_states[-2], dim=1)\n        hidden_states3 = torch.mean(hidden_states[-3], dim=1)\n        hidden_states4 = torch.mean(hidden_states[-4], dim=1)\n        output = torch.cat(\n            (hidden_states1, hidden_states2, hidden_states3, hidden_states4), dim=1)\n        if self.isDropout:\n            output = self.dropout(output)\n        logit = self.classifier(output)\n\n        return logit\n\n\nclass BertLastFourEmbeddingsPooler(nn.Module):\n    def __init__(self, config):\n        super(BertLastFourEmbeddingsPooler, self).__init__()\n        self.n_classes = config.num_class\n        config_json = 'bert_config.json' if os.path.exists(config.model_path + 'bert_config.json') else 'config.json'\n        self.bert_config = CONFIGS[config.model].from_pretrained(config.model_path + config_json,\n                                                                 output_hidden_states=True)\n        self.bert_model = MODELS[config.model].from_pretrained(config.model_path, config=self.bert_config)\n        self.isDropout = True if 0 < config.dropout < 1 else False\n        self.dropout = nn.Dropout(p=config.dropout)\n        self.classifier = nn.Linear(self.bert_config.hidden_size * 5, config.num_class)\n\n    def forward(self, input_ids, input_masks, segment_ids):\n        sequence_output, pooler_output, hidden_states = self.bert_model(input_ids=input_ids, token_type_ids=segment_ids,\n                                                                        attention_mask=input_masks)\n\n        hidden_states1 = torch.mean(hidden_states[-1], dim=1)\n        hidden_states2 = torch.mean(hidden_states[-2], dim=1)\n        hidden_states3 = torch.mean(hidden_states[-3], dim=1)\n        hidden_states4 = torch.mean(hidden_states[-4], dim=1)\n        output = torch.cat(\n            (pooler_output, hidden_states1, hidden_states2, hidden_states3, hidden_states4), dim=1)\n        if self.isDropout:\n            output = self.dropout(output)\n        logit = self.classifier(output)\n\n        return logit\n\nclass BertDynCls(nn.Module):\n    def __init__(self, config):\n        super(BertDynCls, self).__init__()\n        self.n_classes = config.num_class\n        config_json = 'bert_config.json' if os.path.exists(config.model_path + 'bert_config.json') else 'config.json'\n        self.bert_config = CONFIGS[config.model].from_pretrained(config.model_path + config_json,\n                                                                 output_hidden_states=True)\n        self.bert_model = MODELS[config.model].from_pretrained(config.model_path, config=self.bert_config)\n        self.dynWeight = nn.Linear(self.bert_config.hidden_size, 1)\n        self.dence = nn.Linear(self.bert_config.hidden_size, 512)\n        self.isDropout = True if 0 < config.dropout < 1 else False\n        self.dropout = nn.Dropout(p=config.dropout)\n        self.classifier = nn.Linear(512, config.num_class)\n\n\n    def forward(self, input_ids, input_masks, segment_ids):\n        sequence_output, pooler_output, hidden_states = self.bert_model(input_ids=input_ids, token_type_ids=segment_ids,\n                                                                        attention_mask=input_masks)\n\n        batch_size = pooler_output.shape[0]\n\n        hid_avg_list = None\n        weight_list = None\n        for i, hidden in enumerate(hidden_states):\n            hid_avg = hidden_states[-(i + 1)][0]\n            weight = self.dynWeight(hid_avg).repeat(1, self.bert_config.hidden_size)\n            if hid_avg_list is None:\n                hid_avg_list = hid_avg\n            else:\n                hid_avg_list = torch.cat((hid_avg_list, hid_avg), dim=1)\n\n            if weight_list is None:\n                weight_list = hid_avg\n            else:\n                weight_list = torch.cat((weight_list, weight), dim=1)\n\n        concat_out = weight_list.mul_(hid_avg_list)\n        concat_out = concat_out.reshape(batch_size, -1, self.bert_config.hidden_size)\n        concat_out = torch.sum(concat_out, dim=1)\n\n        if self.isDropout:\n            concat_out = self.dropout(concat_out)\n        concat_out = self.dence(concat_out)\n        logit = self.classifier(concat_out)\n\n        return logit\n\nclass BertDynEmbeddings(nn.Module):\n    def __init__(self, config):\n        super(BertDynEmbeddings, self).__init__()\n        self.n_classes = config.num_class\n        config_json = 'bert_config.json' if os.path.exists(config.model_path + 'bert_config.json') else 'config.json'\n        self.bert_config = CONFIGS[config.model].from_pretrained(config.model_path + config_json,\n                                                                 output_hidden_states=True)\n        self.bert_model = MODELS[config.model].from_pretrained(config.model_path, config=self.bert_config)\n        self.dynWeight = nn.Linear(self.bert_config.hidden_size, 1)\n        self.dence = nn.Linear(self.bert_config.hidden_size, 512)\n        self.isDropout = True if 0 < config.dropout < 1 else False\n        self.dropout = nn.Dropout(p=config.dropout)\n        self.classifier = nn.Linear(512, config.num_class)\n\n\n    def forward(self, input_ids, input_masks, segment_ids):\n        sequence_output, pooler_output, hidden_states = self.bert_model(input_ids=input_ids, token_type_ids=segment_ids,\n                                                                        attention_mask=input_masks)\n\n        batch_size = pooler_output.shape[0]\n\n        hid_avg_list = None\n        weight_list = None\n        for i, hidden in enumerate(hidden_states):\n            hid_avg = torch.mean(hidden_states[-(i + 1)], dim=1)\n            weight = self.dynWeight(hid_avg).repeat(1, self.bert_config.hidden_size)\n            if hid_avg_list is None:\n                hid_avg_list = hid_avg\n            else:\n                hid_avg_list = torch.cat((hid_avg_list, hid_avg), dim=1)\n\n            if weight_list is None:\n                weight_list = hid_avg\n            else:\n                weight_list = torch.cat((weight_list, weight), dim=1)\n\n        concat_out = weight_list.mul_(hid_avg_list)\n        concat_out = concat_out.reshape(batch_size, -1, self.bert_config.hidden_size)\n        concat_out = torch.sum(concat_out, dim=1)\n\n        if self.isDropout:\n            concat_out = self.dropout(concat_out)\n\n        concat_out = self.dence(concat_out)\n        logit = self.classifier(concat_out)\n\n        return logit\n\n\nclass BertRNN(nn.Module):\n\n    def __init__(self, config):\n        super(BertRNN, self).__init__()\n        self.rnn_type = \"gru\"\n        self.bidirectional = True\n        self.hidden_dim = 256\n        self.n_layers = 2\n        self.batch_first = True\n        self.drop_out = 0.1\n        self.n_classes = config.num_class\n        config_json = 'bert_config.json' if os.path.exists(config.model_path + 'bert_config.json') else 'config.json'\n        self.bert_config = CONFIGS[config.model].from_pretrained(config.model_path + config_json,\n                                                                 output_hidden_states=True)\n        self.bert_model = MODELS[config.model].from_pretrained(config.model_path, config=self.bert_config)\n        self.num_directions = 1 if not self.bidirectional else 2\n\n        if self.rnn_type == 'lstm':\n            self.rnn = nn.LSTM(self.bert_config.to_dict()['hidden_size'],\n                               hidden_size=self.hidden_dim,\n                               num_layers=self.n_layers,\n                               bidirectional=self.bidirectional,\n                               batch_first=self.batch_first,\n                               dropout=self.drop_out)\n        elif self.rnn_type == 'gru':\n            self.rnn = nn.GRU(self.bert_config.to_dict()['hidden_size'],\n                              hidden_size=self.hidden_dim,\n                              num_layers=self.n_layers,\n                              bidirectional=self.bidirectional,\n                              batch_first=self.batch_first,\n                              dropout=self.drop_out)\n        else:\n            self.rnn = nn.RNN(self.bert_config.to_dict()['hidden_size'],\n                              hidden_size=self.hidden_dim,\n                              num_layers=self.n_layers,\n                              bidirectional=self.bidirectional,\n                              batch_first=self.batch_first,\n                              dropout=self.drop_out)\n\n        self.dropout = nn.Dropout(self.drop_out)\n        self.fc_rnn = nn.Linear(self.hidden_dim * self.num_directions, config.num_class)\n\n    def forward(self, input_ids, input_masks, segment_ids):\n\n        sequence_output, pooler_output, hidden_states = self.bert_model(input_ids=input_ids, token_type_ids=segment_ids,\n                                                                        attention_mask=input_masks)\n\n        self.rnn.flatten_parameters()\n        if self.rnn_type in ['rnn', 'gru']:\n            output, hidden = self.rnn(sequence_output)\n        else:\n            output, (hidden, cell) = self.rnn(sequence_output)\n\n        # output = [ batch size, sent len, hidden_dim * bidirectional]\n        batch_size, max_seq_len, hidden_dim = output.shape\n        hidden = torch.transpose(hidden, 1, 0)\n        hidden = torch.mean(torch.reshape(hidden, [batch_size, -1, hidden_dim]), dim=1)\n        output = torch.sum(output, dim=1)\n        fc_input = self.dropout(output + hidden)\n\n        # output = torch.mean(output, dim=1)\n        # fc_input = self.dropout(output)\n        out = self.fc_rnn(fc_input)\n\n        return out\n\n\nclass BertCNN(nn.Module):\n\n    def __init__(self, config):\n        super(BertCNN, self).__init__()\n        self.num_filters = 100\n        config_json = 'bert_config.json' if os.path.exists(config.model_path + 'bert_config.json') else 'config.json'\n        self.bert_config = CONFIGS[config.model].from_pretrained(config.model_path + config_json,\n                                                                 output_hidden_states=True)\n        self.hidden_size = self.bert_config.to_dict()['hidden_size']\n        self.filter_sizes = {3, 4, 5}\n        self.drop_out = 0.5\n\n        self.convs = nn.ModuleList(\n            [nn.Conv2d(1, self.num_filters, (k, self.hidden_size)) for k in self.filter_sizes])\n\n\n        self.bert_model = MODELS[config.model].from_pretrained(config.model_path, config=self.bert_config)\n        self.dropout = nn.Dropout(self.drop_out)\n\n        self.fc_cnn = nn.Linear(self.num_filters * len(self.filter_sizes), config.num_class)\n\n    def conv_and_pool(self, x, conv):\n        x = F.relu(conv(x)).squeeze(3)\n        x = F.max_pool1d(x, x.size(2)).squeeze(2)\n        return x\n\n    def forward(self, input_ids, input_masks, segment_ids):\n        sequence_output, pooler_output, hidden_states = self.bert_model(input_ids=input_ids, token_type_ids=segment_ids,\n                                                                        attention_mask=input_masks)\n\n        sequence_output = self.dropout(sequence_output)\n        out = sequence_output.unsqueeze(1)\n        out = torch.cat([self.conv_and_pool(out, conv) for conv in self.convs], 1)\n        out = self.dropout(out)\n        out = self.fc_cnn(out)\n        return out\n\n\nclass BertRCNN(nn.Module):\n    def __init__(self, config):\n        super(BertRCNN, self).__init__()\n        self.rnn_type = \"lstm\"\n        self.bidirectional = True\n        self.hidden_dim = 256\n        self.n_layers = 2\n        self.batch_first = True\n        self.drop_out = 0.5\n\n        config_json = 'bert_config.json' if os.path.exists(config.model_path + 'bert_config.json') else 'config.json'\n        self.bert_config = CONFIGS[config.model].from_pretrained(config.model_path + config_json,\n                                                                 output_hidden_states=True)\n\n        if self.rnn_type == 'lstm':\n            self.rnn = nn.LSTM(self.bert_config.to_dict()['hidden_size'],\n                               self.hidden_dim,\n                               num_layers=self.n_layers,\n                               bidirectional=self.bidirectional,\n                               batch_first=self.batch_first,\n                               dropout=self.drop_out)\n        elif self.rnn_type == 'gru':\n            self.rnn = nn.GRU(self.bert_config.to_dict()['hidden_size'],\n                              hidden_size=self.hidden_dim,\n                              num_layers=self.n_layers,\n                              bidirectional=self.bidirectional,\n                              batch_first=self.batch_first,\n                              dropout=self.drop_out)\n        else:\n            self.rnn = nn.RNN(self.bert_config.to_dict()['hidden_size'],\n                              hidden_size=self.hidden_dim,\n                              num_layers=self.n_layers,\n                              bidirectional=self.bidirectional,\n                              batch_first=self.batch_first,\n                              dropout=self.drop_out)\n\n        # self.maxpool = nn.MaxPool1d()\n\n\n        self.bert_model = MODELS[config.model].from_pretrained(config.model_path, config=self.bert_config)\n        self.fc = nn.Linear(self.hidden_dim * self.n_layers, config.num_class)\n        self.dropout = nn.Dropout(self.drop_out)\n\n    def forward(self, input_ids, input_masks, segment_ids):\n\n        sequence_output, pooler_output, hidden_states = self.bert_model(input_ids=input_ids, token_type_ids=segment_ids,\n                                                                        attention_mask=input_masks)\n\n        sentence_len = sequence_output.shape[1]\n        pooler_output = pooler_output.unsqueeze(dim=1).repeat(1, sentence_len, 1)\n        bert_sentence = sequence_output + pooler_output\n\n        self.rnn.flatten_parameters()\n        if self.rnn_type in ['rnn', 'gru']:\n            output, hidden = self.rnn(bert_sentence)\n        else:\n            output, (hidden, cell) = self.rnn(bert_sentence)\n\n        batch_size, max_seq_len, hidden_dim = output.shape\n        out = torch.transpose(output.relu(), 1, 2)\n\n        out = F.max_pool1d(out, max_seq_len).squeeze()\n        out = self.fc(out)\n\n        return out\n\n\nclass XLNet(nn.Module):\n\n    def __init__(self, config):\n        super(XLNet, self).__init__()\n        self.xlnet = XLNetModel.from_pretrained(config.model_path)\n\n        self.isDropout = True if 0 < config.dropout < 1 else False\n        self.dropout = nn.Dropout(p=config.dropout)\n        self.fc = nn.Linear(self.xlnet.d_model, config.num_class)\n\n    def forward(self, input_ids, input_masks, segment_ids):\n        sequence_output = self.xlnet(input_ids=input_ids, token_type_ids=segment_ids,\n                                     attention_mask=input_masks)\n        sequence_output = torch.sum(sequence_output[0], dim=1)\n        if self.isDropout:\n            sequence_output = self.dropout(sequence_output)\n        out = self.fc(sequence_output)\n        return out\n\n\nclass ElectraClassificationHead(nn.Module):\n    \"\"\"Head for sentence-level classification tasks.\"\"\"\n\n    def __init__(self, config):\n        super().__init__()\n        self.dense = nn.Linear(config.hidden_size, config.hidden_size)\n        self.dropout = nn.Dropout(config.hidden_dropout_prob)\n        self.out_proj = nn.Linear(config.hidden_size, config.num_labels)\n\n    def forward(self, features, **kwargs):\n        x = features[:, 0, :]  # take <s> token (equiv. to [CLS])\n        x = self.dropout(x)\n        x = self.dense(x)\n        x = get_activation(\"gelu\")(x)  # although BERT uses tanh here, it seems Electra authors used gelu here\n        x = self.dropout(x)\n        x = self.out_proj(x)\n        return x\n\nclass Electra(nn.Module):\n\n    def __init__(self, config):\n        super(Electra, self).__init__()\n        self.electra = ElectraModel.from_pretrained(config.model_path)\n\n        config_json = 'bert_config.json' if os.path.exists(config.model_path + 'bert_config.json') else 'config.json'\n        self.electra_config = CONFIGS[config.model].from_pretrained(config.model_path + config_json)\n        self.electra_config.num_labels = config.num_class\n        self.fc = ElectraClassificationHead(self.electra_config)\n\n    def forward(self, input_ids, input_masks, segment_ids):\n        discriminator_hidden_states = self.electra(input_ids=input_ids, token_type_ids=segment_ids,\n                                     attention_mask=input_masks)\n\n        sequence_output = discriminator_hidden_states[0]\n        out = self.fc(sequence_output)\n        return out\n\nclass NEZHA(nn.Module):\n    def __init__(self, config):\n        super(NEZHA, self).__init__()\n        self.n_classes = config.num_class\n\n        config_json = 'bert_config.json' if os.path.exists(config.model_path + 'bert_config.json') else 'config.json'\n        self.bert_config = CONFIGS[config.model].from_pretrained(config.model_path + config_json)\n        #self.bert_model = MODELS[config.model](config=self.bert_config)\n        self.bert_model = MODELS[config.model].from_pretrained(config.model_path, config=self.bert_config)\n\n        # NEZHA init\n        #torch_init_model(self.bert_model, os.path.join(config.model_path, 'pytorch_model.bin'))\n        self.isDropout = True if 0 < config.dropout < 1 else False\n        self.dropout = nn.Dropout(p=config.dropout)\n        self.classifier = nn.Linear(self.bert_config.hidden_size * 2, self.n_classes)\n\n    def forward(self, input_ids, input_masks, segment_ids):\n        sequence_output, pooler_output = self.bert_model(input_ids=input_ids, token_type_ids=segment_ids,\n                                                                        attention_mask=input_masks)\n        seq_avg = torch.mean(sequence_output, dim=1)\n        concat_out = torch.cat((seq_avg, pooler_output), dim=1)\n\n        if self.isDropout:\n            concat_out = self.dropout(concat_out)\n        logit = self.classifier(concat_out)\n        return logit\n\n\n"
  },
  {
    "path": "code/bert-base-count5-len32/finetuning/models/gitkeep",
    "content": ""
  },
  {
    "path": "code/bert-base-count5-len32/finetuning/multi_gpu_QA.py",
    "content": "from tqdm import tqdm, trange\nimport numpy as np\nimport pandas as pd\nimport logging\nimport torch\nimport random\nimport os\nfrom torch import nn, optim\nfrom transformers import BertTokenizer, AdamW, BertModel, BertPreTrainedModel, BertConfig, \\\n    get_linear_schedule_with_warmup, XLNetModel, XLNetTokenizer, XLNetConfig\nfrom transformers.optimization import get_linear_schedule_with_warmup\nfrom sklearn.model_selection import StratifiedKFold, KFold\nfrom sklearn.metrics import mean_absolute_error, accuracy_score, f1_score, roc_auc_score\nfrom model import *\nfrom utils import *\nimport time\nimport logging\nlogging.basicConfig(level=logging.DEBUG, filename=\"train.log\",filemode='a')\n\n\nfrom NEZHA.modeling_nezha import *\n\nMODEL_CLASSES = {\n    'BertForClass': BertForClass,\n    'BertLastCls': BertLastCls,\n    'BertLastTwoCls': BertLastTwoCls,\n    'BertLastTwoClsPooler': BertLastTwoClsPooler,\n    'BertLastTwoEmbeddings': BertLastTwoEmbeddings,\n    'BertLastTwoEmbeddingsPooler': BertLastTwoEmbeddingsPooler,\n    'BertLastFourCls': BertLastFourCls,\n    'BertLastFourClsPooler': BertLastFourClsPooler,\n    'BertLastFourEmbeddings': BertLastFourEmbeddings,\n    'BertLastFourEmbeddingsPooler': BertLastFourEmbeddingsPooler,\n    'BertDynCls': BertDynCls,\n    'BertDynEmbeddings': BertDynEmbeddings,\n    'BertRNN': BertRNN,\n    'BertCNN': BertCNN,\n    'BertRCNN': BertRCNN,\n    'XLNet': XLNet,\n    'Electra': Electra,\n    'NEZHA': NEZHA,\n\n}\n\n\nclass Config:\n    def __init__(self):\n        # 预训练模型路径\n        self.modelId = 2\n        self.model = \"BertLastFourCls\"\n        self.Stratification = False\n        self.model_path = '../../bert-base-count5/pretrain/bert_model/'\n\n        self.num_class = 2\n        self.dropout = 0.2\n        self.MAX_LEN = 32\n        self.epoch = 3\n        self.learn_rate = 4e-5\n        self.normal_lr = 1e-4\n        self.batch_size = 32\n        self.k_fold = 10\n        self.seed = 42\n\n        self.device = torch.device('cuda')\n        # self.device = torch.device('cpu')\n\n        self.focalloss = False\n        self.pgd = False\n        self.fgm = True\n\n\nconfig = Config()\nos.environ['PYTHONHASHSEED']='0'#消除hash算法的随机性\nrandom.seed(config.seed)\nnp.random.seed(config.seed)\ntorch.manual_seed(config.seed)\ntorch.cuda.manual_seed_all(config.seed)\n\n\nfile_path = './log/'\n# 创建一个logger\nlogger = logging.getLogger('mylogger')\nlogger.setLevel(logging.DEBUG)\n\n\ntrain = pd.read_csv('/tcdata/gaiic_track3_round1_train_20210228.tsv',sep='\\t',header=None)\nsemi = pd.read_csv('/tcdata/gaiic_track3_round2_train_20210407.tsv',sep='\\t',header=None)\ntrain = pd.concat([train, semi], sort=False)\ntrain.columns=['q1','q2','label']\n\n\ntrain_query1 = train['q1'].values.astype(str)\ntrain_query2 = train['q2'].values.astype(str)\ntrain_label = train['label'].values.astype(int)\n\n\noof_train = np.zeros((len(train), config.num_class), dtype=np.float32)\n\n\n#kf = StratifiedKFold(n_splits=config.k_fold, shuffle=True, random_state=config.seed)\nkf = KFold(n_splits=config.k_fold, shuffle=True, random_state=config.seed)\n\nfor fold, (train_index, valid_index) in enumerate(kf.split(train_query1, train_label)):\n\n    print('\\n\\n------------fold:{}------------\\n'.format(fold))\n\n    '''\n    q1 = train_query1[train_index]\n    q2 = train_query2[train_index]\n    y = train_label[train_index]\n    '''\n    q1 = train_query1\n    q2 = train_query2\n    y = train_label\n\n\n    val_q1 = train_query1[valid_index]\n    val_q2 = train_query2[valid_index]\n    val_y = train_label[valid_index]\n\n    train_D = data_generator([q1, q2, y], config, shuffle=True)\n    val_D = data_generator([val_q1, val_q2, val_y], config)\n\n    model = MODEL_CLASSES[config.model](config).to(config.device)\n\n    if torch.cuda.device_count() > 1:\n        print(\"Let's use\", torch.cuda.device_count(), \"GPUs!\")\n        model = torch.nn.DataParallel(model)\n\n\n    if config.pgd:\n        pgd = PGD(model)\n        K = 3\n\n    elif config.fgm:\n        fgm = FGM(model)\n\n    if config.focalloss:\n        loss_fn = FocalLoss(config.num_class)\n    else:\n        loss_fn = nn.CrossEntropyLoss()  # BCEWithLogitsLoss就是把Sigmoid-BCELoss合成一步\n\n\n    num_train_steps = int(len(train) / config.batch_size * config.epoch)\n    param_optimizer = list(model.named_parameters())\n\n    no_decay = [\"bias\", \"LayerNorm.bias\", \"LayerNorm.weight\"]\n\n    if config.Stratification:\n        bert_params = [x for x in param_optimizer if 'bert' in x[0]]\n        normal_params = [p for n, p in param_optimizer if 'bert' not in n]\n        optimizer_parameters = [\n            {'params': [p for n, p in bert_params if not any(nd in n for nd in no_decay)], 'weight_decay': 0.01},\n            {'params': [p for n, p in bert_params if any(nd in n for nd in no_decay)], 'weight_decay': 0.0},\n            {'params': normal_params, 'lr': config.normal_lr},\n        ]\n    else:\n        optimizer_parameters = [\n            {'params': [p for n, p in param_optimizer if not any(nd in n for nd in no_decay)], 'weight_decay': 0.01},\n            {'params': [p for n, p in param_optimizer if any(nd in n for nd in no_decay)], 'weight_decay': 0.0},\n        ]\n\n    optimizer = AdamW(optimizer_parameters, lr=config.learn_rate) # lr为全局学习率\n    scheduler = get_linear_schedule_with_warmup(\n        optimizer,\n        num_warmup_steps=int(len(train) / config.batch_size / 2),\n        num_training_steps=num_train_steps\n    )\n\n    best_auc = 0\n    PATH = './models/bert_{}.pth'.format(fold)\n    save_model_path = './models/'\n    if not os.path.exists(save_model_path):\n        os.makedirs(save_model_path)\n\n    for e in range(config.epoch):\n        print('\\n------------epoch:{}------------'.format(e))\n        model.train()\n        acc = 0\n        train_len = 0\n        loss_num = 0\n        tq = tqdm(train_D,ncols=70,disable=True)\n        last=time.time()\n        for input_ids, input_masks, segment_ids, labels in tq:\n            label_t = torch.tensor(labels, dtype=torch.long).to(config.device)\n\n            y_pred = model(input_ids, input_masks, segment_ids)\n\n            loss = loss_fn(y_pred, label_t)\n            loss = loss.mean()\n            loss.backward()\n\n            if config.pgd:\n                pgd.backup_grad()\n                # 对抗训练\n                for t in range(K):\n                    pgd.attack(is_first_attack=(t == 0))  # 在embedding上添加对抗扰动, first attack时备份param.data\n                    if t != K - 1:\n                        model.zero_grad()\n                    else:\n                        pgd.restore_grad()\n                    y_pred = model(input_ids, input_masks, segment_ids)\n\n                    loss_adv = loss_fn(y_pred, label_t)\n                    loss_adv = loss_adv.mean()\n                    loss_adv.backward()  # 反向传播，并在正常的grad基础上，累加对抗训练的梯度\n                pgd.restore()  # 恢复embedding参数\n\n            elif config.fgm:\n                # 对抗训练\n                fgm.attack()  # 在embedding上添加对抗扰动\n                y_pred = model(input_ids, input_masks, segment_ids)\n                loss_adv = loss_fn(y_pred, label_t)\n                loss_adv = loss_adv.mean()\n                loss_adv.backward()  # 反向传播，并在正常的grad基础上，累加对抗训练的梯度\n                fgm.restore()  # 恢复embedding参数\n\n\n            # 梯度下降，更新参数\n            optimizer.step()\n            scheduler.step()  # Update learning rate schedule\n            model.zero_grad()\n\n            y_pred = np.argmax(y_pred.detach().to(\"cpu\").numpy(), axis=1)\n            acc += sum(y_pred == labels)\n            loss_num += loss.item()\n            train_len += len(labels)\n            tq.set_postfix(fold=fold, epoch=e, loss=loss_num / train_len, acc=acc / train_len)\n        print(f\"微调第{e}轮耗时：{time.time()-last}\")\n        model.eval()\n        with torch.no_grad():\n            y_p = []\n            y_l = []\n            train_logit = None\n            for input_ids, input_masks, segment_ids, labels in tqdm(val_D,disable=True):\n                label_t = torch.tensor(labels, dtype=torch.long).to(config.device)\n\n                y_pred = model(input_ids, input_masks, segment_ids)\n                y_pred = F.softmax(y_pred)\n                y_pred = y_pred.detach().to(\"cpu\").numpy()\n                if train_logit is None:\n                    train_logit = y_pred\n                else:\n                    train_logit = np.vstack((train_logit, y_pred))\n\n                y_p += list(y_pred[:,1])\n\n                y_pred = np.argmax(y_pred, axis=1)\n                y_l += list(y_pred)\n\n\n            f1 = f1_score(val_y, y_l, average=\"macro\")\n            auc_score = roc_auc_score(val_y, y_p)\n            print(\"best_auc:{}  auc_score:{}  f1:{}\\n\".format(best_auc, auc_score, f1))\n            if auc_score >= best_auc:\n                best_auc = auc_score\n                oof_train[valid_index] = np.array(train_logit)\n                #torch.save(model.module.state_dict() if hasattr(model, \"module\") else model.state_dict(), PATH)\n                torch.save(model.module if hasattr(model, \"module\") else model, PATH)\n\n    optimizer.zero_grad()\n\n    del model\n    torch.cuda.empty_cache()\n\n    break\n\n"
  },
  {
    "path": "code/bert-base-count5-len32/finetuning/utils.py",
    "content": "import torch\nfrom transformers import BertTokenizer, AdamW, BertModel, BertPreTrainedModel, BertConfig, \\\n    get_linear_schedule_with_warmup, XLNetModel, XLNetTokenizer, XLNetConfig\nimport numpy as np\nimport os\nimport random\nfrom Config import *\nimport torch\nimport torch.nn as nn\nimport torch.nn.functional as F\n\ndef paddingList(ls:list,val,returnTensor=False):\n    ls=ls[:]#不要改变了原list尺寸\n    maxLen=max([len(i) for i in ls])\n    for i in range(len(ls)):\n        ls[i]=ls[i]+[val]*(maxLen-len(ls[i]))\n    return torch.tensor(ls,device='cuda') if returnTensor else ls\n\ndef fastTokenizer(a:str,b:str,maxLen,tk):\n    a,b=a.split(),b.split()\n    a,b=tk.convert_tokens_to_ids(a),tk.convert_tokens_to_ids(b)\n    maxLen-=3#空留给cls sep sep\n    assert maxLen>=0\n    len2=maxLen//2#若为奇数，更长部分给左边\n    len1=maxLen-len2\n    #一共就a超长与否，b超长与否，组合的四种情况\n    if len(a)+len(b)>maxLen:#需要截断\n        if len(a)<=len1 and len(b)>len2:\n            b=b[:maxLen-len(a)]\n        elif len(a)>len1 and len(b)<=len2:\n            a=a[:maxLen-len(b)]\n        elif len(a)>len1 and len(b)>len2:\n            a=a[:len1]\n            b=b[:len2]\n    input_ids=[tk.cls_token_id]+a+[tk.sep_token_id]+b+[tk.sep_token_id]\n    token_type_ids=[0]*(len(a)+2)+[1]*(len(b)+1)\n    return {'input_ids': input_ids, 'token_type_ids': token_type_ids}\n\nclass data_generator:\n    def __init__(self, data, config, shuffle=False):\n        self.data = data\n        self.batch_size = config.batch_size\n        self.max_length = config.MAX_LEN\n        self.shuffle = shuffle\n\n        vocab = 'vocab.txt' if os.path.exists(config.model_path + 'vocab.txt') else 'spiece.model'\n        self.tokenizer = TOKENIZERS[config.model].from_pretrained(config.model_path + vocab)\n\n        self.steps = len(self.data[0]) // self.batch_size\n        if len(self.data[0]) % self.batch_size != 0:\n            self.steps += 1\n\n    def __len__(self):\n        return self.steps\n\n    def __iter__(self):\n        q1, q2, y = self.data\n        idxs = list(range(len(self.data[0])))\n        if self.shuffle:\n            np.random.shuffle(idxs)\n        input_ids, input_masks, segment_ids, labels = [], [], [], []\n\n        for index, i in enumerate(idxs):\n\n            text = q1[i]\n            text_pair = q2[i]\n            '''\n            # text = self.tokenizer(text, text_pair, padding='max_length', truncation=True, max_length=self.max_length)\n            text = fastTokenizer(text, text_pair, self.max_length, self.tokenizer)\n            input_ids.append(text['input_ids'])\n            segment_ids.append(text['token_type_ids'])\n            input_masks.append([1] * len(text['input_ids']))  # bs为1时无padding，全1\n            yield input_ids, input_masks, segment_ids, labels\n            input_ids, input_masks, segment_ids, labels = [], [], [], []\n\n            '''\n            tkRes = self.tokenizer(text, text_pair, max_length=self.max_length, truncation='longest_first',\n                                   return_attention_mask=False)\n            input_id = tkRes['input_ids']\n            segment_id = tkRes['token_type_ids']\n            assert len(segment_id) == len(input_id)\n            input_ids.append(input_id)\n            segment_ids.append(segment_id)\n            labels.append(y[i])\n\n            if len(input_ids) == self.batch_size or i == idxs[-1]:\n                input_ids = paddingList(input_ids, 0, returnTensor=True)  # 动态padding\n                segment_ids = paddingList(segment_ids, 0, returnTensor=True)\n                input_masks = (input_ids != 0)\n                yield input_ids, input_masks, segment_ids, labels\n                input_ids, input_masks, segment_ids, labels = [], [], [], []\n\n\n\nclass PGD():\n    def __init__(self, model):\n        self.model = model\n        self.emb_backup = {}\n        self.grad_backup = {}\n\n    def attack(self, epsilon=0.3, alpha=0.1, emb_name='word_embeddings', is_first_attack=False):\n        # emb_name这个参数要换成你模型中embedding的参数名\n        for name, param in self.model.named_parameters():\n            if param.requires_grad and emb_name in name:\n                if is_first_attack:\n                    self.emb_backup[name] = param.data.clone()\n                norm = torch.norm(param.grad)\n                if norm != 0 and not torch.isnan(norm):\n                    r_at = alpha * param.grad / norm\n                    param.data.add_(r_at)\n                    param.data = self.project(name, param.data, epsilon)\n\n    def restore(self, emb_name='word_embeddings'):\n        # emb_name这个参数要换成你模型中embedding的参数名\n        for name, param in self.model.named_parameters():\n            if param.requires_grad and emb_name in name:\n                assert name in self.emb_backup\n                param.data = self.emb_backup[name]\n        self.emb_backup = {}\n\n    def project(self, param_name, param_data, epsilon):\n        r = param_data - self.emb_backup[param_name]\n        if torch.norm(r) > epsilon:\n            r = epsilon * r / torch.norm(r)\n        return self.emb_backup[param_name] + r\n\n    def backup_grad(self):\n        for name, param in self.model.named_parameters():\n            if param.requires_grad:\n                self.grad_backup[name] = param.grad.clone()\n\n    def restore_grad(self):\n        for name, param in self.model.named_parameters():\n            if param.requires_grad:\n                param.grad = self.grad_backup[name]\n\n\n\nclass FGM():\n    def __init__(self, model):\n        self.model = model\n        self.backup = {}\n\n    def attack(self, epsilon=0.25, emb_name='word_embeddings'):\n        # emb_name这个参数要换成你模型中embedding的参数名\n        for name, param in self.model.named_parameters():\n            if param.requires_grad and emb_name in name:\n                self.backup[name] = param.data.clone()\n                norm = torch.norm(param.grad)\n                if norm != 0:\n                    r_at = epsilon * param.grad / norm\n                    param.data.add_(r_at)\n\n    def restore(self, emb_name='word_embeddings'):\n        # emb_name这个参数要换成你模型中embedding的参数名\n        for name, param in self.model.named_parameters():\n            if param.requires_grad and emb_name in name:\n                assert name in self.backup\n                param.data = self.backup[name]\n        self.backup = {}\n\n\n# 支持多分类和二分类\nclass FocalLoss(nn.Module):\n    \"\"\"\n    This is a implementation of Focal Loss with smooth label cross entropy supported which is proposed in\n    'Focal Loss for Dense Object Detection. (https://arxiv.org/abs/1708.02002)'\n    Focal_Loss= -1*alpha*(1-pt)^gamma*log(pt)\n    :param num_class:\n    :param alpha: (tensor) 3D or 4D the scalar factor for this criterion\n    :param gamma: (float,double) gamma > 0 reduces the relative loss\n    for well-classified examples (p>0.5) putting more\n    focus on hard misclassified example\n    :param smooth: (float,double) smooth value when cross entropy\n    :param balance_index: (int) balance class index,\n    should be specific when alpha is float\n    :param size_average: (bool, optional) By default,\n    the losses are averaged over each loss element in the batch.\n    \"\"\"\n    def __init__(self, num_class, alpha=None, gamma=2,\n                smooth=None, size_average=True):\n        super(FocalLoss, self).__init__()\n        self.num_class = num_class\n        self.alpha = alpha\n        self.gamma = gamma\n        self.smooth = smooth\n        self.size_average = size_average\n\n        if self.alpha is None:\n            self.alpha = torch.ones(self.num_class, 1)\n        elif isinstance(self.alpha, (list, np.ndarray)):\n            assert len(self.alpha) == self.num_class\n            self.alpha = torch.FloatTensor(alpha).view(self.num_class, 1)\n            self.alpha = self.alpha / self.alpha.sum()\n        else:\n            raise TypeError('Not support alpha type')\n        if self.smooth is not None:\n            if self.smooth < 0 or self.smooth > 1.0:\n                raise ValueError('smooth value should be in [0,1]')\n\n    def forward(self, input, target):\n        logit = F.softmax(input, dim=1)\n\n        if logit.dim() > 2:\n            # N,C,d1,d2 -> N,C,m (m=d1*d2*...)\n            logit = logit.view(logit.size(0), logit.size(1), -1)\n            logit = logit.permute(0, 2, 1).contiguous()\n            logit = logit.view(-1, logit.size(-1))\n        target = target.view(-1, 1)\n\n        # N = input.size(0)\n        # alpha = torch.ones(N, self.num_class)\n        # alpha = alpha * (1 - self.alpha)\n        # alpha = alpha.scatter_(1, target.long(), self.alpha)\n        epsilon = 1e-10\n        alpha = self.alpha\n        if alpha.device != input.device:\n            alpha = alpha.to(input.device)\n\n        idx = target.cpu().long()\n        one_hot_key = torch.FloatTensor(target.size(0), self.num_class).zero_()\n        one_hot_key = one_hot_key.scatter_(1, idx, 1)\n        if one_hot_key.device != logit.device:\n            one_hot_key = one_hot_key.to(logit.device)\n\n        if self.smooth:\n            one_hot_key = torch.clamp(\n                one_hot_key, self.smooth, 1.0 - self.smooth)\n        pt = (one_hot_key * logit).sum(1) + epsilon\n        logpt = pt.log()\n\n        gamma = self.gamma\n\n        alpha = alpha[idx]\n        loss = -1 * alpha * torch.pow((1 - pt), gamma) * logpt\n\n        if self.size_average:\n            loss = loss.mean()\n        else:\n            loss = loss.sum()\n        return loss\n\n\ndef f1_match(y_true,y_pred):\n    acc = sum(y_pred & y_true) / (sum(y_pred))\n    rec = sum(y_pred & y_true) / (sum(y_true))\n\n    return 2 * acc * rec /(acc + rec)"
  },
  {
    "path": "code/build_vocab.py",
    "content": "from collections import Counter\ndef loadData(path):\n    allData=[]\n    with open(path,\"r\") as f:\n        for i in f:\n            i=i.strip().split('\\t')\n            if len(i)==0:#防止空行\n                break\n            if len(i)==3:#训练集\n                a,b,label=i\n            else:#测试集，直接转为id形式\n                a,b,label=i[0],i[1],-1\n            a,b=[int(i) for i in a.split()],[int(i) for i in b.split()]\n            allData.append([a,b])\n    return allData\n\nallData=loadData('/tcdata/gaiic_track3_round1_train_20210228.tsv')+loadData('/tcdata/gaiic_track3_round2_train_20210407.tsv')\ntest_data = loadData('/tcdata/gaiic_track3_round1_testA_20210228.tsv')+loadData('/tcdata/gaiic_track3_round1_testB_20210317.tsv')\n\nmodel_lists = [\"nezha-base-count3\", \"nezha-base-count5\", \"bert-base-count3\",\n               \"bert-base-count3-len100\", \"bert-base-count5\", \"bert-base-count5-len32\"]\nchildPath_lists=[\n    ['/pretrain/nezha_model/','/finetuning/models/'],\n    ['/pretrain/nezha_model/','/finetuning/models/'],\n    ['/pretrain/bert_model/','/finetuning/models/'],\n\n    ['/finetuning/models/'],\n    ['/pretrain/bert_model/','/finetuning/models/'],\n    ['/finetuning/models/'],\n           ]\ncounts=[3,5,3,3,5,5]\n\ntoken2count=Counter()\nfor i,j in allData+test_data:\n    token2count.update(i+j)\n\nfor modelPath,childPath,ct in zip(model_lists,childPath_lists,counts):\n    pre=['[PAD]','[UNK]','[CLS]','[SEP]','[MASK]',]\n    tail=[]\n    for k,v in token2count.items():\n        if v>=ct:\n            tail.append(k)\n    tail.sort()\n    vocab=pre+tail\n    print(f\"模型{modelPath}，词频：{ct}，词表大小：{len(vocab)}\")\n    for ch in childPath:\n        with open(modelPath+ch+'vocab.txt', \"w\", encoding=\"utf-8\") as f:\n            for i in vocab:\n                f.write(str(i)+'\\n')\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n"
  },
  {
    "path": "code/docker_build.sh",
    "content": "#!/bin/bash\nif [ -z $1 ]\nthen\n   echo \"s1:version\"\n   exit\nfi\n\nif [ -z $DOCKER_REGISTRY ]\nthen\n   echo \"not find $DOCKER_REGISTRY\"\n   exit\nfi\n\nVERSION=$1\n\n##1.create docker regsit\ndocker build -t $DOCKER_REGISTRY/tianchi-submit:$VERSION .\n\n##2.PUSH\ndocker push $DOCKER_REGISTRY/tianchi-submit:$VERSION \n\n##3.echo submit info\necho \"now you can go to tianchi.aliyun.com sunmit this docker url: $DOCKER_REGISTRY/tianchi-submit:$VERSION \n\n"
  },
  {
    "path": "code/main_fusion_thread.py",
    "content": "import logging\nimport traceback\nfrom flask import Flask, request\nfrom utils import *\nfrom queue import Queue\nimport threading\nopset_version = 11\n# 此处示例，需要根据模型类型重写\ndef init_model(model_path, export_model_path, optimized_model_path, length=32):\n    model = torch.load(model_path).to(torch.device(\"cuda\"))\n    model.eval()\n\n    if length == 32:\n        data = [[[2, 16, 2874, 20, 3, 16, 36, 130, 5605, 458, 20, 3,\n                  0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,\n                  0, 0, 0, 0, 0, 0, 0, 0]],\n                [[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,\n                  0, 0, 0, 0, 0, 0, 0, 0]],\n                [[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,\n                  0, 0, 0, 0, 0, 0, 0, 0]]]\n\n    else:\n        data = [[[2, 16, 2874, 20, 3, 16, 36, 130, 5605, 458, 2, 16, 2874, 20, 3, 16, 36, 130, 5605, 458, 2, 16, 2874, 20,\n                  3, 16, 36, 130, 5605, 458, 2, 16, 2874, 20, 3, 16, 36, 130, 5605, 458, 2, 16, 2874, 20, 3, 16, 36, 130,\n                  5605, 458, 2, 16, 2874, 20, 3, 16, 36, 130, 5605, 458, 2, 16, 2874, 20, 3, 16, 36, 130, 5605, 458, 2, 16,\n                  2874, 20, 3, 16, 36, 130, 5605, 458, 2, 16, 2874, 20, 3, 16, 36, 130, 5605, 458, 2, 16, 2874, 20, 3, 16,\n                  36, 130, 5605, 458]],\n                [[1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 1, 1, 1, 1, 1,\n                  1, 1, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0,\n                  1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, ]],\n                [[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,\n                  0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,\n                  0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]]\n\n\n    inputs = {\n        'input_ids': torch.tensor(data[0]).to(config.device),\n        'input_masks': torch.tensor(data[1]).to(config.device),\n        'segment_ids': torch.tensor(data[2]).to(config.device)\n    }\n\n    if True or not os.path.exists(export_model_path):\n        with torch.no_grad():\n            symbolic_names = {0: 'batch_size', 1: 'max_seq_len'}\n            torch.onnx.export(model,  # model being run\n                              args=tuple(inputs.values()),  # model input (or a tuple for multiple inputs)\n                              f=export_model_path,  # where to save the model (can be a file or file-like object)\n                              opset_version=opset_version,  # the ONNX version to export the model to\n                              do_constant_folding=True,  # whether to execute constant folding for optimization\n                              input_names=['input_ids',  # the model's input names\n                                           'input_masks',\n                                           'segment_ids'],\n                              output_names=['predict'],  # the model's output names\n                              dynamic_axes={'input_ids': symbolic_names,  # variable length axes\n                                            'input_masks': symbolic_names,\n                                            'segment_ids': symbolic_names,\n                                            'predict': symbolic_names})\n            print(\"Model exported at \", export_model_path)\n\n    from onnxruntime_tools import optimizer\n    from onnxruntime_tools.transformers.onnx_model_bert import BertOptimizationOptions\n    opt_options = BertOptimizationOptions('bert')\n    opt_options.enable_embed_layer_norm = False\n\n    opt_model = optimizer.optimize_model(\n        export_model_path,\n        'bert',\n        num_heads=12,\n        hidden_size=768,\n        optimization_options=opt_options)\n    opt_model.save_model_to_file(optimized_model_path)\n\n    del model\n    torch.cuda.empty_cache()\n\n    import psutil\n    import onnxruntime\n\n    assert 'CUDAExecutionProvider' in onnxruntime.get_available_providers()\n\n    sess_options = onnxruntime.SessionOptions()\n    sess_options.intra_op_num_threads = psutil.cpu_count(logical=True)\n    session = onnxruntime.InferenceSession(optimized_model_path, sess_options)\n    ort_inputs = {\n        'input_ids': [[0]*32],\n        'input_masks': [[0]*32],\n        'segment_ids': [[0]*32]\n        }\n    session.run(None, ort_inputs)#预先启动一下\n    return session\n\ndef infer(session,config,inp:Queue,res:Queue):\n    data_gen = data_generator(config)\n    while True:\n        query_A, query_B=inp.get()#不断从自己队列中取\n        input_ids, input_masks, segment_ids = data_gen.generate((query_A, query_B))\n        ort_inputs = {\n        'input_ids': input_ids,\n        'input_masks': input_masks,\n        'segment_ids': segment_ids\n        }\n        y_pred = session.run(None, ort_inputs)\n        res.put(y_pred[0])#结果放入队列\n\ndef softmax(x, axis=1):\n    # 计算每行的最大值\n    row_max = x.max(axis=axis)\n\n    # 每行元素都需要减去对应的最大值，否则求exp(x)会溢出，导致inf情况\n    row_max = row_max.reshape(-1, 1)\n    x = x - row_max\n\n    # 计算e的指数次幂\n    x_exp = np.exp(x)\n    x_sum = np.sum(x_exp, axis=axis, keepdims=True)\n    s = x_exp / x_sum\n    return s\n\n\nclass Config:\n    def __init__(self):\n        # 预训练模型路径\n        self.modelId = 2\n        self.model = \"NEZHA\"\n        self.Stratification = False\n\n        self.model_path = 'model0/'\n        self.num_class = 2\n        self.dropout = 0.2\n        self.MAX_LEN = 32\n        self.epoch = 5\n        self.learn_rate = 2e-5\n        self.normal_lr = 1e-4\n        self.batch_size = 1\n        self.k_fold = 5\n        self.seed = 42\n\n        self.device = torch.device(\"cuda:0\" if torch.cuda.is_available() else \"cpu\")\n        self.focalloss = False\n        self.pgd = False\n        self.fgm = True\n\n# 允许使用类似Flask的别的服务方式\napp = Flask(__name__)\n\n@app.route(\"/tccapi\", methods=['GET', 'POST'])\ndef tccapi():\n    data = request.get_data()\n    if (data == b\"exit\"):\n        print(\"received exit command, exit now\")\n        os._exit(0)\n    input_list = request.form.getlist(\"input\")\n    index_list = request.form.getlist(\"index\")\n\n    response_batch = {}\n    response_batch[\"results\"] = []\n\n    for i in range(len(index_list)):\n        index_str = index_list[i]\n        response = {}\n        try:\n            input_sample = input_list[i].strip()\n            elems = input_sample.strip().split(\"\\t\")\n            query_A = elems[0].strip()\n            query_B = elems[1].strip()\n\n            for i in runningModelIds[:3]:#先只用上前三个模型\n                assert inps[i].qsize()==0\n                inps[i].put((query_A,query_B))#为子线程提供数据\n            predict_res=[]\n            for i in runningModelIds[:3]:\n                predict_res.append(res.get())#取3次\n            assert res.qsize()==0\n            y_pred = np.mean(predict_res, axis=0)\n            y_pred = softmax(np.array(y_pred))\n            if 0.15<float(y_pred[0][1])< 0.85:\n                assert inps[3].qsize()==0\n                inps[3].put((query_A,query_B))#为第四个子线程提供数据\n                predict_res.append(res.get())\n                assert res.qsize()==0\n                y_pred = np.mean(predict_res, axis=0)\n                y_pred = softmax(np.array(y_pred))\n\n\n            response[\"predict\"] = float(y_pred[0][1])\n            response[\"index\"] = index_str\n            response[\"ok\"] = True\n        except Exception as e:\n            response[\"predict\"] = 0\n            response[\"index\"] = index_str\n            response[\"ok\"] = False\n            traceback.print_exc()\n        response_batch[\"results\"].append(response)\n\n    return response_batch\n\n\n\nif __name__ == \"__main__\":\n    # 此处示例，需要根据模型类型重写加载部分\n    output_dir = \"./onnx\"\n    if not os.path.exists(output_dir):\n        os.makedirs(output_dir)\n\n    model_lists = [\"nezha-base-count3\", \"nezha-base-count5\", \"bert-base-count3\", \"bert-base-count5\"]\n    lens=[32,100,32,100]\n    configs=[]\n    sessions=[]\n    for path,length in zip(model_lists,lens):\n        config = Config()\n        export_model_path = os.path.join(output_dir, 'opset{}.onnx'.format(path))\n        optimized_model_path = os.path.join(output_dir, 'optimizer{}.onnx'.format(path))\n        config.model_path = './{}/finetuning/models/'.format(path)\n        config.MAX_LEN = length\n        session = init_model(config.model_path+\"bert_0.pth\", export_model_path, optimized_model_path)\n        sessions.append(session)\n        configs.append(config)\n\n\n    #多线程相关\n    inps=[Queue() for i in range(4)]#存子线程输入\n    res=Queue()#存结果\n    proLs=[]\n    runningModelIds=[0,1,2,3]#控制使用那几个模型，如果时间不够，可以在途中退出一个线程，除去一个模型\n\n    for i in runningModelIds:\n        proLs.append(threading.Thread(target=infer,args=(sessions[i],configs[i],inps[i],res)))\n        proLs[-1].daemon = True#子线程随着主线程结束\n        proLs[-1].start()#一次性全启动\n\n    log = logging.getLogger('werkzeug')#关闭冗长的http 200 log\n    log.disabled = True\n\n    app.run(host=\"127.0.0.1\", port=8080)\n\n"
  },
  {
    "path": "code/model.py",
    "content": "import torch\nimport random\nimport os\nfrom torch import nn, optim\nimport torch.nn.functional as F\nfrom transformers.activations import get_activation\n\nfrom Config import *\n\n\nclass BertForClass(nn.Module):\n    def __init__(self, config):\n        super(BertForClass, self).__init__()\n        self.n_classes = config.num_class\n\n        config_json = 'bert_config.json' if os.path.exists(config.model_path + 'bert_config.json') else 'config.json'\n        self.bert_config = CONFIGS[config.model].from_pretrained(config.model_path + config_json,\n                                                                 output_hidden_states=True)\n        self.bert_model = MODELS[config.model].from_pretrained(config.model_path, config=self.bert_config)\n        self.isDropout = True if 0 < config.dropout < 1 else False\n        self.dropout = nn.Dropout(p=config.dropout)\n        self.classifier = nn.Linear(self.bert_config.hidden_size * 2, self.n_classes)\n\n    def forward(self, input_ids, input_masks, segment_ids):\n        output = self.bert_model(input_ids=input_ids, token_type_ids=segment_ids,\n                                                                        attention_mask=input_masks)\n        sequence_output = output[0]\n        pooler_output = output[1]\n        hidden_states = output[2]\n        seq_avg = torch.mean(sequence_output, dim=1)\n        concat_out = torch.cat((seq_avg, pooler_output), dim=1)\n        if self.isDropout:\n            concat_out = self.dropout(concat_out)\n        logit = self.classifier(concat_out)\n        return logit\n\nclass BertForClass_MultiDropout(nn.Module):\n    def __init__(self, config):\n        super(BertForClass_MultiDropout, self).__init__()\n        self.n_classes = config.num_class\n        config_json = 'bert_config.json' if os.path.exists(config.model_path + 'bert_config.json') else 'config.json'\n        self.bert_config = CONFIGS[config.model].from_pretrained(config.model_path + config_json,\n                                                                 output_hidden_states=True)\n\n        self.bert_model = MODELS[config.model].from_pretrained(config.model_path, config=self.bert_config)\n        self.multi_drop = 5\n        self.multi_dropouts = nn.ModuleList([nn.Dropout(config.dropout) for _ in range(self.multi_drop)])\n        self.classifier = nn.Linear(self.bert_config.hidden_size * 2, self.n_classes)\n\n    def forward(self, input_ids, input_masks, segment_ids):\n        sequence_output, pooler_output, hidden_states = self.bert_model(input_ids=input_ids, token_type_ids=segment_ids,\n                                                                        attention_mask=input_masks)\n        seq_avg = torch.mean(sequence_output, dim=1)\n        concat_out = torch.cat((seq_avg, pooler_output), dim=1)\n        for j, dropout in enumerate(self.multi_dropouts):\n            if j == 0:\n                logit = self.classifier(dropout(concat_out)) / self.multi_drop\n            else:\n                logit += self.classifier(dropout(concat_out)) / self.multi_drop\n\n        return logit\n\nclass BertLastTwoCls(nn.Module):\n    def __init__(self, config):\n        super(BertLastTwoCls, self).__init__()\n        self.n_classes = config.num_class\n        config_json = 'bert_config.json' if os.path.exists(config.model_path + 'bert_config.json') else 'config.json'\n        self.bert_config = CONFIGS[config.model].from_pretrained(config.model_path + config_json,\n                                                                 output_hidden_states=True)\n        self.bert_model = MODELS[config.model].from_pretrained(config.model_path, config=self.bert_config)\n        self.isDropout = True if 0 < config.dropout < 1 else False\n        self.dropout = nn.Dropout(p=config.dropout)\n        self.classifier = nn.Linear(self.bert_config.hidden_size, config.num_class)\n\n    def forward(self, input_ids, input_masks, segment_ids):\n        sequence_output, pooler_output, hidden_states = self.bert_model(input_ids=input_ids, token_type_ids=segment_ids,\n                                                                        attention_mask=input_masks)\n        logit = self.classifier(pooler_output)\n\n        return logit\n\n\nclass BertLastCls(nn.Module):\n    def __init__(self, config):\n        super(BertLastCls, self).__init__()\n        self.n_classes = config.num_class\n        config_json = 'bert_config.json' if os.path.exists(config.model_path + 'bert_config.json') else 'config.json'\n        self.bert_config = CONFIGS[config.model].from_pretrained(config.model_path + config_json,\n                                                                 output_hidden_states=True)\n        self.bert_model = MODELS[config.model].from_pretrained(config.model_path, config=self.bert_config)\n        self.isDropout = True if 0 < config.dropout < 1 else False\n        self.dropout = nn.Dropout(p=config.dropout)\n        self.classifier = nn.Linear(self.bert_config.hidden_size, config.num_class)\n\n    def forward(self, input_ids, input_masks, segment_ids):\n        output = self.bert_model(input_ids=input_ids, token_type_ids=segment_ids,\n                                 attention_mask=input_masks)\n        sequence_output = output[0]\n        pooler_output = output[1]\n        hidden_states = output[2]\n\n        if self.isDropout:\n            output = self.dropout(pooler_output)\n        logit = self.classifier(output)\n\n        return logit\n\nclass BertLastTwoClsPooler(nn.Module):\n    def __init__(self, config):\n        super(BertLastTwoClsPooler, self).__init__()\n        self.n_classes = config.num_class\n        config_json = 'bert_config.json' if os.path.exists(config.model_path + 'bert_config.json') else 'config.json'\n        self.bert_config = CONFIGS[config.model].from_pretrained(config.model_path + config_json,\n                                                                 output_hidden_states=True)\n        self.bert_model = MODELS[config.model].from_pretrained(config.model_path, config=self.bert_config)\n        self.isDropout = True if 0 < config.dropout < 1 else False\n        self.dropout = nn.Dropout(p=config.dropout)\n        self.classifier = nn.Linear(self.bert_config.hidden_size * 3, config.num_class)\n\n    def forward(self, input_ids, input_masks, segment_ids):\n        sequence_output, pooler_output, hidden_states = self.bert_model(input_ids=input_ids, token_type_ids=segment_ids,\n                                                                        attention_mask=input_masks)\n\n        output = torch.cat(\n            (pooler_output, hidden_states[-1][:, 0], hidden_states[-2][:, 0]), dim=1)\n        if self.isDropout:\n            output = self.dropout(output)\n        logit = self.classifier(output)\n\n        return logit\n\nclass BertLastTwoEmbeddings(nn.Module):\n    def __init__(self, config):\n        super(BertLastTwoEmbeddings, self).__init__()\n        self.n_classes = config.num_class\n        config_json = 'bert_config.json' if os.path.exists(config.model_path + 'bert_config.json') else 'config.json'\n        self.bert_config = CONFIGS[config.model].from_pretrained(config.model_path + config_json,\n                                                                 output_hidden_states=True)\n        self.bert_model = MODELS[config.model].from_pretrained(config.model_path, config=self.bert_config)\n        self.isDropout = True if 0 < config.dropout < 1 else False\n        self.dropout = nn.Dropout(p=config.dropout)\n        self.classifier = nn.Linear(self.bert_config.hidden_size * 2, config.num_class)\n\n    def forward(self, input_ids, input_masks, segment_ids):\n        sequence_output, pooler_output, hidden_states = self.bert_model(input_ids=input_ids, token_type_ids=segment_ids,\n                                                                        attention_mask=input_masks)\n\n        hidden_states1 = torch.mean(hidden_states[-1], dim=1)\n        hidden_states2 = torch.mean(hidden_states[-2], dim=1)\n\n        output = torch.cat(\n            (hidden_states1, hidden_states2), dim=1)\n        if self.isDropout:\n            output = self.dropout(output)\n        logit = self.classifier(output)\n\n        return logit\n\n\nclass BertLastTwoEmbeddingsPooler(nn.Module):\n    def __init__(self, config):\n        super(BertLastTwoEmbeddingsPooler, self).__init__()\n        self.n_classes = config.num_class\n        config_json = 'bert_config.json' if os.path.exists(config.model_path + 'bert_config.json') else 'config.json'\n        self.bert_config = CONFIGS[config.model].from_pretrained(config.model_path + config_json,\n                                                                 output_hidden_states=True)\n        self.bert_model = MODELS[config.model].from_pretrained(config.model_path, config=self.bert_config)\n        self.isDropout = True if 0 < config.dropout < 1 else False\n        self.dropout = nn.Dropout(p=config.dropout)\n        self.classifier = nn.Linear(self.bert_config.hidden_size * 3, config.num_class)\n\n    def forward(self, input_ids, input_masks, segment_ids):\n        sequence_output, pooler_output, hidden_states = self.bert_model(input_ids=input_ids, token_type_ids=segment_ids,\n                                                                        attention_mask=input_masks)\n\n        hidden_states1 = torch.mean(hidden_states[-1], dim=1)\n        hidden_states2 = torch.mean(hidden_states[-2], dim=1)\n\n        output = torch.cat(\n            (pooler_output, hidden_states1, hidden_states2), dim=1)\n        if self.isDropout:\n            output = self.dropout(output)\n        logit = self.classifier(output)\n\n        return logit\n\nclass BertLastFourCls(nn.Module):\n    def __init__(self, config):\n        super(BertLastFourCls, self).__init__()\n        self.n_classes = config.num_class\n        config_json = 'bert_config.json' if os.path.exists(config.model_path + 'bert_config.json') else 'config.json'\n        self.bert_config = CONFIGS[config.model].from_pretrained(config.model_path + config_json,\n                                                                 output_hidden_states=True)\n        self.bert_model = MODELS[config.model].from_pretrained(config.model_path, config=self.bert_config)\n        self.isDropout = True if 0 < config.dropout < 1 else False\n        self.dropout = nn.Dropout(p=config.dropout)\n        self.classifier = nn.Linear(self.bert_config.hidden_size * 4, config.num_class)\n\n    def forward(self, input_ids, input_masks, segment_ids):\n        output = self.bert_model(input_ids=input_ids, token_type_ids=segment_ids,\n                                 attention_mask=input_masks)\n        sequence_output = output[0]\n        pooler_output = output[1]\n        hidden_states = output[2]\n\n        output = torch.cat(\n            (hidden_states[-1][:, 0], hidden_states[-2][:, 0], hidden_states[-3][:, 0], hidden_states[-4][:, 0]), dim=1)\n        if self.isDropout:\n            output = self.dropout(output)\n        logit = self.classifier(output)\n\n        return logit\n\n\nclass BertLastFourClsPooler(nn.Module):\n    def __init__(self, config):\n        super(BertLastFourClsPooler, self).__init__()\n        self.n_classes = config.num_class\n        config_json = 'bert_config.json' if os.path.exists(config.model_path + 'bert_config.json') else 'config.json'\n        self.bert_config = CONFIGS[config.model].from_pretrained(config.model_path + config_json,\n                                                                 output_hidden_states=True)\n        self.bert_model = MODELS[config.model].from_pretrained(config.model_path, config=self.bert_config)\n        self.isDropout = True if 0 < config.dropout < 1 else False\n        self.dropout = nn.Dropout(p=config.dropout)\n        self.classifier = nn.Linear(self.bert_config.hidden_size * 5, config.num_class)\n\n    def forward(self, input_ids, input_masks, segment_ids):\n        sequence_output, pooler_output, hidden_states = self.bert_model(input_ids=input_ids, token_type_ids=segment_ids,\n                                                                        attention_mask=input_masks)\n\n        output = torch.cat(\n            (pooler_output, hidden_states[-1][:, 0], hidden_states[-2][:, 0], hidden_states[-3][:, 0], hidden_states[-4][:, 0]), dim=1)\n        if self.isDropout:\n            output = self.dropout(output)\n        logit = self.classifier(output)\n\n        return logit\n\nclass BertLastFourEmbeddings(nn.Module):\n    def __init__(self, config):\n        super(BertLastFourEmbeddings, self).__init__()\n        self.n_classes = config.num_class\n        config_json = 'bert_config.json' if os.path.exists(config.model_path + 'bert_config.json') else 'config.json'\n        self.bert_config = CONFIGS[config.model].from_pretrained(config.model_path + config_json,\n                                                                 output_hidden_states=True)\n        self.bert_model = MODELS[config.model].from_pretrained(config.model_path, config=self.bert_config)\n        self.isDropout = True if 0 < config.dropout < 1 else False\n        self.dropout = nn.Dropout(p=config.dropout)\n        self.classifier = nn.Linear(self.bert_config.hidden_size * 4, config.num_class)\n\n    def forward(self, input_ids, input_masks, segment_ids):\n        sequence_output, pooler_output, hidden_states = self.bert_model(input_ids=input_ids, token_type_ids=segment_ids,\n                                                                        attention_mask=input_masks)\n\n        hidden_states1 = torch.mean(hidden_states[-1], dim=1)\n        hidden_states2 = torch.mean(hidden_states[-2], dim=1)\n        hidden_states3 = torch.mean(hidden_states[-3], dim=1)\n        hidden_states4 = torch.mean(hidden_states[-4], dim=1)\n        output = torch.cat(\n            (hidden_states1, hidden_states2, hidden_states3, hidden_states4), dim=1)\n        if self.isDropout:\n            output = self.dropout(output)\n        logit = self.classifier(output)\n\n        return logit\n\n\nclass BertLastFourEmbeddingsPooler(nn.Module):\n    def __init__(self, config):\n        super(BertLastFourEmbeddingsPooler, self).__init__()\n        self.n_classes = config.num_class\n        config_json = 'bert_config.json' if os.path.exists(config.model_path + 'bert_config.json') else 'config.json'\n        self.bert_config = CONFIGS[config.model].from_pretrained(config.model_path + config_json,\n                                                                 output_hidden_states=True)\n        self.bert_model = MODELS[config.model].from_pretrained(config.model_path, config=self.bert_config)\n        self.isDropout = True if 0 < config.dropout < 1 else False\n        self.dropout = nn.Dropout(p=config.dropout)\n        self.classifier = nn.Linear(self.bert_config.hidden_size * 5, config.num_class)\n\n    def forward(self, input_ids, input_masks, segment_ids):\n        sequence_output, pooler_output, hidden_states = self.bert_model(input_ids=input_ids, token_type_ids=segment_ids,\n                                                                        attention_mask=input_masks)\n\n        hidden_states1 = torch.mean(hidden_states[-1], dim=1)\n        hidden_states2 = torch.mean(hidden_states[-2], dim=1)\n        hidden_states3 = torch.mean(hidden_states[-3], dim=1)\n        hidden_states4 = torch.mean(hidden_states[-4], dim=1)\n        output = torch.cat(\n            (pooler_output, hidden_states1, hidden_states2, hidden_states3, hidden_states4), dim=1)\n        if self.isDropout:\n            output = self.dropout(output)\n        logit = self.classifier(output)\n\n        return logit\n\nclass BertDynCls(nn.Module):\n    def __init__(self, config):\n        super(BertDynCls, self).__init__()\n        self.n_classes = config.num_class\n        config_json = 'bert_config.json' if os.path.exists(config.model_path + 'bert_config.json') else 'config.json'\n        self.bert_config = CONFIGS[config.model].from_pretrained(config.model_path + config_json,\n                                                                 output_hidden_states=True)\n        self.bert_model = MODELS[config.model].from_pretrained(config.model_path, config=self.bert_config)\n        self.dynWeight = nn.Linear(self.bert_config.hidden_size, 1)\n        self.dence = nn.Linear(self.bert_config.hidden_size, 512)\n        self.isDropout = True if 0 < config.dropout < 1 else False\n        self.dropout = nn.Dropout(p=config.dropout)\n        self.classifier = nn.Linear(512, config.num_class)\n\n\n    def forward(self, input_ids, input_masks, segment_ids):\n        sequence_output, pooler_output, hidden_states = self.bert_model(input_ids=input_ids, token_type_ids=segment_ids,\n                                                                        attention_mask=input_masks)\n\n        batch_size = pooler_output.shape[0]\n\n        hid_avg_list = None\n        weight_list = None\n        for i, hidden in enumerate(hidden_states):\n            hid_avg = hidden_states[-(i + 1)][0]\n            weight = self.dynWeight(hid_avg).repeat(1, self.bert_config.hidden_size)\n            if hid_avg_list is None:\n                hid_avg_list = hid_avg\n            else:\n                hid_avg_list = torch.cat((hid_avg_list, hid_avg), dim=1)\n\n            if weight_list is None:\n                weight_list = hid_avg\n            else:\n                weight_list = torch.cat((weight_list, weight), dim=1)\n\n        concat_out = weight_list.mul_(hid_avg_list)\n        concat_out = concat_out.reshape(batch_size, -1, self.bert_config.hidden_size)\n        concat_out = torch.sum(concat_out, dim=1)\n\n        if self.isDropout:\n            concat_out = self.dropout(concat_out)\n        concat_out = self.dence(concat_out)\n        logit = self.classifier(concat_out)\n\n        return logit\n\nclass BertDynEmbeddings(nn.Module):\n    def __init__(self, config):\n        super(BertDynEmbeddings, self).__init__()\n        self.n_classes = config.num_class\n        config_json = 'bert_config.json' if os.path.exists(config.model_path + 'bert_config.json') else 'config.json'\n        self.bert_config = CONFIGS[config.model].from_pretrained(config.model_path + config_json,\n                                                                 output_hidden_states=True)\n        self.bert_model = MODELS[config.model].from_pretrained(config.model_path, config=self.bert_config)\n        self.dynWeight = nn.Linear(self.bert_config.hidden_size, 1)\n        self.dence = nn.Linear(self.bert_config.hidden_size, 512)\n        self.isDropout = True if 0 < config.dropout < 1 else False\n        self.dropout = nn.Dropout(p=config.dropout)\n        self.classifier = nn.Linear(512, config.num_class)\n\n\n    def forward(self, input_ids, input_masks, segment_ids):\n        sequence_output, pooler_output, hidden_states = self.bert_model(input_ids=input_ids, token_type_ids=segment_ids,\n                                                                        attention_mask=input_masks)\n\n        batch_size = pooler_output.shape[0]\n\n        hid_avg_list = None\n        weight_list = None\n        for i, hidden in enumerate(hidden_states):\n            hid_avg = torch.mean(hidden_states[-(i + 1)], dim=1)\n            weight = self.dynWeight(hid_avg).repeat(1, self.bert_config.hidden_size)\n            if hid_avg_list is None:\n                hid_avg_list = hid_avg\n            else:\n                hid_avg_list = torch.cat((hid_avg_list, hid_avg), dim=1)\n\n            if weight_list is None:\n                weight_list = hid_avg\n            else:\n                weight_list = torch.cat((weight_list, weight), dim=1)\n\n        concat_out = weight_list.mul_(hid_avg_list)\n        concat_out = concat_out.reshape(batch_size, -1, self.bert_config.hidden_size)\n        concat_out = torch.sum(concat_out, dim=1)\n\n        if self.isDropout:\n            concat_out = self.dropout(concat_out)\n\n        concat_out = self.dence(concat_out)\n        logit = self.classifier(concat_out)\n\n        return logit\n\n\nclass BertRNN(nn.Module):\n\n    def __init__(self, config):\n        super(BertRNN, self).__init__()\n        self.rnn_type = \"gru\"\n        self.bidirectional = True\n        self.hidden_dim = 256\n        self.n_layers = 2\n        self.batch_first = True\n        self.drop_out = 0.1\n        self.n_classes = config.num_class\n        config_json = 'bert_config.json' if os.path.exists(config.model_path + 'bert_config.json') else 'config.json'\n        self.bert_config = CONFIGS[config.model].from_pretrained(config.model_path + config_json,\n                                                                 output_hidden_states=True)\n        self.bert_model = MODELS[config.model].from_pretrained(config.model_path, config=self.bert_config)\n        self.num_directions = 1 if not self.bidirectional else 2\n\n        if self.rnn_type == 'lstm':\n            self.rnn = nn.LSTM(self.bert_config.to_dict()['hidden_size'],\n                               hidden_size=self.hidden_dim,\n                               num_layers=self.n_layers,\n                               bidirectional=self.bidirectional,\n                               batch_first=self.batch_first,\n                               dropout=self.drop_out)\n        elif self.rnn_type == 'gru':\n            self.rnn = nn.GRU(self.bert_config.to_dict()['hidden_size'],\n                              hidden_size=self.hidden_dim,\n                              num_layers=self.n_layers,\n                              bidirectional=self.bidirectional,\n                              batch_first=self.batch_first,\n                              dropout=self.drop_out)\n        else:\n            self.rnn = nn.RNN(self.bert_config.to_dict()['hidden_size'],\n                              hidden_size=self.hidden_dim,\n                              num_layers=self.n_layers,\n                              bidirectional=self.bidirectional,\n                              batch_first=self.batch_first,\n                              dropout=self.drop_out)\n\n        self.dropout = nn.Dropout(self.drop_out)\n        self.fc_rnn = nn.Linear(self.hidden_dim * self.num_directions, config.num_class)\n\n    def forward(self, input_ids, input_masks, segment_ids):\n\n        sequence_output, pooler_output, hidden_states = self.bert_model(input_ids=input_ids, token_type_ids=segment_ids,\n                                                                        attention_mask=input_masks)\n\n        self.rnn.flatten_parameters()\n        if self.rnn_type in ['rnn', 'gru']:\n            output, hidden = self.rnn(sequence_output)\n        else:\n            output, (hidden, cell) = self.rnn(sequence_output)\n\n        # output = [ batch size, sent len, hidden_dim * bidirectional]\n        batch_size, max_seq_len, hidden_dim = output.shape\n        hidden = torch.transpose(hidden, 1, 0)\n        hidden = torch.mean(torch.reshape(hidden, [batch_size, -1, hidden_dim]), dim=1)\n        output = torch.sum(output, dim=1)\n        fc_input = self.dropout(output + hidden)\n\n        # output = torch.mean(output, dim=1)\n        # fc_input = self.dropout(output)\n        out = self.fc_rnn(fc_input)\n\n        return out\n\n\nclass BertCNN(nn.Module):\n\n    def __init__(self, config):\n        super(BertCNN, self).__init__()\n        self.num_filters = 100\n        config_json = 'bert_config.json' if os.path.exists(config.model_path + 'bert_config.json') else 'config.json'\n        self.bert_config = CONFIGS[config.model].from_pretrained(config.model_path + config_json,\n                                                                 output_hidden_states=True)\n        self.hidden_size = self.bert_config.to_dict()['hidden_size']\n        self.filter_sizes = {3, 4, 5}\n        self.drop_out = 0.5\n\n        self.convs = nn.ModuleList(\n            [nn.Conv2d(1, self.num_filters, (k, self.hidden_size)) for k in self.filter_sizes])\n\n\n        self.bert_model = MODELS[config.model].from_pretrained(config.model_path, config=self.bert_config)\n        self.dropout = nn.Dropout(self.drop_out)\n\n        self.fc_cnn = nn.Linear(self.num_filters * len(self.filter_sizes), config.num_class)\n\n    def conv_and_pool(self, x, conv):\n        x = F.relu(conv(x)).squeeze(3)\n        x = F.max_pool1d(x, x.size(2)).squeeze(2)\n        return x\n\n    def forward(self, input_ids, input_masks, segment_ids):\n        sequence_output, pooler_output, hidden_states = self.bert_model(input_ids=input_ids, token_type_ids=segment_ids,\n                                                                        attention_mask=input_masks)\n\n        sequence_output = self.dropout(sequence_output)\n        out = sequence_output.unsqueeze(1)\n        out = torch.cat([self.conv_and_pool(out, conv) for conv in self.convs], 1)\n        out = self.dropout(out)\n        out = self.fc_cnn(out)\n        return out\n\n\nclass BertRCNN(nn.Module):\n    def __init__(self, config):\n        super(BertRCNN, self).__init__()\n        self.rnn_type = \"lstm\"\n        self.bidirectional = True\n        self.hidden_dim = 256\n        self.n_layers = 2\n        self.batch_first = True\n        self.drop_out = 0.5\n\n        config_json = 'bert_config.json' if os.path.exists(config.model_path + 'bert_config.json') else 'config.json'\n        self.bert_config = CONFIGS[config.model].from_pretrained(config.model_path + config_json,\n                                                                 output_hidden_states=True)\n\n        if self.rnn_type == 'lstm':\n            self.rnn = nn.LSTM(self.bert_config.to_dict()['hidden_size'],\n                               self.hidden_dim,\n                               num_layers=self.n_layers,\n                               bidirectional=self.bidirectional,\n                               batch_first=self.batch_first,\n                               dropout=self.drop_out)\n        elif self.rnn_type == 'gru':\n            self.rnn = nn.GRU(self.bert_config.to_dict()['hidden_size'],\n                              hidden_size=self.hidden_dim,\n                              num_layers=self.n_layers,\n                              bidirectional=self.bidirectional,\n                              batch_first=self.batch_first,\n                              dropout=self.drop_out)\n        else:\n            self.rnn = nn.RNN(self.bert_config.to_dict()['hidden_size'],\n                              hidden_size=self.hidden_dim,\n                              num_layers=self.n_layers,\n                              bidirectional=self.bidirectional,\n                              batch_first=self.batch_first,\n                              dropout=self.drop_out)\n\n        # self.maxpool = nn.MaxPool1d()\n\n\n        self.bert_model = MODELS[config.model].from_pretrained(config.model_path, config=self.bert_config)\n        self.fc = nn.Linear(self.hidden_dim * self.n_layers, config.num_class)\n        self.dropout = nn.Dropout(self.drop_out)\n\n    def forward(self, input_ids, input_masks, segment_ids):\n\n        sequence_output, pooler_output, hidden_states = self.bert_model(input_ids=input_ids, token_type_ids=segment_ids,\n                                                                        attention_mask=input_masks)\n\n        sentence_len = sequence_output.shape[1]\n        pooler_output = pooler_output.unsqueeze(dim=1).repeat(1, sentence_len, 1)\n        bert_sentence = sequence_output + pooler_output\n\n        self.rnn.flatten_parameters()\n        if self.rnn_type in ['rnn', 'gru']:\n            output, hidden = self.rnn(bert_sentence)\n        else:\n            output, (hidden, cell) = self.rnn(bert_sentence)\n\n        batch_size, max_seq_len, hidden_dim = output.shape\n        out = torch.transpose(output.relu(), 1, 2)\n\n        out = F.max_pool1d(out, max_seq_len).squeeze()\n        out = self.fc(out)\n\n        return out\n\n\nclass XLNet(nn.Module):\n\n    def __init__(self, config):\n        super(XLNet, self).__init__()\n        self.xlnet = XLNetModel.from_pretrained(config.model_path)\n\n        self.isDropout = True if 0 < config.dropout < 1 else False\n        self.dropout = nn.Dropout(p=config.dropout)\n        self.fc = nn.Linear(self.xlnet.d_model, config.num_class)\n\n    def forward(self, input_ids, input_masks, segment_ids):\n        sequence_output = self.xlnet(input_ids=input_ids, token_type_ids=segment_ids,\n                                     attention_mask=input_masks)\n        sequence_output = torch.sum(sequence_output[0], dim=1)\n        if self.isDropout:\n            sequence_output = self.dropout(sequence_output)\n        out = self.fc(sequence_output)\n        return out\n\n\nclass ElectraClassificationHead(nn.Module):\n    \"\"\"Head for sentence-level classification tasks.\"\"\"\n\n    def __init__(self, config):\n        super().__init__()\n        self.dense = nn.Linear(config.hidden_size, config.hidden_size)\n        self.dropout = nn.Dropout(config.hidden_dropout_prob)\n        self.out_proj = nn.Linear(config.hidden_size, config.num_labels)\n\n    def forward(self, features, **kwargs):\n        x = features[:, 0, :]  # take <s> token (equiv. to [CLS])\n        x = self.dropout(x)\n        x = self.dense(x)\n        x = get_activation(\"gelu\")(x)  # although BERT uses tanh here, it seems Electra authors used gelu here\n        x = self.dropout(x)\n        x = self.out_proj(x)\n        return x\n\nclass Electra(nn.Module):\n\n    def __init__(self, config):\n        super(Electra, self).__init__()\n        self.electra = ElectraModel.from_pretrained(config.model_path)\n\n        config_json = 'bert_config.json' if os.path.exists(config.model_path + 'bert_config.json') else 'config.json'\n        self.electra_config = CONFIGS[config.model].from_pretrained(config.model_path + config_json)\n        self.electra_config.num_labels = config.num_class\n        self.fc = ElectraClassificationHead(self.electra_config)\n\n    def forward(self, input_ids, input_masks, segment_ids):\n        discriminator_hidden_states = self.electra(input_ids=input_ids, token_type_ids=segment_ids,\n                                     attention_mask=input_masks)\n\n        sequence_output = discriminator_hidden_states[0]\n        out = self.fc(sequence_output)\n        return out\n\nclass NEZHA(nn.Module):\n    def __init__(self, config):\n        super(NEZHA, self).__init__()\n        self.n_classes = config.num_class\n\n        config_json = 'bert_config.json' if os.path.exists(config.model_path + 'bert_config.json') else 'config.json'\n        self.bert_config = CONFIGS[config.model].from_pretrained(config.model_path + config_json)\n        self.bert_model = MODELS[config.model](config=self.bert_config)\n        # NEZHA init\n        #torch_init_model(self.bert_model, os.path.join(config.model_path, 'pytorch_model.bin'))\n        self.isDropout = True if 0 < config.dropout < 1 else False\n        self.dropout = nn.Dropout(p=config.dropout)\n        self.classifier = nn.Linear(self.bert_config.hidden_size * 2, self.n_classes)\n\n    def forward(self, input_ids, input_masks, segment_ids):\n        sequence_output, pooler_output = self.bert_model(input_ids=input_ids, token_type_ids=segment_ids,\n                                                                        attention_mask=input_masks)\n        seq_avg = torch.mean(sequence_output, dim=1)\n        concat_out = torch.cat((seq_avg, pooler_output), dim=1)\n        if self.isDropout:\n            concat_out = self.dropout(concat_out)\n        logit = self.classifier(concat_out)\n        return logit\n\n\n"
  },
  {
    "path": "code/nezha-base-count3/finetuning/.ipynb_checkpoints/PyTorch_Bert-Squad_OnnxRuntime_GPU-checkpoint.ipynb",
    "content": "{\n \"cells\": [\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Copyright (c) Microsoft Corporation. All rights reserved.  \\n\",\n    \"Licensed under the MIT License.\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"# Inference PyTorch Bert Model with ONNX Runtime on GPU\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"In this tutorial, you'll learn how to load a Bert model from PyTorch, convert it to ONNX, and inference it for high performance using ONNX Runtime and NVIDIA GPU. In the following sections, we are going to use the Bert model trained with Stanford Question Answering Dataset (SQuAD) dataset as an example. Bert SQuAD model is used in question answering scenarios, where the answer to every question is a segment of text from the corresponding reading passage, or the question might be unanswerable.\\n\",\n    \"\\n\",\n    \"This notebook is for GPU inference. For CPU inference, please look at another notebook [Inference PyTorch Bert Model with ONNX Runtime on CPU](PyTorch_Bert-Squad_OnnxRuntime_CPU.ipynb).\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"## 0. Prerequisites ##\\n\",\n    \"It requires your machine to have a GPU, and a python environment with [PyTorch](https://pytorch.org/) installed before running this notebook.\\n\",\n    \"\\n\",\n    \"#### GPU Environment Setup using AnaConda\\n\",\n    \"\\n\",\n    \"First, we install [AnaConda](https://www.anaconda.com/distribution/) in a target machine and open an AnaConda prompt window when it is done. Then run the following commands to create a conda environment. This notebook is tested with PyTorch 1.5.0 and OnnxRuntime 1.3.0.\\n\",\n    \"\\n\",\n    \"```console\\n\",\n    \"conda create -n gpu_env python=3.7\\n\",\n    \"conda activate gpu_env\\n\",\n    \"conda install pytorch torchvision cudatoolkit=10.1 -c pytorch\\n\",\n    \"conda install -c anaconda ipykernel\\n\",\n    \"conda install -c conda-forge ipywidgets\\n\",\n    \"python -m ipykernel install --user --name=gpu_env_py37\\n\",\n    \"jupyter notebook\\n\",\n    \"```\\n\",\n    \"Finally, launch Jupyter Notebook and you can choose gpu_env_py37 as kernel to run this notebook.\\n\",\n    \"\\n\",\n    \"Onnxruntime-gpu need specified version of CUDA and cuDNN. You can find the corresponding version in [requirements](https://github.com/microsoft/onnxruntime/tree/rel-1.3.0#system-requirements). If the version is different from above cudatoolkit version, you have to install them separately, and add their bin directories to PATH environment variable (See [CUDA and cuDNN Path](#CUDA-and-cuDNN-Path) below).\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"\\u001b[33mWARNING: Skipping onnxruntime-gpu as it is not installed.\\u001b[0m\\r\\n\"\n     ]\n    }\n   ],\n   \"source\": [\n    \"import sys\\n\",\n    \"!{sys.executable} -m pip uninstall --quiet --yes onnxruntime-gpu\\n\",\n    \"!{sys.executable} -m pip install --quiet onnxruntime-gpu\\n\",\n    \"!{sys.executable} -m pip install --quiet --upgrade transformers\\n\",\n    \"!{sys.executable} -m pip install --quiet --upgrade onnxconverter_common\\n\",\n    \"!{sys.executable} -m pip install --quiet --upgrade onnxruntime-tools\\n\",\n    \"!{sys.executable} -m pip install --quiet wget netron pandas\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"## 1. Load Pretrained Bert model ##\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"We begin by downloading the SQuAD data file and store them in the specified location. \"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"import os\\n\",\n    \"\\n\",\n    \"cache_dir = \\\"./squad\\\"\\n\",\n    \"if not os.path.exists(cache_dir):\\n\",\n    \"    os.makedirs(cache_dir)\\n\",\n    \"\\n\",\n    \"predict_file_url = \\\"https://rajpurkar.github.io/SQuAD-explorer/dataset/dev-v1.1.json\\\"\\n\",\n    \"predict_file = os.path.join(cache_dir, \\\"dev-v1.1.json\\\")\\n\",\n    \"if not os.path.exists(predict_file):\\n\",\n    \"    import wget\\n\",\n    \"    print(\\\"Start downloading predict file.\\\")\\n\",\n    \"    wget.download(predict_file_url, predict_file)\\n\",\n    \"    print(\\\"Predict file downloaded.\\\")\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Let's first define some constant variables.\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 3,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"# Whether allow overwriting existing ONNX model and download the latest script from GitHub\\n\",\n    \"enable_overwrite = True\\n\",\n    \"\\n\",\n    \"# Total samples to inference, so that we can get average latency\\n\",\n    \"total_samples = 1000\\n\",\n    \"\\n\",\n    \"# ONNX opset version\\n\",\n    \"opset_version=11\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Specify some model configuration variables.\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 4,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"# For fine-tuned large model, the model name is \\\"bert-large-uncased-whole-word-masking-finetuned-squad\\\". Here we use bert-base for demo.\\n\",\n    \"model_name_or_path = \\\"bert-base-cased\\\"\\n\",\n    \"max_seq_length = 128\\n\",\n    \"doc_stride = 128\\n\",\n    \"max_query_length = 64\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Start to load model from pretrained. This step could take a few minutes. \"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 5,\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"name\": \"stderr\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"100%|██████████| 48/48 [00:04<00:00, 11.28it/s]\\n\",\n      \"convert squad examples to features: 100%|██████████| 1000/1000 [00:09<00:00, 102.15it/s]\\n\",\n      \"add example index and unique id: 100%|██████████| 1000/1000 [00:00<00:00, 161306.98it/s]\\n\"\n     ]\n    }\n   ],\n   \"source\": [\n    \"# The following code is adapted from HuggingFace transformers\\n\",\n    \"# https://github.com/huggingface/transformers/blob/master/examples/run_squad.py\\n\",\n    \"\\n\",\n    \"from transformers import (BertConfig, BertForQuestionAnswering, BertTokenizer)\\n\",\n    \"\\n\",\n    \"# Load pretrained model and tokenizer\\n\",\n    \"config_class, model_class, tokenizer_class = (BertConfig, BertForQuestionAnswering, BertTokenizer)\\n\",\n    \"config = config_class.from_pretrained(model_name_or_path, cache_dir=cache_dir)\\n\",\n    \"tokenizer = tokenizer_class.from_pretrained(model_name_or_path, do_lower_case=True, cache_dir=cache_dir)\\n\",\n    \"model = model_class.from_pretrained(model_name_or_path,\\n\",\n    \"                                    from_tf=False,\\n\",\n    \"                                    config=config,\\n\",\n    \"                                    cache_dir=cache_dir)\\n\",\n    \"# load some examples\\n\",\n    \"from transformers.data.processors.squad import SquadV1Processor\\n\",\n    \"\\n\",\n    \"processor = SquadV1Processor()\\n\",\n    \"examples = processor.get_dev_examples(None, filename=predict_file)\\n\",\n    \"\\n\",\n    \"from transformers import squad_convert_examples_to_features\\n\",\n    \"features, dataset = squad_convert_examples_to_features( \\n\",\n    \"            examples=examples[:total_samples], # convert enough examples for this notebook\\n\",\n    \"            tokenizer=tokenizer,\\n\",\n    \"            max_seq_length=max_seq_length,\\n\",\n    \"            doc_stride=doc_stride,\\n\",\n    \"            max_query_length=max_query_length,\\n\",\n    \"            is_training=False,\\n\",\n    \"            return_dataset='pt'\\n\",\n    \"        )\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"## 2. Export the loaded model ##\\n\",\n    \"Once the model is loaded, we can export the loaded PyTorch model to ONNX.\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 6,\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"Model exported at  ./onnx/bert-base-cased-squad_opset11.onnx\\n\"\n     ]\n    }\n   ],\n   \"source\": [\n    \"output_dir = \\\"./onnx\\\"\\n\",\n    \"if not os.path.exists(output_dir):\\n\",\n    \"    os.makedirs(output_dir)   \\n\",\n    \"export_model_path = os.path.join(output_dir, 'bert-base-cased-squad_opset{}.onnx'.format(opset_version))\\n\",\n    \"\\n\",\n    \"import torch\\n\",\n    \"use_gpu = torch.cuda.is_available()\\n\",\n    \"device = torch.device(\\\"cuda\\\" if use_gpu else \\\"cpu\\\")\\n\",\n    \"\\n\",\n    \"# Get the first example data to run the model and export it to ONNX\\n\",\n    \"data = dataset[0]\\n\",\n    \"inputs = {\\n\",\n    \"    'input_ids':      data[0].to(device).reshape(1, max_seq_length),\\n\",\n    \"    'attention_mask': data[1].to(device).reshape(1, max_seq_length),\\n\",\n    \"    'token_type_ids': data[2].to(device).reshape(1, max_seq_length)\\n\",\n    \"}\\n\",\n    \"\\n\",\n    \"# Set model to inference mode, which is required before exporting the model because some operators behave differently in \\n\",\n    \"# inference and training mode.\\n\",\n    \"model.eval()\\n\",\n    \"model.to(device)\\n\",\n    \"\\n\",\n    \"if enable_overwrite or not os.path.exists(export_model_path):\\n\",\n    \"    with torch.no_grad():\\n\",\n    \"        symbolic_names = {0: 'batch_size', 1: 'max_seq_len'}\\n\",\n    \"        torch.onnx.export(model,                                            # model being run\\n\",\n    \"                          args=tuple(inputs.values()),                      # model input (or a tuple for multiple inputs)\\n\",\n    \"                          f=export_model_path,                              # where to save the model (can be a file or file-like object)\\n\",\n    \"                          opset_version=opset_version,                      # the ONNX version to export the model to\\n\",\n    \"                          do_constant_folding=True,                         # whether to execute constant folding for optimization\\n\",\n    \"                          input_names=['input_ids',                         # the model's input names\\n\",\n    \"                                       'input_mask', \\n\",\n    \"                                       'segment_ids'],\\n\",\n    \"                          output_names=['start', 'end'],                    # the model's output names\\n\",\n    \"                          dynamic_axes={'input_ids': symbolic_names,        # variable length axes\\n\",\n    \"                                        'input_mask' : symbolic_names,\\n\",\n    \"                                        'segment_ids' : symbolic_names,\\n\",\n    \"                                        'start' : symbolic_names,\\n\",\n    \"                                        'end' : symbolic_names})\\n\",\n    \"        print(\\\"Model exported at \\\", export_model_path)\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"## 3. PyTorch Inference ##\\n\",\n    \"Use PyTorch to evaluate an example input for comparison purpose.\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 7,\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"PyTorch cuda Inference time = 16.57 ms\\n\"\n     ]\n    }\n   ],\n   \"source\": [\n    \"import time\\n\",\n    \"\\n\",\n    \"# Measure the latency. It is not accurate using Jupyter Notebook, it is recommended to use standalone python script.\\n\",\n    \"latency = []\\n\",\n    \"with torch.no_grad():\\n\",\n    \"    for i in range(total_samples):\\n\",\n    \"        data = dataset[i]\\n\",\n    \"        inputs = {\\n\",\n    \"            'input_ids':      data[0].to(device).reshape(1, max_seq_length),\\n\",\n    \"            'attention_mask': data[1].to(device).reshape(1, max_seq_length),\\n\",\n    \"            'token_type_ids': data[2].to(device).reshape(1, max_seq_length)\\n\",\n    \"        }\\n\",\n    \"        start = time.time()\\n\",\n    \"        outputs = model(**inputs)\\n\",\n    \"        latency.append(time.time() - start)\\n\",\n    \"print(\\\"PyTorch {} Inference time = {} ms\\\".format(device.type, format(sum(latency) * 1000 / len(latency), '.2f')))\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"## 4. Inference ONNX Model with ONNX Runtime ##\\n\",\n    \"\\n\",\n    \"### CUDA and cuDNN Path\\n\",\n    \"onnxruntime-gpu has dependency on [CUDA](https://developer.nvidia.com/cuda-downloads) and [cuDNN](https://developer.nvidia.com/cudnn):\\n\",\n    \"\\n\",\n    \"* [onnxruntime-gpu v1.3.0](https://github.com/microsoft/onnxruntime/tree/rel-1.3.0#system-requirements) requires CUDA Runtime 10.1 and CUDNN 7.6.5.\\n\",\n    \"* [onnxruntime-gpu v1.2.0](https://github.com/microsoft/onnxruntime/releases/tag/v1.2.0) requires CUDA Runtime 10.1 and CUDNN 7.6.5.\\n\",\n    \"\\n\",\n    \"During installing PyTorch 1.5, we installed cudatoolkit 10.1.243 in this conda environment. That shall be good for onnxruntime-gpu 1.3.0 in Jupyter Notebook.\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 8,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"# Change to True when onnxruntime (like onnxruntime-gpu 1.0.0 ~ 1.1.2) cannot be imported.\\n\",\n    \"add_cuda_path = False\\n\",\n    \"\\n\",\n    \"if add_cuda_path:\\n\",\n    \"    # Add path of CUDA 10.0 and CUDNN 7.6 for onnxruntime-gpu 1.0.0 ~ 1.1.2\\n\",\n    \"    cuda_dir = 'D:/NVidia/CUDA/v10.1/bin'\\n\",\n    \"    cudnn_dir = 'D:/NVidia/CUDA/v10.1/bin'\\n\",\n    \"    if not (os.path.exists(cuda_dir) and os.path.exists(cudnn_dir)):\\n\",\n    \"        raise ValueError(\\\"Please specify correct path for CUDA and cuDNN. Otherwise onnxruntime cannot be imported.\\\")\\n\",\n    \"    else:\\n\",\n    \"        if cuda_dir == cudnn_dir:\\n\",\n    \"            os.environ[\\\"PATH\\\"] = cuda_dir + ';' + os.environ[\\\"PATH\\\"]\\n\",\n    \"        else:\\n\",\n    \"            os.environ[\\\"PATH\\\"] = cuda_dir + ';' + cudnn_dir + ';' + os.environ[\\\"PATH\\\"]\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"### OpenMP Environment Variable\\n\",\n    \"\\n\",\n    \"OpenMP environment variables are optional for GPU inference of standard Bert model. It has little performance impact on Bert model since most nodes are executed in GPU. \\n\",\n    \"\\n\",\n    \"You can find the best setting based on [Performance Test Tool](#Performance-Test-Tool) result in later part of this notebook.\\n\",\n    \"\\n\",\n    \"**Attention: Setting environment variables shall be done before importing onnxruntime**. Otherwise, they might not take effect.\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 9,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"# Optional. You can change them according to Performance Test Tool result.\\n\",\n    \"#os.environ[\\\"OMP_NUM_THREADS\\\"] = '1'\\n\",\n    \"#os.environ[\\\"OMP_WAIT_POLICY\\\"] = 'PASSIVE'\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Now we are ready to inference the model with ONNX Runtime.\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 10,\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"OnnxRuntime gpu Inference time = 4.43 ms\\n\"\n     ]\n    }\n   ],\n   \"source\": [\n    \"import psutil\\n\",\n    \"import onnxruntime\\n\",\n    \"import numpy\\n\",\n    \"\\n\",\n    \"assert 'CUDAExecutionProvider' in onnxruntime.get_available_providers()\\n\",\n    \"device_name = 'gpu'\\n\",\n    \"\\n\",\n    \"sess_options = onnxruntime.SessionOptions()\\n\",\n    \"\\n\",\n    \"# Optional: store the optimized graph and view it using Netron to verify that model is fully optimized.\\n\",\n    \"# Note that this will increase session creation time so enable it for debugging only.\\n\",\n    \"sess_options.optimized_model_filepath = os.path.join(output_dir, \\\"optimized_model_{}.onnx\\\".format(device_name))\\n\",\n    \"\\n\",\n    \"# Please change the value according to best setting in Performance Test Tool result.\\n\",\n    \"sess_options.intra_op_num_threads=psutil.cpu_count(logical=True)\\n\",\n    \"\\n\",\n    \"session = onnxruntime.InferenceSession(export_model_path, sess_options)\\n\",\n    \"\\n\",\n    \"latency = []\\n\",\n    \"for i in range(total_samples):\\n\",\n    \"    data = dataset[i]\\n\",\n    \"    # TODO: use IO Binding (see https://github.com/microsoft/onnxruntime/pull/4206) to improve performance.\\n\",\n    \"    ort_inputs = {\\n\",\n    \"        'input_ids':  data[0].cpu().reshape(1, max_seq_length).numpy(),\\n\",\n    \"        'input_mask': data[1].cpu().reshape(1, max_seq_length).numpy(),\\n\",\n    \"        'segment_ids': data[2].cpu().reshape(1, max_seq_length).numpy()\\n\",\n    \"    }\\n\",\n    \"    start = time.time()\\n\",\n    \"    ort_outputs = session.run(None, ort_inputs)\\n\",\n    \"    latency.append(time.time() - start)\\n\",\n    \"    \\n\",\n    \"print(\\\"OnnxRuntime {} Inference time = {} ms\\\".format(device_name, format(sum(latency) * 1000 / len(latency), '.2f')))\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"We can compare the output of PyTorch and ONNX Runtime. We can see some results are not close. It is because ONNX Runtime uses some approximation in CUDA optimization. Based on our evaluation on SQuAD data set, F1 score is on par for models before and after optimization.\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 11,\n   \"metadata\": {\n    \"scrolled\": true\n   },\n   \"outputs\": [\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"***** Verifying correctness *****\\n\",\n      \"PyTorch and ONNX Runtime output 0 are close: True\\n\",\n      \"maximum_diff=9.499490261077881e-07 average_diff=1.4225952327251434e-07\\n\",\n      \"PyTorch and ONNX Runtime output 1 are close: True\\n\",\n      \"maximum_diff=6.92903995513916e-07 average_diff=1.2441887520253658e-07\\n\"\n     ]\n    }\n   ],\n   \"source\": [\n    \"print(\\\"***** Verifying correctness *****\\\")\\n\",\n    \"for i in range(2):    \\n\",\n    \"    print('PyTorch and ONNX Runtime output {} are close:'.format(i), numpy.allclose(ort_outputs[i], outputs[i].cpu(), rtol=1e-02, atol=1e-02))\\n\",\n    \"    diff = ort_outputs[i] - outputs[i].cpu().numpy()\\n\",\n    \"    max_diff = numpy.max(numpy.abs(diff))\\n\",\n    \"    avg_diff = numpy.average(numpy.abs(diff))\\n\",\n    \"    print(f'maximum_diff={max_diff} average_diff={avg_diff}')\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"### Inference with Actual Sequence Length\\n\",\n    \"Note that ONNX model is exported using dynamic length axis. It is recommended to use actual sequence input without padding instead of fixed length input for best performance. Let's see how it can be applied to this model.\\n\",\n    \"\\n\",\n    \"From an example input below, we can see zero padding at the end of each sequence.\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 12,\n   \"metadata\": {\n    \"scrolled\": true\n   },\n   \"outputs\": [\n    {\n     \"data\": {\n      \"text/plain\": [\n       \"{'input_ids': tensor([[  101,  1293,  1242,  2557,  1127,  1226,  1104,  1103,  3613, 16429,\\n\",\n       \"           5235,   136,   102,  3613, 16429,  5988,   170,   107,  1353,  1671,\\n\",\n       \"           1992,  1342,   107,  5235,   117,  1107,  1134,  1473,  3683,  3538,\\n\",\n       \"           1125,   170,  1476,   118,  1248,  2595,  4086,  1714,  1104,  2965,\\n\",\n       \"          15897,  1104,  3613, 16429,   119,  1473,  3683,  3538,  3222,  1149,\\n\",\n       \"           2551,  1168, 23759,  1116,  1121,  1506,  1103, 10280,  2231,  1111,\\n\",\n       \"           1103,  1714, 16355,   119,   102,     0,     0,     0,     0,     0,\\n\",\n       \"              0,     0,     0,     0,     0,     0,     0,     0,     0,     0,\\n\",\n       \"              0,     0,     0,     0,     0,     0,     0,     0,     0,     0,\\n\",\n       \"              0,     0,     0,     0,     0,     0,     0,     0,     0,     0,\\n\",\n       \"              0,     0,     0,     0,     0,     0,     0,     0,     0,     0,\\n\",\n       \"              0,     0,     0,     0,     0,     0,     0,     0,     0,     0,\\n\",\n       \"              0,     0,     0,     0,     0,     0,     0,     0]],\\n\",\n       \"        device='cuda:0'),\\n\",\n       \" 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,\\n\",\n       \"          1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,\\n\",\n       \"          1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0,\\n\",\n       \"          0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,\\n\",\n       \"          0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,\\n\",\n       \"          0, 0, 0, 0, 0, 0, 0, 0]], device='cuda:0'),\\n\",\n       \" 'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,\\n\",\n       \"          1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,\\n\",\n       \"          1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0,\\n\",\n       \"          0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,\\n\",\n       \"          0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,\\n\",\n       \"          0, 0, 0, 0, 0, 0, 0, 0]], device='cuda:0')}\"\n      ]\n     },\n     \"execution_count\": 12,\n     \"metadata\": {},\n     \"output_type\": \"execute_result\"\n    }\n   ],\n   \"source\": [\n    \"# An example input (we can see padding). From attention_mask, we can deduce the actual length.\\n\",\n    \"inputs\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"The original sequence length is 128. After removing paddings, the sequence length is reduced. Input with smaller sequence length need less computation, thus we can see there is improvement on inference latency. \"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 13,\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"Average length 101\\n\",\n      \"OnnxRuntime gpu Inference time with actual sequence length = 4.23 ms\\n\"\n     ]\n    }\n   ],\n   \"source\": [\n    \"import statistics\\n\",\n    \"\\n\",\n    \"latency = []\\n\",\n    \"lengths = []\\n\",\n    \"for i in range(total_samples):\\n\",\n    \"    data = dataset[i]\\n\",\n    \"    # Instead of using fixed length (128), we can use actual sequence length (less than 128), which helps to get better performance.\\n\",\n    \"    actual_sequence_length = sum(data[1].numpy())\\n\",\n    \"    lengths.append(actual_sequence_length)\\n\",\n    \"    opt_inputs = {\\n\",\n    \"        'input_ids':  data[0].numpy()[:actual_sequence_length].reshape(1, actual_sequence_length),\\n\",\n    \"        'input_mask': data[1].numpy()[:actual_sequence_length].reshape(1, actual_sequence_length),\\n\",\n    \"        'segment_ids': data[2].numpy()[:actual_sequence_length].reshape(1, actual_sequence_length)\\n\",\n    \"    }\\n\",\n    \"    start = time.time()\\n\",\n    \"    opt_outputs = session.run(None, opt_inputs)\\n\",\n    \"    latency.append(time.time() - start)\\n\",\n    \"print(\\\"Average length\\\", statistics.mean(lengths))\\n\",\n    \"print(\\\"OnnxRuntime {} Inference time with actual sequence length = {} ms\\\".format(device_name, format(sum(latency) * 1000 / len(latency), '.2f')))\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Let's compare the output and see whether the results are close.\\n\",\n    \"\\n\",\n    \"**Note**: Need end-to-end evaluation on performance and accuracy if you use this strategy.\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 14,\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"***** Comparing results with/without paddings *****\\n\",\n      \"Output 0 are close: True\\n\",\n      \"Output 1 are close: True\\n\"\n     ]\n    }\n   ],\n   \"source\": [\n    \"print(\\\"***** Comparing results with/without paddings *****\\\")\\n\",\n    \"for i in range(2):\\n\",\n    \"    print('Output {} are close:'.format(i), numpy.allclose(opt_outputs[i], ort_outputs[i][:,:len(opt_outputs[i][0])], rtol=1e-03, atol=1e-03))\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"## 5. Offline Optimization and Test Tools\\n\",\n    \"\\n\",\n    \"It is recommended to try [OnnxRuntime Transformer Model Optimization Tool](https://github.com/microsoft/onnxruntime/tree/master/onnxruntime/python/tools/transformers) on the exported ONNX models. It could help verify whether the model can be fully optimized, and get performance test results.\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"#### Transformer Optimizer\\n\",\n    \"\\n\",\n    \"Although OnnxRuntime could optimize Bert model exported by PyTorch. Sometime, model cannot be fully optimized due to different reasons:\\n\",\n    \"* A new subgraph pattern is generated by new version of export tool, and the pattern is not covered by older version of OnnxRuntime. \\n\",\n    \"* The exported model uses dynamic axis and this makes it harder for shape inference of the graph. That blocks some optimization to be applied.\\n\",\n    \"* Some optimization is better to be done offline. Like change input tensor type from int64 to int32 to avoid extra Cast nodes, or convert model to float16 to achieve better performance in V100 or T4 GPU.\\n\",\n    \"\\n\",\n    \"We have python script **optimizer.py**, which is more flexible in graph pattern matching and model conversion (like float32 to float16). You can also use it to verify whether a Bert model is fully optimized.\\n\",\n    \"\\n\",\n    \"In this example, we can see that it introduces optimization that is not provided by onnxruntime: SkipLayerNormalization and bias fusion, which is not fused in OnnxRuntime due to shape inference as mentioned.\\n\",\n    \"\\n\",\n    \"It will also tell whether the model is fully optimized or not. If not, that means you might need change the script to fuse some new pattern of subgraph.\\n\",\n    \"\\n\",\n    \"Example Usage:\\n\",\n    \"```\\n\",\n    \"from onnxruntime_tools import optimizer\\n\",\n    \"optimized_model = optimizer.optimize_model(export_model_path, model_type='bert', num_heads=12, hidden_size=768)\\n\",\n    \"optimized_model.save_model_to_file(optimized_model_path)\\n\",\n    \"```\\n\",\n    \"\\n\",\n    \"You can also use optimizer_cli like the following:\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"#### Float32 Model\\n\",\n    \"Let us optimize the ONNX model using the script. The first example will output model with float32 to store weights. This is the choice for most GPUs without Tensor Core.\\n\",\n    \"\\n\",\n    \"If your GPU (like V100 or T4) has Tensor Core, jump to [Float16 Model](#6.-Model-Optimization-with-Float16) section since that will give you better performance than Float32 model.\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 15,\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"optimize_by_onnxruntime: Save optimized model by onnxruntime to ./onnx/bert-base-cased-squad_opset11_o1_cpu.onnx\\n\",\n      \"               apply: Fused LayerNormalization count: 25\\n\",\n      \"               apply: Fused Gelu count: 12\\n\",\n      \"               apply: Fused SkipLayerNormalization count: 25\\n\",\n      \"               apply: Fused Attention count: 12\\n\",\n      \"         prune_graph: Graph pruned: 0 inputs, 0 outputs and 5 nodes are removed\\n\",\n      \"               apply: Fused EmbedLayerNormalization(with mask) count: 1\\n\",\n      \"         prune_graph: Graph pruned: 0 inputs, 0 outputs and 12 nodes are removed\\n\",\n      \"         prune_graph: Graph pruned: 0 inputs, 0 outputs and 0 nodes are removed\\n\",\n      \"               apply: Fused BiasGelu count: 12\\n\",\n      \"               apply: Fused SkipLayerNormalization(add bias) count: 24\\n\",\n      \"            optimize: opset verion: 11\\n\",\n      \"  save_model_to_file: Output model to ./onnx/bert-base-cased-squad_opt_gpu_fp32.onnx\\n\",\n      \"get_fused_operator_statistics: Optimized operators:{'EmbedLayerNormalization': 1, 'Attention': 12, 'Gelu': 0, 'FastGelu': 0, 'BiasGelu': 12, 'LayerNormalization': 0, 'SkipLayerNormalization': 24}\\n\",\n      \"                main: The model has been fully optimized.\\n\"\n     ]\n    }\n   ],\n   \"source\": [\n    \"optimized_fp32_model_path = './onnx/bert-base-cased-squad_opt_{}_fp32.onnx'.format('gpu' if use_gpu else 'cpu')\\n\",\n    \"\\n\",\n    \"!python -m onnxruntime_tools.optimizer_cli --input $export_model_path --output $optimized_fp32_model_path\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"#### Optimized Graph\\n\",\n    \"We can open the optimized model using [Netron](https://github.com/lutzroeder/netron) to visualize.\\n\",\n    \"\\n\",\n    \"The graph is like the following:\\n\",\n    \"<img src='images/optimized_bert_gpu.png'>\\n\",\n    \"\\n\",\n    \"Sometime, optimized graph is slightly different. For example, FastGelu is replaced by BiasGelu for CPU inference; When the option --input_int32 is used, Cast nodes for inputs are removed.\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 16,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"import netron\\n\",\n    \"\\n\",\n    \"# change it to True if want to view the optimized model in browser\\n\",\n    \"enable_netron = False\\n\",\n    \"if enable_netron:\\n\",\n    \"    # If you encounter error \\\"access a socket in a way forbidden by its access permissions\\\", install Netron as standalone application instead.\\n\",\n    \"    netron.start(optimized_fp32_model_path)\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"### Performance Test Tool\\n\",\n    \"\\n\",\n    \"The following will create 1000 random inputs of batch_size 1 and sequence length 128, then measure the average latency and throughput numbers.\\n\",\n    \"\\n\",\n    \"Note that the test uses fixed sequence length. If you use [dynamic sequence length](#Inference-with-Actual-Sequence-Length), actual performance depends on the distribution of sequence length.\\n\",\n    \"\\n\",\n    \"**Attention**: Latency numbers from Jupyter Notebook are not accurate. See [Attional Info](#7.-Additional-Info) for more info.\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 17,\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"test setting TestSetting(batch_size=1, sequence_length=128, test_cases=1000, test_times=1, contiguous=None, use_gpu=True, warmup=True, omp_num_threads=None, omp_wait_policy=None, intra_op_num_threads=None, seed=3, verbose=False, inclusive=False, extra_latency=True)\\n\",\n      \"Generating 1000 samples for batch_size=1 sequence_length=128\\n\",\n      \"Running test: model=bert-base-cased-squad_opt_gpu_fp32.onnx,graph_optimization_level=ENABLE_ALL,intra_op_num_threads=0,OMP_NUM_THREADS=,OMP_WAIT_POLICY=,batch_size=1,sequence_length=128,test_cases=1000,test_times=1,contiguous=None,use_gpu=True,warmup=True\\n\",\n      \"Average latency = 4.92 ms, Throughput = 203.24 QPS\\n\",\n      \"Running test: model=bert-base-cased-squad_opt_gpu_fp32.onnx,graph_optimization_level=ENABLE_ALL,intra_op_num_threads=1,OMP_NUM_THREADS=12,OMP_WAIT_POLICY=PASSIVE,batch_size=1,sequence_length=128,test_cases=1000,test_times=1,contiguous=None,use_gpu=True,warmup=True\\n\",\n      \"Average latency = 4.90 ms, Throughput = 203.88 QPS\\n\",\n      \"Running test: model=bert-base-cased-squad_opt_gpu_fp32.onnx,graph_optimization_level=ENABLE_ALL,intra_op_num_threads=12,OMP_NUM_THREADS=1,OMP_WAIT_POLICY=ACTIVE,batch_size=1,sequence_length=128,test_cases=1000,test_times=1,contiguous=None,use_gpu=True,warmup=True\\n\",\n      \"Average latency = 5.07 ms, Throughput = 197.16 QPS\\n\",\n      \"Running test: model=bert-base-cased-squad_opt_gpu_fp32.onnx,graph_optimization_level=ENABLE_ALL,intra_op_num_threads=1,OMP_NUM_THREADS=12,OMP_WAIT_POLICY=ACTIVE,batch_size=1,sequence_length=128,test_cases=1000,test_times=1,contiguous=None,use_gpu=True,warmup=True\\n\",\n      \"Average latency = 4.82 ms, Throughput = 207.33 QPS\\n\",\n      \"skip duplicated test: model=bert-base-cased-squad_opt_gpu_fp32.onnx,graph_optimization_level=ENABLE_ALL,intra_op_num_threads=1,OMP_NUM_THREADS=12,OMP_WAIT_POLICY=PASSIVE,batch_size=1,sequence_length=128,test_cases=1000,test_times=1,contiguous=None,use_gpu=True,warmup=True\\n\",\n      \"skip duplicated test: model=bert-base-cased-squad_opt_gpu_fp32.onnx,graph_optimization_level=ENABLE_ALL,intra_op_num_threads=12,OMP_NUM_THREADS=1,OMP_WAIT_POLICY=ACTIVE,batch_size=1,sequence_length=128,test_cases=1000,test_times=1,contiguous=None,use_gpu=True,warmup=True\\n\",\n      \"Running test: model=bert-base-cased-squad_opt_gpu_fp32.onnx,graph_optimization_level=ENABLE_ALL,intra_op_num_threads=12,OMP_NUM_THREADS=1,OMP_WAIT_POLICY=PASSIVE,batch_size=1,sequence_length=128,test_cases=1000,test_times=1,contiguous=None,use_gpu=True,warmup=True\\n\",\n      \"Average latency = 4.93 ms, Throughput = 202.92 QPS\\n\",\n      \"Running test: model=bert-base-cased-squad_opt_gpu_fp32.onnx,graph_optimization_level=ENABLE_ALL,intra_op_num_threads=12,OMP_NUM_THREADS=12,OMP_WAIT_POLICY=ACTIVE,batch_size=1,sequence_length=128,test_cases=1000,test_times=1,contiguous=None,use_gpu=True,warmup=True\\n\",\n      \"Average latency = 4.91 ms, Throughput = 203.55 QPS\\n\",\n      \"Running test: model=bert-base-cased-squad_opt_gpu_fp32.onnx,graph_optimization_level=ENABLE_ALL,intra_op_num_threads=12,OMP_NUM_THREADS=12,OMP_WAIT_POLICY=PASSIVE,batch_size=1,sequence_length=128,test_cases=1000,test_times=1,contiguous=None,use_gpu=True,warmup=True\\n\",\n      \"Average latency = 4.88 ms, Throughput = 204.90 QPS\\n\",\n      \"Test summary is saved to onnx/perf_results_GPU_B1_S128_20200617-232134.txt\\n\"\n     ]\n    }\n   ],\n   \"source\": [\n    \"GPU_OPTION = '--use_gpu' if use_gpu else ''\\n\",\n    \"\\n\",\n    \"!python -m onnxruntime_tools.transformers.bert_perf_test --model $optimized_fp32_model_path --batch_size 1 --sequence_length 128 --samples 1000 --test_times 1 --inclusive --all $GPU_OPTION\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Let's load the summary file and take a look. Note that blank value in OMP_NUM_THREADS or OMP_WAIT_POLICY means the environment variable does not exist.\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 18,\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"Float32 model perf results from ./onnx/perf_results_GPU_B1_S128_20200617-232134.txt\\n\"\n     ]\n    },\n    {\n     \"data\": {\n      \"text/html\": [\n       \"<div>\\n\",\n       \"<style scoped>\\n\",\n       \"    .dataframe tbody tr th:only-of-type {\\n\",\n       \"        vertical-align: middle;\\n\",\n       \"    }\\n\",\n       \"\\n\",\n       \"    .dataframe tbody tr th {\\n\",\n       \"        vertical-align: top;\\n\",\n       \"    }\\n\",\n       \"\\n\",\n       \"    .dataframe thead th {\\n\",\n       \"        text-align: right;\\n\",\n       \"    }\\n\",\n       \"</style>\\n\",\n       \"<table border=\\\"1\\\" class=\\\"dataframe\\\">\\n\",\n       \"  <thead>\\n\",\n       \"    <tr style=\\\"text-align: right;\\\">\\n\",\n       \"      <th></th>\\n\",\n       \"      <th>Latency(ms)</th>\\n\",\n       \"      <th>Latency_P50</th>\\n\",\n       \"      <th>Latency_P75</th>\\n\",\n       \"      <th>Latency_P90</th>\\n\",\n       \"      <th>Latency_P95</th>\\n\",\n       \"      <th>Latency_P99</th>\\n\",\n       \"      <th>Throughput(QPS)</th>\\n\",\n       \"      <th>intra_op_num_threads</th>\\n\",\n       \"      <th>OMP_NUM_THREADS</th>\\n\",\n       \"      <th>OMP_WAIT_POLICY</th>\\n\",\n       \"      <th>contiguous</th>\\n\",\n       \"      <th>warmup</th>\\n\",\n       \"    </tr>\\n\",\n       \"  </thead>\\n\",\n       \"  <tbody>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>0</th>\\n\",\n       \"      <td>4.82</td>\\n\",\n       \"      <td>4.53</td>\\n\",\n       \"      <td>4.57</td>\\n\",\n       \"      <td>5.15</td>\\n\",\n       \"      <td>7.25</td>\\n\",\n       \"      <td>8.75</td>\\n\",\n       \"      <td>207.33</td>\\n\",\n       \"      <td>1</td>\\n\",\n       \"      <td>12</td>\\n\",\n       \"      <td>ACTIVE</td>\\n\",\n       \"      <td>None</td>\\n\",\n       \"      <td>True</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>1</th>\\n\",\n       \"      <td>4.88</td>\\n\",\n       \"      <td>4.54</td>\\n\",\n       \"      <td>4.58</td>\\n\",\n       \"      <td>6.47</td>\\n\",\n       \"      <td>7.13</td>\\n\",\n       \"      <td>8.68</td>\\n\",\n       \"      <td>204.90</td>\\n\",\n       \"      <td>12</td>\\n\",\n       \"      <td>12</td>\\n\",\n       \"      <td>PASSIVE</td>\\n\",\n       \"      <td>None</td>\\n\",\n       \"      <td>True</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>2</th>\\n\",\n       \"      <td>4.90</td>\\n\",\n       \"      <td>4.54</td>\\n\",\n       \"      <td>4.57</td>\\n\",\n       \"      <td>6.16</td>\\n\",\n       \"      <td>7.64</td>\\n\",\n       \"      <td>8.82</td>\\n\",\n       \"      <td>203.88</td>\\n\",\n       \"      <td>1</td>\\n\",\n       \"      <td>12</td>\\n\",\n       \"      <td>PASSIVE</td>\\n\",\n       \"      <td>None</td>\\n\",\n       \"      <td>True</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>3</th>\\n\",\n       \"      <td>4.91</td>\\n\",\n       \"      <td>4.55</td>\\n\",\n       \"      <td>4.59</td>\\n\",\n       \"      <td>6.70</td>\\n\",\n       \"      <td>7.43</td>\\n\",\n       \"      <td>8.78</td>\\n\",\n       \"      <td>203.55</td>\\n\",\n       \"      <td>12</td>\\n\",\n       \"      <td>12</td>\\n\",\n       \"      <td>ACTIVE</td>\\n\",\n       \"      <td>None</td>\\n\",\n       \"      <td>True</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>4</th>\\n\",\n       \"      <td>4.92</td>\\n\",\n       \"      <td>4.57</td>\\n\",\n       \"      <td>4.60</td>\\n\",\n       \"      <td>6.50</td>\\n\",\n       \"      <td>7.82</td>\\n\",\n       \"      <td>8.90</td>\\n\",\n       \"      <td>203.24</td>\\n\",\n       \"      <td>0</td>\\n\",\n       \"      <td></td>\\n\",\n       \"      <td></td>\\n\",\n       \"      <td>None</td>\\n\",\n       \"      <td>True</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>5</th>\\n\",\n       \"      <td>4.93</td>\\n\",\n       \"      <td>4.55</td>\\n\",\n       \"      <td>4.59</td>\\n\",\n       \"      <td>6.66</td>\\n\",\n       \"      <td>7.57</td>\\n\",\n       \"      <td>8.80</td>\\n\",\n       \"      <td>202.92</td>\\n\",\n       \"      <td>12</td>\\n\",\n       \"      <td>1</td>\\n\",\n       \"      <td>PASSIVE</td>\\n\",\n       \"      <td>None</td>\\n\",\n       \"      <td>True</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>6</th>\\n\",\n       \"      <td>5.07</td>\\n\",\n       \"      <td>4.56</td>\\n\",\n       \"      <td>4.61</td>\\n\",\n       \"      <td>7.19</td>\\n\",\n       \"      <td>8.11</td>\\n\",\n       \"      <td>9.01</td>\\n\",\n       \"      <td>197.16</td>\\n\",\n       \"      <td>12</td>\\n\",\n       \"      <td>1</td>\\n\",\n       \"      <td>ACTIVE</td>\\n\",\n       \"      <td>None</td>\\n\",\n       \"      <td>True</td>\\n\",\n       \"    </tr>\\n\",\n       \"  </tbody>\\n\",\n       \"</table>\\n\",\n       \"</div>\"\n      ],\n      \"text/plain\": [\n       \"   Latency(ms)  Latency_P50  Latency_P75  Latency_P90  Latency_P95  \\\\\\n\",\n       \"0         4.82         4.53         4.57         5.15         7.25   \\n\",\n       \"1         4.88         4.54         4.58         6.47         7.13   \\n\",\n       \"2         4.90         4.54         4.57         6.16         7.64   \\n\",\n       \"3         4.91         4.55         4.59         6.70         7.43   \\n\",\n       \"4         4.92         4.57         4.60         6.50         7.82   \\n\",\n       \"5         4.93         4.55         4.59         6.66         7.57   \\n\",\n       \"6         5.07         4.56         4.61         7.19         8.11   \\n\",\n       \"\\n\",\n       \"   Latency_P99  Throughput(QPS)  intra_op_num_threads OMP_NUM_THREADS  \\\\\\n\",\n       \"0         8.75           207.33                     1              12   \\n\",\n       \"1         8.68           204.90                    12              12   \\n\",\n       \"2         8.82           203.88                     1              12   \\n\",\n       \"3         8.78           203.55                    12              12   \\n\",\n       \"4         8.90           203.24                     0                   \\n\",\n       \"5         8.80           202.92                    12               1   \\n\",\n       \"6         9.01           197.16                    12               1   \\n\",\n       \"\\n\",\n       \"  OMP_WAIT_POLICY contiguous  warmup  \\n\",\n       \"0          ACTIVE       None    True  \\n\",\n       \"1         PASSIVE       None    True  \\n\",\n       \"2         PASSIVE       None    True  \\n\",\n       \"3          ACTIVE       None    True  \\n\",\n       \"4                       None    True  \\n\",\n       \"5         PASSIVE       None    True  \\n\",\n       \"6          ACTIVE       None    True  \"\n      ]\n     },\n     \"execution_count\": 18,\n     \"metadata\": {},\n     \"output_type\": \"execute_result\"\n    }\n   ],\n   \"source\": [\n    \"import os\\n\",\n    \"import glob     \\n\",\n    \"import pandas\\n\",\n    \"latest_result_file = max(glob.glob(\\\"./onnx/perf_results_GPU_B1_S128_*.txt\\\"), key=os.path.getmtime)\\n\",\n    \"result_data = pandas.read_table(latest_result_file, converters={'OMP_NUM_THREADS': str, 'OMP_WAIT_POLICY':str})\\n\",\n    \"print(\\\"Float32 model perf results from\\\", latest_result_file)\\n\",\n    \"# Remove some columns that have same values for all rows.\\n\",\n    \"columns_to_remove = ['model', 'graph_optimization_level', 'batch_size', 'sequence_length', 'test_cases', 'test_times', 'use_gpu']\\n\",\n    \"result_data.drop(columns_to_remove, axis=1, inplace=True)\\n\",\n    \"result_data\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"From above result, we can see that latency is very close for different settings. The default setting (intra_op_num_threads=0, OMP_NUM_THREADS and OMP_WAIT_POLICY does not exist) performs the best. \\n\",\n    \"\\n\",\n    \"### Model Results Comparison Tool\\n\",\n    \"\\n\",\n    \"When a BERT model is optimized, some approximation is used in calculation. If your BERT model has three inputs, a script compare_bert_results.py can be used to do a quick verification. The tool will generate some fake input data, and compare the inference outputs of the original and optimized models. If outputs are all close, it is safe to use the optimized model.\\n\",\n    \"\\n\",\n    \"For GPU inference, the absolute or relative difference is larger than those numbers of CPU inference. Note that slight difference in output will not impact final result. We did end-to-end evaluation using SQuAD data set using a fine-tuned squad model, and F1 score is almost the same before/after optimization.\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 19,\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"100% passed for 100 random inputs given thresholds (rtol=0.01, atol=0.01).\\r\\n\",\n      \"maximum absolute difference=1.9222497940063477e-06\\r\\n\",\n      \"maximum relative difference=0.05027933046221733\\r\\n\"\n     ]\n    }\n   ],\n   \"source\": [\n    \"!python -m onnxruntime_tools.transformers.compare_bert_results --baseline_model $export_model_path --optimized_model $optimized_fp32_model_path --batch_size 1 --sequence_length 128 --samples 100 --rtol 0.01 --atol 0.01 $GPU_OPTION\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"## 6. Model Optimization with Float16\\n\",\n    \"\\n\",\n    \"The optimizer.py script have an option **--float16** to convert model to use float16 to store weights. After the conversion, it could be faster to run in GPU with tensor cores like V100 or T4.\\n\",\n    \"\\n\",\n    \"Let's run tools to measure the performance on V100. The results show significant performance improvement: latency is about 3.4 ms for float32 model, and 1.8 ms for float16 model.\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 20,\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"optimize_by_onnxruntime: Save optimized model by onnxruntime to ./onnx/bert-base-cased-squad_opset11_o1_cpu.onnx\\n\",\n      \"               apply: Fused LayerNormalization count: 25\\n\",\n      \"               apply: Fused Gelu count: 12\\n\",\n      \"               apply: Fused SkipLayerNormalization count: 25\\n\",\n      \"               apply: Fused Attention count: 12\\n\",\n      \"         prune_graph: Graph pruned: 0 inputs, 0 outputs and 5 nodes are removed\\n\",\n      \"               apply: Fused EmbedLayerNormalization(with mask) count: 1\\n\",\n      \"         prune_graph: Graph pruned: 0 inputs, 0 outputs and 12 nodes are removed\\n\",\n      \"         prune_graph: Graph pruned: 0 inputs, 0 outputs and 0 nodes are removed\\n\",\n      \"               apply: Fused BiasGelu count: 12\\n\",\n      \"               apply: Fused SkipLayerNormalization(add bias) count: 24\\n\",\n      \"            optimize: opset verion: 11\\n\",\n      \"  save_model_to_file: Output model to ./onnx/bert-base-cased-squad_opt_gpu_fp16.onnx\\n\",\n      \"get_fused_operator_statistics: Optimized operators:{'EmbedLayerNormalization': 1, 'Attention': 12, 'Gelu': 0, 'FastGelu': 0, 'BiasGelu': 12, 'LayerNormalization': 0, 'SkipLayerNormalization': 24}\\n\",\n      \"                main: The model has been fully optimized.\\n\"\n     ]\n    }\n   ],\n   \"source\": [\n    \"optimized_fp16_model_path = './onnx/bert-base-cased-squad_opt_{}_fp16.onnx'.format('gpu' if use_gpu else 'cpu')\\n\",\n    \"!python -m onnxruntime_tools.optimizer_cli --input $export_model_path --output $optimized_fp16_model_path --float16\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 21,\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"test setting TestSetting(batch_size=1, sequence_length=128, test_cases=1000, test_times=1, contiguous=None, use_gpu=True, warmup=True, omp_num_threads=None, omp_wait_policy=None, intra_op_num_threads=None, seed=3, verbose=False, inclusive=False, extra_latency=True)\\n\",\n      \"Generating 1000 samples for batch_size=1 sequence_length=128\\n\",\n      \"Running test: model=bert-base-cased-squad_opt_gpu_fp16.onnx,graph_optimization_level=ENABLE_ALL,intra_op_num_threads=0,OMP_NUM_THREADS=,OMP_WAIT_POLICY=,batch_size=1,sequence_length=128,test_cases=1000,test_times=1,contiguous=None,use_gpu=True,warmup=True\\n\",\n      \"Average latency = 3.01 ms, Throughput = 331.90 QPS\\n\",\n      \"Running test: model=bert-base-cased-squad_opt_gpu_fp16.onnx,graph_optimization_level=ENABLE_ALL,intra_op_num_threads=1,OMP_NUM_THREADS=12,OMP_WAIT_POLICY=PASSIVE,batch_size=1,sequence_length=128,test_cases=1000,test_times=1,contiguous=None,use_gpu=True,warmup=True\\n\",\n      \"Average latency = 3.12 ms, Throughput = 320.00 QPS\\n\",\n      \"Running test: model=bert-base-cased-squad_opt_gpu_fp16.onnx,graph_optimization_level=ENABLE_ALL,intra_op_num_threads=12,OMP_NUM_THREADS=1,OMP_WAIT_POLICY=ACTIVE,batch_size=1,sequence_length=128,test_cases=1000,test_times=1,contiguous=None,use_gpu=True,warmup=True\\n\",\n      \"Average latency = 3.02 ms, Throughput = 331.39 QPS\\n\",\n      \"Running test: model=bert-base-cased-squad_opt_gpu_fp16.onnx,graph_optimization_level=ENABLE_ALL,intra_op_num_threads=1,OMP_NUM_THREADS=12,OMP_WAIT_POLICY=ACTIVE,batch_size=1,sequence_length=128,test_cases=1000,test_times=1,contiguous=None,use_gpu=True,warmup=True\\n\",\n      \"Average latency = 3.01 ms, Throughput = 332.53 QPS\\n\",\n      \"skip duplicated test: model=bert-base-cased-squad_opt_gpu_fp16.onnx,graph_optimization_level=ENABLE_ALL,intra_op_num_threads=1,OMP_NUM_THREADS=12,OMP_WAIT_POLICY=PASSIVE,batch_size=1,sequence_length=128,test_cases=1000,test_times=1,contiguous=None,use_gpu=True,warmup=True\\n\",\n      \"skip duplicated test: model=bert-base-cased-squad_opt_gpu_fp16.onnx,graph_optimization_level=ENABLE_ALL,intra_op_num_threads=12,OMP_NUM_THREADS=1,OMP_WAIT_POLICY=ACTIVE,batch_size=1,sequence_length=128,test_cases=1000,test_times=1,contiguous=None,use_gpu=True,warmup=True\\n\",\n      \"Running test: model=bert-base-cased-squad_opt_gpu_fp16.onnx,graph_optimization_level=ENABLE_ALL,intra_op_num_threads=12,OMP_NUM_THREADS=1,OMP_WAIT_POLICY=PASSIVE,batch_size=1,sequence_length=128,test_cases=1000,test_times=1,contiguous=None,use_gpu=True,warmup=True\\n\",\n      \"Average latency = 3.04 ms, Throughput = 328.67 QPS\\n\",\n      \"Running test: model=bert-base-cased-squad_opt_gpu_fp16.onnx,graph_optimization_level=ENABLE_ALL,intra_op_num_threads=12,OMP_NUM_THREADS=12,OMP_WAIT_POLICY=ACTIVE,batch_size=1,sequence_length=128,test_cases=1000,test_times=1,contiguous=None,use_gpu=True,warmup=True\\n\",\n      \"Average latency = 3.01 ms, Throughput = 331.72 QPS\\n\",\n      \"Running test: model=bert-base-cased-squad_opt_gpu_fp16.onnx,graph_optimization_level=ENABLE_ALL,intra_op_num_threads=12,OMP_NUM_THREADS=12,OMP_WAIT_POLICY=PASSIVE,batch_size=1,sequence_length=128,test_cases=1000,test_times=1,contiguous=None,use_gpu=True,warmup=True\\n\",\n      \"Average latency = 3.04 ms, Throughput = 329.32 QPS\\n\",\n      \"Test summary is saved to onnx/perf_results_GPU_B1_S128_20200617-232234.txt\\n\"\n     ]\n    }\n   ],\n   \"source\": [\n    \"GPU_OPTION = '--use_gpu' if use_gpu else ''\\n\",\n    \"!python -m onnxruntime_tools.transformers.bert_perf_test --model $optimized_fp16_model_path --batch_size 1 --sequence_length 128 --samples 1000 --test_times 1 --inclusive --all $GPU_OPTION\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 22,\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"Float32 model perf results from ./onnx/perf_results_GPU_B1_S128_20200617-232234.txt\\n\"\n     ]\n    },\n    {\n     \"data\": {\n      \"text/html\": [\n       \"<div>\\n\",\n       \"<style scoped>\\n\",\n       \"    .dataframe tbody tr th:only-of-type {\\n\",\n       \"        vertical-align: middle;\\n\",\n       \"    }\\n\",\n       \"\\n\",\n       \"    .dataframe tbody tr th {\\n\",\n       \"        vertical-align: top;\\n\",\n       \"    }\\n\",\n       \"\\n\",\n       \"    .dataframe thead th {\\n\",\n       \"        text-align: right;\\n\",\n       \"    }\\n\",\n       \"</style>\\n\",\n       \"<table border=\\\"1\\\" class=\\\"dataframe\\\">\\n\",\n       \"  <thead>\\n\",\n       \"    <tr style=\\\"text-align: right;\\\">\\n\",\n       \"      <th></th>\\n\",\n       \"      <th>Latency(ms)</th>\\n\",\n       \"      <th>Latency_P50</th>\\n\",\n       \"      <th>Latency_P75</th>\\n\",\n       \"      <th>Latency_P90</th>\\n\",\n       \"      <th>Latency_P95</th>\\n\",\n       \"      <th>Latency_P99</th>\\n\",\n       \"      <th>Throughput(QPS)</th>\\n\",\n       \"      <th>intra_op_num_threads</th>\\n\",\n       \"      <th>OMP_NUM_THREADS</th>\\n\",\n       \"      <th>OMP_WAIT_POLICY</th>\\n\",\n       \"      <th>contiguous</th>\\n\",\n       \"      <th>warmup</th>\\n\",\n       \"    </tr>\\n\",\n       \"  </thead>\\n\",\n       \"  <tbody>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>0</th>\\n\",\n       \"      <td>3.01</td>\\n\",\n       \"      <td>2.79</td>\\n\",\n       \"      <td>2.81</td>\\n\",\n       \"      <td>2.86</td>\\n\",\n       \"      <td>5.08</td>\\n\",\n       \"      <td>7.16</td>\\n\",\n       \"      <td>332.53</td>\\n\",\n       \"      <td>1</td>\\n\",\n       \"      <td>12</td>\\n\",\n       \"      <td>ACTIVE</td>\\n\",\n       \"      <td>None</td>\\n\",\n       \"      <td>True</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>1</th>\\n\",\n       \"      <td>3.01</td>\\n\",\n       \"      <td>2.80</td>\\n\",\n       \"      <td>2.81</td>\\n\",\n       \"      <td>2.88</td>\\n\",\n       \"      <td>4.52</td>\\n\",\n       \"      <td>7.05</td>\\n\",\n       \"      <td>331.90</td>\\n\",\n       \"      <td>0</td>\\n\",\n       \"      <td></td>\\n\",\n       \"      <td></td>\\n\",\n       \"      <td>None</td>\\n\",\n       \"      <td>True</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>2</th>\\n\",\n       \"      <td>3.01</td>\\n\",\n       \"      <td>2.78</td>\\n\",\n       \"      <td>2.80</td>\\n\",\n       \"      <td>2.92</td>\\n\",\n       \"      <td>5.01</td>\\n\",\n       \"      <td>7.02</td>\\n\",\n       \"      <td>331.72</td>\\n\",\n       \"      <td>12</td>\\n\",\n       \"      <td>12</td>\\n\",\n       \"      <td>ACTIVE</td>\\n\",\n       \"      <td>None</td>\\n\",\n       \"      <td>True</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>3</th>\\n\",\n       \"      <td>3.02</td>\\n\",\n       \"      <td>2.79</td>\\n\",\n       \"      <td>2.80</td>\\n\",\n       \"      <td>2.85</td>\\n\",\n       \"      <td>6.34</td>\\n\",\n       \"      <td>7.04</td>\\n\",\n       \"      <td>331.39</td>\\n\",\n       \"      <td>12</td>\\n\",\n       \"      <td>1</td>\\n\",\n       \"      <td>ACTIVE</td>\\n\",\n       \"      <td>None</td>\\n\",\n       \"      <td>True</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>4</th>\\n\",\n       \"      <td>3.04</td>\\n\",\n       \"      <td>2.80</td>\\n\",\n       \"      <td>2.82</td>\\n\",\n       \"      <td>2.93</td>\\n\",\n       \"      <td>5.56</td>\\n\",\n       \"      <td>7.08</td>\\n\",\n       \"      <td>329.32</td>\\n\",\n       \"      <td>12</td>\\n\",\n       \"      <td>12</td>\\n\",\n       \"      <td>PASSIVE</td>\\n\",\n       \"      <td>None</td>\\n\",\n       \"      <td>True</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>5</th>\\n\",\n       \"      <td>3.04</td>\\n\",\n       \"      <td>2.79</td>\\n\",\n       \"      <td>2.81</td>\\n\",\n       \"      <td>2.92</td>\\n\",\n       \"      <td>6.37</td>\\n\",\n       \"      <td>7.08</td>\\n\",\n       \"      <td>328.67</td>\\n\",\n       \"      <td>12</td>\\n\",\n       \"      <td>1</td>\\n\",\n       \"      <td>PASSIVE</td>\\n\",\n       \"      <td>None</td>\\n\",\n       \"      <td>True</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>6</th>\\n\",\n       \"      <td>3.12</td>\\n\",\n       \"      <td>2.79</td>\\n\",\n       \"      <td>2.82</td>\\n\",\n       \"      <td>2.96</td>\\n\",\n       \"      <td>6.66</td>\\n\",\n       \"      <td>7.20</td>\\n\",\n       \"      <td>320.00</td>\\n\",\n       \"      <td>1</td>\\n\",\n       \"      <td>12</td>\\n\",\n       \"      <td>PASSIVE</td>\\n\",\n       \"      <td>None</td>\\n\",\n       \"      <td>True</td>\\n\",\n       \"    </tr>\\n\",\n       \"  </tbody>\\n\",\n       \"</table>\\n\",\n       \"</div>\"\n      ],\n      \"text/plain\": [\n       \"   Latency(ms)  Latency_P50  Latency_P75  Latency_P90  Latency_P95  \\\\\\n\",\n       \"0         3.01         2.79         2.81         2.86         5.08   \\n\",\n       \"1         3.01         2.80         2.81         2.88         4.52   \\n\",\n       \"2         3.01         2.78         2.80         2.92         5.01   \\n\",\n       \"3         3.02         2.79         2.80         2.85         6.34   \\n\",\n       \"4         3.04         2.80         2.82         2.93         5.56   \\n\",\n       \"5         3.04         2.79         2.81         2.92         6.37   \\n\",\n       \"6         3.12         2.79         2.82         2.96         6.66   \\n\",\n       \"\\n\",\n       \"   Latency_P99  Throughput(QPS)  intra_op_num_threads OMP_NUM_THREADS  \\\\\\n\",\n       \"0         7.16           332.53                     1              12   \\n\",\n       \"1         7.05           331.90                     0                   \\n\",\n       \"2         7.02           331.72                    12              12   \\n\",\n       \"3         7.04           331.39                    12               1   \\n\",\n       \"4         7.08           329.32                    12              12   \\n\",\n       \"5         7.08           328.67                    12               1   \\n\",\n       \"6         7.20           320.00                     1              12   \\n\",\n       \"\\n\",\n       \"  OMP_WAIT_POLICY contiguous  warmup  \\n\",\n       \"0          ACTIVE       None    True  \\n\",\n       \"1                       None    True  \\n\",\n       \"2          ACTIVE       None    True  \\n\",\n       \"3          ACTIVE       None    True  \\n\",\n       \"4         PASSIVE       None    True  \\n\",\n       \"5         PASSIVE       None    True  \\n\",\n       \"6         PASSIVE       None    True  \"\n      ]\n     },\n     \"execution_count\": 22,\n     \"metadata\": {},\n     \"output_type\": \"execute_result\"\n    }\n   ],\n   \"source\": [\n    \"import os\\n\",\n    \"import glob     \\n\",\n    \"import pandas\\n\",\n    \"latest_result_file = max(glob.glob(\\\"./onnx/perf_results_GPU_B1_S128_*.txt\\\"), key=os.path.getmtime)\\n\",\n    \"result_data = pandas.read_table(latest_result_file, converters={'OMP_NUM_THREADS': str, 'OMP_WAIT_POLICY':str})\\n\",\n    \"print(\\\"Float32 model perf results from\\\", latest_result_file)\\n\",\n    \"# Remove some columns that have same values for all rows.\\n\",\n    \"columns_to_remove = ['model', 'graph_optimization_level', 'batch_size', 'sequence_length', 'test_cases', 'test_times', 'use_gpu']\\n\",\n    \"result_data.drop(columns_to_remove, axis=1, inplace=True)\\n\",\n    \"result_data\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"### Throughput Tuning\\n\",\n    \"\\n\",\n    \"Some application need best throughput under some constraint on latency. This can be done by testing performance of different batch sizes. The tool could help on this.\\n\",\n    \"\\n\",\n    \"Here is an example that check the performance of multiple batch sizes (1, 2, 4, 8, 16, 32 and 64) using default settings.\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 23,\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"test setting TestSetting(batch_size=32, sequence_length=128, test_cases=1000, test_times=1, contiguous=None, use_gpu=True, warmup=True, omp_num_threads=12, omp_wait_policy='ACTIVE', intra_op_num_threads=1, seed=3, verbose=False, inclusive=False, extra_latency=True)\\n\",\n      \"Generating 1000 samples for batch_size=32 sequence_length=128\\n\",\n      \"Running test: model=bert-base-cased-squad_opt_gpu_fp16.onnx,graph_optimization_level=ENABLE_ALL,intra_op_num_threads=1,OMP_NUM_THREADS=12,OMP_WAIT_POLICY=ACTIVE,batch_size=32,sequence_length=128,test_cases=1000,test_times=1,contiguous=None,use_gpu=True,warmup=True\\n\",\n      \"Average latency = 16.17 ms, Throughput = 1979.41 QPS\\n\",\n      \"test setting TestSetting(batch_size=1, sequence_length=128, test_cases=1000, test_times=1, contiguous=None, use_gpu=True, warmup=True, omp_num_threads=12, omp_wait_policy='ACTIVE', intra_op_num_threads=1, seed=3, verbose=False, inclusive=False, extra_latency=True)\\n\",\n      \"Generating 1000 samples for batch_size=1 sequence_length=128\\n\",\n      \"Running test: model=bert-base-cased-squad_opt_gpu_fp16.onnx,graph_optimization_level=ENABLE_ALL,intra_op_num_threads=1,OMP_NUM_THREADS=12,OMP_WAIT_POLICY=ACTIVE,batch_size=1,sequence_length=128,test_cases=1000,test_times=1,contiguous=None,use_gpu=True,warmup=True\\n\",\n      \"Average latency = 3.00 ms, Throughput = 333.83 QPS\\n\",\n      \"test setting TestSetting(batch_size=2, sequence_length=128, test_cases=1000, test_times=1, contiguous=None, use_gpu=True, warmup=True, omp_num_threads=12, omp_wait_policy='ACTIVE', intra_op_num_threads=1, seed=3, verbose=False, inclusive=False, extra_latency=True)\\n\",\n      \"Generating 1000 samples for batch_size=2 sequence_length=128\\n\",\n      \"Running test: model=bert-base-cased-squad_opt_gpu_fp16.onnx,graph_optimization_level=ENABLE_ALL,intra_op_num_threads=1,OMP_NUM_THREADS=12,OMP_WAIT_POLICY=ACTIVE,batch_size=2,sequence_length=128,test_cases=1000,test_times=1,contiguous=None,use_gpu=True,warmup=True\\n\",\n      \"Average latency = 3.59 ms, Throughput = 557.32 QPS\\n\",\n      \"test setting TestSetting(batch_size=64, sequence_length=128, test_cases=1000, test_times=1, contiguous=None, use_gpu=True, warmup=True, omp_num_threads=12, omp_wait_policy='ACTIVE', intra_op_num_threads=1, seed=3, verbose=False, inclusive=False, extra_latency=True)\\n\",\n      \"Generating 1000 samples for batch_size=64 sequence_length=128\\n\",\n      \"Running test: model=bert-base-cased-squad_opt_gpu_fp16.onnx,graph_optimization_level=ENABLE_ALL,intra_op_num_threads=1,OMP_NUM_THREADS=12,OMP_WAIT_POLICY=ACTIVE,batch_size=64,sequence_length=128,test_cases=1000,test_times=1,contiguous=None,use_gpu=True,warmup=True\\n\",\n      \"Average latency = 29.26 ms, Throughput = 2187.15 QPS\\n\",\n      \"test setting TestSetting(batch_size=4, sequence_length=128, test_cases=1000, test_times=1, contiguous=None, use_gpu=True, warmup=True, omp_num_threads=12, omp_wait_policy='ACTIVE', intra_op_num_threads=1, seed=3, verbose=False, inclusive=False, extra_latency=True)\\n\",\n      \"Generating 1000 samples for batch_size=4 sequence_length=128\\n\",\n      \"Running test: model=bert-base-cased-squad_opt_gpu_fp16.onnx,graph_optimization_level=ENABLE_ALL,intra_op_num_threads=1,OMP_NUM_THREADS=12,OMP_WAIT_POLICY=ACTIVE,batch_size=4,sequence_length=128,test_cases=1000,test_times=1,contiguous=None,use_gpu=True,warmup=True\\n\",\n      \"Average latency = 4.32 ms, Throughput = 926.92 QPS\\n\",\n      \"test setting TestSetting(batch_size=8, sequence_length=128, test_cases=1000, test_times=1, contiguous=None, use_gpu=True, warmup=True, omp_num_threads=12, omp_wait_policy='ACTIVE', intra_op_num_threads=1, seed=3, verbose=False, inclusive=False, extra_latency=True)\\n\",\n      \"Generating 1000 samples for batch_size=8 sequence_length=128\\n\",\n      \"Running test: model=bert-base-cased-squad_opt_gpu_fp16.onnx,graph_optimization_level=ENABLE_ALL,intra_op_num_threads=1,OMP_NUM_THREADS=12,OMP_WAIT_POLICY=ACTIVE,batch_size=8,sequence_length=128,test_cases=1000,test_times=1,contiguous=None,use_gpu=True,warmup=True\\n\",\n      \"Average latency = 6.32 ms, Throughput = 1266.63 QPS\\n\",\n      \"test setting TestSetting(batch_size=16, sequence_length=128, test_cases=1000, test_times=1, contiguous=None, use_gpu=True, warmup=True, omp_num_threads=12, omp_wait_policy='ACTIVE', intra_op_num_threads=1, seed=3, verbose=False, inclusive=False, extra_latency=True)\\n\",\n      \"Generating 1000 samples for batch_size=16 sequence_length=128\\n\",\n      \"Running test: model=bert-base-cased-squad_opt_gpu_fp16.onnx,graph_optimization_level=ENABLE_ALL,intra_op_num_threads=1,OMP_NUM_THREADS=12,OMP_WAIT_POLICY=ACTIVE,batch_size=16,sequence_length=128,test_cases=1000,test_times=1,contiguous=None,use_gpu=True,warmup=True\\n\",\n      \"Average latency = 9.60 ms, Throughput = 1666.05 QPS\\n\",\n      \"Test summary is saved to onnx/perf_results_GPU_B1-2-4-8-16-32-64_S128_20200617-232401.txt\\n\"\n     ]\n    }\n   ],\n   \"source\": [\n    \"GPU_OPTION = '--use_gpu' if use_gpu else ''\\n\",\n    \"THREAD_SETTING = '--intra_op_num_threads 1 --omp_num_threads {} --omp_wait_policy ACTIVE'.format(psutil.cpu_count(logical=True))\\n\",\n    \"!python -m onnxruntime_tools.transformers.bert_perf_test --model $optimized_fp16_model_path --batch_size 1 2 4 8 16 32 64 --sequence_length 128 --samples 1000 --test_times 1 --inclusive $THREAD_SETTING $GPU_OPTION\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 26,\n   \"metadata\": {\n    \"scrolled\": false\n   },\n   \"outputs\": [\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"Float16 model summary from ./onnx/perf_results_GPU_B1-2-4-8-16-32-64_S128_20200617-232401.txt\\n\"\n     ]\n    },\n    {\n     \"data\": {\n      \"text/html\": [\n       \"<div>\\n\",\n       \"<style scoped>\\n\",\n       \"    .dataframe tbody tr th:only-of-type {\\n\",\n       \"        vertical-align: middle;\\n\",\n       \"    }\\n\",\n       \"\\n\",\n       \"    .dataframe tbody tr th {\\n\",\n       \"        vertical-align: top;\\n\",\n       \"    }\\n\",\n       \"\\n\",\n       \"    .dataframe thead th {\\n\",\n       \"        text-align: right;\\n\",\n       \"    }\\n\",\n       \"</style>\\n\",\n       \"<table border=\\\"1\\\" class=\\\"dataframe\\\">\\n\",\n       \"  <thead>\\n\",\n       \"    <tr style=\\\"text-align: right;\\\">\\n\",\n       \"      <th></th>\\n\",\n       \"      <th>Latency(ms)</th>\\n\",\n       \"      <th>Latency_P50</th>\\n\",\n       \"      <th>Latency_P75</th>\\n\",\n       \"      <th>Latency_P90</th>\\n\",\n       \"      <th>Latency_P95</th>\\n\",\n       \"      <th>Latency_P99</th>\\n\",\n       \"      <th>Throughput(QPS)</th>\\n\",\n       \"      <th>batch_size</th>\\n\",\n       \"    </tr>\\n\",\n       \"  </thead>\\n\",\n       \"  <tbody>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>0</th>\\n\",\n       \"      <td>3.00</td>\\n\",\n       \"      <td>2.79</td>\\n\",\n       \"      <td>2.81</td>\\n\",\n       \"      <td>2.86</td>\\n\",\n       \"      <td>4.37</td>\\n\",\n       \"      <td>7.08</td>\\n\",\n       \"      <td>333.83</td>\\n\",\n       \"      <td>1</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>1</th>\\n\",\n       \"      <td>3.59</td>\\n\",\n       \"      <td>3.33</td>\\n\",\n       \"      <td>3.35</td>\\n\",\n       \"      <td>3.42</td>\\n\",\n       \"      <td>6.60</td>\\n\",\n       \"      <td>7.54</td>\\n\",\n       \"      <td>557.32</td>\\n\",\n       \"      <td>2</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>2</th>\\n\",\n       \"      <td>4.32</td>\\n\",\n       \"      <td>3.98</td>\\n\",\n       \"      <td>4.01</td>\\n\",\n       \"      <td>4.64</td>\\n\",\n       \"      <td>7.23</td>\\n\",\n       \"      <td>8.11</td>\\n\",\n       \"      <td>926.92</td>\\n\",\n       \"      <td>4</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>3</th>\\n\",\n       \"      <td>6.32</td>\\n\",\n       \"      <td>5.94</td>\\n\",\n       \"      <td>5.97</td>\\n\",\n       \"      <td>7.61</td>\\n\",\n       \"      <td>8.96</td>\\n\",\n       \"      <td>10.12</td>\\n\",\n       \"      <td>1266.63</td>\\n\",\n       \"      <td>8</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>4</th>\\n\",\n       \"      <td>9.60</td>\\n\",\n       \"      <td>9.22</td>\\n\",\n       \"      <td>9.25</td>\\n\",\n       \"      <td>11.32</td>\\n\",\n       \"      <td>12.33</td>\\n\",\n       \"      <td>13.34</td>\\n\",\n       \"      <td>1666.05</td>\\n\",\n       \"      <td>16</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>5</th>\\n\",\n       \"      <td>16.17</td>\\n\",\n       \"      <td>15.80</td>\\n\",\n       \"      <td>15.90</td>\\n\",\n       \"      <td>17.38</td>\\n\",\n       \"      <td>18.80</td>\\n\",\n       \"      <td>19.93</td>\\n\",\n       \"      <td>1979.41</td>\\n\",\n       \"      <td>32</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>6</th>\\n\",\n       \"      <td>29.26</td>\\n\",\n       \"      <td>28.89</td>\\n\",\n       \"      <td>29.01</td>\\n\",\n       \"      <td>30.63</td>\\n\",\n       \"      <td>32.53</td>\\n\",\n       \"      <td>33.28</td>\\n\",\n       \"      <td>2187.15</td>\\n\",\n       \"      <td>64</td>\\n\",\n       \"    </tr>\\n\",\n       \"  </tbody>\\n\",\n       \"</table>\\n\",\n       \"</div>\"\n      ],\n      \"text/plain\": [\n       \"   Latency(ms)  Latency_P50  Latency_P75  Latency_P90  Latency_P95  \\\\\\n\",\n       \"0         3.00         2.79         2.81         2.86         4.37   \\n\",\n       \"1         3.59         3.33         3.35         3.42         6.60   \\n\",\n       \"2         4.32         3.98         4.01         4.64         7.23   \\n\",\n       \"3         6.32         5.94         5.97         7.61         8.96   \\n\",\n       \"4         9.60         9.22         9.25        11.32        12.33   \\n\",\n       \"5        16.17        15.80        15.90        17.38        18.80   \\n\",\n       \"6        29.26        28.89        29.01        30.63        32.53   \\n\",\n       \"\\n\",\n       \"   Latency_P99  Throughput(QPS)  batch_size  \\n\",\n       \"0         7.08           333.83           1  \\n\",\n       \"1         7.54           557.32           2  \\n\",\n       \"2         8.11           926.92           4  \\n\",\n       \"3        10.12          1266.63           8  \\n\",\n       \"4        13.34          1666.05          16  \\n\",\n       \"5        19.93          1979.41          32  \\n\",\n       \"6        33.28          2187.15          64  \"\n      ]\n     },\n     \"execution_count\": 26,\n     \"metadata\": {},\n     \"output_type\": \"execute_result\"\n    }\n   ],\n   \"source\": [\n    \"import os\\n\",\n    \"import glob     \\n\",\n    \"import pandas\\n\",\n    \"latest_result_file = max(glob.glob(\\\"./onnx/perf_results_*.txt\\\"), key=os.path.getmtime)\\n\",\n    \"result_data = pandas.read_table(latest_result_file, converters={'OMP_NUM_THREADS': str, 'OMP_WAIT_POLICY':str})\\n\",\n    \"print(\\\"Float16 model summary from\\\", latest_result_file)\\n\",\n    \"columns_to_remove = ['model', 'graph_optimization_level', 'test_cases', 'test_times', 'use_gpu', 'warmup', 'sequence_length']\\n\",\n    \"columns_to_remove.extend(['intra_op_num_threads', 'OMP_NUM_THREADS', 'OMP_WAIT_POLICY', 'contiguous'])\\n\",\n    \"result_data.drop(columns_to_remove, axis=1, inplace=True)\\n\",\n    \"result_data\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"## 7. Additional Info\\n\",\n    \"\\n\",\n    \"Note that running Jupyter Notebook has significant impact on performance result. You can close Jupyter Notebook and other applications, then run the performance test in a console to get more accurate performance numbers.\\n\",\n    \"\\n\",\n    \"We have a [benchmark script](https://github.com/microsoft/onnxruntime/blob/master/onnxruntime/python/tools/transformers/run_benchmark.sh). It is recommended to use it measure inference speed of OnnxRuntime.\\n\",\n    \"\\n\",\n    \"[OnnxRuntime C API](https://github.com/microsoft/onnxruntime/blob/master/docs/C_API.md) could get slightly better performance than python API. If you use C API in inference, you can use OnnxRuntime_Perf_Test.exe built from source to measure performance instead.\\n\",\n    \"\\n\",\n    \"Here is the machine configuration that generated the above results. You might get slower or faster result according to your hardware.\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 27,\n   \"metadata\": {\n    \"scrolled\": true\n   },\n   \"outputs\": [\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"{\\r\\n\",\n      \"  \\\"gpu\\\": {\\r\\n\",\n      \"    \\\"driver_version\\\": \\\"440.64.00\\\",\\r\\n\",\n      \"    \\\"devices\\\": [\\r\\n\",\n      \"      {\\r\\n\",\n      \"        \\\"memory_total\\\": 16945512448,\\r\\n\",\n      \"        \\\"memory_available\\\": 14110883840,\\r\\n\",\n      \"        \\\"name\\\": \\\"Tesla V100-PCIE-16GB\\\"\\r\\n\",\n      \"      },\\r\\n\",\n      \"      {\\r\\n\",\n      \"        \\\"memory_total\\\": 16945512448,\\r\\n\",\n      \"        \\\"memory_available\\\": 16932601856,\\r\\n\",\n      \"        \\\"name\\\": \\\"Tesla V100-PCIE-16GB\\\"\\r\\n\",\n      \"      }\\r\\n\",\n      \"    ]\\r\\n\",\n      \"  },\\r\\n\",\n      \"  \\\"cpu\\\": {\\r\\n\",\n      \"    \\\"brand\\\": \\\"Intel(R) Xeon(R) CPU E5-2690 v4 @ 2.60GHz\\\",\\r\\n\",\n      \"    \\\"cores\\\": 12,\\r\\n\",\n      \"    \\\"logical_cores\\\": 12,\\r\\n\",\n      \"    \\\"hz\\\": \\\"2.5940 GHz\\\",\\r\\n\",\n      \"    \\\"l2_cache\\\": \\\"256 KB\\\",\\r\\n\",\n      \"    \\\"l3_cache\\\": \\\"35840 KB\\\",\\r\\n\",\n      \"    \\\"processor\\\": \\\"x86_64\\\"\\r\\n\",\n      \"  },\\r\\n\",\n      \"  \\\"memory\\\": {\\r\\n\",\n      \"    \\\"total\\\": 236645588992,\\r\\n\",\n      \"    \\\"available\\\": 222567559168\\r\\n\",\n      \"  },\\r\\n\",\n      \"  \\\"python\\\": \\\"3.7.7.final.0 (64 bit)\\\",\\r\\n\",\n      \"  \\\"os\\\": \\\"Linux-4.15.0-1089-azure-x86_64-with-debian-stretch-sid\\\",\\r\\n\",\n      \"  \\\"onnxruntime\\\": {\\r\\n\",\n      \"    \\\"version\\\": \\\"1.3.0\\\",\\r\\n\",\n      \"    \\\"support_gpu\\\": true\\r\\n\",\n      \"  },\\r\\n\",\n      \"  \\\"pytorch\\\": {\\r\\n\",\n      \"    \\\"version\\\": \\\"1.5.0\\\",\\r\\n\",\n      \"    \\\"support_gpu\\\": true\\r\\n\",\n      \"  },\\r\\n\",\n      \"  \\\"tensorflow\\\": null\\r\\n\",\n      \"}\\r\\n\"\n     ]\n    }\n   ],\n   \"source\": [\n    \"!{sys.executable} -m onnxruntime_tools.transformers.machine_info --silent\"\n   ]\n  }\n ],\n \"metadata\": {\n  \"kernelspec\": {\n   \"display_name\": \"PyCharm (ccks_ner-master)\",\n   \"language\": \"python\",\n   \"name\": \"pycharm-de4c0941\"\n  },\n  \"language_info\": {\n   \"codemirror_mode\": {\n    \"name\": \"ipython\",\n    \"version\": 3\n   },\n   \"file_extension\": \".py\",\n   \"mimetype\": \"text/x-python\",\n   \"name\": \"python\",\n   \"nbconvert_exporter\": \"python\",\n   \"pygments_lexer\": \"ipython3\",\n   \"version\": \"3.8.5\"\n  }\n },\n \"nbformat\": 4,\n \"nbformat_minor\": 2\n}\n"
  },
  {
    "path": "code/nezha-base-count3/finetuning/Config.py",
    "content": "from transformers import BertTokenizer, AdamW, BertModel, BertPreTrainedModel, BertConfig, \\\n    get_linear_schedule_with_warmup, XLNetModel, XLNetTokenizer, XLNetConfig, ElectraModel, ElectraConfig, ElectraTokenizer, \\\n    RobertaTokenizer, RobertaModel, RobertaConfig\nfrom NEZHA.modeling_nezha import NeZhaModel\nfrom NEZHA.configuration_nezha import NeZhaConfig\n\n\nMODELS = {\n    'BertForClass':  BertModel,\n    'BertForClass_MultiDropout':  BertModel,\n   'BertLastTwoCls':  BertModel,\n    'BertLastCls':BertModel,\n   'BertLastTwoClsPooler':  BertModel,\n    'BertLastTwoEmbeddings': BertModel,\n    'BertLastTwoEmbeddingsPooler': BertModel,\n    'BertLastFourCls': BertModel,\n    'BertLastFourClsPooler':  BertModel,\n    'BertLastFourEmbeddings':  BertModel,\n   'BertLastFourEmbeddingsPooler':  BertModel,\n   'BertDynCls':  BertModel,\n    'BertDynEmbeddings': BertModel,\n    'BertRNN': BertModel,\n    'BertCNN': XLNetModel,\n    'BertRCNN':  BertModel,\n    'XLNet': XLNetModel,\n    'Electra': ElectraModel,\n    'NEZHA': NeZhaModel\n    }\n\nTOKENIZERS = {\n    'BertForClass': BertTokenizer,\n    'BertForClass_MultiDropout': BertTokenizer,\n    'BertLastTwoCls': BertTokenizer,\n    'BertLastCls': BertTokenizer,\n    'BertLastTwoClsPooler': BertTokenizer,\n    'BertLastTwoEmbeddings': BertTokenizer,\n    'BertLastTwoEmbeddingsPooler': BertTokenizer,\n    'BertLastFourCls': BertTokenizer,\n    'BertLastFourClsPooler': BertTokenizer,\n    'BertLastFourEmbeddings': BertTokenizer,\n    'BertLastFourEmbeddingsPooler': BertTokenizer,\n    'BertDynCls': BertTokenizer,\n    'BertDynEmbeddings': BertTokenizer,\n    'BertRNN': BertTokenizer,\n    'BertCNN': BertTokenizer,\n    'BertRCNN': BertTokenizer,\n    'XLNet': XLNetTokenizer,\n    'Electra': ElectraTokenizer,\n    'NEZHA': BertTokenizer\n    }\n\nCONFIGS = {\n    'BertForClass': BertConfig,\n    'BertForClass_MultiDropout': BertConfig,\n    'BertLastTwoCls': BertConfig,\n    'BertLastCls': BertConfig,\n    'BertLastTwoClsPooler': BertConfig,\n    'BertLastTwoEmbeddings': BertConfig,\n    'BertLastTwoEmbeddingsPooler': BertConfig,\n    'BertLastFourCls': BertConfig,\n    'BertLastFourClsPooler': BertConfig,\n    'BertLastFourEmbeddings': BertConfig,\n    'BertLastFourEmbeddingsPooler': BertConfig,\n    'BertDynCls': BertConfig,\n    'BertDynEmbeddings': BertConfig,\n    'BertRNN': BertConfig,\n    'BertCNN': BertConfig,\n    'BertRCNN': BertConfig,\n    'XLNet': XLNetConfig,\n    'Electra': ElectraConfig,\n    'NEZHA': NeZhaConfig\n\n    }"
  },
  {
    "path": "code/nezha-base-count3/finetuning/NEZHA/configuration_nezha.py",
    "content": "\nfrom transformers import PretrainedConfig\n\nNEZHA_PRETRAINED_CONFIG_ARCHIVE_MAP = {}\n\nclass NeZhaConfig(PretrainedConfig):\n    r\"\"\"\n        This is the configuration class to store the configuration of an :class:`~transformers.AlbertModel`.\n        It is used to instantiate an ALBERT model according to the specified arguments, defining the model\n        architecture. Instantiating a configuration with the defaults will yield a similar configuration to that of\n        the ALBERT `xxlarge <https://huggingface.co/albert-xxlarge-v2>`__ architecture.\n\n        Configuration objects inherit from  :class:`~transformers.PretrainedConfig` and can be used\n        to control the model outputs. Read the documentation from  :class:`~transformers.PretrainedConfig`\n        for more information.\n\n\n        Args:\n            vocab_size (:obj:`int`, optional, defaults to 30000):\n                Vocabulary size of the ALBERT model. Defines the different tokens that\n                can be represented by the `inputs_ids` passed to the forward method of :class:`~transformers.AlbertModel`.\n            embedding_size (:obj:`int`, optional, defaults to 128):\n                Dimensionality of vocabulary embeddings.\n            hidden_size (:obj:`int`, optional, defaults to 4096):\n                Dimensionality of the encoder layers and the pooler layer.\n            num_hidden_layers (:obj:`int`, optional, defaults to 12):\n                Number of hidden layers in the Transformer encoder.\n            num_hidden_groups (:obj:`int`, optional, defaults to 1):\n                Number of groups for the hidden layers, parameters in the same group are shared.\n            num_attention_heads (:obj:`int`, optional, defaults to 64):\n                Number of attention heads for each attention layer in the Transformer encoder.\n            intermediate_size (:obj:`int`, optional, defaults to 16384):\n                The dimensionality of the \"intermediate\" (i.e., feed-forward) layer in the Transformer encoder.\n            inner_group_num (:obj:`int`, optional, defaults to 1):\n                The number of inner repetition of attention and ffn.\n            hidden_act (:obj:`str` or :obj:`function`, optional, defaults to \"gelu_new\"):\n                The non-linear activation function (function or string) in the encoder and pooler.\n                If string, \"gelu\", \"relu\", \"swish\" and \"gelu_new\" are supported.\n            hidden_dropout_prob (:obj:`float`, optional, defaults to 0):\n                The dropout probability for all fully connected layers in the embeddings, encoder, and pooler.\n            attention_probs_dropout_prob (:obj:`float`, optional, defaults to 0):\n                The dropout ratio for the attention probabilities.\n            max_position_embeddings (:obj:`int`, optional, defaults to 512):\n                The maximum sequence length that this model might ever be used with. Typically set this to something\n                large (e.g., 512 or 1024 or 2048).\n            type_vocab_size (:obj:`int`, optional, defaults to 2):\n                The vocabulary size of the `token_type_ids` passed into :class:`~transformers.AlbertModel`.\n            initializer_range (:obj:`float`, optional, defaults to 0.02):\n                The standard deviation of the truncated_normal_initializer for initializing all weight matrices.\n            layer_norm_eps (:obj:`float`, optional, defaults to 1e-12):\n                The epsilon used by the layer normalization layers.\n            classifier_dropout_prob (:obj:`float`, optional, defaults to 0.1):\n                The dropout ratio for attached classifiers.\n\n        Example::\n\n            from transformers import AlbertConfig, AlbertModel\n            # Initializing an ALBERT-xxlarge style configuration\n            albert_xxlarge_configuration = AlbertConfig()\n\n            # Initializing an ALBERT-base style configuration\n            albert_base_configuration = AlbertConfig(\n                hidden_size=768,\n                num_attention_heads=12,\n                intermediate_size=3072,\n            )\n\n            # Initializing a model from the ALBERT-base style configuration\n            model = AlbertModel(albert_xxlarge_configuration)\n\n            # Accessing the model configuration\n            configuration = model.config\n\n        Attributes:\n            pretrained_config_archive_map (Dict[str, str]):\n                A dictionary containing all the available pre-trained checkpoints.\n    \"\"\"\n\n    pretrained_config_archive_map = NEZHA_PRETRAINED_CONFIG_ARCHIVE_MAP\n    model_type = \"nezha\"\n\n    def __init__(\n        self,\n        vocab_size=30000,\n        embedding_size=128,\n        hidden_size=4096,\n        num_hidden_layers=12,\n        num_hidden_groups=1,\n        num_attention_heads=64,\n        intermediate_size=16384,\n        inner_group_num=1,\n        hidden_act=\"gelu_new\",\n        hidden_dropout_prob=0,\n        attention_probs_dropout_prob=0,\n        max_position_embeddings=512,\n        max_relative_position=64,\n        type_vocab_size=2,\n        initializer_range=0.02,\n        layer_norm_eps=1e-12,\n        classifier_dropout_prob=0.1,\n        use_relative_position=True,\n        pad_token_id=0,\n        bos_token_id=2,\n        eos_token_id=3,\n        **kwargs\n    ):\n        super().__init__(pad_token_id=pad_token_id, bos_token_id=bos_token_id, eos_token_id=eos_token_id, **kwargs)\n\n        self.vocab_size = vocab_size\n        self.embedding_size = embedding_size\n        self.hidden_size = hidden_size\n        self.num_hidden_layers = num_hidden_layers\n        self.num_hidden_groups = num_hidden_groups\n        self.num_attention_heads = num_attention_heads\n        self.inner_group_num = inner_group_num\n        self.hidden_act = hidden_act\n        self.intermediate_size = intermediate_size\n        self.hidden_dropout_prob = hidden_dropout_prob\n        self.attention_probs_dropout_prob = attention_probs_dropout_prob\n        self.max_position_embeddings = max_position_embeddings\n        self.max_relative_position = max_relative_position\n        self.type_vocab_size = type_vocab_size\n        self.initializer_range = initializer_range\n        self.layer_norm_eps = layer_norm_eps\n        self.use_relative_position=use_relative_position\n        self.classifier_dropout_prob = classifier_dropout_prob\n"
  },
  {
    "path": "code/nezha-base-count3/finetuning/NEZHA/modeling_nezha.py",
    "content": "import math\nimport os\nimport warnings\nfrom dataclasses import dataclass\nfrom typing import Optional, Tuple\n\nimport torch\nimport torch.utils.checkpoint\nfrom torch import nn\nfrom torch.nn import CrossEntropyLoss, MSELoss\n\nfrom transformers.activations import ACT2FN\nfrom transformers.file_utils import (\n    ModelOutput,\n    add_code_sample_docstrings,\n    add_start_docstrings,\n    add_start_docstrings_to_model_forward,\n    replace_return_docstrings,\n)\nfrom transformers.modeling_outputs import (\n    BaseModelOutputWithPastAndCrossAttentions,\n    BaseModelOutputWithPoolingAndCrossAttentions,\n    CausalLMOutputWithCrossAttentions,\n    MaskedLMOutput,\n    MultipleChoiceModelOutput,\n    NextSentencePredictorOutput,\n    QuestionAnsweringModelOutput,\n    SequenceClassifierOutput,\n    TokenClassifierOutput,\n)\nfrom transformers.modeling_utils import (\n    PreTrainedModel,\n    apply_chunking_to_forward,\n    find_pruneable_heads_and_indices,\n    prune_linear_layer,\n)\n\nfrom transformers.models.bert.configuration_bert import BertConfig\n\nimport logging\nlogger = logging.getLogger(__name__)\n\n_CHECKPOINT_FOR_DOC = \"bert-base-uncased\"\n_CONFIG_FOR_DOC = \"BertConfig\"\n_TOKENIZER_FOR_DOC = \"BertTokenizer\"\n\n\ndef load_tf_weights_in_bert(model, config, tf_checkpoint_path):\n    \"\"\"Load tf checkpoints in a pytorch model.\"\"\"\n    try:\n        import re\n\n        import numpy as np\n        import tensorflow as tf\n    except ImportError:\n        logger.error(\n            \"Loading a TensorFlow model in PyTorch, requires TensorFlow to be installed. Please see \"\n            \"https://www.tensorflow.org/install/ for installation instructions.\"\n        )\n        raise\n    tf_path = os.path.abspath(tf_checkpoint_path)\n    logger.info(\"Converting TensorFlow checkpoint from {}\".format(tf_path))\n    # Load weights from TF model\n    init_vars = tf.train.list_variables(tf_path)\n    names = []\n    arrays = []\n    for name, shape in init_vars:\n        logger.info(\"Loading TF weight {} with shape {}\".format(name, shape))\n        array = tf.train.load_variable(tf_path, name)\n        names.append(name)\n        arrays.append(array)\n\n    for name, array in zip(names, arrays):\n        name = name.split(\"/\")\n        # adam_v and adam_m are variables used in AdamWeightDecayOptimizer to calculated m and v\n        # which are not required for using pretrained model\n        if any(\n            n in [\"adam_v\", \"adam_m\", \"AdamWeightDecayOptimizer\", \"AdamWeightDecayOptimizer_1\", \"global_step\"]\n            for n in name\n        ):\n            logger.info(\"Skipping {}\".format(\"/\".join(name)))\n            continue\n        pointer = model\n        for m_name in name:\n            if re.fullmatch(r\"[A-Za-z]+_\\d+\", m_name):\n                scope_names = re.split(r\"_(\\d+)\", m_name)\n            else:\n                scope_names = [m_name]\n            if scope_names[0] == \"kernel\" or scope_names[0] == \"gamma\":\n                pointer = getattr(pointer, \"weight\")\n            elif scope_names[0] == \"output_bias\" or scope_names[0] == \"beta\":\n                pointer = getattr(pointer, \"bias\")\n            elif scope_names[0] == \"output_weights\":\n                pointer = getattr(pointer, \"weight\")\n            elif scope_names[0] == \"squad\":\n                pointer = getattr(pointer, \"classifier\")\n            else:\n                try:\n                    pointer = getattr(pointer, scope_names[0])\n                except AttributeError:\n                    logger.info(\"Skipping {}\".format(\"/\".join(name)))\n                    continue\n            if len(scope_names) >= 2:\n                num = int(scope_names[1])\n                pointer = pointer[num]\n        if m_name[-11:] == \"_embeddings\":\n            pointer = getattr(pointer, \"weight\")\n        elif m_name == \"kernel\":\n            array = np.transpose(array)\n        try:\n            assert (\n                pointer.shape == array.shape\n            ), f\"Pointer shape {pointer.shape} and array shape {array.shape} mismatched\"\n        except AssertionError as e:\n            e.args += (pointer.shape, array.shape)\n            raise\n        logger.info(\"Initialize PyTorch weight {}\".format(name))\n        pointer.data = torch.from_numpy(array)\n    return model\n\n\nclass BertEmbeddings(nn.Module):\n    \"\"\"Construct the embeddings from word, position and token_type embeddings.\"\"\"\n\n    def __init__(self, config):\n        super().__init__()\n        self.word_embeddings = nn.Embedding(config.vocab_size, config.hidden_size, padding_idx=config.pad_token_id)\n        self.token_type_embeddings = nn.Embedding(config.type_vocab_size, config.hidden_size)\n        # self.LayerNorm is not snake-cased to stick with TensorFlow model variable name and be able to load\n        # any TensorFlow checkpoint file\n        self.LayerNorm = nn.LayerNorm(config.hidden_size, eps=config.layer_norm_eps)\n        self.dropout = nn.Dropout(config.hidden_dropout_prob)\n\n    def forward(self, input_ids=None, token_type_ids=None, inputs_embeds=None):\n        if input_ids is not None:\n            input_shape = input_ids.size()\n        else:\n            input_shape = inputs_embeds.size()[:-1]\n        if token_type_ids is None:\n            token_type_ids = torch.zeros(input_shape, dtype=torch.long, device=input_ids.device)\n\n        if inputs_embeds is None:\n            inputs_embeds = self.word_embeddings(input_ids)\n        token_type_embeddings = self.token_type_embeddings(token_type_ids)\n\n        embeddings = inputs_embeds + token_type_embeddings\n        embeddings = self.LayerNorm(embeddings)\n        embeddings = self.dropout(embeddings)\n        return embeddings\n\ndef relative_position_encoding(depth, max_length=512, max_relative_position=64):\n    vocab_size = max_relative_position * 2 + 1\n    range_vec = torch.arange(max_length)\n    range_mat = range_vec.repeat(max_length).view(max_length, max_length)\n    distance_mat = range_mat - torch.t(range_mat)\n    distance_mat_clipped = torch.clamp(distance_mat, -max_relative_position, max_relative_position)\n    final_mat = distance_mat_clipped + max_relative_position\n\n    embeddings_table = torch.zeros(vocab_size, depth)\n    position = torch.arange(0, vocab_size, dtype=torch.float).unsqueeze(1)\n    div_term = torch.exp(torch.arange(0, depth, 2).float() * (-math.log(10000.0) / depth))\n    embeddings_table[:, 0::2] = torch.sin(position * div_term)\n    embeddings_table[:, 1::2] = torch.cos(position * div_term)\n    embeddings_table = embeddings_table.unsqueeze(0).transpose(0, 1).squeeze(1)\n\n    flat_relative_positions_matrix = final_mat.view(-1)\n    one_hot_relative_positions_matrix = torch.nn.functional.one_hot(flat_relative_positions_matrix,\n                                                                    num_classes=vocab_size).float()\n    positions_encoding = torch.matmul(one_hot_relative_positions_matrix, embeddings_table)\n    my_shape = list(final_mat.size())\n    my_shape.append(depth)\n    positions_encoding = positions_encoding.view(my_shape)\n    return positions_encoding\n\nclass BertSelfAttention(nn.Module):\n    def __init__(self, config):\n        super().__init__()\n        if config.hidden_size % config.num_attention_heads != 0 and not hasattr(config, \"embedding_size\"):\n            raise ValueError(\n                \"The hidden size (%d) is not a multiple of the number of attention \"\n                \"heads (%d)\" % (config.hidden_size, config.num_attention_heads)\n            )\n\n        self.num_attention_heads = config.num_attention_heads\n        self.attention_head_size = int(config.hidden_size / config.num_attention_heads)\n        self.all_head_size = self.num_attention_heads * self.attention_head_size\n\n        self.query = nn.Linear(config.hidden_size, self.all_head_size)\n        self.key = nn.Linear(config.hidden_size, self.all_head_size)\n        self.value = nn.Linear(config.hidden_size, self.all_head_size)\n\n        self.dropout = nn.Dropout(config.attention_probs_dropout_prob)\n        self.position_embedding_type = getattr(config, \"position_embedding_type\", \"absolute\")\n        if self.position_embedding_type == \"relative_key\" or self.position_embedding_type == \"relative_key_query\":\n            self.max_position_embeddings = config.max_position_embeddings\n            self.distance_embedding = nn.Embedding(2 * config.max_position_embeddings - 1, self.attention_head_size)\n\n        self.is_decoder = config.is_decoder\n\n    def transpose_for_scores(self, x):\n        new_x_shape = x.size()[:-1] + (self.num_attention_heads, self.attention_head_size)\n        x = x.view(*new_x_shape)\n        return x.permute(0, 2, 1, 3)\n\n    def forward(\n        self,\n        hidden_states,\n        attention_mask=None,\n        head_mask=None,\n        encoder_hidden_states=None,\n        encoder_attention_mask=None,\n        past_key_value=None,\n        output_attentions=False,\n        relations_kv=None\n    ):\n        mixed_query_layer = self.query(hidden_states)\n\n        # If this is instantiated as a cross-attention module, the keys\n        # and values come from an encoder; the attention mask needs to be\n        # such that the encoder's padding tokens are not attended to.\n        is_cross_attention = encoder_hidden_states is not None\n\n        if is_cross_attention and past_key_value is not None:\n            # reuse k,v, cross_attentions\n            key_layer = past_key_value[0]\n            value_layer = past_key_value[1]\n            attention_mask = encoder_attention_mask\n        elif is_cross_attention:\n            key_layer = self.transpose_for_scores(self.key(encoder_hidden_states))\n            value_layer = self.transpose_for_scores(self.value(encoder_hidden_states))\n            attention_mask = encoder_attention_mask\n        elif past_key_value is not None:\n            key_layer = self.transpose_for_scores(self.key(hidden_states))\n            value_layer = self.transpose_for_scores(self.value(hidden_states))\n            key_layer = torch.cat([past_key_value[0], key_layer], dim=2)\n            value_layer = torch.cat([past_key_value[1], value_layer], dim=2)\n        else:\n            key_layer = self.transpose_for_scores(self.key(hidden_states))\n            value_layer = self.transpose_for_scores(self.value(hidden_states))\n\n        query_layer = self.transpose_for_scores(mixed_query_layer)\n\n        if self.is_decoder:\n            # if cross_attention save Tuple(torch.Tensor, torch.Tensor) of all cross attention key/value_states.\n            # Further calls to cross_attention layer can then reuse all cross-attention\n            # key/value_states (first \"if\" case)\n            # if uni-directional self-attention (decoder) save Tuple(torch.Tensor, torch.Tensor) of\n            # all previous decoder key/value_states. Further calls to uni-directional self-attention\n            # can concat previous decoder key/value_states to current projected key/value_states (third \"elif\" case)\n            # if encoder bi-directional self-attention `past_key_value` is always `None`\n            past_key_value = (key_layer, value_layer)\n\n        # Take the dot product between \"query\" and \"key\" to get the raw attention scores.\n        attention_scores = torch.matmul(query_layer, key_layer.transpose(-1, -2))\n\n        batch_size, num_attention_heads, from_seq_length, to_seq_length = attention_scores.size()\n\n\n        query_layer_t = query_layer.permute(2, 0, 1, 3)\n\n        query_layer_r = query_layer_t.contiguous().view(from_seq_length, batch_size * num_attention_heads,\n                                                        self.attention_head_size)\n        key_position_scores = torch.matmul(query_layer_r, relations_kv.permute(0, 2, 1))\n        key_position_scores_r = key_position_scores.view(from_seq_length, batch_size,\n                                                         num_attention_heads, from_seq_length)\n        key_position_scores_r_t = key_position_scores_r.permute(1, 2, 0, 3)\n        attention_scores = attention_scores + key_position_scores_r_t\n\n        attention_scores = attention_scores / math.sqrt(self.attention_head_size)\n        if attention_mask is not None:\n            # Apply the attention mask is (precomputed for all layers in NeZhaModel forward() function)\n            attention_scores = attention_scores + attention_mask\n\n        # Normalize the attention scores to probabilities.\n        attention_probs = nn.Softmax(dim=-1)(attention_scores)\n\n        # This is actually dropping out entire tokens to attend to, which might\n        # seem a bit unusual, but is taken from the original Transformer paper.\n        attention_probs = self.dropout(attention_probs)\n\n        # Mask heads if we want to\n        if head_mask is not None:\n            attention_probs = attention_probs * head_mask\n\n        context_layer = torch.matmul(attention_probs, value_layer)\n\n\n        attention_probs_t = attention_probs.permute(2, 0, 1, 3)\n        attentions_probs_r = attention_probs_t.contiguous().view(from_seq_length, batch_size * num_attention_heads,\n                                                                 to_seq_length)\n        value_position_scores = torch.matmul(attentions_probs_r, relations_kv)\n        value_position_scores_r = value_position_scores.view(from_seq_length, batch_size,\n                                                             num_attention_heads, self.attention_head_size)\n        value_position_scores_r_t = value_position_scores_r.permute(1, 2, 0, 3)\n        context_layer = context_layer + value_position_scores_r_t\n\n        context_layer = context_layer.permute(0, 2, 1, 3).contiguous()\n        new_context_layer_shape = context_layer.size()[:-2] + (self.all_head_size,)\n        context_layer = context_layer.view(*new_context_layer_shape)\n\n        outputs = (context_layer, attention_probs) if output_attentions else (context_layer,)\n\n        if self.is_decoder:\n            outputs = outputs + (past_key_value,)\n        return outputs\n\n\nclass BertSelfOutput(nn.Module):\n    def __init__(self, config):\n        super().__init__()\n        self.dense = nn.Linear(config.hidden_size, config.hidden_size)\n        self.LayerNorm = nn.LayerNorm(config.hidden_size, eps=config.layer_norm_eps)\n        self.dropout = nn.Dropout(config.hidden_dropout_prob)\n\n    def forward(self, hidden_states, input_tensor):\n        hidden_states = self.dense(hidden_states)\n        hidden_states = self.dropout(hidden_states)\n        hidden_states = self.LayerNorm(hidden_states + input_tensor)\n        return hidden_states\n\n\nclass BertAttention(nn.Module):\n    def __init__(self, config):\n        super().__init__()\n        self.self = BertSelfAttention(config)\n        self.output = BertSelfOutput(config)\n        self.pruned_heads = set()\n\n    def prune_heads(self, heads):\n        if len(heads) == 0:\n            return\n        heads, index = find_pruneable_heads_and_indices(\n            heads, self.self.num_attention_heads, self.self.attention_head_size, self.pruned_heads\n        )\n\n        # Prune linear layers\n        self.self.query = prune_linear_layer(self.self.query, index)\n        self.self.key = prune_linear_layer(self.self.key, index)\n        self.self.value = prune_linear_layer(self.self.value, index)\n        self.output.dense = prune_linear_layer(self.output.dense, index, dim=1)\n\n        # Update hyper params and store pruned heads\n        self.self.num_attention_heads = self.self.num_attention_heads - len(heads)\n        self.self.all_head_size = self.self.attention_head_size * self.self.num_attention_heads\n        self.pruned_heads = self.pruned_heads.union(heads)\n\n    def forward(\n        self,\n        hidden_states,\n        attention_mask=None,\n        head_mask=None,\n        encoder_hidden_states=None,\n        encoder_attention_mask=None,\n        past_key_value=None,\n        output_attentions=False,\n        relations_kv=None\n    ):\n        self_outputs = self.self(\n            hidden_states,\n            attention_mask,\n            head_mask,\n            encoder_hidden_states,\n            encoder_attention_mask,\n            past_key_value,\n            output_attentions,\n            relations_kv=relations_kv\n        )\n        attention_output = self.output(self_outputs[0], hidden_states)\n        outputs = (attention_output,) + self_outputs[1:]  # add attentions if we output them\n        return outputs\n\n\nclass BertIntermediate(nn.Module):\n    def __init__(self, config):\n        super().__init__()\n        self.dense = nn.Linear(config.hidden_size, config.intermediate_size)\n        if isinstance(config.hidden_act, str):\n            self.intermediate_act_fn = ACT2FN[config.hidden_act]\n        else:\n            self.intermediate_act_fn = config.hidden_act\n\n    def forward(self, hidden_states):\n        hidden_states = self.dense(hidden_states)\n        hidden_states = self.intermediate_act_fn(hidden_states)\n        return hidden_states\n\n\nclass BertOutput(nn.Module):\n    def __init__(self, config):\n        super().__init__()\n        self.dense = nn.Linear(config.intermediate_size, config.hidden_size)\n        self.LayerNorm = nn.LayerNorm(config.hidden_size, eps=config.layer_norm_eps)\n        self.dropout = nn.Dropout(config.hidden_dropout_prob)\n\n    def forward(self, hidden_states, input_tensor):\n        hidden_states = self.dense(hidden_states)\n        hidden_states = self.dropout(hidden_states)\n        hidden_states = self.LayerNorm(hidden_states + input_tensor)\n        return hidden_states\n\n\nclass BertLayer(nn.Module):\n    def __init__(self, config):\n        super().__init__()\n        self.chunk_size_feed_forward = config.chunk_size_feed_forward\n        self.seq_len_dim = 1\n        self.attention = BertAttention(config)\n        self.is_decoder = config.is_decoder\n        self.add_cross_attention = config.add_cross_attention\n        if self.add_cross_attention:\n            assert self.is_decoder, f\"{self} should be used as a decoder model if cross attention is added\"\n            self.crossattention = BertAttention(config)\n        self.intermediate = BertIntermediate(config)\n        self.output = BertOutput(config)\n\n    def forward(\n        self,\n        hidden_states,\n        attention_mask=None,\n        head_mask=None,\n        encoder_hidden_states=None,\n        encoder_attention_mask=None,\n        past_key_value=None,\n        output_attentions=False,\n        relations_kv=None\n    ):\n        # decoder uni-directional self-attention cached key/values tuple is at positions 1,2\n        self_attn_past_key_value = past_key_value[:2] if past_key_value is not None else None\n        self_attention_outputs = self.attention(\n            hidden_states,\n            attention_mask,\n            head_mask,\n            output_attentions=output_attentions,\n            past_key_value=self_attn_past_key_value,\n            relations_kv=relations_kv\n        )\n        attention_output = self_attention_outputs[0]\n\n        # if decoder, the last output is tuple of self-attn cache\n        if self.is_decoder:\n            outputs = self_attention_outputs[1:-1]\n            present_key_value = self_attention_outputs[-1]\n        else:\n            outputs = self_attention_outputs[1:]  # add self attentions if we output attention weights\n\n        cross_attn_present_key_value = None\n        if self.is_decoder and encoder_hidden_states is not None:\n            assert hasattr(\n                self, \"crossattention\"\n            ), f\"If `encoder_hidden_states` are passed, {self} has to be instantiated with cross-attention layers by setting `config.add_cross_attention=True`\"\n\n            # cross_attn cached key/values tuple is at positions 3,4 of past_key_value tuple\n            cross_attn_past_key_value = past_key_value[-2:] if past_key_value is not None else None\n            cross_attention_outputs = self.crossattention(\n                attention_output,\n                attention_mask,\n                head_mask,\n                encoder_hidden_states,\n                encoder_attention_mask,\n                cross_attn_past_key_value,\n                output_attentions,\n            )\n            attention_output = cross_attention_outputs[0]\n            outputs = outputs + cross_attention_outputs[1:-1]  # add cross attentions if we output attention weights\n\n            # add cross-attn cache to positions 3,4 of present_key_value tuple\n            cross_attn_present_key_value = cross_attention_outputs[-1]\n            present_key_value = present_key_value + cross_attn_present_key_value\n\n        layer_output = apply_chunking_to_forward(\n            self.feed_forward_chunk, self.chunk_size_feed_forward, self.seq_len_dim, attention_output\n        )\n        outputs = (layer_output,) + outputs\n\n        # if decoder, return the attn key/values as the last output\n        if self.is_decoder:\n            outputs = outputs + (present_key_value,)\n\n        return outputs\n\n    def feed_forward_chunk(self, attention_output):\n        intermediate_output = self.intermediate(attention_output)\n        layer_output = self.output(intermediate_output, attention_output)\n        return layer_output\n\n\nclass NeZhaEncoder(nn.Module):\n    def __init__(self, config):\n        super().__init__()\n        self.config = config\n        self.layer = nn.ModuleList([BertLayer(config) for _ in range(config.num_hidden_layers)])\n        self.relative_positions_encoding = relative_position_encoding(max_length=config.max_position_embeddings,\n                                                                     depth=int(config.hidden_size / config.num_attention_heads),\n                                                                     max_relative_position=config.max_relative_position).to('cuda')\n    def forward(\n        self,\n        hidden_states,\n        attention_mask=None,\n        head_mask=None,\n        encoder_hidden_states=None,\n        encoder_attention_mask=None,\n        past_key_values=None,\n        use_cache=None,\n        output_attentions=False,\n        output_hidden_states=False,\n        return_dict=False,\n    ):\n        to_seq_length=hidden_states.shape[1]\n        relations_kv = self.relative_positions_encoding[:to_seq_length, :to_seq_length, :]\n        all_hidden_states = () if output_hidden_states else None\n        all_self_attentions = () if output_attentions else None\n        all_cross_attentions = () if output_attentions and self.config.add_cross_attention else None\n\n        next_decoder_cache = () if use_cache else None\n        for i, layer_module in enumerate(self.layer):\n            if output_hidden_states:\n                all_hidden_states = all_hidden_states + (hidden_states,)\n\n            layer_head_mask = head_mask[i] if head_mask is not None else None\n            past_key_value = past_key_values[i] if past_key_values is not None else None\n\n            if getattr(self.config, \"gradient_checkpointing\", False) and self.training:\n\n                if use_cache:\n                    logger.warn(\n                        \"`use_cache=True` is incompatible with `config.gradient_checkpointing=True`. Setting \"\n                        \"`use_cache=False`...\"\n                    )\n                    use_cache = False\n\n                def create_custom_forward(module):\n                    def custom_forward(*inputs):\n                        return module(*inputs, past_key_value, output_attentions)\n\n                    return custom_forward\n\n                layer_outputs = torch.utils.checkpoint.checkpoint(\n                    create_custom_forward(layer_module),\n                    hidden_states,\n                    attention_mask,\n                    layer_head_mask,\n                    encoder_hidden_states,\n                    encoder_attention_mask,\n                )\n            else:\n                layer_outputs = layer_module(\n                    hidden_states,\n                    attention_mask,\n                    layer_head_mask,\n                    encoder_hidden_states,\n                    encoder_attention_mask,\n                    past_key_value,\n                    output_attentions,relations_kv=relations_kv\n                )\n\n            hidden_states = layer_outputs[0]\n            if use_cache:\n                next_decoder_cache += (layer_outputs[-1],)\n            if output_attentions:\n                all_self_attentions = all_self_attentions + (layer_outputs[1],)\n                if self.config.add_cross_attention:\n                    all_cross_attentions = all_cross_attentions + (layer_outputs[2],)\n\n        if output_hidden_states:\n            all_hidden_states = all_hidden_states + (hidden_states,)\n\n        if not return_dict:\n            return tuple(\n                v\n                for v in [\n                    hidden_states,\n                    next_decoder_cache,\n                    all_hidden_states,\n                    all_self_attentions,\n                    all_cross_attentions,\n                ]\n                if v is not None\n            )\n        return BaseModelOutputWithPastAndCrossAttentions(\n            last_hidden_state=hidden_states,\n            past_key_values=next_decoder_cache,\n            hidden_states=all_hidden_states,\n            attentions=all_self_attentions,\n            cross_attentions=all_cross_attentions,\n        )\n\n\nclass BertPooler(nn.Module):\n    def __init__(self, config):\n        super().__init__()\n        self.dense = nn.Linear(config.hidden_size, config.hidden_size)\n        self.activation = nn.Tanh()\n\n    def forward(self, hidden_states):\n        # We \"pool\" the model by simply taking the hidden state corresponding\n        # to the first token.\n        first_token_tensor = hidden_states[:, 0]\n        pooled_output = self.dense(first_token_tensor)\n        pooled_output = self.activation(pooled_output)\n        return pooled_output\n\n\nclass BertPredictionHeadTransform(nn.Module):\n    def __init__(self, config):\n        super().__init__()\n        self.dense = nn.Linear(config.hidden_size, config.hidden_size)\n        if isinstance(config.hidden_act, str):\n            self.transform_act_fn = ACT2FN[config.hidden_act]\n        else:\n            self.transform_act_fn = config.hidden_act\n        self.LayerNorm = nn.LayerNorm(config.hidden_size, eps=config.layer_norm_eps)\n\n    def forward(self, hidden_states):\n        hidden_states = self.dense(hidden_states)\n        hidden_states = self.transform_act_fn(hidden_states)\n        hidden_states = self.LayerNorm(hidden_states)\n        return hidden_states\n\n\nclass BertLMPredictionHead(nn.Module):\n    def __init__(self, config):\n        super().__init__()\n        self.transform = BertPredictionHeadTransform(config)\n\n        # The output weights are the same as the input embeddings, but there is\n        # an output-only bias for each token.\n        self.decoder = nn.Linear(config.hidden_size, config.vocab_size, bias=False)\n\n        self.bias = nn.Parameter(torch.zeros(config.vocab_size))\n\n        # Need a link between the two variables so that the bias is correctly resized with `resize_token_embeddings`\n        self.decoder.bias = self.bias\n\n    def forward(self, hidden_states):\n        hidden_states = self.transform(hidden_states)\n        hidden_states = self.decoder(hidden_states)\n        return hidden_states\n\n\nclass BertOnlyMLMHead(nn.Module):\n    def __init__(self, config):\n        super().__init__()\n        self.predictions = BertLMPredictionHead(config)\n\n    def forward(self, sequence_output):\n        prediction_scores = self.predictions(sequence_output)\n        return prediction_scores\n\n\nclass BertOnlyNSPHead(nn.Module):\n    def __init__(self, config):\n        super().__init__()\n        self.seq_relationship = nn.Linear(config.hidden_size, 2)\n\n    def forward(self, pooled_output):\n        seq_relationship_score = self.seq_relationship(pooled_output)\n        return seq_relationship_score\n\n\nclass BertPreTrainingHeads(nn.Module):\n    def __init__(self, config):\n        super().__init__()\n        self.predictions = BertLMPredictionHead(config)\n        self.seq_relationship = nn.Linear(config.hidden_size, 2)\n\n    def forward(self, sequence_output, pooled_output):\n        prediction_scores = self.predictions(sequence_output)\n        seq_relationship_score = self.seq_relationship(pooled_output)\n        return prediction_scores, seq_relationship_score\n\n\nclass BertPreTrainedModel(PreTrainedModel):\n    \"\"\"\n    An abstract class to handle weights initialization and a simple interface for downloading and loading pretrained\n    models.\n    \"\"\"\n\n    config_class = BertConfig\n    load_tf_weights = load_tf_weights_in_bert\n    base_model_prefix = \"bert\"\n\n    def _init_weights(self, module):\n        \"\"\" Initialize the weights \"\"\"\n        if isinstance(module, nn.Linear):\n            # Slightly different from the TF version which uses truncated_normal for initialization\n            # cf https://github.com/pytorch/pytorch/pull/5617\n            module.weight.data.normal_(mean=0.0, std=self.config.initializer_range)\n            if module.bias is not None:\n                module.bias.data.zero_()\n        elif isinstance(module, nn.Embedding):\n            module.weight.data.normal_(mean=0.0, std=self.config.initializer_range)\n            if module.padding_idx is not None:\n                module.weight.data[module.padding_idx].zero_()\n        elif isinstance(module, nn.LayerNorm):\n            module.bias.data.zero_()\n            module.weight.data.fill_(1.0)\n\n\n@dataclass\nclass BertForPreTrainingOutput(ModelOutput):\n    \"\"\"\n    Output type of :class:`~transformers.BertForPreTraining`.\n\n    Args:\n        loss (`optional`, returned when ``labels`` is provided, ``torch.FloatTensor`` of shape :obj:`(1,)`):\n            Total loss as the sum of the masked language modeling loss and the next sequence prediction\n            (classification) loss.\n        prediction_logits (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length, config.vocab_size)`):\n            Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax).\n        seq_relationship_logits (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, 2)`):\n            Prediction scores of the next sequence prediction (classification) head (scores of True/False continuation\n            before SoftMax).\n        hidden_states (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``output_hidden_states=True`` is passed or when ``config.output_hidden_states=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer)\n            of shape :obj:`(batch_size, sequence_length, hidden_size)`.\n\n            Hidden-states of the model at the output of each layer plus the initial embedding outputs.\n        attentions (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``output_attentions=True`` is passed or when ``config.output_attentions=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for each layer) of shape :obj:`(batch_size, num_heads,\n            sequence_length, sequence_length)`.\n\n            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention\n            heads.\n    \"\"\"\n\n    loss: Optional[torch.FloatTensor] = None\n    prediction_logits: torch.FloatTensor = None\n    seq_relationship_logits: torch.FloatTensor = None\n    hidden_states: Optional[Tuple[torch.FloatTensor]] = None\n    attentions: Optional[Tuple[torch.FloatTensor]] = None\n\n\nBERT_START_DOCSTRING = r\"\"\"\n\n    This model inherits from :class:`~transformers.PreTrainedModel`. Check the superclass documentation for the generic\n    methods the library implements for all its model (such as downloading or saving, resizing the input embeddings,\n    pruning heads etc.)\n\n    This model is also a PyTorch `torch.nn.Module <https://pytorch.org/docs/stable/nn.html#torch.nn.Module>`__\n    subclass. Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to\n    general usage and behavior.\n\n    Parameters:\n        config (:class:`~transformers.BertConfig`): Model configuration class with all the parameters of the model.\n            Initializing with a config file does not load the weights associated with the model, only the\n            configuration. Check out the :meth:`~transformers.PreTrainedModel.from_pretrained` method to load the model\n            weights.\n\"\"\"\n\nBERT_INPUTS_DOCSTRING = r\"\"\"\n    Args:\n        input_ids (:obj:`torch.LongTensor` of shape :obj:`({0})`):\n            Indices of input sequence tokens in the vocabulary.\n\n            Indices can be obtained using :class:`~transformers.BertTokenizer`. See\n            :meth:`transformers.PreTrainedTokenizer.encode` and :meth:`transformers.PreTrainedTokenizer.__call__` for\n            details.\n\n            `What are input IDs? <../glossary.html#input-ids>`__\n        attention_mask (:obj:`torch.FloatTensor` of shape :obj:`({0})`, `optional`):\n            Mask to avoid performing attention on padding token indices. Mask values selected in ``[0, 1]``:\n\n            - 1 for tokens that are **not masked**,\n            - 0 for tokens that are **masked**.\n\n            `What are attention masks? <../glossary.html#attention-mask>`__\n        token_type_ids (:obj:`torch.LongTensor` of shape :obj:`({0})`, `optional`):\n            Segment token indices to indicate first and second portions of the inputs. Indices are selected in ``[0,\n            1]``:\n\n            - 0 corresponds to a `sentence A` token,\n            - 1 corresponds to a `sentence B` token.\n\n            `What are token type IDs? <../glossary.html#token-type-ids>`_\n        position_ids (:obj:`torch.LongTensor` of shape :obj:`({0})`, `optional`):\n            Indices of positions of each input sequence tokens in the position embeddings. Selected in the range ``[0,\n            config.max_position_embeddings - 1]``.\n\n            `What are position IDs? <../glossary.html#position-ids>`_\n        head_mask (:obj:`torch.FloatTensor` of shape :obj:`(num_heads,)` or :obj:`(num_layers, num_heads)`, `optional`):\n            Mask to nullify selected heads of the self-attention modules. Mask values selected in ``[0, 1]``:\n\n            - 1 indicates the head is **not masked**,\n            - 0 indicates the head is **masked**.\n\n        inputs_embeds (:obj:`torch.FloatTensor` of shape :obj:`({0}, hidden_size)`, `optional`):\n            Optionally, instead of passing :obj:`input_ids` you can choose to directly pass an embedded representation.\n            This is useful if you want more control over how to convert :obj:`input_ids` indices into associated\n            vectors than the model's internal embedding lookup matrix.\n        output_attentions (:obj:`bool`, `optional`):\n            Whether or not to return the attentions tensors of all attention layers. See ``attentions`` under returned\n            tensors for more detail.\n        output_hidden_states (:obj:`bool`, `optional`):\n            Whether or not to return the hidden states of all layers. See ``hidden_states`` under returned tensors for\n            more detail.\n        return_dict (:obj:`bool`, `optional`):\n            Whether or not to return a :class:`~transformers.file_utils.ModelOutput` instead of a plain tuple.\n\"\"\"\n\n\n@add_start_docstrings(\n    \"The bare Bert Model transformer outputting raw hidden-states without any specific head on top.\",\n    BERT_START_DOCSTRING,\n)\nclass NeZhaModel(BertPreTrainedModel):\n    \"\"\"\n\n    The model can behave as an encoder (with only self-attention) as well as a decoder, in which case a layer of\n    cross-attention is added between the self-attention layers, following the architecture described in `Attention is\n    all you need <https://arxiv.org/abs/1706.03762>`__ by Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit,\n    Llion Jones, Aidan N. Gomez, Lukasz Kaiser and Illia Polosukhin.\n\n    To behave as an decoder the model needs to be initialized with the :obj:`is_decoder` argument of the configuration\n    set to :obj:`True`. To be used in a Seq2Seq model, the model needs to initialized with both :obj:`is_decoder`\n    argument and :obj:`add_cross_attention` set to :obj:`True`; an :obj:`encoder_hidden_states` is then expected as an\n    input to the forward pass.\n    \"\"\"\n\n    def __init__(self, config, add_pooling_layer=True):\n        super().__init__(config)\n        self.config = config\n\n        self.embeddings = BertEmbeddings(config)\n        self.encoder = NeZhaEncoder(config)\n\n        self.pooler = BertPooler(config) if add_pooling_layer else None\n\n        self.init_weights()\n\n    def get_input_embeddings(self):\n        return self.embeddings.word_embeddings\n\n    def set_input_embeddings(self, value):\n        self.embeddings.word_embeddings = value\n\n    def _prune_heads(self, heads_to_prune):\n        \"\"\"\n        Prunes heads of the model. heads_to_prune: dict of {layer_num: list of heads to prune in this layer} See base\n        class PreTrainedModel\n        \"\"\"\n        for layer, heads in heads_to_prune.items():\n            self.encoder.layer[layer].attention.prune_heads(heads)\n\n    @add_start_docstrings_to_model_forward(BERT_INPUTS_DOCSTRING.format(\"batch_size, sequence_length\"))\n    @add_code_sample_docstrings(\n        tokenizer_class=_TOKENIZER_FOR_DOC,\n        checkpoint=_CHECKPOINT_FOR_DOC,\n        output_type=BaseModelOutputWithPoolingAndCrossAttentions,\n        config_class=_CONFIG_FOR_DOC,\n    )\n    def forward(\n        self,\n        input_ids=None,\n        attention_mask=None,\n        token_type_ids=None,\n\n        head_mask=None,\n        inputs_embeds=None,\n        encoder_hidden_states=None,\n        encoder_attention_mask=None,\n        past_key_values=None,\n        use_cache=None,\n        output_attentions=None,\n        output_hidden_states=None,\n        return_dict=False,\n    ):\n        r\"\"\"\n        encoder_hidden_states  (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length, hidden_size)`, `optional`):\n            Sequence of hidden-states at the output of the last layer of the encoder. Used in the cross-attention if\n            the model is configured as a decoder.\n        encoder_attention_mask (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length)`, `optional`):\n            Mask to avoid performing attention on the padding token indices of the encoder input. This mask is used in\n            the cross-attention if the model is configured as a decoder. Mask values selected in ``[0, 1]``:\n\n            - 1 for tokens that are **not masked**,\n            - 0 for tokens that are **masked**.\n        past_key_values (:obj:`tuple(tuple(torch.FloatTensor))` of length :obj:`config.n_layers` with each tuple having 4 tensors of shape :obj:`(batch_size, num_heads, sequence_length - 1, embed_size_per_head)`):\n            Contains precomputed key and value hidden states of the attention blocks. Can be used to speed up decoding.\n\n            If :obj:`past_key_values` are used, the user can optionally input only the last :obj:`decoder_input_ids`\n            (those that don't have their past key value states given to this model) of shape :obj:`(batch_size, 1)`\n            instead of all :obj:`decoder_input_ids` of shape :obj:`(batch_size, sequence_length)`.\n        use_cache (:obj:`bool`, `optional`):\n            If set to :obj:`True`, :obj:`past_key_values` key value states are returned and can be used to speed up\n            decoding (see :obj:`past_key_values`).\n        \"\"\"\n        output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions\n        output_hidden_states = (\n            output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states\n        )\n        return_dict = return_dict if return_dict is not None else self.config.use_return_dict\n\n        if self.config.is_decoder:\n            use_cache = use_cache if use_cache is not None else self.config.use_cache\n        else:\n            use_cache = False\n\n        if input_ids is not None and inputs_embeds is not None:\n            raise ValueError(\"You cannot specify both input_ids and inputs_embeds at the same time\")\n        elif input_ids is not None:\n            input_shape = input_ids.size()\n            batch_size, seq_length = input_shape\n        elif inputs_embeds is not None:\n            input_shape = inputs_embeds.size()[:-1]\n            batch_size, seq_length = input_shape\n        else:\n            raise ValueError(\"You have to specify either input_ids or inputs_embeds\")\n\n        device = input_ids.device if input_ids is not None else inputs_embeds.device\n\n        # past_key_values_length\n        past_key_values_length = past_key_values[0][0].shape[2] if past_key_values is not None else 0\n\n        if attention_mask is None:\n            attention_mask = torch.ones(((batch_size, seq_length + past_key_values_length)), device=device)\n        if token_type_ids is None:\n            token_type_ids = torch.zeros(input_shape, dtype=torch.long, device=device)\n\n        # We can provide a self-attention mask of dimensions [batch_size, from_seq_length, to_seq_length]\n        # ourselves in which case we just need to make it broadcastable to all heads.\n        extended_attention_mask: torch.Tensor = self.get_extended_attention_mask(attention_mask, input_shape, device)\n\n        # If a 2D or 3D attention mask is provided for the cross-attention\n        # we need to make broadcastable to [batch_size, num_heads, seq_length, seq_length]\n        if self.config.is_decoder and encoder_hidden_states is not None:\n            encoder_batch_size, encoder_sequence_length, _ = encoder_hidden_states.size()\n            encoder_hidden_shape = (encoder_batch_size, encoder_sequence_length)\n            if encoder_attention_mask is None:\n                encoder_attention_mask = torch.ones(encoder_hidden_shape, device=device)\n            encoder_extended_attention_mask = self.invert_attention_mask(encoder_attention_mask)\n        else:\n            encoder_extended_attention_mask = None\n\n        # Prepare head mask if needed\n        # 1.0 in head_mask indicate we keep the head\n        # attention_probs has shape bsz x n_heads x N x N\n        # input head_mask has shape [num_heads] or [num_hidden_layers x num_heads]\n        # and head_mask is converted to shape [num_hidden_layers x batch x num_heads x seq_length x seq_length]\n        head_mask = self.get_head_mask(head_mask, self.config.num_hidden_layers)\n\n        embedding_output = self.embeddings(\n            input_ids=input_ids,\n\n            token_type_ids=token_type_ids,\n            inputs_embeds=inputs_embeds,\n        )\n        encoder_outputs = self.encoder(\n            embedding_output,\n            attention_mask=extended_attention_mask,\n            head_mask=head_mask,\n            encoder_hidden_states=encoder_hidden_states,\n            encoder_attention_mask=encoder_extended_attention_mask,\n            past_key_values=past_key_values,\n            use_cache=use_cache,\n            output_attentions=output_attentions,\n            output_hidden_states=output_hidden_states,\n            return_dict=return_dict,\n        )\n        sequence_output = encoder_outputs[0]\n        pooled_output = self.pooler(sequence_output) if self.pooler is not None else None\n\n        if not return_dict:\n            return (sequence_output, pooled_output) + encoder_outputs[1:]\n\n        return BaseModelOutputWithPoolingAndCrossAttentions(\n            last_hidden_state=sequence_output,\n            pooler_output=pooled_output,\n            past_key_values=encoder_outputs.past_key_values,\n            hidden_states=encoder_outputs.hidden_states,\n            attentions=encoder_outputs.attentions,\n            cross_attentions=encoder_outputs.cross_attentions,\n        )\n\n\n@add_start_docstrings(\n    \"\"\"\n    Bert Model with two heads on top as done during the pretraining: a `masked language modeling` head and a `next\n    sentence prediction (classification)` head.\n    \"\"\",\n    BERT_START_DOCSTRING,\n)\nclass BertForPreTraining(BertPreTrainedModel):\n    def __init__(self, config):\n        super().__init__(config)\n\n        self.bert = NeZhaModel(config)\n        self.cls = BertPreTrainingHeads(config)\n\n        self.init_weights()\n\n    def get_output_embeddings(self):\n        return self.cls.predictions.decoder\n\n    def set_output_embeddings(self, new_embeddings):\n        self.cls.predictions.decoder = new_embeddings\n\n    @add_start_docstrings_to_model_forward(BERT_INPUTS_DOCSTRING.format(\"batch_size, sequence_length\"))\n    @replace_return_docstrings(output_type=BertForPreTrainingOutput, config_class=_CONFIG_FOR_DOC)\n    def forward(\n        self,\n        input_ids=None,\n        attention_mask=None,\n        token_type_ids=None,\n\n        head_mask=None,\n        inputs_embeds=None,\n        labels=None,\n        next_sentence_label=None,\n        output_attentions=None,\n        output_hidden_states=None,\n        return_dict=False,\n    ):\n        r\"\"\"\n        labels (:obj:`torch.LongTensor` of shape ``(batch_size, sequence_length)``, `optional`):\n            Labels for computing the masked language modeling loss. Indices should be in ``[-100, 0, ...,\n            config.vocab_size]`` (see ``input_ids`` docstring) Tokens with indices set to ``-100`` are ignored\n            (masked), the loss is only computed for the tokens with labels in ``[0, ..., config.vocab_size]``\n        next_sentence_label (``torch.LongTensor`` of shape ``(batch_size,)``, `optional`):\n            Labels for computing the next sequence prediction (classification) loss. Input should be a sequence pair\n            (see :obj:`input_ids` docstring) Indices should be in ``[0, 1]``:\n\n            - 0 indicates sequence B is a continuation of sequence A,\n            - 1 indicates sequence B is a random sequence.\n        kwargs (:obj:`Dict[str, any]`, optional, defaults to `{}`):\n            Used to hide legacy arguments that have been deprecated.\n\n        Returns:\n\n        Example::\n\n            >>> from transformers import BertTokenizer, BertForPreTraining\n            >>> import torch\n\n            >>> tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')\n            >>> model = BertForPreTraining.from_pretrained('bert-base-uncased')\n\n            >>> inputs = tokenizer(\"Hello, my dog is cute\", return_tensors=\"pt\")\n            >>> outputs = model(**inputs)\n\n            >>> prediction_logits = outputs.prediction_logits\n            >>> seq_relationship_logits = outputs.seq_relationship_logits\n        \"\"\"\n        return_dict = return_dict if return_dict is not None else self.config.use_return_dict\n\n        outputs = self.bert(\n            input_ids,\n            attention_mask=attention_mask,\n            token_type_ids=token_type_ids,\n\n            head_mask=head_mask,\n            inputs_embeds=inputs_embeds,\n            output_attentions=output_attentions,\n            output_hidden_states=output_hidden_states,\n            return_dict=return_dict,\n        )\n\n        sequence_output, pooled_output = outputs[:2]\n        prediction_scores, seq_relationship_score = self.cls(sequence_output, pooled_output)\n\n        total_loss = None\n        if labels is not None and next_sentence_label is not None:\n            loss_fct = CrossEntropyLoss()\n            masked_lm_loss = loss_fct(prediction_scores.view(-1, self.config.vocab_size), labels.view(-1))\n            next_sentence_loss = loss_fct(seq_relationship_score.view(-1, 2), next_sentence_label.view(-1))\n            total_loss = masked_lm_loss + next_sentence_loss\n\n        if not return_dict:\n            output = (prediction_scores, seq_relationship_score) + outputs[2:]\n            return ((total_loss,) + output) if total_loss is not None else output\n\n        return BertForPreTrainingOutput(\n            loss=total_loss,\n            prediction_logits=prediction_scores,\n            seq_relationship_logits=seq_relationship_score,\n            hidden_states=outputs.hidden_states,\n            attentions=outputs.attentions,\n        )\n\n\n@add_start_docstrings(\n    \"\"\"Bert Model with a `language modeling` head on top for CLM fine-tuning. \"\"\", BERT_START_DOCSTRING\n)\nclass BertLMHeadModel(BertPreTrainedModel):\n\n    _keys_to_ignore_on_load_unexpected = [r\"pooler\"]\n    _keys_to_ignore_on_load_missing = [ r\"predictions.decoder.bias\"]\n\n    def __init__(self, config):\n        super().__init__(config)\n\n        if not config.is_decoder:\n            logger.warning(\"If you want to use `BertLMHeadModel` as a standalone, add `is_decoder=True.`\")\n\n        self.bert = NeZhaModel(config, add_pooling_layer=False)\n        self.cls = BertOnlyMLMHead(config)\n\n        self.init_weights()\n\n    def get_output_embeddings(self):\n        return self.cls.predictions.decoder\n\n    def set_output_embeddings(self, new_embeddings):\n        self.cls.predictions.decoder = new_embeddings\n\n    @add_start_docstrings_to_model_forward(BERT_INPUTS_DOCSTRING.format(\"batch_size, sequence_length\"))\n    @replace_return_docstrings(output_type=CausalLMOutputWithCrossAttentions, config_class=_CONFIG_FOR_DOC)\n    def forward(\n        self,\n        input_ids=None,\n        attention_mask=None,\n        token_type_ids=None,\n\n        head_mask=None,\n        inputs_embeds=None,\n        encoder_hidden_states=None,\n        encoder_attention_mask=None,\n        labels=None,\n        past_key_values=None,\n        use_cache=None,\n        output_attentions=None,\n        output_hidden_states=None,\n        return_dict=False,\n    ):\n        r\"\"\"\n        encoder_hidden_states  (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length, hidden_size)`, `optional`):\n            Sequence of hidden-states at the output of the last layer of the encoder. Used in the cross-attention if\n            the model is configured as a decoder.\n        encoder_attention_mask (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length)`, `optional`):\n            Mask to avoid performing attention on the padding token indices of the encoder input. This mask is used in\n            the cross-attention if the model is configured as a decoder. Mask values selected in ``[0, 1]``:\n\n            - 1 for tokens that are **not masked**,\n            - 0 for tokens that are **masked**.\n        labels (:obj:`torch.LongTensor` of shape :obj:`(batch_size, sequence_length)`, `optional`):\n            Labels for computing the left-to-right language modeling loss (next word prediction). Indices should be in\n            ``[-100, 0, ..., config.vocab_size]`` (see ``input_ids`` docstring) Tokens with indices set to ``-100`` are\n            ignored (masked), the loss is only computed for the tokens with labels n ``[0, ..., config.vocab_size]``\n        past_key_values (:obj:`tuple(tuple(torch.FloatTensor))` of length :obj:`config.n_layers` with each tuple having 4 tensors of shape :obj:`(batch_size, num_heads, sequence_length - 1, embed_size_per_head)`):\n            Contains precomputed key and value hidden states of the attention blocks. Can be used to speed up decoding.\n\n            If :obj:`past_key_values` are used, the user can optionally input only the last :obj:`decoder_input_ids`\n            (those that don't have their past key value states given to this model) of shape :obj:`(batch_size, 1)`\n            instead of all :obj:`decoder_input_ids` of shape :obj:`(batch_size, sequence_length)`.\n        use_cache (:obj:`bool`, `optional`):\n            If set to :obj:`True`, :obj:`past_key_values` key value states are returned and can be used to speed up\n            decoding (see :obj:`past_key_values`).\n\n        Returns:\n\n        Example::\n\n            >>> from transformers import BertTokenizer, BertLMHeadModel, BertConfig\n            >>> import torch\n\n            >>> tokenizer = BertTokenizer.from_pretrained('bert-base-cased')\n            >>> config = BertConfig.from_pretrained(\"bert-base-cased\")\n            >>> config.is_decoder = True\n            >>> model = BertLMHeadModel.from_pretrained('bert-base-cased', config=config)\n\n            >>> inputs = tokenizer(\"Hello, my dog is cute\", return_tensors=\"pt\")\n            >>> outputs = model(**inputs)\n\n            >>> prediction_logits = outputs.logits\n        \"\"\"\n        return_dict = return_dict if return_dict is not None else self.config.use_return_dict\n        if labels is not None:\n            use_cache = False\n\n        outputs = self.bert(\n            input_ids,\n            attention_mask=attention_mask,\n            token_type_ids=token_type_ids,\n\n            head_mask=head_mask,\n            inputs_embeds=inputs_embeds,\n            encoder_hidden_states=encoder_hidden_states,\n            encoder_attention_mask=encoder_attention_mask,\n            past_key_values=past_key_values,\n            use_cache=use_cache,\n            output_attentions=output_attentions,\n            output_hidden_states=output_hidden_states,\n            return_dict=return_dict,\n        )\n\n        sequence_output = outputs[0]\n        prediction_scores = self.cls(sequence_output)\n\n        lm_loss = None\n        if labels is not None:\n            # we are doing next-token prediction; shift prediction scores and input ids by one\n            shifted_prediction_scores = prediction_scores[:, :-1, :].contiguous()\n            labels = labels[:, 1:].contiguous()\n            loss_fct = CrossEntropyLoss()\n            lm_loss = loss_fct(shifted_prediction_scores.view(-1, self.config.vocab_size), labels.view(-1))\n\n        if not return_dict:\n            output = (prediction_scores,) + outputs[2:]\n            return ((lm_loss,) + output) if lm_loss is not None else output\n\n        return CausalLMOutputWithCrossAttentions(\n            loss=lm_loss,\n            logits=prediction_scores,\n            past_key_values=outputs.past_key_values,\n            hidden_states=outputs.hidden_states,\n            attentions=outputs.attentions,\n            cross_attentions=outputs.cross_attentions,\n        )\n\n    def prepare_inputs_for_generation(self, input_ids, past=None, attention_mask=None, **model_kwargs):\n        input_shape = input_ids.shape\n        # if model is used as a decoder in encoder-decoder model, the decoder attention mask is created on the fly\n        if attention_mask is None:\n            attention_mask = input_ids.new_ones(input_shape)\n\n        # cut decoder_input_ids if past is used\n        if past is not None:\n            input_ids = input_ids[:, -1:]\n\n        return {\"input_ids\": input_ids, \"attention_mask\": attention_mask, \"past_key_values\": past}\n\n    def _reorder_cache(self, past, beam_idx):\n        reordered_past = ()\n        for layer_past in past:\n            reordered_past += (tuple(past_state.index_select(0, beam_idx) for past_state in layer_past),)\n        return reordered_past\n\n\n@add_start_docstrings(\"\"\"Bert Model with a `language modeling` head on top. \"\"\", BERT_START_DOCSTRING)\nclass NeZhaForMaskedLM(BertPreTrainedModel):\n\n    _keys_to_ignore_on_load_unexpected = [r\"pooler\"]\n    _keys_to_ignore_on_load_missing = [r\"predictions.decoder.bias\"]\n\n    def __init__(self, config):\n        super().__init__(config)\n\n        if config.is_decoder:\n            logger.warning(\n                \"If you want to use `NeZhaForMaskedLM` make sure `config.is_decoder=False` for \"\n                \"bi-directional self-attention.\"\n            )\n\n        self.bert = NeZhaModel(config, add_pooling_layer=False)\n        self.cls = BertOnlyMLMHead(config)\n\n        self.init_weights()\n\n    def get_output_embeddings(self):\n        return self.cls.predictions.decoder\n\n    def set_output_embeddings(self, new_embeddings):\n        self.cls.predictions.decoder = new_embeddings\n\n    @add_start_docstrings_to_model_forward(BERT_INPUTS_DOCSTRING.format(\"batch_size, sequence_length\"))\n    @add_code_sample_docstrings(\n        tokenizer_class=_TOKENIZER_FOR_DOC,\n        checkpoint=_CHECKPOINT_FOR_DOC,\n        output_type=MaskedLMOutput,\n        config_class=_CONFIG_FOR_DOC,\n    )\n    def forward(\n        self,\n        input_ids=None,\n        attention_mask=None,\n        token_type_ids=None,\n\n        head_mask=None,\n        inputs_embeds=None,\n        encoder_hidden_states=None,\n        encoder_attention_mask=None,\n        labels=None,\n        output_attentions=None,\n        output_hidden_states=None,\n        return_dict=False,\n    ):\n        r\"\"\"\n        labels (:obj:`torch.LongTensor` of shape :obj:`(batch_size, sequence_length)`, `optional`):\n            Labels for computing the masked language modeling loss. Indices should be in ``[-100, 0, ...,\n            config.vocab_size]`` (see ``input_ids`` docstring) Tokens with indices set to ``-100`` are ignored\n            (masked), the loss is only computed for the tokens with labels in ``[0, ..., config.vocab_size]``\n        \"\"\"\n\n        return_dict = return_dict if return_dict is not None else self.config.use_return_dict\n\n        outputs = self.bert(\n            input_ids,\n            attention_mask=attention_mask,\n            token_type_ids=token_type_ids,\n\n            head_mask=head_mask,\n            inputs_embeds=inputs_embeds,\n            encoder_hidden_states=encoder_hidden_states,\n            encoder_attention_mask=encoder_attention_mask,\n            output_attentions=output_attentions,\n            output_hidden_states=output_hidden_states,\n            return_dict=return_dict,\n        )\n\n        sequence_output = outputs[0]\n        prediction_scores = self.cls(sequence_output)\n\n        masked_lm_loss = None\n        if labels is not None:\n            loss_fct = CrossEntropyLoss()  # -100 index = padding token\n            masked_lm_loss = loss_fct(prediction_scores.view(-1, self.config.vocab_size), labels.view(-1))\n\n        if not return_dict:\n            output = (prediction_scores,) + outputs[2:]\n            return ((masked_lm_loss,) + output) if masked_lm_loss is not None else output\n\n        return MaskedLMOutput(\n            loss=masked_lm_loss,\n            logits=prediction_scores,\n            hidden_states=outputs.hidden_states,\n            attentions=outputs.attentions,\n        )\n\n    def prepare_inputs_for_generation(self, input_ids, attention_mask=None, **model_kwargs):\n        input_shape = input_ids.shape\n        effective_batch_size = input_shape[0]\n\n        #  add a dummy token\n        assert self.config.pad_token_id is not None, \"The PAD token should be defined for generation\"\n        attention_mask = torch.cat([attention_mask, attention_mask.new_zeros((attention_mask.shape[0], 1))], dim=-1)\n        dummy_token = torch.full(\n            (effective_batch_size, 1), self.config.pad_token_id, dtype=torch.long, device=input_ids.device\n        )\n        input_ids = torch.cat([input_ids, dummy_token], dim=1)\n\n        return {\"input_ids\": input_ids, \"attention_mask\": attention_mask}\n\n\n@add_start_docstrings(\n    \"\"\"Bert Model with a `next sentence prediction (classification)` head on top. \"\"\",\n    BERT_START_DOCSTRING,\n)\nclass BertForNextSentencePrediction(BertPreTrainedModel):\n    def __init__(self, config):\n        super().__init__(config)\n\n        self.bert = NeZhaModel(config)\n        self.cls = BertOnlyNSPHead(config)\n\n        self.init_weights()\n\n    @add_start_docstrings_to_model_forward(BERT_INPUTS_DOCSTRING.format(\"batch_size, sequence_length\"))\n    @replace_return_docstrings(output_type=NextSentencePredictorOutput, config_class=_CONFIG_FOR_DOC)\n    def forward(\n        self,\n        input_ids=None,\n        attention_mask=None,\n        token_type_ids=None,\n\n        head_mask=None,\n        inputs_embeds=None,\n        labels=None,\n        output_attentions=None,\n        output_hidden_states=None,\n        return_dict=False,\n        **kwargs\n    ):\n        r\"\"\"\n        labels (:obj:`torch.LongTensor` of shape :obj:`(batch_size,)`, `optional`):\n            Labels for computing the next sequence prediction (classification) loss. Input should be a sequence pair\n            (see ``input_ids`` docstring). Indices should be in ``[0, 1]``:\n\n            - 0 indicates sequence B is a continuation of sequence A,\n            - 1 indicates sequence B is a random sequence.\n\n        Returns:\n\n        Example::\n\n            >>> from transformers import BertTokenizer, BertForNextSentencePrediction\n            >>> import torch\n\n            >>> tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')\n            >>> model = BertForNextSentencePrediction.from_pretrained('bert-base-uncased')\n\n            >>> prompt = \"In Italy, pizza served in formal settings, such as at a restaurant, is presented unsliced.\"\n            >>> next_sentence = \"The sky is blue due to the shorter wavelength of blue light.\"\n            >>> encoding = tokenizer(prompt, next_sentence, return_tensors='pt')\n\n            >>> outputs = model(**encoding, labels=torch.LongTensor([1]))\n            >>> logits = outputs.logits\n            >>> assert logits[0, 0] < logits[0, 1] # next sentence was random\n        \"\"\"\n\n        if \"next_sentence_label\" in kwargs:\n            warnings.warn(\n                \"The `next_sentence_label` argument is deprecated and will be removed in a future version, use `labels` instead.\",\n                FutureWarning,\n            )\n            labels = kwargs.pop(\"next_sentence_label\")\n\n        return_dict = return_dict if return_dict is not None else self.config.use_return_dict\n\n        outputs = self.bert(\n            input_ids,\n            attention_mask=attention_mask,\n            token_type_ids=token_type_ids,\n\n            head_mask=head_mask,\n            inputs_embeds=inputs_embeds,\n            output_attentions=output_attentions,\n            output_hidden_states=output_hidden_states,\n            return_dict=return_dict,\n        )\n\n        pooled_output = outputs[1]\n\n        seq_relationship_scores = self.cls(pooled_output)\n\n        next_sentence_loss = None\n        if labels is not None:\n            loss_fct = CrossEntropyLoss()\n            next_sentence_loss = loss_fct(seq_relationship_scores.view(-1, 2), labels.view(-1))\n\n        if not return_dict:\n            output = (seq_relationship_scores,) + outputs[2:]\n            return ((next_sentence_loss,) + output) if next_sentence_loss is not None else output\n\n        return NextSentencePredictorOutput(\n            loss=next_sentence_loss,\n            logits=seq_relationship_scores,\n            hidden_states=outputs.hidden_states,\n            attentions=outputs.attentions,\n        )\n\n\n@add_start_docstrings(\n    \"\"\"\n    Bert Model transformer with a sequence classification/regression head on top (a linear layer on top of the pooled\n    output) e.g. for GLUE tasks.\n    \"\"\",\n    BERT_START_DOCSTRING,\n)\nclass BertForSequenceClassification(BertPreTrainedModel):\n    def __init__(self, config):\n        super().__init__(config)\n        self.num_labels = config.num_labels\n\n        self.bert = NeZhaModel(config)\n        self.dropout = nn.Dropout(config.hidden_dropout_prob)\n        self.classifier = nn.Linear(config.hidden_size, config.num_labels)\n\n        self.init_weights()\n\n    @add_start_docstrings_to_model_forward(BERT_INPUTS_DOCSTRING.format(\"batch_size, sequence_length\"))\n    @add_code_sample_docstrings(\n        tokenizer_class=_TOKENIZER_FOR_DOC,\n        checkpoint=_CHECKPOINT_FOR_DOC,\n        output_type=SequenceClassifierOutput,\n        config_class=_CONFIG_FOR_DOC,\n    )\n    def forward(\n        self,\n        input_ids=None,\n        attention_mask=None,\n        token_type_ids=None,\n\n        head_mask=None,\n        inputs_embeds=None,\n        labels=None,\n        output_attentions=None,\n        output_hidden_states=None,\n        return_dict=False,\n    ):\n        r\"\"\"\n        labels (:obj:`torch.LongTensor` of shape :obj:`(batch_size,)`, `optional`):\n            Labels for computing the sequence classification/regression loss. Indices should be in :obj:`[0, ...,\n            config.num_labels - 1]`. If :obj:`config.num_labels == 1` a regression loss is computed (Mean-Square loss),\n            If :obj:`config.num_labels > 1` a classification loss is computed (Cross-Entropy).\n        \"\"\"\n        return_dict = return_dict if return_dict is not None else self.config.use_return_dict\n\n        outputs = self.bert(\n            input_ids,\n            attention_mask=attention_mask,\n            token_type_ids=token_type_ids,\n\n            head_mask=head_mask,\n            inputs_embeds=inputs_embeds,\n            output_attentions=output_attentions,\n            output_hidden_states=output_hidden_states,\n            return_dict=return_dict,\n        )\n\n        pooled_output = outputs[1]\n\n        pooled_output = self.dropout(pooled_output)\n        logits = self.classifier(pooled_output)\n\n        loss = None\n        if labels is not None:\n            if self.num_labels == 1:\n                #  We are doing regression\n                loss_fct = MSELoss()\n                loss = loss_fct(logits.view(-1), labels.view(-1))\n            else:\n                loss_fct = CrossEntropyLoss()\n                loss = loss_fct(logits.view(-1, self.num_labels), labels.view(-1))\n\n        if not return_dict:\n            output = (logits,) + outputs[2:]\n            return ((loss,) + output) if loss is not None else output\n\n        return SequenceClassifierOutput(\n            loss=loss,\n            logits=logits,\n            hidden_states=outputs.hidden_states,\n            attentions=outputs.attentions,\n        )\n\n\n@add_start_docstrings(\n    \"\"\"\n    Bert Model with a multiple choice classification head on top (a linear layer on top of the pooled output and a\n    softmax) e.g. for RocStories/SWAG tasks.\n    \"\"\",\n    BERT_START_DOCSTRING,\n)\nclass BertForMultipleChoice(BertPreTrainedModel):\n    def __init__(self, config):\n        super().__init__(config)\n\n        self.bert = NeZhaModel(config)\n        self.dropout = nn.Dropout(config.hidden_dropout_prob)\n        self.classifier = nn.Linear(config.hidden_size, 1)\n\n        self.init_weights()\n\n    @add_start_docstrings_to_model_forward(BERT_INPUTS_DOCSTRING.format(\"batch_size, num_choices, sequence_length\"))\n    @add_code_sample_docstrings(\n        tokenizer_class=_TOKENIZER_FOR_DOC,\n        checkpoint=_CHECKPOINT_FOR_DOC,\n        output_type=MultipleChoiceModelOutput,\n        config_class=_CONFIG_FOR_DOC,\n    )\n    def forward(\n        self,\n        input_ids=None,\n        attention_mask=None,\n        token_type_ids=None,\n\n        head_mask=None,\n        inputs_embeds=None,\n        labels=None,\n        output_attentions=None,\n        output_hidden_states=None,\n        return_dict=False,\n    ):\n        r\"\"\"\n        labels (:obj:`torch.LongTensor` of shape :obj:`(batch_size,)`, `optional`):\n            Labels for computing the multiple choice classification loss. Indices should be in ``[0, ...,\n            num_choices-1]`` where :obj:`num_choices` is the size of the second dimension of the input tensors. (See\n            :obj:`input_ids` above)\n        \"\"\"\n        return_dict = return_dict if return_dict is not None else self.config.use_return_dict\n        num_choices = input_ids.shape[1] if input_ids is not None else inputs_embeds.shape[1]\n\n        input_ids = input_ids.view(-1, input_ids.size(-1)) if input_ids is not None else None\n        attention_mask = attention_mask.view(-1, attention_mask.size(-1)) if attention_mask is not None else None\n        token_type_ids = token_type_ids.view(-1, token_type_ids.size(-1)) if token_type_ids is not None else None\n        inputs_embeds = (\n            inputs_embeds.view(-1, inputs_embeds.size(-2), inputs_embeds.size(-1))\n            if inputs_embeds is not None\n            else None\n        )\n\n        outputs = self.bert(\n            input_ids,\n            attention_mask=attention_mask,\n            token_type_ids=token_type_ids,\n\n            head_mask=head_mask,\n            inputs_embeds=inputs_embeds,\n            output_attentions=output_attentions,\n            output_hidden_states=output_hidden_states,\n            return_dict=return_dict,\n        )\n\n        pooled_output = outputs[1]\n\n        pooled_output = self.dropout(pooled_output)\n        logits = self.classifier(pooled_output)\n        reshaped_logits = logits.view(-1, num_choices)\n\n        loss = None\n        if labels is not None:\n            loss_fct = CrossEntropyLoss()\n            loss = loss_fct(reshaped_logits, labels)\n\n        if not return_dict:\n            output = (reshaped_logits,) + outputs[2:]\n            return ((loss,) + output) if loss is not None else output\n\n        return MultipleChoiceModelOutput(\n            loss=loss,\n            logits=reshaped_logits,\n            hidden_states=outputs.hidden_states,\n            attentions=outputs.attentions,\n        )\n\n\n@add_start_docstrings(\n    \"\"\"\n    Bert Model with a token classification head on top (a linear layer on top of the hidden-states output) e.g. for\n    Named-Entity-Recognition (NER) tasks.\n    \"\"\",\n    BERT_START_DOCSTRING,\n)\nclass BertForTokenClassification(BertPreTrainedModel):\n\n    _keys_to_ignore_on_load_unexpected = [r\"pooler\"]\n\n    def __init__(self, config):\n        super().__init__(config)\n        self.num_labels = config.num_labels\n\n        self.bert = NeZhaModel(config, add_pooling_layer=False)\n        self.dropout = nn.Dropout(config.hidden_dropout_prob)\n        self.classifier = nn.Linear(config.hidden_size, config.num_labels)\n\n        self.init_weights()\n\n    @add_start_docstrings_to_model_forward(BERT_INPUTS_DOCSTRING.format(\"batch_size, sequence_length\"))\n    @add_code_sample_docstrings(\n        tokenizer_class=_TOKENIZER_FOR_DOC,\n        checkpoint=_CHECKPOINT_FOR_DOC,\n        output_type=TokenClassifierOutput,\n        config_class=_CONFIG_FOR_DOC,\n    )\n    def forward(\n        self,\n        input_ids=None,\n        attention_mask=None,\n        token_type_ids=None,\n\n        head_mask=None,\n        inputs_embeds=None,\n        labels=None,\n        output_attentions=None,\n        output_hidden_states=None,\n        return_dict=False,\n    ):\n        r\"\"\"\n        labels (:obj:`torch.LongTensor` of shape :obj:`(batch_size, sequence_length)`, `optional`):\n            Labels for computing the token classification loss. Indices should be in ``[0, ..., config.num_labels -\n            1]``.\n        \"\"\"\n        return_dict = return_dict if return_dict is not None else self.config.use_return_dict\n\n        outputs = self.bert(\n            input_ids,\n            attention_mask=attention_mask,\n            token_type_ids=token_type_ids,\n\n            head_mask=head_mask,\n            inputs_embeds=inputs_embeds,\n            output_attentions=output_attentions,\n            output_hidden_states=output_hidden_states,\n            return_dict=return_dict,\n        )\n\n        sequence_output = outputs[0]\n\n        sequence_output = self.dropout(sequence_output)\n        logits = self.classifier(sequence_output)\n\n        loss = None\n        if labels is not None:\n            loss_fct = CrossEntropyLoss()\n            # Only keep active parts of the loss\n            if attention_mask is not None:\n                active_loss = attention_mask.view(-1) == 1\n                active_logits = logits.view(-1, self.num_labels)\n                active_labels = torch.where(\n                    active_loss, labels.view(-1), torch.tensor(loss_fct.ignore_index).type_as(labels)\n                )\n                loss = loss_fct(active_logits, active_labels)\n            else:\n                loss = loss_fct(logits.view(-1, self.num_labels), labels.view(-1))\n\n        if not return_dict:\n            output = (logits,) + outputs[2:]\n            return ((loss,) + output) if loss is not None else output\n\n        return TokenClassifierOutput(\n            loss=loss,\n            logits=logits,\n            hidden_states=outputs.hidden_states,\n            attentions=outputs.attentions,\n        )\n\n\n@add_start_docstrings(\n    \"\"\"\n    Bert Model with a span classification head on top for extractive question-answering tasks like SQuAD (a linear\n    layers on top of the hidden-states output to compute `span start logits` and `span end logits`).\n    \"\"\",\n    BERT_START_DOCSTRING,\n)\nclass BertForQuestionAnswering(BertPreTrainedModel):\n\n    _keys_to_ignore_on_load_unexpected = [r\"pooler\"]\n\n    def __init__(self, config):\n        super().__init__(config)\n        self.num_labels = config.num_labels\n\n        self.bert = NeZhaModel(config, add_pooling_layer=False)\n        self.qa_outputs = nn.Linear(config.hidden_size, config.num_labels)\n\n        self.init_weights()\n\n    @add_start_docstrings_to_model_forward(BERT_INPUTS_DOCSTRING.format(\"batch_size, sequence_length\"))\n    @add_code_sample_docstrings(\n        tokenizer_class=_TOKENIZER_FOR_DOC,\n        checkpoint=_CHECKPOINT_FOR_DOC,\n        output_type=QuestionAnsweringModelOutput,\n        config_class=_CONFIG_FOR_DOC,\n    )\n    def forward(\n        self,\n        input_ids=None,\n        attention_mask=None,\n        token_type_ids=None,\n\n        head_mask=None,\n        inputs_embeds=None,\n        start_positions=None,\n        end_positions=None,\n        output_attentions=None,\n        output_hidden_states=None,\n        return_dict=False,\n    ):\n        r\"\"\"\n        start_positions (:obj:`torch.LongTensor` of shape :obj:`(batch_size,)`, `optional`):\n            Labels for position (index) of the start of the labelled span for computing the token classification loss.\n            Positions are clamped to the length of the sequence (:obj:`sequence_length`). Position outside of the\n            sequence are not taken into account for computing the loss.\n        end_positions (:obj:`torch.LongTensor` of shape :obj:`(batch_size,)`, `optional`):\n            Labels for position (index) of the end of the labelled span for computing the token classification loss.\n            Positions are clamped to the length of the sequence (:obj:`sequence_length`). Position outside of the\n            sequence are not taken into account for computing the loss.\n        \"\"\"\n        return_dict = return_dict if return_dict is not None else self.config.use_return_dict\n\n        outputs = self.bert(\n            input_ids,\n            attention_mask=attention_mask,\n            token_type_ids=token_type_ids,\n            \n            head_mask=head_mask,\n            inputs_embeds=inputs_embeds,\n            output_attentions=output_attentions,\n            output_hidden_states=output_hidden_states,\n            return_dict=return_dict,\n        )\n\n        sequence_output = outputs[0]\n\n        logits = self.qa_outputs(sequence_output)\n        start_logits, end_logits = logits.split(1, dim=-1)\n        start_logits = start_logits.squeeze(-1)\n        end_logits = end_logits.squeeze(-1)\n\n        total_loss = None\n        if start_positions is not None and end_positions is not None:\n            # If we are on multi-GPU, split add a dimension\n            if len(start_positions.size()) > 1:\n                start_positions = start_positions.squeeze(-1)\n            if len(end_positions.size()) > 1:\n                end_positions = end_positions.squeeze(-1)\n            # sometimes the start/end positions are outside our model inputs, we ignore these terms\n            ignored_index = start_logits.size(1)\n            start_positions.clamp_(0, ignored_index)\n            end_positions.clamp_(0, ignored_index)\n\n            loss_fct = CrossEntropyLoss(ignore_index=ignored_index)\n            start_loss = loss_fct(start_logits, start_positions)\n            end_loss = loss_fct(end_logits, end_positions)\n            total_loss = (start_loss + end_loss) / 2\n\n        if not return_dict:\n            output = (start_logits, end_logits) + outputs[2:]\n            return ((total_loss,) + output) if total_loss is not None else output\n\n        return QuestionAnsweringModelOutput(\n            loss=total_loss,\n            start_logits=start_logits,\n            end_logits=end_logits,\n            hidden_states=outputs.hidden_states,\n            attentions=outputs.attentions,\n        )\n"
  },
  {
    "path": "code/nezha-base-count3/finetuning/model.py",
    "content": "import torch\nimport random\nimport os\nfrom torch import nn, optim\nimport torch.nn.functional as F\nfrom transformers.activations import get_activation\n\nfrom Config import *\n\n\nclass BertForClass(nn.Module):\n    def __init__(self, config):\n        super(BertForClass, self).__init__()\n        self.n_classes = config.num_class\n\n        config_json = 'bert_config.json' if os.path.exists(config.model_path + 'bert_config.json') else 'config.json'\n        self.bert_config = CONFIGS[config.model].from_pretrained(config.model_path + config_json,\n                                                                 output_hidden_states=True)\n        self.bert_model = MODELS[config.model].from_pretrained(config.model_path, config=self.bert_config)\n        self.isDropout = True if 0 < config.dropout < 1 else False\n        self.dropout = nn.Dropout(p=config.dropout)\n        self.classifier = nn.Linear(self.bert_config.hidden_size * 2, self.n_classes)\n\n    def forward(self, input_ids, input_masks, segment_ids):\n        output = self.bert_model(input_ids=input_ids, token_type_ids=segment_ids,\n                                                                        attention_mask=input_masks)\n        sequence_output = output[0]\n        pooler_output = output[1]\n        hidden_states = output[2]\n        seq_avg = torch.mean(sequence_output, dim=1)\n        concat_out = torch.cat((seq_avg, pooler_output), dim=1)\n        if self.isDropout:\n            concat_out = self.dropout(concat_out)\n        logit = self.classifier(concat_out)\n        return logit\n\nclass BertForClass_MultiDropout(nn.Module):\n    def __init__(self, config):\n        super(BertForClass_MultiDropout, self).__init__()\n        self.n_classes = config.num_class\n        config_json = 'bert_config.json' if os.path.exists(config.model_path + 'bert_config.json') else 'config.json'\n        self.bert_config = CONFIGS[config.model].from_pretrained(config.model_path + config_json,\n                                                                 output_hidden_states=True)\n\n        self.bert_model = MODELS[config.model].from_pretrained(config.model_path, config=self.bert_config)\n        self.multi_drop = 5\n        self.multi_dropouts = nn.ModuleList([nn.Dropout(config.dropout) for _ in range(self.multi_drop)])\n        self.classifier = nn.Linear(self.bert_config.hidden_size * 2, self.n_classes)\n\n    def forward(self, input_ids, input_masks, segment_ids):\n        sequence_output, pooler_output, hidden_states = self.bert_model(input_ids=input_ids, token_type_ids=segment_ids,\n                                                                        attention_mask=input_masks)\n        seq_avg = torch.mean(sequence_output, dim=1)\n        concat_out = torch.cat((seq_avg, pooler_output), dim=1)\n        for j, dropout in enumerate(self.multi_dropouts):\n            if j == 0:\n                logit = self.classifier(dropout(concat_out)) / self.multi_drop\n            else:\n                logit += self.classifier(dropout(concat_out)) / self.multi_drop\n\n        return logit\n\nclass BertLastTwoCls(nn.Module):\n    def __init__(self, config):\n        super(BertLastTwoCls, self).__init__()\n        self.n_classes = config.num_class\n        config_json = 'bert_config.json' if os.path.exists(config.model_path + 'bert_config.json') else 'config.json'\n        self.bert_config = CONFIGS[config.model].from_pretrained(config.model_path + config_json,\n                                                                 output_hidden_states=True)\n        self.bert_model = MODELS[config.model].from_pretrained(config.model_path, config=self.bert_config)\n        self.isDropout = True if 0 < config.dropout < 1 else False\n        self.dropout = nn.Dropout(p=config.dropout)\n        self.classifier = nn.Linear(self.bert_config.hidden_size, config.num_class)\n\n    def forward(self, input_ids, input_masks, segment_ids):\n        sequence_output, pooler_output, hidden_states = self.bert_model(input_ids=input_ids, token_type_ids=segment_ids,\n                                                                        attention_mask=input_masks)\n        logit = self.classifier(pooler_output)\n\n        return logit\n\n\nclass BertLastCls(nn.Module):\n    def __init__(self, config):\n        super(BertLastCls, self).__init__()\n        self.n_classes = config.num_class\n        config_json = 'bert_config.json' if os.path.exists(config.model_path + 'bert_config.json') else 'config.json'\n        self.bert_config = CONFIGS[config.model].from_pretrained(config.model_path + config_json,\n                                                                 output_hidden_states=True)\n        self.bert_model = MODELS[config.model].from_pretrained(config.model_path, config=self.bert_config)\n        self.isDropout = True if 0 < config.dropout < 1 else False\n        self.dropout = nn.Dropout(p=config.dropout)\n        self.classifier = nn.Linear(self.bert_config.hidden_size, config.num_class)\n\n    def forward(self, input_ids, input_masks, segment_ids):\n        output = self.bert_model(input_ids=input_ids, token_type_ids=segment_ids,\n                                 attention_mask=input_masks)\n        sequence_output = output[0]\n        pooler_output = output[1]\n        hidden_states = output[2]\n\n        if self.isDropout:\n            output = self.dropout(pooler_output)\n        logit = self.classifier(output)\n\n        return logit\n\nclass BertLastTwoClsPooler(nn.Module):\n    def __init__(self, config):\n        super(BertLastTwoClsPooler, self).__init__()\n        self.n_classes = config.num_class\n        config_json = 'bert_config.json' if os.path.exists(config.model_path + 'bert_config.json') else 'config.json'\n        self.bert_config = CONFIGS[config.model].from_pretrained(config.model_path + config_json,\n                                                                 output_hidden_states=True)\n        self.bert_model = MODELS[config.model].from_pretrained(config.model_path, config=self.bert_config)\n        self.isDropout = True if 0 < config.dropout < 1 else False\n        self.dropout = nn.Dropout(p=config.dropout)\n        self.classifier = nn.Linear(self.bert_config.hidden_size * 3, config.num_class)\n\n    def forward(self, input_ids, input_masks, segment_ids):\n        sequence_output, pooler_output, hidden_states = self.bert_model(input_ids=input_ids, token_type_ids=segment_ids,\n                                                                        attention_mask=input_masks)\n\n        output = torch.cat(\n            (pooler_output, hidden_states[-1][:, 0], hidden_states[-2][:, 0]), dim=1)\n        if self.isDropout:\n            output = self.dropout(output)\n        logit = self.classifier(output)\n\n        return logit\n\nclass BertLastTwoEmbeddings(nn.Module):\n    def __init__(self, config):\n        super(BertLastTwoEmbeddings, self).__init__()\n        self.n_classes = config.num_class\n        config_json = 'bert_config.json' if os.path.exists(config.model_path + 'bert_config.json') else 'config.json'\n        self.bert_config = CONFIGS[config.model].from_pretrained(config.model_path + config_json,\n                                                                 output_hidden_states=True)\n        self.bert_model = MODELS[config.model].from_pretrained(config.model_path, config=self.bert_config)\n        self.isDropout = True if 0 < config.dropout < 1 else False\n        self.dropout = nn.Dropout(p=config.dropout)\n        self.classifier = nn.Linear(self.bert_config.hidden_size * 2, config.num_class)\n\n    def forward(self, input_ids, input_masks, segment_ids):\n        sequence_output, pooler_output, hidden_states = self.bert_model(input_ids=input_ids, token_type_ids=segment_ids,\n                                                                        attention_mask=input_masks)\n\n        hidden_states1 = torch.mean(hidden_states[-1], dim=1)\n        hidden_states2 = torch.mean(hidden_states[-2], dim=1)\n\n        output = torch.cat(\n            (hidden_states1, hidden_states2), dim=1)\n        if self.isDropout:\n            output = self.dropout(output)\n        logit = self.classifier(output)\n\n        return logit\n\n\nclass BertLastTwoEmbeddingsPooler(nn.Module):\n    def __init__(self, config):\n        super(BertLastTwoEmbeddingsPooler, self).__init__()\n        self.n_classes = config.num_class\n        config_json = 'bert_config.json' if os.path.exists(config.model_path + 'bert_config.json') else 'config.json'\n        self.bert_config = CONFIGS[config.model].from_pretrained(config.model_path + config_json,\n                                                                 output_hidden_states=True)\n        self.bert_model = MODELS[config.model].from_pretrained(config.model_path, config=self.bert_config)\n        self.isDropout = True if 0 < config.dropout < 1 else False\n        self.dropout = nn.Dropout(p=config.dropout)\n        self.classifier = nn.Linear(self.bert_config.hidden_size * 3, config.num_class)\n\n    def forward(self, input_ids, input_masks, segment_ids):\n        sequence_output, pooler_output, hidden_states = self.bert_model(input_ids=input_ids, token_type_ids=segment_ids,\n                                                                        attention_mask=input_masks)\n\n        hidden_states1 = torch.mean(hidden_states[-1], dim=1)\n        hidden_states2 = torch.mean(hidden_states[-2], dim=1)\n\n        output = torch.cat(\n            (pooler_output, hidden_states1, hidden_states2), dim=1)\n        if self.isDropout:\n            output = self.dropout(output)\n        logit = self.classifier(output)\n\n        return logit\n\nclass BertLastFourCls(nn.Module):\n    def __init__(self, config):\n        super(BertLastFourCls, self).__init__()\n        self.n_classes = config.num_class\n        config_json = 'bert_config.json' if os.path.exists(config.model_path + 'bert_config.json') else 'config.json'\n        self.bert_config = CONFIGS[config.model].from_pretrained(config.model_path + config_json,\n                                                                 output_hidden_states=True)\n        self.bert_model = MODELS[config.model].from_pretrained(config.model_path, config=self.bert_config)\n        self.isDropout = True if 0 < config.dropout < 1 else False\n        self.dropout = nn.Dropout(p=config.dropout)\n        self.classifier = nn.Linear(self.bert_config.hidden_size * 4, config.num_class)\n\n    def forward(self, input_ids, input_masks, segment_ids):\n        output = self.bert_model(input_ids=input_ids, token_type_ids=segment_ids,\n                                 attention_mask=input_masks)\n        sequence_output = output[0]\n        pooler_output = output[1]\n        hidden_states = output[2]\n\n        output = torch.cat(\n            (hidden_states[-1][:, 0], hidden_states[-2][:, 0], hidden_states[-3][:, 0], hidden_states[-4][:, 0]), dim=1)\n        if self.isDropout:\n            output = self.dropout(output)\n        logit = self.classifier(output)\n\n        return logit\n\n\nclass BertLastFourClsPooler(nn.Module):\n    def __init__(self, config):\n        super(BertLastFourClsPooler, self).__init__()\n        self.n_classes = config.num_class\n        config_json = 'bert_config.json' if os.path.exists(config.model_path + 'bert_config.json') else 'config.json'\n        self.bert_config = CONFIGS[config.model].from_pretrained(config.model_path + config_json,\n                                                                 output_hidden_states=True)\n        self.bert_model = MODELS[config.model].from_pretrained(config.model_path, config=self.bert_config)\n        self.isDropout = True if 0 < config.dropout < 1 else False\n        self.dropout = nn.Dropout(p=config.dropout)\n        self.classifier = nn.Linear(self.bert_config.hidden_size * 5, config.num_class)\n\n    def forward(self, input_ids, input_masks, segment_ids):\n        sequence_output, pooler_output, hidden_states = self.bert_model(input_ids=input_ids, token_type_ids=segment_ids,\n                                                                        attention_mask=input_masks)\n\n        output = torch.cat(\n            (pooler_output, hidden_states[-1][:, 0], hidden_states[-2][:, 0], hidden_states[-3][:, 0], hidden_states[-4][:, 0]), dim=1)\n        if self.isDropout:\n            output = self.dropout(output)\n        logit = self.classifier(output)\n\n        return logit\n\nclass BertLastFourEmbeddings(nn.Module):\n    def __init__(self, config):\n        super(BertLastFourEmbeddings, self).__init__()\n        self.n_classes = config.num_class\n        config_json = 'bert_config.json' if os.path.exists(config.model_path + 'bert_config.json') else 'config.json'\n        self.bert_config = CONFIGS[config.model].from_pretrained(config.model_path + config_json,\n                                                                 output_hidden_states=True)\n        self.bert_model = MODELS[config.model].from_pretrained(config.model_path, config=self.bert_config)\n        self.isDropout = True if 0 < config.dropout < 1 else False\n        self.dropout = nn.Dropout(p=config.dropout)\n        self.classifier = nn.Linear(self.bert_config.hidden_size * 4, config.num_class)\n\n    def forward(self, input_ids, input_masks, segment_ids):\n        sequence_output, pooler_output, hidden_states = self.bert_model(input_ids=input_ids, token_type_ids=segment_ids,\n                                                                        attention_mask=input_masks)\n\n        hidden_states1 = torch.mean(hidden_states[-1], dim=1)\n        hidden_states2 = torch.mean(hidden_states[-2], dim=1)\n        hidden_states3 = torch.mean(hidden_states[-3], dim=1)\n        hidden_states4 = torch.mean(hidden_states[-4], dim=1)\n        output = torch.cat(\n            (hidden_states1, hidden_states2, hidden_states3, hidden_states4), dim=1)\n        if self.isDropout:\n            output = self.dropout(output)\n        logit = self.classifier(output)\n\n        return logit\n\n\nclass BertLastFourEmbeddingsPooler(nn.Module):\n    def __init__(self, config):\n        super(BertLastFourEmbeddingsPooler, self).__init__()\n        self.n_classes = config.num_class\n        config_json = 'bert_config.json' if os.path.exists(config.model_path + 'bert_config.json') else 'config.json'\n        self.bert_config = CONFIGS[config.model].from_pretrained(config.model_path + config_json,\n                                                                 output_hidden_states=True)\n        self.bert_model = MODELS[config.model].from_pretrained(config.model_path, config=self.bert_config)\n        self.isDropout = True if 0 < config.dropout < 1 else False\n        self.dropout = nn.Dropout(p=config.dropout)\n        self.classifier = nn.Linear(self.bert_config.hidden_size * 5, config.num_class)\n\n    def forward(self, input_ids, input_masks, segment_ids):\n        sequence_output, pooler_output, hidden_states = self.bert_model(input_ids=input_ids, token_type_ids=segment_ids,\n                                                                        attention_mask=input_masks)\n\n        hidden_states1 = torch.mean(hidden_states[-1], dim=1)\n        hidden_states2 = torch.mean(hidden_states[-2], dim=1)\n        hidden_states3 = torch.mean(hidden_states[-3], dim=1)\n        hidden_states4 = torch.mean(hidden_states[-4], dim=1)\n        output = torch.cat(\n            (pooler_output, hidden_states1, hidden_states2, hidden_states3, hidden_states4), dim=1)\n        if self.isDropout:\n            output = self.dropout(output)\n        logit = self.classifier(output)\n\n        return logit\n\nclass BertDynCls(nn.Module):\n    def __init__(self, config):\n        super(BertDynCls, self).__init__()\n        self.n_classes = config.num_class\n        config_json = 'bert_config.json' if os.path.exists(config.model_path + 'bert_config.json') else 'config.json'\n        self.bert_config = CONFIGS[config.model].from_pretrained(config.model_path + config_json,\n                                                                 output_hidden_states=True)\n        self.bert_model = MODELS[config.model].from_pretrained(config.model_path, config=self.bert_config)\n        self.dynWeight = nn.Linear(self.bert_config.hidden_size, 1)\n        self.dence = nn.Linear(self.bert_config.hidden_size, 512)\n        self.isDropout = True if 0 < config.dropout < 1 else False\n        self.dropout = nn.Dropout(p=config.dropout)\n        self.classifier = nn.Linear(512, config.num_class)\n\n\n    def forward(self, input_ids, input_masks, segment_ids):\n        sequence_output, pooler_output, hidden_states = self.bert_model(input_ids=input_ids, token_type_ids=segment_ids,\n                                                                        attention_mask=input_masks)\n\n        batch_size = pooler_output.shape[0]\n\n        hid_avg_list = None\n        weight_list = None\n        for i, hidden in enumerate(hidden_states):\n            hid_avg = hidden_states[-(i + 1)][0]\n            weight = self.dynWeight(hid_avg).repeat(1, self.bert_config.hidden_size)\n            if hid_avg_list is None:\n                hid_avg_list = hid_avg\n            else:\n                hid_avg_list = torch.cat((hid_avg_list, hid_avg), dim=1)\n\n            if weight_list is None:\n                weight_list = hid_avg\n            else:\n                weight_list = torch.cat((weight_list, weight), dim=1)\n\n        concat_out = weight_list.mul_(hid_avg_list)\n        concat_out = concat_out.reshape(batch_size, -1, self.bert_config.hidden_size)\n        concat_out = torch.sum(concat_out, dim=1)\n\n        if self.isDropout:\n            concat_out = self.dropout(concat_out)\n        concat_out = self.dence(concat_out)\n        logit = self.classifier(concat_out)\n\n        return logit\n\nclass BertDynEmbeddings(nn.Module):\n    def __init__(self, config):\n        super(BertDynEmbeddings, self).__init__()\n        self.n_classes = config.num_class\n        config_json = 'bert_config.json' if os.path.exists(config.model_path + 'bert_config.json') else 'config.json'\n        self.bert_config = CONFIGS[config.model].from_pretrained(config.model_path + config_json,\n                                                                 output_hidden_states=True)\n        self.bert_model = MODELS[config.model].from_pretrained(config.model_path, config=self.bert_config)\n        self.dynWeight = nn.Linear(self.bert_config.hidden_size, 1)\n        self.dence = nn.Linear(self.bert_config.hidden_size, 512)\n        self.isDropout = True if 0 < config.dropout < 1 else False\n        self.dropout = nn.Dropout(p=config.dropout)\n        self.classifier = nn.Linear(512, config.num_class)\n\n\n    def forward(self, input_ids, input_masks, segment_ids):\n        sequence_output, pooler_output, hidden_states = self.bert_model(input_ids=input_ids, token_type_ids=segment_ids,\n                                                                        attention_mask=input_masks)\n\n        batch_size = pooler_output.shape[0]\n\n        hid_avg_list = None\n        weight_list = None\n        for i, hidden in enumerate(hidden_states):\n            hid_avg = torch.mean(hidden_states[-(i + 1)], dim=1)\n            weight = self.dynWeight(hid_avg).repeat(1, self.bert_config.hidden_size)\n            if hid_avg_list is None:\n                hid_avg_list = hid_avg\n            else:\n                hid_avg_list = torch.cat((hid_avg_list, hid_avg), dim=1)\n\n            if weight_list is None:\n                weight_list = hid_avg\n            else:\n                weight_list = torch.cat((weight_list, weight), dim=1)\n\n        concat_out = weight_list.mul_(hid_avg_list)\n        concat_out = concat_out.reshape(batch_size, -1, self.bert_config.hidden_size)\n        concat_out = torch.sum(concat_out, dim=1)\n\n        if self.isDropout:\n            concat_out = self.dropout(concat_out)\n\n        concat_out = self.dence(concat_out)\n        logit = self.classifier(concat_out)\n\n        return logit\n\n\nclass BertRNN(nn.Module):\n\n    def __init__(self, config):\n        super(BertRNN, self).__init__()\n        self.rnn_type = \"gru\"\n        self.bidirectional = True\n        self.hidden_dim = 256\n        self.n_layers = 2\n        self.batch_first = True\n        self.drop_out = 0.1\n        self.n_classes = config.num_class\n        config_json = 'bert_config.json' if os.path.exists(config.model_path + 'bert_config.json') else 'config.json'\n        self.bert_config = CONFIGS[config.model].from_pretrained(config.model_path + config_json,\n                                                                 output_hidden_states=True)\n        self.bert_model = MODELS[config.model].from_pretrained(config.model_path, config=self.bert_config)\n        self.num_directions = 1 if not self.bidirectional else 2\n\n        if self.rnn_type == 'lstm':\n            self.rnn = nn.LSTM(self.bert_config.to_dict()['hidden_size'],\n                               hidden_size=self.hidden_dim,\n                               num_layers=self.n_layers,\n                               bidirectional=self.bidirectional,\n                               batch_first=self.batch_first,\n                               dropout=self.drop_out)\n        elif self.rnn_type == 'gru':\n            self.rnn = nn.GRU(self.bert_config.to_dict()['hidden_size'],\n                              hidden_size=self.hidden_dim,\n                              num_layers=self.n_layers,\n                              bidirectional=self.bidirectional,\n                              batch_first=self.batch_first,\n                              dropout=self.drop_out)\n        else:\n            self.rnn = nn.RNN(self.bert_config.to_dict()['hidden_size'],\n                              hidden_size=self.hidden_dim,\n                              num_layers=self.n_layers,\n                              bidirectional=self.bidirectional,\n                              batch_first=self.batch_first,\n                              dropout=self.drop_out)\n\n        self.dropout = nn.Dropout(self.drop_out)\n        self.fc_rnn = nn.Linear(self.hidden_dim * self.num_directions, config.num_class)\n\n    def forward(self, input_ids, input_masks, segment_ids):\n\n        sequence_output, pooler_output, hidden_states = self.bert_model(input_ids=input_ids, token_type_ids=segment_ids,\n                                                                        attention_mask=input_masks)\n\n        self.rnn.flatten_parameters()\n        if self.rnn_type in ['rnn', 'gru']:\n            output, hidden = self.rnn(sequence_output)\n        else:\n            output, (hidden, cell) = self.rnn(sequence_output)\n\n        # output = [ batch size, sent len, hidden_dim * bidirectional]\n        batch_size, max_seq_len, hidden_dim = output.shape\n        hidden = torch.transpose(hidden, 1, 0)\n        hidden = torch.mean(torch.reshape(hidden, [batch_size, -1, hidden_dim]), dim=1)\n        output = torch.sum(output, dim=1)\n        fc_input = self.dropout(output + hidden)\n\n        # output = torch.mean(output, dim=1)\n        # fc_input = self.dropout(output)\n        out = self.fc_rnn(fc_input)\n\n        return out\n\n\nclass BertCNN(nn.Module):\n\n    def __init__(self, config):\n        super(BertCNN, self).__init__()\n        self.num_filters = 100\n        config_json = 'bert_config.json' if os.path.exists(config.model_path + 'bert_config.json') else 'config.json'\n        self.bert_config = CONFIGS[config.model].from_pretrained(config.model_path + config_json,\n                                                                 output_hidden_states=True)\n        self.hidden_size = self.bert_config.to_dict()['hidden_size']\n        self.filter_sizes = {3, 4, 5}\n        self.drop_out = 0.5\n\n        self.convs = nn.ModuleList(\n            [nn.Conv2d(1, self.num_filters, (k, self.hidden_size)) for k in self.filter_sizes])\n\n\n        self.bert_model = MODELS[config.model].from_pretrained(config.model_path, config=self.bert_config)\n        self.dropout = nn.Dropout(self.drop_out)\n\n        self.fc_cnn = nn.Linear(self.num_filters * len(self.filter_sizes), config.num_class)\n\n    def conv_and_pool(self, x, conv):\n        x = F.relu(conv(x)).squeeze(3)\n        x = F.max_pool1d(x, x.size(2)).squeeze(2)\n        return x\n\n    def forward(self, input_ids, input_masks, segment_ids):\n        sequence_output, pooler_output, hidden_states = self.bert_model(input_ids=input_ids, token_type_ids=segment_ids,\n                                                                        attention_mask=input_masks)\n\n        sequence_output = self.dropout(sequence_output)\n        out = sequence_output.unsqueeze(1)\n        out = torch.cat([self.conv_and_pool(out, conv) for conv in self.convs], 1)\n        out = self.dropout(out)\n        out = self.fc_cnn(out)\n        return out\n\n\nclass BertRCNN(nn.Module):\n    def __init__(self, config):\n        super(BertRCNN, self).__init__()\n        self.rnn_type = \"lstm\"\n        self.bidirectional = True\n        self.hidden_dim = 256\n        self.n_layers = 2\n        self.batch_first = True\n        self.drop_out = 0.5\n\n        config_json = 'bert_config.json' if os.path.exists(config.model_path + 'bert_config.json') else 'config.json'\n        self.bert_config = CONFIGS[config.model].from_pretrained(config.model_path + config_json,\n                                                                 output_hidden_states=True)\n\n        if self.rnn_type == 'lstm':\n            self.rnn = nn.LSTM(self.bert_config.to_dict()['hidden_size'],\n                               self.hidden_dim,\n                               num_layers=self.n_layers,\n                               bidirectional=self.bidirectional,\n                               batch_first=self.batch_first,\n                               dropout=self.drop_out)\n        elif self.rnn_type == 'gru':\n            self.rnn = nn.GRU(self.bert_config.to_dict()['hidden_size'],\n                              hidden_size=self.hidden_dim,\n                              num_layers=self.n_layers,\n                              bidirectional=self.bidirectional,\n                              batch_first=self.batch_first,\n                              dropout=self.drop_out)\n        else:\n            self.rnn = nn.RNN(self.bert_config.to_dict()['hidden_size'],\n                              hidden_size=self.hidden_dim,\n                              num_layers=self.n_layers,\n                              bidirectional=self.bidirectional,\n                              batch_first=self.batch_first,\n                              dropout=self.drop_out)\n\n        # self.maxpool = nn.MaxPool1d()\n\n\n        self.bert_model = MODELS[config.model].from_pretrained(config.model_path, config=self.bert_config)\n        self.fc = nn.Linear(self.hidden_dim * self.n_layers, config.num_class)\n        self.dropout = nn.Dropout(self.drop_out)\n\n    def forward(self, input_ids, input_masks, segment_ids):\n\n        sequence_output, pooler_output, hidden_states = self.bert_model(input_ids=input_ids, token_type_ids=segment_ids,\n                                                                        attention_mask=input_masks)\n\n        sentence_len = sequence_output.shape[1]\n        pooler_output = pooler_output.unsqueeze(dim=1).repeat(1, sentence_len, 1)\n        bert_sentence = sequence_output + pooler_output\n\n        self.rnn.flatten_parameters()\n        if self.rnn_type in ['rnn', 'gru']:\n            output, hidden = self.rnn(bert_sentence)\n        else:\n            output, (hidden, cell) = self.rnn(bert_sentence)\n\n        batch_size, max_seq_len, hidden_dim = output.shape\n        out = torch.transpose(output.relu(), 1, 2)\n\n        out = F.max_pool1d(out, max_seq_len).squeeze()\n        out = self.fc(out)\n\n        return out\n\n\nclass XLNet(nn.Module):\n\n    def __init__(self, config):\n        super(XLNet, self).__init__()\n        self.xlnet = XLNetModel.from_pretrained(config.model_path)\n\n        self.isDropout = True if 0 < config.dropout < 1 else False\n        self.dropout = nn.Dropout(p=config.dropout)\n        self.fc = nn.Linear(self.xlnet.d_model, config.num_class)\n\n    def forward(self, input_ids, input_masks, segment_ids):\n        sequence_output = self.xlnet(input_ids=input_ids, token_type_ids=segment_ids,\n                                     attention_mask=input_masks)\n        sequence_output = torch.sum(sequence_output[0], dim=1)\n        if self.isDropout:\n            sequence_output = self.dropout(sequence_output)\n        out = self.fc(sequence_output)\n        return out\n\n\nclass ElectraClassificationHead(nn.Module):\n    \"\"\"Head for sentence-level classification tasks.\"\"\"\n\n    def __init__(self, config):\n        super().__init__()\n        self.dense = nn.Linear(config.hidden_size, config.hidden_size)\n        self.dropout = nn.Dropout(config.hidden_dropout_prob)\n        self.out_proj = nn.Linear(config.hidden_size, config.num_labels)\n\n    def forward(self, features, **kwargs):\n        x = features[:, 0, :]  # take <s> token (equiv. to [CLS])\n        x = self.dropout(x)\n        x = self.dense(x)\n        x = get_activation(\"gelu\")(x)  # although BERT uses tanh here, it seems Electra authors used gelu here\n        x = self.dropout(x)\n        x = self.out_proj(x)\n        return x\n\nclass Electra(nn.Module):\n\n    def __init__(self, config):\n        super(Electra, self).__init__()\n        self.electra = ElectraModel.from_pretrained(config.model_path)\n\n        config_json = 'bert_config.json' if os.path.exists(config.model_path + 'bert_config.json') else 'config.json'\n        self.electra_config = CONFIGS[config.model].from_pretrained(config.model_path + config_json)\n        self.electra_config.num_labels = config.num_class\n        self.fc = ElectraClassificationHead(self.electra_config)\n\n    def forward(self, input_ids, input_masks, segment_ids):\n        discriminator_hidden_states = self.electra(input_ids=input_ids, token_type_ids=segment_ids,\n                                     attention_mask=input_masks)\n\n        sequence_output = discriminator_hidden_states[0]\n        out = self.fc(sequence_output)\n        return out\n\nclass NEZHA(nn.Module):\n    def __init__(self, config):\n        super(NEZHA, self).__init__()\n        self.n_classes = config.num_class\n\n        config_json = 'bert_config.json' if os.path.exists(config.model_path + 'bert_config.json') else 'config.json'\n        self.bert_config = CONFIGS[config.model].from_pretrained(config.model_path + config_json)\n        #self.bert_model = MODELS[config.model](config=self.bert_config)\n        self.bert_model = MODELS[config.model].from_pretrained(config.model_path, config=self.bert_config)\n\n        # NEZHA init\n        #torch_init_model(self.bert_model, os.path.join(config.model_path, 'pytorch_model.bin'))\n        self.isDropout = True if 0 < config.dropout < 1 else False\n        self.dropout = nn.Dropout(p=config.dropout)\n        self.classifier = nn.Linear(self.bert_config.hidden_size * 2, self.n_classes)\n\n    def forward(self, input_ids, input_masks, segment_ids):\n        sequence_output, pooler_output = self.bert_model(input_ids=input_ids, token_type_ids=segment_ids,\n                                                                        attention_mask=input_masks)\n        seq_avg = torch.mean(sequence_output, dim=1)\n        concat_out = torch.cat((seq_avg, pooler_output), dim=1)\n\n        if self.isDropout:\n            concat_out = self.dropout(concat_out)\n        logit = self.classifier(concat_out)\n        return logit\n\n\n"
  },
  {
    "path": "code/nezha-base-count3/finetuning/models/gitkeep",
    "content": ""
  },
  {
    "path": "code/nezha-base-count3/finetuning/multi_gpu_QA.py",
    "content": "from tqdm import tqdm, trange\nimport numpy as np\nimport pandas as pd\nimport logging\nimport torch\nimport random\nimport os\nfrom torch import nn, optim\nfrom transformers import BertTokenizer, AdamW, BertModel, BertPreTrainedModel, BertConfig, \\\n    get_linear_schedule_with_warmup, XLNetModel, XLNetTokenizer, XLNetConfig\nfrom transformers.optimization import get_linear_schedule_with_warmup\nfrom sklearn.model_selection import StratifiedKFold, KFold\nfrom sklearn.metrics import mean_absolute_error, accuracy_score, f1_score, roc_auc_score\nfrom model import *\nfrom utils import *\nimport time\nimport logging\nlogging.basicConfig(level=logging.DEBUG, filename=\"train.log\",filemode='a')\n\n\nfrom NEZHA.modeling_nezha import *\n\nMODEL_CLASSES = {\n    'BertForClass': BertForClass,\n    'BertLastCls': BertLastCls,\n    'BertLastTwoCls': BertLastTwoCls,\n    'BertLastTwoClsPooler': BertLastTwoClsPooler,\n    'BertLastTwoEmbeddings': BertLastTwoEmbeddings,\n    'BertLastTwoEmbeddingsPooler': BertLastTwoEmbeddingsPooler,\n    'BertLastFourCls': BertLastFourCls,\n    'BertLastFourClsPooler': BertLastFourClsPooler,\n    'BertLastFourEmbeddings': BertLastFourEmbeddings,\n    'BertLastFourEmbeddingsPooler': BertLastFourEmbeddingsPooler,\n    'BertDynCls': BertDynCls,\n    'BertDynEmbeddings': BertDynEmbeddings,\n    'BertRNN': BertRNN,\n    'BertCNN': BertCNN,\n    'BertRCNN': BertRCNN,\n    'XLNet': XLNet,\n    'Electra': Electra,\n    'NEZHA': NEZHA,\n\n}\n\n\nclass Config:\n    def __init__(self):\n        # 预训练模型路径\n        self.modelId = 2\n        self.model = \"NEZHA\"\n        self.Stratification = False\n        self.model_path = '../pretrain/nezha_model/'\n\n        self.num_class = 2\n        self.dropout = 0.2\n        self.MAX_LEN = 32\n        self.epoch = 3\n        self.learn_rate = 4e-5\n        self.normal_lr = 1e-4\n        self.batch_size = 32\n        self.k_fold = 10\n        self.seed = 42\n\n        self.device = torch.device('cuda')\n        # self.device = torch.device('cpu')\n\n        self.focalloss = False\n        self.pgd = False\n        self.fgm = True\n\n\nconfig = Config()\nos.environ['PYTHONHASHSEED']='0'#消除hash算法的随机性\nrandom.seed(config.seed)\nnp.random.seed(config.seed)\ntorch.manual_seed(config.seed)\ntorch.cuda.manual_seed_all(config.seed)\n\n\nfile_path = './log/'\n# 创建一个logger\nlogger = logging.getLogger('mylogger')\nlogger.setLevel(logging.DEBUG)\n\n\ntrain = pd.read_csv('/tcdata/gaiic_track3_round1_train_20210228.tsv',sep='\\t',header=None)\nsemi = pd.read_csv('/tcdata/gaiic_track3_round2_train_20210407.tsv',sep='\\t',header=None)\ntrain = pd.concat([train, semi], sort=False)\ntrain.columns=['q1','q2','label']\n\n\ntrain_query1 = train['q1'].values.astype(str)\ntrain_query2 = train['q2'].values.astype(str)\ntrain_label = train['label'].values.astype(int)\n\n\noof_train = np.zeros((len(train), config.num_class), dtype=np.float32)\n\n\n#kf = StratifiedKFold(n_splits=config.k_fold, shuffle=True, random_state=config.seed)\nkf = KFold(n_splits=config.k_fold, shuffle=True, random_state=config.seed)\n\nfor fold, (train_index, valid_index) in enumerate(kf.split(train_query1, train_label)):\n\n    print('\\n\\n------------fold:{}------------\\n'.format(fold))\n\n    '''\n    q1 = train_query1[train_index]\n    q2 = train_query2[train_index]\n    y = train_label[train_index]\n    '''\n    q1 = train_query1\n    q2 = train_query2\n    y = train_label\n\n\n    val_q1 = train_query1[valid_index]\n    val_q2 = train_query2[valid_index]\n    val_y = train_label[valid_index]\n\n    train_D = data_generator([q1, q2, y], config, shuffle=True)\n    val_D = data_generator([val_q1, val_q2, val_y], config)\n\n    model = MODEL_CLASSES[config.model](config).to(config.device)\n\n    if torch.cuda.device_count() > 1:\n        print(\"Let's use\", torch.cuda.device_count(), \"GPUs!\")\n        model = torch.nn.DataParallel(model)\n\n\n    if config.pgd:\n        pgd = PGD(model)\n        K = 3\n\n    elif config.fgm:\n        fgm = FGM(model)\n\n    if config.focalloss:\n        loss_fn = FocalLoss(config.num_class)\n    else:\n        loss_fn = nn.CrossEntropyLoss()  # BCEWithLogitsLoss就是把Sigmoid-BCELoss合成一步\n\n\n    num_train_steps = int(len(train) / config.batch_size * config.epoch)\n    param_optimizer = list(model.named_parameters())\n\n    no_decay = [\"bias\", \"LayerNorm.bias\", \"LayerNorm.weight\"]\n\n    if config.Stratification:\n        bert_params = [x for x in param_optimizer if 'bert' in x[0]]\n        normal_params = [p for n, p in param_optimizer if 'bert' not in n]\n        optimizer_parameters = [\n            {'params': [p for n, p in bert_params if not any(nd in n for nd in no_decay)], 'weight_decay': 0.01},\n            {'params': [p for n, p in bert_params if any(nd in n for nd in no_decay)], 'weight_decay': 0.0},\n            {'params': normal_params, 'lr': config.normal_lr},\n        ]\n    else:\n        optimizer_parameters = [\n            {'params': [p for n, p in param_optimizer if not any(nd in n for nd in no_decay)], 'weight_decay': 0.01},\n            {'params': [p for n, p in param_optimizer if any(nd in n for nd in no_decay)], 'weight_decay': 0.0},\n        ]\n\n    optimizer = AdamW(optimizer_parameters, lr=config.learn_rate) # lr为全局学习率\n    scheduler = get_linear_schedule_with_warmup(\n        optimizer,\n        num_warmup_steps=int(len(train) / config.batch_size / 2),\n        num_training_steps=num_train_steps\n    )\n\n    best_auc = 0\n    PATH = './models/bert_{}.pth'.format(fold)\n    save_model_path = './models/'\n    if not os.path.exists(save_model_path):\n        os.makedirs(save_model_path)\n\n    for e in range(config.epoch):\n        print('\\n------------epoch:{}------------'.format(e))\n        model.train()\n        acc = 0\n        train_len = 0\n        loss_num = 0\n        tq = tqdm(train_D,ncols=70,disable=True)\n        last=time.time()\n        for input_ids, input_masks, segment_ids, labels in tq:\n            label_t = torch.tensor(labels, dtype=torch.long).to(config.device)\n\n            y_pred = model(input_ids, input_masks, segment_ids)\n\n            loss = loss_fn(y_pred, label_t)\n            loss = loss.mean()\n            loss.backward()\n\n            if config.pgd:\n                pgd.backup_grad()\n                # 对抗训练\n                for t in range(K):\n                    pgd.attack(is_first_attack=(t == 0))  # 在embedding上添加对抗扰动, first attack时备份param.data\n                    if t != K - 1:\n                        model.zero_grad()\n                    else:\n                        pgd.restore_grad()\n                    y_pred = model(input_ids, input_masks, segment_ids)\n\n                    loss_adv = loss_fn(y_pred, label_t)\n                    loss_adv = loss_adv.mean()\n                    loss_adv.backward()  # 反向传播，并在正常的grad基础上，累加对抗训练的梯度\n                pgd.restore()  # 恢复embedding参数\n\n            elif config.fgm:\n                # 对抗训练\n                fgm.attack()  # 在embedding上添加对抗扰动\n                y_pred = model(input_ids, input_masks, segment_ids)\n                loss_adv = loss_fn(y_pred, label_t)\n                loss_adv = loss_adv.mean()\n                loss_adv.backward()  # 反向传播，并在正常的grad基础上，累加对抗训练的梯度\n                fgm.restore()  # 恢复embedding参数\n\n\n            # 梯度下降，更新参数\n            optimizer.step()\n            scheduler.step()  # Update learning rate schedule\n            model.zero_grad()\n\n            y_pred = np.argmax(y_pred.detach().to(\"cpu\").numpy(), axis=1)\n            acc += sum(y_pred == labels)\n            loss_num += loss.item()\n            train_len += len(labels)\n            tq.set_postfix(fold=fold, epoch=e, loss=loss_num / train_len, acc=acc / train_len)\n        print(f\"微调第{e}轮耗时：{time.time()-last}\")\n        model.eval()\n        with torch.no_grad():\n            y_p = []\n            y_l = []\n            train_logit = None\n            for input_ids, input_masks, segment_ids, labels in tqdm(val_D,disable=True):\n                label_t = torch.tensor(labels, dtype=torch.long).to(config.device)\n\n                y_pred = model(input_ids, input_masks, segment_ids)\n                y_pred = F.softmax(y_pred)\n                y_pred = y_pred.detach().to(\"cpu\").numpy()\n                if train_logit is None:\n                    train_logit = y_pred\n                else:\n                    train_logit = np.vstack((train_logit, y_pred))\n\n                y_p += list(y_pred[:,1])\n\n                y_pred = np.argmax(y_pred, axis=1)\n                y_l += list(y_pred)\n\n\n            f1 = f1_score(val_y, y_l, average=\"macro\")\n            auc_score = roc_auc_score(val_y, y_p)\n            print(\"best_auc:{}  auc_score:{}  f1:{}\\n\".format(best_auc, auc_score, f1))\n            if auc_score >= best_auc:\n                best_auc = auc_score\n                oof_train[valid_index] = np.array(train_logit)\n                #torch.save(model.module.state_dict() if hasattr(model, \"module\") else model.state_dict(), PATH)\n                torch.save(model.module if hasattr(model, \"module\") else model, PATH)\n\n    optimizer.zero_grad()\n\n    del model\n    torch.cuda.empty_cache()\n\n    break\n\n"
  },
  {
    "path": "code/nezha-base-count3/finetuning/utils.py",
    "content": "import torch\nfrom transformers import BertTokenizer, AdamW, BertModel, BertPreTrainedModel, BertConfig, \\\n    get_linear_schedule_with_warmup, XLNetModel, XLNetTokenizer, XLNetConfig\nimport numpy as np\nimport os\nimport random\nfrom Config import *\nimport torch\nimport torch.nn as nn\nimport torch.nn.functional as F\n\ndef paddingList(ls:list,val,returnTensor=False):\n    ls=ls[:]#不要改变了原list尺寸\n    maxLen=max([len(i) for i in ls])\n    for i in range(len(ls)):\n        ls[i]=ls[i]+[val]*(maxLen-len(ls[i]))\n    return torch.tensor(ls,device='cuda') if returnTensor else ls\n\ndef fastTokenizer(a:str,b:str,maxLen,tk):\n    a,b=a.split(),b.split()\n    a,b=tk.convert_tokens_to_ids(a),tk.convert_tokens_to_ids(b)\n    maxLen-=3#空留给cls sep sep\n    assert maxLen>=0\n    len2=maxLen//2#若为奇数，更长部分给左边\n    len1=maxLen-len2\n    #一共就a超长与否，b超长与否，组合的四种情况\n    if len(a)+len(b)>maxLen:#需要截断\n        if len(a)<=len1 and len(b)>len2:\n            b=b[:maxLen-len(a)]\n        elif len(a)>len1 and len(b)<=len2:\n            a=a[:maxLen-len(b)]\n        elif len(a)>len1 and len(b)>len2:\n            a=a[:len1]\n            b=b[:len2]\n    input_ids=[tk.cls_token_id]+a+[tk.sep_token_id]+b+[tk.sep_token_id]\n    token_type_ids=[0]*(len(a)+2)+[1]*(len(b)+1)\n    return {'input_ids': input_ids, 'token_type_ids': token_type_ids}\n\nclass data_generator:\n    def __init__(self, data, config, shuffle=False):\n        self.data = data\n        self.batch_size = config.batch_size\n        self.max_length = config.MAX_LEN\n        self.shuffle = shuffle\n\n        vocab = 'vocab.txt' if os.path.exists(config.model_path + 'vocab.txt') else 'spiece.model'\n        self.tokenizer = TOKENIZERS[config.model].from_pretrained(config.model_path + vocab)\n\n        self.steps = len(self.data[0]) // self.batch_size\n        if len(self.data[0]) % self.batch_size != 0:\n            self.steps += 1\n\n    def __len__(self):\n        return self.steps\n\n    def __iter__(self):\n        q1, q2, y = self.data\n        idxs = list(range(len(self.data[0])))\n        if self.shuffle:\n            np.random.shuffle(idxs)\n        input_ids, input_masks, segment_ids, labels = [], [], [], []\n\n        for index, i in enumerate(idxs):\n\n            text = q1[i]\n            text_pair = q2[i]\n            '''\n            # text = self.tokenizer(text, text_pair, padding='max_length', truncation=True, max_length=self.max_length)\n            text = fastTokenizer(text, text_pair, self.max_length, self.tokenizer)\n            input_ids.append(text['input_ids'])\n            segment_ids.append(text['token_type_ids'])\n            input_masks.append([1] * len(text['input_ids']))  # bs为1时无padding，全1\n            yield input_ids, input_masks, segment_ids, labels\n            input_ids, input_masks, segment_ids, labels = [], [], [], []\n\n            '''\n            tkRes = self.tokenizer(text, text_pair, max_length=self.max_length, truncation='longest_first',\n                                   return_attention_mask=False)\n            input_id = tkRes['input_ids']\n            segment_id = tkRes['token_type_ids']\n            assert len(segment_id) == len(input_id)\n            input_ids.append(input_id)\n            segment_ids.append(segment_id)\n            labels.append(y[i])\n\n            if len(input_ids) == self.batch_size or i == idxs[-1]:\n                input_ids = paddingList(input_ids, 0, returnTensor=True)  # 动态padding\n                segment_ids = paddingList(segment_ids, 0, returnTensor=True)\n                input_masks = (input_ids != 0)\n                yield input_ids, input_masks, segment_ids, labels\n                input_ids, input_masks, segment_ids, labels = [], [], [], []\n\n\n\nclass PGD():\n    def __init__(self, model):\n        self.model = model\n        self.emb_backup = {}\n        self.grad_backup = {}\n\n    def attack(self, epsilon=0.3, alpha=0.1, emb_name='word_embeddings', is_first_attack=False):\n        # emb_name这个参数要换成你模型中embedding的参数名\n        for name, param in self.model.named_parameters():\n            if param.requires_grad and emb_name in name:\n                if is_first_attack:\n                    self.emb_backup[name] = param.data.clone()\n                norm = torch.norm(param.grad)\n                if norm != 0 and not torch.isnan(norm):\n                    r_at = alpha * param.grad / norm\n                    param.data.add_(r_at)\n                    param.data = self.project(name, param.data, epsilon)\n\n    def restore(self, emb_name='word_embeddings'):\n        # emb_name这个参数要换成你模型中embedding的参数名\n        for name, param in self.model.named_parameters():\n            if param.requires_grad and emb_name in name:\n                assert name in self.emb_backup\n                param.data = self.emb_backup[name]\n        self.emb_backup = {}\n\n    def project(self, param_name, param_data, epsilon):\n        r = param_data - self.emb_backup[param_name]\n        if torch.norm(r) > epsilon:\n            r = epsilon * r / torch.norm(r)\n        return self.emb_backup[param_name] + r\n\n    def backup_grad(self):\n        for name, param in self.model.named_parameters():\n            if param.requires_grad:\n                self.grad_backup[name] = param.grad.clone()\n\n    def restore_grad(self):\n        for name, param in self.model.named_parameters():\n            if param.requires_grad:\n                param.grad = self.grad_backup[name]\n\n\n\nclass FGM():\n    def __init__(self, model):\n        self.model = model\n        self.backup = {}\n\n    def attack(self, epsilon=0.25, emb_name='word_embeddings'):\n        # emb_name这个参数要换成你模型中embedding的参数名\n        for name, param in self.model.named_parameters():\n            if param.requires_grad and emb_name in name:\n                self.backup[name] = param.data.clone()\n                norm = torch.norm(param.grad)\n                if norm != 0:\n                    r_at = epsilon * param.grad / norm\n                    param.data.add_(r_at)\n\n    def restore(self, emb_name='word_embeddings'):\n        # emb_name这个参数要换成你模型中embedding的参数名\n        for name, param in self.model.named_parameters():\n            if param.requires_grad and emb_name in name:\n                assert name in self.backup\n                param.data = self.backup[name]\n        self.backup = {}\n\n\n# 支持多分类和二分类\nclass FocalLoss(nn.Module):\n    \"\"\"\n    This is a implementation of Focal Loss with smooth label cross entropy supported which is proposed in\n    'Focal Loss for Dense Object Detection. (https://arxiv.org/abs/1708.02002)'\n    Focal_Loss= -1*alpha*(1-pt)^gamma*log(pt)\n    :param num_class:\n    :param alpha: (tensor) 3D or 4D the scalar factor for this criterion\n    :param gamma: (float,double) gamma > 0 reduces the relative loss\n    for well-classified examples (p>0.5) putting more\n    focus on hard misclassified example\n    :param smooth: (float,double) smooth value when cross entropy\n    :param balance_index: (int) balance class index,\n    should be specific when alpha is float\n    :param size_average: (bool, optional) By default,\n    the losses are averaged over each loss element in the batch.\n    \"\"\"\n    def __init__(self, num_class, alpha=None, gamma=2,\n                smooth=None, size_average=True):\n        super(FocalLoss, self).__init__()\n        self.num_class = num_class\n        self.alpha = alpha\n        self.gamma = gamma\n        self.smooth = smooth\n        self.size_average = size_average\n\n        if self.alpha is None:\n            self.alpha = torch.ones(self.num_class, 1)\n        elif isinstance(self.alpha, (list, np.ndarray)):\n            assert len(self.alpha) == self.num_class\n            self.alpha = torch.FloatTensor(alpha).view(self.num_class, 1)\n            self.alpha = self.alpha / self.alpha.sum()\n        else:\n            raise TypeError('Not support alpha type')\n        if self.smooth is not None:\n            if self.smooth < 0 or self.smooth > 1.0:\n                raise ValueError('smooth value should be in [0,1]')\n\n    def forward(self, input, target):\n        logit = F.softmax(input, dim=1)\n\n        if logit.dim() > 2:\n            # N,C,d1,d2 -> N,C,m (m=d1*d2*...)\n            logit = logit.view(logit.size(0), logit.size(1), -1)\n            logit = logit.permute(0, 2, 1).contiguous()\n            logit = logit.view(-1, logit.size(-1))\n        target = target.view(-1, 1)\n\n        # N = input.size(0)\n        # alpha = torch.ones(N, self.num_class)\n        # alpha = alpha * (1 - self.alpha)\n        # alpha = alpha.scatter_(1, target.long(), self.alpha)\n        epsilon = 1e-10\n        alpha = self.alpha\n        if alpha.device != input.device:\n            alpha = alpha.to(input.device)\n\n        idx = target.cpu().long()\n        one_hot_key = torch.FloatTensor(target.size(0), self.num_class).zero_()\n        one_hot_key = one_hot_key.scatter_(1, idx, 1)\n        if one_hot_key.device != logit.device:\n            one_hot_key = one_hot_key.to(logit.device)\n\n        if self.smooth:\n            one_hot_key = torch.clamp(\n                one_hot_key, self.smooth, 1.0 - self.smooth)\n        pt = (one_hot_key * logit).sum(1) + epsilon\n        logpt = pt.log()\n\n        gamma = self.gamma\n\n        alpha = alpha[idx]\n        loss = -1 * alpha * torch.pow((1 - pt), gamma) * logpt\n\n        if self.size_average:\n            loss = loss.mean()\n        else:\n            loss = loss.sum()\n        return loss\n\n\ndef f1_match(y_true,y_pred):\n    acc = sum(y_pred & y_true) / (sum(y_pred))\n    rec = sum(y_pred & y_true) / (sum(y_true))\n\n    return 2 * acc * rec /(acc + rec)"
  },
  {
    "path": "code/nezha-base-count3/pretrain/NEZHA/configuration_nezha.py",
    "content": "\nfrom transformers import PretrainedConfig\n\nNEZHA_PRETRAINED_CONFIG_ARCHIVE_MAP = {}\n\nclass NeZhaConfig(PretrainedConfig):\n    r\"\"\"\n        This is the configuration class to store the configuration of an :class:`~transformers.AlbertModel`.\n        It is used to instantiate an ALBERT model according to the specified arguments, defining the model\n        architecture. Instantiating a configuration with the defaults will yield a similar configuration to that of\n        the ALBERT `xxlarge <https://huggingface.co/albert-xxlarge-v2>`__ architecture.\n\n        Configuration objects inherit from  :class:`~transformers.PretrainedConfig` and can be used\n        to control the model outputs. Read the documentation from  :class:`~transformers.PretrainedConfig`\n        for more information.\n\n\n        Args:\n            vocab_size (:obj:`int`, optional, defaults to 30000):\n                Vocabulary size of the ALBERT model. Defines the different tokens that\n                can be represented by the `inputs_ids` passed to the forward method of :class:`~transformers.AlbertModel`.\n            embedding_size (:obj:`int`, optional, defaults to 128):\n                Dimensionality of vocabulary embeddings.\n            hidden_size (:obj:`int`, optional, defaults to 4096):\n                Dimensionality of the encoder layers and the pooler layer.\n            num_hidden_layers (:obj:`int`, optional, defaults to 12):\n                Number of hidden layers in the Transformer encoder.\n            num_hidden_groups (:obj:`int`, optional, defaults to 1):\n                Number of groups for the hidden layers, parameters in the same group are shared.\n            num_attention_heads (:obj:`int`, optional, defaults to 64):\n                Number of attention heads for each attention layer in the Transformer encoder.\n            intermediate_size (:obj:`int`, optional, defaults to 16384):\n                The dimensionality of the \"intermediate\" (i.e., feed-forward) layer in the Transformer encoder.\n            inner_group_num (:obj:`int`, optional, defaults to 1):\n                The number of inner repetition of attention and ffn.\n            hidden_act (:obj:`str` or :obj:`function`, optional, defaults to \"gelu_new\"):\n                The non-linear activation function (function or string) in the encoder and pooler.\n                If string, \"gelu\", \"relu\", \"swish\" and \"gelu_new\" are supported.\n            hidden_dropout_prob (:obj:`float`, optional, defaults to 0):\n                The dropout probability for all fully connected layers in the embeddings, encoder, and pooler.\n            attention_probs_dropout_prob (:obj:`float`, optional, defaults to 0):\n                The dropout ratio for the attention probabilities.\n            max_position_embeddings (:obj:`int`, optional, defaults to 512):\n                The maximum sequence length that this model might ever be used with. Typically set this to something\n                large (e.g., 512 or 1024 or 2048).\n            type_vocab_size (:obj:`int`, optional, defaults to 2):\n                The vocabulary size of the `token_type_ids` passed into :class:`~transformers.AlbertModel`.\n            initializer_range (:obj:`float`, optional, defaults to 0.02):\n                The standard deviation of the truncated_normal_initializer for initializing all weight matrices.\n            layer_norm_eps (:obj:`float`, optional, defaults to 1e-12):\n                The epsilon used by the layer normalization layers.\n            classifier_dropout_prob (:obj:`float`, optional, defaults to 0.1):\n                The dropout ratio for attached classifiers.\n\n        Example::\n\n            from transformers import AlbertConfig, AlbertModel\n            # Initializing an ALBERT-xxlarge style configuration\n            albert_xxlarge_configuration = AlbertConfig()\n\n            # Initializing an ALBERT-base style configuration\n            albert_base_configuration = AlbertConfig(\n                hidden_size=768,\n                num_attention_heads=12,\n                intermediate_size=3072,\n            )\n\n            # Initializing a model from the ALBERT-base style configuration\n            model = AlbertModel(albert_xxlarge_configuration)\n\n            # Accessing the model configuration\n            configuration = model.config\n\n        Attributes:\n            pretrained_config_archive_map (Dict[str, str]):\n                A dictionary containing all the available pre-trained checkpoints.\n    \"\"\"\n\n    pretrained_config_archive_map = NEZHA_PRETRAINED_CONFIG_ARCHIVE_MAP\n    model_type = \"nezha\"\n\n    def __init__(\n        self,\n        vocab_size=30000,\n        embedding_size=128,\n        hidden_size=4096,\n        num_hidden_layers=12,\n        num_hidden_groups=1,\n        num_attention_heads=64,\n        intermediate_size=16384,\n        inner_group_num=1,\n        hidden_act=\"gelu_new\",\n        hidden_dropout_prob=0,\n        attention_probs_dropout_prob=0,\n        max_position_embeddings=512,\n        max_relative_position=64,\n        type_vocab_size=2,\n        initializer_range=0.02,\n        layer_norm_eps=1e-12,\n        classifier_dropout_prob=0.1,\n        use_relative_position=True,\n        pad_token_id=0,\n        bos_token_id=2,\n        eos_token_id=3,\n        **kwargs\n    ):\n        super().__init__(pad_token_id=pad_token_id, bos_token_id=bos_token_id, eos_token_id=eos_token_id, **kwargs)\n\n        self.vocab_size = vocab_size\n        self.embedding_size = embedding_size\n        self.hidden_size = hidden_size\n        self.num_hidden_layers = num_hidden_layers\n        self.num_hidden_groups = num_hidden_groups\n        self.num_attention_heads = num_attention_heads\n        self.inner_group_num = inner_group_num\n        self.hidden_act = hidden_act\n        self.intermediate_size = intermediate_size\n        self.hidden_dropout_prob = hidden_dropout_prob\n        self.attention_probs_dropout_prob = attention_probs_dropout_prob\n        self.max_position_embeddings = max_position_embeddings\n        self.max_relative_position = max_relative_position\n        self.type_vocab_size = type_vocab_size\n        self.initializer_range = initializer_range\n        self.layer_norm_eps = layer_norm_eps\n        self.use_relative_position=use_relative_position\n        self.classifier_dropout_prob = classifier_dropout_prob\n"
  },
  {
    "path": "code/nezha-base-count3/pretrain/NEZHA/modeling_nezha.py",
    "content": "import math\nimport os\nimport warnings\nfrom dataclasses import dataclass\nfrom typing import Optional, Tuple\n\nimport torch\nimport torch.utils.checkpoint\nfrom torch import nn\nfrom torch.nn import CrossEntropyLoss, MSELoss\n\nfrom transformers.activations import ACT2FN\nfrom transformers.file_utils import (\n    ModelOutput,\n    add_code_sample_docstrings,\n    add_start_docstrings,\n    add_start_docstrings_to_model_forward,\n    replace_return_docstrings,\n)\nfrom transformers.modeling_outputs import (\n    BaseModelOutputWithPastAndCrossAttentions,\n    BaseModelOutputWithPoolingAndCrossAttentions,\n    CausalLMOutputWithCrossAttentions,\n    MaskedLMOutput,\n    MultipleChoiceModelOutput,\n    NextSentencePredictorOutput,\n    QuestionAnsweringModelOutput,\n    SequenceClassifierOutput,\n    TokenClassifierOutput,\n)\nfrom transformers.modeling_utils import (\n    PreTrainedModel,\n    apply_chunking_to_forward,\n    find_pruneable_heads_and_indices,\n    prune_linear_layer,\n)\n\nfrom transformers.models.bert.configuration_bert import BertConfig\n\nimport logging\nlogger = logging.getLogger(__name__)\n\n_CHECKPOINT_FOR_DOC = \"bert-base-uncased\"\n_CONFIG_FOR_DOC = \"BertConfig\"\n_TOKENIZER_FOR_DOC = \"BertTokenizer\"\n\n\ndef load_tf_weights_in_bert(model, config, tf_checkpoint_path):\n    \"\"\"Load tf checkpoints in a pytorch model.\"\"\"\n    try:\n        import re\n\n        import numpy as np\n        import tensorflow as tf\n    except ImportError:\n        logger.error(\n            \"Loading a TensorFlow model in PyTorch, requires TensorFlow to be installed. Please see \"\n            \"https://www.tensorflow.org/install/ for installation instructions.\"\n        )\n        raise\n    tf_path = os.path.abspath(tf_checkpoint_path)\n    logger.info(\"Converting TensorFlow checkpoint from {}\".format(tf_path))\n    # Load weights from TF model\n    init_vars = tf.train.list_variables(tf_path)\n    names = []\n    arrays = []\n    for name, shape in init_vars:\n        logger.info(\"Loading TF weight {} with shape {}\".format(name, shape))\n        array = tf.train.load_variable(tf_path, name)\n        names.append(name)\n        arrays.append(array)\n\n    for name, array in zip(names, arrays):\n        name = name.split(\"/\")\n        # adam_v and adam_m are variables used in AdamWeightDecayOptimizer to calculated m and v\n        # which are not required for using pretrained model\n        if any(\n            n in [\"adam_v\", \"adam_m\", \"AdamWeightDecayOptimizer\", \"AdamWeightDecayOptimizer_1\", \"global_step\"]\n            for n in name\n        ):\n            logger.info(\"Skipping {}\".format(\"/\".join(name)))\n            continue\n        pointer = model\n        for m_name in name:\n            if re.fullmatch(r\"[A-Za-z]+_\\d+\", m_name):\n                scope_names = re.split(r\"_(\\d+)\", m_name)\n            else:\n                scope_names = [m_name]\n            if scope_names[0] == \"kernel\" or scope_names[0] == \"gamma\":\n                pointer = getattr(pointer, \"weight\")\n            elif scope_names[0] == \"output_bias\" or scope_names[0] == \"beta\":\n                pointer = getattr(pointer, \"bias\")\n            elif scope_names[0] == \"output_weights\":\n                pointer = getattr(pointer, \"weight\")\n            elif scope_names[0] == \"squad\":\n                pointer = getattr(pointer, \"classifier\")\n            else:\n                try:\n                    pointer = getattr(pointer, scope_names[0])\n                except AttributeError:\n                    logger.info(\"Skipping {}\".format(\"/\".join(name)))\n                    continue\n            if len(scope_names) >= 2:\n                num = int(scope_names[1])\n                pointer = pointer[num]\n        if m_name[-11:] == \"_embeddings\":\n            pointer = getattr(pointer, \"weight\")\n        elif m_name == \"kernel\":\n            array = np.transpose(array)\n        try:\n            assert (\n                pointer.shape == array.shape\n            ), f\"Pointer shape {pointer.shape} and array shape {array.shape} mismatched\"\n        except AssertionError as e:\n            e.args += (pointer.shape, array.shape)\n            raise\n        logger.info(\"Initialize PyTorch weight {}\".format(name))\n        pointer.data = torch.from_numpy(array)\n    return model\n\n\nclass BertEmbeddings(nn.Module):\n    \"\"\"Construct the embeddings from word, position and token_type embeddings.\"\"\"\n\n    def __init__(self, config):\n        super().__init__()\n        self.word_embeddings = nn.Embedding(config.vocab_size, config.hidden_size, padding_idx=config.pad_token_id)\n        self.token_type_embeddings = nn.Embedding(config.type_vocab_size, config.hidden_size)\n        # self.LayerNorm is not snake-cased to stick with TensorFlow model variable name and be able to load\n        # any TensorFlow checkpoint file\n        self.LayerNorm = nn.LayerNorm(config.hidden_size, eps=config.layer_norm_eps)\n        self.dropout = nn.Dropout(config.hidden_dropout_prob)\n\n    def forward(self, input_ids=None, token_type_ids=None, inputs_embeds=None):\n        if input_ids is not None:\n            input_shape = input_ids.size()\n        else:\n            input_shape = inputs_embeds.size()[:-1]\n        if token_type_ids is None:\n            token_type_ids = torch.zeros(input_shape, dtype=torch.long, device=input_ids.device)\n\n        if inputs_embeds is None:\n            inputs_embeds = self.word_embeddings(input_ids)\n        token_type_embeddings = self.token_type_embeddings(token_type_ids)\n\n        embeddings = inputs_embeds + token_type_embeddings\n        embeddings = self.LayerNorm(embeddings)\n        embeddings = self.dropout(embeddings)\n        return embeddings\n\ndef relative_position_encoding(depth, max_length=512, max_relative_position=64):\n    vocab_size = max_relative_position * 2 + 1\n    range_vec = torch.arange(max_length)\n    range_mat = range_vec.repeat(max_length).view(max_length, max_length)\n    distance_mat = range_mat - torch.t(range_mat)\n    distance_mat_clipped = torch.clamp(distance_mat, -max_relative_position, max_relative_position)\n    final_mat = distance_mat_clipped + max_relative_position\n\n    embeddings_table = torch.zeros(vocab_size, depth)\n    position = torch.arange(0, vocab_size, dtype=torch.float).unsqueeze(1)\n    div_term = torch.exp(torch.arange(0, depth, 2).float() * (-math.log(10000.0) / depth))\n    embeddings_table[:, 0::2] = torch.sin(position * div_term)\n    embeddings_table[:, 1::2] = torch.cos(position * div_term)\n    embeddings_table = embeddings_table.unsqueeze(0).transpose(0, 1).squeeze(1)\n\n    flat_relative_positions_matrix = final_mat.view(-1)\n    one_hot_relative_positions_matrix = torch.nn.functional.one_hot(flat_relative_positions_matrix,\n                                                                    num_classes=vocab_size).float()\n    positions_encoding = torch.matmul(one_hot_relative_positions_matrix, embeddings_table)\n    my_shape = list(final_mat.size())\n    my_shape.append(depth)\n    positions_encoding = positions_encoding.view(my_shape)\n    return positions_encoding\n\nclass BertSelfAttention(nn.Module):\n    def __init__(self, config):\n        super().__init__()\n        if config.hidden_size % config.num_attention_heads != 0 and not hasattr(config, \"embedding_size\"):\n            raise ValueError(\n                \"The hidden size (%d) is not a multiple of the number of attention \"\n                \"heads (%d)\" % (config.hidden_size, config.num_attention_heads)\n            )\n\n        self.num_attention_heads = config.num_attention_heads\n        self.attention_head_size = int(config.hidden_size / config.num_attention_heads)\n        self.all_head_size = self.num_attention_heads * self.attention_head_size\n\n        self.query = nn.Linear(config.hidden_size, self.all_head_size)\n        self.key = nn.Linear(config.hidden_size, self.all_head_size)\n        self.value = nn.Linear(config.hidden_size, self.all_head_size)\n\n        self.dropout = nn.Dropout(config.attention_probs_dropout_prob)\n        self.position_embedding_type = getattr(config, \"position_embedding_type\", \"absolute\")\n        if self.position_embedding_type == \"relative_key\" or self.position_embedding_type == \"relative_key_query\":\n            self.max_position_embeddings = config.max_position_embeddings\n            self.distance_embedding = nn.Embedding(2 * config.max_position_embeddings - 1, self.attention_head_size)\n\n        self.is_decoder = config.is_decoder\n\n    def transpose_for_scores(self, x):\n        new_x_shape = x.size()[:-1] + (self.num_attention_heads, self.attention_head_size)\n        x = x.view(*new_x_shape)\n        return x.permute(0, 2, 1, 3)\n\n    def forward(\n        self,\n        hidden_states,\n        attention_mask=None,\n        head_mask=None,\n        encoder_hidden_states=None,\n        encoder_attention_mask=None,\n        past_key_value=None,\n        output_attentions=False,\n        relations_kv=None\n    ):\n        mixed_query_layer = self.query(hidden_states)\n\n        # If this is instantiated as a cross-attention module, the keys\n        # and values come from an encoder; the attention mask needs to be\n        # such that the encoder's padding tokens are not attended to.\n        is_cross_attention = encoder_hidden_states is not None\n\n        if is_cross_attention and past_key_value is not None:\n            # reuse k,v, cross_attentions\n            key_layer = past_key_value[0]\n            value_layer = past_key_value[1]\n            attention_mask = encoder_attention_mask\n        elif is_cross_attention:\n            key_layer = self.transpose_for_scores(self.key(encoder_hidden_states))\n            value_layer = self.transpose_for_scores(self.value(encoder_hidden_states))\n            attention_mask = encoder_attention_mask\n        elif past_key_value is not None:\n            key_layer = self.transpose_for_scores(self.key(hidden_states))\n            value_layer = self.transpose_for_scores(self.value(hidden_states))\n            key_layer = torch.cat([past_key_value[0], key_layer], dim=2)\n            value_layer = torch.cat([past_key_value[1], value_layer], dim=2)\n        else:\n            key_layer = self.transpose_for_scores(self.key(hidden_states))\n            value_layer = self.transpose_for_scores(self.value(hidden_states))\n\n        query_layer = self.transpose_for_scores(mixed_query_layer)\n\n        if self.is_decoder:\n            # if cross_attention save Tuple(torch.Tensor, torch.Tensor) of all cross attention key/value_states.\n            # Further calls to cross_attention layer can then reuse all cross-attention\n            # key/value_states (first \"if\" case)\n            # if uni-directional self-attention (decoder) save Tuple(torch.Tensor, torch.Tensor) of\n            # all previous decoder key/value_states. Further calls to uni-directional self-attention\n            # can concat previous decoder key/value_states to current projected key/value_states (third \"elif\" case)\n            # if encoder bi-directional self-attention `past_key_value` is always `None`\n            past_key_value = (key_layer, value_layer)\n\n        # Take the dot product between \"query\" and \"key\" to get the raw attention scores.\n        attention_scores = torch.matmul(query_layer, key_layer.transpose(-1, -2))\n\n        batch_size, num_attention_heads, from_seq_length, to_seq_length = attention_scores.size()\n\n\n        query_layer_t = query_layer.permute(2, 0, 1, 3)\n\n        query_layer_r = query_layer_t.contiguous().view(from_seq_length, batch_size * num_attention_heads,\n                                                        self.attention_head_size)\n        key_position_scores = torch.matmul(query_layer_r, relations_kv.permute(0, 2, 1))\n        key_position_scores_r = key_position_scores.view(from_seq_length, batch_size,\n                                                         num_attention_heads, from_seq_length)\n        key_position_scores_r_t = key_position_scores_r.permute(1, 2, 0, 3)\n        attention_scores = attention_scores + key_position_scores_r_t\n\n        attention_scores = attention_scores / math.sqrt(self.attention_head_size)\n        if attention_mask is not None:\n            # Apply the attention mask is (precomputed for all layers in NeZhaModel forward() function)\n            attention_scores = attention_scores + attention_mask\n\n        # Normalize the attention scores to probabilities.\n        attention_probs = nn.Softmax(dim=-1)(attention_scores)\n\n        # This is actually dropping out entire tokens to attend to, which might\n        # seem a bit unusual, but is taken from the original Transformer paper.\n        attention_probs = self.dropout(attention_probs)\n\n        # Mask heads if we want to\n        if head_mask is not None:\n            attention_probs = attention_probs * head_mask\n\n        context_layer = torch.matmul(attention_probs, value_layer)\n\n\n        attention_probs_t = attention_probs.permute(2, 0, 1, 3)\n        attentions_probs_r = attention_probs_t.contiguous().view(from_seq_length, batch_size * num_attention_heads,\n                                                                 to_seq_length)\n        value_position_scores = torch.matmul(attentions_probs_r, relations_kv)\n        value_position_scores_r = value_position_scores.view(from_seq_length, batch_size,\n                                                             num_attention_heads, self.attention_head_size)\n        value_position_scores_r_t = value_position_scores_r.permute(1, 2, 0, 3)\n        context_layer = context_layer + value_position_scores_r_t\n\n        context_layer = context_layer.permute(0, 2, 1, 3).contiguous()\n        new_context_layer_shape = context_layer.size()[:-2] + (self.all_head_size,)\n        context_layer = context_layer.view(*new_context_layer_shape)\n\n        outputs = (context_layer, attention_probs) if output_attentions else (context_layer,)\n\n        if self.is_decoder:\n            outputs = outputs + (past_key_value,)\n        return outputs\n\n\nclass BertSelfOutput(nn.Module):\n    def __init__(self, config):\n        super().__init__()\n        self.dense = nn.Linear(config.hidden_size, config.hidden_size)\n        self.LayerNorm = nn.LayerNorm(config.hidden_size, eps=config.layer_norm_eps)\n        self.dropout = nn.Dropout(config.hidden_dropout_prob)\n\n    def forward(self, hidden_states, input_tensor):\n        hidden_states = self.dense(hidden_states)\n        hidden_states = self.dropout(hidden_states)\n        hidden_states = self.LayerNorm(hidden_states + input_tensor)\n        return hidden_states\n\n\nclass BertAttention(nn.Module):\n    def __init__(self, config):\n        super().__init__()\n        self.self = BertSelfAttention(config)\n        self.output = BertSelfOutput(config)\n        self.pruned_heads = set()\n\n    def prune_heads(self, heads):\n        if len(heads) == 0:\n            return\n        heads, index = find_pruneable_heads_and_indices(\n            heads, self.self.num_attention_heads, self.self.attention_head_size, self.pruned_heads\n        )\n\n        # Prune linear layers\n        self.self.query = prune_linear_layer(self.self.query, index)\n        self.self.key = prune_linear_layer(self.self.key, index)\n        self.self.value = prune_linear_layer(self.self.value, index)\n        self.output.dense = prune_linear_layer(self.output.dense, index, dim=1)\n\n        # Update hyper params and store pruned heads\n        self.self.num_attention_heads = self.self.num_attention_heads - len(heads)\n        self.self.all_head_size = self.self.attention_head_size * self.self.num_attention_heads\n        self.pruned_heads = self.pruned_heads.union(heads)\n\n    def forward(\n        self,\n        hidden_states,\n        attention_mask=None,\n        head_mask=None,\n        encoder_hidden_states=None,\n        encoder_attention_mask=None,\n        past_key_value=None,\n        output_attentions=False,\n        relations_kv=None\n    ):\n        self_outputs = self.self(\n            hidden_states,\n            attention_mask,\n            head_mask,\n            encoder_hidden_states,\n            encoder_attention_mask,\n            past_key_value,\n            output_attentions,\n            relations_kv=relations_kv\n        )\n        attention_output = self.output(self_outputs[0], hidden_states)\n        outputs = (attention_output,) + self_outputs[1:]  # add attentions if we output them\n        return outputs\n\n\nclass BertIntermediate(nn.Module):\n    def __init__(self, config):\n        super().__init__()\n        self.dense = nn.Linear(config.hidden_size, config.intermediate_size)\n        if isinstance(config.hidden_act, str):\n            self.intermediate_act_fn = ACT2FN[config.hidden_act]\n        else:\n            self.intermediate_act_fn = config.hidden_act\n\n    def forward(self, hidden_states):\n        hidden_states = self.dense(hidden_states)\n        hidden_states = self.intermediate_act_fn(hidden_states)\n        return hidden_states\n\n\nclass BertOutput(nn.Module):\n    def __init__(self, config):\n        super().__init__()\n        self.dense = nn.Linear(config.intermediate_size, config.hidden_size)\n        self.LayerNorm = nn.LayerNorm(config.hidden_size, eps=config.layer_norm_eps)\n        self.dropout = nn.Dropout(config.hidden_dropout_prob)\n\n    def forward(self, hidden_states, input_tensor):\n        hidden_states = self.dense(hidden_states)\n        hidden_states = self.dropout(hidden_states)\n        hidden_states = self.LayerNorm(hidden_states + input_tensor)\n        return hidden_states\n\n\nclass BertLayer(nn.Module):\n    def __init__(self, config):\n        super().__init__()\n        self.chunk_size_feed_forward = config.chunk_size_feed_forward\n        self.seq_len_dim = 1\n        self.attention = BertAttention(config)\n        self.is_decoder = config.is_decoder\n        self.add_cross_attention = config.add_cross_attention\n        if self.add_cross_attention:\n            assert self.is_decoder, f\"{self} should be used as a decoder model if cross attention is added\"\n            self.crossattention = BertAttention(config)\n        self.intermediate = BertIntermediate(config)\n        self.output = BertOutput(config)\n\n    def forward(\n        self,\n        hidden_states,\n        attention_mask=None,\n        head_mask=None,\n        encoder_hidden_states=None,\n        encoder_attention_mask=None,\n        past_key_value=None,\n        output_attentions=False,\n        relations_kv=None\n    ):\n        # decoder uni-directional self-attention cached key/values tuple is at positions 1,2\n        self_attn_past_key_value = past_key_value[:2] if past_key_value is not None else None\n        self_attention_outputs = self.attention(\n            hidden_states,\n            attention_mask,\n            head_mask,\n            output_attentions=output_attentions,\n            past_key_value=self_attn_past_key_value,\n            relations_kv=relations_kv\n        )\n        attention_output = self_attention_outputs[0]\n\n        # if decoder, the last output is tuple of self-attn cache\n        if self.is_decoder:\n            outputs = self_attention_outputs[1:-1]\n            present_key_value = self_attention_outputs[-1]\n        else:\n            outputs = self_attention_outputs[1:]  # add self attentions if we output attention weights\n\n        cross_attn_present_key_value = None\n        if self.is_decoder and encoder_hidden_states is not None:\n            assert hasattr(\n                self, \"crossattention\"\n            ), f\"If `encoder_hidden_states` are passed, {self} has to be instantiated with cross-attention layers by setting `config.add_cross_attention=True`\"\n\n            # cross_attn cached key/values tuple is at positions 3,4 of past_key_value tuple\n            cross_attn_past_key_value = past_key_value[-2:] if past_key_value is not None else None\n            cross_attention_outputs = self.crossattention(\n                attention_output,\n                attention_mask,\n                head_mask,\n                encoder_hidden_states,\n                encoder_attention_mask,\n                cross_attn_past_key_value,\n                output_attentions,\n            )\n            attention_output = cross_attention_outputs[0]\n            outputs = outputs + cross_attention_outputs[1:-1]  # add cross attentions if we output attention weights\n\n            # add cross-attn cache to positions 3,4 of present_key_value tuple\n            cross_attn_present_key_value = cross_attention_outputs[-1]\n            present_key_value = present_key_value + cross_attn_present_key_value\n\n        layer_output = apply_chunking_to_forward(\n            self.feed_forward_chunk, self.chunk_size_feed_forward, self.seq_len_dim, attention_output\n        )\n        outputs = (layer_output,) + outputs\n\n        # if decoder, return the attn key/values as the last output\n        if self.is_decoder:\n            outputs = outputs + (present_key_value,)\n\n        return outputs\n\n    def feed_forward_chunk(self, attention_output):\n        intermediate_output = self.intermediate(attention_output)\n        layer_output = self.output(intermediate_output, attention_output)\n        return layer_output\n\n\nclass NeZhaEncoder(nn.Module):\n    def __init__(self, config):\n        super().__init__()\n        self.config = config\n        self.layer = nn.ModuleList([BertLayer(config) for _ in range(config.num_hidden_layers)])\n        self.relative_positions_encoding = relative_position_encoding(max_length=config.max_position_embeddings,\n                                                                     depth=int(config.hidden_size / config.num_attention_heads),\n                                                                     max_relative_position=config.max_relative_position).to('cuda')\n    def forward(\n        self,\n        hidden_states,\n        attention_mask=None,\n        head_mask=None,\n        encoder_hidden_states=None,\n        encoder_attention_mask=None,\n        past_key_values=None,\n        use_cache=None,\n        output_attentions=False,\n        output_hidden_states=False,\n        return_dict=False,\n    ):\n        to_seq_length=hidden_states.shape[1]\n        relations_kv = self.relative_positions_encoding[:to_seq_length, :to_seq_length, :]\n        all_hidden_states = () if output_hidden_states else None\n        all_self_attentions = () if output_attentions else None\n        all_cross_attentions = () if output_attentions and self.config.add_cross_attention else None\n\n        next_decoder_cache = () if use_cache else None\n        for i, layer_module in enumerate(self.layer):\n            if output_hidden_states:\n                all_hidden_states = all_hidden_states + (hidden_states,)\n\n            layer_head_mask = head_mask[i] if head_mask is not None else None\n            past_key_value = past_key_values[i] if past_key_values is not None else None\n\n            if getattr(self.config, \"gradient_checkpointing\", False) and self.training:\n\n                if use_cache:\n                    logger.warn(\n                        \"`use_cache=True` is incompatible with `config.gradient_checkpointing=True`. Setting \"\n                        \"`use_cache=False`...\"\n                    )\n                    use_cache = False\n\n                def create_custom_forward(module):\n                    def custom_forward(*inputs):\n                        return module(*inputs, past_key_value, output_attentions)\n\n                    return custom_forward\n\n                layer_outputs = torch.utils.checkpoint.checkpoint(\n                    create_custom_forward(layer_module),\n                    hidden_states,\n                    attention_mask,\n                    layer_head_mask,\n                    encoder_hidden_states,\n                    encoder_attention_mask,\n                )\n            else:\n                layer_outputs = layer_module(\n                    hidden_states,\n                    attention_mask,\n                    layer_head_mask,\n                    encoder_hidden_states,\n                    encoder_attention_mask,\n                    past_key_value,\n                    output_attentions,relations_kv=relations_kv\n                )\n\n            hidden_states = layer_outputs[0]\n            if use_cache:\n                next_decoder_cache += (layer_outputs[-1],)\n            if output_attentions:\n                all_self_attentions = all_self_attentions + (layer_outputs[1],)\n                if self.config.add_cross_attention:\n                    all_cross_attentions = all_cross_attentions + (layer_outputs[2],)\n\n        if output_hidden_states:\n            all_hidden_states = all_hidden_states + (hidden_states,)\n\n        if not return_dict:\n            return tuple(\n                v\n                for v in [\n                    hidden_states,\n                    next_decoder_cache,\n                    all_hidden_states,\n                    all_self_attentions,\n                    all_cross_attentions,\n                ]\n                if v is not None\n            )\n        return BaseModelOutputWithPastAndCrossAttentions(\n            last_hidden_state=hidden_states,\n            past_key_values=next_decoder_cache,\n            hidden_states=all_hidden_states,\n            attentions=all_self_attentions,\n            cross_attentions=all_cross_attentions,\n        )\n\n\nclass BertPooler(nn.Module):\n    def __init__(self, config):\n        super().__init__()\n        self.dense = nn.Linear(config.hidden_size, config.hidden_size)\n        self.activation = nn.Tanh()\n\n    def forward(self, hidden_states):\n        # We \"pool\" the model by simply taking the hidden state corresponding\n        # to the first token.\n        first_token_tensor = hidden_states[:, 0]\n        pooled_output = self.dense(first_token_tensor)\n        pooled_output = self.activation(pooled_output)\n        return pooled_output\n\n\nclass BertPredictionHeadTransform(nn.Module):\n    def __init__(self, config):\n        super().__init__()\n        self.dense = nn.Linear(config.hidden_size, config.hidden_size)\n        if isinstance(config.hidden_act, str):\n            self.transform_act_fn = ACT2FN[config.hidden_act]\n        else:\n            self.transform_act_fn = config.hidden_act\n        self.LayerNorm = nn.LayerNorm(config.hidden_size, eps=config.layer_norm_eps)\n\n    def forward(self, hidden_states):\n        hidden_states = self.dense(hidden_states)\n        hidden_states = self.transform_act_fn(hidden_states)\n        hidden_states = self.LayerNorm(hidden_states)\n        return hidden_states\n\n\nclass BertLMPredictionHead(nn.Module):\n    def __init__(self, config):\n        super().__init__()\n        self.transform = BertPredictionHeadTransform(config)\n\n        # The output weights are the same as the input embeddings, but there is\n        # an output-only bias for each token.\n        self.decoder = nn.Linear(config.hidden_size, config.vocab_size, bias=False)\n\n        self.bias = nn.Parameter(torch.zeros(config.vocab_size))\n\n        # Need a link between the two variables so that the bias is correctly resized with `resize_token_embeddings`\n        self.decoder.bias = self.bias\n\n    def forward(self, hidden_states):\n        hidden_states = self.transform(hidden_states)\n        hidden_states = self.decoder(hidden_states)\n        return hidden_states\n\n\nclass BertOnlyMLMHead(nn.Module):\n    def __init__(self, config):\n        super().__init__()\n        self.predictions = BertLMPredictionHead(config)\n\n    def forward(self, sequence_output):\n        prediction_scores = self.predictions(sequence_output)\n        return prediction_scores\n\n\nclass BertOnlyNSPHead(nn.Module):\n    def __init__(self, config):\n        super().__init__()\n        self.seq_relationship = nn.Linear(config.hidden_size, 2)\n\n    def forward(self, pooled_output):\n        seq_relationship_score = self.seq_relationship(pooled_output)\n        return seq_relationship_score\n\n\nclass BertPreTrainingHeads(nn.Module):\n    def __init__(self, config):\n        super().__init__()\n        self.predictions = BertLMPredictionHead(config)\n        self.seq_relationship = nn.Linear(config.hidden_size, 2)\n\n    def forward(self, sequence_output, pooled_output):\n        prediction_scores = self.predictions(sequence_output)\n        seq_relationship_score = self.seq_relationship(pooled_output)\n        return prediction_scores, seq_relationship_score\n\n\nclass BertPreTrainedModel(PreTrainedModel):\n    \"\"\"\n    An abstract class to handle weights initialization and a simple interface for downloading and loading pretrained\n    models.\n    \"\"\"\n\n    config_class = BertConfig\n    load_tf_weights = load_tf_weights_in_bert\n    base_model_prefix = \"bert\"\n\n    def _init_weights(self, module):\n        \"\"\" Initialize the weights \"\"\"\n        if isinstance(module, nn.Linear):\n            # Slightly different from the TF version which uses truncated_normal for initialization\n            # cf https://github.com/pytorch/pytorch/pull/5617\n            module.weight.data.normal_(mean=0.0, std=self.config.initializer_range)\n            if module.bias is not None:\n                module.bias.data.zero_()\n        elif isinstance(module, nn.Embedding):\n            module.weight.data.normal_(mean=0.0, std=self.config.initializer_range)\n            if module.padding_idx is not None:\n                module.weight.data[module.padding_idx].zero_()\n        elif isinstance(module, nn.LayerNorm):\n            module.bias.data.zero_()\n            module.weight.data.fill_(1.0)\n\n\n@dataclass\nclass BertForPreTrainingOutput(ModelOutput):\n    \"\"\"\n    Output type of :class:`~transformers.BertForPreTraining`.\n\n    Args:\n        loss (`optional`, returned when ``labels`` is provided, ``torch.FloatTensor`` of shape :obj:`(1,)`):\n            Total loss as the sum of the masked language modeling loss and the next sequence prediction\n            (classification) loss.\n        prediction_logits (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length, config.vocab_size)`):\n            Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax).\n        seq_relationship_logits (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, 2)`):\n            Prediction scores of the next sequence prediction (classification) head (scores of True/False continuation\n            before SoftMax).\n        hidden_states (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``output_hidden_states=True`` is passed or when ``config.output_hidden_states=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer)\n            of shape :obj:`(batch_size, sequence_length, hidden_size)`.\n\n            Hidden-states of the model at the output of each layer plus the initial embedding outputs.\n        attentions (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``output_attentions=True`` is passed or when ``config.output_attentions=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for each layer) of shape :obj:`(batch_size, num_heads,\n            sequence_length, sequence_length)`.\n\n            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention\n            heads.\n    \"\"\"\n\n    loss: Optional[torch.FloatTensor] = None\n    prediction_logits: torch.FloatTensor = None\n    seq_relationship_logits: torch.FloatTensor = None\n    hidden_states: Optional[Tuple[torch.FloatTensor]] = None\n    attentions: Optional[Tuple[torch.FloatTensor]] = None\n\n\nBERT_START_DOCSTRING = r\"\"\"\n\n    This model inherits from :class:`~transformers.PreTrainedModel`. Check the superclass documentation for the generic\n    methods the library implements for all its model (such as downloading or saving, resizing the input embeddings,\n    pruning heads etc.)\n\n    This model is also a PyTorch `torch.nn.Module <https://pytorch.org/docs/stable/nn.html#torch.nn.Module>`__\n    subclass. Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to\n    general usage and behavior.\n\n    Parameters:\n        config (:class:`~transformers.BertConfig`): Model configuration class with all the parameters of the model.\n            Initializing with a config file does not load the weights associated with the model, only the\n            configuration. Check out the :meth:`~transformers.PreTrainedModel.from_pretrained` method to load the model\n            weights.\n\"\"\"\n\nBERT_INPUTS_DOCSTRING = r\"\"\"\n    Args:\n        input_ids (:obj:`torch.LongTensor` of shape :obj:`({0})`):\n            Indices of input sequence tokens in the vocabulary.\n\n            Indices can be obtained using :class:`~transformers.BertTokenizer`. See\n            :meth:`transformers.PreTrainedTokenizer.encode` and :meth:`transformers.PreTrainedTokenizer.__call__` for\n            details.\n\n            `What are input IDs? <../glossary.html#input-ids>`__\n        attention_mask (:obj:`torch.FloatTensor` of shape :obj:`({0})`, `optional`):\n            Mask to avoid performing attention on padding token indices. Mask values selected in ``[0, 1]``:\n\n            - 1 for tokens that are **not masked**,\n            - 0 for tokens that are **masked**.\n\n            `What are attention masks? <../glossary.html#attention-mask>`__\n        token_type_ids (:obj:`torch.LongTensor` of shape :obj:`({0})`, `optional`):\n            Segment token indices to indicate first and second portions of the inputs. Indices are selected in ``[0,\n            1]``:\n\n            - 0 corresponds to a `sentence A` token,\n            - 1 corresponds to a `sentence B` token.\n\n            `What are token type IDs? <../glossary.html#token-type-ids>`_\n        position_ids (:obj:`torch.LongTensor` of shape :obj:`({0})`, `optional`):\n            Indices of positions of each input sequence tokens in the position embeddings. Selected in the range ``[0,\n            config.max_position_embeddings - 1]``.\n\n            `What are position IDs? <../glossary.html#position-ids>`_\n        head_mask (:obj:`torch.FloatTensor` of shape :obj:`(num_heads,)` or :obj:`(num_layers, num_heads)`, `optional`):\n            Mask to nullify selected heads of the self-attention modules. Mask values selected in ``[0, 1]``:\n\n            - 1 indicates the head is **not masked**,\n            - 0 indicates the head is **masked**.\n\n        inputs_embeds (:obj:`torch.FloatTensor` of shape :obj:`({0}, hidden_size)`, `optional`):\n            Optionally, instead of passing :obj:`input_ids` you can choose to directly pass an embedded representation.\n            This is useful if you want more control over how to convert :obj:`input_ids` indices into associated\n            vectors than the model's internal embedding lookup matrix.\n        output_attentions (:obj:`bool`, `optional`):\n            Whether or not to return the attentions tensors of all attention layers. See ``attentions`` under returned\n            tensors for more detail.\n        output_hidden_states (:obj:`bool`, `optional`):\n            Whether or not to return the hidden states of all layers. See ``hidden_states`` under returned tensors for\n            more detail.\n        return_dict (:obj:`bool`, `optional`):\n            Whether or not to return a :class:`~transformers.file_utils.ModelOutput` instead of a plain tuple.\n\"\"\"\n\n\n@add_start_docstrings(\n    \"The bare Bert Model transformer outputting raw hidden-states without any specific head on top.\",\n    BERT_START_DOCSTRING,\n)\nclass NeZhaModel(BertPreTrainedModel):\n    \"\"\"\n\n    The model can behave as an encoder (with only self-attention) as well as a decoder, in which case a layer of\n    cross-attention is added between the self-attention layers, following the architecture described in `Attention is\n    all you need <https://arxiv.org/abs/1706.03762>`__ by Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit,\n    Llion Jones, Aidan N. Gomez, Lukasz Kaiser and Illia Polosukhin.\n\n    To behave as an decoder the model needs to be initialized with the :obj:`is_decoder` argument of the configuration\n    set to :obj:`True`. To be used in a Seq2Seq model, the model needs to initialized with both :obj:`is_decoder`\n    argument and :obj:`add_cross_attention` set to :obj:`True`; an :obj:`encoder_hidden_states` is then expected as an\n    input to the forward pass.\n    \"\"\"\n\n    def __init__(self, config, add_pooling_layer=True):\n        super().__init__(config)\n        self.config = config\n\n        self.embeddings = BertEmbeddings(config)\n        self.encoder = NeZhaEncoder(config)\n\n        self.pooler = BertPooler(config) if add_pooling_layer else None\n\n        self.init_weights()\n\n    def get_input_embeddings(self):\n        return self.embeddings.word_embeddings\n\n    def set_input_embeddings(self, value):\n        self.embeddings.word_embeddings = value\n\n    def _prune_heads(self, heads_to_prune):\n        \"\"\"\n        Prunes heads of the model. heads_to_prune: dict of {layer_num: list of heads to prune in this layer} See base\n        class PreTrainedModel\n        \"\"\"\n        for layer, heads in heads_to_prune.items():\n            self.encoder.layer[layer].attention.prune_heads(heads)\n\n    @add_start_docstrings_to_model_forward(BERT_INPUTS_DOCSTRING.format(\"batch_size, sequence_length\"))\n    @add_code_sample_docstrings(\n        tokenizer_class=_TOKENIZER_FOR_DOC,\n        checkpoint=_CHECKPOINT_FOR_DOC,\n        output_type=BaseModelOutputWithPoolingAndCrossAttentions,\n        config_class=_CONFIG_FOR_DOC,\n    )\n    def forward(\n        self,\n        input_ids=None,\n        attention_mask=None,\n        token_type_ids=None,\n\n        head_mask=None,\n        inputs_embeds=None,\n        encoder_hidden_states=None,\n        encoder_attention_mask=None,\n        past_key_values=None,\n        use_cache=None,\n        output_attentions=None,\n        output_hidden_states=None,\n        return_dict=False,\n    ):\n        r\"\"\"\n        encoder_hidden_states  (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length, hidden_size)`, `optional`):\n            Sequence of hidden-states at the output of the last layer of the encoder. Used in the cross-attention if\n            the model is configured as a decoder.\n        encoder_attention_mask (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length)`, `optional`):\n            Mask to avoid performing attention on the padding token indices of the encoder input. This mask is used in\n            the cross-attention if the model is configured as a decoder. Mask values selected in ``[0, 1]``:\n\n            - 1 for tokens that are **not masked**,\n            - 0 for tokens that are **masked**.\n        past_key_values (:obj:`tuple(tuple(torch.FloatTensor))` of length :obj:`config.n_layers` with each tuple having 4 tensors of shape :obj:`(batch_size, num_heads, sequence_length - 1, embed_size_per_head)`):\n            Contains precomputed key and value hidden states of the attention blocks. Can be used to speed up decoding.\n\n            If :obj:`past_key_values` are used, the user can optionally input only the last :obj:`decoder_input_ids`\n            (those that don't have their past key value states given to this model) of shape :obj:`(batch_size, 1)`\n            instead of all :obj:`decoder_input_ids` of shape :obj:`(batch_size, sequence_length)`.\n        use_cache (:obj:`bool`, `optional`):\n            If set to :obj:`True`, :obj:`past_key_values` key value states are returned and can be used to speed up\n            decoding (see :obj:`past_key_values`).\n        \"\"\"\n        output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions\n        output_hidden_states = (\n            output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states\n        )\n        return_dict = return_dict if return_dict is not None else self.config.use_return_dict\n\n        if self.config.is_decoder:\n            use_cache = use_cache if use_cache is not None else self.config.use_cache\n        else:\n            use_cache = False\n\n        if input_ids is not None and inputs_embeds is not None:\n            raise ValueError(\"You cannot specify both input_ids and inputs_embeds at the same time\")\n        elif input_ids is not None:\n            input_shape = input_ids.size()\n            batch_size, seq_length = input_shape\n        elif inputs_embeds is not None:\n            input_shape = inputs_embeds.size()[:-1]\n            batch_size, seq_length = input_shape\n        else:\n            raise ValueError(\"You have to specify either input_ids or inputs_embeds\")\n\n        device = input_ids.device if input_ids is not None else inputs_embeds.device\n\n        # past_key_values_length\n        past_key_values_length = past_key_values[0][0].shape[2] if past_key_values is not None else 0\n\n        if attention_mask is None:\n            attention_mask = torch.ones(((batch_size, seq_length + past_key_values_length)), device=device)\n        if token_type_ids is None:\n            token_type_ids = torch.zeros(input_shape, dtype=torch.long, device=device)\n\n        # We can provide a self-attention mask of dimensions [batch_size, from_seq_length, to_seq_length]\n        # ourselves in which case we just need to make it broadcastable to all heads.\n        extended_attention_mask: torch.Tensor = self.get_extended_attention_mask(attention_mask, input_shape, device)\n\n        # If a 2D or 3D attention mask is provided for the cross-attention\n        # we need to make broadcastable to [batch_size, num_heads, seq_length, seq_length]\n        if self.config.is_decoder and encoder_hidden_states is not None:\n            encoder_batch_size, encoder_sequence_length, _ = encoder_hidden_states.size()\n            encoder_hidden_shape = (encoder_batch_size, encoder_sequence_length)\n            if encoder_attention_mask is None:\n                encoder_attention_mask = torch.ones(encoder_hidden_shape, device=device)\n            encoder_extended_attention_mask = self.invert_attention_mask(encoder_attention_mask)\n        else:\n            encoder_extended_attention_mask = None\n\n        # Prepare head mask if needed\n        # 1.0 in head_mask indicate we keep the head\n        # attention_probs has shape bsz x n_heads x N x N\n        # input head_mask has shape [num_heads] or [num_hidden_layers x num_heads]\n        # and head_mask is converted to shape [num_hidden_layers x batch x num_heads x seq_length x seq_length]\n        head_mask = self.get_head_mask(head_mask, self.config.num_hidden_layers)\n\n        embedding_output = self.embeddings(\n            input_ids=input_ids,\n\n            token_type_ids=token_type_ids,\n            inputs_embeds=inputs_embeds,\n        )\n        encoder_outputs = self.encoder(\n            embedding_output,\n            attention_mask=extended_attention_mask,\n            head_mask=head_mask,\n            encoder_hidden_states=encoder_hidden_states,\n            encoder_attention_mask=encoder_extended_attention_mask,\n            past_key_values=past_key_values,\n            use_cache=use_cache,\n            output_attentions=output_attentions,\n            output_hidden_states=output_hidden_states,\n            return_dict=return_dict,\n        )\n        sequence_output = encoder_outputs[0]\n        pooled_output = self.pooler(sequence_output) if self.pooler is not None else None\n\n        if not return_dict:\n            return (sequence_output, pooled_output) + encoder_outputs[1:]\n\n        return BaseModelOutputWithPoolingAndCrossAttentions(\n            last_hidden_state=sequence_output,\n            pooler_output=pooled_output,\n            past_key_values=encoder_outputs.past_key_values,\n            hidden_states=encoder_outputs.hidden_states,\n            attentions=encoder_outputs.attentions,\n            cross_attentions=encoder_outputs.cross_attentions,\n        )\n\n\n@add_start_docstrings(\n    \"\"\"\n    Bert Model with two heads on top as done during the pretraining: a `masked language modeling` head and a `next\n    sentence prediction (classification)` head.\n    \"\"\",\n    BERT_START_DOCSTRING,\n)\nclass BertForPreTraining(BertPreTrainedModel):\n    def __init__(self, config):\n        super().__init__(config)\n\n        self.bert = NeZhaModel(config)\n        self.cls = BertPreTrainingHeads(config)\n\n        self.init_weights()\n\n    def get_output_embeddings(self):\n        return self.cls.predictions.decoder\n\n    def set_output_embeddings(self, new_embeddings):\n        self.cls.predictions.decoder = new_embeddings\n\n    @add_start_docstrings_to_model_forward(BERT_INPUTS_DOCSTRING.format(\"batch_size, sequence_length\"))\n    @replace_return_docstrings(output_type=BertForPreTrainingOutput, config_class=_CONFIG_FOR_DOC)\n    def forward(\n        self,\n        input_ids=None,\n        attention_mask=None,\n        token_type_ids=None,\n\n        head_mask=None,\n        inputs_embeds=None,\n        labels=None,\n        next_sentence_label=None,\n        output_attentions=None,\n        output_hidden_states=None,\n        return_dict=False,\n    ):\n        r\"\"\"\n        labels (:obj:`torch.LongTensor` of shape ``(batch_size, sequence_length)``, `optional`):\n            Labels for computing the masked language modeling loss. Indices should be in ``[-100, 0, ...,\n            config.vocab_size]`` (see ``input_ids`` docstring) Tokens with indices set to ``-100`` are ignored\n            (masked), the loss is only computed for the tokens with labels in ``[0, ..., config.vocab_size]``\n        next_sentence_label (``torch.LongTensor`` of shape ``(batch_size,)``, `optional`):\n            Labels for computing the next sequence prediction (classification) loss. Input should be a sequence pair\n            (see :obj:`input_ids` docstring) Indices should be in ``[0, 1]``:\n\n            - 0 indicates sequence B is a continuation of sequence A,\n            - 1 indicates sequence B is a random sequence.\n        kwargs (:obj:`Dict[str, any]`, optional, defaults to `{}`):\n            Used to hide legacy arguments that have been deprecated.\n\n        Returns:\n\n        Example::\n\n            >>> from transformers import BertTokenizer, BertForPreTraining\n            >>> import torch\n\n            >>> tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')\n            >>> model = BertForPreTraining.from_pretrained('bert-base-uncased')\n\n            >>> inputs = tokenizer(\"Hello, my dog is cute\", return_tensors=\"pt\")\n            >>> outputs = model(**inputs)\n\n            >>> prediction_logits = outputs.prediction_logits\n            >>> seq_relationship_logits = outputs.seq_relationship_logits\n        \"\"\"\n        return_dict = return_dict if return_dict is not None else self.config.use_return_dict\n\n        outputs = self.bert(\n            input_ids,\n            attention_mask=attention_mask,\n            token_type_ids=token_type_ids,\n\n            head_mask=head_mask,\n            inputs_embeds=inputs_embeds,\n            output_attentions=output_attentions,\n            output_hidden_states=output_hidden_states,\n            return_dict=return_dict,\n        )\n\n        sequence_output, pooled_output = outputs[:2]\n        prediction_scores, seq_relationship_score = self.cls(sequence_output, pooled_output)\n\n        total_loss = None\n        if labels is not None and next_sentence_label is not None:\n            loss_fct = CrossEntropyLoss()\n            masked_lm_loss = loss_fct(prediction_scores.view(-1, self.config.vocab_size), labels.view(-1))\n            next_sentence_loss = loss_fct(seq_relationship_score.view(-1, 2), next_sentence_label.view(-1))\n            total_loss = masked_lm_loss + next_sentence_loss\n\n        if not return_dict:\n            output = (prediction_scores, seq_relationship_score) + outputs[2:]\n            return ((total_loss,) + output) if total_loss is not None else output\n\n        return BertForPreTrainingOutput(\n            loss=total_loss,\n            prediction_logits=prediction_scores,\n            seq_relationship_logits=seq_relationship_score,\n            hidden_states=outputs.hidden_states,\n            attentions=outputs.attentions,\n        )\n\n\n@add_start_docstrings(\n    \"\"\"Bert Model with a `language modeling` head on top for CLM fine-tuning. \"\"\", BERT_START_DOCSTRING\n)\nclass BertLMHeadModel(BertPreTrainedModel):\n\n    _keys_to_ignore_on_load_unexpected = [r\"pooler\"]\n    _keys_to_ignore_on_load_missing = [ r\"predictions.decoder.bias\"]\n\n    def __init__(self, config):\n        super().__init__(config)\n\n        if not config.is_decoder:\n            logger.warning(\"If you want to use `BertLMHeadModel` as a standalone, add `is_decoder=True.`\")\n\n        self.bert = NeZhaModel(config, add_pooling_layer=False)\n        self.cls = BertOnlyMLMHead(config)\n\n        self.init_weights()\n\n    def get_output_embeddings(self):\n        return self.cls.predictions.decoder\n\n    def set_output_embeddings(self, new_embeddings):\n        self.cls.predictions.decoder = new_embeddings\n\n    @add_start_docstrings_to_model_forward(BERT_INPUTS_DOCSTRING.format(\"batch_size, sequence_length\"))\n    @replace_return_docstrings(output_type=CausalLMOutputWithCrossAttentions, config_class=_CONFIG_FOR_DOC)\n    def forward(\n        self,\n        input_ids=None,\n        attention_mask=None,\n        token_type_ids=None,\n\n        head_mask=None,\n        inputs_embeds=None,\n        encoder_hidden_states=None,\n        encoder_attention_mask=None,\n        labels=None,\n        past_key_values=None,\n        use_cache=None,\n        output_attentions=None,\n        output_hidden_states=None,\n        return_dict=False,\n    ):\n        r\"\"\"\n        encoder_hidden_states  (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length, hidden_size)`, `optional`):\n            Sequence of hidden-states at the output of the last layer of the encoder. Used in the cross-attention if\n            the model is configured as a decoder.\n        encoder_attention_mask (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length)`, `optional`):\n            Mask to avoid performing attention on the padding token indices of the encoder input. This mask is used in\n            the cross-attention if the model is configured as a decoder. Mask values selected in ``[0, 1]``:\n\n            - 1 for tokens that are **not masked**,\n            - 0 for tokens that are **masked**.\n        labels (:obj:`torch.LongTensor` of shape :obj:`(batch_size, sequence_length)`, `optional`):\n            Labels for computing the left-to-right language modeling loss (next word prediction). Indices should be in\n            ``[-100, 0, ..., config.vocab_size]`` (see ``input_ids`` docstring) Tokens with indices set to ``-100`` are\n            ignored (masked), the loss is only computed for the tokens with labels n ``[0, ..., config.vocab_size]``\n        past_key_values (:obj:`tuple(tuple(torch.FloatTensor))` of length :obj:`config.n_layers` with each tuple having 4 tensors of shape :obj:`(batch_size, num_heads, sequence_length - 1, embed_size_per_head)`):\n            Contains precomputed key and value hidden states of the attention blocks. Can be used to speed up decoding.\n\n            If :obj:`past_key_values` are used, the user can optionally input only the last :obj:`decoder_input_ids`\n            (those that don't have their past key value states given to this model) of shape :obj:`(batch_size, 1)`\n            instead of all :obj:`decoder_input_ids` of shape :obj:`(batch_size, sequence_length)`.\n        use_cache (:obj:`bool`, `optional`):\n            If set to :obj:`True`, :obj:`past_key_values` key value states are returned and can be used to speed up\n            decoding (see :obj:`past_key_values`).\n\n        Returns:\n\n        Example::\n\n            >>> from transformers import BertTokenizer, BertLMHeadModel, BertConfig\n            >>> import torch\n\n            >>> tokenizer = BertTokenizer.from_pretrained('bert-base-cased')\n            >>> config = BertConfig.from_pretrained(\"bert-base-cased\")\n            >>> config.is_decoder = True\n            >>> model = BertLMHeadModel.from_pretrained('bert-base-cased', config=config)\n\n            >>> inputs = tokenizer(\"Hello, my dog is cute\", return_tensors=\"pt\")\n            >>> outputs = model(**inputs)\n\n            >>> prediction_logits = outputs.logits\n        \"\"\"\n        return_dict = return_dict if return_dict is not None else self.config.use_return_dict\n        if labels is not None:\n            use_cache = False\n\n        outputs = self.bert(\n            input_ids,\n            attention_mask=attention_mask,\n            token_type_ids=token_type_ids,\n\n            head_mask=head_mask,\n            inputs_embeds=inputs_embeds,\n            encoder_hidden_states=encoder_hidden_states,\n            encoder_attention_mask=encoder_attention_mask,\n            past_key_values=past_key_values,\n            use_cache=use_cache,\n            output_attentions=output_attentions,\n            output_hidden_states=output_hidden_states,\n            return_dict=return_dict,\n        )\n\n        sequence_output = outputs[0]\n        prediction_scores = self.cls(sequence_output)\n\n        lm_loss = None\n        if labels is not None:\n            # we are doing next-token prediction; shift prediction scores and input ids by one\n            shifted_prediction_scores = prediction_scores[:, :-1, :].contiguous()\n            labels = labels[:, 1:].contiguous()\n            loss_fct = CrossEntropyLoss()\n            lm_loss = loss_fct(shifted_prediction_scores.view(-1, self.config.vocab_size), labels.view(-1))\n\n        if not return_dict:\n            output = (prediction_scores,) + outputs[2:]\n            return ((lm_loss,) + output) if lm_loss is not None else output\n\n        return CausalLMOutputWithCrossAttentions(\n            loss=lm_loss,\n            logits=prediction_scores,\n            past_key_values=outputs.past_key_values,\n            hidden_states=outputs.hidden_states,\n            attentions=outputs.attentions,\n            cross_attentions=outputs.cross_attentions,\n        )\n\n    def prepare_inputs_for_generation(self, input_ids, past=None, attention_mask=None, **model_kwargs):\n        input_shape = input_ids.shape\n        # if model is used as a decoder in encoder-decoder model, the decoder attention mask is created on the fly\n        if attention_mask is None:\n            attention_mask = input_ids.new_ones(input_shape)\n\n        # cut decoder_input_ids if past is used\n        if past is not None:\n            input_ids = input_ids[:, -1:]\n\n        return {\"input_ids\": input_ids, \"attention_mask\": attention_mask, \"past_key_values\": past}\n\n    def _reorder_cache(self, past, beam_idx):\n        reordered_past = ()\n        for layer_past in past:\n            reordered_past += (tuple(past_state.index_select(0, beam_idx) for past_state in layer_past),)\n        return reordered_past\n\n\n@add_start_docstrings(\"\"\"Bert Model with a `language modeling` head on top. \"\"\", BERT_START_DOCSTRING)\nclass NeZhaForMaskedLM(BertPreTrainedModel):\n\n    _keys_to_ignore_on_load_unexpected = [r\"pooler\"]\n    _keys_to_ignore_on_load_missing = [r\"predictions.decoder.bias\"]\n\n    def __init__(self, config):\n        super().__init__(config)\n\n        if config.is_decoder:\n            logger.warning(\n                \"If you want to use `NeZhaForMaskedLM` make sure `config.is_decoder=False` for \"\n                \"bi-directional self-attention.\"\n            )\n\n        self.bert = NeZhaModel(config, add_pooling_layer=False)\n        self.cls = BertOnlyMLMHead(config)\n\n        self.init_weights()\n\n    def get_output_embeddings(self):\n        return self.cls.predictions.decoder\n\n    def set_output_embeddings(self, new_embeddings):\n        self.cls.predictions.decoder = new_embeddings\n\n    @add_start_docstrings_to_model_forward(BERT_INPUTS_DOCSTRING.format(\"batch_size, sequence_length\"))\n    @add_code_sample_docstrings(\n        tokenizer_class=_TOKENIZER_FOR_DOC,\n        checkpoint=_CHECKPOINT_FOR_DOC,\n        output_type=MaskedLMOutput,\n        config_class=_CONFIG_FOR_DOC,\n    )\n    def forward(\n        self,\n        input_ids=None,\n        attention_mask=None,\n        token_type_ids=None,\n\n        head_mask=None,\n        inputs_embeds=None,\n        encoder_hidden_states=None,\n        encoder_attention_mask=None,\n        labels=None,\n        output_attentions=None,\n        output_hidden_states=None,\n        return_dict=False,\n    ):\n        r\"\"\"\n        labels (:obj:`torch.LongTensor` of shape :obj:`(batch_size, sequence_length)`, `optional`):\n            Labels for computing the masked language modeling loss. Indices should be in ``[-100, 0, ...,\n            config.vocab_size]`` (see ``input_ids`` docstring) Tokens with indices set to ``-100`` are ignored\n            (masked), the loss is only computed for the tokens with labels in ``[0, ..., config.vocab_size]``\n        \"\"\"\n\n        return_dict = return_dict if return_dict is not None else self.config.use_return_dict\n\n        outputs = self.bert(\n            input_ids,\n            attention_mask=attention_mask,\n            token_type_ids=token_type_ids,\n\n            head_mask=head_mask,\n            inputs_embeds=inputs_embeds,\n            encoder_hidden_states=encoder_hidden_states,\n            encoder_attention_mask=encoder_attention_mask,\n            output_attentions=output_attentions,\n            output_hidden_states=output_hidden_states,\n            return_dict=return_dict,\n        )\n\n        sequence_output = outputs[0]\n        prediction_scores = self.cls(sequence_output)\n\n        masked_lm_loss = None\n        if labels is not None:\n            loss_fct = CrossEntropyLoss()  # -100 index = padding token\n            masked_lm_loss = loss_fct(prediction_scores.view(-1, self.config.vocab_size), labels.view(-1))\n\n        if not return_dict:\n            output = (prediction_scores,) + outputs[2:]\n            return ((masked_lm_loss,) + output) if masked_lm_loss is not None else output\n\n        return MaskedLMOutput(\n            loss=masked_lm_loss,\n            logits=prediction_scores,\n            hidden_states=outputs.hidden_states,\n            attentions=outputs.attentions,\n        )\n\n    def prepare_inputs_for_generation(self, input_ids, attention_mask=None, **model_kwargs):\n        input_shape = input_ids.shape\n        effective_batch_size = input_shape[0]\n\n        #  add a dummy token\n        assert self.config.pad_token_id is not None, \"The PAD token should be defined for generation\"\n        attention_mask = torch.cat([attention_mask, attention_mask.new_zeros((attention_mask.shape[0], 1))], dim=-1)\n        dummy_token = torch.full(\n            (effective_batch_size, 1), self.config.pad_token_id, dtype=torch.long, device=input_ids.device\n        )\n        input_ids = torch.cat([input_ids, dummy_token], dim=1)\n\n        return {\"input_ids\": input_ids, \"attention_mask\": attention_mask}\n\n\n@add_start_docstrings(\n    \"\"\"Bert Model with a `next sentence prediction (classification)` head on top. \"\"\",\n    BERT_START_DOCSTRING,\n)\nclass BertForNextSentencePrediction(BertPreTrainedModel):\n    def __init__(self, config):\n        super().__init__(config)\n\n        self.bert = NeZhaModel(config)\n        self.cls = BertOnlyNSPHead(config)\n\n        self.init_weights()\n\n    @add_start_docstrings_to_model_forward(BERT_INPUTS_DOCSTRING.format(\"batch_size, sequence_length\"))\n    @replace_return_docstrings(output_type=NextSentencePredictorOutput, config_class=_CONFIG_FOR_DOC)\n    def forward(\n        self,\n        input_ids=None,\n        attention_mask=None,\n        token_type_ids=None,\n\n        head_mask=None,\n        inputs_embeds=None,\n        labels=None,\n        output_attentions=None,\n        output_hidden_states=None,\n        return_dict=False,\n        **kwargs\n    ):\n        r\"\"\"\n        labels (:obj:`torch.LongTensor` of shape :obj:`(batch_size,)`, `optional`):\n            Labels for computing the next sequence prediction (classification) loss. Input should be a sequence pair\n            (see ``input_ids`` docstring). Indices should be in ``[0, 1]``:\n\n            - 0 indicates sequence B is a continuation of sequence A,\n            - 1 indicates sequence B is a random sequence.\n\n        Returns:\n\n        Example::\n\n            >>> from transformers import BertTokenizer, BertForNextSentencePrediction\n            >>> import torch\n\n            >>> tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')\n            >>> model = BertForNextSentencePrediction.from_pretrained('bert-base-uncased')\n\n            >>> prompt = \"In Italy, pizza served in formal settings, such as at a restaurant, is presented unsliced.\"\n            >>> next_sentence = \"The sky is blue due to the shorter wavelength of blue light.\"\n            >>> encoding = tokenizer(prompt, next_sentence, return_tensors='pt')\n\n            >>> outputs = model(**encoding, labels=torch.LongTensor([1]))\n            >>> logits = outputs.logits\n            >>> assert logits[0, 0] < logits[0, 1] # next sentence was random\n        \"\"\"\n\n        if \"next_sentence_label\" in kwargs:\n            warnings.warn(\n                \"The `next_sentence_label` argument is deprecated and will be removed in a future version, use `labels` instead.\",\n                FutureWarning,\n            )\n            labels = kwargs.pop(\"next_sentence_label\")\n\n        return_dict = return_dict if return_dict is not None else self.config.use_return_dict\n\n        outputs = self.bert(\n            input_ids,\n            attention_mask=attention_mask,\n            token_type_ids=token_type_ids,\n\n            head_mask=head_mask,\n            inputs_embeds=inputs_embeds,\n            output_attentions=output_attentions,\n            output_hidden_states=output_hidden_states,\n            return_dict=return_dict,\n        )\n\n        pooled_output = outputs[1]\n\n        seq_relationship_scores = self.cls(pooled_output)\n\n        next_sentence_loss = None\n        if labels is not None:\n            loss_fct = CrossEntropyLoss()\n            next_sentence_loss = loss_fct(seq_relationship_scores.view(-1, 2), labels.view(-1))\n\n        if not return_dict:\n            output = (seq_relationship_scores,) + outputs[2:]\n            return ((next_sentence_loss,) + output) if next_sentence_loss is not None else output\n\n        return NextSentencePredictorOutput(\n            loss=next_sentence_loss,\n            logits=seq_relationship_scores,\n            hidden_states=outputs.hidden_states,\n            attentions=outputs.attentions,\n        )\n\n\n@add_start_docstrings(\n    \"\"\"\n    Bert Model transformer with a sequence classification/regression head on top (a linear layer on top of the pooled\n    output) e.g. for GLUE tasks.\n    \"\"\",\n    BERT_START_DOCSTRING,\n)\nclass BertForSequenceClassification(BertPreTrainedModel):\n    def __init__(self, config):\n        super().__init__(config)\n        self.num_labels = config.num_labels\n\n        self.bert = NeZhaModel(config)\n        self.dropout = nn.Dropout(config.hidden_dropout_prob)\n        self.classifier = nn.Linear(config.hidden_size, config.num_labels)\n\n        self.init_weights()\n\n    @add_start_docstrings_to_model_forward(BERT_INPUTS_DOCSTRING.format(\"batch_size, sequence_length\"))\n    @add_code_sample_docstrings(\n        tokenizer_class=_TOKENIZER_FOR_DOC,\n        checkpoint=_CHECKPOINT_FOR_DOC,\n        output_type=SequenceClassifierOutput,\n        config_class=_CONFIG_FOR_DOC,\n    )\n    def forward(\n        self,\n        input_ids=None,\n        attention_mask=None,\n        token_type_ids=None,\n\n        head_mask=None,\n        inputs_embeds=None,\n        labels=None,\n        output_attentions=None,\n        output_hidden_states=None,\n        return_dict=False,\n    ):\n        r\"\"\"\n        labels (:obj:`torch.LongTensor` of shape :obj:`(batch_size,)`, `optional`):\n            Labels for computing the sequence classification/regression loss. Indices should be in :obj:`[0, ...,\n            config.num_labels - 1]`. If :obj:`config.num_labels == 1` a regression loss is computed (Mean-Square loss),\n            If :obj:`config.num_labels > 1` a classification loss is computed (Cross-Entropy).\n        \"\"\"\n        return_dict = return_dict if return_dict is not None else self.config.use_return_dict\n\n        outputs = self.bert(\n            input_ids,\n            attention_mask=attention_mask,\n            token_type_ids=token_type_ids,\n\n            head_mask=head_mask,\n            inputs_embeds=inputs_embeds,\n            output_attentions=output_attentions,\n            output_hidden_states=output_hidden_states,\n            return_dict=return_dict,\n        )\n\n        pooled_output = outputs[1]\n\n        pooled_output = self.dropout(pooled_output)\n        logits = self.classifier(pooled_output)\n\n        loss = None\n        if labels is not None:\n            if self.num_labels == 1:\n                #  We are doing regression\n                loss_fct = MSELoss()\n                loss = loss_fct(logits.view(-1), labels.view(-1))\n            else:\n                loss_fct = CrossEntropyLoss()\n                loss = loss_fct(logits.view(-1, self.num_labels), labels.view(-1))\n\n        if not return_dict:\n            output = (logits,) + outputs[2:]\n            return ((loss,) + output) if loss is not None else output\n\n        return SequenceClassifierOutput(\n            loss=loss,\n            logits=logits,\n            hidden_states=outputs.hidden_states,\n            attentions=outputs.attentions,\n        )\n\n\n@add_start_docstrings(\n    \"\"\"\n    Bert Model with a multiple choice classification head on top (a linear layer on top of the pooled output and a\n    softmax) e.g. for RocStories/SWAG tasks.\n    \"\"\",\n    BERT_START_DOCSTRING,\n)\nclass BertForMultipleChoice(BertPreTrainedModel):\n    def __init__(self, config):\n        super().__init__(config)\n\n        self.bert = NeZhaModel(config)\n        self.dropout = nn.Dropout(config.hidden_dropout_prob)\n        self.classifier = nn.Linear(config.hidden_size, 1)\n\n        self.init_weights()\n\n    @add_start_docstrings_to_model_forward(BERT_INPUTS_DOCSTRING.format(\"batch_size, num_choices, sequence_length\"))\n    @add_code_sample_docstrings(\n        tokenizer_class=_TOKENIZER_FOR_DOC,\n        checkpoint=_CHECKPOINT_FOR_DOC,\n        output_type=MultipleChoiceModelOutput,\n        config_class=_CONFIG_FOR_DOC,\n    )\n    def forward(\n        self,\n        input_ids=None,\n        attention_mask=None,\n        token_type_ids=None,\n\n        head_mask=None,\n        inputs_embeds=None,\n        labels=None,\n        output_attentions=None,\n        output_hidden_states=None,\n        return_dict=False,\n    ):\n        r\"\"\"\n        labels (:obj:`torch.LongTensor` of shape :obj:`(batch_size,)`, `optional`):\n            Labels for computing the multiple choice classification loss. Indices should be in ``[0, ...,\n            num_choices-1]`` where :obj:`num_choices` is the size of the second dimension of the input tensors. (See\n            :obj:`input_ids` above)\n        \"\"\"\n        return_dict = return_dict if return_dict is not None else self.config.use_return_dict\n        num_choices = input_ids.shape[1] if input_ids is not None else inputs_embeds.shape[1]\n\n        input_ids = input_ids.view(-1, input_ids.size(-1)) if input_ids is not None else None\n        attention_mask = attention_mask.view(-1, attention_mask.size(-1)) if attention_mask is not None else None\n        token_type_ids = token_type_ids.view(-1, token_type_ids.size(-1)) if token_type_ids is not None else None\n        inputs_embeds = (\n            inputs_embeds.view(-1, inputs_embeds.size(-2), inputs_embeds.size(-1))\n            if inputs_embeds is not None\n            else None\n        )\n\n        outputs = self.bert(\n            input_ids,\n            attention_mask=attention_mask,\n            token_type_ids=token_type_ids,\n\n            head_mask=head_mask,\n            inputs_embeds=inputs_embeds,\n            output_attentions=output_attentions,\n            output_hidden_states=output_hidden_states,\n            return_dict=return_dict,\n        )\n\n        pooled_output = outputs[1]\n\n        pooled_output = self.dropout(pooled_output)\n        logits = self.classifier(pooled_output)\n        reshaped_logits = logits.view(-1, num_choices)\n\n        loss = None\n        if labels is not None:\n            loss_fct = CrossEntropyLoss()\n            loss = loss_fct(reshaped_logits, labels)\n\n        if not return_dict:\n            output = (reshaped_logits,) + outputs[2:]\n            return ((loss,) + output) if loss is not None else output\n\n        return MultipleChoiceModelOutput(\n            loss=loss,\n            logits=reshaped_logits,\n            hidden_states=outputs.hidden_states,\n            attentions=outputs.attentions,\n        )\n\n\n@add_start_docstrings(\n    \"\"\"\n    Bert Model with a token classification head on top (a linear layer on top of the hidden-states output) e.g. for\n    Named-Entity-Recognition (NER) tasks.\n    \"\"\",\n    BERT_START_DOCSTRING,\n)\nclass BertForTokenClassification(BertPreTrainedModel):\n\n    _keys_to_ignore_on_load_unexpected = [r\"pooler\"]\n\n    def __init__(self, config):\n        super().__init__(config)\n        self.num_labels = config.num_labels\n\n        self.bert = NeZhaModel(config, add_pooling_layer=False)\n        self.dropout = nn.Dropout(config.hidden_dropout_prob)\n        self.classifier = nn.Linear(config.hidden_size, config.num_labels)\n\n        self.init_weights()\n\n    @add_start_docstrings_to_model_forward(BERT_INPUTS_DOCSTRING.format(\"batch_size, sequence_length\"))\n    @add_code_sample_docstrings(\n        tokenizer_class=_TOKENIZER_FOR_DOC,\n        checkpoint=_CHECKPOINT_FOR_DOC,\n        output_type=TokenClassifierOutput,\n        config_class=_CONFIG_FOR_DOC,\n    )\n    def forward(\n        self,\n        input_ids=None,\n        attention_mask=None,\n        token_type_ids=None,\n\n        head_mask=None,\n        inputs_embeds=None,\n        labels=None,\n        output_attentions=None,\n        output_hidden_states=None,\n        return_dict=False,\n    ):\n        r\"\"\"\n        labels (:obj:`torch.LongTensor` of shape :obj:`(batch_size, sequence_length)`, `optional`):\n            Labels for computing the token classification loss. Indices should be in ``[0, ..., config.num_labels -\n            1]``.\n        \"\"\"\n        return_dict = return_dict if return_dict is not None else self.config.use_return_dict\n\n        outputs = self.bert(\n            input_ids,\n            attention_mask=attention_mask,\n            token_type_ids=token_type_ids,\n\n            head_mask=head_mask,\n            inputs_embeds=inputs_embeds,\n            output_attentions=output_attentions,\n            output_hidden_states=output_hidden_states,\n            return_dict=return_dict,\n        )\n\n        sequence_output = outputs[0]\n\n        sequence_output = self.dropout(sequence_output)\n        logits = self.classifier(sequence_output)\n\n        loss = None\n        if labels is not None:\n            loss_fct = CrossEntropyLoss()\n            # Only keep active parts of the loss\n            if attention_mask is not None:\n                active_loss = attention_mask.view(-1) == 1\n                active_logits = logits.view(-1, self.num_labels)\n                active_labels = torch.where(\n                    active_loss, labels.view(-1), torch.tensor(loss_fct.ignore_index).type_as(labels)\n                )\n                loss = loss_fct(active_logits, active_labels)\n            else:\n                loss = loss_fct(logits.view(-1, self.num_labels), labels.view(-1))\n\n        if not return_dict:\n            output = (logits,) + outputs[2:]\n            return ((loss,) + output) if loss is not None else output\n\n        return TokenClassifierOutput(\n            loss=loss,\n            logits=logits,\n            hidden_states=outputs.hidden_states,\n            attentions=outputs.attentions,\n        )\n\n\n@add_start_docstrings(\n    \"\"\"\n    Bert Model with a span classification head on top for extractive question-answering tasks like SQuAD (a linear\n    layers on top of the hidden-states output to compute `span start logits` and `span end logits`).\n    \"\"\",\n    BERT_START_DOCSTRING,\n)\nclass BertForQuestionAnswering(BertPreTrainedModel):\n\n    _keys_to_ignore_on_load_unexpected = [r\"pooler\"]\n\n    def __init__(self, config):\n        super().__init__(config)\n        self.num_labels = config.num_labels\n\n        self.bert = NeZhaModel(config, add_pooling_layer=False)\n        self.qa_outputs = nn.Linear(config.hidden_size, config.num_labels)\n\n        self.init_weights()\n\n    @add_start_docstrings_to_model_forward(BERT_INPUTS_DOCSTRING.format(\"batch_size, sequence_length\"))\n    @add_code_sample_docstrings(\n        tokenizer_class=_TOKENIZER_FOR_DOC,\n        checkpoint=_CHECKPOINT_FOR_DOC,\n        output_type=QuestionAnsweringModelOutput,\n        config_class=_CONFIG_FOR_DOC,\n    )\n    def forward(\n        self,\n        input_ids=None,\n        attention_mask=None,\n        token_type_ids=None,\n\n        head_mask=None,\n        inputs_embeds=None,\n        start_positions=None,\n        end_positions=None,\n        output_attentions=None,\n        output_hidden_states=None,\n        return_dict=False,\n    ):\n        r\"\"\"\n        start_positions (:obj:`torch.LongTensor` of shape :obj:`(batch_size,)`, `optional`):\n            Labels for position (index) of the start of the labelled span for computing the token classification loss.\n            Positions are clamped to the length of the sequence (:obj:`sequence_length`). Position outside of the\n            sequence are not taken into account for computing the loss.\n        end_positions (:obj:`torch.LongTensor` of shape :obj:`(batch_size,)`, `optional`):\n            Labels for position (index) of the end of the labelled span for computing the token classification loss.\n            Positions are clamped to the length of the sequence (:obj:`sequence_length`). Position outside of the\n            sequence are not taken into account for computing the loss.\n        \"\"\"\n        return_dict = return_dict if return_dict is not None else self.config.use_return_dict\n\n        outputs = self.bert(\n            input_ids,\n            attention_mask=attention_mask,\n            token_type_ids=token_type_ids,\n            \n            head_mask=head_mask,\n            inputs_embeds=inputs_embeds,\n            output_attentions=output_attentions,\n            output_hidden_states=output_hidden_states,\n            return_dict=return_dict,\n        )\n\n        sequence_output = outputs[0]\n\n        logits = self.qa_outputs(sequence_output)\n        start_logits, end_logits = logits.split(1, dim=-1)\n        start_logits = start_logits.squeeze(-1)\n        end_logits = end_logits.squeeze(-1)\n\n        total_loss = None\n        if start_positions is not None and end_positions is not None:\n            # If we are on multi-GPU, split add a dimension\n            if len(start_positions.size()) > 1:\n                start_positions = start_positions.squeeze(-1)\n            if len(end_positions.size()) > 1:\n                end_positions = end_positions.squeeze(-1)\n            # sometimes the start/end positions are outside our model inputs, we ignore these terms\n            ignored_index = start_logits.size(1)\n            start_positions.clamp_(0, ignored_index)\n            end_positions.clamp_(0, ignored_index)\n\n            loss_fct = CrossEntropyLoss(ignore_index=ignored_index)\n            start_loss = loss_fct(start_logits, start_positions)\n            end_loss = loss_fct(end_logits, end_positions)\n            total_loss = (start_loss + end_loss) / 2\n\n        if not return_dict:\n            output = (start_logits, end_logits) + outputs[2:]\n            return ((total_loss,) + output) if total_loss is not None else output\n\n        return QuestionAnsweringModelOutput(\n            loss=total_loss,\n            start_logits=start_logits,\n            end_logits=end_logits,\n            hidden_states=outputs.hidden_states,\n            attentions=outputs.attentions,\n        )\n"
  },
  {
    "path": "code/nezha-base-count3/pretrain/NLP_Utils.py",
    "content": "import random\nimport json\nimport transformers as _\nfrom transformers1 import BertTokenizer\nimport torch\nfrom torch.utils.data import Dataset,DataLoader\nimport numpy as np\nfrom itertools import chain\n\ndef writeToJsonFile(path: str, obj):\n    with open(path, \"w\", encoding=\"utf-8\") as f:\n        f.write(json.dumps(obj, ensure_ascii=False,indent=0))\ndef readFromJsonFile(path: str):\n    with open(path, \"r\", encoding=\"utf-8\") as f:\n        return json.loads(f.read())\n\ndef loadData(path):\n    allData=[]\n    with open(path,\"r\") as f:\n        for i in f:\n            i=i.strip().split('\\t')\n            if len(i)==0:#防止空行\n                break\n            if len(i)==3:#训练集\n                a,b,label=i\n                a=a.split(' ')\n                b=b.split(' ')\n            else:#测试集，直接转为id形式\n                a,b,label=i[0],i[1],-1\n                a=a.split(' ')\n                b=b.split(' ')\n            allData.append([a,b,label])\n    return allData\n\ndef calNegPos(ls):#计算正负比例\n    posNum,negNum=0,0\n    for i in ls:\n        if i[2]==0:\n            negNum+=1\n        elif i[2]==1:\n            posNum+=1\n    posNum=1 if posNum==0 else posNum\n    return negNum,posNum,round(negNum/posNum,4)\n\nallData=loadData('/tcdata/gaiic_track3_round1_train_20210228.tsv')+loadData('/tcdata/gaiic_track3_round2_train_20210407.tsv')\ntestA_data = loadData('/tcdata/gaiic_track3_round1_testA_20210228.tsv')\ntestB_data = loadData('/tcdata/gaiic_track3_round1_testB_20210317.tsv')\nrandom.shuffle(allData)\n\ntrain_data=allData+testA_data+testB_data#全量\nvalid_data=allData[-20000:]\nprint(\"训练集样本数量：\", len(train_data))\n\ndef paddingList(ls:list,val,returnTensor=False):\n    ls=ls[:]#不要改变了原list尺寸\n    maxLen=max([len(i) for i in ls])\n    for i in range(len(ls)):\n        ls[i]=ls[i]+[val]*(maxLen-len(ls[i]))\n    return torch.tensor(ls,device='cuda') if returnTensor else ls\n\ndef truncate(a:list,b:list,maxLen):\n    maxLen-=3#空留给cls sep sep\n    assert maxLen>=0\n    len2=maxLen//2#若为奇数，更长部分给左边\n    len1=maxLen-len2\n    #一共就a超长与否，b超长与否，组合的四种情况\n    if len(a)+len(b)>maxLen:#需要截断\n        if len(a)<=len1 and len(b)>len2:\n            b=b[:maxLen-len(a)]\n        elif len(a)>len1 and len(b)<=len2:\n            a=a[:maxLen-len(b)]\n        elif len(a)>len1 and len(b)>len2:\n            a=a[:len1]\n            b=b[:len2]\n    return a,b\n\nclass MLM_Data(Dataset):\n    #传入句子对列表\n    def __init__(self,textLs:list,maxLen:int,tk:BertTokenizer):\n        super().__init__()\n        self.data=textLs\n        self.maxLen=maxLen\n        self.tk=tk\n        self.spNum=len(tk.all_special_tokens)\n        self.tkNum=tk.vocab_size\n\n    def __len__(self):\n        return len(self.data)\n\n    def random_mask(self,text_ids):\n        input_ids, output_ids = [], []\n        rands = np.random.random(len(text_ids))\n        idx=0\n        while idx<len(rands):\n            if rands[idx]<0.15:#需要mask\n                ngram=np.random.choice([1,2,3], p=[0.7,0.2,0.1])#若要mask，进行x_gram mask的概率\n                if ngram==3 and len(rands)<7:#太大的gram不要应用于过短文本\n                    ngram=2\n                if ngram==2 and len(rands)<4:\n                    ngram=1\n                L=idx+1\n                R=idx+ngram#最终需要mask的右边界（开）\n                while L<R and L<len(rands):\n                    rands[L]=np.random.random()*0.15#强制mask\n                    L+=1\n                idx=R\n                if idx<len(rands):\n                    rands[idx]=1#禁止mask片段的下一个token被mask，防止一大片连续mask\n            idx+=1\n\n        for r, i in zip(rands, text_ids):\n            if r < 0.15 * 0.8:\n                input_ids.append(self.tk.mask_token_id)\n                output_ids.append(i)#mask预测自己\n            elif r < 0.15 * 0.9:\n                input_ids.append(i)\n                output_ids.append(i)#自己预测自己\n            elif r < 0.15:\n                input_ids.append(np.random.randint(self.spNum,self.tkNum))\n                output_ids.append(i)#随机的一个词预测自己，随机词不会从特殊符号中选取，有小概率抽到自己\n            else:\n                input_ids.append(i)\n                output_ids.append(-100)#保持原样不预测\n\n        return input_ids, output_ids\n\n    #耗时操作在此进行，可用上多进程\n    def __getitem__(self, item):\n        text1,text2,_=self.data[item]#预处理，mask等操作\n        if random.random()>0.5:\n            text1,text2=text2,text1#交换位置\n        text1,text2=truncate(text1,text2,self.maxLen)\n        text1_ids,text2_ids = self.tk.convert_tokens_to_ids(text1),self.tk.convert_tokens_to_ids(text2)\n        text1_ids, out1_ids = self.random_mask(text1_ids)#添加mask预测\n        text2_ids, out2_ids = self.random_mask(text2_ids)\n        input_ids = [self.tk.cls_token_id] + text1_ids + [self.tk.sep_token_id] + text2_ids + [self.tk.sep_token_id]#拼接\n        token_type_ids=[0]*(len(text1_ids)+2)+[1]*(len(text2_ids)+1)\n        labels = [-100] + out1_ids + [-100] + out2_ids + [-100]\n        assert len(input_ids)==len(token_type_ids)==len(labels)\n        return {'input_ids':input_ids,'token_type_ids':token_type_ids,'labels':labels}\n\n    @classmethod\n    def collate(cls,batch):\n        input_ids=[i['input_ids'] for i in batch]\n        token_type_ids=[i['token_type_ids'] for i in batch]\n        labels=[i['labels'] for i in batch]\n        input_ids=paddingList(input_ids,0,returnTensor=True)\n        token_type_ids=paddingList(token_type_ids,0,returnTensor=True)\n        labels=paddingList(labels,-100,returnTensor=True)\n        attention_mask=(input_ids!=0)\n        return {'input_ids':input_ids,'token_type_ids':token_type_ids\n                ,'attention_mask':attention_mask,'labels':labels}\n\n\n\n\nunionList=lambda ls:list(chain(*ls))#按元素拼接\nsplitList=lambda x,bs:[x[i:i+bs] for i in range(0,len(x),bs)]#按bs切分\n\n\n#sortBsNum：原序列按多少个bs块为单位排序，可用来增强随机性\n#比如如果每次打乱后都全体一起排序，那每次都是一样的\ndef blockShuffle(data:list,bs:int,sortBsNum,key):\n    random.shuffle(data)#先打乱\n    tail=len(data)%bs#计算碎片长度\n    tail=[] if tail==0 else data[-tail:]\n    data=data[:len(data)-len(tail)]\n    assert len(data)%bs==0#剩下的一定能被bs整除\n    sortBsNum=len(data)//bs if sortBsNum is None else sortBsNum#为None就是整体排序\n    data=splitList(data,sortBsNum*bs)\n    data=[sorted(i,key=key,reverse=True) for i in data]#每个大块进行降排序\n    data=unionList(data)\n    data=splitList(data,bs)#最后，按bs分块\n    random.shuffle(data)#块间打乱\n    data=unionList(data)+tail\n    return data\nfrom torch.utils.data.dataloader import _SingleProcessDataLoaderIter,_MultiProcessingDataLoaderIter\n#每轮迭代重新分块shuffle数据的DataLoader\nclass blockShuffleDataLoader(DataLoader):\n    def __init__(self, dataset: Dataset,sortBsNum,key,**kwargs):\n        assert isinstance(dataset.data,list)#需要有list类型的data属性\n        super().__init__(dataset,**kwargs)#父类的参数传过去\n        self.sortBsNum=sortBsNum\n        self.key=key\n\n    def __iter__(self):\n        #分块shuffle\n        self.dataset.data=blockShuffle(self.dataset.data,self.batch_size,self.sortBsNum,self.key)\n        if self.num_workers == 0:\n            return _SingleProcessDataLoaderIter(self)\n        else:\n            return _MultiProcessingDataLoaderIter(self)\n"
  },
  {
    "path": "code/nezha-base-count3/pretrain/__init__.py",
    "content": ""
  },
  {
    "path": "code/nezha-base-count3/pretrain/nezha_model/gitkeep",
    "content": ""
  },
  {
    "path": "code/nezha-base-count3/pretrain/train_nezha.py",
    "content": "# coding:utf-8\nimport numpy as np\nimport random\nimport os\nrandom.seed(0)\nnp.random.seed(0)#seed应该在main里尽早设置，以防万一\nos.environ['PYTHONHASHSEED'] =str(0)#消除hash算法的随机性\nimport transformers as _\nfrom transformers1 import Trainer, TrainingArguments,BertTokenizer\nfrom NLP_Utils import MLM_Data,train_data,blockShuffleDataLoader\n\nfrom NEZHA.configuration_nezha import NeZhaConfig\nfrom NEZHA.modeling_nezha import NeZhaForMaskedLM\n\nmaxlen=32\nbatch_size=128\nvocab_file_dir = './nezha_model/vocab.txt'\ntokenizer = BertTokenizer.from_pretrained(vocab_file_dir)\n\nconfig = NeZhaConfig(\n    vocab_size=len(tokenizer),\n    hidden_size=768,\n    num_hidden_layers=12,\n    num_attention_heads=12,\n    max_position_embeddings=512,\n)\n\n\n\nmodel = NeZhaForMaskedLM.from_pretrained(\"../../nezha-cn-base/\")\n\nmodel.resize_token_embeddings(len(tokenizer))\nprint(model)\ntrain_MLM_data=MLM_Data(train_data,maxlen,tokenizer)\n#自己定义dataloader，不要用huggingface的\ndl=blockShuffleDataLoader(train_MLM_data,None,key=lambda x:len(x[0])+len(x[1]),shuffle=False\n                          ,batch_size=batch_size,collate_fn=train_MLM_data.collate)\n\ntraining_args = TrainingArguments(\n    output_dir='./nezha_output',\n    overwrite_output_dir=True,\n    num_train_epochs=400,\n    per_device_train_batch_size=batch_size,\n    save_steps=len(dl)*10000,#每10个epoch save一次\n    save_total_limit=3,\n    logging_steps=len(dl),#每个epoch log一次\n    seed=2021,\n    learning_rate=5e-5,\n    weight_decay=0.01,\n    warmup_steps=int(450000*150/batch_size*0.03)\n)\n\ntrainer = Trainer(\n    model=model,\n    args=training_args,\n    train_dataLoader=dl,\n    prediction_loss_only=True,\n)\n\nif __name__ == '__main__':\n    trainer.train()\n    trainer.save_model('./nezha_model')\n"
  },
  {
    "path": "code/nezha-base-count3/pretrain/transformers1/__init__.py",
    "content": "# flake8: noqa\n# There's no way to ignore \"F401 '...' imported but unused\" warnings in this\n# module, but to preserve other warnings. So, don't check this module at all.\n\n__version__ = \"2.11.0\"\n\n# Work around to update TensorFlow's absl.logging threshold which alters the\n# default Python logging output behavior when present.\n# see: https://github.com/abseil/abseil-py/issues/99\n# and: https://github.com/tensorflow/tensorflow/issues/26691#issuecomment-500369493\ntry:\n    import absl.logging\nexcept ImportError:\n    pass\nelse:\n    absl.logging.set_verbosity(\"info\")\n    absl.logging.set_stderrthreshold(\"info\")\n    absl.logging._warn_preinit_stderr = False\n\nimport logging\n\n# Configurations\nfrom .configuration_albert import ALBERT_PRETRAINED_CONFIG_ARCHIVE_MAP, AlbertConfig\nfrom .configuration_auto import ALL_PRETRAINED_CONFIG_ARCHIVE_MAP, CONFIG_MAPPING, AutoConfig\nfrom .configuration_bart import BartConfig\nfrom .configuration_bert import BERT_PRETRAINED_CONFIG_ARCHIVE_MAP, BertConfig\nfrom .configuration_camembert import CAMEMBERT_PRETRAINED_CONFIG_ARCHIVE_MAP, CamembertConfig\nfrom .configuration_ctrl import CTRL_PRETRAINED_CONFIG_ARCHIVE_MAP, CTRLConfig\nfrom .configuration_distilbert import DISTILBERT_PRETRAINED_CONFIG_ARCHIVE_MAP, DistilBertConfig\nfrom .configuration_electra import ELECTRA_PRETRAINED_CONFIG_ARCHIVE_MAP, ElectraConfig\nfrom .configuration_encoder_decoder import EncoderDecoderConfig\nfrom .configuration_flaubert import FLAUBERT_PRETRAINED_CONFIG_ARCHIVE_MAP, FlaubertConfig\nfrom .configuration_gpt2 import GPT2_PRETRAINED_CONFIG_ARCHIVE_MAP, GPT2Config\nfrom .configuration_longformer import LONGFORMER_PRETRAINED_CONFIG_ARCHIVE_MAP, LongformerConfig\nfrom .configuration_marian import MarianConfig\nfrom .configuration_mmbt import MMBTConfig\nfrom .configuration_openai import OPENAI_GPT_PRETRAINED_CONFIG_ARCHIVE_MAP, OpenAIGPTConfig\nfrom .configuration_reformer import REFORMER_PRETRAINED_CONFIG_ARCHIVE_MAP, ReformerConfig\nfrom .configuration_roberta import ROBERTA_PRETRAINED_CONFIG_ARCHIVE_MAP, RobertaConfig\nfrom .configuration_t5 import T5_PRETRAINED_CONFIG_ARCHIVE_MAP, T5Config\nfrom .configuration_transfo_xl import TRANSFO_XL_PRETRAINED_CONFIG_ARCHIVE_MAP, TransfoXLConfig\nfrom .configuration_utils import PretrainedConfig\nfrom .configuration_xlm import XLM_PRETRAINED_CONFIG_ARCHIVE_MAP, XLMConfig\nfrom .configuration_xlm_roberta import XLM_ROBERTA_PRETRAINED_CONFIG_ARCHIVE_MAP, XLMRobertaConfig\nfrom .configuration_xlnet import XLNET_PRETRAINED_CONFIG_ARCHIVE_MAP, XLNetConfig\nfrom .data import (\n    DataProcessor,\n    InputExample,\n    InputFeatures,\n    SingleSentenceClassificationProcessor,\n    SquadExample,\n    SquadFeatures,\n    SquadV1Processor,\n    SquadV2Processor,\n    glue_convert_examples_to_features,\n    glue_output_modes,\n    glue_processors,\n    glue_tasks_num_labels,\n    is_sklearn_available,\n    squad_convert_examples_to_features,\n    xnli_output_modes,\n    xnli_processors,\n    xnli_tasks_num_labels,\n)\n\n# Files and general utilities\nfrom .file_utils import (\n    CONFIG_NAME,\n    MODEL_CARD_NAME,\n    PYTORCH_PRETRAINED_BERT_CACHE,\n    PYTORCH_TRANSFORMERS_CACHE,\n    TF2_WEIGHTS_NAME,\n    TF_WEIGHTS_NAME,\n    TRANSFORMERS_CACHE,\n    WEIGHTS_NAME,\n    add_end_docstrings,\n    add_start_docstrings,\n    cached_path,\n    is_tf_available,\n    is_torch_available,\n)\nfrom .hf_argparser import HfArgumentParser\n\n# Model Cards\nfrom .modelcard import ModelCard\n\n# TF 2.0 <=> PyTorch conversion utilities\nfrom .modeling_tf_pytorch_utils import (\n    convert_tf_weight_name_to_pt_weight_name,\n    load_pytorch_checkpoint_in_tf2_model,\n    load_pytorch_model_in_tf2_model,\n    load_pytorch_weights_in_tf2_model,\n    load_tf2_checkpoint_in_pytorch_model,\n    load_tf2_model_in_pytorch_model,\n    load_tf2_weights_in_pytorch_model,\n)\n\n# Pipelines\nfrom .pipelines import (\n    CsvPipelineDataFormat,\n    FeatureExtractionPipeline,\n    FillMaskPipeline,\n    JsonPipelineDataFormat,\n    NerPipeline,\n    PipedPipelineDataFormat,\n    Pipeline,\n    PipelineDataFormat,\n    QuestionAnsweringPipeline,\n    SummarizationPipeline,\n    TextClassificationPipeline,\n    TextGenerationPipeline,\n    TokenClassificationPipeline,\n    TranslationPipeline,\n    pipeline,\n)\n\n# Tokenizers\nfrom .tokenization_albert import AlbertTokenizer\nfrom .tokenization_auto import TOKENIZER_MAPPING, AutoTokenizer\nfrom .tokenization_bart import BartTokenizer, MBartTokenizer\nfrom .tokenization_bert import BasicTokenizer, BertTokenizer, BertTokenizerFast, WordpieceTokenizer\nfrom .tokenization_bert_japanese import BertJapaneseTokenizer, CharacterTokenizer, MecabTokenizer\nfrom .tokenization_camembert import CamembertTokenizer\nfrom .tokenization_ctrl import CTRLTokenizer\nfrom .tokenization_distilbert import DistilBertTokenizer, DistilBertTokenizerFast\nfrom .tokenization_electra import ElectraTokenizer, ElectraTokenizerFast\nfrom .tokenization_flaubert import FlaubertTokenizer\nfrom .tokenization_gpt2 import GPT2Tokenizer, GPT2TokenizerFast\nfrom .tokenization_longformer import LongformerTokenizer, LongformerTokenizerFast\nfrom .tokenization_openai import OpenAIGPTTokenizer, OpenAIGPTTokenizerFast\nfrom .tokenization_reformer import ReformerTokenizer\nfrom .tokenization_roberta import RobertaTokenizer, RobertaTokenizerFast\nfrom .tokenization_t5 import T5Tokenizer\nfrom .tokenization_transfo_xl import TransfoXLCorpus, TransfoXLTokenizer, TransfoXLTokenizerFast\nfrom .tokenization_utils import PreTrainedTokenizer\nfrom .tokenization_xlm import XLMTokenizer\nfrom .tokenization_xlm_roberta import XLMRobertaTokenizer\nfrom .tokenization_xlnet import SPIECE_UNDERLINE, XLNetTokenizer\nfrom .trainer_utils import EvalPrediction\nfrom .training_args import TrainingArguments\nfrom .training_args_tf import TFTrainingArguments\n\n\nlogger = logging.getLogger(__name__)  # pylint: disable=invalid-name\n\n\nif is_sklearn_available():\n    from .data import glue_compute_metrics, xnli_compute_metrics\n\n\n# Modeling\nif is_torch_available():\n    from .modeling_utils import PreTrainedModel, prune_layer, Conv1D, top_k_top_p_filtering, apply_chunking_to_forward\n    from .modeling_auto import (\n        AutoModel,\n        AutoModelForPreTraining,\n        AutoModelForSequenceClassification,\n        AutoModelForQuestionAnswering,\n        AutoModelWithLMHead,\n        AutoModelForTokenClassification,\n        AutoModelForMultipleChoice,\n        MODEL_MAPPING,\n        MODEL_FOR_PRETRAINING_MAPPING,\n        MODEL_WITH_LM_HEAD_MAPPING,\n        MODEL_FOR_SEQUENCE_CLASSIFICATION_MAPPING,\n        MODEL_FOR_QUESTION_ANSWERING_MAPPING,\n        MODEL_FOR_TOKEN_CLASSIFICATION_MAPPING,\n        MODEL_FOR_MULTIPLE_CHOICE_MAPPING,\n    )\n\n    from .modeling_bert import (\n        BertPreTrainedModel,\n        BertModel,\n        BertForPreTraining,\n        BertForMaskedLM,\n        BertForNextSentencePrediction,\n        BertForSequenceClassification,\n        BertForMultipleChoice,\n        BertForTokenClassification,\n        BertForQuestionAnswering,\n        load_tf_weights_in_bert,\n        BERT_PRETRAINED_MODEL_ARCHIVE_LIST,\n        BertLayer,\n    )\n    from .modeling_openai import (\n        OpenAIGPTPreTrainedModel,\n        OpenAIGPTModel,\n        OpenAIGPTLMHeadModel,\n        OpenAIGPTDoubleHeadsModel,\n        load_tf_weights_in_openai_gpt,\n        OPENAI_GPT_PRETRAINED_MODEL_ARCHIVE_LIST,\n    )\n    from .modeling_transfo_xl import (\n        TransfoXLPreTrainedModel,\n        TransfoXLModel,\n        TransfoXLLMHeadModel,\n        AdaptiveEmbedding,\n        load_tf_weights_in_transfo_xl,\n        TRANSFO_XL_PRETRAINED_MODEL_ARCHIVE_LIST,\n    )\n    from .modeling_gpt2 import (\n        GPT2PreTrainedModel,\n        GPT2Model,\n        GPT2LMHeadModel,\n        GPT2DoubleHeadsModel,\n        load_tf_weights_in_gpt2,\n        GPT2_PRETRAINED_MODEL_ARCHIVE_LIST,\n    )\n    from .modeling_ctrl import CTRLPreTrainedModel, CTRLModel, CTRLLMHeadModel, CTRL_PRETRAINED_MODEL_ARCHIVE_LIST\n    from .modeling_xlnet import (\n        XLNetPreTrainedModel,\n        XLNetModel,\n        XLNetLMHeadModel,\n        XLNetForSequenceClassification,\n        XLNetForTokenClassification,\n        XLNetForMultipleChoice,\n        XLNetForQuestionAnsweringSimple,\n        XLNetForQuestionAnswering,\n        load_tf_weights_in_xlnet,\n        XLNET_PRETRAINED_MODEL_ARCHIVE_LIST,\n    )\n    from .modeling_xlm import (\n        XLMPreTrainedModel,\n        XLMModel,\n        XLMWithLMHeadModel,\n        XLMForSequenceClassification,\n        XLMForTokenClassification,\n        XLMForQuestionAnswering,\n        XLMForQuestionAnsweringSimple,\n        XLM_PRETRAINED_MODEL_ARCHIVE_LIST,\n    )\n    from .modeling_bart import (\n        BartForSequenceClassification,\n        BartModel,\n        BartForConditionalGeneration,\n        BART_PRETRAINED_MODEL_ARCHIVE_LIST,\n    )\n    from .modeling_marian import MarianMTModel\n    from .tokenization_marian import MarianTokenizer\n    from .modeling_roberta import (\n        RobertaForMaskedLM,\n        RobertaModel,\n        RobertaForSequenceClassification,\n        RobertaForMultipleChoice,\n        RobertaForTokenClassification,\n        RobertaForQuestionAnswering,\n        ROBERTA_PRETRAINED_MODEL_ARCHIVE_LIST,\n    )\n    from .modeling_distilbert import (\n        DistilBertPreTrainedModel,\n        DistilBertForMaskedLM,\n        DistilBertModel,\n        DistilBertForSequenceClassification,\n        DistilBertForQuestionAnswering,\n        DistilBertForTokenClassification,\n        DISTILBERT_PRETRAINED_MODEL_ARCHIVE_LIST,\n    )\n    from .modeling_camembert import (\n        CamembertForMaskedLM,\n        CamembertModel,\n        CamembertForSequenceClassification,\n        CamembertForMultipleChoice,\n        CamembertForTokenClassification,\n        CamembertForQuestionAnswering,\n        CAMEMBERT_PRETRAINED_MODEL_ARCHIVE_LIST,\n    )\n    from .modeling_encoder_decoder import EncoderDecoderModel\n    from .modeling_t5 import (\n        T5PreTrainedModel,\n        T5Model,\n        T5ForConditionalGeneration,\n        load_tf_weights_in_t5,\n        T5_PRETRAINED_MODEL_ARCHIVE_LIST,\n    )\n    from .modeling_albert import (\n        AlbertPreTrainedModel,\n        AlbertModel,\n        AlbertForPreTraining,\n        AlbertForMaskedLM,\n        AlbertForSequenceClassification,\n        AlbertForQuestionAnswering,\n        AlbertForTokenClassification,\n        load_tf_weights_in_albert,\n        ALBERT_PRETRAINED_MODEL_ARCHIVE_LIST,\n    )\n    from .modeling_xlm_roberta import (\n        XLMRobertaForMaskedLM,\n        XLMRobertaModel,\n        XLMRobertaForMultipleChoice,\n        XLMRobertaForSequenceClassification,\n        XLMRobertaForTokenClassification,\n        XLM_ROBERTA_PRETRAINED_MODEL_ARCHIVE_LIST,\n    )\n    from .modeling_mmbt import ModalEmbeddings, MMBTModel, MMBTForClassification\n\n    from .modeling_flaubert import (\n        FlaubertModel,\n        FlaubertWithLMHeadModel,\n        FlaubertForSequenceClassification,\n        FlaubertForQuestionAnswering,\n        FlaubertForQuestionAnsweringSimple,\n        FLAUBERT_PRETRAINED_MODEL_ARCHIVE_LIST,\n    )\n\n    from .modeling_electra import (\n        ElectraForPreTraining,\n        ElectraForMaskedLM,\n        ElectraForTokenClassification,\n        ElectraPreTrainedModel,\n        ElectraForSequenceClassification,\n        ElectraModel,\n        load_tf_weights_in_electra,\n        ELECTRA_PRETRAINED_MODEL_ARCHIVE_LIST,\n    )\n\n    from .modeling_reformer import (\n        ReformerAttention,\n        ReformerLayer,\n        ReformerModel,\n        ReformerModelWithLMHead,\n        REFORMER_PRETRAINED_MODEL_ARCHIVE_LIST,\n    )\n\n    from .modeling_longformer import (\n        LongformerModel,\n        LongformerForMaskedLM,\n        LongformerForSequenceClassification,\n        LongformerForMultipleChoice,\n        LongformerForTokenClassification,\n        LongformerForQuestionAnswering,\n        LONGFORMER_PRETRAINED_MODEL_ARCHIVE_LIST,\n    )\n\n    # Optimization\n    from .optimization import (\n        AdamW,\n        get_constant_schedule,\n        get_constant_schedule_with_warmup,\n        get_cosine_schedule_with_warmup,\n        get_cosine_with_hard_restarts_schedule_with_warmup,\n        get_linear_schedule_with_warmup,\n    )\n\n    # Trainer\n    from .trainer import Trainer, set_seed, torch_distributed_zero_first, EvalPrediction\n    from .data.data_collator import DefaultDataCollator, DataCollator, DataCollatorForLanguageModeling\n    from .data.datasets import GlueDataset, TextDataset, LineByLineTextDataset, GlueDataTrainingArguments\n\n    # Benchmarks\n    from .benchmark import PyTorchBenchmark, PyTorchBenchmarkArguments\n\n# TensorFlow\nif is_tf_available():\n    from .modeling_tf_utils import (\n        TFPreTrainedModel,\n        TFSharedEmbeddings,\n        TFSequenceSummary,\n        shape_list,\n        tf_top_k_top_p_filtering,\n    )\n    from .modeling_tf_auto import (\n        TFAutoModel,\n        TFAutoModelForPreTraining,\n        TFAutoModelForMultipleChoice,\n        TFAutoModelForSequenceClassification,\n        TFAutoModelForQuestionAnswering,\n        TFAutoModelWithLMHead,\n        TFAutoModelForTokenClassification,\n        TF_MODEL_MAPPING,\n        TF_MODEL_FOR_PRETRAINING_MAPPING,\n        TF_MODEL_WITH_LM_HEAD_MAPPING,\n        TF_MODEL_FOR_SEQUENCE_CLASSIFICATION_MAPPING,\n        TF_MODEL_FOR_QUESTION_ANSWERING_MAPPING,\n        TF_MODEL_FOR_TOKEN_CLASSIFICATION_MAPPING,\n    )\n\n    from .modeling_tf_bert import (\n        TFBertPreTrainedModel,\n        TFBertMainLayer,\n        TFBertEmbeddings,\n        TFBertModel,\n        TFBertForPreTraining,\n        TFBertForMaskedLM,\n        TFBertForNextSentencePrediction,\n        TFBertForSequenceClassification,\n        TFBertForMultipleChoice,\n        TFBertForTokenClassification,\n        TFBertForQuestionAnswering,\n        TF_BERT_PRETRAINED_MODEL_ARCHIVE_LIST,\n    )\n\n    from .modeling_tf_gpt2 import (\n        TFGPT2PreTrainedModel,\n        TFGPT2MainLayer,\n        TFGPT2Model,\n        TFGPT2LMHeadModel,\n        TFGPT2DoubleHeadsModel,\n        TF_GPT2_PRETRAINED_MODEL_ARCHIVE_LIST,\n    )\n\n    from .modeling_tf_openai import (\n        TFOpenAIGPTPreTrainedModel,\n        TFOpenAIGPTMainLayer,\n        TFOpenAIGPTModel,\n        TFOpenAIGPTLMHeadModel,\n        TFOpenAIGPTDoubleHeadsModel,\n        TF_OPENAI_GPT_PRETRAINED_MODEL_ARCHIVE_LIST,\n    )\n\n    from .modeling_tf_transfo_xl import (\n        TFTransfoXLPreTrainedModel,\n        TFTransfoXLMainLayer,\n        TFTransfoXLModel,\n        TFTransfoXLLMHeadModel,\n        TF_TRANSFO_XL_PRETRAINED_MODEL_ARCHIVE_LIST,\n        TFAdaptiveEmbedding,\n    )\n\n    from .modeling_tf_xlnet import (\n        TFXLNetPreTrainedModel,\n        TFXLNetMainLayer,\n        TFXLNetModel,\n        TFXLNetLMHeadModel,\n        TFXLNetForSequenceClassification,\n        TFXLNetForTokenClassification,\n        TFXLNetForQuestionAnsweringSimple,\n        TF_XLNET_PRETRAINED_MODEL_ARCHIVE_LIST,\n    )\n\n    from .modeling_tf_xlm import (\n        TFXLMPreTrainedModel,\n        TFXLMMainLayer,\n        TFXLMModel,\n        TFXLMWithLMHeadModel,\n        TFXLMForSequenceClassification,\n        TFXLMForQuestionAnsweringSimple,\n        TF_XLM_PRETRAINED_MODEL_ARCHIVE_LIST,\n    )\n\n    from .modeling_tf_xlm_roberta import (\n        TFXLMRobertaForMaskedLM,\n        TFXLMRobertaModel,\n        TFXLMRobertaForSequenceClassification,\n        TFXLMRobertaForTokenClassification,\n        TF_XLM_ROBERTA_PRETRAINED_MODEL_ARCHIVE_LIST,\n    )\n\n    from .modeling_tf_roberta import (\n        TFRobertaPreTrainedModel,\n        TFRobertaMainLayer,\n        TFRobertaModel,\n        TFRobertaForMaskedLM,\n        TFRobertaForSequenceClassification,\n        TFRobertaForTokenClassification,\n        TFRobertaForQuestionAnswering,\n        TF_ROBERTA_PRETRAINED_MODEL_ARCHIVE_LIST,\n    )\n\n    from .modeling_tf_camembert import (\n        TFCamembertModel,\n        TFCamembertForMaskedLM,\n        TFCamembertForSequenceClassification,\n        TFCamembertForTokenClassification,\n        TF_CAMEMBERT_PRETRAINED_MODEL_ARCHIVE_LIST,\n    )\n\n    from .modeling_tf_flaubert import (\n        TFFlaubertModel,\n        TFFlaubertWithLMHeadModel,\n        TFFlaubertForSequenceClassification,\n        TF_FLAUBERT_PRETRAINED_MODEL_ARCHIVE_LIST,\n    )\n\n    from .modeling_tf_distilbert import (\n        TFDistilBertPreTrainedModel,\n        TFDistilBertMainLayer,\n        TFDistilBertModel,\n        TFDistilBertForMaskedLM,\n        TFDistilBertForSequenceClassification,\n        TFDistilBertForTokenClassification,\n        TFDistilBertForQuestionAnswering,\n        TF_DISTILBERT_PRETRAINED_MODEL_ARCHIVE_LIST,\n    )\n\n    from .modeling_tf_ctrl import (\n        TFCTRLPreTrainedModel,\n        TFCTRLModel,\n        TFCTRLLMHeadModel,\n        TF_CTRL_PRETRAINED_MODEL_ARCHIVE_LIST,\n    )\n\n    from .modeling_tf_albert import (\n        TFAlbertPreTrainedModel,\n        TFAlbertMainLayer,\n        TFAlbertModel,\n        TFAlbertForPreTraining,\n        TFAlbertForMaskedLM,\n        TFAlbertForMultipleChoice,\n        TFAlbertForSequenceClassification,\n        TFAlbertForQuestionAnswering,\n        TF_ALBERT_PRETRAINED_MODEL_ARCHIVE_LIST,\n    )\n\n    from .modeling_tf_t5 import (\n        TFT5PreTrainedModel,\n        TFT5Model,\n        TFT5ForConditionalGeneration,\n        TF_T5_PRETRAINED_MODEL_ARCHIVE_LIST,\n    )\n\n    from .modeling_tf_electra import (\n        TFElectraPreTrainedModel,\n        TFElectraModel,\n        TFElectraForPreTraining,\n        TFElectraForMaskedLM,\n        TFElectraForTokenClassification,\n        TF_ELECTRA_PRETRAINED_MODEL_ARCHIVE_LIST,\n    )\n\n    # Optimization\n    from .optimization_tf import WarmUp, create_optimizer, AdamWeightDecay, GradientAccumulator\n\n    # Trainer\n    from .trainer_tf import TFTrainer\n\n\nif not is_tf_available() and not is_torch_available():\n    logger.warning(\n        \"Neither PyTorch nor TensorFlow >= 2.0 have been found.\"\n        \"Models won't be available and only tokenizers, configuration\"\n        \"and file/data utilities can be used.\"\n    )\n"
  },
  {
    "path": "code/nezha-base-count3/pretrain/transformers1/__main__.py",
    "content": "# coding: utf8\ndef main():\n    import sys\n    if (len(sys.argv) < 4 or len(sys.argv) > 6) or sys.argv[1] not in [\"bert\", \"gpt\", \"transfo_xl\", \"gpt2\", \"xlnet\", \"xlm\"]:\n        print(\n        \"This command line utility let you convert original (author released) model checkpoint to pytorch.\\n\"\n        \"It should be used as one of: \\n\"\n        \">> transformers1 bert TF_CHECKPOINT TF_CONFIG PYTORCH_DUMP_OUTPUT, \\n\"\n        \">> transformers1 gpt OPENAI_GPT_CHECKPOINT_FOLDER_PATH PYTORCH_DUMP_OUTPUT [OPENAI_GPT_CONFIG], \\n\"\n        \">> transformers1 transfo_xl TF_CHECKPOINT_OR_DATASET PYTORCH_DUMP_OUTPUT [TF_CONFIG] or \\n\"\n        \">> transformers1 gpt2 TF_CHECKPOINT PYTORCH_DUMP_OUTPUT [GPT2_CONFIG] or \\n\"\n        \">> transformers1 xlnet TF_CHECKPOINT TF_CONFIG PYTORCH_DUMP_OUTPUT [FINETUNING_TASK_NAME] or \\n\"\n        \">> transformers1 xlm XLM_CHECKPOINT_PATH PYTORCH_DUMP_OUTPUT\")\n    else:\n        if sys.argv[1] == \"bert\":\n            try:\n                from .convert_bert_original_tf_checkpoint_to_pytorch import convert_tf_checkpoint_to_pytorch\n            except ImportError:\n                print(\"transformers1 can only be used from the commandline to convert TensorFlow models in PyTorch, \"\n                    \"In that case, it requires TensorFlow to be installed. Please see \"\n                    \"https://www.tensorflow.org/install/ for installation instructions.\")\n                raise\n\n            if len(sys.argv) != 5:\n                # pylint: disable=line-too-long\n                print(\"Should be used as `transformers1 bert TF_CHECKPOINT TF_CONFIG PYTORCH_DUMP_OUTPUT`\")\n            else:\n                PYTORCH_DUMP_OUTPUT = sys.argv.pop()\n                TF_CONFIG = sys.argv.pop()\n                TF_CHECKPOINT = sys.argv.pop()\n                convert_tf_checkpoint_to_pytorch(TF_CHECKPOINT, TF_CONFIG, PYTORCH_DUMP_OUTPUT)\n        elif sys.argv[1] == \"gpt\":\n            from .convert_openai_original_tf_checkpoint_to_pytorch import convert_openai_checkpoint_to_pytorch\n            if len(sys.argv) < 4 or len(sys.argv) > 5:\n                # pylint: disable=line-too-long\n                print(\"Should be used as `transformers1 gpt OPENAI_GPT_CHECKPOINT_FOLDER_PATH PYTORCH_DUMP_OUTPUT [OPENAI_GPT_CONFIG]`\")\n            else:\n                OPENAI_GPT_CHECKPOINT_FOLDER_PATH = sys.argv[2]\n                PYTORCH_DUMP_OUTPUT = sys.argv[3]\n                if len(sys.argv) == 5:\n                    OPENAI_GPT_CONFIG = sys.argv[4]\n                else:\n                    OPENAI_GPT_CONFIG = \"\"\n                convert_openai_checkpoint_to_pytorch(OPENAI_GPT_CHECKPOINT_FOLDER_PATH,\n                                                    OPENAI_GPT_CONFIG,\n                                                    PYTORCH_DUMP_OUTPUT)\n        elif sys.argv[1] == \"transfo_xl\":\n            try:\n                from .convert_transfo_xl_original_tf_checkpoint_to_pytorch import convert_transfo_xl_checkpoint_to_pytorch\n            except ImportError:\n                print(\"transformers1 can only be used from the commandline to convert TensorFlow models in PyTorch, \"\n                    \"In that case, it requires TensorFlow to be installed. Please see \"\n                    \"https://www.tensorflow.org/install/ for installation instructions.\")\n                raise\n            if len(sys.argv) < 4 or len(sys.argv) > 5:\n                # pylint: disable=line-too-long\n                print(\"Should be used as `transformers1 transfo_xl TF_CHECKPOINT/TF_DATASET_FILE PYTORCH_DUMP_OUTPUT [TF_CONFIG]`\")\n            else:\n                if 'ckpt' in sys.argv[2].lower():\n                    TF_CHECKPOINT = sys.argv[2]\n                    TF_DATASET_FILE = \"\"\n                else:\n                    TF_DATASET_FILE = sys.argv[2]\n                    TF_CHECKPOINT = \"\"\n                PYTORCH_DUMP_OUTPUT = sys.argv[3]\n                if len(sys.argv) == 5:\n                    TF_CONFIG = sys.argv[4]\n                else:\n                    TF_CONFIG = \"\"\n                convert_transfo_xl_checkpoint_to_pytorch(TF_CHECKPOINT, TF_CONFIG, PYTORCH_DUMP_OUTPUT, TF_DATASET_FILE)\n        elif sys.argv[1] == \"gpt2\":\n            try:\n                from .convert_gpt2_original_tf_checkpoint_to_pytorch import convert_gpt2_checkpoint_to_pytorch\n            except ImportError:\n                print(\"transformers1 can only be used from the commandline to convert TensorFlow models in PyTorch, \"\n                    \"In that case, it requires TensorFlow to be installed. Please see \"\n                    \"https://www.tensorflow.org/install/ for installation instructions.\")\n                raise\n\n            if len(sys.argv) < 4 or len(sys.argv) > 5:\n                # pylint: disable=line-too-long\n                print(\"Should be used as `transformers1 gpt2 TF_CHECKPOINT PYTORCH_DUMP_OUTPUT [TF_CONFIG]`\")\n            else:\n                TF_CHECKPOINT = sys.argv[2]\n                PYTORCH_DUMP_OUTPUT = sys.argv[3]\n                if len(sys.argv) == 5:\n                    TF_CONFIG = sys.argv[4]\n                else:\n                    TF_CONFIG = \"\"\n                convert_gpt2_checkpoint_to_pytorch(TF_CHECKPOINT, TF_CONFIG, PYTORCH_DUMP_OUTPUT)\n        elif sys.argv[1] == \"xlnet\":\n            try:\n                from .convert_xlnet_original_tf_checkpoint_to_pytorch import convert_xlnet_checkpoint_to_pytorch\n            except ImportError:\n                print(\"transformers1 can only be used from the commandline to convert TensorFlow models in PyTorch, \"\n                    \"In that case, it requires TensorFlow to be installed. Please see \"\n                    \"https://www.tensorflow.org/install/ for installation instructions.\")\n                raise\n\n            if len(sys.argv) < 5 or len(sys.argv) > 6:\n                # pylint: disable=line-too-long\n                print(\"Should be used as `transformers1 xlnet TF_CHECKPOINT TF_CONFIG PYTORCH_DUMP_OUTPUT [FINETUNING_TASK_NAME]`\")\n            else:\n                TF_CHECKPOINT = sys.argv[2]\n                TF_CONFIG = sys.argv[3]\n                PYTORCH_DUMP_OUTPUT = sys.argv[4]\n                if len(sys.argv) == 6:\n                    FINETUNING_TASK = sys.argv[5]\n                else:\n                    FINETUNING_TASK = None\n\n                convert_xlnet_checkpoint_to_pytorch(TF_CHECKPOINT,\n                                                    TF_CONFIG,\n                                                    PYTORCH_DUMP_OUTPUT,\n                                                    FINETUNING_TASK)\n        elif sys.argv[1] == \"xlm\":\n            from .convert_xlm_original_pytorch_checkpoint_to_pytorch import convert_xlm_checkpoint_to_pytorch\n\n            if len(sys.argv) != 4:\n                # pylint: disable=line-too-long\n                print(\"Should be used as `transformers1 xlm XLM_CHECKPOINT_PATH PYTORCH_DUMP_OUTPUT`\")\n            else:\n                XLM_CHECKPOINT_PATH = sys.argv[2]\n                PYTORCH_DUMP_OUTPUT = sys.argv[3]\n\n                convert_xlm_checkpoint_to_pytorch(XLM_CHECKPOINT_PATH, PYTORCH_DUMP_OUTPUT)\n\nif __name__ == '__main__':\n    main()\n"
  },
  {
    "path": "code/nezha-base-count3/pretrain/transformers1/activations.py",
    "content": "import logging\nimport math\n\nimport torch\nimport torch.nn.functional as F\n\n\nlogger = logging.getLogger(__name__)\n\n\ndef swish(x):\n    return x * torch.sigmoid(x)\n\n\ndef _gelu_python(x):\n    \"\"\" Original Implementation of the gelu activation function in Google Bert repo when initially created.\n        For information: OpenAI GPT's gelu is slightly different (and gives slightly different results):\n        0.5 * x * (1 + torch.tanh(math.sqrt(2 / math.pi) * (x + 0.044715 * torch.pow(x, 3))))\n        This is now written in C in torch.nn.functional\n        Also see https://arxiv.org/abs/1606.08415\n    \"\"\"\n    return x * 0.5 * (1.0 + torch.erf(x / math.sqrt(2.0)))\n\n\ndef gelu_new(x):\n    \"\"\" Implementation of the gelu activation function currently in Google Bert repo (identical to OpenAI GPT).\n        Also see https://arxiv.org/abs/1606.08415\n    \"\"\"\n    return 0.5 * x * (1.0 + torch.tanh(math.sqrt(2.0 / math.pi) * (x + 0.044715 * torch.pow(x, 3.0))))\n\n\nif torch.__version__ < \"1.4.0\":\n    gelu = _gelu_python\nelse:\n    gelu = F.gelu\n\n\ndef gelu_fast(x):\n    return 0.5 * x * (1.0 + torch.tanh(x * 0.7978845608 * (1.0 + 0.044715 * x * x)))\n\n\nACT2FN = {\n    \"relu\": F.relu,\n    \"swish\": swish,\n    \"gelu\": gelu,\n    \"tanh\": torch.tanh,\n    \"gelu_new\": gelu_new,\n    \"gelu_fast\": gelu_fast,\n}\n\n\ndef get_activation(activation_string):\n    if activation_string in ACT2FN:\n        return ACT2FN[activation_string]\n    else:\n        raise KeyError(\"function {} not found in ACT2FN mapping {}\".format(activation_string, list(ACT2FN.keys())))\n"
  },
  {
    "path": "code/nezha-base-count3/pretrain/transformers1/another_try.py",
    "content": "from transformers import TFBertModel, BertTokenizer, BertConfig\nimport tensorflow as tf\n\nconfig = BertConfig.from_pretrained(\"bert-base-cased\", output_hidden_states=True)\nmodel = TFBertModel.from_pretrained(\"bert-base-cased\", config=config)\n\ntok = BertTokenizer.from_pretrained(\"bert-base-cased\")\ntext = tok.encode(\"Ain't this [MASK] best thing you've ever seen?\")\n\ninputs = tf.constant(text)\noutputs = model.predict(inputs)\n\nprint(outputs)"
  },
  {
    "path": "code/nezha-base-count3/pretrain/transformers1/benchmark/__init__.py",
    "content": "# flake8: noqa\n# There's no way to ignore \"F401 '...' imported but unused\" warnings in this\n# module, but to preserve other warnings. So, don't check this module at all.\n\nfrom ..file_utils import is_torch_available\n\n\nif is_torch_available():\n    from .benchmark_args import PyTorchBenchmarkArguments\n    from .benchmark import PyTorchBenchmark\n"
  },
  {
    "path": "code/nezha-base-count3/pretrain/transformers1/benchmark/benchmark.py",
    "content": "# coding=utf-8\n# Copyright 2018 The HuggingFace Inc. team.\n# Copyright (c) 2018, NVIDIA CORPORATION.  All rights reserved.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n\"\"\"\n    Benchmarking the library on inference and training in PyTorch.\n\"\"\"\n\n\nimport inspect\nimport logging\nimport timeit\n\nfrom transformers import MODEL_MAPPING, MODEL_WITH_LM_HEAD_MAPPING, PretrainedConfig, is_torch_available\n\nfrom .benchmark_utils import Benchmark, Memory, start_memory_tracing, stop_memory_tracing\n\n\nif is_torch_available():\n    import torch\n    from .benchmark_args import PyTorchBenchmarkArguments\n\n\nlogger = logging.getLogger(__name__)\n\n\nclass PyTorchBenchmark(Benchmark):\n\n    args: PyTorchBenchmarkArguments\n    configs: PretrainedConfig\n    framework: str = \"PyTorch\"\n\n    @property\n    def framework_version(self):\n        return torch.__version__\n\n    def train(self, model_name, batch_size, sequence_length, trace_memory=False):\n        try:\n            config = self.config_dict[model_name]\n            model = MODEL_WITH_LM_HEAD_MAPPING[config.__class__](config)\n            model.to(self.args.device)\n            model.train()\n\n            input_ids = torch.randint(\n                model.config.vocab_size, (batch_size, sequence_length), dtype=torch.long, device=self.args.device\n            )\n\n            def compute_loss_and_backprob():\n                # TODO: Not all models call labels argument labels => this hack using the function signature should be corrected once all models have a common name for labels\n                function_argument_names = inspect.getfullargspec(model.forward).args\n                if \"labels\" in function_argument_names:\n                    loss = model(input_ids, labels=input_ids)[0]\n                elif \"lm_labels\" in function_argument_names:\n                    loss = model(input_ids, lm_labels=input_ids)[0]\n                elif \"masked_lm_labels\" in function_argument_names:\n                    loss = model(input_ids, masked_lm_labels=input_ids)[0]\n                else:\n                    NotImplementedError(f\"{model_name} does not seem to allow training with labels\")\n\n                loss.backward()\n                model.zero_grad()\n\n            if trace_memory is True:\n                if self.args.trace_memory_line_by_line or self.args.n_gpu == 0:\n                    trace = start_memory_tracing(\"transformers1\")\n                else:\n                    # clear cuda cache\n                    torch.cuda.empty_cache()\n                    torch.cuda.reset_peak_memory_stats()\n\n                # calculate loss and do backpropagation\n                compute_loss_and_backprob()\n\n                if self.args.trace_memory_line_by_line or self.args.n_gpu == 0:\n                    summary = stop_memory_tracing(trace)\n                    memory = summary.total\n                else:\n                    memory = Memory(torch.cuda.max_memory_reserved())\n\n                return memory\n            else:\n                # as written in https://docs.python.org/2/library/timeit.html#timeit.Timer.repeat, min should be taken rather than the average\n                runtimes = timeit.repeat(lambda: compute_loss_and_backprob(), repeat=self.args.repeat, number=10,)\n                return min(runtimes) / 10.0\n        except RuntimeError as e:\n            self.print_fn(\"Doesn't fit on GPU. {}\".format(e))\n            return \"N/A\"\n\n    def inference(self, model_name, batch_size, sequence_length, trace_memory=False):\n        try:\n            config = self.config_dict[model_name]\n            model = MODEL_MAPPING[config.__class__](config)\n            model.to(self.args.device)\n            model.eval()\n\n            input_ids = torch.randint(\n                config.vocab_size, (batch_size, sequence_length), dtype=torch.long, device=self.args.device\n            )\n            if trace_memory is True:\n                if self.args.trace_memory_line_by_line or self.args.n_gpu == 0:\n                    trace = start_memory_tracing(\"transformers1\")\n                else:\n                    # clear cuda cache\n                    torch.cuda.empty_cache()\n                    if hasattr(torch.cuda, \"max_memory_reserved\"):\n                        torch.cuda.reset_peak_memory_stats()\n                    else:\n                        logger.info(\n                            \"Please consider updating PyTorch to version 1.4 to get more accuracy on GPU memory usage\"\n                        )\n                        torch.cuda.reset_max_memory_cached()\n\n                model(input_ids)\n\n                if self.args.trace_memory_line_by_line or self.args.n_gpu == 0:\n                    summary = stop_memory_tracing(trace)\n                    memory = summary.total\n                else:\n                    if hasattr(torch.cuda, \"max_memory_reserved\"):\n                        memory = Memory(torch.cuda.max_memory_reserved())\n                    else:\n                        logger.info(\n                            \"Please consider updating PyTorch to version 1.4 to get more accuracy on GPU memory usage\"\n                        )\n                        memory = Memory(torch.cuda.max_memory_cached())\n\n                return memory\n            else:\n                # as written in https://docs.python.org/2/library/timeit.html#timeit.Timer.repeat, min should be taken rather than the average\n                runtimes = timeit.repeat(lambda: model(input_ids), repeat=self.args.repeat, number=10,)\n                return min(runtimes) / 10.0\n\n        except RuntimeError as e:\n            self.print_fn(\"Doesn't fit on GPU. {}\".format(e))\n            return \"N/A\"\n"
  },
  {
    "path": "code/nezha-base-count3/pretrain/transformers1/benchmark/benchmark_args.py",
    "content": "# coding=utf-8\n# Copyright 2018 The HuggingFace Inc. team.\n# Copyright (c) 2018, NVIDIA CORPORATION.  All rights reserved.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n\nimport logging\nfrom dataclasses import dataclass, field\nfrom typing import Tuple\n\nfrom ..file_utils import cached_property, is_torch_available, torch_required\nfrom .benchmark_args_utils import BenchmarkArguments\n\n\nif is_torch_available():\n    import torch\n\ntry:\n    import torch_xla.core.xla_model as xm\n\n    _has_tpu = True\nexcept ImportError:\n    _has_tpu = False\n\n\n@torch_required\ndef is_tpu_available():\n    return _has_tpu\n\n\nlogger = logging.getLogger(__name__)\n\n\n@dataclass\nclass PyTorchBenchmarkArguments(BenchmarkArguments):\n    no_cuda: bool = field(default=False, metadata={\"help\": \"Whether to run on available cuda devices\"})\n    torchscript: bool = field(default=False, metadata={\"help\": \"Trace the models using torchscript\"})\n    fp16: bool = field(default=False, metadata={\"help\": \"Use FP16 to accelerate inference.\"})\n\n    @cached_property\n    @torch_required\n    def _setup_devices(self) -> Tuple[\"torch.device\", int]:\n        logger.info(\"PyTorch: setting up devices\")\n        if self.no_cuda:\n            device = torch.device(\"cpu\")\n            n_gpu = 0\n        elif is_tpu_available():\n            device = xm.xla_device()\n            n_gpu = 0\n        else:\n            device = torch.device(\"cuda\" if torch.cuda.is_available() else \"cpu\")\n            n_gpu = torch.cuda.device_count()\n        return device, n_gpu\n\n    @property\n    @torch_required\n    def device_idx(self) -> int:\n        return torch.cuda.current_device()\n\n    @property\n    @torch_required\n    def device(self) -> \"torch.device\":\n        return self._setup_devices[0]\n\n    @property\n    @torch_required\n    def n_gpu(self):\n        return self._setup_devices[1]\n"
  },
  {
    "path": "code/nezha-base-count3/pretrain/transformers1/benchmark/benchmark_args_utils.py",
    "content": "# coding=utf-8\n# Copyright 2018 The HuggingFace Inc. team.\n# Copyright (c) 2018, NVIDIA CORPORATION.  All rights reserved.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n\nimport dataclasses\nimport json\nfrom dataclasses import dataclass, field\nfrom time import time\nfrom typing import List\n\n\ndef list_field(default=None, metadata=None):\n    return field(default_factory=lambda: default, metadata=metadata)\n\n\n@dataclass\nclass BenchmarkArguments:\n    \"\"\"\n    BenchMarkArguments are arguments we use in our benchmark scripts\n    **which relate to the training loop itself**.\n\n    Using `HfArgumentParser` we can turn this class\n    into argparse arguments to be able to specify them on\n    the command line.\n    \"\"\"\n\n    models: List[str] = list_field(\n        default=[],\n        metadata={\n            \"help\": \"Model checkpoints to be provided to the AutoModel classes. Leave blank to benchmark the base version of all available models\"\n        },\n    )\n\n    batch_sizes: List[int] = list_field(\n        default=[8], metadata={\"help\": \"List of batch sizes for which memory and time performance will be evaluated\"}\n    )\n\n    sequence_lengths: List[int] = list_field(\n        default=[8, 32, 128, 512],\n        metadata={\"help\": \"List of sequence lengths for which memory and time performance will be evaluated\"},\n    )\n\n    no_inference: bool = field(default=False, metadata={\"help\": \"Don't benchmark inference of model\"})\n    training: bool = field(default=False, metadata={\"help\": \"Benchmark training of model\"})\n    verbose: bool = field(default=False, metadata={\"help\": \"Verbose memory tracing\"})\n    no_speed: bool = field(default=False, metadata={\"help\": \"Don't perform speed measurments\"})\n    no_memory: bool = field(default=False, metadata={\"help\": \"Don't perform memory measurments\"})\n    trace_memory_line_by_line: bool = field(default=False, metadata={\"help\": \"Trace memory line by line\"})\n    save_to_csv: bool = field(default=False, metadata={\"help\": \"Save result to a CSV file\"})\n    log_print: bool = field(default=False, metadata={\"help\": \"Save all print statements in a log file\"})\n    no_env_print: bool = field(default=False, metadata={\"help\": \"Don't print environment information\"})\n    inference_time_csv_file: str = field(\n        default=f\"inference_time_{round(time())}.csv\",\n        metadata={\"help\": \"CSV filename used if saving time results to csv.\"},\n    )\n    inference_memory_csv_file: str = field(\n        default=f\"inference_memory_{round(time())}.csv\",\n        metadata={\"help\": \"CSV filename used if saving memory results to csv.\"},\n    )\n    train_time_csv_file: str = field(\n        default=f\"train_time_{round(time())}.csv\",\n        metadata={\"help\": \"CSV filename used if saving time results to csv for training.\"},\n    )\n    train_memory_csv_file: str = field(\n        default=f\"train_memory_{round(time())}.csv\",\n        metadata={\"help\": \"CSV filename used if saving memory results to csv for training.\"},\n    )\n    env_info_csv_file: str = field(\n        default=f\"env_info_{round(time())}.csv\",\n        metadata={\"help\": \"CSV filename used if saving environment information.\"},\n    )\n    log_filename: str = field(\n        default=f\"log_{round(time())}.csv\",\n        metadata={\"help\": \"Log filename used if print statements are saved in log.\"},\n    )\n    repeat: int = field(default=3, metadata={\"help\": \"Times an experiment will be run.\"})\n\n    def to_json_string(self):\n        \"\"\"\n        Serializes this instance to a JSON string.\n        \"\"\"\n        return json.dumps(dataclasses.asdict(self), indent=2)\n\n    @property\n    def model_names(self):\n        return self.models\n"
  },
  {
    "path": "code/nezha-base-count3/pretrain/transformers1/benchmark/benchmark_utils.py",
    "content": "\"\"\"\nUtilities for working with the local dataset cache.\nThis file is adapted from the AllenNLP library at https://github.com/allenai/allennlp\nCopyright by the AllenNLP authors.\n\"\"\"\n\nimport copy\nimport csv\nimport linecache\nimport logging\nimport os\nimport platform\nimport sys\nfrom abc import ABC, abstractmethod\nfrom collections import defaultdict, namedtuple\nfrom datetime import datetime\nfrom typing import Iterable, List, NamedTuple, Optional, Union\n\nfrom transformers import AutoConfig, PretrainedConfig\nfrom transformers import __version__ as version\n\nfrom ..file_utils import is_tf_available, is_torch_available\nfrom .benchmark_args_utils import BenchmarkArguments\n\n\nif is_torch_available():\n    from torch.cuda import empty_cache as torch_empty_cache\n\nif is_tf_available():\n    from tensorflow.python.eager import context as tf_context\n\n\nlogger = logging.getLogger(__name__)  # pylint: disable=invalid-name\n\n\n_is_memory_tracing_enabled = False\n\nBenchmarkOutput = namedtuple(\n    \"BenchmarkOutput\", [\"time_inference_result\", \"memory_inference_result\", \"time_train_result\", \"memory_train_result\"]\n)\n\n\ndef is_memory_tracing_enabled():\n    global _is_memory_tracing_enabled\n    return _is_memory_tracing_enabled\n\n\nclass Frame(NamedTuple):\n    \"\"\" `Frame` is a NamedTuple used to gather the current frame state.\n            `Frame` has the following fields:\n            - 'filename' (string): Name of the file currently executed\n            - 'module' (string): Name of the module currently executed\n            - 'line_number' (int): Number of the line currently executed\n            - 'event' (string): Event that triggered the tracing (default will be \"line\")\n            - 'line_text' (string): Text of the line in the python script\n    \"\"\"\n\n    filename: str\n    module: str\n    line_number: int\n    event: str\n    line_text: str\n\n\nclass UsedMemoryState(NamedTuple):\n    \"\"\" `UsedMemoryState` are named tuples with the following fields:\n        - 'frame': a `Frame` namedtuple (see below) storing information on the current tracing frame (current file, location in current file)\n        - 'cpu_memory': CPU RSS memory state *before* executing the line\n        - 'gpu_memory': GPU used memory *before* executing the line (sum for all GPUs or for only `gpus_to_trace` if provided)\n    \"\"\"\n\n    frame: Frame\n    cpu_memory: int\n    gpu_memory: int\n\n\nclass Memory(NamedTuple):\n    \"\"\" `Memory` NamedTuple have a single field `bytes` and\n        you can get a human readable str of the number of mega bytes by calling `__repr__`\n            - `byte` (integer): number of bytes,\n    \"\"\"\n\n    bytes: int\n\n    def __repr__(self) -> str:\n        return str(bytes_to_mega_bytes(self.bytes))\n\n\nclass MemoryState(NamedTuple):\n    \"\"\" `MemoryState` are namedtuples listing frame + CPU/GPU memory with the following fields:\n        - `frame` (`Frame`): the current frame (see above)\n        - `cpu`: CPU memory consumed at during the current frame as a `Memory` named tuple\n        - `gpu`: GPU memory consumed at during the current frame as a `Memory` named tuple\n        - `cpu_gpu`: CPU + GPU memory consumed at during the current frame as a `Memory` named tuple\n    \"\"\"\n\n    frame: Frame\n    cpu: Memory\n    gpu: Memory\n    cpu_gpu: Memory\n\n\nclass MemorySummary(NamedTuple):\n    \"\"\" `MemorySummary` namedtuple otherwise with the fields:\n        - `sequential`: a list of `MemoryState` namedtuple (see below) computed from the provided `memory_trace`\n            by substracting the memory after executing each line from the memory before executing said line.\n        - `cumulative`: a list of `MemoryState` namedtuple (see below) with cumulative increase in memory for each line\n            obtained by summing repeted memory increase for a line if it's executed several times.\n            The list is sorted from the frame with the largest memory consumption to the frame with the smallest (can be negative if memory is released)\n        - `total`: total memory increase during the full tracing as a `Memory` named tuple (see below).\n            Line with memory release (negative consumption) are ignored if `ignore_released_memory` is `True` (default).\n    \"\"\"\n\n    sequential: List[MemoryState]\n    cumulative: List[MemoryState]\n    current: List[MemoryState]\n    total: Memory\n\n\nMemoryTrace = List[UsedMemoryState]\n\n\ndef start_memory_tracing(\n    modules_to_trace: Optional[Union[str, Iterable[str]]] = None,\n    modules_not_to_trace: Optional[Union[str, Iterable[str]]] = None,\n    events_to_trace: str = \"line\",\n    gpus_to_trace: Optional[List[int]] = None,\n) -> MemoryTrace:\n    \"\"\" Setup line-by-line tracing to record rss mem (RAM) at each line of a module or sub-module.\n        See `../../examples/benchmarks.py for a usage example.\n        Current memory consumption is returned using psutil and in particular is the RSS memory\n            \"Resident Set Size” (the non-swapped physical memory the process is using).\n            See https://psutil.readthedocs.io/en/latest/#psutil.Process.memory_info\n\n        Args:\n            - `modules_to_trace`: (None, string, list/tuple of string)\n                if None, all events are recorded\n                if string or list of strings: only events from the listed module/sub-module will be recorded (e.g. 'fairseq' or 'transformers1.modeling_gpt2')\n            - `modules_not_to_trace`: (None, string, list/tuple of string)\n                if None, no module is avoided\n                if string or list of strings: events from the listed module/sub-module will not be recorded (e.g. 'torch')\n            - `events_to_trace`: string or list of string of events to be recorded (see official python doc for `sys.settrace` for the list of events)\n                default to line\n            - `gpus_to_trace`: (optional list, default None) list of GPUs to trace. Default to tracing all GPUs\n\n        Return:\n            - `memory_trace` is a list of `UsedMemoryState` for each event (default each line of the traced script).\n                - `UsedMemoryState` are named tuples with the following fields:\n                    - 'frame': a `Frame` namedtuple (see below) storing information on the current tracing frame (current file, location in current file)\n                    - 'cpu_memory': CPU RSS memory state *before* executing the line\n                    - 'gpu_memory': GPU used memory *before* executing the line (sum for all GPUs or for only `gpus_to_trace` if provided)\n\n        `Frame` is a namedtuple used by `UsedMemoryState` to list the current frame state.\n            `Frame` has the following fields:\n            - 'filename' (string): Name of the file currently executed\n            - 'module' (string): Name of the module currently executed\n            - 'line_number' (int): Number of the line currently executed\n            - 'event' (string): Event that triggered the tracing (default will be \"line\")\n            - 'line_text' (string): Text of the line in the python script\n\n    \"\"\"\n    try:\n        import psutil\n    except (ImportError):\n        logger.warning(\n            \"Psutil not installed, we won't log CPU memory usage. \"\n            \"Install psutil (pip install psutil) to use CPU memory tracing.\"\n        )\n        process = None\n    else:\n        process = psutil.Process(os.getpid())\n\n    try:\n        from py3nvml import py3nvml\n\n        py3nvml.nvmlInit()\n        devices = list(range(py3nvml.nvmlDeviceGetCount())) if gpus_to_trace is None else gpus_to_trace\n        py3nvml.nvmlShutdown()\n    except ImportError:\n        logger.warning(\n            \"py3nvml not installed, we won't log GPU memory usage. \"\n            \"Install py3nvml (pip install py3nvml) to use GPU memory tracing.\"\n        )\n        log_gpu = False\n    except (OSError, py3nvml.NVMLError):\n        logger.warning(\"Error while initializing comunication with GPU. \" \"We won't perform GPU memory tracing.\")\n        log_gpu = False\n    else:\n        log_gpu = is_torch_available() or is_tf_available()\n\n    memory_trace = []\n\n    def traceit(frame, event, args):\n        \"\"\" Tracing method executed before running each line in a module or sub-module\n            Record memory allocated in a list with debugging information\n        \"\"\"\n        global _is_memory_tracing_enabled\n\n        if not _is_memory_tracing_enabled:\n            return traceit\n\n        # Filter events\n        if events_to_trace is not None:\n            if isinstance(events_to_trace, str) and event != events_to_trace:\n                return traceit\n            elif isinstance(events_to_trace, (list, tuple)) and event not in events_to_trace:\n                return traceit\n\n        # Filter modules\n        name = frame.f_globals[\"__name__\"]\n        if not isinstance(name, str):\n            return traceit\n        else:\n            # Filter whitelist of modules to trace\n            if modules_to_trace is not None:\n                if isinstance(modules_to_trace, str) and modules_to_trace not in name:\n                    return traceit\n                elif isinstance(modules_to_trace, (list, tuple)) and all(m not in name for m in modules_to_trace):\n                    return traceit\n\n            # Filter blacklist of modules not to trace\n            if modules_not_to_trace is not None:\n                if isinstance(modules_not_to_trace, str) and modules_not_to_trace in name:\n                    return traceit\n                elif isinstance(modules_not_to_trace, (list, tuple)) and any(m in name for m in modules_not_to_trace):\n                    return traceit\n\n        # Record current tracing state (file, location in file...)\n        lineno = frame.f_lineno\n        filename = frame.f_globals[\"__file__\"]\n        if filename.endswith(\".pyc\") or filename.endswith(\".pyo\"):\n            filename = filename[:-1]\n        line = linecache.getline(filename, lineno).rstrip()\n        traced_state = Frame(filename, name, lineno, event, line)\n\n        # Record current memory state (rss memory) and compute difference with previous memory state\n        cpu_mem = 0\n        if process is not None:\n            mem = process.memory_info()\n            cpu_mem = mem.rss\n\n        gpu_mem = 0\n        if log_gpu:\n            # Clear GPU caches\n            if is_torch_available():\n                torch_empty_cache()\n            if is_tf_available():\n                tf_context.context()._clear_caches()  # See https://github.com/tensorflow/tensorflow/issues/20218#issuecomment-416771802\n\n            # Sum used memory for all GPUs\n            py3nvml.nvmlInit()\n\n            for i in devices:\n                handle = py3nvml.nvmlDeviceGetHandleByIndex(i)\n                meminfo = py3nvml.nvmlDeviceGetMemoryInfo(handle)\n                gpu_mem += meminfo.used\n\n            py3nvml.nvmlShutdown()\n\n        mem_state = UsedMemoryState(traced_state, cpu_mem, gpu_mem)\n        memory_trace.append(mem_state)\n\n        return traceit\n\n    sys.settrace(traceit)\n\n    global _is_memory_tracing_enabled\n    _is_memory_tracing_enabled = True\n\n    return memory_trace\n\n\ndef stop_memory_tracing(\n    memory_trace: Optional[MemoryTrace] = None, ignore_released_memory: bool = True\n) -> Optional[MemorySummary]:\n    \"\"\" Stop memory tracing cleanly and return a summary of the memory trace if a trace is given.\n\n        Args:\n            - `memory_trace` (optional output of start_memory_tracing, default: None): memory trace to convert in summary\n            - `ignore_released_memory` (boolean, default: None): if True we only sum memory increase to compute total memory\n\n        Return:\n            - None if `memory_trace` is None\n            - `MemorySummary` namedtuple otherwise with the fields:\n                - `sequential`: a list of `MemoryState` namedtuple (see below) computed from the provided `memory_trace`\n                    by substracting the memory after executing each line from the memory before executing said line.\n                - `cumulative`: a list of `MemoryState` namedtuple (see below) with cumulative increase in memory for each line\n                    obtained by summing repeted memory increase for a line if it's executed several times.\n                    The list is sorted from the frame with the largest memory consumption to the frame with the smallest (can be negative if memory is released)\n                - `total`: total memory increase during the full tracing as a `Memory` named tuple (see below).\n                    Line with memory release (negative consumption) are ignored if `ignore_released_memory` is `True` (default).\n\n        `Memory` named tuple have fields\n            - `byte` (integer): number of bytes,\n            - `string` (string): same as human readable string (ex: \"3.5MB\")\n\n        `Frame` are namedtuple used to list the current frame state and have the following fields:\n            - 'filename' (string): Name of the file currently executed\n            - 'module' (string): Name of the module currently executed\n            - 'line_number' (int): Number of the line currently executed\n            - 'event' (string): Event that triggered the tracing (default will be \"line\")\n            - 'line_text' (string): Text of the line in the python script\n\n        `MemoryState` are namedtuples listing frame + CPU/GPU memory with the following fields:\n            - `frame` (`Frame`): the current frame (see above)\n            - `cpu`: CPU memory consumed at during the current frame as a `Memory` named tuple\n            - `gpu`: GPU memory consumed at during the current frame as a `Memory` named tuple\n            - `cpu_gpu`: CPU + GPU memory consumed at during the current frame as a `Memory` named tuple\n    \"\"\"\n    global _is_memory_tracing_enabled\n    _is_memory_tracing_enabled = False\n\n    if memory_trace is not None and len(memory_trace) > 1:\n        memory_diff_trace = []\n        memory_curr_trace = []\n\n        cumulative_memory_dict = defaultdict(lambda: [0, 0, 0])\n\n        for ((frame, cpu_mem, gpu_mem), (next_frame, next_cpu_mem, next_gpu_mem),) in zip(\n            memory_trace[:-1], memory_trace[1:]\n        ):\n            cpu_mem_inc = next_cpu_mem - cpu_mem\n            gpu_mem_inc = next_gpu_mem - gpu_mem\n            cpu_gpu_mem_inc = cpu_mem_inc + gpu_mem_inc\n            memory_diff_trace.append(\n                MemoryState(\n                    frame=frame, cpu=Memory(cpu_mem_inc), gpu=Memory(gpu_mem_inc), cpu_gpu=Memory(cpu_gpu_mem_inc),\n                )\n            )\n\n            memory_curr_trace.append(\n                MemoryState(\n                    frame=frame,\n                    cpu=Memory(next_cpu_mem),\n                    gpu=Memory(next_gpu_mem),\n                    cpu_gpu=Memory(next_gpu_mem + next_cpu_mem),\n                )\n            )\n\n            cumulative_memory_dict[frame][0] += cpu_mem_inc\n            cumulative_memory_dict[frame][1] += gpu_mem_inc\n            cumulative_memory_dict[frame][2] += cpu_gpu_mem_inc\n\n        cumulative_memory = sorted(\n            list(cumulative_memory_dict.items()), key=lambda x: x[1][2], reverse=True\n        )  # order by the total CPU + GPU memory increase\n        cumulative_memory = list(\n            MemoryState(\n                frame=frame, cpu=Memory(cpu_mem_inc), gpu=Memory(gpu_mem_inc), cpu_gpu=Memory(cpu_gpu_mem_inc),\n            )\n            for frame, (cpu_mem_inc, gpu_mem_inc, cpu_gpu_mem_inc) in cumulative_memory\n        )\n\n        memory_curr_trace = sorted(memory_curr_trace, key=lambda x: x.cpu_gpu.bytes, reverse=True)\n\n        if ignore_released_memory:\n            total_memory = sum(max(0, step_trace.cpu_gpu.bytes) for step_trace in memory_diff_trace)\n        else:\n            total_memory = sum(step_trace.cpu_gpu.bytes for step_trace in memory_diff_trace)\n\n        total_memory = Memory(total_memory)\n\n        return MemorySummary(\n            sequential=memory_diff_trace, cumulative=cumulative_memory, current=memory_curr_trace, total=total_memory,\n        )\n\n    return None\n\n\ndef bytes_to_mega_bytes(memory_amount: int) -> int:\n    \"\"\" Utility to convert a number of bytes (int) into a number of mega bytes (int)\n    \"\"\"\n    return memory_amount >> 20\n\n\nclass Benchmark(ABC):\n    \"\"\"\n    Benchmarks is a simple but feature-complete benchmarking script\n    to compare memory and time performance of models in Transformers.\n    \"\"\"\n\n    args: BenchmarkArguments\n    configs: PretrainedConfig\n    framework: str\n\n    def __init__(self, args: BenchmarkArguments = None, configs: PretrainedConfig = None):\n        self.args = args\n\n        if configs is None:\n            self.config_dict = {\n                model_name: AutoConfig.from_pretrained(model_name) for model_name in self.args.model_names\n            }\n        else:\n            self.config_dict = {model_name: config for model_name, config in zip(self.args.model_names, configs)}\n\n        self._print_fn = None\n        self._framework_version = None\n        self._environment_info = None\n\n    @property\n    def print_fn(self):\n        if self._print_fn is None:\n            if self.args.log_print:\n                logging.basicConfig(\n                    level=logging.DEBUG,\n                    filename=self.args.log_filename,\n                    filemode=\"a+\",\n                    format=\"%(asctime)-15s %(levelname)-8s %(message)s\",\n                )\n\n                def print_and_log(*args):\n                    logging.info(*args)\n                    print(*args)\n\n                self._print_fn = print_and_log\n            else:\n                self._print_fn = print\n        return self._print_fn\n\n    @property\n    def is_gpu(self):\n        return self.args.n_gpu > 0\n\n    @property\n    @abstractmethod\n    def framework_version(self):\n        pass\n\n    @abstractmethod\n    def train(self, model_name, batch_size, sequence_length):\n        pass\n\n    @abstractmethod\n    def inference(self, model_name, batch_size, sequence_length):\n        pass\n\n    def run(self):\n        result_dict = {model_name: {} for model_name in self.args.model_names}\n        inference_result_time = copy.deepcopy(result_dict)\n        inference_result_memory = copy.deepcopy(result_dict)\n        train_result_time = copy.deepcopy(result_dict)\n        train_result_memory = copy.deepcopy(result_dict)\n\n        for c, model_name in enumerate(self.args.model_names):\n            self.print_fn(f\"{c + 1} / {len(self.args.model_names)}\")\n\n            model_dict = {\n                \"bs\": self.args.batch_sizes,\n                \"ss\": self.args.sequence_lengths,\n                \"result\": {i: {} for i in self.args.batch_sizes},\n            }\n            inference_result_time[model_name] = copy.deepcopy(model_dict)\n            inference_result_memory[model_name] = copy.deepcopy(model_dict)\n            train_result_time[model_name] = copy.deepcopy(model_dict)\n            train_result_memory[model_name] = copy.deepcopy(model_dict)\n\n            for batch_size in self.args.batch_sizes:\n                for sequence_length in self.args.sequence_lengths:\n                    if not self.args.no_inference:\n                        if not self.args.no_memory:\n                            memory = self.inference(model_name, batch_size, sequence_length, trace_memory=True)\n                            inference_result_memory[model_name][\"result\"][batch_size][sequence_length] = memory\n                        if not self.args.no_speed:\n                            time = self.inference(model_name, batch_size, sequence_length, trace_memory=False)\n                            inference_result_time[model_name][\"result\"][batch_size][sequence_length] = time\n\n                    if self.args.training:\n                        if not self.args.no_memory:\n                            memory = self.train(model_name, batch_size, sequence_length, trace_memory=True)\n                            train_result_memory[model_name][\"result\"][batch_size][sequence_length] = memory\n                        if not self.args.no_speed:\n                            time = self.inference(model_name, batch_size, sequence_length, trace_memory=False)\n                            train_result_time[model_name][\"result\"][batch_size][sequence_length] = time\n\n        if not self.args.no_inference:\n            if not self.args.no_speed:\n                self.print_fn(\"======= INFERENCE - SPEED - RESULT =======\")\n                self.print_results(inference_result_time)\n                self.save_to_csv(inference_result_time, self.args.inference_time_csv_file)\n\n            if not self.args.no_memory:\n                self.print_fn(\"======= INFERENCE - MEMORY - RESULT =======\")\n                self.print_results(inference_result_memory)\n                self.save_to_csv(inference_result_memory, self.args.inference_memory_csv_file)\n\n        if self.args.training:\n            if not self.args.no_speed:\n                self.print_fn(\"======= TRAIN - SPEED - RESULT =======\")\n                self.print_results(train_result_time)\n                self.save_to_csv(train_result_time, self.args.train_time_csv_file)\n\n            if not self.args.no_memory:\n                self.print_fn(\"======= TRAIN - MEMORY - RESULT =======\")\n                self.print_results(train_result_memory)\n                self.save_to_csv(train_result_memory, self.args.train_memory_csv_file)\n\n        if not self.args.no_env_print:\n            self.print_fn(\"\\n======== ENVIRONMENT - INFORMATION ========\")\n            self.print_fn(\n                \"\\n\".join([\"- {}: {}\".format(prop, val) for prop, val in self.environment_info.items()]) + \"\\n\"\n            )\n\n        if self.args.save_to_csv:\n            with open(self.args.env_info_csv_file, mode=\"w\", newline=\"\") as csv_file:\n                writer = csv.writer(csv_file)\n                for key, value in self.environment_info.items():\n                    writer.writerow([key, value])\n\n        return BenchmarkOutput(inference_result_time, inference_result_memory, train_result_time, train_result_memory)\n\n    @property\n    def environment_info(self):\n        if self._environment_info is None:\n            info = {}\n            info[\"transformers_version\"] = version\n            info[\"framework\"] = self.framework\n            info[\"framework_version\"] = self.framework_version\n            info[\"python_version\"] = platform.python_version()\n            info[\"system\"] = platform.system()\n            info[\"cpu\"] = platform.processor()\n            info[\"architecture\"] = platform.architecture()[0]\n            info[\"date\"] = datetime.date(datetime.now())\n            info[\"time\"] = datetime.time(datetime.now())\n\n            try:\n                import psutil\n            except (ImportError):\n                logger.warning(\n                    \"Psutil not installed, we won't log available CPU memory.\"\n                    \"Install psutil (pip install psutil) to log available CPU memory.\"\n                )\n                info[\"cpu_ram_mb\"] = \"N/A\"\n            else:\n                info[\"cpu_ram_mb\"] = bytes_to_mega_bytes(psutil.virtual_memory().total)\n\n            info[\"use_gpu\"] = self.is_gpu\n            if self.is_gpu:\n                info[\"num_gpus\"] = self.args.n_gpu\n                try:\n                    from py3nvml import py3nvml\n\n                    py3nvml.nvmlInit()\n                    handle = py3nvml.nvmlDeviceGetHandleByIndex(self.args.device_idx)\n                except ImportError:\n                    logger.warning(\n                        \"py3nvml not installed, we won't log GPU memory usage. \"\n                        \"Install py3nvml (pip install py3nvml) to log information about GPU.\"\n                    )\n                    info[\"gpu\"] = \"N/A\"\n                    info[\"gpu_ram_mb\"] = \"N/A\"\n                    info[\"gpu_power_watts\"] = \"N/A\"\n                    info[\"gpu_performance_state\"] = \"N/A\"\n                except (OSError, py3nvml.NVMLError):\n                    logger.warning(\n                        \"Error while initializing comunication with GPU. \" \"We won't log information about GPU.\"\n                    )\n                    info[\"gpu\"] = \"N/A\"\n                    info[\"gpu_ram_mb\"] = \"N/A\"\n                    info[\"gpu_power_watts\"] = \"N/A\"\n                    info[\"gpu_performance_state\"] = \"N/A\"\n                    py3nvml.nvmlShutdown()\n                else:\n                    info[\"gpu\"] = py3nvml.nvmlDeviceGetName(handle)\n                    info[\"gpu_ram_mb\"] = bytes_to_mega_bytes(py3nvml.nvmlDeviceGetMemoryInfo(handle).total)\n                    info[\"gpu_power_watts\"] = py3nvml.nvmlDeviceGetPowerManagementLimit(handle) / 1000\n                    info[\"gpu_performance_state\"] = py3nvml.nvmlDeviceGetPerformanceState(handle)\n                    py3nvml.nvmlShutdown()\n\n            self._environment_info = info\n        return self._environment_info\n\n    def print_results(self, result_dict):\n        for model_name in self.args.model_names:\n            self.print_fn(\"\\t\" + f\"======= MODEL CHECKPOINT: {model_name} =======\")\n            for batch_size in result_dict[model_name][\"bs\"]:\n                for sequence_length in result_dict[model_name][\"ss\"]:\n                    result = result_dict[model_name][\"result\"][batch_size][sequence_length]\n                    if isinstance(result, float):\n                        self.print_fn(\n                            f\"\\t\\t{model_name}/{batch_size}/{sequence_length}: \" f\"{(round(1000 * result) / 1000)}s\"\n                        )\n                    else:\n                        self.print_fn(f\"\\t\\t{model_name}/{batch_size}/{sequence_length}: \" f\"{result} MB\")\n\n    def print_memory_trace_statistics(self, summary: MemorySummary):\n        self.print_fn(\n            \"\\nLine by line memory consumption:\\n\"\n            + \"\\n\".join(\n                f\"{state.frame.filename}:{state.frame.line_number}: mem {state.cpu_gpu}: {state.frame.line_text}\"\n                for state in summary.sequential\n            )\n        )\n        self.print_fn(\n            \"\\nLines with top memory consumption:\\n\"\n            + \"\\n\".join(\n                f\"=> {state.frame.filename}:{state.frame.line_number}: mem {state.cpu_gpu}: {state.frame.line_text}\"\n                for state in summary.cumulative[:6]\n            )\n        )\n        self.print_fn(\n            \"\\nLines with lowest memory consumption:\\n\"\n            + \"\\n\".join(\n                f\"=> {state.frame.filename}:{state.frame.line_number}: mem {state.cpu_gpu}: {state.frame.line_text}\"\n                for state in summary.cumulative[-6:]\n            )\n        )\n        self.print_fn(f\"\\nTotal memory increase: {summary.total}\")\n\n    def save_to_csv(self, result_dict, filename):\n        if not self.args.save_to_csv:\n            return\n        self.print_fn(\"Saving results to csv.\")\n        with open(filename, mode=\"w\") as csv_file:\n\n            assert len(self.args.model_names) > 0, \"At least 1 model should be defined, but got {}\".format(\n                self.model_names\n            )\n\n            fieldnames = [\"model\", \"batch_size\", \"sequence_length\"]\n            writer = csv.DictWriter(csv_file, fieldnames=fieldnames + [\"result\"])\n            writer.writeheader()\n\n            for model_name in self.args.model_names:\n                result_dict_model = result_dict[model_name][\"result\"]\n                for bs in result_dict_model:\n                    for ss in result_dict_model[bs]:\n                        result_model = result_dict_model[bs][ss]\n                        writer.writerow(\n                            {\n                                \"model\": model_name,\n                                \"batch_size\": bs,\n                                \"sequence_length\": ss,\n                                \"result\": (\"{}\" if not isinstance(result_model, float) else \"{:.4f}\").format(\n                                    result_model\n                                ),\n                            }\n                        )\n"
  },
  {
    "path": "code/nezha-base-count3/pretrain/transformers1/benchmark_utils.py",
    "content": "\"\"\"\nUtilities for working with the local dataset cache.\nThis file is adapted from the AllenNLP library at https://github.com/allenai/allennlp\nCopyright by the AllenNLP authors.\n\"\"\"\n\nimport linecache\nimport logging\nimport os\nimport sys\nfrom collections import defaultdict\nfrom typing import Iterable, List, NamedTuple, Optional, Union\n\nfrom .file_utils import is_tf_available, is_torch_available\n\n\nif is_torch_available():\n    from torch.cuda import empty_cache as torch_empty_cache\nif is_tf_available():\n    from tensorflow.python.eager import context as tf_context\n\n\nlogger = logging.getLogger(__name__)  # pylint: disable=invalid-name\n\n\n_is_memory_tracing_enabled = False\n\n\ndef is_memory_tracing_enabled():\n    global _is_memory_tracing_enabled\n    return _is_memory_tracing_enabled\n\n\nclass Frame(NamedTuple):\n    \"\"\" `Frame` is a NamedTuple used to gather the current frame state.\n            `Frame` has the following fields:\n            - 'filename' (string): Name of the file currently executed\n            - 'module' (string): Name of the module currently executed\n            - 'line_number' (int): Number of the line currently executed\n            - 'event' (string): Event that triggered the tracing (default will be \"line\")\n            - 'line_text' (string): Text of the line in the python script\n    \"\"\"\n\n    filename: str\n    module: str\n    line_number: int\n    event: str\n    line_text: str\n\n\nclass UsedMemoryState(NamedTuple):\n    \"\"\" `UsedMemoryState` are named tuples with the following fields:\n        - 'frame': a `Frame` namedtuple (see below) storing information on the current tracing frame (current file, location in current file)\n        - 'cpu_memory': CPU RSS memory state *before* executing the line\n        - 'gpu_memory': GPU used memory *before* executing the line (sum for all GPUs or for only `gpus_to_trace` if provided)\n    \"\"\"\n\n    frame: Frame\n    cpu_memory: int\n    gpu_memory: int\n\n\nclass Memory(NamedTuple):\n    \"\"\" `Memory` NamedTuple have a single field `bytes` and\n        you can get a human readable string of the number of bytes by calling `__repr__`\n            - `byte` (integer): number of bytes,\n    \"\"\"\n\n    bytes: int\n\n    def __repr__(self) -> str:\n        return bytes_to_human_readable(self.bytes)\n\n\nclass MemoryState(NamedTuple):\n    \"\"\" `MemoryState` are namedtuples listing frame + CPU/GPU memory with the following fields:\n        - `frame` (`Frame`): the current frame (see above)\n        - `cpu`: CPU memory consumed at during the current frame as a `Memory` named tuple\n        - `gpu`: GPU memory consumed at during the current frame as a `Memory` named tuple\n        - `cpu_gpu`: CPU + GPU memory consumed at during the current frame as a `Memory` named tuple\n    \"\"\"\n\n    frame: Frame\n    cpu: Memory\n    gpu: Memory\n    cpu_gpu: Memory\n\n\nclass MemorySummary(NamedTuple):\n    \"\"\" `MemorySummary` namedtuple otherwise with the fields:\n        - `sequential`: a list of `MemoryState` namedtuple (see below) computed from the provided `memory_trace`\n            by substracting the memory after executing each line from the memory before executing said line.\n        - `cumulative`: a list of `MemoryState` namedtuple (see below) with cumulative increase in memory for each line\n            obtained by summing repeted memory increase for a line if it's executed several times.\n            The list is sorted from the frame with the largest memory consumption to the frame with the smallest (can be negative if memory is released)\n        - `total`: total memory increase during the full tracing as a `Memory` named tuple (see below).\n            Line with memory release (negative consumption) are ignored if `ignore_released_memory` is `True` (default).\n    \"\"\"\n\n    sequential: List[MemoryState]\n    cumulative: List[MemoryState]\n    total: Memory\n\n\nMemoryTrace = List[UsedMemoryState]\n\n\ndef start_memory_tracing(\n    modules_to_trace: Optional[Union[str, Iterable[str]]] = None,\n    modules_not_to_trace: Optional[Union[str, Iterable[str]]] = None,\n    events_to_trace: str = \"line\",\n    gpus_to_trace: Optional[List[int]] = None,\n) -> MemoryTrace:\n    \"\"\" Setup line-by-line tracing to record rss mem (RAM) at each line of a module or sub-module.\n        See `../../examples/benchmarks.py for a usage example.\n        Current memory consumption is returned using psutil and in particular is the RSS memory\n            \"Resident Set Size” (the non-swapped physical memory the process is using).\n            See https://psutil.readthedocs.io/en/latest/#psutil.Process.memory_info\n\n        Args:\n            - `modules_to_trace`: (None, string, list/tuple of string)\n                if None, all events are recorded\n                if string or list of strings: only events from the listed module/sub-module will be recorded (e.g. 'fairseq' or 'transformers1.modeling_gpt2')\n            - `modules_not_to_trace`: (None, string, list/tuple of string)\n                if None, no module is avoided\n                if string or list of strings: events from the listed module/sub-module will not be recorded (e.g. 'torch')\n            - `events_to_trace`: string or list of string of events to be recorded (see official python doc for `sys.settrace` for the list of events)\n                default to line\n            - `gpus_to_trace`: (optional list, default None) list of GPUs to trace. Default to tracing all GPUs\n\n        Return:\n            - `memory_trace` is a list of `UsedMemoryState` for each event (default each line of the traced script).\n                - `UsedMemoryState` are named tuples with the following fields:\n                    - 'frame': a `Frame` namedtuple (see below) storing information on the current tracing frame (current file, location in current file)\n                    - 'cpu_memory': CPU RSS memory state *before* executing the line\n                    - 'gpu_memory': GPU used memory *before* executing the line (sum for all GPUs or for only `gpus_to_trace` if provided)\n\n        `Frame` is a namedtuple used by `UsedMemoryState` to list the current frame state.\n            `Frame` has the following fields:\n            - 'filename' (string): Name of the file currently executed\n            - 'module' (string): Name of the module currently executed\n            - 'line_number' (int): Number of the line currently executed\n            - 'event' (string): Event that triggered the tracing (default will be \"line\")\n            - 'line_text' (string): Text of the line in the python script\n\n    \"\"\"\n    try:\n        import psutil\n    except (ImportError):\n        logger.warning(\n            \"Psutil not installed, we won't log CPU memory usage. \"\n            \"Install psutil (pip install psutil) to use CPU memory tracing.\"\n        )\n        process = None\n    else:\n        process = psutil.Process(os.getpid())\n\n    try:\n        from py3nvml import py3nvml\n\n        py3nvml.nvmlInit()\n        devices = list(range(py3nvml.nvmlDeviceGetCount())) if gpus_to_trace is None else gpus_to_trace\n        py3nvml.nvmlShutdown()\n    except ImportError:\n        logger.warning(\n            \"py3nvml not installed, we won't log GPU memory usage. \"\n            \"Install py3nvml (pip install py3nvml) to use GPU memory tracing.\"\n        )\n        log_gpu = False\n    except (OSError, py3nvml.NVMLError):\n        logger.warning(\"Error while initializing comunication with GPU. \" \"We won't perform GPU memory tracing.\")\n        log_gpu = False\n    else:\n        log_gpu = is_torch_available() or is_tf_available()\n\n    memory_trace = []\n\n    def traceit(frame, event, args):\n        \"\"\" Tracing method executed before running each line in a module or sub-module\n            Record memory allocated in a list with debugging information\n        \"\"\"\n        global _is_memory_tracing_enabled\n\n        if not _is_memory_tracing_enabled:\n            return traceit\n\n        # Filter events\n        if events_to_trace is not None:\n            if isinstance(events_to_trace, str) and event != events_to_trace:\n                return traceit\n            elif isinstance(events_to_trace, (list, tuple)) and event not in events_to_trace:\n                return traceit\n\n        # Filter modules\n        name = frame.f_globals[\"__name__\"]\n        if not isinstance(name, str):\n            return traceit\n        else:\n            # Filter whitelist of modules to trace\n            if modules_to_trace is not None:\n                if isinstance(modules_to_trace, str) and modules_to_trace not in name:\n                    return traceit\n                elif isinstance(modules_to_trace, (list, tuple)) and all(m not in name for m in modules_to_trace):\n                    return traceit\n\n            # Filter blacklist of modules not to trace\n            if modules_not_to_trace is not None:\n                if isinstance(modules_not_to_trace, str) and modules_not_to_trace in name:\n                    return traceit\n                elif isinstance(modules_not_to_trace, (list, tuple)) and any(m in name for m in modules_not_to_trace):\n                    return traceit\n\n        # Record current tracing state (file, location in file...)\n        lineno = frame.f_lineno\n        filename = frame.f_globals[\"__file__\"]\n        if filename.endswith(\".pyc\") or filename.endswith(\".pyo\"):\n            filename = filename[:-1]\n        line = linecache.getline(filename, lineno).rstrip()\n        traced_state = Frame(filename, name, lineno, event, line)\n\n        # Record current memory state (rss memory) and compute difference with previous memory state\n        cpu_mem = 0\n        if process is not None:\n            mem = process.memory_info()\n            cpu_mem = mem.rss\n\n        gpu_mem = 0\n        if log_gpu:\n            # Clear GPU caches\n            if is_torch_available():\n                torch_empty_cache()\n            if is_tf_available():\n                tf_context.context()._clear_caches()  # See https://github.com/tensorflow/tensorflow/issues/20218#issuecomment-416771802\n\n            # Sum used memory for all GPUs\n            py3nvml.nvmlInit()\n            for i in devices:\n                handle = py3nvml.nvmlDeviceGetHandleByIndex(i)\n                meminfo = py3nvml.nvmlDeviceGetMemoryInfo(handle)\n                gpu_mem += meminfo.used\n            py3nvml.nvmlShutdown()\n\n        mem_state = UsedMemoryState(traced_state, cpu_mem, gpu_mem)\n        memory_trace.append(mem_state)\n\n        return traceit\n\n    sys.settrace(traceit)\n\n    global _is_memory_tracing_enabled\n    _is_memory_tracing_enabled = True\n\n    return memory_trace\n\n\ndef stop_memory_tracing(\n    memory_trace: Optional[MemoryTrace] = None, ignore_released_memory: bool = True\n) -> Optional[MemorySummary]:\n    \"\"\" Stop memory tracing cleanly and return a summary of the memory trace if a trace is given.\n\n        Args:\n            - `memory_trace` (optional output of start_memory_tracing, default: None): memory trace to convert in summary\n            - `ignore_released_memory` (boolean, default: None): if True we only sum memory increase to compute total memory\n\n        Return:\n            - None if `memory_trace` is None\n            - `MemorySummary` namedtuple otherwise with the fields:\n                - `sequential`: a list of `MemoryState` namedtuple (see below) computed from the provided `memory_trace`\n                    by substracting the memory after executing each line from the memory before executing said line.\n                - `cumulative`: a list of `MemoryState` namedtuple (see below) with cumulative increase in memory for each line\n                    obtained by summing repeted memory increase for a line if it's executed several times.\n                    The list is sorted from the frame with the largest memory consumption to the frame with the smallest (can be negative if memory is released)\n                - `total`: total memory increase during the full tracing as a `Memory` named tuple (see below).\n                    Line with memory release (negative consumption) are ignored if `ignore_released_memory` is `True` (default).\n\n        `Memory` named tuple have fields\n            - `byte` (integer): number of bytes,\n            - `string` (string): same as human readable string (ex: \"3.5MB\")\n\n        `Frame` are namedtuple used to list the current frame state and have the following fields:\n            - 'filename' (string): Name of the file currently executed\n            - 'module' (string): Name of the module currently executed\n            - 'line_number' (int): Number of the line currently executed\n            - 'event' (string): Event that triggered the tracing (default will be \"line\")\n            - 'line_text' (string): Text of the line in the python script\n\n        `MemoryState` are namedtuples listing frame + CPU/GPU memory with the following fields:\n            - `frame` (`Frame`): the current frame (see above)\n            - `cpu`: CPU memory consumed at during the current frame as a `Memory` named tuple\n            - `gpu`: GPU memory consumed at during the current frame as a `Memory` named tuple\n            - `cpu_gpu`: CPU + GPU memory consumed at during the current frame as a `Memory` named tuple\n    \"\"\"\n    global _is_memory_tracing_enabled\n    _is_memory_tracing_enabled = False\n\n    if memory_trace is not None and len(memory_trace) > 1:\n        memory_diff_trace = []\n        cumulative_memory_dict = defaultdict(lambda: [0, 0, 0])\n        for (frame, cpu_mem, gpu_mem), (next_frame, next_cpu_mem, next_gpu_mem) in zip(\n            memory_trace[:-1], memory_trace[1:]\n        ):\n            cpu_mem_inc = next_cpu_mem - cpu_mem\n            gpu_mem_inc = next_gpu_mem - gpu_mem\n            cpu_gpu_mem_inc = cpu_mem_inc + gpu_mem_inc\n            memory_diff_trace.append(\n                MemoryState(\n                    frame=frame, cpu=Memory(cpu_mem_inc), gpu=Memory(gpu_mem_inc), cpu_gpu=Memory(cpu_gpu_mem_inc),\n                )\n            )\n            cumulative_memory_dict[frame][0] += cpu_mem_inc\n            cumulative_memory_dict[frame][1] += gpu_mem_inc\n            cumulative_memory_dict[frame][2] += cpu_gpu_mem_inc\n\n        cumulative_memory = sorted(\n            list(cumulative_memory_dict.items()), key=lambda x: x[1][2], reverse=True\n        )  # order by the total CPU + GPU memory increase\n        cumulative_memory = list(\n            MemoryState(\n                frame=frame, cpu=Memory(cpu_mem_inc), gpu=Memory(gpu_mem_inc), cpu_gpu=Memory(cpu_gpu_mem_inc),\n            )\n            for frame, (cpu_mem_inc, gpu_mem_inc, cpu_gpu_mem_inc) in cumulative_memory\n        )\n\n        if ignore_released_memory:\n            total_memory = sum(max(0, step_trace.cpu_gpu.bytes) for step_trace in memory_diff_trace)\n        else:\n            total_memory = sum(step_trace.cpu_gpu.bytes for step_trace in memory_diff_trace)\n        total_memory = Memory(total_memory)\n        return MemorySummary(sequential=memory_diff_trace, cumulative=cumulative_memory, total=total_memory)\n\n    return None\n\n\ndef bytes_to_human_readable(memory_amount: int) -> str:\n    \"\"\" Utility to convert a number of bytes (int) in a human readable string (with units)\n    \"\"\"\n    for unit in [\"B\", \"KB\", \"MB\", \"GB\"]:\n        if memory_amount > -1024.0 and memory_amount < 1024.0:\n            return \"{:.3f}{}\".format(memory_amount, unit)\n        memory_amount /= 1024.0\n    return \"{:.3f}TB\".format(memory_amount)\n"
  },
  {
    "path": "code/nezha-base-count3/pretrain/transformers1/commands/__init__.py",
    "content": "from abc import ABC, abstractmethod\nfrom argparse import ArgumentParser\n\n\nclass BaseTransformersCLICommand(ABC):\n    @staticmethod\n    @abstractmethod\n    def register_subcommand(parser: ArgumentParser):\n        raise NotImplementedError()\n\n    @abstractmethod\n    def run(self):\n        raise NotImplementedError()\n"
  },
  {
    "path": "code/nezha-base-count3/pretrain/transformers1/commands/convert.py",
    "content": "from argparse import ArgumentParser, Namespace\nfrom logging import getLogger\n\nfrom transformers.commands import BaseTransformersCLICommand\n\n\ndef convert_command_factory(args: Namespace):\n    \"\"\"\n    Factory function used to convert a model TF 1.0 checkpoint in a PyTorch checkpoint.\n    :return: ServeCommand\n    \"\"\"\n    return ConvertCommand(\n        args.model_type, args.tf_checkpoint, args.pytorch_dump_output, args.config, args.finetuning_task_name\n    )\n\n\nclass ConvertCommand(BaseTransformersCLICommand):\n    @staticmethod\n    def register_subcommand(parser: ArgumentParser):\n        \"\"\"\n        Register this command to argparse so it's available for the transformer-cli\n        :param parser: Root parser to register command-specific arguments\n        :return:\n        \"\"\"\n        train_parser = parser.add_parser(\n            \"convert\",\n            help=\"CLI tool to run convert model from original \"\n            \"author checkpoints to Transformers PyTorch checkpoints.\",\n        )\n        train_parser.add_argument(\"--model_type\", type=str, required=True, help=\"Model's type.\")\n        train_parser.add_argument(\n            \"--tf_checkpoint\", type=str, required=True, help=\"TensorFlow checkpoint path or folder.\"\n        )\n        train_parser.add_argument(\n            \"--pytorch_dump_output\", type=str, required=True, help=\"Path to the PyTorch savd model output.\"\n        )\n        train_parser.add_argument(\"--config\", type=str, default=\"\", help=\"Configuration file path or folder.\")\n        train_parser.add_argument(\n            \"--finetuning_task_name\",\n            type=str,\n            default=None,\n            help=\"Optional fine-tuning task name if the TF model was a finetuned model.\",\n        )\n        train_parser.set_defaults(func=convert_command_factory)\n\n    def __init__(\n        self,\n        model_type: str,\n        tf_checkpoint: str,\n        pytorch_dump_output: str,\n        config: str,\n        finetuning_task_name: str,\n        *args\n    ):\n        self._logger = getLogger(\"transformers1-cli/converting\")\n\n        self._logger.info(\"Loading model {}\".format(model_type))\n        self._model_type = model_type\n        self._tf_checkpoint = tf_checkpoint\n        self._pytorch_dump_output = pytorch_dump_output\n        self._config = config\n        self._finetuning_task_name = finetuning_task_name\n\n    def run(self):\n        if self._model_type == \"albert\":\n            try:\n                from transformers.convert_albert_original_tf_checkpoint_to_pytorch import (\n                    convert_tf_checkpoint_to_pytorch,\n                )\n            except ImportError:\n                msg = (\n                    \"transformers1 can only be used from the commandline to convert TensorFlow models in PyTorch, \"\n                    \"In that case, it requires TensorFlow to be installed. Please see \"\n                    \"https://www.tensorflow.org/install/ for installation instructions.\"\n                )\n                raise ImportError(msg)\n\n            convert_tf_checkpoint_to_pytorch(self._tf_checkpoint, self._config, self._pytorch_dump_output)\n        elif self._model_type == \"bert\":\n            try:\n                from transformers.convert_bert_original_tf_checkpoint_to_pytorch import (\n                    convert_tf_checkpoint_to_pytorch,\n                )\n            except ImportError:\n                msg = (\n                    \"transformers1 can only be used from the commandline to convert TensorFlow models in PyTorch, \"\n                    \"In that case, it requires TensorFlow to be installed. Please see \"\n                    \"https://www.tensorflow.org/install/ for installation instructions.\"\n                )\n                raise ImportError(msg)\n\n            convert_tf_checkpoint_to_pytorch(self._tf_checkpoint, self._config, self._pytorch_dump_output)\n        elif self._model_type == \"gpt\":\n            from transformers.convert_openai_original_tf_checkpoint_to_pytorch import (\n                convert_openai_checkpoint_to_pytorch,\n            )\n\n            convert_openai_checkpoint_to_pytorch(self._tf_checkpoint, self._config, self._pytorch_dump_output)\n        elif self._model_type == \"transfo_xl\":\n            try:\n                from transformers.convert_transfo_xl_original_tf_checkpoint_to_pytorch import (\n                    convert_transfo_xl_checkpoint_to_pytorch,\n                )\n            except ImportError:\n                msg = (\n                    \"transformers1 can only be used from the commandline to convert TensorFlow models in PyTorch, \"\n                    \"In that case, it requires TensorFlow to be installed. Please see \"\n                    \"https://www.tensorflow.org/install/ for installation instructions.\"\n                )\n                raise ImportError(msg)\n\n            if \"ckpt\" in self._tf_checkpoint.lower():\n                TF_CHECKPOINT = self._tf_checkpoint\n                TF_DATASET_FILE = \"\"\n            else:\n                TF_DATASET_FILE = self._tf_checkpoint\n                TF_CHECKPOINT = \"\"\n            convert_transfo_xl_checkpoint_to_pytorch(\n                TF_CHECKPOINT, self._config, self._pytorch_dump_output, TF_DATASET_FILE\n            )\n        elif self._model_type == \"gpt2\":\n            try:\n                from transformers.convert_gpt2_original_tf_checkpoint_to_pytorch import (\n                    convert_gpt2_checkpoint_to_pytorch,\n                )\n            except ImportError:\n                msg = (\n                    \"transformers1 can only be used from the commandline to convert TensorFlow models in PyTorch, \"\n                    \"In that case, it requires TensorFlow to be installed. Please see \"\n                    \"https://www.tensorflow.org/install/ for installation instructions.\"\n                )\n                raise ImportError(msg)\n\n            convert_gpt2_checkpoint_to_pytorch(self._tf_checkpoint, self._config, self._pytorch_dump_output)\n        elif self._model_type == \"xlnet\":\n            try:\n                from transformers.convert_xlnet_original_tf_checkpoint_to_pytorch import (\n                    convert_xlnet_checkpoint_to_pytorch,\n                )\n            except ImportError:\n                msg = (\n                    \"transformers1 can only be used from the commandline to convert TensorFlow models in PyTorch, \"\n                    \"In that case, it requires TensorFlow to be installed. Please see \"\n                    \"https://www.tensorflow.org/install/ for installation instructions.\"\n                )\n                raise ImportError(msg)\n\n            convert_xlnet_checkpoint_to_pytorch(\n                self._tf_checkpoint, self._config, self._pytorch_dump_output, self._finetuning_task_name\n            )\n        elif self._model_type == \"xlm\":\n            from transformers.convert_xlm_original_pytorch_checkpoint_to_pytorch import (\n                convert_xlm_checkpoint_to_pytorch,\n            )\n\n            convert_xlm_checkpoint_to_pytorch(self._tf_checkpoint, self._pytorch_dump_output)\n        else:\n            raise ValueError(\"--model_type should be selected in the list [bert, gpt, gpt2, transfo_xl, xlnet, xlm]\")\n"
  },
  {
    "path": "code/nezha-base-count3/pretrain/transformers1/commands/download.py",
    "content": "from argparse import ArgumentParser\n\nfrom transformers.commands import BaseTransformersCLICommand\n\n\ndef download_command_factory(args):\n    return DownloadCommand(args.model, args.cache_dir, args.force)\n\n\nclass DownloadCommand(BaseTransformersCLICommand):\n    @staticmethod\n    def register_subcommand(parser: ArgumentParser):\n        download_parser = parser.add_parser(\"download\")\n        download_parser.add_argument(\n            \"--cache-dir\", type=str, default=None, help=\"Path to location to store the models\"\n        )\n        download_parser.add_argument(\n            \"--force\", action=\"store_true\", help=\"Force the model to be download even if already in cache-dir\"\n        )\n        download_parser.add_argument(\"model\", type=str, help=\"Name of the model to download\")\n        download_parser.set_defaults(func=download_command_factory)\n\n    def __init__(self, model: str, cache: str, force: bool):\n        self._model = model\n        self._cache = cache\n        self._force = force\n\n    def run(self):\n        from transformers import AutoModel, AutoTokenizer\n\n        AutoModel.from_pretrained(self._model, cache_dir=self._cache, force_download=self._force)\n        AutoTokenizer.from_pretrained(self._model, cache_dir=self._cache, force_download=self._force)\n"
  },
  {
    "path": "code/nezha-base-count3/pretrain/transformers1/commands/env.py",
    "content": "import platform\nfrom argparse import ArgumentParser\n\nfrom transformers import __version__ as version\nfrom transformers import is_tf_available, is_torch_available\nfrom transformers.commands import BaseTransformersCLICommand\n\n\ndef info_command_factory(_):\n    return EnvironmentCommand()\n\n\nclass EnvironmentCommand(BaseTransformersCLICommand):\n    @staticmethod\n    def register_subcommand(parser: ArgumentParser):\n        download_parser = parser.add_parser(\"env\")\n        download_parser.set_defaults(func=info_command_factory)\n\n    def run(self):\n        pt_version = \"not installed\"\n        pt_cuda_available = \"NA\"\n        if is_torch_available():\n            import torch\n\n            pt_version = torch.__version__\n            pt_cuda_available = torch.cuda.is_available()\n\n        tf_version = \"not installed\"\n        tf_cuda_available = \"NA\"\n        if is_tf_available():\n            import tensorflow as tf\n\n            tf_version = tf.__version__\n            try:\n                # deprecated in v2.1\n                tf_cuda_available = tf.test.is_gpu_available()\n            except AttributeError:\n                # returns list of devices, convert to bool\n                tf_cuda_available = bool(tf.config.list_physical_devices(\"GPU\"))\n\n        info = {\n            \"`transformers1` version\": version,\n            \"Platform\": platform.platform(),\n            \"Python version\": platform.python_version(),\n            \"PyTorch version (GPU?)\": \"{} ({})\".format(pt_version, pt_cuda_available),\n            \"Tensorflow version (GPU?)\": \"{} ({})\".format(tf_version, tf_cuda_available),\n            \"Using GPU in script?\": \"<fill in>\",\n            \"Using distributed or parallel set-up in script?\": \"<fill in>\",\n        }\n\n        print(\"\\nCopy-and-paste the text below in your GitHub issue and FILL OUT the two last points.\\n\")\n        print(self.format_dict(info))\n\n        return info\n\n    @staticmethod\n    def format_dict(d):\n        return \"\\n\".join([\"- {}: {}\".format(prop, val) for prop, val in d.items()]) + \"\\n\"\n"
  },
  {
    "path": "code/nezha-base-count3/pretrain/transformers1/commands/run.py",
    "content": "import logging\nfrom argparse import ArgumentParser\n\nfrom transformers.commands import BaseTransformersCLICommand\nfrom transformers.pipelines import SUPPORTED_TASKS, Pipeline, PipelineDataFormat, pipeline\n\n\nlogger = logging.getLogger(__name__)  # pylint: disable=invalid-name\n\n\ndef try_infer_format_from_ext(path: str):\n    if not path:\n        return \"pipe\"\n\n    for ext in PipelineDataFormat.SUPPORTED_FORMATS:\n        if path.endswith(ext):\n            return ext\n\n    raise Exception(\n        \"Unable to determine file format from file extension {}. \"\n        \"Please provide the format through --format {}\".format(path, PipelineDataFormat.SUPPORTED_FORMATS)\n    )\n\n\ndef run_command_factory(args):\n    nlp = pipeline(\n        task=args.task,\n        model=args.model if args.model else None,\n        config=args.config,\n        tokenizer=args.tokenizer,\n        device=args.device,\n    )\n    format = try_infer_format_from_ext(args.input) if args.format == \"infer\" else args.format\n    reader = PipelineDataFormat.from_str(\n        format=format,\n        output_path=args.output,\n        input_path=args.input,\n        column=args.column if args.column else nlp.default_input_names,\n        overwrite=args.overwrite,\n    )\n    return RunCommand(nlp, reader)\n\n\nclass RunCommand(BaseTransformersCLICommand):\n    def __init__(self, nlp: Pipeline, reader: PipelineDataFormat):\n        self._nlp = nlp\n        self._reader = reader\n\n    @staticmethod\n    def register_subcommand(parser: ArgumentParser):\n        run_parser = parser.add_parser(\"run\", help=\"Run a pipeline through the CLI\")\n        run_parser.add_argument(\"--task\", choices=SUPPORTED_TASKS.keys(), help=\"Task to run\")\n        run_parser.add_argument(\"--input\", type=str, help=\"Path to the file to use for inference\")\n        run_parser.add_argument(\"--output\", type=str, help=\"Path to the file that will be used post to write results.\")\n        run_parser.add_argument(\"--model\", type=str, help=\"Name or path to the model to instantiate.\")\n        run_parser.add_argument(\"--config\", type=str, help=\"Name or path to the model's config to instantiate.\")\n        run_parser.add_argument(\n            \"--tokenizer\", type=str, help=\"Name of the tokenizer to use. (default: same as the model name)\"\n        )\n        run_parser.add_argument(\n            \"--column\",\n            type=str,\n            help=\"Name of the column to use as input. (For multi columns input as QA use column1,columns2)\",\n        )\n        run_parser.add_argument(\n            \"--format\",\n            type=str,\n            default=\"infer\",\n            choices=PipelineDataFormat.SUPPORTED_FORMATS,\n            help=\"Input format to read from\",\n        )\n        run_parser.add_argument(\n            \"--device\",\n            type=int,\n            default=-1,\n            help=\"Indicate the device to run onto, -1 indicates CPU, >= 0 indicates GPU (default: -1)\",\n        )\n        run_parser.add_argument(\"--overwrite\", action=\"store_true\", help=\"Allow overwriting the output file.\")\n        run_parser.set_defaults(func=run_command_factory)\n\n    def run(self):\n        nlp, outputs = self._nlp, []\n\n        for entry in self._reader:\n            output = nlp(**entry) if self._reader.is_multi_columns else nlp(entry)\n            if isinstance(output, dict):\n                outputs.append(output)\n            else:\n                outputs += output\n\n        # Saving data\n        if self._nlp.binary_output:\n            binary_path = self._reader.save_binary(outputs)\n            logger.warning(\"Current pipeline requires output to be in binary format, saving at {}\".format(binary_path))\n        else:\n            self._reader.save(outputs)\n"
  },
  {
    "path": "code/nezha-base-count3/pretrain/transformers1/commands/serving.py",
    "content": "import logging\nfrom argparse import ArgumentParser, Namespace\nfrom typing import Any, List, Optional\n\nfrom transformers import Pipeline\nfrom transformers.commands import BaseTransformersCLICommand\nfrom transformers.pipelines import SUPPORTED_TASKS, pipeline\n\n\ntry:\n    from uvicorn import run\n    from fastapi import FastAPI, HTTPException, Body\n    from fastapi.routing import APIRoute\n    from pydantic import BaseModel\n    from starlette.responses import JSONResponse\n\n    _serve_dependencies_installed = True\nexcept (ImportError, AttributeError):\n    BaseModel = object\n\n    def Body(*x, **y):\n        pass\n\n    _serve_dependencies_installed = False\n\n\nlogger = logging.getLogger(\"transformers1-cli/serving\")\n\n\ndef serve_command_factory(args: Namespace):\n    \"\"\"\n    Factory function used to instantiate serving server from provided command line arguments.\n    :return: ServeCommand\n    \"\"\"\n    nlp = pipeline(\n        task=args.task,\n        model=args.model if args.model else None,\n        config=args.config,\n        tokenizer=args.tokenizer,\n        device=args.device,\n    )\n    return ServeCommand(nlp, args.host, args.port, args.workers)\n\n\nclass ServeModelInfoResult(BaseModel):\n    \"\"\"\n    Expose model information\n    \"\"\"\n\n    infos: dict\n\n\nclass ServeTokenizeResult(BaseModel):\n    \"\"\"\n    Tokenize result model\n    \"\"\"\n\n    tokens: List[str]\n    tokens_ids: Optional[List[int]]\n\n\nclass ServeDeTokenizeResult(BaseModel):\n    \"\"\"\n    DeTokenize result model\n    \"\"\"\n\n    text: str\n\n\nclass ServeForwardResult(BaseModel):\n    \"\"\"\n    Forward result model\n    \"\"\"\n\n    output: Any\n\n\nclass ServeCommand(BaseTransformersCLICommand):\n    @staticmethod\n    def register_subcommand(parser: ArgumentParser):\n        \"\"\"\n        Register this command to argparse so it's available for the transformer-cli\n        :param parser: Root parser to register command-specific arguments\n        :return:\n        \"\"\"\n        serve_parser = parser.add_parser(\n            \"serve\", help=\"CLI tool to run inference requests through REST and GraphQL endpoints.\"\n        )\n        serve_parser.add_argument(\n            \"--task\", type=str, choices=SUPPORTED_TASKS.keys(), help=\"The task to run the pipeline on\"\n        )\n        serve_parser.add_argument(\"--host\", type=str, default=\"localhost\", help=\"Interface the server will listen on.\")\n        serve_parser.add_argument(\"--port\", type=int, default=8888, help=\"Port the serving will listen to.\")\n        serve_parser.add_argument(\"--workers\", type=int, default=1, help=\"Number of http workers\")\n        serve_parser.add_argument(\"--model\", type=str, help=\"Model's name or path to stored model.\")\n        serve_parser.add_argument(\"--config\", type=str, help=\"Model's config name or path to stored model.\")\n        serve_parser.add_argument(\"--tokenizer\", type=str, help=\"Tokenizer name to use.\")\n        serve_parser.add_argument(\n            \"--device\",\n            type=int,\n            default=-1,\n            help=\"Indicate the device to run onto, -1 indicates CPU, >= 0 indicates GPU (default: -1)\",\n        )\n        serve_parser.set_defaults(func=serve_command_factory)\n\n    def __init__(self, pipeline: Pipeline, host: str, port: int, workers: int):\n\n        self._pipeline = pipeline\n\n        self.host = host\n        self.port = port\n        self.workers = workers\n\n        if not _serve_dependencies_installed:\n            raise RuntimeError(\n                \"Using serve command requires FastAPI and unicorn. \"\n                'Please install transformers1 with [serving]: pip install \"transformers1[serving]\".'\n                \"Or install FastAPI and unicorn separately.\"\n            )\n        else:\n            logger.info(\"Serving model over {}:{}\".format(host, port))\n            self._app = FastAPI(\n                routes=[\n                    APIRoute(\n                        \"/\",\n                        self.model_info,\n                        response_model=ServeModelInfoResult,\n                        response_class=JSONResponse,\n                        methods=[\"GET\"],\n                    ),\n                    APIRoute(\n                        \"/tokenize\",\n                        self.tokenize,\n                        response_model=ServeTokenizeResult,\n                        response_class=JSONResponse,\n                        methods=[\"POST\"],\n                    ),\n                    APIRoute(\n                        \"/detokenize\",\n                        self.detokenize,\n                        response_model=ServeDeTokenizeResult,\n                        response_class=JSONResponse,\n                        methods=[\"POST\"],\n                    ),\n                    APIRoute(\n                        \"/forward\",\n                        self.forward,\n                        response_model=ServeForwardResult,\n                        response_class=JSONResponse,\n                        methods=[\"POST\"],\n                    ),\n                ],\n                timeout=600,\n            )\n\n    def run(self):\n        run(self._app, host=self.host, port=self.port, workers=self.workers)\n\n    def model_info(self):\n        return ServeModelInfoResult(infos=vars(self._pipeline.model.config))\n\n    def tokenize(self, text_input: str = Body(None, embed=True), return_ids: bool = Body(False, embed=True)):\n        \"\"\"\n        Tokenize the provided input and eventually returns corresponding tokens id:\n        - **text_input**: String to tokenize\n        - **return_ids**: Boolean flags indicating if the tokens have to be converted to their integer mapping.\n        \"\"\"\n        try:\n            tokens_txt = self._pipeline.tokenizer.tokenize(text_input)\n\n            if return_ids:\n                tokens_ids = self._pipeline.tokenizer.convert_tokens_to_ids(tokens_txt)\n                return ServeTokenizeResult(tokens=tokens_txt, tokens_ids=tokens_ids)\n            else:\n                return ServeTokenizeResult(tokens=tokens_txt)\n\n        except Exception as e:\n            raise HTTPException(status_code=500, detail={\"model\": \"\", \"error\": str(e)})\n\n    def detokenize(\n        self,\n        tokens_ids: List[int] = Body(None, embed=True),\n        skip_special_tokens: bool = Body(False, embed=True),\n        cleanup_tokenization_spaces: bool = Body(True, embed=True),\n    ):\n        \"\"\"\n        Detokenize the provided tokens ids to readable text:\n        - **tokens_ids**: List of tokens ids\n        - **skip_special_tokens**: Flag indicating to not try to decode special tokens\n        - **cleanup_tokenization_spaces**: Flag indicating to remove all leading/trailing spaces and intermediate ones.\n        \"\"\"\n        try:\n            decoded_str = self._pipeline.tokenizer.decode(tokens_ids, skip_special_tokens, cleanup_tokenization_spaces)\n            return ServeDeTokenizeResult(model=\"\", text=decoded_str)\n        except Exception as e:\n            raise HTTPException(status_code=500, detail={\"model\": \"\", \"error\": str(e)})\n\n    async def forward(self, inputs=Body(None, embed=True)):\n        \"\"\"\n        **inputs**:\n        **attention_mask**:\n        **tokens_type_ids**:\n        \"\"\"\n\n        # Check we don't have empty string\n        if len(inputs) == 0:\n            return ServeForwardResult(output=[], attention=[])\n\n        try:\n            # Forward through the model\n            output = self._pipeline(inputs)\n            return ServeForwardResult(output=output)\n        except Exception as e:\n            raise HTTPException(500, {\"error\": str(e)})\n"
  },
  {
    "path": "code/nezha-base-count3/pretrain/transformers1/commands/train.py",
    "content": "import os\nfrom argparse import ArgumentParser, Namespace\nfrom logging import getLogger\n\nfrom transformers import SingleSentenceClassificationProcessor as Processor\nfrom transformers import TextClassificationPipeline, is_tf_available, is_torch_available\nfrom transformers.commands import BaseTransformersCLICommand\n\n\nif not is_tf_available() and not is_torch_available():\n    raise RuntimeError(\"At least one of PyTorch or TensorFlow 2.0+ should be installed to use CLI training\")\n\n# TF training parameters\nUSE_XLA = False\nUSE_AMP = False\n\n\ndef train_command_factory(args: Namespace):\n    \"\"\"\n    Factory function used to instantiate serving server from provided command line arguments.\n    :return: ServeCommand\n    \"\"\"\n    return TrainCommand(args)\n\n\nclass TrainCommand(BaseTransformersCLICommand):\n    @staticmethod\n    def register_subcommand(parser: ArgumentParser):\n        \"\"\"\n        Register this command to argparse so it's available for the transformer-cli\n        :param parser: Root parser to register command-specific arguments\n        :return:\n        \"\"\"\n        train_parser = parser.add_parser(\"train\", help=\"CLI tool to train a model on a task.\")\n\n        train_parser.add_argument(\n            \"--train_data\",\n            type=str,\n            required=True,\n            help=\"path to train (and optionally evaluation) dataset as a csv with \"\n            \"tab separated labels and sentences.\",\n        )\n        train_parser.add_argument(\n            \"--column_label\", type=int, default=0, help=\"Column of the dataset csv file with example labels.\"\n        )\n        train_parser.add_argument(\n            \"--column_text\", type=int, default=1, help=\"Column of the dataset csv file with example texts.\"\n        )\n        train_parser.add_argument(\n            \"--column_id\", type=int, default=2, help=\"Column of the dataset csv file with example ids.\"\n        )\n        train_parser.add_argument(\n            \"--skip_first_row\", action=\"store_true\", help=\"Skip the first row of the csv file (headers).\"\n        )\n\n        train_parser.add_argument(\"--validation_data\", type=str, default=\"\", help=\"path to validation dataset.\")\n        train_parser.add_argument(\n            \"--validation_split\",\n            type=float,\n            default=0.1,\n            help=\"if validation dataset is not provided, fraction of train dataset \" \"to use as validation dataset.\",\n        )\n\n        train_parser.add_argument(\"--output\", type=str, default=\"./\", help=\"path to saved the trained model.\")\n\n        train_parser.add_argument(\n            \"--task\", type=str, default=\"text_classification\", help=\"Task to train the model on.\"\n        )\n        train_parser.add_argument(\n            \"--model\", type=str, default=\"bert-base-uncased\", help=\"Model's name or path to stored model.\"\n        )\n        train_parser.add_argument(\"--train_batch_size\", type=int, default=32, help=\"Batch size for training.\")\n        train_parser.add_argument(\"--valid_batch_size\", type=int, default=64, help=\"Batch size for validation.\")\n        train_parser.add_argument(\"--learning_rate\", type=float, default=3e-5, help=\"Learning rate.\")\n        train_parser.add_argument(\"--adam_epsilon\", type=float, default=1e-08, help=\"Epsilon for Adam optimizer.\")\n        train_parser.set_defaults(func=train_command_factory)\n\n    def __init__(self, args: Namespace):\n        self.logger = getLogger(\"transformers1-cli/training\")\n\n        self.framework = \"tf\" if is_tf_available() else \"torch\"\n\n        os.makedirs(args.output, exist_ok=True)\n        assert os.path.isdir(args.output)\n        self.output = args.output\n\n        self.column_label = args.column_label\n        self.column_text = args.column_text\n        self.column_id = args.column_id\n\n        self.logger.info(\"Loading {} pipeline for {}\".format(args.task, args.model))\n        if args.task == \"text_classification\":\n            self.pipeline = TextClassificationPipeline.from_pretrained(args.model)\n        elif args.task == \"token_classification\":\n            raise NotImplementedError\n        elif args.task == \"question_answering\":\n            raise NotImplementedError\n\n        self.logger.info(\"Loading dataset from {}\".format(args.train_data))\n        self.train_dataset = Processor.create_from_csv(\n            args.train_data,\n            column_label=args.column_label,\n            column_text=args.column_text,\n            column_id=args.column_id,\n            skip_first_row=args.skip_first_row,\n        )\n        self.valid_dataset = None\n        if args.validation_data:\n            self.logger.info(\"Loading validation dataset from {}\".format(args.validation_data))\n            self.valid_dataset = Processor.create_from_csv(\n                args.validation_data,\n                column_label=args.column_label,\n                column_text=args.column_text,\n                column_id=args.column_id,\n                skip_first_row=args.skip_first_row,\n            )\n\n        self.validation_split = args.validation_split\n        self.train_batch_size = args.train_batch_size\n        self.valid_batch_size = args.valid_batch_size\n        self.learning_rate = args.learning_rate\n        self.adam_epsilon = args.adam_epsilon\n\n    def run(self):\n        if self.framework == \"tf\":\n            return self.run_tf()\n        return self.run_torch()\n\n    def run_torch(self):\n        raise NotImplementedError\n\n    def run_tf(self):\n        self.pipeline.fit(\n            self.train_dataset,\n            validation_data=self.valid_dataset,\n            validation_split=self.validation_split,\n            learning_rate=self.learning_rate,\n            adam_epsilon=self.adam_epsilon,\n            train_batch_size=self.train_batch_size,\n            valid_batch_size=self.valid_batch_size,\n        )\n\n        # Save trained pipeline\n        self.pipeline.save_pretrained(self.output)\n"
  },
  {
    "path": "code/nezha-base-count3/pretrain/transformers1/commands/transformers_cli.py",
    "content": "#!/usr/bin/env python\nfrom argparse import ArgumentParser\n\nfrom transformers.commands.convert import ConvertCommand\nfrom transformers.commands.download import DownloadCommand\nfrom transformers.commands.env import EnvironmentCommand\nfrom transformers.commands.run import RunCommand\nfrom transformers.commands.serving import ServeCommand\nfrom transformers.commands.user import UserCommands\n\n\ndef main():\n    parser = ArgumentParser(\"Transformers CLI tool\", usage=\"transformers1-cli <command> [<args>]\")\n    commands_parser = parser.add_subparsers(help=\"transformers1-cli command helpers\")\n\n    # Register commands\n    ConvertCommand.register_subcommand(commands_parser)\n    DownloadCommand.register_subcommand(commands_parser)\n    EnvironmentCommand.register_subcommand(commands_parser)\n    RunCommand.register_subcommand(commands_parser)\n    ServeCommand.register_subcommand(commands_parser)\n    UserCommands.register_subcommand(commands_parser)\n\n    # Let's go\n    args = parser.parse_args()\n\n    if not hasattr(args, \"func\"):\n        parser.print_help()\n        exit(1)\n\n    # Run\n    service = args.func(args)\n    service.run()\n\n\nif __name__ == \"__main__\":\n    main()\n"
  },
  {
    "path": "code/nezha-base-count3/pretrain/transformers1/commands/user.py",
    "content": "import os\nimport sys\nfrom argparse import ArgumentParser\nfrom getpass import getpass\nfrom typing import List, Union\n\nfrom requests.exceptions import HTTPError\n\nfrom transformers.commands import BaseTransformersCLICommand\nfrom transformers.hf_api import HfApi, HfFolder\n\n\nUPLOAD_MAX_FILES = 15\n\n\nclass UserCommands(BaseTransformersCLICommand):\n    @staticmethod\n    def register_subcommand(parser: ArgumentParser):\n        login_parser = parser.add_parser(\"login\", help=\"Log in using the same credentials as on huggingface.co\")\n        login_parser.set_defaults(func=lambda args: LoginCommand(args))\n        whoami_parser = parser.add_parser(\"whoami\", help=\"Find out which huggingface.co account you are logged in as.\")\n        whoami_parser.set_defaults(func=lambda args: WhoamiCommand(args))\n        logout_parser = parser.add_parser(\"logout\", help=\"Log out\")\n        logout_parser.set_defaults(func=lambda args: LogoutCommand(args))\n        # s3\n        s3_parser = parser.add_parser(\"s3\", help=\"{ls, rm} Commands to interact with the files you upload on S3.\")\n        s3_subparsers = s3_parser.add_subparsers(help=\"s3 related commands\")\n        ls_parser = s3_subparsers.add_parser(\"ls\")\n        ls_parser.add_argument(\"--organization\", type=str, help=\"Optional: organization namespace.\")\n        ls_parser.set_defaults(func=lambda args: ListObjsCommand(args))\n        rm_parser = s3_subparsers.add_parser(\"rm\")\n        rm_parser.add_argument(\"filename\", type=str, help=\"individual object filename to delete from S3.\")\n        rm_parser.add_argument(\"--organization\", type=str, help=\"Optional: organization namespace.\")\n        rm_parser.set_defaults(func=lambda args: DeleteObjCommand(args))\n        # upload\n        upload_parser = parser.add_parser(\"upload\", help=\"Upload a model to S3.\")\n        upload_parser.add_argument(\n            \"path\", type=str, help=\"Local path of the model folder or individual file to upload.\"\n        )\n        upload_parser.add_argument(\"--organization\", type=str, help=\"Optional: organization namespace.\")\n        upload_parser.add_argument(\n            \"--filename\", type=str, default=None, help=\"Optional: override individual object filename on S3.\"\n        )\n        upload_parser.set_defaults(func=lambda args: UploadCommand(args))\n\n\nclass ANSI:\n    \"\"\"\n    Helper for en.wikipedia.org/wiki/ANSI_escape_code\n    \"\"\"\n\n    _bold = \"\\u001b[1m\"\n    _red = \"\\u001b[31m\"\n    _reset = \"\\u001b[0m\"\n\n    @classmethod\n    def bold(cls, s):\n        return \"{}{}{}\".format(cls._bold, s, cls._reset)\n\n    @classmethod\n    def red(cls, s):\n        return \"{}{}{}\".format(cls._bold + cls._red, s, cls._reset)\n\n\nclass BaseUserCommand:\n    def __init__(self, args):\n        self.args = args\n        self._api = HfApi()\n\n\nclass LoginCommand(BaseUserCommand):\n    def run(self):\n        print(\n            \"\"\"\n        _|    _|  _|    _|    _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|_|_|_|    _|_|      _|_|_|  _|_|_|_|\n        _|    _|  _|    _|  _|        _|          _|    _|_|    _|  _|            _|        _|    _|  _|        _|\n        _|_|_|_|  _|    _|  _|  _|_|  _|  _|_|    _|    _|  _|  _|  _|  _|_|      _|_|_|    _|_|_|_|  _|        _|_|_|\n        _|    _|  _|    _|  _|    _|  _|    _|    _|    _|    _|_|  _|    _|      _|        _|    _|  _|        _|\n        _|    _|    _|_|      _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|        _|    _|    _|_|_|  _|_|_|_|\n\n        \"\"\"\n        )\n        username = input(\"Username: \")\n        password = getpass()\n        try:\n            token = self._api.login(username, password)\n        except HTTPError as e:\n            # probably invalid credentials, display error message.\n            print(e)\n            print(ANSI.red(e.response.text))\n            exit(1)\n        HfFolder.save_token(token)\n        print(\"Login successful\")\n        print(\"Your token:\", token, \"\\n\")\n        print(\"Your token has been saved to\", HfFolder.path_token)\n\n\nclass WhoamiCommand(BaseUserCommand):\n    def run(self):\n        token = HfFolder.get_token()\n        if token is None:\n            print(\"Not logged in\")\n            exit()\n        try:\n            user, orgs = self._api.whoami(token)\n            print(user)\n            if orgs:\n                print(ANSI.bold(\"orgs: \"), \",\".join(orgs))\n        except HTTPError as e:\n            print(e)\n            print(ANSI.red(e.response.text))\n            exit(1)\n\n\nclass LogoutCommand(BaseUserCommand):\n    def run(self):\n        token = HfFolder.get_token()\n        if token is None:\n            print(\"Not logged in\")\n            exit()\n        HfFolder.delete_token()\n        self._api.logout(token)\n        print(\"Successfully logged out.\")\n\n\nclass ListObjsCommand(BaseUserCommand):\n    def tabulate(self, rows: List[List[Union[str, int]]], headers: List[str]) -> str:\n        \"\"\"\n        Inspired by:\n        stackoverflow.com/a/8356620/593036\n        stackoverflow.com/questions/9535954/printing-lists-as-tabular-data\n        \"\"\"\n        col_widths = [max(len(str(x)) for x in col) for col in zip(*rows, headers)]\n        row_format = (\"{{:{}}} \" * len(headers)).format(*col_widths)\n        lines = []\n        lines.append(row_format.format(*headers))\n        lines.append(row_format.format(*[\"-\" * w for w in col_widths]))\n        for row in rows:\n            lines.append(row_format.format(*row))\n        return \"\\n\".join(lines)\n\n    def run(self):\n        token = HfFolder.get_token()\n        if token is None:\n            print(\"Not logged in\")\n            exit(1)\n        try:\n            objs = self._api.list_objs(token, organization=self.args.organization)\n        except HTTPError as e:\n            print(e)\n            print(ANSI.red(e.response.text))\n            exit(1)\n        if len(objs) == 0:\n            print(\"No shared file yet\")\n            exit()\n        rows = [[obj.filename, obj.LastModified, obj.ETag, obj.Size] for obj in objs]\n        print(self.tabulate(rows, headers=[\"Filename\", \"LastModified\", \"ETag\", \"Size\"]))\n\n\nclass DeleteObjCommand(BaseUserCommand):\n    def run(self):\n        token = HfFolder.get_token()\n        if token is None:\n            print(\"Not logged in\")\n            exit(1)\n        try:\n            self._api.delete_obj(token, filename=self.args.filename, organization=self.args.organization)\n        except HTTPError as e:\n            print(e)\n            print(ANSI.red(e.response.text))\n            exit(1)\n        print(\"Done\")\n\n\nclass UploadCommand(BaseUserCommand):\n    def walk_dir(self, rel_path):\n        \"\"\"\n        Recursively list all files in a folder.\n        \"\"\"\n        entries: List[os.DirEntry] = list(os.scandir(rel_path))\n        files = [(os.path.join(os.getcwd(), f.path), f.path) for f in entries if f.is_file()]  # (filepath, filename)\n        for f in entries:\n            if f.is_dir():\n                files += self.walk_dir(f.path)\n        return files\n\n    def run(self):\n        token = HfFolder.get_token()\n        if token is None:\n            print(\"Not logged in\")\n            exit(1)\n        local_path = os.path.abspath(self.args.path)\n        if os.path.isdir(local_path):\n            if self.args.filename is not None:\n                raise ValueError(\"Cannot specify a filename override when uploading a folder.\")\n            rel_path = os.path.basename(local_path)\n            files = self.walk_dir(rel_path)\n        elif os.path.isfile(local_path):\n            filename = self.args.filename if self.args.filename is not None else os.path.basename(local_path)\n            files = [(local_path, filename)]\n        else:\n            raise ValueError(\"Not a valid file or directory: {}\".format(local_path))\n\n        if sys.platform == \"win32\":\n            files = [(filepath, filename.replace(os.sep, \"/\")) for filepath, filename in files]\n\n        if len(files) > UPLOAD_MAX_FILES:\n            print(\n                \"About to upload {} files to S3. This is probably wrong. Please filter files before uploading.\".format(\n                    ANSI.bold(len(files))\n                )\n            )\n            exit(1)\n\n        user, _ = self._api.whoami(token)\n        namespace = self.args.organization if self.args.organization is not None else user\n\n        for filepath, filename in files:\n            print(\n                \"About to upload file {} to S3 under filename {} and namespace {}\".format(\n                    ANSI.bold(filepath), ANSI.bold(filename), ANSI.bold(namespace)\n                )\n            )\n\n        choice = input(\"Proceed? [Y/n] \").lower()\n        if not (choice == \"\" or choice == \"y\" or choice == \"yes\"):\n            print(\"Abort\")\n            exit()\n        print(ANSI.bold(\"Uploading... This might take a while if files are large\"))\n        for filepath, filename in files:\n            try:\n                access_url = self._api.presign_and_upload(\n                    token=token, filename=filename, filepath=filepath, organization=self.args.organization\n                )\n            except HTTPError as e:\n                print(e)\n                print(ANSI.red(e.response.text))\n                exit(1)\n            print(\"Your file now lives at:\")\n            print(access_url)\n"
  },
  {
    "path": "code/nezha-base-count3/pretrain/transformers1/configuration_albert.py",
    "content": "# coding=utf-8\n# Copyright 2018 The Google AI Language Team Authors and The HuggingFace Inc. team.\n# Copyright (c) 2018, NVIDIA CORPORATION.  All rights reserved.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n\"\"\" ALBERT model configuration \"\"\"\n\nfrom .configuration_utils import PretrainedConfig\n\n\nALBERT_PRETRAINED_CONFIG_ARCHIVE_MAP = {\n    \"albert-base-v1\": \"https://s3.amazonaws.com/models.huggingface.co/bert/albert-base-v1-config.json\",\n    \"albert-large-v1\": \"https://s3.amazonaws.com/models.huggingface.co/bert/albert-large-v1-config.json\",\n    \"albert-xlarge-v1\": \"https://s3.amazonaws.com/models.huggingface.co/bert/albert-xlarge-v1-config.json\",\n    \"albert-xxlarge-v1\": \"https://s3.amazonaws.com/models.huggingface.co/bert/albert-xxlarge-v1-config.json\",\n    \"albert-base-v2\": \"https://s3.amazonaws.com/models.huggingface.co/bert/albert-base-v2-config.json\",\n    \"albert-large-v2\": \"https://s3.amazonaws.com/models.huggingface.co/bert/albert-large-v2-config.json\",\n    \"albert-xlarge-v2\": \"https://s3.amazonaws.com/models.huggingface.co/bert/albert-xlarge-v2-config.json\",\n    \"albert-xxlarge-v2\": \"https://s3.amazonaws.com/models.huggingface.co/bert/albert-xxlarge-v2-config.json\",\n}\n\n\nclass AlbertConfig(PretrainedConfig):\n    r\"\"\"\n        This is the configuration class to store the configuration of a :class:`~transformers1.AlbertModel`.\n        It is used to instantiate an ALBERT model according to the specified arguments, defining the model\n        architecture. Instantiating a configuration with the defaults will yield a similar configuration to that of\n        the ALBERT `xxlarge <https://huggingface.co/albert-xxlarge-v2>`__ architecture.\n\n        Configuration objects inherit from  :class:`~transformers1.PretrainedConfig` and can be used\n        to control the model outputs. Read the documentation from  :class:`~transformers1.PretrainedConfig`\n        for more information.\n\n\n        Args:\n            vocab_size (:obj:`int`, optional, defaults to 30000):\n                Vocabulary size of the ALBERT model. Defines the different tokens that\n                can be represented by the `inputs_ids` passed to the forward method of :class:`~transformers1.AlbertModel`.\n            embedding_size (:obj:`int`, optional, defaults to 128):\n                Dimensionality of vocabulary embeddings.\n            hidden_size (:obj:`int`, optional, defaults to 4096):\n                Dimensionality of the encoder layers and the pooler layer.\n            num_hidden_layers (:obj:`int`, optional, defaults to 12):\n                Number of hidden layers in the Transformer encoder.\n            num_hidden_groups (:obj:`int`, optional, defaults to 1):\n                Number of groups for the hidden layers, parameters in the same group are shared.\n            num_attention_heads (:obj:`int`, optional, defaults to 64):\n                Number of attention heads for each attention layer in the Transformer encoder.\n            intermediate_size (:obj:`int`, optional, defaults to 16384):\n                The dimensionality of the \"intermediate\" (i.e., feed-forward) layer in the Transformer encoder.\n            inner_group_num (:obj:`int`, optional, defaults to 1):\n                The number of inner repetition of attention and ffn.\n            hidden_act (:obj:`str` or :obj:`function`, optional, defaults to \"gelu_new\"):\n                The non-linear activation function (function or string) in the encoder and pooler.\n                If string, \"gelu\", \"relu\", \"swish\" and \"gelu_new\" are supported.\n            hidden_dropout_prob (:obj:`float`, optional, defaults to 0):\n                The dropout probability for all fully connected layers in the embeddings, encoder, and pooler.\n            attention_probs_dropout_prob (:obj:`float`, optional, defaults to 0):\n                The dropout ratio for the attention probabilities.\n            max_position_embeddings (:obj:`int`, optional, defaults to 512):\n                The maximum sequence length that this model might ever be used with. Typically set this to something\n                large (e.g., 512 or 1024 or 2048).\n            type_vocab_size (:obj:`int`, optional, defaults to 2):\n                The vocabulary size of the `token_type_ids` passed into :class:`~transformers1.AlbertModel`.\n            initializer_range (:obj:`float`, optional, defaults to 0.02):\n                The standard deviation of the truncated_normal_initializer for initializing all weight matrices.\n            layer_norm_eps (:obj:`float`, optional, defaults to 1e-12):\n                The epsilon used by the layer normalization layers.\n            classifier_dropout_prob (:obj:`float`, optional, defaults to 0.1):\n                The dropout ratio for attached classifiers.\n\n        Example::\n\n            from transformers1 import AlbertConfig, AlbertModel\n            # Initializing an ALBERT-xxlarge style configuration\n            albert_xxlarge_configuration = AlbertConfig()\n\n            # Initializing an ALBERT-base style configuration\n            albert_base_configuration = AlbertConfig(\n                hidden_size=768,\n                num_attention_heads=12,\n                intermediate_size=3072,\n            )\n\n            # Initializing a model from the ALBERT-base style configuration\n            model = AlbertModel(albert_xxlarge_configuration)\n\n            # Accessing the model configuration\n            configuration = model.config\n    \"\"\"\n\n    model_type = \"albert\"\n\n    def __init__(\n        self,\n        vocab_size=30000,\n        embedding_size=128,\n        hidden_size=4096,\n        num_hidden_layers=12,\n        num_hidden_groups=1,\n        num_attention_heads=64,\n        intermediate_size=16384,\n        inner_group_num=1,\n        hidden_act=\"gelu_new\",\n        hidden_dropout_prob=0,\n        attention_probs_dropout_prob=0,\n        max_position_embeddings=512,\n        type_vocab_size=2,\n        initializer_range=0.02,\n        layer_norm_eps=1e-12,\n        classifier_dropout_prob=0.1,\n        pad_token_id=0,\n        bos_token_id=2,\n        eos_token_id=3,\n        **kwargs\n    ):\n        super().__init__(pad_token_id=pad_token_id, bos_token_id=bos_token_id, eos_token_id=eos_token_id, **kwargs)\n\n        self.vocab_size = vocab_size\n        self.embedding_size = embedding_size\n        self.hidden_size = hidden_size\n        self.num_hidden_layers = num_hidden_layers\n        self.num_hidden_groups = num_hidden_groups\n        self.num_attention_heads = num_attention_heads\n        self.inner_group_num = inner_group_num\n        self.hidden_act = hidden_act\n        self.intermediate_size = intermediate_size\n        self.hidden_dropout_prob = hidden_dropout_prob\n        self.attention_probs_dropout_prob = attention_probs_dropout_prob\n        self.max_position_embeddings = max_position_embeddings\n        self.type_vocab_size = type_vocab_size\n        self.initializer_range = initializer_range\n        self.layer_norm_eps = layer_norm_eps\n        self.classifier_dropout_prob = classifier_dropout_prob\n"
  },
  {
    "path": "code/nezha-base-count3/pretrain/transformers1/configuration_auto.py",
    "content": "# coding=utf-8\n# Copyright 2018 The HuggingFace Inc. team.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n\"\"\" Auto Config class. \"\"\"\n\n\nimport logging\nfrom collections import OrderedDict\n\nfrom .configuration_albert import ALBERT_PRETRAINED_CONFIG_ARCHIVE_MAP, AlbertConfig\nfrom .configuration_bart import BART_PRETRAINED_CONFIG_ARCHIVE_MAP, BartConfig\nfrom .configuration_bert import BERT_PRETRAINED_CONFIG_ARCHIVE_MAP, BertConfig\nfrom .configuration_camembert import CAMEMBERT_PRETRAINED_CONFIG_ARCHIVE_MAP, CamembertConfig\nfrom .configuration_ctrl import CTRL_PRETRAINED_CONFIG_ARCHIVE_MAP, CTRLConfig\nfrom .configuration_distilbert import DISTILBERT_PRETRAINED_CONFIG_ARCHIVE_MAP, DistilBertConfig\nfrom .configuration_electra import ELECTRA_PRETRAINED_CONFIG_ARCHIVE_MAP, ElectraConfig\nfrom .configuration_encoder_decoder import EncoderDecoderConfig\nfrom .configuration_flaubert import FLAUBERT_PRETRAINED_CONFIG_ARCHIVE_MAP, FlaubertConfig\nfrom .configuration_gpt2 import GPT2_PRETRAINED_CONFIG_ARCHIVE_MAP, GPT2Config\nfrom .configuration_longformer import LONGFORMER_PRETRAINED_CONFIG_ARCHIVE_MAP, LongformerConfig\nfrom .configuration_marian import MarianConfig\nfrom .configuration_openai import OPENAI_GPT_PRETRAINED_CONFIG_ARCHIVE_MAP, OpenAIGPTConfig\nfrom .configuration_reformer import ReformerConfig\nfrom .configuration_roberta import ROBERTA_PRETRAINED_CONFIG_ARCHIVE_MAP, RobertaConfig\nfrom .configuration_t5 import T5_PRETRAINED_CONFIG_ARCHIVE_MAP, T5Config\nfrom .configuration_transfo_xl import TRANSFO_XL_PRETRAINED_CONFIG_ARCHIVE_MAP, TransfoXLConfig\nfrom .configuration_utils import PretrainedConfig\nfrom .configuration_xlm import XLM_PRETRAINED_CONFIG_ARCHIVE_MAP, XLMConfig\nfrom .configuration_xlm_roberta import XLM_ROBERTA_PRETRAINED_CONFIG_ARCHIVE_MAP, XLMRobertaConfig\nfrom .configuration_xlnet import XLNET_PRETRAINED_CONFIG_ARCHIVE_MAP, XLNetConfig\n\n\nlogger = logging.getLogger(__name__)\n\n\nALL_PRETRAINED_CONFIG_ARCHIVE_MAP = dict(\n    (key, value)\n    for pretrained_map in [\n        BERT_PRETRAINED_CONFIG_ARCHIVE_MAP,\n        BART_PRETRAINED_CONFIG_ARCHIVE_MAP,\n        OPENAI_GPT_PRETRAINED_CONFIG_ARCHIVE_MAP,\n        TRANSFO_XL_PRETRAINED_CONFIG_ARCHIVE_MAP,\n        GPT2_PRETRAINED_CONFIG_ARCHIVE_MAP,\n        CTRL_PRETRAINED_CONFIG_ARCHIVE_MAP,\n        XLNET_PRETRAINED_CONFIG_ARCHIVE_MAP,\n        XLM_PRETRAINED_CONFIG_ARCHIVE_MAP,\n        ROBERTA_PRETRAINED_CONFIG_ARCHIVE_MAP,\n        DISTILBERT_PRETRAINED_CONFIG_ARCHIVE_MAP,\n        ALBERT_PRETRAINED_CONFIG_ARCHIVE_MAP,\n        CAMEMBERT_PRETRAINED_CONFIG_ARCHIVE_MAP,\n        T5_PRETRAINED_CONFIG_ARCHIVE_MAP,\n        XLM_ROBERTA_PRETRAINED_CONFIG_ARCHIVE_MAP,\n        FLAUBERT_PRETRAINED_CONFIG_ARCHIVE_MAP,\n        ELECTRA_PRETRAINED_CONFIG_ARCHIVE_MAP,\n        LONGFORMER_PRETRAINED_CONFIG_ARCHIVE_MAP,\n    ]\n    for key, value, in pretrained_map.items()\n)\n\n\nCONFIG_MAPPING = OrderedDict(\n    [\n        (\"t5\", T5Config,),\n        (\"distilbert\", DistilBertConfig,),\n        (\"albert\", AlbertConfig,),\n        (\"camembert\", CamembertConfig,),\n        (\"xlm-roberta\", XLMRobertaConfig,),\n        (\"marian\", MarianConfig,),\n        (\"bart\", BartConfig,),\n        (\"reformer\", ReformerConfig,),\n        (\"longformer\", LongformerConfig,),\n        (\"roberta\", RobertaConfig,),\n        (\"flaubert\", FlaubertConfig,),\n        (\"bert\", BertConfig,),\n        (\"openai-gpt\", OpenAIGPTConfig,),\n        (\"gpt2\", GPT2Config,),\n        (\"transfo-xl\", TransfoXLConfig,),\n        (\"xlnet\", XLNetConfig,),\n        (\"xlm\", XLMConfig,),\n        (\"ctrl\", CTRLConfig,),\n        (\"electra\", ElectraConfig,),\n        (\"encoder-decoder\", EncoderDecoderConfig,),\n    ]\n)\n\n\nclass AutoConfig:\n    r\"\"\"\n        :class:`~transformers1.AutoConfig` is a generic configuration class\n        that will be instantiated as one of the configuration classes of the library\n        when created with the :func:`~transformers1.AutoConfig.from_pretrained` class method.\n\n        The :func:`~transformers1.AutoConfig.from_pretrained` method takes care of returning the correct model class instance\n        based on the `model_type` property of the config object, or when it's missing,\n        falling back to using pattern matching on the `pretrained_model_name_or_path` string.\n    \"\"\"\n\n    def __init__(self):\n        raise EnvironmentError(\n            \"AutoConfig is designed to be instantiated \"\n            \"using the `AutoConfig.from_pretrained(pretrained_model_name_or_path)` method.\"\n        )\n\n    @classmethod\n    def for_model(cls, model_type: str, *args, **kwargs):\n        if model_type in CONFIG_MAPPING:\n            config_class = CONFIG_MAPPING[model_type]\n            return config_class(*args, **kwargs)\n        raise ValueError(\n            \"Unrecognized model identifier: {}. Should contain one of {}\".format(\n                model_type, \", \".join(CONFIG_MAPPING.keys())\n            )\n        )\n\n    @classmethod\n    def from_pretrained(cls, pretrained_model_name_or_path, **kwargs):\n        r\"\"\" Instantiates one of the configuration classes of the library\n        from a pre-trained model configuration.\n\n        The configuration class to instantiate is selected\n        based on the `model_type` property of the config object, or when it's missing,\n        falling back to using pattern matching on the `pretrained_model_name_or_path` string:\n            - `t5`: :class:`~transformers1.T5Config` (T5 model)\n            - `distilbert`: :class:`~transformers1.DistilBertConfig` (DistilBERT model)\n            - `albert`: :class:`~transformers1.AlbertConfig` (ALBERT model)\n            - `camembert`: :class:`~transformers1.CamembertConfig` (CamemBERT model)\n            - `xlm-roberta`: :class:`~transformers1.XLMRobertaConfig` (XLM-RoBERTa model)\n            - `longformer`: :class:`~transformers1.LongformerConfig` (Longformer model)\n            - `roberta`: :class:`~transformers1.RobertaConfig` (RoBERTa model)\n            - `reformer`: :class:`~transformers1.ReformerConfig` (Reformer model)\n            - `bert`: :class:`~transformers1.BertConfig` (Bert model)\n            - `openai-gpt`: :class:`~transformers1.OpenAIGPTConfig` (OpenAI GPT model)\n            - `gpt2`: :class:`~transformers1.GPT2Config` (OpenAI GPT-2 model)\n            - `transfo-xl`: :class:`~transformers1.TransfoXLConfig` (Transformer-XL model)\n            - `xlnet`: :class:`~transformers1.XLNetConfig` (XLNet model)\n            - `xlm`: :class:`~transformers1.XLMConfig` (XLM model)\n            - `ctrl` : :class:`~transformers1.CTRLConfig` (CTRL model)\n            - `flaubert` : :class:`~transformers1.FlaubertConfig` (Flaubert model)\n            - `electra` : :class:`~transformers1.ElectraConfig` (ELECTRA model)\n\n        Args:\n            pretrained_model_name_or_path (:obj:`string`):\n                Is either: \\\n                    - a string with the `shortcut name` of a pre-trained model configuration to load from cache or download, e.g.: ``bert-base-uncased``.\n                    - a string with the `identifier name` of a pre-trained model configuration that was user-uploaded to our S3, e.g.: ``dbmdz/bert-base-german-cased``.\n                    - a path to a `directory` containing a configuration file saved using the :func:`~transformers1.PretrainedConfig.save_pretrained` method, e.g.: ``./my_model_directory/``.\n                    - a path or url to a saved configuration JSON `file`, e.g.: ``./my_model_directory/configuration.json``.\n\n            cache_dir (:obj:`string`, optional, defaults to `None`):\n                Path to a directory in which a downloaded pre-trained model\n                configuration should be cached if the standard cache should not be used.\n\n            force_download (:obj:`boolean`, optional, defaults to `False`):\n                Force to (re-)download the model weights and configuration files and override the cached versions if they exist.\n\n            resume_download (:obj:`boolean`, optional, defaults to `False`):\n                Do not delete incompletely received file. Attempt to resume the download if such a file exists.\n\n            proxies (:obj:`Dict[str, str]`, optional, defaults to `None`):\n                A dictionary of proxy servers to use by protocol or endpoint, e.g.: :obj:`{'http': 'foo.bar:3128', 'http://hostname': 'foo.bar:4012'}`.\n                The proxies are used on each request. See `the requests documentation <https://requests.readthedocs.io/en/master/user/advanced/#proxies>`__ for usage.\n\n            return_unused_kwargs (:obj:`boolean`, optional, defaults to `False`):\n                - If False, then this function returns just the final configuration object.\n                - If True, then this functions returns a tuple `(config, unused_kwargs)` where `unused_kwargs` is a dictionary consisting of the key/value pairs whose keys are not configuration attributes: ie the part of kwargs which has not been used to update `config` and is otherwise ignored.\n\n            kwargs (:obj:`Dict[str, any]`, optional, defaults to `{}`): key/value pairs with which to update the configuration object after loading.\n                - The values in kwargs of any keys which are configuration attributes will be used to override the loaded values.\n                - Behavior concerning key/value pairs whose keys are *not* configuration attributes is controlled by the `return_unused_kwargs` keyword parameter.\n\n\n        Examples::\n\n            config = AutoConfig.from_pretrained('bert-base-uncased')  # Download configuration from S3 and cache.\n            config = AutoConfig.from_pretrained('./test/bert_saved_model/')  # E.g. config (or model) was saved using `save_pretrained('./test/saved_model/')`\n            config = AutoConfig.from_pretrained('./test/bert_saved_model/my_configuration.json')\n            config = AutoConfig.from_pretrained('bert-base-uncased', output_attention=True, foo=False)\n            assert config.output_attention == True\n            config, unused_kwargs = AutoConfig.from_pretrained('bert-base-uncased', output_attention=True,\n                                                               foo=False, return_unused_kwargs=True)\n            assert config.output_attention == True\n            assert unused_kwargs == {'foo': False}\n\n        \"\"\"\n        config_dict, _ = PretrainedConfig.get_config_dict(pretrained_model_name_or_path, **kwargs)\n\n        if \"model_type\" in config_dict:\n            config_class = CONFIG_MAPPING[config_dict[\"model_type\"]]\n            return config_class.from_dict(config_dict, **kwargs)\n        else:\n            # Fallback: use pattern matching on the string.\n            for pattern, config_class in CONFIG_MAPPING.items():\n                if pattern in pretrained_model_name_or_path:\n                    return config_class.from_dict(config_dict, **kwargs)\n\n        raise ValueError(\n            \"Unrecognized model in {}. \"\n            \"Should have a `model_type` key in its config.json, or contain one of the following strings \"\n            \"in its name: {}\".format(pretrained_model_name_or_path, \", \".join(CONFIG_MAPPING.keys()))\n        )\n"
  },
  {
    "path": "code/nezha-base-count3/pretrain/transformers1/configuration_bart.py",
    "content": "# coding=utf-8\n# Copyright 2020 The Fairseq Authors and The HuggingFace Inc. team.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n\"\"\" BART configuration \"\"\"\n\n\nimport logging\n\nfrom .configuration_utils import PretrainedConfig\n\n\nlogger = logging.getLogger(__name__)\n\nBART_PRETRAINED_CONFIG_ARCHIVE_MAP = {\n    \"facebook/bart-large\": \"https://s3.amazonaws.com/models.huggingface.co/bert/facebook/bart-large/config.json\",\n    \"facebook/bart-large-mnli\": \"https://s3.amazonaws.com/models.huggingface.co/bert/facebook/bart-large-mnli/config.json\",\n    \"facebook/bart-large-cnn\": \"https://s3.amazonaws.com/models.huggingface.co/bert/facebook/bart-large-cnn/config.json\",\n    \"facebook/bart-large-xsum\": \"https://s3.amazonaws.com/models.huggingface.co/bert/facebook/bart-large-xsum/config.json\",\n    \"facebook/mbart-large-en-ro\": \"https://s3.amazonaws.com/models.huggingface.co/bert/facebook/mbart-large-en-ro/config.json\",\n}\n\n\nclass BartConfig(PretrainedConfig):\n    r\"\"\"\n        Configuration class for Bart. Parameters are renamed from the fairseq implementation\n    \"\"\"\n    model_type = \"bart\"\n\n    def __init__(\n        self,\n        activation_dropout=0.0,\n        activation_function=\"gelu\",\n        vocab_size=50265,\n        d_model=1024,\n        encoder_ffn_dim=4096,\n        encoder_layers=12,\n        encoder_attention_heads=16,\n        decoder_ffn_dim=4096,\n        decoder_layers=12,\n        decoder_attention_heads=16,\n        encoder_layerdrop=0.0,\n        decoder_layerdrop=0.0,\n        attention_dropout=0.0,\n        dropout=0.1,\n        max_position_embeddings=1024,\n        init_std=0.02,\n        classifier_dropout=0.0,\n        num_labels=3,\n        is_encoder_decoder=True,\n        pad_token_id=1,\n        bos_token_id=0,\n        eos_token_id=2,\n        normalize_before=False,\n        add_final_layer_norm=False,\n        scale_embedding=False,\n        normalize_embedding=True,\n        static_position_embeddings=False,\n        add_bias_logits=False,\n        **common_kwargs\n    ):\n        r\"\"\"\n            :class:`~transformers1.BartConfig` is the configuration class for `BartModel`.\n            Examples:\n                config = BartConfig.from_pretrained('bart-large')\n                model = BartModel(config)\n        \"\"\"\n        if \"hidden_size\" in common_kwargs:\n            raise ValueError(\"hidden size is called d_model\")\n        super().__init__(\n            num_labels=num_labels,\n            pad_token_id=pad_token_id,\n            bos_token_id=bos_token_id,\n            eos_token_id=eos_token_id,\n            is_encoder_decoder=is_encoder_decoder,\n            **common_kwargs,\n        )\n        self.vocab_size = vocab_size\n        self.d_model = d_model  # encoder_embed_dim and decoder_embed_dim\n        self.encoder_ffn_dim = encoder_ffn_dim\n        self.encoder_layers = self.num_hidden_layers = encoder_layers\n        self.encoder_attention_heads = encoder_attention_heads\n        self.encoder_layerdrop = encoder_layerdrop\n        self.decoder_layerdrop = decoder_layerdrop\n        self.decoder_ffn_dim = decoder_ffn_dim\n        self.decoder_layers = decoder_layers\n        self.decoder_attention_heads = decoder_attention_heads\n        self.max_position_embeddings = max_position_embeddings\n        self.init_std = init_std  # Normal(0, this parameter)\n        self.activation_function = activation_function\n\n        # Params introduced for Mbart\n        self.scale_embedding = scale_embedding  # scale factor will be sqrt(d_model) if True\n        self.normalize_embedding = normalize_embedding  # True for mbart, False otherwise\n        self.normalize_before = normalize_before  # combo of fairseq's encoder_ and decoder_normalize_before\n        self.add_final_layer_norm = add_final_layer_norm\n\n        # Params introduced for Marian\n        self.add_bias_logits = add_bias_logits\n        self.static_position_embeddings = static_position_embeddings\n\n        # 3 Types of Dropout\n        self.attention_dropout = attention_dropout\n        self.activation_dropout = activation_dropout\n        self.dropout = dropout\n\n        # Classifier stuff\n        self.classif_dropout = classifier_dropout\n\n    @property\n    def num_attention_heads(self) -> int:\n        return self.encoder_attention_heads\n\n    @property\n    def hidden_size(self) -> int:\n        return self.d_model\n\n    def is_valid_mbart(self) -> bool:\n        \"\"\"Is the configuration aligned with the MBART paper.\"\"\"\n        if self.normalize_before and self.add_final_layer_norm and self.scale_embedding:\n            return True\n        if self.normalize_before or self.add_final_layer_norm or self.scale_embedding:\n            logger.info(\"This configuration is a mixture of MBART and BART settings\")\n        return False\n"
  },
  {
    "path": "code/nezha-base-count3/pretrain/transformers1/configuration_bert.py",
    "content": "# coding=utf-8\n# Copyright 2018 The Google AI Language Team Authors and The HuggingFace Inc. team.\n# Copyright (c) 2018, NVIDIA CORPORATION.  All rights reserved.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n\"\"\" BERT model configuration \"\"\"\n\n\nimport logging\n\nfrom .configuration_utils import PretrainedConfig\n\n\nlogger = logging.getLogger(__name__)\n\nBERT_PRETRAINED_CONFIG_ARCHIVE_MAP = {\n    \"bert-base-uncased\": \"https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-uncased-config.json\",\n    \"bert-large-uncased\": \"https://s3.amazonaws.com/models.huggingface.co/bert/bert-large-uncased-config.json\",\n    \"bert-base-cased\": \"https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-cased-config.json\",\n    \"bert-large-cased\": \"https://s3.amazonaws.com/models.huggingface.co/bert/bert-large-cased-config.json\",\n    \"bert-base-multilingual-uncased\": \"https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-multilingual-uncased-config.json\",\n    \"bert-base-multilingual-cased\": \"https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-multilingual-cased-config.json\",\n    \"bert-base-chinese\": \"https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-chinese-config.json\",\n    \"bert-base-german-cased\": \"https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-german-cased-config.json\",\n    \"bert-large-uncased-whole-word-masking\": \"https://s3.amazonaws.com/models.huggingface.co/bert/bert-large-uncased-whole-word-masking-config.json\",\n    \"bert-large-cased-whole-word-masking\": \"https://s3.amazonaws.com/models.huggingface.co/bert/bert-large-cased-whole-word-masking-config.json\",\n    \"bert-large-uncased-whole-word-masking-finetuned-squad\": \"https://s3.amazonaws.com/models.huggingface.co/bert/bert-large-uncased-whole-word-masking-finetuned-squad-config.json\",\n    \"bert-large-cased-whole-word-masking-finetuned-squad\": \"https://s3.amazonaws.com/models.huggingface.co/bert/bert-large-cased-whole-word-masking-finetuned-squad-config.json\",\n    \"bert-base-cased-finetuned-mrpc\": \"https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-cased-finetuned-mrpc-config.json\",\n    \"bert-base-german-dbmdz-cased\": \"https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-german-dbmdz-cased-config.json\",\n    \"bert-base-german-dbmdz-uncased\": \"https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-german-dbmdz-uncased-config.json\",\n    \"cl-tohoku/bert-base-japanese\": \"https://s3.amazonaws.com/models.huggingface.co/bert/cl-tohoku/bert-base-japanese/config.json\",\n    \"cl-tohoku/bert-base-japanese-whole-word-masking\": \"https://s3.amazonaws.com/models.huggingface.co/bert/cl-tohoku/bert-base-japanese-whole-word-masking/config.json\",\n    \"cl-tohoku/bert-base-japanese-char\": \"https://s3.amazonaws.com/models.huggingface.co/bert/cl-tohoku/bert-base-japanese-char/config.json\",\n    \"cl-tohoku/bert-base-japanese-char-whole-word-masking\": \"https://s3.amazonaws.com/models.huggingface.co/bert/cl-tohoku/bert-base-japanese-char-whole-word-masking/config.json\",\n    \"TurkuNLP/bert-base-finnish-cased-v1\": \"https://s3.amazonaws.com/models.huggingface.co/bert/TurkuNLP/bert-base-finnish-cased-v1/config.json\",\n    \"TurkuNLP/bert-base-finnish-uncased-v1\": \"https://s3.amazonaws.com/models.huggingface.co/bert/TurkuNLP/bert-base-finnish-uncased-v1/config.json\",\n    \"wietsedv/bert-base-dutch-cased\": \"https://s3.amazonaws.com/models.huggingface.co/bert/wietsedv/bert-base-dutch-cased/config.json\",\n    # See all BERT models at https://huggingface.co/models?filter=bert\n}\n\n\nclass BertConfig(PretrainedConfig):\n    r\"\"\"\n        This is the configuration class to store the configuration of a :class:`~transformers1.BertModel`.\n        It is used to instantiate an BERT model according to the specified arguments, defining the model\n        architecture. Instantiating a configuration with the defaults will yield a similar configuration to that of\n        the BERT `bert-base-uncased <https://huggingface.co/bert-base-uncased>`__ architecture.\n\n        Configuration objects inherit from  :class:`~transformers1.PretrainedConfig` and can be used\n        to control the model outputs. Read the documentation from  :class:`~transformers1.PretrainedConfig`\n        for more information.\n\n\n        Args:\n            vocab_size (:obj:`int`, optional, defaults to 30522):\n                Vocabulary size of the BERT model. Defines the different tokens that\n                can be represented by the `inputs_ids` passed to the forward method of :class:`~transformers1.BertModel`.\n            hidden_size (:obj:`int`, optional, defaults to 768):\n                Dimensionality of the encoder layers and the pooler layer.\n            num_hidden_layers (:obj:`int`, optional, defaults to 12):\n                Number of hidden layers in the Transformer encoder.\n            num_attention_heads (:obj:`int`, optional, defaults to 12):\n                Number of attention heads for each attention layer in the Transformer encoder.\n            intermediate_size (:obj:`int`, optional, defaults to 3072):\n                Dimensionality of the \"intermediate\" (i.e., feed-forward) layer in the Transformer encoder.\n            hidden_act (:obj:`str` or :obj:`function`, optional, defaults to \"gelu\"):\n                The non-linear activation function (function or string) in the encoder and pooler.\n                If string, \"gelu\", \"relu\", \"swish\" and \"gelu_new\" are supported.\n            hidden_dropout_prob (:obj:`float`, optional, defaults to 0.1):\n                The dropout probabilitiy for all fully connected layers in the embeddings, encoder, and pooler.\n            attention_probs_dropout_prob (:obj:`float`, optional, defaults to 0.1):\n                The dropout ratio for the attention probabilities.\n            max_position_embeddings (:obj:`int`, optional, defaults to 512):\n                The maximum sequence length that this model might ever be used with.\n                Typically set this to something large just in case (e.g., 512 or 1024 or 2048).\n            type_vocab_size (:obj:`int`, optional, defaults to 2):\n                The vocabulary size of the `token_type_ids` passed into :class:`~transformers1.BertModel`.\n            initializer_range (:obj:`float`, optional, defaults to 0.02):\n                The standard deviation of the truncated_normal_initializer for initializing all weight matrices.\n            layer_norm_eps (:obj:`float`, optional, defaults to 1e-12):\n                The epsilon used by the layer normalization layers.\n\n        Example::\n\n            from transformers1 import BertModel, BertConfig\n\n            # Initializing a BERT bert-base-uncased style configuration\n            configuration = BertConfig()\n\n            # Initializing a model from the bert-base-uncased style configuration\n            model = BertModel(configuration)\n\n            # Accessing the model configuration\n            configuration = model.config\n    \"\"\"\n    model_type = \"bert\"\n\n    def __init__(\n        self,\n        vocab_size=30522,\n        hidden_size=768,\n        num_hidden_layers=12,\n        num_attention_heads=12,\n        intermediate_size=3072,\n        hidden_act=\"gelu\",\n        hidden_dropout_prob=0.1,\n        attention_probs_dropout_prob=0.1,\n        max_position_embeddings=512,\n        type_vocab_size=2,\n        initializer_range=0.02,\n        layer_norm_eps=1e-12,\n        pad_token_id=0,\n        **kwargs\n    ):\n        super().__init__(pad_token_id=pad_token_id, **kwargs)\n\n        self.vocab_size = vocab_size\n        self.hidden_size = hidden_size\n        self.num_hidden_layers = num_hidden_layers\n        self.num_attention_heads = num_attention_heads\n        self.hidden_act = hidden_act\n        self.intermediate_size = intermediate_size\n        self.hidden_dropout_prob = hidden_dropout_prob\n        self.attention_probs_dropout_prob = attention_probs_dropout_prob\n        self.max_position_embeddings = max_position_embeddings\n        self.type_vocab_size = type_vocab_size\n        self.initializer_range = initializer_range\n        self.layer_norm_eps = layer_norm_eps\n"
  },
  {
    "path": "code/nezha-base-count3/pretrain/transformers1/configuration_camembert.py",
    "content": "# coding=utf-8\n# Copyright 2018 The Google AI Language Team Authors and The HuggingFace Inc. team.\n# Copyright (c) 2018, NVIDIA CORPORATION.  All rights reserved.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n\"\"\" CamemBERT configuration \"\"\"\n\n\nimport logging\n\nfrom .configuration_roberta import RobertaConfig\n\n\nlogger = logging.getLogger(__name__)\n\nCAMEMBERT_PRETRAINED_CONFIG_ARCHIVE_MAP = {\n    \"camembert-base\": \"https://s3.amazonaws.com/models.huggingface.co/bert/camembert-base-config.json\",\n    \"umberto-commoncrawl-cased-v1\": \"https://s3.amazonaws.com/models.huggingface.co/bert/Musixmatch/umberto-commoncrawl-cased-v1/config.json\",\n    \"umberto-wikipedia-uncased-v1\": \"https://s3.amazonaws.com/models.huggingface.co/bert/Musixmatch/umberto-wikipedia-uncased-v1/config.json\",\n}\n\n\nclass CamembertConfig(RobertaConfig):\n    \"\"\"\n    This class overrides :class:`~transformers1.RobertaConfig`. Please check the\n    superclass for the appropriate documentation alongside usage examples.\n    \"\"\"\n\n    model_type = \"camembert\"\n"
  },
  {
    "path": "code/nezha-base-count3/pretrain/transformers1/configuration_ctrl.py",
    "content": "# coding=utf-8\n# Copyright 2018 Salesforce and HuggingFace Inc. team.\n# Copyright (c) 2018, NVIDIA CORPORATION.  All rights reserved.\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n\"\"\" Salesforce CTRL configuration \"\"\"\n\n\nimport logging\n\nfrom .configuration_utils import PretrainedConfig\n\n\nlogger = logging.getLogger(__name__)\n\nCTRL_PRETRAINED_CONFIG_ARCHIVE_MAP = {\"ctrl\": \"https://s3.amazonaws.com/models.huggingface.co/bert/ctrl-config.json\"}\n\n\nclass CTRLConfig(PretrainedConfig):\n    \"\"\"\n        This is the configuration class to store the configuration of a :class:`~transformers1.CTRLModel`.\n        It is used to instantiate an CTRL model according to the specified arguments, defining the model\n        architecture. Instantiating a configuration with the defaults will yield a similar configuration to that of\n        the `ctrl <https://huggingface.co/ctrl>`__ architecture from SalesForce.\n\n        Configuration objects inherit from  :class:`~transformers1.PretrainedConfig` and can be used\n        to control the model outputs. Read the documentation from  :class:`~transformers1.PretrainedConfig`\n        for more information.\n\n        Args:\n            vocab_size (:obj:`int`, optional, defaults to 246534):\n                Vocabulary size of the CTRL model. Defines the different tokens that\n                can be represented by the `inputs_ids` passed to the forward method of :class:`~transformers1.CTRLModel`.\n            n_positions (:obj:`int`, optional, defaults to 256):\n                The maximum sequence length that this model might ever be used with.\n                Typically set this to something large just in case (e.g., 512 or 1024 or 2048).\n            n_ctx (:obj:`int`, optional, defaults to 256):\n                Dimensionality of the causal mask (usually same as n_positions).\n            n_embd (:obj:`int`, optional, defaults to 1280):\n                Dimensionality of the embeddings and hidden states.\n            dff (:obj:`int`, optional, defaults to 8192):\n                Dimensionality of the inner dimension of the FFN.\n            n_layer (:obj:`int`, optional, defaults to 48):\n                Number of hidden layers in the Transformer encoder.\n            n_head (:obj:`int`, optional, defaults to 16):\n                Number of attention heads for each attention layer in the Transformer encoder.\n            resid_pdrop (:obj:`float`, optional, defaults to 0.1):\n                The dropout probability for all fully connected layers in the embeddings, encoder, and pooler.\n            embd_pdrop (:obj:`int`, optional, defaults to 0.1):\n                The dropout ratio for the embeddings.\n            attn_pdrop (:obj:`float`, optional, defaults to 0.1):\n                The dropout ratio for the attention.\n            layer_norm_epsilon (:obj:`float`, optional, defaults to 1e-6):\n                The epsilon to use in the layer normalization layers\n            initializer_range (:obj:`float`, optional, defaults to 0.02):\n                The standard deviation of the truncated_normal_initializer for initializing all weight matrices.\n\n        Example::\n\n            from transformers1 import CTRLModel, CTRLConfig\n\n            # Initializing a CTRL configuration\n            configuration = CTRLConfig()\n\n            # Initializing a model from the configuration\n            model = CTRLModel(configuration)\n\n            # Accessing the model configuration\n            configuration = model.config\n    \"\"\"\n\n    model_type = \"ctrl\"\n\n    def __init__(\n        self,\n        vocab_size=246534,\n        n_positions=256,\n        n_ctx=256,\n        n_embd=1280,\n        dff=8192,\n        n_layer=48,\n        n_head=16,\n        resid_pdrop=0.1,\n        embd_pdrop=0.1,\n        attn_pdrop=0.1,\n        layer_norm_epsilon=1e-6,\n        initializer_range=0.02,\n        summary_type=\"cls_index\",\n        summary_use_proj=True,\n        summary_activation=None,\n        summary_proj_to_labels=True,\n        summary_first_dropout=0.1,\n        **kwargs\n    ):\n        super().__init__(**kwargs)\n        self.vocab_size = vocab_size\n        self.n_ctx = n_ctx\n        self.n_positions = n_positions\n        self.n_embd = n_embd\n        self.n_layer = n_layer\n        self.n_head = n_head\n        self.dff = dff\n        self.resid_pdrop = resid_pdrop\n        self.embd_pdrop = embd_pdrop\n        self.attn_pdrop = attn_pdrop\n        self.layer_norm_epsilon = layer_norm_epsilon\n        self.initializer_range = initializer_range\n\n        self.summary_type = summary_type\n        self.summary_use_proj = summary_use_proj\n        self.summary_activation = summary_activation\n        self.summary_first_dropout = summary_first_dropout\n        self.summary_proj_to_labels = summary_proj_to_labels\n\n    @property\n    def max_position_embeddings(self):\n        return self.n_positions\n\n    @property\n    def hidden_size(self):\n        return self.n_embd\n\n    @property\n    def num_attention_heads(self):\n        return self.n_head\n\n    @property\n    def num_hidden_layers(self):\n        return self.n_layer\n"
  },
  {
    "path": "code/nezha-base-count3/pretrain/transformers1/configuration_distilbert.py",
    "content": "# coding=utf-8\n# Copyright 2019-present, the HuggingFace Inc. team, The Google AI Language Team and Facebook, Inc.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n\"\"\" DistilBERT model configuration \"\"\"\n\n\nimport logging\n\nfrom .configuration_utils import PretrainedConfig\n\n\nlogger = logging.getLogger(__name__)\n\nDISTILBERT_PRETRAINED_CONFIG_ARCHIVE_MAP = {\n    \"distilbert-base-uncased\": \"https://s3.amazonaws.com/models.huggingface.co/bert/distilbert-base-uncased-config.json\",\n    \"distilbert-base-uncased-distilled-squad\": \"https://s3.amazonaws.com/models.huggingface.co/bert/distilbert-base-uncased-distilled-squad-config.json\",\n    \"distilbert-base-cased\": \"https://s3.amazonaws.com/models.huggingface.co/bert/distilbert-base-cased-config.json\",\n    \"distilbert-base-cased-distilled-squad\": \"https://s3.amazonaws.com/models.huggingface.co/bert/distilbert-base-cased-distilled-squad-config.json\",\n    \"distilbert-base-german-cased\": \"https://s3.amazonaws.com/models.huggingface.co/bert/distilbert-base-german-cased-config.json\",\n    \"distilbert-base-multilingual-cased\": \"https://s3.amazonaws.com/models.huggingface.co/bert/distilbert-base-multilingual-cased-config.json\",\n    \"distilbert-base-uncased-finetuned-sst-2-english\": \"https://s3.amazonaws.com/models.huggingface.co/bert/distilbert-base-uncased-finetuned-sst-2-english-config.json\",\n}\n\n\nclass DistilBertConfig(PretrainedConfig):\n    r\"\"\"\n        This is the configuration class to store the configuration of a :class:`~transformers1.DistilBertModel`.\n        It is used to instantiate a DistilBERT model according to the specified arguments, defining the model\n        architecture. Instantiating a configuration with the defaults will yield a similar configuration to that of\n        the DistilBERT `distilbert-base-uncased <https://huggingface.co/distilbert-base-uncased>`__ architecture.\n\n        Configuration objects inherit from  :class:`~transformers1.PretrainedConfig` and can be used\n        to control the model outputs. Read the documentation from  :class:`~transformers1.PretrainedConfig`\n        for more information.\n\n\n        Args:\n            vocab_size (:obj:`int`, optional, defaults to 30522):\n                Vocabulary size of the DistilBERT model. Defines the different tokens that\n                can be represented by the `inputs_ids` passed to the forward method of :class:`~transformers1.BertModel`.\n            max_position_embeddings (:obj:`int`, optional, defaults to 512):\n                The maximum sequence length that this model might ever be used with.\n                Typically set this to something large just in case (e.g., 512 or 1024 or 2048).\n            sinusoidal_pos_embds (:obj:`boolean`, optional, defaults to :obj:`False`):\n                Whether to use sinusoidal positional embeddings.\n            n_layers (:obj:`int`, optional, defaults to 6):\n                Number of hidden layers in the Transformer encoder.\n            n_heads (:obj:`int`, optional, defaults to 12):\n                Number of attention heads for each attention layer in the Transformer encoder.\n            dim (:obj:`int`, optional, defaults to 768):\n                Dimensionality of the encoder layers and the pooler layer.\n            hidden_dim (:obj:`int`, optional, defaults to 3072):\n                The size of the \"intermediate\" (i.e., feed-forward) layer in the Transformer encoder.\n            dropout (:obj:`float`, optional, defaults to 0.1):\n                The dropout probabilitiy for all fully connected layers in the embeddings, encoder, and pooler.\n            attention_dropout (:obj:`float`, optional, defaults to 0.1):\n                The dropout ratio for the attention probabilities.\n            activation (:obj:`str` or :obj:`function`, optional, defaults to \"gelu\"):\n                The non-linear activation function (function or string) in the encoder and pooler.\n                If string, \"gelu\", \"relu\", \"swish\" and \"gelu_new\" are supported.\n            initializer_range (:obj:`float`, optional, defaults to 0.02):\n                The standard deviation of the truncated_normal_initializer for initializing all weight matrices.\n            qa_dropout (:obj:`float`, optional, defaults to 0.1):\n                The dropout probabilities used in the question answering model\n                :class:`~transformers1.DistilBertForQuestionAnswering`.\n            seq_classif_dropout (:obj:`float`, optional, defaults to 0.2):\n                The dropout probabilities used in the sequence classification model\n                :class:`~transformers1.DistilBertForSequenceClassification`.\n\n        Example::\n\n            from transformers1 import DistilBertModel, DistilBertConfig\n\n            # Initializing a DistilBERT configuration\n            configuration = DistilBertConfig()\n\n            # Initializing a model from the configuration\n            model = DistilBertModel(configuration)\n\n            # Accessing the model configuration\n            configuration = model.config\n    \"\"\"\n    model_type = \"distilbert\"\n\n    def __init__(\n        self,\n        vocab_size=30522,\n        max_position_embeddings=512,\n        sinusoidal_pos_embds=False,\n        n_layers=6,\n        n_heads=12,\n        dim=768,\n        hidden_dim=4 * 768,\n        dropout=0.1,\n        attention_dropout=0.1,\n        activation=\"gelu\",\n        initializer_range=0.02,\n        qa_dropout=0.1,\n        seq_classif_dropout=0.2,\n        pad_token_id=0,\n        **kwargs\n    ):\n        super().__init__(**kwargs, pad_token_id=pad_token_id)\n        self.vocab_size = vocab_size\n        self.max_position_embeddings = max_position_embeddings\n        self.sinusoidal_pos_embds = sinusoidal_pos_embds\n        self.n_layers = n_layers\n        self.n_heads = n_heads\n        self.dim = dim\n        self.hidden_dim = hidden_dim\n        self.dropout = dropout\n        self.attention_dropout = attention_dropout\n        self.activation = activation\n        self.initializer_range = initializer_range\n        self.qa_dropout = qa_dropout\n        self.seq_classif_dropout = seq_classif_dropout\n\n    @property\n    def hidden_size(self):\n        return self.dim\n\n    @property\n    def num_attention_heads(self):\n        return self.n_heads\n\n    @property\n    def num_hidden_layers(self):\n        return self.n_layers\n"
  },
  {
    "path": "code/nezha-base-count3/pretrain/transformers1/configuration_electra.py",
    "content": "# coding=utf-8\n# Copyright 2018 The Google AI Language Team Authors and The HuggingFace Inc. team.\n# Copyright (c) 2018, NVIDIA CORPORATION.  All rights reserved.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n\"\"\" ELECTRA model configuration \"\"\"\n\n\nimport logging\n\nfrom .configuration_utils import PretrainedConfig\n\n\nlogger = logging.getLogger(__name__)\n\nELECTRA_PRETRAINED_CONFIG_ARCHIVE_MAP = {\n    \"google/electra-small-generator\": \"https://s3.amazonaws.com/models.huggingface.co/bert/google/electra-small-generator/config.json\",\n    \"google/electra-base-generator\": \"https://s3.amazonaws.com/models.huggingface.co/bert/google/electra-base-generator/config.json\",\n    \"google/electra-large-generator\": \"https://s3.amazonaws.com/models.huggingface.co/bert/google/electra-large-generator/config.json\",\n    \"google/electra-small-discriminator\": \"https://s3.amazonaws.com/models.huggingface.co/bert/google/electra-small-discriminator/config.json\",\n    \"google/electra-base-discriminator\": \"https://s3.amazonaws.com/models.huggingface.co/bert/google/electra-base-discriminator/config.json\",\n    \"google/electra-large-discriminator\": \"https://s3.amazonaws.com/models.huggingface.co/bert/google/electra-large-discriminator/config.json\",\n}\n\n\nclass ElectraConfig(PretrainedConfig):\n    r\"\"\"\n        This is the configuration class to store the configuration of a :class:`~transformers1.ElectraModel`.\n        It is used to instantiate an ELECTRA model according to the specified arguments, defining the model\n        architecture. Instantiating a configuration with the defaults will yield a similar configuration to that of\n        the ELECTRA `google/electra-small-discriminator <https://huggingface.co/google/electra-small-discriminator>`__\n        architecture.\n\n        Configuration objects inherit from  :class:`~transformers1.PretrainedConfig` and can be used\n        to control the model outputs. Read the documentation from  :class:`~transformers1.PretrainedConfig`\n        for more information.\n\n\n        Args:\n            vocab_size (:obj:`int`, optional, defaults to 30522):\n                Vocabulary size of the ELECTRA model. Defines the different tokens that\n                can be represented by the `inputs_ids` passed to the forward method of :class:`~transformers1.ElectraModel`.\n            embedding_size (:obj:`int`, optional, defaults to 128):\n                Dimensionality of the encoder layers and the pooler layer.\n            hidden_size (:obj:`int`, optional, defaults to 256):\n                Dimensionality of the encoder layers and the pooler layer.\n            num_hidden_layers (:obj:`int`, optional, defaults to 12):\n                Number of hidden layers in the Transformer encoder.\n            num_attention_heads (:obj:`int`, optional, defaults to 4):\n                Number of attention heads for each attention layer in the Transformer encoder.\n            intermediate_size (:obj:`int`, optional, defaults to 1024):\n                Dimensionality of the \"intermediate\" (i.e., feed-forward) layer in the Transformer encoder.\n            hidden_act (:obj:`str` or :obj:`function`, optional, defaults to \"gelu\"):\n                The non-linear activation function (function or string) in the encoder and pooler.\n                If string, \"gelu\", \"relu\", \"swish\" and \"gelu_new\" are supported.\n            hidden_dropout_prob (:obj:`float`, optional, defaults to 0.1):\n                The dropout probabilitiy for all fully connected layers in the embeddings, encoder, and pooler.\n            attention_probs_dropout_prob (:obj:`float`, optional, defaults to 0.1):\n                The dropout ratio for the attention probabilities.\n            max_position_embeddings (:obj:`int`, optional, defaults to 512):\n                The maximum sequence length that this model might ever be used with.\n                Typically set this to something large just in case (e.g., 512 or 1024 or 2048).\n            type_vocab_size (:obj:`int`, optional, defaults to 2):\n                The vocabulary size of the `token_type_ids` passed into :class:`~transformers1.ElectraModel`.\n            initializer_range (:obj:`float`, optional, defaults to 0.02):\n                The standard deviation of the truncated_normal_initializer for initializing all weight matrices.\n            layer_norm_eps (:obj:`float`, optional, defaults to 1e-12):\n                The epsilon used by the layer normalization layers.\n\n        Example::\n\n            from transformers1 import ElectraModel, ElectraConfig\n\n            # Initializing a ELECTRA electra-base-uncased style configuration\n            configuration = ElectraConfig()\n\n            # Initializing a model from the electra-base-uncased style configuration\n            model = ElectraModel(configuration)\n\n            # Accessing the model configuration\n            configuration = model.config\n    \"\"\"\n    model_type = \"electra\"\n\n    def __init__(\n        self,\n        vocab_size=30522,\n        embedding_size=128,\n        hidden_size=256,\n        num_hidden_layers=12,\n        num_attention_heads=4,\n        intermediate_size=1024,\n        hidden_act=\"gelu\",\n        hidden_dropout_prob=0.1,\n        attention_probs_dropout_prob=0.1,\n        max_position_embeddings=512,\n        type_vocab_size=2,\n        initializer_range=0.02,\n        layer_norm_eps=1e-12,\n        pad_token_id=0,\n        **kwargs\n    ):\n        super().__init__(pad_token_id=pad_token_id, **kwargs)\n\n        self.vocab_size = vocab_size\n        self.embedding_size = embedding_size\n        self.hidden_size = hidden_size\n        self.num_hidden_layers = num_hidden_layers\n        self.num_attention_heads = num_attention_heads\n        self.intermediate_size = intermediate_size\n        self.hidden_act = hidden_act\n        self.hidden_dropout_prob = hidden_dropout_prob\n        self.attention_probs_dropout_prob = attention_probs_dropout_prob\n        self.max_position_embeddings = max_position_embeddings\n        self.type_vocab_size = type_vocab_size\n        self.initializer_range = initializer_range\n        self.layer_norm_eps = layer_norm_eps\n"
  },
  {
    "path": "code/nezha-base-count3/pretrain/transformers1/configuration_encoder_decoder.py",
    "content": "# coding=utf-8\n# Copyright 2020 The HuggingFace Inc. team.\n# Copyright (c) 2018, NVIDIA CORPORATION.  All rights reserved.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n\nimport copy\nimport logging\n\nfrom .configuration_utils import PretrainedConfig\n\n\nlogger = logging.getLogger(__name__)\n\n\nclass EncoderDecoderConfig(PretrainedConfig):\n    r\"\"\"\n        :class:`~transformers1.EncoderDecoderConfig` is the configuration class to store the configuration of a `EncoderDecoderModel`.\n\n        It is used to instantiate an Encoder Decoder model according to the specified arguments, defining the encoder and decoder configs.\n        Configuration objects inherit from  :class:`~transformers1.PretrainedConfig`\n        and can be used to control the model outputs.\n        See the documentation for :class:`~transformers1.PretrainedConfig` for more information.\n\n        Args:\n            kwargs (`optional`):\n                Remaining dictionary of keyword arguments. Notably:\n                    encoder (:class:`PretrainedConfig`, optional, defaults to `None`):\n                        An instance of a configuration object that defines the encoder config.\n                    encoder (:class:`PretrainedConfig`, optional, defaults to `None`):\n                        An instance of a configuration object that defines the decoder config.\n\n        Example::\n\n            from transformers1 import BertConfig, EncoderDecoderConfig, EncoderDecoderModel\n\n            # Initializing a BERT bert-base-uncased style configuration\n            config_encoder = BertConfig()\n            config_decoder = BertConfig()\n\n            config = EncoderDecoderConfig.from_encoder_decoder_configs(config_encoder, config_decoder)\n\n            # Initializing a Bert2Bert model from the bert-base-uncased style configurations\n            model = EncoderDecoderModel(config=config)\n\n            # Accessing the model configuration\n            config_encoder = model.config.encoder\n            config_decoder  = model.config.decoder\n    \"\"\"\n    model_type = \"encoder_decoder\"\n\n    def __init__(self, **kwargs):\n        super().__init__(**kwargs)\n        assert (\n            \"encoder\" in kwargs and \"decoder\" in kwargs\n        ), \"Config has to be initialized with encoder and decoder config\"\n        encoder_config = kwargs.pop(\"encoder\")\n        encoder_model_type = encoder_config.pop(\"model_type\")\n        decoder_config = kwargs.pop(\"decoder\")\n        decoder_model_type = decoder_config.pop(\"model_type\")\n\n        from transformers import AutoConfig\n\n        self.encoder = AutoConfig.for_model(encoder_model_type, **encoder_config)\n        self.decoder = AutoConfig.for_model(decoder_model_type, **decoder_config)\n        self.is_encoder_decoder = True\n\n    @classmethod\n    def from_encoder_decoder_configs(\n        cls, encoder_config: PretrainedConfig, decoder_config: PretrainedConfig\n    ) -> PretrainedConfig:\n        r\"\"\"\n        Instantiate a :class:`~transformers1.EncoderDecoderConfig` (or a derived class) from a pre-trained encoder model configuration and decoder model configuration.\n\n        Returns:\n            :class:`EncoderDecoderConfig`: An instance of a configuration object\n        \"\"\"\n        return cls(encoder=encoder_config.to_dict(), decoder=decoder_config.to_dict())\n\n    def to_dict(self):\n        \"\"\"\n        Serializes this instance to a Python dictionary. Override the default `to_dict()` from `PretrainedConfig`.\n\n        Returns:\n            :obj:`Dict[str, any]`: Dictionary of all the attributes that make up this configuration instance,\n        \"\"\"\n        output = copy.deepcopy(self.__dict__)\n        output[\"encoder\"] = self.encoder.to_dict()\n        output[\"decoder\"] = self.decoder.to_dict()\n        output[\"model_type\"] = self.__class__.model_type\n        return output\n"
  },
  {
    "path": "code/nezha-base-count3/pretrain/transformers1/configuration_flaubert.py",
    "content": "# coding=utf-8\n# Copyright 2019-present CNRS, Facebook Inc. and the HuggingFace Inc. team.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n\"\"\" Flaubert configuration, based on XLM. \"\"\"\n\n\nimport logging\n\nfrom .configuration_xlm import XLMConfig\n\n\nlogger = logging.getLogger(__name__)\n\nFLAUBERT_PRETRAINED_CONFIG_ARCHIVE_MAP = {\n    \"flaubert/flaubert_small_cased\": \"https://s3.amazonaws.com/models.huggingface.co/bert/flaubert/flaubert_small_cased/config.json\",\n    \"flaubert/flaubert_base_uncased\": \"https://s3.amazonaws.com/models.huggingface.co/bert/flaubert/flaubert_base_uncased/config.json\",\n    \"flaubert/flaubert_base_cased\": \"https://s3.amazonaws.com/models.huggingface.co/bert/flaubert/flaubert_base_cased/config.json\",\n    \"flaubert/flaubert_large_cased\": \"https://s3.amazonaws.com/models.huggingface.co/bert/flaubert/flaubert_large_cased/config.json\",\n}\n\n\nclass FlaubertConfig(XLMConfig):\n    \"\"\"\n        Configuration class to store the configuration of a `FlaubertModel`.\n        This is the configuration class to store the configuration of a :class:`~transformers1.XLMModel`.\n        It is used to instantiate an XLM model according to the specified arguments, defining the model\n        architecture. Instantiating a configuration with the defaults will yield a similar configuration to that of\n        the `xlm-mlm-en-2048 <https://huggingface.co/xlm-mlm-en-2048>`__ architecture.\n\n        Configuration objects inherit from  :class:`~transformers1.PretrainedConfig` and can be used\n        to control the model outputs. Read the documentation from  :class:`~transformers1.PretrainedConfig`\n        for more information.\n\n        Args:\n            pre_norm (:obj:`bool`, `optional`, defaults to :obj:`False`):\n                Whether to apply the layer normalization before or after the feed forward layer following the\n                attention in each layer (Vaswani et al., Tensor2Tensor for Neural Machine Translation. 2018)\n            layerdrop (:obj:`float`, `optional`, defaults to 0.0):\n                Probability to drop layers during training (Fan et al., Reducing Transformer Depth on Demand\n                with Structured Dropout. ICLR 2020)\n            vocab_size (:obj:`int`, optional, defaults to 30145):\n                Vocabulary size of the Flaubert model. Defines the different tokens that\n                can be represented by the `inputs_ids` passed to the forward method of :class:`~transformers1.FlaubertModel`.\n            emb_dim (:obj:`int`, optional, defaults to 2048):\n                Dimensionality of the encoder layers and the pooler layer.\n            n_layer (:obj:`int`, optional, defaults to 12):\n                Number of hidden layers in the Transformer encoder.\n            n_head (:obj:`int`, optional, defaults to 16):\n                Number of attention heads for each attention layer in the Transformer encoder.\n            dropout (:obj:`float`, optional, defaults to 0.1):\n                The dropout probability for all fully connected\n                layers in the embeddings, encoder, and pooler.\n            attention_dropout (:obj:`float`, optional, defaults to 0.1):\n                The dropout probability for the attention mechanism\n            gelu_activation (:obj:`boolean`, optional, defaults to :obj:`True`):\n                The non-linear activation function (function or string) in the\n                encoder and pooler. If set to `True`, \"gelu\" will be used instead of \"relu\".\n            sinusoidal_embeddings (:obj:`boolean`, optional, defaults to :obj:`False`):\n                Whether to use sinusoidal positional embeddings instead of absolute positional embeddings.\n            causal (:obj:`boolean`, optional, defaults to :obj:`False`):\n                Set this to `True` for the model to behave in a causal manner.\n                Causal models use a triangular attention mask in order to only attend to the left-side context instead\n                if a bidirectional context.\n            asm (:obj:`boolean`, optional, defaults to :obj:`False`):\n                Whether to use an adaptive log softmax projection layer instead of a linear layer for the prediction\n                layer.\n            n_langs (:obj:`int`, optional, defaults to 1):\n                The number of languages the model handles. Set to 1 for monolingual models.\n            use_lang_emb (:obj:`boolean`, optional, defaults to :obj:`True`)\n                Whether to use language embeddings. Some models use additional language embeddings, see\n                `the multilingual models page <http://huggingface.co/transformers/multilingual.html#xlm-language-embeddings>`__\n                for information on how to use them.\n            max_position_embeddings (:obj:`int`, optional, defaults to 512):\n                The maximum sequence length that this model might\n                ever be used with. Typically set this to something large just in case\n                (e.g., 512 or 1024 or 2048).\n            embed_init_std (:obj:`float`, optional, defaults to 2048^-0.5):\n                The standard deviation of the truncated_normal_initializer for\n                initializing the embedding matrices.\n            init_std (:obj:`int`, optional, defaults to 50257):\n                The standard deviation of the truncated_normal_initializer for\n                initializing all weight matrices except the embedding matrices.\n            layer_norm_eps (:obj:`float`, optional, defaults to 1e-12):\n                The epsilon used by the layer normalization layers.\n            bos_index (:obj:`int`, optional, defaults to 0):\n                The index of the beginning of sentence token in the vocabulary.\n            eos_index (:obj:`int`, optional, defaults to 1):\n                The index of the end of sentence token in the vocabulary.\n            pad_index (:obj:`int`, optional, defaults to 2):\n                The index of the padding token in the vocabulary.\n            unk_index (:obj:`int`, optional, defaults to 3):\n                The index of the unknown token in the vocabulary.\n            mask_index (:obj:`int`, optional, defaults to 5):\n                The index of the masking token in the vocabulary.\n            is_encoder(:obj:`boolean`, optional, defaults to :obj:`True`):\n                Whether the initialized model should be a transformer encoder or decoder as seen in Vaswani et al.\n            summary_type (:obj:`string`, optional, defaults to \"first\"):\n                Argument used when doing sequence summary. Used in for the multiple choice head in\n                :class:`~transformers1.XLMForSequenceClassification`.\n                Is one of the following options:\n\n                - 'last' => take the last token hidden state (like XLNet)\n                - 'first' => take the first token hidden state (like Bert)\n                - 'mean' => take the mean of all tokens hidden states\n                - 'cls_index' => supply a Tensor of classification token position (GPT/GPT-2)\n                - 'attn' => Not implemented now, use multi-head attention\n            summary_use_proj (:obj:`boolean`, optional, defaults to :obj:`True`):\n                Argument used when doing sequence summary. Used in for the multiple choice head in\n                :class:`~transformers1.XLMForSequenceClassification`.\n                Add a projection after the vector extraction\n            summary_activation (:obj:`string` or :obj:`None`, optional, defaults to :obj:`None`):\n                Argument used when doing sequence summary. Used in for the multiple choice head in\n                :class:`~transformers1.XLMForSequenceClassification`.\n                'tanh' => add a tanh activation to the output, Other => no activation.\n            summary_proj_to_labels (:obj:`boolean`, optional, defaults to :obj:`True`):\n                Argument used when doing sequence summary. Used in for the multiple choice head in\n                :class:`~transformers1.XLMForSequenceClassification`.\n                If True, the projection outputs to config.num_labels classes (otherwise to hidden_size). Default: False.\n            summary_first_dropout (:obj:`float`, optional, defaults to 0.1):\n                Argument used when doing sequence summary. Used in for the multiple choice head in\n                :class:`~transformers1.XLMForSequenceClassification`.\n                Add a dropout before the projection and activation\n            start_n_top (:obj:`int`, optional, defaults to 5):\n                Used in the SQuAD evaluation script for XLM and XLNet.\n            end_n_top (:obj:`int`, optional, defaults to 5):\n                Used in the SQuAD evaluation script for XLM and XLNet.\n            mask_token_id (:obj:`int`, optional, defaults to 0):\n                Model agnostic parameter to identify masked tokens when generating text in an MLM context.\n            lang_id (:obj:`int`, optional, defaults to 1):\n                The ID of the language used by the model. This parameter is used when generating\n                text in a given language.\n    \"\"\"\n\n    model_type = \"flaubert\"\n\n    def __init__(self, layerdrop=0.0, pre_norm=False, pad_token_id=2, bos_token_id=0, **kwargs):\n        \"\"\"Constructs FlaubertConfig.\n        \"\"\"\n        super().__init__(pad_token_id=pad_token_id, bos_token_id=bos_token_id, **kwargs)\n        self.layerdrop = layerdrop\n        self.pre_norm = pre_norm\n"
  },
  {
    "path": "code/nezha-base-count3/pretrain/transformers1/configuration_gpt2.py",
    "content": "# coding=utf-8\n# Copyright 2018 The OpenAI Team Authors and HuggingFace Inc. team.\n# Copyright (c) 2018, NVIDIA CORPORATION.  All rights reserved.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n\"\"\" OpenAI GPT-2 configuration \"\"\"\n\n\nimport logging\n\nfrom .configuration_utils import PretrainedConfig\n\n\nlogger = logging.getLogger(__name__)\n\nGPT2_PRETRAINED_CONFIG_ARCHIVE_MAP = {\n    \"gpt2\": \"https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-config.json\",\n    \"gpt2-medium\": \"https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-medium-config.json\",\n    \"gpt2-large\": \"https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-large-config.json\",\n    \"gpt2-xl\": \"https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-xl-config.json\",\n    \"distilgpt2\": \"https://s3.amazonaws.com/models.huggingface.co/bert/distilgpt2-config.json\",\n}\n\n\nclass GPT2Config(PretrainedConfig):\n    \"\"\"\n        This is the configuration class to store the configuration of a :class:`~transformers1.GPT2Model`.\n        It is used to instantiate an GPT-2 model according to the specified arguments, defining the model\n        architecture. Instantiating a configuration with the defaults will yield a similar configuration to that of\n        the GPT-2 `small <https://huggingface.co/gpt2>`__ architecture.\n\n        Configuration objects inherit from  :class:`~transformers1.PretrainedConfig` and can be used\n        to control the model outputs. Read the documentation from  :class:`~transformers1.PretrainedConfig`\n        for more information.\n\n\n        Args:\n            vocab_size (:obj:`int`, optional, defaults to 50257):\n                Vocabulary size of the GPT-2 model. Defines the different tokens that\n                can be represented by the `inputs_ids` passed to the forward method of :class:`~transformers1.GPT2Model`.\n            n_positions (:obj:`int`, optional, defaults to 1024):\n                The maximum sequence length that this model might ever be used with.\n                Typically set this to something large just in case (e.g., 512 or 1024 or 2048).\n            n_ctx (:obj:`int`, optional, defaults to 1024):\n                Dimensionality of the causal mask (usually same as n_positions).\n            n_embd (:obj:`int`, optional, defaults to 768):\n                Dimensionality of the embeddings and hidden states.\n            n_layer (:obj:`int`, optional, defaults to 12):\n                Number of hidden layers in the Transformer encoder.\n            n_head (:obj:`int`, optional, defaults to 12):\n                Number of attention heads for each attention layer in the Transformer encoder.\n            activation_function (:obj:`str`, optional, defaults to 'gelu'):\n                Activation function selected in the list [\"relu\", \"swish\", \"gelu\", \"tanh\", \"gelu_new\"].\n            resid_pdrop (:obj:`float`, optional, defaults to 0.1):\n                The dropout probability for all fully connected layers in the embeddings, encoder, and pooler.\n            embd_pdrop (:obj:`int`, optional, defaults to 0.1):\n                The dropout ratio for the embeddings.\n            attn_pdrop (:obj:`float`, optional, defaults to 0.1):\n                The dropout ratio for the attention.\n            layer_norm_epsilon (:obj:`float`, optional, defaults to 1e-5):\n                The epsilon to use in the layer normalization layers\n            initializer_range (:obj:`float`, optional, defaults to 16):\n                The standard deviation of the truncated_normal_initializer for initializing all weight matrices.\n            summary_type (:obj:`string`, optional, defaults to \"cls_index\"):\n                Argument used when doing sequence summary. Used in for the multiple choice head in\n                :class:`~transformers1.GPT2DoubleHeadsModel`.\n                Is one of the following options:\n\n                - 'last' => take the last token hidden state (like XLNet)\n                - 'first' => take the first token hidden state (like Bert)\n                - 'mean' => take the mean of all tokens hidden states\n                - 'cls_index' => supply a Tensor of classification token position (GPT/GPT-2)\n                - 'attn' => Not implemented now, use multi-head attention\n            summary_use_proj (:obj:`boolean`, optional, defaults to :obj:`True`):\n                Argument used when doing sequence summary. Used in for the multiple choice head in\n                :class:`~transformers1.GPT2DoubleHeadsModel`.\n                Add a projection after the vector extraction\n            summary_activation (:obj:`string` or :obj:`None`, optional, defaults to :obj:`None`):\n                Argument used when doing sequence summary. Used in for the multiple choice head in\n                :class:`~transformers1.GPT2DoubleHeadsModel`.\n                'tanh' => add a tanh activation to the output, Other => no activation.\n            summary_proj_to_labels (:obj:`boolean`, optional, defaults to :obj:`True`):\n                Argument used when doing sequence summary. Used in for the multiple choice head in\n                :class:`~transformers1.GPT2DoubleHeadsModel`.\n                If True, the projection outputs to config.num_labels classes (otherwise to hidden_size). Default: False.\n            summary_first_dropout (:obj:`float`, optional, defaults to 0.1):\n                Argument used when doing sequence summary. Used in for the multiple choice head in\n                :class:`~transformers1.GPT2DoubleHeadsModel`.\n                Add a dropout before the projection and activation\n\n        Example::\n\n            from transformers1 import GPT2Model, GPT2Config\n\n            # Initializing a GPT2 configuration\n            configuration = GPT2Config()\n\n            # Initializing a model from the configuration\n            model = GPT2Model(configuration)\n\n            # Accessing the model configuration\n            configuration = model.config\n    \"\"\"\n\n    model_type = \"gpt2\"\n\n    def __init__(\n        self,\n        vocab_size=50257,\n        n_positions=1024,\n        n_ctx=1024,\n        n_embd=768,\n        n_layer=12,\n        n_head=12,\n        activation_function=\"gelu_new\",\n        resid_pdrop=0.1,\n        embd_pdrop=0.1,\n        attn_pdrop=0.1,\n        layer_norm_epsilon=1e-5,\n        initializer_range=0.02,\n        summary_type=\"cls_index\",\n        summary_use_proj=True,\n        summary_activation=None,\n        summary_proj_to_labels=True,\n        summary_first_dropout=0.1,\n        bos_token_id=50256,\n        eos_token_id=50256,\n        **kwargs\n    ):\n        super().__init__(bos_token_id=bos_token_id, eos_token_id=eos_token_id, **kwargs)\n\n        self.vocab_size = vocab_size\n        self.n_ctx = n_ctx\n        self.n_positions = n_positions\n        self.n_embd = n_embd\n        self.n_layer = n_layer\n        self.n_head = n_head\n        self.activation_function = activation_function\n        self.resid_pdrop = resid_pdrop\n        self.embd_pdrop = embd_pdrop\n        self.attn_pdrop = attn_pdrop\n        self.layer_norm_epsilon = layer_norm_epsilon\n        self.initializer_range = initializer_range\n        self.summary_type = summary_type\n        self.summary_use_proj = summary_use_proj\n        self.summary_activation = summary_activation\n        self.summary_first_dropout = summary_first_dropout\n        self.summary_proj_to_labels = summary_proj_to_labels\n\n        self.bos_token_id = bos_token_id\n        self.eos_token_id = eos_token_id\n\n    @property\n    def max_position_embeddings(self):\n        return self.n_positions\n\n    @property\n    def hidden_size(self):\n        return self.n_embd\n\n    @property\n    def num_attention_heads(self):\n        return self.n_head\n\n    @property\n    def num_hidden_layers(self):\n        return self.n_layer\n"
  },
  {
    "path": "code/nezha-base-count3/pretrain/transformers1/configuration_longformer.py",
    "content": "# coding=utf-8\n# Copyright 2020 The Allen Institute for AI team and The HuggingFace Inc. team.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n\"\"\" Longformer configuration \"\"\"\n\nimport logging\nfrom typing import List, Union\n\nfrom .configuration_roberta import RobertaConfig\n\n\nlogger = logging.getLogger(__name__)\n\nLONGFORMER_PRETRAINED_CONFIG_ARCHIVE_MAP = {\n    \"allenai/longformer-base-4096\": \"https://s3.amazonaws.com/models.huggingface.co/bert/allenai/longformer-base-4096/config.json\",\n    \"allenai/longformer-large-4096\": \"https://s3.amazonaws.com/models.huggingface.co/bert/allenai/longformer-large-4096/config.json\",\n    \"allenai/longformer-large-4096-finetuned-triviaqa\": \"https://s3.amazonaws.com/models.huggingface.co/bert/allenai/longformer-large-4096-finetuned-triviaqa/config.json\",\n    \"allenai/longformer-base-4096-extra.pos.embd.only\": \"https://s3.amazonaws.com/models.huggingface.co/bert/allenai/longformer-base-4096-extra.pos.embd.only/config.json\",\n    \"allenai/longformer-large-4096-extra.pos.embd.only\": \"https://s3.amazonaws.com/models.huggingface.co/bert/allenai/longformer-large-4096-extra.pos.embd.only/config.json\",\n}\n\n\nclass LongformerConfig(RobertaConfig):\n    r\"\"\"\n        This is the configuration class to store the configuration of a :class:`~transformers1.LongformerModel`.\n        It is used to instantiate an Longformer model according to the specified arguments, defining the model\n        architecture. Instantiating a configuration with the defaults will yield a similar configuration to that of\n        the RoBERTa `roberta-base <https://huggingface.co/roberta-base>`__ architecture with a sequence length 4,096.\n\n        The :class:`~transformers1.LongformerConfig` class directly inherits :class:`~transformers1.RobertaConfig`.\n        It reuses the same defaults. Please check the parent class for more information.\n\n        Args:\n            attention_window (:obj:`int` or :obj:`List[int]`, optional, defaults to 512):\n                Size of an attention window around each token. If :obj:`int`, use the same size for all layers.\n                To specify a different window size for each layer, use a :obj:`List[int]` where\n                ``len(attention_window) == num_hidden_layers``.\n\n        Example::\n\n            from transformers1 import LongformerConfig, LongformerModel\n\n            # Initializing a Longformer configuration\n            configuration = LongformerConfig()\n\n            # Initializing a model from the configuration\n            model = LongformerModel(configuration)\n\n            # Accessing the model configuration\n            configuration = model.config\n    \"\"\"\n    model_type = \"longformer\"\n\n    def __init__(self, attention_window: Union[List[int], int] = 512, sep_token_id: int = 2, **kwargs):\n        super().__init__(**kwargs)\n        self.attention_window = attention_window\n        self.sep_token_id = sep_token_id\n"
  },
  {
    "path": "code/nezha-base-count3/pretrain/transformers1/configuration_marian.py",
    "content": "# coding=utf-8\n# Copyright 2020 The OPUS-NMT Team, Marian team, and The HuggingFace Inc. team.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n\"\"\" Marian model configuration \"\"\"\n\nfrom .configuration_bart import BartConfig\n\n\nPRETRAINED_CONFIG_ARCHIVE_MAP = {\n    \"Helsinki-NLP/opus-mt-en-de\": \"https://s3.amazonaws.com/models.huggingface.co/bert/Helsinki-NLP/opus-mt-en-de/config.json\",\n}\n\n\nclass MarianConfig(BartConfig):\n    model_type = \"marian\"\n"
  },
  {
    "path": "code/nezha-base-count3/pretrain/transformers1/configuration_mmbt.py",
    "content": "# coding=utf-8\n# Copyright (c) Facebook, Inc. and its affiliates.\n# Copyright (c) HuggingFace Inc. team.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n\"\"\" MMBT configuration \"\"\"\n\n\nimport logging\n\n\nlogger = logging.getLogger(__name__)\n\n\nclass MMBTConfig(object):\n    \"\"\"Configuration class to store the configuration of a `MMBT Model`.\n\n    Args:\n        config (:obj:`~transformers1.PreTrainedConfig`):\n            Config of the underlying Transformer models. Its values are\n            copied over to use a single config.\n        num_labels (:obj:`int` or :obj:`None`, optional, defaults to `None`):\n            Size of final Linear layer for classification.\n        modal_hidden_size (:obj:`int`, optional, defautls to 2048):\n            Embedding dimension of the non-text modality encoder.\n    \"\"\"\n\n    def __init__(self, config, num_labels=None, modal_hidden_size=2048):\n        self.__dict__ = config.__dict__\n        self.modal_hidden_size = modal_hidden_size\n        if num_labels:\n            self.num_labels = num_labels\n"
  },
  {
    "path": "code/nezha-base-count3/pretrain/transformers1/configuration_openai.py",
    "content": "# coding=utf-8\n# Copyright 2018 The OpenAI Team Authors and HuggingFace Inc. team.\n# Copyright (c) 2018, NVIDIA CORPORATION.  All rights reserved.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n\"\"\" OpenAI GPT configuration \"\"\"\n\n\nimport logging\n\nfrom .configuration_utils import PretrainedConfig\n\n\nlogger = logging.getLogger(__name__)\n\nOPENAI_GPT_PRETRAINED_CONFIG_ARCHIVE_MAP = {\n    \"openai-gpt\": \"https://s3.amazonaws.com/models.huggingface.co/bert/openai-gpt-config.json\"\n}\n\n\nclass OpenAIGPTConfig(PretrainedConfig):\n    \"\"\"\n        This is the configuration class to store the configuration of a :class:`~transformers1.OpenAIGPTModel`.\n        It is used to instantiate an GPT model according to the specified arguments, defining the model\n        architecture. Instantiating a configuration with the defaults will yield a similar configuration to that of\n        the `GPT <https://huggingface.co/openai-gpt>`__ architecture from OpenAI.\n\n        Configuration objects inherit from  :class:`~transformers1.PretrainedConfig` and can be used\n        to control the model outputs. Read the documentation from  :class:`~transformers1.PretrainedConfig`\n        for more information.\n\n        Args:\n            vocab_size (:obj:`int`, optional, defaults to 40478):\n                Vocabulary size of the GPT model. Defines the different tokens that\n                can be represented by the `inputs_ids` passed to the forward method of :class:`~transformers1.CTRLModel`.\n            n_positions (:obj:`int`, optional, defaults to 512):\n                The maximum sequence length that this model might ever be used with.\n                Typically set this to something large just in case (e.g., 512 or 1024 or 2048).\n            n_ctx (:obj:`int`, optional, defaults to 512):\n                Dimensionality of the causal mask (usually same as n_positions).\n            n_embd (:obj:`int`, optional, defaults to 768):\n                Dimensionality of the embeddings and hidden states.\n            n_layer (:obj:`int`, optional, defaults to 12):\n                Number of hidden layers in the Transformer encoder.\n            n_head (:obj:`int`, optional, defaults to 12):\n                Number of attention heads for each attention layer in the Transformer encoder.\n            afn (:obj:`str` or :obj:`function`, optional, defaults to \"gelu\"):\n                The non-linear activation function (function or string) in the encoder and pooler.\n                If string, \"gelu\", \"relu\", \"swish\" and \"gelu_new\" are supported.\n            resid_pdrop (:obj:`float`, optional, defaults to 0.1):\n                The dropout probability for all fully connected layers in the embeddings, encoder, and pooler.\n            embd_pdrop (:obj:`int`, optional, defaults to 0.1):\n                The dropout ratio for the embeddings.\n            attn_pdrop (:obj:`float`, optional, defaults to 0.1):\n                The dropout ratio for the attention.\n            layer_norm_epsilon (:obj:`float`, optional, defaults to 1e-5):\n                The epsilon to use in the layer normalization layers\n            initializer_range (:obj:`float`, optional, defaults to 0.02):\n                The standard deviation of the truncated_normal_initializer for initializing all weight matrices.\n            predict_special_tokens (:obj:`boolean`, optional, defaults to :obj:`True`):\n                Whether special tokens should be predicted when the model is has a language modeling head.\n            summary_type (:obj:`string`, optional, defaults to \"cls_index\"):\n                Argument used when doing sequence summary. Used in for the multiple choice head in\n                :class:`~transformers1.OpenAIGPTDoubleHeadsModel`.\n                Is one of the following options:\n\n                - 'last' => take the last token hidden state (like XLNet)\n                - 'first' => take the first token hidden state (like Bert)\n                - 'mean' => take the mean of all tokens hidden states\n                - 'cls_index' => supply a Tensor of classification token position (GPT/GPT-2)\n                - 'attn' => Not implemented now, use multi-head attention\n            summary_use_proj (:obj:`boolean`, optional, defaults to :obj:`True`):\n                Argument used when doing sequence summary. Used in for the multiple choice head in\n                :class:`~transformers1.OpenAIGPTDoubleHeadsModel`.\n                Add a projection after the vector extraction\n            summary_activation (:obj:`string` or :obj:`None`, optional, defaults to :obj:`None`):\n                Argument used when doing sequence summary. Used in for the multiple choice head in\n                :class:`~transformers1.OpenAIGPTDoubleHeadsModel`.\n                'tanh' => add a tanh activation to the output, Other => no activation.\n            summary_proj_to_labels (:obj:`boolean`, optional, defaults to :obj:`True`):\n                Argument used when doing sequence summary. Used in for the multiple choice head in\n                :class:`~transformers1.OpenAIGPTDoubleHeadsModel`.\n                If True, the projection outputs to config.num_labels classes (otherwise to hidden_size). Default: False.\n            summary_first_dropout (:obj:`float`, optional, defaults to 0.1):\n                Argument used when doing sequence summary. Used in for the multiple choice head in\n                :class:`~transformers1.OpenAIGPTDoubleHeadsModel`.\n                Add a dropout before the projection and activation\n\n        Example::\n\n            from transformers1 import OpenAIGPTConfig, OpenAIGPTModel\n\n            # Initializing a GPT configuration\n            configuration = OpenAIGPTConfig()\n\n            # Initializing a model from the configuration\n            model = OpenAIGPTModel(configuration)\n\n            # Accessing the model configuration\n            configuration = model.config\n    \"\"\"\n\n    model_type = \"openai-gpt\"\n\n    def __init__(\n        self,\n        vocab_size=40478,\n        n_positions=512,\n        n_ctx=512,\n        n_embd=768,\n        n_layer=12,\n        n_head=12,\n        afn=\"gelu\",\n        resid_pdrop=0.1,\n        embd_pdrop=0.1,\n        attn_pdrop=0.1,\n        layer_norm_epsilon=1e-5,\n        initializer_range=0.02,\n        predict_special_tokens=True,\n        summary_type=\"cls_index\",\n        summary_use_proj=True,\n        summary_activation=None,\n        summary_proj_to_labels=True,\n        summary_first_dropout=0.1,\n        **kwargs\n    ):\n        super().__init__(**kwargs)\n\n        self.vocab_size = vocab_size\n        self.n_ctx = n_ctx\n        self.n_positions = n_positions\n        self.n_embd = n_embd\n        self.n_layer = n_layer\n        self.n_head = n_head\n        self.afn = afn\n        self.resid_pdrop = resid_pdrop\n        self.embd_pdrop = embd_pdrop\n        self.attn_pdrop = attn_pdrop\n        self.layer_norm_epsilon = layer_norm_epsilon\n        self.initializer_range = initializer_range\n        self.predict_special_tokens = predict_special_tokens\n        self.summary_type = summary_type\n        self.summary_use_proj = summary_use_proj\n        self.summary_activation = summary_activation\n        self.summary_first_dropout = summary_first_dropout\n        self.summary_proj_to_labels = summary_proj_to_labels\n\n    @property\n    def max_position_embeddings(self):\n        return self.n_positions\n\n    @property\n    def hidden_size(self):\n        return self.n_embd\n\n    @property\n    def num_attention_heads(self):\n        return self.n_head\n\n    @property\n    def num_hidden_layers(self):\n        return self.n_layer\n"
  },
  {
    "path": "code/nezha-base-count3/pretrain/transformers1/configuration_reformer.py",
    "content": "# coding=utf-8\n# Copyright 2020 The Trax Authors and The HuggingFace Inc. team.\n# Copyright (c) 2018, NVIDIA CORPORATION.  All rights reserved.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n\"\"\" Reformer model configuration \"\"\"\n\n\nimport logging\n\nfrom .configuration_utils import PretrainedConfig\n\n\nlogger = logging.getLogger(__name__)\n\nREFORMER_PRETRAINED_CONFIG_ARCHIVE_MAP = {\n    \"google/reformer-crime-and-punishment\": \"https://cdn.huggingface.co/google/reformer-crime-and-punishment/config.json\",\n    \"google/reformer-enwik8\": \"https://cdn.huggingface.co/google/reformer-enwik8/config.json\",\n}\n\n\nclass ReformerConfig(PretrainedConfig):\n    r\"\"\"\n        This is the configuration class to store the configuration of a :class:`~transformers1.ReformerModel`.\n        It is used to instantiate an Reformer model according to the specified arguments, defining the model\n        architecture.\n\n        Configuration objects inherit from  :class:`~transformers1.PretrainedConfig` and can be used\n        to control the model outputs. Read the documentation from  :class:`~transformers1.PretrainedConfig`\n        for more information.\n\n        Args:\n            attention_head_size (:obj:`int`, optional, defaults to 64):\n                Dimensionality of the projected key, query and value vectors\n            attn_layers (:obj:`list(str)`, optional, defaults to [\"local\", \"lsh\", \"local\", \"lsh\", \"local\", \"lsh\"]):\n                List of attention layer types in ascending order. It can be chosen between a\n                LSHSelfAttention layer (\"lsh\") and a LocalSelfAttention layer (\"local\").\n                For more information on LSHSelfAttention layer, see `LSH Self Attention <reformer.html#lsh-self-attention>`__ .\n                For more information on LocalSelfAttention layer, see `Local Self Attention <reformer.html#local-sensitive-hashing-self-attention>`__ .\n            axial_pos_embds (:obj:`bool`, optional, defaults to True):\n                If `True` use axial position embeddings. For more information on how axial position embeddings work, see `Axial Position Encodings <reformer.html#axial-positional-encodings>`__\n            axial_norm_std (:obj:`float`, optional, defaluts to 1.0):\n                The standard deviation of the normal_initializer for initializing the weight matrices of the axial positional encodings.\n            axial_pos_shape (:obj:`list(int)`, optional, defaults to `[64, 64]`):\n                The position dims of the axial position encodings.\n                During training the product of the position dims has to equal the sequence length.\n                For more information on how axial position embeddings work, see `Axial Position Encodings <reformer.html#axial-positional-encodings>`__.\n            axial_pos_embds_dim (:obj:`list(int)`, optional, defaults to `[64, 192]`):\n                The embedding dims of the axial position encodings.\n                The sum of the embedding dims has to equal the hidden size.\n                For more information on how axial position embeddings work, see `Axial Position Encodings <reformer.html#axial-positional-encodings>`__.\n            chunk_size_lm_head (:obj:`int`, optional, defaults to 0):\n                The chunk size of the final language model feed forward head layer.\n                A chunk size of 0 means that the feed forward layer is not chunked.\n                A chunk size of n means that the feed forward layer processes n < sequence_length embeddings at a time.\n                For more information on feed forward chunking, see `How does Feed Forward Chunking work? <../glossary.html#feed-forward-chunking>`__ .\n            chunk_size_feed_forward (:obj:`int`, optional, defaults to 0):\n                The chunk size of all feed forward layers in the residual attention blocks.\n                A chunk size of 0 means that the feed forward layer is not chunked.\n                A chunk size of n means that the feed forward layer processes n < sequence_length embeddings at a time.\n                For more information on feed forward chunking, see `How does Feed Forward Chunking work? <../glossary.html#feed-forward-chunking>`__ .\n            eos_token_id (:obj:`int`, optional, defaults to 2):\n                The token id for the <EOS> token.\n            feed_forward_size (:obj:`int`, optional, defaults to 512):\n                Dimensionality of the \"feed_forward\" (i.e., feed-forward) layer in the residual attention block.\n            hash_seed (:obj:`int`, optional, defaults to `None`):\n                Seed that can be used to make local sensitive hashing in LSHSelfAttention deterministic. This should only be set for testing purposed. For evaluation and training purposes `hash_seed` should be set to `None` to ensure fully random rotations in local sensitive hashing scheme.\n            hidden_act (:obj:`str` or :obj:`function`, optional, defaults to \"relu\"):\n                The non-linear activation function (function or string) in the feed forward layer in the residual attention block.\n                If string, \"gelu\", \"relu\", \"swish\", \"gelu_new\" and \"gelu_fast\" are supported.\n            hidden_dropout_prob (:obj:`float`, optional, defaults to 0.05):\n                The dropout probabilitiy for all fully connected layers in the embeddings, encoder, and pooler.\n            hidden_size (:obj:`int`, optional, defaults to 256):\n                Dimensionality of the output hidden states of the residual attention blocks.\n            initializer_range (:obj:`float`, optional, defaults to 0.02):\n                The standard deviation of the truncated_normal_initializer for initializing all weight matrices.\n            is_decoder (:obj:`bool`, optional, defaults to False):\n                If `is_decoder` is True, a causal mask is used in addition to `attention_mask`.\n                When using the Reformer for causal language modeling, `is_decoder` is set to `True`.\n            layer_norm_eps (:obj:`float`, optional, defaults to 1e-12):\n                The epsilon used by the layer normalization layers.\n            local_chunk_length (:obj:`int`, optional, defaults to 64):\n                Length of chunk which attends to itself in LocalSelfAttention. Chunking reduces memory complexity from sequence length x sequence length (self attention) to chunk length x chunk length x sequence length / chunk length (chunked self attention).\n            local_num_chunks_before (:obj:`int`, optional, defaults to 1):\n                Number of previous neighbouring chunks to attend to in LocalSelfAttention layer to itself.\n            local_num_chunks_after (:obj:`int`, optional, defaults to 0):\n                Number of following neighbouring chunks to attend to in LocalSelfAttention layer in addition to itself.\n            local_attention_probs_dropout_prob (:obj:`float`, optional, defaults to 0.1):\n                The dropout ratio for the attention probabilities in LocalSelfAttention.\n            lsh_chunk_length (:obj:`int`, optional, defaults to 64):\n                Length of chunk which attends to itself in LSHSelfAttention. Chunking reduces memory complexity from sequence length x sequence length (self attention) to chunk length x chunk length x sequence length / chunk length (chunked self attention).\n            lsh_num_chunks_before (:obj:`int`, optional, defaults to 1):\n                Number of previous neighbouring chunks to attend to in LSHSelfAttention layer to itself.\n            lsh_num_chunks_after (:obj:`int`, optional, defaults to 0):\n                Number of following neighbouring chunks to attend to in LSHSelfAttention layer to itself.\n            lsh_attention_probs_dropout_prob (:obj:`float`, optional, defaults to 0.1):\n                The dropout ratio for the attention probabilities in LSHSelfAttention.\n            max_position_embeddings (:obj:`int`, optional, defaults to 4096):\n                The maximum sequence length that this model might ever be used with.\n                Typically set this to something large just in case (e.g., 512 or 1024 or 2048).\n            num_attention_heads (:obj:`int`, optional, defaults to 12):\n                Number of attention heads for each attention layer in the Transformer encoder.\n            num_buckets (:obj:`int` or :obj:`list(int)`, optional, defaults to `None`):\n                Number of buckets, the key query vectors can be \"hashed into\" using the locality sensitive hashing scheme. Each query key vector is hashed into a hash in `1, ..., num_buckets`.\n                The number of buckets can also be factorized into a list for improved memory complexity. In this case, each query key vector is hashed into a hash in `1-1, 1-2, ..., num_buckets[0]-1, ..., num_buckets[0]-num_buckets[1]` if `num_buckets` is factorized into two factors.\n                The number of buckets (or the product the factors) should approximately equal sequence length / lsh_chunk_length. If `num_buckets` is set to `None`, a good value for `num_buckets` is calculated on the fly.\n            num_hashes (:obj:`int`, optional, defaults to 1):\n                Number of hashing rounds (e.g. number of random rotations) in Local Sensitive Hashing scheme.\n                The higher `num_hashes`, the more accurate the `LSHSelfAttention` becomes, but also the more memory and time intensive the hashing becomes.\n            pad_token_id (:obj:`int`, optional, defaults to 0):\n                The token id for the <PAD> token.\n            vocab_size (:obj:`int`, optional, defaults to 320):\n                Vocabulary size of the Reformer model. Defines the different tokens that\n                can be represented by the `inputs_ids` passed to the forward method of :class:`~transformers1.ReformerModel`.\n\n        Example::\n\n            from transformers1 import ReformerModel, ReformerConfig\n\n            # Initializing a Reformer configuration\n            configuration = ReformerConfig()\n\n            # Initializing a Reformer model\n            model = ReformerModel(configuration)\n\n            # Accessing the model configuration\n            configuration = model.config\n    \"\"\"\n    model_type = \"reformer\"\n\n    def __init__(\n        self,\n        attention_head_size=64,\n        attn_layers=[\"local\", \"lsh\", \"local\", \"lsh\", \"local\", \"lsh\"],\n        axial_norm_std=1.0,\n        axial_pos_embds=True,\n        axial_pos_shape=[64, 64],\n        axial_pos_embds_dim=[64, 192],\n        chunk_size_lm_head=0,\n        chunk_size_feed_forward=0,\n        eos_token_id=2,\n        feed_forward_size=512,\n        hash_seed=None,\n        hidden_act=\"relu\",\n        hidden_dropout_prob=0.05,\n        hidden_size=256,\n        initializer_range=0.02,\n        is_decoder=False,\n        layer_norm_eps=1e-12,\n        local_num_chunks_before=1,\n        local_num_chunks_after=0,\n        local_attention_probs_dropout_prob=0.05,\n        local_attn_chunk_length=64,\n        lsh_attn_chunk_length=64,\n        lsh_attention_probs_dropout_prob=0.0,\n        lsh_num_chunks_before=1,\n        lsh_num_chunks_after=0,\n        max_position_embeddings=4096,\n        num_attention_heads=2,\n        num_buckets=None,\n        num_hashes=1,\n        pad_token_id=0,\n        vocab_size=320,\n        **kwargs\n    ):\n        super().__init__(pad_token_id=pad_token_id, eos_token_id=eos_token_id, is_decoder=is_decoder, **kwargs)\n\n        self.hash_seed = hash_seed\n        self.vocab_size = vocab_size\n        self.attention_head_size = attention_head_size\n        self.hidden_size = hidden_size\n        self.num_attention_heads = num_attention_heads\n        self.num_hashes = num_hashes\n        self.num_hidden_layers = len(attn_layers)\n        self.num_buckets = tuple(num_buckets) if isinstance(num_buckets, list) else num_buckets\n        self.lsh_attn_chunk_length = lsh_attn_chunk_length\n        self.local_attn_chunk_length = local_attn_chunk_length\n        self.lsh_num_chunks_after = lsh_num_chunks_after\n        self.lsh_num_chunks_before = lsh_num_chunks_before\n        self.local_num_chunks_after = local_num_chunks_after\n        self.local_num_chunks_before = local_num_chunks_before\n        self.hidden_act = hidden_act\n        self.feed_forward_size = feed_forward_size\n        self.hidden_dropout_prob = hidden_dropout_prob\n        self.lsh_attention_probs_dropout_prob = lsh_attention_probs_dropout_prob\n        self.local_attention_probs_dropout_prob = local_attention_probs_dropout_prob\n        self.max_position_embeddings = max_position_embeddings\n        self.initializer_range = initializer_range\n        self.layer_norm_eps = layer_norm_eps\n        self.axial_pos_embds = axial_pos_embds\n        self.axial_pos_shape = tuple(axial_pos_shape)\n        self.axial_pos_embds_dim = tuple(axial_pos_embds_dim)\n        self.axial_norm_std = axial_norm_std\n        self.chunk_size_lm_head = chunk_size_lm_head\n        self.chunk_size_feed_forward = chunk_size_feed_forward\n        self.attn_layers = attn_layers\n"
  },
  {
    "path": "code/nezha-base-count3/pretrain/transformers1/configuration_roberta.py",
    "content": "# coding=utf-8\n# Copyright 2018 The Google AI Language Team Authors and The HuggingFace Inc. team.\n# Copyright (c) 2018, NVIDIA CORPORATION.  All rights reserved.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n\"\"\" RoBERTa configuration \"\"\"\n\n\nimport logging\n\nfrom .configuration_bert import BertConfig\n\n\nlogger = logging.getLogger(__name__)\n\nROBERTA_PRETRAINED_CONFIG_ARCHIVE_MAP = {\n    \"roberta-base\": \"https://s3.amazonaws.com/models.huggingface.co/bert/roberta-base-config.json\",\n    \"roberta-large\": \"https://s3.amazonaws.com/models.huggingface.co/bert/roberta-large-config.json\",\n    \"roberta-large-mnli\": \"https://s3.amazonaws.com/models.huggingface.co/bert/roberta-large-mnli-config.json\",\n    \"distilroberta-base\": \"https://s3.amazonaws.com/models.huggingface.co/bert/distilroberta-base-config.json\",\n    \"roberta-base-openai-detector\": \"https://s3.amazonaws.com/models.huggingface.co/bert/roberta-base-openai-detector-config.json\",\n    \"roberta-large-openai-detector\": \"https://s3.amazonaws.com/models.huggingface.co/bert/roberta-large-openai-detector-config.json\",\n}\n\n\nclass RobertaConfig(BertConfig):\n    r\"\"\"\n        This is the configuration class to store the configuration of a :class:`~transformers1.RobertaModel`.\n        It is used to instantiate an RoBERTa model according to the specified arguments, defining the model\n        architecture. Instantiating a configuration with the defaults will yield a similar configuration to that of\n        the BERT `bert-base-uncased <https://huggingface.co/bert-base-uncased>`__ architecture.\n\n        Configuration objects inherit from  :class:`~transformers1.PretrainedConfig` and can be used\n        to control the model outputs. Read the documentation from  :class:`~transformers1.PretrainedConfig`\n        for more information.\n\n        The :class:`~transformers1.RobertaConfig` class directly inherits :class:`~transformers1.BertConfig`.\n        It reuses the same defaults. Please check the parent class for more information.\n\n        Example::\n\n            from transformers1 import RobertaConfig, RobertaModel\n\n            # Initializing a RoBERTa configuration\n            configuration = RobertaConfig()\n\n            # Initializing a model from the configuration\n            model = RobertaModel(configuration)\n\n            # Accessing the model configuration\n            configuration = model.config\n    \"\"\"\n    model_type = \"roberta\"\n\n    def __init__(self, pad_token_id=1, bos_token_id=0, eos_token_id=2, **kwargs):\n        \"\"\"Constructs RobertaConfig.\n        \"\"\"\n        super().__init__(pad_token_id=pad_token_id, bos_token_id=bos_token_id, eos_token_id=eos_token_id, **kwargs)\n"
  },
  {
    "path": "code/nezha-base-count3/pretrain/transformers1/configuration_t5.py",
    "content": "# coding=utf-8\n# Copyright 2010, The T5 Authors and HuggingFace Inc.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n\"\"\" T5 model configuration \"\"\"\n\n\nimport logging\n\nfrom .configuration_utils import PretrainedConfig\n\n\nlogger = logging.getLogger(__name__)\n\nT5_PRETRAINED_CONFIG_ARCHIVE_MAP = {\n    \"t5-small\": \"https://s3.amazonaws.com/models.huggingface.co/bert/t5-small-config.json\",\n    \"t5-base\": \"https://s3.amazonaws.com/models.huggingface.co/bert/t5-base-config.json\",\n    \"t5-large\": \"https://s3.amazonaws.com/models.huggingface.co/bert/t5-large-config.json\",\n    \"t5-3b\": \"https://s3.amazonaws.com/models.huggingface.co/bert/t5-3b-config.json\",\n    \"t5-11b\": \"https://s3.amazonaws.com/models.huggingface.co/bert/t5-11b-config.json\",\n}\n\n\nclass T5Config(PretrainedConfig):\n    r\"\"\"\n        :class:`~transformers1.T5Config` is the configuration class to store the configuration of a\n        `T5Model`.\n\n\n        Arguments:\n            vocab_size_or_config_json_file: Vocabulary size of `inputs_ids` in `T5Model`.\n            d_model: Size of the encoder layers and the pooler layer. `d_model` can also accesed via the property `hidden_size`.\n            num_layers: Number of hidden layers in the Transformer encoder. `num_layers` can also be accessed via the property `num_hidden_layers`.\n            num_heads: Number of attention heads for each attention layer in\n                the Transformer encoder. `num_heads` can also be accessed via the property `num_attention_heads`.\n            intermediate_size: The size of the \"intermediate\" (i.e., feed-forward)\n                layer in the Transformer encoder.\n            hidden_act: The non-linear activation function (function or string) in the\n                encoder and pooler. If string, \"gelu\", \"relu\", \"swish\" and \"gelu_new\" are supported.\n            hidden_dropout_prob: The dropout probabilitiy for all fully connected\n                layers in the embeddings, encoder, and pooler.\n            attention_probs_dropout_prob: The dropout ratio for the attention\n                probabilities.\n            n_positions: The maximum sequence length that this model might\n                ever be used with. Typically set this to something large just in case\n                (e.g., 512 or 1024 or 2048). `n_positions` can also be accessed via the property `max_position_embeddings'.\n            type_vocab_size: The vocabulary size of the `token_type_ids` passed into\n                `T5Model`.\n            initializer_factor: A factor for initializing all weight matrices (should be kept to 1.0, used for initialization testing).\n            layer_norm_eps: The epsilon used by LayerNorm.\n    \"\"\"\n    model_type = \"t5\"\n\n    def __init__(\n        self,\n        vocab_size=32128,\n        n_positions=512,\n        d_model=512,\n        d_kv=64,\n        d_ff=2048,\n        num_layers=6,\n        num_heads=8,\n        relative_attention_num_buckets=32,\n        dropout_rate=0.1,\n        layer_norm_epsilon=1e-6,\n        initializer_factor=1.0,\n        is_encoder_decoder=True,\n        pad_token_id=0,\n        eos_token_id=1,\n        **kwargs\n    ):\n        super().__init__(\n            pad_token_id=pad_token_id, eos_token_id=eos_token_id, is_encoder_decoder=is_encoder_decoder, **kwargs,\n        )\n        self.vocab_size = vocab_size\n        self.n_positions = n_positions\n        self.d_model = d_model\n        self.d_kv = d_kv\n        self.d_ff = d_ff\n        self.num_layers = num_layers\n        self.num_heads = num_heads\n        self.relative_attention_num_buckets = relative_attention_num_buckets\n        self.dropout_rate = dropout_rate\n        self.layer_norm_epsilon = layer_norm_epsilon\n        self.initializer_factor = initializer_factor\n\n    @property\n    def max_position_embeddings(self):\n        return self.n_positions\n\n    @property\n    def hidden_size(self):\n        return self.d_model\n\n    @property\n    def num_attention_heads(self):\n        return self.num_heads\n\n    @property\n    def num_hidden_layers(self):\n        return self.num_layers\n"
  },
  {
    "path": "code/nezha-base-count3/pretrain/transformers1/configuration_transfo_xl.py",
    "content": "# coding=utf-8\n# Copyright 2018 Google AI, Google Brain and Carnegie Mellon University Authors and the HuggingFace Inc. team.\n# Copyright (c) 2018, NVIDIA CORPORATION.  All rights reserved.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n\"\"\" Transformer XL configuration \"\"\"\n\n\nimport logging\n\nfrom .configuration_utils import PretrainedConfig\n\n\nlogger = logging.getLogger(__name__)\n\nTRANSFO_XL_PRETRAINED_CONFIG_ARCHIVE_MAP = {\n    \"transfo-xl-wt103\": \"https://s3.amazonaws.com/models.huggingface.co/bert/transfo-xl-wt103-config.json\",\n}\n\n\nclass TransfoXLConfig(PretrainedConfig):\n    \"\"\"\n        This is the configuration class to store the configuration of a :class:`~transformers1.TransfoXLModel`.\n        It is used to instantiate a Transformer XL model according to the specified arguments, defining the model\n        architecture. Instantiating a configuration with the defaults will yield a similar configuration to that of\n        the `Transformer XL <https://huggingface.co/transfo-xl-wt103>`__ architecture.\n\n        Configuration objects inherit from  :class:`~transformers1.PretrainedConfig` and can be used\n        to control the model outputs. Read the documentation from  :class:`~transformers1.PretrainedConfig`\n        for more information.\n\n        Args:\n            vocab_size (:obj:`int`, optional, defaults to 267735):\n                Vocabulary size of the Transformer XL model. Defines the different tokens that\n                can be represented by the `inputs_ids` passed to the forward method of :class:`~transformers1.TransfoXLModel`.\n            cutoffs (:obj:`List[int]`, optional, defaults to :obj:`[20000, 40000, 200000]`):\n                Cutoffs for the adaptive softmax\n            d_model (:obj:`int`, optional, defaults to 1024):\n                Dimensionality of the model's hidden states.\n            d_embed (:obj:`int`, optional, defaults to 1024):\n                Dimensionality of the embeddings\n            n_head (:obj:`int`, optional, defaults to 16):\n                Number of attention heads for each attention layer in the Transformer encoder.\n            d_head (:obj:`int`, optional, defaults to 64):\n                Dimensionality of the model's heads.\n            d_inner (:obj:`int`, optional, defaults to 4096):\n                Inner dimension in FF\n            div_val (:obj:`int`, optional, defaults to 4):\n                Divident value for adapative input and softmax\n            pre_lnorm (:obj:`boolean`, optional, defaults to :obj:`False`):\n                Apply LayerNorm to the input instead of the output\n            n_layer (:obj:`int`, optional, defaults to 18):\n                Number of hidden layers in the Transformer encoder.\n            tgt_len (:obj:`int`, optional, defaults to 128):\n                Number of tokens to predict\n            ext_len (:obj:`int`, optional, defaults to 0):\n                Length of the extended context\n            mem_len (:obj:`int`, optional, defaults to 1600):\n                Length of the retained previous heads\n            clamp_len (:obj:`int`, optional, defaults to 1000):\n                use the same pos embeddings after clamp_len\n            same_length (:obj:`boolean`, optional, defaults to :obj:`True`):\n                Use the same attn length for all tokens\n            proj_share_all_but_first (:obj:`boolean`, optional, defaults to :obj:`True`):\n                True to share all but first projs, False not to share.\n            attn_type (:obj:`int`, optional, defaults to 0):\n                Attention type. 0 for Transformer-XL, 1 for Shaw et al, 2 for Vaswani et al, 3 for Al Rfou et al.\n            sample_softmax (:obj:`int`, optional, defaults to -1):\n                number of samples in sampled softmax\n            adaptive (:obj:`boolean`, optional, defaults to :obj:`True`):\n                use adaptive softmax\n            tie_weight (:obj:`boolean`, optional, defaults to :obj:`True`):\n                tie the word embedding and softmax weights\n            dropout (:obj:`float`, optional, defaults to 0.1):\n                The dropout probabilitiy for all fully connected layers in the embeddings, encoder, and pooler.\n            dropatt (:obj:`float`, optional, defaults to 0):\n                The dropout ratio for the attention probabilities.\n            untie_r (:obj:`boolean`, optional, defaults to :obj:`True`):\n                Untie relative position biases\n            init (:obj:`string`, optional, defaults to `normal`):\n                Parameter initializer to use\n            init_range (:obj:`float`, optional, defaults to 0.01):\n                Parameters initialized by U(-init_range, init_range).\n            proj_init_std (:obj:`float`, optional, defaults to 0.01):\n                Parameters initialized by N(0, init_std)\n            init_std (:obj:`float`, optional, defaults to 0.02):\n                Parameters initialized by N(0, init_std)\n            layer_norm_epsilon (:obj:`float`, optional, defaults to 1e-5):\n                The epsilon to use in the layer normalization layers\n\n        Example::\n\n            from transformers1 import TransfoXLConfig, TransfoXLModel\n\n            # Initializing a Transformer XL configuration\n            configuration = TransfoXLConfig()\n\n            # Initializing a model from the configuration\n            model = TransfoXLModel(configuration)\n\n            # Accessing the model configuration\n            configuration = model.config\n    \"\"\"\n\n    model_type = \"transfo-xl\"\n\n    def __init__(\n        self,\n        vocab_size=267735,\n        cutoffs=[20000, 40000, 200000],\n        d_model=1024,\n        d_embed=1024,\n        n_head=16,\n        d_head=64,\n        d_inner=4096,\n        div_val=4,\n        pre_lnorm=False,\n        n_layer=18,\n        tgt_len=128,\n        ext_len=0,\n        mem_len=1600,\n        clamp_len=1000,\n        same_length=True,\n        proj_share_all_but_first=True,\n        attn_type=0,\n        sample_softmax=-1,\n        adaptive=True,\n        tie_weight=True,\n        dropout=0.1,\n        dropatt=0.0,\n        untie_r=True,\n        init=\"normal\",\n        init_range=0.01,\n        proj_init_std=0.01,\n        init_std=0.02,\n        layer_norm_epsilon=1e-5,\n        eos_token_id=0,\n        **kwargs\n    ):\n        super().__init__(eos_token_id=eos_token_id, **kwargs)\n\n        self.vocab_size = vocab_size\n        self.cutoffs = []\n        self.cutoffs.extend(cutoffs)\n        self.tie_weight = tie_weight\n        if proj_share_all_but_first:\n            self.tie_projs = [False] + [True] * len(self.cutoffs)\n        else:\n            self.tie_projs = [False] + [False] * len(self.cutoffs)\n        self.d_model = d_model\n        self.d_embed = d_embed\n        self.d_head = d_head\n        self.d_inner = d_inner\n        self.div_val = div_val\n        self.pre_lnorm = pre_lnorm\n        self.n_layer = n_layer\n        self.n_head = n_head\n        self.tgt_len = tgt_len\n        self.ext_len = ext_len\n        self.mem_len = mem_len\n        self.same_length = same_length\n        self.attn_type = attn_type\n        self.clamp_len = clamp_len\n        self.sample_softmax = sample_softmax\n        self.adaptive = adaptive\n        self.dropout = dropout\n        self.dropatt = dropatt\n        self.untie_r = untie_r\n        self.init = init\n        self.init_range = init_range\n        self.proj_init_std = proj_init_std\n        self.init_std = init_std\n        self.layer_norm_epsilon = layer_norm_epsilon\n\n    @property\n    def max_position_embeddings(self):\n        return self.tgt_len + self.ext_len + self.mem_len\n\n    @property\n    def n_token(self):  # Backward compatibility\n        return self.vocab_size\n\n    @n_token.setter\n    def n_token(self, value):  # Backward compatibility\n        self.vocab_size = value\n\n    @property\n    def hidden_size(self):\n        return self.d_model\n\n    @property\n    def num_attention_heads(self):\n        return self.n_head\n\n    @property\n    def num_hidden_layers(self):\n        return self.n_layer\n"
  },
  {
    "path": "code/nezha-base-count3/pretrain/transformers1/configuration_utils.py",
    "content": "# coding=utf-8\n# Copyright 2018 The Google AI Language Team Authors and The HuggingFace Inc. team.\n# Copyright (c) 2018, NVIDIA CORPORATION.  All rights reserved.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n\"\"\" Configuration base class and utilities.\"\"\"\n\n\nimport copy\nimport json\nimport logging\nimport os\nfrom typing import Dict, Tuple\n\nfrom .file_utils import CONFIG_NAME, cached_path, hf_bucket_url, is_remote_url\n\n\nlogger = logging.getLogger(__name__)\n\n\nclass PretrainedConfig(object):\n    r\"\"\" Base class for all configuration classes.\n        Handles a few parameters common to all models' configurations as well as methods for loading/downloading/saving configurations.\n\n        Note:\n            A configuration file can be loaded and saved to disk. Loading the configuration file and using this file to initialize a model does **not** load the model weights.\n            It only affects the model's configuration.\n\n        Class attributes (overridden by derived classes):\n            - ``model_type``: a string that identifies the model type, that we serialize into the JSON file, and that we use to recreate the correct object in :class:`~transformers1.AutoConfig`.\n\n        Args:\n            finetuning_task (:obj:`string` or :obj:`None`, `optional`, defaults to :obj:`None`):\n                Name of the task used to fine-tune the model. This can be used when converting from an original (TensorFlow or PyTorch) checkpoint.\n            num_labels (:obj:`int`, `optional`, defaults to `2`):\n                Number of classes to use when the model is a classification model (sequences/tokens)\n            output_attentions (:obj:`bool`, `optional`, defaults to :obj:`False`):\n                Should the model returns attentions weights.\n            output_hidden_states (:obj:`string`, `optional`, defaults to :obj:`False`):\n                Should the model returns all hidden-states.\n            torchscript (:obj:`bool`, `optional`, defaults to :obj:`False`):\n                Is the model used with Torchscript (for PyTorch models).\n    \"\"\"\n    model_type: str = \"\"\n\n    def __init__(self, **kwargs):\n        # Attributes with defaults\n        self.output_attentions = kwargs.pop(\"output_attentions\", False)\n        self.output_hidden_states = kwargs.pop(\"output_hidden_states\", False)\n        self.use_cache = kwargs.pop(\"use_cache\", True)  # Not used by all models\n        self.torchscript = kwargs.pop(\"torchscript\", False)  # Only used by PyTorch models\n        self.use_bfloat16 = kwargs.pop(\"use_bfloat16\", False)\n        self.pruned_heads = kwargs.pop(\"pruned_heads\", {})\n\n        # Is decoder is used in encoder-decoder models to differentiate encoder from decoder\n        self.is_encoder_decoder = kwargs.pop(\"is_encoder_decoder\", False)\n        self.is_decoder = kwargs.pop(\"is_decoder\", False)\n\n        # Parameters for sequence generation\n        self.max_length = kwargs.pop(\"max_length\", 20)\n        self.min_length = kwargs.pop(\"min_length\", 0)\n        self.do_sample = kwargs.pop(\"do_sample\", False)\n        self.early_stopping = kwargs.pop(\"early_stopping\", False)\n        self.num_beams = kwargs.pop(\"num_beams\", 1)\n        self.temperature = kwargs.pop(\"temperature\", 1.0)\n        self.top_k = kwargs.pop(\"top_k\", 50)\n        self.top_p = kwargs.pop(\"top_p\", 1.0)\n        self.repetition_penalty = kwargs.pop(\"repetition_penalty\", 1.0)\n        self.length_penalty = kwargs.pop(\"length_penalty\", 1.0)\n        self.no_repeat_ngram_size = kwargs.pop(\"no_repeat_ngram_size\", 0)\n        self.bad_words_ids = kwargs.pop(\"bad_words_ids\", None)\n        self.num_return_sequences = kwargs.pop(\"num_return_sequences\", 1)\n\n        # Fine-tuning task arguments\n        self.architectures = kwargs.pop(\"architectures\", None)\n        self.finetuning_task = kwargs.pop(\"finetuning_task\", None)\n        self.id2label = kwargs.pop(\"id2label\", None)\n        self.label2id = kwargs.pop(\"label2id\", None)\n        if self.id2label is not None:\n            kwargs.pop(\"num_labels\", None)\n            self.id2label = dict((int(key), value) for key, value in self.id2label.items())\n            # Keys are always strings in JSON so convert ids to int here.\n        else:\n            self.num_labels = kwargs.pop(\"num_labels\", 2)\n\n        # Tokenizer arguments TODO: eventually tokenizer and models should share the same config\n        self.prefix = kwargs.pop(\"prefix\", None)\n        self.bos_token_id = kwargs.pop(\"bos_token_id\", None)\n        self.pad_token_id = kwargs.pop(\"pad_token_id\", None)\n        self.eos_token_id = kwargs.pop(\"eos_token_id\", None)\n        self.decoder_start_token_id = kwargs.pop(\"decoder_start_token_id\", None)\n\n        # task specific arguments\n        self.task_specific_params = kwargs.pop(\"task_specific_params\", None)\n\n        # TPU arguments\n        self.xla_device = kwargs.pop(\"xla_device\", None)\n\n        # Additional attributes without default values\n        for key, value in kwargs.items():\n            try:\n                setattr(self, key, value)\n            except AttributeError as err:\n                logger.error(\"Can't set {} with value {} for {}\".format(key, value, self))\n                raise err\n\n    @property\n    def num_labels(self):\n        return len(self.id2label)\n\n    @num_labels.setter\n    def num_labels(self, num_labels):\n        self.id2label = {i: \"LABEL_{}\".format(i) for i in range(num_labels)}\n        self.label2id = dict(zip(self.id2label.values(), self.id2label.keys()))\n\n    def save_pretrained(self, save_directory):\n        \"\"\"\n        Save a configuration object to the directory `save_directory`, so that it\n        can be re-loaded using the :func:`~transformers1.PretrainedConfig.from_pretrained` class method.\n\n        Args:\n            save_directory (:obj:`string`):\n                Directory where the configuration JSON file will be saved.\n        \"\"\"\n        assert os.path.isdir(\n            save_directory\n        ), \"Saving path should be a directory where the model and configuration can be saved\"\n\n        # If we save using the predefined names, we can load using `from_pretrained`\n        output_config_file = os.path.join(save_directory, CONFIG_NAME)\n\n        self.to_json_file(output_config_file, use_diff=True)\n        logger.info(\"Configuration saved in {}\".format(output_config_file))\n\n    @classmethod\n    def from_pretrained(cls, pretrained_model_name_or_path, **kwargs) -> \"PretrainedConfig\":\n        r\"\"\"\n\n        Instantiate a :class:`~transformers1.PretrainedConfig` (or a derived class) from a pre-trained model configuration.\n\n        Args:\n            pretrained_model_name_or_path (:obj:`string`):\n                either:\n                  - a string with the `shortcut name` of a pre-trained model configuration to load from cache or\n                    download, e.g.: ``bert-base-uncased``.\n                  - a string with the `identifier name` of a pre-trained model configuration that was user-uploaded to\n                    our S3, e.g.: ``dbmdz/bert-base-german-cased``.\n                  - a path to a `directory` containing a configuration file saved using the\n                    :func:`~transformers1.PretrainedConfig.save_pretrained` method, e.g.: ``./my_model_directory/``.\n                  - a path or url to a saved configuration JSON `file`, e.g.:\n                    ``./my_model_directory/configuration.json``.\n            cache_dir (:obj:`string`, `optional`):\n                Path to a directory in which a downloaded pre-trained model\n                configuration should be cached if the standard cache should not be used.\n            kwargs (:obj:`Dict[str, any]`, `optional`):\n                The values in kwargs of any keys which are configuration attributes will be used to override the loaded\n                values. Behavior concerning key/value pairs whose keys are *not* configuration attributes is\n                controlled by the `return_unused_kwargs` keyword parameter.\n            force_download (:obj:`bool`, `optional`, defaults to :obj:`False`):\n                Force to (re-)download the model weights and configuration files and override the cached versions if they exist.\n            resume_download (:obj:`bool`, `optional`, defaults to :obj:`False`):\n                Do not delete incompletely recieved file. Attempt to resume the download if such a file exists.\n            proxies (:obj:`Dict`, `optional`):\n                A dictionary of proxy servers to use by protocol or endpoint, e.g.:\n                :obj:`{'http': 'foo.bar:3128', 'http://hostname': 'foo.bar:4012'}.`\n                The proxies are used on each request.\n            return_unused_kwargs: (`optional`) bool:\n                If False, then this function returns just the final configuration object.\n                If True, then this functions returns a :obj:`Tuple(config, unused_kwargs)` where `unused_kwargs` is a\n                dictionary consisting of the key/value pairs whose keys are not configuration attributes: ie the part\n                of kwargs which has not been used to update `config` and is otherwise ignored.\n\n        Returns:\n            :class:`PretrainedConfig`: An instance of a configuration object\n\n        Examples::\n\n            # We can't instantiate directly the base class `PretrainedConfig` so let's show the examples on a\n            # derived class: BertConfig\n            config = BertConfig.from_pretrained('bert-base-uncased')    # Download configuration from S3 and cache.\n            config = BertConfig.from_pretrained('./test/saved_model/')  # E.g. config (or model) was saved using `save_pretrained('./test/saved_model/')`\n            config = BertConfig.from_pretrained('./test/saved_model/my_configuration.json')\n            config = BertConfig.from_pretrained('bert-base-uncased', output_attention=True, foo=False)\n            assert config.output_attention == True\n            config, unused_kwargs = BertConfig.from_pretrained('bert-base-uncased', output_attention=True,\n                                                               foo=False, return_unused_kwargs=True)\n            assert config.output_attention == True\n            assert unused_kwargs == {'foo': False}\n\n        \"\"\"\n        config_dict, kwargs = cls.get_config_dict(pretrained_model_name_or_path, **kwargs)\n        return cls.from_dict(config_dict, **kwargs)\n\n    @classmethod\n    def get_config_dict(cls, pretrained_model_name_or_path: str, **kwargs) -> Tuple[Dict, Dict]:\n        \"\"\"\n        From a `pretrained_model_name_or_path`, resolve to a dictionary of parameters, to be used\n        for instantiating a Config using `from_dict`.\n\n        Parameters:\n            pretrained_model_name_or_path (:obj:`string`):\n                The identifier of the pre-trained checkpoint from which we want the dictionary of parameters.\n\n        Returns:\n            :obj:`Tuple[Dict, Dict]`: The dictionary that will be used to instantiate the configuration object.\n\n        \"\"\"\n        cache_dir = kwargs.pop(\"cache_dir\", None)\n        force_download = kwargs.pop(\"force_download\", False)\n        resume_download = kwargs.pop(\"resume_download\", False)\n        proxies = kwargs.pop(\"proxies\", None)\n        local_files_only = kwargs.pop(\"local_files_only\", False)\n\n        if os.path.isdir(pretrained_model_name_or_path):\n            config_file = os.path.join(pretrained_model_name_or_path, CONFIG_NAME)\n        elif os.path.isfile(pretrained_model_name_or_path) or is_remote_url(pretrained_model_name_or_path):\n            config_file = pretrained_model_name_or_path\n        else:\n            config_file = hf_bucket_url(pretrained_model_name_or_path, filename=CONFIG_NAME, use_cdn=False)\n\n        try:\n            # Load from URL or cache if already cached\n            resolved_config_file = cached_path(\n                config_file,\n                cache_dir=cache_dir,\n                force_download=force_download,\n                proxies=proxies,\n                resume_download=resume_download,\n                local_files_only=local_files_only,\n            )\n            # Load config dict\n            if resolved_config_file is None:\n                raise EnvironmentError\n            config_dict = cls._dict_from_json_file(resolved_config_file)\n\n        except EnvironmentError:\n            msg = (\n                f\"Can't load config for '{pretrained_model_name_or_path}'. Make sure that:\\n\\n\"\n                f\"- '{pretrained_model_name_or_path}' is a correct model identifier listed on 'https://huggingface.co/models'\\n\\n\"\n                f\"- or '{pretrained_model_name_or_path}' is the correct path to a directory containing a {CONFIG_NAME} file\\n\\n\"\n            )\n            raise EnvironmentError(msg)\n\n        except json.JSONDecodeError:\n            msg = (\n                \"Couldn't reach server at '{}' to download configuration file or \"\n                \"configuration file is not a valid JSON file. \"\n                \"Please check network or file content here: {}.\".format(config_file, resolved_config_file)\n            )\n            raise EnvironmentError(msg)\n\n        if resolved_config_file == config_file:\n            logger.info(\"loading configuration file {}\".format(config_file))\n        else:\n            logger.info(\"loading configuration file {} from cache at {}\".format(config_file, resolved_config_file))\n\n        return config_dict, kwargs\n\n    @classmethod\n    def from_dict(cls, config_dict: Dict, **kwargs) -> \"PretrainedConfig\":\n        \"\"\"\n        Constructs a `Config` from a Python dictionary of parameters.\n\n        Args:\n            config_dict (:obj:`Dict[str, any]`):\n                Dictionary that will be used to instantiate the configuration object. Such a dictionary can be retrieved\n                from a pre-trained checkpoint by leveraging the :func:`~transformers1.PretrainedConfig.get_config_dict`\n                method.\n            kwargs (:obj:`Dict[str, any]`):\n                Additional parameters from which to initialize the configuration object.\n\n        Returns:\n            :class:`PretrainedConfig`: An instance of a configuration object\n        \"\"\"\n        return_unused_kwargs = kwargs.pop(\"return_unused_kwargs\", False)\n\n        config = cls(**config_dict)\n\n        if hasattr(config, \"pruned_heads\"):\n            config.pruned_heads = dict((int(key), value) for key, value in config.pruned_heads.items())\n\n        # Update config with kwargs if needed\n        to_remove = []\n        for key, value in kwargs.items():\n            if hasattr(config, key):\n                setattr(config, key, value)\n                to_remove.append(key)\n        for key in to_remove:\n            kwargs.pop(key, None)\n\n        logger.info(\"Model config %s\", str(config))\n        if return_unused_kwargs:\n            return config, kwargs\n        else:\n            return config\n\n    @classmethod\n    def from_json_file(cls, json_file: str) -> \"PretrainedConfig\":\n        \"\"\"\n        Constructs a `Config` from the path to a json file of parameters.\n\n        Args:\n            json_file (:obj:`string`):\n                Path to the JSON file containing the parameters.\n\n        Returns:\n            :class:`PretrainedConfig`: An instance of a configuration object\n\n        \"\"\"\n        config_dict = cls._dict_from_json_file(json_file)\n        return cls(**config_dict)\n\n    @classmethod\n    def _dict_from_json_file(cls, json_file: str):\n        with open(json_file, \"r\", encoding=\"utf-8\") as reader:\n            text = reader.read()\n        return json.loads(text)\n\n    def __eq__(self, other):\n        return self.__dict__ == other.__dict__\n\n    def __repr__(self):\n        return \"{} {}\".format(self.__class__.__name__, self.to_json_string())\n\n    def to_diff_dict(self):\n        \"\"\"\n        Removes all attributes from config which correspond to the default\n        config attributes for better readability and serializes to a Python\n        dictionary.\n\n        Returns:\n            :obj:`Dict[str, any]`: Dictionary of all the attributes that make up this configuration instance,\n        \"\"\"\n        config_dict = self.to_dict()\n\n        # get the default config dict\n        default_config_dict = PretrainedConfig().to_dict()\n\n        serializable_config_dict = {}\n\n        # only serialize values that differ from the default config\n        for key, value in config_dict.items():\n            if key not in default_config_dict or value != default_config_dict[key]:\n                serializable_config_dict[key] = value\n\n        return serializable_config_dict\n\n    def to_dict(self):\n        \"\"\"\n        Serializes this instance to a Python dictionary.\n\n        Returns:\n            :obj:`Dict[str, any]`: Dictionary of all the attributes that make up this configuration instance,\n        \"\"\"\n        output = copy.deepcopy(self.__dict__)\n        if hasattr(self.__class__, \"model_type\"):\n            output[\"model_type\"] = self.__class__.model_type\n        return output\n\n    def to_json_string(self, use_diff=True):\n        \"\"\"\n        Serializes this instance to a JSON string.\n\n        Args:\n            use_diff (:obj:`bool`):\n                If set to True, only the difference between the config instance and the default PretrainedConfig() is serialized to JSON string.\n\n        Returns:\n            :obj:`string`: String containing all the attributes that make up this configuration instance in JSON format.\n        \"\"\"\n        if use_diff is True:\n            config_dict = self.to_diff_dict()\n        else:\n            config_dict = self.to_dict()\n        return json.dumps(config_dict, indent=2, sort_keys=True) + \"\\n\"\n\n    def to_json_file(self, json_file_path, use_diff=True):\n        \"\"\"\n        Save this instance to a json file.\n\n        Args:\n            json_file_path (:obj:`string`):\n                Path to the JSON file in which this configuration instance's parameters will be saved.\n            use_diff (:obj:`bool`):\n                If set to True, only the difference between the config instance and the default PretrainedConfig() is serialized to JSON file.\n        \"\"\"\n        with open(json_file_path, \"w\", encoding=\"utf-8\") as writer:\n            writer.write(self.to_json_string(use_diff=use_diff))\n\n    def update(self, config_dict: Dict):\n        \"\"\"\n        Updates attributes of this class\n        with attributes from `config_dict`.\n\n        Args:\n            :obj:`Dict[str, any]`: Dictionary of attributes that shall be updated for this class.\n        \"\"\"\n        for key, value in config_dict.items():\n            setattr(self, key, value)\n"
  },
  {
    "path": "code/nezha-base-count3/pretrain/transformers1/configuration_xlm.py",
    "content": "# coding=utf-8\n# Copyright 2019-present, Facebook, Inc and the HuggingFace Inc. team.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n\"\"\" XLM configuration \"\"\"\n\n\nimport logging\n\nfrom .configuration_utils import PretrainedConfig\n\n\nlogger = logging.getLogger(__name__)\n\nXLM_PRETRAINED_CONFIG_ARCHIVE_MAP = {\n    \"xlm-mlm-en-2048\": \"https://s3.amazonaws.com/models.huggingface.co/bert/xlm-mlm-en-2048-config.json\",\n    \"xlm-mlm-ende-1024\": \"https://s3.amazonaws.com/models.huggingface.co/bert/xlm-mlm-ende-1024-config.json\",\n    \"xlm-mlm-enfr-1024\": \"https://s3.amazonaws.com/models.huggingface.co/bert/xlm-mlm-enfr-1024-config.json\",\n    \"xlm-mlm-enro-1024\": \"https://s3.amazonaws.com/models.huggingface.co/bert/xlm-mlm-enro-1024-config.json\",\n    \"xlm-mlm-tlm-xnli15-1024\": \"https://s3.amazonaws.com/models.huggingface.co/bert/xlm-mlm-tlm-xnli15-1024-config.json\",\n    \"xlm-mlm-xnli15-1024\": \"https://s3.amazonaws.com/models.huggingface.co/bert/xlm-mlm-xnli15-1024-config.json\",\n    \"xlm-clm-enfr-1024\": \"https://s3.amazonaws.com/models.huggingface.co/bert/xlm-clm-enfr-1024-config.json\",\n    \"xlm-clm-ende-1024\": \"https://s3.amazonaws.com/models.huggingface.co/bert/xlm-clm-ende-1024-config.json\",\n    \"xlm-mlm-17-1280\": \"https://s3.amazonaws.com/models.huggingface.co/bert/xlm-mlm-17-1280-config.json\",\n    \"xlm-mlm-100-1280\": \"https://s3.amazonaws.com/models.huggingface.co/bert/xlm-mlm-100-1280-config.json\",\n}\n\n\nclass XLMConfig(PretrainedConfig):\n    \"\"\"\n        This is the configuration class to store the configuration of a :class:`~transformers1.XLMModel`.\n        It is used to instantiate an XLM model according to the specified arguments, defining the model\n        architecture. Instantiating a configuration with the defaults will yield a similar configuration to that of\n        the `xlm-mlm-en-2048 <https://huggingface.co/xlm-mlm-en-2048>`__ architecture.\n\n        Configuration objects inherit from  :class:`~transformers1.PretrainedConfig` and can be used\n        to control the model outputs. Read the documentation from  :class:`~transformers1.PretrainedConfig`\n        for more information.\n\n        Args:\n            vocab_size (:obj:`int`, optional, defaults to 30145):\n                Vocabulary size of the XLM model. Defines the different tokens that\n                can be represented by the `inputs_ids` passed to the forward method of :class:`~transformers1.XLMModel`.\n            emb_dim (:obj:`int`, optional, defaults to 2048):\n                Dimensionality of the encoder layers and the pooler layer.\n            n_layer (:obj:`int`, optional, defaults to 12):\n                Number of hidden layers in the Transformer encoder.\n            n_head (:obj:`int`, optional, defaults to 16):\n                Number of attention heads for each attention layer in the Transformer encoder.\n            dropout (:obj:`float`, optional, defaults to 0.1):\n                The dropout probability for all fully connected\n                layers in the embeddings, encoder, and pooler.\n            attention_dropout (:obj:`float`, optional, defaults to 0.1):\n                The dropout probability for the attention mechanism\n            gelu_activation (:obj:`boolean`, optional, defaults to :obj:`True`):\n                The non-linear activation function (function or string) in the\n                encoder and pooler. If set to `True`, \"gelu\" will be used instead of \"relu\".\n            sinusoidal_embeddings (:obj:`boolean`, optional, defaults to :obj:`False`):\n                Whether to use sinusoidal positional embeddings instead of absolute positional embeddings.\n            causal (:obj:`boolean`, optional, defaults to :obj:`False`):\n                Set this to `True` for the model to behave in a causal manner.\n                Causal models use a triangular attention mask in order to only attend to the left-side context instead\n                if a bidirectional context.\n            asm (:obj:`boolean`, optional, defaults to :obj:`False`):\n                Whether to use an adaptive log softmax projection layer instead of a linear layer for the prediction\n                layer.\n            n_langs (:obj:`int`, optional, defaults to 1):\n                The number of languages the model handles. Set to 1 for monolingual models.\n            use_lang_emb (:obj:`boolean`, optional, defaults to :obj:`True`)\n                Whether to use language embeddings. Some models use additional language embeddings, see\n                `the multilingual models page <http://huggingface.co/transformers/multilingual.html#xlm-language-embeddings>`__\n                for information on how to use them.\n            max_position_embeddings (:obj:`int`, optional, defaults to 512):\n                The maximum sequence length that this model might\n                ever be used with. Typically set this to something large just in case\n                (e.g., 512 or 1024 or 2048).\n            embed_init_std (:obj:`float`, optional, defaults to 2048^-0.5):\n                The standard deviation of the truncated_normal_initializer for\n                initializing the embedding matrices.\n            init_std (:obj:`int`, optional, defaults to 50257):\n                The standard deviation of the truncated_normal_initializer for\n                initializing all weight matrices except the embedding matrices.\n            layer_norm_eps (:obj:`float`, optional, defaults to 1e-12):\n                The epsilon used by the layer normalization layers.\n            bos_index (:obj:`int`, optional, defaults to 0):\n                The index of the beginning of sentence token in the vocabulary.\n            eos_index (:obj:`int`, optional, defaults to 1):\n                The index of the end of sentence token in the vocabulary.\n            pad_index (:obj:`int`, optional, defaults to 2):\n                The index of the padding token in the vocabulary.\n            unk_index (:obj:`int`, optional, defaults to 3):\n                The index of the unknown token in the vocabulary.\n            mask_index (:obj:`int`, optional, defaults to 5):\n                The index of the masking token in the vocabulary.\n            is_encoder(:obj:`boolean`, optional, defaults to :obj:`True`):\n                Whether the initialized model should be a transformer encoder or decoder as seen in Vaswani et al.\n            summary_type (:obj:`string`, optional, defaults to \"first\"):\n                Argument used when doing sequence summary. Used in for the multiple choice head in\n                :class:`~transformers1.XLMForSequenceClassification`.\n                Is one of the following options:\n\n                - 'last' => take the last token hidden state (like XLNet)\n                - 'first' => take the first token hidden state (like Bert)\n                - 'mean' => take the mean of all tokens hidden states\n                - 'cls_index' => supply a Tensor of classification token position (GPT/GPT-2)\n                - 'attn' => Not implemented now, use multi-head attention\n            summary_use_proj (:obj:`boolean`, optional, defaults to :obj:`True`):\n                Argument used when doing sequence summary. Used in for the multiple choice head in\n                :class:`~transformers1.XLMForSequenceClassification`.\n                Add a projection after the vector extraction\n            summary_activation (:obj:`string` or :obj:`None`, optional, defaults to :obj:`None`):\n                Argument used when doing sequence summary. Used in for the multiple choice head in\n                :class:`~transformers1.XLMForSequenceClassification`.\n                'tanh' => add a tanh activation to the output, Other => no activation.\n            summary_proj_to_labels (:obj:`boolean`, optional, defaults to :obj:`True`):\n                Argument used when doing sequence summary. Used in for the multiple choice head in\n                :class:`~transformers1.XLMForSequenceClassification`.\n                If True, the projection outputs to config.num_labels classes (otherwise to hidden_size). Default: False.\n            summary_first_dropout (:obj:`float`, optional, defaults to 0.1):\n                Argument used when doing sequence summary. Used in for the multiple choice head in\n                :class:`~transformers1.XLMForSequenceClassification`.\n                Add a dropout before the projection and activation\n            start_n_top (:obj:`int`, optional, defaults to 5):\n                Used in the SQuAD evaluation script for XLM and XLNet.\n            end_n_top (:obj:`int`, optional, defaults to 5):\n                Used in the SQuAD evaluation script for XLM and XLNet.\n            mask_token_id (:obj:`int`, optional, defaults to 0):\n                Model agnostic parameter to identify masked tokens when generating text in an MLM context.\n            lang_id (:obj:`int`, optional, defaults to 1):\n                The ID of the language used by the model. This parameter is used when generating\n                text in a given language.\n\n        Example::\n\n            from transformers1 import XLMConfig, XLMModel\n\n            # Initializing a XLM configuration\n            configuration = XLMConfig()\n\n            # Initializing a model from the configuration\n            model = XLMModel(configuration)\n\n            # Accessing the model configuration\n            configuration = model.config\n    \"\"\"\n\n    model_type = \"xlm\"\n\n    def __init__(\n        self,\n        vocab_size=30145,\n        emb_dim=2048,\n        n_layers=12,\n        n_heads=16,\n        dropout=0.1,\n        attention_dropout=0.1,\n        gelu_activation=True,\n        sinusoidal_embeddings=False,\n        causal=False,\n        asm=False,\n        n_langs=1,\n        use_lang_emb=True,\n        max_position_embeddings=512,\n        embed_init_std=2048 ** -0.5,\n        layer_norm_eps=1e-12,\n        init_std=0.02,\n        bos_index=0,\n        eos_index=1,\n        pad_index=2,\n        unk_index=3,\n        mask_index=5,\n        is_encoder=True,\n        summary_type=\"first\",\n        summary_use_proj=True,\n        summary_activation=None,\n        summary_proj_to_labels=True,\n        summary_first_dropout=0.1,\n        start_n_top=5,\n        end_n_top=5,\n        mask_token_id=0,\n        lang_id=0,\n        pad_token_id=2,\n        bos_token_id=0,\n        **kwargs\n    ):\n        \"\"\"Constructs XLMConfig.\n        \"\"\"\n        super().__init__(pad_token_id=pad_token_id, bos_token_id=bos_token_id, **kwargs)\n        self.vocab_size = vocab_size\n        self.emb_dim = emb_dim\n        self.n_layers = n_layers\n        self.n_heads = n_heads\n        self.dropout = dropout\n        self.attention_dropout = attention_dropout\n        self.gelu_activation = gelu_activation\n        self.sinusoidal_embeddings = sinusoidal_embeddings\n        self.causal = causal\n        self.asm = asm\n        self.n_langs = n_langs\n        self.use_lang_emb = use_lang_emb\n        self.layer_norm_eps = layer_norm_eps\n        self.bos_index = bos_index\n        self.eos_index = eos_index\n        self.pad_index = pad_index\n        self.unk_index = unk_index\n        self.mask_index = mask_index\n        self.is_encoder = is_encoder\n        self.max_position_embeddings = max_position_embeddings\n        self.embed_init_std = embed_init_std\n        self.init_std = init_std\n        self.summary_type = summary_type\n        self.summary_use_proj = summary_use_proj\n        self.summary_activation = summary_activation\n        self.summary_proj_to_labels = summary_proj_to_labels\n        self.summary_first_dropout = summary_first_dropout\n        self.start_n_top = start_n_top\n        self.end_n_top = end_n_top\n        self.mask_token_id = mask_token_id\n        self.lang_id = lang_id\n\n        if \"n_words\" in kwargs:\n            self.n_words = kwargs[\"n_words\"]\n\n    @property\n    def n_words(self):  # For backward compatibility\n        return self.vocab_size\n\n    @n_words.setter\n    def n_words(self, value):  # For backward compatibility\n        self.vocab_size = value\n\n    @property\n    def hidden_size(self):\n        return self.emb_dim\n\n    @property\n    def num_attention_heads(self):\n        return self.n_heads\n\n    @property\n    def num_hidden_layers(self):\n        return self.n_layers\n"
  },
  {
    "path": "code/nezha-base-count3/pretrain/transformers1/configuration_xlm_roberta.py",
    "content": "# coding=utf-8\n# Copyright 2018 The Google AI Language Team Authors and The HuggingFace Inc. team.\n# Copyright (c) 2018, NVIDIA CORPORATION.  All rights reserved.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n\"\"\" XLM-RoBERTa configuration \"\"\"\n\n\nimport logging\n\nfrom .configuration_roberta import RobertaConfig\n\n\nlogger = logging.getLogger(__name__)\n\nXLM_ROBERTA_PRETRAINED_CONFIG_ARCHIVE_MAP = {\n    \"xlm-roberta-base\": \"https://s3.amazonaws.com/models.huggingface.co/bert/xlm-roberta-base-config.json\",\n    \"xlm-roberta-large\": \"https://s3.amazonaws.com/models.huggingface.co/bert/xlm-roberta-large-config.json\",\n    \"xlm-roberta-large-finetuned-conll02-dutch\": \"https://s3.amazonaws.com/models.huggingface.co/bert/xlm-roberta-large-finetuned-conll02-dutch-config.json\",\n    \"xlm-roberta-large-finetuned-conll02-spanish\": \"https://s3.amazonaws.com/models.huggingface.co/bert/xlm-roberta-large-finetuned-conll02-spanish-config.json\",\n    \"xlm-roberta-large-finetuned-conll03-english\": \"https://s3.amazonaws.com/models.huggingface.co/bert/xlm-roberta-large-finetuned-conll03-english-config.json\",\n    \"xlm-roberta-large-finetuned-conll03-german\": \"https://s3.amazonaws.com/models.huggingface.co/bert/xlm-roberta-large-finetuned-conll03-german-config.json\",\n}\n\n\nclass XLMRobertaConfig(RobertaConfig):\n    \"\"\"\n    This class overrides :class:`~transformers1.RobertaConfig`. Please check the\n    superclass for the appropriate documentation alongside usage examples.\n    \"\"\"\n\n    model_type = \"xlm-roberta\"\n"
  },
  {
    "path": "code/nezha-base-count3/pretrain/transformers1/configuration_xlnet.py",
    "content": "# coding=utf-8\n# Copyright 2018 Google AI, Google Brain and Carnegie Mellon University Authors and the HuggingFace Inc. team.\n# Copyright (c) 2018, NVIDIA CORPORATION.  All rights reserved.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n\"\"\" XLNet configuration \"\"\"\n\n\nimport logging\n\nfrom .configuration_utils import PretrainedConfig\n\n\nlogger = logging.getLogger(__name__)\n\nXLNET_PRETRAINED_CONFIG_ARCHIVE_MAP = {\n    \"xlnet-base-cased\": \"https://s3.amazonaws.com/models.huggingface.co/bert/xlnet-base-cased-config.json\",\n    \"xlnet-large-cased\": \"https://s3.amazonaws.com/models.huggingface.co/bert/xlnet-large-cased-config.json\",\n}\n\n\nclass XLNetConfig(PretrainedConfig):\n    \"\"\"\n        This is the configuration class to store the configuration of a :class:`~transformers1.XLNetModel`.\n        It is used to instantiate an XLNet model according to the specified arguments, defining the model\n        architecture. Instantiating a configuration with the defaults will yield a similar configuration to that of\n        the `xlnet-large-cased <https://huggingface.co/xlnet-large-cased>`__ architecture.\n\n        Configuration objects inherit from  :class:`~transformers1.PretrainedConfig` and can be used\n        to control the model outputs. Read the documentation from  :class:`~transformers1.PretrainedConfig`\n        for more information.\n\n        Args:\n            vocab_size (:obj:`int`, optional, defaults to 32000):\n                Vocabulary size of the XLNet model. Defines the different tokens that\n                can be represented by the `inputs_ids` passed to the forward method of :class:`~transformers1.XLNetModel`.\n            d_model (:obj:`int`, optional, defaults to 1024):\n                Dimensionality of the encoder layers and the pooler layer.\n            n_layer (:obj:`int`, optional, defaults to 24):\n                Number of hidden layers in the Transformer encoder.\n            n_head (:obj:`int`, optional, defaults to 16):\n                Number of attention heads for each attention layer in the Transformer encoder.\n            d_inner (:obj:`int`, optional, defaults to 4096):\n                Dimensionality of the \"intermediate\" (i.e., feed-forward) layer in the Transformer encoder.\n            ff_activation (:obj:`string`, optional, defaults to \"gelu\"):\n                The non-linear activation function (function or string) in the\n                encoder and pooler. If string, \"gelu\", \"relu\" and \"swish\" are supported.\n            untie_r (:obj:`boolean`, optional, defaults to :obj:`True`):\n                Untie relative position biases\n            attn_type (:obj:`string`, optional, defaults to \"bi\"):\n                The attention type used by the model. Set 'bi' for XLNet, 'uni' for Transformer-XL.\n            initializer_range (:obj:`float`, optional, defaults to 0.02):\n                The standard deviation of the truncated_normal_initializer for initializing all weight matrices.\n            layer_norm_eps (:obj:`float`, optional, defaults to 1e-12):\n                The epsilon used by the layer normalization layers.\n            dropout (:obj:`float`, optional, defaults to 0.1):\n                The dropout probability for all fully connected layers in the embeddings, encoder, and pooler.\n            mem_len (:obj:`int` or :obj:`None`, optional, defaults to :obj:`None`):\n                The number of tokens to cache. The key/value pairs that have already been pre-computed\n                in a previous forward pass won't be re-computed. See the\n                `quickstart <https://huggingface.co/transformers/quickstart.html#using-the-past>`__\n                for more information.\n            reuse_len (:obj:`int` or :obj:`None`, optional, defaults to :obj:`None`):\n                The number of tokens in the current batch to be cached and reused in the future.\n            bi_data (:obj:`boolean`, optional, defaults to :obj:`False`):\n                Whether to use bidirectional input pipeline. Usually set to `True` during\n                pretraining and `False` during finetuning.\n            clamp_len (:obj:`int`, optional, defaults to -1):\n                Clamp all relative distances larger than clamp_len.\n                Setting this attribute to -1 means no clamping.\n            same_length (:obj:`boolean`, optional, defaults to :obj:`False`):\n                Whether to use the same attention length for each token.\n            summary_type (:obj:`string`, optional, defaults to \"last\"):\n                Argument used when doing sequence summary. Used in for the multiple choice head in\n                :class:transformers1.XLNetForSequenceClassification` and :class:`~transformers1.XLNetForMultipleChoice`.\n                Is one of the following options:\n                    - 'last' => take the last token hidden state (like XLNet)\n                    - 'first' => take the first token hidden state (like Bert)\n                    - 'mean' => take the mean of all tokens hidden states\n                    - 'cls_index' => supply a Tensor of classification token position (GPT/GPT-2)\n                    - 'attn' => Not implemented now, use multi-head attention\n            summary_use_proj (:obj:`boolean`, optional, defaults to :obj:`True`):\n                Argument used when doing sequence summary. Used in for the multiple choice head in\n                :class:`~transformers1.XLNetForSequenceClassification` and :class:`~transformers1.XLNetForMultipleChoice`.\n                Add a projection after the vector extraction\n            summary_activation (:obj:`string` or :obj:`None`, optional, defaults to :obj:`None`):\n                Argument used when doing sequence summary. Used in for the multiple choice head in\n                :class:`~transformers1.XLNetForSequenceClassification` and :class:`~transformers1.XLNetForMultipleChoice`.\n                'tanh' => add a tanh activation to the output, Other => no activation.\n            summary_proj_to_labels (:obj:`boolean`, optional, defaults to :obj:`True`):\n                Argument used when doing sequence summary. Used in for the multiple choice head in\n                :class:`~transformers1.XLNetForSequenceClassification` and :class:`~transformers1.XLNetForMultipleChoice`.\n                If True, the projection outputs to config.num_labels classes (otherwise to hidden_size). Default: False.\n            summary_last_dropout (:obj:`float`, optional, defaults to 0.1):\n                Argument used when doing sequence summary. Used in for the multiple choice head in\n                :class:`~transformers1.XLNetForSequenceClassification` and :class:`~transformers1.XLNetForMultipleChoice`.\n                Add a dropout after the projection and activation\n            start_n_top (:obj:`int`, optional, defaults to 5):\n                Used in the SQuAD evaluation script for XLM and XLNet.\n            end_n_top (:obj:`int`, optional, defaults to 5):\n                Used in the SQuAD evaluation script for XLM and XLNet.\n\n        Example::\n\n            from transformers1 import XLNetConfig, XLNetModel\n\n            # Initializing a XLNet configuration\n            configuration = XLNetConfig()\n\n            # Initializing a model from the configuration\n            model = XLNetModel(configuration)\n\n            # Accessing the model configuration\n            configuration = model.config\n    \"\"\"\n\n    model_type = \"xlnet\"\n\n    def __init__(\n        self,\n        vocab_size=32000,\n        d_model=1024,\n        n_layer=24,\n        n_head=16,\n        d_inner=4096,\n        ff_activation=\"gelu\",\n        untie_r=True,\n        attn_type=\"bi\",\n        initializer_range=0.02,\n        layer_norm_eps=1e-12,\n        dropout=0.1,\n        mem_len=None,\n        reuse_len=None,\n        bi_data=False,\n        clamp_len=-1,\n        same_length=False,\n        summary_type=\"last\",\n        summary_use_proj=True,\n        summary_activation=\"tanh\",\n        summary_last_dropout=0.1,\n        start_n_top=5,\n        end_n_top=5,\n        pad_token_id=5,\n        bos_token_id=1,\n        eos_token_id=2,\n        **kwargs\n    ):\n        \"\"\"Constructs XLNetConfig.\n        \"\"\"\n        super().__init__(pad_token_id=pad_token_id, bos_token_id=bos_token_id, eos_token_id=eos_token_id, **kwargs)\n        self.vocab_size = vocab_size\n        self.d_model = d_model\n        self.n_layer = n_layer\n        self.n_head = n_head\n        assert d_model % n_head == 0\n        self.d_head = d_model // n_head\n        self.ff_activation = ff_activation\n        self.d_inner = d_inner\n        self.untie_r = untie_r\n        self.attn_type = attn_type\n\n        self.initializer_range = initializer_range\n        self.layer_norm_eps = layer_norm_eps\n\n        self.dropout = dropout\n        self.mem_len = mem_len\n        self.reuse_len = reuse_len\n        self.bi_data = bi_data\n        self.clamp_len = clamp_len\n        self.same_length = same_length\n\n        self.summary_type = summary_type\n        self.summary_use_proj = summary_use_proj\n        self.summary_activation = summary_activation\n        self.summary_last_dropout = summary_last_dropout\n        self.start_n_top = start_n_top\n        self.end_n_top = end_n_top\n\n        self.bos_token_id = bos_token_id\n        self.pad_token_id = pad_token_id\n        self.eos_token_id = eos_token_id\n\n    @property\n    def max_position_embeddings(self):\n        return -1\n\n    @property\n    def n_token(self):  # Backward compatibility\n        return self.vocab_size\n\n    @n_token.setter\n    def n_token(self, value):  # Backward compatibility\n        self.vocab_size = value\n\n    @property\n    def hidden_size(self):\n        return self.d_model\n\n    @property\n    def num_attention_heads(self):\n        return self.n_head\n\n    @property\n    def num_hidden_layers(self):\n        return self.n_layer\n"
  },
  {
    "path": "code/nezha-base-count3/pretrain/transformers1/convert_albert_original_tf_checkpoint_to_pytorch.py",
    "content": "# coding=utf-8\n# Copyright 2018 The HuggingFace Inc. team.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n\"\"\"Convert ALBERT checkpoint.\"\"\"\n\n\nimport argparse\nimport logging\n\nimport torch\n\nfrom transformers import AlbertConfig, AlbertForPreTraining, load_tf_weights_in_albert\n\n\nlogging.basicConfig(level=logging.INFO)\n\n\ndef convert_tf_checkpoint_to_pytorch(tf_checkpoint_path, albert_config_file, pytorch_dump_path):\n    # Initialise PyTorch model\n    config = AlbertConfig.from_json_file(albert_config_file)\n    print(\"Building PyTorch model from configuration: {}\".format(str(config)))\n    model = AlbertForPreTraining(config)\n\n    # Load weights from tf checkpoint\n    load_tf_weights_in_albert(model, config, tf_checkpoint_path)\n\n    # Save pytorch-model\n    print(\"Save PyTorch model to {}\".format(pytorch_dump_path))\n    torch.save(model.state_dict(), pytorch_dump_path)\n\n\nif __name__ == \"__main__\":\n    parser = argparse.ArgumentParser()\n    # Required parameters\n    parser.add_argument(\n        \"--tf_checkpoint_path\", default=None, type=str, required=True, help=\"Path to the TensorFlow checkpoint path.\"\n    )\n    parser.add_argument(\n        \"--albert_config_file\",\n        default=None,\n        type=str,\n        required=True,\n        help=\"The config json file corresponding to the pre-trained ALBERT model. \\n\"\n        \"This specifies the model architecture.\",\n    )\n    parser.add_argument(\n        \"--pytorch_dump_path\", default=None, type=str, required=True, help=\"Path to the output PyTorch model.\"\n    )\n    args = parser.parse_args()\n    convert_tf_checkpoint_to_pytorch(args.tf_checkpoint_path, args.albert_config_file, args.pytorch_dump_path)\n"
  },
  {
    "path": "code/nezha-base-count3/pretrain/transformers1/convert_bart_original_pytorch_checkpoint_to_pytorch.py",
    "content": "# coding=utf-8\n# Copyright 2020 The HuggingFace Inc. team.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n\"\"\"Convert BART checkpoint.\"\"\"\n\n\nimport argparse\nimport logging\nimport os\nfrom pathlib import Path\n\nimport fairseq\nimport torch\nfrom packaging import version\n\nfrom transformers import (\n    BartConfig,\n    BartForConditionalGeneration,\n    BartForSequenceClassification,\n    BartModel,\n    BartTokenizer,\n)\nfrom transformers.modeling_bart import _make_linear_from_emb\n\n\nFAIRSEQ_MODELS = [\"bart.large\", \"bart.large.mnli\", \"bart.large.cnn\", \"bart_xsum/model.pt\"]\nextra_arch = {\"bart.large\": BartModel, \"bart.large.mnli\": BartForSequenceClassification}\nif version.parse(fairseq.__version__) < version.parse(\"0.9.0\"):\n    raise Exception(\"requires fairseq >= 0.9.0\")\n\n\nlogging.basicConfig(level=logging.INFO)\nlogger = logging.getLogger(__name__)\n\nSAMPLE_TEXT = \" Hello world! cécé herlolip\"\n\nmnli_rename_keys = [\n    (\"model.classification_heads.mnli.dense.weight\", \"classification_head.dense.weight\"),\n    (\"model.classification_heads.mnli.dense.bias\", \"classification_head.dense.bias\"),\n    (\"model.classification_heads.mnli.out_proj.weight\", \"classification_head.out_proj.weight\"),\n    (\"model.classification_heads.mnli.out_proj.bias\", \"classification_head.out_proj.bias\"),\n]\n\n\ndef remove_ignore_keys_(state_dict):\n    ignore_keys = [\n        \"encoder.version\",\n        \"decoder.version\",\n        \"model.encoder.version\",\n        \"model.decoder.version\",\n        \"_float_tensor\",\n    ]\n    for k in ignore_keys:\n        state_dict.pop(k, None)\n\n\ndef rename_key(dct, old, new):\n    val = dct.pop(old)\n    dct[new] = val\n\n\ndef load_xsum_checkpoint(checkpoint_path):\n    \"\"\"Checkpoint path should end in model.pt\"\"\"\n    sd = torch.load(checkpoint_path, map_location=\"cpu\")\n    hub_interface = torch.hub.load(\"pytorch/fairseq\", \"bart.large.cnn\").eval()\n    hub_interface.model.load_state_dict(sd[\"model\"])\n    return hub_interface\n\n\ndef convert_checkpoint_from_disk(checkpoint_path, **config_kwargs):\n    state_dict = torch.load(checkpoint_path, map_location=\"cpu\")[\"model\"]\n    remove_ignore_keys_(state_dict)\n    vocab_size = state_dict[\"encoder.embed_tokens.weight\"].shape[0]\n    state_dict[\"shared.weight\"] = state_dict[\"decoder.embed_tokens.weight\"]\n    mbart_config = BartConfig(vocab_size=vocab_size, **config_kwargs)\n    model = BartForConditionalGeneration(mbart_config)\n    model.model.load_state_dict(state_dict)\n    if hasattr(model, \"lm_head\"):\n        model.lm_head = _make_linear_from_emb(model.model.shared)\n    return model\n\n\n@torch.no_grad()\ndef convert_bart_checkpoint(checkpoint_path, pytorch_dump_folder_path, hf_checkpoint_name=None):\n    \"\"\"\n    Copy/paste/tweak model's weights to our BERT structure.\n    \"\"\"\n    if not os.path.exists(checkpoint_path):\n        bart = torch.hub.load(\"pytorch/fairseq\", checkpoint_path).eval()\n    else:\n        bart = load_xsum_checkpoint(checkpoint_path)\n\n    bart.model.upgrade_state_dict(bart.model.state_dict())\n    if hf_checkpoint_name is None:\n        hf_checkpoint_name = checkpoint_path.replace(\".\", \"-\")\n    config = BartConfig.from_pretrained(hf_checkpoint_name)\n    tokens = bart.encode(SAMPLE_TEXT).unsqueeze(0)\n    tokens2 = BartTokenizer.from_pretrained(hf_checkpoint_name).encode(SAMPLE_TEXT, return_tensors=\"pt\").unsqueeze(0)\n    assert torch.eq(tokens, tokens2).all()\n\n    if checkpoint_path == \"bart.large.mnli\":\n        state_dict = bart.state_dict()\n        remove_ignore_keys_(state_dict)\n        state_dict[\"model.shared.weight\"] = state_dict[\"model.decoder.embed_tokens.weight\"]\n        for src, dest in mnli_rename_keys:\n            rename_key(state_dict, src, dest)\n        model = BartForSequenceClassification(config).eval()\n        model.load_state_dict(state_dict)\n        fairseq_output = bart.predict(\"mnli\", tokens, return_logits=True)\n        new_model_outputs = model(tokens)[0]  # logits\n    else:  # no classification heads to worry about\n        state_dict = bart.model.state_dict()\n        remove_ignore_keys_(state_dict)\n        state_dict[\"shared.weight\"] = state_dict[\"decoder.embed_tokens.weight\"]\n        fairseq_output = bart.extract_features(tokens)\n        if hf_checkpoint_name == \"facebook/bart-large\":\n            model = BartModel(config).eval()\n            model.load_state_dict(state_dict)\n            new_model_outputs = model(tokens).model[0]\n        else:\n            model = BartForConditionalGeneration(config).eval()  # an existing summarization ckpt\n            model.model.load_state_dict(state_dict)\n            if hasattr(model, \"lm_head\"):\n                model.lm_head = _make_linear_from_emb(model.model.shared)\n            new_model_outputs = model.model(tokens)[0]\n\n    # Check results\n    assert fairseq_output.shape == new_model_outputs.shape\n    assert (fairseq_output == new_model_outputs).all().item()\n    Path(pytorch_dump_folder_path).mkdir(exist_ok=True)\n    model.save_pretrained(pytorch_dump_folder_path)\n\n\nif __name__ == \"__main__\":\n    parser = argparse.ArgumentParser()\n    # Required parameters\n    parser.add_argument(\n        \"fairseq_path\", type=str, help=\"bart.large, bart.large.cnn or a path to a model.pt on local filesystem.\"\n    )\n    parser.add_argument(\"pytorch_dump_folder_path\", default=None, type=str, help=\"Path to the output PyTorch model.\")\n    parser.add_argument(\n        \"--hf_config\", default=None, type=str, help=\"Which huggingface architecture to use: bart-large-xsum\"\n    )\n    args = parser.parse_args()\n    convert_bart_checkpoint(args.fairseq_path, args.pytorch_dump_folder_path, hf_checkpoint_name=args.hf_config)\n"
  },
  {
    "path": "code/nezha-base-count3/pretrain/transformers1/convert_bert_original_tf_checkpoint_to_pytorch.py",
    "content": "# coding=utf-8\n# Copyright 2018 The HuggingFace Inc. team.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n\"\"\"Convert BERT checkpoint.\"\"\"\n\n\nimport argparse\nimport logging\n\nimport torch\n\nfrom transformers import BertConfig, BertForPreTraining, load_tf_weights_in_bert\n\n\nlogging.basicConfig(level=logging.INFO)\n\n\ndef convert_tf_checkpoint_to_pytorch(tf_checkpoint_path, bert_config_file, pytorch_dump_path):\n    # Initialise PyTorch model\n    config = BertConfig.from_json_file(bert_config_file)\n    print(\"Building PyTorch model from configuration: {}\".format(str(config)))\n    model = BertForPreTraining(config)\n\n    # Load weights from tf checkpoint\n    load_tf_weights_in_bert(model, config, tf_checkpoint_path)\n\n    # Save pytorch-model\n    print(\"Save PyTorch model to {}\".format(pytorch_dump_path))\n    torch.save(model.state_dict(), pytorch_dump_path)\n\n\nif __name__ == \"__main__\":\n    parser = argparse.ArgumentParser()\n    # Required parameters\n    parser.add_argument(\n        \"--tf_checkpoint_path\", default=None, type=str, required=True, help=\"Path to the TensorFlow checkpoint path.\"\n    )\n    parser.add_argument(\n        \"--bert_config_file\",\n        default=None,\n        type=str,\n        required=True,\n        help=\"The config json file corresponding to the pre-trained BERT model. \\n\"\n        \"This specifies the model architecture.\",\n    )\n    parser.add_argument(\n        \"--pytorch_dump_path\", default=None, type=str, required=True, help=\"Path to the output PyTorch model.\"\n    )\n    args = parser.parse_args()\n    convert_tf_checkpoint_to_pytorch(args.tf_checkpoint_path, args.bert_config_file, args.pytorch_dump_path)\n"
  },
  {
    "path": "code/nezha-base-count3/pretrain/transformers1/convert_bert_pytorch_checkpoint_to_original_tf.py",
    "content": "# coding=utf-8\n# Copyright 2018 The HuggingFace Inc. team.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n\n\"\"\"Convert Huggingface Pytorch checkpoint to Tensorflow checkpoint.\"\"\"\n\nimport argparse\nimport os\n\nimport numpy as np\nimport tensorflow as tf\nimport torch\n\nfrom transformers import BertModel\n\n\ndef convert_pytorch_checkpoint_to_tf(model: BertModel, ckpt_dir: str, model_name: str):\n\n    \"\"\"\n    :param model:BertModel Pytorch model instance to be converted\n    :param ckpt_dir: Tensorflow model directory\n    :param model_name: model name\n    :return:\n\n    Currently supported HF models:\n        Y BertModel\n        N BertForMaskedLM\n        N BertForPreTraining\n        N BertForMultipleChoice\n        N BertForNextSentencePrediction\n        N BertForSequenceClassification\n        N BertForQuestionAnswering\n    \"\"\"\n\n    tensors_to_transpose = (\"dense.weight\", \"attention.self.query\", \"attention.self.key\", \"attention.self.value\")\n\n    var_map = (\n        (\"layer.\", \"layer_\"),\n        (\"word_embeddings.weight\", \"word_embeddings\"),\n        (\"position_embeddings.weight\", \"position_embeddings\"),\n        (\"token_type_embeddings.weight\", \"token_type_embeddings\"),\n        (\".\", \"/\"),\n        (\"LayerNorm/weight\", \"LayerNorm/gamma\"),\n        (\"LayerNorm/bias\", \"LayerNorm/beta\"),\n        (\"weight\", \"kernel\"),\n    )\n\n    if not os.path.isdir(ckpt_dir):\n        os.makedirs(ckpt_dir)\n\n    state_dict = model.state_dict()\n\n    def to_tf_var_name(name: str):\n        for patt, repl in iter(var_map):\n            name = name.replace(patt, repl)\n        return \"bert/{}\".format(name)\n\n    def create_tf_var(tensor: np.ndarray, name: str, session: tf.Session):\n        tf_dtype = tf.dtypes.as_dtype(tensor.dtype)\n        tf_var = tf.get_variable(dtype=tf_dtype, shape=tensor.shape, name=name, initializer=tf.zeros_initializer())\n        session.run(tf.variables_initializer([tf_var]))\n        session.run(tf_var)\n        return tf_var\n\n    tf.reset_default_graph()\n    with tf.Session() as session:\n        for var_name in state_dict:\n            tf_name = to_tf_var_name(var_name)\n            torch_tensor = state_dict[var_name].numpy()\n            if any([x in var_name for x in tensors_to_transpose]):\n                torch_tensor = torch_tensor.T\n            tf_var = create_tf_var(tensor=torch_tensor, name=tf_name, session=session)\n            tf.keras.backend.set_value(tf_var, torch_tensor)\n            tf_weight = session.run(tf_var)\n            print(\"Successfully created {}: {}\".format(tf_name, np.allclose(tf_weight, torch_tensor)))\n\n        saver = tf.train.Saver(tf.trainable_variables())\n        saver.save(session, os.path.join(ckpt_dir, model_name.replace(\"-\", \"_\") + \".ckpt\"))\n\n\ndef main(raw_args=None):\n    parser = argparse.ArgumentParser()\n    parser.add_argument(\"--model_name\", type=str, required=True, help=\"model name e.g. bert-base-uncased\")\n    parser.add_argument(\n        \"--cache_dir\", type=str, default=None, required=False, help=\"Directory containing pytorch model\"\n    )\n    parser.add_argument(\"--pytorch_model_path\", type=str, required=True, help=\"/path/to/<pytorch-model-name>.bin\")\n    parser.add_argument(\"--tf_cache_dir\", type=str, required=True, help=\"Directory in which to save tensorflow model\")\n    args = parser.parse_args(raw_args)\n\n    model = BertModel.from_pretrained(\n        pretrained_model_name_or_path=args.model_name,\n        state_dict=torch.load(args.pytorch_model_path),\n        cache_dir=args.cache_dir,\n    )\n\n    convert_pytorch_checkpoint_to_tf(model=model, ckpt_dir=args.tf_cache_dir, model_name=args.model_name)\n\n\nif __name__ == \"__main__\":\n    main()\n"
  },
  {
    "path": "code/nezha-base-count3/pretrain/transformers1/convert_dialogpt_original_pytorch_checkpoint_to_pytorch.py",
    "content": "import argparse\nimport os\n\nimport torch\n\nfrom transformers.file_utils import WEIGHTS_NAME\n\n\nDIALOGPT_MODELS = [\"small\", \"medium\", \"large\"]\n\nOLD_KEY = \"lm_head.decoder.weight\"\nNEW_KEY = \"lm_head.weight\"\n\n\ndef convert_dialogpt_checkpoint(checkpoint_path: str, pytorch_dump_folder_path: str):\n    d = torch.load(checkpoint_path)\n    d[NEW_KEY] = d.pop(OLD_KEY)\n    os.makedirs(pytorch_dump_folder_path, exist_ok=True)\n    torch.save(d, os.path.join(pytorch_dump_folder_path, WEIGHTS_NAME))\n\n\nif __name__ == \"__main__\":\n    parser = argparse.ArgumentParser()\n    parser.add_argument(\"--dialogpt_path\", default=\".\", type=str)\n    args = parser.parse_args()\n    for MODEL in DIALOGPT_MODELS:\n        checkpoint_path = os.path.join(args.dialogpt_path, f\"{MODEL}_ft.pkl\")\n        pytorch_dump_folder_path = f\"./DialoGPT-{MODEL}\"\n        convert_dialogpt_checkpoint(\n            checkpoint_path, pytorch_dump_folder_path,\n        )\n"
  },
  {
    "path": "code/nezha-base-count3/pretrain/transformers1/convert_electra_original_tf_checkpoint_to_pytorch.py",
    "content": "# coding=utf-8\n# Copyright 2018 The HuggingFace Inc. team.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n\"\"\"Convert ELECTRA checkpoint.\"\"\"\n\n\nimport argparse\nimport logging\n\nimport torch\n\nfrom transformers import ElectraConfig, ElectraForMaskedLM, ElectraForPreTraining, load_tf_weights_in_electra\n\n\nlogging.basicConfig(level=logging.INFO)\n\n\ndef convert_tf_checkpoint_to_pytorch(tf_checkpoint_path, config_file, pytorch_dump_path, discriminator_or_generator):\n    # Initialise PyTorch model\n    config = ElectraConfig.from_json_file(config_file)\n    print(\"Building PyTorch model from configuration: {}\".format(str(config)))\n\n    if discriminator_or_generator == \"discriminator\":\n        model = ElectraForPreTraining(config)\n    elif discriminator_or_generator == \"generator\":\n        model = ElectraForMaskedLM(config)\n    else:\n        raise ValueError(\"The discriminator_or_generator argument should be either 'discriminator' or 'generator'\")\n\n    # Load weights from tf checkpoint\n    load_tf_weights_in_electra(\n        model, config, tf_checkpoint_path, discriminator_or_generator=discriminator_or_generator\n    )\n\n    # Save pytorch-model\n    print(\"Save PyTorch model to {}\".format(pytorch_dump_path))\n    torch.save(model.state_dict(), pytorch_dump_path)\n\n\nif __name__ == \"__main__\":\n    parser = argparse.ArgumentParser()\n    # Required parameters\n    parser.add_argument(\n        \"--tf_checkpoint_path\", default=None, type=str, required=True, help=\"Path to the TensorFlow checkpoint path.\"\n    )\n    parser.add_argument(\n        \"--config_file\",\n        default=None,\n        type=str,\n        required=True,\n        help=\"The config json file corresponding to the pre-trained model. \\n\"\n        \"This specifies the model architecture.\",\n    )\n    parser.add_argument(\n        \"--pytorch_dump_path\", default=None, type=str, required=True, help=\"Path to the output PyTorch model.\"\n    )\n    parser.add_argument(\n        \"--discriminator_or_generator\",\n        default=None,\n        type=str,\n        required=True,\n        help=\"Whether to export the generator or the discriminator. Should be a string, either 'discriminator' or \"\n        \"'generator'.\",\n    )\n    args = parser.parse_args()\n    convert_tf_checkpoint_to_pytorch(\n        args.tf_checkpoint_path, args.config_file, args.pytorch_dump_path, args.discriminator_or_generator\n    )\n"
  },
  {
    "path": "code/nezha-base-count3/pretrain/transformers1/convert_gpt2_original_tf_checkpoint_to_pytorch.py",
    "content": "# coding=utf-8\n# Copyright 2018 The HuggingFace Inc. team.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n\"\"\"Convert OpenAI GPT checkpoint.\"\"\"\n\n\nimport argparse\nimport logging\n\nimport torch\n\nfrom transformers import CONFIG_NAME, WEIGHTS_NAME, GPT2Config, GPT2Model, load_tf_weights_in_gpt2\n\n\nlogging.basicConfig(level=logging.INFO)\n\n\ndef convert_gpt2_checkpoint_to_pytorch(gpt2_checkpoint_path, gpt2_config_file, pytorch_dump_folder_path):\n    # Construct model\n    if gpt2_config_file == \"\":\n        config = GPT2Config()\n    else:\n        config = GPT2Config.from_json_file(gpt2_config_file)\n    model = GPT2Model(config)\n\n    # Load weights from numpy\n    load_tf_weights_in_gpt2(model, config, gpt2_checkpoint_path)\n\n    # Save pytorch-model\n    pytorch_weights_dump_path = pytorch_dump_folder_path + \"/\" + WEIGHTS_NAME\n    pytorch_config_dump_path = pytorch_dump_folder_path + \"/\" + CONFIG_NAME\n    print(\"Save PyTorch model to {}\".format(pytorch_weights_dump_path))\n    torch.save(model.state_dict(), pytorch_weights_dump_path)\n    print(\"Save configuration file to {}\".format(pytorch_config_dump_path))\n    with open(pytorch_config_dump_path, \"w\", encoding=\"utf-8\") as f:\n        f.write(config.to_json_string())\n\n\nif __name__ == \"__main__\":\n    parser = argparse.ArgumentParser()\n    # Required parameters\n    parser.add_argument(\n        \"--gpt2_checkpoint_path\", default=None, type=str, required=True, help=\"Path to the TensorFlow checkpoint path.\"\n    )\n    parser.add_argument(\n        \"--pytorch_dump_folder_path\", default=None, type=str, required=True, help=\"Path to the output PyTorch model.\"\n    )\n    parser.add_argument(\n        \"--gpt2_config_file\",\n        default=\"\",\n        type=str,\n        help=\"An optional config json file corresponding to the pre-trained OpenAI model. \\n\"\n        \"This specifies the model architecture.\",\n    )\n    args = parser.parse_args()\n    convert_gpt2_checkpoint_to_pytorch(args.gpt2_checkpoint_path, args.gpt2_config_file, args.pytorch_dump_folder_path)\n"
  },
  {
    "path": "code/nezha-base-count3/pretrain/transformers1/convert_graph_to_onnx.py",
    "content": "from argparse import ArgumentParser\nfrom os import listdir, makedirs\nfrom os.path import abspath, dirname, exists\nfrom typing import Dict, List, Optional, Tuple\n\nfrom transformers import is_tf_available, is_torch_available\nfrom transformers.pipelines import Pipeline, pipeline\nfrom transformers.tokenization_utils import BatchEncoding\n\n\nclass OnnxConverterArgumentParser(ArgumentParser):\n    \"\"\"\n    Wraps all the script arguments supported to export transformers1 models to ONNX IR\n    \"\"\"\n\n    def __init__(self):\n        super(OnnxConverterArgumentParser, self).__init__(\"ONNX Converter\")\n\n        self.add_argument(\"--model\", type=str, required=True, help=\"Model's id or path (ex: bert-base-cased)\")\n        self.add_argument(\"--tokenizer\", type=str, help=\"Tokenizer's id or path (ex: bert-base-cased)\")\n        self.add_argument(\"--framework\", type=str, choices=[\"pt\", \"tf\"], help=\"Framework for loading the model\")\n        self.add_argument(\"--opset\", type=int, default=11, help=\"ONNX opset to use\")\n        self.add_argument(\"--check-loading\", action=\"store_true\", help=\"Check ONNX is able to load the model\")\n        self.add_argument(\"--use-external-format\", action=\"store_true\", help=\"Allow exporting model >= than 2Gb\")\n        self.add_argument(\"output\")\n\n\ndef ensure_valid_input(model, tokens, input_names):\n    \"\"\"\n    Ensure input are presented in the correct order, without any None\n    Args:\n        model: The model used to forward the input data\n        tokens: BatchEncoding holding the input data\n        input_names: The name of the inputs\n\n    Returns: Tuple\n\n    \"\"\"\n    model_args_name = model.forward.__code__.co_varnames\n\n    ordered_input_names = []\n    model_args = []\n    for arg_name in model_args_name[1:]:  # start at index 1 to skip \"self\" argument\n        if arg_name in input_names:\n            ordered_input_names.append(arg_name)\n            model_args.append(tokens[arg_name])\n        else:\n            break\n\n    return ordered_input_names, tuple(model_args)\n\n\ndef infer_shapes(nlp: Pipeline, framework: str) -> Tuple[List[str], List[str], Dict, BatchEncoding]:\n    def build_shape_dict(tensor, is_input: bool, seq_len: int):\n        if isinstance(tensor, (tuple, list)):\n            return [build_shape_dict(t, is_input, seq_len) for t in tensor]\n\n        else:\n            # Let's assume batch is the first axis with only 1 element (~~ might not be always true ...)\n            axes = {[axis for axis, numel in enumerate(tensor.shape) if numel == 1][0]: \"batch\"}\n            if is_input:\n                if len(tensor.shape) == 2:\n                    axes[1] = \"sequence\"\n                else:\n                    raise ValueError(\"Unable to infer tensor axes ({})\".format(len(tensor.shape)))\n            else:\n                seq_axes = [dim for dim, shape in enumerate(tensor.shape) if shape == seq_len]\n                axes.update({dim: \"sequence\" for dim in seq_axes})\n\n        return axes\n\n    tokens = nlp.tokenizer.encode_plus(\"This is a sample output\", return_tensors=framework)\n    seq_len = tokens.input_ids.shape[-1]\n    outputs = nlp.model(**tokens) if framework == \"pt\" else nlp.model(tokens)\n\n    if not isinstance(outputs, (list, tuple)):\n        outputs = (outputs,)\n\n    # Generate input names & axes\n    input_vars = list(tokens.keys())\n    input_dynamic_axes = {k: build_shape_dict(v, True, seq_len) for k, v in tokens.items()}\n\n    # flatten potentially grouped outputs (past for gpt2, attentions)\n    outputs_flat = []\n    for output in outputs:\n        if isinstance(output, (tuple, list)):\n            outputs_flat.extend(output)\n        else:\n            outputs_flat.append(output)\n\n    # Generate output names & axes\n    output_names = [\"output_{}\".format(i) for i in range(len(outputs_flat))]\n    output_dynamic_axes = {k: build_shape_dict(v, False, seq_len) for k, v in zip(output_names, outputs_flat)}\n\n    # Create the aggregated axes representation\n    dynamic_axes = dict(input_dynamic_axes, **output_dynamic_axes)\n    return input_vars, output_names, dynamic_axes, tokens\n\n\ndef load_graph_from_args(framework: str, model: str, tokenizer: Optional[str] = None) -> Pipeline:\n    # If no tokenizer provided\n    if tokenizer is None:\n        tokenizer = model\n\n    print(\"Loading pipeline (model: {}, tokenizer: {})\".format(model, tokenizer))\n\n    # Allocate tokenizer and model\n    return pipeline(\"feature-extraction\", model=model, tokenizer=tokenizer, framework=framework)\n\n\ndef convert_pytorch(nlp: Pipeline, opset: int, output: str, use_external_format: bool):\n    if not is_torch_available():\n        raise Exception(\"Cannot convert because PyTorch is not installed. Please install torch first.\")\n\n    import torch\n    from torch.onnx import export\n\n    print(\"PyTorch: {}\".format(torch.__version__))\n\n    with torch.no_grad():\n        input_names, output_names, dynamic_axes, tokens = infer_shapes(nlp, \"pt\")\n        ordered_input_names, model_args = ensure_valid_input(nlp.model, tokens, input_names)\n\n        export(\n            nlp.model,\n            model_args,\n            f=output,\n            input_names=ordered_input_names,\n            output_names=output_names,\n            dynamic_axes=dynamic_axes,\n            do_constant_folding=True,\n            use_external_data_format=use_external_format,\n            enable_onnx_checker=True,\n            opset_version=opset,\n        )\n\n\ndef convert_tensorflow(nlp: Pipeline, opset: int, output: str):\n    if not is_tf_available():\n        raise Exception(\n            \"Cannot convert {} because TF is not installed. Please install torch first.\".format(args.model)\n        )\n\n    print(\"/!\\\\ Please note TensorFlow doesn't support exporting model > 2Gb /!\\\\\")\n\n    try:\n        import tensorflow as tf\n        from keras2onnx import convert_keras, save_model, __version__ as k2ov\n\n        print(\"TensorFlow: {}, keras2onnx: {}\".format(tf.version.VERSION, k2ov))\n\n        # Build\n        input_names, output_names, dynamic_axes, tokens = infer_shapes(nlp, \"tf\")\n\n        # Forward\n        nlp.model.predict(tokens.data)\n        onnx_model = convert_keras(nlp.model, nlp.model.name, target_opset=opset)\n        save_model(onnx_model, output)\n\n    except ImportError as e:\n        raise Exception(\n            \"Cannot import {} required to convert TF model to ONNX. Please install {} first.\".format(e.name, e.name)\n        )\n\n\ndef convert(\n    framework: str,\n    model: str,\n    output: str,\n    opset: int,\n    tokenizer: Optional[str] = None,\n    use_external_format: bool = False,\n):\n    print(\"ONNX opset version set to: {}\".format(opset))\n\n    # Load the pipeline\n    nlp = load_graph_from_args(framework, model, tokenizer)\n\n    parent = dirname(output)\n    if not exists(parent):\n        print(\"Creating folder {}\".format(parent))\n        makedirs(parent)\n    elif len(listdir(parent)) > 0:\n        raise Exception(\"Folder {} is not empty, aborting conversion\".format(parent))\n\n    # Export the graph\n    if framework == \"pt\":\n        convert_pytorch(nlp, opset, output, use_external_format)\n    else:\n        convert_tensorflow(nlp, opset, output)\n\n\ndef verify(path: str):\n    from onnxruntime import InferenceSession, SessionOptions\n    from onnxruntime.capi.onnxruntime_pybind11_state import RuntimeException\n\n    print(\"Checking ONNX model loading from: {}\".format(path))\n    try:\n        onnx_options = SessionOptions()\n        _ = InferenceSession(path, onnx_options, providers=[\"CPUExecutionProvider\"])\n        print(\"Model correctly loaded\")\n    except RuntimeException as re:\n        print(\"Error while loading the model: {}\".format(re))\n\n\nif __name__ == \"__main__\":\n    parser = OnnxConverterArgumentParser()\n    args = parser.parse_args()\n\n    # Make sure output is absolute path\n    args.output = abspath(args.output)\n\n    try:\n        # Convert\n        convert(args.framework, args.model, args.output, args.opset, args.tokenizer, args.use_external_format)\n\n        # And verify\n        if args.check_loading:\n            verify(args.output)\n    except Exception as e:\n        print(\"Error while converting the model: {}\".format(e))\n        exit(1)\n"
  },
  {
    "path": "code/nezha-base-count3/pretrain/transformers1/convert_longformer_original_pytorch_lightning_to_pytorch.py",
    "content": "# coding=utf-8\n# Copyright 2018 The HuggingFace Inc. team.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n\"\"\"Convert RoBERTa checkpoint.\"\"\"\n\n\nimport argparse\n\nimport pytorch_lightning as pl\nimport torch\n\nfrom transformers.modeling_longformer import LongformerForQuestionAnswering, LongformerModel\n\n\nclass LightningModel(pl.LightningModule):\n    def __init__(self, model):\n        super().__init__()\n        self.model = model\n        self.num_labels = 2\n        self.qa_outputs = torch.nn.Linear(self.model.config.hidden_size, self.num_labels)\n\n    # implement only because lighning requires to do so\n    def forward(self):\n        pass\n\n\ndef convert_longformer_qa_checkpoint_to_pytorch(\n    longformer_model: str, longformer_question_answering_ckpt_path: str, pytorch_dump_folder_path: str\n):\n\n    # load longformer model from model identifier\n    longformer = LongformerModel.from_pretrained(longformer_model)\n    lightning_model = LightningModel(longformer)\n\n    ckpt = torch.load(longformer_question_answering_ckpt_path, map_location=torch.device(\"cpu\"))\n    lightning_model.load_state_dict(ckpt[\"state_dict\"])\n\n    # init longformer question answering model\n    longformer_for_qa = LongformerForQuestionAnswering.from_pretrained(longformer_model)\n\n    # transfer weights\n    longformer_for_qa.longformer.load_state_dict(lightning_model.model.state_dict())\n    longformer_for_qa.qa_outputs.load_state_dict(lightning_model.qa_outputs.state_dict())\n    longformer_for_qa.eval()\n\n    # save model\n    longformer_for_qa.save_pretrained(pytorch_dump_folder_path)\n\n    print(\"Conversion succesful. Model saved under {}\".format(pytorch_dump_folder_path))\n\n\nif __name__ == \"__main__\":\n    parser = argparse.ArgumentParser()\n    # Required parameters\n    parser.add_argument(\n        \"--longformer_model\",\n        default=None,\n        type=str,\n        required=True,\n        help=\"model identifier of longformer. Should be either `longformer-base-4096` or `longformer-large-4096`.\",\n    )\n    parser.add_argument(\n        \"--longformer_question_answering_ckpt_path\",\n        default=None,\n        type=str,\n        required=True,\n        help=\"Path the official PyTorch Lighning Checkpoint.\",\n    )\n    parser.add_argument(\n        \"--pytorch_dump_folder_path\", default=None, type=str, required=True, help=\"Path to the output PyTorch model.\"\n    )\n    args = parser.parse_args()\n    convert_longformer_qa_checkpoint_to_pytorch(\n        args.longformer_model, args.longformer_question_answering_ckpt_path, args.pytorch_dump_folder_path\n    )\n"
  },
  {
    "path": "code/nezha-base-count3/pretrain/transformers1/convert_marian_to_pytorch.py",
    "content": "import argparse\nimport json\nimport os\nimport shutil\nimport warnings\nfrom pathlib import Path\nfrom typing import Dict, List, Union\nfrom zipfile import ZipFile\n\nimport numpy as np\nimport torch\nfrom tqdm import tqdm\n\nfrom transformers import MarianConfig, MarianMTModel, MarianTokenizer\nfrom transformers.hf_api import HfApi\n\n\ndef remove_prefix(text: str, prefix: str):\n    if text.startswith(prefix):\n        return text[len(prefix) :]\n    return text  # or whatever\n\n\ndef convert_encoder_layer(opus_dict, layer_prefix: str, converter: dict):\n    sd = {}\n    for k in opus_dict:\n        if not k.startswith(layer_prefix):\n            continue\n        stripped = remove_prefix(k, layer_prefix)\n        v = opus_dict[k].T  # besides embeddings, everything must be transposed.\n        sd[converter[stripped]] = torch.tensor(v).squeeze()\n    return sd\n\n\ndef load_layers_(layer_lst: torch.nn.ModuleList, opus_state: dict, converter, is_decoder=False):\n    for i, layer in enumerate(layer_lst):\n        layer_tag = f\"decoder_l{i + 1}_\" if is_decoder else f\"encoder_l{i + 1}_\"\n        sd = convert_encoder_layer(opus_state, layer_tag, converter)\n        layer.load_state_dict(sd, strict=True)\n\n\ndef find_pretrained_model(src_lang: str, tgt_lang: str) -> List[str]:\n    \"\"\"Find models that can accept src_lang as input and return tgt_lang as output.\"\"\"\n    prefix = \"Helsinki-NLP/opus-mt-\"\n    api = HfApi()\n    model_list = api.model_list()\n    model_ids = [x.modelId for x in model_list if x.modelId.startswith(\"Helsinki-NLP\")]\n    src_and_targ = [\n        remove_prefix(m, prefix).lower().split(\"-\") for m in model_ids if \"+\" not in m\n    ]  # + cant be loaded.\n    matching = [f\"{prefix}{a}-{b}\" for (a, b) in src_and_targ if src_lang in a and tgt_lang in b]\n    return matching\n\n\ndef add_emb_entries(wemb, final_bias, n_special_tokens=1):\n    vsize, d_model = wemb.shape\n    embs_to_add = np.zeros((n_special_tokens, d_model))\n    new_embs = np.concatenate([wemb, embs_to_add])\n    bias_to_add = np.zeros((n_special_tokens, 1))\n    new_bias = np.concatenate((final_bias, bias_to_add), axis=1)\n    return new_embs, new_bias\n\n\ndef _cast_yaml_str(v):\n    bool_dct = {\"true\": True, \"false\": False}\n    if not isinstance(v, str):\n        return v\n    elif v in bool_dct:\n        return bool_dct[v]\n    try:\n        return int(v)\n    except (TypeError, ValueError):\n        return v\n\n\ndef cast_marian_config(raw_cfg: Dict[str, str]) -> Dict:\n    return {k: _cast_yaml_str(v) for k, v in raw_cfg.items()}\n\n\nCONFIG_KEY = \"special:model.yml\"\n\n\ndef load_config_from_state_dict(opus_dict):\n    import yaml\n\n    cfg_str = \"\".join([chr(x) for x in opus_dict[CONFIG_KEY]])\n    yaml_cfg = yaml.load(cfg_str[:-1], Loader=yaml.BaseLoader)\n    return cast_marian_config(yaml_cfg)\n\n\ndef find_model_file(dest_dir):  # this one better\n    model_files = list(Path(dest_dir).glob(\"*.npz\"))\n    assert len(model_files) == 1, model_files\n    model_file = model_files[0]\n    return model_file\n\n\n# Group Names Logic: change long opus model names to something shorter, like opus-mt-en-ROMANCE\nROM_GROUP = \"fr+fr_BE+fr_CA+fr_FR+wa+frp+oc+ca+rm+lld+fur+lij+lmo+es+es_AR+es_CL+es_CO+es_CR+es_DO+es_EC+es_ES+es_GT+es_HN+es_MX+es_NI+es_PA+es_PE+es_PR+es_SV+es_UY+es_VE+pt+pt_br+pt_BR+pt_PT+gl+lad+an+mwl+it+it_IT+co+nap+scn+vec+sc+ro+la\"\nGROUPS = [\n    (\"cmn+cn+yue+ze_zh+zh_cn+zh_CN+zh_HK+zh_tw+zh_TW+zh_yue+zhs+zht+zh\", \"ZH\"),\n    (ROM_GROUP, \"ROMANCE\"),\n    (\"de+nl+fy+af+da+fo+is+no+nb+nn+sv\", \"NORTH_EU\"),\n    (\"da+fo+is+no+nb+nn+sv\", \"SCANDINAVIA\"),\n    (\"se+sma+smj+smn+sms\", \"SAMI\"),\n    (\"nb_NO+nb+nn_NO+nn+nog+no_nb+no\", \"NORWAY\"),\n    (\"ga+cy+br+gd+kw+gv\", \"CELTIC\"),  # https://en.wikipedia.org/wiki/Insular_Celtic_languages\n]\nGROUP_TO_OPUS_NAME = {\n    \"opus-mt-ZH-de\": \"cmn+cn+yue+ze_zh+zh_cn+zh_CN+zh_HK+zh_tw+zh_TW+zh_yue+zhs+zht+zh-de\",\n    \"opus-mt-ZH-fi\": \"cmn+cn+yue+ze_zh+zh_cn+zh_CN+zh_HK+zh_tw+zh_TW+zh_yue+zhs+zht+zh-fi\",\n    \"opus-mt-ZH-sv\": \"cmn+cn+yue+ze_zh+zh_cn+zh_CN+zh_HK+zh_tw+zh_TW+zh_yue+zhs+zht+zh-sv\",\n    \"opus-mt-SCANDINAVIA-SCANDINAVIA\": \"da+fo+is+no+nb+nn+sv-da+fo+is+no+nb+nn+sv\",\n    \"opus-mt-NORTH_EU-NORTH_EU\": \"de+nl+fy+af+da+fo+is+no+nb+nn+sv-de+nl+fy+af+da+fo+is+no+nb+nn+sv\",\n    \"opus-mt-de-ZH\": \"de-cmn+cn+yue+ze_zh+zh_cn+zh_CN+zh_HK+zh_tw+zh_TW+zh_yue+zhs+zht+zh\",\n    \"opus-mt-en_el_es_fi-en_el_es_fi\": \"en+el+es+fi-en+el+es+fi\",\n    \"opus-mt-en-ROMANCE\": \"en-fr+fr_BE+fr_CA+fr_FR+wa+frp+oc+ca+rm+lld+fur+lij+lmo+es+es_AR+es_CL+es_CO+es_CR+es_DO\"\n    \"+es_EC+es_ES+es_GT+es_HN+es_MX+es_NI+es_PA+es_PE+es_PR+es_SV+es_UY+es_VE+pt+pt_br+pt_BR\"\n    \"+pt_PT+gl+lad+an+mwl+it+it_IT+co+nap+scn+vec+sc+ro+la\",\n    \"opus-mt-en-CELTIC\": \"en-ga+cy+br+gd+kw+gv\",\n    \"opus-mt-es-NORWAY\": \"es-nb_NO+nb+nn_NO+nn+nog+no_nb+no\",\n    \"opus-mt-fi_nb_no_nn_ru_sv_en-SAMI\": \"fi+nb+no+nn+ru+sv+en-se+sma+smj+smn+sms\",\n    \"opus-mt-fi-ZH\": \"fi-cmn+cn+yue+ze_zh+zh_cn+zh_CN+zh_HK+zh_tw+zh_TW+zh_yue+zhs+zht+zh\",\n    \"opus-mt-fi-NORWAY\": \"fi-nb_NO+nb+nn_NO+nn+nog+no_nb+no\",\n    \"opus-mt-ROMANCE-en\": \"fr+fr_BE+fr_CA+fr_FR+wa+frp+oc+ca+rm+lld+fur+lij+lmo+es+es_AR+es_CL+es_CO+es_CR+es_DO\"\n    \"+es_EC+es_ES+es_GT+es_HN+es_MX+es_NI+es_PA+es_PE+es_PR+es_SV+es_UY+es_VE+pt+pt_br+pt_BR\"\n    \"+pt_PT+gl+lad+an+mwl+it+it_IT+co+nap+scn+vec+sc+ro+la-en\",\n    \"opus-mt-CELTIC-en\": \"ga+cy+br+gd+kw+gv-en\",\n    \"opus-mt-sv-ZH\": \"sv-cmn+cn+yue+ze_zh+zh_cn+zh_CN+zh_HK+zh_tw+zh_TW+zh_yue+zhs+zht+zh\",\n    \"opus-mt-sv-NORWAY\": \"sv-nb_NO+nb+nn_NO+nn+nog+no_nb+no\",\n}\nOPUS_GITHUB_URL = \"https://github.com/Helsinki-NLP/OPUS-MT-train/blob/master/models/\"\nORG_NAME = \"Helsinki-NLP/\"\n\n\ndef convert_opus_name_to_hf_name(x):\n    for substr, grp_name in GROUPS:\n        x = x.replace(substr, grp_name)\n    return x.replace(\"+\", \"_\")\n\n\ndef convert_hf_name_to_opus_name(hf_model_name):\n    \"\"\"Relies on the assumption that there are no language codes like pt_br in models that are not in GROUP_TO_OPUS_NAME.\"\"\"\n    hf_model_name = remove_prefix(hf_model_name, ORG_NAME)\n    if hf_model_name in GROUP_TO_OPUS_NAME:\n        opus_w_prefix = GROUP_TO_OPUS_NAME[hf_model_name]\n    else:\n        opus_w_prefix = hf_model_name.replace(\"_\", \"+\")\n    return remove_prefix(opus_w_prefix, \"opus-mt-\")\n\n\ndef write_model_card(\n    hf_model_name: str,\n    repo_path=\"OPUS-MT-train/models/\",\n    dry_run=False,\n    model_card_dir=Path(\"marian_converted/model_cards/Helsinki-NLP/\"),\n) -> str:\n    \"\"\"Copy the most recent model's readme section from opus, and add metadata.\n    upload command: s3cmd sync --recursive model_card_dir s3://models.huggingface.co/bert/Helsinki-NLP/\n    \"\"\"\n    hf_model_name = remove_prefix(hf_model_name, ORG_NAME)\n    opus_name: str = convert_hf_name_to_opus_name(hf_model_name)\n    opus_src, opus_tgt = [x.split(\"+\") for x in opus_name.split(\"-\")]\n    readme_url = OPUS_GITHUB_URL + f\"{opus_name}/README.md\"\n    s, t = \",\".join(opus_src), \",\".join(opus_tgt)\n    extra_markdown = f\"### {hf_model_name}\\n\\n* source languages: {s}\\n* target languages: {t}\\n*  OPUS readme: [{opus_name}]({readme_url})\\n\"\n    # combine with opus markdown\n    opus_readme_path = Path(f\"{repo_path}{opus_name}/README.md\")\n    assert opus_readme_path.exists(), opus_readme_path\n    content = opus_readme_path.open().read()\n    content = content.split(\"\\n# \")[-1]  # Get the lowest level 1 header in the README -- the most recent model.\n    content = \"*\".join(content.split(\"*\")[1:])\n    content = extra_markdown + \"\\n* \" + content.replace(\"download\", \"download original weights\")\n    if dry_run:\n        return content\n    # Save string to model_cards/hf_model_name/readme.md\n    model_card_dir.mkdir(exist_ok=True)\n    sub_dir = model_card_dir / hf_model_name\n    sub_dir.mkdir(exist_ok=True)\n    dest = sub_dir / \"README.md\"\n    dest.open(\"w\").write(content)\n    return content\n\n\ndef get_clean_model_id_mapping(multiling_model_ids):\n    return {x: convert_opus_name_to_hf_name(x) for x in multiling_model_ids}\n\n\ndef make_registry(repo_path=\"Opus-MT-train/models\"):\n    if not (Path(repo_path) / \"fr-en\" / \"README.md\").exists():\n        raise ValueError(\n            f\"repo_path:{repo_path} does not exist: \"\n            \"You must run: git clone git@github.com:Helsinki-NLP/Opus-MT-train.git before calling.\"\n        )\n    results = {}\n    for p in Path(repo_path).ls():\n        n_dash = p.name.count(\"-\")\n        if n_dash == 0:\n            continue\n        else:\n            lns = list(open(p / \"README.md\").readlines())\n            results[p.name] = _parse_readme(lns)\n    return [(k, v[\"pre-processing\"], v[\"download\"], v[\"download\"][:-4] + \".test.txt\") for k, v in results.items()]\n\n\ndef convert_all_sentencepiece_models(model_list=None, repo_path=None):\n    \"\"\"Requires 300GB\"\"\"\n    save_dir = Path(\"marian_ckpt\")\n    dest_dir = Path(\"marian_converted\")\n    dest_dir.mkdir(exist_ok=True)\n    if model_list is None:\n        model_list: list = make_registry(repo_path=repo_path)\n    for k, prepro, download, test_set_url in tqdm(model_list):\n        if \"SentencePiece\" not in prepro:  # dont convert BPE models.\n            continue\n        if not os.path.exists(save_dir / k / \"pytorch_model.bin\"):\n            download_and_unzip(download, save_dir / k)\n        pair_name = convert_opus_name_to_hf_name(k)\n        convert(save_dir / k, dest_dir / f\"opus-mt-{pair_name}\")\n\n\ndef lmap(f, x) -> List:\n    return list(map(f, x))\n\n\ndef fetch_test_set(test_set_url):\n    import wget\n\n    fname = wget.download(test_set_url, \"opus_test.txt\")\n    lns = Path(fname).open().readlines()\n    src = lmap(str.strip, lns[::4])\n    gold = lmap(str.strip, lns[1::4])\n    mar_model = lmap(str.strip, lns[2::4])\n    assert len(gold) == len(mar_model) == len(src)\n    os.remove(fname)\n    return src, mar_model, gold\n\n\ndef convert_whole_dir(path=Path(\"marian_ckpt/\")):\n    for subdir in tqdm(list(path.ls())):\n        dest_dir = f\"marian_converted/{subdir.name}\"\n        if (dest_dir / \"pytorch_model.bin\").exists():\n            continue\n        convert(source_dir, dest_dir)\n\n\ndef _parse_readme(lns):\n    \"\"\"Get link and metadata from opus model card equivalent.\"\"\"\n    subres = {}\n    for ln in [x.strip() for x in lns]:\n        if not ln.startswith(\"*\"):\n            continue\n        ln = ln[1:].strip()\n\n        for k in [\"download\", \"dataset\", \"models\", \"model\", \"pre-processing\"]:\n            if ln.startswith(k):\n                break\n        else:\n            continue\n        if k in [\"dataset\", \"model\", \"pre-processing\"]:\n            splat = ln.split(\":\")\n            _, v = splat\n            subres[k] = v\n        elif k == \"download\":\n            v = ln.split(\"(\")[-1][:-1]\n            subres[k] = v\n    return subres\n\n\ndef save_tokenizer_config(dest_dir: Path):\n    dname = dest_dir.name.split(\"-\")\n    dct = dict(target_lang=dname[-1], source_lang=\"-\".join(dname[:-1]))\n    save_json(dct, dest_dir / \"tokenizer_config.json\")\n\n\ndef add_to_vocab_(vocab: Dict[str, int], special_tokens: List[str]):\n    start = max(vocab.values()) + 1\n    added = 0\n    for tok in special_tokens:\n        if tok in vocab:\n            continue\n        vocab[tok] = start + added\n        added += 1\n    return added\n\n\ndef find_vocab_file(model_dir):\n    return list(model_dir.glob(\"*vocab.yml\"))[0]\n\n\ndef add_special_tokens_to_vocab(model_dir: Path) -> None:\n    vocab = load_yaml(find_vocab_file(model_dir))\n    vocab = {k: int(v) for k, v in vocab.items()}\n    num_added = add_to_vocab_(vocab, [\"<pad>\"])\n    print(f\"added {num_added} tokens to vocab\")\n    save_json(vocab, model_dir / \"vocab.json\")\n    save_tokenizer_config(model_dir)\n\n\ndef save_tokenizer(self, save_directory):\n    dest = Path(save_directory)\n    src_path = Path(self.init_kwargs[\"source_spm\"])\n\n    for dest_name in {\"source.spm\", \"target.spm\", \"tokenizer_config.json\"}:\n        shutil.copyfile(src_path.parent / dest_name, dest / dest_name)\n    save_json(self.encoder, dest / \"vocab.json\")\n\n\ndef check_equal(marian_cfg, k1, k2):\n    v1, v2 = marian_cfg[k1], marian_cfg[k2]\n    assert v1 == v2, f\"hparams {k1},{k2} differ: {v1} != {v2}\"\n\n\ndef check_marian_cfg_assumptions(marian_cfg):\n    assumed_settings = {\n        \"tied-embeddings-all\": True,\n        \"layer-normalization\": False,\n        \"right-left\": False,\n        \"transformer-ffn-depth\": 2,\n        \"transformer-aan-depth\": 2,\n        \"transformer-no-projection\": False,\n        \"transformer-postprocess-emb\": \"d\",\n        \"transformer-postprocess\": \"dan\",  # Dropout, add, normalize\n        \"transformer-preprocess\": \"\",\n        \"type\": \"transformer\",\n        \"ulr-dim-emb\": 0,\n        \"dec-cell-base-depth\": 2,\n        \"dec-cell-high-depth\": 1,\n        \"transformer-aan-nogate\": False,\n    }\n    for k, v in assumed_settings.items():\n        actual = marian_cfg[k]\n        assert actual == v, f\"Unexpected config value for {k} expected {v} got {actual}\"\n    check_equal(marian_cfg, \"transformer-ffn-activation\", \"transformer-aan-activation\")\n    check_equal(marian_cfg, \"transformer-ffn-depth\", \"transformer-aan-depth\")\n    check_equal(marian_cfg, \"transformer-dim-ffn\", \"transformer-dim-aan\")\n\n\nBIAS_KEY = \"decoder_ff_logit_out_b\"\nBART_CONVERTER = {  # for each encoder and decoder layer\n    \"self_Wq\": \"self_attn.q_proj.weight\",\n    \"self_Wk\": \"self_attn.k_proj.weight\",\n    \"self_Wv\": \"self_attn.v_proj.weight\",\n    \"self_Wo\": \"self_attn.out_proj.weight\",\n    \"self_bq\": \"self_attn.q_proj.bias\",\n    \"self_bk\": \"self_attn.k_proj.bias\",\n    \"self_bv\": \"self_attn.v_proj.bias\",\n    \"self_bo\": \"self_attn.out_proj.bias\",\n    \"self_Wo_ln_scale\": \"self_attn_layer_norm.weight\",\n    \"self_Wo_ln_bias\": \"self_attn_layer_norm.bias\",\n    \"ffn_W1\": \"fc1.weight\",\n    \"ffn_b1\": \"fc1.bias\",\n    \"ffn_W2\": \"fc2.weight\",\n    \"ffn_b2\": \"fc2.bias\",\n    \"ffn_ffn_ln_scale\": \"final_layer_norm.weight\",\n    \"ffn_ffn_ln_bias\": \"final_layer_norm.bias\",\n    # Decoder Cross Attention\n    \"context_Wk\": \"encoder_attn.k_proj.weight\",\n    \"context_Wo\": \"encoder_attn.out_proj.weight\",\n    \"context_Wq\": \"encoder_attn.q_proj.weight\",\n    \"context_Wv\": \"encoder_attn.v_proj.weight\",\n    \"context_bk\": \"encoder_attn.k_proj.bias\",\n    \"context_bo\": \"encoder_attn.out_proj.bias\",\n    \"context_bq\": \"encoder_attn.q_proj.bias\",\n    \"context_bv\": \"encoder_attn.v_proj.bias\",\n    \"context_Wo_ln_scale\": \"encoder_attn_layer_norm.weight\",\n    \"context_Wo_ln_bias\": \"encoder_attn_layer_norm.bias\",\n}\n\n\nclass OpusState:\n    def __init__(self, source_dir):\n        npz_path = find_model_file(source_dir)\n        self.state_dict = np.load(npz_path)\n        cfg = load_config_from_state_dict(self.state_dict)\n        assert cfg[\"dim-vocabs\"][0] == cfg[\"dim-vocabs\"][1]\n        assert \"Wpos\" not in self.state_dict\n        self.state_dict = dict(self.state_dict)\n        self.wemb, self.final_bias = add_emb_entries(self.state_dict[\"Wemb\"], self.state_dict[BIAS_KEY], 1)\n        self.pad_token_id = self.wemb.shape[0] - 1\n        cfg[\"vocab_size\"] = self.pad_token_id + 1\n        # self.state_dict['Wemb'].sha\n        self.state_keys = list(self.state_dict.keys())\n        if \"Wtype\" in self.state_dict:\n            raise ValueError(\"found Wtype key\")\n        self._check_layer_entries()\n        self.source_dir = source_dir\n        self.cfg = cfg\n        hidden_size, intermediate_shape = self.state_dict[\"encoder_l1_ffn_W1\"].shape\n        assert hidden_size == cfg[\"dim-emb\"] == 512\n\n        # Process decoder.yml\n        decoder_yml = cast_marian_config(load_yaml(source_dir / \"decoder.yml\"))\n        check_marian_cfg_assumptions(cfg)\n        self.hf_config = MarianConfig(\n            vocab_size=cfg[\"vocab_size\"],\n            decoder_layers=cfg[\"dec-depth\"],\n            encoder_layers=cfg[\"enc-depth\"],\n            decoder_attention_heads=cfg[\"transformer-heads\"],\n            encoder_attention_heads=cfg[\"transformer-heads\"],\n            decoder_ffn_dim=cfg[\"transformer-dim-ffn\"],\n            encoder_ffn_dim=cfg[\"transformer-dim-ffn\"],\n            d_model=cfg[\"dim-emb\"],\n            activation_function=cfg[\"transformer-aan-activation\"],\n            pad_token_id=self.pad_token_id,\n            eos_token_id=0,\n            bos_token_id=0,\n            max_position_embeddings=cfg[\"dim-emb\"],\n            scale_embedding=True,\n            normalize_embedding=\"n\" in cfg[\"transformer-preprocess\"],\n            static_position_embeddings=not cfg[\"transformer-train-position-embeddings\"],\n            dropout=0.1,  # see opus-mt-train repo/transformer-dropout param.\n            # default: add_final_layer_norm=False,\n            num_beams=decoder_yml[\"beam-size\"],\n            decoder_start_token_id=self.pad_token_id,\n            bad_words_ids=[[self.pad_token_id]],\n            max_length=512,\n        )\n\n    def _check_layer_entries(self):\n        self.encoder_l1 = self.sub_keys(\"encoder_l1\")\n        self.decoder_l1 = self.sub_keys(\"decoder_l1\")\n        self.decoder_l2 = self.sub_keys(\"decoder_l2\")\n        if len(self.encoder_l1) != 16:\n            warnings.warn(f\"Expected 16 keys for each encoder layer, got {len(self.encoder_l1)}\")\n        if len(self.decoder_l1) != 26:\n            warnings.warn(f\"Expected 26 keys for each decoder layer, got {len(self.decoder_l1)}\")\n        if len(self.decoder_l2) != 26:\n            warnings.warn(f\"Expected 26 keys for each decoder layer, got {len(self.decoder_l1)}\")\n\n    @property\n    def extra_keys(self):\n        extra = []\n        for k in self.state_keys:\n            if (\n                k.startswith(\"encoder_l\")\n                or k.startswith(\"decoder_l\")\n                or k in [CONFIG_KEY, \"Wemb\", \"Wpos\", \"decoder_ff_logit_out_b\"]\n            ):\n                continue\n            else:\n                extra.append(k)\n        return extra\n\n    def sub_keys(self, layer_prefix):\n        return [remove_prefix(k, layer_prefix) for k in self.state_dict if k.startswith(layer_prefix)]\n\n    def load_marian_model(self) -> MarianMTModel:\n        state_dict, cfg = self.state_dict, self.hf_config\n\n        assert cfg.static_position_embeddings\n        model = MarianMTModel(cfg)\n\n        assert \"hidden_size\" not in cfg.to_dict()\n        load_layers_(\n            model.model.encoder.layers, state_dict, BART_CONVERTER,\n        )\n        load_layers_(model.model.decoder.layers, state_dict, BART_CONVERTER, is_decoder=True)\n\n        # handle tensors not associated with layers\n        wemb_tensor = torch.nn.Parameter(torch.FloatTensor(self.wemb))\n        bias_tensor = torch.nn.Parameter(torch.FloatTensor(self.final_bias))\n        model.model.shared.weight = wemb_tensor\n        model.model.encoder.embed_tokens = model.model.decoder.embed_tokens = model.model.shared\n\n        model.final_logits_bias = bias_tensor\n\n        if \"Wpos\" in state_dict:\n            print(\"Unexpected: got Wpos\")\n            wpos_tensor = torch.tensor(state_dict[\"Wpos\"])\n            model.model.encoder.embed_positions.weight = wpos_tensor\n            model.model.decoder.embed_positions.weight = wpos_tensor\n\n        if cfg.normalize_embedding:\n            assert \"encoder_emb_ln_scale_pre\" in state_dict\n            raise NotImplementedError(\"Need to convert layernorm_embedding\")\n\n        assert not self.extra_keys, f\"Failed to convert {self.extra_keys}\"\n        assert model.model.shared.padding_idx == self.pad_token_id\n        return model\n\n\ndef download_and_unzip(url, dest_dir):\n    try:\n        import wget\n    except ImportError:\n        raise ImportError(\"you must pip install wget\")\n\n    filename = wget.download(url)\n    unzip(filename, dest_dir)\n    os.remove(filename)\n\n\ndef convert(source_dir: Path, dest_dir):\n    dest_dir = Path(dest_dir)\n    dest_dir.mkdir(exist_ok=True)\n\n    add_special_tokens_to_vocab(source_dir)\n    tokenizer = MarianTokenizer.from_pretrained(str(source_dir))\n    save_tokenizer(tokenizer, dest_dir)\n\n    opus_state = OpusState(source_dir)\n    assert opus_state.cfg[\"vocab_size\"] == len(tokenizer.encoder)\n    # save_json(opus_state.cfg, dest_dir / \"marian_original_config.json\")\n    # ^^ Save human readable marian config for debugging\n\n    model = opus_state.load_marian_model()\n    model.save_pretrained(dest_dir)\n    model.from_pretrained(dest_dir)  # sanity check\n\n\nif __name__ == \"__main__\":\n    parser = argparse.ArgumentParser()\n    # Required parameters\n    parser.add_argument(\"--src\", type=str, help=\"path to marian model dir\", default=\"en-de\")\n    parser.add_argument(\"--dest\", type=str, default=None, help=\"Path to the output PyTorch model.\")\n    args = parser.parse_args()\n\n    source_dir = Path(args.src)\n    assert source_dir.exists()\n    dest_dir = f\"converted-{source_dir.name}\" if args.dest is None else args.dest\n    convert(source_dir, dest_dir)\n\n\ndef load_yaml(path):\n    import yaml\n\n    with open(path) as f:\n        return yaml.load(f, Loader=yaml.BaseLoader)\n\n\ndef save_json(content: Union[Dict, List], path: str) -> None:\n    with open(path, \"w\") as f:\n        json.dump(content, f)\n\n\ndef unzip(zip_path: str, dest_dir: str) -> None:\n    with ZipFile(zip_path, \"r\") as zipObj:\n        zipObj.extractall(dest_dir)\n"
  },
  {
    "path": "code/nezha-base-count3/pretrain/transformers1/convert_openai_original_tf_checkpoint_to_pytorch.py",
    "content": "# coding=utf-8\n# Copyright 2018 The HuggingFace Inc. team.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n\"\"\"Convert OpenAI GPT checkpoint.\"\"\"\n\n\nimport argparse\nimport logging\n\nimport torch\n\nfrom transformers import CONFIG_NAME, WEIGHTS_NAME, OpenAIGPTConfig, OpenAIGPTModel, load_tf_weights_in_openai_gpt\n\n\nlogging.basicConfig(level=logging.INFO)\n\n\ndef convert_openai_checkpoint_to_pytorch(openai_checkpoint_folder_path, openai_config_file, pytorch_dump_folder_path):\n    # Construct model\n    if openai_config_file == \"\":\n        config = OpenAIGPTConfig()\n    else:\n        config = OpenAIGPTConfig.from_json_file(openai_config_file)\n    model = OpenAIGPTModel(config)\n\n    # Load weights from numpy\n    load_tf_weights_in_openai_gpt(model, config, openai_checkpoint_folder_path)\n\n    # Save pytorch-model\n    pytorch_weights_dump_path = pytorch_dump_folder_path + \"/\" + WEIGHTS_NAME\n    pytorch_config_dump_path = pytorch_dump_folder_path + \"/\" + CONFIG_NAME\n    print(\"Save PyTorch model to {}\".format(pytorch_weights_dump_path))\n    torch.save(model.state_dict(), pytorch_weights_dump_path)\n    print(\"Save configuration file to {}\".format(pytorch_config_dump_path))\n    with open(pytorch_config_dump_path, \"w\", encoding=\"utf-8\") as f:\n        f.write(config.to_json_string())\n\n\nif __name__ == \"__main__\":\n    parser = argparse.ArgumentParser()\n    # Required parameters\n    parser.add_argument(\n        \"--openai_checkpoint_folder_path\",\n        default=None,\n        type=str,\n        required=True,\n        help=\"Path to the TensorFlow checkpoint path.\",\n    )\n    parser.add_argument(\n        \"--pytorch_dump_folder_path\", default=None, type=str, required=True, help=\"Path to the output PyTorch model.\"\n    )\n    parser.add_argument(\n        \"--openai_config_file\",\n        default=\"\",\n        type=str,\n        help=\"An optional config json file corresponding to the pre-trained OpenAI model. \\n\"\n        \"This specifies the model architecture.\",\n    )\n    args = parser.parse_args()\n    convert_openai_checkpoint_to_pytorch(\n        args.openai_checkpoint_folder_path, args.openai_config_file, args.pytorch_dump_folder_path\n    )\n"
  },
  {
    "path": "code/nezha-base-count3/pretrain/transformers1/convert_pytorch_checkpoint_to_tf2.py",
    "content": "# coding=utf-8\n# Copyright 2018 The HuggingFace Inc. team.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n\"\"\" Convert pytorch checkpoints to TensorFlow \"\"\"\n\n\nimport argparse\nimport logging\nimport os\n\nfrom transformers import (\n    ALBERT_PRETRAINED_CONFIG_ARCHIVE_MAP,\n    BERT_PRETRAINED_CONFIG_ARCHIVE_MAP,\n    CAMEMBERT_PRETRAINED_CONFIG_ARCHIVE_MAP,\n    CTRL_PRETRAINED_CONFIG_ARCHIVE_MAP,\n    DISTILBERT_PRETRAINED_CONFIG_ARCHIVE_MAP,\n    ELECTRA_PRETRAINED_CONFIG_ARCHIVE_MAP,\n    FLAUBERT_PRETRAINED_CONFIG_ARCHIVE_MAP,\n    GPT2_PRETRAINED_CONFIG_ARCHIVE_MAP,\n    OPENAI_GPT_PRETRAINED_CONFIG_ARCHIVE_MAP,\n    ROBERTA_PRETRAINED_CONFIG_ARCHIVE_MAP,\n    T5_PRETRAINED_CONFIG_ARCHIVE_MAP,\n    TRANSFO_XL_PRETRAINED_CONFIG_ARCHIVE_MAP,\n    WEIGHTS_NAME,\n    XLM_PRETRAINED_CONFIG_ARCHIVE_MAP,\n    XLM_ROBERTA_PRETRAINED_CONFIG_ARCHIVE_MAP,\n    XLNET_PRETRAINED_CONFIG_ARCHIVE_MAP,\n    AlbertConfig,\n    BertConfig,\n    CamembertConfig,\n    CTRLConfig,\n    DistilBertConfig,\n    ElectraConfig,\n    FlaubertConfig,\n    GPT2Config,\n    OpenAIGPTConfig,\n    RobertaConfig,\n    T5Config,\n    TFAlbertForPreTraining,\n    TFBertForPreTraining,\n    TFBertForQuestionAnswering,\n    TFBertForSequenceClassification,\n    TFCamembertForMaskedLM,\n    TFCTRLLMHeadModel,\n    TFDistilBertForMaskedLM,\n    TFDistilBertForQuestionAnswering,\n    TFElectraForPreTraining,\n    TFFlaubertWithLMHeadModel,\n    TFGPT2LMHeadModel,\n    TFOpenAIGPTLMHeadModel,\n    TFRobertaForMaskedLM,\n    TFRobertaForSequenceClassification,\n    TFT5ForConditionalGeneration,\n    TFTransfoXLLMHeadModel,\n    TFXLMRobertaForMaskedLM,\n    TFXLMWithLMHeadModel,\n    TFXLNetLMHeadModel,\n    TransfoXLConfig,\n    XLMConfig,\n    XLMRobertaConfig,\n    XLNetConfig,\n    cached_path,\n    hf_bucket_url,\n    is_torch_available,\n    load_pytorch_checkpoint_in_tf2_model,\n)\n\n\nif is_torch_available():\n    import torch\n    import numpy as np\n    from transformers import (\n        BertForPreTraining,\n        BertForQuestionAnswering,\n        BertForSequenceClassification,\n        GPT2LMHeadModel,\n        XLNetLMHeadModel,\n        XLMWithLMHeadModel,\n        XLMRobertaForMaskedLM,\n        TransfoXLLMHeadModel,\n        OpenAIGPTLMHeadModel,\n        RobertaForMaskedLM,\n        RobertaForSequenceClassification,\n        CamembertForMaskedLM,\n        FlaubertWithLMHeadModel,\n        DistilBertForMaskedLM,\n        DistilBertForQuestionAnswering,\n        CTRLLMHeadModel,\n        AlbertForPreTraining,\n        T5ForConditionalGeneration,\n        ElectraForPreTraining,\n    )\n\n\nlogging.basicConfig(level=logging.INFO)\n\nMODEL_CLASSES = {\n    \"bert\": (BertConfig, TFBertForPreTraining, BertForPreTraining, BERT_PRETRAINED_CONFIG_ARCHIVE_MAP,),\n    \"bert-large-uncased-whole-word-masking-finetuned-squad\": (\n        BertConfig,\n        TFBertForQuestionAnswering,\n        BertForQuestionAnswering,\n        BERT_PRETRAINED_CONFIG_ARCHIVE_MAP,\n    ),\n    \"bert-large-cased-whole-word-masking-finetuned-squad\": (\n        BertConfig,\n        TFBertForQuestionAnswering,\n        BertForQuestionAnswering,\n        BERT_PRETRAINED_CONFIG_ARCHIVE_MAP,\n    ),\n    \"bert-base-cased-finetuned-mrpc\": (\n        BertConfig,\n        TFBertForSequenceClassification,\n        BertForSequenceClassification,\n        BERT_PRETRAINED_CONFIG_ARCHIVE_MAP,\n    ),\n    \"gpt2\": (GPT2Config, TFGPT2LMHeadModel, GPT2LMHeadModel, GPT2_PRETRAINED_CONFIG_ARCHIVE_MAP,),\n    \"xlnet\": (XLNetConfig, TFXLNetLMHeadModel, XLNetLMHeadModel, XLNET_PRETRAINED_CONFIG_ARCHIVE_MAP,),\n    \"xlm\": (XLMConfig, TFXLMWithLMHeadModel, XLMWithLMHeadModel, XLM_PRETRAINED_CONFIG_ARCHIVE_MAP,),\n    \"xlm-roberta\": (\n        XLMRobertaConfig,\n        TFXLMRobertaForMaskedLM,\n        XLMRobertaForMaskedLM,\n        XLM_ROBERTA_PRETRAINED_CONFIG_ARCHIVE_MAP,\n    ),\n    \"transfo-xl\": (\n        TransfoXLConfig,\n        TFTransfoXLLMHeadModel,\n        TransfoXLLMHeadModel,\n        TRANSFO_XL_PRETRAINED_CONFIG_ARCHIVE_MAP,\n    ),\n    \"openai-gpt\": (\n        OpenAIGPTConfig,\n        TFOpenAIGPTLMHeadModel,\n        OpenAIGPTLMHeadModel,\n        OPENAI_GPT_PRETRAINED_CONFIG_ARCHIVE_MAP,\n    ),\n    \"roberta\": (RobertaConfig, TFRobertaForMaskedLM, RobertaForMaskedLM, ROBERTA_PRETRAINED_CONFIG_ARCHIVE_MAP,),\n    \"roberta-large-mnli\": (\n        RobertaConfig,\n        TFRobertaForSequenceClassification,\n        RobertaForSequenceClassification,\n        ROBERTA_PRETRAINED_CONFIG_ARCHIVE_MAP,\n    ),\n    \"camembert\": (\n        CamembertConfig,\n        TFCamembertForMaskedLM,\n        CamembertForMaskedLM,\n        CAMEMBERT_PRETRAINED_CONFIG_ARCHIVE_MAP,\n    ),\n    \"flaubert\": (\n        FlaubertConfig,\n        TFFlaubertWithLMHeadModel,\n        FlaubertWithLMHeadModel,\n        FLAUBERT_PRETRAINED_CONFIG_ARCHIVE_MAP,\n    ),\n    \"distilbert\": (\n        DistilBertConfig,\n        TFDistilBertForMaskedLM,\n        DistilBertForMaskedLM,\n        DISTILBERT_PRETRAINED_CONFIG_ARCHIVE_MAP,\n    ),\n    \"distilbert-base-distilled-squad\": (\n        DistilBertConfig,\n        TFDistilBertForQuestionAnswering,\n        DistilBertForQuestionAnswering,\n        DISTILBERT_PRETRAINED_CONFIG_ARCHIVE_MAP,\n    ),\n    \"ctrl\": (CTRLConfig, TFCTRLLMHeadModel, CTRLLMHeadModel, CTRL_PRETRAINED_CONFIG_ARCHIVE_MAP,),\n    \"albert\": (AlbertConfig, TFAlbertForPreTraining, AlbertForPreTraining, ALBERT_PRETRAINED_CONFIG_ARCHIVE_MAP,),\n    \"t5\": (T5Config, TFT5ForConditionalGeneration, T5ForConditionalGeneration, T5_PRETRAINED_CONFIG_ARCHIVE_MAP,),\n    \"electra\": (ElectraConfig, TFElectraForPreTraining, ElectraForPreTraining, ELECTRA_PRETRAINED_CONFIG_ARCHIVE_MAP,),\n}\n\n\ndef convert_pt_checkpoint_to_tf(\n    model_type, pytorch_checkpoint_path, config_file, tf_dump_path, compare_with_pt_model=False, use_cached_models=True\n):\n    if model_type not in MODEL_CLASSES:\n        raise ValueError(\"Unrecognized model type, should be one of {}.\".format(list(MODEL_CLASSES.keys())))\n\n    config_class, model_class, pt_model_class, aws_config_map = MODEL_CLASSES[model_type]\n\n    # Initialise TF model\n    if config_file in aws_config_map:\n        config_file = cached_path(aws_config_map[config_file], force_download=not use_cached_models)\n    config = config_class.from_json_file(config_file)\n    config.output_hidden_states = True\n    config.output_attentions = True\n    print(\"Building TensorFlow model from configuration: {}\".format(str(config)))\n    tf_model = model_class(config)\n\n    # Load weights from tf checkpoint\n    if pytorch_checkpoint_path in aws_config_map.keys():\n        pytorch_checkpoint_url = hf_bucket_url(pytorch_checkpoint_path, filename=WEIGHTS_NAME)\n        pytorch_checkpoint_path = cached_path(pytorch_checkpoint_url, force_download=not use_cached_models)\n    # Load PyTorch checkpoint in tf2 model:\n    tf_model = load_pytorch_checkpoint_in_tf2_model(tf_model, pytorch_checkpoint_path)\n\n    if compare_with_pt_model:\n        tfo = tf_model(tf_model.dummy_inputs, training=False)  # build the network\n\n        state_dict = torch.load(pytorch_checkpoint_path, map_location=\"cpu\")\n        pt_model = pt_model_class.from_pretrained(\n            pretrained_model_name_or_path=None, config=config, state_dict=state_dict\n        )\n\n        with torch.no_grad():\n            pto = pt_model(**pt_model.dummy_inputs)\n\n        np_pt = pto[0].numpy()\n        np_tf = tfo[0].numpy()\n        diff = np.amax(np.abs(np_pt - np_tf))\n        print(\"Max absolute difference between models outputs {}\".format(diff))\n        assert diff <= 2e-2, \"Error, model absolute difference is >2e-2: {}\".format(diff)\n\n    # Save pytorch-model\n    print(\"Save TensorFlow model to {}\".format(tf_dump_path))\n    tf_model.save_weights(tf_dump_path, save_format=\"h5\")\n\n\ndef convert_all_pt_checkpoints_to_tf(\n    args_model_type,\n    tf_dump_path,\n    model_shortcut_names_or_path=None,\n    config_shortcut_names_or_path=None,\n    compare_with_pt_model=False,\n    use_cached_models=False,\n    remove_cached_files=False,\n    only_convert_finetuned_models=False,\n):\n    assert os.path.isdir(args.tf_dump_path), \"--tf_dump_path should be a directory\"\n\n    if args_model_type is None:\n        model_types = list(MODEL_CLASSES.keys())\n    else:\n        model_types = [args_model_type]\n\n    for j, model_type in enumerate(model_types, start=1):\n        print(\"=\" * 100)\n        print(\" Converting model type {}/{}: {}\".format(j, len(model_types), model_type))\n        print(\"=\" * 100)\n        if model_type not in MODEL_CLASSES:\n            raise ValueError(\n                \"Unrecognized model type {}, should be one of {}.\".format(model_type, list(MODEL_CLASSES.keys()))\n            )\n\n        config_class, model_class, pt_model_class, aws_model_maps, aws_config_map = MODEL_CLASSES[model_type]\n\n        if model_shortcut_names_or_path is None:\n            model_shortcut_names_or_path = list(aws_model_maps.keys())\n        if config_shortcut_names_or_path is None:\n            config_shortcut_names_or_path = model_shortcut_names_or_path\n\n        for i, (model_shortcut_name, config_shortcut_name) in enumerate(\n            zip(model_shortcut_names_or_path, config_shortcut_names_or_path), start=1\n        ):\n            print(\"-\" * 100)\n            if \"-squad\" in model_shortcut_name or \"-mrpc\" in model_shortcut_name or \"-mnli\" in model_shortcut_name:\n                if not only_convert_finetuned_models:\n                    print(\"    Skipping finetuned checkpoint {}\".format(model_shortcut_name))\n                    continue\n                model_type = model_shortcut_name\n            elif only_convert_finetuned_models:\n                print(\"    Skipping not finetuned checkpoint {}\".format(model_shortcut_name))\n                continue\n            print(\n                \"    Converting checkpoint {}/{}: {} - model_type {}\".format(\n                    i, len(aws_config_map), model_shortcut_name, model_type\n                )\n            )\n            print(\"-\" * 100)\n\n            if config_shortcut_name in aws_config_map:\n                config_file = cached_path(aws_config_map[config_shortcut_name], force_download=not use_cached_models)\n            else:\n                config_file = cached_path(config_shortcut_name, force_download=not use_cached_models)\n\n            if model_shortcut_name in aws_model_maps:\n                model_file = cached_path(aws_model_maps[model_shortcut_name], force_download=not use_cached_models)\n            else:\n                model_file = cached_path(model_shortcut_name, force_download=not use_cached_models)\n\n            if os.path.isfile(model_shortcut_name):\n                model_shortcut_name = \"converted_model\"\n\n            convert_pt_checkpoint_to_tf(\n                model_type=model_type,\n                pytorch_checkpoint_path=model_file,\n                config_file=config_file,\n                tf_dump_path=os.path.join(tf_dump_path, model_shortcut_name + \"-tf_model.h5\"),\n                compare_with_pt_model=compare_with_pt_model,\n            )\n            if remove_cached_files:\n                os.remove(config_file)\n                os.remove(model_file)\n\n\nif __name__ == \"__main__\":\n    parser = argparse.ArgumentParser()\n    # Required parameters\n    parser.add_argument(\n        \"--tf_dump_path\", default=None, type=str, required=True, help=\"Path to the output Tensorflow dump file.\"\n    )\n    parser.add_argument(\n        \"--model_type\",\n        default=None,\n        type=str,\n        help=\"Model type selected in the list of {}. If not given, will download and convert all the models from AWS.\".format(\n            list(MODEL_CLASSES.keys())\n        ),\n    )\n    parser.add_argument(\n        \"--pytorch_checkpoint_path\",\n        default=None,\n        type=str,\n        help=\"Path to the PyTorch checkpoint path or shortcut name to download from AWS. \"\n        \"If not given, will download and convert all the checkpoints from AWS.\",\n    )\n    parser.add_argument(\n        \"--config_file\",\n        default=None,\n        type=str,\n        help=\"The config json file corresponding to the pre-trained model. \\n\"\n        \"This specifies the model architecture. If not given and \"\n        \"--pytorch_checkpoint_path is not given or is a shortcut name\"\n        \"use the configuration associated to the shortcut name on the AWS\",\n    )\n    parser.add_argument(\n        \"--compare_with_pt_model\", action=\"store_true\", help=\"Compare Tensorflow and PyTorch model predictions.\"\n    )\n    parser.add_argument(\n        \"--use_cached_models\",\n        action=\"store_true\",\n        help=\"Use cached models if possible instead of updating to latest checkpoint versions.\",\n    )\n    parser.add_argument(\n        \"--remove_cached_files\",\n        action=\"store_true\",\n        help=\"Remove pytorch models after conversion (save memory when converting in batches).\",\n    )\n    parser.add_argument(\"--only_convert_finetuned_models\", action=\"store_true\", help=\"Only convert finetuned models.\")\n    args = parser.parse_args()\n\n    # if args.pytorch_checkpoint_path is not None:\n    #     convert_pt_checkpoint_to_tf(args.model_type.lower(),\n    #                                 args.pytorch_checkpoint_path,\n    #                                 args.config_file if args.config_file is not None else args.pytorch_checkpoint_path,\n    #                                 args.tf_dump_path,\n    #                                 compare_with_pt_model=args.compare_with_pt_model,\n    #                                 use_cached_models=args.use_cached_models)\n    # else:\n    convert_all_pt_checkpoints_to_tf(\n        args.model_type.lower() if args.model_type is not None else None,\n        args.tf_dump_path,\n        model_shortcut_names_or_path=[args.pytorch_checkpoint_path]\n        if args.pytorch_checkpoint_path is not None\n        else None,\n        config_shortcut_names_or_path=[args.config_file] if args.config_file is not None else None,\n        compare_with_pt_model=args.compare_with_pt_model,\n        use_cached_models=args.use_cached_models,\n        remove_cached_files=args.remove_cached_files,\n        only_convert_finetuned_models=args.only_convert_finetuned_models,\n    )\n"
  },
  {
    "path": "code/nezha-base-count3/pretrain/transformers1/convert_reformer_trax_checkpoint_to_pytorch.py",
    "content": "# coding=utf-8\n# Copyright 2020 The HuggingFace Inc. team.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n\"\"\"Convert Reformer checkpoint.\"\"\"\n\n\nimport argparse\nimport logging\nimport pickle\n\nimport numpy as np\nimport torch\n\nfrom transformers import ReformerConfig, ReformerModelWithLMHead\n\n\nlogging.basicConfig(level=logging.INFO)\n\n\ndef set_param(torch_layer, weight, bias=None):\n    # set parameter of one layer\n    assert torch_layer.weight.shape == weight.shape, \"{} layer.weight does not match\".format(torch_layer)\n    torch_layer.weight = torch.nn.Parameter(weight)\n    if bias is not None:\n        assert torch_layer.bias.shape == bias.shape, \"{} layer.bias does not match\".format(torch_layer)\n        torch_layer.bias = torch.nn.Parameter(bias)\n\n\ndef set_layer_weights_in_torch_lsh(weights, torch_layer, hidden_size):\n    # set torch weights for 1-to-1 comparison\n    np_query_key = np.asarray(weights[0])\n    np_value = np.asarray(weights[1])\n    np_dense = np.asarray(weights[2])\n\n    set_param(\n        torch_layer.self_attention.query_key,\n        torch.tensor(np_query_key).transpose(1, 2).contiguous().view(-1, hidden_size),\n    )\n    set_param(\n        torch_layer.self_attention.value, torch.tensor(np_value).transpose(1, 2).contiguous().view(-1, hidden_size),\n    )\n    set_param(\n        torch_layer.output.dense, torch.tensor(np_dense).view(-1, hidden_size).contiguous().transpose(0, 1),\n    )\n\n\ndef set_layer_weights_in_torch_local(weights, torch_layer, hidden_size):\n    # set torch weights for 1-to-1 comparison\n    np_query = np.asarray(weights[0])\n    np_key = np.asarray(weights[1])\n    np_value = np.asarray(weights[2])\n    np_dense = np.asarray(weights[3])\n\n    set_param(\n        torch_layer.self_attention.query, torch.tensor(np_query).transpose(1, 2).contiguous().view(-1, hidden_size),\n    )\n    set_param(\n        torch_layer.self_attention.key, torch.tensor(np_key).transpose(1, 2).contiguous().view(-1, hidden_size),\n    )\n    set_param(\n        torch_layer.self_attention.value, torch.tensor(np_value).transpose(1, 2).contiguous().view(-1, hidden_size),\n    )\n    set_param(\n        torch_layer.output.dense, torch.tensor(np_dense).view(-1, hidden_size).contiguous().transpose(0, 1),\n    )\n\n\ndef set_block_weights_in_torch(weights, torch_block, hidden_size):\n    # layernorm 1\n    layer_norm_1 = weights[0][0][0]\n    layer_norm_1_weight = np.asarray(layer_norm_1[0])\n    layer_norm_1_bias = np.asarray(layer_norm_1[1])\n    set_param(\n        torch_block.attention.layer_norm, torch.tensor(layer_norm_1_weight), torch.tensor(layer_norm_1_bias),\n    )\n\n    # lsh weights + output\n    attn_weights = weights[0][1]\n    if len(attn_weights) < 4:\n        set_layer_weights_in_torch_lsh(attn_weights, torch_block.attention, hidden_size)\n    else:\n        set_layer_weights_in_torch_local(attn_weights, torch_block.attention, hidden_size)\n\n    # intermediate weighs\n    intermediate_weights = weights[2][0][1][2]\n\n    # Chunked Feed Forward\n    if len(intermediate_weights) == 4:\n        intermediate_weights = intermediate_weights[2]\n\n    # layernorm 2\n    layer_norm_2_weight = np.asarray(intermediate_weights[0][0])\n    layer_norm_2_bias = np.asarray(intermediate_weights[0][1])\n    set_param(\n        torch_block.feed_forward.layer_norm, torch.tensor(layer_norm_2_weight), torch.tensor(layer_norm_2_bias),\n    )\n\n    # intermediate dense\n    inter_dense_weight = np.asarray(intermediate_weights[1][0])\n    inter_dense_bias = np.asarray(intermediate_weights[1][1])\n    set_param(\n        torch_block.feed_forward.dense.dense,\n        torch.tensor(inter_dense_weight).transpose(0, 1).contiguous(),\n        torch.tensor(inter_dense_bias),\n    )\n\n    # intermediate out\n    out_dense_weight = np.asarray(intermediate_weights[4][0])\n    out_dense_bias = np.asarray(intermediate_weights[4][1])\n    set_param(\n        torch_block.feed_forward.output.dense,\n        torch.tensor(out_dense_weight).transpose(0, 1).contiguous(),\n        torch.tensor(out_dense_bias),\n    )\n\n\ndef set_model_weights_in_torch(weights, torch_model, hidden_size):\n    # reformer model\n    torch_model_reformer = torch_model.reformer\n\n    # word embeds\n    word_embeddings = np.asarray(weights[1])\n    set_param(\n        torch_model_reformer.embeddings.word_embeddings, torch.tensor(word_embeddings),\n    )\n\n    if isinstance(weights[3], tuple):\n        position_embeddings = torch_model_reformer.embeddings.position_embeddings\n        for emb_idx in range(len(position_embeddings.weights)):\n            emb_weights = np.asarray(weights[3][emb_idx][0])\n            assert position_embeddings.weights[emb_idx].shape == emb_weights.shape, \"{} emb does not match\".format(\n                position_embeddings[emb_idx]\n            )\n            position_embeddings.weights[emb_idx] = torch.nn.Parameter(torch.tensor(emb_weights))\n\n    trax_layer_weights = weights[5]\n    assert len(torch_model_reformer.encoder.layers) * 4 == len(\n        trax_layer_weights\n    ), \"HF and trax model do not have the same number of layers\"\n    for layer_idx, layer in enumerate(torch_model_reformer.encoder.layers):\n        block_weights = trax_layer_weights[4 * layer_idx : 4 * (layer_idx + 1)]\n        set_block_weights_in_torch(block_weights, layer, hidden_size)\n\n    # output layer norm\n    layer_norm_out_weight = np.asarray(weights[7][0])\n    layer_norm_out_bias = np.asarray(weights[7][1])\n    set_param(\n        torch_model_reformer.encoder.layer_norm,\n        torch.tensor(layer_norm_out_weight),\n        torch.tensor(layer_norm_out_bias),\n    )\n\n    # output embeddings\n    output_embed_weights = np.asarray(weights[9][0])\n    output_embed_bias = np.asarray(weights[9][1])\n    set_param(\n        torch_model.lm_head.decoder,\n        torch.tensor(output_embed_weights).transpose(0, 1).contiguous(),\n        torch.tensor(output_embed_bias),\n    )\n\n\ndef convert_trax_checkpoint_to_pytorch(trax_model_pkl_path, config_file, pytorch_dump_path):\n    # Initialise PyTorch model\n    config = ReformerConfig.from_json_file(config_file)\n    print(\"Building PyTorch model from configuration: {}\".format(str(config)))\n    model = ReformerModelWithLMHead(config)\n\n    with open(trax_model_pkl_path, \"rb\") as f:\n        model_weights = pickle.load(f)[\"weights\"]\n\n    set_model_weights_in_torch(model_weights, model, config.hidden_size)\n\n    # Save pytorch-model\n    print(\"Save PyTorch model to {}\".format(pytorch_dump_path))\n    torch.save(model.state_dict(), pytorch_dump_path)\n\n\nif __name__ == \"__main__\":\n    parser = argparse.ArgumentParser()\n    # Required parameters\n    parser.add_argument(\n        \"--trax_model_pkl_path\", default=None, type=str, required=True, help=\"Path to the TensorFlow checkpoint path.\"\n    )\n    parser.add_argument(\n        \"--config_file\",\n        default=None,\n        type=str,\n        required=True,\n        help=\"The config json file corresponding to the pre-trained Reformer model. \\n\"\n        \"This specifies the model architecture.\",\n    )\n    parser.add_argument(\n        \"--pytorch_dump_path\", default=None, type=str, required=True, help=\"Path to the output PyTorch model.\"\n    )\n    args = parser.parse_args()\n    convert_trax_checkpoint_to_pytorch(args.trax_model_pkl_path, args.config_file, args.pytorch_dump_path)\n"
  },
  {
    "path": "code/nezha-base-count3/pretrain/transformers1/convert_roberta_original_pytorch_checkpoint_to_pytorch.py",
    "content": "# coding=utf-8\n# Copyright 2018 The HuggingFace Inc. team.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n\"\"\"Convert RoBERTa checkpoint.\"\"\"\n\n\nimport argparse\nimport logging\nimport pathlib\n\nimport fairseq\nimport torch\nfrom fairseq.models.roberta import RobertaModel as FairseqRobertaModel\nfrom fairseq.modules import TransformerSentenceEncoderLayer\nfrom packaging import version\n\nfrom transformers.modeling_bert import BertIntermediate, BertLayer, BertOutput, BertSelfAttention, BertSelfOutput\nfrom transformers.modeling_roberta import RobertaConfig, RobertaForMaskedLM, RobertaForSequenceClassification\n\n\nif version.parse(fairseq.__version__) < version.parse(\"0.9.0\"):\n    raise Exception(\"requires fairseq >= 0.9.0\")\n\n\nlogging.basicConfig(level=logging.INFO)\nlogger = logging.getLogger(__name__)\n\nSAMPLE_TEXT = \"Hello world! cécé herlolip\"\n\n\ndef convert_roberta_checkpoint_to_pytorch(\n    roberta_checkpoint_path: str, pytorch_dump_folder_path: str, classification_head: bool\n):\n    \"\"\"\n    Copy/paste/tweak roberta's weights to our BERT structure.\n    \"\"\"\n    roberta = FairseqRobertaModel.from_pretrained(roberta_checkpoint_path)\n    roberta.eval()  # disable dropout\n    roberta_sent_encoder = roberta.model.decoder.sentence_encoder\n    config = RobertaConfig(\n        vocab_size=roberta_sent_encoder.embed_tokens.num_embeddings,\n        hidden_size=roberta.args.encoder_embed_dim,\n        num_hidden_layers=roberta.args.encoder_layers,\n        num_attention_heads=roberta.args.encoder_attention_heads,\n        intermediate_size=roberta.args.encoder_ffn_embed_dim,\n        max_position_embeddings=514,\n        type_vocab_size=1,\n        layer_norm_eps=1e-5,  # PyTorch default used in fairseq\n    )\n    if classification_head:\n        config.num_labels = roberta.args.num_classes\n    print(\"Our BERT config:\", config)\n\n    model = RobertaForSequenceClassification(config) if classification_head else RobertaForMaskedLM(config)\n    model.eval()\n\n    # Now let's copy all the weights.\n    # Embeddings\n    model.roberta.embeddings.word_embeddings.weight = roberta_sent_encoder.embed_tokens.weight\n    model.roberta.embeddings.position_embeddings.weight = roberta_sent_encoder.embed_positions.weight\n    model.roberta.embeddings.token_type_embeddings.weight.data = torch.zeros_like(\n        model.roberta.embeddings.token_type_embeddings.weight\n    )  # just zero them out b/c RoBERTa doesn't use them.\n    model.roberta.embeddings.LayerNorm.weight = roberta_sent_encoder.emb_layer_norm.weight\n    model.roberta.embeddings.LayerNorm.bias = roberta_sent_encoder.emb_layer_norm.bias\n\n    for i in range(config.num_hidden_layers):\n        # Encoder: start of layer\n        layer: BertLayer = model.roberta.encoder.layer[i]\n        roberta_layer: TransformerSentenceEncoderLayer = roberta_sent_encoder.layers[i]\n\n        # self attention\n        self_attn: BertSelfAttention = layer.attention.self\n        assert (\n            roberta_layer.self_attn.k_proj.weight.data.shape\n            == roberta_layer.self_attn.q_proj.weight.data.shape\n            == roberta_layer.self_attn.v_proj.weight.data.shape\n            == torch.Size((config.hidden_size, config.hidden_size))\n        )\n\n        self_attn.query.weight.data = roberta_layer.self_attn.q_proj.weight\n        self_attn.query.bias.data = roberta_layer.self_attn.q_proj.bias\n        self_attn.key.weight.data = roberta_layer.self_attn.k_proj.weight\n        self_attn.key.bias.data = roberta_layer.self_attn.k_proj.bias\n        self_attn.value.weight.data = roberta_layer.self_attn.v_proj.weight\n        self_attn.value.bias.data = roberta_layer.self_attn.v_proj.bias\n\n        # self-attention output\n        self_output: BertSelfOutput = layer.attention.output\n        assert self_output.dense.weight.shape == roberta_layer.self_attn.out_proj.weight.shape\n        self_output.dense.weight = roberta_layer.self_attn.out_proj.weight\n        self_output.dense.bias = roberta_layer.self_attn.out_proj.bias\n        self_output.LayerNorm.weight = roberta_layer.self_attn_layer_norm.weight\n        self_output.LayerNorm.bias = roberta_layer.self_attn_layer_norm.bias\n\n        # intermediate\n        intermediate: BertIntermediate = layer.intermediate\n        assert intermediate.dense.weight.shape == roberta_layer.fc1.weight.shape\n        intermediate.dense.weight = roberta_layer.fc1.weight\n        intermediate.dense.bias = roberta_layer.fc1.bias\n\n        # output\n        bert_output: BertOutput = layer.output\n        assert bert_output.dense.weight.shape == roberta_layer.fc2.weight.shape\n        bert_output.dense.weight = roberta_layer.fc2.weight\n        bert_output.dense.bias = roberta_layer.fc2.bias\n        bert_output.LayerNorm.weight = roberta_layer.final_layer_norm.weight\n        bert_output.LayerNorm.bias = roberta_layer.final_layer_norm.bias\n        # end of layer\n\n    if classification_head:\n        model.classifier.dense.weight = roberta.model.classification_heads[\"mnli\"].dense.weight\n        model.classifier.dense.bias = roberta.model.classification_heads[\"mnli\"].dense.bias\n        model.classifier.out_proj.weight = roberta.model.classification_heads[\"mnli\"].out_proj.weight\n        model.classifier.out_proj.bias = roberta.model.classification_heads[\"mnli\"].out_proj.bias\n    else:\n        # LM Head\n        model.lm_head.dense.weight = roberta.model.decoder.lm_head.dense.weight\n        model.lm_head.dense.bias = roberta.model.decoder.lm_head.dense.bias\n        model.lm_head.layer_norm.weight = roberta.model.decoder.lm_head.layer_norm.weight\n        model.lm_head.layer_norm.bias = roberta.model.decoder.lm_head.layer_norm.bias\n        model.lm_head.decoder.weight = roberta.model.decoder.lm_head.weight\n        model.lm_head.decoder.bias = roberta.model.decoder.lm_head.bias\n\n    # Let's check that we get the same results.\n    input_ids: torch.Tensor = roberta.encode(SAMPLE_TEXT).unsqueeze(0)  # batch of size 1\n\n    our_output = model(input_ids)[0]\n    if classification_head:\n        their_output = roberta.model.classification_heads[\"mnli\"](roberta.extract_features(input_ids))\n    else:\n        their_output = roberta.model(input_ids)[0]\n    print(our_output.shape, their_output.shape)\n    max_absolute_diff = torch.max(torch.abs(our_output - their_output)).item()\n    print(f\"max_absolute_diff = {max_absolute_diff}\")  # ~ 1e-7\n    success = torch.allclose(our_output, their_output, atol=1e-3)\n    print(\"Do both models output the same tensors?\", \"🔥\" if success else \"💩\")\n    if not success:\n        raise Exception(\"Something went wRoNg\")\n\n    pathlib.Path(pytorch_dump_folder_path).mkdir(parents=True, exist_ok=True)\n    print(f\"Saving model to {pytorch_dump_folder_path}\")\n    model.save_pretrained(pytorch_dump_folder_path)\n\n\nif __name__ == \"__main__\":\n    parser = argparse.ArgumentParser()\n    # Required parameters\n    parser.add_argument(\n        \"--roberta_checkpoint_path\", default=None, type=str, required=True, help=\"Path the official PyTorch dump.\"\n    )\n    parser.add_argument(\n        \"--pytorch_dump_folder_path\", default=None, type=str, required=True, help=\"Path to the output PyTorch model.\"\n    )\n    parser.add_argument(\n        \"--classification_head\", action=\"store_true\", help=\"Whether to convert a final classification head.\"\n    )\n    args = parser.parse_args()\n    convert_roberta_checkpoint_to_pytorch(\n        args.roberta_checkpoint_path, args.pytorch_dump_folder_path, args.classification_head\n    )\n"
  },
  {
    "path": "code/nezha-base-count3/pretrain/transformers1/convert_t5_original_tf_checkpoint_to_pytorch.py",
    "content": "# coding=utf-8\n# Copyright 2018 The T5 authors and HuggingFace Inc. team.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n\"\"\"Convert T5 checkpoint.\"\"\"\n\n\nimport argparse\nimport logging\n\nimport torch\n\nfrom transformers import T5Config, T5Model, load_tf_weights_in_t5\n\n\nlogging.basicConfig(level=logging.INFO)\n\n\ndef convert_tf_checkpoint_to_pytorch(tf_checkpoint_path, config_file, pytorch_dump_path):\n    # Initialise PyTorch model\n    config = T5Config.from_json_file(config_file)\n    print(\"Building PyTorch model from configuration: {}\".format(str(config)))\n    model = T5Model(config)\n\n    # Load weights from tf checkpoint\n    load_tf_weights_in_t5(model, config, tf_checkpoint_path)\n\n    # Save pytorch-model\n    print(\"Save PyTorch model to {}\".format(pytorch_dump_path))\n    torch.save(model.state_dict(), pytorch_dump_path)\n\n\nif __name__ == \"__main__\":\n    parser = argparse.ArgumentParser()\n    # Required parameters\n    parser.add_argument(\n        \"--tf_checkpoint_path\", default=None, type=str, required=True, help=\"Path to the TensorFlow checkpoint path.\"\n    )\n    parser.add_argument(\n        \"--config_file\",\n        default=None,\n        type=str,\n        required=True,\n        help=\"The config json file corresponding to the pre-trained T5 model. \\n\"\n        \"This specifies the model architecture.\",\n    )\n    parser.add_argument(\n        \"--pytorch_dump_path\", default=None, type=str, required=True, help=\"Path to the output PyTorch model.\"\n    )\n    args = parser.parse_args()\n    convert_tf_checkpoint_to_pytorch(args.tf_checkpoint_path, args.config_file, args.pytorch_dump_path)\n"
  },
  {
    "path": "code/nezha-base-count3/pretrain/transformers1/convert_transfo_xl_original_tf_checkpoint_to_pytorch.py",
    "content": "# coding=utf-8\n# Copyright 2018 The HuggingFace Inc. team.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n\"\"\"Convert Transformer XL checkpoint and datasets.\"\"\"\n\n\nimport argparse\nimport logging\nimport os\nimport pickle\nimport sys\n\nimport torch\n\nimport transformers.tokenization_transfo_xl as data_utils\nfrom transformers import (\n    CONFIG_NAME,\n    WEIGHTS_NAME,\n    TransfoXLConfig,\n    TransfoXLLMHeadModel,\n    load_tf_weights_in_transfo_xl,\n)\nfrom transformers.tokenization_transfo_xl import CORPUS_NAME, VOCAB_FILES_NAMES\n\n\nlogging.basicConfig(level=logging.INFO)\n\n# We do this to be able to load python 2 datasets pickles\n# See e.g. https://stackoverflow.com/questions/2121874/python-pickling-after-changing-a-modules-directory/2121918#2121918\ndata_utils.Vocab = data_utils.TransfoXLTokenizer\ndata_utils.Corpus = data_utils.TransfoXLCorpus\nsys.modules[\"data_utils\"] = data_utils\nsys.modules[\"vocabulary\"] = data_utils\n\n\ndef convert_transfo_xl_checkpoint_to_pytorch(\n    tf_checkpoint_path, transfo_xl_config_file, pytorch_dump_folder_path, transfo_xl_dataset_file\n):\n    if transfo_xl_dataset_file:\n        # Convert a pre-processed corpus (see original TensorFlow repo)\n        with open(transfo_xl_dataset_file, \"rb\") as fp:\n            corpus = pickle.load(fp, encoding=\"latin1\")\n        # Save vocabulary and dataset cache as Dictionaries (should be better than pickles for the long-term)\n        pytorch_vocab_dump_path = pytorch_dump_folder_path + \"/\" + VOCAB_FILES_NAMES[\"pretrained_vocab_file\"]\n        print(\"Save vocabulary to {}\".format(pytorch_vocab_dump_path))\n        corpus_vocab_dict = corpus.vocab.__dict__\n        torch.save(corpus_vocab_dict, pytorch_vocab_dump_path)\n\n        corpus_dict_no_vocab = corpus.__dict__\n        corpus_dict_no_vocab.pop(\"vocab\", None)\n        pytorch_dataset_dump_path = pytorch_dump_folder_path + \"/\" + CORPUS_NAME\n        print(\"Save dataset to {}\".format(pytorch_dataset_dump_path))\n        torch.save(corpus_dict_no_vocab, pytorch_dataset_dump_path)\n\n    if tf_checkpoint_path:\n        # Convert a pre-trained TensorFlow model\n        config_path = os.path.abspath(transfo_xl_config_file)\n        tf_path = os.path.abspath(tf_checkpoint_path)\n\n        print(\"Converting Transformer XL checkpoint from {} with config at {}\".format(tf_path, config_path))\n        # Initialise PyTorch model\n        if transfo_xl_config_file == \"\":\n            config = TransfoXLConfig()\n        else:\n            config = TransfoXLConfig.from_json_file(transfo_xl_config_file)\n        print(\"Building PyTorch model from configuration: {}\".format(str(config)))\n        model = TransfoXLLMHeadModel(config)\n\n        model = load_tf_weights_in_transfo_xl(model, config, tf_path)\n        # Save pytorch-model\n        pytorch_weights_dump_path = os.path.join(pytorch_dump_folder_path, WEIGHTS_NAME)\n        pytorch_config_dump_path = os.path.join(pytorch_dump_folder_path, CONFIG_NAME)\n        print(\"Save PyTorch model to {}\".format(os.path.abspath(pytorch_weights_dump_path)))\n        torch.save(model.state_dict(), pytorch_weights_dump_path)\n        print(\"Save configuration file to {}\".format(os.path.abspath(pytorch_config_dump_path)))\n        with open(pytorch_config_dump_path, \"w\", encoding=\"utf-8\") as f:\n            f.write(config.to_json_string())\n\n\nif __name__ == \"__main__\":\n    parser = argparse.ArgumentParser()\n    parser.add_argument(\n        \"--pytorch_dump_folder_path\",\n        default=None,\n        type=str,\n        required=True,\n        help=\"Path to the folder to store the PyTorch model or dataset/vocab.\",\n    )\n    parser.add_argument(\n        \"--tf_checkpoint_path\",\n        default=\"\",\n        type=str,\n        help=\"An optional path to a TensorFlow checkpoint path to be converted.\",\n    )\n    parser.add_argument(\n        \"--transfo_xl_config_file\",\n        default=\"\",\n        type=str,\n        help=\"An optional config json file corresponding to the pre-trained BERT model. \\n\"\n        \"This specifies the model architecture.\",\n    )\n    parser.add_argument(\n        \"--transfo_xl_dataset_file\",\n        default=\"\",\n        type=str,\n        help=\"An optional dataset file to be converted in a vocabulary.\",\n    )\n    args = parser.parse_args()\n    convert_transfo_xl_checkpoint_to_pytorch(\n        args.tf_checkpoint_path,\n        args.transfo_xl_config_file,\n        args.pytorch_dump_folder_path,\n        args.transfo_xl_dataset_file,\n    )\n"
  },
  {
    "path": "code/nezha-base-count3/pretrain/transformers1/convert_xlm_original_pytorch_checkpoint_to_pytorch.py",
    "content": "# coding=utf-8\n# Copyright 2018 The HuggingFace Inc. team.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n\"\"\"Convert OpenAI GPT checkpoint.\"\"\"\n\n\nimport argparse\nimport json\nimport logging\n\nimport numpy\nimport torch\n\nfrom transformers import CONFIG_NAME, WEIGHTS_NAME\nfrom transformers.tokenization_xlm import VOCAB_FILES_NAMES\n\n\nlogging.basicConfig(level=logging.INFO)\n\n\ndef convert_xlm_checkpoint_to_pytorch(xlm_checkpoint_path, pytorch_dump_folder_path):\n    # Load checkpoint\n    chkpt = torch.load(xlm_checkpoint_path, map_location=\"cpu\")\n\n    state_dict = chkpt[\"model\"]\n\n    # We have the base model one level deeper than the original XLM repository\n    two_levels_state_dict = {}\n    for k, v in state_dict.items():\n        if \"pred_layer\" in k:\n            two_levels_state_dict[k] = v\n        else:\n            two_levels_state_dict[\"transformer.\" + k] = v\n\n    config = chkpt[\"params\"]\n    config = dict((n, v) for n, v in config.items() if not isinstance(v, (torch.FloatTensor, numpy.ndarray)))\n\n    vocab = chkpt[\"dico_word2id\"]\n    vocab = dict((s + \"</w>\" if s.find(\"@@\") == -1 and i > 13 else s.replace(\"@@\", \"\"), i) for s, i in vocab.items())\n\n    # Save pytorch-model\n    pytorch_weights_dump_path = pytorch_dump_folder_path + \"/\" + WEIGHTS_NAME\n    pytorch_config_dump_path = pytorch_dump_folder_path + \"/\" + CONFIG_NAME\n    pytorch_vocab_dump_path = pytorch_dump_folder_path + \"/\" + VOCAB_FILES_NAMES[\"vocab_file\"]\n\n    print(\"Save PyTorch model to {}\".format(pytorch_weights_dump_path))\n    torch.save(two_levels_state_dict, pytorch_weights_dump_path)\n\n    print(\"Save configuration file to {}\".format(pytorch_config_dump_path))\n    with open(pytorch_config_dump_path, \"w\", encoding=\"utf-8\") as f:\n        f.write(json.dumps(config, indent=2) + \"\\n\")\n\n    print(\"Save vocab file to {}\".format(pytorch_config_dump_path))\n    with open(pytorch_vocab_dump_path, \"w\", encoding=\"utf-8\") as f:\n        f.write(json.dumps(vocab, indent=2) + \"\\n\")\n\n\nif __name__ == \"__main__\":\n    parser = argparse.ArgumentParser()\n    # Required parameters\n    parser.add_argument(\n        \"--xlm_checkpoint_path\", default=None, type=str, required=True, help=\"Path the official PyTorch dump.\"\n    )\n    parser.add_argument(\n        \"--pytorch_dump_folder_path\", default=None, type=str, required=True, help=\"Path to the output PyTorch model.\"\n    )\n    args = parser.parse_args()\n    convert_xlm_checkpoint_to_pytorch(args.xlm_checkpoint_path, args.pytorch_dump_folder_path)\n"
  },
  {
    "path": "code/nezha-base-count3/pretrain/transformers1/convert_xlnet_original_tf_checkpoint_to_pytorch.py",
    "content": "# coding=utf-8\n# Copyright 2018 The HuggingFace Inc. team.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n\"\"\"Convert BERT checkpoint.\"\"\"\n\n\nimport argparse\nimport logging\nimport os\n\nimport torch\n\nfrom transformers import (\n    CONFIG_NAME,\n    WEIGHTS_NAME,\n    XLNetConfig,\n    XLNetForQuestionAnswering,\n    XLNetForSequenceClassification,\n    XLNetLMHeadModel,\n    load_tf_weights_in_xlnet,\n)\n\n\nGLUE_TASKS_NUM_LABELS = {\n    \"cola\": 2,\n    \"mnli\": 3,\n    \"mrpc\": 2,\n    \"sst-2\": 2,\n    \"sts-b\": 1,\n    \"qqp\": 2,\n    \"qnli\": 2,\n    \"rte\": 2,\n    \"wnli\": 2,\n}\n\n\nlogging.basicConfig(level=logging.INFO)\n\n\ndef convert_xlnet_checkpoint_to_pytorch(\n    tf_checkpoint_path, bert_config_file, pytorch_dump_folder_path, finetuning_task=None\n):\n    # Initialise PyTorch model\n    config = XLNetConfig.from_json_file(bert_config_file)\n\n    finetuning_task = finetuning_task.lower() if finetuning_task is not None else \"\"\n    if finetuning_task in GLUE_TASKS_NUM_LABELS:\n        print(\"Building PyTorch XLNetForSequenceClassification model from configuration: {}\".format(str(config)))\n        config.finetuning_task = finetuning_task\n        config.num_labels = GLUE_TASKS_NUM_LABELS[finetuning_task]\n        model = XLNetForSequenceClassification(config)\n    elif \"squad\" in finetuning_task:\n        config.finetuning_task = finetuning_task\n        model = XLNetForQuestionAnswering(config)\n    else:\n        model = XLNetLMHeadModel(config)\n\n    # Load weights from tf checkpoint\n    load_tf_weights_in_xlnet(model, config, tf_checkpoint_path)\n\n    # Save pytorch-model\n    pytorch_weights_dump_path = os.path.join(pytorch_dump_folder_path, WEIGHTS_NAME)\n    pytorch_config_dump_path = os.path.join(pytorch_dump_folder_path, CONFIG_NAME)\n    print(\"Save PyTorch model to {}\".format(os.path.abspath(pytorch_weights_dump_path)))\n    torch.save(model.state_dict(), pytorch_weights_dump_path)\n    print(\"Save configuration file to {}\".format(os.path.abspath(pytorch_config_dump_path)))\n    with open(pytorch_config_dump_path, \"w\", encoding=\"utf-8\") as f:\n        f.write(config.to_json_string())\n\n\nif __name__ == \"__main__\":\n    parser = argparse.ArgumentParser()\n    # Required parameters\n    parser.add_argument(\n        \"--tf_checkpoint_path\", default=None, type=str, required=True, help=\"Path to the TensorFlow checkpoint path.\"\n    )\n    parser.add_argument(\n        \"--xlnet_config_file\",\n        default=None,\n        type=str,\n        required=True,\n        help=\"The config json file corresponding to the pre-trained XLNet model. \\n\"\n        \"This specifies the model architecture.\",\n    )\n    parser.add_argument(\n        \"--pytorch_dump_folder_path\",\n        default=None,\n        type=str,\n        required=True,\n        help=\"Path to the folder to store the PyTorch model or dataset/vocab.\",\n    )\n    parser.add_argument(\n        \"--finetuning_task\",\n        default=None,\n        type=str,\n        help=\"Name of a task on which the XLNet TensorFloaw model was fine-tuned\",\n    )\n    args = parser.parse_args()\n    print(args)\n\n    convert_xlnet_checkpoint_to_pytorch(\n        args.tf_checkpoint_path, args.xlnet_config_file, args.pytorch_dump_folder_path, args.finetuning_task\n    )\n"
  },
  {
    "path": "code/nezha-base-count3/pretrain/transformers1/data/__init__.py",
    "content": "# flake8: noqa\n# There's no way to ignore \"F401 '...' imported but unused\" warnings in this\n# module, but to preserve other warnings. So, don't check this module at all.\n\nfrom .metrics import is_sklearn_available\nfrom .processors import (\n    DataProcessor,\n    InputExample,\n    InputFeatures,\n    SingleSentenceClassificationProcessor,\n    SquadExample,\n    SquadFeatures,\n    SquadV1Processor,\n    SquadV2Processor,\n    glue_convert_examples_to_features,\n    glue_output_modes,\n    glue_processors,\n    glue_tasks_num_labels,\n    squad_convert_examples_to_features,\n    xnli_output_modes,\n    xnli_processors,\n    xnli_tasks_num_labels,\n)\n\n\nif is_sklearn_available():\n    from .metrics import glue_compute_metrics, xnli_compute_metrics\n"
  },
  {
    "path": "code/nezha-base-count3/pretrain/transformers1/data/data_collator.py",
    "content": "from abc import ABC, abstractmethod\nfrom dataclasses import dataclass\nfrom typing import Any, Dict, List, NewType, Tuple\n\nimport torch\nfrom torch.nn.utils.rnn import pad_sequence\nimport random\nimport numpy as np\nfrom ..tokenization_utils import PreTrainedTokenizer\n\n\nclass DataCollator(ABC):\n    \"\"\"\n    A `DataCollator` is responsible for batching\n    and pre-processing samples of data as requested by the training loop.\n    \"\"\"\n\n    @abstractmethod\n    def collate_batch(self) -> Dict[str, torch.Tensor]:\n        \"\"\"\n        Take a list of samples from a Dataset and collate them into a batch.\n\n        Returns:\n            A dictionary of tensors\n        \"\"\"\n        pass\n\n\nInputDataClass = NewType(\"InputDataClass\", Any)\n\n\n@dataclass\nclass DefaultDataCollator(DataCollator):\n    \"\"\"\n    Very simple data collator that:\n    - simply collates batches of dict-like objects\n    - Performs special handling for potential keys named:\n        - `label`: handles a single value (int or float) per object\n        - `label_ids`: handles a list of values per object\n    - does not do any additional preprocessing\n\n    i.e., Property names of the input object will be used as corresponding inputs to the model.\n    See glue and ner for example of how it's useful.\n    \"\"\"\n\n    def collate_batch(self, features: List[InputDataClass]) -> Dict[str, torch.Tensor]:\n        # In this method we'll make the assumption that all `features` in the batch\n        # have the same attributes.\n        # So we will look at the first element as a proxy for what attributes exist\n        # on the whole batch.\n        first = features[0]\n\n        # Special handling for labels.\n        # Ensure that tensor is created with the correct type\n        # (it should be automatically the case, but let's make sure of it.)\n        if hasattr(first, \"label\") and first.label is not None:\n            if type(first.label) is int:\n                labels = torch.tensor([f.label for f in features], dtype=torch.long)\n            else:\n                labels = torch.tensor([f.label for f in features], dtype=torch.float)\n            batch = {\"labels\": labels}\n        elif hasattr(first, \"label_ids\") and first.label_ids is not None:\n            if type(first.label_ids[0]) is int:\n                labels = torch.tensor([f.label_ids for f in features], dtype=torch.long)\n            else:\n                labels = torch.tensor([f.label_ids for f in features], dtype=torch.float)\n            batch = {\"labels\": labels}\n        else:\n            batch = {}\n\n        # Handling of all other possible attributes.\n        # Again, we will use the first element to figure out which key/values are not None for this model.\n        for k, v in vars(first).items():\n            if k not in (\"label\", \"label_ids\") and v is not None and not isinstance(v, str):\n                batch[k] = torch.tensor([getattr(f, k) for f in features], dtype=torch.long)\n        return batch\n\n\n@dataclass\nclass DataCollatorForLanguageModeling(DataCollator):\n    \"\"\"\n    Data collator used for language modeling.\n    - collates batches of tensors, honoring their tokenizer's pad_token\n    - preprocesses batches for masked language modeling\n    \"\"\"\n\n    tokenizer: PreTrainedTokenizer\n    mlm: bool = True\n    mlm_probability: float = 0.15\n\n    def collate_batch(self, examples: List[torch.Tensor]) -> Dict[str, torch.Tensor]:\n        batch = self._tensorize_batch(examples)\n        if self.mlm:\n            inputs, labels = self.mask_tokens7(batch)\n            return {\"input_ids\": inputs, \"labels\": labels}\n        else:\n            return {\"input_ids\": batch, \"labels\": batch}\n\n    def _tensorize_batch(self, examples: List[torch.Tensor]) -> torch.Tensor:\n        length_of_first = examples[0].size(0)\n        are_tensors_same_length = all(x.size(0) == length_of_first for x in examples)\n        if are_tensors_same_length:\n            return torch.stack(examples, dim=0)\n        else:\n            if self.tokenizer._pad_token is None:\n                raise ValueError(\n                    \"You are attempting to pad samples but the tokenizer you are using\"\n                    f\" ({self.tokenizer.__class__.__name__}) does not have one.\"\n                )\n            return pad_sequence(examples, batch_first=True, padding_value=self.tokenizer.pad_token_id)\n\n    def mask_tokens(self, inputs: torch.Tensor) -> Tuple[torch.Tensor, torch.Tensor]:\n        \"\"\"\n        Prepare masked tokens inputs/labels for masked language modeling: 80% MASK, 10% random, 10% original.\n        \"\"\"\n\n        if self.tokenizer.mask_token is None:\n            raise ValueError(\n                \"This tokenizer does not have a mask token which is necessary for masked language modeling. Remove the --mlm flag if you want to use this tokenizer.\"\n            )\n\n        labels = inputs.clone()\n        # We sample a few tokens in each sequence for masked-LM training (with probability args.mlm_probability defaults to 0.15 in Bert/RoBERTa)\n        probability_matrix = torch.full(labels.shape, self.mlm_probability)\n        special_tokens_mask = [\n            self.tokenizer.get_special_tokens_mask(val, already_has_special_tokens=True) for val in labels.tolist()\n        ]\n        probability_matrix.masked_fill_(torch.tensor(special_tokens_mask, dtype=torch.bool), value=0.0)\n        if self.tokenizer._pad_token is not None:\n            padding_mask = labels.eq(self.tokenizer.pad_token_id)\n            probability_matrix.masked_fill_(padding_mask, value=0.0)\n\n        masked_indices = torch.bernoulli(probability_matrix).bool()\n        labels[~masked_indices] = -100  # We only compute loss on masked tokens\n\n        # 80% of the time, we replace masked input tokens with tokenizer.mask_token ([MASK])\n        indices_replaced = torch.bernoulli(torch.full(labels.shape, 0.8)).bool() & masked_indices\n        inputs[indices_replaced] = self.tokenizer.convert_tokens_to_ids(self.tokenizer.mask_token)\n\n        # 10% of the time, we replace masked input tokens with random word\n        indices_random = torch.bernoulli(torch.full(labels.shape, 0.5)).bool() & masked_indices & ~indices_replaced\n        random_words = torch.randint(len(self.tokenizer), labels.shape, dtype=torch.long)\n        inputs[indices_random] = random_words[indices_random]\n\n        # The rest of the time (10% of the time) we keep the masked input tokens unchanged\n        return inputs, labels\n\n    def mask_tokens2(self, inputs: torch.Tensor) -> Tuple[torch.Tensor, torch.Tensor]:\n        \"\"\"\n        Prepare masked tokens inputs/labels for masked language modeling: 80% MASK, 10% random, 10% original.\n        \"\"\"\n\n        if self.tokenizer.mask_token is None:\n            raise ValueError(\n                \"This tokenizer does not have a mask token which is necessary for masked language modeling. Remove the --mlm flag if you want to use this tokenizer.\"\n            )\n\n        labels = inputs.clone()\n        inputs = inputs.numpy()\n        # We sample a few tokens in each sequence for masked-LM training (with probability args.mlm_probability defaults to 0.15 in Bert/RoBERTa)\n        probability_matrix = torch.full(labels.shape, self.mlm_probability)\n        special_tokens_mask = [\n            self.tokenizer.get_special_tokens_mask(val, already_has_special_tokens=True) for val in labels.tolist()\n        ]\n        probability_matrix.masked_fill_(torch.tensor(special_tokens_mask, dtype=torch.bool), value=0.0)\n        if self.tokenizer._pad_token is not None:\n            padding_mask = labels.eq(self.tokenizer.pad_token_id)\n            probability_matrix.masked_fill_(padding_mask, value=0.0)\n\n        probability_matrix = probability_matrix.numpy()\n        labels = labels.numpy()\n        for i in range(len(probability_matrix)):\n            for j in range(len(probability_matrix[0])):\n                if probability_matrix[i][j] == np.float32(0.15):\n                    if random.random() > 0.85:\n                        if random.random() > 0.2:\n                            inputs[i][j] = self.tokenizer.mask_token_id\n                        elif random.random() > 0.5:\n                            inputs[i][j] = random.randint(5, len(self.tokenizer) - 1)\n                        else:\n                            pass\n                    else:\n                        labels[i][j] = np.float32(-100)\n                else:\n                    labels[i][j] = np.float32(-100)\n\n        inputs = torch.from_numpy(inputs)\n        labels = torch.from_numpy(labels)\n        return inputs, labels\n\n\n    def mask_tokens3(self, inputs: torch.Tensor) -> Tuple[torch.Tensor, torch.Tensor]:\n        \"\"\"\n        Prepare masked tokens inputs/labels for masked language modeling: 80% MASK, 10% random, 10% original.\n        \"\"\"\n\n        if self.tokenizer.mask_token is None:\n            raise ValueError(\n                \"This tokenizer does not have a mask token which is necessary for masked language modeling. Remove the --mlm flag if you want to use this tokenizer.\"\n            )\n\n        labels = inputs.clone()\n        inputs = inputs.numpy()\n        # We sample a few tokens in each sequence for masked-LM training (with probability args.mlm_probability defaults to 0.15 in Bert/RoBERTa)\n        probability_matrix = torch.full(labels.shape, self.mlm_probability)\n        special_tokens_mask = [\n            self.tokenizer.get_special_tokens_mask(val, already_has_special_tokens=True) for val in labels.tolist()\n        ]\n        probability_matrix.masked_fill_(torch.tensor(special_tokens_mask, dtype=torch.bool), value=0.0)\n        if self.tokenizer._pad_token is not None:\n            padding_mask = labels.eq(self.tokenizer.pad_token_id)\n            probability_matrix.masked_fill_(padding_mask, value=0.0)\n\n        probability_matrix = probability_matrix.numpy()\n        labels = labels.numpy()\n        covered = set()\n        for i in range(len(probability_matrix)):\n            for j in range(len(probability_matrix[0])):\n                if probability_matrix[i][j] == np.float32(0.15) and (i,j) not in covered:\n                    if random.random() > 0.85:\n                        if random.random() > 0.2:\n                            if random.random() > 0.85:\n                                for k in range(j,min(j+5,len(probability_matrix[0]))):\n                                    if probability_matrix[i][k] == np.float32(0.15) and (i,k) not in covered:\n                                        inputs[i][k] = self.tokenizer.mask_token_id\n                                        covered.add((i,k))\n                            elif random.random() > 0.7647:\n                                for k in range(j,min(j+4,len(probability_matrix[0]))):\n                                    if probability_matrix[i][k] == np.float32(0.15) and (i,k) not in covered:\n                                        inputs[i][k] = self.tokenizer.mask_token_id\n                                        covered.add((i,k))\n                            elif random.random() > 0.5384:\n                                for k in range(j,min(j+3,len(probability_matrix[0]))):\n                                    if probability_matrix[i][k] == np.float32(0.15) and (i,k) not in covered:\n                                        inputs[i][k] = self.tokenizer.mask_token_id\n                                        covered.add((i,k))\n                            elif random.random() > 0.42857:\n                                for k in range(j,min(j+2,len(probability_matrix[0]))):\n                                    if probability_matrix[i][k] == np.float32(0.15) and (i,k) not in covered:\n                                        inputs[i][k] = self.tokenizer.mask_token_id\n                                        covered.add((i,k))\n                            else:\n                                inputs[i][j] = self.tokenizer.mask_token_id\n                                covered.add((i,j))\n\n                        elif random.random() > 0.5:\n                            inputs[i][j] = random.randint(5, len(self.tokenizer) - 1)\n                        else:\n                            pass\n                    else:\n                        labels[i][j] = np.float32(-100)\n                else:\n                    labels[i][j] = np.float32(-100)\n\n        inputs = torch.from_numpy(inputs)\n        labels = torch.from_numpy(labels)\n        return inputs, labels\n\n    def mask_tokens4(self, inputs: torch.Tensor) -> Tuple[torch.Tensor, torch.Tensor]:\n        \"\"\"\n        Prepare masked tokens inputs/labels for masked language modeling: 80% MASK, 10% random, 10% original.\n        \"\"\"\n\n        if self.tokenizer.mask_token is None:\n            raise ValueError(\n                \"This tokenizer does not have a mask token which is necessary for masked language modeling. Remove the --mlm flag if you want to use this tokenizer.\"\n            )\n\n        inputs = inputs.numpy()\n        ids = [i for i in range(len(inputs))]\n        random.shuffle(ids)\n        inputs = inputs[ids]\n        inputs = torch.from_numpy(inputs)\n\n        labels = inputs.clone()\n        inputs = inputs.numpy()\n        # We sample a few tokens in each sequence for masked-LM training (with probability args.mlm_probability defaults to 0.15 in Bert/RoBERTa)\n        probability_matrix = torch.full(labels.shape, self.mlm_probability)\n        special_tokens_mask = [\n            self.tokenizer.get_special_tokens_mask(val, already_has_special_tokens=True) for val in labels.tolist()\n        ]\n        probability_matrix.masked_fill_(torch.tensor(special_tokens_mask, dtype=torch.bool), value=0.0)\n        if self.tokenizer._pad_token is not None:\n            padding_mask = labels.eq(self.tokenizer.pad_token_id)\n            probability_matrix.masked_fill_(padding_mask, value=0.0)\n\n        total_token = 0\n        for i in range(len(probability_matrix)):\n            for j in range(len(probability_matrix[0])):\n                if probability_matrix[i][j] == np.float32(0.15):\n                    total_token += 1\n\n        cur_token = 0\n        probability_matrix = probability_matrix.numpy()\n        labels = labels.numpy()\n        covered = set()\n        ngramFlag = True\n        for i in range(len(probability_matrix)):\n            if cur_token > total_token * 0.03:\n                ngramFlag = False\n            for j in range(len(probability_matrix[0])):\n                if probability_matrix[i][j] == np.float32(0.15) and (i,j) not in covered:\n                    if random.random() > 0.85:\n                        if random.random() > 0.2:\n                            if random.random() > 0.9 and ngramFlag:\n                                for k in range(j,min(j+4,len(probability_matrix[0]))):\n                                    if probability_matrix[i][k] == np.float32(0.15) and (i,k) not in covered:\n                                        inputs[i][k] = self.tokenizer.mask_token_id\n                                        covered.add((i,k))\n                                        cur_token += 1\n                            elif random.random() > 0.222 and ngramFlag:\n                                for k in range(j,min(j+3,len(probability_matrix[0]))):\n                                    if probability_matrix[i][k] == np.float32(0.15) and (i,k) not in covered:\n                                        inputs[i][k] = self.tokenizer.mask_token_id\n                                        covered.add((i,k))\n                                        cur_token += 1\n                            elif random.random() > 0.42857 and ngramFlag:\n                                for k in range(j,min(j+2,len(probability_matrix[0]))):\n                                    if probability_matrix[i][k] == np.float32(0.15) and (i,k) not in covered:\n                                        inputs[i][k] = self.tokenizer.mask_token_id\n                                        covered.add((i,k))\n                                        cur_token += 1\n                            else:\n                                inputs[i][j] = self.tokenizer.mask_token_id\n                                covered.add((i,j))\n                                cur_token += 1\n\n                        elif random.random() > 0.5:\n                            inputs[i][j] = random.randint(5, len(self.tokenizer) - 1)\n                            cur_token += 1\n                        else:\n                            pass\n                    else:\n                        labels[i][j] = np.float32(-100)\n                else:\n                    labels[i][j] = np.float32(-100)\n\n        inputs = torch.from_numpy(inputs)\n        labels = torch.from_numpy(labels)\n        return inputs, labels\n\n    def mask_tokens5(self, inputs: torch.Tensor) -> Tuple[torch.Tensor, torch.Tensor]:\n        \"\"\"\n        Prepare masked tokens inputs/labels for masked language modeling: 80% MASK, 10% random, 10% original.\n        \"\"\"\n\n        if self.tokenizer.mask_token is None:\n            raise ValueError(\n                \"This tokenizer does not have a mask token which is necessary for masked language modeling. Remove the --mlm flag if you want to use this tokenizer.\"\n            )\n\n        labels = inputs.clone()\n        inputs = inputs.numpy()\n        # We sample a few tokens in each sequence for masked-LM training (with probability args.mlm_probability defaults to 0.15 in Bert/RoBERTa)\n        probability_matrix = torch.full(labels.shape, self.mlm_probability)\n        special_tokens_mask = [\n            self.tokenizer.get_special_tokens_mask(val, already_has_special_tokens=True) for val in labels.tolist()\n        ]\n        probability_matrix.masked_fill_(torch.tensor(special_tokens_mask, dtype=torch.bool), value=0.0)\n        if self.tokenizer._pad_token is not None:\n            padding_mask = labels.eq(self.tokenizer.pad_token_id)\n            probability_matrix.masked_fill_(padding_mask, value=0.0)\n\n        covered = set()\n        pvals = [0.4, 0.3, 0.2, 0.1]\n        ngrams = np.arange(1, 5, dtype=np.int64)\n\n        probability_matrix = probability_matrix.numpy()\n        labels = labels.numpy()\n        for i in range(len(probability_matrix)):\n            cur_token = 0\n            total_token = 0\n            for j in range(len(probability_matrix[0])):\n                if probability_matrix[i][j] == np.float32(0.15):\n                    total_token += 1\n            choose = random.randint(0, 1)\n            if choose == 0:\n                startIndex = 0\n                endIndex = np.argwhere(inputs[i] == np.float32(2))[-1][0]\n            elif choose == 1:\n                startIndex = np.argwhere(inputs[i] == np.float32(2))[-1][0]\n                endIndex = np.argwhere(inputs[i] == np.float32(3))[-1][0]\n\n            valid_j = [index for index in range(startIndex, endIndex + 1)]\n\n            for j in range(len(probability_matrix[0])):\n                if cur_token < total_token * 0.15:\n                    if probability_matrix[i][j] == np.float32(0.15):\n                        n = np.random.choice(ngrams, p=pvals)\n                        for k in range(n):\n                            if j + k >= len(probability_matrix[0]):\n                                break\n                            if (i, j+k) in covered:\n                                continue\n                            if j+k in valid_j:\n                                if random.random() > 0.7:\n                                    if random.random() > 0.2:\n                                        if probability_matrix[i][j+k] == np.float32(0.15):\n                                            inputs[i][j+k] = self.tokenizer.mask_token_id\n                                            covered.add((i, j + k))\n                                            cur_token += 1\n\n                                    elif random.random() > 0.5:\n                                        if probability_matrix[i][j + k] == np.float32(0.15):\n                                            inputs[i][j+k] = random.randint(5, len(self.tokenizer) - 1)\n                                            covered.add((i, j + k))\n                                            cur_token += 1\n\n                                    else:\n                                        if probability_matrix[i][j + k] == np.float32(0.15):\n                                            covered.add((i, j + k))\n                                            cur_token += 1\n\n                                else:\n                                    labels[i][j] = np.float32(-100)\n                            else:\n                                labels[i][j] = np.float32(-100)\n                    else:\n                        labels[i][j] = np.float32(-100)\n                else:\n                    labels[i][j] = np.float32(-100)\n\n        inputs = torch.from_numpy(inputs)\n        labels = torch.from_numpy(labels)\n        return inputs, labels\n\n    def mask_tokens6(self, inputs: torch.Tensor) -> Tuple[torch.Tensor, torch.Tensor]:\n        \"\"\"\n        Prepare masked tokens inputs/labels for masked language modeling: 80% MASK, 10% random, 10% original.\n        \"\"\"\n\n        if self.tokenizer.mask_token is None:\n            raise ValueError(\n                \"This tokenizer does not have a mask token which is necessary for masked language modeling. Remove the --mlm flag if you want to use this tokenizer.\"\n            )\n\n        labels = inputs.clone()\n        inputs = inputs.numpy()\n        # We sample a few tokens in each sequence for masked-LM training (with probability args.mlm_probability defaults to 0.15 in Bert/RoBERTa)\n        probability_matrix = torch.full(labels.shape, self.mlm_probability)\n        special_tokens_mask = [\n            self.tokenizer.get_special_tokens_mask(val, already_has_special_tokens=True) for val in labels.tolist()\n        ]\n        probability_matrix.masked_fill_(torch.tensor(special_tokens_mask, dtype=torch.bool), value=0.0)\n        if self.tokenizer._pad_token is not None:\n            padding_mask = labels.eq(self.tokenizer.pad_token_id)\n            probability_matrix.masked_fill_(padding_mask, value=0.0)\n\n        covered = set()\n\n\n        probability_matrix = probability_matrix.numpy()\n        labels = labels.numpy()\n        for i in range(len(probability_matrix)):\n            cur_token = 0\n            total_token = 0\n            for j in range(len(probability_matrix[0])):\n                if probability_matrix[i][j] == np.float32(0.15):\n                    total_token += 1\n            for j in range(len(probability_matrix[0])):\n                if cur_token > total_token*0.15:\n                    break\n                if probability_matrix[i][j] == np.float32(0.15):\n                    if random.random() > 0.85:\n                        if random.random() > 0.2:\n                            if random.random() > 0.9:\n                                for k in range(j, min(j + 4, len(probability_matrix[0]))):\n                                    if probability_matrix[i][k] == np.float32(0.15) and (i, k) not in covered:\n                                        inputs[i][k] = self.tokenizer.mask_token_id\n                                        covered.add((i, k))\n                                        cur_token += 1\n                            elif random.random() > 0.222:\n                                for k in range(j, min(j + 3, len(probability_matrix[0]))):\n                                    if probability_matrix[i][k] == np.float32(0.15) and (i, k) not in covered:\n                                        inputs[i][k] = self.tokenizer.mask_token_id\n                                        covered.add((i, k))\n                                        cur_token += 1\n                            elif random.random() > 0.42857:\n                                for k in range(j, min(j + 2, len(probability_matrix[0]))):\n                                    if probability_matrix[i][k] == np.float32(0.15) and (i, k) not in covered:\n                                        inputs[i][k] = self.tokenizer.mask_token_id\n                                        covered.add((i, k))\n                                        cur_token += 1\n                            else:\n                                inputs[i][j] = self.tokenizer.mask_token_id\n                                covered.add((i, j))\n                                cur_token += 1\n\n                        elif random.random() > 0.5:\n                            inputs[i][j] = random.randint(5, len(self.tokenizer) - 1)\n                            cur_token += 1\n                        else:\n                            cur_token += 1\n\n                    else:\n                        labels[i][j] = np.float32(-100)\n\n\n                else:\n                    labels[i][j] = np.float32(-100)\n\n        inputs = torch.from_numpy(inputs)\n        labels = torch.from_numpy(labels)\n        return inputs, labels\n\n\n    def mask_tokens7(self, inputs: torch.Tensor) -> Tuple[torch.Tensor, torch.Tensor]:\n        \"\"\"\n        Prepare masked tokens inputs/labels for masked language modeling: 80% MASK, 10% random, 10% original.\n        \"\"\"\n\n        if self.tokenizer.mask_token is None:\n            raise ValueError(\n                \"This tokenizer does not have a mask token which is necessary for masked language modeling. Remove the --mlm flag if you want to use this tokenizer.\"\n            )\n\n        labels = inputs.clone()\n        inputs = inputs.numpy()\n        # We sample a few tokens in each sequence for masked-LM training (with probability args.mlm_probability defaults to 0.15 in Bert/RoBERTa)\n        probability_matrix = torch.full(labels.shape, self.mlm_probability)\n        special_tokens_mask = [\n            self.tokenizer.get_special_tokens_mask(val, already_has_special_tokens=True) for val in labels.tolist()\n        ]\n        probability_matrix.masked_fill_(torch.tensor(special_tokens_mask, dtype=torch.bool), value=0.0)\n        if self.tokenizer._pad_token is not None:\n            padding_mask = labels.eq(self.tokenizer.pad_token_id)\n            probability_matrix.masked_fill_(padding_mask, value=0.0)\n\n        covered = set()\n        ngrams = np.arange(1, 3 + 1, dtype=np.int64)\n        pvals = 1. / np.arange(1, 3 + 1)\n        pvals /= pvals.sum(keepdims=True)\n\n        probability_matrix = probability_matrix.numpy()\n        labels = labels.numpy()\n        for i in range(len(probability_matrix)):\n            cur_token = 0\n            total_token = 0\n            for j in range(len(probability_matrix[0])):\n                if probability_matrix[i][j] == np.float32(0.15):\n                    total_token += 1\n            for j in range(len(probability_matrix[0])):\n                if cur_token <= total_token * 0.15:\n                    n = np.random.choice(ngrams, p=pvals)\n                    if probability_matrix[i][j] == np.float32(0.15):\n                        for k in range(n):\n                            if j + k >= len(probability_matrix[0]):\n                                break\n                            if (i, j+k) in covered:\n                                continue\n                            if random.random() > 0.85:\n                                if random.random() > 0.2:\n                                    if probability_matrix[i][j+k] == np.float32(0.15):\n                                        inputs[i][j+k] = self.tokenizer.mask_token_id\n                                        covered.add((i, j + k))\n                                        cur_token += 1\n\n                                elif random.random() > 0.5:\n                                    if probability_matrix[i][j + k] == np.float32(0.15):\n                                        inputs[i][j+k] = random.randint(5, len(self.tokenizer) - 1)\n                                        covered.add((i, j + k))\n                                        cur_token += 1\n\n                                else:\n                                    if probability_matrix[i][j + k] == np.float32(0.15):\n                                        covered.add((i, j + k))\n                                        cur_token += 1\n\n                            else:\n                                labels[i][j] = np.float32(-100)\n\n                    else:\n                        labels[i][j] = np.float32(-100)\n                else:\n                    labels[i][j] = np.float32(-100)\n\n        inputs = torch.from_numpy(inputs)\n        labels = torch.from_numpy(labels)\n        return inputs, labels\n\n"
  },
  {
    "path": "code/nezha-base-count3/pretrain/transformers1/data/datasets/__init__.py",
    "content": "# flake8: noqa\n# There's no way to ignore \"F401 '...' imported but unused\" warnings in this\n# module, but to preserve other warnings. So, don't check this module at all.\n\nfrom .glue import GlueDataset, GlueDataTrainingArguments\nfrom .language_modeling import LineByLineTextDataset, TextDataset\n"
  },
  {
    "path": "code/nezha-base-count3/pretrain/transformers1/data/datasets/glue.py",
    "content": "import logging\nimport os\nimport time\nfrom dataclasses import dataclass, field\nfrom enum import Enum\nfrom typing import List, Optional, Union\n\nimport torch\nfrom filelock import FileLock\nfrom torch.utils.data.dataset import Dataset\n\nfrom ...tokenization_roberta import RobertaTokenizer, RobertaTokenizerFast\nfrom ...tokenization_utils import PreTrainedTokenizer\nfrom ...tokenization_xlm_roberta import XLMRobertaTokenizer\nfrom ..processors.glue import glue_convert_examples_to_features, glue_output_modes, glue_processors\nfrom ..processors.utils import InputFeatures\n\n\nlogger = logging.getLogger(__name__)\n\n\n@dataclass\nclass GlueDataTrainingArguments:\n    \"\"\"\n    Arguments pertaining to what data we are going to input our model for training and eval.\n\n    Using `HfArgumentParser` we can turn this class\n    into argparse arguments to be able to specify them on\n    the command line.\n    \"\"\"\n\n    task_name: str = field(metadata={\"help\": \"The name of the task to train on: \" + \", \".join(glue_processors.keys())})\n    data_dir: str = field(\n        metadata={\"help\": \"The input data dir. Should contain the .tsv files (or other data files) for the task.\"}\n    )\n    max_seq_length: int = field(\n        default=128,\n        metadata={\n            \"help\": \"The maximum total input sequence length after tokenization. Sequences longer \"\n            \"than this will be truncated, sequences shorter will be padded.\"\n        },\n    )\n    overwrite_cache: bool = field(\n        default=False, metadata={\"help\": \"Overwrite the cached training and evaluation sets\"}\n    )\n\n    def __post_init__(self):\n        self.task_name = self.task_name.lower()\n\n\nclass Split(Enum):\n    train = \"train\"\n    dev = \"dev\"\n    test = \"test\"\n\n\nclass GlueDataset(Dataset):\n    \"\"\"\n    This will be superseded by a framework-agnostic approach\n    soon.\n    \"\"\"\n\n    args: GlueDataTrainingArguments\n    output_mode: str\n    features: List[InputFeatures]\n\n    def __init__(\n        self,\n        args: GlueDataTrainingArguments,\n        tokenizer: PreTrainedTokenizer,\n        limit_length: Optional[int] = None,\n        mode: Union[str, Split] = Split.train,\n    ):\n        self.args = args\n        self.processor = glue_processors[args.task_name]()\n        self.output_mode = glue_output_modes[args.task_name]\n        if isinstance(mode, str):\n            try:\n                mode = Split[mode]\n            except KeyError:\n                raise KeyError(\"mode is not a valid split name\")\n        # Load data features from cache or dataset file\n        cached_features_file = os.path.join(\n            args.data_dir,\n            \"cached_{}_{}_{}_{}\".format(\n                mode.value, tokenizer.__class__.__name__, str(args.max_seq_length), args.task_name,\n            ),\n        )\n        label_list = self.processor.get_labels()\n        if args.task_name in [\"mnli\", \"mnli-mm\"] and tokenizer.__class__ in (\n            RobertaTokenizer,\n            RobertaTokenizerFast,\n            XLMRobertaTokenizer,\n        ):\n            # HACK(label indices are swapped in RoBERTa pretrained model)\n            label_list[1], label_list[2] = label_list[2], label_list[1]\n        self.label_list = label_list\n\n        # Make sure only the first process in distributed training processes the dataset,\n        # and the others will use the cache.\n        lock_path = cached_features_file + \".lock\"\n        with FileLock(lock_path):\n\n            if os.path.exists(cached_features_file) and not args.overwrite_cache:\n                start = time.time()\n                self.features = torch.load(cached_features_file)\n                logger.info(\n                    f\"Loading features from cached file {cached_features_file} [took %.3f s]\", time.time() - start\n                )\n            else:\n                logger.info(f\"Creating features from dataset file at {args.data_dir}\")\n\n                if mode == Split.dev:\n                    examples = self.processor.get_dev_examples(args.data_dir)\n                elif mode == Split.test:\n                    examples = self.processor.get_test_examples(args.data_dir)\n                else:\n                    examples = self.processor.get_train_examples(args.data_dir)\n                if limit_length is not None:\n                    examples = examples[:limit_length]\n                self.features = glue_convert_examples_to_features(\n                    examples,\n                    tokenizer,\n                    max_length=args.max_seq_length,\n                    label_list=label_list,\n                    output_mode=self.output_mode,\n                )\n                start = time.time()\n                torch.save(self.features, cached_features_file)\n                # ^ This seems to take a lot of time so I want to investigate why and how we can improve.\n                logger.info(\n                    \"Saving features into cached file %s [took %.3f s]\", cached_features_file, time.time() - start\n                )\n\n    def __len__(self):\n        return len(self.features)\n\n    def __getitem__(self, i) -> InputFeatures:\n        return self.features[i]\n\n    def get_labels(self):\n        return self.label_list\n"
  },
  {
    "path": "code/nezha-base-count3/pretrain/transformers1/data/datasets/language_modeling.py",
    "content": "import logging\nimport os\nimport pickle\nimport time\n\nimport torch\nfrom filelock import FileLock\nfrom torch.utils.data.dataset import Dataset\n\nfrom ...tokenization_utils import PreTrainedTokenizer\n\n\nlogger = logging.getLogger(__name__)\n\n\nclass TextDataset(Dataset):\n    \"\"\"\n    This will be superseded by a framework-agnostic approach\n    soon.\n    \"\"\"\n\n    def __init__(\n        self, tokenizer: PreTrainedTokenizer, file_path: str, block_size: int, overwrite_cache=False,\n    ):\n        assert os.path.isfile(file_path)\n\n        block_size = block_size - tokenizer.num_special_tokens_to_add(pair=False)\n\n        directory, filename = os.path.split(file_path)\n        cached_features_file = os.path.join(\n            directory, \"cached_lm_{}_{}_{}\".format(tokenizer.__class__.__name__, str(block_size), filename,),\n        )\n\n        # Make sure only the first process in distributed training processes the dataset,\n        # and the others will use the cache.\n        lock_path = cached_features_file + \".lock\"\n        with FileLock(lock_path):\n\n            if os.path.exists(cached_features_file) and not overwrite_cache:\n                start = time.time()\n                with open(cached_features_file, \"rb\") as handle:\n                    self.examples = pickle.load(handle)\n                logger.info(\n                    f\"Loading features from cached file {cached_features_file} [took %.3f s]\", time.time() - start\n                )\n\n            else:\n                logger.info(f\"Creating features from dataset file at {directory}\")\n\n                self.examples = []\n                with open(file_path, encoding=\"utf-8\") as f:\n                    text = f.read()\n\n                tokenized_text = tokenizer.convert_tokens_to_ids(tokenizer.tokenize(text))\n\n                for i in range(0, len(tokenized_text) - block_size + 1, block_size):  # Truncate in block of block_size\n                    self.examples.append(\n                        tokenizer.build_inputs_with_special_tokens(tokenized_text[i : i + block_size])\n                    )\n                # Note that we are losing the last truncated example here for the sake of simplicity (no padding)\n                # If your dataset is small, first you should loook for a bigger one :-) and second you\n                # can change this behavior by adding (model specific) padding.\n\n                start = time.time()\n                with open(cached_features_file, \"wb\") as handle:\n                    pickle.dump(self.examples, handle, protocol=pickle.HIGHEST_PROTOCOL)\n                logger.info(\n                    \"Saving features into cached file %s [took %.3f s]\", cached_features_file, time.time() - start\n                )\n\n    def __len__(self):\n        return len(self.examples)\n\n    def __getitem__(self, i) -> torch.Tensor:\n        return torch.tensor(self.examples[i], dtype=torch.long)\n\n\nclass LineByLineTextDataset(Dataset):\n    \"\"\"\n    This will be superseded by a framework-agnostic approach\n    soon.\n    \"\"\"\n\n    def __init__(self, tokenizer: PreTrainedTokenizer, file_path: str, block_size: int):\n        assert os.path.isfile(file_path)\n        # Here, we do not cache the features, operating under the assumption\n        # that we will soon use fast multithreaded tokenizers from the\n        # `tokenizers` repo everywhere =)\n        logger.info(\"Creating features from dataset file at %s\", file_path)\n\n        with open(file_path, encoding=\"utf-8\") as f:\n            lines = [line for line in f.read().splitlines() if (len(line) > 0 and not line.isspace())]\n\n        batch_encoding = tokenizer.batch_encode_plus(lines, add_special_tokens=True, max_length=block_size)\n        self.examples = batch_encoding[\"input_ids\"]\n\n    def __len__(self):\n        return len(self.examples)\n\n    def __getitem__(self, i) -> torch.Tensor:\n        return torch.tensor(self.examples[i], dtype=torch.long)\n"
  },
  {
    "path": "code/nezha-base-count3/pretrain/transformers1/data/metrics/__init__.py",
    "content": "# coding=utf-8\n# Copyright 2018 The Google AI Language Team Authors and The HuggingFace Inc. team.\n# Copyright (c) 2018, NVIDIA CORPORATION.  All rights reserved.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n\ntry:\n    from scipy.stats import pearsonr, spearmanr\n    from sklearn.metrics import matthews_corrcoef, f1_score\n\n    _has_sklearn = True\nexcept (AttributeError, ImportError):\n    _has_sklearn = False\n\n\ndef is_sklearn_available():\n    return _has_sklearn\n\n\nif _has_sklearn:\n\n    def simple_accuracy(preds, labels):\n        return (preds == labels).mean()\n\n    def acc_and_f1(preds, labels):\n        acc = simple_accuracy(preds, labels)\n        f1 = f1_score(y_true=labels, y_pred=preds)\n        return {\n            \"acc\": acc,\n            \"f1\": f1,\n            \"acc_and_f1\": (acc + f1) / 2,\n        }\n\n    def pearson_and_spearman(preds, labels):\n        pearson_corr = pearsonr(preds, labels)[0]\n        spearman_corr = spearmanr(preds, labels)[0]\n        return {\n            \"pearson\": pearson_corr,\n            \"spearmanr\": spearman_corr,\n            \"corr\": (pearson_corr + spearman_corr) / 2,\n        }\n\n    def glue_compute_metrics(task_name, preds, labels):\n        assert len(preds) == len(labels)\n        if task_name == \"cola\":\n            return {\"mcc\": matthews_corrcoef(labels, preds)}\n        elif task_name == \"sst-2\":\n            return {\"acc\": simple_accuracy(preds, labels)}\n        elif task_name == \"mrpc\":\n            return acc_and_f1(preds, labels)\n        elif task_name == \"sts-b\":\n            return pearson_and_spearman(preds, labels)\n        elif task_name == \"qqp\":\n            return acc_and_f1(preds, labels)\n        elif task_name == \"mnli\":\n            return {\"acc\": simple_accuracy(preds, labels)}\n        elif task_name == \"mnli-mm\":\n            return {\"acc\": simple_accuracy(preds, labels)}\n        elif task_name == \"qnli\":\n            return {\"acc\": simple_accuracy(preds, labels)}\n        elif task_name == \"rte\":\n            return {\"acc\": simple_accuracy(preds, labels)}\n        elif task_name == \"wnli\":\n            return {\"acc\": simple_accuracy(preds, labels)}\n        elif task_name == \"hans\":\n            return {\"acc\": simple_accuracy(preds, labels)}\n        else:\n            raise KeyError(task_name)\n\n    def xnli_compute_metrics(task_name, preds, labels):\n        assert len(preds) == len(labels)\n        if task_name == \"xnli\":\n            return {\"acc\": simple_accuracy(preds, labels)}\n        else:\n            raise KeyError(task_name)\n"
  },
  {
    "path": "code/nezha-base-count3/pretrain/transformers1/data/metrics/squad_metrics.py",
    "content": "\"\"\" Very heavily inspired by the official evaluation script for SQuAD version 2.0 which was\nmodified by XLNet authors to update `find_best_threshold` scripts for SQuAD V2.0\n\nIn addition to basic functionality, we also compute additional statistics and\nplot precision-recall curves if an additional na_prob.json file is provided.\nThis file is expected to map question ID's to the model's predicted probability\nthat a question is unanswerable.\n\"\"\"\n\n\nimport collections\nimport json\nimport logging\nimport math\nimport re\nimport string\n\nfrom transformers.tokenization_bert import BasicTokenizer\n\n\nlogger = logging.getLogger(__name__)\n\n\ndef normalize_answer(s):\n    \"\"\"Lower text and remove punctuation, articles and extra whitespace.\"\"\"\n\n    def remove_articles(text):\n        regex = re.compile(r\"\\b(a|an|the)\\b\", re.UNICODE)\n        return re.sub(regex, \" \", text)\n\n    def white_space_fix(text):\n        return \" \".join(text.split())\n\n    def remove_punc(text):\n        exclude = set(string.punctuation)\n        return \"\".join(ch for ch in text if ch not in exclude)\n\n    def lower(text):\n        return text.lower()\n\n    return white_space_fix(remove_articles(remove_punc(lower(s))))\n\n\ndef get_tokens(s):\n    if not s:\n        return []\n    return normalize_answer(s).split()\n\n\ndef compute_exact(a_gold, a_pred):\n    return int(normalize_answer(a_gold) == normalize_answer(a_pred))\n\n\ndef compute_f1(a_gold, a_pred):\n    gold_toks = get_tokens(a_gold)\n    pred_toks = get_tokens(a_pred)\n    common = collections.Counter(gold_toks) & collections.Counter(pred_toks)\n    num_same = sum(common.values())\n    if len(gold_toks) == 0 or len(pred_toks) == 0:\n        # If either is no-answer, then F1 is 1 if they agree, 0 otherwise\n        return int(gold_toks == pred_toks)\n    if num_same == 0:\n        return 0\n    precision = 1.0 * num_same / len(pred_toks)\n    recall = 1.0 * num_same / len(gold_toks)\n    f1 = (2 * precision * recall) / (precision + recall)\n    return f1\n\n\ndef get_raw_scores(examples, preds):\n    \"\"\"\n    Computes the exact and f1 scores from the examples and the model predictions\n    \"\"\"\n    exact_scores = {}\n    f1_scores = {}\n\n    for example in examples:\n        qas_id = example.qas_id\n        gold_answers = [answer[\"text\"] for answer in example.answers if normalize_answer(answer[\"text\"])]\n\n        if not gold_answers:\n            # For unanswerable questions, only correct answer is empty string\n            gold_answers = [\"\"]\n\n        if qas_id not in preds:\n            print(\"Missing prediction for %s\" % qas_id)\n            continue\n\n        prediction = preds[qas_id]\n        exact_scores[qas_id] = max(compute_exact(a, prediction) for a in gold_answers)\n        f1_scores[qas_id] = max(compute_f1(a, prediction) for a in gold_answers)\n\n    return exact_scores, f1_scores\n\n\ndef apply_no_ans_threshold(scores, na_probs, qid_to_has_ans, na_prob_thresh):\n    new_scores = {}\n    for qid, s in scores.items():\n        pred_na = na_probs[qid] > na_prob_thresh\n        if pred_na:\n            new_scores[qid] = float(not qid_to_has_ans[qid])\n        else:\n            new_scores[qid] = s\n    return new_scores\n\n\ndef make_eval_dict(exact_scores, f1_scores, qid_list=None):\n    if not qid_list:\n        total = len(exact_scores)\n        return collections.OrderedDict(\n            [\n                (\"exact\", 100.0 * sum(exact_scores.values()) / total),\n                (\"f1\", 100.0 * sum(f1_scores.values()) / total),\n                (\"total\", total),\n            ]\n        )\n    else:\n        total = len(qid_list)\n        return collections.OrderedDict(\n            [\n                (\"exact\", 100.0 * sum(exact_scores[k] for k in qid_list) / total),\n                (\"f1\", 100.0 * sum(f1_scores[k] for k in qid_list) / total),\n                (\"total\", total),\n            ]\n        )\n\n\ndef merge_eval(main_eval, new_eval, prefix):\n    for k in new_eval:\n        main_eval[\"%s_%s\" % (prefix, k)] = new_eval[k]\n\n\ndef find_best_thresh_v2(preds, scores, na_probs, qid_to_has_ans):\n    num_no_ans = sum(1 for k in qid_to_has_ans if not qid_to_has_ans[k])\n    cur_score = num_no_ans\n    best_score = cur_score\n    best_thresh = 0.0\n    qid_list = sorted(na_probs, key=lambda k: na_probs[k])\n    for i, qid in enumerate(qid_list):\n        if qid not in scores:\n            continue\n        if qid_to_has_ans[qid]:\n            diff = scores[qid]\n        else:\n            if preds[qid]:\n                diff = -1\n            else:\n                diff = 0\n        cur_score += diff\n        if cur_score > best_score:\n            best_score = cur_score\n            best_thresh = na_probs[qid]\n\n    has_ans_score, has_ans_cnt = 0, 0\n    for qid in qid_list:\n        if not qid_to_has_ans[qid]:\n            continue\n        has_ans_cnt += 1\n\n        if qid not in scores:\n            continue\n        has_ans_score += scores[qid]\n\n    return 100.0 * best_score / len(scores), best_thresh, 1.0 * has_ans_score / has_ans_cnt\n\n\ndef find_all_best_thresh_v2(main_eval, preds, exact_raw, f1_raw, na_probs, qid_to_has_ans):\n    best_exact, exact_thresh, has_ans_exact = find_best_thresh_v2(preds, exact_raw, na_probs, qid_to_has_ans)\n    best_f1, f1_thresh, has_ans_f1 = find_best_thresh_v2(preds, f1_raw, na_probs, qid_to_has_ans)\n    main_eval[\"best_exact\"] = best_exact\n    main_eval[\"best_exact_thresh\"] = exact_thresh\n    main_eval[\"best_f1\"] = best_f1\n    main_eval[\"best_f1_thresh\"] = f1_thresh\n    main_eval[\"has_ans_exact\"] = has_ans_exact\n    main_eval[\"has_ans_f1\"] = has_ans_f1\n\n\ndef find_best_thresh(preds, scores, na_probs, qid_to_has_ans):\n    num_no_ans = sum(1 for k in qid_to_has_ans if not qid_to_has_ans[k])\n    cur_score = num_no_ans\n    best_score = cur_score\n    best_thresh = 0.0\n    qid_list = sorted(na_probs, key=lambda k: na_probs[k])\n    for _, qid in enumerate(qid_list):\n        if qid not in scores:\n            continue\n        if qid_to_has_ans[qid]:\n            diff = scores[qid]\n        else:\n            if preds[qid]:\n                diff = -1\n            else:\n                diff = 0\n        cur_score += diff\n        if cur_score > best_score:\n            best_score = cur_score\n            best_thresh = na_probs[qid]\n    return 100.0 * best_score / len(scores), best_thresh\n\n\ndef find_all_best_thresh(main_eval, preds, exact_raw, f1_raw, na_probs, qid_to_has_ans):\n    best_exact, exact_thresh = find_best_thresh(preds, exact_raw, na_probs, qid_to_has_ans)\n    best_f1, f1_thresh = find_best_thresh(preds, f1_raw, na_probs, qid_to_has_ans)\n\n    main_eval[\"best_exact\"] = best_exact\n    main_eval[\"best_exact_thresh\"] = exact_thresh\n    main_eval[\"best_f1\"] = best_f1\n    main_eval[\"best_f1_thresh\"] = f1_thresh\n\n\ndef squad_evaluate(examples, preds, no_answer_probs=None, no_answer_probability_threshold=1.0):\n    qas_id_to_has_answer = {example.qas_id: bool(example.answers) for example in examples}\n    has_answer_qids = [qas_id for qas_id, has_answer in qas_id_to_has_answer.items() if has_answer]\n    no_answer_qids = [qas_id for qas_id, has_answer in qas_id_to_has_answer.items() if not has_answer]\n\n    if no_answer_probs is None:\n        no_answer_probs = {k: 0.0 for k in preds}\n\n    exact, f1 = get_raw_scores(examples, preds)\n\n    exact_threshold = apply_no_ans_threshold(\n        exact, no_answer_probs, qas_id_to_has_answer, no_answer_probability_threshold\n    )\n    f1_threshold = apply_no_ans_threshold(f1, no_answer_probs, qas_id_to_has_answer, no_answer_probability_threshold)\n\n    evaluation = make_eval_dict(exact_threshold, f1_threshold)\n\n    if has_answer_qids:\n        has_ans_eval = make_eval_dict(exact_threshold, f1_threshold, qid_list=has_answer_qids)\n        merge_eval(evaluation, has_ans_eval, \"HasAns\")\n\n    if no_answer_qids:\n        no_ans_eval = make_eval_dict(exact_threshold, f1_threshold, qid_list=no_answer_qids)\n        merge_eval(evaluation, no_ans_eval, \"NoAns\")\n\n    if no_answer_probs:\n        find_all_best_thresh(evaluation, preds, exact, f1, no_answer_probs, qas_id_to_has_answer)\n\n    return evaluation\n\n\ndef get_final_text(pred_text, orig_text, do_lower_case, verbose_logging=False):\n    \"\"\"Project the tokenized prediction back to the original text.\"\"\"\n\n    # When we created the data, we kept track of the alignment between original\n    # (whitespace tokenized) tokens and our WordPiece tokenized tokens. So\n    # now `orig_text` contains the span of our original text corresponding to the\n    # span that we predicted.\n    #\n    # However, `orig_text` may contain extra characters that we don't want in\n    # our prediction.\n    #\n    # For example, let's say:\n    #   pred_text = steve smith\n    #   orig_text = Steve Smith's\n    #\n    # We don't want to return `orig_text` because it contains the extra \"'s\".\n    #\n    # We don't want to return `pred_text` because it's already been normalized\n    # (the SQuAD eval script also does punctuation stripping/lower casing but\n    # our tokenizer does additional normalization like stripping accent\n    # characters).\n    #\n    # What we really want to return is \"Steve Smith\".\n    #\n    # Therefore, we have to apply a semi-complicated alignment heuristic between\n    # `pred_text` and `orig_text` to get a character-to-character alignment. This\n    # can fail in certain cases in which case we just return `orig_text`.\n\n    def _strip_spaces(text):\n        ns_chars = []\n        ns_to_s_map = collections.OrderedDict()\n        for (i, c) in enumerate(text):\n            if c == \" \":\n                continue\n            ns_to_s_map[len(ns_chars)] = i\n            ns_chars.append(c)\n        ns_text = \"\".join(ns_chars)\n        return (ns_text, ns_to_s_map)\n\n    # We first tokenize `orig_text`, strip whitespace from the result\n    # and `pred_text`, and check if they are the same length. If they are\n    # NOT the same length, the heuristic has failed. If they are the same\n    # length, we assume the characters are one-to-one aligned.\n    tokenizer = BasicTokenizer(do_lower_case=do_lower_case)\n\n    tok_text = \" \".join(tokenizer.tokenize(orig_text))\n\n    start_position = tok_text.find(pred_text)\n    if start_position == -1:\n        if verbose_logging:\n            logger.info(\"Unable to find text: '%s' in '%s'\" % (pred_text, orig_text))\n        return orig_text\n    end_position = start_position + len(pred_text) - 1\n\n    (orig_ns_text, orig_ns_to_s_map) = _strip_spaces(orig_text)\n    (tok_ns_text, tok_ns_to_s_map) = _strip_spaces(tok_text)\n\n    if len(orig_ns_text) != len(tok_ns_text):\n        if verbose_logging:\n            logger.info(\"Length not equal after stripping spaces: '%s' vs '%s'\", orig_ns_text, tok_ns_text)\n        return orig_text\n\n    # We then project the characters in `pred_text` back to `orig_text` using\n    # the character-to-character alignment.\n    tok_s_to_ns_map = {}\n    for (i, tok_index) in tok_ns_to_s_map.items():\n        tok_s_to_ns_map[tok_index] = i\n\n    orig_start_position = None\n    if start_position in tok_s_to_ns_map:\n        ns_start_position = tok_s_to_ns_map[start_position]\n        if ns_start_position in orig_ns_to_s_map:\n            orig_start_position = orig_ns_to_s_map[ns_start_position]\n\n    if orig_start_position is None:\n        if verbose_logging:\n            logger.info(\"Couldn't map start position\")\n        return orig_text\n\n    orig_end_position = None\n    if end_position in tok_s_to_ns_map:\n        ns_end_position = tok_s_to_ns_map[end_position]\n        if ns_end_position in orig_ns_to_s_map:\n            orig_end_position = orig_ns_to_s_map[ns_end_position]\n\n    if orig_end_position is None:\n        if verbose_logging:\n            logger.info(\"Couldn't map end position\")\n        return orig_text\n\n    output_text = orig_text[orig_start_position : (orig_end_position + 1)]\n    return output_text\n\n\ndef _get_best_indexes(logits, n_best_size):\n    \"\"\"Get the n-best logits from a list.\"\"\"\n    index_and_score = sorted(enumerate(logits), key=lambda x: x[1], reverse=True)\n\n    best_indexes = []\n    for i in range(len(index_and_score)):\n        if i >= n_best_size:\n            break\n        best_indexes.append(index_and_score[i][0])\n    return best_indexes\n\n\ndef _compute_softmax(scores):\n    \"\"\"Compute softmax probability over raw logits.\"\"\"\n    if not scores:\n        return []\n\n    max_score = None\n    for score in scores:\n        if max_score is None or score > max_score:\n            max_score = score\n\n    exp_scores = []\n    total_sum = 0.0\n    for score in scores:\n        x = math.exp(score - max_score)\n        exp_scores.append(x)\n        total_sum += x\n\n    probs = []\n    for score in exp_scores:\n        probs.append(score / total_sum)\n    return probs\n\n\ndef compute_predictions_logits(\n    all_examples,\n    all_features,\n    all_results,\n    n_best_size,\n    max_answer_length,\n    do_lower_case,\n    output_prediction_file,\n    output_nbest_file,\n    output_null_log_odds_file,\n    verbose_logging,\n    version_2_with_negative,\n    null_score_diff_threshold,\n    tokenizer,\n):\n    \"\"\"Write final predictions to the json file and log-odds of null if needed.\"\"\"\n    if output_prediction_file:\n        logger.info(f\"Writing predictions to: {output_prediction_file}\")\n    if output_nbest_file:\n        logger.info(f\"Writing nbest to: {output_nbest_file}\")\n    if output_null_log_odds_file and version_2_with_negative:\n        logger.info(f\"Writing null_log_odds to: {output_null_log_odds_file}\")\n\n    example_index_to_features = collections.defaultdict(list)\n    for feature in all_features:\n        example_index_to_features[feature.example_index].append(feature)\n\n    unique_id_to_result = {}\n    for result in all_results:\n        unique_id_to_result[result.unique_id] = result\n\n    _PrelimPrediction = collections.namedtuple(  # pylint: disable=invalid-name\n        \"PrelimPrediction\", [\"feature_index\", \"start_index\", \"end_index\", \"start_logit\", \"end_logit\"]\n    )\n\n    all_predictions = collections.OrderedDict()\n    all_nbest_json = collections.OrderedDict()\n    scores_diff_json = collections.OrderedDict()\n\n    for (example_index, example) in enumerate(all_examples):\n        features = example_index_to_features[example_index]\n\n        prelim_predictions = []\n        # keep track of the minimum score of null start+end of position 0\n        score_null = 1000000  # large and positive\n        min_null_feature_index = 0  # the paragraph slice with min null score\n        null_start_logit = 0  # the start logit at the slice with min null score\n        null_end_logit = 0  # the end logit at the slice with min null score\n        for (feature_index, feature) in enumerate(features):\n            result = unique_id_to_result[feature.unique_id]\n            start_indexes = _get_best_indexes(result.start_logits, n_best_size)\n            end_indexes = _get_best_indexes(result.end_logits, n_best_size)\n            # if we could have irrelevant answers, get the min score of irrelevant\n            if version_2_with_negative:\n                feature_null_score = result.start_logits[0] + result.end_logits[0]\n                if feature_null_score < score_null:\n                    score_null = feature_null_score\n                    min_null_feature_index = feature_index\n                    null_start_logit = result.start_logits[0]\n                    null_end_logit = result.end_logits[0]\n            for start_index in start_indexes:\n                for end_index in end_indexes:\n                    # We could hypothetically create invalid predictions, e.g., predict\n                    # that the start of the span is in the question. We throw out all\n                    # invalid predictions.\n                    if start_index >= len(feature.tokens):\n                        continue\n                    if end_index >= len(feature.tokens):\n                        continue\n                    if start_index not in feature.token_to_orig_map:\n                        continue\n                    if end_index not in feature.token_to_orig_map:\n                        continue\n                    if not feature.token_is_max_context.get(start_index, False):\n                        continue\n                    if end_index < start_index:\n                        continue\n                    length = end_index - start_index + 1\n                    if length > max_answer_length:\n                        continue\n                    prelim_predictions.append(\n                        _PrelimPrediction(\n                            feature_index=feature_index,\n                            start_index=start_index,\n                            end_index=end_index,\n                            start_logit=result.start_logits[start_index],\n                            end_logit=result.end_logits[end_index],\n                        )\n                    )\n        if version_2_with_negative:\n            prelim_predictions.append(\n                _PrelimPrediction(\n                    feature_index=min_null_feature_index,\n                    start_index=0,\n                    end_index=0,\n                    start_logit=null_start_logit,\n                    end_logit=null_end_logit,\n                )\n            )\n        prelim_predictions = sorted(prelim_predictions, key=lambda x: (x.start_logit + x.end_logit), reverse=True)\n\n        _NbestPrediction = collections.namedtuple(  # pylint: disable=invalid-name\n            \"NbestPrediction\", [\"text\", \"start_logit\", \"end_logit\"]\n        )\n\n        seen_predictions = {}\n        nbest = []\n        for pred in prelim_predictions:\n            if len(nbest) >= n_best_size:\n                break\n            feature = features[pred.feature_index]\n            if pred.start_index > 0:  # this is a non-null prediction\n                tok_tokens = feature.tokens[pred.start_index : (pred.end_index + 1)]\n                orig_doc_start = feature.token_to_orig_map[pred.start_index]\n                orig_doc_end = feature.token_to_orig_map[pred.end_index]\n                orig_tokens = example.doc_tokens[orig_doc_start : (orig_doc_end + 1)]\n\n                tok_text = tokenizer.convert_tokens_to_string(tok_tokens)\n\n                # tok_text = \" \".join(tok_tokens)\n                #\n                # # De-tokenize WordPieces that have been split off.\n                # tok_text = tok_text.replace(\" ##\", \"\")\n                # tok_text = tok_text.replace(\"##\", \"\")\n\n                # Clean whitespace\n                tok_text = tok_text.strip()\n                tok_text = \" \".join(tok_text.split())\n                orig_text = \" \".join(orig_tokens)\n\n                final_text = get_final_text(tok_text, orig_text, do_lower_case, verbose_logging)\n                if final_text in seen_predictions:\n                    continue\n\n                seen_predictions[final_text] = True\n            else:\n                final_text = \"\"\n                seen_predictions[final_text] = True\n\n            nbest.append(_NbestPrediction(text=final_text, start_logit=pred.start_logit, end_logit=pred.end_logit))\n        # if we didn't include the empty option in the n-best, include it\n        if version_2_with_negative:\n            if \"\" not in seen_predictions:\n                nbest.append(_NbestPrediction(text=\"\", start_logit=null_start_logit, end_logit=null_end_logit))\n\n            # In very rare edge cases we could only have single null prediction.\n            # So we just create a nonce prediction in this case to avoid failure.\n            if len(nbest) == 1:\n                nbest.insert(0, _NbestPrediction(text=\"empty\", start_logit=0.0, end_logit=0.0))\n\n        # In very rare edge cases we could have no valid predictions. So we\n        # just create a nonce prediction in this case to avoid failure.\n        if not nbest:\n            nbest.append(_NbestPrediction(text=\"empty\", start_logit=0.0, end_logit=0.0))\n\n        assert len(nbest) >= 1\n\n        total_scores = []\n        best_non_null_entry = None\n        for entry in nbest:\n            total_scores.append(entry.start_logit + entry.end_logit)\n            if not best_non_null_entry:\n                if entry.text:\n                    best_non_null_entry = entry\n\n        probs = _compute_softmax(total_scores)\n\n        nbest_json = []\n        for (i, entry) in enumerate(nbest):\n            output = collections.OrderedDict()\n            output[\"text\"] = entry.text\n            output[\"probability\"] = probs[i]\n            output[\"start_logit\"] = entry.start_logit\n            output[\"end_logit\"] = entry.end_logit\n            nbest_json.append(output)\n\n        assert len(nbest_json) >= 1\n\n        if not version_2_with_negative:\n            all_predictions[example.qas_id] = nbest_json[0][\"text\"]\n        else:\n            # predict \"\" iff the null score - the score of best non-null > threshold\n            score_diff = score_null - best_non_null_entry.start_logit - (best_non_null_entry.end_logit)\n            scores_diff_json[example.qas_id] = score_diff\n            if score_diff > null_score_diff_threshold:\n                all_predictions[example.qas_id] = \"\"\n            else:\n                all_predictions[example.qas_id] = best_non_null_entry.text\n        all_nbest_json[example.qas_id] = nbest_json\n\n    if output_prediction_file:\n        with open(output_prediction_file, \"w\") as writer:\n            writer.write(json.dumps(all_predictions, indent=4) + \"\\n\")\n\n    if output_nbest_file:\n        with open(output_nbest_file, \"w\") as writer:\n            writer.write(json.dumps(all_nbest_json, indent=4) + \"\\n\")\n\n    if output_null_log_odds_file and version_2_with_negative:\n        with open(output_null_log_odds_file, \"w\") as writer:\n            writer.write(json.dumps(scores_diff_json, indent=4) + \"\\n\")\n\n    return all_predictions\n\n\ndef compute_predictions_log_probs(\n    all_examples,\n    all_features,\n    all_results,\n    n_best_size,\n    max_answer_length,\n    output_prediction_file,\n    output_nbest_file,\n    output_null_log_odds_file,\n    start_n_top,\n    end_n_top,\n    version_2_with_negative,\n    tokenizer,\n    verbose_logging,\n):\n    \"\"\" XLNet write prediction logic (more complex than Bert's).\n        Write final predictions to the json file and log-odds of null if needed.\n\n        Requires utils_squad_evaluate.py\n    \"\"\"\n    _PrelimPrediction = collections.namedtuple(  # pylint: disable=invalid-name\n        \"PrelimPrediction\", [\"feature_index\", \"start_index\", \"end_index\", \"start_log_prob\", \"end_log_prob\"]\n    )\n\n    _NbestPrediction = collections.namedtuple(  # pylint: disable=invalid-name\n        \"NbestPrediction\", [\"text\", \"start_log_prob\", \"end_log_prob\"]\n    )\n\n    logger.info(\"Writing predictions to: %s\", output_prediction_file)\n    # logger.info(\"Writing nbest to: %s\" % (output_nbest_file))\n\n    example_index_to_features = collections.defaultdict(list)\n    for feature in all_features:\n        example_index_to_features[feature.example_index].append(feature)\n\n    unique_id_to_result = {}\n    for result in all_results:\n        unique_id_to_result[result.unique_id] = result\n\n    all_predictions = collections.OrderedDict()\n    all_nbest_json = collections.OrderedDict()\n    scores_diff_json = collections.OrderedDict()\n\n    for (example_index, example) in enumerate(all_examples):\n        features = example_index_to_features[example_index]\n\n        prelim_predictions = []\n        # keep track of the minimum score of null start+end of position 0\n        score_null = 1000000  # large and positive\n\n        for (feature_index, feature) in enumerate(features):\n            result = unique_id_to_result[feature.unique_id]\n\n            cur_null_score = result.cls_logits\n\n            # if we could have irrelevant answers, get the min score of irrelevant\n            score_null = min(score_null, cur_null_score)\n\n            for i in range(start_n_top):\n                for j in range(end_n_top):\n                    start_log_prob = result.start_logits[i]\n                    start_index = result.start_top_index[i]\n\n                    j_index = i * end_n_top + j\n\n                    end_log_prob = result.end_logits[j_index]\n                    end_index = result.end_top_index[j_index]\n\n                    # We could hypothetically create invalid predictions, e.g., predict\n                    # that the start of the span is in the question. We throw out all\n                    # invalid predictions.\n                    if start_index >= feature.paragraph_len - 1:\n                        continue\n                    if end_index >= feature.paragraph_len - 1:\n                        continue\n\n                    if not feature.token_is_max_context.get(start_index, False):\n                        continue\n                    if end_index < start_index:\n                        continue\n                    length = end_index - start_index + 1\n                    if length > max_answer_length:\n                        continue\n\n                    prelim_predictions.append(\n                        _PrelimPrediction(\n                            feature_index=feature_index,\n                            start_index=start_index,\n                            end_index=end_index,\n                            start_log_prob=start_log_prob,\n                            end_log_prob=end_log_prob,\n                        )\n                    )\n\n        prelim_predictions = sorted(\n            prelim_predictions, key=lambda x: (x.start_log_prob + x.end_log_prob), reverse=True\n        )\n\n        seen_predictions = {}\n        nbest = []\n        for pred in prelim_predictions:\n            if len(nbest) >= n_best_size:\n                break\n            feature = features[pred.feature_index]\n\n            # XLNet un-tokenizer\n            # Let's keep it simple for now and see if we need all this later.\n            #\n            # tok_start_to_orig_index = feature.tok_start_to_orig_index\n            # tok_end_to_orig_index = feature.tok_end_to_orig_index\n            # start_orig_pos = tok_start_to_orig_index[pred.start_index]\n            # end_orig_pos = tok_end_to_orig_index[pred.end_index]\n            # paragraph_text = example.paragraph_text\n            # final_text = paragraph_text[start_orig_pos: end_orig_pos + 1].strip()\n\n            # Previously used Bert untokenizer\n            tok_tokens = feature.tokens[pred.start_index : (pred.end_index + 1)]\n            orig_doc_start = feature.token_to_orig_map[pred.start_index]\n            orig_doc_end = feature.token_to_orig_map[pred.end_index]\n            orig_tokens = example.doc_tokens[orig_doc_start : (orig_doc_end + 1)]\n            tok_text = tokenizer.convert_tokens_to_string(tok_tokens)\n\n            # Clean whitespace\n            tok_text = tok_text.strip()\n            tok_text = \" \".join(tok_text.split())\n            orig_text = \" \".join(orig_tokens)\n\n            if hasattr(tokenizer, \"do_lower_case\"):\n                do_lower_case = tokenizer.do_lower_case\n            else:\n                do_lower_case = tokenizer.do_lowercase_and_remove_accent\n\n            final_text = get_final_text(tok_text, orig_text, do_lower_case, verbose_logging)\n\n            if final_text in seen_predictions:\n                continue\n\n            seen_predictions[final_text] = True\n\n            nbest.append(\n                _NbestPrediction(text=final_text, start_log_prob=pred.start_log_prob, end_log_prob=pred.end_log_prob)\n            )\n\n        # In very rare edge cases we could have no valid predictions. So we\n        # just create a nonce prediction in this case to avoid failure.\n        if not nbest:\n            nbest.append(_NbestPrediction(text=\"\", start_log_prob=-1e6, end_log_prob=-1e6))\n\n        total_scores = []\n        best_non_null_entry = None\n        for entry in nbest:\n            total_scores.append(entry.start_log_prob + entry.end_log_prob)\n            if not best_non_null_entry:\n                best_non_null_entry = entry\n\n        probs = _compute_softmax(total_scores)\n\n        nbest_json = []\n        for (i, entry) in enumerate(nbest):\n            output = collections.OrderedDict()\n            output[\"text\"] = entry.text\n            output[\"probability\"] = probs[i]\n            output[\"start_log_prob\"] = entry.start_log_prob\n            output[\"end_log_prob\"] = entry.end_log_prob\n            nbest_json.append(output)\n\n        assert len(nbest_json) >= 1\n        assert best_non_null_entry is not None\n\n        score_diff = score_null\n        scores_diff_json[example.qas_id] = score_diff\n        # note(zhiliny): always predict best_non_null_entry\n        # and the evaluation script will search for the best threshold\n        all_predictions[example.qas_id] = best_non_null_entry.text\n\n        all_nbest_json[example.qas_id] = nbest_json\n\n    with open(output_prediction_file, \"w\") as writer:\n        writer.write(json.dumps(all_predictions, indent=4) + \"\\n\")\n\n    with open(output_nbest_file, \"w\") as writer:\n        writer.write(json.dumps(all_nbest_json, indent=4) + \"\\n\")\n\n    if version_2_with_negative:\n        with open(output_null_log_odds_file, \"w\") as writer:\n            writer.write(json.dumps(scores_diff_json, indent=4) + \"\\n\")\n\n    return all_predictions\n"
  },
  {
    "path": "code/nezha-base-count3/pretrain/transformers1/data/processors/__init__.py",
    "content": "# flake8: noqa\n# There's no way to ignore \"F401 '...' imported but unused\" warnings in this\n# module, but to preserve other warnings. So, don't check this module at all.\n\nfrom .glue import glue_convert_examples_to_features, glue_output_modes, glue_processors, glue_tasks_num_labels\nfrom .squad import SquadExample, SquadFeatures, SquadV1Processor, SquadV2Processor, squad_convert_examples_to_features\nfrom .utils import DataProcessor, InputExample, InputFeatures, SingleSentenceClassificationProcessor\nfrom .xnli import xnli_output_modes, xnli_processors, xnli_tasks_num_labels\n"
  },
  {
    "path": "code/nezha-base-count3/pretrain/transformers1/data/processors/glue.py",
    "content": "# coding=utf-8\n# Copyright 2018 The Google AI Language Team Authors and The HuggingFace Inc. team.\n# Copyright (c) 2018, NVIDIA CORPORATION.  All rights reserved.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n\"\"\" GLUE processors and helpers \"\"\"\n\nimport logging\nimport os\nfrom enum import Enum\nfrom typing import List, Optional, Union\n\nfrom ...file_utils import is_tf_available\nfrom ...tokenization_utils import PreTrainedTokenizer\nfrom .utils import DataProcessor, InputExample, InputFeatures\n\n\nif is_tf_available():\n    import tensorflow as tf\n\nlogger = logging.getLogger(__name__)\n\n\ndef glue_convert_examples_to_features(\n    examples: Union[List[InputExample], \"tf.data.Dataset\"],\n    tokenizer: PreTrainedTokenizer,\n    max_length: Optional[int] = None,\n    task=None,\n    label_list=None,\n    output_mode=None,\n):\n    \"\"\"\n    Loads a data file into a list of ``InputFeatures``\n\n    Args:\n        examples: List of ``InputExamples`` or ``tf.data.Dataset`` containing the examples.\n        tokenizer: Instance of a tokenizer that will tokenize the examples\n        max_length: Maximum example length. Defaults to the tokenizer's max_len\n        task: GLUE task\n        label_list: List of labels. Can be obtained from the processor using the ``processor.get_labels()`` method\n        output_mode: String indicating the output mode. Either ``regression`` or ``classification``\n\n    Returns:\n        If the ``examples`` input is a ``tf.data.Dataset``, will return a ``tf.data.Dataset``\n        containing the task-specific features. If the input is a list of ``InputExamples``, will return\n        a list of task-specific ``InputFeatures`` which can be fed to the model.\n\n    \"\"\"\n    if is_tf_available() and isinstance(examples, tf.data.Dataset):\n        if task is None:\n            raise ValueError(\"When calling glue_convert_examples_to_features from TF, the task parameter is required.\")\n        return _tf_glue_convert_examples_to_features(examples, tokenizer, max_length=max_length, task=task)\n    return _glue_convert_examples_to_features(\n        examples, tokenizer, max_length=max_length, task=task, label_list=label_list, output_mode=output_mode\n    )\n\n\nif is_tf_available():\n\n    def _tf_glue_convert_examples_to_features(\n        examples: tf.data.Dataset, tokenizer: PreTrainedTokenizer, task=str, max_length: Optional[int] = None,\n    ) -> tf.data.Dataset:\n        \"\"\"\n        Returns:\n            A ``tf.data.Dataset`` containing the task-specific features.\n\n        \"\"\"\n        processor = glue_processors[task]()\n        examples = [processor.tfds_map(processor.get_example_from_tensor_dict(example)) for example in examples]\n        features = glue_convert_examples_to_features(examples, tokenizer, max_length=max_length, task=task)\n\n        def gen():\n            for ex in features:\n                yield (\n                    {\n                        \"input_ids\": ex.input_ids,\n                        \"attention_mask\": ex.attention_mask,\n                        \"token_type_ids\": ex.token_type_ids,\n                    },\n                    ex.label,\n                )\n\n        return tf.data.Dataset.from_generator(\n            gen,\n            ({\"input_ids\": tf.int32, \"attention_mask\": tf.int32, \"token_type_ids\": tf.int32}, tf.int64),\n            (\n                {\n                    \"input_ids\": tf.TensorShape([None]),\n                    \"attention_mask\": tf.TensorShape([None]),\n                    \"token_type_ids\": tf.TensorShape([None]),\n                },\n                tf.TensorShape([]),\n            ),\n        )\n\n\ndef _glue_convert_examples_to_features(\n    examples: List[InputExample],\n    tokenizer: PreTrainedTokenizer,\n    max_length: Optional[int] = None,\n    task=None,\n    label_list=None,\n    output_mode=None,\n):\n    if max_length is None:\n        max_length = tokenizer.max_len\n\n    if task is not None:\n        processor = glue_processors[task]()\n        if label_list is None:\n            label_list = processor.get_labels()\n            logger.info(\"Using label list %s for task %s\" % (label_list, task))\n        if output_mode is None:\n            output_mode = glue_output_modes[task]\n            logger.info(\"Using output mode %s for task %s\" % (output_mode, task))\n\n    label_map = {label: i for i, label in enumerate(label_list)}\n\n    def label_from_example(example: InputExample) -> Union[int, float, None]:\n        if example.label is None:\n            return None\n        if output_mode == \"classification\":\n            return label_map[example.label]\n        elif output_mode == \"regression\":\n            return float(example.label)\n        raise KeyError(output_mode)\n\n    labels = [label_from_example(example) for example in examples]\n\n    batch_encoding = tokenizer.batch_encode_plus(\n        [(example.text_a, example.text_b) for example in examples], max_length=max_length, pad_to_max_length=True,\n    )\n\n    features = []\n    for i in range(len(examples)):\n        inputs = {k: batch_encoding[k][i] for k in batch_encoding}\n\n        feature = InputFeatures(**inputs, label=labels[i])\n        features.append(feature)\n\n    for i, example in enumerate(examples[:5]):\n        logger.info(\"*** Example ***\")\n        logger.info(\"guid: %s\" % (example.guid))\n        logger.info(\"features: %s\" % features[i])\n\n    return features\n\n\nclass OutputMode(Enum):\n    classification = \"classification\"\n    regression = \"regression\"\n\n\nclass MrpcProcessor(DataProcessor):\n    \"\"\"Processor for the MRPC data set (GLUE version).\"\"\"\n\n    def get_example_from_tensor_dict(self, tensor_dict):\n        \"\"\"See base class.\"\"\"\n        return InputExample(\n            tensor_dict[\"idx\"].numpy(),\n            tensor_dict[\"sentence1\"].numpy().decode(\"utf-8\"),\n            tensor_dict[\"sentence2\"].numpy().decode(\"utf-8\"),\n            str(tensor_dict[\"label\"].numpy()),\n        )\n\n    def get_train_examples(self, data_dir):\n        \"\"\"See base class.\"\"\"\n        logger.info(\"LOOKING AT {}\".format(os.path.join(data_dir, \"train.tsv\")))\n        return self._create_examples(self._read_tsv(os.path.join(data_dir, \"train.tsv\")), \"train\")\n\n    def get_dev_examples(self, data_dir):\n        \"\"\"See base class.\"\"\"\n        return self._create_examples(self._read_tsv(os.path.join(data_dir, \"dev.tsv\")), \"dev\")\n\n    def get_test_examples(self, data_dir):\n        \"\"\"See base class.\"\"\"\n        return self._create_examples(self._read_tsv(os.path.join(data_dir, \"test.tsv\")), \"test\")\n\n    def get_labels(self):\n        \"\"\"See base class.\"\"\"\n        return [\"0\", \"1\"]\n\n    def _create_examples(self, lines, set_type):\n        \"\"\"Creates examples for the training, dev and test sets.\"\"\"\n        examples = []\n        for (i, line) in enumerate(lines):\n            if i == 0:\n                continue\n            guid = \"%s-%s\" % (set_type, i)\n            text_a = line[3]\n            text_b = line[4]\n            label = None if set_type == \"test\" else line[0]\n            examples.append(InputExample(guid=guid, text_a=text_a, text_b=text_b, label=label))\n        return examples\n\n\nclass MnliProcessor(DataProcessor):\n    \"\"\"Processor for the MultiNLI data set (GLUE version).\"\"\"\n\n    def get_example_from_tensor_dict(self, tensor_dict):\n        \"\"\"See base class.\"\"\"\n        return InputExample(\n            tensor_dict[\"idx\"].numpy(),\n            tensor_dict[\"premise\"].numpy().decode(\"utf-8\"),\n            tensor_dict[\"hypothesis\"].numpy().decode(\"utf-8\"),\n            str(tensor_dict[\"label\"].numpy()),\n        )\n\n    def get_train_examples(self, data_dir):\n        \"\"\"See base class.\"\"\"\n        return self._create_examples(self._read_tsv(os.path.join(data_dir, \"train.tsv\")), \"train\")\n\n    def get_dev_examples(self, data_dir):\n        \"\"\"See base class.\"\"\"\n        return self._create_examples(self._read_tsv(os.path.join(data_dir, \"dev_matched.tsv\")), \"dev_matched\")\n\n    def get_test_examples(self, data_dir):\n        \"\"\"See base class.\"\"\"\n        return self._create_examples(self._read_tsv(os.path.join(data_dir, \"test_matched.tsv\")), \"test_matched\")\n\n    def get_labels(self):\n        \"\"\"See base class.\"\"\"\n        return [\"contradiction\", \"entailment\", \"neutral\"]\n\n    def _create_examples(self, lines, set_type):\n        \"\"\"Creates examples for the training, dev and test sets.\"\"\"\n        examples = []\n        for (i, line) in enumerate(lines):\n            if i == 0:\n                continue\n            guid = \"%s-%s\" % (set_type, line[0])\n            text_a = line[8]\n            text_b = line[9]\n            label = None if set_type.startswith(\"test\") else line[-1]\n            examples.append(InputExample(guid=guid, text_a=text_a, text_b=text_b, label=label))\n        return examples\n\n\nclass MnliMismatchedProcessor(MnliProcessor):\n    \"\"\"Processor for the MultiNLI Mismatched data set (GLUE version).\"\"\"\n\n    def get_dev_examples(self, data_dir):\n        \"\"\"See base class.\"\"\"\n        return self._create_examples(self._read_tsv(os.path.join(data_dir, \"dev_mismatched.tsv\")), \"dev_mismatched\")\n\n    def get_test_examples(self, data_dir):\n        \"\"\"See base class.\"\"\"\n        return self._create_examples(self._read_tsv(os.path.join(data_dir, \"test_mismatched.tsv\")), \"test_mismatched\")\n\n\nclass ColaProcessor(DataProcessor):\n    \"\"\"Processor for the CoLA data set (GLUE version).\"\"\"\n\n    def get_example_from_tensor_dict(self, tensor_dict):\n        \"\"\"See base class.\"\"\"\n        return InputExample(\n            tensor_dict[\"idx\"].numpy(),\n            tensor_dict[\"sentence\"].numpy().decode(\"utf-8\"),\n            None,\n            str(tensor_dict[\"label\"].numpy()),\n        )\n\n    def get_train_examples(self, data_dir):\n        \"\"\"See base class.\"\"\"\n        return self._create_examples(self._read_tsv(os.path.join(data_dir, \"train.tsv\")), \"train\")\n\n    def get_dev_examples(self, data_dir):\n        \"\"\"See base class.\"\"\"\n        return self._create_examples(self._read_tsv(os.path.join(data_dir, \"dev.tsv\")), \"dev\")\n\n    def get_test_examples(self, data_dir):\n        \"\"\"See base class.\"\"\"\n        return self._create_examples(self._read_tsv(os.path.join(data_dir, \"test.tsv\")), \"test\")\n\n    def get_labels(self):\n        \"\"\"See base class.\"\"\"\n        return [\"0\", \"1\"]\n\n    def _create_examples(self, lines, set_type):\n        \"\"\"Creates examples for the training, dev and test sets.\"\"\"\n        test_mode = set_type == \"test\"\n        if test_mode:\n            lines = lines[1:]\n        text_index = 1 if test_mode else 3\n        examples = []\n        for (i, line) in enumerate(lines):\n            guid = \"%s-%s\" % (set_type, i)\n            text_a = line[text_index]\n            label = None if test_mode else line[1]\n            examples.append(InputExample(guid=guid, text_a=text_a, text_b=None, label=label))\n        return examples\n\n\nclass Sst2Processor(DataProcessor):\n    \"\"\"Processor for the SST-2 data set (GLUE version).\"\"\"\n\n    def get_example_from_tensor_dict(self, tensor_dict):\n        \"\"\"See base class.\"\"\"\n        return InputExample(\n            tensor_dict[\"idx\"].numpy(),\n            tensor_dict[\"sentence\"].numpy().decode(\"utf-8\"),\n            None,\n            str(tensor_dict[\"label\"].numpy()),\n        )\n\n    def get_train_examples(self, data_dir):\n        \"\"\"See base class.\"\"\"\n        return self._create_examples(self._read_tsv(os.path.join(data_dir, \"train.tsv\")), \"train\")\n\n    def get_dev_examples(self, data_dir):\n        \"\"\"See base class.\"\"\"\n        return self._create_examples(self._read_tsv(os.path.join(data_dir, \"dev.tsv\")), \"dev\")\n\n    def get_test_examples(self, data_dir):\n        \"\"\"See base class.\"\"\"\n        return self._create_examples(self._read_tsv(os.path.join(data_dir, \"test.tsv\")), \"test\")\n\n    def get_labels(self):\n        \"\"\"See base class.\"\"\"\n        return [\"0\", \"1\"]\n\n    def _create_examples(self, lines, set_type):\n        \"\"\"Creates examples for the training, dev and test sets.\"\"\"\n        examples = []\n        text_index = 1 if set_type == \"test\" else 0\n        for (i, line) in enumerate(lines):\n            if i == 0:\n                continue\n            guid = \"%s-%s\" % (set_type, i)\n            text_a = line[text_index]\n            label = None if set_type == \"test\" else line[1]\n            examples.append(InputExample(guid=guid, text_a=text_a, text_b=None, label=label))\n        return examples\n\n\nclass StsbProcessor(DataProcessor):\n    \"\"\"Processor for the STS-B data set (GLUE version).\"\"\"\n\n    def get_example_from_tensor_dict(self, tensor_dict):\n        \"\"\"See base class.\"\"\"\n        return InputExample(\n            tensor_dict[\"idx\"].numpy(),\n            tensor_dict[\"sentence1\"].numpy().decode(\"utf-8\"),\n            tensor_dict[\"sentence2\"].numpy().decode(\"utf-8\"),\n            str(tensor_dict[\"label\"].numpy()),\n        )\n\n    def get_train_examples(self, data_dir):\n        \"\"\"See base class.\"\"\"\n        return self._create_examples(self._read_tsv(os.path.join(data_dir, \"train.tsv\")), \"train\")\n\n    def get_dev_examples(self, data_dir):\n        \"\"\"See base class.\"\"\"\n        return self._create_examples(self._read_tsv(os.path.join(data_dir, \"dev.tsv\")), \"dev\")\n\n    def get_test_examples(self, data_dir):\n        \"\"\"See base class.\"\"\"\n        return self._create_examples(self._read_tsv(os.path.join(data_dir, \"test.tsv\")), \"test\")\n\n    def get_labels(self):\n        \"\"\"See base class.\"\"\"\n        return [None]\n\n    def _create_examples(self, lines, set_type):\n        \"\"\"Creates examples for the training, dev and test sets.\"\"\"\n        examples = []\n        for (i, line) in enumerate(lines):\n            if i == 0:\n                continue\n            guid = \"%s-%s\" % (set_type, line[0])\n            text_a = line[7]\n            text_b = line[8]\n            label = None if set_type == \"test\" else line[-1]\n            examples.append(InputExample(guid=guid, text_a=text_a, text_b=text_b, label=label))\n        return examples\n\n\nclass QqpProcessor(DataProcessor):\n    \"\"\"Processor for the QQP data set (GLUE version).\"\"\"\n\n    def get_example_from_tensor_dict(self, tensor_dict):\n        \"\"\"See base class.\"\"\"\n        return InputExample(\n            tensor_dict[\"idx\"].numpy(),\n            tensor_dict[\"question1\"].numpy().decode(\"utf-8\"),\n            tensor_dict[\"question2\"].numpy().decode(\"utf-8\"),\n            str(tensor_dict[\"label\"].numpy()),\n        )\n\n    def get_train_examples(self, data_dir):\n        \"\"\"See base class.\"\"\"\n        return self._create_examples(self._read_tsv(os.path.join(data_dir, \"train.tsv\")), \"train\")\n\n    def get_dev_examples(self, data_dir):\n        \"\"\"See base class.\"\"\"\n        return self._create_examples(self._read_tsv(os.path.join(data_dir, \"dev.tsv\")), \"dev\")\n\n    def get_test_examples(self, data_dir):\n        \"\"\"See base class.\"\"\"\n        return self._create_examples(self._read_tsv(os.path.join(data_dir, \"test.tsv\")), \"test\")\n\n    def get_labels(self):\n        \"\"\"See base class.\"\"\"\n        return [\"0\", \"1\"]\n\n    def _create_examples(self, lines, set_type):\n        \"\"\"Creates examples for the training, dev and test sets.\"\"\"\n        test_mode = set_type == \"test\"\n        q1_index = 1 if test_mode else 3\n        q2_index = 2 if test_mode else 4\n        examples = []\n        for (i, line) in enumerate(lines):\n            if i == 0:\n                continue\n            guid = \"%s-%s\" % (set_type, line[0])\n            try:\n                text_a = line[q1_index]\n                text_b = line[q2_index]\n                label = None if test_mode else line[5]\n            except IndexError:\n                continue\n            examples.append(InputExample(guid=guid, text_a=text_a, text_b=text_b, label=label))\n        return examples\n\n\nclass QnliProcessor(DataProcessor):\n    \"\"\"Processor for the QNLI data set (GLUE version).\"\"\"\n\n    def get_example_from_tensor_dict(self, tensor_dict):\n        \"\"\"See base class.\"\"\"\n        return InputExample(\n            tensor_dict[\"idx\"].numpy(),\n            tensor_dict[\"question\"].numpy().decode(\"utf-8\"),\n            tensor_dict[\"sentence\"].numpy().decode(\"utf-8\"),\n            str(tensor_dict[\"label\"].numpy()),\n        )\n\n    def get_train_examples(self, data_dir):\n        \"\"\"See base class.\"\"\"\n        return self._create_examples(self._read_tsv(os.path.join(data_dir, \"train.tsv\")), \"train\")\n\n    def get_dev_examples(self, data_dir):\n        \"\"\"See base class.\"\"\"\n        return self._create_examples(self._read_tsv(os.path.join(data_dir, \"dev.tsv\")), \"dev\")\n\n    def get_test_examples(self, data_dir):\n        \"\"\"See base class.\"\"\"\n        return self._create_examples(self._read_tsv(os.path.join(data_dir, \"test.tsv\")), \"test\")\n\n    def get_labels(self):\n        \"\"\"See base class.\"\"\"\n        return [\"entailment\", \"not_entailment\"]\n\n    def _create_examples(self, lines, set_type):\n        \"\"\"Creates examples for the training, dev and test sets.\"\"\"\n        examples = []\n        for (i, line) in enumerate(lines):\n            if i == 0:\n                continue\n            guid = \"%s-%s\" % (set_type, line[0])\n            text_a = line[1]\n            text_b = line[2]\n            label = None if set_type == \"test\" else line[-1]\n            examples.append(InputExample(guid=guid, text_a=text_a, text_b=text_b, label=label))\n        return examples\n\n\nclass RteProcessor(DataProcessor):\n    \"\"\"Processor for the RTE data set (GLUE version).\"\"\"\n\n    def get_example_from_tensor_dict(self, tensor_dict):\n        \"\"\"See base class.\"\"\"\n        return InputExample(\n            tensor_dict[\"idx\"].numpy(),\n            tensor_dict[\"sentence1\"].numpy().decode(\"utf-8\"),\n            tensor_dict[\"sentence2\"].numpy().decode(\"utf-8\"),\n            str(tensor_dict[\"label\"].numpy()),\n        )\n\n    def get_train_examples(self, data_dir):\n        \"\"\"See base class.\"\"\"\n        return self._create_examples(self._read_tsv(os.path.join(data_dir, \"train.tsv\")), \"train\")\n\n    def get_dev_examples(self, data_dir):\n        \"\"\"See base class.\"\"\"\n        return self._create_examples(self._read_tsv(os.path.join(data_dir, \"dev.tsv\")), \"dev\")\n\n    def get_test_examples(self, data_dir):\n        \"\"\"See base class.\"\"\"\n        return self._create_examples(self._read_tsv(os.path.join(data_dir, \"test.tsv\")), \"test\")\n\n    def get_labels(self):\n        \"\"\"See base class.\"\"\"\n        return [\"entailment\", \"not_entailment\"]\n\n    def _create_examples(self, lines, set_type):\n        \"\"\"Creates examples for the training, dev and test sets.\"\"\"\n        examples = []\n        for (i, line) in enumerate(lines):\n            if i == 0:\n                continue\n            guid = \"%s-%s\" % (set_type, line[0])\n            text_a = line[1]\n            text_b = line[2]\n            label = None if set_type == \"test\" else line[-1]\n            examples.append(InputExample(guid=guid, text_a=text_a, text_b=text_b, label=label))\n        return examples\n\n\nclass WnliProcessor(DataProcessor):\n    \"\"\"Processor for the WNLI data set (GLUE version).\"\"\"\n\n    def get_example_from_tensor_dict(self, tensor_dict):\n        \"\"\"See base class.\"\"\"\n        return InputExample(\n            tensor_dict[\"idx\"].numpy(),\n            tensor_dict[\"sentence1\"].numpy().decode(\"utf-8\"),\n            tensor_dict[\"sentence2\"].numpy().decode(\"utf-8\"),\n            str(tensor_dict[\"label\"].numpy()),\n        )\n\n    def get_train_examples(self, data_dir):\n        \"\"\"See base class.\"\"\"\n        return self._create_examples(self._read_tsv(os.path.join(data_dir, \"train.tsv\")), \"train\")\n\n    def get_dev_examples(self, data_dir):\n        \"\"\"See base class.\"\"\"\n        return self._create_examples(self._read_tsv(os.path.join(data_dir, \"dev.tsv\")), \"dev\")\n\n    def get_test_examples(self, data_dir):\n        \"\"\"See base class.\"\"\"\n        return self._create_examples(self._read_tsv(os.path.join(data_dir, \"test.tsv\")), \"test\")\n\n    def get_labels(self):\n        \"\"\"See base class.\"\"\"\n        return [\"0\", \"1\"]\n\n    def _create_examples(self, lines, set_type):\n        \"\"\"Creates examples for the training, dev and test sets.\"\"\"\n        examples = []\n        for (i, line) in enumerate(lines):\n            if i == 0:\n                continue\n            guid = \"%s-%s\" % (set_type, line[0])\n            text_a = line[1]\n            text_b = line[2]\n            label = None if set_type == \"test\" else line[-1]\n            examples.append(InputExample(guid=guid, text_a=text_a, text_b=text_b, label=label))\n        return examples\n\n\nglue_tasks_num_labels = {\n    \"cola\": 2,\n    \"mnli\": 3,\n    \"mrpc\": 2,\n    \"sst-2\": 2,\n    \"sts-b\": 1,\n    \"qqp\": 2,\n    \"qnli\": 2,\n    \"rte\": 2,\n    \"wnli\": 2,\n}\n\nglue_processors = {\n    \"cola\": ColaProcessor,\n    \"mnli\": MnliProcessor,\n    \"mnli-mm\": MnliMismatchedProcessor,\n    \"mrpc\": MrpcProcessor,\n    \"sst-2\": Sst2Processor,\n    \"sts-b\": StsbProcessor,\n    \"qqp\": QqpProcessor,\n    \"qnli\": QnliProcessor,\n    \"rte\": RteProcessor,\n    \"wnli\": WnliProcessor,\n}\n\nglue_output_modes = {\n    \"cola\": \"classification\",\n    \"mnli\": \"classification\",\n    \"mnli-mm\": \"classification\",\n    \"mrpc\": \"classification\",\n    \"sst-2\": \"classification\",\n    \"sts-b\": \"regression\",\n    \"qqp\": \"classification\",\n    \"qnli\": \"classification\",\n    \"rte\": \"classification\",\n    \"wnli\": \"classification\",\n}\n"
  },
  {
    "path": "code/nezha-base-count3/pretrain/transformers1/data/processors/squad.py",
    "content": "import json\nimport logging\nimport os\nfrom functools import partial\nfrom multiprocessing import Pool, cpu_count\n\nimport numpy as np\nfrom tqdm import tqdm\n\nfrom ...file_utils import is_tf_available, is_torch_available\nfrom ...tokenization_bert import whitespace_tokenize\nfrom .utils import DataProcessor\n\n\nif is_torch_available():\n    import torch\n    from torch.utils.data import TensorDataset\n\nif is_tf_available():\n    import tensorflow as tf\n\nlogger = logging.getLogger(__name__)\n\n\ndef _improve_answer_span(doc_tokens, input_start, input_end, tokenizer, orig_answer_text):\n    \"\"\"Returns tokenized answer spans that better match the annotated answer.\"\"\"\n    tok_answer_text = \" \".join(tokenizer.tokenize(orig_answer_text))\n\n    for new_start in range(input_start, input_end + 1):\n        for new_end in range(input_end, new_start - 1, -1):\n            text_span = \" \".join(doc_tokens[new_start : (new_end + 1)])\n            if text_span == tok_answer_text:\n                return (new_start, new_end)\n\n    return (input_start, input_end)\n\n\ndef _check_is_max_context(doc_spans, cur_span_index, position):\n    \"\"\"Check if this is the 'max context' doc span for the token.\"\"\"\n    best_score = None\n    best_span_index = None\n    for (span_index, doc_span) in enumerate(doc_spans):\n        end = doc_span.start + doc_span.length - 1\n        if position < doc_span.start:\n            continue\n        if position > end:\n            continue\n        num_left_context = position - doc_span.start\n        num_right_context = end - position\n        score = min(num_left_context, num_right_context) + 0.01 * doc_span.length\n        if best_score is None or score > best_score:\n            best_score = score\n            best_span_index = span_index\n\n    return cur_span_index == best_span_index\n\n\ndef _new_check_is_max_context(doc_spans, cur_span_index, position):\n    \"\"\"Check if this is the 'max context' doc span for the token.\"\"\"\n    # if len(doc_spans) == 1:\n    # return True\n    best_score = None\n    best_span_index = None\n    for (span_index, doc_span) in enumerate(doc_spans):\n        end = doc_span[\"start\"] + doc_span[\"length\"] - 1\n        if position < doc_span[\"start\"]:\n            continue\n        if position > end:\n            continue\n        num_left_context = position - doc_span[\"start\"]\n        num_right_context = end - position\n        score = min(num_left_context, num_right_context) + 0.01 * doc_span[\"length\"]\n        if best_score is None or score > best_score:\n            best_score = score\n            best_span_index = span_index\n\n    return cur_span_index == best_span_index\n\n\ndef _is_whitespace(c):\n    if c == \" \" or c == \"\\t\" or c == \"\\r\" or c == \"\\n\" or ord(c) == 0x202F:\n        return True\n    return False\n\n\ndef squad_convert_example_to_features(example, max_seq_length, doc_stride, max_query_length, is_training):\n    features = []\n    if is_training and not example.is_impossible:\n        # Get start and end position\n        start_position = example.start_position\n        end_position = example.end_position\n\n        # If the answer cannot be found in the text, then skip this example.\n        actual_text = \" \".join(example.doc_tokens[start_position : (end_position + 1)])\n        cleaned_answer_text = \" \".join(whitespace_tokenize(example.answer_text))\n        if actual_text.find(cleaned_answer_text) == -1:\n            logger.warning(\"Could not find answer: '%s' vs. '%s'\", actual_text, cleaned_answer_text)\n            return []\n\n    tok_to_orig_index = []\n    orig_to_tok_index = []\n    all_doc_tokens = []\n    for (i, token) in enumerate(example.doc_tokens):\n        orig_to_tok_index.append(len(all_doc_tokens))\n        sub_tokens = tokenizer.tokenize(token)\n        for sub_token in sub_tokens:\n            tok_to_orig_index.append(i)\n            all_doc_tokens.append(sub_token)\n\n    if is_training and not example.is_impossible:\n        tok_start_position = orig_to_tok_index[example.start_position]\n        if example.end_position < len(example.doc_tokens) - 1:\n            tok_end_position = orig_to_tok_index[example.end_position + 1] - 1\n        else:\n            tok_end_position = len(all_doc_tokens) - 1\n\n        (tok_start_position, tok_end_position) = _improve_answer_span(\n            all_doc_tokens, tok_start_position, tok_end_position, tokenizer, example.answer_text\n        )\n\n    spans = []\n\n    truncated_query = tokenizer.encode(example.question_text, add_special_tokens=False, max_length=max_query_length)\n    sequence_added_tokens = (\n        tokenizer.max_len - tokenizer.max_len_single_sentence + 1\n        if \"roberta\" in str(type(tokenizer)) or \"camembert\" in str(type(tokenizer))\n        else tokenizer.max_len - tokenizer.max_len_single_sentence\n    )\n    sequence_pair_added_tokens = tokenizer.max_len - tokenizer.max_len_sentences_pair\n\n    span_doc_tokens = all_doc_tokens\n    while len(spans) * doc_stride < len(all_doc_tokens):\n\n        encoded_dict = tokenizer.encode_plus(\n            truncated_query if tokenizer.padding_side == \"right\" else span_doc_tokens,\n            span_doc_tokens if tokenizer.padding_side == \"right\" else truncated_query,\n            max_length=max_seq_length,\n            return_overflowing_tokens=True,\n            pad_to_max_length=True,\n            stride=max_seq_length - doc_stride - len(truncated_query) - sequence_pair_added_tokens,\n            truncation_strategy=\"only_second\" if tokenizer.padding_side == \"right\" else \"only_first\",\n            return_token_type_ids=True,\n        )\n\n        paragraph_len = min(\n            len(all_doc_tokens) - len(spans) * doc_stride,\n            max_seq_length - len(truncated_query) - sequence_pair_added_tokens,\n        )\n\n        if tokenizer.pad_token_id in encoded_dict[\"input_ids\"]:\n            if tokenizer.padding_side == \"right\":\n                non_padded_ids = encoded_dict[\"input_ids\"][: encoded_dict[\"input_ids\"].index(tokenizer.pad_token_id)]\n            else:\n                last_padding_id_position = (\n                    len(encoded_dict[\"input_ids\"]) - 1 - encoded_dict[\"input_ids\"][::-1].index(tokenizer.pad_token_id)\n                )\n                non_padded_ids = encoded_dict[\"input_ids\"][last_padding_id_position + 1 :]\n\n        else:\n            non_padded_ids = encoded_dict[\"input_ids\"]\n\n        tokens = tokenizer.convert_ids_to_tokens(non_padded_ids)\n\n        token_to_orig_map = {}\n        for i in range(paragraph_len):\n            index = len(truncated_query) + sequence_added_tokens + i if tokenizer.padding_side == \"right\" else i\n            token_to_orig_map[index] = tok_to_orig_index[len(spans) * doc_stride + i]\n\n        encoded_dict[\"paragraph_len\"] = paragraph_len\n        encoded_dict[\"tokens\"] = tokens\n        encoded_dict[\"token_to_orig_map\"] = token_to_orig_map\n        encoded_dict[\"truncated_query_with_special_tokens_length\"] = len(truncated_query) + sequence_added_tokens\n        encoded_dict[\"token_is_max_context\"] = {}\n        encoded_dict[\"start\"] = len(spans) * doc_stride\n        encoded_dict[\"length\"] = paragraph_len\n\n        spans.append(encoded_dict)\n\n        if \"overflowing_tokens\" not in encoded_dict:\n            break\n        span_doc_tokens = encoded_dict[\"overflowing_tokens\"]\n\n    for doc_span_index in range(len(spans)):\n        for j in range(spans[doc_span_index][\"paragraph_len\"]):\n            is_max_context = _new_check_is_max_context(spans, doc_span_index, doc_span_index * doc_stride + j)\n            index = (\n                j\n                if tokenizer.padding_side == \"left\"\n                else spans[doc_span_index][\"truncated_query_with_special_tokens_length\"] + j\n            )\n            spans[doc_span_index][\"token_is_max_context\"][index] = is_max_context\n\n    for span in spans:\n        # Identify the position of the CLS token\n        cls_index = span[\"input_ids\"].index(tokenizer.cls_token_id)\n\n        # p_mask: mask with 1 for token than cannot be in the answer (0 for token which can be in an answer)\n        # Original TF implem also keep the classification token (set to 0)\n        p_mask = np.ones_like(span[\"token_type_ids\"])\n        if tokenizer.padding_side == \"right\":\n            p_mask[len(truncated_query) + sequence_added_tokens :] = 0\n        else:\n            p_mask[-len(span[\"tokens\"]) : -(len(truncated_query) + sequence_added_tokens)] = 0\n\n        pad_token_indices = np.where(span[\"input_ids\"] == tokenizer.pad_token_id)\n        special_token_indices = np.asarray(\n            tokenizer.get_special_tokens_mask(span[\"input_ids\"], already_has_special_tokens=True)\n        ).nonzero()\n\n        p_mask[pad_token_indices] = 1\n        p_mask[special_token_indices] = 1\n\n        # Set the cls index to 0: the CLS index can be used for impossible answers\n        p_mask[cls_index] = 0\n\n        span_is_impossible = example.is_impossible\n        start_position = 0\n        end_position = 0\n        if is_training and not span_is_impossible:\n            # For training, if our document chunk does not contain an annotation\n            # we throw it out, since there is nothing to predict.\n            doc_start = span[\"start\"]\n            doc_end = span[\"start\"] + span[\"length\"] - 1\n            out_of_span = False\n\n            if not (tok_start_position >= doc_start and tok_end_position <= doc_end):\n                out_of_span = True\n\n            if out_of_span:\n                start_position = cls_index\n                end_position = cls_index\n                span_is_impossible = True\n            else:\n                if tokenizer.padding_side == \"left\":\n                    doc_offset = 0\n                else:\n                    doc_offset = len(truncated_query) + sequence_added_tokens\n\n                start_position = tok_start_position - doc_start + doc_offset\n                end_position = tok_end_position - doc_start + doc_offset\n\n        features.append(\n            SquadFeatures(\n                span[\"input_ids\"],\n                span[\"attention_mask\"],\n                span[\"token_type_ids\"],\n                cls_index,\n                p_mask.tolist(),\n                example_index=0,  # Can not set unique_id and example_index here. They will be set after multiple processing.\n                unique_id=0,\n                paragraph_len=span[\"paragraph_len\"],\n                token_is_max_context=span[\"token_is_max_context\"],\n                tokens=span[\"tokens\"],\n                token_to_orig_map=span[\"token_to_orig_map\"],\n                start_position=start_position,\n                end_position=end_position,\n                is_impossible=span_is_impossible,\n                qas_id=example.qas_id,\n            )\n        )\n    return features\n\n\ndef squad_convert_example_to_features_init(tokenizer_for_convert):\n    global tokenizer\n    tokenizer = tokenizer_for_convert\n\n\ndef squad_convert_examples_to_features(\n    examples,\n    tokenizer,\n    max_seq_length,\n    doc_stride,\n    max_query_length,\n    is_training,\n    return_dataset=False,\n    threads=1,\n    tqdm_enabled=True,\n):\n    \"\"\"\n    Converts a list of examples into a list of features that can be directly given as input to a model.\n    It is model-dependant and takes advantage of many of the tokenizer's features to create the model's inputs.\n\n    Args:\n        examples: list of :class:`~transformers1.data.processors.squad.SquadExample`\n        tokenizer: an instance of a child of :class:`~transformers1.PreTrainedTokenizer`\n        max_seq_length: The maximum sequence length of the inputs.\n        doc_stride: The stride used when the context is too large and is split across several features.\n        max_query_length: The maximum length of the query.\n        is_training: whether to create features for model evaluation or model training.\n        return_dataset: Default False. Either 'pt' or 'tf'.\n            if 'pt': returns a torch.data.TensorDataset,\n            if 'tf': returns a tf.data.Dataset\n        threads: multiple processing threadsa-smi\n\n\n    Returns:\n        list of :class:`~transformers1.data.processors.squad.SquadFeatures`\n\n    Example::\n\n        processor = SquadV2Processor()\n        examples = processor.get_dev_examples(data_dir)\n\n        features = squad_convert_examples_to_features(\n            examples=examples,\n            tokenizer=tokenizer,\n            max_seq_length=args.max_seq_length,\n            doc_stride=args.doc_stride,\n            max_query_length=args.max_query_length,\n            is_training=not evaluate,\n        )\n    \"\"\"\n\n    # Defining helper methods\n    features = []\n    threads = min(threads, cpu_count())\n    with Pool(threads, initializer=squad_convert_example_to_features_init, initargs=(tokenizer,)) as p:\n        annotate_ = partial(\n            squad_convert_example_to_features,\n            max_seq_length=max_seq_length,\n            doc_stride=doc_stride,\n            max_query_length=max_query_length,\n            is_training=is_training,\n        )\n        features = list(\n            tqdm(\n                p.imap(annotate_, examples, chunksize=32),\n                total=len(examples),\n                desc=\"convert squad examples to features\",\n                disable=not tqdm_enabled,\n            )\n        )\n    new_features = []\n    unique_id = 1000000000\n    example_index = 0\n    for example_features in tqdm(\n        features, total=len(features), desc=\"add example index and unique id\", disable=not tqdm_enabled\n    ):\n        if not example_features:\n            continue\n        for example_feature in example_features:\n            example_feature.example_index = example_index\n            example_feature.unique_id = unique_id\n            new_features.append(example_feature)\n            unique_id += 1\n        example_index += 1\n    features = new_features\n    del new_features\n    if return_dataset == \"pt\":\n        if not is_torch_available():\n            raise RuntimeError(\"PyTorch must be installed to return a PyTorch dataset.\")\n\n        # Convert to Tensors and build dataset\n        all_input_ids = torch.tensor([f.input_ids for f in features], dtype=torch.long)\n        all_attention_masks = torch.tensor([f.attention_mask for f in features], dtype=torch.long)\n        all_token_type_ids = torch.tensor([f.token_type_ids for f in features], dtype=torch.long)\n        all_cls_index = torch.tensor([f.cls_index for f in features], dtype=torch.long)\n        all_p_mask = torch.tensor([f.p_mask for f in features], dtype=torch.float)\n        all_is_impossible = torch.tensor([f.is_impossible for f in features], dtype=torch.float)\n\n        if not is_training:\n            all_feature_index = torch.arange(all_input_ids.size(0), dtype=torch.long)\n            dataset = TensorDataset(\n                all_input_ids, all_attention_masks, all_token_type_ids, all_feature_index, all_cls_index, all_p_mask\n            )\n        else:\n            all_start_positions = torch.tensor([f.start_position for f in features], dtype=torch.long)\n            all_end_positions = torch.tensor([f.end_position for f in features], dtype=torch.long)\n            dataset = TensorDataset(\n                all_input_ids,\n                all_attention_masks,\n                all_token_type_ids,\n                all_start_positions,\n                all_end_positions,\n                all_cls_index,\n                all_p_mask,\n                all_is_impossible,\n            )\n\n        return features, dataset\n    elif return_dataset == \"tf\":\n        if not is_tf_available():\n            raise RuntimeError(\"TensorFlow must be installed to return a TensorFlow dataset.\")\n\n        def gen():\n            for i, ex in enumerate(features):\n                yield (\n                    {\n                        \"input_ids\": ex.input_ids,\n                        \"attention_mask\": ex.attention_mask,\n                        \"token_type_ids\": ex.token_type_ids,\n                        \"feature_index\": i,\n                        \"qas_id\": ex.qas_id,\n                    },\n                    {\n                        \"start_position\": ex.start_position,\n                        \"end_position\": ex.end_position,\n                        \"cls_index\": ex.cls_index,\n                        \"p_mask\": ex.p_mask,\n                        \"is_impossible\": ex.is_impossible,\n                    },\n                )\n\n        # Why have we split the batch into a tuple? PyTorch just has a list of tensors.\n        train_types = (\n            {\n                \"input_ids\": tf.int32,\n                \"attention_mask\": tf.int32,\n                \"token_type_ids\": tf.int32,\n                \"feature_index\": tf.int64,\n                \"qas_id\": tf.string,\n            },\n            {\n                \"start_position\": tf.int64,\n                \"end_position\": tf.int64,\n                \"cls_index\": tf.int64,\n                \"p_mask\": tf.int32,\n                \"is_impossible\": tf.int32,\n            },\n        )\n\n        train_shapes = (\n            {\n                \"input_ids\": tf.TensorShape([None]),\n                \"attention_mask\": tf.TensorShape([None]),\n                \"token_type_ids\": tf.TensorShape([None]),\n                \"feature_index\": tf.TensorShape([]),\n                \"qas_id\": tf.TensorShape([]),\n            },\n            {\n                \"start_position\": tf.TensorShape([]),\n                \"end_position\": tf.TensorShape([]),\n                \"cls_index\": tf.TensorShape([]),\n                \"p_mask\": tf.TensorShape([None]),\n                \"is_impossible\": tf.TensorShape([]),\n            },\n        )\n\n        return tf.data.Dataset.from_generator(gen, train_types, train_shapes)\n    else:\n        return features\n\n\nclass SquadProcessor(DataProcessor):\n    \"\"\"\n    Processor for the SQuAD data set.\n    Overriden by SquadV1Processor and SquadV2Processor, used by the version 1.1 and version 2.0 of SQuAD, respectively.\n    \"\"\"\n\n    train_file = None\n    dev_file = None\n\n    def _get_example_from_tensor_dict(self, tensor_dict, evaluate=False):\n        if not evaluate:\n            answer = tensor_dict[\"answers\"][\"text\"][0].numpy().decode(\"utf-8\")\n            answer_start = tensor_dict[\"answers\"][\"answer_start\"][0].numpy()\n            answers = []\n        else:\n            answers = [\n                {\"answer_start\": start.numpy(), \"text\": text.numpy().decode(\"utf-8\")}\n                for start, text in zip(tensor_dict[\"answers\"][\"answer_start\"], tensor_dict[\"answers\"][\"text\"])\n            ]\n\n            answer = None\n            answer_start = None\n\n        return SquadExample(\n            qas_id=tensor_dict[\"id\"].numpy().decode(\"utf-8\"),\n            question_text=tensor_dict[\"question\"].numpy().decode(\"utf-8\"),\n            context_text=tensor_dict[\"context\"].numpy().decode(\"utf-8\"),\n            answer_text=answer,\n            start_position_character=answer_start,\n            title=tensor_dict[\"title\"].numpy().decode(\"utf-8\"),\n            answers=answers,\n        )\n\n    def get_examples_from_dataset(self, dataset, evaluate=False):\n        \"\"\"\n        Creates a list of :class:`~transformers1.data.processors.squad.SquadExample` using a TFDS dataset.\n\n        Args:\n            dataset: The tfds dataset loaded from `tensorflow_datasets.load(\"squad\")`\n            evaluate: boolean specifying if in evaluation mode or in training mode\n\n        Returns:\n            List of SquadExample\n\n        Examples::\n\n            import tensorflow_datasets as tfds\n            dataset = tfds.load(\"squad\")\n\n            training_examples = get_examples_from_dataset(dataset, evaluate=False)\n            evaluation_examples = get_examples_from_dataset(dataset, evaluate=True)\n        \"\"\"\n\n        if evaluate:\n            dataset = dataset[\"validation\"]\n        else:\n            dataset = dataset[\"train\"]\n\n        examples = []\n        for tensor_dict in tqdm(dataset):\n            examples.append(self._get_example_from_tensor_dict(tensor_dict, evaluate=evaluate))\n\n        return examples\n\n    def get_train_examples(self, data_dir, filename=None):\n        \"\"\"\n        Returns the training examples from the data directory.\n\n        Args:\n            data_dir: Directory containing the data files used for training and evaluating.\n            filename: None by default, specify this if the training file has a different name than the original one\n                which is `train-v1.1.json` and `train-v2.0.json` for squad versions 1.1 and 2.0 respectively.\n\n        \"\"\"\n        if data_dir is None:\n            data_dir = \"\"\n\n        if self.train_file is None:\n            raise ValueError(\"SquadProcessor should be instantiated via SquadV1Processor or SquadV2Processor\")\n\n        with open(\n            os.path.join(data_dir, self.train_file if filename is None else filename), \"r\", encoding=\"utf-8\"\n        ) as reader:\n            input_data = json.load(reader)[\"data\"]\n        return self._create_examples(input_data, \"train\")\n\n    def get_dev_examples(self, data_dir, filename=None):\n        \"\"\"\n        Returns the evaluation example from the data directory.\n\n        Args:\n            data_dir: Directory containing the data files used for training and evaluating.\n            filename: None by default, specify this if the evaluation file has a different name than the original one\n                which is `train-v1.1.json` and `train-v2.0.json` for squad versions 1.1 and 2.0 respectively.\n        \"\"\"\n        if data_dir is None:\n            data_dir = \"\"\n\n        if self.dev_file is None:\n            raise ValueError(\"SquadProcessor should be instantiated via SquadV1Processor or SquadV2Processor\")\n\n        with open(\n            os.path.join(data_dir, self.dev_file if filename is None else filename), \"r\", encoding=\"utf-8\"\n        ) as reader:\n            input_data = json.load(reader)[\"data\"]\n        return self._create_examples(input_data, \"dev\")\n\n    def _create_examples(self, input_data, set_type):\n        is_training = set_type == \"train\"\n        examples = []\n        for entry in tqdm(input_data):\n            title = entry[\"title\"]\n            for paragraph in entry[\"paragraphs\"]:\n                context_text = paragraph[\"context\"]\n                for qa in paragraph[\"qas\"]:\n                    qas_id = qa[\"id\"]\n                    question_text = qa[\"question\"]\n                    start_position_character = None\n                    answer_text = None\n                    answers = []\n\n                    if \"is_impossible\" in qa:\n                        is_impossible = qa[\"is_impossible\"]\n                    else:\n                        is_impossible = False\n\n                    if not is_impossible:\n                        if is_training:\n                            answer = qa[\"answers\"][0]\n                            answer_text = answer[\"text\"]\n                            start_position_character = answer[\"answer_start\"]\n                        else:\n                            answers = qa[\"answers\"]\n\n                    example = SquadExample(\n                        qas_id=qas_id,\n                        question_text=question_text,\n                        context_text=context_text,\n                        answer_text=answer_text,\n                        start_position_character=start_position_character,\n                        title=title,\n                        is_impossible=is_impossible,\n                        answers=answers,\n                    )\n\n                    examples.append(example)\n        return examples\n\n\nclass SquadV1Processor(SquadProcessor):\n    train_file = \"train-v1.1.json\"\n    dev_file = \"dev-v1.1.json\"\n\n\nclass SquadV2Processor(SquadProcessor):\n    train_file = \"train-v2.0.json\"\n    dev_file = \"dev-v2.0.json\"\n\n\nclass SquadExample(object):\n    \"\"\"\n    A single training/test example for the Squad dataset, as loaded from disk.\n\n    Args:\n        qas_id: The example's unique identifier\n        question_text: The question string\n        context_text: The context string\n        answer_text: The answer string\n        start_position_character: The character position of the start of the answer\n        title: The title of the example\n        answers: None by default, this is used during evaluation. Holds answers as well as their start positions.\n        is_impossible: False by default, set to True if the example has no possible answer.\n    \"\"\"\n\n    def __init__(\n        self,\n        qas_id,\n        question_text,\n        context_text,\n        answer_text,\n        start_position_character,\n        title,\n        answers=[],\n        is_impossible=False,\n    ):\n        self.qas_id = qas_id\n        self.question_text = question_text\n        self.context_text = context_text\n        self.answer_text = answer_text\n        self.title = title\n        self.is_impossible = is_impossible\n        self.answers = answers\n\n        self.start_position, self.end_position = 0, 0\n\n        doc_tokens = []\n        char_to_word_offset = []\n        prev_is_whitespace = True\n\n        # Split on whitespace so that different tokens may be attributed to their original position.\n        for c in self.context_text:\n            if _is_whitespace(c):\n                prev_is_whitespace = True\n            else:\n                if prev_is_whitespace:\n                    doc_tokens.append(c)\n                else:\n                    doc_tokens[-1] += c\n                prev_is_whitespace = False\n            char_to_word_offset.append(len(doc_tokens) - 1)\n\n        self.doc_tokens = doc_tokens\n        self.char_to_word_offset = char_to_word_offset\n\n        # Start and end positions only has a value during evaluation.\n        if start_position_character is not None and not is_impossible:\n            self.start_position = char_to_word_offset[start_position_character]\n            self.end_position = char_to_word_offset[\n                min(start_position_character + len(answer_text) - 1, len(char_to_word_offset) - 1)\n            ]\n\n\nclass SquadFeatures(object):\n    \"\"\"\n    Single squad example features to be fed to a model.\n    Those features are model-specific and can be crafted from :class:`~transformers1.data.processors.squad.SquadExample`\n    using the :method:`~transformers1.data.processors.squad.squad_convert_examples_to_features` method.\n\n    Args:\n        input_ids: Indices of input sequence tokens in the vocabulary.\n        attention_mask: Mask to avoid performing attention on padding token indices.\n        token_type_ids: Segment token indices to indicate first and second portions of the inputs.\n        cls_index: the index of the CLS token.\n        p_mask: Mask identifying tokens that can be answers vs. tokens that cannot.\n            Mask with 1 for tokens than cannot be in the answer and 0 for token that can be in an answer\n        example_index: the index of the example\n        unique_id: The unique Feature identifier\n        paragraph_len: The length of the context\n        token_is_max_context: List of booleans identifying which tokens have their maximum context in this feature object.\n            If a token does not have their maximum context in this feature object, it means that another feature object\n            has more information related to that token and should be prioritized over this feature for that token.\n        tokens: list of tokens corresponding to the input ids\n        token_to_orig_map: mapping between the tokens and the original text, needed in order to identify the answer.\n        start_position: start of the answer token index\n        end_position: end of the answer token index\n    \"\"\"\n\n    def __init__(\n        self,\n        input_ids,\n        attention_mask,\n        token_type_ids,\n        cls_index,\n        p_mask,\n        example_index,\n        unique_id,\n        paragraph_len,\n        token_is_max_context,\n        tokens,\n        token_to_orig_map,\n        start_position,\n        end_position,\n        is_impossible,\n        qas_id: str = None,\n    ):\n        self.input_ids = input_ids\n        self.attention_mask = attention_mask\n        self.token_type_ids = token_type_ids\n        self.cls_index = cls_index\n        self.p_mask = p_mask\n\n        self.example_index = example_index\n        self.unique_id = unique_id\n        self.paragraph_len = paragraph_len\n        self.token_is_max_context = token_is_max_context\n        self.tokens = tokens\n        self.token_to_orig_map = token_to_orig_map\n\n        self.start_position = start_position\n        self.end_position = end_position\n        self.is_impossible = is_impossible\n        self.qas_id = qas_id\n\n\nclass SquadResult(object):\n    \"\"\"\n    Constructs a SquadResult which can be used to evaluate a model's output on the SQuAD dataset.\n\n    Args:\n        unique_id: The unique identifier corresponding to that example.\n        start_logits: The logits corresponding to the start of the answer\n        end_logits: The logits corresponding to the end of the answer\n    \"\"\"\n\n    def __init__(self, unique_id, start_logits, end_logits, start_top_index=None, end_top_index=None, cls_logits=None):\n        self.start_logits = start_logits\n        self.end_logits = end_logits\n        self.unique_id = unique_id\n\n        if start_top_index:\n            self.start_top_index = start_top_index\n            self.end_top_index = end_top_index\n            self.cls_logits = cls_logits\n"
  },
  {
    "path": "code/nezha-base-count3/pretrain/transformers1/data/processors/utils.py",
    "content": "# coding=utf-8\n# Copyright 2018 The Google AI Language Team Authors and The HuggingFace Inc. team.\n# Copyright (c) 2018, NVIDIA CORPORATION.  All rights reserved.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n\nimport csv\nimport dataclasses\nimport json\nimport logging\nfrom dataclasses import dataclass\nfrom typing import List, Optional, Union\n\nfrom ...file_utils import is_tf_available, is_torch_available\n\n\nlogger = logging.getLogger(__name__)\n\n\n@dataclass\nclass InputExample:\n    \"\"\"\n    A single training/test example for simple sequence classification.\n\n    Args:\n        guid: Unique id for the example.\n        text_a: string. The untokenized text of the first sequence. For single\n            sequence tasks, only this sequence must be specified.\n        text_b: (Optional) string. The untokenized text of the second sequence.\n            Only must be specified for sequence pair tasks.\n        label: (Optional) string. The label of the example. This should be\n            specified for train and dev examples, but not for test examples.\n    \"\"\"\n\n    guid: str\n    text_a: str\n    text_b: Optional[str] = None\n    label: Optional[str] = None\n\n    def to_json_string(self):\n        \"\"\"Serializes this instance to a JSON string.\"\"\"\n        return json.dumps(dataclasses.asdict(self), indent=2) + \"\\n\"\n\n\n@dataclass(frozen=True)\nclass InputFeatures:\n    \"\"\"\n    A single set of features of data.\n    Property names are the same names as the corresponding inputs to a model.\n\n    Args:\n        input_ids: Indices of input sequence tokens in the vocabulary.\n        attention_mask: Mask to avoid performing attention on padding token indices.\n            Mask values selected in ``[0, 1]``:\n            Usually  ``1`` for tokens that are NOT MASKED, ``0`` for MASKED (padded) tokens.\n        token_type_ids: (Optional) Segment token indices to indicate first and second\n            portions of the inputs. Only some models use them.\n        label: (Optional) Label corresponding to the input. Int for classification problems,\n            float for regression problems.\n    \"\"\"\n\n    input_ids: List[int]\n    attention_mask: Optional[List[int]] = None\n    token_type_ids: Optional[List[int]] = None\n    label: Optional[Union[int, float]] = None\n\n    def to_json_string(self):\n        \"\"\"Serializes this instance to a JSON string.\"\"\"\n        return json.dumps(dataclasses.asdict(self)) + \"\\n\"\n\n\nclass DataProcessor:\n    \"\"\"Base class for data converters for sequence classification data sets.\"\"\"\n\n    def get_example_from_tensor_dict(self, tensor_dict):\n        \"\"\"Gets an example from a dict with tensorflow tensors\n        Args:\n            tensor_dict: Keys and values should match the corresponding Glue\n                tensorflow_dataset examples.\n        \"\"\"\n        raise NotImplementedError()\n\n    def get_train_examples(self, data_dir):\n        \"\"\"Gets a collection of `InputExample`s for the train set.\"\"\"\n        raise NotImplementedError()\n\n    def get_dev_examples(self, data_dir):\n        \"\"\"Gets a collection of `InputExample`s for the dev set.\"\"\"\n        raise NotImplementedError()\n\n    def get_test_examples(self, data_dir):\n        \"\"\"Gets a collection of `InputExample`s for the test set.\"\"\"\n        raise NotImplementedError()\n\n    def get_labels(self):\n        \"\"\"Gets the list of labels for this data set.\"\"\"\n        raise NotImplementedError()\n\n    def tfds_map(self, example):\n        \"\"\"Some tensorflow_datasets datasets are not formatted the same way the GLUE datasets are.\n        This method converts examples to the correct format.\"\"\"\n        if len(self.get_labels()) > 1:\n            example.label = self.get_labels()[int(example.label)]\n        return example\n\n    @classmethod\n    def _read_tsv(cls, input_file, quotechar=None):\n        \"\"\"Reads a tab separated value file.\"\"\"\n        with open(input_file, \"r\", encoding=\"utf-8-sig\") as f:\n            return list(csv.reader(f, delimiter=\"\\t\", quotechar=quotechar))\n\n\nclass SingleSentenceClassificationProcessor(DataProcessor):\n    \"\"\" Generic processor for a single sentence classification data set.\"\"\"\n\n    def __init__(self, labels=None, examples=None, mode=\"classification\", verbose=False):\n        self.labels = [] if labels is None else labels\n        self.examples = [] if examples is None else examples\n        self.mode = mode\n        self.verbose = verbose\n\n    def __len__(self):\n        return len(self.examples)\n\n    def __getitem__(self, idx):\n        if isinstance(idx, slice):\n            return SingleSentenceClassificationProcessor(labels=self.labels, examples=self.examples[idx])\n        return self.examples[idx]\n\n    @classmethod\n    def create_from_csv(\n        cls, file_name, split_name=\"\", column_label=0, column_text=1, column_id=None, skip_first_row=False, **kwargs\n    ):\n        processor = cls(**kwargs)\n        processor.add_examples_from_csv(\n            file_name,\n            split_name=split_name,\n            column_label=column_label,\n            column_text=column_text,\n            column_id=column_id,\n            skip_first_row=skip_first_row,\n            overwrite_labels=True,\n            overwrite_examples=True,\n        )\n        return processor\n\n    @classmethod\n    def create_from_examples(cls, texts_or_text_and_labels, labels=None, **kwargs):\n        processor = cls(**kwargs)\n        processor.add_examples(texts_or_text_and_labels, labels=labels)\n        return processor\n\n    def add_examples_from_csv(\n        self,\n        file_name,\n        split_name=\"\",\n        column_label=0,\n        column_text=1,\n        column_id=None,\n        skip_first_row=False,\n        overwrite_labels=False,\n        overwrite_examples=False,\n    ):\n        lines = self._read_tsv(file_name)\n        if skip_first_row:\n            lines = lines[1:]\n        texts = []\n        labels = []\n        ids = []\n        for (i, line) in enumerate(lines):\n            texts.append(line[column_text])\n            labels.append(line[column_label])\n            if column_id is not None:\n                ids.append(line[column_id])\n            else:\n                guid = \"%s-%s\" % (split_name, i) if split_name else \"%s\" % i\n                ids.append(guid)\n\n        return self.add_examples(\n            texts, labels, ids, overwrite_labels=overwrite_labels, overwrite_examples=overwrite_examples\n        )\n\n    def add_examples(\n        self, texts_or_text_and_labels, labels=None, ids=None, overwrite_labels=False, overwrite_examples=False\n    ):\n        assert labels is None or len(texts_or_text_and_labels) == len(labels)\n        assert ids is None or len(texts_or_text_and_labels) == len(ids)\n        if ids is None:\n            ids = [None] * len(texts_or_text_and_labels)\n        if labels is None:\n            labels = [None] * len(texts_or_text_and_labels)\n        examples = []\n        added_labels = set()\n        for (text_or_text_and_label, label, guid) in zip(texts_or_text_and_labels, labels, ids):\n            if isinstance(text_or_text_and_label, (tuple, list)) and label is None:\n                text, label = text_or_text_and_label\n            else:\n                text = text_or_text_and_label\n            added_labels.add(label)\n            examples.append(InputExample(guid=guid, text_a=text, text_b=None, label=label))\n\n        # Update examples\n        if overwrite_examples:\n            self.examples = examples\n        else:\n            self.examples.extend(examples)\n\n        # Update labels\n        if overwrite_labels:\n            self.labels = list(added_labels)\n        else:\n            self.labels = list(set(self.labels).union(added_labels))\n\n        return self.examples\n\n    def get_features(\n        self,\n        tokenizer,\n        max_length=None,\n        pad_on_left=False,\n        pad_token=0,\n        mask_padding_with_zero=True,\n        return_tensors=None,\n    ):\n        \"\"\"\n        Convert examples in a list of ``InputFeatures``\n\n        Args:\n            tokenizer: Instance of a tokenizer that will tokenize the examples\n            max_length: Maximum example length\n            task: GLUE task\n            label_list: List of labels. Can be obtained from the processor using the ``processor.get_labels()`` method\n            output_mode: String indicating the output mode. Either ``regression`` or ``classification``\n            pad_on_left: If set to ``True``, the examples will be padded on the left rather than on the right (default)\n            pad_token: Padding token\n            mask_padding_with_zero: If set to ``True``, the attention mask will be filled by ``1`` for actual values\n                and by ``0`` for padded values. If set to ``False``, inverts it (``1`` for padded values, ``0`` for\n                actual values)\n\n        Returns:\n            If the ``examples`` input is a ``tf.data.Dataset``, will return a ``tf.data.Dataset``\n            containing the task-specific features. If the input is a list of ``InputExamples``, will return\n            a list of task-specific ``InputFeatures`` which can be fed to the model.\n\n        \"\"\"\n        if max_length is None:\n            max_length = tokenizer.max_len\n\n        label_map = {label: i for i, label in enumerate(self.labels)}\n\n        all_input_ids = []\n        for (ex_index, example) in enumerate(self.examples):\n            if ex_index % 10000 == 0:\n                logger.info(\"Tokenizing example %d\", ex_index)\n\n            input_ids = tokenizer.encode(\n                example.text_a, add_special_tokens=True, max_length=min(max_length, tokenizer.max_len),\n            )\n            all_input_ids.append(input_ids)\n\n        batch_length = max(len(input_ids) for input_ids in all_input_ids)\n\n        features = []\n        for (ex_index, (input_ids, example)) in enumerate(zip(all_input_ids, self.examples)):\n            if ex_index % 10000 == 0:\n                logger.info(\"Writing example %d/%d\" % (ex_index, len(self.examples)))\n            # The mask has 1 for real tokens and 0 for padding tokens. Only real\n            # tokens are attended to.\n            attention_mask = [1 if mask_padding_with_zero else 0] * len(input_ids)\n\n            # Zero-pad up to the sequence length.\n            padding_length = batch_length - len(input_ids)\n            if pad_on_left:\n                input_ids = ([pad_token] * padding_length) + input_ids\n                attention_mask = ([0 if mask_padding_with_zero else 1] * padding_length) + attention_mask\n            else:\n                input_ids = input_ids + ([pad_token] * padding_length)\n                attention_mask = attention_mask + ([0 if mask_padding_with_zero else 1] * padding_length)\n\n            assert len(input_ids) == batch_length, \"Error with input length {} vs {}\".format(\n                len(input_ids), batch_length\n            )\n            assert len(attention_mask) == batch_length, \"Error with input length {} vs {}\".format(\n                len(attention_mask), batch_length\n            )\n\n            if self.mode == \"classification\":\n                label = label_map[example.label]\n            elif self.mode == \"regression\":\n                label = float(example.label)\n            else:\n                raise ValueError(self.mode)\n\n            if ex_index < 5 and self.verbose:\n                logger.info(\"*** Example ***\")\n                logger.info(\"guid: %s\" % (example.guid))\n                logger.info(\"input_ids: %s\" % \" \".join([str(x) for x in input_ids]))\n                logger.info(\"attention_mask: %s\" % \" \".join([str(x) for x in attention_mask]))\n                logger.info(\"label: %s (id = %d)\" % (example.label, label))\n\n            features.append(InputFeatures(input_ids=input_ids, attention_mask=attention_mask, label=label))\n\n        if return_tensors is None:\n            return features\n        elif return_tensors == \"tf\":\n            if not is_tf_available():\n                raise RuntimeError(\"return_tensors set to 'tf' but TensorFlow 2.0 can't be imported\")\n            import tensorflow as tf\n\n            def gen():\n                for ex in features:\n                    yield ({\"input_ids\": ex.input_ids, \"attention_mask\": ex.attention_mask}, ex.label)\n\n            dataset = tf.data.Dataset.from_generator(\n                gen,\n                ({\"input_ids\": tf.int32, \"attention_mask\": tf.int32}, tf.int64),\n                ({\"input_ids\": tf.TensorShape([None]), \"attention_mask\": tf.TensorShape([None])}, tf.TensorShape([])),\n            )\n            return dataset\n        elif return_tensors == \"pt\":\n            if not is_torch_available():\n                raise RuntimeError(\"return_tensors set to 'pt' but PyTorch can't be imported\")\n            import torch\n            from torch.utils.data import TensorDataset\n\n            all_input_ids = torch.tensor([f.input_ids for f in features], dtype=torch.long)\n            all_attention_mask = torch.tensor([f.attention_mask for f in features], dtype=torch.long)\n            if self.mode == \"classification\":\n                all_labels = torch.tensor([f.label for f in features], dtype=torch.long)\n            elif self.mode == \"regression\":\n                all_labels = torch.tensor([f.label for f in features], dtype=torch.float)\n\n            dataset = TensorDataset(all_input_ids, all_attention_mask, all_labels)\n            return dataset\n        else:\n            raise ValueError(\"return_tensors should be one of 'tf' or 'pt'\")\n"
  },
  {
    "path": "code/nezha-base-count3/pretrain/transformers1/data/processors/xnli.py",
    "content": "# coding=utf-8\n# Copyright 2018 The Google AI Language Team Authors and The HuggingFace Inc. team.\n# Copyright (c) 2018, NVIDIA CORPORATION.  All rights reserved.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n\"\"\" XNLI utils (dataset loading and evaluation) \"\"\"\n\n\nimport logging\nimport os\n\nfrom .utils import DataProcessor, InputExample\n\n\nlogger = logging.getLogger(__name__)\n\n\nclass XnliProcessor(DataProcessor):\n    \"\"\"Processor for the XNLI dataset.\n    Adapted from https://github.com/google-research/bert/blob/f39e881b169b9d53bea03d2d341b31707a6c052b/run_classifier.py#L207\"\"\"\n\n    def __init__(self, language, train_language=None):\n        self.language = language\n        self.train_language = train_language\n\n    def get_train_examples(self, data_dir):\n        \"\"\"See base class.\"\"\"\n        lg = self.language if self.train_language is None else self.train_language\n        lines = self._read_tsv(os.path.join(data_dir, \"XNLI-MT-1.0/multinli/multinli.train.{}.tsv\".format(lg)))\n        examples = []\n        for (i, line) in enumerate(lines):\n            if i == 0:\n                continue\n            guid = \"%s-%s\" % (\"train\", i)\n            text_a = line[0]\n            text_b = line[1]\n            label = \"contradiction\" if line[2] == \"contradictory\" else line[2]\n            assert isinstance(text_a, str) and isinstance(text_b, str) and isinstance(label, str)\n            examples.append(InputExample(guid=guid, text_a=text_a, text_b=text_b, label=label))\n        return examples\n\n    def get_test_examples(self, data_dir):\n        \"\"\"See base class.\"\"\"\n        lines = self._read_tsv(os.path.join(data_dir, \"XNLI-1.0/xnli.test.tsv\"))\n        examples = []\n        for (i, line) in enumerate(lines):\n            if i == 0:\n                continue\n            language = line[0]\n            if language != self.language:\n                continue\n            guid = \"%s-%s\" % (\"test\", i)\n            text_a = line[6]\n            text_b = line[7]\n            label = line[1]\n            assert isinstance(text_a, str) and isinstance(text_b, str) and isinstance(label, str)\n            examples.append(InputExample(guid=guid, text_a=text_a, text_b=text_b, label=label))\n        return examples\n\n    def get_labels(self):\n        \"\"\"See base class.\"\"\"\n        return [\"contradiction\", \"entailment\", \"neutral\"]\n\n\nxnli_processors = {\n    \"xnli\": XnliProcessor,\n}\n\nxnli_output_modes = {\n    \"xnli\": \"classification\",\n}\n\nxnli_tasks_num_labels = {\n    \"xnli\": 3,\n}\n"
  },
  {
    "path": "code/nezha-base-count3/pretrain/transformers1/file.py",
    "content": ""
  },
  {
    "path": "code/nezha-base-count3/pretrain/transformers1/file_utils.py",
    "content": "\"\"\"\nUtilities for working with the local dataset cache.\nThis file is adapted from the AllenNLP library at https://github.com/allenai/allennlp\nCopyright by the AllenNLP authors.\n\"\"\"\n\nimport fnmatch\nimport json\nimport logging\nimport os\nimport shutil\nimport sys\nimport tarfile\nimport tempfile\nfrom contextlib import contextmanager\nfrom functools import partial, wraps\nfrom hashlib import sha256\nfrom pathlib import Path\nfrom typing import Optional\nfrom urllib.parse import urlparse\nfrom zipfile import ZipFile, is_zipfile\n\nimport requests\nfrom filelock import FileLock\nfrom tqdm.auto import tqdm\n\nfrom . import __version__\n\n\nlogger = logging.getLogger(__name__)  # pylint: disable=invalid-name\n\ntry:\n    USE_TF = os.environ.get(\"USE_TF\", \"AUTO\").upper()\n    USE_TORCH = os.environ.get(\"USE_TORCH\", \"AUTO\").upper()\n    if USE_TORCH in (\"1\", \"ON\", \"YES\", \"AUTO\") and USE_TF not in (\"1\", \"ON\", \"YES\"):\n        import torch\n\n        _torch_available = True  # pylint: disable=invalid-name\n        logger.info(\"PyTorch version {} available.\".format(torch.__version__))\n    else:\n        logger.info(\"Disabling PyTorch because USE_TF is set\")\n        _torch_available = False\nexcept ImportError:\n    _torch_available = False  # pylint: disable=invalid-name\n\ntry:\n    USE_TF = os.environ.get(\"USE_TF\", \"AUTO\").upper()\n    USE_TORCH = os.environ.get(\"USE_TORCH\", \"AUTO\").upper()\n\n    if USE_TF in (\"1\", \"ON\", \"YES\", \"AUTO\") and USE_TORCH not in (\"1\", \"ON\", \"YES\"):\n        import tensorflow as tf\n\n        assert hasattr(tf, \"__version__\") and int(tf.__version__[0]) >= 2\n        _tf_available = True  # pylint: disable=invalid-name\n        logger.info(\"TensorFlow version {} available.\".format(tf.__version__))\n    else:\n        logger.info(\"Disabling Tensorflow because USE_TORCH is set\")\n        _tf_available = False\nexcept (ImportError, AssertionError):\n    _tf_available = False  # pylint: disable=invalid-name\n\n\ntry:\n    from torch.hub import _get_torch_home\n\n    torch_cache_home = _get_torch_home()\nexcept ImportError:\n    torch_cache_home = os.path.expanduser(\n        os.getenv(\"TORCH_HOME\", os.path.join(os.getenv(\"XDG_CACHE_HOME\", \"~/.cache\"), \"torch\"))\n    )\ndefault_cache_path = os.path.join(torch_cache_home, \"transformers1\")\n\n\nPYTORCH_PRETRAINED_BERT_CACHE = os.getenv(\"PYTORCH_PRETRAINED_BERT_CACHE\", default_cache_path)\nPYTORCH_TRANSFORMERS_CACHE = os.getenv(\"PYTORCH_TRANSFORMERS_CACHE\", PYTORCH_PRETRAINED_BERT_CACHE)\nTRANSFORMERS_CACHE = os.getenv(\"TRANSFORMERS_CACHE\", PYTORCH_TRANSFORMERS_CACHE)\n\nWEIGHTS_NAME = \"pytorch_model.bin\"\nTF2_WEIGHTS_NAME = \"tf_model.h5\"\nTF_WEIGHTS_NAME = \"model.ckpt\"\nCONFIG_NAME = \"config.json\"\nMODEL_CARD_NAME = \"modelcard.json\"\n\n\nMULTIPLE_CHOICE_DUMMY_INPUTS = [[[0], [1]], [[0], [1]]]\nDUMMY_INPUTS = [[7, 6, 0, 0, 1], [1, 2, 3, 0, 0], [0, 0, 0, 4, 5]]\nDUMMY_MASK = [[1, 1, 1, 1, 1], [1, 1, 1, 0, 0], [0, 0, 0, 1, 1]]\n\nS3_BUCKET_PREFIX = \"https://s3.amazonaws.com/models.huggingface.co/bert\"\nCLOUDFRONT_DISTRIB_PREFIX = \"https://cdn.huggingface.co\"\n\n\ndef is_torch_available():\n    return _torch_available\n\n\ndef is_tf_available():\n    return _tf_available\n\n\ndef add_start_docstrings(*docstr):\n    def docstring_decorator(fn):\n        fn.__doc__ = \"\".join(docstr) + (fn.__doc__ if fn.__doc__ is not None else \"\")\n        return fn\n\n    return docstring_decorator\n\n\ndef add_start_docstrings_to_callable(*docstr):\n    def docstring_decorator(fn):\n        class_name = \":class:`~transformers1.{}`\".format(fn.__qualname__.split(\".\")[0])\n        intro = \"   The {} forward method, overrides the :func:`__call__` special method.\".format(class_name)\n        note = r\"\"\"\n\n    .. note::\n        Although the recipe for forward pass needs to be defined within\n        this function, one should call the :class:`Module` instance afterwards\n        instead of this since the former takes care of running the\n        pre and post processing steps while the latter silently ignores them.\n        \"\"\"\n        fn.__doc__ = intro + note + \"\".join(docstr) + (fn.__doc__ if fn.__doc__ is not None else \"\")\n        return fn\n\n    return docstring_decorator\n\n\ndef add_end_docstrings(*docstr):\n    def docstring_decorator(fn):\n        fn.__doc__ = fn.__doc__ + \"\".join(docstr)\n        return fn\n\n    return docstring_decorator\n\n\ndef is_remote_url(url_or_filename):\n    parsed = urlparse(url_or_filename)\n    return parsed.scheme in (\"http\", \"https\")\n\n\ndef hf_bucket_url(model_id: str, filename: str, use_cdn=True) -> str:\n    \"\"\"\n    Resolve a model identifier, and a file name, to a HF-hosted url\n    on either S3 or Cloudfront (a Content Delivery Network, or CDN).\n\n    Cloudfront is replicated over the globe so downloads are way faster\n    for the end user (and it also lowers our bandwidth costs). However, it\n    is more aggressively cached by default, so may not always reflect the\n    latest changes to the underlying file (default TTL is 24 hours).\n\n    In terms of client-side caching from this library, even though\n    Cloudfront relays the ETags from S3, using one or the other\n    (or switching from one to the other) will affect caching: cached files\n    are not shared between the two because the cached file's name contains\n    a hash of the url.\n    \"\"\"\n    endpoint = CLOUDFRONT_DISTRIB_PREFIX if use_cdn else S3_BUCKET_PREFIX\n    legacy_format = \"/\" not in model_id\n    if legacy_format:\n        return f\"{endpoint}/{model_id}-{filename}\"\n    else:\n        return f\"{endpoint}/{model_id}/{filename}\"\n\n\ndef url_to_filename(url, etag=None):\n    \"\"\"\n    Convert `url` into a hashed filename in a repeatable way.\n    If `etag` is specified, append its hash to the url's, delimited\n    by a period.\n    If the url ends with .h5 (Keras HDF5 weights) adds '.h5' to the name\n    so that TF 2.0 can identify it as a HDF5 file\n    (see https://github.com/tensorflow/tensorflow/blob/00fad90125b18b80fe054de1055770cfb8fe4ba3/tensorflow/python/keras/engine/network.py#L1380)\n    \"\"\"\n    url_bytes = url.encode(\"utf-8\")\n    url_hash = sha256(url_bytes)\n    filename = url_hash.hexdigest()\n\n    if etag:\n        etag_bytes = etag.encode(\"utf-8\")\n        etag_hash = sha256(etag_bytes)\n        filename += \".\" + etag_hash.hexdigest()\n\n    if url.endswith(\".h5\"):\n        filename += \".h5\"\n\n    return filename\n\n\ndef filename_to_url(filename, cache_dir=None):\n    \"\"\"\n    Return the url and etag (which may be ``None``) stored for `filename`.\n    Raise ``EnvironmentError`` if `filename` or its stored metadata do not exist.\n    \"\"\"\n    if cache_dir is None:\n        cache_dir = TRANSFORMERS_CACHE\n    if isinstance(cache_dir, Path):\n        cache_dir = str(cache_dir)\n\n    cache_path = os.path.join(cache_dir, filename)\n    if not os.path.exists(cache_path):\n        raise EnvironmentError(\"file {} not found\".format(cache_path))\n\n    meta_path = cache_path + \".json\"\n    if not os.path.exists(meta_path):\n        raise EnvironmentError(\"file {} not found\".format(meta_path))\n\n    with open(meta_path, encoding=\"utf-8\") as meta_file:\n        metadata = json.load(meta_file)\n    url = metadata[\"url\"]\n    etag = metadata[\"etag\"]\n\n    return url, etag\n\n\ndef cached_path(\n    url_or_filename,\n    cache_dir=None,\n    force_download=False,\n    proxies=None,\n    resume_download=False,\n    user_agent=None,\n    extract_compressed_file=False,\n    force_extract=False,\n    local_files_only=False,\n) -> Optional[str]:\n    \"\"\"\n    Given something that might be a URL (or might be a local path),\n    determine which. If it's a URL, download the file and cache it, and\n    return the path to the cached file. If it's already a local path,\n    make sure the file exists and then return the path.\n    Args:\n        cache_dir: specify a cache directory to save the file to (overwrite the default cache dir).\n        force_download: if True, re-dowload the file even if it's already cached in the cache dir.\n        resume_download: if True, resume the download if incompletly recieved file is found.\n        user_agent: Optional string or dict that will be appended to the user-agent on remote requests.\n        extract_compressed_file: if True and the path point to a zip or tar file, extract the compressed\n            file in a folder along the archive.\n        force_extract: if True when extract_compressed_file is True and the archive was already extracted,\n            re-extract the archive and overide the folder where it was extracted.\n\n    Return:\n        None in case of non-recoverable file (non-existent or inaccessible url + no cache on disk).\n        Local path (string) otherwise\n    \"\"\"\n    if cache_dir is None:\n        cache_dir = TRANSFORMERS_CACHE\n    if isinstance(url_or_filename, Path):\n        url_or_filename = str(url_or_filename)\n    if isinstance(cache_dir, Path):\n        cache_dir = str(cache_dir)\n\n    if is_remote_url(url_or_filename):\n        # URL, so get it from the cache (downloading if necessary)\n        output_path = get_from_cache(\n            url_or_filename,\n            cache_dir=cache_dir,\n            force_download=force_download,\n            proxies=proxies,\n            resume_download=resume_download,\n            user_agent=user_agent,\n            local_files_only=local_files_only,\n        )\n    elif os.path.exists(url_or_filename):\n        # File, and it exists.\n        output_path = url_or_filename\n    elif urlparse(url_or_filename).scheme == \"\":\n        # File, but it doesn't exist.\n        raise EnvironmentError(\"file {} not found\".format(url_or_filename))\n    else:\n        # Something unknown\n        raise ValueError(\"unable to parse {} as a URL or as a local path\".format(url_or_filename))\n\n    if extract_compressed_file:\n        if not is_zipfile(output_path) and not tarfile.is_tarfile(output_path):\n            return output_path\n\n        # Path where we extract compressed archives\n        # We avoid '.' in dir name and add \"-extracted\" at the end: \"./model.zip\" => \"./model-zip-extracted/\"\n        output_dir, output_file = os.path.split(output_path)\n        output_extract_dir_name = output_file.replace(\".\", \"-\") + \"-extracted\"\n        output_path_extracted = os.path.join(output_dir, output_extract_dir_name)\n\n        if os.path.isdir(output_path_extracted) and os.listdir(output_path_extracted) and not force_extract:\n            return output_path_extracted\n\n        # Prevent parallel extractions\n        lock_path = output_path + \".lock\"\n        with FileLock(lock_path):\n            shutil.rmtree(output_path_extracted, ignore_errors=True)\n            os.makedirs(output_path_extracted)\n            if is_zipfile(output_path):\n                with ZipFile(output_path, \"r\") as zip_file:\n                    zip_file.extractall(output_path_extracted)\n                    zip_file.close()\n            elif tarfile.is_tarfile(output_path):\n                tar_file = tarfile.open(output_path)\n                tar_file.extractall(output_path_extracted)\n                tar_file.close()\n            else:\n                raise EnvironmentError(\"Archive format of {} could not be identified\".format(output_path))\n\n        return output_path_extracted\n\n    return output_path\n\n\ndef http_get(url, temp_file, proxies=None, resume_size=0, user_agent=None):\n    ua = \"transformers1/{}; python/{}\".format(__version__, sys.version.split()[0])\n    if is_torch_available():\n        ua += \"; torch/{}\".format(torch.__version__)\n    if is_tf_available():\n        ua += \"; tensorflow/{}\".format(tf.__version__)\n    if isinstance(user_agent, dict):\n        ua += \"; \" + \"; \".join(\"{}/{}\".format(k, v) for k, v in user_agent.items())\n    elif isinstance(user_agent, str):\n        ua += \"; \" + user_agent\n    headers = {\"user-agent\": ua}\n    if resume_size > 0:\n        headers[\"Range\"] = \"bytes=%d-\" % (resume_size,)\n    response = requests.get(url, stream=True, proxies=proxies, headers=headers)\n    if response.status_code == 416:  # Range not satisfiable\n        return\n    content_length = response.headers.get(\"Content-Length\")\n    total = resume_size + int(content_length) if content_length is not None else None\n    progress = tqdm(\n        unit=\"B\",\n        unit_scale=True,\n        total=total,\n        initial=resume_size,\n        desc=\"Downloading\",\n        disable=bool(logger.getEffectiveLevel() == logging.NOTSET),\n    )\n    for chunk in response.iter_content(chunk_size=1024):\n        if chunk:  # filter out keep-alive new chunks\n            progress.update(len(chunk))\n            temp_file.write(chunk)\n    progress.close()\n\n\ndef get_from_cache(\n    url,\n    cache_dir=None,\n    force_download=False,\n    proxies=None,\n    etag_timeout=10,\n    resume_download=False,\n    user_agent=None,\n    local_files_only=False,\n) -> Optional[str]:\n    \"\"\"\n    Given a URL, look for the corresponding file in the local cache.\n    If it's not there, download it. Then return the path to the cached file.\n\n    Return:\n        None in case of non-recoverable file (non-existent or inaccessible url + no cache on disk).\n        Local path (string) otherwise\n    \"\"\"\n    if cache_dir is None:\n        cache_dir = TRANSFORMERS_CACHE\n    if isinstance(cache_dir, Path):\n        cache_dir = str(cache_dir)\n\n    os.makedirs(cache_dir, exist_ok=True)\n\n    etag = None\n    if not local_files_only:\n        try:\n            response = requests.head(url, allow_redirects=True, proxies=proxies, timeout=etag_timeout)\n            if response.status_code == 200:\n                etag = response.headers.get(\"ETag\")\n        except (EnvironmentError, requests.exceptions.Timeout):\n            # etag is already None\n            pass\n\n    filename = url_to_filename(url, etag)\n\n    # get cache path to put the file\n    cache_path = os.path.join(cache_dir, filename)\n\n    # etag is None = we don't have a connection, or url doesn't exist, or is otherwise inaccessible.\n    # try to get the last downloaded one\n    if etag is None:\n        if os.path.exists(cache_path):\n            return cache_path\n        else:\n            matching_files = [\n                file\n                for file in fnmatch.filter(os.listdir(cache_dir), filename + \".*\")\n                if not file.endswith(\".json\") and not file.endswith(\".lock\")\n            ]\n            if len(matching_files) > 0:\n                return os.path.join(cache_dir, matching_files[-1])\n            else:\n                # If files cannot be found and local_files_only=True,\n                # the models might've been found if local_files_only=False\n                # Notify the user about that\n                if local_files_only:\n                    raise ValueError(\n                        \"Cannot find the requested files in the cached path and outgoing traffic has been\"\n                        \" disabled. To enable model look-ups and downloads online, set 'local_files_only'\"\n                        \" to False.\"\n                    )\n                return None\n\n    # From now on, etag is not None.\n    if os.path.exists(cache_path) and not force_download:\n        return cache_path\n\n    # Prevent parallel downloads of the same file with a lock.\n    lock_path = cache_path + \".lock\"\n    with FileLock(lock_path):\n\n        # If the download just completed while the lock was activated.\n        if os.path.exists(cache_path) and not force_download:\n            # Even if returning early like here, the lock will be released.\n            return cache_path\n\n        if resume_download:\n            incomplete_path = cache_path + \".incomplete\"\n\n            @contextmanager\n            def _resumable_file_manager():\n                with open(incomplete_path, \"a+b\") as f:\n                    yield f\n\n            temp_file_manager = _resumable_file_manager\n            if os.path.exists(incomplete_path):\n                resume_size = os.stat(incomplete_path).st_size\n            else:\n                resume_size = 0\n        else:\n            temp_file_manager = partial(tempfile.NamedTemporaryFile, dir=cache_dir, delete=False)\n            resume_size = 0\n\n        # Download to temporary file, then copy to cache dir once finished.\n        # Otherwise you get corrupt cache entries if the download gets interrupted.\n        with temp_file_manager() as temp_file:\n            logger.info(\"%s not found in cache or force_download set to True, downloading to %s\", url, temp_file.name)\n\n            http_get(url, temp_file, proxies=proxies, resume_size=resume_size, user_agent=user_agent)\n\n        logger.info(\"storing %s in cache at %s\", url, cache_path)\n        os.replace(temp_file.name, cache_path)\n\n        logger.info(\"creating metadata file for %s\", cache_path)\n        meta = {\"url\": url, \"etag\": etag}\n        meta_path = cache_path + \".json\"\n        with open(meta_path, \"w\") as meta_file:\n            json.dump(meta, meta_file)\n\n    return cache_path\n\n\nclass cached_property(property):\n    \"\"\"\n    Descriptor that mimics @property but caches output in member variable.\n\n    From tensorflow_datasets\n\n    Built-in in functools from Python 3.8.\n    \"\"\"\n\n    def __get__(self, obj, objtype=None):\n        # See docs.python.org/3/howto/descriptor.html#properties\n        if obj is None:\n            return self\n        if self.fget is None:\n            raise AttributeError(\"unreadable attribute\")\n        attr = \"__cached_\" + self.fget.__name__\n        cached = getattr(obj, attr, None)\n        if cached is None:\n            cached = self.fget(obj)\n            setattr(obj, attr, cached)\n        return cached\n\n\ndef torch_required(func):\n    # Chose a different decorator name than in tests so it's clear they are not the same.\n    @wraps(func)\n    def wrapper(*args, **kwargs):\n        if is_torch_available():\n            return func(*args, **kwargs)\n        else:\n            raise ImportError(f\"Method `{func.__name__}` requires PyTorch.\")\n\n    return wrapper\n\n\ndef tf_required(func):\n    # Chose a different decorator name than in tests so it's clear they are not the same.\n    @wraps(func)\n    def wrapper(*args, **kwargs):\n        if is_tf_available():\n            return func(*args, **kwargs)\n        else:\n            raise ImportError(f\"Method `{func.__name__}` requires TF.\")\n\n    return wrapper\n"
  },
  {
    "path": "code/nezha-base-count3/pretrain/transformers1/filep.py",
    "content": "from transformers import GPT2LMHeadModel, GPT2Tokenizer\nimport torch\n\ntokenizer = GPT2Tokenizer.from_pretrained(\"gpt2\")\nmodel = GPT2LMHeadModel.from_pretrained('gpt2')\n\ngenerated = tokenizer.encode(\"The Manhattan bridge\")\ncontext = torch.tensor([generated])\npast = None\n\nfor i in range(15):\n    output, past = model(context, past=past)\n\n    distribution = output[0, :]\n\n    # Get the top 10 values' indices and cast them to a list\n    top_values = distribution[-1].topk(10).indices.tolist()\n\n    # Decode those into words\n    top_words = [tokenizer.decode([x]) for x in top_values.indices.tolist()]\n\n    # select words (only arbitrarily select the first three)\n    words = words[0:3]\n\n    # Cast them back to tokens which can be used as an added token\n    selected_tokens = [tokenizer.encode(word) for word in words]\n\n    generated += [argmax_token.tolist()]\n    context = argmax_token.unsqueeze(0)\n\n    print(tokenizer.decode([argmax_token.tolist()]))\n\nsequence = tokenizer.decode(generated)\n\nprint(sequence)"
  },
  {
    "path": "code/nezha-base-count3/pretrain/transformers1/hf_api.py",
    "content": "# coding=utf-8\n# Copyright 2019-present, the HuggingFace Inc. team.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n\n\nimport io\nimport os\nfrom os.path import expanduser\nfrom typing import Dict, List, Optional, Tuple\n\nimport requests\nfrom tqdm import tqdm\n\n\nENDPOINT = \"https://huggingface.co\"\n\n\nclass S3Obj:\n    \"\"\"\n    Data structure that represents a file belonging to the current user.\n    \"\"\"\n\n    def __init__(self, filename: str, LastModified: str, ETag: str, Size: int, **kwargs):\n        self.filename = filename\n        self.LastModified = LastModified\n        self.ETag = ETag\n        self.Size = Size\n\n\nclass PresignedUrl:\n    def __init__(self, write: str, access: str, type: str, **kwargs):\n        self.write = write\n        self.access = access\n        self.type = type  # mime-type to send to S3.\n\n\nclass S3Object:\n    \"\"\"\n    Data structure that represents a public file accessible on our S3.\n    \"\"\"\n\n    def __init__(\n        self,\n        key: str,  # S3 object key\n        etag: str,\n        lastModified: str,\n        size: int,\n        rfilename: str,  # filename relative to config.json\n        **kwargs\n    ):\n        self.key = key\n        self.etag = etag\n        self.lastModified = lastModified\n        self.size = size\n        self.rfilename = rfilename\n\n\nclass ModelInfo:\n    \"\"\"\n    Info about a public model accessible from our S3.\n    \"\"\"\n\n    def __init__(\n        self,\n        modelId: str,  # id of model\n        key: str,  # S3 object key of config.json\n        author: Optional[str] = None,\n        downloads: Optional[int] = None,\n        tags: List[str] = [],\n        siblings: List[Dict] = [],  # list of files that constitute the model\n        **kwargs\n    ):\n        self.modelId = modelId\n        self.key = key\n        self.author = author\n        self.downloads = downloads\n        self.tags = tags\n        self.siblings = [S3Object(**x) for x in siblings]\n\n\nclass HfApi:\n    def __init__(self, endpoint=None):\n        self.endpoint = endpoint if endpoint is not None else ENDPOINT\n\n    def login(self, username: str, password: str) -> str:\n        \"\"\"\n        Call HF API to sign in a user and get a token if credentials are valid.\n\n        Outputs:\n            token if credentials are valid\n\n        Throws:\n            requests.exceptions.HTTPError if credentials are invalid\n        \"\"\"\n        path = \"{}/api/login\".format(self.endpoint)\n        r = requests.post(path, json={\"username\": username, \"password\": password})\n        r.raise_for_status()\n        d = r.json()\n        return d[\"token\"]\n\n    def whoami(self, token: str) -> Tuple[str, List[str]]:\n        \"\"\"\n        Call HF API to know \"whoami\"\n        \"\"\"\n        path = \"{}/api/whoami\".format(self.endpoint)\n        r = requests.get(path, headers={\"authorization\": \"Bearer {}\".format(token)})\n        r.raise_for_status()\n        d = r.json()\n        return d[\"user\"], d[\"orgs\"]\n\n    def logout(self, token: str) -> None:\n        \"\"\"\n        Call HF API to log out.\n        \"\"\"\n        path = \"{}/api/logout\".format(self.endpoint)\n        r = requests.post(path, headers={\"authorization\": \"Bearer {}\".format(token)})\n        r.raise_for_status()\n\n    def presign(self, token: str, filename: str, organization: Optional[str] = None) -> PresignedUrl:\n        \"\"\"\n        Call HF API to get a presigned url to upload `filename` to S3.\n        \"\"\"\n        path = \"{}/api/presign\".format(self.endpoint)\n        r = requests.post(\n            path,\n            headers={\"authorization\": \"Bearer {}\".format(token)},\n            json={\"filename\": filename, \"organization\": organization},\n        )\n        r.raise_for_status()\n        d = r.json()\n        return PresignedUrl(**d)\n\n    def presign_and_upload(self, token: str, filename: str, filepath: str, organization: Optional[str] = None) -> str:\n        \"\"\"\n        Get a presigned url, then upload file to S3.\n\n        Outputs:\n            url: Read-only url for the stored file on S3.\n        \"\"\"\n        urls = self.presign(token, filename=filename, organization=organization)\n        # streaming upload:\n        # https://2.python-requests.org/en/master/user/advanced/#streaming-uploads\n        #\n        # Even though we presign with the correct content-type,\n        # the client still has to specify it when uploading the file.\n        with open(filepath, \"rb\") as f:\n            pf = TqdmProgressFileReader(f)\n            data = f if pf.total_size > 0 else \"\"\n\n            r = requests.put(urls.write, data=data, headers={\"content-type\": urls.type})\n            r.raise_for_status()\n            pf.close()\n        return urls.access\n\n    def list_objs(self, token: str, organization: Optional[str] = None) -> List[S3Obj]:\n        \"\"\"\n        Call HF API to list all stored files for user (or one of their organizations).\n        \"\"\"\n        path = \"{}/api/listObjs\".format(self.endpoint)\n        params = {\"organization\": organization} if organization is not None else None\n        r = requests.get(path, params=params, headers={\"authorization\": \"Bearer {}\".format(token)})\n        r.raise_for_status()\n        d = r.json()\n        return [S3Obj(**x) for x in d]\n\n    def delete_obj(self, token: str, filename: str, organization: Optional[str] = None):\n        \"\"\"\n        Call HF API to delete a file stored by user\n        \"\"\"\n        path = \"{}/api/deleteObj\".format(self.endpoint)\n        r = requests.delete(\n            path,\n            headers={\"authorization\": \"Bearer {}\".format(token)},\n            json={\"filename\": filename, \"organization\": organization},\n        )\n        r.raise_for_status()\n\n    def model_list(self) -> List[ModelInfo]:\n        \"\"\"\n        Get the public list of all the models on huggingface, including the community models\n        \"\"\"\n        path = \"{}/api/models\".format(self.endpoint)\n        r = requests.get(path)\n        r.raise_for_status()\n        d = r.json()\n        return [ModelInfo(**x) for x in d]\n\n\nclass TqdmProgressFileReader:\n    \"\"\"\n    Wrap an io.BufferedReader `f` (such as the output of `open(…, \"rb\")`)\n    and override `f.read()` so as to display a tqdm progress bar.\n\n    see github.com/huggingface/transformers1/pull/2078#discussion_r354739608\n    for implementation details.\n    \"\"\"\n\n    def __init__(self, f: io.BufferedReader):\n        self.f = f\n        self.total_size = os.fstat(f.fileno()).st_size\n        self.pbar = tqdm(total=self.total_size, leave=False)\n        self.read = f.read\n        f.read = self._read\n\n    def _read(self, n=-1):\n        self.pbar.update(n)\n        return self.read(n)\n\n    def close(self):\n        self.pbar.close()\n\n\nclass HfFolder:\n    path_token = expanduser(\"~/.huggingface/token\")\n\n    @classmethod\n    def save_token(cls, token):\n        \"\"\"\n        Save token, creating folder as needed.\n        \"\"\"\n        os.makedirs(os.path.dirname(cls.path_token), exist_ok=True)\n        with open(cls.path_token, \"w+\") as f:\n            f.write(token)\n\n    @classmethod\n    def get_token(cls):\n        \"\"\"\n        Get token or None if not existent.\n        \"\"\"\n        try:\n            with open(cls.path_token, \"r\") as f:\n                return f.read()\n        except FileNotFoundError:\n            pass\n\n    @classmethod\n    def delete_token(cls):\n        \"\"\"\n        Delete token.\n        Do not fail if token does not exist.\n        \"\"\"\n        try:\n            os.remove(cls.path_token)\n        except FileNotFoundError:\n            pass\n"
  },
  {
    "path": "code/nezha-base-count3/pretrain/transformers1/hf_argparser.py",
    "content": "import dataclasses\nimport json\nimport sys\nfrom argparse import ArgumentParser\nfrom enum import Enum\nfrom pathlib import Path\nfrom typing import Any, Iterable, List, NewType, Tuple, Union\n\n\nDataClass = NewType(\"DataClass\", Any)\nDataClassType = NewType(\"DataClassType\", Any)\n\n\nclass HfArgumentParser(ArgumentParser):\n    \"\"\"\n    This subclass of `argparse.ArgumentParser` uses type hints on dataclasses\n    to generate arguments.\n\n    The class is designed to play well with the native argparse. In particular,\n    you can add more (non-dataclass backed) arguments to the parser after initialization\n    and you'll get the output back after parsing as an additional namespace.\n    \"\"\"\n\n    dataclass_types: Iterable[DataClassType]\n\n    def __init__(self, dataclass_types: Union[DataClassType, Iterable[DataClassType]], **kwargs):\n        \"\"\"\n        Args:\n            dataclass_types:\n                Dataclass type, or list of dataclass types for which we will \"fill\" instances\n                with the parsed args.\n            kwargs:\n                (Optional) Passed to `argparse.ArgumentParser()` in the regular way.\n        \"\"\"\n        super().__init__(**kwargs)\n        if dataclasses.is_dataclass(dataclass_types):\n            dataclass_types = [dataclass_types]\n        self.dataclass_types = dataclass_types\n        for dtype in self.dataclass_types:\n            self._add_dataclass_arguments(dtype)\n\n    def _add_dataclass_arguments(self, dtype: DataClassType):\n        for field in dataclasses.fields(dtype):\n            field_name = f\"--{field.name}\"\n            kwargs = field.metadata.copy()\n            # field.metadata is not used at all by Data Classes,\n            # it is provided as a third-party extension mechanism.\n            if isinstance(field.type, str):\n                raise ImportError(\n                    \"This implementation is not compatible with Postponed Evaluation of Annotations (PEP 563),\"\n                    \"which can be opted in from Python 3.7 with `from __future__ import annotations`.\"\n                    \"We will add compatibility when Python 3.9 is released.\"\n                )\n            typestring = str(field.type)\n            for prim_type in (int, float, str):\n                for collection in (List,):\n                    if typestring == f\"typing.Union[{collection[prim_type]}, NoneType]\":\n                        field.type = collection[prim_type]\n                if typestring == f\"typing.Union[{prim_type.__name__}, NoneType]\":\n                    field.type = prim_type\n\n            if isinstance(field.type, type) and issubclass(field.type, Enum):\n                kwargs[\"choices\"] = list(field.type)\n                kwargs[\"type\"] = field.type\n                if field.default is not dataclasses.MISSING:\n                    kwargs[\"default\"] = field.default\n            elif field.type is bool:\n                kwargs[\"action\"] = \"store_false\" if field.default is True else \"store_true\"\n                if field.default is True:\n                    field_name = f\"--no-{field.name}\"\n                    kwargs[\"dest\"] = field.name\n            elif hasattr(field.type, \"__origin__\") and issubclass(field.type.__origin__, List):\n                kwargs[\"nargs\"] = \"+\"\n                kwargs[\"type\"] = field.type.__args__[0]\n                assert all(\n                    x == kwargs[\"type\"] for x in field.type.__args__\n                ), \"{} cannot be a List of mixed types\".format(field.name)\n                if field.default_factory is not dataclasses.MISSING:\n                    kwargs[\"default\"] = field.default_factory()\n            else:\n                kwargs[\"type\"] = field.type\n                if field.default is not dataclasses.MISSING:\n                    kwargs[\"default\"] = field.default\n                else:\n                    kwargs[\"required\"] = True\n            self.add_argument(field_name, **kwargs)\n\n    def parse_args_into_dataclasses(\n        self, args=None, return_remaining_strings=False, look_for_args_file=True\n    ) -> Tuple[DataClass, ...]:\n        \"\"\"\n        Parse command-line args into instances of the specified dataclass types.\n\n        This relies on argparse's `ArgumentParser.parse_known_args`.\n        See the doc at:\n        docs.python.org/3.7/library/argparse.html#argparse.ArgumentParser.parse_args\n\n        Args:\n            args:\n                List of strings to parse. The default is taken from sys.argv.\n                (same as argparse.ArgumentParser)\n            return_remaining_strings:\n                If true, also return a list of remaining argument strings.\n            look_for_args_file:\n                If true, will look for a \".args\" file with the same base name\n                as the entry point script for this process, and will append its\n                potential content to the command line args.\n\n        Returns:\n            Tuple consisting of:\n                - the dataclass instances in the same order as they\n                  were passed to the initializer.abspath\n                - if applicable, an additional namespace for more\n                  (non-dataclass backed) arguments added to the parser\n                  after initialization.\n                - The potential list of remaining argument strings.\n                  (same as argparse.ArgumentParser.parse_known_args)\n        \"\"\"\n        if look_for_args_file and len(sys.argv):\n            args_file = Path(sys.argv[0]).with_suffix(\".args\")\n            if args_file.exists():\n                fargs = args_file.read_text().split()\n                args = fargs + args if args is not None else fargs + sys.argv[1:]\n                # in case of duplicate arguments the first one has precedence\n                # so we append rather than prepend.\n        namespace, remaining_args = self.parse_known_args(args=args)\n        outputs = []\n        for dtype in self.dataclass_types:\n            keys = {f.name for f in dataclasses.fields(dtype)}\n            inputs = {k: v for k, v in vars(namespace).items() if k in keys}\n            for k in keys:\n                delattr(namespace, k)\n            obj = dtype(**inputs)\n            outputs.append(obj)\n        if len(namespace.__dict__) > 0:\n            # additional namespace.\n            outputs.append(namespace)\n        if return_remaining_strings:\n            return (*outputs, remaining_args)\n        else:\n            if remaining_args:\n                raise ValueError(f\"Some specified arguments are not used by the HfArgumentParser: {remaining_args}\")\n\n            return (*outputs,)\n\n    def parse_json_file(self, json_file: str) -> Tuple[DataClass, ...]:\n        \"\"\"\n        Alternative helper method that does not use `argparse` at all,\n        instead loading a json file and populating the dataclass types.\n        \"\"\"\n        data = json.loads(Path(json_file).read_text())\n        outputs = []\n        for dtype in self.dataclass_types:\n            keys = {f.name for f in dataclasses.fields(dtype)}\n            inputs = {k: v for k, v in data.items() if k in keys}\n            obj = dtype(**inputs)\n            outputs.append(obj)\n        return (*outputs,)\n"
  },
  {
    "path": "code/nezha-base-count3/pretrain/transformers1/modelcard.py",
    "content": "# coding=utf-8\n# Copyright 2018 The HuggingFace Inc. team.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n\"\"\" Configuration base class and utilities.\"\"\"\n\n\nimport copy\nimport json\nimport logging\nimport os\n\nfrom .configuration_auto import ALL_PRETRAINED_CONFIG_ARCHIVE_MAP\nfrom .file_utils import (\n    CONFIG_NAME,\n    MODEL_CARD_NAME,\n    TF2_WEIGHTS_NAME,\n    WEIGHTS_NAME,\n    cached_path,\n    hf_bucket_url,\n    is_remote_url,\n)\n\n\nlogger = logging.getLogger(__name__)\n\n\nclass ModelCard:\n    r\"\"\" Structured Model Card class.\n        Store model card as well as methods for loading/downloading/saving model cards.\n\n        Please read the following paper for details and explanation on the sections:\n            \"Model Cards for Model Reporting\"\n                by Margaret Mitchell, Simone Wu,\n                Andrew Zaldivar, Parker Barnes, Lucy Vasserman, Ben Hutchinson, Elena Spitzer,\n                Inioluwa Deborah Raji and Timnit Gebru for the proposal behind model cards.\n            Link: https://arxiv.org/abs/1810.03993\n\n        Note:\n            A model card can be loaded and saved to disk.\n\n        Parameters:\n    \"\"\"\n\n    def __init__(self, **kwargs):\n        # Recomended attributes from https://arxiv.org/abs/1810.03993 (see papers)\n        self.model_details = kwargs.pop(\"model_details\", {})\n        self.intended_use = kwargs.pop(\"intended_use\", {})\n        self.factors = kwargs.pop(\"factors\", {})\n        self.metrics = kwargs.pop(\"metrics\", {})\n        self.evaluation_data = kwargs.pop(\"evaluation_data\", {})\n        self.training_data = kwargs.pop(\"training_data\", {})\n        self.quantitative_analyses = kwargs.pop(\"quantitative_analyses\", {})\n        self.ethical_considerations = kwargs.pop(\"ethical_considerations\", {})\n        self.caveats_and_recommendations = kwargs.pop(\"caveats_and_recommendations\", {})\n\n        # Open additional attributes\n        for key, value in kwargs.items():\n            try:\n                setattr(self, key, value)\n            except AttributeError as err:\n                logger.error(\"Can't set {} with value {} for {}\".format(key, value, self))\n                raise err\n\n    def save_pretrained(self, save_directory_or_file):\n        \"\"\" Save a model card object to the directory or file `save_directory_or_file`.\n        \"\"\"\n        if os.path.isdir(save_directory_or_file):\n            # If we save using the predefined names, we can load using `from_pretrained`\n            output_model_card_file = os.path.join(save_directory_or_file, MODEL_CARD_NAME)\n        else:\n            output_model_card_file = save_directory_or_file\n\n        self.to_json_file(output_model_card_file)\n        logger.info(\"Model card saved in {}\".format(output_model_card_file))\n\n    @classmethod\n    def from_pretrained(cls, pretrained_model_name_or_path, **kwargs):\n        r\"\"\" Instantiate a :class:`~transformers1.ModelCard` from a pre-trained model model card.\n\n        Parameters:\n            pretrained_model_name_or_path: either:\n\n                - a string with the `shortcut name` of a pre-trained model card to load from cache or download, e.g.: ``bert-base-uncased``.\n                - a string with the `identifier name` of a pre-trained model card that was user-uploaded to our S3, e.g.: ``dbmdz/bert-base-german-cased``.\n                - a path to a `directory` containing a model card file saved using the :func:`~transformers1.ModelCard.save_pretrained` method, e.g.: ``./my_model_directory/``.\n                - a path or url to a saved model card JSON `file`, e.g.: ``./my_model_directory/modelcard.json``.\n\n            cache_dir: (`optional`) string:\n                Path to a directory in which a downloaded pre-trained model\n                card should be cached if the standard cache should not be used.\n\n            kwargs: (`optional`) dict: key/value pairs with which to update the ModelCard object after loading.\n\n                - The values in kwargs of any keys which are model card attributes will be used to override the loaded values.\n                - Behavior concerning key/value pairs whose keys are *not* model card attributes is controlled by the `return_unused_kwargs` keyword parameter.\n\n            proxies: (`optional`) dict, default None:\n                A dictionary of proxy servers to use by protocol or endpoint, e.g.: {'http': 'foo.bar:3128', 'http://hostname': 'foo.bar:4012'}.\n                The proxies are used on each request.\n\n            find_from_standard_name: (`optional`) boolean, default True:\n                If the pretrained_model_name_or_path ends with our standard model or config filenames, replace them with our standard modelcard filename.\n                Can be used to directly feed a model/config url and access the colocated modelcard.\n\n            return_unused_kwargs: (`optional`) bool:\n\n                - If False, then this function returns just the final model card object.\n                - If True, then this functions returns a tuple `(model card, unused_kwargs)` where `unused_kwargs` is a dictionary consisting of the key/value pairs whose keys are not model card attributes: ie the part of kwargs which has not been used to update `ModelCard` and is otherwise ignored.\n\n        Examples::\n\n            modelcard = ModelCard.from_pretrained('bert-base-uncased')    # Download model card from S3 and cache.\n            modelcard = ModelCard.from_pretrained('./test/saved_model/')  # E.g. model card was saved using `save_pretrained('./test/saved_model/')`\n            modelcard = ModelCard.from_pretrained('./test/saved_model/modelcard.json')\n            modelcard = ModelCard.from_pretrained('bert-base-uncased', output_attention=True, foo=False)\n\n        \"\"\"\n        cache_dir = kwargs.pop(\"cache_dir\", None)\n        proxies = kwargs.pop(\"proxies\", None)\n        find_from_standard_name = kwargs.pop(\"find_from_standard_name\", True)\n        return_unused_kwargs = kwargs.pop(\"return_unused_kwargs\", False)\n\n        if pretrained_model_name_or_path in ALL_PRETRAINED_CONFIG_ARCHIVE_MAP:\n            # For simplicity we use the same pretrained url than the configuration files\n            # but with a different suffix (modelcard.json). This suffix is replaced below.\n            model_card_file = ALL_PRETRAINED_CONFIG_ARCHIVE_MAP[pretrained_model_name_or_path]\n        elif os.path.isdir(pretrained_model_name_or_path):\n            model_card_file = os.path.join(pretrained_model_name_or_path, MODEL_CARD_NAME)\n        elif os.path.isfile(pretrained_model_name_or_path) or is_remote_url(pretrained_model_name_or_path):\n            model_card_file = pretrained_model_name_or_path\n        else:\n            model_card_file = hf_bucket_url(pretrained_model_name_or_path, filename=MODEL_CARD_NAME, use_cdn=False)\n\n        if find_from_standard_name or pretrained_model_name_or_path in ALL_PRETRAINED_CONFIG_ARCHIVE_MAP:\n            model_card_file = model_card_file.replace(CONFIG_NAME, MODEL_CARD_NAME)\n            model_card_file = model_card_file.replace(WEIGHTS_NAME, MODEL_CARD_NAME)\n            model_card_file = model_card_file.replace(TF2_WEIGHTS_NAME, MODEL_CARD_NAME)\n\n        try:\n            # Load from URL or cache if already cached\n            resolved_model_card_file = cached_path(\n                model_card_file, cache_dir=cache_dir, force_download=True, proxies=proxies, resume_download=False\n            )\n            if resolved_model_card_file is None:\n                raise EnvironmentError\n            if resolved_model_card_file == model_card_file:\n                logger.info(\"loading model card file {}\".format(model_card_file))\n            else:\n                logger.info(\n                    \"loading model card file {} from cache at {}\".format(model_card_file, resolved_model_card_file)\n                )\n            # Load model card\n            modelcard = cls.from_json_file(resolved_model_card_file)\n\n        except (EnvironmentError, json.JSONDecodeError):\n            # We fall back on creating an empty model card\n            modelcard = cls()\n\n        # Update model card with kwargs if needed\n        to_remove = []\n        for key, value in kwargs.items():\n            if hasattr(modelcard, key):\n                setattr(modelcard, key, value)\n                to_remove.append(key)\n        for key in to_remove:\n            kwargs.pop(key, None)\n\n        logger.info(\"Model card: %s\", str(modelcard))\n        if return_unused_kwargs:\n            return modelcard, kwargs\n        else:\n            return modelcard\n\n    @classmethod\n    def from_dict(cls, json_object):\n        \"\"\"Constructs a `ModelCard` from a Python dictionary of parameters.\"\"\"\n        return cls(**json_object)\n\n    @classmethod\n    def from_json_file(cls, json_file):\n        \"\"\"Constructs a `ModelCard` from a json file of parameters.\"\"\"\n        with open(json_file, \"r\", encoding=\"utf-8\") as reader:\n            text = reader.read()\n        dict_obj = json.loads(text)\n        return cls(**dict_obj)\n\n    def __eq__(self, other):\n        return self.__dict__ == other.__dict__\n\n    def __repr__(self):\n        return str(self.to_json_string())\n\n    def to_dict(self):\n        \"\"\"Serializes this instance to a Python dictionary.\"\"\"\n        output = copy.deepcopy(self.__dict__)\n        return output\n\n    def to_json_string(self):\n        \"\"\"Serializes this instance to a JSON string.\"\"\"\n        return json.dumps(self.to_dict(), indent=2, sort_keys=True) + \"\\n\"\n\n    def to_json_file(self, json_file_path):\n        \"\"\" Save this instance to a json file.\"\"\"\n        with open(json_file_path, \"w\", encoding=\"utf-8\") as writer:\n            writer.write(self.to_json_string())\n"
  },
  {
    "path": "code/nezha-base-count3/pretrain/transformers1/modeling_albert.py",
    "content": "# coding=utf-8\n# Copyright 2018 Google AI, Google Brain and the HuggingFace Inc. team.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n\"\"\"PyTorch ALBERT model. \"\"\"\n\nimport logging\nimport math\nimport os\n\nimport torch\nimport torch.nn as nn\nfrom torch.nn import CrossEntropyLoss, MSELoss\n\nfrom .configuration_albert import AlbertConfig\nfrom .file_utils import add_start_docstrings, add_start_docstrings_to_callable\nfrom .modeling_bert import ACT2FN, BertEmbeddings, BertSelfAttention, prune_linear_layer\nfrom .modeling_utils import PreTrainedModel\n\n\nlogger = logging.getLogger(__name__)\n\n\nALBERT_PRETRAINED_MODEL_ARCHIVE_LIST = [\n    \"albert-base-v1\",\n    \"albert-large-v1\",\n    \"albert-xlarge-v1\",\n    \"albert-xxlarge-v1\",\n    \"albert-base-v2\",\n    \"albert-large-v2\",\n    \"albert-xlarge-v2\",\n    \"albert-xxlarge-v2\",\n    # See all ALBERT models at https://huggingface.co/models?filter=albert\n]\n\n\ndef load_tf_weights_in_albert(model, config, tf_checkpoint_path):\n    \"\"\" Load tf checkpoints in a pytorch model.\"\"\"\n    try:\n        import re\n        import numpy as np\n        import tensorflow as tf\n    except ImportError:\n        logger.error(\n            \"Loading a TensorFlow model in PyTorch, requires TensorFlow to be installed. Please see \"\n            \"https://www.tensorflow.org/install/ for installation instructions.\"\n        )\n        raise\n    tf_path = os.path.abspath(tf_checkpoint_path)\n    logger.info(\"Converting TensorFlow checkpoint from {}\".format(tf_path))\n    # Load weights from TF model\n    init_vars = tf.train.list_variables(tf_path)\n    names = []\n    arrays = []\n    for name, shape in init_vars:\n        logger.info(\"Loading TF weight {} with shape {}\".format(name, shape))\n        array = tf.train.load_variable(tf_path, name)\n        names.append(name)\n        arrays.append(array)\n\n    for name, array in zip(names, arrays):\n        print(name)\n\n    for name, array in zip(names, arrays):\n        original_name = name\n\n        # If saved from the TF HUB module\n        name = name.replace(\"module/\", \"\")\n\n        # Renaming and simplifying\n        name = name.replace(\"ffn_1\", \"ffn\")\n        name = name.replace(\"bert/\", \"albert/\")\n        name = name.replace(\"attention_1\", \"attention\")\n        name = name.replace(\"transform/\", \"\")\n        name = name.replace(\"LayerNorm_1\", \"full_layer_layer_norm\")\n        name = name.replace(\"LayerNorm\", \"attention/LayerNorm\")\n        name = name.replace(\"transformer/\", \"\")\n\n        # The feed forward layer had an 'intermediate' step which has been abstracted away\n        name = name.replace(\"intermediate/dense/\", \"\")\n        name = name.replace(\"ffn/intermediate/output/dense/\", \"ffn_output/\")\n\n        # ALBERT attention was split between self and output which have been abstracted away\n        name = name.replace(\"/output/\", \"/\")\n        name = name.replace(\"/self/\", \"/\")\n\n        # The pooler is a linear layer\n        name = name.replace(\"pooler/dense\", \"pooler\")\n\n        # The classifier was simplified to predictions from cls/predictions\n        name = name.replace(\"cls/predictions\", \"predictions\")\n        name = name.replace(\"predictions/attention\", \"predictions\")\n\n        # Naming was changed to be more explicit\n        name = name.replace(\"embeddings/attention\", \"embeddings\")\n        name = name.replace(\"inner_group_\", \"albert_layers/\")\n        name = name.replace(\"group_\", \"albert_layer_groups/\")\n\n        # Classifier\n        if len(name.split(\"/\")) == 1 and (\"output_bias\" in name or \"output_weights\" in name):\n            name = \"classifier/\" + name\n\n        # No ALBERT model currently handles the next sentence prediction task\n        if \"seq_relationship\" in name:\n            name = name.replace(\"seq_relationship/output_\", \"sop_classifier/classifier/\")\n            name = name.replace(\"weights\", \"weight\")\n\n        name = name.split(\"/\")\n\n        # Ignore the gradients applied by the LAMB/ADAM optimizers.\n        if (\n            \"adam_m\" in name\n            or \"adam_v\" in name\n            or \"AdamWeightDecayOptimizer\" in name\n            or \"AdamWeightDecayOptimizer_1\" in name\n            or \"global_step\" in name\n        ):\n            logger.info(\"Skipping {}\".format(\"/\".join(name)))\n            continue\n\n        pointer = model\n        for m_name in name:\n            if re.fullmatch(r\"[A-Za-z]+_\\d+\", m_name):\n                scope_names = re.split(r\"_(\\d+)\", m_name)\n            else:\n                scope_names = [m_name]\n\n            if scope_names[0] == \"kernel\" or scope_names[0] == \"gamma\":\n                pointer = getattr(pointer, \"weight\")\n            elif scope_names[0] == \"output_bias\" or scope_names[0] == \"beta\":\n                pointer = getattr(pointer, \"bias\")\n            elif scope_names[0] == \"output_weights\":\n                pointer = getattr(pointer, \"weight\")\n            elif scope_names[0] == \"squad\":\n                pointer = getattr(pointer, \"classifier\")\n            else:\n                try:\n                    pointer = getattr(pointer, scope_names[0])\n                except AttributeError:\n                    logger.info(\"Skipping {}\".format(\"/\".join(name)))\n                    continue\n            if len(scope_names) >= 2:\n                num = int(scope_names[1])\n                pointer = pointer[num]\n\n        if m_name[-11:] == \"_embeddings\":\n            pointer = getattr(pointer, \"weight\")\n        elif m_name == \"kernel\":\n            array = np.transpose(array)\n        try:\n            assert pointer.shape == array.shape\n        except AssertionError as e:\n            e.args += (pointer.shape, array.shape)\n            raise\n        print(\"Initialize PyTorch weight {} from {}\".format(name, original_name))\n        pointer.data = torch.from_numpy(array)\n\n    return model\n\n\nclass AlbertEmbeddings(BertEmbeddings):\n    \"\"\"\n    Construct the embeddings from word, position and token_type embeddings.\n    \"\"\"\n\n    def __init__(self, config):\n        super().__init__(config)\n\n        self.word_embeddings = nn.Embedding(config.vocab_size, config.embedding_size, padding_idx=config.pad_token_id)\n        self.position_embeddings = nn.Embedding(config.max_position_embeddings, config.embedding_size)\n        self.token_type_embeddings = nn.Embedding(config.type_vocab_size, config.embedding_size)\n        self.LayerNorm = torch.nn.LayerNorm(config.embedding_size, eps=config.layer_norm_eps)\n\n\nclass AlbertAttention(BertSelfAttention):\n    def __init__(self, config):\n        super().__init__(config)\n\n        self.output_attentions = config.output_attentions\n        self.num_attention_heads = config.num_attention_heads\n        self.hidden_size = config.hidden_size\n        self.attention_head_size = config.hidden_size // config.num_attention_heads\n        self.dropout = nn.Dropout(config.attention_probs_dropout_prob)\n        self.dense = nn.Linear(config.hidden_size, config.hidden_size)\n        self.LayerNorm = nn.LayerNorm(config.hidden_size, eps=config.layer_norm_eps)\n        self.pruned_heads = set()\n\n    def prune_heads(self, heads):\n        if len(heads) == 0:\n            return\n        mask = torch.ones(self.num_attention_heads, self.attention_head_size)\n        heads = set(heads) - self.pruned_heads  # Convert to set and emove already pruned heads\n        for head in heads:\n            # Compute how many pruned heads are before the head and move the index accordingly\n            head = head - sum(1 if h < head else 0 for h in self.pruned_heads)\n            mask[head] = 0\n        mask = mask.view(-1).contiguous().eq(1)\n        index = torch.arange(len(mask))[mask].long()\n\n        # Prune linear layers\n        self.query = prune_linear_layer(self.query, index)\n        self.key = prune_linear_layer(self.key, index)\n        self.value = prune_linear_layer(self.value, index)\n        self.dense = prune_linear_layer(self.dense, index, dim=1)\n\n        # Update hyper params and store pruned heads\n        self.num_attention_heads = self.num_attention_heads - len(heads)\n        self.all_head_size = self.attention_head_size * self.num_attention_heads\n        self.pruned_heads = self.pruned_heads.union(heads)\n\n    def forward(self, input_ids, attention_mask=None, head_mask=None):\n        mixed_query_layer = self.query(input_ids)\n        mixed_key_layer = self.key(input_ids)\n        mixed_value_layer = self.value(input_ids)\n\n        query_layer = self.transpose_for_scores(mixed_query_layer)\n        key_layer = self.transpose_for_scores(mixed_key_layer)\n        value_layer = self.transpose_for_scores(mixed_value_layer)\n\n        # Take the dot product between \"query\" and \"key\" to get the raw attention scores.\n        attention_scores = torch.matmul(query_layer, key_layer.transpose(-1, -2))\n        attention_scores = attention_scores / math.sqrt(self.attention_head_size)\n        if attention_mask is not None:\n            # Apply the attention mask is (precomputed for all layers in BertModel forward() function)\n            attention_scores = attention_scores + attention_mask\n\n        # Normalize the attention scores to probabilities.\n        attention_probs = nn.Softmax(dim=-1)(attention_scores)\n\n        # This is actually dropping out entire tokens to attend to, which might\n        # seem a bit unusual, but is taken from the original Transformer paper.\n        attention_probs = self.dropout(attention_probs)\n\n        # Mask heads if we want to\n        if head_mask is not None:\n            attention_probs = attention_probs * head_mask\n\n        context_layer = torch.matmul(attention_probs, value_layer)\n\n        context_layer = context_layer.permute(0, 2, 1, 3).contiguous()\n\n        # Should find a better way to do this\n        w = (\n            self.dense.weight.t()\n            .view(self.num_attention_heads, self.attention_head_size, self.hidden_size)\n            .to(context_layer.dtype)\n        )\n        b = self.dense.bias.to(context_layer.dtype)\n\n        projected_context_layer = torch.einsum(\"bfnd,ndh->bfh\", context_layer, w) + b\n        projected_context_layer_dropout = self.dropout(projected_context_layer)\n        layernormed_context_layer = self.LayerNorm(input_ids + projected_context_layer_dropout)\n        return (layernormed_context_layer, attention_probs) if self.output_attentions else (layernormed_context_layer,)\n\n\nclass AlbertLayer(nn.Module):\n    def __init__(self, config):\n        super().__init__()\n\n        self.config = config\n        self.full_layer_layer_norm = nn.LayerNorm(config.hidden_size, eps=config.layer_norm_eps)\n        self.attention = AlbertAttention(config)\n        self.ffn = nn.Linear(config.hidden_size, config.intermediate_size)\n        self.ffn_output = nn.Linear(config.intermediate_size, config.hidden_size)\n        self.activation = ACT2FN[config.hidden_act]\n\n    def forward(self, hidden_states, attention_mask=None, head_mask=None):\n        attention_output = self.attention(hidden_states, attention_mask, head_mask)\n        ffn_output = self.ffn(attention_output[0])\n        ffn_output = self.activation(ffn_output)\n        ffn_output = self.ffn_output(ffn_output)\n        hidden_states = self.full_layer_layer_norm(ffn_output + attention_output[0])\n\n        return (hidden_states,) + attention_output[1:]  # add attentions if we output them\n\n\nclass AlbertLayerGroup(nn.Module):\n    def __init__(self, config):\n        super().__init__()\n\n        self.output_attentions = config.output_attentions\n        self.output_hidden_states = config.output_hidden_states\n        self.albert_layers = nn.ModuleList([AlbertLayer(config) for _ in range(config.inner_group_num)])\n\n    def forward(self, hidden_states, attention_mask=None, head_mask=None):\n        layer_hidden_states = ()\n        layer_attentions = ()\n\n        for layer_index, albert_layer in enumerate(self.albert_layers):\n            layer_output = albert_layer(hidden_states, attention_mask, head_mask[layer_index])\n            hidden_states = layer_output[0]\n\n            if self.output_attentions:\n                layer_attentions = layer_attentions + (layer_output[1],)\n\n            if self.output_hidden_states:\n                layer_hidden_states = layer_hidden_states + (hidden_states,)\n\n        outputs = (hidden_states,)\n        if self.output_hidden_states:\n            outputs = outputs + (layer_hidden_states,)\n        if self.output_attentions:\n            outputs = outputs + (layer_attentions,)\n        return outputs  # last-layer hidden state, (layer hidden states), (layer attentions)\n\n\nclass AlbertTransformer(nn.Module):\n    def __init__(self, config):\n        super().__init__()\n\n        self.config = config\n        self.output_attentions = config.output_attentions\n        self.output_hidden_states = config.output_hidden_states\n        self.embedding_hidden_mapping_in = nn.Linear(config.embedding_size, config.hidden_size)\n        self.albert_layer_groups = nn.ModuleList([AlbertLayerGroup(config) for _ in range(config.num_hidden_groups)])\n\n    def forward(self, hidden_states, attention_mask=None, head_mask=None):\n        hidden_states = self.embedding_hidden_mapping_in(hidden_states)\n\n        all_attentions = ()\n\n        if self.output_hidden_states:\n            all_hidden_states = (hidden_states,)\n\n        for i in range(self.config.num_hidden_layers):\n            # Number of layers in a hidden group\n            layers_per_group = int(self.config.num_hidden_layers / self.config.num_hidden_groups)\n\n            # Index of the hidden group\n            group_idx = int(i / (self.config.num_hidden_layers / self.config.num_hidden_groups))\n\n            layer_group_output = self.albert_layer_groups[group_idx](\n                hidden_states,\n                attention_mask,\n                head_mask[group_idx * layers_per_group : (group_idx + 1) * layers_per_group],\n            )\n            hidden_states = layer_group_output[0]\n\n            if self.output_attentions:\n                all_attentions = all_attentions + layer_group_output[-1]\n\n            if self.output_hidden_states:\n                all_hidden_states = all_hidden_states + (hidden_states,)\n\n        outputs = (hidden_states,)\n        if self.output_hidden_states:\n            outputs = outputs + (all_hidden_states,)\n        if self.output_attentions:\n            outputs = outputs + (all_attentions,)\n        return outputs  # last-layer hidden state, (all hidden states), (all attentions)\n\n\nclass AlbertPreTrainedModel(PreTrainedModel):\n    \"\"\" An abstract class to handle weights initialization and\n        a simple interface for downloading and loading pretrained models.\n    \"\"\"\n\n    config_class = AlbertConfig\n    base_model_prefix = \"albert\"\n\n    def _init_weights(self, module):\n        \"\"\" Initialize the weights.\n        \"\"\"\n        if isinstance(module, (nn.Linear, nn.Embedding)):\n            # Slightly different from the TF version which uses truncated_normal for initialization\n            # cf https://github.com/pytorch/pytorch/pull/5617\n            module.weight.data.normal_(mean=0.0, std=self.config.initializer_range)\n            if isinstance(module, (nn.Linear)) and module.bias is not None:\n                module.bias.data.zero_()\n        elif isinstance(module, nn.LayerNorm):\n            module.bias.data.zero_()\n            module.weight.data.fill_(1.0)\n\n\nALBERT_START_DOCSTRING = r\"\"\"\n\n    This model is a PyTorch `torch.nn.Module <https://pytorch.org/docs/stable/nn.html#torch.nn.Module>`_ sub-class.\n    Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general\n    usage and behavior.\n\n    Args:\n        config (:class:`~transformers1.AlbertConfig`): Model configuration class with all the parameters of the model.\n            Initializing with a config file does not load the weights associated with the model, only the configuration.\n            Check out the :meth:`~transformers1.PreTrainedModel.from_pretrained` method to load the model weights.\n\"\"\"\n\nALBERT_INPUTS_DOCSTRING = r\"\"\"\n    Args:\n        input_ids (:obj:`torch.LongTensor` of shape :obj:`(batch_size, sequence_length)`):\n            Indices of input sequence tokens in the vocabulary.\n\n            Indices can be obtained using :class:`transformers1.AlbertTokenizer`.\n            See :func:`transformers1.PreTrainedTokenizer.encode` and\n            :func:`transformers1.PreTrainedTokenizer.encode_plus` for details.\n\n            `What are input IDs? <../glossary.html#input-ids>`__\n        attention_mask (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length)`, `optional`, defaults to :obj:`None`):\n            Mask to avoid performing attention on padding token indices.\n            Mask values selected in ``[0, 1]``:\n            ``1`` for tokens that are NOT MASKED, ``0`` for MASKED tokens.\n\n            `What are attention masks? <../glossary.html#attention-mask>`__\n        token_type_ids (:obj:`torch.LongTensor` of shape :obj:`(batch_size, sequence_length)`, `optional`, defaults to :obj:`None`):\n            Segment token indices to indicate first and second portions of the inputs.\n            Indices are selected in ``[0, 1]``: ``0`` corresponds to a `sentence A` token, ``1``\n            corresponds to a `sentence B` token\n\n            `What are token type IDs? <../glossary.html#token-type-ids>`_\n        position_ids (:obj:`torch.LongTensor` of shape :obj:`(batch_size, sequence_length)`, `optional`, defaults to :obj:`None`):\n            Indices of positions of each input sequence tokens in the position embeddings.\n            Selected in the range ``[0, config.max_position_embeddings - 1]``.\n\n            `What are position IDs? <../glossary.html#position-ids>`_\n        head_mask (:obj:`torch.FloatTensor` of shape :obj:`(num_heads,)` or :obj:`(num_layers, num_heads)`, `optional`, defaults to :obj:`None`):\n            Mask to nullify selected heads of the self-attention modules.\n            Mask values selected in ``[0, 1]``:\n            :obj:`1` indicates the head is **not masked**, :obj:`0` indicates the head is **masked**.\n        inputs_embeds (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length, hidden_size)`, `optional`, defaults to :obj:`None`):\n            Optionally, instead of passing :obj:`input_ids` you can choose to directly pass an embedded representation.\n            This is useful if you want more control over how to convert `input_ids` indices into associated vectors\n            than the model's internal embedding lookup matrix.\n\"\"\"\n\n\n@add_start_docstrings(\n    \"The bare ALBERT Model transformer outputting raw hidden-states without any specific head on top.\",\n    ALBERT_START_DOCSTRING,\n)\nclass AlbertModel(AlbertPreTrainedModel):\n\n    config_class = AlbertConfig\n    load_tf_weights = load_tf_weights_in_albert\n    base_model_prefix = \"albert\"\n\n    def __init__(self, config):\n        super().__init__(config)\n\n        self.config = config\n        self.embeddings = AlbertEmbeddings(config)\n        self.encoder = AlbertTransformer(config)\n        self.pooler = nn.Linear(config.hidden_size, config.hidden_size)\n        self.pooler_activation = nn.Tanh()\n\n        self.init_weights()\n\n    def get_input_embeddings(self):\n        return self.embeddings.word_embeddings\n\n    def set_input_embeddings(self, value):\n        self.embeddings.word_embeddings = value\n\n    def _resize_token_embeddings(self, new_num_tokens):\n        old_embeddings = self.embeddings.word_embeddings\n        new_embeddings = self._get_resized_embeddings(old_embeddings, new_num_tokens)\n        self.embeddings.word_embeddings = new_embeddings\n        return self.embeddings.word_embeddings\n\n    def _prune_heads(self, heads_to_prune):\n        \"\"\" Prunes heads of the model.\n            heads_to_prune: dict of {layer_num: list of heads to prune in this layer}\n            ALBERT has a different architecture in that its layers are shared across groups, which then has inner groups.\n            If an ALBERT model has 12 hidden layers and 2 hidden groups, with two inner groups, there\n            is a total of 4 different layers.\n\n            These layers are flattened: the indices [0,1] correspond to the two inner groups of the first hidden layer,\n            while [2,3] correspond to the two inner groups of the second hidden layer.\n\n            Any layer with in index other than [0,1,2,3] will result in an error.\n            See base class PreTrainedModel for more information about head pruning\n        \"\"\"\n        for layer, heads in heads_to_prune.items():\n            group_idx = int(layer / self.config.inner_group_num)\n            inner_group_idx = int(layer - group_idx * self.config.inner_group_num)\n            self.encoder.albert_layer_groups[group_idx].albert_layers[inner_group_idx].attention.prune_heads(heads)\n\n    @add_start_docstrings_to_callable(ALBERT_INPUTS_DOCSTRING)\n    def forward(\n        self,\n        input_ids=None,\n        attention_mask=None,\n        token_type_ids=None,\n        position_ids=None,\n        head_mask=None,\n        inputs_embeds=None,\n    ):\n        r\"\"\"\n    Return:\n        :obj:`tuple(torch.FloatTensor)` comprising various elements depending on the configuration (:class:`~transformers1.AlbertConfig`) and inputs:\n        last_hidden_state (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length, hidden_size)`):\n            Sequence of hidden-states at the output of the last layer of the model.\n        pooler_output (:obj:`torch.FloatTensor`: of shape :obj:`(batch_size, hidden_size)`):\n            Last layer hidden-state of the first token of the sequence (classification token)\n            further processed by a Linear layer and a Tanh activation function. The Linear\n            layer weights are trained from the next sentence prediction (classification)\n            objective during pre-training.\n\n            This output is usually *not* a good summary\n            of the semantic content of the input, you're often better with averaging or pooling\n            the sequence of hidden-states for the whole input sequence.\n        hidden_states (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_hidden_states=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer)\n            of shape :obj:`(batch_size, sequence_length, hidden_size)`.\n\n            Hidden-states of the model at the output of each layer plus the initial embedding outputs.\n        attentions (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_attentions=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for each layer) of shape\n            :obj:`(batch_size, num_heads, sequence_length, sequence_length)`.\n\n            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention\n            heads.\n\n    Example::\n\n        from transformers1 import AlbertModel, AlbertTokenizer\n        import torch\n\n        tokenizer = AlbertTokenizer.from_pretrained('albert-base-v2')\n        model = AlbertModel.from_pretrained('albert-base-v2')\n        input_ids = torch.tensor(tokenizer.encode(\"Hello, my dog is cute\", add_special_tokens=True)).unsqueeze(0)  # Batch size 1\n        outputs = model(input_ids)\n        last_hidden_states = outputs[0]  # The last hidden-state is the first element of the output tuple\n\n        \"\"\"\n\n        if input_ids is not None and inputs_embeds is not None:\n            raise ValueError(\"You cannot specify both input_ids and inputs_embeds at the same time\")\n        elif input_ids is not None:\n            input_shape = input_ids.size()\n        elif inputs_embeds is not None:\n            input_shape = inputs_embeds.size()[:-1]\n        else:\n            raise ValueError(\"You have to specify either input_ids or inputs_embeds\")\n\n        device = input_ids.device if input_ids is not None else inputs_embeds.device\n\n        if attention_mask is None:\n            attention_mask = torch.ones(input_shape, device=device)\n        if token_type_ids is None:\n            token_type_ids = torch.zeros(input_shape, dtype=torch.long, device=device)\n\n        extended_attention_mask = attention_mask.unsqueeze(1).unsqueeze(2)\n        extended_attention_mask = extended_attention_mask.to(dtype=self.dtype)  # fp16 compatibility\n        extended_attention_mask = (1.0 - extended_attention_mask) * -10000.0\n        head_mask = self.get_head_mask(head_mask, self.config.num_hidden_layers)\n\n        embedding_output = self.embeddings(\n            input_ids, position_ids=position_ids, token_type_ids=token_type_ids, inputs_embeds=inputs_embeds\n        )\n        encoder_outputs = self.encoder(embedding_output, extended_attention_mask, head_mask=head_mask)\n\n        sequence_output = encoder_outputs[0]\n\n        pooled_output = self.pooler_activation(self.pooler(sequence_output[:, 0]))\n\n        outputs = (sequence_output, pooled_output) + encoder_outputs[\n            1:\n        ]  # add hidden_states and attentions if they are here\n        return outputs\n\n\n@add_start_docstrings(\n    \"\"\"Albert Model with two heads on top as done during the pre-training: a `masked language modeling` head and\n    a `sentence order prediction (classification)` head. \"\"\",\n    ALBERT_START_DOCSTRING,\n)\nclass AlbertForPreTraining(AlbertPreTrainedModel):\n    def __init__(self, config):\n        super().__init__(config)\n\n        self.albert = AlbertModel(config)\n        self.predictions = AlbertMLMHead(config)\n        self.sop_classifier = AlbertSOPHead(config)\n\n        self.init_weights()\n        self.tie_weights()\n\n    def tie_weights(self):\n        self._tie_or_clone_weights(self.predictions.decoder, self.albert.embeddings.word_embeddings)\n\n    def get_output_embeddings(self):\n        return self.predictions.decoder\n\n    @add_start_docstrings_to_callable(ALBERT_INPUTS_DOCSTRING)\n    def forward(\n        self,\n        input_ids=None,\n        attention_mask=None,\n        token_type_ids=None,\n        position_ids=None,\n        head_mask=None,\n        inputs_embeds=None,\n        masked_lm_labels=None,\n        sentence_order_label=None,\n    ):\n        r\"\"\"\n        masked_lm_labels (``torch.LongTensor`` of shape ``(batch_size, sequence_length)``, `optional`, defaults to :obj:`None`):\n            Labels for computing the masked language modeling loss.\n            Indices should be in ``[-100, 0, ..., config.vocab_size]`` (see ``input_ids`` docstring)\n            Tokens with indices set to ``-100`` are ignored (masked), the loss is only computed for the tokens with labels\n            in ``[0, ..., config.vocab_size]``\n        sentence_order_label (``torch.LongTensor`` of shape ``(batch_size,)``, `optional`, defaults to :obj:`None`):\n            Labels for computing the next sequence prediction (classification) loss. Input should be a sequence pair (see :obj:`input_ids` docstring)\n            Indices should be in ``[0, 1]``.\n            ``0`` indicates original order (sequence A, then sequence B),\n            ``1`` indicates switched order (sequence B, then sequence A).\n\n    Returns:\n        :obj:`tuple(torch.FloatTensor)` comprising various elements depending on the configuration (:class:`~transformers1.BertConfig`) and inputs:\n        loss (`optional`, returned when ``masked_lm_labels`` is provided) ``torch.FloatTensor`` of shape ``(1,)``:\n            Total loss as the sum of the masked language modeling loss and the next sequence prediction (classification) loss.\n        prediction_scores (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length, config.vocab_size)`)\n            Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax).\n        sop_scores (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, 2)`):\n            Prediction scores of the next sequence prediction (classification) head (scores of True/False\n            continuation before SoftMax).\n        hidden_states (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when :obj:`config.output_hidden_states=True`):\n            Tuple of :obj:`torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer)\n            of shape :obj:`(batch_size, sequence_length, hidden_size)`.\n\n            Hidden-states of the model at the output of each layer plus the initial embedding outputs.\n        attentions (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_attentions=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for each layer) of shape\n            :obj:`(batch_size, num_heads, sequence_length, sequence_length)`.\n\n            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention\n            heads.\n\n\n    Examples::\n\n        from transformers1 import AlbertTokenizer, AlbertForPreTraining\n        import torch\n\n        tokenizer = AlbertTokenizer.from_pretrained('albert-base-v2')\n        model = AlbertForPreTraining.from_pretrained('albert-base-v2')\n\n        input_ids = torch.tensor(tokenizer.encode(\"Hello, my dog is cute\", add_special_tokens=True)).unsqueeze(0)  # Batch size 1\n        outputs = model(input_ids)\n\n        prediction_scores, sop_scores = outputs[:2]\n\n        \"\"\"\n\n        outputs = self.albert(\n            input_ids,\n            attention_mask=attention_mask,\n            token_type_ids=token_type_ids,\n            position_ids=position_ids,\n            head_mask=head_mask,\n            inputs_embeds=inputs_embeds,\n        )\n\n        sequence_output, pooled_output = outputs[:2]\n\n        prediction_scores = self.predictions(sequence_output)\n        sop_scores = self.sop_classifier(pooled_output)\n\n        outputs = (prediction_scores, sop_scores,) + outputs[2:]  # add hidden states and attention if they are here\n\n        if masked_lm_labels is not None and sentence_order_label is not None:\n            loss_fct = CrossEntropyLoss()\n            masked_lm_loss = loss_fct(prediction_scores.view(-1, self.config.vocab_size), masked_lm_labels.view(-1))\n            sentence_order_loss = loss_fct(sop_scores.view(-1, 2), sentence_order_label.view(-1))\n            total_loss = masked_lm_loss + sentence_order_loss\n            outputs = (total_loss,) + outputs\n\n        return outputs  # (loss), prediction_scores, sop_scores, (hidden_states), (attentions)\n\n\nclass AlbertMLMHead(nn.Module):\n    def __init__(self, config):\n        super().__init__()\n\n        self.LayerNorm = nn.LayerNorm(config.embedding_size)\n        self.bias = nn.Parameter(torch.zeros(config.vocab_size))\n        self.dense = nn.Linear(config.hidden_size, config.embedding_size)\n        self.decoder = nn.Linear(config.embedding_size, config.vocab_size)\n        self.activation = ACT2FN[config.hidden_act]\n\n        # Need a link between the two variables so that the bias is correctly resized with `resize_token_embeddings`\n        self.decoder.bias = self.bias\n\n    def forward(self, hidden_states):\n        hidden_states = self.dense(hidden_states)\n        hidden_states = self.activation(hidden_states)\n        hidden_states = self.LayerNorm(hidden_states)\n        hidden_states = self.decoder(hidden_states)\n\n        prediction_scores = hidden_states\n\n        return prediction_scores\n\n\nclass AlbertSOPHead(nn.Module):\n    def __init__(self, config):\n        super().__init__()\n\n        self.dropout = nn.Dropout(config.classifier_dropout_prob)\n        self.classifier = nn.Linear(config.hidden_size, config.num_labels)\n\n    def forward(self, pooled_output):\n        dropout_pooled_output = self.dropout(pooled_output)\n        logits = self.classifier(dropout_pooled_output)\n        return logits\n\n\n@add_start_docstrings(\n    \"Albert Model with a `language modeling` head on top.\", ALBERT_START_DOCSTRING,\n)\nclass AlbertForMaskedLM(AlbertPreTrainedModel):\n    def __init__(self, config):\n        super().__init__(config)\n\n        self.albert = AlbertModel(config)\n        self.predictions = AlbertMLMHead(config)\n\n        self.init_weights()\n        self.tie_weights()\n\n    def tie_weights(self):\n        self._tie_or_clone_weights(self.predictions.decoder, self.albert.embeddings.word_embeddings)\n\n    def get_output_embeddings(self):\n        return self.predictions.decoder\n\n    @add_start_docstrings_to_callable(ALBERT_INPUTS_DOCSTRING)\n    def forward(\n        self,\n        input_ids=None,\n        attention_mask=None,\n        token_type_ids=None,\n        position_ids=None,\n        head_mask=None,\n        inputs_embeds=None,\n        masked_lm_labels=None,\n    ):\n        r\"\"\"\n        masked_lm_labels (:obj:`torch.LongTensor` of shape :obj:`(batch_size, sequence_length)`, `optional`, defaults to :obj:`None`):\n            Labels for computing the masked language modeling loss.\n            Indices should be in ``[-100, 0, ..., config.vocab_size]`` (see ``input_ids`` docstring)\n            Tokens with indices set to ``-100`` are ignored (masked), the loss is only computed for the tokens with\n            labels in ``[0, ..., config.vocab_size]``\n\n    Returns:\n        :obj:`tuple(torch.FloatTensor)` comprising various elements depending on the configuration (:class:`~transformers1.AlbertConfig`) and inputs:\n        loss (`optional`, returned when ``masked_lm_labels`` is provided) ``torch.FloatTensor`` of shape ``(1,)``:\n            Masked language modeling loss.\n        prediction_scores (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length, config.vocab_size)`)\n            Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax).\n        hidden_states (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_hidden_states=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer)\n            of shape :obj:`(batch_size, sequence_length, hidden_size)`.\n\n            Hidden-states of the model at the output of each layer plus the initial embedding outputs.\n        attentions (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_attentions=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for each layer) of shape\n            :obj:`(batch_size, num_heads, sequence_length, sequence_length)`.\n\n            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention\n            heads.\n\n    Example::\n\n        from transformers1 import AlbertTokenizer, AlbertForMaskedLM\n        import torch\n\n        tokenizer = AlbertTokenizer.from_pretrained('albert-base-v2')\n        model = AlbertForMaskedLM.from_pretrained('albert-base-v2')\n        input_ids = torch.tensor(tokenizer.encode(\"Hello, my dog is cute\", add_special_tokens=True)).unsqueeze(0)  # Batch size 1\n        outputs = model(input_ids, masked_lm_labels=input_ids)\n        loss, prediction_scores = outputs[:2]\n\n        \"\"\"\n        outputs = self.albert(\n            input_ids=input_ids,\n            attention_mask=attention_mask,\n            token_type_ids=token_type_ids,\n            position_ids=position_ids,\n            head_mask=head_mask,\n            inputs_embeds=inputs_embeds,\n        )\n        sequence_outputs = outputs[0]\n\n        prediction_scores = self.predictions(sequence_outputs)\n\n        outputs = (prediction_scores,) + outputs[2:]  # Add hidden states and attention if they are here\n        if masked_lm_labels is not None:\n            loss_fct = CrossEntropyLoss()\n            masked_lm_loss = loss_fct(prediction_scores.view(-1, self.config.vocab_size), masked_lm_labels.view(-1))\n            outputs = (masked_lm_loss,) + outputs\n\n        return outputs\n\n\n@add_start_docstrings(\n    \"\"\"Albert Model transformer with a sequence classification/regression head on top (a linear layer on top of\n    the pooled output) e.g. for GLUE tasks. \"\"\",\n    ALBERT_START_DOCSTRING,\n)\nclass AlbertForSequenceClassification(AlbertPreTrainedModel):\n    def __init__(self, config):\n        super().__init__(config)\n        self.num_labels = config.num_labels\n\n        self.albert = AlbertModel(config)\n        self.dropout = nn.Dropout(config.classifier_dropout_prob)\n        self.classifier = nn.Linear(config.hidden_size, self.config.num_labels)\n\n        self.init_weights()\n\n    @add_start_docstrings_to_callable(ALBERT_INPUTS_DOCSTRING)\n    def forward(\n        self,\n        input_ids=None,\n        attention_mask=None,\n        token_type_ids=None,\n        position_ids=None,\n        head_mask=None,\n        inputs_embeds=None,\n        labels=None,\n    ):\n        r\"\"\"\n        labels (:obj:`torch.LongTensor` of shape :obj:`(batch_size,)`, `optional`, defaults to :obj:`None`):\n            Labels for computing the sequence classification/regression loss.\n            Indices should be in ``[0, ..., config.num_labels - 1]``.\n            If ``config.num_labels == 1`` a regression loss is computed (Mean-Square loss),\n            If ``config.num_labels > 1`` a classification loss is computed (Cross-Entropy).\n\n    Returns:\n        :obj:`tuple(torch.FloatTensor)` comprising various elements depending on the configuration (:class:`~transformers1.AlbertConfig`) and inputs:\n        loss: (`optional`, returned when ``labels`` is provided) ``torch.FloatTensor`` of shape ``(1,)``:\n            Classification (or regression if config.num_labels==1) loss.\n        logits ``torch.FloatTensor`` of shape ``(batch_size, config.num_labels)``\n            Classification (or regression if config.num_labels==1) scores (before SoftMax).\n        hidden_states (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_hidden_states=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer)\n            of shape :obj:`(batch_size, sequence_length, hidden_size)`.\n\n            Hidden-states of the model at the output of each layer plus the initial embedding outputs.\n        attentions (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_attentions=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for each layer) of shape\n            :obj:`(batch_size, num_heads, sequence_length, sequence_length)`.\n\n            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention\n            heads.\n\n        Examples::\n\n            from transformers1 import AlbertTokenizer, AlbertForSequenceClassification\n            import torch\n\n            tokenizer = AlbertTokenizer.from_pretrained('albert-base-v2')\n            model = AlbertForSequenceClassification.from_pretrained('albert-base-v2')\n            input_ids = torch.tensor(tokenizer.encode(\"Hello, my dog is cute\")).unsqueeze(0)  # Batch size 1\n            labels = torch.tensor([1]).unsqueeze(0)  # Batch size 1\n            outputs = model(input_ids, labels=labels)\n            loss, logits = outputs[:2]\n\n        \"\"\"\n\n        outputs = self.albert(\n            input_ids=input_ids,\n            attention_mask=attention_mask,\n            token_type_ids=token_type_ids,\n            position_ids=position_ids,\n            head_mask=head_mask,\n            inputs_embeds=inputs_embeds,\n        )\n\n        pooled_output = outputs[1]\n\n        pooled_output = self.dropout(pooled_output)\n        logits = self.classifier(pooled_output)\n\n        outputs = (logits,) + outputs[2:]  # add hidden states and attention if they are here\n\n        if labels is not None:\n            if self.num_labels == 1:\n                #  We are doing regression\n                loss_fct = MSELoss()\n                loss = loss_fct(logits.view(-1), labels.view(-1))\n            else:\n                loss_fct = CrossEntropyLoss()\n                loss = loss_fct(logits.view(-1, self.num_labels), labels.view(-1))\n            outputs = (loss,) + outputs\n\n        return outputs  # (loss), logits, (hidden_states), (attentions)\n\n\n@add_start_docstrings(\n    \"\"\"Albert Model with a token classification head on top (a linear layer on top of\n    the hidden-states output) e.g. for Named-Entity-Recognition (NER) tasks. \"\"\",\n    ALBERT_START_DOCSTRING,\n)\nclass AlbertForTokenClassification(AlbertPreTrainedModel):\n    def __init__(self, config):\n        super().__init__(config)\n        self.num_labels = config.num_labels\n\n        self.albert = AlbertModel(config)\n        self.dropout = nn.Dropout(config.hidden_dropout_prob)\n        self.classifier = nn.Linear(config.hidden_size, self.config.num_labels)\n\n        self.init_weights()\n\n    @add_start_docstrings_to_callable(ALBERT_INPUTS_DOCSTRING)\n    def forward(\n        self,\n        input_ids=None,\n        attention_mask=None,\n        token_type_ids=None,\n        position_ids=None,\n        head_mask=None,\n        inputs_embeds=None,\n        labels=None,\n    ):\n        r\"\"\"\n        labels (:obj:`torch.LongTensor` of shape :obj:`(batch_size, sequence_length)`, `optional`, defaults to :obj:`None`):\n            Labels for computing the token classification loss.\n            Indices should be in ``[0, ..., config.num_labels - 1]``.\n\n    Returns:\n        :obj:`tuple(torch.FloatTensor)` comprising various elements depending on the configuration (:class:`~transformers1.AlbertConfig`) and inputs:\n        loss (:obj:`torch.FloatTensor` of shape :obj:`(1,)`, `optional`, returned when ``labels`` is provided) :\n            Classification loss.\n        scores (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length, config.num_labels)`)\n            Classification scores (before SoftMax).\n        hidden_states (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_hidden_states=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer)\n            of shape :obj:`(batch_size, sequence_length, hidden_size)`.\n\n            Hidden-states of the model at the output of each layer plus the initial embedding outputs.\n        attentions (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_attentions=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for each layer) of shape\n            :obj:`(batch_size, num_heads, sequence_length, sequence_length)`.\n\n            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention\n            heads.\n\n    Examples::\n\n        from transformers1 import AlbertTokenizer, AlbertForTokenClassification\n        import torch\n\n        tokenizer = AlbertTokenizer.from_pretrained('albert-base-v2')\n        model = AlbertForTokenClassification.from_pretrained('albert-base-v2')\n\n        input_ids = torch.tensor(tokenizer.encode(\"Hello, my dog is cute\", add_special_tokens=True)).unsqueeze(0)  # Batch size 1\n        labels = torch.tensor([1] * input_ids.size(1)).unsqueeze(0)  # Batch size 1\n        outputs = model(input_ids, labels=labels)\n\n        loss, scores = outputs[:2]\n\n        \"\"\"\n\n        outputs = self.albert(\n            input_ids,\n            attention_mask=attention_mask,\n            token_type_ids=token_type_ids,\n            position_ids=position_ids,\n            head_mask=head_mask,\n            inputs_embeds=inputs_embeds,\n        )\n\n        sequence_output = outputs[0]\n\n        sequence_output = self.dropout(sequence_output)\n        logits = self.classifier(sequence_output)\n\n        outputs = (logits,) + outputs[2:]  # add hidden states and attention if they are here\n\n        if labels is not None:\n            loss_fct = CrossEntropyLoss()\n            # Only keep active parts of the loss\n            if attention_mask is not None:\n                active_loss = attention_mask.view(-1) == 1\n                active_logits = logits.view(-1, self.num_labels)[active_loss]\n                active_labels = labels.view(-1)[active_loss]\n                loss = loss_fct(active_logits, active_labels)\n            else:\n                loss = loss_fct(logits.view(-1, self.num_labels), labels.view(-1))\n            outputs = (loss,) + outputs\n\n        return outputs  # (loss), logits, (hidden_states), (attentions)\n\n\n@add_start_docstrings(\n    \"\"\"Albert Model with a span classification head on top for extractive question-answering tasks like SQuAD (a linear layers on top of\n    the hidden-states output to compute `span start logits` and `span end logits`). \"\"\",\n    ALBERT_START_DOCSTRING,\n)\nclass AlbertForQuestionAnswering(AlbertPreTrainedModel):\n    def __init__(self, config):\n        super().__init__(config)\n        self.num_labels = config.num_labels\n\n        self.albert = AlbertModel(config)\n        self.qa_outputs = nn.Linear(config.hidden_size, config.num_labels)\n\n        self.init_weights()\n\n    @add_start_docstrings_to_callable(ALBERT_INPUTS_DOCSTRING)\n    def forward(\n        self,\n        input_ids=None,\n        attention_mask=None,\n        token_type_ids=None,\n        position_ids=None,\n        head_mask=None,\n        inputs_embeds=None,\n        start_positions=None,\n        end_positions=None,\n    ):\n        r\"\"\"\n        start_positions (:obj:`torch.LongTensor` of shape :obj:`(batch_size,)`, `optional`, defaults to :obj:`None`):\n            Labels for position (index) of the start of the labelled span for computing the token classification loss.\n            Positions are clamped to the length of the sequence (`sequence_length`).\n            Position outside of the sequence are not taken into account for computing the loss.\n        end_positions (:obj:`torch.LongTensor` of shape :obj:`(batch_size,)`, `optional`, defaults to :obj:`None`):\n            Labels for position (index) of the end of the labelled span for computing the token classification loss.\n            Positions are clamped to the length of the sequence (`sequence_length`).\n            Position outside of the sequence are not taken into account for computing the loss.\n\n    Returns:\n        :obj:`tuple(torch.FloatTensor)` comprising various elements depending on the configuration (:class:`~transformers1.AlbertConfig`) and inputs:\n        loss: (`optional`, returned when ``labels`` is provided) ``torch.FloatTensor`` of shape ``(1,)``:\n            Total span extraction loss is the sum of a Cross-Entropy for the start and end positions.\n        start_scores ``torch.FloatTensor`` of shape ``(batch_size, sequence_length,)``\n            Span-start scores (before SoftMax).\n        end_scores: ``torch.FloatTensor`` of shape ``(batch_size, sequence_length,)``\n            Span-end scores (before SoftMax).\n        hidden_states (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_hidden_states=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer)\n            of shape :obj:`(batch_size, sequence_length, hidden_size)`.\n\n            Hidden-states of the model at the output of each layer plus the initial embedding outputs.\n        attentions (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_attentions=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for each layer) of shape\n            :obj:`(batch_size, num_heads, sequence_length, sequence_length)`.\n\n            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention\n            heads.\n\n    Examples::\n\n        # The checkpoint albert-base-v2 is not fine-tuned for question answering. Please see the\n        # examples/question-answering/run_squad.py example to see how to fine-tune a model to a question answering task.\n\n        from transformers1 import AlbertTokenizer, AlbertForQuestionAnswering\n        import torch\n\n        tokenizer = AlbertTokenizer.from_pretrained('albert-base-v2')\n        model = AlbertForQuestionAnswering.from_pretrained('albert-base-v2')\n        question, text = \"Who was Jim Henson?\", \"Jim Henson was a nice puppet\"\n        input_dict = tokenizer.encode_plus(question, text, return_tensors='pt')\n        start_scores, end_scores = model(**input_dict)\n\n        \"\"\"\n\n        outputs = self.albert(\n            input_ids=input_ids,\n            attention_mask=attention_mask,\n            token_type_ids=token_type_ids,\n            position_ids=position_ids,\n            head_mask=head_mask,\n            inputs_embeds=inputs_embeds,\n        )\n\n        sequence_output = outputs[0]\n\n        logits = self.qa_outputs(sequence_output)\n        start_logits, end_logits = logits.split(1, dim=-1)\n        start_logits = start_logits.squeeze(-1)\n        end_logits = end_logits.squeeze(-1)\n\n        outputs = (start_logits, end_logits,) + outputs[2:]\n        if start_positions is not None and end_positions is not None:\n            # If we are on multi-GPU, split add a dimension\n            if len(start_positions.size()) > 1:\n                start_positions = start_positions.squeeze(-1)\n            if len(end_positions.size()) > 1:\n                end_positions = end_positions.squeeze(-1)\n            # sometimes the start/end positions are outside our model inputs, we ignore these terms\n            ignored_index = start_logits.size(1)\n            start_positions.clamp_(0, ignored_index)\n            end_positions.clamp_(0, ignored_index)\n\n            loss_fct = CrossEntropyLoss(ignore_index=ignored_index)\n            start_loss = loss_fct(start_logits, start_positions)\n            end_loss = loss_fct(end_logits, end_positions)\n            total_loss = (start_loss + end_loss) / 2\n            outputs = (total_loss,) + outputs\n\n        return outputs  # (loss), start_logits, end_logits, (hidden_states), (attentions)\n"
  },
  {
    "path": "code/nezha-base-count3/pretrain/transformers1/modeling_auto.py",
    "content": "# coding=utf-8\n# Copyright 2018 The HuggingFace Inc. team.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n\"\"\" Auto Model class. \"\"\"\n\n\nimport logging\nfrom collections import OrderedDict\n\nfrom .configuration_auto import (\n    AlbertConfig,\n    AutoConfig,\n    BartConfig,\n    BertConfig,\n    CamembertConfig,\n    CTRLConfig,\n    DistilBertConfig,\n    ElectraConfig,\n    EncoderDecoderConfig,\n    FlaubertConfig,\n    GPT2Config,\n    LongformerConfig,\n    OpenAIGPTConfig,\n    ReformerConfig,\n    RobertaConfig,\n    T5Config,\n    TransfoXLConfig,\n    XLMConfig,\n    XLMRobertaConfig,\n    XLNetConfig,\n)\nfrom .configuration_marian import MarianConfig\nfrom .configuration_utils import PretrainedConfig\nfrom .modeling_albert import (\n    AlbertForMaskedLM,\n    AlbertForPreTraining,\n    AlbertForQuestionAnswering,\n    AlbertForSequenceClassification,\n    AlbertForTokenClassification,\n    AlbertModel,\n)\nfrom .modeling_bart import BartForConditionalGeneration, BartForSequenceClassification, BartModel\nfrom .modeling_bert import (\n    BertForMaskedLM,\n    BertForMultipleChoice,\n    BertForPreTraining,\n    BertForQuestionAnswering,\n    BertForSequenceClassification,\n    BertForTokenClassification,\n    BertModel,\n)\nfrom .modeling_camembert import (\n    CamembertForMaskedLM,\n    CamembertForMultipleChoice,\n    CamembertForSequenceClassification,\n    CamembertForTokenClassification,\n    CamembertModel,\n)\nfrom .modeling_ctrl import CTRLLMHeadModel, CTRLModel\nfrom .modeling_distilbert import (\n    DistilBertForMaskedLM,\n    DistilBertForQuestionAnswering,\n    DistilBertForSequenceClassification,\n    DistilBertForTokenClassification,\n    DistilBertModel,\n)\nfrom .modeling_electra import (\n    ElectraForMaskedLM,\n    ElectraForPreTraining,\n    ElectraForSequenceClassification,\n    ElectraForTokenClassification,\n    ElectraModel,\n)\nfrom .modeling_encoder_decoder import EncoderDecoderModel\nfrom .modeling_flaubert import (\n    FlaubertForQuestionAnsweringSimple,\n    FlaubertForSequenceClassification,\n    FlaubertModel,\n    FlaubertWithLMHeadModel,\n)\nfrom .modeling_gpt2 import GPT2LMHeadModel, GPT2Model\nfrom .modeling_longformer import (\n    LongformerForMaskedLM,\n    LongformerForMultipleChoice,\n    LongformerForQuestionAnswering,\n    LongformerForSequenceClassification,\n    LongformerForTokenClassification,\n    LongformerModel,\n)\nfrom .modeling_marian import MarianMTModel\nfrom .modeling_openai import OpenAIGPTLMHeadModel, OpenAIGPTModel\nfrom .modeling_reformer import ReformerModel, ReformerModelWithLMHead\nfrom .modeling_roberta import (\n    RobertaForMaskedLM,\n    RobertaForMultipleChoice,\n    RobertaForQuestionAnswering,\n    RobertaForSequenceClassification,\n    RobertaForTokenClassification,\n    RobertaModel,\n)\nfrom .modeling_t5 import T5ForConditionalGeneration, T5Model\nfrom .modeling_transfo_xl import TransfoXLLMHeadModel, TransfoXLModel\nfrom .modeling_xlm import (\n    XLMForQuestionAnsweringSimple,\n    XLMForSequenceClassification,\n    XLMForTokenClassification,\n    XLMModel,\n    XLMWithLMHeadModel,\n)\nfrom .modeling_xlm_roberta import (\n    XLMRobertaForMaskedLM,\n    XLMRobertaForMultipleChoice,\n    XLMRobertaForSequenceClassification,\n    XLMRobertaForTokenClassification,\n    XLMRobertaModel,\n)\nfrom .modeling_xlnet import (\n    XLNetForMultipleChoice,\n    XLNetForQuestionAnsweringSimple,\n    XLNetForSequenceClassification,\n    XLNetForTokenClassification,\n    XLNetLMHeadModel,\n    XLNetModel,\n)\n\n\nlogger = logging.getLogger(__name__)\n\n\nMODEL_MAPPING = OrderedDict(\n    [\n        (T5Config, T5Model),\n        (DistilBertConfig, DistilBertModel),\n        (AlbertConfig, AlbertModel),\n        (CamembertConfig, CamembertModel),\n        (XLMRobertaConfig, XLMRobertaModel),\n        (BartConfig, BartModel),\n        (LongformerConfig, LongformerModel),\n        (RobertaConfig, RobertaModel),\n        (BertConfig, BertModel),\n        (OpenAIGPTConfig, OpenAIGPTModel),\n        (GPT2Config, GPT2Model),\n        (TransfoXLConfig, TransfoXLModel),\n        (XLNetConfig, XLNetModel),\n        (FlaubertConfig, FlaubertModel),\n        (XLMConfig, XLMModel),\n        (CTRLConfig, CTRLModel),\n        (ElectraConfig, ElectraModel),\n        (ReformerConfig, ReformerModel),\n    ]\n)\n\nMODEL_FOR_PRETRAINING_MAPPING = OrderedDict(\n    [\n        (T5Config, T5ForConditionalGeneration),\n        (DistilBertConfig, DistilBertForMaskedLM),\n        (AlbertConfig, AlbertForPreTraining),\n        (CamembertConfig, CamembertForMaskedLM),\n        (XLMRobertaConfig, XLMRobertaForMaskedLM),\n        (BartConfig, BartForConditionalGeneration),\n        (LongformerConfig, LongformerForMaskedLM),\n        (RobertaConfig, RobertaForMaskedLM),\n        (BertConfig, BertForPreTraining),\n        (OpenAIGPTConfig, OpenAIGPTLMHeadModel),\n        (GPT2Config, GPT2LMHeadModel),\n        (TransfoXLConfig, TransfoXLLMHeadModel),\n        (XLNetConfig, XLNetLMHeadModel),\n        (FlaubertConfig, FlaubertWithLMHeadModel),\n        (XLMConfig, XLMWithLMHeadModel),\n        (CTRLConfig, CTRLLMHeadModel),\n        (ElectraConfig, ElectraForPreTraining),\n    ]\n)\n\nMODEL_WITH_LM_HEAD_MAPPING = OrderedDict(\n    [\n        (T5Config, T5ForConditionalGeneration),\n        (DistilBertConfig, DistilBertForMaskedLM),\n        (AlbertConfig, AlbertForMaskedLM),\n        (CamembertConfig, CamembertForMaskedLM),\n        (XLMRobertaConfig, XLMRobertaForMaskedLM),\n        (MarianConfig, MarianMTModel),\n        (BartConfig, BartForConditionalGeneration),\n        (LongformerConfig, LongformerForMaskedLM),\n        (RobertaConfig, RobertaForMaskedLM),\n        (BertConfig, BertForMaskedLM),\n        (OpenAIGPTConfig, OpenAIGPTLMHeadModel),\n        (GPT2Config, GPT2LMHeadModel),\n        (TransfoXLConfig, TransfoXLLMHeadModel),\n        (XLNetConfig, XLNetLMHeadModel),\n        (FlaubertConfig, FlaubertWithLMHeadModel),\n        (XLMConfig, XLMWithLMHeadModel),\n        (CTRLConfig, CTRLLMHeadModel),\n        (ElectraConfig, ElectraForMaskedLM),\n        (EncoderDecoderConfig, EncoderDecoderModel),\n        (ReformerConfig, ReformerModelWithLMHead),\n    ]\n)\n\nMODEL_FOR_SEQUENCE_CLASSIFICATION_MAPPING = OrderedDict(\n    [\n        (DistilBertConfig, DistilBertForSequenceClassification),\n        (AlbertConfig, AlbertForSequenceClassification),\n        (CamembertConfig, CamembertForSequenceClassification),\n        (XLMRobertaConfig, XLMRobertaForSequenceClassification),\n        (BartConfig, BartForSequenceClassification),\n        (LongformerConfig, LongformerForSequenceClassification),\n        (RobertaConfig, RobertaForSequenceClassification),\n        (BertConfig, BertForSequenceClassification),\n        (XLNetConfig, XLNetForSequenceClassification),\n        (FlaubertConfig, FlaubertForSequenceClassification),\n        (XLMConfig, XLMForSequenceClassification),\n        (ElectraConfig, ElectraForSequenceClassification),\n    ]\n)\n\nMODEL_FOR_QUESTION_ANSWERING_MAPPING = OrderedDict(\n    [\n        (DistilBertConfig, DistilBertForQuestionAnswering),\n        (AlbertConfig, AlbertForQuestionAnswering),\n        (LongformerConfig, LongformerForQuestionAnswering),\n        (RobertaConfig, RobertaForQuestionAnswering),\n        (BertConfig, BertForQuestionAnswering),\n        (XLNetConfig, XLNetForQuestionAnsweringSimple),\n        (FlaubertConfig, FlaubertForQuestionAnsweringSimple),\n        (XLMConfig, XLMForQuestionAnsweringSimple),\n    ]\n)\n\nMODEL_FOR_TOKEN_CLASSIFICATION_MAPPING = OrderedDict(\n    [\n        (DistilBertConfig, DistilBertForTokenClassification),\n        (CamembertConfig, CamembertForTokenClassification),\n        (XLMConfig, XLMForTokenClassification),\n        (XLMRobertaConfig, XLMRobertaForTokenClassification),\n        (LongformerConfig, LongformerForTokenClassification),\n        (RobertaConfig, RobertaForTokenClassification),\n        (BertConfig, BertForTokenClassification),\n        (XLNetConfig, XLNetForTokenClassification),\n        (AlbertConfig, AlbertForTokenClassification),\n        (ElectraConfig, ElectraForTokenClassification),\n    ]\n)\n\n\nMODEL_FOR_MULTIPLE_CHOICE_MAPPING = OrderedDict(\n    [\n        (CamembertConfig, CamembertForMultipleChoice),\n        (XLMRobertaConfig, XLMRobertaForMultipleChoice),\n        (LongformerConfig, LongformerForMultipleChoice),\n        (RobertaConfig, RobertaForMultipleChoice),\n        (BertConfig, BertForMultipleChoice),\n        (XLNetConfig, XLNetForMultipleChoice),\n    ]\n)\n\n\nclass AutoModel:\n    r\"\"\"\n        :class:`~transformers1.AutoModel` is a generic model class\n        that will be instantiated as one of the base model classes of the library\n        when created with the `AutoModel.from_pretrained(pretrained_model_name_or_path)`\n        or the `AutoModel.from_config(config)` class methods.\n\n        This class cannot be instantiated using `__init__()` (throws an error).\n    \"\"\"\n\n    def __init__(self):\n        raise EnvironmentError(\n            \"AutoModel is designed to be instantiated \"\n            \"using the `AutoModel.from_pretrained(pretrained_model_name_or_path)` or \"\n            \"`AutoModel.from_config(config)` methods.\"\n        )\n\n    @classmethod\n    def from_config(cls, config):\n        r\"\"\" Instantiates one of the base model classes of the library\n        from a configuration.\n\n        Note:\n            Loading a model from its configuration file does **not** load the model weights.\n            It only affects the model's configuration. Use :func:`~transformers1.AutoModel.from_pretrained` to load\n            the model weights\n\n        Args:\n            config (:class:`~transformers.PretrainedConfig`):\n                The model class to instantiate is selected based on the configuration class:\n\n                - isInstance of `distilbert` configuration class: :class:`~transformers1.DistilBertModel` (DistilBERT model)\n                - isInstance of `longformer` configuration class: :class:`~transformers1.LongformerModel` (Longformer model)\n                - isInstance of `roberta` configuration class: :class:`~transformers1.RobertaModel` (RoBERTa model)\n                - isInstance of `bert` configuration class: :class:`~transformers1.BertModel` (Bert model)\n                - isInstance of `openai-gpt` configuration class: :class:`~transformers1.OpenAIGPTModel` (OpenAI GPT model)\n                - isInstance of `gpt2` configuration class: :class:`~transformers1.GPT2Model` (OpenAI GPT-2 model)\n                - isInstance of `ctrl` configuration class: :class:`~transformers1.CTRLModel` (Salesforce CTRL  model)\n                - isInstance of `transfo-xl` configuration class: :class:`~transformers1.TransfoXLModel` (Transformer-XL model)\n                - isInstance of `xlnet` configuration class: :class:`~transformers1.XLNetModel` (XLNet model)\n                - isInstance of `xlm` configuration class: :class:`~transformers1.XLMModel` (XLM model)\n                - isInstance of `flaubert` configuration class: :class:`~transformers1.FlaubertModel` (Flaubert model)\n                - isInstance of `electra` configuration class: :class:`~transformers1.ElectraModel` (Electra model)\n\n        Examples::\n\n            config = BertConfig.from_pretrained('bert-base-uncased')    # Download configuration from S3 and cache.\n            model = AutoModel.from_config(config)  # E.g. model was saved using `save_pretrained('./test/saved_model/')`\n        \"\"\"\n        for config_class, model_class in MODEL_MAPPING.items():\n            if isinstance(config, config_class):\n                return model_class(config)\n        raise ValueError(\n            \"Unrecognized configuration class {} for this kind of AutoModel: {}.\\n\"\n            \"Model type should be one of {}.\".format(\n                config.__class__, cls.__name__, \", \".join(c.__name__ for c in MODEL_MAPPING.keys())\n            )\n        )\n\n    @classmethod\n    def from_pretrained(cls, pretrained_model_name_or_path, *model_args, **kwargs):\n        r\"\"\" Instantiates one of the base model classes of the library\n        from a pre-trained model configuration.\n\n        The `from_pretrained()` method takes care of returning the correct model class instance\n        based on the `model_type` property of the config object, or when it's missing,\n        falling back to using pattern matching on the `pretrained_model_name_or_path` string:\n            - `t5`: :class:`~transformers1.T5Model` (T5 model)\n            - `distilbert`: :class:`~transformers1.DistilBertModel` (DistilBERT model)\n            - `albert`: :class:`~transformers1.AlbertModel` (ALBERT model)\n            - `camembert`: :class:`~transformers1.CamembertModel` (CamemBERT model)\n            - `xlm-roberta`: :class:`~transformers1.XLMRobertaModel` (XLM-RoBERTa model)\n            - `longformer` :class:`~transformers1.LongformerModel` (Longformer model)\n            - `roberta`: :class:`~transformers1.RobertaModel` (RoBERTa model)\n            - `bert`: :class:`~transformers1.BertModel` (Bert model)\n            - `openai-gpt`: :class:`~transformers1.OpenAIGPTModel` (OpenAI GPT model)\n            - `gpt2`: :class:`~transformers1.GPT2Model` (OpenAI GPT-2 model)\n            - `transfo-xl`: :class:`~transformers1.TransfoXLModel` (Transformer-XL model)\n            - `xlnet`: :class:`~transformers1.XLNetModel` (XLNet model)\n            - `xlm`: :class:`~transformers1.XLMModel` (XLM model)\n            - `ctrl`: :class:`~transformers1.CTRLModel` (Salesforce CTRL  model)\n            - `flaubert`: :class:`~transformers1.FlaubertModel` (Flaubert  model)\n            - `electra`: :class:`~transformers1.ElectraModel` (Electra  model)\n\n        The model is set in evaluation mode by default using `model.eval()` (Dropout modules are deactivated)\n        To train the model, you should first set it back in training mode with `model.train()`\n\n        Args:\n            pretrained_model_name_or_path: either:\n\n                - a string with the `shortcut name` of a pre-trained model to load from cache or download, e.g.: ``bert-base-uncased``.\n                - a string with the `identifier name` of a pre-trained model that was user-uploaded to our S3, e.g.: ``dbmdz/bert-base-german-cased``.\n                - a path to a `directory` containing model weights saved using :func:`~transformers1.PreTrainedModel.save_pretrained`, e.g.: ``./my_model_directory/``.\n                - a path or url to a `tensorflow index checkpoint file` (e.g. `./tf_model/model.ckpt.index`). In this case, ``from_tf`` should be set to True and a configuration object should be provided as ``config`` argument. This loading path is slower than converting the TensorFlow checkpoint in a PyTorch model using the provided conversion scripts and loading the PyTorch model afterwards.\n\n            model_args: (`optional`) Sequence of positional arguments:\n                All remaning positional arguments will be passed to the underlying model's ``__init__`` method\n\n            config: (`optional`) instance of a class derived from :class:`~transformers1.PretrainedConfig`:\n                Configuration for the model to use instead of an automatically loaded configuation. Configuration can be automatically loaded when:\n\n                - the model is a model provided by the library (loaded with the ``shortcut-name`` string of a pretrained model), or\n                - the model was saved using :func:`~transformers1.PreTrainedModel.save_pretrained` and is reloaded by suppling the save directory.\n                - the model is loaded by suppling a local directory as ``pretrained_model_name_or_path`` and a configuration JSON file named `config.json` is found in the directory.\n\n            state_dict: (`optional`) dict:\n                an optional state dictionary for the model to use instead of a state dictionary loaded from saved weights file.\n                This option can be used if you want to create a model from a pretrained configuration but load your own weights.\n                In this case though, you should check if using :func:`~transformers1.PreTrainedModel.save_pretrained` and :func:`~transformers1.PreTrainedModel.from_pretrained` is not a simpler option.\n\n            cache_dir: (`optional`) string:\n                Path to a directory in which a downloaded pre-trained model\n                configuration should be cached if the standard cache should not be used.\n\n            force_download: (`optional`) boolean, default False:\n                Force to (re-)download the model weights and configuration files and override the cached versions if they exists.\n\n            resume_download: (`optional`) boolean, default False:\n                Do not delete incompletely recieved file. Attempt to resume the download if such a file exists.\n\n            proxies: (`optional`) dict, default None:\n                A dictionary of proxy servers to use by protocol or endpoint, e.g.: {'http': 'foo.bar:3128', 'http://hostname': 'foo.bar:4012'}.\n                The proxies are used on each request.\n\n            output_loading_info: (`optional`) boolean:\n                Set to ``True`` to also return a dictionary containing missing keys, unexpected keys and error messages.\n\n            kwargs: (`optional`) Remaining dictionary of keyword arguments:\n                These arguments will be passed to the configuration and the model.\n\n        Examples::\n\n            model = AutoModel.from_pretrained('bert-base-uncased')    # Download model and configuration from S3 and cache.\n            model = AutoModel.from_pretrained('./test/bert_model/')  # E.g. model was saved using `save_pretrained('./test/saved_model/')`\n            assert model.config.output_attention == True\n            # Loading from a TF checkpoint file instead of a PyTorch model (slower)\n            config = AutoConfig.from_json_file('./tf_model/bert_tf_model_config.json')\n            model = AutoModel.from_pretrained('./tf_model/bert_tf_checkpoint.ckpt.index', from_tf=True, config=config)\n\n        \"\"\"\n        config = kwargs.pop(\"config\", None)\n        if not isinstance(config, PretrainedConfig):\n            config = AutoConfig.from_pretrained(pretrained_model_name_or_path, **kwargs)\n\n        for config_class, model_class in MODEL_MAPPING.items():\n            if isinstance(config, config_class):\n                return model_class.from_pretrained(pretrained_model_name_or_path, *model_args, config=config, **kwargs)\n        raise ValueError(\n            \"Unrecognized configuration class {} for this kind of AutoModel: {}.\\n\"\n            \"Model type should be one of {}.\".format(\n                config.__class__, cls.__name__, \", \".join(c.__name__ for c in MODEL_MAPPING.keys())\n            )\n        )\n\n\nclass AutoModelForPreTraining:\n    r\"\"\"\n        :class:`~transformers1.AutoModelForPreTraining` is a generic model class\n        that will be instantiated as one of the model classes of the library -with the architecture used for pretraining this model– when created with the `AutoModelForPreTraining.from_pretrained(pretrained_model_name_or_path)`\n        class method.\n\n        This class cannot be instantiated using `__init__()` (throws an error).\n    \"\"\"\n\n    def __init__(self):\n        raise EnvironmentError(\n            \"AutoModelForPreTraining is designed to be instantiated \"\n            \"using the `AutoModelForPreTraining.from_pretrained(pretrained_model_name_or_path)` or \"\n            \"`AutoModelForPreTraining.from_config(config)` methods.\"\n        )\n\n    @classmethod\n    def from_config(cls, config):\n        r\"\"\" Instantiates one of the base model classes of the library\n        from a configuration.\n\n        Note:\n            Loading a model from its configuration file does **not** load the model weights.\n            It only affects the model's configuration. Use :func:`~transformers1.AutoModel.from_pretrained` to load\n            the model weights\n\n        Args:\n            config (:class:`~transformers.PretrainedConfig`):\n                The model class to instantiate is selected based on the configuration class:\n\n                - isInstance of `distilbert` configuration class: :class:`~transformers1.DistilBertForMaskedLM` (DistilBERT model)\n                - isInstance of `longformer` configuration class: :class:`~transformers1.LongformerForMaskedLM` (Longformer model)\n                - isInstance of `roberta` configuration class: :class:`~transformers1.RobertaForMaskedLM` (RoBERTa model)\n                - isInstance of `bert` configuration class: :class:`~transformers1.BertForPreTraining` (Bert model)\n                - isInstance of `openai-gpt` configuration class: :class:`~transformers1.OpenAIGPTLMHeadModel` (OpenAI GPT model)\n                - isInstance of `gpt2` configuration class: :class:`~transformers1.GPT2LMHeadModel` (OpenAI GPT-2 model)\n                - isInstance of `ctrl` configuration class: :class:`~transformers1.CTRLLMHeadModel` (Salesforce CTRL  model)\n                - isInstance of `transfo-xl` configuration class: :class:`~transformers1.TransfoXLLMHeadModel` (Transformer-XL model)\n                - isInstance of `xlnet` configuration class: :class:`~transformers1.XLNetLMHeadModel` (XLNet model)\n                - isInstance of `xlm` configuration class: :class:`~transformers1.XLMWithLMHeadModel` (XLM model)\n                - isInstance of `flaubert` configuration class: :class:`~transformers1.FlaubertWithLMHeadModel` (Flaubert model)\n                - isInstance of `electra` configuration class: :class:`~transformers1.ElectraForPreTraining` (Electra model)\n\n        Examples::\n\n            config = BertConfig.from_pretrained('bert-base-uncased')    # Download configuration from S3 and cache.\n            model = AutoModelForPreTraining.from_config(config)  # E.g. model was saved using `save_pretrained('./test/saved_model/')`\n        \"\"\"\n        for config_class, model_class in MODEL_FOR_PRETRAINING_MAPPING.items():\n            if isinstance(config, config_class):\n                return model_class(config)\n        raise ValueError(\n            \"Unrecognized configuration class {} for this kind of AutoModel: {}.\\n\"\n            \"Model type should be one of {}.\".format(\n                config.__class__, cls.__name__, \", \".join(c.__name__ for c in MODEL_FOR_PRETRAINING_MAPPING.keys())\n            )\n        )\n\n    @classmethod\n    def from_pretrained(cls, pretrained_model_name_or_path, *model_args, **kwargs):\n        r\"\"\" Instantiates one of the model classes of the library -with the architecture used for pretraining this model– from a pre-trained model configuration.\n\n        The `from_pretrained()` method takes care of returning the correct model class instance\n        based on the `model_type` property of the config object, or when it's missing,\n        falling back to using pattern matching on the `pretrained_model_name_or_path` string:\n            - `t5`: :class:`~transformers1.T5ModelWithLMHead` (T5 model)\n            - `distilbert`: :class:`~transformers1.DistilBertForMaskedLM` (DistilBERT model)\n            - `albert`: :class:`~transformers1.AlbertForMaskedLM` (ALBERT model)\n            - `camembert`: :class:`~transformers1.CamembertForMaskedLM` (CamemBERT model)\n            - `xlm-roberta`: :class:`~transformers1.XLMRobertaForMaskedLM` (XLM-RoBERTa model)\n            - `longformer`: :class:`~transformers1.LongformerForMaskedLM` (Longformer model)\n            - `roberta`: :class:`~transformers1.RobertaForMaskedLM` (RoBERTa model)\n            - `bert`: :class:`~transformers1.BertForPreTraining` (Bert model)\n            - `openai-gpt`: :class:`~transformers1.OpenAIGPTLMHeadModel` (OpenAI GPT model)\n            - `gpt2`: :class:`~transformers1.GPT2LMHeadModel` (OpenAI GPT-2 model)\n            - `transfo-xl`: :class:`~transformers1.TransfoXLLMHeadModel` (Transformer-XL model)\n            - `xlnet`: :class:`~transformers1.XLNetLMHeadModel` (XLNet model)\n            - `xlm`: :class:`~transformers1.XLMWithLMHeadModel` (XLM model)\n            - `ctrl`: :class:`~transformers1.CTRLLMHeadModel` (Salesforce CTRL model)\n            - `flaubert`: :class:`~transformers1.FlaubertWithLMHeadModel` (Flaubert model)\n            - `electra`: :class:`~transformers1.ElectraForPreTraining` (Electra model)\n\n        The model is set in evaluation mode by default using `model.eval()` (Dropout modules are deactivated)\n        To train the model, you should first set it back in training mode with `model.train()`\n\n        Args:\n            pretrained_model_name_or_path:\n                Either:\n\n                - a string with the `shortcut name` of a pre-trained model to load from cache or download, e.g.: ``bert-base-uncased``.\n                - a string with the `identifier name` of a pre-trained model that was user-uploaded to our S3, e.g.: ``dbmdz/bert-base-german-cased``.\n                - a path to a `directory` containing model weights saved using :func:`~transformers1.PreTrainedModel.save_pretrained`, e.g.: ``./my_model_directory/``.\n                - a path or url to a `tensorflow index checkpoint file` (e.g. `./tf_model/model.ckpt.index`). In this case, ``from_tf`` should be set to True and a configuration object should be provided as ``config`` argument. This loading path is slower than converting the TensorFlow checkpoint in a PyTorch model using the provided conversion scripts and loading the PyTorch model afterwards.\n            model_args: (`optional`) Sequence of positional arguments:\n                All remaning positional arguments will be passed to the underlying model's ``__init__`` method\n            config: (`optional`) instance of a class derived from :class:`~transformers1.PretrainedConfig`:\n                Configuration for the model to use instead of an automatically loaded configuation. Configuration can be automatically loaded when:\n\n                - the model is a model provided by the library (loaded with the ``shortcut-name`` string of a pretrained model), or\n                - the model was saved using :func:`~transformers1.PreTrainedModel.save_pretrained` and is reloaded by suppling the save directory.\n                - the model is loaded by suppling a local directory as ``pretrained_model_name_or_path`` and a configuration JSON file named `config.json` is found in the directory.\n\n            state_dict: (`optional`) dict:\n                an optional state dictionary for the model to use instead of a state dictionary loaded from saved weights file.\n                This option can be used if you want to create a model from a pretrained configuration but load your own weights.\n                In this case though, you should check if using :func:`~transformers1.PreTrainedModel.save_pretrained` and :func:`~transformers1.PreTrainedModel.from_pretrained` is not a simpler option.\n            cache_dir: (`optional`) string:\n                Path to a directory in which a downloaded pre-trained model\n                configuration should be cached if the standard cache should not be used.\n            force_download: (`optional`) boolean, default False:\n                Force to (re-)download the model weights and configuration files and override the cached versions if they exists.\n            resume_download: (`optional`) boolean, default False:\n                Do not delete incompletely received file. Attempt to resume the download if such a file exists.\n            proxies: (`optional`) dict, default None:\n                A dictionary of proxy servers to use by protocol or endpoint, e.g.: {'http': 'foo.bar:3128', 'http://hostname': 'foo.bar:4012'}.\n                The proxies are used on each request.\n            output_loading_info: (`optional`) boolean:\n                Set to ``True`` to also return a dictionary containing missing keys, unexpected keys and error messages.\n            kwargs: (`optional`) Remaining dictionary of keyword arguments:\n                These arguments will be passed to the configuration and the model.\n\n        Examples::\n\n            model = AutoModelForPreTraining.from_pretrained('bert-base-uncased')    # Download model and configuration from S3 and cache.\n            model = AutoModelForPreTraining.from_pretrained('./test/bert_model/')  # E.g. model was saved using `save_pretrained('./test/saved_model/')`\n            assert model.config.output_attention == True\n            # Loading from a TF checkpoint file instead of a PyTorch model (slower)\n            config = AutoConfig.from_json_file('./tf_model/bert_tf_model_config.json')\n            model = AutoModelForPreTraining.from_pretrained('./tf_model/bert_tf_checkpoint.ckpt.index', from_tf=True, config=config)\n\n        \"\"\"\n        config = kwargs.pop(\"config\", None)\n        if not isinstance(config, PretrainedConfig):\n            config = AutoConfig.from_pretrained(pretrained_model_name_or_path, **kwargs)\n\n        for config_class, model_class in MODEL_FOR_PRETRAINING_MAPPING.items():\n            if isinstance(config, config_class):\n                return model_class.from_pretrained(pretrained_model_name_or_path, *model_args, config=config, **kwargs)\n        raise ValueError(\n            \"Unrecognized configuration class {} for this kind of AutoModel: {}.\\n\"\n            \"Model type should be one of {}.\".format(\n                config.__class__, cls.__name__, \", \".join(c.__name__ for c in MODEL_FOR_PRETRAINING_MAPPING.keys())\n            )\n        )\n\n\nclass AutoModelWithLMHead:\n    r\"\"\"\n        :class:`~transformers1.AutoModelWithLMHead` is a generic model class\n        that will be instantiated as one of the language modeling model classes of the library\n        when created with the `AutoModelWithLMHead.from_pretrained(pretrained_model_name_or_path)`\n        class method.\n\n        This class cannot be instantiated using `__init__()` (throws an error).\n    \"\"\"\n\n    def __init__(self):\n        raise EnvironmentError(\n            \"AutoModelWithLMHead is designed to be instantiated \"\n            \"using the `AutoModelWithLMHead.from_pretrained(pretrained_model_name_or_path)` or \"\n            \"`AutoModelWithLMHead.from_config(config)` methods.\"\n        )\n\n    @classmethod\n    def from_config(cls, config):\n        r\"\"\" Instantiates one of the base model classes of the library\n        from a configuration.\n\n        Note:\n            Loading a model from its configuration file does **not** load the model weights.\n            It only affects the model's configuration. Use :func:`~transformers1.AutoModel.from_pretrained` to load\n            the model weights\n\n        Args:\n            config (:class:`~transformers.PretrainedConfig`):\n                The model class to instantiate is selected based on the configuration class:\n\n                - isInstance of `distilbert` configuration class: :class:`~transformers1.DistilBertForMaskedLM` (DistilBERT model)\n                - isInstance of `longformer` configuration class: :class:`~transformers1.LongformerForMaskedLM` (Longformer model)\n                - isInstance of `roberta` configuration class: :class:`~transformers1.RobertaForMaskedLM` (RoBERTa model)\n                - isInstance of `bert` configuration class: :class:`~transformers1.BertForMaskedLM` (Bert model)\n                - isInstance of `openai-gpt` configuration class: :class:`~transformers1.OpenAIGPTLMHeadModel` (OpenAI GPT model)\n                - isInstance of `gpt2` configuration class: :class:`~transformers1.GPT2LMHeadModel` (OpenAI GPT-2 model)\n                - isInstance of `ctrl` configuration class: :class:`~transformers1.CTRLLMHeadModel` (Salesforce CTRL  model)\n                - isInstance of `transfo-xl` configuration class: :class:`~transformers1.TransfoXLLMHeadModel` (Transformer-XL model)\n                - isInstance of `xlnet` configuration class: :class:`~transformers1.XLNetLMHeadModel` (XLNet model)\n                - isInstance of `xlm` configuration class: :class:`~transformers1.XLMWithLMHeadModel` (XLM model)\n                - isInstance of `flaubert` configuration class: :class:`~transformers1.FlaubertWithLMHeadModel` (Flaubert model)\n                - isInstance of `electra` configuration class: :class:`~transformers1.ElectraForMaskedLM` (Electra model)\n\n        Examples::\n\n            config = BertConfig.from_pretrained('bert-base-uncased')    # Download configuration from S3 and cache.\n            model = AutoModelWithLMHead.from_config(config)  # E.g. model was saved using `save_pretrained('./test/saved_model/')`\n        \"\"\"\n        for config_class, model_class in MODEL_WITH_LM_HEAD_MAPPING.items():\n            if isinstance(config, config_class):\n                return model_class(config)\n        raise ValueError(\n            \"Unrecognized configuration class {} for this kind of AutoModel: {}.\\n\"\n            \"Model type should be one of {}.\".format(\n                config.__class__, cls.__name__, \", \".join(c.__name__ for c in MODEL_WITH_LM_HEAD_MAPPING.keys())\n            )\n        )\n\n    @classmethod\n    def from_pretrained(cls, pretrained_model_name_or_path, *model_args, **kwargs):\n        r\"\"\" Instantiates one of the language modeling model classes of the library\n        from a pre-trained model configuration.\n\n        The `from_pretrained()` method takes care of returning the correct model class instance\n        based on the `model_type` property of the config object, or when it's missing,\n        falling back to using pattern matching on the `pretrained_model_name_or_path` string:\n            - `t5`: :class:`~transformers1.T5ModelWithLMHead` (T5 model)\n            - `distilbert`: :class:`~transformers1.DistilBertForMaskedLM` (DistilBERT model)\n            - `albert`: :class:`~transformers1.AlbertForMaskedLM` (ALBERT model)\n            - `camembert`: :class:`~transformers1.CamembertForMaskedLM` (CamemBERT model)\n            - `xlm-roberta`: :class:`~transformers1.XLMRobertaForMaskedLM` (XLM-RoBERTa model)\n            - `longformer`: :class:`~transformers1.LongformerForMaskedLM` (Longformer model)\n            - `roberta`: :class:`~transformers1.RobertaForMaskedLM` (RoBERTa model)\n            - `bert`: :class:`~transformers1.BertForMaskedLM` (Bert model)\n            - `openai-gpt`: :class:`~transformers1.OpenAIGPTLMHeadModel` (OpenAI GPT model)\n            - `gpt2`: :class:`~transformers1.GPT2LMHeadModel` (OpenAI GPT-2 model)\n            - `transfo-xl`: :class:`~transformers1.TransfoXLLMHeadModel` (Transformer-XL model)\n            - `xlnet`: :class:`~transformers1.XLNetLMHeadModel` (XLNet model)\n            - `xlm`: :class:`~transformers1.XLMWithLMHeadModel` (XLM model)\n            - `ctrl`: :class:`~transformers1.CTRLLMHeadModel` (Salesforce CTRL model)\n            - `flaubert`: :class:`~transformers1.FlaubertWithLMHeadModel` (Flaubert model)\n            - `electra`: :class:`~transformers1.ElectraForMaskedLM` (Electra model)\n\n        The model is set in evaluation mode by default using `model.eval()` (Dropout modules are deactivated)\n        To train the model, you should first set it back in training mode with `model.train()`\n\n        Args:\n            pretrained_model_name_or_path:\n                Either:\n\n                - a string with the `shortcut name` of a pre-trained model to load from cache or download, e.g.: ``bert-base-uncased``.\n                - a string with the `identifier name` of a pre-trained model that was user-uploaded to our S3, e.g.: ``dbmdz/bert-base-german-cased``.\n                - a path to a `directory` containing model weights saved using :func:`~transformers1.PreTrainedModel.save_pretrained`, e.g.: ``./my_model_directory/``.\n                - a path or url to a `tensorflow index checkpoint file` (e.g. `./tf_model/model.ckpt.index`). In this case, ``from_tf`` should be set to True and a configuration object should be provided as ``config`` argument. This loading path is slower than converting the TensorFlow checkpoint in a PyTorch model using the provided conversion scripts and loading the PyTorch model afterwards.\n            model_args: (`optional`) Sequence of positional arguments:\n                All remaning positional arguments will be passed to the underlying model's ``__init__`` method\n            config: (`optional`) instance of a class derived from :class:`~transformers1.PretrainedConfig`:\n                Configuration for the model to use instead of an automatically loaded configuation. Configuration can be automatically loaded when:\n\n                - the model is a model provided by the library (loaded with the ``shortcut-name`` string of a pretrained model), or\n                - the model was saved using :func:`~transformers1.PreTrainedModel.save_pretrained` and is reloaded by suppling the save directory.\n                - the model is loaded by suppling a local directory as ``pretrained_model_name_or_path`` and a configuration JSON file named `config.json` is found in the directory.\n\n            state_dict: (`optional`) dict:\n                an optional state dictionary for the model to use instead of a state dictionary loaded from saved weights file.\n                This option can be used if you want to create a model from a pretrained configuration but load your own weights.\n                In this case though, you should check if using :func:`~transformers1.PreTrainedModel.save_pretrained` and :func:`~transformers1.PreTrainedModel.from_pretrained` is not a simpler option.\n            cache_dir: (`optional`) string:\n                Path to a directory in which a downloaded pre-trained model\n                configuration should be cached if the standard cache should not be used.\n            force_download: (`optional`) boolean, default False:\n                Force to (re-)download the model weights and configuration files and override the cached versions if they exists.\n            resume_download: (`optional`) boolean, default False:\n                Do not delete incompletely received file. Attempt to resume the download if such a file exists.\n            proxies: (`optional`) dict, default None:\n                A dictionary of proxy servers to use by protocol or endpoint, e.g.: {'http': 'foo.bar:3128', 'http://hostname': 'foo.bar:4012'}.\n                The proxies are used on each request.\n            output_loading_info: (`optional`) boolean:\n                Set to ``True`` to also return a dictionary containing missing keys, unexpected keys and error messages.\n            kwargs: (`optional`) Remaining dictionary of keyword arguments:\n                These arguments will be passed to the configuration and the model.\n\n        Examples::\n\n            model = AutoModelWithLMHead.from_pretrained('bert-base-uncased')    # Download model and configuration from S3 and cache.\n            model = AutoModelWithLMHead.from_pretrained('./test/bert_model/')  # E.g. model was saved using `save_pretrained('./test/saved_model/')`\n            assert model.config.output_attention == True\n            # Loading from a TF checkpoint file instead of a PyTorch model (slower)\n            config = AutoConfig.from_json_file('./tf_model/bert_tf_model_config.json')\n            model = AutoModelWithLMHead.from_pretrained('./tf_model/bert_tf_checkpoint.ckpt.index', from_tf=True, config=config)\n\n        \"\"\"\n        config = kwargs.pop(\"config\", None)\n        if not isinstance(config, PretrainedConfig):\n            config = AutoConfig.from_pretrained(pretrained_model_name_or_path, **kwargs)\n\n        for config_class, model_class in MODEL_WITH_LM_HEAD_MAPPING.items():\n            if isinstance(config, config_class):\n                return model_class.from_pretrained(pretrained_model_name_or_path, *model_args, config=config, **kwargs)\n        raise ValueError(\n            \"Unrecognized configuration class {} for this kind of AutoModel: {}.\\n\"\n            \"Model type should be one of {}.\".format(\n                config.__class__, cls.__name__, \", \".join(c.__name__ for c in MODEL_WITH_LM_HEAD_MAPPING.keys())\n            )\n        )\n\n\nclass AutoModelForSequenceClassification:\n    r\"\"\"\n        :class:`~transformers1.AutoModelForSequenceClassification` is a generic model class\n        that will be instantiated as one of the sequence classification model classes of the library\n        when created with the `AutoModelForSequenceClassification.from_pretrained(pretrained_model_name_or_path)`\n        class method.\n\n        This class cannot be instantiated using `__init__()` (throws an error).\n    \"\"\"\n\n    def __init__(self):\n        raise EnvironmentError(\n            \"AutoModelForSequenceClassification is designed to be instantiated \"\n            \"using the `AutoModelForSequenceClassification.from_pretrained(pretrained_model_name_or_path)` or \"\n            \"`AutoModelForSequenceClassification.from_config(config)` methods.\"\n        )\n\n    @classmethod\n    def from_config(cls, config):\n        r\"\"\" Instantiates one of the base model classes of the library\n        from a configuration.\n\n        Note:\n            Loading a model from its configuration file does **not** load the model weights.\n            It only affects the model's configuration. Use :func:`~transformers1.AutoModel.from_pretrained` to load\n            the model weights\n\n        Args:\n            config (:class:`~transformers.PretrainedConfig`):\n                The model class to instantiate is selected based on the configuration class:\n\n                - isInstance of `distilbert` configuration class: :class:`~transformers1.DistilBertForSequenceClassification` (DistilBERT model)\n                - isInstance of `albert` configuration class: :class:`~transformers1.AlbertForSequenceClassification` (ALBERT model)\n                - isInstance of `camembert` configuration class: :class:`~transformers1.CamembertForSequenceClassification` (CamemBERT model)\n                - isInstance of `xlm roberta` configuration class: :class:`~transformers1.XLMRobertaForSequenceClassification` (XLM-RoBERTa model)\n                - isInstance of `roberta` configuration class: :class:`~transformers1.RobertaForSequenceClassification` (RoBERTa model)\n                - isInstance of `bert` configuration class: :class:`~transformers1.BertForSequenceClassification` (Bert model)\n                - isInstance of `xlnet` configuration class: :class:`~transformers1.XLNetForSequenceClassification` (XLNet model)\n                - isInstance of `xlm` configuration class: :class:`~transformers1.XLMForSequenceClassification` (XLM model)\n                - isInstance of `flaubert` configuration class: :class:`~transformers1.FlaubertForSequenceClassification` (Flaubert model)\n\n\n        Examples::\n\n            config = BertConfig.from_pretrained('bert-base-uncased')    # Download configuration from S3 and cache.\n            model = AutoModelForSequenceClassification.from_config(config)  # E.g. model was saved using `save_pretrained('./test/saved_model/')`\n        \"\"\"\n        for config_class, model_class in MODEL_FOR_SEQUENCE_CLASSIFICATION_MAPPING.items():\n            if isinstance(config, config_class):\n                return model_class(config)\n        raise ValueError(\n            \"Unrecognized configuration class {} for this kind of AutoModel: {}.\\n\"\n            \"Model type should be one of {}.\".format(\n                config.__class__,\n                cls.__name__,\n                \", \".join(c.__name__ for c in MODEL_FOR_SEQUENCE_CLASSIFICATION_MAPPING.keys()),\n            )\n        )\n\n    @classmethod\n    def from_pretrained(cls, pretrained_model_name_or_path, *model_args, **kwargs):\n        r\"\"\" Instantiates one of the sequence classification model classes of the library\n        from a pre-trained model configuration.\n\n        The `from_pretrained()` method takes care of returning the correct model class instance\n        based on the `model_type` property of the config object, or when it's missing,\n        falling back to using pattern matching on the `pretrained_model_name_or_path` string:\n            - `distilbert`: :class:`~transformers1.DistilBertForSequenceClassification` (DistilBERT model)\n            - `albert`: :class:`~transformers1.AlbertForSequenceClassification` (ALBERT model)\n            - `camembert`: :class:`~transformers1.CamembertForSequenceClassification` (CamemBERT model)\n            - `xlm-roberta`: :class:`~transformers1.XLMRobertaForSequenceClassification` (XLM-RoBERTa model)\n            - `roberta`: :class:`~transformers1.RobertaForSequenceClassification` (RoBERTa model)\n            - `bert`: :class:`~transformers1.BertForSequenceClassification` (Bert model)\n            - `xlnet`: :class:`~transformers1.XLNetForSequenceClassification` (XLNet model)\n            - `flaubert`: :class:`~transformers1.FlaubertForSequenceClassification` (Flaubert model)\n\n        The model is set in evaluation mode by default using `model.eval()` (Dropout modules are deactivated)\n        To train the model, you should first set it back in training mode with `model.train()`\n\n        Args:\n            pretrained_model_name_or_path: either:\n\n                - a string with the `shortcut name` of a pre-trained model to load from cache or download, e.g.: ``bert-base-uncased``.\n                - a string with the `identifier name` of a pre-trained model that was user-uploaded to our S3, e.g.: ``dbmdz/bert-base-german-cased``.\n                - a path to a `directory` containing model weights saved using :func:`~transformers1.PreTrainedModel.save_pretrained`, e.g.: ``./my_model_directory/``.\n                - a path or url to a `tensorflow index checkpoint file` (e.g. `./tf_model/model.ckpt.index`). In this case, ``from_tf`` should be set to True and a configuration object should be provided as ``config`` argument. This loading path is slower than converting the TensorFlow checkpoint in a PyTorch model using the provided conversion scripts and loading the PyTorch model afterwards.\n\n            model_args: (`optional`) Sequence of positional arguments:\n                All remaining positional arguments will be passed to the underlying model's ``__init__`` method\n\n            config: (`optional`) instance of a class derived from :class:`~transformers1.PretrainedConfig`:\n                Configuration for the model to use instead of an automatically loaded configuation. Configuration can be automatically loaded when:\n\n                - the model is a model provided by the library (loaded with the ``shortcut-name`` string of a pretrained model), or\n                - the model was saved using :func:`~transformers1.PreTrainedModel.save_pretrained` and is reloaded by suppling the save directory.\n                - the model is loaded by suppling a local directory as ``pretrained_model_name_or_path`` and a configuration JSON file named `config.json` is found in the directory.\n\n            state_dict: (`optional`) dict:\n                an optional state dictionary for the model to use instead of a state dictionary loaded from saved weights file.\n                This option can be used if you want to create a model from a pretrained configuration but load your own weights.\n                In this case though, you should check if using :func:`~transformers1.PreTrainedModel.save_pretrained` and :func:`~transformers1.PreTrainedModel.from_pretrained` is not a simpler option.\n\n            cache_dir: (`optional`) string:\n                Path to a directory in which a downloaded pre-trained model\n                configuration should be cached if the standard cache should not be used.\n\n            force_download: (`optional`) boolean, default False:\n                Force to (re-)download the model weights and configuration files and override the cached versions if they exists.\n\n            resume_download: (`optional`) boolean, default False:\n                Do not delete incompletely recieved file. Attempt to resume the download if such a file exists.\n\n            proxies: (`optional`) dict, default None:\n                A dictionary of proxy servers to use by protocol or endpoint, e.g.: {'http': 'foo.bar:3128', 'http://hostname': 'foo.bar:4012'}.\n                The proxies are used on each request.\n\n            output_loading_info: (`optional`) boolean:\n                Set to ``True`` to also return a dictionary containing missing keys, unexpected keys and error messages.\n\n            kwargs: (`optional`) Remaining dictionary of keyword arguments:\n                These arguments will be passed to the configuration and the model.\n\n        Examples::\n\n            model = AutoModelForSequenceClassification.from_pretrained('bert-base-uncased')    # Download model and configuration from S3 and cache.\n            model = AutoModelForSequenceClassification.from_pretrained('./test/bert_model/')  # E.g. model was saved using `save_pretrained('./test/saved_model/')`\n            assert model.config.output_attention == True\n            # Loading from a TF checkpoint file instead of a PyTorch model (slower)\n            config = AutoConfig.from_json_file('./tf_model/bert_tf_model_config.json')\n            model = AutoModelForSequenceClassification.from_pretrained('./tf_model/bert_tf_checkpoint.ckpt.index', from_tf=True, config=config)\n\n        \"\"\"\n        config = kwargs.pop(\"config\", None)\n        if not isinstance(config, PretrainedConfig):\n            config = AutoConfig.from_pretrained(pretrained_model_name_or_path, **kwargs)\n\n        for config_class, model_class in MODEL_FOR_SEQUENCE_CLASSIFICATION_MAPPING.items():\n            if isinstance(config, config_class):\n                return model_class.from_pretrained(pretrained_model_name_or_path, *model_args, config=config, **kwargs)\n        raise ValueError(\n            \"Unrecognized configuration class {} for this kind of AutoModel: {}.\\n\"\n            \"Model type should be one of {}.\".format(\n                config.__class__,\n                cls.__name__,\n                \", \".join(c.__name__ for c in MODEL_FOR_SEQUENCE_CLASSIFICATION_MAPPING.keys()),\n            )\n        )\n\n\nclass AutoModelForQuestionAnswering:\n    r\"\"\"\n        :class:`~transformers1.AutoModelForQuestionAnswering` is a generic model class\n        that will be instantiated as one of the question answering model classes of the library\n        when created with the `AutoModelForQuestionAnswering.from_pretrained(pretrained_model_name_or_path)`\n        class method.\n\n        This class cannot be instantiated using `__init__()` (throws an error).\n    \"\"\"\n\n    def __init__(self):\n        raise EnvironmentError(\n            \"AutoModelForQuestionAnswering is designed to be instantiated \"\n            \"using the `AutoModelForQuestionAnswering.from_pretrained(pretrained_model_name_or_path)` or \"\n            \"`AutoModelForQuestionAnswering.from_config(config)` methods.\"\n        )\n\n    @classmethod\n    def from_config(cls, config):\n        r\"\"\" Instantiates one of the base model classes of the library\n        from a configuration.\n\n        Note:\n            Loading a model from its configuration file does **not** load the model weights.\n            It only affects the model's configuration. Use :func:`~transformers1.AutoModel.from_pretrained` to load\n            the model weights\n\n        Args:\n            config (:class:`~transformers.PretrainedConfig`):\n                The model class to instantiate is selected based on the configuration class:\n\n                - isInstance of `distilbert` configuration class: :class:`~transformers1.DistilBertForQuestionAnswering` (DistilBERT model)\n                - isInstance of `albert` configuration class: :class:`~transformers1.AlbertForQuestionAnswering` (ALBERT model)\n                - isInstance of `bert` configuration class: :class:`~transformers1.BertModelForQuestionAnswering` (Bert model)\n                - isInstance of `xlnet` configuration class: :class:`~transformers1.XLNetForQuestionAnswering` (XLNet model)\n                - isInstance of `xlm` configuration class: :class:`~transformers1.XLMForQuestionAnswering` (XLM model)\n                - isInstance of `flaubert` configuration class: :class:`~transformers1.FlaubertForQuestionAnswering` (XLM model)\n\n        Examples::\n\n            config = BertConfig.from_pretrained('bert-base-uncased')    # Download configuration from S3 and cache.\n            model = AutoModelForQuestionAnswering.from_config(config)  # E.g. model was saved using `save_pretrained('./test/saved_model/')`\n        \"\"\"\n        for config_class, model_class in MODEL_FOR_QUESTION_ANSWERING_MAPPING.items():\n            if isinstance(config, config_class):\n                return model_class(config)\n\n        raise ValueError(\n            \"Unrecognized configuration class {} for this kind of AutoModel: {}.\\n\"\n            \"Model type should be one of {}.\".format(\n                config.__class__,\n                cls.__name__,\n                \", \".join(c.__name__ for c in MODEL_FOR_QUESTION_ANSWERING_MAPPING.keys()),\n            )\n        )\n\n    @classmethod\n    def from_pretrained(cls, pretrained_model_name_or_path, *model_args, **kwargs):\n        r\"\"\" Instantiates one of the question answering model classes of the library\n        from a pre-trained model configuration.\n\n        The `from_pretrained()` method takes care of returning the correct model class instance\n        based on the `model_type` property of the config object, or when it's missing,\n        falling back to using pattern matching on the `pretrained_model_name_or_path` string:\n            - `distilbert`: :class:`~transformers1.DistilBertForQuestionAnswering` (DistilBERT model)\n            - `albert`: :class:`~transformers1.AlbertForQuestionAnswering` (ALBERT model)\n            - `bert`: :class:`~transformers1.BertForQuestionAnswering` (Bert model)\n            - `xlnet`: :class:`~transformers1.XLNetForQuestionAnswering` (XLNet model)\n            - `xlm`: :class:`~transformers1.XLMForQuestionAnswering` (XLM model)\n            - `flaubert`: :class:`~transformers1.FlaubertForQuestionAnswering` (XLM model)\n\n        The model is set in evaluation mode by default using `model.eval()` (Dropout modules are deactivated)\n        To train the model, you should first set it back in training mode with `model.train()`\n\n        Args:\n            pretrained_model_name_or_path: either:\n\n                - a string with the `shortcut name` of a pre-trained model to load from cache or download, e.g.: ``bert-base-uncased``.\n                - a string with the `identifier name` of a pre-trained model that was user-uploaded to our S3, e.g.: ``dbmdz/bert-base-german-cased``.\n                - a path to a `directory` containing model weights saved using :func:`~transformers1.PreTrainedModel.save_pretrained`, e.g.: ``./my_model_directory/``.\n                - a path or url to a `tensorflow index checkpoint file` (e.g. `./tf_model/model.ckpt.index`). In this case, ``from_tf`` should be set to True and a configuration object should be provided as ``config`` argument. This loading path is slower than converting the TensorFlow checkpoint in a PyTorch model using the provided conversion scripts and loading the PyTorch model afterwards.\n\n            model_args: (`optional`) Sequence of positional arguments:\n                All remaning positional arguments will be passed to the underlying model's ``__init__`` method\n\n            config: (`optional`) instance of a class derived from :class:`~transformers1.PretrainedConfig`:\n                Configuration for the model to use instead of an automatically loaded configuation. Configuration can be automatically loaded when:\n\n                - the model is a model provided by the library (loaded with the ``shortcut-name`` string of a pretrained model), or\n                - the model was saved using :func:`~transformers1.PreTrainedModel.save_pretrained` and is reloaded by suppling the save directory.\n                - the model is loaded by suppling a local directory as ``pretrained_model_name_or_path`` and a configuration JSON file named `config.json` is found in the directory.\n\n            state_dict: (`optional`) dict:\n                an optional state dictionary for the model to use instead of a state dictionary loaded from saved weights file.\n                This option can be used if you want to create a model from a pretrained configuration but load your own weights.\n                In this case though, you should check if using :func:`~transformers1.PreTrainedModel.save_pretrained` and :func:`~transformers1.PreTrainedModel.from_pretrained` is not a simpler option.\n\n            cache_dir: (`optional`) string:\n                Path to a directory in which a downloaded pre-trained model\n                configuration should be cached if the standard cache should not be used.\n\n            force_download: (`optional`) boolean, default False:\n                Force to (re-)download the model weights and configuration files and override the cached versions if they exists.\n\n            proxies: (`optional`) dict, default None:\n                A dictionary of proxy servers to use by protocol or endpoint, e.g.: {'http': 'foo.bar:3128', 'http://hostname': 'foo.bar:4012'}.\n                The proxies are used on each request.\n\n            output_loading_info: (`optional`) boolean:\n                Set to ``True`` to also return a dictionary containing missing keys, unexpected keys and error messages.\n\n            kwargs: (`optional`) Remaining dictionary of keyword arguments:\n                These arguments will be passed to the configuration and the model.\n\n        Examples::\n\n            model = AutoModelForQuestionAnswering.from_pretrained('bert-base-uncased')    # Download model and configuration from S3 and cache.\n            model = AutoModelForQuestionAnswering.from_pretrained('./test/bert_model/')  # E.g. model was saved using `save_pretrained('./test/saved_model/')`\n            assert model.config.output_attention == True\n            # Loading from a TF checkpoint file instead of a PyTorch model (slower)\n            config = AutoConfig.from_json_file('./tf_model/bert_tf_model_config.json')\n            model = AutoModelForQuestionAnswering.from_pretrained('./tf_model/bert_tf_checkpoint.ckpt.index', from_tf=True, config=config)\n\n        \"\"\"\n        config = kwargs.pop(\"config\", None)\n        if not isinstance(config, PretrainedConfig):\n            config = AutoConfig.from_pretrained(pretrained_model_name_or_path, **kwargs)\n\n        for config_class, model_class in MODEL_FOR_QUESTION_ANSWERING_MAPPING.items():\n            if isinstance(config, config_class):\n                return model_class.from_pretrained(pretrained_model_name_or_path, *model_args, config=config, **kwargs)\n\n        raise ValueError(\n            \"Unrecognized configuration class {} for this kind of AutoModel: {}.\\n\"\n            \"Model type should be one of {}.\".format(\n                config.__class__,\n                cls.__name__,\n                \", \".join(c.__name__ for c in MODEL_FOR_QUESTION_ANSWERING_MAPPING.keys()),\n            )\n        )\n\n\nclass AutoModelForTokenClassification:\n    r\"\"\"\n        :class:`~transformers1.AutoModelForTokenClassification` is a generic model class\n        that will be instantiated as one of the token classification model classes of the library\n        when created with the `AutoModelForTokenClassification.from_pretrained(pretrained_model_name_or_path)`\n        class method.\n\n        This class cannot be instantiated using `__init__()` (throws an error).\n    \"\"\"\n\n    def __init__(self):\n        raise EnvironmentError(\n            \"AutoModelForTokenClassification is designed to be instantiated \"\n            \"using the `AutoModelForTokenClassification.from_pretrained(pretrained_model_name_or_path)` or \"\n            \"`AutoModelForTokenClassification.from_config(config)` methods.\"\n        )\n\n    @classmethod\n    def from_config(cls, config):\n        r\"\"\" Instantiates one of the base model classes of the library\n        from a configuration.\n\n        Note:\n            Loading a model from its configuration file does **not** load the model weights.\n            It only affects the model's configuration. Use :func:`~transformers1.AutoModel.from_pretrained` to load\n            the model weights\n\n        Args:\n            config (:class:`~transformers.PretrainedConfig`):\n                The model class to instantiate is selected based on the configuration class:\n\n                - isInstance of `distilbert` configuration class: :class:`~transformers1.DistilBertModelForTokenClassification` (DistilBERT model)\n                - isInstance of `xlm` configuration class: :class:`~transformers1.XLMForTokenClassification` (XLM model)\n                - isInstance of `xlm roberta` configuration class: :class:`~transformers1.XLMRobertaModelForTokenClassification` (XLMRoberta model)\n                - isInstance of `bert` configuration class: :class:`~transformers1.BertModelForTokenClassification` (Bert model)\n                - isInstance of `albert` configuration class: :class:`~transformers1.AlbertForTokenClassification` (AlBert model)\n                - isInstance of `xlnet` configuration class: :class:`~transformers1.XLNetModelForTokenClassification` (XLNet model)\n                - isInstance of `camembert` configuration class: :class:`~transformers1.CamembertModelForTokenClassification` (Camembert model)\n                - isInstance of `roberta` configuration class: :class:`~transformers1.RobertaModelForTokenClassification` (Roberta model)\n                - isInstance of `electra` configuration class: :class:`~transformers1.ElectraForTokenClassification` (Electra model)\n\n        Examples::\n\n            config = BertConfig.from_pretrained('bert-base-uncased')    # Download configuration from S3 and cache.\n            model = AutoModelForTokenClassification.from_config(config)  # E.g. model was saved using `save_pretrained('./test/saved_model/')`\n        \"\"\"\n        for config_class, model_class in MODEL_FOR_TOKEN_CLASSIFICATION_MAPPING.items():\n            if isinstance(config, config_class):\n                return model_class(config)\n\n        raise ValueError(\n            \"Unrecognized configuration class {} for this kind of AutoModel: {}.\\n\"\n            \"Model type should be one of {}.\".format(\n                config.__class__,\n                cls.__name__,\n                \", \".join(c.__name__ for c in MODEL_FOR_TOKEN_CLASSIFICATION_MAPPING.keys()),\n            )\n        )\n\n    @classmethod\n    def from_pretrained(cls, pretrained_model_name_or_path, *model_args, **kwargs):\n        r\"\"\" Instantiates one of the question answering model classes of the library\n        from a pre-trained model configuration.\n\n        The `from_pretrained()` method takes care of returning the correct model class instance\n        based on the `model_type` property of the config object, or when it's missing,\n        falling back to using pattern matching on the `pretrained_model_name_or_path` string:\n            - `distilbert`: :class:`~transformers1.DistilBertForTokenClassification` (DistilBERT model)\n            - `xlm`: :class:`~transformers1.XLMForTokenClassification` (XLM model)\n            - `xlm-roberta`: :class:`~transformers1.XLMRobertaForTokenClassification` (XLM-RoBERTa?Para model)\n            - `camembert`: :class:`~transformers1.CamembertForTokenClassification` (Camembert model)\n            - `bert`: :class:`~transformers1.BertForTokenClassification` (Bert model)\n            - `xlnet`: :class:`~transformers1.XLNetForTokenClassification` (XLNet model)\n            - `roberta`: :class:`~transformers1.RobertaForTokenClassification` (Roberta model)\n            - `electra`: :class:`~transformers1.ElectraForTokenClassification` (Electra model)\n\n        The model is set in evaluation mode by default using `model.eval()` (Dropout modules are deactivated)\n        To train the model, you should first set it back in training mode with `model.train()`\n\n        Args:\n            pretrained_model_name_or_path:\n                Either:\n\n                - a string with the `shortcut name` of a pre-trained model to load from cache or download, e.g.: ``bert-base-uncased``.\n                - a path to a `directory` containing model weights saved using :func:`~transformers1.PreTrainedModel.save_pretrained`, e.g.: ``./my_model_directory/``.\n                - a path or url to a `tensorflow index checkpoint file` (e.g. `./tf_model/model.ckpt.index`). In this case, ``from_tf`` should be set to True and a configuration object should be provided as ``config`` argument. This loading path is slower than converting the TensorFlow checkpoint in a PyTorch model using the provided conversion scripts and loading the PyTorch model afterwards.\n\n            model_args: (`optional`) Sequence of positional arguments:\n                All remaning positional arguments will be passed to the underlying model's ``__init__`` method\n\n            config: (`optional`) instance of a class derived from :class:`~transformers1.PretrainedConfig`:\n                Configuration for the model to use instead of an automatically loaded configuation. Configuration can be automatically loaded when:\n\n                - the model is a model provided by the library (loaded with the ``shortcut-name`` string of a pretrained model), or\n                - the model was saved using :func:`~transformers1.PreTrainedModel.save_pretrained` and is reloaded by suppling the save directory.\n                - the model is loaded by suppling a local directory as ``pretrained_model_name_or_path`` and a configuration JSON file named `config.json` is found in the directory.\n\n            state_dict: (`optional`) dict:\n                an optional state dictionary for the model to use instead of a state dictionary loaded from saved weights file.\n                This option can be used if you want to create a model from a pretrained configuration but load your own weights.\n                In this case though, you should check if using :func:`~transformers1.PreTrainedModel.save_pretrained` and :func:`~transformers1.PreTrainedModel.from_pretrained` is not a simpler option.\n\n            cache_dir: (`optional`) string:\n                Path to a directory in which a downloaded pre-trained model\n                configuration should be cached if the standard cache should not be used.\n\n            force_download: (`optional`) boolean, default False:\n                Force to (re-)download the model weights and configuration files and override the cached versions if they exists.\n\n            proxies: (`optional`) dict, default None:\n                A dictionary of proxy servers to use by protocol or endpoint, e.g.: {'http': 'foo.bar:3128', 'http://hostname': 'foo.bar:4012'}.\n                The proxies are used on each request.\n\n            output_loading_info: (`optional`) boolean:\n                Set to ``True`` to also return a dictionary containing missing keys, unexpected keys and error messages.\n\n            kwargs: (`optional`) Remaining dictionary of keyword arguments:\n                These arguments will be passed to the configuration and the model.\n\n        Examples::\n\n            model = AutoModelForTokenClassification.from_pretrained('bert-base-uncased')    # Download model and configuration from S3 and cache.\n            model = AutoModelForTokenClassification.from_pretrained('./test/bert_model/')  # E.g. model was saved using `save_pretrained('./test/saved_model/')`\n            assert model.config.output_attention == True\n            # Loading from a TF checkpoint file instead of a PyTorch model (slower)\n            config = AutoConfig.from_json_file('./tf_model/bert_tf_model_config.json')\n            model = AutoModelForTokenClassification.from_pretrained('./tf_model/bert_tf_checkpoint.ckpt.index', from_tf=True, config=config)\n\n        \"\"\"\n        config = kwargs.pop(\"config\", None)\n        if not isinstance(config, PretrainedConfig):\n            config = AutoConfig.from_pretrained(pretrained_model_name_or_path, **kwargs)\n\n        for config_class, model_class in MODEL_FOR_TOKEN_CLASSIFICATION_MAPPING.items():\n            if isinstance(config, config_class):\n                return model_class.from_pretrained(pretrained_model_name_or_path, *model_args, config=config, **kwargs)\n\n        raise ValueError(\n            \"Unrecognized configuration class {} for this kind of AutoModel: {}.\\n\"\n            \"Model type should be one of {}.\".format(\n                config.__class__,\n                cls.__name__,\n                \", \".join(c.__name__ for c in MODEL_FOR_TOKEN_CLASSIFICATION_MAPPING.keys()),\n            )\n        )\n\n\nclass AutoModelForMultipleChoice:\n    r\"\"\"\n        :class:`~transformers1.AutoModelForMultipleChoice` is a generic model class\n        that will be instantiated as one of the multiple choice model classes of the library\n        when created with the `AutoModelForMultipleChoice.from_pretrained(pretrained_model_name_or_path)`\n        class method.\n\n        This class cannot be instantiated using `__init__()` (throws an error).\n    \"\"\"\n\n    def __init__(self):\n        raise EnvironmentError(\n            \"AutoModelForMultipleChoice is designed to be instantiated \"\n            \"using the `AutoModelForMultipleChoice.from_pretrained(pretrained_model_name_or_path)` or \"\n            \"`AutoModelForMultipleChoice.from_config(config)` methods.\"\n        )\n\n    @classmethod\n    def from_config(cls, config):\n        for config_class, model_class in MODEL_FOR_MULTIPLE_CHOICE_MAPPING.items():\n            if isinstance(config, config_class):\n                return model_class(config)\n\n        raise ValueError(\n            \"Unrecognized configuration class {} for this kind of AutoModel: {}.\\n\"\n            \"Model type should be one of {}.\".format(\n                config.__class__,\n                cls.__name__,\n                \", \".join(c.__name__ for c in MODEL_FOR_MULTIPLE_CHOICE_MAPPING.keys()),\n            )\n        )\n\n    @classmethod\n    def from_pretrained(cls, pretrained_model_name_or_path, *model_args, **kwargs):\n        config = kwargs.pop(\"config\", None)\n        if not isinstance(config, PretrainedConfig):\n            config = AutoConfig.from_pretrained(pretrained_model_name_or_path, **kwargs)\n\n        for config_class, model_class in MODEL_FOR_MULTIPLE_CHOICE_MAPPING.items():\n            if isinstance(config, config_class):\n                return model_class.from_pretrained(pretrained_model_name_or_path, *model_args, config=config, **kwargs)\n\n        raise ValueError(\n            \"Unrecognized configuration class {} for this kind of AutoModel: {}.\\n\"\n            \"Model type should be one of {}.\".format(\n                config.__class__,\n                cls.__name__,\n                \", \".join(c.__name__ for c in MODEL_FOR_MULTIPLE_CHOICE_MAPPING.keys()),\n            )\n        )\n"
  },
  {
    "path": "code/nezha-base-count3/pretrain/transformers1/modeling_bart.py",
    "content": "# coding=utf-8\n# Copyright 2020 The Facebook AI Research Team Authors and The HuggingFace Inc. team.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n\"\"\"PyTorch BART model, ported from the fairseq repo.\"\"\"\nimport logging\nimport math\nimport random\nfrom typing import Dict, List, Optional, Tuple\n\nimport numpy as np\nimport torch\nimport torch.nn.functional as F\nfrom torch import Tensor, nn\n\nfrom .activations import ACT2FN\nfrom .configuration_bart import BartConfig\nfrom .file_utils import add_start_docstrings, add_start_docstrings_to_callable\nfrom .modeling_utils import PreTrainedModel, create_position_ids_from_input_ids\n\n\nlogger = logging.getLogger(__name__)\n\n\nBART_PRETRAINED_MODEL_ARCHIVE_LIST = [\n    \"facebook/bart-large\",\n    \"facebook/bart-large-mnli\",\n    \"facebook/bart-large-cnn\",\n    \"facebook/bart-large-xsum\",\n    \"facebook/mbart-large-en-ro\",\n    # See all BART models at https://huggingface.co/models?filter=bart\n]\n\n\nBART_START_DOCSTRING = r\"\"\"\n\n    This model is a PyTorch `torch.nn.Module <https://pytorch.org/docs/stable/nn.html#torch.nn.Module>`_ sub-class. Use it as a regular PyTorch Module and\n    refer to the PyTorch documentation for all matters related to general usage and behavior.\n\n    Parameters:\n        config (:class:`~transformers1.BartConfig`): Model configuration class with all the parameters of the model.\n            Initializing with a config file does not load the weights associated with the model, only the configuration.\n            Check out the :meth:`~transformers1.PreTrainedModel.from_pretrained` method to load the model weights.\n\n\"\"\"\nBART_GENERATION_EXAMPLE = r\"\"\"\n    Examples::\n\n        from transformers1 import BartTokenizer, BartForConditionalGeneration, BartConfig\n        # see ``examples/summarization/bart/evaluate_cnn.py`` for a longer example\n        model = BartForConditionalGeneration.from_pretrained('bart-large-cnn')\n        tokenizer = BartTokenizer.from_pretrained('bart-large-cnn')\n        ARTICLE_TO_SUMMARIZE = \"My friends are cool but they eat too many carbs.\"\n        inputs = tokenizer.batch_encode_plus([ARTICLE_TO_SUMMARIZE], max_length=1024, return_tensors='pt')\n        # Generate Summary\n        summary_ids = model.generate(inputs['input_ids'], num_beams=4, max_length=5, early_stopping=True)\n        print([tokenizer.decode(g, skip_special_tokens=True, clean_up_tokenization_spaces=False) for g in summary_ids])\n\n\"\"\"\n\nBART_INPUTS_DOCSTRING = r\"\"\"\n    Args:\n        input_ids (:obj:`torch.LongTensor` of shape :obj:`(batch_size, sequence_length)`):\n               Indices of input sequence tokens in the vocabulary. Use BartTokenizer.encode to produce them.\n            Padding will be ignored by default should you provide it.\n            Indices can be obtained using :class:`transformers1.BartTokenizer.encode(text)`.\n        attention_mask (:obj:`torch.Tensor` of shape :obj:`(batch_size, sequence_length)`, `optional`, defaults to :obj:`None`):\n            Mask to avoid performing attention on padding token indices in input_ids.\n            Mask values selected in ``[0, 1]``:\n            ``1`` for tokens that are NOT MASKED, ``0`` for MASKED tokens.\n        encoder_outputs (:obj:`tuple(tuple(torch.FloatTensor)`, `optional`, defaults to :obj:`None`):\n            Tuple consists of (`last_hidden_state`, `optional`: `hidden_states`, `optional`: `attentions`)\n            `last_hidden_state` of shape :obj:`(batch_size, sequence_length, hidden_size)`, `optional`, defaults to :obj:`None`) is a sequence of hidden-states at the output of the last layer of the encoder.\n            Used in the cross-attention of the decoder.\n        decoder_input_ids (:obj:`torch.LongTensor` of shape :obj:`(batch_size, target_sequence_length)`, `optional`, defaults to :obj:`None`):\n            Provide for translation and summarization training. By default, the model will create this tensor by shifting the input_ids right, following the paper.\n        decoder_attention_mask (:obj:`torch.BoolTensor` of shape :obj:`(batch_size, tgt_seq_len)`, `optional`, defaults to :obj:`None`):\n            Default behavior: generate a tensor that ignores pad tokens in decoder_input_ids. Causal mask will also be used by default.\n            If you want to change padding behavior, you should read :func:`~transformers1.modeling_bart._prepare_decoder_inputs` and modify.\n            See diagram 1 in the paper for more info on the default strategy\n\"\"\"\n\n\ndef invert_mask(attention_mask):\n    assert attention_mask.dim() == 2\n    return attention_mask.eq(0)\n\n\ndef _prepare_bart_decoder_inputs(\n    config, input_ids, decoder_input_ids=None, decoder_padding_mask=None, causal_mask_dtype=torch.float32\n):\n    \"\"\"Prepare masks that ignore padding tokens in the decoder and a causal mask for the decoder if\n    none are provided. This mimics the default behavior in fairseq. To override it pass in masks.\n    Note: this is not called during generation\n    \"\"\"\n    pad_token_id = config.pad_token_id\n    if decoder_input_ids is None:\n        decoder_input_ids = shift_tokens_right(input_ids, pad_token_id)\n    bsz, tgt_len = decoder_input_ids.size()\n    if decoder_padding_mask is None:\n        decoder_padding_mask = make_padding_mask(decoder_input_ids, pad_token_id)\n    else:\n        decoder_padding_mask = invert_mask(decoder_padding_mask)\n    causal_mask = torch.triu(fill_with_neg_inf(torch.zeros(tgt_len, tgt_len)), 1).to(\n        dtype=causal_mask_dtype, device=decoder_input_ids.device\n    )\n    return decoder_input_ids, decoder_padding_mask, causal_mask\n\n\nclass PretrainedBartModel(PreTrainedModel):\n    config_class = BartConfig\n    base_model_prefix = \"model\"\n\n    def _init_weights(self, module):\n        std = self.config.init_std\n        if isinstance(module, nn.Linear):\n            module.weight.data.normal_(mean=0.0, std=std)\n            if module.bias is not None:\n                module.bias.data.zero_()\n        elif isinstance(module, SinusoidalPositionalEmbedding):\n            pass\n        elif isinstance(module, nn.Embedding):\n            module.weight.data.normal_(mean=0.0, std=std)\n            if module.padding_idx is not None:\n                module.weight.data[module.padding_idx].zero_()\n\n    @property\n    def dummy_inputs(self):\n        pad_token = self.config.pad_token_id\n        input_ids = torch.tensor([[0, 6, 10, 4, 2], [0, 8, 12, 2, pad_token]], device=self.device)\n        dummy_inputs = {\n            \"attention_mask\": input_ids.ne(pad_token),\n            \"input_ids\": input_ids,\n        }\n        return dummy_inputs\n\n\ndef _make_linear_from_emb(emb):\n    vocab_size, emb_size = emb.weight.shape\n    lin_layer = nn.Linear(vocab_size, emb_size, bias=False)\n    lin_layer.weight.data = emb.weight.data\n    return lin_layer\n\n\n# Helper Functions, mostly for making masks\ndef _check_shapes(shape_1, shape2):\n    if shape_1 != shape2:\n        raise AssertionError(\"shape mismatch: {} != {}\".format(shape_1, shape2))\n\n\ndef shift_tokens_right(input_ids, pad_token_id):\n    \"\"\"Shift input ids one token to the right, and wrap the last non pad token (usually <eos>).\"\"\"\n    prev_output_tokens = input_ids.clone()\n    index_of_eos = (input_ids.ne(pad_token_id).sum(dim=1) - 1).unsqueeze(-1)\n    prev_output_tokens[:, 0] = input_ids.gather(1, index_of_eos).squeeze()\n    prev_output_tokens[:, 1:] = input_ids[:, :-1]\n    return prev_output_tokens\n\n\ndef make_padding_mask(input_ids, padding_idx=1):\n    \"\"\"True for pad tokens\"\"\"\n    padding_mask = input_ids.eq(padding_idx)\n    if not padding_mask.any():\n        padding_mask = None\n    return padding_mask\n\n\n# Helper Modules\n\n\nclass EncoderLayer(nn.Module):\n    def __init__(self, config: BartConfig):\n        super().__init__()\n        self.embed_dim = config.d_model\n        self.output_attentions = config.output_attentions\n        self.self_attn = SelfAttention(\n            self.embed_dim, config.encoder_attention_heads, dropout=config.attention_dropout,\n        )\n        self.normalize_before = config.normalize_before\n        self.self_attn_layer_norm = LayerNorm(self.embed_dim)\n        self.dropout = config.dropout\n        self.activation_fn = ACT2FN[config.activation_function]\n        self.activation_dropout = config.activation_dropout\n        self.fc1 = nn.Linear(self.embed_dim, config.encoder_ffn_dim)\n        self.fc2 = nn.Linear(config.encoder_ffn_dim, self.embed_dim)\n        self.final_layer_norm = LayerNorm(self.embed_dim)\n\n    def forward(self, x, encoder_padding_mask):\n        \"\"\"\n        Args:\n            x (Tensor): input to the layer of shape `(seq_len, batch, embed_dim)`\n            encoder_padding_mask (ByteTensor): binary ByteTensor of shape\n                `(batch, src_len)` where padding elements are indicated by ``1``.\n            for t_tgt, t_src is excluded (or masked out), =0 means it is\n            included in attention\n\n        Returns:\n            encoded output of shape `(seq_len, batch, embed_dim)`\n        \"\"\"\n        residual = x\n        if self.normalize_before:\n            x = self.self_attn_layer_norm(x)\n        x, attn_weights = self.self_attn(\n            query=x, key=x, key_padding_mask=encoder_padding_mask, need_weights=self.output_attentions\n        )\n        x = F.dropout(x, p=self.dropout, training=self.training)\n        x = residual + x\n        if not self.normalize_before:\n            x = self.self_attn_layer_norm(x)\n\n        residual = x\n        if self.normalize_before:\n            x = self.final_layer_norm(x)\n        x = self.activation_fn(self.fc1(x))\n        x = F.dropout(x, p=self.activation_dropout, training=self.training)\n        x = self.fc2(x)\n        x = F.dropout(x, p=self.dropout, training=self.training)\n        x = residual + x\n        if not self.normalize_before:\n            x = self.final_layer_norm(x)\n        return x, attn_weights\n\n\nclass BartEncoder(nn.Module):\n    \"\"\"\n    Transformer encoder consisting of *config.encoder_layers* self attention layers. Each layer\n    is a :class:`EncoderLayer`.\n\n    Args:\n        config: BartConfig\n    \"\"\"\n\n    def __init__(self, config: BartConfig, embed_tokens):\n        super().__init__()\n\n        self.dropout = config.dropout\n        self.layerdrop = config.encoder_layerdrop\n        self.output_attentions = config.output_attentions\n        self.output_hidden_states = config.output_hidden_states\n\n        embed_dim = embed_tokens.embedding_dim\n        self.embed_scale = math.sqrt(embed_dim) if config.scale_embedding else 1.0\n        self.padding_idx = embed_tokens.padding_idx\n        self.max_source_positions = config.max_position_embeddings\n\n        self.embed_tokens = embed_tokens\n        if config.static_position_embeddings:\n            self.embed_positions = SinusoidalPositionalEmbedding(\n                config.max_position_embeddings, embed_dim, self.padding_idx\n            )\n        else:\n            self.embed_positions = LearnedPositionalEmbedding(\n                config.max_position_embeddings, embed_dim, self.padding_idx,\n            )\n        self.layers = nn.ModuleList([EncoderLayer(config) for _ in range(config.encoder_layers)])\n        self.layernorm_embedding = LayerNorm(embed_dim) if config.normalize_embedding else nn.Identity()\n        # mbart has one extra layer_norm\n        self.layer_norm = LayerNorm(config.d_model) if config.normalize_before else None\n\n    def forward(\n        self, input_ids, attention_mask=None,\n    ):\n        \"\"\"\n        Args:\n            input_ids (LongTensor): tokens in the source language of shape\n                `(batch, src_len)`\n            attention_mask (torch.LongTensor): indicating which indices are padding tokens.\n        Returns:\n            Tuple comprised of:\n                - **x** (Tensor): the last encoder layer's output of\n                  shape `(src_len, batch, embed_dim)`\n                - **encoder_states** (List[Tensor]): all intermediate\n                  hidden states of shape `(src_len, batch, embed_dim)`.\n                  Only populated if *self.output_hidden_states:* is True.\n                - **all_attentions** (List[Tensor]): Attention weights for each layer.\n                During training might not be of length n_layers because of layer dropout.\n        \"\"\"\n        # check attention mask and invert\n        if attention_mask is not None:\n            attention_mask = invert_mask(attention_mask)\n\n        inputs_embeds = self.embed_tokens(input_ids) * self.embed_scale\n        embed_pos = self.embed_positions(input_ids)\n        x = inputs_embeds + embed_pos\n        x = self.layernorm_embedding(x)\n        x = F.dropout(x, p=self.dropout, training=self.training)\n\n        # B x T x C -> T x B x C\n        x = x.transpose(0, 1)\n\n        encoder_states, all_attentions = [], []\n        for encoder_layer in self.layers:\n            if self.output_hidden_states:\n                encoder_states.append(x)\n            # add LayerDrop (see https://arxiv.org/abs/1909.11556 for description)\n            dropout_probability = random.uniform(0, 1)\n            if self.training and (dropout_probability < self.layerdrop):  # skip the layer\n                attn = None\n            else:\n                x, attn = encoder_layer(x, attention_mask)\n\n            if self.output_attentions:\n                all_attentions.append(attn)\n\n        if self.layer_norm:\n            x = self.layer_norm(x)\n        if self.output_hidden_states:\n            encoder_states.append(x)\n\n        # T x B x C -> B x T x C\n        encoder_states = [hidden_state.transpose(0, 1) for hidden_state in encoder_states]\n        x = x.transpose(0, 1)\n\n        return x, encoder_states, all_attentions\n\n\nclass DecoderLayer(nn.Module):\n    def __init__(self, config: BartConfig):\n        super().__init__()\n        self.embed_dim = config.d_model\n        self.output_attentions = config.output_attentions\n        self.self_attn = SelfAttention(\n            embed_dim=self.embed_dim, num_heads=config.decoder_attention_heads, dropout=config.attention_dropout,\n        )\n        self.dropout = config.dropout\n        self.activation_fn = ACT2FN[config.activation_function]\n        self.activation_dropout = config.activation_dropout\n        self.normalize_before = config.normalize_before\n\n        self.self_attn_layer_norm = LayerNorm(self.embed_dim)\n        self.encoder_attn = SelfAttention(\n            self.embed_dim,\n            config.decoder_attention_heads,\n            dropout=config.attention_dropout,\n            encoder_decoder_attention=True,\n        )\n        self.encoder_attn_layer_norm = LayerNorm(self.embed_dim)\n        self.fc1 = nn.Linear(self.embed_dim, config.decoder_ffn_dim)\n        self.fc2 = nn.Linear(config.decoder_ffn_dim, self.embed_dim)\n        self.final_layer_norm = LayerNorm(self.embed_dim)\n\n    def forward(\n        self,\n        x,\n        encoder_hidden_states,\n        encoder_attn_mask=None,\n        layer_state=None,\n        causal_mask=None,\n        decoder_padding_mask=None,\n    ):\n        residual = x\n\n        if layer_state is None:\n            layer_state = {}\n        if self.normalize_before:\n            x = self.self_attn_layer_norm(x)\n        # Self Attention\n\n        x, self_attn_weights = self.self_attn(\n            query=x,\n            key=x,\n            layer_state=layer_state,  # adds keys to layer state\n            key_padding_mask=decoder_padding_mask,\n            attn_mask=causal_mask,\n            need_weights=self.output_attentions,\n        )\n        x = F.dropout(x, p=self.dropout, training=self.training)\n        x = residual + x\n        if not self.normalize_before:\n            x = self.self_attn_layer_norm(x)\n\n        # Cross attention\n        residual = x\n        assert self.encoder_attn.cache_key != self.self_attn.cache_key\n        if self.normalize_before:\n            x = self.encoder_attn_layer_norm(x)\n        x, _ = self.encoder_attn(\n            query=x,\n            key=encoder_hidden_states,\n            key_padding_mask=encoder_attn_mask,\n            layer_state=layer_state,  # mutates layer state\n        )\n        x = F.dropout(x, p=self.dropout, training=self.training)\n        x = residual + x\n        if not self.normalize_before:\n            x = self.encoder_attn_layer_norm(x)\n\n        # Fully Connected\n        residual = x\n        if self.normalize_before:\n            x = self.final_layer_norm(x)\n        x = self.activation_fn(self.fc1(x))\n        x = F.dropout(x, p=self.activation_dropout, training=self.training)\n        x = self.fc2(x)\n        x = F.dropout(x, p=self.dropout, training=self.training)\n        x = residual + x\n        if not self.normalize_before:\n            x = self.final_layer_norm(x)\n        return (\n            x,\n            self_attn_weights,\n            layer_state,\n        )  # just self_attn weights for now, following t5, layer_state = cache for decoding\n\n\nclass BartDecoder(nn.Module):\n    \"\"\"\n    Transformer decoder consisting of *config.decoder_layers* layers. Each layer\n    is a :class:`DecoderLayer`.\n    Args:\n        config: BartConfig\n        embed_tokens (torch.nn.Embedding): output embedding\n    \"\"\"\n\n    def __init__(self, config: BartConfig, embed_tokens: nn.Embedding):\n        super().__init__()\n        self.output_attentions = config.output_attentions\n        self.output_hidden_states = config.output_hidden_states\n        self.dropout = config.dropout\n        self.layerdrop = config.decoder_layerdrop\n        self.padding_idx = embed_tokens.padding_idx\n        self.max_target_positions = config.max_position_embeddings\n        self.embed_scale = math.sqrt(config.d_model) if config.scale_embedding else 1.0\n        self.embed_tokens = embed_tokens\n        if config.static_position_embeddings:\n            self.embed_positions = SinusoidalPositionalEmbedding(\n                config.max_position_embeddings, config.d_model, config.pad_token_id\n            )\n        else:\n            self.embed_positions = LearnedPositionalEmbedding(\n                config.max_position_embeddings, config.d_model, self.padding_idx,\n            )\n        self.layers = nn.ModuleList(\n            [DecoderLayer(config) for _ in range(config.decoder_layers)]\n        )  # type: List[DecoderLayer]\n        self.layernorm_embedding = LayerNorm(config.d_model) if config.normalize_embedding else nn.Identity()\n        self.layer_norm = LayerNorm(config.d_model) if config.add_final_layer_norm else None\n\n    def forward(\n        self,\n        input_ids,\n        encoder_hidden_states,\n        encoder_padding_mask,\n        decoder_padding_mask,\n        decoder_causal_mask,\n        decoder_cached_states=None,\n        use_cache=False,\n        **unused\n    ):\n        \"\"\"\n        Includes several features from \"Jointly Learning to Align and\n        Translate with Transformer Models\" (Garg et al., EMNLP 2019).\n\n        Args:\n            input_ids (LongTensor): previous decoder outputs of shape\n                `(batch, tgt_len)`, for teacher forcing\n            encoder_hidden_states: output from the encoder, used for\n                encoder-side attention\n            encoder_padding_mask: for ignoring pad tokens\n            decoder_cached_states (dict or None): dictionary used for storing state during generation\n\n        Returns:\n            tuple:\n                - the decoder's features of shape `(batch, tgt_len, embed_dim)`\n                - hidden states\n                - attentions\n        \"\"\"\n        # check attention mask and invert\n        if encoder_padding_mask is not None:\n            encoder_padding_mask = invert_mask(encoder_padding_mask)\n\n        # embed positions\n        positions = self.embed_positions(input_ids, use_cache=use_cache)\n\n        if use_cache:\n            input_ids = input_ids[:, -1:]\n            positions = positions[:, -1:]  # happens after we embed them\n            # assert input_ids.ne(self.padding_idx).any()\n\n        x = self.embed_tokens(input_ids) * self.embed_scale\n        x += positions\n        x = self.layernorm_embedding(x)\n        x = F.dropout(x, p=self.dropout, training=self.training)\n\n        # Convert to Bart output format: (seq_len, BS, model_dim) -> (BS, seq_len, model_dim)\n        x = x.transpose(0, 1)\n        encoder_hidden_states = encoder_hidden_states.transpose(0, 1)\n\n        # decoder layers\n        all_hidden_states = ()\n        all_self_attns = ()\n        next_decoder_cache = []\n        for idx, decoder_layer in enumerate(self.layers):\n            # add LayerDrop (see https://arxiv.org/abs/1909.11556 for description)\n            if self.output_hidden_states:\n                all_hidden_states += (x,)\n            dropout_probability = random.uniform(0, 1)\n            if self.training and (dropout_probability < self.layerdrop):\n                continue\n\n            layer_state = decoder_cached_states[idx] if decoder_cached_states is not None else None\n\n            x, layer_self_attn, layer_past = decoder_layer(\n                x,\n                encoder_hidden_states,\n                encoder_attn_mask=encoder_padding_mask,\n                decoder_padding_mask=decoder_padding_mask,\n                layer_state=layer_state,\n                causal_mask=decoder_causal_mask,\n            )\n\n            if use_cache:\n                next_decoder_cache.append(layer_past.copy())\n\n            if self.layer_norm and (idx == len(self.layers) - 1):  # last layer of mbart\n                x = self.layer_norm(x)\n            if self.output_attentions:\n                all_self_attns += (layer_self_attn,)\n\n        # Convert to standard output format: (seq_len, BS, model_dim) -> (BS, seq_len, model_dim)\n        all_hidden_states = [hidden_state.transpose(0, 1) for hidden_state in all_hidden_states]\n        x = x.transpose(0, 1)\n        encoder_hidden_states = encoder_hidden_states.transpose(0, 1)\n\n        if use_cache:\n            next_cache = ((encoder_hidden_states, encoder_padding_mask), next_decoder_cache)\n        else:\n            next_cache = None\n        return x, next_cache, all_hidden_states, list(all_self_attns)\n\n\ndef _reorder_buffer(attn_cache, new_order):\n    for k, input_buffer_k in attn_cache.items():\n        if input_buffer_k is not None:\n            attn_cache[k] = input_buffer_k.index_select(0, new_order)\n    return attn_cache\n\n\nclass SelfAttention(nn.Module):\n    \"\"\"Multi-headed attention from 'Attention Is All You Need' paper\"\"\"\n\n    def __init__(\n        self,\n        embed_dim,\n        num_heads,\n        dropout=0.0,\n        bias=True,\n        encoder_decoder_attention=False,  # otherwise self_attention\n    ):\n        super().__init__()\n        self.embed_dim = embed_dim\n        self.num_heads = num_heads\n        self.dropout = dropout\n        self.head_dim = embed_dim // num_heads\n        assert self.head_dim * num_heads == self.embed_dim, \"embed_dim must be divisible by num_heads\"\n        self.scaling = self.head_dim ** -0.5\n\n        self.encoder_decoder_attention = encoder_decoder_attention\n        self.k_proj = nn.Linear(embed_dim, embed_dim, bias=bias)\n        self.v_proj = nn.Linear(embed_dim, embed_dim, bias=bias)\n        self.q_proj = nn.Linear(embed_dim, embed_dim, bias=bias)\n        self.out_proj = nn.Linear(embed_dim, embed_dim, bias=bias)\n        self.cache_key = \"encoder_decoder\" if self.encoder_decoder_attention else \"self\"\n\n    def _shape(self, tensor, dim_0, bsz):\n        return tensor.contiguous().view(dim_0, bsz * self.num_heads, self.head_dim).transpose(0, 1)\n\n    def forward(\n        self,\n        query,\n        key: Optional[Tensor],\n        key_padding_mask: Optional[Tensor] = None,\n        layer_state: Optional[Dict[str, Optional[Tensor]]] = None,\n        attn_mask: Optional[Tensor] = None,\n        need_weights=False,\n    ) -> Tuple[Tensor, Optional[Tensor]]:\n        \"\"\"Input shape: Time(SeqLen) x Batch x Channel\"\"\"\n        static_kv: bool = self.encoder_decoder_attention\n        tgt_len, bsz, embed_dim = query.size()\n        assert embed_dim == self.embed_dim\n        assert list(query.size()) == [tgt_len, bsz, embed_dim]\n        # get here for encoder decoder cause of static_kv\n        if layer_state is not None:  # reuse k,v and encoder_padding_mask\n            saved_state = layer_state.get(self.cache_key, {})\n            if \"prev_key\" in saved_state:\n                # previous time steps are cached - no need to recompute key and value if they are static\n                if static_kv:\n                    key = None\n        else:\n            saved_state = None\n            layer_state = {}\n\n        q = self.q_proj(query) * self.scaling\n        if static_kv:\n            if key is None:\n                k = v = None\n            else:\n                k = self.k_proj(key)\n                v = self.v_proj(key)\n        else:\n            k = self.k_proj(query)\n            v = self.v_proj(query)\n\n        q = self._shape(q, tgt_len, bsz)\n        if k is not None:\n            k = self._shape(k, -1, bsz)\n        if v is not None:\n            v = self._shape(v, -1, bsz)\n\n        if saved_state is not None:\n            k, v, key_padding_mask = self._use_saved_state(k, v, saved_state, key_padding_mask, static_kv, bsz)\n\n        # Update cache\n        layer_state[self.cache_key] = {\n            \"prev_key\": k.view(bsz, self.num_heads, -1, self.head_dim),\n            \"prev_value\": v.view(bsz, self.num_heads, -1, self.head_dim),\n            \"prev_key_padding_mask\": key_padding_mask if not static_kv else None,\n        }\n\n        assert k is not None\n        src_len = k.size(1)\n        attn_weights = torch.bmm(q, k.transpose(1, 2))\n        assert attn_weights.size() == (bsz * self.num_heads, tgt_len, src_len)\n\n        if attn_mask is not None:\n            attn_weights = attn_weights.view(bsz, self.num_heads, tgt_len, src_len) + attn_mask\n            attn_weights = attn_weights.view(bsz * self.num_heads, tgt_len, src_len)\n\n        # This is part of a workaround to get around fork/join parallelism not supporting Optional types.\n        if key_padding_mask is not None and key_padding_mask.dim() == 0:\n            key_padding_mask = None\n        assert key_padding_mask is None or key_padding_mask.size()[:2] == (bsz, src_len,)\n\n        if key_padding_mask is not None:  # don't attend to padding symbols\n            attn_weights = attn_weights.view(bsz, self.num_heads, tgt_len, src_len)\n            reshaped = key_padding_mask.unsqueeze(1).unsqueeze(2)\n            attn_weights = attn_weights.masked_fill(reshaped, float(\"-inf\"))\n            attn_weights = attn_weights.view(bsz * self.num_heads, tgt_len, src_len)\n        attn_weights = F.softmax(attn_weights, dim=-1)\n        attn_probs = F.dropout(attn_weights, p=self.dropout, training=self.training,)\n\n        assert v is not None\n        attn_output = torch.bmm(attn_probs, v)\n        assert attn_output.size() == (bsz * self.num_heads, tgt_len, self.head_dim)\n        attn_output = attn_output.transpose(0, 1).contiguous().view(tgt_len, bsz, embed_dim)\n        attn_output = self.out_proj(attn_output)\n        if need_weights:\n            attn_weights = attn_weights.view(bsz, self.num_heads, tgt_len, src_len)\n        else:\n            attn_weights = None\n        return attn_output, attn_weights\n\n    def _use_saved_state(self, k, v, saved_state, key_padding_mask, static_kv, bsz):\n        # saved states are stored with shape (bsz, num_heads, seq_len, head_dim)\n        if \"prev_key\" in saved_state:\n            _prev_key = saved_state[\"prev_key\"]\n            assert _prev_key is not None\n            prev_key = _prev_key.view(bsz * self.num_heads, -1, self.head_dim)\n            if static_kv:\n                k = prev_key\n            else:\n                assert k is not None\n                k = torch.cat([prev_key, k], dim=1)\n        if \"prev_value\" in saved_state:\n            _prev_value = saved_state[\"prev_value\"]\n            assert _prev_value is not None\n            prev_value = _prev_value.view(bsz * self.num_heads, -1, self.head_dim)\n            if static_kv:\n                v = prev_value\n            else:\n                assert v is not None\n                v = torch.cat([prev_value, v], dim=1)\n        assert k is not None and v is not None\n        prev_key_padding_mask: Optional[Tensor] = saved_state.get(\"prev_key_padding_mask\", None)\n        key_padding_mask = self._cat_prev_key_padding_mask(\n            key_padding_mask, prev_key_padding_mask, bsz, k.size(1), static_kv\n        )\n        return k, v, key_padding_mask\n\n    @staticmethod\n    def _cat_prev_key_padding_mask(\n        key_padding_mask: Optional[Tensor],\n        prev_key_padding_mask: Optional[Tensor],\n        batch_size: int,\n        src_len: int,\n        static_kv: bool,\n    ) -> Optional[Tensor]:\n        # saved key padding masks have shape (bsz, seq_len)\n        if prev_key_padding_mask is not None:\n            if static_kv:\n                new_key_padding_mask = prev_key_padding_mask\n            else:\n                new_key_padding_mask = torch.cat([prev_key_padding_mask, key_padding_mask], dim=1)\n\n        elif key_padding_mask is not None:\n            filler = torch.zeros(\n                batch_size,\n                src_len - key_padding_mask.size(1),\n                dtype=key_padding_mask.dtype,\n                device=key_padding_mask.device,\n            )\n            new_key_padding_mask = torch.cat([filler, key_padding_mask], dim=1)\n        else:\n            new_key_padding_mask = prev_key_padding_mask\n        return new_key_padding_mask\n\n\nclass BartClassificationHead(nn.Module):\n    \"\"\"Head for sentence-level classification tasks.\"\"\"\n\n    # This can trivially be shared with RobertaClassificationHead\n\n    def __init__(\n        self, input_dim, inner_dim, num_classes, pooler_dropout,\n    ):\n        super().__init__()\n        self.dense = nn.Linear(input_dim, inner_dim)\n        self.dropout = nn.Dropout(p=pooler_dropout)\n        self.out_proj = nn.Linear(inner_dim, num_classes)\n\n    def forward(self, x):\n        x = self.dropout(x)\n        x = self.dense(x)\n        x = torch.tanh(x)\n        x = self.dropout(x)\n        x = self.out_proj(x)\n        return x\n\n\nclass LearnedPositionalEmbedding(nn.Embedding):\n    \"\"\"\n    This module learns positional embeddings up to a fixed maximum size.\n    Padding ids are ignored by either offsetting based on padding_idx\n    or by setting padding_idx to None and ensuring that the appropriate\n    position ids are passed to the forward function.\n    \"\"\"\n\n    def __init__(\n        self, num_embeddings: int, embedding_dim: int, padding_idx: int,\n    ):\n        # if padding_idx is specified then offset the embedding ids by\n        # this index and adjust num_embeddings appropriately\n        assert padding_idx is not None\n        num_embeddings += padding_idx + 1  # WHY?\n        super().__init__(num_embeddings, embedding_dim, padding_idx=padding_idx)\n\n    def forward(self, input, use_cache=False):\n        \"\"\"Input is expected to be of size [bsz x seqlen].\"\"\"\n        if use_cache:  # the position is our current step in the decoded sequence\n            pos = int(self.padding_idx + input.size(1))\n            positions = input.data.new(1, 1).fill_(pos)\n        else:\n            positions = create_position_ids_from_input_ids(input, self.padding_idx)\n        return super().forward(positions)\n\n\ndef LayerNorm(normalized_shape, eps=1e-5, elementwise_affine=True):\n    if torch.cuda.is_available():\n        try:\n            from apex.normalization import FusedLayerNorm\n\n            return FusedLayerNorm(normalized_shape, eps, elementwise_affine)\n        except ImportError:\n            pass\n    return torch.nn.LayerNorm(normalized_shape, eps, elementwise_affine)\n\n\ndef fill_with_neg_inf(t):\n    \"\"\"FP16-compatible function that fills a input_ids with -inf.\"\"\"\n    return t.float().fill_(float(\"-inf\")).type_as(t)\n\n\ndef _filter_out_falsey_values(tup) -> Tuple:\n    \"\"\"Remove entries that are None or [] from an iterable.\"\"\"\n    return tuple(x for x in tup if isinstance(x, torch.Tensor) or x)\n\n\n# Public API\ndef _get_shape(t):\n    return getattr(t, \"shape\", None)\n\n\n@add_start_docstrings(\n    \"The bare BART Model outputting raw hidden-states without any specific head on top.\", BART_START_DOCSTRING,\n)\nclass BartModel(PretrainedBartModel):\n    def __init__(self, config: BartConfig):\n        super().__init__(config)\n        self.output_attentions = config.output_attentions\n        self.output_hidden_states = config.output_hidden_states\n\n        padding_idx, vocab_size = config.pad_token_id, config.vocab_size\n        self.shared = nn.Embedding(vocab_size, config.d_model, padding_idx)\n\n        self.encoder = BartEncoder(config, self.shared)\n        self.decoder = BartDecoder(config, self.shared)\n\n        self.init_weights()\n\n    @add_start_docstrings_to_callable(BART_INPUTS_DOCSTRING)\n    def forward(\n        self,\n        input_ids,\n        attention_mask=None,\n        decoder_input_ids=None,\n        encoder_outputs: Optional[Tuple] = None,\n        decoder_attention_mask=None,\n        decoder_cached_states=None,\n        use_cache=False,\n    ):\n\n        # make masks if user doesn't supply\n        if not use_cache:\n            decoder_input_ids, decoder_padding_mask, causal_mask = _prepare_bart_decoder_inputs(\n                self.config,\n                input_ids,\n                decoder_input_ids=decoder_input_ids,\n                decoder_padding_mask=decoder_attention_mask,\n                causal_mask_dtype=self.shared.weight.dtype,\n            )\n        else:\n            decoder_padding_mask, causal_mask = None, None\n\n        assert decoder_input_ids is not None\n        if encoder_outputs is None:\n            encoder_outputs = self.encoder(input_ids=input_ids, attention_mask=attention_mask)\n        assert isinstance(encoder_outputs, tuple)\n        # decoder outputs consists of (dec_features, layer_state, dec_hidden, dec_attn)\n        decoder_outputs = self.decoder(\n            decoder_input_ids,\n            encoder_outputs[0],\n            attention_mask,\n            decoder_padding_mask,\n            decoder_causal_mask=causal_mask,\n            decoder_cached_states=decoder_cached_states,\n            use_cache=use_cache,\n        )\n        # Attention and hidden_states will be [] or None if they aren't needed\n        decoder_outputs: Tuple = _filter_out_falsey_values(decoder_outputs)\n        assert isinstance(decoder_outputs[0], torch.Tensor)\n        encoder_outputs: Tuple = _filter_out_falsey_values(encoder_outputs)\n        return decoder_outputs + encoder_outputs\n\n    def get_input_embeddings(self):\n        return self.shared\n\n    def set_input_embeddings(self, value):\n        self.shared = value\n        self.encoder.embed_tokens = self.shared\n        self.decoder.embed_tokens = self.shared\n\n    def get_output_embeddings(self):\n        return _make_linear_from_emb(self.shared)  # make it on the fly\n\n\n@add_start_docstrings(\n    \"The BART Model with a language modeling head. Can be used for summarization.\",\n    BART_START_DOCSTRING + BART_GENERATION_EXAMPLE,\n)\nclass BartForConditionalGeneration(PretrainedBartModel):\n    base_model_prefix = \"model\"\n\n    def __init__(self, config: BartConfig):\n        super().__init__(config)\n        base_model = BartModel(config)\n        self.model = base_model\n        self.register_buffer(\"final_logits_bias\", torch.zeros((1, self.model.shared.num_embeddings)))\n\n    def resize_token_embeddings(self, new_num_tokens: int) -> nn.Embedding:\n        old_num_tokens = self.model.shared.num_embeddings\n        new_embeddings = super().resize_token_embeddings(new_num_tokens)\n        self.model.shared = new_embeddings\n        self._resize_final_logits_bias(new_num_tokens, old_num_tokens)\n        return new_embeddings\n\n    def _resize_final_logits_bias(self, new_num_tokens: int, old_num_tokens: int) -> None:\n        if new_num_tokens <= old_num_tokens:\n            new_bias = self.final_logits_bias[:, :new_num_tokens]\n        else:\n            extra_bias = torch.zeros((1, new_num_tokens - old_num_tokens), device=self.final_logits_bias.device)\n            new_bias = torch.cat([self.final_logits_bias, extra_bias], dim=1)\n        self.register_buffer(\"final_logits_bias\", new_bias)\n\n    @add_start_docstrings_to_callable(BART_INPUTS_DOCSTRING)\n    def forward(\n        self,\n        input_ids,\n        attention_mask=None,\n        encoder_outputs=None,\n        decoder_input_ids=None,\n        decoder_attention_mask=None,\n        decoder_cached_states=None,\n        lm_labels=None,\n        use_cache=False,\n        **unused\n    ):\n        r\"\"\"\n        lm_labels (:obj:`torch.LongTensor` of shape :obj:`(batch_size, sequence_length)`, `optional`, defaults to :obj:`None`):\n            Labels for computing the masked language modeling loss.\n            Indices should either be in ``[0, ..., config.vocab_size]`` or -100 (see ``input_ids`` docstring).\n            Tokens with indices set to ``-100`` are ignored (masked), the loss is only computed for the tokens\n            with labels\n            in ``[0, ..., config.vocab_size]``.\n\n    Returns:\n        :obj:`tuple(torch.FloatTensor)` comprising various elements depending on the configuration (:class:`~transformers1.RobertaConfig`) and inputs:\n        masked_lm_loss (`optional`, returned when ``lm_labels`` is provided) ``torch.FloatTensor`` of shape ``(1,)``:\n            Masked language modeling loss.\n        prediction_scores (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length, config.vocab_size)`)\n            Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax).\n        hidden_states (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_hidden_states=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer)\n            of shape :obj:`(batch_size, sequence_length, hidden_size)`.\n\n            Hidden-states of the model at the output of each layer plus the initial embedding outputs.\n        attentions (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_attentions=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for each layer) of shape\n            :obj:`(batch_size, num_heads, sequence_length, sequence_length)`.\n\n            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention\n            heads.\n\n    Examples::\n\n            # Mask filling only works for bart-large\n            from transformers1 import BartTokenizer, BartForConditionalGeneration\n            tokenizer = BartTokenizer.from_pretrained('bart-large')\n            TXT = \"My friends are <mask> but they eat too many carbs.\"\n            model = BartForConditionalGeneration.from_pretrained('bart-large')\n            input_ids = tokenizer.batch_encode_plus([TXT], return_tensors='pt')['input_ids']\n            logits = model(input_ids)[0]\n            masked_index = (input_ids[0] == tokenizer.mask_token_id).nonzero().item()\n            probs = logits[0, masked_index].softmax(dim=0)\n            values, predictions = probs.topk(5)\n            tokenizer.decode(predictions).split()\n            # ['good', 'great', 'all', 'really', 'very']\n        \"\"\"\n        outputs = self.model(\n            input_ids,\n            attention_mask=attention_mask,\n            decoder_input_ids=decoder_input_ids,\n            encoder_outputs=encoder_outputs,\n            decoder_attention_mask=decoder_attention_mask,\n            decoder_cached_states=decoder_cached_states,\n            use_cache=use_cache,\n        )\n        lm_logits = F.linear(outputs[0], self.model.shared.weight, bias=self.final_logits_bias)\n        outputs = (lm_logits,) + outputs[1:]  # Add cache, hidden states and attention if they are here\n        if lm_labels is not None:\n            loss_fct = nn.CrossEntropyLoss()\n            # TODO(SS): do we need to ignore pad tokens in lm_labels?\n            masked_lm_loss = loss_fct(lm_logits.view(-1, self.config.vocab_size), lm_labels.view(-1))\n            outputs = (masked_lm_loss,) + outputs\n\n        return outputs\n\n    def prepare_inputs_for_generation(self, decoder_input_ids, past, attention_mask, use_cache, **kwargs):\n        assert past is not None, \"past has to be defined for encoder_outputs\"\n\n        # first step, decoder_cached_states are empty\n        if not past[1]:\n            encoder_outputs, decoder_cached_states = past, None\n        else:\n            encoder_outputs, decoder_cached_states = past\n        return {\n            \"input_ids\": None,  # encoder_outputs is defined. input_ids not needed\n            \"encoder_outputs\": encoder_outputs,\n            \"decoder_cached_states\": decoder_cached_states,\n            \"decoder_input_ids\": decoder_input_ids,\n            \"attention_mask\": attention_mask,\n            \"use_cache\": use_cache,  # change this to avoid caching (presumably for debugging)\n        }\n\n    def prepare_logits_for_generation(self, logits, cur_len, max_length):\n        if cur_len == 1:\n            self._force_token_ids_generation(logits, self.config.bos_token_id)\n        if cur_len == max_length - 1 and self.config.eos_token_id is not None:\n            self._force_token_ids_generation(logits, self.config.eos_token_id)\n        return logits\n\n    def _force_token_ids_generation(self, scores, token_ids) -> None:\n        \"\"\"force one of token_ids to be generated by setting prob of all other tokens to 0\"\"\"\n        if isinstance(token_ids, int):\n            token_ids = [token_ids]\n        all_but_token_ids_mask = torch.tensor(\n            [x for x in range(self.config.vocab_size) if x not in token_ids],\n            dtype=torch.long,\n            device=next(self.parameters()).device,\n        )\n        assert len(scores.shape) == 2, \"scores should be of rank 2 with shape: [batch_size, vocab_size]\"\n        scores[:, all_but_token_ids_mask] = -float(\"inf\")\n\n    @staticmethod\n    def _reorder_cache(past, beam_idx):\n        ((enc_out, enc_mask), decoder_cached_states) = past\n        reordered_past = []\n        for layer_past in decoder_cached_states:\n            # get the correct batch idx from decoder layer's batch dim for cross and self-attn\n            layer_past_new = {\n                attn_key: _reorder_buffer(attn_cache, beam_idx) for attn_key, attn_cache in layer_past.items()\n            }\n            reordered_past.append(layer_past_new)\n\n        new_enc_out = enc_out if enc_out is None else enc_out.index_select(0, beam_idx)\n        new_enc_mask = enc_mask if enc_mask is None else enc_mask.index_select(0, beam_idx)\n\n        past = ((new_enc_out, new_enc_mask), reordered_past)\n        return past\n\n    def get_encoder(self):\n        return self.model.encoder\n\n    def get_output_embeddings(self):\n        return _make_linear_from_emb(self.model.shared)  # make it on the fly\n\n\n@add_start_docstrings(\n    \"\"\"Bart model with a sequence classification/head on top (a linear layer on top of the pooled output) e.g. for GLUE tasks. \"\"\",\n    BART_START_DOCSTRING,\n)\nclass BartForSequenceClassification(PretrainedBartModel):\n    def __init__(self, config: BartConfig, **kwargs):\n        super().__init__(config, **kwargs)\n        self.model = BartModel(config)\n        self.classification_head = BartClassificationHead(\n            config.d_model, config.d_model, config.num_labels, config.classif_dropout,\n        )\n        self.model._init_weights(self.classification_head.dense)\n        self.model._init_weights(self.classification_head.out_proj)\n\n    @add_start_docstrings_to_callable(BART_INPUTS_DOCSTRING)\n    def forward(\n        self,\n        input_ids,\n        attention_mask=None,\n        encoder_outputs=None,\n        decoder_input_ids=None,\n        decoder_attention_mask=None,\n        labels=None,\n    ):\n        r\"\"\"\n        labels (:obj:`torch.LongTensor` of shape :obj:`(batch_size,)`, `optional`, defaults to :obj:`None`):\n            Labels for computing the sequence classification/regression loss.\n            Indices should be in :obj:`[0, ..., config.num_labels - 1]`.\n            If :obj:`config.num_labels > 1` a classification loss is computed (Cross-Entropy).\n\n    Returns:\n        :obj:`tuple(torch.FloatTensor)` comprising various elements depending on the configuration (:class:`~transformers1.BartConfig`) and inputs:\n            loss (:obj:`torch.FloatTensor` of shape :obj:`(1,)`, `optional`, returned when :obj:`label` is provided):\n                Classification loss (cross entropy)\n            logits (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, config.num_labels)`):\n                Classification (or regression if config.num_labels==1) scores (before SoftMax).\n            hidden_states (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_hidden_states=True``):\n                Tuple of :obj:`torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer)\n                of shape :obj:`(batch_size, sequence_length, hidden_size)`.\n                Hidden-states of the model at the output of each layer plus the initial embedding outputs.\n            attentions (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_attentions=True``):\n                Tuple of :obj:`torch.FloatTensor` (one for each layer) of shape :obj:`(batch_size, num_heads, sequence_length, sequence_length)`.\n                Attentions weights after the attention softmax, used to compute the weighted average in the\n                self-attention\n                heads.\n\n    Examples::\n\n        from transformers1 import BartTokenizer, BartForSequenceClassification\n        import torch\n\n        tokenizer = BartTokenizer.from_pretrained('bart-large')\n        model = BartForSequenceClassification.from_pretrained('bart-large')\n        input_ids = torch.tensor(tokenizer.encode(\"Hello, my dog is cute\",\n        add_special_tokens=True)).unsqueeze(0)  # Batch size 1\n        labels = torch.tensor([1]).unsqueeze(0)  # Batch size 1\n        outputs = model(input_ids, labels=labels)\n        loss, logits = outputs[:2]\n\n        \"\"\"\n        outputs = self.model(\n            input_ids,\n            attention_mask=attention_mask,\n            decoder_input_ids=decoder_input_ids,\n            decoder_attention_mask=decoder_attention_mask,\n            encoder_outputs=encoder_outputs,\n        )\n        x = outputs[0]  # last hidden state\n        eos_mask = input_ids.eq(self.config.eos_token_id)\n        if len(torch.unique(eos_mask.sum(1))) > 1:\n            raise ValueError(\"All examples must have the same number of <eos> tokens.\")\n        sentence_representation = x[eos_mask, :].view(x.size(0), -1, x.size(-1))[:, -1, :]\n        logits = self.classification_head(sentence_representation)\n        # Prepend logits\n        outputs = (logits,) + outputs[1:]  # Add hidden states and attention if they are here\n        if labels is not None:  # prepend loss to output,\n            loss = F.cross_entropy(logits.view(-1, self.config.num_labels), labels.view(-1))\n            outputs = (loss,) + outputs\n\n        return outputs\n\n\nclass SinusoidalPositionalEmbedding(nn.Embedding):\n    \"\"\"This module produces sinusoidal positional embeddings of any length.\"\"\"\n\n    def __init__(self, num_positions, embedding_dim, padding_idx=None):\n        super().__init__(num_positions, embedding_dim)\n        if embedding_dim % 2 != 0:\n            raise NotImplementedError(f\"odd embedding_dim {embedding_dim} not supported\")\n        self.weight = self._init_weight(self.weight)\n\n    @staticmethod\n    def _init_weight(out: nn.Parameter):\n        \"\"\"Identical to the XLM create_sinusoidal_embeddings except features are not interleaved.\n            The cos features are in the 2nd half of the vector. [dim // 2:]\n        \"\"\"\n        n_pos, dim = out.shape\n        position_enc = np.array(\n            [[pos / np.power(10000, 2 * (j // 2) / dim) for j in range(dim)] for pos in range(n_pos)]\n        )\n        out[:, 0 : dim // 2] = torch.FloatTensor(np.sin(position_enc[:, 0::2]))  # This line breaks for odd n_pos\n        out[:, dim // 2 :] = torch.FloatTensor(np.cos(position_enc[:, 1::2]))\n        out.detach_()\n        out.requires_grad = False\n        return out\n\n    @torch.no_grad()\n    def forward(self, input_ids, use_cache=False):\n        \"\"\"Input is expected to be of size [bsz x seqlen].\"\"\"\n        bsz, seq_len = input_ids.shape[:2]\n        if use_cache:\n            positions = input_ids.data.new(1, 1).fill_(seq_len - 1)  # called before slicing\n        else:\n            # starts at 0, ends at 1-seq_len\n            positions = torch.arange(seq_len, dtype=torch.long, device=self.weight.device)\n        return super().forward(positions)\n"
  },
  {
    "path": "code/nezha-base-count3/pretrain/transformers1/modeling_beam_search.py",
    "content": "# coding=utf-8\n# Copyright (c) 2019 Yang Liu\n\n# Permission is hereby granted, free of charge, to any person obtaining a copy\n# of this software and associated documentation files (the \"Software\"), to deal\n# in the Software without restriction, including without limitation the rights\n# to use, copy, modify, merge, publish, distribute, sublicense, and/or sell\n# copies of the Software, and to permit persons to whom the Software is\n# furnished to do so, subject to the following conditions:\n\n# The above copyright notice and this permission notice shall be included in all\n# copies or substantial portions of the Software.\n\n# THE SOFTWARE IS PROVIDED \"AS IS\", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR\n# IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,\n# FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE\n# AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER\n# LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,\n# OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE\n# SOFTWARE.\n\"\"\"\nA general wrapper around models with LM heads to generate sequences\nusing beam search.\n\"\"\"\nimport torch\nfrom torch import nn\n\n\nclass TransformerBeamSearch(nn.Module):\n    def __init__(\n        self,\n        model,\n        tokenizer,\n        batch_size,\n        beam_size,\n        min_length,\n        max_length,\n        alpha=0,\n        block_repeating_trigram=True,\n    ):\n        \"\"\"\n        Attributes:\n            mask_word_id: token id that corresponds to the mask\n        \"\"\"\n        super(TransformerBeamSearch, self).__init__()\n        self.model = model\n        self.tokenizer = tokenizer\n\n        self.start_token_id = tokenizer.start_token_id\n        self.end_token_id = tokenizer.end_token_id\n        self.pad_token_id = tokenizer.pad_token_id\n\n        self.beam_size = beam_size\n        self.min_length = min_length\n        self.max_length = max_length\n\n        self.block_repeating_trigram = block_repeating_trigram\n        self.apply_length_penalty = False if alpha == 0 else True\n        self.alpha = alpha\n\n        # State of the beam\n        self.hypotheses = [[] for _ in range(batch_size)]\n        self.batch_offset = torch.arange(batch_size, dtype=torch.long)\n        self.beam_offset = torch.arange(\n            0, batch_size * self.beam_size, step=self.beam_size, dtype=torch.long\n        )\n        self.growing_beam = torch.full(\n            (batch_size * self.beam_size, 1), self.start_token_id, dtype=torch.long\n        )\n        self.topk_log_probabilities = torch.tensor(\n            [0.0] + [float(\"-inf\")] * (self.beam_size - 1), dtype=torch.float\n        ).repeat(batch_size)\n        self.results = {\n            \"prediction\": [[] for _ in batch_size],\n            \"scores\": [[] for _ in batch_size],\n        }\n        self._step = 0\n        self.is_done = False\n\n    def step(self, log_probabilities):\n        \"\"\" Grows the beam by one step. \"\"\"\n        self._step += 1\n\n        # The batch size changes as some beams finish so we define _B\n        vocab_size = log_probabilities.size(-1)\n        _B = log_probabilities.size(0) // self.beam_size\n\n        # Multiply each beam probability with the probability of the\n        # next token (conditioned on the words in the beam).\n        log_probabilities += self.topk_log_probabilities.view(-1, 1)\n\n        self.enforce_min_length(log_probabilities)\n        if self.block_repeating_trigram:\n            self.remove_repeating_trigrams(log_probabilities, _B)\n\n        # Find the `beam_size` (previous_beam + token) combinations with\n        # the highest score\n        topk_log_probabilities, topk_ids = log_probabilities.topk(\n            log_probabilities.view(_B, self.beam_size * vocab_size),\n            self.beam_size,\n            dim=1,\n        )\n\n        # Apply the length penalty. The +1 accounts for the [EOS] token\n        # that will be added if the beam ends.\n        topk_scores = topk_log_probabilities / self.length_penalty()\n\n        # Retrieve the corresponding respective beam and token id\n        # topk_token_ids[i] will be added to topk_beam_ids[i]\n        topk_beam_ids = topk_ids.div(vocab_size)\n        topk_token_ids = topk_ids.fmod(vocab_size)\n\n        # Retrieve the row index of the surviving beams in the original\n        # view of the log_probabilities tensor\n        surviving_beams_rows = (topk_beam_ids + self.beam_offset[:_B].view(-1, 1)).view(\n            -1\n        )\n\n        # Append the last predictions\n        self.growing_beam = torch.cat(\n            [\n                self.growing_beam.index_select(0, surviving_beams_rows),\n                topk_token_ids.view(-1, 1),\n            ],\n            1,\n        )\n\n        # Check if any of the beam searches has ended during this\n        # growth step. Also if top beam (most probable) has ended\n        # for one element of the batch.\n        is_finished = topk_token_ids.eq(self.end_token_id)\n        self.enforce_max_length()\n        is_top_beam_finished = is_finished[:, 0].eq(1)\n\n        # Save the finished searches\n        if is_finished.any():\n            predictions = self.growing_beam.view(\n                -1, self.beam_size, self.growing_beam.size(1)\n            )\n            for i in range(is_finished.size(0)):\n                if is_top_beam_finished[i]:\n                    is_finished[i].fill_(1)\n                finished_hyp = is_finished[i].nonzero().view(-1)\n\n                # Store finished hypotheses for this batch.\n                b = self.batch_offset[i]\n                for j in finished_hyp:\n                    self.hypotheses[b].append((topk_scores[i, j], predictions[i, j, :]))\n\n                # If the batch reached the end, save the best hypotheses\n                # in terms of length-penalized score.\n                if is_top_beam_finished[i]:\n                    best_hyp = sorted(\n                        self.hypotheses[b], key=lambda x: x[0], reverse=True\n                    )\n                    best_score, best_prediction = best_hyp[0]\n                    self.results[\"scores\"][b].append(best_score)\n                    self.results[\"predictions\"][b].append(best_prediction)\n\n            non_finished = is_top_beam_finished.eq(0).nonzero().view(-1)\n            if len(non_finished) == 0:\n                self.is_done = True\n\n            # Remove finished batches for the next step.\n            topk_log_probabilities = topk_log_probabilities.index_select(\n                0, non_finished\n            )\n            self.batch_offset = self.batch_offset.index_select(0, non_finished)\n            self.growing_beam = predictions.index_select(0, non_finished).view(\n                -1, self.growing_beam.size(-1)\n            )\n\n            surviving_beams_rows = surviving_beams_rows.index_select(0, non_finished)\n\n        return surviving_beams_rows\n\n    def forward(self, encoder_input_ids, **kwargs):\n        # keyword arguments come in 3 flavors: encoder-specific (prefixed by\n        # `encoder_`), decoder-specific (prefixed by `decoder_`) and those\n        # that apply to the model as whole.\n        # We let the specific kwargs override the common ones in case of conflict.\n        kwargs_encoder = {\n            argument[len(\"encoder_\"):]: value\n            for argument, value in kwargs.items()\n            if argument.startswith(\"encoder_\")\n        }\n        kwargs_decoder = {\n            argument[len(\"decoder_\"):]: value\n            for argument, value in kwargs.items()\n            if argument.startswith(\"decoder_\")\n        }\n        kwargs_common = {\n            argument: value\n            for argument, value in kwargs.items()\n            if not (argument.startswith(\"encoder_\") or argument.startswith(\"decoder_\"))\n        }\n        kwargs_decoder = dict(kwargs_common, **kwargs_decoder)\n        kwargs_encoder = dict(kwargs_common, **kwargs_encoder)\n\n        # forward pass on the encoder\n        encoder_outputs = self.model.encoder.forward(encoder_input_ids, kwargs_encoder)\n        kwargs_decoder[\"encoder_hidden_states\"] = tile(\n            encoder_outputs, self.beam_size, dim=0\n        )\n\n        # grow the beam by generating sequences in an autoregressive way\n        self.growing_beam = torch.full(\n            (self.batch_size * self.beam_size, 1), self.start_token_id, dtype=torch.long\n        )\n        for step in range(self.max_length):\n            decoder_input = self.growing_beam[:, -1]\n            outputs = self.model.decoder(decoder_input, kwargs_decoder)\n            log_probabilities = torch.nn.functional.log_softmax(outputs[1])\n            surviving_beams_rows = self.step(log_probabilities)\n            if self.is_done:\n                break\n\n            kwargs_decoder[\"encoder_hidden_states\"] = kwargs_decoder[\n                \"encoder_hidden_states\"\n            ].index_select(0, surviving_beams_rows)\n\n        return self.results\n\n    def remove_repeating_trigrams(self, log_probabilities, _B):\n        if(self._step + 1 > 3):\n            for i in range(_B * self.beam_size):\n                tokens = [t for t in self.growing_beam[i]]\n                trigrams = [(tokens[i-1], tokens[i], tokens[i+1]) for i in range(1, len(words) - 1)]\n                last_trigram = tuple(trigrams[-1])\n                if last_trigram in trigrams[:-1]:\n                    log_probabilities[i] = -1e20\n\n    def enforce_min_length(self):\n        if self._step < self.min_length:\n            self.log_probabilities[self.end_token_id] = -1e20\n\n    def enforce_max_length(self):\n        if self._step + 1 == self.max_length:\n            self.is_finished.fill_(1)\n\n    def length_penalty(self):\n        return ((5.0 + (self._step + 1)) / 6.0) ** self.alpha\n\n\ndef tile(x, count, dim=0):\n    \"\"\"\n    Tiles `x` along dimension `dim` `count` times.\n\n    Example:\n        >> ex = torch.tensor([1,2],[3,4])\n        >> tile(ex, 2, 0)\n        torch.Tensor([[1,2],[1,2],[3,4],[3,4]])\n    \"\"\"\n    perm = list(range(len(x.size())))\n    if dim != 0:\n        perm[0], perm[dim] = perm[dim], perm[0]\n        x = x.permute(perm).contiguous()\n    out_size = list(x.size())\n    out_size[0] *= count\n    batch = x.size(0)\n    x = (\n        x.view(batch, -1)\n        .transpose(0, 1)\n        .repeat(count, 1)\n        .transpose(0, 1)\n        .contiguous()\n        .view(*out_size)\n    )\n    if dim != 0:\n        x = x.permute(perm).contiguous()\n    return x\n"
  },
  {
    "path": "code/nezha-base-count3/pretrain/transformers1/modeling_bert.py",
    "content": "# coding=utf-8\n# Copyright 2018 The Google AI Language Team Authors and The HuggingFace Inc. team.\n# Copyright (c) 2018, NVIDIA CORPORATION.  All rights reserved.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n\"\"\"PyTorch BERT model. \"\"\"\n\n\nimport logging\nimport math\nimport os\n\nimport torch\nfrom torch import nn\nfrom torch.nn import CrossEntropyLoss, MSELoss\n\nfrom .activations import gelu, gelu_new, swish\nfrom .configuration_bert import BertConfig\nfrom .file_utils import add_start_docstrings, add_start_docstrings_to_callable\nfrom .modeling_utils import PreTrainedModel, prune_linear_layer\n\n\nlogger = logging.getLogger(__name__)\n\nBERT_PRETRAINED_MODEL_ARCHIVE_LIST = [\n    \"bert-base-uncased\",\n    \"bert-large-uncased\",\n    \"bert-base-cased\",\n    \"bert-large-cased\",\n    \"bert-base-multilingual-uncased\",\n    \"bert-base-multilingual-cased\",\n    \"bert-base-chinese\",\n    \"bert-base-german-cased\",\n    \"bert-large-uncased-whole-word-masking\",\n    \"bert-large-cased-whole-word-masking\",\n    \"bert-large-uncased-whole-word-masking-finetuned-squad\",\n    \"bert-large-cased-whole-word-masking-finetuned-squad\",\n    \"bert-base-cased-finetuned-mrpc\",\n    \"bert-base-german-dbmdz-cased\",\n    \"bert-base-german-dbmdz-uncased\",\n    \"cl-tohoku/bert-base-japanese\",\n    \"cl-tohoku/bert-base-japanese-whole-word-masking\",\n    \"cl-tohoku/bert-base-japanese-char\",\n    \"cl-tohoku/bert-base-japanese-char-whole-word-masking\",\n    \"TurkuNLP/bert-base-finnish-cased-v1\",\n    \"TurkuNLP/bert-base-finnish-uncased-v1\",\n    \"wietsedv/bert-base-dutch-cased\",\n    # See all BERT models at https://huggingface.co/models?filter=bert\n]\n\n\ndef load_tf_weights_in_bert(model, config, tf_checkpoint_path):\n    \"\"\" Load tf checkpoints in a pytorch model.\n    \"\"\"\n    try:\n        import re\n        import numpy as np\n        import tensorflow as tf\n    except ImportError:\n        logger.error(\n            \"Loading a TensorFlow model in PyTorch, requires TensorFlow to be installed. Please see \"\n            \"https://www.tensorflow.org/install/ for installation instructions.\"\n        )\n        raise\n    tf_path = os.path.abspath(tf_checkpoint_path)\n    logger.info(\"Converting TensorFlow checkpoint from {}\".format(tf_path))\n    # Load weights from TF model\n    init_vars = tf.train.list_variables(tf_path)\n    names = []\n    arrays = []\n    for name, shape in init_vars:\n        logger.info(\"Loading TF weight {} with shape {}\".format(name, shape))\n        array = tf.train.load_variable(tf_path, name)\n        names.append(name)\n        arrays.append(array)\n\n    for name, array in zip(names, arrays):\n        name = name.split(\"/\")\n        # adam_v and adam_m are variables used in AdamWeightDecayOptimizer to calculated m and v\n        # which are not required for using pretrained model\n        if any(\n            n in [\"adam_v\", \"adam_m\", \"AdamWeightDecayOptimizer\", \"AdamWeightDecayOptimizer_1\", \"global_step\"]\n            for n in name\n        ):\n            logger.info(\"Skipping {}\".format(\"/\".join(name)))\n            continue\n        pointer = model\n        for m_name in name:\n            if re.fullmatch(r\"[A-Za-z]+_\\d+\", m_name):\n                scope_names = re.split(r\"_(\\d+)\", m_name)\n            else:\n                scope_names = [m_name]\n            if scope_names[0] == \"kernel\" or scope_names[0] == \"gamma\":\n                pointer = getattr(pointer, \"weight\")\n            elif scope_names[0] == \"output_bias\" or scope_names[0] == \"beta\":\n                pointer = getattr(pointer, \"bias\")\n            elif scope_names[0] == \"output_weights\":\n                pointer = getattr(pointer, \"weight\")\n            elif scope_names[0] == \"squad\":\n                pointer = getattr(pointer, \"classifier\")\n            else:\n                try:\n                    pointer = getattr(pointer, scope_names[0])\n                except AttributeError:\n                    logger.info(\"Skipping {}\".format(\"/\".join(name)))\n                    continue\n            if len(scope_names) >= 2:\n                num = int(scope_names[1])\n                pointer = pointer[num]\n        if m_name[-11:] == \"_embeddings\":\n            pointer = getattr(pointer, \"weight\")\n        elif m_name == \"kernel\":\n            array = np.transpose(array)\n        try:\n            assert pointer.shape == array.shape\n        except AssertionError as e:\n            e.args += (pointer.shape, array.shape)\n            raise\n        logger.info(\"Initialize PyTorch weight {}\".format(name))\n        pointer.data = torch.from_numpy(array)\n    return model\n\n\ndef mish(x):\n    return x * torch.tanh(nn.functional.softplus(x))\n\n\nACT2FN = {\"gelu\": gelu, \"relu\": torch.nn.functional.relu, \"swish\": swish, \"gelu_new\": gelu_new, \"mish\": mish}\n\n\nBertLayerNorm = torch.nn.LayerNorm\n\n\nclass BertEmbeddings(nn.Module):\n    \"\"\"Construct the embeddings from word, position and token_type embeddings.\n    \"\"\"\n\n    def __init__(self, config):\n        super().__init__()\n        self.word_embeddings = nn.Embedding(config.vocab_size, config.hidden_size, padding_idx=config.pad_token_id)\n        self.position_embeddings = nn.Embedding(config.max_position_embeddings, config.hidden_size)\n        self.token_type_embeddings = nn.Embedding(config.type_vocab_size, config.hidden_size)\n\n        # self.LayerNorm is not snake-cased to stick with TensorFlow model variable name and be able to load\n        # any TensorFlow checkpoint file\n        self.LayerNorm = BertLayerNorm(config.hidden_size, eps=config.layer_norm_eps)\n        self.dropout = nn.Dropout(config.hidden_dropout_prob)\n\n    def forward(self, input_ids=None, token_type_ids=None, position_ids=None, inputs_embeds=None):\n        if input_ids is not None:\n            input_shape = input_ids.size()\n        else:\n            input_shape = inputs_embeds.size()[:-1]\n\n        seq_length = input_shape[1]\n        device = input_ids.device if input_ids is not None else inputs_embeds.device\n        if position_ids is None:\n            position_ids = torch.arange(seq_length, dtype=torch.long, device=device)\n            position_ids = position_ids.unsqueeze(0).expand(input_shape)\n        if token_type_ids is None:\n            token_type_ids = torch.zeros(input_shape, dtype=torch.long, device=device)\n\n        if inputs_embeds is None:\n            inputs_embeds = self.word_embeddings(input_ids)\n        position_embeddings = self.position_embeddings(position_ids)\n        token_type_embeddings = self.token_type_embeddings(token_type_ids)\n\n        embeddings = inputs_embeds + position_embeddings + token_type_embeddings\n        embeddings = self.LayerNorm(embeddings)\n        embeddings = self.dropout(embeddings)\n        return embeddings\n\n\nclass BertSelfAttention(nn.Module):\n    def __init__(self, config):\n        super().__init__()\n        if config.hidden_size % config.num_attention_heads != 0 and not hasattr(config, \"embedding_size\"):\n            raise ValueError(\n                \"The hidden size (%d) is not a multiple of the number of attention \"\n                \"heads (%d)\" % (config.hidden_size, config.num_attention_heads)\n            )\n        self.output_attentions = config.output_attentions\n\n        self.num_attention_heads = config.num_attention_heads\n        self.attention_head_size = int(config.hidden_size / config.num_attention_heads)\n        self.all_head_size = self.num_attention_heads * self.attention_head_size\n\n        self.query = nn.Linear(config.hidden_size, self.all_head_size)\n        self.key = nn.Linear(config.hidden_size, self.all_head_size)\n        self.value = nn.Linear(config.hidden_size, self.all_head_size)\n\n        self.dropout = nn.Dropout(config.attention_probs_dropout_prob)\n\n    def transpose_for_scores(self, x):\n        new_x_shape = x.size()[:-1] + (self.num_attention_heads, self.attention_head_size)\n        x = x.view(*new_x_shape)\n        return x.permute(0, 2, 1, 3)\n\n    def forward(\n        self,\n        hidden_states,\n        attention_mask=None,\n        head_mask=None,\n        encoder_hidden_states=None,\n        encoder_attention_mask=None,\n    ):\n        mixed_query_layer = self.query(hidden_states)\n\n        # If this is instantiated as a cross-attention module, the keys\n        # and values come from an encoder; the attention mask needs to be\n        # such that the encoder's padding tokens are not attended to.\n        if encoder_hidden_states is not None:\n            mixed_key_layer = self.key(encoder_hidden_states)\n            mixed_value_layer = self.value(encoder_hidden_states)\n            attention_mask = encoder_attention_mask\n        else:\n            mixed_key_layer = self.key(hidden_states)\n            mixed_value_layer = self.value(hidden_states)\n\n        query_layer = self.transpose_for_scores(mixed_query_layer)\n        key_layer = self.transpose_for_scores(mixed_key_layer)\n        value_layer = self.transpose_for_scores(mixed_value_layer)\n\n        # Take the dot product between \"query\" and \"key\" to get the raw attention scores.\n        attention_scores = torch.matmul(query_layer, key_layer.transpose(-1, -2))\n        attention_scores = attention_scores / math.sqrt(self.attention_head_size)\n        if attention_mask is not None:\n            # Apply the attention mask is (precomputed for all layers in BertModel forward() function)\n            attention_scores = attention_scores + attention_mask\n\n        # Normalize the attention scores to probabilities.\n        attention_probs = nn.Softmax(dim=-1)(attention_scores)\n\n        # This is actually dropping out entire tokens to attend to, which might\n        # seem a bit unusual, but is taken from the original Transformer paper.\n        attention_probs = self.dropout(attention_probs)\n\n        # Mask heads if we want to\n        if head_mask is not None:\n            attention_probs = attention_probs * head_mask\n\n        context_layer = torch.matmul(attention_probs, value_layer)\n\n        context_layer = context_layer.permute(0, 2, 1, 3).contiguous()\n        new_context_layer_shape = context_layer.size()[:-2] + (self.all_head_size,)\n        context_layer = context_layer.view(*new_context_layer_shape)\n\n        outputs = (context_layer, attention_probs) if self.output_attentions else (context_layer,)\n        return outputs\n\n\nclass BertSelfOutput(nn.Module):\n    def __init__(self, config):\n        super().__init__()\n        self.dense = nn.Linear(config.hidden_size, config.hidden_size)\n        self.LayerNorm = BertLayerNorm(config.hidden_size, eps=config.layer_norm_eps)\n        self.dropout = nn.Dropout(config.hidden_dropout_prob)\n\n    def forward(self, hidden_states, input_tensor):\n        hidden_states = self.dense(hidden_states)\n        hidden_states = self.dropout(hidden_states)\n        hidden_states = self.LayerNorm(hidden_states + input_tensor)\n        return hidden_states\n\n\nclass BertAttention(nn.Module):\n    def __init__(self, config):\n        super().__init__()\n        self.self = BertSelfAttention(config)\n        self.output = BertSelfOutput(config)\n        self.pruned_heads = set()\n\n    def prune_heads(self, heads):\n        if len(heads) == 0:\n            return\n        mask = torch.ones(self.self.num_attention_heads, self.self.attention_head_size)\n        heads = set(heads) - self.pruned_heads  # Convert to set and remove already pruned heads\n        for head in heads:\n            # Compute how many pruned heads are before the head and move the index accordingly\n            head = head - sum(1 if h < head else 0 for h in self.pruned_heads)\n            mask[head] = 0\n        mask = mask.view(-1).contiguous().eq(1)\n        index = torch.arange(len(mask))[mask].long()\n\n        # Prune linear layers\n        self.self.query = prune_linear_layer(self.self.query, index)\n        self.self.key = prune_linear_layer(self.self.key, index)\n        self.self.value = prune_linear_layer(self.self.value, index)\n        self.output.dense = prune_linear_layer(self.output.dense, index, dim=1)\n\n        # Update hyper params and store pruned heads\n        self.self.num_attention_heads = self.self.num_attention_heads - len(heads)\n        self.self.all_head_size = self.self.attention_head_size * self.self.num_attention_heads\n        self.pruned_heads = self.pruned_heads.union(heads)\n\n    def forward(\n        self,\n        hidden_states,\n        attention_mask=None,\n        head_mask=None,\n        encoder_hidden_states=None,\n        encoder_attention_mask=None,\n    ):\n        self_outputs = self.self(\n            hidden_states, attention_mask, head_mask, encoder_hidden_states, encoder_attention_mask\n        )\n        attention_output = self.output(self_outputs[0], hidden_states)\n        outputs = (attention_output,) + self_outputs[1:]  # add attentions if we output them\n        return outputs\n\n\nclass BertIntermediate(nn.Module):\n    def __init__(self, config):\n        super().__init__()\n        self.dense = nn.Linear(config.hidden_size, config.intermediate_size)\n        if isinstance(config.hidden_act, str):\n            self.intermediate_act_fn = ACT2FN[config.hidden_act]\n        else:\n            self.intermediate_act_fn = config.hidden_act\n\n    def forward(self, hidden_states):\n        hidden_states = self.dense(hidden_states)\n        hidden_states = self.intermediate_act_fn(hidden_states)\n        return hidden_states\n\n\nclass BertOutput(nn.Module):\n    def __init__(self, config):\n        super().__init__()\n        self.dense = nn.Linear(config.intermediate_size, config.hidden_size)\n        self.LayerNorm = BertLayerNorm(config.hidden_size, eps=config.layer_norm_eps)\n        self.dropout = nn.Dropout(config.hidden_dropout_prob)\n\n    def forward(self, hidden_states, input_tensor):\n        hidden_states = self.dense(hidden_states)\n        hidden_states = self.dropout(hidden_states)\n        hidden_states = self.LayerNorm(hidden_states + input_tensor)\n        return hidden_states\n\n\nclass BertLayer(nn.Module):\n    def __init__(self, config):\n        super().__init__()\n        self.attention = BertAttention(config)\n        self.is_decoder = config.is_decoder\n        if self.is_decoder:\n            self.crossattention = BertAttention(config)\n        self.intermediate = BertIntermediate(config)\n        self.output = BertOutput(config)\n\n    def forward(\n        self,\n        hidden_states,\n        attention_mask=None,\n        head_mask=None,\n        encoder_hidden_states=None,\n        encoder_attention_mask=None,\n    ):\n        self_attention_outputs = self.attention(hidden_states, attention_mask, head_mask)\n        attention_output = self_attention_outputs[0]\n        outputs = self_attention_outputs[1:]  # add self attentions if we output attention weights\n\n        if self.is_decoder and encoder_hidden_states is not None:\n            cross_attention_outputs = self.crossattention(\n                attention_output, attention_mask, head_mask, encoder_hidden_states, encoder_attention_mask\n            )\n            attention_output = cross_attention_outputs[0]\n            outputs = outputs + cross_attention_outputs[1:]  # add cross attentions if we output attention weights\n\n        intermediate_output = self.intermediate(attention_output)\n        layer_output = self.output(intermediate_output, attention_output)\n        outputs = (layer_output,) + outputs\n        return outputs\n\n\nclass BertEncoder(nn.Module):\n    def __init__(self, config):\n        super().__init__()\n        self.output_attentions = config.output_attentions\n        self.output_hidden_states = config.output_hidden_states\n        self.layer = nn.ModuleList([BertLayer(config) for _ in range(config.num_hidden_layers)])\n\n    def forward(\n        self,\n        hidden_states,\n        attention_mask=None,\n        head_mask=None,\n        encoder_hidden_states=None,\n        encoder_attention_mask=None,\n    ):\n        all_hidden_states = ()\n        all_attentions = ()\n        for i, layer_module in enumerate(self.layer):\n            if self.output_hidden_states:\n                all_hidden_states = all_hidden_states + (hidden_states,)\n\n            layer_outputs = layer_module(\n                hidden_states, attention_mask, head_mask[i], encoder_hidden_states, encoder_attention_mask\n            )\n            hidden_states = layer_outputs[0]\n\n            if self.output_attentions:\n                all_attentions = all_attentions + (layer_outputs[1],)\n\n        # Add last layer\n        if self.output_hidden_states:\n            all_hidden_states = all_hidden_states + (hidden_states,)\n\n        outputs = (hidden_states,)\n        if self.output_hidden_states:\n            outputs = outputs + (all_hidden_states,)\n        if self.output_attentions:\n            outputs = outputs + (all_attentions,)\n        return outputs  # last-layer hidden state, (all hidden states), (all attentions)\n\n\nclass BertPooler(nn.Module):\n    def __init__(self, config):\n        super().__init__()\n        self.dense = nn.Linear(config.hidden_size, config.hidden_size)\n        self.activation = nn.Tanh()\n\n    def forward(self, hidden_states):\n        # We \"pool\" the model by simply taking the hidden state corresponding\n        # to the first token.\n        first_token_tensor = hidden_states[:, 0]\n        pooled_output = self.dense(first_token_tensor)\n        pooled_output = self.activation(pooled_output)\n        return pooled_output\n\n\nclass BertPredictionHeadTransform(nn.Module):\n    def __init__(self, config):\n        super().__init__()\n        self.dense = nn.Linear(config.hidden_size, config.hidden_size)\n        if isinstance(config.hidden_act, str):\n            self.transform_act_fn = ACT2FN[config.hidden_act]\n        else:\n            self.transform_act_fn = config.hidden_act\n        self.LayerNorm = BertLayerNorm(config.hidden_size, eps=config.layer_norm_eps)\n\n    def forward(self, hidden_states):\n        hidden_states = self.dense(hidden_states)\n        hidden_states = self.transform_act_fn(hidden_states)\n        hidden_states = self.LayerNorm(hidden_states)\n        return hidden_states\n\n\nclass BertLMPredictionHead(nn.Module):\n    def __init__(self, config):\n        super().__init__()\n        self.transform = BertPredictionHeadTransform(config)\n\n        # The output weights are the same as the input embeddings, but there is\n        # an output-only bias for each token.\n        self.decoder = nn.Linear(config.hidden_size, config.vocab_size, bias=False)\n\n        self.bias = nn.Parameter(torch.zeros(config.vocab_size))\n\n        # Need a link between the two variables so that the bias is correctly resized with `resize_token_embeddings`\n        self.decoder.bias = self.bias\n\n    def forward(self, hidden_states):\n        hidden_states = self.transform(hidden_states)\n        hidden_states = self.decoder(hidden_states)\n        return hidden_states\n\n\nclass BertOnlyMLMHead(nn.Module):\n    def __init__(self, config):\n        super().__init__()\n        self.predictions = BertLMPredictionHead(config)\n\n    def forward(self, sequence_output):\n        prediction_scores = self.predictions(sequence_output)\n        return prediction_scores\n\n\nclass BertOnlyNSPHead(nn.Module):\n    def __init__(self, config):\n        super().__init__()\n        self.seq_relationship = nn.Linear(config.hidden_size, 2)\n\n    def forward(self, pooled_output):\n        seq_relationship_score = self.seq_relationship(pooled_output)\n        return seq_relationship_score\n\n\nclass BertPreTrainingHeads(nn.Module):\n    def __init__(self, config):\n        super().__init__()\n        self.predictions = BertLMPredictionHead(config)\n        self.seq_relationship = nn.Linear(config.hidden_size, 2)\n\n    def forward(self, sequence_output, pooled_output):\n        prediction_scores = self.predictions(sequence_output)\n        seq_relationship_score = self.seq_relationship(pooled_output)\n        return prediction_scores, seq_relationship_score\n\n\nclass BertPreTrainedModel(PreTrainedModel):\n    \"\"\" An abstract class to handle weights initialization and\n        a simple interface for downloading and loading pretrained models.\n    \"\"\"\n\n    config_class = BertConfig\n    load_tf_weights = load_tf_weights_in_bert\n    base_model_prefix = \"bert\"\n\n    def _init_weights(self, module):\n        \"\"\" Initialize the weights \"\"\"\n        if isinstance(module, (nn.Linear, nn.Embedding)):\n            # Slightly different from the TF version which uses truncated_normal for initialization\n            # cf https://github.com/pytorch/pytorch/pull/5617\n            module.weight.data.normal_(mean=0.0, std=self.config.initializer_range)\n        elif isinstance(module, BertLayerNorm):\n            module.bias.data.zero_()\n            module.weight.data.fill_(1.0)\n        if isinstance(module, nn.Linear) and module.bias is not None:\n            module.bias.data.zero_()\n\n\nBERT_START_DOCSTRING = r\"\"\"\n    This model is a PyTorch `torch.nn.Module <https://pytorch.org/docs/stable/nn.html#torch.nn.Module>`_ sub-class.\n    Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general\n    usage and behavior.\n\n    Parameters:\n        config (:class:`~transformers1.BertConfig`): Model configuration class with all the parameters of the model.\n            Initializing with a config file does not load the weights associated with the model, only the configuration.\n            Check out the :meth:`~transformers1.PreTrainedModel.from_pretrained` method to load the model weights.\n\"\"\"\n\nBERT_INPUTS_DOCSTRING = r\"\"\"\n    Args:\n        input_ids (:obj:`torch.LongTensor` of shape :obj:`{0}`):\n            Indices of input sequence tokens in the vocabulary.\n\n            Indices can be obtained using :class:`transformers1.BertTokenizer`.\n            See :func:`transformers1.PreTrainedTokenizer.encode` and\n            :func:`transformers1.PreTrainedTokenizer.encode_plus` for details.\n\n            `What are input IDs? <../glossary.html#input-ids>`__\n        attention_mask (:obj:`torch.FloatTensor` of shape :obj:`{0}`, `optional`, defaults to :obj:`None`):\n            Mask to avoid performing attention on padding token indices.\n            Mask values selected in ``[0, 1]``:\n            ``1`` for tokens that are NOT MASKED, ``0`` for MASKED tokens.\n\n            `What are attention masks? <../glossary.html#attention-mask>`__\n        token_type_ids (:obj:`torch.LongTensor` of shape :obj:`{0}`, `optional`, defaults to :obj:`None`):\n            Segment token indices to indicate first and second portions of the inputs.\n            Indices are selected in ``[0, 1]``: ``0`` corresponds to a `sentence A` token, ``1``\n            corresponds to a `sentence B` token\n\n            `What are token type IDs? <../glossary.html#token-type-ids>`_\n        position_ids (:obj:`torch.LongTensor` of shape :obj:`{0}`, `optional`, defaults to :obj:`None`):\n            Indices of positions of each input sequence tokens in the position embeddings.\n            Selected in the range ``[0, config.max_position_embeddings - 1]``.\n\n            `What are position IDs? <../glossary.html#position-ids>`_\n        head_mask (:obj:`torch.FloatTensor` of shape :obj:`(num_heads,)` or :obj:`(num_layers, num_heads)`, `optional`, defaults to :obj:`None`):\n            Mask to nullify selected heads of the self-attention modules.\n            Mask values selected in ``[0, 1]``:\n            :obj:`1` indicates the head is **not masked**, :obj:`0` indicates the head is **masked**.\n        inputs_embeds (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length, hidden_size)`, `optional`, defaults to :obj:`None`):\n            Optionally, instead of passing :obj:`input_ids` you can choose to directly pass an embedded representation.\n            This is useful if you want more control over how to convert `input_ids` indices into associated vectors\n            than the model's internal embedding lookup matrix.\n        encoder_hidden_states  (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length, hidden_size)`, `optional`, defaults to :obj:`None`):\n            Sequence of hidden-states at the output of the last layer of the encoder. Used in the cross-attention\n            if the model is configured as a decoder.\n        encoder_attention_mask (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length)`, `optional`, defaults to :obj:`None`):\n            Mask to avoid performing attention on the padding token indices of the encoder input. This mask\n            is used in the cross-attention if the model is configured as a decoder.\n            Mask values selected in ``[0, 1]``:\n            ``1`` for tokens that are NOT MASKED, ``0`` for MASKED tokens.\n\"\"\"\n\n\n@add_start_docstrings(\n    \"The bare Bert Model transformer outputting raw hidden-states without any specific head on top.\",\n    BERT_START_DOCSTRING,\n)\nclass BertModel(BertPreTrainedModel):\n    \"\"\"\n\n    The model can behave as an encoder (with only self-attention) as well\n    as a decoder, in which case a layer of cross-attention is added between\n    the self-attention layers, following the architecture described in `Attention is all you need`_ by Ashish Vaswani,\n    Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser and Illia Polosukhin.\n\n    To behave as an decoder the model needs to be initialized with the\n    :obj:`is_decoder` argument of the configuration set to :obj:`True`; an\n    :obj:`encoder_hidden_states` is expected as an input to the forward pass.\n\n    .. _`Attention is all you need`:\n        https://arxiv.org/abs/1706.03762\n\n    \"\"\"\n\n    def __init__(self, config):\n        super().__init__(config)\n        self.config = config\n\n        self.embeddings = BertEmbeddings(config)\n        self.encoder = BertEncoder(config)\n        self.pooler = BertPooler(config)\n\n        self.init_weights()\n\n    def get_input_embeddings(self):\n        return self.embeddings.word_embeddings\n\n    def set_input_embeddings(self, value):\n        self.embeddings.word_embeddings = value\n\n    def _prune_heads(self, heads_to_prune):\n        \"\"\" Prunes heads of the model.\n            heads_to_prune: dict of {layer_num: list of heads to prune in this layer}\n            See base class PreTrainedModel\n        \"\"\"\n        for layer, heads in heads_to_prune.items():\n            self.encoder.layer[layer].attention.prune_heads(heads)\n\n    @add_start_docstrings_to_callable(BERT_INPUTS_DOCSTRING.format(\"(batch_size, sequence_length)\"))\n    def forward(\n        self,\n        input_ids=None,\n        attention_mask=None,\n        token_type_ids=None,\n        position_ids=None,\n        head_mask=None,\n        inputs_embeds=None,\n        encoder_hidden_states=None,\n        encoder_attention_mask=None,\n    ):\n        r\"\"\"\n    Return:\n        :obj:`tuple(torch.FloatTensor)` comprising various elements depending on the configuration (:class:`~transformers1.BertConfig`) and inputs:\n        last_hidden_state (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length, hidden_size)`):\n            Sequence of hidden-states at the output of the last layer of the model.\n        pooler_output (:obj:`torch.FloatTensor`: of shape :obj:`(batch_size, hidden_size)`):\n            Last layer hidden-state of the first token of the sequence (classification token)\n            further processed by a Linear layer and a Tanh activation function. The Linear\n            layer weights are trained from the next sentence prediction (classification)\n            objective during pre-training.\n\n            This output is usually *not* a good summary\n            of the semantic content of the input, you're often better with averaging or pooling\n            the sequence of hidden-states for the whole input sequence.\n        hidden_states (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_hidden_states=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer)\n            of shape :obj:`(batch_size, sequence_length, hidden_size)`.\n\n            Hidden-states of the model at the output of each layer plus the initial embedding outputs.\n        attentions (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_attentions=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for each layer) of shape\n            :obj:`(batch_size, num_heads, sequence_length, sequence_length)`.\n\n            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention\n            heads.\n\n    Examples::\n\n        from transformers1 import BertModel, BertTokenizer\n        import torch\n\n        tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')\n        model = BertModel.from_pretrained('bert-base-uncased')\n\n        input_ids = torch.tensor(tokenizer.encode(\"Hello, my dog is cute\", add_special_tokens=True)).unsqueeze(0)  # Batch size 1\n        outputs = model(input_ids)\n\n        last_hidden_states = outputs[0]  # The last hidden-state is the first element of the output tuple\n\n        \"\"\"\n\n        if input_ids is not None and inputs_embeds is not None:\n            raise ValueError(\"You cannot specify both input_ids and inputs_embeds at the same time\")\n        elif input_ids is not None:\n            input_shape = input_ids.size()\n        elif inputs_embeds is not None:\n            input_shape = inputs_embeds.size()[:-1]\n        else:\n            raise ValueError(\"You have to specify either input_ids or inputs_embeds\")\n\n        device = input_ids.device if input_ids is not None else inputs_embeds.device\n\n        if attention_mask is None:\n            attention_mask = torch.ones(input_shape, device=device)\n        if token_type_ids is None:\n            token_type_ids = torch.zeros(input_shape, dtype=torch.long, device=device)\n\n        # We can provide a self-attention mask of dimensions [batch_size, from_seq_length, to_seq_length]\n        # ourselves in which case we just need to make it broadcastable to all heads.\n        extended_attention_mask: torch.Tensor = self.get_extended_attention_mask(attention_mask, input_shape, device)\n\n        # If a 2D ou 3D attention mask is provided for the cross-attention\n        # we need to make broadcastabe to [batch_size, num_heads, seq_length, seq_length]\n        if self.config.is_decoder and encoder_hidden_states is not None:\n            encoder_batch_size, encoder_sequence_length, _ = encoder_hidden_states.size()\n            encoder_hidden_shape = (encoder_batch_size, encoder_sequence_length)\n            if encoder_attention_mask is None:\n                encoder_attention_mask = torch.ones(encoder_hidden_shape, device=device)\n            encoder_extended_attention_mask = self.invert_attention_mask(encoder_attention_mask)\n        else:\n            encoder_extended_attention_mask = None\n\n        # Prepare head mask if needed\n        # 1.0 in head_mask indicate we keep the head\n        # attention_probs has shape bsz x n_heads x N x N\n        # input head_mask has shape [num_heads] or [num_hidden_layers x num_heads]\n        # and head_mask is converted to shape [num_hidden_layers x batch x num_heads x seq_length x seq_length]\n        head_mask = self.get_head_mask(head_mask, self.config.num_hidden_layers)\n\n        embedding_output = self.embeddings(\n            input_ids=input_ids, position_ids=position_ids, token_type_ids=token_type_ids, inputs_embeds=inputs_embeds\n        )\n        encoder_outputs = self.encoder(\n            embedding_output,\n            attention_mask=extended_attention_mask,\n            head_mask=head_mask,\n            encoder_hidden_states=encoder_hidden_states,\n            encoder_attention_mask=encoder_extended_attention_mask,\n        )\n        sequence_output = encoder_outputs[0]\n        pooled_output = self.pooler(sequence_output)\n\n        outputs = (sequence_output, pooled_output,) + encoder_outputs[\n            1:\n        ]  # add hidden_states and attentions if they are here\n        return outputs  # sequence_output, pooled_output, (hidden_states), (attentions)\n\n\n@add_start_docstrings(\n    \"\"\"Bert Model with two heads on top as done during the pre-training: a `masked language modeling` head and\n    a `next sentence prediction (classification)` head. \"\"\",\n    BERT_START_DOCSTRING,\n)\nclass BertForPreTraining(BertPreTrainedModel):\n    def __init__(self, config):\n        super().__init__(config)\n\n        self.bert = BertModel(config)\n        self.cls = BertPreTrainingHeads(config)\n\n        self.init_weights()\n\n    def get_output_embeddings(self):\n        return self.cls.predictions.decoder\n\n    @add_start_docstrings_to_callable(BERT_INPUTS_DOCSTRING.format(\"(batch_size, sequence_length)\"))\n    def forward(\n        self,\n        input_ids=None,\n        attention_mask=None,\n        token_type_ids=None,\n        position_ids=None,\n        head_mask=None,\n        inputs_embeds=None,\n        masked_lm_labels=None,\n        next_sentence_label=None,\n    ):\n        r\"\"\"\n        masked_lm_labels (``torch.LongTensor`` of shape ``(batch_size, sequence_length)``, `optional`, defaults to :obj:`None`):\n            Labels for computing the masked language modeling loss.\n            Indices should be in ``[-100, 0, ..., config.vocab_size]`` (see ``input_ids`` docstring)\n            Tokens with indices set to ``-100`` are ignored (masked), the loss is only computed for the tokens with labels\n            in ``[0, ..., config.vocab_size]``\n        next_sentence_label (``torch.LongTensor`` of shape ``(batch_size,)``, `optional`, defaults to :obj:`None`):\n            Labels for computing the next sequence prediction (classification) loss. Input should be a sequence pair (see :obj:`input_ids` docstring)\n            Indices should be in ``[0, 1]``.\n            ``0`` indicates sequence B is a continuation of sequence A,\n            ``1`` indicates sequence B is a random sequence.\n\n    Returns:\n        :obj:`tuple(torch.FloatTensor)` comprising various elements depending on the configuration (:class:`~transformers1.BertConfig`) and inputs:\n        loss (`optional`, returned when ``masked_lm_labels`` is provided) ``torch.FloatTensor`` of shape ``(1,)``:\n            Total loss as the sum of the masked language modeling loss and the next sequence prediction (classification) loss.\n        prediction_scores (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length, config.vocab_size)`)\n            Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax).\n        seq_relationship_scores (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, 2)`):\n            Prediction scores of the next sequence prediction (classification) head (scores of True/False\n            continuation before SoftMax).\n        hidden_states (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when :obj:`config.output_hidden_states=True`):\n            Tuple of :obj:`torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer)\n            of shape :obj:`(batch_size, sequence_length, hidden_size)`.\n\n            Hidden-states of the model at the output of each layer plus the initial embedding outputs.\n        attentions (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_attentions=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for each layer) of shape\n            :obj:`(batch_size, num_heads, sequence_length, sequence_length)`.\n\n            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention\n            heads.\n\n\n    Examples::\n\n        from transformers1 import BertTokenizer, BertForPreTraining\n        import torch\n\n        tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')\n        model = BertForPreTraining.from_pretrained('bert-base-uncased')\n\n        input_ids = torch.tensor(tokenizer.encode(\"Hello, my dog is cute\", add_special_tokens=True)).unsqueeze(0)  # Batch size 1\n        outputs = model(input_ids)\n\n        prediction_scores, seq_relationship_scores = outputs[:2]\n\n        \"\"\"\n\n        outputs = self.bert(\n            input_ids,\n            attention_mask=attention_mask,\n            token_type_ids=token_type_ids,\n            position_ids=position_ids,\n            head_mask=head_mask,\n            inputs_embeds=inputs_embeds,\n        )\n\n        sequence_output, pooled_output = outputs[:2]\n        prediction_scores, seq_relationship_score = self.cls(sequence_output, pooled_output)\n\n        outputs = (prediction_scores, seq_relationship_score,) + outputs[\n            2:\n        ]  # add hidden states and attention if they are here\n\n        if masked_lm_labels is not None and next_sentence_label is not None:\n            loss_fct = CrossEntropyLoss()\n            masked_lm_loss = loss_fct(prediction_scores.view(-1, self.config.vocab_size), masked_lm_labels.view(-1))\n            next_sentence_loss = loss_fct(seq_relationship_score.view(-1, 2), next_sentence_label.view(-1))\n            total_loss = masked_lm_loss + next_sentence_loss\n            outputs = (total_loss,) + outputs\n\n        return outputs  # (loss), prediction_scores, seq_relationship_score, (hidden_states), (attentions)\n\n\n@add_start_docstrings(\"\"\"Bert Model with a `language modeling` head on top. \"\"\", BERT_START_DOCSTRING)\nclass BertForMaskedLM(BertPreTrainedModel):\n    def __init__(self, config):\n        super().__init__(config)\n\n        self.bert = BertModel(config)\n        self.cls = BertOnlyMLMHead(config)\n\n        self.init_weights()\n\n    def get_output_embeddings(self):\n        return self.cls.predictions.decoder\n\n    @add_start_docstrings_to_callable(BERT_INPUTS_DOCSTRING.format(\"(batch_size, sequence_length)\"))\n    def forward(\n        self,\n        input_ids=None,\n        attention_mask=None,\n        token_type_ids=None,\n        position_ids=None,\n        head_mask=None,\n        inputs_embeds=None,\n        masked_lm_labels=None,\n        encoder_hidden_states=None,\n        encoder_attention_mask=None,\n        lm_labels=None,\n    ):\n        r\"\"\"\n        masked_lm_labels (:obj:`torch.LongTensor` of shape :obj:`(batch_size, sequence_length)`, `optional`, defaults to :obj:`None`):\n            Labels for computing the masked language modeling loss.\n            Indices should be in ``[-100, 0, ..., config.vocab_size]`` (see ``input_ids`` docstring)\n            Tokens with indices set to ``-100`` are ignored (masked), the loss is only computed for the tokens with labels\n            in ``[0, ..., config.vocab_size]``\n        lm_labels (:obj:`torch.LongTensor` of shape :obj:`(batch_size, sequence_length)`, `optional`, defaults to :obj:`None`):\n            Labels for computing the left-to-right language modeling loss (next word prediction).\n            Indices should be in ``[-100, 0, ..., config.vocab_size]`` (see ``input_ids`` docstring)\n            Tokens with indices set to ``-100`` are ignored (masked), the loss is only computed for the tokens with labels\n            in ``[0, ..., config.vocab_size]``\n\n    Returns:\n        :obj:`tuple(torch.FloatTensor)` comprising various elements depending on the configuration (:class:`~transformers1.BertConfig`) and inputs:\n        masked_lm_loss (`optional`, returned when ``masked_lm_labels`` is provided) ``torch.FloatTensor`` of shape ``(1,)``:\n            Masked language modeling loss.\n        ltr_lm_loss (:obj:`torch.FloatTensor` of shape :obj:`(1,)`, `optional`, returned when :obj:`lm_labels` is provided):\n                Next token prediction loss.\n        prediction_scores (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length, config.vocab_size)`)\n            Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax).\n        hidden_states (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_hidden_states=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer)\n            of shape :obj:`(batch_size, sequence_length, hidden_size)`.\n\n            Hidden-states of the model at the output of each layer plus the initial embedding outputs.\n        attentions (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_attentions=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for each layer) of shape\n            :obj:`(batch_size, num_heads, sequence_length, sequence_length)`.\n\n            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention\n            heads.\n\n        Examples::\n\n            from transformers1 import BertTokenizer, BertForMaskedLM\n            import torch\n\n            tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')\n            model = BertForMaskedLM.from_pretrained('bert-base-uncased')\n\n            input_ids = torch.tensor(tokenizer.encode(\"Hello, my dog is cute\", add_special_tokens=True)).unsqueeze(0)  # Batch size 1\n            outputs = model(input_ids, masked_lm_labels=input_ids)\n\n            loss, prediction_scores = outputs[:2]\n\n        \"\"\"\n\n        outputs = self.bert(\n            input_ids,\n            attention_mask=attention_mask,\n            token_type_ids=token_type_ids,\n            position_ids=position_ids,\n            head_mask=head_mask,\n            inputs_embeds=inputs_embeds,\n            encoder_hidden_states=encoder_hidden_states,\n            encoder_attention_mask=encoder_attention_mask,\n        )\n\n        sequence_output = outputs[0]\n        prediction_scores = self.cls(sequence_output)\n\n        outputs = (prediction_scores,) + outputs[2:]  # Add hidden states and attention if they are here\n\n        # Although this may seem awkward, BertForMaskedLM supports two scenarios:\n        # 1. If a tensor that contains the indices of masked labels is provided,\n        #    the cross-entropy is the MLM cross-entropy that measures the likelihood\n        #    of predictions for masked words.\n        # 2. If `lm_labels` is provided we are in a causal scenario where we\n        #    try to predict the next token for each input in the decoder.\n        if masked_lm_labels is not None:\n            loss_fct = CrossEntropyLoss()  # -100 index = padding token\n            masked_lm_loss = loss_fct(prediction_scores.view(-1, self.config.vocab_size), masked_lm_labels.view(-1))\n            outputs = (masked_lm_loss,) + outputs\n\n        if lm_labels is not None:\n            # we are doing next-token prediction; shift prediction scores and input ids by one\n            prediction_scores = prediction_scores[:, :-1, :].contiguous()\n            lm_labels = lm_labels[:, 1:].contiguous()\n            loss_fct = CrossEntropyLoss()\n            ltr_lm_loss = loss_fct(prediction_scores.view(-1, self.config.vocab_size), lm_labels.view(-1))\n            outputs = (ltr_lm_loss,) + outputs\n\n        return outputs  # (ltr_lm_loss), (masked_lm_loss), prediction_scores, (hidden_states), (attentions)\n\n    def prepare_inputs_for_generation(self, input_ids, attention_mask=None, **model_kwargs):\n        input_shape = input_ids.shape\n        effective_batch_size = input_shape[0]\n\n        # if model is used as a decoder in encoder-decoder model, the decoder attention mask is created on the fly\n        if attention_mask is None:\n            attention_mask = input_ids.new_ones(input_shape)\n\n        # if model is does not use a causal mask then add a dummy token\n        if self.config.is_decoder is False:\n            assert self.config.pad_token_id is not None, \"The PAD token should be defined for generation\"\n            attention_mask = torch.cat(\n                [attention_mask, attention_mask.new_zeros((attention_mask.shape[0], 1))], dim=-1\n            )\n\n            dummy_token = torch.full(\n                (effective_batch_size, 1), self.config.pad_token_id, dtype=torch.long, device=input_ids.device\n            )\n            input_ids = torch.cat([input_ids, dummy_token], dim=1)\n\n        return {\"input_ids\": input_ids, \"attention_mask\": attention_mask}\n\n\n@add_start_docstrings(\n    \"\"\"Bert Model with a `next sentence prediction (classification)` head on top. \"\"\", BERT_START_DOCSTRING,\n)\nclass BertForNextSentencePrediction(BertPreTrainedModel):\n    def __init__(self, config):\n        super().__init__(config)\n\n        self.bert = BertModel(config)\n        self.cls = BertOnlyNSPHead(config)\n\n        self.init_weights()\n\n    @add_start_docstrings_to_callable(BERT_INPUTS_DOCSTRING.format(\"(batch_size, sequence_length)\"))\n    def forward(\n        self,\n        input_ids=None,\n        attention_mask=None,\n        token_type_ids=None,\n        position_ids=None,\n        head_mask=None,\n        inputs_embeds=None,\n        next_sentence_label=None,\n    ):\n        r\"\"\"\n        next_sentence_label (:obj:`torch.LongTensor` of shape :obj:`(batch_size,)`, `optional`, defaults to :obj:`None`):\n            Labels for computing the next sequence prediction (classification) loss. Input should be a sequence pair (see ``input_ids`` docstring)\n            Indices should be in ``[0, 1]``.\n            ``0`` indicates sequence B is a continuation of sequence A,\n            ``1`` indicates sequence B is a random sequence.\n\n    Returns:\n        :obj:`tuple(torch.FloatTensor)` comprising various elements depending on the configuration (:class:`~transformers1.BertConfig`) and inputs:\n        loss (:obj:`torch.FloatTensor` of shape :obj:`(1,)`, `optional`, returned when :obj:`next_sentence_label` is provided):\n            Next sequence prediction (classification) loss.\n        seq_relationship_scores (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, 2)`):\n            Prediction scores of the next sequence prediction (classification) head (scores of True/False continuation before SoftMax).\n        hidden_states (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_hidden_states=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer)\n            of shape :obj:`(batch_size, sequence_length, hidden_size)`.\n\n            Hidden-states of the model at the output of each layer plus the initial embedding outputs.\n        attentions (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_attentions=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for each layer) of shape\n            :obj:`(batch_size, num_heads, sequence_length, sequence_length)`.\n\n            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention\n            heads.\n\n    Examples::\n\n        from transformers1 import BertTokenizer, BertForNextSentencePrediction\n        import torch\n\n        tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')\n        model = BertForNextSentencePrediction.from_pretrained('bert-base-uncased')\n\n        prompt = \"In Italy, pizza served in formal settings, such as at a restaurant, is presented unsliced.\"\n        next_sentence = \"The sky is blue due to the shorter wavelength of blue light.\"\n        encoding = tokenizer.encode_plus(prompt, next_sentence, return_tensors='pt')\n\n        loss, logits = model(**encoding, next_sentence_label=torch.LongTensor([1]))\n        assert logits[0, 0] < logits[0, 1] # next sentence was random\n        \"\"\"\n\n        outputs = self.bert(\n            input_ids,\n            attention_mask=attention_mask,\n            token_type_ids=token_type_ids,\n            position_ids=position_ids,\n            head_mask=head_mask,\n            inputs_embeds=inputs_embeds,\n        )\n\n        pooled_output = outputs[1]\n\n        seq_relationship_score = self.cls(pooled_output)\n\n        outputs = (seq_relationship_score,) + outputs[2:]  # add hidden states and attention if they are here\n        if next_sentence_label is not None:\n            loss_fct = CrossEntropyLoss()\n            next_sentence_loss = loss_fct(seq_relationship_score.view(-1, 2), next_sentence_label.view(-1))\n            outputs = (next_sentence_loss,) + outputs\n\n        return outputs  # (next_sentence_loss), seq_relationship_score, (hidden_states), (attentions)\n\n\n@add_start_docstrings(\n    \"\"\"Bert Model transformer with a sequence classification/regression head on top (a linear layer on top of\n    the pooled output) e.g. for GLUE tasks. \"\"\",\n    BERT_START_DOCSTRING,\n)\nclass BertForSequenceClassification(BertPreTrainedModel):\n    def __init__(self, config):\n        super().__init__(config)\n        self.num_labels = config.num_labels\n\n        self.bert = BertModel(config)\n        self.dropout = nn.Dropout(config.hidden_dropout_prob)\n        self.classifier = nn.Linear(config.hidden_size, config.num_labels)\n\n        self.init_weights()\n\n    @add_start_docstrings_to_callable(BERT_INPUTS_DOCSTRING.format(\"(batch_size, sequence_length)\"))\n    def forward(\n        self,\n        input_ids=None,\n        attention_mask=None,\n        token_type_ids=None,\n        position_ids=None,\n        head_mask=None,\n        inputs_embeds=None,\n        labels=None,\n    ):\n        r\"\"\"\n        labels (:obj:`torch.LongTensor` of shape :obj:`(batch_size,)`, `optional`, defaults to :obj:`None`):\n            Labels for computing the sequence classification/regression loss.\n            Indices should be in :obj:`[0, ..., config.num_labels - 1]`.\n            If :obj:`config.num_labels == 1` a regression loss is computed (Mean-Square loss),\n            If :obj:`config.num_labels > 1` a classification loss is computed (Cross-Entropy).\n\n    Returns:\n        :obj:`tuple(torch.FloatTensor)` comprising various elements depending on the configuration (:class:`~transformers1.BertConfig`) and inputs:\n        loss (:obj:`torch.FloatTensor` of shape :obj:`(1,)`, `optional`, returned when :obj:`label` is provided):\n            Classification (or regression if config.num_labels==1) loss.\n        logits (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, config.num_labels)`):\n            Classification (or regression if config.num_labels==1) scores (before SoftMax).\n        hidden_states (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_hidden_states=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer)\n            of shape :obj:`(batch_size, sequence_length, hidden_size)`.\n\n            Hidden-states of the model at the output of each layer plus the initial embedding outputs.\n        attentions (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_attentions=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for each layer) of shape\n            :obj:`(batch_size, num_heads, sequence_length, sequence_length)`.\n\n            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention\n            heads.\n\n    Examples::\n\n        from transformers1 import BertTokenizer, BertForSequenceClassification\n        import torch\n\n        tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')\n        model = BertForSequenceClassification.from_pretrained('bert-base-uncased')\n\n        input_ids = torch.tensor(tokenizer.encode(\"Hello, my dog is cute\", add_special_tokens=True)).unsqueeze(0)  # Batch size 1\n        labels = torch.tensor([1]).unsqueeze(0)  # Batch size 1\n        outputs = model(input_ids, labels=labels)\n\n        loss, logits = outputs[:2]\n\n        \"\"\"\n\n        outputs = self.bert(\n            input_ids,\n            attention_mask=attention_mask,\n            token_type_ids=token_type_ids,\n            position_ids=position_ids,\n            head_mask=head_mask,\n            inputs_embeds=inputs_embeds,\n        )\n\n        pooled_output = outputs[1]\n\n        pooled_output = self.dropout(pooled_output)\n        logits = self.classifier(pooled_output)\n\n        outputs = (logits,) + outputs[2:]  # add hidden states and attention if they are here\n\n        if labels is not None:\n            if self.num_labels == 1:\n                #  We are doing regression\n                loss_fct = MSELoss()\n                loss = loss_fct(logits.view(-1), labels.view(-1))\n            else:\n                loss_fct = CrossEntropyLoss()\n                loss = loss_fct(logits.view(-1, self.num_labels), labels.view(-1))\n            outputs = (loss,) + outputs\n\n        return outputs  # (loss), logits, (hidden_states), (attentions)\n\n\n@add_start_docstrings(\n    \"\"\"Bert Model with a multiple choice classification head on top (a linear layer on top of\n    the pooled output and a softmax) e.g. for RocStories/SWAG tasks. \"\"\",\n    BERT_START_DOCSTRING,\n)\nclass BertForMultipleChoice(BertPreTrainedModel):\n    def __init__(self, config):\n        super().__init__(config)\n\n        self.bert = BertModel(config)\n        self.dropout = nn.Dropout(config.hidden_dropout_prob)\n        self.classifier = nn.Linear(config.hidden_size, 1)\n\n        self.init_weights()\n\n    @add_start_docstrings_to_callable(BERT_INPUTS_DOCSTRING.format(\"(batch_size, num_choices, sequence_length)\"))\n    def forward(\n        self,\n        input_ids=None,\n        attention_mask=None,\n        token_type_ids=None,\n        position_ids=None,\n        head_mask=None,\n        inputs_embeds=None,\n        labels=None,\n    ):\n        r\"\"\"\n        labels (:obj:`torch.LongTensor` of shape :obj:`(batch_size,)`, `optional`, defaults to :obj:`None`):\n            Labels for computing the multiple choice classification loss.\n            Indices should be in ``[0, ..., num_choices-1]`` where `num_choices` is the size of the second dimension\n            of the input tensors. (see `input_ids` above)\n\n    Returns:\n        :obj:`tuple(torch.FloatTensor)` comprising various elements depending on the configuration (:class:`~transformers1.BertConfig`) and inputs:\n        loss (:obj:`torch.FloatTensor` of shape `(1,)`, `optional`, returned when :obj:`labels` is provided):\n            Classification loss.\n        classification_scores (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, num_choices)`):\n            `num_choices` is the second dimension of the input tensors. (see `input_ids` above).\n\n            Classification scores (before SoftMax).\n        hidden_states (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_hidden_states=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer)\n            of shape :obj:`(batch_size, sequence_length, hidden_size)`.\n\n            Hidden-states of the model at the output of each layer plus the initial embedding outputs.\n        attentions (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_attentions=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for each layer) of shape\n            :obj:`(batch_size, num_heads, sequence_length, sequence_length)`.\n\n            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention\n            heads.\n\n    Examples::\n\n        from transformers1 import BertTokenizer, BertForMultipleChoice\n        import torch\n\n        tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')\n        model = BertForMultipleChoice.from_pretrained('bert-base-uncased')\n\n        prompt = \"In Italy, pizza served in formal settings, such as at a restaurant, is presented unsliced.\"\n        choice0 = \"It is eaten with a fork and a knife.\"\n        choice1 = \"It is eaten while held in the hand.\"\n        labels = torch.tensor(0) # choice0 is correct (according to Wikipedia ;))\n\n        encoding = tokenizer.batch_encode_plus([[prompt, choice0], [prompt, choice1]], return_tensors='pt', pad_to_max_length=True)\n        outputs = model(**{k: v.unsqueeze(0) for k,v in encoding.items()}, labels=labels) # batch size is 1\n\n        # the linear classifier still needs to be trained\n        loss, logits = outputs[:2]\n        \"\"\"\n        num_choices = input_ids.shape[1]\n\n        input_ids = input_ids.view(-1, input_ids.size(-1))\n        attention_mask = attention_mask.view(-1, attention_mask.size(-1)) if attention_mask is not None else None\n        token_type_ids = token_type_ids.view(-1, token_type_ids.size(-1)) if token_type_ids is not None else None\n        position_ids = position_ids.view(-1, position_ids.size(-1)) if position_ids is not None else None\n\n        outputs = self.bert(\n            input_ids,\n            attention_mask=attention_mask,\n            token_type_ids=token_type_ids,\n            position_ids=position_ids,\n            head_mask=head_mask,\n            inputs_embeds=inputs_embeds,\n        )\n\n        pooled_output = outputs[1]\n\n        pooled_output = self.dropout(pooled_output)\n        logits = self.classifier(pooled_output)\n        reshaped_logits = logits.view(-1, num_choices)\n\n        outputs = (reshaped_logits,) + outputs[2:]  # add hidden states and attention if they are here\n\n        if labels is not None:\n            loss_fct = CrossEntropyLoss()\n            loss = loss_fct(reshaped_logits, labels)\n            outputs = (loss,) + outputs\n\n        return outputs  # (loss), reshaped_logits, (hidden_states), (attentions)\n\n\n@add_start_docstrings(\n    \"\"\"Bert Model with a token classification head on top (a linear layer on top of\n    the hidden-states output) e.g. for Named-Entity-Recognition (NER) tasks. \"\"\",\n    BERT_START_DOCSTRING,\n)\nclass BertForTokenClassification(BertPreTrainedModel):\n    def __init__(self, config):\n        super().__init__(config)\n        self.num_labels = config.num_labels\n\n        self.bert = BertModel(config)\n        self.dropout = nn.Dropout(config.hidden_dropout_prob)\n        self.classifier = nn.Linear(config.hidden_size, config.num_labels)\n\n        self.init_weights()\n\n    @add_start_docstrings_to_callable(BERT_INPUTS_DOCSTRING.format(\"(batch_size, sequence_length)\"))\n    def forward(\n        self,\n        input_ids=None,\n        attention_mask=None,\n        token_type_ids=None,\n        position_ids=None,\n        head_mask=None,\n        inputs_embeds=None,\n        labels=None,\n    ):\n        r\"\"\"\n        labels (:obj:`torch.LongTensor` of shape :obj:`(batch_size, sequence_length)`, `optional`, defaults to :obj:`None`):\n            Labels for computing the token classification loss.\n            Indices should be in ``[0, ..., config.num_labels - 1]``.\n\n    Returns:\n        :obj:`tuple(torch.FloatTensor)` comprising various elements depending on the configuration (:class:`~transformers1.BertConfig`) and inputs:\n        loss (:obj:`torch.FloatTensor` of shape :obj:`(1,)`, `optional`, returned when ``labels`` is provided) :\n            Classification loss.\n        scores (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length, config.num_labels)`)\n            Classification scores (before SoftMax).\n        hidden_states (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_hidden_states=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer)\n            of shape :obj:`(batch_size, sequence_length, hidden_size)`.\n\n            Hidden-states of the model at the output of each layer plus the initial embedding outputs.\n        attentions (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_attentions=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for each layer) of shape\n            :obj:`(batch_size, num_heads, sequence_length, sequence_length)`.\n\n            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention\n            heads.\n\n    Examples::\n\n        from transformers1 import BertTokenizer, BertForTokenClassification\n        import torch\n\n        tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')\n        model = BertForTokenClassification.from_pretrained('bert-base-uncased')\n\n        input_ids = torch.tensor(tokenizer.encode(\"Hello, my dog is cute\", add_special_tokens=True)).unsqueeze(0)  # Batch size 1\n        labels = torch.tensor([1] * input_ids.size(1)).unsqueeze(0)  # Batch size 1\n        outputs = model(input_ids, labels=labels)\n\n        loss, scores = outputs[:2]\n\n        \"\"\"\n\n        outputs = self.bert(\n            input_ids,\n            attention_mask=attention_mask,\n            token_type_ids=token_type_ids,\n            position_ids=position_ids,\n            head_mask=head_mask,\n            inputs_embeds=inputs_embeds,\n        )\n\n        sequence_output = outputs[0]\n\n        sequence_output = self.dropout(sequence_output)\n        logits = self.classifier(sequence_output)\n\n        outputs = (logits,) + outputs[2:]  # add hidden states and attention if they are here\n        if labels is not None:\n            loss_fct = CrossEntropyLoss()\n            # Only keep active parts of the loss\n            if attention_mask is not None:\n                active_loss = attention_mask.view(-1) == 1\n                active_logits = logits.view(-1, self.num_labels)\n                active_labels = torch.where(\n                    active_loss, labels.view(-1), torch.tensor(loss_fct.ignore_index).type_as(labels)\n                )\n                loss = loss_fct(active_logits, active_labels)\n            else:\n                loss = loss_fct(logits.view(-1, self.num_labels), labels.view(-1))\n            outputs = (loss,) + outputs\n\n        return outputs  # (loss), scores, (hidden_states), (attentions)\n\n\n@add_start_docstrings(\n    \"\"\"Bert Model with a span classification head on top for extractive question-answering tasks like SQuAD (a linear\n    layers on top of the hidden-states output to compute `span start logits` and `span end logits`). \"\"\",\n    BERT_START_DOCSTRING,\n)\nclass BertForQuestionAnswering(BertPreTrainedModel):\n    def __init__(self, config):\n        super().__init__(config)\n        self.num_labels = config.num_labels\n\n        self.bert = BertModel(config)\n        self.qa_outputs = nn.Linear(config.hidden_size, config.num_labels)\n\n        self.init_weights()\n\n    @add_start_docstrings_to_callable(BERT_INPUTS_DOCSTRING.format(\"(batch_size, sequence_length)\"))\n    def forward(\n        self,\n        input_ids=None,\n        attention_mask=None,\n        token_type_ids=None,\n        position_ids=None,\n        head_mask=None,\n        inputs_embeds=None,\n        start_positions=None,\n        end_positions=None,\n    ):\n        r\"\"\"\n        start_positions (:obj:`torch.LongTensor` of shape :obj:`(batch_size,)`, `optional`, defaults to :obj:`None`):\n            Labels for position (index) of the start of the labelled span for computing the token classification loss.\n            Positions are clamped to the length of the sequence (`sequence_length`).\n            Position outside of the sequence are not taken into account for computing the loss.\n        end_positions (:obj:`torch.LongTensor` of shape :obj:`(batch_size,)`, `optional`, defaults to :obj:`None`):\n            Labels for position (index) of the end of the labelled span for computing the token classification loss.\n            Positions are clamped to the length of the sequence (`sequence_length`).\n            Position outside of the sequence are not taken into account for computing the loss.\n\n    Returns:\n        :obj:`tuple(torch.FloatTensor)` comprising various elements depending on the configuration (:class:`~transformers1.BertConfig`) and inputs:\n        loss (:obj:`torch.FloatTensor` of shape :obj:`(1,)`, `optional`, returned when :obj:`labels` is provided):\n            Total span extraction loss is the sum of a Cross-Entropy for the start and end positions.\n        start_scores (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length,)`):\n            Span-start scores (before SoftMax).\n        end_scores (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length,)`):\n            Span-end scores (before SoftMax).\n        hidden_states (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_hidden_states=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer)\n            of shape :obj:`(batch_size, sequence_length, hidden_size)`.\n\n            Hidden-states of the model at the output of each layer plus the initial embedding outputs.\n        attentions (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_attentions=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for each layer) of shape\n            :obj:`(batch_size, num_heads, sequence_length, sequence_length)`.\n\n            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention\n            heads.\n\n    Examples::\n\n        from transformers1 import BertTokenizer, BertForQuestionAnswering\n        import torch\n\n        tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')\n        model = BertForQuestionAnswering.from_pretrained('bert-large-uncased-whole-word-masking-finetuned-squad')\n\n        question, text = \"Who was Jim Henson?\", \"Jim Henson was a nice puppet\"\n        encoding = tokenizer.encode_plus(question, text)\n        input_ids, token_type_ids = encoding[\"input_ids\"], encoding[\"token_type_ids\"]\n        start_scores, end_scores = model(torch.tensor([input_ids]), token_type_ids=torch.tensor([token_type_ids]))\n\n        all_tokens = tokenizer.convert_ids_to_tokens(input_ids)\n        answer = ' '.join(all_tokens[torch.argmax(start_scores) : torch.argmax(end_scores)+1])\n\n        assert answer == \"a nice puppet\"\n\n        \"\"\"\n\n        outputs = self.bert(\n            input_ids,\n            attention_mask=attention_mask,\n            token_type_ids=token_type_ids,\n            position_ids=position_ids,\n            head_mask=head_mask,\n            inputs_embeds=inputs_embeds,\n        )\n\n        sequence_output = outputs[0]\n\n        logits = self.qa_outputs(sequence_output)\n        start_logits, end_logits = logits.split(1, dim=-1)\n        start_logits = start_logits.squeeze(-1)\n        end_logits = end_logits.squeeze(-1)\n\n        outputs = (start_logits, end_logits,) + outputs[2:]\n        if start_positions is not None and end_positions is not None:\n            # If we are on multi-GPU, split add a dimension\n            if len(start_positions.size()) > 1:\n                start_positions = start_positions.squeeze(-1)\n            if len(end_positions.size()) > 1:\n                end_positions = end_positions.squeeze(-1)\n            # sometimes the start/end positions are outside our model inputs, we ignore these terms\n            ignored_index = start_logits.size(1)\n            start_positions.clamp_(0, ignored_index)\n            end_positions.clamp_(0, ignored_index)\n\n            loss_fct = CrossEntropyLoss(ignore_index=ignored_index)\n            start_loss = loss_fct(start_logits, start_positions)\n            end_loss = loss_fct(end_logits, end_positions)\n            total_loss = (start_loss + end_loss) / 2\n            outputs = (total_loss,) + outputs\n\n        return outputs  # (loss), start_logits, end_logits, (hidden_states), (attentions)\n"
  },
  {
    "path": "code/nezha-base-count3/pretrain/transformers1/modeling_camembert.py",
    "content": "# coding=utf-8\n# Copyright 2019 Inria, Facebook AI Research and the HuggingFace Inc. team.\n# Copyright (c) 2018, NVIDIA CORPORATION.  All rights reserved.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n\"\"\"PyTorch CamemBERT model. \"\"\"\n\nimport logging\n\nfrom .configuration_camembert import CamembertConfig\nfrom .file_utils import add_start_docstrings\nfrom .modeling_roberta import (\n    RobertaForMaskedLM,\n    RobertaForMultipleChoice,\n    RobertaForQuestionAnswering,\n    RobertaForSequenceClassification,\n    RobertaForTokenClassification,\n    RobertaModel,\n)\n\n\nlogger = logging.getLogger(__name__)\n\nCAMEMBERT_PRETRAINED_MODEL_ARCHIVE_LIST = [\n    \"camembert-base\",\n    \"Musixmatch/umberto-commoncrawl-cased-v1\",\n    \"Musixmatch/umberto-wikipedia-uncased-v1\",\n    # See all CamemBERT models at https://huggingface.co/models?filter=camembert\n]\n\nCAMEMBERT_START_DOCSTRING = r\"\"\"\n\n    This model is a PyTorch `torch.nn.Module <https://pytorch.org/docs/stable/nn.html#torch.nn.Module>`_ sub-class.\n    Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general\n    usage and behavior.\n\n    Parameters:\n        config (:class:`~transformers1.CamembertConfig`): Model configuration class with all the parameters of the\n            model. Initializing with a config file does not load the weights associated with the model, only the\n            configuration.\n            Check out the :meth:`~transformers1.PreTrainedModel.from_pretrained` method to load the model weights.\n\"\"\"\n\n\n@add_start_docstrings(\n    \"The bare CamemBERT Model transformer outputting raw hidden-states without any specific head on top.\",\n    CAMEMBERT_START_DOCSTRING,\n)\nclass CamembertModel(RobertaModel):\n    \"\"\"\n    This class overrides :class:`~transformers1.RobertaModel`. Please check the\n    superclass for the appropriate documentation alongside usage examples.\n    \"\"\"\n\n    config_class = CamembertConfig\n\n\n@add_start_docstrings(\n    \"\"\"CamemBERT Model with a `language modeling` head on top. \"\"\", CAMEMBERT_START_DOCSTRING,\n)\nclass CamembertForMaskedLM(RobertaForMaskedLM):\n    \"\"\"\n    This class overrides :class:`~transformers1.RobertaForMaskedLM`. Please check the\n    superclass for the appropriate documentation alongside usage examples.\n    \"\"\"\n\n    config_class = CamembertConfig\n\n\n@add_start_docstrings(\n    \"\"\"CamemBERT Model transformer with a sequence classification/regression head on top (a linear layer\n    on top of the pooled output) e.g. for GLUE tasks. \"\"\",\n    CAMEMBERT_START_DOCSTRING,\n)\nclass CamembertForSequenceClassification(RobertaForSequenceClassification):\n    \"\"\"\n    This class overrides :class:`~transformers1.RobertaForSequenceClassification`. Please check the\n    superclass for the appropriate documentation alongside usage examples.\n    \"\"\"\n\n    config_class = CamembertConfig\n\n\n@add_start_docstrings(\n    \"\"\"CamemBERT Model with a multiple choice classification head on top (a linear layer on top of\n    the pooled output and a softmax) e.g. for RocStories/SWAG tasks. \"\"\",\n    CAMEMBERT_START_DOCSTRING,\n)\nclass CamembertForMultipleChoice(RobertaForMultipleChoice):\n    \"\"\"\n    This class overrides :class:`~transformers1.RobertaForMultipleChoice`. Please check the\n    superclass for the appropriate documentation alongside usage examples.\n    \"\"\"\n\n    config_class = CamembertConfig\n\n\n@add_start_docstrings(\n    \"\"\"CamemBERT Model with a token classification head on top (a linear layer on top of\n    the hidden-states output) e.g. for Named-Entity-Recognition (NER) tasks. \"\"\",\n    CAMEMBERT_START_DOCSTRING,\n)\nclass CamembertForTokenClassification(RobertaForTokenClassification):\n    \"\"\"\n    This class overrides :class:`~transformers1.RobertaForTokenClassification`. Please check the\n    superclass for the appropriate documentation alongside usage examples.\n    \"\"\"\n\n    config_class = CamembertConfig\n\n\n@add_start_docstrings(\n    \"\"\"CamemBERT Model with a span classification head on top for extractive question-answering tasks like SQuAD\n    (a linear layers on top of the hidden-states output to compute `span start logits` and `span end logits` \"\"\",\n    CAMEMBERT_START_DOCSTRING,\n)\nclass CamembertForQuestionAnswering(RobertaForQuestionAnswering):\n    \"\"\"\n    This class overrides :class:`~transformers1.RobertaForQuestionAnswering`. Please check the\n    superclass for the appropriate documentation alongside usage examples.\n    \"\"\"\n\n    config_class = CamembertConfig\n"
  },
  {
    "path": "code/nezha-base-count3/pretrain/transformers1/modeling_ctrl.py",
    "content": "# coding=utf-8\n# Copyright 2018 Salesforce and HuggingFace Inc. team.\n# Copyright (c) 2018, NVIDIA CORPORATION.  All rights reserved.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n\"\"\" PyTorch CTRL model.\"\"\"\n\n\nimport logging\n\nimport numpy as np\nimport torch\nimport torch.nn as nn\nfrom torch.nn import CrossEntropyLoss\n\nfrom .configuration_ctrl import CTRLConfig\nfrom .file_utils import add_start_docstrings, add_start_docstrings_to_callable\nfrom .modeling_utils import Conv1D, PreTrainedModel\n\n\nlogger = logging.getLogger(__name__)\n\nCTRL_PRETRAINED_MODEL_ARCHIVE_LIST = [\n    \"ctrl\"\n    # See all CTRL models at https://huggingface.co/models?filter=ctrl\n]\n\n\ndef angle_defn(pos, i, d_model_size):\n    angle_rates = 1 / torch.pow(10000, (2 * (i // 2)) / d_model_size)\n    return pos * angle_rates\n\n\ndef positional_encoding(position, d_model_size, dtype):\n    # create the sinusoidal pattern for the positional encoding\n    angle_rads = angle_defn(\n        torch.arange(position, dtype=dtype).unsqueeze(1),\n        torch.arange(d_model_size, dtype=dtype).unsqueeze(0),\n        d_model_size,\n    )\n\n    sines = torch.sin(angle_rads[:, 0::2])\n    cosines = torch.cos(angle_rads[:, 1::2])\n\n    pos_encoding = torch.cat([sines, cosines], dim=-1)\n    return pos_encoding\n\n\ndef scaled_dot_product_attention(q, k, v, mask, attention_mask=None, head_mask=None):\n    # calculate attention\n    matmul_qk = torch.matmul(q, k.permute(0, 1, 3, 2))\n\n    dk = k.shape[-1]\n    scaled_attention_logits = matmul_qk / np.sqrt(dk)\n\n    if mask is not None:\n        nd, ns = scaled_attention_logits.size(-2), scaled_attention_logits.size(-1)\n        scaled_attention_logits += mask[ns - nd : ns, :ns] * -1e4\n\n    if attention_mask is not None:\n        # Apply the attention mask\n        scaled_attention_logits = scaled_attention_logits + attention_mask\n\n    attention_weights = torch.softmax(scaled_attention_logits, dim=-1)\n\n    # Mask heads if we want to\n    if head_mask is not None:\n        attention_weights = attention_weights * head_mask\n\n    output = torch.matmul(attention_weights, v)\n\n    return output, attention_weights\n\n\nclass MultiHeadAttention(torch.nn.Module):\n    def __init__(self, d_model_size, num_heads, output_attentions=False):\n        super().__init__()\n        self.output_attentions = output_attentions\n        self.num_heads = num_heads\n        self.d_model_size = d_model_size\n\n        self.depth = int(d_model_size / self.num_heads)\n\n        self.Wq = torch.nn.Linear(d_model_size, d_model_size)\n        self.Wk = torch.nn.Linear(d_model_size, d_model_size)\n        self.Wv = torch.nn.Linear(d_model_size, d_model_size)\n\n        self.dense = torch.nn.Linear(d_model_size, d_model_size)\n\n    def split_into_heads(self, x, batch_size):\n        x = x.reshape(batch_size, -1, self.num_heads, self.depth)\n        return x.permute([0, 2, 1, 3])\n\n    def forward(self, v, k, q, mask, layer_past=None, attention_mask=None, head_mask=None, use_cache=False):\n        batch_size = q.shape[0]\n\n        q = self.Wq(q)\n        k = self.Wk(k)\n        v = self.Wv(v)\n\n        q = self.split_into_heads(q, batch_size)\n        k = self.split_into_heads(k, batch_size)\n        v = self.split_into_heads(v, batch_size)\n        if layer_past is not None:\n            past_key, past_value = layer_past[0], layer_past[1]\n            k = torch.cat((past_key, k), dim=-2)\n            v = torch.cat((past_value, v), dim=-2)\n\n        if use_cache is True:\n            present = torch.stack((k, v))\n        else:\n            present = (None,)\n\n        output = scaled_dot_product_attention(q, k, v, mask, attention_mask, head_mask)\n        scaled_attention = output[0].permute([0, 2, 1, 3])\n        attn = output[1]\n        original_size_attention = scaled_attention.reshape(batch_size, -1, self.d_model_size)\n        output = self.dense(original_size_attention)\n\n        outputs = (output, present)\n        if self.output_attentions:\n            outputs = outputs + (attn,)\n        return outputs\n\n\ndef point_wise_feed_forward_network(d_model_size, dff):\n    return torch.nn.Sequential(torch.nn.Linear(d_model_size, dff), torch.nn.ReLU(), torch.nn.Linear(dff, d_model_size))\n\n\nclass EncoderLayer(torch.nn.Module):\n    def __init__(self, d_model_size, num_heads, dff, rate=0.1, output_attentions=False):\n        super().__init__()\n\n        self.multi_head_attention = MultiHeadAttention(d_model_size, num_heads, output_attentions)\n        self.ffn = point_wise_feed_forward_network(d_model_size, dff)\n\n        self.layernorm1 = torch.nn.LayerNorm(d_model_size, eps=1e-6)\n        self.layernorm2 = torch.nn.LayerNorm(d_model_size, eps=1e-6)\n\n        self.dropout1 = torch.nn.Dropout(rate)\n        self.dropout2 = torch.nn.Dropout(rate)\n\n    def forward(self, x, mask, layer_past=None, attention_mask=None, head_mask=None, use_cache=False):\n        normed = self.layernorm1(x)\n        attn_outputs = self.multi_head_attention(\n            normed,\n            normed,\n            normed,\n            mask,\n            layer_past=layer_past,\n            attention_mask=attention_mask,\n            head_mask=head_mask,\n            use_cache=use_cache,\n        )\n        attn_output = attn_outputs[0]\n        attn_output = self.dropout1(attn_output)\n        out1 = x + attn_output\n\n        out2 = self.layernorm2(out1)\n        ffn_output = self.ffn(out2)\n        ffn_output = self.dropout2(ffn_output)\n        out2 = out1 + ffn_output\n\n        outputs = (out2,) + attn_outputs[1:]\n        return outputs\n\n\nclass CTRLPreTrainedModel(PreTrainedModel):\n    \"\"\" An abstract class to handle weights initialization and\n        a simple interface for downloading and loading pretrained models.\n    \"\"\"\n\n    config_class = CTRLConfig\n    base_model_prefix = \"transformer\"\n\n    def _init_weights(self, module):\n        \"\"\" Initialize the weights.\n        \"\"\"\n        if isinstance(module, (nn.Linear, nn.Embedding, Conv1D)):\n            # Slightly different from the TF version which uses truncated_normal for initialization\n            # cf https://github.com/pytorch/pytorch/pull/5617\n            module.weight.data.normal_(mean=0.0, std=self.config.initializer_range)\n            if isinstance(module, (nn.Linear, Conv1D)) and module.bias is not None:\n                module.bias.data.zero_()\n        elif isinstance(module, nn.LayerNorm):\n            module.bias.data.zero_()\n            module.weight.data.fill_(1.0)\n\n\nCTRL_START_DOCSTRING = r\"\"\"\n    This model is a PyTorch `torch.nn.Module <https://pytorch.org/docs/stable/nn.html#torch.nn.Module>`_ sub-class.\n    Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general\n    usage and behavior.\n\n    Parameters:\n        config (:class:`~transformers1.CTRLConfig`): Model configuration class with all the parameters of the model.\n            Initializing with a config file does not load the weights associated with the model, only the configuration.\n            Check out the :meth:`~transformers1.PreTrainedModel.from_pretrained` method to load the model weights.\n\"\"\"\n\nCTRL_INPUTS_DOCSTRING = r\"\"\"\n    Args:\n        input_ids (:obj:`torch.LongTensor` of shape :obj:`(batch_size, input_ids_length)`):\n            :obj:`input_ids_length` = ``sequence_length`` if ``past`` is ``None`` else ``past[0].shape[-2]`` (``sequence_length`` of input past key value states).\n            Indices of input sequence tokens in the vocabulary.\n\n            If `past` is used, only input_ids that do not have their past calculated should be passed as input_ids.\n\n            Indices can be obtained using :class:`transformers1.CTRLTokenizer`.\n            See :func:`transformers1.PreTrainedTokenizer.encode` and\n            :func:`transformers1.PreTrainedTokenizer.encode_plus` for details.\n\n            `What are input IDs? <../glossary.html#input-ids>`__\n        past (:obj:`List[torch.FloatTensor]` of length :obj:`config.n_layers`):\n            Contains pre-computed hidden-states (key and values in the attention blocks) as computed by the model\n            (see `past` output below). Can be used to speed up sequential decoding.\n            The input_ids which have their past given to this model should not be passed as input ids as they have already been computed.\n        attention_mask (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length)`, `optional`, defaults to :obj:`None`):\n            Mask to avoid performing attention on padding token indices.\n            Mask values selected in ``[0, 1]``:\n            ``1`` for tokens that are NOT MASKED, ``0`` for MASKED tokens.\n\n            `What are attention masks? <../glossary.html#attention-mask>`__\n        token_type_ids (:obj:`torch.LongTensor` of shape :obj:`(batch_size, sequence_length)`, `optional`, defaults to :obj:`None`):\n            Segment token indices to indicate first and second portions of the inputs.\n            Indices are selected in ``[0, 1]``: ``0`` corresponds to a `sentence A` token, ``1``\n            corresponds to a `sentence B` token\n\n            `What are token type IDs? <../glossary.html#token-type-ids>`_\n        position_ids (:obj:`torch.LongTensor` of shape :obj:`(batch_size, sequence_length)`, `optional`, defaults to :obj:`None`):\n            Indices of positions of each input sequence tokens in the position embeddings.\n            Selected in the range ``[0, config.max_position_embeddings - 1]``.\n\n            `What are position IDs? <../glossary.html#position-ids>`_\n        head_mask (:obj:`torch.FloatTensor` of shape :obj:`(num_heads,)` or :obj:`(num_layers, num_heads)`, `optional`, defaults to :obj:`None`):\n            Mask to nullify selected heads of the self-attention modules.\n            Mask values selected in ``[0, 1]``:\n            :obj:`1` indicates the head is **not masked**, :obj:`0` indicates the head is **masked**.\n        inputs_embeds (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length, hidden_size)`, `optional`, defaults to :obj:`None`):\n            This is useful if you want more control over how to convert `input_ids` indices into associated vectors\n            than the model's internal embedding lookup matrix.\n            If `past` is used, optionally only the last `inputs_embeds` have to be input (see `past`).\n        use_cache (:obj:`bool`):\n            If `use_cache` is True, `past` key value states are returned and\n            can be used to speed up decoding (see `past`). Defaults to `True`.\n\"\"\"\n\n\n@add_start_docstrings(\n    \"The bare CTRL Model transformer outputting raw hidden-states without any specific head on top.\",\n    CTRL_START_DOCSTRING,\n)\nclass CTRLModel(CTRLPreTrainedModel):\n    def __init__(self, config):\n        super().__init__(config)\n        self.output_hidden_states = config.output_hidden_states\n        self.output_attentions = config.output_attentions\n\n        self.d_model_size = config.n_embd\n        self.num_layers = config.n_layer\n\n        self.pos_encoding = positional_encoding(config.n_positions, self.d_model_size, torch.float)\n\n        self.w = nn.Embedding(config.vocab_size, config.n_embd)\n\n        self.dropout = nn.Dropout(config.embd_pdrop)\n        self.h = nn.ModuleList(\n            [\n                EncoderLayer(config.n_embd, config.n_head, config.dff, config.resid_pdrop, config.output_attentions)\n                for _ in range(config.n_layer)\n            ]\n        )\n        self.layernorm = nn.LayerNorm(config.n_embd, eps=config.layer_norm_epsilon)\n\n        self.init_weights()\n\n    def get_input_embeddings(self):\n        return self.w\n\n    def set_input_embeddings(self, new_embeddings):\n        self.w = new_embeddings\n\n    def _prune_heads(self, heads_to_prune):\n        \"\"\" Prunes heads of the model.\n                heads_to_prune: dict of {layer_num: list of heads to prune in this layer}\n        \"\"\"\n        for layer, heads in heads_to_prune.items():\n            self.h[layer].attn.prune_heads(heads)\n\n    @add_start_docstrings_to_callable(CTRL_INPUTS_DOCSTRING)\n    def forward(\n        self,\n        input_ids=None,\n        past=None,\n        attention_mask=None,\n        token_type_ids=None,\n        position_ids=None,\n        head_mask=None,\n        inputs_embeds=None,\n        use_cache=True,\n    ):\n        r\"\"\"\n    Return:\n        :obj:`tuple(torch.FloatTensor)` comprising various elements depending on the configuration (:class:`~transformers1.CTRLConfig`) and inputs:\n        last_hidden_state (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length, hidden_size)`):\n            Sequence of hidden-states at the last layer of the model.\n        past (:obj:`List[torch.FloatTensor]` of length :obj:`config.n_layers` with each tensor of shape :obj:`(2, batch_size, num_heads, sequence_length, embed_size_per_head)`):\n            Contains pre-computed hidden-states (key and values in the attention blocks).\n            Can be used (see `past` input) to speed up sequential decoding.\n        hidden_states (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_hidden_states=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer)\n            of shape :obj:`(batch_size, sequence_length, hidden_size)`.\n\n            Hidden-states of the model at the output of each layer plus the initial embedding outputs.\n        attentions (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_attentions=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for each layer) of shape\n            :obj:`(batch_size, num_heads, sequence_length, sequence_length)`.\n\n            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention\n            heads.\n\n    Examples::\n\n        from transformers1 import CTRLTokenizer, CTRLModel\n        import torch\n\n        tokenizer = CTRLTokenizer.from_pretrained('ctrl')\n        model = CTRLModel.from_pretrained('ctrl')\n\n        input_ids = torch.tensor(tokenizer.encode(\"Links Hello, my dog is cute\", add_special_tokens=True)).unsqueeze(0)  # Batch size 1\n        outputs = model(input_ids)\n\n        last_hidden_states = outputs[0]  # The last hidden-state is the first element of the output tuple\n\n        \"\"\"\n\n        if input_ids is not None and inputs_embeds is not None:\n            raise ValueError(\"You cannot specify both input_ids and inputs_embeds at the same time\")\n        elif input_ids is not None:\n            input_shape = input_ids.size()\n            input_ids = input_ids.view(-1, input_shape[-1])\n            batch_size = input_ids.shape[0]\n        elif inputs_embeds is not None:\n            input_shape = inputs_embeds.size()[:-1]\n            batch_size = inputs_embeds.shape[0]\n        else:\n            raise ValueError(\"You have to specify either input_ids or inputs_embeds\")\n\n        if past is None:\n            past_length = 0\n            past = [None] * len(self.h)\n        else:\n            past_length = past[0][0].size(-2)\n        if position_ids is None:\n            device = input_ids.device if input_ids is not None else inputs_embeds.device\n            position_ids = torch.arange(past_length, input_shape[-1] + past_length, dtype=torch.long, device=device)\n            position_ids = position_ids.unsqueeze(0).view(-1, input_shape[-1])\n\n        # Attention mask.\n        if attention_mask is not None:\n            assert batch_size > 0, \"batch_size has to be defined and > 0\"\n            attention_mask = attention_mask.view(batch_size, -1)\n            # We create a 3D attention mask from a 2D tensor mask.\n            # Sizes are [batch_size, 1, 1, to_seq_length]\n            # So we can broadcast to [batch_size, num_heads, from_seq_length, to_seq_length]\n            # this attention mask is more simple than the triangular masking of causal attention\n            # used in OpenAI GPT, we just need to prepare the broadcast dimension here.\n            attention_mask = attention_mask.unsqueeze(1).unsqueeze(2)\n\n            # Since attention_mask is 1.0 for positions we want to attend and 0.0 for\n            # masked positions, this operation will create a tensor which is 0.0 for\n            # positions we want to attend and -10000.0 for masked positions.\n            # Since we are adding it to the raw scores before the softmax, this is\n            # effectively the same as removing these entirely.\n            attention_mask = attention_mask.to(dtype=self.dtype)  # fp16 compatibility\n            attention_mask = (1.0 - attention_mask) * -10000.0\n\n        # Prepare head mask if needed\n        head_mask = self.get_head_mask(head_mask, self.config.n_layer)\n\n        if token_type_ids is not None:\n            token_type_ids = token_type_ids.view(-1, input_shape[-1])\n            token_type_embeds = self.w(token_type_ids)\n            token_type_embeds *= np.sqrt(self.d_model_size)\n        else:\n            token_type_embeds = 0\n        position_ids = position_ids.view(-1, input_shape[-1])\n\n        if inputs_embeds is None:\n            inputs_embeds = self.w(input_ids)\n        # inputs_embeds = embedded.unsqueeze(0) if len(input_ids.shape)<2 else embedded\n        seq_len = input_shape[-1]\n        mask = torch.triu(torch.ones(seq_len + past_length, seq_len + past_length), 1).to(inputs_embeds.device)\n\n        inputs_embeds *= np.sqrt(self.d_model_size)\n\n        pos_embeds = self.pos_encoding[position_ids, :].to(inputs_embeds.device)\n\n        hidden_states = inputs_embeds + pos_embeds + token_type_embeds\n\n        hidden_states = self.dropout(hidden_states)\n\n        output_shape = input_shape + (inputs_embeds.size(-1),)\n        presents = ()\n        all_hidden_states = ()\n        all_attentions = []\n        for i, (h, layer_past) in enumerate(zip(self.h, past)):\n            if self.output_hidden_states:\n                all_hidden_states = all_hidden_states + (hidden_states.view(*output_shape),)\n            outputs = h(\n                hidden_states,\n                mask,\n                layer_past=layer_past,\n                attention_mask=attention_mask,\n                head_mask=head_mask[i],\n                use_cache=use_cache,\n            )\n            hidden_states, present = outputs[:2]\n            if use_cache is True:\n                presents = presents + (present,)\n\n            if self.output_attentions:\n                all_attentions.append(outputs[2])\n\n        hidden_states = self.layernorm(hidden_states)\n        hidden_states = hidden_states.view(*output_shape)\n        if self.output_hidden_states:\n            all_hidden_states = all_hidden_states + (hidden_states,)\n\n        outputs = (hidden_states,)\n        if use_cache is True:\n            outputs = outputs + (presents,)\n        if self.output_hidden_states:\n            outputs = outputs + (all_hidden_states,)\n        if self.output_attentions:\n            # let the number of heads free (-1) so we can extract attention even after head pruning\n            attention_output_shape = input_shape[:-1] + (-1,) + all_attentions[0].shape[-2:]\n            all_attentions = tuple(t.view(*attention_output_shape) for t in all_attentions)\n            outputs = outputs + (all_attentions,)\n        return outputs\n\n\n@add_start_docstrings(\n    \"\"\"The CTRL Model transformer with a language modeling head on top\n    (linear layer with weights tied to the input embeddings). \"\"\",\n    CTRL_START_DOCSTRING,\n)\nclass CTRLLMHeadModel(CTRLPreTrainedModel):\n    def __init__(self, config):\n        super().__init__(config)\n        self.transformer = CTRLModel(config)\n        self.lm_head = nn.Linear(config.n_embd, config.vocab_size, bias=True)\n\n        self.init_weights()\n\n    def get_output_embeddings(self):\n        return self.lm_head\n\n    def prepare_inputs_for_generation(self, input_ids, past, **kwargs):\n        # only last token for inputs_ids if past is defined in kwargs\n        if past:\n            input_ids = input_ids[:, -1].unsqueeze(-1)\n\n        return {\"input_ids\": input_ids, \"past\": past, \"use_cache\": kwargs[\"use_cache\"]}\n\n    @add_start_docstrings_to_callable(CTRL_INPUTS_DOCSTRING)\n    def forward(\n        self,\n        input_ids=None,\n        past=None,\n        attention_mask=None,\n        token_type_ids=None,\n        position_ids=None,\n        head_mask=None,\n        inputs_embeds=None,\n        labels=None,\n        use_cache=True,\n    ):\n        r\"\"\"\n        labels (:obj:`torch.LongTensor` of shape :obj:`(batch_size, sequence_length)`, `optional`, defaults to :obj:`None`):\n            Labels for language modeling.\n            Note that the labels **are shifted** inside the model, i.e. you can set ``lm_labels = input_ids``\n            Indices are selected in ``[-100, 0, ..., config.vocab_size]``\n            All labels set to ``-100`` are ignored (masked), the loss is only\n            computed for labels in ``[0, ..., config.vocab_size]``\n\n    Return:\n        :obj:`tuple(torch.FloatTensor)` comprising various elements depending on the configuration (:class:`~transformers1.CTRLConfig`) and inputs:\n        loss (:obj:`torch.FloatTensor` of shape `(1,)`, `optional`, returned when ``labels`` is provided)\n            Language modeling loss.\n        prediction_scores (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length, config.vocab_size)`):\n            Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax).\n        past (:obj:`List[torch.FloatTensor]` of length :obj:`config.n_layers` with each tensor of shape :obj:`(2, batch_size, num_heads, sequence_length, embed_size_per_head)`):\n            Contains pre-computed hidden-states (key and values in the attention blocks).\n            Can be used (see `past` input) to speed up sequential decoding.\n        hidden_states (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_hidden_states=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer)\n            of shape :obj:`(batch_size, sequence_length, hidden_size)`.\n\n            Hidden-states of the model at the output of each layer plus the initial embedding outputs.\n        attentions (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_attentions=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for each layer) of shape\n            :obj:`(batch_size, num_heads, sequence_length, sequence_length)`.\n\n            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention\n            heads.\n\n    Examples::\n\n        import torch\n        from transformers1 import CTRLTokenizer, CTRLLMHeadModel\n\n        tokenizer = CTRLTokenizer.from_pretrained('ctrl')\n        model = CTRLLMHeadModel.from_pretrained('ctrl')\n\n        input_ids = torch.tensor(tokenizer.encode(\"Links Hello, my dog is cute\", add_special_tokens=True)).unsqueeze(0)  # Batch size 1\n        outputs = model(input_ids, labels=input_ids)\n        loss, logits = outputs[:2]\n\n        \"\"\"\n        transformer_outputs = self.transformer(\n            input_ids,\n            past=past,\n            attention_mask=attention_mask,\n            token_type_ids=token_type_ids,\n            position_ids=position_ids,\n            head_mask=head_mask,\n            inputs_embeds=inputs_embeds,\n            use_cache=use_cache,\n        )\n\n        hidden_states = transformer_outputs[0]\n\n        lm_logits = self.lm_head(hidden_states)\n\n        outputs = (lm_logits,) + transformer_outputs[1:]\n\n        if labels is not None:\n            # Shift so that tokens < n predict n\n            shift_logits = lm_logits[..., :-1, :].contiguous()\n            shift_labels = labels[..., 1:].contiguous()\n            # Flatten the tokens\n            loss_fct = CrossEntropyLoss()\n            loss = loss_fct(shift_logits.view(-1, shift_logits.size(-1)), shift_labels.view(-1))\n            outputs = (loss,) + outputs\n\n        return outputs  # (loss), lm_logits, presents, (all hidden_states), (attentions)\n"
  },
  {
    "path": "code/nezha-base-count3/pretrain/transformers1/modeling_distilbert.py",
    "content": "# coding=utf-8\n# Copyright 2019-present, the HuggingFace Inc. team, The Google AI Language Team and Facebook, Inc.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n\"\"\" PyTorch DistilBERT model\n    adapted in part from Facebook, Inc XLM model (https://github.com/facebookresearch/XLM)\n    and in part from HuggingFace PyTorch version of Google AI Bert model (https://github.com/google-research/bert)\n\"\"\"\n\n\nimport copy\nimport logging\nimport math\n\nimport numpy as np\nimport torch\nimport torch.nn as nn\nfrom torch.nn import CrossEntropyLoss\n\nfrom .activations import gelu\nfrom .configuration_distilbert import DistilBertConfig\nfrom .file_utils import add_start_docstrings, add_start_docstrings_to_callable\nfrom .modeling_utils import PreTrainedModel, prune_linear_layer\n\n\nlogger = logging.getLogger(__name__)\n\n\nDISTILBERT_PRETRAINED_MODEL_ARCHIVE_LIST = [\n    \"distilbert-base-uncased\",\n    \"distilbert-base-uncased-distilled-squad\",\n    \"distilbert-base-cased\",\n    \"distilbert-base-cased-distilled-squad\",\n    \"distilbert-base-german-cased\",\n    \"distilbert-base-multilingual-cased\",\n    \"distilbert-base-uncased-finetuned-sst-2-english\",\n    # See all DistilBERT models at https://huggingface.co/models?filter=distilbert\n]\n\n\n# UTILS AND BUILDING BLOCKS OF THE ARCHITECTURE #\n\n\ndef create_sinusoidal_embeddings(n_pos, dim, out):\n    position_enc = np.array([[pos / np.power(10000, 2 * (j // 2) / dim) for j in range(dim)] for pos in range(n_pos)])\n    out[:, 0::2] = torch.FloatTensor(np.sin(position_enc[:, 0::2]))\n    out[:, 1::2] = torch.FloatTensor(np.cos(position_enc[:, 1::2]))\n    out.detach_()\n    out.requires_grad = False\n\n\nclass Embeddings(nn.Module):\n    def __init__(self, config):\n        super().__init__()\n        self.word_embeddings = nn.Embedding(config.vocab_size, config.dim, padding_idx=config.pad_token_id)\n        self.position_embeddings = nn.Embedding(config.max_position_embeddings, config.dim)\n        if config.sinusoidal_pos_embds:\n            create_sinusoidal_embeddings(\n                n_pos=config.max_position_embeddings, dim=config.dim, out=self.position_embeddings.weight\n            )\n\n        self.LayerNorm = nn.LayerNorm(config.dim, eps=1e-12)\n        self.dropout = nn.Dropout(config.dropout)\n\n    def forward(self, input_ids):\n        \"\"\"\n        Parameters\n        ----------\n        input_ids: torch.tensor(bs, max_seq_length)\n            The token ids to embed.\n\n        Outputs\n        -------\n        embeddings: torch.tensor(bs, max_seq_length, dim)\n            The embedded tokens (plus position embeddings, no token_type embeddings)\n        \"\"\"\n        seq_length = input_ids.size(1)\n        position_ids = torch.arange(seq_length, dtype=torch.long, device=input_ids.device)  # (max_seq_length)\n        position_ids = position_ids.unsqueeze(0).expand_as(input_ids)  # (bs, max_seq_length)\n\n        word_embeddings = self.word_embeddings(input_ids)  # (bs, max_seq_length, dim)\n        position_embeddings = self.position_embeddings(position_ids)  # (bs, max_seq_length, dim)\n\n        embeddings = word_embeddings + position_embeddings  # (bs, max_seq_length, dim)\n        embeddings = self.LayerNorm(embeddings)  # (bs, max_seq_length, dim)\n        embeddings = self.dropout(embeddings)  # (bs, max_seq_length, dim)\n        return embeddings\n\n\nclass MultiHeadSelfAttention(nn.Module):\n    def __init__(self, config):\n        super().__init__()\n\n        self.n_heads = config.n_heads\n        self.dim = config.dim\n        self.dropout = nn.Dropout(p=config.attention_dropout)\n        self.output_attentions = config.output_attentions\n\n        assert self.dim % self.n_heads == 0\n\n        self.q_lin = nn.Linear(in_features=config.dim, out_features=config.dim)\n        self.k_lin = nn.Linear(in_features=config.dim, out_features=config.dim)\n        self.v_lin = nn.Linear(in_features=config.dim, out_features=config.dim)\n        self.out_lin = nn.Linear(in_features=config.dim, out_features=config.dim)\n\n        self.pruned_heads = set()\n\n    def prune_heads(self, heads):\n        attention_head_size = self.dim // self.n_heads\n        if len(heads) == 0:\n            return\n        mask = torch.ones(self.n_heads, attention_head_size)\n        heads = set(heads) - self.pruned_heads\n        for head in heads:\n            head -= sum(1 if h < head else 0 for h in self.pruned_heads)\n            mask[head] = 0\n        mask = mask.view(-1).contiguous().eq(1)\n        index = torch.arange(len(mask))[mask].long()\n        # Prune linear layers\n        self.q_lin = prune_linear_layer(self.q_lin, index)\n        self.k_lin = prune_linear_layer(self.k_lin, index)\n        self.v_lin = prune_linear_layer(self.v_lin, index)\n        self.out_lin = prune_linear_layer(self.out_lin, index, dim=1)\n        # Update hyper params\n        self.n_heads = self.n_heads - len(heads)\n        self.dim = attention_head_size * self.n_heads\n        self.pruned_heads = self.pruned_heads.union(heads)\n\n    def forward(self, query, key, value, mask, head_mask=None):\n        \"\"\"\n        Parameters\n        ----------\n        query: torch.tensor(bs, seq_length, dim)\n        key: torch.tensor(bs, seq_length, dim)\n        value: torch.tensor(bs, seq_length, dim)\n        mask: torch.tensor(bs, seq_length)\n\n        Outputs\n        -------\n        weights: torch.tensor(bs, n_heads, seq_length, seq_length)\n            Attention weights\n        context: torch.tensor(bs, seq_length, dim)\n            Contextualized layer. Optional: only if `output_attentions=True`\n        \"\"\"\n        bs, q_length, dim = query.size()\n        k_length = key.size(1)\n        # assert dim == self.dim, 'Dimensions do not match: %s input vs %s configured' % (dim, self.dim)\n        # assert key.size() == value.size()\n\n        dim_per_head = self.dim // self.n_heads\n\n        mask_reshp = (bs, 1, 1, k_length)\n\n        def shape(x):\n            \"\"\" separate heads \"\"\"\n            return x.view(bs, -1, self.n_heads, dim_per_head).transpose(1, 2)\n\n        def unshape(x):\n            \"\"\" group heads \"\"\"\n            return x.transpose(1, 2).contiguous().view(bs, -1, self.n_heads * dim_per_head)\n\n        q = shape(self.q_lin(query))  # (bs, n_heads, q_length, dim_per_head)\n        k = shape(self.k_lin(key))  # (bs, n_heads, k_length, dim_per_head)\n        v = shape(self.v_lin(value))  # (bs, n_heads, k_length, dim_per_head)\n\n        q = q / math.sqrt(dim_per_head)  # (bs, n_heads, q_length, dim_per_head)\n        scores = torch.matmul(q, k.transpose(2, 3))  # (bs, n_heads, q_length, k_length)\n        mask = (mask == 0).view(mask_reshp).expand_as(scores)  # (bs, n_heads, q_length, k_length)\n        scores.masked_fill_(mask, -float(\"inf\"))  # (bs, n_heads, q_length, k_length)\n\n        weights = nn.Softmax(dim=-1)(scores)  # (bs, n_heads, q_length, k_length)\n        weights = self.dropout(weights)  # (bs, n_heads, q_length, k_length)\n\n        # Mask heads if we want to\n        if head_mask is not None:\n            weights = weights * head_mask\n\n        context = torch.matmul(weights, v)  # (bs, n_heads, q_length, dim_per_head)\n        context = unshape(context)  # (bs, q_length, dim)\n        context = self.out_lin(context)  # (bs, q_length, dim)\n\n        if self.output_attentions:\n            return (context, weights)\n        else:\n            return (context,)\n\n\nclass FFN(nn.Module):\n    def __init__(self, config):\n        super().__init__()\n        self.dropout = nn.Dropout(p=config.dropout)\n        self.lin1 = nn.Linear(in_features=config.dim, out_features=config.hidden_dim)\n        self.lin2 = nn.Linear(in_features=config.hidden_dim, out_features=config.dim)\n        assert config.activation in [\"relu\", \"gelu\"], \"activation ({}) must be in ['relu', 'gelu']\".format(\n            config.activation\n        )\n        self.activation = gelu if config.activation == \"gelu\" else nn.ReLU()\n\n    def forward(self, input):\n        x = self.lin1(input)\n        x = self.activation(x)\n        x = self.lin2(x)\n        x = self.dropout(x)\n        return x\n\n\nclass TransformerBlock(nn.Module):\n    def __init__(self, config):\n        super().__init__()\n\n        self.output_attentions = config.output_attentions\n\n        assert config.dim % config.n_heads == 0\n\n        self.attention = MultiHeadSelfAttention(config)\n        self.sa_layer_norm = nn.LayerNorm(normalized_shape=config.dim, eps=1e-12)\n\n        self.ffn = FFN(config)\n        self.output_layer_norm = nn.LayerNorm(normalized_shape=config.dim, eps=1e-12)\n\n    def forward(self, x, attn_mask=None, head_mask=None):\n        \"\"\"\n        Parameters\n        ----------\n        x: torch.tensor(bs, seq_length, dim)\n        attn_mask: torch.tensor(bs, seq_length)\n\n        Outputs\n        -------\n        sa_weights: torch.tensor(bs, n_heads, seq_length, seq_length)\n            The attention weights\n        ffn_output: torch.tensor(bs, seq_length, dim)\n            The output of the transformer block contextualization.\n        \"\"\"\n        # Self-Attention\n        sa_output = self.attention(query=x, key=x, value=x, mask=attn_mask, head_mask=head_mask)\n        if self.output_attentions:\n            sa_output, sa_weights = sa_output  # (bs, seq_length, dim), (bs, n_heads, seq_length, seq_length)\n        else:  # To handle these `output_attention` or `output_hidden_states` cases returning tuples\n            assert type(sa_output) == tuple\n            sa_output = sa_output[0]\n        sa_output = self.sa_layer_norm(sa_output + x)  # (bs, seq_length, dim)\n\n        # Feed Forward Network\n        ffn_output = self.ffn(sa_output)  # (bs, seq_length, dim)\n        ffn_output = self.output_layer_norm(ffn_output + sa_output)  # (bs, seq_length, dim)\n\n        output = (ffn_output,)\n        if self.output_attentions:\n            output = (sa_weights,) + output\n        return output\n\n\nclass Transformer(nn.Module):\n    def __init__(self, config):\n        super().__init__()\n        self.n_layers = config.n_layers\n        self.output_attentions = config.output_attentions\n        self.output_hidden_states = config.output_hidden_states\n\n        layer = TransformerBlock(config)\n        self.layer = nn.ModuleList([copy.deepcopy(layer) for _ in range(config.n_layers)])\n\n    def forward(self, x, attn_mask=None, head_mask=None):\n        \"\"\"\n        Parameters\n        ----------\n        x: torch.tensor(bs, seq_length, dim)\n            Input sequence embedded.\n        attn_mask: torch.tensor(bs, seq_length)\n            Attention mask on the sequence.\n\n        Outputs\n        -------\n        hidden_state: torch.tensor(bs, seq_length, dim)\n            Sequence of hiddens states in the last (top) layer\n        all_hidden_states: Tuple[torch.tensor(bs, seq_length, dim)]\n            Tuple of length n_layers with the hidden states from each layer.\n            Optional: only if output_hidden_states=True\n        all_attentions: Tuple[torch.tensor(bs, n_heads, seq_length, seq_length)]\n            Tuple of length n_layers with the attention weights from each layer\n            Optional: only if output_attentions=True\n        \"\"\"\n        all_hidden_states = ()\n        all_attentions = ()\n\n        hidden_state = x\n        for i, layer_module in enumerate(self.layer):\n            if self.output_hidden_states:\n                all_hidden_states = all_hidden_states + (hidden_state,)\n\n            layer_outputs = layer_module(x=hidden_state, attn_mask=attn_mask, head_mask=head_mask[i])\n            hidden_state = layer_outputs[-1]\n\n            if self.output_attentions:\n                assert len(layer_outputs) == 2\n                attentions = layer_outputs[0]\n                all_attentions = all_attentions + (attentions,)\n            else:\n                assert len(layer_outputs) == 1\n\n        # Add last layer\n        if self.output_hidden_states:\n            all_hidden_states = all_hidden_states + (hidden_state,)\n\n        outputs = (hidden_state,)\n        if self.output_hidden_states:\n            outputs = outputs + (all_hidden_states,)\n        if self.output_attentions:\n            outputs = outputs + (all_attentions,)\n        return outputs  # last-layer hidden state, (all hidden states), (all attentions)\n\n\n# INTERFACE FOR ENCODER AND TASK SPECIFIC MODEL #\nclass DistilBertPreTrainedModel(PreTrainedModel):\n    \"\"\" An abstract class to handle weights initialization and\n        a simple interface for downloading and loading pretrained models.\n    \"\"\"\n\n    config_class = DistilBertConfig\n    load_tf_weights = None\n    base_model_prefix = \"distilbert\"\n\n    def _init_weights(self, module):\n        \"\"\" Initialize the weights.\n        \"\"\"\n        if isinstance(module, nn.Embedding):\n            if module.weight.requires_grad:\n                module.weight.data.normal_(mean=0.0, std=self.config.initializer_range)\n        if isinstance(module, nn.Linear):\n            module.weight.data.normal_(mean=0.0, std=self.config.initializer_range)\n        elif isinstance(module, nn.LayerNorm):\n            module.bias.data.zero_()\n            module.weight.data.fill_(1.0)\n        if isinstance(module, nn.Linear) and module.bias is not None:\n            module.bias.data.zero_()\n\n\nDISTILBERT_START_DOCSTRING = r\"\"\"\n\n    This model is a PyTorch `torch.nn.Module <https://pytorch.org/docs/stable/nn.html#torch.nn.Module>`_ sub-class.\n    Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general\n    usage and behavior.\n\n    Parameters:\n        config (:class:`~transformers1.DistilBertConfig`): Model configuration class with all the parameters of the model.\n            Initializing with a config file does not load the weights associated with the model, only the configuration.\n            Check out the :meth:`~transformers1.PreTrainedModel.from_pretrained` method to load the model weights.\n\"\"\"\n\nDISTILBERT_INPUTS_DOCSTRING = r\"\"\"\n    Args:\n        input_ids (:obj:`torch.LongTensor` of shape :obj:`(batch_size, sequence_length)`):\n            Indices of input sequence tokens in the vocabulary.\n\n            Indices can be obtained using :class:`transformers1.DistilBertTokenizer`.\n            See :func:`transformers1.PreTrainedTokenizer.encode` and\n            :func:`transformers1.PreTrainedTokenizer.encode_plus` for details.\n\n            `What are input IDs? <../glossary.html#input-ids>`__\n        attention_mask (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length)`, `optional`, defaults to :obj:`None`):\n            Mask to avoid performing attention on padding token indices.\n            Mask values selected in ``[0, 1]``:\n            ``1`` for tokens that are NOT MASKED, ``0`` for MASKED tokens.\n\n            `What are attention masks? <../glossary.html#attention-mask>`__\n        head_mask (:obj:`torch.FloatTensor` of shape :obj:`(num_heads,)` or :obj:`(num_layers, num_heads)`, `optional`, defaults to :obj:`None`):\n            Mask to nullify selected heads of the self-attention modules.\n            Mask values selected in ``[0, 1]``:\n            :obj:`1` indicates the head is **not masked**, :obj:`0` indicates the head is **masked**.\n        inputs_embeds (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length, hidden_size)`, `optional`, defaults to :obj:`None`):\n            Optionally, instead of passing :obj:`input_ids` you can choose to directly pass an embedded representation.\n            This is useful if you want more control over how to convert `input_ids` indices into associated vectors\n            than the model's internal embedding lookup matrix.\n\"\"\"\n\n\n@add_start_docstrings(\n    \"The bare DistilBERT encoder/transformer outputting raw hidden-states without any specific head on top.\",\n    DISTILBERT_START_DOCSTRING,\n)\nclass DistilBertModel(DistilBertPreTrainedModel):\n    def __init__(self, config):\n        super().__init__(config)\n\n        self.embeddings = Embeddings(config)  # Embeddings\n        self.transformer = Transformer(config)  # Encoder\n\n        self.init_weights()\n\n    def get_input_embeddings(self):\n        return self.embeddings.word_embeddings\n\n    def set_input_embeddings(self, new_embeddings):\n        self.embeddings.word_embeddings = new_embeddings\n\n    def _prune_heads(self, heads_to_prune):\n        \"\"\" Prunes heads of the model.\n            heads_to_prune: dict of {layer_num: list of heads to prune in this layer}\n            See base class PreTrainedModel\n        \"\"\"\n        for layer, heads in heads_to_prune.items():\n            self.transformer.layer[layer].attention.prune_heads(heads)\n\n    @add_start_docstrings_to_callable(DISTILBERT_INPUTS_DOCSTRING)\n    def forward(self, input_ids=None, attention_mask=None, head_mask=None, inputs_embeds=None):\n        r\"\"\"\n    Return:\n        :obj:`tuple(torch.FloatTensor)` comprising various elements depending on the configuration (:class:`~transformers1.DistilBertConfig`) and inputs:\n        last_hidden_state (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length, hidden_size)`):\n            Sequence of hidden-states at the output of the last layer of the model.\n        hidden_states (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_hidden_states=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer)\n            of shape :obj:`(batch_size, sequence_length, hidden_size)`.\n\n            Hidden-states of the model at the output of each layer plus the initial embedding outputs.\n        attentions (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_attentions=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for each layer) of shape\n            :obj:`(batch_size, num_heads, sequence_length, sequence_length)`.\n\n            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention\n            heads.\n\n    Examples::\n\n        from transformers1 import DistilBertTokenizer, DistilBertModel\n        import torch\n\n        tokenizer = DistilBertTokenizer.from_pretrained('distilbert-base-cased')\n        model = DistilBertModel.from_pretrained('distilbert-base-cased')\n\n        input_ids = torch.tensor(tokenizer.encode(\"Hello, my dog is cute\", add_special_tokens=True)).unsqueeze(0)  # Batch size 1\n        outputs = model(input_ids)\n\n        last_hidden_states = outputs[0]  # The last hidden-state is the first element of the output tuple\n\n        \"\"\"\n        if input_ids is not None and inputs_embeds is not None:\n            raise ValueError(\"You cannot specify both input_ids and inputs_embeds at the same time\")\n        elif input_ids is not None:\n            input_shape = input_ids.size()\n        elif inputs_embeds is not None:\n            input_shape = inputs_embeds.size()[:-1]\n        else:\n            raise ValueError(\"You have to specify either input_ids or inputs_embeds\")\n\n        device = input_ids.device if input_ids is not None else inputs_embeds.device\n\n        if attention_mask is None:\n            attention_mask = torch.ones(input_shape, device=device)  # (bs, seq_length)\n\n        # Prepare head mask if needed\n        head_mask = self.get_head_mask(head_mask, self.config.num_hidden_layers)\n\n        if inputs_embeds is None:\n            inputs_embeds = self.embeddings(input_ids)  # (bs, seq_length, dim)\n        tfmr_output = self.transformer(x=inputs_embeds, attn_mask=attention_mask, head_mask=head_mask)\n        hidden_state = tfmr_output[0]\n        output = (hidden_state,) + tfmr_output[1:]\n\n        return output  # last-layer hidden-state, (all hidden_states), (all attentions)\n\n\n@add_start_docstrings(\n    \"\"\"DistilBert Model with a `masked language modeling` head on top. \"\"\", DISTILBERT_START_DOCSTRING,\n)\nclass DistilBertForMaskedLM(DistilBertPreTrainedModel):\n    def __init__(self, config):\n        super().__init__(config)\n        self.output_attentions = config.output_attentions\n        self.output_hidden_states = config.output_hidden_states\n\n        self.distilbert = DistilBertModel(config)\n        self.vocab_transform = nn.Linear(config.dim, config.dim)\n        self.vocab_layer_norm = nn.LayerNorm(config.dim, eps=1e-12)\n        self.vocab_projector = nn.Linear(config.dim, config.vocab_size)\n\n        self.init_weights()\n\n        self.mlm_loss_fct = nn.CrossEntropyLoss()\n\n    def get_output_embeddings(self):\n        return self.vocab_projector\n\n    @add_start_docstrings_to_callable(DISTILBERT_INPUTS_DOCSTRING)\n    def forward(self, input_ids=None, attention_mask=None, head_mask=None, inputs_embeds=None, masked_lm_labels=None):\n        r\"\"\"\n        masked_lm_labels (:obj:`torch.LongTensor` of shape :obj:`(batch_size, sequence_length)`, `optional`, defaults to :obj:`None`):\n            Labels for computing the masked language modeling loss.\n            Indices should be in ``[-100, 0, ..., config.vocab_size]`` (see ``input_ids`` docstring)\n            Tokens with indices set to ``-100`` are ignored (masked), the loss is only computed for the tokens with labels\n            in ``[0, ..., config.vocab_size]``\n\n    Returns:\n        :obj:`tuple(torch.FloatTensor)` comprising various elements depending on the configuration (:class:`~transformers1.DistilBertConfig`) and inputs:\n        loss (`optional`, returned when ``masked_lm_labels`` is provided) ``torch.FloatTensor`` of shape ``(1,)``:\n            Masked language modeling loss.\n        prediction_scores (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length, config.vocab_size)`)\n            Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax).\n        hidden_states (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_hidden_states=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer)\n            of shape :obj:`(batch_size, sequence_length, hidden_size)`.\n\n            Hidden-states of the model at the output of each layer plus the initial embedding outputs.\n        attentions (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_attentions=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for each layer) of shape\n            :obj:`(batch_size, num_heads, sequence_length, sequence_length)`.\n\n            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention\n            heads.\n\n    Examples::\n\n        from transformers1 import DistilBertTokenizer, DistilBertForMaskedLM\n        import torch\n\n        tokenizer = DistilBertTokenizer.from_pretrained('distilbert-base-cased')\n        model = DistilBertForMaskedLM.from_pretrained('distilbert-base-cased')\n        input_ids = torch.tensor(tokenizer.encode(\"Hello, my dog is cute\", add_special_tokens=True)).unsqueeze(0)  # Batch size 1\n        outputs = model(input_ids, masked_lm_labels=input_ids)\n        loss, prediction_scores = outputs[:2]\n\n        \"\"\"\n        dlbrt_output = self.distilbert(\n            input_ids=input_ids, attention_mask=attention_mask, head_mask=head_mask, inputs_embeds=inputs_embeds\n        )\n        hidden_states = dlbrt_output[0]  # (bs, seq_length, dim)\n        prediction_logits = self.vocab_transform(hidden_states)  # (bs, seq_length, dim)\n        prediction_logits = gelu(prediction_logits)  # (bs, seq_length, dim)\n        prediction_logits = self.vocab_layer_norm(prediction_logits)  # (bs, seq_length, dim)\n        prediction_logits = self.vocab_projector(prediction_logits)  # (bs, seq_length, vocab_size)\n\n        outputs = (prediction_logits,) + dlbrt_output[1:]\n        if masked_lm_labels is not None:\n            mlm_loss = self.mlm_loss_fct(\n                prediction_logits.view(-1, prediction_logits.size(-1)), masked_lm_labels.view(-1)\n            )\n            outputs = (mlm_loss,) + outputs\n\n        return outputs  # (mlm_loss), prediction_logits, (all hidden_states), (all attentions)\n\n\n@add_start_docstrings(\n    \"\"\"DistilBert Model transformer with a sequence classification/regression head on top (a linear layer on top of\n    the pooled output) e.g. for GLUE tasks. \"\"\",\n    DISTILBERT_START_DOCSTRING,\n)\nclass DistilBertForSequenceClassification(DistilBertPreTrainedModel):\n    def __init__(self, config):\n        super().__init__(config)\n        self.num_labels = config.num_labels\n\n        self.distilbert = DistilBertModel(config)\n        self.pre_classifier = nn.Linear(config.dim, config.dim)\n        self.classifier = nn.Linear(config.dim, config.num_labels)\n        self.dropout = nn.Dropout(config.seq_classif_dropout)\n\n        self.init_weights()\n\n    @add_start_docstrings_to_callable(DISTILBERT_INPUTS_DOCSTRING)\n    def forward(self, input_ids=None, attention_mask=None, head_mask=None, inputs_embeds=None, labels=None):\n        r\"\"\"\n        labels (:obj:`torch.LongTensor` of shape :obj:`(batch_size,)`, `optional`, defaults to :obj:`None`):\n            Labels for computing the sequence classification/regression loss.\n            Indices should be in :obj:`[0, ..., config.num_labels - 1]`.\n            If :obj:`config.num_labels == 1` a regression loss is computed (Mean-Square loss),\n            If :obj:`config.num_labels > 1` a classification loss is computed (Cross-Entropy).\n\n    Returns:\n        :obj:`tuple(torch.FloatTensor)` comprising various elements depending on the configuration (:class:`~transformers1.DistilBertConfig`) and inputs:\n        loss (:obj:`torch.FloatTensor` of shape :obj:`(1,)`, `optional`, returned when :obj:`label` is provided):\n            Classification (or regression if config.num_labels==1) loss.\n        logits (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, config.num_labels)`):\n            Classification (or regression if config.num_labels==1) scores (before SoftMax).\n        hidden_states (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_hidden_states=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer)\n            of shape :obj:`(batch_size, sequence_length, hidden_size)`.\n\n            Hidden-states of the model at the output of each layer plus the initial embedding outputs.\n        attentions (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_attentions=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for each layer) of shape\n            :obj:`(batch_size, num_heads, sequence_length, sequence_length)`.\n\n            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention\n            heads.\n\n    Examples::\n\n        from transformers1 import DistilBertTokenizer, DistilBertForSequenceClassification\n        import torch\n\n        tokenizer = DistilBertTokenizer.from_pretrained('distilbert-base-cased')\n        model = DistilBertForSequenceClassification.from_pretrained('distilbert-base-cased')\n        input_ids = torch.tensor(tokenizer.encode(\"Hello, my dog is cute\", add_special_tokens=True)).unsqueeze(0)  # Batch size 1\n        labels = torch.tensor([1]).unsqueeze(0)  # Batch size 1\n        outputs = model(input_ids, labels=labels)\n        loss, logits = outputs[:2]\n\n        \"\"\"\n        distilbert_output = self.distilbert(\n            input_ids=input_ids, attention_mask=attention_mask, head_mask=head_mask, inputs_embeds=inputs_embeds\n        )\n        hidden_state = distilbert_output[0]  # (bs, seq_len, dim)\n        pooled_output = hidden_state[:, 0]  # (bs, dim)\n        pooled_output = self.pre_classifier(pooled_output)  # (bs, dim)\n        pooled_output = nn.ReLU()(pooled_output)  # (bs, dim)\n        pooled_output = self.dropout(pooled_output)  # (bs, dim)\n        logits = self.classifier(pooled_output)  # (bs, dim)\n\n        outputs = (logits,) + distilbert_output[1:]\n        if labels is not None:\n            if self.num_labels == 1:\n                loss_fct = nn.MSELoss()\n                loss = loss_fct(logits.view(-1), labels.view(-1))\n            else:\n                loss_fct = nn.CrossEntropyLoss()\n                loss = loss_fct(logits.view(-1, self.num_labels), labels.view(-1))\n            outputs = (loss,) + outputs\n\n        return outputs  # (loss), logits, (hidden_states), (attentions)\n\n\n@add_start_docstrings(\n    \"\"\"DistilBert Model with a span classification head on top for extractive question-answering tasks like SQuAD (a linear layers on top of\n    the hidden-states output to compute `span start logits` and `span end logits`). \"\"\",\n    DISTILBERT_START_DOCSTRING,\n)\nclass DistilBertForQuestionAnswering(DistilBertPreTrainedModel):\n    def __init__(self, config):\n        super().__init__(config)\n\n        self.distilbert = DistilBertModel(config)\n        self.qa_outputs = nn.Linear(config.dim, config.num_labels)\n        assert config.num_labels == 2\n        self.dropout = nn.Dropout(config.qa_dropout)\n\n        self.init_weights()\n\n    @add_start_docstrings_to_callable(DISTILBERT_INPUTS_DOCSTRING)\n    def forward(\n        self,\n        input_ids=None,\n        attention_mask=None,\n        head_mask=None,\n        inputs_embeds=None,\n        start_positions=None,\n        end_positions=None,\n    ):\n        r\"\"\"\n        start_positions (:obj:`torch.LongTensor` of shape :obj:`(batch_size,)`, `optional`, defaults to :obj:`None`):\n            Labels for position (index) of the start of the labelled span for computing the token classification loss.\n            Positions are clamped to the length of the sequence (`sequence_length`).\n            Position outside of the sequence are not taken into account for computing the loss.\n        end_positions (:obj:`torch.LongTensor` of shape :obj:`(batch_size,)`, `optional`, defaults to :obj:`None`):\n            Labels for position (index) of the end of the labelled span for computing the token classification loss.\n            Positions are clamped to the length of the sequence (`sequence_length`).\n            Position outside of the sequence are not taken into account for computing the loss.\n\n    Returns:\n        :obj:`tuple(torch.FloatTensor)` comprising various elements depending on the configuration (:class:`~transformers1.DistilBertConfig`) and inputs:\n        loss (:obj:`torch.FloatTensor` of shape :obj:`(1,)`, `optional`, returned when :obj:`labels` is provided):\n            Total span extraction loss is the sum of a Cross-Entropy for the start and end positions.\n        start_scores (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length,)`):\n            Span-start scores (before SoftMax).\n        end_scores (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length,)`):\n            Span-end scores (before SoftMax).\n        hidden_states (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_hidden_states=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer)\n            of shape :obj:`(batch_size, sequence_length, hidden_size)`.\n\n            Hidden-states of the model at the output of each layer plus the initial embedding outputs.\n        attentions (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_attentions=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for each layer) of shape\n            :obj:`(batch_size, num_heads, sequence_length, sequence_length)`.\n\n            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention\n            heads.\n\n    Examples::\n\n        from transformers1 import DistilBertTokenizer, DistilBertForQuestionAnswering\n        import torch\n\n        tokenizer = DistilBertTokenizer.from_pretrained('distilbert-base-cased')\n        model = DistilBertForQuestionAnswering.from_pretrained('distilbert-base-cased')\n        input_ids = torch.tensor(tokenizer.encode(\"Hello, my dog is cute\", add_special_tokens=True)).unsqueeze(0)  # Batch size 1\n        start_positions = torch.tensor([1])\n        end_positions = torch.tensor([3])\n        outputs = model(input_ids, start_positions=start_positions, end_positions=end_positions)\n        loss, start_scores, end_scores = outputs[:3]\n\n        \"\"\"\n        distilbert_output = self.distilbert(\n            input_ids=input_ids, attention_mask=attention_mask, head_mask=head_mask, inputs_embeds=inputs_embeds\n        )\n        hidden_states = distilbert_output[0]  # (bs, max_query_len, dim)\n\n        hidden_states = self.dropout(hidden_states)  # (bs, max_query_len, dim)\n        logits = self.qa_outputs(hidden_states)  # (bs, max_query_len, 2)\n        start_logits, end_logits = logits.split(1, dim=-1)\n        start_logits = start_logits.squeeze(-1)  # (bs, max_query_len)\n        end_logits = end_logits.squeeze(-1)  # (bs, max_query_len)\n\n        outputs = (start_logits, end_logits,) + distilbert_output[1:]\n        if start_positions is not None and end_positions is not None:\n            # If we are on multi-GPU, split add a dimension\n            if len(start_positions.size()) > 1:\n                start_positions = start_positions.squeeze(-1)\n            if len(end_positions.size()) > 1:\n                end_positions = end_positions.squeeze(-1)\n            # sometimes the start/end positions are outside our model inputs, we ignore these terms\n            ignored_index = start_logits.size(1)\n            start_positions.clamp_(0, ignored_index)\n            end_positions.clamp_(0, ignored_index)\n\n            loss_fct = nn.CrossEntropyLoss(ignore_index=ignored_index)\n            start_loss = loss_fct(start_logits, start_positions)\n            end_loss = loss_fct(end_logits, end_positions)\n            total_loss = (start_loss + end_loss) / 2\n            outputs = (total_loss,) + outputs\n\n        return outputs  # (loss), start_logits, end_logits, (hidden_states), (attentions)\n\n\n@add_start_docstrings(\n    \"\"\"DistilBert Model with a token classification head on top (a linear layer on top of\n    the hidden-states output) e.g. for Named-Entity-Recognition (NER) tasks. \"\"\",\n    DISTILBERT_START_DOCSTRING,\n)\nclass DistilBertForTokenClassification(DistilBertPreTrainedModel):\n    def __init__(self, config):\n        super().__init__(config)\n        self.num_labels = config.num_labels\n\n        self.distilbert = DistilBertModel(config)\n        self.dropout = nn.Dropout(config.dropout)\n        self.classifier = nn.Linear(config.hidden_size, config.num_labels)\n\n        self.init_weights()\n\n    @add_start_docstrings_to_callable(DISTILBERT_INPUTS_DOCSTRING)\n    def forward(self, input_ids=None, attention_mask=None, head_mask=None, inputs_embeds=None, labels=None):\n        r\"\"\"\n        labels (:obj:`torch.LongTensor` of shape :obj:`(batch_size, sequence_length)`, `optional`, defaults to :obj:`None`):\n            Labels for computing the token classification loss.\n            Indices should be in ``[0, ..., config.num_labels - 1]``.\n\n    Returns:\n        :obj:`tuple(torch.FloatTensor)` comprising various elements depending on the configuration (:class:`~transformers1.DistilBertConfig`) and inputs:\n        loss (:obj:`torch.FloatTensor` of shape :obj:`(1,)`, `optional`, returned when ``labels`` is provided) :\n            Classification loss.\n        scores (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length, config.num_labels)`)\n            Classification scores (before SoftMax).\n        hidden_states (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_hidden_states=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer)\n            of shape :obj:`(batch_size, sequence_length, hidden_size)`.\n\n            Hidden-states of the model at the output of each layer plus the initial embedding outputs.\n        attentions (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_attentions=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for each layer) of shape\n            :obj:`(batch_size, num_heads, sequence_length, sequence_length)`.\n\n            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention\n            heads.\n\n    Examples::\n\n        from transformers1 import DistilBertTokenizer, DistilBertForTokenClassification\n        import torch\n\n        tokenizer = DistilBertTokenizer.from_pretrained('distilbert-base-cased')\n        model = DistilBertForTokenClassification.from_pretrained('distilbert-base-cased')\n        input_ids = torch.tensor(tokenizer.encode(\"Hello, my dog is cute\")).unsqueeze(0)  # Batch size 1\n        labels = torch.tensor([1] * input_ids.size(1)).unsqueeze(0)  # Batch size 1\n        outputs = model(input_ids, labels=labels)\n        loss, scores = outputs[:2]\n\n        \"\"\"\n\n        outputs = self.distilbert(\n            input_ids, attention_mask=attention_mask, head_mask=head_mask, inputs_embeds=inputs_embeds\n        )\n\n        sequence_output = outputs[0]\n\n        sequence_output = self.dropout(sequence_output)\n        logits = self.classifier(sequence_output)\n\n        outputs = (logits,) + outputs[2:]  # add hidden states and attention if they are here\n        if labels is not None:\n            loss_fct = CrossEntropyLoss()\n            # Only keep active parts of the loss\n            if attention_mask is not None:\n                active_loss = attention_mask.view(-1) == 1\n                active_logits = logits.view(-1, self.num_labels)\n                active_labels = torch.where(\n                    active_loss, labels.view(-1), torch.tensor(loss_fct.ignore_index).type_as(labels)\n                )\n                loss = loss_fct(active_logits, active_labels)\n            else:\n                loss = loss_fct(logits.view(-1, self.num_labels), labels.view(-1))\n            outputs = (loss,) + outputs\n\n        return outputs  # (loss), scores, (hidden_states), (attentions)\n"
  },
  {
    "path": "code/nezha-base-count3/pretrain/transformers1/modeling_electra.py",
    "content": "import logging\nimport os\n\nimport torch\nimport torch.nn as nn\nfrom torch.nn import CrossEntropyLoss, MSELoss\n\nfrom .activations import get_activation\nfrom .configuration_electra import ElectraConfig\nfrom .file_utils import add_start_docstrings, add_start_docstrings_to_callable\nfrom .modeling_bert import BertEmbeddings, BertEncoder, BertLayerNorm, BertPreTrainedModel\n\n\nlogger = logging.getLogger(__name__)\n\n\nELECTRA_PRETRAINED_MODEL_ARCHIVE_LIST = [\n    \"google/electra-small-generator\",\n    \"google/electra-base-generator\",\n    \"google/electra-large-generator\",\n    \"google/electra-small-discriminator\",\n    \"google/electra-base-discriminator\",\n    \"google/electra-large-discriminator\",\n    # See all ELECTRA models at https://huggingface.co/models?filter=electra\n]\n\n\ndef load_tf_weights_in_electra(model, config, tf_checkpoint_path, discriminator_or_generator=\"discriminator\"):\n    \"\"\" Load tf checkpoints in a pytorch model.\n    \"\"\"\n    try:\n        import re\n        import numpy as np\n        import tensorflow as tf\n    except ImportError:\n        logger.error(\n            \"Loading a TensorFlow model in PyTorch, requires TensorFlow to be installed. Please see \"\n            \"https://www.tensorflow.org/install/ for installation instructions.\"\n        )\n        raise\n    tf_path = os.path.abspath(tf_checkpoint_path)\n    logger.info(\"Converting TensorFlow checkpoint from {}\".format(tf_path))\n    # Load weights from TF model\n    init_vars = tf.train.list_variables(tf_path)\n    names = []\n    arrays = []\n    for name, shape in init_vars:\n        logger.info(\"Loading TF weight {} with shape {}\".format(name, shape))\n        array = tf.train.load_variable(tf_path, name)\n        names.append(name)\n        arrays.append(array)\n    for name, array in zip(names, arrays):\n        original_name: str = name\n\n        try:\n            if isinstance(model, ElectraForMaskedLM):\n                name = name.replace(\"electra/embeddings/\", \"generator/embeddings/\")\n\n            if discriminator_or_generator == \"generator\":\n                name = name.replace(\"electra/\", \"discriminator/\")\n                name = name.replace(\"generator/\", \"electra/\")\n\n            name = name.replace(\"dense_1\", \"dense_prediction\")\n            name = name.replace(\"generator_predictions/output_bias\", \"generator_lm_head/bias\")\n\n            name = name.split(\"/\")\n            # print(original_name, name)\n            # adam_v and adam_m are variables used in AdamWeightDecayOptimizer to calculated m and v\n            # which are not required for using pretrained model\n            if any(n in [\"global_step\", \"temperature\"] for n in name):\n                logger.info(\"Skipping {}\".format(original_name))\n                continue\n            pointer = model\n            for m_name in name:\n                if re.fullmatch(r\"[A-Za-z]+_\\d+\", m_name):\n                    scope_names = re.split(r\"_(\\d+)\", m_name)\n                else:\n                    scope_names = [m_name]\n                if scope_names[0] == \"kernel\" or scope_names[0] == \"gamma\":\n                    pointer = getattr(pointer, \"weight\")\n                elif scope_names[0] == \"output_bias\" or scope_names[0] == \"beta\":\n                    pointer = getattr(pointer, \"bias\")\n                elif scope_names[0] == \"output_weights\":\n                    pointer = getattr(pointer, \"weight\")\n                elif scope_names[0] == \"squad\":\n                    pointer = getattr(pointer, \"classifier\")\n                else:\n                    pointer = getattr(pointer, scope_names[0])\n                if len(scope_names) >= 2:\n                    num = int(scope_names[1])\n                    pointer = pointer[num]\n            if m_name.endswith(\"_embeddings\"):\n                pointer = getattr(pointer, \"weight\")\n            elif m_name == \"kernel\":\n                array = np.transpose(array)\n            try:\n                assert pointer.shape == array.shape, original_name\n            except AssertionError as e:\n                e.args += (pointer.shape, array.shape)\n                raise\n            print(\"Initialize PyTorch weight {}\".format(name), original_name)\n            pointer.data = torch.from_numpy(array)\n        except AttributeError as e:\n            print(\"Skipping {}\".format(original_name), name, e)\n            continue\n    return model\n\n\nclass ElectraEmbeddings(BertEmbeddings):\n    \"\"\"Construct the embeddings from word, position and token_type embeddings.\"\"\"\n\n    def __init__(self, config):\n        super().__init__(config)\n        self.word_embeddings = nn.Embedding(config.vocab_size, config.embedding_size, padding_idx=config.pad_token_id)\n        self.position_embeddings = nn.Embedding(config.max_position_embeddings, config.embedding_size)\n        self.token_type_embeddings = nn.Embedding(config.type_vocab_size, config.embedding_size)\n\n        # self.LayerNorm is not snake-cased to stick with TensorFlow model variable name and be able to load\n        # any TensorFlow checkpoint file\n        self.LayerNorm = BertLayerNorm(config.embedding_size, eps=config.layer_norm_eps)\n\n\nclass ElectraDiscriminatorPredictions(nn.Module):\n    \"\"\"Prediction module for the discriminator, made up of two dense layers.\"\"\"\n\n    def __init__(self, config):\n        super().__init__()\n\n        self.dense = nn.Linear(config.hidden_size, config.hidden_size)\n        self.dense_prediction = nn.Linear(config.hidden_size, 1)\n        self.config = config\n\n    def forward(self, discriminator_hidden_states, attention_mask):\n        hidden_states = self.dense(discriminator_hidden_states)\n        hidden_states = get_activation(self.config.hidden_act)(hidden_states)\n        logits = self.dense_prediction(hidden_states).squeeze()\n\n        return logits\n\n\nclass ElectraGeneratorPredictions(nn.Module):\n    \"\"\"Prediction module for the generator, made up of two dense layers.\"\"\"\n\n    def __init__(self, config):\n        super().__init__()\n\n        self.LayerNorm = BertLayerNorm(config.embedding_size)\n        self.dense = nn.Linear(config.hidden_size, config.embedding_size)\n\n    def forward(self, generator_hidden_states):\n        hidden_states = self.dense(generator_hidden_states)\n        hidden_states = get_activation(\"gelu\")(hidden_states)\n        hidden_states = self.LayerNorm(hidden_states)\n\n        return hidden_states\n\n\nclass ElectraPreTrainedModel(BertPreTrainedModel):\n    \"\"\" An abstract class to handle weights initialization and\n        a simple interface for downloading and loading pretrained models.\n    \"\"\"\n\n    config_class = ElectraConfig\n    load_tf_weights = load_tf_weights_in_electra\n    base_model_prefix = \"electra\"\n\n\nELECTRA_START_DOCSTRING = r\"\"\"\n    This model is a PyTorch `torch.nn.Module <https://pytorch.org/docs/stable/nn.html#torch.nn.Module>`_ sub-class.\n    Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general\n    usage and behavior.\n\n    Parameters:\n        config (:class:`~transformers1.ElectraConfig`): Model configuration class with all the parameters of the model.\n            Initializing with a config file does not load the weights associated with the model, only the configuration.\n            Check out the :meth:`~transformers1.PreTrainedModel.from_pretrained` method to load the model weights.\n\"\"\"\n\nELECTRA_INPUTS_DOCSTRING = r\"\"\"\n    Args:\n        input_ids (:obj:`torch.LongTensor` of shape :obj:`(batch_size, sequence_length)`):\n            Indices of input sequence tokens in the vocabulary.\n\n            Indices can be obtained using :class:`transformers1.ElectraTokenizer`.\n            See :func:`transformers1.PreTrainedTokenizer.encode` and\n            :func:`transformers1.PreTrainedTokenizer.encode_plus` for details.\n\n            `What are input IDs? <../glossary.html#input-ids>`__\n        attention_mask (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length)`, `optional`, defaults to :obj:`None`):\n            Mask to avoid performing attention on padding token indices.\n            Mask values selected in ``[0, 1]``:\n            ``1`` for tokens that are NOT MASKED, ``0`` for MASKED tokens.\n\n            `What are attention masks? <../glossary.html#attention-mask>`__\n        token_type_ids (:obj:`torch.LongTensor` of shape :obj:`(batch_size, sequence_length)`, `optional`, defaults to :obj:`None`):\n            Segment token indices to indicate first and second portions of the inputs.\n            Indices are selected in ``[0, 1]``: ``0`` corresponds to a `sentence A` token, ``1``\n            corresponds to a `sentence B` token\n\n            `What are token type IDs? <../glossary.html#token-type-ids>`_\n        position_ids (:obj:`torch.LongTensor` of shape :obj:`(batch_size, sequence_length)`, `optional`, defaults to :obj:`None`):\n            Indices of positions of each input sequence tokens in the position embeddings.\n            Selected in the range ``[0, config.max_position_embeddings - 1]``.\n\n            `What are position IDs? <../glossary.html#position-ids>`_\n        head_mask (:obj:`torch.FloatTensor` of shape :obj:`(num_heads,)` or :obj:`(num_layers, num_heads)`, `optional`, defaults to :obj:`None`):\n            Mask to nullify selected heads of the self-attention modules.\n            Mask values selected in ``[0, 1]``:\n            :obj:`1` indicates the head is **not masked**, :obj:`0` indicates the head is **masked**.\n        inputs_embeds (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length, hidden_size)`, `optional`, defaults to :obj:`None`):\n            Optionally, instead of passing :obj:`input_ids` you can choose to directly pass an embedded representation.\n            This is useful if you want more control over how to convert `input_ids` indices into associated vectors\n            than the model's internal embedding lookup matrix.\n        encoder_hidden_states  (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length, hidden_size)`, `optional`, defaults to :obj:`None`):\n            Sequence of hidden-states at the output of the last layer of the encoder. Used in the cross-attention\n            if the model is configured as a decoder.\n        encoder_attention_mask (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length)`, `optional`, defaults to :obj:`None`):\n            Mask to avoid performing attention on the padding token indices of the encoder input. This mask\n            is used in the cross-attention if the model is configured as a decoder.\n            Mask values selected in ``[0, 1]``:\n            ``1`` for tokens that are NOT MASKED, ``0`` for MASKED tokens.\n\"\"\"\n\n\n@add_start_docstrings(\n    \"The bare Electra Model transformer outputting raw hidden-states without any specific head on top. Identical to \"\n    \"the BERT model except that it uses an additional linear layer between the embedding layer and the encoder if the \"\n    \"hidden size and embedding size are different.\"\n    \"\"\n    \"Both the generator and discriminator checkpoints may be loaded into this model.\",\n    ELECTRA_START_DOCSTRING,\n)\nclass ElectraModel(ElectraPreTrainedModel):\n\n    config_class = ElectraConfig\n\n    def __init__(self, config):\n        super().__init__(config)\n        self.embeddings = ElectraEmbeddings(config)\n\n        if config.embedding_size != config.hidden_size:\n            self.embeddings_project = nn.Linear(config.embedding_size, config.hidden_size)\n\n        self.encoder = BertEncoder(config)\n        self.config = config\n        self.init_weights()\n\n    def get_input_embeddings(self):\n        return self.embeddings.word_embeddings\n\n    def set_input_embeddings(self, value):\n        self.embeddings.word_embeddings = value\n\n    def _prune_heads(self, heads_to_prune):\n        \"\"\" Prunes heads of the model.\n            heads_to_prune: dict of {layer_num: list of heads to prune in this layer}\n            See base class PreTrainedModel\n        \"\"\"\n        for layer, heads in heads_to_prune.items():\n            self.encoder.layer[layer].attention.prune_heads(heads)\n\n    @add_start_docstrings_to_callable(ELECTRA_INPUTS_DOCSTRING)\n    def forward(\n        self,\n        input_ids=None,\n        attention_mask=None,\n        token_type_ids=None,\n        position_ids=None,\n        head_mask=None,\n        inputs_embeds=None,\n    ):\n        r\"\"\"\n    Return:\n        :obj:`tuple(torch.FloatTensor)` comprising various elements depending on the configuration (:class:`~transformers1.ElectraConfig`) and inputs:\n        last_hidden_state (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length, hidden_size)`):\n            Sequence of hidden-states at the output of the last layer of the model.\n        hidden_states (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_hidden_states=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer)\n            of shape :obj:`(batch_size, sequence_length, hidden_size)`.\n\n            Hidden-states of the model at the output of each layer plus the initial embedding outputs.\n        attentions (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_attentions=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for each layer) of shape\n            :obj:`(batch_size, num_heads, sequence_length, sequence_length)`.\n\n            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention\n            heads.\n\n    Examples::\n\n        from transformers1 import ElectraModel, ElectraTokenizer\n        import torch\n\n        tokenizer = ElectraTokenizer.from_pretrained('google/electra-small-discriminator')\n        model = ElectraModel.from_pretrained('google/electra-small-discriminator')\n\n        input_ids = torch.tensor(tokenizer.encode(\"Hello, my dog is cute\", add_special_tokens=True)).unsqueeze(0)  # Batch size 1\n        outputs = model(input_ids)\n\n        last_hidden_states = outputs[0]  # The last hidden-state is the first element of the output tuple\n\n        \"\"\"\n        if input_ids is not None and inputs_embeds is not None:\n            raise ValueError(\"You cannot specify both input_ids and inputs_embeds at the same time\")\n        elif input_ids is not None:\n            input_shape = input_ids.size()\n        elif inputs_embeds is not None:\n            input_shape = inputs_embeds.size()[:-1]\n        else:\n            raise ValueError(\"You have to specify either input_ids or inputs_embeds\")\n\n        device = input_ids.device if input_ids is not None else inputs_embeds.device\n\n        if attention_mask is None:\n            attention_mask = torch.ones(input_shape, device=device)\n        if token_type_ids is None:\n            token_type_ids = torch.zeros(input_shape, dtype=torch.long, device=device)\n\n        extended_attention_mask = self.get_extended_attention_mask(attention_mask, input_shape, device)\n        head_mask = self.get_head_mask(head_mask, self.config.num_hidden_layers)\n\n        hidden_states = self.embeddings(\n            input_ids=input_ids, position_ids=position_ids, token_type_ids=token_type_ids, inputs_embeds=inputs_embeds\n        )\n\n        if hasattr(self, \"embeddings_project\"):\n            hidden_states = self.embeddings_project(hidden_states)\n\n        hidden_states = self.encoder(hidden_states, attention_mask=extended_attention_mask, head_mask=head_mask)\n\n        return hidden_states\n\n\nclass ElectraClassificationHead(nn.Module):\n    \"\"\"Head for sentence-level classification tasks.\"\"\"\n\n    def __init__(self, config):\n        super().__init__()\n        self.dense = nn.Linear(config.hidden_size, config.hidden_size)\n        self.dropout = nn.Dropout(config.hidden_dropout_prob)\n        self.out_proj = nn.Linear(config.hidden_size, config.num_labels)\n\n    def forward(self, features, **kwargs):\n        x = features[:, 0, :]  # take <s> token (equiv. to [CLS])\n        x = self.dropout(x)\n        x = self.dense(x)\n        x = get_activation(\"gelu\")(x)  # although BERT uses tanh here, it seems Electra authors used gelu here\n        x = self.dropout(x)\n        x = self.out_proj(x)\n        return x\n\n\n@add_start_docstrings(\n    \"\"\"ELECTRA Model transformer with a sequence classification/regression head on top (a linear layer on top of\n    the pooled output) e.g. for GLUE tasks. \"\"\",\n    ELECTRA_START_DOCSTRING,\n)\nclass ElectraForSequenceClassification(ElectraPreTrainedModel):\n    def __init__(self, config):\n        super().__init__(config)\n        self.num_labels = config.num_labels\n        self.electra = ElectraModel(config)\n        self.classifier = ElectraClassificationHead(config)\n\n        self.init_weights()\n\n    @add_start_docstrings_to_callable(ELECTRA_INPUTS_DOCSTRING)\n    def forward(\n        self,\n        input_ids=None,\n        attention_mask=None,\n        token_type_ids=None,\n        position_ids=None,\n        head_mask=None,\n        inputs_embeds=None,\n        labels=None,\n    ):\n        r\"\"\"\n        labels (:obj:`torch.LongTensor` of shape :obj:`(batch_size,)`, `optional`, defaults to :obj:`None`):\n            Labels for computing the sequence classification/regression loss.\n            Indices should be in :obj:`[0, ..., config.num_labels - 1]`.\n            If :obj:`config.num_labels == 1` a regression loss is computed (Mean-Square loss),\n            If :obj:`config.num_labels > 1` a classification loss is computed (Cross-Entropy).\n\n    Returns:\n        :obj:`tuple(torch.FloatTensor)` comprising various elements depending on the configuration (:class:`~transformers1.BertConfig`) and inputs:\n        loss (:obj:`torch.FloatTensor` of shape :obj:`(1,)`, `optional`, returned when :obj:`label` is provided):\n            Classification (or regression if config.num_labels==1) loss.\n        logits (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, config.num_labels)`):\n            Classification (or regression if config.num_labels==1) scores (before SoftMax).\n        hidden_states (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_hidden_states=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer)\n            of shape :obj:`(batch_size, sequence_length, hidden_size)`.\n\n            Hidden-states of the model at the output of each layer plus the initial embedding outputs.\n        attentions (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_attentions=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for each layer) of shape\n            :obj:`(batch_size, num_heads, sequence_length, sequence_length)`.\n\n            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention\n            heads.\n\n    Examples::\n\n        from transformers1 import BertTokenizer, BertForSequenceClassification\n        import torch\n\n        tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')\n        model = BertForSequenceClassification.from_pretrained('bert-base-uncased')\n\n        input_ids = torch.tensor(tokenizer.encode(\"Hello, my dog is cute\", add_special_tokens=True)).unsqueeze(0)  # Batch size 1\n        labels = torch.tensor([1]).unsqueeze(0)  # Batch size 1\n        outputs = model(input_ids, labels=labels)\n\n        loss, logits = outputs[:2]\n\n        \"\"\"\n        discriminator_hidden_states = self.electra(\n            input_ids, attention_mask, token_type_ids, position_ids, head_mask, inputs_embeds\n        )\n\n        sequence_output = discriminator_hidden_states[0]\n        logits = self.classifier(sequence_output)\n\n        outputs = (logits,) + discriminator_hidden_states[2:]  # add hidden states and attention if they are here\n\n        if labels is not None:\n            if self.num_labels == 1:\n                #  We are doing regression\n                loss_fct = MSELoss()\n                loss = loss_fct(logits.view(-1), labels.view(-1))\n            else:\n                loss_fct = CrossEntropyLoss()\n                loss = loss_fct(logits.view(-1, self.num_labels), labels.view(-1))\n            outputs = (loss,) + outputs\n\n        return outputs  # (loss), logits, (hidden_states), (attentions)\n\n\n@add_start_docstrings(\n    \"\"\"\n    Electra model with a binary classification head on top as used during pre-training for identifying generated\n    tokens.\n\n    It is recommended to load the discriminator checkpoint into that model.\"\"\",\n    ELECTRA_START_DOCSTRING,\n)\nclass ElectraForPreTraining(ElectraPreTrainedModel):\n    def __init__(self, config):\n        super().__init__(config)\n\n        self.electra = ElectraModel(config)\n        self.discriminator_predictions = ElectraDiscriminatorPredictions(config)\n        self.init_weights()\n\n    @add_start_docstrings_to_callable(ELECTRA_INPUTS_DOCSTRING)\n    def forward(\n        self,\n        input_ids=None,\n        attention_mask=None,\n        token_type_ids=None,\n        position_ids=None,\n        head_mask=None,\n        inputs_embeds=None,\n        labels=None,\n    ):\n        r\"\"\"\n        labels (``torch.LongTensor`` of shape ``(batch_size, sequence_length)``, `optional`, defaults to :obj:`None`):\n            Labels for computing the ELECTRA loss. Input should be a sequence of tokens (see :obj:`input_ids` docstring)\n            Indices should be in ``[0, 1]``.\n            ``0`` indicates the token is an original token,\n            ``1`` indicates the token was replaced.\n\n    Returns:\n        :obj:`tuple(torch.FloatTensor)` comprising various elements depending on the configuration (:class:`~transformers1.ElectraConfig`) and inputs:\n        loss (`optional`, returned when ``labels`` is provided) ``torch.FloatTensor`` of shape ``(1,)``:\n            Total loss of the ELECTRA objective.\n        scores (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length)`)\n            Prediction scores of the head (scores for each token before SoftMax).\n        hidden_states (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when :obj:`config.output_hidden_states=True`):\n            Tuple of :obj:`torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer)\n            of shape :obj:`(batch_size, sequence_length, hidden_size)`.\n\n            Hidden-states of the model at the output of each layer plus the initial embedding outputs.\n        attentions (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_attentions=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for each layer) of shape\n            :obj:`(batch_size, num_heads, sequence_length, sequence_length)`.\n\n            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention\n            heads.\n\n\n    Examples::\n\n        from transformers1 import ElectraTokenizer, ElectraForPreTraining\n        import torch\n\n        tokenizer = ElectraTokenizer.from_pretrained('google/electra-small-discriminator')\n        model = ElectraForPreTraining.from_pretrained('google/electra-small-discriminator')\n\n        input_ids = torch.tensor(tokenizer.encode(\"Hello, my dog is cute\", add_special_tokens=True)).unsqueeze(0)  # Batch size 1\n        outputs = model(input_ids)\n\n        prediction_scores, seq_relationship_scores = outputs[:2]\n\n        \"\"\"\n\n        discriminator_hidden_states = self.electra(\n            input_ids, attention_mask, token_type_ids, position_ids, head_mask, inputs_embeds\n        )\n        discriminator_sequence_output = discriminator_hidden_states[0]\n\n        logits = self.discriminator_predictions(discriminator_sequence_output, attention_mask)\n\n        output = (logits,)\n\n        if labels is not None:\n            loss_fct = nn.BCEWithLogitsLoss()\n            if attention_mask is not None:\n                active_loss = attention_mask.view(-1, discriminator_sequence_output.shape[1]) == 1\n                active_logits = logits.view(-1, discriminator_sequence_output.shape[1])[active_loss]\n                active_labels = labels[active_loss]\n                loss = loss_fct(active_logits, active_labels.float())\n            else:\n                loss = loss_fct(logits.view(-1, discriminator_sequence_output.shape[1]), labels.float())\n\n            output = (loss,) + output\n\n        output += discriminator_hidden_states[1:]\n\n        return output  # (loss), scores, (hidden_states), (attentions)\n\n\n@add_start_docstrings(\n    \"\"\"\n    Electra model with a language modeling head on top.\n\n    Even though both the discriminator and generator may be loaded into this model, the generator is\n    the only model of the two to have been trained for the masked language modeling task.\"\"\",\n    ELECTRA_START_DOCSTRING,\n)\nclass ElectraForMaskedLM(ElectraPreTrainedModel):\n    def __init__(self, config):\n        super().__init__(config)\n\n        self.electra = ElectraModel(config)\n        self.generator_predictions = ElectraGeneratorPredictions(config)\n\n        self.generator_lm_head = nn.Linear(config.embedding_size, config.vocab_size)\n        self.init_weights()\n\n    def get_output_embeddings(self):\n        return self.generator_lm_head\n\n    @add_start_docstrings_to_callable(ELECTRA_INPUTS_DOCSTRING)\n    def forward(\n        self,\n        input_ids=None,\n        attention_mask=None,\n        token_type_ids=None,\n        position_ids=None,\n        head_mask=None,\n        inputs_embeds=None,\n        masked_lm_labels=None,\n    ):\n        r\"\"\"\n        masked_lm_labels (:obj:`torch.LongTensor` of shape :obj:`(batch_size, sequence_length)`, `optional`, defaults to :obj:`None`):\n            Labels for computing the masked language modeling loss.\n            Indices should be in ``[-100, 0, ..., config.vocab_size]`` (see ``input_ids`` docstring)\n            Tokens with indices set to ``-100`` are ignored (masked), the loss is only computed for the tokens with labels\n            in ``[0, ..., config.vocab_size]``\n\n    Returns:\n        :obj:`tuple(torch.FloatTensor)` comprising various elements depending on the configuration (:class:`~transformers1.ElectraConfig`) and inputs:\n        masked_lm_loss (`optional`, returned when ``masked_lm_labels`` is provided) ``torch.FloatTensor`` of shape ``(1,)``:\n            Masked language modeling loss.\n        prediction_scores (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length, config.vocab_size)`)\n            Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax).\n        hidden_states (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_hidden_states=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer)\n            of shape :obj:`(batch_size, sequence_length, hidden_size)`.\n\n            Hidden-states of the model at the output of each layer plus the initial embedding outputs.\n        attentions (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_attentions=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for each layer) of shape\n            :obj:`(batch_size, num_heads, sequence_length, sequence_length)`.\n\n            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention\n            heads.\n\n        Examples::\n\n            from transformers1 import ElectraTokenizer, ElectraForMaskedLM\n            import torch\n\n            tokenizer = ElectraTokenizer.from_pretrained('google/electra-small-generator')\n            model = ElectraForMaskedLM.from_pretrained('google/electra-small-generator')\n\n            input_ids = torch.tensor(tokenizer.encode(\"Hello, my dog is cute\", add_special_tokens=True)).unsqueeze(0)  # Batch size 1\n            outputs = model(input_ids, masked_lm_labels=input_ids)\n\n            loss, prediction_scores = outputs[:2]\n\n        \"\"\"\n\n        generator_hidden_states = self.electra(\n            input_ids, attention_mask, token_type_ids, position_ids, head_mask, inputs_embeds\n        )\n        generator_sequence_output = generator_hidden_states[0]\n\n        prediction_scores = self.generator_predictions(generator_sequence_output)\n        prediction_scores = self.generator_lm_head(prediction_scores)\n\n        output = (prediction_scores,)\n\n        # Masked language modeling softmax layer\n        if masked_lm_labels is not None:\n            loss_fct = nn.CrossEntropyLoss()  # -100 index = padding token\n            loss = loss_fct(prediction_scores.view(-1, self.config.vocab_size), masked_lm_labels.view(-1))\n            output = (loss,) + output\n\n        output += generator_hidden_states[1:]\n\n        return output  # (masked_lm_loss), prediction_scores, (hidden_states), (attentions)\n\n\n@add_start_docstrings(\n    \"\"\"\n    Electra model with a token classification head on top.\n\n    Both the discriminator and generator may be loaded into this model.\"\"\",\n    ELECTRA_START_DOCSTRING,\n)\nclass ElectraForTokenClassification(ElectraPreTrainedModel):\n    def __init__(self, config):\n        super().__init__(config)\n\n        self.electra = ElectraModel(config)\n        self.dropout = nn.Dropout(config.hidden_dropout_prob)\n        self.classifier = nn.Linear(config.hidden_size, config.num_labels)\n        self.init_weights()\n\n    @add_start_docstrings_to_callable(ELECTRA_INPUTS_DOCSTRING)\n    def forward(\n        self,\n        input_ids=None,\n        attention_mask=None,\n        token_type_ids=None,\n        position_ids=None,\n        head_mask=None,\n        inputs_embeds=None,\n        labels=None,\n    ):\n        r\"\"\"\n        labels (:obj:`torch.LongTensor` of shape :obj:`(batch_size, sequence_length)`, `optional`, defaults to :obj:`None`):\n            Labels for computing the token classification loss.\n            Indices should be in ``[0, ..., config.num_labels - 1]``.\n\n    Returns:\n        :obj:`tuple(torch.FloatTensor)` comprising various elements depending on the configuration (:class:`~transformers1.ElectraConfig`) and inputs:\n        loss (:obj:`torch.FloatTensor` of shape :obj:`(1,)`, `optional`, returned when ``labels`` is provided) :\n            Classification loss.\n        scores (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length, config.num_labels)`)\n            Classification scores (before SoftMax).\n        hidden_states (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_hidden_states=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer)\n            of shape :obj:`(batch_size, sequence_length, hidden_size)`.\n\n            Hidden-states of the model at the output of each layer plus the initial embedding outputs.\n        attentions (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_attentions=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for each layer) of shape\n            :obj:`(batch_size, num_heads, sequence_length, sequence_length)`.\n\n            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention\n            heads.\n\n    Examples::\n\n        from transformers1 import ElectraTokenizer, ElectraForTokenClassification\n        import torch\n\n        tokenizer = ElectraTokenizer.from_pretrained('google/electra-small-discriminator')\n        model = ElectraForTokenClassification.from_pretrained('google/electra-small-discriminator')\n\n        input_ids = torch.tensor(tokenizer.encode(\"Hello, my dog is cute\", add_special_tokens=True)).unsqueeze(0)  # Batch size 1\n        labels = torch.tensor([1] * input_ids.size(1)).unsqueeze(0)  # Batch size 1\n        outputs = model(input_ids, labels=labels)\n\n        loss, scores = outputs[:2]\n\n        \"\"\"\n\n        discriminator_hidden_states = self.electra(\n            input_ids, attention_mask, token_type_ids, position_ids, head_mask, inputs_embeds\n        )\n        discriminator_sequence_output = discriminator_hidden_states[0]\n\n        discriminator_sequence_output = self.dropout(discriminator_sequence_output)\n        logits = self.classifier(discriminator_sequence_output)\n\n        output = (logits,)\n\n        if labels is not None:\n            loss_fct = nn.CrossEntropyLoss()\n            # Only keep active parts of the loss\n            if attention_mask is not None:\n                active_loss = attention_mask.view(-1) == 1\n                active_logits = logits.view(-1, self.config.num_labels)[active_loss]\n                active_labels = labels.view(-1)[active_loss]\n                loss = loss_fct(active_logits, active_labels)\n            else:\n                loss = loss_fct(logits.view(-1, self.config.num_labels), labels.view(-1))\n\n            output = (loss,) + output\n\n        output += discriminator_hidden_states[1:]\n\n        return output  # (loss), scores, (hidden_states), (attentions)\n"
  },
  {
    "path": "code/nezha-base-count3/pretrain/transformers1/modeling_encoder_decoder.py",
    "content": "# coding=utf-8\n# Copyright 2018 The HuggingFace Inc. team.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n\"\"\" Classes to support Encoder-Decoder architectures \"\"\"\n\n\nimport logging\nfrom typing import Optional\n\nfrom .configuration_encoder_decoder import EncoderDecoderConfig\nfrom .configuration_utils import PretrainedConfig\nfrom .modeling_utils import PreTrainedModel\n\n\nlogger = logging.getLogger(__name__)\n\n\nclass EncoderDecoderModel(PreTrainedModel):\n    r\"\"\"\n        :class:`~transformers1.EncoderDecoder` is a generic model class that will be\n        instantiated as a transformer architecture with one of the base model\n        classes of the library as encoder and another one as\n        decoder when created with the `AutoModel.from_pretrained(pretrained_model_name_or_path)`\n        class method for the encoder and `AutoModelWithLMHead.from_pretrained(pretrained_model_name_or_path)` class method for the decoder.\n    \"\"\"\n    config_class = EncoderDecoderConfig\n    base_model_prefix = \"encoder_decoder\"\n\n    def __init__(\n        self,\n        config: Optional[PretrainedConfig] = None,\n        encoder: Optional[PreTrainedModel] = None,\n        decoder: Optional[PreTrainedModel] = None,\n    ):\n        assert config is not None or (\n            encoder is not None and decoder is not None\n        ), \"Either a configuration or an Encoder and a decoder has to be provided\"\n        if config is None:\n            config = EncoderDecoderConfig.from_encoder_decoder_configs(encoder.config, decoder.config)\n        else:\n            assert isinstance(config, self.config_class), \"config: {} has to be of type {}\".format(\n                config, self.config_class\n            )\n        # initialize with config\n        super().__init__(config)\n\n        if encoder is None:\n            from transformers import AutoModel\n\n            encoder = AutoModel.from_config(config.encoder)\n\n        if decoder is None:\n            from transformers import AutoModelWithLMHead\n\n            decoder = AutoModelWithLMHead.from_config(config.decoder)\n\n        self.encoder = encoder\n        self.decoder = decoder\n        assert (\n            self.encoder.get_output_embeddings() is None\n        ), \"The encoder {} should not have a LM Head. Please use a model without LM Head\"\n\n    def tie_weights(self):\n        # for now no weights tying in encoder-decoder\n        pass\n\n    def get_encoder(self):\n        return self.encoder\n\n    def get_decoder(self):\n        return self.decoder\n\n    def get_input_embeddings(self):\n        return self.encoder.get_input_embeddings()\n\n    def get_output_embeddings(self):\n        return self.decoder.get_output_embeddings()\n\n    @classmethod\n    def from_encoder_decoder_pretrained(\n        cls,\n        encoder_pretrained_model_name_or_path: str = None,\n        decoder_pretrained_model_name_or_path: str = None,\n        *model_args,\n        **kwargs\n    ) -> PreTrainedModel:\n        r\"\"\" Instantiates an encoder and a decoder from one or two base classes of the library from pre-trained model checkpoints.\n\n\n        The model is set in evaluation mode by default using `model.eval()` (Dropout modules are deactivated).\n        To train the model, you need to first set it back in training mode with `model.train()`.\n\n        Params:\n            encoder_pretrained_model_name_or_path (:obj: `str`, `optional`, defaults to `None`):\n                information necessary to initiate the encoder. Either:\n\n                - a string with the `shortcut name` of a pre-trained model to load from cache or download, e.g.: ``bert-base-uncased``.\n                - a string with the `identifier name` of a pre-trained model that was user-uploaded to our S3, e.g.: ``dbmdz/bert-base-german-cased``.\n                - a path to a `directory` containing model weights saved using :func:`~transformers1.PreTrainedModel.save_pretrained`, e.g.: ``./my_model_directory/encoder``.\n                - a path or url to a `tensorflow index checkpoint file` (e.g. `./tf_model/model.ckpt.index`). In this case, ``from_tf`` should be set to True and a configuration object should be provided as ``config`` argument. This loading path is slower than converting the TensorFlow checkpoint in a PyTorch model using the provided conversion scripts and loading the PyTorch model afterwards.\n\n            decoder_pretrained_model_name_or_path (:obj: `str`, `optional`, defaults to `None`):\n                information necessary to initiate the decoder. Either:\n\n                - a string with the `shortcut name` of a pre-trained model to load from cache or download, e.g.: ``bert-base-uncased``.\n                - a string with the `identifier name` of a pre-trained model that was user-uploaded to our S3, e.g.: ``dbmdz/bert-base-german-cased``.\n                - a path to a `directory` containing model weights saved using :func:`~transformers1.PreTrainedModel.save_pretrained`, e.g.: ``./my_model_directory/decoder``.\n                - a path or url to a `tensorflow index checkpoint file` (e.g. `./tf_model/model.ckpt.index`). In this case, ``from_tf`` should be set to True and a configuration object should be provided as ``config`` argument. This loading path is slower than converting the TensorFlow checkpoint in a PyTorch model using the provided conversion scripts and loading the PyTorch model afterwards.\n\n            model_args: (`optional`) Sequence of positional arguments:\n                All remaning positional arguments will be passed to the underlying model's ``__init__`` method\n\n            kwargs: (`optional`) Remaining dictionary of keyword arguments.\n                Can be used to update the configuration object (after it being loaded) and initiate the model. (e.g. ``output_attention=True``). Behave differently depending on whether a `config` is provided or automatically loaded:\n\n        Examples::\n\n            from transformers1 import EncoderDecoder\n\n            model = EncoderDecoder.from_encoder_decoder_pretrained('bert-base-uncased', 'bert-base-uncased') # initialize Bert2Bert\n        \"\"\"\n\n        kwargs_encoder = {\n            argument[len(\"encoder_\") :]: value for argument, value in kwargs.items() if argument.startswith(\"encoder_\")\n        }\n\n        kwargs_decoder = {\n            argument[len(\"decoder_\") :]: value for argument, value in kwargs.items() if argument.startswith(\"decoder_\")\n        }\n\n        # Load and initialize the encoder and decoder\n        # The distinction between encoder and decoder at the model level is made\n        # by the value of the flag `is_decoder` that we need to set correctly.\n        encoder = kwargs_encoder.pop(\"model\", None)\n        if encoder is None:\n            assert (\n                encoder_pretrained_model_name_or_path is not None\n            ), \"If `model` is not defined as an argument, a `encoder_pretrained_model_name_or_path` has to be defined\"\n            from .modeling_auto import AutoModel\n\n            encoder = AutoModel.from_pretrained(encoder_pretrained_model_name_or_path, *model_args, **kwargs_encoder)\n        encoder.config.is_decoder = False\n\n        decoder = kwargs_decoder.pop(\"model\", None)\n        if decoder is None:\n            assert (\n                decoder_pretrained_model_name_or_path is not None\n            ), \"If `decoder_model` is not defined as an argument, a `decoder_pretrained_model_name_or_path` has to be defined\"\n            from .modeling_auto import AutoModelWithLMHead\n\n            if \"config\" not in kwargs_decoder:\n                from transformers import AutoConfig\n\n                decoder_config = AutoConfig.from_pretrained(decoder_pretrained_model_name_or_path)\n                if decoder_config.is_decoder is False:\n                    logger.info(\n                        f\"Initializing {decoder_pretrained_model_name_or_path} as a decoder model. Cross attention layers are added to {decoder_pretrained_model_name_or_path} and randomly initialized if {decoder_pretrained_model_name_or_path}'s architecture allows for cross attention layers.\"\n                    )\n                    decoder_config.is_decoder = True\n\n                kwargs_decoder[\"config\"] = decoder_config\n\n            if kwargs_decoder[\"config\"].is_decoder is False:\n                logger.warning(\n                    f\"Decoder model {decoder_pretrained_model_name_or_path} is not initialized as a decoder. In order to initialize {decoder_pretrained_model_name_or_path} as a decoder, make sure that the attribute `is_decoder` of `decoder_config` passed to `.from_encoder_decoder_pretrained(...)` is set to `True` or do not pass a `decoder_config` to `.from_encoder_decoder_pretrained(...)`\"\n                )\n\n            decoder = AutoModelWithLMHead.from_pretrained(decoder_pretrained_model_name_or_path, **kwargs_decoder)\n\n        return cls(encoder=encoder, decoder=decoder)\n\n    def forward(\n        self,\n        input_ids=None,\n        inputs_embeds=None,\n        attention_mask=None,\n        head_mask=None,\n        encoder_outputs=None,\n        decoder_input_ids=None,\n        decoder_attention_mask=None,\n        decoder_head_mask=None,\n        decoder_inputs_embeds=None,\n        masked_lm_labels=None,\n        lm_labels=None,\n        **kwargs,\n    ):\n\n        \"\"\"\n        Args:\n            input_ids (:obj:`torch.LongTensor` of shape :obj:`(batch_size, sequence_length)`):\n                Indices of input sequence tokens in the vocabulary for the encoder.\n                Indices can be obtained using :class:`transformers1.PretrainedTokenizer`.\n                See :func:`transformers1.PreTrainedTokenizer.encode` and\n                :func:`transformers1.PreTrainedTokenizer.convert_tokens_to_ids` for details.\n            inputs_embeds (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length, hidden_size)`, `optional`, defaults to :obj:`None`):\n                Optionally, instead of passing :obj:`input_ids` you can choose to directly pass an embedded representation.\n                This is useful if you want more control over how to convert `input_ids` indices into associated vectors\n                than the model's internal embedding lookup matrix.\n            attention_mask (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length)`, `optional`, defaults to :obj:`None`):\n                Mask to avoid performing attention on padding token indices for the encoder.\n                Mask values selected in ``[0, 1]``:\n                ``1`` for tokens that are NOT MASKED, ``0`` for MASKED tokens.\n            head_mask: (:obj:`torch.FloatTensor` of shape :obj:`(num_heads,)` or :obj:`(num_layers, num_heads)`, `optional`, defaults to :obj:`None`):\n                Mask to nullify selected heads of the self-attention modules for the encoder.\n                Mask values selected in ``[0, 1]``:\n                ``1`` indicates the head is **not masked**, ``0`` indicates the head is **masked**.\n            encoder_outputs (:obj:`tuple(tuple(torch.FloatTensor)`, `optional`, defaults to :obj:`None`):\n                Tuple consists of (`last_hidden_state`, `optional`: `hidden_states`, `optional`: `attentions`)\n                `last_hidden_state` of shape :obj:`(batch_size, sequence_length, hidden_size)`, `optional`, defaults to :obj:`None`) is a sequence of hidden-states at the output of the last layer of the encoder.\n                Used in the cross-attention of the decoder.\n            decoder_input_ids (:obj:`torch.LongTensor` of shape :obj:`(batch_size, target_sequence_length)`, `optional`, defaults to :obj:`None`):\n                Provide for sequence to sequence training to the decoder.\n                Indices can be obtained using :class:`transformers1.PretrainedTokenizer`.\n                See :func:`transformers1.PreTrainedTokenizer.encode` and\n                :func:`transformers1.PreTrainedTokenizer.convert_tokens_to_ids` for details.\n            decoder_attention_mask (:obj:`torch.BoolTensor` of shape :obj:`(batch_size, tgt_seq_len)`, `optional`, defaults to :obj:`None`):\n                Default behavior: generate a tensor that ignores pad tokens in decoder_input_ids. Causal mask will also be used by default.\n            decoder_head_mask: (:obj:`torch.FloatTensor` of shape :obj:`(num_heads,)` or :obj:`(num_layers, num_heads)`, `optional`, defaults to :obj:`None`):\n                Mask to nullify selected heads of the self-attention modules for the decoder.\n                Mask values selected in ``[0, 1]``:\n                ``1`` indicates the head is **not masked**, ``0`` indicates the head is **masked**.\n            decoder_inputs_embeds (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, target_sequence_length, hidden_size)`, `optional`, defaults to :obj:`None`):\n                Optionally, instead of passing :obj:`decoder_input_ids` you can choose to directly pass an embedded representation.\n                This is useful if you want more control over how to convert `decoder_input_ids` indices into associated vectors\n                than the model's internal embedding lookup matrix.\n            masked_lm_labels (:obj:`torch.LongTensor` of shape :obj:`(batch_size, sequence_length)`, `optional`, defaults to :obj:`None`):\n                Labels for computing the masked language modeling loss for the decoder.\n                Indices should be in ``[-100, 0, ..., config.vocab_size]`` (see ``input_ids`` docstring)\n                Tokens with indices set to ``-100`` are ignored (masked), the loss is only computed for the tokens with labels\n                in ``[0, ..., config.vocab_size]``\n            lm_labels (:obj:`torch.LongTensor` of shape :obj:`(batch_size, sequence_length)`, `optional`, defaults to :obj:`None`):\n                Labels for computing the left-to-right language modeling loss (next word prediction) for the decoder.\n                Indices should be in ``[-100, 0, ..., config.vocab_size]`` (see ``input_ids`` docstring)\n                Tokens with indices set to ``-100`` are ignored (masked), the loss is only computed for the tokens with labels\n                in ``[0, ..., config.vocab_size]``\n            kwargs: (`optional`) Remaining dictionary of keyword arguments. Keyword arguments come in two flavors:\n                - Without a prefix which will be input as `**encoder_kwargs` for the encoder forward function.\n                - With a `decoder_` prefix which will be input as `**decoder_kwargs` for the decoder forward function.\n\n        Examples::\n\n            from transformers1 import EncoderDecoderModel, BertTokenizer\n            import torch\n\n            tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')\n            model = EncoderDecoderModel.from_encoder_decoder_pretrained('bert-base-uncased', 'bert-base-uncased') # initialize Bert2Bert\n\n            # forward\n            input_ids = torch.tensor(tokenizer.encode(\"Hello, my dog is cute\", add_special_tokens=True)).unsqueeze(0)  # Batch size 1\n            outputs = model(input_ids=input_ids, decoder_input_ids=input_ids)\n\n            # training\n            loss, outputs = model(input_ids=input_ids, decoder_input_ids=input_ids, lm_labels=input_ids)[:2]\n\n            # generation\n            generated = model.generate(input_ids, decoder_start_token_id=model.config.decoder.pad_token_id)\n\n        \"\"\"\n\n        kwargs_encoder = {argument: value for argument, value in kwargs.items() if not argument.startswith(\"decoder_\")}\n\n        kwargs_decoder = {\n            argument[len(\"decoder_\") :]: value for argument, value in kwargs.items() if argument.startswith(\"decoder_\")\n        }\n\n        if encoder_outputs is None:\n            encoder_outputs = self.encoder(\n                input_ids=input_ids,\n                attention_mask=attention_mask,\n                inputs_embeds=inputs_embeds,\n                head_mask=head_mask,\n                **kwargs_encoder,\n            )\n\n        hidden_states = encoder_outputs[0]\n\n        # Decode\n        decoder_outputs = self.decoder(\n            input_ids=decoder_input_ids,\n            inputs_embeds=decoder_inputs_embeds,\n            attention_mask=decoder_attention_mask,\n            encoder_hidden_states=hidden_states,\n            encoder_attention_mask=attention_mask,\n            head_mask=decoder_head_mask,\n            lm_labels=lm_labels,\n            masked_lm_labels=masked_lm_labels,\n            **kwargs_decoder,\n        )\n\n        return decoder_outputs + encoder_outputs\n\n    def prepare_inputs_for_generation(self, input_ids, past, attention_mask, **kwargs):\n        assert past is not None, \"past has to be defined for encoder_outputs\"\n\n        # first step\n        if type(past) is tuple:\n            encoder_outputs = past\n        else:\n            encoder_outputs = (past,)\n\n        decoder_inputs = self.decoder.prepare_inputs_for_generation(input_ids)\n\n        return {\n            \"attention_mask\": attention_mask,\n            \"decoder_attention_mask\": decoder_inputs[\"attention_mask\"],\n            \"decoder_input_ids\": decoder_inputs[\"input_ids\"],\n            \"encoder_outputs\": encoder_outputs,\n        }\n\n    def _reorder_cache(self, past, beam_idx):\n        # as a default encoder-decoder models do not re-order the past.\n        # TODO(PVP): might have to be updated, e.g. if GPT2 is to be used as a decoder\n        return past\n"
  },
  {
    "path": "code/nezha-base-count3/pretrain/transformers1/modeling_flaubert.py",
    "content": "# coding=utf-8\n# Copyright 2019-present CNRS, Facebook Inc. and the HuggingFace Inc. team.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n\"\"\" PyTorch Flaubert model, based on XLM. \"\"\"\n\n\nimport logging\nimport random\n\nimport torch\nfrom torch.nn import functional as F\n\nfrom .configuration_flaubert import FlaubertConfig\nfrom .file_utils import add_start_docstrings, add_start_docstrings_to_callable\nfrom .modeling_xlm import (\n    XLMForQuestionAnswering,\n    XLMForQuestionAnsweringSimple,\n    XLMForSequenceClassification,\n    XLMModel,\n    XLMWithLMHeadModel,\n    get_masks,\n)\n\n\nlogger = logging.getLogger(__name__)\n\nFLAUBERT_PRETRAINED_MODEL_ARCHIVE_LIST = [\n    \"flaubert/flaubert_small_cased\",\n    \"flaubert/flaubert_base_uncased\",\n    \"flaubert/flaubert_base_cased\",\n    \"flaubert/flaubert_large_cased\",\n    # See all Flaubert models at https://huggingface.co/models?filter=flaubert\n]\n\n\nFLAUBERT_START_DOCSTRING = r\"\"\"\n\n    This model is a PyTorch `torch.nn.Module <https://pytorch.org/docs/stable/nn.html#torch.nn.Module>`_ sub-class.\n    Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general\n    usage and behavior.\n\n    Parameters:\n        config (:class:`~transformers1.FlaubertConfig`): Model configuration class with all the parameters of the model.\n            Initializing with a config file does not load the weights associated with the model, only the configuration.\n            Check out the :meth:`~transformers1.PreTrainedModel.from_pretrained` method to load the model weights.\n\"\"\"\n\nFLAUBERT_INPUTS_DOCSTRING = r\"\"\"\n    Args:\n        input_ids (:obj:`torch.LongTensor` of shape :obj:`(batch_size, sequence_length)`):\n            Indices of input sequence tokens in the vocabulary.\n\n            Indices can be obtained using :class:`transformers1.BertTokenizer`.\n            See :func:`transformers1.PreTrainedTokenizer.encode` and\n            :func:`transformers1.PreTrainedTokenizer.encode_plus` for details.\n\n            `What are input IDs? <../glossary.html#input-ids>`__\n        attention_mask (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length)`, `optional`, defaults to :obj:`None`):\n            Mask to avoid performing attention on padding token indices.\n            Mask values selected in ``[0, 1]``:\n            ``1`` for tokens that are NOT MASKED, ``0`` for MASKED tokens.\n\n            `What are attention masks? <../glossary.html#attention-mask>`__\n        token_type_ids (:obj:`torch.LongTensor` of shape :obj:`(batch_size, sequence_length)`, `optional`, defaults to :obj:`None`):\n            Segment token indices to indicate first and second portions of the inputs.\n            Indices are selected in ``[0, 1]``: ``0`` corresponds to a `sentence A` token, ``1``\n            corresponds to a `sentence B` token\n\n            `What are token type IDs? <../glossary.html#token-type-ids>`_\n        position_ids (:obj:`torch.LongTensor` of shape :obj:`(batch_size, sequence_length)`, `optional`, defaults to :obj:`None`):\n            Indices of positions of each input sequence tokens in the position embeddings.\n            Selected in the range ``[0, config.max_position_embeddings - 1]``.\n\n            `What are position IDs? <../glossary.html#position-ids>`_\n        lengths (:obj:`torch.LongTensor` of shape :obj:`(batch_size,)`, `optional`, defaults to :obj:`None`):\n            Length of each sentence that can be used to avoid performing attention on padding token indices.\n            You can also use `attention_mask` for the same result (see above), kept here for compatbility.\n            Indices selected in ``[0, ..., input_ids.size(-1)]``:\n        cache (:obj:`Dict[str, torch.FloatTensor]`, `optional`, defaults to :obj:`None`):\n            dictionary with ``torch.FloatTensor`` that contains pre-computed\n            hidden-states (key and values in the attention blocks) as computed by the model\n            (see `cache` output below). Can be used to speed up sequential decoding.\n            The dictionary object will be modified in-place during the forward pass to add newly computed hidden-states.\n        head_mask (:obj:`torch.FloatTensor` of shape :obj:`(num_heads,)` or :obj:`(num_layers, num_heads)`, `optional`, defaults to :obj:`None`):\n            Mask to nullify selected heads of the self-attention modules.\n            Mask values selected in ``[0, 1]``:\n            :obj:`1` indicates the head is **not masked**, :obj:`0` indicates the head is **masked**.\n        inputs_embeds (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length, hidden_size)`, `optional`, defaults to :obj:`None`):\n            Optionally, instead of passing :obj:`input_ids` you can choose to directly pass an embedded representation.\n            This is useful if you want more control over how to convert `input_ids` indices into associated vectors\n            than the model's internal embedding lookup matrix.\n\"\"\"\n\n\n@add_start_docstrings(\n    \"The bare Flaubert Model transformer outputting raw hidden-states without any specific head on top.\",\n    FLAUBERT_START_DOCSTRING,\n)\nclass FlaubertModel(XLMModel):\n\n    config_class = FlaubertConfig\n\n    def __init__(self, config):  # , dico, is_encoder, with_output):\n        super().__init__(config)\n        self.layerdrop = getattr(config, \"layerdrop\", 0.0)\n        self.pre_norm = getattr(config, \"pre_norm\", False)\n\n    @add_start_docstrings_to_callable(FLAUBERT_INPUTS_DOCSTRING)\n    def forward(\n        self,\n        input_ids=None,\n        attention_mask=None,\n        langs=None,\n        token_type_ids=None,\n        position_ids=None,\n        lengths=None,\n        cache=None,\n        head_mask=None,\n        inputs_embeds=None,\n    ):\n        r\"\"\"\n    Return:\n        :obj:`tuple(torch.FloatTensor)` comprising various elements depending on the configuration (:class:`~transformers1.XLMConfig`) and inputs:\n        last_hidden_state (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length, hidden_size)`):\n            Sequence of hidden-states at the output of the last layer of the model.\n        hidden_states (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_hidden_states=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer)\n            of shape :obj:`(batch_size, sequence_length, hidden_size)`.\n\n            Hidden-states of the model at the output of each layer plus the initial embedding outputs.\n        attentions (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_attentions=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for each layer) of shape\n            :obj:`(batch_size, num_heads, sequence_length, sequence_length)`.\n\n            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention\n            heads.\n\n    Examples::\n\n        from transformers1 import FlaubertTokenizer, FlaubertModel\n        import torch\n\n        tokenizer = FlaubertTokenizer.from_pretrained('flaubert-base-cased')\n        model = FlaubertModel.from_pretrained('flaubert-base-cased')\n        input_ids = torch.tensor(tokenizer.encode(\"Le chat mange une pomme.\", add_special_tokens=True)).unsqueeze(0)  # Batch size 1\n        outputs = model(input_ids)\n        last_hidden_states = outputs[0]  # The last hidden-state is the first element of the output tuple\n\n        \"\"\"\n        # removed: src_enc=None, src_len=None\n        if input_ids is not None:\n            bs, slen = input_ids.size()\n        else:\n            bs, slen = inputs_embeds.size()[:-1]\n\n        if lengths is None:\n            if input_ids is not None:\n                lengths = (input_ids != self.pad_index).sum(dim=1).long()\n            else:\n                lengths = torch.LongTensor([slen] * bs)\n        # mask = input_ids != self.pad_index\n\n        # check inputs\n        assert lengths.size(0) == bs\n        assert lengths.max().item() <= slen\n        # input_ids = input_ids.transpose(0, 1)  # batch size as dimension 0\n        # assert (src_enc is None) == (src_len is None)\n        # if src_enc is not None:\n        #     assert self.is_decoder\n        #     assert src_enc.size(0) == bs\n\n        # generate masks\n        mask, attn_mask = get_masks(slen, lengths, self.causal, padding_mask=attention_mask)\n        # if self.is_decoder and src_enc is not None:\n        #     src_mask = torch.arange(src_len.max(), dtype=torch.long, device=lengths.device) < src_len[:, None]\n\n        device = input_ids.device if input_ids is not None else inputs_embeds.device\n\n        # position_ids\n        if position_ids is None:\n            position_ids = torch.arange(slen, dtype=torch.long, device=device)\n            position_ids = position_ids.unsqueeze(0).expand((bs, slen))\n        else:\n            assert position_ids.size() == (bs, slen)  # (slen, bs)\n            # position_ids = position_ids.transpose(0, 1)\n\n        # langs\n        if langs is not None:\n            assert langs.size() == (bs, slen)  # (slen, bs)\n            # langs = langs.transpose(0, 1)\n\n        # Prepare head mask if needed\n        head_mask = self.get_head_mask(head_mask, self.config.n_layers)\n\n        # do not recompute cached elements\n        if cache is not None and input_ids is not None:\n            _slen = slen - cache[\"slen\"]\n            input_ids = input_ids[:, -_slen:]\n            position_ids = position_ids[:, -_slen:]\n            if langs is not None:\n                langs = langs[:, -_slen:]\n            mask = mask[:, -_slen:]\n            attn_mask = attn_mask[:, -_slen:]\n\n        # embeddings\n        if inputs_embeds is None:\n            inputs_embeds = self.embeddings(input_ids)\n\n        tensor = inputs_embeds + self.position_embeddings(position_ids).expand_as(inputs_embeds)\n        if langs is not None and self.use_lang_emb and self.config.n_langs > 1:\n            tensor = tensor + self.lang_embeddings(langs)\n        if token_type_ids is not None:\n            tensor = tensor + self.embeddings(token_type_ids)\n        tensor = self.layer_norm_emb(tensor)\n        tensor = F.dropout(tensor, p=self.dropout, training=self.training)\n        tensor *= mask.unsqueeze(-1).to(tensor.dtype)\n\n        # transformer layers\n        hidden_states = ()\n        attentions = ()\n        for i in range(self.n_layers):\n            # LayerDrop\n            dropout_probability = random.uniform(0, 1)\n            if self.training and (dropout_probability < self.layerdrop):\n                continue\n\n            if self.output_hidden_states:\n                hidden_states = hidden_states + (tensor,)\n\n            # self attention\n            if not self.pre_norm:\n                attn_outputs = self.attentions[i](tensor, attn_mask, cache=cache, head_mask=head_mask[i])\n                attn = attn_outputs[0]\n                if self.output_attentions:\n                    attentions = attentions + (attn_outputs[1],)\n                attn = F.dropout(attn, p=self.dropout, training=self.training)\n                tensor = tensor + attn\n                tensor = self.layer_norm1[i](tensor)\n            else:\n                tensor_normalized = self.layer_norm1[i](tensor)\n                attn_outputs = self.attentions[i](tensor_normalized, attn_mask, cache=cache, head_mask=head_mask[i])\n                attn = attn_outputs[0]\n                if self.output_attentions:\n                    attentions = attentions + (attn_outputs[1],)\n                attn = F.dropout(attn, p=self.dropout, training=self.training)\n                tensor = tensor + attn\n\n            # encoder attention (for decoder only)\n            # if self.is_decoder and src_enc is not None:\n            #     attn = self.encoder_attn[i](tensor, src_mask, kv=src_enc, cache=cache)\n            #     attn = F.dropout(attn, p=self.dropout, training=self.training)\n            #     tensor = tensor + attn\n            #     tensor = self.layer_norm15[i](tensor)\n\n            # FFN\n            if not self.pre_norm:\n                tensor = tensor + self.ffns[i](tensor)\n                tensor = self.layer_norm2[i](tensor)\n            else:\n                tensor_normalized = self.layer_norm2[i](tensor)\n                tensor = tensor + self.ffns[i](tensor_normalized)\n\n            tensor *= mask.unsqueeze(-1).to(tensor.dtype)\n\n        # Add last hidden state\n        if self.output_hidden_states:\n            hidden_states = hidden_states + (tensor,)\n\n        # update cache length\n        if cache is not None:\n            cache[\"slen\"] += tensor.size(1)\n\n        # move back sequence length to dimension 0\n        # tensor = tensor.transpose(0, 1)\n\n        outputs = (tensor,)\n        if self.output_hidden_states:\n            outputs = outputs + (hidden_states,)\n        if self.output_attentions:\n            outputs = outputs + (attentions,)\n        return outputs  # outputs, (hidden_states), (attentions)\n\n\n@add_start_docstrings(\n    \"\"\"The Flaubert Model transformer with a language modeling head on top\n    (linear layer with weights tied to the input embeddings). \"\"\",\n    FLAUBERT_START_DOCSTRING,\n)\nclass FlaubertWithLMHeadModel(XLMWithLMHeadModel):\n    \"\"\"\n    This class overrides :class:`~transformers1.XLMWithLMHeadModel`. Please check the\n    superclass for the appropriate documentation alongside usage examples.\n    \"\"\"\n\n    config_class = FlaubertConfig\n\n    def __init__(self, config):\n        super().__init__(config)\n        self.transformer = FlaubertModel(config)\n        self.init_weights()\n\n\n@add_start_docstrings(\n    \"\"\"Flaubert Model with a sequence classification/regression head on top (a linear layer on top of\n    the pooled output) e.g. for GLUE tasks. \"\"\",\n    FLAUBERT_START_DOCSTRING,\n)\nclass FlaubertForSequenceClassification(XLMForSequenceClassification):\n    \"\"\"\n    This class overrides :class:`~transformers1.XLMForSequenceClassification`. Please check the\n    superclass for the appropriate documentation alongside usage examples.\n    \"\"\"\n\n    config_class = FlaubertConfig\n\n    def __init__(self, config):\n        super().__init__(config)\n        self.transformer = FlaubertModel(config)\n        self.init_weights()\n\n\n@add_start_docstrings(\n    \"\"\"Flaubert Model with a span classification head on top for extractive question-answering tasks like SQuAD (a linear layers on top of\n    the hidden-states output to compute `span start logits` and `span end logits`). \"\"\",\n    FLAUBERT_START_DOCSTRING,\n)\nclass FlaubertForQuestionAnsweringSimple(XLMForQuestionAnsweringSimple):\n    \"\"\"\n    This class overrides :class:`~transformers1.XLMForQuestionAnsweringSimple`. Please check the\n    superclass for the appropriate documentation alongside usage examples.\n    \"\"\"\n\n    config_class = FlaubertConfig\n\n    def __init__(self, config):\n        super().__init__(config)\n        self.transformer = FlaubertModel(config)\n        self.init_weights()\n\n\n@add_start_docstrings(\n    \"\"\"Flaubert Model with a beam-search span classification head on top for extractive question-answering tasks like SQuAD (a linear layers on top of\n    the hidden-states output to compute `span start logits` and `span end logits`). \"\"\",\n    FLAUBERT_START_DOCSTRING,\n)\nclass FlaubertForQuestionAnswering(XLMForQuestionAnswering):\n    \"\"\"\n    This class overrides :class:`~transformers1.XLMForQuestionAnswering`. Please check the\n    superclass for the appropriate documentation alongside usage examples.\n    \"\"\"\n\n    config_class = FlaubertConfig\n\n    def __init__(self, config):\n        super().__init__(config)\n        self.transformer = FlaubertModel(config)\n        self.init_weights()\n"
  },
  {
    "path": "code/nezha-base-count3/pretrain/transformers1/modeling_gpt2.py",
    "content": "# coding=utf-8\n# Copyright 2018 The OpenAI Team Authors and HuggingFace Inc. team.\n# Copyright (c) 2018, NVIDIA CORPORATION.  All rights reserved.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n\"\"\"PyTorch OpenAI GPT-2 model.\"\"\"\n\n\nimport logging\nimport os\n\nimport torch\nimport torch.nn as nn\nfrom torch.nn import CrossEntropyLoss\n\nfrom .activations import ACT2FN\nfrom .configuration_gpt2 import GPT2Config\nfrom .file_utils import add_start_docstrings, add_start_docstrings_to_callable\nfrom .modeling_utils import Conv1D, PreTrainedModel, SequenceSummary, prune_conv1d_layer\n\n\nlogger = logging.getLogger(__name__)\n\nGPT2_PRETRAINED_MODEL_ARCHIVE_LIST = [\n    \"gpt2\",\n    \"gpt2-medium\",\n    \"gpt2-large\",\n    \"gpt2-xl\",\n    \"distilgpt2\",\n    # See all GPT-2 models at https://huggingface.co/models?filter=gpt2\n]\n\n\ndef load_tf_weights_in_gpt2(model, config, gpt2_checkpoint_path):\n    \"\"\" Load tf checkpoints in a pytorch model\n    \"\"\"\n    try:\n        import re\n        import tensorflow as tf\n    except ImportError:\n        logger.error(\n            \"Loading a TensorFlow model in PyTorch, requires TensorFlow to be installed. Please see \"\n            \"https://www.tensorflow.org/install/ for installation instructions.\"\n        )\n        raise\n    tf_path = os.path.abspath(gpt2_checkpoint_path)\n    logger.info(\"Converting TensorFlow checkpoint from {}\".format(tf_path))\n    # Load weights from TF model\n    init_vars = tf.train.list_variables(tf_path)\n    names = []\n    arrays = []\n    for name, shape in init_vars:\n        logger.info(\"Loading TF weight {} with shape {}\".format(name, shape))\n        array = tf.train.load_variable(tf_path, name)\n        names.append(name)\n        arrays.append(array.squeeze())\n\n    for name, array in zip(names, arrays):\n        name = name[6:]  # skip \"model/\"\n        name = name.split(\"/\")\n        pointer = model\n        for m_name in name:\n            if re.fullmatch(r\"[A-Za-z]+\\d+\", m_name):\n                scope_names = re.split(r\"(\\d+)\", m_name)\n            else:\n                scope_names = [m_name]\n            if scope_names[0] == \"w\" or scope_names[0] == \"g\":\n                pointer = getattr(pointer, \"weight\")\n            elif scope_names[0] == \"b\":\n                pointer = getattr(pointer, \"bias\")\n            elif scope_names[0] == \"wpe\" or scope_names[0] == \"wte\":\n                pointer = getattr(pointer, scope_names[0])\n                pointer = getattr(pointer, \"weight\")\n            else:\n                pointer = getattr(pointer, scope_names[0])\n            if len(scope_names) >= 2:\n                num = int(scope_names[1])\n                pointer = pointer[num]\n        try:\n            assert pointer.shape == array.shape\n        except AssertionError as e:\n            e.args += (pointer.shape, array.shape)\n            raise\n        logger.info(\"Initialize PyTorch weight {}\".format(name))\n        pointer.data = torch.from_numpy(array)\n    return model\n\n\nclass Attention(nn.Module):\n    def __init__(self, nx, n_ctx, config, scale=False):\n        super().__init__()\n        self.output_attentions = config.output_attentions\n\n        n_state = nx  # in Attention: n_state=768 (nx=n_embd)\n        # [switch nx => n_state from Block to Attention to keep identical to TF implem]\n        assert n_state % config.n_head == 0\n        self.register_buffer(\n            \"bias\", torch.tril(torch.ones((n_ctx, n_ctx), dtype=torch.uint8)).view(1, 1, n_ctx, n_ctx)\n        )\n        self.register_buffer(\"masked_bias\", torch.tensor(-1e4))\n        self.n_head = config.n_head\n        self.split_size = n_state\n        self.scale = scale\n\n        self.c_attn = Conv1D(n_state * 3, nx)\n        self.c_proj = Conv1D(n_state, nx)\n        self.attn_dropout = nn.Dropout(config.attn_pdrop)\n        self.resid_dropout = nn.Dropout(config.resid_pdrop)\n        self.pruned_heads = set()\n\n    def prune_heads(self, heads):\n        if len(heads) == 0:\n            return\n        mask = torch.ones(self.n_head, self.split_size // self.n_head)\n        heads = set(heads) - self.pruned_heads  # Convert to set and emove already pruned heads\n        for head in heads:\n            # Compute how many pruned heads are before the head and move the index accordingly\n            head = head - sum(1 if h < head else 0 for h in self.pruned_heads)\n            mask[head] = 0\n        mask = mask.view(-1).contiguous().eq(1)\n        index = torch.arange(len(mask))[mask].long()\n        index_attn = torch.cat([index, index + self.split_size, index + (2 * self.split_size)])\n\n        # Prune conv1d layers\n        self.c_attn = prune_conv1d_layer(self.c_attn, index_attn, dim=1)\n        self.c_proj = prune_conv1d_layer(self.c_proj, index, dim=0)\n\n        # Update hyper params\n        self.split_size = (self.split_size // self.n_head) * (self.n_head - len(heads))\n        self.n_head = self.n_head - len(heads)\n        self.pruned_heads = self.pruned_heads.union(heads)\n\n    def _attn(self, q, k, v, attention_mask=None, head_mask=None):\n        w = torch.matmul(q, k)\n        if self.scale:\n            w = w / (float(v.size(-1)) ** 0.5)\n        nd, ns = w.size(-2), w.size(-1)\n        mask = self.bias[:, :, ns - nd : ns, :ns]\n        w = torch.where(mask.bool(), w, self.masked_bias.to(w.dtype))\n\n        if attention_mask is not None:\n            # Apply the attention mask\n            w = w + attention_mask\n\n        w = nn.Softmax(dim=-1)(w)\n        w = self.attn_dropout(w)\n\n        # Mask heads if we want to\n        if head_mask is not None:\n            w = w * head_mask\n\n        outputs = [torch.matmul(w, v)]\n        if self.output_attentions:\n            outputs.append(w)\n        return outputs\n\n    def merge_heads(self, x):\n        x = x.permute(0, 2, 1, 3).contiguous()\n        new_x_shape = x.size()[:-2] + (x.size(-2) * x.size(-1),)\n        return x.view(*new_x_shape)  # in Tensorflow implem: fct merge_states\n\n    def split_heads(self, x, k=False):\n        new_x_shape = x.size()[:-1] + (self.n_head, x.size(-1) // self.n_head)\n        x = x.view(*new_x_shape)  # in Tensorflow implem: fct split_states\n        if k:\n            return x.permute(0, 2, 3, 1)  # (batch, head, head_features, seq_length)\n        else:\n            return x.permute(0, 2, 1, 3)  # (batch, head, seq_length, head_features)\n\n    def forward(self, x, layer_past=None, attention_mask=None, head_mask=None, use_cache=False):\n        x = self.c_attn(x)\n        query, key, value = x.split(self.split_size, dim=2)\n        query = self.split_heads(query)\n        key = self.split_heads(key, k=True)\n        value = self.split_heads(value)\n        if layer_past is not None:\n            past_key, past_value = layer_past[0].transpose(-2, -1), layer_past[1]  # transpose back cf below\n            key = torch.cat((past_key, key), dim=-1)\n            value = torch.cat((past_value, value), dim=-2)\n\n        if use_cache is True:\n            present = torch.stack((key.transpose(-2, -1), value))  # transpose to have same shapes for stacking\n        else:\n            present = (None,)\n\n        attn_outputs = self._attn(query, key, value, attention_mask, head_mask)\n        a = attn_outputs[0]\n\n        a = self.merge_heads(a)\n        a = self.c_proj(a)\n        a = self.resid_dropout(a)\n\n        outputs = [a, present] + attn_outputs[1:]\n        return outputs  # a, present, (attentions)\n\n\nclass MLP(nn.Module):\n    def __init__(self, n_state, config):  # in MLP: n_state=3072 (4 * n_embd)\n        super().__init__()\n        nx = config.n_embd\n        self.c_fc = Conv1D(n_state, nx)\n        self.c_proj = Conv1D(nx, n_state)\n        self.act = ACT2FN[config.activation_function]\n        self.dropout = nn.Dropout(config.resid_pdrop)\n\n    def forward(self, x):\n        h = self.act(self.c_fc(x))\n        h2 = self.c_proj(h)\n        return self.dropout(h2)\n\n\nclass Block(nn.Module):\n    def __init__(self, n_ctx, config, scale=False):\n        super().__init__()\n        nx = config.n_embd\n        self.ln_1 = nn.LayerNorm(nx, eps=config.layer_norm_epsilon)\n        self.attn = Attention(nx, n_ctx, config, scale)\n        self.ln_2 = nn.LayerNorm(nx, eps=config.layer_norm_epsilon)\n        self.mlp = MLP(4 * nx, config)\n\n    def forward(self, x, layer_past=None, attention_mask=None, head_mask=None, use_cache=False):\n        output_attn = self.attn(\n            self.ln_1(x),\n            layer_past=layer_past,\n            attention_mask=attention_mask,\n            head_mask=head_mask,\n            use_cache=use_cache,\n        )\n        a = output_attn[0]  # output_attn: a, present, (attentions)\n\n        x = x + a\n        m = self.mlp(self.ln_2(x))\n        x = x + m\n\n        outputs = [x] + output_attn[1:]\n        return outputs  # x, present, (attentions)\n\n\nclass GPT2PreTrainedModel(PreTrainedModel):\n    \"\"\" An abstract class to handle weights initialization and\n        a simple interface for downloading and loading pretrained models.\n    \"\"\"\n\n    config_class = GPT2Config\n    load_tf_weights = load_tf_weights_in_gpt2\n    base_model_prefix = \"transformer\"\n\n    def __init__(self, *inputs, **kwargs):\n        super().__init__(*inputs, **kwargs)\n\n    def _init_weights(self, module):\n        \"\"\" Initialize the weights.\n        \"\"\"\n        if isinstance(module, (nn.Linear, nn.Embedding, Conv1D)):\n            # Slightly different from the TF version which uses truncated_normal for initialization\n            # cf https://github.com/pytorch/pytorch/pull/5617\n            module.weight.data.normal_(mean=0.0, std=self.config.initializer_range)\n            if isinstance(module, (nn.Linear, Conv1D)) and module.bias is not None:\n                module.bias.data.zero_()\n        elif isinstance(module, nn.LayerNorm):\n            module.bias.data.zero_()\n            module.weight.data.fill_(1.0)\n\n\nGPT2_START_DOCSTRING = r\"\"\"\n\n    This model is a PyTorch `torch.nn.Module <https://pytorch.org/docs/stable/nn.html#torch.nn.Module>`_ sub-class.\n    Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general\n    usage and behavior.\n\n    Parameters:\n        config (:class:`~transformers1.GPT2Config`): Model configuration class with all the parameters of the model.\n            Initializing with a config file does not load the weights associated with the model, only the configuration.\n            Check out the :meth:`~transformers1.PreTrainedModel.from_pretrained` method to load the model weights.\n\"\"\"\n\nGPT2_INPUTS_DOCSTRING = r\"\"\"\n    Args:\n        input_ids (:obj:`torch.LongTensor` of shape :obj:`(batch_size, input_ids_length)`):\n            :obj:`input_ids_length` = ``sequence_length`` if ``past`` is ``None`` else ``past[0].shape[-2]`` (``sequence_length`` of input past key value states).\n            Indices of input sequence tokens in the vocabulary.\n\n            If `past` is used, only `input_ids` that do not have their past calculated should be passed as `input_ids`.\n\n            Indices can be obtained using :class:`transformers1.GPT2Tokenizer`.\n            See :func:`transformers1.PreTrainedTokenizer.encode` and\n            :func:`transformers1.PreTrainedTokenizer.encode_plus` for details.\n\n            `What are input IDs? <../glossary.html#input-ids>`__\n\n        past (:obj:`List[torch.FloatTensor]` of length :obj:`config.n_layers`):\n            Contains pre-computed hidden-states (key and values in the attention blocks) as computed by the model\n            (see `past` output below). Can be used to speed up sequential decoding.\n            The `input_ids` which have their past given to this model should not be passed as `input_ids` as they have already been computed.\n        attention_mask (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length)`, `optional`, defaults to :obj:`None`):\n            Mask to avoid performing attention on padding token indices.\n            Mask values selected in ``[0, 1]``:\n            ``1`` for tokens that are NOT MASKED, ``0`` for MASKED tokens.\n\n            `What are attention masks? <../glossary.html#attention-mask>`__\n        token_type_ids (:obj:`torch.LongTensor` of shape :obj:`(batch_size, input_ids_length)`, `optional`, defaults to :obj:`None`):\n            `input_ids_length` = `sequence_length if `past` is None else 1\n            Segment token indices to indicate first and second portions of the inputs.\n            Indices are selected in ``[0, 1]``: ``0`` corresponds to a `sentence A` token, ``1``\n            corresponds to a `sentence B` token\n            `What are token type IDs? <../glossary.html#token-type-ids>`_\n        position_ids (:obj:`torch.LongTensor` of shape :obj:`(batch_size, sequence_length)`, `optional`, defaults to :obj:`None`):\n            Indices of positions of each input sequence tokens in the position embeddings.\n            Selected in the range ``[0, config.max_position_embeddings - 1]``.\n\n            `What are position IDs? <../glossary.html#position-ids>`_\n        head_mask (:obj:`torch.FloatTensor` of shape :obj:`(num_heads,)` or :obj:`(num_layers, num_heads)`, `optional`, defaults to :obj:`None`):\n            Mask to nullify selected heads of the self-attention modules.\n            Mask values selected in ``[0, 1]``:\n            :obj:`1` indicates the head is **not masked**, :obj:`0` indicates the head is **masked**.\n        inputs_embeds (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length, hidden_size)`, `optional`, defaults to :obj:`None`):\n            This is useful if you want more control over how to convert `input_ids` indices into associated vectors\n            than the model's internal embedding lookup matrix.\n            If `past` is used, optionally only the last `inputs_embeds` have to be input (see `past`).\n        use_cache (:obj:`bool`):\n            If `use_cache` is True, `past` key value states are returned and can be used to speed up decoding (see `past`). Defaults to `True`.\n\"\"\"\n\n\n@add_start_docstrings(\n    \"The bare GPT2 Model transformer outputting raw hidden-states without any specific head on top.\",\n    GPT2_START_DOCSTRING,\n)\nclass GPT2Model(GPT2PreTrainedModel):\n    def __init__(self, config):\n        super().__init__(config)\n        self.output_hidden_states = config.output_hidden_states\n        self.output_attentions = config.output_attentions\n\n        self.wte = nn.Embedding(config.vocab_size, config.n_embd)\n        self.wpe = nn.Embedding(config.n_positions, config.n_embd)\n        self.drop = nn.Dropout(config.embd_pdrop)\n        self.h = nn.ModuleList([Block(config.n_ctx, config, scale=True) for _ in range(config.n_layer)])\n        self.ln_f = nn.LayerNorm(config.n_embd, eps=config.layer_norm_epsilon)\n\n        self.init_weights()\n\n    def get_input_embeddings(self):\n        return self.wte\n\n    def set_input_embeddings(self, new_embeddings):\n        self.wte = new_embeddings\n\n    def _prune_heads(self, heads_to_prune):\n        \"\"\" Prunes heads of the model.\n            heads_to_prune: dict of {layer_num: list of heads to prune in this layer}\n        \"\"\"\n        for layer, heads in heads_to_prune.items():\n            self.h[layer].attn.prune_heads(heads)\n\n    @add_start_docstrings_to_callable(GPT2_INPUTS_DOCSTRING)\n    def forward(\n        self,\n        input_ids=None,\n        past=None,\n        attention_mask=None,\n        token_type_ids=None,\n        position_ids=None,\n        head_mask=None,\n        inputs_embeds=None,\n        use_cache=True,\n    ):\n        r\"\"\"\n    Return:\n        :obj:`tuple(torch.FloatTensor)` comprising various elements depending on the configuration (:class:`~transformers1.GPT2Config`) and inputs:\n        last_hidden_state (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length, hidden_size)`):\n            Sequence of hidden-states at the last layer of the model.\n            If `past` is used only the last hidden-state of the sequences of shape :obj:`(batch_size, 1, hidden_size)` is output.\n        past (:obj:`List[torch.FloatTensor]` of length :obj:`config.n_layers` with each tensor of shape :obj:`(2, batch_size, num_heads, sequence_length, embed_size_per_head)`):\n            Contains pre-computed hidden-states (key and values in the attention blocks).\n            Can be used (see `past` input) to speed up sequential decoding.\n        hidden_states (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_hidden_states=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer)\n            of shape :obj:`(batch_size, sequence_length, hidden_size)`.\n\n            Hidden-states of the model at the output of each layer plus the initial embedding outputs.\n        attentions (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_attentions=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for each layer) of shape\n            :obj:`(batch_size, num_heads, sequence_length, sequence_length)`.\n\n            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention\n            heads.\n\n    Examples::\n\n        from transformers1 import GPT2Tokenizer, GPT2Model\n        import torch\n\n        tokenizer = GPT2Tokenizer.from_pretrained('gpt2')\n        model = GPT2Model.from_pretrained('gpt2')\n        input_ids = torch.tensor(tokenizer.encode(\"Hello, my dog is cute\", add_special_tokens=True)).unsqueeze(0)  # Batch size 1\n        outputs = model(input_ids)\n        last_hidden_states = outputs[0]  # The last hidden-state is the first element of the output tuple\n\n        \"\"\"\n\n        if input_ids is not None and inputs_embeds is not None:\n            raise ValueError(\"You cannot specify both input_ids and inputs_embeds at the same time\")\n        elif input_ids is not None:\n            input_shape = input_ids.size()\n            input_ids = input_ids.view(-1, input_shape[-1])\n            batch_size = input_ids.shape[0]\n        elif inputs_embeds is not None:\n            input_shape = inputs_embeds.size()[:-1]\n            batch_size = inputs_embeds.shape[0]\n        else:\n            raise ValueError(\"You have to specify either input_ids or inputs_embeds\")\n\n        if token_type_ids is not None:\n            token_type_ids = token_type_ids.view(-1, input_shape[-1])\n        if position_ids is not None:\n            position_ids = position_ids.view(-1, input_shape[-1])\n\n        if past is None:\n            past_length = 0\n            past = [None] * len(self.h)\n        else:\n            past_length = past[0][0].size(-2)\n        if position_ids is None:\n            device = input_ids.device if input_ids is not None else inputs_embeds.device\n            position_ids = torch.arange(past_length, input_shape[-1] + past_length, dtype=torch.long, device=device)\n            position_ids = position_ids.unsqueeze(0).view(-1, input_shape[-1])\n\n        # Attention mask.\n        if attention_mask is not None:\n            assert batch_size > 0, \"batch_size has to be defined and > 0\"\n            attention_mask = attention_mask.view(batch_size, -1)\n            # We create a 3D attention mask from a 2D tensor mask.\n            # Sizes are [batch_size, 1, 1, to_seq_length]\n            # So we can broadcast to [batch_size, num_heads, from_seq_length, to_seq_length]\n            # this attention mask is more simple than the triangular masking of causal attention\n            # used in OpenAI GPT, we just need to prepare the broadcast dimension here.\n            attention_mask = attention_mask.unsqueeze(1).unsqueeze(2)\n\n            # Since attention_mask is 1.0 for positions we want to attend and 0.0 for\n            # masked positions, this operation will create a tensor which is 0.0 for\n            # positions we want to attend and -10000.0 for masked positions.\n            # Since we are adding it to the raw scores before the softmax, this is\n            # effectively the same as removing these entirely.\n            attention_mask = attention_mask.to(dtype=next(self.parameters()).dtype)  # fp16 compatibility\n            attention_mask = (1.0 - attention_mask) * -10000.0\n\n        # Prepare head mask if needed\n        # 1.0 in head_mask indicate we keep the head\n        # attention_probs has shape bsz x n_heads x N x N\n        # head_mask has shape n_layer x batch x n_heads x N x N\n        head_mask = self.get_head_mask(head_mask, self.config.n_layer)\n\n        if inputs_embeds is None:\n            inputs_embeds = self.wte(input_ids)\n        position_embeds = self.wpe(position_ids)\n        if token_type_ids is not None:\n            token_type_embeds = self.wte(token_type_ids)\n        else:\n            token_type_embeds = 0\n        hidden_states = inputs_embeds + position_embeds + token_type_embeds\n        hidden_states = self.drop(hidden_states)\n\n        output_shape = input_shape + (hidden_states.size(-1),)\n\n        presents = ()\n        all_attentions = []\n        all_hidden_states = ()\n        for i, (block, layer_past) in enumerate(zip(self.h, past)):\n            if self.output_hidden_states:\n                all_hidden_states = all_hidden_states + (hidden_states.view(*output_shape),)\n\n            outputs = block(\n                hidden_states,\n                layer_past=layer_past,\n                attention_mask=attention_mask,\n                head_mask=head_mask[i],\n                use_cache=use_cache,\n            )\n\n            hidden_states, present = outputs[:2]\n            if use_cache is True:\n                presents = presents + (present,)\n\n            if self.output_attentions:\n                all_attentions.append(outputs[2])\n\n        hidden_states = self.ln_f(hidden_states)\n\n        hidden_states = hidden_states.view(*output_shape)\n        # Add last hidden state\n        if self.output_hidden_states:\n            all_hidden_states = all_hidden_states + (hidden_states,)\n\n        outputs = (hidden_states,)\n        if use_cache is True:\n            outputs = outputs + (presents,)\n        if self.output_hidden_states:\n            outputs = outputs + (all_hidden_states,)\n        if self.output_attentions:\n            # let the number of heads free (-1) so we can extract attention even after head pruning\n            attention_output_shape = input_shape[:-1] + (-1,) + all_attentions[0].shape[-2:]\n            all_attentions = tuple(t.view(*attention_output_shape) for t in all_attentions)\n            outputs = outputs + (all_attentions,)\n        return outputs  # last hidden state, (presents), (all hidden_states), (attentions)\n\n\n@add_start_docstrings(\n    \"\"\"The GPT2 Model transformer with a language modeling head on top\n    (linear layer with weights tied to the input embeddings). \"\"\",\n    GPT2_START_DOCSTRING,\n)\nclass GPT2LMHeadModel(GPT2PreTrainedModel):\n    def __init__(self, config):\n        super().__init__(config)\n        self.transformer = GPT2Model(config)\n        self.lm_head = nn.Linear(config.n_embd, config.vocab_size, bias=False)\n\n        self.init_weights()\n\n    def get_output_embeddings(self):\n        return self.lm_head\n\n    def prepare_inputs_for_generation(self, input_ids, past, **kwargs):\n        # only last token for inputs_ids if past is defined in kwargs\n        if past:\n            input_ids = input_ids[:, -1].unsqueeze(-1)\n\n        return {\"input_ids\": input_ids, \"past\": past, \"use_cache\": kwargs[\"use_cache\"]}\n\n    @add_start_docstrings_to_callable(GPT2_INPUTS_DOCSTRING)\n    def forward(\n        self,\n        input_ids=None,\n        past=None,\n        attention_mask=None,\n        token_type_ids=None,\n        position_ids=None,\n        head_mask=None,\n        inputs_embeds=None,\n        labels=None,\n        use_cache=True,\n    ):\n        r\"\"\"\n        labels (:obj:`torch.LongTensor` of shape :obj:`(batch_size, sequence_length)`, `optional`, defaults to :obj:`None`):\n            Labels for language modeling.\n            Note that the labels **are shifted** inside the model, i.e. you can set ``labels = input_ids``\n            Indices are selected in ``[-100, 0, ..., config.vocab_size]``\n            All labels set to ``-100`` are ignored (masked), the loss is only\n            computed for labels in ``[0, ..., config.vocab_size]``\n\n    Return:\n        :obj:`tuple(torch.FloatTensor)` comprising various elements depending on the configuration (:class:`~transformers1.GPT2Config`) and inputs:\n        loss (:obj:`torch.FloatTensor` of shape `(1,)`, `optional`, returned when ``labels`` is provided)\n            Language modeling loss.\n        prediction_scores (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length, config.vocab_size)`):\n            Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax).\n        past (:obj:`List[torch.FloatTensor]` of length :obj:`config.n_layers` with each tensor of shape :obj:`(2, batch_size, num_heads, sequence_length, embed_size_per_head)`):\n            Contains pre-computed hidden-states (key and values in the attention blocks).\n            Can be used (see `past` input) to speed up sequential decoding.\n        hidden_states (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_hidden_states=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer)\n            of shape :obj:`(batch_size, sequence_length, hidden_size)`.\n\n            Hidden-states of the model at the output of each layer plus the initial embedding outputs.\n        attentions (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_attentions=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for each layer) of shape\n            :obj:`(batch_size, num_heads, sequence_length, sequence_length)`.\n\n            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention\n            heads.\n\n    Examples::\n\n        import torch\n        from transformers1 import GPT2Tokenizer, GPT2LMHeadModel\n\n        tokenizer = GPT2Tokenizer.from_pretrained('gpt2')\n        model = GPT2LMHeadModel.from_pretrained('gpt2')\n\n        input_ids = torch.tensor(tokenizer.encode(\"Hello, my dog is cute\", add_special_tokens=True)).unsqueeze(0)  # Batch size 1\n        outputs = model(input_ids, labels=input_ids)\n        loss, logits = outputs[:2]\n\n        \"\"\"\n        transformer_outputs = self.transformer(\n            input_ids,\n            past=past,\n            attention_mask=attention_mask,\n            token_type_ids=token_type_ids,\n            position_ids=position_ids,\n            head_mask=head_mask,\n            inputs_embeds=inputs_embeds,\n            use_cache=use_cache,\n        )\n        hidden_states = transformer_outputs[0]\n\n        lm_logits = self.lm_head(hidden_states)\n\n        outputs = (lm_logits,) + transformer_outputs[1:]\n        if labels is not None:\n            # Shift so that tokens < n predict n\n            shift_logits = lm_logits[..., :-1, :].contiguous()\n            shift_labels = labels[..., 1:].contiguous()\n            # Flatten the tokens\n            loss_fct = CrossEntropyLoss()\n            loss = loss_fct(shift_logits.view(-1, shift_logits.size(-1)), shift_labels.view(-1))\n            outputs = (loss,) + outputs\n\n        return outputs  # (loss), lm_logits, presents, (all hidden_states), (attentions)\n\n\n@add_start_docstrings(\n    \"\"\"The GPT2 Model transformer with a language modeling and a multiple-choice classification\n    head on top e.g. for RocStories/SWAG tasks. The two heads are two linear layers.\n    The language modeling head has its weights tied to the input embeddings,\n    the classification head takes as input the input of a specified classification token index in the input sequence).\n\"\"\",\n    GPT2_START_DOCSTRING,\n)\nclass GPT2DoubleHeadsModel(GPT2PreTrainedModel):\n    def __init__(self, config):\n        super().__init__(config)\n        config.num_labels = 1\n        self.transformer = GPT2Model(config)\n        self.lm_head = nn.Linear(config.n_embd, config.vocab_size, bias=False)\n        self.multiple_choice_head = SequenceSummary(config)\n\n        self.init_weights()\n\n    def get_output_embeddings(self):\n        return self.lm_head\n\n    @add_start_docstrings_to_callable(GPT2_INPUTS_DOCSTRING)\n    def forward(\n        self,\n        input_ids=None,\n        past=None,\n        attention_mask=None,\n        token_type_ids=None,\n        position_ids=None,\n        head_mask=None,\n        inputs_embeds=None,\n        mc_token_ids=None,\n        lm_labels=None,\n        mc_labels=None,\n        use_cache=True,\n    ):\n        r\"\"\"\n        mc_token_ids (:obj:`torch.LongTensor` of shape :obj:`(batch_size, num_choices)`, `optional`, default to index of the last token of the input)\n            Index of the classification token in each input sequence.\n            Selected in the range ``[0, input_ids.size(-1) - 1[``.\n        lm_labels (:obj:`torch.LongTensor` of shape :obj:`(batch_size, sequence_length)`, `optional`, defaults to :obj:`None`)\n            Labels for language modeling.\n            Note that the labels **are shifted** inside the model, i.e. you can set ``lm_labels = input_ids``\n            Indices are selected in ``[-1, 0, ..., config.vocab_size]``\n            All labels set to ``-100`` are ignored (masked), the loss is only\n            computed for labels in ``[0, ..., config.vocab_size]``\n        mc_labels (:obj:`torch.LongTensor` of shape :obj:`(batch_size)`, `optional`, defaults to :obj:`None`)\n            Labels for computing the multiple choice classification loss.\n            Indices should be in ``[0, ..., num_choices]`` where `num_choices` is the size of the second dimension\n            of the input tensors. (see `input_ids` above)\n\n    Return:\n        :obj:`tuple(torch.FloatTensor)` comprising various elements depending on the configuration (:class:`~transformers1.GPT2Config`) and inputs:\n        lm_loss (:obj:`torch.FloatTensor` of shape :obj:`(1,)`, `optional`, returned when ``lm_labels`` is provided):\n            Language modeling loss.\n        mc_loss (:obj:`torch.FloatTensor` of shape :obj:`(1,)`, `optional`, returned when :obj:`multiple_choice_labels` is provided):\n            Multiple choice classification loss.\n        lm_prediction_scores (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, num_choices, sequence_length, config.vocab_size)`):\n            Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax).\n        mc_prediction_scores (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, num_choices)`):\n            Prediction scores of the multiple choice classification head (scores for each choice before SoftMax).\n        past (:obj:`List[torch.FloatTensor]` of length :obj:`config.n_layers` with each tensor of shape :obj:`(2, batch_size, num_heads, sequence_length, embed_size_per_head)`):\n            Contains pre-computed hidden-states (key and values in the attention blocks).\n            Can be used (see `past` input) to speed up sequential decoding.\n        hidden_states (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_hidden_states=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer)\n            of shape :obj:`(batch_size, sequence_length, hidden_size)`.\n\n            Hidden-states of the model at the output of each layer plus the initial embedding outputs.\n        attentions (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_attentions=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for each layer) of shape\n            :obj:`(batch_size, num_heads, sequence_length, sequence_length)`.\n\n            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention\n            heads.\n\n    Examples::\n\n        import torch\n        from transformers1 import GPT2Tokenizer, GPT2DoubleHeadsModel\n\n        tokenizer = GPT2Tokenizer.from_pretrained('gpt2')\n        model = GPT2DoubleHeadsModel.from_pretrained('gpt2')\n\n        # Add a [CLS] to the vocabulary (we should train it also!)\n        tokenizer.add_special_tokens({'cls_token': '[CLS]'})\n        model.resize_token_embeddings(len(tokenizer))  # Update the model embeddings with the new vocabulary size\n        print(tokenizer.cls_token_id, len(tokenizer))  # The newly token the last token of the vocabulary\n\n        choices = [\"Hello, my dog is cute [CLS]\", \"Hello, my cat is cute [CLS]\"]\n        encoded_choices = [tokenizer.encode(s) for s in choices]\n        cls_token_location = [tokens.index(tokenizer.cls_token_id) for tokens in encoded_choices]\n\n        input_ids = torch.tensor(encoded_choices).unsqueeze(0)  # Batch size: 1, number of choices: 2\n        mc_token_ids = torch.tensor([cls_token_location])  # Batch size: 1\n\n        outputs = model(input_ids, mc_token_ids=mc_token_ids)\n        lm_prediction_scores, mc_prediction_scores = outputs[:2]\n\n        \"\"\"\n        transformer_outputs = self.transformer(\n            input_ids,\n            past=past,\n            attention_mask=attention_mask,\n            token_type_ids=token_type_ids,\n            position_ids=position_ids,\n            head_mask=head_mask,\n            inputs_embeds=inputs_embeds,\n            use_cache=use_cache,\n        )\n\n        hidden_states = transformer_outputs[0]\n\n        lm_logits = self.lm_head(hidden_states)\n        mc_logits = self.multiple_choice_head(hidden_states, mc_token_ids).squeeze(-1)\n\n        outputs = (lm_logits, mc_logits) + transformer_outputs[1:]\n        if mc_labels is not None:\n            loss_fct = CrossEntropyLoss()\n            loss = loss_fct(mc_logits.view(-1, mc_logits.size(-1)), mc_labels.view(-1))\n            outputs = (loss,) + outputs\n        if lm_labels is not None:\n            shift_logits = lm_logits[..., :-1, :].contiguous()\n            shift_labels = lm_labels[..., 1:].contiguous()\n            loss_fct = CrossEntropyLoss()\n            loss = loss_fct(shift_logits.view(-1, shift_logits.size(-1)), shift_labels.view(-1))\n            outputs = (loss,) + outputs\n\n        return outputs  # (lm loss), (mc loss), lm logits, mc logits, presents, (all hidden_states), (attentions)\n"
  },
  {
    "path": "code/nezha-base-count3/pretrain/transformers1/modeling_longformer.py",
    "content": "# coding=utf-8\n# Copyright 2020 The Allen Institute for AI team and The HuggingFace Inc. team.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n\"\"\"PyTorch Longformer model. \"\"\"\n\nimport logging\nimport math\n\nimport torch\nimport torch.nn as nn\nfrom torch.nn import CrossEntropyLoss, MSELoss\nfrom torch.nn import functional as F\n\nfrom .configuration_longformer import LongformerConfig\nfrom .file_utils import add_start_docstrings, add_start_docstrings_to_callable\nfrom .modeling_bert import BertPreTrainedModel\nfrom .modeling_roberta import RobertaLMHead, RobertaModel\n\n\nlogger = logging.getLogger(__name__)\n\nLONGFORMER_PRETRAINED_MODEL_ARCHIVE_LIST = [\n    \"allenai/longformer-base-4096\",\n    \"allenai/longformer-large-4096\",\n    \"allenai/longformer-large-4096-finetuned-triviaqa\",\n    \"allenai/longformer-base-4096-extra.pos.embd.only\",\n    \"allenai/longformer-large-4096-extra.pos.embd.only\",\n    # See all Longformer models at https://huggingface.co/models?filter=longformer\n]\n\n\ndef _get_question_end_index(input_ids, sep_token_id):\n    \"\"\"\n        Computes the index of the first occurance of `sep_token_id`.\n    \"\"\"\n\n    sep_token_indices = (input_ids == sep_token_id).nonzero()\n    batch_size = input_ids.shape[0]\n\n    assert sep_token_indices.shape[1] == 2, \"`input_ids` should have two dimensions\"\n    assert (\n        sep_token_indices.shape[0] == 3 * batch_size\n    ), f\"There should be exactly three separator tokens: {sep_token_id} in every sample for questions answering. You might also consider to set `global_attention_mask` manually in the forward function to avoid this error.\"\n\n    return sep_token_indices.view(batch_size, 3, 2)[:, 0, 1]\n\n\ndef _compute_global_attention_mask(input_ids, sep_token_id, before_sep_token=True):\n    \"\"\"\n        Computes global attention mask by putting attention on all tokens\n        before `sep_token_id` if `before_sep_token is True` else after\n        `sep_token_id`.\n    \"\"\"\n\n    question_end_index = _get_question_end_index(input_ids, sep_token_id)\n    question_end_index = question_end_index.unsqueeze(dim=1)  # size: batch_size x 1\n    # bool attention mask with True in locations of global attention\n    attention_mask = torch.arange(input_ids.shape[1], device=input_ids.device)\n    if before_sep_token is True:\n        attention_mask = (attention_mask.expand_as(input_ids) < question_end_index).to(torch.uint8)\n    else:\n        # last token is separation token and should not be counted and in the middle are two separation tokens\n        attention_mask = (attention_mask.expand_as(input_ids) > (question_end_index + 1)).to(torch.uint8) * (\n            attention_mask.expand_as(input_ids) < input_ids.shape[-1]\n        ).to(torch.uint8)\n\n    return attention_mask\n\n\nclass LongformerSelfAttention(nn.Module):\n    def __init__(self, config, layer_id):\n        super().__init__()\n        if config.hidden_size % config.num_attention_heads != 0:\n            raise ValueError(\n                \"The hidden size (%d) is not a multiple of the number of attention \"\n                \"heads (%d)\" % (config.hidden_size, config.num_attention_heads)\n            )\n        self.output_attentions = config.output_attentions\n        self.num_heads = config.num_attention_heads\n        self.head_dim = int(config.hidden_size / config.num_attention_heads)\n        self.embed_dim = config.hidden_size\n\n        self.query = nn.Linear(config.hidden_size, self.embed_dim)\n        self.key = nn.Linear(config.hidden_size, self.embed_dim)\n        self.value = nn.Linear(config.hidden_size, self.embed_dim)\n\n        # separate projection layers for tokens with global attention\n        self.query_global = nn.Linear(config.hidden_size, self.embed_dim)\n        self.key_global = nn.Linear(config.hidden_size, self.embed_dim)\n        self.value_global = nn.Linear(config.hidden_size, self.embed_dim)\n\n        self.dropout = config.attention_probs_dropout_prob\n\n        self.layer_id = layer_id\n        attention_window = config.attention_window[self.layer_id]\n        assert (\n            attention_window % 2 == 0\n        ), f\"`attention_window` for layer {self.layer_id} has to be an even value. Given {attention_window}\"\n        assert (\n            attention_window > 0\n        ), f\"`attention_window` for layer {self.layer_id} has to be positive. Given {attention_window}\"\n\n        self.one_sided_attention_window_size = attention_window // 2\n\n    @staticmethod\n    def _skew(x, direction):\n        \"\"\"Convert diagonals into columns (or columns into diagonals depending on `direction`\"\"\"\n        x_padded = F.pad(x, direction)  # padding value is not important because it will be overwritten\n        x_padded = x_padded.view(*x_padded.size()[:-2], x_padded.size(-1), x_padded.size(-2))\n        return x_padded\n\n    @staticmethod\n    def _skew2(x):\n        \"\"\"shift every row 1 step to right converting columns into diagonals\"\"\"\n        # X = B x C x M x L\n        B, C, M, L = x.size()\n        x = F.pad(x, (0, M + 1))  # B x C x M x (L+M+1). Padding value is not important because it'll be overwritten\n        x = x.view(B, C, -1)  # B x C x ML+MM+M\n        x = x[:, :, :-M]  # B x C x ML+MM\n        x = x.view(B, C, M, M + L)  # B x C, M x L+M\n        x = x[:, :, :, :-1]\n        return x\n\n    @staticmethod\n    def _chunk(x, w):\n        \"\"\"convert into overlapping chunkings. Chunk size = 2w, overlap size = w\"\"\"\n\n        # non-overlapping chunks of size = 2w\n        x = x.view(x.size(0), x.size(1) // (w * 2), w * 2, x.size(2))\n\n        # use `as_strided` to make the chunks overlap with an overlap size = w\n        chunk_size = list(x.size())\n        chunk_size[1] = chunk_size[1] * 2 - 1\n\n        chunk_stride = list(x.stride())\n        chunk_stride[1] = chunk_stride[1] // 2\n        return x.as_strided(size=chunk_size, stride=chunk_stride)\n\n    def _mask_invalid_locations(self, input_tensor, w) -> torch.Tensor:\n        affected_seqlen = w\n        beginning_mask_2d = input_tensor.new_ones(w, w + 1).tril().flip(dims=[0])\n        beginning_mask = beginning_mask_2d[None, :, None, :]\n        ending_mask = beginning_mask.flip(dims=(1, 3))\n        seqlen = input_tensor.size(1)\n        beginning_input = input_tensor[:, :affected_seqlen, :, : w + 1]\n        beginning_mask = beginning_mask[:, :seqlen].expand(beginning_input.size())\n        beginning_input.masked_fill_(beginning_mask == 1, -float(\"inf\"))  # `== 1` converts to bool or uint8\n        ending_input = input_tensor[:, -affected_seqlen:, :, -(w + 1) :]\n        ending_mask = ending_mask[:, -seqlen:].expand(ending_input.size())\n        ending_input.masked_fill_(ending_mask == 1, -float(\"inf\"))  # `== 1` converts to bool or uint8\n\n    def _sliding_chunks_matmul_qk(self, q: torch.Tensor, k: torch.Tensor, w: int):\n        \"\"\"Matrix multiplicatio of query x key tensors using with a sliding window attention pattern.\n        This implementation splits the input into overlapping chunks of size 2w (e.g. 512 for pretrained Longformer)\n        with an overlap of size w\"\"\"\n        batch_size, seqlen, num_heads, head_dim = q.size()\n        assert seqlen % (w * 2) == 0, f\"Sequence length should be multiple of {w * 2}. Given {seqlen}\"\n        assert q.size() == k.size()\n\n        chunks_count = seqlen // w - 1\n\n        # group batch_size and num_heads dimensions into one, then chunk seqlen into chunks of size w * 2\n        q = q.transpose(1, 2).reshape(batch_size * num_heads, seqlen, head_dim)\n        k = k.transpose(1, 2).reshape(batch_size * num_heads, seqlen, head_dim)\n\n        chunk_q = self._chunk(q, w)\n        chunk_k = self._chunk(k, w)\n\n        # matrix multipication\n        # bcxd: batch_size * num_heads x chunks x 2w x head_dim\n        # bcyd: batch_size * num_heads x chunks x 2w x head_dim\n        # bcxy: batch_size * num_heads x chunks x 2w x 2w\n        chunk_attn = torch.einsum(\"bcxd,bcyd->bcxy\", (chunk_q, chunk_k))  # multiply\n\n        # convert diagonals into columns\n        diagonal_chunk_attn = self._skew(chunk_attn, direction=(0, 0, 0, 1))\n\n        # allocate space for the overall attention matrix where the chunks are compined. The last dimension\n        # has (w * 2 + 1) columns. The first (w) columns are the w lower triangles (attention from a word to\n        # w previous words). The following column is attention score from each word to itself, then\n        # followed by w columns for the upper triangle.\n\n        diagonal_attn = diagonal_chunk_attn.new_empty((batch_size * num_heads, chunks_count + 1, w, w * 2 + 1))\n\n        # copy parts from diagonal_chunk_attn into the compined matrix of attentions\n        # - copying the main diagonal and the upper triangle\n        diagonal_attn[:, :-1, :, w:] = diagonal_chunk_attn[:, :, :w, : w + 1]\n        diagonal_attn[:, -1, :, w:] = diagonal_chunk_attn[:, -1, w:, : w + 1]\n        # - copying the lower triangle\n        diagonal_attn[:, 1:, :, :w] = diagonal_chunk_attn[:, :, -(w + 1) : -1, w + 1 :]\n        diagonal_attn[:, 0, 1:w, 1:w] = diagonal_chunk_attn[:, 0, : w - 1, 1 - w :]\n\n        # separate batch_size and num_heads dimensions again\n        diagonal_attn = diagonal_attn.view(batch_size, num_heads, seqlen, 2 * w + 1).transpose(2, 1)\n\n        self._mask_invalid_locations(diagonal_attn, w)\n        return diagonal_attn\n\n    def _sliding_chunks_matmul_pv(self, prob: torch.Tensor, v: torch.Tensor, w: int):\n        \"\"\"Same as _sliding_chunks_matmul_qk but for prob and value tensors. It is expecting the same output\n        format from _sliding_chunks_matmul_qk\"\"\"\n        batch_size, seqlen, num_heads, head_dim = v.size()\n        assert seqlen % (w * 2) == 0\n        assert prob.size()[:3] == v.size()[:3]\n        assert prob.size(3) == 2 * w + 1\n        chunks_count = seqlen // w - 1\n        # group batch_size and num_heads dimensions into one, then chunk seqlen into chunks of size 2w\n        chunk_prob = prob.transpose(1, 2).reshape(batch_size * num_heads, seqlen // w, w, 2 * w + 1)\n\n        # group batch_size and num_heads dimensions into one\n        v = v.transpose(1, 2).reshape(batch_size * num_heads, seqlen, head_dim)\n\n        # pad seqlen with w at the beginning of the sequence and another w at the end\n        padded_v = F.pad(v, (0, 0, w, w), value=-1)\n\n        # chunk padded_v into chunks of size 3w and an overlap of size w\n        chunk_v_size = (batch_size * num_heads, chunks_count + 1, 3 * w, head_dim)\n        chunk_v_stride = padded_v.stride()\n        chunk_v_stride = chunk_v_stride[0], w * chunk_v_stride[1], chunk_v_stride[1], chunk_v_stride[2]\n        chunk_v = padded_v.as_strided(size=chunk_v_size, stride=chunk_v_stride)\n\n        skewed_prob = self._skew2(chunk_prob)\n\n        context = torch.einsum(\"bcwd,bcdh->bcwh\", (skewed_prob, chunk_v))\n        return context.view(batch_size, num_heads, seqlen, head_dim).transpose(1, 2)\n\n    def forward(\n        self,\n        hidden_states,\n        attention_mask=None,\n        head_mask=None,\n        encoder_hidden_states=None,\n        encoder_attention_mask=None,\n    ):\n        \"\"\"\n        LongformerSelfAttention expects `len(hidden_states)` to be multiple of `attention_window`.\n        Padding to `attention_window` happens in LongformerModel.forward to avoid redoing the padding on each layer.\n\n        The `attention_mask` is changed in `BertModel.forward` from 0, 1, 2 to\n            -ve: no attention\n              0: local attention\n            +ve: global attention\n\n        `encoder_hidden_states` and `encoder_attention_mask` are not supported and should be None\n        \"\"\"\n        # TODO: add support for `encoder_hidden_states` and `encoder_attention_mask`\n        assert encoder_hidden_states is None, \"`encoder_hidden_states` is not supported and should be None\"\n        assert encoder_attention_mask is None, \"`encoder_attention_mask` is not supported and shiould be None\"\n\n        if attention_mask is not None:\n            attention_mask = attention_mask.squeeze(dim=2).squeeze(dim=1)\n            key_padding_mask = attention_mask < 0\n            extra_attention_mask = attention_mask > 0\n            remove_from_windowed_attention_mask = attention_mask != 0\n\n            num_extra_indices_per_batch = extra_attention_mask.long().sum(dim=1)\n            max_num_extra_indices_per_batch = num_extra_indices_per_batch.max()\n            if max_num_extra_indices_per_batch <= 0:\n                extra_attention_mask = None\n            else:\n                # To support the case of variable number of global attention in the rows of a batch,\n                # we use the following three selection masks to select global attention embeddings\n                # in a 3d tensor and pad it to `max_num_extra_indices_per_batch`\n                # 1) selecting embeddings that correspond to global attention\n                extra_attention_mask_nonzeros = extra_attention_mask.nonzero(as_tuple=True)\n                zero_to_max_range = torch.arange(\n                    0, max_num_extra_indices_per_batch, device=num_extra_indices_per_batch.device\n                )\n                # mask indicating which values are actually going to be padding\n                selection_padding_mask = zero_to_max_range < num_extra_indices_per_batch.unsqueeze(dim=-1)\n                # 2) location of the non-padding values in the selected global attention\n                selection_padding_mask_nonzeros = selection_padding_mask.nonzero(as_tuple=True)\n                # 3) location of the padding values in the selected global attention\n                selection_padding_mask_zeros = (selection_padding_mask == 0).nonzero(as_tuple=True)\n        else:\n            remove_from_windowed_attention_mask = None\n            extra_attention_mask = None\n            key_padding_mask = None\n\n        hidden_states = hidden_states.transpose(0, 1)\n        seqlen, batch_size, embed_dim = hidden_states.size()\n        assert embed_dim == self.embed_dim\n        q = self.query(hidden_states)\n        k = self.key(hidden_states)\n        v = self.value(hidden_states)\n        q /= math.sqrt(self.head_dim)\n\n        q = q.view(seqlen, batch_size, self.num_heads, self.head_dim).transpose(0, 1)\n        k = k.view(seqlen, batch_size, self.num_heads, self.head_dim).transpose(0, 1)\n        # attn_weights = (batch_size, seqlen, num_heads, window*2+1)\n        attn_weights = self._sliding_chunks_matmul_qk(q, k, self.one_sided_attention_window_size)\n        self._mask_invalid_locations(attn_weights, self.one_sided_attention_window_size)\n        if remove_from_windowed_attention_mask is not None:\n            # This implementation is fast and takes very little memory because num_heads x hidden_size = 1\n            # from (batch_size x seqlen) to (batch_size x seqlen x num_heads x hidden_size)\n            remove_from_windowed_attention_mask = remove_from_windowed_attention_mask.unsqueeze(dim=-1).unsqueeze(\n                dim=-1\n            )\n            # cast to fp32/fp16 then replace 1's with -inf\n            float_mask = remove_from_windowed_attention_mask.type_as(q).masked_fill(\n                remove_from_windowed_attention_mask, -10000.0\n            )\n            ones = float_mask.new_ones(size=float_mask.size())  # tensor of ones\n            # diagonal mask with zeros everywhere and -inf inplace of padding\n            d_mask = self._sliding_chunks_matmul_qk(ones, float_mask, self.one_sided_attention_window_size)\n            attn_weights += d_mask\n        assert list(attn_weights.size()) == [\n            batch_size,\n            seqlen,\n            self.num_heads,\n            self.one_sided_attention_window_size * 2 + 1,\n        ]\n\n        # the extra attention\n        if extra_attention_mask is not None:\n            selected_k = k.new_zeros(batch_size, max_num_extra_indices_per_batch, self.num_heads, self.head_dim)\n            selected_k[selection_padding_mask_nonzeros] = k[extra_attention_mask_nonzeros]\n            # (batch_size, seqlen, num_heads, max_num_extra_indices_per_batch)\n            selected_attn_weights = torch.einsum(\"blhd,bshd->blhs\", (q, selected_k))\n            selected_attn_weights[selection_padding_mask_zeros[0], :, :, selection_padding_mask_zeros[1]] = -10000\n            # concat to attn_weights\n            # (batch_size, seqlen, num_heads, extra attention count + 2*window+1)\n            attn_weights = torch.cat((selected_attn_weights, attn_weights), dim=-1)\n\n        attn_weights_fp32 = F.softmax(attn_weights, dim=-1, dtype=torch.float32)  # use fp32 for numerical stability\n        attn_weights = attn_weights_fp32.type_as(attn_weights)\n\n        if key_padding_mask is not None:\n            # softmax sometimes inserts NaN if all positions are masked, replace them with 0\n            attn_weights = torch.masked_fill(attn_weights, key_padding_mask.unsqueeze(-1).unsqueeze(-1), 0.0)\n\n        attn_probs = F.dropout(attn_weights, p=self.dropout, training=self.training)\n        v = v.view(seqlen, batch_size, self.num_heads, self.head_dim).transpose(0, 1)\n        attn = None\n        if extra_attention_mask is not None:\n            selected_attn_probs = attn_probs.narrow(-1, 0, max_num_extra_indices_per_batch)\n            selected_v = v.new_zeros(batch_size, max_num_extra_indices_per_batch, self.num_heads, self.head_dim)\n            selected_v[selection_padding_mask_nonzeros] = v[extra_attention_mask_nonzeros]\n            # use `matmul` because `einsum` crashes sometimes with fp16\n            # attn = torch.einsum('blhs,bshd->blhd', (selected_attn_probs, selected_v))\n            attn = torch.matmul(selected_attn_probs.transpose(1, 2), selected_v.transpose(1, 2)).transpose(1, 2)\n            attn_probs = attn_probs.narrow(\n                -1, max_num_extra_indices_per_batch, attn_probs.size(-1) - max_num_extra_indices_per_batch\n            ).contiguous()\n        if attn is None:\n            attn = self._sliding_chunks_matmul_pv(attn_probs, v, self.one_sided_attention_window_size)\n        else:\n            attn += self._sliding_chunks_matmul_pv(attn_probs, v, self.one_sided_attention_window_size)\n\n        assert attn.size() == (batch_size, seqlen, self.num_heads, self.head_dim), \"Unexpected size\"\n        attn = attn.transpose(0, 1).reshape(seqlen, batch_size, embed_dim).contiguous()\n\n        # For this case, we'll just recompute the attention for these indices\n        # and overwrite the attn tensor.\n        # TODO: remove the redundant computation\n        if extra_attention_mask is not None:\n            selected_hidden_states = hidden_states.new_zeros(max_num_extra_indices_per_batch, batch_size, embed_dim)\n            selected_hidden_states[selection_padding_mask_nonzeros[::-1]] = hidden_states[\n                extra_attention_mask_nonzeros[::-1]\n            ]\n\n            q = self.query_global(selected_hidden_states)\n            k = self.key_global(hidden_states)\n            v = self.value_global(hidden_states)\n            q /= math.sqrt(self.head_dim)\n\n            q = (\n                q.contiguous()\n                .view(max_num_extra_indices_per_batch, batch_size * self.num_heads, self.head_dim)\n                .transpose(0, 1)\n            )  # (batch_size * self.num_heads, max_num_extra_indices_per_batch, head_dim)\n            k = (\n                k.contiguous().view(-1, batch_size * self.num_heads, self.head_dim).transpose(0, 1)\n            )  # batch_size * self.num_heads, seqlen, head_dim)\n            v = (\n                v.contiguous().view(-1, batch_size * self.num_heads, self.head_dim).transpose(0, 1)\n            )  # batch_size * self.num_heads, seqlen, head_dim)\n            attn_weights = torch.bmm(q, k.transpose(1, 2))\n            assert list(attn_weights.size()) == [batch_size * self.num_heads, max_num_extra_indices_per_batch, seqlen]\n\n            attn_weights = attn_weights.view(batch_size, self.num_heads, max_num_extra_indices_per_batch, seqlen)\n            attn_weights[selection_padding_mask_zeros[0], :, selection_padding_mask_zeros[1], :] = -10000.0\n            if key_padding_mask is not None:\n                attn_weights = attn_weights.masked_fill(key_padding_mask.unsqueeze(1).unsqueeze(2), -10000.0,)\n            attn_weights = attn_weights.view(batch_size * self.num_heads, max_num_extra_indices_per_batch, seqlen)\n            attn_weights_float = F.softmax(\n                attn_weights, dim=-1, dtype=torch.float32\n            )  # use fp32 for numerical stability\n            attn_probs = F.dropout(attn_weights_float.type_as(attn_weights), p=self.dropout, training=self.training)\n            selected_attn = torch.bmm(attn_probs, v)\n            assert list(selected_attn.size()) == [\n                batch_size * self.num_heads,\n                max_num_extra_indices_per_batch,\n                self.head_dim,\n            ]\n\n            selected_attn_4d = selected_attn.view(\n                batch_size, self.num_heads, max_num_extra_indices_per_batch, self.head_dim\n            )\n            nonzero_selected_attn = selected_attn_4d[\n                selection_padding_mask_nonzeros[0], :, selection_padding_mask_nonzeros[1]\n            ]\n            attn[extra_attention_mask_nonzeros[::-1]] = nonzero_selected_attn.view(\n                len(selection_padding_mask_nonzeros[0]), -1\n            )\n\n        context_layer = attn.transpose(0, 1)\n        if self.output_attentions:\n            if extra_attention_mask is not None:\n                # With global attention, return global attention probabilities only\n                # batch_size x num_heads x max_num_global_attention_tokens x sequence_length\n                # which is the attention weights from tokens with global attention to all tokens\n                # It doesn't not return local attention\n                # In case of variable number of global attantion in the rows of a batch,\n                # attn_weights are padded with -10000.0 attention scores\n                attn_weights = attn_weights.view(batch_size, self.num_heads, max_num_extra_indices_per_batch, seqlen)\n            else:\n                # without global attention, return local attention probabilities\n                # batch_size x num_heads x sequence_length x window_size\n                # which is the attention weights of every token attending to its neighbours\n                attn_weights = attn_weights.permute(0, 2, 1, 3)\n        outputs = (context_layer, attn_weights) if self.output_attentions else (context_layer,)\n        return outputs\n\n\nLONGFORMER_START_DOCSTRING = r\"\"\"\n\n    This model is a PyTorch `torch.nn.Module <https://pytorch.org/docs/stable/nn.html#torch.nn.Module>`_ sub-class.\n    Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general\n    usage and behavior.\n\n    Parameters:\n        config (:class:`~transformers1.LongformerConfig`): Model configuration class with all the parameters of the\n            model. Initializing with a config file does not load the weights associated with the model, only the configuration.\n            Check out the :meth:`~transformers1.PreTrainedModel.from_pretrained` method to load the model weights.\n\"\"\"\n\nLONGFORMER_INPUTS_DOCSTRING = r\"\"\"\n    Args:\n        input_ids (:obj:`torch.LongTensor` of shape :obj:`{0}`):\n            Indices of input sequence tokens in the vocabulary.\n\n            Indices can be obtained using :class:`transformers1.LonmgformerTokenizer`.\n            See :func:`transformers1.PreTrainedTokenizer.encode` and\n            :func:`transformers1.PreTrainedTokenizer.encode_plus` for details.\n\n            `What are input IDs? <../glossary.html#input-ids>`__\n        attention_mask (:obj:`torch.FloatTensor` of shape :obj:`{0}`, `optional`, defaults to :obj:`None`):\n            Mask to avoid performing attention on padding token indices.\n            Mask values selected in ``[0, 1]``:\n            ``1`` for tokens that are NOT MASKED, ``0`` for MASKED tokens.\n\n            `What are attention masks? <../glossary.html#attention-mask>`__\n\n        global_attention_mask (:obj:`torch.FloatTensor` of shape :obj:`{0}`, `optional`, defaults to :obj:`None`):\n            Mask to decide the attention given on each token, local attention or global attenion.\n            Tokens with global attention attends to all other tokens, and all other tokens attend to them. This is important for\n            task-specific finetuning because it makes the model more flexible at representing the task. For example,\n            for classification, the <s> token should be given global attention. For QA, all question tokens should also have\n            global attention. Please refer to the Longformer paper https://arxiv.org/abs/2004.05150 for more details.\n            Mask values selected in ``[0, 1]``:\n            ``0`` for local attention (a sliding window attention),\n            ``1`` for global attention (tokens that attend to all other tokens, and all other tokens attend to them).\n\n        token_type_ids (:obj:`torch.LongTensor` of shape :obj:`{0}`, `optional`, defaults to :obj:`None`):\n            Segment token indices to indicate first and second portions of the inputs.\n            Indices are selected in ``[0, 1]``: ``0`` corresponds to a `sentence A` token, ``1``\n            corresponds to a `sentence B` token\n\n            `What are token type IDs? <../glossary.html#token-type-ids>`_\n        position_ids (:obj:`torch.LongTensor` of shape :obj:`{0}`, `optional`, defaults to :obj:`None`):\n            Indices of positions of each input sequence tokens in the position embeddings.\n            Selected in the range ``[0, config.max_position_embeddings - 1]``.\n\n            `What are position IDs? <../glossary.html#position-ids>`_\n        inputs_embeds (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length, hidden_size)`, `optional`, defaults to :obj:`None`):\n            Optionally, instead of passing :obj:`input_ids` you can choose to directly pass an embedded representation.\n            This is useful if you want more control over how to convert `input_ids` indices into associated vectors\n            than the model's internal embedding lookup matrix.\n\"\"\"\n\n\n@add_start_docstrings(\n    \"The bare Longformer Model outputting raw hidden-states without any specific head on top.\",\n    LONGFORMER_START_DOCSTRING,\n)\nclass LongformerModel(RobertaModel):\n    \"\"\"\n    This class overrides :class:`~transformers1.RobertaModel` to provide the ability to process\n    long sequences following the selfattention approach described in `Longformer: the Long-Document Transformer`_by\n    Iz Beltagy, Matthew E. Peters, and Arman Cohan. Longformer selfattention combines a local (sliding window)\n    and global attention to extend to long documents without the O(n^2) increase in memory and compute.\n\n    The selfattention module `LongformerSelfAttention` implemented here supports the combination of local and\n    global attention but it lacks support for autoregressive attention and dilated attention. Autoregressive\n    and dilated attention are more relevant for autoregressive language modeling than finetuning on downstream\n    tasks. Future release will add support for autoregressive attention, but the support for dilated attention\n    requires a custom CUDA kernel to be memory and compute efficient.\n\n    .. _`Longformer: the Long-Document Transformer`:\n        https://arxiv.org/abs/2004.05150\n\n    \"\"\"\n\n    config_class = LongformerConfig\n    base_model_prefix = \"longformer\"\n\n    def __init__(self, config):\n        super().__init__(config)\n\n        if isinstance(config.attention_window, int):\n            assert config.attention_window % 2 == 0, \"`config.attention_window` has to be an even value\"\n            assert config.attention_window > 0, \"`config.attention_window` has to be positive\"\n            config.attention_window = [config.attention_window] * config.num_hidden_layers  # one value per layer\n        else:\n            assert len(config.attention_window) == config.num_hidden_layers, (\n                \"`len(config.attention_window)` should equal `config.num_hidden_layers`. \"\n                f\"Expected {config.num_hidden_layers}, given {len(config.attention_window)}\"\n            )\n\n        for i, layer in enumerate(self.encoder.layer):\n            # replace the `modeling_bert.BertSelfAttention` object with `LongformerSelfAttention`\n            layer.attention.self = LongformerSelfAttention(config, layer_id=i)\n\n        self.init_weights()\n\n    def _pad_to_window_size(\n        self,\n        input_ids: torch.Tensor,\n        attention_mask: torch.Tensor,\n        token_type_ids: torch.Tensor,\n        position_ids: torch.Tensor,\n        inputs_embeds: torch.Tensor,\n        attention_window: int,\n        pad_token_id: int,\n    ):\n        \"\"\"A helper function to pad tokens and mask to work with implementation of Longformer selfattention.\"\"\"\n\n        assert attention_window % 2 == 0, f\"`attention_window` should be an even value. Given {attention_window}\"\n        input_shape = input_ids.shape if input_ids is not None else inputs_embeds.shape\n        batch_size, seqlen = input_shape[:2]\n\n        padding_len = (attention_window - seqlen % attention_window) % attention_window\n        if padding_len > 0:\n            logger.info(\n                \"Input ids are automatically padded from {} to {} to be a multiple of `config.attention_window`: {}\".format(\n                    seqlen, seqlen + padding_len, attention_window\n                )\n            )\n            if input_ids is not None:\n                input_ids = F.pad(input_ids, (0, padding_len), value=pad_token_id)\n            if attention_mask is not None:\n                attention_mask = F.pad(\n                    attention_mask, (0, padding_len), value=False\n                )  # no attention on the padding tokens\n            if token_type_ids is not None:\n                token_type_ids = F.pad(token_type_ids, (0, padding_len), value=0)  # pad with token_type_id = 0\n            if position_ids is not None:\n                # pad with position_id = pad_token_id as in modeling_roberta.RobertaEmbeddings\n                position_ids = F.pad(position_ids, (0, padding_len), value=pad_token_id)\n            if inputs_embeds is not None:\n                input_ids_padding = inputs_embeds.new_full(\n                    (batch_size, padding_len), self.config.pad_token_id, dtype=torch.long,\n                )\n                inputs_embeds_padding = self.embeddings(input_ids_padding)\n                inputs_embeds = torch.cat([inputs_embeds, inputs_embeds_padding], dim=-2)\n\n        return padding_len, input_ids, attention_mask, token_type_ids, position_ids, inputs_embeds\n\n    @add_start_docstrings_to_callable(LONGFORMER_INPUTS_DOCSTRING.format(\"(batch_size, sequence_length)\"))\n    def forward(\n        self,\n        input_ids=None,\n        attention_mask=None,\n        global_attention_mask=None,\n        token_type_ids=None,\n        position_ids=None,\n        inputs_embeds=None,\n        masked_lm_labels=None,\n    ):\n        r\"\"\"\n\n    Returns:\n        :obj:`tuple(torch.FloatTensor)` comprising various elements depending on the configuration (:class:`~transformers1.RobertaConfig`) and inputs:\n        masked_lm_loss (`optional`, returned when ``masked_lm_labels`` is provided) ``torch.FloatTensor`` of shape ``(1,)``:\n            Masked language modeling loss.\n        prediction_scores (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length, config.vocab_size)`)\n            Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax).\n        hidden_states (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_hidden_states=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer)\n            of shape :obj:`(batch_size, sequence_length, hidden_size)`.\n\n            Hidden-states of the model at the output of each layer plus the initial embedding outputs.\n        attentions (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_attentions=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for each layer) of shape\n            :obj:`(batch_size, num_heads, sequence_length, sequence_length)`.\n\n            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention\n            heads.\n\n    Examples::\n\n        import torch\n        from transformers1 import LongformerModel, LongformerTokenizer\n\n        model = LongformerModel.from_pretrained('allenai/longformer-base-4096')\n        tokenizer = LongformerTokenizer.from_pretrained('allenai/longformer-base-4096')\n\n        SAMPLE_TEXT = ' '.join(['Hello world! '] * 1000)  # long input document\n        input_ids = torch.tensor(tokenizer.encode(SAMPLE_TEXT)).unsqueeze(0)  # batch of size 1\n\n        # Attention mask values -- 0: no attention, 1: local attention, 2: global attention\n        attention_mask = torch.ones(input_ids.shape, dtype=torch.long, device=input_ids.device) # initialize to local attention\n        attention_mask[:, [1, 4, 21,]] = 2  # Set global attention based on the task. For example,\n                                            # classification: the <s> token\n                                            # QA: question tokens\n                                            # LM: potentially on the beginning of sentences and paragraphs\n        sequence_output, pooled_output = model(input_ids, attention_mask=attention_mask)\n        \"\"\"\n\n        # padding\n        attention_window = (\n            self.config.attention_window\n            if isinstance(self.config.attention_window, int)\n            else max(self.config.attention_window)\n        )\n\n        # merge `global_attention_mask` and `attention_mask`\n        if global_attention_mask is not None:\n            # longformer self attention expects attention mask to have 0 (no attn), 1 (local attn), 2 (global attn)\n            # (global_attention_mask + 1) => 1 for local attention, 2 for global attention\n            # => final attention_mask => 0 for no attention, 1 for local attention 2 for global attention\n            if attention_mask is not None:\n                attention_mask = attention_mask * (global_attention_mask + 1)\n            else:\n                # simply use `global_attention_mask` as `attention_mask`\n                # if no `attention_mask` is given\n                attention_mask = global_attention_mask + 1\n\n        padding_len, input_ids, attention_mask, token_type_ids, position_ids, inputs_embeds = self._pad_to_window_size(\n            input_ids=input_ids,\n            attention_mask=attention_mask,\n            token_type_ids=token_type_ids,\n            position_ids=position_ids,\n            inputs_embeds=inputs_embeds,\n            attention_window=attention_window,\n            pad_token_id=self.config.pad_token_id,\n        )\n\n        # embed\n        output = super().forward(\n            input_ids=input_ids,\n            attention_mask=attention_mask,\n            token_type_ids=token_type_ids,\n            position_ids=position_ids,\n            head_mask=None,\n            inputs_embeds=inputs_embeds,\n            encoder_hidden_states=None,\n            encoder_attention_mask=None,\n        )\n\n        # undo padding\n        if padding_len > 0:\n            # `output` has the following tensors: sequence_output, pooled_output, (hidden_states), (attentions)\n            # `sequence_output`: unpad because the calling function is expecting a length == input_ids.size(1)\n            # `pooled_output`: independent of the sequence length\n            # `hidden_states`: mainly used for debugging and analysis, so keep the padding\n            # `attentions`: mainly used for debugging and analysis, so keep the padding\n            output = output[0][:, :-padding_len], *output[1:]\n\n        return output\n\n\n@add_start_docstrings(\"\"\"Longformer Model with a `language modeling` head on top. \"\"\", LONGFORMER_START_DOCSTRING)\nclass LongformerForMaskedLM(BertPreTrainedModel):\n    config_class = LongformerConfig\n    base_model_prefix = \"longformer\"\n\n    def __init__(self, config):\n        super().__init__(config)\n\n        self.longformer = LongformerModel(config)\n        self.lm_head = RobertaLMHead(config)\n\n        self.init_weights()\n\n    @add_start_docstrings_to_callable(LONGFORMER_INPUTS_DOCSTRING.format(\"(batch_size, sequence_length)\"))\n    def forward(\n        self,\n        input_ids=None,\n        attention_mask=None,\n        global_attention_mask=None,\n        token_type_ids=None,\n        position_ids=None,\n        inputs_embeds=None,\n        masked_lm_labels=None,\n    ):\n        r\"\"\"\n        masked_lm_labels (:obj:`torch.LongTensor` of shape :obj:`(batch_size, sequence_length)`, `optional`, defaults to :obj:`None`):\n            Labels for computing the masked language modeling loss.\n            Indices should be in ``[-100, 0, ..., config.vocab_size]`` (see ``input_ids`` docstring)\n            Tokens with indices set to ``-100`` are ignored (masked), the loss is only computed for the tokens with labels\n            in ``[0, ..., config.vocab_size]``\n\n    Returns:\n        :obj:`tuple(torch.FloatTensor)` comprising various elements depending on the configuration (:class:`~transformers1.RobertaConfig`) and inputs:\n        masked_lm_loss (`optional`, returned when ``masked_lm_labels`` is provided) ``torch.FloatTensor`` of shape ``(1,)``:\n            Masked language modeling loss.\n        prediction_scores (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length, config.vocab_size)`)\n            Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax).\n        hidden_states (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_hidden_states=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer)\n            of shape :obj:`(batch_size, sequence_length, hidden_size)`.\n\n            Hidden-states of the model at the output of each layer plus the initial embedding outputs.\n        attentions (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_attentions=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for each layer) of shape\n            :obj:`(batch_size, num_heads, sequence_length, sequence_length)`.\n\n            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention\n            heads.\n\n    Examples::\n\n        import torch\n        from transformers1 import LongformerForMaskedLM, LongformerTokenizer\n\n        model = LongformerForMaskedLM.from_pretrained('allenai/longformer-base-4096')\n        tokenizer = LongformerTokenizer.from_pretrained('allenai/longformer-base-4096')\n\n        SAMPLE_TEXT = ' '.join(['Hello world! '] * 1000)  # long input document\n        input_ids = torch.tensor(tokenizer.encode(SAMPLE_TEXT)).unsqueeze(0)  # batch of size 1\n\n        attention_mask = None  # default is local attention everywhere, which is a good choice for MaskedLM\n                               # check ``LongformerModel.forward`` for more details how to set `attention_mask`\n        loss, prediction_scores = model(input_ids, attention_mask=attention_mask, masked_lm_labels=input_ids)\n        \"\"\"\n\n        outputs = self.longformer(\n            input_ids,\n            attention_mask=attention_mask,\n            global_attention_mask=global_attention_mask,\n            token_type_ids=token_type_ids,\n            position_ids=position_ids,\n            inputs_embeds=inputs_embeds,\n        )\n        sequence_output = outputs[0]\n        prediction_scores = self.lm_head(sequence_output)\n\n        outputs = (prediction_scores,) + outputs[2:]  # Add hidden states and attention if they are here\n\n        if masked_lm_labels is not None:\n            loss_fct = CrossEntropyLoss()\n            masked_lm_loss = loss_fct(prediction_scores.view(-1, self.config.vocab_size), masked_lm_labels.view(-1))\n            outputs = (masked_lm_loss,) + outputs\n\n        return outputs  # (masked_lm_loss), prediction_scores, (hidden_states), (attentions)\n\n\n@add_start_docstrings(\n    \"\"\"Longformer Model transformer with a sequence classification/regression head on top (a linear layer\n    on top of the pooled output) e.g. for GLUE tasks. \"\"\",\n    LONGFORMER_START_DOCSTRING,\n)\nclass LongformerForSequenceClassification(BertPreTrainedModel):\n    config_class = LongformerConfig\n    base_model_prefix = \"longformer\"\n\n    def __init__(self, config):\n        super().__init__(config)\n        self.num_labels = config.num_labels\n\n        self.longformer = LongformerModel(config)\n        self.classifier = LongformerClassificationHead(config)\n\n    @add_start_docstrings_to_callable(LONGFORMER_INPUTS_DOCSTRING.format(\"(batch_size, sequence_length)\"))\n    def forward(\n        self,\n        input_ids=None,\n        attention_mask=None,\n        global_attention_mask=None,\n        token_type_ids=None,\n        position_ids=None,\n        inputs_embeds=None,\n        labels=None,\n    ):\n        r\"\"\"\n        labels (:obj:`torch.LongTensor` of shape :obj:`(batch_size,)`, `optional`, defaults to :obj:`None`):\n            Labels for computing the sequence classification/regression loss.\n            Indices should be in :obj:`[0, ..., config.num_labels - 1]`.\n            If :obj:`config.num_labels == 1` a regression loss is computed (Mean-Square loss),\n            If :obj:`config.num_labels > 1` a classification loss is computed (Cross-Entropy).\n\n    Returns:\n        :obj:`tuple(torch.FloatTensor)` comprising various elements depending on the configuration (:class:`~transformers1.LongformerConfig`) and inputs:\n        loss (:obj:`torch.FloatTensor` of shape :obj:`(1,)`, `optional`, returned when :obj:`label` is provided):\n            Classification (or regression if config.num_labels==1) loss.\n        logits (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, config.num_labels)`):\n            Classification (or regression if config.num_labels==1) scores (before SoftMax).\n        hidden_states (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_hidden_states=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer)\n            of shape :obj:`(batch_size, sequence_length, hidden_size)`.\n\n            Hidden-states of the model at the output of each layer plus the initial embedding outputs.\n        attentions (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_attentions=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for each layer) of shape\n            :obj:`(batch_size, num_heads, sequence_length, sequence_length)`.\n\n            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention\n            heads.\n\n    Examples::\n\n        from transformers1 import LongformerTokenizer, LongformerForSequenceClassification\n        import torch\n\n        tokenizer = LongformerTokenizer.from_pretrained('allenai/longformer-base-4096')\n        model = LongformerForSequenceClassification.from_pretrained('allenai/longformer-base-4096')\n        input_ids = torch.tensor(tokenizer.encode(\"Hello, my dog is cute\", add_special_tokens=True)).unsqueeze(0)  # Batch size 1\n        labels = torch.tensor([1]).unsqueeze(0)  # Batch size 1\n        outputs = model(input_ids, labels=labels)\n        loss, logits = outputs[:2]\n\n        \"\"\"\n\n        if global_attention_mask is None:\n            logger.info(\"Initializing global attention on CLS token...\")\n            global_attention_mask = torch.zeros_like(input_ids)\n            # global attention on cls token\n            global_attention_mask[:, 0] = 1\n\n        outputs = self.longformer(\n            input_ids,\n            attention_mask=attention_mask,\n            global_attention_mask=global_attention_mask,\n            token_type_ids=token_type_ids,\n            position_ids=position_ids,\n            inputs_embeds=inputs_embeds,\n        )\n        sequence_output = outputs[0]\n        logits = self.classifier(sequence_output)\n\n        outputs = (logits,) + outputs[2:]\n        if labels is not None:\n            if self.num_labels == 1:\n                #  We are doing regression\n                loss_fct = MSELoss()\n                loss = loss_fct(logits.view(-1), labels.view(-1))\n            else:\n                loss_fct = CrossEntropyLoss()\n                loss = loss_fct(logits.view(-1, self.num_labels), labels.view(-1))\n            outputs = (loss,) + outputs\n\n        return outputs  # (loss), logits, (hidden_states), (attentions)\n\n\nclass LongformerClassificationHead(nn.Module):\n    \"\"\"Head for sentence-level classification tasks.\"\"\"\n\n    def __init__(self, config):\n        super().__init__()\n        self.dense = nn.Linear(config.hidden_size, config.hidden_size)\n        self.dropout = nn.Dropout(config.hidden_dropout_prob)\n        self.out_proj = nn.Linear(config.hidden_size, config.num_labels)\n\n    def forward(self, hidden_states, **kwargs):\n        hidden_states = hidden_states[:, 0, :]  # take <s> token (equiv. to [CLS])\n        hidden_states = self.dropout(hidden_states)\n        hidden_states = self.dense(hidden_states)\n        hidden_states = torch.tanh(hidden_states)\n        hidden_states = self.dropout(hidden_states)\n        output = self.out_proj(hidden_states)\n        return output\n\n\n@add_start_docstrings(\n    \"\"\"Longformer Model with a span classification head on top for extractive question-answering tasks like SQuAD / TriviaQA (a linear layers on top of\n    the hidden-states output to compute `span start logits` and `span end logits`). \"\"\",\n    LONGFORMER_START_DOCSTRING,\n)\nclass LongformerForQuestionAnswering(BertPreTrainedModel):\n    config_class = LongformerConfig\n    base_model_prefix = \"longformer\"\n\n    def __init__(self, config):\n        super().__init__(config)\n        self.num_labels = config.num_labels\n\n        self.longformer = LongformerModel(config)\n        self.qa_outputs = nn.Linear(config.hidden_size, config.num_labels)\n\n        self.init_weights()\n\n    @add_start_docstrings_to_callable(LONGFORMER_INPUTS_DOCSTRING.format(\"(batch_size, sequence_length)\"))\n    def forward(\n        self,\n        input_ids,\n        attention_mask=None,\n        global_attention_mask=None,\n        token_type_ids=None,\n        position_ids=None,\n        inputs_embeds=None,\n        start_positions=None,\n        end_positions=None,\n    ):\n        r\"\"\"\n        start_positions (:obj:`torch.LongTensor` of shape :obj:`(batch_size,)`, `optional`, defaults to :obj:`None`):\n            Labels for position (index) of the start of the labelled span for computing the token classification loss.\n            Positions are clamped to the length of the sequence (`sequence_length`).\n            Position outside of the sequence are not taken into account for computing the loss.\n        end_positions (:obj:`torch.LongTensor` of shape :obj:`(batch_size,)`, `optional`, defaults to :obj:`None`):\n            Labels for position (index) of the end of the labelled span for computing the token classification loss.\n            Positions are clamped to the length of the sequence (`sequence_length`).\n            Position outside of the sequence are not taken into account for computing the loss.\n    Returns:\n        :obj:`tuple(torch.FloatTensor)` comprising various elements depending on the configuration (:class:`~transformers1.LongformerConfig`) and inputs:\n        loss (:obj:`torch.FloatTensor` of shape :obj:`(1,)`, `optional`, returned when :obj:`labels` is provided):\n            Total span extraction loss is the sum of a Cross-Entropy for the start and end positions.\n        start_scores (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length,)`):\n            Span-start scores (before SoftMax).\n        end_scores (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length,)`):\n            Span-end scores (before SoftMax).\n        hidden_states (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_hidden_states=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer)\n            of shape :obj:`(batch_size, sequence_length, hidden_size)`.\n            Hidden-states of the model at the output of each layer plus the initial embedding outputs.\n        attentions (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_attentions=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for each layer) of shape\n            :obj:`(batch_size, num_heads, sequence_length, sequence_length)`.\n            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention\n            heads.\n\n    Examples::\n\n        from transformers1 import LongformerTokenizer, LongformerForQuestionAnswering\n        import torch\n\n        tokenizer = LongformerTokenizer.from_pretrained(\"allenai/longformer-large-4096-finetuned-triviaqa\")\n        model = LongformerForQuestionAnswering.from_pretrained(\"allenai/longformer-large-4096-finetuned-triviaqa\")\n\n        question, text = \"Who was Jim Henson?\", \"Jim Henson was a nice puppet\"\n        encoding = tokenizer.encode_plus(question, text, return_tensors=\"pt\")\n        input_ids = encoding[\"input_ids\"]\n\n        # default is local attention everywhere\n        # the forward method will automatically set global attention on question tokens\n        attention_mask = encoding[\"attention_mask\"]\n\n        start_scores, end_scores = model(input_ids, attention_mask=attention_mask)\n        all_tokens = tokenizer.convert_ids_to_tokens(input_ids[0].tolist())\n\n        answer_tokens = all_tokens[torch.argmax(start_scores) :torch.argmax(end_scores)+1]\n        answer = tokenizer.decode(tokenizer.convert_tokens_to_ids(answer_tokens)) # remove space prepending space token\n\n        \"\"\"\n\n        # set global attention on question tokens\n        if global_attention_mask is None:\n            logger.info(\"Initializing global attention on question tokens...\")\n            # put global attention on all tokens until `config.sep_token_id` is reached\n            global_attention_mask = _compute_global_attention_mask(input_ids, self.config.sep_token_id)\n\n        outputs = self.longformer(\n            input_ids,\n            attention_mask=attention_mask,\n            global_attention_mask=global_attention_mask,\n            token_type_ids=token_type_ids,\n            position_ids=position_ids,\n            inputs_embeds=inputs_embeds,\n        )\n\n        sequence_output = outputs[0]\n\n        logits = self.qa_outputs(sequence_output)\n        start_logits, end_logits = logits.split(1, dim=-1)\n        start_logits = start_logits.squeeze(-1)\n        end_logits = end_logits.squeeze(-1)\n\n        outputs = (start_logits, end_logits,) + outputs[2:]\n        if start_positions is not None and end_positions is not None:\n            # If we are on multi-GPU, split add a dimension\n            if len(start_positions.size()) > 1:\n                start_positions = start_positions.squeeze(-1)\n            if len(end_positions.size()) > 1:\n                end_positions = end_positions.squeeze(-1)\n            # sometimes the start/end positions are outside our model inputs, we ignore these terms\n            ignored_index = start_logits.size(1)\n            start_positions.clamp_(0, ignored_index)\n            end_positions.clamp_(0, ignored_index)\n\n            loss_fct = CrossEntropyLoss(ignore_index=ignored_index)\n            start_loss = loss_fct(start_logits, start_positions)\n            end_loss = loss_fct(end_logits, end_positions)\n            total_loss = (start_loss + end_loss) / 2\n            outputs = (total_loss,) + outputs\n\n        return outputs  # (loss), start_logits, end_logits, (hidden_states), (attentions)\n\n\n@add_start_docstrings(\n    \"\"\"Longformer Model with a token classification head on top (a linear layer on top of\n    the hidden-states output) e.g. for Named-Entity-Recognition (NER) tasks. \"\"\",\n    LONGFORMER_START_DOCSTRING,\n)\nclass LongformerForTokenClassification(BertPreTrainedModel):\n    config_class = LongformerConfig\n    base_model_prefix = \"longformer\"\n\n    def __init__(self, config):\n        super().__init__(config)\n        self.num_labels = config.num_labels\n\n        self.longformer = LongformerModel(config)\n        self.dropout = nn.Dropout(config.hidden_dropout_prob)\n        self.classifier = nn.Linear(config.hidden_size, config.num_labels)\n\n        self.init_weights()\n\n    @add_start_docstrings_to_callable(LONGFORMER_INPUTS_DOCSTRING.format(\"(batch_size, sequence_length)\"))\n    def forward(\n        self,\n        input_ids=None,\n        attention_mask=None,\n        global_attention_mask=None,\n        token_type_ids=None,\n        position_ids=None,\n        inputs_embeds=None,\n        labels=None,\n    ):\n        r\"\"\"\n        labels (:obj:`torch.LongTensor` of shape :obj:`(batch_size, sequence_length)`, `optional`, defaults to :obj:`None`):\n            Labels for computing the token classification loss.\n            Indices should be in ``[0, ..., config.num_labels - 1]``.\n\n    Returns:\n        :obj:`tuple(torch.FloatTensor)` comprising various elements depending on the configuration (:class:`~transformers1.LongformerConfig`) and inputs:\n        loss (:obj:`torch.FloatTensor` of shape :obj:`(1,)`, `optional`, returned when ``labels`` is provided) :\n            Classification loss.\n        scores (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length, config.num_labels)`)\n            Classification scores (before SoftMax).\n        hidden_states (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_hidden_states=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer)\n            of shape :obj:`(batch_size, sequence_length, hidden_size)`.\n\n            Hidden-states of the model at the output of each layer plus the initial embedding outputs.\n        attentions (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_attentions=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for each layer) of shape\n            :obj:`(batch_size, num_heads, sequence_length, sequence_length)`.\n\n            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention\n            heads.\n\n    Examples::\n\n        from transformers1 import LongformerTokenizer, LongformerForTokenClassification\n        import torch\n\n        tokenizer = LongformerTokenizer.from_pretrained('allenai/longformer-base-4096')\n        model = LongformerForTokenClassification.from_pretrained('allenai/longformer-base-4096')\n        input_ids = torch.tensor(tokenizer.encode(\"Hello, my dog is cute\", add_special_tokens=True)).unsqueeze(0)  # Batch size 1\n        labels = torch.tensor([1] * input_ids.size(1)).unsqueeze(0)  # Batch size 1\n        outputs = model(input_ids, labels=labels)\n        loss, scores = outputs[:2]\n\n        \"\"\"\n\n        outputs = self.longformer(\n            input_ids,\n            attention_mask=attention_mask,\n            global_attention_mask=global_attention_mask,\n            token_type_ids=token_type_ids,\n            position_ids=position_ids,\n            inputs_embeds=inputs_embeds,\n        )\n\n        sequence_output = outputs[0]\n\n        sequence_output = self.dropout(sequence_output)\n        logits = self.classifier(sequence_output)\n\n        outputs = (logits,) + outputs[2:]  # add hidden states and attention if they are here\n\n        if labels is not None:\n            loss_fct = CrossEntropyLoss()\n            # Only keep active parts of the loss\n            if attention_mask is not None:\n                active_loss = attention_mask.view(-1) == 1\n                active_logits = logits.view(-1, self.num_labels)\n                active_labels = torch.where(\n                    active_loss, labels.view(-1), torch.tensor(loss_fct.ignore_index).type_as(labels)\n                )\n                loss = loss_fct(active_logits, active_labels)\n            else:\n                loss = loss_fct(logits.view(-1, self.num_labels), labels.view(-1))\n            outputs = (loss,) + outputs\n\n        return outputs  # (loss), scores, (hidden_states), (attentions)\n\n\n@add_start_docstrings(\n    \"\"\"Longformer Model with a multiple choice classification head on top (a linear layer on top of\n    the pooled output and a softmax) e.g. for RocStories/SWAG tasks. \"\"\",\n    LONGFORMER_START_DOCSTRING,\n)\nclass LongformerForMultipleChoice(BertPreTrainedModel):\n    config_class = LongformerConfig\n    base_model_prefix = \"longformer\"\n\n    def __init__(self, config):\n        super().__init__(config)\n\n        self.longformer = LongformerModel(config)\n        self.dropout = nn.Dropout(config.hidden_dropout_prob)\n        self.classifier = nn.Linear(config.hidden_size, 1)\n\n        self.init_weights()\n\n    @add_start_docstrings_to_callable(LONGFORMER_INPUTS_DOCSTRING.format(\"(batch_size, num_choices, sequence_length)\"))\n    def forward(\n        self,\n        input_ids=None,\n        token_type_ids=None,\n        attention_mask=None,\n        global_attention_mask=None,\n        labels=None,\n        position_ids=None,\n        inputs_embeds=None,\n    ):\n        r\"\"\"\n        labels (:obj:`torch.LongTensor` of shape :obj:`(batch_size,)`, `optional`, defaults to :obj:`None`):\n            Labels for computing the multiple choice classification loss.\n            Indices should be in ``[0, ..., num_choices]`` where `num_choices` is the size of the second dimension\n            of the input tensors. (see `input_ids` above)\n\n    Returns:\n        :obj:`tuple(torch.FloatTensor)` comprising various elements depending on the configuration (:class:`~transformers1.RobertaConfig`) and inputs:\n        loss (:obj:`torch.FloatTensor`` of shape ``(1,)`, `optional`, returned when :obj:`labels` is provided):\n            Classification loss.\n        classification_scores (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, num_choices)`):\n            `num_choices` is the second dimension of the input tensors. (see `input_ids` above).\n\n            Classification scores (before SoftMax).\n        hidden_states (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_hidden_states=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer)\n            of shape :obj:`(batch_size, sequence_length, hidden_size)`.\n\n            Hidden-states of the model at the output of each layer plus the initial embedding outputs.\n        attentions (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_attentions=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for each layer) of shape\n            :obj:`(batch_size, num_heads, sequence_length, sequence_length)`.\n\n            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention\n            heads.\n\n    Examples::\n\n        from transformers1 import LongformerTokenizer, LongformerForMultipleChoice\n        import torch\n\n        tokenizer = LongformerTokenizer.from_pretrained('allenai/longformer-base-4096')\n        model = LongformerForMultipleChoice.from_pretrained('allenai/longformer-base-4096')\n        # context = \"The dog is cute\" | choice = \"the dog\" / \"the cat\"\n        choices = [(\"The dog is cute\", \"the dog\"), (\"The dog is cute\", \"the cat\")]\n        input_ids = torch.tensor([tokenizer.encode(s[0], s[1], add_special_tokens=True) for s in choices]).unsqueeze(0)  # Batch size 1, 2 choices\n        labels = torch.tensor(1).unsqueeze(0)  # Batch size 1\n\n        # global attention is automatically put on \"the dog\" and \"the cat\"\n        outputs = model(input_ids, labels=labels)\n        loss, classification_scores = outputs[:2]\n\n        \"\"\"\n        num_choices = input_ids.shape[1]\n\n        # set global attention on question tokens\n        if global_attention_mask is None:\n            logger.info(\"Initializing global attention on multiple choice...\")\n            # put global attention on all tokens after `config.sep_token_id`\n            global_attention_mask = torch.stack(\n                [\n                    _compute_global_attention_mask(input_ids[:, i], self.config.sep_token_id, before_sep_token=False)\n                    for i in range(num_choices)\n                ],\n                dim=1,\n            )\n\n        flat_input_ids = input_ids.view(-1, input_ids.size(-1))\n        flat_position_ids = position_ids.view(-1, position_ids.size(-1)) if position_ids is not None else None\n        flat_token_type_ids = token_type_ids.view(-1, token_type_ids.size(-1)) if token_type_ids is not None else None\n        flat_attention_mask = attention_mask.view(-1, attention_mask.size(-1)) if attention_mask is not None else None\n        flat_global_attention_mask = (\n            global_attention_mask.view(-1, global_attention_mask.size(-1))\n            if global_attention_mask is not None\n            else None\n        )\n\n        outputs = self.longformer(\n            flat_input_ids,\n            position_ids=flat_position_ids,\n            token_type_ids=flat_token_type_ids,\n            attention_mask=flat_attention_mask,\n            global_attention_mask=flat_global_attention_mask,\n        )\n        pooled_output = outputs[1]\n\n        pooled_output = self.dropout(pooled_output)\n        logits = self.classifier(pooled_output)\n        reshaped_logits = logits.view(-1, num_choices)\n\n        outputs = (reshaped_logits,) + outputs[2:]  # add hidden states and attention if they are here\n\n        if labels is not None:\n            loss_fct = CrossEntropyLoss()\n            loss = loss_fct(reshaped_logits, labels)\n            outputs = (loss,) + outputs\n\n        return outputs  # (loss), reshaped_logits, (hidden_states), (attentions)\n"
  },
  {
    "path": "code/nezha-base-count3/pretrain/transformers1/modeling_marian.py",
    "content": "# coding=utf-8\n# Copyright 2020 Marian Team Authors and The HuggingFace Inc. team.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n\"\"\"PyTorch MarianMTModel model, ported from the Marian C++ repo.\"\"\"\n\n\nfrom .modeling_bart import BartForConditionalGeneration\n\n\nMARIAN_PRETRAINED_MODEL_ARCHIVE_LIST = [\n    # See all Marian models at https://huggingface.co/models?search=Helsinki-NLP\n]\n\n\nclass MarianMTModel(BartForConditionalGeneration):\n    r\"\"\"\n    Pytorch version of marian-nmt's transformer.h (c++). Designed for the OPUS-NMT translation checkpoints.\n    Model API is identical to BartForConditionalGeneration.\n    Available models are listed at `Model List <https://huggingface.co/models?search=Helsinki-NLP>`__\n\n    Examples::\n\n        from transformers1 import MarianTokenizer, MarianMTModel\n        from typing import List\n        src = 'fr'  # source language\n        trg = 'en'  # target language\n        sample_text = \"où est l'arrêt de bus ?\"\n        mname = f'Helsinki-NLP/opus-mt-{src}-{trg}'\n\n        model = MarianMTModel.from_pretrained(mname)\n        tok = MarianTokenizer.from_pretrained(mname)\n        batch = tok.prepare_translation_batch(src_texts=[sample_text])  # don't need tgt_text for inference\n        gen = model.generate(**batch)  # for forward pass: model(**batch)\n        words: List[str] = tok.batch_decode(gen, skip_special_tokens=True)  # returns \"Where is the the bus stop ?\"\n\n    \"\"\"\n\n    def prepare_logits_for_generation(self, logits, cur_len, max_length):\n        logits[:, self.config.pad_token_id] = float(\"-inf\")\n        if cur_len == max_length - 1 and self.config.eos_token_id is not None:\n            self._force_token_ids_generation(logits, self.config.eos_token_id)\n        return logits\n"
  },
  {
    "path": "code/nezha-base-count3/pretrain/transformers1/modeling_mmbt.py",
    "content": "# coding=utf-8\n# Copyright (c) Facebook, Inc. and its affiliates.\n# Copyright (c) HuggingFace Inc. team.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n\"\"\"PyTorch MMBT model. \"\"\"\n\n\nimport logging\n\nimport torch\nimport torch.nn as nn\nfrom torch.nn import CrossEntropyLoss, MSELoss\n\nfrom .file_utils import add_start_docstrings\nfrom .modeling_utils import ModuleUtilsMixin\n\n\nlogger = logging.getLogger(__name__)\n\n\nclass ModalEmbeddings(nn.Module):\n    \"\"\"Generic Modal Embeddings which takes in an encoder, and a transformer embedding.\n    \"\"\"\n\n    def __init__(self, config, encoder, embeddings):\n        super().__init__()\n        self.config = config\n        self.encoder = encoder\n        self.proj_embeddings = nn.Linear(config.modal_hidden_size, config.hidden_size)\n        self.position_embeddings = embeddings.position_embeddings\n        self.token_type_embeddings = embeddings.token_type_embeddings\n        self.word_embeddings = embeddings.word_embeddings\n        self.LayerNorm = embeddings.LayerNorm\n        self.dropout = nn.Dropout(p=config.hidden_dropout_prob)\n\n    def forward(self, input_modal, start_token=None, end_token=None, position_ids=None, token_type_ids=None):\n        token_embeddings = self.proj_embeddings(self.encoder(input_modal))\n        seq_length = token_embeddings.size(1)\n\n        if start_token is not None:\n            start_token_embeds = self.word_embeddings(start_token)\n            seq_length += 1\n            token_embeddings = torch.cat([start_token_embeds.unsqueeze(1), token_embeddings], dim=1)\n\n        if end_token is not None:\n            end_token_embeds = self.word_embeddings(end_token)\n            seq_length += 1\n            token_embeddings = torch.cat([token_embeddings, end_token_embeds.unsqueeze(1)], dim=1)\n\n        if position_ids is None:\n            position_ids = torch.arange(seq_length, dtype=torch.long, device=input_modal.device)\n            position_ids = position_ids.unsqueeze(0).expand(input_modal.size(0), seq_length)\n\n        if token_type_ids is None:\n            token_type_ids = torch.zeros(\n                (input_modal.size(0), seq_length), dtype=torch.long, device=input_modal.device\n            )\n\n        position_embeddings = self.position_embeddings(position_ids)\n        token_type_embeddings = self.token_type_embeddings(token_type_ids)\n        embeddings = token_embeddings + position_embeddings + token_type_embeddings\n        embeddings = self.LayerNorm(embeddings)\n        embeddings = self.dropout(embeddings)\n        return embeddings\n\n\nMMBT_START_DOCSTRING = r\"\"\"    MMBT model was proposed in\n    `Supervised Multimodal Bitransformers for Classifying Images and Text`_\n    by Douwe Kiela, Suvrat Bhooshan, Hamed Firooz, Davide Testuggine.\n    It's a supervised multimodal bitransformer model that fuses information from text and other image encoders,\n    and obtain state-of-the-art performance on various multimodal classification benchmark tasks.\n\n    This model is a PyTorch `torch.nn.Module`_ sub-class. Use it as a regular PyTorch Module and\n    refer to the PyTorch documentation for all matter related to general usage and behavior.\n\n    .. _`Supervised Multimodal Bitransformers for Classifying Images and Text`:\n        https://github.com/facebookresearch/mmbt\n\n    .. _`torch.nn.Module`:\n        https://pytorch.org/docs/stable/nn.html#module\n\n    Parameters:\n        config (:class:`~transformers1.MMBTConfig`): Model configuration class with all the parameters of the model.\n            Initializing with a config file does not load the weights associated with the model, only the configuration.\n        transformer (:class: `~nn.Module`): A text transformer that is used by MMBT.\n            It should have embeddings, encoder, and pooler attributes.\n        encoder (:class: `~nn.Module`): Encoder for the second modality.\n            It should take in a batch of modal inputs and return k, n dimension embeddings.\n\"\"\"\n\nMMBT_INPUTS_DOCSTRING = r\"\"\"    Inputs:\n        **input_modal**: ``torch.FloatTensor`` of shape ``(batch_size, ***)``:\n            The other modality data. It will be the shape that the encoder for that type expects.\n            e.g. With an Image Encoder, the shape would be (batch_size, channels, height, width)\n        **input_ids**: ``torch.LongTensor`` of shape ``(batch_size, sequence_length)``:\n            Indices of input sequence tokens in the vocabulary.\n            It does not expect [CLS] token to be added as it's appended to the end of other modality embeddings.\n            See :func:`transformers1.PreTrainedTokenizer.encode` and\n            :func:`transformers1.PreTrainedTokenizer.convert_tokens_to_ids` for details.\n        **modal_start_tokens**: (`optional`) ``torch.LongTensor`` of shape ``(batch_size,)``:\n            Optional start token to be added to Other Modality Embedding. [CLS] Most commonly used for Classification tasks.\n        **modal_end_tokens**: (`optional`) ``torch.LongTensor`` of shape ``(batch_size,)``:\n            Optional end token to be added to Other Modality Embedding. [SEP] Most commonly used.\n        **attention_mask**: (`optional`) ``torch.FloatTensor`` of shape ``(batch_size, sequence_length)``:\n            Mask to avoid performing attention on padding token indices.\n            Mask values selected in ``[0, 1]``:\n            ``1`` for tokens that are NOT MASKED, ``0`` for MASKED tokens.\n        **token_type_ids**: (`optional`) ``torch.LongTensor`` of shape ``(batch_size, sequence_length)``:\n            Segment token indices to indicate different portions of the inputs.\n        **modal_token_type_ids**: (`optional`) ``torch.LongTensor`` of shape ``(batch_size, modal_sequence_length)``:\n            Segment token indices to indicate different portions of the non-text modality.\n            The embeddings from these tokens will be summed with the respective token embeddings for the non-text modality.\n        **position_ids**: (`optional`) ``torch.LongTensor`` of shape ``(batch_size, sequence_length)``:\n            Indices of positions of each input sequence tokens in the position embeddings.\n        **modal_position_ids**: (`optional`) ``torch.LongTensor`` of shape ``(batch_size, modal_sequence_length)``:\n            Indices of positions of each input sequence tokens in the position embeddings for the non-text modality.\n        **head_mask**: (`optional`) ``torch.FloatTensor`` of shape ``(num_heads,)`` or ``(num_layers, num_heads)``:\n            Mask to nullify selected heads of the self-attention modules.\n            Mask values selected in ``[0, 1]``:\n            ``1`` indicates the head is **not masked**, ``0`` indicates the head is **masked**.\n        **inputs_embeds**: (`optional`) ``torch.FloatTensor`` of shape ``(batch_size, sequence_length, embedding_dim)``:\n            Optionally, instead of passing ``input_ids`` you can choose to directly pass an embedded representation.\n            This is useful if you want more control over how to convert `input_ids` indices into associated vectors\n            than the model's internal embedding lookup matrix.\n        **encoder_hidden_states**: (`optional`) ``torch.FloatTensor`` of shape ``(batch_size, sequence_length, hidden_size)``:\n            Sequence of hidden-states at the output of the last layer of the encoder. Used in the cross-attention if the model\n            is configured as a decoder.\n        **encoder_attention_mask**: (`optional`) ``torch.FloatTensor`` of shape ``(batch_size, sequence_length)``:\n            Mask to avoid performing attention on the padding token indices of the encoder input. This mask\n            is used in the cross-attention if the model is configured as a decoder.\n            Mask values selected in ``[0, 1]``:\n            ``1`` for tokens that are NOT MASKED, ``0`` for MASKED tokens.\n\"\"\"\n\n\n@add_start_docstrings(\n    \"The bare MMBT Model outputting raw hidden-states without any specific head on top.\",\n    MMBT_START_DOCSTRING,\n    MMBT_INPUTS_DOCSTRING,\n)\nclass MMBTModel(nn.Module, ModuleUtilsMixin):\n    r\"\"\"\n        Outputs: `Tuple` comprising various elements depending on the configuration (config) and inputs:\n            **last_hidden_state**: ``torch.FloatTensor`` of shape ``(batch_size, sequence_length, hidden_size)``\n                Sequence of hidden-states at the output of the last layer of the model.\n            **pooler_output**: ``torch.FloatTensor`` of shape ``(batch_size, hidden_size)``\n                Last layer hidden-state of the first token of the sequence (classification token)\n                further processed by a Linear layer and a Tanh activation function. The Linear\n                layer weights are trained from the next sentence prediction (classification)\n                objective during Bert pretraining. This output is usually *not* a good summary\n                of the semantic content of the input, you're often better with averaging or pooling\n                the sequence of hidden-states for the whole input sequence.\n            **hidden_states**: (`optional`, returned when ``config.output_hidden_states=True``)\n                list of ``torch.FloatTensor`` (one for the output of each layer + the output of the embeddings)\n                of shape ``(batch_size, sequence_length, hidden_size)``:\n                Hidden-states of the model at the output of each layer plus the initial embedding outputs.\n            **attentions**: (`optional`, returned when ``config.output_attentions=True``)\n                list of ``torch.FloatTensor`` (one for each layer) of shape ``(batch_size, num_heads, sequence_length, sequence_length)``:\n                Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads.\n\n        Examples::\n\n            # For example purposes. Not runnable.\n            transformer = BertModel.from_pretrained('bert-base-uncased')\n            encoder = ImageEncoder(args)\n            mmbt = MMBTModel(config, transformer, encoder)\n        \"\"\"\n\n    def __init__(self, config, transformer, encoder):\n        super().__init__()\n        self.config = config\n        self.transformer = transformer\n        self.modal_encoder = ModalEmbeddings(config, encoder, transformer.embeddings)\n\n    def forward(\n        self,\n        input_modal,\n        input_ids=None,\n        modal_start_tokens=None,\n        modal_end_tokens=None,\n        attention_mask=None,\n        token_type_ids=None,\n        modal_token_type_ids=None,\n        position_ids=None,\n        modal_position_ids=None,\n        head_mask=None,\n        inputs_embeds=None,\n        encoder_hidden_states=None,\n        encoder_attention_mask=None,\n    ):\n\n        if input_ids is not None and inputs_embeds is not None:\n            raise ValueError(\"You cannot specify both input_ids and inputs_embeds at the same time\")\n        elif input_ids is not None:\n            input_txt_shape = input_ids.size()\n        elif inputs_embeds is not None:\n            input_txt_shape = inputs_embeds.size()[:-1]\n        else:\n            raise ValueError(\"You have to specify either input_ids or inputs_embeds\")\n\n        device = input_ids.device if input_ids is not None else inputs_embeds.device\n\n        modal_embeddings = self.modal_encoder(\n            input_modal,\n            start_token=modal_start_tokens,\n            end_token=modal_end_tokens,\n            position_ids=modal_position_ids,\n            token_type_ids=modal_token_type_ids,\n        )\n\n        input_modal_shape = modal_embeddings.size()[:-1]\n\n        if token_type_ids is None:\n            token_type_ids = torch.ones(input_txt_shape, dtype=torch.long, device=device)\n\n        txt_embeddings = self.transformer.embeddings(\n            input_ids=input_ids, position_ids=position_ids, token_type_ids=token_type_ids, inputs_embeds=inputs_embeds\n        )\n\n        embedding_output = torch.cat([modal_embeddings, txt_embeddings], 1)\n\n        input_shape = embedding_output.size()[:-1]\n\n        if attention_mask is None:\n            attention_mask = torch.ones(input_shape, device=device)\n        else:\n            attention_mask = torch.cat(\n                [torch.ones(input_modal_shape, device=device, dtype=torch.long), attention_mask], dim=1\n            )\n        if encoder_attention_mask is None:\n            encoder_attention_mask = torch.ones(input_shape, device=device)\n        else:\n            encoder_attention_mask = torch.cat(\n                [torch.ones(input_modal_shape, device=device), encoder_attention_mask], dim=1\n            )\n\n        extended_attention_mask = self.get_extended_attention_mask(attention_mask, input_shape, self.device)\n        encoder_extended_attention_mask = self.invert_attention_mask(encoder_attention_mask)\n        head_mask = self.get_head_mask(head_mask, self.config.num_hidden_layers)\n\n        encoder_outputs = self.transformer.encoder(\n            embedding_output,\n            attention_mask=extended_attention_mask,\n            head_mask=head_mask,\n            encoder_hidden_states=encoder_hidden_states,\n            encoder_attention_mask=encoder_extended_attention_mask,\n        )\n\n        sequence_output = encoder_outputs[0]\n        pooled_output = self.transformer.pooler(sequence_output)\n\n        outputs = (sequence_output, pooled_output,) + encoder_outputs[\n            1:\n        ]  # add hidden_states and attentions if they are here\n        return outputs  # sequence_output, pooled_output, (hidden_states), (attentions)\n\n    def get_input_embeddings(self):\n        return self.embeddings.word_embeddings\n\n    def set_input_embeddings(self, value):\n        self.embeddings.word_embeddings = value\n\n\n@add_start_docstrings(\n    \"\"\"MMBT Model with a sequence classification/regression head on top (a linear layer on top of\n                      the pooled output)\"\"\",\n    MMBT_START_DOCSTRING,\n    MMBT_INPUTS_DOCSTRING,\n)\nclass MMBTForClassification(nn.Module):\n    r\"\"\"\n            **labels**: (`optional`) ``torch.LongTensor`` of shape ``(batch_size,)``:\n                Labels for computing the sequence classification/regression loss.\n                Indices should be in ``[0, ..., config.num_labels - 1]``.\n                If ``config.num_labels == 1`` a regression loss is computed (Mean-Square loss),\n                If ``config.num_labels > 1`` a classification loss is computed (Cross-Entropy).\n\n        Outputs: `Tuple` comprising various elements depending on the configuration (config) and inputs:\n            **loss**: (`optional`, returned when ``labels`` is provided) ``torch.FloatTensor`` of shape ``(1,)``:\n                Classification (or regression if config.num_labels==1) loss.\n            **logits**: ``torch.FloatTensor`` of shape ``(batch_size, config.num_labels)``\n                Classification (or regression if config.num_labels==1) scores (before SoftMax).\n            **hidden_states**: (`optional`, returned when ``config.output_hidden_states=True``)\n                list of ``torch.FloatTensor`` (one for the output of each layer + the output of the embeddings)\n                of shape ``(batch_size, sequence_length, hidden_size)``:\n                Hidden-states of the model at the output of each layer plus the initial embedding outputs.\n            **attentions**: (`optional`, returned when ``config.output_attentions=True``)\n                list of ``torch.FloatTensor`` (one for each layer) of shape ``(batch_size, num_heads, sequence_length, sequence_length)``:\n                Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads.\n\n        Examples::\n\n            # For example purposes. Not runnable.\n            transformer = BertModel.from_pretrained('bert-base-uncased')\n            encoder = ImageEncoder(args)\n            model = MMBTForClassification(config, transformer, encoder)\n            outputs = model(input_modal, input_ids, labels=labels)\n            loss, logits = outputs[:2]\n        \"\"\"\n\n    def __init__(self, config, transformer, encoder):\n        super().__init__()\n        self.num_labels = config.num_labels\n\n        self.mmbt = MMBTModel(config, transformer, encoder)\n        self.dropout = nn.Dropout(config.hidden_dropout_prob)\n        self.classifier = nn.Linear(config.hidden_size, config.num_labels)\n\n    def forward(\n        self,\n        input_modal,\n        input_ids=None,\n        modal_start_tokens=None,\n        modal_end_tokens=None,\n        attention_mask=None,\n        token_type_ids=None,\n        modal_token_type_ids=None,\n        position_ids=None,\n        modal_position_ids=None,\n        head_mask=None,\n        inputs_embeds=None,\n        labels=None,\n    ):\n\n        outputs = self.mmbt(\n            input_modal=input_modal,\n            input_ids=input_ids,\n            modal_start_tokens=modal_start_tokens,\n            modal_end_tokens=modal_end_tokens,\n            attention_mask=attention_mask,\n            token_type_ids=token_type_ids,\n            modal_token_type_ids=modal_token_type_ids,\n            position_ids=position_ids,\n            modal_position_ids=modal_position_ids,\n            head_mask=head_mask,\n            inputs_embeds=inputs_embeds,\n        )\n\n        pooled_output = outputs[1]\n\n        pooled_output = self.dropout(pooled_output)\n        logits = self.classifier(pooled_output)\n\n        outputs = (logits,) + outputs[2:]  # add hidden states and attention if they are here\n\n        if labels is not None:\n            if self.num_labels == 1:\n                #  We are doing regression\n                loss_fct = MSELoss()\n                loss = loss_fct(logits.view(-1), labels.view(-1))\n            else:\n                loss_fct = CrossEntropyLoss()\n                loss = loss_fct(logits.view(-1, self.num_labels), labels.view(-1))\n            outputs = (loss,) + outputs\n\n        return outputs  # (loss), logits, (hidden_states), (attentions)\n"
  },
  {
    "path": "code/nezha-base-count3/pretrain/transformers1/modeling_openai.py",
    "content": "# coding=utf-8\n# Copyright 2018 The OpenAI Team Authors and HuggingFace Inc. team.\n# Copyright (c) 2018, NVIDIA CORPORATION.  All rights reserved.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n\"\"\"PyTorch OpenAI GPT model.\"\"\"\n\n\nimport json\nimport logging\nimport math\nimport os\n\nimport torch\nimport torch.nn as nn\nfrom torch.nn import CrossEntropyLoss\n\nfrom .activations import gelu_new, swish\nfrom .configuration_openai import OpenAIGPTConfig\nfrom .file_utils import add_start_docstrings, add_start_docstrings_to_callable\nfrom .modeling_utils import Conv1D, PreTrainedModel, SequenceSummary, prune_conv1d_layer\n\n\nlogger = logging.getLogger(__name__)\n\nOPENAI_GPT_PRETRAINED_MODEL_ARCHIVE_LIST = [\n    \"openai-gpt\",\n    # See all OpenAI GPT models at https://huggingface.co/models?filter=openai-gpt\n]\n\n\ndef load_tf_weights_in_openai_gpt(model, config, openai_checkpoint_folder_path):\n    \"\"\" Load tf pre-trained weights in a pytorch model (from NumPy arrays here)\n    \"\"\"\n    import re\n    import numpy as np\n\n    if \".ckpt\" in openai_checkpoint_folder_path:\n        openai_checkpoint_folder_path = os.path.dirname(openai_checkpoint_folder_path)\n\n    logger.info(\"Loading weights from {}\".format(openai_checkpoint_folder_path))\n\n    with open(openai_checkpoint_folder_path + \"/parameters_names.json\", \"r\", encoding=\"utf-8\") as names_handle:\n        names = json.load(names_handle)\n    with open(openai_checkpoint_folder_path + \"/params_shapes.json\", \"r\", encoding=\"utf-8\") as shapes_handle:\n        shapes = json.load(shapes_handle)\n    offsets = np.cumsum([np.prod(shape) for shape in shapes])\n    init_params = [np.load(openai_checkpoint_folder_path + \"/params_{}.npy\".format(n)) for n in range(10)]\n    init_params = np.split(np.concatenate(init_params, 0), offsets)[:-1]\n    init_params = [param.reshape(shape) for param, shape in zip(init_params, shapes)]\n\n    # This was used when we had a single embedding matrix for positions and tokens\n    # init_params[0] = np.concatenate([init_params[1], init_params[0]], 0)\n    # model init_params[1]\n    init_params = [arr.squeeze() for arr in init_params]\n\n    try:\n        assert model.tokens_embed.weight.shape == init_params[1].shape\n        assert model.positions_embed.weight.shape == init_params[0].shape\n    except AssertionError as e:\n        e.args += (model.tokens_embed.weight.shape, init_params[1].shape)\n        e.args += (model.positions_embed.weight.shape, init_params[0].shape)\n        raise\n\n    model.tokens_embed.weight.data = torch.from_numpy(init_params[1])\n    model.positions_embed.weight.data = torch.from_numpy(init_params[0])\n    names.pop(0)\n    # Pop position and token embedding arrays\n    init_params.pop(0)\n    init_params.pop(0)\n\n    for name, array in zip(names, init_params):  # names[1:n_transfer], init_params[1:n_transfer]):\n        name = name[6:]  # skip \"model/\"\n        assert name[-2:] == \":0\"\n        name = name[:-2]\n        name = name.split(\"/\")\n        pointer = model\n        for m_name in name:\n            if re.fullmatch(r\"[A-Za-z]+\\d+\", m_name):\n                scope_names = re.split(r\"(\\d+)\", m_name)\n            else:\n                scope_names = [m_name]\n            if scope_names[0] == \"g\":\n                pointer = getattr(pointer, \"weight\")\n            elif scope_names[0] == \"b\":\n                pointer = getattr(pointer, \"bias\")\n            elif scope_names[0] == \"w\":\n                pointer = getattr(pointer, \"weight\")\n            else:\n                pointer = getattr(pointer, scope_names[0])\n            if len(scope_names) >= 2:\n                num = int(scope_names[1])\n                pointer = pointer[num]\n        try:\n            assert pointer.shape == array.shape\n        except AssertionError as e:\n            e.args += (pointer.shape, array.shape)\n            raise\n        try:\n            assert pointer.shape == array.shape\n        except AssertionError as e:\n            e.args += (pointer.shape, array.shape)\n            raise\n        logger.info(\"Initialize PyTorch weight {}\".format(name))\n        pointer.data = torch.from_numpy(array)\n    return model\n\n\nACT_FNS = {\"relu\": nn.ReLU, \"swish\": swish, \"gelu\": gelu_new}\n\n\nclass Attention(nn.Module):\n    def __init__(self, nx, n_ctx, config, scale=False):\n        super().__init__()\n        n_state = nx  # in Attention: n_state=768 (nx=n_embd)\n        # [switch nx => n_state from Block to Attention to keep identical to TF implem]\n        assert n_state % config.n_head == 0\n        self.register_buffer(\"bias\", torch.tril(torch.ones(n_ctx, n_ctx)).view(1, 1, n_ctx, n_ctx))\n        self.n_head = config.n_head\n        self.split_size = n_state\n        self.scale = scale\n\n        self.output_attentions = config.output_attentions\n\n        self.c_attn = Conv1D(n_state * 3, nx)\n        self.c_proj = Conv1D(n_state, nx)\n        self.attn_dropout = nn.Dropout(config.attn_pdrop)\n        self.resid_dropout = nn.Dropout(config.resid_pdrop)\n        self.pruned_heads = set()\n\n    def prune_heads(self, heads):\n        if len(heads) == 0:\n            return\n        mask = torch.ones(self.n_head, self.split_size // self.n_head)\n        heads = set(heads) - self.pruned_heads\n        for head in heads:\n            head -= sum(1 if h < head else 0 for h in self.pruned_heads)\n            mask[head] = 0\n        mask = mask.view(-1).contiguous().eq(1)\n        index = torch.arange(len(mask))[mask].long()\n        index_attn = torch.cat([index, index + self.split_size, index + (2 * self.split_size)])\n        # Prune conv1d layers\n        self.c_attn = prune_conv1d_layer(self.c_attn, index_attn, dim=1)\n        self.c_proj = prune_conv1d_layer(self.c_proj, index, dim=0)\n        # Update hyper params\n        self.split_size = (self.split_size // self.n_head) * (self.n_head - len(heads))\n        self.n_head = self.n_head - len(heads)\n        self.pruned_heads = self.pruned_heads.union(heads)\n\n    def _attn(self, q, k, v, attention_mask=None, head_mask=None):\n        w = torch.matmul(q, k)\n        if self.scale:\n            w = w / math.sqrt(v.size(-1))\n        # w = w * self.bias + -1e9 * (1 - self.bias)  # TF implem method: mask_attn_weights\n        # XD: self.b may be larger than w, so we need to crop it\n        b = self.bias[:, :, : w.size(-2), : w.size(-1)]\n        w = w * b + -1e4 * (1 - b)\n\n        if attention_mask is not None:\n            # Apply the attention mask\n            w = w + attention_mask\n\n        w = nn.Softmax(dim=-1)(w)\n        w = self.attn_dropout(w)\n\n        # Mask heads if we want to\n        if head_mask is not None:\n            w = w * head_mask\n\n        outputs = [torch.matmul(w, v)]\n        if self.output_attentions:\n            outputs.append(w)\n        return outputs\n\n    def merge_heads(self, x):\n        x = x.permute(0, 2, 1, 3).contiguous()\n        new_x_shape = x.size()[:-2] + (x.size(-2) * x.size(-1),)\n        return x.view(*new_x_shape)  # in Tensorflow implem: fct merge_states\n\n    def split_heads(self, x, k=False):\n        new_x_shape = x.size()[:-1] + (self.n_head, x.size(-1) // self.n_head)\n        x = x.view(*new_x_shape)  # in Tensorflow implem: fct split_states\n        if k:\n            return x.permute(0, 2, 3, 1)\n        else:\n            return x.permute(0, 2, 1, 3)\n\n    def forward(self, x, attention_mask=None, head_mask=None):\n        x = self.c_attn(x)\n        query, key, value = x.split(self.split_size, dim=2)\n        query = self.split_heads(query)\n        key = self.split_heads(key, k=True)\n        value = self.split_heads(value)\n\n        attn_outputs = self._attn(query, key, value, attention_mask, head_mask)\n        a = attn_outputs[0]\n\n        a = self.merge_heads(a)\n        a = self.c_proj(a)\n        a = self.resid_dropout(a)\n\n        outputs = [a] + attn_outputs[1:]\n        return outputs  # a, (attentions)\n\n\nclass MLP(nn.Module):\n    def __init__(self, n_state, config):  # in MLP: n_state=3072 (4 * n_embd)\n        super().__init__()\n        nx = config.n_embd\n        self.c_fc = Conv1D(n_state, nx)\n        self.c_proj = Conv1D(nx, n_state)\n        self.act = ACT_FNS[config.afn]\n        self.dropout = nn.Dropout(config.resid_pdrop)\n\n    def forward(self, x):\n        h = self.act(self.c_fc(x))\n        h2 = self.c_proj(h)\n        return self.dropout(h2)\n\n\nclass Block(nn.Module):\n    def __init__(self, n_ctx, config, scale=False):\n        super().__init__()\n        nx = config.n_embd\n        self.attn = Attention(nx, n_ctx, config, scale)\n        self.ln_1 = nn.LayerNorm(nx, eps=config.layer_norm_epsilon)\n        self.mlp = MLP(4 * nx, config)\n        self.ln_2 = nn.LayerNorm(nx, eps=config.layer_norm_epsilon)\n\n    def forward(self, x, attention_mask=None, head_mask=None):\n        attn_outputs = self.attn(x, attention_mask=attention_mask, head_mask=head_mask)\n        a = attn_outputs[0]\n\n        n = self.ln_1(x + a)\n        m = self.mlp(n)\n        h = self.ln_2(n + m)\n\n        outputs = [h] + attn_outputs[1:]\n        return outputs\n\n\nclass OpenAIGPTPreTrainedModel(PreTrainedModel):\n    \"\"\" An abstract class to handle weights initialization and\n        a simple interface for downloading and loading pretrained models.\n    \"\"\"\n\n    config_class = OpenAIGPTConfig\n    load_tf_weights = load_tf_weights_in_openai_gpt\n    base_model_prefix = \"transformer\"\n\n    def _init_weights(self, module):\n        \"\"\" Initialize the weights.\n        \"\"\"\n        if isinstance(module, (nn.Linear, nn.Embedding, Conv1D)):\n            # Slightly different from the TF version which uses truncated_normal for initialization\n            # cf https://github.com/pytorch/pytorch/pull/5617\n            module.weight.data.normal_(mean=0.0, std=self.config.initializer_range)\n            if isinstance(module, (nn.Linear, Conv1D)) and module.bias is not None:\n                module.bias.data.zero_()\n        elif isinstance(module, nn.LayerNorm):\n            module.bias.data.zero_()\n            module.weight.data.fill_(1.0)\n\n\nOPENAI_GPT_START_DOCSTRING = r\"\"\"\n\n    This model is a PyTorch `torch.nn.Module <https://pytorch.org/docs/stable/nn.html#torch.nn.Module>`_ sub-class.\n    Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general\n    usage and behavior.\n\n    Parameters:\n        config (:class:`~transformers1.OpenAIGPTConfig`): Model configuration class with all the parameters of the model.\n            Initializing with a config file does not load the weights associated with the model, only the configuration.\n            Check out the :meth:`~transformers1.PreTrainedModel.from_pretrained` method to load the model weights.\n\"\"\"\n\nOPENAI_GPT_INPUTS_DOCSTRING = r\"\"\"\n    Args:\n        input_ids (:obj:`torch.LongTensor` of shape :obj:`(batch_size, sequence_length)`):\n            Indices of input sequence tokens in the vocabulary.\n\n            Indices can be obtained using :class:`transformers1.OpenAIGPTTokenizer`.\n            See :func:`transformers1.PreTrainedTokenizer.encode` and\n            :func:`transformers1.PreTrainedTokenizer.encode_plus` for details.\n\n            `What are input IDs? <../glossary.html#input-ids>`__\n        attention_mask (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length)`, `optional`, defaults to :obj:`None`):\n            Mask to avoid performing attention on padding token indices.\n            Mask values selected in ``[0, 1]``:\n            ``1`` for tokens that are NOT MASKED, ``0`` for MASKED tokens.\n\n            `What are attention masks? <../glossary.html#attention-mask>`__\n        token_type_ids (:obj:`torch.LongTensor` of shape :obj:`(batch_size, sequence_length)`, `optional`, defaults to :obj:`None`):\n            Segment token indices to indicate first and second portions of the inputs.\n            Indices are selected in ``[0, 1]``: ``0`` corresponds to a `sentence A` token, ``1``\n            corresponds to a `sentence B` token\n\n            `What are token type IDs? <../glossary.html#token-type-ids>`_\n        position_ids (:obj:`torch.LongTensor` of shape :obj:`(batch_size, sequence_length)`, `optional`, defaults to :obj:`None`):\n            Indices of positions of each input sequence tokens in the position embeddings.\n            Selected in the range ``[0, config.max_position_embeddings - 1]``.\n\n            `What are position IDs? <../glossary.html#position-ids>`_\n        head_mask (:obj:`torch.FloatTensor` of shape :obj:`(num_heads,)` or :obj:`(num_layers, num_heads)`, `optional`, defaults to :obj:`None`):\n            Mask to nullify selected heads of the self-attention modules.\n            Mask values selected in ``[0, 1]``:\n            :obj:`1` indicates the head is **not masked**, :obj:`0` indicates the head is **masked**.\n        inputs_embeds (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length, hidden_size)`, `optional`, defaults to :obj:`None`):\n            Optionally, instead of passing :obj:`input_ids` you can choose to directly pass an embedded representation.\n            This is useful if you want more control over how to convert `input_ids` indices into associated vectors\n            than the model's internal embedding lookup matrix.\n\"\"\"\n\n\n@add_start_docstrings(\n    \"The bare OpenAI GPT transformer model outputting raw hidden-states without any specific head on top.\",\n    OPENAI_GPT_START_DOCSTRING,\n)\nclass OpenAIGPTModel(OpenAIGPTPreTrainedModel):\n    def __init__(self, config):\n        super().__init__(config)\n        self.output_attentions = config.output_attentions\n        self.output_hidden_states = config.output_hidden_states\n\n        self.tokens_embed = nn.Embedding(config.vocab_size, config.n_embd)\n        self.positions_embed = nn.Embedding(config.n_positions, config.n_embd)\n        self.drop = nn.Dropout(config.embd_pdrop)\n        self.h = nn.ModuleList([Block(config.n_ctx, config, scale=True) for _ in range(config.n_layer)])\n\n        self.init_weights()\n\n    def get_input_embeddings(self):\n        return self.tokens_embed\n\n    def set_input_embeddings(self, new_embeddings):\n        self.tokens_embed = new_embeddings\n\n    def _prune_heads(self, heads_to_prune):\n        \"\"\" Prunes heads of the model.\n            heads_to_prune: dict of {layer_num: list of heads to prune in this layer}\n        \"\"\"\n        for layer, heads in heads_to_prune.items():\n            self.h[layer].attn.prune_heads(heads)\n\n    @add_start_docstrings_to_callable(OPENAI_GPT_INPUTS_DOCSTRING)\n    def forward(\n        self,\n        input_ids=None,\n        attention_mask=None,\n        token_type_ids=None,\n        position_ids=None,\n        head_mask=None,\n        inputs_embeds=None,\n    ):\n        r\"\"\"\n    Return:\n        :obj:`tuple(torch.FloatTensor)` comprising various elements depending on the configuration (:class:`~transformers1.OpenAIGPTConfig`) and inputs:\n        last_hidden_state (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length, hidden_size)`):\n            Sequence of hidden-states at the last layer of the model.\n        hidden_states (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_hidden_states=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer)\n            of shape :obj:`(batch_size, sequence_length, hidden_size)`.\n\n            Hidden-states of the model at the output of each layer plus the initial embedding outputs.\n        attentions (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_attentions=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for each layer) of shape\n            :obj:`(batch_size, num_heads, sequence_length, sequence_length)`.\n\n            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention\n            heads.\n\n    Examples::\n\n        from transformers1 import OpenAIGPTTokenizer, OpenAIGPTModel\n        import torch\n\n        tokenizer = OpenAIGPTTokenizer.from_pretrained('openai-gpt')\n        model = OpenAIGPTModel.from_pretrained('openai-gpt')\n        input_ids = torch.tensor(tokenizer.encode(\"Hello, my dog is cute\", add_special_tokens=True)).unsqueeze(0)  # Batch size 1\n        outputs = model(input_ids)\n        last_hidden_states = outputs[0]  # The last hidden-state is the first element of the output tuple\n\n        \"\"\"\n        if input_ids is not None and inputs_embeds is not None:\n            raise ValueError(\"You cannot specify both input_ids and inputs_embeds at the same time\")\n        elif input_ids is not None:\n            input_shape = input_ids.size()\n            input_ids = input_ids.view(-1, input_shape[-1])\n        elif inputs_embeds is not None:\n            input_shape = inputs_embeds.size()[:-1]\n        else:\n            raise ValueError(\"You have to specify either input_ids or inputs_embeds\")\n\n        if position_ids is None:\n            # Code is different from when we had a single embedding matrice from position and token embeddings\n            device = input_ids.device if input_ids is not None else inputs_embeds.device\n            position_ids = torch.arange(input_shape[-1], dtype=torch.long, device=device)\n            position_ids = position_ids.unsqueeze(0).view(-1, input_shape[-1])\n\n        # Attention mask.\n        if attention_mask is not None:\n            # We create a 3D attention mask from a 2D tensor mask.\n            # Sizes are [batch_size, 1, 1, to_seq_length]\n            # So we can broadcast to [batch_size, num_heads, from_seq_length, to_seq_length]\n            # this attention mask is more simple than the triangular masking of causal attention\n            # used in OpenAI GPT, we just need to prepare the broadcast dimension here.\n            attention_mask = attention_mask.unsqueeze(1).unsqueeze(2)\n\n            # Since attention_mask is 1.0 for positions we want to attend and 0.0 for\n            # masked positions, this operation will create a tensor which is 0.0 for\n            # positions we want to attend and -10000.0 for masked positions.\n            # Since we are adding it to the raw scores before the softmax, this is\n            # effectively the same as removing these entirely.\n            attention_mask = attention_mask.to(dtype=next(self.parameters()).dtype)  # fp16 compatibility\n            attention_mask = (1.0 - attention_mask) * -10000.0\n\n        # Prepare head mask if needed\n        head_mask = self.get_head_mask(head_mask, self.config.n_layer)\n\n        if inputs_embeds is None:\n            inputs_embeds = self.tokens_embed(input_ids)\n        position_embeds = self.positions_embed(position_ids)\n        if token_type_ids is not None:\n            token_type_ids = token_type_ids.view(-1, token_type_ids.size(-1))\n            token_type_embeds = self.tokens_embed(token_type_ids)\n        else:\n            token_type_embeds = 0\n        hidden_states = inputs_embeds + position_embeds + token_type_embeds\n        hidden_states = self.drop(hidden_states)\n\n        output_shape = input_shape + (hidden_states.size(-1),)\n\n        all_attentions = ()\n        all_hidden_states = ()\n        for i, block in enumerate(self.h):\n            if self.output_hidden_states:\n                all_hidden_states = all_hidden_states + (hidden_states.view(*output_shape),)\n\n            outputs = block(hidden_states, attention_mask, head_mask[i])\n            hidden_states = outputs[0]\n            if self.output_attentions:\n                all_attentions = all_attentions + (outputs[1],)\n\n        # Add last layer\n        if self.output_hidden_states:\n            all_hidden_states = all_hidden_states + (hidden_states.view(*output_shape),)\n\n        outputs = (hidden_states.view(*output_shape),)\n        if self.output_hidden_states:\n            outputs = outputs + (all_hidden_states,)\n        if self.output_attentions:\n            outputs = outputs + (all_attentions,)\n        return outputs  # last hidden state, (all hidden states), (all attentions)\n\n\n@add_start_docstrings(\n    \"\"\"OpenAI GPT Model transformer with a language modeling head on top\n    (linear layer with weights tied to the input embeddings). \"\"\",\n    OPENAI_GPT_START_DOCSTRING,\n)\nclass OpenAIGPTLMHeadModel(OpenAIGPTPreTrainedModel):\n    def __init__(self, config):\n        super().__init__(config)\n        self.transformer = OpenAIGPTModel(config)\n        self.lm_head = nn.Linear(config.n_embd, config.vocab_size, bias=False)\n\n        self.init_weights()\n\n    def get_output_embeddings(self):\n        return self.lm_head\n\n    @add_start_docstrings_to_callable(OPENAI_GPT_INPUTS_DOCSTRING)\n    def forward(\n        self,\n        input_ids=None,\n        attention_mask=None,\n        token_type_ids=None,\n        position_ids=None,\n        head_mask=None,\n        inputs_embeds=None,\n        labels=None,\n    ):\n        r\"\"\"\n        labels (:obj:`torch.LongTensor` of shape :obj:`(batch_size, sequence_length)`, `optional`, defaults to :obj:`None`):\n            Labels for language modeling.\n            Note that the labels **are shifted** inside the model, i.e. you can set ``labels = input_ids``\n            Indices are selected in ``[-100, 0, ..., config.vocab_size]``\n            All labels set to ``-100`` are ignored (masked), the loss is only\n            computed for labels in ``[0, ..., config.vocab_size]``\n\n    Return:\n        :obj:`tuple(torch.FloatTensor)` comprising various elements depending on the configuration (:class:`~transformers1.OpenAIGPTConfig`) and inputs:\n        loss (:obj:`torch.FloatTensor` of shape `(1,)`, `optional`, returned when ``labels`` is provided)\n            Language modeling loss.\n        prediction_scores (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length, config.vocab_size)`):\n            Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax).\n        past (:obj:`List[torch.FloatTensor]` of length :obj:`config.n_layers` with each tensor of shape :obj:`(2, batch_size, num_heads, sequence_length, embed_size_per_head)`):\n            Contains pre-computed hidden-states (key and values in the attention blocks).\n            Can be used (see `past` input) to speed up sequential decoding. The token ids which have their past given to this model\n            should not be passed as input ids as they have already been computed.\n        hidden_states (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_hidden_states=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer)\n            of shape :obj:`(batch_size, sequence_length, hidden_size)`.\n\n            Hidden-states of the model at the output of each layer plus the initial embedding outputs.\n        attentions (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_attentions=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for each layer) of shape\n            :obj:`(batch_size, num_heads, sequence_length, sequence_length)`.\n\n            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention\n            heads.\n\n    Examples::\n\n        from transformers1 import OpenAIGPTTokenizer, OpenAIGPTLMHeadModel\n        import torch\n\n        tokenizer = OpenAIGPTTokenizer.from_pretrained('openai-gpt')\n        model = OpenAIGPTLMHeadModel.from_pretrained('openai-gpt')\n        input_ids = torch.tensor(tokenizer.encode(\"Hello, my dog is cute\", add_special_tokens=True)).unsqueeze(0)  # Batch size 1\n        outputs = model(input_ids, labels=input_ids)\n        loss, logits = outputs[:2]\n\n    \"\"\"\n        transformer_outputs = self.transformer(\n            input_ids,\n            attention_mask=attention_mask,\n            token_type_ids=token_type_ids,\n            position_ids=position_ids,\n            head_mask=head_mask,\n            inputs_embeds=inputs_embeds,\n        )\n        hidden_states = transformer_outputs[0]\n        lm_logits = self.lm_head(hidden_states)\n\n        outputs = (lm_logits,) + transformer_outputs[1:]\n        if labels is not None:\n            # Shift so that tokens < n predict n\n            shift_logits = lm_logits[..., :-1, :].contiguous()\n            shift_labels = labels[..., 1:].contiguous()\n            # Flatten the tokens\n            loss_fct = CrossEntropyLoss()\n            loss = loss_fct(shift_logits.view(-1, shift_logits.size(-1)), shift_labels.view(-1))\n            outputs = (loss,) + outputs\n\n        return outputs  # (loss), lm_logits, (all hidden states), (all attentions)\n\n\n@add_start_docstrings(\n    \"\"\"OpenAI GPT Model transformer with a language modeling and a multiple-choice classification\n    head on top e.g. for RocStories/SWAG tasks. The two heads are two linear layers.\n    The language modeling head has its weights tied to the input embeddings,\n    the classification head takes as input the input of a specified classification token index in the input sequence).\n\"\"\",\n    OPENAI_GPT_START_DOCSTRING,\n)\nclass OpenAIGPTDoubleHeadsModel(OpenAIGPTPreTrainedModel):\n    def __init__(self, config):\n        super().__init__(config)\n\n        config.num_labels = 1\n        self.transformer = OpenAIGPTModel(config)\n        self.lm_head = nn.Linear(config.n_embd, config.vocab_size, bias=False)\n        self.multiple_choice_head = SequenceSummary(config)\n\n        self.init_weights()\n\n    def get_output_embeddings(self):\n        return self.lm_head\n\n    @add_start_docstrings_to_callable(OPENAI_GPT_INPUTS_DOCSTRING)\n    def forward(\n        self,\n        input_ids=None,\n        attention_mask=None,\n        token_type_ids=None,\n        position_ids=None,\n        head_mask=None,\n        inputs_embeds=None,\n        mc_token_ids=None,\n        lm_labels=None,\n        mc_labels=None,\n    ):\n        r\"\"\"\n        mc_token_ids (:obj:`torch.LongTensor` of shape :obj:`(batch_size, num_choices)`, `optional`, default to index of the last token of the input)\n            Index of the classification token in each input sequence.\n            Selected in the range ``[0, input_ids.size(-1) - 1[``.\n        lm_labels (:obj:`torch.LongTensor` of shape :obj:`(batch_size, sequence_length)`, `optional`, defaults to :obj:`None`)\n            Labels for language modeling.\n            Note that the labels **are shifted** inside the model, i.e. you can set ``lm_labels = input_ids``\n            Indices are selected in ``[-1, 0, ..., config.vocab_size]``\n            All labels set to ``-100`` are ignored (masked), the loss is only\n            computed for labels in ``[0, ..., config.vocab_size]``\n        mc_labels (:obj:`torch.LongTensor` of shape :obj:`(batch_size)`, `optional`, defaults to :obj:`None`)\n            Labels for computing the multiple choice classification loss.\n            Indices should be in ``[0, ..., num_choices]`` where `num_choices` is the size of the second dimension\n            of the input tensors. (see `input_ids` above)\n\n    Return:\n        :obj:`tuple(torch.FloatTensor)` comprising various elements depending on the configuration (:class:`~transformers1.OpenAIGPTConfig`) and inputs:\n        lm_loss (:obj:`torch.FloatTensor` of shape :obj:`(1,)`, `optional`, returned when ``lm_labels`` is provided):\n            Language modeling loss.\n        mc_loss (:obj:`torch.FloatTensor` of shape :obj:`(1,)`, `optional`, returned when :obj:`multiple_choice_labels` is provided):\n            Multiple choice classification loss.\n        lm_prediction_scores (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, num_choices, sequence_length, config.vocab_size)`):\n            Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax).\n        mc_prediction_scores (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, num_choices)`):\n            Prediction scores of the multiple choice classification head (scores for each choice before SoftMax).\n        past (:obj:`List[torch.FloatTensor]` of length :obj:`config.n_layers` with each tensor of shape :obj:`(2, batch_size, num_heads, sequence_length, embed_size_per_head)`):\n            Contains pre-computed hidden-states (key and values in the attention blocks).\n            Can be used (see `past` input) to speed up sequential decoding. The token ids which have their past given to this model\n            should not be passed as input ids as they have already been computed.\n        hidden_states (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_hidden_states=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer)\n            of shape :obj:`(batch_size, sequence_length, hidden_size)`.\n\n            Hidden-states of the model at the output of each layer plus the initial embedding outputs.\n        attentions (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_attentions=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for each layer) of shape\n            :obj:`(batch_size, num_heads, sequence_length, sequence_length)`.\n\n            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention\n            heads.\n\n    Examples::\n\n        from transformers1 import OpenAIGPTTokenizer, OpenAIGPTDoubleHeadsModel\n        import torch\n\n        tokenizer = OpenAIGPTTokenizer.from_pretrained('openai-gpt')\n        model = OpenAIGPTDoubleHeadsModel.from_pretrained('openai-gpt')\n        tokenizer.add_special_tokens({'cls_token': '[CLS]'})  # Add a [CLS] to the vocabulary (we should train it also!)\n        model.resize_token_embeddings(len(tokenizer))\n\n        choices = [\"Hello, my dog is cute [CLS]\", \"Hello, my cat is cute [CLS]\"]\n        input_ids = torch.tensor([tokenizer.encode(s) for s in choices]).unsqueeze(0)  # Batch size 1, 2 choices\n        mc_token_ids = torch.tensor([input_ids.size(-1)-1, input_ids.size(-1)-1]).unsqueeze(0)  # Batch size 1\n\n        outputs = model(input_ids, mc_token_ids=mc_token_ids)\n        lm_prediction_scores, mc_prediction_scores = outputs[:2]\n\n    \"\"\"\n        transformer_outputs = self.transformer(\n            input_ids,\n            attention_mask=attention_mask,\n            token_type_ids=token_type_ids,\n            position_ids=position_ids,\n            head_mask=head_mask,\n            inputs_embeds=inputs_embeds,\n        )\n        hidden_states = transformer_outputs[0]\n\n        lm_logits = self.lm_head(hidden_states)\n        mc_logits = self.multiple_choice_head(hidden_states, mc_token_ids).squeeze(-1)\n\n        outputs = (lm_logits, mc_logits) + transformer_outputs[1:]\n        if mc_labels is not None:\n            loss_fct = CrossEntropyLoss()\n            loss = loss_fct(mc_logits.view(-1, mc_logits.size(-1)), mc_labels.view(-1))\n            outputs = (loss,) + outputs\n        if lm_labels is not None:\n            shift_logits = lm_logits[..., :-1, :].contiguous()\n            shift_labels = lm_labels[..., 1:].contiguous()\n            loss_fct = CrossEntropyLoss()\n            loss = loss_fct(shift_logits.view(-1, shift_logits.size(-1)), shift_labels.view(-1))\n            outputs = (loss,) + outputs\n\n        return outputs  # (lm loss), (mc loss), lm logits, mc logits, (all hidden_states), (attentions)\n"
  },
  {
    "path": "code/nezha-base-count3/pretrain/transformers1/modeling_reformer.py",
    "content": "# coding=utf-8\n# Copyright 2020 The Trax Authors and The HuggingFace Inc. team.\n# Copyright (c) 2018, NVIDIA CORPORATION.  All rights reserved.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n\"\"\"PyTorch REFORMER model. \"\"\"\n\nimport logging\nimport sys\nfrom collections import namedtuple\nfrom functools import reduce\nfrom operator import mul\n\nimport numpy as np\nimport torch\nfrom torch import nn\nfrom torch.autograd.function import Function\nfrom torch.nn import CrossEntropyLoss\n\nfrom .activations import gelu, gelu_fast, gelu_new, swish\nfrom .configuration_reformer import ReformerConfig\nfrom .file_utils import DUMMY_INPUTS, DUMMY_MASK, add_start_docstrings, add_start_docstrings_to_callable\nfrom .modeling_utils import PreTrainedModel, apply_chunking_to_forward\n\n\nlogger = logging.getLogger(__name__)\n\nREFORMER_PRETRAINED_MODEL_ARCHIVE_LIST = [\n    \"google/reformer-crime-and-punishment\",\n    \"google/reformer-enwik8\",\n    # See all Reformer models at https://huggingface.co/models?filter=reformer\n]\n\n\ndef mish(x):\n    return x * torch.tanh(nn.functional.softplus(x))\n\n\nACT2FN = {\n    \"gelu\": gelu,\n    \"relu\": torch.nn.functional.relu,\n    \"swish\": swish,\n    \"gelu_new\": gelu_new,\n    \"gelu_fast\": gelu_fast,\n    \"mish\": mish,\n}\n\n\n# Define named tuples for nn.Modules here\nLSHSelfAttentionOutput = namedtuple(\"LSHSelfAttentionOutput\", [\"hidden_states\", \"attention_probs\", \"buckets\"])\nLocalSelfAttentionOutput = namedtuple(\"LocalSelfAttentionOutput\", [\"hidden_states\", \"attention_probs\"])\nAttentionOutput = namedtuple(\"AttentionOutput\", [\"hidden_states\", \"attention_probs\", \"buckets\"])\nReformerOutput = namedtuple(\"ReformerOutput\", [\"hidden_states\", \"attn_output\", \"attention_probs\", \"buckets\"])\nReformerBackwardOutput = namedtuple(\n    \"ReformerBackwardOutput\", [\"attn_output\", \"hidden_states\", \"grad_attn_output\", \"grad_hidden_states\"]\n)\nReformerEncoderOutput = namedtuple(\"ReformerEncoderOutput\", [\"hidden_states\", \"all_hidden_states\", \"all_attentions\"])\n\n\ndef _get_least_common_mult_chunk_len(config):\n    attn_types = config.attn_layers\n    attn_types_set = set(attn_types)\n    if len(attn_types_set) == 1 and attn_types[0] == \"lsh\":\n        return config.lsh_attn_chunk_length\n    elif len(attn_types_set) == 1 and attn_types[0] == \"local\":\n        return config.local_attn_chunk_length\n    elif len(attn_types_set) == 2 and attn_types_set == set([\"lsh\", \"local\"]):\n        return np.lcm(config.lsh_attn_chunk_length, config.local_attn_chunk_length)\n    else:\n        raise NotImplementedError(\n            \"Only attn layer types 'lsh' and 'local' exist, but `config.attn_layers`: {}. Select attn layer types from ['lsh', 'local'] only.\".format(\n                config.attn_layers\n            )\n        )\n\n\nclass AxialPositionEmbeddings(nn.Module):\n    \"\"\"Constructs axial position embeddings. Useful for very long input\n    sequences to save memory and time.\n    \"\"\"\n\n    def __init__(self, config):\n        super().__init__()\n        self.axial_pos_shape = config.axial_pos_shape\n        self.axial_pos_embds_dim = config.axial_pos_embds_dim\n        self.dropout = config.hidden_dropout_prob\n\n        self.least_common_mult_chunk_length = _get_least_common_mult_chunk_len(config)\n        self.weights = nn.ParameterList()\n\n        assert (\n            sum(self.axial_pos_embds_dim) == config.hidden_size\n        ), \"Make sure that config.axial_pos_embds factors: {} sum to config.hidden_size: {}\".format(\n            self.axial_pos_embds_dim, config.hidden_size\n        )\n\n        # create weights\n        for axis, axial_pos_embd_dim in enumerate(self.axial_pos_embds_dim):\n            # create expanded shapes\n            ax_shape = [1] * len(self.axial_pos_shape)\n            ax_shape[axis] = self.axial_pos_shape[axis]\n            ax_shape = tuple(ax_shape) + (axial_pos_embd_dim,)\n\n            # create tensor and init\n            self.weights.append(nn.Parameter(torch.ones(ax_shape, dtype=torch.float32)))\n\n    def forward(self, position_ids):\n        # broadcast weights to correct shape\n        batch_size = position_ids.shape[0]\n        sequence_length = position_ids.shape[1]\n\n        broadcasted_weights = [\n            weight.expand((batch_size,) + self.axial_pos_shape + weight.shape[-1:]) for weight in self.weights\n        ]\n\n        if self.training is True:\n            assert (\n                reduce(mul, self.axial_pos_shape) == sequence_length\n            ), \"If training, make sure that config.axial_pos_shape factors: {} multiply to sequence length. Got prod({}) != sequence_length: {}. You might want to consider padding your sequence length to {} or changing config.axial_pos_shape.\".format(\n                self.axial_pos_shape, self.axial_pos_shape, sequence_length, reduce(mul, self.axial_pos_shape)\n            )\n            if self.dropout > 0:\n                weights = torch.cat(broadcasted_weights, dim=-1)\n                # permute weights so that 2D correctly drops dims 1 and 2\n                transposed_weights = weights.transpose(2, 1)\n                # drop entire matrix of last two dims (prev dims 1 and 2)\n                dropped_transposed_weights = nn.functional.dropout2d(\n                    transposed_weights, p=self.dropout, training=self.training\n                )\n                dropped_weights = dropped_transposed_weights.transpose(2, 1)\n\n                position_encodings = torch.reshape(dropped_weights, (batch_size, sequence_length, -1))\n\n            else:\n                position_encodings = torch.cat(\n                    [torch.reshape(weight, (batch_size, sequence_length, -1)) for weight in broadcasted_weights],\n                    dim=-1,\n                )\n\n        else:\n            assert (\n                reduce(mul, self.axial_pos_shape) >= sequence_length\n            ), \"Make sure that config.axial_pos_shape factors: {} multiply at least to max(sequence_length, least_common_mult_chunk_length): max({}, {})\".format(\n                self.axial_pos_shape, sequence_length, self.least_common_mult_chunk_length,\n            )\n\n            # reshape axial encodings and use only until sequence_length\n            position_encodings = torch.cat(broadcasted_weights, dim=-1)\n            position_encodings = position_encodings.view(batch_size, -1, position_encodings.shape[-1])[\n                :, :sequence_length\n            ]\n\n        return position_encodings\n\n\nclass PositionEmbeddings(nn.Module):\n    \"\"\"Constructs conventional position embeddings of shape `[max_pos_embeddings, hidden_size]`.\n    \"\"\"\n\n    def __init__(self, config):\n        super().__init__()\n        self.dropout = config.hidden_dropout_prob\n        self.embedding = nn.Embedding(config.max_position_embeddings, config.hidden_size)\n\n    def forward(self, position_ids):\n        position_embeddings = self.embedding(position_ids)\n        position_embeddings = nn.functional.dropout(position_embeddings, p=self.dropout, training=self.training)\n        return position_embeddings\n\n\nclass ReformerEmbeddings(nn.Module):\n    \"\"\"Construct the embeddings from word, position and token_type embeddings.\n    \"\"\"\n\n    def __init__(self, config):\n        super().__init__()\n        self.max_position_embeddings = config.max_position_embeddings\n        self.dropout = config.hidden_dropout_prob\n\n        self.word_embeddings = nn.Embedding(config.vocab_size, config.hidden_size)\n        self.position_embeddings = (\n            AxialPositionEmbeddings(config) if config.axial_pos_embds else PositionEmbeddings(config)\n        )\n\n    def forward(self, input_ids=None, position_ids=None, inputs_embeds=None):\n        if input_ids is not None:\n            input_shape = input_ids.size()\n            device = input_ids.device\n        else:\n            input_shape = inputs_embeds.size()[:-1]\n            device = inputs_embeds.device\n\n        seq_length = input_shape[1]\n        if position_ids is None:\n            position_ids = torch.arange(seq_length, dtype=torch.long, device=device)\n            position_ids = position_ids.unsqueeze(0).expand(input_shape)\n\n        if inputs_embeds is None:\n            inputs_embeds = self.word_embeddings(input_ids)\n\n        assert (\n            position_ids.shape[-1] <= self.max_position_embeddings\n        ), \"Sequence Length: {} has to be larger equal than config.max_position_embeddings: {}\".format(\n            position_ids.shape[-1], self.max_position_embeddings\n        )\n\n        # dropout\n        embeddings = nn.functional.dropout(inputs_embeds, p=self.dropout, training=self.training)\n\n        # add positional embeddings\n        position_embeddings = self.position_embeddings(position_ids)\n        embeddings = embeddings + position_embeddings\n        return embeddings\n\n\nclass EfficientAttentionMixin:\n    \"\"\"\n    A few utilities for nn.Modules in Reformer, to be used as a mixin.\n    \"\"\"\n\n    def _look_adjacent(self, vectors, num_chunks_before, num_chunks_after):\n        \"\"\" Used to implement attention between consecutive chunks.\n\n            Args:\n                vectors: array of shape [batch_size, num_attention_heads, n_chunks, chunk_len, ...]\n                num_chunks_before: chunks before current chunk to include in attention\n                num_chunks_after: chunks after current chunk to include in attention\n\n            Returns:\n                tensor of shape [num_chunks, N * chunk_length, ...], where\n                N = (1 + num_chunks_before + num_chunks_after).\n        \"\"\"\n        if num_chunks_before == 0 and num_chunks_after == 0:\n            return vectors\n\n        slices = []\n        for i in range(-num_chunks_before, num_chunks_after + 1):\n            if i == 0:\n                slices.append(vectors)\n            else:\n                slices.append(torch.cat([vectors[:, :, i:, ...], vectors[:, :, :i, ...]], dim=2))\n        return torch.cat(slices, dim=3)\n\n    def _split_hidden_size_dim(self, x, num_attn_heads, attn_head_size):\n        \"\"\"\n            splits hidden_size dim into attn_head_size and num_attn_heads\n        \"\"\"\n        new_x_shape = x.size()[:-1] + (num_attn_heads, attn_head_size)\n        x = x.view(*new_x_shape)\n        return x.transpose(2, 1)\n\n    def _merge_hidden_size_dims(self, x, num_attn_heads, attn_head_size):\n        \"\"\"\n            merges attn_head_size dim and num_attn_heads dim into hidden_size\n        \"\"\"\n        x = x.permute(0, 2, 1, 3)\n        return torch.reshape(x, (x.size()[0], -1, num_attn_heads * attn_head_size))\n\n    def _split_seq_length_dim_to(self, vectors, dim_factor_1, dim_factor_2, num_attn_heads, attn_head_size=None):\n        \"\"\"\n            splits sequence length dim of vectors into `dim_factor_1` and `dim_factor_2` dims\n        \"\"\"\n        batch_size = vectors.shape[0]\n        split_dim_shape = (batch_size, num_attn_heads, dim_factor_1, dim_factor_2)\n\n        if len(vectors.shape) == 4:\n            return torch.reshape(vectors, split_dim_shape + (attn_head_size,))\n        elif len(vectors.shape) == 3:\n            return torch.reshape(vectors, split_dim_shape)\n        else:\n            raise ValueError(\"Input vector rank should be one of [3, 4], but is: {}\".format(len(vectors.shape)))\n\n\nclass LSHSelfAttention(nn.Module, EfficientAttentionMixin):\n    def __init__(self, config):\n        super().__init__()\n        self.config = config\n\n        self.chunk_length = config.lsh_attn_chunk_length\n        self.num_hashes = config.num_hashes\n        self.num_buckets = config.num_buckets\n        self.num_chunks_before = config.lsh_num_chunks_before\n        self.num_chunks_after = config.lsh_num_chunks_after\n        self.hash_seed = config.hash_seed\n        self.is_decoder = config.is_decoder\n        self.max_position_embeddings = config.max_position_embeddings\n\n        self.dropout = config.lsh_attention_probs_dropout_prob\n\n        self.num_attention_heads = config.num_attention_heads\n        self.attention_head_size = config.attention_head_size\n        self.all_head_size = self.num_attention_heads * self.attention_head_size\n        self.hidden_size = config.hidden_size\n\n        # projection matrices\n        self.query_key = nn.Linear(self.hidden_size, self.all_head_size, bias=False)\n        self.value = nn.Linear(self.hidden_size, self.all_head_size, bias=False)\n\n        # save mask value here. Need fp32 and fp16 mask values\n        self.register_buffer(\"self_mask_value_float16\", torch.tensor(-1e3))\n        self.register_buffer(\"self_mask_value_float32\", torch.tensor(-1e5))\n        self.register_buffer(\"mask_value_float16\", torch.tensor(-1e4))\n        self.register_buffer(\"mask_value_float32\", torch.tensor(-1e9))\n\n    def forward(\n        self,\n        hidden_states,\n        attention_mask=None,\n        head_mask=None,\n        num_hashes=None,\n        do_output_attentions=False,\n        buckets=None,\n        **kwargs\n    ):\n        sequence_length = hidden_states.shape[1]\n        batch_size = hidden_states.shape[0]\n\n        # num hashes can optionally be overwritten by user\n        num_hashes = num_hashes if num_hashes is not None else self.num_hashes\n\n        # project hidden_states to query_key and value\n        query_key_vectors = self.query_key(hidden_states)\n        value_vectors = self.value(hidden_states)\n\n        # free memory\n        del hidden_states\n\n        query_key_vectors = self._split_hidden_size_dim(\n            query_key_vectors, self.num_attention_heads, self.attention_head_size\n        )\n        value_vectors = self._split_hidden_size_dim(value_vectors, self.num_attention_heads, self.attention_head_size)\n\n        assert (\n            query_key_vectors.shape[-1] == self.attention_head_size\n        ), \"last dim of query_key_vectors is {} but should be {}.\".format(\n            query_key_vectors.shape[-1], self.attention_head_size\n        )\n        assert (\n            value_vectors.shape[-1] == self.attention_head_size\n        ), \"last dim of value_vectors is {} but should be {}.\".format(\n            value_vectors.shape[-1], self.attention_head_size\n        )\n\n        # set `num_buckets` on the fly, recommended way to do it\n        if self.num_buckets is None:\n            self._set_num_buckets(sequence_length)\n\n        # use cached buckets for backprop only\n        if buckets is None:\n            # hash query key vectors into buckets\n            buckets = self._hash_vectors(query_key_vectors, num_hashes)\n\n        assert (\n            int(buckets.shape[-1]) == num_hashes * sequence_length\n        ), \"last dim of buckets is {}, but should be {}\".format(buckets.shape[-1], num_hashes * sequence_length)\n\n        sorted_bucket_idx, undo_sorted_bucket_idx = self._get_sorted_bucket_idx_and_undo_sorted_bucket_idx(\n            sequence_length, buckets, num_hashes\n        )\n\n        # make sure bucket idx is not longer then sequence length\n        sorted_bucket_idx = sorted_bucket_idx % sequence_length\n\n        # cluster query key value vectors according to hashed buckets\n        query_key_vectors = self._gather_by_expansion(query_key_vectors, sorted_bucket_idx, num_hashes)\n        value_vectors = self._gather_by_expansion(value_vectors, sorted_bucket_idx, num_hashes)\n\n        query_key_vectors = self._split_seq_length_dim_to(\n            query_key_vectors, -1, self.chunk_length, self.num_attention_heads, self.attention_head_size,\n        )\n        value_vectors = self._split_seq_length_dim_to(\n            value_vectors, -1, self.chunk_length, self.num_attention_heads, self.attention_head_size,\n        )\n\n        if self.chunk_length is None:\n            assert (\n                self.num_chunks_before == 0 and self.num_chunks_after == 0\n            ), \"If `config.chunk_length` is `None`, make sure `config.num_chunks_after` and `config.num_chunks_before` are set to 0.\"\n\n        # scale key vectors\n        key_vectors = self._len_and_dim_norm(query_key_vectors)\n\n        # get attention probs\n        out_vectors, logits, attention_probs = self._attend(\n            query_vectors=query_key_vectors,\n            key_vectors=key_vectors,\n            value_vectors=value_vectors,\n            sorted_bucket_idx=sorted_bucket_idx,\n            attention_mask=attention_mask,\n            head_mask=head_mask,\n        )\n        # free memory\n        del query_key_vectors, key_vectors, value_vectors\n\n        # sort clusters back to correct ordering\n        out_vectors, logits = ReverseSort.apply(\n            out_vectors, logits, sorted_bucket_idx, undo_sorted_bucket_idx, self.num_hashes\n        )\n\n        # sum up all hash rounds\n        if num_hashes > 1:\n            out_vectors = self._split_seq_length_dim_to(\n                out_vectors, num_hashes, sequence_length, self.num_attention_heads, self.attention_head_size,\n            )\n            logits = self._split_seq_length_dim_to(\n                logits, num_hashes, sequence_length, self.num_attention_heads, self.attention_head_size,\n            ).unsqueeze(-1)\n\n            probs_vectors = torch.exp(logits - torch.logsumexp(logits, dim=2, keepdim=True))\n            out_vectors = torch.sum(out_vectors * probs_vectors, dim=2)\n            # free memory\n            del probs_vectors\n\n        # free memory\n        del logits\n\n        assert out_vectors.shape == (\n            batch_size,\n            self.num_attention_heads,\n            sequence_length,\n            self.attention_head_size,\n        ), \"out_vectors have be of shape `[batch_size, config.num_attention_heads, sequence_length, config.attention_head_size]`.\"\n\n        out_vectors = self._merge_hidden_size_dims(out_vectors, self.num_attention_heads, self.attention_head_size)\n\n        if do_output_attentions is False:\n            attention_probs = ()\n\n        return LSHSelfAttentionOutput(hidden_states=out_vectors, attention_probs=attention_probs, buckets=buckets)\n\n    def _hash_vectors(self, vectors, num_hashes):\n        batch_size = vectors.shape[0]\n\n        # See https://arxiv.org/pdf/1509.02897.pdf\n        # We sample a different random rotation for each round of hashing to\n        # decrease the probability of hash misses.\n        if isinstance(self.num_buckets, int):\n            assert (\n                self.num_buckets % 2 == 0\n            ), \"There should be an even number of bucktes, but `self.num_bucktes`: {}\".format(self.num_buckets)\n            rotation_size = self.num_buckets\n            num_buckets = self.num_buckets\n        else:\n            # Factorize the hash if self.num_buckets is a list or tuple\n            rotation_size, num_buckets = 0, 1\n            for bucket_factor in self.num_buckets:\n                assert bucket_factor % 2 == 0, \"The number of buckets should be even, but `num_bucket`: {}\".format(\n                    bucket_factor\n                )\n                rotation_size = rotation_size + bucket_factor\n                num_buckets = num_buckets * bucket_factor\n\n        # remove gradient\n        vectors = vectors.detach()\n\n        if self.hash_seed is not None:\n            # for determinism\n            torch.manual_seed(self.hash_seed)\n\n        rotations_shape = (self.num_attention_heads, vectors.shape[-1], num_hashes, rotation_size // 2)\n        # create a random self.attention_head_size x num_hashes x num_buckets/2\n        random_rotations = torch.randn(rotations_shape, device=vectors.device, dtype=vectors.dtype)\n\n        # Output dim: Batch_Size x Num_Attn_Heads x Num_Hashes x Seq_Len x Num_Buckets/2\n        rotated_vectors = torch.einsum(\"bmtd,mdhr->bmhtr\", vectors, random_rotations)\n\n        if isinstance(self.num_buckets, int) or len(self.num_buckets) == 1:\n            rotated_vectors = torch.cat([rotated_vectors, -rotated_vectors], dim=-1)\n            buckets = torch.argmax(rotated_vectors, dim=-1)\n        else:\n            # Get the buckets for them and combine.\n            buckets, cur_sum, cur_product = None, 0, 1\n            for bucket_factor in self.num_buckets:\n                rotated_vectors_factor = rotated_vectors[..., cur_sum : cur_sum + (bucket_factor // 2)]\n                cur_sum = cur_sum + bucket_factor // 2\n                rotated_vectors_factor = torch.cat([rotated_vectors_factor, -rotated_vectors_factor], dim=-1)\n\n                if buckets is None:\n                    buckets = torch.argmax(rotated_vectors_factor, dim=-1)\n                else:\n                    buckets = buckets + (cur_product * torch.argmax(rotated_vectors_factor, dim=-1))\n\n                cur_product = cur_product * bucket_factor\n\n        # buckets is now (Batch_size x Num_Attn_Heads x Num_Hashes x Seq_Len).\n        # Next we add offsets so that bucket numbers from different hashing rounds don't overlap.\n        offsets = torch.arange(num_hashes, device=vectors.device)\n        offsets = (offsets * num_buckets).view((1, 1, -1, 1))\n\n        # expand to batch size and num attention heads\n        offsets = offsets.expand((batch_size, self.num_attention_heads) + offsets.shape[-2:])\n        offset_buckets = (buckets + offsets).flatten(start_dim=2, end_dim=3)\n\n        return offset_buckets\n\n    def _get_sorted_bucket_idx_and_undo_sorted_bucket_idx(self, sequence_length, buckets, num_hashes):\n        # no gradients are needed\n        with torch.no_grad():\n            batch_size = buckets.shape[0]\n\n            # arange and expand\n            orig_indices = torch.arange(num_hashes * sequence_length, device=buckets.device).view(1, 1, -1)\n            orig_indices = orig_indices.expand(batch_size, self.num_attention_heads, orig_indices.shape[-1])\n\n            # scale buckets\n            scaled_buckets = sequence_length * buckets + (orig_indices % sequence_length)\n\n            # remove gradient\n            scaled_buckets = scaled_buckets.detach()\n\n            # Hash-based sort\n            sorted_bucket_idx = torch.argsort(scaled_buckets, dim=-1)\n\n            # create simple indices to scatter to, to have undo sort\n            indices = (\n                torch.arange(sorted_bucket_idx.shape[-1], device=buckets.device)\n                .view(1, 1, -1)\n                .expand(sorted_bucket_idx.shape)\n            )\n\n            # get undo sort\n            undo_sorted_bucket_idx = sorted_bucket_idx.new(*sorted_bucket_idx.size())\n            undo_sorted_bucket_idx.scatter_(-1, sorted_bucket_idx, indices)\n\n        return sorted_bucket_idx, undo_sorted_bucket_idx\n\n    def _set_num_buckets(self, sequence_length):\n        # `num_buckets` should be set to 2 * sequence_length // chunk_length as recommended in paper\n        num_buckets_pow_2 = (2 * (sequence_length // self.chunk_length)).bit_length() - 1\n        # make sure buckets are power of 2\n        num_buckets = 2 ** num_buckets_pow_2\n\n        # factorize `num_buckets` if `num_buckets` becomes too large\n        num_buckets_limit = 2 * max(\n            int((self.max_position_embeddings // self.chunk_length) ** (0.5)), self.chunk_length,\n        )\n        if num_buckets > num_buckets_limit:\n            num_buckets = [2 ** (num_buckets_pow_2 // 2), 2 ** (num_buckets_pow_2 - num_buckets_pow_2 // 2)]\n\n        logger.warning(\"config.num_buckets is not set. Setting config.num_buckets to {}...\".format(num_buckets))\n\n        # set num buckets in config to be properly saved\n        self.config.num_buckets = num_buckets\n        self.num_buckets = num_buckets\n\n    def _attend(\n        self, query_vectors, key_vectors, value_vectors, sorted_bucket_idx, attention_mask, head_mask,\n    ):\n        key_vectors = self._look_adjacent(key_vectors, self.num_chunks_before, self.num_chunks_after)\n        value_vectors = self._look_adjacent(value_vectors, self.num_chunks_before, self.num_chunks_after)\n\n        # get logits and dots\n        query_key_dots = torch.matmul(query_vectors, key_vectors.transpose(-1, -2))\n\n        # free memory\n        del query_vectors, key_vectors\n\n        query_bucket_idx = self._split_seq_length_dim_to(\n            sorted_bucket_idx, -1, self.chunk_length, self.num_attention_heads\n        )\n        key_value_bucket_idx = self._look_adjacent(query_bucket_idx, self.num_chunks_before, self.num_chunks_after)\n\n        # get correct mask values depending on precision\n        if query_key_dots.dtype == torch.float16:\n            self_mask_value = self.self_mask_value_float16.half()\n            mask_value = self.mask_value_float16.half()\n        else:\n            self_mask_value = self.self_mask_value_float32\n            mask_value = self.mask_value_float32\n\n        mask = self._compute_attn_mask(query_bucket_idx, key_value_bucket_idx, attention_mask)\n\n        if mask is not None:\n            query_key_dots = torch.where(mask, query_key_dots, mask_value)\n\n        # free memory\n        del mask\n\n        # Self mask is ALWAYS applied.\n        # From the reformer paper (https://arxiv.org/pdf/2001.04451.pdf):\n        # \" While attention to the future is not allowed, typical implementations of the\n        # Transformer do allow a position to attend to itself.\n        # Such behavior is undesirable in a shared-QK formulation because the dot-product\n        # of a query vector with itself will almost always be greater than the dot product of a\n        # query vector with a vector at another position. We therefore modify the masking\n        # to forbid a token from attending to itself, except in situations\n        # where a token has no other valid attention targets (e.g. the first token in a sequence) \"\n\n        self_mask = torch.ne(query_bucket_idx.unsqueeze(-1), key_value_bucket_idx.unsqueeze(-2)).to(\n            query_bucket_idx.device\n        )\n\n        # apply self_mask\n        query_key_dots = torch.where(self_mask, query_key_dots, self_mask_value)\n\n        # free memory\n        del self_mask\n\n        logits = torch.logsumexp(query_key_dots, dim=-1, keepdim=True)\n        # dots shape is `[batch_size, num_attn_heads, num_hashes * seq_len // chunk_length, chunk_length, chunk_length * (1 + num_chunks_before + num_chunks_after)]`\n        attention_probs = torch.exp(query_key_dots - logits)\n\n        # free memory\n        del query_key_dots\n\n        # dropout\n        attention_probs = nn.functional.dropout(attention_probs, p=self.dropout, training=self.training)\n\n        # Mask heads if we want to\n        if head_mask is not None:\n            attention_probs = attention_probs * head_mask\n\n        # attend values\n        out_vectors = torch.matmul(attention_probs, value_vectors)\n\n        # free memory\n        del value_vectors\n\n        # merge chunk length\n        logits = logits.flatten(start_dim=2, end_dim=3).squeeze(-1)\n        out_vectors = out_vectors.flatten(start_dim=2, end_dim=3)\n\n        return out_vectors, logits, attention_probs\n\n    def _compute_attn_mask(self, query_indices, key_indices, attention_mask):\n        mask = None\n\n        # Causal mask\n        if self.is_decoder:\n            mask = torch.ge(query_indices.unsqueeze(-1), key_indices.unsqueeze(-2)).to(query_indices.device)\n\n        # Attention mask: chunk, look up correct mask value from key_value_bucket_idx\n        # IMPORTANT: official trax code does not use a mask for LSH Atttention. Not sure why.\n        if attention_mask is not None:\n            attention_mask = attention_mask.to(torch.uint8)[:, None, None, :]\n            # expand attn_mask to fit with key_value_bucket_idx shape\n            attention_mask = attention_mask.expand(query_indices.shape[:-1] + (-1,))\n            key_attn_mask = torch.gather(attention_mask, -1, key_indices)\n            query_attn_mask = torch.gather(attention_mask, -1, query_indices)\n            # expand to query_key_dots shape: duplicate along query axis since key sorting is the same for each query position in chunk\n            attn_mask = query_attn_mask.unsqueeze(-1) * key_attn_mask.unsqueeze(-2)\n            # free memory\n            del query_attn_mask, key_attn_mask, attention_mask\n\n            # multiply by casaul mask if necessary\n            if mask is not None:\n                mask = mask * attn_mask\n            else:\n                mask = attn_mask\n\n        return mask\n\n    def _len_and_dim_norm(self, vectors):\n        \"\"\"\n            length and attention head size dim normalization\n        \"\"\"\n        vectors = self._len_norm(vectors)\n        vectors = vectors * torch.rsqrt(\n            torch.tensor(self.attention_head_size, device=vectors.device, dtype=vectors.dtype)\n        )\n        return vectors\n\n    def _len_norm(self, x, epsilon=1e-6):\n        \"\"\"\n            length normalization\n        \"\"\"\n        variance = torch.mean(x ** 2, -1, keepdim=True)\n        norm_x = x * torch.rsqrt(variance + epsilon)\n        return norm_x\n\n    def _gather_by_expansion(self, vectors, idxs, num_hashes):\n        \"\"\"\n            expand dims of idxs and vectors for all hashes and gather\n        \"\"\"\n        expanded_idxs = idxs.unsqueeze(-1).expand(-1, -1, -1, self.attention_head_size)\n        vectors = vectors.repeat(1, 1, num_hashes, 1)\n        return torch.gather(vectors, 2, expanded_idxs)\n\n\nclass ReverseSort(Function):\n    \"\"\"\n        After chunked attention is applied which sorted clusters,\n        original ordering has to be restored.\n        Since customized backward function is used for Reformer,\n        the gradients of the output vectors have to be explicitely\n        sorted here.\n    \"\"\"\n\n    @staticmethod\n    def forward(ctx, out_vectors, logits, sorted_bucket_idx, undo_sorted_bucket_idx, num_hashes):\n        # save sorted_bucket_idx for backprop\n        with torch.no_grad():\n            ctx.sorted_bucket_idx = sorted_bucket_idx\n            ctx.num_hashes = num_hashes\n\n            # undo sort to have correct order for next layer\n            expanded_undo_sort_indices = undo_sorted_bucket_idx.unsqueeze(-1).expand(out_vectors.shape)\n            out_vectors = torch.gather(out_vectors, 2, expanded_undo_sort_indices)\n            logits = torch.gather(logits, 2, undo_sorted_bucket_idx)\n        return out_vectors, logits\n\n    @staticmethod\n    def backward(ctx, grad_out_vectors, grad_logits):\n        # get parameters saved in ctx\n        sorted_bucket_idx = ctx.sorted_bucket_idx\n        num_hashes = ctx.num_hashes\n\n        # get real gradient shape\n        # shape is BatchSize x NumAttnHeads x ChunkLen * NumHashes\n        grad_logits_shape = grad_logits.shape\n        # shape is BatchSize x NumAttnHeads x ChunkLen * NumHashes x ChunkLen\n        grad_out_vectors_shape = grad_out_vectors.shape\n\n        # split gradient vectors and sorted bucket idxs by concatenated chunk dimension to gather correct indices\n        # shape is BatchSize x NumAttnHeads x NumHashes x ChunkLen\n        grad_logits = grad_logits.view((grad_logits_shape[:2] + (num_hashes, -1)))\n        # shape is BatchSize x NumAttnHeads x NumHashes x ChunkLen x ChunkLen\n        grad_out_vectors = grad_out_vectors.view(\n            (grad_out_vectors_shape[:2] + (num_hashes, -1) + grad_out_vectors_shape[-1:])\n        )\n\n        # reshape and expand\n        sorted_bucket_idx = torch.reshape(sorted_bucket_idx, (sorted_bucket_idx.shape[:2] + (num_hashes, -1)))\n        expanded_sort_indices = sorted_bucket_idx.unsqueeze(-1).expand(grad_out_vectors.shape)\n        # reverse sort of forward\n        grad_out_vectors = torch.gather(grad_out_vectors, 3, expanded_sort_indices)\n        grad_logits = torch.gather(grad_logits, 3, sorted_bucket_idx)\n\n        # reshape into correct shape\n        grad_logits = torch.reshape(grad_logits, grad_logits_shape)\n        grad_out_vectors = torch.reshape(grad_out_vectors, grad_out_vectors_shape)\n\n        # return grad and `None` fillers for last 3 forward args\n        return grad_out_vectors, grad_logits, None, None, None\n\n\nclass LocalSelfAttention(nn.Module, EfficientAttentionMixin):\n    def __init__(self, config):\n        super().__init__()\n\n        self.num_attention_heads = config.num_attention_heads\n        self.chunk_length = config.local_attn_chunk_length\n        self.num_chunks_before = config.local_num_chunks_before\n        self.num_chunks_after = config.local_num_chunks_after\n        self.is_decoder = config.is_decoder\n        self.pad_token_id = config.pad_token_id\n\n        self.attention_head_size = config.attention_head_size\n        self.all_head_size = self.num_attention_heads * self.attention_head_size\n        self.hidden_size = config.hidden_size\n\n        # projection matrices\n        self.query = nn.Linear(self.hidden_size, self.all_head_size, bias=False)\n        self.key = nn.Linear(self.hidden_size, self.all_head_size, bias=False)\n        self.value = nn.Linear(self.hidden_size, self.all_head_size, bias=False)\n\n        self.dropout = config.local_attention_probs_dropout_prob\n\n        # save mask value here\n        self.register_buffer(\"mask_value_float16\", torch.tensor(-1e4))\n        self.register_buffer(\"mask_value_float32\", torch.tensor(-1e9))\n\n    def forward(self, hidden_states, attention_mask=None, head_mask=None, do_output_attentions=False, **kwargs):\n        sequence_length = hidden_states.shape[1]\n        batch_size = hidden_states.shape[0]\n\n        # project hidden_states to query, key and value\n        query_vectors = self.query(hidden_states)\n        key_vectors = self.key(hidden_states)\n        value_vectors = self.value(hidden_states)\n\n        # split last dim into `config.num_attention_heads` and `config.attention_head_size`\n        query_vectors = self._split_hidden_size_dim(query_vectors, self.num_attention_heads, self.attention_head_size)\n        key_vectors = self._split_hidden_size_dim(key_vectors, self.num_attention_heads, self.attention_head_size)\n        value_vectors = self._split_hidden_size_dim(value_vectors, self.num_attention_heads, self.attention_head_size)\n\n        assert (\n            query_vectors.shape[-1] == self.attention_head_size\n        ), \"last dim of query_key_vectors is {} but should be {}.\".format(\n            query_vectors.shape[-1], self.attention_head_size\n        )\n        assert (\n            key_vectors.shape[-1] == self.attention_head_size\n        ), \"last dim of query_key_vectors is {} but should be {}.\".format(\n            key_vectors.shape[-1], self.attention_head_size\n        )\n        assert (\n            value_vectors.shape[-1] == self.attention_head_size\n        ), \"last dim of query_key_vectors is {} but should be {}.\".format(\n            value_vectors.shape[-1], self.attention_head_size\n        )\n\n        if self.chunk_length is None:\n            assert (\n                self.num_chunks_before == 0 and self.num_chunks_after == 0\n            ), \"If `config.chunk_length` is `None`, make sure `config.num_chunks_after` and `config.num_chunks_before` are set to 0.\"\n\n        # normalize key vectors\n        key_vectors = key_vectors / torch.sqrt(\n            torch.tensor(self.attention_head_size, device=key_vectors.device, dtype=key_vectors.dtype)\n        )\n\n        # chunk vectors\n        # B x Num_Attn_Head x Seq_Len // chunk_len x chunk_len  x  attn_head_size\n        query_vectors = self._split_seq_length_dim_to(\n            query_vectors, -1, self.chunk_length, self.num_attention_heads, self.attention_head_size,\n        )\n        key_vectors = self._split_seq_length_dim_to(\n            key_vectors, -1, self.chunk_length, self.num_attention_heads, self.attention_head_size,\n        )\n        value_vectors = self._split_seq_length_dim_to(\n            value_vectors, -1, self.chunk_length, self.num_attention_heads, self.attention_head_size,\n        )\n\n        # chunk indices\n        indices = torch.arange(sequence_length, device=query_vectors.device).repeat(\n            batch_size, self.num_attention_heads, 1\n        )\n        query_indices = self._split_seq_length_dim_to(indices, -1, self.chunk_length, self.num_attention_heads)\n        key_indices = self._split_seq_length_dim_to(indices, -1, self.chunk_length, self.num_attention_heads)\n\n        # append chunks before and after\n        key_vectors = self._look_adjacent(key_vectors, self.num_chunks_before, self.num_chunks_after)\n        value_vectors = self._look_adjacent(value_vectors, self.num_chunks_before, self.num_chunks_after)\n        key_indices = self._look_adjacent(key_indices, self.num_chunks_before, self.num_chunks_after)\n\n        query_key_dots = torch.matmul(query_vectors, key_vectors.transpose(-1, -2))\n\n        # free memory\n        del query_vectors, key_vectors\n\n        mask = self._compute_attn_mask(query_indices, key_indices, attention_mask, query_key_dots.shape)\n\n        if mask is not None:\n            # get mask tensor depending on half precision or not\n            if query_key_dots.dtype == torch.float16:\n                mask_value = self.mask_value_float16.half()\n            else:\n                mask_value = self.mask_value_float32\n\n            query_key_dots = torch.where(mask, query_key_dots, mask_value)\n\n        # free memory\n        del mask\n\n        # softmax\n        logits = torch.logsumexp(query_key_dots, dim=-1, keepdim=True)\n        attention_probs = torch.exp(query_key_dots - logits)\n\n        # free memory\n        del logits\n\n        # dropout\n        attention_probs = nn.functional.dropout(attention_probs, p=self.dropout, training=self.training)\n\n        # Mask heads if we want to\n        if head_mask is not None:\n            attention_probs = attention_probs * head_mask\n\n        # attend values\n        out_vectors = torch.matmul(attention_probs, value_vectors)\n\n        # free memory\n        del value_vectors\n\n        # merge chunk length\n        out_vectors = out_vectors.flatten(start_dim=2, end_dim=3)\n\n        assert out_vectors.shape == (batch_size, self.num_attention_heads, sequence_length, self.attention_head_size,)\n\n        out_vectors = self._merge_hidden_size_dims(out_vectors, self.num_attention_heads, self.attention_head_size)\n\n        if do_output_attentions is False:\n            attention_probs = ()\n\n        return LocalSelfAttentionOutput(hidden_states=out_vectors, attention_probs=attention_probs)\n\n    def _compute_attn_mask(self, query_indices, key_indices, attention_mask, query_key_dots_shape):\n        mask = None\n\n        # chunk attention mask and look before and after\n        if attention_mask is not None:\n            attention_mask = attention_mask.to(torch.uint8)[:, None, :]\n            attention_mask = self._split_seq_length_dim_to(attention_mask, -1, self.chunk_length, 1)\n            attention_mask_key = self._look_adjacent(attention_mask, self.num_chunks_before, self.num_chunks_after)\n\n        # Causal mask\n        if self.is_decoder is True:\n            mask = torch.ge(query_indices.unsqueeze(-1), key_indices.unsqueeze(-2)).to(query_indices.device)\n\n        # Attention mask\n        if attention_mask is not None:\n            # create attn_mask\n            attn_mask = (attention_mask.unsqueeze(-1) * attention_mask_key.unsqueeze(-2)).expand(query_key_dots_shape)\n            # multiply by casaul mask if necessary\n            if mask is not None:\n                mask = mask * attn_mask\n            else:\n                mask = attn_mask\n        return mask\n\n\nclass ReformerSelfOutput(nn.Module):\n    def __init__(self, config):\n        super().__init__()\n        all_head_size = config.num_attention_heads * config.attention_head_size\n        self.dropout = config.hidden_dropout_prob\n\n        self.dense = nn.Linear(all_head_size, config.hidden_size, bias=False)\n\n    def forward(self, hidden_states):\n        hidden_states = self.dense(hidden_states)\n        hidden_states = nn.functional.dropout(hidden_states, p=self.dropout, training=self.training)\n        return hidden_states\n\n\nclass ReformerAttention(nn.Module):\n    def __init__(self, config, layer_id=0):\n        super().__init__()\n        self.layer_id = layer_id\n        self.attn_layers = config.attn_layers\n\n        self.layer_norm = nn.LayerNorm(config.hidden_size, eps=config.layer_norm_eps)\n\n        if len(set(self.attn_layers)) == 1 and self.attn_layers[0] == \"lsh\":\n            self.self_attention = LSHSelfAttention(config)\n        elif len(set(self.attn_layers)) == 1 and self.attn_layers[0] == \"local\":\n            self.self_attention = LocalSelfAttention(config)\n        elif len(set(self.attn_layers)) == 2 and set(self.attn_layers) == set([\"lsh\", \"local\"]):\n            # get correct attn layers\n            if self.attn_layers[self.layer_id] == \"lsh\":\n                self.self_attention = LSHSelfAttention(config)\n            else:\n                self.self_attention = LocalSelfAttention(config)\n        else:\n            raise NotImplementedError(\n                \"Only attn layer types 'lsh' and 'local' exist, but got `config.attn_layers`: {}. Select attn layer types from ['lsh', 'local'] only.\".format(\n                    self.attn_layers\n                )\n            )\n        self.output = ReformerSelfOutput(config)\n\n    def forward(\n        self,\n        hidden_states,\n        attention_mask=None,\n        head_mask=None,\n        num_hashes=None,\n        do_output_attentions=False,\n        buckets=None,\n    ):\n        hidden_states = self.layer_norm(hidden_states)\n\n        # use cached buckets for backprob if buckets not None for LSHSelfAttention\n        self_attention_outputs = self.self_attention(\n            hidden_states=hidden_states,\n            head_mask=head_mask,\n            attention_mask=attention_mask,\n            num_hashes=num_hashes,\n            do_output_attentions=do_output_attentions,\n            buckets=buckets,\n        )\n        attention_output = self.output(self_attention_outputs.hidden_states)\n\n        # add buckets if necessary\n        if hasattr(self_attention_outputs, \"buckets\"):\n            buckets = self_attention_outputs.buckets\n        else:\n            buckets = None\n\n        return AttentionOutput(\n            hidden_states=attention_output, attention_probs=self_attention_outputs.attention_probs, buckets=buckets,\n        )\n\n\nclass ReformerFeedForwardDense(nn.Module):\n    def __init__(self, config):\n        super().__init__()\n        self.dropout = config.hidden_dropout_prob\n\n        if isinstance(config.hidden_act, str):\n            self.act_fn = ACT2FN[config.hidden_act]\n        else:\n            self.act_fn = config.hidden_act\n\n        self.dense = nn.Linear(config.hidden_size, config.feed_forward_size)\n\n    def forward(self, hidden_states):\n        hidden_states = self.dense(hidden_states)\n        hidden_states = nn.functional.dropout(hidden_states, p=self.dropout, training=self.training)\n        hidden_states = self.act_fn(hidden_states)\n        return hidden_states\n\n\nclass ReformerFeedForwardOutput(nn.Module):\n    def __init__(self, config):\n        super().__init__()\n        self.dropout = config.hidden_dropout_prob\n\n        self.dense = nn.Linear(config.feed_forward_size, config.hidden_size)\n\n    def forward(self, hidden_states):\n        hidden_states = self.dense(hidden_states)\n        hidden_states = nn.functional.dropout(hidden_states, p=self.dropout, training=self.training)\n        return hidden_states\n\n\nclass ChunkReformerFeedForward(nn.Module):\n    def __init__(self, config):\n        super().__init__()\n        self.chunk_size_feed_forward = config.chunk_size_feed_forward\n        self.seq_len_dim = 1\n\n        self.layer_norm = nn.LayerNorm(config.hidden_size, eps=config.layer_norm_eps)\n        self.dense = ReformerFeedForwardDense(config)\n        self.output = ReformerFeedForwardOutput(config)\n\n    def forward(self, attention_output):\n        return apply_chunking_to_forward(\n            self.chunk_size_feed_forward, self.seq_len_dim, self.forward_chunk, attention_output,\n        )\n\n    def forward_chunk(self, hidden_states):\n        hidden_states = self.layer_norm(hidden_states)\n        hidden_states = self.dense(hidden_states)\n        return self.output(hidden_states)\n\n\nclass ReformerLayer(nn.Module):\n    def __init__(self, config, layer_id=0):\n        super().__init__()\n        self.attention = ReformerAttention(config, layer_id)\n        # dropout requires to have the same\n        # seed for forward and backward pass\n        self.attention_seed = None\n        self.feed_forward_seed = None\n\n        self.feed_forward = ChunkReformerFeedForward(config)\n\n    def _init_attention_seed(self):\n        \"\"\"\n            This function sets a new seed for the\n            attention layer to make dropout deterministic\n            for both forward calls: 1 normal forward\n            call and 1 forward call in backward\n            to recalculate activations.\n        \"\"\"\n\n        # randomize seeds\n        if next(self.parameters()).device.type == \"cuda\":\n            # GPU\n            device_idx = torch.cuda.current_device()\n            self.attention_seed = torch.cuda.default_generators[device_idx].seed()\n            torch.cuda.manual_seed(self.attention_seed)\n        else:\n            # CPU\n            self.attention_seed = int(torch.seed() % sys.maxsize)\n            torch.manual_seed(self.attention_seed)\n\n    def _init_feed_forward_seed(self):\n        \"\"\"\n            This function sets a new seed for the\n            feed forward layer to make dropout deterministic\n            for both forward calls: 1 normal forward\n            call and 1 forward call in backward\n            to recalculate activations.\n        \"\"\"\n\n        # randomize seeds\n        if next(self.parameters()).device.type == \"cuda\":\n            # GPU\n            device_idx = torch.cuda.current_device()\n            self.feed_forward_seed = torch.cuda.default_generators[device_idx].seed()\n            torch.cuda.manual_seed(self.feed_forward_seed)\n        else:\n            # CPU\n            self.feed_forward_seed = int(torch.seed() % sys.maxsize)\n            torch.manual_seed(self.feed_forward_seed)\n\n    def forward(\n        self,\n        prev_attn_output,\n        hidden_states,\n        attention_mask=None,\n        head_mask=None,\n        num_hashes=None,\n        do_output_attentions=False,\n    ):\n        with torch.no_grad():\n            # every forward pass we sample a different seed\n            # for dropout and save for forward fn in backward pass\n            # to have correct dropout\n            self._init_attention_seed()\n            attn_outputs = self.attention(\n                hidden_states=hidden_states,\n                head_mask=head_mask,\n                attention_mask=attention_mask,\n                num_hashes=num_hashes,\n                do_output_attentions=do_output_attentions,\n            )\n            attn_output = attn_outputs.hidden_states\n\n            # Implementation of RevNet (see Fig. 6 in https://towardsdatascience.com/illustrating-the-reformer-393575ac6ba0)\n            # Y_1 = X_1 + f(X_2)\n            attn_output = prev_attn_output + attn_output\n\n            # free memory\n            del prev_attn_output\n\n            # every forward pass we sample a different seed\n            # for dropout and save seed for forward fn in backward\n            # to have correct dropout\n            self._init_feed_forward_seed()\n            # Y_2 = X_2 + g(Y_1)\n            hidden_states = hidden_states + self.feed_forward(attn_output)\n\n        return ReformerOutput(\n            attn_output=attn_output,\n            hidden_states=hidden_states,\n            attention_probs=attn_outputs.attention_probs,\n            buckets=attn_outputs.buckets,\n        )\n\n    def backward_pass(\n        self,\n        next_attn_output,\n        hidden_states,\n        grad_attn_output,\n        grad_hidden_states,\n        attention_mask=None,\n        head_mask=None,\n        buckets=None,\n    ):\n        # Implements the backward pass for reversible ResNets.\n        # A good blog post on how this works can be found here:\n        # Implementation of RevNet (see Fig. 6 in https://towardsdatascience.com/illustrating-the-reformer-393575ac6ba0)\n        # This code is heavily inspired by https://github.com/lucidrains/reformer-pytorch/blob/master/reformer_pytorch/reversible.py\n\n        with torch.enable_grad():\n            next_attn_output.requires_grad = True\n\n            # set seed to have correct dropout\n            torch.manual_seed(self.feed_forward_seed)\n            # g(Y_1)\n            res_hidden_states = self.feed_forward(next_attn_output)\n            res_hidden_states.backward(grad_hidden_states, retain_graph=True)\n\n        with torch.no_grad():\n            # X_2 = Y_2 - g(Y_1)\n            hidden_states = hidden_states - res_hidden_states\n            del res_hidden_states\n\n            grad_attn_output = grad_attn_output + next_attn_output.grad\n            next_attn_output.grad = None\n\n        with torch.enable_grad():\n            hidden_states.requires_grad = True\n\n            # set seed to have correct dropout\n            torch.manual_seed(self.attention_seed)\n            # f(X_2)\n            # use cached buckets for backprob if buckets not None for LSHSelfAttention\n            output = self.attention(\n                hidden_states=hidden_states, head_mask=head_mask, attention_mask=attention_mask, buckets=buckets,\n            ).hidden_states\n            output.backward(grad_attn_output, retain_graph=True)\n\n        with torch.no_grad():\n            # X_1 = Y_1 - f(X_2)\n            attn_output = next_attn_output - output\n            del output, next_attn_output\n\n            grad_hidden_states = grad_hidden_states + hidden_states.grad\n            hidden_states.grad = None\n            hidden_states = hidden_states.detach()\n\n        return ReformerBackwardOutput(\n            attn_output=attn_output,\n            hidden_states=hidden_states,\n            grad_attn_output=grad_attn_output,\n            grad_hidden_states=grad_hidden_states,\n        )\n\n\nclass _ReversibleFunction(Function):\n    \"\"\"\n    To prevent PyTorch from performing the usual backpropagation,\n    a customized backward function is implemented here. This way\n    it is made sure that no memory expensive activations are\n    saved during the forward pass.\n    This function is heavily inspired by https://github.com/lucidrains/reformer-pytorch/blob/master/reformer_pytorch/reversible.py\n    \"\"\"\n\n    @staticmethod\n    def forward(\n        ctx,\n        hidden_states,\n        layers,\n        attention_mask,\n        head_mask,\n        num_hashes,\n        all_hidden_states,\n        all_attentions,\n        do_output_hidden_states,\n        do_output_attentions,\n    ):\n        all_buckets = ()\n\n        # split duplicated tensor\n        hidden_states, attn_output = torch.chunk(hidden_states, 2, dim=-1)\n\n        for layer, layer_head_mask in zip(layers, head_mask):\n            if do_output_hidden_states is True:\n                all_hidden_states.append(hidden_states)\n\n            layer_outputs = layer(\n                prev_attn_output=attn_output,\n                hidden_states=hidden_states,\n                attention_mask=attention_mask,\n                head_mask=layer_head_mask,\n                num_hashes=num_hashes,\n                do_output_attentions=do_output_attentions,\n            )\n            attn_output = layer_outputs.attn_output\n            hidden_states = layer_outputs.hidden_states\n            all_buckets = all_buckets + (layer_outputs.buckets,)\n\n            if do_output_attentions:\n                all_attentions.append(layer_outputs.attention_probs)\n\n        # Add last layer\n        if do_output_hidden_states is True:\n            all_hidden_states.append(hidden_states)\n\n        # attach params to ctx for backward\n        ctx.save_for_backward(attn_output.detach(), hidden_states.detach())\n        ctx.layers = layers\n        ctx.all_buckets = all_buckets\n        ctx.head_mask = head_mask\n        ctx.attention_mask = attention_mask\n\n        # Concatenate 2 RevNet outputs\n        return torch.cat([attn_output, hidden_states], dim=-1)\n\n    @staticmethod\n    def backward(ctx, grad_hidden_states):\n        grad_attn_output, grad_hidden_states = torch.chunk(grad_hidden_states, 2, dim=-1)\n\n        # retrieve params from ctx for backward\n        attn_output, hidden_states = ctx.saved_tensors\n\n        # create tuple\n        output = ReformerBackwardOutput(\n            attn_output=attn_output,\n            hidden_states=hidden_states,\n            grad_attn_output=grad_attn_output,\n            grad_hidden_states=grad_hidden_states,\n        )\n\n        # free memory\n        del grad_attn_output, grad_hidden_states, attn_output, hidden_states\n\n        layers = ctx.layers\n        all_buckets = ctx.all_buckets\n        head_mask = ctx.head_mask\n        attention_mask = ctx.attention_mask\n\n        for idx, layer in enumerate(layers[::-1]):\n            # pop last buckets from stack\n            buckets = all_buckets[-1]\n            all_buckets = all_buckets[:-1]\n\n            # backprop\n            output = layer.backward_pass(\n                next_attn_output=output.attn_output,\n                hidden_states=output.hidden_states,\n                grad_attn_output=output.grad_attn_output,\n                grad_hidden_states=output.grad_hidden_states,\n                head_mask=head_mask[len(layers) - idx - 1],\n                attention_mask=attention_mask,\n                buckets=buckets,\n            )\n\n        assert all_buckets == (), \"buckets have to be empty after backpropagation\"\n        grad_hidden_states = torch.cat([output.grad_attn_output, output.grad_hidden_states], dim=-1)\n\n        # num of return vars has to match num of forward() args\n        # return gradient for hidden_states arg and None for other args\n        return grad_hidden_states, None, None, None, None, None, None, None, None\n\n\nclass ReformerEncoder(nn.Module):\n    def __init__(self, config):\n        super().__init__()\n        self.dropout = config.hidden_dropout_prob\n\n        self.layers = nn.ModuleList([ReformerLayer(config, i) for i in range(config.num_hidden_layers)])\n        # Reformer is using Rev Nets, thus last layer outputs are concatenated and\n        # Layer Norm is done over 2 * hidden_size\n        self.layer_norm = nn.LayerNorm(2 * config.hidden_size, eps=config.layer_norm_eps)\n\n    def forward(\n        self,\n        hidden_states,\n        attention_mask=None,\n        head_mask=None,\n        num_hashes=None,\n        do_output_hidden_states=False,\n        do_output_attentions=False,\n    ):\n        # hidden_states and attention lists to be filled if wished\n        all_hidden_states = []\n        all_attentions = []\n\n        # concat same tensor for reversible ResNet\n        hidden_states = torch.cat([hidden_states, hidden_states], dim=-1)\n        hidden_states = _ReversibleFunction.apply(\n            hidden_states,\n            self.layers,\n            attention_mask,\n            head_mask,\n            num_hashes,\n            all_hidden_states,\n            all_attentions,\n            do_output_hidden_states,\n            do_output_attentions,\n        )\n\n        # Apply layer norm to concatenated hidden states\n        hidden_states = self.layer_norm(hidden_states)\n\n        # Apply dropout\n        hidden_states = nn.functional.dropout(hidden_states, p=self.dropout, training=self.training)\n\n        return ReformerEncoderOutput(\n            hidden_states=hidden_states, all_hidden_states=all_hidden_states, all_attentions=all_attentions\n        )\n\n\nclass ReformerOnlyLMHead(nn.Module):\n    def __init__(self, config):\n        super().__init__()\n        # Reformer is using Rev Nets, thus last layer outputs are concatenated and\n        # Layer Norm is done over 2 * hidden_size\n        self.seq_len_dim = 1\n        self.chunk_size_lm_head = config.chunk_size_lm_head\n        self.decoder = nn.Linear(2 * config.hidden_size, config.vocab_size, bias=False)\n        self.bias = nn.Parameter(torch.zeros(config.vocab_size))\n\n        # Need a link between the two variables so that the bias is correctly resized with `resize_token_embeddings`\n        self.decoder.bias = self.bias\n\n    def forward(self, hidden_states):\n        return apply_chunking_to_forward(self.chunk_size_lm_head, self.seq_len_dim, self.forward_chunk, hidden_states)\n\n    def forward_chunk(self, hidden_states):\n        hidden_states = self.decoder(hidden_states)\n        return hidden_states\n\n\nclass ReformerPreTrainedModel(PreTrainedModel):\n    \"\"\" An abstract class to handle weights initialization and\n        a simple interface for downloading and loading pretrained models.\n    \"\"\"\n\n    config_class = ReformerConfig\n    base_model_prefix = \"reformer\"\n\n    @property\n    def dummy_inputs(self):\n        input_ids = torch.tensor(DUMMY_INPUTS)\n        input_mask = torch.tensor(DUMMY_MASK)\n        dummy_inputs = {\n            \"input_ids\": input_ids,\n            \"attention_mask\": input_mask,\n        }\n        return dummy_inputs\n\n    def _init_weights(self, module):\n        \"\"\" Initialize the weights \"\"\"\n        if isinstance(module, AxialPositionEmbeddings):\n            for weight in module.weights:\n                torch.nn.init.normal_(weight, std=self.config.axial_norm_std)\n        elif isinstance(module, nn.Embedding):\n            module.weight.data.normal_(mean=0.0, std=self.config.initializer_range)\n        elif isinstance(module, nn.Linear):\n            # Slightly different from the TF version which uses truncated_normal for initialization\n            # cf https://github.com/pytorch/pytorch/pull/5617\n            module.weight.data.normal_(mean=0.0, std=self.config.initializer_range)\n\n        elif isinstance(module, nn.LayerNorm):\n            module.bias.data.zero_()\n            module.weight.data.fill_(1.0)\n        if isinstance(module, nn.Linear) and module.bias is not None:\n            module.bias.data.zero_()\n\n\nREFORMER_START_DOCSTRING = r\"\"\"\n    Reformer was proposed in\n    `Reformer: The Efficient Transformer`_\n    by Nikita Kitaev, Łukasz Kaiser, Anselm Levskaya.\n\n    .. _`Reformer: The Efficient Transformer`:\n        https://arxiv.org/abs/2001.04451\n\n    This model is a PyTorch `torch.nn.Module <https://pytorch.org/docs/stable/nn.html#torch.nn.Module>`_ sub-class.\n    Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general\n    usage and behavior.\n\n    Parameters:\n        config (:class:`~transformers1.ReformerConfig`): Model configuration class with all the parameters of the model.\n            Initializing with a config file does not load the weights associated with the model, only the configuration.\n            Check out the :meth:`~transformers1.PreTrainedModel.from_pretrained` method to load the model weights.\n\"\"\"\n\nREFORMER_INPUTS_DOCSTRING = r\"\"\"\n    Args:\n        input_ids (:obj:`torch.LongTensor` of shape :obj:`(batch_size, sequence_length)`):\n            Indices of input sequence tokens in the vocabulary.\n            During training the input_ids sequence_length has to be a multiple of the relevant model's\n            chunk lengths (lsh's, local's or both). During evaluation, the indices are automatically\n            padded to be a multiple of the chunk length.\n\n            Indices can be obtained using :class:`transformers1.ReformerTokenizer`.\n            See :func:`transformers1.PreTrainedTokenizer.encode` and\n            :func:`transformers1.PreTrainedTokenizer.encode_plus` for details.\n\n            `What are input IDs? <../glossary.html#input-ids>`__\n        attention_mask (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length)`, `optional`, defaults to :obj:`None`):\n            Mask to avoid performing attention on padding token indices.\n            Mask values selected in ``[0, 1]``:\n            ``1`` for tokens that are NOT MASKED, ``0`` for MASKED tokens.\n\n            `What are attention masks? <../glossary.html#attention-mask>`__\n        position_ids (:obj:`torch.LongTensor` of shape :obj:`(batch_size, sequence_length)`, `optional`, defaults to :obj:`None`):\n            Indices of positions of each input sequence tokens in the position embeddings.\n            Selected in the range ``[0, config.max_position_embeddings - 1]``.\n\n            `What are position IDs? <../glossary.html#position-ids>`_\n        head_mask (:obj:`torch.FloatTensor` of shape :obj:`(num_heads,)` or :obj:`(num_layers, num_heads)`, `optional`, defaults to :obj:`None`):\n            Mask to nullify selected heads of the self-attention modules.\n            Mask values selected in ``[0, 1]``:\n            :obj:`1` indicates the head is **not masked**, :obj:`0` indicates the head is **masked**.\n        inputs_embeds (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length, hidden_size)`, `optional`, defaults to :obj:`None`):\n            Optionally, instead of passing :obj:`input_ids` you can choose to directly pass an embedded representation.\n            This is useful if you want more control over how to convert `input_ids` indices into associated vectors\n            than the model's internal embedding lookup matrix.\n        num_hashes (:obj:`int`, `optional`, defaults to :obj:`None`):\n            `num_hashes` is the number of hashing rounds that should be performed during\n            bucketing. Setting `num_hashes` overwrites the default `num_hashes` defined\n            in `config.num_hashes`.\n            For more information, see `num_hashes` in :class:`transformers1.ReformerConfig`.\n\"\"\"\n\n\n@add_start_docstrings(\n    \"The bare Reformer Model transformer outputting raw hidden-states\" \"without any specific head on top.\",\n    REFORMER_START_DOCSTRING,\n)\nclass ReformerModel(ReformerPreTrainedModel):\n    def __init__(self, config):\n        super().__init__(config)\n        self.config = config\n        assert (\n            self.config.num_hidden_layers > 0\n        ), \"`config.attn_layers` is empty. Select at least one attn layer form ['lsh', 'local']\"\n\n        self.embeddings = ReformerEmbeddings(config)\n        self.encoder = ReformerEncoder(config)\n\n        self.init_weights()\n\n    def get_input_embeddings(self):\n        return self.embeddings.word_embeddings\n\n    def set_input_embeddings(self, value):\n        self.embeddings.word_embeddings = value\n\n    def _prune_heads(self, heads_to_prune):\n        \"\"\" Prunes heads of the model.\n            heads_to_prune: dict of {layer_num: list of heads to prune in this layer}\n            See base class PreTrainedModel\n        \"\"\"\n        for layer, heads in heads_to_prune.items():\n            self.encoder.layer[layer].attention.prune_heads(heads)\n\n    @add_start_docstrings_to_callable(REFORMER_INPUTS_DOCSTRING)\n    def forward(\n        self,\n        input_ids=None,\n        attention_mask=None,\n        position_ids=None,\n        head_mask=None,\n        inputs_embeds=None,\n        num_hashes=None,\n        do_output_hidden_states=False,\n        do_output_attentions=False,\n    ):\n        r\"\"\"\n    Return:\n        :obj:`tuple(torch.FloatTensor)` comprising various elements depending on the configuration (:class:`~transformers1.BertConfig`) and inputs:\n        last_hidden_state (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length, hidden_size)`):\n            Sequence of hidden-states at the output of the last layer of the model.\n        all_hidden_states (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_hidden_states=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer)\n            of shape :obj:`(batch_size, sequence_length, hidden_size)`.\n\n            Hidden-states of the model at the output of each layer plus the initial embedding outputs.\n        all_attentions (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``do_output_attentions=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for each layer) of shape\n            :obj:`(batch_size, num_heads, sequence_length, sequence_length)`.\n\n            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention\n            heads.\n\n    Examples::\n\n        from transformers1 import ReformerModel, ReformerTokenizer\n        import torch\n\n        tokenizer = ReformerTokenizer.from_pretrained('google/reformer-crime-and-punishment')\n        model =  ReformerModel.from_pretrained('google/reformer-crime-and-punishment')\n\n        input_ids = torch.tensor(tokenizer.encode(\"Hello, my dog is cute\", add_special_tokens=True)).unsqueeze(0)  # Batch size 1\n        outputs = model(input_ids)\n\n        last_hidden_states = outputs[0]  # The last hidden-state is the first element of the output tuple\n        \"\"\"\n\n        # TODO(PVP): delete when PR to change output_attentions is made\n        do_output_attentions = self.config.output_attentions\n        do_output_hidden_states = self.config.output_hidden_states\n\n        if input_ids is not None and inputs_embeds is not None:\n            raise ValueError(\"You cannot specify both input_ids and inputs_embeds at the same time\")\n        elif input_ids is not None:\n            input_shape = input_ids.size()  # noqa: F841\n            device = input_ids.device\n        elif inputs_embeds is not None:\n            input_shape = inputs_embeds.size()[:-1]  # noqa: F841\n            device = inputs_embeds.device\n        else:\n            raise ValueError(\"You have to specify either input_ids or inputs_embeds\")\n\n        assert (\n            len(input_shape) == 2\n        ), \"`input_ids` have be of shape `[batch_size, sequence_length]`, but got shape: {}\".format(input_shape)\n\n        # prepare head mask\n        head_mask = self.get_head_mask(head_mask, self.config.num_hidden_layers, is_attention_chunked=True)\n\n        # original sequence length for padding\n        orig_sequence_length = input_shape[-1]\n\n        # if needs padding\n        least_common_mult_chunk_length = _get_least_common_mult_chunk_len(self.config)\n        must_pad_to_match_chunk_length = input_shape[-1] % least_common_mult_chunk_length != 0\n\n        if must_pad_to_match_chunk_length:\n            padding_length = least_common_mult_chunk_length - input_shape[-1] % least_common_mult_chunk_length\n\n            if self.training is True:\n                raise ValueError(\n                    \"If training, sequence Length {} has to be a multiple of least common multiple chunk_length {}. Please consider padding the input to a length of {}.\".format(\n                        input_shape[-1], least_common_mult_chunk_length, input_shape[-1] + padding_length\n                    )\n                )\n\n            # pad input\n            input_ids, inputs_embeds, attention_mask, position_ids, input_shape = self._pad_to_mult_of_chunk_length(\n                input_ids,\n                inputs_embeds=inputs_embeds,\n                attention_mask=attention_mask,\n                position_ids=position_ids,\n                input_shape=input_shape,\n                padding_length=padding_length,\n                padded_seq_length=least_common_mult_chunk_length,\n                device=device,\n            )\n\n        embedding_output = self.embeddings(input_ids=input_ids, position_ids=position_ids, inputs_embeds=inputs_embeds)\n\n        encoder_outputs = self.encoder(\n            hidden_states=embedding_output,\n            head_mask=head_mask,\n            attention_mask=attention_mask,\n            num_hashes=num_hashes,\n            do_output_hidden_states=do_output_hidden_states,\n            do_output_attentions=do_output_attentions,\n        )\n        sequence_output = encoder_outputs.hidden_states\n\n        # if padding was applied\n        if must_pad_to_match_chunk_length:\n            sequence_output = sequence_output[:, :orig_sequence_length]\n\n        outputs = (sequence_output,)\n        # TODO(PVP): Replace by named tuple after namedtuples are introduced in the library.\n        if do_output_hidden_states is True:\n            outputs = outputs + (encoder_outputs.all_hidden_states,)\n        if do_output_attentions is True:\n            outputs = outputs + (encoder_outputs.all_attentions,)\n        return outputs\n\n    def _pad_to_mult_of_chunk_length(\n        self,\n        input_ids,\n        inputs_embeds=None,\n        attention_mask=None,\n        position_ids=None,\n        input_shape=None,\n        padding_length=None,\n        padded_seq_length=None,\n        device=None,\n    ):\n        logger.info(\n            \"Input ids are automatically padded from {} to {} to be a multiple of `config.chunk_length`: {}\".format(\n                input_shape[-1], input_shape[-1] + padding_length, padded_seq_length\n            )\n        )\n\n        padded_input_ids = torch.full(\n            (input_shape[0], padding_length), self.config.pad_token_id, device=device, dtype=torch.long,\n        )\n\n        # Extend `attention_mask`\n        if attention_mask is not None:\n            attention_mask = torch.cat(\n                [\n                    attention_mask,\n                    torch.zeros(input_shape[0], padding_length, device=device, dtype=attention_mask.dtype,),\n                ],\n                dim=-1,\n            )\n        else:\n            attention_mask = torch.cat(\n                [\n                    torch.ones(input_shape, device=device, dtype=torch.uint8),\n                    torch.zeros((input_shape[0], padding_length), device=device, dtype=torch.uint8),\n                ],\n                dim=-1,\n            )\n\n        # Extend `input_ids` with padding to match least common multiple chunk_length\n        if input_ids is not None:\n            input_ids = torch.cat([input_ids, padded_input_ids], dim=-1)\n            input_shape = input_ids.size()\n\n            # Pad position ids if given\n            if position_ids is not None:\n                padded_position_ids = torch.arange(input_shape[-1], padded_seq_length, dtype=torch.long, device=device)\n                padded_position_ids = position_ids.unsqueeze(0).expand(input_shape[0], padding_length)\n                position_ids = torch.cat([position_ids, padded_position_ids], dim=-1)\n\n        # Extend `inputs_embeds` with padding to match least common multiple chunk_length\n        if inputs_embeds is not None:\n            padded_inputs_embeds = self.embeddings(padded_input_ids, position_ids)\n            inputs_embeds = torch.cat([inputs_embeds, padded_inputs_embeds], dim=-2)\n            input_shape = inputs_embeds.size()\n        return input_ids, inputs_embeds, attention_mask, position_ids, input_shape\n\n\n@add_start_docstrings(\"\"\"Reformer Model with a `language modeling` head on top. \"\"\", REFORMER_START_DOCSTRING)\nclass ReformerModelWithLMHead(ReformerPreTrainedModel):\n    def __init__(self, config):\n        super().__init__(config)\n        self.reformer = ReformerModel(config)\n        self.lm_head = ReformerOnlyLMHead(config)\n\n        self.init_weights()\n\n    def get_output_embeddings(self):\n        return self.lm_head.decoder\n\n    def tie_weights(self):\n        # word embeddings are not tied in Reformer\n        pass\n\n    @add_start_docstrings_to_callable(REFORMER_INPUTS_DOCSTRING)\n    def forward(\n        self,\n        input_ids=None,\n        position_ids=None,\n        attention_mask=None,\n        head_mask=None,\n        inputs_embeds=None,\n        num_hashes=None,\n        labels=None,\n        do_output_hidden_states=False,\n        do_output_attentions=False,\n    ):\n        r\"\"\"\n        labels (:obj:`torch.LongTensor` of shape :obj:`(batch_size,)`, `optional`, defaults to :obj:`None`):\n                Labels for computing the sequence classification/regression loss.\n                Indices should be in :obj:`[-100, 0, ..., config.vocab_size - 1]`.\n                All labels set to ``-100`` are ignored (masked), the loss is only\n                computed for labels in ``[0, ..., config.vocab_size]``\n\n    Return:\n        :obj:`tuple(torch.FloatTensor)` comprising various elements depending on the configuration (:class:`~transformers1.BertConfig`) and inputs:\n        loss (:obj:`torch.FloatTensor` of shape :obj:`(1,)`, `optional`, returned when :obj:`lm_label` is provided):\n            Classification loss (cross entropy).\n        prediction_scores (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length, config.vocab_size)`)\n            Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax).\n        all_hidden_states (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_hidden_states=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer)\n            of shape :obj:`(batch_size, sequence_length, hidden_size)`.\n\n            Hidden-states of the model at the output of each layer plus the initial embedding outputs.\n        all_attentions (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``do_output_attentions=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for each layer) of shape\n            :obj:`(batch_size, num_heads, sequence_length, sequence_length)`.\n\n            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention\n            heads.\n\n    Examples::\n\n        from transformers1 import ReformerModelWithLMHead, ReformerTokenizer\n        import torch\n\n        tokenizer = ReformerTokenizer.from_pretrained('google/reformer-crime-and-punishment')\n        model =  ReformerModelWithLMHead.from_pretrained('google/reformer-crime-and-punishment')\n\n        input_ids = torch.tensor(tokenizer.encode(\"Hello, my dog is cute\", add_special_tokens=True)).unsqueeze(0)  # Batch size 1\n        outputs = model(input_ids, labels=input_ids)\n\n        loss, prediction_scores = outputs[:2]\n        \"\"\"\n\n        reformer_outputs = self.reformer(\n            input_ids,\n            position_ids=position_ids,\n            attention_mask=attention_mask,\n            head_mask=head_mask,\n            inputs_embeds=inputs_embeds,\n            num_hashes=num_hashes,\n            do_output_hidden_states=do_output_hidden_states,\n            do_output_attentions=do_output_attentions,\n        )\n\n        sequence_output = reformer_outputs[0]\n        logits = self.lm_head(sequence_output)\n        outputs = (logits,) + reformer_outputs[1:]\n\n        if labels is not None:\n            # Shift so that tokens < n predict n\n            shift_logits = logits[..., :-1, :].contiguous()\n            shift_labels = labels[..., 1:].contiguous()\n            # Flatten the tokens\n            loss_fct = CrossEntropyLoss()\n            loss = loss_fct(shift_logits.view(-1, self.config.vocab_size), shift_labels.view(-1))\n            outputs = (loss,) + outputs\n        return outputs  # (lm_loss), lm_logits, (hidden_states), (attentions)\n\n    def prepare_inputs_for_generation(self, input_ids, past, **kwargs):\n        # TODO(PVP): Add smart caching\n        inputs_dict = {\"input_ids\": input_ids}\n\n        if \"num_hashes\" in kwargs:\n            inputs_dict[\"num_hashes\"] = kwargs[\"num_hashes\"]\n\n        return inputs_dict\n"
  },
  {
    "path": "code/nezha-base-count3/pretrain/transformers1/modeling_roberta.py",
    "content": "# coding=utf-8\n# Copyright 2018 The Google AI Language Team Authors and The HuggingFace Inc. team.\n# Copyright (c) 2018, NVIDIA CORPORATION.  All rights reserved.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n\"\"\"PyTorch RoBERTa model. \"\"\"\n\n\nimport logging\n\nimport torch\nimport torch.nn as nn\nfrom torch.nn import CrossEntropyLoss, MSELoss\n\nfrom .configuration_roberta import RobertaConfig\nfrom .file_utils import add_start_docstrings, add_start_docstrings_to_callable\nfrom .modeling_bert import BertEmbeddings, BertLayerNorm, BertModel, BertPreTrainedModel, gelu\nfrom .modeling_utils import create_position_ids_from_input_ids\n\n\nlogger = logging.getLogger(__name__)\n\nROBERTA_PRETRAINED_MODEL_ARCHIVE_LIST = [\n    \"roberta-base\",\n    \"roberta-large\",\n    \"roberta-large-mnli\",\n    \"distilroberta-base\",\n    \"roberta-base-openai-detector\",\n    \"roberta-large-openai-detector\",\n    # See all RoBERTa models at https://huggingface.co/models?filter=roberta\n]\n\n\nclass RobertaEmbeddings(BertEmbeddings):\n    \"\"\"\n    Same as BertEmbeddings with a tiny tweak for positional embeddings indexing.\n    \"\"\"\n\n    def __init__(self, config):\n        super().__init__(config)\n        self.padding_idx = config.pad_token_id\n        self.word_embeddings = nn.Embedding(config.vocab_size, config.hidden_size, padding_idx=self.padding_idx)\n        self.position_embeddings = nn.Embedding(\n            config.max_position_embeddings, config.hidden_size, padding_idx=self.padding_idx\n        )\n\n    def forward(self, input_ids=None, token_type_ids=None, position_ids=None, inputs_embeds=None):\n        if position_ids is None:\n            if input_ids is not None:\n                # Create the position ids from the input token ids. Any padded tokens remain padded.\n                position_ids = create_position_ids_from_input_ids(input_ids, self.padding_idx).to(input_ids.device)\n            else:\n                position_ids = self.create_position_ids_from_inputs_embeds(inputs_embeds)\n\n        return super().forward(\n            input_ids, token_type_ids=token_type_ids, position_ids=position_ids, inputs_embeds=inputs_embeds\n        )\n\n    def create_position_ids_from_inputs_embeds(self, inputs_embeds):\n        \"\"\" We are provided embeddings directly. We cannot infer which are padded so just generate\n        sequential position ids.\n\n        :param torch.Tensor inputs_embeds:\n        :return torch.Tensor:\n        \"\"\"\n        input_shape = inputs_embeds.size()[:-1]\n        sequence_length = input_shape[1]\n\n        position_ids = torch.arange(\n            self.padding_idx + 1, sequence_length + self.padding_idx + 1, dtype=torch.long, device=inputs_embeds.device\n        )\n        return position_ids.unsqueeze(0).expand(input_shape)\n\n\nROBERTA_START_DOCSTRING = r\"\"\"\n\n    This model is a PyTorch `torch.nn.Module <https://pytorch.org/docs/stable/nn.html#torch.nn.Module>`_ sub-class.\n    Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general\n    usage and behavior.\n\n    Parameters:\n        config (:class:`~transformers1.RobertaConfig`): Model configuration class with all the parameters of the\n            model. Initializing with a config file does not load the weights associated with the model, only the configuration.\n            Check out the :meth:`~transformers1.PreTrainedModel.from_pretrained` method to load the model weights.\n\"\"\"\n\nROBERTA_INPUTS_DOCSTRING = r\"\"\"\n    Args:\n        input_ids (:obj:`torch.LongTensor` of shape :obj:`{0}`):\n            Indices of input sequence tokens in the vocabulary.\n\n            Indices can be obtained using :class:`transformers1.RobertaTokenizer`.\n            See :func:`transformers1.PreTrainedTokenizer.encode` and\n            :func:`transformers1.PreTrainedTokenizer.encode_plus` for details.\n\n            `What are input IDs? <../glossary.html#input-ids>`__\n        attention_mask (:obj:`torch.FloatTensor` of shape :obj:`{0}`, `optional`, defaults to :obj:`None`):\n            Mask to avoid performing attention on padding token indices.\n            Mask values selected in ``[0, 1]``:\n            ``1`` for tokens that are NOT MASKED, ``0`` for MASKED tokens.\n\n            `What are attention masks? <../glossary.html#attention-mask>`__\n        token_type_ids (:obj:`torch.LongTensor` of shape :obj:`{0}`, `optional`, defaults to :obj:`None`):\n            Segment token indices to indicate first and second portions of the inputs.\n            Indices are selected in ``[0, 1]``: ``0`` corresponds to a `sentence A` token, ``1``\n            corresponds to a `sentence B` token\n\n            `What are token type IDs? <../glossary.html#token-type-ids>`_\n        position_ids (:obj:`torch.LongTensor` of shape :obj:`{0}`, `optional`, defaults to :obj:`None`):\n            Indices of positions of each input sequence tokens in the position embeddings.\n            Selected in the range ``[0, config.max_position_embeddings - 1]``.\n\n            `What are position IDs? <../glossary.html#position-ids>`_\n        head_mask (:obj:`torch.FloatTensor` of shape :obj:`(num_heads,)` or :obj:`(num_layers, num_heads)`, `optional`, defaults to :obj:`None`):\n            Mask to nullify selected heads of the self-attention modules.\n            Mask values selected in ``[0, 1]``:\n            :obj:`1` indicates the head is **not masked**, :obj:`0` indicates the head is **masked**.\n        inputs_embeds (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length, hidden_size)`, `optional`, defaults to :obj:`None`):\n            Optionally, instead of passing :obj:`input_ids` you can choose to directly pass an embedded representation.\n            This is useful if you want more control over how to convert `input_ids` indices into associated vectors\n            than the model's internal embedding lookup matrix.\n\"\"\"\n\n\n@add_start_docstrings(\n    \"The bare RoBERTa Model transformer outputting raw hidden-states without any specific head on top.\",\n    ROBERTA_START_DOCSTRING,\n)\nclass RobertaModel(BertModel):\n    \"\"\"\n    This class overrides :class:`~transformers1.BertModel`. Please check the\n    superclass for the appropriate documentation alongside usage examples.\n    \"\"\"\n\n    config_class = RobertaConfig\n    base_model_prefix = \"roberta\"\n\n    def __init__(self, config):\n        super().__init__(config)\n\n        self.embeddings = RobertaEmbeddings(config)\n        self.init_weights()\n\n    def get_input_embeddings(self):\n        return self.embeddings.word_embeddings\n\n    def set_input_embeddings(self, value):\n        self.embeddings.word_embeddings = value\n\n\n@add_start_docstrings(\"\"\"RoBERTa Model with a `language modeling` head on top. \"\"\", ROBERTA_START_DOCSTRING)\nclass RobertaForMaskedLM(BertPreTrainedModel):\n    config_class = RobertaConfig\n    base_model_prefix = \"roberta\"\n\n    def __init__(self, config):\n        super().__init__(config)\n\n        self.roberta = RobertaModel(config)\n        self.lm_head = RobertaLMHead(config)\n\n        self.init_weights()\n\n    def get_output_embeddings(self):\n        return self.lm_head.decoder\n\n    @add_start_docstrings_to_callable(ROBERTA_INPUTS_DOCSTRING.format(\"(batch_size, sequence_length)\"))\n    def forward(\n        self,\n        input_ids=None,\n        attention_mask=None,\n        token_type_ids=None,\n        position_ids=None,\n        head_mask=None,\n        inputs_embeds=None,\n        masked_lm_labels=None,\n    ):\n        r\"\"\"\n        masked_lm_labels (:obj:`torch.LongTensor` of shape :obj:`(batch_size, sequence_length)`, `optional`, defaults to :obj:`None`):\n            Labels for computing the masked language modeling loss.\n            Indices should be in ``[-100, 0, ..., config.vocab_size]`` (see ``input_ids`` docstring)\n            Tokens with indices set to ``-100`` are ignored (masked), the loss is only computed for the tokens with labels\n            in ``[0, ..., config.vocab_size]``\n\n    Returns:\n        :obj:`tuple(torch.FloatTensor)` comprising various elements depending on the configuration (:class:`~transformers1.RobertaConfig`) and inputs:\n        masked_lm_loss (`optional`, returned when ``masked_lm_labels`` is provided) ``torch.FloatTensor`` of shape ``(1,)``:\n            Masked language modeling loss.\n        prediction_scores (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length, config.vocab_size)`)\n            Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax).\n        hidden_states (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_hidden_states=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer)\n            of shape :obj:`(batch_size, sequence_length, hidden_size)`.\n\n            Hidden-states of the model at the output of each layer plus the initial embedding outputs.\n        attentions (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_attentions=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for each layer) of shape\n            :obj:`(batch_size, num_heads, sequence_length, sequence_length)`.\n\n            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention\n            heads.\n\n    Examples::\n\n        from transformers1 import RobertaTokenizer, RobertaForMaskedLM\n        import torch\n\n        tokenizer = RobertaTokenizer.from_pretrained('roberta-base')\n        model = RobertaForMaskedLM.from_pretrained('roberta-base')\n        input_ids = torch.tensor(tokenizer.encode(\"Hello, my dog is cute\", add_special_tokens=True)).unsqueeze(0)  # Batch size 1\n        outputs = model(input_ids, masked_lm_labels=input_ids)\n        loss, prediction_scores = outputs[:2]\n\n        \"\"\"\n        outputs = self.roberta(\n            input_ids,\n            attention_mask=attention_mask,\n            token_type_ids=token_type_ids,\n            position_ids=position_ids,\n            head_mask=head_mask,\n            inputs_embeds=inputs_embeds,\n        )\n        sequence_output = outputs[0]\n        prediction_scores = self.lm_head(sequence_output)\n\n        outputs = (prediction_scores,) + outputs[2:]  # Add hidden states and attention if they are here\n\n        if masked_lm_labels is not None:\n            loss_fct = CrossEntropyLoss()\n            masked_lm_loss = loss_fct(prediction_scores.view(-1, self.config.vocab_size), masked_lm_labels.view(-1))\n            outputs = (masked_lm_loss,) + outputs\n\n        return outputs  # (masked_lm_loss), prediction_scores, (hidden_states), (attentions)\n\n\nclass RobertaLMHead(nn.Module):\n    \"\"\"Roberta Head for masked language modeling.\"\"\"\n\n    def __init__(self, config):\n        super().__init__()\n        self.dense = nn.Linear(config.hidden_size, config.hidden_size)\n        self.layer_norm = BertLayerNorm(config.hidden_size, eps=config.layer_norm_eps)\n\n        self.decoder = nn.Linear(config.hidden_size, config.vocab_size, bias=False)\n        self.bias = nn.Parameter(torch.zeros(config.vocab_size))\n\n        # Need a link between the two variables so that the bias is correctly resized with `resize_token_embeddings`\n        self.decoder.bias = self.bias\n\n    def forward(self, features, **kwargs):\n        x = self.dense(features)\n        x = gelu(x)\n        x = self.layer_norm(x)\n\n        # project back to size of vocabulary with bias\n        x = self.decoder(x)\n\n        return x\n\n\n@add_start_docstrings(\n    \"\"\"RoBERTa Model transformer with a sequence classification/regression head on top (a linear layer\n    on top of the pooled output) e.g. for GLUE tasks. \"\"\",\n    ROBERTA_START_DOCSTRING,\n)\nclass RobertaForSequenceClassification(BertPreTrainedModel):\n    config_class = RobertaConfig\n    base_model_prefix = \"roberta\"\n\n    def __init__(self, config):\n        super().__init__(config)\n        self.num_labels = config.num_labels\n\n        self.roberta = RobertaModel(config)\n        self.classifier = RobertaClassificationHead(config)\n\n    @add_start_docstrings_to_callable(ROBERTA_INPUTS_DOCSTRING.format(\"(batch_size, sequence_length)\"))\n    def forward(\n        self,\n        input_ids=None,\n        attention_mask=None,\n        token_type_ids=None,\n        position_ids=None,\n        head_mask=None,\n        inputs_embeds=None,\n        labels=None,\n    ):\n        r\"\"\"\n        labels (:obj:`torch.LongTensor` of shape :obj:`(batch_size,)`, `optional`, defaults to :obj:`None`):\n            Labels for computing the sequence classification/regression loss.\n            Indices should be in :obj:`[0, ..., config.num_labels - 1]`.\n            If :obj:`config.num_labels == 1` a regression loss is computed (Mean-Square loss),\n            If :obj:`config.num_labels > 1` a classification loss is computed (Cross-Entropy).\n\n    Returns:\n        :obj:`tuple(torch.FloatTensor)` comprising various elements depending on the configuration (:class:`~transformers1.RobertaConfig`) and inputs:\n        loss (:obj:`torch.FloatTensor` of shape :obj:`(1,)`, `optional`, returned when :obj:`label` is provided):\n            Classification (or regression if config.num_labels==1) loss.\n        logits (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, config.num_labels)`):\n            Classification (or regression if config.num_labels==1) scores (before SoftMax).\n        hidden_states (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_hidden_states=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer)\n            of shape :obj:`(batch_size, sequence_length, hidden_size)`.\n\n            Hidden-states of the model at the output of each layer plus the initial embedding outputs.\n        attentions (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_attentions=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for each layer) of shape\n            :obj:`(batch_size, num_heads, sequence_length, sequence_length)`.\n\n            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention\n            heads.\n\n    Examples::\n\n        from transformers1 import RobertaTokenizer, RobertaForSequenceClassification\n        import torch\n\n        tokenizer = RobertaTokenizer.from_pretrained('roberta-base')\n        model = RobertaForSequenceClassification.from_pretrained('roberta-base')\n        input_ids = torch.tensor(tokenizer.encode(\"Hello, my dog is cute\", add_special_tokens=True)).unsqueeze(0)  # Batch size 1\n        labels = torch.tensor([1]).unsqueeze(0)  # Batch size 1\n        outputs = model(input_ids, labels=labels)\n        loss, logits = outputs[:2]\n\n        \"\"\"\n        outputs = self.roberta(\n            input_ids,\n            attention_mask=attention_mask,\n            token_type_ids=token_type_ids,\n            position_ids=position_ids,\n            head_mask=head_mask,\n            inputs_embeds=inputs_embeds,\n        )\n        sequence_output = outputs[0]\n        logits = self.classifier(sequence_output)\n\n        outputs = (logits,) + outputs[2:]\n        if labels is not None:\n            if self.num_labels == 1:\n                #  We are doing regression\n                loss_fct = MSELoss()\n                loss = loss_fct(logits.view(-1), labels.view(-1))\n            else:\n                loss_fct = CrossEntropyLoss()\n                loss = loss_fct(logits.view(-1, self.num_labels), labels.view(-1))\n            outputs = (loss,) + outputs\n\n        return outputs  # (loss), logits, (hidden_states), (attentions)\n\n\n@add_start_docstrings(\n    \"\"\"Roberta Model with a multiple choice classification head on top (a linear layer on top of\n    the pooled output and a softmax) e.g. for RocStories/SWAG tasks. \"\"\",\n    ROBERTA_START_DOCSTRING,\n)\nclass RobertaForMultipleChoice(BertPreTrainedModel):\n    config_class = RobertaConfig\n    base_model_prefix = \"roberta\"\n\n    def __init__(self, config):\n        super().__init__(config)\n\n        self.roberta = RobertaModel(config)\n        self.dropout = nn.Dropout(config.hidden_dropout_prob)\n        self.classifier = nn.Linear(config.hidden_size, 1)\n\n        self.init_weights()\n\n    @add_start_docstrings_to_callable(ROBERTA_INPUTS_DOCSTRING.format(\"(batch_size, num_choices, sequence_length)\"))\n    def forward(\n        self,\n        input_ids=None,\n        token_type_ids=None,\n        attention_mask=None,\n        labels=None,\n        position_ids=None,\n        head_mask=None,\n        inputs_embeds=None,\n    ):\n        r\"\"\"\n        labels (:obj:`torch.LongTensor` of shape :obj:`(batch_size,)`, `optional`, defaults to :obj:`None`):\n            Labels for computing the multiple choice classification loss.\n            Indices should be in ``[0, ..., num_choices]`` where `num_choices` is the size of the second dimension\n            of the input tensors. (see `input_ids` above)\n\n    Returns:\n        :obj:`tuple(torch.FloatTensor)` comprising various elements depending on the configuration (:class:`~transformers1.RobertaConfig`) and inputs:\n        loss (:obj:`torch.FloatTensor`` of shape ``(1,)`, `optional`, returned when :obj:`labels` is provided):\n            Classification loss.\n        classification_scores (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, num_choices)`):\n            `num_choices` is the second dimension of the input tensors. (see `input_ids` above).\n\n            Classification scores (before SoftMax).\n        hidden_states (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_hidden_states=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer)\n            of shape :obj:`(batch_size, sequence_length, hidden_size)`.\n\n            Hidden-states of the model at the output of each layer plus the initial embedding outputs.\n        attentions (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_attentions=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for each layer) of shape\n            :obj:`(batch_size, num_heads, sequence_length, sequence_length)`.\n\n            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention\n            heads.\n\n    Examples::\n\n        from transformers1 import RobertaTokenizer, RobertaForMultipleChoice\n        import torch\n\n        tokenizer = RobertaTokenizer.from_pretrained('roberta-base')\n        model = RobertaForMultipleChoice.from_pretrained('roberta-base')\n        choices = [\"Hello, my dog is cute\", \"Hello, my cat is amazing\"]\n        input_ids = torch.tensor([tokenizer.encode(s, add_special_tokens=True) for s in choices]).unsqueeze(0)  # Batch size 1, 2 choices\n        labels = torch.tensor(1).unsqueeze(0)  # Batch size 1\n        outputs = model(input_ids, labels=labels)\n        loss, classification_scores = outputs[:2]\n\n        \"\"\"\n        num_choices = input_ids.shape[1]\n\n        flat_input_ids = input_ids.view(-1, input_ids.size(-1))\n        flat_position_ids = position_ids.view(-1, position_ids.size(-1)) if position_ids is not None else None\n        flat_token_type_ids = token_type_ids.view(-1, token_type_ids.size(-1)) if token_type_ids is not None else None\n        flat_attention_mask = attention_mask.view(-1, attention_mask.size(-1)) if attention_mask is not None else None\n        outputs = self.roberta(\n            flat_input_ids,\n            position_ids=flat_position_ids,\n            token_type_ids=flat_token_type_ids,\n            attention_mask=flat_attention_mask,\n            head_mask=head_mask,\n        )\n        pooled_output = outputs[1]\n\n        pooled_output = self.dropout(pooled_output)\n        logits = self.classifier(pooled_output)\n        reshaped_logits = logits.view(-1, num_choices)\n\n        outputs = (reshaped_logits,) + outputs[2:]  # add hidden states and attention if they are here\n\n        if labels is not None:\n            loss_fct = CrossEntropyLoss()\n            loss = loss_fct(reshaped_logits, labels)\n            outputs = (loss,) + outputs\n\n        return outputs  # (loss), reshaped_logits, (hidden_states), (attentions)\n\n\n@add_start_docstrings(\n    \"\"\"Roberta Model with a token classification head on top (a linear layer on top of\n    the hidden-states output) e.g. for Named-Entity-Recognition (NER) tasks. \"\"\",\n    ROBERTA_START_DOCSTRING,\n)\nclass RobertaForTokenClassification(BertPreTrainedModel):\n    config_class = RobertaConfig\n    base_model_prefix = \"roberta\"\n\n    def __init__(self, config):\n        super().__init__(config)\n        self.num_labels = config.num_labels\n\n        self.roberta = RobertaModel(config)\n        self.dropout = nn.Dropout(config.hidden_dropout_prob)\n        self.classifier = nn.Linear(config.hidden_size, config.num_labels)\n\n        self.init_weights()\n\n    @add_start_docstrings_to_callable(ROBERTA_INPUTS_DOCSTRING.format(\"(batch_size, sequence_length)\"))\n    def forward(\n        self,\n        input_ids=None,\n        attention_mask=None,\n        token_type_ids=None,\n        position_ids=None,\n        head_mask=None,\n        inputs_embeds=None,\n        labels=None,\n    ):\n        r\"\"\"\n        labels (:obj:`torch.LongTensor` of shape :obj:`(batch_size, sequence_length)`, `optional`, defaults to :obj:`None`):\n            Labels for computing the token classification loss.\n            Indices should be in ``[0, ..., config.num_labels - 1]``.\n\n    Returns:\n        :obj:`tuple(torch.FloatTensor)` comprising various elements depending on the configuration (:class:`~transformers1.RobertaConfig`) and inputs:\n        loss (:obj:`torch.FloatTensor` of shape :obj:`(1,)`, `optional`, returned when ``labels`` is provided) :\n            Classification loss.\n        scores (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length, config.num_labels)`)\n            Classification scores (before SoftMax).\n        hidden_states (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_hidden_states=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer)\n            of shape :obj:`(batch_size, sequence_length, hidden_size)`.\n\n            Hidden-states of the model at the output of each layer plus the initial embedding outputs.\n        attentions (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_attentions=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for each layer) of shape\n            :obj:`(batch_size, num_heads, sequence_length, sequence_length)`.\n\n            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention\n            heads.\n\n    Examples::\n\n        from transformers1 import RobertaTokenizer, RobertaForTokenClassification\n        import torch\n\n        tokenizer = RobertaTokenizer.from_pretrained('roberta-base')\n        model = RobertaForTokenClassification.from_pretrained('roberta-base')\n        input_ids = torch.tensor(tokenizer.encode(\"Hello, my dog is cute\", add_special_tokens=True)).unsqueeze(0)  # Batch size 1\n        labels = torch.tensor([1] * input_ids.size(1)).unsqueeze(0)  # Batch size 1\n        outputs = model(input_ids, labels=labels)\n        loss, scores = outputs[:2]\n\n        \"\"\"\n\n        outputs = self.roberta(\n            input_ids,\n            attention_mask=attention_mask,\n            token_type_ids=token_type_ids,\n            position_ids=position_ids,\n            head_mask=head_mask,\n            inputs_embeds=inputs_embeds,\n        )\n\n        sequence_output = outputs[0]\n\n        sequence_output = self.dropout(sequence_output)\n        logits = self.classifier(sequence_output)\n\n        outputs = (logits,) + outputs[2:]  # add hidden states and attention if they are here\n\n        if labels is not None:\n            loss_fct = CrossEntropyLoss()\n            # Only keep active parts of the loss\n            if attention_mask is not None:\n                active_loss = attention_mask.view(-1) == 1\n                active_logits = logits.view(-1, self.num_labels)\n                active_labels = torch.where(\n                    active_loss, labels.view(-1), torch.tensor(loss_fct.ignore_index).type_as(labels)\n                )\n                loss = loss_fct(active_logits, active_labels)\n            else:\n                loss = loss_fct(logits.view(-1, self.num_labels), labels.view(-1))\n            outputs = (loss,) + outputs\n\n        return outputs  # (loss), scores, (hidden_states), (attentions)\n\n\nclass RobertaClassificationHead(nn.Module):\n    \"\"\"Head for sentence-level classification tasks.\"\"\"\n\n    def __init__(self, config):\n        super().__init__()\n        self.dense = nn.Linear(config.hidden_size, config.hidden_size)\n        self.dropout = nn.Dropout(config.hidden_dropout_prob)\n        self.out_proj = nn.Linear(config.hidden_size, config.num_labels)\n\n    def forward(self, features, **kwargs):\n        x = features[:, 0, :]  # take <s> token (equiv. to [CLS])\n        x = self.dropout(x)\n        x = self.dense(x)\n        x = torch.tanh(x)\n        x = self.dropout(x)\n        x = self.out_proj(x)\n        return x\n\n\n@add_start_docstrings(\n    \"\"\"Roberta Model with a span classification head on top for extractive question-answering tasks like SQuAD (a linear layers on top of\n    the hidden-states output to compute `span start logits` and `span end logits`). \"\"\",\n    ROBERTA_START_DOCSTRING,\n)\nclass RobertaForQuestionAnswering(BertPreTrainedModel):\n    config_class = RobertaConfig\n    base_model_prefix = \"roberta\"\n\n    def __init__(self, config):\n        super().__init__(config)\n        self.num_labels = config.num_labels\n\n        self.roberta = RobertaModel(config)\n        self.qa_outputs = nn.Linear(config.hidden_size, config.num_labels)\n\n        self.init_weights()\n\n    @add_start_docstrings_to_callable(ROBERTA_INPUTS_DOCSTRING.format(\"(batch_size, sequence_length)\"))\n    def forward(\n        self,\n        input_ids,\n        attention_mask=None,\n        token_type_ids=None,\n        position_ids=None,\n        head_mask=None,\n        inputs_embeds=None,\n        start_positions=None,\n        end_positions=None,\n    ):\n        r\"\"\"\n        start_positions (:obj:`torch.LongTensor` of shape :obj:`(batch_size,)`, `optional`, defaults to :obj:`None`):\n            Labels for position (index) of the start of the labelled span for computing the token classification loss.\n            Positions are clamped to the length of the sequence (`sequence_length`).\n            Position outside of the sequence are not taken into account for computing the loss.\n        end_positions (:obj:`torch.LongTensor` of shape :obj:`(batch_size,)`, `optional`, defaults to :obj:`None`):\n            Labels for position (index) of the end of the labelled span for computing the token classification loss.\n            Positions are clamped to the length of the sequence (`sequence_length`).\n            Position outside of the sequence are not taken into account for computing the loss.\n\n    Returns:\n        :obj:`tuple(torch.FloatTensor)` comprising various elements depending on the configuration (:class:`~transformers1.RobertaConfig`) and inputs:\n        loss (:obj:`torch.FloatTensor` of shape :obj:`(1,)`, `optional`, returned when :obj:`labels` is provided):\n            Total span extraction loss is the sum of a Cross-Entropy for the start and end positions.\n        start_scores (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length,)`):\n            Span-start scores (before SoftMax).\n        end_scores (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length,)`):\n            Span-end scores (before SoftMax).\n        hidden_states (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_hidden_states=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer)\n            of shape :obj:`(batch_size, sequence_length, hidden_size)`.\n\n            Hidden-states of the model at the output of each layer plus the initial embedding outputs.\n        attentions (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_attentions=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for each layer) of shape\n            :obj:`(batch_size, num_heads, sequence_length, sequence_length)`.\n\n            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention\n            heads.\n\n    Examples::\n\n        # The checkpoint roberta-large is not fine-tuned for question answering. Please see the\n        # examples/question-answering/run_squad.py example to see how to fine-tune a model to a question answering task.\n\n        from transformers1 import RobertaTokenizer, RobertaForQuestionAnswering\n        import torch\n\n        tokenizer = RobertaTokenizer.from_pretrained('roberta-base')\n        model = RobertaForQuestionAnswering.from_pretrained('roberta-base')\n\n        question, text = \"Who was Jim Henson?\", \"Jim Henson was a nice puppet\"\n        input_ids = tokenizer.encode(question, text)\n        start_scores, end_scores = model(torch.tensor([input_ids]))\n\n        all_tokens = tokenizer.convert_ids_to_tokens(input_ids)\n        answer = ' '.join(all_tokens[torch.argmax(start_scores) : torch.argmax(end_scores)+1])\n\n        \"\"\"\n\n        outputs = self.roberta(\n            input_ids,\n            attention_mask=attention_mask,\n            token_type_ids=token_type_ids,\n            position_ids=position_ids,\n            head_mask=head_mask,\n            inputs_embeds=inputs_embeds,\n        )\n\n        sequence_output = outputs[0]\n\n        logits = self.qa_outputs(sequence_output)\n        start_logits, end_logits = logits.split(1, dim=-1)\n        start_logits = start_logits.squeeze(-1)\n        end_logits = end_logits.squeeze(-1)\n\n        outputs = (start_logits, end_logits,) + outputs[2:]\n        if start_positions is not None and end_positions is not None:\n            # If we are on multi-GPU, split add a dimension\n            if len(start_positions.size()) > 1:\n                start_positions = start_positions.squeeze(-1)\n            if len(end_positions.size()) > 1:\n                end_positions = end_positions.squeeze(-1)\n            # sometimes the start/end positions are outside our model inputs, we ignore these terms\n            ignored_index = start_logits.size(1)\n            start_positions.clamp_(0, ignored_index)\n            end_positions.clamp_(0, ignored_index)\n\n            loss_fct = CrossEntropyLoss(ignore_index=ignored_index)\n            start_loss = loss_fct(start_logits, start_positions)\n            end_loss = loss_fct(end_logits, end_positions)\n            total_loss = (start_loss + end_loss) / 2\n            outputs = (total_loss,) + outputs\n\n        return outputs  # (loss), start_logits, end_logits, (hidden_states), (attentions)\n"
  },
  {
    "path": "code/nezha-base-count3/pretrain/transformers1/modeling_t5.py",
    "content": "# coding=utf-8\n# Copyright 2018 Mesh TensorFlow authors, T5 Authors and HuggingFace Inc. team.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n\"\"\" PyTorch T5 model. \"\"\"\n\n\nimport copy\nimport logging\nimport math\nimport os\n\nimport torch\nimport torch.nn.functional as F\nfrom torch import nn\nfrom torch.nn import CrossEntropyLoss\n\nfrom .configuration_t5 import T5Config\nfrom .file_utils import DUMMY_INPUTS, DUMMY_MASK, add_start_docstrings, add_start_docstrings_to_callable\nfrom .modeling_utils import PreTrainedModel, prune_linear_layer\n\n\nlogger = logging.getLogger(__name__)\n\n####################################################\n# This dict contrains shortcut names and associated url\n# for the pretrained weights provided with the models\n####################################################\nT5_PRETRAINED_MODEL_ARCHIVE_LIST = [\n    \"t5-small\",\n    \"t5-base\",\n    \"t5-large\",\n    \"t5-3b\",\n    \"t5-11b\",\n    # See all T5 models at https://huggingface.co/models?filter=t5\n]\n\n\n####################################################\n# This is a conversion method from TF 1.0 to PyTorch\n# More details: https://medium.com/huggingface/from-tensorflow-to-pytorch-265f40ef2a28\n####################################################\ndef load_tf_weights_in_t5(model, config, tf_checkpoint_path):\n    \"\"\" Load tf checkpoints in a pytorch model.\n    \"\"\"\n    try:\n        import re\n        import numpy as np\n        import tensorflow as tf\n    except ImportError:\n        logger.error(\n            \"Loading a TensorFlow model in PyTorch, requires TensorFlow to be installed. Please see \"\n            \"https://www.tensorflow.org/install/ for installation instructions.\"\n        )\n        raise\n    tf_path = os.path.abspath(tf_checkpoint_path)\n    logger.info(\"Converting TensorFlow checkpoint from {}\".format(tf_path))\n    # Load weights from TF model\n    init_vars = tf.train.list_variables(tf_path)\n    names = []\n    tf_weights = {}\n    for name, shape in init_vars:\n        logger.info(\"Loading TF weight {} with shape {}\".format(name, shape))\n        array = tf.train.load_variable(tf_path, name)\n        names.append(name)\n        tf_weights[name] = array\n\n    for txt_name in names:\n        name = txt_name.split(\"/\")\n        # adam_v and adam_m are variables used in AdamWeightDecayOptimizer to calculated m and v\n        # which are not required for using pretrained model\n        if any(\n            n in [\"adam_v\", \"adam_m\", \"AdamWeightDecayOptimizer\", \"AdamWeightDecayOptimizer_1\", \"global_step\"]\n            for n in name\n        ):\n            logger.info(\"Skipping {}\".format(\"/\".join(name)))\n            tf_weights.pop(txt_name, None)\n            continue\n        if \"_slot_\" in name[-1]:\n            logger.info(\"Skipping {}\".format(\"/\".join(name)))\n            tf_weights.pop(txt_name, None)\n            continue\n        pointer = model\n        array = tf_weights[txt_name]\n        for m_name in name:\n            if re.fullmatch(r\"[A-Za-z]+_\\d+\", m_name):\n                scope_names = re.split(r\"_(\\d+)\", m_name)\n            else:\n                scope_names = [m_name]\n            if scope_names[0] in [\"kernel\", \"scale\", \"embedding\"]:\n                pointer = getattr(pointer, \"weight\")\n            # elif scope_names[0] == 'scale':\n            #     pointer = getattr(pointer, 'weight')\n            # elif scope_names[0] == 'output_bias' or scope_names[0] == 'beta':\n            #     pointer = getattr(pointer, 'bias')\n            # elif scope_names[0] == 'squad':\n            #     pointer = getattr(pointer, 'classifier')\n            else:\n                try:\n                    pointer = getattr(pointer, scope_names[0])\n                except AttributeError:\n                    logger.info(\"Skipping {}\".format(\"/\".join(name)))\n                    continue\n            if len(scope_names) >= 2:\n                num = int(scope_names[1])\n                pointer = pointer[num]\n        if scope_names[0] not in [\"kernel\", \"scale\", \"embedding\"]:\n            pointer = getattr(pointer, \"weight\")\n        if scope_names[0] != \"embedding\":\n            logger.info(\"Transposing numpy weight of shape {} for {}\".format(array.shape, name))\n            array = np.transpose(array)\n        try:\n            assert pointer.shape == array.shape\n        except AssertionError as e:\n            e.args += (pointer.shape, array.shape)\n            raise\n        logger.info(\"Initialize PyTorch weight {}\".format(name))\n        pointer.data = torch.from_numpy(array.astype(np.float32))\n        tf_weights.pop(txt_name, None)\n\n    logger.info(\"Weights not copied to PyTorch model: {}\".format(\", \".join(tf_weights.keys())))\n    # logger.info(\"Weights not copied to PyTorch model: {}\".format(', '.join(tf_weights.keys())))\n    return model\n\n\n####################################################\n# PyTorch Models are constructed by sub-classing\n# - torch.nn.Module for the layers and\n# - PreTrainedModel for the models (it-self a sub-class of torch.nn.Module)\n####################################################\n\n\nclass T5LayerNorm(nn.Module):\n    def __init__(self, hidden_size, eps=1e-6):\n        \"\"\" Construct a layernorm module in the T5 style\n            No bias and no substraction of mean.\n        \"\"\"\n        super().__init__()\n        self.weight = nn.Parameter(torch.ones(hidden_size))\n        self.variance_epsilon = eps\n\n    def forward(self, x):\n        # layer norm should always be calculated in float32\n        variance = x.to(torch.float32).pow(2).mean(-1, keepdim=True)\n        x = x / torch.sqrt(variance + self.variance_epsilon)\n\n        if self.weight.dtype == torch.float16:\n            x = x.to(torch.float16)\n        return self.weight * x\n\n\nclass T5DenseReluDense(nn.Module):\n    def __init__(self, config):\n        super().__init__()\n        self.wi = nn.Linear(config.d_model, config.d_ff, bias=False)\n        self.wo = nn.Linear(config.d_ff, config.d_model, bias=False)\n        self.dropout = nn.Dropout(config.dropout_rate)\n\n    def forward(self, hidden_states):\n        h = self.wi(hidden_states)\n        h = F.relu(h)\n        h = self.dropout(h)\n        h = self.wo(h)\n        return h\n\n\nclass T5LayerFF(nn.Module):\n    def __init__(self, config):\n        super().__init__()\n        self.DenseReluDense = T5DenseReluDense(config)\n        self.layer_norm = T5LayerNorm(config.d_model, eps=config.layer_norm_epsilon)\n        self.dropout = nn.Dropout(config.dropout_rate)\n\n    def forward(self, hidden_states):\n        norm_x = self.layer_norm(hidden_states)\n        y = self.DenseReluDense(norm_x)\n        layer_output = hidden_states + self.dropout(y)\n        return layer_output\n\n\nclass T5Attention(nn.Module):\n    def __init__(self, config: T5Config, has_relative_attention_bias=False):\n        super().__init__()\n        self.is_decoder = config.is_decoder\n        self.has_relative_attention_bias = has_relative_attention_bias\n\n        self.output_attentions = config.output_attentions\n        self.relative_attention_num_buckets = config.relative_attention_num_buckets\n        self.d_model = config.d_model\n        self.d_kv = config.d_kv\n        self.n_heads = config.num_heads\n        self.dropout = config.dropout_rate\n        self.inner_dim = self.n_heads * self.d_kv\n\n        # Mesh TensorFlow initialization to avoid scaling before softmax\n        self.q = nn.Linear(self.d_model, self.inner_dim, bias=False)\n        self.k = nn.Linear(self.d_model, self.inner_dim, bias=False)\n        self.v = nn.Linear(self.d_model, self.inner_dim, bias=False)\n        self.o = nn.Linear(self.inner_dim, self.d_model, bias=False)\n\n        if self.has_relative_attention_bias:\n            self.relative_attention_bias = nn.Embedding(self.relative_attention_num_buckets, self.n_heads)\n        self.pruned_heads = set()\n\n    def prune_heads(self, heads):\n        if len(heads) == 0:\n            return\n        mask = torch.ones(self.n_heads, self.d_kv)\n        heads = set(heads) - self.pruned_heads\n        for head in heads:\n            head -= sum(1 if h < head else 0 for h in self.pruned_heads)\n            mask[head] = 0\n        mask = mask.view(-1).contiguous().eq(1)\n        index = torch.arange(len(mask))[mask].long()\n        # Prune linear layers\n        self.q = prune_linear_layer(self.q, index)\n        self.k = prune_linear_layer(self.k, index)\n        self.v = prune_linear_layer(self.v, index)\n        self.o = prune_linear_layer(self.o, index, dim=1)\n        # Update hyper params\n        self.n_heads = self.n_heads - len(heads)\n        self.inner_dim = self.d_kv * self.n_heads\n        self.pruned_heads = self.pruned_heads.union(heads)\n\n    @staticmethod\n    def _relative_position_bucket(relative_position, bidirectional=True, num_buckets=32, max_distance=128):\n        \"\"\"\n        Adapted from Mesh Tensorflow:\n        https://github.com/tensorflow/mesh/blob/0cb87fe07da627bf0b7e60475d59f95ed6b5be3d/mesh_tensorflow/transformer/transformer_layers.py#L593\n\n        Translate relative position to a bucket number for relative attention.\n        The relative position is defined as memory_position - query_position, i.e.\n        the distance in tokens from the attending position to the attended-to\n        position.  If bidirectional=False, then positive relative positions are\n        invalid.\n        We use smaller buckets for small absolute relative_position and larger buckets\n        for larger absolute relative_positions.  All relative positions >=max_distance\n        map to the same bucket.  All relative positions <=-max_distance map to the\n        same bucket.  This should allow for more graceful generalization to longer\n        sequences than the model has been trained on.\n        Args:\n            relative_position: an int32 Tensor\n            bidirectional: a boolean - whether the attention is bidirectional\n            num_buckets: an integer\n            max_distance: an integer\n        Returns:\n            a Tensor with the same shape as relative_position, containing int32\n            values in the range [0, num_buckets)\n        \"\"\"\n        ret = 0\n        n = -relative_position\n        if bidirectional:\n            num_buckets //= 2\n            ret += (n < 0).to(torch.long) * num_buckets  # mtf.to_int32(mtf.less(n, 0)) * num_buckets\n            n = torch.abs(n)\n        else:\n            n = torch.max(n, torch.zeros_like(n))\n        # now n is in the range [0, inf)\n\n        # half of the buckets are for exact increments in positions\n        max_exact = num_buckets // 2\n        is_small = n < max_exact\n\n        # The other half of the buckets are for logarithmically bigger bins in positions up to max_distance\n        val_if_large = max_exact + (\n            torch.log(n.float() / max_exact) / math.log(max_distance / max_exact) * (num_buckets - max_exact)\n        ).to(torch.long)\n        val_if_large = torch.min(val_if_large, torch.full_like(val_if_large, num_buckets - 1))\n\n        ret += torch.where(is_small, n, val_if_large)\n        return ret\n\n    def compute_bias(self, qlen, klen):\n        \"\"\" Compute binned relative position bias \"\"\"\n        context_position = torch.arange(qlen, dtype=torch.long)[:, None]\n        memory_position = torch.arange(klen, dtype=torch.long)[None, :]\n        relative_position = memory_position - context_position  # shape (qlen, klen)\n        rp_bucket = self._relative_position_bucket(\n            relative_position,  # shape (qlen, klen)\n            bidirectional=not self.is_decoder,\n            num_buckets=self.relative_attention_num_buckets,\n        )\n        rp_bucket = rp_bucket.to(self.relative_attention_bias.weight.device)\n        values = self.relative_attention_bias(rp_bucket)  # shape (qlen, klen, num_heads)\n        values = values.permute([2, 0, 1]).unsqueeze(0)  # shape (1, num_heads, qlen, klen)\n        return values\n\n    def forward(\n        self,\n        input,\n        mask=None,\n        kv=None,\n        position_bias=None,\n        past_key_value_state=None,\n        head_mask=None,\n        query_length=None,\n        use_cache=False,\n    ):\n        \"\"\"\n        Self-attention (if kv is None) or attention over source sentence (provided by kv).\n        \"\"\"\n        # Input is (bs, qlen, dim)\n        # Mask is (bs, klen) (non-causal) or (bs, klen, klen)\n        # past_key_value_state[0] is (bs, n_heads, q_len - 1, dim_per_head)\n        bs, qlen, dim = input.size()\n\n        if past_key_value_state is not None:\n            assert self.is_decoder is True, \"Encoder cannot cache past key value states\"\n            assert (\n                len(past_key_value_state) == 2\n            ), \"past_key_value_state should have 2 past states: keys and values. Got {} past states\".format(\n                len(past_key_value_state)\n            )\n            real_qlen = qlen + past_key_value_state[0].shape[2] if query_length is None else query_length\n        else:\n            real_qlen = qlen\n\n        if kv is None:\n            klen = real_qlen\n        else:\n            klen = kv.size(1)\n\n        def shape(x):\n            \"\"\"  projection \"\"\"\n            return x.view(bs, -1, self.n_heads, self.d_kv).transpose(1, 2)\n\n        def unshape(x):\n            \"\"\"  compute context \"\"\"\n            return x.transpose(1, 2).contiguous().view(bs, -1, self.inner_dim)\n\n        q = shape(self.q(input))  # (bs, n_heads, qlen, dim_per_head)\n\n        if kv is None:\n            k = shape(self.k(input))  # (bs, n_heads, qlen, dim_per_head)\n            v = shape(self.v(input))  # (bs, n_heads, qlen, dim_per_head)\n        elif past_key_value_state is None:\n            k = v = kv\n            k = shape(self.k(k))  # (bs, n_heads, qlen, dim_per_head)\n            v = shape(self.v(v))  # (bs, n_heads, qlen, dim_per_head)\n\n        if past_key_value_state is not None:\n            if kv is None:\n                k_, v_ = past_key_value_state\n                k = torch.cat([k_, k], dim=2)  # (bs, n_heads, klen, dim_per_head)\n                v = torch.cat([v_, v], dim=2)  # (bs, n_heads, klen, dim_per_head)\n            else:\n                k, v = past_key_value_state\n\n        if self.is_decoder and use_cache is True:\n            present_key_value_state = ((k, v),)\n        else:\n            present_key_value_state = (None,)\n\n        scores = torch.einsum(\"bnqd,bnkd->bnqk\", q, k)  # (bs, n_heads, qlen, klen)\n\n        if position_bias is None:\n            if not self.has_relative_attention_bias:\n                raise ValueError(\"No position_bias provided and no weights to compute position_bias\")\n            position_bias = self.compute_bias(real_qlen, klen)\n\n            # if key and values are already calculated\n            # we want only the last query position bias\n            if past_key_value_state is not None:\n                position_bias = position_bias[:, :, -1:, :]\n\n            if mask is not None:\n                position_bias = position_bias + mask  # (bs, n_heads, qlen, klen)\n\n        scores += position_bias\n        weights = F.softmax(scores.float(), dim=-1).type_as(scores)  # (bs, n_heads, qlen, klen)\n        weights = F.dropout(weights, p=self.dropout, training=self.training)  # (bs, n_heads, qlen, klen)\n\n        # Mask heads if we want to\n        if head_mask is not None:\n            weights = weights * head_mask\n\n        context = torch.matmul(weights, v)  # (bs, n_heads, qlen, dim_per_head)\n        context = unshape(context)  # (bs, qlen, dim)\n\n        context = self.o(context)\n\n        outputs = (context,) + present_key_value_state\n\n        if self.output_attentions:\n            outputs = outputs + (weights,)\n        if self.has_relative_attention_bias:\n            outputs = outputs + (position_bias,)\n        return outputs\n\n\nclass T5LayerSelfAttention(nn.Module):\n    def __init__(self, config, has_relative_attention_bias=False):\n        super().__init__()\n        self.SelfAttention = T5Attention(config, has_relative_attention_bias=has_relative_attention_bias)\n        self.layer_norm = T5LayerNorm(config.d_model, eps=config.layer_norm_epsilon)\n        self.dropout = nn.Dropout(config.dropout_rate)\n\n    def forward(\n        self,\n        hidden_states,\n        attention_mask=None,\n        position_bias=None,\n        head_mask=None,\n        past_key_value_state=None,\n        use_cache=False,\n    ):\n        norm_x = self.layer_norm(hidden_states)\n        attention_output = self.SelfAttention(\n            norm_x,\n            mask=attention_mask,\n            position_bias=position_bias,\n            head_mask=head_mask,\n            past_key_value_state=past_key_value_state,\n            use_cache=use_cache,\n        )\n        y = attention_output[0]\n        layer_output = hidden_states + self.dropout(y)\n        outputs = (layer_output,) + attention_output[1:]  # add attentions if we output them\n        return outputs\n\n\nclass T5LayerCrossAttention(nn.Module):\n    def __init__(self, config, has_relative_attention_bias=False):\n        super().__init__()\n        self.EncDecAttention = T5Attention(config, has_relative_attention_bias=has_relative_attention_bias)\n        self.layer_norm = T5LayerNorm(config.d_model, eps=config.layer_norm_epsilon)\n        self.dropout = nn.Dropout(config.dropout_rate)\n\n    def forward(\n        self,\n        hidden_states,\n        kv,\n        attention_mask=None,\n        position_bias=None,\n        head_mask=None,\n        past_key_value_state=None,\n        use_cache=False,\n        query_length=None,\n    ):\n        norm_x = self.layer_norm(hidden_states)\n        attention_output = self.EncDecAttention(\n            norm_x,\n            mask=attention_mask,\n            kv=kv,\n            position_bias=position_bias,\n            head_mask=head_mask,\n            past_key_value_state=past_key_value_state,\n            use_cache=use_cache,\n            query_length=query_length,\n        )\n        y = attention_output[0]\n        layer_output = hidden_states + self.dropout(y)\n        outputs = (layer_output,) + attention_output[1:]  # add attentions if we output them\n        return outputs\n\n\nclass T5Block(nn.Module):\n    def __init__(self, config, has_relative_attention_bias=False):\n        super().__init__()\n        self.is_decoder = config.is_decoder\n        self.layer = nn.ModuleList()\n        self.layer.append(T5LayerSelfAttention(config, has_relative_attention_bias=has_relative_attention_bias))\n        if self.is_decoder:\n            self.layer.append(T5LayerCrossAttention(config, has_relative_attention_bias=has_relative_attention_bias))\n\n        self.layer.append(T5LayerFF(config))\n\n    def forward(\n        self,\n        hidden_states,\n        attention_mask=None,\n        position_bias=None,\n        encoder_hidden_states=None,\n        encoder_attention_mask=None,\n        encoder_decoder_position_bias=None,\n        head_mask=None,\n        past_key_value_state=None,\n        use_cache=False,\n    ):\n\n        if past_key_value_state is not None:\n            assert self.is_decoder, \"Only decoder can use `past_key_value_states`\"\n            expected_num_past_key_value_states = 2 if encoder_hidden_states is None else 4\n\n            error_message = \"There should be {} past states. 2 (past / key) for self attention.{} Got {} past key / value states\".format(\n                expected_num_past_key_value_states,\n                \"2 (past / key) for cross attention\" if expected_num_past_key_value_states == 4 else \"\",\n                len(past_key_value_state),\n            )\n            assert len(past_key_value_state) == expected_num_past_key_value_states, error_message\n\n            self_attn_past_key_value_state = past_key_value_state[:2]\n            cross_attn_past_key_value_state = past_key_value_state[2:]\n        else:\n            self_attn_past_key_value_state, cross_attn_past_key_value_state = None, None\n\n        self_attention_outputs = self.layer[0](\n            hidden_states,\n            attention_mask=attention_mask,\n            position_bias=position_bias,\n            head_mask=head_mask,\n            past_key_value_state=self_attn_past_key_value_state,\n            use_cache=use_cache,\n        )\n        hidden_states, present_key_value_state = self_attention_outputs[:2]\n        attention_outputs = self_attention_outputs[2:]  # Keep self-attention outputs and relative position weights\n\n        if self.is_decoder and encoder_hidden_states is not None:\n            # the actual query length is unknown for cross attention\n            # if using past key value states. Need to inject it here\n            if present_key_value_state is not None:\n                query_length = present_key_value_state[0].shape[2]\n            else:\n                query_length = None\n\n            cross_attention_outputs = self.layer[1](\n                hidden_states,\n                kv=encoder_hidden_states,\n                attention_mask=encoder_attention_mask,\n                position_bias=encoder_decoder_position_bias,\n                head_mask=head_mask,\n                past_key_value_state=cross_attn_past_key_value_state,\n                query_length=query_length,\n                use_cache=use_cache,\n            )\n            hidden_states = cross_attention_outputs[0]\n            # Combine self attn and cross attn key value states\n            if present_key_value_state is not None:\n                present_key_value_state = present_key_value_state + cross_attention_outputs[1]\n\n            # Keep cross-attention outputs and relative position weights\n            attention_outputs = attention_outputs + cross_attention_outputs[2:]\n\n        # Apply Feed Forward layer\n        hidden_states = self.layer[-1](hidden_states)\n        outputs = (hidden_states,)\n\n        # Add attentions if we output them\n        outputs = outputs + (present_key_value_state,) + attention_outputs\n        return outputs  # hidden-states, present_key_value_states, (self-attention weights), (self-attention position bias), (cross-attention weights), (cross-attention position bias)\n\n\nclass T5PreTrainedModel(PreTrainedModel):\n    \"\"\" An abstract class to handle weights initialization and\n        a simple interface for downloading and loading pretrained models.\n    \"\"\"\n\n    config_class = T5Config\n    load_tf_weights = load_tf_weights_in_t5\n    base_model_prefix = \"transformer\"\n\n    @property\n    def dummy_inputs(self):\n        input_ids = torch.tensor(DUMMY_INPUTS)\n        input_mask = torch.tensor(DUMMY_MASK)\n        dummy_inputs = {\n            \"decoder_input_ids\": input_ids,\n            \"input_ids\": input_ids,\n            \"decoder_attention_mask\": input_mask,\n        }\n        return dummy_inputs\n\n    def _init_weights(self, module):\n        \"\"\" Initialize the weights \"\"\"\n        factor = self.config.initializer_factor  # Used for testing weights initialization\n        if isinstance(module, T5LayerNorm):\n            module.weight.data.fill_(factor * 1.0)\n        elif isinstance(module, (T5Model, T5ForConditionalGeneration)):\n            # Mesh TensorFlow embeddings initialization\n            # See https://github.com/tensorflow/mesh/blob/fa19d69eafc9a482aff0b59ddd96b025c0cb207d/mesh_tensorflow/layers.py#L1624\n            module.shared.weight.data.normal_(mean=0.0, std=factor * 1.0)\n        elif isinstance(module, T5DenseReluDense):\n            # Mesh TensorFlow FF initialization\n            # See https://github.com/tensorflow/mesh/blob/master/mesh_tensorflow/transformer/transformer_layers.py#L56\n            # and https://github.com/tensorflow/mesh/blob/fa19d69eafc9a482aff0b59ddd96b025c0cb207d/mesh_tensorflow/layers.py#L89\n            module.wi.weight.data.normal_(mean=0.0, std=factor * ((self.config.d_model) ** -0.5))\n            if hasattr(module.wi, \"bias\") and module.wi.bias is not None:\n                module.wi.bias.data.zero_()\n            module.wo.weight.data.normal_(mean=0.0, std=factor * ((self.config.d_ff) ** -0.5))\n            if hasattr(module.wo, \"bias\") and module.wo.bias is not None:\n                module.wo.bias.data.zero_()\n        elif isinstance(module, T5Attention):\n            # Mesh TensorFlow attention initialization to avoid scaling before softmax\n            # See https://github.com/tensorflow/mesh/blob/fa19d69eafc9a482aff0b59ddd96b025c0cb207d/mesh_tensorflow/transformer/attention.py#L136\n            d_model = self.config.d_model\n            d_kv = self.config.d_kv\n            n_heads = self.config.num_heads\n            module.q.weight.data.normal_(mean=0.0, std=factor * ((d_model * d_kv) ** -0.5))\n            module.k.weight.data.normal_(mean=0.0, std=factor * (d_model ** -0.5))\n            module.v.weight.data.normal_(mean=0.0, std=factor * (d_model ** -0.5))\n            module.o.weight.data.normal_(mean=0.0, std=factor * ((n_heads * d_kv) ** -0.5))\n            if module.has_relative_attention_bias:\n                module.relative_attention_bias.weight.data.normal_(mean=0.0, std=factor * ((d_model) ** -0.5))\n\n    def _shift_right(self, input_ids):\n        decoder_start_token_id = self.config.decoder_start_token_id\n        pad_token_id = self.config.pad_token_id\n\n        assert (\n            decoder_start_token_id is not None\n        ), \"self.model.config.decoder_start_token_id has to be defined. In T5 it is usually set to the pad_token_id. See T5 docs for more information\"\n\n        # shift inputs to the right\n        shifted_input_ids = input_ids.new_zeros(input_ids.shape)\n        shifted_input_ids[..., 1:] = input_ids[..., :-1].clone()\n        shifted_input_ids[..., 0] = decoder_start_token_id\n\n        assert pad_token_id is not None, \"self.model.config.pad_token_id has to be defined.\"\n        # replace possible -100 values in lm_labels by `pad_token_id`\n        shifted_input_ids.masked_fill_(shifted_input_ids == -100, pad_token_id)\n\n        assert torch.all(shifted_input_ids >= 0).item(), \"Verify that `lm_labels` has only positive values and -100\"\n\n        return shifted_input_ids\n\n\nclass T5Stack(T5PreTrainedModel):\n    def __init__(self, config, embed_tokens=None):\n        super().__init__(config)\n        self.output_attentions = config.output_attentions\n        self.output_hidden_states = config.output_hidden_states\n\n        self.embed_tokens = embed_tokens\n        self.is_decoder = config.is_decoder\n\n        self.block = nn.ModuleList(\n            [T5Block(config, has_relative_attention_bias=bool(i == 0)) for i in range(config.num_layers)]\n        )\n        self.final_layer_norm = T5LayerNorm(config.d_model, eps=config.layer_norm_epsilon)\n        self.dropout = nn.Dropout(config.dropout_rate)\n\n        self.init_weights()\n\n    def get_input_embeddings(self):\n        return self.embed_tokens\n\n    def get_output_embeddings(self):\n        return self.embed_tokens\n\n    def set_input_embeddings(self, new_embeddings):\n        self.embed_tokens = new_embeddings\n\n    def forward(\n        self,\n        input_ids=None,\n        attention_mask=None,\n        encoder_hidden_states=None,\n        encoder_attention_mask=None,\n        inputs_embeds=None,\n        head_mask=None,\n        past_key_value_states=None,\n        use_cache=False,\n    ):\n\n        if input_ids is not None and inputs_embeds is not None:\n            raise ValueError(\"You cannot specify both input_ids and inputs_embeds at the same time\")\n        elif input_ids is not None:\n            input_shape = input_ids.size()\n            input_ids = input_ids.view(-1, input_shape[-1])\n        elif inputs_embeds is not None:\n            input_shape = inputs_embeds.size()[:-1]\n        else:\n            if self.is_decoder:\n                raise ValueError(\"You have to specify either decoder_input_ids or decoder_inputs_embeds\")\n            else:\n                raise ValueError(\"You have to specify either input_ids or inputs_embeds\")\n\n        if inputs_embeds is None:\n            assert self.embed_tokens is not None, \"You have to intialize the model with valid token embeddings\"\n            inputs_embeds = self.embed_tokens(input_ids)\n\n        batch_size, seq_length = input_shape\n\n        if past_key_value_states is not None:\n            assert seq_length == 1, \"Input shape is {}, but should be {} when using past_key_value_sates\".format(\n                input_shape, (batch_size, 1)\n            )\n            # required mask seq length can be calculated via length of past\n            # key value states and seq_length = 1 for the last token\n            mask_seq_length = past_key_value_states[0][0].shape[2] + seq_length\n        else:\n            mask_seq_length = seq_length\n\n        if attention_mask is None:\n            attention_mask = torch.ones(batch_size, mask_seq_length).to(inputs_embeds.device)\n        if self.is_decoder and encoder_attention_mask is None and encoder_hidden_states is not None:\n            encoder_seq_length = encoder_hidden_states.shape[1]\n            encoder_attention_mask = torch.ones(\n                batch_size, encoder_seq_length, device=inputs_embeds.device, dtype=torch.long\n            )\n\n        # initialize past_key_value_states with `None` if past does not exist\n        if past_key_value_states is None:\n            past_key_value_states = [None] * len(self.block)\n\n        # ourselves in which case we just need to make it broadcastable to all heads.\n        extended_attention_mask = self.get_extended_attention_mask(attention_mask, input_shape, inputs_embeds.device)\n\n        if self.is_decoder and encoder_attention_mask is not None:\n            encoder_extended_attention_mask = self.invert_attention_mask(encoder_attention_mask)\n        else:\n            encoder_extended_attention_mask = None\n\n        # Prepare head mask if needed\n        head_mask = self.get_head_mask(head_mask, self.config.num_layers)\n        present_key_value_states = ()\n        all_hidden_states = ()\n        all_attentions = ()\n        position_bias = None\n        encoder_decoder_position_bias = None\n\n        hidden_states = self.dropout(inputs_embeds)\n\n        for i, (layer_module, past_key_value_state) in enumerate(zip(self.block, past_key_value_states)):\n            if self.output_hidden_states:\n                all_hidden_states = all_hidden_states + (hidden_states,)\n\n            layer_outputs = layer_module(\n                hidden_states,\n                attention_mask=extended_attention_mask,\n                position_bias=position_bias,\n                encoder_hidden_states=encoder_hidden_states,\n                encoder_attention_mask=encoder_extended_attention_mask,\n                encoder_decoder_position_bias=encoder_decoder_position_bias,\n                head_mask=head_mask[i],\n                past_key_value_state=past_key_value_state,\n                use_cache=use_cache,\n            )\n            # layer_outputs is a tuple with:\n            # hidden-states, key-value-states, (self-attention weights), (self-attention position bias), (cross-attention weights), (cross-attention position bias)\n            hidden_states, present_key_value_state = layer_outputs[:2]\n\n            if i == 0:\n                # We share the position biases between the layers - the first layer store them\n                # layer_outputs = hidden-states, key-value-states (self-attention weights), (self-attention position bias), (cross-attention weights), (cross-attention position bias)\n                position_bias = layer_outputs[3 if self.output_attentions else 2]\n                if self.is_decoder and encoder_hidden_states is not None:\n                    encoder_decoder_position_bias = layer_outputs[5 if self.output_attentions else 3]\n            # append next layer key value states\n            present_key_value_states = present_key_value_states + (present_key_value_state,)\n\n            if self.output_attentions:\n                all_attentions = all_attentions + (layer_outputs[2],)  # We keep only self-attention weights for now\n\n        hidden_states = self.final_layer_norm(hidden_states)\n        hidden_states = self.dropout(hidden_states)\n\n        # Add last layer\n        if self.output_hidden_states:\n            all_hidden_states = all_hidden_states + (hidden_states,)\n\n        outputs = (hidden_states,)\n        if use_cache is True:\n            assert self.is_decoder, \"`use_cache` can only be set to `True` if {} is used as a decoder\".format(self)\n            outputs = outputs + (present_key_value_states,)\n        if self.output_hidden_states:\n            outputs = outputs + (all_hidden_states,)\n        if self.output_attentions:\n            outputs = outputs + (all_attentions,)\n        return outputs  # last-layer hidden state, (presents,) (all hidden states), (all attentions)\n\n\nT5_START_DOCSTRING = r\"\"\"    The T5 model was proposed in\n    `Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer`_\n    by Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, Peter J. Liu.\n    It's an encoder decoder transformer pre-trained in a text-to-text denoising generative setting.\n\n    This model is a PyTorch `torch.nn.Module`_ sub-class. Use it as a regular PyTorch Module and\n    refer to the PyTorch documentation for all matter related to general usage and behavior.\n\n    .. _`Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer`:\n        https://arxiv.org/abs/1910.10683\n\n    .. _`torch.nn.Module`:\n        https://pytorch.org/docs/stable/nn.html#module\n\n    Parameters:\n        config (:class:`~transformers1.T5Config`): Model configuration class with all the parameters of the model.\n            Initializing with a config file does not load the weights associated with the model, only the configuration.\n            Check out the :meth:`~transformers1.PreTrainedModel.from_pretrained` method to load the model weights.\n\"\"\"\n\nT5_INPUTS_DOCSTRING = r\"\"\"\n    Args:\n        input_ids (:obj:`torch.LongTensor` of shape :obj:`(batch_size, sequence_length)`):\n            Indices of input sequence tokens in the vocabulary.\n            T5 is a model with relative position embeddings so you should be able to pad the inputs on both the right and the left.\n            Indices can be obtained using :class:`transformers1.T5Tokenizer`.\n            See :func:`transformers1.PreTrainedTokenizer.encode` and\n            :func:`transformers1.PreTrainedTokenizer.convert_tokens_to_ids` for details.\n            To know more on how to prepare :obj:`input_ids` for pre-training take a look at\n            `T5 Training <./t5.html#training>`_ .\n        attention_mask (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length)`, `optional`, defaults to :obj:`None`):\n            Mask to avoid performing attention on padding token indices.\n            Mask values selected in ``[0, 1]``:\n            ``1`` for tokens that are NOT MASKED, ``0`` for MASKED tokens.\n        encoder_outputs (:obj:`tuple(tuple(torch.FloatTensor)`, `optional`, defaults to :obj:`None`):\n            Tuple consists of (`last_hidden_state`, `optional`: `hidden_states`, `optional`: `attentions`)\n            `last_hidden_state` of shape :obj:`(batch_size, sequence_length, hidden_size)`, `optional`, defaults to :obj:`None`) is a sequence of hidden-states at the output of the last layer of the encoder.\n            Used in the cross-attention of the decoder.\n        decoder_input_ids (:obj:`torch.LongTensor` of shape :obj:`(batch_size, target_sequence_length)`, `optional`, defaults to :obj:`None`):\n            Provide for sequence to sequence training. T5 uses the pad_token_id as the starting token for decoder_input_ids generation.\n            If `decoder_past_key_value_states` is used, optionally only the last `decoder_input_ids` have to be input (see `decoder_past_key_value_states`).\n            To know more on how to prepare :obj:`decoder_input_ids` for pre-training take a look at\n            `T5 Training <./t5.html#training>`_ .\n        decoder_attention_mask (:obj:`torch.BoolTensor` of shape :obj:`(batch_size, tgt_seq_len)`, `optional`, defaults to :obj:`None`):\n            Default behavior: generate a tensor that ignores pad tokens in decoder_input_ids. Causal mask will also be used by default.\n        decoder_past_key_value_states (:obj:`tuple(tuple(torch.FloatTensor))` of length :obj:`config.n_layers` with each tuple having 4 tensors of shape :obj:`(batch_size, num_heads, sequence_length - 1, embed_size_per_head)`):\n            Contains pre-computed key and value hidden-states of the attention blocks.\n            Can be used to speed up decoding.\n            If `decoder_past_key_value_states` are used, the user can optionally input only the last `decoder_input_ids`\n            (those that don't have their past key value states given to this model) of shape :obj:`(batch_size, 1)`\n            instead of all `decoder_input_ids` of shape :obj:`(batch_size, sequence_length)`.\n        use_cache (:obj:`bool`, `optional`, defaults to :obj:`True`):\n            If `use_cache` is True, `decoder_past_key_value_states` are returned and can be used to speed up decoding (see `decoder_past_key_value_states`).\n        inputs_embeds (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length, hidden_size)`, `optional`, defaults to :obj:`None`):\n            Optionally, instead of passing :obj:`input_ids` you can choose to directly pass an embedded representation.\n            This is useful if you want more control over how to convert `input_ids` indices into associated vectors\n            than the model's internal embedding lookup matrix.\n        decoder_inputs_embeds (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, target_sequence_length, hidden_size)`, `optional`, defaults to :obj:`None`):\n            Optionally, instead of passing :obj:`decoder_input_ids` you can choose to directly pass an embedded representation.\n            If `decoder_past_key_value_states` is used, optionally only the last `decoder_inputs_embeds` have to be input (see `decoder_past_key_value_states`).\n            This is useful if you want more control over how to convert `decoder_input_ids` indices into associated vectors\n            than the model's internal embedding lookup matrix.\n        head_mask: (:obj:`torch.FloatTensor` of shape :obj:`(num_heads,)` or :obj:`(num_layers, num_heads)`, `optional`, defaults to :obj:`None`):\n            Mask to nullify selected heads of the self-attention modules.\n            Mask values selected in ``[0, 1]``:\n            ``1`` indicates the head is **not masked**, ``0`` indicates the head is **masked**.\n\"\"\"\n\n\n@add_start_docstrings(\n    \"The bare T5 Model transformer outputting raw hidden-states\" \"without any specific head on top.\",\n    T5_START_DOCSTRING,\n)\nclass T5Model(T5PreTrainedModel):\n    def __init__(self, config):\n        super().__init__(config)\n        self.shared = nn.Embedding(config.vocab_size, config.d_model)\n\n        encoder_config = copy.deepcopy(config)\n        self.encoder = T5Stack(encoder_config, self.shared)\n\n        decoder_config = copy.deepcopy(config)\n        decoder_config.is_decoder = True\n        self.decoder = T5Stack(decoder_config, self.shared)\n\n        self.init_weights()\n\n    def get_input_embeddings(self):\n        return self.shared\n\n    def set_input_embeddings(self, new_embeddings):\n        self.shared = new_embeddings\n        self.encoder.set_input_embeddings(new_embeddings)\n        self.decoder.set_input_embeddings(new_embeddings)\n\n    def get_encoder(self):\n        return self.encoder\n\n    def get_decoder(self):\n        return self.decoder\n\n    def _prune_heads(self, heads_to_prune):\n        \"\"\" Prunes heads of the model.\n            heads_to_prune: dict of {layer_num: list of heads to prune in this layer}\n            See base class PreTrainedModel\n        \"\"\"\n        for layer, heads in heads_to_prune.items():\n            self.encoder.layer[layer].attention.prune_heads(heads)\n\n    @add_start_docstrings_to_callable(T5_INPUTS_DOCSTRING)\n    def forward(\n        self,\n        input_ids=None,\n        attention_mask=None,\n        encoder_outputs=None,\n        decoder_input_ids=None,\n        decoder_attention_mask=None,\n        decoder_past_key_value_states=None,\n        use_cache=True,\n        inputs_embeds=None,\n        decoder_inputs_embeds=None,\n        head_mask=None,\n    ):\n        r\"\"\"\n    Return:\n        :obj:`tuple(torch.FloatTensor)` comprising various elements depending on the configuration (:class:`~transformers1.T5Config`) and inputs.\n        last_hidden_state (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length, hidden_size)`):\n            Sequence of hidden-states at the output of the last layer of the model.\n            If `decoder_past_key_value_states` is used only the last hidden-state of the sequences of shape :obj:`(batch_size, 1, hidden_size)` is output.\n        decoder_past_key_value_states (:obj:`tuple(tuple(torch.FloatTensor))` of length :obj:`config.n_layers` with each tuple having 4 tensors of shape :obj:`(batch_size, num_heads, sequence_length, embed_size_per_head)`, `optional`, returned when ``use_cache=True``):\n            Contains pre-computed key and value hidden-states of the attention blocks.\n            Can be used to speed up sequential decoding (see `decoder_past_key_value_states` input).\n            Note that when using `decoder_past_key_value_states`, the model only outputs the last `hidden-state` of the sequence of shape :obj:`(batch_size, 1, config.vocab_size)`.\n        hidden_states (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_hidden_states=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer)\n            of shape :obj:`(batch_size, sequence_length, hidden_size)`.\n\n            Hidden-states of the model at the output of each layer plus the initial embedding outputs.\n        attentions (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_attentions=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for each layer) of shape\n            :obj:`(batch_size, num_heads, sequence_length, sequence_length)`.\n\n            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention\n            heads.\n\n    Examples::\n\n            from transformers1 import T5Tokenizer, T5Model\n\n            tokenizer = T5Tokenizer.from_pretrained('t5-small')\n            model = T5Model.from_pretrained('t5-small')\n            input_ids = tokenizer.encode(\"Hello, my dog is cute\", return_tensors=\"pt\")  # Batch size 1\n            outputs = model(input_ids=input_ids, decoder_input_ids=input_ids)\n            last_hidden_states = outputs[0]  # The last hidden-state is the first element of the output tuple\n\n        \"\"\"\n\n        # Encode if needed (training, first prediction pass)\n        if encoder_outputs is None:\n            encoder_outputs = self.encoder(\n                input_ids=input_ids, attention_mask=attention_mask, inputs_embeds=inputs_embeds, head_mask=head_mask\n            )\n\n        hidden_states = encoder_outputs[0]\n\n        # If decoding with past key value states, only the last tokens\n        # should be given as an input\n        if decoder_past_key_value_states is not None:\n            if decoder_input_ids is not None:\n                decoder_input_ids = decoder_input_ids[:, -1:]\n            if decoder_inputs_embeds is not None:\n                decoder_inputs_embeds = decoder_inputs_embeds[:, -1:]\n\n        # Decode\n        decoder_outputs = self.decoder(\n            input_ids=decoder_input_ids,\n            attention_mask=decoder_attention_mask,\n            inputs_embeds=decoder_inputs_embeds,\n            past_key_value_states=decoder_past_key_value_states,\n            encoder_hidden_states=hidden_states,\n            encoder_attention_mask=attention_mask,\n            head_mask=head_mask,\n            use_cache=use_cache,\n        )\n\n        if use_cache is True:\n            past = ((encoder_outputs, decoder_outputs[1]),)\n            decoder_outputs = decoder_outputs[:1] + past + decoder_outputs[2:]\n\n        return decoder_outputs + encoder_outputs\n\n\n@add_start_docstrings(\"\"\"T5 Model with a `language modeling` head on top. \"\"\", T5_START_DOCSTRING)\nclass T5ForConditionalGeneration(T5PreTrainedModel):\n    def __init__(self, config):\n        super().__init__(config)\n        self.model_dim = config.d_model\n\n        self.shared = nn.Embedding(config.vocab_size, config.d_model)\n\n        encoder_config = copy.deepcopy(config)\n        self.encoder = T5Stack(encoder_config, self.shared)\n\n        decoder_config = copy.deepcopy(config)\n        decoder_config.is_decoder = True\n        self.decoder = T5Stack(decoder_config, self.shared)\n\n        self.lm_head = nn.Linear(config.d_model, config.vocab_size, bias=False)\n\n        self.init_weights()\n\n    def get_input_embeddings(self):\n        return self.shared\n\n    def set_input_embeddings(self, new_embeddings):\n        self.shared = new_embeddings\n        self.encoder.set_input_embeddings(new_embeddings)\n        self.decoder.set_input_embeddings(new_embeddings)\n\n    def get_output_embeddings(self):\n        return self.lm_head\n\n    def get_encoder(self):\n        return self.encoder\n\n    def get_decoder(self):\n        return self.decoder\n\n    @add_start_docstrings_to_callable(T5_INPUTS_DOCSTRING)\n    def forward(\n        self,\n        input_ids=None,\n        attention_mask=None,\n        encoder_outputs=None,\n        decoder_input_ids=None,\n        decoder_attention_mask=None,\n        decoder_past_key_value_states=None,\n        use_cache=True,\n        lm_labels=None,\n        inputs_embeds=None,\n        decoder_inputs_embeds=None,\n        head_mask=None,\n    ):\n        r\"\"\"\n        lm_labels (:obj:`torch.LongTensor` of shape :obj:`(batch_size,)`, `optional`, defaults to :obj:`None`):\n                Labels for computing the sequence classification/regression loss.\n                Indices should be in :obj:`[-100, 0, ..., config.vocab_size - 1]`.\n                All labels set to ``-100`` are ignored (masked), the loss is only\n                computed for labels in ``[0, ..., config.vocab_size]``\n\n    Returns:\n        :obj:`tuple(torch.FloatTensor)` comprising various elements depending on the configuration (:class:`~transformers1.T5Config`) and inputs.\n        loss (:obj:`torch.FloatTensor` of shape :obj:`(1,)`, `optional`, returned when :obj:`lm_label` is provided):\n            Classification loss (cross entropy).\n        prediction_scores (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length, config.vocab_size)`)\n            Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax).\n            If `past_key_value_states` is used only the last prediction_scores of the sequences of shape :obj:`(batch_size, 1, hidden_size)` is output.\n        decoder_past_key_value_states (:obj:`tuple(tuple(torch.FloatTensor))` of length :obj:`config.n_layers` with each tuple having 4 tensors of shape :obj:`(batch_size, num_heads, sequence_length, embed_size_per_head)`, `optional`, returned when ``use_cache=True``):\n            Contains pre-computed key and value hidden-states of the attention blocks.\n            Can be used to speed up sequential decoding (see `decoder_past_key_value_states` input).\n            Note that when using `decoder_past_key_value_states`, the model only outputs the last `prediction_score` of the sequence of shape :obj:`(batch_size, 1, config.vocab_size)`.\n        hidden_states (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_hidden_states=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer)\n            of shape :obj:`(batch_size, sequence_length, hidden_size)`.\n            Hidden-states of the model at the output of each layer plus the initial embedding outputs.\n        attentions (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_attentions=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for each layer) of shape\n            :obj:`(batch_size, num_heads, sequence_length, sequence_length)`.\n            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention.\n\n    Examples::\n\n        from transformers1 import T5Tokenizer, T5ForConditionalGeneration\n\n        tokenizer = T5Tokenizer.from_pretrained('t5-small')\n        model = T5ForConditionalGeneration.from_pretrained('t5-small')\n        input_ids = tokenizer.encode(\"Hello, my dog is cute\", return_tensors=\"pt\")  # Batch size 1\n        outputs = model(input_ids=input_ids, decoder_input_ids=input_ids, lm_labels=input_ids)\n        loss, prediction_scores = outputs[:2]\n\n        tokenizer = T5Tokenizer.from_pretrained('t5-small')\n        model = T5ForConditionalGeneration.from_pretrained('t5-small')\n        input_ids = tokenizer.encode(\"summarize: Hello, my dog is cute\", return_tensors=\"pt\")  # Batch size 1\n        outputs = model.generate(input_ids)\n        \"\"\"\n\n        # Encode if needed (training, first prediction pass)\n        if encoder_outputs is None:\n            # Convert encoder inputs in embeddings if needed\n            encoder_outputs = self.encoder(\n                input_ids=input_ids, attention_mask=attention_mask, inputs_embeds=inputs_embeds, head_mask=head_mask\n            )\n\n        hidden_states = encoder_outputs[0]\n\n        if lm_labels is not None and decoder_input_ids is None and decoder_inputs_embeds is None:\n            # get decoder inputs from shifting lm labels to the right\n            decoder_input_ids = self._shift_right(lm_labels)\n\n        # If decoding with past key value states, only the last tokens\n        # should be given as an input\n        if decoder_past_key_value_states is not None:\n            assert lm_labels is None, \"Decoder should not use cached key value states when training.\"\n            if decoder_input_ids is not None:\n                decoder_input_ids = decoder_input_ids[:, -1:]\n            if decoder_inputs_embeds is not None:\n                decoder_inputs_embeds = decoder_inputs_embeds[:, -1:]\n\n        # Decode\n        decoder_outputs = self.decoder(\n            input_ids=decoder_input_ids,\n            attention_mask=decoder_attention_mask,\n            inputs_embeds=decoder_inputs_embeds,\n            past_key_value_states=decoder_past_key_value_states,\n            encoder_hidden_states=hidden_states,\n            encoder_attention_mask=attention_mask,\n            head_mask=head_mask,\n            use_cache=use_cache,\n        )\n\n        # insert decoder past at right place\n        # to speed up decoding\n        if use_cache is True:\n            past = ((encoder_outputs, decoder_outputs[1]),)\n            decoder_outputs = decoder_outputs[:1] + past + decoder_outputs[2:]\n\n        sequence_output = decoder_outputs[0]\n        # Rescale output before projecting on vocab\n        # See https://github.com/tensorflow/mesh/blob/fa19d69eafc9a482aff0b59ddd96b025c0cb207d/mesh_tensorflow/transformer/transformer.py#L586\n        sequence_output = sequence_output * (self.model_dim ** -0.5)\n        lm_logits = self.lm_head(sequence_output)\n\n        decoder_outputs = (lm_logits,) + decoder_outputs[1:]  # Add hidden states and attention if they are here\n        if lm_labels is not None:\n            loss_fct = CrossEntropyLoss(ignore_index=-100)\n            loss = loss_fct(lm_logits.view(-1, lm_logits.size(-1)), lm_labels.view(-1))\n            # TODO(thom): Add z_loss https://github.com/tensorflow/mesh/blob/fa19d69eafc9a482aff0b59ddd96b025c0cb207d/mesh_tensorflow/layers.py#L666\n            decoder_outputs = (loss,) + decoder_outputs\n\n        return decoder_outputs + encoder_outputs\n\n    def prepare_inputs_for_generation(self, input_ids, past, attention_mask, use_cache, **kwargs):\n        assert past is not None, \"past has to be defined for encoder_outputs\"\n\n        # first step\n        if len(past) < 2:\n            encoder_outputs, decoder_past_key_value_states = past, None\n        else:\n            encoder_outputs, decoder_past_key_value_states = past[0], past[1]\n\n        return {\n            \"decoder_input_ids\": input_ids,\n            \"decoder_past_key_value_states\": decoder_past_key_value_states,\n            \"encoder_outputs\": encoder_outputs,\n            \"attention_mask\": attention_mask,\n            \"use_cache\": use_cache,\n        }\n\n    def _reorder_cache(self, past, beam_idx):\n        # if decoder past is not included in output\n        # speedy decoding is disabled and no need to reorder\n        if len(past) < 2:\n            logger.warning(\"You might want to consider setting `use_cache=True` to speed up decoding\")\n            return past\n\n        decoder_past = past[1]\n        past = (past[0],)\n        reordered_decoder_past = ()\n        for layer_past_states in decoder_past:\n            # get the correct batch idx from layer past batch dim\n            # batch dim of `past` is at 2nd position\n            reordered_layer_past_states = ()\n            for layer_past_state in layer_past_states:\n                # need to set correct `past` for each of the four key / value states\n                reordered_layer_past_states = reordered_layer_past_states + (\n                    layer_past_state.index_select(0, beam_idx),\n                )\n\n            assert reordered_layer_past_states[0].shape == layer_past_states[0].shape\n            assert len(reordered_layer_past_states) == len(layer_past_states)\n\n            reordered_decoder_past = reordered_decoder_past + (reordered_layer_past_states,)\n        return past + (reordered_decoder_past,)\n"
  },
  {
    "path": "code/nezha-base-count3/pretrain/transformers1/modeling_tf_albert.py",
    "content": "# coding=utf-8\n# Copyright 2018 The OpenAI Team Authors and HuggingFace Inc. team.\n# Copyright (c) 2018, NVIDIA CORPORATION.  All rights reserved.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n\"\"\" TF 2.0 ALBERT model. \"\"\"\n\n\nimport logging\n\nimport tensorflow as tf\n\nfrom .configuration_albert import AlbertConfig\nfrom .file_utils import MULTIPLE_CHOICE_DUMMY_INPUTS, add_start_docstrings, add_start_docstrings_to_callable\nfrom .modeling_tf_bert import ACT2FN, TFBertSelfAttention\nfrom .modeling_tf_utils import TFPreTrainedModel, get_initializer, keras_serializable, shape_list\nfrom .tokenization_utils import BatchEncoding\n\n\nlogger = logging.getLogger(__name__)\n\nTF_ALBERT_PRETRAINED_MODEL_ARCHIVE_LIST = [\n    \"albert-base-v1\",\n    \"albert-large-v1\",\n    \"albert-xlarge-v1\",\n    \"albert-xxlarge-v1\",\n    \"albert-base-v2\",\n    \"albert-large-v2\",\n    \"albert-xlarge-v2\",\n    \"albert-xxlarge-v2\",\n    # See all ALBERT models at https://huggingface.co/models?filter=albert\n]\n\n\nclass TFAlbertEmbeddings(tf.keras.layers.Layer):\n    \"\"\"Construct the embeddings from word, position and token_type embeddings.\n    \"\"\"\n\n    def __init__(self, config, **kwargs):\n        super().__init__(**kwargs)\n\n        self.config = config\n        self.position_embeddings = tf.keras.layers.Embedding(\n            config.max_position_embeddings,\n            config.embedding_size,\n            embeddings_initializer=get_initializer(self.config.initializer_range),\n            name=\"position_embeddings\",\n        )\n        self.token_type_embeddings = tf.keras.layers.Embedding(\n            config.type_vocab_size,\n            config.embedding_size,\n            embeddings_initializer=get_initializer(self.config.initializer_range),\n            name=\"token_type_embeddings\",\n        )\n\n        # self.LayerNorm is not snake-cased to stick with TensorFlow model variable name and be able to load\n        # any TensorFlow checkpoint file\n        self.LayerNorm = tf.keras.layers.LayerNormalization(epsilon=config.layer_norm_eps, name=\"LayerNorm\")\n        self.dropout = tf.keras.layers.Dropout(config.hidden_dropout_prob)\n\n    def build(self, input_shape):\n        \"\"\"Build shared word embedding layer \"\"\"\n        with tf.name_scope(\"word_embeddings\"):\n            # Create and initialize weights. The random normal initializer was chosen\n            # arbitrarily, and works well.\n            self.word_embeddings = self.add_weight(\n                \"weight\",\n                shape=[self.config.vocab_size, self.config.embedding_size],\n                initializer=get_initializer(self.config.initializer_range),\n            )\n        super().build(input_shape)\n\n    def call(self, inputs, mode=\"embedding\", training=False):\n        \"\"\"Get token embeddings of inputs.\n        Args:\n            inputs: list of three int64 tensors with shape [batch_size, length]: (input_ids, position_ids, token_type_ids)\n            mode: string, a valid value is one of \"embedding\" and \"linear\".\n        Returns:\n            outputs: (1) If mode == \"embedding\", output embedding tensor, float32 with\n                shape [batch_size, length, embedding_size]; (2) mode == \"linear\", output\n                linear tensor, float32 with shape [batch_size, length, vocab_size].\n        Raises:\n            ValueError: if mode is not valid.\n\n        Shared weights logic adapted from\n            https://github.com/tensorflow/models/blob/a009f4fb9d2fc4949e32192a944688925ef78659/official/transformer/v2/embedding_layer.py#L24\n        \"\"\"\n        if mode == \"embedding\":\n            return self._embedding(inputs, training=training)\n        elif mode == \"linear\":\n            return self._linear(inputs)\n        else:\n            raise ValueError(\"mode {} is not valid.\".format(mode))\n\n    def _embedding(self, inputs, training=False):\n        \"\"\"Applies embedding based on inputs tensor.\"\"\"\n        input_ids, position_ids, token_type_ids, inputs_embeds = inputs\n\n        if input_ids is not None:\n            input_shape = shape_list(input_ids)\n        else:\n            input_shape = shape_list(inputs_embeds)[:-1]\n\n        seq_length = input_shape[1]\n        if position_ids is None:\n            position_ids = tf.range(seq_length, dtype=tf.int32)[tf.newaxis, :]\n        if token_type_ids is None:\n            token_type_ids = tf.fill(input_shape, 0)\n\n        if inputs_embeds is None:\n            inputs_embeds = tf.gather(self.word_embeddings, input_ids)\n        position_embeddings = self.position_embeddings(position_ids)\n        token_type_embeddings = self.token_type_embeddings(token_type_ids)\n\n        embeddings = inputs_embeds + position_embeddings + token_type_embeddings\n        embeddings = self.LayerNorm(embeddings)\n        embeddings = self.dropout(embeddings, training=training)\n        return embeddings\n\n    def _linear(self, inputs):\n        \"\"\"Computes logits by running inputs through a linear layer.\n            Args:\n                inputs: A float32 tensor with shape [batch_size, length, embedding_size]\n            Returns:\n                float32 tensor with shape [batch_size, length, vocab_size].\n        \"\"\"\n        batch_size = shape_list(inputs)[0]\n        length = shape_list(inputs)[1]\n        x = tf.reshape(inputs, [-1, self.config.embedding_size])\n        logits = tf.matmul(x, self.word_embeddings, transpose_b=True)\n        return tf.reshape(logits, [batch_size, length, self.config.vocab_size])\n\n\nclass TFAlbertSelfAttention(tf.keras.layers.Layer):\n    def __init__(self, config, **kwargs):\n        super().__init__(**kwargs)\n        if config.hidden_size % config.num_attention_heads != 0:\n            raise ValueError(\n                \"The hidden size (%d) is not a multiple of the number of attention \"\n                \"heads (%d)\" % (config.hidden_size, config.num_attention_heads)\n            )\n        self.output_attentions = config.output_attentions\n\n        self.num_attention_heads = config.num_attention_heads\n        assert config.hidden_size % config.num_attention_heads == 0\n        self.attention_head_size = int(config.hidden_size / config.num_attention_heads)\n        self.all_head_size = self.num_attention_heads * self.attention_head_size\n\n        self.query = tf.keras.layers.Dense(\n            self.all_head_size, kernel_initializer=get_initializer(config.initializer_range), name=\"query\"\n        )\n        self.key = tf.keras.layers.Dense(\n            self.all_head_size, kernel_initializer=get_initializer(config.initializer_range), name=\"key\"\n        )\n        self.value = tf.keras.layers.Dense(\n            self.all_head_size, kernel_initializer=get_initializer(config.initializer_range), name=\"value\"\n        )\n\n        self.dropout = tf.keras.layers.Dropout(config.attention_probs_dropout_prob)\n\n    def transpose_for_scores(self, x, batch_size):\n        x = tf.reshape(x, (batch_size, -1, self.num_attention_heads, self.attention_head_size))\n        return tf.transpose(x, perm=[0, 2, 1, 3])\n\n    def call(self, inputs, training=False):\n        hidden_states, attention_mask, head_mask = inputs\n\n        batch_size = shape_list(hidden_states)[0]\n        mixed_query_layer = self.query(hidden_states)\n        mixed_key_layer = self.key(hidden_states)\n        mixed_value_layer = self.value(hidden_states)\n\n        query_layer = self.transpose_for_scores(mixed_query_layer, batch_size)\n        key_layer = self.transpose_for_scores(mixed_key_layer, batch_size)\n        value_layer = self.transpose_for_scores(mixed_value_layer, batch_size)\n\n        # Take the dot product between \"query\" and \"key\" to get the raw attention scores.\n        # (batch size, num_heads, seq_len_q, seq_len_k)\n        attention_scores = tf.matmul(query_layer, key_layer, transpose_b=True)\n        # scale attention_scores\n        dk = tf.cast(shape_list(key_layer)[-1], tf.float32)\n        attention_scores = attention_scores / tf.math.sqrt(dk)\n\n        if attention_mask is not None:\n            # Apply the attention mask is (precomputed for all layers in TFAlbertModel call() function)\n            attention_scores = attention_scores + attention_mask\n\n        # Normalize the attention scores to probabilities.\n        attention_probs = tf.nn.softmax(attention_scores, axis=-1)\n\n        # This is actually dropping out entire tokens to attend to, which might\n        # seem a bit unusual, but is taken from the original Transformer paper.\n        attention_probs = self.dropout(attention_probs, training=training)\n\n        # Mask heads if we want to\n        if head_mask is not None:\n            attention_probs = attention_probs * head_mask\n\n        context_layer = tf.matmul(attention_probs, value_layer)\n\n        context_layer = tf.transpose(context_layer, perm=[0, 2, 1, 3])\n        context_layer = tf.reshape(\n            context_layer, (batch_size, -1, self.all_head_size)\n        )  # (batch_size, seq_len_q, all_head_size)\n\n        outputs = (context_layer, attention_probs) if self.output_attentions else (context_layer,)\n        return outputs\n\n\nclass TFAlbertSelfOutput(tf.keras.layers.Layer):\n    def __init__(self, config, **kwargs):\n        super().__init__(**kwargs)\n        self.dense = tf.keras.layers.Dense(\n            config.hidden_size, kernel_initializer=get_initializer(config.initializer_range), name=\"dense\"\n        )\n        self.LayerNorm = tf.keras.layers.LayerNormalization(epsilon=config.layer_norm_eps, name=\"LayerNorm\")\n        self.dropout = tf.keras.layers.Dropout(config.hidden_dropout_prob)\n\n    def call(self, inputs, training=False):\n        hidden_states, input_tensor = inputs\n\n        hidden_states = self.dense(hidden_states)\n        hidden_states = self.dropout(hidden_states, training=training)\n        hidden_states = self.LayerNorm(hidden_states + input_tensor)\n        return hidden_states\n\n\nclass TFAlbertAttention(TFBertSelfAttention):\n    def __init__(self, config, **kwargs):\n        super().__init__(config, **kwargs)\n\n        self.hidden_size = config.hidden_size\n        self.dense = tf.keras.layers.Dense(\n            config.hidden_size, kernel_initializer=get_initializer(config.initializer_range), name=\"dense\"\n        )\n        self.LayerNorm = tf.keras.layers.LayerNormalization(epsilon=config.layer_norm_eps, name=\"LayerNorm\")\n        self.pruned_heads = set()\n\n    def prune_heads(self, heads):\n        raise NotImplementedError\n\n    def call(self, inputs, training=False):\n        input_tensor, attention_mask, head_mask = inputs\n\n        batch_size = shape_list(input_tensor)[0]\n        mixed_query_layer = self.query(input_tensor)\n        mixed_key_layer = self.key(input_tensor)\n        mixed_value_layer = self.value(input_tensor)\n\n        query_layer = self.transpose_for_scores(mixed_query_layer, batch_size)\n        key_layer = self.transpose_for_scores(mixed_key_layer, batch_size)\n        value_layer = self.transpose_for_scores(mixed_value_layer, batch_size)\n\n        # Take the dot product between \"query\" and \"key\" to get the raw attention scores.\n        # (batch size, num_heads, seq_len_q, seq_len_k)\n        attention_scores = tf.matmul(query_layer, key_layer, transpose_b=True)\n        # scale attention_scores\n        dk = tf.cast(shape_list(key_layer)[-1], tf.float32)\n        attention_scores = attention_scores / tf.math.sqrt(dk)\n\n        if attention_mask is not None:\n            # Apply the attention mask is (precomputed for all layers in TFBertModel call() function)\n            attention_scores = attention_scores + attention_mask\n\n        # Normalize the attention scores to probabilities.\n        attention_probs = tf.nn.softmax(attention_scores, axis=-1)\n\n        # This is actually dropping out entire tokens to attend to, which might\n        # seem a bit unusual, but is taken from the original Transformer paper.\n        attention_probs = self.dropout(attention_probs, training=training)\n\n        # Mask heads if we want to\n        if head_mask is not None:\n            attention_probs = attention_probs * head_mask\n\n        context_layer = tf.matmul(attention_probs, value_layer)\n\n        context_layer = tf.transpose(context_layer, perm=[0, 2, 1, 3])\n        context_layer = tf.reshape(\n            context_layer, (batch_size, -1, self.all_head_size)\n        )  # (batch_size, seq_len_q, all_head_size)\n\n        self_outputs = (context_layer, attention_probs) if self.output_attentions else (context_layer,)\n\n        hidden_states = self_outputs[0]\n\n        hidden_states = self.dense(hidden_states)\n        hidden_states = self.dropout(hidden_states, training=training)\n        attention_output = self.LayerNorm(hidden_states + input_tensor)\n\n        # add attentions if we output them\n        outputs = (attention_output,) + self_outputs[1:]\n        return outputs\n\n\nclass TFAlbertLayer(tf.keras.layers.Layer):\n    def __init__(self, config, **kwargs):\n        super().__init__(**kwargs)\n        self.attention = TFAlbertAttention(config, name=\"attention\")\n\n        self.ffn = tf.keras.layers.Dense(\n            config.intermediate_size, kernel_initializer=get_initializer(config.initializer_range), name=\"ffn\"\n        )\n\n        if isinstance(config.hidden_act, str):\n            self.activation = ACT2FN[config.hidden_act]\n        else:\n            self.activation = config.hidden_act\n\n        self.ffn_output = tf.keras.layers.Dense(\n            config.hidden_size, kernel_initializer=get_initializer(config.initializer_range), name=\"ffn_output\"\n        )\n        self.full_layer_layer_norm = tf.keras.layers.LayerNormalization(\n            epsilon=config.layer_norm_eps, name=\"full_layer_layer_norm\"\n        )\n        self.dropout = tf.keras.layers.Dropout(config.hidden_dropout_prob)\n\n    def call(self, inputs, training=False):\n        hidden_states, attention_mask, head_mask = inputs\n\n        attention_outputs = self.attention([hidden_states, attention_mask, head_mask], training=training)\n        ffn_output = self.ffn(attention_outputs[0])\n        ffn_output = self.activation(ffn_output)\n        ffn_output = self.ffn_output(ffn_output)\n\n        hidden_states = self.dropout(hidden_states, training=training)\n        hidden_states = self.full_layer_layer_norm(ffn_output + attention_outputs[0])\n\n        # add attentions if we output them\n        outputs = (hidden_states,) + attention_outputs[1:]\n        return outputs\n\n\nclass TFAlbertLayerGroup(tf.keras.layers.Layer):\n    def __init__(self, config, **kwargs):\n        super().__init__(**kwargs)\n\n        self.output_attentions = config.output_attentions\n        self.output_hidden_states = config.output_hidden_states\n        self.albert_layers = [\n            TFAlbertLayer(config, name=\"albert_layers_._{}\".format(i)) for i in range(config.inner_group_num)\n        ]\n\n    def call(self, inputs, training=False):\n        hidden_states, attention_mask, head_mask = inputs\n\n        layer_hidden_states = ()\n        layer_attentions = ()\n\n        for layer_index, albert_layer in enumerate(self.albert_layers):\n            layer_output = albert_layer([hidden_states, attention_mask, head_mask[layer_index]], training=training)\n            hidden_states = layer_output[0]\n\n            if self.output_attentions:\n                layer_attentions = layer_attentions + (layer_output[1],)\n\n            if self.output_hidden_states:\n                layer_hidden_states = layer_hidden_states + (hidden_states,)\n\n        outputs = (hidden_states,)\n        if self.output_hidden_states:\n            outputs = outputs + (layer_hidden_states,)\n        if self.output_attentions:\n            outputs = outputs + (layer_attentions,)\n        # last-layer hidden state, (layer hidden states), (layer attentions)\n        return outputs\n\n\nclass TFAlbertTransformer(tf.keras.layers.Layer):\n    def __init__(self, config, **kwargs):\n        super().__init__(**kwargs)\n\n        self.config = config\n        self.output_attentions = config.output_attentions\n        self.output_hidden_states = config.output_hidden_states\n        self.embedding_hidden_mapping_in = tf.keras.layers.Dense(\n            config.hidden_size,\n            kernel_initializer=get_initializer(config.initializer_range),\n            name=\"embedding_hidden_mapping_in\",\n        )\n        self.albert_layer_groups = [\n            TFAlbertLayerGroup(config, name=\"albert_layer_groups_._{}\".format(i))\n            for i in range(config.num_hidden_groups)\n        ]\n\n    def call(self, inputs, training=False):\n        hidden_states, attention_mask, head_mask = inputs\n\n        hidden_states = self.embedding_hidden_mapping_in(hidden_states)\n        all_attentions = ()\n\n        if self.output_hidden_states:\n            all_hidden_states = (hidden_states,)\n\n        for i in range(self.config.num_hidden_layers):\n            # Number of layers in a hidden group\n            layers_per_group = int(self.config.num_hidden_layers / self.config.num_hidden_groups)\n\n            # Index of the hidden group\n            group_idx = int(i / (self.config.num_hidden_layers / self.config.num_hidden_groups))\n\n            layer_group_output = self.albert_layer_groups[group_idx](\n                [\n                    hidden_states,\n                    attention_mask,\n                    head_mask[group_idx * layers_per_group : (group_idx + 1) * layers_per_group],\n                ],\n                training=training,\n            )\n            hidden_states = layer_group_output[0]\n\n            if self.output_attentions:\n                all_attentions = all_attentions + layer_group_output[-1]\n\n            if self.output_hidden_states:\n                all_hidden_states = all_hidden_states + (hidden_states,)\n\n        outputs = (hidden_states,)\n        if self.output_hidden_states:\n            outputs = outputs + (all_hidden_states,)\n        if self.output_attentions:\n            outputs = outputs + (all_attentions,)\n\n        # last-layer hidden state, (all hidden states), (all attentions)\n        return outputs\n\n\nclass TFAlbertPreTrainedModel(TFPreTrainedModel):\n    \"\"\" An abstract class to handle weights initialization and\n        a simple interface for downloading and loading pretrained models.\n    \"\"\"\n\n    config_class = AlbertConfig\n    base_model_prefix = \"albert\"\n\n\nclass TFAlbertMLMHead(tf.keras.layers.Layer):\n    def __init__(self, config, input_embeddings, **kwargs):\n        super().__init__(**kwargs)\n        self.vocab_size = config.vocab_size\n\n        self.dense = tf.keras.layers.Dense(\n            config.embedding_size, kernel_initializer=get_initializer(config.initializer_range), name=\"dense\"\n        )\n        if isinstance(config.hidden_act, str):\n            self.activation = ACT2FN[config.hidden_act]\n        else:\n            self.activation = config.hidden_act\n\n        self.LayerNorm = tf.keras.layers.LayerNormalization(epsilon=config.layer_norm_eps, name=\"LayerNorm\")\n\n        # The output weights are the same as the input embeddings, but there is\n        # an output-only bias for each token.\n        self.decoder = input_embeddings\n\n    def build(self, input_shape):\n        self.bias = self.add_weight(shape=(self.vocab_size,), initializer=\"zeros\", trainable=True, name=\"bias\")\n        self.decoder_bias = self.add_weight(\n            shape=(self.vocab_size,), initializer=\"zeros\", trainable=True, name=\"decoder/bias\"\n        )\n        super().build(input_shape)\n\n    def call(self, hidden_states):\n        hidden_states = self.dense(hidden_states)\n        hidden_states = self.activation(hidden_states)\n        hidden_states = self.LayerNorm(hidden_states)\n        hidden_states = self.decoder(hidden_states, mode=\"linear\") + self.decoder_bias\n        return hidden_states\n\n\n@keras_serializable\nclass TFAlbertMainLayer(tf.keras.layers.Layer):\n    config_class = AlbertConfig\n\n    def __init__(self, config, **kwargs):\n        super().__init__(**kwargs)\n        self.num_hidden_layers = config.num_hidden_layers\n\n        self.embeddings = TFAlbertEmbeddings(config, name=\"embeddings\")\n        self.encoder = TFAlbertTransformer(config, name=\"encoder\")\n        self.pooler = tf.keras.layers.Dense(\n            config.hidden_size,\n            kernel_initializer=get_initializer(config.initializer_range),\n            activation=\"tanh\",\n            name=\"pooler\",\n        )\n\n    def get_input_embeddings(self):\n        return self.embeddings\n\n    def _resize_token_embeddings(self, new_num_tokens):\n        raise NotImplementedError\n\n    def _prune_heads(self, heads_to_prune):\n        \"\"\" Prunes heads of the model.\n            heads_to_prune: dict of {layer_num: list of heads to prune in this layer}\n            See base class PreTrainedModel\n        \"\"\"\n        raise NotImplementedError\n\n    def call(\n        self,\n        inputs,\n        attention_mask=None,\n        token_type_ids=None,\n        position_ids=None,\n        head_mask=None,\n        inputs_embeds=None,\n        training=False,\n    ):\n        if isinstance(inputs, (tuple, list)):\n            input_ids = inputs[0]\n            attention_mask = inputs[1] if len(inputs) > 1 else attention_mask\n            token_type_ids = inputs[2] if len(inputs) > 2 else token_type_ids\n            position_ids = inputs[3] if len(inputs) > 3 else position_ids\n            head_mask = inputs[4] if len(inputs) > 4 else head_mask\n            inputs_embeds = inputs[5] if len(inputs) > 5 else inputs_embeds\n            assert len(inputs) <= 6, \"Too many inputs.\"\n        elif isinstance(inputs, (dict, BatchEncoding)):\n            input_ids = inputs.get(\"input_ids\")\n            attention_mask = inputs.get(\"attention_mask\", attention_mask)\n            token_type_ids = inputs.get(\"token_type_ids\", token_type_ids)\n            position_ids = inputs.get(\"position_ids\", position_ids)\n            head_mask = inputs.get(\"head_mask\", head_mask)\n            inputs_embeds = inputs.get(\"inputs_embeds\", inputs_embeds)\n            assert len(inputs) <= 6, \"Too many inputs.\"\n        else:\n            input_ids = inputs\n\n        if input_ids is not None and inputs_embeds is not None:\n            raise ValueError(\"You cannot specify both input_ids and inputs_embeds at the same time\")\n        elif input_ids is not None:\n            input_shape = shape_list(input_ids)\n        elif inputs_embeds is not None:\n            input_shape = shape_list(inputs_embeds)[:-1]\n        else:\n            raise ValueError(\"You have to specify either input_ids or inputs_embeds\")\n\n        if attention_mask is None:\n            attention_mask = tf.fill(input_shape, 1)\n        if token_type_ids is None:\n            token_type_ids = tf.fill(input_shape, 0)\n\n        # We create a 3D attention mask from a 2D tensor mask.\n        # Sizes are [batch_size, 1, 1, to_seq_length]\n        # So we can broadcast to [batch_size, num_heads, from_seq_length, to_seq_length]\n        # this attention mask is more simple than the triangular masking of causal attention\n        # used in OpenAI GPT, we just need to prepare the broadcast dimension here.\n        extended_attention_mask = attention_mask[:, tf.newaxis, tf.newaxis, :]\n\n        # Since attention_mask is 1.0 for positions we want to attend and 0.0 for\n        # masked positions, this operation will create a tensor which is 0.0 for\n        # positions we want to attend and -10000.0 for masked positions.\n        # Since we are adding it to the raw scores before the softmax, this is\n        # effectively the same as removing these entirely.\n\n        extended_attention_mask = tf.cast(extended_attention_mask, tf.float32)\n        extended_attention_mask = (1.0 - extended_attention_mask) * -10000.0\n\n        # Prepare head mask if needed\n        # 1.0 in head_mask indicate we keep the head\n        # attention_probs has shape bsz x n_heads x N x N\n        # input head_mask has shape [num_heads] or [num_hidden_layers x num_heads]\n        # and head_mask is converted to shape [num_hidden_layers x batch x num_heads x seq_length x seq_length]\n        if head_mask is not None:\n            raise NotImplementedError\n        else:\n            head_mask = [None] * self.num_hidden_layers\n            # head_mask = tf.constant([0] * self.num_hidden_layers)\n\n        embedding_output = self.embeddings([input_ids, position_ids, token_type_ids, inputs_embeds], training=training)\n        encoder_outputs = self.encoder([embedding_output, extended_attention_mask, head_mask], training=training)\n\n        sequence_output = encoder_outputs[0]\n        pooled_output = self.pooler(sequence_output[:, 0])\n\n        # add hidden_states and attentions if they are here\n        outputs = (sequence_output, pooled_output,) + encoder_outputs[1:]\n        # sequence_output, pooled_output, (hidden_states), (attentions)\n        return outputs\n\n\nALBERT_START_DOCSTRING = r\"\"\"\n    This model is a `tf.keras.Model <https://www.tensorflow.org/api_docs/python/tf/keras/Model>`__ sub-class.\n    Use it as a regular TF 2.0 Keras Model and\n    refer to the TF 2.0 documentation for all matter related to general usage and behavior.\n\n    .. _`ALBERT: A Lite BERT for Self-supervised Learning of Language Representations`:\n        https://arxiv.org/abs/1909.11942\n\n    .. _`tf.keras.Model`:\n        https://www.tensorflow.org/versions/r2.0/api_docs/python/tf/keras/Model\n\n    .. note::\n\n        TF 2.0 models accepts two formats as inputs:\n\n            - having all inputs as keyword arguments (like PyTorch models), or\n            - having all inputs as a list, tuple or dict in the first positional arguments.\n\n        This second option is useful when using :obj:`tf.keras.Model.fit()` method which currently requires having\n        all the tensors in the first argument of the model call function: :obj:`model(inputs)`.\n\n        If you choose this second option, there are three possibilities you can use to gather all the input Tensors\n        in the first positional argument :\n\n        - a single Tensor with input_ids only and nothing else: :obj:`model(inputs_ids)`\n        - a list of varying length with one or several input Tensors IN THE ORDER given in the docstring:\n          :obj:`model([input_ids, attention_mask])` or :obj:`model([input_ids, attention_mask, token_type_ids])`\n        - a dictionary with one or several input Tensors associated to the input names given in the docstring:\n          :obj:`model({'input_ids': input_ids, 'token_type_ids': token_type_ids})`\n\n    Args:\n        config (:class:`~transformers1.AlbertConfig`): Model configuration class with all the parameters of the model.\n            Initializing with a config file does not load the weights associated with the model, only the configuration.\n            Check out the :meth:`~transformers1.PreTrainedModel.from_pretrained` method to load the model weights.\n\"\"\"\n\nALBERT_INPUTS_DOCSTRING = r\"\"\"\n    Args:\n        input_ids (:obj:`Numpy array` or :obj:`tf.Tensor` of shape :obj:`{0}`):\n            Indices of input sequence tokens in the vocabulary.\n\n            Indices can be obtained using :class:`transformers1.AlbertTokenizer`.\n            See :func:`transformers1.PreTrainedTokenizer.encode` and\n            :func:`transformers1.PreTrainedTokenizer.encode_plus` for details.\n\n            `What are input IDs? <../glossary.html#input-ids>`__\n        attention_mask (:obj:`Numpy array` or :obj:`tf.Tensor` of shape :obj:`{0}`, `optional, defaults to :obj:`None`):\n            Mask to avoid performing attention on padding token indices.\n            Mask values selected in ``[0, 1]``:\n            ``1`` for tokens that are NOT MASKED, ``0`` for MASKED tokens.\n\n            `What are attention masks? <../glossary.html#attention-mask>`__\n        token_type_ids (:obj:`Numpy array` or :obj:`tf.Tensor` of shape :obj:`{0}`, `optional`, defaults to :obj:`None`):\n            Segment token indices to indicate first and second portions of the inputs.\n            Indices are selected in ``[0, 1]``: ``0`` corresponds to a `sentence A` token, ``1``\n            corresponds to a `sentence B` token\n\n            `What are token type IDs? <../glossary.html#token-type-ids>`_\n        position_ids (:obj:`Numpy array` or :obj:`tf.Tensor` of shape :obj:`{0}`, `optional`, defaults to :obj:`None`):\n            Indices of positions of each input sequence tokens in the position embeddings.\n            Selected in the range ``[0, config.max_position_embeddings - 1]``.\n\n            `What are position IDs? <../glossary.html#position-ids>`_\n        head_mask (:obj:`Numpy array` or :obj:`tf.Tensor` of shape :obj:`(num_heads,)` or :obj:`(num_layers, num_heads)`, `optional`, defaults to :obj:`None`):\n            Mask to nullify selected heads of the self-attention modules.\n            Mask values selected in ``[0, 1]``:\n            ``1`` indicates the head is **not masked**, ``0`` indicates the head is **masked**.\n        inputs_embeds (:obj:`tf.Tensor` of shape :obj:`(batch_size, sequence_length, hidden_size)`, `optional`, defaults to :obj:`None`):\n            Optionally, instead of passing :obj:`input_ids` you can choose to directly pass an embedded representation.\n            This is useful if you want more control over how to convert `input_ids` indices into associated vectors\n            than the model's internal embedding lookup matrix.\n        training (:obj:`boolean`, `optional`, defaults to :obj:`False`):\n            Whether to activate dropout modules (if set to :obj:`True`) during training or to de-activate them\n            (if set to :obj:`False`) for evaluation.\n\"\"\"\n\n\n@add_start_docstrings(\n    \"The bare Albert Model transformer outputing raw hidden-states without any specific head on top.\",\n    ALBERT_START_DOCSTRING,\n)\nclass TFAlbertModel(TFAlbertPreTrainedModel):\n    def __init__(self, config, *inputs, **kwargs):\n        super().__init__(config, *inputs, **kwargs)\n        self.albert = TFAlbertMainLayer(config, name=\"albert\")\n\n    @add_start_docstrings_to_callable(ALBERT_INPUTS_DOCSTRING.format(\"(batch_size, sequence_length)\"))\n    def call(self, inputs, **kwargs):\n        r\"\"\"\n        Returns:\n            :obj:`tuple(tf.Tensor)` comprising various elements depending on the configuration (:class:`~transformers1.AlbertConfig`) and inputs:\n            last_hidden_state (:obj:`tf.Tensor` of shape :obj:`(batch_size, sequence_length, hidden_size)`):\n                Sequence of hidden-states at the output of the last layer of the model.\n            pooler_output (:obj:`tf.Tensor` of shape :obj:`(batch_size, hidden_size)`):\n                Last layer hidden-state of the first token of the sequence (classification token)\n                further processed by a Linear layer and a Tanh activation function. The Linear\n                layer weights are trained from the next sentence prediction (classification)\n                objective during Albert pretraining. This output is usually *not* a good summary\n                of the semantic content of the input, you're often better with averaging or pooling\n                the sequence of hidden-states for the whole input sequence.\n            hidden_states (:obj:`tuple(tf.Tensor)`, `optional`, returned when :obj:`config.output_hidden_states=True`):\n                tuple of :obj:`tf.Tensor` (one for the output of the embeddings + one for the output of each layer)\n                of shape :obj:`(batch_size, sequence_length, hidden_size)`.\n\n                Hidden-states of the model at the output of each layer plus the initial embedding outputs.\n            attentions (:obj:`tuple(tf.Tensor)`, `optional`, returned when ``config.output_attentions=True``):\n                tuple of :obj:`tf.Tensor` (one for each layer) of shape\n                :obj:`(batch_size, num_heads, sequence_length, sequence_length)`:\n\n                Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads.\n\n        Examples::\n\n            import tensorflow as tf\n            from transformers1 import AlbertTokenizer, TFAlbertModel\n\n            tokenizer = AlbertTokenizer.from_pretrained('albert-base-v2')\n            model = TFAlbertModel.from_pretrained('albert-base-v2')\n            input_ids = tf.constant(tokenizer.encode(\"Hello, my dog is cute\"))[None, :]  # Batch size 1\n            outputs = model(input_ids)\n            last_hidden_states = outputs[0]  # The last hidden-state is the first element of the output tuple\n\n        \"\"\"\n        outputs = self.albert(inputs, **kwargs)\n        return outputs\n\n\n@add_start_docstrings(\n    \"\"\"Albert Model with two heads on top for pre-training:\n    a `masked language modeling` head and a `sentence order prediction` (classification) head. \"\"\",\n    ALBERT_START_DOCSTRING,\n)\nclass TFAlbertForPreTraining(TFAlbertPreTrainedModel):\n    def __init__(self, config, *inputs, **kwargs):\n        super().__init__(config, *inputs, **kwargs)\n        self.num_labels = config.num_labels\n\n        self.albert = TFAlbertMainLayer(config, name=\"albert\")\n        self.predictions = TFAlbertMLMHead(config, self.albert.embeddings, name=\"predictions\")\n        self.sop_classifier = TFAlbertSOPHead(config, name=\"sop_classifier\")\n\n    def get_output_embeddings(self):\n        return self.albert.embeddings\n\n    @add_start_docstrings_to_callable(ALBERT_INPUTS_DOCSTRING.format(\"(batch_size, sequence_length)\"))\n    def call(self, inputs, **kwargs):\n        r\"\"\"\n    Return:\n        :obj:`tuple(tf.Tensor)` comprising various elements depending on the configuration (:class:`~transformers1.BertConfig`) and inputs:\n        prediction_scores (:obj:`tf.Tensor` of shape :obj:`(batch_size, sequence_length, config.vocab_size)`):\n            Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax).\n        sop_scores (:obj:`tf.Tensor` of shape :obj:`(batch_size, sequence_length, 2)`):\n            Prediction scores of the sentence order prediction (classification) head (scores of True/False continuation before SoftMax).\n        hidden_states (:obj:`tuple(tf.Tensor)`, `optional`, returned when :obj:`config.output_hidden_states=True`):\n            tuple of :obj:`tf.Tensor` (one for the output of the embeddings + one for the output of each layer)\n            of shape :obj:`(batch_size, sequence_length, hidden_size)`.\n            Hidden-states of the model at the output of each layer plus the initial embedding outputs.\n        attentions (:obj:`tuple(tf.Tensor)`, `optional`, returned when ``config.output_attentions=True``):\n            tuple of :obj:`tf.Tensor` (one for each layer) of shape\n            :obj:`(batch_size, num_heads, sequence_length, sequence_length)`:\n            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads.\n    Examples::\n        import tensorflow as tf\n        from transformers1 import AlbertTokenizer, TFAlbertForPreTraining\n        tokenizer = AlbertTokenizer.from_pretrained('albert-base-v2')\n        model = TFAlbertForPreTraining.from_pretrained('albert-base-v2')\n        input_ids = tf.constant(tokenizer.encode(\"Hello, my dog is cute\", add_special_tokens=True))[None, :]  # Batch size 1\n        outputs = model(input_ids)\n        prediction_scores, sop_scores = outputs[:2]\n        \"\"\"\n\n        outputs = self.albert(inputs, **kwargs)\n        sequence_output, pooled_output = outputs[:2]\n        prediction_scores = self.predictions(sequence_output)\n        sop_scores = self.sop_classifier(pooled_output, training=kwargs.get(\"training\", False))\n        outputs = (prediction_scores, sop_scores) + outputs[2:]\n        return outputs\n\n\nclass TFAlbertSOPHead(tf.keras.layers.Layer):\n    def __init__(self, config, **kwargs):\n        super().__init__(**kwargs)\n\n        self.dropout = tf.keras.layers.Dropout(config.classifier_dropout_prob)\n        self.classifier = tf.keras.layers.Dense(\n            config.num_labels, kernel_initializer=get_initializer(config.initializer_range), name=\"classifier\",\n        )\n\n    def call(self, pooled_output, training: bool):\n        dropout_pooled_output = self.dropout(pooled_output, training=training)\n        logits = self.classifier(dropout_pooled_output)\n        return logits\n\n\n@add_start_docstrings(\"\"\"Albert Model with a `language modeling` head on top. \"\"\", ALBERT_START_DOCSTRING)\nclass TFAlbertForMaskedLM(TFAlbertPreTrainedModel):\n    def __init__(self, config, *inputs, **kwargs):\n        super().__init__(config, *inputs, **kwargs)\n\n        self.albert = TFAlbertMainLayer(config, name=\"albert\")\n        self.predictions = TFAlbertMLMHead(config, self.albert.embeddings, name=\"predictions\")\n\n    def get_output_embeddings(self):\n        return self.albert.embeddings\n\n    @add_start_docstrings_to_callable(ALBERT_INPUTS_DOCSTRING.format(\"(batch_size, sequence_length)\"))\n    def call(self, inputs, **kwargs):\n        r\"\"\"\n    Returns:\n        :obj:`tuple(tf.Tensor)` comprising various elements depending on the configuration (:class:`~transformers1.AlbertConfig`) and inputs:\n        prediction_scores (:obj:`Numpy array` or :obj:`tf.Tensor` of shape :obj:`(batch_size, sequence_length, config.vocab_size)`\n            Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax).\n        hidden_states (:obj:`tuple(tf.Tensor)`, `optional`, returned when :obj:`config.output_hidden_states=True`):\n            tuple of :obj:`tf.Tensor` (one for the output of the embeddings + one for the output of each layer)\n            of shape :obj:`(batch_size, sequence_length, hidden_size)`.\n\n            Hidden-states of the model at the output of each layer plus the initial embedding outputs.\n        attentions (:obj:`tuple(tf.Tensor)`, `optional`, returned when ``config.output_attentions=True``):\n            tuple of :obj:`tf.Tensor` (one for each layer) of shape\n            :obj:`(batch_size, num_heads, sequence_length, sequence_length)`:\n\n            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads.\n\n    Examples::\n\n        import tensorflow as tf\n        from transformers1 import AlbertTokenizer, TFAlbertForMaskedLM\n\n        tokenizer = AlbertTokenizer.from_pretrained('albert-base-v2')\n        model = TFAlbertForMaskedLM.from_pretrained('albert-base-v2')\n        input_ids = tf.constant(tokenizer.encode(\"Hello, my dog is cute\"))[None, :]  # Batch size 1\n        outputs = model(input_ids)\n        prediction_scores = outputs[0]\n\n        \"\"\"\n        outputs = self.albert(inputs, **kwargs)\n\n        sequence_output = outputs[0]\n        prediction_scores = self.predictions(sequence_output, training=kwargs.get(\"training\", False))\n\n        # Add hidden states and attention if they are here\n        outputs = (prediction_scores,) + outputs[2:]\n\n        return outputs  # prediction_scores, (hidden_states), (attentions)\n\n\n@add_start_docstrings(\n    \"\"\"Albert Model transformer with a sequence classification/regression head on top (a linear layer on top of\n    the pooled output) e.g. for GLUE tasks. \"\"\",\n    ALBERT_START_DOCSTRING,\n)\nclass TFAlbertForSequenceClassification(TFAlbertPreTrainedModel):\n    def __init__(self, config, *inputs, **kwargs):\n        super().__init__(config, *inputs, **kwargs)\n        self.num_labels = config.num_labels\n\n        self.albert = TFAlbertMainLayer(config, name=\"albert\")\n        self.dropout = tf.keras.layers.Dropout(config.classifier_dropout_prob)\n        self.classifier = tf.keras.layers.Dense(\n            config.num_labels, kernel_initializer=get_initializer(config.initializer_range), name=\"classifier\"\n        )\n\n    @add_start_docstrings_to_callable(ALBERT_INPUTS_DOCSTRING.format(\"(batch_size, sequence_length)\"))\n    def call(self, inputs, **kwargs):\n        r\"\"\"\n    Returns:\n        :obj:`tuple(tf.Tensor)` comprising various elements depending on the configuration (:class:`~transformers1.AlbertConfig`) and inputs:\n        logits (:obj:`Numpy array` or :obj:`tf.Tensor` of shape :obj:`(batch_size, config.num_labels)`)\n            Classification (or regression if config.num_labels==1) scores (before SoftMax).\n        hidden_states (:obj:`tuple(tf.Tensor)`, `optional`, returned when :obj:`config.output_hidden_states=True`):\n            tuple of :obj:`tf.Tensor` (one for the output of the embeddings + one for the output of each layer)\n            of shape :obj:`(batch_size, sequence_length, hidden_size)`.\n\n            Hidden-states of the model at the output of each layer plus the initial embedding outputs.\n        attentions (:obj:`tuple(tf.Tensor)`, `optional`, returned when ``config.output_attentions=True``):\n            tuple of :obj:`tf.Tensor` (one for each layer) of shape\n            :obj:`(batch_size, num_heads, sequence_length, sequence_length)`:\n\n            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads.\n\n    Examples::\n\n        import tensorflow as tf\n        from transformers1 import AlbertTokenizer, TFAlbertForSequenceClassification\n\n        tokenizer = AlbertTokenizer.from_pretrained('albert-base-v2')\n        model = TFAlbertForSequenceClassification.from_pretrained('albert-base-v2')\n        input_ids = tf.constant(tokenizer.encode(\"Hello, my dog is cute\"))[None, :]  # Batch size 1\n        outputs = model(input_ids)\n        logits = outputs[0]\n\n        \"\"\"\n        outputs = self.albert(inputs, **kwargs)\n\n        pooled_output = outputs[1]\n\n        pooled_output = self.dropout(pooled_output, training=kwargs.get(\"training\", False))\n        logits = self.classifier(pooled_output)\n\n        outputs = (logits,) + outputs[2:]  # add hidden states and attention if they are here\n\n        return outputs  # logits, (hidden_states), (attentions)\n\n\n@add_start_docstrings(\n    \"\"\"Albert Model with a span classification head on top for extractive question-answering tasks like SQuAD (a linear layers on top of the hidden-states output to compute `span start logits` and `span end logits`). \"\"\",\n    ALBERT_START_DOCSTRING,\n)\nclass TFAlbertForQuestionAnswering(TFAlbertPreTrainedModel):\n    def __init__(self, config, *inputs, **kwargs):\n        super().__init__(config, *inputs, **kwargs)\n        self.num_labels = config.num_labels\n\n        self.albert = TFAlbertMainLayer(config, name=\"albert\")\n        self.qa_outputs = tf.keras.layers.Dense(\n            config.num_labels, kernel_initializer=get_initializer(config.initializer_range), name=\"qa_outputs\"\n        )\n\n    @add_start_docstrings_to_callable(ALBERT_INPUTS_DOCSTRING.format(\"(batch_size, sequence_length)\"))\n    def call(self, inputs, **kwargs):\n        r\"\"\"\n    Return:\n        :obj:`tuple(tf.Tensor)` comprising various elements depending on the configuration (:class:`~transformers1.AlbertConfig`) and inputs:\n        start_scores (:obj:`Numpy array` or :obj:`tf.Tensor` of shape :obj:`(batch_size, sequence_length,)`):\n            Span-start scores (before SoftMax).\n        end_scores (:obj:`Numpy array` or :obj:`tf.Tensor` of shape :obj:`(batch_size, sequence_length,)`):\n            Span-end scores (before SoftMax).\n        hidden_states (:obj:`tuple(tf.Tensor)`, `optional`, returned when :obj:`config.output_hidden_states=True`):\n            tuple of :obj:`tf.Tensor` (one for the output of the embeddings + one for the output of each layer)\n            of shape :obj:`(batch_size, sequence_length, hidden_size)`.\n\n            Hidden-states of the model at the output of each layer plus the initial embedding outputs.\n        attentions (:obj:`tuple(tf.Tensor)`, `optional`, returned when ``config.output_attentions=True``):\n            tuple of :obj:`tf.Tensor` (one for each layer) of shape\n            :obj:`(batch_size, num_heads, sequence_length, sequence_length)`:\n\n            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads.\n\n    Examples::\n\n        # The checkpoint albert-base-v2 is not fine-tuned for question answering. Please see the\n        # examples/question-answering/run_squad.py example to see how to fine-tune a model to a question answering task.\n\n        import tensorflow as tf\n        from transformers1 import AlbertTokenizer, TFAlbertForQuestionAnswering\n\n        tokenizer = AlbertTokenizer.from_pretrained('albert-base-v2')\n        model = TFAlbertForQuestionAnswering.from_pretrained('albert-base-v2')\n        input_ids = tokenizer.encode(\"Who was Jim Henson?\", \"Jim Henson was a nice puppet\")\n        start_scores, end_scores = model(tf.constant(input_ids)[None, :]) # Batch size 1\n\n        all_tokens = tokenizer.convert_ids_to_tokens(input_ids)\n        answer = ' '.join(all_tokens[tf.math.argmax(start_scores, 1)[0] : tf.math.argmax(end_scores, 1)[0]+1])\n\n        \"\"\"\n        outputs = self.albert(inputs, **kwargs)\n\n        sequence_output = outputs[0]\n\n        logits = self.qa_outputs(sequence_output)\n        start_logits, end_logits = tf.split(logits, 2, axis=-1)\n        start_logits = tf.squeeze(start_logits, axis=-1)\n        end_logits = tf.squeeze(end_logits, axis=-1)\n\n        outputs = (start_logits, end_logits,) + outputs[2:]\n\n        return outputs  # start_logits, end_logits, (hidden_states), (attentions)\n\n\n@add_start_docstrings(\n    \"\"\"Albert Model with a multiple choice classification head on top (a linear layer on top of\n    the pooled output and a softmax) e.g. for RocStories/SWAG tasks. \"\"\",\n    ALBERT_START_DOCSTRING,\n)\nclass TFAlbertForMultipleChoice(TFAlbertPreTrainedModel):\n    def __init__(self, config, *inputs, **kwargs):\n        super().__init__(config, *inputs, **kwargs)\n\n        self.albert = TFAlbertMainLayer(config, name=\"albert\")\n        self.dropout = tf.keras.layers.Dropout(config.hidden_dropout_prob)\n        self.classifier = tf.keras.layers.Dense(\n            1, kernel_initializer=get_initializer(config.initializer_range), name=\"classifier\"\n        )\n\n    @property\n    def dummy_inputs(self):\n        \"\"\" Dummy inputs to build the network.\n\n        Returns:\n            tf.Tensor with dummy inputs\n        \"\"\"\n        return {\"input_ids\": tf.constant(MULTIPLE_CHOICE_DUMMY_INPUTS)}\n\n    @add_start_docstrings_to_callable(ALBERT_INPUTS_DOCSTRING.format(\"(batch_size, num_choices, sequence_length)\"))\n    def call(\n        self,\n        inputs,\n        attention_mask=None,\n        token_type_ids=None,\n        position_ids=None,\n        head_mask=None,\n        inputs_embeds=None,\n        training=False,\n    ):\n        r\"\"\"\n    Return:\n        :obj:`tuple(tf.Tensor)` comprising various elements depending on the configuration (:class:`~transformers1.BertConfig`) and inputs:\n        classification_scores (:obj:`Numpy array` or :obj:`tf.Tensor` of shape :obj:`(batch_size, num_choices)`:\n            `num_choices` is the size of the second dimension of the input tensors. (see `input_ids` above).\n\n            Classification scores (before SoftMax).\n        hidden_states (:obj:`tuple(tf.Tensor)`, `optional`, returned when :obj:`config.output_hidden_states=True`):\n            tuple of :obj:`tf.Tensor` (one for the output of the embeddings + one for the output of each layer)\n            of shape :obj:`(batch_size, sequence_length, hidden_size)`.\n\n            Hidden-states of the model at the output of each layer plus the initial embedding outputs.\n        attentions (:obj:`tuple(tf.Tensor)`, `optional`, returned when ``config.output_attentions=True``):\n            tuple of :obj:`tf.Tensor` (one for each layer) of shape\n            :obj:`(batch_size, num_heads, sequence_length, sequence_length)`:\n\n            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads.\n\n    Examples::\n\n        import tensorflow as tf\n        from transformers1 import AlbertTokenizer, TFAlbertForMultipleChoice\n\n        tokenizer = AlbertTokenizer.from_pretrained('albert-base-v2')\n        model = TFAlbertForMultipleChoice.from_pretrained('albert-base-v2')\n\n        example1 = [\"This is a context\", \"Is it a context? Yes\"]\n        example2 = [\"This is a context\", \"Is it a context? No\"]\n        encoding = tokenizer.batch_encode_plus([example1, example2], return_tensors='tf', truncation_strategy=\"only_first\", pad_to_max_length=True, max_length=128)\n        outputs = model(encoding[\"input_ids\"][None, :])\n        logits = outputs[0]\n\n        \"\"\"\n        if isinstance(inputs, (tuple, list)):\n            input_ids = inputs[0]\n            attention_mask = inputs[1] if len(inputs) > 1 else attention_mask\n            token_type_ids = inputs[2] if len(inputs) > 2 else token_type_ids\n            position_ids = inputs[3] if len(inputs) > 3 else position_ids\n            head_mask = inputs[4] if len(inputs) > 4 else head_mask\n            inputs_embeds = inputs[5] if len(inputs) > 5 else inputs_embeds\n            assert len(inputs) <= 6, \"Too many inputs.\"\n        elif isinstance(inputs, dict):\n            print(\"isdict(1)\")\n            input_ids = inputs.get(\"input_ids\")\n            print(input_ids)\n\n            attention_mask = inputs.get(\"attention_mask\", attention_mask)\n            token_type_ids = inputs.get(\"token_type_ids\", token_type_ids)\n            position_ids = inputs.get(\"position_ids\", position_ids)\n            head_mask = inputs.get(\"head_mask\", head_mask)\n            inputs_embeds = inputs.get(\"inputs_embeds\", inputs_embeds)\n            assert len(inputs) <= 6, \"Too many inputs.\"\n        else:\n            input_ids = inputs\n\n        if input_ids is not None:\n            num_choices = shape_list(input_ids)[1]\n            seq_length = shape_list(input_ids)[2]\n        else:\n            num_choices = shape_list(inputs_embeds)[1]\n            seq_length = shape_list(inputs_embeds)[2]\n\n        flat_input_ids = tf.reshape(input_ids, (-1, seq_length)) if input_ids is not None else None\n        flat_attention_mask = tf.reshape(attention_mask, (-1, seq_length)) if attention_mask is not None else None\n        flat_token_type_ids = tf.reshape(token_type_ids, (-1, seq_length)) if token_type_ids is not None else None\n        flat_position_ids = tf.reshape(position_ids, (-1, seq_length)) if position_ids is not None else None\n\n        flat_inputs = [\n            flat_input_ids,\n            flat_attention_mask,\n            flat_token_type_ids,\n            flat_position_ids,\n            head_mask,\n            inputs_embeds,\n        ]\n\n        outputs = self.albert(flat_inputs, training=training)\n\n        pooled_output = outputs[1]\n\n        pooled_output = self.dropout(pooled_output, training=training)\n        logits = self.classifier(pooled_output)\n        reshaped_logits = tf.reshape(logits, (-1, num_choices))\n\n        outputs = (reshaped_logits,) + outputs[2:]  # add hidden states and attention if they are here\n\n        return outputs  # reshaped_logits, (hidden_states), (attentions)\n"
  },
  {
    "path": "code/nezha-base-count3/pretrain/transformers1/modeling_tf_auto.py",
    "content": "# coding=utf-8\n# Copyright 2018 The HuggingFace Inc. team.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n\"\"\" Auto Model class. \"\"\"\n\n\nimport logging\nfrom collections import OrderedDict\n\nfrom .configuration_auto import (\n    AlbertConfig,\n    AutoConfig,\n    BertConfig,\n    CTRLConfig,\n    DistilBertConfig,\n    GPT2Config,\n    OpenAIGPTConfig,\n    RobertaConfig,\n    T5Config,\n    TransfoXLConfig,\n    XLMConfig,\n    XLNetConfig,\n)\nfrom .configuration_utils import PretrainedConfig\nfrom .modeling_tf_albert import (\n    TFAlbertForMaskedLM,\n    TFAlbertForMultipleChoice,\n    TFAlbertForPreTraining,\n    TFAlbertForQuestionAnswering,\n    TFAlbertForSequenceClassification,\n    TFAlbertModel,\n)\nfrom .modeling_tf_bert import (\n    TFBertForMaskedLM,\n    TFBertForMultipleChoice,\n    TFBertForPreTraining,\n    TFBertForQuestionAnswering,\n    TFBertForSequenceClassification,\n    TFBertForTokenClassification,\n    TFBertModel,\n)\nfrom .modeling_tf_ctrl import TFCTRLLMHeadModel, TFCTRLModel\nfrom .modeling_tf_distilbert import (\n    TFDistilBertForMaskedLM,\n    TFDistilBertForQuestionAnswering,\n    TFDistilBertForSequenceClassification,\n    TFDistilBertForTokenClassification,\n    TFDistilBertModel,\n)\nfrom .modeling_tf_gpt2 import TFGPT2LMHeadModel, TFGPT2Model\nfrom .modeling_tf_openai import TFOpenAIGPTLMHeadModel, TFOpenAIGPTModel\nfrom .modeling_tf_roberta import (\n    TFRobertaForMaskedLM,\n    TFRobertaForQuestionAnswering,\n    TFRobertaForSequenceClassification,\n    TFRobertaForTokenClassification,\n    TFRobertaModel,\n)\nfrom .modeling_tf_t5 import TFT5ForConditionalGeneration, TFT5Model\nfrom .modeling_tf_transfo_xl import TFTransfoXLLMHeadModel, TFTransfoXLModel\nfrom .modeling_tf_xlm import (\n    TFXLMForQuestionAnsweringSimple,\n    TFXLMForSequenceClassification,\n    TFXLMModel,\n    TFXLMWithLMHeadModel,\n)\nfrom .modeling_tf_xlnet import (\n    TFXLNetForQuestionAnsweringSimple,\n    TFXLNetForSequenceClassification,\n    TFXLNetForTokenClassification,\n    TFXLNetLMHeadModel,\n    TFXLNetModel,\n)\n\n\nlogger = logging.getLogger(__name__)\n\n\nTF_MODEL_MAPPING = OrderedDict(\n    [\n        (T5Config, TFT5Model),\n        (DistilBertConfig, TFDistilBertModel),\n        (AlbertConfig, TFAlbertModel),\n        (RobertaConfig, TFRobertaModel),\n        (BertConfig, TFBertModel),\n        (OpenAIGPTConfig, TFOpenAIGPTModel),\n        (GPT2Config, TFGPT2Model),\n        (TransfoXLConfig, TFTransfoXLModel),\n        (XLNetConfig, TFXLNetModel),\n        (XLMConfig, TFXLMModel),\n        (CTRLConfig, TFCTRLModel),\n    ]\n)\n\nTF_MODEL_FOR_PRETRAINING_MAPPING = OrderedDict(\n    [\n        (T5Config, TFT5ForConditionalGeneration),\n        (DistilBertConfig, TFDistilBertForMaskedLM),\n        (AlbertConfig, TFAlbertForPreTraining),\n        (RobertaConfig, TFRobertaForMaskedLM),\n        (BertConfig, TFBertForPreTraining),\n        (OpenAIGPTConfig, TFOpenAIGPTLMHeadModel),\n        (GPT2Config, TFGPT2LMHeadModel),\n        (TransfoXLConfig, TFTransfoXLLMHeadModel),\n        (XLNetConfig, TFXLNetLMHeadModel),\n        (XLMConfig, TFXLMWithLMHeadModel),\n        (CTRLConfig, TFCTRLLMHeadModel),\n    ]\n)\n\nTF_MODEL_WITH_LM_HEAD_MAPPING = OrderedDict(\n    [\n        (T5Config, TFT5ForConditionalGeneration),\n        (DistilBertConfig, TFDistilBertForMaskedLM),\n        (AlbertConfig, TFAlbertForMaskedLM),\n        (RobertaConfig, TFRobertaForMaskedLM),\n        (BertConfig, TFBertForMaskedLM),\n        (OpenAIGPTConfig, TFOpenAIGPTLMHeadModel),\n        (GPT2Config, TFGPT2LMHeadModel),\n        (TransfoXLConfig, TFTransfoXLLMHeadModel),\n        (XLNetConfig, TFXLNetLMHeadModel),\n        (XLMConfig, TFXLMWithLMHeadModel),\n        (CTRLConfig, TFCTRLLMHeadModel),\n    ]\n)\n\nTF_MODEL_FOR_SEQUENCE_CLASSIFICATION_MAPPING = OrderedDict(\n    [\n        (DistilBertConfig, TFDistilBertForSequenceClassification),\n        (AlbertConfig, TFAlbertForSequenceClassification),\n        (RobertaConfig, TFRobertaForSequenceClassification),\n        (BertConfig, TFBertForSequenceClassification),\n        (XLNetConfig, TFXLNetForSequenceClassification),\n        (XLMConfig, TFXLMForSequenceClassification),\n    ]\n)\n\nTF_MODEL_FOR_MULTIPLE_CHOICE_MAPPING = OrderedDict(\n    [(BertConfig, TFBertForMultipleChoice), (AlbertConfig, TFAlbertForMultipleChoice)]\n)\n\nTF_MODEL_FOR_QUESTION_ANSWERING_MAPPING = OrderedDict(\n    [\n        (DistilBertConfig, TFDistilBertForQuestionAnswering),\n        (AlbertConfig, TFAlbertForQuestionAnswering),\n        (RobertaConfig, TFRobertaForQuestionAnswering),\n        (BertConfig, TFBertForQuestionAnswering),\n        (XLNetConfig, TFXLNetForQuestionAnsweringSimple),\n        (XLMConfig, TFXLMForQuestionAnsweringSimple),\n    ]\n)\n\nTF_MODEL_FOR_TOKEN_CLASSIFICATION_MAPPING = OrderedDict(\n    [\n        (DistilBertConfig, TFDistilBertForTokenClassification),\n        (RobertaConfig, TFRobertaForTokenClassification),\n        (BertConfig, TFBertForTokenClassification),\n        (XLNetConfig, TFXLNetForTokenClassification),\n    ]\n)\n\n\nclass TFAutoModel(object):\n    r\"\"\"\n        :class:`~transformers1.TFAutoModel` is a generic model class\n        that will be instantiated as one of the base model classes of the library\n        when created with the `TFAutoModel.from_pretrained(pretrained_model_name_or_path)`\n        class method.\n\n        The `from_pretrained()` method takes care of returning the correct model class instance\n        based on the `model_type` property of the config object, or when it's missing,\n        falling back to using pattern matching on the `pretrained_model_name_or_path` string:\n            - `t5`: TFT5Model (T5 model)\n            - `distilbert`: TFDistilBertModel (DistilBERT model)\n            - `roberta`: TFRobertaModel (RoBERTa model)\n            - `bert`: TFBertModel (Bert model)\n            - `openai-gpt`: TFOpenAIGPTModel (OpenAI GPT model)\n            - `gpt2`: TFGPT2Model (OpenAI GPT-2 model)\n            - `transfo-xl`: TFTransfoXLModel (Transformer-XL model)\n            - `xlnet`: TFXLNetModel (XLNet model)\n            - `xlm`: TFXLMModel (XLM model)\n            - `ctrl`: TFCTRLModel (CTRL model)\n\n        This class cannot be instantiated using `__init__()` (throws an error).\n    \"\"\"\n\n    def __init__(self):\n        raise EnvironmentError(\n            \"TFAutoModel is designed to be instantiated \"\n            \"using the `TFAutoModel.from_pretrained(pretrained_model_name_or_path)` or \"\n            \"`TFAutoModel.from_config(config)` methods.\"\n        )\n\n    @classmethod\n    def from_config(cls, config):\n        r\"\"\" Instantiates one of the base model classes of the library\n        from a configuration.\n\n        Note:\n            Loading a model from its configuration file does **not** load the model weights.\n            It only affects the model's configuration. Use :func:`~transformers1.AutoModel.from_pretrained` to load\n            the model weights\n\n        Args:\n            config: (`optional`) instance of a class derived from :class:`~transformers1.PretrainedConfig`:\n                The model class to instantiate is selected based on the configuration class:\n                    - isInstance of `distilbert` configuration class: TFDistilBertModel (DistilBERT model)\n                    - isInstance of `roberta` configuration class: TFRobertaModel (RoBERTa model)\n                    - isInstance of `bert` configuration class: TFBertModel (Bert model)\n                    - isInstance of `openai-gpt` configuration class: TFOpenAIGPTModel (OpenAI GPT model)\n                    - isInstance of `gpt2` configuration class: TFGPT2Model (OpenAI GPT-2 model)\n                    - isInstance of `ctrl` configuration class: TFCTRLModel (Salesforce CTRL  model)\n                    - isInstance of `transfo-xl` configuration class: TFTransfoXLModel (Transformer-XL model)\n                    - isInstance of `xlnet` configuration class: TFXLNetModel (XLNet model)\n                    - isInstance of `xlm` configuration class: TFXLMModel (XLM model)\n\n        Examples::\n\n            config = BertConfig.from_pretrained('bert-base-uncased')    # Download configuration from S3 and cache.\n            model = TFAutoModel.from_config(config)  # E.g. model was saved using `save_pretrained('./test/saved_model/')`\n        \"\"\"\n        for config_class, model_class in TF_MODEL_MAPPING.items():\n            if isinstance(config, config_class):\n                return model_class(config)\n        raise ValueError(\n            \"Unrecognized configuration class {} for this kind of TFAutoModel: {}.\\n\"\n            \"Model type should be one of {}.\".format(\n                config.__class__, cls.__name__, \", \".join(c.__name__ for c in TF_MODEL_MAPPING.keys())\n            )\n        )\n\n    @classmethod\n    def from_pretrained(cls, pretrained_model_name_or_path, *model_args, **kwargs):\n        r\"\"\" Instantiates one of the base model classes of the library\n        from a pre-trained model configuration.\n\n        The `from_pretrained()` method takes care of returning the correct model class instance\n        based on the `model_type` property of the config object, or when it's missing,\n        falling back to using pattern matching on the `pretrained_model_name_or_path` string:\n            - `t5`: TFT5Model (T5 model)\n            - `distilbert`: TFDistilBertModel (DistilBERT model)\n            - `roberta`: TFRobertaModel (RoBERTa model)\n            - `bert`: TFTFBertModel (Bert model)\n            - `openai-gpt`: TFOpenAIGPTModel (OpenAI GPT model)\n            - `gpt2`: TFGPT2Model (OpenAI GPT-2 model)\n            - `transfo-xl`: TFTransfoXLModel (Transformer-XL model)\n            - `xlnet`: TFXLNetModel (XLNet model)\n            - `ctrl`: TFCTRLModel (CTRL model)\n\n        Params:\n            pretrained_model_name_or_path: either:\n\n                - a string with the `shortcut name` of a pre-trained model to load from cache or download, e.g.: ``bert-base-uncased``.\n                - a string with the `identifier name` of a pre-trained model that was user-uploaded to our S3, e.g.: ``dbmdz/bert-base-german-cased``.\n                - a path to a `directory` containing model weights saved using :func:`~transformers1.PreTrainedModel.save_pretrained`, e.g.: ``./my_model_directory/``.\n                - a path or url to a `PyTorch, TF 1.X or TF 2.0 checkpoint file` (e.g. `./tf_model/model.ckpt.index`). In the case of a PyTorch checkpoint, ``from_pt`` should be set to True and a configuration object should be provided as ``config`` argument.\n\n            from_pt: (`Optional`) Boolean\n                Set to True if the Checkpoint is a PyTorch checkpoint.\n\n            model_args: (`optional`) Sequence of positional arguments:\n                All remaning positional arguments will be passed to the underlying model's ``__init__`` method\n\n            config: (`optional`) instance of a class derived from :class:`~transformers1.PretrainedConfig`:\n                Configuration for the model to use instead of an automatically loaded configuation. Configuration can be automatically loaded when:\n\n                - the model is a model provided by the library (loaded with the ``shortcut-name`` string of a pretrained model), or\n                - the model was saved using :func:`~transformers1.PreTrainedModel.save_pretrained` and is reloaded by suppling the save directory.\n                - the model is loaded by suppling a local directory as ``pretrained_model_name_or_path`` and a configuration JSON file named `config.json` is found in the directory.\n\n            state_dict: (`optional`) dict:\n                an optional state dictionnary for the model to use instead of a state dictionary loaded from saved weights file.\n                This option can be used if you want to create a model from a pretrained configuration but load your own weights.\n                In this case though, you should check if using :func:`~transformers1.PreTrainedModel.save_pretrained` and :func:`~transformers1.PreTrainedModel.from_pretrained` is not a simpler option.\n\n            cache_dir: (`optional`) string:\n                Path to a directory in which a downloaded pre-trained model\n                configuration should be cached if the standard cache should not be used.\n\n            force_download: (`optional`) boolean, default False:\n                Force to (re-)download the model weights and configuration files and override the cached versions if they exists.\n\n            resume_download: (`optional`) boolean, default False:\n                Do not delete incompletely recieved file. Attempt to resume the download if such a file exists.\n\n            proxies: (`optional`) dict, default None:\n                A dictionary of proxy servers to use by protocol or endpoint, e.g.: {'http': 'foo.bar:3128', 'http://hostname': 'foo.bar:4012'}.\n                The proxies are used on each request.\n\n            output_loading_info: (`optional`) boolean:\n                Set to ``True`` to also return a dictionnary containing missing keys, unexpected keys and error messages.\n\n            kwargs: (`optional`) Remaining dictionary of keyword arguments:\n                Can be used to update the configuration object (after it being loaded) and initiate the model. (e.g. ``output_attention=True``). Behave differently depending on whether a `config` is provided or automatically loaded:\n\n                - If a configuration is provided with ``config``, ``**kwargs`` will be directly passed to the underlying model's ``__init__`` method (we assume all relevant updates to the configuration have already been done)\n                - If a configuration is not provided, ``kwargs`` will be first passed to the configuration class initialization function (:func:`~transformers1.PretrainedConfig.from_pretrained`). Each key of ``kwargs`` that corresponds to a configuration attribute will be used to override said attribute with the supplied ``kwargs`` value. Remaining keys that do not correspond to any configuration attribute will be passed to the underlying model's ``__init__`` function.\n\n        Examples::\n\n            model = TFAutoModel.from_pretrained('bert-base-uncased')    # Download model and configuration from S3 and cache.\n            model = TFAutoModel.from_pretrained('./test/bert_model/')  # E.g. model was saved using `save_pretrained('./test/saved_model/')`\n            model = TFAutoModel.from_pretrained('bert-base-uncased', output_attention=True)  # Update configuration during loading\n            assert model.config.output_attention == True\n            # Loading from a TF checkpoint file instead of a PyTorch model (slower)\n            config = AutoConfig.from_json_file('./tf_model/bert_tf_model_config.json')\n            model = TFAutoModel.from_pretrained('./pt_model/bert_pytorch_model.bin', from_pt=True, config=config)\n\n        \"\"\"\n        config = kwargs.pop(\"config\", None)\n        if not isinstance(config, PretrainedConfig):\n            config = AutoConfig.from_pretrained(pretrained_model_name_or_path, **kwargs)\n\n        for config_class, model_class in TF_MODEL_MAPPING.items():\n            if isinstance(config, config_class):\n                return model_class.from_pretrained(pretrained_model_name_or_path, *model_args, config=config, **kwargs)\n        raise ValueError(\n            \"Unrecognized configuration class {} for this kind of TFAutoModel: {}.\\n\"\n            \"Model type should be one of {}.\".format(\n                config.__class__, cls.__name__, \", \".join(c.__name__ for c in TF_MODEL_MAPPING.keys())\n            )\n        )\n\n\nclass TFAutoModelForPreTraining(object):\n    r\"\"\"\n        :class:`~transformers1.TFAutoModelForPreTraining` is a generic model class\n        that will be instantiated as one of the model classes of the library -with the architecture used for pretraining this model– when created with the `TFAutoModelForPreTraining.from_pretrained(pretrained_model_name_or_path)`\n        class method.\n\n        This class cannot be instantiated using `__init__()` (throws an error).\n    \"\"\"\n\n    def __init__(self):\n        raise EnvironmentError(\n            \"TFAutoModelForPreTraining is designed to be instantiated \"\n            \"using the `TFAutoModelForPreTraining.from_pretrained(pretrained_model_name_or_path)` or \"\n            \"`TFAutoModelForPreTraining.from_config(config)` methods.\"\n        )\n\n    @classmethod\n    def from_config(cls, config):\n        r\"\"\" Instantiates one of the base model classes of the library\n        from a configuration.\n\n        Note:\n            Loading a model from its configuration file does **not** load the model weights.\n            It only affects the model's configuration. Use :func:`~transformers1.AutoModel.from_pretrained` to load\n            the model weights\n\n        Args:\n            config (:class:`~transformers.PretrainedConfig`):\n                The model class to instantiate is selected based on the configuration class:\n\n                - isInstance of `distilbert` configuration class: :class:`~transformers1.TFDistilBertModelForMaskedLM` (DistilBERT model)\n                - isInstance of `roberta` configuration class: :class:`~transformers1.TFRobertaModelForMaskedLM` (RoBERTa model)\n                - isInstance of `bert` configuration class: :class:`~transformers1.TFBertForPreTraining` (Bert model)\n                - isInstance of `openai-gpt` configuration class: :class:`~transformers1.TFOpenAIGPTLMHeadModel` (OpenAI GPT model)\n                - isInstance of `gpt2` configuration class: :class:`~transformers1.TFGPT2ModelLMHeadModel` (OpenAI GPT-2 model)\n                - isInstance of `ctrl` configuration class: :class:`~transformers1.TFCTRLModelLMHeadModel` (Salesforce CTRL  model)\n                - isInstance of `transfo-xl` configuration class: :class:`~transformers1.TFTransfoXLLMHeadModel` (Transformer-XL model)\n                - isInstance of `xlnet` configuration class: :class:`~transformers1.TFXLNetLMHeadModel` (XLNet model)\n                - isInstance of `xlm` configuration class: :class:`~transformers1.TFXLMWithLMHeadModel` (XLM model)\n\n        Examples::\n\n            config = BertConfig.from_pretrained('bert-base-uncased')    # Download configuration from S3 and cache.\n            model = TFAutoModelForPreTraining.from_config(config)  # E.g. model was saved using `save_pretrained('./test/saved_model/')`\n        \"\"\"\n        for config_class, model_class in TF_MODEL_FOR_PRETRAINING_MAPPING.items():\n            if isinstance(config, config_class):\n                return model_class(config)\n        raise ValueError(\n            \"Unrecognized configuration class {} for this kind of AutoModel: {}.\\n\"\n            \"Model type should be one of {}.\".format(\n                config.__class__, cls.__name__, \", \".join(c.__name__ for c in TF_MODEL_FOR_PRETRAINING_MAPPING.keys())\n            )\n        )\n\n    @classmethod\n    def from_pretrained(cls, pretrained_model_name_or_path, *model_args, **kwargs):\n        r\"\"\" Instantiates one of the model classes of the library -with the architecture used for pretraining this model– from a pre-trained model configuration.\n\n        The `from_pretrained()` method takes care of returning the correct model class instance\n        based on the `model_type` property of the config object, or when it's missing,\n        falling back to using pattern matching on the `pretrained_model_name_or_path` string:\n            - `t5`: :class:`~transformers1.TFT5ModelWithLMHead` (T5 model)\n            - `distilbert`: :class:`~transformers1.TFDistilBertForMaskedLM` (DistilBERT model)\n            - `albert`: :class:`~transformers1.TFAlbertForPreTraining` (ALBERT model)\n            - `roberta`: :class:`~transformers1.TFRobertaForMaskedLM` (RoBERTa model)\n            - `bert`: :class:`~transformers1.TFBertForPreTraining` (Bert model)\n            - `openai-gpt`: :class:`~transformers1.TFOpenAIGPTLMHeadModel` (OpenAI GPT model)\n            - `gpt2`: :class:`~transformers1.TFGPT2LMHeadModel` (OpenAI GPT-2 model)\n            - `transfo-xl`: :class:`~transformers1.TFTransfoXLLMHeadModel` (Transformer-XL model)\n            - `xlnet`: :class:`~transformers1.TFXLNetLMHeadModel` (XLNet model)\n            - `xlm`: :class:`~transformers1.TFXLMWithLMHeadModel` (XLM model)\n            - `ctrl`: :class:`~transformers1.TFCTRLLMHeadModel` (Salesforce CTRL model)\n\n        The model is set in evaluation mode by default using `model.eval()` (Dropout modules are deactivated)\n        To train the model, you should first set it back in training mode with `model.train()`\n\n        Args:\n            pretrained_model_name_or_path:\n                Either:\n\n                - a string with the `shortcut name` of a pre-trained model to load from cache or download, e.g.: ``bert-base-uncased``.\n                - a string with the `identifier name` of a pre-trained model that was user-uploaded to our S3, e.g.: ``dbmdz/bert-base-german-cased``.\n                - a path to a `directory` containing model weights saved using :func:`~transformers1.PreTrainedModel.save_pretrained`, e.g.: ``./my_model_directory/``.\n                - a path or url to a `tensorflow index checkpoint file` (e.g. `./tf_model/model.ckpt.index`). In this case, ``from_tf`` should be set to True and a configuration object should be provided as ``config`` argument. This loading path is slower than converting the TensorFlow checkpoint in a PyTorch model using the provided conversion scripts and loading the PyTorch model afterwards.\n            model_args: (`optional`) Sequence of positional arguments:\n                All remaning positional arguments will be passed to the underlying model's ``__init__`` method\n            config: (`optional`) instance of a class derived from :class:`~transformers1.PretrainedConfig`:\n                Configuration for the model to use instead of an automatically loaded configuation. Configuration can be automatically loaded when:\n\n                - the model is a model provided by the library (loaded with the ``shortcut-name`` string of a pretrained model), or\n                - the model was saved using :func:`~transformers1.PreTrainedModel.save_pretrained` and is reloaded by suppling the save directory.\n                - the model is loaded by suppling a local directory as ``pretrained_model_name_or_path`` and a configuration JSON file named `config.json` is found in the directory.\n\n            state_dict: (`optional`) dict:\n                an optional state dictionnary for the model to use instead of a state dictionary loaded from saved weights file.\n                This option can be used if you want to create a model from a pretrained configuration but load your own weights.\n                In this case though, you should check if using :func:`~transformers1.PreTrainedModel.save_pretrained` and :func:`~transformers1.PreTrainedModel.from_pretrained` is not a simpler option.\n            cache_dir: (`optional`) string:\n                Path to a directory in which a downloaded pre-trained model\n                configuration should be cached if the standard cache should not be used.\n            force_download: (`optional`) boolean, default False:\n                Force to (re-)download the model weights and configuration files and override the cached versions if they exists.\n            resume_download: (`optional`) boolean, default False:\n                Do not delete incompletely received file. Attempt to resume the download if such a file exists.\n            proxies: (`optional`) dict, default None:\n                A dictionary of proxy servers to use by protocol or endpoint, e.g.: {'http': 'foo.bar:3128', 'http://hostname': 'foo.bar:4012'}.\n                The proxies are used on each request.\n            output_loading_info: (`optional`) boolean:\n                Set to ``True`` to also return a dictionnary containing missing keys, unexpected keys and error messages.\n            kwargs: (`optional`) Remaining dictionary of keyword arguments:\n                Can be used to update the configuration object (after it being loaded) and initiate the model.\n                (e.g. ``output_attention=True``). Behave differently depending on whether a `config` is provided or\n                automatically loaded:\n\n                - If a configuration is provided with ``config``, ``**kwargs`` will be directly passed to the\n                  underlying model's ``__init__`` method (we assume all relevant updates to the configuration have\n                  already been done)\n                - If a configuration is not provided, ``kwargs`` will be first passed to the configuration class\n                  initialization function (:func:`~transformers1.PretrainedConfig.from_pretrained`). Each key of\n                  ``kwargs`` that corresponds to a configuration attribute will be used to override said attribute\n                  with the supplied ``kwargs`` value. Remaining keys that do not correspond to any configuration\n                  attribute will be passed to the underlying model's ``__init__`` function.\n\n        Examples::\n\n            model = TFAutoModelForPreTraining.from_pretrained('bert-base-uncased')    # Download model and configuration from S3 and cache.\n            model = TFAutoModelForPreTraining.from_pretrained('./test/bert_model/')  # E.g. model was saved using `save_pretrained('./test/saved_model/')`\n            model = TFAutoModelForPreTraining.from_pretrained('bert-base-uncased', output_attention=True)  # Update configuration during loading\n            assert model.config.output_attention == True\n            # Loading from a TF checkpoint file instead of a PyTorch model (slower)\n            config = AutoConfig.from_json_file('./tf_model/bert_tf_model_config.json')\n            model = TFAutoModelForPreTraining.from_pretrained('./tf_model/bert_tf_checkpoint.ckpt.index', from_tf=True, config=config)\n\n        \"\"\"\n        config = kwargs.pop(\"config\", None)\n        if not isinstance(config, PretrainedConfig):\n            config = AutoConfig.from_pretrained(pretrained_model_name_or_path, **kwargs)\n\n        for config_class, model_class in TF_MODEL_FOR_PRETRAINING_MAPPING.items():\n            if isinstance(config, config_class):\n                return model_class.from_pretrained(pretrained_model_name_or_path, *model_args, config=config, **kwargs)\n        raise ValueError(\n            \"Unrecognized configuration class {} for this kind of AutoModel: {}.\\n\"\n            \"Model type should be one of {}.\".format(\n                config.__class__, cls.__name__, \", \".join(c.__name__ for c in TF_MODEL_FOR_PRETRAINING_MAPPING.keys())\n            )\n        )\n\n\nclass TFAutoModelWithLMHead(object):\n    r\"\"\"\n        :class:`~transformers1.TFAutoModelWithLMHead` is a generic model class\n        that will be instantiated as one of the language modeling model classes of the library\n        when created with the `TFAutoModelWithLMHead.from_pretrained(pretrained_model_name_or_path)`\n        class method.\n\n        The `from_pretrained()` method takes care of returning the correct model class instance\n        based on the `model_type` property of the config object, or when it's missing,\n        falling back to using pattern matching on the `pretrained_model_name_or_path` string:\n            - `t5`: TFT5ForConditionalGeneration (T5 model)\n            - `distilbert`: TFDistilBertForMaskedLM (DistilBERT model)\n            - `roberta`: TFRobertaForMaskedLM (RoBERTa model)\n            - `bert`: TFBertForMaskedLM (Bert model)\n            - `openai-gpt`: TFOpenAIGPTLMHeadModel (OpenAI GPT model)\n            - `gpt2`: TFGPT2LMHeadModel (OpenAI GPT-2 model)\n            - `transfo-xl`: TFTransfoXLLMHeadModel (Transformer-XL model)\n            - `xlnet`: TFXLNetLMHeadModel (XLNet model)\n            - `xlm`: TFXLMWithLMHeadModel (XLM model)\n            - `ctrl`: TFCTRLLMHeadModel (CTRL model)\n\n        This class cannot be instantiated using `__init__()` (throws an error).\n    \"\"\"\n\n    def __init__(self):\n        raise EnvironmentError(\n            \"TFAutoModelWithLMHead is designed to be instantiated \"\n            \"using the `TFAutoModelWithLMHead.from_pretrained(pretrained_model_name_or_path)` or \"\n            \"`TFAutoModelWithLMHead.from_config(config)` methods.\"\n        )\n\n    @classmethod\n    def from_config(cls, config):\n        r\"\"\" Instantiates one of the base model classes of the library\n        from a configuration.\n\n        Note:\n            Loading a model from its configuration file does **not** load the model weights.\n            It only affects the model's configuration. Use :func:`~transformers1.AutoModel.from_pretrained` to load\n            the model weights\n\n        Args:\n            config: (`optional`) instance of a class derived from :class:`~transformers1.PretrainedConfig`:\n                The model class to instantiate is selected based on the configuration class:\n                    - isInstance of `distilbert` configuration class: DistilBertModel (DistilBERT model)\n                    - isInstance of `roberta` configuration class: RobertaModel (RoBERTa model)\n                    - isInstance of `bert` configuration class: BertModel (Bert model)\n                    - isInstance of `openai-gpt` configuration class: OpenAIGPTModel (OpenAI GPT model)\n                    - isInstance of `gpt2` configuration class: GPT2Model (OpenAI GPT-2 model)\n                    - isInstance of `ctrl` configuration class: CTRLModel (Salesforce CTRL  model)\n                    - isInstance of `transfo-xl` configuration class: TransfoXLModel (Transformer-XL model)\n                    - isInstance of `xlnet` configuration class: XLNetModel (XLNet model)\n                    - isInstance of `xlm` configuration class: XLMModel (XLM model)\n\n        Examples::\n\n            config = BertConfig.from_pretrained('bert-base-uncased')    # Download configuration from S3 and cache.\n            model = TFAutoModelWithLMHead.from_config(config)  # E.g. model was saved using `save_pretrained('./test/saved_model/')`\n        \"\"\"\n        for config_class, model_class in TF_MODEL_WITH_LM_HEAD_MAPPING.items():\n            if isinstance(config, config_class):\n                return model_class(config)\n        raise ValueError(\n            \"Unrecognized configuration class {} for this kind of TFAutoModel: {}.\\n\"\n            \"Model type should be one of {}.\".format(\n                config.__class__, cls.__name__, \", \".join(c.__name__ for c in TF_MODEL_WITH_LM_HEAD_MAPPING.keys())\n            )\n        )\n\n    @classmethod\n    def from_pretrained(cls, pretrained_model_name_or_path, *model_args, **kwargs):\n        r\"\"\" Instantiates one of the language modeling model classes of the library\n        from a pre-trained model configuration.\n\n        The `from_pretrained()` method takes care of returning the correct model class instance\n        based on the `model_type` property of the config object, or when it's missing,\n        falling back to using pattern matching on the `pretrained_model_name_or_path` string:\n            - `t5`: TFT5ForConditionalGeneration (T5 model)\n            - `distilbert`: TFDistilBertForMaskedLM (DistilBERT model)\n            - `roberta`: TFRobertaForMaskedLM (RoBERTa model)\n            - `bert`: TFBertForMaskedLM (Bert model)\n            - `openai-gpt`: TFOpenAIGPTLMHeadModel (OpenAI GPT model)\n            - `gpt2`: TFGPT2LMHeadModel (OpenAI GPT-2 model)\n            - `transfo-xl`: TFTransfoXLLMHeadModel (Transformer-XL model)\n            - `xlnet`: TFXLNetLMHeadModel (XLNet model)\n            - `xlm`: TFXLMWithLMHeadModel (XLM model)\n            - `ctrl`: TFCTRLLMHeadModel (CTRL model)\n\n        Params:\n            pretrained_model_name_or_path: either:\n\n                - a string with the `shortcut name` of a pre-trained model to load from cache or download, e.g.: ``bert-base-uncased``.\n                - a string with the `identifier name` of a pre-trained model that was user-uploaded to our S3, e.g.: ``dbmdz/bert-base-german-cased``.\n                - a path to a `directory` containing model weights saved using :func:`~transformers1.PreTrainedModel.save_pretrained`, e.g.: ``./my_model_directory/``.\n                - a path or url to a `PyTorch, TF 1.X or TF 2.0 checkpoint file` (e.g. `./tf_model/model.ckpt.index`). In the case of a PyTorch checkpoint, ``from_pt`` should be set to True and a configuration object should be provided as ``config`` argument.\n\n            from_pt: (`Optional`) Boolean\n                Set to True if the Checkpoint is a PyTorch checkpoint.\n\n            model_args: (`optional`) Sequence of positional arguments:\n                All remaning positional arguments will be passed to the underlying model's ``__init__`` method\n\n            config: (`optional`) instance of a class derived from :class:`~transformers1.PretrainedConfig`:\n                Configuration for the model to use instead of an automatically loaded configuation. Configuration can be automatically loaded when:\n\n                - the model is a model provided by the library (loaded with the ``shortcut-name`` string of a pretrained model), or\n                - the model was saved using :func:`~transformers1.PreTrainedModel.save_pretrained` and is reloaded by suppling the save directory.\n                - the model is loaded by suppling a local directory as ``pretrained_model_name_or_path`` and a configuration JSON file named `config.json` is found in the directory.\n\n            state_dict: (`optional`) dict:\n                an optional state dictionnary for the model to use instead of a state dictionary loaded from saved weights file.\n                This option can be used if you want to create a model from a pretrained configuration but load your own weights.\n                In this case though, you should check if using :func:`~transformers1.PreTrainedModel.save_pretrained` and :func:`~transformers1.PreTrainedModel.from_pretrained` is not a simpler option.\n\n            cache_dir: (`optional`) string:\n                Path to a directory in which a downloaded pre-trained model\n                configuration should be cached if the standard cache should not be used.\n\n            force_download: (`optional`) boolean, default False:\n                Force to (re-)download the model weights and configuration files and override the cached versions if they exists.\n\n            resume_download: (`optional`) boolean, default False:\n                Do not delete incompletely recieved file. Attempt to resume the download if such a file exists.\n\n            proxies: (`optional`) dict, default None:\n                A dictionary of proxy servers to use by protocol or endpoint, e.g.: {'http': 'foo.bar:3128', 'http://hostname': 'foo.bar:4012'}.\n                The proxies are used on each request.\n\n            output_loading_info: (`optional`) boolean:\n                Set to ``True`` to also return a dictionnary containing missing keys, unexpected keys and error messages.\n\n            kwargs: (`optional`) Remaining dictionary of keyword arguments:\n                Can be used to update the configuration object (after it being loaded) and initiate the model. (e.g. ``output_attention=True``). Behave differently depending on whether a `config` is provided or automatically loaded:\n\n                - If a configuration is provided with ``config``, ``**kwargs`` will be directly passed to the underlying model's ``__init__`` method (we assume all relevant updates to the configuration have already been done)\n                - If a configuration is not provided, ``kwargs`` will be first passed to the configuration class initialization function (:func:`~transformers1.PretrainedConfig.from_pretrained`). Each key of ``kwargs`` that corresponds to a configuration attribute will be used to override said attribute with the supplied ``kwargs`` value. Remaining keys that do not correspond to any configuration attribute will be passed to the underlying model's ``__init__`` function.\n\n        Examples::\n\n            model = TFAutoModelWithLMHead.from_pretrained('bert-base-uncased')    # Download model and configuration from S3 and cache.\n            model = TFAutoModelWithLMHead.from_pretrained('./test/bert_model/')  # E.g. model was saved using `save_pretrained('./test/saved_model/')`\n            model = TFAutoModelWithLMHead.from_pretrained('bert-base-uncased', output_attention=True)  # Update configuration during loading\n            assert model.config.output_attention == True\n            # Loading from a TF checkpoint file instead of a PyTorch model (slower)\n            config = AutoConfig.from_json_file('./tf_model/bert_tf_model_config.json')\n            model = TFAutoModelWithLMHead.from_pretrained('./pt_model/bert_pytorch_model.bin', from_pt=True, config=config)\n\n        \"\"\"\n        config = kwargs.pop(\"config\", None)\n        if not isinstance(config, PretrainedConfig):\n            config = AutoConfig.from_pretrained(pretrained_model_name_or_path, **kwargs)\n\n        for config_class, model_class in TF_MODEL_WITH_LM_HEAD_MAPPING.items():\n            if isinstance(config, config_class):\n                return model_class.from_pretrained(pretrained_model_name_or_path, *model_args, config=config, **kwargs)\n        raise ValueError(\n            \"Unrecognized configuration class {} for this kind of TFAutoModel: {}.\\n\"\n            \"Model type should be one of {}.\".format(\n                config.__class__, cls.__name__, \", \".join(c.__name__ for c in TF_MODEL_WITH_LM_HEAD_MAPPING.keys())\n            )\n        )\n\n\nclass TFAutoModelForMultipleChoice:\n    r\"\"\"\n        :class:`~transformers1.TFAutoModelForMultipleChoice` is a generic model class\n        that will be instantiated as one of the multiple choice model classes of the library\n        when created with the `TFAutoModelForMultipleChoice.from_pretrained(pretrained_model_name_or_path)`\n        class method.\n\n        The `from_pretrained()` method takes care of returning the correct model class instance\n        based on the `model_type` property of the config object, or when it's missing,\n        falling back to using pattern matching on the `pretrained_model_name_or_path` string:\n            - `albert`: TFAlbertForMultipleChoice (Albert model)\n            - `bert`: TFBertForMultipleChoice (Bert model)\n\n        This class cannot be instantiated using `__init__()` (throws an error).\n    \"\"\"\n\n    def __init__(self):\n        raise EnvironmentError(\n            \"TFAutoModelForMultipleChoice is designed to be instantiated \"\n            \"using the `TFAutoModelForMultipleChoice.from_pretrained(pretrained_model_name_or_path)` or \"\n            \"`TFAutoModelForMultipleChoice.from_config(config)` methods.\"\n        )\n\n    @classmethod\n    def from_config(cls, config):\n        r\"\"\" Instantiates one of the base model classes of the library\n        from a configuration.\n\n        Note:\n            Loading a model from its configuration file does **not** load the model weights.\n            It only affects the model's configuration. Use :func:`~transformers1.AutoModel.from_pretrained` to load\n            the model weights\n\n        Args:\n            config: (`optional`) instance of a class derived from :class:`~transformers1.PretrainedConfig`:\n                The model class to instantiate is selected based on the configuration class:\n                    - isInstance of `albert` configuration class: AlbertModel (Albert model)\n                    - isInstance of `bert` configuration class: BertModel (Bert model)\n\n        Examples::\n\n            config = BertConfig.from_pretrained('bert-base-uncased')    # Download configuration from S3 and cache.\n            model = AutoModelForMulitpleChoice.from_config(config)  # E.g. model was saved using `save_pretrained('./test/saved_model/')`\n        \"\"\"\n        for config_class, model_class in TF_MODEL_FOR_MULTIPLE_CHOICE_MAPPING.items():\n            if isinstance(config, config_class):\n                return model_class(config)\n        raise ValueError(\n            \"Unrecognized configuration class {} for this kind of TFAutoModel: {}.\\n\"\n            \"Model type should be one of {}.\".format(\n                config.__class__,\n                cls.__name__,\n                \", \".join(c.__name__ for c in TF_MODEL_FOR_MULTIPLE_CHOICE_MAPPING.keys()),\n            )\n        )\n\n    @classmethod\n    def from_pretrained(cls, pretrained_model_name_or_path, *model_args, **kwargs):\n        r\"\"\" Instantiates one of the multiple choice model classes of the library\n        from a pre-trained model configuration.\n\n        The `from_pretrained()` method takes care of returning the correct model class instance\n        based on the `model_type` property of the config object, or when it's missing,\n        falling back to using pattern matching on the `pretrained_model_name_or_path` string:\n            - `albert`: TFRobertaForMultiple (Albert model)\n            - `bert`: TFBertForMultipleChoice (Bert model)\n\n        The model is set in evaluation mode by default using `model.eval()` (Dropout modules are deactivated)\n        To train the model, you should first set it back in training mode with `model.train()`\n\n        Params:\n            pretrained_model_name_or_path: either:\n\n                - a string with the `shortcut name` of a pre-trained model to load from cache or download, e.g.: ``bert-base-uncased``.\n                - a string with the `identifier name` of a pre-trained model that was user-uploaded to our S3, e.g.: ``dbmdz/bert-base-german-cased``.\n                - a path to a `directory` containing model weights saved using :func:`~transformers1.PreTrainedModel.save_pretrained`, e.g.: ``./my_model_directory/``.\n                - a path or url to a `PyTorch, TF 1.X or TF 2.0 checkpoint file` (e.g. `./tf_model/model.ckpt.index`). In the case of a PyTorch checkpoint, ``from_pt`` should be set to True and a configuration object should be provided as ``config`` argument.\n\n            from_pt: (`Optional`) Boolean\n                Set to True if the Checkpoint is a PyTorch checkpoint.\n\n            model_args: (`optional`) Sequence of positional arguments:\n                All remaning positional arguments will be passed to the underlying model's ``__init__`` method\n\n            config: (`optional`) instance of a class derived from :class:`~transformers1.PretrainedConfig`:\n                Configuration for the model to use instead of an automatically loaded configuation. Configuration can be automatically loaded when:\n\n                - the model is a model provided by the library (loaded with the ``shortcut-name`` string of a pretrained model), or\n                - the model was saved using :func:`~transformers1.PreTrainedModel.save_pretrained` and is reloaded by suppling the save directory.\n                - the model is loaded by suppling a local directory as ``pretrained_model_name_or_path`` and a configuration JSON file named `config.json` is found in the directory.\n\n            state_dict: (`optional`) dict:\n                an optional state dictionnary for the model to use instead of a state dictionary loaded from saved weights file.\n                This option can be used if you want to create a model from a pretrained configuration but load your own weights.\n                In this case though, you should check if using :func:`~transformers1.PreTrainedModel.save_pretrained` and :func:`~transformers1.PreTrainedModel.from_pretrained` is not a simpler option.\n\n            cache_dir: (`optional`) string:\n                Path to a directory in which a downloaded pre-trained model\n                configuration should be cached if the standard cache should not be used.\n\n            force_download: (`optional`) boolean, default False:\n                Force to (re-)download the model weights and configuration files and override the cached versions if they exists.\n\n            resume_download: (`optional`) boolean, default False:\n                Do not delete incompletely recieved file. Attempt to resume the download if such a file exists.\n\n            proxies: (`optional`) dict, default None:\n                A dictionary of proxy servers to use by protocol or endpoint, e.g.: {'http': 'foo.bar:3128', 'http://hostname': 'foo.bar:4012'}.\n                The proxies are used on each request.\n\n            output_loading_info: (`optional`) boolean:\n                Set to ``True`` to also return a dictionnary containing missing keys, unexpected keys and error messages.\n\n            kwargs: (`optional`) Remaining dictionary of keyword arguments:\n                Can be used to update the configuration object (after it being loaded) and initiate the model. (e.g. ``output_attention=True``). Behave differently depending on whether a `config` is provided or automatically loaded:\n\n                - If a configuration is provided with ``config``, ``**kwargs`` will be directly passed to the underlying model's ``__init__`` method (we assume all relevant updates to the configuration have already been done)\n                - If a configuration is not provided, ``kwargs`` will be first passed to the configuration class initialization function (:func:`~transformers1.PretrainedConfig.from_pretrained`). Each key of ``kwargs`` that corresponds to a configuration attribute will be used to override said attribute with the supplied ``kwargs`` value. Remaining keys that do not correspond to any configuration attribute will be passed to the underlying model's ``__init__`` function.\n\n        Examples::\n\n            model = TFAutoModelFormultipleChoice.from_pretrained('bert-base-uncased')    # Download model and configuration from S3 and cache.\n            model = TFAutoModelFormultipleChoice.from_pretrained('./test/bert_model/')  # E.g. model was saved using `save_pretrained('./test/saved_model/')`\n            model = TFAutoModelFormultipleChoice.from_pretrained('bert-base-uncased', output_attention=True)  # Update configuration during loading\n            assert model.config.output_attention == True\n            # Loading from a TF checkpoint file instead of a PyTorch model (slower)\n            config = AutoConfig.from_json_file('./tf_model/bert_tf_model_config.json')\n            model = TFAutoModelFormultipleChoice.from_pretrained('./pt_model/bert_pytorch_model.bin', from_pt=True, config=config)\n\n        \"\"\"\n        config = kwargs.pop(\"config\", None)\n        if not isinstance(config, PretrainedConfig):\n            config = AutoConfig.from_pretrained(pretrained_model_name_or_path, **kwargs)\n\n        for config_class, model_class in TF_MODEL_FOR_MULTIPLE_CHOICE_MAPPING.items():\n            if isinstance(config, config_class):\n                return model_class.from_pretrained(pretrained_model_name_or_path, *model_args, config=config, **kwargs)\n        raise ValueError(\n            \"Unrecognized configuration class {} for this kind of TFAutoModel: {}.\\n\"\n            \"Model type should be one of {}.\".format(\n                config.__class__,\n                cls.__name__,\n                \", \".join(c.__name__ for c in TF_MODEL_FOR_MULTIPLE_CHOICE_MAPPING.keys()),\n            )\n        )\n\n\nclass TFAutoModelForSequenceClassification(object):\n    r\"\"\"\n        :class:`~transformers1.TFAutoModelForSequenceClassification` is a generic model class\n        that will be instantiated as one of the sequence classification model classes of the library\n        when created with the `TFAutoModelForSequenceClassification.from_pretrained(pretrained_model_name_or_path)`\n        class method.\n\n        The `from_pretrained()` method takes care of returning the correct model class instance\n        based on the `model_type` property of the config object, or when it's missing,\n        falling back to using pattern matching on the `pretrained_model_name_or_path` string:\n            - `distilbert`: TFDistilBertForSequenceClassification (DistilBERT model)\n            - `roberta`: TFRobertaForSequenceClassification (RoBERTa model)\n            - `bert`: TFBertForSequenceClassification (Bert model)\n            - `xlnet`: TFXLNetForSequenceClassification (XLNet model)\n            - `xlm`: TFXLMForSequenceClassification (XLM model)\n\n        This class cannot be instantiated using `__init__()` (throws an error).\n    \"\"\"\n\n    def __init__(self):\n        raise EnvironmentError(\n            \"TFAutoModelForSequenceClassification is designed to be instantiated \"\n            \"using the `TFAutoModelForSequenceClassification.from_pretrained(pretrained_model_name_or_path)` or \"\n            \"`TFAutoModelForSequenceClassification.from_config(config)` methods.\"\n        )\n\n    @classmethod\n    def from_config(cls, config):\n        r\"\"\" Instantiates one of the base model classes of the library\n        from a configuration.\n\n        Note:\n            Loading a model from its configuration file does **not** load the model weights.\n            It only affects the model's configuration. Use :func:`~transformers1.AutoModel.from_pretrained` to load\n            the model weights\n\n        Args:\n            config: (`optional`) instance of a class derived from :class:`~transformers1.PretrainedConfig`:\n                The model class to instantiate is selected based on the configuration class:\n                    - isInstance of `distilbert` configuration class: DistilBertModel (DistilBERT model)\n                    - isInstance of `roberta` configuration class: RobertaModel (RoBERTa model)\n                    - isInstance of `bert` configuration class: BertModel (Bert model)\n                    - isInstance of `xlnet` configuration class: XLNetModel (XLNet model)\n                    - isInstance of `xlm` configuration class: XLMModel (XLM model)\n\n        Examples::\n\n            config = BertConfig.from_pretrained('bert-base-uncased')    # Download configuration from S3 and cache.\n            model = AutoModelForSequenceClassification.from_config(config)  # E.g. model was saved using `save_pretrained('./test/saved_model/')`\n        \"\"\"\n        for config_class, model_class in TF_MODEL_FOR_SEQUENCE_CLASSIFICATION_MAPPING.items():\n            if isinstance(config, config_class):\n                return model_class(config)\n        raise ValueError(\n            \"Unrecognized configuration class {} for this kind of TFAutoModel: {}.\\n\"\n            \"Model type should be one of {}.\".format(\n                config.__class__,\n                cls.__name__,\n                \", \".join(c.__name__ for c in TF_MODEL_FOR_SEQUENCE_CLASSIFICATION_MAPPING.keys()),\n            )\n        )\n\n    @classmethod\n    def from_pretrained(cls, pretrained_model_name_or_path, *model_args, **kwargs):\n        r\"\"\" Instantiates one of the sequence classification model classes of the library\n        from a pre-trained model configuration.\n\n        The `from_pretrained()` method takes care of returning the correct model class instance\n        based on the `model_type` property of the config object, or when it's missing,\n        falling back to using pattern matching on the `pretrained_model_name_or_path` string:\n            - `distilbert`: TFDistilBertForSequenceClassification (DistilBERT model)\n            - `roberta`: TFRobertaForSequenceClassification (RoBERTa model)\n            - `bert`: TFBertForSequenceClassification (Bert model)\n            - `xlnet`: TFXLNetForSequenceClassification (XLNet model)\n            - `xlm`: TFXLMForSequenceClassification (XLM model)\n\n        The model is set in evaluation mode by default using `model.eval()` (Dropout modules are deactivated)\n        To train the model, you should first set it back in training mode with `model.train()`\n\n        Params:\n            pretrained_model_name_or_path: either:\n\n                - a string with the `shortcut name` of a pre-trained model to load from cache or download, e.g.: ``bert-base-uncased``.\n                - a string with the `identifier name` of a pre-trained model that was user-uploaded to our S3, e.g.: ``dbmdz/bert-base-german-cased``.\n                - a path to a `directory` containing model weights saved using :func:`~transformers1.PreTrainedModel.save_pretrained`, e.g.: ``./my_model_directory/``.\n                - a path or url to a `PyTorch, TF 1.X or TF 2.0 checkpoint file` (e.g. `./tf_model/model.ckpt.index`). In the case of a PyTorch checkpoint, ``from_pt`` should be set to True and a configuration object should be provided as ``config`` argument.\n\n            from_pt: (`Optional`) Boolean\n                Set to True if the Checkpoint is a PyTorch checkpoint.\n\n            model_args: (`optional`) Sequence of positional arguments:\n                All remaning positional arguments will be passed to the underlying model's ``__init__`` method\n\n            config: (`optional`) instance of a class derived from :class:`~transformers1.PretrainedConfig`:\n                Configuration for the model to use instead of an automatically loaded configuation. Configuration can be automatically loaded when:\n\n                - the model is a model provided by the library (loaded with the ``shortcut-name`` string of a pretrained model), or\n                - the model was saved using :func:`~transformers1.PreTrainedModel.save_pretrained` and is reloaded by suppling the save directory.\n                - the model is loaded by suppling a local directory as ``pretrained_model_name_or_path`` and a configuration JSON file named `config.json` is found in the directory.\n\n            state_dict: (`optional`) dict:\n                an optional state dictionnary for the model to use instead of a state dictionary loaded from saved weights file.\n                This option can be used if you want to create a model from a pretrained configuration but load your own weights.\n                In this case though, you should check if using :func:`~transformers1.PreTrainedModel.save_pretrained` and :func:`~transformers1.PreTrainedModel.from_pretrained` is not a simpler option.\n\n            cache_dir: (`optional`) string:\n                Path to a directory in which a downloaded pre-trained model\n                configuration should be cached if the standard cache should not be used.\n\n            force_download: (`optional`) boolean, default False:\n                Force to (re-)download the model weights and configuration files and override the cached versions if they exists.\n\n            resume_download: (`optional`) boolean, default False:\n                Do not delete incompletely recieved file. Attempt to resume the download if such a file exists.\n\n            proxies: (`optional`) dict, default None:\n                A dictionary of proxy servers to use by protocol or endpoint, e.g.: {'http': 'foo.bar:3128', 'http://hostname': 'foo.bar:4012'}.\n                The proxies are used on each request.\n\n            output_loading_info: (`optional`) boolean:\n                Set to ``True`` to also return a dictionnary containing missing keys, unexpected keys and error messages.\n\n            kwargs: (`optional`) Remaining dictionary of keyword arguments:\n                Can be used to update the configuration object (after it being loaded) and initiate the model. (e.g. ``output_attention=True``). Behave differently depending on whether a `config` is provided or automatically loaded:\n\n                - If a configuration is provided with ``config``, ``**kwargs`` will be directly passed to the underlying model's ``__init__`` method (we assume all relevant updates to the configuration have already been done)\n                - If a configuration is not provided, ``kwargs`` will be first passed to the configuration class initialization function (:func:`~transformers1.PretrainedConfig.from_pretrained`). Each key of ``kwargs`` that corresponds to a configuration attribute will be used to override said attribute with the supplied ``kwargs`` value. Remaining keys that do not correspond to any configuration attribute will be passed to the underlying model's ``__init__`` function.\n\n        Examples::\n\n            model = TFAutoModelForSequenceClassification.from_pretrained('bert-base-uncased')    # Download model and configuration from S3 and cache.\n            model = TFAutoModelForSequenceClassification.from_pretrained('./test/bert_model/')  # E.g. model was saved using `save_pretrained('./test/saved_model/')`\n            model = TFAutoModelForSequenceClassification.from_pretrained('bert-base-uncased', output_attention=True)  # Update configuration during loading\n            assert model.config.output_attention == True\n            # Loading from a TF checkpoint file instead of a PyTorch model (slower)\n            config = AutoConfig.from_json_file('./tf_model/bert_tf_model_config.json')\n            model = TFAutoModelForSequenceClassification.from_pretrained('./pt_model/bert_pytorch_model.bin', from_pt=True, config=config)\n\n        \"\"\"\n        config = kwargs.pop(\"config\", None)\n        if not isinstance(config, PretrainedConfig):\n            config = AutoConfig.from_pretrained(pretrained_model_name_or_path, **kwargs)\n\n        for config_class, model_class in TF_MODEL_FOR_SEQUENCE_CLASSIFICATION_MAPPING.items():\n            if isinstance(config, config_class):\n                return model_class.from_pretrained(pretrained_model_name_or_path, *model_args, config=config, **kwargs)\n        raise ValueError(\n            \"Unrecognized configuration class {} for this kind of TFAutoModel: {}.\\n\"\n            \"Model type should be one of {}.\".format(\n                config.__class__,\n                cls.__name__,\n                \", \".join(c.__name__ for c in TF_MODEL_FOR_SEQUENCE_CLASSIFICATION_MAPPING.keys()),\n            )\n        )\n\n\nclass TFAutoModelForQuestionAnswering(object):\n    r\"\"\"\n        :class:`~transformers1.TFAutoModelForQuestionAnswering` is a generic model class\n        that will be instantiated as one of the question answering model classes of the library\n        when created with the `TFAutoModelForQuestionAnswering.from_pretrained(pretrained_model_name_or_path)`\n        class method.\n\n        The `from_pretrained()` method takes care of returning the correct model class instance\n        based on the `model_type` property of the config object, or when it's missing,\n        falling back to using pattern matching on the `pretrained_model_name_or_path` string:\n            - `distilbert`: TFDistilBertForQuestionAnswering (DistilBERT model)\n            - `albert`: TFAlbertForQuestionAnswering (ALBERT model)\n            - `roberta`: TFRobertaForQuestionAnswering (RoBERTa model)\n            - `bert`: TFBertForQuestionAnswering (Bert model)\n            - `xlnet`: TFXLNetForQuestionAnswering (XLNet model)\n            - `xlm`: TFXLMForQuestionAnswering (XLM model)\n\n        This class cannot be instantiated using `__init__()` (throws an error).\n    \"\"\"\n\n    def __init__(self):\n        raise EnvironmentError(\n            \"TFAutoModelForQuestionAnswering is designed to be instantiated \"\n            \"using the `TFAutoModelForQuestionAnswering.from_pretrained(pretrained_model_name_or_path)` or \"\n            \"`TFAutoModelForQuestionAnswering.from_config(config)` methods.\"\n        )\n\n    @classmethod\n    def from_config(cls, config):\n        r\"\"\" Instantiates one of the base model classes of the library\n        from a configuration.\n\n        Note:\n            Loading a model from its configuration file does **not** load the model weights.\n            It only affects the model's configuration. Use :func:`~transformers1.AutoModel.from_pretrained` to load\n            the model weights\n\n        Args:\n            config: (`optional`) instance of a class derived from :class:`~transformers1.PretrainedConfig`:\n                The model class to instantiate is selected based on the configuration class:\n                    - isInstance of `distilbert` configuration class: DistilBertModel (DistilBERT model)\n                    - isInstance of `albert` configuration class: AlbertModel (ALBERT model)\n                    - isInstance of `roberta` configuration class: RobertaModel (RoBERTa model)\n                    - isInstance of `bert` configuration class: BertModel (Bert model)\n                    - isInstance of `xlnet` configuration class: XLNetModel (XLNet model)\n                    - isInstance of `xlm` configuration class: XLMModel (XLM model)\n\n        Examples::\n\n            config = BertConfig.from_pretrained('bert-base-uncased')    # Download configuration from S3 and cache.\n            model = TFAutoModelForQuestionAnswering.from_config(config)  # E.g. model was saved using `save_pretrained('./test/saved_model/')`\n        \"\"\"\n        for config_class, model_class in TF_MODEL_FOR_QUESTION_ANSWERING_MAPPING.items():\n            if isinstance(config, config_class):\n                return model_class(config)\n        raise ValueError(\n            \"Unrecognized configuration class {} for this kind of TFAutoModel: {}.\\n\"\n            \"Model type should be one of {}.\".format(\n                config.__class__,\n                cls.__name__,\n                \", \".join(c.__name__ for c in TF_MODEL_FOR_QUESTION_ANSWERING_MAPPING.keys()),\n            )\n        )\n\n    @classmethod\n    def from_pretrained(cls, pretrained_model_name_or_path, *model_args, **kwargs):\n        r\"\"\" Instantiates one of the question answering model classes of the library\n        from a pre-trained model configuration.\n\n        The `from_pretrained()` method takes care of returning the correct model class instance\n        based on the `model_type` property of the config object, or when it's missing,\n        falling back to using pattern matching on the `pretrained_model_name_or_path` string:\n            - `distilbert`: TFDistilBertForQuestionAnswering (DistilBERT model)\n            - `albert`: TFAlbertForQuestionAnswering (ALBERT model)\n            - `roberta`: TFRobertaForQuestionAnswering (RoBERTa model)\n            - `bert`: TFBertForQuestionAnswering (Bert model)\n            - `xlnet`: TFXLNetForQuestionAnswering (XLNet model)\n            - `xlm`: TFXLMForQuestionAnswering (XLM model)\n\n        The model is set in evaluation mode by default using `model.eval()` (Dropout modules are deactivated)\n        To train the model, you should first set it back in training mode with `model.train()`\n\n        Params:\n            pretrained_model_name_or_path: either:\n\n                - a string with the `shortcut name` of a pre-trained model to load from cache or download, e.g.: ``bert-base-uncased``.\n                - a string with the `identifier name` of a pre-trained model that was user-uploaded to our S3, e.g.: ``dbmdz/bert-base-german-cased``.\n                - a path to a `directory` containing model weights saved using :func:`~transformers1.PreTrainedModel.save_pretrained`, e.g.: ``./my_model_directory/``.\n                - a path or url to a `PyTorch, TF 1.X or TF 2.0 checkpoint file` (e.g. `./tf_model/model.ckpt.index`). In the case of a PyTorch checkpoint, ``from_pt`` should be set to True and a configuration object should be provided as ``config`` argument.\n\n            from_pt: (`Optional`) Boolean\n                Set to True if the Checkpoint is a PyTorch checkpoint.\n\n            model_args: (`optional`) Sequence of positional arguments:\n                All remaning positional arguments will be passed to the underlying model's ``__init__`` method\n\n            config: (`optional`) instance of a class derived from :class:`~transformers1.PretrainedConfig`:\n                Configuration for the model to use instead of an automatically loaded configuation. Configuration can be automatically loaded when:\n\n                - the model is a model provided by the library (loaded with the ``shortcut-name`` string of a pretrained model), or\n                - the model was saved using :func:`~transformers1.PreTrainedModel.save_pretrained` and is reloaded by suppling the save directory.\n                - the model is loaded by suppling a local directory as ``pretrained_model_name_or_path`` and a configuration JSON file named `config.json` is found in the directory.\n\n            state_dict: (`optional`) dict:\n                an optional state dictionnary for the model to use instead of a state dictionary loaded from saved weights file.\n                This option can be used if you want to create a model from a pretrained configuration but load your own weights.\n                In this case though, you should check if using :func:`~transformers1.PreTrainedModel.save_pretrained` and :func:`~transformers1.PreTrainedModel.from_pretrained` is not a simpler option.\n\n            cache_dir: (`optional`) string:\n                Path to a directory in which a downloaded pre-trained model\n                configuration should be cached if the standard cache should not be used.\n\n            force_download: (`optional`) boolean, default False:\n                Force to (re-)download the model weights and configuration files and override the cached versions if they exists.\n\n            resume_download: (`optional`) boolean, default False:\n                Do not delete incompletely recieved file. Attempt to resume the download if such a file exists.\n\n            proxies: (`optional`) dict, default None:\n                A dictionary of proxy servers to use by protocol or endpoint, e.g.: {'http': 'foo.bar:3128', 'http://hostname': 'foo.bar:4012'}.\n                The proxies are used on each request.\n\n            output_loading_info: (`optional`) boolean:\n                Set to ``True`` to also return a dictionnary containing missing keys, unexpected keys and error messages.\n\n            kwargs: (`optional`) Remaining dictionary of keyword arguments:\n                Can be used to update the configuration object (after it being loaded) and initiate the model. (e.g. ``output_attention=True``). Behave differently depending on whether a `config` is provided or automatically loaded:\n\n                - If a configuration is provided with ``config``, ``**kwargs`` will be directly passed to the underlying model's ``__init__`` method (we assume all relevant updates to the configuration have already been done)\n                - If a configuration is not provided, ``kwargs`` will be first passed to the configuration class initialization function (:func:`~transformers1.PretrainedConfig.from_pretrained`). Each key of ``kwargs`` that corresponds to a configuration attribute will be used to override said attribute with the supplied ``kwargs`` value. Remaining keys that do not correspond to any configuration attribute will be passed to the underlying model's ``__init__`` function.\n\n        Examples::\n\n            model = TFAutoModelForQuestionAnswering.from_pretrained('bert-base-uncased')    # Download model and configuration from S3 and cache.\n            model = TFAutoModelForQuestionAnswering.from_pretrained('./test/bert_model/')  # E.g. model was saved using `save_pretrained('./test/saved_model/')`\n            model = TFAutoModelForQuestionAnswering.from_pretrained('bert-base-uncased', output_attention=True)  # Update configuration during loading\n            assert model.config.output_attention == True\n            # Loading from a TF checkpoint file instead of a PyTorch model (slower)\n            config = AutoConfig.from_json_file('./tf_model/bert_tf_model_config.json')\n            model = TFAutoModelForQuestionAnswering.from_pretrained('./pt_model/bert_pytorch_model.bin', from_pt=True, config=config)\n\n        \"\"\"\n        config = kwargs.pop(\"config\", None)\n        if not isinstance(config, PretrainedConfig):\n            config = AutoConfig.from_pretrained(pretrained_model_name_or_path, **kwargs)\n\n        for config_class, model_class in TF_MODEL_FOR_QUESTION_ANSWERING_MAPPING.items():\n            if isinstance(config, config_class):\n                return model_class.from_pretrained(pretrained_model_name_or_path, *model_args, config=config, **kwargs)\n        raise ValueError(\n            \"Unrecognized configuration class {} for this kind of TFAutoModel: {}.\\n\"\n            \"Model type should be one of {}.\".format(\n                config.__class__,\n                cls.__name__,\n                \", \".join(c.__name__ for c in TF_MODEL_FOR_QUESTION_ANSWERING_MAPPING.keys()),\n            )\n        )\n\n\nclass TFAutoModelForTokenClassification:\n    def __init__(self):\n        raise EnvironmentError(\n            \"TFAutoModelForTokenClassification is designed to be instantiated \"\n            \"using the `TFAutoModelForTokenClassification.from_pretrained(pretrained_model_name_or_path)` or \"\n            \"`AutoModelForTokenClassification.from_config(config)` methods.\"\n        )\n\n    @classmethod\n    def from_config(cls, config):\n        r\"\"\" Instantiates one of the base model classes of the library\n        from a configuration.\n\n        Note:\n            Loading a model from its configuration file does **not** load the model weights.\n            It only affects the model's configuration. Use :func:`~transformers1.AutoModel.from_pretrained` to load\n            the model weights\n\n        Args:\n            config: (`optional`) instance of a class derived from :class:`~transformers1.PretrainedConfig`:\n                The model class to instantiate is selected based on the configuration class:\n                    - isInstance of `bert` configuration class: BertModel (Bert model)\n                    - isInstance of `xlnet` configuration class: XLNetModel (XLNet model)\n                    - isInstance of `distilbert` configuration class: DistilBertModel (DistilBert model)\n                    - isInstance of `roberta` configuration class: RobteraModel (Roberta model)\n\n        Examples::\n\n            config = BertConfig.from_pretrained('bert-base-uncased')    # Download configuration from S3 and cache.\n            model = TFAutoModelForTokenClassification.from_config(config)  # E.g. model was saved using `save_pretrained('./test/saved_model/')`\n        \"\"\"\n        for config_class, model_class in TF_MODEL_FOR_TOKEN_CLASSIFICATION_MAPPING.items():\n            if isinstance(config, config_class):\n                return model_class(config)\n        raise ValueError(\n            \"Unrecognized configuration class {} for this kind of TFAutoModel: {}.\\n\"\n            \"Model type should be one of {}.\".format(\n                config.__class__,\n                cls.__name__,\n                \", \".join(c.__name__ for c in TF_MODEL_FOR_TOKEN_CLASSIFICATION_MAPPING.keys()),\n            )\n        )\n\n    @classmethod\n    def from_pretrained(cls, pretrained_model_name_or_path, *model_args, **kwargs):\n        r\"\"\" Instantiates one of the question answering model classes of the library\n        from a pre-trained model configuration.\n\n        The `from_pretrained()` method takes care of returning the correct model class instance\n        based on the `model_type` property of the config object, or when it's missing,\n        falling back to using pattern matching on the `pretrained_model_name_or_path` string:\n            - `bert`: BertForTokenClassification (Bert model)\n            - `xlnet`: XLNetForTokenClassification (XLNet model)\n            - `distilbert`: DistilBertForTokenClassification (DistilBert model)\n            - `roberta`: RobertaForTokenClassification (Roberta model)\n\n        The model is set in evaluation mode by default using `model.eval()` (Dropout modules are deactivated)\n        To train the model, you should first set it back in training mode with `model.train()`\n\n        Params:\n            pretrained_model_name_or_path: either:\n\n                - a string with the `shortcut name` of a pre-trained model to load from cache or download, e.g.: ``bert-base-uncased``.\n                - a path to a `directory` containing model weights saved using :func:`~transformers1.PreTrainedModel.save_pretrained`, e.g.: ``./my_model_directory/``.\n                - a path or url to a `tensorflow index checkpoint file` (e.g. `./tf_model/model.ckpt.index`). In this case, ``from_tf`` should be set to True and a configuration object should be provided as ``config`` argument. This loading path is slower than converting the TensorFlow checkpoint in a PyTorch model using the provided conversion scripts and loading the PyTorch model afterwards.\n\n            model_args: (`optional`) Sequence of positional arguments:\n                All remaning positional arguments will be passed to the underlying model's ``__init__`` method\n\n            config: (`optional`) instance of a class derived from :class:`~transformers1.PretrainedConfig`:\n                Configuration for the model to use instead of an automatically loaded configuation. Configuration can be automatically loaded when:\n\n                - the model is a model provided by the library (loaded with the ``shortcut-name`` string of a pretrained model), or\n                - the model was saved using :func:`~transformers1.PreTrainedModel.save_pretrained` and is reloaded by suppling the save directory.\n                - the model is loaded by suppling a local directory as ``pretrained_model_name_or_path`` and a configuration JSON file named `config.json` is found in the directory.\n\n            state_dict: (`optional`) dict:\n                an optional state dictionnary for the model to use instead of a state dictionary loaded from saved weights file.\n                This option can be used if you want to create a model from a pretrained configuration but load your own weights.\n                In this case though, you should check if using :func:`~transformers1.PreTrainedModel.save_pretrained` and :func:`~transformers1.PreTrainedModel.from_pretrained` is not a simpler option.\n\n            cache_dir: (`optional`) string:\n                Path to a directory in which a downloaded pre-trained model\n                configuration should be cached if the standard cache should not be used.\n\n            force_download: (`optional`) boolean, default False:\n                Force to (re-)download the model weights and configuration files and override the cached versions if they exists.\n\n            proxies: (`optional`) dict, default None:\n                A dictionary of proxy servers to use by protocol or endpoint, e.g.: {'http': 'foo.bar:3128', 'http://hostname': 'foo.bar:4012'}.\n                The proxies are used on each request.\n\n            output_loading_info: (`optional`) boolean:\n                Set to ``True`` to also return a dictionnary containing missing keys, unexpected keys and error messages.\n\n            kwargs: (`optional`) Remaining dictionary of keyword arguments:\n                Can be used to update the configuration object (after it being loaded) and initiate the model. (e.g. ``output_attention=True``). Behave differently depending on whether a `config` is provided or automatically loaded:\n\n                - If a configuration is provided with ``config``, ``**kwargs`` will be directly passed to the underlying model's ``__init__`` method (we assume all relevant updates to the configuration have already been done)\n                - If a configuration is not provided, ``kwargs`` will be first passed to the configuration class initialization function (:func:`~transformers1.PretrainedConfig.from_pretrained`). Each key of ``kwargs`` that corresponds to a configuration attribute will be used to override said attribute with the supplied ``kwargs`` value. Remaining keys that do not correspond to any configuration attribute will be passed to the underlying model's ``__init__`` function.\n\n        Examples::\n\n            model = TFAutoModelForTokenClassification.from_pretrained('bert-base-uncased')    # Download model and configuration from S3 and cache.\n            model = TFAutoModelForTokenClassification.from_pretrained('./test/bert_model/')  # E.g. model was saved using `save_pretrained('./test/saved_model/')`\n            model = TFAutoModelForTokenClassification.from_pretrained('bert-base-uncased', output_attention=True)  # Update configuration during loading\n            assert model.config.output_attention == True\n            # Loading from a TF checkpoint file instead of a PyTorch model (slower)\n            config = AutoConfig.from_json_file('./tf_model/bert_tf_model_config.json')\n            model = TFAutoModelForTokenClassification.from_pretrained('./tf_model/bert_tf_checkpoint.ckpt.index', from_tf=True, config=config)\n\n        \"\"\"\n        config = kwargs.pop(\"config\", None)\n        if not isinstance(config, PretrainedConfig):\n            config = AutoConfig.from_pretrained(pretrained_model_name_or_path, **kwargs)\n\n        for config_class, model_class in TF_MODEL_FOR_TOKEN_CLASSIFICATION_MAPPING.items():\n            if isinstance(config, config_class):\n                return model_class.from_pretrained(pretrained_model_name_or_path, *model_args, config=config, **kwargs)\n        raise ValueError(\n            \"Unrecognized configuration class {} for this kind of TFAutoModel: {}.\\n\"\n            \"Model type should be one of {}.\".format(\n                config.__class__,\n                cls.__name__,\n                \", \".join(c.__name__ for c in TF_MODEL_FOR_TOKEN_CLASSIFICATION_MAPPING.keys()),\n            )\n        )\n"
  },
  {
    "path": "code/nezha-base-count3/pretrain/transformers1/modeling_tf_bert.py",
    "content": "# coding=utf-8\n# Copyright 2018 The Google AI Language Team Authors and The HuggingFace Inc. team.\n# Copyright (c) 2018, NVIDIA CORPORATION.  All rights reserved.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n\"\"\" TF 2.0 BERT model. \"\"\"\n\n\nimport logging\n\nimport numpy as np\nimport tensorflow as tf\n\nfrom .configuration_bert import BertConfig\nfrom .file_utils import MULTIPLE_CHOICE_DUMMY_INPUTS, add_start_docstrings, add_start_docstrings_to_callable\nfrom .modeling_tf_utils import TFPreTrainedModel, get_initializer, keras_serializable, shape_list\nfrom .tokenization_utils import BatchEncoding\n\n\nlogger = logging.getLogger(__name__)\n\n\nTF_BERT_PRETRAINED_MODEL_ARCHIVE_LIST = [\n    \"bert-base-uncased\",\n    \"bert-large-uncased\",\n    \"bert-base-cased\",\n    \"bert-large-cased\",\n    \"bert-base-multilingual-uncased\",\n    \"bert-base-multilingual-cased\",\n    \"bert-base-chinese\",\n    \"bert-base-german-cased\",\n    \"bert-large-uncased-whole-word-masking\",\n    \"bert-large-cased-whole-word-masking\",\n    \"bert-large-uncased-whole-word-masking-finetuned-squad\",\n    \"bert-large-cased-whole-word-masking-finetuned-squad\",\n    \"bert-base-cased-finetuned-mrpc\",\n    \"cl-tohoku/bert-base-japanese\",\n    \"cl-tohoku/bert-base-japanese-whole-word-masking\",\n    \"cl-tohoku/bert-base-japanese-char\",\n    \"cl-tohoku/bert-base-japanese-char-whole-word-masking\",\n    \"TurkuNLP/bert-base-finnish-cased-v1\",\n    \"TurkuNLP/bert-base-finnish-uncased-v1\",\n    \"wietsedv/bert-base-dutch-cased\",\n    # See all BERT models at https://huggingface.co/models?filter=bert\n]\n\n\ndef gelu(x):\n    \"\"\" Gaussian Error Linear Unit.\n    Original Implementation of the gelu activation function in Google Bert repo when initially created.\n        For information: OpenAI GPT's gelu is slightly different (and gives slightly different results):\n        0.5 * x * (1 + torch.tanh(math.sqrt(2 / math.pi) * (x + 0.044715 * torch.pow(x, 3))))\n        Also see https://arxiv.org/abs/1606.08415\n    \"\"\"\n    cdf = 0.5 * (1.0 + tf.math.erf(x / tf.math.sqrt(2.0)))\n    return x * cdf\n\n\ndef gelu_new(x):\n    \"\"\"Gaussian Error Linear Unit.\n    This is a smoother version of the RELU.\n    Original paper: https://arxiv.org/abs/1606.08415\n    Args:\n        x: float Tensor to perform activation.\n    Returns:\n        `x` with the GELU activation applied.\n    \"\"\"\n    cdf = 0.5 * (1.0 + tf.tanh((np.sqrt(2 / np.pi) * (x + 0.044715 * tf.pow(x, 3)))))\n    return x * cdf\n\n\ndef swish(x):\n    return x * tf.sigmoid(x)\n\n\nACT2FN = {\n    \"gelu\": tf.keras.layers.Activation(gelu),\n    \"relu\": tf.keras.activations.relu,\n    \"swish\": tf.keras.layers.Activation(swish),\n    \"gelu_new\": tf.keras.layers.Activation(gelu_new),\n}\n\n\nclass TFBertEmbeddings(tf.keras.layers.Layer):\n    \"\"\"Construct the embeddings from word, position and token_type embeddings.\n    \"\"\"\n\n    def __init__(self, config, **kwargs):\n        super().__init__(**kwargs)\n        self.vocab_size = config.vocab_size\n        self.hidden_size = config.hidden_size\n        self.initializer_range = config.initializer_range\n\n        self.position_embeddings = tf.keras.layers.Embedding(\n            config.max_position_embeddings,\n            config.hidden_size,\n            embeddings_initializer=get_initializer(self.initializer_range),\n            name=\"position_embeddings\",\n        )\n        self.token_type_embeddings = tf.keras.layers.Embedding(\n            config.type_vocab_size,\n            config.hidden_size,\n            embeddings_initializer=get_initializer(self.initializer_range),\n            name=\"token_type_embeddings\",\n        )\n\n        # self.LayerNorm is not snake-cased to stick with TensorFlow model variable name and be able to load\n        # any TensorFlow checkpoint file\n        self.LayerNorm = tf.keras.layers.LayerNormalization(epsilon=config.layer_norm_eps, name=\"LayerNorm\")\n        self.dropout = tf.keras.layers.Dropout(config.hidden_dropout_prob)\n\n    def build(self, input_shape):\n        \"\"\"Build shared word embedding layer \"\"\"\n        with tf.name_scope(\"word_embeddings\"):\n            # Create and initialize weights. The random normal initializer was chosen\n            # arbitrarily, and works well.\n            self.word_embeddings = self.add_weight(\n                \"weight\",\n                shape=[self.vocab_size, self.hidden_size],\n                initializer=get_initializer(self.initializer_range),\n            )\n        super().build(input_shape)\n\n    def call(self, inputs, mode=\"embedding\", training=False):\n        \"\"\"Get token embeddings of inputs.\n        Args:\n            inputs: list of three int64 tensors with shape [batch_size, length]: (input_ids, position_ids, token_type_ids)\n            mode: string, a valid value is one of \"embedding\" and \"linear\".\n        Returns:\n            outputs: (1) If mode == \"embedding\", output embedding tensor, float32 with\n                shape [batch_size, length, embedding_size]; (2) mode == \"linear\", output\n                linear tensor, float32 with shape [batch_size, length, vocab_size].\n        Raises:\n            ValueError: if mode is not valid.\n\n        Shared weights logic adapted from\n            https://github.com/tensorflow/models/blob/a009f4fb9d2fc4949e32192a944688925ef78659/official/transformer/v2/embedding_layer.py#L24\n        \"\"\"\n        if mode == \"embedding\":\n            return self._embedding(inputs, training=training)\n        elif mode == \"linear\":\n            return self._linear(inputs)\n        else:\n            raise ValueError(\"mode {} is not valid.\".format(mode))\n\n    def _embedding(self, inputs, training=False):\n        \"\"\"Applies embedding based on inputs tensor.\"\"\"\n        input_ids, position_ids, token_type_ids, inputs_embeds = inputs\n\n        if input_ids is not None:\n            input_shape = shape_list(input_ids)\n        else:\n            input_shape = shape_list(inputs_embeds)[:-1]\n\n        seq_length = input_shape[1]\n        if position_ids is None:\n            position_ids = tf.range(seq_length, dtype=tf.int32)[tf.newaxis, :]\n        if token_type_ids is None:\n            token_type_ids = tf.fill(input_shape, 0)\n\n        if inputs_embeds is None:\n            inputs_embeds = tf.gather(self.word_embeddings, input_ids)\n        position_embeddings = self.position_embeddings(position_ids)\n        token_type_embeddings = self.token_type_embeddings(token_type_ids)\n\n        embeddings = inputs_embeds + position_embeddings + token_type_embeddings\n        embeddings = self.LayerNorm(embeddings)\n        embeddings = self.dropout(embeddings, training=training)\n        return embeddings\n\n    def _linear(self, inputs):\n        \"\"\"Computes logits by running inputs through a linear layer.\n            Args:\n                inputs: A float32 tensor with shape [batch_size, length, hidden_size]\n            Returns:\n                float32 tensor with shape [batch_size, length, vocab_size].\n        \"\"\"\n        batch_size = shape_list(inputs)[0]\n        length = shape_list(inputs)[1]\n\n        x = tf.reshape(inputs, [-1, self.hidden_size])\n        logits = tf.matmul(x, self.word_embeddings, transpose_b=True)\n\n        return tf.reshape(logits, [batch_size, length, self.vocab_size])\n\n\nclass TFBertSelfAttention(tf.keras.layers.Layer):\n    def __init__(self, config, **kwargs):\n        super().__init__(**kwargs)\n        if config.hidden_size % config.num_attention_heads != 0:\n            raise ValueError(\n                \"The hidden size (%d) is not a multiple of the number of attention \"\n                \"heads (%d)\" % (config.hidden_size, config.num_attention_heads)\n            )\n        self.output_attentions = config.output_attentions\n\n        self.num_attention_heads = config.num_attention_heads\n        assert config.hidden_size % config.num_attention_heads == 0\n        self.attention_head_size = int(config.hidden_size / config.num_attention_heads)\n        self.all_head_size = self.num_attention_heads * self.attention_head_size\n\n        self.query = tf.keras.layers.Dense(\n            self.all_head_size, kernel_initializer=get_initializer(config.initializer_range), name=\"query\"\n        )\n        self.key = tf.keras.layers.Dense(\n            self.all_head_size, kernel_initializer=get_initializer(config.initializer_range), name=\"key\"\n        )\n        self.value = tf.keras.layers.Dense(\n            self.all_head_size, kernel_initializer=get_initializer(config.initializer_range), name=\"value\"\n        )\n\n        self.dropout = tf.keras.layers.Dropout(config.attention_probs_dropout_prob)\n\n    def transpose_for_scores(self, x, batch_size):\n        x = tf.reshape(x, (batch_size, -1, self.num_attention_heads, self.attention_head_size))\n        return tf.transpose(x, perm=[0, 2, 1, 3])\n\n    def call(self, inputs, training=False):\n        hidden_states, attention_mask, head_mask = inputs\n\n        batch_size = shape_list(hidden_states)[0]\n        mixed_query_layer = self.query(hidden_states)\n        mixed_key_layer = self.key(hidden_states)\n        mixed_value_layer = self.value(hidden_states)\n\n        query_layer = self.transpose_for_scores(mixed_query_layer, batch_size)\n        key_layer = self.transpose_for_scores(mixed_key_layer, batch_size)\n        value_layer = self.transpose_for_scores(mixed_value_layer, batch_size)\n\n        # Take the dot product between \"query\" and \"key\" to get the raw attention scores.\n        attention_scores = tf.matmul(\n            query_layer, key_layer, transpose_b=True\n        )  # (batch size, num_heads, seq_len_q, seq_len_k)\n        dk = tf.cast(shape_list(key_layer)[-1], tf.float32)  # scale attention_scores\n        attention_scores = attention_scores / tf.math.sqrt(dk)\n\n        if attention_mask is not None:\n            # Apply the attention mask is (precomputed for all layers in TFBertModel call() function)\n            attention_scores = attention_scores + attention_mask\n\n        # Normalize the attention scores to probabilities.\n        attention_probs = tf.nn.softmax(attention_scores, axis=-1)\n\n        # This is actually dropping out entire tokens to attend to, which might\n        # seem a bit unusual, but is taken from the original Transformer paper.\n        attention_probs = self.dropout(attention_probs, training=training)\n\n        # Mask heads if we want to\n        if head_mask is not None:\n            attention_probs = attention_probs * head_mask\n\n        context_layer = tf.matmul(attention_probs, value_layer)\n\n        context_layer = tf.transpose(context_layer, perm=[0, 2, 1, 3])\n        context_layer = tf.reshape(\n            context_layer, (batch_size, -1, self.all_head_size)\n        )  # (batch_size, seq_len_q, all_head_size)\n\n        outputs = (context_layer, attention_probs) if self.output_attentions else (context_layer,)\n        return outputs\n\n\nclass TFBertSelfOutput(tf.keras.layers.Layer):\n    def __init__(self, config, **kwargs):\n        super().__init__(**kwargs)\n        self.dense = tf.keras.layers.Dense(\n            config.hidden_size, kernel_initializer=get_initializer(config.initializer_range), name=\"dense\"\n        )\n        self.LayerNorm = tf.keras.layers.LayerNormalization(epsilon=config.layer_norm_eps, name=\"LayerNorm\")\n        self.dropout = tf.keras.layers.Dropout(config.hidden_dropout_prob)\n\n    def call(self, inputs, training=False):\n        hidden_states, input_tensor = inputs\n\n        hidden_states = self.dense(hidden_states)\n        hidden_states = self.dropout(hidden_states, training=training)\n        hidden_states = self.LayerNorm(hidden_states + input_tensor)\n        return hidden_states\n\n\nclass TFBertAttention(tf.keras.layers.Layer):\n    def __init__(self, config, **kwargs):\n        super().__init__(**kwargs)\n        self.self_attention = TFBertSelfAttention(config, name=\"self\")\n        self.dense_output = TFBertSelfOutput(config, name=\"output\")\n\n    def prune_heads(self, heads):\n        raise NotImplementedError\n\n    def call(self, inputs, training=False):\n        input_tensor, attention_mask, head_mask = inputs\n\n        self_outputs = self.self_attention([input_tensor, attention_mask, head_mask], training=training)\n        attention_output = self.dense_output([self_outputs[0], input_tensor], training=training)\n        outputs = (attention_output,) + self_outputs[1:]  # add attentions if we output them\n        return outputs\n\n\nclass TFBertIntermediate(tf.keras.layers.Layer):\n    def __init__(self, config, **kwargs):\n        super().__init__(**kwargs)\n        self.dense = tf.keras.layers.Dense(\n            config.intermediate_size, kernel_initializer=get_initializer(config.initializer_range), name=\"dense\"\n        )\n        if isinstance(config.hidden_act, str):\n            self.intermediate_act_fn = ACT2FN[config.hidden_act]\n        else:\n            self.intermediate_act_fn = config.hidden_act\n\n    def call(self, hidden_states):\n        hidden_states = self.dense(hidden_states)\n        hidden_states = self.intermediate_act_fn(hidden_states)\n        return hidden_states\n\n\nclass TFBertOutput(tf.keras.layers.Layer):\n    def __init__(self, config, **kwargs):\n        super().__init__(**kwargs)\n        self.dense = tf.keras.layers.Dense(\n            config.hidden_size, kernel_initializer=get_initializer(config.initializer_range), name=\"dense\"\n        )\n        self.LayerNorm = tf.keras.layers.LayerNormalization(epsilon=config.layer_norm_eps, name=\"LayerNorm\")\n        self.dropout = tf.keras.layers.Dropout(config.hidden_dropout_prob)\n\n    def call(self, inputs, training=False):\n        hidden_states, input_tensor = inputs\n\n        hidden_states = self.dense(hidden_states)\n        hidden_states = self.dropout(hidden_states, training=training)\n        hidden_states = self.LayerNorm(hidden_states + input_tensor)\n        return hidden_states\n\n\nclass TFBertLayer(tf.keras.layers.Layer):\n    def __init__(self, config, **kwargs):\n        super().__init__(**kwargs)\n        self.attention = TFBertAttention(config, name=\"attention\")\n        self.intermediate = TFBertIntermediate(config, name=\"intermediate\")\n        self.bert_output = TFBertOutput(config, name=\"output\")\n\n    def call(self, inputs, training=False):\n        hidden_states, attention_mask, head_mask = inputs\n\n        attention_outputs = self.attention([hidden_states, attention_mask, head_mask], training=training)\n        attention_output = attention_outputs[0]\n        intermediate_output = self.intermediate(attention_output)\n        layer_output = self.bert_output([intermediate_output, attention_output], training=training)\n        outputs = (layer_output,) + attention_outputs[1:]  # add attentions if we output them\n        return outputs\n\n\nclass TFBertEncoder(tf.keras.layers.Layer):\n    def __init__(self, config, **kwargs):\n        super().__init__(**kwargs)\n        self.output_attentions = config.output_attentions\n        self.output_hidden_states = config.output_hidden_states\n        self.layer = [TFBertLayer(config, name=\"layer_._{}\".format(i)) for i in range(config.num_hidden_layers)]\n\n    def call(self, inputs, training=False):\n        hidden_states, attention_mask, head_mask = inputs\n\n        all_hidden_states = ()\n        all_attentions = ()\n        for i, layer_module in enumerate(self.layer):\n            if self.output_hidden_states:\n                all_hidden_states = all_hidden_states + (hidden_states,)\n\n            layer_outputs = layer_module([hidden_states, attention_mask, head_mask[i]], training=training)\n            hidden_states = layer_outputs[0]\n\n            if self.output_attentions:\n                all_attentions = all_attentions + (layer_outputs[1],)\n\n        # Add last layer\n        if self.output_hidden_states:\n            all_hidden_states = all_hidden_states + (hidden_states,)\n\n        outputs = (hidden_states,)\n        if self.output_hidden_states:\n            outputs = outputs + (all_hidden_states,)\n        if self.output_attentions:\n            outputs = outputs + (all_attentions,)\n        return outputs  # outputs, (hidden states), (attentions)\n\n\nclass TFBertPooler(tf.keras.layers.Layer):\n    def __init__(self, config, **kwargs):\n        super().__init__(**kwargs)\n        self.dense = tf.keras.layers.Dense(\n            config.hidden_size,\n            kernel_initializer=get_initializer(config.initializer_range),\n            activation=\"tanh\",\n            name=\"dense\",\n        )\n\n    def call(self, hidden_states):\n        # We \"pool\" the model by simply taking the hidden state corresponding\n        # to the first token.\n        first_token_tensor = hidden_states[:, 0]\n        pooled_output = self.dense(first_token_tensor)\n        return pooled_output\n\n\nclass TFBertPredictionHeadTransform(tf.keras.layers.Layer):\n    def __init__(self, config, **kwargs):\n        super().__init__(**kwargs)\n        self.dense = tf.keras.layers.Dense(\n            config.hidden_size, kernel_initializer=get_initializer(config.initializer_range), name=\"dense\"\n        )\n        if isinstance(config.hidden_act, str):\n            self.transform_act_fn = ACT2FN[config.hidden_act]\n        else:\n            self.transform_act_fn = config.hidden_act\n        self.LayerNorm = tf.keras.layers.LayerNormalization(epsilon=config.layer_norm_eps, name=\"LayerNorm\")\n\n    def call(self, hidden_states):\n        hidden_states = self.dense(hidden_states)\n        hidden_states = self.transform_act_fn(hidden_states)\n        hidden_states = self.LayerNorm(hidden_states)\n        return hidden_states\n\n\nclass TFBertLMPredictionHead(tf.keras.layers.Layer):\n    def __init__(self, config, input_embeddings, **kwargs):\n        super().__init__(**kwargs)\n        self.vocab_size = config.vocab_size\n        self.transform = TFBertPredictionHeadTransform(config, name=\"transform\")\n\n        # The output weights are the same as the input embeddings, but there is\n        # an output-only bias for each token.\n        self.input_embeddings = input_embeddings\n\n    def build(self, input_shape):\n        self.bias = self.add_weight(shape=(self.vocab_size,), initializer=\"zeros\", trainable=True, name=\"bias\")\n        super().build(input_shape)\n\n    def call(self, hidden_states):\n        hidden_states = self.transform(hidden_states)\n        hidden_states = self.input_embeddings(hidden_states, mode=\"linear\")\n        hidden_states = hidden_states + self.bias\n        return hidden_states\n\n\nclass TFBertMLMHead(tf.keras.layers.Layer):\n    def __init__(self, config, input_embeddings, **kwargs):\n        super().__init__(**kwargs)\n        self.predictions = TFBertLMPredictionHead(config, input_embeddings, name=\"predictions\")\n\n    def call(self, sequence_output):\n        prediction_scores = self.predictions(sequence_output)\n        return prediction_scores\n\n\nclass TFBertNSPHead(tf.keras.layers.Layer):\n    def __init__(self, config, **kwargs):\n        super().__init__(**kwargs)\n        self.seq_relationship = tf.keras.layers.Dense(\n            2, kernel_initializer=get_initializer(config.initializer_range), name=\"seq_relationship\"\n        )\n\n    def call(self, pooled_output):\n        seq_relationship_score = self.seq_relationship(pooled_output)\n        return seq_relationship_score\n\n\n@keras_serializable\nclass TFBertMainLayer(tf.keras.layers.Layer):\n    config_class = BertConfig\n\n    def __init__(self, config, **kwargs):\n        super().__init__(**kwargs)\n        self.num_hidden_layers = config.num_hidden_layers\n\n        self.embeddings = TFBertEmbeddings(config, name=\"embeddings\")\n        self.encoder = TFBertEncoder(config, name=\"encoder\")\n        self.pooler = TFBertPooler(config, name=\"pooler\")\n\n    def get_input_embeddings(self):\n        return self.embeddings\n\n    def _resize_token_embeddings(self, new_num_tokens):\n        raise NotImplementedError\n\n    def _prune_heads(self, heads_to_prune):\n        \"\"\" Prunes heads of the model.\n            heads_to_prune: dict of {layer_num: list of heads to prune in this layer}\n            See base class PreTrainedModel\n        \"\"\"\n        raise NotImplementedError\n\n    def call(\n        self,\n        inputs,\n        attention_mask=None,\n        token_type_ids=None,\n        position_ids=None,\n        head_mask=None,\n        inputs_embeds=None,\n        training=False,\n    ):\n        if isinstance(inputs, (tuple, list)):\n            input_ids = inputs[0]\n            attention_mask = inputs[1] if len(inputs) > 1 else attention_mask\n            token_type_ids = inputs[2] if len(inputs) > 2 else token_type_ids\n            position_ids = inputs[3] if len(inputs) > 3 else position_ids\n            head_mask = inputs[4] if len(inputs) > 4 else head_mask\n            inputs_embeds = inputs[5] if len(inputs) > 5 else inputs_embeds\n            assert len(inputs) <= 6, \"Too many inputs.\"\n        elif isinstance(inputs, (dict, BatchEncoding)):\n            input_ids = inputs.get(\"input_ids\")\n            attention_mask = inputs.get(\"attention_mask\", attention_mask)\n            token_type_ids = inputs.get(\"token_type_ids\", token_type_ids)\n            position_ids = inputs.get(\"position_ids\", position_ids)\n            head_mask = inputs.get(\"head_mask\", head_mask)\n            inputs_embeds = inputs.get(\"inputs_embeds\", inputs_embeds)\n            assert len(inputs) <= 6, \"Too many inputs.\"\n        else:\n            input_ids = inputs\n\n        if input_ids is not None and inputs_embeds is not None:\n            raise ValueError(\"You cannot specify both input_ids and inputs_embeds at the same time\")\n        elif input_ids is not None:\n            input_shape = shape_list(input_ids)\n        elif inputs_embeds is not None:\n            input_shape = shape_list(inputs_embeds)[:-1]\n        else:\n            raise ValueError(\"You have to specify either input_ids or inputs_embeds\")\n\n        if attention_mask is None:\n            attention_mask = tf.fill(input_shape, 1)\n        if token_type_ids is None:\n            token_type_ids = tf.fill(input_shape, 0)\n\n        # We create a 3D attention mask from a 2D tensor mask.\n        # Sizes are [batch_size, 1, 1, to_seq_length]\n        # So we can broadcast to [batch_size, num_heads, from_seq_length, to_seq_length]\n        # this attention mask is more simple than the triangular masking of causal attention\n        # used in OpenAI GPT, we just need to prepare the broadcast dimension here.\n        extended_attention_mask = attention_mask[:, tf.newaxis, tf.newaxis, :]\n\n        # Since attention_mask is 1.0 for positions we want to attend and 0.0 for\n        # masked positions, this operation will create a tensor which is 0.0 for\n        # positions we want to attend and -10000.0 for masked positions.\n        # Since we are adding it to the raw scores before the softmax, this is\n        # effectively the same as removing these entirely.\n\n        extended_attention_mask = tf.cast(extended_attention_mask, tf.float32)\n        extended_attention_mask = (1.0 - extended_attention_mask) * -10000.0\n\n        # Prepare head mask if needed\n        # 1.0 in head_mask indicate we keep the head\n        # attention_probs has shape bsz x n_heads x N x N\n        # input head_mask has shape [num_heads] or [num_hidden_layers x num_heads]\n        # and head_mask is converted to shape [num_hidden_layers x batch x num_heads x seq_length x seq_length]\n        if head_mask is not None:\n            raise NotImplementedError\n        else:\n            head_mask = [None] * self.num_hidden_layers\n            # head_mask = tf.constant([0] * self.num_hidden_layers)\n\n        embedding_output = self.embeddings([input_ids, position_ids, token_type_ids, inputs_embeds], training=training)\n        encoder_outputs = self.encoder([embedding_output, extended_attention_mask, head_mask], training=training)\n\n        sequence_output = encoder_outputs[0]\n        pooled_output = self.pooler(sequence_output)\n\n        outputs = (sequence_output, pooled_output,) + encoder_outputs[\n            1:\n        ]  # add hidden_states and attentions if they are here\n        return outputs  # sequence_output, pooled_output, (hidden_states), (attentions)\n\n\nclass TFBertPreTrainedModel(TFPreTrainedModel):\n    \"\"\" An abstract class to handle weights initialization and\n        a simple interface for downloading and loading pretrained models.\n    \"\"\"\n\n    config_class = BertConfig\n    base_model_prefix = \"bert\"\n\n\nBERT_START_DOCSTRING = r\"\"\"\n    This model is a `tf.keras.Model <https://www.tensorflow.org/api_docs/python/tf/keras/Model>`__ sub-class.\n    Use it as a regular TF 2.0 Keras Model and\n    refer to the TF 2.0 documentation for all matter related to general usage and behavior.\n\n    .. note::\n\n        TF 2.0 models accepts two formats as inputs:\n\n            - having all inputs as keyword arguments (like PyTorch models), or\n            - having all inputs as a list, tuple or dict in the first positional arguments.\n\n        This second option is useful when using :obj:`tf.keras.Model.fit()` method which currently requires having\n        all the tensors in the first argument of the model call function: :obj:`model(inputs)`.\n\n        If you choose this second option, there are three possibilities you can use to gather all the input Tensors\n        in the first positional argument :\n\n        - a single Tensor with input_ids only and nothing else: :obj:`model(inputs_ids)`\n        - a list of varying length with one or several input Tensors IN THE ORDER given in the docstring:\n          :obj:`model([input_ids, attention_mask])` or :obj:`model([input_ids, attention_mask, token_type_ids])`\n        - a dictionary with one or several input Tensors associated to the input names given in the docstring:\n          :obj:`model({'input_ids': input_ids, 'token_type_ids': token_type_ids})`\n\n    Parameters:\n        config (:class:`~transformers1.BertConfig`): Model configuration class with all the parameters of the model.\n            Initializing with a config file does not load the weights associated with the model, only the configuration.\n            Check out the :meth:`~transformers1.PreTrainedModel.from_pretrained` method to load the model weights.\n\"\"\"\n\nBERT_INPUTS_DOCSTRING = r\"\"\"\n    Args:\n        input_ids (:obj:`Numpy array` or :obj:`tf.Tensor` of shape :obj:`{0}`):\n            Indices of input sequence tokens in the vocabulary.\n\n            Indices can be obtained using :class:`transformers1.BertTokenizer`.\n            See :func:`transformers1.PreTrainedTokenizer.encode` and\n            :func:`transformers1.PreTrainedTokenizer.encode_plus` for details.\n\n            `What are input IDs? <../glossary.html#input-ids>`__\n        attention_mask (:obj:`Numpy array` or :obj:`tf.Tensor` of shape :obj:`{0}`, `optional`, defaults to :obj:`None`):\n            Mask to avoid performing attention on padding token indices.\n            Mask values selected in ``[0, 1]``:\n            ``1`` for tokens that are NOT MASKED, ``0`` for MASKED tokens.\n\n            `What are attention masks? <../glossary.html#attention-mask>`__\n        token_type_ids (:obj:`Numpy array` or :obj:`tf.Tensor` of shape :obj:`{0}`, `optional`, defaults to :obj:`None`):\n            Segment token indices to indicate first and second portions of the inputs.\n            Indices are selected in ``[0, 1]``: ``0`` corresponds to a `sentence A` token, ``1``\n            corresponds to a `sentence B` token\n\n            `What are token type IDs? <../glossary.html#token-type-ids>`__\n        position_ids (:obj:`Numpy array` or :obj:`tf.Tensor` of shape :obj:`{0}`, `optional`, defaults to :obj:`None`):\n            Indices of positions of each input sequence tokens in the position embeddings.\n            Selected in the range ``[0, config.max_position_embeddings - 1]``.\n\n            `What are position IDs? <../glossary.html#position-ids>`__\n        head_mask (:obj:`Numpy array` or :obj:`tf.Tensor` of shape :obj:`(num_heads,)` or :obj:`(num_layers, num_heads)`, `optional`, defaults to :obj:`None`):\n            Mask to nullify selected heads of the self-attention modules.\n            Mask values selected in ``[0, 1]``:\n            :obj:`1` indicates the head is **not masked**, :obj:`0` indicates the head is **masked**.\n        inputs_embeds (:obj:`Numpy array` or :obj:`tf.Tensor` of shape :obj:`(batch_size, sequence_length, embedding_dim)`, `optional`, defaults to :obj:`None`):\n            Optionally, instead of passing :obj:`input_ids` you can choose to directly pass an embedded representation.\n            This is useful if you want more control over how to convert `input_ids` indices into associated vectors\n            than the model's internal embedding lookup matrix.\n        training (:obj:`boolean`, `optional`, defaults to :obj:`False`):\n            Whether to activate dropout modules (if set to :obj:`True`) during training or to de-activate them\n            (if set to :obj:`False`) for evaluation.\n\"\"\"\n\n\n@add_start_docstrings(\n    \"The bare Bert Model transformer outputing raw hidden-states without any specific head on top.\",\n    BERT_START_DOCSTRING,\n)\nclass TFBertModel(TFBertPreTrainedModel):\n    def __init__(self, config, *inputs, **kwargs):\n        super().__init__(config, *inputs, **kwargs)\n        self.bert = TFBertMainLayer(config, name=\"bert\")\n\n    @add_start_docstrings_to_callable(BERT_INPUTS_DOCSTRING.format(\"(batch_size, sequence_length)\"))\n    def call(self, inputs, **kwargs):\n        r\"\"\"\n    Returns:\n        :obj:`tuple(tf.Tensor)` comprising various elements depending on the configuration (:class:`~transformers1.BertConfig`) and inputs:\n        last_hidden_state (:obj:`tf.Tensor` of shape :obj:`(batch_size, sequence_length, hidden_size)`):\n            Sequence of hidden-states at the output of the last layer of the model.\n        pooler_output (:obj:`tf.Tensor` of shape :obj:`(batch_size, hidden_size)`):\n            Last layer hidden-state of the first token of the sequence (classification token)\n            further processed by a Linear layer and a Tanh activation function. The Linear\n            layer weights are trained from the next sentence prediction (classification)\n            objective during Bert pretraining. This output is usually *not* a good summary\n            of the semantic content of the input, you're often better with averaging or pooling\n            the sequence of hidden-states for the whole input sequence.\n        hidden_states (:obj:`tuple(tf.Tensor)`, `optional`, returned when :obj:`config.output_hidden_states=True`):\n            tuple of :obj:`tf.Tensor` (one for the output of the embeddings + one for the output of each layer)\n            of shape :obj:`(batch_size, sequence_length, hidden_size)`.\n\n            Hidden-states of the model at the output of each layer plus the initial embedding outputs.\n        attentions (:obj:`tuple(tf.Tensor)`, `optional`, returned when ``config.output_attentions=True``):\n            tuple of :obj:`tf.Tensor` (one for each layer) of shape\n            :obj:`(batch_size, num_heads, sequence_length, sequence_length)`.\n\n            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads.\n\n\n    Examples::\n\n        import tensorflow as tf\n        from transformers1 import BertTokenizer, TFBertModel\n\n        tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')\n        model = TFBertModel.from_pretrained('bert-base-uncased')\n        input_ids = tf.constant(tokenizer.encode(\"Hello, my dog is cute\", add_special_tokens=True))[None, :]  # Batch size 1\n        outputs = model(input_ids)\n        last_hidden_states = outputs[0]  # The last hidden-state is the first element of the output tuple\n        \"\"\"\n        outputs = self.bert(inputs, **kwargs)\n        return outputs\n\n\n@add_start_docstrings(\n    \"\"\"Bert Model with two heads on top as done during the pre-training:\n    a `masked language modeling` head and a `next sentence prediction (classification)` head. \"\"\",\n    BERT_START_DOCSTRING,\n)\nclass TFBertForPreTraining(TFBertPreTrainedModel):\n    def __init__(self, config, *inputs, **kwargs):\n        super().__init__(config, *inputs, **kwargs)\n\n        self.bert = TFBertMainLayer(config, name=\"bert\")\n        self.nsp = TFBertNSPHead(config, name=\"nsp___cls\")\n        self.mlm = TFBertMLMHead(config, self.bert.embeddings, name=\"mlm___cls\")\n\n    def get_output_embeddings(self):\n        return self.bert.embeddings\n\n    @add_start_docstrings_to_callable(BERT_INPUTS_DOCSTRING.format(\"(batch_size, sequence_length)\"))\n    def call(self, inputs, **kwargs):\n        r\"\"\"\n    Return:\n        :obj:`tuple(tf.Tensor)` comprising various elements depending on the configuration (:class:`~transformers1.BertConfig`) and inputs:\n        prediction_scores (:obj:`tf.Tensor` of shape :obj:`(batch_size, sequence_length, config.vocab_size)`):\n            Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax).\n        seq_relationship_scores (:obj:`tf.Tensor` of shape :obj:`(batch_size, sequence_length, 2)`):\n            Prediction scores of the next sequence prediction (classification) head (scores of True/False continuation before SoftMax).\n        hidden_states (:obj:`tuple(tf.Tensor)`, `optional`, returned when :obj:`config.output_hidden_states=True`):\n            tuple of :obj:`tf.Tensor` (one for the output of the embeddings + one for the output of each layer)\n            of shape :obj:`(batch_size, sequence_length, hidden_size)`.\n\n            Hidden-states of the model at the output of each layer plus the initial embedding outputs.\n        attentions (:obj:`tuple(tf.Tensor)`, `optional`, returned when ``config.output_attentions=True``):\n            tuple of :obj:`tf.Tensor` (one for each layer) of shape\n            :obj:`(batch_size, num_heads, sequence_length, sequence_length)`:\n\n            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads.\n\n    Examples::\n\n        import tensorflow as tf\n        from transformers1 import BertTokenizer, TFBertForPreTraining\n\n        tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')\n        model = TFBertForPreTraining.from_pretrained('bert-base-uncased')\n        input_ids = tf.constant(tokenizer.encode(\"Hello, my dog is cute\", add_special_tokens=True))[None, :]  # Batch size 1\n        outputs = model(input_ids)\n        prediction_scores, seq_relationship_scores = outputs[:2]\n\n        \"\"\"\n        outputs = self.bert(inputs, **kwargs)\n\n        sequence_output, pooled_output = outputs[:2]\n        prediction_scores = self.mlm(sequence_output, training=kwargs.get(\"training\", False))\n        seq_relationship_score = self.nsp(pooled_output)\n\n        outputs = (prediction_scores, seq_relationship_score,) + outputs[\n            2:\n        ]  # add hidden states and attention if they are here\n\n        return outputs  # prediction_scores, seq_relationship_score, (hidden_states), (attentions)\n\n\n@add_start_docstrings(\"\"\"Bert Model with a `language modeling` head on top. \"\"\", BERT_START_DOCSTRING)\nclass TFBertForMaskedLM(TFBertPreTrainedModel):\n    def __init__(self, config, *inputs, **kwargs):\n        super().__init__(config, *inputs, **kwargs)\n\n        self.bert = TFBertMainLayer(config, name=\"bert\")\n        self.mlm = TFBertMLMHead(config, self.bert.embeddings, name=\"mlm___cls\")\n\n    def get_output_embeddings(self):\n        return self.bert.embeddings\n\n    @add_start_docstrings_to_callable(BERT_INPUTS_DOCSTRING.format(\"(batch_size, sequence_length)\"))\n    def call(self, inputs, **kwargs):\n        r\"\"\"\n    Return:\n        :obj:`tuple(tf.Tensor)` comprising various elements depending on the configuration (:class:`~transformers1.BertConfig`) and inputs:\n        prediction_scores (:obj:`Numpy array` or :obj:`tf.Tensor` of shape :obj:`(batch_size, sequence_length, config.vocab_size)`):\n            Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax).\n        hidden_states (:obj:`tuple(tf.Tensor)`, `optional`, returned when :obj:`config.output_hidden_states=True`):\n            tuple of :obj:`tf.Tensor` (one for the output of the embeddings + one for the output of each layer)\n            of shape :obj:`(batch_size, sequence_length, hidden_size)`.\n\n            Hidden-states of the model at the output of each layer plus the initial embedding outputs.\n        attentions (:obj:`tuple(tf.Tensor)`, `optional`, returned when ``config.output_attentions=True``):\n            tuple of :obj:`tf.Tensor` (one for each layer) of shape\n            :obj:`(batch_size, num_heads, sequence_length, sequence_length)`:\n\n            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads.\n\n    Examples::\n\n        import tensorflow as tf\n        from transformers1 import BertTokenizer, TFBertForMaskedLM\n\n        tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')\n        model = TFBertForMaskedLM.from_pretrained('bert-base-uncased')\n        input_ids = tf.constant(tokenizer.encode(\"Hello, my dog is cute\", add_special_tokens=True))[None, :]  # Batch size 1\n        outputs = model(input_ids)\n        prediction_scores = outputs[0]\n\n        \"\"\"\n        outputs = self.bert(inputs, **kwargs)\n\n        sequence_output = outputs[0]\n        prediction_scores = self.mlm(sequence_output, training=kwargs.get(\"training\", False))\n\n        outputs = (prediction_scores,) + outputs[2:]  # Add hidden states and attention if they are here\n\n        return outputs  # prediction_scores, (hidden_states), (attentions)\n\n\n@add_start_docstrings(\n    \"\"\"Bert Model with a `next sentence prediction (classification)` head on top. \"\"\", BERT_START_DOCSTRING,\n)\nclass TFBertForNextSentencePrediction(TFBertPreTrainedModel):\n    def __init__(self, config, *inputs, **kwargs):\n        super().__init__(config, *inputs, **kwargs)\n\n        self.bert = TFBertMainLayer(config, name=\"bert\")\n        self.nsp = TFBertNSPHead(config, name=\"nsp___cls\")\n\n    @add_start_docstrings_to_callable(BERT_INPUTS_DOCSTRING.format(\"(batch_size, sequence_length)\"))\n    def call(self, inputs, **kwargs):\n        r\"\"\"\n    Return:\n        :obj:`tuple(tf.Tensor)` comprising various elements depending on the configuration (:class:`~transformers1.BertConfig`) and inputs:\n        seq_relationship_scores (:obj:`Numpy array` or :obj:`tf.Tensor` of shape :obj:`(batch_size, sequence_length, 2)`)\n            Prediction scores of the next sequence prediction (classification) head (scores of True/False continuation before SoftMax).\n        hidden_states (:obj:`tuple(tf.Tensor)`, `optional`, returned when :obj:`config.output_hidden_states=True`):\n            tuple of :obj:`tf.Tensor` (one for the output of the embeddings + one for the output of each layer)\n            of shape :obj:`(batch_size, sequence_length, hidden_size)`.\n\n            Hidden-states of the model at the output of each layer plus the initial embedding outputs.\n        attentions (:obj:`tuple(tf.Tensor)`, `optional`, returned when ``config.output_attentions=True``):\n            tuple of :obj:`tf.Tensor` (one for each layer) of shape\n            :obj:`(batch_size, num_heads, sequence_length, sequence_length)`:\n\n            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads.\n\n    Examples::\n\n        import tensorflow as tf\n        from transformers1 import BertTokenizer, TFBertForNextSentencePrediction\n\n        tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')\n        model = TFBertForNextSentencePrediction.from_pretrained('bert-base-uncased')\n\n        prompt = \"In Italy, pizza served in formal settings, such as at a restaurant, is presented unsliced.\"\n        next_sentence = \"The sky is blue due to the shorter wavelength of blue light.\"\n        encoding = tokenizer.encode_plus(prompt, next_sentence, return_tensors='tf')\n\n        logits = model(encoding['input_ids'], token_type_ids=encoding['token_type_ids'])[0]\n        assert logits[0][0] < logits[0][1] # the next sentence was random\n        \"\"\"\n        outputs = self.bert(inputs, **kwargs)\n\n        pooled_output = outputs[1]\n        seq_relationship_score = self.nsp(pooled_output)\n\n        outputs = (seq_relationship_score,) + outputs[2:]  # add hidden states and attention if they are here\n\n        return outputs  # seq_relationship_score, (hidden_states), (attentions)\n\n\n@add_start_docstrings(\n    \"\"\"Bert Model transformer with a sequence classification/regression head on top (a linear layer on top of\n    the pooled output) e.g. for GLUE tasks. \"\"\",\n    BERT_START_DOCSTRING,\n)\nclass TFBertForSequenceClassification(TFBertPreTrainedModel):\n    def __init__(self, config, *inputs, **kwargs):\n        super().__init__(config, *inputs, **kwargs)\n        self.num_labels = config.num_labels\n\n        self.bert = TFBertMainLayer(config, name=\"bert\")\n        self.dropout = tf.keras.layers.Dropout(config.hidden_dropout_prob)\n        self.classifier = tf.keras.layers.Dense(\n            config.num_labels, kernel_initializer=get_initializer(config.initializer_range), name=\"classifier\"\n        )\n\n    @add_start_docstrings_to_callable(BERT_INPUTS_DOCSTRING.format(\"(batch_size, sequence_length)\"))\n    def call(self, inputs, **kwargs):\n        r\"\"\"\n    Return:\n        :obj:`tuple(tf.Tensor)` comprising various elements depending on the configuration (:class:`~transformers1.BertConfig`) and inputs:\n        logits (:obj:`Numpy array` or :obj:`tf.Tensor` of shape :obj:`(batch_size, config.num_labels)`):\n            Classification (or regression if config.num_labels==1) scores (before SoftMax).\n        hidden_states (:obj:`tuple(tf.Tensor)`, `optional`, returned when :obj:`config.output_hidden_states=True`):\n            tuple of :obj:`tf.Tensor` (one for the output of the embeddings + one for the output of each layer)\n            of shape :obj:`(batch_size, sequence_length, hidden_size)`.\n\n            Hidden-states of the model at the output of each layer plus the initial embedding outputs.\n        attentions (:obj:`tuple(tf.Tensor)`, `optional`, returned when ``config.output_attentions=True``):\n            tuple of :obj:`tf.Tensor` (one for each layer) of shape\n            :obj:`(batch_size, num_heads, sequence_length, sequence_length)`:\n\n            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads.\n\n    Examples::\n\n        import tensorflow as tf\n        from transformers1 import BertTokenizer, TFBertForSequenceClassification\n\n        tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')\n        model = TFBertForSequenceClassification.from_pretrained('bert-base-uncased')\n        input_ids = tf.constant(tokenizer.encode(\"Hello, my dog is cute\", add_special_tokens=True))[None, :]  # Batch size 1\n        outputs = model(input_ids)\n        logits = outputs[0]\n\n        \"\"\"\n        outputs = self.bert(inputs, **kwargs)\n\n        pooled_output = outputs[1]\n\n        pooled_output = self.dropout(pooled_output, training=kwargs.get(\"training\", False))\n        logits = self.classifier(pooled_output)\n\n        outputs = (logits,) + outputs[2:]  # add hidden states and attention if they are here\n\n        return outputs  # logits, (hidden_states), (attentions)\n\n\n@add_start_docstrings(\n    \"\"\"Bert Model with a multiple choice classification head on top (a linear layer on top of\n    the pooled output and a softmax) e.g. for RocStories/SWAG tasks. \"\"\",\n    BERT_START_DOCSTRING,\n)\nclass TFBertForMultipleChoice(TFBertPreTrainedModel):\n    def __init__(self, config, *inputs, **kwargs):\n        super().__init__(config, *inputs, **kwargs)\n\n        self.bert = TFBertMainLayer(config, name=\"bert\")\n        self.dropout = tf.keras.layers.Dropout(config.hidden_dropout_prob)\n        self.classifier = tf.keras.layers.Dense(\n            1, kernel_initializer=get_initializer(config.initializer_range), name=\"classifier\"\n        )\n\n    @property\n    def dummy_inputs(self):\n        \"\"\" Dummy inputs to build the network.\n\n        Returns:\n            tf.Tensor with dummy inputs\n        \"\"\"\n        return {\"input_ids\": tf.constant(MULTIPLE_CHOICE_DUMMY_INPUTS)}\n\n    @add_start_docstrings_to_callable(BERT_INPUTS_DOCSTRING.format(\"(batch_size, num_choices, sequence_length)\"))\n    def call(\n        self,\n        inputs,\n        attention_mask=None,\n        token_type_ids=None,\n        position_ids=None,\n        head_mask=None,\n        inputs_embeds=None,\n        training=False,\n    ):\n        r\"\"\"\n    Return:\n        :obj:`tuple(tf.Tensor)` comprising various elements depending on the configuration (:class:`~transformers1.BertConfig`) and inputs:\n        classification_scores (:obj:`Numpy array` or :obj:`tf.Tensor` of shape :obj:`(batch_size, num_choices)`:\n            `num_choices` is the size of the second dimension of the input tensors. (see `input_ids` above).\n\n            Classification scores (before SoftMax).\n        hidden_states (:obj:`tuple(tf.Tensor)`, `optional`, returned when :obj:`config.output_hidden_states=True`):\n            tuple of :obj:`tf.Tensor` (one for the output of the embeddings + one for the output of each layer)\n            of shape :obj:`(batch_size, sequence_length, hidden_size)`.\n\n            Hidden-states of the model at the output of each layer plus the initial embedding outputs.\n        attentions (:obj:`tuple(tf.Tensor)`, `optional`, returned when ``config.output_attentions=True``):\n            tuple of :obj:`tf.Tensor` (one for each layer) of shape\n            :obj:`(batch_size, num_heads, sequence_length, sequence_length)`:\n\n            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads.\n\n    Examples::\n\n        import tensorflow as tf\n        from transformers1 import BertTokenizer, TFBertForMultipleChoice\n\n        tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')\n        model = TFBertForMultipleChoice.from_pretrained('bert-base-uncased')\n\n        prompt = \"In Italy, pizza served in formal settings, such as at a restaurant, is presented unsliced.\"\n        choice0 = \"It is eaten with a fork and a knife.\"\n        choice1 = \"It is eaten while held in the hand.\"\n        encoding = tokenizer.batch_encode_plus([[prompt, choice0], [prompt, choice1]], return_tensors='tf', pad_to_max_length=True)\n\n        # linear classifier on the output is not yet trained\n        outputs = model(encoding['input_ids'][None, :])\n        logits = outputs[0]\n        \"\"\"\n        if isinstance(inputs, (tuple, list)):\n            input_ids = inputs[0]\n            attention_mask = inputs[1] if len(inputs) > 1 else attention_mask\n            token_type_ids = inputs[2] if len(inputs) > 2 else token_type_ids\n            position_ids = inputs[3] if len(inputs) > 3 else position_ids\n            head_mask = inputs[4] if len(inputs) > 4 else head_mask\n            inputs_embeds = inputs[5] if len(inputs) > 5 else inputs_embeds\n            assert len(inputs) <= 6, \"Too many inputs.\"\n        elif isinstance(inputs, dict):\n            input_ids = inputs.get(\"input_ids\")\n            attention_mask = inputs.get(\"attention_mask\", attention_mask)\n            token_type_ids = inputs.get(\"token_type_ids\", token_type_ids)\n            position_ids = inputs.get(\"position_ids\", position_ids)\n            head_mask = inputs.get(\"head_mask\", head_mask)\n            inputs_embeds = inputs.get(\"inputs_embeds\", inputs_embeds)\n            assert len(inputs) <= 6, \"Too many inputs.\"\n        else:\n            input_ids = inputs\n\n        if input_ids is not None:\n            num_choices = shape_list(input_ids)[1]\n            seq_length = shape_list(input_ids)[2]\n        else:\n            num_choices = shape_list(inputs_embeds)[1]\n            seq_length = shape_list(inputs_embeds)[2]\n\n        flat_input_ids = tf.reshape(input_ids, (-1, seq_length)) if input_ids is not None else None\n        flat_attention_mask = tf.reshape(attention_mask, (-1, seq_length)) if attention_mask is not None else None\n        flat_token_type_ids = tf.reshape(token_type_ids, (-1, seq_length)) if token_type_ids is not None else None\n        flat_position_ids = tf.reshape(position_ids, (-1, seq_length)) if position_ids is not None else None\n\n        flat_inputs = [\n            flat_input_ids,\n            flat_attention_mask,\n            flat_token_type_ids,\n            flat_position_ids,\n            head_mask,\n            inputs_embeds,\n        ]\n\n        outputs = self.bert(flat_inputs, training=training)\n\n        pooled_output = outputs[1]\n\n        pooled_output = self.dropout(pooled_output, training=training)\n        logits = self.classifier(pooled_output)\n        reshaped_logits = tf.reshape(logits, (-1, num_choices))\n\n        outputs = (reshaped_logits,) + outputs[2:]  # add hidden states and attention if they are here\n\n        return outputs  # reshaped_logits, (hidden_states), (attentions)\n\n\n@add_start_docstrings(\n    \"\"\"Bert Model with a token classification head on top (a linear layer on top of\n    the hidden-states output) e.g. for Named-Entity-Recognition (NER) tasks. \"\"\",\n    BERT_START_DOCSTRING,\n)\nclass TFBertForTokenClassification(TFBertPreTrainedModel):\n    def __init__(self, config, *inputs, **kwargs):\n        super().__init__(config, *inputs, **kwargs)\n        self.num_labels = config.num_labels\n\n        self.bert = TFBertMainLayer(config, name=\"bert\")\n        self.dropout = tf.keras.layers.Dropout(config.hidden_dropout_prob)\n        self.classifier = tf.keras.layers.Dense(\n            config.num_labels, kernel_initializer=get_initializer(config.initializer_range), name=\"classifier\"\n        )\n\n    @add_start_docstrings_to_callable(BERT_INPUTS_DOCSTRING.format(\"(batch_size, sequence_length)\"))\n    def call(self, inputs, **kwargs):\n        r\"\"\"\n    Return:\n        :obj:`tuple(tf.Tensor)` comprising various elements depending on the configuration (:class:`~transformers1.BertConfig`) and inputs:\n        scores (:obj:`Numpy array` or :obj:`tf.Tensor` of shape :obj:`(batch_size, sequence_length, config.num_labels)`):\n            Classification scores (before SoftMax).\n        hidden_states (:obj:`tuple(tf.Tensor)`, `optional`, returned when :obj:`config.output_hidden_states=True`):\n            tuple of :obj:`tf.Tensor` (one for the output of the embeddings + one for the output of each layer)\n            of shape :obj:`(batch_size, sequence_length, hidden_size)`.\n\n            Hidden-states of the model at the output of each layer plus the initial embedding outputs.\n        attentions (:obj:`tuple(tf.Tensor)`, `optional`, returned when ``config.output_attentions=True``):\n            tuple of :obj:`tf.Tensor` (one for each layer) of shape\n            :obj:`(batch_size, num_heads, sequence_length, sequence_length)`:\n\n            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads.\n\n    Examples::\n\n        import tensorflow as tf\n        from transformers1 import BertTokenizer, TFBertForTokenClassification\n\n        tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')\n        model = TFBertForTokenClassification.from_pretrained('bert-base-uncased')\n        input_ids = tf.constant(tokenizer.encode(\"Hello, my dog is cute\", add_special_tokens=True))[None, :]  # Batch size 1\n        outputs = model(input_ids)\n        scores = outputs[0]\n\n        \"\"\"\n        outputs = self.bert(inputs, **kwargs)\n\n        sequence_output = outputs[0]\n\n        sequence_output = self.dropout(sequence_output, training=kwargs.get(\"training\", False))\n        logits = self.classifier(sequence_output)\n\n        outputs = (logits,) + outputs[2:]  # add hidden states and attention if they are here\n\n        return outputs  # scores, (hidden_states), (attentions)\n\n\n@add_start_docstrings(\n    \"\"\"Bert Model with a span classification head on top for extractive question-answering tasks like SQuAD (a linear layers on top of\n    the hidden-states output to compute `span start logits` and `span end logits`). \"\"\",\n    BERT_START_DOCSTRING,\n)\nclass TFBertForQuestionAnswering(TFBertPreTrainedModel):\n    def __init__(self, config, *inputs, **kwargs):\n        super().__init__(config, *inputs, **kwargs)\n        self.num_labels = config.num_labels\n\n        self.bert = TFBertMainLayer(config, name=\"bert\")\n        self.qa_outputs = tf.keras.layers.Dense(\n            config.num_labels, kernel_initializer=get_initializer(config.initializer_range), name=\"qa_outputs\"\n        )\n\n    @add_start_docstrings_to_callable(BERT_INPUTS_DOCSTRING.format(\"(batch_size, sequence_length)\"))\n    def call(self, inputs, **kwargs):\n        r\"\"\"\n    Return:\n        :obj:`tuple(tf.Tensor)` comprising various elements depending on the configuration (:class:`~transformers1.BertConfig`) and inputs:\n        start_scores (:obj:`Numpy array` or :obj:`tf.Tensor` of shape :obj:`(batch_size, sequence_length,)`):\n            Span-start scores (before SoftMax).\n        end_scores (:obj:`Numpy array` or :obj:`tf.Tensor` of shape :obj:`(batch_size, sequence_length,)`):\n            Span-end scores (before SoftMax).\n        hidden_states (:obj:`tuple(tf.Tensor)`, `optional`, returned when :obj:`config.output_hidden_states=True`):\n            tuple of :obj:`tf.Tensor` (one for the output of the embeddings + one for the output of each layer)\n            of shape :obj:`(batch_size, sequence_length, hidden_size)`.\n\n            Hidden-states of the model at the output of each layer plus the initial embedding outputs.\n        attentions (:obj:`tuple(tf.Tensor)`, `optional`, returned when ``config.output_attentions=True``):\n            tuple of :obj:`tf.Tensor` (one for each layer) of shape\n            :obj:`(batch_size, num_heads, sequence_length, sequence_length)`:\n\n            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads.\n\n    Examples::\n\n        import tensorflow as tf\n        from transformers1 import BertTokenizer, TFBertForQuestionAnswering\n\n        tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')\n        model = TFBertForQuestionAnswering.from_pretrained('bert-large-uncased-whole-word-masking-finetuned-squad')\n\n        question, text = \"Who was Jim Henson?\", \"Jim Henson was a nice puppet\"\n        encoding = tokenizer.encode_plus(question, text)\n        input_ids, token_type_ids = encoding[\"input_ids\"], encoding[\"token_type_ids\"]\n        start_scores, end_scores = model(tf.constant(input_ids)[None, :], token_type_ids=tf.constant(token_type_ids)[None, :])\n\n        all_tokens = tokenizer.convert_ids_to_tokens(input_ids)\n        answer = ' '.join(all_tokens[tf.math.argmax(tf.squeeze(start_scores)) : tf.math.argmax(tf.squeeze(end_scores))+1])\n        assert answer == \"a nice puppet\"\n\n        \"\"\"\n        outputs = self.bert(inputs, **kwargs)\n\n        sequence_output = outputs[0]\n\n        logits = self.qa_outputs(sequence_output)\n        start_logits, end_logits = tf.split(logits, 2, axis=-1)\n        start_logits = tf.squeeze(start_logits, axis=-1)\n        end_logits = tf.squeeze(end_logits, axis=-1)\n\n        outputs = (start_logits, end_logits,) + outputs[2:]\n\n        return outputs  # start_logits, end_logits, (hidden_states), (attentions)\n"
  },
  {
    "path": "code/nezha-base-count3/pretrain/transformers1/modeling_tf_camembert.py",
    "content": "# coding=utf-8\n# Copyright 2018 The Google AI Language Team Authors and The HuggingFace Inc. team.\n# Copyright (c) 2018, NVIDIA CORPORATION.  All rights reserved.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n\"\"\" TF 2.0 CamemBERT model. \"\"\"\n\n\nimport logging\n\nfrom .configuration_camembert import CamembertConfig\nfrom .file_utils import add_start_docstrings\nfrom .modeling_tf_roberta import (\n    TFRobertaForMaskedLM,\n    TFRobertaForSequenceClassification,\n    TFRobertaForTokenClassification,\n    TFRobertaModel,\n)\n\n\nlogger = logging.getLogger(__name__)\n\nTF_CAMEMBERT_PRETRAINED_MODEL_ARCHIVE_LIST = [\n    # See all CamemBERT models at https://huggingface.co/models?filter=camembert\n]\n\n\nCAMEMBERT_START_DOCSTRING = r\"\"\"\n\n    .. note::\n\n        TF 2.0 models accepts two formats as inputs:\n\n            - having all inputs as keyword arguments (like PyTorch models), or\n            - having all inputs as a list, tuple or dict in the first positional arguments.\n\n        This second option is useful when using :obj:`tf.keras.Model.fit()` method which currently requires having\n        all the tensors in the first argument of the model call function: :obj:`model(inputs)`.\n\n        If you choose this second option, there are three possibilities you can use to gather all the input Tensors\n        in the first positional argument :\n\n        - a single Tensor with input_ids only and nothing else: :obj:`model(inputs_ids)`\n        - a list of varying length with one or several input Tensors IN THE ORDER given in the docstring:\n          :obj:`model([input_ids, attention_mask])` or :obj:`model([input_ids, attention_mask, token_type_ids])`\n        - a dictionary with one or several input Tensors associated to the input names given in the docstring:\n          :obj:`model({'input_ids': input_ids, 'token_type_ids': token_type_ids})`\n\n    Parameters:\n        config (:class:`~transformers1.CamembertConfig`): Model configuration class with all the parameters of the\n            model. Initializing with a config file does not load the weights associated with the model, only the configuration.\n            Check out the :meth:`~transformers1.PreTrainedModel.from_pretrained` method to load the model weights.\n\"\"\"\n\n\n@add_start_docstrings(\n    \"The bare CamemBERT Model transformer outputting raw hidden-states without any specific head on top.\",\n    CAMEMBERT_START_DOCSTRING,\n)\nclass TFCamembertModel(TFRobertaModel):\n    \"\"\"\n    This class overrides :class:`~transformers1.TFRobertaModel`. Please check the\n    superclass for the appropriate documentation alongside usage examples.\n    \"\"\"\n\n    config_class = CamembertConfig\n\n\n@add_start_docstrings(\n    \"\"\"CamemBERT Model with a `language modeling` head on top. \"\"\", CAMEMBERT_START_DOCSTRING,\n)\nclass TFCamembertForMaskedLM(TFRobertaForMaskedLM):\n    \"\"\"\n    This class overrides :class:`~transformers1.TFRobertaForMaskedLM`. Please check the\n    superclass for the appropriate documentation alongside usage examples.\n    \"\"\"\n\n    config_class = CamembertConfig\n\n\n@add_start_docstrings(\n    \"\"\"CamemBERT Model transformer with a sequence classification/regression head on top (a linear layer\n    on top of the pooled output) e.g. for GLUE tasks. \"\"\",\n    CAMEMBERT_START_DOCSTRING,\n)\nclass TFCamembertForSequenceClassification(TFRobertaForSequenceClassification):\n    \"\"\"\n    This class overrides :class:`~transformers1.TFRobertaForSequenceClassification`. Please check the\n    superclass for the appropriate documentation alongside usage examples.\n    \"\"\"\n\n    config_class = CamembertConfig\n\n\n@add_start_docstrings(\n    \"\"\"CamemBERT Model with a token classification head on top (a linear layer on top of\n    the hidden-states output) e.g. for Named-Entity-Recognition (NER) tasks. \"\"\",\n    CAMEMBERT_START_DOCSTRING,\n)\nclass TFCamembertForTokenClassification(TFRobertaForTokenClassification):\n    \"\"\"\n    This class overrides :class:`~transformers1.TFRobertaForTokenClassification`. Please check the\n    superclass for the appropriate documentation alongside usage examples.\n    \"\"\"\n\n    config_class = CamembertConfig\n"
  },
  {
    "path": "code/nezha-base-count3/pretrain/transformers1/modeling_tf_ctrl.py",
    "content": "# coding=utf-8\n# Copyright 2018 Salesforce and HuggingFace Inc. team.\n# Copyright (c) 2018, NVIDIA CORPORATION.  All rights reserved.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n\"\"\" TF 2.0 CTRL model.\"\"\"\n\n\nimport logging\n\nimport numpy as np\nimport tensorflow as tf\n\nfrom .configuration_ctrl import CTRLConfig\nfrom .file_utils import add_start_docstrings, add_start_docstrings_to_callable\nfrom .modeling_tf_utils import TFPreTrainedModel, TFSharedEmbeddings, keras_serializable, shape_list\nfrom .tokenization_utils import BatchEncoding\n\n\nlogger = logging.getLogger(__name__)\n\nTF_CTRL_PRETRAINED_MODEL_ARCHIVE_LIST = [\n    \"ctrl\"\n    # See all CTRL models at https://huggingface.co/models?filter=ctrl\n]\n\n\ndef angle_defn(pos, i, d_model_size):\n    angle_rates = 1 / np.power(10000, (2 * (i // 2)) / np.float32(d_model_size))\n    return pos * angle_rates\n\n\ndef positional_encoding(position, d_model_size):\n    # create the sinusoidal pattern for the positional encoding\n    angle_rads = angle_defn(np.arange(position)[:, np.newaxis], np.arange(d_model_size)[np.newaxis, :], d_model_size)\n\n    sines = np.sin(angle_rads[:, 0::2])\n    cosines = np.cos(angle_rads[:, 1::2])\n\n    # pos_encoding = tf.cast(np.concatenate([sines, cosines], axis=-1)[np.newaxis, ...], dtype=tf.float32)\n    pos_encoding = tf.cast(np.concatenate([sines, cosines], axis=-1), dtype=tf.float32)\n    return pos_encoding\n\n\ndef scaled_dot_product_attention(q, k, v, mask, attention_mask=None, head_mask=None):\n    # calculate attention\n    matmul_qk = tf.matmul(q, k, transpose_b=True)\n\n    dk = tf.cast(shape_list(k)[-1], tf.float32)\n    scaled_attention_logits = matmul_qk / tf.math.sqrt(dk)\n\n    if mask is not None:\n        scaled_attention_logits += mask * -1e4\n\n    if attention_mask is not None:\n        # Apply the attention mask\n        scaled_attention_logits = scaled_attention_logits + attention_mask\n\n    attention_weights = tf.nn.softmax(scaled_attention_logits, axis=-1)\n\n    # Mask heads if we want to\n    if head_mask is not None:\n        attention_weights = attention_weights * head_mask\n\n    output = tf.matmul(attention_weights, v)\n\n    return output, attention_weights\n\n\nclass TFMultiHeadAttention(tf.keras.layers.Layer):\n    def __init__(self, d_model_size, num_heads, output_attentions=False, **kwargs):\n        super().__init__(**kwargs)\n        self.output_attentions = output_attentions\n        self.num_heads = num_heads\n        self.d_model_size = d_model_size\n\n        self.depth = int(d_model_size / self.num_heads)\n\n        self.Wq = tf.keras.layers.Dense(d_model_size, name=\"Wq\")\n        self.Wk = tf.keras.layers.Dense(d_model_size, name=\"Wk\")\n        self.Wv = tf.keras.layers.Dense(d_model_size, name=\"Wv\")\n\n        self.dense = tf.keras.layers.Dense(d_model_size, name=\"dense\")\n\n    def split_into_heads(self, x, batch_size):\n        x = tf.reshape(x, (batch_size, -1, self.num_heads, self.depth))\n        return tf.transpose(x, perm=[0, 2, 1, 3])\n\n    def call(self, inputs, training=False):\n        v, k, q, mask, layer_past, attention_mask, head_mask, use_cache = inputs\n        batch_size = shape_list(q)[0]\n\n        q = self.Wq(q)\n        k = self.Wk(k)\n        v = self.Wv(v)\n\n        q = self.split_into_heads(q, batch_size)\n        k = self.split_into_heads(k, batch_size)\n        v = self.split_into_heads(v, batch_size)\n\n        if layer_past is not None:\n            past_key, past_value = tf.unstack(layer_past, axis=0)\n            k = tf.concat((past_key, k), axis=-2)\n            v = tf.concat((past_value, v), axis=-2)\n\n        # to cope with keras serialization\n        # we need to cast `use_cache` to correct bool\n        # if it is a tensor\n        if tf.is_tensor(use_cache):\n            if hasattr(use_cache, \"numpy\"):\n                use_cache = bool(use_cache.numpy())\n            else:\n                use_cache = True\n\n        if use_cache is True:\n            present = tf.stack((k, v), axis=0)\n        else:\n            present = (None,)\n\n        output = scaled_dot_product_attention(q, k, v, mask, attention_mask, head_mask)\n        scaled_attention = tf.transpose(output[0], perm=[0, 2, 1, 3])\n        attn = output[1]\n        original_size_attention = tf.reshape(scaled_attention, (batch_size, -1, self.d_model_size))\n        output = self.dense(original_size_attention)\n\n        outputs = (output, present)\n        if self.output_attentions:\n            outputs = outputs + (attn,)\n        return outputs\n\n\ndef point_wise_feed_forward_network(d_model_size, dff, name=\"\"):\n    return tf.keras.Sequential(\n        [tf.keras.layers.Dense(dff, activation=\"relu\", name=\"0\"), tf.keras.layers.Dense(d_model_size, name=\"2\")],\n        name=\"ffn\",\n    )\n\n\nclass TFEncoderLayer(tf.keras.layers.Layer):\n    def __init__(\n        self, d_model_size, num_heads, dff, rate=0.1, layer_norm_epsilon=1e-6, output_attentions=False, **kwargs\n    ):\n        super().__init__(**kwargs)\n\n        self.multi_head_attention = TFMultiHeadAttention(\n            d_model_size, num_heads, output_attentions, name=\"multi_head_attention\"\n        )\n        self.ffn = point_wise_feed_forward_network(d_model_size, dff, name=\"ffn\")\n\n        self.layernorm1 = tf.keras.layers.LayerNormalization(epsilon=layer_norm_epsilon, name=\"layernorm1\")\n        self.layernorm2 = tf.keras.layers.LayerNormalization(epsilon=layer_norm_epsilon, name=\"layernorm2\")\n\n        self.dropout1 = tf.keras.layers.Dropout(rate)\n        self.dropout2 = tf.keras.layers.Dropout(rate)\n\n    def call(self, inputs, training=False):\n        x, mask, layer_past, attention_mask, head_mask, use_cache = inputs\n        normed = self.layernorm1(x)\n        attn_outputs = self.multi_head_attention(\n            [normed, normed, normed, mask, layer_past, attention_mask, head_mask, use_cache], training=training\n        )\n        attn_output = attn_outputs[0]\n        attn_output = self.dropout1(attn_output, training=training)\n        out1 = x + attn_output\n\n        out2 = self.layernorm2(out1)\n        ffn_output = self.ffn(out2)\n        ffn_output = self.dropout2(ffn_output, training=training)\n        out2 = out1 + ffn_output\n\n        outputs = (out2,) + attn_outputs[1:]\n        return outputs\n\n\n@keras_serializable\nclass TFCTRLMainLayer(tf.keras.layers.Layer):\n    config_class = CTRLConfig\n\n    def __init__(self, config, **kwargs):\n        super().__init__(**kwargs)\n        self.output_hidden_states = config.output_hidden_states\n        self.output_attentions = config.output_attentions\n\n        self.d_model_size = config.n_embd\n        self.num_layers = config.n_layer\n\n        self.pos_encoding = positional_encoding(config.n_positions, self.d_model_size)\n\n        self.w = TFSharedEmbeddings(\n            config.vocab_size, config.n_embd, initializer_range=config.initializer_range, name=\"w\"\n        )\n\n        self.dropout = tf.keras.layers.Dropout(config.embd_pdrop)\n        self.h = [\n            TFEncoderLayer(\n                config.n_embd,\n                config.n_head,\n                config.dff,\n                config.resid_pdrop,\n                config.layer_norm_epsilon,\n                config.output_attentions,\n                name=\"h_._{}\".format(i),\n            )\n            for i in range(config.n_layer)\n        ]\n        self.layernorm = tf.keras.layers.LayerNormalization(epsilon=config.layer_norm_epsilon, name=\"layernorm\")\n\n    def get_input_embeddings(self):\n        return self.w\n\n    def _resize_token_embeddings(self, new_num_tokens):\n        raise NotImplementedError\n\n    def _prune_heads(self, heads_to_prune):\n        \"\"\" Prunes heads of the model.\n                heads_to_prune: dict of {layer_num: list of heads to prune in this layer}\n        \"\"\"\n        raise NotImplementedError\n\n    def call(\n        self,\n        inputs,\n        past=None,\n        attention_mask=None,\n        token_type_ids=None,\n        position_ids=None,\n        head_mask=None,\n        inputs_embeds=None,\n        use_cache=True,\n        training=False,\n    ):\n\n        if isinstance(inputs, (tuple, list)):\n            input_ids = inputs[0]\n            past = inputs[1] if len(inputs) > 1 else past\n            attention_mask = inputs[2] if len(inputs) > 2 else attention_mask\n            token_type_ids = inputs[3] if len(inputs) > 3 else token_type_ids\n            position_ids = inputs[4] if len(inputs) > 4 else position_ids\n            head_mask = inputs[5] if len(inputs) > 5 else head_mask\n            inputs_embeds = inputs[6] if len(inputs) > 6 else inputs_embeds\n            use_cache = inputs[7] if len(inputs) > 7 else use_cache\n            assert len(inputs) <= 8, \"Too many inputs.\"\n        elif isinstance(inputs, (dict, BatchEncoding)):\n            input_ids = inputs.get(\"input_ids\")\n            past = inputs.get(\"past\", past)\n            attention_mask = inputs.get(\"attention_mask\", attention_mask)\n            token_type_ids = inputs.get(\"token_type_ids\", token_type_ids)\n            position_ids = inputs.get(\"position_ids\", position_ids)\n            head_mask = inputs.get(\"head_mask\", head_mask)\n            inputs_embeds = inputs.get(\"inputs_embeds\", inputs_embeds)\n            use_cache = inputs.get(\"use_cache\", use_cache)\n            assert len(inputs) <= 8, \"Too many inputs.\"\n        else:\n            input_ids = inputs\n\n        # If using past key value states, only the last tokens\n        # should be given as an input\n        if past is not None:\n            if input_ids is not None:\n                input_ids = input_ids[:, -1:]\n            if inputs_embeds is not None:\n                inputs_embeds = inputs_embeds[:, -1:]\n            if token_type_ids is not None:\n                token_type_ids = token_type_ids[:, -1:]\n\n        if input_ids is not None and inputs_embeds is not None:\n            raise ValueError(\"You cannot specify both input_ids and inputs_embeds at the same time\")\n        elif input_ids is not None:\n            input_shape = shape_list(input_ids)\n            input_ids = tf.reshape(input_ids, [-1, input_shape[-1]])\n        elif inputs_embeds is not None:\n            input_shape = shape_list(inputs_embeds)[:-1]\n        else:\n            raise ValueError(\"You have to specify either input_ids or inputs_embeds\")\n\n        if past is None:\n            past_length = 0\n            past = [None] * len(self.h)\n        else:\n            past_length = shape_list(past[0][0])[-2]\n        if position_ids is None:\n            position_ids = tf.range(past_length, input_shape[-1] + past_length, dtype=tf.int32)[tf.newaxis, :]\n            position_ids = tf.tile(position_ids, [input_shape[0], 1])\n\n        # Attention mask.\n        if attention_mask is not None:\n            # We create a 3D attention mask from a 2D tensor mask.\n            # Sizes are [batch_size, 1, 1, to_seq_length]\n            # So we can broadcast to [batch_size, num_heads, from_seq_length, to_seq_length]\n            # this attention mask is more simple than the triangular masking of causal attention\n            # used in OpenAI GPT, we just need to prepare the broadcast dimension here.\n            attention_mask = attention_mask[:, tf.newaxis, tf.newaxis, :]\n\n            # Since attention_mask is 1.0 for positions we want to attend and 0.0 for\n            # masked positions, this operation will create a tensor which is 0.0 for\n            # positions we want to attend and -10000.0 for masked positions.\n            # Since we are adding it to the raw scores before the softmax, this is\n            # effectively the same as removing these entirely.\n\n            attention_mask = tf.cast(attention_mask, tf.float32)\n            attention_mask = (1.0 - attention_mask) * -10000.0\n        else:\n            attention_mask = None\n\n        # Prepare head mask if needed\n        # 1.0 in head_mask indicate we keep the head\n        # attention_probs has shape bsz x n_heads x N x N\n        # head_mask has shape n_layer x batch x n_heads x N x N\n        if head_mask is not None:\n            raise NotImplementedError\n        else:\n            head_mask = [None] * self.num_layers\n\n        if token_type_ids is not None:\n            token_type_ids = tf.reshape(token_type_ids, [-1, shape_list(token_type_ids)[-1]])\n            token_type_embeds = self.w(token_type_ids, mode=\"embedding\")\n            token_type_embeds *= tf.math.sqrt(tf.cast(self.d_model_size, tf.float32))\n        else:\n            token_type_embeds = 0\n        position_ids = tf.reshape(position_ids, [-1, shape_list(position_ids)[-1]])\n\n        if inputs_embeds is None:\n            inputs_embeds = self.w(input_ids, mode=\"embedding\")\n        seq_len = input_shape[-1]\n        mask = 1 - tf.linalg.band_part(tf.ones((seq_len, seq_len)), -1, 0)\n\n        inputs_embeds *= tf.math.sqrt(tf.cast(self.d_model_size, tf.float32))\n\n        pos_embeds = tf.gather(self.pos_encoding, position_ids)\n\n        hidden_states = inputs_embeds + pos_embeds + token_type_embeds\n\n        hidden_states = self.dropout(hidden_states, training=training)\n\n        output_shape = input_shape + [shape_list(hidden_states)[-1]]\n        presents = ()\n        all_hidden_states = ()\n        all_attentions = []\n        for i, (h, layer_past) in enumerate(zip(self.h, past)):\n            if self.output_hidden_states:\n                all_hidden_states = all_hidden_states + (tf.reshape(hidden_states, output_shape),)\n            outputs = h([hidden_states, mask, layer_past, attention_mask, head_mask[i], use_cache], training=training)\n            hidden_states, present = outputs[:2]\n\n            if use_cache is True:\n                presents = presents + (present,)\n\n            if self.output_attentions:\n                all_attentions.append(outputs[2])\n\n        hidden_states = self.layernorm(hidden_states)\n        hidden_states = tf.reshape(hidden_states, output_shape)\n        if self.output_hidden_states:\n            all_hidden_states = all_hidden_states + (hidden_states,)\n\n        outputs = (hidden_states,)\n        if use_cache is True:\n            outputs = outputs + (presents,)\n        if self.output_hidden_states:\n            outputs = outputs + (all_hidden_states,)\n        if self.output_attentions:\n            # let the number of heads free (-1) so we can extract attention even after head pruning\n            attention_output_shape = input_shape[:-1] + [-1] + shape_list(all_attentions[0])[-2:]\n            all_attentions = tuple(tf.reshape(t, attention_output_shape) for t in all_attentions)\n            outputs = outputs + (all_attentions,)\n        return outputs\n\n\nclass TFCTRLPreTrainedModel(TFPreTrainedModel):\n    \"\"\" An abstract class to handle weights initialization and\n        a simple interface for downloading and loading pretrained models.\n    \"\"\"\n\n    config_class = CTRLConfig\n    base_model_prefix = \"transformer\"\n\n\nCTRL_START_DOCSTRING = r\"\"\"\n\n    .. note::\n        TF 2.0 models accepts two formats as inputs:\n\n            - having all inputs as keyword arguments (like PyTorch models), or\n            - having all inputs as a list, tuple or dict in the first positional arguments.\n\n        This second option is useful when using :obj:`tf.keras.Model.fit()` method which currently requires having\n        all the tensors in the first argument of the model call function: :obj:`model(inputs)`.\n\n        If you choose this second option, there are three possibilities you can use to gather all the input Tensors\n        in the first positional argument :\n\n        - a single Tensor with input_ids only and nothing else: :obj:`model(inputs_ids)`\n        - a list of varying length with one or several input Tensors IN THE ORDER given in the docstring:\n          :obj:`model([input_ids, attention_mask])` or :obj:`model([input_ids, attention_mask, token_type_ids])`\n        - a dictionary with one or several input Tensors associated to the input names given in the docstring:\n          :obj:`model({'input_ids': input_ids, 'token_type_ids': token_type_ids})`\n\n    Parameters:\n        config (:class:`~transformers1.CTRLConfig`): Model configuration class with all the parameters of the model.\n            Initializing with a config file does not load the weights associated with the model, only the configuration.\n            Check out the :meth:`~transformers1.PreTrainedModel.from_pretrained` method to load the model weights.\n\"\"\"\n\nCTRL_INPUTS_DOCSTRING = r\"\"\"\n    Args:\n        input_ids (:obj:`Numpy array` or :obj:`tf.Tensor` of shape :obj:`(batch_size, input_ids_length)`):\n            :obj:`input_ids_length` = ``sequence_length`` if ``past`` is ``None`` else ``past[0].shape[-2]`` (``sequence_length`` of input past key value states).\n\n            Indices of input sequence tokens in the vocabulary.\n\n            If `past` is used, only input_ids that do not have their past calculated should be passed as input_ids (see `past`).\n\n            Indices can be obtained using :class:`transformers1.CTRLTokenizer`.\n            See :func:`transformers1.PreTrainedTokenizer.encode` and\n            :func:`transformers1.PreTrainedTokenizer.encode_plus` for details.\n\n            `What are input IDs? <../glossary.html#input-ids>`__\n        past (:obj:`List[tf.Tensor]` of length :obj:`config.n_layers`):\n            Contains pre-computed hidden-states (key and values in the attention blocks) as computed by the model\n            (see `past` output below). Can be used to speed up sequential decoding.\n            The token ids which have their past given to this model\n            should not be passed as input ids as they have already been computed.\n        attention_mask (:obj:`tf.Tensor` or :obj:`Numpy array` of shape :obj:`(batch_size, sequence_length)`, `optional`, defaults to :obj:`None`):\n            Mask to avoid performing attention on padding token indices.\n            Mask values selected in ``[0, 1]``:\n            ``1`` for tokens that are NOT MASKED, ``0`` for MASKED tokens.\n\n            `What are attention masks? <../glossary.html#attention-mask>`__\n        token_type_ids (:obj:`tf.Tensor` or :obj:`Numpy array` of shape :obj:`(batch_size, sequence_length)`, `optional`, defaults to :obj:`None`):\n            Segment token indices to indicate first and second portions of the inputs.\n            Indices are selected in ``[0, 1]``: ``0`` corresponds to a `sentence A` token, ``1``\n            corresponds to a `sentence B` token\n\n            `What are token type IDs? <../glossary.html#token-type-ids>`_\n        position_ids (:obj:`tf.Tensor` or :obj:`Numpy array` of shape :obj:`(batch_size, sequence_length)`, `optional`, defaults to :obj:`None`):\n            Indices of positions of each input sequence tokens in the position embeddings.\n            Selected in the range ``[0, config.max_position_embeddings - 1]``.\n\n            `What are position IDs? <../glossary.html#position-ids>`_\n        head_mask (:obj:`tf.Tensor` or :obj:`Numpy array` of shape :obj:`(num_heads,)` or :obj:`(num_layers, num_heads)`, `optional`, defaults to :obj:`None`):\n            Mask to nullify selected heads of the self-attention modules.\n            Mask values selected in ``[0, 1]``:\n            :obj:`1` indicates the head is **not masked**, :obj:`0` indicates the head is **masked**.\n        inputs_embeds (:obj:`tf.Tensor` or :obj:`Numpy array` of shape :obj:`(batch_size, sequence_length, hidden_size)`, `optional`, defaults to :obj:`None`):\n            Optionally, instead of passing :obj:`input_ids` you can choose to directly pass an embedded representation.\n            This is useful if you want more control over how to convert `input_ids` indices into associated vectors\n            than the model's internal embedding lookup matrix.\n        use_cache (:obj:`bool`):\n            If `use_cache` is True, `past` key value states are returned and\n            can be used to speed up decoding (see `past`). Defaults to `True`.\n        training (:obj:`boolean`, `optional`, defaults to :obj:`False`):\n            Whether to activate dropout modules (if set to :obj:`True`) during training or to de-activate them\n            (if set to :obj:`False`) for evaluation.\n\"\"\"\n\n\n@add_start_docstrings(\n    \"The bare CTRL Model transformer outputting raw hidden-states without any specific head on top.\",\n    CTRL_START_DOCSTRING,\n)\nclass TFCTRLModel(TFCTRLPreTrainedModel):\n    def __init__(self, config, *inputs, **kwargs):\n        super().__init__(config, *inputs, **kwargs)\n        self.transformer = TFCTRLMainLayer(config, name=\"transformer\")\n\n    @add_start_docstrings_to_callable(CTRL_INPUTS_DOCSTRING)\n    def call(self, inputs, **kwargs):\n        r\"\"\"\n    Return:\n        :obj:`tuple(tf.Tensor)` comprising various elements depending on the configuration (:class:`~transformers1.CTRLConfig`) and inputs:\n        last_hidden_state (:obj:`tf.Tensor` of shape :obj:`(batch_size, sequence_length, hidden_size)`):\n            Sequence of hidden-states at the last layer of the model.\n        past (:obj:`List[tf.Tensor]` of length :obj:`config.n_layers` with each tensor of shape :obj:`(2, batch_size, num_heads, sequence_length, embed_size_per_head)`):\n            Contains pre-computed hidden-states (key and values in the attention blocks).\n            Can be used (see `past` input) to speed up sequential decoding. The token ids which have their past given to this model\n            should not be passed as input ids as they have already been computed.\n        hidden_states (:obj:`tuple(tf.Tensor)` `optional`, returned when ``config.output_hidden_states=True``):\n            Tuple of :obj:`tf.Tensor` (one for the output of the embeddings + one for the output of each layer)\n            of shape :obj:`(batch_size, sequence_length, hidden_size)`.\n\n            Hidden-states of the model at the output of each layer plus the initial embedding outputs.\n        attentions (:obj:`tuple(tf.Tensor)`, `optional`, returned when ``config.output_attentions=True``):\n            Tuple of :obj:`tf.Tensor` (one for each layer) of shape\n            :obj:`(batch_size, num_heads, sequence_length, sequence_length)`.\n\n            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention\n            heads.\n\n    Examples::\n\n        import tensorflow as tf\n        from transformers1 import CTRLTokenizer, TFCTRLModel\n\n        tokenizer = CTRLTokenizer.from_pretrained('ctrl')\n        model = TFCTRLModel.from_pretrained('ctrl')\n        input_ids = tf.constant(tokenizer.encode(\"Hello, my dog is cute\", add_special_tokens=True))[None, :]  # Batch size 1\n        outputs = model(input_ids)\n        last_hidden_states = outputs[0]  # The last hidden-state is the first element of the output tuple\n\n        \"\"\"\n        outputs = self.transformer(inputs, **kwargs)\n        return outputs\n\n\nclass TFCTRLLMHead(tf.keras.layers.Layer):\n    def __init__(self, config, input_embeddings, **kwargs):\n        super().__init__(**kwargs)\n        self.vocab_size = config.vocab_size\n\n        # The output weights are the same as the input embeddings, but there is\n        # an output-only bias for each token.\n        self.input_embeddings = input_embeddings\n\n    def build(self, input_shape):\n        self.bias = self.add_weight(shape=(self.vocab_size,), initializer=\"zeros\", trainable=True, name=\"bias\")\n        super().build(input_shape)\n\n    def call(self, hidden_states):\n        hidden_states = self.input_embeddings(hidden_states, mode=\"linear\")\n        hidden_states = hidden_states + self.bias\n        return hidden_states\n\n\n@add_start_docstrings(\n    \"\"\"The CTRL Model transformer with a language modeling head on top\n    (linear layer with weights tied to the input embeddings). \"\"\",\n    CTRL_START_DOCSTRING,\n)\nclass TFCTRLLMHeadModel(TFCTRLPreTrainedModel):\n    def __init__(self, config, *inputs, **kwargs):\n        super().__init__(config, *inputs, **kwargs)\n        self.transformer = TFCTRLMainLayer(config, name=\"transformer\")\n\n        self.lm_head = TFCTRLLMHead(config, self.transformer.w, name=\"lm_head\")\n\n    def get_output_embeddings(self):\n        return self.lm_head.input_embeddings\n\n    def prepare_inputs_for_generation(self, inputs, past, **kwargs):\n        # only last token for inputs_ids if past is defined in kwargs\n        if past:\n            inputs = tf.expand_dims(inputs[:, -1], -1)\n\n        return {\"inputs\": inputs, \"past\": past, \"use_cache\": kwargs[\"use_cache\"]}\n\n    @add_start_docstrings_to_callable(CTRL_INPUTS_DOCSTRING)\n    def call(self, inputs, **kwargs):\n        r\"\"\"\n    Return:\n        :obj:`tuple(tf.Tensor)` comprising various elements depending on the configuration (:class:`~transformers1.CTRLConfig`) and inputs:\n        prediction_scores (:obj:`tf.Tensor` of shape :obj:`(batch_size, sequence_length, config.vocab_size)`):\n            Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax).\n        past (:obj:`List[tf.Tensor]` of length :obj:`config.n_layers` with each tensor of shape :obj:`(2, batch_size, num_heads, sequence_length, embed_size_per_head)`):\n            Contains pre-computed hidden-states (key and values in the attention blocks).\n            Can be used (see `past` input) to speed up sequential decoding. The token ids which have their past given to this model\n            should not be passed as input ids as they have already been computed.\n        hidden_states (:obj:`tuple(tf.Tensor)`, `optional`, returned when ``config.output_hidden_states=True``):\n            Tuple of :obj:`tf.Tensor` (one for the output of the embeddings + one for the output of each layer)\n            of shape :obj:`(batch_size, sequence_length, hidden_size)`.\n\n            Hidden-states of the model at the output of each layer plus the initial embedding outputs.\n        attentions (:obj:`tuple(tf.Tensor)`, `optional`, returned when ``config.output_attentions=True``):\n            Tuple of :obj:`tf.Tensor` (one for each layer) of shape\n            :obj:`(batch_size, num_heads, sequence_length, sequence_length)`.\n\n            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention\n            heads.\n\n    Examples::\n\n        import tensorflow as tf\n        from transformers1 import CTRLTokenizer, TFCTRLLMHeadModel\n\n        tokenizer = CTRLTokenizer.from_pretrained('ctrl')\n        model = TFCTRLLMHeadModel.from_pretrained('ctrl')\n\n        input_ids = tf.constant([tokenizer.encode(\"Links Hello, my dog is cute\", add_special_tokens=True)])\n        outputs = model(input_ids)\n        loss, logits = outputs[:2]\n\n        \"\"\"\n        transformer_outputs = self.transformer(inputs, **kwargs)\n        hidden_states = transformer_outputs[0]\n\n        lm_logits = self.lm_head(hidden_states)\n\n        outputs = (lm_logits,) + transformer_outputs[1:]\n\n        return outputs  # lm_logits, presents, (all hidden_states), (attentions)\n"
  },
  {
    "path": "code/nezha-base-count3/pretrain/transformers1/modeling_tf_distilbert.py",
    "content": "# coding=utf-8\n# Copyright 2019-present, the HuggingFace Inc. team, The Google AI Language Team and Facebook, Inc.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n\"\"\" TF 2.0 DistilBERT model\n\"\"\"\n\n\nimport logging\nimport math\n\nimport numpy as np\nimport tensorflow as tf\n\nfrom .configuration_distilbert import DistilBertConfig\nfrom .file_utils import add_start_docstrings, add_start_docstrings_to_callable\nfrom .modeling_tf_utils import TFPreTrainedModel, TFSharedEmbeddings, get_initializer, shape_list\nfrom .tokenization_utils import BatchEncoding\n\n\nlogger = logging.getLogger(__name__)\n\n\nTF_DISTILBERT_PRETRAINED_MODEL_ARCHIVE_LIST = [\n    \"distilbert-base-uncased\",\n    \"distilbert-base-uncased-distilled-squad\",\n    \"distilbert-base-cased\",\n    \"distilbert-base-cased-distilled-squad\",\n    \"distilbert-base-multilingual-cased\",\n    \"distilbert-base-uncased-finetuned-sst-2-english\",\n    # See all DistilBERT models at https://huggingface.co/models?filter=distilbert\n]\n\n\n# UTILS AND BUILDING BLOCKS OF THE ARCHITECTURE #\ndef gelu(x):\n    \"\"\" Gaussian Error Linear Unit.\n    Original Implementation of the gelu activation function in Google Bert repo when initially created.\n        For information: OpenAI GPT's gelu is slightly different (and gives slightly different results):\n        0.5 * x * (1 + torch.tanh(math.sqrt(2 / math.pi) * (x + 0.044715 * torch.pow(x, 3))))\n        Also see https://arxiv.org/abs/1606.08415\n    \"\"\"\n    cdf = 0.5 * (1.0 + tf.math.erf(x / tf.math.sqrt(2.0)))\n    return x * cdf\n\n\ndef gelu_new(x):\n    \"\"\"Gaussian Error Linear Unit.\n    This is a smoother version of the RELU.\n    Original paper: https://arxiv.org/abs/1606.08415\n    Args:\n        x: float Tensor to perform activation.\n    Returns:\n        `x` with the GELU activation applied.\n    \"\"\"\n    cdf = 0.5 * (1.0 + tf.tanh((np.sqrt(2 / np.pi) * (x + 0.044715 * tf.pow(x, 3)))))\n    return x * cdf\n\n\nclass TFEmbeddings(tf.keras.layers.Layer):\n    def __init__(self, config, **kwargs):\n        super().__init__(**kwargs)\n        self.vocab_size = config.vocab_size\n        self.dim = config.dim\n        self.initializer_range = config.initializer_range\n        self.word_embeddings = TFSharedEmbeddings(\n            config.vocab_size, config.dim, initializer_range=config.initializer_range, name=\"word_embeddings\"\n        )  # padding_idx=0)\n        self.position_embeddings = tf.keras.layers.Embedding(\n            config.max_position_embeddings,\n            config.dim,\n            embeddings_initializer=get_initializer(config.initializer_range),\n            name=\"position_embeddings\",\n        )\n\n        self.LayerNorm = tf.keras.layers.LayerNormalization(epsilon=1e-12, name=\"LayerNorm\")\n        self.dropout = tf.keras.layers.Dropout(config.dropout)\n\n    def build(self, input_shape):\n        \"\"\"Build shared word embedding layer \"\"\"\n        with tf.name_scope(\"word_embeddings\"):\n            # Create and initialize weights. The random normal initializer was chosen\n            # arbitrarily, and works well.\n            self.word_embeddings = self.add_weight(\n                \"weight\", shape=[self.vocab_size, self.dim], initializer=get_initializer(self.initializer_range)\n            )\n        super().build(input_shape)\n\n    def call(self, inputs, inputs_embeds=None, mode=\"embedding\", training=False):\n        \"\"\"Get token embeddings of inputs.\n        Args:\n            inputs: list of three int64 tensors with shape [batch_size, length]: (input_ids, position_ids, token_type_ids)\n            mode: string, a valid value is one of \"embedding\" and \"linear\".\n        Returns:\n            outputs: (1) If mode == \"embedding\", output embedding tensor, float32 with\n                shape [batch_size, length, embedding_size]; (2) mode == \"linear\", output\n                linear tensor, float32 with shape [batch_size, length, vocab_size].\n        Raises:\n            ValueError: if mode is not valid.\n\n        Shared weights logic adapted from\n            https://github.com/tensorflow/models/blob/a009f4fb9d2fc4949e32192a944688925ef78659/official/transformer/v2/embedding_layer.py#L24\n        \"\"\"\n        if mode == \"embedding\":\n            return self._embedding(inputs, inputs_embeds=inputs_embeds, training=training)\n        elif mode == \"linear\":\n            return self._linear(inputs)\n        else:\n            raise ValueError(\"mode {} is not valid.\".format(mode))\n\n    def _embedding(self, inputs, inputs_embeds=None, training=False):\n        \"\"\"\n        Parameters\n        ----------\n        input_ids: tf.Tensor(bs, max_seq_length)\n            The token ids to embed.\n\n        Outputs\n        -------\n        embeddings: tf.Tensor(bs, max_seq_length, dim)\n            The embedded tokens (plus position embeddings, no token_type embeddings)\n        \"\"\"\n        if not isinstance(inputs, (tuple, list)):\n            input_ids = inputs\n            position_ids = None\n        else:\n            input_ids, position_ids = inputs\n\n        if input_ids is not None:\n            seq_length = shape_list(input_ids)[1]\n        else:\n            seq_length = shape_list(inputs_embeds)[1]\n\n        if position_ids is None:\n            position_ids = tf.range(seq_length, dtype=tf.int32)[tf.newaxis, :]\n\n        if inputs_embeds is None:\n            inputs_embeds = tf.gather(self.word_embeddings, input_ids)\n        position_embeddings = self.position_embeddings(position_ids)  # (bs, max_seq_length, dim)\n\n        embeddings = inputs_embeds + position_embeddings  # (bs, max_seq_length, dim)\n        embeddings = self.LayerNorm(embeddings)  # (bs, max_seq_length, dim)\n        embeddings = self.dropout(embeddings, training=training)  # (bs, max_seq_length, dim)\n        return embeddings\n\n    def _linear(self, inputs):\n        \"\"\"Computes logits by running inputs through a linear layer.\n            Args:\n                inputs: A float32 tensor with shape [batch_size, length, hidden_size]\n            Returns:\n                float32 tensor with shape [batch_size, length, vocab_size].\n        \"\"\"\n        batch_size = shape_list(inputs)[0]\n        length = shape_list(inputs)[1]\n\n        x = tf.reshape(inputs, [-1, self.dim])\n        logits = tf.matmul(x, self.word_embeddings, transpose_b=True)\n\n        return tf.reshape(logits, [batch_size, length, self.vocab_size])\n\n\nclass TFMultiHeadSelfAttention(tf.keras.layers.Layer):\n    def __init__(self, config, **kwargs):\n        super().__init__(**kwargs)\n\n        self.n_heads = config.n_heads\n        self.dim = config.dim\n        self.dropout = tf.keras.layers.Dropout(config.attention_dropout)\n        self.output_attentions = config.output_attentions\n\n        assert self.dim % self.n_heads == 0\n\n        self.q_lin = tf.keras.layers.Dense(\n            config.dim, kernel_initializer=get_initializer(config.initializer_range), name=\"q_lin\"\n        )\n        self.k_lin = tf.keras.layers.Dense(\n            config.dim, kernel_initializer=get_initializer(config.initializer_range), name=\"k_lin\"\n        )\n        self.v_lin = tf.keras.layers.Dense(\n            config.dim, kernel_initializer=get_initializer(config.initializer_range), name=\"v_lin\"\n        )\n        self.out_lin = tf.keras.layers.Dense(\n            config.dim, kernel_initializer=get_initializer(config.initializer_range), name=\"out_lin\"\n        )\n\n        self.pruned_heads = set()\n\n    def prune_heads(self, heads):\n        raise NotImplementedError\n\n    def call(self, inputs, training=False):\n        \"\"\"\n        Parameters\n        ----------\n        query: tf.Tensor(bs, seq_length, dim)\n        key: tf.Tensor(bs, seq_length, dim)\n        value: tf.Tensor(bs, seq_length, dim)\n        mask: tf.Tensor(bs, seq_length)\n\n        Outputs\n        -------\n        weights: tf.Tensor(bs, n_heads, seq_length, seq_length)\n            Attention weights\n        context: tf.Tensor(bs, seq_length, dim)\n            Contextualized layer. Optional: only if `output_attentions=True`\n        \"\"\"\n        query, key, value, mask, head_mask = inputs\n        bs, q_length, dim = shape_list(query)\n        k_length = shape_list(key)[1]\n        # assert dim == self.dim, 'Dimensions do not match: %s input vs %s configured' % (dim, self.dim)\n        # assert key.size() == value.size()\n\n        dim_per_head = self.dim // self.n_heads\n\n        mask_reshape = [bs, 1, 1, k_length]\n\n        def shape(x):\n            \"\"\" separate heads \"\"\"\n            return tf.transpose(tf.reshape(x, (bs, -1, self.n_heads, dim_per_head)), perm=(0, 2, 1, 3))\n\n        def unshape(x):\n            \"\"\" group heads \"\"\"\n            return tf.reshape(tf.transpose(x, perm=(0, 2, 1, 3)), (bs, -1, self.n_heads * dim_per_head))\n\n        q = shape(self.q_lin(query))  # (bs, n_heads, q_length, dim_per_head)\n        k = shape(self.k_lin(key))  # (bs, n_heads, k_length, dim_per_head)\n        v = shape(self.v_lin(value))  # (bs, n_heads, k_length, dim_per_head)\n\n        q = q / math.sqrt(dim_per_head)  # (bs, n_heads, q_length, dim_per_head)\n        scores = tf.matmul(q, k, transpose_b=True)  # (bs, n_heads, q_length, k_length)\n        mask = tf.reshape(mask, mask_reshape)  # (bs, n_heads, qlen, klen)\n        # scores.masked_fill_(mask, -float('inf'))            # (bs, n_heads, q_length, k_length)\n        scores = scores - 1e30 * (1.0 - mask)\n\n        weights = tf.nn.softmax(scores, axis=-1)  # (bs, n_heads, qlen, klen)\n        weights = self.dropout(weights, training=training)  # (bs, n_heads, qlen, klen)\n\n        # Mask heads if we want to\n        if head_mask is not None:\n            weights = weights * head_mask\n\n        context = tf.matmul(weights, v)  # (bs, n_heads, qlen, dim_per_head)\n        context = unshape(context)  # (bs, q_length, dim)\n        context = self.out_lin(context)  # (bs, q_length, dim)\n\n        if self.output_attentions:\n            return (context, weights)\n        else:\n            return (context,)\n\n\nclass TFFFN(tf.keras.layers.Layer):\n    def __init__(self, config, **kwargs):\n        super().__init__(**kwargs)\n        self.dropout = tf.keras.layers.Dropout(config.dropout)\n        self.lin1 = tf.keras.layers.Dense(\n            config.hidden_dim, kernel_initializer=get_initializer(config.initializer_range), name=\"lin1\"\n        )\n        self.lin2 = tf.keras.layers.Dense(\n            config.dim, kernel_initializer=get_initializer(config.initializer_range), name=\"lin2\"\n        )\n        assert config.activation in [\"relu\", \"gelu\"], \"activation ({}) must be in ['relu', 'gelu']\".format(\n            config.activation\n        )\n        self.activation = (\n            tf.keras.layers.Activation(gelu) if config.activation == \"gelu\" else tf.keras.activations.relu\n        )\n\n    def call(self, input, training=False):\n        x = self.lin1(input)\n        x = self.activation(x)\n        x = self.lin2(x)\n        x = self.dropout(x, training=training)\n        return x\n\n\nclass TFTransformerBlock(tf.keras.layers.Layer):\n    def __init__(self, config, **kwargs):\n        super().__init__(**kwargs)\n\n        self.n_heads = config.n_heads\n        self.dim = config.dim\n        self.hidden_dim = config.hidden_dim\n        self.dropout = tf.keras.layers.Dropout(config.dropout)\n        self.activation = config.activation\n        self.output_attentions = config.output_attentions\n\n        assert config.dim % config.n_heads == 0\n\n        self.attention = TFMultiHeadSelfAttention(config, name=\"attention\")\n        self.sa_layer_norm = tf.keras.layers.LayerNormalization(epsilon=1e-12, name=\"sa_layer_norm\")\n\n        self.ffn = TFFFN(config, name=\"ffn\")\n        self.output_layer_norm = tf.keras.layers.LayerNormalization(epsilon=1e-12, name=\"output_layer_norm\")\n\n    def call(self, inputs, training=False):  # removed: src_enc=None, src_len=None\n        \"\"\"\n        Parameters\n        ----------\n        x: tf.Tensor(bs, seq_length, dim)\n        attn_mask: tf.Tensor(bs, seq_length)\n\n        Outputs\n        -------\n        sa_weights: tf.Tensor(bs, n_heads, seq_length, seq_length)\n            The attention weights\n        ffn_output: tf.Tensor(bs, seq_length, dim)\n            The output of the transformer block contextualization.\n        \"\"\"\n        x, attn_mask, head_mask = inputs\n\n        # Self-Attention\n        sa_output = self.attention([x, x, x, attn_mask, head_mask], training=training)\n        if self.output_attentions:\n            sa_output, sa_weights = sa_output  # (bs, seq_length, dim), (bs, n_heads, seq_length, seq_length)\n        else:  # To handle these `output_attention` or `output_hidden_states` cases returning tuples\n            # assert type(sa_output) == tuple\n            sa_output = sa_output[0]\n        sa_output = self.sa_layer_norm(sa_output + x)  # (bs, seq_length, dim)\n\n        # Feed Forward Network\n        ffn_output = self.ffn(sa_output, training=training)  # (bs, seq_length, dim)\n        ffn_output = self.output_layer_norm(ffn_output + sa_output)  # (bs, seq_length, dim)\n\n        output = (ffn_output,)\n        if self.output_attentions:\n            output = (sa_weights,) + output\n        return output\n\n\nclass TFTransformer(tf.keras.layers.Layer):\n    def __init__(self, config, **kwargs):\n        super().__init__(**kwargs)\n        self.n_layers = config.n_layers\n        self.output_attentions = config.output_attentions\n        self.output_hidden_states = config.output_hidden_states\n\n        self.layer = [TFTransformerBlock(config, name=\"layer_._{}\".format(i)) for i in range(config.n_layers)]\n\n    def call(self, inputs, training=False):\n        \"\"\"\n        Parameters\n        ----------\n        x: tf.Tensor(bs, seq_length, dim)\n            Input sequence embedded.\n        attn_mask: tf.Tensor(bs, seq_length)\n            Attention mask on the sequence.\n\n        Outputs\n        -------\n        hidden_state: tf.Tensor(bs, seq_length, dim)\n            Sequence of hiddens states in the last (top) layer\n        all_hidden_states: Tuple[tf.Tensor(bs, seq_length, dim)]\n            Tuple of length n_layers with the hidden states from each layer.\n            Optional: only if output_hidden_states=True\n        all_attentions: Tuple[tf.Tensor(bs, n_heads, seq_length, seq_length)]\n            Tuple of length n_layers with the attention weights from each layer\n            Optional: only if output_attentions=True\n        \"\"\"\n        x, attn_mask, head_mask = inputs\n\n        all_hidden_states = ()\n        all_attentions = ()\n\n        hidden_state = x\n        for i, layer_module in enumerate(self.layer):\n            if self.output_hidden_states:\n                all_hidden_states = all_hidden_states + (hidden_state,)\n\n            layer_outputs = layer_module([hidden_state, attn_mask, head_mask[i]], training=training)\n            hidden_state = layer_outputs[-1]\n\n            if self.output_attentions:\n                assert len(layer_outputs) == 2\n                attentions = layer_outputs[0]\n                all_attentions = all_attentions + (attentions,)\n            else:\n                assert len(layer_outputs) == 1\n\n        # Add last layer\n        if self.output_hidden_states:\n            all_hidden_states = all_hidden_states + (hidden_state,)\n\n        outputs = (hidden_state,)\n        if self.output_hidden_states:\n            outputs = outputs + (all_hidden_states,)\n        if self.output_attentions:\n            outputs = outputs + (all_attentions,)\n        return outputs  # last-layer hidden state, (all hidden states), (all attentions)\n\n\nclass TFDistilBertMainLayer(tf.keras.layers.Layer):\n    def __init__(self, config, **kwargs):\n        super().__init__(**kwargs)\n        self.num_hidden_layers = config.num_hidden_layers\n\n        self.embeddings = TFEmbeddings(config, name=\"embeddings\")  # Embeddings\n        self.transformer = TFTransformer(config, name=\"transformer\")  # Encoder\n\n    def get_input_embeddings(self):\n        return self.embeddings\n\n    def _resize_token_embeddings(self, new_num_tokens):\n        raise NotImplementedError\n\n    def _prune_heads(self, heads_to_prune):\n        raise NotImplementedError\n\n    def call(self, inputs, attention_mask=None, head_mask=None, inputs_embeds=None, training=False):\n        if isinstance(inputs, (tuple, list)):\n            input_ids = inputs[0]\n            attention_mask = inputs[1] if len(inputs) > 1 else attention_mask\n            head_mask = inputs[2] if len(inputs) > 2 else head_mask\n            inputs_embeds = inputs[3] if len(inputs) > 3 else inputs_embeds\n            assert len(inputs) <= 4, \"Too many inputs.\"\n        elif isinstance(inputs, (dict, BatchEncoding)):\n            input_ids = inputs.get(\"input_ids\")\n            attention_mask = inputs.get(\"attention_mask\", attention_mask)\n            head_mask = inputs.get(\"head_mask\", head_mask)\n            inputs_embeds = inputs.get(\"inputs_embeds\", inputs_embeds)\n            assert len(inputs) <= 4, \"Too many inputs.\"\n        else:\n            input_ids = inputs\n\n        if input_ids is not None and inputs_embeds is not None:\n            raise ValueError(\"You cannot specify both input_ids and inputs_embeds at the same time\")\n        elif input_ids is not None:\n            input_shape = shape_list(input_ids)\n        elif inputs_embeds is not None:\n            input_shape = shape_list(inputs_embeds)[:-1]\n        else:\n            raise ValueError(\"You have to specify either input_ids or inputs_embeds\")\n\n        if attention_mask is None:\n            attention_mask = tf.ones(input_shape)  # (bs, seq_length)\n        attention_mask = tf.cast(attention_mask, dtype=tf.float32)\n\n        # Prepare head mask if needed\n        # 1.0 in head_mask indicate we keep the head\n        # attention_probs has shape bsz x n_heads x N x N\n        # input head_mask has shape [num_heads] or [num_hidden_layers x num_heads]\n        # and head_mask is converted to shape [num_hidden_layers x batch x num_heads x seq_length x seq_length]\n        if head_mask is not None:\n            raise NotImplementedError\n        else:\n            head_mask = [None] * self.num_hidden_layers\n\n        embedding_output = self.embeddings(input_ids, inputs_embeds=inputs_embeds)  # (bs, seq_length, dim)\n        tfmr_output = self.transformer([embedding_output, attention_mask, head_mask], training=training)\n\n        return tfmr_output  # last-layer hidden-state, (all hidden_states), (all attentions)\n\n\n# INTERFACE FOR ENCODER AND TASK SPECIFIC MODEL #\nclass TFDistilBertPreTrainedModel(TFPreTrainedModel):\n    \"\"\" An abstract class to handle weights initialization and\n        a simple interface for downloading and loading pretrained models.\n    \"\"\"\n\n    config_class = DistilBertConfig\n    base_model_prefix = \"distilbert\"\n\n\nDISTILBERT_START_DOCSTRING = r\"\"\"\n    This model is a `tf.keras.Model <https://www.tensorflow.org/api_docs/python/tf/keras/Model>`__ sub-class.\n    Use it as a regular TF 2.0 Keras Model and\n    refer to the TF 2.0 documentation for all matter related to general usage and behavior.\n\n    .. note::\n\n        TF 2.0 models accepts two formats as inputs:\n\n            - having all inputs as keyword arguments (like PyTorch models), or\n            - having all inputs as a list, tuple or dict in the first positional arguments.\n\n        This second option is useful when using :obj:`tf.keras.Model.fit()` method which currently requires having\n        all the tensors in the first argument of the model call function: :obj:`model(inputs)`.\n\n        If you choose this second option, there are three possibilities you can use to gather all the input Tensors\n        in the first positional argument :\n\n        - a single Tensor with input_ids only and nothing else: :obj:`model(inputs_ids)`\n        - a list of varying length with one or several input Tensors IN THE ORDER given in the docstring:\n          :obj:`model([input_ids, attention_mask])` or :obj:`model([input_ids, attention_mask, token_type_ids])`\n        - a dictionary with one or several input Tensors associated to the input names given in the docstring:\n          :obj:`model({'input_ids': input_ids, 'token_type_ids': token_type_ids})`\n\n    Parameters:\n        config (:class:`~transformers1.DistilBertConfig`): Model configuration class with all the parameters of the model.\n            Initializing with a config file does not load the weights associated with the model, only the configuration.\n            Check out the :meth:`~transformers1.PreTrainedModel.from_pretrained` method to load the model weights.\n\"\"\"\n\nDISTILBERT_INPUTS_DOCSTRING = r\"\"\"\n    Args:\n        input_ids (:obj:`Numpy array` or :obj:`tf.Tensor` of shape :obj:`(batch_size, sequence_length)`):\n            Indices of input sequence tokens in the vocabulary.\n\n            Indices can be obtained using :class:`transformers1.BertTokenizer`.\n            See :func:`transformers1.PreTrainedTokenizer.encode` and\n            :func:`transformers1.PreTrainedTokenizer.encode_plus` for details.\n\n            `What are input IDs? <../glossary.html#input-ids>`__\n        attention_mask (:obj:`Numpy array` or :obj:`tf.Tensor` of shape :obj:`(batch_size, sequence_length)`, `optional`, defaults to :obj:`None`):\n            Mask to avoid performing attention on padding token indices.\n            Mask values selected in ``[0, 1]``:\n            ``1`` for tokens that are NOT MASKED, ``0`` for MASKED tokens.\n\n            `What are attention masks? <../glossary.html#attention-mask>`__\n        head_mask (:obj:`Numpy array` or :obj:`tf.Tensor` of shape :obj:`(num_heads,)` or :obj:`(num_layers, num_heads)`, `optional`, defaults to :obj:`None`):\n            Mask to nullify selected heads of the self-attention modules.\n            Mask values selected in ``[0, 1]``:\n            :obj:`1` indicates the head is **not masked**, :obj:`0` indicates the head is **masked**.\n        inputs_embeds (:obj:`Numpy array` or :obj:`tf.Tensor` of shape :obj:`(batch_size, sequence_length, embedding_dim)`, `optional`, defaults to :obj:`None`):\n            Optionally, instead of passing :obj:`input_ids` you can choose to directly pass an embedded representation.\n            This is useful if you want more control over how to convert `input_ids` indices into associated vectors\n            than the model's internal embedding lookup matrix.\n        training (:obj:`boolean`, `optional`, defaults to :obj:`False`):\n            Whether to activate dropout modules (if set to :obj:`True`) during training or to de-activate them\n            (if set to :obj:`False`) for evaluation.\n\n\"\"\"\n\n\n@add_start_docstrings(\n    \"The bare DistilBERT encoder/transformer outputing raw hidden-states without any specific head on top.\",\n    DISTILBERT_START_DOCSTRING,\n)\nclass TFDistilBertModel(TFDistilBertPreTrainedModel):\n    def __init__(self, config, *inputs, **kwargs):\n        super().__init__(config, *inputs, **kwargs)\n        self.distilbert = TFDistilBertMainLayer(config, name=\"distilbert\")  # Embeddings\n\n    @add_start_docstrings_to_callable(DISTILBERT_INPUTS_DOCSTRING)\n    def call(self, inputs, **kwargs):\n        r\"\"\"\n    Returns:\n        :obj:`tuple(tf.Tensor)` comprising various elements depending on the configuration (:class:`~transformers1,DistilBertConfig`) and inputs:\n        last_hidden_state (:obj:`tf.Tensor` of shape :obj:`(batch_size, sequence_length, hidden_size)`):\n            Sequence of hidden-states at the output of the last layer of the model.\n        hidden_states (:obj:`tuple(tf.Tensor)`, `optional`, returned when :obj:`config.output_hidden_states=True`):\n            tuple of :obj:`tf.Tensor` (one for the output of the embeddings + one for the output of each layer)\n            of shape :obj:`(batch_size, sequence_length, hidden_size)`.\n\n            Hidden-states of the model at the output of each layer plus the initial embedding outputs.\n        attentions (:obj:`tuple(tf.Tensor)`, `optional`, returned when ``config.output_attentions=True``):\n            tuple of :obj:`tf.Tensor` (one for each layer) of shape\n            :obj:`(batch_size, num_heads, sequence_length, sequence_length)`:\n\n            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads.\n\n    Examples::\n\n        import tensorflow as tf\n        from transformers1 import DistilBertTokenizer, TFDistilBertModel\n\n        tokenizer = DistilBertTokenizer.from_pretrained('distilbert-base-cased')\n        model = TFDistilBertModel.from_pretrained('distilbert-base-cased')\n        input_ids = tf.constant(tokenizer.encode(\"Hello, my dog is cute\"))[None, :]  # Batch size 1\n        outputs = model(input_ids)\n        last_hidden_states = outputs[0]  # The last hidden-state is the first element of the output tuple\n        \"\"\"\n        outputs = self.distilbert(inputs, **kwargs)\n        return outputs\n\n\nclass TFDistilBertLMHead(tf.keras.layers.Layer):\n    def __init__(self, config, input_embeddings, **kwargs):\n        super().__init__(**kwargs)\n        self.vocab_size = config.vocab_size\n\n        # The output weights are the same as the input embeddings, but there is\n        # an output-only bias for each token.\n        self.input_embeddings = input_embeddings\n\n    def build(self, input_shape):\n        self.bias = self.add_weight(shape=(self.vocab_size,), initializer=\"zeros\", trainable=True, name=\"bias\")\n        super().build(input_shape)\n\n    def call(self, hidden_states):\n        hidden_states = self.input_embeddings(hidden_states, mode=\"linear\")\n        hidden_states = hidden_states + self.bias\n        return hidden_states\n\n\n@add_start_docstrings(\n    \"\"\"DistilBert Model with a `masked language modeling` head on top. \"\"\", DISTILBERT_START_DOCSTRING,\n)\nclass TFDistilBertForMaskedLM(TFDistilBertPreTrainedModel):\n    def __init__(self, config, *inputs, **kwargs):\n        super().__init__(config, *inputs, **kwargs)\n        self.output_attentions = config.output_attentions\n        self.output_hidden_states = config.output_hidden_states\n        self.vocab_size = config.vocab_size\n\n        self.distilbert = TFDistilBertMainLayer(config, name=\"distilbert\")\n        self.vocab_transform = tf.keras.layers.Dense(\n            config.dim, kernel_initializer=get_initializer(config.initializer_range), name=\"vocab_transform\"\n        )\n        self.act = tf.keras.layers.Activation(gelu)\n        self.vocab_layer_norm = tf.keras.layers.LayerNormalization(epsilon=1e-12, name=\"vocab_layer_norm\")\n        self.vocab_projector = TFDistilBertLMHead(config, self.distilbert.embeddings, name=\"vocab_projector\")\n\n    def get_output_embeddings(self):\n        return self.vocab_projector.input_embeddings\n\n    @add_start_docstrings_to_callable(DISTILBERT_INPUTS_DOCSTRING)\n    def call(self, inputs, **kwargs):\n        r\"\"\"\n\n    Returns:\n        :obj:`tuple(tf.Tensor)` comprising various elements depending on the configuration (:class:`~transformers1,DistilBertConfig`) and inputs:\n        prediction_scores (:obj:`Numpy array` or :obj:`tf.Tensor` of shape :obj:`(batch_size, sequence_length, config.vocab_size)`):\n            Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax).\n        hidden_states (:obj:`tuple(tf.Tensor)`, `optional`, returned when :obj:`config.output_hidden_states=True`):\n            tuple of :obj:`tf.Tensor` (one for the output of the embeddings + one for the output of each layer)\n            of shape :obj:`(batch_size, sequence_length, hidden_size)`.\n\n            Hidden-states of the model at the output of each layer plus the initial embedding outputs.\n        attentions (:obj:`tuple(tf.Tensor)`, `optional`, returned when ``config.output_attentions=True``):\n            tuple of :obj:`tf.Tensor` (one for each layer) of shape\n            :obj:`(batch_size, num_heads, sequence_length, sequence_length)`:\n\n            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads.\n\n    Examples::\n\n        import tensorflow as tf\n        from transformers1 import DistilBertTokenizer, TFDistilBertForMaskedLM\n\n        tokenizer = DistilBertTokenizer.from_pretrained('distilbert-base-cased')\n        model = TFDistilBertForMaskedLM.from_pretrained('distilbert-base-cased')\n        input_ids = tf.constant(tokenizer.encode(\"Hello, my dog is cute\"))[None, :]  # Batch size 1\n        outputs = model(input_ids)\n        prediction_scores = outputs[0]\n\n        \"\"\"\n        distilbert_output = self.distilbert(inputs, **kwargs)\n\n        hidden_states = distilbert_output[0]  # (bs, seq_length, dim)\n        prediction_logits = self.vocab_transform(hidden_states)  # (bs, seq_length, dim)\n        prediction_logits = self.act(prediction_logits)  # (bs, seq_length, dim)\n        prediction_logits = self.vocab_layer_norm(prediction_logits)  # (bs, seq_length, dim)\n        prediction_logits = self.vocab_projector(prediction_logits)\n\n        outputs = (prediction_logits,) + distilbert_output[1:]\n        return outputs  # logits, (hidden_states), (attentions)\n\n\n@add_start_docstrings(\n    \"\"\"DistilBert Model transformer with a sequence classification/regression head on top (a linear layer on top of\n    the pooled output) e.g. for GLUE tasks. \"\"\",\n    DISTILBERT_START_DOCSTRING,\n)\nclass TFDistilBertForSequenceClassification(TFDistilBertPreTrainedModel):\n    def __init__(self, config, *inputs, **kwargs):\n        super().__init__(config, *inputs, **kwargs)\n        self.num_labels = config.num_labels\n\n        self.distilbert = TFDistilBertMainLayer(config, name=\"distilbert\")\n        self.pre_classifier = tf.keras.layers.Dense(\n            config.dim,\n            kernel_initializer=get_initializer(config.initializer_range),\n            activation=\"relu\",\n            name=\"pre_classifier\",\n        )\n        self.classifier = tf.keras.layers.Dense(\n            config.num_labels, kernel_initializer=get_initializer(config.initializer_range), name=\"classifier\"\n        )\n        self.dropout = tf.keras.layers.Dropout(config.seq_classif_dropout)\n\n    @add_start_docstrings_to_callable(DISTILBERT_INPUTS_DOCSTRING)\n    def call(self, inputs, **kwargs):\n        r\"\"\"\n    Returns:\n        :obj:`tuple(tf.Tensor)` comprising various elements depending on the configuration (:class:`~transformers1,DistilBertConfig`) and inputs:\n        logits (:obj:`Numpy array` or :obj:`tf.Tensor` of shape :obj:`(batch_size, config.num_labels)`):\n            Classification (or regression if config.num_labels==1) scores (before SoftMax).\n        hidden_states (:obj:`tuple(tf.Tensor)`, `optional`, returned when :obj:`config.output_hidden_states=True`):\n            tuple of :obj:`tf.Tensor` (one for the output of the embeddings + one for the output of each layer)\n            of shape :obj:`(batch_size, sequence_length, hidden_size)`.\n\n            Hidden-states of the model at the output of each layer plus the initial embedding outputs.\n        attentions (:obj:`tuple(tf.Tensor)`, `optional`, returned when ``config.output_attentions=True``):\n            tuple of :obj:`tf.Tensor` (one for each layer) of shape\n            :obj:`(batch_size, num_heads, sequence_length, sequence_length)`:\n\n            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads.\n\n    Examples::\n\n        import tensorflow as tf\n        from transformers1 import DistilBertTokenizer, TFDistilBertForSequenceClassification\n\n        tokenizer = DistilBertTokenizer.from_pretrained('distilbert-base-cased')\n        model = TFDistilBertForSequenceClassification.from_pretrained('distilbert-base-cased')\n        input_ids = tf.constant(tokenizer.encode(\"Hello, my dog is cute\"))[None, :]  # Batch size 1\n        outputs = model(input_ids)\n        logits = outputs[0]\n\n        \"\"\"\n        distilbert_output = self.distilbert(inputs, **kwargs)\n\n        hidden_state = distilbert_output[0]  # (bs, seq_len, dim)\n        pooled_output = hidden_state[:, 0]  # (bs, dim)\n        pooled_output = self.pre_classifier(pooled_output)  # (bs, dim)\n        pooled_output = self.dropout(pooled_output, training=kwargs.get(\"training\", False))  # (bs, dim)\n        logits = self.classifier(pooled_output)  # (bs, dim)\n\n        outputs = (logits,) + distilbert_output[1:]\n        return outputs  # logits, (hidden_states), (attentions)\n\n\n@add_start_docstrings(\n    \"\"\"DistilBert Model with a token classification head on top (a linear layer on top of\n    the hidden-states output) e.g. for Named-Entity-Recognition (NER) tasks. \"\"\",\n    DISTILBERT_START_DOCSTRING,\n)\nclass TFDistilBertForTokenClassification(TFDistilBertPreTrainedModel):\n    def __init__(self, config, *inputs, **kwargs):\n        super().__init__(config, *inputs, **kwargs)\n        self.num_labels = config.num_labels\n\n        self.distilbert = TFDistilBertMainLayer(config, name=\"distilbert\")\n        self.dropout = tf.keras.layers.Dropout(config.dropout)\n        self.classifier = tf.keras.layers.Dense(\n            config.num_labels, kernel_initializer=get_initializer(config.initializer_range), name=\"classifier\"\n        )\n\n    @add_start_docstrings_to_callable(DISTILBERT_INPUTS_DOCSTRING)\n    def call(self, inputs, **kwargs):\n        r\"\"\"\n    Returns:\n        :obj:`tuple(tf.Tensor)` comprising various elements depending on the configuration (:class:`~transformers1,DistilBertConfig`) and inputs:\n        scores (:obj:`Numpy array` or :obj:`tf.Tensor` of shape :obj:`(batch_size, sequence_length, config.num_labels)`):\n            Classification scores (before SoftMax).\n        hidden_states (:obj:`tuple(tf.Tensor)`, `optional`, returned when :obj:`config.output_hidden_states=True`):\n            tuple of :obj:`tf.Tensor` (one for the output of the embeddings + one for the output of each layer)\n            of shape :obj:`(batch_size, sequence_length, hidden_size)`.\n\n            Hidden-states of the model at the output of each layer plus the initial embedding outputs.\n        attentions (:obj:`tuple(tf.Tensor)`, `optional`, returned when ``config.output_attentions=True``):\n            tuple of :obj:`tf.Tensor` (one for each layer) of shape\n            :obj:`(batch_size, num_heads, sequence_length, sequence_length)`:\n\n            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads.\n\n    Examples::\n\n        import tensorflow as tf\n        from transformers1 import DistilBertTokenizer, TFDistilBertForTokenClassification\n\n        tokenizer = DistilBertTokenizer.from_pretrained('distilbert-base-cased')\n        model = TFDistilBertForTokenClassification.from_pretrained('distilbert-base-cased')\n        input_ids = tf.constant(tokenizer.encode(\"Hello, my dog is cute\"))[None, :]  # Batch size 1\n        outputs = model(input_ids)\n        scores = outputs[0]\n        \"\"\"\n        outputs = self.distilbert(inputs, **kwargs)\n\n        sequence_output = outputs[0]\n\n        sequence_output = self.dropout(sequence_output, training=kwargs.get(\"training\", False))\n        logits = self.classifier(sequence_output)\n\n        outputs = (logits,) + outputs[2:]  # add hidden states and attention if they are here\n\n        return outputs  # scores, (hidden_states), (attentions)\n\n\n@add_start_docstrings(\n    \"\"\"DistilBert Model with a span classification head on top for extractive question-answering tasks like SQuAD (a linear layers on top of\n    the hidden-states output to compute `span start logits` and `span end logits`). \"\"\",\n    DISTILBERT_START_DOCSTRING,\n)\nclass TFDistilBertForQuestionAnswering(TFDistilBertPreTrainedModel):\n    def __init__(self, config, *inputs, **kwargs):\n        super().__init__(config, *inputs, **kwargs)\n\n        self.distilbert = TFDistilBertMainLayer(config, name=\"distilbert\")\n        self.qa_outputs = tf.keras.layers.Dense(\n            config.num_labels, kernel_initializer=get_initializer(config.initializer_range), name=\"qa_outputs\"\n        )\n        assert config.num_labels == 2\n        self.dropout = tf.keras.layers.Dropout(config.qa_dropout)\n\n    @add_start_docstrings_to_callable(DISTILBERT_INPUTS_DOCSTRING)\n    def call(self, inputs, **kwargs):\n        r\"\"\"\n    Return:\n        :obj:`tuple(tf.Tensor)` comprising various elements depending on the configuration (:class:`~transformers1,DistilBertConfig`) and inputs:\n        start_scores (:obj:`Numpy array` or :obj:`tf.Tensor` of shape :obj:`(batch_size, sequence_length,)`):\n            Span-start scores (before SoftMax).\n        end_scores (:obj:`Numpy array` or :obj:`tf.Tensor` of shape :obj:`(batch_size, sequence_length,)`):\n            Span-end scores (before SoftMax).\n        hidden_states (:obj:`tuple(tf.Tensor)`, `optional`, returned when :obj:`config.output_hidden_states=True`):\n            tuple of :obj:`tf.Tensor` (one for the output of the embeddings + one for the output of each layer)\n            of shape :obj:`(batch_size, sequence_length, hidden_size)`.\n\n            Hidden-states of the model at the output of each layer plus the initial embedding outputs.\n        attentions (:obj:`tuple(tf.Tensor)`, `optional`, returned when ``config.output_attentions=True``):\n            tuple of :obj:`tf.Tensor` (one for each layer) of shape\n            :obj:`(batch_size, num_heads, sequence_length, sequence_length)`:\n\n            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads.\n\n    Examples::\n\n        import tensorflow as tf\n        from transformers1 import DistilBertTokenizer, TFDistilBertForQuestionAnswering\n\n        tokenizer = DistilBertTokenizer.from_pretrained('distilbert-base-cased')\n        model = TFDistilBertForQuestionAnswering.from_pretrained('distilbert-base-cased')\n        input_ids = tf.constant(tokenizer.encode(\"Hello, my dog is cute\"))[None, :]  # Batch size 1\n        outputs = model(input_ids)\n        start_scores, end_scores = outputs[:2]\n\n        \"\"\"\n        distilbert_output = self.distilbert(inputs, **kwargs)\n\n        hidden_states = distilbert_output[0]  # (bs, max_query_len, dim)\n        hidden_states = self.dropout(hidden_states, training=kwargs.get(\"training\", False))  # (bs, max_query_len, dim)\n        logits = self.qa_outputs(hidden_states)  # (bs, max_query_len, 2)\n        start_logits, end_logits = tf.split(logits, 2, axis=-1)\n        start_logits = tf.squeeze(start_logits, axis=-1)\n        end_logits = tf.squeeze(end_logits, axis=-1)\n\n        outputs = (start_logits, end_logits,) + distilbert_output[1:]\n        return outputs  # start_logits, end_logits, (hidden_states), (attentions)\n"
  },
  {
    "path": "code/nezha-base-count3/pretrain/transformers1/modeling_tf_electra.py",
    "content": "import logging\n\nimport tensorflow as tf\n\nfrom transformers import ElectraConfig\n\nfrom .file_utils import add_start_docstrings, add_start_docstrings_to_callable\nfrom .modeling_tf_bert import ACT2FN, TFBertEncoder, TFBertPreTrainedModel\nfrom .modeling_tf_utils import get_initializer, shape_list\nfrom .tokenization_utils import BatchEncoding\n\n\nlogger = logging.getLogger(__name__)\n\n\nTF_ELECTRA_PRETRAINED_MODEL_ARCHIVE_LIST = [\n    \"google/electra-small-generator\",\n    \"google/electra-base-generator\",\n    \"google/electra-large-generator\",\n    \"google/electra-small-discriminator\",\n    \"google/electra-base-discriminator\",\n    \"google/electra-large-discriminator\",\n    # See all ELECTRA models at https://huggingface.co/models?filter=electra\n]\n\n\nclass TFElectraEmbeddings(tf.keras.layers.Layer):\n    \"\"\"Construct the embeddings from word, position and token_type embeddings.\n    \"\"\"\n\n    def __init__(self, config, **kwargs):\n        super().__init__(**kwargs)\n        self.vocab_size = config.vocab_size\n        self.embedding_size = config.embedding_size\n        self.initializer_range = config.initializer_range\n\n        self.position_embeddings = tf.keras.layers.Embedding(\n            config.max_position_embeddings,\n            config.embedding_size,\n            embeddings_initializer=get_initializer(self.initializer_range),\n            name=\"position_embeddings\",\n        )\n        self.token_type_embeddings = tf.keras.layers.Embedding(\n            config.type_vocab_size,\n            config.embedding_size,\n            embeddings_initializer=get_initializer(self.initializer_range),\n            name=\"token_type_embeddings\",\n        )\n\n        # self.LayerNorm is not snake-cased to stick with TensorFlow model variable name and be able to load\n        # any TensorFlow checkpoint file\n        self.LayerNorm = tf.keras.layers.LayerNormalization(epsilon=config.layer_norm_eps, name=\"LayerNorm\")\n        self.dropout = tf.keras.layers.Dropout(config.hidden_dropout_prob)\n\n    def build(self, input_shape):\n        \"\"\"Build shared word embedding layer \"\"\"\n        with tf.name_scope(\"word_embeddings\"):\n            # Create and initialize weights. The random normal initializer was chosen\n            # arbitrarily, and works well.\n            self.word_embeddings = self.add_weight(\n                \"weight\",\n                shape=[self.vocab_size, self.embedding_size],\n                initializer=get_initializer(self.initializer_range),\n            )\n        super().build(input_shape)\n\n    def call(self, inputs, mode=\"embedding\", training=False):\n        \"\"\"Get token embeddings of inputs.\n        Args:\n            inputs: list of three int64 tensors with shape [batch_size, length]: (input_ids, position_ids, token_type_ids)\n            mode: string, a valid value is one of \"embedding\" and \"linear\".\n        Returns:\n            outputs: (1) If mode == \"embedding\", output embedding tensor, float32 with\n                shape [batch_size, length, embedding_size]; (2) mode == \"linear\", output\n                linear tensor, float32 with shape [batch_size, length, vocab_size].\n        Raises:\n            ValueError: if mode is not valid.\n\n        Shared weights logic adapted from\n            https://github.com/tensorflow/models/blob/a009f4fb9d2fc4949e32192a944688925ef78659/official/transformer/v2/embedding_layer.py#L24\n        \"\"\"\n        if mode == \"embedding\":\n            return self._embedding(inputs, training=training)\n        elif mode == \"linear\":\n            return self._linear(inputs)\n        else:\n            raise ValueError(\"mode {} is not valid.\".format(mode))\n\n    def _embedding(self, inputs, training=False):\n        \"\"\"Applies embedding based on inputs tensor.\"\"\"\n        input_ids, position_ids, token_type_ids, inputs_embeds = inputs\n\n        if input_ids is not None:\n            input_shape = shape_list(input_ids)\n        else:\n            input_shape = shape_list(inputs_embeds)[:-1]\n\n        seq_length = input_shape[1]\n        if position_ids is None:\n            position_ids = tf.range(seq_length, dtype=tf.int32)[tf.newaxis, :]\n        if token_type_ids is None:\n            token_type_ids = tf.fill(input_shape, 0)\n\n        if inputs_embeds is None:\n            inputs_embeds = tf.gather(self.word_embeddings, input_ids)\n        position_embeddings = self.position_embeddings(position_ids)\n        token_type_embeddings = self.token_type_embeddings(token_type_ids)\n\n        embeddings = inputs_embeds + position_embeddings + token_type_embeddings\n        embeddings = self.LayerNorm(embeddings)\n        embeddings = self.dropout(embeddings, training=training)\n        return embeddings\n\n    def _linear(self, inputs):\n        \"\"\"Computes logits by running inputs through a linear layer.\n            Args:\n                inputs: A float32 tensor with shape [batch_size, length, hidden_size]\n            Returns:\n                float32 tensor with shape [batch_size, length, vocab_size].\n        \"\"\"\n        batch_size = shape_list(inputs)[0]\n        length = shape_list(inputs)[1]\n\n        x = tf.reshape(inputs, [-1, self.embedding_size])\n        logits = tf.matmul(x, self.word_embeddings, transpose_b=True)\n\n        return tf.reshape(logits, [batch_size, length, self.vocab_size])\n\n\nclass TFElectraDiscriminatorPredictions(tf.keras.layers.Layer):\n    def __init__(self, config, **kwargs):\n        super().__init__(**kwargs)\n\n        self.dense = tf.keras.layers.Dense(config.hidden_size, name=\"dense\")\n        self.dense_prediction = tf.keras.layers.Dense(1, name=\"dense_prediction\")\n        self.config = config\n\n    def call(self, discriminator_hidden_states, training=False):\n        hidden_states = self.dense(discriminator_hidden_states)\n        hidden_states = ACT2FN[self.config.hidden_act](hidden_states)\n        logits = tf.squeeze(self.dense_prediction(hidden_states))\n\n        return logits\n\n\nclass TFElectraGeneratorPredictions(tf.keras.layers.Layer):\n    def __init__(self, config, **kwargs):\n        super().__init__(**kwargs)\n\n        self.LayerNorm = tf.keras.layers.LayerNormalization(epsilon=config.layer_norm_eps, name=\"LayerNorm\")\n        self.dense = tf.keras.layers.Dense(config.embedding_size, name=\"dense\")\n\n    def call(self, generator_hidden_states, training=False):\n        hidden_states = self.dense(generator_hidden_states)\n        hidden_states = ACT2FN[\"gelu\"](hidden_states)\n        hidden_states = self.LayerNorm(hidden_states)\n\n        return hidden_states\n\n\nclass TFElectraPreTrainedModel(TFBertPreTrainedModel):\n\n    config_class = ElectraConfig\n    base_model_prefix = \"electra\"\n\n    def get_extended_attention_mask(self, attention_mask, input_shape):\n        if attention_mask is None:\n            attention_mask = tf.fill(input_shape, 1)\n\n        # We create a 3D attention mask from a 2D tensor mask.\n        # Sizes are [batch_size, 1, 1, to_seq_length]\n        # So we can broadcast to [batch_size, num_heads, from_seq_length, to_seq_length]\n        # this attention mask is more simple than the triangular masking of causal attention\n        # used in OpenAI GPT, we just need to prepare the broadcast dimension here.\n        extended_attention_mask = attention_mask[:, tf.newaxis, tf.newaxis, :]\n\n        # Since attention_mask is 1.0 for positions we want to attend and 0.0 for\n        # masked positions, this operation will create a tensor which is 0.0 for\n        # positions we want to attend and -10000.0 for masked positions.\n        # Since we are adding it to the raw scores before the softmax, this is\n        # effectively the same as removing these entirely.\n\n        extended_attention_mask = tf.cast(extended_attention_mask, tf.float32)\n        extended_attention_mask = (1.0 - extended_attention_mask) * -10000.0\n\n        return extended_attention_mask\n\n    def get_head_mask(self, head_mask):\n        if head_mask is not None:\n            raise NotImplementedError\n        else:\n            head_mask = [None] * self.config.num_hidden_layers\n\n        return head_mask\n\n\nclass TFElectraMainLayer(TFElectraPreTrainedModel):\n\n    config_class = ElectraConfig\n\n    def __init__(self, config, **kwargs):\n        super().__init__(config, **kwargs)\n        self.embeddings = TFElectraEmbeddings(config, name=\"embeddings\")\n\n        if config.embedding_size != config.hidden_size:\n            self.embeddings_project = tf.keras.layers.Dense(config.hidden_size, name=\"embeddings_project\")\n        self.encoder = TFBertEncoder(config, name=\"encoder\")\n        self.config = config\n\n    def get_input_embeddings(self):\n        return self.embeddings\n\n    def _resize_token_embeddings(self, new_num_tokens):\n        raise NotImplementedError\n\n    def _prune_heads(self, heads_to_prune):\n        \"\"\" Prunes heads of the model.\n            heads_to_prune: dict of {layer_num: list of heads to prune in this layer}\n            See base class PreTrainedModel\n        \"\"\"\n        raise NotImplementedError\n\n    def call(\n        self,\n        inputs,\n        attention_mask=None,\n        token_type_ids=None,\n        position_ids=None,\n        head_mask=None,\n        inputs_embeds=None,\n        training=False,\n    ):\n        if isinstance(inputs, (tuple, list)):\n            input_ids = inputs[0]\n            attention_mask = inputs[1] if len(inputs) > 1 else attention_mask\n            token_type_ids = inputs[2] if len(inputs) > 2 else token_type_ids\n            position_ids = inputs[3] if len(inputs) > 3 else position_ids\n            head_mask = inputs[4] if len(inputs) > 4 else head_mask\n            inputs_embeds = inputs[5] if len(inputs) > 5 else inputs_embeds\n            assert len(inputs) <= 6, \"Too many inputs.\"\n        elif isinstance(inputs, (dict, BatchEncoding)):\n            input_ids = inputs.get(\"input_ids\")\n            attention_mask = inputs.get(\"attention_mask\", attention_mask)\n            token_type_ids = inputs.get(\"token_type_ids\", token_type_ids)\n            position_ids = inputs.get(\"position_ids\", position_ids)\n            head_mask = inputs.get(\"head_mask\", head_mask)\n            inputs_embeds = inputs.get(\"inputs_embeds\", inputs_embeds)\n            assert len(inputs) <= 6, \"Too many inputs.\"\n        else:\n            input_ids = inputs\n\n        if input_ids is not None and inputs_embeds is not None:\n            raise ValueError(\"You cannot specify both input_ids and inputs_embeds at the same time\")\n        elif input_ids is not None:\n            input_shape = shape_list(input_ids)\n        elif inputs_embeds is not None:\n            input_shape = shape_list(inputs_embeds)[:-1]\n        else:\n            raise ValueError(\"You have to specify either input_ids or inputs_embeds\")\n\n        if attention_mask is None:\n            attention_mask = tf.fill(input_shape, 1)\n        if token_type_ids is None:\n            token_type_ids = tf.fill(input_shape, 0)\n\n        extended_attention_mask = self.get_extended_attention_mask(attention_mask, input_shape)\n        head_mask = self.get_head_mask(head_mask)\n\n        hidden_states = self.embeddings([input_ids, position_ids, token_type_ids, inputs_embeds], training=training)\n\n        if hasattr(self, \"embeddings_project\"):\n            hidden_states = self.embeddings_project(hidden_states, training=training)\n\n        hidden_states = self.encoder([hidden_states, extended_attention_mask, head_mask], training=training)\n\n        return hidden_states\n\n\nELECTRA_START_DOCSTRING = r\"\"\"\n    This model is a `tf.keras.Model <https://www.tensorflow.org/api_docs/python/tf/keras/Model>`__ sub-class.\n    Use it as a regular TF 2.0 Keras Model and\n    refer to the TF 2.0 documentation for all matter related to general usage and behavior.\n\n    .. note::\n\n        TF 2.0 models accepts two formats as inputs:\n\n            - having all inputs as keyword arguments (like PyTorch models), or\n            - having all inputs as a list, tuple or dict in the first positional arguments.\n\n        This second option is useful when using :obj:`tf.keras.Model.fit()` method which currently requires having\n        all the tensors in the first argument of the model call function: :obj:`model(inputs)`.\n\n        If you choose this second option, there are three possibilities you can use to gather all the input Tensors\n        in the first positional argument :\n\n        - a single Tensor with input_ids only and nothing else: :obj:`model(inputs_ids)`\n        - a list of varying length with one or several input Tensors IN THE ORDER given in the docstring:\n          :obj:`model([input_ids, attention_mask])` or :obj:`model([input_ids, attention_mask, token_type_ids])`\n        - a dictionary with one or several input Tensors associated to the input names given in the docstring:\n          :obj:`model({'input_ids': input_ids, 'token_type_ids': token_type_ids})`\n\n    Parameters:\n        config (:class:`~transformers1.ElectraConfig`): Model configuration class with all the parameters of the model.\n            Initializing with a config file does not load the weights associated with the model, only the configuration.\n            Check out the :meth:`~transformers1.PreTrainedModel.from_pretrained` method to load the model weights.\n\"\"\"\n\nELECTRA_INPUTS_DOCSTRING = r\"\"\"\n    Args:\n        input_ids (:obj:`Numpy array` or :obj:`tf.Tensor` of shape :obj:`(batch_size, sequence_length)`):\n            Indices of input sequence tokens in the vocabulary.\n\n            Indices can be obtained using :class:`transformers1.ElectraTokenizer`.\n            See :func:`transformers1.PreTrainedTokenizer.encode` and\n            :func:`transformers1.PreTrainedTokenizer.encode_plus` for details.\n\n            `What are input IDs? <../glossary.html#input-ids>`__\n        attention_mask (:obj:`Numpy array` or :obj:`tf.Tensor` of shape :obj:`(batch_size, sequence_length)`, `optional`, defaults to :obj:`None`):\n            Mask to avoid performing attention on padding token indices.\n            Mask values selected in ``[0, 1]``:\n            ``1`` for tokens that are NOT MASKED, ``0`` for MASKED tokens.\n\n            `What are attention masks? <../glossary.html#attention-mask>`__\n        head_mask (:obj:`Numpy array` or :obj:`tf.Tensor` of shape :obj:`(num_heads,)` or :obj:`(num_layers, num_heads)`, `optional`, defaults to :obj:`None`):\n            Mask to nullify selected heads of the self-attention modules.\n            Mask values selected in ``[0, 1]``:\n            :obj:`1` indicates the head is **not masked**, :obj:`0` indicates the head is **masked**.\n        inputs_embeds (:obj:`Numpy array` or :obj:`tf.Tensor` of shape :obj:`(batch_size, sequence_length, embedding_dim)`, `optional`, defaults to :obj:`None`):\n            Optionally, instead of passing :obj:`input_ids` you can choose to directly pass an embedded representation.\n            This is useful if you want more control over how to convert `input_ids` indices into associated vectors\n            than the model's internal embedding lookup matrix.\n        training (:obj:`boolean`, `optional`, defaults to :obj:`False`):\n            Whether to activate dropout modules (if set to :obj:`True`) during training or to de-activate them\n            (if set to :obj:`False`) for evaluation.\n\n\"\"\"\n\n\n@add_start_docstrings(\n    \"The bare Electra Model transformer outputting raw hidden-states without any specific head on top. Identical to \"\n    \"the BERT model except that it uses an additional linear layer between the embedding layer and the encoder if the \"\n    \"hidden size and embedding size are different.\"\n    \"\"\n    \"Both the generator and discriminator checkpoints may be loaded into this model.\",\n    ELECTRA_START_DOCSTRING,\n)\nclass TFElectraModel(TFElectraPreTrainedModel):\n    def __init__(self, config, *inputs, **kwargs):\n        super().__init__(config, *inputs, **kwargs)\n        self.electra = TFElectraMainLayer(config, name=\"electra\")\n\n    def get_input_embeddings(self):\n        return self.electra.embeddings\n\n    @add_start_docstrings_to_callable(ELECTRA_INPUTS_DOCSTRING)\n    def call(self, inputs, **kwargs):\n        r\"\"\"\n    Returns:\n        :obj:`tuple(tf.Tensor)` comprising various elements depending on the configuration (:class:`~transformers1.ElectraConfig`) and inputs:\n        last_hidden_state (:obj:`tf.Tensor` of shape :obj:`(batch_size, sequence_length, hidden_size)`):\n            Sequence of hidden-states at the output of the last layer of the model.\n        hidden_states (:obj:`tuple(tf.Tensor)`, `optional`, returned when :obj:`config.output_hidden_states=True`):\n            tuple of :obj:`tf.Tensor` (one for the output of the embeddings + one for the output of each layer)\n            of shape :obj:`(batch_size, sequence_length, hidden_size)`.\n\n            Hidden-states of the model at the output of each layer plus the initial embedding outputs.\n        attentions (:obj:`tuple(tf.Tensor)`, `optional`, returned when ``config.output_attentions=True``):\n            tuple of :obj:`tf.Tensor` (one for each layer) of shape\n            :obj:`(batch_size, num_heads, sequence_length, sequence_length)`:\n\n            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads.\n\n    Examples::\n\n        import tensorflow as tf\n        from transformers1 import ElectraTokenizer, TFElectraModel\n\n        tokenizer = ElectraTokenizer.from_pretrained('google/electra-small-discriminator')\n        model = TFElectraModel.from_pretrained('google/electra-small-discriminator')\n        input_ids = tf.constant(tokenizer.encode(\"Hello, my dog is cute\"))[None, :]  # Batch size 1\n        outputs = model(input_ids)\n        last_hidden_states = outputs[0]  # The last hidden-state is the first element of the output tuple\n        \"\"\"\n        outputs = self.electra(inputs, **kwargs)\n        return outputs\n\n\n@add_start_docstrings(\n    \"\"\"\nElectra model with a binary classification head on top as used during pre-training for identifying generated\ntokens.\n\nEven though both the discriminator and generator may be loaded into this model, the discriminator is\nthe only model of the two to have the correct classification head to be used for this model.\"\"\",\n    ELECTRA_START_DOCSTRING,\n)\nclass TFElectraForPreTraining(TFElectraPreTrainedModel):\n    def __init__(self, config, **kwargs):\n        super().__init__(config, **kwargs)\n\n        self.electra = TFElectraMainLayer(config, name=\"electra\")\n        self.discriminator_predictions = TFElectraDiscriminatorPredictions(config, name=\"discriminator_predictions\")\n\n    def get_input_embeddings(self):\n        return self.electra.embeddings\n\n    @add_start_docstrings_to_callable(ELECTRA_INPUTS_DOCSTRING)\n    def call(\n        self,\n        input_ids=None,\n        attention_mask=None,\n        token_type_ids=None,\n        position_ids=None,\n        head_mask=None,\n        inputs_embeds=None,\n        training=False,\n    ):\n        r\"\"\"\n    Returns:\n        :obj:`tuple(tf.Tensor)` comprising various elements depending on the configuration (:class:`~transformers1.ElectraConfig`) and inputs:\n        scores (:obj:`Numpy array` or :obj:`tf.Tensor` of shape :obj:`(batch_size, sequence_length, config.num_labels)`):\n            Prediction scores of the head (scores for each token before SoftMax).\n        hidden_states (:obj:`tuple(tf.Tensor)`, `optional`, returned when :obj:`config.output_hidden_states=True`):\n            tuple of :obj:`tf.Tensor` (one for the output of the embeddings + one for the output of each layer)\n            of shape :obj:`(batch_size, sequence_length, hidden_size)`.\n\n            Hidden-states of the model at the output of each layer plus the initial embedding outputs.\n        attentions (:obj:`tuple(tf.Tensor)`, `optional`, returned when ``config.output_attentions=True``):\n            tuple of :obj:`tf.Tensor` (one for each layer) of shape\n            :obj:`(batch_size, num_heads, sequence_length, sequence_length)`:\n\n            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads.\n\n    Examples::\n\n        import tensorflow as tf\n        from transformers1 import ElectraTokenizer, TFElectraForPreTraining\n\n        tokenizer = ElectraTokenizer.from_pretrained('google/electra-small-discriminator')\n        model = TFElectraForPreTraining.from_pretrained('google/electra-small-discriminator')\n        input_ids = tf.constant(tokenizer.encode(\"Hello, my dog is cute\"))[None, :]  # Batch size 1\n        outputs = model(input_ids)\n        scores = outputs[0]\n        \"\"\"\n\n        discriminator_hidden_states = self.electra(\n            input_ids, attention_mask, token_type_ids, position_ids, head_mask, inputs_embeds, training=training\n        )\n        discriminator_sequence_output = discriminator_hidden_states[0]\n        logits = self.discriminator_predictions(discriminator_sequence_output)\n        output = (logits,)\n        output += discriminator_hidden_states[1:]\n\n        return output  # (loss), scores, (hidden_states), (attentions)\n\n\nclass TFElectraMaskedLMHead(tf.keras.layers.Layer):\n    def __init__(self, config, input_embeddings, **kwargs):\n        super().__init__(**kwargs)\n        self.vocab_size = config.vocab_size\n        self.input_embeddings = input_embeddings\n\n    def build(self, input_shape):\n        self.bias = self.add_weight(shape=(self.vocab_size,), initializer=\"zeros\", trainable=True, name=\"bias\")\n        super().build(input_shape)\n\n    def call(self, hidden_states, training=False):\n        hidden_states = self.input_embeddings(hidden_states, mode=\"linear\")\n        hidden_states = hidden_states + self.bias\n        return hidden_states\n\n\n@add_start_docstrings(\n    \"\"\"\nElectra model with a language modeling head on top.\n\nEven though both the discriminator and generator may be loaded into this model, the generator is\nthe only model of the two to have been trained for the masked language modeling task.\"\"\",\n    ELECTRA_START_DOCSTRING,\n)\nclass TFElectraForMaskedLM(TFElectraPreTrainedModel):\n    def __init__(self, config, **kwargs):\n        super().__init__(config, **kwargs)\n\n        self.vocab_size = config.vocab_size\n        self.electra = TFElectraMainLayer(config, name=\"electra\")\n        self.generator_predictions = TFElectraGeneratorPredictions(config, name=\"generator_predictions\")\n        if isinstance(config.hidden_act, str):\n            self.activation = ACT2FN[config.hidden_act]\n        else:\n            self.activation = config.hidden_act\n        self.generator_lm_head = TFElectraMaskedLMHead(config, self.electra.embeddings, name=\"generator_lm_head\")\n\n    def get_input_embeddings(self):\n        return self.electra.embeddings\n\n    def get_output_embeddings(self):\n        return self.generator_lm_head\n\n    @add_start_docstrings_to_callable(ELECTRA_INPUTS_DOCSTRING)\n    def call(\n        self,\n        input_ids=None,\n        attention_mask=None,\n        token_type_ids=None,\n        position_ids=None,\n        head_mask=None,\n        inputs_embeds=None,\n        training=False,\n    ):\n        r\"\"\"\n    Returns:\n        :obj:`tuple(tf.Tensor)` comprising various elements depending on the configuration (:class:`~transformers1.ElectraConfig`) and inputs:\n        prediction_scores (:obj:`Numpy array` or :obj:`tf.Tensor` of shape :obj:`(batch_size, sequence_length, config.vocab_size)`):\n            Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax).\n        hidden_states (:obj:`tuple(tf.Tensor)`, `optional`, returned when :obj:`config.output_hidden_states=True`):\n            tuple of :obj:`tf.Tensor` (one for the output of the embeddings + one for the output of each layer)\n            of shape :obj:`(batch_size, sequence_length, hidden_size)`.\n\n            Hidden-states of the model at the output of each layer plus the initial embedding outputs.\n        attentions (:obj:`tuple(tf.Tensor)`, `optional`, returned when ``config.output_attentions=True``):\n            tuple of :obj:`tf.Tensor` (one for each layer) of shape\n            :obj:`(batch_size, num_heads, sequence_length, sequence_length)`:\n\n            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads.\n\n    Examples::\n\n        import tensorflow as tf\n        from transformers1 import ElectraTokenizer, TFElectraForMaskedLM\n\n        tokenizer = ElectraTokenizer.from_pretrained('google/electra-small-generator')\n        model = TFElectraForMaskedLM.from_pretrained('google/electra-small-generator')\n        input_ids = tf.constant(tokenizer.encode(\"Hello, my dog is cute\"))[None, :]  # Batch size 1\n        outputs = model(input_ids)\n        prediction_scores = outputs[0]\n\n        \"\"\"\n\n        generator_hidden_states = self.electra(\n            input_ids, attention_mask, token_type_ids, position_ids, head_mask, inputs_embeds, training=training\n        )\n        generator_sequence_output = generator_hidden_states[0]\n        prediction_scores = self.generator_predictions(generator_sequence_output, training=training)\n        prediction_scores = self.generator_lm_head(prediction_scores, training=training)\n        output = (prediction_scores,)\n        output += generator_hidden_states[1:]\n\n        return output  # (masked_lm_loss), prediction_scores, (hidden_states), (attentions)\n\n\n@add_start_docstrings(\n    \"\"\"\nElectra model with a token classification head on top.\n\nBoth the discriminator and generator may be loaded into this model.\"\"\",\n    ELECTRA_START_DOCSTRING,\n)\nclass TFElectraForTokenClassification(TFElectraPreTrainedModel):\n    def __init__(self, config, **kwargs):\n        super().__init__(config, **kwargs)\n\n        self.electra = TFElectraMainLayer(config, name=\"electra\")\n        self.dropout = tf.keras.layers.Dropout(config.hidden_dropout_prob)\n        self.classifier = tf.keras.layers.Dense(config.num_labels, name=\"classifier\")\n\n    @add_start_docstrings_to_callable(ELECTRA_INPUTS_DOCSTRING)\n    def call(\n        self,\n        input_ids=None,\n        attention_mask=None,\n        token_type_ids=None,\n        position_ids=None,\n        head_mask=None,\n        inputs_embeds=None,\n        training=False,\n    ):\n        r\"\"\"\n    Returns:\n        :obj:`tuple(tf.Tensor)` comprising various elements depending on the configuration (:class:`~transformers1.ElectraConfig`) and inputs:\n        scores (:obj:`Numpy array` or :obj:`tf.Tensor` of shape :obj:`(batch_size, sequence_length, config.num_labels)`):\n            Classification scores (before SoftMax).\n        hidden_states (:obj:`tuple(tf.Tensor)`, `optional`, returned when :obj:`config.output_hidden_states=True`):\n            tuple of :obj:`tf.Tensor` (one for the output of the embeddings + one for the output of each layer)\n            of shape :obj:`(batch_size, sequence_length, hidden_size)`.\n\n            Hidden-states of the model at the output of each layer plus the initial embedding outputs.\n        attentions (:obj:`tuple(tf.Tensor)`, `optional`, returned when ``config.output_attentions=True``):\n            tuple of :obj:`tf.Tensor` (one for each layer) of shape\n            :obj:`(batch_size, num_heads, sequence_length, sequence_length)`:\n\n            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads.\n\n    Examples::\n\n        import tensorflow as tf\n        from transformers1 import ElectraTokenizer, TFElectraForTokenClassification\n\n        tokenizer = ElectraTokenizer.from_pretrained('google/electra-small-discriminator')\n        model = TFElectraForTokenClassification.from_pretrained('google/electra-small-discriminator')\n        input_ids = tf.constant(tokenizer.encode(\"Hello, my dog is cute\"))[None, :]  # Batch size 1\n        outputs = model(input_ids)\n        scores = outputs[0]\n        \"\"\"\n\n        discriminator_hidden_states = self.electra(\n            input_ids, attention_mask, token_type_ids, position_ids, head_mask, inputs_embeds, training=training\n        )\n        discriminator_sequence_output = discriminator_hidden_states[0]\n        discriminator_sequence_output = self.dropout(discriminator_sequence_output)\n        logits = self.classifier(discriminator_sequence_output)\n        output = (logits,)\n        output += discriminator_hidden_states[1:]\n\n        return output  # (loss), scores, (hidden_states), (attentions)\n"
  },
  {
    "path": "code/nezha-base-count3/pretrain/transformers1/modeling_tf_flaubert.py",
    "content": "# coding=utf-8\n# Copyright 2019-present, Facebook, Inc and the HuggingFace Inc. team.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n\"\"\" TF 2.0 Flaubert model.\n\"\"\"\n\nimport logging\nimport random\n\nimport tensorflow as tf\n\nfrom .configuration_flaubert import FlaubertConfig\nfrom .file_utils import add_start_docstrings\nfrom .modeling_tf_xlm import (\n    TFXLMForSequenceClassification,\n    TFXLMMainLayer,\n    TFXLMModel,\n    TFXLMWithLMHeadModel,\n    get_masks,\n    shape_list,\n)\nfrom .tokenization_utils import BatchEncoding\n\n\nlogger = logging.getLogger(__name__)\n\nTF_FLAUBERT_PRETRAINED_MODEL_ARCHIVE_LIST = [\n    # See all Flaubert models at https://huggingface.co/models?filter=flaubert\n]\n\nFLAUBERT_START_DOCSTRING = r\"\"\"\n\n    This model is a `tf.keras.Model <https://www.tensorflow.org/api_docs/python/tf/keras/Model>`__ sub-class.\n    Use it as a regular TF 2.0 Keras Model and\n    refer to the TF 2.0 documentation for all matter related to general usage and behavior.\n\n    Parameters:\n        config (:class:`~transformers1.FlaubertConfig`): Model configuration class with all the parameters of the model.\n            Initializing with a config file does not load the weights associated with the model, only the configuration.\n            Check out the :meth:`~transformers1.PreTrainedModel.from_pretrained` method to load the model weights.\n\"\"\"\n\nFLAUBERT_INPUTS_DOCSTRING = r\"\"\"\n    Args:\n        input_ids (:obj:`tf.Tensor` or :obj:`Numpy array` of shape :obj:`(batch_size, sequence_length)`):\n            Indices of input sequence tokens in the vocabulary.\n            Indices can be obtained using :class:`transformers1.BertTokenizer`.\n            See :func:`transformers1.PreTrainedTokenizer.encode` and\n            :func:`transformers1.PreTrainedTokenizer.encode_plus` for details.\n            `What are input IDs? <../glossary.html#input-ids>`__\n        attention_mask (:obj:`tf.Tensor` or :obj:`Numpy array` of shape :obj:`(batch_size, sequence_length)`, `optional`, defaults to :obj:`None`):\n            Mask to avoid performing attention on padding token indices.\n            Mask values selected in ``[0, 1]``:\n            ``1`` for tokens that are NOT MASKED, ``0`` for MASKED tokens.\n            `What are attention masks? <../glossary.html#attention-mask>`__\n        langs (:obj:`tf.Tensor` or :obj:`Numpy array` of shape :obj:`(batch_size, sequence_length)`, `optional`, defaults to :obj:`None`):\n            A parallel sequence of tokens to be used to indicate the language of each token in the input.\n            Indices are languages ids which can be obtained from the language names by using two conversion mappings\n            provided in the configuration of the model (only provided for multilingual models).\n            More precisely, the `language name -> language id` mapping is in `model.config.lang2id` (dict str -> int) and\n            the `language id -> language name` mapping is `model.config.id2lang` (dict int -> str).\n            See usage examples detailed in the `multilingual documentation <https://huggingface.co/transformers/multilingual.html>`__.\n        token_type_ids (:obj:`tf.Tensor` or :obj:`Numpy array` of shape :obj:`(batch_size, sequence_length)`, `optional`, defaults to :obj:`None`):\n            Segment token indices to indicate first and second portions of the inputs.\n            Indices are selected in ``[0, 1]``: ``0`` corresponds to a `sentence A` token, ``1``\n            corresponds to a `sentence B` token\n            `What are token type IDs? <../glossary.html#token-type-ids>`_\n        position_ids (:obj:`tf.Tensor` or :obj:`Numpy array` of shape :obj:`(batch_size, sequence_length)`, `optional`, defaults to :obj:`None`):\n            Indices of positions of each input sequence tokens in the position embeddings.\n            Selected in the range ``[0, config.max_position_embeddings - 1]``.\n            `What are position IDs? <../glossary.html#position-ids>`_\n        lengths (:obj:`tf.Tensor` or :obj:`Numpy array` of shape :obj:`(batch_size,)`, `optional`, defaults to :obj:`None`):\n            Length of each sentence that can be used to avoid performing attention on padding token indices.\n            You can also use `attention_mask` for the same result (see above), kept here for compatbility.\n            Indices selected in ``[0, ..., input_ids.size(-1)]``:\n        cache (:obj:`Dict[str, tf.Tensor]`, `optional`, defaults to :obj:`None`):\n            dictionary with ``tf.Tensor`` that contains pre-computed\n            hidden-states (key and values in the attention blocks) as computed by the model\n            (see `cache` output below). Can be used to speed up sequential decoding.\n            The dictionary object will be modified in-place during the forward pass to add newly computed hidden-states.\n        head_mask (:obj:`tf.Tensor` or :obj:`Numpy array` of shape :obj:`(num_heads,)` or :obj:`(num_layers, num_heads)`, `optional`, defaults to :obj:`None`):\n            Mask to nullify selected heads of the self-attention modules.\n            Mask values selected in ``[0, 1]``:\n            :obj:`1` indicates the head is **not masked**, :obj:`0` indicates the head is **masked**.\n        inputs_embeds (:obj:`tf.Tensor` or :obj:`Numpy array` of shape :obj:`(batch_size, sequence_length, hidden_size)`, `optional`, defaults to :obj:`None`):\n            Optionally, instead of passing :obj:`input_ids` you can choose to directly pass an embedded representation.\n            This is useful if you want more control over how to convert `input_ids` indices into associated vectors\n            than the model's internal embedding lookup matrix.\n\"\"\"\n\n\n@add_start_docstrings(\n    \"The bare Flaubert Model transformer outputting raw hidden-states without any specific head on top.\",\n    FLAUBERT_START_DOCSTRING,\n)\nclass TFFlaubertModel(TFXLMModel):\n    config_class = FlaubertConfig\n\n    def __init__(self, config, *inputs, **kwargs):\n        super().__init__(config, *inputs, **kwargs)\n        self.transformer = TFFlaubertMainLayer(config, name=\"transformer\")\n\n\nclass TFFlaubertMainLayer(TFXLMMainLayer):\n    def __init__(self, config, *inputs, **kwargs):\n        super().__init__(config, *inputs, **kwargs)\n        self.layerdrop = getattr(config, \"layerdrop\", 0.0)\n        self.pre_norm = getattr(config, \"pre_norm\", False)\n\n    def call(\n        self,\n        inputs,\n        attention_mask=None,\n        langs=None,\n        token_type_ids=None,\n        position_ids=None,\n        lengths=None,\n        cache=None,\n        head_mask=None,\n        inputs_embeds=None,\n        training=False,\n    ):\n        # removed: src_enc=None, src_len=None\n        if isinstance(inputs, (tuple, list)):\n            input_ids = inputs[0]\n            attention_mask = inputs[1] if len(inputs) > 1 else attention_mask\n            langs = inputs[2] if len(inputs) > 2 else langs\n            token_type_ids = inputs[3] if len(inputs) > 3 else token_type_ids\n            position_ids = inputs[4] if len(inputs) > 4 else position_ids\n            lengths = inputs[5] if len(inputs) > 5 else lengths\n            cache = inputs[6] if len(inputs) > 6 else cache\n            head_mask = inputs[7] if len(inputs) > 7 else head_mask\n            inputs_embeds = inputs[8] if len(inputs) > 8 else inputs_embeds\n            assert len(inputs) <= 9, \"Too many inputs.\"\n        elif isinstance(inputs, (dict, BatchEncoding)):\n            input_ids = inputs.get(\"input_ids\")\n            attention_mask = inputs.get(\"attention_mask\", attention_mask)\n            langs = inputs.get(\"langs\", langs)\n            token_type_ids = inputs.get(\"token_type_ids\", token_type_ids)\n            position_ids = inputs.get(\"position_ids\", position_ids)\n            lengths = inputs.get(\"lengths\", lengths)\n            cache = inputs.get(\"cache\", cache)\n            head_mask = inputs.get(\"head_mask\", head_mask)\n            inputs_embeds = inputs.get(\"inputs_embeds\", inputs_embeds)\n            assert len(inputs) <= 9, \"Too many inputs.\"\n        else:\n            input_ids = inputs\n\n        if input_ids is not None and inputs_embeds is not None:\n            raise ValueError(\"You cannot specify both input_ids and inputs_embeds at the same time\")\n        elif input_ids is not None:\n            bs, slen = shape_list(input_ids)\n        elif inputs_embeds is not None:\n            bs, slen = shape_list(inputs_embeds)[:2]\n        else:\n            raise ValueError(\"You have to specify either input_ids or inputs_embeds\")\n\n        if lengths is None:\n            if input_ids is not None:\n                lengths = tf.reduce_sum(tf.cast(tf.not_equal(input_ids, self.pad_index), dtype=tf.int32), axis=1)\n            else:\n                lengths = tf.convert_to_tensor([slen] * bs, tf.int32)\n        # mask = input_ids != self.pad_index\n\n        # check inputs\n        # assert shape_list(lengths)[0] == bs\n        tf.debugging.assert_equal(shape_list(lengths)[0], bs)\n        # assert lengths.max().item() <= slen\n        # input_ids = input_ids.transpose(0, 1)  # batch size as dimension 0\n        # assert (src_enc is None) == (src_len is None)\n        # if src_enc is not None:\n        #     assert self.is_decoder\n        #     assert src_enc.size(0) == bs\n\n        # generate masks\n        mask, attn_mask = get_masks(slen, lengths, self.causal, padding_mask=attention_mask)\n        # if self.is_decoder and src_enc is not None:\n        #     src_mask = torch.arange(src_len.max(), dtype=torch.long, device=lengths.device) < src_len[:, None]\n\n        # position_ids\n        if position_ids is None:\n            position_ids = tf.expand_dims(tf.range(slen), axis=0)\n        else:\n            # assert shape_list(position_ids) == [bs, slen]  # (slen, bs)\n            tf.debugging.assert_equal(shape_list(position_ids), [bs, slen])\n            # position_ids = position_ids.transpose(0, 1)\n\n        # langs\n        if langs is not None:\n            # assert shape_list(langs) == [bs, slen]  # (slen, bs)\n            tf.debugging.assert_equal(shape_list(langs), [bs, slen])\n            # langs = langs.transpose(0, 1)\n\n        # Prepare head mask if needed\n        # 1.0 in head_mask indicate we keep the head\n        # attention_probs has shape bsz x n_heads x N x N\n        # input head_mask has shape [num_heads] or [num_hidden_layers x num_heads]\n        # and head_mask is converted to shape [num_hidden_layers x batch x num_heads x qlen x klen]\n        if head_mask is not None:\n            raise NotImplementedError\n        else:\n            head_mask = [None] * self.n_layers\n\n        # do not recompute cached elements\n        if cache is not None and input_ids is not None:\n            _slen = slen - cache[\"slen\"]\n            input_ids = input_ids[:, -_slen:]\n            position_ids = position_ids[:, -_slen:]\n            if langs is not None:\n                langs = langs[:, -_slen:]\n            mask = mask[:, -_slen:]\n            attn_mask = attn_mask[:, -_slen:]\n\n        # embeddings\n        if inputs_embeds is None:\n            inputs_embeds = self.embeddings(input_ids)\n\n        tensor = inputs_embeds + self.position_embeddings(position_ids)\n        if langs is not None and self.use_lang_emb:\n            tensor = tensor + self.lang_embeddings(langs)\n        if token_type_ids is not None:\n            tensor = tensor + self.embeddings(token_type_ids)\n        tensor = self.layer_norm_emb(tensor)\n        tensor = self.dropout(tensor, training=training)\n        tensor = tensor * mask[..., tf.newaxis]\n\n        # transformer layers\n        hidden_states = ()\n        attentions = ()\n        for i in range(self.n_layers):\n            # LayerDrop\n            dropout_probability = random.uniform(0, 1)\n            if training and (dropout_probability < self.layerdrop):\n                continue\n\n            if self.output_hidden_states:\n                hidden_states = hidden_states + (tensor,)\n\n            # self attention\n            if not self.pre_norm:\n                attn_outputs = self.attentions[i]([tensor, attn_mask, None, cache, head_mask[i]], training=training)\n                attn = attn_outputs[0]\n                if self.output_attentions:\n                    attentions = attentions + (attn_outputs[1],)\n                attn = self.dropout(attn, training=training)\n                tensor = tensor + attn\n                tensor = self.layer_norm1[i](tensor)\n            else:\n                tensor_normalized = self.layer_norm1[i](tensor)\n                attn_outputs = self.attentions[i](\n                    [tensor_normalized, attn_mask, None, cache, head_mask[i]], training=training\n                )\n                attn = attn_outputs[0]\n                if self.output_attentions:\n                    attentions = attentions + (attn_outputs[1],)\n                attn = self.dropout(attn, training=training)\n                tensor = tensor + attn\n\n            # encoder attention (for decoder only)\n            # if self.is_decoder and src_enc is not None:\n            #     attn = self.encoder_attn[i](tensor, src_mask, kv=src_enc, cache=cache)\n            #     attn = F.dropout(attn, p=self.dropout, training=self.training)\n            #     tensor = tensor + attn\n            #     tensor = self.layer_norm15[i](tensor)\n\n            # FFN\n            if not self.pre_norm:\n                tensor = tensor + self.ffns[i](tensor)\n                tensor = self.layer_norm2[i](tensor)\n            else:\n                tensor_normalized = self.layer_norm2[i](tensor)\n                tensor = tensor + self.ffns[i](tensor_normalized)\n\n            tensor = tensor * mask[..., tf.newaxis]\n\n        # Add last hidden state\n        if self.output_hidden_states:\n            hidden_states = hidden_states + (tensor,)\n\n        # update cache length\n        if cache is not None:\n            cache[\"slen\"] += tensor.size(1)\n\n        # move back sequence length to dimension 0\n        # tensor = tensor.transpose(0, 1)\n\n        outputs = (tensor,)\n        if self.output_hidden_states:\n            outputs = outputs + (hidden_states,)\n        if self.output_attentions:\n            outputs = outputs + (attentions,)\n        return outputs  # outputs, (hidden_states), (attentions)\n\n\n@add_start_docstrings(\n    \"\"\"The Flaubert Model transformer with a language modeling head on top\n    (linear layer with weights tied to the input embeddings). \"\"\",\n    FLAUBERT_START_DOCSTRING,\n)\nclass TFFlaubertWithLMHeadModel(TFXLMWithLMHeadModel):\n    config_class = FlaubertConfig\n\n    def __init__(self, config, *inputs, **kwargs):\n        super().__init__(config, *inputs, **kwargs)\n        self.transformer = TFFlaubertMainLayer(config, name=\"transformer\")\n\n\n@add_start_docstrings(\n    \"\"\"Flaubert Model with a sequence classification/regression head on top (a linear layer on top of\n    the pooled output) e.g. for GLUE tasks. \"\"\",\n    FLAUBERT_START_DOCSTRING,\n)\nclass TFFlaubertForSequenceClassification(TFXLMForSequenceClassification):\n    config_class = FlaubertConfig\n\n    def __init__(self, config, *inputs, **kwargs):\n        super().__init__(config, *inputs, **kwargs)\n        self.transformer = TFFlaubertMainLayer(config, name=\"transformer\")\n"
  },
  {
    "path": "code/nezha-base-count3/pretrain/transformers1/modeling_tf_gpt2.py",
    "content": "# coding=utf-8\n# Copyright 2018 The OpenAI Team Authors and HuggingFace Inc. team.\n# Copyright (c) 2018, NVIDIA CORPORATION.  All rights reserved.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n\"\"\" TF 2.0 OpenAI GPT-2 model. \"\"\"\n\n\nimport logging\n\nimport numpy as np\nimport tensorflow as tf\n\nfrom .configuration_gpt2 import GPT2Config\nfrom .file_utils import add_start_docstrings, add_start_docstrings_to_callable\nfrom .modeling_tf_utils import (\n    TFConv1D,\n    TFPreTrainedModel,\n    TFSequenceSummary,\n    TFSharedEmbeddings,\n    get_initializer,\n    keras_serializable,\n    shape_list,\n)\nfrom .tokenization_utils import BatchEncoding\n\n\nlogger = logging.getLogger(__name__)\n\nTF_GPT2_PRETRAINED_MODEL_ARCHIVE_LIST = [\n    \"gpt2\",\n    \"gpt2-medium\",\n    \"gpt2-large\",\n    \"gpt2-xl\",\n    \"distilgpt2\",\n    # See all GPT-2 models at https://huggingface.co/models?filter=gpt2\n]\n\n\ndef gelu(x):\n    \"\"\"Gaussian Error Linear Unit.\n    This is a smoother version of the RELU.\n    Original paper: https://arxiv.org/abs/1606.08415\n    Args:\n        x: float Tensor to perform activation.\n    Returns:\n        `x` with the GELU activation applied.\n    \"\"\"\n    cdf = 0.5 * (1.0 + tf.tanh((np.sqrt(2 / np.pi) * (x + 0.044715 * tf.pow(x, 3)))))\n    return x * cdf\n\n\nclass TFAttention(tf.keras.layers.Layer):\n    def __init__(self, nx, n_ctx, config, scale=False, **kwargs):\n        super().__init__(**kwargs)\n        self.output_attentions = config.output_attentions\n\n        n_state = nx  # in Attention: n_state=768 (nx=n_embd)\n        # [switch nx => n_state from Block to Attention to keep identical to TF implem]\n        assert n_state % config.n_head == 0\n        self.n_ctx = n_ctx\n        self.n_head = config.n_head\n        self.split_size = n_state\n        self.scale = scale\n\n        self.c_attn = TFConv1D(n_state * 3, nx, initializer_range=config.initializer_range, name=\"c_attn\")\n        self.c_proj = TFConv1D(n_state, nx, initializer_range=config.initializer_range, name=\"c_proj\")\n        self.attn_dropout = tf.keras.layers.Dropout(config.attn_pdrop)\n        self.resid_dropout = tf.keras.layers.Dropout(config.resid_pdrop)\n        self.pruned_heads = set()\n\n    def prune_heads(self, heads):\n        pass\n\n    @staticmethod\n    def causal_attention_mask(nd, ns, dtype):\n        \"\"\"1's in the lower triangle, counting from the lower right corner.\n        Same as tf.matrix_band_part(tf.ones([nd, ns]), -1, ns-nd), but doesn't produce garbage on TPUs.\n        \"\"\"\n        i = tf.range(nd)[:, None]\n        j = tf.range(ns)\n        m = i >= j - ns + nd\n        return tf.cast(m, dtype)\n\n    def _attn(self, inputs, training=False):\n        q, k, v, attention_mask, head_mask = inputs\n        # q, k, v have shape [batch, heads, sequence, features]\n        w = tf.matmul(q, k, transpose_b=True)\n        if self.scale:\n            dk = tf.cast(shape_list(k)[-1], tf.float32)  # scale attention_scores\n            w = w / tf.math.sqrt(dk)\n\n        # w has shape [batch, heads, dst_sequence, src_sequence], where information flows from src to dst.\n        _, _, nd, ns = shape_list(w)\n        b = self.causal_attention_mask(nd, ns, dtype=w.dtype)\n        b = tf.reshape(b, [1, 1, nd, ns])\n        w = w * b - 1e4 * (1 - b)\n\n        if attention_mask is not None:\n            # Apply the attention mask\n            w = w + attention_mask\n\n        w = tf.nn.softmax(w, axis=-1)\n        w = self.attn_dropout(w, training=training)\n\n        # Mask heads if we want to\n        if head_mask is not None:\n            w = w * head_mask\n\n        outputs = [tf.matmul(w, v)]\n        if self.output_attentions:\n            outputs.append(w)\n        return outputs\n\n    def merge_heads(self, x):\n        x = tf.transpose(x, [0, 2, 1, 3])\n        x_shape = shape_list(x)\n        new_x_shape = x_shape[:-2] + [x_shape[-2] * x_shape[-1]]\n        return tf.reshape(x, new_x_shape)\n\n    def split_heads(self, x):\n        x_shape = shape_list(x)\n        new_x_shape = x_shape[:-1] + [self.n_head, x_shape[-1] // self.n_head]\n        x = tf.reshape(x, new_x_shape)\n        return tf.transpose(x, (0, 2, 1, 3))  # (batch, head, seq_length, head_features)\n\n    def call(self, inputs, training=False):\n        x, layer_past, attention_mask, head_mask, use_cache = inputs\n\n        x = self.c_attn(x)\n        query, key, value = tf.split(x, 3, axis=2)\n        query = self.split_heads(query)\n        key = self.split_heads(key)\n        value = self.split_heads(value)\n        if layer_past is not None:\n            past_key, past_value = tf.unstack(layer_past, axis=0)\n            key = tf.concat([past_key, key], axis=-2)\n            value = tf.concat([past_value, value], axis=-2)\n\n        # to cope with keras serialization\n        # we need to cast `use_cache` to correct bool\n        # if it is a tensor\n        if tf.is_tensor(use_cache):\n            if hasattr(use_cache, \"numpy\"):\n                use_cache = bool(use_cache.numpy())\n            else:\n                use_cache = True\n\n        if use_cache is True:\n            present = tf.stack([key, value], axis=0)\n        else:\n            present = (None,)\n\n        attn_outputs = self._attn([query, key, value, attention_mask, head_mask], training=training)\n        a = attn_outputs[0]\n\n        a = self.merge_heads(a)\n        a = self.c_proj(a)\n        a = self.resid_dropout(a, training=training)\n\n        outputs = [a, present] + attn_outputs[1:]\n        return outputs  # a, present, (attentions)\n\n\nclass TFMLP(tf.keras.layers.Layer):\n    def __init__(self, n_state, config, **kwargs):\n        super().__init__(**kwargs)\n        nx = config.n_embd\n        self.c_fc = TFConv1D(n_state, nx, initializer_range=config.initializer_range, name=\"c_fc\")\n        self.c_proj = TFConv1D(nx, n_state, initializer_range=config.initializer_range, name=\"c_proj\")\n        self.act = gelu\n        self.dropout = tf.keras.layers.Dropout(config.resid_pdrop)\n\n    def call(self, x, training=False):\n        h = self.act(self.c_fc(x))\n        h2 = self.c_proj(h)\n        h2 = self.dropout(h2, training=training)\n        return h2\n\n\nclass TFBlock(tf.keras.layers.Layer):\n    def __init__(self, n_ctx, config, scale=False, **kwargs):\n        super().__init__(**kwargs)\n        nx = config.n_embd\n        self.ln_1 = tf.keras.layers.LayerNormalization(epsilon=config.layer_norm_epsilon, name=\"ln_1\")\n        self.attn = TFAttention(nx, n_ctx, config, scale, name=\"attn\")\n        self.ln_2 = tf.keras.layers.LayerNormalization(epsilon=config.layer_norm_epsilon, name=\"ln_2\")\n        self.mlp = TFMLP(4 * nx, config, name=\"mlp\")\n\n    def call(self, inputs, training=False):\n        x, layer_past, attention_mask, head_mask, use_cache = inputs\n\n        a = self.ln_1(x)\n        output_attn = self.attn([a, layer_past, attention_mask, head_mask, use_cache], training=training)\n        a = output_attn[0]  # output_attn: a, present, (attentions)\n        x = x + a\n\n        m = self.ln_2(x)\n        m = self.mlp(m, training=training)\n        x = x + m\n\n        outputs = [x] + output_attn[1:]\n        return outputs  # x, present, (attentions)\n\n\n@keras_serializable\nclass TFGPT2MainLayer(tf.keras.layers.Layer):\n    config_class = GPT2Config\n\n    def __init__(self, config, *inputs, **kwargs):\n        super().__init__(*inputs, **kwargs)\n        self.output_hidden_states = config.output_hidden_states\n        self.output_attentions = config.output_attentions\n        self.num_hidden_layers = config.n_layer\n        self.vocab_size = config.vocab_size\n        self.n_embd = config.n_embd\n\n        self.wte = TFSharedEmbeddings(\n            config.vocab_size, config.hidden_size, initializer_range=config.initializer_range, name=\"wte\"\n        )\n        self.wpe = tf.keras.layers.Embedding(\n            config.n_positions,\n            config.n_embd,\n            embeddings_initializer=get_initializer(config.initializer_range),\n            name=\"wpe\",\n        )\n        self.drop = tf.keras.layers.Dropout(config.embd_pdrop)\n        self.h = [TFBlock(config.n_ctx, config, scale=True, name=\"h_._{}\".format(i)) for i in range(config.n_layer)]\n        self.ln_f = tf.keras.layers.LayerNormalization(epsilon=config.layer_norm_epsilon, name=\"ln_f\")\n\n    def get_input_embeddings(self):\n        return self.wte\n\n    def _resize_token_embeddings(self, new_num_tokens):\n        raise NotImplementedError\n\n    def _prune_heads(self, heads_to_prune):\n        \"\"\" Prunes heads of the model.\n            heads_to_prune: dict of {layer_num: list of heads to prune in this layer}\n        \"\"\"\n        raise NotImplementedError\n\n    def call(\n        self,\n        inputs,\n        past=None,\n        attention_mask=None,\n        token_type_ids=None,\n        position_ids=None,\n        head_mask=None,\n        inputs_embeds=None,\n        use_cache=True,\n        training=False,\n    ):\n        if isinstance(inputs, (tuple, list)):\n            input_ids = inputs[0]\n            past = inputs[1] if len(inputs) > 1 else past\n            attention_mask = inputs[2] if len(inputs) > 2 else attention_mask\n            token_type_ids = inputs[3] if len(inputs) > 3 else token_type_ids\n            position_ids = inputs[4] if len(inputs) > 4 else position_ids\n            head_mask = inputs[5] if len(inputs) > 5 else head_mask\n            inputs_embeds = inputs[6] if len(inputs) > 6 else inputs_embeds\n            use_cache = inputs[7] if len(inputs) > 7 else use_cache\n            assert len(inputs) <= 8, \"Too many inputs.\"\n        elif isinstance(inputs, (dict, BatchEncoding)):\n            input_ids = inputs.get(\"input_ids\")\n            past = inputs.get(\"past\", past)\n            attention_mask = inputs.get(\"attention_mask\", attention_mask)\n            token_type_ids = inputs.get(\"token_type_ids\", token_type_ids)\n            position_ids = inputs.get(\"position_ids\", position_ids)\n            head_mask = inputs.get(\"head_mask\", head_mask)\n            inputs_embeds = inputs.get(\"inputs_embeds\", inputs_embeds)\n            use_cache = inputs.get(\"use_cache\", use_cache)\n            assert len(inputs) <= 8, \"Too many inputs.\"\n        else:\n            input_ids = inputs\n\n        if input_ids is not None and inputs_embeds is not None:\n            raise ValueError(\"You cannot specify both input_ids and inputs_embeds at the same time\")\n        elif input_ids is not None:\n            input_shape = shape_list(input_ids)\n            input_ids = tf.reshape(input_ids, [-1, input_shape[-1]])\n        elif inputs_embeds is not None:\n            input_shape = shape_list(inputs_embeds)[:-1]\n        else:\n            raise ValueError(\"You have to specify either input_ids or inputs_embeds\")\n\n        if past is None:\n            past_length = 0\n            past = [None] * len(self.h)\n        else:\n            past_length = shape_list(past[0][0])[-2]\n        if position_ids is None:\n            position_ids = tf.range(past_length, input_shape[-1] + past_length, dtype=tf.int32)[tf.newaxis, :]\n\n        if attention_mask is not None:\n            # We create a 3D attention mask from a 2D tensor mask.\n            # Sizes are [batch_size, 1, 1, to_seq_length]\n            # So we can broadcast to [batch_size, num_heads, from_seq_length, to_seq_length]\n            # this attention mask is more simple than the triangular masking of causal attention\n            # used in OpenAI GPT, we just need to prepare the broadcast dimension here.\n            attention_mask = attention_mask[:, tf.newaxis, tf.newaxis, :]\n\n            # Since attention_mask is 1.0 for positions we want to attend and 0.0 for\n            # masked positions, this operation will create a tensor which is 0.0 for\n            # positions we want to attend and -10000.0 for masked positions.\n            # Since we are adding it to the raw scores before the softmax, this is\n            # effectively the same as removing these entirely.\n\n            attention_mask = tf.cast(attention_mask, tf.float32)\n            attention_mask = (1.0 - attention_mask) * -10000.0\n        else:\n            attention_mask = None\n\n        # Prepare head mask if needed\n        # 1.0 in head_mask indicate we keep the head\n        # attention_probs has shape bsz x n_heads x N x N\n        # input head_mask has shape [num_heads] or [num_hidden_layers x num_heads]\n        # and head_mask is converted to shape [num_hidden_layers x batch x num_heads x seq_length x seq_length]\n        if head_mask is not None:\n            raise NotImplementedError\n        else:\n            head_mask = [None] * self.num_hidden_layers\n            # head_mask = tf.constant([0] * self.num_hidden_layers)\n\n        position_ids = tf.reshape(position_ids, [-1, shape_list(position_ids)[-1]])\n\n        if inputs_embeds is None:\n            inputs_embeds = self.wte(input_ids, mode=\"embedding\")\n        position_embeds = self.wpe(position_ids)\n        if token_type_ids is not None:\n            token_type_ids = tf.reshape(token_type_ids, [-1, shape_list(token_type_ids)[-1]])\n            token_type_embeds = self.wte(token_type_ids, mode=\"embedding\")\n        else:\n            token_type_embeds = 0\n        hidden_states = inputs_embeds + position_embeds + token_type_embeds\n        hidden_states = self.drop(hidden_states, training=training)\n\n        output_shape = input_shape + [shape_list(hidden_states)[-1]]\n\n        presents = ()\n        all_attentions = []\n        all_hidden_states = ()\n        for i, (block, layer_past) in enumerate(zip(self.h, past)):\n            if self.output_hidden_states:\n                all_hidden_states = all_hidden_states + (tf.reshape(hidden_states, output_shape),)\n\n            outputs = block([hidden_states, layer_past, attention_mask, head_mask[i], use_cache], training=training)\n\n            hidden_states, present = outputs[:2]\n            presents = presents + (present,)\n\n            if self.output_attentions:\n                all_attentions.append(outputs[2])\n\n        hidden_states = self.ln_f(hidden_states)\n\n        hidden_states = tf.reshape(hidden_states, output_shape)\n        # Add last hidden state\n        if self.output_hidden_states:\n            all_hidden_states = all_hidden_states + (hidden_states,)\n\n        outputs = (hidden_states,)\n\n        if use_cache is True:\n            outputs = outputs + (presents,)\n        if self.output_hidden_states:\n            outputs = outputs + (all_hidden_states,)\n        if self.output_attentions:\n            # let the number of heads free (-1) so we can extract attention even after head pruning\n            attention_output_shape = input_shape[:-1] + [-1] + shape_list(all_attentions[0])[-2:]\n            all_attentions = tuple(tf.reshape(t, attention_output_shape) for t in all_attentions)\n            outputs = outputs + (all_attentions,)\n        return outputs  # last hidden state, presents, (all hidden_states), (attentions)\n\n\nclass TFGPT2PreTrainedModel(TFPreTrainedModel):\n    \"\"\" An abstract class to handle weights initialization and\n        a simple interface for downloading and loading pretrained models.\n    \"\"\"\n\n    config_class = GPT2Config\n    base_model_prefix = \"transformer\"\n\n\nGPT2_START_DOCSTRING = r\"\"\"\n\n    .. note::\n        TF 2.0 models accepts two formats as inputs:\n\n            - having all inputs as keyword arguments (like PyTorch models), or\n            - having all inputs as a list, tuple or dict in the first positional arguments.\n\n        This second option is useful when using :obj:`tf.keras.Model.fit()` method which currently requires having\n        all the tensors in the first argument of the model call function: :obj:`model(inputs)`.\n\n        If you choose this second option, there are three possibilities you can use to gather all the input Tensors\n        in the first positional argument :\n\n        - a single Tensor with input_ids only and nothing else: :obj:`model(inputs_ids)`\n        - a list of varying length with one or several input Tensors IN THE ORDER given in the docstring:\n          :obj:`model([input_ids, attention_mask])` or :obj:`model([input_ids, attention_mask, token_type_ids])`\n        - a dictionary with one or several input Tensors associated to the input names given in the docstring:\n          :obj:`model({'input_ids': input_ids, 'token_type_ids': token_type_ids})`\n\n    Parameters:\n        config (:class:`~transformers1.GPT2Config`): Model configuration class with all the parameters of the model.\n            Initializing with a config file does not load the weights associated with the model, only the configuration.\n            Check out the :meth:`~transformers1.PreTrainedModel.from_pretrained` method to load the model weights.\n\"\"\"\n\nGPT2_INPUTS_DOCSTRING = r\"\"\"\n    Args:\n        input_ids (:obj:`Numpy array` or :obj:`tf.Tensor` of shape :obj:`(batch_size, input_ids_length)`):\n            :obj:`input_ids_length` = ``sequence_length`` if ``past`` is ``None`` else ``past[0].shape[-2]`` (``sequence_length`` of input past key value states).\n            Indices of input sequence tokens in the vocabulary.\n\n            If `past` is used, only `input_ids` that do not have their past calculated should be passed as `input_ids`.\n\n            Indices can be obtained using :class:`transformers1.GPT2Tokenizer`.\n            See :func:`transformers1.PreTrainedTokenizer.encode` and\n            :func:`transformers1.PreTrainedTokenizer.encode_plus` for details.\n\n            `What are input IDs? <../glossary.html#input-ids>`__\n        past (:obj:`List[tf.Tensor]` of length :obj:`config.n_layers`):\n            Contains pre-computed hidden-states (key and values in the attention blocks) as computed by the model\n            (see `past` output below). Can be used to speed up sequential decoding.\n            The token ids which have their past given to this model\n            should not be passed as `input_ids` as they have already been computed.\n        attention_mask (:obj:`tf.Tensor` or :obj:`Numpy array` of shape :obj:`(batch_size, sequence_length)`, `optional`, defaults to :obj:`None`):\n            Mask to avoid performing attention on padding token indices.\n            Mask values selected in ``[0, 1]``:\n            ``1`` for tokens that are NOT MASKED, ``0`` for MASKED tokens.\n\n            `What are attention masks? <../glossary.html#attention-mask>`__\n        token_type_ids (:obj:`tf.Tensor` or :obj:`Numpy array` of shape :obj:`(batch_size, sequence_length)`, `optional`, defaults to :obj:`None`):\n            Segment token indices to indicate first and second portions of the inputs.\n            Indices are selected in ``[0, 1]``: ``0`` corresponds to a `sentence A` token, ``1``\n            corresponds to a `sentence B` token\n\n            `What are token type IDs? <../glossary.html#token-type-ids>`_\n        position_ids (:obj:`tf.Tensor` or :obj:`Numpy array` of shape :obj:`(batch_size, sequence_length)`, `optional`, defaults to :obj:`None`):\n            Indices of positions of each input sequence tokens in the position embeddings.\n            Selected in the range ``[0, config.max_position_embeddings - 1]``.\n\n            `What are position IDs? <../glossary.html#position-ids>`_\n        head_mask (:obj:`tf.Tensor` or :obj:`Numpy array` of shape :obj:`(num_heads,)` or :obj:`(num_layers, num_heads)`, `optional`, defaults to :obj:`None`):\n            Mask to nullify selected heads of the self-attention modules.\n            Mask values selected in ``[0, 1]``:\n            :obj:`1` indicates the head is **not masked**, :obj:`0` indicates the head is **masked**.\n        inputs_embeds (:obj:`tf.Tensor` or :obj:`Numpy array` of shape :obj:`(batch_size, sequence_length, hidden_size)`, `optional`, defaults to :obj:`None`):\n            Optionally, instead of passing :obj:`input_ids` you can choose to directly pass an embedded representation.\n            This is useful if you want more control over how to convert `input_ids` indices into associated vectors\n            than the model's internal embedding lookup matrix.\n        training (:obj:`boolean`, `optional`, defaults to :obj:`False`):\n            Whether to activate dropout modules (if set to :obj:`True`) during training or to de-activate them\n            (if set to :obj:`False`) for evaluation.\n\"\"\"\n\n\n@add_start_docstrings(\n    \"The bare GPT2 Model transformer outputing raw hidden-states without any specific head on top.\",\n    GPT2_START_DOCSTRING,\n)\nclass TFGPT2Model(TFGPT2PreTrainedModel):\n    def __init__(self, config, *inputs, **kwargs):\n        super().__init__(config, *inputs, **kwargs)\n        self.transformer = TFGPT2MainLayer(config, name=\"transformer\")\n\n    @add_start_docstrings_to_callable(GPT2_INPUTS_DOCSTRING)\n    def call(self, inputs, **kwargs):\n        r\"\"\"\n    Return:\n        :obj:`tuple(tf.Tensor)` comprising various elements depending on the configuration (:class:`~transformers1.GPT2Config`) and inputs:\n        last_hidden_state (:obj:`tf.Tensor` of shape :obj:`(batch_size, sequence_length, hidden_size)`):\n            Sequence of hidden-states at the last layer of the model.\n        past (:obj:`List[tf.Tensor]` of length :obj:`config.n_layers` with each tensor of shape :obj:`(2, batch_size, num_heads, sequence_length, embed_size_per_head)`):\n            Contains pre-computed hidden-states (key and values in the attention blocks).\n            Can be used (see `past` input) to speed up sequential decoding. The token ids which have their past given to this model\n            should not be passed as input ids as they have already been computed.\n        hidden_states (:obj:`tuple(tf.Tensor)` `optional`, returned when ``config.output_hidden_states=True``):\n            Tuple of :obj:`tf.Tensor` (one for the output of the embeddings + one for the output of each layer)\n            of shape :obj:`(batch_size, sequence_length, hidden_size)`.\n\n            Hidden-states of the model at the output of each layer plus the initial embedding outputs.\n        attentions (:obj:`tuple(tf.Tensor)`, `optional`, returned when ``config.output_attentions=True``):\n            Tuple of :obj:`tf.Tensor` (one for each layer) of shape\n            :obj:`(batch_size, num_heads, sequence_length, sequence_length)`.\n\n            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention\n            heads.\n\n    Examples::\n\n        import tensorflow as tf\n        from transformers1 import GPT2Tokenizer, TFGPT2Model\n\n        tokenizer = GPT2Tokenizer.from_pretrained('gpt2')\n        model = TFGPT2Model.from_pretrained('gpt2')\n        input_ids = tf.constant(tokenizer.encode(\"Hello, my dog is cute\", add_special_tokens=True))[None, :]  # Batch size 1\n        outputs = model(input_ids)\n        last_hidden_states = outputs[0]  # The last hidden-state is the first element of the output tuple\n\n    \"\"\"\n        outputs = self.transformer(inputs, **kwargs)\n        return outputs\n\n\n@add_start_docstrings(\n    \"\"\"The GPT2 Model transformer with a language modeling head on top\n    (linear layer with weights tied to the input embeddings). \"\"\",\n    GPT2_START_DOCSTRING,\n)\nclass TFGPT2LMHeadModel(TFGPT2PreTrainedModel):\n    def __init__(self, config, *inputs, **kwargs):\n        super().__init__(config, *inputs, **kwargs)\n        self.transformer = TFGPT2MainLayer(config, name=\"transformer\")\n\n    def get_output_embeddings(self):\n        return self.transformer.wte\n\n    def prepare_inputs_for_generation(self, inputs, past, **kwargs):\n        # only last token for inputs_ids if past is defined in kwargs\n        if past:\n            inputs = tf.expand_dims(inputs[:, -1], -1)\n\n        return {\"inputs\": inputs, \"past\": past, \"use_cache\": kwargs[\"use_cache\"]}\n\n    @add_start_docstrings_to_callable(GPT2_INPUTS_DOCSTRING)\n    def call(self, inputs, **kwargs):\n        r\"\"\"\n    Return:\n        :obj:`tuple(tf.Tensor)` comprising various elements depending on the configuration (:class:`~transformers1.GPT2Config`) and inputs:\n        prediction_scores (:obj:`tf.Tensor` of shape :obj:`(batch_size, sequence_length, config.vocab_size)`):\n            Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax).\n        past (:obj:`List[tf.Tensor]` of length :obj:`config.n_layers` with each tensor of shape :obj:`(2, batch_size, num_heads, sequence_length, embed_size_per_head)`):\n            Contains pre-computed hidden-states (key and values in the attention blocks).\n            Can be used (see `past` input) to speed up sequential decoding. The token ids which have their past given to this model\n            should not be passed as input ids as they have already been computed.\n        hidden_states (:obj:`tuple(tf.Tensor)`, `optional`, returned when ``config.output_hidden_states=True``):\n            Tuple of :obj:`tf.Tensor` (one for the output of the embeddings + one for the output of each layer)\n            of shape :obj:`(batch_size, sequence_length, hidden_size)`.\n\n            Hidden-states of the model at the output of each layer plus the initial embedding outputs.\n        attentions (:obj:`tuple(tf.Tensor)`, `optional`, returned when ``config.output_attentions=True``):\n            Tuple of :obj:`tf.Tensor` (one for each layer) of shape\n            :obj:`(batch_size, num_heads, sequence_length, sequence_length)`.\n\n            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention\n            heads.\n\n    Examples::\n\n        import tensorflow as tf\n        from transformers1 import GPT2Tokenizer, TFGPT2LMHeadModel\n\n        tokenizer = GPT2Tokenizer.from_pretrained('gpt2')\n        model = TFGPT2LMHeadModel.from_pretrained('gpt2')\n\n        input_ids = tf.constant(tokenizer.encode(\"Hello, my dog is cute\", add_special_tokens=True))[None, :]  # Batch size 1\n        outputs = model(input_ids)\n        logits = outputs[0]\n\n        \"\"\"\n        transformer_outputs = self.transformer(inputs, **kwargs)\n        hidden_states = transformer_outputs[0]\n\n        lm_logits = self.transformer.wte(hidden_states, mode=\"linear\")\n\n        outputs = (lm_logits,) + transformer_outputs[1:]\n\n        return outputs  # lm_logits, presents, (all hidden_states), (attentions)\n\n\n@add_start_docstrings(\n    \"\"\"The GPT2 Model transformer with a language modeling and a multiple-choice classification\n    head on top e.g. for RocStories/SWAG tasks. The two heads are two linear layers.\n    The language modeling head has its weights tied to the input embeddings,\n    the classification head takes as input the input of a specified classification token index in the input sequence).\n\"\"\",\n    GPT2_START_DOCSTRING,\n)\nclass TFGPT2DoubleHeadsModel(TFGPT2PreTrainedModel):\n    def __init__(self, config, *inputs, **kwargs):\n        super().__init__(config, *inputs, **kwargs)\n        config.num_labels = 1\n        self.transformer = TFGPT2MainLayer(config, name=\"transformer\")\n        self.multiple_choice_head = TFSequenceSummary(\n            config, initializer_range=config.initializer_range, name=\"multiple_choice_head\"\n        )\n\n    def get_output_embeddings(self):\n        return self.transformer.wte\n\n    @add_start_docstrings_to_callable(GPT2_INPUTS_DOCSTRING)\n    def call(\n        self,\n        inputs,\n        past=None,\n        attention_mask=None,\n        token_type_ids=None,\n        position_ids=None,\n        head_mask=None,\n        inputs_embeds=None,\n        mc_token_ids=None,\n        use_cache=True,\n        training=False,\n    ):\n        r\"\"\"\n        mc_token_ids (:obj:`tf.Tensor` or :obj:`Numpy array` of shape :obj:`(batch_size, num_choices)`, `optional`, default to index of the last token of the input)\n            Index of the classification token in each input sequence.\n            Selected in the range ``[0, input_ids.size(-1) - 1[``.\n\n    Return:\n        :obj:`tuple(tf.Tensor)` comprising various elements depending on the configuration (:class:`~transformers1.GPT2Config`) and inputs:\n        lm_prediction_scores (:obj:`tf.Tensor` of shape :obj:`(batch_size, num_choices, sequence_length, config.vocab_size)`):\n            Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax).\n        mc_prediction_scores (:obj:`tf.Tensor` of shape :obj:`(batch_size, num_choices)`):\n            Prediction scores of the multiple choice classification head (scores for each choice before SoftMax).\n        past (:obj:`List[tf.Tensor]` of length :obj:`config.n_layers` with each tensor of shape :obj:`(2, batch_size, num_heads, sequence_length, embed_size_per_head)`):\n            Contains pre-computed hidden-states (key and values in the attention blocks).\n            Can be used (see `past` input) to speed up sequential decoding. The token ids which have their past given to this model\n            should not be passed as `input_ids` as they have already been computed.\n        hidden_states (:obj:`tuple(tf.Tensor)`, `optional`, returned when ``config.output_hidden_states=True``):\n            Tuple of :obj:`tf.Tensor` (one for the output of the embeddings + one for the output of each layer)\n            of shape :obj:`(batch_size, sequence_length, hidden_size)`.\n\n            Hidden-states of the model at the output of each layer plus the initial embedding outputs.\n        attentions (:obj:`tuple(tf.Tensor)`, `optional`, returned when ``config.output_attentions=True``):\n            Tuple of :obj:`tf.Tensor` (one for each layer) of shape\n            :obj:`(batch_size, num_heads, sequence_length, sequence_length)`.\n\n            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention\n            heads.\n\n\n    Examples::\n\n        # For example purposes. Not runnable.\n        import tensorflow as tf\n        from transformers1 import GPT2Tokenizer, TFGPT2DoubleHeadsModel\n\n        tokenizer = GPT2Tokenizer.from_pretrained('gpt2')\n        model = TFGPT2DoubleHeadsModel.from_pretrained('gpt2')\n\n        # Add a [CLS] to the vocabulary (we should train it also!)\n        # This option is currently not implemented in TF 2.0\n        raise NotImplementedError\n        tokenizer.add_special_tokens({'cls_token': '[CLS]'})\n        model.resize_token_embeddings(len(tokenizer))  # Update the model embeddings with the new vocabulary size\n        print(tokenizer.cls_token_id, len(tokenizer))  # The newly token the last token of the vocabulary\n\n        choices = [\"Hello, my dog is cute [CLS]\", \"Hello, my cat is cute [CLS]\"]\n        encoded_choices = [tokenizer.encode(s) for s in choices]\n        cls_token_location = [tokens.index(tokenizer.cls_token_id) for tokens in encoded_choices]\n\n        input_ids = tf.constant(encoded_choices)[None, :]  # Batch size: 1, number of choices: 2\n        mc_token_ids = tf.constant([cls_token_location])  # Batch size: 1\n\n        outputs = model(input_ids, mc_token_ids=mc_token_ids)\n        lm_prediction_scores, mc_prediction_scores = outputs[:2]\n\n        \"\"\"\n        if isinstance(inputs, (tuple, list)):\n            input_ids = inputs[0]\n            past = inputs[1] if len(inputs) > 1 else past\n            attention_mask = inputs[2] if len(inputs) > 2 else attention_mask\n            token_type_ids = inputs[3] if len(inputs) > 3 else token_type_ids\n            position_ids = inputs[4] if len(inputs) > 4 else position_ids\n            head_mask = inputs[5] if len(inputs) > 5 else head_mask\n            inputs_embeds = inputs[6] if len(inputs) > 6 else inputs_embeds\n            mc_token_ids = inputs[7] if len(inputs) > 7 else mc_token_ids\n            use_cache = inputs[8] if len(inputs) > 8 else use_cache\n            assert len(inputs) <= 9, \"Too many inputs.\"\n        elif isinstance(inputs, dict):\n            input_ids = inputs.get(\"input_ids\")\n            past = inputs.get(\"past\", past)\n            attention_mask = inputs.get(\"attention_mask\", attention_mask)\n            token_type_ids = inputs.get(\"token_type_ids\", token_type_ids)\n            position_ids = inputs.get(\"position_ids\", position_ids)\n            head_mask = inputs.get(\"head_mask\", head_mask)\n            inputs_embeds = inputs.get(\"inputs_embeds\", inputs_embeds)\n            mc_token_ids = inputs.get(\"mc_token_ids\", mc_token_ids)\n            use_cache = inputs.get(\"use_cache\", use_cache)\n            assert len(inputs) <= 9, \"Too many inputs.\"\n        else:\n            input_ids = inputs\n\n        if input_ids is not None:\n            input_shapes = shape_list(input_ids)\n        else:\n            input_shapes = shape_list(inputs_embeds)[:-1]\n\n        seq_length = input_shapes[-1]\n\n        flat_input_ids = tf.reshape(input_ids, (-1, seq_length)) if input_ids is not None else None\n        flat_attention_mask = tf.reshape(attention_mask, (-1, seq_length)) if attention_mask is not None else None\n        flat_token_type_ids = tf.reshape(token_type_ids, (-1, seq_length)) if token_type_ids is not None else None\n        flat_position_ids = tf.reshape(position_ids, (-1, seq_length)) if position_ids is not None else None\n\n        flat_inputs = [\n            flat_input_ids,\n            past,\n            flat_attention_mask,\n            flat_token_type_ids,\n            flat_position_ids,\n            head_mask,\n            inputs_embeds,\n            use_cache,\n        ]\n\n        transformer_outputs = self.transformer(flat_inputs, training=training)\n        hidden_states = transformer_outputs[0]\n\n        hidden_states = tf.reshape(hidden_states, input_shapes + shape_list(hidden_states)[-1:])\n\n        lm_logits = self.transformer.wte(hidden_states, mode=\"linear\")\n        mc_logits = self.multiple_choice_head([hidden_states, mc_token_ids], training=training)\n\n        mc_logits = tf.squeeze(mc_logits, axis=-1)\n\n        outputs = (lm_logits, mc_logits) + transformer_outputs[1:]\n\n        return outputs  # lm logits, mc logits, presents, (all hidden_states), (attentions)\n"
  },
  {
    "path": "code/nezha-base-count3/pretrain/transformers1/modeling_tf_openai.py",
    "content": "# coding=utf-8\n# Copyright 2018 The OpenAI Team Authors and HuggingFace Inc. team.\n# Copyright (c) 2018, NVIDIA CORPORATION.  All rights reserved.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n\"\"\" TF 2.0 OpenAI GPT model.\"\"\"\n\n\nimport logging\n\nimport numpy as np\nimport tensorflow as tf\n\nfrom .configuration_openai import OpenAIGPTConfig\nfrom .file_utils import add_start_docstrings, add_start_docstrings_to_callable\nfrom .modeling_tf_utils import (\n    TFConv1D,\n    TFPreTrainedModel,\n    TFSequenceSummary,\n    TFSharedEmbeddings,\n    get_initializer,\n    shape_list,\n)\nfrom .tokenization_utils import BatchEncoding\n\n\nlogger = logging.getLogger(__name__)\n\nTF_OPENAI_GPT_PRETRAINED_MODEL_ARCHIVE_LIST = [\n    \"openai-gpt\",\n    # See all OpenAI GPT models at https://huggingface.co/models?filter=openai-gpt\n]\n\n\ndef gelu(x):\n    \"\"\"Gaussian Error Linear Unit.\n    This is a smoother version of the RELU.\n    Original paper: https://arxiv.org/abs/1606.08415\n    Args:\n        x: float Tensor to perform activation.\n    Returns:\n        `x` with the GELU activation applied.\n    \"\"\"\n    cdf = 0.5 * (1.0 + tf.tanh((np.sqrt(2 / np.pi) * (x + 0.044715 * tf.pow(x, 3)))))\n    return x * cdf\n\n\ndef swish(x):\n    return x * tf.math.sigmoid(x)\n\n\nACT_FNS = {\n    \"gelu\": tf.keras.layers.Activation(gelu),\n    \"relu\": tf.keras.activations.relu,\n    \"swish\": tf.keras.layers.Activation(swish),\n}\n\n\nclass TFAttention(tf.keras.layers.Layer):\n    def __init__(self, nx, n_ctx, config, scale=False, **kwargs):\n        super().__init__(**kwargs)\n        self.output_attentions = config.output_attentions\n\n        n_state = nx  # in Attention: n_state=768 (nx=n_embd)\n        # [switch nx => n_state from Block to Attention to keep identical to TF implem]\n        assert n_state % config.n_head == 0\n        self.n_ctx = n_ctx\n        self.n_head = config.n_head\n        self.split_size = n_state\n        self.scale = scale\n\n        self.c_attn = TFConv1D(n_state * 3, nx, initializer_range=config.initializer_range, name=\"c_attn\")\n        self.c_proj = TFConv1D(n_state, nx, initializer_range=config.initializer_range, name=\"c_proj\")\n        self.attn_dropout = tf.keras.layers.Dropout(config.attn_pdrop)\n        self.resid_dropout = tf.keras.layers.Dropout(config.resid_pdrop)\n        self.pruned_heads = set()\n\n    def prune_heads(self, heads):\n        pass\n\n    @staticmethod\n    def causal_attention_mask(nd, ns, dtype):\n        \"\"\"1's in the lower triangle, counting from the lower right corner.\n        Same as tf.matrix_band_part(tf.ones([nd, ns]), -1, ns-nd), but doesn't produce garbage on TPUs.\n        \"\"\"\n        i = tf.range(nd)[:, None]\n        j = tf.range(ns)\n        m = i >= j - ns + nd\n        return tf.cast(m, dtype)\n\n    def _attn(self, inputs, training=False):\n        q, k, v, attention_mask, head_mask = inputs\n        # q, k, v have shape [batch, heads, sequence, features]\n        w = tf.matmul(q, k, transpose_b=True)\n        if self.scale:\n            dk = tf.cast(shape_list(k)[-1], tf.float32)  # scale attention_scores\n            w = w / tf.math.sqrt(dk)\n\n        # w has shape [batch, heads, dst_sequence, src_sequence], where information flows from src to dst.\n        _, _, nd, ns = shape_list(w)\n        b = self.causal_attention_mask(nd, ns, dtype=w.dtype)\n        b = tf.reshape(b, [1, 1, nd, ns])\n        w = w * b - 1e4 * (1 - b)\n\n        if attention_mask is not None:\n            # Apply the attention mask\n            w = w + attention_mask\n\n        w = tf.nn.softmax(w, axis=-1)\n        w = self.attn_dropout(w, training=training)\n\n        # Mask heads if we want to\n        if head_mask is not None:\n            w = w * head_mask\n\n        outputs = [tf.matmul(w, v)]\n        if self.output_attentions:\n            outputs.append(w)\n        return outputs\n\n    def merge_heads(self, x):\n        x = tf.transpose(x, [0, 2, 1, 3])\n        x_shape = shape_list(x)\n        new_x_shape = x_shape[:-2] + [x_shape[-2] * x_shape[-1]]\n        return tf.reshape(x, new_x_shape)\n\n    def split_heads(self, x):\n        x_shape = shape_list(x)\n        new_x_shape = x_shape[:-1] + [self.n_head, x_shape[-1] // self.n_head]\n        x = tf.reshape(x, new_x_shape)\n        return tf.transpose(x, (0, 2, 1, 3))  # (batch, head, seq_length, head_features)\n\n    def call(self, inputs, training=False):\n        x, attention_mask, head_mask = inputs\n\n        x = self.c_attn(x)\n        query, key, value = tf.split(x, 3, axis=2)\n        query = self.split_heads(query)\n        key = self.split_heads(key)\n        value = self.split_heads(value)\n\n        attn_outputs = self._attn([query, key, value, attention_mask, head_mask], training=training)\n        a = attn_outputs[0]\n\n        a = self.merge_heads(a)\n        a = self.c_proj(a)\n        a = self.resid_dropout(a, training=training)\n\n        outputs = [a] + attn_outputs[1:]\n        return outputs  # a, (attentions)\n\n\nclass TFMLP(tf.keras.layers.Layer):\n    def __init__(self, n_state, config, **kwargs):\n        super().__init__(**kwargs)\n        nx = config.n_embd\n        self.c_fc = TFConv1D(n_state, nx, initializer_range=config.initializer_range, name=\"c_fc\")\n        self.c_proj = TFConv1D(nx, n_state, initializer_range=config.initializer_range, name=\"c_proj\")\n        self.act = gelu\n        self.dropout = tf.keras.layers.Dropout(config.resid_pdrop)\n\n    def call(self, x, training=False):\n        h = self.act(self.c_fc(x))\n        h2 = self.c_proj(h)\n        h2 = self.dropout(h2, training=training)\n        return h2\n\n\nclass TFBlock(tf.keras.layers.Layer):\n    def __init__(self, n_ctx, config, scale=False, **kwargs):\n        super().__init__(**kwargs)\n        nx = config.n_embd\n        self.attn = TFAttention(nx, n_ctx, config, scale, name=\"attn\")\n        self.ln_1 = tf.keras.layers.LayerNormalization(epsilon=config.layer_norm_epsilon, name=\"ln_1\")\n        self.mlp = TFMLP(4 * nx, config, name=\"mlp\")\n        self.ln_2 = tf.keras.layers.LayerNormalization(epsilon=config.layer_norm_epsilon, name=\"ln_2\")\n\n    def call(self, inputs, training=False):\n        x, attention_mask, head_mask = inputs\n\n        output_attn = self.attn([x, attention_mask, head_mask], training=training)\n        a = output_attn[0]  # output_attn: a, (attentions)\n\n        n = self.ln_1(x + a)\n        m = self.mlp(n, training=training)\n        h = self.ln_2(n + m)\n\n        outputs = [h] + output_attn[1:]\n        return outputs  # x, (attentions)\n\n\nclass TFOpenAIGPTMainLayer(tf.keras.layers.Layer):\n    def __init__(self, config, *inputs, **kwargs):\n        super().__init__(*inputs, **kwargs)\n        self.output_hidden_states = config.output_hidden_states\n        self.output_attentions = config.output_attentions\n        self.num_hidden_layers = config.n_layer\n        self.vocab_size = config.vocab_size\n        self.n_embd = config.n_embd\n\n        self.tokens_embed = TFSharedEmbeddings(\n            config.vocab_size, config.n_embd, initializer_range=config.initializer_range, name=\"tokens_embed\"\n        )\n        self.positions_embed = tf.keras.layers.Embedding(\n            config.n_positions,\n            config.n_embd,\n            embeddings_initializer=get_initializer(config.initializer_range),\n            name=\"positions_embed\",\n        )\n        self.drop = tf.keras.layers.Dropout(config.embd_pdrop)\n        self.h = [TFBlock(config.n_ctx, config, scale=True, name=\"h_._{}\".format(i)) for i in range(config.n_layer)]\n\n    def get_input_embeddings(self):\n        return self.tokens_embed\n\n    def _resize_token_embeddings(self, new_num_tokens):\n        raise NotImplementedError\n\n    def _prune_heads(self, heads_to_prune):\n        \"\"\" Prunes heads of the model.\n            heads_to_prune: dict of {layer_num: list of heads to prune in this layer}\n        \"\"\"\n        raise NotImplementedError\n\n    def call(\n        self,\n        inputs,\n        attention_mask=None,\n        token_type_ids=None,\n        position_ids=None,\n        head_mask=None,\n        inputs_embeds=None,\n        training=False,\n    ):\n        if isinstance(inputs, (tuple, list)):\n            input_ids = inputs[0]\n            attention_mask = inputs[1] if len(inputs) > 1 else attention_mask\n            token_type_ids = inputs[2] if len(inputs) > 2 else token_type_ids\n            position_ids = inputs[3] if len(inputs) > 3 else position_ids\n            head_mask = inputs[4] if len(inputs) > 4 else head_mask\n            inputs_embeds = inputs[5] if len(inputs) > 5 else inputs_embeds\n            assert len(inputs) <= 6, \"Too many inputs.\"\n        elif isinstance(inputs, (dict, BatchEncoding)):\n            input_ids = inputs.get(\"input_ids\")\n            attention_mask = inputs.get(\"attention_mask\", attention_mask)\n            token_type_ids = inputs.get(\"token_type_ids\", token_type_ids)\n            position_ids = inputs.get(\"position_ids\", position_ids)\n            head_mask = inputs.get(\"head_mask\", head_mask)\n            inputs_embeds = inputs.get(\"inputs_embeds\", inputs_embeds)\n            assert len(inputs) <= 6, \"Too many inputs.\"\n        else:\n            input_ids = inputs\n\n        if input_ids is not None and inputs_embeds is not None:\n            raise ValueError(\"You cannot specify both input_ids and inputs_embeds at the same time\")\n        elif input_ids is not None:\n            input_shape = shape_list(input_ids)\n            input_ids = tf.reshape(input_ids, [-1, input_shape[-1]])\n        elif inputs_embeds is not None:\n            input_shape = shape_list(inputs_embeds)[:-1]\n        else:\n            raise ValueError(\"You have to specify either input_ids or inputs_embeds\")\n\n        if position_ids is None:\n            position_ids = tf.range(input_shape[-1], dtype=tf.int32)[tf.newaxis, :]\n\n        if attention_mask is not None:\n            # We create a 3D attention mask from a 2D tensor mask.\n            # Sizes are [batch_size, 1, 1, to_seq_length]\n            # So we can broadcast to [batch_size, num_heads, from_seq_length, to_seq_length]\n            # this attention mask is more simple than the triangular masking of causal attention\n            # used in OpenAI GPT, we just need to prepare the broadcast dimension here.\n            attention_mask = attention_mask[:, tf.newaxis, tf.newaxis, :]\n\n            # Since attention_mask is 1.0 for positions we want to attend and 0.0 for\n            # masked positions, this operation will create a tensor which is 0.0 for\n            # positions we want to attend and -10000.0 for masked positions.\n            # Since we are adding it to the raw scores before the softmax, this is\n            # effectively the same as removing these entirely.\n\n            attention_mask = tf.cast(attention_mask, tf.float32)\n            attention_mask = (1.0 - attention_mask) * -10000.0\n        else:\n            attention_mask = None\n\n        # Prepare head mask if needed\n        # 1.0 in head_mask indicate we keep the head\n        # attention_probs has shape bsz x n_heads x N x N\n        # input head_mask has shape [num_heads] or [num_hidden_layers x num_heads]\n        # and head_mask is converted to shape [num_hidden_layers x batch x num_heads x seq_length x seq_length]\n        if head_mask is not None:\n            raise NotImplementedError\n        else:\n            head_mask = [None] * self.num_hidden_layers\n            # head_mask = tf.constant([0] * self.num_hidden_layers)\n\n        position_ids = tf.reshape(position_ids, [-1, shape_list(position_ids)[-1]])\n\n        if inputs_embeds is None:\n            inputs_embeds = self.tokens_embed(input_ids, mode=\"embedding\")\n        position_embeds = self.positions_embed(position_ids)\n        if token_type_ids is not None:\n            token_type_ids = tf.reshape(token_type_ids, [-1, shape_list(token_type_ids)[-1]])\n            token_type_embeds = self.tokens_embed(token_type_ids, mode=\"embedding\")\n        else:\n            token_type_embeds = 0\n        hidden_states = inputs_embeds + position_embeds + token_type_embeds\n        hidden_states = self.drop(hidden_states, training=training)\n\n        output_shape = input_shape + [shape_list(hidden_states)[-1]]\n\n        all_attentions = []\n        all_hidden_states = ()\n        for i, block in enumerate(self.h):\n            if self.output_hidden_states:\n                all_hidden_states = all_hidden_states + (tf.reshape(hidden_states, output_shape),)\n\n            outputs = block([hidden_states, attention_mask, head_mask[i]], training=training)\n            hidden_states = outputs[0]\n            if self.output_attentions:\n                all_attentions.append(outputs[1])\n\n        hidden_states = tf.reshape(hidden_states, output_shape)\n        # Add last hidden state\n        if self.output_hidden_states:\n            all_hidden_states = all_hidden_states + (hidden_states,)\n\n        outputs = (hidden_states,)\n        if self.output_hidden_states:\n            outputs = outputs + (all_hidden_states,)\n        if self.output_attentions:\n            # let the number of heads free (-1) so we can extract attention even after head pruning\n            attention_output_shape = input_shape[:-1] + [-1] + shape_list(all_attentions[0])[-2:]\n            all_attentions = tuple(tf.reshape(t, attention_output_shape) for t in all_attentions)\n            outputs = outputs + (all_attentions,)\n        return outputs  # last hidden state, (all hidden_states), (attentions)\n\n\nclass TFOpenAIGPTPreTrainedModel(TFPreTrainedModel):\n    \"\"\" An abstract class to handle weights initialization and\n        a simple interface for downloading and loading pretrained models.\n    \"\"\"\n\n    config_class = OpenAIGPTConfig\n    base_model_prefix = \"transformer\"\n\n\nOPENAI_GPT_START_DOCSTRING = r\"\"\"\n\n    .. note::\n        TF 2.0 models accepts two formats as inputs:\n\n            - having all inputs as keyword arguments (like PyTorch models), or\n            - having all inputs as a list, tuple or dict in the first positional arguments.\n\n        This second option is useful when using :obj:`tf.keras.Model.fit()` method which currently requires having\n        all the tensors in the first argument of the model call function: :obj:`model(inputs)`.\n\n        If you choose this second option, there are three possibilities you can use to gather all the input Tensors\n        in the first positional argument :\n\n        - a single Tensor with input_ids only and nothing else: :obj:`model(inputs_ids)`\n        - a list of varying length with one or several input Tensors IN THE ORDER given in the docstring:\n          :obj:`model([input_ids, attention_mask])` or :obj:`model([input_ids, attention_mask, token_type_ids])`\n        - a dictionary with one or several input Tensors associated to the input names given in the docstring:\n          :obj:`model({'input_ids': input_ids, 'token_type_ids': token_type_ids})`\n\n\n    Parameters:\n        config (:class:`~transformers1.OpenAIGPTConfig`): Model configuration class with all the parameters of the model.\n            Initializing with a config file does not load the weights associated with the model, only the configuration.\n            Check out the :meth:`~transformers1.PreTrainedModel.from_pretrained` method to load the model weights.\n\"\"\"\n\nOPENAI_GPT_INPUTS_DOCSTRING = r\"\"\"\n    Args:\n        input_ids (:obj:`Numpy array` or :obj:`tf.Tensor` of shape :obj:`(batch_size, sequence_length)`):\n            Indices of input sequence tokens in the vocabulary.\n\n            Indices can be obtained using :class:`transformers1.GPT2Tokenizer`.\n            See :func:`transformers1.PreTrainedTokenizer.encode` and\n            :func:`transformers1.PreTrainedTokenizer.encode_plus` for details.\n\n            `What are input IDs? <../glossary.html#input-ids>`__\n        attention_mask (:obj:`tf.Tensor` or :obj:`Numpy array` of shape :obj:`(batch_size, sequence_length)`, `optional`, defaults to :obj:`None`):\n            Mask to avoid performing attention on padding token indices.\n            Mask values selected in ``[0, 1]``:\n            ``1`` for tokens that are NOT MASKED, ``0`` for MASKED tokens.\n\n            `What are attention masks? <../glossary.html#attention-mask>`__\n        token_type_ids (:obj:`tf.Tensor` or :obj:`Numpy array` of shape :obj:`(batch_size, sequence_length)`, `optional`, defaults to :obj:`None`):\n            Segment token indices to indicate first and second portions of the inputs.\n            Indices are selected in ``[0, 1]``: ``0`` corresponds to a `sentence A` token, ``1``\n            corresponds to a `sentence B` token\n\n            `What are token type IDs? <../glossary.html#token-type-ids>`_\n        position_ids (:obj:`tf.Tensor` or :obj:`Numpy array` of shape :obj:`(batch_size, sequence_length)`, `optional`, defaults to :obj:`None`):\n            Indices of positions of each input sequence tokens in the position embeddings.\n            Selected in the range ``[0, config.max_position_embeddings - 1]``.\n\n            `What are position IDs? <../glossary.html#position-ids>`_\n        head_mask (:obj:`tf.Tensor` or :obj:`Numpy array` of shape :obj:`(num_heads,)` or :obj:`(num_layers, num_heads)`, `optional`, defaults to :obj:`None`):\n            Mask to nullify selected heads of the self-attention modules.\n            Mask values selected in ``[0, 1]``:\n            :obj:`1` indicates the head is **not masked**, :obj:`0` indicates the head is **masked**.\n        inputs_embeds (:obj:`tf.Tensor` or :obj:`Numpy array` of shape :obj:`(batch_size, sequence_length, hidden_size)`, `optional`, defaults to :obj:`None`):\n            Optionally, instead of passing :obj:`input_ids` you can choose to directly pass an embedded representation.\n            This is useful if you want more control over how to convert `input_ids` indices into associated vectors\n            than the model's internal embedding lookup matrix.\n        training (:obj:`boolean`, `optional`, defaults to :obj:`False`):\n            Whether to activate dropout modules (if set to :obj:`True`) during training or to de-activate them\n            (if set to :obj:`False`) for evaluation.\n\"\"\"\n\n\n@add_start_docstrings(\n    \"The bare OpenAI GPT transformer model outputing raw hidden-states without any specific head on top.\",\n    OPENAI_GPT_START_DOCSTRING,\n)\nclass TFOpenAIGPTModel(TFOpenAIGPTPreTrainedModel):\n    def __init__(self, config, *inputs, **kwargs):\n        super().__init__(config, *inputs, **kwargs)\n        self.transformer = TFOpenAIGPTMainLayer(config, name=\"transformer\")\n\n    @add_start_docstrings_to_callable(OPENAI_GPT_INPUTS_DOCSTRING)\n    def call(self, inputs, **kwargs):\n        r\"\"\"\n    Return:\n        :obj:`tuple(tf.Tensor)` comprising various elements depending on the configuration (:class:`~transformers1.OpenAIGPTConfig`) and inputs:\n        last_hidden_state (:obj:`tf.Tensor` of shape :obj:`(batch_size, sequence_length, hidden_size)`):\n            Sequence of hidden-states at the last layer of the model.\n        hidden_states (:obj:`tuple(tf.Tensor)` `optional`, returned when ``config.output_hidden_states=True``):\n            Tuple of :obj:`tf.Tensor` (one for the output of the embeddings + one for the output of each layer)\n            of shape :obj:`(batch_size, sequence_length, hidden_size)`.\n\n            Hidden-states of the model at the output of each layer plus the initial embedding outputs.\n        attentions (:obj:`tuple(tf.Tensor)`, `optional`, returned when ``config.output_attentions=True``):\n            Tuple of :obj:`tf.Tensor` (one for each layer) of shape\n            :obj:`(batch_size, num_heads, sequence_length, sequence_length)`.\n\n            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention\n            heads.\n\n    Examples::\n\n        import tensorflow as tf\n        from transformers1 import OpenAIGPTTokenizer, TFOpenAIGPTModel\n\n        tokenizer = OpenAIGPTTokenizer.from_pretrained('openai-gpt')\n        model = TFOpenAIGPTModel.from_pretrained('openai-gpt')\n        input_ids = tf.constant(tokenizer.encode(\"Hello, my dog is cute\", add_special_tokens=True))[None, :]  # Batch size 1\n        outputs = model(input_ids)\n        last_hidden_states = outputs[0]  # The last hidden-state is the first element of the output tuple\n\n        \"\"\"\n        outputs = self.transformer(inputs, **kwargs)\n        return outputs\n\n\n@add_start_docstrings(\n    \"\"\"OpenAI GPT Model transformer with a language modeling head on top\n    (linear layer with weights tied to the input embeddings). \"\"\",\n    OPENAI_GPT_START_DOCSTRING,\n)\nclass TFOpenAIGPTLMHeadModel(TFOpenAIGPTPreTrainedModel):\n    def __init__(self, config, *inputs, **kwargs):\n        super().__init__(config, *inputs, **kwargs)\n        self.transformer = TFOpenAIGPTMainLayer(config, name=\"transformer\")\n\n    def get_output_embeddings(self):\n        return self.transformer.tokens_embed\n\n    @add_start_docstrings_to_callable(OPENAI_GPT_INPUTS_DOCSTRING)\n    def call(self, inputs, **kwargs):\n        r\"\"\"\n    Return:\n        :obj:`tuple(tf.Tensor)` comprising various elements depending on the configuration (:class:`~transformers1.OpenAIGPTConfig`) and inputs:\n        prediction_scores (:obj:`tf.Tensor` of shape :obj:`(batch_size, sequence_length, config.vocab_size)`):\n            Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax).\n        hidden_states (:obj:`tuple(tf.Tensor)`, `optional`, returned when ``config.output_hidden_states=True``):\n            Tuple of :obj:`tf.Tensor` (one for the output of the embeddings + one for the output of each layer)\n            of shape :obj:`(batch_size, sequence_length, hidden_size)`.\n\n            Hidden-states of the model at the output of each layer plus the initial embedding outputs.\n        attentions (:obj:`tuple(tf.Tensor)`, `optional`, returned when ``config.output_attentions=True``):\n            Tuple of :obj:`tf.Tensor` (one for each layer) of shape\n            :obj:`(batch_size, num_heads, sequence_length, sequence_length)`.\n\n            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention\n            heads.\n\n    Examples::\n\n        import tensorflow as tf\n        from transformers1 import OpenAIGPTTokenizer, TFOpenAIGPTLMHeadModel\n\n        tokenizer = OpenAIGPTTokenizer.from_pretrained('openai-gpt')\n        model = TFOpenAIGPTLMHeadModel.from_pretrained('openai-gpt')\n        input_ids = tf.constant(tokenizer.encode(\"Hello, my dog is cute\", add_special_tokens=True))[None, :]  # Batch size 1\n        outputs = model(input_ids)\n        logits = outputs[0]\n\n        \"\"\"\n        transformer_outputs = self.transformer(inputs, **kwargs)\n        hidden_states = transformer_outputs[0]\n\n        lm_logits = self.transformer.tokens_embed(hidden_states, mode=\"linear\")\n\n        outputs = (lm_logits,) + transformer_outputs[1:]\n\n        return outputs  # lm_logits, (all hidden_states), (attentions)\n\n\n@add_start_docstrings(\n    \"\"\"OpenAI GPT Model transformer with a language modeling and a multiple-choice classification\n    head on top e.g. for RocStories/SWAG tasks. The two heads are two linear layers.\n    The language modeling head has its weights tied to the input embeddings,\n    the classification head takes as input the input of a specified classification token index in the input sequence).\n\"\"\",\n    OPENAI_GPT_START_DOCSTRING,\n)\nclass TFOpenAIGPTDoubleHeadsModel(TFOpenAIGPTPreTrainedModel):\n    def __init__(self, config, *inputs, **kwargs):\n        super().__init__(config, *inputs, **kwargs)\n        config.num_labels = 1\n        self.transformer = TFOpenAIGPTMainLayer(config, name=\"transformer\")\n        self.multiple_choice_head = TFSequenceSummary(\n            config, initializer_range=config.initializer_range, name=\"multiple_choice_head\"\n        )\n\n    def get_output_embeddings(self):\n        return self.transformer.tokens_embed\n\n    @add_start_docstrings_to_callable(OPENAI_GPT_INPUTS_DOCSTRING)\n    def call(\n        self,\n        inputs,\n        attention_mask=None,\n        token_type_ids=None,\n        position_ids=None,\n        head_mask=None,\n        inputs_embeds=None,\n        mc_token_ids=None,\n        training=False,\n    ):\n        r\"\"\"\n        mc_token_ids (:obj:`tf.Tensor` or :obj:`Numpy array` of shape :obj:`(batch_size, num_choices)`, `optional`, default to index of the last token of the input)\n            Index of the classification token in each input sequence.\n            Selected in the range ``[0, input_ids.size(-1) - 1[``.\n\n    Return:\n        :obj:`tuple(tf.Tensor)` comprising various elements depending on the configuration (:class:`~transformers1.OpenAIGPTConfig`) and inputs:\n        lm_prediction_scores (:obj:`tf.Tensor` of shape :obj:`(batch_size, num_choices, sequence_length, config.vocab_size)`):\n            Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax).\n        mc_prediction_scores (:obj:`tf.Tensor` of shape :obj:`(batch_size, num_choices)`):\n            Prediction scores of the multiple choice classification head (scores for each choice before SoftMax).\n        past (:obj:`List[tf.Tensor]` of length :obj:`config.n_layers` with each tensor of shape :obj:`(2, batch_size, num_heads, sequence_length, embed_size_per_head)`):\n            Contains pre-computed hidden-states (key and values in the attention blocks).\n            Can be used (see `past` input) to speed up sequential decoding. The token ids which have their past given to this model\n            should not be passed as input ids as they have already been computed.\n        hidden_states (:obj:`tuple(tf.Tensor)`, `optional`, returned when ``config.output_hidden_states=True``):\n            Tuple of :obj:`tf.Tensor` (one for the output of the embeddings + one for the output of each layer)\n            of shape :obj:`(batch_size, sequence_length, hidden_size)`.\n\n            Hidden-states of the model at the output of each layer plus the initial embedding outputs.\n        attentions (:obj:`tuple(tf.Tensor)`, `optional`, returned when ``config.output_attentions=True``):\n            Tuple of :obj:`tf.Tensor` (one for each layer) of shape\n            :obj:`(batch_size, num_heads, sequence_length, sequence_length)`.\n\n            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention\n            heads.\n\n\n    Examples::\n\n        # For example purposes. Not runnable.\n        import tensorflow as tf\n        from transformers1 import OpenAIGPTTokenizer, TFOpenAIGPTDoubleHeadsModel\n\n        tokenizer = OpenAIGPTTokenizer.from_pretrained('openai-gpt')\n        model = TFOpenAIGPTDoubleHeadsModel.from_pretrained('openai-gpt')\n\n        # Add a [CLS] to the vocabulary (we should train it also!)\n        # This option is currently not implemented in TF 2.0\n        raise NotImplementedError\n        tokenizer.add_special_tokens({'cls_token': '[CLS]'})\n        model.resize_token_embeddings(len(tokenizer))  # Update the model embeddings with the new vocabulary size\n        print(tokenizer.cls_token_id, len(tokenizer))  # The newly token the last token of the vocabulary\n\n        choices = [\"Hello, my dog is cute [CLS]\", \"Hello, my cat is cute [CLS]\"]\n        input_ids = tf.constant([tokenizer.encode(s) for s in choices])[None, :]  # Batch size 1, 2 choices\n        mc_token_ids = tf.constant([input_ids.size(-1), input_ids.size(-1)])[None, :]  # Batch size 1\n        outputs = model(input_ids, mc_token_ids=mc_token_ids)\n        lm_prediction_scores, mc_prediction_scores = outputs[:2]\n\n        \"\"\"\n\n        if isinstance(inputs, (tuple, list)):\n            input_ids = inputs[0]\n            attention_mask = inputs[1] if len(inputs) > 1 else attention_mask\n            token_type_ids = inputs[2] if len(inputs) > 2 else token_type_ids\n            position_ids = inputs[3] if len(inputs) > 3 else position_ids\n            head_mask = inputs[4] if len(inputs) > 4 else head_mask\n            inputs_embeds = inputs[5] if len(inputs) > 5 else inputs_embeds\n            mc_token_ids = inputs[6] if len(inputs) > 6 else mc_token_ids\n            assert len(inputs) <= 7, \"Too many inputs.\"\n        elif isinstance(inputs, dict):\n            input_ids = inputs.get(\"input_ids\")\n            attention_mask = inputs.get(\"attention_mask\", attention_mask)\n            token_type_ids = inputs.get(\"token_type_ids\", token_type_ids)\n            position_ids = inputs.get(\"position_ids\", position_ids)\n            head_mask = inputs.get(\"head_mask\", head_mask)\n            inputs_embeds = inputs.get(\"inputs_embeds\", inputs_embeds)\n            mc_token_ids = inputs.get(\"mc_token_ids\", mc_token_ids)\n            assert len(inputs) <= 7, \"Too many inputs.\"\n        else:\n            input_ids = inputs\n\n        if input_ids is not None:\n            input_shapes = shape_list(input_ids)\n        else:\n            input_shapes = shape_list(inputs_embeds)[:-1]\n\n        seq_length = input_shapes[-1]\n\n        flat_input_ids = tf.reshape(input_ids, (-1, seq_length)) if input_ids is not None else None\n        flat_attention_mask = tf.reshape(attention_mask, (-1, seq_length)) if attention_mask is not None else None\n        flat_token_type_ids = tf.reshape(token_type_ids, (-1, seq_length)) if token_type_ids is not None else None\n        flat_position_ids = tf.reshape(position_ids, (-1, seq_length)) if position_ids is not None else None\n\n        flat_inputs = [\n            flat_input_ids,\n            flat_attention_mask,\n            flat_token_type_ids,\n            flat_position_ids,\n            head_mask,\n            inputs_embeds,\n        ]\n\n        transformer_outputs = self.transformer(flat_inputs, training=training)\n        hidden_states = transformer_outputs[0]\n\n        hidden_states = tf.reshape(hidden_states, input_shapes + shape_list(hidden_states)[-1:])\n\n        lm_logits = self.transformer.tokens_embed(hidden_states, mode=\"linear\")\n        mc_logits = self.multiple_choice_head([hidden_states, mc_token_ids], training=training)\n\n        mc_logits = tf.squeeze(mc_logits, axis=-1)\n\n        outputs = (lm_logits, mc_logits) + transformer_outputs[1:]\n\n        return outputs  # lm logits, mc logits, (all hidden_states), (attentions)\n"
  },
  {
    "path": "code/nezha-base-count3/pretrain/transformers1/modeling_tf_pytorch_utils.py",
    "content": "# coding=utf-8\n# Copyright 2018 The Google AI Language Team Authors and The HuggingFace Inc. team.\n# Copyright (c) 2018, NVIDIA CORPORATION.  All rights reserved.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n\"\"\" PyTorch - TF 2.0 general utilities.\"\"\"\n\n\nimport logging\nimport os\nimport re\n\nimport numpy\n\n\nlogger = logging.getLogger(__name__)\n\n\ndef convert_tf_weight_name_to_pt_weight_name(tf_name, start_prefix_to_remove=\"\"):\n    \"\"\" Convert a TF 2.0 model variable name in a pytorch model weight name.\n\n        Conventions for TF2.0 scopes -> PyTorch attribute names conversions:\n            - '$1___$2' is replaced by $2 (can be used to duplicate or remove layers in TF2.0 vs PyTorch)\n            - '_._' is replaced by a new level separation (can be used to convert TF2.0 lists in PyTorch nn.ModulesList)\n\n        return tuple with:\n            - pytorch model weight name\n            - transpose: boolean indicating weither TF2.0 and PyTorch weights matrices are transposed with regards to each other\n    \"\"\"\n    tf_name = tf_name.replace(\":0\", \"\")  # device ids\n    tf_name = re.sub(\n        r\"/[^/]*___([^/]*)/\", r\"/\\1/\", tf_name\n    )  # '$1___$2' is replaced by $2 (can be used to duplicate or remove layers in TF2.0 vs PyTorch)\n    tf_name = tf_name.replace(\n        \"_._\", \"/\"\n    )  # '_._' is replaced by a level separation (can be used to convert TF2.0 lists in PyTorch nn.ModulesList)\n    tf_name = re.sub(r\"//+\", \"/\", tf_name)  # Remove empty levels at the end\n    tf_name = tf_name.split(\"/\")  # Convert from TF2.0 '/' separators to PyTorch '.' separators\n    tf_name = tf_name[1:]  # Remove level zero\n\n    # When should we transpose the weights\n    transpose = bool(tf_name[-1] == \"kernel\" or \"emb_projs\" in tf_name or \"out_projs\" in tf_name)\n\n    # Convert standard TF2.0 names in PyTorch names\n    if tf_name[-1] == \"kernel\" or tf_name[-1] == \"embeddings\" or tf_name[-1] == \"gamma\":\n        tf_name[-1] = \"weight\"\n    if tf_name[-1] == \"beta\":\n        tf_name[-1] = \"bias\"\n\n    # Remove prefix if needed\n    tf_name = \".\".join(tf_name)\n    if start_prefix_to_remove:\n        tf_name = tf_name.replace(start_prefix_to_remove, \"\", 1)\n\n    return tf_name, transpose\n\n\n#####################\n# PyTorch => TF 2.0 #\n#####################\n\n\ndef load_pytorch_checkpoint_in_tf2_model(tf_model, pytorch_checkpoint_path, tf_inputs=None, allow_missing_keys=False):\n    \"\"\" Load pytorch checkpoints in a TF 2.0 model\n    \"\"\"\n    try:\n        import tensorflow as tf  # noqa: F401\n        import torch  # noqa: F401\n    except ImportError:\n        logger.error(\n            \"Loading a PyTorch model in TensorFlow, requires both PyTorch and TensorFlow to be installed. Please see \"\n            \"https://pytorch.org/ and https://www.tensorflow.org/install/ for installation instructions.\"\n        )\n        raise\n\n    pt_path = os.path.abspath(pytorch_checkpoint_path)\n    logger.info(\"Loading PyTorch weights from {}\".format(pt_path))\n\n    pt_state_dict = torch.load(pt_path, map_location=\"cpu\")\n    logger.info(\"PyTorch checkpoint contains {:,} parameters\".format(sum(t.numel() for t in pt_state_dict.values())))\n\n    return load_pytorch_weights_in_tf2_model(\n        tf_model, pt_state_dict, tf_inputs=tf_inputs, allow_missing_keys=allow_missing_keys\n    )\n\n\ndef load_pytorch_model_in_tf2_model(tf_model, pt_model, tf_inputs=None, allow_missing_keys=False):\n    \"\"\" Load pytorch checkpoints in a TF 2.0 model\n    \"\"\"\n    pt_state_dict = pt_model.state_dict()\n\n    return load_pytorch_weights_in_tf2_model(\n        tf_model, pt_state_dict, tf_inputs=tf_inputs, allow_missing_keys=allow_missing_keys\n    )\n\n\ndef load_pytorch_weights_in_tf2_model(tf_model, pt_state_dict, tf_inputs=None, allow_missing_keys=False):\n    \"\"\" Load pytorch state_dict in a TF 2.0 model.\n    \"\"\"\n    try:\n        import torch  # noqa: F401\n        import tensorflow as tf  # noqa: F401\n        from tensorflow.python.keras import backend as K\n    except ImportError:\n        logger.error(\n            \"Loading a PyTorch model in TensorFlow, requires both PyTorch and TensorFlow to be installed. Please see \"\n            \"https://pytorch.org/ and https://www.tensorflow.org/install/ for installation instructions.\"\n        )\n        raise\n\n    if tf_inputs is None:\n        tf_inputs = tf_model.dummy_inputs\n\n    if tf_inputs is not None:\n        tf_model(tf_inputs, training=False)  # Make sure model is built\n\n    # Adapt state dict - TODO remove this and update the AWS weights files instead\n    # Convert old format to new format if needed from a PyTorch state_dict\n    old_keys = []\n    new_keys = []\n    for key in pt_state_dict.keys():\n        new_key = None\n        if \"gamma\" in key:\n            new_key = key.replace(\"gamma\", \"weight\")\n        if \"beta\" in key:\n            new_key = key.replace(\"beta\", \"bias\")\n        if new_key:\n            old_keys.append(key)\n            new_keys.append(new_key)\n    for old_key, new_key in zip(old_keys, new_keys):\n        pt_state_dict[new_key] = pt_state_dict.pop(old_key)\n\n    # Make sure we are able to load PyTorch base models as well as derived models (with heads)\n    # TF models always have a prefix, some of PyTorch models (base ones) don't\n    start_prefix_to_remove = \"\"\n    if not any(s.startswith(tf_model.base_model_prefix) for s in pt_state_dict.keys()):\n        start_prefix_to_remove = tf_model.base_model_prefix + \".\"\n\n    symbolic_weights = tf_model.trainable_weights + tf_model.non_trainable_weights\n    tf_loaded_numel = 0\n    weight_value_tuples = []\n    all_pytorch_weights = set(list(pt_state_dict.keys()))\n    for symbolic_weight in symbolic_weights:\n        sw_name = symbolic_weight.name\n        name, transpose = convert_tf_weight_name_to_pt_weight_name(\n            sw_name, start_prefix_to_remove=start_prefix_to_remove\n        )\n\n        # Find associated numpy array in pytorch model state dict\n        if name not in pt_state_dict:\n            if allow_missing_keys:\n                continue\n\n            raise AttributeError(\"{} not found in PyTorch model\".format(name))\n\n        array = pt_state_dict[name].numpy()\n\n        if transpose:\n            array = numpy.transpose(array)\n\n        if len(symbolic_weight.shape) < len(array.shape):\n            array = numpy.squeeze(array)\n        elif len(symbolic_weight.shape) > len(array.shape):\n            array = numpy.expand_dims(array, axis=0)\n\n        try:\n            assert list(symbolic_weight.shape) == list(array.shape)\n        except AssertionError as e:\n            e.args += (symbolic_weight.shape, array.shape)\n            raise e\n\n        tf_loaded_numel += array.size\n        # logger.warning(\"Initialize TF weight {}\".format(symbolic_weight.name))\n\n        weight_value_tuples.append((symbolic_weight, array))\n        all_pytorch_weights.discard(name)\n\n    K.batch_set_value(weight_value_tuples)\n\n    if tf_inputs is not None:\n        tf_model(tf_inputs, training=False)  # Make sure restore ops are run\n\n    logger.info(\"Loaded {:,} parameters in the TF 2.0 model.\".format(tf_loaded_numel))\n\n    logger.info(\"Weights or buffers not loaded from PyTorch model: {}\".format(all_pytorch_weights))\n\n    return tf_model\n\n\n#####################\n# TF 2.0 => PyTorch #\n#####################\n\n\ndef load_tf2_checkpoint_in_pytorch_model(pt_model, tf_checkpoint_path, tf_inputs=None, allow_missing_keys=False):\n    \"\"\" Load TF 2.0 HDF5 checkpoint in a PyTorch model\n        We use HDF5 to easily do transfer learning\n        (see https://github.com/tensorflow/tensorflow/blob/ee16fcac960ae660e0e4496658a366e2f745e1f0/tensorflow/python/keras/engine/network.py#L1352-L1357).\n    \"\"\"\n    try:\n        import tensorflow as tf  # noqa: F401\n        import torch  # noqa: F401\n    except ImportError:\n        logger.error(\n            \"Loading a TensorFlow model in PyTorch, requires both PyTorch and TensorFlow to be installed. Please see \"\n            \"https://pytorch.org/ and https://www.tensorflow.org/install/ for installation instructions.\"\n        )\n        raise\n\n    import transformers\n\n    logger.info(\"Loading TensorFlow weights from {}\".format(tf_checkpoint_path))\n\n    # Instantiate and load the associated TF 2.0 model\n    tf_model_class_name = \"TF\" + pt_model.__class__.__name__  # Add \"TF\" at the beggining\n    tf_model_class = getattr(transformers, tf_model_class_name)\n    tf_model = tf_model_class(pt_model.config)\n\n    if tf_inputs is None:\n        tf_inputs = tf_model.dummy_inputs\n\n    if tf_inputs is not None:\n        tf_model(tf_inputs, training=False)  # Make sure model is built\n\n    tf_model.load_weights(tf_checkpoint_path, by_name=True)\n\n    return load_tf2_model_in_pytorch_model(pt_model, tf_model, allow_missing_keys=allow_missing_keys)\n\n\ndef load_tf2_model_in_pytorch_model(pt_model, tf_model, allow_missing_keys=False):\n    \"\"\" Load TF 2.0 model in a pytorch model\n    \"\"\"\n    weights = tf_model.weights\n\n    return load_tf2_weights_in_pytorch_model(pt_model, weights, allow_missing_keys=allow_missing_keys)\n\n\ndef load_tf2_weights_in_pytorch_model(pt_model, tf_weights, allow_missing_keys=False):\n    \"\"\" Load TF2.0 symbolic weights in a PyTorch model\n    \"\"\"\n    try:\n        import tensorflow as tf  # noqa: F401\n        import torch  # noqa: F401\n    except ImportError:\n        logger.error(\n            \"Loading a TensorFlow model in PyTorch, requires both PyTorch and TensorFlow to be installed. Please see \"\n            \"https://pytorch.org/ and https://www.tensorflow.org/install/ for installation instructions.\"\n        )\n        raise\n\n    new_pt_params_dict = {}\n    current_pt_params_dict = dict(pt_model.named_parameters())\n\n    # Make sure we are able to load PyTorch base models as well as derived models (with heads)\n    # TF models always have a prefix, some of PyTorch models (base ones) don't\n    start_prefix_to_remove = \"\"\n    if not any(s.startswith(pt_model.base_model_prefix) for s in current_pt_params_dict.keys()):\n        start_prefix_to_remove = pt_model.base_model_prefix + \".\"\n\n    # Build a map from potential PyTorch weight names to TF 2.0 Variables\n    tf_weights_map = {}\n    for tf_weight in tf_weights:\n        pt_name, transpose = convert_tf_weight_name_to_pt_weight_name(\n            tf_weight.name, start_prefix_to_remove=start_prefix_to_remove\n        )\n        tf_weights_map[pt_name] = (tf_weight.numpy(), transpose)\n\n    all_tf_weights = set(list(tf_weights_map.keys()))\n    loaded_pt_weights_data_ptr = {}\n    missing_keys_pt = []\n    for pt_weight_name, pt_weight in current_pt_params_dict.items():\n        # Handle PyTorch shared weight ()not duplicated in TF 2.0\n        if pt_weight.data_ptr() in loaded_pt_weights_data_ptr:\n            new_pt_params_dict[pt_weight_name] = loaded_pt_weights_data_ptr[pt_weight.data_ptr()]\n            continue\n\n        # Find associated numpy array in pytorch model state dict\n        if pt_weight_name not in tf_weights_map:\n            if allow_missing_keys:\n                missing_keys_pt.append(pt_weight_name)\n                continue\n\n            raise AttributeError(\"{} not found in TF 2.0 model\".format(pt_weight_name))\n\n        array, transpose = tf_weights_map[pt_weight_name]\n\n        if transpose:\n            array = numpy.transpose(array)\n\n        if len(pt_weight.shape) < len(array.shape):\n            array = numpy.squeeze(array)\n        elif len(pt_weight.shape) > len(array.shape):\n            array = numpy.expand_dims(array, axis=0)\n\n        try:\n            assert list(pt_weight.shape) == list(array.shape)\n        except AssertionError as e:\n            e.args += (pt_weight.shape, array.shape)\n            raise e\n\n        # logger.warning(\"Initialize PyTorch weight {}\".format(pt_weight_name))\n\n        new_pt_params_dict[pt_weight_name] = torch.from_numpy(array)\n        loaded_pt_weights_data_ptr[pt_weight.data_ptr()] = torch.from_numpy(array)\n        all_tf_weights.discard(pt_weight_name)\n\n    missing_keys, unexpected_keys = pt_model.load_state_dict(new_pt_params_dict, strict=False)\n    missing_keys += missing_keys_pt\n\n    if len(missing_keys) > 0:\n        logger.info(\n            \"Weights of {} not initialized from TF 2.0 model: {}\".format(pt_model.__class__.__name__, missing_keys)\n        )\n    if len(unexpected_keys) > 0:\n        logger.info(\n            \"Weights from TF 2.0 model not used in {}: {}\".format(pt_model.__class__.__name__, unexpected_keys)\n        )\n\n    logger.info(\"Weights or buffers not loaded from TF 2.0 model: {}\".format(all_tf_weights))\n\n    return pt_model\n"
  },
  {
    "path": "code/nezha-base-count3/pretrain/transformers1/modeling_tf_roberta.py",
    "content": "# coding=utf-8\n# Copyright 2018 The Google AI Language Team Authors and The HuggingFace Inc. team.\n# Copyright (c) 2018, NVIDIA CORPORATION.  All rights reserved.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n\"\"\" TF 2.0 RoBERTa model. \"\"\"\n\n\nimport logging\n\nimport tensorflow as tf\n\nfrom .configuration_roberta import RobertaConfig\nfrom .file_utils import add_start_docstrings, add_start_docstrings_to_callable\nfrom .modeling_tf_bert import TFBertEmbeddings, TFBertMainLayer, gelu\nfrom .modeling_tf_utils import TFPreTrainedModel, get_initializer, shape_list\n\n\nlogger = logging.getLogger(__name__)\n\nTF_ROBERTA_PRETRAINED_MODEL_ARCHIVE_LIST = [\n    \"roberta-base\",\n    \"roberta-large\",\n    \"roberta-large-mnli\",\n    \"distilroberta-base\",\n    # See all RoBERTa models at https://huggingface.co/models?filter=roberta\n]\n\n\nclass TFRobertaEmbeddings(TFBertEmbeddings):\n    \"\"\"\n    Same as BertEmbeddings with a tiny tweak for positional embeddings indexing.\n    \"\"\"\n\n    def __init__(self, config, **kwargs):\n        super().__init__(config, **kwargs)\n        self.padding_idx = 1\n\n    def create_position_ids_from_input_ids(self, x):\n        \"\"\" Replace non-padding symbols with their position numbers. Position numbers begin at\n        padding_idx+1. Padding symbols are ignored. This is modified from fairseq's\n        `utils.make_positions`.\n        :param tf.Tensor x:\n        :return tf.Tensor:\n        \"\"\"\n        mask = tf.cast(tf.math.not_equal(x, self.padding_idx), dtype=tf.int32)\n        incremental_indicies = tf.math.cumsum(mask, axis=1) * mask\n        return incremental_indicies + self.padding_idx\n\n    def create_position_ids_from_inputs_embeds(self, inputs_embeds):\n        \"\"\" We are provided embeddings directly. We cannot infer which are padded so just generate\n        sequential position ids.\n        :param tf.Tensor inputs_embeds:\n        :return tf.Tensor:\n        \"\"\"\n        seq_length = shape_list(inputs_embeds)[1]\n\n        position_ids = tf.range(self.padding_idx + 1, seq_length + self.padding_idx + 1, dtype=tf.int32)[tf.newaxis, :]\n        return position_ids\n\n    def _embedding(self, inputs, training=False):\n        \"\"\"Applies embedding based on inputs tensor.\"\"\"\n        input_ids, position_ids, token_type_ids, inputs_embeds = inputs\n\n        if position_ids is None:\n            if input_ids is not None:\n                # Create the position ids from the input token ids. Any padded tokens remain padded.\n                position_ids = self.create_position_ids_from_input_ids(input_ids)\n            else:\n                position_ids = self.create_position_ids_from_inputs_embeds(inputs_embeds)\n\n        return super()._embedding([input_ids, position_ids, token_type_ids, inputs_embeds], training=training)\n\n\nclass TFRobertaMainLayer(TFBertMainLayer):\n    \"\"\"\n    Same as TFBertMainLayer but uses TFRobertaEmbeddings.\n    \"\"\"\n\n    def __init__(self, config, **kwargs):\n        super().__init__(config, **kwargs)\n        self.embeddings = TFRobertaEmbeddings(config, name=\"embeddings\")\n\n    def get_input_embeddings(self):\n        return self.embeddings\n\n\nclass TFRobertaPreTrainedModel(TFPreTrainedModel):\n    \"\"\" An abstract class to handle weights initialization and\n        a simple interface for downloading and loading pretrained models.\n    \"\"\"\n\n    config_class = RobertaConfig\n    base_model_prefix = \"roberta\"\n\n\nROBERTA_START_DOCSTRING = r\"\"\"\n    This model is a `tf.keras.Model <https://www.tensorflow.org/api_docs/python/tf/keras/Model>`__ sub-class.\n    Use it as a regular TF 2.0 Keras Model and\n    refer to the TF 2.0 documentation for all matter related to general usage and behavior.\n\n    .. note::\n\n        TF 2.0 models accepts two formats as inputs:\n\n            - having all inputs as keyword arguments (like PyTorch models), or\n            - having all inputs as a list, tuple or dict in the first positional arguments.\n\n        This second option is useful when using :obj:`tf.keras.Model.fit()` method which currently requires having\n        all the tensors in the first argument of the model call function: :obj:`model(inputs)`.\n\n        If you choose this second option, there are three possibilities you can use to gather all the input Tensors\n        in the first positional argument :\n\n        - a single Tensor with input_ids only and nothing else: :obj:`model(inputs_ids)`\n        - a list of varying length with one or several input Tensors IN THE ORDER given in the docstring:\n          :obj:`model([input_ids, attention_mask])` or :obj:`model([input_ids, attention_mask, token_type_ids])`\n        - a dictionary with one or several input Tensors associated to the input names given in the docstring:\n          :obj:`model({'input_ids': input_ids, 'token_type_ids': token_type_ids})`\n\n    Parameters:\n        config (:class:`~transformers1.RobertaConfig`): Model configuration class with all the parameters of the\n            model. Initializing with a config file does not load the weights associated with the model, only the configuration.\n            Check out the :meth:`~transformers1.PreTrainedModel.from_pretrained` method to load the model weights.\n\"\"\"\n\nROBERTA_INPUTS_DOCSTRING = r\"\"\"\n    Args:\n        input_ids (:obj:`Numpy array` or :obj:`tf.Tensor` of shape :obj:`(batch_size, sequence_length)`):\n            Indices of input sequence tokens in the vocabulary.\n\n            Indices can be obtained using :class:`transformers1.RobertaTokenizer`.\n            See :func:`transformers1.PreTrainedTokenizer.encode` and\n            :func:`transformers1.PreTrainedTokenizer.encode_plus` for details.\n\n            `What are input IDs? <../glossary.html#input-ids>`__\n        attention_mask (:obj:`Numpy array` or :obj:`tf.Tensor` of shape :obj:`(batch_size, sequence_length)`, `optional`, defaults to :obj:`None`):\n            Mask to avoid performing attention on padding token indices.\n            Mask values selected in ``[0, 1]``:\n            ``1`` for tokens that are NOT MASKED, ``0`` for MASKED tokens.\n\n            `What are attention masks? <../glossary.html#attention-mask>`__\n        token_type_ids (:obj:`Numpy array` or :obj:`tf.Tensor` of shape :obj:`(batch_size, sequence_length)`, `optional`, defaults to :obj:`None`):\n            Segment token indices to indicate first and second portions of the inputs.\n            Indices are selected in ``[0, 1]``: ``0`` corresponds to a `sentence A` token, ``1``\n            corresponds to a `sentence B` token\n\n            `What are token type IDs? <../glossary.html#token-type-ids>`__\n        position_ids (:obj:`Numpy array` or :obj:`tf.Tensor` of shape :obj:`(batch_size, sequence_length)`, `optional`, defaults to :obj:`None`):\n            Indices of positions of each input sequence tokens in the position embeddings.\n            Selected in the range ``[0, config.max_position_embeddings - 1]``.\n\n            `What are position IDs? <../glossary.html#position-ids>`__\n        head_mask (:obj:`Numpy array` or :obj:`tf.Tensor` of shape :obj:`(num_heads,)` or :obj:`(num_layers, num_heads)`, `optional`, defaults to :obj:`None`):\n            Mask to nullify selected heads of the self-attention modules.\n            Mask values selected in ``[0, 1]``:\n            :obj:`1` indicates the head is **not masked**, :obj:`0` indicates the head is **masked**.\n        inputs_embeds (:obj:`Numpy array` or :obj:`tf.Tensor` of shape :obj:`(batch_size, sequence_length, embedding_dim)`, `optional`, defaults to :obj:`None`):\n            Optionally, instead of passing :obj:`input_ids` you can choose to directly pass an embedded representation.\n            This is useful if you want more control over how to convert `input_ids` indices into associated vectors\n            than the model's internal embedding lookup matrix.\n        training (:obj:`boolean`, `optional`, defaults to :obj:`False`):\n            Whether to activate dropout modules (if set to :obj:`True`) during training or to de-activate them\n            (if set to :obj:`False`) for evaluation.\n\"\"\"\n\n\n@add_start_docstrings(\n    \"The bare RoBERTa Model transformer outputing raw hidden-states without any specific head on top.\",\n    ROBERTA_START_DOCSTRING,\n)\nclass TFRobertaModel(TFRobertaPreTrainedModel):\n    def __init__(self, config, *inputs, **kwargs):\n        super().__init__(config, *inputs, **kwargs)\n        self.roberta = TFRobertaMainLayer(config, name=\"roberta\")\n\n    @add_start_docstrings_to_callable(ROBERTA_INPUTS_DOCSTRING)\n    def call(self, inputs, **kwargs):\n        r\"\"\"\n    Returns:\n        :obj:`tuple(tf.Tensor)` comprising various elements depending on the configuration (:class:`~transformers1.RobertaConfig`) and inputs:\n        last_hidden_state (:obj:`tf.Tensor` of shape :obj:`(batch_size, sequence_length, hidden_size)`):\n            Sequence of hidden-states at the output of the last layer of the model.\n        pooler_output (:obj:`tf.Tensor` of shape :obj:`(batch_size, hidden_size)`):\n            Last layer hidden-state of the first token of the sequence (classification token)\n            further processed by a Linear layer and a Tanh activation function. The Linear\n            layer weights are trained from the next sentence prediction (classification)\n            objective during Bert pretraining. This output is usually *not* a good summary\n            of the semantic content of the input, you're often better with averaging or pooling\n            the sequence of hidden-states for the whole input sequence.\n        hidden_states (:obj:`tuple(tf.Tensor)`, `optional`, returned when :obj:`config.output_hidden_states=True`):\n            tuple of :obj:`tf.Tensor` (one for the output of the embeddings + one for the output of each layer)\n            of shape :obj:`(batch_size, sequence_length, hidden_size)`.\n\n            Hidden-states of the model at the output of each layer plus the initial embedding outputs.\n        attentions (:obj:`tuple(tf.Tensor)`, `optional`, returned when ``config.output_attentions=True``):\n            tuple of :obj:`tf.Tensor` (one for each layer) of shape\n            :obj:`(batch_size, num_heads, sequence_length, sequence_length)`:\n\n            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads.\n\n    Examples::\n\n        import tensorflow as tf\n        from transformers1 import RobertaTokenizer, TFRobertaModel\n\n        tokenizer = RobertaTokenizer.from_pretrained('roberta-base')\n        model = TFRobertaModel.from_pretrained('roberta-base')\n        input_ids = tf.constant(tokenizer.encode(\"Hello, my dog is cute\", add_special_tokens=True))[None, :]  # Batch size 1\n        outputs = model(input_ids)\n        last_hidden_states = outputs[0]  # The last hidden-state is the first element of the output tuple\n\n        \"\"\"\n        outputs = self.roberta(inputs, **kwargs)\n        return outputs\n\n\nclass TFRobertaLMHead(tf.keras.layers.Layer):\n    \"\"\"Roberta Head for masked language modeling.\"\"\"\n\n    def __init__(self, config, input_embeddings, **kwargs):\n        super().__init__(**kwargs)\n        self.vocab_size = config.vocab_size\n        self.dense = tf.keras.layers.Dense(\n            config.hidden_size, kernel_initializer=get_initializer(config.initializer_range), name=\"dense\"\n        )\n        self.layer_norm = tf.keras.layers.LayerNormalization(epsilon=config.layer_norm_eps, name=\"layer_norm\")\n        self.act = tf.keras.layers.Activation(gelu)\n\n        # The output weights are the same as the input embeddings, but there is\n        # an output-only bias for each token.\n        self.decoder = input_embeddings\n\n    def build(self, input_shape):\n        self.bias = self.add_weight(shape=(self.vocab_size,), initializer=\"zeros\", trainable=True, name=\"bias\")\n        super().build(input_shape)\n\n    def call(self, features):\n        x = self.dense(features)\n        x = self.act(x)\n        x = self.layer_norm(x)\n\n        # project back to size of vocabulary with bias\n        x = self.decoder(x, mode=\"linear\") + self.bias\n\n        return x\n\n\n@add_start_docstrings(\"\"\"RoBERTa Model with a `language modeling` head on top. \"\"\", ROBERTA_START_DOCSTRING)\nclass TFRobertaForMaskedLM(TFRobertaPreTrainedModel):\n    def __init__(self, config, *inputs, **kwargs):\n        super().__init__(config, *inputs, **kwargs)\n\n        self.roberta = TFRobertaMainLayer(config, name=\"roberta\")\n        self.lm_head = TFRobertaLMHead(config, self.roberta.embeddings, name=\"lm_head\")\n\n    def get_output_embeddings(self):\n        return self.lm_head.decoder\n\n    @add_start_docstrings_to_callable(ROBERTA_INPUTS_DOCSTRING)\n    def call(self, inputs, **kwargs):\n        r\"\"\"\n    Return:\n        :obj:`tuple(tf.Tensor)` comprising various elements depending on the configuration (:class:`~transformers1.RobertaConfig`) and inputs:\n        prediction_scores (:obj:`Numpy array` or :obj:`tf.Tensor` of shape :obj:`(batch_size, sequence_length, config.vocab_size)`):\n            Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax).\n        hidden_states (:obj:`tuple(tf.Tensor)`, `optional`, returned when :obj:`config.output_hidden_states=True`):\n            tuple of :obj:`tf.Tensor` (one for the output of the embeddings + one for the output of each layer)\n            of shape :obj:`(batch_size, sequence_length, hidden_size)`.\n\n            Hidden-states of the model at the output of each layer plus the initial embedding outputs.\n        attentions (:obj:`tuple(tf.Tensor)`, `optional`, returned when ``config.output_attentions=True``):\n            tuple of :obj:`tf.Tensor` (one for each layer) of shape\n            :obj:`(batch_size, num_heads, sequence_length, sequence_length)`:\n\n            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads.\n\n    Examples::\n\n        import tensorflow as tf\n        from transformers1 import RobertaTokenizer, TFRobertaForMaskedLM\n\n        tokenizer = RobertaTokenizer.from_pretrained('roberta-base')\n        model = TFRobertaForMaskedLM.from_pretrained('roberta-base')\n        input_ids = tf.constant(tokenizer.encode(\"Hello, my dog is cute\", add_special_tokens=True))[None, :]  # Batch size 1\n        outputs = model(input_ids)\n        prediction_scores = outputs[0]\n\n        \"\"\"\n        outputs = self.roberta(inputs, **kwargs)\n\n        sequence_output = outputs[0]\n        prediction_scores = self.lm_head(sequence_output)\n\n        outputs = (prediction_scores,) + outputs[2:]  # Add hidden states and attention if they are here\n\n        return outputs  # prediction_scores, (hidden_states), (attentions)\n\n\nclass TFRobertaClassificationHead(tf.keras.layers.Layer):\n    \"\"\"Head for sentence-level classification tasks.\"\"\"\n\n    def __init__(self, config, **kwargs):\n        super().__init__(config, **kwargs)\n        self.dense = tf.keras.layers.Dense(\n            config.hidden_size,\n            kernel_initializer=get_initializer(config.initializer_range),\n            activation=\"tanh\",\n            name=\"dense\",\n        )\n        self.dropout = tf.keras.layers.Dropout(config.hidden_dropout_prob)\n        self.out_proj = tf.keras.layers.Dense(\n            config.num_labels, kernel_initializer=get_initializer(config.initializer_range), name=\"out_proj\"\n        )\n\n    def call(self, features, training=False):\n        x = features[:, 0, :]  # take <s> token (equiv. to [CLS])\n        x = self.dropout(x, training=training)\n        x = self.dense(x)\n        x = self.dropout(x, training=training)\n        x = self.out_proj(x)\n        return x\n\n\n@add_start_docstrings(\n    \"\"\"RoBERTa Model transformer with a sequence classification/regression head on top (a linear layer\n    on top of the pooled output) e.g. for GLUE tasks. \"\"\",\n    ROBERTA_START_DOCSTRING,\n)\nclass TFRobertaForSequenceClassification(TFRobertaPreTrainedModel):\n    def __init__(self, config, *inputs, **kwargs):\n        super().__init__(config, *inputs, **kwargs)\n        self.num_labels = config.num_labels\n\n        self.roberta = TFRobertaMainLayer(config, name=\"roberta\")\n        self.classifier = TFRobertaClassificationHead(config, name=\"classifier\")\n\n    @add_start_docstrings_to_callable(ROBERTA_INPUTS_DOCSTRING)\n    def call(self, inputs, **kwargs):\n        r\"\"\"\n    Return:\n        :obj:`tuple(tf.Tensor)` comprising various elements depending on the configuration (:class:`~transformers1.RobertaConfig`) and inputs:\n        logits (:obj:`Numpy array` or :obj:`tf.Tensor` of shape :obj:`(batch_size, config.num_labels)`):\n            Classification (or regression if config.num_labels==1) scores (before SoftMax).\n        hidden_states (:obj:`tuple(tf.Tensor)`, `optional`, returned when :obj:`config.output_hidden_states=True`):\n            tuple of :obj:`tf.Tensor` (one for the output of the embeddings + one for the output of each layer)\n            of shape :obj:`(batch_size, sequence_length, hidden_size)`.\n\n            Hidden-states of the model at the output of each layer plus the initial embedding outputs.\n        attentions (:obj:`tuple(tf.Tensor)`, `optional`, returned when ``config.output_attentions=True``):\n            tuple of :obj:`tf.Tensor` (one for each layer) of shape\n            :obj:`(batch_size, num_heads, sequence_length, sequence_length)`:\n\n            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads.\n\n    Examples::\n\n        import tensorflow as tf\n        from transformers1 import RobertaTokenizer, TFRobertaForSequenceClassification\n\n        tokenizer = RobertaTokenizer.from_pretrained('roberta-base')\n        model = TFRobertaForSequenceClassification.from_pretrained('roberta-base')\n        input_ids = tf.constant(tokenizer.encode(\"Hello, my dog is cute\", add_special_tokens=True))[None, :]  # Batch size 1\n        labels = tf.constant([1])[None, :]  # Batch size 1\n        outputs = model(input_ids)\n        logits = outputs[0]\n\n        \"\"\"\n        outputs = self.roberta(inputs, **kwargs)\n\n        sequence_output = outputs[0]\n        logits = self.classifier(sequence_output, training=kwargs.get(\"training\", False))\n\n        outputs = (logits,) + outputs[2:]\n\n        return outputs  # logits, (hidden_states), (attentions)\n\n\n@add_start_docstrings(\n    \"\"\"RoBERTa Model with a token classification head on top (a linear layer on top of\n    the hidden-states output) e.g. for Named-Entity-Recognition (NER) tasks. \"\"\",\n    ROBERTA_START_DOCSTRING,\n)\nclass TFRobertaForTokenClassification(TFRobertaPreTrainedModel):\n    def __init__(self, config, *inputs, **kwargs):\n        super().__init__(config, *inputs, **kwargs)\n        self.num_labels = config.num_labels\n\n        self.roberta = TFRobertaMainLayer(config, name=\"roberta\")\n        self.dropout = tf.keras.layers.Dropout(config.hidden_dropout_prob)\n        self.classifier = tf.keras.layers.Dense(\n            config.num_labels, kernel_initializer=get_initializer(config.initializer_range), name=\"classifier\"\n        )\n\n    @add_start_docstrings_to_callable(ROBERTA_INPUTS_DOCSTRING)\n    def call(self, inputs, **kwargs):\n        r\"\"\"\n    Return:\n        :obj:`tuple(tf.Tensor)` comprising various elements depending on the configuration (:class:`~transformers1.RobertaConfig`) and inputs:\n        scores (:obj:`Numpy array` or :obj:`tf.Tensor` of shape :obj:`(batch_size, sequence_length, config.num_labels)`):\n            Classification scores (before SoftMax).\n        hidden_states (:obj:`tuple(tf.Tensor)`, `optional`, returned when :obj:`config.output_hidden_states=True`):\n            tuple of :obj:`tf.Tensor` (one for the output of the embeddings + one for the output of each layer)\n            of shape :obj:`(batch_size, sequence_length, hidden_size)`.\n\n            Hidden-states of the model at the output of each layer plus the initial embedding outputs.\n        attentions (:obj:`tuple(tf.Tensor)`, `optional`, returned when ``config.output_attentions=True``):\n            tuple of :obj:`tf.Tensor` (one for each layer) of shape\n            :obj:`(batch_size, num_heads, sequence_length, sequence_length)`:\n\n            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads.\n\n    Examples::\n\n        import tensorflow as tf\n        from transformers1 import RobertaTokenizer, TFRobertaForTokenClassification\n\n        tokenizer = RobertaTokenizer.from_pretrained('roberta-base')\n        model = TFRobertaForTokenClassification.from_pretrained('roberta-base')\n        input_ids = tf.constant(tokenizer.encode(\"Hello, my dog is cute\", add_special_tokens=True))[None, :]  # Batch size 1\n        outputs = model(input_ids)\n        scores = outputs[0]\n\n        \"\"\"\n        outputs = self.roberta(inputs, **kwargs)\n\n        sequence_output = outputs[0]\n\n        sequence_output = self.dropout(sequence_output, training=kwargs.get(\"training\", False))\n        logits = self.classifier(sequence_output)\n\n        outputs = (logits,) + outputs[2:]  # add hidden states and attention if they are here\n\n        return outputs  # scores, (hidden_states), (attentions)\n\n\n@add_start_docstrings(\n    \"\"\"RoBERTa Model with a span classification head on top for extractive question-answering tasks like SQuAD (a linear layers on top of the hidden-states output to compute `span start logits` and `span end logits`). \"\"\",\n    ROBERTA_START_DOCSTRING,\n)\nclass TFRobertaForQuestionAnswering(TFRobertaPreTrainedModel):\n    def __init__(self, config, *inputs, **kwargs):\n        super().__init__(config, *inputs, **kwargs)\n        self.num_labels = config.num_labels\n\n        self.roberta = TFRobertaMainLayer(config, name=\"roberta\")\n        self.qa_outputs = tf.keras.layers.Dense(\n            config.num_labels, kernel_initializer=get_initializer(config.initializer_range), name=\"qa_outputs\"\n        )\n\n    @add_start_docstrings_to_callable(ROBERTA_INPUTS_DOCSTRING)\n    def call(self, inputs, **kwargs):\n        r\"\"\"\n    Return:\n        :obj:`tuple(tf.Tensor)` comprising various elements depending on the configuration (:class:`~transformers1.RobertaConfig`) and inputs:\n        start_scores (:obj:`Numpy array` or :obj:`tf.Tensor` of shape :obj:`(batch_size, sequence_length,)`):\n            Span-start scores (before SoftMax).\n        end_scores (:obj:`Numpy array` or :obj:`tf.Tensor` of shape :obj:`(batch_size, sequence_length,)`):\n            Span-end scores (before SoftMax).\n        hidden_states (:obj:`tuple(tf.Tensor)`, `optional`, returned when :obj:`config.output_hidden_states=True`):\n            tuple of :obj:`tf.Tensor` (one for the output of the embeddings + one for the output of each layer)\n            of shape :obj:`(batch_size, sequence_length, hidden_size)`.\n\n            Hidden-states of the model at the output of each layer plus the initial embedding outputs.\n        attentions (:obj:`tuple(tf.Tensor)`, `optional`, returned when ``config.output_attentions=True``):\n            tuple of :obj:`tf.Tensor` (one for each layer) of shape\n            :obj:`(batch_size, num_heads, sequence_length, sequence_length)`:\n\n            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads.\n\n    Examples::\n\n        # The checkpoint roberta-base is not fine-tuned for question answering. Please see the\n        # examples/question-answering/run_squad.py example to see how to fine-tune a model to a question answering task.\n\n        import tensorflow as tf\n        from transformers1 import RobertaTokenizer, TFRobertaForQuestionAnswering\n\n        tokenizer = RobertaTokenizer.from_pretrained('roberta-base')\n        model = TFRobertaForQuestionAnswering.from_pretrained('roberta-base')\n        input_ids = tokenizer.encode(\"Who was Jim Henson?\", \"Jim Henson was a nice puppet\")\n        start_scores, end_scores = model(tf.constant(input_ids)[None, :]) # Batch size 1\n\n        all_tokens = tokenizer.convert_ids_to_tokens(input_ids)\n        answer = ' '.join(all_tokens[tf.math.argmax(start_scores, 1)[0] : tf.math.argmax(end_scores, 1)[0]+1])\n\n        \"\"\"\n        outputs = self.roberta(inputs, **kwargs)\n\n        sequence_output = outputs[0]\n\n        logits = self.qa_outputs(sequence_output)\n        start_logits, end_logits = tf.split(logits, 2, axis=-1)\n        start_logits = tf.squeeze(start_logits, axis=-1)\n        end_logits = tf.squeeze(end_logits, axis=-1)\n\n        outputs = (start_logits, end_logits,) + outputs[2:]\n\n        return outputs  # start_logits, end_logits, (hidden_states), (attentions)\n"
  },
  {
    "path": "code/nezha-base-count3/pretrain/transformers1/modeling_tf_t5.py",
    "content": "# coding=utf-8\n# Copyright 2018 T5 Authors and The HuggingFace Inc. team.\n# Copyright (c) 2018, NVIDIA CORPORATION.  All rights reserved.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n\"\"\" TF 2.0 T5 model. \"\"\"\n\n\nimport copy\nimport itertools\nimport logging\nimport math\n\nimport tensorflow as tf\n\nfrom .configuration_t5 import T5Config\nfrom .file_utils import DUMMY_INPUTS, DUMMY_MASK, add_start_docstrings, add_start_docstrings_to_callable\nfrom .modeling_tf_utils import TFPreTrainedModel, TFSharedEmbeddings, shape_list\n\n\nlogger = logging.getLogger(__name__)\n\nTF_T5_PRETRAINED_MODEL_ARCHIVE_LIST = [\n    \"t5-small\",\n    \"t5-base\",\n    \"t5-large\",\n    \"t5-3b\",\n    \"t5-11b\",\n    # See all T5 models at https://huggingface.co/models?filter=t5\n]\n\n####################################################\n# TF 2.0 Models are constructed using Keras imperative API by sub-classing\n# - tf.keras.layers.Layer for the layers and\n# - TFPreTrainedModel for the models (it-self a sub-class of tf.keras.Model)\n####################################################\n\n\nclass TFT5LayerNorm(tf.keras.layers.Layer):\n    def __init__(self, epsilon=1e-6, **kwargs):\n        \"\"\" Construct a layernorm module in the T5 style\n            No bias and no substraction of mean.\n        \"\"\"\n        super().__init__(**kwargs)\n        self.variance_epsilon = epsilon\n\n    def build(self, input_shape):\n        \"\"\"Build shared word embedding layer \"\"\"\n        self.weight = self.add_weight(\"weight\", shape=(input_shape[-1],), initializer=\"ones\")\n        super().build(input_shape)\n\n    def call(self, x):\n        variance = tf.math.reduce_mean(tf.math.square(x), axis=-1, keepdims=True)\n        x = x * tf.math.rsqrt(variance + self.variance_epsilon)\n        return self.weight * x\n\n\nclass TFT5DenseReluDense(tf.keras.layers.Layer):\n    def __init__(self, config, **kwargs):\n        super().__init__(**kwargs)\n        self.wi = tf.keras.layers.Dense(config.d_ff, use_bias=False, name=\"wi\")\n        self.wo = tf.keras.layers.Dense(config.d_model, use_bias=False, name=\"wo\")\n        self.dropout = tf.keras.layers.Dropout(config.dropout_rate)\n        self.act = tf.keras.activations.relu\n\n    def call(self, hidden_states, training=False):\n        h = self.wi(hidden_states)\n        h = self.act(h)\n        h = self.dropout(h, training=training)\n        h = self.wo(h)\n        return h\n\n\nclass TFT5LayerFF(tf.keras.layers.Layer):\n    def __init__(self, config, **kwargs):\n        super().__init__(**kwargs)\n        self.DenseReluDense = TFT5DenseReluDense(config, name=\"DenseReluDense\")\n        self.layer_norm = TFT5LayerNorm(epsilon=config.layer_norm_epsilon, name=\"layer_norm\")\n        self.dropout = tf.keras.layers.Dropout(config.dropout_rate)\n\n    def call(self, hidden_states, training=False):\n        norm_x = self.layer_norm(hidden_states)\n        y = self.DenseReluDense(norm_x, training=training)\n        layer_output = hidden_states + self.dropout(y, training=training)\n        return layer_output\n\n\nclass TFT5Attention(tf.keras.layers.Layer):\n    NEW_ID = itertools.count()\n\n    def __init__(self, config, has_relative_attention_bias=False, **kwargs):\n        super().__init__(**kwargs)\n        self.layer_id = next(TFT5Attention.NEW_ID)\n        self.is_decoder = config.is_decoder\n        self.has_relative_attention_bias = has_relative_attention_bias\n\n        self.output_attentions = config.output_attentions\n        self.relative_attention_num_buckets = config.relative_attention_num_buckets\n        self.d_model = config.d_model\n        self.d_kv = config.d_kv\n        self.n_heads = config.num_heads\n        self.inner_dim = self.n_heads * self.d_kv\n\n        # Mesh TensorFlow initialization to avoid scaling before softmax\n        self.q = tf.keras.layers.Dense(self.inner_dim, use_bias=False, name=\"q\")\n        self.k = tf.keras.layers.Dense(self.inner_dim, use_bias=False, name=\"k\")\n        self.v = tf.keras.layers.Dense(self.inner_dim, use_bias=False, name=\"v\")\n        self.o = tf.keras.layers.Dense(self.d_model, use_bias=False, name=\"o\")\n        self.dropout = tf.keras.layers.Dropout(config.dropout_rate)\n\n        if self.has_relative_attention_bias:\n            self.relative_attention_bias = tf.keras.layers.Embedding(\n                self.relative_attention_num_buckets, self.n_heads, name=\"relative_attention_bias\",\n            )\n        self.pruned_heads = set()\n\n    def prune_heads(self, heads):\n        raise NotImplementedError\n\n    @staticmethod\n    def _relative_position_bucket(relative_position, bidirectional=True, num_buckets=32, max_distance=128):\n        \"\"\"\n        Adapted from Mesh Tensorflow:\n        https://github.com/tensorflow/mesh/blob/0cb87fe07da627bf0b7e60475d59f95ed6b5be3d/mesh_tensorflow/transformer/transformer_layers.py#L593\n\n        Translate relative position to a bucket number for relative attention.\n        The relative position is defined as memory_position - query_position, i.e.\n        the distance in tokens from the attending position to the attended-to\n        position.  If bidirectional=False, then positive relative positions are\n        invalid.\n        We use smaller buckets for small absolute relative_position and larger buckets\n        for larger absolute relative_positions.  All relative positions >=max_distance\n        map to the same bucket.  All relative positions <=-max_distance map to the\n        same bucket.  This should allow for more graceful generalization to longer\n        sequences than the model has been trained on.\n        Args:\n            relative_position: an int32 Tensor\n            bidirectional: a boolean - whether the attention is bidirectional\n            num_buckets: an integer\n            max_distance: an integer\n        Returns:\n            a Tensor with the same shape as relative_position, containing int32\n            values in the range [0, num_buckets)\n        \"\"\"\n        ret = 0\n        n = -relative_position\n        if bidirectional:\n            num_buckets //= 2\n            ret += tf.dtypes.cast(tf.math.less(n, 0), tf.int32) * num_buckets\n            n = tf.math.abs(n)\n        else:\n            n = tf.math.maximum(n, 0)\n        # now n is in the range [0, inf)\n        max_exact = num_buckets // 2\n        is_small = tf.math.less(n, max_exact)\n        val_if_large = max_exact + tf.dtypes.cast(\n            tf.math.log(tf.dtypes.cast(n, tf.float32) / max_exact)\n            / math.log(max_distance / max_exact)\n            * (num_buckets - max_exact),\n            tf.int32,\n        )\n        val_if_large = tf.math.minimum(val_if_large, num_buckets - 1)\n        ret += tf.where(is_small, n, val_if_large)\n        return ret\n\n    def compute_bias(self, qlen, klen):\n        \"\"\" Compute binned relative position bias \"\"\"\n        context_position = tf.range(qlen)[:, None]\n        memory_position = tf.range(klen)[None, :]\n        relative_position = memory_position - context_position  # shape (qlen, klen)\n        rp_bucket = self._relative_position_bucket(\n            relative_position, bidirectional=not self.is_decoder, num_buckets=self.relative_attention_num_buckets,\n        )\n        values = self.relative_attention_bias(rp_bucket)  # shape (qlen, klen, num_heads)\n        values = tf.expand_dims(tf.transpose(values, [2, 0, 1]), axis=0)  # shape (1, num_heads, qlen, klen)\n        return values\n\n    def call(\n        self,\n        input,\n        mask=None,\n        kv=None,\n        position_bias=None,\n        cache=None,\n        past_key_value_state=None,\n        head_mask=None,\n        query_length=None,\n        use_cache=False,\n        training=False,\n    ):\n        \"\"\"\n        Self-attention (if kv is None) or attention over source sentence (provided by kv).\n        \"\"\"\n        # Input is (bs, qlen, dim)\n        # Mask is (bs, klen) (non-causal) or (bs, klen, klen)\n        # past_key_value_state[0] is (bs, n_heads, q_len - 1, dim_per_head)\n        bs, qlen, dim = shape_list(input)\n\n        if past_key_value_state is not None:\n            assert self.is_decoder is True, \"Encoder cannot cache past key value states\"\n            assert (\n                len(past_key_value_state) == 2\n            ), \"past_key_value_state should have 2 past states: keys and values. Got {} past states\".format(\n                len(past_key_value_state)\n            )\n            real_qlen = qlen + shape_list(past_key_value_state[0])[2] if query_length is None else query_length\n        else:\n            real_qlen = qlen\n\n        if kv is None:\n            klen = real_qlen\n        else:\n            klen = shape_list(kv)[1]\n\n        def shape(x):\n            \"\"\"  projection \"\"\"\n            return tf.transpose(tf.reshape(x, (bs, -1, self.n_heads, self.d_kv)), perm=(0, 2, 1, 3))\n\n        def unshape(x):\n            \"\"\"  compute context \"\"\"\n            return tf.reshape(tf.transpose(x, perm=(0, 2, 1, 3)), (bs, -1, self.inner_dim))\n\n        q = shape(self.q(input))  # (bs, n_heads, qlen, dim_per_head)\n\n        if kv is None:\n            k = shape(self.k(input))  # (bs, n_heads, qlen, dim_per_head)\n            v = shape(self.v(input))  # (bs, n_heads, qlen, dim_per_head)\n        elif past_key_value_state is None:\n            k = v = kv\n            k = shape(self.k(k))  # (bs, n_heads, qlen, dim_per_head)\n            v = shape(self.v(v))  # (bs, n_heads, qlen, dim_per_head)\n\n        if past_key_value_state is not None:\n            if kv is None:\n                k_, v_ = past_key_value_state\n                k = tf.concat([k_, k], axis=2)  # (bs, n_heads, klen, dim_per_head)\n                v = tf.concat([v_, v], axis=2)  # (bs, n_heads, klen, dim_per_head)\n            else:\n                k, v = past_key_value_state\n\n        # to cope with keras serialization\n        # we need to cast `use_cache` to correct bool\n        # if it is a tensor\n        if tf.is_tensor(use_cache):\n            if hasattr(use_cache, \"numpy\"):\n                use_cache = bool(use_cache.numpy())\n            else:\n                use_cache = True\n\n        if self.is_decoder and use_cache is True:\n            present_key_value_state = ((k, v),)\n        else:\n            present_key_value_state = (None,)\n\n        scores = tf.einsum(\"bnqd,bnkd->bnqk\", q, k)  # (bs, n_heads, qlen, klen)\n\n        if position_bias is None:\n            if not self.has_relative_attention_bias:\n                raise ValueError(\"No position_bias provided and no weights to compute position_bias\")\n            position_bias = self.compute_bias(real_qlen, klen)\n\n            # if key and values are already calculated\n            # we want only the last query position bias\n            if past_key_value_state is not None:\n                position_bias = position_bias[:, :, -1:, :]\n\n            if mask is not None:\n                position_bias = position_bias + mask  # (bs, n_heads, qlen, klen)\n\n        scores += position_bias\n        weights = tf.nn.softmax(scores, axis=-1)  # (bs, n_heads, qlen, klen)\n        weights = self.dropout(weights, training=training)  # (bs, n_heads, qlen, klen)\n\n        # Mask heads if we want to\n        if head_mask is not None:\n            weights = weights * head_mask\n\n        context = tf.matmul(weights, v)  # (bs, n_heads, qlen, dim_per_head)\n        context = unshape(context)  # (bs, qlen, dim)\n\n        context = self.o(context)\n\n        outputs = (context,) + present_key_value_state\n\n        if self.output_attentions:\n            outputs = outputs + (weights,)\n        if self.has_relative_attention_bias:\n            outputs = outputs + (position_bias,)\n        return outputs\n\n\nclass TFT5LayerSelfAttention(tf.keras.layers.Layer):\n    def __init__(self, config, has_relative_attention_bias=False, **kwargs):\n        super().__init__(**kwargs)\n        self.SelfAttention = TFT5Attention(\n            config, has_relative_attention_bias=has_relative_attention_bias, name=\"SelfAttention\",\n        )\n        self.layer_norm = TFT5LayerNorm(epsilon=config.layer_norm_epsilon, name=\"layer_norm\")\n        self.dropout = tf.keras.layers.Dropout(config.dropout_rate)\n\n    def call(\n        self,\n        hidden_states,\n        attention_mask=None,\n        position_bias=None,\n        head_mask=None,\n        past_key_value_state=None,\n        use_cache=False,\n        training=False,\n    ):\n        norm_x = self.layer_norm(hidden_states)\n        attention_output = self.SelfAttention(\n            norm_x,\n            mask=attention_mask,\n            position_bias=position_bias,\n            head_mask=head_mask,\n            past_key_value_state=past_key_value_state,\n            use_cache=use_cache,\n            training=training,\n        )\n        y = attention_output[0]\n        layer_output = hidden_states + self.dropout(y, training=training)\n        outputs = (layer_output,) + attention_output[1:]  # add attentions if we output them\n        return outputs\n\n\nclass TFT5LayerCrossAttention(tf.keras.layers.Layer):\n    def __init__(self, config, has_relative_attention_bias=False, **kwargs):\n        super().__init__(**kwargs)\n        self.EncDecAttention = TFT5Attention(\n            config, has_relative_attention_bias=has_relative_attention_bias, name=\"EncDecAttention\",\n        )\n        self.layer_norm = TFT5LayerNorm(epsilon=config.layer_norm_epsilon, name=\"layer_norm\")\n        self.dropout = tf.keras.layers.Dropout(config.dropout_rate)\n\n    def call(\n        self,\n        hidden_states,\n        kv,\n        attention_mask=None,\n        position_bias=None,\n        head_mask=None,\n        past_key_value_state=None,\n        query_length=None,\n        use_cache=False,\n        training=False,\n    ):\n        norm_x = self.layer_norm(hidden_states)\n        attention_output = self.EncDecAttention(\n            norm_x,\n            mask=attention_mask,\n            kv=kv,\n            position_bias=position_bias,\n            head_mask=head_mask,\n            past_key_value_state=past_key_value_state,\n            query_length=query_length,\n            use_cache=use_cache,\n            training=training,\n        )\n        y = attention_output[0]\n        layer_output = hidden_states + self.dropout(y, training=training)\n        outputs = (layer_output,) + attention_output[1:]  # add attentions if we output them\n        return outputs\n\n\nclass TFT5Block(tf.keras.layers.Layer):\n    def __init__(self, config, has_relative_attention_bias=False, **kwargs):\n        super().__init__(**kwargs)\n        self.is_decoder = config.is_decoder\n        self.layer = []\n        self.layer.append(\n            TFT5LayerSelfAttention(config, has_relative_attention_bias=has_relative_attention_bias, name=\"layer_._0\",)\n        )\n        if self.is_decoder:\n            self.layer.append(\n                TFT5LayerCrossAttention(\n                    config, has_relative_attention_bias=has_relative_attention_bias, name=\"layer_._1\",\n                )\n            )\n\n        self.layer.append(TFT5LayerFF(config, name=\"layer_._{}\".format(len(self.layer))))\n\n    def call(\n        self,\n        hidden_states,\n        attention_mask=None,\n        position_bias=None,\n        encoder_hidden_states=None,\n        encoder_attention_mask=None,\n        encoder_decoder_position_bias=None,\n        head_mask=None,\n        past_key_value_state=None,\n        use_cache=False,\n        training=False,\n    ):\n\n        if past_key_value_state is not None:\n            assert self.is_decoder, \"Only decoder can use `past_key_value_states`\"\n            expected_num_past_key_value_states = 2 if encoder_hidden_states is None else 4\n\n            error_message = \"There should be {} past states. 2 (past / key) for self attention.{} Got {} past key / value states\".format(\n                expected_num_past_key_value_states,\n                \"2 (past / key) for cross attention\" if expected_num_past_key_value_states == 4 else \"\",\n                len(past_key_value_state),\n            )\n            assert len(past_key_value_state) == expected_num_past_key_value_states, error_message\n\n            self_attn_past_key_value_state = past_key_value_state[:2]\n            cross_attn_past_key_value_state = past_key_value_state[2:]\n        else:\n            self_attn_past_key_value_state, cross_attn_past_key_value_state = None, None\n\n        self_attention_outputs = self.layer[0](\n            hidden_states,\n            attention_mask=attention_mask,\n            position_bias=position_bias,\n            head_mask=head_mask,\n            past_key_value_state=self_attn_past_key_value_state,\n            use_cache=use_cache,\n            training=training,\n        )\n        hidden_states, present_key_value_state = self_attention_outputs[:2]\n        attention_outputs = self_attention_outputs[2:]  # Keep self-attention outputs and relative position weights\n\n        if self.is_decoder and encoder_hidden_states is not None:\n            # the actual query length is unknown for cross attention\n            # if using past key value states. Need to inject it here\n            if present_key_value_state is not None:\n                query_length = shape_list(present_key_value_state[0])[2]\n            else:\n                query_length = None\n\n            cross_attention_outputs = self.layer[1](\n                hidden_states,\n                kv=encoder_hidden_states,\n                attention_mask=encoder_attention_mask,\n                position_bias=encoder_decoder_position_bias,\n                head_mask=head_mask,\n                past_key_value_state=cross_attn_past_key_value_state,\n                query_length=query_length,\n                use_cache=use_cache,\n                training=training,\n            )\n            hidden_states = cross_attention_outputs[0]\n            # Combine self attn and cross attn key value states\n            if present_key_value_state is not None:\n                present_key_value_state = present_key_value_state + cross_attention_outputs[1]\n\n            # Keep cross-attention outputs and relative position weights\n            attention_outputs = attention_outputs + cross_attention_outputs[2:]\n\n        # Apply Feed Forward layer\n        hidden_states = self.layer[-1](hidden_states, training=training)\n        outputs = (hidden_states,)\n\n        # Add attentions if we output them\n        outputs = outputs + (present_key_value_state,) + attention_outputs\n        return outputs  # hidden-states, present_key_value_states, (self-attention weights), (self-attention position bias), (cross-attention weights), (cross-attention position bias)\n\n\nclass _NoLayerEmbedTokens(object):\n    \"\"\"\n     this class wraps a the TFSharedEmbeddingTokens layer into a python 'no-keras-layer'\n     class to avoid problem with weight restoring. Also it makes sure that the layer is\n     called from the correct scope to avoid problem with saving/storing the correct weights\n    \"\"\"\n\n    def __init__(self, layer, abs_scope_name=None):\n        self._layer = layer\n        self._abs_scope_name = abs_scope_name\n\n    def call(self, inputs, mode=\"embedding\"):\n        if self._abs_scope_name is None:\n            return self._layer.call(inputs, mode)\n\n        # if an abs scope name is given to the embedding variable, call variable from absolute scope\n        with tf.compat.v1.variable_scope(self._abs_scope_name, auxiliary_name_scope=False) as abs_scope_name:\n            with tf.name_scope(abs_scope_name.original_name_scope):\n                return self._layer.call(inputs, mode)\n\n    def __call__(self, inputs, mode=\"embedding\"):\n        if self._abs_scope_name is None:\n            return self._layer(inputs, mode)\n\n        # if an abs scope name is given to the embedding variable, call variable from absolute scope\n        with tf.compat.v1.variable_scope(self._abs_scope_name, auxiliary_name_scope=False) as abs_scope_name:\n            with tf.name_scope(abs_scope_name.original_name_scope):\n                return self._layer(inputs, mode)\n\n\n####################################################\n# The full model without a specific pretrained or finetuning head is\n# provided as a tf.keras.layers.Layer usually called \"TFT5MainLayer\"\n####################################################\nclass TFT5MainLayer(tf.keras.layers.Layer):\n    def __init__(self, config, embed_tokens=None, **kwargs):\n        super().__init__(**kwargs)\n        self.output_attentions = config.output_attentions\n        self.output_hidden_states = config.output_hidden_states\n\n        self.embed_tokens = embed_tokens\n        self.is_decoder = config.is_decoder\n\n        self.config = config\n        self.num_hidden_layers = config.num_layers\n\n        self.block = [\n            TFT5Block(config, has_relative_attention_bias=bool(i == 0), name=\"block_._{}\".format(i),)\n            for i in range(config.num_layers)\n        ]\n        self.final_layer_norm = TFT5LayerNorm(epsilon=config.layer_norm_epsilon, name=\"final_layer_norm\")\n        self.dropout = tf.keras.layers.Dropout(config.dropout_rate)\n\n    def get_input_embeddings(self):\n        return self.embed_tokens\n\n    def get_output_embeddings(self):\n        return self.embed_tokens\n\n    def set_embed_tokens(self, embed_tokens):\n        self.embed_tokens = embed_tokens\n\n    def _resize_token_embeddings(self, new_num_tokens):\n        raise NotImplementedError  # Not implemented yet in the library fr TF 2.0 models\n\n    def _prune_heads(self, heads_to_prune):\n        raise NotImplementedError  # Not implemented yet in the library fr TF 2.0 models\n\n    def call(\n        self,\n        inputs,\n        attention_mask=None,\n        encoder_hidden_states=None,\n        encoder_attention_mask=None,\n        inputs_embeds=None,\n        head_mask=None,\n        past_key_value_states=None,\n        use_cache=False,\n        training=False,\n    ):\n\n        if inputs is not None and inputs_embeds is not None:\n            raise ValueError(\"You cannot specify both inputs and inputs_embeds at the same time\")\n        elif inputs is not None:\n            input_shape = shape_list(inputs)\n            inputs = tf.reshape(inputs, (-1, input_shape[-1]))\n        elif inputs_embeds is not None:\n            input_shape = shape_list(inputs_embeds)[:-1]\n        else:\n            raise ValueError(\"You have to specify either inputs or inputs_embeds\")\n\n        if inputs_embeds is None:\n            assert self.embed_tokens is not None, \"You have to intialize the model with valid token embeddings\"\n            inputs_embeds = self.embed_tokens(inputs)\n\n        batch_size, seq_length = input_shape\n\n        if past_key_value_states is not None:\n            assert seq_length == 1, \"Input shape is {}, but should be {} when using past_key_value_sates\".format(\n                input_shape, (batch_size, 1)\n            )\n            # required mask seq length can be calculated via length of past\n            # key value states and seq_length = 1 for the last token\n            mask_seq_length = shape_list(past_key_value_states[0][0])[2] + seq_length\n        else:\n            mask_seq_length = seq_length\n\n        if attention_mask is None:\n            attention_mask = tf.fill((batch_size, mask_seq_length), 1)\n        if self.is_decoder and encoder_attention_mask is None and encoder_hidden_states is not None:\n            encoder_seq_length = shape_list(encoder_hidden_states)[1]\n            encoder_attention_mask = tf.fill((batch_size, encoder_seq_length), 1)\n\n        # initialize past_key_value_states with `None` if past does not exist\n        if past_key_value_states is None:\n            past_key_value_states = [None] * len(self.block)\n\n        # We can provide a self-attention mask of dimensions [batch_size, from_seq_length, to_seq_length]\n        # ourselves in which case we just need to make it broadcastable to all heads.\n        attention_mask = tf.cast(attention_mask, dtype=tf.float32)\n        num_dims_attention_mask = len(shape_list(attention_mask))\n        if num_dims_attention_mask == 3:\n            extended_attention_mask = attention_mask[:, None, :, :]\n        elif num_dims_attention_mask == 2:\n            # Provided a padding mask of dimensions [batch_size, mask_seq_length]\n            # - if the model is a decoder, apply a causal mask in addition to the padding mask\n            # - if the model is an encoder, make the mask broadcastable to [batch_size, num_heads, mask_seq_length, mask_seq_length]\n            if self.is_decoder:\n                seq_ids = tf.range(mask_seq_length)\n                causal_mask = tf.less_equal(\n                    tf.tile(seq_ids[None, None, :], (batch_size, mask_seq_length, 1)), seq_ids[None, :, None],\n                )\n                causal_mask = tf.cast(causal_mask, dtype=tf.float32)\n                extended_attention_mask = causal_mask[:, None, :, :] * attention_mask[:, None, None, :]\n                if past_key_value_states[0] is not None:\n                    extended_attention_mask = extended_attention_mask[:, :, -1:, :]\n            else:\n                extended_attention_mask = attention_mask[:, None, None, :]\n\n        # Since attention_mask is 1.0 for positions we want to attend and 0.0 for\n        # masked positions, this operation will create a tensor which is 0.0 for\n        # positions we want to attend and -10000.0 for masked positions.\n        # Since we are adding it to the raw scores before the softmax, this is\n        # effectively the same as removing these entirely.\n\n        # T5 has a mask that can compare sequence ids, we can simulate this here with this transposistion\n        # Cf. https://github.com/tensorflow/mesh/blob/8d2465e9bc93129b913b5ccc6a59aa97abd96ec6/mesh_tensorflow/transformer/transformer_layers.py#L270\n        # extended_attention_mask = tf.math.equal(extended_attention_mask,\n        #                                         tf.transpose(extended_attention_mask, perm=(-1, -2)))\n\n        extended_attention_mask = (1.0 - extended_attention_mask) * -1e9\n\n        if self.is_decoder and encoder_attention_mask is not None:\n            # If a 2D ou 3D attention mask is provided for the cross-attention\n            # we need to make broadcastabe to [batch_size, num_heads, mask_seq_length, mask_seq_length]\n            # we need to make broadcastabe to [batch_size, num_heads, seq_length, seq_length]\n            encoder_attention_mask = tf.cast(encoder_attention_mask, dtype=tf.float32)\n            num_dims_encoder_attention_mask = len(shape_list(encoder_attention_mask))\n            if num_dims_encoder_attention_mask == 3:\n                encoder_extended_attention_mask = encoder_attention_mask[:, None, :, :]\n            if num_dims_encoder_attention_mask == 2:\n                encoder_extended_attention_mask = encoder_attention_mask[:, None, None, :]\n\n            # T5 has a mask that can compare sequence ids, we can simulate this here with this transposistion\n            # Cf. https://github.com/tensorflow/mesh/blob/8d2465e9bc93129b913b5ccc6a59aa97abd96ec6/mesh_tensorflow/transformer/transformer_layers.py#L270\n            # encoder_extended_attention_mask = tf.math.equal(encoder_extended_attention_mask,\n            #                                         tf.transpose(encoder_extended_attention_mask, perm=(-1, -2)))\n\n            encoder_extended_attention_mask = (1.0 - encoder_extended_attention_mask) * -1e9\n        else:\n            encoder_extended_attention_mask = None\n\n        # Prepare head mask if needed\n        # 1.0 in head_mask indicate we keep the head\n        # attention_probs has shape bsz x n_heads x N x N\n        # input head_mask has shape [num_heads] or [num_hidden_layers x num_heads]\n        # and head_mask is converted to shape [num_hidden_layers x batch x num_heads x seq_length x seq_length]\n        if head_mask is not None:\n            raise NotImplementedError\n        else:\n            head_mask = [None] * self.num_hidden_layers\n            # head_mask = tf.constant([0] * self.num_hidden_layers)\n\n        present_key_value_states = ()\n        all_hidden_states = ()\n        all_attentions = ()\n        position_bias = None\n        encoder_decoder_position_bias = None\n\n        hidden_states = self.dropout(inputs_embeds, training=training)\n\n        for i, (layer_module, past_key_value_state) in enumerate(zip(self.block, past_key_value_states)):\n            if self.output_hidden_states:\n                all_hidden_states = all_hidden_states + (hidden_states,)\n\n            layer_outputs = layer_module(\n                hidden_states,\n                attention_mask=extended_attention_mask,\n                position_bias=position_bias,\n                encoder_hidden_states=encoder_hidden_states,\n                encoder_attention_mask=encoder_extended_attention_mask,\n                encoder_decoder_position_bias=encoder_decoder_position_bias,\n                head_mask=head_mask[i],\n                past_key_value_state=past_key_value_state,\n                use_cache=use_cache,\n                training=training,\n            )\n            # layer_outputs is a tuple with:\n            # hidden-states, key-value-states, (self-attention weights), (self-attention position bias), (cross-attention weights), (cross-attention position bias)\n            hidden_states, present_key_value_state = layer_outputs[:2]\n            if i == 0:\n                # We share the position biases between the layers - the first layer store them\n                # layer_outputs = hidden-states, (self-attention weights), (self-attention position bias), (cross-attention weights), (cross-attention position bias)\n                position_bias = layer_outputs[3 if self.output_attentions else 2]\n                if self.is_decoder and encoder_hidden_states is not None:\n                    encoder_decoder_position_bias = layer_outputs[5 if self.output_attentions else 3]\n            # append next layer key value states\n            present_key_value_states = present_key_value_states + (present_key_value_state,)\n\n            if self.output_attentions:\n                all_attentions = all_attentions + (layer_outputs[2],)\n\n        hidden_states = self.final_layer_norm(hidden_states)\n        hidden_states = self.dropout(hidden_states, training=training)\n\n        # Add last layer\n        if self.output_hidden_states:\n            all_hidden_states = all_hidden_states + (hidden_states,)\n\n        outputs = (hidden_states,)\n        if use_cache is True:\n            assert self.is_decoder, \"`use_cache` can only be set to `True` if {} is used as a decoder\".format(self)\n            outputs = outputs + (present_key_value_states,)\n        if self.output_hidden_states:\n            outputs = outputs + (all_hidden_states,)\n        if self.output_attentions:\n            outputs = outputs + (all_attentions,)\n        return outputs  # last-layer hidden state, (all hidden states), (all attentions)\n\n\n####################################################\n# TFT5PreTrainedModel is a sub-class of tf.keras.Model\n# which take care of loading and saving pretrained weights\n# and various common utilities.\n# Here you just need to specify a few (self-explanatory)\n# pointers for your model.\n####################################################\nclass TFT5PreTrainedModel(TFPreTrainedModel):\n    \"\"\" An abstract class to handle weights initialization and\n        a simple interface for downloading and loading pretrained models.\n    \"\"\"\n\n    config_class = T5Config\n    base_model_prefix = \"transformer\"\n\n    @property\n    def dummy_inputs(self):\n        inputs = tf.constant(DUMMY_INPUTS)\n        input_mask = tf.constant(DUMMY_MASK)\n        dummy_inputs = {\n            \"inputs\": inputs,\n            \"decoder_input_ids\": inputs,\n            \"decoder_attention_mask\": input_mask,\n        }\n        return dummy_inputs\n\n\nT5_START_DOCSTRING = r\"\"\"    The T5 model was proposed in\n    `Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer`_\n    by Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, Peter J. Liu.\n    It's an encoder decoder transformer pre-trained in a text-to-text denoising generative setting.\n\n    This model is a tf.keras.Model `tf.keras.Model`_ sub-class. Use it as a regular TF 2.0 Keras Model and\n    refer to the TF 2.0 documentation for all matter related to general usage and behavior.\n\n    .. _`Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer`:\n        https://arxiv.org/abs/1910.10683\n\n    .. _`tf.keras.Model`:\n        https://www.tensorflow.org/versions/r2.0/api_docs/python/tf/keras/Model\n\n    Note on the model inputs:\n        TF 2.0 models accepts two formats as inputs:\n\n            - having all inputs as keyword arguments (like PyTorch models), or\n            - having all inputs as a list, tuple or dict in the first positional arguments.\n\n        This second option is usefull when using `tf.keras.Model.fit()` method which currently requires having all the tensors in the first argument of the model call function: `model(inputs)`.\n\n        If you choose this second option, there are three possibilities you can use to gather all the input Tensors in the first positional argument :\n\n        - a single Tensor with inputs only and nothing else: `model(inputs_ids)\n        - a list of varying length with one or several input Tensors IN THE ORDER given in the docstring:\n            `model([inputs, attention_mask])` or `model([inputs, attention_mask, token_type_ids])`\n        - a dictionary with one or several input Tensors associaed to the input names given in the docstring:\n            `model({'inputs': inputs, 'token_type_ids': token_type_ids})`\n\n    Parameters:\n        config (:class:`~transformers1.T5Config`): Model configuration class with all the parameters of the model.\n            Initializing with a config file does not load the weights associated with the model, only the configuration.\n            Check out the :meth:`~transformers1.PreTrainedModel.from_pretrained` method to load the model weights.\n\"\"\"\n\nT5_INPUTS_DOCSTRING = r\"\"\"\n    Args:\n        inputs are usually used as a `dict` (see T5 description above for more information) containing all the following.\n\n        inputs (:obj:`tf.Tensor` of shape :obj:`(batch_size, sequence_length)`):\n            Indices of input sequence tokens in the vocabulary.\n            T5 is a model with relative position embeddings so you should be able to pad the inputs on\n            the right or the left.\n            Indices can be obtained using :class:`transformers1.T5Tokenizer`.\n            To know more on how to prepare :obj:`inputs` for pre-training take a look at\n            `T5 Training <./t5.html#training>`_ .\n            See :func:`transformers1.PreTrainedTokenizer.encode` and\n            :func:`transformers1.PreTrainedTokenizer.convert_tokens_to_ids` for details.\n        decoder_input_ids (:obj:`tf.Tensor` of shape :obj:`(batch_size, target_sequence_length)`, `optional`, defaults to :obj:`None`):\n            Provide for sequence to sequence training. T5 uses the pad_token_id as the starting token for decoder_input_ids generation.\n            If `decoder_past_key_value_states` is used, optionally only the last `decoder_input_ids` have to be input (see `decoder_past_key_value_states`).\n        attention_mask (:obj:`tf.Tensor` of shape :obj:`(batch_size, sequence_length)`, `optional`, defaults to :obj:`None`):\n            Mask to avoid performing attention on padding token indices.\n            Mask values selected in ``[0, 1]``:\n            ``1`` for tokens that are NOT MASKED, ``0`` for MASKED tokens.\n        encoder_outputs (:obj:`tuple(tuple(tf.FloatTensor)`, `optional`, defaults to :obj:`None`):\n            Tuple consists of (`last_hidden_state`, `optional`: `hidden_states`, `optional`: `attentions`)\n            `last_hidden_state` of shape :obj:`(batch_size, sequence_length, hidden_size)`, `optional`, defaults to :obj:`None`) is a sequence of hidden-states at the output of the last layer of the encoder.\n            Used in the cross-attention of the decoder.\n        decoder_attention_mask (:obj:`tf.Tensor` of shape :obj:`(batch_size, tgt_seq_len)`, `optional`, defaults to :obj:`None`):\n            Default behavior: generate a tensor that ignores pad tokens in decoder_input_ids. Causal mask will also be used by default.\n        decoder_past_key_value_states (:obj:`tuple(tuple(tf.Tensor))` of length :obj:`config.n_layers` with each tuple having 4 tensors of shape :obj:`(batch_size, num_heads, sequence_length - 1, embed_size_per_head)`):\n            Contains pre-computed key and value hidden-states of the attention blocks.\n            Can be used to speed up decoding.\n            If `decoder_past_key_value_states` are used, the user can optionally input only the last `decoder_input_ids`\n            (those that don't have their past key value states given to this model) of shape :obj:`(batch_size, 1)`\n        use_cache (:obj:`bool`, `optional`, defaults to :obj:`True`):\n            If `use_cache` is True, `decoder_past_key_value_states` are returned and can be used to speed up decoding (see `decoder_past_key_value_states`).\n        inputs_embeds (:obj:`tf.Tensor` of shape :obj:`(batch_size, sequence_length, hidden_size)`, `optional`, defaults to :obj:`None`):\n            Optionally, instead of passing :obj:`inputs` you can choose to directly pass an embedded representation.\n            This is useful if you want more control over how to convert `inputs` indices into associated vectors\n            than the model's internal embedding lookup matrix.\n        decoder_inputs_embeds (:obj:`tf.Tensor` of shape :obj:`(batch_size, target_sequence_length, hidden_size)`, `optional`, defaults to :obj:`None`):\n            Optionally, instead of passing :obj:`decoder_input_ids` you can choose to directly pass an embedded representation.\n            This is useful if you want more control over how to convert `decoder_input_ids` indices into associated vectors\n            than the model's internal embedding lookup matrix.\n            To know more on how to prepare :obj:`decoder_input_ids` for pre-training take a look at\n            `T5 Training <./t5.html#training>`_ .\n        head_mask: (:obj:`tf.Tensor` of shape :obj:`(num_heads,)` or :obj:`(num_layers, num_heads)`, `optional`, defaults to :obj:`None`):\n            Mask to nullify selected heads of the self-attention modules.\n            Mask values selected in ``[0, 1]``:\n            ``1`` indicates the head is **not masked**, ``0`` indicates the head is **masked**.\n\"\"\"\n\n\n@add_start_docstrings(\n    \"The bare T5 Model transformer outputting raw hidden-states\" \"without any specific head on top.\",\n    T5_START_DOCSTRING,\n)\nclass TFT5Model(TFT5PreTrainedModel):\n    def __init__(self, config, *inputs, **kwargs):\n        super().__init__(config, *inputs, **kwargs)\n        self.shared = TFSharedEmbeddings(config.vocab_size, config.d_model, name=\"shared\")\n\n        # retrieve correct absolute scope for embed token wrapper\n        with tf.compat.v1.variable_scope(\"shared\") as shared_abs_scope_name:\n            pass\n\n        embed_tokens = _NoLayerEmbedTokens(self.shared, abs_scope_name=shared_abs_scope_name)\n\n        encoder_config = copy.deepcopy(config)\n        self.encoder = TFT5MainLayer(encoder_config, embed_tokens, name=\"encoder\")\n\n        decoder_config = copy.deepcopy(config)\n        decoder_config.is_decoder = True\n        self.decoder = TFT5MainLayer(decoder_config, embed_tokens, name=\"decoder\")\n\n    def get_input_embeddings(self):\n        return self.shared\n\n    def get_output_embeddings(self):\n        return self.shared\n\n    def get_encoder(self):\n        return self.encoder\n\n    def get_decoder(self):\n        return self.decoder\n\n    @add_start_docstrings_to_callable(T5_INPUTS_DOCSTRING)\n    def call(self, inputs, **kwargs):\n        r\"\"\"\n    Return:\n        :obj:`tuple(tf.Tensor)` comprising various elements depending on the configuration (:class:`~transformers1.T5Config`) and inputs.\n        last_hidden_state (:obj:`tf.Tensor` of shape :obj:`(batch_size, sequence_length, hidden_size)`):\n            Sequence of hidden-states at the output of the last layer of the model.\n            If `decoder_past_key_value_states` is used only the last hidden-state of the sequences of shape :obj:`(batch_size, 1, hidden_size)` is output.\n        decoder_past_key_value_states (:obj:`tuple(tuple(tf.Tensor))` of length :obj:`config.n_layers` with each tuple having 4 tensors of shape :obj:`(batch_size, num_heads, sequence_length, embed_size_per_head)`, `optional`, returned when ``use_cache=True``):\n            Contains pre-computed key and value hidden-states of the attention blocks.\n            Can be used to speed up sequential decoding (see `decoder_past_key_value_states` input).\n            Note that when using `decoder_past_key_value_states`, the model only outputs the last `hidden-state` of the sequence of shape :obj:`(batch_size, 1, config.vocab_size)`.\n        hidden_states (:obj:`tuple(tf.Tensor)`, `optional`, returned when ``config.output_hidden_states=True``):\n            Tuple of :obj:`tf.Tensor` (one for the output of the embeddings + one for the output of each layer)\n            of shape :obj:`(batch_size, sequence_length, hidden_size)`.\n\n            Hidden-states of the model at the output of each layer plus the initial embedding outputs.\n        attentions (:obj:`tuple(tf.Tensor)`, `optional`, returned when ``config.output_attentions=True``):\n            Tuple of :obj:`tf.Tensor` (one for each layer) of shape\n                :obj:`(batch_size, num_heads, sequence_length, sequence_length)`.\n\n            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention\n            heads.\n\n    Examples::\n\n        from transformers1 import T5Tokenizer, TFT5Model\n\n        tokenizer = T5Tokenizer.from_pretrained('t5-small')\n        model = TFT5Model.from_pretrained('t5-small')\n        inputs = tokenizer.encode(\"Hello, my dog is cute\", return_tensors=\"tf\")  # Batch size 1\n        outputs = model(inputs, decoder_input_ids=inputs)\n        last_hidden_states = outputs[0]  # The last hidden-state is the first element of the output tuple\n\n        \"\"\"\n\n        if isinstance(inputs, dict):\n            kwargs.update(inputs)\n        else:\n            kwargs[\"inputs\"] = inputs\n\n        # retrieve arguments\n        inputs = kwargs.get(\"inputs\", None)\n        inputs_embeds = kwargs.get(\"inputs_embeds\", None)\n        attention_mask = kwargs.get(\"attention_mask\", None)\n        encoder_outputs = kwargs.get(\"encoder_outputs\", None)\n        decoder_input_ids = kwargs.get(\"decoder_input_ids\", None)\n        decoder_attention_mask = kwargs.get(\"decoder_attention_mask\", None)\n        decoder_inputs_embeds = kwargs.get(\"decoder_inputs_embeds\", None)\n        decoder_past_key_value_states = kwargs.get(\"decoder_past_key_value_states\", None)\n        use_cache = kwargs.get(\"use_cache\", True)\n        head_mask = kwargs.get(\"head_mask\", None)\n\n        # Encode if needed (training, first prediction pass)\n        if encoder_outputs is None:\n            encoder_outputs = self.encoder(\n                inputs, attention_mask=attention_mask, inputs_embeds=inputs_embeds, head_mask=head_mask,\n            )\n\n        hidden_states = encoder_outputs[0]\n\n        # If decoding with past key value states, only the last tokens\n        # should be given as an input\n        if decoder_past_key_value_states is not None:\n            if decoder_input_ids is not None:\n                decoder_input_ids = decoder_input_ids[:, -1:]\n            if decoder_inputs_embeds is not None:\n                decoder_inputs_embeds = decoder_inputs_embeds[:, -1:]\n\n        # Decode\n        decoder_outputs = self.decoder(\n            decoder_input_ids,\n            attention_mask=decoder_attention_mask,\n            inputs_embeds=decoder_inputs_embeds,\n            past_key_value_states=decoder_past_key_value_states,\n            encoder_hidden_states=hidden_states,\n            encoder_attention_mask=attention_mask,\n            head_mask=head_mask,\n            use_cache=use_cache,\n        )\n\n        if use_cache is True:\n            past = ((encoder_outputs, decoder_outputs[1]),)\n            decoder_outputs = decoder_outputs[:1] + past + decoder_outputs[2:]\n\n        return decoder_outputs + encoder_outputs\n\n\n@add_start_docstrings(\"\"\"T5 Model with a `language modeling` head on top. \"\"\", T5_START_DOCSTRING)\nclass TFT5ForConditionalGeneration(TFT5PreTrainedModel):\n    def __init__(self, config, *inputs, **kwargs):\n        super().__init__(config, *inputs, **kwargs)\n        self.model_dim = config.d_model\n\n        self.shared = TFSharedEmbeddings(config.vocab_size, config.d_model, name=\"shared\")\n\n        # retrieve correct absolute scope for embed token wrapper\n        with tf.compat.v1.variable_scope(\"shared\") as shared_abs_scope_name:\n            pass\n\n        embed_tokens = _NoLayerEmbedTokens(self.shared, abs_scope_name=shared_abs_scope_name)\n\n        encoder_config = copy.deepcopy(config)\n        self.encoder = TFT5MainLayer(encoder_config, embed_tokens, name=\"encoder\")\n\n        decoder_config = copy.deepcopy(config)\n        decoder_config.is_decoder = True\n        self.decoder = TFT5MainLayer(decoder_config, embed_tokens, name=\"decoder\")\n\n    def get_input_embeddings(self):\n        return self.shared\n\n    def get_output_embeddings(self):\n        return self.shared\n\n    def get_encoder(self):\n        return self.encoder\n\n    def get_decoder(self):\n        return self.decoder\n\n    @add_start_docstrings_to_callable(T5_INPUTS_DOCSTRING)\n    def call(self, inputs, **kwargs):\n        r\"\"\"\n    Return:\n        :obj:`tuple(tf.Tensor)` comprising various elements depending on the configuration (:class:`~transformers1.T5Config`) and inputs.\n        loss (:obj:`tf.Tensor` of shape :obj:`(1,)`, `optional`, returned when :obj:`lm_label` is provided):\n            Classification loss (cross entropy).\n        prediction_scores (:obj:`tf.Tensor` of shape :obj:`(batch_size, sequence_length, config.vocab_size)`)\n            Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax).\n        decoder_past_key_value_states (:obj:`tuple(tuple(tf.Tensor))` of length :obj:`config.n_layers` with each tuple having 4 tensors of shape :obj:`(batch_size, num_heads, sequence_length, embed_size_per_head)`, `optional`, returned when ``use_cache=True``):\n            Contains pre-computed key and value hidden-states of the attention blocks.\n            Can be used to speed up sequential decoding (see `decoder_past_key_value_states` input).\n            Note that when using `decoder_past_key_value_states`, the model only outputs the last `prediction_score` of the sequence of shape :obj:`(batch_size, 1, config.vocab_size)`.\n        hidden_states (:obj:`tuple(tf.Tensor)`, `optional`, returned when ``config.output_hidden_states=True``):\n            Tuple of :obj:`tf.Tensor` (one for the output of the embeddings + one for the output of each layer)\n            of shape :obj:`(batch_size, sequence_length, hidden_size)`.\n\n            Hidden-states of the model at the output of each layer plus the initial embedding outputs.\n        attentions (:obj:`tuple(tf.Tensor)`, `optional`, returned when ``config.output_attentions=True``):\n            Tuple of :obj:`tf.Tensor` (one for each layer) of shape\n            :obj:`(batch_size, num_heads, sequence_length, sequence_length)`.\n\n            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention.\n\n    Examples::\n\n        from transformers1 import T5Tokenizer, TFT5ForConditionalGeneration\n\n        tokenizer = T5Tokenizer.from_pretrained('t5-small')\n        model = TFT5ForConditionalGeneration.from_pretrained('t5-small')\n        inputs = tokenizer.encode(\"Hello, my dog is cute\", return_tensors=\"tf\")  # Batch size 1\n        outputs = model(inputs, decoder_input_ids=inputs)\n        prediction_scores = outputs[0]\n\n        tokenizer = T5Tokenizer.from_pretrained('t5-small')\n        model = TFT5ForConditionalGeneration.from_pretrained('t5-small')\n        inputs = tokenizer.encode(\"summarize: Hello, my dog is cute\", return_tensors=\"tf\")  # Batch size 1\n        model.generate(inputs)\n\n        \"\"\"\n\n        if isinstance(inputs, dict):\n            kwargs.update(inputs)\n        else:\n            kwargs[\"inputs\"] = inputs\n\n        # retrieve arguments\n        inputs = kwargs.get(\"inputs\", None)\n        decoder_input_ids = kwargs.get(\"decoder_input_ids\", None)\n        attention_mask = kwargs.get(\"attention_mask\", None)\n        encoder_outputs = kwargs.get(\"encoder_outputs\", None)\n        decoder_attention_mask = kwargs.get(\"decoder_attention_mask\", None)\n        decoder_past_key_value_states = kwargs.get(\"decoder_past_key_value_states\", None)\n        use_cache = kwargs.get(\"use_cache\", True)\n        inputs_embeds = kwargs.get(\"inputs_embeds\", None)\n        decoder_inputs_embeds = kwargs.get(\"decoder_inputs_embeds\", None)\n        head_mask = kwargs.get(\"head_mask\", None)\n\n        # Encode if needed (training, first prediction pass)\n        if encoder_outputs is None:\n            # Convert encoder inputs in embeddings if needed\n            encoder_outputs = self.encoder(\n                inputs, attention_mask=attention_mask, inputs_embeds=inputs_embeds, head_mask=head_mask,\n            )\n\n        hidden_states = encoder_outputs[0]\n\n        # If decoding with past key value states, only the last tokens\n        # should be given as an input\n        if decoder_past_key_value_states is not None:\n            if decoder_input_ids is not None:\n                decoder_input_ids = decoder_input_ids[:, -1:]\n            if decoder_inputs_embeds is not None:\n                decoder_inputs_embeds = decoder_inputs_embeds[:, -1:]\n\n        # Decode\n        decoder_outputs = self.decoder(\n            decoder_input_ids,\n            attention_mask=decoder_attention_mask,\n            inputs_embeds=decoder_inputs_embeds,\n            past_key_value_states=decoder_past_key_value_states,\n            encoder_hidden_states=hidden_states,\n            encoder_attention_mask=attention_mask,\n            head_mask=head_mask,\n            use_cache=use_cache,\n        )\n\n        # insert decoder past at right place\n        # to speed up decoding\n        if use_cache is True:\n            past = ((encoder_outputs, decoder_outputs[1]),)\n            decoder_outputs = decoder_outputs[:1] + past + decoder_outputs[2:]\n\n        sequence_output = decoder_outputs[0] * (self.model_dim ** -0.5)\n        embed_tokens = self.get_output_embeddings()\n        lm_logits = embed_tokens(sequence_output, mode=\"linear\")\n        decoder_outputs = (lm_logits,) + decoder_outputs[1:]\n\n        return decoder_outputs + encoder_outputs\n\n    def prepare_inputs_for_generation(self, inputs, past, attention_mask, use_cache, **kwargs):\n        assert past is not None, \"past has to be defined for encoder_outputs\"\n\n        # first step\n        if len(past) < 2:\n            encoder_outputs, decoder_past_key_value_states = past, None\n        else:\n            encoder_outputs, decoder_past_key_value_states = past[0], past[1]\n\n        return {\n            \"inputs\": None,  # inputs don't have to be defined, but still need to be passed to make Keras.layer.__call__ happy\n            \"decoder_input_ids\": inputs,  # inputs are the decoder_input_ids\n            \"decoder_past_key_value_states\": decoder_past_key_value_states,\n            \"encoder_outputs\": encoder_outputs,\n            \"attention_mask\": attention_mask,\n            \"use_cache\": use_cache,\n        }\n\n    def _reorder_cache(self, past, beam_idx):\n        # if decoder past is not included in output\n        # speedy decoding is disabled and no need to reorder\n\n        if len(past) < 2:\n            logger.warning(\"You might want to consider setting `use_cache=True` to speed up decoding\")\n            return past\n\n        decoder_past = past[1]\n        past = (past[0],)\n        reordered_decoder_past = ()\n\n        for layer_past_states in decoder_past:\n            # get the correct batch idx from layer past batch dim\n            # batch dim of `past` is at 2nd position\n            reordered_layer_past_states = ()\n            for layer_past_state in layer_past_states:\n                # need to set correct `past` for each of the four key / value states\n                reordered_layer_past_states = reordered_layer_past_states + (tf.gather(layer_past_state, beam_idx),)\n\n            assert shape_list(reordered_layer_past_states[0]) == shape_list(layer_past_states[0])\n            assert len(reordered_layer_past_states) == len(layer_past_states)\n\n            reordered_decoder_past = reordered_decoder_past + (reordered_layer_past_states,)\n        return past + (reordered_decoder_past,)\n"
  },
  {
    "path": "code/nezha-base-count3/pretrain/transformers1/modeling_tf_transfo_xl.py",
    "content": "# coding=utf-8\n# Copyright 2018 Google AI, Google Brain and Carnegie Mellon University Authors and the HuggingFace Inc. team.\n# Copyright (c) 2018, NVIDIA CORPORATION.  All rights reserved.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n\"\"\" TF 2.0 Transformer XL model.\n\"\"\"\n\n\nimport logging\n\nimport tensorflow as tf\n\nfrom .configuration_transfo_xl import TransfoXLConfig\nfrom .file_utils import add_start_docstrings, add_start_docstrings_to_callable\nfrom .modeling_tf_transfo_xl_utilities import TFAdaptiveSoftmaxMask\nfrom .modeling_tf_utils import TFPreTrainedModel, get_initializer, keras_serializable, shape_list\nfrom .tokenization_utils import BatchEncoding\n\n\nlogger = logging.getLogger(__name__)\n\nTF_TRANSFO_XL_PRETRAINED_MODEL_ARCHIVE_LIST = [\n    \"transfo-xl-wt103\",\n    # See all Transformer XL models at https://huggingface.co/models?filter=transfo-xl\n]\n\n\nclass TFPositionalEmbedding(tf.keras.layers.Layer):\n    def __init__(self, demb, **kwargs):\n        super().__init__(**kwargs)\n\n        self.inv_freq = 1 / (10000 ** (tf.range(0, demb, 2.0) / demb))\n\n    def call(self, pos_seq, bsz=None):\n        sinusoid_inp = tf.einsum(\"i,j->ij\", pos_seq, self.inv_freq)\n        pos_emb = tf.concat([tf.sin(sinusoid_inp), tf.cos(sinusoid_inp)], -1)\n\n        if bsz is not None:\n            return tf.tile(pos_emb[:, None, :], [1, bsz, 1])\n        else:\n            return pos_emb[:, None, :]\n\n\nclass TFPositionwiseFF(tf.keras.layers.Layer):\n    def __init__(self, d_model, d_inner, dropout, pre_lnorm=False, layer_norm_epsilon=1e-5, init_std=0.02, **kwargs):\n        super().__init__(**kwargs)\n\n        self.d_model = d_model\n        self.d_inner = d_inner\n        self.dropout = dropout\n\n        self.layer_1 = tf.keras.layers.Dense(\n            d_inner, kernel_initializer=get_initializer(init_std), activation=tf.nn.relu, name=\"CoreNet_._0\"\n        )\n        self.drop_1 = tf.keras.layers.Dropout(dropout)\n        self.layer_2 = tf.keras.layers.Dense(d_model, kernel_initializer=get_initializer(init_std), name=\"CoreNet_._3\")\n        self.drop_2 = tf.keras.layers.Dropout(dropout)\n\n        self.layer_norm = tf.keras.layers.LayerNormalization(epsilon=layer_norm_epsilon, name=\"layer_norm\")\n\n        self.pre_lnorm = pre_lnorm\n\n    def call(self, inp, training=False):\n        if self.pre_lnorm:\n            # layer normalization + positionwise feed-forward\n            core_out = self.layer_norm(inp)\n            core_out = self.layer_1(core_out)\n            core_out = self.drop_1(core_out, training=training)\n            core_out = self.layer_2(core_out)\n            core_out = self.drop_2(core_out, training=training)\n\n            # residual connection\n            output = core_out + inp\n        else:\n            # positionwise feed-forward\n            core_out = self.layer_1(inp)\n            core_out = self.drop_1(core_out, training=training)\n            core_out = self.layer_2(core_out)\n            core_out = self.drop_2(core_out, training=training)\n\n            # residual connection + layer normalization\n            output = self.layer_norm(inp + core_out)\n\n        return output\n\n\nclass TFRelPartialLearnableMultiHeadAttn(tf.keras.layers.Layer):\n    def __init__(\n        self,\n        n_head,\n        d_model,\n        d_head,\n        dropout,\n        dropatt=0,\n        tgt_len=None,\n        ext_len=None,\n        mem_len=None,\n        pre_lnorm=False,\n        r_r_bias=None,\n        r_w_bias=None,\n        output_attentions=False,\n        layer_norm_epsilon=1e-5,\n        init_std=0.02,\n        **kwargs\n    ):\n        super().__init__(**kwargs)\n\n        self.output_attentions = output_attentions\n        self.n_head = n_head\n        self.d_model = d_model\n        self.d_head = d_head\n        self.dropout = dropout\n\n        self.qkv_net = tf.keras.layers.Dense(\n            3 * n_head * d_head, kernel_initializer=get_initializer(init_std), use_bias=False, name=\"qkv_net\"\n        )\n\n        self.drop = tf.keras.layers.Dropout(dropout)\n        self.dropatt = tf.keras.layers.Dropout(dropatt)\n        self.o_net = tf.keras.layers.Dense(\n            d_model, kernel_initializer=get_initializer(init_std), use_bias=False, name=\"o_net\"\n        )\n\n        self.layer_norm = tf.keras.layers.LayerNormalization(epsilon=layer_norm_epsilon, name=\"layer_norm\")\n\n        self.scale = 1 / (d_head ** 0.5)\n\n        self.pre_lnorm = pre_lnorm\n\n        if r_r_bias is not None and r_w_bias is not None:  # Biases are shared\n            self.r_r_bias = r_r_bias\n            self.r_w_bias = r_w_bias\n        else:\n            self.r_r_bias = None\n            self.r_w_bias = None\n\n        self.r_net = tf.keras.layers.Dense(\n            self.n_head * self.d_head, kernel_initializer=get_initializer(init_std), use_bias=False, name=\"r_net\"\n        )\n\n    def build(self, input_shape):\n        if self.r_r_bias is None or self.r_w_bias is None:  # Biases are not shared\n            self.r_r_bias = self.add_weight(\n                shape=(self.n_head, self.d_head), initializer=\"zeros\", trainable=True, name=\"r_r_bias\"\n            )\n            self.r_w_bias = self.add_weight(\n                shape=(self.n_head, self.d_head), initializer=\"zeros\", trainable=True, name=\"r_w_bias\"\n            )\n        super().build(input_shape)\n\n    def _rel_shift(self, x):\n        x_size = shape_list(x)\n\n        x = tf.pad(x, [[0, 0], [1, 0], [0, 0], [0, 0]])\n        x = tf.reshape(x, [x_size[1] + 1, x_size[0], x_size[2], x_size[3]])\n        x = tf.slice(x, [1, 0, 0, 0], [-1, -1, -1, -1])\n        x = tf.reshape(x, x_size)\n\n        return x\n\n    def call(self, inputs, training=False):\n        w, r, attn_mask, mems, head_mask = inputs\n        qlen, rlen, bsz = shape_list(w)[0], shape_list(r)[0], shape_list(w)[1]\n\n        if mems is not None:\n            cat = tf.concat([mems, w], 0)\n            if self.pre_lnorm:\n                w_heads = self.qkv_net(self.layer_norm(cat))\n            else:\n                w_heads = self.qkv_net(cat)\n            r_head_k = self.r_net(r)\n\n            w_head_q, w_head_k, w_head_v = tf.split(w_heads, 3, axis=-1)\n            w_head_q = w_head_q[-qlen:]\n        else:\n            if self.pre_lnorm:\n                w_heads = self.qkv_net(self.layer_norm(w))\n            else:\n                w_heads = self.qkv_net(w)\n            r_head_k = self.r_net(r)\n\n            w_head_q, w_head_k, w_head_v = tf.split(w_heads, 3, axis=-1)\n\n        klen = shape_list(w_head_k)[0]\n\n        w_head_q = tf.reshape(w_head_q, (qlen, bsz, self.n_head, self.d_head))  # qlen x bsz x n_head x d_head\n        w_head_k = tf.reshape(w_head_k, (klen, bsz, self.n_head, self.d_head))  # qlen x bsz x n_head x d_head\n        w_head_v = tf.reshape(w_head_v, (klen, bsz, self.n_head, self.d_head))  # qlen x bsz x n_head x d_head\n\n        r_head_k = tf.reshape(r_head_k, (rlen, self.n_head, self.d_head))  # qlen x n_head x d_head\n\n        # compute attention score\n        rw_head_q = w_head_q + self.r_w_bias  # qlen x bsz x n_head x d_head\n        AC = tf.einsum(\"ibnd,jbnd->ijbn\", rw_head_q, w_head_k)  # qlen x klen x bsz x n_head\n\n        rr_head_q = w_head_q + self.r_r_bias\n        BD = tf.einsum(\"ibnd,jnd->ijbn\", rr_head_q, r_head_k)  # qlen x klen x bsz x n_head\n        BD = self._rel_shift(BD)\n\n        # [qlen x klen x bsz x n_head]\n        attn_score = AC + BD\n        attn_score = attn_score * self.scale\n\n        # compute attention probability\n        if attn_mask is not None:\n            attn_mask_t = attn_mask[:, :, None, None]\n            attn_score = attn_score * (1 - attn_mask_t) - 1e30 * attn_mask_t\n\n        # [qlen x klen x bsz x n_head]\n        attn_prob = tf.nn.softmax(attn_score, axis=1)\n        attn_prob = self.dropatt(attn_prob, training=training)\n\n        # Mask heads if we want to\n        if head_mask is not None:\n            attn_prob = attn_prob * head_mask\n\n        # compute attention vector\n        attn_vec = tf.einsum(\"ijbn,jbnd->ibnd\", attn_prob, w_head_v)\n\n        # [qlen x bsz x n_head x d_head]\n        attn_vec_sizes = shape_list(attn_vec)\n        attn_vec = tf.reshape(attn_vec, (attn_vec_sizes[0], attn_vec_sizes[1], self.n_head * self.d_head))\n\n        # linear projection\n        attn_out = self.o_net(attn_vec)\n        attn_out = self.drop(attn_out, training=training)\n\n        if self.pre_lnorm:\n            # residual connection\n            outputs = [w + attn_out]\n        else:\n            # residual connection + layer normalization\n            outputs = [self.layer_norm(w + attn_out)]\n\n        if self.output_attentions:\n            outputs.append(attn_prob)\n\n        return outputs\n\n\nclass TFRelPartialLearnableDecoderLayer(tf.keras.layers.Layer):\n    def __init__(\n        self,\n        n_head,\n        d_model,\n        d_head,\n        d_inner,\n        dropout,\n        tgt_len=None,\n        ext_len=None,\n        mem_len=None,\n        dropatt=0.0,\n        pre_lnorm=False,\n        r_w_bias=None,\n        r_r_bias=None,\n        output_attentions=False,\n        layer_norm_epsilon=1e-5,\n        init_std=0.02,\n        **kwargs\n    ):\n        super().__init__(**kwargs)\n\n        self.dec_attn = TFRelPartialLearnableMultiHeadAttn(\n            n_head,\n            d_model,\n            d_head,\n            dropout,\n            tgt_len=tgt_len,\n            ext_len=ext_len,\n            mem_len=mem_len,\n            dropatt=dropatt,\n            pre_lnorm=pre_lnorm,\n            r_w_bias=r_w_bias,\n            r_r_bias=r_r_bias,\n            init_std=init_std,\n            output_attentions=output_attentions,\n            layer_norm_epsilon=layer_norm_epsilon,\n            name=\"dec_attn\",\n        )\n        self.pos_ff = TFPositionwiseFF(\n            d_model,\n            d_inner,\n            dropout,\n            pre_lnorm=pre_lnorm,\n            init_std=init_std,\n            layer_norm_epsilon=layer_norm_epsilon,\n            name=\"pos_ff\",\n        )\n\n    def call(self, inputs, training=False):\n        dec_inp, r, dec_attn_mask, mems, head_mask = inputs\n        attn_outputs = self.dec_attn([dec_inp, r, dec_attn_mask, mems, head_mask], training=training)\n        ff_output = self.pos_ff(attn_outputs[0], training=training)\n\n        outputs = [ff_output] + attn_outputs[1:]\n\n        return outputs\n\n\nclass TFAdaptiveEmbedding(tf.keras.layers.Layer):\n    def __init__(self, n_token, d_embed, d_proj, cutoffs, div_val=1, init_std=0.02, sample_softmax=False, **kwargs):\n        super().__init__(**kwargs)\n\n        self.n_token = n_token\n        self.d_embed = d_embed\n        self.init_std = init_std\n\n        self.cutoffs = cutoffs + [n_token]\n        self.div_val = div_val\n        self.d_proj = d_proj\n\n        self.emb_scale = d_proj ** 0.5\n\n        self.cutoff_ends = [0] + self.cutoffs\n\n        self.emb_layers = []\n        self.emb_projs = []\n        if div_val == 1:\n            raise NotImplementedError  # Removed these to avoid maintaining dead code - They are not used in our pretrained checkpoint\n        else:\n            for i in range(len(self.cutoffs)):\n                l_idx, r_idx = self.cutoff_ends[i], self.cutoff_ends[i + 1]\n                d_emb_i = d_embed // (div_val ** i)\n                self.emb_layers.append(\n                    tf.keras.layers.Embedding(\n                        r_idx - l_idx,\n                        d_emb_i,\n                        embeddings_initializer=get_initializer(init_std),\n                        name=\"emb_layers_._{}\".format(i),\n                    )\n                )\n\n    def build(self, input_shape):\n        for i in range(len(self.cutoffs)):\n            d_emb_i = self.d_embed // (self.div_val ** i)\n            self.emb_projs.append(\n                self.add_weight(\n                    shape=(d_emb_i, self.d_proj),\n                    initializer=get_initializer(self.init_std),\n                    trainable=True,\n                    name=\"emb_projs_._{}\".format(i),\n                )\n            )\n        super().build(input_shape)\n\n    def call(self, inp):\n        if self.div_val == 1:\n            raise NotImplementedError  # Removed these to avoid maintaining dead code - They are not used in our pretrained checkpoint\n        else:\n            inp_flat = tf.reshape(inp, (-1,))\n            emb_flat = tf.zeros([shape_list(inp_flat)[0], self.d_proj])\n            for i in range(len(self.cutoffs)):\n                l_idx, r_idx = self.cutoff_ends[i], self.cutoff_ends[i + 1]\n\n                mask_i = (inp_flat >= l_idx) & (inp_flat < r_idx)\n\n                inp_i = tf.boolean_mask(inp_flat, mask_i) - l_idx\n                emb_i = self.emb_layers[i](inp_i)\n                emb_i = tf.einsum(\"id,de->ie\", emb_i, self.emb_projs[i])\n\n                mask_idx = tf.cast(tf.where(mask_i), dtype=tf.int64)\n                emb_flat += tf.scatter_nd(mask_idx, emb_i, tf.cast(shape_list(emb_flat), dtype=tf.int64))\n\n            embed_shape = shape_list(inp) + [self.d_proj]\n            embed = tf.reshape(emb_flat, embed_shape)\n\n        embed *= self.emb_scale\n\n        return embed\n\n\n@keras_serializable\nclass TFTransfoXLMainLayer(tf.keras.layers.Layer):\n    config_class = TransfoXLConfig\n\n    def __init__(self, config, **kwargs):\n        super().__init__(**kwargs)\n        self.output_attentions = config.output_attentions\n        self.output_hidden_states = config.output_hidden_states\n\n        self.n_token = config.vocab_size\n\n        self.d_embed = config.d_embed\n        self.d_model = config.d_model\n        self.n_head = config.n_head\n        self.d_head = config.d_head\n        self.untie_r = config.untie_r\n\n        self.word_emb = TFAdaptiveEmbedding(\n            config.vocab_size,\n            config.d_embed,\n            config.d_model,\n            config.cutoffs,\n            div_val=config.div_val,\n            init_std=config.init_std,\n            name=\"word_emb\",\n        )\n\n        self.drop = tf.keras.layers.Dropout(config.dropout)\n\n        self.n_layer = config.n_layer\n\n        self.tgt_len = config.tgt_len\n        self.mem_len = config.mem_len\n        self.ext_len = config.ext_len\n        self.max_klen = config.tgt_len + config.ext_len + config.mem_len\n\n        self.attn_type = config.attn_type\n\n        self.layers = []\n        if config.attn_type == 0:  # the default attention\n            for i in range(config.n_layer):\n                self.layers.append(\n                    TFRelPartialLearnableDecoderLayer(\n                        config.n_head,\n                        config.d_model,\n                        config.d_head,\n                        config.d_inner,\n                        config.dropout,\n                        tgt_len=config.tgt_len,\n                        ext_len=config.ext_len,\n                        mem_len=config.mem_len,\n                        dropatt=config.dropatt,\n                        pre_lnorm=config.pre_lnorm,\n                        r_w_bias=None if self.untie_r else self.r_w_bias,\n                        r_r_bias=None if self.untie_r else self.r_r_bias,\n                        output_attentions=self.output_attentions,\n                        layer_norm_epsilon=config.layer_norm_epsilon,\n                        init_std=config.init_std,\n                        name=\"layers_._{}\".format(i),\n                    )\n                )\n        else:  # learnable embeddings and absolute embeddings\n            raise NotImplementedError  # Removed these to avoid maintaining dead code - They are not used in our pretrained checkpoint\n\n        self.same_length = config.same_length\n        self.clamp_len = config.clamp_len\n\n        if self.attn_type == 0:  # default attention\n            self.pos_emb = TFPositionalEmbedding(self.d_model, name=\"pos_emb\")\n        else:  # learnable embeddings and absolute embeddings\n            raise NotImplementedError  # Removed these to avoid maintaining dead code - They are not used in our pretrained checkpoint\n\n    def build(self, input_shape):\n        if not self.untie_r:\n            self.r_w_bias = self.add_weight(\n                shape=(self.n_head, self.d_head), initializer=\"zeros\", trainable=True, name=\"r_w_bias\"\n            )\n            self.r_r_bias = self.add_weight(\n                shape=(self.n_head, self.d_head), initializer=\"zeros\", trainable=True, name=\"r_r_bias\"\n            )\n        super().build(input_shape)\n\n    def get_input_embeddings(self):\n        return self.word_emb\n\n    def _resize_token_embeddings(self, new_num_tokens):\n        return self.word_emb\n\n    def backward_compatible(self):\n        self.sample_softmax = -1\n\n    def reset_length(self, tgt_len, ext_len, mem_len):\n        self.tgt_len = tgt_len\n        self.mem_len = mem_len\n        self.ext_len = ext_len\n\n    def _prune_heads(self, heads):\n        raise NotImplementedError\n\n    def init_mems(self, bsz):\n        if self.mem_len > 0:\n            mems = []\n            for i in range(self.n_layer):\n                empty = tf.zeros([self.mem_len, bsz, self.d_model])\n                mems.append(empty)\n\n            return mems\n        else:\n            return None\n\n    def _update_mems(self, hids, mems, mlen, qlen):\n        # does not deal with None\n        if mems is None:\n            return None\n\n        # mems is not None\n        assert len(hids) == len(mems), \"len(hids) != len(mems)\"\n\n        # There are `mlen + qlen` steps that can be cached into mems\n        # For the next step, the last `ext_len` of the `qlen` tokens\n        # will be used as the extended context. Hence, we only cache\n        # the tokens from `mlen + qlen - self.ext_len - self.mem_len`\n        # to `mlen + qlen - self.ext_len`.\n        new_mems = []\n        end_idx = mlen + max(0, qlen - 0 - self.ext_len)\n        beg_idx = max(0, end_idx - self.mem_len)\n        for i in range(len(hids)):\n\n            cat = tf.concat([mems[i], hids[i]], axis=0)\n            tf.stop_gradient(cat)\n            new_mems.append(cat[beg_idx:end_idx])\n\n        return new_mems\n\n    def call(self, inputs, mems=None, head_mask=None, inputs_embeds=None, training=False):\n        if isinstance(inputs, (tuple, list)):\n            input_ids = inputs[0]\n            mems = inputs[1] if len(inputs) > 1 else mems\n            head_mask = inputs[2] if len(inputs) > 2 else head_mask\n            inputs_embeds = inputs[3] if len(inputs) > 3 else inputs_embeds\n            assert len(inputs) <= 4, \"Too many inputs.\"\n        elif isinstance(inputs, (dict, BatchEncoding)):\n            input_ids = inputs.get(\"input_ids\")\n            mems = inputs.get(\"mems\", mems)\n            head_mask = inputs.get(\"head_mask\", head_mask)\n            inputs_embeds = inputs.get(\"inputs_embeds\", inputs_embeds)\n            assert len(inputs) <= 4, \"Too many inputs.\"\n        else:\n            input_ids = inputs\n\n        # the original code for Transformer-XL used shapes [len, bsz] but we want a unified interface in the library\n        # so we transpose here from shape [bsz, len] to shape [len, bsz]\n        if input_ids is not None and inputs_embeds is not None:\n            raise ValueError(\"You cannot specify both input_ids and inputs_embeds at the same time\")\n        elif input_ids is not None:\n            input_ids = tf.transpose(input_ids, perm=(1, 0))\n            qlen, bsz = shape_list(input_ids)\n        elif inputs_embeds is not None:\n            inputs_embeds = tf.transpose(inputs_embeds, perm=(1, 0, 2))\n            qlen, bsz = shape_list(inputs_embeds)[:2]\n        else:\n            raise ValueError(\"You have to specify either input_ids or inputs_embeds\")\n\n        if mems is None:\n            mems = self.init_mems(bsz)\n\n        # Prepare head mask if needed\n        # 1.0 in head_mask indicate we keep the head\n        # attention_probs has shape bsz x n_heads x N x N\n        # input head_mask has shape [num_heads] or [num_hidden_layers x num_heads] (a head_mask for each layer)\n        # and head_mask is converted to shape [num_hidden_layers x qlen x klen x bsz x n_head]\n        if head_mask is not None:\n            raise NotImplementedError\n        else:\n            head_mask = [None] * self.n_layer\n\n        if inputs_embeds is not None:\n            word_emb = inputs_embeds\n        else:\n            word_emb = self.word_emb(input_ids)\n\n        mlen = shape_list(mems[0])[0] if mems is not None else 0\n        klen = mlen + qlen\n\n        attn_mask = tf.ones([qlen, qlen])\n        mask_u = tf.linalg.band_part(attn_mask, 0, -1)\n        mask_dia = tf.linalg.band_part(attn_mask, 0, 0)\n        attn_mask_pad = tf.zeros([qlen, mlen])\n        dec_attn_mask = tf.concat([attn_mask_pad, mask_u - mask_dia], 1)\n        if self.same_length:\n            mask_l = tf.linalg.band_part(attn_mask, -1, 0)\n            dec_attn_mask = tf.concat([dec_attn_mask[:, :qlen] + mask_l - mask_dia, dec_attn_mask[:, qlen:]], 1)\n        # ::: PyTorch masking code for reference :::\n        # if self.same_length:\n        #     all_ones = word_emb.new_ones((qlen, klen), dtype=torch.uint8)\n        #     mask_len = klen - self.mem_len\n        #     if mask_len > 0:\n        #         mask_shift_len = qlen - mask_len\n        #     else:\n        #         mask_shift_len = qlen\n        #     dec_attn_mask = (torch.triu(all_ones, 1+mlen)\n        #             + torch.tril(all_ones, -mask_shift_len))[:, :, None] # -1\n        # else:\n        #     dec_attn_mask = torch.triu(\n        #         word_emb.new_ones((qlen, klen), dtype=torch.uint8), diagonal=1+mlen)[:,:,None]\n\n        hids = []\n        attentions = []\n        if self.attn_type == 0:  # default\n            pos_seq = tf.range(klen - 1, -1, -1.0)\n            if self.clamp_len > 0:\n                pos_seq = tf.minimum(pos_seq, self.clamp_len)\n            pos_emb = self.pos_emb(pos_seq)\n\n            core_out = self.drop(word_emb, training=training)\n            pos_emb = self.drop(pos_emb, training=training)\n\n            for i, layer in enumerate(self.layers):\n                hids.append(core_out)\n                mems_i = None if mems is None else mems[i]\n                layer_outputs = layer([core_out, pos_emb, dec_attn_mask, mems_i, head_mask[i]], training=training)\n                core_out = layer_outputs[0]\n                if self.output_attentions:\n                    attentions.append(layer_outputs[1])\n        else:  # learnable embeddings and absolute embeddings\n            raise NotImplementedError  # Removed these to avoid maintaining dead code - They are not used in our pretrained checkpoint\n\n        core_out = self.drop(core_out, training=training)\n\n        new_mems = self._update_mems(hids, mems, mlen, qlen)\n\n        # We transpose back here to shape [bsz, len, hidden_dim]\n        outputs = [tf.transpose(core_out, perm=(1, 0, 2)), new_mems]\n        if self.output_hidden_states:\n            # Add last layer and transpose to library standard shape [bsz, len, hidden_dim]\n            hids.append(core_out)\n            hids = list(tf.transpose(t, perm=(1, 0, 2)) for t in hids)\n            outputs.append(hids)\n        if self.output_attentions:\n            # Transpose to library standard shape [bsz, n_heads, query_seq_len, key_seq_len]\n            attentions = list(tf.transpose(t, perm=(2, 3, 0, 1)) for t in attentions)\n            outputs.append(attentions)\n        return outputs  # last hidden state, new_mems, (all hidden states), (all attentions)\n\n\nclass TFTransfoXLPreTrainedModel(TFPreTrainedModel):\n    \"\"\" An abstract class to handle weights initialization and\n        a simple interface for downloading and loading pretrained models.\n    \"\"\"\n\n    config_class = TransfoXLConfig\n    base_model_prefix = \"transformer\"\n\n\nTRANSFO_XL_START_DOCSTRING = r\"\"\"\n\n    .. note::\n\n        TF 2.0 models accepts two formats as inputs:\n\n            - having all inputs as keyword arguments (like PyTorch models), or\n            - having all inputs as a list, tuple or dict in the first positional arguments.\n\n        This second option is useful when using :obj:`tf.keras.Model.fit()` method which currently requires having\n        all the tensors in the first argument of the model call function: :obj:`model(inputs)`.\n\n        If you choose this second option, there are three possibilities you can use to gather all the input Tensors\n        in the first positional argument :\n\n        - a single Tensor with input_ids only and nothing else: :obj:`model(inputs_ids)`\n        - a list of varying length with one or several input Tensors IN THE ORDER given in the docstring:\n          :obj:`model([input_ids, attention_mask])` or :obj:`model([input_ids, attention_mask, token_type_ids])`\n        - a dictionary with one or several input Tensors associated to the input names given in the docstring:\n          :obj:`model({'input_ids': input_ids, 'token_type_ids': token_type_ids})`\n\n    Parameters:\n        config (:class:`~transformers1.TransfoXLConfig`): Model configuration class with all the parameters of the model.\n            Initializing with a config file does not load the weights associated with the model, only the configuration.\n            Check out the :meth:`~transformers1.PreTrainedModel.from_pretrained` method to load the model weights.\n\"\"\"\n\nTRANSFO_XL_INPUTS_DOCSTRING = r\"\"\"\n    Args:\n        input_ids (:obj:`tf.Tensor` or :obj:`Numpy array` of shape :obj:`(batch_size, sequence_length)`):\n            Indices of input sequence tokens in the vocabulary.\n\n            Indices can be obtained using :class:`transformers1.TransfoXLTokenizer`.\n            See :func:`transformers1.PreTrainedTokenizer.encode` and\n            :func:`transformers1.PreTrainedTokenizer.encode_plus` for details.\n\n            `What are input IDs? <../glossary.html#input-ids>`__\n        mems (:obj:`List[tf.Tensor]` of length :obj:`config.n_layers`):\n            Contains pre-computed hidden-states (key and values in the attention blocks) as computed by the model\n            (see `mems` output below). Can be used to speed up sequential decoding. The token ids which have their mems\n            given to this model should not be passed as input ids as they have already been computed.\n        head_mask (:obj:`tf.Tensor` or :obj:`Numpy array` of shape :obj:`(num_heads,)` or :obj:`(num_layers, num_heads)`, `optional`, defaults to :obj:`None`):\n            Mask to nullify selected heads of the self-attention modules.\n            Mask values selected in ``[0, 1]``:\n            :obj:`1` indicates the head is **not masked**, :obj:`0` indicates the head is **masked**.\n        inputs_embeds (:obj:`tf.Tensor` or :obj:`Numpy array` of shape :obj:`(batch_size, sequence_length, hidden_size)`, `optional`, defaults to :obj:`None`):\n            Optionally, instead of passing :obj:`input_ids` you can choose to directly pass an embedded representation.\n            This is useful if you want more control over how to convert `input_ids` indices into associated vectors\n            than the model's internal embedding lookup matrix.\n\"\"\"\n\n\n@add_start_docstrings(\n    \"The bare Bert Model transformer outputing raw hidden-states without any specific head on top.\",\n    TRANSFO_XL_START_DOCSTRING,\n)\nclass TFTransfoXLModel(TFTransfoXLPreTrainedModel):\n    def __init__(self, config, *inputs, **kwargs):\n        super().__init__(config, *inputs, **kwargs)\n        self.transformer = TFTransfoXLMainLayer(config, name=\"transformer\")\n\n    @add_start_docstrings_to_callable(TRANSFO_XL_INPUTS_DOCSTRING)\n    def call(self, inputs, **kwargs):\n        r\"\"\"\n    Return:\n        :obj:`tuple(tf.Tensor)` comprising various elements depending on the configuration (:class:`~transformers1.TransfoXLConfig`) and inputs:\n        last_hidden_state (:obj:`tf.Tensor` of shape :obj:`(batch_size, sequence_length, hidden_size)`):\n            Sequence of hidden-states at the last layer of the model.\n        mems (:obj:`List[tf.Tensor]` of length :obj:`config.n_layers`):\n            Contains pre-computed hidden-states (key and values in the attention blocks).\n            Can be used (see `mems` input) to speed up sequential decoding. The token ids which have their past given to this model\n            should not be passed as input ids as they have already been computed.\n        hidden_states (:obj:`tuple(tf.Tensor)`, `optional`, returned when ``config.output_hidden_states=True``):\n            Tuple of :obj:`tf.Tensor` (one for the output of the embeddings + one for the output of each layer)\n            of shape :obj:`(batch_size, sequence_length, hidden_size)`.\n\n            Hidden-states of the model at the output of each layer plus the initial embedding outputs.\n        attentions (:obj:`tuple(tf.Tensor)`, `optional`, returned when ``config.output_attentions=True``):\n            Tuple of :obj:`tf.Tensor` (one for each layer) of shape\n            :obj:`(batch_size, num_heads, sequence_length, sequence_length)`.\n\n            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention\n            heads.\n\n    Examples::\n\n        import tensorflow as tf\n        from transformers1 import TransfoXLTokenizer, TFTransfoXLModel\n\n        tokenizer = TransfoXLTokenizer.from_pretrained('transfo-xl-wt103')\n        model = TFTransfoXLModel.from_pretrained('transfo-xl-wt103')\n        input_ids = tf.constant(tokenizer.encode(\"Hello, my dog is cute\", add_special_tokens=True))[None, :]  # Batch size 1\n        outputs = model(input_ids)\n        last_hidden_states, mems = outputs[:2]\n\n        \"\"\"\n        outputs = self.transformer(inputs, **kwargs)\n        return outputs\n\n\nclass TFTransfoXLLMHead(tf.keras.layers.Layer):\n    def __init__(self, config, input_embeddings, **kwargs):\n        super().__init__(**kwargs)\n        self.vocab_size = config.vocab_size\n\n        # The output weights are the same as the input embeddings, but there is\n        # an output-only bias for each token.\n        self.input_embeddings = input_embeddings\n\n    def build(self, input_shape):\n        self.bias = self.add_weight(shape=(self.vocab_size,), initializer=\"zeros\", trainable=True, name=\"bias\")\n        super().build(input_shape)\n\n    def call(self, hidden_states):\n        hidden_states = self.input_embeddings(hidden_states, mode=\"linear\")\n        hidden_states = hidden_states + self.bias\n        return hidden_states\n\n\n@add_start_docstrings(\n    \"\"\"The Transformer-XL Model with a language modeling head on top\n    (adaptive softmax with weights tied to the adaptive input embeddings)\"\"\",\n    TRANSFO_XL_START_DOCSTRING,\n)\nclass TFTransfoXLLMHeadModel(TFTransfoXLPreTrainedModel):\n    def __init__(self, config):\n        super().__init__(config)\n        self.transformer = TFTransfoXLMainLayer(config, name=\"transformer\")\n        self.sample_softmax = config.sample_softmax\n        assert (\n            self.sample_softmax <= 0\n        ), \"Sampling from the softmax is not implemented yet. Please look at issue: #3310: https://github.com/huggingface/transformers/issues/3310\"\n\n        self.crit = TFAdaptiveSoftmaxMask(\n            config.vocab_size, config.d_embed, config.d_model, config.cutoffs, div_val=config.div_val, name=\"crit\"\n        )\n\n    def get_output_embeddings(self):\n        \"\"\" Double-check if you are using adaptive softmax.\n        \"\"\"\n        if len(self.crit.out_layers) > 0:\n            return self.crit.out_layers[-1]\n        return None\n\n    def reset_length(self, tgt_len, ext_len, mem_len):\n        self.transformer.reset_length(tgt_len, ext_len, mem_len)\n\n    def init_mems(self, bsz):\n        return self.transformer.init_mems(bsz)\n\n    @add_start_docstrings_to_callable(TRANSFO_XL_INPUTS_DOCSTRING)\n    def call(self, inputs, mems=None, head_mask=None, inputs_embeds=None, labels=None, training=False):\n        r\"\"\"\n    Return:\n        :obj:`tuple(tf.Tensor)` comprising various elements depending on the configuration (:class:`~transformers1.TransfoXLConfig`) and inputs:\n        prediction_scores (:obj:`tf.Tensor` of shape :obj:`(batch_size, sequence_length, config.vocab_size)`):\n            Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax).\n        mems (:obj:`List[tf.Tensor]` of length :obj:`config.n_layers`):\n            Contains pre-computed hidden-states (key and values in the attention blocks).\n            Can be used (see `past` input) to speed up sequential decoding. The token ids which have their past given to this model\n            should not be passed as input ids as they have already been computed.\n        hidden_states (:obj:`tuple(tf.Tensor)`, `optional`, returned when ``config.output_hidden_states=True``):\n            Tuple of :obj:`tf.Tensor` (one for the output of the embeddings + one for the output of each layer)\n            of shape :obj:`(batch_size, sequence_length, hidden_size)`.\n\n            Hidden-states of the model at the output of each layer plus the initial embedding outputs.\n        attentions (:obj:`tuple(tf.Tensor)`, `optional`, returned when ``config.output_attentions=True``):\n            Tuple of :obj:`tf.Tensor` (one for each layer) of shape\n            :obj:`(batch_size, num_heads, sequence_length, sequence_length)`.\n\n            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention\n            heads.\n\n    Examples::\n\n        import tensorflow as tf\n        from transformers1 import TransfoXLTokenizer, TFTransfoXLLMHeadModel\n\n        tokenizer = TransfoXLTokenizer.from_pretrained('transfo-xl-wt103')\n        model = TFTransfoXLLMHeadModel.from_pretrained('transfo-xl-wt103')\n        input_ids = tf.constant(tokenizer.encode(\"Hello, my dog is cute\", add_special_tokens=True))[None, :]  # Batch size 1\n        outputs = model(input_ids)\n        prediction_scores, mems = outputs[:2]\n\n        \"\"\"\n        if isinstance(inputs, (tuple, list)):\n            input_ids = inputs[0]\n            mems = inputs[1] if len(inputs) > 1 else mems\n            head_mask = inputs[2] if len(inputs) > 2 else head_mask\n            inputs_embeds = inputs[3] if len(inputs) > 3 else inputs_embeds\n            labels = inputs[4] if len(inputs) > 4 else labels\n            assert len(inputs) <= 5, \"Too many inputs.\"\n        elif isinstance(inputs, dict):\n            input_ids = inputs.get(\"input_ids\")\n            mems = inputs.get(\"mems\", mems)\n            head_mask = inputs.get(\"head_mask\", head_mask)\n            inputs_embeds = inputs.get(\"inputs_embeds\", inputs_embeds)\n            labels = inputs.get(\"labels\", labels)\n            assert len(inputs) <= 5, \"Too many inputs.\"\n        else:\n            input_ids = inputs\n\n        if input_ids is not None:\n            bsz, tgt_len = shape_list(input_ids)[:2]\n        else:\n            bsz, tgt_len = shape_list(inputs_embeds)[:2]\n\n        transformer_outputs = self.transformer([input_ids, mems, head_mask, inputs_embeds], training=training)\n\n        last_hidden = transformer_outputs[0]\n        pred_hid = last_hidden[:, -tgt_len:]\n        outputs = transformer_outputs[1:]\n\n        softmax_output = self.crit([pred_hid, labels], training=training)\n        outputs = [softmax_output] + outputs\n\n        return outputs  # logits, new_mems, (all hidden states), (all attentions)\n\n    def prepare_inputs_for_generation(self, inputs, past, **model_kwargs):\n        inputs = {\"inputs\": inputs}\n\n        # if past is defined in model kwargs then use it for faster decoding\n        if past:\n            inputs[\"mems\"] = past\n\n        return inputs\n"
  },
  {
    "path": "code/nezha-base-count3/pretrain/transformers1/modeling_tf_transfo_xl_utilities.py",
    "content": "# coding=utf-8\n# Copyright 2018 Google AI, Google Brain and Carnegie Mellon University Authors and the HuggingFace Inc. team.\n# Copyright (c) 2018, NVIDIA CORPORATION.  All rights reserved.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n\"\"\" A TF 2.0 Adaptive Softmax for Transformer XL model.\n\"\"\"\n\n\nimport tensorflow as tf\n\nfrom .modeling_tf_utils import shape_list\n\n\nclass TFAdaptiveSoftmaxMask(tf.keras.layers.Layer):\n    def __init__(self, vocab_size, d_embed, d_proj, cutoffs, div_val=1, keep_order=False, **kwargs):\n        super().__init__(**kwargs)\n\n        self.vocab_size = vocab_size\n        self.d_embed = d_embed\n        self.d_proj = d_proj\n\n        self.cutoffs = cutoffs + [vocab_size]\n        self.cutoff_ends = [0] + self.cutoffs\n        self.div_val = div_val\n\n        self.shortlist_size = self.cutoffs[0]\n        self.n_clusters = len(self.cutoffs) - 1\n        self.head_size = self.shortlist_size + self.n_clusters\n        self.keep_order = keep_order\n\n        self.out_layers = []\n        self.out_projs = []\n\n    def build(self, input_shape):\n        if self.n_clusters > 0:\n            self.cluster_weight = self.add_weight(\n                shape=(self.n_clusters, self.d_embed), initializer=\"zeros\", trainable=True, name=\"cluster_weight\"\n            )\n            self.cluster_bias = self.add_weight(\n                shape=(self.n_clusters,), initializer=\"zeros\", trainable=True, name=\"cluster_bias\"\n            )\n\n        if self.div_val == 1:\n            for i in range(len(self.cutoffs)):\n                if self.d_proj != self.d_embed:\n                    weight = self.add_weight(\n                        shape=(self.d_embed, self.d_proj),\n                        initializer=\"zeros\",\n                        trainable=True,\n                        name=\"out_projs_._{}\".format(i),\n                    )\n                    self.out_projs.append(weight)\n                else:\n                    self.out_projs.append(None)\n                weight = self.add_weight(\n                    shape=(self.vocab_size, self.d_embed,),\n                    initializer=\"zeros\",\n                    trainable=True,\n                    name=\"out_layers_._{}_._weight\".format(i),\n                )\n                bias = self.add_weight(\n                    shape=(self.vocab_size,),\n                    initializer=\"zeros\",\n                    trainable=True,\n                    name=\"out_layers_._{}_._bias\".format(i),\n                )\n                self.out_layers.append((weight, bias))\n        else:\n            for i in range(len(self.cutoffs)):\n                l_idx, r_idx = self.cutoff_ends[i], self.cutoff_ends[i + 1]\n                d_emb_i = self.d_embed // (self.div_val ** i)\n\n                weight = self.add_weight(\n                    shape=(d_emb_i, self.d_proj), initializer=\"zeros\", trainable=True, name=\"out_projs_._{}\".format(i)\n                )\n                self.out_projs.append(weight)\n                weight = self.add_weight(\n                    shape=(r_idx - l_idx, d_emb_i,),\n                    initializer=\"zeros\",\n                    trainable=True,\n                    name=\"out_layers_._{}_._weight\".format(i),\n                )\n                bias = self.add_weight(\n                    shape=(r_idx - l_idx,),\n                    initializer=\"zeros\",\n                    trainable=True,\n                    name=\"out_layers_._{}_._bias\".format(i),\n                )\n                self.out_layers.append((weight, bias))\n        super().build(input_shape)\n\n    @staticmethod\n    def _logit(x, W, b, proj=None):\n        y = x\n        if proj is not None:\n            y = tf.einsum(\"ibd,ed->ibe\", y, proj)\n        return tf.einsum(\"ibd,nd->ibn\", y, W) + b\n\n    @staticmethod\n    def _gather_logprob(logprob, target):\n        lp_size = shape_list(logprob)\n        r = tf.range(lp_size[0])\n        idx = tf.stack([r, target], 1)\n        return tf.gather_nd(logprob, idx)\n\n    def call(self, inputs, return_mean=True, training=False):\n        hidden, target = inputs\n        head_logprob = 0\n        if self.n_clusters == 0:\n            output = self._logit(hidden, self.out_layers[0][0], self.out_layers[0][1], self.out_projs[0])\n            if target is not None:\n                loss = tf.nn.sparse_softmax_cross_entropy_with_logits(labels=target, logits=output)\n            out = tf.nn.log_softmax(output, axis=-1)\n        else:\n            hidden_sizes = shape_list(hidden)\n            out = []\n            loss = tf.zeros(hidden_sizes[:2], dtype=tf.float32)\n            for i in range(len(self.cutoffs)):\n                l_idx, r_idx = self.cutoff_ends[i], self.cutoff_ends[i + 1]\n                if target is not None:\n                    mask = (target >= l_idx) & (target < r_idx)\n                    mask_idx = tf.where(mask)\n                    cur_target = tf.boolean_mask(target, mask) - l_idx\n\n                if self.div_val == 1:\n                    cur_W = self.out_layers[0][0][l_idx:r_idx]\n                    cur_b = self.out_layers[0][1][l_idx:r_idx]\n                else:\n                    cur_W = self.out_layers[i][0]\n                    cur_b = self.out_layers[i][1]\n\n                if i == 0:\n                    cur_W = tf.concat([cur_W, self.cluster_weight], 0)\n                    cur_b = tf.concat([cur_b, self.cluster_bias], 0)\n\n                    head_logit = self._logit(hidden, cur_W, cur_b, self.out_projs[0])\n                    head_logprob = tf.nn.log_softmax(head_logit)\n                    out.append(head_logprob[..., : self.cutoffs[0]])\n                    if target is not None:\n                        cur_head_logprob = tf.boolean_mask(head_logprob, mask)\n                        cur_logprob = self._gather_logprob(cur_head_logprob, cur_target)\n                else:\n                    tail_logit = self._logit(hidden, cur_W, cur_b, self.out_projs[i])\n                    tail_logprob = tf.nn.log_softmax(tail_logit)\n                    cluster_prob_idx = self.cutoffs[0] + i - 1  # No probability for the head cluster\n                    logprob_i = head_logprob[..., cluster_prob_idx, None] + tail_logprob\n                    out.append(logprob_i)\n                    if target is not None:\n                        cur_head_logprob = tf.boolean_mask(head_logprob, mask)\n                        cur_tail_logprob = tf.boolean_mask(tail_logprob, mask)\n                        cur_logprob = self._gather_logprob(cur_tail_logprob, cur_target)\n                        cur_logprob += cur_head_logprob[:, self.cutoff_ends[1] + i - 1]\n                if target is not None:\n                    loss += tf.scatter_nd(mask_idx, -cur_logprob, tf.cast(shape_list(loss), dtype=tf.int64))\n            out = tf.concat(out, axis=-1)\n\n        if target is not None:\n            if return_mean:\n                loss = tf.reduce_mean(loss)\n            # Add the training-time loss value to the layer using `self.add_loss()`.\n            self.add_loss(loss)\n\n            # Log the loss as a metric (we could log arbitrary metrics,\n            # including different metrics for training and inference.\n            self.add_metric(loss, name=self.name, aggregation=\"mean\" if return_mean else \"\")\n\n        return out\n"
  },
  {
    "path": "code/nezha-base-count3/pretrain/transformers1/modeling_tf_utils.py",
    "content": "# coding=utf-8\n# Copyright 2018 The Google AI Language Team Authors and The HuggingFace Inc. team.\n# Copyright (c) 2018, NVIDIA CORPORATION.  All rights reserved.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n\"\"\"TF general model utils.\"\"\"\nimport functools\nimport logging\nimport os\n\nimport h5py\nimport numpy as np\nimport tensorflow as tf\nfrom tensorflow.python.keras.saving import hdf5_format\n\nfrom .configuration_utils import PretrainedConfig\nfrom .file_utils import DUMMY_INPUTS, TF2_WEIGHTS_NAME, WEIGHTS_NAME, cached_path, hf_bucket_url, is_remote_url\nfrom .modeling_tf_pytorch_utils import load_pytorch_checkpoint_in_tf2_model\n\n\nlogger = logging.getLogger(__name__)\n\n\nclass TFModelUtilsMixin:\n    \"\"\"\n    A few utilities for `tf.keras.Model`s, to be used as a mixin.\n    \"\"\"\n\n    def num_parameters(self, only_trainable: bool = False) -> int:\n        \"\"\"\n        Get number of (optionally, trainable) parameters in the model.\n        \"\"\"\n        if only_trainable:\n            return int(sum(np.prod(w.shape.as_list()) for w in self.trainable_variables))\n        else:\n            return self.count_params()\n\n\ndef keras_serializable(cls):\n    \"\"\"\n    Decorate a Keras Layer class to support Keras serialization.\n\n    This is done by:\n    1. adding a `transformers_config` dict to the Keras config dictionary in `get_config` (called by Keras at\n       serialization time\n    2. wrapping `__init__` to accept that `transformers_config` dict (passed by Keras at deserialization time) and\n       convert it to a config object for the actual layer initializer\n    3. registering the class as a custom object in Keras (if the Tensorflow version supports this), so that it does\n       not need to be supplied in `custom_objects` in the call to `tf.keras.models.load_model`\n\n    :param cls: a tf.keras.layers.Layers subclass that accepts a `config` argument to its initializer (typically a\n                `TF*MainLayer` class in this project)\n    :return: the same class object, with modifications for Keras deserialization.\n    \"\"\"\n    initializer = cls.__init__\n\n    config_class = getattr(cls, \"config_class\", None)\n    if config_class is None:\n        raise AttributeError(\"Must set `config_class` to use @keras_serializable\")\n\n    @functools.wraps(initializer)\n    def wrapped_init(self, *args, **kwargs):\n        transformers_config = kwargs.pop(\"transformers_config\", None)\n        config = args[0] if args and isinstance(args[0], PretrainedConfig) else kwargs.get(\"config\", None)\n        if config is not None and transformers_config is not None:\n            raise ValueError(\"Must pass either `config` or `transformers_config`, not both\")\n        elif config is not None:\n            # normal layer construction, call with unchanged args (config is already in there)\n            initializer(self, *args, **kwargs)\n        elif transformers_config is not None:\n            # Keras deserialization, convert dict to config\n            config = config_class.from_dict(transformers_config)\n            initializer(self, config, *args, **kwargs)\n        else:\n            raise ValueError(\"Must pass either `config` (PretrainedConfig) or `transformers_config` (dict)\")\n        self._transformers_config = config\n\n    cls.__init__ = wrapped_init\n\n    if not hasattr(cls, \"get_config\"):\n        raise TypeError(\"Only use @keras_serializable on tf.keras.layers.Layer subclasses\")\n    if hasattr(cls.get_config, \"_is_default\"):\n\n        def get_config(self):\n            cfg = super(cls, self).get_config()\n            cfg[\"transformers_config\"] = self._transformers_config.to_dict()\n            return cfg\n\n        cls.get_config = get_config\n\n    cls._keras_serializable = True\n    if hasattr(tf.keras.utils, \"register_keras_serializable\"):\n        cls = tf.keras.utils.register_keras_serializable()(cls)\n    return cls\n\n\nclass TFPreTrainedModel(tf.keras.Model, TFModelUtilsMixin):\n    r\"\"\" Base class for all TF models.\n\n        :class:`~transformers1.TFPreTrainedModel` takes care of storing the configuration of the models and handles methods for loading/downloading/saving models\n        as well as a few methods common to all models to (i) resize the input embeddings and (ii) prune heads in the self-attention heads.\n\n        Class attributes (overridden by derived classes):\n            - ``config_class``: a class derived from :class:`~transformers1.PretrainedConfig` to use as configuration class for this model architecture.\n            - ``load_tf_weights``: a python ``method`` for loading a TensorFlow checkpoint in a PyTorch model, taking as arguments:\n\n                - ``model``: an instance of the relevant subclass of :class:`~transformers1.PreTrainedModel`,\n                - ``config``: an instance of the relevant subclass of :class:`~transformers1.PretrainedConfig`,\n                - ``path``: a path (string) to the TensorFlow checkpoint.\n\n            - ``base_model_prefix``: a string indicating the attribute associated to the base model in derived classes of the same architecture adding modules on top of the base model.\n    \"\"\"\n    config_class = None\n    base_model_prefix = \"\"\n\n    @property\n    def dummy_inputs(self):\n        \"\"\" Dummy inputs to build the network.\n\n        Returns:\n            tf.Tensor with dummy inputs\n        \"\"\"\n        return {\"input_ids\": tf.constant(DUMMY_INPUTS)}\n\n    def __init__(self, config, *inputs, **kwargs):\n        super().__init__(*inputs, **kwargs)\n        if not isinstance(config, PretrainedConfig):\n            raise ValueError(\n                \"Parameter config in `{}(config)` should be an instance of class `PretrainedConfig`. \"\n                \"To create a model from a pretrained model use \"\n                \"`model = {}.from_pretrained(PRETRAINED_MODEL_NAME)`\".format(\n                    self.__class__.__name__, self.__class__.__name__\n                )\n            )\n        # Save config in model\n        self.config = config\n\n    def get_input_embeddings(self):\n        \"\"\"\n        Returns the model's input embeddings.\n\n        Returns:\n            :obj:`tf.keras.layers.Layer`:\n                A torch module mapping vocabulary to hidden states.\n        \"\"\"\n        base_model = getattr(self, self.base_model_prefix, self)\n        if base_model is not self:\n            return base_model.get_input_embeddings()\n        else:\n            raise NotImplementedError\n\n    def get_output_embeddings(self):\n        \"\"\"\n        Returns the model's output embeddings.\n\n        Returns:\n            :obj:`tf.keras.layers.Layer`:\n                A torch module mapping hidden states to vocabulary.\n        \"\"\"\n        return None  # Overwrite for models with output embeddings\n\n    def _get_resized_embeddings(self, old_embeddings, new_num_tokens=None):\n        \"\"\" Build a resized Embedding Variable from a provided token Embedding Module.\n            Increasing the size will add newly initialized vectors at the end\n            Reducing the size will remove vectors from the end\n\n        Args:\n            new_num_tokens: (`optional`) int\n                New number of tokens in the embedding matrix.\n                Increasing the size will add newly initialized vectors at the end\n                Reducing the size will remove vectors from the end\n                If not provided or None: return the provided token Embedding Module.\n        Return: ``tf.Variable``\n            Pointer to the resized Embedding Module or the old Embedding Module if new_num_tokens is None\n        \"\"\"\n        # if new_num_tokens is None:\n        #     return old_embeddings\n\n        # old_num_tokens, old_embedding_dim = old_embeddings.weight.size()\n        # if old_num_tokens == new_num_tokens:\n        #     return old_embeddings\n\n        # # Build new embeddings\n        # new_embeddings = nn.Embedding(new_num_tokens, old_embedding_dim)\n        # new_embeddings.to(old_embeddings.weight.device)\n\n        # # initialize all new embeddings (in particular added tokens)\n        # self._init_weights(new_embeddings)\n\n        # # Copy token embeddings from the previous weights\n        # num_tokens_to_copy = min(old_num_tokens, new_num_tokens)\n        # new_embeddings.weight.data[:num_tokens_to_copy, :] = old_embeddings.weight.data[:num_tokens_to_copy, :]\n\n        # return new_embeddings\n\n    def resize_token_embeddings(self, new_num_tokens=None):\n        \"\"\" Resize input token embeddings matrix of the model if new_num_tokens != config.vocab_size.\n        Take care of tying weights embeddings afterwards if the model class has a `tie_weights()` method.\n\n        Arguments:\n\n            new_num_tokens: (`optional`) int:\n                New number of tokens in the embedding matrix. Increasing the size will add newly initialized vectors at the end. Reducing the size will remove vectors from the end.\n                If not provided or None: does nothing and just returns a pointer to the input tokens ``tf.Variable`` Module of the model.\n\n        Return: ``tf.Variable``\n            Pointer to the input tokens Embeddings Module of the model\n        \"\"\"\n        raise NotImplementedError\n\n    def prune_heads(self, heads_to_prune):\n        \"\"\" Prunes heads of the base model.\n\n            Arguments:\n\n                heads_to_prune: dict with keys being selected layer indices (`int`) and associated values being the list of heads to prune in said layer (list of `int`).\n        \"\"\"\n        raise NotImplementedError\n\n    def save_pretrained(self, save_directory):\n        \"\"\" Save a model and its configuration file to a directory, so that it\n            can be re-loaded using the :func:`~transformers1.PreTrainedModel.from_pretrained` class method.\n        \"\"\"\n        assert os.path.isdir(\n            save_directory\n        ), \"Saving path should be a directory where the model and configuration can be saved\"\n\n        # Save configuration file\n        self.config.save_pretrained(save_directory)\n\n        # If we save using the predefined names, we can load using `from_pretrained`\n        output_model_file = os.path.join(save_directory, TF2_WEIGHTS_NAME)\n        self.save_weights(output_model_file)\n        logger.info(\"Model weights saved in {}\".format(output_model_file))\n\n    @classmethod\n    def from_pretrained(cls, pretrained_model_name_or_path, *model_args, **kwargs):\n        r\"\"\"Instantiate a pretrained TF 2.0 model from a pre-trained model configuration.\n\n        The warning ``Weights from XXX not initialized from pretrained model`` means that the weights of XXX do not come pre-trained with the rest of the model.\n        It is up to you to train those weights with a downstream fine-tuning task.\n\n        The warning ``Weights from XXX not used in YYY`` means that the layer XXX is not used by YYY, therefore those weights are discarded.\n\n        Parameters:\n            pretrained_model_name_or_path: either:\n\n                - a string with the `shortcut name` of a pre-trained model to load from cache or download, e.g.: ``bert-base-uncased``.\n                - a string with the `identifier name` of a pre-trained model that was user-uploaded to our S3, e.g.: ``dbmdz/bert-base-german-cased``.\n                - a path to a `directory` containing model weights saved using :func:`~transformers1.PreTrainedModel.save_pretrained`, e.g.: ``./my_model_directory/``.\n                - a path or url to a `PyTorch state_dict save file` (e.g. `./pt_model/pytorch_model.bin`). In this case, ``from_pt`` should be set to True and a configuration object should be provided as ``config`` argument. This loading path is slower than converting the PyTorch checkpoint in a TensorFlow model using the provided conversion scripts and loading the TensorFlow model afterwards.\n\n            model_args: (`optional`) Sequence of positional arguments:\n                All remaning positional arguments will be passed to the underlying model's ``__init__`` method\n\n            config: (`optional`) one of:\n                    - an instance of a class derived from :class:`~transformers1.PretrainedConfig`, or\n                    - a string valid as input to :func:`~transformers1.PretrainedConfig.from_pretrained()`\n                Configuration for the model to use instead of an automatically loaded configuation. Configuration can be automatically loaded when:\n\n                - the model is a model provided by the library (loaded with the ``shortcut-name`` string of a pretrained model), or\n                - the model was saved using :func:`~transformers1.PreTrainedModel.save_pretrained` and is reloaded by suppling the save directory.\n                - the model is loaded by suppling a local directory as ``pretrained_model_name_or_path`` and a configuration JSON file named `config.json` is found in the directory.\n\n            from_pt: (`optional`) boolean, default False:\n                Load the model weights from a PyTorch state_dict save file (see docstring of pretrained_model_name_or_path argument).\n\n            cache_dir: (`optional`) string:\n                Path to a directory in which a downloaded pre-trained model\n                configuration should be cached if the standard cache should not be used.\n\n            force_download: (`optional`) boolean, default False:\n                Force to (re-)download the model weights and configuration files and override the cached versions if they exists.\n\n            resume_download: (`optional`) boolean, default False:\n                Do not delete incompletely recieved file. Attempt to resume the download if such a file exists.\n\n            proxies: (`optional`) dict, default None:\n                A dictionary of proxy servers to use by protocol or endpoint, e.g.: {'http': 'foo.bar:3128', 'http://hostname': 'foo.bar:4012'}.\n                The proxies are used on each request.\n\n            output_loading_info: (`optional`) boolean:\n                Set to ``True`` to also return a dictionnary containing missing keys, unexpected keys and error messages.\n\n            kwargs: (`optional`) Remaining dictionary of keyword arguments:\n                Can be used to update the configuration object (after it being loaded) and initiate the model. (e.g. ``output_attention=True``). Behave differently depending on whether a `config` is provided or automatically loaded:\n\n                - If a configuration is provided with ``config``, ``**kwargs`` will be directly passed to the underlying model's ``__init__`` method (we assume all relevant updates to the configuration have already been done)\n                - If a configuration is not provided, ``kwargs`` will be first passed to the configuration class initialization function (:func:`~transformers1.PretrainedConfig.from_pretrained`). Each key of ``kwargs`` that corresponds to a configuration attribute will be used to override said attribute with the supplied ``kwargs`` value. Remaining keys that do not correspond to any configuration attribute will be passed to the underlying model's ``__init__`` function.\n\n        Examples::\n\n            # For example purposes. Not runnable.\n            model = BertModel.from_pretrained('bert-base-uncased')    # Download model and configuration from S3 and cache.\n            model = BertModel.from_pretrained('./test/saved_model/')  # E.g. model was saved using `save_pretrained('./test/saved_model/')`\n            model = BertModel.from_pretrained('bert-base-uncased', output_attention=True)  # Update configuration during loading\n            assert model.config.output_attention == True\n            # Loading from a TF checkpoint file instead of a PyTorch model (slower)\n            config = BertConfig.from_json_file('./tf_model/my_tf_model_config.json')\n            model = BertModel.from_pretrained('./tf_model/my_tf_checkpoint.ckpt.index', from_pt=True, config=config)\n\n        \"\"\"\n        config = kwargs.pop(\"config\", None)\n        cache_dir = kwargs.pop(\"cache_dir\", None)\n        from_pt = kwargs.pop(\"from_pt\", False)\n        force_download = kwargs.pop(\"force_download\", False)\n        resume_download = kwargs.pop(\"resume_download\", False)\n        proxies = kwargs.pop(\"proxies\", None)\n        output_loading_info = kwargs.pop(\"output_loading_info\", False)\n        use_cdn = kwargs.pop(\"use_cdn\", True)\n\n        # Load config if we don't provide a configuration\n        if not isinstance(config, PretrainedConfig):\n            config_path = config if config is not None else pretrained_model_name_or_path\n            config, model_kwargs = cls.config_class.from_pretrained(\n                config_path,\n                *model_args,\n                cache_dir=cache_dir,\n                return_unused_kwargs=True,\n                force_download=force_download,\n                resume_download=resume_download,\n                **kwargs,\n            )\n        else:\n            model_kwargs = kwargs\n\n        # Load model\n        if pretrained_model_name_or_path is not None:\n            if os.path.isdir(pretrained_model_name_or_path):\n                if os.path.isfile(os.path.join(pretrained_model_name_or_path, TF2_WEIGHTS_NAME)):\n                    # Load from a TF 2.0 checkpoint\n                    archive_file = os.path.join(pretrained_model_name_or_path, TF2_WEIGHTS_NAME)\n                elif from_pt and os.path.isfile(os.path.join(pretrained_model_name_or_path, WEIGHTS_NAME)):\n                    # Load from a PyTorch checkpoint\n                    archive_file = os.path.join(pretrained_model_name_or_path, WEIGHTS_NAME)\n                else:\n                    raise EnvironmentError(\n                        \"Error no file named {} found in directory {} or `from_pt` set to False\".format(\n                            [WEIGHTS_NAME, TF2_WEIGHTS_NAME], pretrained_model_name_or_path\n                        )\n                    )\n            elif os.path.isfile(pretrained_model_name_or_path) or is_remote_url(pretrained_model_name_or_path):\n                archive_file = pretrained_model_name_or_path\n            elif os.path.isfile(pretrained_model_name_or_path + \".index\"):\n                archive_file = pretrained_model_name_or_path + \".index\"\n            else:\n                archive_file = hf_bucket_url(\n                    pretrained_model_name_or_path,\n                    filename=(WEIGHTS_NAME if from_pt else TF2_WEIGHTS_NAME),\n                    use_cdn=use_cdn,\n                )\n\n            try:\n                # Load from URL or cache if already cached\n                resolved_archive_file = cached_path(\n                    archive_file,\n                    cache_dir=cache_dir,\n                    force_download=force_download,\n                    resume_download=resume_download,\n                    proxies=proxies,\n                )\n                if resolved_archive_file is None:\n                    raise EnvironmentError\n            except EnvironmentError:\n                msg = (\n                    f\"Can't load weights for '{pretrained_model_name_or_path}'. Make sure that:\\n\\n\"\n                    f\"- '{pretrained_model_name_or_path}' is a correct model identifier listed on 'https://huggingface.co/models'\\n\\n\"\n                    f\"- or '{pretrained_model_name_or_path}' is the correct path to a directory containing a file named one of {TF2_WEIGHTS_NAME}, {WEIGHTS_NAME}.\\n\\n\"\n                )\n                raise EnvironmentError(msg)\n            if resolved_archive_file == archive_file:\n                logger.info(\"loading weights file {}\".format(archive_file))\n            else:\n                logger.info(\"loading weights file {} from cache at {}\".format(archive_file, resolved_archive_file))\n        else:\n            resolved_archive_file = None\n\n        # Instantiate model.\n        model = cls(config, *model_args, **model_kwargs)\n\n        if from_pt:\n            # Load from a PyTorch checkpoint\n            return load_pytorch_checkpoint_in_tf2_model(model, resolved_archive_file, allow_missing_keys=True)\n\n        model(model.dummy_inputs, training=False)  # build the network with dummy inputs\n\n        assert os.path.isfile(resolved_archive_file), \"Error retrieving file {}\".format(resolved_archive_file)\n        # 'by_name' allow us to do transfer learning by skipping/adding layers\n        # see https://github.com/tensorflow/tensorflow/blob/00fad90125b18b80fe054de1055770cfb8fe4ba3/tensorflow/python/keras/engine/network.py#L1339-L1357\n        try:\n            model.load_weights(resolved_archive_file, by_name=True)\n        except OSError:\n            raise OSError(\n                \"Unable to load weights from h5 file. \"\n                \"If you tried to load a TF 2.0 model from a PyTorch checkpoint, please set from_pt=True. \"\n            )\n\n        model(model.dummy_inputs, training=False)  # Make sure restore ops are run\n\n        # Check if the models are the same to output loading informations\n        with h5py.File(resolved_archive_file, \"r\") as f:\n            if \"layer_names\" not in f.attrs and \"model_weights\" in f:\n                f = f[\"model_weights\"]\n            hdf5_layer_names = set(hdf5_format.load_attributes_from_hdf5_group(f, \"layer_names\"))\n        model_layer_names = set(layer.name for layer in model.layers)\n        missing_keys = list(model_layer_names - hdf5_layer_names)\n        unexpected_keys = list(hdf5_layer_names - model_layer_names)\n        error_msgs = []\n\n        if len(missing_keys) > 0:\n            logger.info(\n                \"Layers of {} not initialized from pretrained model: {}\".format(model.__class__.__name__, missing_keys)\n            )\n        if len(unexpected_keys) > 0:\n            logger.info(\n                \"Layers from pretrained model not used in {}: {}\".format(model.__class__.__name__, unexpected_keys)\n            )\n        if len(error_msgs) > 0:\n            raise RuntimeError(\n                \"Error(s) in loading weights for {}:\\n\\t{}\".format(model.__class__.__name__, \"\\n\\t\".join(error_msgs))\n            )\n        if output_loading_info:\n            loading_info = {\"missing_keys\": missing_keys, \"unexpected_keys\": unexpected_keys, \"error_msgs\": error_msgs}\n            return model, loading_info\n\n        return model\n\n    def prepare_inputs_for_generation(self, inputs, **kwargs):\n        return {\"inputs\": inputs}\n\n    def _use_cache(self, outputs, use_cache):\n        \"\"\"During generation, decide whether to pass the `past` variable to the next forward pass.\"\"\"\n        if len(outputs) <= 1 or use_cache is False:\n            return False\n        if hasattr(self.config, \"mem_len\") and self.config.mem_len == 0:\n            return False\n        return True\n\n    def generate(\n        self,\n        input_ids=None,\n        max_length=None,\n        min_length=None,\n        do_sample=None,\n        early_stopping=None,\n        num_beams=None,\n        temperature=None,\n        top_k=None,\n        top_p=None,\n        repetition_penalty=None,\n        bad_words_ids=None,\n        bos_token_id=None,\n        pad_token_id=None,\n        eos_token_id=None,\n        length_penalty=None,\n        no_repeat_ngram_size=None,\n        num_return_sequences=None,\n        attention_mask=None,\n        decoder_start_token_id=None,\n        use_cache=None,\n    ):\n        r\"\"\" Generates sequences for models with a LM head. The method currently supports greedy or penalized greedy decoding, sampling with top-k or nucleus sampling\n        and beam-search.\n\n        Adapted in part from `Facebook's XLM beam search code`_.\n\n        .. _`Facebook's XLM beam search code`:\n           https://github.com/facebookresearch/XLM/blob/9e6f6814d17be4fe5b15f2e6c43eb2b2d76daeb4/src/model/transformer.py#L529\n\n\n        Parameters:\n\n            input_ids: (`optional`) `tf.Tensor` of `dtype=tf.int32` of shape `(batch_size, sequence_length)`\n                The sequence used as a prompt for the generation. If `None` the method initializes\n                it as an empty `tf.Tensor` of shape `(1,)`.\n\n            max_length: (`optional`) int\n                The max length of the sequence to be generated.  Between 1 and infinity. Default to 20.\n\n            min_length: (`optional`) int\n                The min length of the sequence to be generated.  Between 0 and infinity. Default to 0.\n            do_sample: (`optional`) bool\n                If set to `False` greedy decoding is used. Otherwise sampling is used. Defaults to `False` as defined in `configuration_utils.PretrainedConfig`.\n\n            early_stopping: (`optional`) bool\n                if set to `True` beam search is stopped when at least `num_beams` sentences finished per batch. Defaults to `False` as defined in `configuration_utils.PretrainedConfig`.\n\n            num_beams: (`optional`) int\n                Number of beams for beam search. Must be between 1 and infinity. 1 means no beam search. Default to 1.\n\n            temperature: (`optional`) float\n                The value used to module the next token probabilities. Must be strictely positive. Default to 1.0.\n\n            top_k: (`optional`) int\n                The number of highest probability vocabulary tokens to keep for top-k-filtering. Between 1 and infinity. Default to 50.\n\n            top_p: (`optional`) float\n                The cumulative probability of parameter highest probability vocabulary tokens to keep for nucleus sampling. Must be between 0 and 1. Default to 1.\n\n            repetition_penalty: (`optional`) float\n                The parameter for repetition penalty. Between 1.0 and infinity. 1.0 means no penalty. Default to 1.0.\n\n            bos_token_id: (`optional`) int\n                Beginning of sentence token if no prompt is provided. Default to specicic model bos_token_id or None if it does not exist.\n\n            pad_token_id: (`optional`) int\n                Pad token. Defaults to pad_token_id as defined in the models config.\n\n            eos_token_id: (`optional`) int\n                EOS token. Defaults to eos_token_id as defined in the models config.\n\n            length_penalty: (`optional`) float\n                Exponential penalty to the length. Default to 1.\n\n            no_repeat_ngram_size: (`optional`) int\n                If set to int > 0, all ngrams of size `no_repeat_ngram_size` can only occur once.\n\n            bad_words_ids: (`optional`) list of lists of int\n                `bad_words_ids` contains tokens that are not allowed to be generated. In order to get the tokens of the words that should not appear in the generated text, use `tokenizer.encode(bad_word, add_prefix_space=True)`.\n\n            num_return_sequences: (`optional`) int\n                The number of independently computed returned sequences for each element in the batch. Default to 1.\n\n            attention_mask (`optional`) obj: `tf.Tensor` with `dtype=tf.int32` of same shape as `input_ids`\n                Mask to avoid performing attention on padding token indices.\n                Mask values selected in ``[0, 1]``:\n                ``1`` for tokens that are NOT MASKED, ``0`` for MASKED tokens.\n                Defaults to `None`.\n\n                `What are attention masks? <../glossary.html#attention-mask>`__\n\n            decoder_start_token_id=None: (`optional`) int\n                If an encoder-decoder model starts decoding with a different token than BOS.\n                Defaults to `None` and is changed to `BOS` later.\n\n            use_cache: (`optional`) bool\n                If `use_cache` is True, past key values are used to speed up decoding if applicable to model. Defaults to `True`.\n\n        Return:\n\n            output: `tf.Tensor` of `dtype=tf.int32` shape `(batch_size * num_return_sequences, sequence_length)`\n                sequence_length is either equal to max_length or shorter if all batches finished early due to the `eos_token_id`\n\n        Examples::\n\n            tokenizer = AutoTokenizer.from_pretrained('distilgpt2')   # Initialize tokenizer\n            model = TFAutoModelWithLMHead.from_pretrained('distilgpt2')    # Download model and configuration from S3 and cache.\n            outputs = model.generate(max_length=40)  # do greedy decoding\n            print('Generated: {}'.format(tokenizer.decode(outputs[0], skip_special_tokens=True)))\n\n            tokenizer = AutoTokenizer.from_pretrained('openai-gpt')   # Initialize tokenizer\n            model = TFAutoModelWithLMHead.from_pretrained('openai-gpt')    # Download model and configuration from S3 and cache.\n            input_context = 'The dog'\n            input_ids = tokenizer.encode(input_context, return_tensors='tf')  # encode input context\n            outputs = model.generate(input_ids=input_ids, num_beams=5, num_return_sequences=3, temperature=1.5)  # generate 3 independent sequences using beam search decoding (5 beams) with sampling from initial context 'The dog'\n            for i in range(3): #  3 output sequences were generated\n                print('Generated {}: {}'.format(i, tokenizer.decode(outputs[i], skip_special_tokens=True)))\n\n            tokenizer = AutoTokenizer.from_pretrained('distilgpt2')   # Initialize tokenizer\n            model = TFAutoModelWithLMHead.from_pretrained('distilgpt2')    # Download model and configuration from S3 and cache.\n            input_context = 'The dog'\n            input_ids = tokenizer.encode(input_context, return_tensors='tf')  # encode input context\n            outputs = model.generate(input_ids=input_ids, max_length=40, temperature=0.7, num_return_sequences=3)  # 3 generate sequences using by sampling\n            for i in range(3): #  3 output sequences were generated\n                print('Generated {}: {}'.format(i, tokenizer.decode(outputs[i], skip_special_tokens=True)))\n\n            tokenizer = AutoTokenizer.from_pretrained('ctrl')   # Initialize tokenizer\n            model = TFAutoModelWithLMHead.from_pretrained('ctrl')    # Download model and configuration from S3 and cache.\n            input_context = 'Legal My neighbor is'  # \"Legal\" is one of the control codes for ctrl\n            input_ids = tokenizer.encode(input_context, return_tensors='tf')  # encode input context\n            outputs = model.generate(input_ids=input_ids, max_length=50, temperature=0.7, repetition_penalty=1.2)  # generate sequences\n            print('Generated: {}'.format(tokenizer.decode(outputs[0], skip_special_tokens=True)))\n\n            tokenizer = AutoTokenizer.from_pretrained('gpt2')   # Initialize tokenizer\n            model = TFAutoModelWithLMHead.from_pretrained('gpt2')    # Download model and configuration from S3 and cache.\n            input_context = 'My cute dog'  # \"Legal\" is one of the control codes for ctrl\n            bad_words_ids = [tokenizer.encode(bad_word, add_prefix_space=True) for bad_word in ['idiot', 'stupid', 'shut up']]\n            input_ids = tokenizer.encode(input_context, return_tensors='tf')  # encode input context\n            outputs = model.generate(input_ids=input_ids, max_length=100, do_sample=True, bad_words_ids=bad_words_ids)  # generate sequences without allowing bad_words to be generated\n        \"\"\"\n\n        # We cannot generate if the model does not have a LM head\n        if self.get_output_embeddings() is None:\n            raise AttributeError(\n                \"You tried to generate sequences with a model that does not have a LM Head.\"\n                \"Please use another model class (e.g. `TFOpenAIGPTLMHeadModel`, `TFXLNetLMHeadModel`, `TFGPT2LMHeadModel`, `TFCTRLLMHeadModel`, `TFT5ForConditionalGeneration`, `TFTransfoXLLMHeadModel`)\"\n            )\n\n        max_length = max_length if max_length is not None else self.config.max_length\n        min_length = min_length if min_length is not None else self.config.min_length\n        do_sample = do_sample if do_sample is not None else self.config.do_sample\n        early_stopping = early_stopping if early_stopping is not None else self.config.early_stopping\n        use_cache = use_cache if use_cache is not None else self.config.use_cache\n        num_beams = num_beams if num_beams is not None else self.config.num_beams\n        temperature = temperature if temperature is not None else self.config.temperature\n        top_k = top_k if top_k is not None else self.config.top_k\n        top_p = top_p if top_p is not None else self.config.top_p\n        repetition_penalty = repetition_penalty if repetition_penalty is not None else self.config.repetition_penalty\n        bos_token_id = bos_token_id if bos_token_id is not None else self.config.bos_token_id\n        pad_token_id = pad_token_id if pad_token_id is not None else self.config.pad_token_id\n        eos_token_id = eos_token_id if eos_token_id is not None else self.config.eos_token_id\n        length_penalty = length_penalty if length_penalty is not None else self.config.length_penalty\n        no_repeat_ngram_size = (\n            no_repeat_ngram_size if no_repeat_ngram_size is not None else self.config.no_repeat_ngram_size\n        )\n        bad_words_ids = bad_words_ids if bad_words_ids is not None else self.config.bad_words_ids\n        num_return_sequences = (\n            num_return_sequences if num_return_sequences is not None else self.config.num_return_sequences\n        )\n        decoder_start_token_id = (\n            decoder_start_token_id if decoder_start_token_id is not None else self.config.decoder_start_token_id\n        )\n\n        if input_ids is not None:\n            batch_size = shape_list(input_ids)[0]  # overriden by the input batch_size\n        else:\n            batch_size = 1\n\n        assert isinstance(max_length, int) and max_length > 0, \"`max_length` should be a strictely positive integer.\"\n        assert isinstance(min_length, int) and min_length >= 0, \"`min_length` should be a positive integer.\"\n        assert isinstance(do_sample, bool), \"`do_sample` should be a boolean.\"\n        assert isinstance(early_stopping, bool), \"`early_stopping` should be a boolean.\"\n        assert isinstance(use_cache, bool), \"`use_cache` should be a boolean.\"\n        assert isinstance(num_beams, int) and num_beams > 0, \"`num_beams` should be a strictely positive integer.\"\n        assert temperature > 0, \"`temperature` should be strictely positive.\"\n        assert isinstance(top_k, int) and top_k >= 0, \"`top_k` should be a positive integer.\"\n        assert 0 <= top_p <= 1, \"`top_p` should be between 0 and 1.\"\n        assert repetition_penalty >= 1.0, \"`repetition_penalty` should be >= 1.\"\n        assert input_ids is not None or (\n            isinstance(bos_token_id, int) and bos_token_id >= 0\n        ), \"If input_ids is not defined, `bos_token_id` should be a positive integer.\"\n        assert pad_token_id is None or (\n            isinstance(pad_token_id, int) and (pad_token_id >= 0)\n        ), \"`pad_token_id` should be a positive integer.\"\n        assert (eos_token_id is None) or (\n            isinstance(eos_token_id, int) and (eos_token_id >= 0)\n        ), \"`eos_token_id` should be a positive integer.\"\n        assert length_penalty > 0, \"`length_penalty` should be strictely positive.\"\n        assert (\n            isinstance(num_return_sequences, int) and num_return_sequences > 0\n        ), \"`num_return_sequences` should be a strictely positive integer.\"\n        assert (\n            bad_words_ids is None or isinstance(bad_words_ids, list) and isinstance(bad_words_ids[0], list)\n        ), \"`bad_words_ids` is either `None` or a list of lists of tokens that should not be generated\"\n\n        if input_ids is None:\n            assert isinstance(bos_token_id, int) and bos_token_id >= 0, (\n                \"you should either supply a context to complete as `input_ids` input \"\n                \"or a `bos_token_id` (integer >= 0) as a first token to start the generation.\"\n            )\n            input_ids = tf.fill((batch_size, 1), bos_token_id)\n        else:\n            assert len(shape_list(input_ids)) == 2, \"Input prompt should be of shape (batch_size, sequence length).\"\n\n        # not allow to duplicate outputs when greedy decoding\n        if do_sample is False:\n            if num_beams == 1:\n                # no_beam_search greedy generation conditions\n                assert (\n                    num_return_sequences == 1\n                ), \"Greedy decoding will always produce the same output for num_beams == 1 and num_return_sequences > 1. Please set num_return_sequences = 1\"\n\n            else:\n                # beam_search greedy generation conditions\n                assert (\n                    num_beams >= num_return_sequences\n                ), \"Greedy beam search decoding cannot return more sequences than it has beams. Please set num_beams >= num_return_sequences\"\n\n        # create attention mask if necessary\n        # TODO (PVP): this should later be handled by the forward fn() in each model in the future see PR 3140\n        if (attention_mask is None) and (pad_token_id is not None) and (pad_token_id in input_ids.numpy()):\n            attention_mask = tf.cast(tf.math.not_equal(input_ids, pad_token_id), dtype=tf.int32)\n        elif attention_mask is None:\n            attention_mask = tf.ones_like(input_ids)\n\n        if pad_token_id is None and eos_token_id is not None:\n            logger.warning(\n                \"Setting `pad_token_id` to {} (first `eos_token_id`) to generate sequence\".format(eos_token_id)\n            )\n            pad_token_id = eos_token_id\n\n        # current position and vocab size\n        cur_len = shape_list(input_ids)[1]\n        vocab_size = self.config.vocab_size\n\n        # set effective batch size and effective batch multiplier according to do_sample\n        if do_sample:\n            effective_batch_size = batch_size * num_return_sequences\n            effective_batch_mult = num_return_sequences\n        else:\n            effective_batch_size = batch_size\n            effective_batch_mult = 1\n\n        if self.config.is_encoder_decoder:\n            if decoder_start_token_id is None:\n                decoder_start_token_id = bos_token_id\n\n            assert (\n                decoder_start_token_id is not None\n            ), \"decoder_start_token_id or bos_token_id has to be defined for encoder-decoder generation\"\n            assert hasattr(self, \"get_encoder\"), \"{} should have a 'get_encoder' function defined\".format(self)\n            assert callable(self.get_encoder), \"{} should be a method\".format(self.get_encoder)\n\n            # get encoder and store encoder outputs\n            encoder = self.get_encoder()\n\n            encoder_outputs = encoder(input_ids, attention_mask=attention_mask)\n\n        # Expand input ids if num_beams > 1 or num_return_sequences > 1\n        if num_return_sequences > 1 or num_beams > 1:\n            input_ids_len = shape_list(input_ids)[-1]\n            input_ids = tf.broadcast_to(\n                tf.expand_dims(input_ids, 1), (batch_size, effective_batch_mult * num_beams, input_ids_len)\n            )\n            attention_mask = tf.broadcast_to(\n                tf.expand_dims(attention_mask, 1), (batch_size, effective_batch_mult * num_beams, input_ids_len)\n            )\n            input_ids = tf.reshape(\n                input_ids, (effective_batch_size * num_beams, input_ids_len)\n            )  # shape: (batch_size * num_return_sequences * num_beams, cur_len)\n            attention_mask = tf.reshape(\n                attention_mask, (effective_batch_size * num_beams, input_ids_len)\n            )  # shape: (batch_size * num_return_sequences * num_beams, cur_len)\n\n        if self.config.is_encoder_decoder:\n\n            # create empty decoder_input_ids\n            input_ids = tf.ones((effective_batch_size * num_beams, 1), dtype=tf.int32,) * decoder_start_token_id\n            cur_len = 1\n\n            assert (\n                batch_size == encoder_outputs[0].shape[0]\n            ), f\"expected encoder_outputs[0] to have 1st dimension bs={batch_size}, got {encoder_outputs[0].shape[0]} \"\n\n            # expand batch_idx to assign correct encoder output for expanded input_ids (due to num_beams > 1 and num_return_sequences > 1)\n            expanded_batch_idxs = tf.reshape(\n                tf.repeat(tf.expand_dims(tf.range(batch_size), -1), repeats=num_beams * effective_batch_mult, axis=1),\n                shape=(-1,),\n            )\n            # expand encoder_outputs\n            encoder_outputs = (tf.gather(encoder_outputs[0], expanded_batch_idxs, axis=0), *encoder_outputs[1:])\n\n        else:\n            encoder_outputs = None\n            cur_len = shape_list(input_ids)[-1]\n\n        if num_beams > 1:\n            output = self._generate_beam_search(\n                input_ids,\n                cur_len=cur_len,\n                max_length=max_length,\n                min_length=min_length,\n                do_sample=do_sample,\n                early_stopping=early_stopping,\n                temperature=temperature,\n                top_k=top_k,\n                top_p=top_p,\n                repetition_penalty=repetition_penalty,\n                no_repeat_ngram_size=no_repeat_ngram_size,\n                bad_words_ids=bad_words_ids,\n                bos_token_id=bos_token_id,\n                pad_token_id=pad_token_id,\n                eos_token_id=eos_token_id,\n                decoder_start_token_id=decoder_start_token_id,\n                batch_size=effective_batch_size,\n                num_return_sequences=num_return_sequences,\n                length_penalty=length_penalty,\n                num_beams=num_beams,\n                vocab_size=vocab_size,\n                encoder_outputs=encoder_outputs,\n                attention_mask=attention_mask,\n                use_cache=use_cache,\n            )\n        else:\n            output = self._generate_no_beam_search(\n                input_ids,\n                cur_len=cur_len,\n                max_length=max_length,\n                min_length=min_length,\n                do_sample=do_sample,\n                temperature=temperature,\n                top_k=top_k,\n                top_p=top_p,\n                repetition_penalty=repetition_penalty,\n                no_repeat_ngram_size=no_repeat_ngram_size,\n                bad_words_ids=bad_words_ids,\n                bos_token_id=bos_token_id,\n                pad_token_id=pad_token_id,\n                eos_token_id=eos_token_id,\n                decoder_start_token_id=decoder_start_token_id,\n                batch_size=effective_batch_size,\n                vocab_size=vocab_size,\n                encoder_outputs=encoder_outputs,\n                attention_mask=attention_mask,\n                use_cache=use_cache,\n            )\n\n        return output\n\n    def _generate_no_beam_search(\n        self,\n        input_ids,\n        cur_len,\n        max_length,\n        min_length,\n        do_sample,\n        temperature,\n        top_k,\n        top_p,\n        repetition_penalty,\n        no_repeat_ngram_size,\n        bad_words_ids,\n        bos_token_id,\n        pad_token_id,\n        eos_token_id,\n        decoder_start_token_id,\n        batch_size,\n        vocab_size,\n        encoder_outputs,\n        attention_mask,\n        use_cache,\n    ):\n        \"\"\" Generate sequences for each example without beam search (num_beams == 1).\n            All returned sequence are generated independantly.\n        \"\"\"\n\n        # length of generated sentences / unfinished sentences\n        unfinished_sents = tf.ones_like(input_ids[:, 0])\n        sent_lengths = tf.ones_like(input_ids[:, 0]) * max_length\n\n        past = encoder_outputs  # defined for encoder-decoder models, None for decoder-only models\n\n        while cur_len < max_length:\n            model_inputs = self.prepare_inputs_for_generation(\n                input_ids, past=past, attention_mask=attention_mask, use_cache=use_cache\n            )\n            outputs = self(**model_inputs)\n            next_token_logits = outputs[0][:, -1, :]\n\n            # if model has past, then set the past variable to speed up decoding\n            if self._use_cache(outputs, use_cache):\n                past = outputs[1]\n\n            # repetition penalty from CTRL paper (https://arxiv.org/abs/1909.05858)\n            if repetition_penalty != 1.0:\n                next_token_logits_penalties = _create_next_token_logits_penalties(\n                    input_ids, next_token_logits, repetition_penalty\n                )\n                next_token_logits = tf.math.multiply(next_token_logits, next_token_logits_penalties)\n\n            if no_repeat_ngram_size > 0:\n                # calculate a list of banned tokens to prevent repetitively generating the same ngrams\n                # from fairseq: https://github.com/pytorch/fairseq/blob/a07cb6f40480928c9e0548b737aadd36ee66ac76/fairseq/sequence_generator.py#L345\n                banned_tokens = calc_banned_ngram_tokens(input_ids, batch_size, no_repeat_ngram_size, cur_len)\n                # create banned_tokens boolean mask\n                banned_tokens_indices_mask = []\n                for banned_tokens_slice in banned_tokens:\n                    banned_tokens_indices_mask.append(\n                        [True if token in banned_tokens_slice else False for token in range(vocab_size)]\n                    )\n\n                next_token_logits = set_tensor_by_indices_to_value(\n                    next_token_logits, tf.convert_to_tensor(banned_tokens_indices_mask, dtype=tf.bool), -float(\"inf\")\n                )\n\n            if bad_words_ids is not None:\n                # calculate a list of banned tokens according to bad words\n                banned_tokens = calc_banned_bad_words_ids(input_ids, bad_words_ids)\n\n                banned_tokens_indices_mask = []\n                for banned_tokens_slice in banned_tokens:\n                    banned_tokens_indices_mask.append(\n                        [True if token in banned_tokens_slice else False for token in range(vocab_size)]\n                    )\n\n                next_token_logits = set_tensor_by_indices_to_value(\n                    next_token_logits, tf.convert_to_tensor(banned_tokens_indices_mask, dtype=tf.bool), -float(\"inf\")\n                )\n\n            # set eos token prob to zero if min_length is not reached\n            if eos_token_id is not None and cur_len < min_length:\n                # create eos_token_id boolean mask\n                is_token_logit_eos_token = tf.convert_to_tensor(\n                    [True if token is eos_token_id else False for token in range(vocab_size)], dtype=tf.bool\n                )\n                eos_token_indices_mask = tf.broadcast_to(is_token_logit_eos_token, [batch_size, vocab_size])\n\n                next_token_logits = set_tensor_by_indices_to_value(\n                    next_token_logits, eos_token_indices_mask, -float(\"inf\")\n                )\n\n            if do_sample:\n                # Temperature (higher temperature => more likely to sample low probability tokens)\n                if temperature != 1.0:\n                    next_token_logits = next_token_logits / temperature\n                # Top-p/top-k filtering\n                next_token_logits = tf_top_k_top_p_filtering(next_token_logits, top_k=top_k, top_p=top_p)\n                # Sample\n                next_token = tf.squeeze(\n                    tf.random.categorical(next_token_logits, dtype=tf.int32, num_samples=1), axis=1\n                )\n            else:\n                # Greedy decoding\n                next_token = tf.math.argmax(next_token_logits, axis=-1, output_type=tf.int32)\n\n            # update generations and finished sentences\n            if eos_token_id is not None:\n                # pad finished sentences if eos_token_id exist\n                tokens_to_add = next_token * unfinished_sents + (pad_token_id) * (1 - unfinished_sents)\n            else:\n                tokens_to_add = next_token\n\n            # add token and increase length by one\n            input_ids = tf.concat([input_ids, tf.expand_dims(tokens_to_add, -1)], 1)\n            cur_len = cur_len + 1\n\n            if eos_token_id is not None:\n                eos_in_sents = tokens_to_add == eos_token_id\n                # if sentence is unfinished and the token to add is eos, sent_lengths is filled with current length\n                is_sents_unfinished_and_token_to_add_is_eos = tf.math.multiply(\n                    unfinished_sents, tf.cast(eos_in_sents, tf.int32)\n                )\n                sent_lengths = (\n                    sent_lengths * (1 - is_sents_unfinished_and_token_to_add_is_eos)\n                    + cur_len * is_sents_unfinished_and_token_to_add_is_eos\n                )\n\n                # unfinished_sents is set to zero if eos in sentence\n                unfinished_sents -= is_sents_unfinished_and_token_to_add_is_eos\n\n            # stop when there is a </s> in each sentence, or if we exceed the maximul length\n            if tf.math.reduce_max(unfinished_sents) == 0:\n                break\n\n            # extend attention_mask for new generated input if only decoder\n            if self.config.is_encoder_decoder is False:\n                attention_mask = tf.concat(\n                    [attention_mask, tf.ones((shape_list(attention_mask)[0], 1), dtype=tf.int32)], axis=-1\n                )\n\n        # if there are different sentences lengths in the batch, some batches have to be padded\n        min_sent_length = tf.math.reduce_min(sent_lengths)\n        max_sent_length = tf.math.reduce_max(sent_lengths)\n        if min_sent_length != max_sent_length:\n            assert pad_token_id is not None, \"`Pad_token_id` has to be defined if batches have different lengths\"\n            # finished sents are filled with pad_token\n            padding = tf.ones([batch_size, max_sent_length.numpy()], dtype=tf.int32) * pad_token_id\n\n            # create length masks for tf.where operation\n            broad_casted_sent_lengths = tf.broadcast_to(\n                tf.expand_dims(sent_lengths, -1), [batch_size, max_sent_length]\n            )\n            broad_casted_range = tf.transpose(\n                tf.broadcast_to(tf.expand_dims(tf.range(max_sent_length), -1), [max_sent_length, batch_size])\n            )\n\n            decoded = tf.where(broad_casted_range < broad_casted_sent_lengths, input_ids, padding)\n        else:\n            decoded = input_ids\n\n        return decoded\n\n    def _generate_beam_search(\n        self,\n        input_ids,\n        cur_len,\n        max_length,\n        min_length,\n        do_sample,\n        early_stopping,\n        temperature,\n        top_k,\n        top_p,\n        repetition_penalty,\n        no_repeat_ngram_size,\n        bad_words_ids,\n        bos_token_id,\n        pad_token_id,\n        decoder_start_token_id,\n        eos_token_id,\n        batch_size,\n        num_return_sequences,\n        length_penalty,\n        num_beams,\n        vocab_size,\n        encoder_outputs,\n        attention_mask,\n        use_cache,\n    ):\n        \"\"\" Generate sequences for each example with beam search.\n        \"\"\"\n\n        # generated hypotheses\n        generated_hyps = [\n            BeamHypotheses(num_beams, max_length, length_penalty, early_stopping=early_stopping)\n            for _ in range(batch_size)\n        ]\n\n        # for greedy decoding it is made sure that only tokens of the first beam are considered to avoid sampling the exact same tokens three times\n        if do_sample is False:\n            beam_scores_begin = tf.zeros((batch_size, 1), dtype=tf.float32)\n            beam_scores_end = tf.ones((batch_size, num_beams - 1), dtype=tf.float32) * (-1e9)\n            beam_scores = tf.concat([beam_scores_begin, beam_scores_end], -1)\n        else:\n            beam_scores = tf.zeros((batch_size, num_beams), dtype=tf.float32)\n\n        beam_scores = tf.reshape(beam_scores, (batch_size * num_beams,))\n\n        # cache compute states\n        past = encoder_outputs\n\n        # done sentences\n        done = [False for _ in range(batch_size)]\n\n        while cur_len < max_length:\n            model_inputs = self.prepare_inputs_for_generation(\n                input_ids, past=past, attention_mask=attention_mask, use_cache=use_cache\n            )\n            outputs = self(**model_inputs)  # (batch_size * num_beams, cur_len, vocab_size)\n            next_token_logits = outputs[0][:, -1, :]  # (batch_size * num_beams, vocab_size)\n\n            # if model has past, then set the past variable to speed up decoding\n            if self._use_cache(outputs, use_cache):\n                past = outputs[1]\n\n            # repetition penalty (from CTRL paper https://arxiv.org/abs/1909.05858)\n            if repetition_penalty != 1.0:\n                next_token_logits_penalties = _create_next_token_logits_penalties(\n                    input_ids, next_token_logits, repetition_penalty\n                )\n                next_token_logits = tf.math.multiply(next_token_logits, next_token_logits_penalties)\n\n            # Temperature (higher temperature => more likely to sample low probability tokens)\n            if temperature != 1.0:\n                next_token_logits = next_token_logits / temperature\n\n            #             calculate log softmax score\n            scores = tf.nn.log_softmax(next_token_logits, axis=-1)  # (batch_size * num_beams, vocab_size)\n\n            # set eos token prob to zero if min_length is not reached\n            if eos_token_id is not None and cur_len < min_length:\n                # create eos_token_id boolean mask\n                num_batch_hypotheses = batch_size * num_beams\n\n                is_token_logit_eos_token = tf.convert_to_tensor(\n                    [True if token is eos_token_id else False for token in range(vocab_size)], dtype=tf.bool\n                )\n                eos_token_indices_mask = tf.broadcast_to(is_token_logit_eos_token, [num_batch_hypotheses, vocab_size])\n\n                scores = set_tensor_by_indices_to_value(scores, eos_token_indices_mask, -float(\"inf\"))\n\n            if no_repeat_ngram_size > 0:\n                # calculate a list of banned tokens to prevent repetitively generating the same ngrams\n                # from fairseq: https://github.com/pytorch/fairseq/blob/a07cb6f40480928c9e0548b737aadd36ee66ac76/fairseq/sequence_generator.py#L345\n                num_batch_hypotheses = batch_size * num_beams\n                banned_tokens = calc_banned_ngram_tokens(\n                    input_ids, num_batch_hypotheses, no_repeat_ngram_size, cur_len\n                )\n                # create banned_tokens boolean mask\n                banned_tokens_indices_mask = []\n                for banned_tokens_slice in banned_tokens:\n                    banned_tokens_indices_mask.append(\n                        [True if token in banned_tokens_slice else False for token in range(vocab_size)]\n                    )\n\n                scores = set_tensor_by_indices_to_value(\n                    scores, tf.convert_to_tensor(banned_tokens_indices_mask, dtype=tf.bool), -float(\"inf\")\n                )\n\n            if bad_words_ids is not None:\n                # calculate a list of banned tokens according to bad words\n                banned_tokens = calc_banned_bad_words_ids(input_ids, bad_words_ids)\n\n                banned_tokens_indices_mask = []\n                for banned_tokens_slice in banned_tokens:\n                    banned_tokens_indices_mask.append(\n                        [True if token in banned_tokens_slice else False for token in range(vocab_size)]\n                    )\n\n                scores = set_tensor_by_indices_to_value(\n                    scores, tf.convert_to_tensor(banned_tokens_indices_mask, dtype=tf.bool), -float(\"inf\")\n                )\n\n            assert shape_list(scores) == [batch_size * num_beams, vocab_size]\n\n            if do_sample:\n                _scores = scores + tf.broadcast_to(\n                    beam_scores[:, None], (batch_size * num_beams, vocab_size)\n                )  # (batch_size * num_beams, vocab_size)\n\n                # Top-p/top-k filtering\n                _scores = tf_top_k_top_p_filtering(\n                    _scores, top_k=top_k, top_p=top_p, min_tokens_to_keep=2\n                )  # (batch_size * num_beams, vocab_size)\n                # Sample 2 next tokens for each beam (so we have some spare tokens and match output of greedy beam search)\n                _scores = tf.reshape(_scores, (batch_size, num_beams * vocab_size))\n\n                next_tokens = tf.random.categorical(\n                    _scores, dtype=tf.int32, num_samples=2 * num_beams\n                )  # (batch_size, 2 * num_beams)\n                # Compute next scores\n                next_scores = tf.gather(_scores, next_tokens, batch_dims=1)  # (batch_size, 2 * num_beams)\n\n                # sort the sampled vector to make sure that the first num_beams samples are the best\n                next_scores_indices = tf.argsort(next_scores, direction=\"DESCENDING\", axis=1)\n                next_scores = tf.gather(next_scores, next_scores_indices, batch_dims=1)  # (batch_size, num_beams * 2)\n                next_tokens = tf.gather(next_tokens, next_scores_indices, batch_dims=1)  # (batch_size, num_beams * 2)\n            else:\n                # Add the log prob of the new beams to the log prob of the beginning of the sequence (sum of logs == log of the product)\n                next_scores = scores + tf.broadcast_to(\n                    beam_scores[:, None], (batch_size * num_beams, vocab_size)\n                )  # (batch_size * num_beams, vocab_size)\n\n                # re-organize to group the beam together (we are keeping top hypothesis accross beams)\n                next_scores = tf.reshape(\n                    next_scores, (batch_size, num_beams * vocab_size)\n                )  # (batch_size, num_beams * vocab_size)\n\n                next_scores, next_tokens = tf.math.top_k(next_scores, k=2 * num_beams, sorted=True)\n\n            assert shape_list(next_scores) == shape_list(next_tokens) == [batch_size, 2 * num_beams]\n\n            # next batch beam content\n            next_batch_beam = []\n\n            # for each sentence\n            for batch_idx in range(batch_size):\n\n                # if we are done with this sentence\n                if done[batch_idx]:\n                    assert (\n                        len(generated_hyps[batch_idx]) >= num_beams\n                    ), \"Batch can only be done if at least {} beams have been generated\".format(num_beams)\n                    assert (\n                        eos_token_id is not None and pad_token_id is not None\n                    ), \"generated beams >= num_beams -> eos_token_id and pad_token have to be defined\"\n                    next_batch_beam.extend([(0, pad_token_id, 0)] * num_beams)  # pad the batch\n                    continue\n\n                # next sentence beam content\n                next_sent_beam = []\n\n                # next tokens for this sentence\n                for beam_token_rank, (beam_token_id, beam_token_score) in enumerate(\n                    zip(next_tokens[batch_idx], next_scores[batch_idx])\n                ):\n                    # get beam and token IDs\n                    beam_id = beam_token_id // vocab_size\n                    token_id = beam_token_id % vocab_size\n\n                    effective_beam_id = batch_idx * num_beams + beam_id\n                    # add to generated hypotheses if end of sentence or last iteration\n                    if (eos_token_id is not None) and (token_id.numpy() == eos_token_id):\n                        # if beam_token does not belong to top num_beams tokens, it should not be added\n                        is_beam_token_worse_than_top_num_beams = beam_token_rank >= num_beams\n                        if is_beam_token_worse_than_top_num_beams:\n                            continue\n                        generated_hyps[batch_idx].add(\n                            tf.identity(input_ids[effective_beam_id]), beam_token_score.numpy()\n                        )\n                    else:\n                        # add next predicted token if it is not eos_token\n                        next_sent_beam.append((beam_token_score, token_id, effective_beam_id))\n\n                    # the beam for next step is full\n                    if len(next_sent_beam) == num_beams:\n                        break\n\n                # Check if were done so that we can save a pad step if all(done)\n                done[batch_idx] = done[batch_idx] or generated_hyps[batch_idx].is_done(\n                    tf.reduce_max(next_scores[batch_idx]).numpy(), cur_len=cur_len\n                )\n\n                # update next beam content\n                assert len(next_sent_beam) == num_beams, \"Beam should always be full\"\n                next_batch_beam.extend(next_sent_beam)\n                assert len(next_batch_beam) == num_beams * (batch_idx + 1)\n\n            # stop when we are done with each sentence\n            if all(done):\n                break\n\n            # sanity check / prepare next batch\n            assert len(next_batch_beam) == batch_size * num_beams\n            beam_scores = tf.convert_to_tensor([x[0] for x in next_batch_beam], dtype=tf.float32)\n            beam_tokens = tf.convert_to_tensor([x[1] for x in next_batch_beam], dtype=tf.int32)\n            beam_idx = tf.convert_to_tensor([x[2] for x in next_batch_beam], dtype=tf.int32)\n\n            # re-order batch and update current length\n            input_ids = tf.stack([tf.identity(input_ids[x, :]) for x in beam_idx])\n            input_ids = tf.concat([input_ids, tf.expand_dims(beam_tokens, 1)], axis=-1)\n            cur_len = cur_len + 1\n\n            # re-order internal states\n            if past is not None:\n                past = self._reorder_cache(past, beam_idx)\n\n            # extend attention_mask for new generated input if only decoder\n            if self.config.is_encoder_decoder is False:\n                attention_mask = tf.concat(\n                    [attention_mask, tf.ones((shape_list(attention_mask)[0], 1), dtype=tf.int32)], axis=-1\n                )\n\n        # finalize all open beam hypotheses and end to generated hypotheses\n        for batch_idx in range(batch_size):\n            # Add all open beam hypothesis to generated_hyps\n            if done[batch_idx]:\n                continue\n            # test that beam scores match previously calculated scores if not eos and batch_idx not done\n            if eos_token_id is not None and all(\n                (token_id % vocab_size).numpy().item() is not eos_token_id for token_id in next_tokens[batch_idx]\n            ):\n                assert tf.reduce_all(\n                    next_scores[batch_idx, :num_beams] == tf.reshape(beam_scores, (batch_size, num_beams))[batch_idx]\n                ), \"If batch_idx is not done, final next scores: {} have to equal to accumulated beam_scores: {}\".format(\n                    next_scores[:, :num_beams][batch_idx], tf.reshape(beam_scores, (batch_size, num_beams))[batch_idx]\n                )\n\n            # need to add best num_beams hypotheses to generated hyps\n            for beam_id in range(num_beams):\n                effective_beam_id = batch_idx * num_beams + beam_id\n                final_score = beam_scores[effective_beam_id].numpy().item()\n                final_tokens = input_ids[effective_beam_id]\n                generated_hyps[batch_idx].add(final_tokens, final_score)\n\n        # depending on whether greedy generation is wanted or not define different output_batch_size and output_num_return_sequences_per_batch\n        output_batch_size = batch_size if do_sample else batch_size * num_return_sequences\n        output_num_return_sequences_per_batch = 1 if do_sample else num_return_sequences\n\n        # select the best hypotheses\n        sent_lengths_list = []\n        best = []\n\n        # retrieve best hypotheses\n        for i, hypotheses in enumerate(generated_hyps):\n            sorted_hyps = sorted(hypotheses.beams, key=lambda x: x[0])\n            for j in range(output_num_return_sequences_per_batch):\n                best_hyp = sorted_hyps.pop()[1]\n                sent_lengths_list.append(len(best_hyp))\n                best.append(best_hyp)\n        assert output_batch_size == len(best), \"Output batch size {} must match output beam hypotheses {}\".format(\n            output_batch_size, len(best)\n        )\n\n        sent_lengths = tf.convert_to_tensor(sent_lengths_list, dtype=tf.int32)\n\n        # shorter batches are filled with pad_token\n        if tf.reduce_min(sent_lengths).numpy() != tf.reduce_max(sent_lengths).numpy():\n            assert pad_token_id is not None, \"`Pad_token_id` has to be defined\"\n            sent_max_len = min(tf.reduce_max(sent_lengths).numpy() + 1, max_length)\n            decoded_list = []\n\n            # fill with hypothesis and eos_token_id if necessary\n            for i, hypo in enumerate(best):\n                assert sent_lengths[i] == shape_list(hypo)[0]\n                # if sent_length is max_len do not pad\n                if sent_lengths[i] == sent_max_len:\n                    decoded_slice = hypo\n                else:\n                    # else pad to sent_max_len\n                    num_pad_tokens = sent_max_len - sent_lengths[i]\n                    padding = pad_token_id * tf.ones((num_pad_tokens,), dtype=tf.int32)\n                    decoded_slice = tf.concat([hypo, padding], axis=-1)\n\n                    # finish sentence with EOS token\n                    if sent_lengths[i] < max_length:\n                        decoded_slice = tf.where(\n                            tf.range(sent_max_len, dtype=tf.int32) == sent_lengths[i],\n                            eos_token_id * tf.ones((sent_max_len,), dtype=tf.int32),\n                            decoded_slice,\n                        )\n                # add to list\n                decoded_list.append(decoded_slice)\n\n            decoded = tf.stack(decoded_list)\n        else:\n            # none of the hypotheses have an eos_token\n            assert (len(hypo) == max_length for hypo in best)\n            decoded = tf.stack(best)\n\n        return decoded\n\n    @staticmethod\n    def _reorder_cache(past, beam_idx):\n        return tuple(tf.gather(layer_past, beam_idx, axis=1) for layer_past in past)\n\n\ndef _create_next_token_logits_penalties(input_ids, logits, repetition_penalty):\n    # create logit penalties for already seen input_ids\n    token_penalties = np.ones(shape_list(logits))\n    prev_input_ids = [np.unique(input_id) for input_id in input_ids.numpy()]\n    for i, prev_input_id in enumerate(prev_input_ids):\n        logit_penalized = logits[i].numpy()[prev_input_id]\n        logit_penalties = np.zeros(logit_penalized.shape)\n        # if previous logit score is < 0 then multiply repetition penalty else divide\n        logit_penalties[logit_penalized < 0] = repetition_penalty\n        logit_penalties[logit_penalized > 0] = 1 / repetition_penalty\n        np.put(token_penalties[i], prev_input_id, logit_penalties)\n    return tf.convert_to_tensor(token_penalties, dtype=tf.float32)\n\n\ndef calc_banned_ngram_tokens(prev_input_ids, num_hypos, no_repeat_ngram_size, cur_len):\n    # Copied from fairseq for no_repeat_ngram in beam_search\"\"\"\n    if cur_len + 1 < no_repeat_ngram_size:\n        # return no banned tokens if we haven't generated no_repeat_ngram_size tokens yet\n        return [[] for _ in range(num_hypos)]\n    generated_ngrams = [{} for _ in range(num_hypos)]\n    for idx in range(num_hypos):\n        gen_tokens = prev_input_ids[idx].numpy().tolist()\n        generated_ngram = generated_ngrams[idx]\n        for ngram in zip(*[gen_tokens[i:] for i in range(no_repeat_ngram_size)]):\n            prev_ngram_tuple = tuple(ngram[:-1])\n            generated_ngram[prev_ngram_tuple] = generated_ngram.get(prev_ngram_tuple, []) + [ngram[-1]]\n\n    def _get_generated_ngrams(hypo_idx):\n        # Before decoding the next token, prevent decoding of ngrams that have already appeared\n        start_idx = cur_len + 1 - no_repeat_ngram_size\n        ngram_idx = tuple(prev_input_ids[hypo_idx, start_idx:cur_len].numpy().tolist())\n        return generated_ngrams[hypo_idx].get(ngram_idx, [])\n\n    banned_tokens = [_get_generated_ngrams(hypo_idx) for hypo_idx in range(num_hypos)]\n    return banned_tokens\n\n\ndef calc_banned_bad_words_ids(prev_input_ids, bad_words_ids):\n    banned_tokens = []\n\n    def _tokens_match(prev_tokens, tokens):\n        if len(tokens) == 0:\n            # if bad word tokens is just one token always ban it\n            return True\n        if len(tokens) > len(prev_input_ids):\n            # if bad word tokens are longer then prev input_ids they can't be equal\n            return False\n\n        if prev_tokens[-len(tokens) :] == tokens:\n            # if tokens match\n            return True\n        else:\n            return False\n\n    for prev_input_ids_slice in prev_input_ids:\n        banned_tokens_slice = []\n\n        for banned_token_seq in bad_words_ids:\n            assert len(banned_token_seq) > 0, \"Banned words token sequences {} cannot have an empty list\".format(\n                bad_words_ids\n            )\n\n            if _tokens_match(prev_input_ids_slice.numpy().tolist(), banned_token_seq[:-1]) is False:\n                # if tokens do not match continue\n                continue\n\n            banned_tokens_slice.append(banned_token_seq[-1])\n\n        banned_tokens.append(banned_tokens_slice)\n\n    return banned_tokens\n\n\ndef tf_top_k_top_p_filtering(logits, top_k=0, top_p=1.0, filter_value=-float(\"Inf\"), min_tokens_to_keep=1):\n    \"\"\" Filter a distribution of logits using top-k and/or nucleus (top-p) filtering\n        Args:\n            logits: logits distribution shape (batch size, vocabulary size)\n            if top_k > 0: keep only top k tokens with highest probability (top-k filtering).\n            if top_p < 1.0: keep the top tokens with cumulative probability >= top_p (nucleus filtering).\n                Nucleus filtering is described in Holtzman et al. (http://arxiv.org/abs/1904.09751)\n            Make sure we keep at least min_tokens_to_keep per batch example in the output\n        From: https://gist.github.com/thomwolf/1a5a29f6962089e871b94cbd09daf317\n    \"\"\"\n    logits_shape = shape_list(logits)\n\n    if top_k > 0:\n        top_k = min(max(top_k, min_tokens_to_keep), logits_shape[-1])  # Safety check\n        # Remove all tokens with a probability less than the last token of the top-k\n        indices_to_remove = logits < tf.math.top_k(logits, k=top_k)[0][..., -1, None]\n        logits = set_tensor_by_indices_to_value(logits, indices_to_remove, filter_value)\n\n    if top_p < 1.0:\n        sorted_indices = tf.argsort(logits, direction=\"DESCENDING\")\n        sorted_logits = tf.gather(\n            logits, sorted_indices, axis=-1, batch_dims=1\n        )  # expects logits to be of dim (batch_size, vocab_size)\n\n        cumulative_probs = tf.math.cumsum(tf.nn.softmax(sorted_logits, axis=-1), axis=-1)\n\n        # Remove tokens with cumulative probability above the threshold (token with 0 are kept)\n        sorted_indices_to_remove = cumulative_probs > top_p\n\n        if min_tokens_to_keep > 1:\n            # Keep at least min_tokens_to_keep (set to min_tokens_to_keep-1 because we add the first one below)\n            sorted_indices_to_remove = tf.concat(\n                [\n                    tf.zeros_like(sorted_indices_to_remove[:, :min_tokens_to_keep]),\n                    sorted_indices_to_remove[:, min_tokens_to_keep:],\n                ],\n                -1,\n            )\n\n        # Shift the indices to the right to keep also the first token above the threshold\n        sorted_indices_to_remove = tf.roll(sorted_indices_to_remove, 1, axis=-1)\n        sorted_indices_to_remove = tf.concat(\n            [tf.zeros_like(sorted_indices_to_remove[:, :1]), sorted_indices_to_remove[:, 1:]], -1,\n        )\n        # scatter sorted tensors to original indexing\n        indices_to_remove = scatter_values_on_batch_indices(sorted_indices_to_remove, sorted_indices)\n        logits = set_tensor_by_indices_to_value(logits, indices_to_remove, filter_value)\n    return logits\n\n\ndef scatter_values_on_batch_indices(values, batch_indices):\n    shape = shape_list(batch_indices)\n    # broadcast batch dim to shape\n    broad_casted_batch_dims = tf.reshape(tf.broadcast_to(tf.expand_dims(tf.range(shape[0]), axis=-1), shape), [1, -1])\n    # transform batch_indices to pair_indices\n    pair_indices = tf.transpose(tf.concat([broad_casted_batch_dims, tf.reshape(batch_indices, [1, -1])], 0))\n    # scatter values to pair indices\n    return tf.scatter_nd(pair_indices, tf.reshape(values, [-1]), shape)\n\n\ndef set_tensor_by_indices_to_value(tensor, indices, value):\n    # create value_tensor since tensor value assignment is not possible in TF\n    value_tensor = tf.zeros_like(tensor) + value\n    return tf.where(indices, value_tensor, tensor)\n\n\nclass BeamHypotheses(object):\n    def __init__(self, num_beams, max_length, length_penalty, early_stopping):\n        \"\"\"\n        Initialize n-best list of hypotheses.\n        \"\"\"\n        self.max_length = max_length - 1  # ignoring bos_token\n        self.length_penalty = length_penalty\n        self.early_stopping = early_stopping\n        self.num_beams = num_beams\n        self.beams = []\n        self.worst_score = 1e9\n\n    def __len__(self):\n        \"\"\"\n        Number of hypotheses in the list.\n        \"\"\"\n        return len(self.beams)\n\n    def add(self, hyp, sum_logprobs):\n        \"\"\"\n        Add a new hypothesis to the list.\n        \"\"\"\n        score = sum_logprobs / len(hyp) ** self.length_penalty\n        if len(self) < self.num_beams or score > self.worst_score:\n            self.beams.append((score, hyp))\n            if len(self) > self.num_beams:\n                sorted_scores = sorted([(s, idx) for idx, (s, _) in enumerate(self.beams)])\n                del self.beams[sorted_scores[0][1]]\n                self.worst_score = sorted_scores[1][0]\n            else:\n                self.worst_score = min(score, self.worst_score)\n\n    def is_done(self, best_sum_logprobs, cur_len=None):\n        \"\"\"\n        If there are enough hypotheses and that none of the hypotheses being generated\n        can become better than the worst one in the heap, then we are done with this sentence.\n        \"\"\"\n\n        if len(self) < self.num_beams:\n            return False\n        elif self.early_stopping:\n            return True\n        else:\n            if cur_len is None:\n                cur_len = self.max_length\n            cur_score = best_sum_logprobs / cur_len ** self.length_penalty\n            ret = self.worst_score >= cur_score\n            return ret\n\n\nclass TFConv1D(tf.keras.layers.Layer):\n    def __init__(self, nf, nx, initializer_range=0.02, **kwargs):\n        \"\"\" TFConv1D layer as defined by Radford et al. for OpenAI GPT (and also used in GPT-2)\n            Basically works like a Linear layer but the weights are transposed\n        \"\"\"\n        super().__init__(**kwargs)\n        self.nf = nf\n        self.nx = nx\n        self.initializer_range = initializer_range\n\n    def build(self, input_shape):\n        self.weight = self.add_weight(\n            \"weight\", shape=[self.nx, self.nf], initializer=get_initializer(self.initializer_range)\n        )\n        self.bias = self.add_weight(\"bias\", shape=[1, self.nf], initializer=tf.zeros_initializer())\n\n    def call(self, x):\n        bz, sl = shape_list(x)[:2]\n\n        x = tf.reshape(x, [-1, self.nx])\n        x = tf.matmul(x, self.weight) + self.bias\n\n        x = tf.reshape(x, [bz, sl, self.nf])\n\n        return x\n\n\nclass TFSharedEmbeddings(tf.keras.layers.Layer):\n    \"\"\"Construct shared token embeddings.\n    \"\"\"\n\n    def __init__(self, vocab_size, hidden_size, initializer_range=None, **kwargs):\n        super().__init__(**kwargs)\n        self.vocab_size = vocab_size\n        self.hidden_size = hidden_size\n        self.initializer_range = hidden_size ** -0.5 if initializer_range is None else initializer_range\n\n    def build(self, input_shape):\n        \"\"\"Build shared token embedding layer\n        Shared weights logic adapted from\n            https://github.com/tensorflow/models/blob/a009f4fb9d2fc4949e32192a944688925ef78659/official/transformer/v2/embedding_layer.py#L24\n        \"\"\"\n        self.weight = self.add_weight(\n            \"weight\", shape=[self.vocab_size, self.hidden_size], initializer=get_initializer(self.initializer_range)\n        )\n        super().build(input_shape)\n\n    def call(self, inputs, mode=\"embedding\"):\n        \"\"\"Get token embeddings of inputs.\n        Args:\n            inputs: list of three int64 tensors with shape [batch_size, length]: (input_ids, position_ids, token_type_ids)\n            mode: string, a valid value is one of \"embedding\" and \"linear\".\n        Returns:\n            outputs: (1) If mode == \"embedding\", output embedding tensor, float32 with\n                shape [batch_size, length, embedding_size]; (2) mode == \"linear\", output\n                linear tensor, float32 with shape [batch_size, length, vocab_size].\n        Raises:\n            ValueError: if mode is not valid.\n\n        Shared weights logic adapted from\n            https://github.com/tensorflow/models/blob/a009f4fb9d2fc4949e32192a944688925ef78659/official/transformer/v2/embedding_layer.py#L24\n        \"\"\"\n        if mode == \"embedding\":\n            return self._embedding(inputs)\n        elif mode == \"linear\":\n            return self._linear(inputs)\n        else:\n            raise ValueError(\"mode {} is not valid.\".format(mode))\n\n    def _embedding(self, input_ids):\n        \"\"\"Applies embedding based on inputs tensor.\"\"\"\n        return tf.gather(self.weight, input_ids)\n\n    def _linear(self, inputs):\n        \"\"\"Computes logits by running inputs through a linear layer.\n            Args:\n                inputs: A float32 tensor with shape [..., hidden_size]\n            Returns:\n                float32 tensor with shape [..., vocab_size].\n        \"\"\"\n        first_dims = shape_list(inputs)[:-1]\n\n        x = tf.reshape(inputs, [-1, self.hidden_size])\n        logits = tf.matmul(x, self.weight, transpose_b=True)\n\n        return tf.reshape(logits, first_dims + [self.vocab_size])\n\n\nclass TFSequenceSummary(tf.keras.layers.Layer):\n    r\"\"\" Compute a single vector summary of a sequence hidden states according to various possibilities:\n        Args of the config class:\n            summary_type:\n                - 'last' => [default] take the last token hidden state (like XLNet)\n                - 'first' => take the first token hidden state (like Bert)\n                - 'mean' => take the mean of all tokens hidden states\n                - 'cls_index' => supply a Tensor of classification token position (GPT/GPT-2)\n                - 'attn' => Not implemented now, use multi-head attention\n            summary_use_proj: Add a projection after the vector extraction\n            summary_proj_to_labels: If True, the projection outputs to config.num_labels classes (otherwise to hidden_size). Default: False.\n            summary_activation: 'tanh' => add a tanh activation to the output, Other => no activation. Default\n            summary_first_dropout: Add a dropout before the projection and activation\n            summary_last_dropout: Add a dropout after the projection and activation\n    \"\"\"\n\n    def __init__(self, config, initializer_range=0.02, **kwargs):\n        super().__init__(**kwargs)\n\n        self.summary_type = config.summary_type if hasattr(config, \"summary_use_proj\") else \"last\"\n        if self.summary_type == \"attn\":\n            # We should use a standard multi-head attention module with absolute positional embedding for that.\n            # Cf. https://github.com/zihangdai/xlnet/blob/master/modeling.py#L253-L276\n            # We can probably just use the multi-head attention module of PyTorch >=1.1.0\n            raise NotImplementedError\n\n        self.has_summary = hasattr(config, \"summary_use_proj\") and config.summary_use_proj\n        if self.has_summary:\n            if hasattr(config, \"summary_proj_to_labels\") and config.summary_proj_to_labels and config.num_labels > 0:\n                num_classes = config.num_labels\n            else:\n                num_classes = config.hidden_size\n            self.summary = tf.keras.layers.Dense(\n                num_classes, kernel_initializer=get_initializer(initializer_range), name=\"summary\"\n            )\n\n        self.has_activation = hasattr(config, \"summary_activation\") and config.summary_activation == \"tanh\"\n        if self.has_activation:\n            self.activation = tf.keras.activations.tanh\n\n        self.has_first_dropout = hasattr(config, \"summary_first_dropout\") and config.summary_first_dropout > 0\n        if self.has_first_dropout:\n            self.first_dropout = tf.keras.layers.Dropout(config.summary_first_dropout)\n\n        self.has_last_dropout = hasattr(config, \"summary_last_dropout\") and config.summary_last_dropout > 0\n        if self.has_last_dropout:\n            self.last_dropout = tf.keras.layers.Dropout(config.summary_last_dropout)\n\n    def call(self, inputs, training=False):\n        \"\"\" hidden_states: float Tensor in shape [bsz, seq_len, hidden_size], the hidden-states of the last layer.\n            cls_index: [optional] position of the classification token if summary_type == 'cls_index',\n                shape (bsz,) or more generally (bsz, ...) where ... are optional leading dimensions of hidden_states.\n                if summary_type == 'cls_index' and cls_index is None:\n                    we take the last token of the sequence as classification token\n        \"\"\"\n        if not isinstance(inputs, (dict, tuple, list)):\n            hidden_states = inputs\n            cls_index = None\n        elif isinstance(inputs, (tuple, list)):\n            hidden_states = inputs[0]\n            cls_index = inputs[1] if len(inputs) > 1 else None\n            assert len(inputs) <= 2, \"Too many inputs.\"\n        else:\n            hidden_states = inputs.get(\"hidden_states\")\n            cls_index = inputs.get(\"cls_index\", None)\n\n        if self.summary_type == \"last\":\n            output = hidden_states[:, -1]\n        elif self.summary_type == \"first\":\n            output = hidden_states[:, 0]\n        elif self.summary_type == \"mean\":\n            output = tf.reduce_mean(hidden_states, axis=1)\n        elif self.summary_type == \"cls_index\":\n            hidden_shape = shape_list(hidden_states)  # e.g. [batch, num choices, seq length, hidden dims]\n            if cls_index is None:\n                cls_index = tf.fill(\n                    hidden_shape[:-2], hidden_shape[-2] - 1\n                )  # A tensor full of shape [batch] or [batch, num choices] full of sequence length\n            cls_shape = shape_list(cls_index)\n            if len(cls_shape) <= len(hidden_shape) - 2:\n                cls_index = cls_index[..., tf.newaxis]\n            # else:\n            # cls_index = cls_index[..., tf.newaxis]\n            # cls_index = cls_index.expand((-1,) * (cls_index.dim()-1) + (hidden_states.size(-1),))\n            # shape of cls_index: (bsz, XX, 1, hidden_size) where XX are optional leading dim of hidden_states\n            output = tf.gather(hidden_states, cls_index, batch_dims=len(hidden_shape) - 2)\n            output = tf.squeeze(\n                output, axis=len(hidden_shape) - 2\n            )  # shape of output: (batch, num choices, hidden_size)\n        elif self.summary_type == \"attn\":\n            raise NotImplementedError\n\n        if self.has_first_dropout:\n            output = self.first_dropout(output, training=training)\n\n        if self.has_summary:\n            output = self.summary(output)\n\n        if self.has_activation:\n            output = self.activation(output)\n\n        if self.has_last_dropout:\n            output = self.last_dropout(output, training=training)\n\n        return output\n\n\ndef shape_list(x):\n    \"\"\"Deal with dynamic shape in tensorflow cleanly.\"\"\"\n    static = x.shape.as_list()\n    dynamic = tf.shape(x)\n    return [dynamic[i] if s is None else s for i, s in enumerate(static)]\n\n\ndef get_initializer(initializer_range=0.02):\n    \"\"\"Creates a `tf.initializers.truncated_normal` with the given range.\n    Args:\n        initializer_range: float, initializer range for stddev.\n    Returns:\n        TruncatedNormal initializer with stddev = `initializer_range`.\n    \"\"\"\n    return tf.keras.initializers.TruncatedNormal(stddev=initializer_range)\n"
  },
  {
    "path": "code/nezha-base-count3/pretrain/transformers1/modeling_tf_xlm.py",
    "content": "# coding=utf-8\n# Copyright 2019-present, Facebook, Inc and the HuggingFace Inc. team.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n\"\"\" TF 2.0 XLM model.\n\"\"\"\n\n\nimport itertools\nimport logging\nimport math\n\nimport numpy as np\nimport tensorflow as tf\n\nfrom .configuration_xlm import XLMConfig\nfrom .file_utils import add_start_docstrings, add_start_docstrings_to_callable\nfrom .modeling_tf_utils import TFPreTrainedModel, TFSequenceSummary, TFSharedEmbeddings, get_initializer, shape_list\nfrom .tokenization_utils import BatchEncoding\n\n\nlogger = logging.getLogger(__name__)\n\nTF_XLM_PRETRAINED_MODEL_ARCHIVE_LIST = [\n    \"xlm-mlm-en-2048\",\n    \"xlm-mlm-ende-1024\",\n    \"xlm-mlm-enfr-1024\",\n    \"xlm-mlm-enro-1024\",\n    \"xlm-mlm-tlm-xnli15-1024\",\n    \"xlm-mlm-xnli15-1024\",\n    \"xlm-clm-enfr-1024\",\n    \"xlm-clm-ende-1024\",\n    \"xlm-mlm-17-1280\",\n    \"xlm-mlm-100-1280\",\n    # See all XLM models at https://huggingface.co/models?filter=xlm\n]\n\n\ndef create_sinusoidal_embeddings(n_pos, dim, out):\n    position_enc = np.array([[pos / np.power(10000, 2 * (j // 2) / dim) for j in range(dim)] for pos in range(n_pos)])\n    out[:, 0::2] = tf.constant(np.sin(position_enc[:, 0::2]))\n    out[:, 1::2] = tf.constant(np.cos(position_enc[:, 1::2]))\n\n\ndef gelu(x):\n    \"\"\" Gaussian Error Linear Unit.\n    Original Implementation of the gelu activation function in Google Bert repo when initially created.\n        For information: OpenAI GPT's gelu is slightly different (and gives slightly different results):\n        0.5 * x * (1 + torch.tanh(math.sqrt(2 / math.pi) * (x + 0.044715 * torch.pow(x, 3))))\n        Also see https://arxiv.org/abs/1606.08415\n    \"\"\"\n    cdf = 0.5 * (1.0 + tf.math.erf(x / tf.math.sqrt(2.0)))\n    return x * cdf\n\n\ndef get_masks(slen, lengths, causal, padding_mask=None, dtype=tf.float32):\n    \"\"\"\n    Generate hidden states mask, and optionally an attention mask.\n    \"\"\"\n    bs = shape_list(lengths)[0]\n    if padding_mask is not None:\n        mask = padding_mask\n    else:\n        # assert lengths.max().item() <= slen\n        alen = tf.range(slen)\n        mask = tf.math.less(alen, lengths[:, tf.newaxis])\n\n    # attention mask is the same as mask, or triangular inferior attention (causal)\n    if causal:\n        attn_mask = tf.less_equal(\n            tf.tile(alen[tf.newaxis, tf.newaxis, :], (bs, slen, 1)), alen[tf.newaxis, :, tf.newaxis]\n        )\n    else:\n        attn_mask = mask\n\n    # sanity check\n    # assert shape_list(mask) == [bs, slen]\n    tf.debugging.assert_equal(shape_list(mask), [bs, slen])\n    assert causal is False or shape_list(attn_mask) == [bs, slen, slen]\n\n    mask = tf.cast(mask, dtype=dtype)\n    attn_mask = tf.cast(attn_mask, dtype=dtype)\n\n    return mask, attn_mask\n\n\nclass TFMultiHeadAttention(tf.keras.layers.Layer):\n\n    NEW_ID = itertools.count()\n\n    def __init__(self, n_heads, dim, config, **kwargs):\n        super().__init__(**kwargs)\n        self.layer_id = next(TFMultiHeadAttention.NEW_ID)\n        self.output_attentions = config.output_attentions\n        self.dim = dim\n        self.n_heads = n_heads\n        assert self.dim % self.n_heads == 0\n\n        self.q_lin = tf.keras.layers.Dense(dim, kernel_initializer=get_initializer(config.init_std), name=\"q_lin\")\n        self.k_lin = tf.keras.layers.Dense(dim, kernel_initializer=get_initializer(config.init_std), name=\"k_lin\")\n        self.v_lin = tf.keras.layers.Dense(dim, kernel_initializer=get_initializer(config.init_std), name=\"v_lin\")\n        self.out_lin = tf.keras.layers.Dense(dim, kernel_initializer=get_initializer(config.init_std), name=\"out_lin\")\n        self.dropout = tf.keras.layers.Dropout(config.attention_dropout)\n        self.pruned_heads = set()\n\n    def prune_heads(self, heads):\n        raise NotImplementedError\n\n    def call(self, inputs, training=False):\n        \"\"\"\n        Self-attention (if kv is None) or attention over source sentence (provided by kv).\n        \"\"\"\n        input, mask, kv, cache, head_mask = inputs\n        # Input is (bs, qlen, dim)\n        # Mask is (bs, klen) (non-causal) or (bs, klen, klen)\n        bs, qlen, dim = shape_list(input)\n        if kv is None:\n            klen = qlen if cache is None else cache[\"slen\"] + qlen\n        else:\n            klen = shape_list(kv)[1]\n        # assert dim == self.dim, 'Dimensions do not match: %s input vs %s configured' % (dim, self.dim)\n        n_heads = self.n_heads\n        dim_per_head = self.dim // n_heads\n        mask_reshape = (bs, 1, qlen, klen) if len(shape_list(mask)) == 3 else (bs, 1, 1, klen)\n\n        def shape(x):\n            \"\"\"  projection \"\"\"\n            return tf.transpose(tf.reshape(x, (bs, -1, self.n_heads, dim_per_head)), perm=(0, 2, 1, 3))\n\n        def unshape(x):\n            \"\"\"  compute context \"\"\"\n            return tf.reshape(tf.transpose(x, perm=(0, 2, 1, 3)), (bs, -1, self.n_heads * dim_per_head))\n\n        q = shape(self.q_lin(input))  # (bs, n_heads, qlen, dim_per_head)\n        if kv is None:\n            k = shape(self.k_lin(input))  # (bs, n_heads, qlen, dim_per_head)\n            v = shape(self.v_lin(input))  # (bs, n_heads, qlen, dim_per_head)\n        elif cache is None or self.layer_id not in cache:\n            k = v = kv\n            k = shape(self.k_lin(k))  # (bs, n_heads, qlen, dim_per_head)\n            v = shape(self.v_lin(v))  # (bs, n_heads, qlen, dim_per_head)\n\n        if cache is not None:\n            if self.layer_id in cache:\n                if kv is None:\n                    k_, v_ = cache[self.layer_id]\n                    k = tf.concat([k_, k], axis=2)  # (bs, n_heads, klen, dim_per_head)\n                    v = tf.concat([v_, v], axis=2)  # (bs, n_heads, klen, dim_per_head)\n                else:\n                    k, v = cache[self.layer_id]\n            cache[self.layer_id] = (k, v)\n\n        q = q / math.sqrt(dim_per_head)  # (bs, n_heads, qlen, dim_per_head)\n        scores = tf.matmul(q, k, transpose_b=True)  # (bs, n_heads, qlen, klen)\n        mask = tf.reshape(mask, mask_reshape)  # (bs, n_heads, qlen, klen)\n        # scores.masked_fill_(mask, -float('inf'))                            # (bs, n_heads, qlen, klen)\n        scores = scores - 1e30 * (1.0 - mask)\n\n        weights = tf.nn.softmax(scores, axis=-1)  # (bs, n_heads, qlen, klen)\n        weights = self.dropout(weights, training=training)  # (bs, n_heads, qlen, klen)\n\n        # Mask heads if we want to\n        if head_mask is not None:\n            weights = weights * head_mask\n\n        context = tf.matmul(weights, v)  # (bs, n_heads, qlen, dim_per_head)\n        context = unshape(context)  # (bs, qlen, dim)\n\n        outputs = (self.out_lin(context),)\n        if self.output_attentions:\n            outputs = outputs + (weights,)\n        return outputs\n\n\nclass TFTransformerFFN(tf.keras.layers.Layer):\n    def __init__(self, in_dim, dim_hidden, out_dim, config, **kwargs):\n        super().__init__(**kwargs)\n        self.lin1 = tf.keras.layers.Dense(dim_hidden, kernel_initializer=get_initializer(config.init_std), name=\"lin1\")\n        self.lin2 = tf.keras.layers.Dense(out_dim, kernel_initializer=get_initializer(config.init_std), name=\"lin2\")\n        self.act = tf.keras.layers.Activation(gelu) if config.gelu_activation else tf.keras.activations.relu\n        self.dropout = tf.keras.layers.Dropout(config.dropout)\n\n    def call(self, input, training=False):\n        x = self.lin1(input)\n        x = self.act(x)\n        x = self.lin2(x)\n        x = self.dropout(x, training=training)\n        return x\n\n\nclass TFXLMMainLayer(tf.keras.layers.Layer):\n    def __init__(self, config, **kwargs):\n        super().__init__(**kwargs)\n        self.output_attentions = config.output_attentions\n        self.output_hidden_states = config.output_hidden_states\n\n        # encoder / decoder, output layer\n        self.is_encoder = config.is_encoder\n        self.is_decoder = not config.is_encoder\n        if self.is_decoder:\n            raise NotImplementedError(\"Currently XLM can only be used as an encoder\")\n        # self.with_output = with_output\n        self.causal = config.causal\n\n        # dictionary / languages\n        self.n_langs = config.n_langs\n        self.use_lang_emb = config.use_lang_emb\n        self.n_words = config.n_words\n        self.eos_index = config.eos_index\n        self.pad_index = config.pad_index\n        # self.dico = dico\n        # self.id2lang = config.id2lang\n        # self.lang2id = config.lang2id\n        # assert len(self.dico) == self.n_words\n        # assert len(self.id2lang) == len(self.lang2id) == self.n_langs\n\n        # model parameters\n        self.dim = config.emb_dim  # 512 by default\n        self.hidden_dim = self.dim * 4  # 2048 by default\n        self.n_heads = config.n_heads  # 8 by default\n        self.n_layers = config.n_layers\n        assert self.dim % self.n_heads == 0, \"transformer dim must be a multiple of n_heads\"\n\n        # embeddings\n        self.dropout = tf.keras.layers.Dropout(config.dropout)\n        self.attention_dropout = tf.keras.layers.Dropout(config.attention_dropout)\n\n        self.position_embeddings = tf.keras.layers.Embedding(\n            config.max_position_embeddings,\n            self.dim,\n            embeddings_initializer=get_initializer(config.embed_init_std),\n            name=\"position_embeddings\",\n        )\n        if config.sinusoidal_embeddings:\n            raise NotImplementedError\n            # create_sinusoidal_embeddings(config.max_position_embeddings, self.dim, out=self.position_embeddings.weight)\n        if config.n_langs > 1 and config.use_lang_emb:\n            self.lang_embeddings = tf.keras.layers.Embedding(\n                self.n_langs,\n                self.dim,\n                embeddings_initializer=get_initializer(config.embed_init_std),\n                name=\"lang_embeddings\",\n            )\n        self.embeddings = TFSharedEmbeddings(\n            self.n_words, self.dim, initializer_range=config.embed_init_std, name=\"embeddings\"\n        )  # padding_idx=self.pad_index)\n        self.layer_norm_emb = tf.keras.layers.LayerNormalization(epsilon=config.layer_norm_eps, name=\"layer_norm_emb\")\n\n        # transformer layers\n        self.attentions = []\n        self.layer_norm1 = []\n        self.ffns = []\n        self.layer_norm2 = []\n        # if self.is_decoder:\n        #     self.layer_norm15 = []\n        #     self.encoder_attn = []\n\n        for i in range(self.n_layers):\n            self.attentions.append(\n                TFMultiHeadAttention(self.n_heads, self.dim, config=config, name=\"attentions_._{}\".format(i))\n            )\n            self.layer_norm1.append(\n                tf.keras.layers.LayerNormalization(epsilon=config.layer_norm_eps, name=\"layer_norm1_._{}\".format(i))\n            )\n            # if self.is_decoder:\n            #     self.layer_norm15.append(nn.LayerNorm(self.dim, eps=config.layer_norm_eps))\n            #     self.encoder_attn.append(MultiHeadAttention(self.n_heads, self.dim, dropout=self.attention_dropout))\n            self.ffns.append(\n                TFTransformerFFN(self.dim, self.hidden_dim, self.dim, config=config, name=\"ffns_._{}\".format(i))\n            )\n            self.layer_norm2.append(\n                tf.keras.layers.LayerNormalization(epsilon=config.layer_norm_eps, name=\"layer_norm2_._{}\".format(i))\n            )\n\n        if hasattr(config, \"pruned_heads\"):\n            pruned_heads = config.pruned_heads.copy().items()\n            config.pruned_heads = {}\n            for layer, heads in pruned_heads:\n                if self.attentions[int(layer)].n_heads == config.n_heads:\n                    self.prune_heads({int(layer): list(map(int, heads))})\n\n    def get_input_embeddings(self):\n        return self.embeddings\n\n    def _resize_token_embeddings(self, new_num_tokens):\n        raise NotImplementedError\n\n    def _prune_heads(self, heads_to_prune):\n        \"\"\" Prunes heads of the model.\n            heads_to_prune: dict of {layer_num: list of heads to prune in this layer}\n            See base class PreTrainedModel\n        \"\"\"\n        raise NotImplementedError\n\n    def call(\n        self,\n        inputs,\n        attention_mask=None,\n        langs=None,\n        token_type_ids=None,\n        position_ids=None,\n        lengths=None,\n        cache=None,\n        head_mask=None,\n        inputs_embeds=None,\n        training=False,\n    ):  # removed: src_enc=None, src_len=None\n        if isinstance(inputs, (tuple, list)):\n            input_ids = inputs[0]\n            attention_mask = inputs[1] if len(inputs) > 1 else attention_mask\n            langs = inputs[2] if len(inputs) > 2 else langs\n            token_type_ids = inputs[3] if len(inputs) > 3 else token_type_ids\n            position_ids = inputs[4] if len(inputs) > 4 else position_ids\n            lengths = inputs[5] if len(inputs) > 5 else lengths\n            cache = inputs[6] if len(inputs) > 6 else cache\n            head_mask = inputs[7] if len(inputs) > 7 else head_mask\n            inputs_embeds = inputs[8] if len(inputs) > 8 else inputs_embeds\n            assert len(inputs) <= 9, \"Too many inputs.\"\n        elif isinstance(inputs, (dict, BatchEncoding)):\n            input_ids = inputs.get(\"input_ids\")\n            attention_mask = inputs.get(\"attention_mask\", attention_mask)\n            langs = inputs.get(\"langs\", langs)\n            token_type_ids = inputs.get(\"token_type_ids\", token_type_ids)\n            position_ids = inputs.get(\"position_ids\", position_ids)\n            lengths = inputs.get(\"lengths\", lengths)\n            cache = inputs.get(\"cache\", cache)\n            head_mask = inputs.get(\"head_mask\", head_mask)\n            inputs_embeds = inputs.get(\"inputs_embeds\", inputs_embeds)\n            assert len(inputs) <= 9, \"Too many inputs.\"\n        else:\n            input_ids = inputs\n\n        if input_ids is not None and inputs_embeds is not None:\n            raise ValueError(\"You cannot specify both input_ids and inputs_embeds at the same time\")\n        elif input_ids is not None:\n            bs, slen = shape_list(input_ids)\n        elif inputs_embeds is not None:\n            bs, slen = shape_list(inputs_embeds)[:2]\n        else:\n            raise ValueError(\"You have to specify either input_ids or inputs_embeds\")\n\n        if lengths is None:\n            if input_ids is not None:\n                lengths = tf.reduce_sum(tf.cast(tf.not_equal(input_ids, self.pad_index), dtype=tf.int32), axis=1)\n            else:\n                lengths = tf.convert_to_tensor([slen] * bs, tf.int32)\n        # mask = input_ids != self.pad_index\n\n        # check inputs\n        # assert shape_list(lengths)[0] == bs\n        tf.debugging.assert_equal(shape_list(lengths)[0], bs)\n        # assert lengths.max().item() <= slen\n        # input_ids = input_ids.transpose(0, 1)  # batch size as dimension 0\n        # assert (src_enc is None) == (src_len is None)\n        # if src_enc is not None:\n        #     assert self.is_decoder\n        #     assert src_enc.size(0) == bs\n\n        # generate masks\n        mask, attn_mask = get_masks(slen, lengths, self.causal, padding_mask=attention_mask)\n        # if self.is_decoder and src_enc is not None:\n        #     src_mask = torch.arange(src_len.max(), dtype=torch.long, device=lengths.device) < src_len[:, None]\n\n        # position_ids\n        if position_ids is None:\n            position_ids = tf.expand_dims(tf.range(slen), axis=0)\n        else:\n            # assert shape_list(position_ids) == [bs, slen]  # (slen, bs)\n            tf.debugging.assert_equal(shape_list(position_ids), [bs, slen])\n            # position_ids = position_ids.transpose(0, 1)\n\n        # langs\n        if langs is not None:\n            # assert shape_list(langs) == [bs, slen]  # (slen, bs)\n            tf.debugging.assert_equal(shape_list(langs), [bs, slen])\n            # langs = langs.transpose(0, 1)\n\n        # Prepare head mask if needed\n        # 1.0 in head_mask indicate we keep the head\n        # attention_probs has shape bsz x n_heads x N x N\n        # input head_mask has shape [num_heads] or [num_hidden_layers x num_heads]\n        # and head_mask is converted to shape [num_hidden_layers x batch x num_heads x qlen x klen]\n        if head_mask is not None:\n            raise NotImplementedError\n        else:\n            head_mask = [None] * self.n_layers\n\n        # do not recompute cached elements\n        if cache is not None and input_ids is not None:\n            _slen = slen - cache[\"slen\"]\n            input_ids = input_ids[:, -_slen:]\n            position_ids = position_ids[:, -_slen:]\n            if langs is not None:\n                langs = langs[:, -_slen:]\n            mask = mask[:, -_slen:]\n            attn_mask = attn_mask[:, -_slen:]\n\n        # embeddings\n        if inputs_embeds is None:\n            inputs_embeds = self.embeddings(input_ids)\n\n        tensor = inputs_embeds + self.position_embeddings(position_ids)\n        if langs is not None and self.use_lang_emb and self.n_langs > 1:\n            tensor = tensor + self.lang_embeddings(langs)\n        if token_type_ids is not None:\n            tensor = tensor + self.embeddings(token_type_ids)\n        tensor = self.layer_norm_emb(tensor)\n        tensor = self.dropout(tensor, training=training)\n        tensor = tensor * mask[..., tf.newaxis]\n\n        # transformer layers\n        hidden_states = ()\n        attentions = ()\n        for i in range(self.n_layers):\n            if self.output_hidden_states:\n                hidden_states = hidden_states + (tensor,)\n\n            # self attention\n            attn_outputs = self.attentions[i]([tensor, attn_mask, None, cache, head_mask[i]], training=training)\n            attn = attn_outputs[0]\n            if self.output_attentions:\n                attentions = attentions + (attn_outputs[1],)\n            attn = self.dropout(attn, training=training)\n            tensor = tensor + attn\n            tensor = self.layer_norm1[i](tensor)\n\n            # encoder attention (for decoder only)\n            # if self.is_decoder and src_enc is not None:\n            #     attn = self.encoder_attn[i](tensor, src_mask, kv=src_enc, cache=cache)\n            #     attn = F.dropout(attn, p=self.dropout, training=self.training)\n            #     tensor = tensor + attn\n            #     tensor = self.layer_norm15[i](tensor)\n\n            # FFN\n            tensor = tensor + self.ffns[i](tensor)\n            tensor = self.layer_norm2[i](tensor)\n            tensor = tensor * mask[..., tf.newaxis]\n\n        # Add last hidden state\n        if self.output_hidden_states:\n            hidden_states = hidden_states + (tensor,)\n\n        # update cache length\n        if cache is not None:\n            cache[\"slen\"] += tensor.size(1)\n\n        # move back sequence length to dimension 0\n        # tensor = tensor.transpose(0, 1)\n\n        outputs = (tensor,)\n        if self.output_hidden_states:\n            outputs = outputs + (hidden_states,)\n        if self.output_attentions:\n            outputs = outputs + (attentions,)\n        return outputs  # outputs, (hidden_states), (attentions)\n\n\nclass TFXLMPreTrainedModel(TFPreTrainedModel):\n    \"\"\" An abstract class to handle weights initialization and\n        a simple interface for downloading and loading pretrained models.\n    \"\"\"\n\n    config_class = XLMConfig\n    base_model_prefix = \"transformer\"\n\n    @property\n    def dummy_inputs(self):\n        # Sometimes XLM has language embeddings so don't forget to build them as well if needed\n        inputs_list = tf.constant([[7, 6, 0, 0, 1], [1, 2, 3, 0, 0], [0, 0, 0, 4, 5]])\n        attns_list = tf.constant([[1, 1, 0, 0, 1], [1, 1, 1, 0, 0], [1, 0, 0, 1, 1]])\n        if self.config.use_lang_emb and self.config.n_langs > 1:\n            langs_list = tf.constant([[1, 1, 0, 0, 1], [1, 1, 1, 0, 0], [1, 0, 0, 1, 1]])\n        else:\n            langs_list = None\n        return {\"input_ids\": inputs_list, \"attention_mask\": attns_list, \"langs\": langs_list}\n\n\nXLM_START_DOCSTRING = r\"\"\"\n\n    .. note::\n\n        TF 2.0 models accepts two formats as inputs:\n\n            - having all inputs as keyword arguments (like PyTorch models), or\n            - having all inputs as a list, tuple or dict in the first positional arguments.\n\n        This second option is useful when using :obj:`tf.keras.Model.fit()` method which currently requires having\n        all the tensors in the first argument of the model call function: :obj:`model(inputs)`.\n\n        If you choose this second option, there are three possibilities you can use to gather all the input Tensors\n        in the first positional argument :\n\n        - a single Tensor with input_ids only and nothing else: :obj:`model(inputs_ids)`\n        - a list of varying length with one or several input Tensors IN THE ORDER given in the docstring:\n          :obj:`model([input_ids, attention_mask])` or :obj:`model([input_ids, attention_mask, token_type_ids])`\n        - a dictionary with one or several input Tensors associated to the input names given in the docstring:\n          :obj:`model({'input_ids': input_ids, 'token_type_ids': token_type_ids})`\n\n    Parameters:\n        config (:class:`~transformers1.XLMConfig`): Model configuration class with all the parameters of the model.\n            Initializing with a config file does not load the weights associated with the model, only the configuration.\n            Check out the :meth:`~transformers1.PreTrainedModel.from_pretrained` method to load the model weights.\n\"\"\"\n\nXLM_INPUTS_DOCSTRING = r\"\"\"\n    Args:\n        input_ids (:obj:`tf.Tensor` or :obj:`Numpy array` of shape :obj:`(batch_size, sequence_length)`):\n            Indices of input sequence tokens in the vocabulary.\n\n            Indices can be obtained using :class:`transformers1.BertTokenizer`.\n            See :func:`transformers1.PreTrainedTokenizer.encode` and\n            :func:`transformers1.PreTrainedTokenizer.encode_plus` for details.\n\n            `What are input IDs? <../glossary.html#input-ids>`__\n        attention_mask (:obj:`tf.Tensor` or :obj:`Numpy array` of shape :obj:`(batch_size, sequence_length)`, `optional`, defaults to :obj:`None`):\n            Mask to avoid performing attention on padding token indices.\n            Mask values selected in ``[0, 1]``:\n            ``1`` for tokens that are NOT MASKED, ``0`` for MASKED tokens.\n\n            `What are attention masks? <../glossary.html#attention-mask>`__\n        langs (:obj:`tf.Tensor` or :obj:`Numpy array` of shape :obj:`(batch_size, sequence_length)`, `optional`, defaults to :obj:`None`):\n            A parallel sequence of tokens to be used to indicate the language of each token in the input.\n            Indices are languages ids which can be obtained from the language names by using two conversion mappings\n            provided in the configuration of the model (only provided for multilingual models).\n            More precisely, the `language name -> language id` mapping is in `model.config.lang2id` (dict str -> int) and\n            the `language id -> language name` mapping is `model.config.id2lang` (dict int -> str).\n\n            See usage examples detailed in the `multilingual documentation <https://huggingface.co/transformers/multilingual.html>`__.\n        token_type_ids (:obj:`tf.Tensor` or :obj:`Numpy array` of shape :obj:`(batch_size, sequence_length)`, `optional`, defaults to :obj:`None`):\n            Segment token indices to indicate first and second portions of the inputs.\n            Indices are selected in ``[0, 1]``: ``0`` corresponds to a `sentence A` token, ``1``\n            corresponds to a `sentence B` token\n\n            `What are token type IDs? <../glossary.html#token-type-ids>`_\n        position_ids (:obj:`tf.Tensor` or :obj:`Numpy array` of shape :obj:`(batch_size, sequence_length)`, `optional`, defaults to :obj:`None`):\n            Indices of positions of each input sequence tokens in the position embeddings.\n            Selected in the range ``[0, config.max_position_embeddings - 1]``.\n\n            `What are position IDs? <../glossary.html#position-ids>`_\n        lengths (:obj:`tf.Tensor` or :obj:`Numpy array` of shape :obj:`(batch_size,)`, `optional`, defaults to :obj:`None`):\n            Length of each sentence that can be used to avoid performing attention on padding token indices.\n            You can also use `attention_mask` for the same result (see above), kept here for compatbility.\n            Indices selected in ``[0, ..., input_ids.size(-1)]``:\n        cache (:obj:`Dict[str, tf.Tensor]`, `optional`, defaults to :obj:`None`):\n            dictionary with ``tf.Tensor`` that contains pre-computed\n            hidden-states (key and values in the attention blocks) as computed by the model\n            (see `cache` output below). Can be used to speed up sequential decoding.\n            The dictionary object will be modified in-place during the forward pass to add newly computed hidden-states.\n        head_mask (:obj:`tf.Tensor` or :obj:`Numpy array` of shape :obj:`(num_heads,)` or :obj:`(num_layers, num_heads)`, `optional`, defaults to :obj:`None`):\n            Mask to nullify selected heads of the self-attention modules.\n            Mask values selected in ``[0, 1]``:\n            :obj:`1` indicates the head is **not masked**, :obj:`0` indicates the head is **masked**.\n        inputs_embeds (:obj:`tf.Tensor` or :obj:`Numpy array` of shape :obj:`(batch_size, sequence_length, hidden_size)`, `optional`, defaults to :obj:`None`):\n            Optionally, instead of passing :obj:`input_ids` you can choose to directly pass an embedded representation.\n            This is useful if you want more control over how to convert `input_ids` indices into associated vectors\n            than the model's internal embedding lookup matrix.\n\"\"\"\n\n\n@add_start_docstrings(\n    \"The bare XLM Model transformer outputing raw hidden-states without any specific head on top.\",\n    XLM_START_DOCSTRING,\n)\nclass TFXLMModel(TFXLMPreTrainedModel):\n    def __init__(self, config, *inputs, **kwargs):\n        super().__init__(config, *inputs, **kwargs)\n        self.transformer = TFXLMMainLayer(config, name=\"transformer\")\n\n    @add_start_docstrings_to_callable(XLM_INPUTS_DOCSTRING)\n    def call(self, inputs, **kwargs):\n        r\"\"\"\n    Return:\n        :obj:`tuple(tf.Tensor)` comprising various elements depending on the configuration (:class:`~transformers1.XLMConfig`) and inputs:\n        last_hidden_state (:obj:`tf.Tensor` or :obj:`Numpy array` of shape :obj:`(batch_size, sequence_length, hidden_size)`):\n            Sequence of hidden-states at the output of the last layer of the model.\n        hidden_states (:obj:`tuple(tf.Tensor)`, `optional`, returned when ``config.output_hidden_states=True``):\n            Tuple of :obj:`tf.Tensor` or :obj:`Numpy array` (one for the output of the embeddings + one for the output of each layer)\n            of shape :obj:`(batch_size, sequence_length, hidden_size)`.\n\n            Hidden-states of the model at the output of each layer plus the initial embedding outputs.\n        attentions (:obj:`tuple(tf.Tensor)`, `optional`, returned when ``config.output_attentions=True``):\n            Tuple of :obj:`tf.Tensor` or :obj:`Numpy array` (one for each layer) of shape\n            :obj:`(batch_size, num_heads, sequence_length, sequence_length)`.\n\n            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention\n            heads.\n\n    Examples::\n\n        import tensorflow as tf\n        from transformers1 import XLMTokenizer, TFXLMModel\n\n        tokenizer = XLMTokenizer.from_pretrained('xlm-mlm-en-2048')\n        model = TFXLMModel.from_pretrained('xlm-mlm-en-2048')\n        input_ids = tf.constant(tokenizer.encode(\"Hello, my dog is cute\", add_special_tokens=True))[None, :]  # Batch size 1\n        outputs = model(input_ids)\n        last_hidden_states = outputs[0]  # The last hidden-state is the first element of the output tuple\n\n        \"\"\"\n        outputs = self.transformer(inputs, **kwargs)\n        return outputs\n\n\nclass TFXLMPredLayer(tf.keras.layers.Layer):\n    \"\"\"\n    Prediction layer (cross_entropy or adaptive_softmax).\n    \"\"\"\n\n    def __init__(self, config, input_embeddings, **kwargs):\n        super().__init__(**kwargs)\n        self.asm = config.asm\n        self.n_words = config.n_words\n        self.pad_index = config.pad_index\n        if config.asm is False:\n            self.input_embeddings = input_embeddings\n        else:\n            raise NotImplementedError\n            # self.proj = nn.AdaptiveLogSoftmaxWithLoss(\n            #     in_features=dim,\n            #     n_classes=config.n_words,\n            #     cutoffs=config.asm_cutoffs,\n            #     div_value=config.asm_div_value,\n            #     head_bias=True,  # default is False\n            # )\n\n    def build(self, input_shape):\n        # The output weights are the same as the input embeddings, but there is an output-only bias for each token.\n        self.bias = self.add_weight(shape=(self.n_words,), initializer=\"zeros\", trainable=True, name=\"bias\")\n        super().build(input_shape)\n\n    def call(self, hidden_states):\n        hidden_states = self.input_embeddings(hidden_states, mode=\"linear\")\n        hidden_states = hidden_states + self.bias\n        return hidden_states\n\n\n@add_start_docstrings(\n    \"\"\"The XLM Model transformer with a language modeling head on top\n    (linear layer with weights tied to the input embeddings). \"\"\",\n    XLM_START_DOCSTRING,\n)\nclass TFXLMWithLMHeadModel(TFXLMPreTrainedModel):\n    def __init__(self, config, *inputs, **kwargs):\n        super().__init__(config, *inputs, **kwargs)\n        self.transformer = TFXLMMainLayer(config, name=\"transformer\")\n        self.pred_layer = TFXLMPredLayer(config, self.transformer.embeddings, name=\"pred_layer_._proj\")\n\n    def get_output_embeddings(self):\n        return self.pred_layer.input_embeddings\n\n    def prepare_inputs_for_generation(self, inputs, **kwargs):\n        mask_token_id = self.config.mask_token_id\n        lang_id = self.config.lang_id\n\n        effective_batch_size = inputs.shape[0]\n        mask_token = tf.ones((effective_batch_size, 1), dtype=tf.int32) * mask_token_id\n        inputs = tf.concat([inputs, mask_token], axis=1)\n\n        if lang_id is not None:\n            langs = tf.ones_like(inputs) * lang_id\n        else:\n            langs = None\n        return {\"inputs\": inputs, \"langs\": langs}\n\n    @add_start_docstrings_to_callable(XLM_INPUTS_DOCSTRING)\n    def call(self, inputs, **kwargs):\n        r\"\"\"\n    Return:\n        :obj:`tuple(tf.Tensor)` comprising various elements depending on the configuration (:class:`~transformers1.XLMConfig`) and inputs:\n        prediction_scores (:obj:`tf.Tensor` or :obj:`Numpy array` of shape :obj:`(batch_size, sequence_length, config.vocab_size)`):\n            Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax).\n        hidden_states (:obj:`tuple(tf.Tensor)`, `optional`, returned when ``config.output_hidden_states=True``):\n            Tuple of :obj:`tf.Tensor` or :obj:`Numpy array` (one for the output of the embeddings + one for the output of each layer)\n            of shape :obj:`(batch_size, sequence_length, hidden_size)`.\n\n            Hidden-states of the model at the output of each layer plus the initial embedding outputs.\n        attentions (:obj:`tuple(tf.Tensor)`, `optional`, returned when ``config.output_attentions=True``):\n            Tuple of :obj:`tf.Tensor` or :obj:`Numpy array` (one for each layer) of shape\n            :obj:`(batch_size, num_heads, sequence_length, sequence_length)`.\n\n            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention\n            heads.\n\n    Examples::\n\n        import tensorflow as tf\n        from transformers1 import XLMTokenizer, TFXLMWithLMHeadModel\n\n        tokenizer = XLMTokenizer.from_pretrained('xlm-mlm-en-2048')\n        model = TFXLMWithLMHeadModel.from_pretrained('xlm-mlm-en-2048')\n        input_ids = tf.constant(tokenizer.encode(\"Hello, my dog is cute\", add_special_tokens=True))[None, :]  # Batch size 1\n        outputs = model(input_ids)\n        last_hidden_states = outputs[0]  # The last hidden-state is the first element of the output tuple\n\n        \"\"\"\n        transformer_outputs = self.transformer(inputs, **kwargs)\n\n        output = transformer_outputs[0]\n        outputs = self.pred_layer(output)\n        outputs = (outputs,) + transformer_outputs[1:]  # Keep new_mems and attention/hidden states if they are here\n\n        return outputs\n\n\n@add_start_docstrings(\n    \"\"\"XLM Model with a sequence classification/regression head on top (a linear layer on top of\n    the pooled output) e.g. for GLUE tasks. \"\"\",\n    XLM_START_DOCSTRING,\n)\nclass TFXLMForSequenceClassification(TFXLMPreTrainedModel):\n    def __init__(self, config, *inputs, **kwargs):\n        super().__init__(config, *inputs, **kwargs)\n        self.num_labels = config.num_labels\n\n        self.transformer = TFXLMMainLayer(config, name=\"transformer\")\n        self.sequence_summary = TFSequenceSummary(config, initializer_range=config.init_std, name=\"sequence_summary\")\n\n    @add_start_docstrings_to_callable(XLM_INPUTS_DOCSTRING)\n    def call(self, inputs, **kwargs):\n        r\"\"\"\n    Returns:\n        :obj:`tuple(tf.Tensor)` comprising various elements depending on the configuration (:class:`~transformers1.XLMConfig`) and inputs:\n        logits (:obj:`tf.Tensor` or :obj:`Numpy array` of shape :obj:`(batch_size, config.num_labels)`):\n            Classification (or regression if config.num_labels==1) scores (before SoftMax).\n        hidden_states (:obj:`tuple(tf.Tensor)`, `optional`, returned when ``config.output_hidden_states=True``):\n            Tuple of :obj:`tf.Tensor` or :obj:`Numpy array` (one for the output of the embeddings + one for the output of each layer)\n            of shape :obj:`(batch_size, sequence_length, hidden_size)`.\n\n            Hidden-states of the model at the output of each layer plus the initial embedding outputs.\n        attentions (:obj:`tuple(tf.Tensor)`, `optional`, returned when ``config.output_attentions=True``):\n            Tuple of :obj:`tf.Tensor` or :obj:`Numpy array` (one for each layer) of shape\n            :obj:`(batch_size, num_heads, sequence_length, sequence_length)`.\n\n            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention\n            heads.\n\n    Examples::\n\n        import tensorflow as tf\n        from transformers1 import XLMTokenizer, TFXLMForSequenceClassification\n\n        tokenizer = XLMTokenizer.from_pretrained('xlm-mlm-en-2048')\n        model = TFXLMForSequenceClassification.from_pretrained('xlm-mlm-en-2048')\n        input_ids = tf.constant(tokenizer.encode(\"Hello, my dog is cute\", add_special_tokens=True))[None, :]  # Batch size 1\n        labels = tf.constant([1])[None, :]  # Batch size 1\n        outputs = model(input_ids)\n        logits = outputs[0]\n\n        \"\"\"\n        transformer_outputs = self.transformer(inputs, **kwargs)\n        output = transformer_outputs[0]\n\n        logits = self.sequence_summary(output)\n\n        outputs = (logits,) + transformer_outputs[1:]  # Keep new_mems and attention/hidden states if they are here\n        return outputs\n\n\n@add_start_docstrings(\n    \"\"\"XLM Model with a span classification head on top for extractive question-answering tasks like SQuAD (a linear layers on top of\n    the hidden-states output to compute `span start logits` and `span end logits`). \"\"\",\n    XLM_START_DOCSTRING,\n)\nclass TFXLMForQuestionAnsweringSimple(TFXLMPreTrainedModel):\n    def __init__(self, config, *inputs, **kwargs):\n        super().__init__(config, *inputs, **kwargs)\n        self.transformer = TFXLMMainLayer(config, name=\"transformer\")\n        self.qa_outputs = tf.keras.layers.Dense(\n            config.num_labels, kernel_initializer=get_initializer(config.init_std), name=\"qa_outputs\"\n        )\n\n    @add_start_docstrings_to_callable(XLM_INPUTS_DOCSTRING)\n    def call(self, inputs, **kwargs):\n        r\"\"\"\n    Returns:\n        :obj:`tuple(tf.Tensor)` comprising various elements depending on the configuration (:class:`~transformers1.XLMConfig`) and inputs:\n        start_scores (:obj:`tf.Tensor` or :obj:`Numpy array` of shape :obj:`(batch_size, sequence_length,)`):\n            Span-start scores (before SoftMax).\n        end_scores (:obj:`tf.Tensor` or :obj:`Numpy array` of shape :obj:`(batch_size, sequence_length,)`):\n            Span-end scores (before SoftMax).\n        hidden_states (:obj:`tuple(tf.Tensor)`, `optional`, returned when ``config.output_hidden_states=True``):\n            Tuple of :obj:`tf.Tensor` or :obj:`Numpy array` (one for the output of the embeddings + one for the output of each layer)\n            of shape :obj:`(batch_size, sequence_length, hidden_size)`.\n\n            Hidden-states of the model at the output of each layer plus the initial embedding outputs.\n        attentions (:obj:`tuple(tf.Tensor)`, `optional`, returned when ``config.output_attentions=True``):\n            Tuple of :obj:`tf.Tensor` or :obj:`Numpy array` (one for each layer) of shape\n            :obj:`(batch_size, num_heads, sequence_length, sequence_length)`.\n\n            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention\n            heads.\n\n    Examples::\n\n        import tensorflow as tf\n        from transformers1 import XLMTokenizer, TFXLMForQuestionAnsweringSimple\n\n        tokenizer = XLMTokenizer.from_pretrained('xlm-mlm-en-2048')\n        model = TFXLMForQuestionAnsweringSimple.from_pretrained('xlm-mlm-en-2048')\n        input_ids = tf.constant(tokenizer.encode(\"Hello, my dog is cute\", add_special_tokens=True))[None, :]  # Batch size 1\n        outputs = model(input_ids)\n        start_scores, end_scores = outputs[:2]\n\n        \"\"\"\n        transformer_outputs = self.transformer(inputs, **kwargs)\n\n        sequence_output = transformer_outputs[0]\n\n        logits = self.qa_outputs(sequence_output)\n        start_logits, end_logits = tf.split(logits, 2, axis=-1)\n        start_logits = tf.squeeze(start_logits, axis=-1)\n        end_logits = tf.squeeze(end_logits, axis=-1)\n\n        outputs = (start_logits, end_logits,) + transformer_outputs[\n            1:\n        ]  # Keep mems, hidden states, attentions if there are in it\n\n        return outputs  # start_logits, end_logits, (hidden_states), (attentions)\n"
  },
  {
    "path": "code/nezha-base-count3/pretrain/transformers1/modeling_tf_xlm_roberta.py",
    "content": "# coding=utf-8\n# Copyright 2019 Facebook AI Research and the HuggingFace Inc. team.\n# Copyright (c) 2018, NVIDIA CORPORATION.  All rights reserved.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n\"\"\" TF 2.0  XLM-RoBERTa model. \"\"\"\n\n\nimport logging\n\nfrom .configuration_xlm_roberta import XLMRobertaConfig\nfrom .file_utils import add_start_docstrings\nfrom .modeling_tf_roberta import (\n    TFRobertaForMaskedLM,\n    TFRobertaForSequenceClassification,\n    TFRobertaForTokenClassification,\n    TFRobertaModel,\n)\n\n\nlogger = logging.getLogger(__name__)\n\nTF_XLM_ROBERTA_PRETRAINED_MODEL_ARCHIVE_LIST = [\n    # See all XLM-RoBERTa models at https://huggingface.co/models?filter=xlm-roberta\n]\n\n\nXLM_ROBERTA_START_DOCSTRING = r\"\"\"\n\n    .. note::\n\n        TF 2.0 models accepts two formats as inputs:\n\n            - having all inputs as keyword arguments (like PyTorch models), or\n            - having all inputs as a list, tuple or dict in the first positional arguments.\n\n        This second option is useful when using :obj:`tf.keras.Model.fit()` method which currently requires having\n        all the tensors in the first argument of the model call function: :obj:`model(inputs)`.\n\n        If you choose this second option, there are three possibilities you can use to gather all the input Tensors\n        in the first positional argument :\n\n        - a single Tensor with input_ids only and nothing else: :obj:`model(inputs_ids)`\n        - a list of varying length with one or several input Tensors IN THE ORDER given in the docstring:\n          :obj:`model([input_ids, attention_mask])` or :obj:`model([input_ids, attention_mask, token_type_ids])`\n        - a dictionary with one or several input Tensors associated to the input names given in the docstring:\n          :obj:`model({'input_ids': input_ids, 'token_type_ids': token_type_ids})`\n\n    Parameters:\n        config (:class:`~transformers1.XLMRobertaConfig`): Model configuration class with all the parameters of the\n            model. Initializing with a config file does not load the weights associated with the model, only the configuration.\n            Check out the :meth:`~transformers1.PreTrainedModel.from_pretrained` method to load the model weights.\n\"\"\"\n\n\n@add_start_docstrings(\n    \"The bare XLM-RoBERTa Model transformer outputting raw hidden-states without any specific head on top.\",\n    XLM_ROBERTA_START_DOCSTRING,\n)\nclass TFXLMRobertaModel(TFRobertaModel):\n    \"\"\"\n    This class overrides :class:`~transformers1.TFRobertaModel`. Please check the\n    superclass for the appropriate documentation alongside usage examples.\n    \"\"\"\n\n    config_class = XLMRobertaConfig\n\n\n@add_start_docstrings(\n    \"\"\"XLM-RoBERTa Model with a `language modeling` head on top. \"\"\", XLM_ROBERTA_START_DOCSTRING,\n)\nclass TFXLMRobertaForMaskedLM(TFRobertaForMaskedLM):\n    \"\"\"\n    This class overrides :class:`~transformers1.TFRobertaForMaskedLM`. Please check the\n    superclass for the appropriate documentation alongside usage examples.\n    \"\"\"\n\n    config_class = XLMRobertaConfig\n\n\n@add_start_docstrings(\n    \"\"\"XLM-RoBERTa Model transformer with a sequence classification/regression head on top (a linear layer\n    on top of the pooled output) e.g. for GLUE tasks. \"\"\",\n    XLM_ROBERTA_START_DOCSTRING,\n)\nclass TFXLMRobertaForSequenceClassification(TFRobertaForSequenceClassification):\n    \"\"\"\n    This class overrides :class:`~transformers1.TFRobertaForSequenceClassification`. Please check the\n    superclass for the appropriate documentation alongside usage examples.\n    \"\"\"\n\n    config_class = XLMRobertaConfig\n\n\n@add_start_docstrings(\n    \"\"\"XLM-RoBERTa Model with a token classification head on top (a linear layer on top of\n    the hidden-states output) e.g. for Named-Entity-Recognition (NER) tasks. \"\"\",\n    XLM_ROBERTA_START_DOCSTRING,\n)\nclass TFXLMRobertaForTokenClassification(TFRobertaForTokenClassification):\n    \"\"\"\n    This class overrides :class:`~transformers1.TFRobertaForTokenClassification`. Please check the\n    superclass for the appropriate documentation alongside usage examples.\n    \"\"\"\n\n    config_class = XLMRobertaConfig\n"
  },
  {
    "path": "code/nezha-base-count3/pretrain/transformers1/modeling_tf_xlnet.py",
    "content": "# coding=utf-8\n# Copyright 2018 Google AI, Google Brain and Carnegie Mellon University Authors and the HuggingFace Inc. team.\n# Copyright (c) 2018, NVIDIA CORPORATION.  All rights reserved.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n\"\"\" TF 2.0 XLNet model.\n\"\"\"\n\n\nimport logging\n\nimport numpy as np\nimport tensorflow as tf\n\nfrom .configuration_xlnet import XLNetConfig\nfrom .file_utils import add_start_docstrings, add_start_docstrings_to_callable\nfrom .modeling_tf_utils import (\n    TFPreTrainedModel,\n    TFSequenceSummary,\n    TFSharedEmbeddings,\n    get_initializer,\n    keras_serializable,\n    shape_list,\n)\nfrom .tokenization_utils import BatchEncoding\n\n\nlogger = logging.getLogger(__name__)\n\nTF_XLNET_PRETRAINED_MODEL_ARCHIVE_LIST = [\n    \"xlnet-base-cased\",\n    \"xlnet-large-cased\",\n    # See all XLNet models at https://huggingface.co/models?filter=xlnet\n]\n\n\ndef gelu(x):\n    \"\"\" Implementation of the gelu activation function.\n        XLNet is using OpenAI GPT's gelu\n        Also see https://arxiv.org/abs/1606.08415\n    \"\"\"\n    cdf = 0.5 * (1.0 + tf.tanh((np.sqrt(2 / np.pi) * (x + 0.044715 * tf.pow(x, 3)))))\n    return x * cdf\n\n\ndef swish(x):\n    return x * tf.sigmoid(x)\n\n\nACT2FN = {\n    \"gelu\": tf.keras.layers.Activation(gelu),\n    \"relu\": tf.keras.activations.relu,\n    \"swish\": tf.keras.layers.Activation(swish),\n}\n\n\nclass TFXLNetRelativeAttention(tf.keras.layers.Layer):\n    def __init__(self, config, **kwargs):\n        super().__init__(**kwargs)\n        self.output_attentions = config.output_attentions\n\n        if config.d_model % config.n_head != 0:\n            raise ValueError(\n                \"The hidden size (%d) is not a multiple of the number of attention \"\n                \"heads (%d)\" % (config.d_model, config.n_head)\n            )\n\n        self.n_head = config.n_head\n        self.d_head = config.d_head\n        self.d_model = config.d_model\n        self.scale = 1 / (config.d_head ** 0.5)\n        self.initializer_range = config.initializer_range\n\n        self.layer_norm = tf.keras.layers.LayerNormalization(epsilon=config.layer_norm_eps, name=\"layer_norm\")\n        self.dropout = tf.keras.layers.Dropout(config.dropout)\n\n    def build(self, input_shape):\n        initializer = get_initializer(self.initializer_range)\n        self.q = self.add_weight(\n            shape=(self.d_model, self.n_head, self.d_head), initializer=initializer, trainable=True, name=\"q\"\n        )\n        self.k = self.add_weight(\n            shape=(self.d_model, self.n_head, self.d_head), initializer=initializer, trainable=True, name=\"k\"\n        )\n        self.v = self.add_weight(\n            shape=(self.d_model, self.n_head, self.d_head), initializer=initializer, trainable=True, name=\"v\"\n        )\n        self.o = self.add_weight(\n            shape=(self.d_model, self.n_head, self.d_head), initializer=initializer, trainable=True, name=\"o\"\n        )\n        self.r = self.add_weight(\n            shape=(self.d_model, self.n_head, self.d_head), initializer=initializer, trainable=True, name=\"r\"\n        )\n        self.r_r_bias = self.add_weight(\n            shape=(self.n_head, self.d_head), initializer=\"zeros\", trainable=True, name=\"r_r_bias\"\n        )\n        self.r_s_bias = self.add_weight(\n            shape=(self.n_head, self.d_head), initializer=\"zeros\", trainable=True, name=\"r_s_bias\"\n        )\n        self.r_w_bias = self.add_weight(\n            shape=(self.n_head, self.d_head), initializer=\"zeros\", trainable=True, name=\"r_w_bias\"\n        )\n        self.seg_embed = self.add_weight(\n            shape=(2, self.n_head, self.d_head), initializer=initializer, trainable=True, name=\"seg_embed\"\n        )\n        super().build(input_shape)\n\n    def prune_heads(self, heads):\n        raise NotImplementedError\n\n    def rel_shift(self, x, klen=-1):\n        \"\"\"perform relative shift to form the relative attention score.\"\"\"\n        x_size = shape_list(x)\n\n        x = tf.reshape(x, (x_size[1], x_size[0], x_size[2], x_size[3]))\n        x = x[1:, ...]\n        x = tf.reshape(x, (x_size[0], x_size[1] - 1, x_size[2], x_size[3]))\n        x = x[:, 0:klen, :, :]\n        # x = torch.index_select(x, 1, torch.arange(klen, device=x.device, dtype=torch.long))\n\n        return x\n\n    def rel_attn_core(self, inputs, training=False):\n        \"\"\"Core relative positional attention operations.\"\"\"\n\n        q_head, k_head_h, v_head_h, k_head_r, seg_mat, attn_mask, head_mask = inputs\n\n        # content based attention score\n        ac = tf.einsum(\"ibnd,jbnd->ijbn\", q_head + self.r_w_bias, k_head_h)\n\n        # position based attention score\n        bd = tf.einsum(\"ibnd,jbnd->ijbn\", q_head + self.r_r_bias, k_head_r)\n        bd = self.rel_shift(bd, klen=shape_list(ac)[1])\n\n        # segment based attention score\n        if seg_mat is None:\n            ef = 0\n        else:\n            ef = tf.einsum(\"ibnd,snd->ibns\", q_head + self.r_s_bias, self.seg_embed)\n            ef = tf.einsum(\"ijbs,ibns->ijbn\", seg_mat, ef)\n\n        # merge attention scores and perform masking\n        attn_score = (ac + bd + ef) * self.scale\n        if attn_mask is not None:\n            # attn_score = attn_score * (1 - attn_mask) - 1e30 * attn_mask\n            if attn_mask.dtype == tf.float16:\n                attn_score = attn_score - 65500 * attn_mask\n            else:\n                attn_score = attn_score - 1e30 * attn_mask\n\n        # attention probability\n        attn_prob = tf.nn.softmax(attn_score, axis=1)\n\n        attn_prob = self.dropout(attn_prob, training=training)\n\n        # Mask heads if we want to\n        if head_mask is not None:\n            attn_prob = attn_prob * head_mask\n\n        # attention output\n        attn_vec = tf.einsum(\"ijbn,jbnd->ibnd\", attn_prob, v_head_h)\n\n        if self.output_attentions:\n            return attn_vec, attn_prob\n\n        return attn_vec\n\n    def post_attention(self, inputs, residual=True, training=False):\n        \"\"\"Post-attention processing.\"\"\"\n        # post-attention projection (back to `d_model`)\n        h, attn_vec = inputs\n\n        attn_out = tf.einsum(\"ibnd,hnd->ibh\", attn_vec, self.o)\n\n        attn_out = self.dropout(attn_out, training=training)\n\n        if residual:\n            attn_out = attn_out + h\n        output = self.layer_norm(attn_out)\n\n        return output\n\n    def call(self, inputs, training=False):\n        (h, g, attn_mask_h, attn_mask_g, r, seg_mat, mems, target_mapping, head_mask) = inputs\n\n        if g is not None:\n            # Two-stream attention with relative positional encoding.\n            # content based attention score\n            if mems is not None and len(shape_list(mems)) > 1:\n                cat = tf.concat([mems, h], axis=0)\n            else:\n                cat = h\n\n            # content-based key head\n            k_head_h = tf.einsum(\"ibh,hnd->ibnd\", cat, self.k)\n\n            # content-based value head\n            v_head_h = tf.einsum(\"ibh,hnd->ibnd\", cat, self.v)\n\n            # position-based key head\n            k_head_r = tf.einsum(\"ibh,hnd->ibnd\", r, self.r)\n\n            # h-stream\n            # content-stream query head\n            q_head_h = tf.einsum(\"ibh,hnd->ibnd\", h, self.q)\n\n            # core attention ops\n            attn_vec_h = self.rel_attn_core(\n                [q_head_h, k_head_h, v_head_h, k_head_r, seg_mat, attn_mask_h, head_mask], training=training\n            )\n\n            if self.output_attentions:\n                attn_vec_h, attn_prob_h = attn_vec_h\n\n            # post processing\n            output_h = self.post_attention([h, attn_vec_h], training=training)\n\n            # g-stream\n            # query-stream query head\n            q_head_g = tf.einsum(\"ibh,hnd->ibnd\", g, self.q)\n\n            # core attention ops\n            if target_mapping is not None:\n                q_head_g = tf.einsum(\"mbnd,mlb->lbnd\", q_head_g, target_mapping)\n                attn_vec_g = self.rel_attn_core(\n                    [q_head_g, k_head_h, v_head_h, k_head_r, seg_mat, attn_mask_g, head_mask], training=training\n                )\n\n                if self.output_attentions:\n                    attn_vec_g, attn_prob_g = attn_vec_g\n\n                attn_vec_g = tf.einsum(\"lbnd,mlb->mbnd\", attn_vec_g, target_mapping)\n            else:\n                attn_vec_g = self.rel_attn_core(\n                    [q_head_g, k_head_h, v_head_h, k_head_r, seg_mat, attn_mask_g, head_mask], training=training\n                )\n\n                if self.output_attentions:\n                    attn_vec_g, attn_prob_g = attn_vec_g\n\n            # post processing\n            output_g = self.post_attention([g, attn_vec_g], training=training)\n\n            if self.output_attentions:\n                attn_prob = attn_prob_h, attn_prob_g\n\n        else:\n            # Multi-head attention with relative positional encoding\n            if mems is not None and len(shape_list(mems)) > 1:\n                cat = tf.concat([mems, h], axis=0)\n            else:\n                cat = h\n\n            # content heads\n            q_head_h = tf.einsum(\"ibh,hnd->ibnd\", h, self.q)\n            k_head_h = tf.einsum(\"ibh,hnd->ibnd\", cat, self.k)\n            v_head_h = tf.einsum(\"ibh,hnd->ibnd\", cat, self.v)\n\n            # positional heads\n            k_head_r = tf.einsum(\"ibh,hnd->ibnd\", r, self.r)\n\n            # core attention ops\n            attn_vec = self.rel_attn_core(\n                [q_head_h, k_head_h, v_head_h, k_head_r, seg_mat, attn_mask_h, head_mask], training=training\n            )\n\n            if self.output_attentions:\n                attn_vec, attn_prob = attn_vec\n\n            # post processing\n            output_h = self.post_attention([h, attn_vec], training=training)\n            output_g = None\n\n        outputs = (output_h, output_g)\n        if self.output_attentions:\n            outputs = outputs + (attn_prob,)\n        return outputs\n\n\nclass TFXLNetFeedForward(tf.keras.layers.Layer):\n    def __init__(self, config, **kwargs):\n        super().__init__(**kwargs)\n        self.layer_norm = tf.keras.layers.LayerNormalization(epsilon=config.layer_norm_eps, name=\"layer_norm\")\n        self.layer_1 = tf.keras.layers.Dense(\n            config.d_inner, kernel_initializer=get_initializer(config.initializer_range), name=\"layer_1\"\n        )\n        self.layer_2 = tf.keras.layers.Dense(\n            config.d_model, kernel_initializer=get_initializer(config.initializer_range), name=\"layer_2\"\n        )\n        self.dropout = tf.keras.layers.Dropout(config.dropout)\n        if isinstance(config.ff_activation, str):\n            self.activation_function = ACT2FN[config.ff_activation]\n        else:\n            self.activation_function = config.ff_activation\n\n    def call(self, inp, training=False):\n        output = inp\n        output = self.layer_1(output)\n        output = self.activation_function(output)\n        output = self.dropout(output, training=training)\n        output = self.layer_2(output)\n        output = self.dropout(output, training=training)\n        output = self.layer_norm(output + inp)\n        return output\n\n\nclass TFXLNetLayer(tf.keras.layers.Layer):\n    def __init__(self, config, **kwargs):\n        super().__init__(**kwargs)\n        self.rel_attn = TFXLNetRelativeAttention(config, name=\"rel_attn\")\n        self.ff = TFXLNetFeedForward(config, name=\"ff\")\n        self.dropout = tf.keras.layers.Dropout(config.dropout)\n\n    def call(self, inputs, training=False):\n        outputs = self.rel_attn(inputs, training=training)\n        output_h, output_g = outputs[:2]\n\n        if output_g is not None:\n            output_g = self.ff(output_g, training=training)\n        output_h = self.ff(output_h, training=training)\n\n        outputs = (output_h, output_g) + outputs[2:]  # Add again attentions if there are there\n        return outputs\n\n\nclass TFXLNetLMHead(tf.keras.layers.Layer):\n    def __init__(self, config, input_embeddings, **kwargs):\n        super().__init__(**kwargs)\n        self.vocab_size = config.vocab_size\n        # The output weights are the same as the input embeddings, but there is\n        # an output-only bias for each token.\n        self.input_embeddings = input_embeddings\n\n    def build(self, input_shape):\n        self.bias = self.add_weight(shape=(self.vocab_size,), initializer=\"zeros\", trainable=True, name=\"bias\")\n        super().build(input_shape)\n\n    def call(self, hidden_states):\n        hidden_states = self.input_embeddings(hidden_states, mode=\"linear\")\n        hidden_states = hidden_states + self.bias\n        return hidden_states\n\n\n@keras_serializable\nclass TFXLNetMainLayer(tf.keras.layers.Layer):\n    config_class = XLNetConfig\n\n    def __init__(self, config, **kwargs):\n        super().__init__(**kwargs)\n        self.output_attentions = config.output_attentions\n        self.output_hidden_states = config.output_hidden_states\n\n        self.mem_len = config.mem_len\n        self.reuse_len = config.reuse_len\n        self.d_model = config.d_model\n        self.same_length = config.same_length\n        self.attn_type = config.attn_type\n        self.bi_data = config.bi_data\n        self.clamp_len = config.clamp_len\n        self.n_layer = config.n_layer\n        self.use_bfloat16 = config.use_bfloat16\n        self.initializer_range = config.initializer_range\n\n        self.word_embedding = TFSharedEmbeddings(\n            config.vocab_size, config.d_model, initializer_range=config.initializer_range, name=\"word_embedding\"\n        )\n        self.layer = [TFXLNetLayer(config, name=\"layer_._{}\".format(i)) for i in range(config.n_layer)]\n        self.dropout = tf.keras.layers.Dropout(config.dropout)\n\n    def get_input_embeddings(self):\n        return self.word_embedding\n\n    def build(self, input_shape):\n        initializer = get_initializer(self.initializer_range)\n        self.mask_emb = self.add_weight(\n            shape=(1, 1, self.d_model), initializer=initializer, trainable=True, name=\"mask_emb\"\n        )\n\n    def _resize_token_embeddings(self, new_num_tokens):\n        raise NotImplementedError\n\n    def _prune_heads(self, heads_to_prune):\n        raise NotImplementedError\n\n    def create_mask(self, qlen, mlen, dtype=tf.float32):\n        \"\"\"\n        Creates causal attention mask. Float mask where 1.0 indicates masked, 0.0 indicates not-masked.\n\n        Args:\n            qlen: TODO Lysandre didn't fill\n            mlen: TODO Lysandre didn't fill\n\n        ::\n\n                  same_length=False:      same_length=True:\n                  <mlen > <  qlen >       <mlen > <  qlen >\n               ^ [0 0 0 0 0 1 1 1 1]     [0 0 0 0 0 1 1 1 1]\n                 [0 0 0 0 0 0 1 1 1]     [1 0 0 0 0 0 1 1 1]\n            qlen [0 0 0 0 0 0 0 1 1]     [1 1 0 0 0 0 0 1 1]\n                 [0 0 0 0 0 0 0 0 1]     [1 1 1 0 0 0 0 0 1]\n               v [0 0 0 0 0 0 0 0 0]     [1 1 1 1 0 0 0 0 0]\n\n        \"\"\"\n        attn_mask = tf.ones([qlen, qlen], dtype=dtype)\n        mask_u = tf.matrix_band_part(attn_mask, 0, -1)\n        mask_dia = tf.matrix_band_part(attn_mask, 0, 0)\n        attn_mask_pad = tf.zeros([qlen, mlen], dtype=dtype)\n        ret = tf.concat([attn_mask_pad, mask_u - mask_dia], 1)\n        if self.same_length:\n            mask_l = tf.matrix_band_part(attn_mask, -1, 0)\n            ret = tf.concat([ret[:, :qlen] + mask_l - mask_dia, ret[:, qlen:]], 1)\n        return ret\n\n    def cache_mem(self, curr_out, prev_mem):\n        \"\"\"cache hidden states into memory.\"\"\"\n        if self.reuse_len is not None and self.reuse_len > 0:\n            curr_out = curr_out[: self.reuse_len]\n\n        if prev_mem is None:\n            new_mem = curr_out[-self.mem_len :]\n        else:\n            new_mem = tf.concat([prev_mem, curr_out], 0)[-self.mem_len :]\n\n        return tf.stop_gradient(new_mem)\n\n    @staticmethod\n    def positional_embedding(pos_seq, inv_freq, bsz=None):\n        sinusoid_inp = tf.einsum(\"i,d->id\", pos_seq, inv_freq)\n        pos_emb = tf.concat([tf.sin(sinusoid_inp), tf.cos(sinusoid_inp)], axis=-1)\n        pos_emb = pos_emb[:, None, :]\n\n        if bsz is not None:\n            pos_emb = tf.tile(pos_emb, [1, bsz, 1])\n\n        return pos_emb\n\n    def relative_positional_encoding(self, qlen, klen, bsz=None, dtype=None):\n        \"\"\"create relative positional encoding.\"\"\"\n        freq_seq = tf.range(0, self.d_model, 2.0)\n        if dtype is not None and dtype != tf.float32:\n            freq_seq = tf.cast(freq_seq, dtype=dtype)\n        inv_freq = 1 / (10000 ** (freq_seq / self.d_model))\n\n        if self.attn_type == \"bi\":\n            # beg, end = klen - 1, -qlen\n            beg, end = klen, -qlen\n        elif self.attn_type == \"uni\":\n            # beg, end = klen - 1, -1\n            beg, end = klen, -1\n        else:\n            raise ValueError(\"Unknown `attn_type` {}.\".format(self.attn_type))\n\n        if self.bi_data:\n            fwd_pos_seq = tf.range(beg, end, -1.0)\n            bwd_pos_seq = tf.range(-beg, -end, 1.0)\n\n            if dtype is not None and dtype != tf.float32:\n                fwd_pos_seq = tf.cast(fwd_pos_seq, dtype=dtype)\n                bwd_pos_seq = tf.cast(bwd_pos_seq, dtype=dtype)\n\n            if self.clamp_len > 0:\n                fwd_pos_seq = tf.clip_by_value(fwd_pos_seq, -self.clamp_len, self.clamp_len)\n                bwd_pos_seq = tf.clip_by_value(bwd_pos_seq, -self.clamp_len, self.clamp_len)\n\n            if bsz is not None:\n                # With bi_data, the batch size should be divisible by 2.\n                assert bsz % 2 == 0\n                fwd_pos_emb = self.positional_embedding(fwd_pos_seq, inv_freq, bsz // 2)\n                bwd_pos_emb = self.positional_embedding(bwd_pos_seq, inv_freq, bsz // 2)\n            else:\n                fwd_pos_emb = self.positional_embedding(fwd_pos_seq, inv_freq)\n                bwd_pos_emb = self.positional_embedding(bwd_pos_seq, inv_freq)\n\n            pos_emb = tf.concat([fwd_pos_emb, bwd_pos_emb], axis=1)\n        else:\n            fwd_pos_seq = tf.range(beg, end, -1.0)\n            if dtype is not None and dtype != tf.float32:\n                fwd_pos_seq = tf.cast(fwd_pos_seq, dtype=dtype)\n            if self.clamp_len > 0:\n                fwd_pos_seq = tf.clip_by_value(fwd_pos_seq, -self.clamp_len, self.clamp_len)\n            pos_emb = self.positional_embedding(fwd_pos_seq, inv_freq, bsz)\n\n        return pos_emb\n\n    def call(\n        self,\n        inputs,\n        attention_mask=None,\n        mems=None,\n        perm_mask=None,\n        target_mapping=None,\n        token_type_ids=None,\n        input_mask=None,\n        head_mask=None,\n        inputs_embeds=None,\n        use_cache=True,\n        training=False,\n    ):\n        if isinstance(inputs, (tuple, list)):\n            input_ids = inputs[0]\n            attention_mask = inputs[1] if len(inputs) > 1 else attention_mask\n            mems = inputs[2] if len(inputs) > 2 else mems\n            perm_mask = inputs[3] if len(inputs) > 3 else perm_mask\n            target_mapping = inputs[4] if len(inputs) > 4 else target_mapping\n            token_type_ids = inputs[5] if len(inputs) > 5 else token_type_ids\n            input_mask = inputs[6] if len(inputs) > 6 else input_mask\n            head_mask = inputs[7] if len(inputs) > 7 else head_mask\n            inputs_embeds = inputs[8] if len(inputs) > 8 else inputs_embeds\n            use_cache = inputs[9] if len(inputs) > 9 else use_cache\n            assert len(inputs) <= 10, \"Too many inputs.\"\n        elif isinstance(inputs, (dict, BatchEncoding)):\n            input_ids = inputs.get(\"input_ids\")\n            attention_mask = inputs.get(\"attention_mask\", attention_mask)\n            mems = inputs.get(\"mems\", mems)\n            perm_mask = inputs.get(\"perm_mask\", perm_mask)\n            target_mapping = inputs.get(\"target_mapping\", target_mapping)\n            token_type_ids = inputs.get(\"token_type_ids\", token_type_ids)\n            input_mask = inputs.get(\"input_mask\", input_mask)\n            head_mask = inputs.get(\"head_mask\", head_mask)\n            inputs_embeds = inputs.get(\"inputs_embeds\", inputs_embeds)\n            use_cache = inputs.get(\"use_cache\", use_cache)\n            assert len(inputs) <= 10, \"Too many inputs.\"\n        else:\n            input_ids = inputs\n\n        # the original code for XLNet uses shapes [len, bsz] with the batch dimension at the end\n        # but we want a unified interface in the library with the batch size on the first dimension\n        # so we move here the first dimension (batch) to the end\n\n        if input_ids is not None and inputs_embeds is not None:\n            raise ValueError(\"You cannot specify both input_ids and inputs_embeds at the same time\")\n        elif input_ids is not None:\n            input_ids = tf.transpose(input_ids, perm=(1, 0))\n            qlen, bsz = shape_list(input_ids)[:2]\n        elif inputs_embeds is not None:\n            inputs_embeds = tf.transpose(inputs_embeds, perm=(1, 0, 2))\n            qlen, bsz = shape_list(inputs_embeds)[:2]\n        else:\n            raise ValueError(\"You have to specify either input_ids or inputs_embeds\")\n\n        token_type_ids = tf.transpose(token_type_ids, perm=(1, 0)) if token_type_ids is not None else None\n        input_mask = tf.transpose(input_mask, perm=(1, 0)) if input_mask is not None else None\n        attention_mask = tf.transpose(attention_mask, perm=(1, 0)) if attention_mask is not None else None\n        perm_mask = tf.transpose(perm_mask, perm=(1, 2, 0)) if perm_mask is not None else None\n        target_mapping = tf.transpose(target_mapping, perm=(1, 2, 0)) if target_mapping is not None else None\n\n        mlen = shape_list(mems[0])[0] if mems is not None and mems[0] is not None else 0\n        klen = mlen + qlen\n\n        dtype_float = tf.bfloat16 if self.use_bfloat16 else tf.float32\n\n        # Attention mask\n        # causal attention mask\n        if self.attn_type == \"uni\":\n            attn_mask = self.create_mask(qlen, mlen)\n            attn_mask = attn_mask[:, :, None, None]\n        elif self.attn_type == \"bi\":\n            attn_mask = None\n        else:\n            raise ValueError(\"Unsupported attention type: {}\".format(self.attn_type))\n\n        # data mask: input mask & perm mask\n        assert input_mask is None or attention_mask is None, (\n            \"You can only use one of input_mask (uses 1 for padding) \"\n            \"or attention_mask (uses 0 for padding, added for compatbility with BERT). Please choose one.\"\n        )\n        if input_mask is None and attention_mask is not None:\n            input_mask = 1.0 - tf.cast(attention_mask, dtype=dtype_float)\n        if input_mask is not None and perm_mask is not None:\n            data_mask = input_mask[None] + perm_mask\n        elif input_mask is not None and perm_mask is None:\n            data_mask = input_mask[None]\n        elif input_mask is None and perm_mask is not None:\n            data_mask = perm_mask\n        else:\n            data_mask = None\n\n        if data_mask is not None:\n            # all mems can be attended to\n            if mlen > 0:\n                mems_mask = tf.zeros([shape_list(data_mask)[0], mlen, bsz], dtype=dtype_float)\n                data_mask = tf.concat([mems_mask, data_mask], axis=1)\n            if attn_mask is None:\n                attn_mask = data_mask[:, :, :, None]\n            else:\n                attn_mask += data_mask[:, :, :, None]\n\n        if attn_mask is not None:\n            attn_mask = tf.cast(attn_mask > 0, dtype=dtype_float)\n\n        if attn_mask is not None:\n            non_tgt_mask = -tf.eye(qlen, dtype=dtype_float)\n            if mlen > 0:\n                non_tgt_mask = tf.concat([tf.zeros([qlen, mlen], dtype=dtype_float), non_tgt_mask], axis=-1)\n            non_tgt_mask = tf.cast((attn_mask + non_tgt_mask[:, :, None, None]) > 0, dtype=dtype_float)\n        else:\n            non_tgt_mask = None\n\n        # Word embeddings and prepare h & g hidden states\n        if inputs_embeds is not None:\n            word_emb_k = inputs_embeds\n        else:\n            word_emb_k = self.word_embedding(input_ids)\n        output_h = self.dropout(word_emb_k, training=training)\n        if target_mapping is not None:\n            word_emb_q = tf.tile(self.mask_emb, [shape_list(target_mapping)[0], bsz, 1])\n            # else:  # We removed the inp_q input which was same as target mapping\n            #     inp_q_ext = inp_q[:, :, None]\n            #     word_emb_q = inp_q_ext * self.mask_emb + (1 - inp_q_ext) * word_emb_k\n            output_g = self.dropout(word_emb_q, training=training)\n        else:\n            output_g = None\n\n        # Segment embedding\n        if token_type_ids is not None:\n            # Convert `token_type_ids` to one-hot `seg_mat`\n            if mlen > 0:\n                mem_pad = tf.zeros([mlen, bsz], dtype=tf.int32)\n                cat_ids = tf.concat([mem_pad, token_type_ids], 0)\n            else:\n                cat_ids = token_type_ids\n\n            # `1` indicates not in the same segment [qlen x klen x bsz]\n            seg_mat = tf.cast(tf.logical_not(tf.equal(token_type_ids[:, None], cat_ids[None, :])), tf.int32)\n            seg_mat = tf.one_hot(seg_mat, 2, dtype=dtype_float)\n        else:\n            seg_mat = None\n\n        # Positional encoding\n        pos_emb = self.relative_positional_encoding(qlen, klen, bsz=bsz, dtype=dtype_float)\n        pos_emb = self.dropout(pos_emb, training=training)\n\n        # Prepare head mask if needed\n        # 1.0 in head_mask indicate we keep the head\n        # attention_probs has shape bsz x n_heads x N x N\n        # input head_mask has shape [num_heads] or [num_hidden_layers x num_heads] (a head_mask for each layer)\n        # and head_mask is converted to shape [num_hidden_layers x qlen x klen x bsz x n_head]\n        if head_mask is not None:\n            raise NotImplementedError\n        else:\n            head_mask = [None] * self.n_layer\n\n        new_mems = ()\n        if mems is None:\n            mems = [None] * len(self.layer)\n\n        attentions = []\n        hidden_states = []\n        for i, layer_module in enumerate(self.layer):\n            # cache new mems\n            if self.mem_len is not None and self.mem_len > 0 and use_cache is True:\n                new_mems = new_mems + (self.cache_mem(output_h, mems[i]),)\n            if self.output_hidden_states:\n                hidden_states.append((output_h, output_g) if output_g is not None else output_h)\n\n            outputs = layer_module(\n                [output_h, output_g, non_tgt_mask, attn_mask, pos_emb, seg_mat, mems[i], target_mapping, head_mask[i]],\n                training=training,\n            )\n            output_h, output_g = outputs[:2]\n            if self.output_attentions:\n                attentions.append(outputs[2])\n\n        # Add last hidden state\n        if self.output_hidden_states:\n            hidden_states.append((output_h, output_g) if output_g is not None else output_h)\n\n        output = self.dropout(output_g if output_g is not None else output_h, training=training)\n\n        # Prepare outputs, we transpose back here to shape [bsz, len, hidden_dim] (cf. beginning of forward() method)\n        outputs = (tf.transpose(output, perm=(1, 0, 2)),)\n\n        if self.mem_len is not None and self.mem_len > 0 and use_cache is True:\n            outputs = outputs + (new_mems,)\n\n        if self.output_hidden_states:\n            if output_g is not None:\n                hidden_states = tuple(tf.transpose(h, perm=(1, 0, 2)) for hs in hidden_states for h in hs)\n            else:\n                hidden_states = tuple(tf.transpose(hs, perm=(1, 0, 2)) for hs in hidden_states)\n            outputs = outputs + (hidden_states,)\n        if self.output_attentions:\n            attentions = tuple(tf.transpose(t, perm=(2, 3, 0, 1)) for t in attentions)\n            outputs = outputs + (attentions,)\n\n        return outputs  # outputs, (new_mems), (hidden_states), (attentions)\n\n\nclass TFXLNetPreTrainedModel(TFPreTrainedModel):\n    \"\"\" An abstract class to handle weights initialization and\n        a simple interface for downloading and loading pretrained models.\n    \"\"\"\n\n    config_class = XLNetConfig\n    base_model_prefix = \"transformer\"\n\n\nXLNET_START_DOCSTRING = r\"\"\"\n\n    .. note::\n\n        TF 2.0 models accepts two formats as inputs:\n\n            - having all inputs as keyword arguments (like PyTorch models), or\n            - having all inputs as a list, tuple or dict in the first positional arguments.\n\n        This second option is useful when using :obj:`tf.keras.Model.fit()` method which currently requires having\n        all the tensors in the first argument of the model call function: :obj:`model(inputs)`.\n\n        If you choose this second option, there are three possibilities you can use to gather all the input Tensors\n        in the first positional argument :\n\n        - a single Tensor with input_ids only and nothing else: :obj:`model(inputs_ids)`\n        - a list of varying length with one or several input Tensors IN THE ORDER given in the docstring:\n          :obj:`model([input_ids, attention_mask])` or :obj:`model([input_ids, attention_mask, token_type_ids])`\n        - a dictionary with one or several input Tensors associated to the input names given in the docstring:\n          :obj:`model({'input_ids': input_ids, 'token_type_ids': token_type_ids})`\n\n    Parameters:\n        config (:class:`~transformers1.XLNetConfig`): Model configuration class with all the parameters of the model.\n            Initializing with a config file does not load the weights associated with the model, only the configuration.\n            Check out the :meth:`~transformers1.PreTrainedModel.from_pretrained` method to load the model weights.\n\"\"\"\n\nXLNET_INPUTS_DOCSTRING = r\"\"\"\n    Args:\n        input_ids (:obj:`tf.Tensor` or :obj:`Numpy array` of shape :obj:`(batch_size, sequence_length)`):\n            Indices of input sequence tokens in the vocabulary.\n\n            Indices can be obtained using :class:`transformers1.XLNetTokenizer`.\n            See :func:`transformers1.PreTrainedTokenizer.encode` and\n            :func:`transformers1.PreTrainedTokenizer.encode_plus` for details.\n\n            `What are input IDs? <../glossary.html#input-ids>`__\n        attention_mask (:obj:`tf.Tensor` or :obj:`Numpy array` of shape :obj:`(batch_size, sequence_length)`, `optional`, defaults to :obj:`None`):\n            Mask to avoid performing attention on padding token indices.\n            Mask values selected in ``[0, 1]``:\n            ``1`` for tokens that are NOT MASKED, ``0`` for MASKED tokens.\n\n            `What are attention masks? <../glossary.html#attention-mask>`__\n        mems (:obj:`List[tf.Tensor]` of length :obj:`config.n_layers`):\n            Contains pre-computed hidden-states (key and values in the attention blocks) as computed by the model\n            (see `mems` output below). Can be used to speed up sequential decoding. The token ids which have their mems\n            given to this model should not be passed as input ids as they have already been computed.\n        perm_mask (:obj:`tf.Tensor` or :obj:`Numpy array` of shape :obj:`(batch_size, sequence_length, sequence_length)`, `optional`, defaults to :obj:`None`):\n            Mask to indicate the attention pattern for each input token with values selected in ``[0, 1]``:\n            If ``perm_mask[k, i, j] = 0``, i attend to j in batch k;\n            if ``perm_mask[k, i, j] = 1``, i does not attend to j in batch k.\n            If None, each token attends to all the others (full bidirectional attention).\n            Only used during pretraining (to define factorization order) or for sequential decoding (generation).\n        target_mapping (:obj:`tf.Tensor` or :obj:`Numpy array` of shape :obj:`(batch_size, num_predict, sequence_length)`, `optional`, defaults to :obj:`None`):\n            Mask to indicate the output tokens to use.\n            If ``target_mapping[k, i, j] = 1``, the i-th predict in batch k is on the j-th token.\n            Only used during pretraining for partial prediction or for sequential decoding (generation).\n        token_type_ids (:obj:`tf.Tensor` or :obj:`Numpy array` of shape :obj:`(batch_size, sequence_length)`, `optional`, defaults to :obj:`None`):\n            Segment token indices to indicate first and second portions of the inputs.\n            Indices are selected in ``[0, 1]``: ``0`` corresponds to a `sentence A` token, ``1``\n            corresponds to a `sentence B` token\n\n            `What are token type IDs? <../glossary.html#token-type-ids>`_\n        input_mask (:obj:`tf.Tensor` or :obj:`Numpy array` of shape :obj:`(batch_size, sequence_length)`, `optional`, defaults to :obj:`None`):\n            Mask to avoid performing attention on padding token indices.\n            Negative of `attention_mask`, i.e. with 0 for real tokens and 1 for padding.\n            Kept for compatibility with the original code base.\n            You can only uses one of `input_mask` and `attention_mask`\n            Mask values selected in ``[0, 1]``:\n            ``1`` for tokens that are MASKED, ``0`` for tokens that are NOT MASKED.\n        head_mask (:obj:`tf.Tensor` or :obj:`Numpy array` of shape :obj:`(num_heads,)` or :obj:`(num_layers, num_heads)`, `optional`, defaults to :obj:`None`):\n            Mask to nullify selected heads of the self-attention modules.\n            Mask values selected in ``[0, 1]``:\n            :obj:`1` indicates the head is **not masked**, :obj:`0` indicates the head is **masked**.\n        inputs_embeds (:obj:`tf.Tensor` or :obj:`Numpy array` of shape :obj:`(batch_size, sequence_length, hidden_size)`, `optional`, defaults to :obj:`None`):\n            Optionally, instead of passing :obj:`input_ids` you can choose to directly pass an embedded representation.\n            This is useful if you want more control over how to convert `input_ids` indices into associated vectors\n            than the model's internal embedding lookup matrix.\n        use_cache (:obj:`bool`):\n            If `use_cache` is True, `mems` are returned and can be used to speed up decoding (see `mems`). Defaults to `True`.\n\"\"\"\n\n\n@add_start_docstrings(\n    \"The bare XLNet Model transformer outputing raw hidden-states without any specific head on top.\",\n    XLNET_START_DOCSTRING,\n)\nclass TFXLNetModel(TFXLNetPreTrainedModel):\n    def __init__(self, config, *inputs, **kwargs):\n        super().__init__(config, *inputs, **kwargs)\n        self.transformer = TFXLNetMainLayer(config, name=\"transformer\")\n\n    @add_start_docstrings_to_callable(XLNET_INPUTS_DOCSTRING)\n    def call(self, inputs, **kwargs):\n        r\"\"\"\n    Return:\n        :obj:`tuple(tf.Tensor)` comprising various elements depending on the configuration (:class:`~transformers1.XLNetConfig`) and inputs:\n        last_hidden_state (:obj:`tf.Tensor` or :obj:`Numpy array` of shape :obj:`(batch_size, sequence_length, hidden_size)`):\n            Sequence of hidden-states at the last layer of the model.\n        mems (:obj:`List[tf.Tensor]` of length :obj:`config.n_layers`):\n            Contains pre-computed hidden-states (key and values in the attention blocks).\n            Can be used (see `mems` input) to speed up sequential decoding. The token ids which have their past given to this model\n            should not be passed as input ids as they have already been computed.\n        hidden_states (:obj:`tuple(tf.Tensor)`, `optional`, returned when ``config.output_hidden_states=True``):\n            Tuple of :obj:`tf.Tensor` or :obj:`Numpy array` (one for the output of the embeddings + one for the output of each layer)\n            of shape :obj:`(batch_size, sequence_length, hidden_size)`.\n\n            Hidden-states of the model at the output of each layer plus the initial embedding outputs.\n        attentions (:obj:`tuple(tf.Tensor)`, `optional`, returned when ``config.output_attentions=True``):\n            Tuple of :obj:`tf.Tensor` or :obj:`Numpy array` (one for each layer) of shape\n            :obj:`(batch_size, num_heads, sequence_length, sequence_length)`.\n\n            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention\n            heads.\n\n    Examples::\n\n        import tensorflow as tf\n        from transformers1 import XLNetTokenizer, TFXLNetModel\n\n        tokenizer = XLNetTokenizer.from_pretrained('xlnet-large-cased')\n        model = TFXLNetModel.from_pretrained('xlnet-large-cased')\n        input_ids = tf.constant(tokenizer.encode(\"Hello, my dog is cute\", add_special_tokens=True))[None, :]  # Batch size 1\n        outputs = model(input_ids)\n        last_hidden_states = outputs[0]  # The last hidden-state is the first element of the output tuple\n\n        \"\"\"\n        outputs = self.transformer(inputs, **kwargs)\n        return outputs\n\n\n@add_start_docstrings(\n    \"\"\"XLNet Model with a language modeling head on top\n    (linear layer with weights tied to the input embeddings). \"\"\",\n    XLNET_START_DOCSTRING,\n)\nclass TFXLNetLMHeadModel(TFXLNetPreTrainedModel):\n    def __init__(self, config, *inputs, **kwargs):\n        super().__init__(config, *inputs, **kwargs)\n        self.transformer = TFXLNetMainLayer(config, name=\"transformer\")\n        self.lm_loss = TFXLNetLMHead(config, self.transformer.word_embedding, name=\"lm_loss\")\n\n    def get_output_embeddings(self):\n        return self.lm_loss.input_embeddings\n\n    def prepare_inputs_for_generation(self, inputs, past, **kwargs):\n        # Add dummy token at the end (no attention on this one)\n\n        effective_batch_size = inputs.shape[0]\n        dummy_token = tf.zeros((effective_batch_size, 1), dtype=tf.int32)\n        inputs = tf.concat([inputs, dummy_token], axis=1)\n\n        # Build permutation mask so that previous tokens don't see last token\n        sequence_length = inputs.shape[1]\n        perm_mask = tf.zeros((effective_batch_size, sequence_length, sequence_length - 1), dtype=tf.float32)\n        perm_mask_seq_end = tf.ones((effective_batch_size, sequence_length, 1), dtype=tf.float32)\n        perm_mask = tf.concat([perm_mask, perm_mask_seq_end], axis=-1)\n\n        # We'll only predict the last token\n        target_mapping = tf.zeros((effective_batch_size, 1, sequence_length - 1), dtype=tf.float32)\n        target_mapping_seq_end = tf.ones((effective_batch_size, 1, 1), dtype=tf.float32)\n        target_mapping = tf.concat([target_mapping, target_mapping_seq_end], axis=-1)\n\n        inputs = {\n            \"inputs\": inputs,\n            \"perm_mask\": perm_mask,\n            \"target_mapping\": target_mapping,\n            \"use_cache\": kwargs[\"use_cache\"],\n        }\n\n        # if past is defined in model kwargs then use it for faster decoding\n        if past:\n            inputs[\"mems\"] = past\n\n        return inputs\n\n    @add_start_docstrings_to_callable(XLNET_INPUTS_DOCSTRING)\n    def call(self, inputs, **kwargs):\n        r\"\"\"\n    Return:\n        :obj:`tuple(tf.Tensor)` comprising various elements depending on the configuration (:class:`~transformers1.XLNetConfig`) and inputs:\n        prediction_scores (:obj:`tf.Tensor` or :obj:`Numpy array` of shape :obj:`(batch_size, sequence_length, config.vocab_size)`):\n            Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax).\n        mems (:obj:`List[tf.Tensor]` of length :obj:`config.n_layers`):\n            Contains pre-computed hidden-states (key and values in the attention blocks).\n            Can be used (see `past` input) to speed up sequential decoding. The token ids which have their past given to this model\n            should not be passed as input ids as they have already been computed.\n        hidden_states (:obj:`tuple(tf.Tensor)`, `optional`, returned when ``config.output_hidden_states=True``):\n            Tuple of :obj:`tf.Tensor` or :obj:`Numpy array` (one for the output of the embeddings + one for the output of each layer)\n            of shape :obj:`(batch_size, sequence_length, hidden_size)`.\n\n            Hidden-states of the model at the output of each layer plus the initial embedding outputs.\n        attentions (:obj:`tuple(tf.Tensor)`, `optional`, returned when ``config.output_attentions=True``):\n            Tuple of :obj:`tf.Tensor` or :obj:`Numpy array` (one for each layer) of shape\n            :obj:`(batch_size, num_heads, sequence_length, sequence_length)`.\n\n            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention\n            heads.\n\n    Examples::\n\n        import tensorflow as tf\n        import numpy as np\n        from transformers1 import XLNetTokenizer, TFXLNetLMHeadModel\n\n        tokenizer = XLNetTokenizer.from_pretrained('xlnet-large-cased')\n        model = TFXLNetLMHeadModel.from_pretrained('xlnet-large-cased')\n\n        # We show how to setup inputs to predict a next token using a bi-directional context.\n        input_ids = tf.constant(tokenizer.encode(\"Hello, my dog is very <mask>\", add_special_tokens=True))[None, :]  # We will predict the masked token\n        perm_mask = np.zeros((1, input_ids.shape[1], input_ids.shape[1]))\n        perm_mask[:, :, -1] = 1.0  # Previous tokens don't see last token\n        target_mapping = np.zeros((1, 1, input_ids.shape[1]))  # Shape [1, 1, seq_length] => let's predict one token\n        target_mapping[0, 0, -1] = 1.0  # Our first (and only) prediction will be the last token of the sequence (the masked token)\n        outputs = model(input_ids, perm_mask=tf.constant(perm_mask, dtype=tf.float32), target_mapping=tf.constant(target_mapping, dtype=tf.float32))\n\n        next_token_logits = outputs[0]  # Output has shape [target_mapping.size(0), target_mapping.size(1), config.vocab_size]\n\n        \"\"\"\n        transformer_outputs = self.transformer(inputs, **kwargs)\n        hidden_state = transformer_outputs[0]\n        logits = self.lm_loss(hidden_state)\n\n        outputs = (logits,) + transformer_outputs[1:]  # Keep mems, hidden states, attentions if there are in it\n\n        return outputs  # return logits, (mems), (hidden states), (attentions)\n\n\n@add_start_docstrings(\n    \"\"\"XLNet Model with a sequence classification/regression head on top (a linear layer on top of\n    the pooled output) e.g. for GLUE tasks. \"\"\",\n    XLNET_START_DOCSTRING,\n)\nclass TFXLNetForSequenceClassification(TFXLNetPreTrainedModel):\n    def __init__(self, config, *inputs, **kwargs):\n        super().__init__(config, *inputs, **kwargs)\n        self.num_labels = config.num_labels\n\n        self.transformer = TFXLNetMainLayer(config, name=\"transformer\")\n        self.sequence_summary = TFSequenceSummary(\n            config, initializer_range=config.initializer_range, name=\"sequence_summary\"\n        )\n        self.logits_proj = tf.keras.layers.Dense(\n            config.num_labels, kernel_initializer=get_initializer(config.initializer_range), name=\"logits_proj\"\n        )\n\n    @add_start_docstrings_to_callable(XLNET_INPUTS_DOCSTRING)\n    def call(self, inputs, **kwargs):\n        r\"\"\"\n    Return:\n        :obj:`tuple(tf.Tensor)` comprising various elements depending on the configuration (:class:`~transformers1.XLNetConfig`) and inputs:\n        logits (:obj:`tf.Tensor` or :obj:`Numpy array` of shape :obj:(batch_size, config.num_labels)`):\n            Classification (or regression if config.num_labels==1) scores (before SoftMax).\n        mems (:obj:`List[tf.Tensor]` of length :obj:`config.n_layers`):\n            Contains pre-computed hidden-states (key and values in the attention blocks).\n            Can be used (see `past` input) to speed up sequential decoding. The token ids which have their past given to this model\n            should not be passed as input ids as they have already been computed.\n        hidden_states (:obj:`tuple(tf.Tensor)`, `optional`, returned when ``config.output_hidden_states=True``):\n            Tuple of :obj:`tf.Tensor` or :obj:`Numpy array` (one for the output of the embeddings + one for the output of each layer)\n            of shape :obj:`(batch_size, sequence_length, hidden_size)`.\n\n            Hidden-states of the model at the output of each layer plus the initial embedding outputs.\n        attentions (:obj:`tuple(tf.Tensor)`, `optional`, returned when ``config.output_attentions=True``):\n            Tuple of :obj:`tf.Tensor` or :obj:`Numpy array` (one for each layer) of shape\n            :obj:`(batch_size, num_heads, sequence_length, sequence_length)`.\n\n            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention\n            heads.\n\n    Examples::\n\n        import tensorflow as tf\n        from transformers1 import XLNetTokenizer, TFXLNetForSequenceClassification\n\n        tokenizer = XLNetTokenizer.from_pretrained('xlnet-large-cased')\n        model = TFXLNetForSequenceClassification.from_pretrained('xlnet-large-cased')\n        input_ids = tf.constant(tokenizer.encode(\"Hello, my dog is cute\", add_special_tokens=True))[None, :]  # Batch size 1\n        outputs = model(input_ids)\n        logits = outputs[0]\n\n        \"\"\"\n        transformer_outputs = self.transformer(inputs, **kwargs)\n        output = transformer_outputs[0]\n\n        output = self.sequence_summary(output)\n        logits = self.logits_proj(output)\n\n        outputs = (logits,) + transformer_outputs[1:]  # Keep mems, hidden states, attentions if there are in it\n\n        return outputs  # return logits, (mems), (hidden states), (attentions)\n\n\n@add_start_docstrings(\n    \"\"\"XLNet Model with a token classification head on top (a linear layer on top of\n    the hidden-states output) e.g. for Named-Entity-Recognition (NER) tasks. \"\"\",\n    XLNET_START_DOCSTRING,\n)\nclass TFXLNetForTokenClassification(TFXLNetPreTrainedModel):\n    def __init__(self, config, *inputs, **kwargs):\n        super().__init__(config, *inputs, **kwargs)\n        self.num_labels = config.num_labels\n\n        self.transformer = TFXLNetMainLayer(config, name=\"transformer\")\n        self.classifier = tf.keras.layers.Dense(\n            config.num_labels, kernel_initializer=get_initializer(config.initializer_range), name=\"classifier\"\n        )\n\n    def call(self, inputs, **kwargs):\n        r\"\"\"\n    Return:\n        :obj:`tuple(tf.Tensor)` comprising various elements depending on the configuration (:class:`~transformers1.XLNetConfig`) and inputs:\n        logits (:obj:`tf.Tensor` or :obj:`Numpy array` of shape :obj:(batch_size, config.num_labels)`):\n            Classification scores (before SoftMax).\n        mems (:obj:`List[tf.Tensor]` of length :obj:`config.n_layers`):\n            Contains pre-computed hidden-states (key and values in the attention blocks).\n            Can be used (see `past` input) to speed up sequential decoding. The token ids which have their past given to this model\n            should not be passed as input ids as they have already been computed.\n        hidden_states (:obj:`tuple(tf.Tensor)`, `optional`, returned when ``config.output_hidden_states=True``):\n            Tuple of :obj:`tf.Tensor` or :obj:`Numpy array` (one for the output of the embeddings + one for the output of each layer)\n            of shape :obj:`(batch_size, sequence_length, hidden_size)`.\n\n            Hidden-states of the model at the output of each layer plus the initial embedding outputs.\n        attentions (:obj:`tuple(tf.Tensor)`, `optional`, returned when ``config.output_attentions=True``):\n            Tuple of :obj:`tf.Tensor` or :obj:`Numpy array` (one for each layer) of shape\n            :obj:`(batch_size, num_heads, sequence_length, sequence_length)`.\n\n            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention\n            heads.\n\n    Examples::\n\n        import tensorflow as tf\n        from transformers1 import XLNetTokenizer, TFXLNetForTokenClassification\n\n        tokenizer = XLNetTokenizer.from_pretrained('xlnet-large-cased')\n        model = TFXLNetForTokenClassification.from_pretrained('xlnet-large-cased')\n        input_ids = tf.constant(tokenizer.encode(\"Hello, my dog is cute\"))[None, :]  # Batch size 1\n        outputs = model(input_ids)\n        scores = outputs[0]\n\n        \"\"\"\n        transformer_outputs = self.transformer(inputs, **kwargs)\n        output = transformer_outputs[0]\n\n        logits = self.classifier(output)\n\n        outputs = (logits,) + transformer_outputs[1:]  # Keep mems, hidden states, attentions if there are in it\n\n        return outputs  # return logits, (mems), (hidden states), (attentions)\n\n\n@add_start_docstrings(\n    \"\"\"XLNet Model with a span classification head on top for extractive question-answering tasks like SQuAD (a linear layers on top of\n    the hidden-states output to compute `span start logits` and `span end logits`). \"\"\",\n    XLNET_START_DOCSTRING,\n)\nclass TFXLNetForQuestionAnsweringSimple(TFXLNetPreTrainedModel):\n    def __init__(self, config, *inputs, **kwargs):\n        super().__init__(config, *inputs, **kwargs)\n        self.transformer = TFXLNetMainLayer(config, name=\"transformer\")\n        self.qa_outputs = tf.keras.layers.Dense(\n            config.num_labels, kernel_initializer=get_initializer(config.initializer_range), name=\"qa_outputs\"\n        )\n\n    @add_start_docstrings_to_callable(XLNET_INPUTS_DOCSTRING)\n    def call(self, inputs, **kwargs):\n        r\"\"\"\n    Returns:\n        :obj:`tuple(tf.Tensor)` comprising various elements depending on the configuration (:class:`~transformers1.XLNetConfig`) and inputs:\n        loss (:obj:`tf.Tensor` or :obj:`Numpy array` of shape :obj:`(1,)`, `optional`, returned when :obj:`labels` is provided):\n            Total span extraction loss is the sum of a Cross-Entropy for the start and end positions.\n        start_scores (:obj:`tf.Tensor` or :obj:`Numpy array` of shape :obj:`(batch_size, sequence_length,)`):\n            Span-start scores (before SoftMax).\n        end_scores (:obj:`tf.Tensor` or :obj:`Numpy array` of shape :obj:`(batch_size, sequence_length,)`):\n            Span-end scores (before SoftMax).\n        mems (:obj:`List[tf.Tensor]` of length :obj:`config.n_layers`):\n            Contains pre-computed hidden-states (key and values in the attention blocks).\n            Can be used (see `past` input) to speed up sequential decoding. The token ids which have their past given to this model\n            should not be passed as input ids as they have already been computed.\n        hidden_states (:obj:`tuple(tf.Tensor)`, `optional`, returned when ``config.output_hidden_states=True``):\n            Tuple of :obj:`tf.Tensor` or :obj:`Numpy array` (one for the output of the embeddings + one for the output of each layer)\n            of shape :obj:`(batch_size, sequence_length, hidden_size)`.\n\n            Hidden-states of the model at the output of each layer plus the initial embedding outputs.\n        attentions (:obj:`tuple(tf.Tensor)`, `optional`, returned when ``config.output_attentions=True``):\n            Tuple of :obj:`tf.Tensor` or :obj:`Numpy array` (one for each layer) of shape\n            :obj:`(batch_size, num_heads, sequence_length, sequence_length)`.\n\n            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention\n            heads.\n\n    Examples::\n\n        import tensorflow as tf\n        from transformers1 import XLNetTokenizer, TFXLNetForQuestionAnsweringSimple\n\n        tokenizer = XLNetTokenizer.from_pretrained('xlnet-base-cased')\n        model = TFXLNetForQuestionAnsweringSimple.from_pretrained('xlnet-base-cased')\n        input_ids = tf.constant(tokenizer.encode(\"Hello, my dog is cute\", add_special_tokens=True))[None, :]  # Batch size 1\n        outputs = model(input_ids)\n        start_scores, end_scores = outputs[:2]\n\n        \"\"\"\n        transformer_outputs = self.transformer(inputs, **kwargs)\n\n        sequence_output = transformer_outputs[0]\n\n        logits = self.qa_outputs(sequence_output)\n        start_logits, end_logits = tf.split(logits, 2, axis=-1)\n        start_logits = tf.squeeze(start_logits, axis=-1)\n        end_logits = tf.squeeze(end_logits, axis=-1)\n\n        outputs = (start_logits, end_logits,) + transformer_outputs[\n            1:\n        ]  # Keep mems, hidden states, attentions if there are in it\n\n        return outputs  # start_logits, end_logits, (mems), (hidden_states), (attentions)\n\n\n# @add_start_docstrings(\"\"\"XLNet Model with a span classification head on top for extractive question-answering tasks like SQuAD (a linear layers on top of\n#     the hidden-states output to compute `span start logits` and `span end logits`). \"\"\",\n#     XLNET_START_DOCSTRING, XLNET_INPUTS_DOCSTRING)\n# class TFXLNetForQuestionAnswering(TFXLNetPreTrainedModel):\n#     r\"\"\"\n#     Outputs: `Tuple` comprising various elements depending on the configuration (config) and inputs:\n#         **start_top_log_probs**: (`optional`, returned if ``start_positions`` or ``end_positions`` is not provided)\n#             ``tf.Tensor`` of shape ``(batch_size, config.start_n_top)``\n#             Log probabilities for the top config.start_n_top start token possibilities (beam-search).\n#         **start_top_index**: (`optional`, returned if ``start_positions`` or ``end_positions`` is not provided)\n#             ``tf.Tensor`` of shape ``(batch_size, config.start_n_top)``\n#             Indices for the top config.start_n_top start token possibilities (beam-search).\n#         **end_top_log_probs**: (`optional`, returned if ``start_positions`` or ``end_positions`` is not provided)\n#             ``tf.Tensor`` of shape ``(batch_size, config.start_n_top * config.end_n_top)``\n#             Log probabilities for the top ``config.start_n_top * config.end_n_top`` end token possibilities (beam-search).\n#         **end_top_index**: (`optional`, returned if ``start_positions`` or ``end_positions`` is not provided)\n#             ``tf.Tensor`` of shape ``(batch_size, config.start_n_top * config.end_n_top)``\n#             Indices for the top ``config.start_n_top * config.end_n_top`` end token possibilities (beam-search).\n#         **cls_logits**: (`optional`, returned if ``start_positions`` or ``end_positions`` is not provided)\n#             ``tf.Tensor`` of shape ``(batch_size,)``\n#             Log probabilities for the ``is_impossible`` label of the answers.\n#         **mems**:\n#             list of ``tf.Tensor`` (one for each layer):\n#             that contains pre-computed hidden-states (key and values in the attention blocks) as computed by the model\n#             if config.mem_len > 0 else tuple of None. Can be used to speed up sequential decoding and attend to longer context.\n#             See details in the docstring of the `mems` input above.\n#         **hidden_states**: (`optional`, returned when ``config.output_hidden_states=True``)\n#             list of ``tf.Tensor`` (one for the output of each layer + the output of the embeddings)\n#             of shape ``(batch_size, sequence_length, hidden_size)``:\n#             Hidden-states of the model at the output of each layer plus the initial embedding outputs.\n#         **attentions**: (`optional`, returned when ``config.output_attentions=True``)\n#             list of ``tf.Tensor`` (one for each layer) of shape ``(batch_size, num_heads, sequence_length, sequence_length)``:\n#             Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads.\n\n#     Examples::\n\n#         # For example purposes. Not runnable.\n#         tokenizer = XLMTokenizer.from_pretrained('xlm-mlm-en-2048')\n#         model = XLMForQuestionAnswering.from_pretrained('xlnet-large-cased')\n#         input_ids = tf.constant(tokenizer.encode(\"Hello, my dog is cute\", add_special_tokens=True))[None, :]  # Batch size 1\n#         start_positions = tf.constant([1])\n#         end_positions = tf.constant([3])\n#         outputs = model(input_ids, start_positions=start_positions, end_positions=end_positions)\n#         loss, start_scores, end_scores = outputs[:2]\n\n#     \"\"\"\n#     def __init__(self, config, *inputs, **kwargs):\n#         super().__init__(config, *inputs, **kwargs)\n#         self.start_n_top = config.start_n_top\n#         self.end_n_top = config.end_n_top\n\n#         self.transformer = TFXLNetMainLayer(config, name='transformer')\n#         self.start_logits = TFPoolerStartLogits(config, name='start_logits')\n#         self.end_logits = TFPoolerEndLogits(config, name='end_logits')\n#         self.answer_class = TFPoolerAnswerClass(config, name='answer_class')\n\n#     def call(self, inputs, training=False):\n#         transformer_outputs = self.transformer(inputs, training=training)\n#         hidden_states = transformer_outputs[0]\n#         start_logits = self.start_logits(hidden_states, p_mask=p_mask)\n\n#         outputs = transformer_outputs[1:]  # Keep mems, hidden states, attentions if there are in it\n\n#         if start_positions is not None and end_positions is not None:\n#             # If we are on multi-GPU, let's remove the dimension added by batch splitting\n#             for x in (start_positions, end_positions, cls_index, is_impossible):\n#                 if x is not None and x.dim() > 1:\n#                     x.squeeze_(-1)\n\n#             # during training, compute the end logits based on the ground truth of the start position\n#             end_logits = self.end_logits(hidden_states, start_positions=start_positions, p_mask=p_mask)\n\n#             loss_fct = CrossEntropyLoss()\n#             start_loss = loss_fct(start_logits, start_positions)\n#             end_loss = loss_fct(end_logits, end_positions)\n#             total_loss = (start_loss + end_loss) / 2\n\n#             if cls_index is not None and is_impossible is not None:\n#                 # Predict answerability from the representation of CLS and START\n#                 cls_logits = self.answer_class(hidden_states, start_positions=start_positions, cls_index=cls_index)\n#                 loss_fct_cls = nn.BCEWithLogitsLoss()\n#                 cls_loss = loss_fct_cls(cls_logits, is_impossible)\n\n#                 # note(zhiliny): by default multiply the loss by 0.5 so that the scale is comparable to start_loss and end_loss\n#                 total_loss += cls_loss * 0.5\n\n#             outputs = (total_loss,) + outputs\n\n#         else:\n#             # during inference, compute the end logits based on beam search\n#             bsz, slen, hsz = hidden_states.size()\n#             start_log_probs = F.softmax(start_logits, dim=-1) # shape (bsz, slen)\n\n#             start_top_log_probs, start_top_index = torch.topk(start_log_probs, self.start_n_top, dim=-1) # shape (bsz, start_n_top)\n#             start_top_index_exp = start_top_index.unsqueeze(-1).expand(-1, -1, hsz) # shape (bsz, start_n_top, hsz)\n#             start_states = torch.gather(hidden_states, -2, start_top_index_exp) # shape (bsz, start_n_top, hsz)\n#             start_states = start_states.unsqueeze(1).expand(-1, slen, -1, -1) # shape (bsz, slen, start_n_top, hsz)\n\n#             hidden_states_expanded = hidden_states.unsqueeze(2).expand_as(start_states) # shape (bsz, slen, start_n_top, hsz)\n#             p_mask = p_mask.unsqueeze(-1) if p_mask is not None else None\n#             end_logits = self.end_logits(hidden_states_expanded, start_states=start_states, p_mask=p_mask)\n#             end_log_probs = F.softmax(end_logits, dim=1) # shape (bsz, slen, start_n_top)\n\n#             end_top_log_probs, end_top_index = torch.topk(end_log_probs, self.end_n_top, dim=1) # shape (bsz, end_n_top, start_n_top)\n#             end_top_log_probs = end_top_log_probs.view(-1, self.start_n_top * self.end_n_top)\n#             end_top_index = end_top_index.view(-1, self.start_n_top * self.end_n_top)\n\n#             start_states = torch.einsum(\"blh,bl->bh\", hidden_states, start_log_probs)  # get the representation of START as weighted sum of hidden states\n#             cls_logits = self.answer_class(hidden_states, start_states=start_states, cls_index=cls_index)  # Shape (batch size,): one single `cls_logits` for each sample\n\n#             outputs = (start_top_log_probs, start_top_index, end_top_log_probs, end_top_index, cls_logits) + outputs\n\n#         # return start_top_log_probs, start_top_index, end_top_log_probs, end_top_index, cls_logits\n#         # or (if labels are provided) (total_loss,)\n#         return outputs\n"
  },
  {
    "path": "code/nezha-base-count3/pretrain/transformers1/modeling_transfo_xl.py",
    "content": "# coding=utf-8\n# Copyright 2018 Google AI, Google Brain and Carnegie Mellon University Authors and the HuggingFace Inc. team.\n# Copyright (c) 2018, NVIDIA CORPORATION.  All rights reserved.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n\"\"\" PyTorch Transformer XL model.\n    Adapted from https://github.com/kimiyoung/transformer-xl.\n    In particular https://github.com/kimiyoung/transformer-xl/blob/master/pytorch/mem_transformer.py\n\"\"\"\n\n\nimport logging\n\nimport torch\nimport torch.nn as nn\nimport torch.nn.functional as F\n\nfrom .configuration_transfo_xl import TransfoXLConfig\nfrom .file_utils import add_start_docstrings, add_start_docstrings_to_callable\nfrom .modeling_transfo_xl_utilities import ProjectedAdaptiveLogSoftmax\nfrom .modeling_utils import PreTrainedModel\n\n\nlogger = logging.getLogger(__name__)\n\nTRANSFO_XL_PRETRAINED_MODEL_ARCHIVE_LIST = [\n    \"transfo-xl-wt103\",\n    # See all Transformer XL models at https://huggingface.co/models?filter=transfo-xl\n]\n\n\ndef build_tf_to_pytorch_map(model, config):\n    \"\"\" A map of modules from TF to PyTorch.\n        This time I use a map to keep the PyTorch model as identical to the original PyTorch model as possible.\n    \"\"\"\n    tf_to_pt_map = {}\n\n    if hasattr(model, \"transformer\"):\n        # We are loading in a TransfoXLLMHeadModel => we will load also the Adaptive Softmax\n        tf_to_pt_map.update(\n            {\n                \"transformer/adaptive_softmax/cutoff_0/cluster_W\": model.crit.cluster_weight,\n                \"transformer/adaptive_softmax/cutoff_0/cluster_b\": model.crit.cluster_bias,\n            }\n        )\n        for i, (out_l, proj_l, tie_proj) in enumerate(\n            zip(model.crit.out_layers, model.crit.out_projs, config.tie_projs)\n        ):\n            layer_str = \"transformer/adaptive_softmax/cutoff_%d/\" % i\n            if config.tie_weight:\n                tf_to_pt_map.update({layer_str + \"b\": out_l.bias})\n            else:\n                raise NotImplementedError\n                # I don't think this is implemented in the TF code\n                tf_to_pt_map.update({layer_str + \"lookup_table\": out_l.weight, layer_str + \"b\": out_l.bias})\n            if not tie_proj:\n                tf_to_pt_map.update({layer_str + \"proj\": proj_l})\n        # Now load the rest of the transformer\n        model = model.transformer\n\n    # Embeddings\n    for i, (embed_l, proj_l) in enumerate(zip(model.word_emb.emb_layers, model.word_emb.emb_projs)):\n        layer_str = \"transformer/adaptive_embed/cutoff_%d/\" % i\n        tf_to_pt_map.update({layer_str + \"lookup_table\": embed_l.weight, layer_str + \"proj_W\": proj_l})\n\n    # Transformer blocks\n    for i, b in enumerate(model.layers):\n        layer_str = \"transformer/layer_%d/\" % i\n        tf_to_pt_map.update(\n            {\n                layer_str + \"rel_attn/LayerNorm/gamma\": b.dec_attn.layer_norm.weight,\n                layer_str + \"rel_attn/LayerNorm/beta\": b.dec_attn.layer_norm.bias,\n                layer_str + \"rel_attn/o/kernel\": b.dec_attn.o_net.weight,\n                layer_str + \"rel_attn/qkv/kernel\": b.dec_attn.qkv_net.weight,\n                layer_str + \"rel_attn/r/kernel\": b.dec_attn.r_net.weight,\n                layer_str + \"ff/LayerNorm/gamma\": b.pos_ff.layer_norm.weight,\n                layer_str + \"ff/LayerNorm/beta\": b.pos_ff.layer_norm.bias,\n                layer_str + \"ff/layer_1/kernel\": b.pos_ff.CoreNet[0].weight,\n                layer_str + \"ff/layer_1/bias\": b.pos_ff.CoreNet[0].bias,\n                layer_str + \"ff/layer_2/kernel\": b.pos_ff.CoreNet[3].weight,\n                layer_str + \"ff/layer_2/bias\": b.pos_ff.CoreNet[3].bias,\n            }\n        )\n\n    # Relative positioning biases\n    if config.untie_r:\n        r_r_list = []\n        r_w_list = []\n        for b in model.layers:\n            r_r_list.append(b.dec_attn.r_r_bias)\n            r_w_list.append(b.dec_attn.r_w_bias)\n    else:\n        r_r_list = [model.r_r_bias]\n        r_w_list = [model.r_w_bias]\n    tf_to_pt_map.update({\"transformer/r_r_bias\": r_r_list, \"transformer/r_w_bias\": r_w_list})\n    return tf_to_pt_map\n\n\ndef load_tf_weights_in_transfo_xl(model, config, tf_path):\n    \"\"\" Load tf checkpoints in a pytorch model\n    \"\"\"\n    try:\n        import numpy as np\n        import tensorflow as tf\n    except ImportError:\n        logger.error(\n            \"Loading a TensorFlow models in PyTorch, requires TensorFlow to be installed. Please see \"\n            \"https://www.tensorflow.org/install/ for installation instructions.\"\n        )\n        raise\n    # Build TF to PyTorch weights loading map\n    tf_to_pt_map = build_tf_to_pytorch_map(model, config)\n\n    # Load weights from TF model\n    init_vars = tf.train.list_variables(tf_path)\n    tf_weights = {}\n    for name, shape in init_vars:\n        logger.info(\"Loading TF weight {} with shape {}\".format(name, shape))\n        array = tf.train.load_variable(tf_path, name)\n        tf_weights[name] = array\n\n    for name, pointer in tf_to_pt_map.items():\n        assert name in tf_weights\n        array = tf_weights[name]\n        # adam_v and adam_m are variables used in AdamWeightDecayOptimizer to calculated m and v\n        # which are not required for using pretrained model\n        if \"kernel\" in name or \"proj\" in name:\n            array = np.transpose(array)\n        if (\"r_r_bias\" in name or \"r_w_bias\" in name) and len(pointer) > 1:\n            # Here we will split the TF weights\n            assert len(pointer) == array.shape[0]\n            for i, p_i in enumerate(pointer):\n                arr_i = array[i, ...]\n                try:\n                    assert p_i.shape == arr_i.shape\n                except AssertionError as e:\n                    e.args += (p_i.shape, arr_i.shape)\n                    raise\n                logger.info(\"Initialize PyTorch weight {} for layer {}\".format(name, i))\n                p_i.data = torch.from_numpy(arr_i)\n        else:\n            try:\n                assert pointer.shape == array.shape\n            except AssertionError as e:\n                e.args += (pointer.shape, array.shape)\n                raise\n            logger.info(\"Initialize PyTorch weight {}\".format(name))\n            pointer.data = torch.from_numpy(array)\n        tf_weights.pop(name, None)\n        tf_weights.pop(name + \"/Adam\", None)\n        tf_weights.pop(name + \"/Adam_1\", None)\n\n    logger.info(\"Weights not copied to PyTorch model: {}\".format(\", \".join(tf_weights.keys())))\n    return model\n\n\nclass PositionalEmbedding(nn.Module):\n    def __init__(self, demb):\n        super().__init__()\n\n        self.demb = demb\n\n        inv_freq = 1 / (10000 ** (torch.arange(0.0, demb, 2.0) / demb))\n        self.register_buffer(\"inv_freq\", inv_freq)\n\n    def forward(self, pos_seq, bsz=None):\n        sinusoid_inp = torch.ger(pos_seq, self.inv_freq)\n        pos_emb = torch.cat([sinusoid_inp.sin(), sinusoid_inp.cos()], dim=-1)\n\n        if bsz is not None:\n            return pos_emb[:, None, :].expand(-1, bsz, -1)\n        else:\n            return pos_emb[:, None, :]\n\n\nclass PositionwiseFF(nn.Module):\n    def __init__(self, d_model, d_inner, dropout, pre_lnorm=False, layer_norm_epsilon=1e-5):\n        super().__init__()\n\n        self.d_model = d_model\n        self.d_inner = d_inner\n        self.dropout = dropout\n\n        self.CoreNet = nn.Sequential(\n            nn.Linear(d_model, d_inner),\n            nn.ReLU(inplace=True),\n            nn.Dropout(dropout),\n            nn.Linear(d_inner, d_model),\n            nn.Dropout(dropout),\n        )\n\n        self.layer_norm = nn.LayerNorm(d_model, eps=layer_norm_epsilon)\n\n        self.pre_lnorm = pre_lnorm\n\n    def forward(self, inp):\n        if self.pre_lnorm:\n            # layer normalization + positionwise feed-forward\n            core_out = self.CoreNet(self.layer_norm(inp))\n\n            # residual connection\n            output = core_out + inp\n        else:\n            # positionwise feed-forward\n            core_out = self.CoreNet(inp)\n\n            # residual connection + layer normalization\n            output = self.layer_norm(inp + core_out)\n\n        return output\n\n\nclass RelPartialLearnableMultiHeadAttn(nn.Module):\n    def __init__(\n        self,\n        n_head,\n        d_model,\n        d_head,\n        dropout,\n        dropatt=0,\n        tgt_len=None,\n        ext_len=None,\n        mem_len=None,\n        pre_lnorm=False,\n        r_r_bias=None,\n        r_w_bias=None,\n        output_attentions=False,\n        layer_norm_epsilon=1e-5,\n    ):\n        super().__init__()\n\n        self.output_attentions = output_attentions\n        self.n_head = n_head\n        self.d_model = d_model\n        self.d_head = d_head\n        self.dropout = dropout\n\n        self.qkv_net = nn.Linear(d_model, 3 * n_head * d_head, bias=False)\n\n        self.drop = nn.Dropout(dropout)\n        self.dropatt = nn.Dropout(dropatt)\n        self.o_net = nn.Linear(n_head * d_head, d_model, bias=False)\n\n        self.layer_norm = nn.LayerNorm(d_model, eps=layer_norm_epsilon)\n\n        self.scale = 1 / (d_head ** 0.5)\n\n        self.pre_lnorm = pre_lnorm\n\n        if r_r_bias is None or r_w_bias is None:  # Biases are not shared\n            self.r_r_bias = nn.Parameter(torch.FloatTensor(self.n_head, self.d_head))\n            self.r_w_bias = nn.Parameter(torch.FloatTensor(self.n_head, self.d_head))\n        else:\n            self.r_r_bias = r_r_bias\n            self.r_w_bias = r_w_bias\n\n        self.r_net = nn.Linear(self.d_model, self.n_head * self.d_head, bias=False)\n\n    def _rel_shift(self, x):\n        zero_pad_shape = (x.size(0), 1) + x.size()[2:]\n        zero_pad = torch.zeros(zero_pad_shape, device=x.device, dtype=x.dtype)\n        x_padded = torch.cat([zero_pad, x], dim=1)\n\n        x_padded_shape = (x.size(1) + 1, x.size(0)) + x.size()[2:]\n        x_padded = x_padded.view(*x_padded_shape)\n\n        x = x_padded[1:].view_as(x)\n\n        return x\n\n    def forward(self, w, r, attn_mask=None, mems=None, head_mask=None):\n        qlen, rlen, bsz = w.size(0), r.size(0), w.size(1)\n\n        if mems is not None:\n            cat = torch.cat([mems, w], 0)\n            if self.pre_lnorm:\n                w_heads = self.qkv_net(self.layer_norm(cat))\n            else:\n                w_heads = self.qkv_net(cat)\n            r_head_k = self.r_net(r)\n\n            w_head_q, w_head_k, w_head_v = torch.chunk(w_heads, 3, dim=-1)\n            w_head_q = w_head_q[-qlen:]\n        else:\n            if self.pre_lnorm:\n                w_heads = self.qkv_net(self.layer_norm(w))\n            else:\n                w_heads = self.qkv_net(w)\n            r_head_k = self.r_net(r)\n\n            w_head_q, w_head_k, w_head_v = torch.chunk(w_heads, 3, dim=-1)\n\n        klen = w_head_k.size(0)\n\n        w_head_q = w_head_q.view(qlen, bsz, self.n_head, self.d_head)  # qlen x bsz x n_head x d_head\n        w_head_k = w_head_k.view(klen, bsz, self.n_head, self.d_head)  # qlen x bsz x n_head x d_head\n        w_head_v = w_head_v.view(klen, bsz, self.n_head, self.d_head)  # qlen x bsz x n_head x d_head\n\n        r_head_k = r_head_k.view(rlen, self.n_head, self.d_head)  # qlen x n_head x d_head\n\n        # compute attention score\n        rw_head_q = w_head_q + self.r_w_bias  # qlen x bsz x n_head x d_head\n        AC = torch.einsum(\"ibnd,jbnd->ijbn\", (rw_head_q, w_head_k))  # qlen x klen x bsz x n_head\n\n        rr_head_q = w_head_q + self.r_r_bias\n        BD = torch.einsum(\"ibnd,jnd->ijbn\", (rr_head_q, r_head_k))  # qlen x klen x bsz x n_head\n        BD = self._rel_shift(BD)\n\n        # [qlen x klen x bsz x n_head]\n        attn_score = AC + BD\n        attn_score.mul_(self.scale)\n\n        # compute attention probability\n        if attn_mask is not None and torch.sum(attn_mask).item():\n            attn_mask = attn_mask == 1  # Switch to bool\n            if attn_mask.dim() == 2:\n                if next(self.parameters()).dtype == torch.float16:\n                    attn_score = (\n                        attn_score.float().masked_fill(attn_mask[None, :, :, None], -65000).type_as(attn_score)\n                    )\n                else:\n                    attn_score = attn_score.float().masked_fill(attn_mask[None, :, :, None], -1e30).type_as(attn_score)\n            elif attn_mask.dim() == 3:\n                if next(self.parameters()).dtype == torch.float16:\n                    attn_score = attn_score.float().masked_fill(attn_mask[:, :, :, None], -65000).type_as(attn_score)\n                else:\n                    attn_score = attn_score.float().masked_fill(attn_mask[:, :, :, None], -1e30).type_as(attn_score)\n\n        # [qlen x klen x bsz x n_head]\n        attn_prob = F.softmax(attn_score, dim=1)\n        attn_prob = self.dropatt(attn_prob)\n\n        # Mask heads if we want to\n        if head_mask is not None:\n            attn_prob = attn_prob * head_mask\n\n        # compute attention vector\n        attn_vec = torch.einsum(\"ijbn,jbnd->ibnd\", (attn_prob, w_head_v))\n\n        # [qlen x bsz x n_head x d_head]\n        attn_vec = attn_vec.contiguous().view(attn_vec.size(0), attn_vec.size(1), self.n_head * self.d_head)\n\n        # linear projection\n        attn_out = self.o_net(attn_vec)\n        attn_out = self.drop(attn_out)\n\n        if self.pre_lnorm:\n            # residual connection\n            outputs = [w + attn_out]\n        else:\n            # residual connection + layer normalization\n            outputs = [self.layer_norm(w + attn_out)]\n\n        if self.output_attentions:\n            outputs.append(attn_prob)\n\n        return outputs\n\n\nclass RelPartialLearnableDecoderLayer(nn.Module):\n    def __init__(self, n_head, d_model, d_head, d_inner, dropout, layer_norm_epsilon=1e-5, **kwargs):\n        super().__init__()\n\n        self.dec_attn = RelPartialLearnableMultiHeadAttn(\n            n_head, d_model, d_head, dropout, layer_norm_epsilon=layer_norm_epsilon, **kwargs\n        )\n        self.pos_ff = PositionwiseFF(\n            d_model, d_inner, dropout, pre_lnorm=kwargs.get(\"pre_lnorm\"), layer_norm_epsilon=layer_norm_epsilon\n        )\n\n    def forward(self, dec_inp, r, dec_attn_mask=None, mems=None, head_mask=None):\n\n        attn_outputs = self.dec_attn(dec_inp, r, attn_mask=dec_attn_mask, mems=mems, head_mask=head_mask)\n        ff_output = self.pos_ff(attn_outputs[0])\n\n        outputs = [ff_output] + attn_outputs[1:]\n\n        return outputs\n\n\nclass AdaptiveEmbedding(nn.Module):\n    def __init__(self, n_token, d_embed, d_proj, cutoffs, div_val=1, sample_softmax=False):\n        super().__init__()\n\n        self.n_token = n_token\n        self.d_embed = d_embed\n\n        self.cutoffs = cutoffs + [n_token]\n        self.div_val = div_val\n        self.d_proj = d_proj\n\n        self.emb_scale = d_proj ** 0.5\n\n        self.cutoff_ends = [0] + self.cutoffs\n\n        self.emb_layers = nn.ModuleList()\n        self.emb_projs = nn.ParameterList()\n        if div_val == 1:\n            self.emb_layers.append(nn.Embedding(n_token, d_embed, sparse=sample_softmax > 0))\n            if d_proj != d_embed:\n                self.emb_projs.append(nn.Parameter(torch.FloatTensor(d_proj, d_embed)))\n        else:\n            for i in range(len(self.cutoffs)):\n                l_idx, r_idx = self.cutoff_ends[i], self.cutoff_ends[i + 1]\n                d_emb_i = d_embed // (div_val ** i)\n                self.emb_layers.append(nn.Embedding(r_idx - l_idx, d_emb_i))\n                self.emb_projs.append(nn.Parameter(torch.FloatTensor(d_proj, d_emb_i)))\n\n    def forward(self, inp):\n        if self.div_val == 1:\n            embed = self.emb_layers[0](inp)\n            if self.d_proj != self.d_embed:\n                embed = F.linear(embed, self.emb_projs[0])\n        else:\n            param = next(self.parameters())\n            inp_flat = inp.view(-1)\n            emb_flat = torch.zeros([inp_flat.size(0), self.d_proj], dtype=param.dtype, device=param.device)\n            for i in range(len(self.cutoffs)):\n                l_idx, r_idx = self.cutoff_ends[i], self.cutoff_ends[i + 1]\n\n                mask_i = (inp_flat >= l_idx) & (inp_flat < r_idx)\n                indices_i = mask_i.nonzero().squeeze()\n\n                if indices_i.numel() == 0:\n                    continue\n\n                inp_i = inp_flat.index_select(0, indices_i) - l_idx\n                emb_i = self.emb_layers[i](inp_i)\n                emb_i = F.linear(emb_i, self.emb_projs[i])\n\n                emb_flat.index_copy_(0, indices_i, emb_i)\n\n            embed_shape = inp.size() + (self.d_proj,)\n            embed = emb_flat.view(embed_shape)\n\n        embed.mul_(self.emb_scale)\n\n        return embed\n\n\nclass TransfoXLPreTrainedModel(PreTrainedModel):\n    \"\"\" An abstract class to handle weights initialization and\n        a simple interface for downloading and loading pretrained models.\n    \"\"\"\n\n    config_class = TransfoXLConfig\n    load_tf_weights = load_tf_weights_in_transfo_xl\n    base_model_prefix = \"transformer\"\n\n    def _init_weight(self, weight):\n        if self.config.init == \"uniform\":\n            nn.init.uniform_(weight, -self.config.init_range, self.config.init_range)\n        elif self.config.init == \"normal\":\n            nn.init.normal_(weight, 0.0, self.config.init_std)\n\n    def _init_bias(self, bias):\n        nn.init.constant_(bias, 0.0)\n\n    def _init_weights(self, m):\n        \"\"\" Initialize the weights.\n        \"\"\"\n        classname = m.__class__.__name__\n        if classname.find(\"Linear\") != -1:\n            if hasattr(m, \"weight\") and m.weight is not None:\n                self._init_weight(m.weight)\n            if hasattr(m, \"bias\") and m.bias is not None:\n                self._init_bias(m.bias)\n        elif classname.find(\"AdaptiveEmbedding\") != -1:\n            if hasattr(m, \"emb_projs\"):\n                for i in range(len(m.emb_projs)):\n                    if m.emb_projs[i] is not None:\n                        nn.init.normal_(m.emb_projs[i], 0.0, self.config.proj_init_std)\n        elif classname.find(\"Embedding\") != -1:\n            if hasattr(m, \"weight\"):\n                self._init_weight(m.weight)\n        elif classname.find(\"ProjectedAdaptiveLogSoftmax\") != -1:\n            if hasattr(m, \"cluster_weight\") and m.cluster_weight is not None:\n                self._init_weight(m.cluster_weight)\n            if hasattr(m, \"cluster_bias\") and m.cluster_bias is not None:\n                self._init_bias(m.cluster_bias)\n            if hasattr(m, \"out_projs\"):\n                for i in range(len(m.out_projs)):\n                    if m.out_projs[i] is not None:\n                        nn.init.normal_(m.out_projs[i], 0.0, self.config.proj_init_std)\n        elif classname.find(\"LayerNorm\") != -1:\n            if hasattr(m, \"weight\"):\n                nn.init.normal_(m.weight, 1.0, self.config.init_std)\n            if hasattr(m, \"bias\") and m.bias is not None:\n                self._init_bias(m.bias)\n        else:\n            if hasattr(m, \"r_emb\"):\n                self._init_weight(m.r_emb)\n            if hasattr(m, \"r_w_bias\"):\n                self._init_weight(m.r_w_bias)\n            if hasattr(m, \"r_r_bias\"):\n                self._init_weight(m.r_r_bias)\n            if hasattr(m, \"r_bias\"):\n                self._init_bias(m.r_bias)\n\n\nTRANSFO_XL_START_DOCSTRING = r\"\"\"\n\n    This model is a PyTorch `torch.nn.Module <https://pytorch.org/docs/stable/nn.html#torch.nn.Module>`_ sub-class.\n    Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general\n    usage and behavior.\n\n    Parameters:\n        config (:class:`~transformers1.TransfoXLConfig`): Model configuration class with all the parameters of the model.\n            Initializing with a config file does not load the weights associated with the model, only the configuration.\n            Check out the :meth:`~transformers1.PreTrainedModel.from_pretrained` method to load the model weights.\n\"\"\"\n\nTRANSFO_XL_INPUTS_DOCSTRING = r\"\"\"\n    Args:\n        input_ids (:obj:`torch.LongTensor` of shape :obj:`(batch_size, sequence_length)`):\n            Indices of input sequence tokens in the vocabulary.\n\n            Indices can be obtained using :class:`transformers1.TransfoXLTokenizer`.\n            See :func:`transformers1.PreTrainedTokenizer.encode` and\n            :func:`transformers1.PreTrainedTokenizer.encode_plus` for details.\n\n            `What are input IDs? <../glossary.html#input-ids>`__\n        mems (:obj:`List[torch.FloatTensor]` of length :obj:`config.n_layers`):\n            Contains pre-computed hidden-states (key and values in the attention blocks) as computed by the model\n            (see `mems` output below). Can be used to speed up sequential decoding. The token ids which have their mems\n            given to this model should not be passed as input ids as they have already been computed.\n        head_mask (:obj:`torch.FloatTensor` of shape :obj:`(num_heads,)` or :obj:`(num_layers, num_heads)`, `optional`, defaults to :obj:`None`):\n            Mask to nullify selected heads of the self-attention modules.\n            Mask values selected in ``[0, 1]``:\n            :obj:`1` indicates the head is **not masked**, :obj:`0` indicates the head is **masked**.\n        inputs_embeds (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length, hidden_size)`, `optional`, defaults to :obj:`None`):\n            Optionally, instead of passing :obj:`input_ids` you can choose to directly pass an embedded representation.\n            This is useful if you want more control over how to convert `input_ids` indices into associated vectors\n            than the model's internal embedding lookup matrix.\n\"\"\"\n\n\n@add_start_docstrings(\n    \"The bare Bert Model transformer outputting raw hidden-states without any specific head on top.\",\n    TRANSFO_XL_START_DOCSTRING,\n)\nclass TransfoXLModel(TransfoXLPreTrainedModel):\n    def __init__(self, config):\n        super().__init__(config)\n        self.output_attentions = config.output_attentions\n        self.output_hidden_states = config.output_hidden_states\n\n        self.n_token = config.vocab_size\n\n        self.d_embed = config.d_embed\n        self.d_model = config.d_model\n        self.n_head = config.n_head\n        self.d_head = config.d_head\n\n        self.word_emb = AdaptiveEmbedding(\n            config.vocab_size, config.d_embed, config.d_model, config.cutoffs, div_val=config.div_val\n        )\n\n        self.drop = nn.Dropout(config.dropout)\n\n        self.n_layer = config.n_layer\n\n        self.tgt_len = config.tgt_len\n        self.mem_len = config.mem_len\n        self.ext_len = config.ext_len\n        self.max_klen = config.tgt_len + config.ext_len + config.mem_len\n\n        self.attn_type = config.attn_type\n\n        if not config.untie_r:\n            self.r_w_bias = nn.Parameter(torch.FloatTensor(self.n_head, self.d_head))\n            self.r_r_bias = nn.Parameter(torch.FloatTensor(self.n_head, self.d_head))\n\n        self.layers = nn.ModuleList()\n        if config.attn_type == 0:  # the default attention\n            for i in range(config.n_layer):\n                self.layers.append(\n                    RelPartialLearnableDecoderLayer(\n                        config.n_head,\n                        config.d_model,\n                        config.d_head,\n                        config.d_inner,\n                        config.dropout,\n                        tgt_len=config.tgt_len,\n                        ext_len=config.ext_len,\n                        mem_len=config.mem_len,\n                        dropatt=config.dropatt,\n                        pre_lnorm=config.pre_lnorm,\n                        r_w_bias=None if config.untie_r else self.r_w_bias,\n                        r_r_bias=None if config.untie_r else self.r_r_bias,\n                        output_attentions=self.output_attentions,\n                        layer_norm_epsilon=config.layer_norm_epsilon,\n                    )\n                )\n        else:  # learnable embeddings and absolute embeddings are not used in our pretrained checkpoints\n            raise NotImplementedError  # Removed them to avoid maintaining dead code\n\n        self.same_length = config.same_length\n        self.clamp_len = config.clamp_len\n\n        if self.attn_type == 0:  # default attention\n            self.pos_emb = PositionalEmbedding(self.d_model)\n        else:  # learnable embeddings and absolute embeddings\n            raise NotImplementedError  # Removed these to avoid maintaining dead code - They are not used in our pretrained checkpoint\n\n        self.init_weights()\n\n    def get_input_embeddings(self):\n        return self.word_emb\n\n    def set_input_embeddings(self, new_embeddings):\n        self.word_emb = new_embeddings\n\n    def backward_compatible(self):\n        self.sample_softmax = -1\n\n    def reset_length(self, tgt_len, ext_len, mem_len):\n        self.tgt_len = tgt_len\n        self.mem_len = mem_len\n        self.ext_len = ext_len\n\n    def _prune_heads(self, heads):\n        logger.info(\"Head pruning is not implemented for Transformer-XL model\")\n        pass\n\n    def init_mems(self, bsz):\n        if self.mem_len > 0:\n            mems = []\n            param = next(self.parameters())\n            for i in range(self.n_layer):\n                empty = torch.zeros(self.mem_len, bsz, self.config.d_model, dtype=param.dtype, device=param.device)\n                mems.append(empty)\n\n            return mems\n        else:\n            return None\n\n    def _update_mems(self, hids, mems, mlen, qlen):\n        # does not deal with None\n        if mems is None:\n            return None\n\n        # mems is not None\n        assert len(hids) == len(mems), \"len(hids) != len(mems)\"\n\n        # There are `mlen + qlen` steps that can be cached into mems\n        # For the next step, the last `ext_len` of the `qlen` tokens\n        # will be used as the extended context. Hence, we only cache\n        # the tokens from `mlen + qlen - self.ext_len - self.mem_len`\n        # to `mlen + qlen - self.ext_len`.\n        with torch.no_grad():\n            new_mems = []\n            end_idx = mlen + max(0, qlen - 0 - self.ext_len)\n            beg_idx = max(0, end_idx - self.mem_len)\n            for i in range(len(hids)):\n\n                cat = torch.cat([mems[i], hids[i]], dim=0)\n                new_mems.append(cat[beg_idx:end_idx].detach())\n\n        return new_mems\n\n    @add_start_docstrings_to_callable(TRANSFO_XL_INPUTS_DOCSTRING)\n    def forward(self, input_ids=None, mems=None, head_mask=None, inputs_embeds=None):\n        r\"\"\"\n    Return:\n        :obj:`tuple(torch.FloatTensor)` comprising various elements depending on the configuration (:class:`~transformers1.TransfoXLConfig`) and inputs:\n        last_hidden_state (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length, hidden_size)`):\n            Sequence of hidden-states at the last layer of the model.\n        mems (:obj:`List[torch.FloatTensor]` of length :obj:`config.n_layers`):\n            Contains pre-computed hidden-states (key and values in the attention blocks).\n            Can be used (see `mems` input) to speed up sequential decoding. The token ids which have their past given to this model\n            should not be passed as input ids as they have already been computed.\n        hidden_states (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_hidden_states=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer)\n            of shape :obj:`(batch_size, sequence_length, hidden_size)`.\n\n            Hidden-states of the model at the output of each layer plus the initial embedding outputs.\n        attentions (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_attentions=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for each layer) of shape\n            :obj:`(batch_size, num_heads, sequence_length, sequence_length)`.\n\n            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention\n            heads.\n\n    Examples::\n\n        from transformers1 import TransfoXLTokenizer, TransfoXLModel\n        import torch\n\n        tokenizer = TransfoXLTokenizer.from_pretrained('transfo-xl-wt103')\n        model = TransfoXLModel.from_pretrained('transfo-xl-wt103')\n        input_ids = torch.tensor(tokenizer.encode(\"Hello, my dog is cute\", add_special_tokens=True)).unsqueeze(0)  # Batch size 1\n        outputs = model(input_ids)\n        last_hidden_states, mems = outputs[:2]\n\n        \"\"\"\n        # the original code for Transformer-XL used shapes [len, bsz] but we want a unified interface in the library\n        # so we transpose here from shape [bsz, len] to shape [len, bsz]\n        if input_ids is not None and inputs_embeds is not None:\n            raise ValueError(\"You cannot specify both input_ids and inputs_embeds at the same time\")\n        elif input_ids is not None:\n            input_ids = input_ids.transpose(0, 1).contiguous()\n            qlen, bsz = input_ids.size()\n        elif inputs_embeds is not None:\n            inputs_embeds = inputs_embeds.transpose(0, 1).contiguous()\n            qlen, bsz = inputs_embeds.shape[0], inputs_embeds.shape[1]\n        else:\n            raise ValueError(\"You have to specify either input_ids or inputs_embeds\")\n\n        if mems is None:\n            mems = self.init_mems(bsz)\n\n        # Prepare head mask if needed\n        # 1.0 in head_mask indicate we keep the head\n        # attention_probs has shape bsz x n_heads x N x N\n        # input head_mask has shape [num_heads] or [num_hidden_layers x num_heads] (a head_mask for each layer)\n        # and head_mask is converted to shape [num_hidden_layers x qlen x klen x bsz x n_head]\n        if head_mask is not None:\n            if head_mask.dim() == 1:\n                head_mask = head_mask.unsqueeze(0).unsqueeze(0).unsqueeze(0).unsqueeze(0)\n                head_mask = head_mask.expand(self.n_layer, -1, -1, -1, -1)\n            elif head_mask.dim() == 2:\n                head_mask = head_mask.unsqueeze(1).unsqueeze(1).unsqueeze(1)\n            head_mask = head_mask.to(\n                dtype=next(self.parameters()).dtype\n            )  # switch to fload if need + fp16 compatibility\n        else:\n            head_mask = [None] * self.n_layer\n\n        if inputs_embeds is not None:\n            word_emb = inputs_embeds\n        else:\n            word_emb = self.word_emb(input_ids)\n\n        mlen = mems[0].size(0) if mems is not None else 0\n        klen = mlen + qlen\n        if self.same_length:\n            all_ones = word_emb.new_ones((qlen, klen), dtype=torch.uint8)\n            mask_len = klen - self.mem_len\n            if mask_len > 0:\n                mask_shift_len = qlen - mask_len\n            else:\n                mask_shift_len = qlen\n            dec_attn_mask = (torch.triu(all_ones, 1 + mlen) + torch.tril(all_ones, -mask_shift_len))[:, :, None]  # -1\n        else:\n            dec_attn_mask = torch.triu(word_emb.new_ones((qlen, klen), dtype=torch.uint8), diagonal=1 + mlen)[\n                :, :, None\n            ]\n\n        hids = []\n        attentions = []\n        if self.attn_type == 0:  # default\n            pos_seq = torch.arange(klen - 1, -1, -1.0, device=word_emb.device, dtype=word_emb.dtype)\n            if self.clamp_len > 0:\n                pos_seq.clamp_(max=self.clamp_len)\n            pos_emb = self.pos_emb(pos_seq)\n\n            core_out = self.drop(word_emb)\n            pos_emb = self.drop(pos_emb)\n\n            for i, layer in enumerate(self.layers):\n                hids.append(core_out)\n                mems_i = None if mems is None else mems[i]\n                layer_outputs = layer(\n                    core_out, pos_emb, dec_attn_mask=dec_attn_mask, mems=mems_i, head_mask=head_mask[i]\n                )\n                core_out = layer_outputs[0]\n                if self.output_attentions:\n                    attentions.append(layer_outputs[1])\n        else:  # learnable embeddings and absolute embeddings\n            raise NotImplementedError  # Removed these to avoid maintaining dead code - They are not used in our pretrained checkpoint\n\n        core_out = self.drop(core_out)\n\n        new_mems = self._update_mems(hids, mems, mlen, qlen)\n\n        # We transpose back here to shape [bsz, len, hidden_dim]\n        outputs = [core_out.transpose(0, 1).contiguous(), new_mems]\n        if self.output_hidden_states:\n            # Add last layer and transpose to library standard shape [bsz, len, hidden_dim]\n            hids.append(core_out)\n            hids = list(t.transpose(0, 1).contiguous() for t in hids)\n            outputs.append(hids)\n        if self.output_attentions:\n            # Transpose to library standard shape [bsz, n_heads, query_seq_len, key_seq_len]\n            attentions = list(t.permute(2, 3, 0, 1).contiguous() for t in attentions)\n            outputs.append(attentions)\n\n        return outputs  # last hidden state, new_mems, (all hidden states), (all attentions)\n\n\n@add_start_docstrings(\n    \"\"\"The Transformer-XL Model with a language modeling head on top\n    (adaptive softmax with weights tied to the adaptive input embeddings)\"\"\",\n    TRANSFO_XL_START_DOCSTRING,\n)\nclass TransfoXLLMHeadModel(TransfoXLPreTrainedModel):\n    def __init__(self, config):\n        super().__init__(config)\n        self.transformer = TransfoXLModel(config)\n        self.sample_softmax = config.sample_softmax\n\n        assert (\n            self.sample_softmax <= 0\n        ), \"Sampling from the softmax is not implemented yet. Please look at issue: #3310: https://github.com/huggingface/transformers/issues/3310\"\n\n        self.crit = ProjectedAdaptiveLogSoftmax(\n            config.vocab_size, config.d_embed, config.d_model, config.cutoffs, div_val=config.div_val\n        )\n\n        self.init_weights()\n\n    def tie_weights(self):\n        \"\"\"\n        Run this to be sure output and input (adaptive) softmax weights are tied\n        \"\"\"\n\n        if self.config.tie_weight:\n            for i in range(len(self.crit.out_layers)):\n                self._tie_or_clone_weights(self.crit.out_layers[i], self.transformer.word_emb.emb_layers[i])\n        if self.config.tie_projs:\n            for i, tie_proj in enumerate(self.config.tie_projs):\n                if tie_proj and self.config.div_val == 1 and self.config.d_model != self.config.d_embed:\n                    if self.config.torchscript:\n                        self.crit.out_projs[i] = nn.Parameter(self.transformer.word_emb.emb_projs[0].clone())\n                    else:\n                        self.crit.out_projs[i] = self.transformer.word_emb.emb_projs[0]\n                elif tie_proj and self.config.div_val != 1:\n                    if self.config.torchscript:\n                        self.crit.out_projs[i] = nn.Parameter(self.transformer.word_emb.emb_projs[i].clone())\n                    else:\n                        self.crit.out_projs[i] = self.transformer.word_emb.emb_projs[i]\n\n    def reset_length(self, tgt_len, ext_len, mem_len):\n        self.transformer.reset_length(tgt_len, ext_len, mem_len)\n\n    def init_mems(self, bsz):\n        return self.transformer.init_mems(bsz)\n\n    @add_start_docstrings_to_callable(TRANSFO_XL_INPUTS_DOCSTRING)\n    def forward(self, input_ids=None, mems=None, head_mask=None, inputs_embeds=None, labels=None):\n        r\"\"\"\n        labels (:obj:`torch.LongTensor` of shape :obj:`(batch_size, sequence_length)`, `optional`, defaults to :obj:`None`):\n            Labels for language modeling.\n            Note that the labels **are shifted** inside the model, i.e. you can set ``labels = input_ids``\n            Indices are selected in ``[-100, 0, ..., config.vocab_size]``\n            All labels set to ``-100`` are ignored (masked), the loss is only\n            computed for labels in ``[0, ..., config.vocab_size]``\n\n    Return:\n        :obj:`tuple(torch.FloatTensor)` comprising various elements depending on the configuration (:class:`~transformers1.TransfoXLConfig`) and inputs:\n        loss (:obj:`torch.FloatTensor` of shape `(batch_size, sequence_length-1)`, `optional`, returned when ``labels`` is provided)\n            Language modeling loss.\n        prediction_scores (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length, config.vocab_size)`):\n            Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax).\n        mems (:obj:`List[torch.FloatTensor]` of length :obj:`config.n_layers`):\n            Contains pre-computed hidden-states (key and values in the attention blocks).\n            Can be used (see `past` input) to speed up sequential decoding. The token ids which have their past given to this model\n            should not be passed as input ids as they have already been computed.\n        hidden_states (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_hidden_states=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer)\n            of shape :obj:`(batch_size, sequence_length, hidden_size)`.\n\n            Hidden-states of the model at the output of each layer plus the initial embedding outputs.\n        attentions (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_attentions=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for each layer) of shape\n            :obj:`(batch_size, num_heads, sequence_length, sequence_length)`.\n\n            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention\n            heads.\n\n    Examples::\n\n        from transformers1 import TransfoXLTokenizer, TransfoXLLMHeadModel\n        import torch\n\n        tokenizer = TransfoXLTokenizer.from_pretrained('transfo-xl-wt103')\n        model = TransfoXLLMHeadModel.from_pretrained('transfo-xl-wt103')\n        input_ids = torch.tensor(tokenizer.encode(\"Hello, my dog is cute\", add_special_tokens=True)).unsqueeze(0)  # Batch size 1\n        outputs = model(input_ids)\n        prediction_scores, mems = outputs[:2]\n\n        \"\"\"\n        if input_ids is not None:\n            bsz, tgt_len = input_ids.size(0), input_ids.size(1)\n        elif inputs_embeds is not None:\n            bsz, tgt_len = inputs_embeds.size(0), inputs_embeds.size(1)\n        else:\n            raise ValueError(\"You have to specify either input_ids or inputs_embeds\")\n\n        transformer_outputs = self.transformer(input_ids, mems=mems, head_mask=head_mask, inputs_embeds=inputs_embeds)\n\n        last_hidden = transformer_outputs[0]\n        pred_hid = last_hidden[:, -tgt_len:]\n        outputs = transformer_outputs[1:]\n\n        softmax_output = self.crit(pred_hid, labels)\n        if labels is None:\n            softmax_output = softmax_output.view(bsz, tgt_len, -1)\n            outputs = [softmax_output] + outputs\n        else:\n            softmax_output = softmax_output.view(bsz, tgt_len - 1)\n            outputs = [softmax_output, None] + outputs\n\n        return outputs  # (loss), logits or None if labels is not None (speed up adaptive softmax), new_mems, (all hidden states), (all attentions)\n\n    def get_output_embeddings(self):\n        \"\"\" Double-check if you are using adaptive softmax.\n        \"\"\"\n        if self.sample_softmax > 0:\n            return self.out_layer\n        else:\n            return self.crit.out_layers[-1]\n\n    def prepare_inputs_for_generation(self, input_ids, past, **model_kwargs):\n        inputs = {\"input_ids\": input_ids}\n\n        # if past is defined in model kwargs then use it for faster decoding\n        if past:\n            inputs[\"mems\"] = past\n\n        return inputs\n"
  },
  {
    "path": "code/nezha-base-count3/pretrain/transformers1/modeling_transfo_xl_utilities.py",
    "content": "# coding=utf-8\n# Copyright 2018 Google AI, Google Brain and Carnegie Mellon University Authors and the HuggingFace Inc. team.\n# Copyright (c) 2018, NVIDIA CORPORATION.  All rights reserved.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n\"\"\" Utilities for PyTorch Transformer XL model.\n    Directly adapted from https://github.com/kimiyoung/transformer-xl.\n\"\"\"\n\n\nimport torch\nimport torch.nn as nn\nimport torch.nn.functional as F\n\n\n# CUDA_MAJOR = int(torch.version.cuda.split('.')[0])\n# CUDA_MINOR = int(torch.version.cuda.split('.')[1])\n\n\nclass ProjectedAdaptiveLogSoftmax(nn.Module):\n    def __init__(self, n_token, d_embed, d_proj, cutoffs, div_val=1, keep_order=False):\n        super().__init__()\n\n        self.n_token = n_token\n        self.d_embed = d_embed\n        self.d_proj = d_proj\n\n        self.cutoffs = cutoffs + [n_token]\n        self.cutoff_ends = [0] + self.cutoffs\n        self.div_val = div_val\n\n        self.shortlist_size = self.cutoffs[0]\n        self.n_clusters = len(self.cutoffs) - 1\n        self.head_size = self.shortlist_size + self.n_clusters\n\n        if self.n_clusters > 0:\n            self.cluster_weight = nn.Parameter(torch.zeros(self.n_clusters, self.d_embed))\n            self.cluster_bias = nn.Parameter(torch.zeros(self.n_clusters))\n\n        self.out_layers = nn.ModuleList()\n        self.out_projs = nn.ParameterList()\n\n        if div_val == 1:\n            for i in range(len(self.cutoffs)):\n                if d_proj != d_embed:\n                    self.out_projs.append(nn.Parameter(torch.FloatTensor(d_proj, d_embed)))\n                else:\n                    self.out_projs.append(None)\n\n            self.out_layers.append(nn.Linear(d_embed, n_token))\n        else:\n            for i in range(len(self.cutoffs)):\n                l_idx, r_idx = self.cutoff_ends[i], self.cutoff_ends[i + 1]\n                d_emb_i = d_embed // (div_val ** i)\n\n                self.out_projs.append(nn.Parameter(torch.FloatTensor(d_proj, d_emb_i)))\n\n                self.out_layers.append(nn.Linear(d_emb_i, r_idx - l_idx))\n\n        self.keep_order = keep_order\n\n    def _compute_logit(self, hidden, weight, bias, proj):\n        if proj is None:\n            logit = F.linear(hidden, weight, bias=bias)\n        else:\n            # if CUDA_MAJOR <= 9 and CUDA_MINOR <= 1:\n            proj_hid = F.linear(hidden, proj.t().contiguous())\n            logit = F.linear(proj_hid, weight, bias=bias)\n            # else:\n            #     logit = torch.einsum('bd,de,ev->bv', (hidden, proj, weight.t()))\n            #     if bias is not None:\n            #         logit = logit + bias\n\n        return logit\n\n    def forward(self, hidden, labels=None, keep_order=False):\n        \"\"\"\n            Params:\n                hidden :: [len*bsz x d_proj]\n                labels :: [len*bsz]\n            Return:\n                if labels is None:\n                    out :: [len*bsz x n_tokens] log probabilities of tokens over the vocabulary\n                else:\n                    out :: [(len-1)*bsz] Negative log likelihood\n            We could replace this implementation by the native PyTorch one\n            if their's had an option to set bias on all clusters in the native one.\n            here: https://github.com/pytorch/pytorch/blob/dbe6a7a9ff1a364a8706bf5df58a1ca96d2fd9da/torch/nn/modules/adaptive.py#L138\n        \"\"\"\n\n        if labels is not None:\n            # Shift so that tokens < n predict n\n            hidden = hidden[..., :-1, :].contiguous()\n            labels = labels[..., 1:].contiguous()\n            hidden = hidden.view(-1, hidden.size(-1))\n            labels = labels.view(-1)\n            if hidden.size(0) != labels.size(0):\n                raise RuntimeError(\"Input and labels should have the same size \" \"in the batch dimension.\")\n        else:\n            hidden = hidden.view(-1, hidden.size(-1))\n\n        if self.n_clusters == 0:\n            logit = self._compute_logit(hidden, self.out_layers[0].weight, self.out_layers[0].bias, self.out_projs[0])\n            if labels is not None:\n                out = -F.log_softmax(logit, dim=-1).gather(1, labels.unsqueeze(1)).squeeze(1)\n            else:\n                out = F.log_softmax(logit, dim=-1)\n        else:\n            # construct weights and biases\n            weights, biases = [], []\n            for i in range(len(self.cutoffs)):\n                if self.div_val == 1:\n                    l_idx, r_idx = self.cutoff_ends[i], self.cutoff_ends[i + 1]\n                    weight_i = self.out_layers[0].weight[l_idx:r_idx]\n                    bias_i = self.out_layers[0].bias[l_idx:r_idx]\n                else:\n                    weight_i = self.out_layers[i].weight\n                    bias_i = self.out_layers[i].bias\n\n                if i == 0:\n                    weight_i = torch.cat([weight_i, self.cluster_weight], dim=0)\n                    bias_i = torch.cat([bias_i, self.cluster_bias], dim=0)\n\n                weights.append(weight_i)\n                biases.append(bias_i)\n\n            head_weight, head_bias, head_proj = weights[0], biases[0], self.out_projs[0]\n\n            head_logit = self._compute_logit(hidden, head_weight, head_bias, head_proj)\n            head_logprob = F.log_softmax(head_logit, dim=1)\n\n            if labels is None:\n                out = hidden.new_empty((head_logit.size(0), self.n_token))\n            else:\n                out = torch.zeros_like(labels, dtype=hidden.dtype, device=hidden.device)\n\n            offset = 0\n            cutoff_values = [0] + self.cutoffs\n            for i in range(len(cutoff_values) - 1):\n                l_idx, r_idx = cutoff_values[i], cutoff_values[i + 1]\n\n                if labels is not None:\n                    mask_i = (labels >= l_idx) & (labels < r_idx)\n                    indices_i = mask_i.nonzero().squeeze()\n\n                    if indices_i.numel() == 0:\n                        continue\n\n                    target_i = labels.index_select(0, indices_i) - l_idx\n                    head_logprob_i = head_logprob.index_select(0, indices_i)\n                    hidden_i = hidden.index_select(0, indices_i)\n                else:\n                    hidden_i = hidden\n\n                if i == 0:\n                    if labels is not None:\n                        logprob_i = head_logprob_i.gather(1, target_i[:, None]).squeeze(1)\n                    else:\n                        out[:, : self.cutoffs[0]] = head_logprob[:, : self.cutoffs[0]]\n                else:\n                    weight_i, bias_i, proj_i = weights[i], biases[i], self.out_projs[i]\n\n                    tail_logit_i = self._compute_logit(hidden_i, weight_i, bias_i, proj_i)\n                    tail_logprob_i = F.log_softmax(tail_logit_i, dim=1)\n                    cluster_prob_idx = self.cutoffs[0] + i - 1  # No probability for the head cluster\n                    if labels is not None:\n                        logprob_i = head_logprob_i[:, cluster_prob_idx] + tail_logprob_i.gather(\n                            1, target_i[:, None]\n                        ).squeeze(1)\n                    else:\n                        logprob_i = head_logprob[:, cluster_prob_idx, None] + tail_logprob_i\n                        out[:, l_idx:r_idx] = logprob_i\n\n                if labels is not None:\n                    if (hasattr(self, \"keep_order\") and self.keep_order) or keep_order:\n                        out.index_copy_(0, indices_i, -logprob_i)\n                    else:\n                        out[offset : offset + logprob_i.size(0)].copy_(-logprob_i)\n                    offset += logprob_i.size(0)\n\n        return out\n\n    def log_prob(self, hidden):\n        r\"\"\" Computes log probabilities for all :math:`n\\_classes`\n        From: https://github.com/pytorch/pytorch/blob/master/torch/nn/modules/adaptive.py\n        Args:\n            hidden (Tensor): a minibatch of examples\n        Returns:\n            log-probabilities of for each class :math:`c`\n            in range :math:`0 <= c <= n\\_classes`, where :math:`n\\_classes` is a\n            parameter passed to ``AdaptiveLogSoftmaxWithLoss`` constructor.\n        Shape:\n            - Input: :math:`(N, in\\_features)`\n            - Output: :math:`(N, n\\_classes)`\n        \"\"\"\n        if self.n_clusters == 0:\n            logit = self._compute_logit(hidden, self.out_layers[0].weight, self.out_layers[0].bias, self.out_projs[0])\n            return F.log_softmax(logit, dim=-1)\n        else:\n            # construct weights and biases\n            weights, biases = [], []\n            for i in range(len(self.cutoffs)):\n                if self.div_val == 1:\n                    l_idx, r_idx = self.cutoff_ends[i], self.cutoff_ends[i + 1]\n                    weight_i = self.out_layers[0].weight[l_idx:r_idx]\n                    bias_i = self.out_layers[0].bias[l_idx:r_idx]\n                else:\n                    weight_i = self.out_layers[i].weight\n                    bias_i = self.out_layers[i].bias\n\n                if i == 0:\n                    weight_i = torch.cat([weight_i, self.cluster_weight], dim=0)\n                    bias_i = torch.cat([bias_i, self.cluster_bias], dim=0)\n\n                weights.append(weight_i)\n                biases.append(bias_i)\n\n            head_weight, head_bias, head_proj = weights[0], biases[0], self.out_projs[0]\n            head_logit = self._compute_logit(hidden, head_weight, head_bias, head_proj)\n\n            out = hidden.new_empty((head_logit.size(0), self.n_token))\n            head_logprob = F.log_softmax(head_logit, dim=1)\n\n            cutoff_values = [0] + self.cutoffs\n            for i in range(len(cutoff_values) - 1):\n                start_idx, stop_idx = cutoff_values[i], cutoff_values[i + 1]\n\n                if i == 0:\n                    out[:, : self.cutoffs[0]] = head_logprob[:, : self.cutoffs[0]]\n                else:\n                    weight_i, bias_i, proj_i = weights[i], biases[i], self.out_projs[i]\n\n                    tail_logit_i = self._compute_logit(hidden, weight_i, bias_i, proj_i)\n                    tail_logprob_i = F.log_softmax(tail_logit_i, dim=1)\n\n                    logprob_i = head_logprob[:, -i] + tail_logprob_i\n                    out[:, start_idx, stop_idx] = logprob_i\n\n            return out\n"
  },
  {
    "path": "code/nezha-base-count3/pretrain/transformers1/modeling_utils.py",
    "content": "# coding=utf-8\n# Copyright 2018 The Google AI Language Team Authors, Facebook AI Research authors and The HuggingFace Inc. team.\n# Copyright (c) 2018, NVIDIA CORPORATION.  All rights reserved.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n\nimport inspect\nimport logging\nimport os\nfrom typing import Callable, Dict, Iterable, List, Optional, Tuple\n\nimport torch\nfrom torch import Tensor, device, dtype, nn\nfrom torch.nn import CrossEntropyLoss\nfrom torch.nn import functional as F\n\nfrom .activations import get_activation\nfrom .configuration_utils import PretrainedConfig\nfrom .file_utils import (\n    DUMMY_INPUTS,\n    TF2_WEIGHTS_NAME,\n    TF_WEIGHTS_NAME,\n    WEIGHTS_NAME,\n    cached_path,\n    hf_bucket_url,\n    is_remote_url,\n)\n\n\nlogger = logging.getLogger(__name__)\n\n\ntry:\n    from torch.nn import Identity\nexcept ImportError:\n    # Older PyTorch compatibility\n    class Identity(nn.Module):\n        r\"\"\"A placeholder identity operator that is argument-insensitive.\n        \"\"\"\n\n        def __init__(self, *args, **kwargs):\n            super().__init__()\n\n        def forward(self, input):\n            return input\n\n\nclass ModuleUtilsMixin:\n    \"\"\"\n    A few utilities for torch.nn.Modules, to be used as a mixin.\n    \"\"\"\n\n    def num_parameters(self, only_trainable: bool = False) -> int:\n        \"\"\"\n        Get number of (optionally, trainable) parameters in the module.\n        \"\"\"\n        params = filter(lambda x: x.requires_grad, self.parameters()) if only_trainable else self.parameters()\n        return sum(p.numel() for p in params)\n\n    @staticmethod\n    def _hook_rss_memory_pre_forward(module, *args, **kwargs):\n        try:\n            import psutil\n        except (ImportError):\n            raise ImportError(\"You need to install psutil (pip install psutil) to use memory tracing.\")\n\n        process = psutil.Process(os.getpid())\n        mem = process.memory_info()\n        module.mem_rss_pre_forward = mem.rss\n        return None\n\n    @staticmethod\n    def _hook_rss_memory_post_forward(module, *args, **kwargs):\n        try:\n            import psutil\n        except (ImportError):\n            raise ImportError(\"You need to install psutil (pip install psutil) to use memory tracing.\")\n\n        process = psutil.Process(os.getpid())\n        mem = process.memory_info()\n        module.mem_rss_post_forward = mem.rss\n        mem_rss_diff = module.mem_rss_post_forward - module.mem_rss_pre_forward\n        module.mem_rss_diff = mem_rss_diff + (module.mem_rss_diff if hasattr(module, \"mem_rss_diff\") else 0)\n        return None\n\n    def add_memory_hooks(self):\n        \"\"\" Add a memory hook before and after each sub-module forward pass to record increase in memory consumption.\n            Increase in memory consumption is stored in a `mem_rss_diff` attribute for each module and can be reset to zero with `model.reset_memory_hooks_state()`\n        \"\"\"\n        for module in self.modules():\n            module.register_forward_pre_hook(self._hook_rss_memory_pre_forward)\n            module.register_forward_hook(self._hook_rss_memory_post_forward)\n        self.reset_memory_hooks_state()\n\n    def reset_memory_hooks_state(self):\n        for module in self.modules():\n            module.mem_rss_diff = 0\n            module.mem_rss_post_forward = 0\n            module.mem_rss_pre_forward = 0\n\n    @property\n    def device(self) -> device:\n        \"\"\"\n        Get torch.device from module, assuming that the whole module has one device.\n        \"\"\"\n        try:\n            return next(self.parameters()).device\n        except StopIteration:\n            # For nn.DataParallel compatibility in PyTorch 1.5\n\n            def find_tensor_attributes(module: nn.Module) -> List[Tuple[str, Tensor]]:\n                tuples = [(k, v) for k, v in module.__dict__.items() if torch.is_tensor(v)]\n                return tuples\n\n            gen = self._named_members(get_members_fn=find_tensor_attributes)\n            first_tuple = next(gen)\n            return first_tuple[1].device\n\n    @property\n    def dtype(self) -> dtype:\n        \"\"\"\n        Get torch.dtype from module, assuming that the whole module has one dtype.\n        \"\"\"\n        try:\n            return next(self.parameters()).dtype\n        except StopIteration:\n            # For nn.DataParallel compatibility in PyTorch 1.5\n\n            def find_tensor_attributes(module: nn.Module) -> List[Tuple[str, Tensor]]:\n                tuples = [(k, v) for k, v in module.__dict__.items() if torch.is_tensor(v)]\n                return tuples\n\n            gen = self._named_members(get_members_fn=find_tensor_attributes)\n            first_tuple = next(gen)\n            return first_tuple[1].dtype\n\n    def invert_attention_mask(self, encoder_attention_mask: Tensor) -> Tensor:\n        \"\"\"type: torch.Tensor -> torch.Tensor\"\"\"\n        if encoder_attention_mask.dim() == 3:\n            encoder_extended_attention_mask = encoder_attention_mask[:, None, :, :]\n        if encoder_attention_mask.dim() == 2:\n            encoder_extended_attention_mask = encoder_attention_mask[:, None, None, :]\n        # T5 has a mask that can compare sequence ids, we can simulate this here with this transposition\n        # Cf. https://github.com/tensorflow/mesh/blob/8d2465e9bc93129b913b5ccc6a59aa97abd96ec6/mesh_tensorflow\n        # /transformer/transformer_layers.py#L270\n        # encoder_extended_attention_mask = (encoder_extended_attention_mask ==\n        # encoder_extended_attention_mask.transpose(-1, -2))\n        encoder_extended_attention_mask = encoder_extended_attention_mask.to(dtype=self.dtype)  # fp16 compatibility\n\n        if self.dtype == torch.float16:\n            encoder_extended_attention_mask = (1.0 - encoder_extended_attention_mask) * -1e4\n        elif self.dtype == torch.float32:\n            encoder_extended_attention_mask = (1.0 - encoder_extended_attention_mask) * -1e9\n        else:\n            raise ValueError(\n                \"{} not recognized. `dtype` should be set to either `torch.float32` or `torch.float16`\".format(\n                    self.dtype\n                )\n            )\n\n        return encoder_extended_attention_mask\n\n    def get_extended_attention_mask(self, attention_mask: Tensor, input_shape: Tuple, device: device) -> Tensor:\n        \"\"\"Makes broadcastable attention mask and causal mask so that future and maked tokens are ignored.\n\n        Arguments:\n            attention_mask: torch.Tensor with 1 indicating tokens to ATTEND to\n            input_shape: tuple, shape of input_ids\n            device: torch.Device, usually self.device\n\n        Returns:\n            torch.Tensor with dtype of attention_mask.dtype\n        \"\"\"\n        # We can provide a self-attention mask of dimensions [batch_size, from_seq_length, to_seq_length]\n        # ourselves in which case we just need to make it broadcastable to all heads.\n        if attention_mask.dim() == 3:\n            extended_attention_mask = attention_mask[:, None, :, :]\n        elif attention_mask.dim() == 2:\n            # Provided a padding mask of dimensions [batch_size, seq_length]\n            # - if the model is a decoder, apply a causal mask in addition to the padding mask\n            # - if the model is an encoder, make the mask broadcastable to [batch_size, num_heads, seq_length, seq_length]\n            if self.config.is_decoder:\n                batch_size, seq_length = input_shape\n                seq_ids = torch.arange(seq_length, device=device)\n                causal_mask = seq_ids[None, None, :].repeat(batch_size, seq_length, 1) <= seq_ids[None, :, None]\n                # causal and attention masks must have same type with pytorch version < 1.3\n                causal_mask = causal_mask.to(attention_mask.dtype)\n                extended_attention_mask = causal_mask[:, None, :, :] * attention_mask[:, None, None, :]\n            else:\n                extended_attention_mask = attention_mask[:, None, None, :]\n        else:\n            raise ValueError(\n                \"Wrong shape for input_ids (shape {}) or attention_mask (shape {})\".format(\n                    input_shape, attention_mask.shape\n                )\n            )\n\n        # Since attention_mask is 1.0 for positions we want to attend and 0.0 for\n        # masked positions, this operation will create a tensor which is 0.0 for\n        # positions we want to attend and -10000.0 for masked positions.\n        # Since we are adding it to the raw scores before the softmax, this is\n        # effectively the same as removing these entirely.\n        extended_attention_mask = extended_attention_mask.to(dtype=self.dtype)  # fp16 compatibility\n        extended_attention_mask = (1.0 - extended_attention_mask) * -10000.0\n        return extended_attention_mask\n\n    def get_head_mask(self, head_mask: Tensor, num_hidden_layers: int, is_attention_chunked: bool = False) -> Tensor:\n        \"\"\"\n        # Prepare head mask if needed\n        # 1.0 in head_mask indicate we keep the head\n        attention_probs has shape bsz x n_heads x N x N\n        Arguments:\n            head_mask: torch.Tensor or None: has shape [num_heads] or [num_hidden_layers x num_heads]\n            num_hidden_layers: int\n        Returns:\n             Tensor of shape shape [num_hidden_layers x batch x num_heads x seq_length x seq_length]\n             or list with [None] for each layer\n        \"\"\"\n        if head_mask is not None:\n            head_mask = self._convert_head_mask_to_5d(head_mask, num_hidden_layers)\n            if is_attention_chunked is True:\n                head_mask = head_mask.unsqueeze(-1)\n        else:\n            head_mask = [None] * num_hidden_layers\n\n        return head_mask\n\n    def _convert_head_mask_to_5d(self, head_mask, num_hidden_layers):\n        \"\"\"-> [num_hidden_layers x batch x num_heads x seq_length x seq_length]\"\"\"\n        if head_mask.dim() == 1:\n            head_mask = head_mask.unsqueeze(0).unsqueeze(0).unsqueeze(-1).unsqueeze(-1)\n            head_mask = head_mask.expand(num_hidden_layers, -1, -1, -1, -1)\n        elif head_mask.dim() == 2:\n            head_mask = head_mask.unsqueeze(1).unsqueeze(-1).unsqueeze(-1)  # We can specify head_mask for each layer\n        assert head_mask.dim() == 5, f\"head_mask.dim != 5, instead {head_mask.dim()}\"\n        head_mask = head_mask.to(dtype=self.dtype)  # switch to fload if need + fp16 compatibility\n        return head_mask\n\n\nclass PreTrainedModel(nn.Module, ModuleUtilsMixin):\n    r\"\"\" Base class for all models.\n\n        :class:`~transformers1.PreTrainedModel` takes care of storing the configuration of the models and handles methods for loading/downloading/saving models\n        as well as a few methods common to all models to (i) resize the input embeddings and (ii) prune heads in the self-attention heads.\n\n        Class attributes (overridden by derived classes):\n            - ``config_class``: a class derived from :class:`~transformers1.PretrainedConfig` to use as configuration class for this model architecture.\n            - ``load_tf_weights``: a python ``method`` for loading a TensorFlow checkpoint in a PyTorch model, taking as arguments:\n\n                - ``model``: an instance of the relevant subclass of :class:`~transformers1.PreTrainedModel`,\n                - ``config``: an instance of the relevant subclass of :class:`~transformers1.PretrainedConfig`,\n                - ``path``: a path (string) to the TensorFlow checkpoint.\n\n            - ``base_model_prefix``: a string indicating the attribute associated to the base model in derived classes of the same architecture adding modules on top of the base model.\n    \"\"\"\n    config_class = None\n    base_model_prefix = \"\"\n\n    @property\n    def dummy_inputs(self):\n        \"\"\" Dummy inputs to do a forward pass in the network.\n\n        Returns:\n            torch.Tensor with dummy inputs\n        \"\"\"\n        return {\"input_ids\": torch.tensor(DUMMY_INPUTS)}\n\n    def __init__(self, config, *inputs, **kwargs):\n        super().__init__()\n        if not isinstance(config, PretrainedConfig):\n            raise ValueError(\n                \"Parameter config in `{}(config)` should be an instance of class `PretrainedConfig`. \"\n                \"To create a model from a pretrained model use \"\n                \"`model = {}.from_pretrained(PRETRAINED_MODEL_NAME)`\".format(\n                    self.__class__.__name__, self.__class__.__name__\n                )\n            )\n        # Save config in model\n        self.config = config\n\n    @property\n    def base_model(self):\n        return getattr(self, self.base_model_prefix, self)\n\n    def get_input_embeddings(self):\n        \"\"\"\n        Returns the model's input embeddings.\n\n        Returns:\n            :obj:`nn.Module`:\n                A torch module mapping vocabulary to hidden states.\n        \"\"\"\n        base_model = getattr(self, self.base_model_prefix, self)\n        if base_model is not self:\n            return base_model.get_input_embeddings()\n        else:\n            raise NotImplementedError\n\n    def set_input_embeddings(self, value: nn.Module):\n        \"\"\"\n        Set model's input embeddings\n\n        Args:\n            value (:obj:`nn.Module`):\n                A module mapping vocabulary to hidden states.\n        \"\"\"\n        base_model = getattr(self, self.base_model_prefix, self)\n        if base_model is not self:\n            base_model.set_input_embeddings(value)\n        else:\n            raise NotImplementedError\n\n    def get_output_embeddings(self):\n        \"\"\"\n        Returns the model's output embeddings.\n\n        Returns:\n            :obj:`nn.Module`:\n                A torch module mapping hidden states to vocabulary.\n        \"\"\"\n        return None  # Overwrite for models with output embeddings\n\n    def tie_weights(self):\n        \"\"\"\n        Tie the weights between the input embeddings and the output embeddings.\n        If the `torchscript` flag is set in the configuration, can't handle parameter sharing so we are cloning\n        the weights instead.\n        \"\"\"\n        output_embeddings = self.get_output_embeddings()\n        if output_embeddings is not None:\n            self._tie_or_clone_weights(output_embeddings, self.get_input_embeddings())\n\n    def _tie_or_clone_weights(self, output_embeddings, input_embeddings):\n        \"\"\" Tie or clone module weights depending of whether we are using TorchScript or not\n        \"\"\"\n        if self.config.torchscript:\n            output_embeddings.weight = nn.Parameter(input_embeddings.weight.clone())\n        else:\n            output_embeddings.weight = input_embeddings.weight\n\n        if getattr(output_embeddings, \"bias\", None) is not None:\n            output_embeddings.bias.data = torch.nn.functional.pad(\n                output_embeddings.bias.data,\n                (0, output_embeddings.weight.shape[0] - output_embeddings.bias.shape[0],),\n                \"constant\",\n                0,\n            )\n        if hasattr(output_embeddings, \"out_features\") and hasattr(input_embeddings, \"num_embeddings\"):\n            output_embeddings.out_features = input_embeddings.num_embeddings\n\n    def resize_token_embeddings(self, new_num_tokens: Optional[int] = None):\n        \"\"\" Resize input token embeddings matrix of the model if new_num_tokens != config.vocab_size.\n        Take care of tying weights embeddings afterwards if the model class has a `tie_weights()` method.\n\n        Arguments:\n\n            new_num_tokens: (`optional`) int:\n                New number of tokens in the embedding matrix. Increasing the size will add newly initialized vectors at the end. Reducing the size will remove vectors from the end.\n                If not provided or None: does nothing and just returns a pointer to the input tokens ``torch.nn.Embeddings`` Module of the model.\n\n        Return: ``torch.nn.Embeddings``\n            Pointer to the input tokens Embeddings Module of the model\n        \"\"\"\n        base_model = getattr(self, self.base_model_prefix, self)  # get the base model if needed\n        model_embeds = base_model._resize_token_embeddings(new_num_tokens)\n        if new_num_tokens is None:\n            return model_embeds\n\n        # Update base model and current model config\n        self.config.vocab_size = new_num_tokens\n        base_model.vocab_size = new_num_tokens\n\n        # Tie weights again if needed\n        self.tie_weights()\n\n        return model_embeds\n\n    def _resize_token_embeddings(self, new_num_tokens):\n        old_embeddings = self.get_input_embeddings()\n        new_embeddings = self._get_resized_embeddings(old_embeddings, new_num_tokens)\n        self.set_input_embeddings(new_embeddings)\n        return self.get_input_embeddings()\n\n    def _get_resized_embeddings(\n        self, old_embeddings: torch.nn.Embedding, new_num_tokens: Optional[int] = None\n    ) -> torch.nn.Embedding:\n        \"\"\" Build a resized Embedding Module from a provided token Embedding Module.\n            Increasing the size will add newly initialized vectors at the end\n            Reducing the size will remove vectors from the end\n\n        Args:\n            old_embeddings: ``torch.nn.Embedding``\n                Old embeddings to be resized.\n            new_num_tokens: (`optional`) int\n                New number of tokens in the embedding matrix.\n                Increasing the size will add newly initialized vectors at the end\n                Reducing the size will remove vectors from the end\n                If not provided or None: return the provided token Embedding Module.\n        Return: ``torch.nn.Embedding``\n            Pointer to the resized Embedding Module or the old Embedding Module if new_num_tokens is None\n        \"\"\"\n        if new_num_tokens is None:\n            return old_embeddings\n\n        old_num_tokens, old_embedding_dim = old_embeddings.weight.size()\n        if old_num_tokens == new_num_tokens:\n            return old_embeddings\n\n        # Build new embeddings\n        new_embeddings = nn.Embedding(new_num_tokens, old_embedding_dim)\n        new_embeddings.to(old_embeddings.weight.device)\n\n        # initialize all new embeddings (in particular added tokens)\n        self._init_weights(new_embeddings)\n\n        # Copy token embeddings from the previous weights\n        num_tokens_to_copy = min(old_num_tokens, new_num_tokens)\n        new_embeddings.weight.data[:num_tokens_to_copy, :] = old_embeddings.weight.data[:num_tokens_to_copy, :]\n\n        return new_embeddings\n\n    def init_weights(self):\n        \"\"\" Initialize and prunes weights if needed. \"\"\"\n        # Initialize weights\n        self.apply(self._init_weights)\n\n        # Prune heads if needed\n        if self.config.pruned_heads:\n            self.prune_heads(self.config.pruned_heads)\n\n        # Tie weights if needed\n        self.tie_weights()\n\n    def prune_heads(self, heads_to_prune: Dict):\n        \"\"\" Prunes heads of the base model.\n\n            Arguments:\n\n                heads_to_prune: dict with keys being selected layer indices (`int`) and associated values being the list of heads to prune in said layer (list of `int`).\n                E.g. {1: [0, 2], 2: [2, 3]} will prune heads 0 and 2 on layer 1 and heads 2 and 3 on layer 2.\n        \"\"\"\n        # save new sets of pruned heads as union of previously stored pruned heads and newly pruned heads\n        for layer, heads in heads_to_prune.items():\n            union_heads = set(self.config.pruned_heads.get(layer, [])) | set(heads)\n            self.config.pruned_heads[layer] = list(union_heads)  # Unfortunately we have to store it as list for JSON\n\n        self.base_model._prune_heads(heads_to_prune)\n\n    def save_pretrained(self, save_directory):\n        \"\"\" Save a model and its configuration file to a directory, so that it\n            can be re-loaded using the `:func:`~transformers1.PreTrainedModel.from_pretrained`` class method.\n\n            Arguments:\n                save_directory: directory to which to save.\n        \"\"\"\n        assert os.path.isdir(\n            save_directory\n        ), \"Saving path should be a directory where the model and configuration can be saved\"\n\n        # Only save the model itself if we are using distributed training\n        model_to_save = self.module if hasattr(self, \"module\") else self\n\n        # Attach architecture to the config\n        model_to_save.config.architectures = [model_to_save.__class__.__name__]\n\n        # If we save using the predefined names, we can load using `from_pretrained`\n        output_model_file = os.path.join(save_directory, WEIGHTS_NAME)\n\n        if getattr(self.config, \"xla_device\", False):\n            import torch_xla.core.xla_model as xm\n\n            if xm.is_master_ordinal():\n                # Save configuration file\n                model_to_save.config.save_pretrained(save_directory)\n            # xm.save takes care of saving only from master\n            xm.save(model_to_save.state_dict(), output_model_file)\n        else:\n            model_to_save.config.save_pretrained(save_directory)\n            torch.save(model_to_save.state_dict(), output_model_file)\n\n        logger.info(\"Model weights saved in {}\".format(output_model_file))\n\n    @classmethod\n    def from_pretrained(cls, pretrained_model_name_or_path, *model_args, **kwargs):\n        r\"\"\"Instantiate a pretrained pytorch model from a pre-trained model configuration.\n\n        The model is set in evaluation mode by default using ``model.eval()`` (Dropout modules are deactivated)\n        To train the model, you should first set it back in training mode with ``model.train()``\n\n        The warning ``Weights from XXX not initialized from pretrained model`` means that the weights of XXX do not come pre-trained with the rest of the model.\n        It is up to you to train those weights with a downstream fine-tuning task.\n\n        The warning ``Weights from XXX not used in YYY`` means that the layer XXX is not used by YYY, therefore those weights are discarded.\n\n        Parameters:\n            pretrained_model_name_or_path: either:\n              - a string with the `shortcut name` of a pre-trained model to load from cache or download, e.g.: ``bert-base-uncased``.\n              - a string with the `identifier name` of a pre-trained model that was user-uploaded to our S3, e.g.: ``dbmdz/bert-base-german-cased``.\n              - a path to a `directory` containing model weights saved using :func:`~transformers1.PreTrainedModel.save_pretrained`, e.g.: ``./my_model_directory/``.\n              - a path or url to a `tensorflow index checkpoint file` (e.g. `./tf_model/model.ckpt.index`). In this case, ``from_tf`` should be set to True and a configuration object should be provided as ``config`` argument. This loading path is slower than converting the TensorFlow checkpoint in a PyTorch model using the provided conversion scripts and loading the PyTorch model afterwards.\n              - None if you are both providing the configuration and state dictionary (resp. with keyword arguments ``config`` and ``state_dict``)\n\n            model_args: (`optional`) Sequence of positional arguments:\n                All remaning positional arguments will be passed to the underlying model's ``__init__`` method\n\n            config: (`optional`) one of:\n                - an instance of a class derived from :class:`~transformers1.PretrainedConfig`, or\n                - a string valid as input to :func:`~transformers1.PretrainedConfig.from_pretrained()`\n                Configuration for the model to use instead of an automatically loaded configuation. Configuration can be automatically loaded when:\n                    - the model is a model provided by the library (loaded with the ``shortcut-name`` string of a pretrained model), or\n                    - the model was saved using :func:`~transformers1.PreTrainedModel.save_pretrained` and is reloaded by suppling the save directory.\n                    - the model is loaded by suppling a local directory as ``pretrained_model_name_or_path`` and a configuration JSON file named `config.json` is found in the directory.\n\n            state_dict: (`optional`) dict:\n                an optional state dictionnary for the model to use instead of a state dictionary loaded from saved weights file.\n                This option can be used if you want to create a model from a pretrained configuration but load your own weights.\n                In this case though, you should check if using :func:`~transformers1.PreTrainedModel.save_pretrained` and :func:`~transformers1.PreTrainedModel.from_pretrained` is not a simpler option.\n\n            cache_dir: (`optional`) string:\n                Path to a directory in which a downloaded pre-trained model\n                configuration should be cached if the standard cache should not be used.\n\n            force_download: (`optional`) boolean, default False:\n                Force to (re-)download the model weights and configuration files and override the cached versions if they exists.\n\n            resume_download: (`optional`) boolean, default False:\n                Do not delete incompletely recieved file. Attempt to resume the download if such a file exists.\n\n            proxies: (`optional`) dict, default None:\n                A dictionary of proxy servers to use by protocol or endpoint, e.g.: {'http': 'foo.bar:3128', 'http://hostname': 'foo.bar:4012'}.\n                The proxies are used on each request.\n\n            output_loading_info: (`optional`) boolean:\n                Set to ``True`` to also return a dictionnary containing missing keys, unexpected keys and error messages.\n\n            kwargs: (`optional`) Remaining dictionary of keyword arguments:\n                Can be used to update the configuration object (after it being loaded) and initiate the model. (e.g. ``output_attention=True``). Behave differently depending on whether a `config` is provided or automatically loaded:\n\n                - If a configuration is provided with ``config``, ``**kwargs`` will be directly passed to the underlying model's ``__init__`` method (we assume all relevant updates to the configuration have already been done)\n                - If a configuration is not provided, ``kwargs`` will be first passed to the configuration class initialization function (:func:`~transformers1.PretrainedConfig.from_pretrained`). Each key of ``kwargs`` that corresponds to a configuration attribute will be used to override said attribute with the supplied ``kwargs`` value. Remaining keys that do not correspond to any configuration attribute will be passed to the underlying model's ``__init__`` function.\n\n        Examples::\n\n            # For example purposes. Not runnable.\n            model = BertModel.from_pretrained('bert-base-uncased')    # Download model and configuration from S3 and cache.\n            model = BertModel.from_pretrained('./test/saved_model/')  # E.g. model was saved using `save_pretrained('./test/saved_model/')`\n            model = BertModel.from_pretrained('bert-base-uncased', output_attention=True)  # Update configuration during loading\n            assert model.config.output_attention == True\n            # Loading from a TF checkpoint file instead of a PyTorch model (slower)\n            config = BertConfig.from_json_file('./tf_model/my_tf_model_config.json')\n            model = BertModel.from_pretrained('./tf_model/my_tf_checkpoint.ckpt.index', from_tf=True, config=config)\n\n        \"\"\"\n        config = kwargs.pop(\"config\", None)\n        state_dict = kwargs.pop(\"state_dict\", None)\n        cache_dir = kwargs.pop(\"cache_dir\", None)\n        from_tf = kwargs.pop(\"from_tf\", False)\n        force_download = kwargs.pop(\"force_download\", False)\n        resume_download = kwargs.pop(\"resume_download\", False)\n        proxies = kwargs.pop(\"proxies\", None)\n        output_loading_info = kwargs.pop(\"output_loading_info\", False)\n        local_files_only = kwargs.pop(\"local_files_only\", False)\n        use_cdn = kwargs.pop(\"use_cdn\", True)\n\n        # Load config if we don't provide a configuration\n        if not isinstance(config, PretrainedConfig):\n            config_path = config if config is not None else pretrained_model_name_or_path\n            config, model_kwargs = cls.config_class.from_pretrained(\n                config_path,\n                *model_args,\n                cache_dir=cache_dir,\n                return_unused_kwargs=True,\n                force_download=force_download,\n                resume_download=resume_download,\n                proxies=proxies,\n                local_files_only=local_files_only,\n                **kwargs,\n            )\n        else:\n            model_kwargs = kwargs\n\n        # Load model\n        if pretrained_model_name_or_path is not None:\n            if os.path.isdir(pretrained_model_name_or_path):\n                if from_tf and os.path.isfile(os.path.join(pretrained_model_name_or_path, TF_WEIGHTS_NAME + \".index\")):\n                    # Load from a TF 1.0 checkpoint\n                    archive_file = os.path.join(pretrained_model_name_or_path, TF_WEIGHTS_NAME + \".index\")\n                elif from_tf and os.path.isfile(os.path.join(pretrained_model_name_or_path, TF2_WEIGHTS_NAME)):\n                    # Load from a TF 2.0 checkpoint\n                    archive_file = os.path.join(pretrained_model_name_or_path, TF2_WEIGHTS_NAME)\n                elif os.path.isfile(os.path.join(pretrained_model_name_or_path, WEIGHTS_NAME)):\n                    # Load from a PyTorch checkpoint\n                    archive_file = os.path.join(pretrained_model_name_or_path, WEIGHTS_NAME)\n                else:\n                    raise EnvironmentError(\n                        \"Error no file named {} found in directory {} or `from_tf` set to False\".format(\n                            [WEIGHTS_NAME, TF2_WEIGHTS_NAME, TF_WEIGHTS_NAME + \".index\"],\n                            pretrained_model_name_or_path,\n                        )\n                    )\n            elif os.path.isfile(pretrained_model_name_or_path) or is_remote_url(pretrained_model_name_or_path):\n                archive_file = pretrained_model_name_or_path\n            elif os.path.isfile(pretrained_model_name_or_path + \".index\"):\n                assert (\n                    from_tf\n                ), \"We found a TensorFlow checkpoint at {}, please set from_tf to True to load from this checkpoint\".format(\n                    pretrained_model_name_or_path + \".index\"\n                )\n                archive_file = pretrained_model_name_or_path + \".index\"\n            else:\n                archive_file = hf_bucket_url(\n                    pretrained_model_name_or_path,\n                    filename=(TF2_WEIGHTS_NAME if from_tf else WEIGHTS_NAME),\n                    use_cdn=use_cdn,\n                )\n\n            try:\n                # Load from URL or cache if already cached\n                resolved_archive_file = cached_path(\n                    archive_file,\n                    cache_dir=cache_dir,\n                    force_download=force_download,\n                    proxies=proxies,\n                    resume_download=resume_download,\n                    local_files_only=local_files_only,\n                )\n                if resolved_archive_file is None:\n                    raise EnvironmentError\n            except EnvironmentError:\n                msg = (\n                    f\"Can't load weights for '{pretrained_model_name_or_path}'. Make sure that:\\n\\n\"\n                    f\"- '{pretrained_model_name_or_path}' is a correct model identifier listed on 'https://huggingface.co/models'\\n\\n\"\n                    f\"- or '{pretrained_model_name_or_path}' is the correct path to a directory containing a file named one of {WEIGHTS_NAME}, {TF2_WEIGHTS_NAME}, {TF_WEIGHTS_NAME}.\\n\\n\"\n                )\n                raise EnvironmentError(msg)\n\n            if resolved_archive_file == archive_file:\n                logger.info(\"loading weights file {}\".format(archive_file))\n            else:\n                logger.info(\"loading weights file {} from cache at {}\".format(archive_file, resolved_archive_file))\n        else:\n            resolved_archive_file = None\n\n        # Instantiate model.\n        model = cls(config, *model_args, **model_kwargs)\n\n        if state_dict is None and not from_tf:\n            try:\n                state_dict = torch.load(resolved_archive_file, map_location=\"cpu\")\n            except Exception:\n                raise OSError(\n                    \"Unable to load weights from pytorch checkpoint file. \"\n                    \"If you tried to load a PyTorch model from a TF 2.0 checkpoint, please set from_tf=True. \"\n                )\n\n        missing_keys = []\n        unexpected_keys = []\n        error_msgs = []\n\n        if from_tf:\n            if resolved_archive_file.endswith(\".index\"):\n                # Load from a TensorFlow 1.X checkpoint - provided by original authors\n                model = cls.load_tf_weights(model, config, resolved_archive_file[:-6])  # Remove the '.index'\n            else:\n                # Load from our TensorFlow 2.0 checkpoints\n                try:\n                    from transformers import load_tf2_checkpoint_in_pytorch_model\n\n                    model = load_tf2_checkpoint_in_pytorch_model(model, resolved_archive_file, allow_missing_keys=True)\n                except ImportError:\n                    logger.error(\n                        \"Loading a TensorFlow model in PyTorch, requires both PyTorch and TensorFlow to be installed. Please see \"\n                        \"https://pytorch.org/ and https://www.tensorflow.org/install/ for installation instructions.\"\n                    )\n                    raise\n        else:\n            # Convert old format to new format if needed from a PyTorch state_dict\n            old_keys = []\n            new_keys = []\n            for key in state_dict.keys():\n                new_key = None\n                if \"gamma\" in key:\n                    new_key = key.replace(\"gamma\", \"weight\")\n                if \"beta\" in key:\n                    new_key = key.replace(\"beta\", \"bias\")\n                if new_key:\n                    old_keys.append(key)\n                    new_keys.append(new_key)\n            for old_key, new_key in zip(old_keys, new_keys):\n                state_dict[new_key] = state_dict.pop(old_key)\n\n            # copy state_dict so _load_from_state_dict can modify it\n            metadata = getattr(state_dict, \"_metadata\", None)\n            state_dict = state_dict.copy()\n            if metadata is not None:\n                state_dict._metadata = metadata\n\n            # PyTorch's `_load_from_state_dict` does not copy parameters in a module's descendants\n            # so we need to apply the function recursively.\n            def load(module: nn.Module, prefix=\"\"):\n                local_metadata = {} if metadata is None else metadata.get(prefix[:-1], {})\n                module._load_from_state_dict(\n                    state_dict, prefix, local_metadata, True, missing_keys, unexpected_keys, error_msgs,\n                )\n                for name, child in module._modules.items():\n                    if child is not None:\n                        load(child, prefix + name + \".\")\n\n            # Make sure we are able to load base models as well as derived models (with heads)\n            start_prefix = \"\"\n            model_to_load = model\n            has_prefix_module = any(s.startswith(cls.base_model_prefix) for s in state_dict.keys())\n            if not hasattr(model, cls.base_model_prefix) and has_prefix_module:\n                start_prefix = cls.base_model_prefix + \".\"\n            if hasattr(model, cls.base_model_prefix) and not has_prefix_module:\n                model_to_load = getattr(model, cls.base_model_prefix)\n\n            load(model_to_load, prefix=start_prefix)\n\n            if model.__class__.__name__ != model_to_load.__class__.__name__:\n                base_model_state_dict = model_to_load.state_dict().keys()\n                head_model_state_dict_without_base_prefix = [\n                    key.split(cls.base_model_prefix + \".\")[-1] for key in model.state_dict().keys()\n                ]\n\n                missing_keys.extend(head_model_state_dict_without_base_prefix - base_model_state_dict)\n\n            if len(missing_keys) > 0:\n                logger.info(\n                    \"Weights of {} not initialized from pretrained model: {}\".format(\n                        model.__class__.__name__, missing_keys\n                    )\n                )\n            if len(unexpected_keys) > 0:\n                logger.info(\n                    \"Weights from pretrained model not used in {}: {}\".format(\n                        model.__class__.__name__, unexpected_keys\n                    )\n                )\n            if len(error_msgs) > 0:\n                raise RuntimeError(\n                    \"Error(s) in loading state_dict for {}:\\n\\t{}\".format(\n                        model.__class__.__name__, \"\\n\\t\".join(error_msgs)\n                    )\n                )\n        model.tie_weights()  # make sure token embedding weights are still tied if needed\n\n        # Set model in evaluation mode to deactivate DropOut modules by default\n        model.eval()\n\n        if output_loading_info:\n            loading_info = {\n                \"missing_keys\": missing_keys,\n                \"unexpected_keys\": unexpected_keys,\n                \"error_msgs\": error_msgs,\n            }\n            return model, loading_info\n\n        if hasattr(config, \"xla_device\") and config.xla_device:\n            import torch_xla.core.xla_model as xm\n\n            model = xm.send_cpu_data_to_device(model, xm.xla_device())\n            model.to(xm.xla_device())\n\n        return model\n\n    def prepare_inputs_for_generation(self, input_ids, **kwargs):\n        return {\"input_ids\": input_ids}\n\n    def prepare_logits_for_generation(self, logits, **kwargs):\n        return logits\n\n    def _use_cache(self, outputs, use_cache):\n        \"\"\"During generation, decide whether to pass the `past` variable to the next forward pass.\"\"\"\n        if len(outputs) <= 1 or use_cache is False:\n            return False\n        if hasattr(self.config, \"mem_len\") and self.config.mem_len == 0:\n            return False\n        return True\n\n    def enforce_repetition_penalty_(self, lprobs, batch_size, num_beams, prev_output_tokens, repetition_penalty):\n        \"\"\"repetition penalty (from CTRL paper https://arxiv.org/abs/1909.05858). \"\"\"\n        for i in range(batch_size * num_beams):\n            for previous_token in set(prev_output_tokens[i].tolist()):\n                # if score < 0 then repetition penalty has to multiplied to reduce the previous token probability\n                if lprobs[i, previous_token] < 0:\n                    lprobs[i, previous_token] *= repetition_penalty\n                else:\n                    lprobs[i, previous_token] /= repetition_penalty\n\n    @torch.no_grad()\n    def generate(\n        self,\n        input_ids: Optional[torch.LongTensor] = None,\n        max_length: Optional[int] = None,\n        min_length: Optional[int] = None,\n        do_sample: Optional[bool] = None,\n        early_stopping: Optional[bool] = None,\n        num_beams: Optional[int] = None,\n        temperature: Optional[float] = None,\n        top_k: Optional[int] = None,\n        top_p: Optional[float] = None,\n        repetition_penalty: Optional[float] = None,\n        bad_words_ids: Optional[Iterable[int]] = None,\n        bos_token_id: Optional[int] = None,\n        pad_token_id: Optional[int] = None,\n        eos_token_id: Optional[int] = None,\n        length_penalty: Optional[float] = None,\n        no_repeat_ngram_size: Optional[int] = None,\n        num_return_sequences: Optional[int] = None,\n        attention_mask: Optional[torch.LongTensor] = None,\n        decoder_start_token_id: Optional[int] = None,\n        use_cache: Optional[bool] = None,\n        **model_specific_kwargs\n    ) -> torch.LongTensor:\n        r\"\"\" Generates sequences for models with a LM head. The method currently supports greedy decoding, beam-search decoding, sampling with temperature, sampling with top-k or nucleus sampling.\n\n        Adapted in part from `Facebook's XLM beam search code`_.\n\n        .. _`Facebook's XLM beam search code`:\n           https://github.com/facebookresearch/XLM/blob/9e6f6814d17be4fe5b15f2e6c43eb2b2d76daeb4/src/model/transformer.py#L529\n\n\n        Parameters:\n\n            input_ids: (`optional`) `torch.LongTensor` of shape `(batch_size, sequence_length)`\n                The sequence used as a prompt for the generation. If `None` the method initializes\n                it as an empty `torch.LongTensor` of shape `(1,)`.\n\n            max_length: (`optional`) int\n                The max length of the sequence to be generated.  Between `min_length` and infinity. Default to 20.\n\n            min_length: (`optional`) int\n                The min length of the sequence to be generated.  Between 0 and infinity. Default to 0.\n\n            do_sample: (`optional`) bool\n                If set to `False` greedy decoding is used. Otherwise sampling is used. Defaults to `False` as defined in `configuration_utils.PretrainedConfig`.\n\n            early_stopping: (`optional`) bool\n                if set to `True` beam search is stopped when at least `num_beams` sentences finished per batch. Defaults to `False` as defined in `configuration_utils.PretrainedConfig`.\n\n            num_beams: (`optional`) int\n                Number of beams for beam search. Must be between 1 and infinity. 1 means no beam search. Default to 1.\n\n            temperature: (`optional`) float\n                The value used to module the next token probabilities. Must be strictly positive. Default to 1.0.\n\n            top_k: (`optional`) int\n                The number of highest probability vocabulary tokens to keep for top-k-filtering. Between 1 and infinity. Default to 50.\n\n            top_p: (`optional`) float\n                The cumulative probability of parameter highest probability vocabulary tokens to keep for nucleus sampling. Must be between 0 and 1. Default to 1.\n\n            repetition_penalty: (`optional`) float\n                The parameter for repetition penalty. Between 1.0 and infinity. 1.0 means no penalty. Default to 1.0.\n\n            pad_token_id: (`optional`) int\n                Padding token. Default to specicic model pad_token_id or None if it does not exist.\n\n            bos_token_id: (`optional`) int\n                BOS token. Defaults to `bos_token_id` as defined in the models config.\n\n            eos_token_id: (`optional`) int\n                EOS token. Defaults to `eos_token_id` as defined in the models config.\n\n            length_penalty: (`optional`) float\n                Exponential penalty to the length. Default to 1.\n\n            no_repeat_ngram_size: (`optional`) int\n                If set to int > 0, all ngrams of size `no_repeat_ngram_size` can only occur once.\n            bad_words_ids: (`optional`) list of lists of int\n                `bad_words_ids` contains tokens that are not allowed to be generated. In order to get the tokens of the words that should not appear in the generated text, use `tokenizer.encode(bad_word, add_prefix_space=True)`.\n\n            num_return_sequences: (`optional`) int\n                The number of independently computed returned sequences for each element in the batch. Default to 1.\n\n            attention_mask (`optional`) obj: `torch.LongTensor` of same shape as `input_ids`\n                Mask to avoid performing attention on padding token indices.\n                Mask values selected in ``[0, 1]``:\n                ``1`` for tokens that are NOT MASKED, ``0`` for MASKED tokens.\n                Defaults to `None`.\n\n                `What are attention masks? <../glossary.html#attention-mask>`__\n\n            decoder_start_token_id=None: (`optional`) int\n                If an encoder-decoder model starts decoding with a different token than BOS.\n                Defaults to `None` and is changed to `BOS` later.\n\n            use_cache: (`optional`) bool\n                If `use_cache` is True, past key values are used to speed up decoding if applicable to model. Defaults to `True`.\n\n            model_specific_kwargs: (`optional`) dict\n                Additional model specific kwargs will be forwarded to the `forward` function of the model.\n\n        Return:\n\n            output: `torch.LongTensor` of shape `(batch_size * num_return_sequences, sequence_length)`\n                sequence_length is either equal to max_length or shorter if all batches finished early due to the `eos_token_id`\n\n        Examples::\n\n            tokenizer = AutoTokenizer.from_pretrained('distilgpt2')   # Initialize tokenizer\n            model = AutoModelWithLMHead.from_pretrained('distilgpt2')    # Download model and configuration from S3 and cache.\n            outputs = model.generate(max_length=40)  # do greedy decoding\n            print('Generated: {}'.format(tokenizer.decode(outputs[0], skip_special_tokens=True)))\n\n            tokenizer = AutoTokenizer.from_pretrained('openai-gpt')   # Initialize tokenizer\n            model = AutoModelWithLMHead.from_pretrained('openai-gpt')    # Download model and configuration from S3 and cache.\n            input_context = 'The dog'\n            input_ids = tokenizer.encode(input_context, return_tensors='pt')  # encode input context\n            outputs = model.generate(input_ids=input_ids, num_beams=5, num_return_sequences=3, temperature=1.5)  # generate 3 independent sequences using beam search decoding (5 beams) with sampling from initial context 'The dog'\n            for i in range(3): #  3 output sequences were generated\n                print('Generated {}: {}'.format(i, tokenizer.decode(outputs[i], skip_special_tokens=True)))\n\n            tokenizer = AutoTokenizer.from_pretrained('distilgpt2')   # Initialize tokenizer\n            model = AutoModelWithLMHead.from_pretrained('distilgpt2')    # Download model and configuration from S3 and cache.\n            input_context = 'The dog'\n            input_ids = tokenizer.encode(input_context, return_tensors='pt')  # encode input context\n            outputs = model.generate(input_ids=input_ids, max_length=40, temperature=0.7, num_return_sequences=3)  # 3 generate sequences using by sampling\n            for i in range(3): #  3 output sequences were generated\n                print('Generated {}: {}'.format(i, tokenizer.decode(outputs[i], skip_special_tokens=True)))\n\n            tokenizer = AutoTokenizer.from_pretrained('ctrl')   # Initialize tokenizer\n            model = AutoModelWithLMHead.from_pretrained('ctrl')    # Download model and configuration from S3 and cache.\n            input_context = 'Legal My neighbor is'  # \"Legal\" is one of the control codes for ctrl\n            input_ids = tokenizer.encode(input_context, return_tensors='pt')  # encode input context\n            outputs = model.generate(input_ids=input_ids, max_length=50, temperature=0.7, repetition_penalty=1.2)  # generate sequences\n            print('Generated: {}'.format(tokenizer.decode(outputs[0], skip_special_tokens=True)))\n\n            tokenizer = AutoTokenizer.from_pretrained('gpt2')   # Initialize tokenizer\n            model = AutoModelWithLMHead.from_pretrained('gpt2')    # Download model and configuration from S3 and cache.\n            input_context = 'My cute dog'  # \"Legal\" is one of the control codes for ctrl\n            bad_words_ids = [tokenizer.encode(bad_word, add_prefix_space=True) for bad_word in ['idiot', 'stupid', 'shut up']]\n            input_ids = tokenizer.encode(input_context, return_tensors='pt')  # encode input context\n            outputs = model.generate(input_ids=input_ids, max_length=100, do_sample=True, bad_words_ids=bad_words_ids)  # generate sequences without allowing bad_words to be generated\n        \"\"\"\n\n        # We cannot generate if the model does not have a LM head\n        if self.get_output_embeddings() is None:\n            raise AttributeError(\n                \"You tried to generate sequences with a model that does not have a LM Head.\"\n                \"Please use another model class (e.g. `OpenAIGPTLMHeadModel`, `XLNetLMHeadModel`, `GPT2LMHeadModel`, `CTRLLMHeadModel`, `T5WithLMHeadModel`, `TransfoXLLMHeadModel`, `XLMWithLMHeadModel`, `BartForConditionalGeneration` )\"\n            )\n\n        max_length = max_length if max_length is not None else self.config.max_length\n        min_length = min_length if min_length is not None else self.config.min_length\n        do_sample = do_sample if do_sample is not None else self.config.do_sample\n        early_stopping = early_stopping if early_stopping is not None else self.config.early_stopping\n        use_cache = use_cache if use_cache is not None else self.config.use_cache\n        num_beams = num_beams if num_beams is not None else self.config.num_beams\n        temperature = temperature if temperature is not None else self.config.temperature\n        top_k = top_k if top_k is not None else self.config.top_k\n        top_p = top_p if top_p is not None else self.config.top_p\n        repetition_penalty = repetition_penalty if repetition_penalty is not None else self.config.repetition_penalty\n        bos_token_id = bos_token_id if bos_token_id is not None else self.config.bos_token_id\n        pad_token_id = pad_token_id if pad_token_id is not None else self.config.pad_token_id\n        eos_token_id = eos_token_id if eos_token_id is not None else self.config.eos_token_id\n        length_penalty = length_penalty if length_penalty is not None else self.config.length_penalty\n        no_repeat_ngram_size = (\n            no_repeat_ngram_size if no_repeat_ngram_size is not None else self.config.no_repeat_ngram_size\n        )\n        bad_words_ids = bad_words_ids if bad_words_ids is not None else self.config.bad_words_ids\n        num_return_sequences = (\n            num_return_sequences if num_return_sequences is not None else self.config.num_return_sequences\n        )\n        decoder_start_token_id = (\n            decoder_start_token_id if decoder_start_token_id is not None else self.config.decoder_start_token_id\n        )\n\n        if input_ids is not None:\n            batch_size = input_ids.shape[0]  # overriden by the input batch_size\n        else:\n            batch_size = 1\n\n        assert isinstance(max_length, int) and max_length > 0, \"`max_length` should be a strictly positive integer.\"\n        assert isinstance(min_length, int) and min_length >= 0, \"`min_length` should be a positive integer.\"\n        assert isinstance(do_sample, bool), \"`do_sample` should be a boolean.\"\n        assert isinstance(early_stopping, bool), \"`early_stopping` should be a boolean.\"\n        assert isinstance(use_cache, bool), \"`use_cache` should be a boolean.\"\n        assert isinstance(num_beams, int) and num_beams > 0, \"`num_beams` should be a strictly positive integer.\"\n        assert temperature > 0, \"`temperature` should be strictly positive.\"\n        assert isinstance(top_k, int) and top_k >= 0, \"`top_k` should be a positive integer.\"\n        assert 0 <= top_p <= 1, \"`top_p` should be between 0 and 1.\"\n        assert repetition_penalty >= 1.0, \"`repetition_penalty` should be >= 1.\"\n        assert input_ids is not None or (\n            isinstance(bos_token_id, int) and bos_token_id >= 0\n        ), \"If input_ids is not defined, `bos_token_id` should be a positive integer.\"\n        assert pad_token_id is None or (\n            isinstance(pad_token_id, int) and (pad_token_id >= 0)\n        ), \"`pad_token_id` should be a positive integer.\"\n        assert (eos_token_id is None) or (\n            isinstance(eos_token_id, int) and (eos_token_id >= 0)\n        ), \"`eos_token_id` should be a positive integer.\"\n        assert length_penalty > 0, \"`length_penalty` should be strictly positive.\"\n        assert (\n            isinstance(no_repeat_ngram_size, int) and no_repeat_ngram_size >= 0\n        ), \"`no_repeat_ngram_size` should be a positive integer.\"\n        assert (\n            isinstance(num_return_sequences, int) and num_return_sequences > 0\n        ), \"`num_return_sequences` should be a strictly positive integer.\"\n        assert (\n            bad_words_ids is None or isinstance(bad_words_ids, list) and isinstance(bad_words_ids[0], list)\n        ), \"`bad_words_ids` is either `None` or a list of lists of tokens that should not be generated\"\n\n        if input_ids is None:\n            assert isinstance(bos_token_id, int) and bos_token_id >= 0, (\n                \"you should either supply a context to complete as `input_ids` input \"\n                \"or a `bos_token_id` (integer >= 0) as a first token to start the generation.\"\n            )\n            input_ids = torch.full(\n                (batch_size, 1), bos_token_id, dtype=torch.long, device=next(self.parameters()).device,\n            )\n        else:\n            assert input_ids.dim() == 2, \"Input prompt should be of shape (batch_size, sequence length).\"\n\n        # not allow to duplicate outputs when greedy decoding\n        if do_sample is False:\n            if num_beams == 1:\n                # no_beam_search greedy generation conditions\n                assert (\n                    num_return_sequences == 1\n                ), \"Greedy decoding will always produce the same output for num_beams == 1 and num_return_sequences > 1. Please set num_return_sequences = 1\"\n\n            else:\n                # beam_search greedy generation conditions\n                assert (\n                    num_beams >= num_return_sequences\n                ), \"Greedy beam search decoding cannot return more sequences than it has beams. Please set num_beams >= num_return_sequences\"\n\n        # create attention mask if necessary\n        # TODO (PVP): this should later be handled by the forward fn() in each model in the future see PR 3140\n        if (attention_mask is None) and (pad_token_id is not None) and (pad_token_id in input_ids):\n            attention_mask = input_ids.ne(pad_token_id).long()\n        elif attention_mask is None:\n            attention_mask = input_ids.new_ones(input_ids.shape)\n\n        # set pad_token_id to eos_token_id if not set. Important that this is done after\n        # attention_mask is created\n        if pad_token_id is None and eos_token_id is not None:\n            logger.warning(\n                \"Setting `pad_token_id` to {} (first `eos_token_id`) to generate sequence\".format(eos_token_id)\n            )\n            pad_token_id = eos_token_id\n\n        # current position and vocab size\n        if hasattr(self.config, \"vocab_size\"):\n            vocab_size = self.config.vocab_size\n        elif (\n            self.config.is_encoder_decoder\n            and hasattr(self.config, \"decoder\")\n            and hasattr(self.config.decoder, \"vocab_size\")\n        ):\n            vocab_size = self.config.decoder.vocab_size\n\n        # set effective batch size and effective batch multiplier according to do_sample\n        if do_sample:\n            effective_batch_size = batch_size * num_return_sequences\n            effective_batch_mult = num_return_sequences\n        else:\n            effective_batch_size = batch_size\n            effective_batch_mult = 1\n\n        if self.config.is_encoder_decoder:\n            if decoder_start_token_id is None:\n                decoder_start_token_id = bos_token_id\n\n            assert (\n                decoder_start_token_id is not None\n            ), \"decoder_start_token_id or bos_token_id has to be defined for encoder-decoder generation\"\n            assert hasattr(self, \"get_encoder\"), \"{} should have a 'get_encoder' function defined\".format(self)\n            assert callable(self.get_encoder), \"{} should be a method\".format(self.get_encoder)\n\n            # get encoder and store encoder outputs\n            encoder = self.get_encoder()\n\n            encoder_outputs: tuple = encoder(input_ids, attention_mask=attention_mask)\n\n        # Expand input ids if num_beams > 1 or num_return_sequences > 1\n        if num_return_sequences > 1 or num_beams > 1:\n            input_ids_len = input_ids.shape[-1]\n            input_ids = input_ids.unsqueeze(1).expand(batch_size, effective_batch_mult * num_beams, input_ids_len)\n            attention_mask = attention_mask.unsqueeze(1).expand(\n                batch_size, effective_batch_mult * num_beams, input_ids_len\n            )\n\n            input_ids = input_ids.contiguous().view(\n                effective_batch_size * num_beams, input_ids_len\n            )  # shape: (batch_size * num_return_sequences * num_beams, cur_len)\n            attention_mask = attention_mask.contiguous().view(\n                effective_batch_size * num_beams, input_ids_len\n            )  # shape: (batch_size * num_return_sequences * num_beams, cur_len)\n\n        if self.config.is_encoder_decoder:\n            # create empty decoder_input_ids\n            input_ids = torch.full(\n                (effective_batch_size * num_beams, 1),\n                decoder_start_token_id,\n                dtype=torch.long,\n                device=next(self.parameters()).device,\n            )\n            cur_len = 1\n\n            assert (\n                batch_size == encoder_outputs[0].shape[0]\n            ), f\"expected encoder_outputs[0] to have 1st dimension bs={batch_size}, got {encoder_outputs[0].shape[0]} \"\n\n            # expand batch_idx to assign correct encoder output for expanded input_ids (due to num_beams > 1 and num_return_sequences > 1)\n            expanded_batch_idxs = (\n                torch.arange(batch_size)\n                .view(-1, 1)\n                .repeat(1, num_beams * effective_batch_mult)\n                .view(-1)\n                .to(input_ids.device)\n            )\n            # expand encoder_outputs\n            encoder_outputs = (encoder_outputs[0].index_select(0, expanded_batch_idxs), *encoder_outputs[1:])\n\n        else:\n            encoder_outputs = None\n            cur_len = input_ids.shape[-1]\n\n        if num_beams > 1:\n            output = self._generate_beam_search(\n                input_ids,\n                cur_len=cur_len,\n                max_length=max_length,\n                min_length=min_length,\n                do_sample=do_sample,\n                early_stopping=early_stopping,\n                temperature=temperature,\n                top_k=top_k,\n                top_p=top_p,\n                repetition_penalty=repetition_penalty,\n                no_repeat_ngram_size=no_repeat_ngram_size,\n                bad_words_ids=bad_words_ids,\n                bos_token_id=bos_token_id,\n                pad_token_id=pad_token_id,\n                decoder_start_token_id=decoder_start_token_id,\n                eos_token_id=eos_token_id,\n                batch_size=effective_batch_size,\n                num_return_sequences=num_return_sequences,\n                length_penalty=length_penalty,\n                num_beams=num_beams,\n                vocab_size=vocab_size,\n                encoder_outputs=encoder_outputs,\n                attention_mask=attention_mask,\n                use_cache=use_cache,\n                model_specific_kwargs=model_specific_kwargs,\n            )\n        else:\n            output = self._generate_no_beam_search(\n                input_ids,\n                cur_len=cur_len,\n                max_length=max_length,\n                min_length=min_length,\n                do_sample=do_sample,\n                temperature=temperature,\n                top_k=top_k,\n                top_p=top_p,\n                repetition_penalty=repetition_penalty,\n                no_repeat_ngram_size=no_repeat_ngram_size,\n                bad_words_ids=bad_words_ids,\n                bos_token_id=bos_token_id,\n                pad_token_id=pad_token_id,\n                decoder_start_token_id=decoder_start_token_id,\n                eos_token_id=eos_token_id,\n                batch_size=effective_batch_size,\n                encoder_outputs=encoder_outputs,\n                attention_mask=attention_mask,\n                use_cache=use_cache,\n                model_specific_kwargs=model_specific_kwargs,\n            )\n\n        return output\n\n    def _generate_no_beam_search(\n        self,\n        input_ids,\n        cur_len,\n        max_length,\n        min_length,\n        do_sample,\n        temperature,\n        top_k,\n        top_p,\n        repetition_penalty,\n        no_repeat_ngram_size,\n        bad_words_ids,\n        bos_token_id,\n        pad_token_id,\n        eos_token_id,\n        decoder_start_token_id,\n        batch_size,\n        encoder_outputs,\n        attention_mask,\n        use_cache,\n        model_specific_kwargs,\n    ):\n        \"\"\" Generate sequences for each example without beam search (num_beams == 1).\n            All returned sequence are generated independantly.\n        \"\"\"\n        # length of generated sentences / unfinished sentences\n        unfinished_sents = input_ids.new(batch_size).fill_(1)\n        sent_lengths = input_ids.new(batch_size).fill_(max_length)\n\n        past = encoder_outputs  # defined for encoder-decoder models, None for decoder-only models\n\n        while cur_len < max_length:\n            model_inputs = self.prepare_inputs_for_generation(\n                input_ids, past=past, attention_mask=attention_mask, use_cache=use_cache, **model_specific_kwargs\n            )\n\n            outputs = self(**model_inputs)\n            next_token_logits = outputs[0][:, -1, :]\n\n            # if model has past, then set the past variable to speed up decoding\n            if self._use_cache(outputs, use_cache):\n                past = outputs[1]\n\n            # repetition penalty from CTRL paper (https://arxiv.org/abs/1909.05858)\n            if repetition_penalty != 1.0:\n                self.enforce_repetition_penalty_(next_token_logits, batch_size, 1, input_ids, repetition_penalty)\n\n            if no_repeat_ngram_size > 0:\n                # calculate a list of banned tokens to prevent repetitively generating the same ngrams\n                # from fairseq: https://github.com/pytorch/fairseq/blob/a07cb6f40480928c9e0548b737aadd36ee66ac76/fairseq/sequence_generator.py#L345\n                banned_tokens = calc_banned_ngram_tokens(input_ids, batch_size, no_repeat_ngram_size, cur_len)\n                for batch_idx in range(batch_size):\n                    next_token_logits[batch_idx, banned_tokens[batch_idx]] = -float(\"inf\")\n\n            if bad_words_ids is not None:\n                # calculate a list of banned tokens according to bad words\n                banned_tokens = calc_banned_bad_words_ids(input_ids, bad_words_ids)\n\n                for batch_idx in range(batch_size):\n                    next_token_logits[batch_idx, banned_tokens[batch_idx]] = -float(\"inf\")\n\n            # set eos token prob to zero if min_length is not reached\n            if eos_token_id is not None and cur_len < min_length:\n                next_token_logits[:, eos_token_id] = -float(\"inf\")\n\n            if do_sample:\n                # Temperature (higher temperature => more likely to sample low probability tokens)\n                if temperature != 1.0:\n                    next_token_logits = next_token_logits / temperature\n                # Top-p/top-k filtering\n                next_token_logits = top_k_top_p_filtering(next_token_logits, top_k=top_k, top_p=top_p)\n                # Sample\n                probs = F.softmax(next_token_logits, dim=-1)\n                next_token = torch.multinomial(probs, num_samples=1).squeeze(1)\n            else:\n                # Greedy decoding\n                next_token = torch.argmax(next_token_logits, dim=-1)\n\n            # update generations and finished sentences\n            if eos_token_id is not None:\n                # pad finished sentences if eos_token_id exist\n                tokens_to_add = next_token * unfinished_sents + (pad_token_id) * (1 - unfinished_sents)\n            else:\n                tokens_to_add = next_token\n\n            # add token and increase length by one\n            input_ids = torch.cat([input_ids, tokens_to_add.unsqueeze(-1)], dim=-1)\n            cur_len = cur_len + 1\n\n            if eos_token_id is not None:\n                eos_in_sents = tokens_to_add == eos_token_id\n                # if sentence is unfinished and the token to add is eos, sent_lengths is filled with current length\n                is_sents_unfinished_and_token_to_add_is_eos = unfinished_sents.mul(eos_in_sents.long()).bool()\n                sent_lengths.masked_fill_(is_sents_unfinished_and_token_to_add_is_eos, cur_len)\n                # unfinished_sents is set to zero if eos in sentence\n                unfinished_sents.mul_((~eos_in_sents).long())\n\n            # stop when there is a </s> in each sentence, or if we exceed the maximul length\n            if unfinished_sents.max() == 0:\n                break\n\n            # extend attention_mask for new generated input if only decoder\n            if self.config.is_encoder_decoder is False:\n                attention_mask = torch.cat(\n                    [attention_mask, attention_mask.new_ones((attention_mask.shape[0], 1))], dim=-1\n                )\n\n        # if there are different sentences lengths in the batch, some batches have to be padded\n        if sent_lengths.min().item() != sent_lengths.max().item():\n            assert pad_token_id is not None, \"`Pad_token_id` has to be defined if batches have different lengths\"\n            # finished sents are filled with pad_token\n            decoded = input_ids.new(batch_size, sent_lengths.max().item()).fill_(pad_token_id)\n        else:\n            decoded = input_ids\n\n        for hypo_idx, hypo in enumerate(input_ids):\n            decoded[hypo_idx, : sent_lengths[hypo_idx]] = hypo[: sent_lengths[hypo_idx]]\n\n        return decoded\n\n    def _generate_beam_search(\n        self,\n        input_ids,\n        cur_len,\n        max_length,\n        min_length,\n        do_sample,\n        early_stopping,\n        temperature,\n        top_k,\n        top_p,\n        repetition_penalty,\n        no_repeat_ngram_size,\n        bad_words_ids,\n        bos_token_id,\n        pad_token_id,\n        eos_token_id,\n        decoder_start_token_id,\n        batch_size,\n        num_return_sequences,\n        length_penalty,\n        num_beams,\n        vocab_size,\n        encoder_outputs,\n        attention_mask,\n        use_cache,\n        model_specific_kwargs,\n    ):\n        \"\"\" Generate sequences for each example with beam search.\n        \"\"\"\n\n        # generated hypotheses\n        generated_hyps = [\n            BeamHypotheses(num_beams, max_length, length_penalty, early_stopping=early_stopping)\n            for _ in range(batch_size)\n        ]\n\n        # scores for each sentence in the beam\n        beam_scores = torch.zeros((batch_size, num_beams), dtype=torch.float, device=input_ids.device)\n\n        # for greedy decoding it is made sure that only tokens of the first beam are considered to avoid sampling the exact same tokens three times\n        if do_sample is False:\n            beam_scores[:, 1:] = -1e9\n        beam_scores = beam_scores.view(-1)  # shape (batch_size * num_beams,)\n\n        # cache compute states\n        past = encoder_outputs  # defined for encoder-decoder models, None for decoder-only models\n\n        # done sentences\n        done = [False for _ in range(batch_size)]\n\n        while cur_len < max_length:\n            model_inputs = self.prepare_inputs_for_generation(\n                input_ids, past=past, attention_mask=attention_mask, use_cache=use_cache, **model_specific_kwargs\n            )\n            outputs = self(**model_inputs)  # (batch_size * num_beams, cur_len, vocab_size)\n            next_token_logits = outputs[0][:, -1, :]  # (batch_size * num_beams, vocab_size)\n\n            # if model has past, then set the past variable to speed up decoding\n            if self._use_cache(outputs, use_cache):\n                past = outputs[1]\n\n            # repetition penalty (from CTRL paper https://arxiv.org/abs/1909.05858)\n            if repetition_penalty != 1.0:\n                self.enforce_repetition_penalty_(\n                    next_token_logits, batch_size, num_beams, input_ids, repetition_penalty,\n                )\n\n            if temperature != 1.0:\n                next_token_logits = next_token_logits / temperature\n\n            if self.config.is_encoder_decoder and do_sample is False:\n                # TODO (PVP) still a bit hacky here - there might be a better solution\n                next_token_logits = self.prepare_logits_for_generation(\n                    next_token_logits, cur_len=cur_len, max_length=max_length\n                )\n\n            scores = F.log_softmax(next_token_logits, dim=-1)  # (batch_size * num_beams, vocab_size)\n\n            # set eos token prob to zero if min_length is not reached\n            if eos_token_id is not None and cur_len < min_length:\n                scores[:, eos_token_id] = -float(\"inf\")\n\n            if no_repeat_ngram_size > 0:\n                # calculate a list of banned tokens to prevent repetitively generating the same ngrams\n                num_batch_hypotheses = batch_size * num_beams\n                # from fairseq: https://github.com/pytorch/fairseq/blob/a07cb6f40480928c9e0548b737aadd36ee66ac76/fairseq/sequence_generator.py#L345\n                banned_batch_tokens = calc_banned_ngram_tokens(\n                    input_ids, num_batch_hypotheses, no_repeat_ngram_size, cur_len\n                )\n                for i, banned_tokens in enumerate(banned_batch_tokens):\n                    scores[i, banned_tokens] = -float(\"inf\")\n\n            if bad_words_ids is not None:\n                # calculate a list of banned tokens according to bad words\n                banned_tokens = calc_banned_bad_words_ids(input_ids, bad_words_ids)\n\n                for i, banned_tokens in enumerate(banned_tokens):\n                    scores[i, banned_tokens] = -float(\"inf\")\n\n            assert scores.shape == (batch_size * num_beams, vocab_size), \"Shapes of scores: {} != {}\".format(\n                scores.shape, (batch_size * num_beams, vocab_size)\n            )\n\n            if do_sample:\n                _scores = scores + beam_scores[:, None].expand_as(scores)  # (batch_size * num_beams, vocab_size)\n                # Top-p/top-k filtering\n                _scores = top_k_top_p_filtering(\n                    _scores, top_k=top_k, top_p=top_p, min_tokens_to_keep=2\n                )  # (batch_size * num_beams, vocab_size)\n                # re-organize to group the beam together to sample from all beam_idxs\n                _scores = _scores.contiguous().view(\n                    batch_size, num_beams * vocab_size\n                )  # (batch_size, num_beams * vocab_size)\n\n                # Sample 2 next tokens for each beam (so we have some spare tokens and match output of greedy beam search)\n                probs = F.softmax(_scores, dim=-1)\n                next_tokens = torch.multinomial(probs, num_samples=2 * num_beams)  # (batch_size, num_beams * 2)\n                # Compute next scores\n                next_scores = torch.gather(_scores, -1, next_tokens)  # (batch_size, num_beams * 2)\n                # sort the sampled vector to make sure that the first num_beams samples are the best\n                next_scores, next_scores_indices = torch.sort(next_scores, descending=True, dim=1)\n                next_tokens = torch.gather(next_tokens, -1, next_scores_indices)  # (batch_size, num_beams * 2)\n\n            else:\n                next_scores = scores + beam_scores[:, None].expand_as(scores)  # (batch_size * num_beams, vocab_size)\n\n                # re-organize to group the beam together (we are keeping top hypothesis accross beams)\n                next_scores = next_scores.view(\n                    batch_size, num_beams * vocab_size\n                )  # (batch_size, num_beams * vocab_size)\n\n                next_scores, next_tokens = torch.topk(next_scores, 2 * num_beams, dim=1, largest=True, sorted=True)\n\n            assert next_scores.size() == next_tokens.size() == (batch_size, 2 * num_beams)\n\n            # next batch beam content\n            next_batch_beam = []\n\n            # for each sentence\n            for batch_idx in range(batch_size):\n\n                # if we are done with this sentence\n                if done[batch_idx]:\n                    assert (\n                        len(generated_hyps[batch_idx]) >= num_beams\n                    ), \"Batch can only be done if at least {} beams have been generated\".format(num_beams)\n                    assert (\n                        eos_token_id is not None and pad_token_id is not None\n                    ), \"generated beams >= num_beams -> eos_token_id and pad_token have to be defined\"\n                    next_batch_beam.extend([(0, pad_token_id, 0)] * num_beams)  # pad the batch\n                    continue\n\n                # next sentence beam content\n                next_sent_beam = []\n\n                # next tokens for this sentence\n                for beam_token_rank, (beam_token_id, beam_token_score) in enumerate(\n                    zip(next_tokens[batch_idx], next_scores[batch_idx])\n                ):\n                    # get beam and token IDs\n                    beam_id = beam_token_id // vocab_size\n                    token_id = beam_token_id % vocab_size\n\n                    effective_beam_id = batch_idx * num_beams + beam_id\n                    # add to generated hypotheses if end of sentence or last iteration\n                    if (eos_token_id is not None) and (token_id.item() == eos_token_id):\n                        # if beam_token does not belong to top num_beams tokens, it should not be added\n                        is_beam_token_worse_than_top_num_beams = beam_token_rank >= num_beams\n                        if is_beam_token_worse_than_top_num_beams:\n                            continue\n                        generated_hyps[batch_idx].add(\n                            input_ids[effective_beam_id].clone(), beam_token_score.item(),\n                        )\n                    else:\n                        # add next predicted token if it is not eos_token\n                        next_sent_beam.append((beam_token_score, token_id, effective_beam_id))\n\n                    # the beam for next step is full\n                    if len(next_sent_beam) == num_beams:\n                        break\n\n                # Check if were done so that we can save a pad step if all(done)\n                done[batch_idx] = done[batch_idx] or generated_hyps[batch_idx].is_done(\n                    next_scores[batch_idx].max().item(), cur_len=cur_len\n                )\n\n                # update next beam content\n                assert len(next_sent_beam) == num_beams, \"Beam should always be full\"\n                next_batch_beam.extend(next_sent_beam)\n                assert len(next_batch_beam) == num_beams * (batch_idx + 1)\n\n            # stop when we are done with each sentence\n            if all(done):\n                break\n\n            # sanity check / prepare next batch\n            assert len(next_batch_beam) == batch_size * num_beams\n            beam_scores = beam_scores.new([x[0] for x in next_batch_beam])\n            beam_tokens = input_ids.new([x[1] for x in next_batch_beam])\n            beam_idx = input_ids.new([x[2] for x in next_batch_beam])\n\n            # re-order batch and update current length\n            input_ids = input_ids[beam_idx, :]\n            input_ids = torch.cat([input_ids, beam_tokens.unsqueeze(1)], dim=-1)\n            cur_len = cur_len + 1\n\n            # re-order internal states\n            if past is not None:\n                past = self._reorder_cache(past, beam_idx)\n\n            # extend attention_mask for new generated input if only decoder\n            if self.config.is_encoder_decoder is False:\n                attention_mask = torch.cat(\n                    [attention_mask, attention_mask.new_ones((attention_mask.shape[0], 1))], dim=-1\n                )\n\n        # finalize all open beam hypotheses and end to generated hypotheses\n        for batch_idx in range(batch_size):\n            if done[batch_idx]:\n                continue\n\n            # test that beam scores match previously calculated scores if not eos and batch_idx not done\n            if eos_token_id is not None and all(\n                (token_id % vocab_size).item() is not eos_token_id for token_id in next_tokens[batch_idx]\n            ):\n                assert torch.all(\n                    next_scores[batch_idx, :num_beams] == beam_scores.view(batch_size, num_beams)[batch_idx]\n                ), \"If batch_idx is not done, final next scores: {} have to equal to accumulated beam_scores: {}\".format(\n                    next_scores[:, :num_beams][batch_idx], beam_scores.view(batch_size, num_beams)[batch_idx],\n                )\n\n            # need to add best num_beams hypotheses to generated hyps\n            for beam_id in range(num_beams):\n                effective_beam_id = batch_idx * num_beams + beam_id\n                final_score = beam_scores[effective_beam_id].item()\n                final_tokens = input_ids[effective_beam_id]\n                generated_hyps[batch_idx].add(final_tokens, final_score)\n\n        # depending on whether greedy generation is wanted or not define different output_batch_size and output_num_return_sequences_per_batch\n        output_batch_size = batch_size if do_sample else batch_size * num_return_sequences\n        output_num_return_sequences_per_batch = 1 if do_sample else num_return_sequences\n\n        # select the best hypotheses\n        sent_lengths = input_ids.new(output_batch_size)\n        best = []\n\n        # retrieve best hypotheses\n        for i, hypotheses in enumerate(generated_hyps):\n            sorted_hyps = sorted(hypotheses.beams, key=lambda x: x[0])\n            for j in range(output_num_return_sequences_per_batch):\n                effective_batch_idx = output_num_return_sequences_per_batch * i + j\n                best_hyp = sorted_hyps.pop()[1]\n                sent_lengths[effective_batch_idx] = len(best_hyp)\n                best.append(best_hyp)\n\n        # shorter batches are filled with pad_token\n        if sent_lengths.min().item() != sent_lengths.max().item():\n            assert pad_token_id is not None, \"`Pad_token_id` has to be defined\"\n            sent_max_len = min(sent_lengths.max().item() + 1, max_length)\n            decoded = input_ids.new(output_batch_size, sent_max_len).fill_(pad_token_id)\n\n            # fill with hypothesis and eos_token_id if necessary\n            for i, hypo in enumerate(best):\n                decoded[i, : sent_lengths[i]] = hypo\n                if sent_lengths[i] < max_length:\n                    decoded[i, sent_lengths[i]] = eos_token_id\n        else:\n            # none of the hypotheses have an eos_token\n            assert (len(hypo) == max_length for hypo in best)\n            decoded = torch.stack(best).type(torch.long).to(next(self.parameters()).device)\n\n        return decoded\n\n    @staticmethod\n    def _reorder_cache(past: Tuple, beam_idx: Tensor) -> Tuple[Tensor]:\n        return tuple(layer_past.index_select(1, beam_idx) for layer_past in past)\n\n\ndef calc_banned_ngram_tokens(prev_input_ids: Tensor, num_hypos: int, no_repeat_ngram_size: int, cur_len: int) -> None:\n    \"\"\"Copied from fairseq for no_repeat_ngram in beam_search\"\"\"\n    if cur_len + 1 < no_repeat_ngram_size:\n        # return no banned tokens if we haven't generated no_repeat_ngram_size tokens yet\n        return [[] for _ in range(num_hypos)]\n    generated_ngrams = [{} for _ in range(num_hypos)]\n    for idx in range(num_hypos):\n        gen_tokens = prev_input_ids[idx].tolist()\n        generated_ngram = generated_ngrams[idx]\n        for ngram in zip(*[gen_tokens[i:] for i in range(no_repeat_ngram_size)]):\n            prev_ngram_tuple = tuple(ngram[:-1])\n            generated_ngram[prev_ngram_tuple] = generated_ngram.get(prev_ngram_tuple, []) + [ngram[-1]]\n\n    def _get_generated_ngrams(hypo_idx):\n        # Before decoding the next token, prevent decoding of ngrams that have already appeared\n        start_idx = cur_len + 1 - no_repeat_ngram_size\n        ngram_idx = tuple(prev_input_ids[hypo_idx, start_idx:cur_len].tolist())\n        return generated_ngrams[hypo_idx].get(ngram_idx, [])\n\n    banned_tokens = [_get_generated_ngrams(hypo_idx) for hypo_idx in range(num_hypos)]\n    return banned_tokens\n\n\ndef calc_banned_bad_words_ids(prev_input_ids: Iterable[int], bad_words_ids: Iterable[int]) -> Iterable[int]:\n    banned_tokens = []\n\n    def _tokens_match(prev_tokens, tokens):\n        if len(tokens) == 0:\n            # if bad word tokens is just one token always ban it\n            return True\n        if len(tokens) > len(prev_input_ids):\n            # if bad word tokens are longer then prev input_ids they can't be equal\n            return False\n\n        if prev_tokens[-len(tokens) :] == tokens:\n            # if tokens match\n            return True\n        else:\n            return False\n\n    for prev_input_ids_slice in prev_input_ids:\n        banned_tokens_slice = []\n\n        for banned_token_seq in bad_words_ids:\n            assert len(banned_token_seq) > 0, \"Banned words token sequences {} cannot have an empty list\".format(\n                bad_words_ids\n            )\n\n            if _tokens_match(prev_input_ids_slice.tolist(), banned_token_seq[:-1]) is False:\n                # if tokens do not match continue\n                continue\n\n            banned_tokens_slice.append(banned_token_seq[-1])\n\n        banned_tokens.append(banned_tokens_slice)\n\n    return banned_tokens\n\n\ndef top_k_top_p_filtering(\n    logits: Tensor,\n    top_k: int = 0,\n    top_p: float = 1.0,\n    filter_value: float = -float(\"Inf\"),\n    min_tokens_to_keep: int = 1,\n) -> Tensor:\n    \"\"\" Filter a distribution of logits using top-k and/or nucleus (top-p) filtering\n        Args:\n            logits: logits distribution shape (batch size, vocabulary size)\n            if top_k > 0: keep only top k tokens with highest probability (top-k filtering).\n            if top_p < 1.0: keep the top tokens with cumulative probability >= top_p (nucleus filtering).\n                Nucleus filtering is described in Holtzman et al. (http://arxiv.org/abs/1904.09751)\n            Make sure we keep at least min_tokens_to_keep per batch example in the output\n        From: https://gist.github.com/thomwolf/1a5a29f6962089e871b94cbd09daf317\n    \"\"\"\n    if top_k > 0:\n        top_k = min(max(top_k, min_tokens_to_keep), logits.size(-1))  # Safety check\n        # Remove all tokens with a probability less than the last token of the top-k\n        indices_to_remove = logits < torch.topk(logits, top_k)[0][..., -1, None]\n        logits[indices_to_remove] = filter_value\n\n    if top_p < 1.0:\n        sorted_logits, sorted_indices = torch.sort(logits, descending=True)\n        cumulative_probs = torch.cumsum(F.softmax(sorted_logits, dim=-1), dim=-1)\n\n        # Remove tokens with cumulative probability above the threshold (token with 0 are kept)\n        sorted_indices_to_remove = cumulative_probs > top_p\n        if min_tokens_to_keep > 1:\n            # Keep at least min_tokens_to_keep (set to min_tokens_to_keep-1 because we add the first one below)\n            sorted_indices_to_remove[..., :min_tokens_to_keep] = 0\n        # Shift the indices to the right to keep also the first token above the threshold\n        sorted_indices_to_remove[..., 1:] = sorted_indices_to_remove[..., :-1].clone()\n        sorted_indices_to_remove[..., 0] = 0\n\n        # scatter sorted tensors to original indexing\n        indices_to_remove = sorted_indices_to_remove.scatter(1, sorted_indices, sorted_indices_to_remove)\n        logits[indices_to_remove] = filter_value\n    return logits\n\n\nclass BeamHypotheses(object):\n    def __init__(self, num_beams, max_length, length_penalty, early_stopping):\n        \"\"\"\n        Initialize n-best list of hypotheses.\n        \"\"\"\n        self.max_length = max_length - 1  # ignoring bos_token\n        self.length_penalty = length_penalty\n        self.early_stopping = early_stopping\n        self.num_beams = num_beams\n        self.beams = []\n        self.worst_score = 1e9\n\n    def __len__(self):\n        \"\"\"\n        Number of hypotheses in the list.\n        \"\"\"\n        return len(self.beams)\n\n    def add(self, hyp, sum_logprobs):\n        \"\"\"\n        Add a new hypothesis to the list.\n        \"\"\"\n        score = sum_logprobs / len(hyp) ** self.length_penalty\n        if len(self) < self.num_beams or score > self.worst_score:\n            self.beams.append((score, hyp))\n            if len(self) > self.num_beams:\n                sorted_scores = sorted([(s, idx) for idx, (s, _) in enumerate(self.beams)])\n                del self.beams[sorted_scores[0][1]]\n                self.worst_score = sorted_scores[1][0]\n            else:\n                self.worst_score = min(score, self.worst_score)\n\n    def is_done(self, best_sum_logprobs, cur_len=None):\n        \"\"\"\n        If there are enough hypotheses and that none of the hypotheses being generated\n        can become better than the worst one in the heap, then we are done with this sentence.\n        \"\"\"\n\n        if len(self) < self.num_beams:\n            return False\n        elif self.early_stopping:\n            return True\n        else:\n            if cur_len is None:\n                cur_len = self.max_length\n            cur_score = best_sum_logprobs / cur_len ** self.length_penalty\n            ret = self.worst_score >= cur_score\n            return ret\n\n\nclass Conv1D(nn.Module):\n    def __init__(self, nf, nx):\n        \"\"\" Conv1D layer as defined by Radford et al. for OpenAI GPT (and also used in GPT-2)\n            Basically works like a Linear layer but the weights are transposed\n        \"\"\"\n        super().__init__()\n        self.nf = nf\n        w = torch.empty(nx, nf)\n        nn.init.normal_(w, std=0.02)\n        self.weight = nn.Parameter(w)\n        self.bias = nn.Parameter(torch.zeros(nf))\n\n    def forward(self, x):\n        size_out = x.size()[:-1] + (self.nf,)\n        x = torch.addmm(self.bias, x.view(-1, x.size(-1)), self.weight)\n        x = x.view(*size_out)\n        return x\n\n\nclass PoolerStartLogits(nn.Module):\n    \"\"\" Compute SQuAD start_logits from sequence hidden states. \"\"\"\n\n    def __init__(self, config):\n        super().__init__()\n        self.dense = nn.Linear(config.hidden_size, 1)\n\n    def forward(self, hidden_states, p_mask=None):\n        \"\"\" Args:\n            **p_mask**: (`optional`) ``torch.FloatTensor`` of shape `(batch_size, seq_len)`\n                invalid position mask such as query and special symbols (PAD, SEP, CLS)\n                1.0 means token should be masked.\n        \"\"\"\n        x = self.dense(hidden_states).squeeze(-1)\n\n        if p_mask is not None:\n            if next(self.parameters()).dtype == torch.float16:\n                x = x * (1 - p_mask) - 65500 * p_mask\n            else:\n                x = x * (1 - p_mask) - 1e30 * p_mask\n\n        return x\n\n\nclass PoolerEndLogits(nn.Module):\n    \"\"\" Compute SQuAD end_logits from sequence hidden states and start token hidden state.\n    \"\"\"\n\n    def __init__(self, config):\n        super().__init__()\n        self.dense_0 = nn.Linear(config.hidden_size * 2, config.hidden_size)\n        self.activation = nn.Tanh()\n        self.LayerNorm = nn.LayerNorm(config.hidden_size, eps=config.layer_norm_eps)\n        self.dense_1 = nn.Linear(config.hidden_size, 1)\n\n    def forward(self, hidden_states, start_states=None, start_positions=None, p_mask=None):\n        \"\"\" Args:\n            One of ``start_states``, ``start_positions`` should be not None.\n            If both are set, ``start_positions`` overrides ``start_states``.\n\n            **start_states**: ``torch.LongTensor`` of shape identical to hidden_states\n                hidden states of the first tokens for the labeled span.\n            **start_positions**: ``torch.LongTensor`` of shape ``(batch_size,)``\n                position of the first token for the labeled span:\n            **p_mask**: (`optional`) ``torch.FloatTensor`` of shape ``(batch_size, seq_len)``\n                Mask of invalid position such as query and special symbols (PAD, SEP, CLS)\n                1.0 means token should be masked.\n        \"\"\"\n        assert (\n            start_states is not None or start_positions is not None\n        ), \"One of start_states, start_positions should be not None\"\n        if start_positions is not None:\n            slen, hsz = hidden_states.shape[-2:]\n            start_positions = start_positions[:, None, None].expand(-1, -1, hsz)  # shape (bsz, 1, hsz)\n            start_states = hidden_states.gather(-2, start_positions)  # shape (bsz, 1, hsz)\n            start_states = start_states.expand(-1, slen, -1)  # shape (bsz, slen, hsz)\n\n        x = self.dense_0(torch.cat([hidden_states, start_states], dim=-1))\n        x = self.activation(x)\n        x = self.LayerNorm(x)\n        x = self.dense_1(x).squeeze(-1)\n\n        if p_mask is not None:\n            if next(self.parameters()).dtype == torch.float16:\n                x = x * (1 - p_mask) - 65500 * p_mask\n            else:\n                x = x * (1 - p_mask) - 1e30 * p_mask\n\n        return x\n\n\nclass PoolerAnswerClass(nn.Module):\n    \"\"\" Compute SQuAD 2.0 answer class from classification and start tokens hidden states. \"\"\"\n\n    def __init__(self, config):\n        super().__init__()\n        self.dense_0 = nn.Linear(config.hidden_size * 2, config.hidden_size)\n        self.activation = nn.Tanh()\n        self.dense_1 = nn.Linear(config.hidden_size, 1, bias=False)\n\n    def forward(self, hidden_states, start_states=None, start_positions=None, cls_index=None):\n        \"\"\"\n        Args:\n            One of ``start_states``, ``start_positions`` should be not None.\n            If both are set, ``start_positions`` overrides ``start_states``.\n\n            **start_states**: ``torch.LongTensor`` of shape identical to ``hidden_states``.\n                hidden states of the first tokens for the labeled span.\n            **start_positions**: ``torch.LongTensor`` of shape ``(batch_size,)``\n                position of the first token for the labeled span.\n            **cls_index**: torch.LongTensor of shape ``(batch_size,)``\n                position of the CLS token. If None, take the last token.\n\n            note(Original repo):\n                no dependency on end_feature so that we can obtain one single `cls_logits`\n                for each sample\n        \"\"\"\n        hsz = hidden_states.shape[-1]\n        assert (\n            start_states is not None or start_positions is not None\n        ), \"One of start_states, start_positions should be not None\"\n        if start_positions is not None:\n            start_positions = start_positions[:, None, None].expand(-1, -1, hsz)  # shape (bsz, 1, hsz)\n            start_states = hidden_states.gather(-2, start_positions).squeeze(-2)  # shape (bsz, hsz)\n\n        if cls_index is not None:\n            cls_index = cls_index[:, None, None].expand(-1, -1, hsz)  # shape (bsz, 1, hsz)\n            cls_token_state = hidden_states.gather(-2, cls_index).squeeze(-2)  # shape (bsz, hsz)\n        else:\n            cls_token_state = hidden_states[:, -1, :]  # shape (bsz, hsz)\n\n        x = self.dense_0(torch.cat([start_states, cls_token_state], dim=-1))\n        x = self.activation(x)\n        x = self.dense_1(x).squeeze(-1)\n\n        return x\n\n\nclass SQuADHead(nn.Module):\n    r\"\"\" A SQuAD head inspired by XLNet.\n\n    Parameters:\n        config (:class:`~transformers.XLNetConfig`): Model configuration class with all the parameters of the model.\n\n    Inputs:\n        **hidden_states**: ``torch.FloatTensor`` of shape ``(batch_size, seq_len, hidden_size)``\n            hidden states of sequence tokens\n        **start_positions**: ``torch.LongTensor`` of shape ``(batch_size,)``\n            position of the first token for the labeled span.\n        **end_positions**: ``torch.LongTensor`` of shape ``(batch_size,)``\n            position of the last token for the labeled span.\n        **cls_index**: torch.LongTensor of shape ``(batch_size,)``\n            position of the CLS token. If None, take the last token.\n        **is_impossible**: ``torch.LongTensor`` of shape ``(batch_size,)``\n            Whether the question has a possible answer in the paragraph or not.\n        **p_mask**: (`optional`) ``torch.FloatTensor`` of shape ``(batch_size, seq_len)``\n            Mask of invalid position such as query and special symbols (PAD, SEP, CLS)\n            1.0 means token should be masked.\n\n    Outputs: `Tuple` comprising various elements depending on the configuration (config) and inputs:\n        **loss**: (`optional`, returned if both ``start_positions`` and ``end_positions`` are provided) ``torch.FloatTensor`` of shape ``(1,)``:\n            Classification loss as the sum of start token, end token (and is_impossible if provided) classification losses.\n        **start_top_log_probs**: (`optional`, returned if ``start_positions`` or ``end_positions`` is not provided)\n            ``torch.FloatTensor`` of shape ``(batch_size, config.start_n_top)``\n            Log probabilities for the top config.start_n_top start token possibilities (beam-search).\n        **start_top_index**: (`optional`, returned if ``start_positions`` or ``end_positions`` is not provided)\n            ``torch.LongTensor`` of shape ``(batch_size, config.start_n_top)``\n            Indices for the top config.start_n_top start token possibilities (beam-search).\n        **end_top_log_probs**: (`optional`, returned if ``start_positions`` or ``end_positions`` is not provided)\n            ``torch.FloatTensor`` of shape ``(batch_size, config.start_n_top * config.end_n_top)``\n            Log probabilities for the top ``config.start_n_top * config.end_n_top`` end token possibilities (beam-search).\n        **end_top_index**: (`optional`, returned if ``start_positions`` or ``end_positions`` is not provided)\n            ``torch.LongTensor`` of shape ``(batch_size, config.start_n_top * config.end_n_top)``\n            Indices for the top ``config.start_n_top * config.end_n_top`` end token possibilities (beam-search).\n        **cls_logits**: (`optional`, returned if ``start_positions`` or ``end_positions`` is not provided)\n            ``torch.FloatTensor`` of shape ``(batch_size,)``\n            Log probabilities for the ``is_impossible`` label of the answers.\n    \"\"\"\n\n    def __init__(self, config):\n        super().__init__()\n        self.start_n_top = config.start_n_top\n        self.end_n_top = config.end_n_top\n\n        self.start_logits = PoolerStartLogits(config)\n        self.end_logits = PoolerEndLogits(config)\n        self.answer_class = PoolerAnswerClass(config)\n\n    def forward(\n        self, hidden_states, start_positions=None, end_positions=None, cls_index=None, is_impossible=None, p_mask=None,\n    ):\n        outputs = ()\n\n        start_logits = self.start_logits(hidden_states, p_mask=p_mask)\n\n        if start_positions is not None and end_positions is not None:\n            # If we are on multi-GPU, let's remove the dimension added by batch splitting\n            for x in (start_positions, end_positions, cls_index, is_impossible):\n                if x is not None and x.dim() > 1:\n                    x.squeeze_(-1)\n\n            # during training, compute the end logits based on the ground truth of the start position\n            end_logits = self.end_logits(hidden_states, start_positions=start_positions, p_mask=p_mask)\n\n            loss_fct = CrossEntropyLoss()\n            start_loss = loss_fct(start_logits, start_positions)\n            end_loss = loss_fct(end_logits, end_positions)\n            total_loss = (start_loss + end_loss) / 2\n\n            if cls_index is not None and is_impossible is not None:\n                # Predict answerability from the representation of CLS and START\n                cls_logits = self.answer_class(hidden_states, start_positions=start_positions, cls_index=cls_index)\n                loss_fct_cls = nn.BCEWithLogitsLoss()\n                cls_loss = loss_fct_cls(cls_logits, is_impossible)\n\n                # note(zhiliny): by default multiply the loss by 0.5 so that the scale is comparable to start_loss and end_loss\n                total_loss += cls_loss * 0.5\n\n            outputs = (total_loss,) + outputs\n\n        else:\n            # during inference, compute the end logits based on beam search\n            bsz, slen, hsz = hidden_states.size()\n            start_log_probs = F.softmax(start_logits, dim=-1)  # shape (bsz, slen)\n\n            start_top_log_probs, start_top_index = torch.topk(\n                start_log_probs, self.start_n_top, dim=-1\n            )  # shape (bsz, start_n_top)\n            start_top_index_exp = start_top_index.unsqueeze(-1).expand(-1, -1, hsz)  # shape (bsz, start_n_top, hsz)\n            start_states = torch.gather(hidden_states, -2, start_top_index_exp)  # shape (bsz, start_n_top, hsz)\n            start_states = start_states.unsqueeze(1).expand(-1, slen, -1, -1)  # shape (bsz, slen, start_n_top, hsz)\n\n            hidden_states_expanded = hidden_states.unsqueeze(2).expand_as(\n                start_states\n            )  # shape (bsz, slen, start_n_top, hsz)\n            p_mask = p_mask.unsqueeze(-1) if p_mask is not None else None\n            end_logits = self.end_logits(hidden_states_expanded, start_states=start_states, p_mask=p_mask)\n            end_log_probs = F.softmax(end_logits, dim=1)  # shape (bsz, slen, start_n_top)\n\n            end_top_log_probs, end_top_index = torch.topk(\n                end_log_probs, self.end_n_top, dim=1\n            )  # shape (bsz, end_n_top, start_n_top)\n            end_top_log_probs = end_top_log_probs.view(-1, self.start_n_top * self.end_n_top)\n            end_top_index = end_top_index.view(-1, self.start_n_top * self.end_n_top)\n\n            start_states = torch.einsum(\"blh,bl->bh\", hidden_states, start_log_probs)\n            cls_logits = self.answer_class(hidden_states, start_states=start_states, cls_index=cls_index)\n\n            outputs = (start_top_log_probs, start_top_index, end_top_log_probs, end_top_index, cls_logits,) + outputs\n\n        # return start_top_log_probs, start_top_index, end_top_log_probs, end_top_index, cls_logits\n        # or (if labels are provided) (total_loss,)\n        return outputs\n\n\nclass SequenceSummary(nn.Module):\n    r\"\"\" Compute a single vector summary of a sequence hidden states according to various possibilities:\n        Args of the config class:\n            summary_type:\n                - 'last' => [default] take the last token hidden state (like XLNet)\n                - 'first' => take the first token hidden state (like Bert)\n                - 'mean' => take the mean of all tokens hidden states\n                - 'cls_index' => supply a Tensor of classification token position (GPT/GPT-2)\n                - 'attn' => Not implemented now, use multi-head attention\n            summary_use_proj: Add a projection after the vector extraction\n            summary_proj_to_labels: If True, the projection outputs to config.num_labels classes (otherwise to hidden_size). Default: False.\n            summary_activation: 'tanh' or another string => add an activation to the output, Other => no activation. Default\n            summary_first_dropout: Add a dropout before the projection and activation\n            summary_last_dropout: Add a dropout after the projection and activation\n    \"\"\"\n\n    def __init__(self, config: PretrainedConfig):\n        super().__init__()\n\n        self.summary_type = getattr(config, \"summary_type\", \"last\")\n        if self.summary_type == \"attn\":\n            # We should use a standard multi-head attention module with absolute positional embedding for that.\n            # Cf. https://github.com/zihangdai/xlnet/blob/master/modeling.py#L253-L276\n            # We can probably just use the multi-head attention module of PyTorch >=1.1.0\n            raise NotImplementedError\n\n        self.summary = Identity()\n        if hasattr(config, \"summary_use_proj\") and config.summary_use_proj:\n            if hasattr(config, \"summary_proj_to_labels\") and config.summary_proj_to_labels and config.num_labels > 0:\n                num_classes = config.num_labels\n            else:\n                num_classes = config.hidden_size\n            self.summary = nn.Linear(config.hidden_size, num_classes)\n\n        activation_string = getattr(config, \"summary_activation\", None)\n        self.activation: Callable = (get_activation(activation_string) if activation_string else Identity())\n\n        self.first_dropout = Identity()\n        if hasattr(config, \"summary_first_dropout\") and config.summary_first_dropout > 0:\n            self.first_dropout = nn.Dropout(config.summary_first_dropout)\n\n        self.last_dropout = Identity()\n        if hasattr(config, \"summary_last_dropout\") and config.summary_last_dropout > 0:\n            self.last_dropout = nn.Dropout(config.summary_last_dropout)\n\n    def forward(self, hidden_states, cls_index=None):\n        \"\"\" hidden_states: float Tensor in shape [bsz, ..., seq_len, hidden_size], the hidden-states of the last layer.\n            cls_index: [optional] position of the classification token if summary_type == 'cls_index',\n                shape (bsz,) or more generally (bsz, ...) where ... are optional leading dimensions of hidden_states.\n                if summary_type == 'cls_index' and cls_index is None:\n                    we take the last token of the sequence as classification token\n        \"\"\"\n        if self.summary_type == \"last\":\n            output = hidden_states[:, -1]\n        elif self.summary_type == \"first\":\n            output = hidden_states[:, 0]\n        elif self.summary_type == \"mean\":\n            output = hidden_states.mean(dim=1)\n        elif self.summary_type == \"cls_index\":\n            if cls_index is None:\n                cls_index = torch.full_like(hidden_states[..., :1, :], hidden_states.shape[-2] - 1, dtype=torch.long,)\n            else:\n                cls_index = cls_index.unsqueeze(-1).unsqueeze(-1)\n                cls_index = cls_index.expand((-1,) * (cls_index.dim() - 1) + (hidden_states.size(-1),))\n            # shape of cls_index: (bsz, XX, 1, hidden_size) where XX are optional leading dim of hidden_states\n            output = hidden_states.gather(-2, cls_index).squeeze(-2)  # shape (bsz, XX, hidden_size)\n        elif self.summary_type == \"attn\":\n            raise NotImplementedError\n\n        output = self.first_dropout(output)\n        output = self.summary(output)\n        output = self.activation(output)\n        output = self.last_dropout(output)\n\n        return output\n\n\ndef create_position_ids_from_input_ids(input_ids, padding_idx):\n    \"\"\" Replace non-padding symbols with their position numbers. Position numbers begin at\n    padding_idx+1. Padding symbols are ignored. This is modified from fairseq's\n    `utils.make_positions`.\n\n    :param torch.Tensor x:\n    :return torch.Tensor:\n    \"\"\"\n    # The series of casts and type-conversions here are carefully balanced to both work with ONNX export and XLA.\n    mask = input_ids.ne(padding_idx).int()\n    incremental_indices = torch.cumsum(mask, dim=1).type_as(mask) * mask\n    return incremental_indices.long() + padding_idx\n\n\ndef prune_linear_layer(layer, index, dim=0):\n    \"\"\" Prune a linear layer (a model parameters) to keep only entries in index.\n        Return the pruned layer as a new layer with requires_grad=True.\n        Used to remove heads.\n    \"\"\"\n    index = index.to(layer.weight.device)\n    W = layer.weight.index_select(dim, index).clone().detach()\n    if layer.bias is not None:\n        if dim == 1:\n            b = layer.bias.clone().detach()\n        else:\n            b = layer.bias[index].clone().detach()\n    new_size = list(layer.weight.size())\n    new_size[dim] = len(index)\n    new_layer = nn.Linear(new_size[1], new_size[0], bias=layer.bias is not None).to(layer.weight.device)\n    new_layer.weight.requires_grad = False\n    new_layer.weight.copy_(W.contiguous())\n    new_layer.weight.requires_grad = True\n    if layer.bias is not None:\n        new_layer.bias.requires_grad = False\n        new_layer.bias.copy_(b.contiguous())\n        new_layer.bias.requires_grad = True\n    return new_layer\n\n\ndef prune_conv1d_layer(layer, index, dim=1):\n    \"\"\" Prune a Conv1D layer (a model parameters) to keep only entries in index.\n        A Conv1D work as a Linear layer (see e.g. BERT) but the weights are transposed.\n        Return the pruned layer as a new layer with requires_grad=True.\n        Used to remove heads.\n    \"\"\"\n    index = index.to(layer.weight.device)\n    W = layer.weight.index_select(dim, index).clone().detach()\n    if dim == 0:\n        b = layer.bias.clone().detach()\n    else:\n        b = layer.bias[index].clone().detach()\n    new_size = list(layer.weight.size())\n    new_size[dim] = len(index)\n    new_layer = Conv1D(new_size[1], new_size[0]).to(layer.weight.device)\n    new_layer.weight.requires_grad = False\n    new_layer.weight.copy_(W.contiguous())\n    new_layer.weight.requires_grad = True\n    new_layer.bias.requires_grad = False\n    new_layer.bias.copy_(b.contiguous())\n    new_layer.bias.requires_grad = True\n    return new_layer\n\n\ndef prune_layer(layer, index, dim=None):\n    \"\"\" Prune a Conv1D or nn.Linear layer (a model parameters) to keep only entries in index.\n        Return the pruned layer as a new layer with requires_grad=True.\n        Used to remove heads.\n    \"\"\"\n    if isinstance(layer, nn.Linear):\n        return prune_linear_layer(layer, index, dim=0 if dim is None else dim)\n    elif isinstance(layer, Conv1D):\n        return prune_conv1d_layer(layer, index, dim=1 if dim is None else dim)\n    else:\n        raise ValueError(\"Can't prune layer of class {}\".format(layer.__class__))\n\n\ndef apply_chunking_to_forward(\n    chunk_size: int, chunk_dim: int, forward_fn: Callable[..., torch.Tensor], *input_tensors\n) -> torch.Tensor:\n    \"\"\"\n    This function chunks the `input_tensors` into smaller input tensor parts of size `chunk_size` over the dimension `chunk_dim`.\n    It then applies a layer `forward_fn` to each chunk independently to save memory.\n    If the `forward_fn` is independent across the `chunk_dim` this function will yield the\n    same result as not applying it.\n\n    Args:\n        chunk_size: int - the chunk size of a chunked tensor. `num_chunks` = `len(input_tensors[0]) / chunk_size`\n        chunk_dim: int - the dimension over which the input_tensors should be chunked\n        forward_fn: fn - the forward fn of the model\n        input_tensors: tuple(torch.Tensor) - the input tensors of `forward_fn` which are chunked\n    Returns:\n        a Tensor with the same shape the foward_fn would have given if applied\n\n\n    Examples::\n\n        # rename the usual forward() fn to forward_chunk()\n        def forward_chunk(self, hidden_states):\n            hidden_states = self.decoder(hidden_states)\n            return hidden_states\n\n        # implement a chunked forward function\n        def forward(self, hidden_states):\n            return apply_chunking_to_forward(self.chunk_size_lm_head, self.seq_len_dim, self.forward_chunk, hidden_states)\n    \"\"\"\n\n    assert len(input_tensors) > 0, \"{} has to be a tuple/list of tensors\".format(input_tensors)\n    tensor_shape = input_tensors[0].shape\n    assert all(\n        input_tensor.shape == tensor_shape for input_tensor in input_tensors\n    ), \"All input tenors have to be of the same shape\"\n\n    # inspect.signature exist since python 3.5 and is a python method -> no problem with backward compability\n    num_args_in_forward_chunk_fn = len(inspect.signature(forward_fn).parameters)\n    assert num_args_in_forward_chunk_fn == len(\n        input_tensors\n    ), \"forward_chunk_fn expects {} arguments, but only {} input tensors are given\".format(\n        num_args_in_forward_chunk_fn, len(input_tensors)\n    )\n\n    if chunk_size > 0:\n        assert (\n            input_tensors[0].shape[chunk_dim] % chunk_size == 0\n        ), \"The dimension to be chunked {} has to be a multiple of the chunk size {}\".format(\n            input_tensors[0][chunk_dim], chunk_size\n        )\n\n        num_chunks = input_tensors[0].shape[chunk_dim] // chunk_size\n\n        # chunk input tensor into tuples\n        input_tensors_chunks = tuple(input_tensor.chunk(num_chunks, dim=chunk_dim) for input_tensor in input_tensors)\n        # apply forward fn to every tuple\n        output_chunks = tuple(forward_fn(*input_tensors_chunk) for input_tensors_chunk in zip(*input_tensors_chunks))\n        # concatenate output at same dimension\n        return torch.cat(output_chunks, dim=chunk_dim)\n\n    return forward_fn(*input_tensors)\n"
  },
  {
    "path": "code/nezha-base-count3/pretrain/transformers1/modeling_xlm.py",
    "content": "# coding=utf-8\n# Copyright 2019-present, Facebook, Inc and the HuggingFace Inc. team.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n\"\"\" PyTorch XLM model.\n\"\"\"\n\n\nimport itertools\nimport logging\nimport math\n\nimport numpy as np\nimport torch\nfrom torch import nn\nfrom torch.nn import CrossEntropyLoss, MSELoss\nfrom torch.nn import functional as F\n\nfrom .activations import gelu\nfrom .configuration_xlm import XLMConfig\nfrom .file_utils import add_start_docstrings, add_start_docstrings_to_callable\nfrom .modeling_utils import PreTrainedModel, SequenceSummary, SQuADHead, prune_linear_layer\n\n\nlogger = logging.getLogger(__name__)\n\nXLM_PRETRAINED_MODEL_ARCHIVE_LIST = [\n    \"xlm-mlm-en-2048\",\n    \"xlm-mlm-ende-1024\",\n    \"xlm-mlm-enfr-1024\",\n    \"xlm-mlm-enro-1024\",\n    \"xlm-mlm-tlm-xnli15-1024\",\n    \"xlm-mlm-xnli15-1024\",\n    \"xlm-clm-enfr-1024\",\n    \"xlm-clm-ende-1024\",\n    \"xlm-mlm-17-1280\",\n    \"xlm-mlm-100-1280\",\n    # See all XLM models at https://huggingface.co/models?filter=xlm\n]\n\n\ndef create_sinusoidal_embeddings(n_pos, dim, out):\n    position_enc = np.array([[pos / np.power(10000, 2 * (j // 2) / dim) for j in range(dim)] for pos in range(n_pos)])\n    out[:, 0::2] = torch.FloatTensor(np.sin(position_enc[:, 0::2]))\n    out[:, 1::2] = torch.FloatTensor(np.cos(position_enc[:, 1::2]))\n    out.detach_()\n    out.requires_grad = False\n\n\ndef get_masks(slen, lengths, causal, padding_mask=None):\n    \"\"\"\n    Generate hidden states mask, and optionally an attention mask.\n    \"\"\"\n    alen = torch.arange(slen, dtype=torch.long, device=lengths.device)\n    if padding_mask is not None:\n        mask = padding_mask\n    else:\n        assert lengths.max().item() <= slen\n        mask = alen < lengths[:, None]\n\n    # attention mask is the same as mask, or triangular inferior attention (causal)\n    bs = lengths.size(0)\n    if causal:\n        attn_mask = alen[None, None, :].repeat(bs, slen, 1) <= alen[None, :, None]\n    else:\n        attn_mask = mask\n\n    # sanity check\n    assert mask.size() == (bs, slen)\n    assert causal is False or attn_mask.size() == (bs, slen, slen)\n\n    return mask, attn_mask\n\n\nclass MultiHeadAttention(nn.Module):\n\n    NEW_ID = itertools.count()\n\n    def __init__(self, n_heads, dim, config):\n        super().__init__()\n        self.layer_id = next(MultiHeadAttention.NEW_ID)\n        self.output_attentions = config.output_attentions\n        self.dim = dim\n        self.n_heads = n_heads\n        self.dropout = config.attention_dropout\n        assert self.dim % self.n_heads == 0\n\n        self.q_lin = nn.Linear(dim, dim)\n        self.k_lin = nn.Linear(dim, dim)\n        self.v_lin = nn.Linear(dim, dim)\n        self.out_lin = nn.Linear(dim, dim)\n        self.pruned_heads = set()\n\n    def prune_heads(self, heads):\n        attention_head_size = self.dim // self.n_heads\n        if len(heads) == 0:\n            return\n        mask = torch.ones(self.n_heads, attention_head_size)\n        heads = set(heads) - self.pruned_heads\n        for head in heads:\n            head -= sum(1 if h < head else 0 for h in self.pruned_heads)\n            mask[head] = 0\n        mask = mask.view(-1).contiguous().eq(1)\n        index = torch.arange(len(mask))[mask].long()\n        # Prune linear layers\n        self.q_lin = prune_linear_layer(self.q_lin, index)\n        self.k_lin = prune_linear_layer(self.k_lin, index)\n        self.v_lin = prune_linear_layer(self.v_lin, index)\n        self.out_lin = prune_linear_layer(self.out_lin, index, dim=1)\n        # Update hyper params\n        self.n_heads = self.n_heads - len(heads)\n        self.dim = attention_head_size * self.n_heads\n        self.pruned_heads = self.pruned_heads.union(heads)\n\n    def forward(self, input, mask, kv=None, cache=None, head_mask=None):\n        \"\"\"\n        Self-attention (if kv is None) or attention over source sentence (provided by kv).\n        \"\"\"\n        # Input is (bs, qlen, dim)\n        # Mask is (bs, klen) (non-causal) or (bs, klen, klen)\n        bs, qlen, dim = input.size()\n        if kv is None:\n            klen = qlen if cache is None else cache[\"slen\"] + qlen\n        else:\n            klen = kv.size(1)\n        # assert dim == self.dim, 'Dimensions do not match: %s input vs %s configured' % (dim, self.dim)\n        n_heads = self.n_heads\n        dim_per_head = self.dim // n_heads\n        mask_reshape = (bs, 1, qlen, klen) if mask.dim() == 3 else (bs, 1, 1, klen)\n\n        def shape(x):\n            \"\"\"  projection \"\"\"\n            return x.view(bs, -1, self.n_heads, dim_per_head).transpose(1, 2)\n\n        def unshape(x):\n            \"\"\"  compute context \"\"\"\n            return x.transpose(1, 2).contiguous().view(bs, -1, self.n_heads * dim_per_head)\n\n        q = shape(self.q_lin(input))  # (bs, n_heads, qlen, dim_per_head)\n        if kv is None:\n            k = shape(self.k_lin(input))  # (bs, n_heads, qlen, dim_per_head)\n            v = shape(self.v_lin(input))  # (bs, n_heads, qlen, dim_per_head)\n        elif cache is None or self.layer_id not in cache:\n            k = v = kv\n            k = shape(self.k_lin(k))  # (bs, n_heads, qlen, dim_per_head)\n            v = shape(self.v_lin(v))  # (bs, n_heads, qlen, dim_per_head)\n\n        if cache is not None:\n            if self.layer_id in cache:\n                if kv is None:\n                    k_, v_ = cache[self.layer_id]\n                    k = torch.cat([k_, k], dim=2)  # (bs, n_heads, klen, dim_per_head)\n                    v = torch.cat([v_, v], dim=2)  # (bs, n_heads, klen, dim_per_head)\n                else:\n                    k, v = cache[self.layer_id]\n            cache[self.layer_id] = (k, v)\n\n        q = q / math.sqrt(dim_per_head)  # (bs, n_heads, qlen, dim_per_head)\n        scores = torch.matmul(q, k.transpose(2, 3))  # (bs, n_heads, qlen, klen)\n        mask = (mask == 0).view(mask_reshape).expand_as(scores)  # (bs, n_heads, qlen, klen)\n        scores.masked_fill_(mask, -float(\"inf\"))  # (bs, n_heads, qlen, klen)\n\n        weights = F.softmax(scores.float(), dim=-1).type_as(scores)  # (bs, n_heads, qlen, klen)\n        weights = F.dropout(weights, p=self.dropout, training=self.training)  # (bs, n_heads, qlen, klen)\n\n        # Mask heads if we want to\n        if head_mask is not None:\n            weights = weights * head_mask\n\n        context = torch.matmul(weights, v)  # (bs, n_heads, qlen, dim_per_head)\n        context = unshape(context)  # (bs, qlen, dim)\n\n        outputs = (self.out_lin(context),)\n        if self.output_attentions:\n            outputs = outputs + (weights,)\n        return outputs\n\n\nclass TransformerFFN(nn.Module):\n    def __init__(self, in_dim, dim_hidden, out_dim, config):\n        super().__init__()\n        self.dropout = config.dropout\n        self.lin1 = nn.Linear(in_dim, dim_hidden)\n        self.lin2 = nn.Linear(dim_hidden, out_dim)\n        self.act = gelu if config.gelu_activation else F.relu\n\n    def forward(self, input):\n        x = self.lin1(input)\n        x = self.act(x)\n        x = self.lin2(x)\n        x = F.dropout(x, p=self.dropout, training=self.training)\n        return x\n\n\nclass XLMPreTrainedModel(PreTrainedModel):\n    \"\"\" An abstract class to handle weights initialization and\n        a simple interface for downloading and loading pretrained models.\n    \"\"\"\n\n    config_class = XLMConfig\n    load_tf_weights = None\n    base_model_prefix = \"transformer\"\n\n    def __init__(self, *inputs, **kwargs):\n        super().__init__(*inputs, **kwargs)\n\n    @property\n    def dummy_inputs(self):\n        inputs_list = torch.tensor([[7, 6, 0, 0, 1], [1, 2, 3, 0, 0], [0, 0, 0, 4, 5]])\n        attns_list = torch.tensor([[1, 1, 0, 0, 1], [1, 1, 1, 0, 0], [1, 0, 0, 1, 1]])\n        if self.config.use_lang_emb and self.config.n_langs > 1:\n            langs_list = torch.tensor([[1, 1, 0, 0, 1], [1, 1, 1, 0, 0], [1, 0, 0, 1, 1]])\n        else:\n            langs_list = None\n        return {\"input_ids\": inputs_list, \"attention_mask\": attns_list, \"langs\": langs_list}\n\n    def _init_weights(self, module):\n        \"\"\" Initialize the weights. \"\"\"\n        if isinstance(module, nn.Embedding):\n            if self.config is not None and self.config.embed_init_std is not None:\n                nn.init.normal_(module.weight, mean=0, std=self.config.embed_init_std)\n        if isinstance(module, nn.Linear):\n            if self.config is not None and self.config.init_std is not None:\n                nn.init.normal_(module.weight, mean=0, std=self.config.init_std)\n                if hasattr(module, \"bias\") and module.bias is not None:\n                    nn.init.constant_(module.bias, 0.0)\n        if isinstance(module, nn.LayerNorm):\n            module.bias.data.zero_()\n            module.weight.data.fill_(1.0)\n\n\nXLM_START_DOCSTRING = r\"\"\"\n\n    This model is a PyTorch `torch.nn.Module <https://pytorch.org/docs/stable/nn.html#torch.nn.Module>`_ sub-class.\n    Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general\n    usage and behavior.\n\n    Parameters:\n        config (:class:`~transformers1.XLMConfig`): Model configuration class with all the parameters of the model.\n            Initializing with a config file does not load the weights associated with the model, only the configuration.\n            Check out the :meth:`~transformers1.PreTrainedModel.from_pretrained` method to load the model weights.\n\"\"\"\n\nXLM_INPUTS_DOCSTRING = r\"\"\"\n    Args:\n        input_ids (:obj:`torch.LongTensor` of shape :obj:`(batch_size, sequence_length)`):\n            Indices of input sequence tokens in the vocabulary.\n\n            Indices can be obtained using :class:`transformers1.BertTokenizer`.\n            See :func:`transformers1.PreTrainedTokenizer.encode` and\n            :func:`transformers1.PreTrainedTokenizer.encode_plus` for details.\n\n            `What are input IDs? <../glossary.html#input-ids>`__\n        attention_mask (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length)`, `optional`, defaults to :obj:`None`):\n            Mask to avoid performing attention on padding token indices.\n            Mask values selected in ``[0, 1]``:\n            ``1`` for tokens that are NOT MASKED, ``0`` for MASKED tokens.\n\n            `What are attention masks? <../glossary.html#attention-mask>`__\n        langs (:obj:`torch.LongTensor` of shape :obj:`(batch_size, sequence_length)`, `optional`, defaults to :obj:`None`):\n            A parallel sequence of tokens to be used to indicate the language of each token in the input.\n            Indices are languages ids which can be obtained from the language names by using two conversion mappings\n            provided in the configuration of the model (only provided for multilingual models).\n            More precisely, the `language name -> language id` mapping is in `model.config.lang2id` (dict str -> int) and\n            the `language id -> language name` mapping is `model.config.id2lang` (dict int -> str).\n\n            See usage examples detailed in the `multilingual documentation <https://huggingface.co/transformers/multilingual.html>`__.\n        token_type_ids (:obj:`torch.LongTensor` of shape :obj:`(batch_size, sequence_length)`, `optional`, defaults to :obj:`None`):\n            Segment token indices to indicate first and second portions of the inputs.\n            Indices are selected in ``[0, 1]``: ``0`` corresponds to a `sentence A` token, ``1``\n            corresponds to a `sentence B` token\n\n            `What are token type IDs? <../glossary.html#token-type-ids>`_\n        position_ids (:obj:`torch.LongTensor` of shape :obj:`(batch_size, sequence_length)`, `optional`, defaults to :obj:`None`):\n            Indices of positions of each input sequence tokens in the position embeddings.\n            Selected in the range ``[0, config.max_position_embeddings - 1]``.\n\n            `What are position IDs? <../glossary.html#position-ids>`_\n        lengths (:obj:`torch.LongTensor` of shape :obj:`(batch_size,)`, `optional`, defaults to :obj:`None`):\n            Length of each sentence that can be used to avoid performing attention on padding token indices.\n            You can also use `attention_mask` for the same result (see above), kept here for compatbility.\n            Indices selected in ``[0, ..., input_ids.size(-1)]``:\n        cache (:obj:`Dict[str, torch.FloatTensor]`, `optional`, defaults to :obj:`None`):\n            dictionary with ``torch.FloatTensor`` that contains pre-computed\n            hidden-states (key and values in the attention blocks) as computed by the model\n            (see `cache` output below). Can be used to speed up sequential decoding.\n            The dictionary object will be modified in-place during the forward pass to add newly computed hidden-states.\n        head_mask (:obj:`torch.FloatTensor` of shape :obj:`(num_heads,)` or :obj:`(num_layers, num_heads)`, `optional`, defaults to :obj:`None`):\n            Mask to nullify selected heads of the self-attention modules.\n            Mask values selected in ``[0, 1]``:\n            :obj:`1` indicates the head is **not masked**, :obj:`0` indicates the head is **masked**.\n        inputs_embeds (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length, hidden_size)`, `optional`, defaults to :obj:`None`):\n            Optionally, instead of passing :obj:`input_ids` you can choose to directly pass an embedded representation.\n            This is useful if you want more control over how to convert `input_ids` indices into associated vectors\n            than the model's internal embedding lookup matrix.\n\"\"\"\n\n\n@add_start_docstrings(\n    \"The bare XLM Model transformer outputting raw hidden-states without any specific head on top.\",\n    XLM_START_DOCSTRING,\n)\nclass XLMModel(XLMPreTrainedModel):\n    def __init__(self, config):  # , dico, is_encoder, with_output):\n        super().__init__(config)\n        self.output_attentions = config.output_attentions\n        self.output_hidden_states = config.output_hidden_states\n\n        # encoder / decoder, output layer\n        self.is_encoder = config.is_encoder\n        self.is_decoder = not config.is_encoder\n        if self.is_decoder:\n            raise NotImplementedError(\"Currently XLM can only be used as an encoder\")\n        # self.with_output = with_output\n        self.causal = config.causal\n\n        # dictionary / languages\n        self.n_langs = config.n_langs\n        self.use_lang_emb = config.use_lang_emb\n        self.n_words = config.n_words\n        self.eos_index = config.eos_index\n        self.pad_index = config.pad_index\n        # self.dico = dico\n        # self.id2lang = config.id2lang\n        # self.lang2id = config.lang2id\n        # assert len(self.dico) == self.n_words\n        # assert len(self.id2lang) == len(self.lang2id) == self.n_langs\n\n        # model parameters\n        self.dim = config.emb_dim  # 512 by default\n        self.hidden_dim = self.dim * 4  # 2048 by default\n        self.n_heads = config.n_heads  # 8 by default\n        self.n_layers = config.n_layers\n        self.dropout = config.dropout\n        self.attention_dropout = config.attention_dropout\n        assert self.dim % self.n_heads == 0, \"transformer dim must be a multiple of n_heads\"\n\n        # embeddings\n        self.position_embeddings = nn.Embedding(config.max_position_embeddings, self.dim)\n        if config.sinusoidal_embeddings:\n            create_sinusoidal_embeddings(config.max_position_embeddings, self.dim, out=self.position_embeddings.weight)\n        if config.n_langs > 1 and config.use_lang_emb:\n            self.lang_embeddings = nn.Embedding(self.n_langs, self.dim)\n        self.embeddings = nn.Embedding(self.n_words, self.dim, padding_idx=self.pad_index)\n        self.layer_norm_emb = nn.LayerNorm(self.dim, eps=config.layer_norm_eps)\n\n        # transformer layers\n        self.attentions = nn.ModuleList()\n        self.layer_norm1 = nn.ModuleList()\n        self.ffns = nn.ModuleList()\n        self.layer_norm2 = nn.ModuleList()\n        # if self.is_decoder:\n        #     self.layer_norm15 = nn.ModuleList()\n        #     self.encoder_attn = nn.ModuleList()\n\n        for _ in range(self.n_layers):\n            self.attentions.append(MultiHeadAttention(self.n_heads, self.dim, config=config))\n            self.layer_norm1.append(nn.LayerNorm(self.dim, eps=config.layer_norm_eps))\n            # if self.is_decoder:\n            #     self.layer_norm15.append(nn.LayerNorm(self.dim, eps=config.layer_norm_eps))\n            #     self.encoder_attn.append(MultiHeadAttention(self.n_heads, self.dim, dropout=self.attention_dropout))\n            self.ffns.append(TransformerFFN(self.dim, self.hidden_dim, self.dim, config=config))\n            self.layer_norm2.append(nn.LayerNorm(self.dim, eps=config.layer_norm_eps))\n\n        if hasattr(config, \"pruned_heads\"):\n            pruned_heads = config.pruned_heads.copy().items()\n            config.pruned_heads = {}\n            for layer, heads in pruned_heads:\n                if self.attentions[int(layer)].n_heads == config.n_heads:\n                    self.prune_heads({int(layer): list(map(int, heads))})\n\n        self.init_weights()\n\n    def get_input_embeddings(self):\n        return self.embeddings\n\n    def set_input_embeddings(self, new_embeddings):\n        self.embeddings = new_embeddings\n\n    def _prune_heads(self, heads_to_prune):\n        \"\"\" Prunes heads of the model.\n            heads_to_prune: dict of {layer_num: list of heads to prune in this layer}\n            See base class PreTrainedModel\n        \"\"\"\n        for layer, heads in heads_to_prune.items():\n            self.attentions[layer].prune_heads(heads)\n\n    @add_start_docstrings_to_callable(XLM_INPUTS_DOCSTRING)\n    def forward(\n        self,\n        input_ids=None,\n        attention_mask=None,\n        langs=None,\n        token_type_ids=None,\n        position_ids=None,\n        lengths=None,\n        cache=None,\n        head_mask=None,\n        inputs_embeds=None,\n    ):\n        r\"\"\"\n    Return:\n        :obj:`tuple(torch.FloatTensor)` comprising various elements depending on the configuration (:class:`~transformers1.XLMConfig`) and inputs:\n        last_hidden_state (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length, hidden_size)`):\n            Sequence of hidden-states at the output of the last layer of the model.\n        hidden_states (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_hidden_states=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer)\n            of shape :obj:`(batch_size, sequence_length, hidden_size)`.\n\n            Hidden-states of the model at the output of each layer plus the initial embedding outputs.\n        attentions (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_attentions=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for each layer) of shape\n            :obj:`(batch_size, num_heads, sequence_length, sequence_length)`.\n\n            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention\n            heads.\n\n    Examples::\n\n        from transformers1 import XLMTokenizer, XLMModel\n        import torch\n\n        tokenizer = XLMTokenizer.from_pretrained('xlm-mlm-en-2048')\n        model = XLMModel.from_pretrained('xlm-mlm-en-2048')\n        input_ids = torch.tensor(tokenizer.encode(\"Hello, my dog is cute\", add_special_tokens=True)).unsqueeze(0)  # Batch size 1\n        outputs = model(input_ids)\n        last_hidden_states = outputs[0]  # The last hidden-state is the first element of the output tuple\n\n        \"\"\"\n        if input_ids is not None:\n            bs, slen = input_ids.size()\n        else:\n            bs, slen = inputs_embeds.size()[:-1]\n\n        if lengths is None:\n            if input_ids is not None:\n                lengths = (input_ids != self.pad_index).sum(dim=1).long()\n            else:\n                lengths = torch.LongTensor([slen] * bs)\n        # mask = input_ids != self.pad_index\n\n        # check inputs\n        assert lengths.size(0) == bs\n        assert lengths.max().item() <= slen\n        # input_ids = input_ids.transpose(0, 1)  # batch size as dimension 0\n        # assert (src_enc is None) == (src_len is None)\n        # if src_enc is not None:\n        #     assert self.is_decoder\n        #     assert src_enc.size(0) == bs\n\n        # generate masks\n        mask, attn_mask = get_masks(slen, lengths, self.causal, padding_mask=attention_mask)\n        # if self.is_decoder and src_enc is not None:\n        #     src_mask = torch.arange(src_len.max(), dtype=torch.long, device=lengths.device) < src_len[:, None]\n\n        device = input_ids.device if input_ids is not None else inputs_embeds.device\n\n        # position_ids\n        if position_ids is None:\n            position_ids = torch.arange(slen, dtype=torch.long, device=device)\n            position_ids = position_ids.unsqueeze(0).expand((bs, slen))\n        else:\n            assert position_ids.size() == (bs, slen)  # (slen, bs)\n            # position_ids = position_ids.transpose(0, 1)\n\n        # langs\n        if langs is not None:\n            assert langs.size() == (bs, slen)  # (slen, bs)\n            # langs = langs.transpose(0, 1)\n\n        # Prepare head mask if needed\n        head_mask = self.get_head_mask(head_mask, self.config.n_layers)\n\n        # do not recompute cached elements\n        if cache is not None and input_ids is not None:\n            _slen = slen - cache[\"slen\"]\n            input_ids = input_ids[:, -_slen:]\n            position_ids = position_ids[:, -_slen:]\n            if langs is not None:\n                langs = langs[:, -_slen:]\n            mask = mask[:, -_slen:]\n            attn_mask = attn_mask[:, -_slen:]\n\n        # embeddings\n        if inputs_embeds is None:\n            inputs_embeds = self.embeddings(input_ids)\n\n        tensor = inputs_embeds + self.position_embeddings(position_ids).expand_as(inputs_embeds)\n        if langs is not None and self.use_lang_emb and self.n_langs > 1:\n            tensor = tensor + self.lang_embeddings(langs)\n        if token_type_ids is not None:\n            tensor = tensor + self.embeddings(token_type_ids)\n        tensor = self.layer_norm_emb(tensor)\n        tensor = F.dropout(tensor, p=self.dropout, training=self.training)\n        tensor *= mask.unsqueeze(-1).to(tensor.dtype)\n\n        # transformer layers\n        hidden_states = ()\n        attentions = ()\n        for i in range(self.n_layers):\n            if self.output_hidden_states:\n                hidden_states = hidden_states + (tensor,)\n\n            # self attention\n            attn_outputs = self.attentions[i](tensor, attn_mask, cache=cache, head_mask=head_mask[i])\n            attn = attn_outputs[0]\n            if self.output_attentions:\n                attentions = attentions + (attn_outputs[1],)\n            attn = F.dropout(attn, p=self.dropout, training=self.training)\n            tensor = tensor + attn\n            tensor = self.layer_norm1[i](tensor)\n\n            # encoder attention (for decoder only)\n            # if self.is_decoder and src_enc is not None:\n            #     attn = self.encoder_attn[i](tensor, src_mask, kv=src_enc, cache=cache)\n            #     attn = F.dropout(attn, p=self.dropout, training=self.training)\n            #     tensor = tensor + attn\n            #     tensor = self.layer_norm15[i](tensor)\n\n            # FFN\n            tensor = tensor + self.ffns[i](tensor)\n            tensor = self.layer_norm2[i](tensor)\n            tensor *= mask.unsqueeze(-1).to(tensor.dtype)\n\n        # Add last hidden state\n        if self.output_hidden_states:\n            hidden_states = hidden_states + (tensor,)\n\n        # update cache length\n        if cache is not None:\n            cache[\"slen\"] += tensor.size(1)\n\n        # move back sequence length to dimension 0\n        # tensor = tensor.transpose(0, 1)\n\n        outputs = (tensor,)\n        if self.output_hidden_states:\n            outputs = outputs + (hidden_states,)\n        if self.output_attentions:\n            outputs = outputs + (attentions,)\n        return outputs  # outputs, (hidden_states), (attentions)\n\n\nclass XLMPredLayer(nn.Module):\n    \"\"\"\n    Prediction layer (cross_entropy or adaptive_softmax).\n    \"\"\"\n\n    def __init__(self, config):\n        super().__init__()\n        self.asm = config.asm\n        self.n_words = config.n_words\n        self.pad_index = config.pad_index\n        dim = config.emb_dim\n\n        if config.asm is False:\n            self.proj = nn.Linear(dim, config.n_words, bias=True)\n        else:\n            self.proj = nn.AdaptiveLogSoftmaxWithLoss(\n                in_features=dim,\n                n_classes=config.n_words,\n                cutoffs=config.asm_cutoffs,\n                div_value=config.asm_div_value,\n                head_bias=True,  # default is False\n            )\n\n    def forward(self, x, y=None):\n        \"\"\" Compute the loss, and optionally the scores.\n        \"\"\"\n        outputs = ()\n        if self.asm is False:\n            scores = self.proj(x)\n            outputs = (scores,) + outputs\n            if y is not None:\n                loss = F.cross_entropy(scores.view(-1, self.n_words), y.view(-1), reduction=\"elementwise_mean\")\n                outputs = (loss,) + outputs\n        else:\n            scores = self.proj.log_prob(x)\n            outputs = (scores,) + outputs\n            if y is not None:\n                _, loss = self.proj(x, y)\n                outputs = (loss,) + outputs\n\n        return outputs\n\n\n@add_start_docstrings(\n    \"\"\"The XLM Model transformer with a language modeling head on top\n    (linear layer with weights tied to the input embeddings). \"\"\",\n    XLM_START_DOCSTRING,\n)\nclass XLMWithLMHeadModel(XLMPreTrainedModel):\n    def __init__(self, config):\n        super().__init__(config)\n        self.transformer = XLMModel(config)\n        self.pred_layer = XLMPredLayer(config)\n\n        self.init_weights()\n\n    def get_output_embeddings(self):\n        return self.pred_layer.proj\n\n    def prepare_inputs_for_generation(self, input_ids, **kwargs):\n        mask_token_id = self.config.mask_token_id\n        lang_id = self.config.lang_id\n\n        effective_batch_size = input_ids.shape[0]\n        mask_token = torch.full((effective_batch_size, 1), mask_token_id, dtype=torch.long, device=input_ids.device)\n        input_ids = torch.cat([input_ids, mask_token], dim=1)\n        if lang_id is not None:\n            langs = torch.full_like(input_ids, lang_id)\n        else:\n            langs = None\n        return {\"input_ids\": input_ids, \"langs\": langs}\n\n    @add_start_docstrings_to_callable(XLM_INPUTS_DOCSTRING)\n    def forward(\n        self,\n        input_ids=None,\n        attention_mask=None,\n        langs=None,\n        token_type_ids=None,\n        position_ids=None,\n        lengths=None,\n        cache=None,\n        head_mask=None,\n        inputs_embeds=None,\n        labels=None,\n    ):\n        r\"\"\"\n        labels (:obj:`torch.LongTensor` of shape :obj:`(batch_size, sequence_length)`, `optional`, defaults to :obj:`None`):\n            Labels for language modeling.\n            Note that the labels **are shifted** inside the model, i.e. you can set ``labels = input_ids``\n            Indices are selected in ``[-100, 0, ..., config.vocab_size]``\n            All labels set to ``-100`` are ignored (masked), the loss is only\n            computed for labels in ``[0, ..., config.vocab_size]``\n\n    Return:\n        :obj:`tuple(torch.FloatTensor)` comprising various elements depending on the configuration (:class:`~transformers1.XLMConfig`) and inputs:\n        loss (:obj:`torch.FloatTensor` of shape `(1,)`, `optional`, returned when ``labels`` is provided)\n            Language modeling loss.\n        prediction_scores (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length, config.vocab_size)`):\n            Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax).\n        hidden_states (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_hidden_states=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer)\n            of shape :obj:`(batch_size, sequence_length, hidden_size)`.\n\n            Hidden-states of the model at the output of each layer plus the initial embedding outputs.\n        attentions (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_attentions=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for each layer) of shape\n            :obj:`(batch_size, num_heads, sequence_length, sequence_length)`.\n\n            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention\n            heads.\n\n    Examples::\n\n        from transformers1 import XLMTokenizer, XLMWithLMHeadModel\n        import torch\n\n        tokenizer = XLMTokenizer.from_pretrained('xlm-mlm-en-2048')\n        model = XLMWithLMHeadModel.from_pretrained('xlm-mlm-en-2048')\n        input_ids = torch.tensor(tokenizer.encode(\"Hello, my dog is cute\", add_special_tokens=True)).unsqueeze(0)  # Batch size 1\n        outputs = model(input_ids)\n        last_hidden_states = outputs[0]  # The last hidden-state is the first element of the output tuple\n\n        \"\"\"\n        transformer_outputs = self.transformer(\n            input_ids,\n            attention_mask=attention_mask,\n            langs=langs,\n            token_type_ids=token_type_ids,\n            position_ids=position_ids,\n            lengths=lengths,\n            cache=cache,\n            head_mask=head_mask,\n            inputs_embeds=inputs_embeds,\n        )\n\n        output = transformer_outputs[0]\n        outputs = self.pred_layer(output, labels)\n        outputs = outputs + transformer_outputs[1:]  # Keep new_mems and attention/hidden states if they are here\n\n        return outputs\n\n\n@add_start_docstrings(\n    \"\"\"XLM Model with a sequence classification/regression head on top (a linear layer on top of\n    the pooled output) e.g. for GLUE tasks. \"\"\",\n    XLM_START_DOCSTRING,\n)\nclass XLMForSequenceClassification(XLMPreTrainedModel):\n    def __init__(self, config):\n        super().__init__(config)\n        self.num_labels = config.num_labels\n\n        self.transformer = XLMModel(config)\n        self.sequence_summary = SequenceSummary(config)\n\n        self.init_weights()\n\n    @add_start_docstrings_to_callable(XLM_INPUTS_DOCSTRING)\n    def forward(\n        self,\n        input_ids=None,\n        attention_mask=None,\n        langs=None,\n        token_type_ids=None,\n        position_ids=None,\n        lengths=None,\n        cache=None,\n        head_mask=None,\n        inputs_embeds=None,\n        labels=None,\n    ):\n        r\"\"\"\n        labels (:obj:`torch.LongTensor` of shape :obj:`(batch_size,)`, `optional`, defaults to :obj:`None`):\n            Labels for computing the sequence classification/regression loss.\n            Indices should be in :obj:`[0, ..., config.num_labels - 1]`.\n            If :obj:`config.num_labels == 1` a regression loss is computed (Mean-Square loss),\n            If :obj:`config.num_labels > 1` a classification loss is computed (Cross-Entropy).\n\n    Returns:\n        :obj:`tuple(torch.FloatTensor)` comprising various elements depending on the configuration (:class:`~transformers1.XLMConfig`) and inputs:\n        loss (:obj:`torch.FloatTensor` of shape :obj:`(1,)`, `optional`, returned when :obj:`label` is provided):\n            Classification (or regression if config.num_labels==1) loss.\n        logits (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, config.num_labels)`):\n            Classification (or regression if config.num_labels==1) scores (before SoftMax).\n        hidden_states (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_hidden_states=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer)\n            of shape :obj:`(batch_size, sequence_length, hidden_size)`.\n\n            Hidden-states of the model at the output of each layer plus the initial embedding outputs.\n        attentions (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_attentions=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for each layer) of shape\n            :obj:`(batch_size, num_heads, sequence_length, sequence_length)`.\n\n            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention\n            heads.\n\n    Examples::\n\n        from transformers1 import XLMTokenizer, XLMForSequenceClassification\n        import torch\n\n        tokenizer = XLMTokenizer.from_pretrained('xlm-mlm-en-2048')\n        model = XLMForSequenceClassification.from_pretrained('xlm-mlm-en-2048')\n        input_ids = torch.tensor(tokenizer.encode(\"Hello, my dog is cute\", add_special_tokens=True)).unsqueeze(0)  # Batch size 1\n        labels = torch.tensor([1]).unsqueeze(0)  # Batch size 1\n        outputs = model(input_ids, labels=labels)\n        loss, logits = outputs[:2]\n\n        \"\"\"\n        transformer_outputs = self.transformer(\n            input_ids,\n            attention_mask=attention_mask,\n            langs=langs,\n            token_type_ids=token_type_ids,\n            position_ids=position_ids,\n            lengths=lengths,\n            cache=cache,\n            head_mask=head_mask,\n            inputs_embeds=inputs_embeds,\n        )\n\n        output = transformer_outputs[0]\n        logits = self.sequence_summary(output)\n\n        outputs = (logits,) + transformer_outputs[1:]  # Keep new_mems and attention/hidden states if they are here\n\n        if labels is not None:\n            if self.num_labels == 1:\n                #  We are doing regression\n                loss_fct = MSELoss()\n                loss = loss_fct(logits.view(-1), labels.view(-1))\n            else:\n                loss_fct = CrossEntropyLoss()\n                loss = loss_fct(logits.view(-1, self.num_labels), labels.view(-1))\n            outputs = (loss,) + outputs\n\n        return outputs\n\n\n@add_start_docstrings(\n    \"\"\"XLM Model with a span classification head on top for extractive question-answering tasks like SQuAD (a linear layers on top of\n    the hidden-states output to compute `span start logits` and `span end logits`). \"\"\",\n    XLM_START_DOCSTRING,\n)\nclass XLMForQuestionAnsweringSimple(XLMPreTrainedModel):\n    def __init__(self, config):\n        super().__init__(config)\n\n        self.transformer = XLMModel(config)\n        self.qa_outputs = nn.Linear(config.hidden_size, config.num_labels)\n\n        self.init_weights()\n\n    @add_start_docstrings_to_callable(XLM_INPUTS_DOCSTRING)\n    def forward(\n        self,\n        input_ids=None,\n        attention_mask=None,\n        langs=None,\n        token_type_ids=None,\n        position_ids=None,\n        lengths=None,\n        cache=None,\n        head_mask=None,\n        inputs_embeds=None,\n        start_positions=None,\n        end_positions=None,\n    ):\n        r\"\"\"\n        start_positions (:obj:`torch.LongTensor` of shape :obj:`(batch_size,)`, `optional`, defaults to :obj:`None`):\n            Labels for position (index) of the start of the labelled span for computing the token classification loss.\n            Positions are clamped to the length of the sequence (`sequence_length`).\n            Position outside of the sequence are not taken into account for computing the loss.\n        end_positions (:obj:`torch.LongTensor` of shape :obj:`(batch_size,)`, `optional`, defaults to :obj:`None`):\n            Labels for position (index) of the end of the labelled span for computing the token classification loss.\n            Positions are clamped to the length of the sequence (`sequence_length`).\n            Position outside of the sequence are not taken into account for computing the loss.\n\n    Returns:\n        :obj:`tuple(torch.FloatTensor)` comprising various elements depending on the configuration (:class:`~transformers1.XLMConfig`) and inputs:\n        loss (:obj:`torch.FloatTensor` of shape :obj:`(1,)`, `optional`, returned when :obj:`labels` is provided):\n            Total span extraction loss is the sum of a Cross-Entropy for the start and end positions.\n        start_scores (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length,)`):\n            Span-start scores (before SoftMax).\n        end_scores (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length,)`):\n            Span-end scores (before SoftMax).\n        hidden_states (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_hidden_states=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer)\n            of shape :obj:`(batch_size, sequence_length, hidden_size)`.\n\n            Hidden-states of the model at the output of each layer plus the initial embedding outputs.\n        attentions (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_attentions=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for each layer) of shape\n            :obj:`(batch_size, num_heads, sequence_length, sequence_length)`.\n\n            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention\n            heads.\n\n    Examples::\n\n        from transformers1 import XLMTokenizer, XLMForQuestionAnsweringSimple\n        import torch\n\n        tokenizer = XLMTokenizer.from_pretrained('xlm-mlm-en-2048')\n        model = XLMForQuestionAnsweringSimple.from_pretrained('xlm-mlm-en-2048')\n        input_ids = torch.tensor(tokenizer.encode(\"Hello, my dog is cute\", add_special_tokens=True)).unsqueeze(0)  # Batch size 1\n        start_positions = torch.tensor([1])\n        end_positions = torch.tensor([3])\n        outputs = model(input_ids, start_positions=start_positions, end_positions=end_positions)\n        loss = outputs[0]\n\n        \"\"\"\n        transformer_outputs = self.transformer(\n            input_ids,\n            attention_mask=attention_mask,\n            langs=langs,\n            token_type_ids=token_type_ids,\n            position_ids=position_ids,\n            lengths=lengths,\n            cache=cache,\n            head_mask=head_mask,\n            inputs_embeds=inputs_embeds,\n        )\n\n        sequence_output = transformer_outputs[0]\n\n        logits = self.qa_outputs(sequence_output)\n        start_logits, end_logits = logits.split(1, dim=-1)\n        start_logits = start_logits.squeeze(-1)\n        end_logits = end_logits.squeeze(-1)\n\n        outputs = (\n            start_logits,\n            end_logits,\n        )\n        if start_positions is not None and end_positions is not None:\n            # If we are on multi-GPU, split add a dimension\n            if len(start_positions.size()) > 1:\n                start_positions = start_positions.squeeze(-1)\n            if len(end_positions.size()) > 1:\n                end_positions = end_positions.squeeze(-1)\n            # sometimes the start/end positions are outside our model inputs, we ignore these terms\n            ignored_index = start_logits.size(1)\n            start_positions.clamp_(0, ignored_index)\n            end_positions.clamp_(0, ignored_index)\n\n            loss_fct = CrossEntropyLoss(ignore_index=ignored_index)\n            start_loss = loss_fct(start_logits, start_positions)\n            end_loss = loss_fct(end_logits, end_positions)\n            total_loss = (start_loss + end_loss) / 2\n            outputs = (total_loss,) + outputs\n\n        outputs = outputs + transformer_outputs[1:]  # Keep new_mems and attention/hidden states if they are here\n\n        return outputs\n\n\n@add_start_docstrings(\n    \"\"\"XLM Model with a beam-search span classification head on top for extractive question-answering tasks like SQuAD (a linear layers on top of\n    the hidden-states output to compute `span start logits` and `span end logits`). \"\"\",\n    XLM_START_DOCSTRING,\n)\nclass XLMForQuestionAnswering(XLMPreTrainedModel):\n    def __init__(self, config):\n        super().__init__(config)\n\n        self.transformer = XLMModel(config)\n        self.qa_outputs = SQuADHead(config)\n\n        self.init_weights()\n\n    @add_start_docstrings_to_callable(XLM_INPUTS_DOCSTRING)\n    def forward(\n        self,\n        input_ids=None,\n        attention_mask=None,\n        langs=None,\n        token_type_ids=None,\n        position_ids=None,\n        lengths=None,\n        cache=None,\n        head_mask=None,\n        inputs_embeds=None,\n        start_positions=None,\n        end_positions=None,\n        is_impossible=None,\n        cls_index=None,\n        p_mask=None,\n    ):\n        r\"\"\"\n        start_positions (:obj:`torch.LongTensor` of shape :obj:`(batch_size,)`, `optional`, defaults to :obj:`None`):\n            Labels for position (index) of the start of the labelled span for computing the token classification loss.\n            Positions are clamped to the length of the sequence (`sequence_length`).\n            Position outside of the sequence are not taken into account for computing the loss.\n        end_positions (:obj:`torch.LongTensor` of shape :obj:`(batch_size,)`, `optional`, defaults to :obj:`None`):\n            Labels for position (index) of the end of the labelled span for computing the token classification loss.\n            Positions are clamped to the length of the sequence (`sequence_length`).\n            Position outside of the sequence are not taken into account for computing the loss.\n        is_impossible (``torch.LongTensor`` of shape ``(batch_size,)``, `optional`, defaults to :obj:`None`):\n            Labels whether a question has an answer or no answer (SQuAD 2.0)\n        cls_index (``torch.LongTensor`` of shape ``(batch_size,)``, `optional`, defaults to :obj:`None`):\n            Labels for position (index) of the classification token to use as input for computing plausibility of the answer.\n        p_mask (``torch.FloatTensor`` of shape ``(batch_size, sequence_length)``, `optional`, defaults to :obj:`None`):\n            Optional mask of tokens which can't be in answers (e.g. [CLS], [PAD], ...).\n            1.0 means token should be masked. 0.0 mean token is not masked.\n\n    Returns:\n        :obj:`tuple(torch.FloatTensor)` comprising various elements depending on the configuration (:class:`~transformers1.XLMConfig`) and inputs:\n        loss (:obj:`torch.FloatTensor` of shape :obj:`(1,)`, `optional`, returned if both :obj:`start_positions` and :obj:`end_positions` are provided):\n            Classification loss as the sum of start token, end token (and is_impossible if provided) classification losses.\n        start_top_log_probs (``torch.FloatTensor`` of shape ``(batch_size, config.start_n_top)``, `optional`, returned if ``start_positions`` or ``end_positions`` is not provided):\n            Log probabilities for the top config.start_n_top start token possibilities (beam-search).\n        start_top_index (``torch.LongTensor`` of shape ``(batch_size, config.start_n_top)``, `optional`, returned if ``start_positions`` or ``end_positions`` is not provided):\n            Indices for the top config.start_n_top start token possibilities (beam-search).\n        end_top_log_probs (``torch.FloatTensor`` of shape ``(batch_size, config.start_n_top * config.end_n_top)``, `optional`, returned if ``start_positions`` or ``end_positions`` is not provided):\n            Log probabilities for the top ``config.start_n_top * config.end_n_top`` end token possibilities (beam-search).\n        end_top_index (``torch.LongTensor`` of shape ``(batch_size, config.start_n_top * config.end_n_top)``, `optional`, returned if ``start_positions`` or ``end_positions`` is not provided):\n            Indices for the top ``config.start_n_top * config.end_n_top`` end token possibilities (beam-search).\n        cls_logits (``torch.FloatTensor`` of shape ``(batch_size,)``, `optional`, returned if ``start_positions`` or ``end_positions`` is not provided):\n            Log probabilities for the ``is_impossible`` label of the answers.\n        hidden_states (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_hidden_states=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer)\n            of shape :obj:`(batch_size, sequence_length, hidden_size)`.\n\n            Hidden-states of the model at the output of each layer plus the initial embedding outputs.\n        attentions (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_attentions=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for each layer) of shape\n            :obj:`(batch_size, num_heads, sequence_length, sequence_length)`.\n\n            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention\n            heads.\n\n    Examples::\n\n        from transformers1 import XLMTokenizer, XLMForQuestionAnswering\n        import torch\n\n        tokenizer = XLMTokenizer.from_pretrained('xlm-mlm-en-2048')\n        model = XLMForQuestionAnswering.from_pretrained('xlm-mlm-en-2048')\n        input_ids = torch.tensor(tokenizer.encode(\"Hello, my dog is cute\", add_special_tokens=True)).unsqueeze(0)  # Batch size 1\n        start_positions = torch.tensor([1])\n        end_positions = torch.tensor([3])\n        outputs = model(input_ids, start_positions=start_positions, end_positions=end_positions)\n        loss = outputs[0]\n\n        \"\"\"\n        transformer_outputs = self.transformer(\n            input_ids,\n            attention_mask=attention_mask,\n            langs=langs,\n            token_type_ids=token_type_ids,\n            position_ids=position_ids,\n            lengths=lengths,\n            cache=cache,\n            head_mask=head_mask,\n            inputs_embeds=inputs_embeds,\n        )\n\n        output = transformer_outputs[0]\n\n        outputs = self.qa_outputs(\n            output,\n            start_positions=start_positions,\n            end_positions=end_positions,\n            cls_index=cls_index,\n            is_impossible=is_impossible,\n            p_mask=p_mask,\n        )\n\n        outputs = outputs + transformer_outputs[1:]  # Keep new_mems and attention/hidden states if they are here\n\n        return outputs\n\n\n@add_start_docstrings(\n    \"\"\"XLM Model with a token classification head on top (a linear layer on top of\n    the hidden-states output) e.g. for Named-Entity-Recognition (NER) tasks. \"\"\",\n    XLM_START_DOCSTRING,\n)\nclass XLMForTokenClassification(XLMPreTrainedModel):\n    def __init__(self, config):\n        super().__init__(config)\n        self.num_labels = config.num_labels\n\n        self.transformer = XLMModel(config)\n        self.dropout = nn.Dropout(config.dropout)\n        self.classifier = nn.Linear(config.hidden_size, config.num_labels)\n\n        self.init_weights()\n\n    @add_start_docstrings_to_callable(XLM_INPUTS_DOCSTRING)\n    def forward(\n        self,\n        input_ids=None,\n        attention_mask=None,\n        langs=None,\n        token_type_ids=None,\n        position_ids=None,\n        head_mask=None,\n        labels=None,\n    ):\n        r\"\"\"\n        labels (:obj:`torch.LongTensor` of shape :obj:`(batch_size, sequence_length)`, `optional`, defaults to :obj:`None`):\n            Labels for computing the token classification loss.\n            Indices should be in ``[0, ..., config.num_labels - 1]``.\n\n    Returns:\n        :obj:`tuple(torch.FloatTensor)` comprising various elements depending on the configuration (:class:`~transformers1.XLMConfig`) and inputs:\n        loss (:obj:`torch.FloatTensor` of shape :obj:`(1,)`, `optional`, returned when ``labels`` is provided) :\n            Classification loss.\n        scores (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length, config.num_labels)`)\n            Classification scores (before SoftMax).\n        hidden_states (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_hidden_states=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer)\n            of shape :obj:`(batch_size, sequence_length, hidden_size)`.\n\n            Hidden-states of the model at the output of each layer plus the initial embedding outputs.\n        attentions (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_attentions=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for each layer) of shape\n            :obj:`(batch_size, num_heads, sequence_length, sequence_length)`.\n\n            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention\n            heads.\n\n    Examples::\n\n        from transformers1 import XLMTokenizer, XLMForTokenClassification\n        import torch\n\n        tokenizer = XLMTokenizer.from_pretrained('xlm-mlm-100-1280')\n        model = XLMForTokenClassification.from_pretrained('xlm-mlm-100-1280')\n        input_ids = torch.tensor(tokenizer.encode(\"Hello, my dog is cute\")).unsqueeze(0)  # Batch size 1\n        labels = torch.tensor([1] * input_ids.size(1)).unsqueeze(0)  # Batch size 1\n        outputs = model(input_ids, labels=labels)\n        loss, scores = outputs[:2]\n\n        \"\"\"\n        outputs = self.transformer(\n            input_ids,\n            attention_mask=attention_mask,\n            langs=langs,\n            token_type_ids=token_type_ids,\n            position_ids=position_ids,\n            head_mask=head_mask,\n        )\n\n        sequence_output = outputs[0]\n\n        sequence_output = self.dropout(sequence_output)\n        logits = self.classifier(sequence_output)\n\n        outputs = (logits,) + outputs[2:]  # add hidden states and attention if they are here\n        if labels is not None:\n            loss_fct = CrossEntropyLoss()\n            # Only keep active parts of the loss\n            if attention_mask is not None:\n                active_loss = attention_mask.view(-1) == 1\n                active_logits = logits.view(-1, self.num_labels)\n                active_labels = torch.where(\n                    active_loss, labels.view(-1), torch.tensor(loss_fct.ignore_index).type_as(labels)\n                )\n                loss = loss_fct(active_logits, active_labels)\n            else:\n                loss = loss_fct(logits.view(-1, self.num_labels), labels.view(-1))\n            outputs = (loss,) + outputs\n\n        return outputs  # (loss), scores, (hidden_states), (attentions)\n"
  },
  {
    "path": "code/nezha-base-count3/pretrain/transformers1/modeling_xlm_roberta.py",
    "content": "# coding=utf-8\n# Copyright 2019 Facebook AI Research and the HuggingFace Inc. team.\n# Copyright (c) 2018, NVIDIA CORPORATION.  All rights reserved.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n\"\"\"PyTorch XLM-RoBERTa model. \"\"\"\n\n\nimport logging\n\nfrom .configuration_xlm_roberta import XLMRobertaConfig\nfrom .file_utils import add_start_docstrings\nfrom .modeling_roberta import (\n    RobertaForMaskedLM,\n    RobertaForMultipleChoice,\n    RobertaForSequenceClassification,\n    RobertaForTokenClassification,\n    RobertaModel,\n)\n\n\nlogger = logging.getLogger(__name__)\n\nXLM_ROBERTA_PRETRAINED_MODEL_ARCHIVE_LIST = [\n    \"xlm-roberta-base\",\n    \"xlm-roberta-large\",\n    \"xlm-roberta-large-finetuned-conll02-dutch\",\n    \"xlm-roberta-large-finetuned-conll02-spanish\",\n    \"xlm-roberta-large-finetuned-conll03-english\",\n    \"xlm-roberta-large-finetuned-conll03-german\",\n    # See all XLM-RoBERTa models at https://huggingface.co/models?filter=xlm-roberta\n]\n\n\nXLM_ROBERTA_START_DOCSTRING = r\"\"\"\n\n    This model is a PyTorch `torch.nn.Module <https://pytorch.org/docs/stable/nn.html#torch.nn.Module>`_ sub-class.\n    Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general\n    usage and behavior.\n\n    Parameters:\n        config (:class:`~transformers1.XLMRobertaConfig`): Model configuration class with all the parameters of the\n            model. Initializing with a config file does not load the weights associated with the model, only the configuration.\n            Check out the :meth:`~transformers1.PreTrainedModel.from_pretrained` method to load the model weights.\n\"\"\"\n\n\n@add_start_docstrings(\n    \"The bare XLM-RoBERTa Model transformer outputting raw hidden-states without any specific head on top.\",\n    XLM_ROBERTA_START_DOCSTRING,\n)\nclass XLMRobertaModel(RobertaModel):\n    \"\"\"\n    This class overrides :class:`~transformers1.RobertaModel`. Please check the\n    superclass for the appropriate documentation alongside usage examples.\n    \"\"\"\n\n    config_class = XLMRobertaConfig\n\n\n@add_start_docstrings(\n    \"\"\"XLM-RoBERTa Model with a `language modeling` head on top. \"\"\", XLM_ROBERTA_START_DOCSTRING,\n)\nclass XLMRobertaForMaskedLM(RobertaForMaskedLM):\n    \"\"\"\n    This class overrides :class:`~transformers1.RobertaForMaskedLM`. Please check the\n    superclass for the appropriate documentation alongside usage examples.\n    \"\"\"\n\n    config_class = XLMRobertaConfig\n\n\n@add_start_docstrings(\n    \"\"\"XLM-RoBERTa Model transformer with a sequence classification/regression head on top (a linear layer\n    on top of the pooled output) e.g. for GLUE tasks. \"\"\",\n    XLM_ROBERTA_START_DOCSTRING,\n)\nclass XLMRobertaForSequenceClassification(RobertaForSequenceClassification):\n    \"\"\"\n    This class overrides :class:`~transformers1.RobertaForSequenceClassification`. Please check the\n    superclass for the appropriate documentation alongside usage examples.\n    \"\"\"\n\n    config_class = XLMRobertaConfig\n\n\n@add_start_docstrings(\n    \"\"\"XLM-RoBERTa Model with a multiple choice classification head on top (a linear layer on top of\n    the pooled output and a softmax) e.g. for RocStories/SWAG tasks. \"\"\",\n    XLM_ROBERTA_START_DOCSTRING,\n)\nclass XLMRobertaForMultipleChoice(RobertaForMultipleChoice):\n    \"\"\"\n    This class overrides :class:`~transformers1.RobertaForMultipleChoice`. Please check the\n    superclass for the appropriate documentation alongside usage examples.\n    \"\"\"\n\n    config_class = XLMRobertaConfig\n\n\n@add_start_docstrings(\n    \"\"\"XLM-RoBERTa Model with a token classification head on top (a linear layer on top of\n    the hidden-states output) e.g. for Named-Entity-Recognition (NER) tasks. \"\"\",\n    XLM_ROBERTA_START_DOCSTRING,\n)\nclass XLMRobertaForTokenClassification(RobertaForTokenClassification):\n    \"\"\"\n    This class overrides :class:`~transformers1.RobertaForTokenClassification`. Please check the\n    superclass for the appropriate documentation alongside usage examples.\n    \"\"\"\n\n    config_class = XLMRobertaConfig\n"
  },
  {
    "path": "code/nezha-base-count3/pretrain/transformers1/modeling_xlnet.py",
    "content": "# coding=utf-8\n# Copyright 2018 Google AI, Google Brain and Carnegie Mellon University Authors and the HuggingFace Inc. team.\n# Copyright (c) 2018, NVIDIA CORPORATION.  All rights reserved.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n\"\"\" PyTorch XLNet model.\n\"\"\"\n\n\nimport logging\n\nimport torch\nfrom torch import nn\nfrom torch.nn import CrossEntropyLoss, MSELoss\nfrom torch.nn import functional as F\n\nfrom .activations import gelu_new, swish\nfrom .configuration_xlnet import XLNetConfig\nfrom .file_utils import add_start_docstrings, add_start_docstrings_to_callable\nfrom .modeling_utils import PoolerAnswerClass, PoolerEndLogits, PoolerStartLogits, PreTrainedModel, SequenceSummary\n\n\nlogger = logging.getLogger(__name__)\n\nXLNET_PRETRAINED_MODEL_ARCHIVE_LIST = [\n    \"xlnet-base-cased\",\n    \"xlnet-large-cased\",\n    # See all XLNet models at https://huggingface.co/models?filter=xlnet\n]\n\n\ndef build_tf_xlnet_to_pytorch_map(model, config, tf_weights=None):\n    \"\"\" A map of modules from TF to PyTorch.\n        I use a map to keep the PyTorch model as\n        identical to the original PyTorch model as possible.\n    \"\"\"\n\n    tf_to_pt_map = {}\n\n    if hasattr(model, \"transformer\"):\n        if hasattr(model, \"lm_loss\"):\n            # We will load also the output bias\n            tf_to_pt_map[\"model/lm_loss/bias\"] = model.lm_loss.bias\n        if hasattr(model, \"sequence_summary\") and \"model/sequnece_summary/summary/kernel\" in tf_weights:\n            # We will load also the sequence summary\n            tf_to_pt_map[\"model/sequnece_summary/summary/kernel\"] = model.sequence_summary.summary.weight\n            tf_to_pt_map[\"model/sequnece_summary/summary/bias\"] = model.sequence_summary.summary.bias\n        if (\n            hasattr(model, \"logits_proj\")\n            and config.finetuning_task is not None\n            and \"model/regression_{}/logit/kernel\".format(config.finetuning_task) in tf_weights\n        ):\n            tf_to_pt_map[\"model/regression_{}/logit/kernel\".format(config.finetuning_task)] = model.logits_proj.weight\n            tf_to_pt_map[\"model/regression_{}/logit/bias\".format(config.finetuning_task)] = model.logits_proj.bias\n\n        # Now load the rest of the transformer\n        model = model.transformer\n\n    # Embeddings and output\n    tf_to_pt_map.update(\n        {\n            \"model/transformer/word_embedding/lookup_table\": model.word_embedding.weight,\n            \"model/transformer/mask_emb/mask_emb\": model.mask_emb,\n        }\n    )\n\n    # Transformer blocks\n    for i, b in enumerate(model.layer):\n        layer_str = \"model/transformer/layer_%d/\" % i\n        tf_to_pt_map.update(\n            {\n                layer_str + \"rel_attn/LayerNorm/gamma\": b.rel_attn.layer_norm.weight,\n                layer_str + \"rel_attn/LayerNorm/beta\": b.rel_attn.layer_norm.bias,\n                layer_str + \"rel_attn/o/kernel\": b.rel_attn.o,\n                layer_str + \"rel_attn/q/kernel\": b.rel_attn.q,\n                layer_str + \"rel_attn/k/kernel\": b.rel_attn.k,\n                layer_str + \"rel_attn/r/kernel\": b.rel_attn.r,\n                layer_str + \"rel_attn/v/kernel\": b.rel_attn.v,\n                layer_str + \"ff/LayerNorm/gamma\": b.ff.layer_norm.weight,\n                layer_str + \"ff/LayerNorm/beta\": b.ff.layer_norm.bias,\n                layer_str + \"ff/layer_1/kernel\": b.ff.layer_1.weight,\n                layer_str + \"ff/layer_1/bias\": b.ff.layer_1.bias,\n                layer_str + \"ff/layer_2/kernel\": b.ff.layer_2.weight,\n                layer_str + \"ff/layer_2/bias\": b.ff.layer_2.bias,\n            }\n        )\n\n    # Relative positioning biases\n    if config.untie_r:\n        r_r_list = []\n        r_w_list = []\n        r_s_list = []\n        seg_embed_list = []\n        for b in model.layer:\n            r_r_list.append(b.rel_attn.r_r_bias)\n            r_w_list.append(b.rel_attn.r_w_bias)\n            r_s_list.append(b.rel_attn.r_s_bias)\n            seg_embed_list.append(b.rel_attn.seg_embed)\n    else:\n        r_r_list = [model.r_r_bias]\n        r_w_list = [model.r_w_bias]\n        r_s_list = [model.r_s_bias]\n        seg_embed_list = [model.seg_embed]\n    tf_to_pt_map.update(\n        {\n            \"model/transformer/r_r_bias\": r_r_list,\n            \"model/transformer/r_w_bias\": r_w_list,\n            \"model/transformer/r_s_bias\": r_s_list,\n            \"model/transformer/seg_embed\": seg_embed_list,\n        }\n    )\n    return tf_to_pt_map\n\n\ndef load_tf_weights_in_xlnet(model, config, tf_path):\n    \"\"\" Load tf checkpoints in a pytorch model\n    \"\"\"\n    try:\n        import numpy as np\n        import tensorflow as tf\n    except ImportError:\n        logger.error(\n            \"Loading a TensorFlow models in PyTorch, requires TensorFlow to be installed. Please see \"\n            \"https://www.tensorflow.org/install/ for installation instructions.\"\n        )\n        raise\n    # Load weights from TF model\n    init_vars = tf.train.list_variables(tf_path)\n    tf_weights = {}\n    for name, shape in init_vars:\n        logger.info(\"Loading TF weight {} with shape {}\".format(name, shape))\n        array = tf.train.load_variable(tf_path, name)\n        tf_weights[name] = array\n\n    # Build TF to PyTorch weights loading map\n    tf_to_pt_map = build_tf_xlnet_to_pytorch_map(model, config, tf_weights)\n\n    for name, pointer in tf_to_pt_map.items():\n        logger.info(\"Importing {}\".format(name))\n        if name not in tf_weights:\n            logger.info(\"{} not in tf pre-trained weights, skipping\".format(name))\n            continue\n        array = tf_weights[name]\n        # adam_v and adam_m are variables used in AdamWeightDecayOptimizer to calculated m and v\n        # which are not required for using pretrained model\n        if \"kernel\" in name and (\"ff\" in name or \"summary\" in name or \"logit\" in name):\n            logger.info(\"Transposing\")\n            array = np.transpose(array)\n        if isinstance(pointer, list):\n            # Here we will split the TF weights\n            assert len(pointer) == array.shape[0]\n            for i, p_i in enumerate(pointer):\n                arr_i = array[i, ...]\n                try:\n                    assert p_i.shape == arr_i.shape\n                except AssertionError as e:\n                    e.args += (p_i.shape, arr_i.shape)\n                    raise\n                logger.info(\"Initialize PyTorch weight {} for layer {}\".format(name, i))\n                p_i.data = torch.from_numpy(arr_i)\n        else:\n            try:\n                assert pointer.shape == array.shape\n            except AssertionError as e:\n                e.args += (pointer.shape, array.shape)\n                raise\n            logger.info(\"Initialize PyTorch weight {}\".format(name))\n            pointer.data = torch.from_numpy(array)\n        tf_weights.pop(name, None)\n        tf_weights.pop(name + \"/Adam\", None)\n        tf_weights.pop(name + \"/Adam_1\", None)\n\n    logger.info(\"Weights not copied to PyTorch model: {}\".format(\", \".join(tf_weights.keys())))\n    return model\n\n\nACT2FN = {\"gelu\": gelu_new, \"relu\": torch.nn.functional.relu, \"swish\": swish}\n\n\nXLNetLayerNorm = nn.LayerNorm\n\n\nclass XLNetRelativeAttention(nn.Module):\n    def __init__(self, config):\n        super().__init__()\n        self.output_attentions = config.output_attentions\n\n        if config.d_model % config.n_head != 0:\n            raise ValueError(\n                \"The hidden size (%d) is not a multiple of the number of attention \"\n                \"heads (%d)\" % (config.d_model, config.n_head)\n            )\n\n        self.n_head = config.n_head\n        self.d_head = config.d_head\n        self.d_model = config.d_model\n        self.scale = 1 / (config.d_head ** 0.5)\n\n        self.q = nn.Parameter(torch.FloatTensor(config.d_model, self.n_head, self.d_head))\n        self.k = nn.Parameter(torch.FloatTensor(config.d_model, self.n_head, self.d_head))\n        self.v = nn.Parameter(torch.FloatTensor(config.d_model, self.n_head, self.d_head))\n        self.o = nn.Parameter(torch.FloatTensor(config.d_model, self.n_head, self.d_head))\n        self.r = nn.Parameter(torch.FloatTensor(config.d_model, self.n_head, self.d_head))\n\n        self.r_r_bias = nn.Parameter(torch.FloatTensor(self.n_head, self.d_head))\n        self.r_s_bias = nn.Parameter(torch.FloatTensor(self.n_head, self.d_head))\n        self.r_w_bias = nn.Parameter(torch.FloatTensor(self.n_head, self.d_head))\n        self.seg_embed = nn.Parameter(torch.FloatTensor(2, self.n_head, self.d_head))\n\n        self.layer_norm = XLNetLayerNorm(config.d_model, eps=config.layer_norm_eps)\n        self.dropout = nn.Dropout(config.dropout)\n\n    def prune_heads(self, heads):\n        raise NotImplementedError\n\n    @staticmethod\n    def rel_shift(x, klen=-1):\n        \"\"\"perform relative shift to form the relative attention score.\"\"\"\n        x_size = x.shape\n\n        x = x.reshape(x_size[1], x_size[0], x_size[2], x_size[3])\n        x = x[1:, ...]\n        x = x.reshape(x_size[0], x_size[1] - 1, x_size[2], x_size[3])\n        # x = x[:, 0:klen, :, :]\n        x = torch.index_select(x, 1, torch.arange(klen, device=x.device, dtype=torch.long))\n\n        return x\n\n    @staticmethod\n    def rel_shift_bnij(x, klen=-1):\n        x_size = x.shape\n\n        x = x.reshape(x_size[0], x_size[1], x_size[3], x_size[2])\n        x = x[:, :, 1:, :]\n        x = x.reshape(x_size[0], x_size[1], x_size[2], x_size[3] - 1)\n        # Note: the tensor-slice form was faster in my testing than torch.index_select\n        #       However, tracing doesn't like the nature of the slice, and if klen changes\n        #       during the run then it'll fail, whereas index_select will be fine.\n        x = torch.index_select(x, 3, torch.arange(klen, device=x.device, dtype=torch.long))\n        # x = x[:, :, :, :klen]\n\n        return x\n\n    def rel_attn_core(self, q_head, k_head_h, v_head_h, k_head_r, seg_mat=None, attn_mask=None, head_mask=None):\n        \"\"\"Core relative positional attention operations.\"\"\"\n\n        # content based attention score\n        ac = torch.einsum(\"ibnd,jbnd->bnij\", q_head + self.r_w_bias, k_head_h)\n\n        # position based attention score\n        bd = torch.einsum(\"ibnd,jbnd->bnij\", q_head + self.r_r_bias, k_head_r)\n        bd = self.rel_shift_bnij(bd, klen=ac.shape[3])\n\n        # segment based attention score\n        if seg_mat is None:\n            ef = 0\n        else:\n            ef = torch.einsum(\"ibnd,snd->ibns\", q_head + self.r_s_bias, self.seg_embed)\n            ef = torch.einsum(\"ijbs,ibns->bnij\", seg_mat, ef)\n\n        # merge attention scores and perform masking\n        attn_score = (ac + bd + ef) * self.scale\n        if attn_mask is not None:\n            # attn_score = attn_score * (1 - attn_mask) - 1e30 * attn_mask\n            if attn_mask.dtype == torch.float16:\n                attn_score = attn_score - 65500 * torch.einsum(\"ijbn->bnij\", attn_mask)\n            else:\n                attn_score = attn_score - 1e30 * torch.einsum(\"ijbn->bnij\", attn_mask)\n\n        # attention probability\n        attn_prob = F.softmax(attn_score, dim=3)\n        attn_prob = self.dropout(attn_prob)\n\n        # Mask heads if we want to\n        if head_mask is not None:\n            attn_prob = attn_prob * torch.einsum(\"ijbn->bnij\", head_mask)\n\n        # attention output\n        attn_vec = torch.einsum(\"bnij,jbnd->ibnd\", attn_prob, v_head_h)\n\n        if self.output_attentions:\n            return attn_vec, torch.einsum(\"bnij->ijbn\", attn_prob)\n\n        return attn_vec\n\n    def post_attention(self, h, attn_vec, residual=True):\n        \"\"\"Post-attention processing.\"\"\"\n        # post-attention projection (back to `d_model`)\n        attn_out = torch.einsum(\"ibnd,hnd->ibh\", attn_vec, self.o)\n\n        attn_out = self.dropout(attn_out)\n        if residual:\n            attn_out = attn_out + h\n        output = self.layer_norm(attn_out)\n\n        return output\n\n    def forward(self, h, g, attn_mask_h, attn_mask_g, r, seg_mat, mems=None, target_mapping=None, head_mask=None):\n        if g is not None:\n            # Two-stream attention with relative positional encoding.\n            # content based attention score\n            if mems is not None and mems.dim() > 1:\n                cat = torch.cat([mems, h], dim=0)\n            else:\n                cat = h\n\n            # content-based key head\n            k_head_h = torch.einsum(\"ibh,hnd->ibnd\", cat, self.k)\n\n            # content-based value head\n            v_head_h = torch.einsum(\"ibh,hnd->ibnd\", cat, self.v)\n\n            # position-based key head\n            k_head_r = torch.einsum(\"ibh,hnd->ibnd\", r, self.r)\n\n            # h-stream\n            # content-stream query head\n            q_head_h = torch.einsum(\"ibh,hnd->ibnd\", h, self.q)\n\n            # core attention ops\n            attn_vec_h = self.rel_attn_core(\n                q_head_h, k_head_h, v_head_h, k_head_r, seg_mat=seg_mat, attn_mask=attn_mask_h, head_mask=head_mask\n            )\n\n            if self.output_attentions:\n                attn_vec_h, attn_prob_h = attn_vec_h\n\n            # post processing\n            output_h = self.post_attention(h, attn_vec_h)\n\n            # g-stream\n            # query-stream query head\n            q_head_g = torch.einsum(\"ibh,hnd->ibnd\", g, self.q)\n\n            # core attention ops\n            if target_mapping is not None:\n                q_head_g = torch.einsum(\"mbnd,mlb->lbnd\", q_head_g, target_mapping)\n                attn_vec_g = self.rel_attn_core(\n                    q_head_g, k_head_h, v_head_h, k_head_r, seg_mat=seg_mat, attn_mask=attn_mask_g, head_mask=head_mask\n                )\n\n                if self.output_attentions:\n                    attn_vec_g, attn_prob_g = attn_vec_g\n\n                attn_vec_g = torch.einsum(\"lbnd,mlb->mbnd\", attn_vec_g, target_mapping)\n            else:\n                attn_vec_g = self.rel_attn_core(\n                    q_head_g, k_head_h, v_head_h, k_head_r, seg_mat=seg_mat, attn_mask=attn_mask_g, head_mask=head_mask\n                )\n\n                if self.output_attentions:\n                    attn_vec_g, attn_prob_g = attn_vec_g\n\n            # post processing\n            output_g = self.post_attention(g, attn_vec_g)\n\n            if self.output_attentions:\n                attn_prob = attn_prob_h, attn_prob_g\n\n        else:\n            # Multi-head attention with relative positional encoding\n            if mems is not None and mems.dim() > 1:\n                cat = torch.cat([mems, h], dim=0)\n            else:\n                cat = h\n\n            # content heads\n            q_head_h = torch.einsum(\"ibh,hnd->ibnd\", h, self.q)\n            k_head_h = torch.einsum(\"ibh,hnd->ibnd\", cat, self.k)\n            v_head_h = torch.einsum(\"ibh,hnd->ibnd\", cat, self.v)\n\n            # positional heads\n            k_head_r = torch.einsum(\"ibh,hnd->ibnd\", r, self.r)\n\n            # core attention ops\n            attn_vec = self.rel_attn_core(\n                q_head_h, k_head_h, v_head_h, k_head_r, seg_mat=seg_mat, attn_mask=attn_mask_h, head_mask=head_mask\n            )\n\n            if self.output_attentions:\n                attn_vec, attn_prob = attn_vec\n\n            # post processing\n            output_h = self.post_attention(h, attn_vec)\n            output_g = None\n\n        outputs = (output_h, output_g)\n        if self.output_attentions:\n            outputs = outputs + (attn_prob,)\n        return outputs\n\n\nclass XLNetFeedForward(nn.Module):\n    def __init__(self, config):\n        super().__init__()\n        self.layer_norm = XLNetLayerNorm(config.d_model, eps=config.layer_norm_eps)\n        self.layer_1 = nn.Linear(config.d_model, config.d_inner)\n        self.layer_2 = nn.Linear(config.d_inner, config.d_model)\n        self.dropout = nn.Dropout(config.dropout)\n        if isinstance(config.ff_activation, str):\n            self.activation_function = ACT2FN[config.ff_activation]\n        else:\n            self.activation_function = config.ff_activation\n\n    def forward(self, inp):\n        output = inp\n        output = self.layer_1(output)\n        output = self.activation_function(output)\n        output = self.dropout(output)\n        output = self.layer_2(output)\n        output = self.dropout(output)\n        output = self.layer_norm(output + inp)\n        return output\n\n\nclass XLNetLayer(nn.Module):\n    def __init__(self, config):\n        super().__init__()\n        self.rel_attn = XLNetRelativeAttention(config)\n        self.ff = XLNetFeedForward(config)\n        self.dropout = nn.Dropout(config.dropout)\n\n    def forward(\n        self, output_h, output_g, attn_mask_h, attn_mask_g, r, seg_mat, mems=None, target_mapping=None, head_mask=None\n    ):\n        outputs = self.rel_attn(\n            output_h,\n            output_g,\n            attn_mask_h,\n            attn_mask_g,\n            r,\n            seg_mat,\n            mems=mems,\n            target_mapping=target_mapping,\n            head_mask=head_mask,\n        )\n        output_h, output_g = outputs[:2]\n\n        if output_g is not None:\n            output_g = self.ff(output_g)\n        output_h = self.ff(output_h)\n\n        outputs = (output_h, output_g) + outputs[2:]  # Add again attentions if there are there\n        return outputs\n\n\nclass XLNetPreTrainedModel(PreTrainedModel):\n    \"\"\" An abstract class to handle weights initialization and\n        a simple interface for downloading and loading pretrained models.\n    \"\"\"\n\n    config_class = XLNetConfig\n    load_tf_weights = load_tf_weights_in_xlnet\n    base_model_prefix = \"transformer\"\n\n    def _init_weights(self, module):\n        \"\"\" Initialize the weights.\n        \"\"\"\n        if isinstance(module, (nn.Linear, nn.Embedding)):\n            # Slightly different from the TF version which uses truncated_normal for initialization\n            # cf https://github.com/pytorch/pytorch/pull/5617\n            module.weight.data.normal_(mean=0.0, std=self.config.initializer_range)\n            if isinstance(module, nn.Linear) and module.bias is not None:\n                module.bias.data.zero_()\n        elif isinstance(module, XLNetLayerNorm):\n            module.bias.data.zero_()\n            module.weight.data.fill_(1.0)\n        elif isinstance(module, XLNetRelativeAttention):\n            for param in [\n                module.q,\n                module.k,\n                module.v,\n                module.o,\n                module.r,\n                module.r_r_bias,\n                module.r_s_bias,\n                module.r_w_bias,\n                module.seg_embed,\n            ]:\n                param.data.normal_(mean=0.0, std=self.config.initializer_range)\n        elif isinstance(module, XLNetModel):\n            module.mask_emb.data.normal_(mean=0.0, std=self.config.initializer_range)\n\n\nXLNET_START_DOCSTRING = r\"\"\"\n\n    This model is a PyTorch `torch.nn.Module <https://pytorch.org/docs/stable/nn.html#torch.nn.Module>`_ sub-class.\n    Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general\n    usage and behavior.\n\n    Parameters:\n        config (:class:`~transformers1.XLNetConfig`): Model configuration class with all the parameters of the model.\n            Initializing with a config file does not load the weights associated with the model, only the configuration.\n            Check out the :meth:`~transformers1.PreTrainedModel.from_pretrained` method to load the model weights.\n\"\"\"\n\nXLNET_INPUTS_DOCSTRING = r\"\"\"\n    Args:\n        input_ids (:obj:`torch.LongTensor` of shape :obj:`{0}`):\n            Indices of input sequence tokens in the vocabulary.\n\n            Indices can be obtained using :class:`transformers1.BertTokenizer`.\n            See :func:`transformers1.PreTrainedTokenizer.encode` and\n            :func:`transformers1.PreTrainedTokenizer.encode_plus` for details.\n\n            `What are input IDs? <../glossary.html#input-ids>`__\n        attention_mask (:obj:`torch.FloatTensor` of shape :obj:`{0}`, `optional`, defaults to :obj:`None`):\n            Mask to avoid performing attention on padding token indices.\n            Mask values selected in ``[0, 1]``:\n            ``1`` for tokens that are NOT MASKED, ``0`` for MASKED tokens.\n\n            `What are attention masks? <../glossary.html#attention-mask>`__\n        mems (:obj:`List[torch.FloatTensor]` of length :obj:`config.n_layers`):\n            Contains pre-computed hidden-states (key and values in the attention blocks) as computed by the model\n            (see `mems` output below). Can be used to speed up sequential decoding. The token ids which have their mems\n            given to this model should not be passed as input ids as they have already been computed.\n            `use_cache` has to be set to `True` to make use of `mems`.\n        perm_mask (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length, sequence_length)`, `optional`, defaults to :obj:`None`):\n            Mask to indicate the attention pattern for each input token with values selected in ``[0, 1]``:\n            If ``perm_mask[k, i, j] = 0``, i attend to j in batch k;\n            if ``perm_mask[k, i, j] = 1``, i does not attend to j in batch k.\n            If None, each token attends to all the others (full bidirectional attention).\n            Only used during pretraining (to define factorization order) or for sequential decoding (generation).\n        target_mapping (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, num_predict, sequence_length)`, `optional`, defaults to :obj:`None`):\n            Mask to indicate the output tokens to use.\n            If ``target_mapping[k, i, j] = 1``, the i-th predict in batch k is on the j-th token.\n            Only used during pretraining for partial prediction or for sequential decoding (generation).\n        token_type_ids (:obj:`torch.LongTensor` of shape :obj:`{0}`, `optional`, defaults to :obj:`None`):\n            Segment token indices to indicate first and second portions of the inputs.\n            Indices are selected in ``[0, 1]``: ``0`` corresponds to a `sentence A` token, ``1``\n            corresponds to a `sentence B` token. The classifier token should be represented by a ``2``.\n\n            `What are token type IDs? <../glossary.html#token-type-ids>`_\n        input_mask (:obj:`torch.FloatTensor` of shape :obj:`{0}`, `optional`, defaults to :obj:`None`):\n            Mask to avoid performing attention on padding token indices.\n            Negative of `attention_mask`, i.e. with 0 for real tokens and 1 for padding.\n            Kept for compatibility with the original code base.\n            You can only uses one of `input_mask` and `attention_mask`\n            Mask values selected in ``[0, 1]``:\n            ``1`` for tokens that are MASKED, ``0`` for tokens that are NOT MASKED.\n        head_mask (:obj:`torch.FloatTensor` of shape :obj:`(num_heads,)` or :obj:`(num_layers, num_heads)`, `optional`, defaults to :obj:`None`):\n            Mask to nullify selected heads of the self-attention modules.\n            Mask values selected in ``[0, 1]``:\n            :obj:`1` indicates the head is **not masked**, :obj:`0` indicates the head is **masked**.\n        inputs_embeds (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length, hidden_size)`, `optional`, defaults to :obj:`None`):\n            Optionally, instead of passing :obj:`input_ids` you can choose to directly pass an embedded representation.\n            This is useful if you want more control over how to convert `input_ids` indices into associated vectors\n            than the model's internal embedding lookup matrix.\n        use_cache (:obj:`bool`):\n            If `use_cache` is True, `mems` are returned and can be used to speed up decoding (see `mems`). Defaults to `True`.\n\"\"\"\n\n\n@add_start_docstrings(\n    \"The bare XLNet Model transformer outputting raw hidden-states without any specific head on top.\",\n    XLNET_START_DOCSTRING,\n)\nclass XLNetModel(XLNetPreTrainedModel):\n    def __init__(self, config):\n        super().__init__(config)\n        self.output_attentions = config.output_attentions\n        self.output_hidden_states = config.output_hidden_states\n\n        self.mem_len = config.mem_len\n        self.reuse_len = config.reuse_len\n        self.d_model = config.d_model\n        self.same_length = config.same_length\n        self.attn_type = config.attn_type\n        self.bi_data = config.bi_data\n        self.clamp_len = config.clamp_len\n        self.n_layer = config.n_layer\n\n        self.word_embedding = nn.Embedding(config.vocab_size, config.d_model)\n        self.mask_emb = nn.Parameter(torch.FloatTensor(1, 1, config.d_model))\n        self.layer = nn.ModuleList([XLNetLayer(config) for _ in range(config.n_layer)])\n        self.dropout = nn.Dropout(config.dropout)\n\n        self.init_weights()\n\n    def get_input_embeddings(self):\n        return self.word_embedding\n\n    def set_input_embeddings(self, new_embeddings):\n        self.word_embedding = new_embeddings\n\n    def _prune_heads(self, heads_to_prune):\n        raise NotImplementedError\n\n    def create_mask(self, qlen, mlen):\n        \"\"\"\n        Creates causal attention mask. Float mask where 1.0 indicates masked, 0.0 indicates not-masked.\n\n        Args:\n            qlen: Sequence length\n            mlen: Mask length\n\n        ::\n\n                  same_length=False:      same_length=True:\n                  <mlen > <  qlen >       <mlen > <  qlen >\n               ^ [0 0 0 0 0 1 1 1 1]     [0 0 0 0 0 1 1 1 1]\n                 [0 0 0 0 0 0 1 1 1]     [1 0 0 0 0 0 1 1 1]\n            qlen [0 0 0 0 0 0 0 1 1]     [1 1 0 0 0 0 0 1 1]\n                 [0 0 0 0 0 0 0 0 1]     [1 1 1 0 0 0 0 0 1]\n               v [0 0 0 0 0 0 0 0 0]     [1 1 1 1 0 0 0 0 0]\n\n        \"\"\"\n        attn_mask = torch.ones([qlen, qlen])\n        mask_up = torch.triu(attn_mask, diagonal=1)\n        attn_mask_pad = torch.zeros([qlen, mlen])\n        ret = torch.cat([attn_mask_pad, mask_up], dim=1)\n        if self.same_length:\n            mask_lo = torch.tril(attn_mask, diagonal=-1)\n            ret = torch.cat([ret[:, :qlen] + mask_lo, ret[:, qlen:]], dim=1)\n\n        ret = ret.to(self.device)\n        return ret\n\n    def cache_mem(self, curr_out, prev_mem):\n        # cache hidden states into memory.\n        if self.reuse_len is not None and self.reuse_len > 0:\n            curr_out = curr_out[: self.reuse_len]\n\n        if prev_mem is None:\n            new_mem = curr_out[-self.mem_len :]\n        else:\n            new_mem = torch.cat([prev_mem, curr_out], dim=0)[-self.mem_len :]\n\n        return new_mem.detach()\n\n    @staticmethod\n    def positional_embedding(pos_seq, inv_freq, bsz=None):\n        sinusoid_inp = torch.einsum(\"i,d->id\", pos_seq, inv_freq)\n        pos_emb = torch.cat([torch.sin(sinusoid_inp), torch.cos(sinusoid_inp)], dim=-1)\n        pos_emb = pos_emb[:, None, :]\n\n        if bsz is not None:\n            pos_emb = pos_emb.expand(-1, bsz, -1)\n\n        return pos_emb\n\n    def relative_positional_encoding(self, qlen, klen, bsz=None):\n        # create relative positional encoding.\n        freq_seq = torch.arange(0, self.d_model, 2.0, dtype=torch.float)\n        inv_freq = 1 / torch.pow(10000, (freq_seq / self.d_model))\n\n        if self.attn_type == \"bi\":\n            # beg, end = klen - 1, -qlen\n            beg, end = klen, -qlen\n        elif self.attn_type == \"uni\":\n            # beg, end = klen - 1, -1\n            beg, end = klen, -1\n        else:\n            raise ValueError(\"Unknown `attn_type` {}.\".format(self.attn_type))\n\n        if self.bi_data:\n            fwd_pos_seq = torch.arange(beg, end, -1.0, dtype=torch.float)\n            bwd_pos_seq = torch.arange(-beg, -end, 1.0, dtype=torch.float)\n\n            if self.clamp_len > 0:\n                fwd_pos_seq = fwd_pos_seq.clamp(-self.clamp_len, self.clamp_len)\n                bwd_pos_seq = bwd_pos_seq.clamp(-self.clamp_len, self.clamp_len)\n\n            if bsz is not None:\n                fwd_pos_emb = self.positional_embedding(fwd_pos_seq, inv_freq, bsz // 2)\n                bwd_pos_emb = self.positional_embedding(bwd_pos_seq, inv_freq, bsz // 2)\n            else:\n                fwd_pos_emb = self.positional_embedding(fwd_pos_seq, inv_freq)\n                bwd_pos_emb = self.positional_embedding(bwd_pos_seq, inv_freq)\n\n            pos_emb = torch.cat([fwd_pos_emb, bwd_pos_emb], dim=1)\n        else:\n            fwd_pos_seq = torch.arange(beg, end, -1.0)\n            if self.clamp_len > 0:\n                fwd_pos_seq = fwd_pos_seq.clamp(-self.clamp_len, self.clamp_len)\n            pos_emb = self.positional_embedding(fwd_pos_seq, inv_freq, bsz)\n\n        pos_emb = pos_emb.to(self.device)\n        return pos_emb\n\n    @add_start_docstrings_to_callable(XLNET_INPUTS_DOCSTRING.format(\"(batch_size, sequence_length)\"))\n    def forward(\n        self,\n        input_ids=None,\n        attention_mask=None,\n        mems=None,\n        perm_mask=None,\n        target_mapping=None,\n        token_type_ids=None,\n        input_mask=None,\n        head_mask=None,\n        inputs_embeds=None,\n        use_cache=True,\n    ):\n        r\"\"\"\n    Return:\n        :obj:`tuple(torch.FloatTensor)` comprising various elements depending on the configuration (:class:`~transformers1.XLNetConfig`) and inputs:\n        last_hidden_state (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, num_predict, hidden_size)`):\n            Sequence of hidden-states at the last layer of the model.\n            `num_predict` corresponds to `target_mapping.shape[1]`. If `target_mapping` is `None`, then `num_predict` corresponds to `sequence_length`.\n        mems (:obj:`List[torch.FloatTensor]` of length :obj:`config.n_layers`):\n            Contains pre-computed hidden-states (key and values in the attention blocks).\n            Can be used (see `mems` input) to speed up sequential decoding. The token ids which have their past given to this model\n            should not be passed as input ids as they have already been computed.\n        hidden_states (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_hidden_states=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer)\n            of shape :obj:`(batch_size, sequence_length, hidden_size)`.\n\n            Hidden-states of the model at the output of each layer plus the initial embedding outputs.\n        attentions (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_attentions=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for each layer) of shape\n            :obj:`(batch_size, num_heads, sequence_length, sequence_length)`.\n\n            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention\n            heads.\n\n    Examples::\n\n        from transformers1 import XLNetTokenizer, XLNetModel\n        import torch\n\n        tokenizer = XLNetTokenizer.from_pretrained('xlnet-large-cased')\n        model = XLNetModel.from_pretrained('xlnet-large-cased')\n\n        input_ids = torch.tensor(tokenizer.encode(\"Hello, my dog is cute\", add_special_tokens=False)).unsqueeze(0)  # Batch size 1\n\n        outputs = model(input_ids)\n        last_hidden_states = outputs[0]  # The last hidden-state is the first element of the output tuple\n\n        \"\"\"\n        # the original code for XLNet uses shapes [len, bsz] with the batch dimension at the end\n        # but we want a unified interface in the library with the batch size on the first dimension\n        # so we move here the first dimension (batch) to the end\n        if input_ids is not None and inputs_embeds is not None:\n            raise ValueError(\"You cannot specify both input_ids and inputs_embeds at the same time\")\n        elif input_ids is not None:\n            input_ids = input_ids.transpose(0, 1).contiguous()\n            qlen, bsz = input_ids.shape[0], input_ids.shape[1]\n        elif inputs_embeds is not None:\n            inputs_embeds = inputs_embeds.transpose(0, 1).contiguous()\n            qlen, bsz = inputs_embeds.shape[0], inputs_embeds.shape[1]\n        else:\n            raise ValueError(\"You have to specify either input_ids or inputs_embeds\")\n\n        token_type_ids = token_type_ids.transpose(0, 1).contiguous() if token_type_ids is not None else None\n        input_mask = input_mask.transpose(0, 1).contiguous() if input_mask is not None else None\n        attention_mask = attention_mask.transpose(0, 1).contiguous() if attention_mask is not None else None\n        perm_mask = perm_mask.permute(1, 2, 0).contiguous() if perm_mask is not None else None\n        target_mapping = target_mapping.permute(1, 2, 0).contiguous() if target_mapping is not None else None\n\n        mlen = mems[0].shape[0] if mems is not None and mems[0] is not None else 0\n        klen = mlen + qlen\n\n        dtype_float = self.dtype\n        device = self.device\n\n        # Attention mask\n        # causal attention mask\n        if self.attn_type == \"uni\":\n            attn_mask = self.create_mask(qlen, mlen)\n            attn_mask = attn_mask[:, :, None, None]\n        elif self.attn_type == \"bi\":\n            attn_mask = None\n        else:\n            raise ValueError(\"Unsupported attention type: {}\".format(self.attn_type))\n\n        # data mask: input mask & perm mask\n        assert input_mask is None or attention_mask is None, \"You can only use one of input_mask (uses 1 for padding) \"\n        \"or attention_mask (uses 0 for padding, added for compatbility with BERT). Please choose one.\"\n        if input_mask is None and attention_mask is not None:\n            input_mask = 1.0 - attention_mask\n        if input_mask is not None and perm_mask is not None:\n            data_mask = input_mask[None] + perm_mask\n        elif input_mask is not None and perm_mask is None:\n            data_mask = input_mask[None]\n        elif input_mask is None and perm_mask is not None:\n            data_mask = perm_mask\n        else:\n            data_mask = None\n\n        if data_mask is not None:\n            # all mems can be attended to\n            if mlen > 0:\n                mems_mask = torch.zeros([data_mask.shape[0], mlen, bsz]).to(data_mask)\n                data_mask = torch.cat([mems_mask, data_mask], dim=1)\n            if attn_mask is None:\n                attn_mask = data_mask[:, :, :, None]\n            else:\n                attn_mask += data_mask[:, :, :, None]\n\n        if attn_mask is not None:\n            attn_mask = (attn_mask > 0).to(dtype_float)\n\n        if attn_mask is not None:\n            non_tgt_mask = -torch.eye(qlen).to(attn_mask)\n            if mlen > 0:\n                non_tgt_mask = torch.cat([torch.zeros([qlen, mlen]).to(attn_mask), non_tgt_mask], dim=-1)\n            non_tgt_mask = ((attn_mask + non_tgt_mask[:, :, None, None]) > 0).to(attn_mask)\n        else:\n            non_tgt_mask = None\n\n        # Word embeddings and prepare h & g hidden states\n        if inputs_embeds is not None:\n            word_emb_k = inputs_embeds\n        else:\n            word_emb_k = self.word_embedding(input_ids)\n        output_h = self.dropout(word_emb_k)\n        if target_mapping is not None:\n            word_emb_q = self.mask_emb.expand(target_mapping.shape[0], bsz, -1)\n            # else:  # We removed the inp_q input which was same as target mapping\n            #     inp_q_ext = inp_q[:, :, None]\n            #     word_emb_q = inp_q_ext * self.mask_emb + (1 - inp_q_ext) * word_emb_k\n            output_g = self.dropout(word_emb_q)\n        else:\n            output_g = None\n\n        # Segment embedding\n        if token_type_ids is not None:\n            # Convert `token_type_ids` to one-hot `seg_mat`\n            if mlen > 0:\n                mem_pad = torch.zeros([mlen, bsz], dtype=torch.long, device=device)\n                cat_ids = torch.cat([mem_pad, token_type_ids], dim=0)\n            else:\n                cat_ids = token_type_ids\n\n            # `1` indicates not in the same segment [qlen x klen x bsz]\n            seg_mat = (token_type_ids[:, None] != cat_ids[None, :]).long()\n            seg_mat = F.one_hot(seg_mat, num_classes=2).to(dtype_float)\n        else:\n            seg_mat = None\n\n        # Positional encoding\n        pos_emb = self.relative_positional_encoding(qlen, klen, bsz=bsz)\n        pos_emb = self.dropout(pos_emb)\n\n        # Prepare head mask if needed\n        # 1.0 in head_mask indicate we keep the head\n        # attention_probs has shape bsz x n_heads x N x N\n        # input head_mask has shape [num_heads] or [num_hidden_layers x num_heads] (a head_mask for each layer)\n        # and head_mask is converted to shape [num_hidden_layers x qlen x klen x bsz x n_head]\n        if head_mask is not None:\n            if head_mask.dim() == 1:\n                head_mask = head_mask.unsqueeze(0).unsqueeze(0).unsqueeze(0).unsqueeze(0)\n                head_mask = head_mask.expand(self.n_layer, -1, -1, -1, -1)\n            elif head_mask.dim() == 2:\n                head_mask = head_mask.unsqueeze(1).unsqueeze(1).unsqueeze(1)\n            head_mask = head_mask.to(\n                dtype=next(self.parameters()).dtype\n            )  # switch to fload if need + fp16 compatibility\n        else:\n            head_mask = [None] * self.n_layer\n\n        new_mems = ()\n        if mems is None:\n            mems = [None] * len(self.layer)\n\n        attentions = []\n        hidden_states = []\n        for i, layer_module in enumerate(self.layer):\n            if self.mem_len is not None and self.mem_len > 0 and use_cache is True:\n                # cache new mems\n                new_mems = new_mems + (self.cache_mem(output_h, mems[i]),)\n            if self.output_hidden_states:\n                hidden_states.append((output_h, output_g) if output_g is not None else output_h)\n\n            outputs = layer_module(\n                output_h,\n                output_g,\n                attn_mask_h=non_tgt_mask,\n                attn_mask_g=attn_mask,\n                r=pos_emb,\n                seg_mat=seg_mat,\n                mems=mems[i],\n                target_mapping=target_mapping,\n                head_mask=head_mask[i],\n            )\n            output_h, output_g = outputs[:2]\n            if self.output_attentions:\n                attentions.append(outputs[2])\n\n        # Add last hidden state\n        if self.output_hidden_states:\n            hidden_states.append((output_h, output_g) if output_g is not None else output_h)\n\n        output = self.dropout(output_g if output_g is not None else output_h)\n\n        # Prepare outputs, we transpose back here to shape [bsz, len, hidden_dim] (cf. beginning of forward() method)\n        outputs = (output.permute(1, 0, 2).contiguous(),)\n\n        if self.mem_len is not None and self.mem_len > 0 and use_cache is True:\n            outputs = outputs + (new_mems,)\n\n        if self.output_hidden_states:\n            if output_g is not None:\n                hidden_states = tuple(h.permute(1, 0, 2).contiguous() for hs in hidden_states for h in hs)\n            else:\n                hidden_states = tuple(hs.permute(1, 0, 2).contiguous() for hs in hidden_states)\n            outputs = outputs + (hidden_states,)\n        if self.output_attentions:\n            if target_mapping is not None:\n                # when target_mapping is provided, there are 2-tuple of attentions\n                attentions = tuple(\n                    tuple(att_stream.permute(2, 3, 0, 1).contiguous() for att_stream in t) for t in attentions\n                )\n            else:\n                attentions = tuple(t.permute(2, 3, 0, 1).contiguous() for t in attentions)\n            outputs = outputs + (attentions,)\n\n        return outputs  # outputs, (new_mems), (hidden_states), (attentions)\n\n\n@add_start_docstrings(\n    \"\"\"XLNet Model with a language modeling head on top\n    (linear layer with weights tied to the input embeddings). \"\"\",\n    XLNET_START_DOCSTRING,\n)\nclass XLNetLMHeadModel(XLNetPreTrainedModel):\n    def __init__(self, config):\n        super().__init__(config)\n        self.attn_type = config.attn_type\n        self.same_length = config.same_length\n\n        self.transformer = XLNetModel(config)\n        self.lm_loss = nn.Linear(config.d_model, config.vocab_size, bias=True)\n\n        self.init_weights()\n\n    def get_output_embeddings(self):\n        return self.lm_loss\n\n    def prepare_inputs_for_generation(self, input_ids, past, **kwargs):\n        # Add dummy token at the end (no attention on this one)\n\n        effective_batch_size = input_ids.shape[0]\n        dummy_token = torch.zeros((effective_batch_size, 1), dtype=torch.long, device=input_ids.device)\n        input_ids = torch.cat([input_ids, dummy_token], dim=1)\n\n        # Build permutation mask so that previous tokens don't see last token\n        sequence_length = input_ids.shape[1]\n        perm_mask = torch.zeros(\n            (effective_batch_size, sequence_length, sequence_length), dtype=torch.float, device=input_ids.device\n        )\n        perm_mask[:, :, -1] = 1.0\n\n        # We'll only predict the last token\n        target_mapping = torch.zeros(\n            (effective_batch_size, 1, sequence_length), dtype=torch.float, device=input_ids.device\n        )\n        target_mapping[0, 0, -1] = 1.0\n\n        inputs = {\n            \"input_ids\": input_ids,\n            \"perm_mask\": perm_mask,\n            \"target_mapping\": target_mapping,\n            \"use_cache\": kwargs[\"use_cache\"],\n        }\n\n        # if past is defined in model kwargs then use it for faster decoding\n        if past:\n            inputs[\"mems\"] = past\n\n        return inputs\n\n    @add_start_docstrings_to_callable(XLNET_INPUTS_DOCSTRING.format(\"(batch_size, sequence_length)\"))\n    def forward(\n        self,\n        input_ids=None,\n        attention_mask=None,\n        mems=None,\n        perm_mask=None,\n        target_mapping=None,\n        token_type_ids=None,\n        input_mask=None,\n        head_mask=None,\n        inputs_embeds=None,\n        use_cache=True,\n        labels=None,\n    ):\n        r\"\"\"\n        labels (:obj:`torch.LongTensor` of shape :obj:`(batch_size, num_predict)`, `optional`, defaults to :obj:`None`):\n            Labels for masked language modeling.\n            `num_predict` corresponds to `target_mapping.shape[1]`. If `target_mapping` is `None`, then `num_predict` corresponds to `sequence_length`.\n            The labels should correspond to the masked input words that should be predicted and depends on `target_mapping`. Note in order to perform standard auto-regressive language modeling a `<mask>` token has to be added to the `input_ids` (see `prepare_inputs_for_generation` fn and examples below)\n            Indices are selected in ``[-100, 0, ..., config.vocab_size]``\n            All labels set to ``-100`` are ignored, the loss is only\n            computed for labels in ``[0, ..., config.vocab_size]``\n\n    Return:\n        :obj:`tuple(torch.FloatTensor)` comprising various elements depending on the configuration (:class:`~transformers1.XLNetConfig`) and inputs:\n        loss (:obj:`torch.FloatTensor` of shape `(1,)`, `optional`, returned when ``labels`` is provided)\n            Language modeling loss.\n        prediction_scores (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, num_predict, config.vocab_size)`):\n            Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax).\n            `num_predict` corresponds to `target_mapping.shape[1]`. If `target_mapping` is `None`, then `num_predict` corresponds to `sequence_length`.\n        mems (:obj:`List[torch.FloatTensor]` of length :obj:`config.n_layers`):\n            Contains pre-computed hidden-states (key and values in the attention blocks).\n            Can be used (see `past` input) to speed up sequential decoding. The token ids which have their past given to this model\n            should not be passed as input ids as they have already been computed.\n        hidden_states (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_hidden_states=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer)\n            of shape :obj:`(batch_size, sequence_length, hidden_size)`.\n\n            Hidden-states of the model at the output of each layer plus the initial embedding outputs.\n        attentions (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_attentions=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for each layer) of shape\n            :obj:`(batch_size, num_heads, sequence_length, sequence_length)`.\n\n            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention\n            heads.\n\n    Examples::\n\n        from transformers1 import XLNetTokenizer, XLNetLMHeadModel\n        import torch\n\n        tokenizer = XLNetTokenizer.from_pretrained('xlnet-large-cased')\n        model = XLNetLMHeadModel.from_pretrained('xlnet-large-cased')\n\n        # We show how to setup inputs to predict a next token using a bi-directional context.\n        input_ids = torch.tensor(tokenizer.encode(\"Hello, my dog is very <mask>\", add_special_tokens=False)).unsqueeze(0)  # We will predict the masked token\n        perm_mask = torch.zeros((1, input_ids.shape[1], input_ids.shape[1]), dtype=torch.float)\n        perm_mask[:, :, -1] = 1.0  # Previous tokens don't see last token\n        target_mapping = torch.zeros((1, 1, input_ids.shape[1]), dtype=torch.float)  # Shape [1, 1, seq_length] => let's predict one token\n        target_mapping[0, 0, -1] = 1.0  # Our first (and only) prediction will be the last token of the sequence (the masked token)\n\n        outputs = model(input_ids, perm_mask=perm_mask, target_mapping=target_mapping)\n        next_token_logits = outputs[0]  # Output has shape [target_mapping.size(0), target_mapping.size(1), config.vocab_size]\n\n        # The same way can the XLNetLMHeadModel be used to be trained by standard auto-regressive language modeling.\n        input_ids = torch.tensor(tokenizer.encode(\"Hello, my dog is very <mask>\", add_special_tokens=False)).unsqueeze(0)  # We will predict the masked token\n        labels = torch.tensor(tokenizer.encode(\"cute\", add_special_tokens=False)).unsqueeze(0)\n        assert labels.shape[0] == 1, 'only one word will be predicted'\n        perm_mask = torch.zeros((1, input_ids.shape[1], input_ids.shape[1]), dtype=torch.float)\n        perm_mask[:, :, -1] = 1.0  # Previous tokens don't see last token as is done in standard auto-regressive lm training\n        target_mapping = torch.zeros((1, 1, input_ids.shape[1]), dtype=torch.float)  # Shape [1, 1, seq_length] => let's predict one token\n        target_mapping[0, 0, -1] = 1.0  # Our first (and only) prediction will be the last token of the sequence (the masked token)\n\n        outputs = model(input_ids, perm_mask=perm_mask, target_mapping=target_mapping, labels=labels)\n        loss, next_token_logits = outputs[:2]  # Output has shape [target_mapping.size(0), target_mapping.size(1), config.vocab_size]\n\n        \"\"\"\n        transformer_outputs = self.transformer(\n            input_ids,\n            attention_mask=attention_mask,\n            mems=mems,\n            perm_mask=perm_mask,\n            target_mapping=target_mapping,\n            token_type_ids=token_type_ids,\n            input_mask=input_mask,\n            head_mask=head_mask,\n            inputs_embeds=inputs_embeds,\n            use_cache=use_cache,\n        )\n\n        logits = self.lm_loss(transformer_outputs[0])\n\n        outputs = (logits,) + transformer_outputs[1:]  # Keep mems, hidden states, attentions if there are in it\n\n        if labels is not None:\n            # Flatten the tokens\n            loss_fct = CrossEntropyLoss()\n            loss = loss_fct(logits.view(-1, logits.size(-1)), labels.view(-1))\n            outputs = (loss,) + outputs\n\n        return outputs  # return (loss), logits, (mems), (hidden states), (attentions)\n\n\n@add_start_docstrings(\n    \"\"\"XLNet Model with a sequence classification/regression head on top (a linear layer on top of\n    the pooled output) e.g. for GLUE tasks. \"\"\",\n    XLNET_START_DOCSTRING,\n)\nclass XLNetForSequenceClassification(XLNetPreTrainedModel):\n    def __init__(self, config):\n        super().__init__(config)\n        self.num_labels = config.num_labels\n\n        self.transformer = XLNetModel(config)\n        self.sequence_summary = SequenceSummary(config)\n        self.logits_proj = nn.Linear(config.d_model, config.num_labels)\n\n        self.init_weights()\n\n    @add_start_docstrings_to_callable(XLNET_INPUTS_DOCSTRING.format(\"(batch_size, sequence_length)\"))\n    def forward(\n        self,\n        input_ids=None,\n        attention_mask=None,\n        mems=None,\n        perm_mask=None,\n        target_mapping=None,\n        token_type_ids=None,\n        input_mask=None,\n        head_mask=None,\n        inputs_embeds=None,\n        use_cache=True,\n        labels=None,\n    ):\n        r\"\"\"\n        labels (:obj:`torch.LongTensor` of shape :obj:`(batch_size,)`, `optional`, defaults to :obj:`None`)\n            Labels for computing the sequence classification/regression loss.\n            Indices should be in ``[0, ..., config.num_labels - 1]``.\n            If ``config.num_labels == 1`` a regression loss is computed (Mean-Square loss),\n            If ``config.num_labels > 1`` a classification loss is computed (Cross-Entropy).\n\n    Return:\n        :obj:`tuple(torch.FloatTensor)` comprising various elements depending on the configuration (:class:`~transformers1.XLNetConfig`) and inputs:\n        loss (:obj:`torch.FloatTensor` of shape :obj:`(1,)`, `optional`, returned when :obj:`labels` is provided):\n            Classification (or regression if config.num_labels==1) loss.\n        logits (:obj:`torch.FloatTensor` of shape :obj:(batch_size, config.num_labels)`):\n            Classification (or regression if config.num_labels==1) scores (before SoftMax).\n        mems (:obj:`List[torch.FloatTensor]` of length :obj:`config.n_layers`):\n            Contains pre-computed hidden-states (key and values in the attention blocks).\n            Can be used (see `past` input) to speed up sequential decoding. The token ids which have their past given to this model\n            should not be passed as input ids as they have already been computed.\n        hidden_states (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_hidden_states=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer)\n            of shape :obj:`(batch_size, sequence_length, hidden_size)`.\n\n            Hidden-states of the model at the output of each layer plus the initial embedding outputs.\n        attentions (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_attentions=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for each layer) of shape\n            :obj:`(batch_size, num_heads, sequence_length, sequence_length)`.\n\n            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention\n            heads.\n\n    Examples::\n\n        from transformers1 import XLNetTokenizer, XLNetForSequenceClassification\n        import torch\n\n        tokenizer = XLNetTokenizer.from_pretrained('xlnet-large-cased')\n        model = XLNetForSequenceClassification.from_pretrained('xlnet-large-cased')\n\n        input_ids = torch.tensor(tokenizer.encode(\"Hello, my dog is cute\", add_special_tokens=True)).unsqueeze(0)  # Batch size 1\n        labels = torch.tensor([1]).unsqueeze(0)  # Batch size 1\n        outputs = model(input_ids, labels=labels)\n        loss, logits = outputs[:2]\n\n        \"\"\"\n        transformer_outputs = self.transformer(\n            input_ids,\n            attention_mask=attention_mask,\n            mems=mems,\n            perm_mask=perm_mask,\n            target_mapping=target_mapping,\n            token_type_ids=token_type_ids,\n            input_mask=input_mask,\n            head_mask=head_mask,\n            inputs_embeds=inputs_embeds,\n            use_cache=use_cache,\n        )\n        output = transformer_outputs[0]\n\n        output = self.sequence_summary(output)\n        logits = self.logits_proj(output)\n\n        outputs = (logits,) + transformer_outputs[1:]  # Keep mems, hidden states, attentions if there are in it\n\n        if labels is not None:\n            if self.num_labels == 1:\n                #  We are doing regression\n                loss_fct = MSELoss()\n                loss = loss_fct(logits.view(-1), labels.view(-1))\n            else:\n                loss_fct = CrossEntropyLoss()\n                loss = loss_fct(logits.view(-1, self.num_labels), labels.view(-1))\n            outputs = (loss,) + outputs\n\n        return outputs  # return (loss), logits, (mems), (hidden states), (attentions)\n\n\n@add_start_docstrings(\n    \"\"\"XLNet Model with a token classification head on top (a linear layer on top of\n    the hidden-states output) e.g. for Named-Entity-Recognition (NER) tasks. \"\"\",\n    XLNET_START_DOCSTRING,\n)\nclass XLNetForTokenClassification(XLNetPreTrainedModel):\n    def __init__(self, config):\n        super().__init__(config)\n        self.num_labels = config.num_labels\n\n        self.transformer = XLNetModel(config)\n        self.classifier = nn.Linear(config.hidden_size, config.num_labels)\n\n        self.init_weights()\n\n    @add_start_docstrings_to_callable(XLNET_INPUTS_DOCSTRING.format(\"(batch_size, sequence_length)\"))\n    def forward(\n        self,\n        input_ids=None,\n        attention_mask=None,\n        mems=None,\n        perm_mask=None,\n        target_mapping=None,\n        token_type_ids=None,\n        input_mask=None,\n        head_mask=None,\n        inputs_embeds=None,\n        use_cache=True,\n        labels=None,\n    ):\n        r\"\"\"\n        labels (:obj:`torch.LongTensor` of shape :obj:`(batch_size,)`, `optional`, defaults to :obj:`None`):\n            Labels for computing the multiple choice classification loss.\n            Indices should be in ``[0, ..., num_choices]`` where `num_choices` is the size of the second dimension\n            of the input tensors. (see `input_ids` above)\n\n    Return:\n        :obj:`tuple(torch.FloatTensor)` comprising various elements depending on the configuration (:class:`~transformers1.XLNetConfig`) and inputs:\n        loss (:obj:`torch.FloatTensor` of shape :obj:`(1,)`, `optional`, returned when :obj:`labels` is provided):\n            Classification loss.\n        logits (:obj:`torch.FloatTensor` of shape :obj:(batch_size, config.num_labels)`):\n            Classification scores (before SoftMax).\n        mems (:obj:`List[torch.FloatTensor]` of length :obj:`config.n_layers`):\n            Contains pre-computed hidden-states (key and values in the attention blocks).\n            Can be used (see `past` input) to speed up sequential decoding. The token ids which have their past given to this model\n            should not be passed as input ids as they have already been computed.\n        hidden_states (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_hidden_states=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer)\n            of shape :obj:`(batch_size, sequence_length, hidden_size)`.\n\n            Hidden-states of the model at the output of each layer plus the initial embedding outputs.\n        attentions (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_attentions=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for each layer) of shape\n            :obj:`(batch_size, num_heads, sequence_length, sequence_length)`.\n\n            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention\n            heads.\n\n    Examples::\n\n        from transformers1 import XLNetTokenizer, XLNetForTokenClassification\n        import torch\n\n        tokenizer = XLNetTokenizer.from_pretrained('xlnet-large-cased')\n        model = XLNetForTokenClassification.from_pretrained('xlnet-large-cased')\n\n        input_ids = torch.tensor(tokenizer.encode(\"Hello, my dog is cute\")).unsqueeze(0)  # Batch size 1\n        labels = torch.tensor([1] * input_ids.size(1)).unsqueeze(0)  # Batch size 1\n        outputs = model(input_ids, labels=labels)\n\n        scores = outputs[0]\n\n        \"\"\"\n\n        outputs = self.transformer(\n            input_ids,\n            attention_mask=attention_mask,\n            mems=mems,\n            perm_mask=perm_mask,\n            target_mapping=target_mapping,\n            token_type_ids=token_type_ids,\n            input_mask=input_mask,\n            head_mask=head_mask,\n            inputs_embeds=inputs_embeds,\n            use_cache=use_cache,\n        )\n\n        sequence_output = outputs[0]\n\n        logits = self.classifier(sequence_output)\n\n        outputs = (logits,) + outputs[1:]  # Keep mems, hidden states, attentions if there are in it\n        if labels is not None:\n            loss_fct = CrossEntropyLoss()\n            # Only keep active parts of the loss\n            if attention_mask is not None:\n                active_loss = attention_mask.view(-1) == 1\n                active_logits = logits.view(-1, self.num_labels)\n                active_labels = torch.where(\n                    active_loss, labels.view(-1), torch.tensor(loss_fct.ignore_index).type_as(labels)\n                )\n                loss = loss_fct(active_logits, active_labels)\n            else:\n                loss = loss_fct(logits.view(-1, self.num_labels), labels.view(-1))\n            outputs = (loss,) + outputs\n\n        return outputs  # return (loss), logits, (mems), (hidden states), (attentions)\n\n\n@add_start_docstrings(\n    \"\"\"XLNet Model with a multiple choice classification head on top (a linear layer on top of\n    the pooled output and a softmax) e.g. for RACE/SWAG tasks. \"\"\",\n    XLNET_START_DOCSTRING,\n)\nclass XLNetForMultipleChoice(XLNetPreTrainedModel):\n    def __init__(self, config):\n        super().__init__(config)\n\n        self.transformer = XLNetModel(config)\n        self.sequence_summary = SequenceSummary(config)\n        self.logits_proj = nn.Linear(config.d_model, 1)\n\n        self.init_weights()\n\n    @add_start_docstrings_to_callable(XLNET_INPUTS_DOCSTRING.format(\"(batch_size, num_choices, sequence_length)\"))\n    def forward(\n        self,\n        input_ids=None,\n        token_type_ids=None,\n        input_mask=None,\n        attention_mask=None,\n        mems=None,\n        perm_mask=None,\n        target_mapping=None,\n        head_mask=None,\n        inputs_embeds=None,\n        use_cache=True,\n        labels=None,\n    ):\n        r\"\"\"\n        labels (:obj:`torch.LongTensor` of shape :obj:`(batch_size,)`, `optional`, defaults to :obj:`None`):\n            Labels for computing the multiple choice classification loss.\n            Indices should be in ``[0, ..., num_choices]`` where `num_choices` is the size of the second dimension\n            of the input tensors. (see `input_ids` above)\n\n    Returns:\n        :obj:`tuple(torch.FloatTensor)` comprising various elements depending on the configuration (:class:`~transformers1.XLNetConfig`) and inputs:\n        loss (:obj:`torch.FloatTensor`` of shape ``(1,)`, `optional`, returned when :obj:`labels` is provided):\n            Classification loss.\n        classification_scores (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, num_choices)`):\n            `num_choices` is the second dimension of the input tensors. (see `input_ids` above).\n\n            Classification scores (before SoftMax).\n        mems (:obj:`List[torch.FloatTensor]` of length :obj:`config.n_layers`):\n            Contains pre-computed hidden-states (key and values in the attention blocks).\n            Can be used (see `past` input) to speed up sequential decoding. The token ids which have their past given to this model\n            should not be passed as input ids as they have already been computed.\n        hidden_states (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_hidden_states=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer)\n            of shape :obj:`(batch_size, sequence_length, hidden_size)`.\n\n            Hidden-states of the model at the output of each layer plus the initial embedding outputs.\n        attentions (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_attentions=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for each layer) of shape\n            :obj:`(batch_size, num_heads, sequence_length, sequence_length)`.\n\n            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention\n            heads.\n\n    Examples::\n\n        from transformers1 import XLNetTokenizer, XLNetForMultipleChoice\n        import torch\n\n        tokenizer = XLNetTokenizer.from_pretrained('xlnet-base-cased')\n        model = XLNetForMultipleChoice.from_pretrained('xlnet-base-cased')\n\n        choices = [\"Hello, my dog is cute\", \"Hello, my cat is amazing\"]\n        input_ids = torch.tensor([tokenizer.encode(s) for s in choices]).unsqueeze(0)  # Batch size 1, 2 choices\n        labels = torch.tensor(1).unsqueeze(0)  # Batch size 1\n\n        outputs = model(input_ids, labels=labels)\n        loss, classification_scores = outputs[:2]\n\n        \"\"\"\n        num_choices = input_ids.shape[1]\n\n        flat_input_ids = input_ids.view(-1, input_ids.size(-1))\n        flat_token_type_ids = token_type_ids.view(-1, token_type_ids.size(-1)) if token_type_ids is not None else None\n        flat_attention_mask = attention_mask.view(-1, attention_mask.size(-1)) if attention_mask is not None else None\n        flat_input_mask = input_mask.view(-1, input_mask.size(-1)) if input_mask is not None else None\n\n        transformer_outputs = self.transformer(\n            flat_input_ids,\n            token_type_ids=flat_token_type_ids,\n            input_mask=flat_input_mask,\n            attention_mask=flat_attention_mask,\n            mems=mems,\n            perm_mask=perm_mask,\n            target_mapping=target_mapping,\n            head_mask=head_mask,\n            inputs_embeds=inputs_embeds,\n            use_cache=use_cache,\n        )\n\n        output = transformer_outputs[0]\n\n        output = self.sequence_summary(output)\n        logits = self.logits_proj(output)\n        reshaped_logits = logits.view(-1, num_choices)\n        outputs = (reshaped_logits,) + transformer_outputs[\n            1:\n        ]  # Keep mems, hidden states, attentions if there are in it\n\n        if labels is not None:\n            loss_fct = CrossEntropyLoss()\n            loss = loss_fct(reshaped_logits, labels.view(-1))\n            outputs = (loss,) + outputs\n\n        return outputs  # return (loss), logits, (mems), (hidden states), (attentions)\n\n\n@add_start_docstrings(\n    \"\"\"XLNet Model with a span classification head on top for extractive question-answering tasks like SQuAD (a linear layers on top of\n    the hidden-states output to compute `span start logits` and `span end logits`). \"\"\",\n    XLNET_START_DOCSTRING,\n)\nclass XLNetForQuestionAnsweringSimple(XLNetPreTrainedModel):\n    def __init__(self, config):\n        super().__init__(config)\n        self.num_labels = config.num_labels\n\n        self.transformer = XLNetModel(config)\n        self.qa_outputs = nn.Linear(config.hidden_size, config.num_labels)\n\n        self.init_weights()\n\n    @add_start_docstrings_to_callable(XLNET_INPUTS_DOCSTRING.format(\"(batch_size, sequence_length)\"))\n    def forward(\n        self,\n        input_ids=None,\n        attention_mask=None,\n        mems=None,\n        perm_mask=None,\n        target_mapping=None,\n        token_type_ids=None,\n        input_mask=None,\n        head_mask=None,\n        inputs_embeds=None,\n        use_cache=True,\n        start_positions=None,\n        end_positions=None,\n    ):\n        r\"\"\"\n        start_positions (:obj:`torch.LongTensor` of shape :obj:`(batch_size,)`, `optional`, defaults to :obj:`None`):\n            Labels for position (index) of the start of the labelled span for computing the token classification loss.\n            Positions are clamped to the length of the sequence (`sequence_length`).\n            Position outside of the sequence are not taken into account for computing the loss.\n        end_positions (:obj:`torch.LongTensor` of shape :obj:`(batch_size,)`, `optional`, defaults to :obj:`None`):\n            Labels for position (index) of the end of the labelled span for computing the token classification loss.\n            Positions are clamped to the length of the sequence (`sequence_length`).\n            Position outside of the sequence are not taken into account for computing the loss.\n\n    Returns:\n        :obj:`tuple(torch.FloatTensor)` comprising various elements depending on the configuration (:class:`~transformers1.XLNetConfig`) and inputs:\n        loss (:obj:`torch.FloatTensor` of shape :obj:`(1,)`, `optional`, returned when :obj:`labels` is provided):\n            Total span extraction loss is the sum of a Cross-Entropy for the start and end positions.\n        start_scores (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length,)`):\n            Span-start scores (before SoftMax).\n        end_scores (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length,)`):\n            Span-end scores (before SoftMax).\n        mems (:obj:`List[torch.FloatTensor]` of length :obj:`config.n_layers`):\n            Contains pre-computed hidden-states (key and values in the attention blocks).\n            Can be used (see `past` input) to speed up sequential decoding. The token ids which have their past given to this model\n            should not be passed as input ids as they have already been computed.\n        hidden_states (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_hidden_states=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer)\n            of shape :obj:`(batch_size, sequence_length, hidden_size)`.\n\n            Hidden-states of the model at the output of each layer plus the initial embedding outputs.\n        attentions (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_attentions=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for each layer) of shape\n            :obj:`(batch_size, num_heads, sequence_length, sequence_length)`.\n\n            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention\n            heads.\n\n    Examples::\n\n        from transformers1 import XLNetTokenizer, XLNetForQuestionAnsweringSimple\n        import torch\n\n        tokenizer = XLNetTokenizer.from_pretrained('xlnet-base-cased')\n        model = XLNetForQuestionAnsweringSimple.from_pretrained('xlnet-base-cased')\n\n        input_ids = torch.tensor(tokenizer.encode(\"Hello, my dog is cute\", add_special_tokens=True)).unsqueeze(0)  # Batch size 1\n        start_positions = torch.tensor([1])\n        end_positions = torch.tensor([3])\n\n        outputs = model(input_ids, start_positions=start_positions, end_positions=end_positions)\n        loss = outputs[0]\n\n        \"\"\"\n\n        outputs = self.transformer(\n            input_ids,\n            attention_mask=attention_mask,\n            mems=mems,\n            perm_mask=perm_mask,\n            target_mapping=target_mapping,\n            token_type_ids=token_type_ids,\n            input_mask=input_mask,\n            head_mask=head_mask,\n            inputs_embeds=inputs_embeds,\n            use_cache=use_cache,\n        )\n\n        sequence_output = outputs[0]\n\n        logits = self.qa_outputs(sequence_output)\n        start_logits, end_logits = logits.split(1, dim=-1)\n        start_logits = start_logits.squeeze(-1)\n        end_logits = end_logits.squeeze(-1)\n\n        outputs = (start_logits, end_logits,) + outputs[2:]\n        if start_positions is not None and end_positions is not None:\n            # If we are on multi-GPU, split add a dimension\n            if len(start_positions.size()) > 1:\n                start_positions = start_positions.squeeze(-1)\n            if len(end_positions.size()) > 1:\n                end_positions = end_positions.squeeze(-1)\n            # sometimes the start/end positions are outside our model inputs, we ignore these terms\n            ignored_index = start_logits.size(1)\n            start_positions.clamp_(0, ignored_index)\n            end_positions.clamp_(0, ignored_index)\n\n            loss_fct = CrossEntropyLoss(ignore_index=ignored_index)\n            start_loss = loss_fct(start_logits, start_positions)\n            end_loss = loss_fct(end_logits, end_positions)\n            total_loss = (start_loss + end_loss) / 2\n            outputs = (total_loss,) + outputs\n\n        return outputs  # (loss), start_logits, end_logits, (mems), (hidden_states), (attentions)\n\n\n@add_start_docstrings(\n    \"\"\"XLNet Model with a span classification head on top for extractive question-answering tasks like SQuAD (a linear layers on top of\n    the hidden-states output to compute `span start logits` and `span end logits`). \"\"\",\n    XLNET_START_DOCSTRING,\n)\nclass XLNetForQuestionAnswering(XLNetPreTrainedModel):\n    def __init__(self, config):\n        super().__init__(config)\n        self.start_n_top = config.start_n_top\n        self.end_n_top = config.end_n_top\n\n        self.transformer = XLNetModel(config)\n        self.start_logits = PoolerStartLogits(config)\n        self.end_logits = PoolerEndLogits(config)\n        self.answer_class = PoolerAnswerClass(config)\n\n        self.init_weights()\n\n    @add_start_docstrings_to_callable(XLNET_INPUTS_DOCSTRING.format(\"(batch_size, sequence_length)\"))\n    def forward(\n        self,\n        input_ids=None,\n        attention_mask=None,\n        mems=None,\n        perm_mask=None,\n        target_mapping=None,\n        token_type_ids=None,\n        input_mask=None,\n        head_mask=None,\n        inputs_embeds=None,\n        use_cache=True,\n        start_positions=None,\n        end_positions=None,\n        is_impossible=None,\n        cls_index=None,\n        p_mask=None,\n    ):\n        r\"\"\"\n        start_positions (:obj:`torch.LongTensor` of shape :obj:`(batch_size,)`, `optional`, defaults to :obj:`None`):\n            Labels for position (index) of the start of the labelled span for computing the token classification loss.\n            Positions are clamped to the length of the sequence (`sequence_length`).\n            Position outside of the sequence are not taken into account for computing the loss.\n        end_positions (:obj:`torch.LongTensor` of shape :obj:`(batch_size,)`, `optional`, defaults to :obj:`None`):\n            Labels for position (index) of the end of the labelled span for computing the token classification loss.\n            Positions are clamped to the length of the sequence (`sequence_length`).\n            Position outside of the sequence are not taken into account for computing the loss.\n        is_impossible (``torch.LongTensor`` of shape ``(batch_size,)``, `optional`, defaults to :obj:`None`):\n            Labels whether a question has an answer or no answer (SQuAD 2.0)\n        cls_index (``torch.LongTensor`` of shape ``(batch_size,)``, `optional`, defaults to :obj:`None`):\n            Labels for position (index) of the classification token to use as input for computing plausibility of the answer.\n        p_mask (``torch.FloatTensor`` of shape ``(batch_size, sequence_length)``, `optional`, defaults to :obj:`None`):\n            Optional mask of tokens which can't be in answers (e.g. [CLS], [PAD], ...).\n            1.0 means token should be masked. 0.0 mean token is not masked.\n\n    Returns:\n        :obj:`tuple(torch.FloatTensor)` comprising various elements depending on the configuration (:class:`~transformers1.XLNetConfig`) and inputs:\n        loss (:obj:`torch.FloatTensor` of shape :obj:`(1,)`, `optional`, returned if both :obj:`start_positions` and :obj:`end_positions` are provided):\n            Classification loss as the sum of start token, end token (and is_impossible if provided) classification losses.\n        start_top_log_probs (``torch.FloatTensor`` of shape ``(batch_size, config.start_n_top)``, `optional`, returned if ``start_positions`` or ``end_positions`` is not provided):\n            Log probabilities for the top config.start_n_top start token possibilities (beam-search).\n        start_top_index (``torch.LongTensor`` of shape ``(batch_size, config.start_n_top)``, `optional`, returned if ``start_positions`` or ``end_positions`` is not provided):\n            Indices for the top config.start_n_top start token possibilities (beam-search).\n        end_top_log_probs (``torch.FloatTensor`` of shape ``(batch_size, config.start_n_top * config.end_n_top)``, `optional`, returned if ``start_positions`` or ``end_positions`` is not provided):\n            Log probabilities for the top ``config.start_n_top * config.end_n_top`` end token possibilities (beam-search).\n        end_top_index (``torch.LongTensor`` of shape ``(batch_size, config.start_n_top * config.end_n_top)``, `optional`, returned if ``start_positions`` or ``end_positions`` is not provided):\n            Indices for the top ``config.start_n_top * config.end_n_top`` end token possibilities (beam-search).\n        cls_logits (``torch.FloatTensor`` of shape ``(batch_size,)``, `optional`, returned if ``start_positions`` or ``end_positions`` is not provided):\n            Log probabilities for the ``is_impossible`` label of the answers.\n        mems (:obj:`List[torch.FloatTensor]` of length :obj:`config.n_layers`):\n            Contains pre-computed hidden-states (key and values in the attention blocks).\n            Can be used (see `past` input) to speed up sequential decoding. The token ids which have their past given to this model\n            should not be passed as input ids as they have already been computed.\n        hidden_states (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_hidden_states=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer)\n            of shape :obj:`(batch_size, sequence_length, hidden_size)`.\n\n            Hidden-states of the model at the output of each layer plus the initial embedding outputs.\n        attentions (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_attentions=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for each layer) of shape\n            :obj:`(batch_size, num_heads, sequence_length, sequence_length)`.\n\n            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention\n            heads.\n\n    Examples::\n\n        from transformers1 import XLNetTokenizer, XLNetForQuestionAnswering\n        import torch\n\n        tokenizer =  XLNetTokenizer.from_pretrained('xlnet-base-cased')\n        model = XLNetForQuestionAnswering.from_pretrained('xlnet-base-cased')\n\n        input_ids = torch.tensor(tokenizer.encode(\"Hello, my dog is cute\", add_special_tokens=True)).unsqueeze(0)  # Batch size 1\n        start_positions = torch.tensor([1])\n        end_positions = torch.tensor([3])\n        outputs = model(input_ids, start_positions=start_positions, end_positions=end_positions)\n        loss = outputs[0]\n\n        \"\"\"\n        transformer_outputs = self.transformer(\n            input_ids,\n            attention_mask=attention_mask,\n            mems=mems,\n            perm_mask=perm_mask,\n            target_mapping=target_mapping,\n            token_type_ids=token_type_ids,\n            input_mask=input_mask,\n            head_mask=head_mask,\n            inputs_embeds=inputs_embeds,\n            use_cache=use_cache,\n        )\n        hidden_states = transformer_outputs[0]\n        start_logits = self.start_logits(hidden_states, p_mask=p_mask)\n\n        outputs = transformer_outputs[1:]  # Keep mems, hidden states, attentions if there are in it\n\n        if start_positions is not None and end_positions is not None:\n            # If we are on multi-GPU, let's remove the dimension added by batch splitting\n            for x in (start_positions, end_positions, cls_index, is_impossible):\n                if x is not None and x.dim() > 1:\n                    x.squeeze_(-1)\n\n            # during training, compute the end logits based on the ground truth of the start position\n            end_logits = self.end_logits(hidden_states, start_positions=start_positions, p_mask=p_mask)\n\n            loss_fct = CrossEntropyLoss()\n            start_loss = loss_fct(start_logits, start_positions)\n            end_loss = loss_fct(end_logits, end_positions)\n            total_loss = (start_loss + end_loss) / 2\n\n            if cls_index is not None and is_impossible is not None:\n                # Predict answerability from the representation of CLS and START\n                cls_logits = self.answer_class(hidden_states, start_positions=start_positions, cls_index=cls_index)\n                loss_fct_cls = nn.BCEWithLogitsLoss()\n                cls_loss = loss_fct_cls(cls_logits, is_impossible)\n\n                # note(zhiliny): by default multiply the loss by 0.5 so that the scale is comparable to start_loss and end_loss\n                total_loss += cls_loss * 0.5\n\n            outputs = (total_loss,) + outputs\n\n        else:\n            # during inference, compute the end logits based on beam search\n            bsz, slen, hsz = hidden_states.size()\n            start_log_probs = F.softmax(start_logits, dim=-1)  # shape (bsz, slen)\n\n            start_top_log_probs, start_top_index = torch.topk(\n                start_log_probs, self.start_n_top, dim=-1\n            )  # shape (bsz, start_n_top)\n            start_top_index_exp = start_top_index.unsqueeze(-1).expand(-1, -1, hsz)  # shape (bsz, start_n_top, hsz)\n            start_states = torch.gather(hidden_states, -2, start_top_index_exp)  # shape (bsz, start_n_top, hsz)\n            start_states = start_states.unsqueeze(1).expand(-1, slen, -1, -1)  # shape (bsz, slen, start_n_top, hsz)\n\n            hidden_states_expanded = hidden_states.unsqueeze(2).expand_as(\n                start_states\n            )  # shape (bsz, slen, start_n_top, hsz)\n            p_mask = p_mask.unsqueeze(-1) if p_mask is not None else None\n            end_logits = self.end_logits(hidden_states_expanded, start_states=start_states, p_mask=p_mask)\n            end_log_probs = F.softmax(end_logits, dim=1)  # shape (bsz, slen, start_n_top)\n\n            end_top_log_probs, end_top_index = torch.topk(\n                end_log_probs, self.end_n_top, dim=1\n            )  # shape (bsz, end_n_top, start_n_top)\n            end_top_log_probs = end_top_log_probs.view(-1, self.start_n_top * self.end_n_top)\n            end_top_index = end_top_index.view(-1, self.start_n_top * self.end_n_top)\n\n            start_states = torch.einsum(\n                \"blh,bl->bh\", hidden_states, start_log_probs\n            )  # get the representation of START as weighted sum of hidden states\n            cls_logits = self.answer_class(\n                hidden_states, start_states=start_states, cls_index=cls_index\n            )  # Shape (batch size,): one single `cls_logits` for each sample\n\n            outputs = (start_top_log_probs, start_top_index, end_top_log_probs, end_top_index, cls_logits) + outputs\n\n        # return start_top_log_probs, start_top_index, end_top_log_probs, end_top_index, cls_logits\n        # or (if labels are provided) (total_loss,)\n        return outputs\n"
  },
  {
    "path": "code/nezha-base-count3/pretrain/transformers1/optimization.py",
    "content": "# coding=utf-8\n# Copyright 2018 The Google AI Language Team Authors and The HuggingFace Inc. team.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n\"\"\"PyTorch optimization for BERT model.\"\"\"\n\nimport logging\nimport math\n\nimport torch\nfrom torch.optim import Optimizer\nfrom torch.optim.lr_scheduler import LambdaLR\n\n\nlogger = logging.getLogger(__name__)\n\n\ndef get_constant_schedule(optimizer, last_epoch=-1):\n    \"\"\" Create a schedule with a constant learning rate.\n    \"\"\"\n    return LambdaLR(optimizer, lambda _: 1, last_epoch=last_epoch)\n\n\ndef get_constant_schedule_with_warmup(optimizer, num_warmup_steps, last_epoch=-1):\n    \"\"\" Create a schedule with a constant learning rate preceded by a warmup\n    period during which the learning rate increases linearly between 0 and 1.\n    \"\"\"\n\n    def lr_lambda(current_step):\n        if current_step < num_warmup_steps:\n            return float(current_step) / float(max(1.0, num_warmup_steps))\n        return 1.0\n\n    return LambdaLR(optimizer, lr_lambda, last_epoch=last_epoch)\n\n\ndef get_linear_schedule_with_warmup(optimizer, num_warmup_steps, num_training_steps, last_epoch=-1):\n    \"\"\" Create a schedule with a learning rate that decreases linearly after\n    linearly increasing during a warmup period.\n    \"\"\"\n\n    def lr_lambda(current_step):\n        if current_step < num_warmup_steps:\n            return float(current_step) / float(max(1, num_warmup_steps))\n        return max(\n            0.0, float(num_training_steps - current_step) / float(max(1, num_training_steps - num_warmup_steps))\n        )\n\n    return LambdaLR(optimizer, lr_lambda, last_epoch)\n\n\ndef get_cosine_schedule_with_warmup(optimizer, num_warmup_steps, num_training_steps, num_cycles=0.5, last_epoch=-1):\n    \"\"\" Create a schedule with a learning rate that decreases following the\n    values of the cosine function between 0 and `pi * cycles` after a warmup\n    period during which it increases linearly between 0 and 1.\n    \"\"\"\n\n    def lr_lambda(current_step):\n        if current_step < num_warmup_steps:\n            return float(current_step) / float(max(1, num_warmup_steps))\n        progress = float(current_step - num_warmup_steps) / float(max(1, num_training_steps - num_warmup_steps))\n        return max(0.0, 0.5 * (1.0 + math.cos(math.pi * float(num_cycles) * 2.0 * progress)))\n\n    return LambdaLR(optimizer, lr_lambda, last_epoch)\n\n\ndef get_cosine_with_hard_restarts_schedule_with_warmup(\n    optimizer, num_warmup_steps, num_training_steps, num_cycles=1.0, last_epoch=-1\n):\n    \"\"\" Create a schedule with a learning rate that decreases following the\n    values of the cosine function with several hard restarts, after a warmup\n    period during which it increases linearly between 0 and 1.\n    \"\"\"\n\n    def lr_lambda(current_step):\n        if current_step < num_warmup_steps:\n            return float(current_step) / float(max(1, num_warmup_steps))\n        progress = float(current_step - num_warmup_steps) / float(max(1, num_training_steps - num_warmup_steps))\n        if progress >= 1.0:\n            return 0.0\n        return max(0.0, 0.5 * (1.0 + math.cos(math.pi * ((float(num_cycles) * progress) % 1.0))))\n\n    return LambdaLR(optimizer, lr_lambda, last_epoch)\n\n\nclass AdamW(Optimizer):\n    \"\"\" Implements Adam algorithm with weight decay fix.\n\n    Parameters:\n        lr (float): learning rate. Default 1e-3.\n        betas (tuple of 2 floats): Adams beta parameters (b1, b2). Default: (0.9, 0.999)\n        eps (float): Adams epsilon. Default: 1e-6\n        weight_decay (float): Weight decay. Default: 0.0\n        correct_bias (bool): can be set to False to avoid correcting bias in Adam (e.g. like in Bert TF repository). Default True.\n    \"\"\"\n\n    def __init__(self, params, lr=1e-3, betas=(0.9, 0.999), eps=1e-6, weight_decay=0.0, correct_bias=True):\n        if lr < 0.0:\n            raise ValueError(\"Invalid learning rate: {} - should be >= 0.0\".format(lr))\n        if not 0.0 <= betas[0] < 1.0:\n            raise ValueError(\"Invalid beta parameter: {} - should be in [0.0, 1.0[\".format(betas[0]))\n        if not 0.0 <= betas[1] < 1.0:\n            raise ValueError(\"Invalid beta parameter: {} - should be in [0.0, 1.0[\".format(betas[1]))\n        if not 0.0 <= eps:\n            raise ValueError(\"Invalid epsilon value: {} - should be >= 0.0\".format(eps))\n        defaults = dict(lr=lr, betas=betas, eps=eps, weight_decay=weight_decay, correct_bias=correct_bias)\n        super().__init__(params, defaults)\n\n    def step(self, closure=None):\n        \"\"\"Performs a single optimization step.\n\n        Arguments:\n            closure (callable, optional): A closure that reevaluates the model\n                and returns the loss.\n        \"\"\"\n        loss = None\n        if closure is not None:\n            loss = closure()\n\n        for group in self.param_groups:\n            for p in group[\"params\"]:\n                if p.grad is None:\n                    continue\n                grad = p.grad.data\n                if grad.is_sparse:\n                    raise RuntimeError(\"Adam does not support sparse gradients, please consider SparseAdam instead\")\n\n                state = self.state[p]\n\n                # State initialization\n                if len(state) == 0:\n                    state[\"step\"] = 0\n                    # Exponential moving average of gradient values\n                    state[\"exp_avg\"] = torch.zeros_like(p.data)\n                    # Exponential moving average of squared gradient values\n                    state[\"exp_avg_sq\"] = torch.zeros_like(p.data)\n\n                exp_avg, exp_avg_sq = state[\"exp_avg\"], state[\"exp_avg_sq\"]\n                beta1, beta2 = group[\"betas\"]\n\n                state[\"step\"] += 1\n\n                # Decay the first and second moment running average coefficient\n                # In-place operations to update the averages at the same time\n                exp_avg.mul_(beta1).add_(grad, alpha=1.0 - beta1)\n                exp_avg_sq.mul_(beta2).addcmul_(grad, grad, value=1.0 - beta2)\n                denom = exp_avg_sq.sqrt().add_(group[\"eps\"])\n\n                step_size = group[\"lr\"]\n                if group[\"correct_bias\"]:  # No bias correction for Bert\n                    bias_correction1 = 1.0 - beta1 ** state[\"step\"]\n                    bias_correction2 = 1.0 - beta2 ** state[\"step\"]\n                    step_size = step_size * math.sqrt(bias_correction2) / bias_correction1\n\n                p.data.addcdiv_(exp_avg, denom, value=-step_size)\n\n                # Just adding the square of the weights to the loss function is *not*\n                # the correct way of using L2 regularization/weight decay with Adam,\n                # since that will interact with the m and v parameters in strange ways.\n                #\n                # Instead we want to decay the weights in a manner that doesn't interact\n                # with the m/v parameters. This is equivalent to adding the square\n                # of the weights to the loss with plain (non-momentum) SGD.\n                # Add weight decay at the end (fixed version)\n                if group[\"weight_decay\"] > 0.0:\n                    p.data.add_(p.data, alpha=-group[\"lr\"] * group[\"weight_decay\"])\n\n        return loss\n"
  },
  {
    "path": "code/nezha-base-count3/pretrain/transformers1/optimization_tf.py",
    "content": "# Copyright 2019 The TensorFlow Authors. All Rights Reserved.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n# ==============================================================================\n\"\"\"Functions and classes related to optimization (weight updates).\"\"\"\n\n\nimport re\n\nimport tensorflow as tf\n\n\nclass WarmUp(tf.keras.optimizers.schedules.LearningRateSchedule):\n    \"\"\"Applies a warmup schedule on a given learning rate decay schedule.\"\"\"\n\n    def __init__(\n        self, initial_learning_rate, decay_schedule_fn, warmup_steps, power=1.0, name=None,\n    ):\n        super().__init__()\n        self.initial_learning_rate = initial_learning_rate\n        self.warmup_steps = warmup_steps\n        self.power = power\n        self.decay_schedule_fn = decay_schedule_fn\n        self.name = name\n\n    def __call__(self, step):\n        with tf.name_scope(self.name or \"WarmUp\") as name:\n            # Implements polynomial warmup. i.e., if global_step < warmup_steps, the\n            # learning rate will be `global_step/num_warmup_steps * init_lr`.\n            global_step_float = tf.cast(step, tf.float32)\n            warmup_steps_float = tf.cast(self.warmup_steps, tf.float32)\n            warmup_percent_done = global_step_float / warmup_steps_float\n            warmup_learning_rate = self.initial_learning_rate * tf.math.pow(warmup_percent_done, self.power)\n            return tf.cond(\n                global_step_float < warmup_steps_float,\n                lambda: warmup_learning_rate,\n                lambda: self.decay_schedule_fn(step),\n                name=name,\n            )\n\n    def get_config(self):\n        return {\n            \"initial_learning_rate\": self.initial_learning_rate,\n            \"decay_schedule_fn\": self.decay_schedule_fn,\n            \"warmup_steps\": self.warmup_steps,\n            \"power\": self.power,\n            \"name\": self.name,\n        }\n\n\ndef create_optimizer(init_lr, num_train_steps, num_warmup_steps, end_lr=0.0, optimizer_type=\"adamw\"):\n    \"\"\"Creates an optimizer with learning rate schedule.\"\"\"\n    # Implements linear decay of the learning rate.\n    lr_schedule = tf.keras.optimizers.schedules.PolynomialDecay(\n        initial_learning_rate=init_lr, decay_steps=num_train_steps, end_learning_rate=end_lr,\n    )\n    if num_warmup_steps:\n        lr_schedule = WarmUp(\n            initial_learning_rate=init_lr, decay_schedule_fn=lr_schedule, warmup_steps=num_warmup_steps,\n        )\n\n    optimizer = AdamWeightDecay(\n        learning_rate=lr_schedule,\n        weight_decay_rate=0.01,\n        beta_1=0.9,\n        beta_2=0.999,\n        epsilon=1e-6,\n        exclude_from_weight_decay=[\"LayerNorm\", \"layer_norm\", \"bias\"],\n    )\n\n    return optimizer\n\n\nclass AdamWeightDecay(tf.keras.optimizers.Adam):\n    \"\"\"Adam enables L2 weight decay and clip_by_global_norm on gradients.\n  Just adding the square of the weights to the loss function is *not* the\n  correct way of using L2 regularization/weight decay with Adam, since that will\n  interact with the m and v parameters in strange ways.\n  Instead we want ot decay the weights in a manner that doesn't interact with\n  the m/v parameters. This is equivalent to adding the square of the weights to\n  the loss with plain (non-momentum) SGD.\n  \"\"\"\n\n    def __init__(\n        self,\n        learning_rate=0.001,\n        beta_1=0.9,\n        beta_2=0.999,\n        epsilon=1e-7,\n        amsgrad=False,\n        weight_decay_rate=0.0,\n        include_in_weight_decay=None,\n        exclude_from_weight_decay=None,\n        name=\"AdamWeightDecay\",\n        **kwargs\n    ):\n        super().__init__(learning_rate, beta_1, beta_2, epsilon, amsgrad, name, **kwargs)\n        self.weight_decay_rate = weight_decay_rate\n        self._include_in_weight_decay = include_in_weight_decay\n        self._exclude_from_weight_decay = exclude_from_weight_decay\n\n    @classmethod\n    def from_config(cls, config):\n        \"\"\"Creates an optimizer from its config with WarmUp custom object.\"\"\"\n        custom_objects = {\"WarmUp\": WarmUp}\n        return super(AdamWeightDecay, cls).from_config(config, custom_objects=custom_objects)\n\n    def _prepare_local(self, var_device, var_dtype, apply_state):\n        super(AdamWeightDecay, self)._prepare_local(var_device, var_dtype, apply_state)\n        apply_state[(var_device, var_dtype)][\"weight_decay_rate\"] = tf.constant(\n            self.weight_decay_rate, name=\"adam_weight_decay_rate\"\n        )\n\n    def _decay_weights_op(self, var, learning_rate, apply_state):\n        do_decay = self._do_use_weight_decay(var.name)\n        if do_decay:\n            return var.assign_sub(\n                learning_rate * var * apply_state[(var.device, var.dtype.base_dtype)][\"weight_decay_rate\"],\n                use_locking=self._use_locking,\n            )\n        return tf.no_op()\n\n    def apply_gradients(self, grads_and_vars, name=None):\n        grads, tvars = list(zip(*grads_and_vars))\n        return super(AdamWeightDecay, self).apply_gradients(zip(grads, tvars), name=name,)\n\n    def _get_lr(self, var_device, var_dtype, apply_state):\n        \"\"\"Retrieves the learning rate with the given state.\"\"\"\n        if apply_state is None:\n            return self._decayed_lr_t[var_dtype], {}\n\n        apply_state = apply_state or {}\n        coefficients = apply_state.get((var_device, var_dtype))\n        if coefficients is None:\n            coefficients = self._fallback_apply_state(var_device, var_dtype)\n            apply_state[(var_device, var_dtype)] = coefficients\n\n        return coefficients[\"lr_t\"], dict(apply_state=apply_state)\n\n    def _resource_apply_dense(self, grad, var, apply_state=None):\n        lr_t, kwargs = self._get_lr(var.device, var.dtype.base_dtype, apply_state)\n        decay = self._decay_weights_op(var, lr_t, apply_state)\n        with tf.control_dependencies([decay]):\n            return super(AdamWeightDecay, self)._resource_apply_dense(grad, var, **kwargs)\n\n    def _resource_apply_sparse(self, grad, var, indices, apply_state=None):\n        lr_t, kwargs = self._get_lr(var.device, var.dtype.base_dtype, apply_state)\n        decay = self._decay_weights_op(var, lr_t, apply_state)\n        with tf.control_dependencies([decay]):\n            return super(AdamWeightDecay, self)._resource_apply_sparse(grad, var, indices, **kwargs)\n\n    def get_config(self):\n        config = super().get_config()\n        config.update({\"weight_decay_rate\": self.weight_decay_rate})\n        return config\n\n    def _do_use_weight_decay(self, param_name):\n        \"\"\"Whether to use L2 weight decay for `param_name`.\"\"\"\n        if self.weight_decay_rate == 0:\n            return False\n\n        if self._include_in_weight_decay:\n            for r in self._include_in_weight_decay:\n                if re.search(r, param_name) is not None:\n                    return True\n\n        if self._exclude_from_weight_decay:\n            for r in self._exclude_from_weight_decay:\n                if re.search(r, param_name) is not None:\n                    return False\n        return True\n\n\n# Extracted from https://github.com/OpenNMT/OpenNMT-tf/blob/master/opennmt/optimizers/utils.py\nclass GradientAccumulator(object):\n    \"\"\"Gradient accumulation utility.\n  When used with a distribution strategy, the accumulator should be called in a\n  replica context. Gradients will be accumulated locally on each replica and\n  without synchronization. Users should then call ``.gradients``, scale the\n  gradients if required, and pass the result to ``apply_gradients``.\n  \"\"\"\n\n    # We use the ON_READ synchronization policy so that no synchronization is\n    # performed on assignment. To get the value, we call .value() which returns the\n    # value on the current replica without synchronization.\n\n    def __init__(self):\n        \"\"\"Initializes the accumulator.\"\"\"\n        self._gradients = []\n        self._accum_steps = None\n\n    @property\n    def step(self):\n        \"\"\"Number of accumulated steps.\"\"\"\n        if self._accum_steps is None:\n            self._accum_steps = tf.Variable(\n                tf.constant(0, dtype=tf.int64),\n                trainable=False,\n                synchronization=tf.VariableSynchronization.ON_READ,\n                aggregation=tf.VariableAggregation.ONLY_FIRST_REPLICA,\n            )\n\n        return self._accum_steps.value()\n\n    @property\n    def gradients(self):\n        \"\"\"The accumulated gradients on the current replica.\"\"\"\n        if not self._gradients:\n            raise ValueError(\"The accumulator should be called first to initialize the gradients\")\n        return list(gradient.value() if gradient is not None else gradient for gradient in self._gradients)\n\n    def __call__(self, gradients):\n        \"\"\"Accumulates :obj:`gradients` on the current replica.\"\"\"\n        if not self._gradients:\n            _ = self.step  # Create the step variable.\n            self._gradients.extend(\n                [\n                    tf.Variable(\n                        tf.zeros_like(gradient),\n                        trainable=False,\n                        synchronization=tf.VariableSynchronization.ON_READ,\n                        aggregation=tf.VariableAggregation.ONLY_FIRST_REPLICA,\n                    )\n                    if gradient is not None\n                    else gradient\n                    for gradient in gradients\n                ]\n            )\n        if len(gradients) != len(self._gradients):\n            raise ValueError(\"Expected %s gradients, but got %d\" % (len(self._gradients), len(gradients)))\n\n        for accum_gradient, gradient in zip(self._gradients, gradients):\n            if accum_gradient is not None and gradient is not None:\n                accum_gradient.assign_add(gradient)\n\n        self._accum_steps.assign_add(1)\n\n    def reset(self):\n        \"\"\"Resets the accumulated gradients on the current replica.\"\"\"\n        if not self._gradients:\n            return\n        self._accum_steps.assign(0)\n        for gradient in self._gradients:\n            if gradient is not None:\n                gradient.assign(tf.zeros_like(gradient))\n"
  },
  {
    "path": "code/nezha-base-count3/pretrain/transformers1/pipelines.py",
    "content": "# coding=utf-8\n# Copyright 2018 The HuggingFace Inc. team.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n\n\nimport csv\nimport json\nimport logging\nimport os\nimport pickle\nimport sys\nfrom abc import ABC, abstractmethod\nfrom contextlib import contextmanager\nfrom itertools import chain\nfrom os.path import abspath, exists\nfrom typing import TYPE_CHECKING, Any, Dict, Iterable, List, Optional, Sequence, Tuple, Union\n\nimport numpy as np\n\nfrom .configuration_auto import ALL_PRETRAINED_CONFIG_ARCHIVE_MAP, AutoConfig\nfrom .configuration_utils import PretrainedConfig\nfrom .data import SquadExample, squad_convert_examples_to_features\nfrom .file_utils import is_tf_available, is_torch_available\nfrom .modelcard import ModelCard\nfrom .tokenization_auto import AutoTokenizer\nfrom .tokenization_bert import BasicTokenizer\nfrom .tokenization_utils import PreTrainedTokenizer\n\n\nif is_tf_available():\n    import tensorflow as tf\n    from .modeling_tf_auto import (\n        TFAutoModel,\n        TFAutoModelForSequenceClassification,\n        TFAutoModelForQuestionAnswering,\n        TFAutoModelForTokenClassification,\n        TFAutoModelWithLMHead,\n    )\n\nif is_torch_available():\n    import torch\n    from .modeling_auto import (\n        AutoModel,\n        AutoModelForSequenceClassification,\n        AutoModelForQuestionAnswering,\n        AutoModelForTokenClassification,\n        AutoModelWithLMHead,\n    )\n\nif TYPE_CHECKING:\n    from .modeling_utils import PreTrainedModel\n    from .modeling_tf_utils import TFPreTrainedModel\n\n\nlogger = logging.getLogger(__name__)\n\n\ndef get_framework(model=None):\n    \"\"\" Select framework (TensorFlow/PyTorch) to use.\n        If both frameworks are installed and no specific model is provided, defaults to using PyTorch.\n    \"\"\"\n    if is_tf_available() and is_torch_available() and model is not None and not isinstance(model, str):\n        # Both framework are available but the user supplied a model class instance.\n        # Try to guess which framework to use from the model classname\n        framework = \"tf\" if model.__class__.__name__.startswith(\"TF\") else \"pt\"\n    elif not is_tf_available() and not is_torch_available():\n        raise RuntimeError(\n            \"At least one of TensorFlow 2.0 or PyTorch should be installed. \"\n            \"To install TensorFlow 2.0, read the instructions at https://www.tensorflow.org/install/ \"\n            \"To install PyTorch, read the instructions at https://pytorch.org/.\"\n        )\n    else:\n        # framework = 'tf' if is_tf_available() else 'pt'\n        framework = \"pt\" if is_torch_available() else \"tf\"\n    return framework\n\n\nclass ArgumentHandler(ABC):\n    \"\"\"\n    Base interface for handling varargs for each Pipeline\n    \"\"\"\n\n    @abstractmethod\n    def __call__(self, *args, **kwargs):\n        raise NotImplementedError()\n\n\nclass DefaultArgumentHandler(ArgumentHandler):\n    \"\"\"\n    Default varargs argument parser handling parameters for each Pipeline\n    \"\"\"\n\n    @staticmethod\n    def handle_kwargs(kwargs: Dict) -> List:\n        if len(kwargs) == 1:\n            output = list(kwargs.values())\n        else:\n            output = list(chain(kwargs.values()))\n\n        return DefaultArgumentHandler.handle_args(output)\n\n    @staticmethod\n    def handle_args(args: Sequence[Any]) -> List[str]:\n\n        # Only one argument, let's do case by case\n        if len(args) == 1:\n            if isinstance(args[0], str):\n                return [args[0]]\n            elif not isinstance(args[0], list):\n                return list(args)\n            else:\n                return args[0]\n\n        # Multiple arguments (x1, x2, ...)\n        elif len(args) > 1:\n            if all([isinstance(arg, str) for arg in args]):\n                return list(args)\n\n            # If not instance of list, then it should instance of iterable\n            elif isinstance(args, Iterable):\n                return list(chain.from_iterable(chain(args)))\n            else:\n                raise ValueError(\n                    \"Invalid input type {}. Pipeline supports Union[str, Iterable[str]]\".format(type(args))\n                )\n        else:\n            return []\n\n    def __call__(self, *args, **kwargs):\n        if len(kwargs) > 0 and len(args) > 0:\n            raise ValueError(\"Pipeline cannot handle mixed args and kwargs\")\n\n        if len(kwargs) > 0:\n            return DefaultArgumentHandler.handle_kwargs(kwargs)\n        else:\n            return DefaultArgumentHandler.handle_args(args)\n\n\nclass PipelineDataFormat:\n    \"\"\"\n    Base class for all the pipeline supported data format both for reading and writing.\n    Supported data formats currently includes:\n     - JSON\n     - CSV\n     - stdin/stdout (pipe)\n\n    PipelineDataFormat also includes some utilities to work with multi-columns like mapping from datasets columns\n    to pipelines keyword arguments through the `dataset_kwarg_1=dataset_column_1` format.\n    \"\"\"\n\n    SUPPORTED_FORMATS = [\"json\", \"csv\", \"pipe\"]\n\n    def __init__(\n        self, output_path: Optional[str], input_path: Optional[str], column: Optional[str], overwrite=False,\n    ):\n        self.output_path = output_path\n        self.input_path = input_path\n        self.column = column.split(\",\") if column is not None else [\"\"]\n        self.is_multi_columns = len(self.column) > 1\n\n        if self.is_multi_columns:\n            self.column = [tuple(c.split(\"=\")) if \"=\" in c else (c, c) for c in self.column]\n\n        if output_path is not None and not overwrite:\n            if exists(abspath(self.output_path)):\n                raise OSError(\"{} already exists on disk\".format(self.output_path))\n\n        if input_path is not None:\n            if not exists(abspath(self.input_path)):\n                raise OSError(\"{} doesnt exist on disk\".format(self.input_path))\n\n    @abstractmethod\n    def __iter__(self):\n        raise NotImplementedError()\n\n    @abstractmethod\n    def save(self, data: dict):\n        \"\"\"\n        Save the provided data object with the representation for the current `DataFormat`.\n        :param data: data to store\n        :return:\n        \"\"\"\n        raise NotImplementedError()\n\n    def save_binary(self, data: Union[dict, List[dict]]) -> str:\n        \"\"\"\n        Save the provided data object as a pickle-formatted binary data on the disk.\n        :param data: data to store\n        :return: (str) Path where the data has been saved\n        \"\"\"\n        path, _ = os.path.splitext(self.output_path)\n        binary_path = os.path.extsep.join((path, \"pickle\"))\n\n        with open(binary_path, \"wb+\") as f_output:\n            pickle.dump(data, f_output)\n\n        return binary_path\n\n    @staticmethod\n    def from_str(\n        format: str, output_path: Optional[str], input_path: Optional[str], column: Optional[str], overwrite=False,\n    ):\n        if format == \"json\":\n            return JsonPipelineDataFormat(output_path, input_path, column, overwrite=overwrite)\n        elif format == \"csv\":\n            return CsvPipelineDataFormat(output_path, input_path, column, overwrite=overwrite)\n        elif format == \"pipe\":\n            return PipedPipelineDataFormat(output_path, input_path, column, overwrite=overwrite)\n        else:\n            raise KeyError(\"Unknown reader {} (Available reader are json/csv/pipe)\".format(format))\n\n\nclass CsvPipelineDataFormat(PipelineDataFormat):\n    def __init__(\n        self, output_path: Optional[str], input_path: Optional[str], column: Optional[str], overwrite=False,\n    ):\n        super().__init__(output_path, input_path, column, overwrite=overwrite)\n\n    def __iter__(self):\n        with open(self.input_path, \"r\") as f:\n            reader = csv.DictReader(f)\n            for row in reader:\n                if self.is_multi_columns:\n                    yield {k: row[c] for k, c in self.column}\n                else:\n                    yield row[self.column[0]]\n\n    def save(self, data: List[dict]):\n        with open(self.output_path, \"w\") as f:\n            if len(data) > 0:\n                writer = csv.DictWriter(f, list(data[0].keys()))\n                writer.writeheader()\n                writer.writerows(data)\n\n\nclass JsonPipelineDataFormat(PipelineDataFormat):\n    def __init__(\n        self, output_path: Optional[str], input_path: Optional[str], column: Optional[str], overwrite=False,\n    ):\n        super().__init__(output_path, input_path, column, overwrite=overwrite)\n\n        with open(input_path, \"r\") as f:\n            self._entries = json.load(f)\n\n    def __iter__(self):\n        for entry in self._entries:\n            if self.is_multi_columns:\n                yield {k: entry[c] for k, c in self.column}\n            else:\n                yield entry[self.column[0]]\n\n    def save(self, data: dict):\n        with open(self.output_path, \"w\") as f:\n            json.dump(data, f)\n\n\nclass PipedPipelineDataFormat(PipelineDataFormat):\n    \"\"\"\n    Read data from piped input to the python process.\n    For multi columns data, columns should separated by \\t\n\n    If columns are provided, then the output will be a dictionary with {column_x: value_x}\n    \"\"\"\n\n    def __iter__(self):\n        for line in sys.stdin:\n            # Split for multi-columns\n            if \"\\t\" in line:\n\n                line = line.split(\"\\t\")\n                if self.column:\n                    # Dictionary to map arguments\n                    yield {kwargs: l for (kwargs, _), l in zip(self.column, line)}\n                else:\n                    yield tuple(line)\n\n            # No dictionary to map arguments\n            else:\n                yield line\n\n    def save(self, data: dict):\n        print(data)\n\n    def save_binary(self, data: Union[dict, List[dict]]) -> str:\n        if self.output_path is None:\n            raise KeyError(\n                \"When using piped input on pipeline outputting large object requires an output file path. \"\n                \"Please provide such output path through --output argument.\"\n            )\n\n        return super().save_binary(data)\n\n\nclass _ScikitCompat(ABC):\n    \"\"\"\n    Interface layer for the Scikit and Keras compatibility.\n    \"\"\"\n\n    @abstractmethod\n    def transform(self, X):\n        raise NotImplementedError()\n\n    @abstractmethod\n    def predict(self, X):\n        raise NotImplementedError()\n\n\nclass Pipeline(_ScikitCompat):\n    \"\"\"\n    The Pipeline class is the class from which all pipelines inherit. Refer to this class for methods shared across\n    different pipelines.\n\n    Base class implementing pipelined operations.\n    Pipeline workflow is defined as a sequence of the following operations:\n        Input -> Tokenization -> Model Inference -> Post-Processing (Task dependent) -> Output\n\n    Pipeline supports running on CPU or GPU through the device argument. Users can specify\n    device argument as an integer, -1 meaning \"CPU\", >= 0 referring the CUDA device ordinal.\n\n    Some pipeline, like for instance FeatureExtractionPipeline ('feature-extraction') outputs large\n    tensor object as nested-lists. In order to avoid dumping such large structure as textual data we\n    provide the binary_output constructor argument. If set to True, the output will be stored in the\n    pickle format.\n\n    Arguments:\n        model (:obj:`~transformers1.PreTrainedModel` or :obj:`~transformers1.TFPreTrainedModel`):\n            The model that will be used by the pipeline to make predictions. This needs to be a model inheriting from\n            :class:`~transformers1.PreTrainedModel` for PyTorch and :class:`~transformers1.TFPreTrainedModel` for\n            TensorFlow.\n        tokenizer (:obj:`~transformers1.PreTrainedTokenizer`):\n            The tokenizer that will be used by the pipeline to encode data for the model. This object inherits from\n            :class:`~transformers1.PreTrainedTokenizer`.\n        modelcard (:obj:`str` or :class:`~transformers1.ModelCard`, `optional`, defaults to :obj:`None`):\n            Model card attributed to the model for this pipeline.\n        framework (:obj:`str`, `optional`, defaults to :obj:`None`):\n            The framework to use, either \"pt\" for PyTorch or \"tf\" for TensorFlow. The specified framework must be\n            installed.\n\n            If no framework is specified, will default to the one currently installed. If no framework is specified\n            and both frameworks are installed, will default to PyTorch.\n        args_parser (:class:`~transformers1.pipelines.ArgumentHandler`, `optional`, defaults to :obj:`None`):\n            Reference to the object in charge of parsing supplied pipeline parameters.\n        device (:obj:`int`, `optional`, defaults to :obj:`-1`):\n            Device ordinal for CPU/GPU supports. Setting this to -1 will leverage CPU, >=0 will run the model\n            on the associated CUDA device id.\n        binary_output (:obj:`bool`, `optional`, defaults to :obj:`False`):\n            Flag indicating if the output the pipeline should happen in a binary format (i.e. pickle) or as raw text.\n\n    Return:\n        :obj:`List` or :obj:`Dict`:\n        Pipeline returns list or dictionary depending on:\n\n         - Whether the user supplied multiple samples\n         - Whether the pipeline exposes multiple fields in the output object\n    \"\"\"\n\n    default_input_names = None\n\n    def __init__(\n        self,\n        model: Union[\"PreTrainedModel\", \"TFPreTrainedModel\"],\n        tokenizer: PreTrainedTokenizer,\n        modelcard: Optional[ModelCard] = None,\n        framework: Optional[str] = None,\n        task: str = \"\",\n        args_parser: ArgumentHandler = None,\n        device: int = -1,\n        binary_output: bool = False,\n    ):\n\n        if framework is None:\n            framework = get_framework()\n\n        self.model = model\n        self.tokenizer = tokenizer\n        self.modelcard = modelcard\n        self.framework = framework\n        self.device = device if framework == \"tf\" else torch.device(\"cpu\" if device < 0 else \"cuda:{}\".format(device))\n        self.binary_output = binary_output\n        self._args_parser = args_parser or DefaultArgumentHandler()\n\n        # Special handling\n        if self.framework == \"pt\" and self.device.type == \"cuda\":\n            self.model = self.model.to(self.device)\n\n        # Update config with task specific parameters\n        task_specific_params = self.model.config.task_specific_params\n        if task_specific_params is not None and task in task_specific_params:\n            self.model.config.update(task_specific_params.get(task))\n\n    def save_pretrained(self, save_directory):\n        \"\"\"\n        Save the pipeline's model and tokenizer to the specified save_directory\n        \"\"\"\n        if not os.path.isdir(save_directory):\n            logger.error(\"Provided path ({}) should be a directory\".format(save_directory))\n            return\n\n        self.model.save_pretrained(save_directory)\n        self.tokenizer.save_pretrained(save_directory)\n        if self.modelcard is not None:\n            self.modelcard.save_pretrained(save_directory)\n\n    def transform(self, X):\n        \"\"\"\n        Scikit / Keras interface to transformers1' pipelines. This method will forward to __call__().\n        \"\"\"\n        return self(X=X)\n\n    def predict(self, X):\n        \"\"\"\n        Scikit / Keras interface to transformers1' pipelines. This method will forward to __call__().\n        \"\"\"\n        return self(X=X)\n\n    @contextmanager\n    def device_placement(self):\n        \"\"\"\n        Context Manager allowing tensor allocation on the user-specified device in framework agnostic way.\n        example:\n            # Explicitly ask for tensor allocation on CUDA device :0\n            nlp = pipeline(..., device=0)\n            with nlp.device_placement():\n                # Every framework specific tensor allocation will be done on the request device\n                output = nlp(...)\n        Returns:\n            Context manager\n        \"\"\"\n        if self.framework == \"tf\":\n            with tf.device(\"/CPU:0\" if self.device == -1 else \"/device:GPU:{}\".format(self.device)):\n                yield\n        else:\n            if self.device.type == \"cuda\":\n                torch.cuda.set_device(self.device)\n\n            yield\n\n    def ensure_tensor_on_device(self, **inputs):\n        \"\"\"\n        Ensure PyTorch tensors are on the specified device.\n        :param inputs:\n        :return:\n        \"\"\"\n        return {name: tensor.to(self.device) for name, tensor in inputs.items()}\n\n    def _parse_and_tokenize(self, *args, pad_to_max_length=True, add_special_tokens=True, **kwargs):\n        \"\"\"\n        Parse arguments and tokenize\n        \"\"\"\n        # Parse arguments\n        inputs = self._args_parser(*args, **kwargs)\n        inputs = self.tokenizer.batch_encode_plus(\n            inputs,\n            add_special_tokens=add_special_tokens,\n            return_tensors=self.framework,\n            pad_to_max_length=pad_to_max_length,\n        )\n\n        return inputs\n\n    def __call__(self, *args, **kwargs):\n        inputs = self._parse_and_tokenize(*args, **kwargs)\n        return self._forward(inputs)\n\n    def _forward(self, inputs, return_tensors=False):\n        \"\"\"\n        Internal framework specific forward dispatching.\n        Args:\n            inputs: dict holding all the keyworded arguments for required by the model forward method.\n            return_tensors: Whether to return native framework (pt/tf) tensors rather than numpy array.\n        Returns:\n            Numpy array\n        \"\"\"\n        # Encode for forward\n        with self.device_placement():\n            if self.framework == \"tf\":\n                # TODO trace model\n                predictions = self.model(inputs.data, training=False)[0]\n            else:\n                with torch.no_grad():\n                    inputs = self.ensure_tensor_on_device(**inputs)\n                    predictions = self.model(**inputs)[0].cpu()\n\n        if return_tensors:\n            return predictions\n        else:\n            return predictions.numpy()\n\n\nclass FeatureExtractionPipeline(Pipeline):\n    \"\"\"\n    Feature extraction pipeline using Model head. This pipeline extracts the hidden states from the base transformer,\n    which can be used as features in downstream tasks.\n\n    This feature extraction pipeline can currently be loaded from the :func:`~transformers1.pipeline` method using\n    the following task identifier(s):\n\n    - \"feature-extraction\", for extracting features of a sequence.\n\n    All models may be used for this pipeline. See a list of all models, including community-contributed models on\n    `huggingface.co/models <https://huggingface.co/models>`__.\n\n    Arguments:\n        model (:obj:`~transformers1.PreTrainedModel` or :obj:`~transformers1.TFPreTrainedModel`):\n            The model that will be used by the pipeline to make predictions. This needs to be a model inheriting from\n            :class:`~transformers1.PreTrainedModel` for PyTorch and :class:`~transformers1.TFPreTrainedModel` for\n            TensorFlow.\n        tokenizer (:obj:`~transformers1.PreTrainedTokenizer`):\n            The tokenizer that will be used by the pipeline to encode data for the model. This object inherits from\n            :class:`~transformers1.PreTrainedTokenizer`.\n        modelcard (:obj:`str` or :class:`~transformers1.ModelCard`, `optional`, defaults to :obj:`None`):\n            Model card attributed to the model for this pipeline.\n        framework (:obj:`str`, `optional`, defaults to :obj:`None`):\n            The framework to use, either \"pt\" for PyTorch or \"tf\" for TensorFlow. The specified framework must be\n            installed.\n\n            If no framework is specified, will default to the one currently installed. If no framework is specified\n            and both frameworks are installed, will default to PyTorch.\n        args_parser (:class:`~transformers1.pipelines.ArgumentHandler`, `optional`, defaults to :obj:`None`):\n            Reference to the object in charge of parsing supplied pipeline parameters.\n        device (:obj:`int`, `optional`, defaults to :obj:`-1`):\n            Device ordinal for CPU/GPU supports. Setting this to -1 will leverage CPU, >=0 will run the model\n            on the associated CUDA device id.\n    \"\"\"\n\n    def __init__(\n        self,\n        model: Union[\"PreTrainedModel\", \"TFPreTrainedModel\"],\n        tokenizer: PreTrainedTokenizer,\n        modelcard: Optional[ModelCard] = None,\n        framework: Optional[str] = None,\n        args_parser: ArgumentHandler = None,\n        device: int = -1,\n        task: str = \"\",\n    ):\n        super().__init__(\n            model=model,\n            tokenizer=tokenizer,\n            modelcard=modelcard,\n            framework=framework,\n            args_parser=args_parser,\n            device=device,\n            binary_output=True,\n            task=task,\n        )\n\n    def __call__(self, *args, **kwargs):\n        return super().__call__(*args, **kwargs).tolist()\n\n\nclass TextGenerationPipeline(Pipeline):\n    \"\"\"\n    Language generation pipeline using any ModelWithLMHead head. This pipeline predicts the words that will follow a specified text prompt.\n\n    This language generation pipeline can currently be loaded from the :func:`~transformers1.pipeline` method using\n    the following task identifier(s):\n\n    - \"text-generation\", for generating text from a specified prompt.\n\n    The models that this pipeline can use are models that have been trained with an autoregressive language modeling objective,\n    which includes the uni-directional models in the library (e.g. gpt2).\n    See the list of available community models on\n    `huggingface.co/models <https://huggingface.co/models?search=&filter=lm-head>`__.\n    \"\"\"\n\n    # Padding text to help Transformer-XL and XLNet with short prompts as proposed by Aman Rusia\n    # in https://github.com/rusiaaman/XLNet-gen#methodology\n    # and https://medium.com/@amanrusia/xlnet-speaks-comparison-to-gpt-2-ea1a4e9ba39e\n\n    PADDING_TEXT = \"\"\"In 1991, the remains of Russian Tsar Nicholas II and his family\n    (except for Alexei and Maria) are discovered.\n    The voice of Nicholas's young son, Tsarevich Alexei Nikolaevich, narrates the\n    remainder of the story. 1883 Western Siberia,\n    a young Grigori Rasputin is asked by his father and a group of men to perform magic.\n    Rasputin has a vision and denounces one of the men as a horse thief. Although his\n    father initially slaps him for making such an accusation, Rasputin watches as the\n    man is chased outside and beaten. Twenty years later, Rasputin sees a vision of\n    the Virgin Mary, prompting him to become a priest. Rasputin quickly becomes famous,\n    with people, even a bishop, begging for his blessing. <eod> </s> <eos>\"\"\"\n\n    ALLOWED_MODELS = [\n        \"XLNetLMHeadModel\",\n        \"TransfoXLLMHeadModel\",\n        \"ReformerModelWithLMHead\",\n        \"GPT2LMHeadModel\",\n        \"OpenAIGPTLMHeadModel\",\n        \"CTRLLMHeadModel\",\n        \"TFXLNetLMHeadModel\",\n        \"TFTransfoXLLMHeadModel\",\n        \"TFGPT2LMHeadModel\",\n        \"TFOpenAIGPTLMHeadModel\",\n        \"TFCTRLLMHeadModel\",\n    ]\n\n    def __call__(\n        self, *args, return_tensors=False, return_text=True, clean_up_tokenization_spaces=False, **generate_kwargs\n    ):\n        if self.model.__class__.__name__ not in self.ALLOWED_MODELS:\n            raise NotImplementedError(\n                \"Generation is currently not supported for {}. Please select a model from {} for generation.\".format(\n                    self.model.__class__.__name__, self.ALLOWED_MODELS\n                )\n            )\n\n        text_inputs = self._args_parser(*args)\n\n        results = []\n        for prompt_text in text_inputs:\n            # Manage correct placement of the tensors\n            with self.device_placement():\n                if self.model.__class__.__name__ in [\"XLNetLMHeadModel\", \"TransfoXLLMHeadModel\"]:\n                    inputs = self._parse_and_tokenize(\n                        self.PADDING_TEXT + prompt_text, pad_to_max_length=False, add_special_tokens=False\n                    )\n                else:\n                    inputs = self._parse_and_tokenize(prompt_text, pad_to_max_length=False, add_special_tokens=False)\n\n                # set input_ids to None to allow empty prompt\n                if inputs[\"input_ids\"].shape[-1] == 0:\n                    inputs[\"input_ids\"] = None\n                    inputs[\"attention_mask\"] = None\n\n                if self.framework == \"pt\" and inputs[\"input_ids\"] is not None:\n                    inputs = self.ensure_tensor_on_device(**inputs)\n\n                input_ids = inputs[\"input_ids\"]\n\n                # Ensure that batch size = 1 (batch generation not allowed for now)\n                assert (\n                    input_ids is None or input_ids.shape[0] == 1\n                ), \"Batch generation is currently not supported. See https://github.com/huggingface/transformers/issues/3021 for more information.\"\n\n                output_sequences = self.model.generate(input_ids=input_ids, **generate_kwargs)  # BS x SL\n\n            result = []\n            for generated_sequence in output_sequences:\n                generated_sequence = generated_sequence.numpy().tolist()\n                record = {}\n                if return_tensors:\n                    record[\"generated_token_ids\"] = generated_sequence\n                if return_text:\n                    # Decode text\n                    text = self.tokenizer.decode(\n                        generated_sequence,\n                        skip_special_tokens=True,\n                        clean_up_tokenization_spaces=clean_up_tokenization_spaces,\n                    )\n\n                    # Remove PADDING prompt of the sequence if XLNet or Transfo-XL model is used\n                    if input_ids is None:\n                        prompt_length = 0\n                    else:\n                        prompt_length = len(\n                            self.tokenizer.decode(\n                                input_ids[0],\n                                skip_special_tokens=True,\n                                clean_up_tokenization_spaces=clean_up_tokenization_spaces,\n                            )\n                        )\n\n                    record[\"generated_text\"] = prompt_text + text[prompt_length:]\n\n                result.append(record)\n            results += [result]\n\n        if len(results) == 1:\n            return results[0]\n\n        return results\n\n\nclass TextClassificationPipeline(Pipeline):\n    \"\"\"\n    Text classification pipeline using ModelForSequenceClassification head. See the\n    `sequence classification usage <../usage.html#sequence-classification>`__ examples for more information.\n\n    This text classification pipeline can currently be loaded from the :func:`~transformers1.pipeline` method using\n    the following task identifier(s):\n\n    - \"sentiment-analysis\", for classifying sequences according to positive or negative sentiments.\n\n    The models that this pipeline can use are models that have been fine-tuned on a sequence classification task.\n    See the up-to-date list of available models on\n    `huggingface.co/models <https://huggingface.co/models?filter=text-classification>`__.\n\n    Arguments:\n        model (:obj:`~transformers1.PreTrainedModel` or :obj:`~transformers1.TFPreTrainedModel`):\n            The model that will be used by the pipeline to make predictions. This needs to be a model inheriting from\n            :class:`~transformers1.PreTrainedModel` for PyTorch and :class:`~transformers1.TFPreTrainedModel` for\n            TensorFlow.\n        tokenizer (:obj:`~transformers1.PreTrainedTokenizer`):\n            The tokenizer that will be used by the pipeline to encode data for the model. This object inherits from\n            :class:`~transformers1.PreTrainedTokenizer`.\n        modelcard (:obj:`str` or :class:`~transformers1.ModelCard`, `optional`, defaults to :obj:`None`):\n            Model card attributed to the model for this pipeline.\n        framework (:obj:`str`, `optional`, defaults to :obj:`None`):\n            The framework to use, either \"pt\" for PyTorch or \"tf\" for TensorFlow. The specified framework must be\n            installed.\n\n            If no framework is specified, will default to the one currently installed. If no framework is specified\n            and both frameworks are installed, will default to PyTorch.\n        args_parser (:class:`~transformers1.pipelines.ArgumentHandler`, `optional`, defaults to :obj:`None`):\n            Reference to the object in charge of parsing supplied pipeline parameters.\n        device (:obj:`int`, `optional`, defaults to :obj:`-1`):\n            Device ordinal for CPU/GPU supports. Setting this to -1 will leverage CPU, >=0 will run the model\n            on the associated CUDA device id.\n    \"\"\"\n\n    def __call__(self, *args, **kwargs):\n        outputs = super().__call__(*args, **kwargs)\n        scores = np.exp(outputs) / np.exp(outputs).sum(-1, keepdims=True)\n        return [{\"label\": self.model.config.id2label[item.argmax()], \"score\": item.max().item()} for item in scores]\n\n\nclass FillMaskPipeline(Pipeline):\n    \"\"\"\n    Masked language modeling prediction pipeline using ModelWithLMHead head. See the\n    `masked language modeling usage <../usage.html#masked-language-modeling>`__ examples for more information.\n\n    This mask filling pipeline can currently be loaded from the :func:`~transformers1.pipeline` method using\n    the following task identifier(s):\n\n    - \"fill-mask\", for predicting masked tokens in a sequence.\n\n    The models that this pipeline can use are models that have been trained with a masked language modeling objective,\n    which includes the bi-directional models in the library.\n    See the up-to-date list of available models on\n    `huggingface.co/models <https://huggingface.co/models?filter=lm-head>`__.\n\n    Arguments:\n        model (:obj:`~transformers1.PreTrainedModel` or :obj:`~transformers1.TFPreTrainedModel`):\n            The model that will be used by the pipeline to make predictions. This needs to be a model inheriting from\n            :class:`~transformers1.PreTrainedModel` for PyTorch and :class:`~transformers1.TFPreTrainedModel` for\n            TensorFlow.\n        tokenizer (:obj:`~transformers1.PreTrainedTokenizer`):\n            The tokenizer that will be used by the pipeline to encode data for the model. This object inherits from\n            :class:`~transformers1.PreTrainedTokenizer`.\n        modelcard (:obj:`str` or :class:`~transformers1.ModelCard`, `optional`, defaults to :obj:`None`):\n            Model card attributed to the model for this pipeline.\n        framework (:obj:`str`, `optional`, defaults to :obj:`None`):\n            The framework to use, either \"pt\" for PyTorch or \"tf\" for TensorFlow. The specified framework must be\n            installed.\n\n            If no framework is specified, will default to the one currently installed. If no framework is specified\n            and both frameworks are installed, will default to PyTorch.\n        args_parser (:class:`~transformers1.pipelines.ArgumentHandler`, `optional`, defaults to :obj:`None`):\n            Reference to the object in charge of parsing supplied pipeline parameters.\n        device (:obj:`int`, `optional`, defaults to :obj:`-1`):\n            Device ordinal for CPU/GPU supports. Setting this to -1 will leverage CPU, >=0 will run the model\n            on the associated CUDA device id.\n    \"\"\"\n\n    def __init__(\n        self,\n        model: Union[\"PreTrainedModel\", \"TFPreTrainedModel\"],\n        tokenizer: PreTrainedTokenizer,\n        modelcard: Optional[ModelCard] = None,\n        framework: Optional[str] = None,\n        args_parser: ArgumentHandler = None,\n        device: int = -1,\n        topk=5,\n        task: str = \"\",\n    ):\n        super().__init__(\n            model=model,\n            tokenizer=tokenizer,\n            modelcard=modelcard,\n            framework=framework,\n            args_parser=args_parser,\n            device=device,\n            binary_output=True,\n            task=task,\n        )\n\n        self.topk = topk\n\n    def __call__(self, *args, **kwargs):\n        inputs = self._parse_and_tokenize(*args, **kwargs)\n        outputs = self._forward(inputs, return_tensors=True)\n\n        results = []\n        batch_size = outputs.shape[0] if self.framework == \"tf\" else outputs.size(0)\n\n        for i in range(batch_size):\n            input_ids = inputs[\"input_ids\"][i]\n            result = []\n\n            if self.framework == \"tf\":\n                masked_index = tf.where(input_ids == self.tokenizer.mask_token_id).numpy().item()\n                logits = outputs[i, masked_index, :]\n                probs = tf.nn.softmax(logits)\n                topk = tf.math.top_k(probs, k=self.topk)\n                values, predictions = topk.values.numpy(), topk.indices.numpy()\n            else:\n                masked_index = (input_ids == self.tokenizer.mask_token_id).nonzero().item()\n                logits = outputs[i, masked_index, :]\n                probs = logits.softmax(dim=0)\n                values, predictions = probs.topk(self.topk)\n\n            for v, p in zip(values.tolist(), predictions.tolist()):\n                tokens = input_ids.numpy()\n                tokens[masked_index] = p\n                # Filter padding out:\n                tokens = tokens[np.where(tokens != self.tokenizer.pad_token_id)]\n                result.append({\"sequence\": self.tokenizer.decode(tokens), \"score\": v, \"token\": p})\n\n            # Append\n            results += [result]\n\n        if len(results) == 1:\n            return results[0]\n        return results\n\n\nclass NerPipeline(Pipeline):\n    \"\"\"\n    Named Entity Recognition pipeline using ModelForTokenClassification head. See the\n    `named entity recognition usage <../usage.html#named-entity-recognition>`__ examples for more information.\n\n    This token recognition pipeline can currently be loaded from the :func:`~transformers1.pipeline` method using\n    the following task identifier(s):\n\n    - \"ner\", for predicting the classes of tokens in a sequence: person, organisation, location or miscellaneous.\n\n    The models that this pipeline can use are models that have been fine-tuned on a token classification task.\n    See the up-to-date list of available models on\n    `huggingface.co/models <https://huggingface.co/models?filter=token-classification>`__.\n\n    Arguments:\n        model (:obj:`~transformers1.PreTrainedModel` or :obj:`~transformers1.TFPreTrainedModel`):\n            The model that will be used by the pipeline to make predictions. This needs to be a model inheriting from\n            :class:`~transformers1.PreTrainedModel` for PyTorch and :class:`~transformers1.TFPreTrainedModel` for\n            TensorFlow.\n        tokenizer (:obj:`~transformers1.PreTrainedTokenizer`):\n            The tokenizer that will be used by the pipeline to encode data for the model. This object inherits from\n            :class:`~transformers1.PreTrainedTokenizer`.\n        modelcard (:obj:`str` or :class:`~transformers1.ModelCard`, `optional`, defaults to :obj:`None`):\n            Model card attributed to the model for this pipeline.\n        framework (:obj:`str`, `optional`, defaults to :obj:`None`):\n            The framework to use, either \"pt\" for PyTorch or \"tf\" for TensorFlow. The specified framework must be\n            installed.\n\n            If no framework is specified, will default to the one currently installed. If no framework is specified\n            and both frameworks are installed, will default to PyTorch.\n        args_parser (:class:`~transformers1.pipelines.ArgumentHandler`, `optional`, defaults to :obj:`None`):\n            Reference to the object in charge of parsing supplied pipeline parameters.\n        device (:obj:`int`, `optional`, defaults to :obj:`-1`):\n            Device ordinal for CPU/GPU supports. Setting this to -1 will leverage CPU, >=0 will run the model\n            on the associated CUDA device id.\n    \"\"\"\n\n    default_input_names = \"sequences\"\n\n    def __init__(\n        self,\n        model: Union[\"PreTrainedModel\", \"TFPreTrainedModel\"],\n        tokenizer: PreTrainedTokenizer,\n        modelcard: Optional[ModelCard] = None,\n        framework: Optional[str] = None,\n        args_parser: ArgumentHandler = None,\n        device: int = -1,\n        binary_output: bool = False,\n        ignore_labels=[\"O\"],\n        task: str = \"\",\n        grouped_entities: bool = False,\n    ):\n        super().__init__(\n            model=model,\n            tokenizer=tokenizer,\n            modelcard=modelcard,\n            framework=framework,\n            args_parser=args_parser,\n            device=device,\n            binary_output=binary_output,\n            task=task,\n        )\n\n        self._basic_tokenizer = BasicTokenizer(do_lower_case=False)\n        self.ignore_labels = ignore_labels\n        self.grouped_entities = grouped_entities\n\n    def __call__(self, *args, **kwargs):\n        inputs = self._args_parser(*args, **kwargs)\n        answers = []\n        for sentence in inputs:\n\n            # Manage correct placement of the tensors\n            with self.device_placement():\n\n                tokens = self.tokenizer.encode_plus(\n                    sentence,\n                    return_attention_mask=False,\n                    return_tensors=self.framework,\n                    max_length=self.tokenizer.max_len,\n                )\n\n                # Forward\n                if self.framework == \"tf\":\n                    entities = self.model(tokens.data)[0][0].numpy()\n                    input_ids = tokens[\"input_ids\"].numpy()[0]\n                else:\n                    with torch.no_grad():\n                        tokens = self.ensure_tensor_on_device(**tokens)\n                        entities = self.model(**tokens)[0][0].cpu().numpy()\n                        input_ids = tokens[\"input_ids\"].cpu().numpy()[0]\n\n            score = np.exp(entities) / np.exp(entities).sum(-1, keepdims=True)\n            labels_idx = score.argmax(axis=-1)\n\n            entities = []\n            entity_groups = []\n            entity_group_disagg = []\n            # Filter to labels not in `self.ignore_labels`\n            filtered_labels_idx = [\n                (idx, label_idx)\n                for idx, label_idx in enumerate(labels_idx)\n                if self.model.config.id2label[label_idx] not in self.ignore_labels\n            ]\n\n            for idx, label_idx in filtered_labels_idx:\n\n                entity = {\n                    \"word\": self.tokenizer.convert_ids_to_tokens(int(input_ids[idx])),\n                    \"score\": score[idx][label_idx].item(),\n                    \"entity\": self.model.config.id2label[label_idx],\n                    \"index\": idx,\n                }\n                last_idx, _ = filtered_labels_idx[-1]\n                if self.grouped_entities:\n                    if not entity_group_disagg:\n                        entity_group_disagg += [entity]\n                        if idx == last_idx:\n                            entity_groups += [self.group_entities(entity_group_disagg)]\n                        continue\n\n                    # If the current entity is similar and adjacent to the previous entity, append it to the disaggregated entity group\n                    if (\n                        entity[\"entity\"] == entity_group_disagg[-1][\"entity\"]\n                        and entity[\"index\"] == entity_group_disagg[-1][\"index\"] + 1\n                    ):\n                        entity_group_disagg += [entity]\n                        # Group the entities at the last entity\n                        if idx == last_idx:\n                            entity_groups += [self.group_entities(entity_group_disagg)]\n                    # If the current entity is different from the previous entity, aggregate the disaggregated entity group\n                    else:\n                        entity_groups += [self.group_entities(entity_group_disagg)]\n                        entity_group_disagg = [entity]\n\n                entities += [entity]\n\n            # Append\n            if self.grouped_entities:\n                answers += [entity_groups]\n            else:\n                answers += [entities]\n\n        if len(answers) == 1:\n            return answers[0]\n        return answers\n\n    def group_entities(self, entities):\n        \"\"\"\n        Returns grouped entities\n        \"\"\"\n        # Get the last entity in the entity group\n        entity = entities[-1][\"entity\"]\n        scores = np.mean([entity[\"score\"] for entity in entities])\n        tokens = [entity[\"word\"] for entity in entities]\n\n        entity_group = {\n            \"entity_group\": entity,\n            \"score\": np.mean(scores),\n            \"word\": self.tokenizer.convert_tokens_to_string(tokens),\n        }\n        return entity_group\n\n\nTokenClassificationPipeline = NerPipeline\n\n\nclass QuestionAnsweringArgumentHandler(ArgumentHandler):\n    \"\"\"\n    QuestionAnsweringPipeline requires the user to provide multiple arguments (i.e. question & context) to be mapped\n    to internal SquadExample / SquadFeature structures.\n\n    QuestionAnsweringArgumentHandler manages all the possible to create SquadExample from the command-line supplied\n    arguments.\n    \"\"\"\n\n    def __call__(self, *args, **kwargs):\n        # Position args, handling is sensibly the same as X and data, so forwarding to avoid duplicating\n        if args is not None and len(args) > 0:\n            if len(args) == 1:\n                kwargs[\"X\"] = args[0]\n            else:\n                kwargs[\"X\"] = list(args)\n\n        # Generic compatibility with sklearn and Keras\n        # Batched data\n        if \"X\" in kwargs or \"data\" in kwargs:\n            inputs = kwargs[\"X\"] if \"X\" in kwargs else kwargs[\"data\"]\n\n            if isinstance(inputs, dict):\n                inputs = [inputs]\n            else:\n                # Copy to avoid overriding arguments\n                inputs = [i for i in inputs]\n\n            for i, item in enumerate(inputs):\n                if isinstance(item, dict):\n                    if any(k not in item for k in [\"question\", \"context\"]):\n                        raise KeyError(\"You need to provide a dictionary with keys {question:..., context:...}\")\n\n                    inputs[i] = QuestionAnsweringPipeline.create_sample(**item)\n\n                elif not isinstance(item, SquadExample):\n                    raise ValueError(\n                        \"{} argument needs to be of type (list[SquadExample | dict], SquadExample, dict)\".format(\n                            \"X\" if \"X\" in kwargs else \"data\"\n                        )\n                    )\n\n            # Tabular input\n        elif \"question\" in kwargs and \"context\" in kwargs:\n            if isinstance(kwargs[\"question\"], str):\n                kwargs[\"question\"] = [kwargs[\"question\"]]\n\n            if isinstance(kwargs[\"context\"], str):\n                kwargs[\"context\"] = [kwargs[\"context\"]]\n\n            inputs = [\n                QuestionAnsweringPipeline.create_sample(q, c) for q, c in zip(kwargs[\"question\"], kwargs[\"context\"])\n            ]\n        else:\n            raise ValueError(\"Unknown arguments {}\".format(kwargs))\n\n        if not isinstance(inputs, list):\n            inputs = [inputs]\n\n        return inputs\n\n\nclass QuestionAnsweringPipeline(Pipeline):\n    \"\"\"\n    Question Answering pipeline using ModelForQuestionAnswering head. See the\n    `question answering usage <../usage.html#question-answering>`__ examples for more information.\n\n    This question answering can currently be loaded from the :func:`~transformers1.pipeline` method using\n    the following task identifier(s):\n\n    - \"question-answering\", for answering questions given a context.\n\n    The models that this pipeline can use are models that have been fine-tuned on a question answering task.\n    See the up-to-date list of available models on\n    `huggingface.co/models <https://huggingface.co/models?filter=question-answering>`__.\n\n    Arguments:\n        model (:obj:`~transformers1.PreTrainedModel` or :obj:`~transformers1.TFPreTrainedModel`):\n            The model that will be used by the pipeline to make predictions. This needs to be a model inheriting from\n            :class:`~transformers1.PreTrainedModel` for PyTorch and :class:`~transformers1.TFPreTrainedModel` for\n            TensorFlow.\n        tokenizer (:obj:`~transformers1.PreTrainedTokenizer`):\n            The tokenizer that will be used by the pipeline to encode data for the model. This object inherits from\n            :class:`~transformers1.PreTrainedTokenizer`.\n        modelcard (:obj:`str` or :class:`~transformers1.ModelCard`, `optional`, defaults to :obj:`None`):\n            Model card attributed to the model for this pipeline.\n        framework (:obj:`str`, `optional`, defaults to :obj:`None`):\n            The framework to use, either \"pt\" for PyTorch or \"tf\" for TensorFlow. The specified framework must be\n            installed.\n\n            If no framework is specified, will default to the one currently installed. If no framework is specified\n            and both frameworks are installed, will default to PyTorch.\n        args_parser (:class:`~transformers1.pipelines.ArgumentHandler`, `optional`, defaults to :obj:`None`):\n            Reference to the object in charge of parsing supplied pipeline parameters.\n        device (:obj:`int`, `optional`, defaults to :obj:`-1`):\n            Device ordinal for CPU/GPU supports. Setting this to -1 will leverage CPU, >=0 will run the model\n            on the associated CUDA device id.\n    \"\"\"\n\n    default_input_names = \"question,context\"\n\n    def __init__(\n        self,\n        model: Union[\"PreTrainedModel\", \"TFPreTrainedModel\"],\n        tokenizer: PreTrainedTokenizer,\n        modelcard: Optional[ModelCard] = None,\n        framework: Optional[str] = None,\n        device: int = -1,\n        task: str = \"\",\n        **kwargs\n    ):\n        super().__init__(\n            model=model,\n            tokenizer=tokenizer,\n            modelcard=modelcard,\n            framework=framework,\n            args_parser=QuestionAnsweringArgumentHandler(),\n            device=device,\n            task=task,\n            **kwargs,\n        )\n\n    @staticmethod\n    def create_sample(\n        question: Union[str, List[str]], context: Union[str, List[str]]\n    ) -> Union[SquadExample, List[SquadExample]]:\n        \"\"\"\n        QuestionAnsweringPipeline leverages the SquadExample/SquadFeatures internally.\n        This helper method encapsulate all the logic for converting question(s) and context(s) to SquadExample(s).\n        We currently support extractive question answering.\n        Arguments:\n             question: (str, List[str]) The question to be ask for the associated context\n             context: (str, List[str]) The context in which we will look for the answer.\n\n        Returns:\n            SquadExample initialized with the corresponding question and context.\n        \"\"\"\n        if isinstance(question, list):\n            return [SquadExample(None, q, c, None, None, None) for q, c in zip(question, context)]\n        else:\n            return SquadExample(None, question, context, None, None, None)\n\n    def __call__(self, *args, **kwargs):\n        \"\"\"\n        Args:\n            We support multiple use-cases, the following are exclusive:\n            X: sequence of SquadExample\n            data: sequence of SquadExample\n            question: (str, List[str]), batch of question(s) to map along with context\n            context: (str, List[str]), batch of context(s) associated with the provided question keyword argument\n        Returns:\n            dict: {'answer': str, 'score\": float, 'start\": int, \"end\": int}\n            answer: the textual answer in the intial context\n            score: the score the current answer scored for the model\n            start: the character index in the original string corresponding to the beginning of the answer' span\n            end: the character index in the original string corresponding to the ending of the answer' span\n        \"\"\"\n        # Set defaults values\n        kwargs.setdefault(\"topk\", 1)\n        kwargs.setdefault(\"doc_stride\", 128)\n        kwargs.setdefault(\"max_answer_len\", 15)\n        kwargs.setdefault(\"max_seq_len\", 384)\n        kwargs.setdefault(\"max_question_len\", 64)\n        kwargs.setdefault(\"handle_impossible_answer\", False)\n\n        if kwargs[\"topk\"] < 1:\n            raise ValueError(\"topk parameter should be >= 1 (got {})\".format(kwargs[\"topk\"]))\n\n        if kwargs[\"max_answer_len\"] < 1:\n            raise ValueError(\"max_answer_len parameter should be >= 1 (got {})\".format(kwargs[\"max_answer_len\"]))\n\n        # Convert inputs to features\n        examples = self._args_parser(*args, **kwargs)\n        features_list = [\n            squad_convert_examples_to_features(\n                [example],\n                self.tokenizer,\n                kwargs[\"max_seq_len\"],\n                kwargs[\"doc_stride\"],\n                kwargs[\"max_question_len\"],\n                False,\n                tqdm_enabled=False,\n            )\n            for example in examples\n        ]\n        all_answers = []\n        for features, example in zip(features_list, examples):\n            model_input_names = self.tokenizer.model_input_names + [\"input_ids\"]\n            fw_args = {k: [feature.__dict__[k] for feature in features] for k in model_input_names}\n\n            # Manage tensor allocation on correct device\n            with self.device_placement():\n                if self.framework == \"tf\":\n                    fw_args = {k: tf.constant(v) for (k, v) in fw_args.items()}\n                    start, end = self.model(fw_args)\n                    start, end = start.numpy(), end.numpy()\n                else:\n                    with torch.no_grad():\n                        # Retrieve the score for the context tokens only (removing question tokens)\n                        fw_args = {k: torch.tensor(v, device=self.device) for (k, v) in fw_args.items()}\n                        start, end = self.model(**fw_args)\n                        start, end = start.cpu().numpy(), end.cpu().numpy()\n\n            min_null_score = 1000000  # large and positive\n            answers = []\n            for (feature, start_, end_) in zip(features, start, end):\n                # Normalize logits and spans to retrieve the answer\n                start_ = np.exp(start_) / np.sum(np.exp(start_))\n                end_ = np.exp(end_) / np.sum(np.exp(end_))\n\n                # Mask padding and question\n                start_, end_ = (\n                    start_ * np.abs(np.array(feature.p_mask) - 1),\n                    end_ * np.abs(np.array(feature.p_mask) - 1),\n                )\n\n                if kwargs[\"handle_impossible_answer\"]:\n                    min_null_score = min(min_null_score, (start_[0] * end_[0]).item())\n\n                start_[0] = end_[0] = 0\n\n                starts, ends, scores = self.decode(start_, end_, kwargs[\"topk\"], kwargs[\"max_answer_len\"])\n                char_to_word = np.array(example.char_to_word_offset)\n\n                # Convert the answer (tokens) back to the original text\n                answers += [\n                    {\n                        \"score\": score.item(),\n                        \"start\": np.where(char_to_word == feature.token_to_orig_map[s])[0][0].item(),\n                        \"end\": np.where(char_to_word == feature.token_to_orig_map[e])[0][-1].item(),\n                        \"answer\": \" \".join(\n                            example.doc_tokens[feature.token_to_orig_map[s] : feature.token_to_orig_map[e] + 1]\n                        ),\n                    }\n                    for s, e, score in zip(starts, ends, scores)\n                ]\n\n            if kwargs[\"handle_impossible_answer\"]:\n                answers.append({\"score\": min_null_score, \"start\": 0, \"end\": 0, \"answer\": \"\"})\n\n            answers = sorted(answers, key=lambda x: x[\"score\"], reverse=True)[: kwargs[\"topk\"]]\n            all_answers += answers\n\n        if len(all_answers) == 1:\n            return all_answers[0]\n        return all_answers\n\n    def decode(self, start: np.ndarray, end: np.ndarray, topk: int, max_answer_len: int) -> Tuple:\n        \"\"\"\n        Take the output of any QuestionAnswering head and will generate probalities for each span to be\n        the actual answer.\n        In addition, it filters out some unwanted/impossible cases like answer len being greater than\n        max_answer_len or answer end position being before the starting position.\n        The method supports output the k-best answer through the topk argument.\n\n        Args:\n            start: numpy array, holding individual start probabilities for each token\n            end: numpy array, holding individual end probabilities for each token\n            topk: int, indicates how many possible answer span(s) to extract from the model's output\n            max_answer_len: int, maximum size of the answer to extract from the model's output\n        \"\"\"\n        # Ensure we have batch axis\n        if start.ndim == 1:\n            start = start[None]\n\n        if end.ndim == 1:\n            end = end[None]\n\n        # Compute the score of each tuple(start, end) to be the real answer\n        outer = np.matmul(np.expand_dims(start, -1), np.expand_dims(end, 1))\n\n        # Remove candidate with end < start and end - start > max_answer_len\n        candidates = np.tril(np.triu(outer), max_answer_len - 1)\n\n        #  Inspired by Chen & al. (https://github.com/facebookresearch/DrQA)\n        scores_flat = candidates.flatten()\n        if topk == 1:\n            idx_sort = [np.argmax(scores_flat)]\n        elif len(scores_flat) < topk:\n            idx_sort = np.argsort(-scores_flat)\n        else:\n            idx = np.argpartition(-scores_flat, topk)[0:topk]\n            idx_sort = idx[np.argsort(-scores_flat[idx])]\n\n        start, end = np.unravel_index(idx_sort, candidates.shape)[1:]\n        return start, end, candidates[0, start, end]\n\n    def span_to_answer(self, text: str, start: int, end: int):\n        \"\"\"\n        When decoding from token probalities, this method maps token indexes to actual word in\n        the initial context.\n\n        Args:\n            text: str, the actual context to extract the answer from\n            start: int, starting answer token index\n            end: int, ending answer token index\n\n        Returns:\n            dict: {'answer': str, 'start': int, 'end': int}\n        \"\"\"\n        words = []\n        token_idx = char_start_idx = char_end_idx = chars_idx = 0\n\n        for i, word in enumerate(text.split(\" \")):\n            token = self.tokenizer.tokenize(word)\n\n            # Append words if they are in the span\n            if start <= token_idx <= end:\n                if token_idx == start:\n                    char_start_idx = chars_idx\n\n                if token_idx == end:\n                    char_end_idx = chars_idx + len(word)\n\n                words += [word]\n\n            # Stop if we went over the end of the answer\n            if token_idx > end:\n                break\n\n            # Append the subtokenization length to the running index\n            token_idx += len(token)\n            chars_idx += len(word) + 1\n\n        # Join text with spaces\n        return {\n            \"answer\": \" \".join(words),\n            \"start\": max(0, char_start_idx),\n            \"end\": min(len(text), char_end_idx),\n        }\n\n\nclass SummarizationPipeline(Pipeline):\n    \"\"\"\n    Summarize news articles and other documents\n\n    Usage::\n\n        # use bart in pytorch\n        summarizer = pipeline(\"summarization\")\n        summarizer(\"Sam Shleifer writes the best docstring examples in the whole world.\", min_length=5, max_length=20)\n\n        # use t5 in tf\n        summarizer = pipeline(\"summarization\", model=\"t5-base\", tokenizer=\"t5-base\", framework=\"tf\")\n        summarizer(\"Sam Shleifer writes the best docstring examples in the whole world.\", min_length=5, max_length=20)\n\n    The models that this pipeline can use are models that have been fine-tuned on a summarization task,\n    which is currently, '`bart-large-cnn`', '`t5-small`', '`t5-base`', '`t5-large`', '`t5-3b`', '`t5-11b`'.\n    See the up-to-date list of available models on\n    `huggingface.co/models <https://huggingface.co/models?filter=summarization>`__.\n\n    Arguments:\n        model (:obj:`str` or :obj:`~transformers1.PreTrainedModel` or :obj:`~transformers1.TFPreTrainedModel`, `optional`, defaults to :obj:`None`):\n            The model that will be used by the pipeline to make predictions. This can be :obj:`None`, a string\n            checkpoint identifier or an actual pre-trained model inheriting from\n            :class:`~transformers1.PreTrainedModel` for PyTorch and :class:`~transformers1.TFPreTrainedModel` for\n            TensorFlow.\n\n            If :obj:`None`, the default of the pipeline will be loaded.\n        tokenizer (:obj:`str` or :obj:`~transformers1.PreTrainedTokenizer`, `optional`, defaults to :obj:`None`):\n            The tokenizer that will be used by the pipeline to encode data for the model. This can be :obj:`None`,\n            a string checkpoint identifier or an actual pre-trained tokenizer inheriting from\n            :class:`~transformers1.PreTrainedTokenizer`.\n\n            If :obj:`None`, the default of the pipeline will be loaded.\n        modelcard (:obj:`str` or :class:`~transformers1.ModelCard`, `optional`, defaults to :obj:`None`):\n            Model card attributed to the model for this pipeline.\n        framework (:obj:`str`, `optional`, defaults to :obj:`None`):\n            The framework to use, either \"pt\" for PyTorch or \"tf\" for TensorFlow. The specified framework must be\n            installed.\n\n            If no framework is specified, will default to the one currently installed. If no framework is specified\n            and both frameworks are installed, will default to PyTorch.\n        args_parser (:class:`~transformers1.pipelines.ArgumentHandler`, `optional`, defaults to :obj:`None`):\n            Reference to the object in charge of parsing supplied pipeline parameters.\n        device (:obj:`int`, `optional`, defaults to :obj:`-1`):\n            Device ordinal for CPU/GPU supports. Setting this to -1 will leverage CPU, >=0 will run the model\n            on the associated CUDA device id.\n    \"\"\"\n\n    def __call__(\n        self, *documents, return_tensors=False, return_text=True, clean_up_tokenization_spaces=False, **generate_kwargs\n    ):\n        r\"\"\"\n        Args:\n            *documents: (list of strings) articles to be summarized\n            return_text: (bool, default=True) whether to add a decoded \"summary_text\" to each result\n            return_tensors: (bool, default=False) whether to return the raw \"summary_token_ids\" to each result\n\n            clean_up_tokenization_spaces: (`optional`) bool whether to include extra spaces in the output\n            **generate_kwargs: extra kwargs passed to `self.model.generate`_\n\n        Returns:\n            list of dicts with 'summary_text' and/or 'summary_token_ids' for each document_to_summarize\n\n        .. _`self.model.generate`:\n            https://huggingface.co/transformers/model_doc/bart.html#transformers.BartForConditionalGeneration.generate\n\n        \"\"\"\n        assert return_tensors or return_text, \"You must specify return_tensors=True or return_text=True\"\n        assert len(documents) > 0, \"Please provide a document to summarize\"\n\n        if self.framework == \"tf\" and \"BartForConditionalGeneration\" in self.model.__class__.__name__:\n            raise NotImplementedError(\n                \"Tensorflow is not yet supported for Bart. Please consider using T5, e.g. `t5-base`\"\n            )\n\n        prefix = self.model.config.prefix if self.model.config.prefix is not None else \"\"\n\n        if isinstance(documents[0], list):\n            assert (\n                self.tokenizer.pad_token_id is not None\n            ), \"Please make sure that the tokenizer has a pad_token_id when using a batch input\"\n\n            documents = ([prefix + document for document in documents[0]],)\n            pad_to_max_length = True\n\n        elif isinstance(documents[0], str):\n            documents = (prefix + documents[0],)\n            pad_to_max_length = False\n        else:\n            raise ValueError(\n                \" `documents[0]`: {} have the wrong format. The should be either of type `str` or type `list`\".format(\n                    documents[0]\n                )\n            )\n\n        with self.device_placement():\n            inputs = self._parse_and_tokenize(*documents, pad_to_max_length=pad_to_max_length)\n\n            if self.framework == \"pt\":\n                inputs = self.ensure_tensor_on_device(**inputs)\n                input_length = inputs[\"input_ids\"].shape[-1]\n            elif self.framework == \"tf\":\n                input_length = tf.shape(inputs[\"input_ids\"])[-1].numpy()\n\n            min_length = generate_kwargs.get(\"min_length\", self.model.config.min_length)\n            if input_length < min_length // 2:\n                logger.warning(\n                    \"Your min_length is set to {}, but you input_length is only {}. You might consider decreasing min_length manually, e.g. summarizer('...', min_length=10)\".format(\n                        min_length, input_length\n                    )\n                )\n\n            max_length = generate_kwargs.get(\"max_length\", self.model.config.max_length)\n            if input_length < max_length:\n                logger.warning(\n                    \"Your max_length is set to {}, but you input_length is only {}. You might consider decreasing max_length manually, e.g. summarizer('...', max_length=50)\".format(\n                        max_length, input_length\n                    )\n                )\n\n            summaries = self.model.generate(\n                inputs[\"input_ids\"], attention_mask=inputs[\"attention_mask\"], **generate_kwargs,\n            )\n\n            results = []\n            for summary in summaries:\n                record = {}\n                if return_tensors:\n                    record[\"summary_token_ids\"] = summary\n                if return_text:\n                    record[\"summary_text\"] = self.tokenizer.decode(\n                        summary, skip_special_tokens=True, clean_up_tokenization_spaces=clean_up_tokenization_spaces,\n                    )\n                results.append(record)\n            return results\n\n\nclass TranslationPipeline(Pipeline):\n    \"\"\"\n    Translates from one language to another.\n\n    Usage::\n        en_fr_translator = pipeline(\"translation_en_to_fr\")\n        en_fr_translator(\"How old are you?\")\n\n    The models that this pipeline can use are models that have been fine-tuned on a translation task,\n    currently: \"t5-small\", \"t5-base\", \"t5-large\", \"t5-3b\", \"t5-11b\"\n    See the up-to-date list of available models on\n    `huggingface.co/models <https://huggingface.co/models?filter=translation>`__.\n\n    Arguments:\n        model (:obj:`str` or :obj:`~transformers1.PreTrainedModel` or :obj:`~transformers1.TFPreTrainedModel`, `optional`, defaults to :obj:`None`):\n            The model that will be used by the pipeline to make predictions. This can be :obj:`None`, a string\n            checkpoint identifier or an actual pre-trained model inheriting from\n            :class:`~transformers1.PreTrainedModel` for PyTorch and :class:`~transformers1.TFPreTrainedModel` for\n            TensorFlow.\n            If :obj:`None`, the default of the pipeline will be loaded.\n        tokenizer (:obj:`str` or :obj:`~transformers1.PreTrainedTokenizer`, `optional`, defaults to :obj:`None`):\n            The tokenizer that will be used by the pipeline to encode data for the model. This can be :obj:`None`,\n            a string checkpoint identifier or an actual pre-trained tokenizer inheriting from\n            :class:`~transformers1.PreTrainedTokenizer`.\n            If :obj:`None`, the default of the pipeline will be loaded.\n        modelcard (:obj:`str` or :class:`~transformers1.ModelCard`, `optional`, defaults to :obj:`None`):\n            Model card attributed to the model for this pipeline.\n        framework (:obj:`str`, `optional`, defaults to :obj:`None`):\n            The framework to use, either \"pt\" for PyTorch or \"tf\" for TensorFlow. The specified framework must be\n            installed.\n            If no framework is specified, will default to the one currently installed. If no framework is specified\n            and both frameworks are installed, will default to PyTorch.\n        args_parser (:class:`~transformers1.pipelines.ArgumentHandler`, `optional`, defaults to :obj:`None`):\n            Reference to the object in charge of parsing supplied pipeline parameters.\n        device (:obj:`int`, `optional`, defaults to :obj:`-1`):\n            Device ordinal for CPU/GPU supports. Setting this to -1 will leverage CPU, >=0 will run the model\n            on the associated CUDA device id.\n    \"\"\"\n\n    def __call__(\n        self, *args, return_tensors=False, return_text=True, clean_up_tokenization_spaces=False, **generate_kwargs\n    ):\n        r\"\"\"\n        Args:\n            *args: (list of strings) texts to be translated\n            return_text: (bool, default=True) whether to add a decoded \"translation_text\" to each result\n            return_tensors: (bool, default=False) whether to return the raw \"translation_token_ids\" to each result\n\n            **generate_kwargs: extra kwargs passed to `self.model.generate`_\n\n        Returns:\n            list of dicts with 'translation_text' and/or 'translation_token_ids' for each text_to_translate\n        .. _`self.model.generate`:\n            https://huggingface.co/transformers/model_doc/bart.html#transformers.BartForConditionalGeneration.generate\n        \"\"\"\n        assert return_tensors or return_text, \"You must specify return_tensors=True or return_text=True\"\n\n        prefix = self.model.config.prefix if self.model.config.prefix is not None else \"\"\n\n        if isinstance(args[0], list):\n            assert (\n                self.tokenizer.pad_token_id is not None\n            ), \"Please make sure that the tokenizer has a pad_token_id when using a batch input\"\n            args = ([prefix + text for text in args[0]],)\n            pad_to_max_length = True\n\n        elif isinstance(args[0], str):\n            args = (prefix + args[0],)\n            pad_to_max_length = False\n        else:\n            raise ValueError(\n                \" `documents[0]`: {} have the wrong format. The should be either of type `str` or type `list`\".format(\n                    args[0]\n                )\n            )\n\n        with self.device_placement():\n            inputs = self._parse_and_tokenize(*args, pad_to_max_length=pad_to_max_length)\n\n            if self.framework == \"pt\":\n                inputs = self.ensure_tensor_on_device(**inputs)\n                input_length = inputs[\"input_ids\"].shape[-1]\n\n            elif self.framework == \"tf\":\n                input_length = tf.shape(inputs[\"input_ids\"])[-1].numpy()\n\n            max_length = generate_kwargs.get(\"max_length\", self.model.config.max_length)\n            if input_length > 0.9 * max_length:\n                logger.warning(\n                    \"Your input_length: {} is bigger than 0.9 * max_length: {}. You might consider increasing your max_length manually, e.g. translator('...', max_length=400)\".format(\n                        input_length, max_length\n                    )\n                )\n\n            translations = self.model.generate(\n                inputs[\"input_ids\"], attention_mask=inputs[\"attention_mask\"], **generate_kwargs,\n            )\n            results = []\n            for translation in translations:\n                record = {}\n                if return_tensors:\n                    record[\"translation_token_ids\"] = translation\n                if return_text:\n                    record[\"translation_text\"] = self.tokenizer.decode(\n                        translation,\n                        skip_special_tokens=True,\n                        clean_up_tokenization_spaces=clean_up_tokenization_spaces,\n                    )\n                results.append(record)\n            return results\n\n\n# Register all the supported tasks here\nSUPPORTED_TASKS = {\n    \"feature-extraction\": {\n        \"impl\": FeatureExtractionPipeline,\n        \"tf\": TFAutoModel if is_tf_available() else None,\n        \"pt\": AutoModel if is_torch_available() else None,\n        \"default\": {\n            \"model\": {\"pt\": \"distilbert-base-cased\", \"tf\": \"distilbert-base-cased\"},\n            \"config\": None,\n            \"tokenizer\": \"distilbert-base-cased\",\n        },\n    },\n    \"sentiment-analysis\": {\n        \"impl\": TextClassificationPipeline,\n        \"tf\": TFAutoModelForSequenceClassification if is_tf_available() else None,\n        \"pt\": AutoModelForSequenceClassification if is_torch_available() else None,\n        \"default\": {\n            \"model\": {\n                \"pt\": \"distilbert-base-uncased-finetuned-sst-2-english\",\n                \"tf\": \"distilbert-base-uncased-finetuned-sst-2-english\",\n            },\n            \"config\": \"distilbert-base-uncased-finetuned-sst-2-english\",\n            \"tokenizer\": \"distilbert-base-uncased\",\n        },\n    },\n    \"ner\": {\n        \"impl\": NerPipeline,\n        \"tf\": TFAutoModelForTokenClassification if is_tf_available() else None,\n        \"pt\": AutoModelForTokenClassification if is_torch_available() else None,\n        \"default\": {\n            \"model\": {\n                \"pt\": \"dbmdz/bert-large-cased-finetuned-conll03-english\",\n                \"tf\": \"dbmdz/bert-large-cased-finetuned-conll03-english\",\n            },\n            \"config\": \"dbmdz/bert-large-cased-finetuned-conll03-english\",\n            \"tokenizer\": \"bert-large-cased\",\n        },\n    },\n    \"question-answering\": {\n        \"impl\": QuestionAnsweringPipeline,\n        \"tf\": TFAutoModelForQuestionAnswering if is_tf_available() else None,\n        \"pt\": AutoModelForQuestionAnswering if is_torch_available() else None,\n        \"default\": {\n            \"model\": {\"pt\": \"distilbert-base-cased-distilled-squad\", \"tf\": \"distilbert-base-cased-distilled-squad\"},\n            \"config\": None,\n            \"tokenizer\": (\"distilbert-base-cased\", {\"use_fast\": False}),\n        },\n    },\n    \"fill-mask\": {\n        \"impl\": FillMaskPipeline,\n        \"tf\": TFAutoModelWithLMHead if is_tf_available() else None,\n        \"pt\": AutoModelWithLMHead if is_torch_available() else None,\n        \"default\": {\n            \"model\": {\"pt\": \"distilroberta-base\", \"tf\": \"distilroberta-base\"},\n            \"config\": None,\n            \"tokenizer\": (\"distilroberta-base\", {\"use_fast\": False}),\n        },\n    },\n    \"summarization\": {\n        \"impl\": SummarizationPipeline,\n        \"tf\": TFAutoModelWithLMHead if is_tf_available() else None,\n        \"pt\": AutoModelWithLMHead if is_torch_available() else None,\n        \"default\": {\"model\": {\"pt\": \"facebook/bart-large-cnn\", \"tf\": \"t5-small\"}, \"config\": None, \"tokenizer\": None},\n    },\n    \"translation_en_to_fr\": {\n        \"impl\": TranslationPipeline,\n        \"tf\": TFAutoModelWithLMHead if is_tf_available() else None,\n        \"pt\": AutoModelWithLMHead if is_torch_available() else None,\n        \"default\": {\n            \"model\": {\"pt\": \"t5-base\", \"tf\": \"t5-base\"},\n            \"config\": None,\n            \"tokenizer\": (\"t5-base\", {\"use_fast\": False}),\n        },\n    },\n    \"translation_en_to_de\": {\n        \"impl\": TranslationPipeline,\n        \"tf\": TFAutoModelWithLMHead if is_tf_available() else None,\n        \"pt\": AutoModelWithLMHead if is_torch_available() else None,\n        \"default\": {\n            \"model\": {\"pt\": \"t5-base\", \"tf\": \"t5-base\"},\n            \"config\": None,\n            \"tokenizer\": (\"t5-base\", {\"use_fast\": False}),\n        },\n    },\n    \"translation_en_to_ro\": {\n        \"impl\": TranslationPipeline,\n        \"tf\": TFAutoModelWithLMHead if is_tf_available() else None,\n        \"pt\": AutoModelWithLMHead if is_torch_available() else None,\n        \"default\": {\n            \"model\": {\"pt\": \"t5-base\", \"tf\": \"t5-base\"},\n            \"config\": None,\n            \"tokenizer\": (\"t5-base\", {\"use_fast\": False}),\n        },\n    },\n    \"text-generation\": {\n        \"impl\": TextGenerationPipeline,\n        \"tf\": TFAutoModelWithLMHead if is_tf_available() else None,\n        \"pt\": AutoModelWithLMHead if is_torch_available() else None,\n        \"default\": {\"model\": {\"pt\": \"gpt2\", \"tf\": \"gpt2\"}, \"config\": None, \"tokenizer\": \"gpt2\"},\n    },\n}\n\n\ndef pipeline(\n    task: str,\n    model: Optional = None,\n    config: Optional[Union[str, PretrainedConfig]] = None,\n    tokenizer: Optional[Union[str, PreTrainedTokenizer]] = None,\n    framework: Optional[str] = None,\n    **kwargs\n) -> Pipeline:\n    \"\"\"\n    Utility factory method to build a pipeline.\n\n    Pipeline are made of:\n\n        - A Tokenizer instance in charge of mapping raw textual input to token\n        - A Model instance\n        - Some (optional) post processing for enhancing model's output\n\n\n    Args:\n        task (:obj:`str`):\n            The task defining which pipeline will be returned. Currently accepted tasks are:\n\n            - \"feature-extraction\": will return a :class:`~transformers1.FeatureExtractionPipeline`\n            - \"sentiment-analysis\": will return a :class:`~transformers1.TextClassificationPipeline`\n            - \"ner\": will return a :class:`~transformers1.NerPipeline`\n            - \"question-answering\": will return a :class:`~transformers1.QuestionAnsweringPipeline`\n            - \"fill-mask\": will return a :class:`~transformers1.FillMaskPipeline`\n            - \"summarization\": will return a :class:`~transformers1.SummarizationPipeline`\n            - \"translation_xx_to_yy\": will return a :class:`~transformers1.TranslationPipeline`\n        model (:obj:`str` or :obj:`~transformers1.PreTrainedModel` or :obj:`~transformers1.TFPreTrainedModel`, `optional`, defaults to :obj:`None`):\n            The model that will be used by the pipeline to make predictions. This can be :obj:`None`,\n            a model identifier or an actual pre-trained model inheriting from\n            :class:`~transformers1.PreTrainedModel` for PyTorch and :class:`~transformers1.TFPreTrainedModel` for\n            TensorFlow.\n\n            If :obj:`None`, the default for this pipeline will be loaded.\n        config (:obj:`str` or :obj:`~transformers1.PretrainedConfig`, `optional`, defaults to :obj:`None`):\n            The configuration that will be used by the pipeline to instantiate the model. This can be :obj:`None`,\n            a model identifier or an actual pre-trained model configuration inheriting from\n            :class:`~transformers1.PretrainedConfig`.\n\n            If :obj:`None`, the default for this pipeline will be loaded.\n        tokenizer (:obj:`str` or :obj:`~transformers1.PreTrainedTokenizer`, `optional`, defaults to :obj:`None`):\n            The tokenizer that will be used by the pipeline to encode data for the model. This can be :obj:`None`,\n            a model identifier or an actual pre-trained tokenizer inheriting from\n            :class:`~transformers1.PreTrainedTokenizer`.\n\n            If :obj:`None`, the default for this pipeline will be loaded.\n        framework (:obj:`str`, `optional`, defaults to :obj:`None`):\n            The framework to use, either \"pt\" for PyTorch or \"tf\" for TensorFlow. The specified framework must be\n            installed.\n\n            If no framework is specified, will default to the one currently installed. If no framework is specified\n            and both frameworks are installed, will default to PyTorch.\n\n    Returns:\n        :class:`~transformers.Pipeline`: Class inheriting from :class:`~transformers1.Pipeline`, according to\n        the task.\n\n    Examples::\n\n        from transformers1 import pipeline, AutoModelForTokenClassification, AutoTokenizer\n\n        # Sentiment analysis pipeline\n        pipeline('sentiment-analysis')\n\n        # Question answering pipeline, specifying the checkpoint identifier\n        pipeline('question-answering', model='distilbert-base-cased-distilled-squad', tokenizer='bert-base-cased')\n\n        # Named entity recognition pipeline, passing in a specific model and tokenizer\n        model = AutoModelForTokenClassification.from_pretrained(\"dbmdz/bert-large-cased-finetuned-conll03-english\")\n        tokenizer = AutoTokenizer.from_pretrained(\"bert-base-cased\")\n        pipeline('ner', model=model, tokenizer=tokenizer)\n    \"\"\"\n    # Retrieve the task\n    if task not in SUPPORTED_TASKS:\n        raise KeyError(\"Unknown task {}, available tasks are {}\".format(task, list(SUPPORTED_TASKS.keys())))\n\n    framework = framework or get_framework(model)\n\n    targeted_task = SUPPORTED_TASKS[task]\n    task_class, model_class = targeted_task[\"impl\"], targeted_task[framework]\n\n    # Use default model/config/tokenizer for the task if no model is provided\n    if model is None:\n        models, config, tokenizer = [targeted_task[\"default\"][k] for k in [\"model\", \"config\", \"tokenizer\"]]\n        model = models[framework]\n\n    # Try to infer tokenizer from model or config name (if provided as str)\n    if tokenizer is None:\n        if isinstance(model, str) and model in ALL_PRETRAINED_CONFIG_ARCHIVE_MAP:\n            tokenizer = model\n        elif isinstance(config, str) and config in ALL_PRETRAINED_CONFIG_ARCHIVE_MAP:\n            tokenizer = config\n        else:\n            # Impossible to guest what is the right tokenizer here\n            raise Exception(\n                \"Impossible to guess which tokenizer to use. \"\n                \"Please provided a PretrainedTokenizer class or a path/identifier to a pretrained tokenizer.\"\n            )\n\n    modelcard = None\n    # Try to infer modelcard from model or config name (if provided as str)\n    if isinstance(model, str):\n        modelcard = model\n    elif isinstance(config, str):\n        modelcard = config\n\n    # Instantiate tokenizer if needed\n    if isinstance(tokenizer, (str, tuple)):\n        if isinstance(tokenizer, tuple):\n            # For tuple we have (tokenizer name, {kwargs})\n            tokenizer = AutoTokenizer.from_pretrained(tokenizer[0], **tokenizer[1])\n        else:\n            tokenizer = AutoTokenizer.from_pretrained(tokenizer)\n\n    # Instantiate config if needed\n    if isinstance(config, str):\n        config = AutoConfig.from_pretrained(config)\n\n    # Instantiate modelcard if needed\n    if isinstance(modelcard, str):\n        modelcard = ModelCard.from_pretrained(modelcard)\n\n    # Instantiate model if needed\n    if isinstance(model, str):\n        # Handle transparent TF/PT model conversion\n        model_kwargs = {}\n        if framework == \"pt\" and model.endswith(\".h5\"):\n            model_kwargs[\"from_tf\"] = True\n            logger.warning(\n                \"Model might be a TensorFlow model (ending with `.h5`) but TensorFlow is not available. \"\n                \"Trying to load the model with PyTorch.\"\n            )\n        elif framework == \"tf\" and model.endswith(\".bin\"):\n            model_kwargs[\"from_pt\"] = True\n            logger.warning(\n                \"Model might be a PyTorch model (ending with `.bin`) but PyTorch is not available. \"\n                \"Trying to load the model with Tensorflow.\"\n            )\n        model = model_class.from_pretrained(model, config=config, **model_kwargs)\n\n    return task_class(model=model, tokenizer=tokenizer, modelcard=modelcard, framework=framework, task=task, **kwargs)\n"
  },
  {
    "path": "code/nezha-base-count3/pretrain/transformers1/tokenization_albert.py",
    "content": "# coding=utf-8\n# Copyright 2018 Google AI, Google Brain and the HuggingFace Inc. team.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n\"\"\" Tokenization classes for ALBERT model.\"\"\"\n\n\nimport logging\nimport os\nimport unicodedata\nfrom shutil import copyfile\nfrom typing import List, Optional\n\nfrom .tokenization_utils import PreTrainedTokenizer\n\n\nlogger = logging.getLogger(__name__)\nVOCAB_FILES_NAMES = {\"vocab_file\": \"spiece.model\"}\n\nPRETRAINED_VOCAB_FILES_MAP = {\n    \"vocab_file\": {\n        \"albert-base-v1\": \"https://s3.amazonaws.com/models.huggingface.co/bert/albert-base-v1-spiece.model\",\n        \"albert-large-v1\": \"https://s3.amazonaws.com/models.huggingface.co/bert/albert-large-v1-spiece.model\",\n        \"albert-xlarge-v1\": \"https://s3.amazonaws.com/models.huggingface.co/bert/albert-xlarge-v1-spiece.model\",\n        \"albert-xxlarge-v1\": \"https://s3.amazonaws.com/models.huggingface.co/bert/albert-xxlarge-v1-spiece.model\",\n        \"albert-base-v2\": \"https://s3.amazonaws.com/models.huggingface.co/bert/albert-base-v2-spiece.model\",\n        \"albert-large-v2\": \"https://s3.amazonaws.com/models.huggingface.co/bert/albert-large-v2-spiece.model\",\n        \"albert-xlarge-v2\": \"https://s3.amazonaws.com/models.huggingface.co/bert/albert-xlarge-v2-spiece.model\",\n        \"albert-xxlarge-v2\": \"https://s3.amazonaws.com/models.huggingface.co/bert/albert-xxlarge-v2-spiece.model\",\n    }\n}\n\nPRETRAINED_POSITIONAL_EMBEDDINGS_SIZES = {\n    \"albert-base-v1\": 512,\n    \"albert-large-v1\": 512,\n    \"albert-xlarge-v1\": 512,\n    \"albert-xxlarge-v1\": 512,\n    \"albert-base-v2\": 512,\n    \"albert-large-v2\": 512,\n    \"albert-xlarge-v2\": 512,\n    \"albert-xxlarge-v2\": 512,\n}\n\nSPIECE_UNDERLINE = \"▁\"\n\n\nclass AlbertTokenizer(PreTrainedTokenizer):\n    \"\"\"\n    Constructs an ALBERT tokenizer. Based on `SentencePiece <https://github.com/google/sentencepiece>`__\n\n    This tokenizer inherits from :class:`~transformers1.PreTrainedTokenizer` which contains most of the methods. Users\n    should refer to the superclass for more information regarding methods.\n\n    Args:\n        vocab_file (:obj:`string`):\n            `SentencePiece <https://github.com/google/sentencepiece>`__ file (generally has a .spm extension) that\n            contains the vocabulary necessary to instantiate a tokenizer.\n        do_lower_case (:obj:`bool`, `optional`, defaults to :obj:`True`):\n            Whether to lowercase the input when tokenizing.\n        remove_space (:obj:`bool`, `optional`, defaults to :obj:`True`):\n            Whether to strip the text when tokenizing (removing excess spaces before and after the string).\n        keep_accents (:obj:`bool`, `optional`, defaults to :obj:`False`):\n            Whether to keep accents when tokenizing.\n        bos_token (:obj:`string`, `optional`, defaults to \"[CLS]\"):\n            The beginning of sequence token that was used during pre-training. Can be used a sequence classifier token.\n\n            .. note::\n\n                When building a sequence using special tokens, this is not the token that is used for the beginning\n                of sequence. The token used is the :obj:`cls_token`.\n        eos_token (:obj:`string`, `optional`, defaults to \"[SEP]\"):\n            The end of sequence token.\n\n            .. note::\n\n                When building a sequence using special tokens, this is not the token that is used for the end\n                of sequence. The token used is the :obj:`sep_token`.\n        unk_token (:obj:`string`, `optional`, defaults to \"<unk>\"):\n            The unknown token. A token that is not in the vocabulary cannot be converted to an ID and is set to be this\n            token instead.\n        sep_token (:obj:`string`, `optional`, defaults to \"[SEP]\"):\n            The separator token, which is used when building a sequence from multiple sequences, e.g. two sequences\n            for sequence classification or for a text and a question for question answering.\n            It is also used as the last token of a sequence built with special tokens.\n        pad_token (:obj:`string`, `optional`, defaults to \"<pad>\"):\n            The token used for padding, for example when batching sequences of different lengths.\n        cls_token (:obj:`string`, `optional`, defaults to \"[CLS]\"):\n            The classifier token which is used when doing sequence classification (classification of the whole\n            sequence instead of per-token classification). It is the first token of the sequence when built with\n            special tokens.\n        mask_token (:obj:`string`, `optional`, defaults to \"[MASK]\"):\n            The token used for masking values. This is the token used when training this model with masked language\n            modeling. This is the token which the model will try to predict.\n\n    Attributes:\n        sp_model (:obj:`SentencePieceProcessor`):\n            The `SentencePiece` processor that is used for every conversion (string, tokens and IDs).\n    \"\"\"\n\n    vocab_files_names = VOCAB_FILES_NAMES\n    pretrained_vocab_files_map = PRETRAINED_VOCAB_FILES_MAP\n    max_model_input_sizes = PRETRAINED_POSITIONAL_EMBEDDINGS_SIZES\n\n    def __init__(\n        self,\n        vocab_file,\n        do_lower_case=True,\n        remove_space=True,\n        keep_accents=False,\n        bos_token=\"[CLS]\",\n        eos_token=\"[SEP]\",\n        unk_token=\"<unk>\",\n        sep_token=\"[SEP]\",\n        pad_token=\"<pad>\",\n        cls_token=\"[CLS]\",\n        mask_token=\"[MASK]\",\n        **kwargs\n    ):\n        super().__init__(\n            bos_token=bos_token,\n            eos_token=eos_token,\n            unk_token=unk_token,\n            sep_token=sep_token,\n            pad_token=pad_token,\n            cls_token=cls_token,\n            mask_token=mask_token,\n            **kwargs,\n        )\n\n        try:\n            import sentencepiece as spm\n        except ImportError:\n            logger.warning(\n                \"You need to install SentencePiece to use AlbertTokenizer: https://github.com/google/sentencepiece\"\n                \"pip install sentencepiece\"\n            )\n            raise\n\n        self.do_lower_case = do_lower_case\n        self.remove_space = remove_space\n        self.keep_accents = keep_accents\n        self.vocab_file = vocab_file\n\n        self.sp_model = spm.SentencePieceProcessor()\n        self.sp_model.Load(vocab_file)\n\n    @property\n    def vocab_size(self):\n        return len(self.sp_model)\n\n    def get_vocab(self):\n        vocab = {self.convert_ids_to_tokens(i): i for i in range(self.vocab_size)}\n        vocab.update(self.added_tokens_encoder)\n        return vocab\n\n    def __getstate__(self):\n        state = self.__dict__.copy()\n        state[\"sp_model\"] = None\n        return state\n\n    def __setstate__(self, d):\n        self.__dict__ = d\n        try:\n            import sentencepiece as spm\n        except ImportError:\n            logger.warning(\n                \"You need to install SentencePiece to use AlbertTokenizer: https://github.com/google/sentencepiece\"\n                \"pip install sentencepiece\"\n            )\n            raise\n        self.sp_model = spm.SentencePieceProcessor()\n        self.sp_model.Load(self.vocab_file)\n\n    def preprocess_text(self, inputs):\n        if self.remove_space:\n            outputs = \" \".join(inputs.strip().split())\n        else:\n            outputs = inputs\n        outputs = outputs.replace(\"``\", '\"').replace(\"''\", '\"')\n\n        if not self.keep_accents:\n            outputs = unicodedata.normalize(\"NFKD\", outputs)\n            outputs = \"\".join([c for c in outputs if not unicodedata.combining(c)])\n        if self.do_lower_case:\n            outputs = outputs.lower()\n\n        return outputs\n\n    def _tokenize(self, text, sample=False):\n        \"\"\" Tokenize a string. \"\"\"\n        text = self.preprocess_text(text)\n\n        if not sample:\n            pieces = self.sp_model.EncodeAsPieces(text)\n        else:\n            pieces = self.sp_model.SampleEncodeAsPieces(text, 64, 0.1)\n        new_pieces = []\n        for piece in pieces:\n            if len(piece) > 1 and piece[-1] == str(\",\") and piece[-2].isdigit():\n                cur_pieces = self.sp_model.EncodeAsPieces(piece[:-1].replace(SPIECE_UNDERLINE, \"\"))\n                if piece[0] != SPIECE_UNDERLINE and cur_pieces[0][0] == SPIECE_UNDERLINE:\n                    if len(cur_pieces[0]) == 1:\n                        cur_pieces = cur_pieces[1:]\n                    else:\n                        cur_pieces[0] = cur_pieces[0][1:]\n                cur_pieces.append(piece[-1])\n                new_pieces.extend(cur_pieces)\n            else:\n                new_pieces.append(piece)\n\n        return new_pieces\n\n    def _convert_token_to_id(self, token):\n        \"\"\" Converts a token (str) in an id using the vocab. \"\"\"\n        return self.sp_model.PieceToId(token)\n\n    def _convert_id_to_token(self, index):\n        \"\"\"Converts an index (integer) in a token (str) using the vocab.\"\"\"\n        return self.sp_model.IdToPiece(index)\n\n    def convert_tokens_to_string(self, tokens):\n        out_string = \"\".join(tokens).replace(SPIECE_UNDERLINE, \" \").strip()\n        return out_string\n\n    def build_inputs_with_special_tokens(\n        self, token_ids_0: List[int], token_ids_1: Optional[List[int]] = None\n    ) -> List[int]:\n        \"\"\"\n        Build model inputs from a sequence or a pair of sequence for sequence classification tasks\n        by concatenating and adding special tokens.\n        An ALBERT sequence has the following format:\n\n        - single sequence: ``[CLS] X [SEP]``\n        - pair of sequences: ``[CLS] A [SEP] B [SEP]``\n\n        Args:\n            token_ids_0 (:obj:`List[int]`):\n                List of IDs to which the special tokens will be added\n            token_ids_1 (:obj:`List[int]`, `optional`, defaults to :obj:`None`):\n                Optional second list of IDs for sequence pairs.\n\n        Returns:\n            :obj:`List[int]`: list of `input IDs <../glossary.html#input-ids>`__ with the appropriate special tokens.\n        \"\"\"\n        sep = [self.sep_token_id]\n        cls = [self.cls_token_id]\n        if token_ids_1 is None:\n            return cls + token_ids_0 + sep\n        return cls + token_ids_0 + sep + token_ids_1 + sep\n\n    def get_special_tokens_mask(\n        self, token_ids_0: List[int], token_ids_1: Optional[List[int]] = None, already_has_special_tokens: bool = False\n    ) -> List[int]:\n        \"\"\"\n        Retrieves sequence ids from a token list that has no special tokens added. This method is called when adding\n        special tokens using the tokenizer ``prepare_for_model`` or ``encode_plus`` methods.\n\n        Args:\n            token_ids_0 (:obj:`List[int]`):\n                List of ids.\n            token_ids_1 (:obj:`List[int]`, `optional`, defaults to :obj:`None`):\n                Optional second list of IDs for sequence pairs.\n            already_has_special_tokens (:obj:`bool`, `optional`, defaults to :obj:`False`):\n                Set to True if the token list is already formatted with special tokens for the model\n\n        Returns:\n            :obj:`List[int]`: A list of integers in the range [0, 1]: 1 for a special token, 0 for a sequence token.\n        \"\"\"\n\n        if already_has_special_tokens:\n            if token_ids_1 is not None:\n                raise ValueError(\n                    \"You should not supply a second sequence if the provided sequence of \"\n                    \"ids is already formatted with special tokens for the model.\"\n                )\n            return list(map(lambda x: 1 if x in [self.sep_token_id, self.cls_token_id] else 0, token_ids_0))\n\n        if token_ids_1 is not None:\n            return [1] + ([0] * len(token_ids_0)) + [1] + ([0] * len(token_ids_1)) + [1]\n        return [1] + ([0] * len(token_ids_0)) + [1]\n\n    def create_token_type_ids_from_sequences(\n        self, token_ids_0: List[int], token_ids_1: Optional[List[int]] = None\n    ) -> List[int]:\n        \"\"\"\n        Creates a mask from the two sequences passed to be used in a sequence-pair classification task.\n        An ALBERT sequence pair mask has the following format:\n\n        ::\n\n            0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1\n            | first sequence    | second sequence |\n\n        if token_ids_1 is None, only returns the first portion of the mask (0s).\n\n        Args:\n            token_ids_0 (:obj:`List[int]`):\n                List of ids.\n            token_ids_1 (:obj:`List[int]`, `optional`, defaults to :obj:`None`):\n                Optional second list of IDs for sequence pairs.\n\n        Returns:\n            :obj:`List[int]`: List of `token type IDs <../glossary.html#token-type-ids>`_ according to the given\n            sequence(s).\n        \"\"\"\n        sep = [self.sep_token_id]\n        cls = [self.cls_token_id]\n\n        if token_ids_1 is None:\n            return len(cls + token_ids_0 + sep) * [0]\n        return len(cls + token_ids_0 + sep) * [0] + len(token_ids_1 + sep) * [1]\n\n    def save_vocabulary(self, save_directory):\n        \"\"\"\n        Save the sentencepiece vocabulary (copy original file) and special tokens file to a directory.\n\n        Args:\n            save_directory (:obj:`str`):\n                The directory in which to save the vocabulary.\n\n        Returns:\n            :obj:`Tuple(str)`: Paths to the files saved.\n        \"\"\"\n        if not os.path.isdir(save_directory):\n            logger.error(\"Vocabulary path ({}) should be a directory\".format(save_directory))\n            return\n        out_vocab_file = os.path.join(save_directory, VOCAB_FILES_NAMES[\"vocab_file\"])\n\n        if os.path.abspath(self.vocab_file) != os.path.abspath(out_vocab_file):\n            copyfile(self.vocab_file, out_vocab_file)\n\n        return (out_vocab_file,)\n"
  },
  {
    "path": "code/nezha-base-count3/pretrain/transformers1/tokenization_auto.py",
    "content": "# coding=utf-8\n# Copyright 2018 The HuggingFace Inc. team.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n\"\"\" Auto Tokenizer class. \"\"\"\n\n\nimport logging\nfrom collections import OrderedDict\n\nfrom .configuration_auto import (\n    AlbertConfig,\n    AutoConfig,\n    BartConfig,\n    BertConfig,\n    CamembertConfig,\n    CTRLConfig,\n    DistilBertConfig,\n    ElectraConfig,\n    FlaubertConfig,\n    GPT2Config,\n    LongformerConfig,\n    OpenAIGPTConfig,\n    ReformerConfig,\n    RobertaConfig,\n    T5Config,\n    TransfoXLConfig,\n    XLMConfig,\n    XLMRobertaConfig,\n    XLNetConfig,\n)\nfrom .configuration_marian import MarianConfig\nfrom .configuration_utils import PretrainedConfig\nfrom .tokenization_albert import AlbertTokenizer\nfrom .tokenization_bart import BartTokenizer\nfrom .tokenization_bert import BertTokenizer, BertTokenizerFast\nfrom .tokenization_bert_japanese import BertJapaneseTokenizer\nfrom .tokenization_camembert import CamembertTokenizer\nfrom .tokenization_ctrl import CTRLTokenizer\nfrom .tokenization_distilbert import DistilBertTokenizer, DistilBertTokenizerFast\nfrom .tokenization_electra import ElectraTokenizer, ElectraTokenizerFast\nfrom .tokenization_flaubert import FlaubertTokenizer\nfrom .tokenization_gpt2 import GPT2Tokenizer, GPT2TokenizerFast\nfrom .tokenization_longformer import LongformerTokenizer\nfrom .tokenization_marian import MarianTokenizer\nfrom .tokenization_openai import OpenAIGPTTokenizer, OpenAIGPTTokenizerFast\nfrom .tokenization_reformer import ReformerTokenizer\nfrom .tokenization_roberta import RobertaTokenizer, RobertaTokenizerFast\nfrom .tokenization_t5 import T5Tokenizer\nfrom .tokenization_transfo_xl import TransfoXLTokenizer, TransfoXLTokenizerFast\nfrom .tokenization_xlm import XLMTokenizer\nfrom .tokenization_xlm_roberta import XLMRobertaTokenizer\nfrom .tokenization_xlnet import XLNetTokenizer\n\n\nlogger = logging.getLogger(__name__)\n\n\nTOKENIZER_MAPPING = OrderedDict(\n    [\n        (T5Config, (T5Tokenizer, None)),\n        (DistilBertConfig, (DistilBertTokenizer, DistilBertTokenizerFast)),\n        (AlbertConfig, (AlbertTokenizer, None)),\n        (CamembertConfig, (CamembertTokenizer, None)),\n        (XLMRobertaConfig, (XLMRobertaTokenizer, None)),\n        (MarianConfig, (MarianTokenizer, None)),\n        (BartConfig, (BartTokenizer, None)),\n        (LongformerConfig, (LongformerTokenizer, None)),\n        (RobertaConfig, (RobertaTokenizer, RobertaTokenizerFast)),\n        (ReformerConfig, (ReformerTokenizer, None)),\n        (ElectraConfig, (ElectraTokenizer, ElectraTokenizerFast)),\n        (BertConfig, (BertTokenizer, BertTokenizerFast)),\n        (OpenAIGPTConfig, (OpenAIGPTTokenizer, OpenAIGPTTokenizerFast)),\n        (GPT2Config, (GPT2Tokenizer, GPT2TokenizerFast)),\n        (TransfoXLConfig, (TransfoXLTokenizer, TransfoXLTokenizerFast)),\n        (XLNetConfig, (XLNetTokenizer, None)),\n        (FlaubertConfig, (FlaubertTokenizer, None)),\n        (XLMConfig, (XLMTokenizer, None)),\n        (CTRLConfig, (CTRLTokenizer, None)),\n    ]\n)\n\n\nclass AutoTokenizer:\n    r\"\"\":class:`~transformers1.AutoTokenizer` is a generic tokenizer class\n        that will be instantiated as one of the tokenizer classes of the library\n        when created with the `AutoTokenizer.from_pretrained(pretrained_model_name_or_path)`\n        class method.\n\n        The `from_pretrained()` method takes care of returning the correct tokenizer class instance\n        based on the `model_type` property of the config object, or when it's missing,\n        falling back to using pattern matching on the `pretrained_model_name_or_path` string:\n            - `t5`: T5Tokenizer (T5 model)\n            - `distilbert`: DistilBertTokenizer (DistilBert model)\n            - `albert`: AlbertTokenizer (ALBERT model)\n            - `camembert`: CamembertTokenizer (CamemBERT model)\n            - `xlm-roberta`: XLMRobertaTokenizer (XLM-RoBERTa model)\n            - `longformer`: LongformerTokenizer (AllenAI Longformer model)\n            - `roberta`: RobertaTokenizer (RoBERTa model)\n            - `bert`: BertTokenizer (Bert model)\n            - `openai-gpt`: OpenAIGPTTokenizer (OpenAI GPT model)\n            - `gpt2`: GPT2Tokenizer (OpenAI GPT-2 model)\n            - `transfo-xl`: TransfoXLTokenizer (Transformer-XL model)\n            - `xlnet`: XLNetTokenizer (XLNet model)\n            - `xlm`: XLMTokenizer (XLM model)\n            - `ctrl`: CTRLTokenizer (Salesforce CTRL model)\n            - `electra`: ElectraTokenizer (Google ELECTRA model)\n\n        This class cannot be instantiated using `__init__()` (throw an error).\n    \"\"\"\n\n    def __init__(self):\n        raise EnvironmentError(\n            \"AutoTokenizer is designed to be instantiated \"\n            \"using the `AutoTokenizer.from_pretrained(pretrained_model_name_or_path)` method.\"\n        )\n\n    @classmethod\n    def from_pretrained(cls, pretrained_model_name_or_path, *inputs, **kwargs):\n        r\"\"\" Instantiate one of the tokenizer classes of the library\n        from a pre-trained model vocabulary.\n\n        The tokenizer class to instantiate is selected\n        based on the `model_type` property of the config object, or when it's missing,\n        falling back to using pattern matching on the `pretrained_model_name_or_path` string:\n            - `t5`: T5Tokenizer (T5 model)\n            - `distilbert`: DistilBertTokenizer (DistilBert model)\n            - `albert`: AlbertTokenizer (ALBERT model)\n            - `camembert`: CamembertTokenizer (CamemBERT model)\n            - `xlm-roberta`: XLMRobertaTokenizer (XLM-RoBERTa model)\n            - `longformer`: LongformerTokenizer (AllenAI Longformer model)\n            - `roberta`: RobertaTokenizer (RoBERTa model)\n            - `bert-base-japanese`: BertJapaneseTokenizer (Bert model)\n            - `bert`: BertTokenizer (Bert model)\n            - `openai-gpt`: OpenAIGPTTokenizer (OpenAI GPT model)\n            - `gpt2`: GPT2Tokenizer (OpenAI GPT-2 model)\n            - `transfo-xl`: TransfoXLTokenizer (Transformer-XL model)\n            - `xlnet`: XLNetTokenizer (XLNet model)\n            - `xlm`: XLMTokenizer (XLM model)\n            - `ctrl`: CTRLTokenizer (Salesforce CTRL model)\n            - `electra`: ElectraTokenizer (Google ELECTRA model)\n\n        Params:\n            pretrained_model_name_or_path: either:\n\n                - a string with the `shortcut name` of a predefined tokenizer to load from cache or download, e.g.: ``bert-base-uncased``.\n                - a string with the `identifier name` of a predefined tokenizer that was user-uploaded to our S3, e.g.: ``dbmdz/bert-base-german-cased``.\n                - a path to a `directory` containing vocabulary files required by the tokenizer, for instance saved using the :func:`~transformers1.PreTrainedTokenizer.save_pretrained` method, e.g.: ``./my_model_directory/``.\n                - (not applicable to all derived classes) a path or url to a single saved vocabulary file if and only if the tokenizer only requires a single vocabulary file (e.g. Bert, XLNet), e.g.: ``./my_model_directory/vocab.txt``.\n\n            cache_dir: (`optional`) string:\n                Path to a directory in which a downloaded predefined tokenizer vocabulary files should be cached if the standard cache should not be used.\n\n            force_download: (`optional`) boolean, default False:\n                Force to (re-)download the vocabulary files and override the cached versions if they exists.\n\n            resume_download: (`optional`) boolean, default False:\n                Do not delete incompletely recieved file. Attempt to resume the download if such a file exists.\n\n            proxies: (`optional`) dict, default None:\n                A dictionary of proxy servers to use by protocol or endpoint, e.g.: {'http': 'foo.bar:3128', 'http://hostname': 'foo.bar:4012'}.\n                The proxies are used on each request.\n\n            use_fast: (`optional`) boolean, default False:\n                Indicate if transformers1 should try to load the fast version of the tokenizer (True) or use the Python one (False).\n\n            inputs: (`optional`) positional arguments: will be passed to the Tokenizer ``__init__`` method.\n\n            kwargs: (`optional`) keyword arguments: will be passed to the Tokenizer ``__init__`` method. Can be used to set special tokens like ``bos_token``, ``eos_token``, ``unk_token``, ``sep_token``, ``pad_token``, ``cls_token``, ``mask_token``, ``additional_special_tokens``. See parameters in the doc string of :class:`~transformers1.PreTrainedTokenizer` for details.\n\n        Examples::\n\n            # Download vocabulary from S3 and cache.\n            tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')\n\n            # Download vocabulary from S3 (user-uploaded) and cache.\n            tokenizer = AutoTokenizer.from_pretrained('dbmdz/bert-base-german-cased')\n\n            # If vocabulary files are in a directory (e.g. tokenizer was saved using `save_pretrained('./test/saved_model/')`)\n            tokenizer = AutoTokenizer.from_pretrained('./test/bert_saved_model/')\n\n        \"\"\"\n        config = kwargs.pop(\"config\", None)\n        if not isinstance(config, PretrainedConfig):\n            config = AutoConfig.from_pretrained(pretrained_model_name_or_path, **kwargs)\n\n        if \"bert-base-japanese\" in pretrained_model_name_or_path:\n            return BertJapaneseTokenizer.from_pretrained(pretrained_model_name_or_path, *inputs, **kwargs)\n\n        use_fast = kwargs.pop(\"use_fast\", False)\n        for config_class, (tokenizer_class_py, tokenizer_class_fast) in TOKENIZER_MAPPING.items():\n            if isinstance(config, config_class):\n                if tokenizer_class_fast and use_fast:\n                    return tokenizer_class_fast.from_pretrained(pretrained_model_name_or_path, *inputs, **kwargs)\n                else:\n                    return tokenizer_class_py.from_pretrained(pretrained_model_name_or_path, *inputs, **kwargs)\n\n        raise ValueError(\n            \"Unrecognized configuration class {} to build an AutoTokenizer.\\n\"\n            \"Model type should be one of {}.\".format(\n                config.__class__, \", \".join(c.__name__ for c in TOKENIZER_MAPPING.keys())\n            )\n        )\n"
  },
  {
    "path": "code/nezha-base-count3/pretrain/transformers1/tokenization_bart.py",
    "content": "# coding=utf-8\n# Copyright 2020 The Facebook AI Research Team Authors and The HuggingFace Inc. team.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n\nimport logging\n\nfrom .tokenization_roberta import RobertaTokenizer\nfrom .tokenization_xlm_roberta import XLMRobertaTokenizer\n\n\nlogger = logging.getLogger(__name__)\n\n\n# vocab and merges same as roberta\nvocab_url = \"https://s3.amazonaws.com/models.huggingface.co/bert/roberta-large-vocab.json\"\nmerges_url = \"https://s3.amazonaws.com/models.huggingface.co/bert/roberta-large-merges.txt\"\n_all_bart_models = [\n    \"facebook/bart-large\",\n    \"facebook/bart-large-mnli\",\n    \"facebook/bart-large-cnn\",\n    \"facebook/bart-large-xsum\",\n]\n\n\nclass BartTokenizer(RobertaTokenizer):\n    # merges and vocab same as Roberta\n    max_model_input_sizes = {m: 1024 for m in _all_bart_models}\n    pretrained_vocab_files_map = {\n        \"vocab_file\": {m: vocab_url for m in _all_bart_models},\n        \"merges_file\": {m: merges_url for m in _all_bart_models},\n    }\n\n\n_all_mbart_models = [\"facebook/mbart-large-en-ro\"]\nSPM_URL = \"https://s3.amazonaws.com/models.huggingface.co/bert/facebook/mbart-large-en-ro/sentence.bpe.model\"\n\n\nclass MBartTokenizer(XLMRobertaTokenizer):\n    vocab_files_names = {\"vocab_file\": \"sentencepiece.bpe.model\"}\n    max_model_input_sizes = {m: 1024 for m in _all_mbart_models}\n    pretrained_vocab_files_map = {\"vocab_file\": {m: SPM_URL for m in _all_mbart_models}}\n"
  },
  {
    "path": "code/nezha-base-count3/pretrain/transformers1/tokenization_bert.py",
    "content": "# coding=utf-8\n# Copyright 2018 The Google AI Language Team Authors and The HuggingFace Inc. team.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n\"\"\"Tokenization classes.\"\"\"\n\n\nimport collections\nimport logging\nimport os\nimport unicodedata\nfrom typing import List, Optional\n\nfrom tokenizers import BertWordPieceTokenizer\n\nfrom .tokenization_utils import PreTrainedTokenizer, PreTrainedTokenizerFast\n\n\nlogger = logging.getLogger(__name__)\n\nVOCAB_FILES_NAMES = {\"vocab_file\": \"vocab.txt\"}\n\nPRETRAINED_VOCAB_FILES_MAP = {\n    \"vocab_file\": {\n        \"bert-base-uncased\": \"https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-uncased-vocab.txt\",\n        \"bert-large-uncased\": \"https://s3.amazonaws.com/models.huggingface.co/bert/bert-large-uncased-vocab.txt\",\n        \"bert-base-cased\": \"https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-cased-vocab.txt\",\n        \"bert-large-cased\": \"https://s3.amazonaws.com/models.huggingface.co/bert/bert-large-cased-vocab.txt\",\n        \"bert-base-multilingual-uncased\": \"https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-multilingual-uncased-vocab.txt\",\n        \"bert-base-multilingual-cased\": \"https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-multilingual-cased-vocab.txt\",\n        \"bert-base-chinese\": \"https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-chinese-vocab.txt\",\n        \"bert-base-german-cased\": \"https://int-deepset-models-bert.s3.eu-central-1.amazonaws.com/pytorch/bert-base-german-cased-vocab.txt\",\n        \"bert-large-uncased-whole-word-masking\": \"https://s3.amazonaws.com/models.huggingface.co/bert/bert-large-uncased-whole-word-masking-vocab.txt\",\n        \"bert-large-cased-whole-word-masking\": \"https://s3.amazonaws.com/models.huggingface.co/bert/bert-large-cased-whole-word-masking-vocab.txt\",\n        \"bert-large-uncased-whole-word-masking-finetuned-squad\": \"https://s3.amazonaws.com/models.huggingface.co/bert/bert-large-uncased-whole-word-masking-finetuned-squad-vocab.txt\",\n        \"bert-large-cased-whole-word-masking-finetuned-squad\": \"https://s3.amazonaws.com/models.huggingface.co/bert/bert-large-cased-whole-word-masking-finetuned-squad-vocab.txt\",\n        \"bert-base-cased-finetuned-mrpc\": \"https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-cased-finetuned-mrpc-vocab.txt\",\n        \"bert-base-german-dbmdz-cased\": \"https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-german-dbmdz-cased-vocab.txt\",\n        \"bert-base-german-dbmdz-uncased\": \"https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-german-dbmdz-uncased-vocab.txt\",\n        \"TurkuNLP/bert-base-finnish-cased-v1\": \"https://s3.amazonaws.com/models.huggingface.co/bert/TurkuNLP/bert-base-finnish-cased-v1/vocab.txt\",\n        \"TurkuNLP/bert-base-finnish-uncased-v1\": \"https://s3.amazonaws.com/models.huggingface.co/bert/TurkuNLP/bert-base-finnish-uncased-v1/vocab.txt\",\n        \"wietsedv/bert-base-dutch-cased\": \"https://s3.amazonaws.com/models.huggingface.co/bert/wietsedv/bert-base-dutch-cased/vocab.txt\",\n    }\n}\n\nPRETRAINED_POSITIONAL_EMBEDDINGS_SIZES = {\n    \"bert-base-uncased\": 512,\n    \"bert-large-uncased\": 512,\n    \"bert-base-cased\": 512,\n    \"bert-large-cased\": 512,\n    \"bert-base-multilingual-uncased\": 512,\n    \"bert-base-multilingual-cased\": 512,\n    \"bert-base-chinese\": 512,\n    \"bert-base-german-cased\": 512,\n    \"bert-large-uncased-whole-word-masking\": 512,\n    \"bert-large-cased-whole-word-masking\": 512,\n    \"bert-large-uncased-whole-word-masking-finetuned-squad\": 512,\n    \"bert-large-cased-whole-word-masking-finetuned-squad\": 512,\n    \"bert-base-cased-finetuned-mrpc\": 512,\n    \"bert-base-german-dbmdz-cased\": 512,\n    \"bert-base-german-dbmdz-uncased\": 512,\n    \"TurkuNLP/bert-base-finnish-cased-v1\": 512,\n    \"TurkuNLP/bert-base-finnish-uncased-v1\": 512,\n    \"wietsedv/bert-base-dutch-cased\": 512,\n}\n\nPRETRAINED_INIT_CONFIGURATION = {\n    \"bert-base-uncased\": {\"do_lower_case\": True},\n    \"bert-large-uncased\": {\"do_lower_case\": True},\n    \"bert-base-cased\": {\"do_lower_case\": False},\n    \"bert-large-cased\": {\"do_lower_case\": False},\n    \"bert-base-multilingual-uncased\": {\"do_lower_case\": True},\n    \"bert-base-multilingual-cased\": {\"do_lower_case\": False},\n    \"bert-base-chinese\": {\"do_lower_case\": False},\n    \"bert-base-german-cased\": {\"do_lower_case\": False},\n    \"bert-large-uncased-whole-word-masking\": {\"do_lower_case\": True},\n    \"bert-large-cased-whole-word-masking\": {\"do_lower_case\": False},\n    \"bert-large-uncased-whole-word-masking-finetuned-squad\": {\"do_lower_case\": True},\n    \"bert-large-cased-whole-word-masking-finetuned-squad\": {\"do_lower_case\": False},\n    \"bert-base-cased-finetuned-mrpc\": {\"do_lower_case\": False},\n    \"bert-base-german-dbmdz-cased\": {\"do_lower_case\": False},\n    \"bert-base-german-dbmdz-uncased\": {\"do_lower_case\": True},\n    \"TurkuNLP/bert-base-finnish-cased-v1\": {\"do_lower_case\": False},\n    \"TurkuNLP/bert-base-finnish-uncased-v1\": {\"do_lower_case\": True},\n    \"wietsedv/bert-base-dutch-cased\": {\"do_lower_case\": False},\n}\n\n\ndef load_vocab(vocab_file):\n    \"\"\"Loads a vocabulary file into a dictionary.\"\"\"\n    vocab = collections.OrderedDict()\n    with open(vocab_file, \"r\", encoding=\"utf-8\") as reader:\n        tokens = reader.readlines()\n    for index, token in enumerate(tokens):\n        token = token.rstrip(\"\\n\")\n        vocab[token] = index\n    return vocab\n\n\ndef whitespace_tokenize(text):\n    \"\"\"Runs basic whitespace cleaning and splitting on a piece of text.\"\"\"\n    text = text.strip()\n    if not text:\n        return []\n    tokens = text.split()\n    return tokens\n\n\nclass BertTokenizer(PreTrainedTokenizer):\n    r\"\"\"\n    Constructs a BERT tokenizer. Based on WordPiece.\n\n    This tokenizer inherits from :class:`~transformers1.PreTrainedTokenizer` which contains most of the methods. Users\n    should refer to the superclass for more information regarding methods.\n\n    Args:\n        vocab_file (:obj:`string`):\n            File containing the vocabulary.\n        do_lower_case (:obj:`bool`, `optional`, defaults to :obj:`True`):\n            Whether to lowercase the input when tokenizing.\n        do_basic_tokenize (:obj:`bool`, `optional`, defaults to :obj:`True`):\n            Whether to do basic tokenization before WordPiece.\n        never_split (:obj:`bool`, `optional`, defaults to :obj:`True`):\n            List of tokens which will never be split during tokenization. Only has an effect when\n            :obj:`do_basic_tokenize=True`\n        unk_token (:obj:`string`, `optional`, defaults to \"[UNK]\"):\n            The unknown token. A token that is not in the vocabulary cannot be converted to an ID and is set to be this\n            token instead.\n        sep_token (:obj:`string`, `optional`, defaults to \"[SEP]\"):\n            The separator token, which is used when building a sequence from multiple sequences, e.g. two sequences\n            for sequence classification or for a text and a question for question answering.\n            It is also used as the last token of a sequence built with special tokens.\n        pad_token (:obj:`string`, `optional`, defaults to \"[PAD]\"):\n            The token used for padding, for example when batching sequences of different lengths.\n        cls_token (:obj:`string`, `optional`, defaults to \"[CLS]\"):\n            The classifier token which is used when doing sequence classification (classification of the whole\n            sequence instead of per-token classification). It is the first token of the sequence when built with\n            special tokens.\n        mask_token (:obj:`string`, `optional`, defaults to \"[MASK]\"):\n            The token used for masking values. This is the token used when training this model with masked language\n            modeling. This is the token which the model will try to predict.\n        tokenize_chinese_chars (:obj:`bool`, `optional`, defaults to :obj:`True`):\n            Whether to tokenize Chinese characters.\n            This should likely be deactivated for Japanese:\n            see: https://github.com/huggingface/transformers/issues/328\n    \"\"\"\n\n    vocab_files_names = VOCAB_FILES_NAMES\n    pretrained_vocab_files_map = PRETRAINED_VOCAB_FILES_MAP\n    pretrained_init_configuration = PRETRAINED_INIT_CONFIGURATION\n    max_model_input_sizes = PRETRAINED_POSITIONAL_EMBEDDINGS_SIZES\n\n    def __init__(\n        self,\n        vocab_file,\n        do_lower_case=True,\n        do_basic_tokenize=True,\n        never_split=None,\n        unk_token=\"[UNK]\",\n        sep_token=\"[SEP]\",\n        pad_token=\"[PAD]\",\n        cls_token=\"[CLS]\",\n        mask_token=\"[MASK]\",\n        tokenize_chinese_chars=True,\n        **kwargs\n    ):\n        super().__init__(\n            unk_token=unk_token,\n            sep_token=sep_token,\n            pad_token=pad_token,\n            cls_token=cls_token,\n            mask_token=mask_token,\n            **kwargs,\n        )\n\n        if not os.path.isfile(vocab_file):\n            raise ValueError(\n                \"Can't find a vocabulary file at path '{}'. To load the vocabulary from a Google pretrained \"\n                \"model use `tokenizer = BertTokenizer.from_pretrained(PRETRAINED_MODEL_NAME)`\".format(vocab_file)\n            )\n        self.vocab = load_vocab(vocab_file)\n        self.ids_to_tokens = collections.OrderedDict([(ids, tok) for tok, ids in self.vocab.items()])\n        self.do_basic_tokenize = do_basic_tokenize\n        if do_basic_tokenize:\n            self.basic_tokenizer = BasicTokenizer(\n                do_lower_case=do_lower_case, never_split=never_split, tokenize_chinese_chars=tokenize_chinese_chars\n            )\n        self.wordpiece_tokenizer = WordpieceTokenizer(vocab=self.vocab, unk_token=self.unk_token)\n\n    @property\n    def vocab_size(self):\n        return len(self.vocab)\n\n    def get_vocab(self):\n        return dict(self.vocab, **self.added_tokens_encoder)\n\n    def _tokenize(self, text):\n        split_tokens = []\n        if self.do_basic_tokenize:\n            for token in self.basic_tokenizer.tokenize(text, never_split=self.all_special_tokens):\n                for sub_token in self.wordpiece_tokenizer.tokenize(token):\n                    split_tokens.append(sub_token)\n        else:\n            split_tokens = self.wordpiece_tokenizer.tokenize(text)\n        return split_tokens\n\n    def _convert_token_to_id(self, token):\n        \"\"\" Converts a token (str) in an id using the vocab. \"\"\"\n        return self.vocab.get(token, self.vocab.get(self.unk_token))\n\n    def _convert_id_to_token(self, index):\n        \"\"\"Converts an index (integer) in a token (str) using the vocab.\"\"\"\n        return self.ids_to_tokens.get(index, self.unk_token)\n\n    def convert_tokens_to_string(self, tokens):\n        \"\"\" Converts a sequence of tokens (string) in a single string. \"\"\"\n        out_string = \" \".join(tokens).replace(\" ##\", \"\").strip()\n        return out_string\n\n    def build_inputs_with_special_tokens(\n        self, token_ids_0: List[int], token_ids_1: Optional[List[int]] = None\n    ) -> List[int]:\n        \"\"\"\n        Build model inputs from a sequence or a pair of sequence for sequence classification tasks\n        by concatenating and adding special tokens.\n        A BERT sequence has the following format:\n\n        - single sequence: ``[CLS] X [SEP]``\n        - pair of sequences: ``[CLS] A [SEP] B [SEP]``\n\n        Args:\n            token_ids_0 (:obj:`List[int]`):\n                List of IDs to which the special tokens will be added\n            token_ids_1 (:obj:`List[int]`, `optional`, defaults to :obj:`None`):\n                Optional second list of IDs for sequence pairs.\n\n        Returns:\n            :obj:`List[int]`: list of `input IDs <../glossary.html#input-ids>`__ with the appropriate special tokens.\n        \"\"\"\n        if token_ids_1 is None:\n            return [self.cls_token_id] + token_ids_0 + [self.sep_token_id]\n        cls = [self.cls_token_id]\n        sep = [self.sep_token_id]\n        return cls + token_ids_0 + sep + token_ids_1 + sep\n\n    def get_special_tokens_mask(\n        self, token_ids_0: List[int], token_ids_1: Optional[List[int]] = None, already_has_special_tokens: bool = False\n    ) -> List[int]:\n        \"\"\"\n        Retrieves sequence ids from a token list that has no special tokens added. This method is called when adding\n        special tokens using the tokenizer ``prepare_for_model`` or ``encode_plus`` methods.\n\n        Args:\n            token_ids_0 (:obj:`List[int]`):\n                List of ids.\n            token_ids_1 (:obj:`List[int]`, `optional`, defaults to :obj:`None`):\n                Optional second list of IDs for sequence pairs.\n            already_has_special_tokens (:obj:`bool`, `optional`, defaults to :obj:`False`):\n                Set to True if the token list is already formatted with special tokens for the model\n\n        Returns:\n            :obj:`List[int]`: A list of integers in the range [0, 1]: 1 for a special token, 0 for a sequence token.\n        \"\"\"\n\n        if already_has_special_tokens:\n            if token_ids_1 is not None:\n                raise ValueError(\n                    \"You should not supply a second sequence if the provided sequence of \"\n                    \"ids is already formated with special tokens for the model.\"\n                )\n            return list(map(lambda x: 1 if x in [self.sep_token_id, self.cls_token_id] else 0, token_ids_0))\n\n        if token_ids_1 is not None:\n            return [1] + ([0] * len(token_ids_0)) + [1] + ([0] * len(token_ids_1)) + [1]\n        return [1] + ([0] * len(token_ids_0)) + [1]\n\n    def create_token_type_ids_from_sequences(\n        self, token_ids_0: List[int], token_ids_1: Optional[List[int]] = None\n    ) -> List[int]:\n        \"\"\"\n        Creates a mask from the two sequences passed to be used in a sequence-pair classification task.\n        A BERT sequence pair mask has the following format:\n\n        ::\n\n            0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1\n            | first sequence    | second sequence |\n\n        if token_ids_1 is None, only returns the first portion of the mask (0's).\n\n        Args:\n            token_ids_0 (:obj:`List[int]`):\n                List of ids.\n            token_ids_1 (:obj:`List[int]`, `optional`, defaults to :obj:`None`):\n                Optional second list of IDs for sequence pairs.\n\n        Returns:\n            :obj:`List[int]`: List of `token type IDs <../glossary.html#token-type-ids>`_ according to the given\n            sequence(s).\n        \"\"\"\n        sep = [self.sep_token_id]\n        cls = [self.cls_token_id]\n        if token_ids_1 is None:\n            return len(cls + token_ids_0 + sep) * [0]\n        return len(cls + token_ids_0 + sep) * [0] + len(token_ids_1 + sep) * [1]\n\n    def save_vocabulary(self, vocab_path):\n        \"\"\"\n        Save the sentencepiece vocabulary (copy original file) and special tokens file to a directory.\n\n        Args:\n            vocab_path (:obj:`str`):\n                The directory in which to save the vocabulary.\n\n        Returns:\n            :obj:`Tuple(str)`: Paths to the files saved.\n        \"\"\"\n        index = 0\n        if os.path.isdir(vocab_path):\n            vocab_file = os.path.join(vocab_path, VOCAB_FILES_NAMES[\"vocab_file\"])\n        else:\n            vocab_file = vocab_path\n        with open(vocab_file, \"w\", encoding=\"utf-8\") as writer:\n            for token, token_index in sorted(self.vocab.items(), key=lambda kv: kv[1]):\n                if index != token_index:\n                    logger.warning(\n                        \"Saving vocabulary to {}: vocabulary indices are not consecutive.\"\n                        \" Please check that the vocabulary is not corrupted!\".format(vocab_file)\n                    )\n                    index = token_index\n                writer.write(token + \"\\n\")\n                index += 1\n        return (vocab_file,)\n\n\nclass BasicTokenizer(object):\n    \"\"\"Runs basic tokenization (punctuation splitting, lower casing, etc.).\"\"\"\n\n    def __init__(self, do_lower_case=True, never_split=None, tokenize_chinese_chars=True):\n        \"\"\" Constructs a BasicTokenizer.\n\n        Args:\n            **do_lower_case**: Whether to lower case the input.\n            **never_split**: (`optional`) list of str\n                Kept for backward compatibility purposes.\n                Now implemented directly at the base class level (see :func:`PreTrainedTokenizer.tokenize`)\n                List of token not to split.\n            **tokenize_chinese_chars**: (`optional`) boolean (default True)\n                Whether to tokenize Chinese characters.\n                This should likely be deactivated for Japanese:\n                see: https://github.com/huggingface/pytorch-pretrained-BERT/issues/328\n        \"\"\"\n        if never_split is None:\n            never_split = []\n        self.do_lower_case = do_lower_case\n        self.never_split = never_split\n        self.tokenize_chinese_chars = tokenize_chinese_chars\n\n    def tokenize(self, text, never_split=None):\n        \"\"\" Basic Tokenization of a piece of text.\n            Split on \"white spaces\" only, for sub-word tokenization, see WordPieceTokenizer.\n\n        Args:\n            **never_split**: (`optional`) list of str\n                Kept for backward compatibility purposes.\n                Now implemented directly at the base class level (see :func:`PreTrainedTokenizer.tokenize`)\n                List of token not to split.\n        \"\"\"\n        never_split = self.never_split + (never_split if never_split is not None else [])\n        text = self._clean_text(text)\n        # This was added on November 1st, 2018 for the multilingual and Chinese\n        # models. This is also applied to the English models now, but it doesn't\n        # matter since the English models were not trained on any Chinese data\n        # and generally don't have any Chinese data in them (there are Chinese\n        # characters in the vocabulary because Wikipedia does have some Chinese\n        # words in the English Wikipedia.).\n        if self.tokenize_chinese_chars:\n            text = self._tokenize_chinese_chars(text)\n        orig_tokens = whitespace_tokenize(text)\n        split_tokens = []\n        for token in orig_tokens:\n            if self.do_lower_case and token not in never_split:\n                token = token.lower()\n                token = self._run_strip_accents(token)\n            split_tokens.extend(self._run_split_on_punc(token, never_split))\n\n        output_tokens = whitespace_tokenize(\" \".join(split_tokens))\n        return output_tokens\n\n    def _run_strip_accents(self, text):\n        \"\"\"Strips accents from a piece of text.\"\"\"\n        text = unicodedata.normalize(\"NFD\", text)\n        output = []\n        for char in text:\n            cat = unicodedata.category(char)\n            if cat == \"Mn\":\n                continue\n            output.append(char)\n        return \"\".join(output)\n\n    def _run_split_on_punc(self, text, never_split=None):\n        \"\"\"Splits punctuation on a piece of text.\"\"\"\n        if never_split is not None and text in never_split:\n            return [text]\n        chars = list(text)\n        i = 0\n        start_new_word = True\n        output = []\n        while i < len(chars):\n            char = chars[i]\n            if _is_punctuation(char):\n                output.append([char])\n                start_new_word = True\n            else:\n                if start_new_word:\n                    output.append([])\n                start_new_word = False\n                output[-1].append(char)\n            i += 1\n\n        return [\"\".join(x) for x in output]\n\n    def _tokenize_chinese_chars(self, text):\n        \"\"\"Adds whitespace around any CJK character.\"\"\"\n        output = []\n        for char in text:\n            cp = ord(char)\n            if self._is_chinese_char(cp):\n                output.append(\" \")\n                output.append(char)\n                output.append(\" \")\n            else:\n                output.append(char)\n        return \"\".join(output)\n\n    def _is_chinese_char(self, cp):\n        \"\"\"Checks whether CP is the codepoint of a CJK character.\"\"\"\n        # This defines a \"chinese character\" as anything in the CJK Unicode block:\n        #   https://en.wikipedia.org/wiki/CJK_Unified_Ideographs_(Unicode_block)\n        #\n        # Note that the CJK Unicode block is NOT all Japanese and Korean characters,\n        # despite its name. The modern Korean Hangul alphabet is a different block,\n        # as is Japanese Hiragana and Katakana. Those alphabets are used to write\n        # space-separated words, so they are not treated specially and handled\n        # like the all of the other languages.\n        if (\n            (cp >= 0x4E00 and cp <= 0x9FFF)\n            or (cp >= 0x3400 and cp <= 0x4DBF)  #\n            or (cp >= 0x20000 and cp <= 0x2A6DF)  #\n            or (cp >= 0x2A700 and cp <= 0x2B73F)  #\n            or (cp >= 0x2B740 and cp <= 0x2B81F)  #\n            or (cp >= 0x2B820 and cp <= 0x2CEAF)  #\n            or (cp >= 0xF900 and cp <= 0xFAFF)\n            or (cp >= 0x2F800 and cp <= 0x2FA1F)  #\n        ):  #\n            return True\n\n        return False\n\n    def _clean_text(self, text):\n        \"\"\"Performs invalid character removal and whitespace cleanup on text.\"\"\"\n        output = []\n        for char in text:\n            cp = ord(char)\n            if cp == 0 or cp == 0xFFFD or _is_control(char):\n                continue\n            if _is_whitespace(char):\n                output.append(\" \")\n            else:\n                output.append(char)\n        return \"\".join(output)\n\n\nclass WordpieceTokenizer(object):\n    \"\"\"Runs WordPiece tokenization.\"\"\"\n\n    def __init__(self, vocab, unk_token, max_input_chars_per_word=100):\n        self.vocab = vocab\n        self.unk_token = unk_token\n        self.max_input_chars_per_word = max_input_chars_per_word\n\n    def tokenize(self, text):\n        \"\"\"Tokenizes a piece of text into its word pieces.\n\n        This uses a greedy longest-match-first algorithm to perform tokenization\n        using the given vocabulary.\n\n        For example:\n          input = \"unaffable\"\n          output = [\"un\", \"##aff\", \"##able\"]\n\n        Args:\n          text: A single token or whitespace separated tokens. This should have\n            already been passed through `BasicTokenizer`.\n\n        Returns:\n          A list of wordpiece tokens.\n        \"\"\"\n\n        output_tokens = []\n        for token in whitespace_tokenize(text):\n            chars = list(token)\n            if len(chars) > self.max_input_chars_per_word:\n                output_tokens.append(self.unk_token)\n                continue\n\n            is_bad = False\n            start = 0\n            sub_tokens = []\n            while start < len(chars):\n                end = len(chars)\n                cur_substr = None\n                while start < end:\n                    substr = \"\".join(chars[start:end])\n                    if start > 0:\n                        substr = \"##\" + substr\n                    if substr in self.vocab:\n                        cur_substr = substr\n                        break\n                    end -= 1\n                if cur_substr is None:\n                    is_bad = True\n                    break\n                sub_tokens.append(cur_substr)\n                start = end\n\n            if is_bad:\n                output_tokens.append(self.unk_token)\n            else:\n                output_tokens.extend(sub_tokens)\n        return output_tokens\n\n\ndef _is_whitespace(char):\n    \"\"\"Checks whether `chars` is a whitespace character.\"\"\"\n    # \\t, \\n, and \\r are technically contorl characters but we treat them\n    # as whitespace since they are generally considered as such.\n    if char == \" \" or char == \"\\t\" or char == \"\\n\" or char == \"\\r\":\n        return True\n    cat = unicodedata.category(char)\n    if cat == \"Zs\":\n        return True\n    return False\n\n\ndef _is_control(char):\n    \"\"\"Checks whether `chars` is a control character.\"\"\"\n    # These are technically control characters but we count them as whitespace\n    # characters.\n    if char == \"\\t\" or char == \"\\n\" or char == \"\\r\":\n        return False\n    cat = unicodedata.category(char)\n    if cat.startswith(\"C\"):\n        return True\n    return False\n\n\ndef _is_punctuation(char):\n    \"\"\"Checks whether `chars` is a punctuation character.\"\"\"\n    cp = ord(char)\n    # We treat all non-letter/number ASCII as punctuation.\n    # Characters such as \"^\", \"$\", and \"`\" are not in the Unicode\n    # Punctuation class but we treat them as punctuation anyways, for\n    # consistency.\n    if (cp >= 33 and cp <= 47) or (cp >= 58 and cp <= 64) or (cp >= 91 and cp <= 96) or (cp >= 123 and cp <= 126):\n        return True\n    cat = unicodedata.category(char)\n    if cat.startswith(\"P\"):\n        return True\n    return False\n\n\nclass BertTokenizerFast(PreTrainedTokenizerFast):\n    r\"\"\"\n    Constructs a \"Fast\" BERT tokenizer (backed by HuggingFace's `tokenizers` library).\n\n    Bert tokenization is Based on WordPiece.\n\n    This tokenizer inherits from :class:`~transformers1.PreTrainedTokenizerFast` which contains most of the methods. Users\n    should refer to the superclass for more information regarding methods.\n\n    Args:\n        vocab_file (:obj:`string`):\n            File containing the vocabulary.\n        do_lower_case (:obj:`bool`, `optional`, defaults to :obj:`True`):\n            Whether to lowercase the input when tokenizing.\n        unk_token (:obj:`string`, `optional`, defaults to \"[UNK]\"):\n            The unknown token. A token that is not in the vocabulary cannot be converted to an ID and is set to be this\n            token instead.\n        sep_token (:obj:`string`, `optional`, defaults to \"[SEP]\"):\n            The separator token, which is used when building a sequence from multiple sequences, e.g. two sequences\n            for sequence classification or for a text and a question for question answering.\n            It is also used as the last token of a sequence built with special tokens.\n        pad_token (:obj:`string`, `optional`, defaults to \"[PAD]\"):\n            The token used for padding, for example when batching sequences of different lengths.\n        cls_token (:obj:`string`, `optional`, defaults to \"[CLS]\"):\n            The classifier token which is used when doing sequence classification (classification of the whole\n            sequence instead of per-token classification). It is the first token of the sequence when built with\n            special tokens.\n        mask_token (:obj:`string`, `optional`, defaults to \"[MASK]\"):\n            The token used for masking values. This is the token used when training this model with masked language\n            modeling. This is the token which the model will try to predict.\n        tokenize_chinese_chars (:obj:`bool`, `optional`, defaults to :obj:`True`):\n            Whether to tokenize Chinese characters.\n            This should likely be deactivated for Japanese:\n            see: https://github.com/huggingface/transformers/issues/328\n        clean_text (:obj:`bool`, `optional`, defaults to :obj:`True`):\n            Whether to clean the text before tokenization by removing any control characters and\n            replacing all whitespaces by the classic one.\n        tokenize_chinese_chars (:obj:`bool`, `optional`, defaults to :obj:`True`):\n            Whether to tokenize Chinese characters.\n            This should likely be deactivated for Japanese:\n            see: https://github.com/huggingface/transformers/issues/328\n    \"\"\"\n\n    vocab_files_names = VOCAB_FILES_NAMES\n    pretrained_vocab_files_map = PRETRAINED_VOCAB_FILES_MAP\n    pretrained_init_configuration = PRETRAINED_INIT_CONFIGURATION\n    max_model_input_sizes = PRETRAINED_POSITIONAL_EMBEDDINGS_SIZES\n\n    def __init__(\n        self,\n        vocab_file,\n        do_lower_case=True,\n        unk_token=\"[UNK]\",\n        sep_token=\"[SEP]\",\n        pad_token=\"[PAD]\",\n        cls_token=\"[CLS]\",\n        mask_token=\"[MASK]\",\n        clean_text=True,\n        tokenize_chinese_chars=True,\n        strip_accents=True,\n        wordpieces_prefix=\"##\",\n        **kwargs\n    ):\n        super().__init__(\n            BertWordPieceTokenizer(\n                vocab_file=vocab_file,\n                unk_token=unk_token,\n                sep_token=sep_token,\n                cls_token=cls_token,\n                clean_text=clean_text,\n                handle_chinese_chars=tokenize_chinese_chars,\n                strip_accents=strip_accents,\n                lowercase=do_lower_case,\n                wordpieces_prefix=wordpieces_prefix,\n            ),\n            unk_token=unk_token,\n            sep_token=sep_token,\n            pad_token=pad_token,\n            cls_token=cls_token,\n            mask_token=mask_token,\n            **kwargs,\n        )\n\n        self.do_lower_case = do_lower_case\n\n    def build_inputs_with_special_tokens(self, token_ids_0, token_ids_1=None):\n        output = [self.cls_token_id] + token_ids_0 + [self.sep_token_id]\n\n        if token_ids_1:\n            output += token_ids_1 + [self.sep_token_id]\n\n        return output\n\n    def create_token_type_ids_from_sequences(\n        self, token_ids_0: List[int], token_ids_1: Optional[List[int]] = None\n    ) -> List[int]:\n        \"\"\"\n        Creates a mask from the two sequences passed to be used in a sequence-pair classification task.\n        A BERT sequence pair mask has the following format:\n\n        ::\n\n            0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1\n            | first sequence    | second sequence |\n\n        if token_ids_1 is None, only returns the first portion of the mask (0's).\n\n        Args:\n            token_ids_0 (:obj:`List[int]`):\n                List of ids.\n            token_ids_1 (:obj:`List[int]`, `optional`, defaults to :obj:`None`):\n                Optional second list of IDs for sequence pairs.\n\n        Returns:\n            :obj:`List[int]`: List of `token type IDs <../glossary.html#token-type-ids>`_ according to the given\n            sequence(s).\n        \"\"\"\n        sep = [self.sep_token_id]\n        cls = [self.cls_token_id]\n        if token_ids_1 is None:\n            return len(cls + token_ids_0 + sep) * [0]\n        return len(cls + token_ids_0 + sep) * [0] + len(token_ids_1 + sep) * [1]\n"
  },
  {
    "path": "code/nezha-base-count3/pretrain/transformers1/tokenization_bert_japanese.py",
    "content": "# coding=utf-8\n# Copyright 2018 The Google AI Language Team Authors and The HuggingFace Inc. team.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n\"\"\"Tokenization classes.\"\"\"\n\n\nimport collections\nimport logging\nimport os\nimport unicodedata\nfrom typing import Optional\n\nfrom .tokenization_bert import BasicTokenizer, BertTokenizer, WordpieceTokenizer, load_vocab\n\n\nlogger = logging.getLogger(__name__)\n\nVOCAB_FILES_NAMES = {\"vocab_file\": \"vocab.txt\"}\n\nPRETRAINED_VOCAB_FILES_MAP = {\n    \"vocab_file\": {\n        \"cl-tohoku/bert-base-japanese\": \"https://s3.amazonaws.com/models.huggingface.co/bert/cl-tohoku/bert-base-japanese/vocab.txt\",\n        \"cl-tohoku/bert-base-japanese-whole-word-masking\": \"https://s3.amazonaws.com/models.huggingface.co/bert/cl-tohoku/bert-base-japanese-whole-word-masking/vocab.txt\",\n        \"cl-tohoku/bert-base-japanese-char\": \"https://s3.amazonaws.com/models.huggingface.co/bert/cl-tohoku/bert-base-japanese-char/vocab.txt\",\n        \"cl-tohoku/bert-base-japanese-char-whole-word-masking\": \"https://s3.amazonaws.com/models.huggingface.co/bert/cl-tohoku/bert-base-japanese-char-whole-word-masking/vocab.txt\",\n    }\n}\n\nPRETRAINED_POSITIONAL_EMBEDDINGS_SIZES = {\n    \"cl-tohoku/bert-base-japanese\": 512,\n    \"cl-tohoku/bert-base-japanese-whole-word-masking\": 512,\n    \"cl-tohoku/bert-base-japanese-char\": 512,\n    \"cl-tohoku/bert-base-japanese-char-whole-word-masking\": 512,\n}\n\nPRETRAINED_INIT_CONFIGURATION = {\n    \"cl-tohoku/bert-base-japanese\": {\n        \"do_lower_case\": False,\n        \"word_tokenizer_type\": \"mecab\",\n        \"subword_tokenizer_type\": \"wordpiece\",\n    },\n    \"cl-tohoku/bert-base-japanese-whole-word-masking\": {\n        \"do_lower_case\": False,\n        \"word_tokenizer_type\": \"mecab\",\n        \"subword_tokenizer_type\": \"wordpiece\",\n    },\n    \"cl-tohoku/bert-base-japanese-char\": {\n        \"do_lower_case\": False,\n        \"word_tokenizer_type\": \"mecab\",\n        \"subword_tokenizer_type\": \"character\",\n    },\n    \"cl-tohoku/bert-base-japanese-char-whole-word-masking\": {\n        \"do_lower_case\": False,\n        \"word_tokenizer_type\": \"mecab\",\n        \"subword_tokenizer_type\": \"character\",\n    },\n}\n\n\nclass BertJapaneseTokenizer(BertTokenizer):\n    \"\"\"BERT tokenizer for Japanese text\"\"\"\n\n    vocab_files_names = VOCAB_FILES_NAMES\n    pretrained_vocab_files_map = PRETRAINED_VOCAB_FILES_MAP\n    pretrained_init_configuration = PRETRAINED_INIT_CONFIGURATION\n    max_model_input_sizes = PRETRAINED_POSITIONAL_EMBEDDINGS_SIZES\n\n    def __init__(\n        self,\n        vocab_file,\n        do_lower_case=False,\n        do_word_tokenize=True,\n        do_subword_tokenize=True,\n        word_tokenizer_type=\"basic\",\n        subword_tokenizer_type=\"wordpiece\",\n        never_split=None,\n        unk_token=\"[UNK]\",\n        sep_token=\"[SEP]\",\n        pad_token=\"[PAD]\",\n        cls_token=\"[CLS]\",\n        mask_token=\"[MASK]\",\n        mecab_kwargs=None,\n        **kwargs\n    ):\n        \"\"\"Constructs a MecabBertTokenizer.\n\n        Args:\n            **vocab_file**: Path to a one-wordpiece-per-line vocabulary file.\n            **do_lower_case**: (`optional`) boolean (default True)\n                Whether to lower case the input.\n                Only has an effect when do_basic_tokenize=True.\n            **do_word_tokenize**: (`optional`) boolean (default True)\n                Whether to do word tokenization.\n            **do_subword_tokenize**: (`optional`) boolean (default True)\n                Whether to do subword tokenization.\n            **word_tokenizer_type**: (`optional`) string (default \"basic\")\n                Type of word tokenizer.\n            **subword_tokenizer_type**: (`optional`) string (default \"wordpiece\")\n                Type of subword tokenizer.\n            **mecab_kwargs**: (`optional`) dict passed to `MecabTokenizer` constructor (default None)\n        \"\"\"\n        super(BertTokenizer, self).__init__(\n            unk_token=unk_token,\n            sep_token=sep_token,\n            pad_token=pad_token,\n            cls_token=cls_token,\n            mask_token=mask_token,\n            **kwargs,\n        )\n        # ^^ We call the grandparent's init, not the parent's.\n\n        if not os.path.isfile(vocab_file):\n            raise ValueError(\n                \"Can't find a vocabulary file at path '{}'. To load the vocabulary from a Google pretrained \"\n                \"model use `tokenizer = BertTokenizer.from_pretrained(PRETRAINED_MODEL_NAME)`\".format(vocab_file)\n            )\n        self.vocab = load_vocab(vocab_file)\n        self.ids_to_tokens = collections.OrderedDict([(ids, tok) for tok, ids in self.vocab.items()])\n\n        self.do_word_tokenize = do_word_tokenize\n        if do_word_tokenize:\n            if word_tokenizer_type == \"basic\":\n                self.word_tokenizer = BasicTokenizer(\n                    do_lower_case=do_lower_case, never_split=never_split, tokenize_chinese_chars=False\n                )\n            elif word_tokenizer_type == \"mecab\":\n                self.word_tokenizer = MecabTokenizer(\n                    do_lower_case=do_lower_case, never_split=never_split, **(mecab_kwargs or {})\n                )\n            else:\n                raise ValueError(\"Invalid word_tokenizer_type '{}' is specified.\".format(word_tokenizer_type))\n\n        self.do_subword_tokenize = do_subword_tokenize\n        if do_subword_tokenize:\n            if subword_tokenizer_type == \"wordpiece\":\n                self.subword_tokenizer = WordpieceTokenizer(vocab=self.vocab, unk_token=self.unk_token)\n            elif subword_tokenizer_type == \"character\":\n                self.subword_tokenizer = CharacterTokenizer(vocab=self.vocab, unk_token=self.unk_token)\n            else:\n                raise ValueError(\"Invalid subword_tokenizer_type '{}' is specified.\".format(subword_tokenizer_type))\n\n    def _tokenize(self, text):\n        if self.do_word_tokenize:\n            tokens = self.word_tokenizer.tokenize(text, never_split=self.all_special_tokens)\n        else:\n            tokens = [text]\n\n        if self.do_subword_tokenize:\n            split_tokens = [sub_token for token in tokens for sub_token in self.subword_tokenizer.tokenize(token)]\n        else:\n            split_tokens = tokens\n\n        return split_tokens\n\n\nclass MecabTokenizer:\n    \"\"\"Runs basic tokenization with MeCab morphological parser.\"\"\"\n\n    def __init__(self, do_lower_case=False, never_split=None, normalize_text=True, mecab_option: Optional[str] = None):\n        \"\"\"Constructs a MecabTokenizer.\n\n        Args:\n            **do_lower_case**: (`optional`) boolean (default True)\n                Whether to lower case the input.\n            **never_split**: (`optional`) list of str\n                Kept for backward compatibility purposes.\n                Now implemented directly at the base class level (see :func:`PreTrainedTokenizer.tokenize`)\n                List of token not to split.\n            **normalize_text**: (`optional`) boolean (default True)\n                Whether to apply unicode normalization to text before tokenization.\n            **mecab_option**: (`optional`) string passed to `MeCab.Tagger` constructor (default \"\")\n        \"\"\"\n        self.do_lower_case = do_lower_case\n        self.never_split = never_split if never_split is not None else []\n        self.normalize_text = normalize_text\n\n        import MeCab\n\n        self.mecab = MeCab.Tagger(mecab_option) if mecab_option is not None else MeCab.Tagger()\n\n    def tokenize(self, text, never_split=None, **kwargs):\n        \"\"\"Tokenizes a piece of text.\"\"\"\n        if self.normalize_text:\n            text = unicodedata.normalize(\"NFKC\", text)\n\n        never_split = self.never_split + (never_split if never_split is not None else [])\n        tokens = []\n\n        mecab_output = self.mecab.parse(text)\n\n        cursor = 0\n        for line in mecab_output.split(\"\\n\"):\n            if line == \"EOS\":\n                break\n\n            token, _ = line.split(\"\\t\")\n            token_start = text.index(token, cursor)\n            token_end = token_start + len(token)\n            if self.do_lower_case and token not in never_split:\n                token = token.lower()\n\n            tokens.append(token)\n            cursor = token_end\n\n        return tokens\n\n\nclass CharacterTokenizer(object):\n    \"\"\"Runs Character tokenziation.\"\"\"\n\n    def __init__(self, vocab, unk_token, normalize_text=True):\n        \"\"\"Constructs a CharacterTokenizer.\n\n        Args:\n            **vocab**:\n                Vocabulary object.\n            **unk_token**: str\n                A special symbol for out-of-vocabulary token.\n            **normalize_text**: (`optional`) boolean (default True)\n                Whether to apply unicode normalization to text before tokenization.\n        \"\"\"\n        self.vocab = vocab\n        self.unk_token = unk_token\n        self.normalize_text = normalize_text\n\n    def tokenize(self, text):\n        \"\"\"Tokenizes a piece of text into characters.\n\n        For example:\n            input = \"apple\"\n            output = [\"a\", \"p\", \"p\", \"l\", \"e\"]\n        Args:\n            text: A single token or whitespace separated tokens.\n                This should have already been passed through `BasicTokenizer`.\n        Returns:\n            A list of characters.\n        \"\"\"\n        if self.normalize_text:\n            text = unicodedata.normalize(\"NFKC\", text)\n\n        output_tokens = []\n        for i, char in enumerate(text):\n            if char not in self.vocab:\n                output_tokens.append(self.unk_token)\n                continue\n\n            output_tokens.append(char)\n\n        return output_tokens\n"
  },
  {
    "path": "code/nezha-base-count3/pretrain/transformers1/tokenization_camembert.py",
    "content": "# coding=utf-8\n# Copyright 2018 Google AI, Google Brain and Carnegie Mellon University Authors and the HuggingFace Inc. team.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License\n\"\"\" Tokenization classes for Camembert model.\"\"\"\n\n\nimport logging\nimport os\nfrom shutil import copyfile\nfrom typing import List, Optional\n\nimport sentencepiece as spm\n\nfrom .tokenization_utils import PreTrainedTokenizer\nfrom .tokenization_xlnet import SPIECE_UNDERLINE\n\n\nlogger = logging.getLogger(__name__)\n\nVOCAB_FILES_NAMES = {\"vocab_file\": \"sentencepiece.bpe.model\"}\n\nPRETRAINED_VOCAB_FILES_MAP = {\n    \"vocab_file\": {\n        \"camembert-base\": \"https://s3.amazonaws.com/models.huggingface.co/bert/camembert-base-sentencepiece.bpe.model\",\n    }\n}\n\nPRETRAINED_POSITIONAL_EMBEDDINGS_SIZES = {\n    \"camembert-base\": None,\n}\n\nSHARED_MODEL_IDENTIFIERS = [\n    # Load with\n    # `tokenizer = AutoTokenizer.from_pretrained(\"username/pretrained_model\")`\n    \"Musixmatch/umberto-commoncrawl-cased-v1\",\n    \"Musixmatch/umberto-wikipedia-uncased-v1\",\n]\n\n\nclass CamembertTokenizer(PreTrainedTokenizer):\n    \"\"\"\n        Adapted from RobertaTokenizer and XLNetTokenizer\n        SentencePiece based tokenizer. Peculiarities:\n\n        - requires `SentencePiece <https://github.com/google/sentencepiece>`_\n\n    This tokenizer inherits from :class:`~transformers1.PreTrainedTokenizer` which contains most of the methods. Users\n    should refer to the superclass for more information regarding methods.\n\n    Args:\n        vocab_file (:obj:`str`):\n            Path to the vocabulary file.\n        bos_token (:obj:`string`, `optional`, defaults to \"<s>\"):\n            The beginning of sequence token that was used during pre-training. Can be used a sequence classifier token.\n\n            .. note::\n\n                When building a sequence using special tokens, this is not the token that is used for the beginning\n                of sequence. The token used is the :obj:`cls_token`.\n        eos_token (:obj:`string`, `optional`, defaults to \"</s>\"):\n            The end of sequence token.\n\n            .. note::\n\n                When building a sequence using special tokens, this is not the token that is used for the end\n                of sequence. The token used is the :obj:`sep_token`.\n        sep_token (:obj:`string`, `optional`, defaults to \"</s>\"):\n            The separator token, which is used when building a sequence from multiple sequences, e.g. two sequences\n            for sequence classification or for a text and a question for question answering.\n            It is also used as the last token of a sequence built with special tokens.\n        cls_token (:obj:`string`, `optional`, defaults to \"<s>\"):\n            The classifier token which is used when doing sequence classification (classification of the whole\n            sequence instead of per-token classification). It is the first token of the sequence when built with\n            special tokens.\n        unk_token (:obj:`string`, `optional`, defaults to \"<unk>\"):\n            The unknown token. A token that is not in the vocabulary cannot be converted to an ID and is set to be this\n            token instead.\n        pad_token (:obj:`string`, `optional`, defaults to \"<pad>\"):\n            The token used for padding, for example when batching sequences of different lengths.\n        mask_token (:obj:`string`, `optional`, defaults to \"<mask>\"):\n            The token used for masking values. This is the token used when training this model with masked language\n            modeling. This is the token which the model will try to predict.\n        additional_special_tokens (:obj:`List[str]`, `optional`, defaults to :obj:`[\"<s>NOTUSED\", \"</s>NOTUSED\"]`):\n            Additional special tokens used by the tokenizer.\n\n    Attributes:\n        sp_model (:obj:`SentencePieceProcessor`):\n            The `SentencePiece` processor that is used for every conversion (string, tokens and IDs).\n    \"\"\"\n\n    vocab_files_names = VOCAB_FILES_NAMES\n    pretrained_vocab_files_map = PRETRAINED_VOCAB_FILES_MAP\n    max_model_input_sizes = PRETRAINED_POSITIONAL_EMBEDDINGS_SIZES\n    model_input_names = [\"attention_mask\"]\n\n    def __init__(\n        self,\n        vocab_file,\n        bos_token=\"<s>\",\n        eos_token=\"</s>\",\n        sep_token=\"</s>\",\n        cls_token=\"<s>\",\n        unk_token=\"<unk>\",\n        pad_token=\"<pad>\",\n        mask_token=\"<mask>\",\n        additional_special_tokens=[\"<s>NOTUSED\", \"</s>NOTUSED\"],\n        **kwargs\n    ):\n        super().__init__(\n            max_len=512,\n            bos_token=bos_token,\n            eos_token=eos_token,\n            unk_token=unk_token,\n            sep_token=sep_token,\n            cls_token=cls_token,\n            pad_token=pad_token,\n            mask_token=mask_token,\n            additional_special_tokens=additional_special_tokens,\n            **kwargs,\n        )\n        self.sp_model = spm.SentencePieceProcessor()\n        self.sp_model.Load(str(vocab_file))\n        self.vocab_file = vocab_file\n        # HACK: These tokens were added by fairseq but don't seem to be actually used when duplicated in the actual\n        # sentencepiece vocabulary (this is the case for <s> and </s>\n        self.fairseq_tokens_to_ids = {\"<s>NOTUSED\": 0, \"<pad>\": 1, \"</s>NOTUSED\": 2, \"<unk>\": 3}\n        self.fairseq_offset = len(self.fairseq_tokens_to_ids)\n        self.fairseq_tokens_to_ids[\"<mask>\"] = len(self.sp_model) + len(self.fairseq_tokens_to_ids)\n        self.fairseq_ids_to_tokens = {v: k for k, v in self.fairseq_tokens_to_ids.items()}\n\n    def build_inputs_with_special_tokens(\n        self, token_ids_0: List[int], token_ids_1: Optional[List[int]] = None\n    ) -> List[int]:\n        \"\"\"\n        Build model inputs from a sequence or a pair of sequence for sequence classification tasks\n        by concatenating and adding special tokens.\n        A CamemBERT sequence has the following format:\n\n        - single sequence: ``<s> X </s>``\n        - pair of sequences: ``<s> A </s></s> B </s>``\n\n        Args:\n            token_ids_0 (:obj:`List[int]`):\n                List of IDs to which the special tokens will be added\n            token_ids_1 (:obj:`List[int]`, `optional`, defaults to :obj:`None`):\n                Optional second list of IDs for sequence pairs.\n\n        Returns:\n            :obj:`List[int]`: list of `input IDs <../glossary.html#input-ids>`__ with the appropriate special tokens.\n        \"\"\"\n\n        if token_ids_1 is None:\n            return [self.cls_token_id] + token_ids_0 + [self.sep_token_id]\n        cls = [self.cls_token_id]\n        sep = [self.sep_token_id]\n        return cls + token_ids_0 + sep + sep + token_ids_1 + sep\n\n    def get_special_tokens_mask(\n        self, token_ids_0: List[int], token_ids_1: Optional[List[int]] = None, already_has_special_tokens: bool = False\n    ) -> List[int]:\n        \"\"\"\n        Retrieves sequence ids from a token list that has no special tokens added. This method is called when adding\n        special tokens using the tokenizer ``prepare_for_model`` or ``encode_plus`` methods.\n\n        Args:\n            token_ids_0 (:obj:`List[int]`):\n                List of ids.\n            token_ids_1 (:obj:`List[int]`, `optional`, defaults to :obj:`None`):\n                Optional second list of IDs for sequence pairs.\n            already_has_special_tokens (:obj:`bool`, `optional`, defaults to :obj:`False`):\n                Set to True if the token list is already formatted with special tokens for the model\n\n        Returns:\n            :obj:`List[int]`: A list of integers in the range [0, 1]: 1 for a special token, 0 for a sequence token.\n        \"\"\"\n        if already_has_special_tokens:\n            if token_ids_1 is not None:\n                raise ValueError(\n                    \"You should not supply a second sequence if the provided sequence of \"\n                    \"ids is already formated with special tokens for the model.\"\n                )\n            return list(map(lambda x: 1 if x in [self.sep_token_id, self.cls_token_id] else 0, token_ids_0))\n\n        if token_ids_1 is None:\n            return [1] + ([0] * len(token_ids_0)) + [1]\n        return [1] + ([0] * len(token_ids_0)) + [1, 1] + ([0] * len(token_ids_1)) + [1]\n\n    def create_token_type_ids_from_sequences(\n        self, token_ids_0: List[int], token_ids_1: Optional[List[int]] = None\n    ) -> List[int]:\n        \"\"\"\n        Creates a mask from the two sequences passed to be used in a sequence-pair classification task.\n        CamemBERT, like RoBERTa, does not make use of token type ids, therefore a list of zeros is returned.\n\n        Args:\n            token_ids_0 (:obj:`List[int]`):\n                List of ids.\n            token_ids_1 (:obj:`List[int]`, `optional`, defaults to :obj:`None`):\n                Optional second list of IDs for sequence pairs.\n\n        Returns:\n            :obj:`List[int]`: List of zeros.\n\n        \"\"\"\n        sep = [self.sep_token_id]\n        cls = [self.cls_token_id]\n\n        if token_ids_1 is None:\n            return len(cls + token_ids_0 + sep) * [0]\n        return len(cls + token_ids_0 + sep + sep + token_ids_1 + sep) * [0]\n\n    @property\n    def vocab_size(self):\n        return len(self.fairseq_tokens_to_ids) + len(self.sp_model)\n\n    def _tokenize(self, text):\n        return self.sp_model.EncodeAsPieces(text)\n\n    def _convert_token_to_id(self, token):\n        \"\"\" Converts a token (str) in an id using the vocab. \"\"\"\n        if token in self.fairseq_tokens_to_ids:\n            return self.fairseq_tokens_to_ids[token]\n        elif self.sp_model.PieceToId(token) == 0:\n            # Convert sentence piece unk token to fairseq unk token index\n            return self.unk_token_id\n        return self.fairseq_offset + self.sp_model.PieceToId(token)\n\n    def _convert_id_to_token(self, index):\n        \"\"\"Converts an index (integer) in a token (str) using the vocab.\"\"\"\n        if index in self.fairseq_ids_to_tokens:\n            return self.fairseq_ids_to_tokens[index]\n        return self.sp_model.IdToPiece(index - self.fairseq_offset)\n\n    def __getstate__(self):\n        state = self.__dict__.copy()\n        state[\"sp_model\"] = None\n        return state\n\n    def __setstate__(self, d):\n        self.__dict__ = d\n        try:\n            import sentencepiece as spm\n        except ImportError:\n            logger.warning(\n                \"You need to install SentencePiece to use AlbertTokenizer: https://github.com/google/sentencepiece\"\n                \"pip install sentencepiece\"\n            )\n            raise\n        self.sp_model = spm.SentencePieceProcessor()\n        self.sp_model.Load(self.vocab_file)\n\n    def convert_tokens_to_string(self, tokens):\n        \"\"\"Converts a sequence of tokens (strings for sub-words) in a single string.\"\"\"\n        out_string = \"\".join(tokens).replace(SPIECE_UNDERLINE, \" \").strip()\n        return out_string\n\n    def save_vocabulary(self, save_directory):\n        \"\"\"\n        Save the sentencepiece vocabulary (copy original file) and special tokens file to a directory.\n\n        Args:\n            save_directory (:obj:`str`):\n                The directory in which to save the vocabulary.\n\n        Returns:\n            :obj:`Tuple(str)`: Paths to the files saved.\n        \"\"\"\n        if not os.path.isdir(save_directory):\n            logger.error(\"Vocabulary path ({}) should be a directory\".format(save_directory))\n            return\n        out_vocab_file = os.path.join(save_directory, VOCAB_FILES_NAMES[\"vocab_file\"])\n\n        if os.path.abspath(self.vocab_file) != os.path.abspath(out_vocab_file):\n            copyfile(self.vocab_file, out_vocab_file)\n\n        return (out_vocab_file,)\n"
  },
  {
    "path": "code/nezha-base-count3/pretrain/transformers1/tokenization_ctrl.py",
    "content": "# coding=utf-8\n# Copyright 2018 Salesforce and The HuggingFace Inc. team.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n\"\"\"Tokenization classes for Salesforce CTRL.\"\"\"\n\n\nimport json\nimport logging\nimport os\n\nimport regex as re\n\nfrom .tokenization_utils import PreTrainedTokenizer\n\n\nlogger = logging.getLogger(__name__)\n\nVOCAB_FILES_NAMES = {\n    \"vocab_file\": \"vocab.json\",\n    \"merges_file\": \"merges.txt\",\n}\n\nPRETRAINED_VOCAB_FILES_MAP = {\n    \"vocab_file\": {\"ctrl\": \"https://raw.githubusercontent.com/salesforce/ctrl/master/ctrl-vocab.json\"},\n    \"merges_file\": {\"ctrl\": \"https://raw.githubusercontent.com/salesforce/ctrl/master/ctrl-merges.txt\"},\n}\n\nPRETRAINED_POSITIONAL_EMBEDDINGS_SIZES = {\n    \"ctrl\": 256,\n}\n\nCONTROL_CODES = {\n    \"Pregnancy\": 168629,\n    \"Christianity\": 7675,\n    \"Explain\": 106423,\n    \"Fitness\": 63440,\n    \"Saving\": 63163,\n    \"Ask\": 27171,\n    \"Ass\": 95985,\n    \"Joke\": 163509,\n    \"Questions\": 45622,\n    \"Thoughts\": 49605,\n    \"Retail\": 52342,\n    \"Feminism\": 164338,\n    \"Writing\": 11992,\n    \"Atheism\": 192263,\n    \"Netflix\": 48616,\n    \"Computing\": 39639,\n    \"Opinion\": 43213,\n    \"Alone\": 44967,\n    \"Funny\": 58917,\n    \"Gaming\": 40358,\n    \"Human\": 4088,\n    \"India\": 1331,\n    \"Joker\": 77138,\n    \"Diet\": 36206,\n    \"Legal\": 11859,\n    \"Norman\": 4939,\n    \"Tip\": 72689,\n    \"Weight\": 52343,\n    \"Movies\": 46273,\n    \"Running\": 23425,\n    \"Science\": 2090,\n    \"Horror\": 37793,\n    \"Confession\": 60572,\n    \"Finance\": 12250,\n    \"Politics\": 16360,\n    \"Scary\": 191985,\n    \"Support\": 12654,\n    \"Technologies\": 32516,\n    \"Teenage\": 66160,\n    \"Event\": 32769,\n    \"Learned\": 67460,\n    \"Notion\": 182770,\n    \"Wikipedia\": 37583,\n    \"Books\": 6665,\n    \"Extract\": 76050,\n    \"Confessions\": 102701,\n    \"Conspiracy\": 75932,\n    \"Links\": 63674,\n    \"Narcissus\": 150425,\n    \"Relationship\": 54766,\n    \"Relationships\": 134796,\n    \"Reviews\": 41671,\n    \"News\": 4256,\n    \"Translation\": 26820,\n    \"multilingual\": 128406,\n}\n\n\ndef get_pairs(word):\n    \"\"\"Return set of symbol pairs in a word.\n\n    Word is represented as tuple of symbols (symbols being variable-length strings).\n    \"\"\"\n    pairs = set()\n    prev_char = word[0]\n    for char in word[1:]:\n        pairs.add((prev_char, char))\n        prev_char = char\n\n    pairs = set(pairs)\n    return pairs\n\n\nclass CTRLTokenizer(PreTrainedTokenizer):\n    \"\"\"\n    Constructs a CTRL tokenizer. Peculiarities:\n\n    - Byte-Pair-Encoding\n\n    This tokenizer inherits from :class:`~transformers1.PreTrainedTokenizer` which contains most of the methods. Users\n    should refer to the superclass for more information regarding methods.\n\n    Args:\n        vocab_file (:obj:`str`):\n            Path to the vocabulary file.\n        merges_file (:obj:`str`):\n            Path to the merges file.\n        unk_token (:obj:`string`, `optional`, defaults to \"<unk>\"):\n            The unknown token. A token that is not in the vocabulary cannot be converted to an ID and is set to be this\n            token instead.\n    \"\"\"\n\n    vocab_files_names = VOCAB_FILES_NAMES\n    pretrained_vocab_files_map = PRETRAINED_VOCAB_FILES_MAP\n    max_model_input_sizes = PRETRAINED_POSITIONAL_EMBEDDINGS_SIZES\n    control_codes = CONTROL_CODES\n\n    def __init__(self, vocab_file, merges_file, unk_token=\"<unk>\", **kwargs):\n        super().__init__(unk_token=unk_token, **kwargs)\n\n        with open(vocab_file, encoding=\"utf-8\") as vocab_handle:\n            self.encoder = json.load(vocab_handle)\n        self.decoder = {v: k for k, v in self.encoder.items()}\n        with open(merges_file, encoding=\"utf-8\") as merges_handle:\n            merges = merges_handle.read().split(\"\\n\")[1:-1]\n        merges = [tuple(merge.split()) for merge in merges]\n        self.bpe_ranks = dict(zip(merges, range(len(merges))))\n        self.cache = {}\n\n    @property\n    def vocab_size(self):\n        return len(self.encoder)\n\n    def get_vocab(self):\n        return dict(self.encoder, **self.added_tokens_encoder)\n\n    def bpe(self, token):\n        if token in self.cache:\n            return self.cache[token]\n        word = tuple(token)\n        word = tuple(list(word[:-1]) + [word[-1] + \"</w>\"])\n        pairs = get_pairs(word)\n\n        if not pairs:\n            return token\n\n        while True:\n            bigram = min(pairs, key=lambda pair: self.bpe_ranks.get(pair, float(\"inf\")))\n            if bigram not in self.bpe_ranks:\n                break\n            first, second = bigram\n            new_word = []\n            i = 0\n            while i < len(word):\n                try:\n                    j = word.index(first, i)\n                except ValueError:\n                    new_word.extend(word[i:])\n                    break\n                else:\n                    new_word.extend(word[i:j])\n                    i = j\n\n                if word[i] == first and i < len(word) - 1 and word[i + 1] == second:\n                    new_word.append(first + second)\n                    i += 2\n                else:\n                    new_word.append(word[i])\n                    i += 1\n            new_word = tuple(new_word)\n            word = new_word\n            if len(word) == 1:\n                break\n            else:\n                pairs = get_pairs(word)\n        word = \"@@ \".join(word)\n        word = word[:-4]\n        self.cache[token] = word\n        return word\n\n    def _tokenize(self, text):\n        \"\"\" Tokenize a string.\n        \"\"\"\n        split_tokens = []\n\n        words = re.findall(r\"\\S+\\n?\", text)\n\n        for token in words:\n            split_tokens.extend([t for t in self.bpe(token).split(\" \")])\n        return split_tokens\n\n    def _convert_token_to_id(self, token):\n        \"\"\" Converts a token (str) in an id using the vocab. \"\"\"\n        return self.encoder.get(token, self.encoder.get(self.unk_token))\n\n    def _convert_id_to_token(self, index):\n        \"\"\"Converts an index (integer) in a token (str) using the vocab.\"\"\"\n        return self.decoder.get(index, self.unk_token)\n\n    def convert_tokens_to_string(self, tokens):\n        \"\"\" Converts a sequence of tokens (string) in a single string. \"\"\"\n        out_string = \" \".join(tokens).replace(\"@@ \", \"\").strip()\n        return out_string\n\n    def save_vocabulary(self, save_directory):\n        \"\"\"\n        Save the vocabulary and special tokens file to a directory.\n\n        Args:\n            save_directory (:obj:`str`):\n                The directory in which to save the vocabulary.\n\n        Returns:\n            :obj:`Tuple(str)`: Paths to the files saved.\n        \"\"\"\n        if not os.path.isdir(save_directory):\n            logger.error(\"Vocabulary path ({}) should be a directory\".format(save_directory))\n            return\n        vocab_file = os.path.join(save_directory, VOCAB_FILES_NAMES[\"vocab_file\"])\n        merge_file = os.path.join(save_directory, VOCAB_FILES_NAMES[\"merges_file\"])\n\n        with open(vocab_file, \"w\", encoding=\"utf-8\") as f:\n            f.write(json.dumps(self.encoder, ensure_ascii=False))\n\n        index = 0\n        with open(merge_file, \"w\", encoding=\"utf-8\") as writer:\n            writer.write(\"#version: 0.2\\n\")\n            for bpe_tokens, token_index in sorted(self.bpe_ranks.items(), key=lambda kv: kv[1]):\n                if index != token_index:\n                    logger.warning(\n                        \"Saving vocabulary to {}: BPE merge indices are not consecutive.\"\n                        \" Please check that the tokenizer is not corrupted!\".format(merge_file)\n                    )\n                    index = token_index\n                writer.write(\" \".join(bpe_tokens) + \"\\n\")\n                index += 1\n\n        return vocab_file, merge_file\n\n    # def decode(self, token_ids, skip_special_tokens=False, clean_up_tokenization_spaces=True):\n    #     filtered_tokens = ' '.join(self.convert_ids_to_tokens(token_ids, skip_special_tokens=skip_special_tokens))\n    #     tokens_generated_so_far = re.sub('(@@ )', '', string=filtered_tokens)\n    #     tokens_generated_so_far = re.sub('(@@ ?$)', '', string=tokens_generated_so_far)\n    #     return ''.join(tokens_generated_so_far)\n"
  },
  {
    "path": "code/nezha-base-count3/pretrain/transformers1/tokenization_distilbert.py",
    "content": "# coding=utf-8\n# Copyright 2018 The HuggingFace Inc. team.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n\"\"\"Tokenization classes for DistilBERT.\"\"\"\n\n\nimport logging\n\nfrom .tokenization_bert import BertTokenizer, BertTokenizerFast\n\n\nlogger = logging.getLogger(__name__)\n\nVOCAB_FILES_NAMES = {\"vocab_file\": \"vocab.txt\"}\n\nPRETRAINED_VOCAB_FILES_MAP = {\n    \"vocab_file\": {\n        \"distilbert-base-uncased\": \"https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-uncased-vocab.txt\",\n        \"distilbert-base-uncased-distilled-squad\": \"https://s3.amazonaws.com/models.huggingface.co/bert/bert-large-uncased-vocab.txt\",\n        \"distilbert-base-cased\": \"https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-cased-vocab.txt\",\n        \"distilbert-base-cased-distilled-squad\": \"https://s3.amazonaws.com/models.huggingface.co/bert/bert-large-cased-vocab.txt\",\n        \"distilbert-base-german-cased\": \"https://s3.amazonaws.com/models.huggingface.co/bert/distilbert-base-german-cased-vocab.txt\",\n        \"distilbert-base-multilingual-cased\": \"https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-multilingual-cased-vocab.txt\",\n    }\n}\n\nPRETRAINED_POSITIONAL_EMBEDDINGS_SIZES = {\n    \"distilbert-base-uncased\": 512,\n    \"distilbert-base-uncased-distilled-squad\": 512,\n    \"distilbert-base-cased\": 512,\n    \"distilbert-base-cased-distilled-squad\": 512,\n    \"distilbert-base-german-cased\": 512,\n    \"distilbert-base-multilingual-cased\": 512,\n}\n\n\nPRETRAINED_INIT_CONFIGURATION = {\n    \"distilbert-base-uncased\": {\"do_lower_case\": True},\n    \"distilbert-base-uncased-distilled-squad\": {\"do_lower_case\": True},\n    \"distilbert-base-cased\": {\"do_lower_case\": False},\n    \"distilbert-base-cased-distilled-squad\": {\"do_lower_case\": False},\n    \"distilbert-base-german-cased\": {\"do_lower_case\": False},\n    \"distilbert-base-multilingual-cased\": {\"do_lower_case\": False},\n}\n\n\nclass DistilBertTokenizer(BertTokenizer):\n    r\"\"\"\n    Constructs a  DistilBertTokenizer.\n\n    :class:`~transformers1.DistilBertTokenizer is identical to :class:`~transformers1.BertTokenizer` and runs end-to-end\n    tokenization: punctuation splitting + wordpiece.\n\n    Refer to superclass :class:`~transformers1.BertTokenizer` for usage examples and documentation concerning\n    parameters.\n    \"\"\"\n\n    vocab_files_names = VOCAB_FILES_NAMES\n    pretrained_vocab_files_map = PRETRAINED_VOCAB_FILES_MAP\n    max_model_input_sizes = PRETRAINED_POSITIONAL_EMBEDDINGS_SIZES\n    pretrained_init_configuration = PRETRAINED_INIT_CONFIGURATION\n    model_input_names = [\"attention_mask\"]\n\n\nclass DistilBertTokenizerFast(BertTokenizerFast):\n    r\"\"\"\n    Constructs a  \"Fast\" DistilBertTokenizer (backed by HuggingFace's `tokenizers` library).\n\n    :class:`~transformers1.DistilBertTokenizerFast` is identical to :class:`~transformers1.BertTokenizerFast` and runs end-to-end\n    tokenization: punctuation splitting + wordpiece.\n\n    Refer to superclass :class:`~transformers1.BertTokenizerFast` for usage examples and documentation concerning\n    parameters.\n    \"\"\"\n\n    vocab_files_names = VOCAB_FILES_NAMES\n    pretrained_vocab_files_map = PRETRAINED_VOCAB_FILES_MAP\n    max_model_input_sizes = PRETRAINED_POSITIONAL_EMBEDDINGS_SIZES\n    pretrained_init_configuration = PRETRAINED_INIT_CONFIGURATION\n    model_input_names = [\"attention_mask\"]\n"
  },
  {
    "path": "code/nezha-base-count3/pretrain/transformers1/tokenization_electra.py",
    "content": "# coding=utf-8\n# Copyright 2020 The Google AI Team, Stanford University and The HuggingFace Inc. team.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n\nfrom .tokenization_bert import BertTokenizer, BertTokenizerFast\n\n\nVOCAB_FILES_NAMES = {\"vocab_file\": \"vocab.txt\"}\n\nPRETRAINED_VOCAB_FILES_MAP = {\n    \"vocab_file\": {\n        \"google/electra-small-generator\": \"https://s3.amazonaws.com/models.huggingface.co/bert/google/electra-small-generator/vocab.txt\",\n        \"google/electra-base-generator\": \"https://s3.amazonaws.com/models.huggingface.co/bert/google/electra-base-generator/vocab.txt\",\n        \"google/electra-large-generator\": \"https://s3.amazonaws.com/models.huggingface.co/bert/google/electra-large-generator/vocab.txt\",\n        \"google/electra-small-discriminator\": \"https://s3.amazonaws.com/models.huggingface.co/bert/google/electra-small-discriminator/vocab.txt\",\n        \"google/electra-base-discriminator\": \"https://s3.amazonaws.com/models.huggingface.co/bert/google/electra-base-discriminator/vocab.txt\",\n        \"google/electra-large-discriminator\": \"https://s3.amazonaws.com/models.huggingface.co/bert/google/electra-large-discriminator/vocab.txt\",\n    }\n}\n\nPRETRAINED_POSITIONAL_EMBEDDINGS_SIZES = {\n    \"google/electra-small-generator\": 512,\n    \"google/electra-base-generator\": 512,\n    \"google/electra-large-generator\": 512,\n    \"google/electra-small-discriminator\": 512,\n    \"google/electra-base-discriminator\": 512,\n    \"google/electra-large-discriminator\": 512,\n}\n\n\nPRETRAINED_INIT_CONFIGURATION = {\n    \"google/electra-small-generator\": {\"do_lower_case\": True},\n    \"google/electra-base-generator\": {\"do_lower_case\": True},\n    \"google/electra-large-generator\": {\"do_lower_case\": True},\n    \"google/electra-small-discriminator\": {\"do_lower_case\": True},\n    \"google/electra-base-discriminator\": {\"do_lower_case\": True},\n    \"google/electra-large-discriminator\": {\"do_lower_case\": True},\n}\n\n\nclass ElectraTokenizer(BertTokenizer):\n    r\"\"\"\n    Constructs an Electra tokenizer.\n    :class:`~transformers1.ElectraTokenizer` is identical to :class:`~transformers1.BertTokenizer` and runs end-to-end\n    tokenization: punctuation splitting + wordpiece.\n\n    Refer to superclass :class:`~transformers1.BertTokenizer` for usage examples and documentation concerning\n    parameters.\n    \"\"\"\n\n    vocab_files_names = VOCAB_FILES_NAMES\n    pretrained_vocab_files_map = PRETRAINED_VOCAB_FILES_MAP\n    max_model_input_sizes = PRETRAINED_POSITIONAL_EMBEDDINGS_SIZES\n    pretrained_init_configuration = PRETRAINED_INIT_CONFIGURATION\n\n\nclass ElectraTokenizerFast(BertTokenizerFast):\n    r\"\"\"\n    Constructs a \"Fast\" Electra Fast tokenizer (backed by HuggingFace's `tokenizers` library).\n\n    :class:`~transformers1.ElectraTokenizerFast` is identical to :class:`~transformers1.BertTokenizerFast` and runs end-to-end\n    tokenization: punctuation splitting + wordpiece.\n\n    Refer to superclass :class:`~transformers1.BertTokenizerFast` for usage examples and documentation concerning\n    parameters.\n    \"\"\"\n    vocab_files_names = VOCAB_FILES_NAMES\n    pretrained_vocab_files_map = PRETRAINED_VOCAB_FILES_MAP\n    max_model_input_sizes = PRETRAINED_POSITIONAL_EMBEDDINGS_SIZES\n    pretrained_init_configuration = PRETRAINED_INIT_CONFIGURATION\n"
  },
  {
    "path": "code/nezha-base-count3/pretrain/transformers1/tokenization_flaubert.py",
    "content": "# coding=utf-8\n# Copyright 2019-present CNRS, Facebook Inc. and the HuggingFace Inc. team.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n\"\"\"Tokenization classes for Flaubert, based on XLM.\"\"\"\n\n\nimport logging\nimport unicodedata\n\nimport six\n\nfrom .tokenization_xlm import XLMTokenizer\n\n\nlogger = logging.getLogger(__name__)\n\nVOCAB_FILES_NAMES = {\n    \"vocab_file\": \"vocab.json\",\n    \"merges_file\": \"merges.txt\",\n}\n\nPRETRAINED_VOCAB_FILES_MAP = {\n    \"vocab_file\": {\n        \"flaubert/flaubert_small_cased\": \"https://s3.amazonaws.com/models.huggingface.co/bert/flaubert/flaubert_small_cased/vocab.json\",\n        \"flaubert/flaubert_base_uncased\": \"https://s3.amazonaws.com/models.huggingface.co/bert/flaubert/flaubert_base_uncased/vocab.json\",\n        \"flaubert/flaubert_base_cased\": \"https://s3.amazonaws.com/models.huggingface.co/bert/flaubert/flaubert_base_cased/vocab.json\",\n        \"flaubert/flaubert_large_cased\": \"https://s3.amazonaws.com/models.huggingface.co/bert/flaubert/flaubert_large_cased/vocab.json\",\n    },\n    \"merges_file\": {\n        \"flaubert/flaubert_small_cased\": \"https://s3.amazonaws.com/models.huggingface.co/bert/flaubert/flaubert_small_cased/merges.txt\",\n        \"flaubert/flaubert_base_uncased\": \"https://s3.amazonaws.com/models.huggingface.co/bert/flaubert/flaubert_base_uncased/merges.txt\",\n        \"flaubert/flaubert_base_cased\": \"https://s3.amazonaws.com/models.huggingface.co/bert/flaubert/flaubert_base_cased/merges.txt\",\n        \"flaubert/flaubert_large_cased\": \"https://s3.amazonaws.com/models.huggingface.co/bert/flaubert/flaubert_large_cased/merges.txt\",\n    },\n}\n\nPRETRAINED_POSITIONAL_EMBEDDINGS_SIZES = {\n    \"flaubert/flaubert_small_cased\": 512,\n    \"flaubert/flaubert_base_uncased\": 512,\n    \"flaubert/flaubert_base_cased\": 512,\n    \"flaubert/flaubert_large_cased\": 512,\n}\n\nPRETRAINED_INIT_CONFIGURATION = {\n    \"flaubert/flaubert_small_cased\": {\"do_lowercase\": False},\n    \"flaubert/flaubert_base_uncased\": {\"do_lowercase\": True},\n    \"flaubert/flaubert_base_cased\": {\"do_lowercase\": False},\n    \"flaubert/flaubert_large_cased\": {\"do_lowercase\": False},\n}\n\n\ndef convert_to_unicode(text):\n    \"\"\"\n    Converts `text` to Unicode (if it's not already), assuming UTF-8 input.\n    \"\"\"\n    # six_ensure_text is copied from https://github.com/benjaminp/six\n    def six_ensure_text(s, encoding=\"utf-8\", errors=\"strict\"):\n        if isinstance(s, six.binary_type):\n            return s.decode(encoding, errors)\n        elif isinstance(s, six.text_type):\n            return s\n        else:\n            raise TypeError(\"not expecting type '%s'\" % type(s))\n\n    return six_ensure_text(text, encoding=\"utf-8\", errors=\"ignore\")\n\n\nclass FlaubertTokenizer(XLMTokenizer):\n    \"\"\"\n    BPE tokenizer for Flaubert\n\n    - Moses preprocessing & tokenization\n    - Normalize all inputs text\n    - argument ``special_tokens`` and function ``set_special_tokens``, can be used to add additional symbols \\\n      (ex: \"__classify__\") to a vocabulary\n    - `do_lowercase` controle lower casing (automatically set for pretrained vocabularies)\n\n    This tokenizer inherits from :class:`~transformers1.XLMTokenizer`. Please check the superclass for usage examples\n    and documentation regarding arguments.\n    \"\"\"\n\n    vocab_files_names = VOCAB_FILES_NAMES\n    pretrained_vocab_files_map = PRETRAINED_VOCAB_FILES_MAP\n    pretrained_init_configuration = PRETRAINED_INIT_CONFIGURATION\n    max_model_input_sizes = PRETRAINED_POSITIONAL_EMBEDDINGS_SIZES\n\n    def __init__(self, do_lowercase=False, **kwargs):\n        super().__init__(**kwargs)\n        self.do_lowercase = do_lowercase\n        self.do_lowercase_and_remove_accent = False\n\n    def preprocess_text(self, text):\n        text = text.replace(\"``\", '\"').replace(\"''\", '\"')\n        text = convert_to_unicode(text)\n        text = unicodedata.normalize(\"NFC\", text)\n\n        if self.do_lowercase:\n            text = text.lower()\n\n        return text\n\n    def _tokenize(self, text, bypass_tokenizer=False):\n        \"\"\"\n        Tokenize a string given language code using Moses.\n\n        Details of tokenization:\n        - [sacremoses](https://github.com/alvations/sacremoses): port of Moses\n            - Install with `pip install sacremoses`\n\n        Args:\n            - bypass_tokenizer: Allow users to preprocess and tokenize the sentences externally (default = False)  (bool). If True, we only apply BPE.\n\n        Returns:\n            List of tokens.\n        \"\"\"\n        lang = \"fr\"\n        if lang and self.lang2id and lang not in self.lang2id:\n            logger.error(\n                \"Supplied language code not found in lang2id mapping. Please check that your language is supported by the loaded pretrained model.\"\n            )\n\n        if bypass_tokenizer:\n            text = text.split()\n        else:\n            text = self.preprocess_text(text)\n            text = self.moses_pipeline(text, lang=lang)\n            text = self.moses_tokenize(text, lang=lang)\n\n        split_tokens = []\n        for token in text:\n            if token:\n                split_tokens.extend([t for t in self.bpe(token).split(\" \")])\n\n        return split_tokens\n"
  },
  {
    "path": "code/nezha-base-count3/pretrain/transformers1/tokenization_gpt2.py",
    "content": "# coding=utf-8\n# Copyright 2018 The Open AI Team Authors and The HuggingFace Inc. team.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n\"\"\"Tokenization classes for OpenAI GPT.\"\"\"\n\n\nimport json\nimport logging\nimport os\nfrom functools import lru_cache\n\nimport regex as re\nfrom tokenizers import ByteLevelBPETokenizer\n\nfrom .tokenization_utils import PreTrainedTokenizer, PreTrainedTokenizerFast\n\n\nlogger = logging.getLogger(__name__)\n\nVOCAB_FILES_NAMES = {\n    \"vocab_file\": \"vocab.json\",\n    \"merges_file\": \"merges.txt\",\n}\n\nPRETRAINED_VOCAB_FILES_MAP = {\n    \"vocab_file\": {\n        \"gpt2\": \"https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-vocab.json\",\n        \"gpt2-medium\": \"https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-medium-vocab.json\",\n        \"gpt2-large\": \"https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-large-vocab.json\",\n        \"gpt2-xl\": \"https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-xl-vocab.json\",\n        \"distilgpt2\": \"https://s3.amazonaws.com/models.huggingface.co/bert/distilgpt2-vocab.json\",\n    },\n    \"merges_file\": {\n        \"gpt2\": \"https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-merges.txt\",\n        \"gpt2-medium\": \"https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-medium-merges.txt\",\n        \"gpt2-large\": \"https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-large-merges.txt\",\n        \"gpt2-xl\": \"https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-xl-merges.txt\",\n        \"distilgpt2\": \"https://s3.amazonaws.com/models.huggingface.co/bert/distilgpt2-merges.txt\",\n    },\n}\n\nPRETRAINED_POSITIONAL_EMBEDDINGS_SIZES = {\n    \"gpt2\": 1024,\n    \"gpt2-medium\": 1024,\n    \"gpt2-large\": 1024,\n    \"gpt2-xl\": 1024,\n    \"distilgpt2\": 1024,\n}\n\n\n@lru_cache()\ndef bytes_to_unicode():\n    \"\"\"\n    Returns list of utf-8 byte and a mapping to unicode strings.\n    We specifically avoids mapping to whitespace/control characters the bpe code barfs on.\n\n    The reversible bpe codes work on unicode strings.\n    This means you need a large # of unicode characters in your vocab if you want to avoid UNKs.\n    When you're at something like a 10B token dataset you end up needing around 5K for decent coverage.\n    This is a signficant percentage of your normal, say, 32K bpe vocab.\n    To avoid that, we want lookup tables between utf-8 bytes and unicode strings.\n    \"\"\"\n    bs = (\n        list(range(ord(\"!\"), ord(\"~\") + 1)) + list(range(ord(\"¡\"), ord(\"¬\") + 1)) + list(range(ord(\"®\"), ord(\"ÿ\") + 1))\n    )\n    cs = bs[:]\n    n = 0\n    for b in range(2 ** 8):\n        if b not in bs:\n            bs.append(b)\n            cs.append(2 ** 8 + n)\n            n += 1\n    cs = [chr(n) for n in cs]\n    return dict(zip(bs, cs))\n\n\ndef get_pairs(word):\n    \"\"\"Return set of symbol pairs in a word.\n\n    Word is represented as tuple of symbols (symbols being variable-length strings).\n    \"\"\"\n    pairs = set()\n    prev_char = word[0]\n    for char in word[1:]:\n        pairs.add((prev_char, char))\n        prev_char = char\n    return pairs\n\n\nclass GPT2Tokenizer(PreTrainedTokenizer):\n    \"\"\"\n    GPT-2 BPE tokenizer. Peculiarities:\n\n    - Byte-level Byte-Pair-Encoding\n    - Requires a space to start the input string => the encoding methods should be called with the\n      ``add_prefix_space`` flag set to ``True``.\n      Otherwise, this tokenizer ``encode`` and ``decode`` method will not conserve\n      the absence of a space at the beginning of a string:\n\n    ::\n\n        tokenizer.decode(tokenizer.encode(\"Hello\")) = \" Hello\"\n\n    This tokenizer inherits from :class:`~transformers1.PreTrainedTokenizer` which contains most of the methods. Users\n    should refer to the superclass for more information regarding methods.\n\n    Args:\n        vocab_file (:obj:`str`):\n            Path to the vocabulary file.\n        merges_file (:obj:`str`):\n            Path to the merges file.\n        errors (:obj:`str`, `optional`, defaults to \"replace\"):\n            Paradigm to follow when decoding bytes to UTF-8. See `bytes.decode\n            <https://docs.python.org/3/library/stdtypes.html#bytes.decode>`__ for more information.\n        unk_token (:obj:`string`, `optional`, defaults to `<|endoftext|>`):\n            The unknown token. A token that is not in the vocabulary cannot be converted to an ID and is set to be this\n            token instead.\n        bos_token (:obj:`string`, `optional`, defaults to `<|endoftext|>`):\n            The beginning of sequence token.\n        eos_token (:obj:`string`, `optional`, defaults to `<|endoftext|>`):\n            The end of sequence token.\n    \"\"\"\n\n    vocab_files_names = VOCAB_FILES_NAMES\n    pretrained_vocab_files_map = PRETRAINED_VOCAB_FILES_MAP\n    max_model_input_sizes = PRETRAINED_POSITIONAL_EMBEDDINGS_SIZES\n\n    def __init__(\n        self,\n        vocab_file,\n        merges_file,\n        errors=\"replace\",\n        unk_token=\"<|endoftext|>\",\n        bos_token=\"<|endoftext|>\",\n        eos_token=\"<|endoftext|>\",\n        **kwargs\n    ):\n        super().__init__(bos_token=bos_token, eos_token=eos_token, unk_token=unk_token, **kwargs)\n\n        with open(vocab_file, encoding=\"utf-8\") as vocab_handle:\n            self.encoder = json.load(vocab_handle)\n        self.decoder = {v: k for k, v in self.encoder.items()}\n        self.errors = errors  # how to handle errors in decoding\n        self.byte_encoder = bytes_to_unicode()\n        self.byte_decoder = {v: k for k, v in self.byte_encoder.items()}\n        with open(merges_file, encoding=\"utf-8\") as merges_handle:\n            bpe_merges = merges_handle.read().split(\"\\n\")[1:-1]\n        bpe_merges = [tuple(merge.split()) for merge in bpe_merges]\n        self.bpe_ranks = dict(zip(bpe_merges, range(len(bpe_merges))))\n        self.cache = {}\n\n        # Should haved added re.IGNORECASE so BPE merges can happen for capitalized versions of contractions\n        self.pat = re.compile(r\"\"\"'s|'t|'re|'ve|'m|'ll|'d| ?\\p{L}+| ?\\p{N}+| ?[^\\s\\p{L}\\p{N}]+|\\s+(?!\\S)|\\s+\"\"\")\n\n    @property\n    def vocab_size(self):\n        return len(self.encoder)\n\n    def get_vocab(self):\n        return dict(self.encoder, **self.added_tokens_encoder)\n\n    def bpe(self, token):\n        if token in self.cache:\n            return self.cache[token]\n        word = tuple(token)\n        pairs = get_pairs(word)\n\n        if not pairs:\n            return token\n\n        while True:\n            bigram = min(pairs, key=lambda pair: self.bpe_ranks.get(pair, float(\"inf\")))\n            if bigram not in self.bpe_ranks:\n                break\n            first, second = bigram\n            new_word = []\n            i = 0\n            while i < len(word):\n                try:\n                    j = word.index(first, i)\n                except ValueError:\n                    new_word.extend(word[i:])\n                    break\n                else:\n                    new_word.extend(word[i:j])\n                    i = j\n\n                if word[i] == first and i < len(word) - 1 and word[i + 1] == second:\n                    new_word.append(first + second)\n                    i += 2\n                else:\n                    new_word.append(word[i])\n                    i += 1\n            new_word = tuple(new_word)\n            word = new_word\n            if len(word) == 1:\n                break\n            else:\n                pairs = get_pairs(word)\n        word = \" \".join(word)\n        self.cache[token] = word\n        return word\n\n    def _tokenize(self, text):\n        \"\"\" Tokenize a string. \"\"\"\n        bpe_tokens = []\n        for token in re.findall(self.pat, text):\n            token = \"\".join(\n                self.byte_encoder[b] for b in token.encode(\"utf-8\")\n            )  # Maps all our bytes to unicode strings, avoiding controle tokens of the BPE (spaces in our case)\n            bpe_tokens.extend(bpe_token for bpe_token in self.bpe(token).split(\" \"))\n        return bpe_tokens\n\n    def _convert_token_to_id(self, token):\n        \"\"\" Converts a token (str) in an id using the vocab. \"\"\"\n        return self.encoder.get(token, self.encoder.get(self.unk_token))\n\n    def _convert_id_to_token(self, index):\n        \"\"\"Converts an index (integer) in a token (str) using the vocab.\"\"\"\n        return self.decoder.get(index)\n\n    def convert_tokens_to_string(self, tokens):\n        \"\"\" Converts a sequence of tokens (string) in a single string. \"\"\"\n        text = \"\".join(tokens)\n        text = bytearray([self.byte_decoder[c] for c in text]).decode(\"utf-8\", errors=self.errors)\n        return text\n\n    def save_vocabulary(self, save_directory):\n        \"\"\"\n        Save the vocabulary and special tokens file to a directory.\n\n        Args:\n            save_directory (:obj:`str`):\n                The directory in which to save the vocabulary.\n\n        Returns:\n            :obj:`Tuple(str)`: Paths to the files saved.\n        \"\"\"\n        if not os.path.isdir(save_directory):\n            logger.error(\"Vocabulary path ({}) should be a directory\".format(save_directory))\n            return\n        vocab_file = os.path.join(save_directory, VOCAB_FILES_NAMES[\"vocab_file\"])\n        merge_file = os.path.join(save_directory, VOCAB_FILES_NAMES[\"merges_file\"])\n\n        with open(vocab_file, \"w\", encoding=\"utf-8\") as f:\n            f.write(json.dumps(self.encoder, ensure_ascii=False))\n\n        index = 0\n        with open(merge_file, \"w\", encoding=\"utf-8\") as writer:\n            writer.write(\"#version: 0.2\\n\")\n            for bpe_tokens, token_index in sorted(self.bpe_ranks.items(), key=lambda kv: kv[1]):\n                if index != token_index:\n                    logger.warning(\n                        \"Saving vocabulary to {}: BPE merge indices are not consecutive.\"\n                        \" Please check that the tokenizer is not corrupted!\".format(merge_file)\n                    )\n                    index = token_index\n                writer.write(\" \".join(bpe_tokens) + \"\\n\")\n                index += 1\n\n        return vocab_file, merge_file\n\n    def prepare_for_tokenization(self, text, **kwargs):\n        if \"add_prefix_space\" in kwargs and kwargs[\"add_prefix_space\"]:\n            return \" \" + text\n        return text\n\n\nclass GPT2TokenizerFast(PreTrainedTokenizerFast):\n    \"\"\"\n    Constructs a \"Fast\" GPT-2 BPE tokenizer (backed by HuggingFace's `tokenizers` library).\n\n    Peculiarities:\n\n    - Byte-level Byte-Pair-Encoding\n    - Requires a space to start the input string => the encoding methods should be called with the\n      ``add_prefix_space`` flag set to ``True``.\n      Otherwise, this tokenizer ``encode`` and ``decode`` method will not conserve\n      the absence of a space at the beginning of a string:\n\n    ::\n\n        tokenizer.decode(tokenizer.encode(\"Hello\")) = \" Hello\"\n\n    This tokenizer inherits from :class:`~transformers1.PreTrainedTokenizerFast` which contains most of the methods. Users\n    should refer to the superclass for more information regarding methods.\n\n    Args:\n        vocab_file (:obj:`str`):\n            Path to the vocabulary file.\n        merges_file (:obj:`str`):\n            Path to the merges file.\n        errors (:obj:`str`, `optional`, defaults to \"replace\"):\n            Paradigm to follow when decoding bytes to UTF-8. See `bytes.decode\n            <https://docs.python.org/3/library/stdtypes.html#bytes.decode>`__ for more information.\n        unk_token (:obj:`string`, `optional`, defaults to `<|endoftext|>`):\n            The unknown token. A token that is not in the vocabulary cannot be converted to an ID and is set to be this\n            token instead.\n        bos_token (:obj:`string`, `optional`, defaults to `<|endoftext|>`):\n            The beginning of sequence token.\n        eos_token (:obj:`string`, `optional`, defaults to `<|endoftext|>`):\n            The end of sequence token.\n        add_prefix_space (:obj:`bool`, `optional`, defaults to `False`):\n            Whether to add a leading space to the first word.\n            This allows to treat the leading word just as any other word.\n            (GPT2 tokenizer detect beginning of words by the preceeding space)\n        trim_offsets (:obj:`bool`, `optional`, defaults to `True`):\n            Whether the post processing step should trim offsets to avoid including whitespaces.\n    \"\"\"\n\n    vocab_files_names = VOCAB_FILES_NAMES\n    pretrained_vocab_files_map = PRETRAINED_VOCAB_FILES_MAP\n    max_model_input_sizes = PRETRAINED_POSITIONAL_EMBEDDINGS_SIZES\n\n    def __init__(\n        self,\n        vocab_file,\n        merges_file,\n        unk_token=\"<|endoftext|>\",\n        bos_token=\"<|endoftext|>\",\n        eos_token=\"<|endoftext|>\",\n        add_prefix_space=False,\n        trim_offsets=True,\n        **kwargs\n    ):\n        super().__init__(\n            ByteLevelBPETokenizer(\n                vocab_file=vocab_file,\n                merges_file=merges_file,\n                add_prefix_space=add_prefix_space,\n                trim_offsets=trim_offsets,\n            ),\n            bos_token=bos_token,\n            eos_token=eos_token,\n            unk_token=unk_token,\n            **kwargs,\n        )\n"
  },
  {
    "path": "code/nezha-base-count3/pretrain/transformers1/tokenization_longformer.py",
    "content": "# coding=utf-8\n# Copyright 2020 The Allen Institute for AI team and The HuggingFace Inc. team.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n\nimport logging\n\nfrom .tokenization_roberta import RobertaTokenizer, RobertaTokenizerFast\n\n\nlogger = logging.getLogger(__name__)\n\n\n# vocab and merges same as roberta\nvocab_url = \"https://s3.amazonaws.com/models.huggingface.co/bert/roberta-large-vocab.json\"\nmerges_url = \"https://s3.amazonaws.com/models.huggingface.co/bert/roberta-large-merges.txt\"\n_all_longformer_models = [\n    \"allenai/longformer-base-4096\",\n    \"allenai/longformer-large-4096\",\n    \"allenai/longformer-large-4096-finetuned-triviaqa\",\n    \"allenai/longformer-base-4096-extra.pos.embd.only\",\n    \"allenai/longformer-large-4096-extra.pos.embd.only\",\n]\n\n\nPRETRAINED_POSITIONAL_EMBEDDINGS_SIZES = {\n    \"allenai/longformer-base-4096\": 4096,\n    \"allenai/longformer-large-4096\": 4096,\n    \"allenai/longformer-large-4096-finetuned-triviaqa\": 4096,\n    \"allenai/longformer-base-4096-extra.pos.embd.only\": 4096,\n    \"allenai/longformer-large-4096-extra.pos.embd.only\": 4096,\n}\n\n\nclass LongformerTokenizer(RobertaTokenizer):\n    # merges and vocab same as Roberta\n    max_model_input_sizes = PRETRAINED_POSITIONAL_EMBEDDINGS_SIZES\n    pretrained_vocab_files_map = {\n        \"vocab_file\": {m: vocab_url for m in _all_longformer_models},\n        \"merges_file\": {m: merges_url for m in _all_longformer_models},\n    }\n\n\nclass LongformerTokenizerFast(RobertaTokenizerFast):\n    # merges and vocab same as Roberta\n    max_model_input_sizes = PRETRAINED_POSITIONAL_EMBEDDINGS_SIZES\n    pretrained_vocab_files_map = {\n        \"vocab_file\": {m: vocab_url for m in _all_longformer_models},\n        \"merges_file\": {m: merges_url for m in _all_longformer_models},\n    }\n"
  },
  {
    "path": "code/nezha-base-count3/pretrain/transformers1/tokenization_marian.py",
    "content": "import json\nimport re\nimport warnings\nfrom pathlib import Path\nfrom shutil import copyfile\nfrom typing import Dict, List, Optional, Tuple, Union\n\nimport sentencepiece\n\nfrom .file_utils import S3_BUCKET_PREFIX\nfrom .tokenization_utils import BatchEncoding, PreTrainedTokenizer\n\n\nvocab_files_names = {\n    \"source_spm\": \"source.spm\",\n    \"target_spm\": \"target.spm\",\n    \"vocab\": \"vocab.json\",\n    \"tokenizer_config_file\": \"tokenizer_config.json\",\n}\nMODEL_NAMES = (\"opus-mt-en-de\",)  # TODO(SS): delete this, the only required constant is vocab_files_names\nPRETRAINED_VOCAB_FILES_MAP = {\n    k: {m: f\"{S3_BUCKET_PREFIX}/Helsinki-NLP/{m}/{fname}\" for m in MODEL_NAMES}\n    for k, fname in vocab_files_names.items()\n}\n# Example URL https://s3.amazonaws.com/models.huggingface.co/bert/Helsinki-NLP/opus-mt-en-de/vocab.json\n\n\nclass MarianTokenizer(PreTrainedTokenizer):\n    \"\"\"Sentencepiece tokenizer for marian. Source and target languages have different SPM models.\n    The logic is use the relevant source_spm or target_spm to encode txt as pieces, then look up each piece in a vocab dictionary.\n\n    Examples::\n\n        from transformers1 import MarianTokenizer\n        tok = MarianTokenizer.from_pretrained('Helsinki-NLP/opus-mt-en-de')\n        src_texts = [ \"I am a small frog.\", \"Tom asked his teacher for advice.\"]\n        tgt_texts = [\"Ich bin ein kleiner Frosch.\", \"Tom bat seinen Lehrer um Rat.\"]  # optional\n        batch_enc: BatchEncoding = tok.prepare_translation_batch(src_texts, tgt_texts=tgt_texts)\n        # keys  [input_ids, attention_mask, decoder_input_ids,  decoder_attention_mask].\n        # model(**batch) should work\n    \"\"\"\n\n    vocab_files_names = vocab_files_names\n    pretrained_vocab_files_map = PRETRAINED_VOCAB_FILES_MAP\n    max_model_input_sizes = {m: 512 for m in MODEL_NAMES}\n    model_input_names = [\"attention_mask\"]  # actually attention_mask, decoder_attention_mask\n    language_code_re = re.compile(\">>.+<<\")  # type: re.Pattern\n\n    def __init__(\n        self,\n        vocab=None,\n        source_spm=None,\n        target_spm=None,\n        source_lang=None,\n        target_lang=None,\n        unk_token=\"<unk>\",\n        eos_token=\"</s>\",\n        pad_token=\"<pad>\",\n        max_len=512,\n        **kwargs,\n    ):\n\n        super().__init__(\n            # bos_token=bos_token,  unused. Start decoding with config.decoder_start_token_id\n            max_len=max_len,\n            eos_token=eos_token,\n            unk_token=unk_token,\n            pad_token=pad_token,\n            **kwargs,\n        )\n        self.encoder = load_json(vocab)\n        if self.unk_token not in self.encoder:\n            raise KeyError(\"<unk> token must be in vocab\")\n        assert self.pad_token in self.encoder\n        self.decoder = {v: k for k, v in self.encoder.items()}\n\n        self.source_lang = source_lang\n        self.target_lang = target_lang\n        self.supported_language_codes: list = [k for k in self.encoder if k.startswith(\">>\") and k.endswith(\"<<\")]\n        self.spm_files = [source_spm, target_spm]\n\n        # load SentencePiece model for pre-processing\n        self.spm_source = load_spm(source_spm)\n        self.spm_target = load_spm(target_spm)\n        self.current_spm = self.spm_source\n\n        # Multilingual target side: default to using first supported language code.\n\n        self._setup_normalizer()\n\n    def _setup_normalizer(self):\n        try:\n            from mosestokenizer import MosesPunctuationNormalizer\n\n            self.punc_normalizer = MosesPunctuationNormalizer(self.source_lang)\n        except ImportError:\n            warnings.warn(\"Recommended: pip install mosestokenizer\")\n            self.punc_normalizer = lambda x: x\n\n    def normalize(self, x: str) -> str:\n        \"\"\"Cover moses empty string edge case. They return empty list for '' input!\"\"\"\n        return self.punc_normalizer(x) if x else \"\"\n\n    def _convert_token_to_id(self, token):\n        return self.encoder.get(token, self.encoder[self.unk_token])\n\n    def remove_language_code(self, text: str):\n        \"\"\"Remove language codes like <<fr>> before sentencepiece\"\"\"\n        match = self.language_code_re.match(text)\n        code: list = [match.group(0)] if match else []\n        return code, self.language_code_re.sub(\"\", text)\n\n    def _tokenize(self, text: str) -> List[str]:\n        code, text = self.remove_language_code(text)\n        pieces = self.current_spm.EncodeAsPieces(text)\n        return code + pieces\n\n    def _convert_id_to_token(self, index: int) -> str:\n        \"\"\"Converts an index (integer) in a token (str) using the encoder.\"\"\"\n        return self.decoder.get(index, self.unk_token)\n\n    def convert_tokens_to_string(self, tokens: List[str]) -> str:\n        \"\"\"Uses target language sentencepiece model\"\"\"\n        return self.spm_target.DecodePieces(tokens)\n\n    def build_inputs_with_special_tokens(self, token_ids_0, token_ids_1=None) -> List[int]:\n        \"\"\"Build model inputs from a sequence by appending eos_token_id.\"\"\"\n        if token_ids_1 is None:\n            return token_ids_0 + [self.eos_token_id]\n        # We don't expect to process pairs, but leave the pair logic for API consistency\n        return token_ids_0 + token_ids_1 + [self.eos_token_id]\n\n    def prepare_translation_batch(\n        self,\n        src_texts: List[str],\n        tgt_texts: Optional[List[str]] = None,\n        max_length: Optional[int] = None,\n        pad_to_max_length: bool = True,\n        return_tensors: str = \"pt\",\n    ) -> BatchEncoding:\n        \"\"\"Prepare model inputs for translation. For best performance, translate one sentence at a time.\n        Arguments:\n            src_texts: list of src language texts\n            tgt_texts: list of tgt language texts\n            max_length: (None) defer to config (1024 for mbart-large-en-ro)\n            pad_to_max_length: (bool)\n            return_tensors: (str) default \"pt\" returns pytorch tensors, pass None to return lists.\n\n        Returns:\n            BatchEncoding: with keys [input_ids, attention_mask, decoder_input_ids,  decoder_attention_mask]\n            all shaped bs, seq_len. (BatchEncoding is a dict of string -> tensor or lists).\n            If no tgt_text is specified, the only keys will be input_ids and attention_mask.\n        \"\"\"\n        if \"\" in src_texts:\n            raise ValueError(f\"found empty string in src_texts: {src_texts}\")\n        self.current_spm = self.spm_source\n        src_texts = [self.normalize(t) for t in src_texts]  # this does not appear to do much\n        model_inputs: BatchEncoding = self.batch_encode_plus(\n            src_texts,\n            add_special_tokens=True,\n            return_tensors=return_tensors,\n            max_length=max_length,\n            pad_to_max_length=pad_to_max_length,\n        )\n        if tgt_texts is None:\n            return model_inputs\n\n        self.current_spm = self.spm_target\n        decoder_inputs: BatchEncoding = self.batch_encode_plus(\n            tgt_texts,\n            add_special_tokens=True,\n            return_tensors=return_tensors,\n            max_length=max_length,\n            pad_to_max_length=pad_to_max_length,\n        )\n        for k, v in decoder_inputs.items():\n            model_inputs[f\"decoder_{k}\"] = v\n        self.current_spm = self.spm_source\n        return model_inputs\n\n    @property\n    def vocab_size(self) -> int:\n        return len(self.encoder)\n\n    def save_vocabulary(self, save_directory: str) -> Tuple[str]:\n        \"\"\"save vocab file to json and copy spm files from their original path.\"\"\"\n        save_dir = Path(save_directory)\n        assert save_dir.is_dir(), f\"{save_directory} should be a directory\"\n        save_json(self.encoder, save_dir / self.vocab_files_names[\"vocab\"])\n\n        for f in self.spm_files:\n            dest_path = save_dir / Path(f).name\n            if not dest_path.exists():\n                copyfile(f, save_dir / Path(f).name)\n        return tuple(save_dir / f for f in self.vocab_files_names)\n\n    def get_vocab(self) -> Dict:\n        vocab = self.encoder.copy()\n        vocab.update(self.added_tokens_encoder)\n        return vocab\n\n    def __getstate__(self) -> Dict:\n        state = self.__dict__.copy()\n        state.update({k: None for k in [\"spm_source\", \"spm_target\", \"current_spm\", \"punc_normalizer\"]})\n        return state\n\n    def __setstate__(self, d: Dict) -> None:\n        self.__dict__ = d\n        self.spm_source, self.spm_target = (load_spm(f) for f in self.spm_files)\n        self.current_spm = self.spm_source\n        self._setup_normalizer()\n\n    def num_special_tokens_to_add(self, **unused):\n        \"\"\"Just EOS\"\"\"\n        return 1\n\n    def _special_token_mask(self, seq):\n        all_special_ids = set(self.all_special_ids)  # call it once instead of inside list comp\n        all_special_ids.remove(self.unk_token_id)  # <unk> is only sometimes special\n        return [1 if x in all_special_ids else 0 for x in seq]\n\n    def get_special_tokens_mask(\n        self, token_ids_0: List, token_ids_1: Optional[List] = None, already_has_special_tokens: bool = False\n    ) -> List[int]:\n        \"\"\"Get list where entries are [1] if a token is [eos] or [pad] else 0.\"\"\"\n        if already_has_special_tokens:\n            return self._special_token_mask(token_ids_0)\n        elif token_ids_1 is None:\n            return self._special_token_mask(token_ids_0) + [1]\n        else:\n            return self._special_token_mask(token_ids_0 + token_ids_1) + [1]\n\n\ndef load_spm(path: str) -> sentencepiece.SentencePieceProcessor:\n    spm = sentencepiece.SentencePieceProcessor()\n    spm.Load(path)\n    return spm\n\n\ndef save_json(data, path: str) -> None:\n    with open(path, \"w\") as f:\n        json.dump(data, f, indent=2)\n\n\ndef load_json(path: str) -> Union[Dict, List]:\n    with open(path, \"r\") as f:\n        return json.load(f)\n"
  },
  {
    "path": "code/nezha-base-count3/pretrain/transformers1/tokenization_openai.py",
    "content": "# coding=utf-8\n# Copyright 2018 The Open AI Team Authors and The HuggingFace Inc. team.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n\"\"\"Tokenization classes for OpenAI GPT.\"\"\"\n\n\nimport json\nimport logging\nimport os\nimport re\n\nfrom tokenizers import CharBPETokenizer\n\nfrom .tokenization_bert import BasicTokenizer\nfrom .tokenization_utils import PreTrainedTokenizer, PreTrainedTokenizerFast\n\n\nlogger = logging.getLogger(__name__)\n\nVOCAB_FILES_NAMES = {\n    \"vocab_file\": \"vocab.json\",\n    \"merges_file\": \"merges.txt\",\n}\n\nPRETRAINED_VOCAB_FILES_MAP = {\n    \"vocab_file\": {\"openai-gpt\": \"https://s3.amazonaws.com/models.huggingface.co/bert/openai-gpt-vocab.json\"},\n    \"merges_file\": {\"openai-gpt\": \"https://s3.amazonaws.com/models.huggingface.co/bert/openai-gpt-merges.txt\"},\n}\n\nPRETRAINED_POSITIONAL_EMBEDDINGS_SIZES = {\n    \"openai-gpt\": 512,\n}\n\n\ndef get_pairs(word):\n    \"\"\"\n    Return set of symbol pairs in a word.\n    word is represented as tuple of symbols (symbols being variable-length strings)\n    \"\"\"\n    pairs = set()\n    prev_char = word[0]\n    for char in word[1:]:\n        pairs.add((prev_char, char))\n        prev_char = char\n    return pairs\n\n\ndef text_standardize(text):\n    \"\"\"\n    fixes some issues the spacy tokenizer had on books corpus\n    also does some whitespace standardization\n    \"\"\"\n    text = text.replace(\"—\", \"-\")\n    text = text.replace(\"–\", \"-\")\n    text = text.replace(\"―\", \"-\")\n    text = text.replace(\"…\", \"...\")\n    text = text.replace(\"´\", \"'\")\n    text = re.sub(r\"\"\"(-+|~+|!+|\"+|;+|\\?+|\\++|,+|\\)+|\\(+|\\\\+|\\/+|\\*+|\\[+|\\]+|}+|{+|\\|+|_+)\"\"\", r\" \\1 \", text)\n    text = re.sub(r\"\\s*\\n\\s*\", \" \\n \", text)\n    text = re.sub(r\"[^\\S\\n]+\", \" \", text)\n    return text.strip()\n\n\nclass OpenAIGPTTokenizer(PreTrainedTokenizer):\n    \"\"\"\n    BPE tokenizer. Peculiarities:\n\n    - lower case all inputs\n    - uses SpaCy tokenizer and ftfy for pre-BPE tokenization if they are installed, fallback to BERT's BasicTokenizer if not.\n\n    This tokenizer inherits from :class:`~transformers1.PreTrainedTokenizer` which contains most of the methods. Users\n    should refer to the superclass for more information regarding methods.\n\n    Args:\n        vocab_file (:obj:`str`):\n            Path to the vocabulary file.\n        merges_file (:obj:`str`):\n            Path to the merges file.\n        unk_token (:obj:`string`, `optional`, defaults to \"<unk>\"):\n            The unknown token. A token that is not in the vocabulary cannot be converted to an ID and is set to be this\n            token instead.\n    \"\"\"\n\n    vocab_files_names = VOCAB_FILES_NAMES\n    pretrained_vocab_files_map = PRETRAINED_VOCAB_FILES_MAP\n    max_model_input_sizes = PRETRAINED_POSITIONAL_EMBEDDINGS_SIZES\n\n    def __init__(self, vocab_file, merges_file, unk_token=\"<unk>\", **kwargs):\n        super().__init__(unk_token=unk_token, **kwargs)\n\n        try:\n            import ftfy\n            from spacy.lang.en import English\n\n            _nlp = English()\n            self.nlp = _nlp.Defaults.create_tokenizer(_nlp)\n            self.fix_text = ftfy.fix_text\n        except ImportError:\n            logger.warning(\"ftfy or spacy is not installed using BERT BasicTokenizer instead of SpaCy & ftfy.\")\n            self.nlp = BasicTokenizer(do_lower_case=True)\n            self.fix_text = None\n\n        with open(vocab_file, encoding=\"utf-8\") as vocab_handle:\n            self.encoder = json.load(vocab_handle)\n        self.decoder = {v: k for k, v in self.encoder.items()}\n        with open(merges_file, encoding=\"utf-8\") as merges_handle:\n            merges = merges_handle.read().split(\"\\n\")[1:-1]\n        merges = [tuple(merge.split()) for merge in merges]\n        self.bpe_ranks = dict(zip(merges, range(len(merges))))\n        self.cache = {}\n\n    @property\n    def vocab_size(self):\n        return len(self.encoder)\n\n    def get_vocab(self):\n        return dict(self.encoder, **self.added_tokens_encoder)\n\n    def bpe(self, token):\n        word = tuple(token[:-1]) + (token[-1] + \"</w>\",)\n        if token in self.cache:\n            return self.cache[token]\n        pairs = get_pairs(word)\n\n        if not pairs:\n            return token + \"</w>\"\n\n        while True:\n            bigram = min(pairs, key=lambda pair: self.bpe_ranks.get(pair, float(\"inf\")))\n            if bigram not in self.bpe_ranks:\n                break\n            first, second = bigram\n            new_word = []\n            i = 0\n            while i < len(word):\n                try:\n                    j = word.index(first, i)\n                except ValueError:\n                    new_word.extend(word[i:])\n                    break\n                else:\n                    new_word.extend(word[i:j])\n                    i = j\n\n                if word[i] == first and i < len(word) - 1 and word[i + 1] == second:\n                    new_word.append(first + second)\n                    i += 2\n                else:\n                    new_word.append(word[i])\n                    i += 1\n            new_word = tuple(new_word)\n            word = new_word\n            if len(word) == 1:\n                break\n            else:\n                pairs = get_pairs(word)\n        word = \" \".join(word)\n        if word == \"\\n  </w>\":\n            word = \"\\n</w>\"\n        self.cache[token] = word\n        return word\n\n    def _tokenize(self, text):\n        \"\"\" Tokenize a string. \"\"\"\n        split_tokens = []\n        if self.fix_text is None:\n            # Using BERT's BasicTokenizer\n            text = self.nlp.tokenize(text)\n            for token in text:\n                split_tokens.extend([t for t in self.bpe(token).split(\" \")])\n        else:\n            # Using SpaCy & ftfy (original tokenization process of OpenAI GPT)\n            text = self.nlp(text_standardize(self.fix_text(text)))\n            for token in text:\n                split_tokens.extend([t for t in self.bpe(token.text.lower()).split(\" \")])\n        return split_tokens\n\n    def _convert_token_to_id(self, token):\n        \"\"\" Converts a token (str) in an id using the vocab. \"\"\"\n        return self.encoder.get(token, self.encoder.get(self.unk_token))\n\n    def _convert_id_to_token(self, index):\n        \"\"\"Converts an id in a token (BPE) using the vocab.\"\"\"\n        return self.decoder.get(index, self.unk_token)\n\n    def convert_tokens_to_string(self, tokens):\n        \"\"\" Converts a sequence of tokens (string) in a single string. \"\"\"\n        out_string = \"\".join(tokens).replace(\"</w>\", \" \").strip()\n        return out_string\n\n    def save_vocabulary(self, save_directory):\n        \"\"\"\n        Save the vocabulary and special tokens file to a directory.\n\n        Args:\n            save_directory (:obj:`str`):\n                The directory in which to save the vocabulary.\n\n        Returns:\n            :obj:`Tuple(str)`: Paths to the files saved.\n        \"\"\"\n        if not os.path.isdir(save_directory):\n            logger.error(\"Vocabulary path ({}) should be a directory\".format(save_directory))\n            return\n        vocab_file = os.path.join(save_directory, VOCAB_FILES_NAMES[\"vocab_file\"])\n        merge_file = os.path.join(save_directory, VOCAB_FILES_NAMES[\"merges_file\"])\n\n        with open(vocab_file, \"w\", encoding=\"utf-8\") as f:\n            f.write(json.dumps(self.encoder, ensure_ascii=False))\n\n        index = 0\n        with open(merge_file, \"w\", encoding=\"utf-8\") as writer:\n            writer.write(\"#version: 0.2\\n\")\n            for bpe_tokens, token_index in sorted(self.bpe_ranks.items(), key=lambda kv: kv[1]):\n                if index != token_index:\n                    logger.warning(\n                        \"Saving vocabulary to {}: BPE merge indices are not consecutive.\"\n                        \" Please check that the tokenizer is not corrupted!\".format(merge_file)\n                    )\n                    index = token_index\n                writer.write(\" \".join(bpe_tokens) + \"\\n\")\n                index += 1\n\n        return vocab_file, merge_file\n\n\nclass OpenAIGPTTokenizerFast(PreTrainedTokenizerFast):\n    \"\"\"\n    Construct a \"Fast\" BPE tokenizer for OpenAI GPT (backed by HuggingFace's `tokenizers` library).\n\n    Peculiarities:\n\n    - lower case all inputs\n    - uses SpaCy tokenizer and ftfy for pre-BPE tokenization if they are installed, fallback to BERT's BasicTokenizer if not.\n\n    This tokenizer inherits from :class:`~transformers1.PreTrainedTokenizer` which contains most of the methods. Users\n    should refer to the superclass for more information regarding methods.\n\n    Args:\n        vocab_file (:obj:`str`):\n            Path to the vocabulary file.\n        merges_file (:obj:`str`):\n            Path to the merges file.\n        unk_token (:obj:`string`, `optional`, defaults to \"<unk>\"):\n            The unknown token. A token that is not in the vocabulary cannot be converted to an ID and is set to be this\n            token instead.\n    \"\"\"\n\n    vocab_files_names = VOCAB_FILES_NAMES\n    pretrained_vocab_files_map = PRETRAINED_VOCAB_FILES_MAP\n    max_model_input_sizes = PRETRAINED_POSITIONAL_EMBEDDINGS_SIZES\n\n    def __init__(self, vocab_file, merges_file, unk_token=\"<unk>\", **kwargs):\n        kwargs.setdefault(\"unk_token\", unk_token)\n        super().__init__(\n            CharBPETokenizer(vocab_file=vocab_file, merges_file=merges_file, unk_token=unk_token, lowercase=True),\n            **kwargs,\n        )\n"
  },
  {
    "path": "code/nezha-base-count3/pretrain/transformers1/tokenization_reformer.py",
    "content": "# coding=utf-8\n# Copyright 2020 The Trax Authors and The HuggingFace Inc. team.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n\"\"\" Tokenization class for model Reformer.\"\"\"\n\n\nimport logging\nimport os\nfrom shutil import copyfile\n\nfrom .tokenization_utils import PreTrainedTokenizer\n\n\nlogger = logging.getLogger(__name__)\n\nSPIECE_UNDERLINE = \"▁\"\n\n\n####################################################\n# Mapping from the keyword arguments names of Tokenizer `__init__`\n# to file names for serializing Tokenizer instances\n####################################################\nVOCAB_FILES_NAMES = {\"vocab_file\": \"spiece.model\"}\n\n####################################################\n# Mapping from the keyword arguments names of Tokenizer `__init__`\n# to pretrained vocabulary URL for all the model shortcut names.\n####################################################\nPRETRAINED_VOCAB_FILES_MAP = {\n    \"vocab_file\": {\n        \"google/reformer-crime-and-punishment\": \"https://cdn.huggingface.co/google/reformer-crime-and-punishment/spiece.model\"\n    }\n}\n\n####################################################\n# Mapping from model shortcut names to max length of inputs\n####################################################\nPRETRAINED_POSITIONAL_EMBEDDINGS_SIZES = {\n    \"google/reformer-crime-and-punishment\": 524288,\n}\n\n\nclass ReformerTokenizer(PreTrainedTokenizer):\n    \"\"\"\n        Constructs an Reformer tokenizer. Based on `SentencePiece <https://github.com/google/sentencepiece>`__ .\n\n        This tokenizer inherits from :class:`~transformers1.PreTrainedTokenizer` which contains most of the methods. Users\n        should refer to the superclass for more information regarding methods.\n\n        Args:\n            vocab_file (:obj:`string`):\n                `SentencePiece <https://github.com/google/sentencepiece>`__ file (generally has a `.spm` extension) that\n                contains the vocabulary necessary to instantiate a tokenizer.\n            eos_token (:obj:`string`, `optional`, defaults to \"</s>\"):\n                The end of sequence token.\n\n                .. note::\n\n                    When building a sequence using special tokens, this is not the token that is used for the end\n                    of sequence. The token used is the :obj:`sep_token`.\n            unk_token (:obj:`string`, `optional`, defaults to \"<unk>\"):\n                The unknown token. A token that is not in the vocabulary cannot be converted to an ID and is set to be this\n                token instead.\n            pad_token (:obj:`string`, `optional`, defaults to \"<pad>\"):\n                The token used for padding, for example when batching sequences of different lengths.\n            additional_special_tokens (:obj:`List[str]`, `optional`, defaults to :obj:`None`):\n                Additional special tokens used by the tokenizer.\n    \"\"\"\n\n    vocab_files_names = VOCAB_FILES_NAMES\n    pretrained_vocab_files_map = PRETRAINED_VOCAB_FILES_MAP\n    max_model_input_sizes = PRETRAINED_POSITIONAL_EMBEDDINGS_SIZES\n\n    def __init__(\n        self,\n        vocab_file,\n        eos_token=\"</s>\",\n        unk_token=\"<unk>\",\n        pad_token=\"<pad>\",\n        additional_special_tokens=[],\n        **kwargs\n    ):\n        super().__init__(\n            eos_token=eos_token,\n            unk_token=unk_token,\n            pad_token=pad_token,\n            additional_special_tokens=additional_special_tokens,\n            **kwargs,\n        )\n\n        try:\n            import sentencepiece as spm\n        except ImportError:\n            logger.warning(\n                \"You need to install SentencePiece to use ReformerTokenizer:\"\n                \"https://github.com/google/sentencepiece\"\n                \"pip install sentencepiece\"\n            )\n            raise\n\n        self.vocab_file = vocab_file\n        self.sp_model = spm.SentencePieceProcessor()\n        self.sp_model.Load(vocab_file)\n\n    @property\n    def vocab_size(self):\n        return self.sp_model.get_piece_size()\n\n    def get_vocab(self):\n        vocab = {self.convert_ids_to_tokens(i): i for i in range(self.vocab_size)}\n        vocab.update(self.added_tokens_encoder)\n        return vocab\n\n    def __getstate__(self):\n        state = self.__dict__.copy()\n        state[\"sp_model\"] = None\n        return state\n\n    def __setstate__(self, d):\n        self.__dict__ = d\n        try:\n            import sentencepiece as spm\n        except ImportError:\n            logger.warning(\n                \"You need to install SentencePiece to use ReformerTokenizer: https://github.com/google/sentencepiece\"\n                \"pip install sentencepiece\"\n            )\n            raise\n        self.sp_model = spm.SentencePieceProcessor()\n        self.sp_model.Load(self.vocab_file)\n\n    def _tokenize(self, text, sample=False):\n        \"\"\" Take as input a string and return a list of strings (tokens) for words/sub-words\n        \"\"\"\n        if not sample:\n            pieces = self.sp_model.EncodeAsPieces(text)\n        else:\n            pieces = self.sp_model.SampleEncodeAsPieces(text, 64, 0.1)\n        return pieces\n\n    def _convert_token_to_id(self, token):\n        \"\"\" Converts a token (str) in an id using the vocab. \"\"\"\n        return self.sp_model.piece_to_id(token)\n\n    def _convert_id_to_token(self, index):\n        \"\"\"Converts an index (integer) in a token (str) using the vocab.\"\"\"\n        if index < self.sp_model.get_piece_size():\n            token = self.sp_model.IdToPiece(index)\n        return token\n\n    def convert_tokens_to_string(self, tokens):\n        \"\"\" Converts a sequence of tokens (string) in a single string. \"\"\"\n        out_string = self.sp_model.decode_pieces(tokens)\n        return out_string\n\n    def save_vocabulary(self, save_directory):\n        \"\"\" Save the sentencepiece vocabulary (copy original file) and special tokens file\n            to a directory.\n        \"\"\"\n        if not os.path.isdir(save_directory):\n            logger.error(\"Vocabulary path ({}) should be a directory\".format(save_directory))\n            return\n        out_vocab_file = os.path.join(save_directory, VOCAB_FILES_NAMES[\"vocab_file\"])\n\n        if os.path.abspath(self.vocab_file) != os.path.abspath(out_vocab_file):\n            copyfile(self.vocab_file, out_vocab_file)\n\n        return (out_vocab_file,)\n"
  },
  {
    "path": "code/nezha-base-count3/pretrain/transformers1/tokenization_roberta.py",
    "content": "# coding=utf-8\n# Copyright 2018 The Open AI Team Authors and The HuggingFace Inc. team.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n\"\"\"Tokenization classes for RoBERTa.\"\"\"\n\n\nimport logging\nfrom typing import List, Optional\n\nfrom tokenizers import AddedToken\nfrom tokenizers.processors import RobertaProcessing\n\nfrom .tokenization_gpt2 import GPT2Tokenizer, GPT2TokenizerFast\nfrom .tokenization_utils import PreTrainedTokenizer\n\n\nlogger = logging.getLogger(__name__)\n\nVOCAB_FILES_NAMES = {\n    \"vocab_file\": \"vocab.json\",\n    \"merges_file\": \"merges.txt\",\n}\n\nPRETRAINED_VOCAB_FILES_MAP = {\n    \"vocab_file\": {\n        \"roberta-base\": \"https://s3.amazonaws.com/models.huggingface.co/bert/roberta-base-vocab.json\",\n        \"roberta-large\": \"https://s3.amazonaws.com/models.huggingface.co/bert/roberta-large-vocab.json\",\n        \"roberta-large-mnli\": \"https://s3.amazonaws.com/models.huggingface.co/bert/roberta-large-mnli-vocab.json\",\n        \"distilroberta-base\": \"https://s3.amazonaws.com/models.huggingface.co/bert/distilroberta-base-vocab.json\",\n        \"roberta-base-openai-detector\": \"https://s3.amazonaws.com/models.huggingface.co/bert/roberta-base-vocab.json\",\n        \"roberta-large-openai-detector\": \"https://s3.amazonaws.com/models.huggingface.co/bert/roberta-large-vocab.json\",\n    },\n    \"merges_file\": {\n        \"roberta-base\": \"https://s3.amazonaws.com/models.huggingface.co/bert/roberta-base-merges.txt\",\n        \"roberta-large\": \"https://s3.amazonaws.com/models.huggingface.co/bert/roberta-large-merges.txt\",\n        \"roberta-large-mnli\": \"https://s3.amazonaws.com/models.huggingface.co/bert/roberta-large-mnli-merges.txt\",\n        \"distilroberta-base\": \"https://s3.amazonaws.com/models.huggingface.co/bert/distilroberta-base-merges.txt\",\n        \"roberta-base-openai-detector\": \"https://s3.amazonaws.com/models.huggingface.co/bert/roberta-base-merges.txt\",\n        \"roberta-large-openai-detector\": \"https://s3.amazonaws.com/models.huggingface.co/bert/roberta-large-merges.txt\",\n    },\n}\n\nPRETRAINED_POSITIONAL_EMBEDDINGS_SIZES = {\n    \"roberta-base\": 512,\n    \"roberta-large\": 512,\n    \"roberta-large-mnli\": 512,\n    \"distilroberta-base\": 512,\n    \"roberta-base-openai-detector\": 512,\n    \"roberta-large-openai-detector\": 512,\n}\n\n\nclass RobertaTokenizer(GPT2Tokenizer):\n    \"\"\"\n    Constructs a RoBERTa BPE tokenizer, derived from the GPT-2 tokenizer. Peculiarities:\n\n    - Byte-level Byte-Pair-Encoding\n    - Requires a space to start the input string => the encoding methods should be called with the\n      ``add_prefix_space`` flag set to ``True``.\n      Otherwise, this tokenizer ``encode`` and ``decode`` method will not conserve\n      the absence of a space at the beginning of a string:\n\n    ::\n\n        tokenizer.decode(tokenizer.encode(\"Hello\")) = \" Hello\"\n\n    This tokenizer inherits from :class:`~transformers1.PreTrainedTokenizer` which contains most of the methods. Users\n    should refer to the superclass for more information regarding methods.\n\n    Args:\n        vocab_file (:obj:`str`):\n            Path to the vocabulary file.\n        merges_file (:obj:`str`):\n            Path to the merges file.\n        errors (:obj:`str`, `optional`, defaults to \"replace\"):\n            Paradigm to follow when decoding bytes to UTF-8. See `bytes.decode\n            <https://docs.python.org/3/library/stdtypes.html#bytes.decode>`__ for more information.\n        bos_token (:obj:`string`, `optional`, defaults to \"<s>\"):\n            The beginning of sequence token that was used during pre-training. Can be used a sequence classifier token.\n\n            .. note::\n\n                When building a sequence using special tokens, this is not the token that is used for the beginning\n                of sequence. The token used is the :obj:`cls_token`.\n        eos_token (:obj:`string`, `optional`, defaults to \"</s>\"):\n            The end of sequence token.\n\n            .. note::\n\n                When building a sequence using special tokens, this is not the token that is used for the end\n                of sequence. The token used is the :obj:`sep_token`.\n        sep_token (:obj:`string`, `optional`, defaults to \"</s>\"):\n            The separator token, which is used when building a sequence from multiple sequences, e.g. two sequences\n            for sequence classification or for a text and a question for question answering.\n            It is also used as the last token of a sequence built with special tokens.\n        cls_token (:obj:`string`, `optional`, defaults to \"<s>\"):\n            The classifier token which is used when doing sequence classification (classification of the whole\n            sequence instead of per-token classification). It is the first token of the sequence when built with\n            special tokens.\n        unk_token (:obj:`string`, `optional`, defaults to \"<unk>\"):\n            The unknown token. A token that is not in the vocabulary cannot be converted to an ID and is set to be this\n            token instead.\n        pad_token (:obj:`string`, `optional`, defaults to \"<pad>\"):\n            The token used for padding, for example when batching sequences of different lengths.\n        mask_token (:obj:`string`, `optional`, defaults to \"<mask>\"):\n            The token used for masking values. This is the token used when training this model with masked language\n            modeling. This is the token which the model will try to predict.\n    \"\"\"\n\n    vocab_files_names = VOCAB_FILES_NAMES\n    pretrained_vocab_files_map = PRETRAINED_VOCAB_FILES_MAP\n    max_model_input_sizes = PRETRAINED_POSITIONAL_EMBEDDINGS_SIZES\n    model_input_names = [\"attention_mask\"]\n\n    def __init__(\n        self,\n        vocab_file,\n        merges_file,\n        errors=\"replace\",\n        bos_token=\"<s>\",\n        eos_token=\"</s>\",\n        sep_token=\"</s>\",\n        cls_token=\"<s>\",\n        unk_token=\"<unk>\",\n        pad_token=\"<pad>\",\n        mask_token=\"<mask>\",\n        **kwargs\n    ):\n        super().__init__(\n            vocab_file=vocab_file,\n            merges_file=merges_file,\n            errors=errors,\n            bos_token=bos_token,\n            eos_token=eos_token,\n            unk_token=unk_token,\n            sep_token=sep_token,\n            cls_token=cls_token,\n            pad_token=pad_token,\n            mask_token=mask_token,\n            **kwargs,\n        )\n\n    def build_inputs_with_special_tokens(\n        self, token_ids_0: List[int], token_ids_1: Optional[List[int]] = None\n    ) -> List[int]:\n        \"\"\"\n        Build model inputs from a sequence or a pair of sequence for sequence classification tasks\n        by concatenating and adding special tokens.\n        A RoBERTa sequence has the following format:\n\n        - single sequence: ``<s> X </s>``\n        - pair of sequences: ``<s> A </s></s> B </s>``\n\n        Args:\n            token_ids_0 (:obj:`List[int]`):\n                List of IDs to which the special tokens will be added\n            token_ids_1 (:obj:`List[int]`, `optional`, defaults to :obj:`None`):\n                Optional second list of IDs for sequence pairs.\n\n        Returns:\n            :obj:`List[int]`: list of `input IDs <../glossary.html#input-ids>`__ with the appropriate special tokens.\n        \"\"\"\n        if token_ids_1 is None:\n            return [self.cls_token_id] + token_ids_0 + [self.sep_token_id]\n        cls = [self.cls_token_id]\n        sep = [self.sep_token_id]\n        return cls + token_ids_0 + sep + sep + token_ids_1 + sep\n\n    def get_special_tokens_mask(\n        self, token_ids_0: List[int], token_ids_1: Optional[List[int]] = None, already_has_special_tokens: bool = False\n    ) -> List[int]:\n        \"\"\"\n        Retrieves sequence ids from a token list that has no special tokens added. This method is called when adding\n        special tokens using the tokenizer ``prepare_for_model`` or ``encode_plus`` methods.\n\n        Args:\n            token_ids_0 (:obj:`List[int]`):\n                List of ids.\n            token_ids_1 (:obj:`List[int]`, `optional`, defaults to :obj:`None`):\n                Optional second list of IDs for sequence pairs.\n            already_has_special_tokens (:obj:`bool`, `optional`, defaults to :obj:`False`):\n                Set to True if the token list is already formatted with special tokens for the model\n\n        Returns:\n            :obj:`List[int]`: A list of integers in the range [0, 1]: 1 for a special token, 0 for a sequence token.\n        \"\"\"\n        if already_has_special_tokens:\n            if token_ids_1 is not None:\n                raise ValueError(\n                    \"You should not supply a second sequence if the provided sequence of \"\n                    \"ids is already formatted with special tokens for the model.\"\n                )\n            return list(map(lambda x: 1 if x in [self.sep_token_id, self.cls_token_id] else 0, token_ids_0))\n\n        if token_ids_1 is None:\n            return [1] + ([0] * len(token_ids_0)) + [1]\n        return [1] + ([0] * len(token_ids_0)) + [1, 1] + ([0] * len(token_ids_1)) + [1]\n\n    def create_token_type_ids_from_sequences(\n        self, token_ids_0: List[int], token_ids_1: Optional[List[int]] = None\n    ) -> List[int]:\n        \"\"\"\n        Creates a mask from the two sequences passed to be used in a sequence-pair classification task.\n        RoBERTa does not make use of token type ids, therefore a list of zeros is returned.\n\n        Args:\n            token_ids_0 (:obj:`List[int]`):\n                List of ids.\n            token_ids_1 (:obj:`List[int]`, `optional`, defaults to :obj:`None`):\n                Optional second list of IDs for sequence pairs.\n\n        Returns:\n            :obj:`List[int]`: List of zeros.\n\n        \"\"\"\n        sep = [self.sep_token_id]\n        cls = [self.cls_token_id]\n\n        if token_ids_1 is None:\n            return len(cls + token_ids_0 + sep) * [0]\n        return len(cls + token_ids_0 + sep + sep + token_ids_1 + sep) * [0]\n\n    def prepare_for_tokenization(self, text, add_special_tokens=False, **kwargs):\n        if \"add_prefix_space\" in kwargs:\n            add_prefix_space = kwargs[\"add_prefix_space\"]\n        else:\n            add_prefix_space = add_special_tokens\n        if add_prefix_space and not text[0].isspace():\n            text = \" \" + text\n        return text\n\n\nclass RobertaTokenizerFast(GPT2TokenizerFast):\n    \"\"\"\n    Constructs a \"Fast\" RoBERTa BPE tokenizer (backed by HuggingFace's `tokenizers` library).\n\n    Peculiarities:\n\n    - Byte-level Byte-Pair-Encoding\n    - Requires a space to start the input string => the encoding methods should be called with the\n      ``add_prefix_space`` flag set to ``True``.\n      Otherwise, this tokenizer ``encode`` and ``decode`` method will not conserve\n      the absence of a space at the beginning of a string:\n\n    ::\n\n        tokenizer.decode(tokenizer.encode(\"Hello\")) = \" Hello\"\n\n    This tokenizer inherits from :class:`~transformers1.PreTrainedTokenizerFast` which contains most of the methods. Users\n    should refer to the superclass for more information regarding methods.\n\n    Args:\n        vocab_file (:obj:`str`):\n            Path to the vocabulary file.\n        merges_file (:obj:`str`):\n            Path to the merges file.\n        errors (:obj:`str`, `optional`, defaults to \"replace\"):\n            Paradigm to follow when decoding bytes to UTF-8. See `bytes.decode\n            <https://docs.python.org/3/library/stdtypes.html#bytes.decode>`__ for more information.\n        unk_token (:obj:`string`, `optional`, defaults to `<|endoftext|>`):\n            The unknown token. A token that is not in the vocabulary cannot be converted to an ID and is set to be this\n            token instead.\n        bos_token (:obj:`string`, `optional`, defaults to `<|endoftext|>`):\n            The beginning of sequence token.\n        eos_token (:obj:`string`, `optional`, defaults to `<|endoftext|>`):\n            The end of sequence token.\n        add_prefix_space (:obj:`bool`, `optional`, defaults to `False`):\n            Whether to add a leading space to the first word.\n            This allows to treat the leading word just as any other word.\n            (GPT2 tokenizer detect beginning of words by the preceeding space)\n        trim_offsets (:obj:`bool`, `optional`, defaults to `True`):\n            Whether the post processing step should trim offsets to avoid including whitespaces.\n    \"\"\"\n\n    vocab_files_names = VOCAB_FILES_NAMES\n    pretrained_vocab_files_map = PRETRAINED_VOCAB_FILES_MAP\n    max_model_input_sizes = PRETRAINED_POSITIONAL_EMBEDDINGS_SIZES\n    model_input_names = [\"attention_mask\"]\n\n    def __init__(\n        self,\n        vocab_file,\n        merges_file,\n        errors=\"replace\",\n        bos_token=\"<s>\",\n        eos_token=\"</s>\",\n        sep_token=\"</s>\",\n        cls_token=\"<s>\",\n        unk_token=\"<unk>\",\n        pad_token=\"<pad>\",\n        mask_token=\"<mask>\",\n        add_prefix_space=True,\n        trim_offsets=True,\n        **kwargs\n    ):\n        kwargs.setdefault(\"pad_token\", pad_token)\n        kwargs.setdefault(\"sep_token\", sep_token)\n        kwargs.setdefault(\"cls_token\", cls_token)\n        kwargs.setdefault(\"mask_token\", mask_token)\n\n        super().__init__(\n            vocab_file=vocab_file,\n            merges_file=merges_file,\n            unk_token=unk_token,\n            bos_token=bos_token,\n            eos_token=eos_token,\n            add_prefix_space=add_prefix_space,\n            trim_offsets=trim_offsets,\n            **kwargs,\n        )\n\n        self.backend_tokenizer._tokenizer.post_processor = RobertaProcessing(\n            sep=(sep_token, self.sep_token_id),\n            cls=(cls_token, self.cls_token_id),\n            add_prefix_space=add_prefix_space,\n            trim_offsets=trim_offsets,\n        )\n\n        self.backend_tokenizer.add_special_tokens([kwargs[\"mask_token\"]])\n\n    @PreTrainedTokenizer.mask_token.setter\n    def mask_token(self, value):\n        if not isinstance(value, AddedToken):\n            value = AddedToken(value, lstrip=True)\n\n        self._mask_token = str(value)\n        self._maybe_update_backend([value])\n\n    def build_inputs_with_special_tokens(self, token_ids_0, token_ids_1=None):\n        output = [self.bos_token_id] + token_ids_0 + [self.eos_token_id]\n        if token_ids_1 is None:\n            return output\n\n        return output + [self.eos_token_id] + token_ids_1 + [self.eos_token_id]\n\n    def create_token_type_ids_from_sequences(\n        self, token_ids_0: List[int], token_ids_1: Optional[List[int]] = None\n    ) -> List[int]:\n        \"\"\"\n        Creates a mask from the two sequences passed to be used in a sequence-pair classification task.\n        RoBERTa does not make use of token type ids, therefore a list of zeros is returned.\n\n        Args:\n            token_ids_0 (:obj:`List[int]`):\n                List of ids.\n            token_ids_1 (:obj:`List[int]`, `optional`, defaults to :obj:`None`):\n                Optional second list of IDs for sequence pairs.\n\n        Returns:\n            :obj:`List[int]`: List of zeros.\n\n        \"\"\"\n        sep = [self.sep_token_id]\n        cls = [self.cls_token_id]\n\n        if token_ids_1 is None:\n            return len(cls + token_ids_0 + sep) * [0]\n        return len(cls + token_ids_0 + sep + sep + token_ids_1 + sep) * [0]\n"
  },
  {
    "path": "code/nezha-base-count3/pretrain/transformers1/tokenization_t5.py",
    "content": "# coding=utf-8\n# Copyright 2018 T5 Authors and HuggingFace Inc. team.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n\"\"\" Tokenization class for model T5.\"\"\"\n\n\nimport logging\nimport os\nimport re\nfrom shutil import copyfile\n\nfrom .tokenization_utils import PreTrainedTokenizer\n\n\nlogger = logging.getLogger(__name__)\n\nSPIECE_UNDERLINE = \"▁\"\n\n####################################################\n# Mapping from the keyword arguments names of Tokenizer `__init__`\n# to file names for serializing Tokenizer instances\n####################################################\nVOCAB_FILES_NAMES = {\"vocab_file\": \"spiece.model\"}\n\n####################################################\n# Mapping from the keyword arguments names of Tokenizer `__init__`\n# to pretrained vocabulary URL for all the model shortcut names.\n####################################################\nPRETRAINED_VOCAB_FILES_MAP = {\n    \"vocab_file\": {\n        \"t5-small\": \"https://s3.amazonaws.com/models.huggingface.co/bert/t5-spiece.model\",\n        \"t5-base\": \"https://s3.amazonaws.com/models.huggingface.co/bert/t5-spiece.model\",\n        \"t5-large\": \"https://s3.amazonaws.com/models.huggingface.co/bert/t5-spiece.model\",\n        \"t5-3b\": \"https://s3.amazonaws.com/models.huggingface.co/bert/t5-spiece.model\",\n        \"t5-11b\": \"https://s3.amazonaws.com/models.huggingface.co/bert/t5-spiece.model\",\n    }\n}\n\n####################################################\n# Mapping from model shortcut names to max length of inputs\n####################################################\nPRETRAINED_POSITIONAL_EMBEDDINGS_SIZES = {\n    \"t5-small\": 512,\n    \"t5-base\": 512,\n    \"t5-large\": 512,\n    \"t5-3b\": 512,\n    \"t5-11b\": 512,\n}\n\n\nclass T5Tokenizer(PreTrainedTokenizer):\n    \"\"\"\n        Constructs an XLNet tokenizer. Based on `SentencePiece <https://github.com/google/sentencepiece>`__ .\n\n        This tokenizer inherits from :class:`~transformers1.PreTrainedTokenizer` which contains most of the methods. Users\n        should refer to the superclass for more information regarding methods.\n\n        Args:\n            vocab_file (:obj:`string`):\n                `SentencePiece <https://github.com/google/sentencepiece>`__ file (generally has a `.spm` extension) that\n                contains the vocabulary necessary to instantiate a tokenizer.\n            eos_token (:obj:`string`, `optional`, defaults to \"</s>\"):\n                The end of sequence token.\n\n                .. note::\n\n                    When building a sequence using special tokens, this is not the token that is used for the end\n                    of sequence. The token used is the :obj:`sep_token`.\n            unk_token (:obj:`string`, `optional`, defaults to \"<unk>\"):\n                The unknown token. A token that is not in the vocabulary cannot be converted to an ID and is set to be this\n                token instead.\n            pad_token (:obj:`string`, `optional`, defaults to \"<pad>\"):\n                The token used for padding, for example when batching sequences of different lengths.\n            extra_ids (:obj:`List[str]`, `optional`, defaults to :obj:`100`):\n                Add a number of extra ids added to the end of the vocabulary for use as sentinels.\n                These tokens are accessible as \"<extra_id_{%d}>\" where \"{%d}\" is a number between 0 and extra_ids-1.\n                Extra tokens are indexed from the end of the vocabulary up to beginnning (\"<extra_id_0>\" is the last token in the vocabulary like in T5 preprocessing\n                see: https://github.com/google-research/text-to-text-transfer-transformer/blob/9fd7b14a769417be33bc6c850f9598764913c833/t5/data/preprocessors.py#L2117)\n            additional_special_tokens (:obj:`List[str]`, `optional`, defaults to :obj:`None`):\n                Additional special tokens used by the tokenizer.\n    \"\"\"\n\n    vocab_files_names = VOCAB_FILES_NAMES\n    pretrained_vocab_files_map = PRETRAINED_VOCAB_FILES_MAP\n    max_model_input_sizes = PRETRAINED_POSITIONAL_EMBEDDINGS_SIZES\n\n    def __init__(\n        self,\n        vocab_file,\n        eos_token=\"</s>\",\n        unk_token=\"<unk>\",\n        pad_token=\"<pad>\",\n        extra_ids=100,\n        additional_special_tokens=None,\n        **kwargs\n    ):\n        # Add extra_ids to the special token list\n        if extra_ids > 0:\n            if additional_special_tokens is None:\n                additional_special_tokens = []\n            additional_special_tokens.extend([\"<extra_id_{}>\".format(i) for i in range(extra_ids)])\n\n        super().__init__(\n            eos_token=eos_token,\n            unk_token=unk_token,\n            pad_token=pad_token,\n            additional_special_tokens=additional_special_tokens,\n            **kwargs,\n        )\n\n        try:\n            import sentencepiece as spm\n        except ImportError:\n            logger.warning(\n                \"You need to install SentencePiece to use T5Tokenizer:\"\n                \"https://github.com/google/sentencepiece\"\n                \"pip install sentencepiece\"\n            )\n            raise\n\n        self.vocab_file = vocab_file\n        self._extra_ids = extra_ids\n\n        self.sp_model = spm.SentencePieceProcessor()\n        self.sp_model.Load(vocab_file)\n\n    @property\n    def vocab_size(self):\n        return self.sp_model.get_piece_size() + self._extra_ids\n\n    def get_vocab(self):\n        vocab = {self.convert_ids_to_tokens(i): i for i in range(self.vocab_size)}\n        vocab.update(self.added_tokens_encoder)\n        return vocab\n\n    def __getstate__(self):\n        state = self.__dict__.copy()\n        state[\"sp_model\"] = None\n        return state\n\n    def __setstate__(self, d):\n        self.__dict__ = d\n        try:\n            import sentencepiece as spm\n        except ImportError:\n            logger.warning(\n                \"You need to install SentencePiece to use T5Tokenizer: https://github.com/google/sentencepiece\"\n                \"pip install sentencepiece\"\n            )\n            raise\n        self.sp_model = spm.SentencePieceProcessor()\n        self.sp_model.Load(self.vocab_file)\n\n    def _tokenize(self, text, sample=False):\n        \"\"\" Take as input a string and return a list of strings (tokens) for words/sub-words\n        \"\"\"\n        if not sample:\n            pieces = self.sp_model.EncodeAsPieces(text)\n        else:\n            pieces = self.sp_model.SampleEncodeAsPieces(text, 64, 0.1)\n        return pieces\n\n    def _convert_token_to_id(self, token):\n        \"\"\" Converts a token (str) in an id using the vocab. \"\"\"\n        if token.startswith(\"<extra_id_\"):\n            match = re.match(r\"<extra_id_(\\d+)>\", token)\n            num = int(match.group(1))\n            return self.vocab_size - num - 1\n        return self.sp_model.piece_to_id(token)\n\n    def _convert_id_to_token(self, index):\n        \"\"\"Converts an index (integer) in a token (str) using the vocab.\"\"\"\n        if index < self.sp_model.get_piece_size():\n            token = self.sp_model.IdToPiece(index)\n        else:\n            token = \"<extra_id_{}>\".format(self.vocab_size - 1 - index)\n        return token\n\n    def convert_tokens_to_string(self, tokens):\n        \"\"\" Converts a sequence of tokens (string) in a single string. \"\"\"\n        out_string = self.sp_model.decode_pieces(tokens)\n        return out_string\n\n    def save_vocabulary(self, save_directory):\n        \"\"\" Save the sentencepiece vocabulary (copy original file) and special tokens file\n            to a directory.\n        \"\"\"\n        if not os.path.isdir(save_directory):\n            logger.error(\"Vocabulary path ({}) should be a directory\".format(save_directory))\n            return\n        out_vocab_file = os.path.join(save_directory, VOCAB_FILES_NAMES[\"vocab_file\"])\n\n        if os.path.abspath(self.vocab_file) != os.path.abspath(out_vocab_file):\n            copyfile(self.vocab_file, out_vocab_file)\n\n        return (out_vocab_file,)\n"
  },
  {
    "path": "code/nezha-base-count3/pretrain/transformers1/tokenization_transfo_xl.py",
    "content": "# coding=utf-8\n# Copyright 2018 Google AI, Google Brain and Carnegie Mellon University Authors and the HuggingFace Inc. team.\n# Copyright (c) 2018, NVIDIA CORPORATION.  All rights reserved.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n\"\"\" Tokenization classes for Transformer XL model.\n    Adapted from https://github.com/kimiyoung/transformer-xl.\n\"\"\"\n\n\nimport glob\nimport logging\nimport os\nimport pickle\nimport re\nfrom collections import Counter, OrderedDict\nfrom typing import Optional\n\nimport numpy as np\nfrom tokenizers import Tokenizer\nfrom tokenizers.implementations import BaseTokenizer\nfrom tokenizers.models import WordLevel\nfrom tokenizers.normalizers import Lowercase, Sequence, Strip, unicode_normalizer_from_str\nfrom tokenizers.pre_tokenizers import CharDelimiterSplit, WhitespaceSplit\nfrom tokenizers.processors import BertProcessing\n\nfrom .file_utils import cached_path, is_torch_available\nfrom .tokenization_utils import PreTrainedTokenizer, PreTrainedTokenizerFast\n\n\nif is_torch_available():\n    import torch\n\n\nlogger = logging.getLogger(__name__)\n\nVOCAB_FILES_NAMES = {\"pretrained_vocab_file\": \"vocab.bin\", \"vocab_file\": \"vocab.txt\"}\nVOCAB_FILES_NAMES_FAST = {\"pretrained_vocab_file\": \"vocab.json\", \"vocab_file\": \"vocab.json\"}\n\nPRETRAINED_VOCAB_FILES_MAP = {\n    \"pretrained_vocab_file\": {\n        \"transfo-xl-wt103\": \"https://s3.amazonaws.com/models.huggingface.co/bert/transfo-xl-wt103-vocab.bin\",\n    }\n}\n\nPRETRAINED_VOCAB_FILES_MAP_FAST = {\n    \"pretrained_vocab_file\": {\n        \"transfo-xl-wt103\": \"https://s3.amazonaws.com/models.huggingface.co/bert/transfo-xl-wt103-vocab.json\",\n    }\n}\n\nPRETRAINED_POSITIONAL_EMBEDDINGS_SIZES = {\n    \"transfo-xl-wt103\": None,\n}\n\nPRETRAINED_CORPUS_ARCHIVE_MAP = {\n    \"transfo-xl-wt103\": \"https://s3.amazonaws.com/models.huggingface.co/bert/transfo-xl-wt103-corpus.bin\",\n}\nCORPUS_NAME = \"corpus.bin\"\n\n\nclass TransfoXLTokenizer(PreTrainedTokenizer):\n    \"\"\"\n    Transformer-XL tokenizer adapted from Vocab class in https://github.com/kimiyoung/transformer-xl\n\n    This tokenizer inherits from :class:`~transformers1.PreTrainedTokenizer` which contains most of the methods. Users\n    should refer to the superclass for more information regarding methods.\n    \"\"\"\n\n    vocab_files_names = VOCAB_FILES_NAMES\n    pretrained_vocab_files_map = PRETRAINED_VOCAB_FILES_MAP\n    max_model_input_sizes = PRETRAINED_POSITIONAL_EMBEDDINGS_SIZES\n    model_input_names = []\n\n    def __init__(\n        self,\n        special=None,\n        min_freq=0,\n        max_size=None,\n        lower_case=False,\n        delimiter=None,\n        vocab_file=None,\n        pretrained_vocab_file=None,\n        never_split=None,\n        unk_token=\"<unk>\",\n        eos_token=\"<eos>\",\n        additional_special_tokens=[\"<formula>\"],\n        **kwargs\n    ):\n        super().__init__(\n            unk_token=unk_token, eos_token=eos_token, additional_special_tokens=additional_special_tokens, **kwargs\n        )\n\n        if never_split is None:\n            never_split = self.all_special_tokens\n        if special is None:\n            special = []\n        self.counter = Counter()\n        self.special = special\n        self.min_freq = min_freq\n        self.max_size = max_size\n        self.lower_case = lower_case\n        self.delimiter = delimiter\n        self.vocab_file = vocab_file\n        self.never_split = never_split\n        self.punctuation_symbols = '!\"#$%&()*+,-./\\:;<=>?@[\\\\]^_`{|}~'  # noqa: W605\n        self.punction_without_space_before_pattern = re.compile(r\"[^\\s][{}]\".format(self.punctuation_symbols))\n        self.punctuation_with_space_around_pattern = self._compile_space_around_punctuation_pattern()\n\n        try:\n            if pretrained_vocab_file is not None:\n                # Hack because, honestly this tokenizer was not made to be used\n                # in a library like ours, at all.\n                vocab_dict = torch.load(pretrained_vocab_file)\n                for key, value in vocab_dict.items():\n                    if key not in self.__dict__:\n                        self.__dict__[key] = value\n\n            if vocab_file is not None:\n                self.build_vocab()\n        except Exception:\n            raise ValueError(\n                \"Unable to parse file {}. Unknown format. \"\n                \"If you tried to load a model saved through TransfoXLTokenizerFast,\"\n                \"please note they are not compatible.\".format(pretrained_vocab_file)\n            )\n\n        if vocab_file is not None:\n            self.build_vocab()\n\n    def _compile_space_around_punctuation_pattern(self):\n        look_ahead_for_special_token = \"(?=[{}])\".format(self.punctuation_symbols)\n        look_ahead_to_match_all_except_space = \"(?=[^\\s])\"  # noqa: W605\n        return re.compile(r\"\" + look_ahead_for_special_token + look_ahead_to_match_all_except_space)\n\n    def count_file(self, path, verbose=False, add_eos=False):\n        if verbose:\n            logger.info(\"counting file {} ...\".format(path))\n        assert os.path.exists(path)\n\n        sents = []\n        with open(path, \"r\", encoding=\"utf-8\") as f:\n            for idx, line in enumerate(f):\n                if verbose and idx > 0 and idx % 500000 == 0:\n                    logger.info(\"    line {}\".format(idx))\n                symbols = self.tokenize(line, add_eos=add_eos)\n                self.counter.update(symbols)\n                sents.append(symbols)\n\n        return sents\n\n    def count_sents(self, sents, verbose=False):\n        \"\"\"\n            sents : a list of sentences, each a list of tokenized symbols\n        \"\"\"\n        if verbose:\n            logger.info(\"counting {} sents ...\".format(len(sents)))\n        for idx, symbols in enumerate(sents):\n            if verbose and idx > 0 and idx % 500000 == 0:\n                logger.info(\"    line {}\".format(idx))\n            self.counter.update(symbols)\n\n    def _build_from_file(self, vocab_file):\n        self.idx2sym = []\n        self.sym2idx = OrderedDict()\n\n        with open(vocab_file, \"r\", encoding=\"utf-8\") as f:\n            for line in f:\n                symb = line.strip().split()[0]\n                self.add_symbol(symb)\n        if \"<UNK>\" in self.sym2idx:\n            self.unk_idx = self.sym2idx[\"<UNK>\"]\n        elif \"<unk>\" in self.sym2idx:\n            self.unk_idx = self.sym2idx[\"<unk>\"]\n        else:\n            raise ValueError(\"No <unkown> token in vocabulary\")\n\n    def save_vocabulary(self, vocab_path):\n        \"\"\"\n        Save the vocabulary and special tokens file to a directory.\n\n        Args:\n            vocab_path (:obj:`str`):\n                The directory in which to save the vocabulary.\n\n        Returns:\n            :obj:`Tuple(str)`: Paths to the files saved.\n        \"\"\"\n\n        logger.warning(\n            \"Please note you will not be able to load the save vocabulary in\"\n            \" Rust-based TransfoXLTokenizerFast as they don't share the same structure.\"\n        )\n\n        if os.path.isdir(vocab_path):\n            vocab_file = os.path.join(vocab_path, VOCAB_FILES_NAMES[\"pretrained_vocab_file\"])\n        else:\n            vocab_file = vocab_path\n        torch.save(self.__dict__, vocab_file)\n        return (vocab_file,)\n\n    def build_vocab(self):\n        if self.vocab_file:\n            logger.info(\"building vocab from {}\".format(self.vocab_file))\n            self._build_from_file(self.vocab_file)\n            logger.info(\"final vocab size {}\".format(len(self)))\n        else:\n            logger.info(\"building vocab with min_freq={}, max_size={}\".format(self.min_freq, self.max_size))\n            self.idx2sym = []\n            self.sym2idx = OrderedDict()\n\n            for sym in self.special:\n                self.add_special(sym)\n\n            for sym, cnt in self.counter.most_common(self.max_size):\n                if cnt < self.min_freq:\n                    break\n                self.add_symbol(sym)\n\n            logger.info(\"final vocab size {} from {} unique tokens\".format(len(self), len(self.counter)))\n\n    def encode_file(self, path, ordered=False, verbose=False, add_eos=True, add_double_eos=False):\n        if verbose:\n            logger.info(\"encoding file {} ...\".format(path))\n        assert os.path.exists(path)\n        encoded = []\n        with open(path, \"r\", encoding=\"utf-8\") as f:\n            for idx, line in enumerate(f):\n                if verbose and idx > 0 and idx % 500000 == 0:\n                    logger.info(\"    line {}\".format(idx))\n                symbols = self.tokenize(line, add_eos=add_eos, add_double_eos=add_double_eos)\n                encoded.append(self.convert_to_tensor(symbols))\n\n        if ordered:\n            encoded = torch.cat(encoded)\n\n        return encoded\n\n    def encode_sents(self, sents, ordered=False, verbose=False):\n        if verbose:\n            logger.info(\"encoding {} sents ...\".format(len(sents)))\n        encoded = []\n        for idx, symbols in enumerate(sents):\n            if verbose and idx > 0 and idx % 500000 == 0:\n                logger.info(\"    line {}\".format(idx))\n            encoded.append(self.convert_to_tensor(symbols))\n\n        if ordered:\n            encoded = torch.cat(encoded)\n\n        return encoded\n\n    def add_special(self, sym):\n        if sym not in self.sym2idx:\n            self.idx2sym.append(sym)\n            self.sym2idx[sym] = len(self.idx2sym) - 1\n            setattr(self, \"{}_idx\".format(sym.strip(\"<>\")), self.sym2idx[sym])\n\n    def add_symbol(self, sym):\n        if sym not in self.sym2idx:\n            self.idx2sym.append(sym)\n            self.sym2idx[sym] = len(self.idx2sym) - 1\n\n    def _convert_id_to_token(self, idx):\n        \"\"\"Converts an id in a token (BPE) using the vocab.\"\"\"\n        assert 0 <= idx < len(self), \"Index {} out of vocabulary range\".format(idx)\n        return self.idx2sym[idx]\n\n    def _convert_token_to_id(self, sym):\n        \"\"\" Converts a token (str) in an id using the vocab. \"\"\"\n        if sym in self.sym2idx:\n            return self.sym2idx[sym]\n        else:\n            # logger.info('encounter unk {}'.format(sym))\n            # assert '<eos>' not in sym\n            if hasattr(self, \"unk_idx\"):\n                return self.sym2idx.get(sym, self.unk_idx)\n            # Backward compatibility with pre-trained models\n            elif \"<unk>\" in self.sym2idx:\n                return self.sym2idx[\"<unk>\"]\n            elif \"<UNK>\" in self.sym2idx:\n                return self.sym2idx[\"<UNK>\"]\n            else:\n                raise ValueError(\"Token not in vocabulary and no <unk> token in vocabulary for replacement\")\n\n    def convert_tokens_to_string(self, tokens):\n        \"\"\" Converts a sequence of tokens (string) in a single string. \"\"\"\n        out_string = \" \".join(tokens).strip()\n        return out_string\n\n    def convert_to_tensor(self, symbols):\n        return torch.LongTensor(self.convert_tokens_to_ids(symbols))\n\n    @property\n    def vocab_size(self):\n        return len(self.idx2sym)\n\n    def get_vocab(self):\n        return dict(self.sym2idx, **self.added_tokens_encoder)\n\n    def _tokenize(self, line, add_eos=False, add_double_eos=False):\n        line = line.strip()\n        # convert to lower case\n        if self.lower_case:\n            line = line.lower()\n\n        # empty delimiter '' will evaluate False\n        if self.delimiter == \"\":\n            symbols = line\n        else:\n            symbols = line.split(self.delimiter)\n\n        if add_double_eos:  # lm1b\n            return [\"<S>\"] + symbols + [\"<S>\"]\n        elif add_eos:\n            return symbols + [\"<eos>\"]\n        else:\n            return symbols\n\n    def prepare_for_tokenization(self, text, **kwargs):\n        # add spaces before punctuation symbols as should be done in transfo-xl\n\n        if \"add_space_before_punct_symbol\" in kwargs and kwargs[\"add_space_before_punct_symbol\"]:\n            text = self.punctuation_with_space_around_pattern.sub(r\" \", text)\n        elif self.punction_without_space_before_pattern.search(text):\n            # searches until the first occurence of a punctuation symbol without surrounding spaces\n            logger.warning(\n                \"You might want to consider setting `add_space_before_punct_symbol=True` as an argument to the `tokenizer.encode()` to avoid tokenizing words with punctuation symbols to the `<unk>` token\"\n            )\n\n        return text\n\n\nclass _TransfoXLDelimiterLookupTokenizer(BaseTokenizer):\n    def __init__(\n        self,\n        vocab_file,\n        delimiter,\n        lowercase,\n        unk_token,\n        eos_token,\n        add_eos=False,\n        add_double_eos=False,\n        normalization: Optional[str] = None,\n    ):\n\n        try:\n            tokenizer = WordLevel(vocab_file, unk_token=unk_token)\n            tokenizer = Tokenizer(tokenizer)\n        except Exception:\n            raise ValueError(\n                \"Unable to parse file {}. Unknown format. \"\n                \"If you tried to load a model saved through TransfoXLTokenizer,\"\n                \"please note they are not compatible.\".format(vocab_file)\n            )\n\n        # Create the correct normalization path\n        normalizer = []\n\n        # Include unicode normalization\n        if normalization:\n            normalizer += [unicode_normalizer_from_str(normalization)]\n\n        # Include case normalization\n        if lowercase:\n            normalizer += [Lowercase()]\n\n        # Strip normalizer at the end\n        normalizer += [Strip(left=True, right=True)]\n\n        if len(normalizer) > 0:\n            tokenizer.normalizer = Sequence(normalizer) if len(normalizer) > 1 else normalizer[0]\n\n        # Setup the splitter\n        tokenizer.pre_tokenizer = CharDelimiterSplit(delimiter) if delimiter else WhitespaceSplit()\n\n        if add_double_eos:\n            tokenizer.post_processor = BertProcessing(\n                (eos_token, tokenizer.token_to_id(eos_token)), (eos_token, tokenizer.token_to_id(eos_token))\n            )\n\n        parameters = {\n            \"model\": \"TransfoXLModel\",\n            \"add_eos\": add_eos,\n            \"add_double_eos\": add_double_eos,\n            \"unk_token\": unk_token,\n            \"eos_token\": eos_token,\n            \"delimiter\": delimiter,\n            \"lowercase\": lowercase,\n        }\n\n        super().__init__(tokenizer, parameters)\n\n\nclass TransfoXLTokenizerFast(PreTrainedTokenizerFast):\n    \"\"\"\n    Construct a \"Fast\" Transformer-XL tokenizer (backed by HuggingFace's `tokenizers` library).\n\n    The Transformer-XL tokenizer is a word-level tokenizer (no sub-word tokenization).\n\n    Adapted from Vocab class in https://github.com/kimiyoung/transformer-xl\n\n    This tokenizer inherits from :class:`~transformers1.PreTrainedTokenizerFast` which contains most of the methods. Users\n    should refer to the superclass for more information regarding methods.\n    \"\"\"\n\n    vocab_files_names = VOCAB_FILES_NAMES_FAST\n    pretrained_vocab_files_map = PRETRAINED_VOCAB_FILES_MAP_FAST\n    max_model_input_sizes = PRETRAINED_POSITIONAL_EMBEDDINGS_SIZES\n    model_input_names = []\n\n    def __init__(\n        self,\n        special=None,\n        min_freq=0,\n        max_size=None,\n        lower_case=False,\n        delimiter=None,\n        vocab_file=None,\n        pretrained_vocab_file=None,\n        never_split=None,\n        unk_token=\"<unk>\",\n        eos_token=\"<eos>\",\n        additional_special_tokens=[\"<formula>\"],\n        add_eos=False,\n        add_double_eos=False,\n        normalization=None,\n        **kwargs\n    ):\n\n        super().__init__(\n            _TransfoXLDelimiterLookupTokenizer(\n                vocab_file=vocab_file or pretrained_vocab_file,\n                delimiter=delimiter,\n                lowercase=lower_case,\n                unk_token=unk_token,\n                eos_token=eos_token,\n                add_eos=add_eos,\n                add_double_eos=add_double_eos,\n                normalization=normalization,\n            ),\n            unk_token=unk_token,\n            eos_token=eos_token,\n            additional_special_tokens=additional_special_tokens,\n            **kwargs,\n        )\n\n    def save_pretrained(self, save_directory):\n        logger.warning(\n            \"Please note you will not be able to load the vocabulary in\"\n            \" Python-based TransfoXLTokenizer as they don't share the same structure.\"\n        )\n\n        return super().save_pretrained(save_directory)\n\n\nclass LMOrderedIterator(object):\n    def __init__(self, data, bsz, bptt, device=\"cpu\", ext_len=None):\n        \"\"\"\n            data -- LongTensor -- the LongTensor is strictly ordered\n        \"\"\"\n        self.bsz = bsz\n        self.bptt = bptt\n        self.ext_len = ext_len if ext_len is not None else 0\n\n        self.device = device\n\n        # Work out how cleanly we can divide the dataset into bsz parts.\n        self.n_step = data.size(0) // bsz\n\n        # Trim off any extra elements that wouldn't cleanly fit (remainders).\n        data = data.narrow(0, 0, self.n_step * bsz)\n\n        # Evenly divide the data across the bsz batches.\n        self.data = data.view(bsz, -1).t().contiguous().to(device)\n\n        # Number of mini-batches\n        self.n_batch = (self.n_step + self.bptt - 1) // self.bptt\n\n    def get_batch(self, i, bptt=None):\n        if bptt is None:\n            bptt = self.bptt\n        seq_len = min(bptt, self.data.size(0) - 1 - i)\n\n        end_idx = i + seq_len\n        beg_idx = max(0, i - self.ext_len)\n\n        data = self.data[beg_idx:end_idx]\n        target = self.data[i + 1 : i + 1 + seq_len]\n\n        data_out = data.transpose(0, 1).contiguous().to(self.device)\n        target_out = target.transpose(0, 1).contiguous().to(self.device)\n\n        return data_out, target_out, seq_len\n\n    def get_fixlen_iter(self, start=0):\n        for i in range(start, self.data.size(0) - 1, self.bptt):\n            yield self.get_batch(i)\n\n    def get_varlen_iter(self, start=0, std=5, min_len=5, max_deviation=3):\n        max_len = self.bptt + max_deviation * std\n        i = start\n        while True:\n            bptt = self.bptt if np.random.random() < 0.95 else self.bptt / 2.0\n            bptt = min(max_len, max(min_len, int(np.random.normal(bptt, std))))\n            data, target, seq_len = self.get_batch(i, bptt)\n            i += seq_len\n            yield data, target, seq_len\n            if i >= self.data.size(0) - 2:\n                break\n\n    def __iter__(self):\n        return self.get_fixlen_iter()\n\n\nclass LMShuffledIterator(object):\n    def __init__(self, data, bsz, bptt, device=\"cpu\", ext_len=None, shuffle=False):\n        \"\"\"\n            data -- list[LongTensor] -- there is no order among the LongTensors\n        \"\"\"\n        self.data = data\n\n        self.bsz = bsz\n        self.bptt = bptt\n        self.ext_len = ext_len if ext_len is not None else 0\n\n        self.device = device\n        self.shuffle = shuffle\n\n    def get_sent_stream(self):\n        # index iterator\n        epoch_indices = np.random.permutation(len(self.data)) if self.shuffle else np.array(range(len(self.data)))\n\n        # sentence iterator\n        for idx in epoch_indices:\n            yield self.data[idx]\n\n    def stream_iterator(self, sent_stream):\n        # streams for each data in the batch\n        streams = [None] * self.bsz\n\n        data = torch.LongTensor(self.bptt, self.bsz)\n        target = torch.LongTensor(self.bptt, self.bsz)\n\n        n_retain = 0\n\n        while True:\n            # data   : [n_retain+bptt x bsz]\n            # target : [bptt x bsz]\n            data[n_retain:].fill_(-1)\n            target.fill_(-1)\n\n            valid_batch = True\n\n            for i in range(self.bsz):\n                n_filled = 0\n                try:\n                    while n_filled < self.bptt:\n                        if streams[i] is None or len(streams[i]) <= 1:\n                            streams[i] = next(sent_stream)\n                        # number of new tokens to fill in\n                        n_new = min(len(streams[i]) - 1, self.bptt - n_filled)\n                        # first n_retain tokens are retained from last batch\n                        data[n_retain + n_filled : n_retain + n_filled + n_new, i] = streams[i][:n_new]\n                        target[n_filled : n_filled + n_new, i] = streams[i][1 : n_new + 1]\n                        streams[i] = streams[i][n_new:]\n                        n_filled += n_new\n                except StopIteration:\n                    valid_batch = False\n                    break\n\n            if not valid_batch:\n                return\n\n            data_out = data.transpose(0, 1).contiguous().to(self.device)\n            target_out = target.transpose(0, 1).contiguous().to(self.device)\n\n            yield data_out, target_out, self.bptt\n\n            n_retain = min(data.size(0), self.ext_len)\n            if n_retain > 0:\n                data[:n_retain] = data[-n_retain:]\n            data.resize_(n_retain + self.bptt, data.size(1))\n\n    def __iter__(self):\n        # sent_stream is an iterator\n        sent_stream = self.get_sent_stream()\n\n        for batch in self.stream_iterator(sent_stream):\n            yield batch\n\n\nclass LMMultiFileIterator(LMShuffledIterator):\n    def __init__(self, paths, vocab, bsz, bptt, device=\"cpu\", ext_len=None, shuffle=False):\n\n        self.paths = paths\n        self.vocab = vocab\n\n        self.bsz = bsz\n        self.bptt = bptt\n        self.ext_len = ext_len if ext_len is not None else 0\n\n        self.device = device\n        self.shuffle = shuffle\n\n    def get_sent_stream(self, path):\n        sents = self.vocab.encode_file(path, add_double_eos=True)\n        if self.shuffle:\n            np.random.shuffle(sents)\n        sent_stream = iter(sents)\n\n        return sent_stream\n\n    def __iter__(self):\n        if self.shuffle:\n            np.random.shuffle(self.paths)\n\n        for path in self.paths:\n            # sent_stream is an iterator\n            sent_stream = self.get_sent_stream(path)\n            for batch in self.stream_iterator(sent_stream):\n                yield batch\n\n\nclass TransfoXLCorpus(object):\n    @classmethod\n    def from_pretrained(cls, pretrained_model_name_or_path, cache_dir=None, *inputs, **kwargs):\n        \"\"\"\n        Instantiate a pre-processed corpus.\n        \"\"\"\n        vocab = TransfoXLTokenizer.from_pretrained(pretrained_model_name_or_path, *inputs, **kwargs)\n        if pretrained_model_name_or_path in PRETRAINED_CORPUS_ARCHIVE_MAP:\n            corpus_file = PRETRAINED_CORPUS_ARCHIVE_MAP[pretrained_model_name_or_path]\n        else:\n            corpus_file = os.path.join(pretrained_model_name_or_path, CORPUS_NAME)\n        # redirect to the cache, if necessary\n        try:\n            resolved_corpus_file = cached_path(corpus_file, cache_dir=cache_dir)\n        except EnvironmentError:\n            logger.error(\n                \"Corpus '{}' was not found in corpus list ({}). \"\n                \"We assumed '{}' was a path or url but couldn't find files {} \"\n                \"at this path or url.\".format(\n                    pretrained_model_name_or_path,\n                    \", \".join(PRETRAINED_CORPUS_ARCHIVE_MAP.keys()),\n                    pretrained_model_name_or_path,\n                    corpus_file,\n                )\n            )\n            return None\n        if resolved_corpus_file == corpus_file:\n            logger.info(\"loading corpus file {}\".format(corpus_file))\n        else:\n            logger.info(\"loading corpus file {} from cache at {}\".format(corpus_file, resolved_corpus_file))\n\n        # Instantiate tokenizer.\n        corpus = cls(*inputs, **kwargs)\n        corpus_dict = torch.load(resolved_corpus_file)\n        for key, value in corpus_dict.items():\n            corpus.__dict__[key] = value\n        corpus.vocab = vocab\n        if corpus.train is not None:\n            corpus.train = torch.tensor(corpus.train, dtype=torch.long)\n        if corpus.valid is not None:\n            corpus.valid = torch.tensor(corpus.valid, dtype=torch.long)\n        if corpus.test is not None:\n            corpus.test = torch.tensor(corpus.test, dtype=torch.long)\n        return corpus\n\n    def __init__(self, *args, **kwargs):\n        self.vocab = TransfoXLTokenizer(*args, **kwargs)\n        self.dataset = None\n        self.train = None\n        self.valid = None\n        self.test = None\n\n    def build_corpus(self, path, dataset):\n        self.dataset = dataset\n\n        if self.dataset in [\"ptb\", \"wt2\", \"enwik8\", \"text8\"]:\n            self.vocab.count_file(os.path.join(path, \"train.txt\"))\n            self.vocab.count_file(os.path.join(path, \"valid.txt\"))\n            self.vocab.count_file(os.path.join(path, \"test.txt\"))\n        elif self.dataset == \"wt103\":\n            self.vocab.count_file(os.path.join(path, \"train.txt\"))\n        elif self.dataset == \"lm1b\":\n            train_path_pattern = os.path.join(\n                path,\n                \"1-billion-word-language-modeling-benchmark-r13output\",\n                \"training-monolingual.tokenized.shuffled\",\n                \"news.en-*\",\n            )\n            train_paths = glob.glob(train_path_pattern)\n            # the vocab will load from file when build_vocab() is called\n\n        self.vocab.build_vocab()\n\n        if self.dataset in [\"ptb\", \"wt2\", \"wt103\"]:\n            self.train = self.vocab.encode_file(os.path.join(path, \"train.txt\"), ordered=True)\n            self.valid = self.vocab.encode_file(os.path.join(path, \"valid.txt\"), ordered=True)\n            self.test = self.vocab.encode_file(os.path.join(path, \"test.txt\"), ordered=True)\n        elif self.dataset in [\"enwik8\", \"text8\"]:\n            self.train = self.vocab.encode_file(os.path.join(path, \"train.txt\"), ordered=True, add_eos=False)\n            self.valid = self.vocab.encode_file(os.path.join(path, \"valid.txt\"), ordered=True, add_eos=False)\n            self.test = self.vocab.encode_file(os.path.join(path, \"test.txt\"), ordered=True, add_eos=False)\n        elif self.dataset == \"lm1b\":\n            self.train = train_paths\n            self.valid = self.vocab.encode_file(os.path.join(path, \"valid.txt\"), ordered=False, add_double_eos=True)\n            self.test = self.vocab.encode_file(os.path.join(path, \"test.txt\"), ordered=False, add_double_eos=True)\n\n    def get_iterator(self, split, *args, **kwargs):\n        if split == \"train\":\n            if self.dataset in [\"ptb\", \"wt2\", \"wt103\", \"enwik8\", \"text8\"]:\n                data_iter = LMOrderedIterator(self.train, *args, **kwargs)\n            elif self.dataset == \"lm1b\":\n                kwargs[\"shuffle\"] = True\n                data_iter = LMMultiFileIterator(self.train, self.vocab, *args, **kwargs)\n        elif split in [\"valid\", \"test\"]:\n            data = self.valid if split == \"valid\" else self.test\n            if self.dataset in [\"ptb\", \"wt2\", \"wt103\", \"enwik8\", \"text8\"]:\n                data_iter = LMOrderedIterator(data, *args, **kwargs)\n            elif self.dataset == \"lm1b\":\n                data_iter = LMShuffledIterator(data, *args, **kwargs)\n\n        return data_iter\n\n\ndef get_lm_corpus(datadir, dataset):\n    fn = os.path.join(datadir, \"cache.pt\")\n    fn_pickle = os.path.join(datadir, \"cache.pkl\")\n    if os.path.exists(fn):\n        logger.info(\"Loading cached dataset...\")\n        corpus = torch.load(fn_pickle)\n    elif os.path.exists(fn):\n        logger.info(\"Loading cached dataset from pickle...\")\n        with open(fn, \"rb\") as fp:\n            corpus = pickle.load(fp)\n    else:\n        logger.info(\"Producing dataset {}...\".format(dataset))\n        kwargs = {}\n        if dataset in [\"wt103\", \"wt2\"]:\n            kwargs[\"special\"] = [\"<eos>\"]\n            kwargs[\"lower_case\"] = False\n        elif dataset == \"ptb\":\n            kwargs[\"special\"] = [\"<eos>\"]\n            kwargs[\"lower_case\"] = True\n        elif dataset == \"lm1b\":\n            kwargs[\"special\"] = []\n            kwargs[\"lower_case\"] = False\n            kwargs[\"vocab_file\"] = os.path.join(datadir, \"1b_word_vocab.txt\")\n        elif dataset in [\"enwik8\", \"text8\"]:\n            pass\n\n        corpus = TransfoXLCorpus(datadir, dataset, **kwargs)\n        torch.save(corpus, fn)\n\n    return corpus\n"
  },
  {
    "path": "code/nezha-base-count3/pretrain/transformers1/tokenization_utils.py",
    "content": "# coding=utf-8\n# Copyright 2020 The HuggingFace Inc. team.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n\"\"\"Tokenization classes for python and fast tokenizers. Fast tokenizers are provided by HuggingFace's tokenizers library.\"\"\"\n\nimport copy\nimport functools\nimport itertools\nimport json\nimport logging\nimport operator\nimport os\nimport re\nimport warnings\nfrom collections import UserDict, defaultdict\nfrom contextlib import contextmanager\nfrom typing import Any, Dict, List, NamedTuple, Optional, Sequence, Tuple, Union\n\nfrom tokenizers import AddedToken as AddedTokenFast\nfrom tokenizers import Encoding as EncodingFast\nfrom tokenizers.decoders import Decoder as DecoderFast\nfrom tokenizers.implementations import BaseTokenizer as BaseTokenizerFast\n\nfrom .file_utils import cached_path, hf_bucket_url, is_remote_url, is_tf_available, is_torch_available, torch_required\n\n\nif is_tf_available():\n    import tensorflow as tf\nif is_torch_available():\n    import torch\n\nlogger = logging.getLogger(__name__)\n\nSPECIAL_TOKENS_MAP_FILE = \"special_tokens_map.json\"\nADDED_TOKENS_FILE = \"added_tokens.json\"\nTOKENIZER_CONFIG_FILE = \"tokenizer_config.json\"\n\nVERY_LARGE_INTEGER = int(1e30)  # This is used to set the max input length for a model with infinite size input\nLARGE_INTEGER = int(1e20)  # This is used when we need something big but slightly smaller than VERY_LARGE_INTEGER\n\n# Define type aliases and NamedTuples\nTextInput = str\nPreTokenizedInput = List[str]\nEncodedInput = List[int]\nTextInputPair = Tuple[str, str]\nPreTokenizedInputPair = Tuple[List[str], List[str]]\nEncodedInputPair = Tuple[List[int], List[int]]\n\n\nclass CharSpan(NamedTuple):\n    \"\"\" Character span in the original string\n\n        Args:\n            start: index of the first character in the original string\n            end: index of the character following the last character in the original string\n    \"\"\"\n\n    start: int\n    end: int\n\n\nclass TokenSpan(NamedTuple):\n    \"\"\" Token span in an encoded string (list of tokens)\n\n        Args:\n            start: index of the first token in the span\n            end: index of the token following the last token in the span\n    \"\"\"\n\n    start: int\n    end: int\n\n\ndef flatten(x: Sequence):\n    \"\"\"\n    Flatten the provided (potentially nested) sequence\n\n    Args:\n        x (Sequence): Potentially nested sequence to flatten\n\n    Returns:\n        list: Flattened sequence\n    \"\"\"\n\n    return functools.reduce(operator.iconcat, x, [])\n\n\n@contextmanager\ndef truncate_and_pad(\n    tokenizer: BaseTokenizerFast,\n    max_length: int,\n    stride: int,\n    strategy: str,\n    pad_to_max_length: bool,\n    padding_side: str,\n    pad_token_id: int,\n    pad_token_type_id: int,\n    pad_token: str,\n):\n    \"\"\" This contextmanager is in charge of defining the truncation and the padding strategies for fast tokenizers\n        (provided by HuggingFace tokenizers library) and restore the tokenizer settings afterwards.\n\n        This contextmanager assumes the provider tokenizer has no padding / truncation strategy\n        before the managed section. If your tokenizer set a padding / truncation strategy before,\n        then it will be reset to no padding/truncation when exiting the managed section.\n\n        Args:\n            tokenizer (BaseTokenizerFast): The tokenizer which will be used\n            max_length (int): The maximum size of the sequence\n            stride (int): The stride to use when handling overflow\n            strategy (str): Overflowing logic to use\n            pad_to_max_length (bool): Boolean indicating if the output needs to be padded up to max_length\n            padding_side (str): \"left\" or \"right\" indicating the direction the output sequence will be padded\n            pad_token_id (int): The integer representation of the padding token to use\n            pad_token_type_id (int): The integer representation of the padding token type to use\n            pad_token (str): The string representation of the padding token to use\n\n    \"\"\"\n\n    # Handle all the truncation and padding stuff\n    if max_length is not None:\n        tokenizer.enable_truncation(max_length, stride=stride, strategy=strategy)\n\n    if pad_to_max_length and (pad_token and pad_token_id >= 0):\n        tokenizer.enable_padding(\n            max_length=max_length,\n            direction=padding_side,\n            pad_id=pad_token_id,\n            pad_type_id=pad_token_type_id,\n            pad_token=pad_token,\n        )\n    elif pad_to_max_length:\n        logger.warning(\n            \"Disabled padding because no padding token set (pad_token: {}, pad_token_id: {}).\\n\"\n            \"To remove this error, you can add a new pad token and then resize model embedding:\\n\"\n            \"\\ttokenizer.pad_token = '<PAD>'\\n\\tmodel.resize_token_embeddings(len(tokenizer))\".format(\n                pad_token, pad_token_id\n            )\n        )\n\n    yield\n\n    # TODO(morgan, anthony): once we have a simple way to serialize tokenizers maybe store and restore the state afterward\n    # to avoid destructing the padding / truncation strategy as we do now.\n\n    if max_length is not None:\n        tokenizer.no_truncation()\n\n    if pad_to_max_length and (pad_token and pad_token_id >= 0):\n        tokenizer.no_padding()\n\n\nclass BatchEncoding(UserDict):\n    \"\"\" BatchEncoding hold the output of the encode and batch_encode methods (tokens, attention_masks, etc).\n        This class is derived from a python Dictionary and can be used as a dictionnary.\n        In addition, this class expose utility methods to map from word/char space to token space.\n\n        Args:\n            data (:obj:`dict`): Dictionary of lists/arrays returned by the encode/batch_encode methods ('input_ids', 'attention_mask'...)\n            encoding (:obj:`EncodingFast`, :obj:`list(EncodingFast)`, `optional`, defaults to :obj:`None`):\n                If the tokenizer is a fast tokenizer which outputs additional informations like mapping from word/char space to token space\n                the `EncodingFast` instance or list of instance (for batches) hold these informations.\n\n    \"\"\"\n\n    def __init__(\n        self,\n        data: Optional[Dict[str, Any]] = None,\n        encoding: Optional[Union[EncodingFast, Sequence[EncodingFast]]] = None,\n    ):\n        super().__init__(data)\n\n        if isinstance(encoding, EncodingFast):\n            encoding = [encoding]\n\n        self._encodings = encoding\n\n    def __getitem__(self, item: Union[int, str]) -> EncodingFast:\n        \"\"\" If the key is a string, get the value of the dict associated to `key` ('input_ids', 'attention_mask'...)\n            If the key is an integer, get the EncodingFast for batch item with index `key`\n        \"\"\"\n        if isinstance(item, str):\n            return self.data[item]\n        elif self._encodings is not None:\n            return self._encodings[item]\n        else:\n            raise KeyError(\n                \"Indexing with integers (to access backend Encoding for a given batch index) \"\n                \"is not available when using Python based tokenizers\"\n            )\n\n    def __getattr__(self, item: str):\n        return self.data[item]\n\n    def keys(self):\n        return self.data.keys()\n\n    def values(self):\n        return self.data.values()\n\n    def items(self):\n        return self.data.items()\n\n    # After this point:\n    # Extended properties and methods only available for fast (Rust-based) tokenizers\n    # provided by HuggingFace tokenizers library.\n\n    @property\n    def encodings(self) -> Optional[List[EncodingFast]]:\n        \"\"\"\n        Return the list all encoding from the tokenization process\n\n        Returns: List[EncodingFast] or None if input was tokenized through Python (i.e. not fast) tokenizer\n        \"\"\"\n        return self._encodings\n\n    def tokens(self, batch_index: int = 0) -> List[int]:\n        if not self._encodings:\n            raise ValueError(\"tokens() is not available when using Python based tokenizers\")\n        return self._encodings[batch_index].tokens\n\n    def words(self, batch_index: int = 0) -> List[Optional[int]]:\n        if not self._encodings:\n            raise ValueError(\"words() is not available when using Python based tokenizers\")\n        return self._encodings[batch_index].words\n\n    def token_to_word(self, batch_or_token_index: int, token_index: Optional[int] = None) -> int:\n        \"\"\" Get the index of the word corresponding (i.e. comprising) to an encoded token\n            in a sequence of the batch.\n\n            Can be called as:\n                - self.token_to_word(token_index) if batch size is 1\n                - self.token_to_word(batch_index, token_index) if batch size is greater than 1\n\n            This method is particularly suited when the input sequences are provided as\n            pre-tokenized sequences (i.e. words are defined by the user). In this case it allows\n            to easily associate encoded tokens with provided tokenized words.\n\n        Args:\n            batch_or_token_index (:obj:`int`):\n                Index of the sequence in the batch. If the batch only comprise one sequence,\n                this can be the index of the token in the sequence\n            token_index (:obj:`int`, `optional`):\n                If a batch index is provided in `batch_or_token_index`, this can be the index\n                of the token in the sequence.\n\n        Returns:\n            word_index (:obj:`int`):\n                index of the word in the input sequence.\n\n        \"\"\"\n\n        if not self._encodings:\n            raise ValueError(\"token_to_word() is not available when using Python based tokenizers\")\n        if token_index is not None:\n            batch_index = batch_or_token_index\n        else:\n            batch_index = 0\n            token_index = batch_or_token_index\n        if batch_index < 0:\n            batch_index = self._batch_size + batch_index\n        if token_index < 0:\n            token_index = self._seq_len + token_index\n        return self._encodings[batch_index].token_to_word(token_index)\n\n    def word_to_tokens(self, batch_or_word_index: int, word_index: Optional[int] = None) -> TokenSpan:\n        \"\"\" Get the encoded token span corresponding to a word in the sequence of the batch.\n\n            Token spans are returned as a TokenSpan NamedTuple with:\n                start: index of the first token\n                end: index of the token following the last token\n\n            Can be called as:\n                - self.word_to_tokens(word_index) if batch size is 1\n                - self.word_to_tokens(batch_index, word_index) if batch size is greater or equal to 1\n\n            This method is particularly suited when the input sequences are provided as\n            pre-tokenized sequences (i.e. words are defined by the user). In this case it allows\n            to easily associate encoded tokens with provided tokenized words.\n\n        Args:\n            batch_or_word_index (:obj:`int`):\n                Index of the sequence in the batch. If the batch only comprises one sequence,\n                this can be the index of the word in the sequence\n            word_index (:obj:`int`, `optional`):\n                If a batch index is provided in `batch_or_token_index`, this can be the index\n                of the word in the sequence.\n\n        Returns:\n            token_span (:obj:`TokenSpan`):\n                Span of tokens in the encoded sequence.\n\n                TokenSpan are NamedTuple with:\n                    start: index of the first token\n                    end: index of the token following the last token\n        \"\"\"\n\n        if not self._encodings:\n            raise ValueError(\"word_to_tokens() is not available when using Python based tokenizers\")\n        if word_index is not None:\n            batch_index = batch_or_word_index\n        else:\n            batch_index = 0\n            word_index = batch_or_word_index\n        if batch_index < 0:\n            batch_index = self._batch_size + batch_index\n        if word_index < 0:\n            word_index = self._seq_len + word_index\n        return TokenSpan(*(self._encodings[batch_index].word_to_tokens(word_index)))\n\n    def token_to_chars(self, batch_or_token_index: int, token_index: Optional[int] = None) -> CharSpan:\n        \"\"\" Get the character span corresponding to an encoded token in a sequence of the batch.\n\n            Character spans are returned as a CharSpan NamedTuple with:\n                start: index of the first character in the original string associated to the token\n                end: index of the character following the last character in the original string associated to the token\n\n            Can be called as:\n                - self.token_to_chars(token_index) if batch size is 1\n                - self.token_to_chars(batch_index, token_index) if batch size is greater or equal to 1\n\n        Args:\n            batch_or_token_index (:obj:`int`):\n                Index of the sequence in the batch. If the batch only comprise one sequence,\n                this can be the index of the token in the sequence\n            token_index (:obj:`int`, `optional`):\n                If a batch index is provided in `batch_or_token_index`, this can be the index\n                of the token or tokens in the sequence.\n\n        Returns:\n            char_span (:obj:`CharSpan`):\n                Span of characters in the original string.\n\n                CharSpan are NamedTuple with:\n                    start: index of the first character in the original string\n                    end: index of the character following the last character in the original string\n        \"\"\"\n\n        if not self._encodings:\n            raise ValueError(\"token_to_chars() is not available when using Python based tokenizers\")\n        if token_index is not None:\n            batch_index = batch_or_token_index\n        else:\n            batch_index = 0\n            token_index = batch_or_token_index\n        return CharSpan(*(self._encodings[batch_index].token_to_chars(token_index)))\n\n    def char_to_token(self, batch_or_char_index: int, char_index: Optional[int] = None) -> int:\n        \"\"\" Get the index of the token in the encoded output comprising a character\n            in the original string for a sequence of the batch.\n\n            Can be called as:\n                - self.char_to_token(char_index) if batch size is 1\n                - self.char_to_token(batch_index, char_index) if batch size is greater or equal to 1\n\n            This method is particularly suited when the input sequences are provided as\n            pre-tokenized sequences (i.e. words are defined by the user). In this case it allows\n            to easily associate encoded tokens with provided tokenized words.\n\n        Args:\n            batch_or_char_index (:obj:`int`):\n                Index of the sequence in the batch. If the batch only comprise one sequence,\n                this can be the index of the word in the sequence\n            char_index (:obj:`int`, `optional`):\n                If a batch index is provided in `batch_or_token_index`, this can be the index\n                of the word in the sequence.\n\n\n        Returns:\n            token_index (:obj:`int`):\n                Index of the token.\n        \"\"\"\n\n        if not self._encodings:\n            raise ValueError(\"char_to_token() is not available when using Python based tokenizers\")\n        if char_index is not None:\n            batch_index = batch_or_char_index\n        else:\n            batch_index = 0\n            char_index = batch_or_char_index\n        return self._encodings[batch_index].char_to_token(char_index)\n\n    def word_to_chars(self, batch_or_word_index: int, word_index: Optional[int] = None) -> CharSpan:\n        \"\"\" Get the character span in the original string corresponding to given word in a sequence\n            of the batch.\n\n            Character spans are returned as a CharSpan NamedTuple with:\n                start: index of the first character in the original string\n                end: index of the character following the last character in the original string\n\n            Can be called as:\n                - self.word_to_chars(word_index) if batch size is 1\n                - self.word_to_chars(batch_index, word_index) if batch size is greater or equal to 1\n\n        Args:\n            batch_or_word_index (:obj:`int`):\n                Index of the sequence in the batch. If the batch only comprise one sequence,\n                this can be the index of the word in the sequence\n            word_index (:obj:`int`, `optional`):\n                If a batch index is provided in `batch_or_token_index`, this can be the index\n                of the word in the sequence.\n\n        Returns:\n            char_span (:obj:`CharSpan` or :obj:`List[CharSpan]`):\n                Span(s) of the associated character or characters in the string.\n                CharSpan are NamedTuple with:\n                    start: index of the first character associated to the token in the original string\n                    end: index of the character following the last character associated to the token in the original string\n        \"\"\"\n\n        if not self._encodings:\n            raise ValueError(\"word_to_chars() is not available when using Python based tokenizers\")\n        if word_index is not None:\n            batch_index = batch_or_word_index\n        else:\n            batch_index = 0\n            word_index = batch_or_word_index\n        return CharSpan(*(self._encodings[batch_index].word_to_chars(word_index)))\n\n    def char_to_word(self, batch_or_char_index: int, char_index: Optional[int] = None) -> int:\n        \"\"\" Get the word in the original string corresponding to a character in the original string of\n            a sequence of the batch.\n\n            Can be called as:\n                - self.char_to_word(char_index) if batch size is 1\n                - self.char_to_word(batch_index, char_index) if batch size is greater than 1\n\n            This method is particularly suited when the input sequences are provided as\n            pre-tokenized sequences (i.e. words are defined by the user). In this case it allows\n            to easily associate encoded tokens with provided tokenized words.\n\n        Args:\n            batch_or_char_index (:obj:`int`):\n                Index of the sequence in the batch. If the batch only comprise one sequence,\n                this can be the index of the character in the orginal string.\n            char_index (:obj:`int`, `optional`):\n                If a batch index is provided in `batch_or_token_index`, this can be the index\n                of the character in the orginal string.\n\n\n        Returns:\n            token_index (:obj:`int` or :obj:`List[int]`):\n                Index or indices of the associated encoded token(s).\n        \"\"\"\n\n        if not self._encodings:\n            raise ValueError(\"char_to_word() is not available when using Python based tokenizers\")\n        if char_index is not None:\n            batch_index = batch_or_char_index\n        else:\n            batch_index = 0\n            char_index = batch_or_char_index\n        return self._encodings[batch_index].char_to_word(char_index)\n\n    @torch_required\n    def to(self, device: str):\n        \"\"\"Send all values to device by calling v.to(device)\"\"\"\n        self.data = {k: v.to(device) for k, v in self.data.items()}\n        return self\n\n\nclass SpecialTokensMixin:\n    \"\"\" SpecialTokensMixin is derived by ``PreTrainedTokenizer`` and ``PreTrainedTokenizerFast`` and\n        handles specific behaviors related to special tokens. In particular, this class hold the\n        attributes which can be used to directly access to these special tokens in a\n        model-independant manner and allow to set and update the special tokens.\n    \"\"\"\n\n    SPECIAL_TOKENS_ATTRIBUTES = [\n        \"bos_token\",\n        \"eos_token\",\n        \"unk_token\",\n        \"sep_token\",\n        \"pad_token\",\n        \"cls_token\",\n        \"mask_token\",\n        \"additional_special_tokens\",\n    ]\n\n    def __init__(self, **kwargs):\n        self._bos_token = None\n        self._eos_token = None\n        self._unk_token = None\n        self._sep_token = None\n        self._pad_token = None\n        self._cls_token = None\n        self._mask_token = None\n        self._pad_token_type_id = 0\n        self._additional_special_tokens = []\n\n        for key, value in kwargs.items():\n            if key in self.SPECIAL_TOKENS_ATTRIBUTES:\n                if key == \"additional_special_tokens\":\n                    assert isinstance(value, (list, tuple)) and all(isinstance(t, str) for t in value)\n                    setattr(self, key, value)\n                elif isinstance(value, AddedTokenFast):\n                    setattr(self, key, str(value))\n                elif isinstance(value, str):\n                    setattr(self, key, value)\n                else:\n                    raise TypeError(\n                        \"special token {} has to be either str or AddedTokenFast but got: {}\".format(key, type(value))\n                    )\n\n    @property\n    def bos_token(self):\n        \"\"\" Beginning of sentence token (string). Log an error if used while not having been set. \"\"\"\n        if self._bos_token is None:\n            logger.error(\"Using bos_token, but it is not set yet.\")\n        return self._bos_token\n\n    @property\n    def eos_token(self):\n        \"\"\" End of sentence token (string). Log an error if used while not having been set. \"\"\"\n        if self._eos_token is None:\n            logger.error(\"Using eos_token, but it is not set yet.\")\n        return self._eos_token\n\n    @property\n    def unk_token(self):\n        \"\"\" Unknown token (string). Log an error if used while not having been set. \"\"\"\n        if self._unk_token is None:\n            logger.error(\"Using unk_token, but it is not set yet.\")\n        return self._unk_token\n\n    @property\n    def sep_token(self):\n        \"\"\" Separation token (string). E.g. separate context and query in an input sequence. Log an error if used while not having been set. \"\"\"\n        if self._sep_token is None:\n            logger.error(\"Using sep_token, but it is not set yet.\")\n        return self._sep_token\n\n    @property\n    def pad_token(self):\n        \"\"\" Padding token (string). Log an error if used while not having been set. \"\"\"\n        if self._pad_token is None:\n            logger.error(\"Using pad_token, but it is not set yet.\")\n        return self._pad_token\n\n    @property\n    def cls_token(self):\n        \"\"\" Classification token (string). E.g. to extract a summary of an input sequence leveraging self-attention along the full depth of the model. Log an error if used while not having been set. \"\"\"\n        if self._cls_token is None:\n            logger.error(\"Using cls_token, but it is not set yet.\")\n        return self._cls_token\n\n    @property\n    def mask_token(self):\n        \"\"\" Mask token (string). E.g. when training a model with masked-language modeling. Log an error if used while not having been set. \"\"\"\n        if self._mask_token is None:\n            logger.error(\"Using mask_token, but it is not set yet.\")\n        return self._mask_token\n\n    @property\n    def additional_special_tokens(self):\n        \"\"\" All the additional special tokens you may want to use (list of strings). Log an error if used while not having been set. \"\"\"\n        if self._additional_special_tokens is None:\n            logger.error(\"Using additional_special_tokens, but it is not set yet.\")\n        return self._additional_special_tokens\n\n    def _maybe_update_backend(self, value):\n        \"\"\" To be overriden by derived class if a backend tokenizer has to be updated. \"\"\"\n        pass\n\n    @bos_token.setter\n    def bos_token(self, value):\n        self._bos_token = value\n        self._maybe_update_backend([value])\n\n    @eos_token.setter\n    def eos_token(self, value):\n        self._eos_token = value\n        self._maybe_update_backend([value])\n\n    @unk_token.setter\n    def unk_token(self, value):\n        self._unk_token = value\n        self._maybe_update_backend([value])\n\n    @sep_token.setter\n    def sep_token(self, value):\n        self._sep_token = value\n        self._maybe_update_backend([value])\n\n    @pad_token.setter\n    def pad_token(self, value):\n        self._pad_token = value\n        self._maybe_update_backend([value])\n\n    @cls_token.setter\n    def cls_token(self, value):\n        self._cls_token = value\n        self._maybe_update_backend([value])\n\n    @mask_token.setter\n    def mask_token(self, value):\n        self._mask_token = value\n        self._maybe_update_backend([value])\n\n    @additional_special_tokens.setter\n    def additional_special_tokens(self, value):\n        self._additional_special_tokens = value\n        self._maybe_update_backend(value)\n\n    @property\n    def bos_token_id(self):\n        \"\"\" Id of the beginning of sentence token in the vocabulary. Log an error if used while not having been set. \"\"\"\n        return self.convert_tokens_to_ids(self.bos_token)\n\n    @property\n    def eos_token_id(self):\n        \"\"\" Id of the end of sentence token in the vocabulary. Log an error if used while not having been set. \"\"\"\n        return self.convert_tokens_to_ids(self.eos_token)\n\n    @property\n    def unk_token_id(self):\n        \"\"\" Id of the unknown token in the vocabulary. Log an error if used while not having been set. \"\"\"\n        return self.convert_tokens_to_ids(self.unk_token)\n\n    @property\n    def sep_token_id(self):\n        \"\"\" Id of the separation token in the vocabulary. E.g. separate context and query in an input sequence. Log an error if used while not having been set. \"\"\"\n        return self.convert_tokens_to_ids(self.sep_token)\n\n    @property\n    def pad_token_id(self):\n        \"\"\" Id of the padding token in the vocabulary. Log an error if used while not having been set. \"\"\"\n        return self.convert_tokens_to_ids(self.pad_token)\n\n    @property\n    def pad_token_type_id(self):\n        \"\"\" Id of the padding token type in the vocabulary.\"\"\"\n        return self._pad_token_type_id\n\n    @property\n    def cls_token_id(self):\n        \"\"\" Id of the classification token in the vocabulary. E.g. to extract a summary of an input sequence leveraging self-attention along the full depth of the model. Log an error if used while not having been set. \"\"\"\n        return self.convert_tokens_to_ids(self.cls_token)\n\n    @property\n    def mask_token_id(self):\n        \"\"\" Id of the mask token in the vocabulary. E.g. when training a model with masked-language modeling. Log an error if used while not having been set. \"\"\"\n        return self.convert_tokens_to_ids(self.mask_token)\n\n    @property\n    def additional_special_tokens_ids(self):\n        \"\"\" Ids of all the additional special tokens in the vocabulary (list of integers). Log an error if used while not having been set. \"\"\"\n        return self.convert_tokens_to_ids(self.additional_special_tokens)\n\n    @property\n    def special_tokens_map(self):\n        \"\"\" A dictionary mapping special token class attribute (cls_token, unk_token...) to their\n            values ('<unk>', '<cls>'...)\n        \"\"\"\n        set_attr = {}\n        for attr in self.SPECIAL_TOKENS_ATTRIBUTES:\n            attr_value = getattr(self, \"_\" + attr)\n            if attr_value:\n                set_attr[attr] = attr_value\n        return set_attr\n\n    @property\n    def all_special_tokens(self):\n        \"\"\" List all the special tokens ('<unk>', '<cls>'...) mapped to class attributes\n            (cls_token, unk_token...).\n        \"\"\"\n        all_toks = []\n        set_attr = self.special_tokens_map\n        for attr_value in set_attr.values():\n            all_toks = all_toks + (list(attr_value) if isinstance(attr_value, (list, tuple)) else [attr_value])\n        all_toks = list(set(all_toks))\n        return all_toks\n\n    @property\n    def all_special_ids(self):\n        \"\"\" List the vocabulary indices of the special tokens ('<unk>', '<cls>'...) mapped to\n            class attributes (cls_token, unk_token...).\n        \"\"\"\n        all_toks = self.all_special_tokens\n        all_ids = self.convert_tokens_to_ids(all_toks)\n        return all_ids\n\n\nclass PreTrainedTokenizer(SpecialTokensMixin):\n    \"\"\" Base class for all tokenizers.\n\n    Handle all the shared methods for tokenization and special tokens as well as methods\n    downloading/caching/loading pretrained tokenizers as well as adding tokens to the vocabulary.\n\n    This class also contain the added tokens in a unified way on top of all tokenizers so we don't\n    have to handle the specific vocabulary augmentation methods of the various underlying\n    dictionary structures (BPE, sentencepiece...).\n\n    Class attributes (overridden by derived classes):\n\n        - ``vocab_files_names``: a python ``dict`` with, as keys, the ``__init__`` keyword name of each vocabulary file\n            required by the model, and as associated values, the filename for saving the associated file (string).\n        - ``pretrained_vocab_files_map``: a python ``dict of dict`` the high-level keys\n            being the ``__init__`` keyword name of each vocabulary file required by the model, the low-level being the\n            `short-cut-names` (string) of the pretrained models with, as associated values, the `url` (string) to the\n            associated pretrained vocabulary file.\n        - ``max_model_input_sizes``: a python ``dict`` with, as keys, the `short-cut-names` (string) of the pretrained\n            models, and as associated values, the maximum length of the sequence inputs of this model, or None if the\n            model has no maximum input size.\n        - ``pretrained_init_configuration``: a python ``dict`` with, as keys, the `short-cut-names` (string) of the\n            pretrained models, and as associated values, a dictionnary of specific arguments to pass to the\n            ``__init__``method of the tokenizer class for this pretrained model when loading the tokenizer with the\n            ``from_pretrained()`` method.\n\n    Args:\n        - ``model_max_length``: (`Optional`) int: the maximum length in number of tokens for the inputs to the transformer model.\n            When the tokenizer is loaded with `from_pretrained`, this will be set to the value stored for the associated\n            model in ``max_model_input_sizes`` (see above). If no value is provided, will default to VERY_LARGE_INTEGER (`int(1e30)`).\n            no associated max_length can be found in ``max_model_input_sizes``.\n        - ``padding_side``: (`Optional`) string: the side on which the model should have padding applied.\n            Should be selected between ['right', 'left']\n        - ``model_input_names``: (`Optional`) List[string]: the list of the forward pass inputs accepted by the\n            model (\"token_type_ids\", \"attention_mask\"...).\n        - ``bos_token``: (`Optional`) string: a beginning of sentence token.\n            Will be associated to ``self.bos_token`` and ``self.bos_token_id``\n        - ``eos_token``: (`Optional`) string: an end of sentence token.\n            Will be associated to ``self.eos_token`` and ``self.eos_token_id``\n        - ``unk_token``: (`Optional`) string: an unknown token.\n            Will be associated to ``self.unk_token`` and ``self.unk_token_id``\n        - ``sep_token``: (`Optional`) string: a separation token (e.g. to separate context and query in an input sequence).\n            Will be associated to ``self.sep_token`` and ``self.sep_token_id``\n        - ``pad_token``: (`Optional`) string: a padding token.\n            Will be associated to ``self.pad_token`` and ``self.pad_token_id``\n        - ``cls_token``: (`Optional`) string: a classification token (e.g. to extract a summary of an input sequence\n            leveraging self-attention along the full depth of the model).\n            Will be associated to ``self.cls_token`` and ``self.cls_token_id``\n        - ``mask_token``: (`Optional`) string: a masking token (e.g. when training a model with masked-language\n            modeling). Will be associated to ``self.mask_token`` and ``self.mask_token_id``\n        - ``additional_special_tokens``: (`Optional`) list: a list of additional special tokens.\n            Adding all special tokens here ensure they won't be split by the tokenization process.\n            Will be associated to ``self.additional_special_tokens`` and ``self.additional_special_tokens_ids``\n    \"\"\"\n\n    vocab_files_names: Dict[str, str] = {}\n    pretrained_vocab_files_map: Dict[str, Dict[str, str]] = {}\n    pretrained_init_configuration: Dict[str, Dict[str, Any]] = {}\n    max_model_input_sizes: Dict[str, int] = {}\n    model_input_names: List[str] = [\"token_type_ids\", \"attention_mask\"]\n\n    padding_side: str = \"right\"\n\n    NO_PAD_TOKEN_FOR_BATCH_MSG = (\n        \"No padding token is set for this model, therefore no batch can be made with uneven \"\n        \"sequences. Set a padding token or adjust the lengths of the sequences building the \"\n        \"batch so that every sequence is of the same length.\"\n    )\n\n    UNEVEN_SEQUENCES_FOR_BATCH_MSG = (\n        \"The sequences building the batch are not of the same size, no tensor \"\n        \"can be built. Set `pad_to_max_length=True` to pad the smaller sequences\"\n        \"up to the larger sequence's length.\"\n    )\n\n    @property\n    def vocab_size(self) -> int:\n        \"\"\" Size of the base vocabulary (without the added tokens) \"\"\"\n        raise NotImplementedError\n\n    @property\n    def is_fast(self) -> bool:\n        return False\n\n    @property\n    def max_len(self) -> int:\n        \"\"\" Kept here for backward compatibility.\n            Now renamed to `model_max_length` to avoid ambiguity.\n        \"\"\"\n        return self.model_max_length\n\n    @property\n    def max_len_single_sentence(self) -> int:\n        return self.model_max_length - self.num_special_tokens_to_add(pair=False)\n\n    @property\n    def max_len_sentences_pair(self) -> int:\n        return self.model_max_length - self.num_special_tokens_to_add(pair=True)\n\n    @max_len_single_sentence.setter\n    def max_len_single_sentence(self, value) -> int:\n        \"\"\" For backward compatibility, allow to try to setup 'max_len_single_sentence' \"\"\"\n        if value == self.model_max_length - self.num_special_tokens_to_add(pair=False):\n            logger.warning(\n                \"Setting 'max_len_single_sentence' is now deprecated. \" \"This value is automatically set up.\"\n            )\n        else:\n            raise ValueError(\n                \"Setting 'max_len_single_sentence' is now deprecated. \" \"This value is automatically set up.\"\n            )\n\n    @max_len_sentences_pair.setter\n    def max_len_sentences_pair(self, value) -> int:\n        \"\"\" For backward compatibility, allow to try to setup 'max_len_sentences_pair' \"\"\"\n        if value == self.model_max_length - self.num_special_tokens_to_add(pair=True):\n            logger.warning(\n                \"Setting 'max_len_sentences_pair' is now deprecated. \" \"This value is automatically set up.\"\n            )\n        else:\n            raise ValueError(\n                \"Setting 'max_len_sentences_pair' is now deprecated. \" \"This value is automatically set up.\"\n            )\n\n    def get_vocab(self):\n        \"\"\" Returns the vocabulary as a dict of {token: index} pairs. `tokenizer.get_vocab()[token]` is equivalent to `tokenizer.convert_tokens_to_ids(token)` when `token` is in the vocab. \"\"\"\n        raise NotImplementedError()\n\n    def __init__(self, model_max_length=None, **kwargs):\n\n        super().__init__(**kwargs)\n\n        # For backward compatibility we fallback to set model_max_length from max_len if provided\n        if \"max_len\" in kwargs:\n            warnings.warn(\n                \"Parameter max_len is deprecated and will be removed in a future release. \"\n                \"Use model_max_length instead.\",\n                category=FutureWarning,\n            )\n\n            model_max_length = kwargs.pop(\"max_len\")\n        self.model_max_length = model_max_length if model_max_length is not None else VERY_LARGE_INTEGER\n\n        # Padding side is right by default and overridden in subclasses. If specified in the kwargs, it is changed.\n        self.padding_side = kwargs.pop(\"padding_side\", self.padding_side)\n        assert self.padding_side in [\n            \"right\",\n            \"left\",\n        ], f\"Padding side should be selected between 'right' and 'left', current value: {self.padding_side}\"\n        self.model_input_names = kwargs.pop(\"model_input_names\", self.model_input_names)\n\n        # Added tokens\n        self.added_tokens_encoder = {}\n        self.unique_added_tokens_encoder = set()\n        self.added_tokens_decoder = {}\n\n        # inputs and kwargs for saving and re-loading (see ``from_pretrained`` and ``save_pretrained``)\n        self.init_inputs = ()\n        self.init_kwargs = {}\n\n    def __len__(self):\n        \"\"\" Size of the full vocabulary with the added tokens \"\"\"\n        return self.vocab_size + len(self.added_tokens_encoder)\n\n    @classmethod\n    def from_pretrained(cls, *inputs, **kwargs):\n        r\"\"\"\n        Instantiate a :class:`~transformers1.PreTrainedTokenizer` (or a derived class) from a predefined tokenizer.\n\n        Args:\n            pretrained_model_name_or_path: either:\n\n                - a string with the `shortcut name` of a predefined tokenizer to load from cache or download, e.g.: ``bert-base-uncased``.\n                - a string with the `identifier name` of a predefined tokenizer that was user-uploaded to our S3, e.g.: ``dbmdz/bert-base-german-cased``.\n                - a path to a `directory` containing vocabulary files required by the tokenizer, for instance saved using the :func:`~transformers1.PreTrainedTokenizer.save_pretrained` method, e.g.: ``./my_model_directory/``.\n                - (not applicable to all derived classes, deprecated) a path or url to a single saved vocabulary file if and only if the tokenizer only requires a single vocabulary file (e.g. Bert, XLNet), e.g.: ``./my_model_directory/vocab.txt``.\n\n            cache_dir: (`optional`) string:\n                Path to a directory in which a downloaded predefined tokenizer vocabulary files should be cached if the standard cache should not be used.\n\n            force_download: (`optional`) boolean, default False:\n                Force to (re-)download the vocabulary files and override the cached versions if they exists.\n\n            resume_download: (`optional`) boolean, default False:\n                Do not delete incompletely recieved file. Attempt to resume the download if such a file exists.\n\n            proxies: (`optional`) dict, default None:\n                A dictionary of proxy servers to use by protocol or endpoint, e.g.: {'http': 'foo.bar:3128', 'http://hostname': 'foo.bar:4012'}.\n                The proxies are used on each request.\n\n            inputs: (`optional`) positional arguments: will be passed to the Tokenizer ``__init__`` method.\n\n            kwargs: (`optional`) keyword arguments: will be passed to the Tokenizer ``__init__`` method. Can be used to set special tokens like ``bos_token``, ``eos_token``, ``unk_token``, ``sep_token``, ``pad_token``, ``cls_token``, ``mask_token``, ``additional_special_tokens``. See parameters in the doc string of :class:`~transformers1.PreTrainedTokenizer` for details.\n\n        Examples::\n\n            # We can't instantiate directly the base class `PreTrainedTokenizer` so let's show our examples on a derived class: BertTokenizer\n\n            # Download vocabulary from S3 and cache.\n            tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')\n\n            # Download vocabulary from S3 (user-uploaded) and cache.\n            tokenizer = BertTokenizer.from_pretrained('dbmdz/bert-base-german-cased')\n\n            # If vocabulary files are in a directory (e.g. tokenizer was saved using `save_pretrained('./test/saved_model/')`)\n            tokenizer = BertTokenizer.from_pretrained('./test/saved_model/')\n\n            # If the tokenizer uses a single vocabulary file, you can point directly to this file\n            tokenizer = BertTokenizer.from_pretrained('./test/saved_model/my_vocab.txt')\n\n            # You can link tokens to special vocabulary when instantiating\n            tokenizer = BertTokenizer.from_pretrained('bert-base-uncased', unk_token='<unk>')\n            # You should be sure '<unk>' is in the vocabulary when doing that.\n            # Otherwise use tokenizer.add_special_tokens({'unk_token': '<unk>'}) instead)\n            assert tokenizer.unk_token == '<unk>'\n\n        \"\"\"\n        return cls._from_pretrained(*inputs, **kwargs)\n\n    @classmethod\n    def _from_pretrained(cls, pretrained_model_name_or_path, *init_inputs, **kwargs):\n        cache_dir = kwargs.pop(\"cache_dir\", None)\n        force_download = kwargs.pop(\"force_download\", False)\n        resume_download = kwargs.pop(\"resume_download\", False)\n        proxies = kwargs.pop(\"proxies\", None)\n        local_files_only = kwargs.pop(\"local_files_only\", False)\n\n        s3_models = list(cls.max_model_input_sizes.keys())\n        vocab_files = {}\n        init_configuration = {}\n        if pretrained_model_name_or_path in s3_models:\n            # Get the vocabulary from AWS S3 bucket\n            for file_id, map_list in cls.pretrained_vocab_files_map.items():\n                vocab_files[file_id] = map_list[pretrained_model_name_or_path]\n            if (\n                cls.pretrained_init_configuration\n                and pretrained_model_name_or_path in cls.pretrained_init_configuration\n            ):\n                init_configuration = cls.pretrained_init_configuration[pretrained_model_name_or_path].copy()\n        else:\n            # Get the vocabulary from local files\n            logger.info(\n                \"Model name '{}' not found in model shortcut name list ({}). \"\n                \"Assuming '{}' is a path, a model identifier, or url to a directory containing tokenizer files.\".format(\n                    pretrained_model_name_or_path, \", \".join(s3_models), pretrained_model_name_or_path\n                )\n            )\n\n            if os.path.isfile(pretrained_model_name_or_path) or is_remote_url(pretrained_model_name_or_path):\n                if len(cls.vocab_files_names) > 1:\n                    raise ValueError(\n                        f\"Calling {cls.__name__}.from_pretrained() with the path to a single file or url is not supported.\"\n                        \"Use a model identifier or the path to a directory instead.\"\n                    )\n                logger.warning(\n                    f\"Calling {cls.__name__}.from_pretrained() with the path to a single file or url is deprecated\"\n                )\n                file_id = list(cls.vocab_files_names.keys())[0]\n                vocab_files[file_id] = pretrained_model_name_or_path\n            else:\n                # At this point pretrained_model_name_or_path is either a directory or a model identifier name\n                additional_files_names = {\n                    \"added_tokens_file\": ADDED_TOKENS_FILE,\n                    \"special_tokens_map_file\": SPECIAL_TOKENS_MAP_FILE,\n                    \"tokenizer_config_file\": TOKENIZER_CONFIG_FILE,\n                }\n                # Look for the tokenizer main vocabulary files + the additional tokens files\n                for file_id, file_name in {**cls.vocab_files_names, **additional_files_names}.items():\n                    if os.path.isdir(pretrained_model_name_or_path):\n                        full_file_name = os.path.join(pretrained_model_name_or_path, file_name)\n                        if not os.path.exists(full_file_name):\n                            logger.info(\"Didn't find file {}. We won't load it.\".format(full_file_name))\n                            full_file_name = None\n                    else:\n                        full_file_name = hf_bucket_url(\n                            pretrained_model_name_or_path, filename=file_name, use_cdn=False\n                        )\n\n                    vocab_files[file_id] = full_file_name\n\n        # Get files from url, cache, or disk depending on the case\n        try:\n            resolved_vocab_files = {}\n            for file_id, file_path in vocab_files.items():\n                if file_path is None:\n                    resolved_vocab_files[file_id] = None\n                else:\n                    resolved_vocab_files[file_id] = cached_path(\n                        file_path,\n                        cache_dir=cache_dir,\n                        force_download=force_download,\n                        proxies=proxies,\n                        resume_download=resume_download,\n                        local_files_only=local_files_only,\n                    )\n        except EnvironmentError:\n            if pretrained_model_name_or_path in s3_models:\n                msg = \"Couldn't reach server at '{}' to download vocabulary files.\"\n            else:\n                msg = (\n                    \"Model name '{}' was not found in tokenizers model name list ({}). \"\n                    \"We assumed '{}' was a path or url to a directory containing vocabulary files \"\n                    \"named {}, but couldn't find such vocabulary files at this path or url.\".format(\n                        pretrained_model_name_or_path,\n                        \", \".join(s3_models),\n                        pretrained_model_name_or_path,\n                        list(cls.vocab_files_names.values()),\n                    )\n                )\n\n            raise EnvironmentError(msg)\n\n        if all(full_file_name is None for full_file_name in resolved_vocab_files.values()):\n            raise EnvironmentError(\n                \"Model name '{}' was not found in tokenizers model name list ({}). \"\n                \"We assumed '{}' was a path, a model identifier, or url to a directory containing vocabulary files \"\n                \"named {} but couldn't find such vocabulary files at this path or url.\".format(\n                    pretrained_model_name_or_path,\n                    \", \".join(s3_models),\n                    pretrained_model_name_or_path,\n                    list(cls.vocab_files_names.values()),\n                )\n            )\n\n        for file_id, file_path in vocab_files.items():\n            if file_path == resolved_vocab_files[file_id]:\n                logger.info(\"loading file {}\".format(file_path))\n            else:\n                logger.info(\"loading file {} from cache at {}\".format(file_path, resolved_vocab_files[file_id]))\n\n        # Prepare tokenizer initialization kwargs\n        # Did we saved some inputs and kwargs to reload ?\n        tokenizer_config_file = resolved_vocab_files.pop(\"tokenizer_config_file\", None)\n        if tokenizer_config_file is not None:\n            with open(tokenizer_config_file, encoding=\"utf-8\") as tokenizer_config_handle:\n                init_kwargs = json.load(tokenizer_config_handle)\n            saved_init_inputs = init_kwargs.pop(\"init_inputs\", ())\n            if not init_inputs:\n                init_inputs = saved_init_inputs\n        else:\n            init_kwargs = init_configuration\n\n        # Update with newly provided kwargs\n        init_kwargs.update(kwargs)\n\n        # Set max length if needed\n        if pretrained_model_name_or_path in cls.max_model_input_sizes:\n            # if we're using a pretrained model, ensure the tokenizer\n            # wont index sequences longer than the number of positional embeddings\n            model_max_length = cls.max_model_input_sizes[pretrained_model_name_or_path]\n            if model_max_length is not None and isinstance(model_max_length, (int, float)):\n                init_kwargs[\"model_max_length\"] = min(init_kwargs.get(\"model_max_length\", int(1e30)), model_max_length)\n\n        # Merge resolved_vocab_files arguments in init_kwargs.\n        added_tokens_file = resolved_vocab_files.pop(\"added_tokens_file\", None)\n        special_tokens_map_file = resolved_vocab_files.pop(\"special_tokens_map_file\", None)\n        for args_name, file_path in resolved_vocab_files.items():\n            if args_name not in init_kwargs:\n                init_kwargs[args_name] = file_path\n        if special_tokens_map_file is not None:\n            with open(special_tokens_map_file, encoding=\"utf-8\") as special_tokens_map_handle:\n                special_tokens_map = json.load(special_tokens_map_handle)\n            for key, value in special_tokens_map.items():\n                if key not in init_kwargs:\n                    init_kwargs[key] = value\n\n        # Instantiate tokenizer.\n        try:\n            tokenizer = cls(*init_inputs, **init_kwargs)\n        except OSError:\n            raise OSError(\n                \"Unable to load vocabulary from file. \"\n                \"Please check that the provided vocabulary is accessible and not corrupted.\"\n            )\n\n        # Save inputs and kwargs for saving and re-loading with ``save_pretrained``\n        tokenizer.init_inputs = init_inputs\n        tokenizer.init_kwargs = init_kwargs\n\n        # update unique_added_tokens_encoder with special tokens for correct tokenization\n        tokenizer.unique_added_tokens_encoder.update(set(tokenizer.all_special_tokens))\n\n        # Add supplementary tokens.\n        if added_tokens_file is not None:\n            with open(added_tokens_file, encoding=\"utf-8\") as added_tokens_handle:\n                added_tok_encoder = json.load(added_tokens_handle)\n            added_tok_decoder = {v: k for k, v in added_tok_encoder.items()}\n            tokenizer.added_tokens_encoder.update(added_tok_encoder)\n            tokenizer.added_tokens_decoder.update(added_tok_decoder)\n            tokenizer.unique_added_tokens_encoder.update(set(tokenizer.added_tokens_encoder.keys()))\n\n        return tokenizer\n\n    def save_pretrained(self, save_directory):\n        \"\"\" Save the tokenizer vocabulary files together with:\n                - added tokens,\n                - special-tokens-to-class-attributes-mapping,\n                - tokenizer instantiation positional and keywords inputs (e.g. do_lower_case for Bert).\n\n            Warning: This won't save modifications you may have applied to the tokenizer after the instantiation\n            (e.g. modifying tokenizer.do_lower_case after creation).\n\n            This method make sure the full tokenizer can then be re-loaded using the\n            :func:`~transformers1.PreTrainedTokenizer.from_pretrained` class method.\n        \"\"\"\n        if not os.path.isdir(save_directory):\n            logger.error(\"Saving directory ({}) should be a directory\".format(save_directory))\n            return\n\n        special_tokens_map_file = os.path.join(save_directory, SPECIAL_TOKENS_MAP_FILE)\n        added_tokens_file = os.path.join(save_directory, ADDED_TOKENS_FILE)\n        tokenizer_config_file = os.path.join(save_directory, TOKENIZER_CONFIG_FILE)\n\n        tokenizer_config = copy.deepcopy(self.init_kwargs)\n        if len(self.init_inputs) > 0:\n            tokenizer_config[\"init_inputs\"] = copy.deepcopy(self.init_inputs)\n        for file_id in self.vocab_files_names.keys():\n            tokenizer_config.pop(file_id, None)\n\n        with open(tokenizer_config_file, \"w\", encoding=\"utf-8\") as f:\n            f.write(json.dumps(tokenizer_config, ensure_ascii=False))\n\n        with open(special_tokens_map_file, \"w\", encoding=\"utf-8\") as f:\n            f.write(json.dumps(self.special_tokens_map, ensure_ascii=False))\n\n        if len(self.added_tokens_encoder) > 0:\n            with open(added_tokens_file, \"w\", encoding=\"utf-8\") as f:\n                out_str = json.dumps(self.added_tokens_encoder, ensure_ascii=False)\n                f.write(out_str)\n\n        vocab_files = self.save_vocabulary(save_directory)\n\n        return vocab_files + (special_tokens_map_file, added_tokens_file)\n\n    def save_vocabulary(self, save_directory) -> Tuple[str]:\n        \"\"\" Save the tokenizer vocabulary to a directory. This method does *NOT* save added tokens\n            and special token mappings.\n\n            Please use :func:`~transformers1.PreTrainedTokenizer.save_pretrained` `()` to save the full\n            Tokenizer state if you want to reload it using the :func:`~transformers1.PreTrainedTokenizer.from_pretrained`\n            class method.\n        \"\"\"\n        raise NotImplementedError\n\n    def add_tokens(self, new_tokens: Union[str, List[str]]) -> int:\n        \"\"\"\n        Add a list of new tokens to the tokenizer class. If the new tokens are not in the\n        vocabulary, they are added to it with indices starting from length of the current vocabulary.\n\n        Args:\n            new_tokens: string or list of string. Each string is a token to add. Tokens are only added if they are not\n            already in the vocabulary (tested by checking if the tokenizer assign the index of the ``unk_token`` to them).\n\n        Returns:\n            Number of tokens added to the vocabulary.\n\n        Examples::\n\n            # Let's see how to increase the vocabulary of Bert model and tokenizer\n            tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')\n            model = BertModel.from_pretrained('bert-base-uncased')\n\n            num_added_toks = tokenizer.add_tokens(['new_tok1', 'my_new-tok2'])\n            print('We have added', num_added_toks, 'tokens')\n            model.resize_token_embeddings(len(tokenizer))  # Notice: resize_token_embeddings expect to receive the full size of the new vocabulary, i.e. the length of the tokenizer.\n        \"\"\"\n        if not new_tokens:\n            return 0\n\n        if not isinstance(new_tokens, list):\n            new_tokens = [new_tokens]\n\n        tokens_to_add = []\n        for token in new_tokens:\n            assert isinstance(token, str)\n            if self.init_kwargs.get(\"do_lower_case\", False) and token not in self.all_special_tokens:\n                token = token.lower()\n            if (\n                token != self.unk_token\n                and self.convert_tokens_to_ids(token) == self.convert_tokens_to_ids(self.unk_token)\n                and token not in tokens_to_add\n            ):\n                tokens_to_add.append(token)\n                logger.info(\"Adding %s to the vocabulary\", token)\n\n        added_tok_encoder = dict((tok, len(self) + i) for i, tok in enumerate(tokens_to_add))\n        added_tok_decoder = {v: k for k, v in added_tok_encoder.items()}\n        self.added_tokens_encoder.update(added_tok_encoder)\n        self.unique_added_tokens_encoder = set(self.added_tokens_encoder.keys()).union(set(self.all_special_tokens))\n        self.added_tokens_decoder.update(added_tok_decoder)\n\n        return len(tokens_to_add)\n\n    def num_special_tokens_to_add(self, pair=False):\n        \"\"\"\n        Returns the number of added tokens when encoding a sequence with special tokens.\n\n        Note:\n            This encodes inputs and checks the number of added tokens, and is therefore not efficient. Do not put this\n            inside your training loop.\n\n        Args:\n            pair: Returns the number of added tokens in the case of a sequence pair if set to True, returns the\n                number of added tokens in the case of a single sequence if set to False.\n\n        Returns:\n            Number of tokens added to sequences\n        \"\"\"\n        token_ids_0 = []\n        token_ids_1 = []\n        return len(self.build_inputs_with_special_tokens(token_ids_0, token_ids_1 if pair else None))\n\n    def add_special_tokens(self, special_tokens_dict):\n        \"\"\"\n        Add a dictionary of special tokens (eos, pad, cls...) to the encoder and link them\n        to class attributes. If special tokens are NOT in the vocabulary, they are added\n        to it (indexed starting from the last index of the current vocabulary).\n\n        Using `add_special_tokens` will ensure your special tokens can be used in several ways:\n\n        - special tokens are carefully handled by the tokenizer (they are never split)\n        - you can easily refer to special tokens using tokenizer class attributes like `tokenizer.cls_token`. This makes it easy to develop model-agnostic training and fine-tuning scripts.\n\n        When possible, special tokens are already registered for provided pretrained models (ex: BertTokenizer cls_token is already registered to be '[CLS]' and XLM's one is also registered to be '</s>')\n\n        Args:\n            special_tokens_dict: dict of string. Keys should be in the list of predefined special attributes:\n                [``bos_token``, ``eos_token``, ``unk_token``, ``sep_token``, ``pad_token``, ``cls_token``, ``mask_token``,\n                ``additional_special_tokens``].\n\n                Tokens are only added if they are not already in the vocabulary (tested by checking if the tokenizer assign the index of the ``unk_token`` to them).\n\n        Returns:\n            Number of tokens added to the vocabulary.\n\n        Examples::\n\n            # Let's see how to add a new classification token to GPT-2\n            tokenizer = GPT2Tokenizer.from_pretrained('gpt2')\n            model = GPT2Model.from_pretrained('gpt2')\n\n            special_tokens_dict = {'cls_token': '<CLS>'}\n\n            num_added_toks = tokenizer.add_special_tokens(special_tokens_dict)\n            print('We have added', num_added_toks, 'tokens')\n            model.resize_token_embeddings(len(tokenizer))  # Notice: resize_token_embeddings expect to receive the full size of the new vocabulary, i.e. the length of the tokenizer.\n\n            assert tokenizer.cls_token == '<CLS>'\n        \"\"\"\n        if not special_tokens_dict:\n            return 0\n\n        added_tokens = 0\n        for key, value in special_tokens_dict.items():\n            assert key in self.SPECIAL_TOKENS_ATTRIBUTES\n            if key == \"additional_special_tokens\":\n                assert isinstance(value, (list, tuple)) and all(isinstance(t, str) for t in value)\n                added_tokens += self.add_tokens(value)\n            else:\n                assert isinstance(value, str)\n                added_tokens += self.add_tokens([value])\n            logger.info(\"Assigning %s to the %s key of the tokenizer\", value, key)\n            setattr(self, key, value)\n\n        return added_tokens\n\n    def tokenize(self, text: TextInput, **kwargs):\n        \"\"\" Converts a string in a sequence of tokens (string), using the tokenizer.\n            Split in words for word-based vocabulary or sub-words for sub-word-based\n            vocabularies (BPE/SentencePieces/WordPieces).\n\n            Take care of added tokens.\n\n            Args:\n                text (:obj:`string`): The sequence to be encoded.\n                **kwargs (:obj: `dict`): Arguments passed to the model-specific `prepare_for_tokenization` preprocessing method.\n        \"\"\"\n        all_special_tokens = self.all_special_tokens\n        text = self.prepare_for_tokenization(text, **kwargs)\n\n        # TODO: should this be in the base class?\n        def lowercase_text(t):\n            # convert non-special tokens to lowercase\n            escaped_special_toks = [re.escape(s_tok) for s_tok in all_special_tokens]\n            pattern = r\"(\" + r\"|\".join(escaped_special_toks) + r\")|\" + r\"(.+?)\"\n            return re.sub(pattern, lambda m: m.groups()[0] or m.groups()[1].lower(), t)\n\n        if self.init_kwargs.get(\"do_lower_case\", False):\n            text = lowercase_text(text)\n\n        def split_on_token(tok, text):\n            result = []\n            split_text = text.split(tok)\n            for i, sub_text in enumerate(split_text):\n                sub_text = sub_text.rstrip()\n                if i == 0 and not sub_text:\n                    result += [tok]\n                elif i == len(split_text) - 1:\n                    if sub_text:\n                        result += [sub_text]\n                    else:\n                        pass\n                else:\n                    if sub_text:\n                        result += [sub_text]\n                    result += [tok]\n            return result\n\n        def split_on_tokens(tok_list, text):\n            if not text.strip():\n                return []\n            if not tok_list:\n                return self._tokenize(text)\n\n            tokenized_text = []\n            text_list = [text]\n            for tok in tok_list:\n                tokenized_text = []\n                for sub_text in text_list:\n                    if sub_text not in self.unique_added_tokens_encoder:\n                        tokenized_text += split_on_token(tok, sub_text)\n                    else:\n                        tokenized_text += [sub_text]\n                text_list = tokenized_text\n\n            return list(\n                itertools.chain.from_iterable(\n                    (\n                        self._tokenize(token) if token not in self.unique_added_tokens_encoder else [token]\n                        for token in tokenized_text\n                    )\n                )\n            )\n\n        added_tokens = self.unique_added_tokens_encoder\n        tokenized_text = split_on_tokens(added_tokens, text)\n        return tokenized_text\n\n    def _tokenize(self, text, **kwargs):\n        \"\"\" Converts a string in a sequence of tokens (string), using the tokenizer.\n            Split in words for word-based vocabulary or sub-words for sub-word-based\n            vocabularies (BPE/SentencePieces/WordPieces).\n\n            Do NOT take care of added tokens.\n        \"\"\"\n        raise NotImplementedError\n\n    def convert_tokens_to_ids(self, tokens):\n        \"\"\" Converts a token string (or a sequence of tokens) in a single integer id\n            (or a sequence of ids), using the vocabulary.\n        \"\"\"\n        if tokens is None:\n            return None\n\n        if isinstance(tokens, str):\n            return self._convert_token_to_id_with_added_voc(tokens)\n\n        ids = []\n        for token in tokens:\n            ids.append(self._convert_token_to_id_with_added_voc(token))\n        return ids\n\n    def _convert_token_to_id_with_added_voc(self, token):\n        if token is None:\n            return None\n\n        if token in self.added_tokens_encoder:\n            return self.added_tokens_encoder[token]\n        return self._convert_token_to_id(token)\n\n    def _convert_token_to_id(self, token):\n        raise NotImplementedError\n\n    def encode(\n        self,\n        text: Union[TextInput, PreTokenizedInput, EncodedInput],\n        text_pair: Optional[Union[TextInput, PreTokenizedInput, EncodedInput]] = None,\n        add_special_tokens: bool = True,\n        max_length: Optional[int] = None,\n        stride: int = 0,\n        truncation_strategy: str = \"longest_first\",\n        pad_to_max_length: bool = False,\n        return_tensors: Optional[str] = None,\n        **kwargs\n    ):\n        \"\"\"\n        Converts a string in a sequence of ids (integer), using the tokenizer and vocabulary.\n\n        Same as doing ``self.convert_tokens_to_ids(self.tokenize(text))``.\n\n        Args:\n            text (:obj:`str`, :obj:`List[str]` or :obj:`List[int]`):\n                The first sequence to be encoded. This can be a string, a list of strings (tokenized string using\n                the `tokenize` method) or a list of integers (tokenized string ids using the `convert_tokens_to_ids`\n                method)\n            text_pair (:obj:`str`, :obj:`List[str]` or :obj:`List[int]`, `optional`, defaults to :obj:`None`):\n                Optional second sequence to be encoded. This can be a string, a list of strings (tokenized\n                string using the `tokenize` method) or a list of integers (tokenized string ids using the\n                `convert_tokens_to_ids` method)\n            add_special_tokens (:obj:`bool`, `optional`, defaults to :obj:`True`):\n                If set to ``True``, the sequences will be encoded with the special tokens relative\n                to their model.\n            max_length (:obj:`int`, `optional`, defaults to :obj:`None`):\n                If set to a number, will limit the total sequence returned so that it has a maximum length.\n                If there are overflowing tokens, those will be added to the returned dictionary.\n                You can set it to the maximal input size of the model with `max_length = tokenizer.model_max_length`.\n            stride (:obj:`int`, `optional`, defaults to ``0``):\n                If set to a number along with max_length, the overflowing tokens returned will contain some tokens\n                from the main sequence returned. The value of this argument defines the number of additional tokens.\n            truncation_strategy (:obj:`str`, `optional`, defaults to `longest_first`):\n                String selected in the following options:\n\n                - 'longest_first' (default) Iteratively reduce the inputs sequence until the input is under max_length\n                  starting from the longest one at each token (when there is a pair of input sequences)\n                - 'only_first': Only truncate the first sequence\n                - 'only_second': Only truncate the second sequence\n                - 'do_not_truncate': Does not truncate (raise an error if the input sequence is longer than max_length)\n            pad_to_max_length (:obj:`bool`, `optional`, defaults to :obj:`False`):\n                If set to True, the returned sequences will be padded according to the model's padding side and\n                padding index, up to their max length. If no max length is specified, the padding is done up to the\n                model's max length. The tokenizer padding sides are handled by the class attribute `padding_side`\n                which can be set to the following strings:\n\n                - 'left': pads on the left of the sequences\n                - 'right': pads on the right of the sequences\n                Defaults to False: no padding.\n            return_tensors (:obj:`str`, `optional`, defaults to :obj:`None`):\n                Can be set to 'tf' or 'pt' to return respectively TensorFlow :obj:`tf.constant`\n                or PyTorch :obj:`torch.Tensor` instead of a list of python integers.\n            **kwargs: passed to the `self.tokenize()` method\n        \"\"\"\n        encoded_inputs = self.encode_plus(\n            text,\n            text_pair=text_pair,\n            max_length=max_length,\n            add_special_tokens=add_special_tokens,\n            stride=stride,\n            truncation_strategy=truncation_strategy,\n            pad_to_max_length=pad_to_max_length,\n            return_tensors=return_tensors,\n            **kwargs,\n        )\n\n        return encoded_inputs[\"input_ids\"]\n\n    def encode_plus(\n        self,\n        text: Union[TextInput, PreTokenizedInput, EncodedInput],\n        text_pair: Optional[Union[TextInput, PreTokenizedInput, EncodedInput]] = None,\n        add_special_tokens: bool = True,\n        max_length: Optional[int] = None,\n        stride: int = 0,\n        truncation_strategy: str = \"longest_first\",\n        pad_to_max_length: bool = False,\n        is_pretokenized: bool = False,\n        return_tensors: Optional[str] = None,\n        return_token_type_ids: Optional[bool] = None,\n        return_attention_mask: Optional[bool] = None,\n        return_overflowing_tokens: bool = False,\n        return_special_tokens_mask: bool = False,\n        return_offsets_mapping: bool = False,\n        **kwargs\n    ) -> BatchEncoding:\n        \"\"\"\n        Returns a dictionary containing the encoded sequence or sequence pair and additional information:\n        the mask for sequence classification and the overflowing elements if a ``max_length`` is specified.\n\n        Args:\n            text (:obj:`str`, :obj:`List[str]` or :obj:`List[int]` (the later only for not-fast tokenizers)):\n                The first sequence to be encoded. This can be a string, a list of strings (tokenized string using\n                the `tokenize` method) or a list of integers (tokenized string ids using the `convert_tokens_to_ids`\n                method)\n            text_pair (:obj:`str`, :obj:`List[str]` or :obj:`List[int]`, `optional`, defaults to :obj:`None`):\n                Optional second sequence to be encoded. This can be a string, a list of strings (tokenized\n                string using the `tokenize` method) or a list of integers (tokenized string ids using the\n                `convert_tokens_to_ids` method)\n            add_special_tokens (:obj:`bool`, `optional`, defaults to :obj:`True`):\n                If set to ``True``, the sequences will be encoded with the special tokens relative\n                to their model.\n            max_length (:obj:`int`, `optional`, defaults to :obj:`None`):\n                If set to a number, will limit the total sequence returned so that it has a maximum length.\n                If there are overflowing tokens, those will be added to the returned dictionary\n                You can set it to the maximal input size of the model with `max_length = tokenizer.model_max_length`.\n            stride (:obj:`int`, `optional`, defaults to ``0``):\n                If set to a number along with max_length, the overflowing tokens returned will contain some tokens\n                from the main sequence returned. The value of this argument defines the number of additional tokens.\n            truncation_strategy (:obj:`str`, `optional`, defaults to `longest_first`):\n                String selected in the following options:\n\n                - 'longest_first' (default) Iteratively reduce the inputs sequence until the input is under max_length\n                  starting from the longest one at each token (when there is a pair of input sequences)\n                - 'only_first': Only truncate the first sequence\n                - 'only_second': Only truncate the second sequence\n                - 'do_not_truncate': Does not truncate (raise an error if the input sequence is longer than max_length)\n            pad_to_max_length (:obj:`bool`, `optional`, defaults to :obj:`False`):\n                If set to True, the returned sequences will be padded according to the model's padding side and\n                padding index, up to their max length. If no max length is specified, the padding is done up to the\n                model's max length. The tokenizer padding sides are handled by the class attribute `padding_side`\n                which can be set to the following strings:\n\n                - 'left': pads on the left of the sequences\n                - 'right': pads on the right of the sequences\n                Defaults to False: no padding.\n            is_pretokenized (:obj:`bool`, defaults to :obj:`False`):\n                Set to True to indicate the input is already tokenized\n            return_tensors (:obj:`str`, `optional`, defaults to :obj:`None`):\n                Can be set to 'tf' or 'pt' to return respectively TensorFlow :obj:`tf.constant`\n                or PyTorch :obj:`torch.Tensor` instead of a list of python integers.\n            return_token_type_ids (:obj:`bool`, `optional`, defaults to :obj:`None`):\n                Whether to return token type IDs. If left to the default, will return the token type IDs according\n                to the specific tokenizer's default, defined by the :obj:`return_outputs` attribute.\n\n                `What are token type IDs? <../glossary.html#token-type-ids>`_\n            return_attention_mask (:obj:`bool`, `optional`, defaults to :obj:`none`):\n                Whether to return the attention mask. If left to the default, will return the attention mask according\n                to the specific tokenizer's default, defined by the :obj:`return_outputs` attribute.\n\n                `What are attention masks? <../glossary.html#attention-mask>`__\n            return_overflowing_tokens (:obj:`bool`, `optional`, defaults to :obj:`False`):\n                Set to True to return overflowing token information (default False).\n            return_special_tokens_mask (:obj:`bool`, `optional`, defaults to :obj:`False`):\n                Set to True to return special tokens mask information (default False).\n            return_offsets_mapping (:obj:`bool`, `optional`, defaults to :obj:`False`):\n                Set to True to return (char_start, char_end) for each token (default False).\n                If using Python's tokenizer, this method will raise NotImplementedError.\n                This one is only available on fast tokenizers inheriting from PreTrainedTokenizerFast.\n            **kwargs: passed to the `self.tokenize()` method\n\n        Return:\n            A Dictionary of shape::\n\n                {\n                    input_ids: list[int],\n                    token_type_ids: list[int] if return_token_type_ids is True (default)\n                    attention_mask: list[int] if return_attention_mask is True (default)\n                    overflowing_tokens: list[int] if a ``max_length`` is specified and return_overflowing_tokens is True\n                    num_truncated_tokens: int if a ``max_length`` is specified and return_overflowing_tokens is True\n                    special_tokens_mask: list[int] if ``add_special_tokens`` if set to ``True``\n                    and return_special_tokens_mask is True\n                }\n\n            With the fields:\n\n            - ``input_ids``: list of token ids to be fed to a model\n            - ``token_type_ids``: list of token type ids to be fed to a model\n            - ``attention_mask``: list of indices specifying which tokens should be attended to by the model\n            - ``overflowing_tokens``: list of overflowing tokens if a max length is specified.\n            - ``num_truncated_tokens``: number of overflowing tokens a ``max_length`` is specified\n            - ``special_tokens_mask``: if adding special tokens, this is a list of [0, 1], with 0 specifying special added\n              tokens and 1 specifying sequence tokens.\n        \"\"\"\n\n        def get_input_ids(text):\n            if isinstance(text, str):\n                tokens = self.tokenize(text, add_special_tokens=add_special_tokens, **kwargs)\n                return self.convert_tokens_to_ids(tokens)\n            elif isinstance(text, (list, tuple)) and len(text) > 0 and isinstance(text[0], str):\n                return self.convert_tokens_to_ids(text)\n            elif isinstance(text, (list, tuple)) and len(text) > 0 and isinstance(text[0], int):\n                return text\n            else:\n                raise ValueError(\n                    \"Input is not valid. Should be a string, a list/tuple of strings or a list/tuple of integers.\"\n                )\n\n        if return_offsets_mapping:\n            raise NotImplementedError(\n                \"return_offset_mapping is not available when using Python tokenizers.\"\n                \"To use this feature, change your tokenizer to one deriving from \"\n                \"transformers1.PreTrainedTokenizerFast.\"\n                \"More information on available tokenizers at \"\n                \"https://github.com/huggingface/transformers/pull/2674\"\n            )\n\n        # Throw an error if we can pad because there is no padding token\n        if pad_to_max_length and self.pad_token_id is None:\n            raise ValueError(\n                \"Unable to set proper padding strategy as the tokenizer does not have a padding token. \"\n                \"In this case please set the `pad_token` `(tokenizer.pad_token = tokenizer.eos_token e.g.)` \"\n                \"or add a new pad token via the function add_special_tokens if you want to use a padding strategy\"\n            )\n\n        first_ids = get_input_ids(text)\n        second_ids = get_input_ids(text_pair) if text_pair is not None else None\n\n        return self.prepare_for_model(\n            first_ids,\n            pair_ids=second_ids,\n            max_length=max_length,\n            pad_to_max_length=pad_to_max_length,\n            add_special_tokens=add_special_tokens,\n            stride=stride,\n            truncation_strategy=truncation_strategy,\n            return_tensors=return_tensors,\n            return_attention_mask=return_attention_mask,\n            return_token_type_ids=return_token_type_ids,\n            return_overflowing_tokens=return_overflowing_tokens,\n            return_special_tokens_mask=return_special_tokens_mask,\n        )\n\n    def batch_encode_plus(\n        self,\n        batch_text_or_text_pairs: Union[\n            List[TextInput],\n            List[TextInputPair],\n            List[PreTokenizedInput],\n            List[PreTokenizedInputPair],\n            List[EncodedInput],\n            List[EncodedInputPair],\n        ],\n        add_special_tokens: bool = True,\n        max_length: Optional[int] = None,\n        stride: int = 0,\n        truncation_strategy: str = \"longest_first\",\n        pad_to_max_length: bool = False,\n        is_pretokenized: bool = False,\n        return_tensors: Optional[str] = None,\n        return_token_type_ids: Optional[bool] = None,\n        return_attention_masks: Optional[bool] = None,\n        return_overflowing_tokens: bool = False,\n        return_special_tokens_masks: bool = False,\n        return_offsets_mapping: bool = False,\n        return_lengths: bool = False,\n        **kwargs\n    ) -> BatchEncoding:\n        \"\"\"\n        Returns a dictionary containing the encoded sequence or sequence pair and additional information:\n        the mask for sequence classification and the overflowing elements if a ``max_length`` is specified.\n\n        Args:\n            batch_text_or_text_pairs (:obj:`List[str]`,  :obj:`List[Tuple[str, str]]`,\n                                      :obj:`List[List[str]]`,  :obj:`List[Tuple[List[str], List[str]]]`,\n                                      and for not-fast tokenizers, also:\n                                      :obj:`List[List[int]]`,  :obj:`List[Tuple[List[int], List[int]]]`):\n                Batch of sequences or pair of sequences to be encoded.\n                This can be a list of string/string-sequences/int-sequences or a list of pair of\n                string/string-sequences/int-sequence (see details in encode_plus)\n            add_special_tokens (:obj:`bool`, `optional`, defaults to :obj:`True`):\n                If set to ``True``, the sequences will be encoded with the special tokens relative\n                to their model.\n            max_length (:obj:`int`, `optional`, defaults to :obj:`None`):\n                If set to a number, will limit the total sequence returned so that it has a maximum length.\n                If there are overflowing tokens, those will be added to the returned dictionary\n            stride (:obj:`int`, `optional`, defaults to ``0``):\n                If set to a number along with max_length, the overflowing tokens returned will contain some tokens\n                from the main sequence returned. The value of this argument defines the number of additional tokens.\n            truncation_strategy (:obj:`str`, `optional`, defaults to `longest_first`):\n                String selected in the following options:\n\n                - 'longest_first' (default) Iteratively reduce the inputs sequence until the input is under max_length\n                  starting from the longest one at each token (when there is a pair of input sequences)\n                - 'only_first': Only truncate the first sequence\n                - 'only_second': Only truncate the second sequence\n                - 'do_not_truncate': Does not truncate (raise an error if the input sequence is longer than max_length)\n            pad_to_max_length (:obj:`bool`, `optional`, defaults to :obj:`False`):\n                If set to True, the returned sequences will be padded according to the model's padding side and\n                padding index, up to their max length. If no max length is specified, the padding is done up to the\n                model's max length. The tokenizer padding sides are handled by the class attribute `padding_side`\n                which can be set to the following strings:\n\n                - 'left': pads on the left of the sequences\n                - 'right': pads on the right of the sequences\n                Defaults to False: no padding.\n            is_pretokenized (:obj:`bool`, defaults to :obj:`False`):\n                Set to True to indicate the input is already tokenized\n            return_tensors (:obj:`str`, `optional`, defaults to :obj:`None`):\n                Can be set to 'tf' or 'pt' to return respectively TensorFlow :obj:`tf.constant`\n                or PyTorch :obj:`torch.Tensor` instead of a list of python integers.\n            return_token_type_ids (:obj:`bool`, `optional`, defaults to :obj:`None`):\n                Whether to return token type IDs. If left to the default, will return the token type IDs according\n                to the specific tokenizer's default, defined by the :obj:`return_outputs` attribute.\n\n                `What are token type IDs? <../glossary.html#token-type-ids>`_\n            return_attention_masks (:obj:`bool`, `optional`, defaults to :obj:`none`):\n                Whether to return the attention mask. If left to the default, will return the attention mask according\n                to the specific tokenizer's default, defined by the :obj:`return_outputs` attribute.\n\n                `What are attention masks? <../glossary.html#attention-mask>`__\n            return_overflowing_tokens (:obj:`bool`, `optional`, defaults to :obj:`False`):\n                Set to True to return overflowing token information (default False).\n            return_special_tokens_masks (:obj:`bool`, `optional`, defaults to :obj:`False`):\n                Set to True to return special tokens mask information (default False).\n            return_offsets_mapping (:obj:`bool`, `optional`, defaults to :obj:`False`):\n                Set to True to return (char_start, char_end) for each token (default False).\n                If using Python's tokenizer, this method will raise NotImplementedError. This one is only available on\n                Rust-based tokenizers inheriting from PreTrainedTokenizerFast.\n            return_lengths (:obj:`bool`, `optional`, defaults to :obj:`False`):\n                If set the resulting dictionary will include the length of each encoded inputs\n            **kwargs: passed to the `self.tokenize()` method\n\n        Return:\n            A Dictionary of shape::\n\n                {\n                    input_ids: list[List[int]],\n                    token_type_ids: list[List[int]] if return_token_type_ids is True (default)\n                    attention_mask: list[List[int]] if return_attention_mask is True (default)\n                    overflowing_tokens: list[List[int]] if a ``max_length`` is specified and return_overflowing_tokens is True\n                    num_truncated_tokens: List[int] if a ``max_length`` is specified and return_overflowing_tokens is True\n                    special_tokens_mask: list[List[int]] if ``add_special_tokens`` if set to ``True`` and return_special_tokens_mask is True\n                }\n\n            With the fields:\n\n            - ``input_ids``: list of token ids to be fed to a model\n            - ``token_type_ids``: list of token type ids to be fed to a model\n            - ``attention_mask``: list of indices specifying which tokens should be attended to by the model\n            - ``overflowing_tokens``: list of overflowing tokens if a max length is specified.\n            - ``num_truncated_tokens``: number of overflowing tokens a ``max_length`` is specified\n            - ``special_tokens_mask``: if adding special tokens, this is a list of [0, 1], with 0 specifying special added\n              tokens and 1 specifying sequence tokens.\n        \"\"\"\n\n        def get_input_ids(text):\n            if isinstance(text, str):\n                tokens = self.tokenize(text, add_special_tokens=add_special_tokens, **kwargs)\n                return self.convert_tokens_to_ids(tokens)\n            elif isinstance(text, (list, tuple)) and len(text) > 0 and isinstance(text[0], str):\n                return self.convert_tokens_to_ids(text)\n            elif isinstance(text, (list, tuple)) and len(text) > 0 and isinstance(text[0], int):\n                return text\n            else:\n                raise ValueError(\n                    \"Input is not valid. Should be a string, a list/tuple of strings or a list/tuple of integers.\"\n                )\n\n        # Throw an error if we can pad because there is no padding token\n        if pad_to_max_length and self.pad_token_id is None:\n            raise ValueError(\n                \"Unable to set proper padding strategy as the tokenizer does not have a padding token. In this case please set the `pad_token` `(tokenizer.pad_token = tokenizer.eos_token e.g.)` or add a new pad token via the function add_special_tokens if you want to use a padding strategy\"\n            )\n\n        if return_offsets_mapping:\n            raise NotImplementedError(\n                \"return_offset_mapping is not available when using Python tokenizers.\"\n                \"To use this feature, change your tokenizer to one deriving from \"\n                \"transformers1.PreTrainedTokenizerFast.\"\n                \"More information on available tokenizers at \"\n                \"https://github.com/huggingface/transformers/pull/2674\"\n            )\n\n        input_ids = []\n        for ids_or_pair_ids in batch_text_or_text_pairs:\n            if isinstance(ids_or_pair_ids, (list, tuple)) and len(ids_or_pair_ids) == 2 and not is_pretokenized:\n                ids, pair_ids = ids_or_pair_ids\n            else:\n                ids, pair_ids = ids_or_pair_ids, None\n\n            first_ids = get_input_ids(ids)\n            second_ids = get_input_ids(pair_ids) if pair_ids is not None else None\n            input_ids.append((first_ids, second_ids))\n\n        if max_length is None and pad_to_max_length:\n\n            def total_sequence_length(input_pairs):\n                first_ids, second_ids = input_pairs\n                return len(first_ids) + (\n                    self.num_special_tokens_to_add()\n                    if second_ids is None\n                    else (len(second_ids) + self.num_special_tokens_to_add(pair=True))\n                )\n\n            max_length = max([total_sequence_length(ids) for ids in input_ids])\n\n        batch_outputs = {}\n        for first_ids, second_ids in input_ids:\n            # Prepares a sequence of input id, or a pair of sequences of inputs ids so that it can be used by\n            # the model. It adds special tokens, truncates sequences if overflowing while taking into account\n            # the special tokens and manages a window stride for overflowing tokens\n            outputs = self.prepare_for_model(\n                first_ids,\n                pair_ids=second_ids,\n                max_length=max_length,\n                pad_to_max_length=pad_to_max_length,\n                add_special_tokens=add_special_tokens,\n                stride=stride,\n                truncation_strategy=truncation_strategy,\n                return_attention_mask=return_attention_masks,\n                return_token_type_ids=return_token_type_ids,\n                return_overflowing_tokens=return_overflowing_tokens,\n                return_special_tokens_mask=return_special_tokens_masks,\n                return_lengths=return_lengths,\n                return_tensors=None,  # We will convert the whole batch to tensors at the end\n            )\n\n            for key, value in outputs.items():\n                if key not in batch_outputs:\n                    batch_outputs[key] = []\n                batch_outputs[key].append(value)\n\n        if return_tensors is not None:\n\n            self.convert_to_tensors_(batch_outputs, return_tensors)\n        return BatchEncoding(batch_outputs)\n\n    def convert_to_tensors_(self, batch_outputs: dict, return_tensors: str) -> None:\n        # Do the tensor conversion in batch\n        for key, value in batch_outputs.items():\n            if return_tensors == \"tf\" and is_tf_available():\n                try:\n                    batch_outputs[key] = tf.constant(value)\n                except ValueError:\n                    if None in [item for sequence in value for item in sequence]:\n                        raise ValueError(self.NO_PAD_TOKEN_FOR_BATCH_MSG)\n                    else:\n                        raise ValueError(self.UNEVEN_SEQUENCES_FOR_BATCH_MSG)\n            elif return_tensors == \"pt\" and is_torch_available():\n                try:\n                    batch_outputs[key] = torch.tensor(value)\n                except ValueError:\n                    raise ValueError(self.UNEVEN_SEQUENCES_FOR_BATCH_MSG)\n                except RuntimeError:\n                    if None in [item for sequence in value for item in sequence]:\n                        raise ValueError(self.NO_PAD_TOKEN_FOR_BATCH_MSG)\n                    else:\n                        raise\n\n            elif return_tensors is not None:\n                logger.warning(\n                    \"Unable to convert output to tensors format {}, PyTorch or TensorFlow is not available.\".format(\n                        return_tensors\n                    )\n                )\n\n    def prepare_for_model(\n        self,\n        ids: List[int],\n        pair_ids: Optional[List[int]] = None,\n        max_length: Optional[int] = None,\n        add_special_tokens: bool = True,\n        stride: int = 0,\n        truncation_strategy: str = \"longest_first\",\n        pad_to_max_length: bool = False,\n        return_tensors: Optional[str] = None,\n        return_token_type_ids: Optional[bool] = None,\n        return_attention_mask: Optional[bool] = None,\n        return_overflowing_tokens: bool = False,\n        return_special_tokens_mask: bool = False,\n        return_lengths: bool = False,\n    ) -> BatchEncoding:\n        \"\"\" Prepares a sequence of input id, or a pair of sequences of inputs ids so that it can be used by the model.\n        It adds special tokens, truncates sequences if overflowing while taking into account the special tokens and\n        manages a moving window (with user defined stride) for overflowing tokens\n\n        Args:\n            ids: list of tokenized input ids. Can be obtained from a string by chaining the\n                `tokenize` and `convert_tokens_to_ids` methods.\n            pair_ids: Optional second list of input ids. Can be obtained from a string by chaining the\n                `tokenize` and `convert_tokens_to_ids` methods.\n            max_length: maximum length of the returned list. Will truncate by taking into account the special tokens.\n            add_special_tokens: if set to ``True``, the sequences will be encoded with the special tokens relative\n                to their model.\n            stride: window stride for overflowing tokens. Can be useful to remove edge effect when using sequential\n                list of inputs. The overflowing token will contains a part of the previous window of tokens.\n            truncation_strategy: string selected in the following options:\n                - 'longest_first' (default) Iteratively reduce the inputs sequence until the input is under max_length\n                    starting from the longest one at each token (when there is a pair of input sequences)\n                - 'only_first': Only truncate the first sequence\n                - 'only_second': Only truncate the second sequence\n                - 'do_not_truncate': Does not truncate (raise an error if the input sequence is longer than max_length)\n            pad_to_max_length: if set to True, the returned sequences will be padded according to the model's padding side and\n                padding index, up to their max length. If no max length is specified, the padding is done up to the model's max length.\n                The tokenizer padding sides are handled by the following strings:\n                - 'left': pads on the left of the sequences\n                - 'right': pads on the right of the sequences\n                Defaults to False: no padding.\n            return_tensors: (optional) can be set to 'tf' or 'pt' to return respectively TensorFlow tf.constant\n                or PyTorch torch.Tensor instead of a list of python integers.\n            return_token_type_ids: (optional) Set to False to avoid returning token_type_ids (default: set to model specifics).\n            return_attention_mask: (optional) Set to False to avoid returning attention mask (default: set to model specifics)\n            return_overflowing_tokens: (optional) Set to True to return overflowing token information (default False).\n            return_special_tokens_mask: (optional) Set to True to return special tokens mask information (default False).\n            return_lengths (:obj:`bool`, `optional`, defaults to :obj:`False`):\n                If set the resulting dictionary will include the length of each encoded inputs\n\n        Return:\n            A Dictionary of shape::\n\n                {\n                    input_ids: list[int],\n                    token_type_ids: list[int] if return_token_type_ids is True (default)\n                    overflowing_tokens: list[int] if a ``max_length`` is specified and return_overflowing_tokens is True\n                    num_truncated_tokens: int if a ``max_length`` is specified and return_overflowing_tokens is True\n                    special_tokens_mask: list[int] if ``add_special_tokens`` if set to ``True`` and return_special_tokens_mask is True\n                    length: int if return_lengths is True\n                }\n\n            With the fields:\n                - ``input_ids``: list of token ids to be fed to a model\n                - ``token_type_ids``: list of token type ids to be fed to a model\n\n                - ``overflowing_tokens``: list of overflowing tokens if a max length is specified.\n                - ``num_truncated_tokens``: number of overflowing tokens a ``max_length`` is specified\n                - ``special_tokens_mask``: if adding special tokens, this is a list of [0, 1], with 0 specifying special added\n                    tokens and 1 specifying sequence tokens.\n                - ``length``: this is the length of ``input_ids``\n        \"\"\"\n        pair = bool(pair_ids is not None)\n        len_ids = len(ids)\n        len_pair_ids = len(pair_ids) if pair else 0\n\n        # Load from model defaults\n        if return_token_type_ids is None:\n            return_token_type_ids = \"token_type_ids\" in self.model_input_names\n        if return_attention_mask is None:\n            return_attention_mask = \"attention_mask\" in self.model_input_names\n\n        encoded_inputs = {}\n\n        # Truncation: Handle max sequence length\n        total_len = len_ids + len_pair_ids + (self.num_special_tokens_to_add(pair=pair) if add_special_tokens else 0)\n        if max_length and total_len > max_length:\n            ids, pair_ids, overflowing_tokens = self.truncate_sequences(\n                ids,\n                pair_ids=pair_ids,\n                num_tokens_to_remove=total_len - max_length,\n                truncation_strategy=truncation_strategy,\n                stride=stride,\n            )\n            if return_overflowing_tokens:\n                encoded_inputs[\"overflowing_tokens\"] = overflowing_tokens\n                encoded_inputs[\"num_truncated_tokens\"] = total_len - max_length\n\n        # Add special tokens\n        if add_special_tokens:\n            sequence = self.build_inputs_with_special_tokens(ids, pair_ids)\n            token_type_ids = self.create_token_type_ids_from_sequences(ids, pair_ids)\n        else:\n            sequence = ids + pair_ids if pair else ids\n            token_type_ids = [0] * len(ids) + ([1] * len(pair_ids) if pair else [])\n\n        # Build output dictionnary\n        encoded_inputs[\"input_ids\"] = sequence\n        if return_token_type_ids:\n            encoded_inputs[\"token_type_ids\"] = token_type_ids\n        if return_special_tokens_mask:\n            if add_special_tokens:\n                encoded_inputs[\"special_tokens_mask\"] = self.get_special_tokens_mask(ids, pair_ids)\n            else:\n                encoded_inputs[\"special_tokens_mask\"] = [0] * len(sequence)\n\n        # Check lengths\n        assert max_length is None or len(encoded_inputs[\"input_ids\"]) <= max_length\n        if max_length is None and len(encoded_inputs[\"input_ids\"]) > self.model_max_length:\n            logger.warning(\n                \"Token indices sequence length is longer than the specified maximum sequence length \"\n                \"for this model ({} > {}). Running this sequence through the model will result in \"\n                \"indexing errors\".format(len(ids), self.model_max_length)\n            )\n\n        # Padding\n        needs_to_be_padded = pad_to_max_length and (\n            max_length\n            and len(encoded_inputs[\"input_ids\"]) < max_length\n            or max_length is None\n            and len(encoded_inputs[\"input_ids\"]) < self.model_max_length\n            and self.model_max_length <= LARGE_INTEGER\n        )\n\n        if pad_to_max_length and max_length is None and self.model_max_length > LARGE_INTEGER:\n            logger.warning(\n                \"Sequence can't be padded as no maximum length is specified and the model maximum length is too high.\"\n            )\n\n        if needs_to_be_padded:\n            difference = (max_length if max_length is not None else self.model_max_length) - len(\n                encoded_inputs[\"input_ids\"]\n            )\n            if self.padding_side == \"right\":\n                if return_attention_mask:\n                    encoded_inputs[\"attention_mask\"] = [1] * len(encoded_inputs[\"input_ids\"]) + [0] * difference\n                if return_token_type_ids:\n                    encoded_inputs[\"token_type_ids\"] = (\n                        encoded_inputs[\"token_type_ids\"] + [self.pad_token_type_id] * difference\n                    )\n                if return_special_tokens_mask:\n                    encoded_inputs[\"special_tokens_mask\"] = encoded_inputs[\"special_tokens_mask\"] + [1] * difference\n                encoded_inputs[\"input_ids\"] = encoded_inputs[\"input_ids\"] + [self.pad_token_id] * difference\n            elif self.padding_side == \"left\":\n                if return_attention_mask:\n                    encoded_inputs[\"attention_mask\"] = [0] * difference + [1] * len(encoded_inputs[\"input_ids\"])\n                if return_token_type_ids:\n                    encoded_inputs[\"token_type_ids\"] = [self.pad_token_type_id] * difference + encoded_inputs[\n                        \"token_type_ids\"\n                    ]\n                if return_special_tokens_mask:\n                    encoded_inputs[\"special_tokens_mask\"] = [1] * difference + encoded_inputs[\"special_tokens_mask\"]\n                encoded_inputs[\"input_ids\"] = [self.pad_token_id] * difference + encoded_inputs[\"input_ids\"]\n            else:\n                raise ValueError(\"Invalid padding strategy:\" + str(self.padding_side))\n        else:\n            if return_attention_mask:\n                encoded_inputs[\"attention_mask\"] = [1] * len(encoded_inputs[\"input_ids\"])\n\n        if return_lengths:\n            encoded_inputs[\"length\"] = len(encoded_inputs[\"input_ids\"])\n\n        # Prepare model inputs as tensors if asked\n        if return_tensors == \"tf\" and is_tf_available():\n            encoded_inputs[\"input_ids\"] = tf.constant([encoded_inputs[\"input_ids\"]])\n\n            if \"token_type_ids\" in encoded_inputs:\n                encoded_inputs[\"token_type_ids\"] = tf.constant([encoded_inputs[\"token_type_ids\"]])\n\n            if \"attention_mask\" in encoded_inputs:\n                encoded_inputs[\"attention_mask\"] = tf.constant([encoded_inputs[\"attention_mask\"]])\n\n        elif return_tensors == \"pt\" and is_torch_available():\n            encoded_inputs[\"input_ids\"] = torch.tensor([encoded_inputs[\"input_ids\"]])\n\n            if \"token_type_ids\" in encoded_inputs:\n                encoded_inputs[\"token_type_ids\"] = torch.tensor([encoded_inputs[\"token_type_ids\"]])\n\n            if \"attention_mask\" in encoded_inputs:\n                encoded_inputs[\"attention_mask\"] = torch.tensor([encoded_inputs[\"attention_mask\"]])\n        elif return_tensors is not None:\n            logger.warning(\n                \"Unable to convert output to tensors format {}, PyTorch or TensorFlow is not available.\".format(\n                    return_tensors\n                )\n            )\n\n        return BatchEncoding(encoded_inputs)\n\n    def prepare_for_tokenization(self, text: str, **kwargs) -> str:\n        \"\"\" Performs any necessary transformations before tokenization \"\"\"\n        return text\n\n    def truncate_sequences(\n        self,\n        ids: List[int],\n        pair_ids: Optional[List[int]] = None,\n        num_tokens_to_remove: int = 0,\n        truncation_strategy: str = \"longest_first\",\n        stride: int = 0,\n    ) -> Tuple[List[int], List[int], List[int]]:\n        \"\"\" Truncates a sequence pair in place to the maximum length.\n\n        Args:\n            ids: list of tokenized input ids. Can be obtained from a string by chaining the\n                `tokenize` and `convert_tokens_to_ids` methods.\n            pair_ids: Optional second list of input ids. Can be obtained from a string by chaining the\n                `tokenize` and `convert_tokens_to_ids` methods.\n            num_tokens_to_remove (:obj:`int`, `optional`, defaults to ``0``):\n                number of tokens to remove using the truncation strategy\n            truncation_strategy: string selected in the following options:\n                - 'longest_first' (default) Iteratively reduce the inputs sequence until the input is under max_length\n                    starting from the longest one at each token (when there is a pair of input sequences).\n                    Overflowing tokens only contains overflow from the first sequence.\n                - 'only_first': Only truncate the first sequence. raise an error if the first sequence is shorter or equal to than num_tokens_to_remove.\n                - 'only_second': Only truncate the second sequence\n                - 'do_not_truncate': Does not truncate (raise an error if the input sequence is longer than max_length)\n            stride (:obj:`int`, `optional`, defaults to ``0``):\n                If set to a number along with max_length, the overflowing tokens returned will contain some tokens\n                from the main sequence returned. The value of this argument defines the number of additional tokens.\n        \"\"\"\n        if num_tokens_to_remove <= 0:\n            return ids, pair_ids, []\n\n        if truncation_strategy == \"longest_first\":\n            overflowing_tokens = []\n            for _ in range(num_tokens_to_remove):\n                if pair_ids is None or len(ids) > len(pair_ids):\n                    overflowing_tokens = [ids[-1]] + overflowing_tokens\n                    ids = ids[:-1]\n                else:\n                    pair_ids = pair_ids[:-1]\n            window_len = min(len(ids), stride)\n            if window_len > 0:\n                overflowing_tokens = ids[-window_len:] + overflowing_tokens\n        elif truncation_strategy == \"only_first\":\n            assert len(ids) > num_tokens_to_remove\n            window_len = min(len(ids), stride + num_tokens_to_remove)\n            overflowing_tokens = ids[-window_len:]\n            ids = ids[:-num_tokens_to_remove]\n        elif truncation_strategy == \"only_second\":\n            assert pair_ids is not None and len(pair_ids) > num_tokens_to_remove\n            window_len = min(len(pair_ids), stride + num_tokens_to_remove)\n            overflowing_tokens = pair_ids[-window_len:]\n            pair_ids = pair_ids[:-num_tokens_to_remove]\n        elif truncation_strategy == \"do_not_truncate\":\n            raise ValueError(\"Input sequence are too long for max_length. Please select a truncation strategy.\")\n        else:\n            raise ValueError(\n                \"Truncation_strategy should be selected in ['longest_first', 'only_first', 'only_second', 'do_not_truncate']\"\n            )\n        return (ids, pair_ids, overflowing_tokens)\n\n    def create_token_type_ids_from_sequences(self, token_ids_0: List, token_ids_1: Optional[List] = None) -> List[int]:\n        if token_ids_1 is None:\n            return len(token_ids_0) * [0]\n        return [0] * len(token_ids_0) + [1] * len(token_ids_1)\n\n    def build_inputs_with_special_tokens(self, token_ids_0: List, token_ids_1: Optional[List] = None) -> List:\n        \"\"\"\n        Build model inputs from a sequence or a pair of sequence for sequence classification tasks\n        by concatenating and adding special tokens. This implementation does not add special tokens.\n        \"\"\"\n        if token_ids_1 is None:\n            return token_ids_0\n        return token_ids_0 + token_ids_1\n\n    def get_special_tokens_mask(\n        self, token_ids_0: List, token_ids_1: Optional[List] = None, already_has_special_tokens: bool = False\n    ) -> List[int]:\n        \"\"\"\n        Retrieves sequence ids from a token list that has no special tokens added. This method is called when adding\n        special tokens using the tokenizer ``prepare_for_model`` or ``encode_plus`` methods.\n\n        Args:\n            token_ids_0: list of ids (must not contain special tokens)\n            token_ids_1: Optional list of ids (must not contain special tokens), necessary when fetching sequence ids\n                for sequence pairs\n            already_has_special_tokens: (default False) Set to True if the token list is already formated with\n                special tokens for the model\n\n        Returns:\n            A list of integers in the range [0, 1]: 1 for a special token, 0 for a sequence token.\n        \"\"\"\n        return [0] * ((len(token_ids_1) if token_ids_1 else 0) + len(token_ids_0))\n\n    def convert_ids_to_tokens(\n        self, ids: Union[int, List[int]], skip_special_tokens: bool = False\n    ) -> Union[int, List[int]]:\n        \"\"\" Converts a single index or a sequence of indices (integers) in a token \"\n            (resp.) a sequence of tokens (str), using the vocabulary and added tokens.\n\n            Args:\n                skip_special_tokens: Don't decode special tokens (self.all_special_tokens). Default: False\n        \"\"\"\n        if isinstance(ids, int):\n            if ids in self.added_tokens_decoder:\n                return self.added_tokens_decoder[ids]\n            else:\n                return self._convert_id_to_token(ids)\n        tokens = []\n        for index in ids:\n            index = int(index)\n            if skip_special_tokens and index in self.all_special_ids:\n                continue\n            if index in self.added_tokens_decoder:\n                tokens.append(self.added_tokens_decoder[index])\n            else:\n                tokens.append(self._convert_id_to_token(index))\n        return tokens\n\n    def _convert_id_to_token(self, index: int) -> str:\n        raise NotImplementedError\n\n    def convert_tokens_to_string(self, tokens: List[str]) -> str:\n        \"\"\" Converts a sequence of tokens (string) in a single string.\n            The most simple way to do it is ' '.join(self.convert_ids_to_tokens(token_ids))\n            but we often want to remove sub-word tokenization artifacts at the same time.\n        \"\"\"\n        return \" \".join(self.convert_ids_to_tokens(tokens))\n\n    def decode(\n        self, token_ids: List[int], skip_special_tokens: bool = False, clean_up_tokenization_spaces: bool = True\n    ) -> str:\n        \"\"\"\n        Converts a sequence of ids (integer) in a string, using the tokenizer and vocabulary\n        with options to remove special tokens and clean up tokenization spaces.\n        Similar to doing ``self.convert_tokens_to_string(self.convert_ids_to_tokens(token_ids))``.\n\n        Args:\n            token_ids: list of tokenized input ids. Can be obtained using the `encode` or `encode_plus` methods.\n            skip_special_tokens: if set to True, will replace special tokens.\n            clean_up_tokenization_spaces: if set to True, will clean up the tokenization spaces.\n        \"\"\"\n        filtered_tokens = self.convert_ids_to_tokens(token_ids, skip_special_tokens=skip_special_tokens)\n\n        # To avoid mixing byte-level and unicode for byte-level BPT\n        # we need to build string separatly for added tokens and byte-level tokens\n        # cf. https://github.com/huggingface/transformers/issues/1133\n        sub_texts = []\n        current_sub_text = []\n        for token in filtered_tokens:\n            if skip_special_tokens and token in self.all_special_ids:\n                continue\n            if token in self.added_tokens_encoder:\n                if current_sub_text:\n                    sub_texts.append(self.convert_tokens_to_string(current_sub_text))\n                    current_sub_text = []\n                sub_texts.append(token)\n            else:\n                current_sub_text.append(token)\n        if current_sub_text:\n            sub_texts.append(self.convert_tokens_to_string(current_sub_text))\n        text = \" \".join(sub_texts)\n\n        if clean_up_tokenization_spaces:\n            clean_text = self.clean_up_tokenization(text)\n            return clean_text\n        else:\n            return text\n\n    def batch_decode(self, sequences: List[List[int]], **kwargs) -> List[str]:\n        return [self.decode(seq, **kwargs) for seq in sequences]\n\n    @staticmethod\n    def clean_up_tokenization(out_string: str) -> str:\n        \"\"\" Clean up a list of simple English tokenization artifacts like spaces before punctuations and abreviated forms.\n        \"\"\"\n        out_string = (\n            out_string.replace(\" .\", \".\")\n            .replace(\" ?\", \"?\")\n            .replace(\" !\", \"!\")\n            .replace(\" ,\", \",\")\n            .replace(\" ' \", \"'\")\n            .replace(\" n't\", \"n't\")\n            .replace(\" 'm\", \"'m\")\n            .replace(\" 's\", \"'s\")\n            .replace(\" 've\", \"'ve\")\n            .replace(\" 're\", \"'re\")\n        )\n        return out_string\n\n\nclass PreTrainedTokenizerFast(PreTrainedTokenizer):\n    \"\"\" Base class for all fast tokenizers (wrapping HuggingFace tokenizers library).\n\n    Inherit from PreTrainedTokenizer.\n\n    Handle all the shared methods for tokenization and special tokens as well as methods\n    downloading/caching/loading pretrained tokenizers as well as adding tokens to the vocabulary.\n\n    This class also contain the added tokens in a unified way on top of all tokenizers so we don't\n    have to handle the specific vocabulary augmentation methods of the various underlying\n    dictionary structures (BPE, sentencepiece...).\n\n    Class attributes (overridden by derived classes):\n\n        - ``vocab_files_names``: a python ``dict`` with, as keys, the ``__init__`` keyword name of each vocabulary file\n            required by the model, and as associated values, the filename for saving the associated file (string).\n        - ``pretrained_vocab_files_map``: a python ``dict of dict`` the high-level keys\n            being the ``__init__`` keyword name of each vocabulary file required by the model, the low-level being the\n            `short-cut-names` (string) of the pretrained models with, as associated values, the `url` (string) to the\n            associated pretrained vocabulary file.\n        - ``max_model_input_sizes``: a python ``dict`` with, as keys, the `short-cut-names` (string) of the pretrained\n            models, and as associated values, the maximum length of the sequence inputs of this model, or None if the\n            model has no maximum input size.\n        - ``pretrained_init_configuration``: a python ``dict`` with, as keys, the `short-cut-names` (string) of the\n            pretrained models, and as associated values, a dictionnary of specific arguments to pass to the\n            ``__init__``method of the tokenizer class for this pretrained model when loading the tokenizer with the\n            ``from_pretrained()`` method.\n\n    Args:\n        - ``tokenizer`` (`BaseTokenizerFast`): A Fast tokenizer from the HuggingFace tokenizer library (in low level Rust language)\n        - ``model_max_length``: (`Optional`) int: the maximum length in number of tokens for the inputs to the transformer model.\n            When the tokenizer is loaded with `from_pretrained`, this will be set to the value stored for the associated\n            model in ``max_model_input_sizes`` (see above). If no value is provided, will default to VERY_LARGE_INTEGER (`int(1e30)`).\n            no associated max_length can be found in ``max_model_input_sizes``.\n        - ``padding_side``: (`Optional`) string: the side on which the model should have padding applied.\n            Should be selected between ['right', 'left']\n        - ``model_input_names``: (`Optional`) List[string]: the list of the forward pass inputs accepted by the\n            model (\"token_type_ids\", \"attention_mask\"...).\n        - ``bos_token``: (`Optional`) string: a beginning of sentence token.\n            Will be associated to ``self.bos_token`` and ``self.bos_token_id``\n        - ``eos_token``: (`Optional`) string: an end of sentence token.\n            Will be associated to ``self.eos_token`` and ``self.eos_token_id``\n        - ``unk_token``: (`Optional`) string: an unknown token.\n            Will be associated to ``self.unk_token`` and ``self.unk_token_id``\n        - ``sep_token``: (`Optional`) string: a separation token (e.g. to separate context and query in an input sequence).\n            Will be associated to ``self.sep_token`` and ``self.sep_token_id``\n        - ``pad_token``: (`Optional`) string: a padding token.\n            Will be associated to ``self.pad_token`` and ``self.pad_token_id``\n        - ``cls_token``: (`Optional`) string: a classification token (e.g. to extract a summary of an input sequence\n            leveraging self-attention along the full depth of the model).\n            Will be associated to ``self.cls_token`` and ``self.cls_token_id``\n        - ``mask_token``: (`Optional`) string: a masking token (e.g. when training a model with masked-language\n            modeling). Will be associated to ``self.mask_token`` and ``self.mask_token_id``\n        - ``additional_special_tokens``: (`Optional`) list: a list of additional special tokens.\n            Adding all special tokens here ensure they won't be split by the tokenization process.\n            Will be associated to ``self.additional_special_tokens`` and ``self.additional_special_tokens_ids``\n    \"\"\"\n\n    def __init__(self, tokenizer: BaseTokenizerFast, **kwargs):\n        if not isinstance(tokenizer, BaseTokenizerFast):\n            raise ValueError(\n                \"Tokenizer should be an instance of a Tokenizer \" \"provided by HuggingFace tokenizers library.\"\n            )\n        self._tokenizer: BaseTokenizerFast = tokenizer\n\n        # Initialize all the rest of the kwargs\n        super().__init__(**kwargs)\n\n    @property\n    def backend_tokenizer(self) -> BaseTokenizerFast:\n        return self._tokenizer\n\n    @property\n    def decoder(self) -> DecoderFast:\n        return self._tokenizer._tokenizer.decoder\n\n    @property\n    def is_fast(self) -> bool:\n        return True\n\n    @property\n    def vocab_size(self) -> int:\n        return self._tokenizer.get_vocab_size(with_added_tokens=False)\n\n    def __len__(self) -> int:\n        return self._tokenizer.get_vocab_size(with_added_tokens=True)\n\n    def _maybe_update_backend(self, value):\n        \"\"\" Update the backend fast tokenizer.\n            Override method from base class SpecialTokensMixin \"\"\"\n        self._tokenizer.add_special_tokens(value)\n\n    def _convert_encoding(\n        self,\n        encoding: EncodingFast,\n        return_tensors: Optional[bool] = None,\n        return_token_type_ids: Optional[bool] = None,\n        return_attention_mask: Optional[bool] = None,\n        return_overflowing_tokens: bool = False,\n        return_special_tokens_mask: bool = False,\n        return_offsets_mapping: bool = False,\n    ) -> Dict[str, Any]:\n        \"\"\" Convert the encoding representation (from low-level HuggingFace tokenizer output) to a python Dict.\n\n            Overflowing tokens are converted to additional examples (like batches) so the output values of\n            the dict are lists (overflows) of lists (tokens).\n\n            If return_tensors is not None, these lists of lists are converted to 2-D tensors\n            for input_ids, token_type_ids and attention_mask.\n            Output shape: (overflows, sequence length)\n        \"\"\"\n        if return_token_type_ids is None:\n            return_token_type_ids = \"token_type_ids\" in self.model_input_names\n        if return_attention_mask is None:\n            return_attention_mask = \"attention_mask\" in self.model_input_names\n\n        if return_overflowing_tokens and encoding.overflowing is not None:\n            encodings = [encoding] + encoding.overflowing\n        else:\n            encodings = [encoding]\n\n        encoding_dict = defaultdict(list)\n        for e in encodings:\n            encoding_dict[\"input_ids\"].append(e.ids)\n\n            if return_token_type_ids:\n                encoding_dict[\"token_type_ids\"].append(e.type_ids)\n            if return_attention_mask:\n                encoding_dict[\"attention_mask\"].append(e.attention_mask)\n            if return_special_tokens_mask:\n                encoding_dict[\"special_tokens_mask\"].append(e.special_tokens_mask)\n            if return_offsets_mapping:\n                encoding_dict[\"offset_mapping\"].append(e.offsets)\n\n        if return_tensors is not None:\n            for key, value in encoding_dict.items():\n                if return_tensors == \"tf\" and is_tf_available():\n                    encoding_dict[key] = tf.constant(value)\n                elif return_tensors == \"pt\" and is_torch_available():\n                    encoding_dict[key] = torch.tensor(value)\n                elif return_tensors is not None:\n                    logger.warning(\n                        \"Unable to convert output to tensors format {}, \"\n                        \"PyTorch or TensorFlow is not available.\".format(return_tensors)\n                    )\n\n        return encoding_dict\n\n    def _convert_token_to_id_with_added_voc(self, token: int) -> str:\n        index = self._tokenizer.token_to_id(token)\n        if index is None:\n            return self.unk_token_id\n        return index\n\n    def _convert_id_to_token(self, index: int) -> Optional[str]:\n        return self._tokenizer.id_to_token(int(index))\n\n    def get_vocab(self):\n        return self._tokenizer.get_vocab(True)\n\n    def convert_tokens_to_string(self, tokens: List[int], skip_special_tokens: bool = False) -> str:\n        return self._tokenizer.decode(tokens, skip_special_tokens)\n\n    def add_tokens(self, new_tokens: List[Union[str, AddedTokenFast]]) -> int:\n        \"\"\"\n        Add a list of new tokens to the tokenizer class. If the new tokens are not in the\n        vocabulary, they are added to it with indices starting from length of the current vocabulary.\n\n        Args:\n            new_tokens: string or list of string or AddedTokenFast. Each string is a token to add.\n            Tokens are only added if they are not already in the vocabulary. AddedTokenFast wrap a string token to let you personnalize it's behavior (Whether this token should only match against single word, whether this token should strip all potential whitespaces on the left side, Whether this token should strip all potential whitespaces on the right side...).\n            See details for AddedToken in HuggingFace tokenizers library.\n\n        Returns:\n            Number of tokens added to the vocabulary.\n\n        Examples::\n\n            # Let's see how to increase the vocabulary of Bert model and tokenizer\n            tokenizer = BertTokenizerFast.from_pretrained('bert-base-uncased')\n            model = BertModel.from_pretrained('bert-base-uncased')\n\n            num_added_toks = tokenizer.add_tokens(['new_tok1', 'my_new-tok2'])\n            print('We have added', num_added_toks, 'tokens')\n            model.resize_token_embeddings(len(tokenizer))  # Notice: resize_token_embeddings expect to receive the full size of the new vocabulary, i.e. the length of the tokenizer.\n        \"\"\"\n        if isinstance(new_tokens, str):\n            new_tokens = [new_tokens]\n        return self._tokenizer.add_tokens(new_tokens)\n\n    def add_special_tokens(self, special_tokens_dict: dict) -> int:\n        # Map special tokens to class attributes (self.pad_token...)\n        super().add_special_tokens(special_tokens_dict)\n\n        # If the backend tokenizer the only specificities of special tokens are that\n        #    - they will never be processed by the model, and\n        #    - they will be removed while decoding.\n        # But they are not mapped to special attributes in the backend so we can just\n        # send a list.\n        tokens = []\n        for token in special_tokens_dict.values():\n            if isinstance(token, list):\n                tokens += token\n            else:\n                tokens += [token]\n        num_added_tokens = self._tokenizer.add_special_tokens(tokens)\n\n        return num_added_tokens\n\n    def num_special_tokens_to_add(self, pair: bool = False) -> int:\n        return self._tokenizer.num_special_tokens_to_add(pair)\n\n    def tokenize(\n        self, text: TextInput, pair: Optional[TextInput] = None, add_special_tokens: bool = False\n    ) -> List[str]:\n        return self._tokenizer.encode(text, pair, add_special_tokens).tokens\n\n    def batch_encode_plus(\n        self,\n        batch_text_or_text_pairs: Union[\n            List[TextInput], List[TextInputPair], List[PreTokenizedInput], List[PreTokenizedInputPair]\n        ],\n        add_special_tokens: bool = True,\n        max_length: Optional[int] = None,\n        stride: int = 0,\n        truncation_strategy: str = \"longest_first\",\n        pad_to_max_length: bool = False,\n        is_pretokenized: bool = False,\n        return_tensors: Optional[str] = None,\n        return_token_type_ids: Optional[bool] = None,\n        return_attention_mask: Optional[bool] = None,\n        return_overflowing_tokens: bool = False,\n        return_special_tokens_mask: bool = False,\n        return_offsets_mapping: bool = False,\n        return_lengths: bool = False,\n        **kwargs\n    ) -> BatchEncoding:\n\n        if not isinstance(batch_text_or_text_pairs, list):\n            raise ValueError(\n                \"batch_text_or_text_pairs has to be a list (got {})\".format(type(batch_text_or_text_pairs))\n            )\n\n        # Needed if we have to return a tensor\n        pad_to_max_length = pad_to_max_length or (return_tensors is not None and len(batch_text_or_text_pairs) > 1)\n\n        # Throw an error if we can pad because there is no padding token\n        if pad_to_max_length and self.pad_token_id is None:\n            raise ValueError(\"Unable to set proper padding strategy as the tokenizer does not have a padding token\")\n\n        # Set the truncation and padding strategy and restore the initial configuration\n        with truncate_and_pad(\n            tokenizer=self._tokenizer,\n            max_length=max_length,\n            stride=stride,\n            strategy=truncation_strategy,\n            pad_to_max_length=pad_to_max_length,\n            padding_side=self.padding_side,\n            pad_token_id=self.pad_token_id,\n            pad_token_type_id=self.pad_token_type_id,\n            pad_token=self._pad_token,\n        ):\n\n            # Check for the pretokenized path\n            if is_pretokenized:\n                encodings = []\n\n                # Iterate over each sample (we don't know yet if they are pairs or simple input\n                for i, sample in enumerate(batch_text_or_text_pairs):\n\n                    if not isinstance(sample, (list, tuple)):\n                        raise TypeError(\n                            \"batch_encode_plus(..., is_pretokenized=True) requires batch_text_or_text_pairs \"\n                            \"to be either List[List[str]] or List[Tuple[List[str], List[str]]] but sample at \"\n                            \"index {} is of type {}\".format(i, type(sample))\n                        )\n\n                    # Test if we have a pair of sentences by checking the depth of nesting\n                    is_pair = bool(len(sample) > 0 and isinstance(sample[0], (list, tuple)))\n\n                    # Take care of the first sequence - we multi-thread over the words\n                    encodings_text = EncodingFast.merge(\n                        self._tokenizer.encode_batch(sample[0] if is_pair else sample, add_special_tokens=False),\n                        growing_offsets=True,\n                    )\n\n                    # Take care of the second sequence if we have a pair\n                    if is_pair:\n                        encodings_pair = EncodingFast.merge(\n                            self._tokenizer.encode_batch([(\"\", s) for s in sample[1]], add_special_tokens=False),\n                            growing_offsets=True,\n                        )\n                    else:\n                        encodings_pair = None\n\n                    # Post-process - truncate/pad and add special tokens\n                    encoding = self._tokenizer.post_process(encodings_text, encodings_pair, add_special_tokens)\n                    encodings.append(encoding)\n\n            # Classical path with strings input\n            else:\n                # Avoid thread overhead if only one example.\n                if len(batch_text_or_text_pairs) == 1:\n                    if isinstance(batch_text_or_text_pairs[0], (tuple, list)):\n                        encodings = self._tokenizer.encode(\n                            *batch_text_or_text_pairs[0], add_special_tokens=add_special_tokens\n                        )\n                    else:\n                        encodings = self._tokenizer.encode(\n                            batch_text_or_text_pairs[0], add_special_tokens=add_special_tokens\n                        )\n                    encodings = [encodings]\n                else:\n                    encodings = self._tokenizer.encode_batch(\n                        batch_text_or_text_pairs, add_special_tokens=add_special_tokens\n                    )\n\n        # Convert encoding to dict\n        # `Tokens` has type: List[Dict[str, List[List[int]]]] or List[Dict[str, 2D-Tensor]]\n        # with nested dimensions corresponding to batch, overflows, sequence length\n        tokens = [\n            self._convert_encoding(\n                encoding=encoding,\n                return_tensors=return_tensors,\n                return_token_type_ids=return_token_type_ids,\n                return_attention_mask=return_attention_mask,\n                return_overflowing_tokens=return_overflowing_tokens,\n                return_special_tokens_mask=return_special_tokens_mask,\n                return_offsets_mapping=return_offsets_mapping,\n            )\n            for encoding in encodings\n        ]\n\n        # Sanitize the output to have dict[list] from list[dict]\n        sanitized = {}\n        for key in tokens[0].keys():\n            # To List[List[List[int]]] of shape (batch, overflows, sequence length)\n            stack = [e for item in tokens for e in item[key]]\n            if return_tensors == \"tf\":\n                stack = tf.stack(stack, axis=0)\n            elif return_tensors == \"pt\":\n                stack = torch.stack(stack, dim=0)\n            # elif not return_tensors and len(stack) == 1:\n            #     stack = stack[0]\n\n            sanitized[key] = stack\n\n        # If returning overflowing tokens, we need to return a mapping\n        # from the batch idx to the original sample\n        if return_overflowing_tokens:\n            overflow_to_sample_mapping = flatten([[i] * len(enc[\"input_ids\"]) for i, enc in enumerate(tokens)])\n            sanitized[\"overflow_to_sample_mapping\"] = overflow_to_sample_mapping\n\n        return BatchEncoding(sanitized, encodings)\n\n    def encode_plus(\n        self,\n        text: Union[TextInput, PreTokenizedInput],\n        text_pair: Optional[Union[TextInput, PreTokenizedInput]] = None,\n        add_special_tokens: bool = True,\n        max_length: Optional[int] = None,\n        pad_to_max_length: bool = False,\n        stride: int = 0,\n        truncation_strategy: str = \"longest_first\",\n        is_pretokenized: bool = False,\n        return_tensors: Optional[bool] = None,\n        return_token_type_ids: Optional[bool] = None,\n        return_attention_mask: Optional[bool] = None,\n        return_overflowing_tokens: bool = False,\n        return_special_tokens_mask: bool = False,\n        return_offsets_mapping: bool = False,\n        **kwargs\n    ) -> BatchEncoding:\n\n        # Check for pretokenized path (ie [token1, token2, ..., tokenN] -> [id1, id2, ..., idN]\n        if is_pretokenized:\n            if isinstance(text, list) and len(text) > 0:\n\n                # Encode through encode_batch with sequence of only one word which will be merged after hand\n                encoding = self._tokenizer.encode_batch(text, add_special_tokens=False)\n                encoding = EncodingFast.merge(encoding, growing_offsets=True)\n\n                # Let's do the same for pairs if provided\n                if isinstance(text_pair, list):\n                    # We prepend empty string before each word so that encoding is aware content is a pair\n                    encoding_pair = self._tokenizer.encode_batch(\n                        [(\"\", p) for p in text_pair], add_special_tokens=False\n                    )\n                    encoding_pair = EncodingFast.merge(encoding_pair, growing_offsets=True)\n                elif text_pair is None:\n                    encoding_pair = None\n                else:\n                    raise TypeError(\n                        \"encode_plus(..., is_pretokenized=True) requires text and text_pair to be List[str] \"\n                        \"but got (text={}, text_pair={})\".format(type(text), type(text_pair))\n                    )\n\n                # Post process and if asked to do so, insert special tokens where needed\n                encoding = self._tokenizer.post_process(encoding, encoding_pair, add_special_tokens)\n\n                batched_output = BatchEncoding(\n                    self._convert_encoding(\n                        encoding,\n                        return_tensors=return_tensors,\n                        return_token_type_ids=return_token_type_ids,\n                        return_attention_mask=return_attention_mask,\n                        return_overflowing_tokens=return_overflowing_tokens,\n                        return_special_tokens_mask=return_special_tokens_mask,\n                        return_offsets_mapping=return_offsets_mapping,\n                    ),\n                    encoding,\n                )\n            else:\n                raise TypeError(\n                    \"encode_plus(..., is_pretokenized=True) requires text to be List[str] \"\n                    \"but got (text={}, text_pair={})\".format(type(text), type(text_pair))\n                )\n        else:\n            batched_input = [(text, text_pair)] if text_pair else [text]\n            batched_output = self.batch_encode_plus(\n                batched_input,\n                add_special_tokens=add_special_tokens,\n                max_length=max_length,\n                stride=stride,\n                truncation_strategy=truncation_strategy,\n                return_tensors=return_tensors,\n                return_token_type_ids=return_token_type_ids,\n                return_attention_mask=return_attention_mask,\n                return_overflowing_tokens=return_overflowing_tokens,\n                return_special_tokens_mask=return_special_tokens_mask,\n                return_offsets_mapping=return_offsets_mapping,\n                pad_to_max_length=pad_to_max_length,\n                **kwargs,\n            )\n\n        # Return tensor is None, then we can remove the leading batch axis\n        if not return_tensors:\n            batched_output = BatchEncoding(\n                {\n                    key: value[0] if len(value) > 0 and isinstance(value[0], list) else value\n                    for key, value in batched_output.items()\n                },\n                batched_output.encodings,\n            )\n\n        return batched_output\n\n    def decode(\n        self, token_ids: List[int], skip_special_tokens: bool = False, clean_up_tokenization_spaces: bool = True\n    ) -> str:\n        text = self._tokenizer.decode(token_ids, skip_special_tokens)\n\n        if clean_up_tokenization_spaces:\n            clean_text = self.clean_up_tokenization(text)\n            return clean_text\n        else:\n            return text\n\n    def save_vocabulary(self, save_directory: str) -> Tuple[str]:\n        if os.path.isdir(save_directory):\n            files = self._tokenizer.save(save_directory)\n        else:\n            folder, file = os.path.split(os.path.abspath(save_directory))\n            files = self._tokenizer.save(folder, name=file)\n\n        return tuple(files)\n\n\ndef trim_batch(\n    input_ids, pad_token_id, attention_mask=None,\n):\n    \"\"\"Remove columns that are populated exclusively by pad_token_id\"\"\"\n    keep_column_mask = input_ids.ne(pad_token_id).any(dim=0)\n    if attention_mask is None:\n        return input_ids[:, keep_column_mask]\n    else:\n        return (input_ids[:, keep_column_mask], attention_mask[:, keep_column_mask])\n"
  },
  {
    "path": "code/nezha-base-count3/pretrain/transformers1/tokenization_xlm.py",
    "content": "# coding=utf-8\n# Copyright 2019 The Open AI Team Authors and The HuggingFace Inc. team.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n\"\"\"Tokenization classes for XLM.\"\"\"\n\n\nimport json\nimport logging\nimport os\nimport re\nimport sys\nimport unicodedata\nfrom typing import List, Optional\n\nimport sacremoses as sm\n\nfrom .tokenization_utils import PreTrainedTokenizer\n\n\nlogger = logging.getLogger(__name__)\n\nVOCAB_FILES_NAMES = {\n    \"vocab_file\": \"vocab.json\",\n    \"merges_file\": \"merges.txt\",\n}\n\nPRETRAINED_VOCAB_FILES_MAP = {\n    \"vocab_file\": {\n        \"xlm-mlm-en-2048\": \"https://s3.amazonaws.com/models.huggingface.co/bert/xlm-mlm-en-2048-vocab.json\",\n        \"xlm-mlm-ende-1024\": \"https://s3.amazonaws.com/models.huggingface.co/bert/xlm-mlm-ende-1024-vocab.json\",\n        \"xlm-mlm-enfr-1024\": \"https://s3.amazonaws.com/models.huggingface.co/bert/xlm-mlm-enfr-1024-vocab.json\",\n        \"xlm-mlm-enro-1024\": \"https://s3.amazonaws.com/models.huggingface.co/bert/xlm-mlm-enro-1024-vocab.json\",\n        \"xlm-mlm-tlm-xnli15-1024\": \"https://s3.amazonaws.com/models.huggingface.co/bert/xlm-mlm-tlm-xnli15-1024-vocab.json\",\n        \"xlm-mlm-xnli15-1024\": \"https://s3.amazonaws.com/models.huggingface.co/bert/xlm-mlm-xnli15-1024-vocab.json\",\n        \"xlm-clm-enfr-1024\": \"https://s3.amazonaws.com/models.huggingface.co/bert/xlm-clm-enfr-1024-vocab.json\",\n        \"xlm-clm-ende-1024\": \"https://s3.amazonaws.com/models.huggingface.co/bert/xlm-clm-ende-1024-vocab.json\",\n        \"xlm-mlm-17-1280\": \"https://s3.amazonaws.com/models.huggingface.co/bert/xlm-mlm-17-1280-vocab.json\",\n        \"xlm-mlm-100-1280\": \"https://s3.amazonaws.com/models.huggingface.co/bert/xlm-mlm-100-1280-vocab.json\",\n    },\n    \"merges_file\": {\n        \"xlm-mlm-en-2048\": \"https://s3.amazonaws.com/models.huggingface.co/bert/xlm-mlm-en-2048-merges.txt\",\n        \"xlm-mlm-ende-1024\": \"https://s3.amazonaws.com/models.huggingface.co/bert/xlm-mlm-ende-1024-merges.txt\",\n        \"xlm-mlm-enfr-1024\": \"https://s3.amazonaws.com/models.huggingface.co/bert/xlm-mlm-enfr-1024-merges.txt\",\n        \"xlm-mlm-enro-1024\": \"https://s3.amazonaws.com/models.huggingface.co/bert/xlm-mlm-enro-1024-merges.txt\",\n        \"xlm-mlm-tlm-xnli15-1024\": \"https://s3.amazonaws.com/models.huggingface.co/bert/xlm-mlm-tlm-xnli15-1024-merges.txt\",\n        \"xlm-mlm-xnli15-1024\": \"https://s3.amazonaws.com/models.huggingface.co/bert/xlm-mlm-xnli15-1024-merges.txt\",\n        \"xlm-clm-enfr-1024\": \"https://s3.amazonaws.com/models.huggingface.co/bert/xlm-mlm-enfr-1024-merges.txt\",\n        \"xlm-clm-ende-1024\": \"https://s3.amazonaws.com/models.huggingface.co/bert/xlm-mlm-ende-1024-merges.txt\",\n        \"xlm-mlm-17-1280\": \"https://s3.amazonaws.com/models.huggingface.co/bert/xlm-mlm-17-1280-merges.txt\",\n        \"xlm-mlm-100-1280\": \"https://s3.amazonaws.com/models.huggingface.co/bert/xlm-mlm-100-1280-merges.txt\",\n    },\n}\n\nPRETRAINED_POSITIONAL_EMBEDDINGS_SIZES = {\n    \"xlm-mlm-en-2048\": 512,\n    \"xlm-mlm-ende-1024\": 512,\n    \"xlm-mlm-enfr-1024\": 512,\n    \"xlm-mlm-enro-1024\": 512,\n    \"xlm-mlm-tlm-xnli15-1024\": 512,\n    \"xlm-mlm-xnli15-1024\": 512,\n    \"xlm-clm-enfr-1024\": 512,\n    \"xlm-clm-ende-1024\": 512,\n    \"xlm-mlm-17-1280\": 512,\n    \"xlm-mlm-100-1280\": 512,\n}\n\nPRETRAINED_INIT_CONFIGURATION = {\n    \"xlm-mlm-en-2048\": {\"do_lowercase_and_remove_accent\": True},\n    \"xlm-mlm-ende-1024\": {\n        \"do_lowercase_and_remove_accent\": True,\n        \"id2lang\": {\"0\": \"de\", \"1\": \"en\"},\n        \"lang2id\": {\"de\": 0, \"en\": 1},\n    },\n    \"xlm-mlm-enfr-1024\": {\n        \"do_lowercase_and_remove_accent\": True,\n        \"id2lang\": {\"0\": \"en\", \"1\": \"fr\"},\n        \"lang2id\": {\"en\": 0, \"fr\": 1},\n    },\n    \"xlm-mlm-enro-1024\": {\n        \"do_lowercase_and_remove_accent\": True,\n        \"id2lang\": {\"0\": \"en\", \"1\": \"ro\"},\n        \"lang2id\": {\"en\": 0, \"ro\": 1},\n    },\n    \"xlm-mlm-tlm-xnli15-1024\": {\n        \"do_lowercase_and_remove_accent\": True,\n        \"id2lang\": {\n            \"0\": \"ar\",\n            \"1\": \"bg\",\n            \"2\": \"de\",\n            \"3\": \"el\",\n            \"4\": \"en\",\n            \"5\": \"es\",\n            \"6\": \"fr\",\n            \"7\": \"hi\",\n            \"8\": \"ru\",\n            \"9\": \"sw\",\n            \"10\": \"th\",\n            \"11\": \"tr\",\n            \"12\": \"ur\",\n            \"13\": \"vi\",\n            \"14\": \"zh\",\n        },\n        \"lang2id\": {\n            \"ar\": 0,\n            \"bg\": 1,\n            \"de\": 2,\n            \"el\": 3,\n            \"en\": 4,\n            \"es\": 5,\n            \"fr\": 6,\n            \"hi\": 7,\n            \"ru\": 8,\n            \"sw\": 9,\n            \"th\": 10,\n            \"tr\": 11,\n            \"ur\": 12,\n            \"vi\": 13,\n            \"zh\": 14,\n        },\n    },\n    \"xlm-mlm-xnli15-1024\": {\n        \"do_lowercase_and_remove_accent\": True,\n        \"id2lang\": {\n            \"0\": \"ar\",\n            \"1\": \"bg\",\n            \"2\": \"de\",\n            \"3\": \"el\",\n            \"4\": \"en\",\n            \"5\": \"es\",\n            \"6\": \"fr\",\n            \"7\": \"hi\",\n            \"8\": \"ru\",\n            \"9\": \"sw\",\n            \"10\": \"th\",\n            \"11\": \"tr\",\n            \"12\": \"ur\",\n            \"13\": \"vi\",\n            \"14\": \"zh\",\n        },\n        \"lang2id\": {\n            \"ar\": 0,\n            \"bg\": 1,\n            \"de\": 2,\n            \"el\": 3,\n            \"en\": 4,\n            \"es\": 5,\n            \"fr\": 6,\n            \"hi\": 7,\n            \"ru\": 8,\n            \"sw\": 9,\n            \"th\": 10,\n            \"tr\": 11,\n            \"ur\": 12,\n            \"vi\": 13,\n            \"zh\": 14,\n        },\n    },\n    \"xlm-clm-enfr-1024\": {\n        \"do_lowercase_and_remove_accent\": True,\n        \"id2lang\": {\"0\": \"en\", \"1\": \"fr\"},\n        \"lang2id\": {\"en\": 0, \"fr\": 1},\n    },\n    \"xlm-clm-ende-1024\": {\n        \"do_lowercase_and_remove_accent\": True,\n        \"id2lang\": {\"0\": \"de\", \"1\": \"en\"},\n        \"lang2id\": {\"de\": 0, \"en\": 1},\n    },\n    \"xlm-mlm-17-1280\": {\n        \"do_lowercase_and_remove_accent\": False,\n        \"id2lang\": {\n            \"0\": \"ar\",\n            \"1\": \"de\",\n            \"2\": \"en\",\n            \"3\": \"es\",\n            \"4\": \"fr\",\n            \"5\": \"hi\",\n            \"6\": \"it\",\n            \"7\": \"ja\",\n            \"8\": \"ko\",\n            \"9\": \"nl\",\n            \"10\": \"pl\",\n            \"11\": \"pt\",\n            \"12\": \"ru\",\n            \"13\": \"sv\",\n            \"14\": \"tr\",\n            \"15\": \"vi\",\n            \"16\": \"zh\",\n        },\n        \"lang2id\": {\n            \"ar\": 0,\n            \"de\": 1,\n            \"en\": 2,\n            \"es\": 3,\n            \"fr\": 4,\n            \"hi\": 5,\n            \"it\": 6,\n            \"ja\": 7,\n            \"ko\": 8,\n            \"nl\": 9,\n            \"pl\": 10,\n            \"pt\": 11,\n            \"ru\": 12,\n            \"sv\": 13,\n            \"tr\": 14,\n            \"vi\": 15,\n            \"zh\": 16,\n        },\n    },\n    \"xlm-mlm-100-1280\": {\n        \"do_lowercase_and_remove_accent\": False,\n        \"id2lang\": {\n            \"0\": \"af\",\n            \"1\": \"als\",\n            \"2\": \"am\",\n            \"3\": \"an\",\n            \"4\": \"ang\",\n            \"5\": \"ar\",\n            \"6\": \"arz\",\n            \"7\": \"ast\",\n            \"8\": \"az\",\n            \"9\": \"bar\",\n            \"10\": \"be\",\n            \"11\": \"bg\",\n            \"12\": \"bn\",\n            \"13\": \"br\",\n            \"14\": \"bs\",\n            \"15\": \"ca\",\n            \"16\": \"ceb\",\n            \"17\": \"ckb\",\n            \"18\": \"cs\",\n            \"19\": \"cy\",\n            \"20\": \"da\",\n            \"21\": \"de\",\n            \"22\": \"el\",\n            \"23\": \"en\",\n            \"24\": \"eo\",\n            \"25\": \"es\",\n            \"26\": \"et\",\n            \"27\": \"eu\",\n            \"28\": \"fa\",\n            \"29\": \"fi\",\n            \"30\": \"fr\",\n            \"31\": \"fy\",\n            \"32\": \"ga\",\n            \"33\": \"gan\",\n            \"34\": \"gl\",\n            \"35\": \"gu\",\n            \"36\": \"he\",\n            \"37\": \"hi\",\n            \"38\": \"hr\",\n            \"39\": \"hu\",\n            \"40\": \"hy\",\n            \"41\": \"ia\",\n            \"42\": \"id\",\n            \"43\": \"is\",\n            \"44\": \"it\",\n            \"45\": \"ja\",\n            \"46\": \"jv\",\n            \"47\": \"ka\",\n            \"48\": \"kk\",\n            \"49\": \"kn\",\n            \"50\": \"ko\",\n            \"51\": \"ku\",\n            \"52\": \"la\",\n            \"53\": \"lb\",\n            \"54\": \"lt\",\n            \"55\": \"lv\",\n            \"56\": \"mk\",\n            \"57\": \"ml\",\n            \"58\": \"mn\",\n            \"59\": \"mr\",\n            \"60\": \"ms\",\n            \"61\": \"my\",\n            \"62\": \"nds\",\n            \"63\": \"ne\",\n            \"64\": \"nl\",\n            \"65\": \"nn\",\n            \"66\": \"no\",\n            \"67\": \"oc\",\n            \"68\": \"pl\",\n            \"69\": \"pt\",\n            \"70\": \"ro\",\n            \"71\": \"ru\",\n            \"72\": \"scn\",\n            \"73\": \"sco\",\n            \"74\": \"sh\",\n            \"75\": \"si\",\n            \"76\": \"simple\",\n            \"77\": \"sk\",\n            \"78\": \"sl\",\n            \"79\": \"sq\",\n            \"80\": \"sr\",\n            \"81\": \"sv\",\n            \"82\": \"sw\",\n            \"83\": \"ta\",\n            \"84\": \"te\",\n            \"85\": \"th\",\n            \"86\": \"tl\",\n            \"87\": \"tr\",\n            \"88\": \"tt\",\n            \"89\": \"uk\",\n            \"90\": \"ur\",\n            \"91\": \"uz\",\n            \"92\": \"vi\",\n            \"93\": \"war\",\n            \"94\": \"wuu\",\n            \"95\": \"yi\",\n            \"96\": \"zh\",\n            \"97\": \"zh_classical\",\n            \"98\": \"zh_min_nan\",\n            \"99\": \"zh_yue\",\n        },\n        \"lang2id\": {\n            \"af\": 0,\n            \"als\": 1,\n            \"am\": 2,\n            \"an\": 3,\n            \"ang\": 4,\n            \"ar\": 5,\n            \"arz\": 6,\n            \"ast\": 7,\n            \"az\": 8,\n            \"bar\": 9,\n            \"be\": 10,\n            \"bg\": 11,\n            \"bn\": 12,\n            \"br\": 13,\n            \"bs\": 14,\n            \"ca\": 15,\n            \"ceb\": 16,\n            \"ckb\": 17,\n            \"cs\": 18,\n            \"cy\": 19,\n            \"da\": 20,\n            \"de\": 21,\n            \"el\": 22,\n            \"en\": 23,\n            \"eo\": 24,\n            \"es\": 25,\n            \"et\": 26,\n            \"eu\": 27,\n            \"fa\": 28,\n            \"fi\": 29,\n            \"fr\": 30,\n            \"fy\": 31,\n            \"ga\": 32,\n            \"gan\": 33,\n            \"gl\": 34,\n            \"gu\": 35,\n            \"he\": 36,\n            \"hi\": 37,\n            \"hr\": 38,\n            \"hu\": 39,\n            \"hy\": 40,\n            \"ia\": 41,\n            \"id\": 42,\n            \"is\": 43,\n            \"it\": 44,\n            \"ja\": 45,\n            \"jv\": 46,\n            \"ka\": 47,\n            \"kk\": 48,\n            \"kn\": 49,\n            \"ko\": 50,\n            \"ku\": 51,\n            \"la\": 52,\n            \"lb\": 53,\n            \"lt\": 54,\n            \"lv\": 55,\n            \"mk\": 56,\n            \"ml\": 57,\n            \"mn\": 58,\n            \"mr\": 59,\n            \"ms\": 60,\n            \"my\": 61,\n            \"nds\": 62,\n            \"ne\": 63,\n            \"nl\": 64,\n            \"nn\": 65,\n            \"no\": 66,\n            \"oc\": 67,\n            \"pl\": 68,\n            \"pt\": 69,\n            \"ro\": 70,\n            \"ru\": 71,\n            \"scn\": 72,\n            \"sco\": 73,\n            \"sh\": 74,\n            \"si\": 75,\n            \"simple\": 76,\n            \"sk\": 77,\n            \"sl\": 78,\n            \"sq\": 79,\n            \"sr\": 80,\n            \"sv\": 81,\n            \"sw\": 82,\n            \"ta\": 83,\n            \"te\": 84,\n            \"th\": 85,\n            \"tl\": 86,\n            \"tr\": 87,\n            \"tt\": 88,\n            \"uk\": 89,\n            \"ur\": 90,\n            \"uz\": 91,\n            \"vi\": 92,\n            \"war\": 93,\n            \"wuu\": 94,\n            \"yi\": 95,\n            \"zh\": 96,\n            \"zh_classical\": 97,\n            \"zh_min_nan\": 98,\n            \"zh_yue\": 99,\n        },\n    },\n}\n\n\ndef get_pairs(word):\n    \"\"\"\n    Return set of symbol pairs in a word.\n    word is represented as tuple of symbols (symbols being variable-length strings)\n    \"\"\"\n    pairs = set()\n    prev_char = word[0]\n    for char in word[1:]:\n        pairs.add((prev_char, char))\n        prev_char = char\n    return pairs\n\n\ndef lowercase_and_remove_accent(text):\n    \"\"\"\n    Lowercase and strips accents from a piece of text based on\n    https://github.com/facebookresearch/XLM/blob/master/tools/lowercase_and_remove_accent.py\n    \"\"\"\n    text = \" \".join(text)\n    text = text.lower()\n    text = unicodedata.normalize(\"NFD\", text)\n    output = []\n    for char in text:\n        cat = unicodedata.category(char)\n        if cat == \"Mn\":\n            continue\n        output.append(char)\n    return \"\".join(output).lower().split(\" \")\n\n\ndef replace_unicode_punct(text):\n    \"\"\"\n    Port of https://github.com/moses-smt/mosesdecoder/blob/master/scripts/tokenizer/replace-unicode-punctuation.perl\n    \"\"\"\n    text = text.replace(\"，\", \",\")\n    text = re.sub(r\"。\\s*\", \". \", text)\n    text = text.replace(\"、\", \",\")\n    text = text.replace(\"”\", '\"')\n    text = text.replace(\"“\", '\"')\n    text = text.replace(\"∶\", \":\")\n    text = text.replace(\"：\", \":\")\n    text = text.replace(\"？\", \"?\")\n    text = text.replace(\"《\", '\"')\n    text = text.replace(\"》\", '\"')\n    text = text.replace(\"）\", \")\")\n    text = text.replace(\"！\", \"!\")\n    text = text.replace(\"（\", \"(\")\n    text = text.replace(\"；\", \";\")\n    text = text.replace(\"１\", \"1\")\n    text = text.replace(\"」\", '\"')\n    text = text.replace(\"「\", '\"')\n    text = text.replace(\"０\", \"0\")\n    text = text.replace(\"３\", \"3\")\n    text = text.replace(\"２\", \"2\")\n    text = text.replace(\"５\", \"5\")\n    text = text.replace(\"６\", \"6\")\n    text = text.replace(\"９\", \"9\")\n    text = text.replace(\"７\", \"7\")\n    text = text.replace(\"８\", \"8\")\n    text = text.replace(\"４\", \"4\")\n    text = re.sub(r\"．\\s*\", \". \", text)\n    text = text.replace(\"～\", \"~\")\n    text = text.replace(\"’\", \"'\")\n    text = text.replace(\"…\", \"...\")\n    text = text.replace(\"━\", \"-\")\n    text = text.replace(\"〈\", \"<\")\n    text = text.replace(\"〉\", \">\")\n    text = text.replace(\"【\", \"[\")\n    text = text.replace(\"】\", \"]\")\n    text = text.replace(\"％\", \"%\")\n    return text\n\n\ndef remove_non_printing_char(text):\n    \"\"\"\n    Port of https://github.com/moses-smt/mosesdecoder/blob/master/scripts/tokenizer/remove-non-printing-char.perl\n    \"\"\"\n    output = []\n    for char in text:\n        cat = unicodedata.category(char)\n        if cat.startswith(\"C\"):\n            continue\n        output.append(char)\n    return \"\".join(output)\n\n\ndef romanian_preprocessing(text):\n    \"\"\"Sennrich's WMT16 scripts for Romanian preprocessing, used by model `xlm-mlm-enro-1024`\"\"\"\n    # https://github.com/rsennrich/wmt16-scripts/blob/master/preprocess/normalise-romanian.py\n    text = text.replace(\"\\u015e\", \"\\u0218\").replace(\"\\u015f\", \"\\u0219\")\n    text = text.replace(\"\\u0162\", \"\\u021a\").replace(\"\\u0163\", \"\\u021b\")\n    # https://github.com/rsennrich/wmt16-scripts/blob/master/preprocess/remove-diacritics.py\n    text = text.replace(\"\\u0218\", \"S\").replace(\"\\u0219\", \"s\")  # s-comma\n    text = text.replace(\"\\u021a\", \"T\").replace(\"\\u021b\", \"t\")  # t-comma\n    text = text.replace(\"\\u0102\", \"A\").replace(\"\\u0103\", \"a\")\n    text = text.replace(\"\\u00C2\", \"A\").replace(\"\\u00E2\", \"a\")\n    text = text.replace(\"\\u00CE\", \"I\").replace(\"\\u00EE\", \"i\")\n    return text\n\n\nclass XLMTokenizer(PreTrainedTokenizer):\n    \"\"\"\n    BPE tokenizer for XLM\n\n    - Moses preprocessing & tokenization for most supported languages\n    - Language specific tokenization for Chinese (Jieba), Japanese (KyTea) and Thai (PyThaiNLP)\n    - (optionally) lower case & normalize all inputs text\n    - argument ``special_tokens`` and function ``set_special_tokens``, can be used to add additional symbols \\\n      (ex: \"__classify__\") to a vocabulary\n    - `lang2id` attribute maps the languages supported by the model with their ids if provided (automatically set for pretrained vocabularies)\n    - `id2lang` attributes does reverse mapping if provided (automatically set for pretrained vocabularies)\n\n    This tokenizer inherits from :class:`~transformers1.PreTrainedTokenizer` which contains most of the methods. Users\n    should refer to the superclass for more information regarding methods.\n\n    Args:\n        vocab_file (:obj:`string`):\n            Vocabulary file.\n        merges_file (:obj:`string`):\n            Merges file.\n        do_lower_case (:obj:`bool`, `optional`, defaults to :obj:`True`):\n            Whether to lowercase the input when tokenizing.\n        remove_space (:obj:`bool`, `optional`, defaults to :obj:`True`):\n            Whether to strip the text when tokenizing (removing excess spaces before and after the string).\n        keep_accents (:obj:`bool`, `optional`, defaults to :obj:`False`):\n            Whether to keep accents when tokenizing.\n        unk_token (:obj:`string`, `optional`, defaults to \"<unk>\"):\n            The unknown token. A token that is not in the vocabulary cannot be converted to an ID and is set to be this\n            token instead.\n        bos_token (:obj:`string`, `optional`, defaults to \"<s>\"):\n            The beginning of sequence token that was used during pre-training. Can be used a sequence classifier token.\n\n            .. note::\n\n                When building a sequence using special tokens, this is not the token that is used for the beginning\n                of sequence. The token used is the :obj:`cls_token`.\n        sep_token (:obj:`string`, `optional`, defaults to \"</s>\"):\n            The separator token, which is used when building a sequence from multiple sequences, e.g. two sequences\n            for sequence classification or for a text and a question for question answering.\n            It is also used as the last token of a sequence built with special tokens.\n        pad_token (:obj:`string`, `optional`, defaults to \"<pad>\"):\n            The token used for padding, for example when batching sequences of different lengths.\n        cls_token (:obj:`string`, `optional`, defaults to \"</s>\"):\n            The classifier token which is used when doing sequence classification (classification of the whole\n            sequence instead of per-token classification). It is the first token of the sequence when built with\n            special tokens.\n        mask_token (:obj:`string`, `optional`, defaults to \"<special1>\"):\n            The token used for masking values. This is the token used when training this model with masked language\n            modeling. This is the token which the model will try to predict.\n        additional_special_tokens (:obj:`List[str]`, `optional`, defaults to :obj:`[\"<special0>\",\"<special1>\",\"<special2>\",\"<special3>\",\"<special4>\",\"<special5>\",\"<special6>\",\"<special7>\",\"<special8>\",\"<special9>\"]`):\n            List of additional special tokens.\n        lang2id (:obj:`Dict[str, int]`, `optional`, defaults to :obj:`None`):\n            Dictionary mapping languages string identifiers to their IDs.\n        id2lang (:obj:`Dict[int, str`, `optional`, defaults to :obj:`None`):\n            Dictionary mapping language IDs to their string identifiers.\n        do_lowercase_and_remove_accent (:obj:`bool`, `optional`, defaults to :obj:`True`):\n            Whether to lowercase and remove accents when tokenizing.\n    \"\"\"\n\n    vocab_files_names = VOCAB_FILES_NAMES\n    pretrained_vocab_files_map = PRETRAINED_VOCAB_FILES_MAP\n    pretrained_init_configuration = PRETRAINED_INIT_CONFIGURATION\n    max_model_input_sizes = PRETRAINED_POSITIONAL_EMBEDDINGS_SIZES\n\n    def __init__(\n        self,\n        vocab_file,\n        merges_file,\n        unk_token=\"<unk>\",\n        bos_token=\"<s>\",\n        sep_token=\"</s>\",\n        pad_token=\"<pad>\",\n        cls_token=\"</s>\",\n        mask_token=\"<special1>\",\n        additional_special_tokens=[\n            \"<special0>\",\n            \"<special1>\",\n            \"<special2>\",\n            \"<special3>\",\n            \"<special4>\",\n            \"<special5>\",\n            \"<special6>\",\n            \"<special7>\",\n            \"<special8>\",\n            \"<special9>\",\n        ],\n        lang2id=None,\n        id2lang=None,\n        do_lowercase_and_remove_accent=True,\n        **kwargs\n    ):\n        super().__init__(\n            unk_token=unk_token,\n            bos_token=bos_token,\n            sep_token=sep_token,\n            pad_token=pad_token,\n            cls_token=cls_token,\n            mask_token=mask_token,\n            additional_special_tokens=additional_special_tokens,\n            **kwargs,\n        )\n\n        # cache of sm.MosesPunctNormalizer instance\n        self.cache_moses_punct_normalizer = dict()\n        # cache of sm.MosesTokenizer instance\n        self.cache_moses_tokenizer = dict()\n        self.lang_with_custom_tokenizer = set([\"zh\", \"th\", \"ja\"])\n        # True for current supported model (v1.2.0), False for XLM-17 & 100\n        self.do_lowercase_and_remove_accent = do_lowercase_and_remove_accent\n        self.lang2id = lang2id\n        self.id2lang = id2lang\n        if lang2id is not None and id2lang is not None:\n            assert len(lang2id) == len(id2lang)\n\n        self.ja_word_tokenizer = None\n        self.zh_word_tokenizer = None\n\n        with open(vocab_file, encoding=\"utf-8\") as vocab_handle:\n            self.encoder = json.load(vocab_handle)\n        self.decoder = {v: k for k, v in self.encoder.items()}\n        with open(merges_file, encoding=\"utf-8\") as merges_handle:\n            merges = merges_handle.read().split(\"\\n\")[:-1]\n        merges = [tuple(merge.split()[:2]) for merge in merges]\n        self.bpe_ranks = dict(zip(merges, range(len(merges))))\n        self.cache = {}\n\n    def moses_punct_norm(self, text, lang):\n        if lang not in self.cache_moses_punct_normalizer:\n            punct_normalizer = sm.MosesPunctNormalizer(lang=lang)\n            self.cache_moses_punct_normalizer[lang] = punct_normalizer\n        else:\n            punct_normalizer = self.cache_moses_punct_normalizer[lang]\n        return punct_normalizer.normalize(text)\n\n    def moses_tokenize(self, text, lang):\n        if lang not in self.cache_moses_tokenizer:\n            moses_tokenizer = sm.MosesTokenizer(lang=lang)\n            self.cache_moses_tokenizer[lang] = moses_tokenizer\n        else:\n            moses_tokenizer = self.cache_moses_tokenizer[lang]\n        return moses_tokenizer.tokenize(text, return_str=False, escape=False)\n\n    def moses_pipeline(self, text, lang):\n        text = replace_unicode_punct(text)\n        text = self.moses_punct_norm(text, lang)\n        text = remove_non_printing_char(text)\n        return text\n\n    def ja_tokenize(self, text):\n        if self.ja_word_tokenizer is None:\n            try:\n                import Mykytea\n\n                self.ja_word_tokenizer = Mykytea.Mykytea(\n                    \"-model %s/local/share/kytea/model.bin\" % os.path.expanduser(\"~\")\n                )\n            except (AttributeError, ImportError):\n                logger.error(\n                    \"Make sure you install KyTea (https://github.com/neubig/kytea) and it's python wrapper (https://github.com/chezou/Mykytea-python) with the following steps\"\n                )\n                logger.error(\"1. git clone git@github.com:neubig/kytea.git && cd kytea\")\n                logger.error(\"2. autoreconf -i\")\n                logger.error(\"3. ./configure --prefix=$HOME/local\")\n                logger.error(\"4. make && make install\")\n                logger.error(\"5. pip install kytea\")\n                raise\n        return list(self.ja_word_tokenizer.getWS(text))\n\n    @property\n    def vocab_size(self):\n        return len(self.encoder)\n\n    def get_vocab(self):\n        return dict(self.encoder, **self.added_tokens_encoder)\n\n    def bpe(self, token):\n        word = tuple(token[:-1]) + (token[-1] + \"</w>\",)\n        if token in self.cache:\n            return self.cache[token]\n        pairs = get_pairs(word)\n\n        if not pairs:\n            return token + \"</w>\"\n\n        while True:\n            bigram = min(pairs, key=lambda pair: self.bpe_ranks.get(pair, float(\"inf\")))\n            if bigram not in self.bpe_ranks:\n                break\n            first, second = bigram\n            new_word = []\n            i = 0\n            while i < len(word):\n                try:\n                    j = word.index(first, i)\n                except ValueError:\n                    new_word.extend(word[i:])\n                    break\n                else:\n                    new_word.extend(word[i:j])\n                    i = j\n\n                if word[i] == first and i < len(word) - 1 and word[i + 1] == second:\n                    new_word.append(first + second)\n                    i += 2\n                else:\n                    new_word.append(word[i])\n                    i += 1\n            new_word = tuple(new_word)\n            word = new_word\n            if len(word) == 1:\n                break\n            else:\n                pairs = get_pairs(word)\n        word = \" \".join(word)\n        if word == \"\\n  </w>\":\n            word = \"\\n</w>\"\n        self.cache[token] = word\n        return word\n\n    def _tokenize(self, text, lang=\"en\", bypass_tokenizer=False):\n        \"\"\"\n        Tokenize a string given language code. For Chinese, Japanese and Thai, we use a language specific tokenizerself. Otherwise, we use Moses.\n\n        Details of tokenization:\n        - [sacremoses](https://github.com/alvations/sacremoses): port of Moses\n            - Install with `pip install sacremoses`\n        - [pythainlp](https://github.com/PyThaiNLP/pythainlp): Thai tokenizer\n            - Install with `pip install pythainlp`\n        - [kytea](https://github.com/chezou/Mykytea-python): Japanese tokenizer, wrapper of [KyTea](https://github.com/neubig/kytea)\n            - Install with the following steps:\n            ```\n            git clone git@github.com:neubig/kytea.git && cd kytea\n            autoreconf -i\n            ./configure --prefix=$HOME/local\n            make && make install\n            pip install kytea\n            ```\n        - [jieba](https://github.com/fxsjy/jieba): Chinese tokenizer (*)\n            - Install with `pip install jieba`\n\n        (*) The original XLM used [Stanford Segmenter](https://nlp.stanford.edu/software/stanford-segmenter-2018-10-16.zip).\n        However, the wrapper (`nltk.tokenize.stanford_segmenter`) is slow due to JVM overhead, and it will be deprecated.\n        Jieba is a lot faster and pip-installable. Note there is some mismatch with the Stanford Segmenter. It should be fine\n        if you fine-tune the model with Chinese supervisionself. If you want the same exact behaviour, use the original XLM\n        [preprocessing script](https://github.com/facebookresearch/XLM/tree/master/tools) to tokenize the sentence externally,\n        and set `bypass_tokenizer=True` to bypass the tokenizer.\n\n        Args:\n            - lang: ISO language code (default = 'en') (string). Languages should belong of the model supported languages. However, we don't enforce it.\n            - bypass_tokenizer: Allow users to preprocess and tokenize the sentences externally (default = False)  (bool). If True, we only apply BPE.\n\n        Returns:\n            List of tokens.\n        \"\"\"\n        if lang and self.lang2id and lang not in self.lang2id:\n            logger.error(\n                \"Supplied language code not found in lang2id mapping. Please check that your language is supported by the loaded pretrained model.\"\n            )\n        if bypass_tokenizer:\n            text = text.split()\n        elif lang not in self.lang_with_custom_tokenizer:\n            text = self.moses_pipeline(text, lang=lang)\n            # TODO: make sure we are using `xlm-mlm-enro-1024`, since XLM-100 doesn't have this step\n            if lang == \"ro\":\n                text = romanian_preprocessing(text)\n            text = self.moses_tokenize(text, lang=lang)\n        elif lang == \"th\":\n            text = self.moses_pipeline(text, lang=lang)\n            try:\n                if \"pythainlp\" not in sys.modules:\n                    from pythainlp.tokenize import word_tokenize as th_word_tokenize\n                else:\n                    th_word_tokenize = sys.modules[\"pythainlp\"].word_tokenize\n            except (AttributeError, ImportError):\n                logger.error(\n                    \"Make sure you install PyThaiNLP (https://github.com/PyThaiNLP/pythainlp) with the following steps\"\n                )\n                logger.error(\"1. pip install pythainlp\")\n                raise\n            text = th_word_tokenize(text)\n        elif lang == \"zh\":\n            try:\n                if \"jieba\" not in sys.modules:\n                    import jieba\n                else:\n                    jieba = sys.modules[\"jieba\"]\n            except (AttributeError, ImportError):\n                logger.error(\"Make sure you install Jieba (https://github.com/fxsjy/jieba) with the following steps\")\n                logger.error(\"1. pip install jieba\")\n                raise\n            text = \" \".join(jieba.cut(text))\n            text = self.moses_pipeline(text, lang=lang)\n            text = text.split()\n        elif lang == \"ja\":\n            text = self.moses_pipeline(text, lang=lang)\n            text = self.ja_tokenize(text)\n        else:\n            raise ValueError(\"It should not reach here\")\n\n        if self.do_lowercase_and_remove_accent and not bypass_tokenizer:\n            text = lowercase_and_remove_accent(text)\n\n        split_tokens = []\n        for token in text:\n            if token:\n                split_tokens.extend([t for t in self.bpe(token).split(\" \")])\n\n        return split_tokens\n\n    def _convert_token_to_id(self, token):\n        \"\"\" Converts a token (str) in an id using the vocab. \"\"\"\n        return self.encoder.get(token, self.encoder.get(self.unk_token))\n\n    def _convert_id_to_token(self, index):\n        \"\"\"Converts an index (integer) in a token (str) using the vocab.\"\"\"\n        return self.decoder.get(index, self.unk_token)\n\n    def convert_tokens_to_string(self, tokens):\n        \"\"\" Converts a sequence of tokens (string) in a single string. \"\"\"\n        out_string = \"\".join(tokens).replace(\"</w>\", \" \").strip()\n        return out_string\n\n    def build_inputs_with_special_tokens(\n        self, token_ids_0: List[int], token_ids_1: Optional[List[int]] = None\n    ) -> List[int]:\n        \"\"\"\n        Build model inputs from a sequence or a pair of sequence for sequence classification tasks\n        by concatenating and adding special tokens.\n        A XLM sequence has the following format:\n\n        - single sequence: ``<s> X </s>``\n        - pair of sequences: ``<s> A </s> B </s>``\n\n        Args:\n            token_ids_0 (:obj:`List[int]`):\n                List of IDs to which the special tokens will be added\n            token_ids_1 (:obj:`List[int]`, `optional`, defaults to :obj:`None`):\n                Optional second list of IDs for sequence pairs.\n\n        Returns:\n            :obj:`List[int]`: list of `input IDs <../glossary.html#input-ids>`__ with the appropriate special tokens.\n\n        \"\"\"\n        bos = [self.bos_token_id]\n        sep = [self.sep_token_id]\n\n        if token_ids_1 is None:\n            return bos + token_ids_0 + sep\n        return bos + token_ids_0 + sep + token_ids_1 + sep\n\n    def get_special_tokens_mask(\n        self, token_ids_0: List[int], token_ids_1: Optional[List[int]] = None, already_has_special_tokens: bool = False\n    ) -> List[int]:\n        \"\"\"\n        Retrieves sequence ids from a token list that has no special tokens added. This method is called when adding\n        special tokens using the tokenizer ``prepare_for_model`` or ``encode_plus`` methods.\n\n        Args:\n            token_ids_0 (:obj:`List[int]`):\n                List of ids.\n            token_ids_1 (:obj:`List[int]`, `optional`, defaults to :obj:`None`):\n                Optional second list of IDs for sequence pairs.\n            already_has_special_tokens (:obj:`bool`, `optional`, defaults to :obj:`False`):\n                Set to True if the token list is already formatted with special tokens for the model\n\n        Returns:\n            :obj:`List[int]`: A list of integers in the range [0, 1]: 1 for a special token, 0 for a sequence token.\n        \"\"\"\n\n        if already_has_special_tokens:\n            if token_ids_1 is not None:\n                raise ValueError(\n                    \"You should not supply a second sequence if the provided sequence of \"\n                    \"ids is already formated with special tokens for the model.\"\n                )\n            return list(map(lambda x: 1 if x in [self.sep_token_id, self.cls_token_id] else 0, token_ids_0,))\n\n        if token_ids_1 is not None:\n            return [1] + ([0] * len(token_ids_0)) + [1] + ([0] * len(token_ids_1)) + [1]\n        return [1] + ([0] * len(token_ids_0)) + [1]\n\n    def create_token_type_ids_from_sequences(\n        self, token_ids_0: List[int], token_ids_1: Optional[List[int]] = None\n    ) -> List[int]:\n        \"\"\"\n        Creates a mask from the two sequences passed to be used in a sequence-pair classification task.\n        An XLM sequence pair mask has the following format:\n\n        ::\n\n            0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1\n            | first sequence    | second sequence |\n\n        if token_ids_1 is None, only returns the first portion of the mask (0s).\n\n        Args:\n            token_ids_0 (:obj:`List[int]`):\n                List of ids.\n            token_ids_1 (:obj:`List[int]`, `optional`, defaults to :obj:`None`):\n                Optional second list of IDs for sequence pairs.\n\n        Returns:\n            :obj:`List[int]`: List of `token type IDs <../glossary.html#token-type-ids>`_ according to the given\n            sequence(s).\n        \"\"\"\n        sep = [self.sep_token_id]\n        cls = [self.cls_token_id]\n        if token_ids_1 is None:\n            return len(cls + token_ids_0 + sep) * [0]\n        return len(cls + token_ids_0 + sep) * [0] + len(token_ids_1 + sep) * [1]\n\n    def save_vocabulary(self, save_directory):\n        \"\"\"\n        Save the vocabulary and special tokens file to a directory.\n\n        Args:\n            save_directory (:obj:`str`):\n                The directory in which to save the vocabulary.\n\n        Returns:\n            :obj:`Tuple(str)`: Paths to the files saved.\n        \"\"\"\n        if not os.path.isdir(save_directory):\n            logger.error(\"Vocabulary path ({}) should be a directory\".format(save_directory))\n            return\n        vocab_file = os.path.join(save_directory, VOCAB_FILES_NAMES[\"vocab_file\"])\n        merge_file = os.path.join(save_directory, VOCAB_FILES_NAMES[\"merges_file\"])\n\n        with open(vocab_file, \"w\", encoding=\"utf-8\") as f:\n            f.write(json.dumps(self.encoder, ensure_ascii=False))\n\n        index = 0\n        with open(merge_file, \"w\", encoding=\"utf-8\") as writer:\n            for bpe_tokens, token_index in sorted(self.bpe_ranks.items(), key=lambda kv: kv[1]):\n                if index != token_index:\n                    logger.warning(\n                        \"Saving vocabulary to {}: BPE merge indices are not consecutive.\"\n                        \" Please check that the tokenizer is not corrupted!\".format(merge_file)\n                    )\n                    index = token_index\n                writer.write(\" \".join(bpe_tokens) + \"\\n\")\n                index += 1\n\n        return vocab_file, merge_file\n"
  },
  {
    "path": "code/nezha-base-count3/pretrain/transformers1/tokenization_xlm_roberta.py",
    "content": "# coding=utf-8\n# Copyright 2018 Google AI, Google Brain and Carnegie Mellon University Authors and the HuggingFace Inc. team.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License\n\"\"\" Tokenization classes for XLM-RoBERTa model.\"\"\"\n\n\nimport logging\nimport os\nfrom shutil import copyfile\nfrom typing import List, Optional\n\nfrom .tokenization_utils import PreTrainedTokenizer\nfrom .tokenization_xlnet import SPIECE_UNDERLINE\n\n\nlogger = logging.getLogger(__name__)\n\nVOCAB_FILES_NAMES = {\"vocab_file\": \"sentencepiece.bpe.model\"}\n\nPRETRAINED_VOCAB_FILES_MAP = {\n    \"vocab_file\": {\n        \"xlm-roberta-base\": \"https://s3.amazonaws.com/models.huggingface.co/bert/xlm-roberta-base-sentencepiece.bpe.model\",\n        \"xlm-roberta-large\": \"https://s3.amazonaws.com/models.huggingface.co/bert/xlm-roberta-large-sentencepiece.bpe.model\",\n        \"xlm-roberta-large-finetuned-conll02-dutch\": \"https://s3.amazonaws.com/models.huggingface.co/bert/xlm-roberta-large-finetuned-conll02-dutch-sentencepiece.bpe.model\",\n        \"xlm-roberta-large-finetuned-conll02-spanish\": \"https://s3.amazonaws.com/models.huggingface.co/bert/xlm-roberta-large-finetuned-conll02-spanish-sentencepiece.bpe.model\",\n        \"xlm-roberta-large-finetuned-conll03-english\": \"https://s3.amazonaws.com/models.huggingface.co/bert/xlm-roberta-large-finetuned-conll03-english-sentencepiece.bpe.model\",\n        \"xlm-roberta-large-finetuned-conll03-german\": \"https://s3.amazonaws.com/models.huggingface.co/bert/xlm-roberta-large-finetuned-conll03-german-sentencepiece.bpe.model\",\n    }\n}\n\nPRETRAINED_POSITIONAL_EMBEDDINGS_SIZES = {\n    \"xlm-roberta-base\": 512,\n    \"xlm-roberta-large\": 512,\n    \"xlm-roberta-large-finetuned-conll02-dutch\": 512,\n    \"xlm-roberta-large-finetuned-conll02-spanish\": 512,\n    \"xlm-roberta-large-finetuned-conll03-english\": 512,\n    \"xlm-roberta-large-finetuned-conll03-german\": 512,\n}\n\n\nclass XLMRobertaTokenizer(PreTrainedTokenizer):\n    \"\"\"\n        Adapted from RobertaTokenizer and XLNetTokenizer\n        SentencePiece based tokenizer. Peculiarities:\n\n        - requires `SentencePiece <https://github.com/google/sentencepiece>`_\n\n    This tokenizer inherits from :class:`~transformers1.PreTrainedTokenizer` which contains most of the methods. Users\n    should refer to the superclass for more information regarding methods.\n\n    Args:\n        vocab_file (:obj:`str`):\n            Path to the vocabulary file.\n        bos_token (:obj:`string`, `optional`, defaults to \"<s>\"):\n            The beginning of sequence token that was used during pre-training. Can be used a sequence classifier token.\n\n            .. note::\n\n                When building a sequence using special tokens, this is not the token that is used for the beginning\n                of sequence. The token used is the :obj:`cls_token`.\n        eos_token (:obj:`string`, `optional`, defaults to \"</s>\"):\n            The end of sequence token.\n\n            .. note::\n\n                When building a sequence using special tokens, this is not the token that is used for the end\n                of sequence. The token used is the :obj:`sep_token`.\n        sep_token (:obj:`string`, `optional`, defaults to \"</s>\"):\n            The separator token, which is used when building a sequence from multiple sequences, e.g. two sequences\n            for sequence classification or for a text and a question for question answering.\n            It is also used as the last token of a sequence built with special tokens.\n        cls_token (:obj:`string`, `optional`, defaults to \"<s>\"):\n            The classifier token which is used when doing sequence classification (classification of the whole\n            sequence instead of per-token classification). It is the first token of the sequence when built with\n            special tokens.\n        unk_token (:obj:`string`, `optional`, defaults to \"<unk>\"):\n            The unknown token. A token that is not in the vocabulary cannot be converted to an ID and is set to be this\n            token instead.\n        pad_token (:obj:`string`, `optional`, defaults to \"<pad>\"):\n            The token used for padding, for example when batching sequences of different lengths.\n        mask_token (:obj:`string`, `optional`, defaults to \"<mask>\"):\n            The token used for masking values. This is the token used when training this model with masked language\n            modeling. This is the token which the model will try to predict.\n        additional_special_tokens (:obj:`List[str]`, `optional`, defaults to :obj:`[\"<s>NOTUSED\", \"</s>NOTUSED\"]`):\n            Additional special tokens used by the tokenizer.\n\n    Attributes:\n        sp_model (:obj:`SentencePieceProcessor`):\n            The `SentencePiece` processor that is used for every conversion (string, tokens and IDs).\n    \"\"\"\n\n    vocab_files_names = VOCAB_FILES_NAMES\n    pretrained_vocab_files_map = PRETRAINED_VOCAB_FILES_MAP\n    max_model_input_sizes = PRETRAINED_POSITIONAL_EMBEDDINGS_SIZES\n    model_input_names = [\"attention_mask\"]\n\n    def __init__(\n        self,\n        vocab_file,\n        bos_token=\"<s>\",\n        eos_token=\"</s>\",\n        sep_token=\"</s>\",\n        cls_token=\"<s>\",\n        unk_token=\"<unk>\",\n        pad_token=\"<pad>\",\n        mask_token=\"<mask>\",\n        **kwargs\n    ):\n        super().__init__(\n            bos_token=bos_token,\n            eos_token=eos_token,\n            unk_token=unk_token,\n            sep_token=sep_token,\n            cls_token=cls_token,\n            pad_token=pad_token,\n            mask_token=mask_token,\n            **kwargs,\n        )\n\n        try:\n            import sentencepiece as spm\n        except ImportError:\n            logger.warning(\n                \"You need to install SentencePiece to use XLMRobertaTokenizer: https://github.com/google/sentencepiece\"\n                \"pip install sentencepiece\"\n            )\n            raise\n\n        self.sp_model = spm.SentencePieceProcessor()\n        self.sp_model.Load(str(vocab_file))\n        self.vocab_file = vocab_file\n\n        # Original fairseq vocab and spm vocab must be \"aligned\":\n        # Vocab    |    0    |    1    |   2    |    3    |  4  |  5  |  6  |   7   |   8   |  9\n        # -------- | ------- | ------- | ------ | ------- | --- | --- | --- | ----- | ----- | ----\n        # fairseq  | '<s>'   | '<pad>' | '</s>' | '<unk>' | ',' | '.' | '▁' | 's'   | '▁de' | '-'\n        # spm      | '<unk>' | '<s>'   | '</s>' | ','     | '.' | '▁' | 's' | '▁de' | '-'   | '▁a'\n\n        # Mimic fairseq token-to-id alignment for the first 4 token\n        self.fairseq_tokens_to_ids = {\"<s>\": 0, \"<pad>\": 1, \"</s>\": 2, \"<unk>\": 3}\n\n        # The first \"real\" token \",\" has position 4 in the original fairseq vocab and position 3 in the spm vocab\n        self.fairseq_offset = 1\n\n        self.fairseq_tokens_to_ids[\"<mask>\"] = len(self.sp_model) + self.fairseq_offset\n        self.fairseq_ids_to_tokens = {v: k for k, v in self.fairseq_tokens_to_ids.items()}\n\n    def __getstate__(self):\n        state = self.__dict__.copy()\n        state[\"sp_model\"] = None\n        return state\n\n    def __setstate__(self, d):\n        self.__dict__ = d\n        try:\n            import sentencepiece as spm\n        except ImportError:\n            logger.warning(\n                \"You need to install SentencePiece to use XLMRobertaTokenizer: https://github.com/google/sentencepiece\"\n                \"pip install sentencepiece\"\n            )\n            raise\n        self.sp_model = spm.SentencePieceProcessor()\n        self.sp_model.Load(self.vocab_file)\n\n    def build_inputs_with_special_tokens(\n        self, token_ids_0: List[int], token_ids_1: Optional[List[int]] = None\n    ) -> List[int]:\n        \"\"\"\n        Build model inputs from a sequence or a pair of sequence for sequence classification tasks\n        by concatenating and adding special tokens.\n        A XLM-R sequence has the following format:\n\n        - single sequence: ``<s> X </s>``\n        - pair of sequences: ``<s> A </s></s> B </s>``\n\n        Args:\n            token_ids_0 (:obj:`List[int]`):\n                List of IDs to which the special tokens will be added\n            token_ids_1 (:obj:`List[int]`, `optional`, defaults to :obj:`None`):\n                Optional second list of IDs for sequence pairs.\n\n        Returns:\n            :obj:`List[int]`: list of `input IDs <../glossary.html#input-ids>`__ with the appropriate special tokens.\n        \"\"\"\n\n        if token_ids_1 is None:\n            return [self.cls_token_id] + token_ids_0 + [self.sep_token_id]\n        cls = [self.cls_token_id]\n        sep = [self.sep_token_id]\n        return cls + token_ids_0 + sep + sep + token_ids_1 + sep\n\n    def get_special_tokens_mask(\n        self, token_ids_0: List[int], token_ids_1: Optional[List[int]] = None, already_has_special_tokens: bool = False\n    ) -> List[int]:\n        \"\"\"\n        Retrieves sequence ids from a token list that has no special tokens added. This method is called when adding\n        special tokens using the tokenizer ``prepare_for_model`` or ``encode_plus`` methods.\n\n        Args:\n            token_ids_0 (:obj:`List[int]`):\n                List of ids.\n            token_ids_1 (:obj:`List[int]`, `optional`, defaults to :obj:`None`):\n                Optional second list of IDs for sequence pairs.\n            already_has_special_tokens (:obj:`bool`, `optional`, defaults to :obj:`False`):\n                Set to True if the token list is already formatted with special tokens for the model\n\n        Returns:\n            :obj:`List[int]`: A list of integers in the range [0, 1]: 1 for a special token, 0 for a sequence token.\n        \"\"\"\n\n        if already_has_special_tokens:\n            if token_ids_1 is not None:\n                raise ValueError(\n                    \"You should not supply a second sequence if the provided sequence of \"\n                    \"ids is already formated with special tokens for the model.\"\n                )\n            return list(map(lambda x: 1 if x in [self.sep_token_id, self.cls_token_id] else 0, token_ids_0))\n\n        if token_ids_1 is None:\n            return [1] + ([0] * len(token_ids_0)) + [1]\n        return [1] + ([0] * len(token_ids_0)) + [1, 1] + ([0] * len(token_ids_1)) + [1]\n\n    def create_token_type_ids_from_sequences(\n        self, token_ids_0: List[int], token_ids_1: Optional[List[int]] = None\n    ) -> List[int]:\n        \"\"\"\n        Creates a mask from the two sequences passed to be used in a sequence-pair classification task.\n        XLM-R does not make use of token type ids, therefore a list of zeros is returned.\n\n        Args:\n            token_ids_0 (:obj:`List[int]`):\n                List of ids.\n            token_ids_1 (:obj:`List[int]`, `optional`, defaults to :obj:`None`):\n                Optional second list of IDs for sequence pairs.\n\n        Returns:\n            :obj:`List[int]`: List of zeros.\n\n        \"\"\"\n\n        sep = [self.sep_token_id]\n        cls = [self.cls_token_id]\n\n        if token_ids_1 is None:\n            return len(cls + token_ids_0 + sep) * [0]\n        return len(cls + token_ids_0 + sep + sep + token_ids_1 + sep) * [0]\n\n    @property\n    def vocab_size(self):\n        return len(self.sp_model) + self.fairseq_offset + 1  # Add the <mask> token\n\n    def get_vocab(self):\n        vocab = {self.convert_ids_to_tokens(i): i for i in range(self.vocab_size)}\n        vocab.update(self.added_tokens_encoder)\n        return vocab\n\n    def _tokenize(self, text):\n        return self.sp_model.EncodeAsPieces(text)\n\n    def _convert_token_to_id(self, token):\n        \"\"\" Converts a token (str) in an id using the vocab. \"\"\"\n        if token in self.fairseq_tokens_to_ids:\n            return self.fairseq_tokens_to_ids[token]\n        spm_id = self.sp_model.PieceToId(token)\n\n        # Need to return unknown token if the SP model returned 0\n        return spm_id + self.fairseq_offset if spm_id else self.unk_token_id\n\n    def _convert_id_to_token(self, index):\n        \"\"\"Converts an index (integer) in a token (str) using the vocab.\"\"\"\n        if index in self.fairseq_ids_to_tokens:\n            return self.fairseq_ids_to_tokens[index]\n        return self.sp_model.IdToPiece(index - self.fairseq_offset)\n\n    def convert_tokens_to_string(self, tokens):\n        \"\"\"Converts a sequence of tokens (strings for sub-words) in a single string.\"\"\"\n        out_string = \"\".join(tokens).replace(SPIECE_UNDERLINE, \" \").strip()\n        return out_string\n\n    def save_vocabulary(self, save_directory):\n        \"\"\"\n        Save the sentencepiece vocabulary (copy original file) and special tokens file to a directory.\n\n        Args:\n            save_directory (:obj:`str`):\n                The directory in which to save the vocabulary.\n\n        Returns:\n            :obj:`Tuple(str)`: Paths to the files saved.\n        \"\"\"\n        if not os.path.isdir(save_directory):\n            logger.error(\"Vocabulary path ({}) should be a directory\".format(save_directory))\n            return\n        out_vocab_file = os.path.join(save_directory, VOCAB_FILES_NAMES[\"vocab_file\"])\n\n        if os.path.abspath(self.vocab_file) != os.path.abspath(out_vocab_file):\n            copyfile(self.vocab_file, out_vocab_file)\n\n        return (out_vocab_file,)\n"
  },
  {
    "path": "code/nezha-base-count3/pretrain/transformers1/tokenization_xlnet.py",
    "content": "# coding=utf-8\n# Copyright 2018 Google AI, Google Brain and Carnegie Mellon University Authors and the HuggingFace Inc. team.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n\"\"\" Tokenization classes for XLNet model.\"\"\"\n\n\nimport logging\nimport os\nimport unicodedata\nfrom shutil import copyfile\nfrom typing import List, Optional\n\nfrom .tokenization_utils import PreTrainedTokenizer\n\n\nlogger = logging.getLogger(__name__)\n\nVOCAB_FILES_NAMES = {\"vocab_file\": \"spiece.model\"}\n\nPRETRAINED_VOCAB_FILES_MAP = {\n    \"vocab_file\": {\n        \"xlnet-base-cased\": \"https://s3.amazonaws.com/models.huggingface.co/bert/xlnet-base-cased-spiece.model\",\n        \"xlnet-large-cased\": \"https://s3.amazonaws.com/models.huggingface.co/bert/xlnet-large-cased-spiece.model\",\n    }\n}\n\nPRETRAINED_POSITIONAL_EMBEDDINGS_SIZES = {\n    \"xlnet-base-cased\": None,\n    \"xlnet-large-cased\": None,\n}\n\nSPIECE_UNDERLINE = \"▁\"\n\n# Segments (not really needed)\nSEG_ID_A = 0\nSEG_ID_B = 1\nSEG_ID_CLS = 2\nSEG_ID_SEP = 3\nSEG_ID_PAD = 4\n\n\nclass XLNetTokenizer(PreTrainedTokenizer):\n    \"\"\"\n    Constructs an XLNet tokenizer. Based on `SentencePiece <https://github.com/google/sentencepiece>`__\n\n    This tokenizer inherits from :class:`~transformers1.PreTrainedTokenizer` which contains most of the methods. Users\n    should refer to the superclass for more information regarding methods.\n\n    Args:\n        vocab_file (:obj:`string`):\n            `SentencePiece <https://github.com/google/sentencepiece>`__ file (generally has a .spm extension) that\n            contains the vocabulary necessary to instantiate a tokenizer.\n        do_lower_case (:obj:`bool`, `optional`, defaults to :obj:`True`):\n            Whether to lowercase the input when tokenizing.\n        remove_space (:obj:`bool`, `optional`, defaults to :obj:`True`):\n            Whether to strip the text when tokenizing (removing excess spaces before and after the string).\n        keep_accents (:obj:`bool`, `optional`, defaults to :obj:`False`):\n            Whether to keep accents when tokenizing.\n        bos_token (:obj:`string`, `optional`, defaults to \"<s>\"):\n            The beginning of sequence token that was used during pre-training. Can be used a sequence classifier token.\n\n            .. note::\n\n                When building a sequence using special tokens, this is not the token that is used for the beginning\n                of sequence. The token used is the :obj:`cls_token`.\n        eos_token (:obj:`string`, `optional`, defaults to \"</s>\"):\n            The end of sequence token.\n\n            .. note::\n\n                When building a sequence using special tokens, this is not the token that is used for the end\n                of sequence. The token used is the :obj:`sep_token`.\n        unk_token (:obj:`string`, `optional`, defaults to \"<unk>\"):\n            The unknown token. A token that is not in the vocabulary cannot be converted to an ID and is set to be this\n            token instead.\n        sep_token (:obj:`string`, `optional`, defaults to \"<sep>\"):\n            The separator token, which is used when building a sequence from multiple sequences, e.g. two sequences\n            for sequence classification or for a text and a question for question answering.\n            It is also used as the last token of a sequence built with special tokens.\n        pad_token (:obj:`string`, `optional`, defaults to \"<pad>\"):\n            The token used for padding, for example when batching sequences of different lengths.\n        cls_token (:obj:`string`, `optional`, defaults to \"<cls>\"):\n            The classifier token which is used when doing sequence classification (classification of the whole\n            sequence instead of per-token classification). It is the first token of the sequence when built with\n            special tokens.\n        mask_token (:obj:`string`, `optional`, defaults to \"<mask>\"):\n            The token used for masking values. This is the token used when training this model with masked language\n            modeling. This is the token which the model will try to predict.\n        additional_special_tokens (:obj:`List[str]`, `optional`, defaults to :obj:`[\"<eop>\", \"<eod>\"]`):\n            Additional special tokens used by the tokenizer.\n\n    Attributes:\n        sp_model (:obj:`SentencePieceProcessor`):\n            The `SentencePiece` processor that is used for every conversion (string, tokens and IDs).\n    \"\"\"\n\n    vocab_files_names = VOCAB_FILES_NAMES\n    pretrained_vocab_files_map = PRETRAINED_VOCAB_FILES_MAP\n    max_model_input_sizes = PRETRAINED_POSITIONAL_EMBEDDINGS_SIZES\n    padding_side = \"left\"\n\n    def __init__(\n        self,\n        vocab_file,\n        do_lower_case=False,\n        remove_space=True,\n        keep_accents=False,\n        bos_token=\"<s>\",\n        eos_token=\"</s>\",\n        unk_token=\"<unk>\",\n        sep_token=\"<sep>\",\n        pad_token=\"<pad>\",\n        cls_token=\"<cls>\",\n        mask_token=\"<mask>\",\n        additional_special_tokens=[\"<eop>\", \"<eod>\"],\n        **kwargs\n    ):\n        super().__init__(\n            bos_token=bos_token,\n            eos_token=eos_token,\n            unk_token=unk_token,\n            sep_token=sep_token,\n            pad_token=pad_token,\n            cls_token=cls_token,\n            mask_token=mask_token,\n            additional_special_tokens=additional_special_tokens,\n            **kwargs,\n        )\n\n        self._pad_token_type_id = 3\n\n        try:\n            import sentencepiece as spm\n        except ImportError:\n            logger.warning(\n                \"You need to install SentencePiece to use XLNetTokenizer: https://github.com/google/sentencepiece\"\n                \"pip install sentencepiece\"\n            )\n            raise\n\n        self.do_lower_case = do_lower_case\n        self.remove_space = remove_space\n        self.keep_accents = keep_accents\n        self.vocab_file = vocab_file\n\n        self.sp_model = spm.SentencePieceProcessor()\n        self.sp_model.Load(vocab_file)\n\n    @property\n    def vocab_size(self):\n        return len(self.sp_model)\n\n    def get_vocab(self):\n        vocab = {self.convert_ids_to_tokens(i): i for i in range(self.vocab_size)}\n        vocab.update(self.added_tokens_encoder)\n        return vocab\n\n    def __getstate__(self):\n        state = self.__dict__.copy()\n        state[\"sp_model\"] = None\n        return state\n\n    def __setstate__(self, d):\n        self.__dict__ = d\n        try:\n            import sentencepiece as spm\n        except ImportError:\n            logger.warning(\n                \"You need to install SentencePiece to use XLNetTokenizer: https://github.com/google/sentencepiece\"\n                \"pip install sentencepiece\"\n            )\n            raise\n        self.sp_model = spm.SentencePieceProcessor()\n        self.sp_model.Load(self.vocab_file)\n\n    def preprocess_text(self, inputs):\n        if self.remove_space:\n            outputs = \" \".join(inputs.strip().split())\n        else:\n            outputs = inputs\n        outputs = outputs.replace(\"``\", '\"').replace(\"''\", '\"')\n\n        if not self.keep_accents:\n            outputs = unicodedata.normalize(\"NFKD\", outputs)\n            outputs = \"\".join([c for c in outputs if not unicodedata.combining(c)])\n        if self.do_lower_case:\n            outputs = outputs.lower()\n\n        return outputs\n\n    def _tokenize(self, text, sample=False):\n        \"\"\" Tokenize a string. \"\"\"\n        text = self.preprocess_text(text)\n\n        if not sample:\n            pieces = self.sp_model.EncodeAsPieces(text)\n        else:\n            pieces = self.sp_model.SampleEncodeAsPieces(text, 64, 0.1)\n        new_pieces = []\n        for piece in pieces:\n            if len(piece) > 1 and piece[-1] == str(\",\") and piece[-2].isdigit():\n                cur_pieces = self.sp_model.EncodeAsPieces(piece[:-1].replace(SPIECE_UNDERLINE, \"\"))\n                if piece[0] != SPIECE_UNDERLINE and cur_pieces[0][0] == SPIECE_UNDERLINE:\n                    if len(cur_pieces[0]) == 1:\n                        cur_pieces = cur_pieces[1:]\n                    else:\n                        cur_pieces[0] = cur_pieces[0][1:]\n                cur_pieces.append(piece[-1])\n                new_pieces.extend(cur_pieces)\n            else:\n                new_pieces.append(piece)\n\n        return new_pieces\n\n    def _convert_token_to_id(self, token):\n        \"\"\" Converts a token (str) in an id using the vocab. \"\"\"\n        return self.sp_model.PieceToId(token)\n\n    def _convert_id_to_token(self, index):\n        \"\"\"Converts an index (integer) in a token (str) using the vocab.\"\"\"\n        return self.sp_model.IdToPiece(index)\n\n    def convert_tokens_to_string(self, tokens):\n        \"\"\"Converts a sequence of tokens (strings for sub-words) in a single string.\"\"\"\n        out_string = \"\".join(tokens).replace(SPIECE_UNDERLINE, \" \").strip()\n        return out_string\n\n    def build_inputs_with_special_tokens(\n        self, token_ids_0: List[int], token_ids_1: Optional[List[int]] = None\n    ) -> List[int]:\n        \"\"\"\n        Build model inputs from a sequence or a pair of sequence for sequence classification tasks\n        by concatenating and adding special tokens.\n        An XLNet sequence has the following format:\n\n        - single sequence: ``X <sep> <cls>``\n        - pair of sequences: ``A <sep> B <sep> <cls>``\n\n        Args:\n            token_ids_0 (:obj:`List[int]`):\n                List of IDs to which the special tokens will be added\n            token_ids_1 (:obj:`List[int]`, `optional`, defaults to :obj:`None`):\n                Optional second list of IDs for sequence pairs.\n\n        Returns:\n            :obj:`List[int]`: list of `input IDs <../glossary.html#input-ids>`__ with the appropriate special tokens.\n        \"\"\"\n        sep = [self.sep_token_id]\n        cls = [self.cls_token_id]\n        if token_ids_1 is None:\n            return token_ids_0 + sep + cls\n        return token_ids_0 + sep + token_ids_1 + sep + cls\n\n    def get_special_tokens_mask(\n        self, token_ids_0: List[int], token_ids_1: Optional[List[int]] = None, already_has_special_tokens: bool = False\n    ) -> List[int]:\n        \"\"\"\n        Retrieves sequence ids from a token list that has no special tokens added. This method is called when adding\n        special tokens using the tokenizer ``prepare_for_model`` or ``encode_plus`` methods.\n\n        Args:\n            token_ids_0 (:obj:`List[int]`):\n                List of ids.\n            token_ids_1 (:obj:`List[int]`, `optional`, defaults to :obj:`None`):\n                Optional second list of IDs for sequence pairs.\n            already_has_special_tokens (:obj:`bool`, `optional`, defaults to :obj:`False`):\n                Set to True if the token list is already formatted with special tokens for the model\n\n        Returns:\n            :obj:`List[int]`: A list of integers in the range [0, 1]: 1 for a special token, 0 for a sequence token.\n        \"\"\"\n\n        if already_has_special_tokens:\n            if token_ids_1 is not None:\n                raise ValueError(\n                    \"You should not supply a second sequence if the provided sequence of \"\n                    \"ids is already formated with special tokens for the model.\"\n                )\n            return list(map(lambda x: 1 if x in [self.sep_token_id, self.cls_token_id] else 0, token_ids_0))\n\n        if token_ids_1 is not None:\n            return ([0] * len(token_ids_0)) + [1] + ([0] * len(token_ids_1)) + [1, 1]\n        return ([0] * len(token_ids_0)) + [1, 1]\n\n    def create_token_type_ids_from_sequences(\n        self, token_ids_0: List[int], token_ids_1: Optional[List[int]] = None\n    ) -> List[int]:\n        \"\"\"\n        Creates a mask from the two sequences passed to be used in a sequence-pair classification task.\n        An XLNet sequence pair mask has the following format:\n        0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 2\n        | first sequence    | second sequence     | CLS segment ID\n\n        if token_ids_1 is None, only returns the first portion of the mask (0's).\n\n        Args:\n            token_ids_0 (:obj:`List[int]`):\n                List of ids.\n            token_ids_1 (:obj:`List[int]`, `optional`, defaults to :obj:`None`):\n                Optional second list of IDs for sequence pairs.\n\n        Returns:\n            :obj:`List[int]`: List of `token type IDs <../glossary.html#token-type-ids>`_ according to the given\n            sequence(s).\n        \"\"\"\n        sep = [self.sep_token_id]\n        cls_segment_id = [2]\n\n        if token_ids_1 is None:\n            return len(token_ids_0 + sep) * [0] + cls_segment_id\n        return len(token_ids_0 + sep) * [0] + len(token_ids_1 + sep) * [1] + cls_segment_id\n\n    def save_vocabulary(self, save_directory):\n        \"\"\"\n        Save the sentencepiece vocabulary (copy original file) and special tokens file to a directory.\n\n        Args:\n            save_directory (:obj:`str`):\n                The directory in which to save the vocabulary.\n\n        Returns:\n            :obj:`Tuple(str)`: Paths to the files saved.\n        \"\"\"\n        if not os.path.isdir(save_directory):\n            logger.error(\"Vocabulary path ({}) should be a directory\".format(save_directory))\n            return\n        out_vocab_file = os.path.join(save_directory, VOCAB_FILES_NAMES[\"vocab_file\"])\n\n        if os.path.abspath(self.vocab_file) != os.path.abspath(out_vocab_file):\n            copyfile(self.vocab_file, out_vocab_file)\n\n        return (out_vocab_file,)\n"
  },
  {
    "path": "code/nezha-base-count3/pretrain/transformers1/trainer.py",
    "content": "import json\nimport logging\nimport math\nimport os\nimport random\nimport re\nimport shutil\nfrom contextlib import contextmanager\nfrom pathlib import Path\nfrom typing import Callable, Dict, List, Optional, Tuple\nimport time\nimport numpy as np\nimport torch\nfrom packaging import version\nfrom torch import nn\nfrom torch.utils.data.dataloader import DataLoader\nfrom torch.utils.data.dataset import Dataset\nfrom torch.utils.data.distributed import DistributedSampler\nfrom torch.utils.data.sampler import RandomSampler, Sampler, SequentialSampler\nfrom tqdm.auto import tqdm, trange\n\nfrom .data.data_collator import DataCollator, DefaultDataCollator\nfrom transformers.modeling_utils import PreTrainedModel\nfrom .optimization import AdamW\nfrom transformers import get_polynomial_decay_schedule_with_warmup#需要新版才有\nfrom .trainer_utils import PREFIX_CHECKPOINT_DIR, EvalPrediction, PredictionOutput, TrainOutput\nfrom .training_args import TrainingArguments, is_tpu_available\n\n\ntry:\n    from apex import amp\n\n    _has_apex = True\nexcept ImportError:\n    _has_apex = False\n\n\ndef is_apex_available():\n    return _has_apex\n\n\nif is_tpu_available():\n    import torch_xla.core.xla_model as xm\n    import torch_xla.debug.metrics as met\n    import torch_xla.distributed.parallel_loader as pl\n\ntry:\n    from torch.utils.tensorboard import SummaryWriter\n\n    _has_tensorboard = True\nexcept ImportError:\n    try:\n        from tensorboardX import SummaryWriter\n\n        _has_tensorboard = True\n    except ImportError:\n        _has_tensorboard = False\n\n\ndef is_tensorboard_available():\n    return _has_tensorboard\n\n\ntry:\n    import wandb\n\n    wandb.ensure_configured()\n    if wandb.api.api_key is None:\n        _has_wandb = False\n        wandb.termwarn(\"W&B installed but not logged in.  Run `wandb login` or set the WANDB_API_KEY env variable.\")\n    else:\n        _has_wandb = False if os.getenv(\"WANDB_DISABLED\") else True\nexcept ImportError:\n    _has_wandb = False\n\n\ndef is_wandb_available():\n    return _has_wandb\n\n\nlogger = logging.getLogger(__name__)\n\n\ndef set_seed(seed: int):\n    random.seed(seed)\n    np.random.seed(seed)\n    torch.manual_seed(seed)\n    torch.cuda.manual_seed_all(seed)\n    # ^^ safe to call this function even if cuda is not available\n\n\n@contextmanager\ndef torch_distributed_zero_first(local_rank: int):\n    \"\"\"\n    Decorator to make all processes in distributed training wait for each local_master to do something.\n    \"\"\"\n    if local_rank not in [-1, 0]:\n        torch.distributed.barrier()\n    yield\n    if local_rank == 0:\n        torch.distributed.barrier()\n\n\nclass SequentialDistributedSampler(Sampler):\n    \"\"\"\n    Distributed Sampler that subsamples indicies sequentially,\n    making it easier to collate all results at the end.\n\n    Even though we only use this sampler for eval and predict (no training),\n    which means that the model params won't have to be synced (i.e. will not hang\n    for synchronization even if varied number of forward passes), we still add extra\n    samples to the sampler to make it evenly divisible (like in `DistributedSampler`)\n    to make it easy to `gather` or `reduce` resulting tensors at the end of the loop.\n    \"\"\"\n\n    def __init__(self, dataset, num_replicas=None, rank=None):\n        if num_replicas is None:\n            if not torch.distributed.is_available():\n                raise RuntimeError(\"Requires distributed package to be available\")\n            num_replicas = torch.distributed.get_world_size()\n        if rank is None:\n            if not torch.distributed.is_available():\n                raise RuntimeError(\"Requires distributed package to be available\")\n            rank = torch.distributed.get_rank()\n        self.dataset = dataset\n        self.num_replicas = num_replicas\n        self.rank = rank\n        self.num_samples = int(math.ceil(len(self.dataset) * 1.0 / self.num_replicas))\n        self.total_size = self.num_samples * self.num_replicas\n\n    def __iter__(self):\n        indices = list(range(len(self.dataset)))\n\n        # add extra samples to make it evenly divisible\n        indices += indices[: (self.total_size - len(indices))]\n        assert len(indices) == self.total_size\n\n        # subsample\n        indices = indices[self.rank * self.num_samples : (self.rank + 1) * self.num_samples]\n        assert len(indices) == self.num_samples\n\n        return iter(indices)\n\n    def __len__(self):\n        return self.num_samples\n\n\ndef get_tpu_sampler(dataset: Dataset):\n    if xm.xrt_world_size() <= 1:\n        return RandomSampler(dataset)\n    return DistributedSampler(dataset, num_replicas=xm.xrt_world_size(), rank=xm.get_ordinal())\n\n\nclass Trainer:\n    \"\"\"\n    Trainer is a simple but feature-complete training and eval loop for PyTorch,\n    optimized for Transformers.\n    \"\"\"\n\n    model: PreTrainedModel\n    args: TrainingArguments\n    train_dataset: Optional[Dataset]\n    eval_dataset: Optional[Dataset]\n    compute_metrics: Optional[Callable[[EvalPrediction], Dict]] = None\n    prediction_loss_only: bool\n    tb_writer: Optional[\"SummaryWriter\"] = None\n    optimizers: Tuple[torch.optim.Optimizer, torch.optim.lr_scheduler.LambdaLR] = None\n    global_step: Optional[int] = None\n    epoch: Optional[float] = None\n\n    def __init__(\n        self,\n        model: PreTrainedModel,\n        args: TrainingArguments,\n        train_dataLoader: Optional[DataLoader] = None,\n        eval_dataLoader: Optional[DataLoader] = None,\n        compute_metrics: Optional[Callable[[EvalPrediction], Dict]] = None,\n        prediction_loss_only=False,\n        tb_writer: Optional[\"SummaryWriter\"] = None,\n        optimizers: Tuple[torch.optim.Optimizer, torch.optim.lr_scheduler.LambdaLR] = None,\n    ):\n        \"\"\"\n        Trainer is a simple but feature-complete training and eval loop for PyTorch,\n        optimized for Transformers.\n\n        Args:\n            prediction_loss_only:\n                (Optional) in evaluation and prediction, only return the loss\n        \"\"\"\n        self.model = model.to(args.device)\n        self.args = args\n\n        self.train_dataLoader = train_dataLoader\n        self.eval_dataLoader = eval_dataLoader\n        self.compute_metrics = compute_metrics\n        self.prediction_loss_only = prediction_loss_only\n        self.optimizers = optimizers\n        if tb_writer is not None:\n            self.tb_writer = tb_writer\n        elif is_tensorboard_available() and self.is_world_master():\n            self.tb_writer = SummaryWriter(log_dir=self.args.logging_dir)\n        if not is_tensorboard_available():\n            logger.warning(\n                \"You are instantiating a Trainer but Tensorboard is not installed. You should consider installing it.\"\n            )\n        if is_wandb_available():\n            self._setup_wandb()\n        else:\n            logger.info(\n                \"You are instantiating a Trainer but W&B is not installed. To use wandb logging, \"\n                \"run `pip install wandb; wandb login` see https://docs.wandb.com/huggingface.\"\n            )\n        set_seed(self.args.seed)\n        # Create output directory if needed\n        if self.is_world_master():\n            os.makedirs(self.args.output_dir, exist_ok=True)\n        if is_tpu_available():\n            # Set an xla_device flag on the model's config.\n            # We'll find a more elegant and not need to do this in the future.\n            self.model.config.xla_device = True\n\n    def get_test_dataloader(self, test_dataset: Dataset) -> DataLoader:\n        # We use the same batch_size as for eval.\n        if is_tpu_available():\n            sampler = SequentialDistributedSampler(\n                test_dataset, num_replicas=xm.xrt_world_size(), rank=xm.get_ordinal()\n            )\n        elif self.args.local_rank != -1:\n            sampler = SequentialDistributedSampler(test_dataset)\n        else:\n            sampler = SequentialSampler(test_dataset)\n\n        data_loader = DataLoader(\n            test_dataset,\n            sampler=sampler,\n            batch_size=self.args.eval_batch_size,\n\n        )\n\n        return data_loader\n\n    def get_optimizers(\n        self, num_training_steps: int\n    ) -> Tuple[torch.optim.Optimizer, torch.optim.lr_scheduler.LambdaLR]:\n        \"\"\"\n        Setup the optimizer and the learning rate scheduler.\n\n        We provide a reasonable default that works well.\n        If you want to use something else, you can pass a tuple in the Trainer's init,\n        or override this method in a subclass.\n        \"\"\"\n        if self.optimizers is not None:\n            return self.optimizers\n        # Prepare optimizer and schedule (linear warmup and decay)\n        no_decay = [\"bias\", \"LayerNorm.weight\"]\n        optimizer_grouped_parameters = [\n            {\n                \"params\": [p for n, p in self.model.named_parameters() if not any(nd in n for nd in no_decay)],\n                \"weight_decay\": self.args.weight_decay,\n            },\n            {\n                \"params\": [p for n, p in self.model.named_parameters() if any(nd in n for nd in no_decay)],\n                \"weight_decay\": 0.0,\n            },\n        ]\n\n        optimizer = AdamW(optimizer_grouped_parameters, lr=self.args.learning_rate, eps=self.args.adam_epsilon)\n        scheduler = get_polynomial_decay_schedule_with_warmup(\n            optimizer, num_warmup_steps=self.args.warmup_steps, num_training_steps=num_training_steps,lr_end=self.args.lr_end\n        )\n        return optimizer, scheduler\n\n    def _setup_wandb(self):\n        \"\"\"\n        Setup the optional Weights & Biases (`wandb`) integration.\n\n        One can override this method to customize the setup if needed.  Find more information at https://docs.wandb.com/huggingface\n        You can also override the following environment variables:\n\n        Environment:\n            WANDB_WATCH:\n                (Optional, [\"gradients\", \"all\", \"false\"]) \"gradients\" by default, set to \"false\" to disable gradient logging\n                or \"all\" to log gradients and parameters\n            WANDB_PROJECT:\n                (Optional): str - \"huggingface\" by default, set this to a custom string to store results in a different project\n            WANDB_DISABLED:\n                (Optional): boolean - defaults to false, set to \"true\" to disable wandb entirely\n        \"\"\"\n        logger.info('Automatic Weights & Biases logging enabled, to disable set os.environ[\"WANDB_DISABLED\"] = \"true\"')\n        wandb.init(project=os.getenv(\"WANDB_PROJECT\", \"huggingface\"), config=vars(self.args))\n        # keep track of model topology and gradients\n        if os.getenv(\"WANDB_WATCH\") != \"false\":\n            wandb.watch(\n                self.model, log=os.getenv(\"WANDB_WATCH\", \"gradients\"), log_freq=max(100, self.args.logging_steps)\n            )\n\n    def num_examples(self, dataloader: DataLoader) -> int:\n        \"\"\"\n        Helper to get num of examples from a DataLoader, by accessing its Dataset.\n        \"\"\"\n        return len(dataloader.dataset)\n\n    def train(self, model_path: Optional[str] = None):\n        \"\"\"\n        Main training entry point.\n\n        Args:\n            model_path:\n                (Optional) Local path to model if model to train has been instantiated from a local path\n                If present, we will try reloading the optimizer/scheduler states from there.\n        \"\"\"\n        train_dataloader = self.train_dataLoader\n        if self.args.max_steps > 0:\n            t_total = self.args.max_steps\n            num_train_epochs = (\n                self.args.max_steps // (len(train_dataloader) // self.args.gradient_accumulation_steps) + 1\n            )\n        else:\n            t_total = int(len(train_dataloader) // self.args.gradient_accumulation_steps * self.args.num_train_epochs)\n            num_train_epochs = self.args.num_train_epochs\n\n        optimizer, scheduler = self.get_optimizers(num_training_steps=t_total)\n\n        # Check if saved optimizer or scheduler states exist\n        if (\n            model_path is not None\n            and os.path.isfile(os.path.join(model_path, \"optimizer.pt\"))\n            and os.path.isfile(os.path.join(model_path, \"scheduler.pt\"))\n        ):\n            # Load in optimizer and scheduler states\n            optimizer.load_state_dict(\n                torch.load(os.path.join(model_path, \"optimizer.pt\"), map_location=self.args.device)\n            )\n            scheduler.load_state_dict(torch.load(os.path.join(model_path, \"scheduler.pt\")))\n\n        model = self.model\n        if self.args.fp16:\n            if not is_apex_available():\n                raise ImportError(\"Please install apex from https://www.github.com/nvidia/apex to use fp16 training.\")\n            model, optimizer = amp.initialize(model, optimizer, opt_level=self.args.fp16_opt_level)\n\n        # multi-gpu training (should be after apex fp16 initialization)\n        if self.args.n_gpu > 1:\n            model = torch.nn.DataParallel(model)\n\n        # Distributed training (should be after apex fp16 initialization)\n        if self.args.local_rank != -1:\n            model = torch.nn.parallel.DistributedDataParallel(\n                model,\n                device_ids=[self.args.local_rank],\n                output_device=self.args.local_rank,\n                find_unused_parameters=True,\n            )\n\n        if self.tb_writer is not None:\n            self.tb_writer.add_text(\"args\", self.args.to_json_string())\n            self.tb_writer.add_hparams(self.args.to_sanitized_dict(), metric_dict={})\n\n        # Train!\n        if is_tpu_available():\n            total_train_batch_size = self.args.train_batch_size * xm.xrt_world_size()\n        else:\n            total_train_batch_size = (\n                self.args.train_batch_size\n                * self.args.gradient_accumulation_steps\n                * (torch.distributed.get_world_size() if self.args.local_rank != -1 else 1)\n            )\n        logger.info(\"***** Running training *****\")\n        logger.info(\"  Num examples = %d\", self.num_examples(train_dataloader))\n        logger.info(\"  Num Epochs = %d\", num_train_epochs)\n        logger.info(\"  Instantaneous batch size per device = %d\", self.args.per_device_train_batch_size)\n        logger.info(\"  Total train batch size (w. parallel, distributed & accumulation) = %d\", total_train_batch_size)\n        logger.info(\"  Gradient Accumulation steps = %d\", self.args.gradient_accumulation_steps)\n        logger.info(\"  Total optimization steps = %d\", t_total)\n\n        self.global_step = 0\n        self.epoch = 0\n        epochs_trained = 0\n        steps_trained_in_current_epoch = 0\n        # Check if continuing training from a checkpoint\n        if model_path is not None:\n            # set global_step to global_step of last saved checkpoint from model path\n            try:\n                self.global_step = int(model_path.split(\"-\")[-1].split(\"/\")[0])\n                epochs_trained = self.global_step // (len(train_dataloader) // self.args.gradient_accumulation_steps)\n                steps_trained_in_current_epoch = self.global_step % (\n                    len(train_dataloader) // self.args.gradient_accumulation_steps\n                )\n\n                logger.info(\"  Continuing training from checkpoint, will skip to saved global_step\")\n                logger.info(\"  Continuing training from epoch %d\", epochs_trained)\n                logger.info(\"  Continuing training from global step %d\", self.global_step)\n                logger.info(\"  Will skip the first %d steps in the first epoch\", steps_trained_in_current_epoch)\n            except ValueError:\n                self.global_step = 0\n                logger.info(\"  Starting fine-tuning.\")\n\n        tr_loss = 0.0\n        logging_loss = 0.0\n        tqdmLoss=0#进度条的loss用滑动平均显示\n        beta_exp=1\n        model.zero_grad()\n        train_iterator = trange(\n            epochs_trained, int(num_train_epochs), desc=\"Epoch\", disable=True\n        )\n        for epoch in train_iterator:\n            last=time.time()\n            if isinstance(train_dataloader, DataLoader) and isinstance(train_dataloader.sampler, DistributedSampler):\n                train_dataloader.sampler.set_epoch(epoch)\n\n            if is_tpu_available():\n                parallel_loader = pl.ParallelLoader(train_dataloader, [self.args.device]).per_device_loader(\n                    self.args.device\n                )\n                epoch_iterator = tqdm(parallel_loader, desc=\"Iteration\", disable=not self.is_local_master())\n            else:\n                epoch_iterator = tqdm(train_dataloader, desc=\"Iteration\", disable=True,ncols=70)#固定下长度，不然要换行\n\n            for step, inputs in enumerate(epoch_iterator):\n\n                # Skip past any already trained steps if resuming training\n                if steps_trained_in_current_epoch > 0:\n                    steps_trained_in_current_epoch -= 1\n                    continue\n                now_loss=self._training_step(model, inputs, optimizer)\n                tr_loss += now_loss\n                #丰富进度条\n                tqdmLoss=tqdmLoss*0.99+(1-0.99)*now_loss#滑动平均下\n                beta_exp*=0.99#校正\n\n                epoch_iterator.set_description_str(f\"epoch：{epoch+1}\")\n                epoch_iterator.set_postfix_str(f\"loss：{round(tqdmLoss/(1-beta_exp),4)}\")\n                if (step + 1) % self.args.gradient_accumulation_steps == 0 or (\n                    # last step in epoch but step is always smaller than gradient_accumulation_steps\n                    len(epoch_iterator) <= self.args.gradient_accumulation_steps\n                    and (step + 1) == len(epoch_iterator)\n                ):\n                    if self.args.fp16:\n                        torch.nn.utils.clip_grad_norm_(amp.master_params(optimizer), self.args.max_grad_norm)\n                    else:\n                        torch.nn.utils.clip_grad_norm_(model.parameters(), self.args.max_grad_norm)\n\n                    if is_tpu_available():\n                        xm.optimizer_step(optimizer)\n                    else:\n                        optimizer.step()\n\n                    scheduler.step()\n                    model.zero_grad()\n                    self.global_step += 1\n                    self.epoch = epoch + (step + 1) / len(epoch_iterator)\n\n                    if (self.args.logging_steps > 0 and self.global_step % self.args.logging_steps == 0) or (\n                        self.global_step == 1 and self.args.logging_first_step\n                    ):\n                        logs: Dict[str, float] = {}\n                        logs[\"loss\"] = (tr_loss - logging_loss) / self.args.logging_steps\n                        # backward compatibility for pytorch schedulers\n                        logs[\"learning_rate\"] = (\n                            scheduler.get_last_lr()[0]\n                            if version.parse(torch.__version__) >= version.parse(\"1.4\")\n                            else scheduler.get_lr()[0]\n                        )\n                        logging_loss = tr_loss\n                        print()#log前要换行，不然和进度条挤在一起\n                        self._log(logs)\n                        print()\n                        if self.args.evaluate_during_training:\n                            self.evaluate()\n\n                    if self.args.save_steps > 0 and self.global_step % self.args.save_steps==0:\n                        # In all cases (even distributed/parallel), self.model is always a reference\n                        # to the model we want to save.\n                        if hasattr(model, \"module\"):\n                            assert model.module is self.model\n                        else:\n                            assert model is self.model\n                        # Save model checkpoint\n                        output_dir = os.path.join(self.args.output_dir, f\"{PREFIX_CHECKPOINT_DIR}-{self.global_step}-epoch-{int(self.epoch)}\")\n\n                        self.save_model(output_dir)\n\n                        if self.is_world_master():\n                            self._rotate_checkpoints()\n\n                        if is_tpu_available():\n                            xm.rendezvous(\"saving_optimizer_states\")\n                            xm.save(optimizer.state_dict(), os.path.join(output_dir, \"optimizer.pt\"))\n                            xm.save(scheduler.state_dict(), os.path.join(output_dir, \"scheduler.pt\"))\n                        elif self.is_world_master():\n                            torch.save(optimizer.state_dict(), os.path.join(output_dir, \"optimizer.pt\"))\n                            torch.save(scheduler.state_dict(), os.path.join(output_dir, \"scheduler.pt\"))\n\n                if self.args.max_steps > 0 and self.global_step > self.args.max_steps:\n                    epoch_iterator.close()\n                    break\n            print(f\"预训练第{epoch}轮耗时：\",time.time()-last)\n            if self.args.max_steps > 0 and self.global_step > self.args.max_steps:\n                train_iterator.close()\n                break\n            if self.args.tpu_metrics_debug:\n                # tpu-comment: Logging debug metrics for PyTorch/XLA (compile, execute times, ops, etc.)\n                xm.master_print(met.metrics_report())\n        if self.tb_writer:\n            self.tb_writer.close()\n\n        logger.info(\"\\n\\nTraining completed. Do not forget to share your model on huggingface.co/models =)\\n\\n\")\n        return TrainOutput(self.global_step, tr_loss / self.global_step)\n\n    def _log(self, logs: Dict[str, float], iterator: Optional[tqdm] = None) -> None:\n        if self.epoch is not None:\n            logs[\"epoch\"] = self.epoch\n        if self.tb_writer:\n            for k, v in logs.items():\n                self.tb_writer.add_scalar(k, v, self.global_step)\n        if is_wandb_available():\n            wandb.log(logs, step=self.global_step)\n        output = json.dumps({**logs, **{\"step\": self.global_step}})\n        if iterator is not None:\n            iterator.write(output)\n        else:\n            print(output)\n\n    def _training_step(\n        self, model: nn.Module, inputs: Dict[str, torch.Tensor], optimizer: torch.optim.Optimizer\n    ) -> float:\n        model.train()\n        for k, v in inputs.items():\n            inputs[k] = v.to(self.args.device)\n\n        outputs = model(**inputs)\n        loss = outputs[0]  # model outputs are always tuple in transformers1 (see doc)\n\n        if self.args.n_gpu > 1:\n            loss = loss.mean()  # mean() to average on multi-gpu parallel training\n        if self.args.gradient_accumulation_steps > 1:\n            loss = loss / self.args.gradient_accumulation_steps\n\n        if self.args.fp16:\n            with amp.scale_loss(loss, optimizer) as scaled_loss:\n                scaled_loss.backward()\n        else:\n            loss.backward()\n\n        return loss.item()\n\n    def is_local_master(self) -> bool:\n        if is_tpu_available():\n            return xm.is_master_ordinal(local=True)\n        else:\n            return self.args.local_rank in [-1, 0]\n\n    def is_world_master(self) -> bool:\n        \"\"\"\n        This will be True only in one process, even in distributed mode,\n        even when training on multiple machines.\n        \"\"\"\n        if is_tpu_available():\n            return xm.is_master_ordinal(local=False)\n        else:\n            return self.args.local_rank == -1 or torch.distributed.get_rank() == 0\n\n    def save_model(self, output_dir: Optional[str] = None):\n        \"\"\"\n        Saving best-practices: if you use default names for the model,\n        you can reload it using from_pretrained().\n\n        Will only save from the world_master process (unless in TPUs).\n        \"\"\"\n\n        if is_tpu_available():\n            self._save_tpu(output_dir)\n        elif self.is_world_master():\n            self._save(output_dir)\n\n    def _save_tpu(self, output_dir: Optional[str] = None):\n        output_dir = output_dir if output_dir is not None else self.args.output_dir\n        logger.info(\"Saving model checkpoint to %s\", output_dir)\n\n        if xm.is_master_ordinal():\n            os.makedirs(output_dir, exist_ok=True)\n            torch.save(self.args, os.path.join(output_dir, \"training_args.bin\"))\n\n        # Save a trained model and configuration using `save_pretrained()`.\n        # They can then be reloaded using `from_pretrained()`\n        if not isinstance(self.model, PreTrainedModel):\n            raise ValueError(\"Trainer.model appears to not be a PreTrainedModel\")\n\n        xm.rendezvous(\"saving_checkpoint\")\n        self.model.save_pretrained(output_dir)\n\n    def _save(self, output_dir: Optional[str] = None):\n        output_dir = output_dir if output_dir is not None else self.args.output_dir\n        os.makedirs(output_dir, exist_ok=True)\n        logger.info(\"Saving model checkpoint to %s\", output_dir)\n        # Save a trained model and configuration using `save_pretrained()`.\n        # They can then be reloaded using `from_pretrained()`\n        if not isinstance(self.model, PreTrainedModel):\n            raise ValueError(\"Trainer.model appears to not be a PreTrainedModel\")\n        self.model.save_pretrained(output_dir)\n\n        # Good practice: save your training arguments together with the trained model\n        torch.save(self.args, os.path.join(output_dir, \"training_args.bin\"))\n\n    def _sorted_checkpoints(self, checkpoint_prefix=PREFIX_CHECKPOINT_DIR, use_mtime=False) -> List[str]:\n        ordering_and_checkpoint_path = []\n\n        glob_checkpoints = [str(x) for x in Path(self.args.output_dir).glob(f\"{checkpoint_prefix}-*\")]\n\n        for path in glob_checkpoints:\n            if use_mtime:\n                ordering_and_checkpoint_path.append((os.path.getmtime(path), path))\n            else:\n                regex_match = re.match(f\".*{checkpoint_prefix}-([0-9]+)\", path)\n                if regex_match and regex_match.groups():\n                    ordering_and_checkpoint_path.append((int(regex_match.groups()[0]), path))\n\n        checkpoints_sorted = sorted(ordering_and_checkpoint_path)\n        checkpoints_sorted = [checkpoint[1] for checkpoint in checkpoints_sorted]\n        return checkpoints_sorted\n\n    def _rotate_checkpoints(self, use_mtime=False) -> None:\n        if self.args.save_total_limit is None or self.args.save_total_limit <= 0:\n            return\n\n        # Check if we should delete older checkpoint(s)\n        checkpoints_sorted = self._sorted_checkpoints(use_mtime=use_mtime)\n        if len(checkpoints_sorted) <= self.args.save_total_limit:\n            return\n\n        number_of_checkpoints_to_delete = max(0, len(checkpoints_sorted) - self.args.save_total_limit)\n        checkpoints_to_be_deleted = checkpoints_sorted[:number_of_checkpoints_to_delete]\n        for checkpoint in checkpoints_to_be_deleted:\n            curEpoch = checkpoint.split('-')[-1]\n            print(checkpoint,curEpoch)\n            if int(curEpoch) % 50 == 0:\n                continue\n            logger.info(\"Deleting older checkpoint [{}] due to args.save_total_limit\".format(checkpoint))\n            shutil.rmtree(checkpoint)\n\n    def evaluate(\n        self, eval_dataset: Optional[Dataset] = None, prediction_loss_only: Optional[bool] = None,\n    ) -> Dict[str, float]:\n        \"\"\"\n        Run evaluation and return metrics.\n\n        The calling script will be responsible for providing a method to compute metrics, as they are\n        task-dependent.\n\n        Args:\n            eval_dataset: (Optional) Pass a dataset if you wish to override\n            the one on the instance.\n        Returns:\n            A dict containing:\n                - the eval loss\n                - the potential metrics computed from the predictions\n        \"\"\"\n        eval_dataloader = self.eval_dataLoader\n\n        output = self._prediction_loop(eval_dataloader, description=\"Evaluation\")\n\n        self._log(output.metrics)\n\n        if self.args.tpu_metrics_debug:\n            # tpu-comment: Logging debug metrics for PyTorch/XLA (compile, execute times, ops, etc.)\n            xm.master_print(met.metrics_report())\n\n        return output.metrics\n\n    def predict(self, test_dataset: Dataset) -> PredictionOutput:\n        \"\"\"\n        Run prediction and return predictions and potential metrics.\n\n        Depending on the dataset and your use case, your test dataset may contain labels.\n        In that case, this method will also return metrics, like in evaluate().\n        \"\"\"\n        test_dataloader = self.get_test_dataloader(test_dataset)\n\n        return self._prediction_loop(test_dataloader, description=\"Prediction\")\n\n    def _prediction_loop(\n        self, dataloader: DataLoader, description: str, prediction_loss_only: Optional[bool] = None\n    ) -> PredictionOutput:\n        \"\"\"\n        Prediction/evaluation loop, shared by `evaluate()` and `predict()`.\n\n        Works both with or without labels.\n        \"\"\"\n\n        prediction_loss_only = prediction_loss_only if prediction_loss_only is not None else self.prediction_loss_only\n\n        model = self.model\n        # multi-gpu eval\n        if self.args.n_gpu > 1:\n            model = torch.nn.DataParallel(model)\n        else:\n            model = self.model\n        # Note: in torch.distributed mode, there's no point in wrapping the model\n        # inside a DistributedDataParallel as we'll be under `no_grad` anyways.\n\n        batch_size = dataloader.batch_size\n        logger.info(\"***** Running %s *****\", description)\n        logger.info(\"  Num examples = %d\", self.num_examples(dataloader))\n        logger.info(\"  Batch size = %d\", batch_size)\n        eval_losses: List[float] = []\n        preds: torch.Tensor = None\n        label_ids: torch.Tensor = None\n        model.eval()\n\n        if is_tpu_available():\n            dataloader = pl.ParallelLoader(dataloader, [self.args.device]).per_device_loader(self.args.device)\n\n        for inputs in tqdm(dataloader, desc=description):\n            has_labels = any(inputs.get(k) is not None for k in [\"labels\", \"lm_labels\", \"masked_lm_labels\"])\n\n            for k, v in inputs.items():\n                inputs[k] = v.to(self.args.device)\n\n            with torch.no_grad():\n                outputs = model(**inputs)\n                if has_labels:\n                    step_eval_loss, logits = outputs[:2]\n                    eval_losses += [step_eval_loss.mean().item()]\n                else:\n                    logits = outputs[0]\n\n            if not prediction_loss_only:\n                if preds is None:\n                    preds = logits.detach()\n                else:\n                    preds = torch.cat((preds, logits.detach()), dim=0)\n                if inputs.get(\"labels\") is not None:\n                    if label_ids is None:\n                        label_ids = inputs[\"labels\"].detach()\n                    else:\n                        label_ids = torch.cat((label_ids, inputs[\"labels\"].detach()), dim=0)\n\n        if self.args.local_rank != -1:\n            # In distributed mode, concatenate all results from all nodes:\n            if preds is not None:\n                preds = self.distributed_concat(preds, num_total_examples=self.num_examples(dataloader))\n            if label_ids is not None:\n                label_ids = self.distributed_concat(label_ids, num_total_examples=self.num_examples(dataloader))\n        elif is_tpu_available():\n            # tpu-comment: Get all predictions and labels from all worker shards of eval dataset\n            if preds is not None:\n                preds = xm.mesh_reduce(\"eval_preds\", preds, torch.cat)\n            if label_ids is not None:\n                label_ids = xm.mesh_reduce(\"eval_label_ids\", label_ids, torch.cat)\n\n        # Finally, turn the aggregated tensors into numpy arrays.\n        if preds is not None:\n            preds = preds.cpu().numpy()\n        if label_ids is not None:\n            label_ids = label_ids.cpu().numpy()\n\n        if self.compute_metrics is not None and preds is not None and label_ids is not None:\n            metrics = self.compute_metrics(EvalPrediction(predictions=preds, label_ids=label_ids))\n        else:\n            metrics = {}\n        if len(eval_losses) > 0:\n            metrics[\"eval_loss\"] = np.mean(eval_losses)\n\n        # Prefix all keys with eval_\n        for key in list(metrics.keys()):\n            if not key.startswith(\"eval_\"):\n                metrics[f\"eval_{key}\"] = metrics.pop(key)\n\n        return PredictionOutput(predictions=preds, label_ids=label_ids, metrics=metrics)\n\n    def distributed_concat(self, tensor: torch.Tensor, num_total_examples: int) -> torch.Tensor:\n        assert self.args.local_rank != -1\n\n        output_tensors = [tensor.clone() for _ in range(torch.distributed.get_world_size())]\n        torch.distributed.all_gather(output_tensors, tensor)\n\n        concat = torch.cat(output_tensors, dim=0)\n\n        # truncate the dummy elements added by SequentialDistributedSampler\n        output = concat[:num_total_examples]\n        return output\n"
  },
  {
    "path": "code/nezha-base-count3/pretrain/transformers1/trainer_tf.py",
    "content": "\"\"\"Tensorflow trainer class.\"\"\"\n\nimport logging\nimport math\nimport os\nfrom typing import Callable, Dict, Optional\n\nimport numpy as np\nimport tensorflow as tf\n\nfrom .modeling_tf_utils import TFPreTrainedModel, shape_list\nfrom .optimization_tf import GradientAccumulator, create_optimizer\nfrom .trainer_utils import PREFIX_CHECKPOINT_DIR, EvalPrediction, PredictionOutput\nfrom .training_args_tf import TFTrainingArguments\n\n\nlogger = logging.getLogger(__name__)\n\n\nclass TFTrainer:\n    model: TFPreTrainedModel\n    args: TFTrainingArguments\n    # something similar to a PT Dataset.\n    # This is just temporary before to have\n    # a framework-agnostic approach for datasets.\n    train_dataset: Optional[tf.data.Dataset]\n    eval_dataset: Optional[tf.data.Dataset]\n    compute_metrics: Optional[Callable[[EvalPrediction], Dict]] = None\n    prediction_loss_only: bool\n\n    def __init__(\n        self,\n        model: TFPreTrainedModel,\n        args: TFTrainingArguments,\n        train_dataset: Optional[tf.data.Dataset] = None,\n        eval_dataset: Optional[tf.data.Dataset] = None,\n        compute_metrics: Optional[Callable[[EvalPrediction], Dict]] = None,\n        prediction_loss_only=False,\n    ):\n        self.model = model\n        self.args = args\n        self.train_dataset = train_dataset\n        self.eval_dataset = eval_dataset\n        self.compute_metrics = compute_metrics\n        self.prediction_loss_only = prediction_loss_only\n        self.gradient_accumulator = GradientAccumulator()\n\n        self._setup_training()\n\n    def _setup_training(self) -> None:\n        \"\"\"\n        Setup the different steps to train a model:\n          - check if all the data are given\n          - create the proper strategy\n          - create the features\n          - prepare the model settings\n        \"\"\"\n        self._prepare_dataset()\n\n        with self.args.strategy.scope():\n            self._create_optimizer()\n            _ = self.optimizer.iterations\n            self._set_loss_and_metric()\n            self._create_checkpoint_manager()\n            self._create_summary_writer()\n\n    def _set_loss_and_metric(self) -> None:\n        \"\"\"\n        Create the training loss and metric with their name. Allowed names are those listed\n        in the Tensorflow documentation and those contained in the transformers1 library.\n        \"\"\"\n        try:\n            self.loss = tf.keras.losses.get(\n                {\n                    \"class_name\": self.args.loss_name,\n                    \"config\": {\"from_logits\": True, \"reduction\": tf.keras.losses.Reduction.NONE},\n                }\n            )\n        except TypeError:\n            self.loss = tf.keras.losses.get(\n                {\"class_name\": self.args.loss_name, \"config\": {\"reduction\": tf.keras.losses.Reduction.NONE}}\n            )\n\n    def _create_summary_writer(self) -> None:\n        \"\"\"\n        Create a summary writer to be able to read the logs in Tensorboard.\n        \"\"\"\n        self.writer = tf.summary.create_file_writer(self.args.logging_dir)\n\n    def _prepare_dataset(self) -> None:\n        \"\"\"\n        Prepare the training, validation and test data.\n        \"\"\"\n        if self.train_dataset is not None:\n            self.num_train_examples = self.train_dataset.reduce(tf.constant(0), lambda x, _: x + 1).numpy()\n\n            if self.args.max_steps > 0:\n                self.train_steps = self.args.max_steps\n            else:\n                self.train_steps: int = math.ceil(self.num_train_examples / self.args.train_batch_size)\n\n            self.train_dataset = (\n                self.train_dataset.cache()\n                .shuffle(self.num_train_examples)\n                .batch(self.args.train_batch_size)\n                .prefetch(tf.data.experimental.AUTOTUNE)\n            )\n\n            if self.args.max_steps > 0:\n                self.train_dataset = self.train_dataset.repeat(-1)\n\n            self.train_dataset = self.args.strategy.experimental_distribute_dataset(self.train_dataset)\n        else:\n            self.train_steps = 0\n\n        if self.eval_dataset is not None:\n            self.eval_dataset = (\n                self.eval_dataset.batch(self.args.eval_batch_size).cache().prefetch(tf.data.experimental.AUTOTUNE)\n            )\n            self.eval_dataset = self.args.strategy.experimental_distribute_dataset(self.eval_dataset)\n\n    def _create_optimizer(self) -> None:\n        \"\"\"\n        Create the training optimizer with its name. Allowed names are those listed\n        in the Tensorflow documentation and those contained in the transformers1 library.\n        \"\"\"\n        if self.args.optimizer_name == \"adamw\":\n            self.optimizer = create_optimizer(\n                self.args.learning_rate, self.train_steps, self.args.warmup_steps, self.args.end_lr\n            )\n        else:\n            try:\n                self.optimizer = tf.keras.optimizers.get(\n                    {\n                        \"class_name\": self.args.optimizer_name,\n                        \"config\": {\"learning_rate\": self.args.learning_rate, \"epsilon\": self.args.adam_epsilon},\n                    }\n                )\n            except TypeError:\n                # This is for the case where the optimizer is not Adam-like such as SGD\n                self.optimizer = tf.keras.optimizers.get(\n                    {\"class_name\": self.args.optimizer_name, \"config\": {\"learning_rate\": self.args.learning_rate}}\n                )\n        logger.info(\"Created an/a {} optimizer\".format(self.args.optimizer_name))\n\n    def _create_checkpoint_manager(self, max_to_keep: int = 5, load_model: bool = True) -> None:\n        \"\"\"\n        Create a checkpoint manager in order to be able to make the training\n        fault-tolerant.\n        Args:\n          max_to_keep: the maximum number of checkpoints to keep in the checkpoint path.\n          load_model: if we want to start the training from the latest checkpoint.\n        \"\"\"\n        ckpt = tf.train.Checkpoint(optimizer=self.optimizer, model=self.model)\n\n        self.model.ckpt_manager = tf.train.CheckpointManager(ckpt, PREFIX_CHECKPOINT_DIR, max_to_keep=max_to_keep)\n\n        if load_model:\n            ckpt.restore(self.model.ckpt_manager.latest_checkpoint).expect_partial()\n\n    @tf.function\n    def _evaluate_steps(self, per_replica_features, per_replica_labels):\n        \"\"\"\n        One step evaluation across replica.\n        Args:\n          per_replica_features: the batched features.\n          per_replica_labels: the batched labels.\n        Returns:\n          The loss corresponding to the given batch.\n        \"\"\"\n        per_replica_loss, per_replica_logits = self.args.strategy.experimental_run_v2(\n            self._run_model, args=(per_replica_features, per_replica_labels, False)\n        )\n\n        try:\n            reduced_loss = self.args.strategy.reduce(tf.distribute.ReduceOp.MEAN, per_replica_loss, axis=0)\n        except ValueError:\n            reduced_loss = self.args.strategy.reduce(tf.distribute.ReduceOp.MEAN, per_replica_loss, None)\n\n        return reduced_loss, per_replica_logits\n\n    def _prediction_loop(\n        self, dataset: tf.data.Dataset, description: str, prediction_loss_only: Optional[bool] = None\n    ) -> PredictionOutput:\n        logger.info(\"***** Running %s *****\", description)\n        logger.info(\"  Batch size = %d\", self.args.eval_batch_size)\n\n        label_ids: np.ndarray = None\n        preds: np.ndarray = None\n\n        step: int = 1\n\n        for features, labels in dataset:\n            step = tf.convert_to_tensor(step, dtype=tf.int64)\n            loss, logits = self._evaluate_steps(features, labels)\n            loss = tf.reduce_mean(loss)\n\n            if not prediction_loss_only:\n                if self.args.n_gpu > 1:\n                    for val in logits.values:\n                        if preds is None:\n                            preds = val.numpy()\n                        else:\n                            preds = np.append(preds, val.numpy(), axis=0)\n\n                    for val in labels.values:\n                        if label_ids is None:\n                            label_ids = val.numpy()\n                        else:\n                            label_ids = np.append(label_ids, val.numpy(), axis=0)\n                else:\n                    if preds is None:\n                        preds = logits.numpy()\n                    else:\n                        preds = np.append(preds, logits.numpy(), axis=0)\n\n                    if label_ids is None:\n                        label_ids = labels.numpy()\n                    else:\n                        label_ids = np.append(label_ids, labels.numpy(), axis=0)\n\n            step += 1\n\n        if self.compute_metrics is not None and preds is not None and label_ids is not None:\n            metrics = self.compute_metrics(EvalPrediction(predictions=preds, label_ids=label_ids))\n        else:\n            metrics = {}\n\n        metrics[\"eval_loss\"] = loss.numpy()\n\n        for key in list(metrics.keys()):\n            if not key.startswith(\"eval_\"):\n                metrics[f\"eval_{key}\"] = metrics.pop(key)\n\n        return PredictionOutput(predictions=preds, label_ids=label_ids, metrics=metrics)\n\n    def evaluate(\n        self, eval_dataset: Optional[tf.data.Dataset] = None, prediction_loss_only: Optional[bool] = None\n    ) -> Dict[str, float]:\n        \"\"\"\n        Prediction/evaluation loop, shared by `evaluate()` and `predict()`.\n        \"\"\"\n        if eval_dataset is None:\n            eval_dataset = self.eval_dataset\n\n        output = self._prediction_loop(eval_dataset, description=\"Evaluation\")\n\n        return output.metrics\n\n    def train(self) -> None:\n        \"\"\"\n        Train method to train the model.\n        \"\"\"\n        if self.args.debug:\n            tf.summary.trace_on(graph=True, profiler=True)\n\n        self.gradient_accumulator.reset()\n\n        iterations = self.optimizer.iterations\n\n        if iterations.numpy() > 0:\n            logger.info(\"Start the training from the last checkpoint\")\n            start_epoch = (iterations.numpy() // self.train_steps) + 1\n        else:\n            start_epoch = 1\n\n        tf.summary.experimental.set_step(iterations)\n\n        epochs = 1 if self.args.max_steps > 0 else self.args.num_train_epochs\n\n        logger.info(\"***** Running training *****\")\n        logger.info(\"  Num examples = %d\", self.num_train_examples)\n        logger.info(\"  Num Epochs = %d\", epochs)\n        logger.info(\"  Total optimization steps = %d\", self.train_steps)\n\n        for epoch in range(start_epoch, int(epochs + 1)):\n            for training_loss in self._training_steps():\n                step = iterations.numpy()\n\n                if self.args.debug:\n                    with self.writer.as_default():\n                        tf.summary.scalar(\"loss\", training_loss, step=step)\n\n                if step == 1 and self.args.debug:\n                    with self.writer.as_default():\n                        tf.summary.trace_export(name=\"training\", step=step, profiler_outdir=self.args.logging_dir)\n\n                if self.args.evaluate_during_training and step % self.args.eval_steps == 0:\n                    logs = {}\n                    results = self.evaluate()\n\n                    for key, value in results.items():\n                        eval_key = \"eval_{}\".format(key)\n                        logs[eval_key] = value\n\n                    if callable(self.optimizer.learning_rate):\n                        logs[\"learning_rate\"] = self.optimizer.learning_rate(step).numpy()\n                    else:\n                        logs[\"learning_rate\"] = self.optimizer.learning_rate.numpy()\n\n                    logger.info(\"Epoch {} Step {} Validation Metrics {}\".format(epoch, step, logs))\n\n                    with self.writer.as_default():\n                        for k, v in logs.items():\n                            tf.summary.scalar(k, v, step=step)\n\n                if step % self.args.logging_steps == 0:\n                    logger.info(\"Epoch {} Step {} Train Loss {:.4f}\".format(epoch, step, training_loss.numpy()))\n\n                if step % self.args.save_steps == 0:\n                    ckpt_save_path = self.model.ckpt_manager.save()\n                    logger.info(\"Saving checkpoint for step {} at {}\".format(step, ckpt_save_path))\n\n                if step % self.train_steps == 0:\n                    break\n\n    def _training_steps(self):\n        \"\"\"\n        Returns a generator over training steps (i.e. parameters update).\n        \"\"\"\n        for i, loss in enumerate(self._accumulate_next_gradients()):\n            if i % self.args.gradient_accumulation_steps == 0:\n                self._apply_gradients()\n                yield loss\n\n    @tf.function\n    def _apply_gradients(self):\n        \"\"\"Applies the gradients (cross-replica).\"\"\"\n        self.args.strategy.experimental_run_v2(self._step)\n\n    def _step(self):\n        \"\"\"Applies gradients and resets accumulation.\"\"\"\n        gradient_scale = self.gradient_accumulator.step * self.args.strategy.num_replicas_in_sync\n        gradients = [\n            gradient / tf.cast(gradient_scale, gradient.dtype) for gradient in self.gradient_accumulator.gradients\n        ]\n        gradients = [(tf.clip_by_value(grad, -self.args.max_grad_norm, self.args.max_grad_norm)) for grad in gradients]\n\n        self.optimizer.apply_gradients(list(zip(gradients, self.model.trainable_variables)))\n        self.gradient_accumulator.reset()\n\n    def _accumulate_next_gradients(self):\n        \"\"\"Accumulates the gradients from the next element in dataset.\"\"\"\n        iterator = iter(self.train_dataset)\n\n        @tf.function\n        def _accumulate_next():\n            per_replica_features, per_replica_labels = next(iterator)\n\n            return self._accumulate_gradients(per_replica_features, per_replica_labels)\n\n        while True:\n            try:\n                yield _accumulate_next()\n            except tf.errors.OutOfRangeError:\n                break\n\n    def _accumulate_gradients(self, per_replica_features, per_replica_labels):\n        \"\"\"Accumulates the gradients across all the replica.\"\"\"\n        per_replica_loss = self.args.strategy.experimental_run_v2(\n            self._forward, args=(per_replica_features, per_replica_labels)\n        )\n\n        try:\n            reduced_loss = self.args.strategy.reduce(tf.distribute.ReduceOp.MEAN, per_replica_loss, axis=0)\n        except ValueError:\n            reduced_loss = self.args.strategy.reduce(tf.distribute.ReduceOp.MEAN, per_replica_loss, None)\n\n        return reduced_loss\n\n    def _forward(self, features, labels):\n        \"\"\"Forwards a training example and accumulates the gradients.\"\"\"\n        per_example_loss, _ = self._run_model(features, labels, True)\n        gradients = tf.gradients(per_example_loss, self.model.trainable_variables)\n        gradients = [\n            g if g is not None else tf.zeros_like(v) for g, v in zip(gradients, self.model.trainable_variables)\n        ]\n\n        self.gradient_accumulator(gradients)\n\n        return per_example_loss\n\n    def _run_model(self, features, labels, training):\n        \"\"\"\n        Computes the loss of the given features and labels pair.\n        Args:\n          features: the batched features.\n          labels: the batched labels.\n          training: run the model in training mode or not\n        \"\"\"\n        if self.args.mode == \"text-classification\" or self.args.mode == \"token-classification\":\n            logits = self.model(features, training=training)[0]\n        else:\n            logits = self.model(features, training=training)\n\n        if self.args.mode == \"token-classification\":\n            active_loss = tf.reshape(labels, (-1,)) != -1\n            reduced_logits = tf.boolean_mask(tf.reshape(logits, (-1, shape_list(logits)[2])), active_loss)\n            labels = tf.boolean_mask(tf.reshape(labels, (-1,)), active_loss)\n            loss = self.loss(labels, reduced_logits)\n        elif self.args.mode == \"question-answering\":\n            start_loss = self.loss(labels[\"start_position\"], logits[0])\n            end_loss = self.loss(labels[\"end_position\"], logits[1])\n            loss = (start_loss + end_loss) / 2.0\n        else:\n            loss = self.loss(labels, logits)\n\n        loss += sum(self.model.losses) * (1.0 / self.args.n_gpu)\n\n        return loss, logits\n\n    def predict(self, test_dataset: tf.data.Dataset) -> PredictionOutput:\n        \"\"\"\n        Run prediction and return predictions and potential metrics.\n        Depending on the dataset and your use case, your test dataset may contain labels.\n        In that case, this method will also return metrics, like in evaluate().\n        Args:\n          test_dataset: something similar to a PT Dataset. This is just\n            temporary before to have a framework-agnostic approach for datasets.\n        \"\"\"\n        test_dataset = test_dataset.batch(self.args.eval_batch_size)\n        test_dataset = self.args.strategy.experimental_distribute_dataset(test_dataset)\n\n        return self._prediction_loop(test_dataset, description=\"Prediction\")\n\n    def save_model(self) -> None:\n        \"\"\"\n        Save the pretrained model and create a Tensorflow saved model.\n        \"\"\"\n        logger.info(\"Saving model in {}\".format(self.args.output_dir))\n\n        path = os.path.join(self.args.output_dir, \"saved_model\")\n\n        logger.info(\"Saving model in {}\".format(path))\n        os.makedirs(path, exist_ok=True)\n        self.model.save_pretrained(self.args.output_dir)\n"
  },
  {
    "path": "code/nezha-base-count3/pretrain/transformers1/trainer_utils.py",
    "content": "from typing import Dict, NamedTuple, Optional\n\nimport numpy as np\n\n\nclass EvalPrediction(NamedTuple):\n    \"\"\"\n    Evaluation output (always contains labels), to be used\n    to compute metrics.\n    \"\"\"\n\n    predictions: np.ndarray\n    label_ids: np.ndarray\n\n\nclass PredictionOutput(NamedTuple):\n    predictions: np.ndarray\n    label_ids: Optional[np.ndarray]\n    metrics: Optional[Dict[str, float]]\n\n\nclass TrainOutput(NamedTuple):\n    global_step: int\n    training_loss: float\n\n\nPREFIX_CHECKPOINT_DIR = \"checkpoint\"\n"
  },
  {
    "path": "code/nezha-base-count3/pretrain/transformers1/training_args.py",
    "content": "import dataclasses\nimport json\nimport logging\nfrom dataclasses import dataclass, field\nfrom typing import Any, Dict, Optional, Tuple\n\nfrom .file_utils import cached_property, is_torch_available, torch_required\n\n\nif is_torch_available():\n    import torch\n\n\ntry:\n    import torch_xla.core.xla_model as xm\n\n    _has_tpu = True\nexcept ImportError:\n    _has_tpu = False\n\n\n@torch_required\ndef is_tpu_available():\n    return _has_tpu\n\n\nlogger = logging.getLogger(__name__)\n\n\n@dataclass\nclass TrainingArguments:\n    \"\"\"\n    TrainingArguments is the subset of the arguments we use in our example scripts\n    **which relate to the training loop itself**.\n\n    Using `HfArgumentParser` we can turn this class\n    into argparse arguments to be able to specify them on\n    the command line.\n    \"\"\"\n\n    output_dir: str = field(\n        metadata={\"help\": \"The output directory where the model predictions and checkpoints will be written.\"}\n    )\n    overwrite_output_dir: bool = field(\n        default=False,\n        metadata={\n            \"help\": (\n                \"Overwrite the content of the output directory.\"\n                \"Use this to continue training if output_dir points to a checkpoint directory.\"\n            )\n        },\n    )\n\n    do_train: bool = field(default=False, metadata={\"help\": \"Whether to run training.\"})\n    do_eval: bool = field(default=False, metadata={\"help\": \"Whether to run eval on the dev set.\"})\n    do_predict: bool = field(default=False, metadata={\"help\": \"Whether to run predictions on the test set.\"})\n    evaluate_during_training: bool = field(\n        default=False, metadata={\"help\": \"Run evaluation during training at each logging step.\"},\n    )\n\n    per_device_train_batch_size: int = field(\n        default=8, metadata={\"help\": \"Batch size per GPU/TPU core/CPU for training.\"}\n    )\n    per_device_eval_batch_size: int = field(\n        default=8, metadata={\"help\": \"Batch size per GPU/TPU core/CPU for evaluation.\"}\n    )\n\n    per_gpu_train_batch_size: Optional[int] = field(\n        default=None,\n        metadata={\n            \"help\": \"Deprecated, the use of `--per_device_train_batch_size` is preferred. \"\n            \"Batch size per GPU/TPU core/CPU for training.\"\n        },\n    )\n    per_gpu_eval_batch_size: Optional[int] = field(\n        default=None,\n        metadata={\n            \"help\": \"Deprecated, the use of `--per_device_eval_batch_size` is preferred.\"\n            \"Batch size per GPU/TPU core/CPU for evaluation.\"\n        },\n    )\n\n    gradient_accumulation_steps: int = field(\n        default=1,\n        metadata={\"help\": \"Number of updates steps to accumulate before performing a backward/update pass.\"},\n    )\n\n    learning_rate: float = field(default=5e-5, metadata={\"help\": \"The initial learning rate for Adam.\"})\n    lr_end: float = field(default=1e-5, metadata={\"help\": \"学习率最后衰减到多少.\"})\n    weight_decay: float = field(default=0.0, metadata={\"help\": \"Weight decay if we apply some.\"})\n    adam_epsilon: float = field(default=1e-8, metadata={\"help\": \"Epsilon for Adam optimizer.\"})\n    max_grad_norm: float = field(default=1.0, metadata={\"help\": \"Max gradient norm.\"})\n\n    num_train_epochs: float = field(default=3.0, metadata={\"help\": \"Total number of training epochs to perform.\"})\n    max_steps: int = field(\n        default=-1,\n        metadata={\"help\": \"If > 0: set total number of training steps to perform. Override num_train_epochs.\"},\n    )\n    warmup_steps: int = field(default=0, metadata={\"help\": \"Linear warmup over warmup_steps.\"})\n\n    logging_dir: Optional[str] = field(default=None, metadata={\"help\": \"Tensorboard log dir.\"})\n    logging_first_step: bool = field(default=False, metadata={\"help\": \"Log and eval the first global_step\"})\n    logging_steps: int = field(default=500, metadata={\"help\": \"Log every X updates steps.\"})\n    save_steps: int = field(default=500, metadata={\"help\": \"Save checkpoint every X updates steps.\"})\n    save_total_limit: Optional[int] = field(\n        default=None,\n        metadata={\n            \"help\": (\n                \"Limit the total amount of checkpoints.\"\n                \"Deletes the older checkpoints in the output_dir. Default is unlimited checkpoints\"\n            )\n        },\n    )\n    no_cuda: bool = field(default=False, metadata={\"help\": \"Do not use CUDA even when it is available\"})\n    seed: int = field(default=42, metadata={\"help\": \"random seed for initialization\"})\n\n    fp16: bool = field(\n        default=False,\n        metadata={\"help\": \"Whether to use 16-bit (mixed) precision (through NVIDIA apex) instead of 32-bit\"},\n    )\n    fp16_opt_level: str = field(\n        default=\"O1\",\n        metadata={\n            \"help\": (\n                \"For fp16: Apex AMP optimization level selected in ['O0', 'O1', 'O2', and 'O3'].\"\n                \"See details at https://nvidia.github.io/apex/amp.html\"\n            )\n        },\n    )\n    local_rank: int = field(default=-1, metadata={\"help\": \"For distributed training: local_rank\"})\n\n    tpu_num_cores: Optional[int] = field(\n        default=None, metadata={\"help\": \"TPU: Number of TPU cores (automatically passed by launcher script)\"}\n    )\n    tpu_metrics_debug: bool = field(default=False, metadata={\"help\": \"TPU: Whether to print debug metrics\"})\n\n    @property\n    def train_batch_size(self) -> int:\n        if self.per_gpu_train_batch_size:\n            logger.warning(\n                \"Using deprecated `--per_gpu_train_batch_size` argument which will be removed in a future \"\n                \"version. Using `--per_device_train_batch_size` is preferred.\"\n            )\n        per_device_batch_size = self.per_gpu_train_batch_size or self.per_device_train_batch_size\n        return per_device_batch_size * max(1, self.n_gpu)\n\n    @property\n    def eval_batch_size(self) -> int:\n        if self.per_gpu_eval_batch_size:\n            logger.warning(\n                \"Using deprecated `--per_gpu_eval_batch_size` argument which will be removed in a future \"\n                \"version. Using `--per_device_eval_batch_size` is preferred.\"\n            )\n        per_device_batch_size = self.per_gpu_eval_batch_size or self.per_device_eval_batch_size\n        return per_device_batch_size * max(1, self.n_gpu)\n\n    @cached_property\n    @torch_required\n    def _setup_devices(self) -> Tuple[\"torch.device\", int]:\n        logger.info(\"PyTorch: setting up devices\")\n        if self.no_cuda:\n            device = torch.device(\"cpu\")\n            n_gpu = 0\n        elif is_tpu_available():\n            device = xm.xla_device()\n            n_gpu = 0\n        elif self.local_rank == -1:\n            # if n_gpu is > 1 we'll use nn.DataParallel.\n            # If you only want to use a specific subset of GPUs use `CUDA_VISIBLE_DEVICES=0`\n            device = torch.device(\"cuda\" if torch.cuda.is_available() else \"cpu\")\n            n_gpu = torch.cuda.device_count()\n        else:\n            # Here, we'll use torch.distributed.\n            # Initializes the distributed backend which will take care of sychronizing nodes/GPUs\n            torch.distributed.init_process_group(backend=\"nccl\")\n            device = torch.device(\"cuda\", self.local_rank)\n            n_gpu = 1\n        return device, n_gpu\n\n    @property\n    @torch_required\n    def device(self) -> \"torch.device\":\n        return self._setup_devices[0]\n\n    @property\n    @torch_required\n    def n_gpu(self):\n        return self._setup_devices[1]\n\n    def to_json_string(self):\n        \"\"\"\n        Serializes this instance to a JSON string.\n        \"\"\"\n        return json.dumps(dataclasses.asdict(self), indent=2)\n\n    def to_sanitized_dict(self) -> Dict[str, Any]:\n        \"\"\"\n        Sanitized serialization to use with TensorBoard’s hparams\n        \"\"\"\n        d = dataclasses.asdict(self)\n        valid_types = [bool, int, float, str]\n        if is_torch_available():\n            valid_types.append(torch.Tensor)\n        return {k: v if type(v) in valid_types else str(v) for k, v in d.items()}\n"
  },
  {
    "path": "code/nezha-base-count3/pretrain/transformers1/training_args_tf.py",
    "content": "import logging\nfrom dataclasses import dataclass, field\nfrom typing import Tuple\n\nfrom .file_utils import cached_property, is_tf_available, tf_required\nfrom .training_args import TrainingArguments\n\n\nlogger = logging.getLogger(__name__)\n\nif is_tf_available():\n    import tensorflow as tf\n\n\n@dataclass\nclass TFTrainingArguments(TrainingArguments):\n    optimizer_name: str = field(\n        default=\"adam\",\n        metadata={\n            \"help\": 'Name of a Tensorflow optimizer among \"adadelta, adagrad, adam, adamax, ftrl, nadam, rmsprop, sgd, adamw\"'\n        },\n    )\n    mode: str = field(\n        default=\"text-classification\",\n        metadata={\"help\": 'Type of task, one of \"text-classification\", \"token-classification\", \"question-answering\"'},\n    )\n    loss_name: str = field(\n        default=\"SparseCategoricalCrossentropy\",\n        metadata={\n            \"help\": \"Name of a Tensorflow loss. For the list see: https://www.tensorflow.org/api_docs/python/tf/keras/losses\"\n        },\n    )\n    tpu_name: str = field(\n        default=None, metadata={\"help\": \"Name of TPU\"},\n    )\n    end_lr: float = field(\n        default=0, metadata={\"help\": \"End learning rate for optimizer\"},\n    )\n    eval_steps: int = field(default=1000, metadata={\"help\": \"Run an evaluation every X steps.\"})\n    debug: bool = field(\n        default=False, metadata={\"help\": \"Activate the trace to record computation graphs and profiling information\"}\n    )\n\n    @cached_property\n    @tf_required\n    def _setup_strategy(self) -> Tuple[\"tf.distribute.Strategy\", int]:\n        logger.info(\"Tensorflow: setting up strategy\")\n        gpus = tf.config.list_physical_devices(\"GPU\")\n\n        if self.no_cuda:\n            strategy = tf.distribute.OneDeviceStrategy(device=\"/cpu:0\")\n        else:\n            try:\n                if self.tpu_name:\n                    tpu = tf.distribute.cluster_resolver.TPUClusterResolver(self.tpu_name)\n                else:\n                    tpu = tf.distribute.cluster_resolver.TPUClusterResolver()\n            except ValueError:\n                tpu = None\n\n            if tpu:\n                tf.config.experimental_connect_to_cluster(tpu)\n                tf.tpu.experimental.initialize_tpu_system(tpu)\n\n                strategy = tf.distribute.experimental.TPUStrategy(tpu)\n            elif len(gpus) == 0:\n                strategy = tf.distribute.OneDeviceStrategy(device=\"/cpu:0\")\n            elif len(gpus) == 1:\n                strategy = tf.distribute.OneDeviceStrategy(device=\"/gpu:0\")\n            elif len(gpus) > 1:\n                # If you only want to use a specific subset of GPUs use `CUDA_VISIBLE_DEVICES=0`\n                strategy = tf.distribute.MirroredStrategy()\n            else:\n                raise ValueError(\"Cannot find the proper strategy please check your environment properties.\")\n\n        return strategy\n\n    @property\n    @tf_required\n    def strategy(self) -> \"tf.distribute.Strategy\":\n        return self._setup_strategy\n\n    @property\n    @tf_required\n    def n_gpu(self) -> int:\n        return self._setup_strategy.num_replicas_in_sync\n"
  },
  {
    "path": "code/nezha-base-count3/pretrain/transformers1/try.py",
    "content": "from transformers import TFAlbertForMaskedLM, TFAlbertModel, TFAlbertForSequenceClassification, AlbertForMaskedLM\nimport os\n\ncheckpoint = \"albert-base-v1\"\n\nmodel = AlbertForMaskedLM.from_pretrained(checkpoint)\n\nif not os.path.exists(\"~/saved/\" + checkpoint):\n    os.makedirs(\"~/saved/\" + checkpoint)\n    \n\nmodel.save_pretrained(\"~/saved/\" + checkpoint)\nmodel = TFAlbertForMaskedLM.from_pretrained('~/saved/' + checkpoint, from_pt=True)\nmodel.save_pretrained(\"~/saved/\" + checkpoint)\nmodel = TFAlbertModel.from_pretrained('~/saved/' + checkpoint)\nmodel = TFAlbertForMaskedLM.from_pretrained('~/saved/' + checkpoint)\nmodel = TFAlbertForSequenceClassification.from_pretrained('~/saved/' + checkpoint)\n\n\nprint(\"nice model\") "
  },
  {
    "path": "code/nezha-base-count3/pretrain/transformers1/utils_encoder_decoder.py",
    "content": "# coding=utf-8\n# Copyright 2020 The HuggingFace Inc. team.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n\"\"\" Classes to support Encoder-Decoder architectures \"\"\"\n\n\ndef prepare_encoder_decoder_model_kwargs(**kwargs):\n    \"\"\" Prepare the encoder and decoder's keyword arguments.\n\n    Keyword arguments come in 3 flavors:\n    - encoder-specific (prefixed by `encoder_`)\n    - decoder-specific (prefixed by `decoder_`)\n    - those that apply to the model as whole.\n\n    We let the specific kwargs override the common ones in case of\n    conflict.\n    \"\"\"\n\n    kwargs_common = {\n        argument: value\n        for argument, value in kwargs.items()\n        if not argument.startswith(\"encoder_\") and not argument.startswith(\"decoder_\")\n    }\n    if \"input_ids\" in kwargs_common:\n        kwargs[\"encoder_input_ids\"] = kwargs_common.pop(\"input_ids\")\n\n    decoder_kwargs = kwargs_common.copy()\n    encoder_kwargs = kwargs_common.copy()\n    encoder_kwargs.update(\n        {argument[len(\"encoder_\") :]: value for argument, value in kwargs.items() if argument.startswith(\"encoder_\")}\n    )\n    decoder_kwargs.update(\n        {argument[len(\"decoder_\") :]: value for argument, value in kwargs.items() if argument.startswith(\"decoder_\")}\n    )\n    decoder_kwargs[\"encoder_attention_mask\"] = encoder_kwargs.get(\"attention_mask\", None)\n    return encoder_kwargs, decoder_kwargs\n"
  },
  {
    "path": "code/nezha-base-count5/finetuning/.ipynb_checkpoints/PyTorch_Bert-Squad_OnnxRuntime_GPU-checkpoint.ipynb",
    "content": "{\n \"cells\": [\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Copyright (c) Microsoft Corporation. All rights reserved.  \\n\",\n    \"Licensed under the MIT License.\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"# Inference PyTorch Bert Model with ONNX Runtime on GPU\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"In this tutorial, you'll learn how to load a Bert model from PyTorch, convert it to ONNX, and inference it for high performance using ONNX Runtime and NVIDIA GPU. In the following sections, we are going to use the Bert model trained with Stanford Question Answering Dataset (SQuAD) dataset as an example. Bert SQuAD model is used in question answering scenarios, where the answer to every question is a segment of text from the corresponding reading passage, or the question might be unanswerable.\\n\",\n    \"\\n\",\n    \"This notebook is for GPU inference. For CPU inference, please look at another notebook [Inference PyTorch Bert Model with ONNX Runtime on CPU](PyTorch_Bert-Squad_OnnxRuntime_CPU.ipynb).\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"## 0. Prerequisites ##\\n\",\n    \"It requires your machine to have a GPU, and a python environment with [PyTorch](https://pytorch.org/) installed before running this notebook.\\n\",\n    \"\\n\",\n    \"#### GPU Environment Setup using AnaConda\\n\",\n    \"\\n\",\n    \"First, we install [AnaConda](https://www.anaconda.com/distribution/) in a target machine and open an AnaConda prompt window when it is done. Then run the following commands to create a conda environment. This notebook is tested with PyTorch 1.5.0 and OnnxRuntime 1.3.0.\\n\",\n    \"\\n\",\n    \"```console\\n\",\n    \"conda create -n gpu_env python=3.7\\n\",\n    \"conda activate gpu_env\\n\",\n    \"conda install pytorch torchvision cudatoolkit=10.1 -c pytorch\\n\",\n    \"conda install -c anaconda ipykernel\\n\",\n    \"conda install -c conda-forge ipywidgets\\n\",\n    \"python -m ipykernel install --user --name=gpu_env_py37\\n\",\n    \"jupyter notebook\\n\",\n    \"```\\n\",\n    \"Finally, launch Jupyter Notebook and you can choose gpu_env_py37 as kernel to run this notebook.\\n\",\n    \"\\n\",\n    \"Onnxruntime-gpu need specified version of CUDA and cuDNN. You can find the corresponding version in [requirements](https://github.com/microsoft/onnxruntime/tree/rel-1.3.0#system-requirements). If the version is different from above cudatoolkit version, you have to install them separately, and add their bin directories to PATH environment variable (See [CUDA and cuDNN Path](#CUDA-and-cuDNN-Path) below).\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"\\u001b[33mWARNING: Skipping onnxruntime-gpu as it is not installed.\\u001b[0m\\r\\n\"\n     ]\n    }\n   ],\n   \"source\": [\n    \"import sys\\n\",\n    \"!{sys.executable} -m pip uninstall --quiet --yes onnxruntime-gpu\\n\",\n    \"!{sys.executable} -m pip install --quiet onnxruntime-gpu\\n\",\n    \"!{sys.executable} -m pip install --quiet --upgrade transformers\\n\",\n    \"!{sys.executable} -m pip install --quiet --upgrade onnxconverter_common\\n\",\n    \"!{sys.executable} -m pip install --quiet --upgrade onnxruntime-tools\\n\",\n    \"!{sys.executable} -m pip install --quiet wget netron pandas\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"## 1. Load Pretrained Bert model ##\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"We begin by downloading the SQuAD data file and store them in the specified location. \"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"import os\\n\",\n    \"\\n\",\n    \"cache_dir = \\\"./squad\\\"\\n\",\n    \"if not os.path.exists(cache_dir):\\n\",\n    \"    os.makedirs(cache_dir)\\n\",\n    \"\\n\",\n    \"predict_file_url = \\\"https://rajpurkar.github.io/SQuAD-explorer/dataset/dev-v1.1.json\\\"\\n\",\n    \"predict_file = os.path.join(cache_dir, \\\"dev-v1.1.json\\\")\\n\",\n    \"if not os.path.exists(predict_file):\\n\",\n    \"    import wget\\n\",\n    \"    print(\\\"Start downloading predict file.\\\")\\n\",\n    \"    wget.download(predict_file_url, predict_file)\\n\",\n    \"    print(\\\"Predict file downloaded.\\\")\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Let's first define some constant variables.\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 3,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"# Whether allow overwriting existing ONNX model and download the latest script from GitHub\\n\",\n    \"enable_overwrite = True\\n\",\n    \"\\n\",\n    \"# Total samples to inference, so that we can get average latency\\n\",\n    \"total_samples = 1000\\n\",\n    \"\\n\",\n    \"# ONNX opset version\\n\",\n    \"opset_version=11\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Specify some model configuration variables.\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 4,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"# For fine-tuned large model, the model name is \\\"bert-large-uncased-whole-word-masking-finetuned-squad\\\". Here we use bert-base for demo.\\n\",\n    \"model_name_or_path = \\\"bert-base-cased\\\"\\n\",\n    \"max_seq_length = 128\\n\",\n    \"doc_stride = 128\\n\",\n    \"max_query_length = 64\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Start to load model from pretrained. This step could take a few minutes. \"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 5,\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"name\": \"stderr\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"100%|██████████| 48/48 [00:04<00:00, 11.28it/s]\\n\",\n      \"convert squad examples to features: 100%|██████████| 1000/1000 [00:09<00:00, 102.15it/s]\\n\",\n      \"add example index and unique id: 100%|██████████| 1000/1000 [00:00<00:00, 161306.98it/s]\\n\"\n     ]\n    }\n   ],\n   \"source\": [\n    \"# The following code is adapted from HuggingFace transformers\\n\",\n    \"# https://github.com/huggingface/transformers/blob/master/examples/run_squad.py\\n\",\n    \"\\n\",\n    \"from transformers import (BertConfig, BertForQuestionAnswering, BertTokenizer)\\n\",\n    \"\\n\",\n    \"# Load pretrained model and tokenizer\\n\",\n    \"config_class, model_class, tokenizer_class = (BertConfig, BertForQuestionAnswering, BertTokenizer)\\n\",\n    \"config = config_class.from_pretrained(model_name_or_path, cache_dir=cache_dir)\\n\",\n    \"tokenizer = tokenizer_class.from_pretrained(model_name_or_path, do_lower_case=True, cache_dir=cache_dir)\\n\",\n    \"model = model_class.from_pretrained(model_name_or_path,\\n\",\n    \"                                    from_tf=False,\\n\",\n    \"                                    config=config,\\n\",\n    \"                                    cache_dir=cache_dir)\\n\",\n    \"# load some examples\\n\",\n    \"from transformers.data.processors.squad import SquadV1Processor\\n\",\n    \"\\n\",\n    \"processor = SquadV1Processor()\\n\",\n    \"examples = processor.get_dev_examples(None, filename=predict_file)\\n\",\n    \"\\n\",\n    \"from transformers import squad_convert_examples_to_features\\n\",\n    \"features, dataset = squad_convert_examples_to_features( \\n\",\n    \"            examples=examples[:total_samples], # convert enough examples for this notebook\\n\",\n    \"            tokenizer=tokenizer,\\n\",\n    \"            max_seq_length=max_seq_length,\\n\",\n    \"            doc_stride=doc_stride,\\n\",\n    \"            max_query_length=max_query_length,\\n\",\n    \"            is_training=False,\\n\",\n    \"            return_dataset='pt'\\n\",\n    \"        )\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"## 2. Export the loaded model ##\\n\",\n    \"Once the model is loaded, we can export the loaded PyTorch model to ONNX.\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 6,\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"Model exported at  ./onnx/bert-base-cased-squad_opset11.onnx\\n\"\n     ]\n    }\n   ],\n   \"source\": [\n    \"output_dir = \\\"./onnx\\\"\\n\",\n    \"if not os.path.exists(output_dir):\\n\",\n    \"    os.makedirs(output_dir)   \\n\",\n    \"export_model_path = os.path.join(output_dir, 'bert-base-cased-squad_opset{}.onnx'.format(opset_version))\\n\",\n    \"\\n\",\n    \"import torch\\n\",\n    \"use_gpu = torch.cuda.is_available()\\n\",\n    \"device = torch.device(\\\"cuda\\\" if use_gpu else \\\"cpu\\\")\\n\",\n    \"\\n\",\n    \"# Get the first example data to run the model and export it to ONNX\\n\",\n    \"data = dataset[0]\\n\",\n    \"inputs = {\\n\",\n    \"    'input_ids':      data[0].to(device).reshape(1, max_seq_length),\\n\",\n    \"    'attention_mask': data[1].to(device).reshape(1, max_seq_length),\\n\",\n    \"    'token_type_ids': data[2].to(device).reshape(1, max_seq_length)\\n\",\n    \"}\\n\",\n    \"\\n\",\n    \"# Set model to inference mode, which is required before exporting the model because some operators behave differently in \\n\",\n    \"# inference and training mode.\\n\",\n    \"model.eval()\\n\",\n    \"model.to(device)\\n\",\n    \"\\n\",\n    \"if enable_overwrite or not os.path.exists(export_model_path):\\n\",\n    \"    with torch.no_grad():\\n\",\n    \"        symbolic_names = {0: 'batch_size', 1: 'max_seq_len'}\\n\",\n    \"        torch.onnx.export(model,                                            # model being run\\n\",\n    \"                          args=tuple(inputs.values()),                      # model input (or a tuple for multiple inputs)\\n\",\n    \"                          f=export_model_path,                              # where to save the model (can be a file or file-like object)\\n\",\n    \"                          opset_version=opset_version,                      # the ONNX version to export the model to\\n\",\n    \"                          do_constant_folding=True,                         # whether to execute constant folding for optimization\\n\",\n    \"                          input_names=['input_ids',                         # the model's input names\\n\",\n    \"                                       'input_mask', \\n\",\n    \"                                       'segment_ids'],\\n\",\n    \"                          output_names=['start', 'end'],                    # the model's output names\\n\",\n    \"                          dynamic_axes={'input_ids': symbolic_names,        # variable length axes\\n\",\n    \"                                        'input_mask' : symbolic_names,\\n\",\n    \"                                        'segment_ids' : symbolic_names,\\n\",\n    \"                                        'start' : symbolic_names,\\n\",\n    \"                                        'end' : symbolic_names})\\n\",\n    \"        print(\\\"Model exported at \\\", export_model_path)\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"## 3. PyTorch Inference ##\\n\",\n    \"Use PyTorch to evaluate an example input for comparison purpose.\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 7,\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"PyTorch cuda Inference time = 16.57 ms\\n\"\n     ]\n    }\n   ],\n   \"source\": [\n    \"import time\\n\",\n    \"\\n\",\n    \"# Measure the latency. It is not accurate using Jupyter Notebook, it is recommended to use standalone python script.\\n\",\n    \"latency = []\\n\",\n    \"with torch.no_grad():\\n\",\n    \"    for i in range(total_samples):\\n\",\n    \"        data = dataset[i]\\n\",\n    \"        inputs = {\\n\",\n    \"            'input_ids':      data[0].to(device).reshape(1, max_seq_length),\\n\",\n    \"            'attention_mask': data[1].to(device).reshape(1, max_seq_length),\\n\",\n    \"            'token_type_ids': data[2].to(device).reshape(1, max_seq_length)\\n\",\n    \"        }\\n\",\n    \"        start = time.time()\\n\",\n    \"        outputs = model(**inputs)\\n\",\n    \"        latency.append(time.time() - start)\\n\",\n    \"print(\\\"PyTorch {} Inference time = {} ms\\\".format(device.type, format(sum(latency) * 1000 / len(latency), '.2f')))\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"## 4. Inference ONNX Model with ONNX Runtime ##\\n\",\n    \"\\n\",\n    \"### CUDA and cuDNN Path\\n\",\n    \"onnxruntime-gpu has dependency on [CUDA](https://developer.nvidia.com/cuda-downloads) and [cuDNN](https://developer.nvidia.com/cudnn):\\n\",\n    \"\\n\",\n    \"* [onnxruntime-gpu v1.3.0](https://github.com/microsoft/onnxruntime/tree/rel-1.3.0#system-requirements) requires CUDA Runtime 10.1 and CUDNN 7.6.5.\\n\",\n    \"* [onnxruntime-gpu v1.2.0](https://github.com/microsoft/onnxruntime/releases/tag/v1.2.0) requires CUDA Runtime 10.1 and CUDNN 7.6.5.\\n\",\n    \"\\n\",\n    \"During installing PyTorch 1.5, we installed cudatoolkit 10.1.243 in this conda environment. That shall be good for onnxruntime-gpu 1.3.0 in Jupyter Notebook.\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 8,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"# Change to True when onnxruntime (like onnxruntime-gpu 1.0.0 ~ 1.1.2) cannot be imported.\\n\",\n    \"add_cuda_path = False\\n\",\n    \"\\n\",\n    \"if add_cuda_path:\\n\",\n    \"    # Add path of CUDA 10.0 and CUDNN 7.6 for onnxruntime-gpu 1.0.0 ~ 1.1.2\\n\",\n    \"    cuda_dir = 'D:/NVidia/CUDA/v10.1/bin'\\n\",\n    \"    cudnn_dir = 'D:/NVidia/CUDA/v10.1/bin'\\n\",\n    \"    if not (os.path.exists(cuda_dir) and os.path.exists(cudnn_dir)):\\n\",\n    \"        raise ValueError(\\\"Please specify correct path for CUDA and cuDNN. Otherwise onnxruntime cannot be imported.\\\")\\n\",\n    \"    else:\\n\",\n    \"        if cuda_dir == cudnn_dir:\\n\",\n    \"            os.environ[\\\"PATH\\\"] = cuda_dir + ';' + os.environ[\\\"PATH\\\"]\\n\",\n    \"        else:\\n\",\n    \"            os.environ[\\\"PATH\\\"] = cuda_dir + ';' + cudnn_dir + ';' + os.environ[\\\"PATH\\\"]\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"### OpenMP Environment Variable\\n\",\n    \"\\n\",\n    \"OpenMP environment variables are optional for GPU inference of standard Bert model. It has little performance impact on Bert model since most nodes are executed in GPU. \\n\",\n    \"\\n\",\n    \"You can find the best setting based on [Performance Test Tool](#Performance-Test-Tool) result in later part of this notebook.\\n\",\n    \"\\n\",\n    \"**Attention: Setting environment variables shall be done before importing onnxruntime**. Otherwise, they might not take effect.\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 9,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"# Optional. You can change them according to Performance Test Tool result.\\n\",\n    \"#os.environ[\\\"OMP_NUM_THREADS\\\"] = '1'\\n\",\n    \"#os.environ[\\\"OMP_WAIT_POLICY\\\"] = 'PASSIVE'\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Now we are ready to inference the model with ONNX Runtime.\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 10,\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"OnnxRuntime gpu Inference time = 4.43 ms\\n\"\n     ]\n    }\n   ],\n   \"source\": [\n    \"import psutil\\n\",\n    \"import onnxruntime\\n\",\n    \"import numpy\\n\",\n    \"\\n\",\n    \"assert 'CUDAExecutionProvider' in onnxruntime.get_available_providers()\\n\",\n    \"device_name = 'gpu'\\n\",\n    \"\\n\",\n    \"sess_options = onnxruntime.SessionOptions()\\n\",\n    \"\\n\",\n    \"# Optional: store the optimized graph and view it using Netron to verify that model is fully optimized.\\n\",\n    \"# Note that this will increase session creation time so enable it for debugging only.\\n\",\n    \"sess_options.optimized_model_filepath = os.path.join(output_dir, \\\"optimized_model_{}.onnx\\\".format(device_name))\\n\",\n    \"\\n\",\n    \"# Please change the value according to best setting in Performance Test Tool result.\\n\",\n    \"sess_options.intra_op_num_threads=psutil.cpu_count(logical=True)\\n\",\n    \"\\n\",\n    \"session = onnxruntime.InferenceSession(export_model_path, sess_options)\\n\",\n    \"\\n\",\n    \"latency = []\\n\",\n    \"for i in range(total_samples):\\n\",\n    \"    data = dataset[i]\\n\",\n    \"    # TODO: use IO Binding (see https://github.com/microsoft/onnxruntime/pull/4206) to improve performance.\\n\",\n    \"    ort_inputs = {\\n\",\n    \"        'input_ids':  data[0].cpu().reshape(1, max_seq_length).numpy(),\\n\",\n    \"        'input_mask': data[1].cpu().reshape(1, max_seq_length).numpy(),\\n\",\n    \"        'segment_ids': data[2].cpu().reshape(1, max_seq_length).numpy()\\n\",\n    \"    }\\n\",\n    \"    start = time.time()\\n\",\n    \"    ort_outputs = session.run(None, ort_inputs)\\n\",\n    \"    latency.append(time.time() - start)\\n\",\n    \"    \\n\",\n    \"print(\\\"OnnxRuntime {} Inference time = {} ms\\\".format(device_name, format(sum(latency) * 1000 / len(latency), '.2f')))\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"We can compare the output of PyTorch and ONNX Runtime. We can see some results are not close. It is because ONNX Runtime uses some approximation in CUDA optimization. Based on our evaluation on SQuAD data set, F1 score is on par for models before and after optimization.\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 11,\n   \"metadata\": {\n    \"scrolled\": true\n   },\n   \"outputs\": [\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"***** Verifying correctness *****\\n\",\n      \"PyTorch and ONNX Runtime output 0 are close: True\\n\",\n      \"maximum_diff=9.499490261077881e-07 average_diff=1.4225952327251434e-07\\n\",\n      \"PyTorch and ONNX Runtime output 1 are close: True\\n\",\n      \"maximum_diff=6.92903995513916e-07 average_diff=1.2441887520253658e-07\\n\"\n     ]\n    }\n   ],\n   \"source\": [\n    \"print(\\\"***** Verifying correctness *****\\\")\\n\",\n    \"for i in range(2):    \\n\",\n    \"    print('PyTorch and ONNX Runtime output {} are close:'.format(i), numpy.allclose(ort_outputs[i], outputs[i].cpu(), rtol=1e-02, atol=1e-02))\\n\",\n    \"    diff = ort_outputs[i] - outputs[i].cpu().numpy()\\n\",\n    \"    max_diff = numpy.max(numpy.abs(diff))\\n\",\n    \"    avg_diff = numpy.average(numpy.abs(diff))\\n\",\n    \"    print(f'maximum_diff={max_diff} average_diff={avg_diff}')\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"### Inference with Actual Sequence Length\\n\",\n    \"Note that ONNX model is exported using dynamic length axis. It is recommended to use actual sequence input without padding instead of fixed length input for best performance. Let's see how it can be applied to this model.\\n\",\n    \"\\n\",\n    \"From an example input below, we can see zero padding at the end of each sequence.\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 12,\n   \"metadata\": {\n    \"scrolled\": true\n   },\n   \"outputs\": [\n    {\n     \"data\": {\n      \"text/plain\": [\n       \"{'input_ids': tensor([[  101,  1293,  1242,  2557,  1127,  1226,  1104,  1103,  3613, 16429,\\n\",\n       \"           5235,   136,   102,  3613, 16429,  5988,   170,   107,  1353,  1671,\\n\",\n       \"           1992,  1342,   107,  5235,   117,  1107,  1134,  1473,  3683,  3538,\\n\",\n       \"           1125,   170,  1476,   118,  1248,  2595,  4086,  1714,  1104,  2965,\\n\",\n       \"          15897,  1104,  3613, 16429,   119,  1473,  3683,  3538,  3222,  1149,\\n\",\n       \"           2551,  1168, 23759,  1116,  1121,  1506,  1103, 10280,  2231,  1111,\\n\",\n       \"           1103,  1714, 16355,   119,   102,     0,     0,     0,     0,     0,\\n\",\n       \"              0,     0,     0,     0,     0,     0,     0,     0,     0,     0,\\n\",\n       \"              0,     0,     0,     0,     0,     0,     0,     0,     0,     0,\\n\",\n       \"              0,     0,     0,     0,     0,     0,     0,     0,     0,     0,\\n\",\n       \"              0,     0,     0,     0,     0,     0,     0,     0,     0,     0,\\n\",\n       \"              0,     0,     0,     0,     0,     0,     0,     0,     0,     0,\\n\",\n       \"              0,     0,     0,     0,     0,     0,     0,     0]],\\n\",\n       \"        device='cuda:0'),\\n\",\n       \" 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,\\n\",\n       \"          1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,\\n\",\n       \"          1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0,\\n\",\n       \"          0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,\\n\",\n       \"          0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,\\n\",\n       \"          0, 0, 0, 0, 0, 0, 0, 0]], device='cuda:0'),\\n\",\n       \" 'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,\\n\",\n       \"          1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,\\n\",\n       \"          1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0,\\n\",\n       \"          0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,\\n\",\n       \"          0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,\\n\",\n       \"          0, 0, 0, 0, 0, 0, 0, 0]], device='cuda:0')}\"\n      ]\n     },\n     \"execution_count\": 12,\n     \"metadata\": {},\n     \"output_type\": \"execute_result\"\n    }\n   ],\n   \"source\": [\n    \"# An example input (we can see padding). From attention_mask, we can deduce the actual length.\\n\",\n    \"inputs\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"The original sequence length is 128. After removing paddings, the sequence length is reduced. Input with smaller sequence length need less computation, thus we can see there is improvement on inference latency. \"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 13,\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"Average length 101\\n\",\n      \"OnnxRuntime gpu Inference time with actual sequence length = 4.23 ms\\n\"\n     ]\n    }\n   ],\n   \"source\": [\n    \"import statistics\\n\",\n    \"\\n\",\n    \"latency = []\\n\",\n    \"lengths = []\\n\",\n    \"for i in range(total_samples):\\n\",\n    \"    data = dataset[i]\\n\",\n    \"    # Instead of using fixed length (128), we can use actual sequence length (less than 128), which helps to get better performance.\\n\",\n    \"    actual_sequence_length = sum(data[1].numpy())\\n\",\n    \"    lengths.append(actual_sequence_length)\\n\",\n    \"    opt_inputs = {\\n\",\n    \"        'input_ids':  data[0].numpy()[:actual_sequence_length].reshape(1, actual_sequence_length),\\n\",\n    \"        'input_mask': data[1].numpy()[:actual_sequence_length].reshape(1, actual_sequence_length),\\n\",\n    \"        'segment_ids': data[2].numpy()[:actual_sequence_length].reshape(1, actual_sequence_length)\\n\",\n    \"    }\\n\",\n    \"    start = time.time()\\n\",\n    \"    opt_outputs = session.run(None, opt_inputs)\\n\",\n    \"    latency.append(time.time() - start)\\n\",\n    \"print(\\\"Average length\\\", statistics.mean(lengths))\\n\",\n    \"print(\\\"OnnxRuntime {} Inference time with actual sequence length = {} ms\\\".format(device_name, format(sum(latency) * 1000 / len(latency), '.2f')))\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Let's compare the output and see whether the results are close.\\n\",\n    \"\\n\",\n    \"**Note**: Need end-to-end evaluation on performance and accuracy if you use this strategy.\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 14,\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"***** Comparing results with/without paddings *****\\n\",\n      \"Output 0 are close: True\\n\",\n      \"Output 1 are close: True\\n\"\n     ]\n    }\n   ],\n   \"source\": [\n    \"print(\\\"***** Comparing results with/without paddings *****\\\")\\n\",\n    \"for i in range(2):\\n\",\n    \"    print('Output {} are close:'.format(i), numpy.allclose(opt_outputs[i], ort_outputs[i][:,:len(opt_outputs[i][0])], rtol=1e-03, atol=1e-03))\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"## 5. Offline Optimization and Test Tools\\n\",\n    \"\\n\",\n    \"It is recommended to try [OnnxRuntime Transformer Model Optimization Tool](https://github.com/microsoft/onnxruntime/tree/master/onnxruntime/python/tools/transformers) on the exported ONNX models. It could help verify whether the model can be fully optimized, and get performance test results.\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"#### Transformer Optimizer\\n\",\n    \"\\n\",\n    \"Although OnnxRuntime could optimize Bert model exported by PyTorch. Sometime, model cannot be fully optimized due to different reasons:\\n\",\n    \"* A new subgraph pattern is generated by new version of export tool, and the pattern is not covered by older version of OnnxRuntime. \\n\",\n    \"* The exported model uses dynamic axis and this makes it harder for shape inference of the graph. That blocks some optimization to be applied.\\n\",\n    \"* Some optimization is better to be done offline. Like change input tensor type from int64 to int32 to avoid extra Cast nodes, or convert model to float16 to achieve better performance in V100 or T4 GPU.\\n\",\n    \"\\n\",\n    \"We have python script **optimizer.py**, which is more flexible in graph pattern matching and model conversion (like float32 to float16). You can also use it to verify whether a Bert model is fully optimized.\\n\",\n    \"\\n\",\n    \"In this example, we can see that it introduces optimization that is not provided by onnxruntime: SkipLayerNormalization and bias fusion, which is not fused in OnnxRuntime due to shape inference as mentioned.\\n\",\n    \"\\n\",\n    \"It will also tell whether the model is fully optimized or not. If not, that means you might need change the script to fuse some new pattern of subgraph.\\n\",\n    \"\\n\",\n    \"Example Usage:\\n\",\n    \"```\\n\",\n    \"from onnxruntime_tools import optimizer\\n\",\n    \"optimized_model = optimizer.optimize_model(export_model_path, model_type='bert', num_heads=12, hidden_size=768)\\n\",\n    \"optimized_model.save_model_to_file(optimized_model_path)\\n\",\n    \"```\\n\",\n    \"\\n\",\n    \"You can also use optimizer_cli like the following:\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"#### Float32 Model\\n\",\n    \"Let us optimize the ONNX model using the script. The first example will output model with float32 to store weights. This is the choice for most GPUs without Tensor Core.\\n\",\n    \"\\n\",\n    \"If your GPU (like V100 or T4) has Tensor Core, jump to [Float16 Model](#6.-Model-Optimization-with-Float16) section since that will give you better performance than Float32 model.\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 15,\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"optimize_by_onnxruntime: Save optimized model by onnxruntime to ./onnx/bert-base-cased-squad_opset11_o1_cpu.onnx\\n\",\n      \"               apply: Fused LayerNormalization count: 25\\n\",\n      \"               apply: Fused Gelu count: 12\\n\",\n      \"               apply: Fused SkipLayerNormalization count: 25\\n\",\n      \"               apply: Fused Attention count: 12\\n\",\n      \"         prune_graph: Graph pruned: 0 inputs, 0 outputs and 5 nodes are removed\\n\",\n      \"               apply: Fused EmbedLayerNormalization(with mask) count: 1\\n\",\n      \"         prune_graph: Graph pruned: 0 inputs, 0 outputs and 12 nodes are removed\\n\",\n      \"         prune_graph: Graph pruned: 0 inputs, 0 outputs and 0 nodes are removed\\n\",\n      \"               apply: Fused BiasGelu count: 12\\n\",\n      \"               apply: Fused SkipLayerNormalization(add bias) count: 24\\n\",\n      \"            optimize: opset verion: 11\\n\",\n      \"  save_model_to_file: Output model to ./onnx/bert-base-cased-squad_opt_gpu_fp32.onnx\\n\",\n      \"get_fused_operator_statistics: Optimized operators:{'EmbedLayerNormalization': 1, 'Attention': 12, 'Gelu': 0, 'FastGelu': 0, 'BiasGelu': 12, 'LayerNormalization': 0, 'SkipLayerNormalization': 24}\\n\",\n      \"                main: The model has been fully optimized.\\n\"\n     ]\n    }\n   ],\n   \"source\": [\n    \"optimized_fp32_model_path = './onnx/bert-base-cased-squad_opt_{}_fp32.onnx'.format('gpu' if use_gpu else 'cpu')\\n\",\n    \"\\n\",\n    \"!python -m onnxruntime_tools.optimizer_cli --input $export_model_path --output $optimized_fp32_model_path\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"#### Optimized Graph\\n\",\n    \"We can open the optimized model using [Netron](https://github.com/lutzroeder/netron) to visualize.\\n\",\n    \"\\n\",\n    \"The graph is like the following:\\n\",\n    \"<img src='images/optimized_bert_gpu.png'>\\n\",\n    \"\\n\",\n    \"Sometime, optimized graph is slightly different. For example, FastGelu is replaced by BiasGelu for CPU inference; When the option --input_int32 is used, Cast nodes for inputs are removed.\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 16,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"import netron\\n\",\n    \"\\n\",\n    \"# change it to True if want to view the optimized model in browser\\n\",\n    \"enable_netron = False\\n\",\n    \"if enable_netron:\\n\",\n    \"    # If you encounter error \\\"access a socket in a way forbidden by its access permissions\\\", install Netron as standalone application instead.\\n\",\n    \"    netron.start(optimized_fp32_model_path)\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"### Performance Test Tool\\n\",\n    \"\\n\",\n    \"The following will create 1000 random inputs of batch_size 1 and sequence length 128, then measure the average latency and throughput numbers.\\n\",\n    \"\\n\",\n    \"Note that the test uses fixed sequence length. If you use [dynamic sequence length](#Inference-with-Actual-Sequence-Length), actual performance depends on the distribution of sequence length.\\n\",\n    \"\\n\",\n    \"**Attention**: Latency numbers from Jupyter Notebook are not accurate. See [Attional Info](#7.-Additional-Info) for more info.\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 17,\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"test setting TestSetting(batch_size=1, sequence_length=128, test_cases=1000, test_times=1, contiguous=None, use_gpu=True, warmup=True, omp_num_threads=None, omp_wait_policy=None, intra_op_num_threads=None, seed=3, verbose=False, inclusive=False, extra_latency=True)\\n\",\n      \"Generating 1000 samples for batch_size=1 sequence_length=128\\n\",\n      \"Running test: model=bert-base-cased-squad_opt_gpu_fp32.onnx,graph_optimization_level=ENABLE_ALL,intra_op_num_threads=0,OMP_NUM_THREADS=,OMP_WAIT_POLICY=,batch_size=1,sequence_length=128,test_cases=1000,test_times=1,contiguous=None,use_gpu=True,warmup=True\\n\",\n      \"Average latency = 4.92 ms, Throughput = 203.24 QPS\\n\",\n      \"Running test: model=bert-base-cased-squad_opt_gpu_fp32.onnx,graph_optimization_level=ENABLE_ALL,intra_op_num_threads=1,OMP_NUM_THREADS=12,OMP_WAIT_POLICY=PASSIVE,batch_size=1,sequence_length=128,test_cases=1000,test_times=1,contiguous=None,use_gpu=True,warmup=True\\n\",\n      \"Average latency = 4.90 ms, Throughput = 203.88 QPS\\n\",\n      \"Running test: model=bert-base-cased-squad_opt_gpu_fp32.onnx,graph_optimization_level=ENABLE_ALL,intra_op_num_threads=12,OMP_NUM_THREADS=1,OMP_WAIT_POLICY=ACTIVE,batch_size=1,sequence_length=128,test_cases=1000,test_times=1,contiguous=None,use_gpu=True,warmup=True\\n\",\n      \"Average latency = 5.07 ms, Throughput = 197.16 QPS\\n\",\n      \"Running test: model=bert-base-cased-squad_opt_gpu_fp32.onnx,graph_optimization_level=ENABLE_ALL,intra_op_num_threads=1,OMP_NUM_THREADS=12,OMP_WAIT_POLICY=ACTIVE,batch_size=1,sequence_length=128,test_cases=1000,test_times=1,contiguous=None,use_gpu=True,warmup=True\\n\",\n      \"Average latency = 4.82 ms, Throughput = 207.33 QPS\\n\",\n      \"skip duplicated test: model=bert-base-cased-squad_opt_gpu_fp32.onnx,graph_optimization_level=ENABLE_ALL,intra_op_num_threads=1,OMP_NUM_THREADS=12,OMP_WAIT_POLICY=PASSIVE,batch_size=1,sequence_length=128,test_cases=1000,test_times=1,contiguous=None,use_gpu=True,warmup=True\\n\",\n      \"skip duplicated test: model=bert-base-cased-squad_opt_gpu_fp32.onnx,graph_optimization_level=ENABLE_ALL,intra_op_num_threads=12,OMP_NUM_THREADS=1,OMP_WAIT_POLICY=ACTIVE,batch_size=1,sequence_length=128,test_cases=1000,test_times=1,contiguous=None,use_gpu=True,warmup=True\\n\",\n      \"Running test: model=bert-base-cased-squad_opt_gpu_fp32.onnx,graph_optimization_level=ENABLE_ALL,intra_op_num_threads=12,OMP_NUM_THREADS=1,OMP_WAIT_POLICY=PASSIVE,batch_size=1,sequence_length=128,test_cases=1000,test_times=1,contiguous=None,use_gpu=True,warmup=True\\n\",\n      \"Average latency = 4.93 ms, Throughput = 202.92 QPS\\n\",\n      \"Running test: model=bert-base-cased-squad_opt_gpu_fp32.onnx,graph_optimization_level=ENABLE_ALL,intra_op_num_threads=12,OMP_NUM_THREADS=12,OMP_WAIT_POLICY=ACTIVE,batch_size=1,sequence_length=128,test_cases=1000,test_times=1,contiguous=None,use_gpu=True,warmup=True\\n\",\n      \"Average latency = 4.91 ms, Throughput = 203.55 QPS\\n\",\n      \"Running test: model=bert-base-cased-squad_opt_gpu_fp32.onnx,graph_optimization_level=ENABLE_ALL,intra_op_num_threads=12,OMP_NUM_THREADS=12,OMP_WAIT_POLICY=PASSIVE,batch_size=1,sequence_length=128,test_cases=1000,test_times=1,contiguous=None,use_gpu=True,warmup=True\\n\",\n      \"Average latency = 4.88 ms, Throughput = 204.90 QPS\\n\",\n      \"Test summary is saved to onnx/perf_results_GPU_B1_S128_20200617-232134.txt\\n\"\n     ]\n    }\n   ],\n   \"source\": [\n    \"GPU_OPTION = '--use_gpu' if use_gpu else ''\\n\",\n    \"\\n\",\n    \"!python -m onnxruntime_tools.transformers.bert_perf_test --model $optimized_fp32_model_path --batch_size 1 --sequence_length 128 --samples 1000 --test_times 1 --inclusive --all $GPU_OPTION\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Let's load the summary file and take a look. Note that blank value in OMP_NUM_THREADS or OMP_WAIT_POLICY means the environment variable does not exist.\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 18,\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"Float32 model perf results from ./onnx/perf_results_GPU_B1_S128_20200617-232134.txt\\n\"\n     ]\n    },\n    {\n     \"data\": {\n      \"text/html\": [\n       \"<div>\\n\",\n       \"<style scoped>\\n\",\n       \"    .dataframe tbody tr th:only-of-type {\\n\",\n       \"        vertical-align: middle;\\n\",\n       \"    }\\n\",\n       \"\\n\",\n       \"    .dataframe tbody tr th {\\n\",\n       \"        vertical-align: top;\\n\",\n       \"    }\\n\",\n       \"\\n\",\n       \"    .dataframe thead th {\\n\",\n       \"        text-align: right;\\n\",\n       \"    }\\n\",\n       \"</style>\\n\",\n       \"<table border=\\\"1\\\" class=\\\"dataframe\\\">\\n\",\n       \"  <thead>\\n\",\n       \"    <tr style=\\\"text-align: right;\\\">\\n\",\n       \"      <th></th>\\n\",\n       \"      <th>Latency(ms)</th>\\n\",\n       \"      <th>Latency_P50</th>\\n\",\n       \"      <th>Latency_P75</th>\\n\",\n       \"      <th>Latency_P90</th>\\n\",\n       \"      <th>Latency_P95</th>\\n\",\n       \"      <th>Latency_P99</th>\\n\",\n       \"      <th>Throughput(QPS)</th>\\n\",\n       \"      <th>intra_op_num_threads</th>\\n\",\n       \"      <th>OMP_NUM_THREADS</th>\\n\",\n       \"      <th>OMP_WAIT_POLICY</th>\\n\",\n       \"      <th>contiguous</th>\\n\",\n       \"      <th>warmup</th>\\n\",\n       \"    </tr>\\n\",\n       \"  </thead>\\n\",\n       \"  <tbody>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>0</th>\\n\",\n       \"      <td>4.82</td>\\n\",\n       \"      <td>4.53</td>\\n\",\n       \"      <td>4.57</td>\\n\",\n       \"      <td>5.15</td>\\n\",\n       \"      <td>7.25</td>\\n\",\n       \"      <td>8.75</td>\\n\",\n       \"      <td>207.33</td>\\n\",\n       \"      <td>1</td>\\n\",\n       \"      <td>12</td>\\n\",\n       \"      <td>ACTIVE</td>\\n\",\n       \"      <td>None</td>\\n\",\n       \"      <td>True</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>1</th>\\n\",\n       \"      <td>4.88</td>\\n\",\n       \"      <td>4.54</td>\\n\",\n       \"      <td>4.58</td>\\n\",\n       \"      <td>6.47</td>\\n\",\n       \"      <td>7.13</td>\\n\",\n       \"      <td>8.68</td>\\n\",\n       \"      <td>204.90</td>\\n\",\n       \"      <td>12</td>\\n\",\n       \"      <td>12</td>\\n\",\n       \"      <td>PASSIVE</td>\\n\",\n       \"      <td>None</td>\\n\",\n       \"      <td>True</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>2</th>\\n\",\n       \"      <td>4.90</td>\\n\",\n       \"      <td>4.54</td>\\n\",\n       \"      <td>4.57</td>\\n\",\n       \"      <td>6.16</td>\\n\",\n       \"      <td>7.64</td>\\n\",\n       \"      <td>8.82</td>\\n\",\n       \"      <td>203.88</td>\\n\",\n       \"      <td>1</td>\\n\",\n       \"      <td>12</td>\\n\",\n       \"      <td>PASSIVE</td>\\n\",\n       \"      <td>None</td>\\n\",\n       \"      <td>True</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>3</th>\\n\",\n       \"      <td>4.91</td>\\n\",\n       \"      <td>4.55</td>\\n\",\n       \"      <td>4.59</td>\\n\",\n       \"      <td>6.70</td>\\n\",\n       \"      <td>7.43</td>\\n\",\n       \"      <td>8.78</td>\\n\",\n       \"      <td>203.55</td>\\n\",\n       \"      <td>12</td>\\n\",\n       \"      <td>12</td>\\n\",\n       \"      <td>ACTIVE</td>\\n\",\n       \"      <td>None</td>\\n\",\n       \"      <td>True</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>4</th>\\n\",\n       \"      <td>4.92</td>\\n\",\n       \"      <td>4.57</td>\\n\",\n       \"      <td>4.60</td>\\n\",\n       \"      <td>6.50</td>\\n\",\n       \"      <td>7.82</td>\\n\",\n       \"      <td>8.90</td>\\n\",\n       \"      <td>203.24</td>\\n\",\n       \"      <td>0</td>\\n\",\n       \"      <td></td>\\n\",\n       \"      <td></td>\\n\",\n       \"      <td>None</td>\\n\",\n       \"      <td>True</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>5</th>\\n\",\n       \"      <td>4.93</td>\\n\",\n       \"      <td>4.55</td>\\n\",\n       \"      <td>4.59</td>\\n\",\n       \"      <td>6.66</td>\\n\",\n       \"      <td>7.57</td>\\n\",\n       \"      <td>8.80</td>\\n\",\n       \"      <td>202.92</td>\\n\",\n       \"      <td>12</td>\\n\",\n       \"      <td>1</td>\\n\",\n       \"      <td>PASSIVE</td>\\n\",\n       \"      <td>None</td>\\n\",\n       \"      <td>True</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>6</th>\\n\",\n       \"      <td>5.07</td>\\n\",\n       \"      <td>4.56</td>\\n\",\n       \"      <td>4.61</td>\\n\",\n       \"      <td>7.19</td>\\n\",\n       \"      <td>8.11</td>\\n\",\n       \"      <td>9.01</td>\\n\",\n       \"      <td>197.16</td>\\n\",\n       \"      <td>12</td>\\n\",\n       \"      <td>1</td>\\n\",\n       \"      <td>ACTIVE</td>\\n\",\n       \"      <td>None</td>\\n\",\n       \"      <td>True</td>\\n\",\n       \"    </tr>\\n\",\n       \"  </tbody>\\n\",\n       \"</table>\\n\",\n       \"</div>\"\n      ],\n      \"text/plain\": [\n       \"   Latency(ms)  Latency_P50  Latency_P75  Latency_P90  Latency_P95  \\\\\\n\",\n       \"0         4.82         4.53         4.57         5.15         7.25   \\n\",\n       \"1         4.88         4.54         4.58         6.47         7.13   \\n\",\n       \"2         4.90         4.54         4.57         6.16         7.64   \\n\",\n       \"3         4.91         4.55         4.59         6.70         7.43   \\n\",\n       \"4         4.92         4.57         4.60         6.50         7.82   \\n\",\n       \"5         4.93         4.55         4.59         6.66         7.57   \\n\",\n       \"6         5.07         4.56         4.61         7.19         8.11   \\n\",\n       \"\\n\",\n       \"   Latency_P99  Throughput(QPS)  intra_op_num_threads OMP_NUM_THREADS  \\\\\\n\",\n       \"0         8.75           207.33                     1              12   \\n\",\n       \"1         8.68           204.90                    12              12   \\n\",\n       \"2         8.82           203.88                     1              12   \\n\",\n       \"3         8.78           203.55                    12              12   \\n\",\n       \"4         8.90           203.24                     0                   \\n\",\n       \"5         8.80           202.92                    12               1   \\n\",\n       \"6         9.01           197.16                    12               1   \\n\",\n       \"\\n\",\n       \"  OMP_WAIT_POLICY contiguous  warmup  \\n\",\n       \"0          ACTIVE       None    True  \\n\",\n       \"1         PASSIVE       None    True  \\n\",\n       \"2         PASSIVE       None    True  \\n\",\n       \"3          ACTIVE       None    True  \\n\",\n       \"4                       None    True  \\n\",\n       \"5         PASSIVE       None    True  \\n\",\n       \"6          ACTIVE       None    True  \"\n      ]\n     },\n     \"execution_count\": 18,\n     \"metadata\": {},\n     \"output_type\": \"execute_result\"\n    }\n   ],\n   \"source\": [\n    \"import os\\n\",\n    \"import glob     \\n\",\n    \"import pandas\\n\",\n    \"latest_result_file = max(glob.glob(\\\"./onnx/perf_results_GPU_B1_S128_*.txt\\\"), key=os.path.getmtime)\\n\",\n    \"result_data = pandas.read_table(latest_result_file, converters={'OMP_NUM_THREADS': str, 'OMP_WAIT_POLICY':str})\\n\",\n    \"print(\\\"Float32 model perf results from\\\", latest_result_file)\\n\",\n    \"# Remove some columns that have same values for all rows.\\n\",\n    \"columns_to_remove = ['model', 'graph_optimization_level', 'batch_size', 'sequence_length', 'test_cases', 'test_times', 'use_gpu']\\n\",\n    \"result_data.drop(columns_to_remove, axis=1, inplace=True)\\n\",\n    \"result_data\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"From above result, we can see that latency is very close for different settings. The default setting (intra_op_num_threads=0, OMP_NUM_THREADS and OMP_WAIT_POLICY does not exist) performs the best. \\n\",\n    \"\\n\",\n    \"### Model Results Comparison Tool\\n\",\n    \"\\n\",\n    \"When a BERT model is optimized, some approximation is used in calculation. If your BERT model has three inputs, a script compare_bert_results.py can be used to do a quick verification. The tool will generate some fake input data, and compare the inference outputs of the original and optimized models. If outputs are all close, it is safe to use the optimized model.\\n\",\n    \"\\n\",\n    \"For GPU inference, the absolute or relative difference is larger than those numbers of CPU inference. Note that slight difference in output will not impact final result. We did end-to-end evaluation using SQuAD data set using a fine-tuned squad model, and F1 score is almost the same before/after optimization.\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 19,\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"100% passed for 100 random inputs given thresholds (rtol=0.01, atol=0.01).\\r\\n\",\n      \"maximum absolute difference=1.9222497940063477e-06\\r\\n\",\n      \"maximum relative difference=0.05027933046221733\\r\\n\"\n     ]\n    }\n   ],\n   \"source\": [\n    \"!python -m onnxruntime_tools.transformers.compare_bert_results --baseline_model $export_model_path --optimized_model $optimized_fp32_model_path --batch_size 1 --sequence_length 128 --samples 100 --rtol 0.01 --atol 0.01 $GPU_OPTION\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"## 6. Model Optimization with Float16\\n\",\n    \"\\n\",\n    \"The optimizer.py script have an option **--float16** to convert model to use float16 to store weights. After the conversion, it could be faster to run in GPU with tensor cores like V100 or T4.\\n\",\n    \"\\n\",\n    \"Let's run tools to measure the performance on V100. The results show significant performance improvement: latency is about 3.4 ms for float32 model, and 1.8 ms for float16 model.\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 20,\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"optimize_by_onnxruntime: Save optimized model by onnxruntime to ./onnx/bert-base-cased-squad_opset11_o1_cpu.onnx\\n\",\n      \"               apply: Fused LayerNormalization count: 25\\n\",\n      \"               apply: Fused Gelu count: 12\\n\",\n      \"               apply: Fused SkipLayerNormalization count: 25\\n\",\n      \"               apply: Fused Attention count: 12\\n\",\n      \"         prune_graph: Graph pruned: 0 inputs, 0 outputs and 5 nodes are removed\\n\",\n      \"               apply: Fused EmbedLayerNormalization(with mask) count: 1\\n\",\n      \"         prune_graph: Graph pruned: 0 inputs, 0 outputs and 12 nodes are removed\\n\",\n      \"         prune_graph: Graph pruned: 0 inputs, 0 outputs and 0 nodes are removed\\n\",\n      \"               apply: Fused BiasGelu count: 12\\n\",\n      \"               apply: Fused SkipLayerNormalization(add bias) count: 24\\n\",\n      \"            optimize: opset verion: 11\\n\",\n      \"  save_model_to_file: Output model to ./onnx/bert-base-cased-squad_opt_gpu_fp16.onnx\\n\",\n      \"get_fused_operator_statistics: Optimized operators:{'EmbedLayerNormalization': 1, 'Attention': 12, 'Gelu': 0, 'FastGelu': 0, 'BiasGelu': 12, 'LayerNormalization': 0, 'SkipLayerNormalization': 24}\\n\",\n      \"                main: The model has been fully optimized.\\n\"\n     ]\n    }\n   ],\n   \"source\": [\n    \"optimized_fp16_model_path = './onnx/bert-base-cased-squad_opt_{}_fp16.onnx'.format('gpu' if use_gpu else 'cpu')\\n\",\n    \"!python -m onnxruntime_tools.optimizer_cli --input $export_model_path --output $optimized_fp16_model_path --float16\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 21,\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"test setting TestSetting(batch_size=1, sequence_length=128, test_cases=1000, test_times=1, contiguous=None, use_gpu=True, warmup=True, omp_num_threads=None, omp_wait_policy=None, intra_op_num_threads=None, seed=3, verbose=False, inclusive=False, extra_latency=True)\\n\",\n      \"Generating 1000 samples for batch_size=1 sequence_length=128\\n\",\n      \"Running test: model=bert-base-cased-squad_opt_gpu_fp16.onnx,graph_optimization_level=ENABLE_ALL,intra_op_num_threads=0,OMP_NUM_THREADS=,OMP_WAIT_POLICY=,batch_size=1,sequence_length=128,test_cases=1000,test_times=1,contiguous=None,use_gpu=True,warmup=True\\n\",\n      \"Average latency = 3.01 ms, Throughput = 331.90 QPS\\n\",\n      \"Running test: model=bert-base-cased-squad_opt_gpu_fp16.onnx,graph_optimization_level=ENABLE_ALL,intra_op_num_threads=1,OMP_NUM_THREADS=12,OMP_WAIT_POLICY=PASSIVE,batch_size=1,sequence_length=128,test_cases=1000,test_times=1,contiguous=None,use_gpu=True,warmup=True\\n\",\n      \"Average latency = 3.12 ms, Throughput = 320.00 QPS\\n\",\n      \"Running test: model=bert-base-cased-squad_opt_gpu_fp16.onnx,graph_optimization_level=ENABLE_ALL,intra_op_num_threads=12,OMP_NUM_THREADS=1,OMP_WAIT_POLICY=ACTIVE,batch_size=1,sequence_length=128,test_cases=1000,test_times=1,contiguous=None,use_gpu=True,warmup=True\\n\",\n      \"Average latency = 3.02 ms, Throughput = 331.39 QPS\\n\",\n      \"Running test: model=bert-base-cased-squad_opt_gpu_fp16.onnx,graph_optimization_level=ENABLE_ALL,intra_op_num_threads=1,OMP_NUM_THREADS=12,OMP_WAIT_POLICY=ACTIVE,batch_size=1,sequence_length=128,test_cases=1000,test_times=1,contiguous=None,use_gpu=True,warmup=True\\n\",\n      \"Average latency = 3.01 ms, Throughput = 332.53 QPS\\n\",\n      \"skip duplicated test: model=bert-base-cased-squad_opt_gpu_fp16.onnx,graph_optimization_level=ENABLE_ALL,intra_op_num_threads=1,OMP_NUM_THREADS=12,OMP_WAIT_POLICY=PASSIVE,batch_size=1,sequence_length=128,test_cases=1000,test_times=1,contiguous=None,use_gpu=True,warmup=True\\n\",\n      \"skip duplicated test: model=bert-base-cased-squad_opt_gpu_fp16.onnx,graph_optimization_level=ENABLE_ALL,intra_op_num_threads=12,OMP_NUM_THREADS=1,OMP_WAIT_POLICY=ACTIVE,batch_size=1,sequence_length=128,test_cases=1000,test_times=1,contiguous=None,use_gpu=True,warmup=True\\n\",\n      \"Running test: model=bert-base-cased-squad_opt_gpu_fp16.onnx,graph_optimization_level=ENABLE_ALL,intra_op_num_threads=12,OMP_NUM_THREADS=1,OMP_WAIT_POLICY=PASSIVE,batch_size=1,sequence_length=128,test_cases=1000,test_times=1,contiguous=None,use_gpu=True,warmup=True\\n\",\n      \"Average latency = 3.04 ms, Throughput = 328.67 QPS\\n\",\n      \"Running test: model=bert-base-cased-squad_opt_gpu_fp16.onnx,graph_optimization_level=ENABLE_ALL,intra_op_num_threads=12,OMP_NUM_THREADS=12,OMP_WAIT_POLICY=ACTIVE,batch_size=1,sequence_length=128,test_cases=1000,test_times=1,contiguous=None,use_gpu=True,warmup=True\\n\",\n      \"Average latency = 3.01 ms, Throughput = 331.72 QPS\\n\",\n      \"Running test: model=bert-base-cased-squad_opt_gpu_fp16.onnx,graph_optimization_level=ENABLE_ALL,intra_op_num_threads=12,OMP_NUM_THREADS=12,OMP_WAIT_POLICY=PASSIVE,batch_size=1,sequence_length=128,test_cases=1000,test_times=1,contiguous=None,use_gpu=True,warmup=True\\n\",\n      \"Average latency = 3.04 ms, Throughput = 329.32 QPS\\n\",\n      \"Test summary is saved to onnx/perf_results_GPU_B1_S128_20200617-232234.txt\\n\"\n     ]\n    }\n   ],\n   \"source\": [\n    \"GPU_OPTION = '--use_gpu' if use_gpu else ''\\n\",\n    \"!python -m onnxruntime_tools.transformers.bert_perf_test --model $optimized_fp16_model_path --batch_size 1 --sequence_length 128 --samples 1000 --test_times 1 --inclusive --all $GPU_OPTION\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 22,\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"Float32 model perf results from ./onnx/perf_results_GPU_B1_S128_20200617-232234.txt\\n\"\n     ]\n    },\n    {\n     \"data\": {\n      \"text/html\": [\n       \"<div>\\n\",\n       \"<style scoped>\\n\",\n       \"    .dataframe tbody tr th:only-of-type {\\n\",\n       \"        vertical-align: middle;\\n\",\n       \"    }\\n\",\n       \"\\n\",\n       \"    .dataframe tbody tr th {\\n\",\n       \"        vertical-align: top;\\n\",\n       \"    }\\n\",\n       \"\\n\",\n       \"    .dataframe thead th {\\n\",\n       \"        text-align: right;\\n\",\n       \"    }\\n\",\n       \"</style>\\n\",\n       \"<table border=\\\"1\\\" class=\\\"dataframe\\\">\\n\",\n       \"  <thead>\\n\",\n       \"    <tr style=\\\"text-align: right;\\\">\\n\",\n       \"      <th></th>\\n\",\n       \"      <th>Latency(ms)</th>\\n\",\n       \"      <th>Latency_P50</th>\\n\",\n       \"      <th>Latency_P75</th>\\n\",\n       \"      <th>Latency_P90</th>\\n\",\n       \"      <th>Latency_P95</th>\\n\",\n       \"      <th>Latency_P99</th>\\n\",\n       \"      <th>Throughput(QPS)</th>\\n\",\n       \"      <th>intra_op_num_threads</th>\\n\",\n       \"      <th>OMP_NUM_THREADS</th>\\n\",\n       \"      <th>OMP_WAIT_POLICY</th>\\n\",\n       \"      <th>contiguous</th>\\n\",\n       \"      <th>warmup</th>\\n\",\n       \"    </tr>\\n\",\n       \"  </thead>\\n\",\n       \"  <tbody>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>0</th>\\n\",\n       \"      <td>3.01</td>\\n\",\n       \"      <td>2.79</td>\\n\",\n       \"      <td>2.81</td>\\n\",\n       \"      <td>2.86</td>\\n\",\n       \"      <td>5.08</td>\\n\",\n       \"      <td>7.16</td>\\n\",\n       \"      <td>332.53</td>\\n\",\n       \"      <td>1</td>\\n\",\n       \"      <td>12</td>\\n\",\n       \"      <td>ACTIVE</td>\\n\",\n       \"      <td>None</td>\\n\",\n       \"      <td>True</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>1</th>\\n\",\n       \"      <td>3.01</td>\\n\",\n       \"      <td>2.80</td>\\n\",\n       \"      <td>2.81</td>\\n\",\n       \"      <td>2.88</td>\\n\",\n       \"      <td>4.52</td>\\n\",\n       \"      <td>7.05</td>\\n\",\n       \"      <td>331.90</td>\\n\",\n       \"      <td>0</td>\\n\",\n       \"      <td></td>\\n\",\n       \"      <td></td>\\n\",\n       \"      <td>None</td>\\n\",\n       \"      <td>True</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>2</th>\\n\",\n       \"      <td>3.01</td>\\n\",\n       \"      <td>2.78</td>\\n\",\n       \"      <td>2.80</td>\\n\",\n       \"      <td>2.92</td>\\n\",\n       \"      <td>5.01</td>\\n\",\n       \"      <td>7.02</td>\\n\",\n       \"      <td>331.72</td>\\n\",\n       \"      <td>12</td>\\n\",\n       \"      <td>12</td>\\n\",\n       \"      <td>ACTIVE</td>\\n\",\n       \"      <td>None</td>\\n\",\n       \"      <td>True</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>3</th>\\n\",\n       \"      <td>3.02</td>\\n\",\n       \"      <td>2.79</td>\\n\",\n       \"      <td>2.80</td>\\n\",\n       \"      <td>2.85</td>\\n\",\n       \"      <td>6.34</td>\\n\",\n       \"      <td>7.04</td>\\n\",\n       \"      <td>331.39</td>\\n\",\n       \"      <td>12</td>\\n\",\n       \"      <td>1</td>\\n\",\n       \"      <td>ACTIVE</td>\\n\",\n       \"      <td>None</td>\\n\",\n       \"      <td>True</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>4</th>\\n\",\n       \"      <td>3.04</td>\\n\",\n       \"      <td>2.80</td>\\n\",\n       \"      <td>2.82</td>\\n\",\n       \"      <td>2.93</td>\\n\",\n       \"      <td>5.56</td>\\n\",\n       \"      <td>7.08</td>\\n\",\n       \"      <td>329.32</td>\\n\",\n       \"      <td>12</td>\\n\",\n       \"      <td>12</td>\\n\",\n       \"      <td>PASSIVE</td>\\n\",\n       \"      <td>None</td>\\n\",\n       \"      <td>True</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>5</th>\\n\",\n       \"      <td>3.04</td>\\n\",\n       \"      <td>2.79</td>\\n\",\n       \"      <td>2.81</td>\\n\",\n       \"      <td>2.92</td>\\n\",\n       \"      <td>6.37</td>\\n\",\n       \"      <td>7.08</td>\\n\",\n       \"      <td>328.67</td>\\n\",\n       \"      <td>12</td>\\n\",\n       \"      <td>1</td>\\n\",\n       \"      <td>PASSIVE</td>\\n\",\n       \"      <td>None</td>\\n\",\n       \"      <td>True</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>6</th>\\n\",\n       \"      <td>3.12</td>\\n\",\n       \"      <td>2.79</td>\\n\",\n       \"      <td>2.82</td>\\n\",\n       \"      <td>2.96</td>\\n\",\n       \"      <td>6.66</td>\\n\",\n       \"      <td>7.20</td>\\n\",\n       \"      <td>320.00</td>\\n\",\n       \"      <td>1</td>\\n\",\n       \"      <td>12</td>\\n\",\n       \"      <td>PASSIVE</td>\\n\",\n       \"      <td>None</td>\\n\",\n       \"      <td>True</td>\\n\",\n       \"    </tr>\\n\",\n       \"  </tbody>\\n\",\n       \"</table>\\n\",\n       \"</div>\"\n      ],\n      \"text/plain\": [\n       \"   Latency(ms)  Latency_P50  Latency_P75  Latency_P90  Latency_P95  \\\\\\n\",\n       \"0         3.01         2.79         2.81         2.86         5.08   \\n\",\n       \"1         3.01         2.80         2.81         2.88         4.52   \\n\",\n       \"2         3.01         2.78         2.80         2.92         5.01   \\n\",\n       \"3         3.02         2.79         2.80         2.85         6.34   \\n\",\n       \"4         3.04         2.80         2.82         2.93         5.56   \\n\",\n       \"5         3.04         2.79         2.81         2.92         6.37   \\n\",\n       \"6         3.12         2.79         2.82         2.96         6.66   \\n\",\n       \"\\n\",\n       \"   Latency_P99  Throughput(QPS)  intra_op_num_threads OMP_NUM_THREADS  \\\\\\n\",\n       \"0         7.16           332.53                     1              12   \\n\",\n       \"1         7.05           331.90                     0                   \\n\",\n       \"2         7.02           331.72                    12              12   \\n\",\n       \"3         7.04           331.39                    12               1   \\n\",\n       \"4         7.08           329.32                    12              12   \\n\",\n       \"5         7.08           328.67                    12               1   \\n\",\n       \"6         7.20           320.00                     1              12   \\n\",\n       \"\\n\",\n       \"  OMP_WAIT_POLICY contiguous  warmup  \\n\",\n       \"0          ACTIVE       None    True  \\n\",\n       \"1                       None    True  \\n\",\n       \"2          ACTIVE       None    True  \\n\",\n       \"3          ACTIVE       None    True  \\n\",\n       \"4         PASSIVE       None    True  \\n\",\n       \"5         PASSIVE       None    True  \\n\",\n       \"6         PASSIVE       None    True  \"\n      ]\n     },\n     \"execution_count\": 22,\n     \"metadata\": {},\n     \"output_type\": \"execute_result\"\n    }\n   ],\n   \"source\": [\n    \"import os\\n\",\n    \"import glob     \\n\",\n    \"import pandas\\n\",\n    \"latest_result_file = max(glob.glob(\\\"./onnx/perf_results_GPU_B1_S128_*.txt\\\"), key=os.path.getmtime)\\n\",\n    \"result_data = pandas.read_table(latest_result_file, converters={'OMP_NUM_THREADS': str, 'OMP_WAIT_POLICY':str})\\n\",\n    \"print(\\\"Float32 model perf results from\\\", latest_result_file)\\n\",\n    \"# Remove some columns that have same values for all rows.\\n\",\n    \"columns_to_remove = ['model', 'graph_optimization_level', 'batch_size', 'sequence_length', 'test_cases', 'test_times', 'use_gpu']\\n\",\n    \"result_data.drop(columns_to_remove, axis=1, inplace=True)\\n\",\n    \"result_data\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"### Throughput Tuning\\n\",\n    \"\\n\",\n    \"Some application need best throughput under some constraint on latency. This can be done by testing performance of different batch sizes. The tool could help on this.\\n\",\n    \"\\n\",\n    \"Here is an example that check the performance of multiple batch sizes (1, 2, 4, 8, 16, 32 and 64) using default settings.\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 23,\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"test setting TestSetting(batch_size=32, sequence_length=128, test_cases=1000, test_times=1, contiguous=None, use_gpu=True, warmup=True, omp_num_threads=12, omp_wait_policy='ACTIVE', intra_op_num_threads=1, seed=3, verbose=False, inclusive=False, extra_latency=True)\\n\",\n      \"Generating 1000 samples for batch_size=32 sequence_length=128\\n\",\n      \"Running test: model=bert-base-cased-squad_opt_gpu_fp16.onnx,graph_optimization_level=ENABLE_ALL,intra_op_num_threads=1,OMP_NUM_THREADS=12,OMP_WAIT_POLICY=ACTIVE,batch_size=32,sequence_length=128,test_cases=1000,test_times=1,contiguous=None,use_gpu=True,warmup=True\\n\",\n      \"Average latency = 16.17 ms, Throughput = 1979.41 QPS\\n\",\n      \"test setting TestSetting(batch_size=1, sequence_length=128, test_cases=1000, test_times=1, contiguous=None, use_gpu=True, warmup=True, omp_num_threads=12, omp_wait_policy='ACTIVE', intra_op_num_threads=1, seed=3, verbose=False, inclusive=False, extra_latency=True)\\n\",\n      \"Generating 1000 samples for batch_size=1 sequence_length=128\\n\",\n      \"Running test: model=bert-base-cased-squad_opt_gpu_fp16.onnx,graph_optimization_level=ENABLE_ALL,intra_op_num_threads=1,OMP_NUM_THREADS=12,OMP_WAIT_POLICY=ACTIVE,batch_size=1,sequence_length=128,test_cases=1000,test_times=1,contiguous=None,use_gpu=True,warmup=True\\n\",\n      \"Average latency = 3.00 ms, Throughput = 333.83 QPS\\n\",\n      \"test setting TestSetting(batch_size=2, sequence_length=128, test_cases=1000, test_times=1, contiguous=None, use_gpu=True, warmup=True, omp_num_threads=12, omp_wait_policy='ACTIVE', intra_op_num_threads=1, seed=3, verbose=False, inclusive=False, extra_latency=True)\\n\",\n      \"Generating 1000 samples for batch_size=2 sequence_length=128\\n\",\n      \"Running test: model=bert-base-cased-squad_opt_gpu_fp16.onnx,graph_optimization_level=ENABLE_ALL,intra_op_num_threads=1,OMP_NUM_THREADS=12,OMP_WAIT_POLICY=ACTIVE,batch_size=2,sequence_length=128,test_cases=1000,test_times=1,contiguous=None,use_gpu=True,warmup=True\\n\",\n      \"Average latency = 3.59 ms, Throughput = 557.32 QPS\\n\",\n      \"test setting TestSetting(batch_size=64, sequence_length=128, test_cases=1000, test_times=1, contiguous=None, use_gpu=True, warmup=True, omp_num_threads=12, omp_wait_policy='ACTIVE', intra_op_num_threads=1, seed=3, verbose=False, inclusive=False, extra_latency=True)\\n\",\n      \"Generating 1000 samples for batch_size=64 sequence_length=128\\n\",\n      \"Running test: model=bert-base-cased-squad_opt_gpu_fp16.onnx,graph_optimization_level=ENABLE_ALL,intra_op_num_threads=1,OMP_NUM_THREADS=12,OMP_WAIT_POLICY=ACTIVE,batch_size=64,sequence_length=128,test_cases=1000,test_times=1,contiguous=None,use_gpu=True,warmup=True\\n\",\n      \"Average latency = 29.26 ms, Throughput = 2187.15 QPS\\n\",\n      \"test setting TestSetting(batch_size=4, sequence_length=128, test_cases=1000, test_times=1, contiguous=None, use_gpu=True, warmup=True, omp_num_threads=12, omp_wait_policy='ACTIVE', intra_op_num_threads=1, seed=3, verbose=False, inclusive=False, extra_latency=True)\\n\",\n      \"Generating 1000 samples for batch_size=4 sequence_length=128\\n\",\n      \"Running test: model=bert-base-cased-squad_opt_gpu_fp16.onnx,graph_optimization_level=ENABLE_ALL,intra_op_num_threads=1,OMP_NUM_THREADS=12,OMP_WAIT_POLICY=ACTIVE,batch_size=4,sequence_length=128,test_cases=1000,test_times=1,contiguous=None,use_gpu=True,warmup=True\\n\",\n      \"Average latency = 4.32 ms, Throughput = 926.92 QPS\\n\",\n      \"test setting TestSetting(batch_size=8, sequence_length=128, test_cases=1000, test_times=1, contiguous=None, use_gpu=True, warmup=True, omp_num_threads=12, omp_wait_policy='ACTIVE', intra_op_num_threads=1, seed=3, verbose=False, inclusive=False, extra_latency=True)\\n\",\n      \"Generating 1000 samples for batch_size=8 sequence_length=128\\n\",\n      \"Running test: model=bert-base-cased-squad_opt_gpu_fp16.onnx,graph_optimization_level=ENABLE_ALL,intra_op_num_threads=1,OMP_NUM_THREADS=12,OMP_WAIT_POLICY=ACTIVE,batch_size=8,sequence_length=128,test_cases=1000,test_times=1,contiguous=None,use_gpu=True,warmup=True\\n\",\n      \"Average latency = 6.32 ms, Throughput = 1266.63 QPS\\n\",\n      \"test setting TestSetting(batch_size=16, sequence_length=128, test_cases=1000, test_times=1, contiguous=None, use_gpu=True, warmup=True, omp_num_threads=12, omp_wait_policy='ACTIVE', intra_op_num_threads=1, seed=3, verbose=False, inclusive=False, extra_latency=True)\\n\",\n      \"Generating 1000 samples for batch_size=16 sequence_length=128\\n\",\n      \"Running test: model=bert-base-cased-squad_opt_gpu_fp16.onnx,graph_optimization_level=ENABLE_ALL,intra_op_num_threads=1,OMP_NUM_THREADS=12,OMP_WAIT_POLICY=ACTIVE,batch_size=16,sequence_length=128,test_cases=1000,test_times=1,contiguous=None,use_gpu=True,warmup=True\\n\",\n      \"Average latency = 9.60 ms, Throughput = 1666.05 QPS\\n\",\n      \"Test summary is saved to onnx/perf_results_GPU_B1-2-4-8-16-32-64_S128_20200617-232401.txt\\n\"\n     ]\n    }\n   ],\n   \"source\": [\n    \"GPU_OPTION = '--use_gpu' if use_gpu else ''\\n\",\n    \"THREAD_SETTING = '--intra_op_num_threads 1 --omp_num_threads {} --omp_wait_policy ACTIVE'.format(psutil.cpu_count(logical=True))\\n\",\n    \"!python -m onnxruntime_tools.transformers.bert_perf_test --model $optimized_fp16_model_path --batch_size 1 2 4 8 16 32 64 --sequence_length 128 --samples 1000 --test_times 1 --inclusive $THREAD_SETTING $GPU_OPTION\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 26,\n   \"metadata\": {\n    \"scrolled\": false\n   },\n   \"outputs\": [\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"Float16 model summary from ./onnx/perf_results_GPU_B1-2-4-8-16-32-64_S128_20200617-232401.txt\\n\"\n     ]\n    },\n    {\n     \"data\": {\n      \"text/html\": [\n       \"<div>\\n\",\n       \"<style scoped>\\n\",\n       \"    .dataframe tbody tr th:only-of-type {\\n\",\n       \"        vertical-align: middle;\\n\",\n       \"    }\\n\",\n       \"\\n\",\n       \"    .dataframe tbody tr th {\\n\",\n       \"        vertical-align: top;\\n\",\n       \"    }\\n\",\n       \"\\n\",\n       \"    .dataframe thead th {\\n\",\n       \"        text-align: right;\\n\",\n       \"    }\\n\",\n       \"</style>\\n\",\n       \"<table border=\\\"1\\\" class=\\\"dataframe\\\">\\n\",\n       \"  <thead>\\n\",\n       \"    <tr style=\\\"text-align: right;\\\">\\n\",\n       \"      <th></th>\\n\",\n       \"      <th>Latency(ms)</th>\\n\",\n       \"      <th>Latency_P50</th>\\n\",\n       \"      <th>Latency_P75</th>\\n\",\n       \"      <th>Latency_P90</th>\\n\",\n       \"      <th>Latency_P95</th>\\n\",\n       \"      <th>Latency_P99</th>\\n\",\n       \"      <th>Throughput(QPS)</th>\\n\",\n       \"      <th>batch_size</th>\\n\",\n       \"    </tr>\\n\",\n       \"  </thead>\\n\",\n       \"  <tbody>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>0</th>\\n\",\n       \"      <td>3.00</td>\\n\",\n       \"      <td>2.79</td>\\n\",\n       \"      <td>2.81</td>\\n\",\n       \"      <td>2.86</td>\\n\",\n       \"      <td>4.37</td>\\n\",\n       \"      <td>7.08</td>\\n\",\n       \"      <td>333.83</td>\\n\",\n       \"      <td>1</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>1</th>\\n\",\n       \"      <td>3.59</td>\\n\",\n       \"      <td>3.33</td>\\n\",\n       \"      <td>3.35</td>\\n\",\n       \"      <td>3.42</td>\\n\",\n       \"      <td>6.60</td>\\n\",\n       \"      <td>7.54</td>\\n\",\n       \"      <td>557.32</td>\\n\",\n       \"      <td>2</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>2</th>\\n\",\n       \"      <td>4.32</td>\\n\",\n       \"      <td>3.98</td>\\n\",\n       \"      <td>4.01</td>\\n\",\n       \"      <td>4.64</td>\\n\",\n       \"      <td>7.23</td>\\n\",\n       \"      <td>8.11</td>\\n\",\n       \"      <td>926.92</td>\\n\",\n       \"      <td>4</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>3</th>\\n\",\n       \"      <td>6.32</td>\\n\",\n       \"      <td>5.94</td>\\n\",\n       \"      <td>5.97</td>\\n\",\n       \"      <td>7.61</td>\\n\",\n       \"      <td>8.96</td>\\n\",\n       \"      <td>10.12</td>\\n\",\n       \"      <td>1266.63</td>\\n\",\n       \"      <td>8</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>4</th>\\n\",\n       \"      <td>9.60</td>\\n\",\n       \"      <td>9.22</td>\\n\",\n       \"      <td>9.25</td>\\n\",\n       \"      <td>11.32</td>\\n\",\n       \"      <td>12.33</td>\\n\",\n       \"      <td>13.34</td>\\n\",\n       \"      <td>1666.05</td>\\n\",\n       \"      <td>16</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>5</th>\\n\",\n       \"      <td>16.17</td>\\n\",\n       \"      <td>15.80</td>\\n\",\n       \"      <td>15.90</td>\\n\",\n       \"      <td>17.38</td>\\n\",\n       \"      <td>18.80</td>\\n\",\n       \"      <td>19.93</td>\\n\",\n       \"      <td>1979.41</td>\\n\",\n       \"      <td>32</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>6</th>\\n\",\n       \"      <td>29.26</td>\\n\",\n       \"      <td>28.89</td>\\n\",\n       \"      <td>29.01</td>\\n\",\n       \"      <td>30.63</td>\\n\",\n       \"      <td>32.53</td>\\n\",\n       \"      <td>33.28</td>\\n\",\n       \"      <td>2187.15</td>\\n\",\n       \"      <td>64</td>\\n\",\n       \"    </tr>\\n\",\n       \"  </tbody>\\n\",\n       \"</table>\\n\",\n       \"</div>\"\n      ],\n      \"text/plain\": [\n       \"   Latency(ms)  Latency_P50  Latency_P75  Latency_P90  Latency_P95  \\\\\\n\",\n       \"0         3.00         2.79         2.81         2.86         4.37   \\n\",\n       \"1         3.59         3.33         3.35         3.42         6.60   \\n\",\n       \"2         4.32         3.98         4.01         4.64         7.23   \\n\",\n       \"3         6.32         5.94         5.97         7.61         8.96   \\n\",\n       \"4         9.60         9.22         9.25        11.32        12.33   \\n\",\n       \"5        16.17        15.80        15.90        17.38        18.80   \\n\",\n       \"6        29.26        28.89        29.01        30.63        32.53   \\n\",\n       \"\\n\",\n       \"   Latency_P99  Throughput(QPS)  batch_size  \\n\",\n       \"0         7.08           333.83           1  \\n\",\n       \"1         7.54           557.32           2  \\n\",\n       \"2         8.11           926.92           4  \\n\",\n       \"3        10.12          1266.63           8  \\n\",\n       \"4        13.34          1666.05          16  \\n\",\n       \"5        19.93          1979.41          32  \\n\",\n       \"6        33.28          2187.15          64  \"\n      ]\n     },\n     \"execution_count\": 26,\n     \"metadata\": {},\n     \"output_type\": \"execute_result\"\n    }\n   ],\n   \"source\": [\n    \"import os\\n\",\n    \"import glob     \\n\",\n    \"import pandas\\n\",\n    \"latest_result_file = max(glob.glob(\\\"./onnx/perf_results_*.txt\\\"), key=os.path.getmtime)\\n\",\n    \"result_data = pandas.read_table(latest_result_file, converters={'OMP_NUM_THREADS': str, 'OMP_WAIT_POLICY':str})\\n\",\n    \"print(\\\"Float16 model summary from\\\", latest_result_file)\\n\",\n    \"columns_to_remove = ['model', 'graph_optimization_level', 'test_cases', 'test_times', 'use_gpu', 'warmup', 'sequence_length']\\n\",\n    \"columns_to_remove.extend(['intra_op_num_threads', 'OMP_NUM_THREADS', 'OMP_WAIT_POLICY', 'contiguous'])\\n\",\n    \"result_data.drop(columns_to_remove, axis=1, inplace=True)\\n\",\n    \"result_data\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"## 7. Additional Info\\n\",\n    \"\\n\",\n    \"Note that running Jupyter Notebook has significant impact on performance result. You can close Jupyter Notebook and other applications, then run the performance test in a console to get more accurate performance numbers.\\n\",\n    \"\\n\",\n    \"We have a [benchmark script](https://github.com/microsoft/onnxruntime/blob/master/onnxruntime/python/tools/transformers/run_benchmark.sh). It is recommended to use it measure inference speed of OnnxRuntime.\\n\",\n    \"\\n\",\n    \"[OnnxRuntime C API](https://github.com/microsoft/onnxruntime/blob/master/docs/C_API.md) could get slightly better performance than python API. If you use C API in inference, you can use OnnxRuntime_Perf_Test.exe built from source to measure performance instead.\\n\",\n    \"\\n\",\n    \"Here is the machine configuration that generated the above results. You might get slower or faster result according to your hardware.\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 27,\n   \"metadata\": {\n    \"scrolled\": true\n   },\n   \"outputs\": [\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"{\\r\\n\",\n      \"  \\\"gpu\\\": {\\r\\n\",\n      \"    \\\"driver_version\\\": \\\"440.64.00\\\",\\r\\n\",\n      \"    \\\"devices\\\": [\\r\\n\",\n      \"      {\\r\\n\",\n      \"        \\\"memory_total\\\": 16945512448,\\r\\n\",\n      \"        \\\"memory_available\\\": 14110883840,\\r\\n\",\n      \"        \\\"name\\\": \\\"Tesla V100-PCIE-16GB\\\"\\r\\n\",\n      \"      },\\r\\n\",\n      \"      {\\r\\n\",\n      \"        \\\"memory_total\\\": 16945512448,\\r\\n\",\n      \"        \\\"memory_available\\\": 16932601856,\\r\\n\",\n      \"        \\\"name\\\": \\\"Tesla V100-PCIE-16GB\\\"\\r\\n\",\n      \"      }\\r\\n\",\n      \"    ]\\r\\n\",\n      \"  },\\r\\n\",\n      \"  \\\"cpu\\\": {\\r\\n\",\n      \"    \\\"brand\\\": \\\"Intel(R) Xeon(R) CPU E5-2690 v4 @ 2.60GHz\\\",\\r\\n\",\n      \"    \\\"cores\\\": 12,\\r\\n\",\n      \"    \\\"logical_cores\\\": 12,\\r\\n\",\n      \"    \\\"hz\\\": \\\"2.5940 GHz\\\",\\r\\n\",\n      \"    \\\"l2_cache\\\": \\\"256 KB\\\",\\r\\n\",\n      \"    \\\"l3_cache\\\": \\\"35840 KB\\\",\\r\\n\",\n      \"    \\\"processor\\\": \\\"x86_64\\\"\\r\\n\",\n      \"  },\\r\\n\",\n      \"  \\\"memory\\\": {\\r\\n\",\n      \"    \\\"total\\\": 236645588992,\\r\\n\",\n      \"    \\\"available\\\": 222567559168\\r\\n\",\n      \"  },\\r\\n\",\n      \"  \\\"python\\\": \\\"3.7.7.final.0 (64 bit)\\\",\\r\\n\",\n      \"  \\\"os\\\": \\\"Linux-4.15.0-1089-azure-x86_64-with-debian-stretch-sid\\\",\\r\\n\",\n      \"  \\\"onnxruntime\\\": {\\r\\n\",\n      \"    \\\"version\\\": \\\"1.3.0\\\",\\r\\n\",\n      \"    \\\"support_gpu\\\": true\\r\\n\",\n      \"  },\\r\\n\",\n      \"  \\\"pytorch\\\": {\\r\\n\",\n      \"    \\\"version\\\": \\\"1.5.0\\\",\\r\\n\",\n      \"    \\\"support_gpu\\\": true\\r\\n\",\n      \"  },\\r\\n\",\n      \"  \\\"tensorflow\\\": null\\r\\n\",\n      \"}\\r\\n\"\n     ]\n    }\n   ],\n   \"source\": [\n    \"!{sys.executable} -m onnxruntime_tools.transformers.machine_info --silent\"\n   ]\n  }\n ],\n \"metadata\": {\n  \"kernelspec\": {\n   \"display_name\": \"PyCharm (ccks_ner-master)\",\n   \"language\": \"python\",\n   \"name\": \"pycharm-de4c0941\"\n  },\n  \"language_info\": {\n   \"codemirror_mode\": {\n    \"name\": \"ipython\",\n    \"version\": 3\n   },\n   \"file_extension\": \".py\",\n   \"mimetype\": \"text/x-python\",\n   \"name\": \"python\",\n   \"nbconvert_exporter\": \"python\",\n   \"pygments_lexer\": \"ipython3\",\n   \"version\": \"3.8.5\"\n  }\n },\n \"nbformat\": 4,\n \"nbformat_minor\": 2\n}\n"
  },
  {
    "path": "code/nezha-base-count5/finetuning/Config.py",
    "content": "from transformers import BertTokenizer, AdamW, BertModel, BertPreTrainedModel, BertConfig, \\\n    get_linear_schedule_with_warmup, XLNetModel, XLNetTokenizer, XLNetConfig, ElectraModel, ElectraConfig, ElectraTokenizer, \\\n    RobertaTokenizer, RobertaModel, RobertaConfig\nfrom NEZHA.modeling_nezha import NeZhaModel\nfrom NEZHA.configuration_nezha import NeZhaConfig\n\n\nMODELS = {\n    'BertForClass':  BertModel,\n    'BertForClass_MultiDropout':  BertModel,\n   'BertLastTwoCls':  BertModel,\n    'BertLastCls':BertModel,\n   'BertLastTwoClsPooler':  BertModel,\n    'BertLastTwoEmbeddings': BertModel,\n    'BertLastTwoEmbeddingsPooler': BertModel,\n    'BertLastFourCls': BertModel,\n    'BertLastFourClsPooler':  BertModel,\n    'BertLastFourEmbeddings':  BertModel,\n   'BertLastFourEmbeddingsPooler':  BertModel,\n   'BertDynCls':  BertModel,\n    'BertDynEmbeddings': BertModel,\n    'BertRNN': BertModel,\n    'BertCNN': XLNetModel,\n    'BertRCNN':  BertModel,\n    'XLNet': XLNetModel,\n    'Electra': ElectraModel,\n    'NEZHA': NeZhaModel\n    }\n\nTOKENIZERS = {\n    'BertForClass': BertTokenizer,\n    'BertForClass_MultiDropout': BertTokenizer,\n    'BertLastTwoCls': BertTokenizer,\n    'BertLastCls': BertTokenizer,\n    'BertLastTwoClsPooler': BertTokenizer,\n    'BertLastTwoEmbeddings': BertTokenizer,\n    'BertLastTwoEmbeddingsPooler': BertTokenizer,\n    'BertLastFourCls': BertTokenizer,\n    'BertLastFourClsPooler': BertTokenizer,\n    'BertLastFourEmbeddings': BertTokenizer,\n    'BertLastFourEmbeddingsPooler': BertTokenizer,\n    'BertDynCls': BertTokenizer,\n    'BertDynEmbeddings': BertTokenizer,\n    'BertRNN': BertTokenizer,\n    'BertCNN': BertTokenizer,\n    'BertRCNN': BertTokenizer,\n    'XLNet': XLNetTokenizer,\n    'Electra': ElectraTokenizer,\n    'NEZHA': BertTokenizer\n    }\n\nCONFIGS = {\n    'BertForClass': BertConfig,\n    'BertForClass_MultiDropout': BertConfig,\n    'BertLastTwoCls': BertConfig,\n    'BertLastCls': BertConfig,\n    'BertLastTwoClsPooler': BertConfig,\n    'BertLastTwoEmbeddings': BertConfig,\n    'BertLastTwoEmbeddingsPooler': BertConfig,\n    'BertLastFourCls': BertConfig,\n    'BertLastFourClsPooler': BertConfig,\n    'BertLastFourEmbeddings': BertConfig,\n    'BertLastFourEmbeddingsPooler': BertConfig,\n    'BertDynCls': BertConfig,\n    'BertDynEmbeddings': BertConfig,\n    'BertRNN': BertConfig,\n    'BertCNN': BertConfig,\n    'BertRCNN': BertConfig,\n    'XLNet': XLNetConfig,\n    'Electra': ElectraConfig,\n    'NEZHA': NeZhaConfig\n\n    }"
  },
  {
    "path": "code/nezha-base-count5/finetuning/NEZHA/configuration_nezha.py",
    "content": "\nfrom transformers import PretrainedConfig\n\nNEZHA_PRETRAINED_CONFIG_ARCHIVE_MAP = {}\n\nclass NeZhaConfig(PretrainedConfig):\n    r\"\"\"\n        This is the configuration class to store the configuration of an :class:`~transformers.AlbertModel`.\n        It is used to instantiate an ALBERT model according to the specified arguments, defining the model\n        architecture. Instantiating a configuration with the defaults will yield a similar configuration to that of\n        the ALBERT `xxlarge <https://huggingface.co/albert-xxlarge-v2>`__ architecture.\n\n        Configuration objects inherit from  :class:`~transformers.PretrainedConfig` and can be used\n        to control the model outputs. Read the documentation from  :class:`~transformers.PretrainedConfig`\n        for more information.\n\n\n        Args:\n            vocab_size (:obj:`int`, optional, defaults to 30000):\n                Vocabulary size of the ALBERT model. Defines the different tokens that\n                can be represented by the `inputs_ids` passed to the forward method of :class:`~transformers.AlbertModel`.\n            embedding_size (:obj:`int`, optional, defaults to 128):\n                Dimensionality of vocabulary embeddings.\n            hidden_size (:obj:`int`, optional, defaults to 4096):\n                Dimensionality of the encoder layers and the pooler layer.\n            num_hidden_layers (:obj:`int`, optional, defaults to 12):\n                Number of hidden layers in the Transformer encoder.\n            num_hidden_groups (:obj:`int`, optional, defaults to 1):\n                Number of groups for the hidden layers, parameters in the same group are shared.\n            num_attention_heads (:obj:`int`, optional, defaults to 64):\n                Number of attention heads for each attention layer in the Transformer encoder.\n            intermediate_size (:obj:`int`, optional, defaults to 16384):\n                The dimensionality of the \"intermediate\" (i.e., feed-forward) layer in the Transformer encoder.\n            inner_group_num (:obj:`int`, optional, defaults to 1):\n                The number of inner repetition of attention and ffn.\n            hidden_act (:obj:`str` or :obj:`function`, optional, defaults to \"gelu_new\"):\n                The non-linear activation function (function or string) in the encoder and pooler.\n                If string, \"gelu\", \"relu\", \"swish\" and \"gelu_new\" are supported.\n            hidden_dropout_prob (:obj:`float`, optional, defaults to 0):\n                The dropout probability for all fully connected layers in the embeddings, encoder, and pooler.\n            attention_probs_dropout_prob (:obj:`float`, optional, defaults to 0):\n                The dropout ratio for the attention probabilities.\n            max_position_embeddings (:obj:`int`, optional, defaults to 512):\n                The maximum sequence length that this model might ever be used with. Typically set this to something\n                large (e.g., 512 or 1024 or 2048).\n            type_vocab_size (:obj:`int`, optional, defaults to 2):\n                The vocabulary size of the `token_type_ids` passed into :class:`~transformers.AlbertModel`.\n            initializer_range (:obj:`float`, optional, defaults to 0.02):\n                The standard deviation of the truncated_normal_initializer for initializing all weight matrices.\n            layer_norm_eps (:obj:`float`, optional, defaults to 1e-12):\n                The epsilon used by the layer normalization layers.\n            classifier_dropout_prob (:obj:`float`, optional, defaults to 0.1):\n                The dropout ratio for attached classifiers.\n\n        Example::\n\n            from transformers import AlbertConfig, AlbertModel\n            # Initializing an ALBERT-xxlarge style configuration\n            albert_xxlarge_configuration = AlbertConfig()\n\n            # Initializing an ALBERT-base style configuration\n            albert_base_configuration = AlbertConfig(\n                hidden_size=768,\n                num_attention_heads=12,\n                intermediate_size=3072,\n            )\n\n            # Initializing a model from the ALBERT-base style configuration\n            model = AlbertModel(albert_xxlarge_configuration)\n\n            # Accessing the model configuration\n            configuration = model.config\n\n        Attributes:\n            pretrained_config_archive_map (Dict[str, str]):\n                A dictionary containing all the available pre-trained checkpoints.\n    \"\"\"\n\n    pretrained_config_archive_map = NEZHA_PRETRAINED_CONFIG_ARCHIVE_MAP\n    model_type = \"nezha\"\n\n    def __init__(\n        self,\n        vocab_size=30000,\n        embedding_size=128,\n        hidden_size=4096,\n        num_hidden_layers=12,\n        num_hidden_groups=1,\n        num_attention_heads=64,\n        intermediate_size=16384,\n        inner_group_num=1,\n        hidden_act=\"gelu_new\",\n        hidden_dropout_prob=0,\n        attention_probs_dropout_prob=0,\n        max_position_embeddings=512,\n        max_relative_position=64,\n        type_vocab_size=2,\n        initializer_range=0.02,\n        layer_norm_eps=1e-12,\n        classifier_dropout_prob=0.1,\n        use_relative_position=True,\n        pad_token_id=0,\n        bos_token_id=2,\n        eos_token_id=3,\n        **kwargs\n    ):\n        super().__init__(pad_token_id=pad_token_id, bos_token_id=bos_token_id, eos_token_id=eos_token_id, **kwargs)\n\n        self.vocab_size = vocab_size\n        self.embedding_size = embedding_size\n        self.hidden_size = hidden_size\n        self.num_hidden_layers = num_hidden_layers\n        self.num_hidden_groups = num_hidden_groups\n        self.num_attention_heads = num_attention_heads\n        self.inner_group_num = inner_group_num\n        self.hidden_act = hidden_act\n        self.intermediate_size = intermediate_size\n        self.hidden_dropout_prob = hidden_dropout_prob\n        self.attention_probs_dropout_prob = attention_probs_dropout_prob\n        self.max_position_embeddings = max_position_embeddings\n        self.max_relative_position = max_relative_position\n        self.type_vocab_size = type_vocab_size\n        self.initializer_range = initializer_range\n        self.layer_norm_eps = layer_norm_eps\n        self.use_relative_position=use_relative_position\n        self.classifier_dropout_prob = classifier_dropout_prob\n"
  },
  {
    "path": "code/nezha-base-count5/finetuning/NEZHA/modeling_nezha.py",
    "content": "import math\nimport os\nimport warnings\nfrom dataclasses import dataclass\nfrom typing import Optional, Tuple\n\nimport torch\nimport torch.utils.checkpoint\nfrom torch import nn\nfrom torch.nn import CrossEntropyLoss, MSELoss\n\nfrom transformers.activations import ACT2FN\nfrom transformers.file_utils import (\n    ModelOutput,\n    add_code_sample_docstrings,\n    add_start_docstrings,\n    add_start_docstrings_to_model_forward,\n    replace_return_docstrings,\n)\nfrom transformers.modeling_outputs import (\n    BaseModelOutputWithPastAndCrossAttentions,\n    BaseModelOutputWithPoolingAndCrossAttentions,\n    CausalLMOutputWithCrossAttentions,\n    MaskedLMOutput,\n    MultipleChoiceModelOutput,\n    NextSentencePredictorOutput,\n    QuestionAnsweringModelOutput,\n    SequenceClassifierOutput,\n    TokenClassifierOutput,\n)\nfrom transformers.modeling_utils import (\n    PreTrainedModel,\n    apply_chunking_to_forward,\n    find_pruneable_heads_and_indices,\n    prune_linear_layer,\n)\n\nfrom transformers.models.bert.configuration_bert import BertConfig\n\nimport logging\nlogger = logging.getLogger(__name__)\n\n_CHECKPOINT_FOR_DOC = \"bert-base-uncased\"\n_CONFIG_FOR_DOC = \"BertConfig\"\n_TOKENIZER_FOR_DOC = \"BertTokenizer\"\n\n\ndef load_tf_weights_in_bert(model, config, tf_checkpoint_path):\n    \"\"\"Load tf checkpoints in a pytorch model.\"\"\"\n    try:\n        import re\n\n        import numpy as np\n        import tensorflow as tf\n    except ImportError:\n        logger.error(\n            \"Loading a TensorFlow model in PyTorch, requires TensorFlow to be installed. Please see \"\n            \"https://www.tensorflow.org/install/ for installation instructions.\"\n        )\n        raise\n    tf_path = os.path.abspath(tf_checkpoint_path)\n    logger.info(\"Converting TensorFlow checkpoint from {}\".format(tf_path))\n    # Load weights from TF model\n    init_vars = tf.train.list_variables(tf_path)\n    names = []\n    arrays = []\n    for name, shape in init_vars:\n        logger.info(\"Loading TF weight {} with shape {}\".format(name, shape))\n        array = tf.train.load_variable(tf_path, name)\n        names.append(name)\n        arrays.append(array)\n\n    for name, array in zip(names, arrays):\n        name = name.split(\"/\")\n        # adam_v and adam_m are variables used in AdamWeightDecayOptimizer to calculated m and v\n        # which are not required for using pretrained model\n        if any(\n            n in [\"adam_v\", \"adam_m\", \"AdamWeightDecayOptimizer\", \"AdamWeightDecayOptimizer_1\", \"global_step\"]\n            for n in name\n        ):\n            logger.info(\"Skipping {}\".format(\"/\".join(name)))\n            continue\n        pointer = model\n        for m_name in name:\n            if re.fullmatch(r\"[A-Za-z]+_\\d+\", m_name):\n                scope_names = re.split(r\"_(\\d+)\", m_name)\n            else:\n                scope_names = [m_name]\n            if scope_names[0] == \"kernel\" or scope_names[0] == \"gamma\":\n                pointer = getattr(pointer, \"weight\")\n            elif scope_names[0] == \"output_bias\" or scope_names[0] == \"beta\":\n                pointer = getattr(pointer, \"bias\")\n            elif scope_names[0] == \"output_weights\":\n                pointer = getattr(pointer, \"weight\")\n            elif scope_names[0] == \"squad\":\n                pointer = getattr(pointer, \"classifier\")\n            else:\n                try:\n                    pointer = getattr(pointer, scope_names[0])\n                except AttributeError:\n                    logger.info(\"Skipping {}\".format(\"/\".join(name)))\n                    continue\n            if len(scope_names) >= 2:\n                num = int(scope_names[1])\n                pointer = pointer[num]\n        if m_name[-11:] == \"_embeddings\":\n            pointer = getattr(pointer, \"weight\")\n        elif m_name == \"kernel\":\n            array = np.transpose(array)\n        try:\n            assert (\n                pointer.shape == array.shape\n            ), f\"Pointer shape {pointer.shape} and array shape {array.shape} mismatched\"\n        except AssertionError as e:\n            e.args += (pointer.shape, array.shape)\n            raise\n        logger.info(\"Initialize PyTorch weight {}\".format(name))\n        pointer.data = torch.from_numpy(array)\n    return model\n\n\nclass BertEmbeddings(nn.Module):\n    \"\"\"Construct the embeddings from word, position and token_type embeddings.\"\"\"\n\n    def __init__(self, config):\n        super().__init__()\n        self.word_embeddings = nn.Embedding(config.vocab_size, config.hidden_size, padding_idx=config.pad_token_id)\n        self.token_type_embeddings = nn.Embedding(config.type_vocab_size, config.hidden_size)\n        # self.LayerNorm is not snake-cased to stick with TensorFlow model variable name and be able to load\n        # any TensorFlow checkpoint file\n        self.LayerNorm = nn.LayerNorm(config.hidden_size, eps=config.layer_norm_eps)\n        self.dropout = nn.Dropout(config.hidden_dropout_prob)\n\n    def forward(self, input_ids=None, token_type_ids=None, inputs_embeds=None):\n        if input_ids is not None:\n            input_shape = input_ids.size()\n        else:\n            input_shape = inputs_embeds.size()[:-1]\n        if token_type_ids is None:\n            token_type_ids = torch.zeros(input_shape, dtype=torch.long, device=input_ids.device)\n\n        if inputs_embeds is None:\n            inputs_embeds = self.word_embeddings(input_ids)\n        token_type_embeddings = self.token_type_embeddings(token_type_ids)\n\n        embeddings = inputs_embeds + token_type_embeddings\n        embeddings = self.LayerNorm(embeddings)\n        embeddings = self.dropout(embeddings)\n        return embeddings\n\ndef relative_position_encoding(depth, max_length=512, max_relative_position=64):\n    vocab_size = max_relative_position * 2 + 1\n    range_vec = torch.arange(max_length)\n    range_mat = range_vec.repeat(max_length).view(max_length, max_length)\n    distance_mat = range_mat - torch.t(range_mat)\n    distance_mat_clipped = torch.clamp(distance_mat, -max_relative_position, max_relative_position)\n    final_mat = distance_mat_clipped + max_relative_position\n\n    embeddings_table = torch.zeros(vocab_size, depth)\n    position = torch.arange(0, vocab_size, dtype=torch.float).unsqueeze(1)\n    div_term = torch.exp(torch.arange(0, depth, 2).float() * (-math.log(10000.0) / depth))\n    embeddings_table[:, 0::2] = torch.sin(position * div_term)\n    embeddings_table[:, 1::2] = torch.cos(position * div_term)\n    embeddings_table = embeddings_table.unsqueeze(0).transpose(0, 1).squeeze(1)\n\n    flat_relative_positions_matrix = final_mat.view(-1)\n    one_hot_relative_positions_matrix = torch.nn.functional.one_hot(flat_relative_positions_matrix,\n                                                                    num_classes=vocab_size).float()\n    positions_encoding = torch.matmul(one_hot_relative_positions_matrix, embeddings_table)\n    my_shape = list(final_mat.size())\n    my_shape.append(depth)\n    positions_encoding = positions_encoding.view(my_shape)\n    return positions_encoding\n\nclass BertSelfAttention(nn.Module):\n    def __init__(self, config):\n        super().__init__()\n        if config.hidden_size % config.num_attention_heads != 0 and not hasattr(config, \"embedding_size\"):\n            raise ValueError(\n                \"The hidden size (%d) is not a multiple of the number of attention \"\n                \"heads (%d)\" % (config.hidden_size, config.num_attention_heads)\n            )\n\n        self.num_attention_heads = config.num_attention_heads\n        self.attention_head_size = int(config.hidden_size / config.num_attention_heads)\n        self.all_head_size = self.num_attention_heads * self.attention_head_size\n\n        self.query = nn.Linear(config.hidden_size, self.all_head_size)\n        self.key = nn.Linear(config.hidden_size, self.all_head_size)\n        self.value = nn.Linear(config.hidden_size, self.all_head_size)\n\n        self.dropout = nn.Dropout(config.attention_probs_dropout_prob)\n        self.position_embedding_type = getattr(config, \"position_embedding_type\", \"absolute\")\n        if self.position_embedding_type == \"relative_key\" or self.position_embedding_type == \"relative_key_query\":\n            self.max_position_embeddings = config.max_position_embeddings\n            self.distance_embedding = nn.Embedding(2 * config.max_position_embeddings - 1, self.attention_head_size)\n\n        self.is_decoder = config.is_decoder\n\n    def transpose_for_scores(self, x):\n        new_x_shape = x.size()[:-1] + (self.num_attention_heads, self.attention_head_size)\n        x = x.view(*new_x_shape)\n        return x.permute(0, 2, 1, 3)\n\n    def forward(\n        self,\n        hidden_states,\n        attention_mask=None,\n        head_mask=None,\n        encoder_hidden_states=None,\n        encoder_attention_mask=None,\n        past_key_value=None,\n        output_attentions=False,\n        relations_kv=None\n    ):\n        mixed_query_layer = self.query(hidden_states)\n\n        # If this is instantiated as a cross-attention module, the keys\n        # and values come from an encoder; the attention mask needs to be\n        # such that the encoder's padding tokens are not attended to.\n        is_cross_attention = encoder_hidden_states is not None\n\n        if is_cross_attention and past_key_value is not None:\n            # reuse k,v, cross_attentions\n            key_layer = past_key_value[0]\n            value_layer = past_key_value[1]\n            attention_mask = encoder_attention_mask\n        elif is_cross_attention:\n            key_layer = self.transpose_for_scores(self.key(encoder_hidden_states))\n            value_layer = self.transpose_for_scores(self.value(encoder_hidden_states))\n            attention_mask = encoder_attention_mask\n        elif past_key_value is not None:\n            key_layer = self.transpose_for_scores(self.key(hidden_states))\n            value_layer = self.transpose_for_scores(self.value(hidden_states))\n            key_layer = torch.cat([past_key_value[0], key_layer], dim=2)\n            value_layer = torch.cat([past_key_value[1], value_layer], dim=2)\n        else:\n            key_layer = self.transpose_for_scores(self.key(hidden_states))\n            value_layer = self.transpose_for_scores(self.value(hidden_states))\n\n        query_layer = self.transpose_for_scores(mixed_query_layer)\n\n        if self.is_decoder:\n            # if cross_attention save Tuple(torch.Tensor, torch.Tensor) of all cross attention key/value_states.\n            # Further calls to cross_attention layer can then reuse all cross-attention\n            # key/value_states (first \"if\" case)\n            # if uni-directional self-attention (decoder) save Tuple(torch.Tensor, torch.Tensor) of\n            # all previous decoder key/value_states. Further calls to uni-directional self-attention\n            # can concat previous decoder key/value_states to current projected key/value_states (third \"elif\" case)\n            # if encoder bi-directional self-attention `past_key_value` is always `None`\n            past_key_value = (key_layer, value_layer)\n\n        # Take the dot product between \"query\" and \"key\" to get the raw attention scores.\n        attention_scores = torch.matmul(query_layer, key_layer.transpose(-1, -2))\n\n        batch_size, num_attention_heads, from_seq_length, to_seq_length = attention_scores.size()\n\n\n        query_layer_t = query_layer.permute(2, 0, 1, 3)\n\n        query_layer_r = query_layer_t.contiguous().view(from_seq_length, batch_size * num_attention_heads,\n                                                        self.attention_head_size)\n        key_position_scores = torch.matmul(query_layer_r, relations_kv.permute(0, 2, 1))\n        key_position_scores_r = key_position_scores.view(from_seq_length, batch_size,\n                                                         num_attention_heads, from_seq_length)\n        key_position_scores_r_t = key_position_scores_r.permute(1, 2, 0, 3)\n        attention_scores = attention_scores + key_position_scores_r_t\n\n        attention_scores = attention_scores / math.sqrt(self.attention_head_size)\n        if attention_mask is not None:\n            # Apply the attention mask is (precomputed for all layers in NeZhaModel forward() function)\n            attention_scores = attention_scores + attention_mask\n\n        # Normalize the attention scores to probabilities.\n        attention_probs = nn.Softmax(dim=-1)(attention_scores)\n\n        # This is actually dropping out entire tokens to attend to, which might\n        # seem a bit unusual, but is taken from the original Transformer paper.\n        attention_probs = self.dropout(attention_probs)\n\n        # Mask heads if we want to\n        if head_mask is not None:\n            attention_probs = attention_probs * head_mask\n\n        context_layer = torch.matmul(attention_probs, value_layer)\n\n\n        attention_probs_t = attention_probs.permute(2, 0, 1, 3)\n        attentions_probs_r = attention_probs_t.contiguous().view(from_seq_length, batch_size * num_attention_heads,\n                                                                 to_seq_length)\n        value_position_scores = torch.matmul(attentions_probs_r, relations_kv)\n        value_position_scores_r = value_position_scores.view(from_seq_length, batch_size,\n                                                             num_attention_heads, self.attention_head_size)\n        value_position_scores_r_t = value_position_scores_r.permute(1, 2, 0, 3)\n        context_layer = context_layer + value_position_scores_r_t\n\n        context_layer = context_layer.permute(0, 2, 1, 3).contiguous()\n        new_context_layer_shape = context_layer.size()[:-2] + (self.all_head_size,)\n        context_layer = context_layer.view(*new_context_layer_shape)\n\n        outputs = (context_layer, attention_probs) if output_attentions else (context_layer,)\n\n        if self.is_decoder:\n            outputs = outputs + (past_key_value,)\n        return outputs\n\n\nclass BertSelfOutput(nn.Module):\n    def __init__(self, config):\n        super().__init__()\n        self.dense = nn.Linear(config.hidden_size, config.hidden_size)\n        self.LayerNorm = nn.LayerNorm(config.hidden_size, eps=config.layer_norm_eps)\n        self.dropout = nn.Dropout(config.hidden_dropout_prob)\n\n    def forward(self, hidden_states, input_tensor):\n        hidden_states = self.dense(hidden_states)\n        hidden_states = self.dropout(hidden_states)\n        hidden_states = self.LayerNorm(hidden_states + input_tensor)\n        return hidden_states\n\n\nclass BertAttention(nn.Module):\n    def __init__(self, config):\n        super().__init__()\n        self.self = BertSelfAttention(config)\n        self.output = BertSelfOutput(config)\n        self.pruned_heads = set()\n\n    def prune_heads(self, heads):\n        if len(heads) == 0:\n            return\n        heads, index = find_pruneable_heads_and_indices(\n            heads, self.self.num_attention_heads, self.self.attention_head_size, self.pruned_heads\n        )\n\n        # Prune linear layers\n        self.self.query = prune_linear_layer(self.self.query, index)\n        self.self.key = prune_linear_layer(self.self.key, index)\n        self.self.value = prune_linear_layer(self.self.value, index)\n        self.output.dense = prune_linear_layer(self.output.dense, index, dim=1)\n\n        # Update hyper params and store pruned heads\n        self.self.num_attention_heads = self.self.num_attention_heads - len(heads)\n        self.self.all_head_size = self.self.attention_head_size * self.self.num_attention_heads\n        self.pruned_heads = self.pruned_heads.union(heads)\n\n    def forward(\n        self,\n        hidden_states,\n        attention_mask=None,\n        head_mask=None,\n        encoder_hidden_states=None,\n        encoder_attention_mask=None,\n        past_key_value=None,\n        output_attentions=False,\n        relations_kv=None\n    ):\n        self_outputs = self.self(\n            hidden_states,\n            attention_mask,\n            head_mask,\n            encoder_hidden_states,\n            encoder_attention_mask,\n            past_key_value,\n            output_attentions,\n            relations_kv=relations_kv\n        )\n        attention_output = self.output(self_outputs[0], hidden_states)\n        outputs = (attention_output,) + self_outputs[1:]  # add attentions if we output them\n        return outputs\n\n\nclass BertIntermediate(nn.Module):\n    def __init__(self, config):\n        super().__init__()\n        self.dense = nn.Linear(config.hidden_size, config.intermediate_size)\n        if isinstance(config.hidden_act, str):\n            self.intermediate_act_fn = ACT2FN[config.hidden_act]\n        else:\n            self.intermediate_act_fn = config.hidden_act\n\n    def forward(self, hidden_states):\n        hidden_states = self.dense(hidden_states)\n        hidden_states = self.intermediate_act_fn(hidden_states)\n        return hidden_states\n\n\nclass BertOutput(nn.Module):\n    def __init__(self, config):\n        super().__init__()\n        self.dense = nn.Linear(config.intermediate_size, config.hidden_size)\n        self.LayerNorm = nn.LayerNorm(config.hidden_size, eps=config.layer_norm_eps)\n        self.dropout = nn.Dropout(config.hidden_dropout_prob)\n\n    def forward(self, hidden_states, input_tensor):\n        hidden_states = self.dense(hidden_states)\n        hidden_states = self.dropout(hidden_states)\n        hidden_states = self.LayerNorm(hidden_states + input_tensor)\n        return hidden_states\n\n\nclass BertLayer(nn.Module):\n    def __init__(self, config):\n        super().__init__()\n        self.chunk_size_feed_forward = config.chunk_size_feed_forward\n        self.seq_len_dim = 1\n        self.attention = BertAttention(config)\n        self.is_decoder = config.is_decoder\n        self.add_cross_attention = config.add_cross_attention\n        if self.add_cross_attention:\n            assert self.is_decoder, f\"{self} should be used as a decoder model if cross attention is added\"\n            self.crossattention = BertAttention(config)\n        self.intermediate = BertIntermediate(config)\n        self.output = BertOutput(config)\n\n    def forward(\n        self,\n        hidden_states,\n        attention_mask=None,\n        head_mask=None,\n        encoder_hidden_states=None,\n        encoder_attention_mask=None,\n        past_key_value=None,\n        output_attentions=False,\n        relations_kv=None\n    ):\n        # decoder uni-directional self-attention cached key/values tuple is at positions 1,2\n        self_attn_past_key_value = past_key_value[:2] if past_key_value is not None else None\n        self_attention_outputs = self.attention(\n            hidden_states,\n            attention_mask,\n            head_mask,\n            output_attentions=output_attentions,\n            past_key_value=self_attn_past_key_value,\n            relations_kv=relations_kv\n        )\n        attention_output = self_attention_outputs[0]\n\n        # if decoder, the last output is tuple of self-attn cache\n        if self.is_decoder:\n            outputs = self_attention_outputs[1:-1]\n            present_key_value = self_attention_outputs[-1]\n        else:\n            outputs = self_attention_outputs[1:]  # add self attentions if we output attention weights\n\n        cross_attn_present_key_value = None\n        if self.is_decoder and encoder_hidden_states is not None:\n            assert hasattr(\n                self, \"crossattention\"\n            ), f\"If `encoder_hidden_states` are passed, {self} has to be instantiated with cross-attention layers by setting `config.add_cross_attention=True`\"\n\n            # cross_attn cached key/values tuple is at positions 3,4 of past_key_value tuple\n            cross_attn_past_key_value = past_key_value[-2:] if past_key_value is not None else None\n            cross_attention_outputs = self.crossattention(\n                attention_output,\n                attention_mask,\n                head_mask,\n                encoder_hidden_states,\n                encoder_attention_mask,\n                cross_attn_past_key_value,\n                output_attentions,\n            )\n            attention_output = cross_attention_outputs[0]\n            outputs = outputs + cross_attention_outputs[1:-1]  # add cross attentions if we output attention weights\n\n            # add cross-attn cache to positions 3,4 of present_key_value tuple\n            cross_attn_present_key_value = cross_attention_outputs[-1]\n            present_key_value = present_key_value + cross_attn_present_key_value\n\n        layer_output = apply_chunking_to_forward(\n            self.feed_forward_chunk, self.chunk_size_feed_forward, self.seq_len_dim, attention_output\n        )\n        outputs = (layer_output,) + outputs\n\n        # if decoder, return the attn key/values as the last output\n        if self.is_decoder:\n            outputs = outputs + (present_key_value,)\n\n        return outputs\n\n    def feed_forward_chunk(self, attention_output):\n        intermediate_output = self.intermediate(attention_output)\n        layer_output = self.output(intermediate_output, attention_output)\n        return layer_output\n\n\nclass NeZhaEncoder(nn.Module):\n    def __init__(self, config):\n        super().__init__()\n        self.config = config\n        self.layer = nn.ModuleList([BertLayer(config) for _ in range(config.num_hidden_layers)])\n        self.relative_positions_encoding = relative_position_encoding(max_length=config.max_position_embeddings,\n                                                                     depth=int(config.hidden_size / config.num_attention_heads),\n                                                                     max_relative_position=config.max_relative_position).to('cuda')\n    def forward(\n        self,\n        hidden_states,\n        attention_mask=None,\n        head_mask=None,\n        encoder_hidden_states=None,\n        encoder_attention_mask=None,\n        past_key_values=None,\n        use_cache=None,\n        output_attentions=False,\n        output_hidden_states=False,\n        return_dict=False,\n    ):\n        to_seq_length=hidden_states.shape[1]\n        relations_kv = self.relative_positions_encoding[:to_seq_length, :to_seq_length, :]\n        all_hidden_states = () if output_hidden_states else None\n        all_self_attentions = () if output_attentions else None\n        all_cross_attentions = () if output_attentions and self.config.add_cross_attention else None\n\n        next_decoder_cache = () if use_cache else None\n        for i, layer_module in enumerate(self.layer):\n            if output_hidden_states:\n                all_hidden_states = all_hidden_states + (hidden_states,)\n\n            layer_head_mask = head_mask[i] if head_mask is not None else None\n            past_key_value = past_key_values[i] if past_key_values is not None else None\n\n            if getattr(self.config, \"gradient_checkpointing\", False) and self.training:\n\n                if use_cache:\n                    logger.warn(\n                        \"`use_cache=True` is incompatible with `config.gradient_checkpointing=True`. Setting \"\n                        \"`use_cache=False`...\"\n                    )\n                    use_cache = False\n\n                def create_custom_forward(module):\n                    def custom_forward(*inputs):\n                        return module(*inputs, past_key_value, output_attentions)\n\n                    return custom_forward\n\n                layer_outputs = torch.utils.checkpoint.checkpoint(\n                    create_custom_forward(layer_module),\n                    hidden_states,\n                    attention_mask,\n                    layer_head_mask,\n                    encoder_hidden_states,\n                    encoder_attention_mask,\n                )\n            else:\n                layer_outputs = layer_module(\n                    hidden_states,\n                    attention_mask,\n                    layer_head_mask,\n                    encoder_hidden_states,\n                    encoder_attention_mask,\n                    past_key_value,\n                    output_attentions,relations_kv=relations_kv\n                )\n\n            hidden_states = layer_outputs[0]\n            if use_cache:\n                next_decoder_cache += (layer_outputs[-1],)\n            if output_attentions:\n                all_self_attentions = all_self_attentions + (layer_outputs[1],)\n                if self.config.add_cross_attention:\n                    all_cross_attentions = all_cross_attentions + (layer_outputs[2],)\n\n        if output_hidden_states:\n            all_hidden_states = all_hidden_states + (hidden_states,)\n\n        if not return_dict:\n            return tuple(\n                v\n                for v in [\n                    hidden_states,\n                    next_decoder_cache,\n                    all_hidden_states,\n                    all_self_attentions,\n                    all_cross_attentions,\n                ]\n                if v is not None\n            )\n        return BaseModelOutputWithPastAndCrossAttentions(\n            last_hidden_state=hidden_states,\n            past_key_values=next_decoder_cache,\n            hidden_states=all_hidden_states,\n            attentions=all_self_attentions,\n            cross_attentions=all_cross_attentions,\n        )\n\n\nclass BertPooler(nn.Module):\n    def __init__(self, config):\n        super().__init__()\n        self.dense = nn.Linear(config.hidden_size, config.hidden_size)\n        self.activation = nn.Tanh()\n\n    def forward(self, hidden_states):\n        # We \"pool\" the model by simply taking the hidden state corresponding\n        # to the first token.\n        first_token_tensor = hidden_states[:, 0]\n        pooled_output = self.dense(first_token_tensor)\n        pooled_output = self.activation(pooled_output)\n        return pooled_output\n\n\nclass BertPredictionHeadTransform(nn.Module):\n    def __init__(self, config):\n        super().__init__()\n        self.dense = nn.Linear(config.hidden_size, config.hidden_size)\n        if isinstance(config.hidden_act, str):\n            self.transform_act_fn = ACT2FN[config.hidden_act]\n        else:\n            self.transform_act_fn = config.hidden_act\n        self.LayerNorm = nn.LayerNorm(config.hidden_size, eps=config.layer_norm_eps)\n\n    def forward(self, hidden_states):\n        hidden_states = self.dense(hidden_states)\n        hidden_states = self.transform_act_fn(hidden_states)\n        hidden_states = self.LayerNorm(hidden_states)\n        return hidden_states\n\n\nclass BertLMPredictionHead(nn.Module):\n    def __init__(self, config):\n        super().__init__()\n        self.transform = BertPredictionHeadTransform(config)\n\n        # The output weights are the same as the input embeddings, but there is\n        # an output-only bias for each token.\n        self.decoder = nn.Linear(config.hidden_size, config.vocab_size, bias=False)\n\n        self.bias = nn.Parameter(torch.zeros(config.vocab_size))\n\n        # Need a link between the two variables so that the bias is correctly resized with `resize_token_embeddings`\n        self.decoder.bias = self.bias\n\n    def forward(self, hidden_states):\n        hidden_states = self.transform(hidden_states)\n        hidden_states = self.decoder(hidden_states)\n        return hidden_states\n\n\nclass BertOnlyMLMHead(nn.Module):\n    def __init__(self, config):\n        super().__init__()\n        self.predictions = BertLMPredictionHead(config)\n\n    def forward(self, sequence_output):\n        prediction_scores = self.predictions(sequence_output)\n        return prediction_scores\n\n\nclass BertOnlyNSPHead(nn.Module):\n    def __init__(self, config):\n        super().__init__()\n        self.seq_relationship = nn.Linear(config.hidden_size, 2)\n\n    def forward(self, pooled_output):\n        seq_relationship_score = self.seq_relationship(pooled_output)\n        return seq_relationship_score\n\n\nclass BertPreTrainingHeads(nn.Module):\n    def __init__(self, config):\n        super().__init__()\n        self.predictions = BertLMPredictionHead(config)\n        self.seq_relationship = nn.Linear(config.hidden_size, 2)\n\n    def forward(self, sequence_output, pooled_output):\n        prediction_scores = self.predictions(sequence_output)\n        seq_relationship_score = self.seq_relationship(pooled_output)\n        return prediction_scores, seq_relationship_score\n\n\nclass BertPreTrainedModel(PreTrainedModel):\n    \"\"\"\n    An abstract class to handle weights initialization and a simple interface for downloading and loading pretrained\n    models.\n    \"\"\"\n\n    config_class = BertConfig\n    load_tf_weights = load_tf_weights_in_bert\n    base_model_prefix = \"bert\"\n\n    def _init_weights(self, module):\n        \"\"\" Initialize the weights \"\"\"\n        if isinstance(module, nn.Linear):\n            # Slightly different from the TF version which uses truncated_normal for initialization\n            # cf https://github.com/pytorch/pytorch/pull/5617\n            module.weight.data.normal_(mean=0.0, std=self.config.initializer_range)\n            if module.bias is not None:\n                module.bias.data.zero_()\n        elif isinstance(module, nn.Embedding):\n            module.weight.data.normal_(mean=0.0, std=self.config.initializer_range)\n            if module.padding_idx is not None:\n                module.weight.data[module.padding_idx].zero_()\n        elif isinstance(module, nn.LayerNorm):\n            module.bias.data.zero_()\n            module.weight.data.fill_(1.0)\n\n\n@dataclass\nclass BertForPreTrainingOutput(ModelOutput):\n    \"\"\"\n    Output type of :class:`~transformers.BertForPreTraining`.\n\n    Args:\n        loss (`optional`, returned when ``labels`` is provided, ``torch.FloatTensor`` of shape :obj:`(1,)`):\n            Total loss as the sum of the masked language modeling loss and the next sequence prediction\n            (classification) loss.\n        prediction_logits (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length, config.vocab_size)`):\n            Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax).\n        seq_relationship_logits (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, 2)`):\n            Prediction scores of the next sequence prediction (classification) head (scores of True/False continuation\n            before SoftMax).\n        hidden_states (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``output_hidden_states=True`` is passed or when ``config.output_hidden_states=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer)\n            of shape :obj:`(batch_size, sequence_length, hidden_size)`.\n\n            Hidden-states of the model at the output of each layer plus the initial embedding outputs.\n        attentions (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``output_attentions=True`` is passed or when ``config.output_attentions=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for each layer) of shape :obj:`(batch_size, num_heads,\n            sequence_length, sequence_length)`.\n\n            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention\n            heads.\n    \"\"\"\n\n    loss: Optional[torch.FloatTensor] = None\n    prediction_logits: torch.FloatTensor = None\n    seq_relationship_logits: torch.FloatTensor = None\n    hidden_states: Optional[Tuple[torch.FloatTensor]] = None\n    attentions: Optional[Tuple[torch.FloatTensor]] = None\n\n\nBERT_START_DOCSTRING = r\"\"\"\n\n    This model inherits from :class:`~transformers.PreTrainedModel`. Check the superclass documentation for the generic\n    methods the library implements for all its model (such as downloading or saving, resizing the input embeddings,\n    pruning heads etc.)\n\n    This model is also a PyTorch `torch.nn.Module <https://pytorch.org/docs/stable/nn.html#torch.nn.Module>`__\n    subclass. Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to\n    general usage and behavior.\n\n    Parameters:\n        config (:class:`~transformers.BertConfig`): Model configuration class with all the parameters of the model.\n            Initializing with a config file does not load the weights associated with the model, only the\n            configuration. Check out the :meth:`~transformers.PreTrainedModel.from_pretrained` method to load the model\n            weights.\n\"\"\"\n\nBERT_INPUTS_DOCSTRING = r\"\"\"\n    Args:\n        input_ids (:obj:`torch.LongTensor` of shape :obj:`({0})`):\n            Indices of input sequence tokens in the vocabulary.\n\n            Indices can be obtained using :class:`~transformers.BertTokenizer`. See\n            :meth:`transformers.PreTrainedTokenizer.encode` and :meth:`transformers.PreTrainedTokenizer.__call__` for\n            details.\n\n            `What are input IDs? <../glossary.html#input-ids>`__\n        attention_mask (:obj:`torch.FloatTensor` of shape :obj:`({0})`, `optional`):\n            Mask to avoid performing attention on padding token indices. Mask values selected in ``[0, 1]``:\n\n            - 1 for tokens that are **not masked**,\n            - 0 for tokens that are **masked**.\n\n            `What are attention masks? <../glossary.html#attention-mask>`__\n        token_type_ids (:obj:`torch.LongTensor` of shape :obj:`({0})`, `optional`):\n            Segment token indices to indicate first and second portions of the inputs. Indices are selected in ``[0,\n            1]``:\n\n            - 0 corresponds to a `sentence A` token,\n            - 1 corresponds to a `sentence B` token.\n\n            `What are token type IDs? <../glossary.html#token-type-ids>`_\n        position_ids (:obj:`torch.LongTensor` of shape :obj:`({0})`, `optional`):\n            Indices of positions of each input sequence tokens in the position embeddings. Selected in the range ``[0,\n            config.max_position_embeddings - 1]``.\n\n            `What are position IDs? <../glossary.html#position-ids>`_\n        head_mask (:obj:`torch.FloatTensor` of shape :obj:`(num_heads,)` or :obj:`(num_layers, num_heads)`, `optional`):\n            Mask to nullify selected heads of the self-attention modules. Mask values selected in ``[0, 1]``:\n\n            - 1 indicates the head is **not masked**,\n            - 0 indicates the head is **masked**.\n\n        inputs_embeds (:obj:`torch.FloatTensor` of shape :obj:`({0}, hidden_size)`, `optional`):\n            Optionally, instead of passing :obj:`input_ids` you can choose to directly pass an embedded representation.\n            This is useful if you want more control over how to convert :obj:`input_ids` indices into associated\n            vectors than the model's internal embedding lookup matrix.\n        output_attentions (:obj:`bool`, `optional`):\n            Whether or not to return the attentions tensors of all attention layers. See ``attentions`` under returned\n            tensors for more detail.\n        output_hidden_states (:obj:`bool`, `optional`):\n            Whether or not to return the hidden states of all layers. See ``hidden_states`` under returned tensors for\n            more detail.\n        return_dict (:obj:`bool`, `optional`):\n            Whether or not to return a :class:`~transformers.file_utils.ModelOutput` instead of a plain tuple.\n\"\"\"\n\n\n@add_start_docstrings(\n    \"The bare Bert Model transformer outputting raw hidden-states without any specific head on top.\",\n    BERT_START_DOCSTRING,\n)\nclass NeZhaModel(BertPreTrainedModel):\n    \"\"\"\n\n    The model can behave as an encoder (with only self-attention) as well as a decoder, in which case a layer of\n    cross-attention is added between the self-attention layers, following the architecture described in `Attention is\n    all you need <https://arxiv.org/abs/1706.03762>`__ by Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit,\n    Llion Jones, Aidan N. Gomez, Lukasz Kaiser and Illia Polosukhin.\n\n    To behave as an decoder the model needs to be initialized with the :obj:`is_decoder` argument of the configuration\n    set to :obj:`True`. To be used in a Seq2Seq model, the model needs to initialized with both :obj:`is_decoder`\n    argument and :obj:`add_cross_attention` set to :obj:`True`; an :obj:`encoder_hidden_states` is then expected as an\n    input to the forward pass.\n    \"\"\"\n\n    def __init__(self, config, add_pooling_layer=True):\n        super().__init__(config)\n        self.config = config\n\n        self.embeddings = BertEmbeddings(config)\n        self.encoder = NeZhaEncoder(config)\n\n        self.pooler = BertPooler(config) if add_pooling_layer else None\n\n        self.init_weights()\n\n    def get_input_embeddings(self):\n        return self.embeddings.word_embeddings\n\n    def set_input_embeddings(self, value):\n        self.embeddings.word_embeddings = value\n\n    def _prune_heads(self, heads_to_prune):\n        \"\"\"\n        Prunes heads of the model. heads_to_prune: dict of {layer_num: list of heads to prune in this layer} See base\n        class PreTrainedModel\n        \"\"\"\n        for layer, heads in heads_to_prune.items():\n            self.encoder.layer[layer].attention.prune_heads(heads)\n\n    @add_start_docstrings_to_model_forward(BERT_INPUTS_DOCSTRING.format(\"batch_size, sequence_length\"))\n    @add_code_sample_docstrings(\n        tokenizer_class=_TOKENIZER_FOR_DOC,\n        checkpoint=_CHECKPOINT_FOR_DOC,\n        output_type=BaseModelOutputWithPoolingAndCrossAttentions,\n        config_class=_CONFIG_FOR_DOC,\n    )\n    def forward(\n        self,\n        input_ids=None,\n        attention_mask=None,\n        token_type_ids=None,\n\n        head_mask=None,\n        inputs_embeds=None,\n        encoder_hidden_states=None,\n        encoder_attention_mask=None,\n        past_key_values=None,\n        use_cache=None,\n        output_attentions=None,\n        output_hidden_states=None,\n        return_dict=False,\n    ):\n        r\"\"\"\n        encoder_hidden_states  (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length, hidden_size)`, `optional`):\n            Sequence of hidden-states at the output of the last layer of the encoder. Used in the cross-attention if\n            the model is configured as a decoder.\n        encoder_attention_mask (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length)`, `optional`):\n            Mask to avoid performing attention on the padding token indices of the encoder input. This mask is used in\n            the cross-attention if the model is configured as a decoder. Mask values selected in ``[0, 1]``:\n\n            - 1 for tokens that are **not masked**,\n            - 0 for tokens that are **masked**.\n        past_key_values (:obj:`tuple(tuple(torch.FloatTensor))` of length :obj:`config.n_layers` with each tuple having 4 tensors of shape :obj:`(batch_size, num_heads, sequence_length - 1, embed_size_per_head)`):\n            Contains precomputed key and value hidden states of the attention blocks. Can be used to speed up decoding.\n\n            If :obj:`past_key_values` are used, the user can optionally input only the last :obj:`decoder_input_ids`\n            (those that don't have their past key value states given to this model) of shape :obj:`(batch_size, 1)`\n            instead of all :obj:`decoder_input_ids` of shape :obj:`(batch_size, sequence_length)`.\n        use_cache (:obj:`bool`, `optional`):\n            If set to :obj:`True`, :obj:`past_key_values` key value states are returned and can be used to speed up\n            decoding (see :obj:`past_key_values`).\n        \"\"\"\n        output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions\n        output_hidden_states = (\n            output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states\n        )\n        return_dict = return_dict if return_dict is not None else self.config.use_return_dict\n\n        if self.config.is_decoder:\n            use_cache = use_cache if use_cache is not None else self.config.use_cache\n        else:\n            use_cache = False\n\n        if input_ids is not None and inputs_embeds is not None:\n            raise ValueError(\"You cannot specify both input_ids and inputs_embeds at the same time\")\n        elif input_ids is not None:\n            input_shape = input_ids.size()\n            batch_size, seq_length = input_shape\n        elif inputs_embeds is not None:\n            input_shape = inputs_embeds.size()[:-1]\n            batch_size, seq_length = input_shape\n        else:\n            raise ValueError(\"You have to specify either input_ids or inputs_embeds\")\n\n        device = input_ids.device if input_ids is not None else inputs_embeds.device\n\n        # past_key_values_length\n        past_key_values_length = past_key_values[0][0].shape[2] if past_key_values is not None else 0\n\n        if attention_mask is None:\n            attention_mask = torch.ones(((batch_size, seq_length + past_key_values_length)), device=device)\n        if token_type_ids is None:\n            token_type_ids = torch.zeros(input_shape, dtype=torch.long, device=device)\n\n        # We can provide a self-attention mask of dimensions [batch_size, from_seq_length, to_seq_length]\n        # ourselves in which case we just need to make it broadcastable to all heads.\n        extended_attention_mask: torch.Tensor = self.get_extended_attention_mask(attention_mask, input_shape, device)\n\n        # If a 2D or 3D attention mask is provided for the cross-attention\n        # we need to make broadcastable to [batch_size, num_heads, seq_length, seq_length]\n        if self.config.is_decoder and encoder_hidden_states is not None:\n            encoder_batch_size, encoder_sequence_length, _ = encoder_hidden_states.size()\n            encoder_hidden_shape = (encoder_batch_size, encoder_sequence_length)\n            if encoder_attention_mask is None:\n                encoder_attention_mask = torch.ones(encoder_hidden_shape, device=device)\n            encoder_extended_attention_mask = self.invert_attention_mask(encoder_attention_mask)\n        else:\n            encoder_extended_attention_mask = None\n\n        # Prepare head mask if needed\n        # 1.0 in head_mask indicate we keep the head\n        # attention_probs has shape bsz x n_heads x N x N\n        # input head_mask has shape [num_heads] or [num_hidden_layers x num_heads]\n        # and head_mask is converted to shape [num_hidden_layers x batch x num_heads x seq_length x seq_length]\n        head_mask = self.get_head_mask(head_mask, self.config.num_hidden_layers)\n\n        embedding_output = self.embeddings(\n            input_ids=input_ids,\n\n            token_type_ids=token_type_ids,\n            inputs_embeds=inputs_embeds,\n        )\n        encoder_outputs = self.encoder(\n            embedding_output,\n            attention_mask=extended_attention_mask,\n            head_mask=head_mask,\n            encoder_hidden_states=encoder_hidden_states,\n            encoder_attention_mask=encoder_extended_attention_mask,\n            past_key_values=past_key_values,\n            use_cache=use_cache,\n            output_attentions=output_attentions,\n            output_hidden_states=output_hidden_states,\n            return_dict=return_dict,\n        )\n        sequence_output = encoder_outputs[0]\n        pooled_output = self.pooler(sequence_output) if self.pooler is not None else None\n\n        if not return_dict:\n            return (sequence_output, pooled_output) + encoder_outputs[1:]\n\n        return BaseModelOutputWithPoolingAndCrossAttentions(\n            last_hidden_state=sequence_output,\n            pooler_output=pooled_output,\n            past_key_values=encoder_outputs.past_key_values,\n            hidden_states=encoder_outputs.hidden_states,\n            attentions=encoder_outputs.attentions,\n            cross_attentions=encoder_outputs.cross_attentions,\n        )\n\n\n@add_start_docstrings(\n    \"\"\"\n    Bert Model with two heads on top as done during the pretraining: a `masked language modeling` head and a `next\n    sentence prediction (classification)` head.\n    \"\"\",\n    BERT_START_DOCSTRING,\n)\nclass BertForPreTraining(BertPreTrainedModel):\n    def __init__(self, config):\n        super().__init__(config)\n\n        self.bert = NeZhaModel(config)\n        self.cls = BertPreTrainingHeads(config)\n\n        self.init_weights()\n\n    def get_output_embeddings(self):\n        return self.cls.predictions.decoder\n\n    def set_output_embeddings(self, new_embeddings):\n        self.cls.predictions.decoder = new_embeddings\n\n    @add_start_docstrings_to_model_forward(BERT_INPUTS_DOCSTRING.format(\"batch_size, sequence_length\"))\n    @replace_return_docstrings(output_type=BertForPreTrainingOutput, config_class=_CONFIG_FOR_DOC)\n    def forward(\n        self,\n        input_ids=None,\n        attention_mask=None,\n        token_type_ids=None,\n\n        head_mask=None,\n        inputs_embeds=None,\n        labels=None,\n        next_sentence_label=None,\n        output_attentions=None,\n        output_hidden_states=None,\n        return_dict=False,\n    ):\n        r\"\"\"\n        labels (:obj:`torch.LongTensor` of shape ``(batch_size, sequence_length)``, `optional`):\n            Labels for computing the masked language modeling loss. Indices should be in ``[-100, 0, ...,\n            config.vocab_size]`` (see ``input_ids`` docstring) Tokens with indices set to ``-100`` are ignored\n            (masked), the loss is only computed for the tokens with labels in ``[0, ..., config.vocab_size]``\n        next_sentence_label (``torch.LongTensor`` of shape ``(batch_size,)``, `optional`):\n            Labels for computing the next sequence prediction (classification) loss. Input should be a sequence pair\n            (see :obj:`input_ids` docstring) Indices should be in ``[0, 1]``:\n\n            - 0 indicates sequence B is a continuation of sequence A,\n            - 1 indicates sequence B is a random sequence.\n        kwargs (:obj:`Dict[str, any]`, optional, defaults to `{}`):\n            Used to hide legacy arguments that have been deprecated.\n\n        Returns:\n\n        Example::\n\n            >>> from transformers import BertTokenizer, BertForPreTraining\n            >>> import torch\n\n            >>> tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')\n            >>> model = BertForPreTraining.from_pretrained('bert-base-uncased')\n\n            >>> inputs = tokenizer(\"Hello, my dog is cute\", return_tensors=\"pt\")\n            >>> outputs = model(**inputs)\n\n            >>> prediction_logits = outputs.prediction_logits\n            >>> seq_relationship_logits = outputs.seq_relationship_logits\n        \"\"\"\n        return_dict = return_dict if return_dict is not None else self.config.use_return_dict\n\n        outputs = self.bert(\n            input_ids,\n            attention_mask=attention_mask,\n            token_type_ids=token_type_ids,\n\n            head_mask=head_mask,\n            inputs_embeds=inputs_embeds,\n            output_attentions=output_attentions,\n            output_hidden_states=output_hidden_states,\n            return_dict=return_dict,\n        )\n\n        sequence_output, pooled_output = outputs[:2]\n        prediction_scores, seq_relationship_score = self.cls(sequence_output, pooled_output)\n\n        total_loss = None\n        if labels is not None and next_sentence_label is not None:\n            loss_fct = CrossEntropyLoss()\n            masked_lm_loss = loss_fct(prediction_scores.view(-1, self.config.vocab_size), labels.view(-1))\n            next_sentence_loss = loss_fct(seq_relationship_score.view(-1, 2), next_sentence_label.view(-1))\n            total_loss = masked_lm_loss + next_sentence_loss\n\n        if not return_dict:\n            output = (prediction_scores, seq_relationship_score) + outputs[2:]\n            return ((total_loss,) + output) if total_loss is not None else output\n\n        return BertForPreTrainingOutput(\n            loss=total_loss,\n            prediction_logits=prediction_scores,\n            seq_relationship_logits=seq_relationship_score,\n            hidden_states=outputs.hidden_states,\n            attentions=outputs.attentions,\n        )\n\n\n@add_start_docstrings(\n    \"\"\"Bert Model with a `language modeling` head on top for CLM fine-tuning. \"\"\", BERT_START_DOCSTRING\n)\nclass BertLMHeadModel(BertPreTrainedModel):\n\n    _keys_to_ignore_on_load_unexpected = [r\"pooler\"]\n    _keys_to_ignore_on_load_missing = [ r\"predictions.decoder.bias\"]\n\n    def __init__(self, config):\n        super().__init__(config)\n\n        if not config.is_decoder:\n            logger.warning(\"If you want to use `BertLMHeadModel` as a standalone, add `is_decoder=True.`\")\n\n        self.bert = NeZhaModel(config, add_pooling_layer=False)\n        self.cls = BertOnlyMLMHead(config)\n\n        self.init_weights()\n\n    def get_output_embeddings(self):\n        return self.cls.predictions.decoder\n\n    def set_output_embeddings(self, new_embeddings):\n        self.cls.predictions.decoder = new_embeddings\n\n    @add_start_docstrings_to_model_forward(BERT_INPUTS_DOCSTRING.format(\"batch_size, sequence_length\"))\n    @replace_return_docstrings(output_type=CausalLMOutputWithCrossAttentions, config_class=_CONFIG_FOR_DOC)\n    def forward(\n        self,\n        input_ids=None,\n        attention_mask=None,\n        token_type_ids=None,\n\n        head_mask=None,\n        inputs_embeds=None,\n        encoder_hidden_states=None,\n        encoder_attention_mask=None,\n        labels=None,\n        past_key_values=None,\n        use_cache=None,\n        output_attentions=None,\n        output_hidden_states=None,\n        return_dict=False,\n    ):\n        r\"\"\"\n        encoder_hidden_states  (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length, hidden_size)`, `optional`):\n            Sequence of hidden-states at the output of the last layer of the encoder. Used in the cross-attention if\n            the model is configured as a decoder.\n        encoder_attention_mask (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length)`, `optional`):\n            Mask to avoid performing attention on the padding token indices of the encoder input. This mask is used in\n            the cross-attention if the model is configured as a decoder. Mask values selected in ``[0, 1]``:\n\n            - 1 for tokens that are **not masked**,\n            - 0 for tokens that are **masked**.\n        labels (:obj:`torch.LongTensor` of shape :obj:`(batch_size, sequence_length)`, `optional`):\n            Labels for computing the left-to-right language modeling loss (next word prediction). Indices should be in\n            ``[-100, 0, ..., config.vocab_size]`` (see ``input_ids`` docstring) Tokens with indices set to ``-100`` are\n            ignored (masked), the loss is only computed for the tokens with labels n ``[0, ..., config.vocab_size]``\n        past_key_values (:obj:`tuple(tuple(torch.FloatTensor))` of length :obj:`config.n_layers` with each tuple having 4 tensors of shape :obj:`(batch_size, num_heads, sequence_length - 1, embed_size_per_head)`):\n            Contains precomputed key and value hidden states of the attention blocks. Can be used to speed up decoding.\n\n            If :obj:`past_key_values` are used, the user can optionally input only the last :obj:`decoder_input_ids`\n            (those that don't have their past key value states given to this model) of shape :obj:`(batch_size, 1)`\n            instead of all :obj:`decoder_input_ids` of shape :obj:`(batch_size, sequence_length)`.\n        use_cache (:obj:`bool`, `optional`):\n            If set to :obj:`True`, :obj:`past_key_values` key value states are returned and can be used to speed up\n            decoding (see :obj:`past_key_values`).\n\n        Returns:\n\n        Example::\n\n            >>> from transformers import BertTokenizer, BertLMHeadModel, BertConfig\n            >>> import torch\n\n            >>> tokenizer = BertTokenizer.from_pretrained('bert-base-cased')\n            >>> config = BertConfig.from_pretrained(\"bert-base-cased\")\n            >>> config.is_decoder = True\n            >>> model = BertLMHeadModel.from_pretrained('bert-base-cased', config=config)\n\n            >>> inputs = tokenizer(\"Hello, my dog is cute\", return_tensors=\"pt\")\n            >>> outputs = model(**inputs)\n\n            >>> prediction_logits = outputs.logits\n        \"\"\"\n        return_dict = return_dict if return_dict is not None else self.config.use_return_dict\n        if labels is not None:\n            use_cache = False\n\n        outputs = self.bert(\n            input_ids,\n            attention_mask=attention_mask,\n            token_type_ids=token_type_ids,\n\n            head_mask=head_mask,\n            inputs_embeds=inputs_embeds,\n            encoder_hidden_states=encoder_hidden_states,\n            encoder_attention_mask=encoder_attention_mask,\n            past_key_values=past_key_values,\n            use_cache=use_cache,\n            output_attentions=output_attentions,\n            output_hidden_states=output_hidden_states,\n            return_dict=return_dict,\n        )\n\n        sequence_output = outputs[0]\n        prediction_scores = self.cls(sequence_output)\n\n        lm_loss = None\n        if labels is not None:\n            # we are doing next-token prediction; shift prediction scores and input ids by one\n            shifted_prediction_scores = prediction_scores[:, :-1, :].contiguous()\n            labels = labels[:, 1:].contiguous()\n            loss_fct = CrossEntropyLoss()\n            lm_loss = loss_fct(shifted_prediction_scores.view(-1, self.config.vocab_size), labels.view(-1))\n\n        if not return_dict:\n            output = (prediction_scores,) + outputs[2:]\n            return ((lm_loss,) + output) if lm_loss is not None else output\n\n        return CausalLMOutputWithCrossAttentions(\n            loss=lm_loss,\n            logits=prediction_scores,\n            past_key_values=outputs.past_key_values,\n            hidden_states=outputs.hidden_states,\n            attentions=outputs.attentions,\n            cross_attentions=outputs.cross_attentions,\n        )\n\n    def prepare_inputs_for_generation(self, input_ids, past=None, attention_mask=None, **model_kwargs):\n        input_shape = input_ids.shape\n        # if model is used as a decoder in encoder-decoder model, the decoder attention mask is created on the fly\n        if attention_mask is None:\n            attention_mask = input_ids.new_ones(input_shape)\n\n        # cut decoder_input_ids if past is used\n        if past is not None:\n            input_ids = input_ids[:, -1:]\n\n        return {\"input_ids\": input_ids, \"attention_mask\": attention_mask, \"past_key_values\": past}\n\n    def _reorder_cache(self, past, beam_idx):\n        reordered_past = ()\n        for layer_past in past:\n            reordered_past += (tuple(past_state.index_select(0, beam_idx) for past_state in layer_past),)\n        return reordered_past\n\n\n@add_start_docstrings(\"\"\"Bert Model with a `language modeling` head on top. \"\"\", BERT_START_DOCSTRING)\nclass NeZhaForMaskedLM(BertPreTrainedModel):\n\n    _keys_to_ignore_on_load_unexpected = [r\"pooler\"]\n    _keys_to_ignore_on_load_missing = [r\"predictions.decoder.bias\"]\n\n    def __init__(self, config):\n        super().__init__(config)\n\n        if config.is_decoder:\n            logger.warning(\n                \"If you want to use `NeZhaForMaskedLM` make sure `config.is_decoder=False` for \"\n                \"bi-directional self-attention.\"\n            )\n\n        self.bert = NeZhaModel(config, add_pooling_layer=False)\n        self.cls = BertOnlyMLMHead(config)\n\n        self.init_weights()\n\n    def get_output_embeddings(self):\n        return self.cls.predictions.decoder\n\n    def set_output_embeddings(self, new_embeddings):\n        self.cls.predictions.decoder = new_embeddings\n\n    @add_start_docstrings_to_model_forward(BERT_INPUTS_DOCSTRING.format(\"batch_size, sequence_length\"))\n    @add_code_sample_docstrings(\n        tokenizer_class=_TOKENIZER_FOR_DOC,\n        checkpoint=_CHECKPOINT_FOR_DOC,\n        output_type=MaskedLMOutput,\n        config_class=_CONFIG_FOR_DOC,\n    )\n    def forward(\n        self,\n        input_ids=None,\n        attention_mask=None,\n        token_type_ids=None,\n\n        head_mask=None,\n        inputs_embeds=None,\n        encoder_hidden_states=None,\n        encoder_attention_mask=None,\n        labels=None,\n        output_attentions=None,\n        output_hidden_states=None,\n        return_dict=False,\n    ):\n        r\"\"\"\n        labels (:obj:`torch.LongTensor` of shape :obj:`(batch_size, sequence_length)`, `optional`):\n            Labels for computing the masked language modeling loss. Indices should be in ``[-100, 0, ...,\n            config.vocab_size]`` (see ``input_ids`` docstring) Tokens with indices set to ``-100`` are ignored\n            (masked), the loss is only computed for the tokens with labels in ``[0, ..., config.vocab_size]``\n        \"\"\"\n\n        return_dict = return_dict if return_dict is not None else self.config.use_return_dict\n\n        outputs = self.bert(\n            input_ids,\n            attention_mask=attention_mask,\n            token_type_ids=token_type_ids,\n\n            head_mask=head_mask,\n            inputs_embeds=inputs_embeds,\n            encoder_hidden_states=encoder_hidden_states,\n            encoder_attention_mask=encoder_attention_mask,\n            output_attentions=output_attentions,\n            output_hidden_states=output_hidden_states,\n            return_dict=return_dict,\n        )\n\n        sequence_output = outputs[0]\n        prediction_scores = self.cls(sequence_output)\n\n        masked_lm_loss = None\n        if labels is not None:\n            loss_fct = CrossEntropyLoss()  # -100 index = padding token\n            masked_lm_loss = loss_fct(prediction_scores.view(-1, self.config.vocab_size), labels.view(-1))\n\n        if not return_dict:\n            output = (prediction_scores,) + outputs[2:]\n            return ((masked_lm_loss,) + output) if masked_lm_loss is not None else output\n\n        return MaskedLMOutput(\n            loss=masked_lm_loss,\n            logits=prediction_scores,\n            hidden_states=outputs.hidden_states,\n            attentions=outputs.attentions,\n        )\n\n    def prepare_inputs_for_generation(self, input_ids, attention_mask=None, **model_kwargs):\n        input_shape = input_ids.shape\n        effective_batch_size = input_shape[0]\n\n        #  add a dummy token\n        assert self.config.pad_token_id is not None, \"The PAD token should be defined for generation\"\n        attention_mask = torch.cat([attention_mask, attention_mask.new_zeros((attention_mask.shape[0], 1))], dim=-1)\n        dummy_token = torch.full(\n            (effective_batch_size, 1), self.config.pad_token_id, dtype=torch.long, device=input_ids.device\n        )\n        input_ids = torch.cat([input_ids, dummy_token], dim=1)\n\n        return {\"input_ids\": input_ids, \"attention_mask\": attention_mask}\n\n\n@add_start_docstrings(\n    \"\"\"Bert Model with a `next sentence prediction (classification)` head on top. \"\"\",\n    BERT_START_DOCSTRING,\n)\nclass BertForNextSentencePrediction(BertPreTrainedModel):\n    def __init__(self, config):\n        super().__init__(config)\n\n        self.bert = NeZhaModel(config)\n        self.cls = BertOnlyNSPHead(config)\n\n        self.init_weights()\n\n    @add_start_docstrings_to_model_forward(BERT_INPUTS_DOCSTRING.format(\"batch_size, sequence_length\"))\n    @replace_return_docstrings(output_type=NextSentencePredictorOutput, config_class=_CONFIG_FOR_DOC)\n    def forward(\n        self,\n        input_ids=None,\n        attention_mask=None,\n        token_type_ids=None,\n\n        head_mask=None,\n        inputs_embeds=None,\n        labels=None,\n        output_attentions=None,\n        output_hidden_states=None,\n        return_dict=False,\n        **kwargs\n    ):\n        r\"\"\"\n        labels (:obj:`torch.LongTensor` of shape :obj:`(batch_size,)`, `optional`):\n            Labels for computing the next sequence prediction (classification) loss. Input should be a sequence pair\n            (see ``input_ids`` docstring). Indices should be in ``[0, 1]``:\n\n            - 0 indicates sequence B is a continuation of sequence A,\n            - 1 indicates sequence B is a random sequence.\n\n        Returns:\n\n        Example::\n\n            >>> from transformers import BertTokenizer, BertForNextSentencePrediction\n            >>> import torch\n\n            >>> tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')\n            >>> model = BertForNextSentencePrediction.from_pretrained('bert-base-uncased')\n\n            >>> prompt = \"In Italy, pizza served in formal settings, such as at a restaurant, is presented unsliced.\"\n            >>> next_sentence = \"The sky is blue due to the shorter wavelength of blue light.\"\n            >>> encoding = tokenizer(prompt, next_sentence, return_tensors='pt')\n\n            >>> outputs = model(**encoding, labels=torch.LongTensor([1]))\n            >>> logits = outputs.logits\n            >>> assert logits[0, 0] < logits[0, 1] # next sentence was random\n        \"\"\"\n\n        if \"next_sentence_label\" in kwargs:\n            warnings.warn(\n                \"The `next_sentence_label` argument is deprecated and will be removed in a future version, use `labels` instead.\",\n                FutureWarning,\n            )\n            labels = kwargs.pop(\"next_sentence_label\")\n\n        return_dict = return_dict if return_dict is not None else self.config.use_return_dict\n\n        outputs = self.bert(\n            input_ids,\n            attention_mask=attention_mask,\n            token_type_ids=token_type_ids,\n\n            head_mask=head_mask,\n            inputs_embeds=inputs_embeds,\n            output_attentions=output_attentions,\n            output_hidden_states=output_hidden_states,\n            return_dict=return_dict,\n        )\n\n        pooled_output = outputs[1]\n\n        seq_relationship_scores = self.cls(pooled_output)\n\n        next_sentence_loss = None\n        if labels is not None:\n            loss_fct = CrossEntropyLoss()\n            next_sentence_loss = loss_fct(seq_relationship_scores.view(-1, 2), labels.view(-1))\n\n        if not return_dict:\n            output = (seq_relationship_scores,) + outputs[2:]\n            return ((next_sentence_loss,) + output) if next_sentence_loss is not None else output\n\n        return NextSentencePredictorOutput(\n            loss=next_sentence_loss,\n            logits=seq_relationship_scores,\n            hidden_states=outputs.hidden_states,\n            attentions=outputs.attentions,\n        )\n\n\n@add_start_docstrings(\n    \"\"\"\n    Bert Model transformer with a sequence classification/regression head on top (a linear layer on top of the pooled\n    output) e.g. for GLUE tasks.\n    \"\"\",\n    BERT_START_DOCSTRING,\n)\nclass BertForSequenceClassification(BertPreTrainedModel):\n    def __init__(self, config):\n        super().__init__(config)\n        self.num_labels = config.num_labels\n\n        self.bert = NeZhaModel(config)\n        self.dropout = nn.Dropout(config.hidden_dropout_prob)\n        self.classifier = nn.Linear(config.hidden_size, config.num_labels)\n\n        self.init_weights()\n\n    @add_start_docstrings_to_model_forward(BERT_INPUTS_DOCSTRING.format(\"batch_size, sequence_length\"))\n    @add_code_sample_docstrings(\n        tokenizer_class=_TOKENIZER_FOR_DOC,\n        checkpoint=_CHECKPOINT_FOR_DOC,\n        output_type=SequenceClassifierOutput,\n        config_class=_CONFIG_FOR_DOC,\n    )\n    def forward(\n        self,\n        input_ids=None,\n        attention_mask=None,\n        token_type_ids=None,\n\n        head_mask=None,\n        inputs_embeds=None,\n        labels=None,\n        output_attentions=None,\n        output_hidden_states=None,\n        return_dict=False,\n    ):\n        r\"\"\"\n        labels (:obj:`torch.LongTensor` of shape :obj:`(batch_size,)`, `optional`):\n            Labels for computing the sequence classification/regression loss. Indices should be in :obj:`[0, ...,\n            config.num_labels - 1]`. If :obj:`config.num_labels == 1` a regression loss is computed (Mean-Square loss),\n            If :obj:`config.num_labels > 1` a classification loss is computed (Cross-Entropy).\n        \"\"\"\n        return_dict = return_dict if return_dict is not None else self.config.use_return_dict\n\n        outputs = self.bert(\n            input_ids,\n            attention_mask=attention_mask,\n            token_type_ids=token_type_ids,\n\n            head_mask=head_mask,\n            inputs_embeds=inputs_embeds,\n            output_attentions=output_attentions,\n            output_hidden_states=output_hidden_states,\n            return_dict=return_dict,\n        )\n\n        pooled_output = outputs[1]\n\n        pooled_output = self.dropout(pooled_output)\n        logits = self.classifier(pooled_output)\n\n        loss = None\n        if labels is not None:\n            if self.num_labels == 1:\n                #  We are doing regression\n                loss_fct = MSELoss()\n                loss = loss_fct(logits.view(-1), labels.view(-1))\n            else:\n                loss_fct = CrossEntropyLoss()\n                loss = loss_fct(logits.view(-1, self.num_labels), labels.view(-1))\n\n        if not return_dict:\n            output = (logits,) + outputs[2:]\n            return ((loss,) + output) if loss is not None else output\n\n        return SequenceClassifierOutput(\n            loss=loss,\n            logits=logits,\n            hidden_states=outputs.hidden_states,\n            attentions=outputs.attentions,\n        )\n\n\n@add_start_docstrings(\n    \"\"\"\n    Bert Model with a multiple choice classification head on top (a linear layer on top of the pooled output and a\n    softmax) e.g. for RocStories/SWAG tasks.\n    \"\"\",\n    BERT_START_DOCSTRING,\n)\nclass BertForMultipleChoice(BertPreTrainedModel):\n    def __init__(self, config):\n        super().__init__(config)\n\n        self.bert = NeZhaModel(config)\n        self.dropout = nn.Dropout(config.hidden_dropout_prob)\n        self.classifier = nn.Linear(config.hidden_size, 1)\n\n        self.init_weights()\n\n    @add_start_docstrings_to_model_forward(BERT_INPUTS_DOCSTRING.format(\"batch_size, num_choices, sequence_length\"))\n    @add_code_sample_docstrings(\n        tokenizer_class=_TOKENIZER_FOR_DOC,\n        checkpoint=_CHECKPOINT_FOR_DOC,\n        output_type=MultipleChoiceModelOutput,\n        config_class=_CONFIG_FOR_DOC,\n    )\n    def forward(\n        self,\n        input_ids=None,\n        attention_mask=None,\n        token_type_ids=None,\n\n        head_mask=None,\n        inputs_embeds=None,\n        labels=None,\n        output_attentions=None,\n        output_hidden_states=None,\n        return_dict=False,\n    ):\n        r\"\"\"\n        labels (:obj:`torch.LongTensor` of shape :obj:`(batch_size,)`, `optional`):\n            Labels for computing the multiple choice classification loss. Indices should be in ``[0, ...,\n            num_choices-1]`` where :obj:`num_choices` is the size of the second dimension of the input tensors. (See\n            :obj:`input_ids` above)\n        \"\"\"\n        return_dict = return_dict if return_dict is not None else self.config.use_return_dict\n        num_choices = input_ids.shape[1] if input_ids is not None else inputs_embeds.shape[1]\n\n        input_ids = input_ids.view(-1, input_ids.size(-1)) if input_ids is not None else None\n        attention_mask = attention_mask.view(-1, attention_mask.size(-1)) if attention_mask is not None else None\n        token_type_ids = token_type_ids.view(-1, token_type_ids.size(-1)) if token_type_ids is not None else None\n        inputs_embeds = (\n            inputs_embeds.view(-1, inputs_embeds.size(-2), inputs_embeds.size(-1))\n            if inputs_embeds is not None\n            else None\n        )\n\n        outputs = self.bert(\n            input_ids,\n            attention_mask=attention_mask,\n            token_type_ids=token_type_ids,\n\n            head_mask=head_mask,\n            inputs_embeds=inputs_embeds,\n            output_attentions=output_attentions,\n            output_hidden_states=output_hidden_states,\n            return_dict=return_dict,\n        )\n\n        pooled_output = outputs[1]\n\n        pooled_output = self.dropout(pooled_output)\n        logits = self.classifier(pooled_output)\n        reshaped_logits = logits.view(-1, num_choices)\n\n        loss = None\n        if labels is not None:\n            loss_fct = CrossEntropyLoss()\n            loss = loss_fct(reshaped_logits, labels)\n\n        if not return_dict:\n            output = (reshaped_logits,) + outputs[2:]\n            return ((loss,) + output) if loss is not None else output\n\n        return MultipleChoiceModelOutput(\n            loss=loss,\n            logits=reshaped_logits,\n            hidden_states=outputs.hidden_states,\n            attentions=outputs.attentions,\n        )\n\n\n@add_start_docstrings(\n    \"\"\"\n    Bert Model with a token classification head on top (a linear layer on top of the hidden-states output) e.g. for\n    Named-Entity-Recognition (NER) tasks.\n    \"\"\",\n    BERT_START_DOCSTRING,\n)\nclass BertForTokenClassification(BertPreTrainedModel):\n\n    _keys_to_ignore_on_load_unexpected = [r\"pooler\"]\n\n    def __init__(self, config):\n        super().__init__(config)\n        self.num_labels = config.num_labels\n\n        self.bert = NeZhaModel(config, add_pooling_layer=False)\n        self.dropout = nn.Dropout(config.hidden_dropout_prob)\n        self.classifier = nn.Linear(config.hidden_size, config.num_labels)\n\n        self.init_weights()\n\n    @add_start_docstrings_to_model_forward(BERT_INPUTS_DOCSTRING.format(\"batch_size, sequence_length\"))\n    @add_code_sample_docstrings(\n        tokenizer_class=_TOKENIZER_FOR_DOC,\n        checkpoint=_CHECKPOINT_FOR_DOC,\n        output_type=TokenClassifierOutput,\n        config_class=_CONFIG_FOR_DOC,\n    )\n    def forward(\n        self,\n        input_ids=None,\n        attention_mask=None,\n        token_type_ids=None,\n\n        head_mask=None,\n        inputs_embeds=None,\n        labels=None,\n        output_attentions=None,\n        output_hidden_states=None,\n        return_dict=False,\n    ):\n        r\"\"\"\n        labels (:obj:`torch.LongTensor` of shape :obj:`(batch_size, sequence_length)`, `optional`):\n            Labels for computing the token classification loss. Indices should be in ``[0, ..., config.num_labels -\n            1]``.\n        \"\"\"\n        return_dict = return_dict if return_dict is not None else self.config.use_return_dict\n\n        outputs = self.bert(\n            input_ids,\n            attention_mask=attention_mask,\n            token_type_ids=token_type_ids,\n\n            head_mask=head_mask,\n            inputs_embeds=inputs_embeds,\n            output_attentions=output_attentions,\n            output_hidden_states=output_hidden_states,\n            return_dict=return_dict,\n        )\n\n        sequence_output = outputs[0]\n\n        sequence_output = self.dropout(sequence_output)\n        logits = self.classifier(sequence_output)\n\n        loss = None\n        if labels is not None:\n            loss_fct = CrossEntropyLoss()\n            # Only keep active parts of the loss\n            if attention_mask is not None:\n                active_loss = attention_mask.view(-1) == 1\n                active_logits = logits.view(-1, self.num_labels)\n                active_labels = torch.where(\n                    active_loss, labels.view(-1), torch.tensor(loss_fct.ignore_index).type_as(labels)\n                )\n                loss = loss_fct(active_logits, active_labels)\n            else:\n                loss = loss_fct(logits.view(-1, self.num_labels), labels.view(-1))\n\n        if not return_dict:\n            output = (logits,) + outputs[2:]\n            return ((loss,) + output) if loss is not None else output\n\n        return TokenClassifierOutput(\n            loss=loss,\n            logits=logits,\n            hidden_states=outputs.hidden_states,\n            attentions=outputs.attentions,\n        )\n\n\n@add_start_docstrings(\n    \"\"\"\n    Bert Model with a span classification head on top for extractive question-answering tasks like SQuAD (a linear\n    layers on top of the hidden-states output to compute `span start logits` and `span end logits`).\n    \"\"\",\n    BERT_START_DOCSTRING,\n)\nclass BertForQuestionAnswering(BertPreTrainedModel):\n\n    _keys_to_ignore_on_load_unexpected = [r\"pooler\"]\n\n    def __init__(self, config):\n        super().__init__(config)\n        self.num_labels = config.num_labels\n\n        self.bert = NeZhaModel(config, add_pooling_layer=False)\n        self.qa_outputs = nn.Linear(config.hidden_size, config.num_labels)\n\n        self.init_weights()\n\n    @add_start_docstrings_to_model_forward(BERT_INPUTS_DOCSTRING.format(\"batch_size, sequence_length\"))\n    @add_code_sample_docstrings(\n        tokenizer_class=_TOKENIZER_FOR_DOC,\n        checkpoint=_CHECKPOINT_FOR_DOC,\n        output_type=QuestionAnsweringModelOutput,\n        config_class=_CONFIG_FOR_DOC,\n    )\n    def forward(\n        self,\n        input_ids=None,\n        attention_mask=None,\n        token_type_ids=None,\n\n        head_mask=None,\n        inputs_embeds=None,\n        start_positions=None,\n        end_positions=None,\n        output_attentions=None,\n        output_hidden_states=None,\n        return_dict=False,\n    ):\n        r\"\"\"\n        start_positions (:obj:`torch.LongTensor` of shape :obj:`(batch_size,)`, `optional`):\n            Labels for position (index) of the start of the labelled span for computing the token classification loss.\n            Positions are clamped to the length of the sequence (:obj:`sequence_length`). Position outside of the\n            sequence are not taken into account for computing the loss.\n        end_positions (:obj:`torch.LongTensor` of shape :obj:`(batch_size,)`, `optional`):\n            Labels for position (index) of the end of the labelled span for computing the token classification loss.\n            Positions are clamped to the length of the sequence (:obj:`sequence_length`). Position outside of the\n            sequence are not taken into account for computing the loss.\n        \"\"\"\n        return_dict = return_dict if return_dict is not None else self.config.use_return_dict\n\n        outputs = self.bert(\n            input_ids,\n            attention_mask=attention_mask,\n            token_type_ids=token_type_ids,\n            \n            head_mask=head_mask,\n            inputs_embeds=inputs_embeds,\n            output_attentions=output_attentions,\n            output_hidden_states=output_hidden_states,\n            return_dict=return_dict,\n        )\n\n        sequence_output = outputs[0]\n\n        logits = self.qa_outputs(sequence_output)\n        start_logits, end_logits = logits.split(1, dim=-1)\n        start_logits = start_logits.squeeze(-1)\n        end_logits = end_logits.squeeze(-1)\n\n        total_loss = None\n        if start_positions is not None and end_positions is not None:\n            # If we are on multi-GPU, split add a dimension\n            if len(start_positions.size()) > 1:\n                start_positions = start_positions.squeeze(-1)\n            if len(end_positions.size()) > 1:\n                end_positions = end_positions.squeeze(-1)\n            # sometimes the start/end positions are outside our model inputs, we ignore these terms\n            ignored_index = start_logits.size(1)\n            start_positions.clamp_(0, ignored_index)\n            end_positions.clamp_(0, ignored_index)\n\n            loss_fct = CrossEntropyLoss(ignore_index=ignored_index)\n            start_loss = loss_fct(start_logits, start_positions)\n            end_loss = loss_fct(end_logits, end_positions)\n            total_loss = (start_loss + end_loss) / 2\n\n        if not return_dict:\n            output = (start_logits, end_logits) + outputs[2:]\n            return ((total_loss,) + output) if total_loss is not None else output\n\n        return QuestionAnsweringModelOutput(\n            loss=total_loss,\n            start_logits=start_logits,\n            end_logits=end_logits,\n            hidden_states=outputs.hidden_states,\n            attentions=outputs.attentions,\n        )\n"
  },
  {
    "path": "code/nezha-base-count5/finetuning/model.py",
    "content": "import torch\nimport random\nimport os\nfrom torch import nn, optim\nimport torch.nn.functional as F\nfrom transformers.activations import get_activation\n\nfrom Config import *\n\n\nclass BertForClass(nn.Module):\n    def __init__(self, config):\n        super(BertForClass, self).__init__()\n        self.n_classes = config.num_class\n\n        config_json = 'bert_config.json' if os.path.exists(config.model_path + 'bert_config.json') else 'config.json'\n        self.bert_config = CONFIGS[config.model].from_pretrained(config.model_path + config_json,\n                                                                 output_hidden_states=True)\n        self.bert_model = MODELS[config.model].from_pretrained(config.model_path, config=self.bert_config)\n        self.isDropout = True if 0 < config.dropout < 1 else False\n        self.dropout = nn.Dropout(p=config.dropout)\n        self.classifier = nn.Linear(self.bert_config.hidden_size * 2, self.n_classes)\n\n    def forward(self, input_ids, input_masks, segment_ids):\n        output = self.bert_model(input_ids=input_ids, token_type_ids=segment_ids,\n                                                                        attention_mask=input_masks)\n        sequence_output = output[0]\n        pooler_output = output[1]\n        hidden_states = output[2]\n        seq_avg = torch.mean(sequence_output, dim=1)\n        concat_out = torch.cat((seq_avg, pooler_output), dim=1)\n        if self.isDropout:\n            concat_out = self.dropout(concat_out)\n        logit = self.classifier(concat_out)\n        return logit\n\nclass BertForClass_MultiDropout(nn.Module):\n    def __init__(self, config):\n        super(BertForClass_MultiDropout, self).__init__()\n        self.n_classes = config.num_class\n        config_json = 'bert_config.json' if os.path.exists(config.model_path + 'bert_config.json') else 'config.json'\n        self.bert_config = CONFIGS[config.model].from_pretrained(config.model_path + config_json,\n                                                                 output_hidden_states=True)\n\n        self.bert_model = MODELS[config.model].from_pretrained(config.model_path, config=self.bert_config)\n        self.multi_drop = 5\n        self.multi_dropouts = nn.ModuleList([nn.Dropout(config.dropout) for _ in range(self.multi_drop)])\n        self.classifier = nn.Linear(self.bert_config.hidden_size * 2, self.n_classes)\n\n    def forward(self, input_ids, input_masks, segment_ids):\n        sequence_output, pooler_output, hidden_states = self.bert_model(input_ids=input_ids, token_type_ids=segment_ids,\n                                                                        attention_mask=input_masks)\n        seq_avg = torch.mean(sequence_output, dim=1)\n        concat_out = torch.cat((seq_avg, pooler_output), dim=1)\n        for j, dropout in enumerate(self.multi_dropouts):\n            if j == 0:\n                logit = self.classifier(dropout(concat_out)) / self.multi_drop\n            else:\n                logit += self.classifier(dropout(concat_out)) / self.multi_drop\n\n        return logit\n\nclass BertLastTwoCls(nn.Module):\n    def __init__(self, config):\n        super(BertLastTwoCls, self).__init__()\n        self.n_classes = config.num_class\n        config_json = 'bert_config.json' if os.path.exists(config.model_path + 'bert_config.json') else 'config.json'\n        self.bert_config = CONFIGS[config.model].from_pretrained(config.model_path + config_json,\n                                                                 output_hidden_states=True)\n        self.bert_model = MODELS[config.model].from_pretrained(config.model_path, config=self.bert_config)\n        self.isDropout = True if 0 < config.dropout < 1 else False\n        self.dropout = nn.Dropout(p=config.dropout)\n        self.classifier = nn.Linear(self.bert_config.hidden_size, config.num_class)\n\n    def forward(self, input_ids, input_masks, segment_ids):\n        sequence_output, pooler_output, hidden_states = self.bert_model(input_ids=input_ids, token_type_ids=segment_ids,\n                                                                        attention_mask=input_masks)\n        logit = self.classifier(pooler_output)\n\n        return logit\n\n\nclass BertLastCls(nn.Module):\n    def __init__(self, config):\n        super(BertLastCls, self).__init__()\n        self.n_classes = config.num_class\n        config_json = 'bert_config.json' if os.path.exists(config.model_path + 'bert_config.json') else 'config.json'\n        self.bert_config = CONFIGS[config.model].from_pretrained(config.model_path + config_json,\n                                                                 output_hidden_states=True)\n        self.bert_model = MODELS[config.model].from_pretrained(config.model_path, config=self.bert_config)\n        self.isDropout = True if 0 < config.dropout < 1 else False\n        self.dropout = nn.Dropout(p=config.dropout)\n        self.classifier = nn.Linear(self.bert_config.hidden_size, config.num_class)\n\n    def forward(self, input_ids, input_masks, segment_ids):\n        output = self.bert_model(input_ids=input_ids, token_type_ids=segment_ids,\n                                 attention_mask=input_masks)\n        sequence_output = output[0]\n        pooler_output = output[1]\n        hidden_states = output[2]\n\n        if self.isDropout:\n            output = self.dropout(pooler_output)\n        logit = self.classifier(output)\n\n        return logit\n\nclass BertLastTwoClsPooler(nn.Module):\n    def __init__(self, config):\n        super(BertLastTwoClsPooler, self).__init__()\n        self.n_classes = config.num_class\n        config_json = 'bert_config.json' if os.path.exists(config.model_path + 'bert_config.json') else 'config.json'\n        self.bert_config = CONFIGS[config.model].from_pretrained(config.model_path + config_json,\n                                                                 output_hidden_states=True)\n        self.bert_model = MODELS[config.model].from_pretrained(config.model_path, config=self.bert_config)\n        self.isDropout = True if 0 < config.dropout < 1 else False\n        self.dropout = nn.Dropout(p=config.dropout)\n        self.classifier = nn.Linear(self.bert_config.hidden_size * 3, config.num_class)\n\n    def forward(self, input_ids, input_masks, segment_ids):\n        sequence_output, pooler_output, hidden_states = self.bert_model(input_ids=input_ids, token_type_ids=segment_ids,\n                                                                        attention_mask=input_masks)\n\n        output = torch.cat(\n            (pooler_output, hidden_states[-1][:, 0], hidden_states[-2][:, 0]), dim=1)\n        if self.isDropout:\n            output = self.dropout(output)\n        logit = self.classifier(output)\n\n        return logit\n\nclass BertLastTwoEmbeddings(nn.Module):\n    def __init__(self, config):\n        super(BertLastTwoEmbeddings, self).__init__()\n        self.n_classes = config.num_class\n        config_json = 'bert_config.json' if os.path.exists(config.model_path + 'bert_config.json') else 'config.json'\n        self.bert_config = CONFIGS[config.model].from_pretrained(config.model_path + config_json,\n                                                                 output_hidden_states=True)\n        self.bert_model = MODELS[config.model].from_pretrained(config.model_path, config=self.bert_config)\n        self.isDropout = True if 0 < config.dropout < 1 else False\n        self.dropout = nn.Dropout(p=config.dropout)\n        self.classifier = nn.Linear(self.bert_config.hidden_size * 2, config.num_class)\n\n    def forward(self, input_ids, input_masks, segment_ids):\n        sequence_output, pooler_output, hidden_states = self.bert_model(input_ids=input_ids, token_type_ids=segment_ids,\n                                                                        attention_mask=input_masks)\n\n        hidden_states1 = torch.mean(hidden_states[-1], dim=1)\n        hidden_states2 = torch.mean(hidden_states[-2], dim=1)\n\n        output = torch.cat(\n            (hidden_states1, hidden_states2), dim=1)\n        if self.isDropout:\n            output = self.dropout(output)\n        logit = self.classifier(output)\n\n        return logit\n\n\nclass BertLastTwoEmbeddingsPooler(nn.Module):\n    def __init__(self, config):\n        super(BertLastTwoEmbeddingsPooler, self).__init__()\n        self.n_classes = config.num_class\n        config_json = 'bert_config.json' if os.path.exists(config.model_path + 'bert_config.json') else 'config.json'\n        self.bert_config = CONFIGS[config.model].from_pretrained(config.model_path + config_json,\n                                                                 output_hidden_states=True)\n        self.bert_model = MODELS[config.model].from_pretrained(config.model_path, config=self.bert_config)\n        self.isDropout = True if 0 < config.dropout < 1 else False\n        self.dropout = nn.Dropout(p=config.dropout)\n        self.classifier = nn.Linear(self.bert_config.hidden_size * 3, config.num_class)\n\n    def forward(self, input_ids, input_masks, segment_ids):\n        sequence_output, pooler_output, hidden_states = self.bert_model(input_ids=input_ids, token_type_ids=segment_ids,\n                                                                        attention_mask=input_masks)\n\n        hidden_states1 = torch.mean(hidden_states[-1], dim=1)\n        hidden_states2 = torch.mean(hidden_states[-2], dim=1)\n\n        output = torch.cat(\n            (pooler_output, hidden_states1, hidden_states2), dim=1)\n        if self.isDropout:\n            output = self.dropout(output)\n        logit = self.classifier(output)\n\n        return logit\n\nclass BertLastFourCls(nn.Module):\n    def __init__(self, config):\n        super(BertLastFourCls, self).__init__()\n        self.n_classes = config.num_class\n        config_json = 'bert_config.json' if os.path.exists(config.model_path + 'bert_config.json') else 'config.json'\n        self.bert_config = CONFIGS[config.model].from_pretrained(config.model_path + config_json,\n                                                                 output_hidden_states=True)\n        self.bert_model = MODELS[config.model].from_pretrained(config.model_path, config=self.bert_config)\n        self.isDropout = True if 0 < config.dropout < 1 else False\n        self.dropout = nn.Dropout(p=config.dropout)\n        self.classifier = nn.Linear(self.bert_config.hidden_size * 4, config.num_class)\n\n    def forward(self, input_ids, input_masks, segment_ids):\n        output = self.bert_model(input_ids=input_ids, token_type_ids=segment_ids,\n                                 attention_mask=input_masks)\n        sequence_output = output[0]\n        pooler_output = output[1]\n        hidden_states = output[2]\n\n        output = torch.cat(\n            (hidden_states[-1][:, 0], hidden_states[-2][:, 0], hidden_states[-3][:, 0], hidden_states[-4][:, 0]), dim=1)\n        if self.isDropout:\n            output = self.dropout(output)\n        logit = self.classifier(output)\n\n        return logit\n\n\nclass BertLastFourClsPooler(nn.Module):\n    def __init__(self, config):\n        super(BertLastFourClsPooler, self).__init__()\n        self.n_classes = config.num_class\n        config_json = 'bert_config.json' if os.path.exists(config.model_path + 'bert_config.json') else 'config.json'\n        self.bert_config = CONFIGS[config.model].from_pretrained(config.model_path + config_json,\n                                                                 output_hidden_states=True)\n        self.bert_model = MODELS[config.model].from_pretrained(config.model_path, config=self.bert_config)\n        self.isDropout = True if 0 < config.dropout < 1 else False\n        self.dropout = nn.Dropout(p=config.dropout)\n        self.classifier = nn.Linear(self.bert_config.hidden_size * 5, config.num_class)\n\n    def forward(self, input_ids, input_masks, segment_ids):\n        sequence_output, pooler_output, hidden_states = self.bert_model(input_ids=input_ids, token_type_ids=segment_ids,\n                                                                        attention_mask=input_masks)\n\n        output = torch.cat(\n            (pooler_output, hidden_states[-1][:, 0], hidden_states[-2][:, 0], hidden_states[-3][:, 0], hidden_states[-4][:, 0]), dim=1)\n        if self.isDropout:\n            output = self.dropout(output)\n        logit = self.classifier(output)\n\n        return logit\n\nclass BertLastFourEmbeddings(nn.Module):\n    def __init__(self, config):\n        super(BertLastFourEmbeddings, self).__init__()\n        self.n_classes = config.num_class\n        config_json = 'bert_config.json' if os.path.exists(config.model_path + 'bert_config.json') else 'config.json'\n        self.bert_config = CONFIGS[config.model].from_pretrained(config.model_path + config_json,\n                                                                 output_hidden_states=True)\n        self.bert_model = MODELS[config.model].from_pretrained(config.model_path, config=self.bert_config)\n        self.isDropout = True if 0 < config.dropout < 1 else False\n        self.dropout = nn.Dropout(p=config.dropout)\n        self.classifier = nn.Linear(self.bert_config.hidden_size * 4, config.num_class)\n\n    def forward(self, input_ids, input_masks, segment_ids):\n        sequence_output, pooler_output, hidden_states = self.bert_model(input_ids=input_ids, token_type_ids=segment_ids,\n                                                                        attention_mask=input_masks)\n\n        hidden_states1 = torch.mean(hidden_states[-1], dim=1)\n        hidden_states2 = torch.mean(hidden_states[-2], dim=1)\n        hidden_states3 = torch.mean(hidden_states[-3], dim=1)\n        hidden_states4 = torch.mean(hidden_states[-4], dim=1)\n        output = torch.cat(\n            (hidden_states1, hidden_states2, hidden_states3, hidden_states4), dim=1)\n        if self.isDropout:\n            output = self.dropout(output)\n        logit = self.classifier(output)\n\n        return logit\n\n\nclass BertLastFourEmbeddingsPooler(nn.Module):\n    def __init__(self, config):\n        super(BertLastFourEmbeddingsPooler, self).__init__()\n        self.n_classes = config.num_class\n        config_json = 'bert_config.json' if os.path.exists(config.model_path + 'bert_config.json') else 'config.json'\n        self.bert_config = CONFIGS[config.model].from_pretrained(config.model_path + config_json,\n                                                                 output_hidden_states=True)\n        self.bert_model = MODELS[config.model].from_pretrained(config.model_path, config=self.bert_config)\n        self.isDropout = True if 0 < config.dropout < 1 else False\n        self.dropout = nn.Dropout(p=config.dropout)\n        self.classifier = nn.Linear(self.bert_config.hidden_size * 5, config.num_class)\n\n    def forward(self, input_ids, input_masks, segment_ids):\n        sequence_output, pooler_output, hidden_states = self.bert_model(input_ids=input_ids, token_type_ids=segment_ids,\n                                                                        attention_mask=input_masks)\n\n        hidden_states1 = torch.mean(hidden_states[-1], dim=1)\n        hidden_states2 = torch.mean(hidden_states[-2], dim=1)\n        hidden_states3 = torch.mean(hidden_states[-3], dim=1)\n        hidden_states4 = torch.mean(hidden_states[-4], dim=1)\n        output = torch.cat(\n            (pooler_output, hidden_states1, hidden_states2, hidden_states3, hidden_states4), dim=1)\n        if self.isDropout:\n            output = self.dropout(output)\n        logit = self.classifier(output)\n\n        return logit\n\nclass BertDynCls(nn.Module):\n    def __init__(self, config):\n        super(BertDynCls, self).__init__()\n        self.n_classes = config.num_class\n        config_json = 'bert_config.json' if os.path.exists(config.model_path + 'bert_config.json') else 'config.json'\n        self.bert_config = CONFIGS[config.model].from_pretrained(config.model_path + config_json,\n                                                                 output_hidden_states=True)\n        self.bert_model = MODELS[config.model].from_pretrained(config.model_path, config=self.bert_config)\n        self.dynWeight = nn.Linear(self.bert_config.hidden_size, 1)\n        self.dence = nn.Linear(self.bert_config.hidden_size, 512)\n        self.isDropout = True if 0 < config.dropout < 1 else False\n        self.dropout = nn.Dropout(p=config.dropout)\n        self.classifier = nn.Linear(512, config.num_class)\n\n\n    def forward(self, input_ids, input_masks, segment_ids):\n        sequence_output, pooler_output, hidden_states = self.bert_model(input_ids=input_ids, token_type_ids=segment_ids,\n                                                                        attention_mask=input_masks)\n\n        batch_size = pooler_output.shape[0]\n\n        hid_avg_list = None\n        weight_list = None\n        for i, hidden in enumerate(hidden_states):\n            hid_avg = hidden_states[-(i + 1)][0]\n            weight = self.dynWeight(hid_avg).repeat(1, self.bert_config.hidden_size)\n            if hid_avg_list is None:\n                hid_avg_list = hid_avg\n            else:\n                hid_avg_list = torch.cat((hid_avg_list, hid_avg), dim=1)\n\n            if weight_list is None:\n                weight_list = hid_avg\n            else:\n                weight_list = torch.cat((weight_list, weight), dim=1)\n\n        concat_out = weight_list.mul_(hid_avg_list)\n        concat_out = concat_out.reshape(batch_size, -1, self.bert_config.hidden_size)\n        concat_out = torch.sum(concat_out, dim=1)\n\n        if self.isDropout:\n            concat_out = self.dropout(concat_out)\n        concat_out = self.dence(concat_out)\n        logit = self.classifier(concat_out)\n\n        return logit\n\nclass BertDynEmbeddings(nn.Module):\n    def __init__(self, config):\n        super(BertDynEmbeddings, self).__init__()\n        self.n_classes = config.num_class\n        config_json = 'bert_config.json' if os.path.exists(config.model_path + 'bert_config.json') else 'config.json'\n        self.bert_config = CONFIGS[config.model].from_pretrained(config.model_path + config_json,\n                                                                 output_hidden_states=True)\n        self.bert_model = MODELS[config.model].from_pretrained(config.model_path, config=self.bert_config)\n        self.dynWeight = nn.Linear(self.bert_config.hidden_size, 1)\n        self.dence = nn.Linear(self.bert_config.hidden_size, 512)\n        self.isDropout = True if 0 < config.dropout < 1 else False\n        self.dropout = nn.Dropout(p=config.dropout)\n        self.classifier = nn.Linear(512, config.num_class)\n\n\n    def forward(self, input_ids, input_masks, segment_ids):\n        sequence_output, pooler_output, hidden_states = self.bert_model(input_ids=input_ids, token_type_ids=segment_ids,\n                                                                        attention_mask=input_masks)\n\n        batch_size = pooler_output.shape[0]\n\n        hid_avg_list = None\n        weight_list = None\n        for i, hidden in enumerate(hidden_states):\n            hid_avg = torch.mean(hidden_states[-(i + 1)], dim=1)\n            weight = self.dynWeight(hid_avg).repeat(1, self.bert_config.hidden_size)\n            if hid_avg_list is None:\n                hid_avg_list = hid_avg\n            else:\n                hid_avg_list = torch.cat((hid_avg_list, hid_avg), dim=1)\n\n            if weight_list is None:\n                weight_list = hid_avg\n            else:\n                weight_list = torch.cat((weight_list, weight), dim=1)\n\n        concat_out = weight_list.mul_(hid_avg_list)\n        concat_out = concat_out.reshape(batch_size, -1, self.bert_config.hidden_size)\n        concat_out = torch.sum(concat_out, dim=1)\n\n        if self.isDropout:\n            concat_out = self.dropout(concat_out)\n\n        concat_out = self.dence(concat_out)\n        logit = self.classifier(concat_out)\n\n        return logit\n\n\nclass BertRNN(nn.Module):\n\n    def __init__(self, config):\n        super(BertRNN, self).__init__()\n        self.rnn_type = \"gru\"\n        self.bidirectional = True\n        self.hidden_dim = 256\n        self.n_layers = 2\n        self.batch_first = True\n        self.drop_out = 0.1\n        self.n_classes = config.num_class\n        config_json = 'bert_config.json' if os.path.exists(config.model_path + 'bert_config.json') else 'config.json'\n        self.bert_config = CONFIGS[config.model].from_pretrained(config.model_path + config_json,\n                                                                 output_hidden_states=True)\n        self.bert_model = MODELS[config.model].from_pretrained(config.model_path, config=self.bert_config)\n        self.num_directions = 1 if not self.bidirectional else 2\n\n        if self.rnn_type == 'lstm':\n            self.rnn = nn.LSTM(self.bert_config.to_dict()['hidden_size'],\n                               hidden_size=self.hidden_dim,\n                               num_layers=self.n_layers,\n                               bidirectional=self.bidirectional,\n                               batch_first=self.batch_first,\n                               dropout=self.drop_out)\n        elif self.rnn_type == 'gru':\n            self.rnn = nn.GRU(self.bert_config.to_dict()['hidden_size'],\n                              hidden_size=self.hidden_dim,\n                              num_layers=self.n_layers,\n                              bidirectional=self.bidirectional,\n                              batch_first=self.batch_first,\n                              dropout=self.drop_out)\n        else:\n            self.rnn = nn.RNN(self.bert_config.to_dict()['hidden_size'],\n                              hidden_size=self.hidden_dim,\n                              num_layers=self.n_layers,\n                              bidirectional=self.bidirectional,\n                              batch_first=self.batch_first,\n                              dropout=self.drop_out)\n\n        self.dropout = nn.Dropout(self.drop_out)\n        self.fc_rnn = nn.Linear(self.hidden_dim * self.num_directions, config.num_class)\n\n    def forward(self, input_ids, input_masks, segment_ids):\n\n        sequence_output, pooler_output, hidden_states = self.bert_model(input_ids=input_ids, token_type_ids=segment_ids,\n                                                                        attention_mask=input_masks)\n\n        self.rnn.flatten_parameters()\n        if self.rnn_type in ['rnn', 'gru']:\n            output, hidden = self.rnn(sequence_output)\n        else:\n            output, (hidden, cell) = self.rnn(sequence_output)\n\n        # output = [ batch size, sent len, hidden_dim * bidirectional]\n        batch_size, max_seq_len, hidden_dim = output.shape\n        hidden = torch.transpose(hidden, 1, 0)\n        hidden = torch.mean(torch.reshape(hidden, [batch_size, -1, hidden_dim]), dim=1)\n        output = torch.sum(output, dim=1)\n        fc_input = self.dropout(output + hidden)\n\n        # output = torch.mean(output, dim=1)\n        # fc_input = self.dropout(output)\n        out = self.fc_rnn(fc_input)\n\n        return out\n\n\nclass BertCNN(nn.Module):\n\n    def __init__(self, config):\n        super(BertCNN, self).__init__()\n        self.num_filters = 100\n        config_json = 'bert_config.json' if os.path.exists(config.model_path + 'bert_config.json') else 'config.json'\n        self.bert_config = CONFIGS[config.model].from_pretrained(config.model_path + config_json,\n                                                                 output_hidden_states=True)\n        self.hidden_size = self.bert_config.to_dict()['hidden_size']\n        self.filter_sizes = {3, 4, 5}\n        self.drop_out = 0.5\n\n        self.convs = nn.ModuleList(\n            [nn.Conv2d(1, self.num_filters, (k, self.hidden_size)) for k in self.filter_sizes])\n\n\n        self.bert_model = MODELS[config.model].from_pretrained(config.model_path, config=self.bert_config)\n        self.dropout = nn.Dropout(self.drop_out)\n\n        self.fc_cnn = nn.Linear(self.num_filters * len(self.filter_sizes), config.num_class)\n\n    def conv_and_pool(self, x, conv):\n        x = F.relu(conv(x)).squeeze(3)\n        x = F.max_pool1d(x, x.size(2)).squeeze(2)\n        return x\n\n    def forward(self, input_ids, input_masks, segment_ids):\n        sequence_output, pooler_output, hidden_states = self.bert_model(input_ids=input_ids, token_type_ids=segment_ids,\n                                                                        attention_mask=input_masks)\n\n        sequence_output = self.dropout(sequence_output)\n        out = sequence_output.unsqueeze(1)\n        out = torch.cat([self.conv_and_pool(out, conv) for conv in self.convs], 1)\n        out = self.dropout(out)\n        out = self.fc_cnn(out)\n        return out\n\n\nclass BertRCNN(nn.Module):\n    def __init__(self, config):\n        super(BertRCNN, self).__init__()\n        self.rnn_type = \"lstm\"\n        self.bidirectional = True\n        self.hidden_dim = 256\n        self.n_layers = 2\n        self.batch_first = True\n        self.drop_out = 0.5\n\n        config_json = 'bert_config.json' if os.path.exists(config.model_path + 'bert_config.json') else 'config.json'\n        self.bert_config = CONFIGS[config.model].from_pretrained(config.model_path + config_json,\n                                                                 output_hidden_states=True)\n\n        if self.rnn_type == 'lstm':\n            self.rnn = nn.LSTM(self.bert_config.to_dict()['hidden_size'],\n                               self.hidden_dim,\n                               num_layers=self.n_layers,\n                               bidirectional=self.bidirectional,\n                               batch_first=self.batch_first,\n                               dropout=self.drop_out)\n        elif self.rnn_type == 'gru':\n            self.rnn = nn.GRU(self.bert_config.to_dict()['hidden_size'],\n                              hidden_size=self.hidden_dim,\n                              num_layers=self.n_layers,\n                              bidirectional=self.bidirectional,\n                              batch_first=self.batch_first,\n                              dropout=self.drop_out)\n        else:\n            self.rnn = nn.RNN(self.bert_config.to_dict()['hidden_size'],\n                              hidden_size=self.hidden_dim,\n                              num_layers=self.n_layers,\n                              bidirectional=self.bidirectional,\n                              batch_first=self.batch_first,\n                              dropout=self.drop_out)\n\n        # self.maxpool = nn.MaxPool1d()\n\n\n        self.bert_model = MODELS[config.model].from_pretrained(config.model_path, config=self.bert_config)\n        self.fc = nn.Linear(self.hidden_dim * self.n_layers, config.num_class)\n        self.dropout = nn.Dropout(self.drop_out)\n\n    def forward(self, input_ids, input_masks, segment_ids):\n\n        sequence_output, pooler_output, hidden_states = self.bert_model(input_ids=input_ids, token_type_ids=segment_ids,\n                                                                        attention_mask=input_masks)\n\n        sentence_len = sequence_output.shape[1]\n        pooler_output = pooler_output.unsqueeze(dim=1).repeat(1, sentence_len, 1)\n        bert_sentence = sequence_output + pooler_output\n\n        self.rnn.flatten_parameters()\n        if self.rnn_type in ['rnn', 'gru']:\n            output, hidden = self.rnn(bert_sentence)\n        else:\n            output, (hidden, cell) = self.rnn(bert_sentence)\n\n        batch_size, max_seq_len, hidden_dim = output.shape\n        out = torch.transpose(output.relu(), 1, 2)\n\n        out = F.max_pool1d(out, max_seq_len).squeeze()\n        out = self.fc(out)\n\n        return out\n\n\nclass XLNet(nn.Module):\n\n    def __init__(self, config):\n        super(XLNet, self).__init__()\n        self.xlnet = XLNetModel.from_pretrained(config.model_path)\n\n        self.isDropout = True if 0 < config.dropout < 1 else False\n        self.dropout = nn.Dropout(p=config.dropout)\n        self.fc = nn.Linear(self.xlnet.d_model, config.num_class)\n\n    def forward(self, input_ids, input_masks, segment_ids):\n        sequence_output = self.xlnet(input_ids=input_ids, token_type_ids=segment_ids,\n                                     attention_mask=input_masks)\n        sequence_output = torch.sum(sequence_output[0], dim=1)\n        if self.isDropout:\n            sequence_output = self.dropout(sequence_output)\n        out = self.fc(sequence_output)\n        return out\n\n\nclass ElectraClassificationHead(nn.Module):\n    \"\"\"Head for sentence-level classification tasks.\"\"\"\n\n    def __init__(self, config):\n        super().__init__()\n        self.dense = nn.Linear(config.hidden_size, config.hidden_size)\n        self.dropout = nn.Dropout(config.hidden_dropout_prob)\n        self.out_proj = nn.Linear(config.hidden_size, config.num_labels)\n\n    def forward(self, features, **kwargs):\n        x = features[:, 0, :]  # take <s> token (equiv. to [CLS])\n        x = self.dropout(x)\n        x = self.dense(x)\n        x = get_activation(\"gelu\")(x)  # although BERT uses tanh here, it seems Electra authors used gelu here\n        x = self.dropout(x)\n        x = self.out_proj(x)\n        return x\n\nclass Electra(nn.Module):\n\n    def __init__(self, config):\n        super(Electra, self).__init__()\n        self.electra = ElectraModel.from_pretrained(config.model_path)\n\n        config_json = 'bert_config.json' if os.path.exists(config.model_path + 'bert_config.json') else 'config.json'\n        self.electra_config = CONFIGS[config.model].from_pretrained(config.model_path + config_json)\n        self.electra_config.num_labels = config.num_class\n        self.fc = ElectraClassificationHead(self.electra_config)\n\n    def forward(self, input_ids, input_masks, segment_ids):\n        discriminator_hidden_states = self.electra(input_ids=input_ids, token_type_ids=segment_ids,\n                                     attention_mask=input_masks)\n\n        sequence_output = discriminator_hidden_states[0]\n        out = self.fc(sequence_output)\n        return out\n\nclass NEZHA(nn.Module):\n    def __init__(self, config):\n        super(NEZHA, self).__init__()\n        self.n_classes = config.num_class\n\n        config_json = 'bert_config.json' if os.path.exists(config.model_path + 'bert_config.json') else 'config.json'\n        self.bert_config = CONFIGS[config.model].from_pretrained(config.model_path + config_json)\n        #self.bert_model = MODELS[config.model](config=self.bert_config)\n        self.bert_model = MODELS[config.model].from_pretrained(config.model_path, config=self.bert_config)\n\n        # NEZHA init\n        #torch_init_model(self.bert_model, os.path.join(config.model_path, 'pytorch_model.bin'))\n        self.isDropout = True if 0 < config.dropout < 1 else False\n        self.dropout = nn.Dropout(p=config.dropout)\n        self.classifier = nn.Linear(self.bert_config.hidden_size * 2, self.n_classes)\n\n    def forward(self, input_ids, input_masks, segment_ids):\n        sequence_output, pooler_output = self.bert_model(input_ids=input_ids, token_type_ids=segment_ids,\n                                                                        attention_mask=input_masks)\n        seq_avg = torch.mean(sequence_output, dim=1)\n        concat_out = torch.cat((seq_avg, pooler_output), dim=1)\n\n        if self.isDropout:\n            concat_out = self.dropout(concat_out)\n        logit = self.classifier(concat_out)\n        return logit\n\n\n"
  },
  {
    "path": "code/nezha-base-count5/finetuning/models/gitkeep",
    "content": ""
  },
  {
    "path": "code/nezha-base-count5/finetuning/multi_gpu_QA.py",
    "content": "from tqdm import tqdm, trange\nimport numpy as np\nimport pandas as pd\nimport logging\nimport torch\nimport random\nimport os\nfrom torch import nn, optim\nfrom transformers import BertTokenizer, AdamW, BertModel, BertPreTrainedModel, BertConfig, \\\n    get_linear_schedule_with_warmup, XLNetModel, XLNetTokenizer, XLNetConfig\nfrom transformers.optimization import get_linear_schedule_with_warmup\nfrom sklearn.model_selection import StratifiedKFold, KFold\nfrom sklearn.metrics import mean_absolute_error, accuracy_score, f1_score, roc_auc_score\nfrom model import *\nfrom utils import *\nimport time\n\nimport logging\nlogging.basicConfig(level=logging.DEBUG, filename=\"train.log\",filemode='a')\n\n\nfrom NEZHA.modeling_nezha import *\n\nMODEL_CLASSES = {\n    'BertForClass': BertForClass,\n    'BertLastCls': BertLastCls,\n    'BertLastTwoCls': BertLastTwoCls,\n    'BertLastTwoClsPooler': BertLastTwoClsPooler,\n    'BertLastTwoEmbeddings': BertLastTwoEmbeddings,\n    'BertLastTwoEmbeddingsPooler': BertLastTwoEmbeddingsPooler,\n    'BertLastFourCls': BertLastFourCls,\n    'BertLastFourClsPooler': BertLastFourClsPooler,\n    'BertLastFourEmbeddings': BertLastFourEmbeddings,\n    'BertLastFourEmbeddingsPooler': BertLastFourEmbeddingsPooler,\n    'BertDynCls': BertDynCls,\n    'BertDynEmbeddings': BertDynEmbeddings,\n    'BertRNN': BertRNN,\n    'BertCNN': BertCNN,\n    'BertRCNN': BertRCNN,\n    'XLNet': XLNet,\n    'Electra': Electra,\n    'NEZHA': NEZHA,\n\n}\n\n\nclass Config:\n    def __init__(self):\n        # 预训练模型路径\n        self.modelId = 2\n        self.model = \"NEZHA\"\n        self.Stratification = False\n        self.model_path = '../pretrain/nezha_model/'\n\n        self.num_class = 2\n        self.dropout = 0.2\n        self.MAX_LEN = 100\n        self.epoch = 3\n        self.learn_rate = 4e-5\n        self.normal_lr = 1e-4\n        self.batch_size = 32\n        self.k_fold = 10\n        self.seed = 42\n\n        self.device = torch.device('cuda')\n        # self.device = torch.device('cpu')\n\n        self.focalloss = False\n        self.pgd = False\n        self.fgm = True\n\n\nconfig = Config()\nos.environ['PYTHONHASHSEED']='0'#消除hash算法的随机性\nrandom.seed(config.seed)\nnp.random.seed(config.seed)\ntorch.manual_seed(config.seed)\ntorch.cuda.manual_seed_all(config.seed)\n\n\nfile_path = './log/'\n# 创建一个logger\nlogger = logging.getLogger('mylogger')\nlogger.setLevel(logging.DEBUG)\n\n\ntrain = pd.read_csv('/tcdata/gaiic_track3_round1_train_20210228.tsv',sep='\\t',header=None)\nsemi = pd.read_csv('/tcdata/gaiic_track3_round2_train_20210407.tsv',sep='\\t',header=None)\ntrain = pd.concat([train, semi], sort=False)\ntrain.columns=['q1','q2','label']\n\n\ntrain_query1 = train['q1'].values.astype(str)\ntrain_query2 = train['q2'].values.astype(str)\ntrain_label = train['label'].values.astype(int)\n\n\noof_train = np.zeros((len(train), config.num_class), dtype=np.float32)\n\n\n#kf = StratifiedKFold(n_splits=config.k_fold, shuffle=True, random_state=config.seed)\nkf = KFold(n_splits=config.k_fold, shuffle=True, random_state=config.seed)\n\nfor fold, (train_index, valid_index) in enumerate(kf.split(train_query1, train_label)):\n\n    print('\\n\\n------------fold:{}------------\\n'.format(fold))\n\n    '''\n    q1 = train_query1[train_index]\n    q2 = train_query2[train_index]\n    y = train_label[train_index]\n    '''\n    q1 = train_query1\n    q2 = train_query2\n    y = train_label\n\n\n    val_q1 = train_query1[valid_index]\n    val_q2 = train_query2[valid_index]\n    val_y = train_label[valid_index]\n\n    train_D = data_generator([q1, q2, y], config, shuffle=True)\n    val_D = data_generator([val_q1, val_q2, val_y], config)\n\n    model = MODEL_CLASSES[config.model](config).to(config.device)\n\n    if torch.cuda.device_count() > 1:\n        print(\"Let's use\", torch.cuda.device_count(), \"GPUs!\")\n        model = torch.nn.DataParallel(model)\n\n\n    if config.pgd:\n        pgd = PGD(model)\n        K = 3\n\n    elif config.fgm:\n        fgm = FGM(model)\n\n    if config.focalloss:\n        loss_fn = FocalLoss(config.num_class)\n    else:\n        loss_fn = nn.CrossEntropyLoss()  # BCEWithLogitsLoss就是把Sigmoid-BCELoss合成一步\n\n\n    num_train_steps = int(len(train) / config.batch_size * config.epoch)\n    param_optimizer = list(model.named_parameters())\n\n    no_decay = [\"bias\", \"LayerNorm.bias\", \"LayerNorm.weight\"]\n\n    if config.Stratification:\n        bert_params = [x for x in param_optimizer if 'bert' in x[0]]\n        normal_params = [p for n, p in param_optimizer if 'bert' not in n]\n        optimizer_parameters = [\n            {'params': [p for n, p in bert_params if not any(nd in n for nd in no_decay)], 'weight_decay': 0.01},\n            {'params': [p for n, p in bert_params if any(nd in n for nd in no_decay)], 'weight_decay': 0.0},\n            {'params': normal_params, 'lr': config.normal_lr},\n        ]\n    else:\n        optimizer_parameters = [\n            {'params': [p for n, p in param_optimizer if not any(nd in n for nd in no_decay)], 'weight_decay': 0.01},\n            {'params': [p for n, p in param_optimizer if any(nd in n for nd in no_decay)], 'weight_decay': 0.0},\n        ]\n\n    optimizer = AdamW(optimizer_parameters, lr=config.learn_rate) # lr为全局学习率\n    scheduler = get_linear_schedule_with_warmup(\n        optimizer,\n        num_warmup_steps=int(len(train) / config.batch_size / 2),\n        num_training_steps=num_train_steps\n    )\n\n    best_auc = 0\n    PATH = './models/bert_{}.pth'.format(fold)\n    save_model_path = './models/'\n    if not os.path.exists(save_model_path):\n        os.makedirs(save_model_path)\n\n    for e in range(config.epoch):\n        print('\\n------------epoch:{}------------'.format(e))\n        model.train()\n        acc = 0\n        train_len = 0\n        loss_num = 0\n        tq = tqdm(train_D,ncols=70,disable=True)\n        last=time.time()\n        for input_ids, input_masks, segment_ids, labels in tq:\n            label_t = torch.tensor(labels, dtype=torch.long).to(config.device)\n\n            y_pred = model(input_ids, input_masks, segment_ids)\n\n            loss = loss_fn(y_pred, label_t)\n            loss = loss.mean()\n            loss.backward()\n\n            if config.pgd:\n                pgd.backup_grad()\n                # 对抗训练\n                for t in range(K):\n                    pgd.attack(is_first_attack=(t == 0))  # 在embedding上添加对抗扰动, first attack时备份param.data\n                    if t != K - 1:\n                        model.zero_grad()\n                    else:\n                        pgd.restore_grad()\n                    y_pred = model(input_ids, input_masks, segment_ids)\n\n                    loss_adv = loss_fn(y_pred, label_t)\n                    loss_adv = loss_adv.mean()\n                    loss_adv.backward()  # 反向传播，并在正常的grad基础上，累加对抗训练的梯度\n                pgd.restore()  # 恢复embedding参数\n\n            elif config.fgm:\n                # 对抗训练\n                fgm.attack()  # 在embedding上添加对抗扰动\n                y_pred = model(input_ids, input_masks, segment_ids)\n                loss_adv = loss_fn(y_pred, label_t)\n                loss_adv = loss_adv.mean()\n                loss_adv.backward()  # 反向传播，并在正常的grad基础上，累加对抗训练的梯度\n                fgm.restore()  # 恢复embedding参数\n\n\n            # 梯度下降，更新参数\n            optimizer.step()\n            scheduler.step()  # Update learning rate schedule\n            model.zero_grad()\n\n            y_pred = np.argmax(y_pred.detach().to(\"cpu\").numpy(), axis=1)\n            acc += sum(y_pred == labels)\n            loss_num += loss.item()\n            train_len += len(labels)\n            tq.set_postfix(fold=fold, epoch=e, loss=loss_num / train_len, acc=acc / train_len)\n        print(f\"微调第{e}轮耗时：{time.time()-last}\")\n        model.eval()\n        with torch.no_grad():\n            y_p = []\n            y_l = []\n            train_logit = None\n            for input_ids, input_masks, segment_ids, labels in tqdm(val_D,disable=True):\n                label_t = torch.tensor(labels, dtype=torch.long).to(config.device)\n\n                y_pred = model(input_ids, input_masks, segment_ids)\n                y_pred = F.softmax(y_pred)\n                y_pred = y_pred.detach().to(\"cpu\").numpy()\n                if train_logit is None:\n                    train_logit = y_pred\n                else:\n                    train_logit = np.vstack((train_logit, y_pred))\n\n                y_p += list(y_pred[:,1])\n\n                y_pred = np.argmax(y_pred, axis=1)\n                y_l += list(y_pred)\n\n\n            f1 = f1_score(val_y, y_l, average=\"macro\")\n            auc_score = roc_auc_score(val_y, y_p)\n            print(\"best_auc:{}  auc_score:{}  f1:{}\\n\".format(best_auc, auc_score, f1))\n            if auc_score >= best_auc:\n                best_auc = auc_score\n                oof_train[valid_index] = np.array(train_logit)\n                #torch.save(model.module.state_dict() if hasattr(model, \"module\") else model.state_dict(), PATH)\n                torch.save(model.module if hasattr(model, \"module\") else model, PATH)\n\n    optimizer.zero_grad()\n\n    del model\n    torch.cuda.empty_cache()\n\n    break\n\n"
  },
  {
    "path": "code/nezha-base-count5/finetuning/utils.py",
    "content": "import torch\nfrom transformers import BertTokenizer, AdamW, BertModel, BertPreTrainedModel, BertConfig, \\\n    get_linear_schedule_with_warmup, XLNetModel, XLNetTokenizer, XLNetConfig\nimport numpy as np\nimport os\nimport random\nfrom Config import *\nimport torch\nimport torch.nn as nn\nimport torch.nn.functional as F\n\ndef paddingList(ls:list,val,returnTensor=False):\n    ls=ls[:]#不要改变了原list尺寸\n    maxLen=max([len(i) for i in ls])\n    for i in range(len(ls)):\n        ls[i]=ls[i]+[val]*(maxLen-len(ls[i]))\n    return torch.tensor(ls,device='cuda') if returnTensor else ls\n\ndef fastTokenizer(a:str,b:str,maxLen,tk):\n    a,b=a.split(),b.split()\n    a,b=tk.convert_tokens_to_ids(a),tk.convert_tokens_to_ids(b)\n    maxLen-=3#空留给cls sep sep\n    assert maxLen>=0\n    len2=maxLen//2#若为奇数，更长部分给左边\n    len1=maxLen-len2\n    #一共就a超长与否，b超长与否，组合的四种情况\n    if len(a)+len(b)>maxLen:#需要截断\n        if len(a)<=len1 and len(b)>len2:\n            b=b[:maxLen-len(a)]\n        elif len(a)>len1 and len(b)<=len2:\n            a=a[:maxLen-len(b)]\n        elif len(a)>len1 and len(b)>len2:\n            a=a[:len1]\n            b=b[:len2]\n    input_ids=[tk.cls_token_id]+a+[tk.sep_token_id]+b+[tk.sep_token_id]\n    token_type_ids=[0]*(len(a)+2)+[1]*(len(b)+1)\n    return {'input_ids': input_ids, 'token_type_ids': token_type_ids}\n\nclass data_generator:\n    def __init__(self, data, config, shuffle=False):\n        self.data = data\n        self.batch_size = config.batch_size\n        self.max_length = config.MAX_LEN\n        self.shuffle = shuffle\n\n        vocab = 'vocab.txt' if os.path.exists(config.model_path + 'vocab.txt') else 'spiece.model'\n        self.tokenizer = TOKENIZERS[config.model].from_pretrained(config.model_path + vocab)\n\n        self.steps = len(self.data[0]) // self.batch_size\n        if len(self.data[0]) % self.batch_size != 0:\n            self.steps += 1\n\n    def __len__(self):\n        return self.steps\n\n    def __iter__(self):\n        q1, q2, y = self.data\n        idxs = list(range(len(self.data[0])))\n        if self.shuffle:\n            np.random.shuffle(idxs)\n        input_ids, input_masks, segment_ids, labels = [], [], [], []\n\n        for index, i in enumerate(idxs):\n\n            text = q1[i]\n            text_pair = q2[i]\n            '''\n            # text = self.tokenizer(text, text_pair, padding='max_length', truncation=True, max_length=self.max_length)\n            text = fastTokenizer(text, text_pair, self.max_length, self.tokenizer)\n            input_ids.append(text['input_ids'])\n            segment_ids.append(text['token_type_ids'])\n            input_masks.append([1] * len(text['input_ids']))  # bs为1时无padding，全1\n            yield input_ids, input_masks, segment_ids, labels\n            input_ids, input_masks, segment_ids, labels = [], [], [], []\n\n            '''\n            tkRes = self.tokenizer(text, text_pair, max_length=self.max_length, truncation='longest_first',\n                                   return_attention_mask=False)\n            input_id = tkRes['input_ids']\n            segment_id = tkRes['token_type_ids']\n            assert len(segment_id) == len(input_id)\n            input_ids.append(input_id)\n            segment_ids.append(segment_id)\n            labels.append(y[i])\n\n            if len(input_ids) == self.batch_size or i == idxs[-1]:\n                input_ids = paddingList(input_ids, 0, returnTensor=True)  # 动态padding\n                segment_ids = paddingList(segment_ids, 0, returnTensor=True)\n                input_masks = (input_ids != 0)\n                yield input_ids, input_masks, segment_ids, labels\n                input_ids, input_masks, segment_ids, labels = [], [], [], []\n\n\n\nclass PGD():\n    def __init__(self, model):\n        self.model = model\n        self.emb_backup = {}\n        self.grad_backup = {}\n\n    def attack(self, epsilon=0.3, alpha=0.1, emb_name='word_embeddings', is_first_attack=False):\n        # emb_name这个参数要换成你模型中embedding的参数名\n        for name, param in self.model.named_parameters():\n            if param.requires_grad and emb_name in name:\n                if is_first_attack:\n                    self.emb_backup[name] = param.data.clone()\n                norm = torch.norm(param.grad)\n                if norm != 0 and not torch.isnan(norm):\n                    r_at = alpha * param.grad / norm\n                    param.data.add_(r_at)\n                    param.data = self.project(name, param.data, epsilon)\n\n    def restore(self, emb_name='word_embeddings'):\n        # emb_name这个参数要换成你模型中embedding的参数名\n        for name, param in self.model.named_parameters():\n            if param.requires_grad and emb_name in name:\n                assert name in self.emb_backup\n                param.data = self.emb_backup[name]\n        self.emb_backup = {}\n\n    def project(self, param_name, param_data, epsilon):\n        r = param_data - self.emb_backup[param_name]\n        if torch.norm(r) > epsilon:\n            r = epsilon * r / torch.norm(r)\n        return self.emb_backup[param_name] + r\n\n    def backup_grad(self):\n        for name, param in self.model.named_parameters():\n            if param.requires_grad:\n                self.grad_backup[name] = param.grad.clone()\n\n    def restore_grad(self):\n        for name, param in self.model.named_parameters():\n            if param.requires_grad:\n                param.grad = self.grad_backup[name]\n\n\n\nclass FGM():\n    def __init__(self, model):\n        self.model = model\n        self.backup = {}\n\n    def attack(self, epsilon=0.25, emb_name='word_embeddings'):\n        # emb_name这个参数要换成你模型中embedding的参数名\n        for name, param in self.model.named_parameters():\n            if param.requires_grad and emb_name in name:\n                self.backup[name] = param.data.clone()\n                norm = torch.norm(param.grad)\n                if norm != 0:\n                    r_at = epsilon * param.grad / norm\n                    param.data.add_(r_at)\n\n    def restore(self, emb_name='word_embeddings'):\n        # emb_name这个参数要换成你模型中embedding的参数名\n        for name, param in self.model.named_parameters():\n            if param.requires_grad and emb_name in name:\n                assert name in self.backup\n                param.data = self.backup[name]\n        self.backup = {}\n\n\n# 支持多分类和二分类\nclass FocalLoss(nn.Module):\n    \"\"\"\n    This is a implementation of Focal Loss with smooth label cross entropy supported which is proposed in\n    'Focal Loss for Dense Object Detection. (https://arxiv.org/abs/1708.02002)'\n    Focal_Loss= -1*alpha*(1-pt)^gamma*log(pt)\n    :param num_class:\n    :param alpha: (tensor) 3D or 4D the scalar factor for this criterion\n    :param gamma: (float,double) gamma > 0 reduces the relative loss\n    for well-classified examples (p>0.5) putting more\n    focus on hard misclassified example\n    :param smooth: (float,double) smooth value when cross entropy\n    :param balance_index: (int) balance class index,\n    should be specific when alpha is float\n    :param size_average: (bool, optional) By default,\n    the losses are averaged over each loss element in the batch.\n    \"\"\"\n    def __init__(self, num_class, alpha=None, gamma=2,\n                smooth=None, size_average=True):\n        super(FocalLoss, self).__init__()\n        self.num_class = num_class\n        self.alpha = alpha\n        self.gamma = gamma\n        self.smooth = smooth\n        self.size_average = size_average\n\n        if self.alpha is None:\n            self.alpha = torch.ones(self.num_class, 1)\n        elif isinstance(self.alpha, (list, np.ndarray)):\n            assert len(self.alpha) == self.num_class\n            self.alpha = torch.FloatTensor(alpha).view(self.num_class, 1)\n            self.alpha = self.alpha / self.alpha.sum()\n        else:\n            raise TypeError('Not support alpha type')\n        if self.smooth is not None:\n            if self.smooth < 0 or self.smooth > 1.0:\n                raise ValueError('smooth value should be in [0,1]')\n\n    def forward(self, input, target):\n        logit = F.softmax(input, dim=1)\n\n        if logit.dim() > 2:\n            # N,C,d1,d2 -> N,C,m (m=d1*d2*...)\n            logit = logit.view(logit.size(0), logit.size(1), -1)\n            logit = logit.permute(0, 2, 1).contiguous()\n            logit = logit.view(-1, logit.size(-1))\n        target = target.view(-1, 1)\n\n        # N = input.size(0)\n        # alpha = torch.ones(N, self.num_class)\n        # alpha = alpha * (1 - self.alpha)\n        # alpha = alpha.scatter_(1, target.long(), self.alpha)\n        epsilon = 1e-10\n        alpha = self.alpha\n        if alpha.device != input.device:\n            alpha = alpha.to(input.device)\n\n        idx = target.cpu().long()\n        one_hot_key = torch.FloatTensor(target.size(0), self.num_class).zero_()\n        one_hot_key = one_hot_key.scatter_(1, idx, 1)\n        if one_hot_key.device != logit.device:\n            one_hot_key = one_hot_key.to(logit.device)\n\n        if self.smooth:\n            one_hot_key = torch.clamp(\n                one_hot_key, self.smooth, 1.0 - self.smooth)\n        pt = (one_hot_key * logit).sum(1) + epsilon\n        logpt = pt.log()\n\n        gamma = self.gamma\n\n        alpha = alpha[idx]\n        loss = -1 * alpha * torch.pow((1 - pt), gamma) * logpt\n\n        if self.size_average:\n            loss = loss.mean()\n        else:\n            loss = loss.sum()\n        return loss\n\n\ndef f1_match(y_true,y_pred):\n    acc = sum(y_pred & y_true) / (sum(y_pred))\n    rec = sum(y_pred & y_true) / (sum(y_true))\n\n    return 2 * acc * rec /(acc + rec)"
  },
  {
    "path": "code/nezha-base-count5/pretrain/NEZHA/configuration_nezha.py",
    "content": "\nfrom transformers import PretrainedConfig\n\nNEZHA_PRETRAINED_CONFIG_ARCHIVE_MAP = {}\n\nclass NeZhaConfig(PretrainedConfig):\n    r\"\"\"\n        This is the configuration class to store the configuration of an :class:`~transformers.AlbertModel`.\n        It is used to instantiate an ALBERT model according to the specified arguments, defining the model\n        architecture. Instantiating a configuration with the defaults will yield a similar configuration to that of\n        the ALBERT `xxlarge <https://huggingface.co/albert-xxlarge-v2>`__ architecture.\n\n        Configuration objects inherit from  :class:`~transformers.PretrainedConfig` and can be used\n        to control the model outputs. Read the documentation from  :class:`~transformers.PretrainedConfig`\n        for more information.\n\n\n        Args:\n            vocab_size (:obj:`int`, optional, defaults to 30000):\n                Vocabulary size of the ALBERT model. Defines the different tokens that\n                can be represented by the `inputs_ids` passed to the forward method of :class:`~transformers.AlbertModel`.\n            embedding_size (:obj:`int`, optional, defaults to 128):\n                Dimensionality of vocabulary embeddings.\n            hidden_size (:obj:`int`, optional, defaults to 4096):\n                Dimensionality of the encoder layers and the pooler layer.\n            num_hidden_layers (:obj:`int`, optional, defaults to 12):\n                Number of hidden layers in the Transformer encoder.\n            num_hidden_groups (:obj:`int`, optional, defaults to 1):\n                Number of groups for the hidden layers, parameters in the same group are shared.\n            num_attention_heads (:obj:`int`, optional, defaults to 64):\n                Number of attention heads for each attention layer in the Transformer encoder.\n            intermediate_size (:obj:`int`, optional, defaults to 16384):\n                The dimensionality of the \"intermediate\" (i.e., feed-forward) layer in the Transformer encoder.\n            inner_group_num (:obj:`int`, optional, defaults to 1):\n                The number of inner repetition of attention and ffn.\n            hidden_act (:obj:`str` or :obj:`function`, optional, defaults to \"gelu_new\"):\n                The non-linear activation function (function or string) in the encoder and pooler.\n                If string, \"gelu\", \"relu\", \"swish\" and \"gelu_new\" are supported.\n            hidden_dropout_prob (:obj:`float`, optional, defaults to 0):\n                The dropout probability for all fully connected layers in the embeddings, encoder, and pooler.\n            attention_probs_dropout_prob (:obj:`float`, optional, defaults to 0):\n                The dropout ratio for the attention probabilities.\n            max_position_embeddings (:obj:`int`, optional, defaults to 512):\n                The maximum sequence length that this model might ever be used with. Typically set this to something\n                large (e.g., 512 or 1024 or 2048).\n            type_vocab_size (:obj:`int`, optional, defaults to 2):\n                The vocabulary size of the `token_type_ids` passed into :class:`~transformers.AlbertModel`.\n            initializer_range (:obj:`float`, optional, defaults to 0.02):\n                The standard deviation of the truncated_normal_initializer for initializing all weight matrices.\n            layer_norm_eps (:obj:`float`, optional, defaults to 1e-12):\n                The epsilon used by the layer normalization layers.\n            classifier_dropout_prob (:obj:`float`, optional, defaults to 0.1):\n                The dropout ratio for attached classifiers.\n\n        Example::\n\n            from transformers import AlbertConfig, AlbertModel\n            # Initializing an ALBERT-xxlarge style configuration\n            albert_xxlarge_configuration = AlbertConfig()\n\n            # Initializing an ALBERT-base style configuration\n            albert_base_configuration = AlbertConfig(\n                hidden_size=768,\n                num_attention_heads=12,\n                intermediate_size=3072,\n            )\n\n            # Initializing a model from the ALBERT-base style configuration\n            model = AlbertModel(albert_xxlarge_configuration)\n\n            # Accessing the model configuration\n            configuration = model.config\n\n        Attributes:\n            pretrained_config_archive_map (Dict[str, str]):\n                A dictionary containing all the available pre-trained checkpoints.\n    \"\"\"\n\n    pretrained_config_archive_map = NEZHA_PRETRAINED_CONFIG_ARCHIVE_MAP\n    model_type = \"nezha\"\n\n    def __init__(\n        self,\n        vocab_size=30000,\n        embedding_size=128,\n        hidden_size=4096,\n        num_hidden_layers=12,\n        num_hidden_groups=1,\n        num_attention_heads=64,\n        intermediate_size=16384,\n        inner_group_num=1,\n        hidden_act=\"gelu_new\",\n        hidden_dropout_prob=0,\n        attention_probs_dropout_prob=0,\n        max_position_embeddings=512,\n        max_relative_position=64,\n        type_vocab_size=2,\n        initializer_range=0.02,\n        layer_norm_eps=1e-12,\n        classifier_dropout_prob=0.1,\n        use_relative_position=True,\n        pad_token_id=0,\n        bos_token_id=2,\n        eos_token_id=3,\n        **kwargs\n    ):\n        super().__init__(pad_token_id=pad_token_id, bos_token_id=bos_token_id, eos_token_id=eos_token_id, **kwargs)\n\n        self.vocab_size = vocab_size\n        self.embedding_size = embedding_size\n        self.hidden_size = hidden_size\n        self.num_hidden_layers = num_hidden_layers\n        self.num_hidden_groups = num_hidden_groups\n        self.num_attention_heads = num_attention_heads\n        self.inner_group_num = inner_group_num\n        self.hidden_act = hidden_act\n        self.intermediate_size = intermediate_size\n        self.hidden_dropout_prob = hidden_dropout_prob\n        self.attention_probs_dropout_prob = attention_probs_dropout_prob\n        self.max_position_embeddings = max_position_embeddings\n        self.max_relative_position = max_relative_position\n        self.type_vocab_size = type_vocab_size\n        self.initializer_range = initializer_range\n        self.layer_norm_eps = layer_norm_eps\n        self.use_relative_position=use_relative_position\n        self.classifier_dropout_prob = classifier_dropout_prob\n"
  },
  {
    "path": "code/nezha-base-count5/pretrain/NEZHA/modeling_nezha.py",
    "content": "import math\nimport os\nimport warnings\nfrom dataclasses import dataclass\nfrom typing import Optional, Tuple\n\nimport torch\nimport torch.utils.checkpoint\nfrom torch import nn\nfrom torch.nn import CrossEntropyLoss, MSELoss\n\nfrom transformers.activations import ACT2FN\nfrom transformers.file_utils import (\n    ModelOutput,\n    add_code_sample_docstrings,\n    add_start_docstrings,\n    add_start_docstrings_to_model_forward,\n    replace_return_docstrings,\n)\nfrom transformers.modeling_outputs import (\n    BaseModelOutputWithPastAndCrossAttentions,\n    BaseModelOutputWithPoolingAndCrossAttentions,\n    CausalLMOutputWithCrossAttentions,\n    MaskedLMOutput,\n    MultipleChoiceModelOutput,\n    NextSentencePredictorOutput,\n    QuestionAnsweringModelOutput,\n    SequenceClassifierOutput,\n    TokenClassifierOutput,\n)\nfrom transformers.modeling_utils import (\n    PreTrainedModel,\n    apply_chunking_to_forward,\n    find_pruneable_heads_and_indices,\n    prune_linear_layer,\n)\n\nfrom transformers.models.bert.configuration_bert import BertConfig\n\nimport logging\nlogger = logging.getLogger(__name__)\n\n_CHECKPOINT_FOR_DOC = \"bert-base-uncased\"\n_CONFIG_FOR_DOC = \"BertConfig\"\n_TOKENIZER_FOR_DOC = \"BertTokenizer\"\n\n\ndef load_tf_weights_in_bert(model, config, tf_checkpoint_path):\n    \"\"\"Load tf checkpoints in a pytorch model.\"\"\"\n    try:\n        import re\n\n        import numpy as np\n        import tensorflow as tf\n    except ImportError:\n        logger.error(\n            \"Loading a TensorFlow model in PyTorch, requires TensorFlow to be installed. Please see \"\n            \"https://www.tensorflow.org/install/ for installation instructions.\"\n        )\n        raise\n    tf_path = os.path.abspath(tf_checkpoint_path)\n    logger.info(\"Converting TensorFlow checkpoint from {}\".format(tf_path))\n    # Load weights from TF model\n    init_vars = tf.train.list_variables(tf_path)\n    names = []\n    arrays = []\n    for name, shape in init_vars:\n        logger.info(\"Loading TF weight {} with shape {}\".format(name, shape))\n        array = tf.train.load_variable(tf_path, name)\n        names.append(name)\n        arrays.append(array)\n\n    for name, array in zip(names, arrays):\n        name = name.split(\"/\")\n        # adam_v and adam_m are variables used in AdamWeightDecayOptimizer to calculated m and v\n        # which are not required for using pretrained model\n        if any(\n            n in [\"adam_v\", \"adam_m\", \"AdamWeightDecayOptimizer\", \"AdamWeightDecayOptimizer_1\", \"global_step\"]\n            for n in name\n        ):\n            logger.info(\"Skipping {}\".format(\"/\".join(name)))\n            continue\n        pointer = model\n        for m_name in name:\n            if re.fullmatch(r\"[A-Za-z]+_\\d+\", m_name):\n                scope_names = re.split(r\"_(\\d+)\", m_name)\n            else:\n                scope_names = [m_name]\n            if scope_names[0] == \"kernel\" or scope_names[0] == \"gamma\":\n                pointer = getattr(pointer, \"weight\")\n            elif scope_names[0] == \"output_bias\" or scope_names[0] == \"beta\":\n                pointer = getattr(pointer, \"bias\")\n            elif scope_names[0] == \"output_weights\":\n                pointer = getattr(pointer, \"weight\")\n            elif scope_names[0] == \"squad\":\n                pointer = getattr(pointer, \"classifier\")\n            else:\n                try:\n                    pointer = getattr(pointer, scope_names[0])\n                except AttributeError:\n                    logger.info(\"Skipping {}\".format(\"/\".join(name)))\n                    continue\n            if len(scope_names) >= 2:\n                num = int(scope_names[1])\n                pointer = pointer[num]\n        if m_name[-11:] == \"_embeddings\":\n            pointer = getattr(pointer, \"weight\")\n        elif m_name == \"kernel\":\n            array = np.transpose(array)\n        try:\n            assert (\n                pointer.shape == array.shape\n            ), f\"Pointer shape {pointer.shape} and array shape {array.shape} mismatched\"\n        except AssertionError as e:\n            e.args += (pointer.shape, array.shape)\n            raise\n        logger.info(\"Initialize PyTorch weight {}\".format(name))\n        pointer.data = torch.from_numpy(array)\n    return model\n\n\nclass BertEmbeddings(nn.Module):\n    \"\"\"Construct the embeddings from word, position and token_type embeddings.\"\"\"\n\n    def __init__(self, config):\n        super().__init__()\n        self.word_embeddings = nn.Embedding(config.vocab_size, config.hidden_size, padding_idx=config.pad_token_id)\n        self.token_type_embeddings = nn.Embedding(config.type_vocab_size, config.hidden_size)\n        # self.LayerNorm is not snake-cased to stick with TensorFlow model variable name and be able to load\n        # any TensorFlow checkpoint file\n        self.LayerNorm = nn.LayerNorm(config.hidden_size, eps=config.layer_norm_eps)\n        self.dropout = nn.Dropout(config.hidden_dropout_prob)\n\n    def forward(self, input_ids=None, token_type_ids=None, inputs_embeds=None):\n        if input_ids is not None:\n            input_shape = input_ids.size()\n        else:\n            input_shape = inputs_embeds.size()[:-1]\n        if token_type_ids is None:\n            token_type_ids = torch.zeros(input_shape, dtype=torch.long, device=input_ids.device)\n\n        if inputs_embeds is None:\n            inputs_embeds = self.word_embeddings(input_ids)\n        token_type_embeddings = self.token_type_embeddings(token_type_ids)\n\n        embeddings = inputs_embeds + token_type_embeddings\n        embeddings = self.LayerNorm(embeddings)\n        embeddings = self.dropout(embeddings)\n        return embeddings\n\ndef relative_position_encoding(depth, max_length=512, max_relative_position=64):\n    vocab_size = max_relative_position * 2 + 1\n    range_vec = torch.arange(max_length)\n    range_mat = range_vec.repeat(max_length).view(max_length, max_length)\n    distance_mat = range_mat - torch.t(range_mat)\n    distance_mat_clipped = torch.clamp(distance_mat, -max_relative_position, max_relative_position)\n    final_mat = distance_mat_clipped + max_relative_position\n\n    embeddings_table = torch.zeros(vocab_size, depth)\n    position = torch.arange(0, vocab_size, dtype=torch.float).unsqueeze(1)\n    div_term = torch.exp(torch.arange(0, depth, 2).float() * (-math.log(10000.0) / depth))\n    embeddings_table[:, 0::2] = torch.sin(position * div_term)\n    embeddings_table[:, 1::2] = torch.cos(position * div_term)\n    embeddings_table = embeddings_table.unsqueeze(0).transpose(0, 1).squeeze(1)\n\n    flat_relative_positions_matrix = final_mat.view(-1)\n    one_hot_relative_positions_matrix = torch.nn.functional.one_hot(flat_relative_positions_matrix,\n                                                                    num_classes=vocab_size).float()\n    positions_encoding = torch.matmul(one_hot_relative_positions_matrix, embeddings_table)\n    my_shape = list(final_mat.size())\n    my_shape.append(depth)\n    positions_encoding = positions_encoding.view(my_shape)\n    return positions_encoding\n\nclass BertSelfAttention(nn.Module):\n    def __init__(self, config):\n        super().__init__()\n        if config.hidden_size % config.num_attention_heads != 0 and not hasattr(config, \"embedding_size\"):\n            raise ValueError(\n                \"The hidden size (%d) is not a multiple of the number of attention \"\n                \"heads (%d)\" % (config.hidden_size, config.num_attention_heads)\n            )\n\n        self.num_attention_heads = config.num_attention_heads\n        self.attention_head_size = int(config.hidden_size / config.num_attention_heads)\n        self.all_head_size = self.num_attention_heads * self.attention_head_size\n\n        self.query = nn.Linear(config.hidden_size, self.all_head_size)\n        self.key = nn.Linear(config.hidden_size, self.all_head_size)\n        self.value = nn.Linear(config.hidden_size, self.all_head_size)\n\n        self.dropout = nn.Dropout(config.attention_probs_dropout_prob)\n        self.position_embedding_type = getattr(config, \"position_embedding_type\", \"absolute\")\n        if self.position_embedding_type == \"relative_key\" or self.position_embedding_type == \"relative_key_query\":\n            self.max_position_embeddings = config.max_position_embeddings\n            self.distance_embedding = nn.Embedding(2 * config.max_position_embeddings - 1, self.attention_head_size)\n\n        self.is_decoder = config.is_decoder\n\n    def transpose_for_scores(self, x):\n        new_x_shape = x.size()[:-1] + (self.num_attention_heads, self.attention_head_size)\n        x = x.view(*new_x_shape)\n        return x.permute(0, 2, 1, 3)\n\n    def forward(\n        self,\n        hidden_states,\n        attention_mask=None,\n        head_mask=None,\n        encoder_hidden_states=None,\n        encoder_attention_mask=None,\n        past_key_value=None,\n        output_attentions=False,\n        relations_kv=None\n    ):\n        mixed_query_layer = self.query(hidden_states)\n\n        # If this is instantiated as a cross-attention module, the keys\n        # and values come from an encoder; the attention mask needs to be\n        # such that the encoder's padding tokens are not attended to.\n        is_cross_attention = encoder_hidden_states is not None\n\n        if is_cross_attention and past_key_value is not None:\n            # reuse k,v, cross_attentions\n            key_layer = past_key_value[0]\n            value_layer = past_key_value[1]\n            attention_mask = encoder_attention_mask\n        elif is_cross_attention:\n            key_layer = self.transpose_for_scores(self.key(encoder_hidden_states))\n            value_layer = self.transpose_for_scores(self.value(encoder_hidden_states))\n            attention_mask = encoder_attention_mask\n        elif past_key_value is not None:\n            key_layer = self.transpose_for_scores(self.key(hidden_states))\n            value_layer = self.transpose_for_scores(self.value(hidden_states))\n            key_layer = torch.cat([past_key_value[0], key_layer], dim=2)\n            value_layer = torch.cat([past_key_value[1], value_layer], dim=2)\n        else:\n            key_layer = self.transpose_for_scores(self.key(hidden_states))\n            value_layer = self.transpose_for_scores(self.value(hidden_states))\n\n        query_layer = self.transpose_for_scores(mixed_query_layer)\n\n        if self.is_decoder:\n            # if cross_attention save Tuple(torch.Tensor, torch.Tensor) of all cross attention key/value_states.\n            # Further calls to cross_attention layer can then reuse all cross-attention\n            # key/value_states (first \"if\" case)\n            # if uni-directional self-attention (decoder) save Tuple(torch.Tensor, torch.Tensor) of\n            # all previous decoder key/value_states. Further calls to uni-directional self-attention\n            # can concat previous decoder key/value_states to current projected key/value_states (third \"elif\" case)\n            # if encoder bi-directional self-attention `past_key_value` is always `None`\n            past_key_value = (key_layer, value_layer)\n\n        # Take the dot product between \"query\" and \"key\" to get the raw attention scores.\n        attention_scores = torch.matmul(query_layer, key_layer.transpose(-1, -2))\n\n        batch_size, num_attention_heads, from_seq_length, to_seq_length = attention_scores.size()\n\n\n        query_layer_t = query_layer.permute(2, 0, 1, 3)\n\n        query_layer_r = query_layer_t.contiguous().view(from_seq_length, batch_size * num_attention_heads,\n                                                        self.attention_head_size)\n        key_position_scores = torch.matmul(query_layer_r, relations_kv.permute(0, 2, 1))\n        key_position_scores_r = key_position_scores.view(from_seq_length, batch_size,\n                                                         num_attention_heads, from_seq_length)\n        key_position_scores_r_t = key_position_scores_r.permute(1, 2, 0, 3)\n        attention_scores = attention_scores + key_position_scores_r_t\n\n        attention_scores = attention_scores / math.sqrt(self.attention_head_size)\n        if attention_mask is not None:\n            # Apply the attention mask is (precomputed for all layers in NeZhaModel forward() function)\n            attention_scores = attention_scores + attention_mask\n\n        # Normalize the attention scores to probabilities.\n        attention_probs = nn.Softmax(dim=-1)(attention_scores)\n\n        # This is actually dropping out entire tokens to attend to, which might\n        # seem a bit unusual, but is taken from the original Transformer paper.\n        attention_probs = self.dropout(attention_probs)\n\n        # Mask heads if we want to\n        if head_mask is not None:\n            attention_probs = attention_probs * head_mask\n\n        context_layer = torch.matmul(attention_probs, value_layer)\n\n\n        attention_probs_t = attention_probs.permute(2, 0, 1, 3)\n        attentions_probs_r = attention_probs_t.contiguous().view(from_seq_length, batch_size * num_attention_heads,\n                                                                 to_seq_length)\n        value_position_scores = torch.matmul(attentions_probs_r, relations_kv)\n        value_position_scores_r = value_position_scores.view(from_seq_length, batch_size,\n                                                             num_attention_heads, self.attention_head_size)\n        value_position_scores_r_t = value_position_scores_r.permute(1, 2, 0, 3)\n        context_layer = context_layer + value_position_scores_r_t\n\n        context_layer = context_layer.permute(0, 2, 1, 3).contiguous()\n        new_context_layer_shape = context_layer.size()[:-2] + (self.all_head_size,)\n        context_layer = context_layer.view(*new_context_layer_shape)\n\n        outputs = (context_layer, attention_probs) if output_attentions else (context_layer,)\n\n        if self.is_decoder:\n            outputs = outputs + (past_key_value,)\n        return outputs\n\n\nclass BertSelfOutput(nn.Module):\n    def __init__(self, config):\n        super().__init__()\n        self.dense = nn.Linear(config.hidden_size, config.hidden_size)\n        self.LayerNorm = nn.LayerNorm(config.hidden_size, eps=config.layer_norm_eps)\n        self.dropout = nn.Dropout(config.hidden_dropout_prob)\n\n    def forward(self, hidden_states, input_tensor):\n        hidden_states = self.dense(hidden_states)\n        hidden_states = self.dropout(hidden_states)\n        hidden_states = self.LayerNorm(hidden_states + input_tensor)\n        return hidden_states\n\n\nclass BertAttention(nn.Module):\n    def __init__(self, config):\n        super().__init__()\n        self.self = BertSelfAttention(config)\n        self.output = BertSelfOutput(config)\n        self.pruned_heads = set()\n\n    def prune_heads(self, heads):\n        if len(heads) == 0:\n            return\n        heads, index = find_pruneable_heads_and_indices(\n            heads, self.self.num_attention_heads, self.self.attention_head_size, self.pruned_heads\n        )\n\n        # Prune linear layers\n        self.self.query = prune_linear_layer(self.self.query, index)\n        self.self.key = prune_linear_layer(self.self.key, index)\n        self.self.value = prune_linear_layer(self.self.value, index)\n        self.output.dense = prune_linear_layer(self.output.dense, index, dim=1)\n\n        # Update hyper params and store pruned heads\n        self.self.num_attention_heads = self.self.num_attention_heads - len(heads)\n        self.self.all_head_size = self.self.attention_head_size * self.self.num_attention_heads\n        self.pruned_heads = self.pruned_heads.union(heads)\n\n    def forward(\n        self,\n        hidden_states,\n        attention_mask=None,\n        head_mask=None,\n        encoder_hidden_states=None,\n        encoder_attention_mask=None,\n        past_key_value=None,\n        output_attentions=False,\n        relations_kv=None\n    ):\n        self_outputs = self.self(\n            hidden_states,\n            attention_mask,\n            head_mask,\n            encoder_hidden_states,\n            encoder_attention_mask,\n            past_key_value,\n            output_attentions,\n            relations_kv=relations_kv\n        )\n        attention_output = self.output(self_outputs[0], hidden_states)\n        outputs = (attention_output,) + self_outputs[1:]  # add attentions if we output them\n        return outputs\n\n\nclass BertIntermediate(nn.Module):\n    def __init__(self, config):\n        super().__init__()\n        self.dense = nn.Linear(config.hidden_size, config.intermediate_size)\n        if isinstance(config.hidden_act, str):\n            self.intermediate_act_fn = ACT2FN[config.hidden_act]\n        else:\n            self.intermediate_act_fn = config.hidden_act\n\n    def forward(self, hidden_states):\n        hidden_states = self.dense(hidden_states)\n        hidden_states = self.intermediate_act_fn(hidden_states)\n        return hidden_states\n\n\nclass BertOutput(nn.Module):\n    def __init__(self, config):\n        super().__init__()\n        self.dense = nn.Linear(config.intermediate_size, config.hidden_size)\n        self.LayerNorm = nn.LayerNorm(config.hidden_size, eps=config.layer_norm_eps)\n        self.dropout = nn.Dropout(config.hidden_dropout_prob)\n\n    def forward(self, hidden_states, input_tensor):\n        hidden_states = self.dense(hidden_states)\n        hidden_states = self.dropout(hidden_states)\n        hidden_states = self.LayerNorm(hidden_states + input_tensor)\n        return hidden_states\n\n\nclass BertLayer(nn.Module):\n    def __init__(self, config):\n        super().__init__()\n        self.chunk_size_feed_forward = config.chunk_size_feed_forward\n        self.seq_len_dim = 1\n        self.attention = BertAttention(config)\n        self.is_decoder = config.is_decoder\n        self.add_cross_attention = config.add_cross_attention\n        if self.add_cross_attention:\n            assert self.is_decoder, f\"{self} should be used as a decoder model if cross attention is added\"\n            self.crossattention = BertAttention(config)\n        self.intermediate = BertIntermediate(config)\n        self.output = BertOutput(config)\n\n    def forward(\n        self,\n        hidden_states,\n        attention_mask=None,\n        head_mask=None,\n        encoder_hidden_states=None,\n        encoder_attention_mask=None,\n        past_key_value=None,\n        output_attentions=False,\n        relations_kv=None\n    ):\n        # decoder uni-directional self-attention cached key/values tuple is at positions 1,2\n        self_attn_past_key_value = past_key_value[:2] if past_key_value is not None else None\n        self_attention_outputs = self.attention(\n            hidden_states,\n            attention_mask,\n            head_mask,\n            output_attentions=output_attentions,\n            past_key_value=self_attn_past_key_value,\n            relations_kv=relations_kv\n        )\n        attention_output = self_attention_outputs[0]\n\n        # if decoder, the last output is tuple of self-attn cache\n        if self.is_decoder:\n            outputs = self_attention_outputs[1:-1]\n            present_key_value = self_attention_outputs[-1]\n        else:\n            outputs = self_attention_outputs[1:]  # add self attentions if we output attention weights\n\n        cross_attn_present_key_value = None\n        if self.is_decoder and encoder_hidden_states is not None:\n            assert hasattr(\n                self, \"crossattention\"\n            ), f\"If `encoder_hidden_states` are passed, {self} has to be instantiated with cross-attention layers by setting `config.add_cross_attention=True`\"\n\n            # cross_attn cached key/values tuple is at positions 3,4 of past_key_value tuple\n            cross_attn_past_key_value = past_key_value[-2:] if past_key_value is not None else None\n            cross_attention_outputs = self.crossattention(\n                attention_output,\n                attention_mask,\n                head_mask,\n                encoder_hidden_states,\n                encoder_attention_mask,\n                cross_attn_past_key_value,\n                output_attentions,\n            )\n            attention_output = cross_attention_outputs[0]\n            outputs = outputs + cross_attention_outputs[1:-1]  # add cross attentions if we output attention weights\n\n            # add cross-attn cache to positions 3,4 of present_key_value tuple\n            cross_attn_present_key_value = cross_attention_outputs[-1]\n            present_key_value = present_key_value + cross_attn_present_key_value\n\n        layer_output = apply_chunking_to_forward(\n            self.feed_forward_chunk, self.chunk_size_feed_forward, self.seq_len_dim, attention_output\n        )\n        outputs = (layer_output,) + outputs\n\n        # if decoder, return the attn key/values as the last output\n        if self.is_decoder:\n            outputs = outputs + (present_key_value,)\n\n        return outputs\n\n    def feed_forward_chunk(self, attention_output):\n        intermediate_output = self.intermediate(attention_output)\n        layer_output = self.output(intermediate_output, attention_output)\n        return layer_output\n\n\nclass NeZhaEncoder(nn.Module):\n    def __init__(self, config):\n        super().__init__()\n        self.config = config\n        self.layer = nn.ModuleList([BertLayer(config) for _ in range(config.num_hidden_layers)])\n        self.relative_positions_encoding = relative_position_encoding(max_length=config.max_position_embeddings,\n                                                                     depth=int(config.hidden_size / config.num_attention_heads),\n                                                                     max_relative_position=config.max_relative_position).to('cuda')\n    def forward(\n        self,\n        hidden_states,\n        attention_mask=None,\n        head_mask=None,\n        encoder_hidden_states=None,\n        encoder_attention_mask=None,\n        past_key_values=None,\n        use_cache=None,\n        output_attentions=False,\n        output_hidden_states=False,\n        return_dict=False,\n    ):\n        to_seq_length=hidden_states.shape[1]\n        relations_kv = self.relative_positions_encoding[:to_seq_length, :to_seq_length, :]\n        all_hidden_states = () if output_hidden_states else None\n        all_self_attentions = () if output_attentions else None\n        all_cross_attentions = () if output_attentions and self.config.add_cross_attention else None\n\n        next_decoder_cache = () if use_cache else None\n        for i, layer_module in enumerate(self.layer):\n            if output_hidden_states:\n                all_hidden_states = all_hidden_states + (hidden_states,)\n\n            layer_head_mask = head_mask[i] if head_mask is not None else None\n            past_key_value = past_key_values[i] if past_key_values is not None else None\n\n            if getattr(self.config, \"gradient_checkpointing\", False) and self.training:\n\n                if use_cache:\n                    logger.warn(\n                        \"`use_cache=True` is incompatible with `config.gradient_checkpointing=True`. Setting \"\n                        \"`use_cache=False`...\"\n                    )\n                    use_cache = False\n\n                def create_custom_forward(module):\n                    def custom_forward(*inputs):\n                        return module(*inputs, past_key_value, output_attentions)\n\n                    return custom_forward\n\n                layer_outputs = torch.utils.checkpoint.checkpoint(\n                    create_custom_forward(layer_module),\n                    hidden_states,\n                    attention_mask,\n                    layer_head_mask,\n                    encoder_hidden_states,\n                    encoder_attention_mask,\n                )\n            else:\n                layer_outputs = layer_module(\n                    hidden_states,\n                    attention_mask,\n                    layer_head_mask,\n                    encoder_hidden_states,\n                    encoder_attention_mask,\n                    past_key_value,\n                    output_attentions,relations_kv=relations_kv\n                )\n\n            hidden_states = layer_outputs[0]\n            if use_cache:\n                next_decoder_cache += (layer_outputs[-1],)\n            if output_attentions:\n                all_self_attentions = all_self_attentions + (layer_outputs[1],)\n                if self.config.add_cross_attention:\n                    all_cross_attentions = all_cross_attentions + (layer_outputs[2],)\n\n        if output_hidden_states:\n            all_hidden_states = all_hidden_states + (hidden_states,)\n\n        if not return_dict:\n            return tuple(\n                v\n                for v in [\n                    hidden_states,\n                    next_decoder_cache,\n                    all_hidden_states,\n                    all_self_attentions,\n                    all_cross_attentions,\n                ]\n                if v is not None\n            )\n        return BaseModelOutputWithPastAndCrossAttentions(\n            last_hidden_state=hidden_states,\n            past_key_values=next_decoder_cache,\n            hidden_states=all_hidden_states,\n            attentions=all_self_attentions,\n            cross_attentions=all_cross_attentions,\n        )\n\n\nclass BertPooler(nn.Module):\n    def __init__(self, config):\n        super().__init__()\n        self.dense = nn.Linear(config.hidden_size, config.hidden_size)\n        self.activation = nn.Tanh()\n\n    def forward(self, hidden_states):\n        # We \"pool\" the model by simply taking the hidden state corresponding\n        # to the first token.\n        first_token_tensor = hidden_states[:, 0]\n        pooled_output = self.dense(first_token_tensor)\n        pooled_output = self.activation(pooled_output)\n        return pooled_output\n\n\nclass BertPredictionHeadTransform(nn.Module):\n    def __init__(self, config):\n        super().__init__()\n        self.dense = nn.Linear(config.hidden_size, config.hidden_size)\n        if isinstance(config.hidden_act, str):\n            self.transform_act_fn = ACT2FN[config.hidden_act]\n        else:\n            self.transform_act_fn = config.hidden_act\n        self.LayerNorm = nn.LayerNorm(config.hidden_size, eps=config.layer_norm_eps)\n\n    def forward(self, hidden_states):\n        hidden_states = self.dense(hidden_states)\n        hidden_states = self.transform_act_fn(hidden_states)\n        hidden_states = self.LayerNorm(hidden_states)\n        return hidden_states\n\n\nclass BertLMPredictionHead(nn.Module):\n    def __init__(self, config):\n        super().__init__()\n        self.transform = BertPredictionHeadTransform(config)\n\n        # The output weights are the same as the input embeddings, but there is\n        # an output-only bias for each token.\n        self.decoder = nn.Linear(config.hidden_size, config.vocab_size, bias=False)\n\n        self.bias = nn.Parameter(torch.zeros(config.vocab_size))\n\n        # Need a link between the two variables so that the bias is correctly resized with `resize_token_embeddings`\n        self.decoder.bias = self.bias\n\n    def forward(self, hidden_states):\n        hidden_states = self.transform(hidden_states)\n        hidden_states = self.decoder(hidden_states)\n        return hidden_states\n\n\nclass BertOnlyMLMHead(nn.Module):\n    def __init__(self, config):\n        super().__init__()\n        self.predictions = BertLMPredictionHead(config)\n\n    def forward(self, sequence_output):\n        prediction_scores = self.predictions(sequence_output)\n        return prediction_scores\n\n\nclass BertOnlyNSPHead(nn.Module):\n    def __init__(self, config):\n        super().__init__()\n        self.seq_relationship = nn.Linear(config.hidden_size, 2)\n\n    def forward(self, pooled_output):\n        seq_relationship_score = self.seq_relationship(pooled_output)\n        return seq_relationship_score\n\n\nclass BertPreTrainingHeads(nn.Module):\n    def __init__(self, config):\n        super().__init__()\n        self.predictions = BertLMPredictionHead(config)\n        self.seq_relationship = nn.Linear(config.hidden_size, 2)\n\n    def forward(self, sequence_output, pooled_output):\n        prediction_scores = self.predictions(sequence_output)\n        seq_relationship_score = self.seq_relationship(pooled_output)\n        return prediction_scores, seq_relationship_score\n\n\nclass BertPreTrainedModel(PreTrainedModel):\n    \"\"\"\n    An abstract class to handle weights initialization and a simple interface for downloading and loading pretrained\n    models.\n    \"\"\"\n\n    config_class = BertConfig\n    load_tf_weights = load_tf_weights_in_bert\n    base_model_prefix = \"bert\"\n\n    def _init_weights(self, module):\n        \"\"\" Initialize the weights \"\"\"\n        if isinstance(module, nn.Linear):\n            # Slightly different from the TF version which uses truncated_normal for initialization\n            # cf https://github.com/pytorch/pytorch/pull/5617\n            module.weight.data.normal_(mean=0.0, std=self.config.initializer_range)\n            if module.bias is not None:\n                module.bias.data.zero_()\n        elif isinstance(module, nn.Embedding):\n            module.weight.data.normal_(mean=0.0, std=self.config.initializer_range)\n            if module.padding_idx is not None:\n                module.weight.data[module.padding_idx].zero_()\n        elif isinstance(module, nn.LayerNorm):\n            module.bias.data.zero_()\n            module.weight.data.fill_(1.0)\n\n\n@dataclass\nclass BertForPreTrainingOutput(ModelOutput):\n    \"\"\"\n    Output type of :class:`~transformers.BertForPreTraining`.\n\n    Args:\n        loss (`optional`, returned when ``labels`` is provided, ``torch.FloatTensor`` of shape :obj:`(1,)`):\n            Total loss as the sum of the masked language modeling loss and the next sequence prediction\n            (classification) loss.\n        prediction_logits (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length, config.vocab_size)`):\n            Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax).\n        seq_relationship_logits (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, 2)`):\n            Prediction scores of the next sequence prediction (classification) head (scores of True/False continuation\n            before SoftMax).\n        hidden_states (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``output_hidden_states=True`` is passed or when ``config.output_hidden_states=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer)\n            of shape :obj:`(batch_size, sequence_length, hidden_size)`.\n\n            Hidden-states of the model at the output of each layer plus the initial embedding outputs.\n        attentions (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``output_attentions=True`` is passed or when ``config.output_attentions=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for each layer) of shape :obj:`(batch_size, num_heads,\n            sequence_length, sequence_length)`.\n\n            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention\n            heads.\n    \"\"\"\n\n    loss: Optional[torch.FloatTensor] = None\n    prediction_logits: torch.FloatTensor = None\n    seq_relationship_logits: torch.FloatTensor = None\n    hidden_states: Optional[Tuple[torch.FloatTensor]] = None\n    attentions: Optional[Tuple[torch.FloatTensor]] = None\n\n\nBERT_START_DOCSTRING = r\"\"\"\n\n    This model inherits from :class:`~transformers.PreTrainedModel`. Check the superclass documentation for the generic\n    methods the library implements for all its model (such as downloading or saving, resizing the input embeddings,\n    pruning heads etc.)\n\n    This model is also a PyTorch `torch.nn.Module <https://pytorch.org/docs/stable/nn.html#torch.nn.Module>`__\n    subclass. Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to\n    general usage and behavior.\n\n    Parameters:\n        config (:class:`~transformers.BertConfig`): Model configuration class with all the parameters of the model.\n            Initializing with a config file does not load the weights associated with the model, only the\n            configuration. Check out the :meth:`~transformers.PreTrainedModel.from_pretrained` method to load the model\n            weights.\n\"\"\"\n\nBERT_INPUTS_DOCSTRING = r\"\"\"\n    Args:\n        input_ids (:obj:`torch.LongTensor` of shape :obj:`({0})`):\n            Indices of input sequence tokens in the vocabulary.\n\n            Indices can be obtained using :class:`~transformers.BertTokenizer`. See\n            :meth:`transformers.PreTrainedTokenizer.encode` and :meth:`transformers.PreTrainedTokenizer.__call__` for\n            details.\n\n            `What are input IDs? <../glossary.html#input-ids>`__\n        attention_mask (:obj:`torch.FloatTensor` of shape :obj:`({0})`, `optional`):\n            Mask to avoid performing attention on padding token indices. Mask values selected in ``[0, 1]``:\n\n            - 1 for tokens that are **not masked**,\n            - 0 for tokens that are **masked**.\n\n            `What are attention masks? <../glossary.html#attention-mask>`__\n        token_type_ids (:obj:`torch.LongTensor` of shape :obj:`({0})`, `optional`):\n            Segment token indices to indicate first and second portions of the inputs. Indices are selected in ``[0,\n            1]``:\n\n            - 0 corresponds to a `sentence A` token,\n            - 1 corresponds to a `sentence B` token.\n\n            `What are token type IDs? <../glossary.html#token-type-ids>`_\n        position_ids (:obj:`torch.LongTensor` of shape :obj:`({0})`, `optional`):\n            Indices of positions of each input sequence tokens in the position embeddings. Selected in the range ``[0,\n            config.max_position_embeddings - 1]``.\n\n            `What are position IDs? <../glossary.html#position-ids>`_\n        head_mask (:obj:`torch.FloatTensor` of shape :obj:`(num_heads,)` or :obj:`(num_layers, num_heads)`, `optional`):\n            Mask to nullify selected heads of the self-attention modules. Mask values selected in ``[0, 1]``:\n\n            - 1 indicates the head is **not masked**,\n            - 0 indicates the head is **masked**.\n\n        inputs_embeds (:obj:`torch.FloatTensor` of shape :obj:`({0}, hidden_size)`, `optional`):\n            Optionally, instead of passing :obj:`input_ids` you can choose to directly pass an embedded representation.\n            This is useful if you want more control over how to convert :obj:`input_ids` indices into associated\n            vectors than the model's internal embedding lookup matrix.\n        output_attentions (:obj:`bool`, `optional`):\n            Whether or not to return the attentions tensors of all attention layers. See ``attentions`` under returned\n            tensors for more detail.\n        output_hidden_states (:obj:`bool`, `optional`):\n            Whether or not to return the hidden states of all layers. See ``hidden_states`` under returned tensors for\n            more detail.\n        return_dict (:obj:`bool`, `optional`):\n            Whether or not to return a :class:`~transformers.file_utils.ModelOutput` instead of a plain tuple.\n\"\"\"\n\n\n@add_start_docstrings(\n    \"The bare Bert Model transformer outputting raw hidden-states without any specific head on top.\",\n    BERT_START_DOCSTRING,\n)\nclass NeZhaModel(BertPreTrainedModel):\n    \"\"\"\n\n    The model can behave as an encoder (with only self-attention) as well as a decoder, in which case a layer of\n    cross-attention is added between the self-attention layers, following the architecture described in `Attention is\n    all you need <https://arxiv.org/abs/1706.03762>`__ by Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit,\n    Llion Jones, Aidan N. Gomez, Lukasz Kaiser and Illia Polosukhin.\n\n    To behave as an decoder the model needs to be initialized with the :obj:`is_decoder` argument of the configuration\n    set to :obj:`True`. To be used in a Seq2Seq model, the model needs to initialized with both :obj:`is_decoder`\n    argument and :obj:`add_cross_attention` set to :obj:`True`; an :obj:`encoder_hidden_states` is then expected as an\n    input to the forward pass.\n    \"\"\"\n\n    def __init__(self, config, add_pooling_layer=True):\n        super().__init__(config)\n        self.config = config\n\n        self.embeddings = BertEmbeddings(config)\n        self.encoder = NeZhaEncoder(config)\n\n        self.pooler = BertPooler(config) if add_pooling_layer else None\n\n        self.init_weights()\n\n    def get_input_embeddings(self):\n        return self.embeddings.word_embeddings\n\n    def set_input_embeddings(self, value):\n        self.embeddings.word_embeddings = value\n\n    def _prune_heads(self, heads_to_prune):\n        \"\"\"\n        Prunes heads of the model. heads_to_prune: dict of {layer_num: list of heads to prune in this layer} See base\n        class PreTrainedModel\n        \"\"\"\n        for layer, heads in heads_to_prune.items():\n            self.encoder.layer[layer].attention.prune_heads(heads)\n\n    @add_start_docstrings_to_model_forward(BERT_INPUTS_DOCSTRING.format(\"batch_size, sequence_length\"))\n    @add_code_sample_docstrings(\n        tokenizer_class=_TOKENIZER_FOR_DOC,\n        checkpoint=_CHECKPOINT_FOR_DOC,\n        output_type=BaseModelOutputWithPoolingAndCrossAttentions,\n        config_class=_CONFIG_FOR_DOC,\n    )\n    def forward(\n        self,\n        input_ids=None,\n        attention_mask=None,\n        token_type_ids=None,\n\n        head_mask=None,\n        inputs_embeds=None,\n        encoder_hidden_states=None,\n        encoder_attention_mask=None,\n        past_key_values=None,\n        use_cache=None,\n        output_attentions=None,\n        output_hidden_states=None,\n        return_dict=False,\n    ):\n        r\"\"\"\n        encoder_hidden_states  (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length, hidden_size)`, `optional`):\n            Sequence of hidden-states at the output of the last layer of the encoder. Used in the cross-attention if\n            the model is configured as a decoder.\n        encoder_attention_mask (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length)`, `optional`):\n            Mask to avoid performing attention on the padding token indices of the encoder input. This mask is used in\n            the cross-attention if the model is configured as a decoder. Mask values selected in ``[0, 1]``:\n\n            - 1 for tokens that are **not masked**,\n            - 0 for tokens that are **masked**.\n        past_key_values (:obj:`tuple(tuple(torch.FloatTensor))` of length :obj:`config.n_layers` with each tuple having 4 tensors of shape :obj:`(batch_size, num_heads, sequence_length - 1, embed_size_per_head)`):\n            Contains precomputed key and value hidden states of the attention blocks. Can be used to speed up decoding.\n\n            If :obj:`past_key_values` are used, the user can optionally input only the last :obj:`decoder_input_ids`\n            (those that don't have their past key value states given to this model) of shape :obj:`(batch_size, 1)`\n            instead of all :obj:`decoder_input_ids` of shape :obj:`(batch_size, sequence_length)`.\n        use_cache (:obj:`bool`, `optional`):\n            If set to :obj:`True`, :obj:`past_key_values` key value states are returned and can be used to speed up\n            decoding (see :obj:`past_key_values`).\n        \"\"\"\n        output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions\n        output_hidden_states = (\n            output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states\n        )\n        return_dict = return_dict if return_dict is not None else self.config.use_return_dict\n\n        if self.config.is_decoder:\n            use_cache = use_cache if use_cache is not None else self.config.use_cache\n        else:\n            use_cache = False\n\n        if input_ids is not None and inputs_embeds is not None:\n            raise ValueError(\"You cannot specify both input_ids and inputs_embeds at the same time\")\n        elif input_ids is not None:\n            input_shape = input_ids.size()\n            batch_size, seq_length = input_shape\n        elif inputs_embeds is not None:\n            input_shape = inputs_embeds.size()[:-1]\n            batch_size, seq_length = input_shape\n        else:\n            raise ValueError(\"You have to specify either input_ids or inputs_embeds\")\n\n        device = input_ids.device if input_ids is not None else inputs_embeds.device\n\n        # past_key_values_length\n        past_key_values_length = past_key_values[0][0].shape[2] if past_key_values is not None else 0\n\n        if attention_mask is None:\n            attention_mask = torch.ones(((batch_size, seq_length + past_key_values_length)), device=device)\n        if token_type_ids is None:\n            token_type_ids = torch.zeros(input_shape, dtype=torch.long, device=device)\n\n        # We can provide a self-attention mask of dimensions [batch_size, from_seq_length, to_seq_length]\n        # ourselves in which case we just need to make it broadcastable to all heads.\n        extended_attention_mask: torch.Tensor = self.get_extended_attention_mask(attention_mask, input_shape, device)\n\n        # If a 2D or 3D attention mask is provided for the cross-attention\n        # we need to make broadcastable to [batch_size, num_heads, seq_length, seq_length]\n        if self.config.is_decoder and encoder_hidden_states is not None:\n            encoder_batch_size, encoder_sequence_length, _ = encoder_hidden_states.size()\n            encoder_hidden_shape = (encoder_batch_size, encoder_sequence_length)\n            if encoder_attention_mask is None:\n                encoder_attention_mask = torch.ones(encoder_hidden_shape, device=device)\n            encoder_extended_attention_mask = self.invert_attention_mask(encoder_attention_mask)\n        else:\n            encoder_extended_attention_mask = None\n\n        # Prepare head mask if needed\n        # 1.0 in head_mask indicate we keep the head\n        # attention_probs has shape bsz x n_heads x N x N\n        # input head_mask has shape [num_heads] or [num_hidden_layers x num_heads]\n        # and head_mask is converted to shape [num_hidden_layers x batch x num_heads x seq_length x seq_length]\n        head_mask = self.get_head_mask(head_mask, self.config.num_hidden_layers)\n\n        embedding_output = self.embeddings(\n            input_ids=input_ids,\n\n            token_type_ids=token_type_ids,\n            inputs_embeds=inputs_embeds,\n        )\n        encoder_outputs = self.encoder(\n            embedding_output,\n            attention_mask=extended_attention_mask,\n            head_mask=head_mask,\n            encoder_hidden_states=encoder_hidden_states,\n            encoder_attention_mask=encoder_extended_attention_mask,\n            past_key_values=past_key_values,\n            use_cache=use_cache,\n            output_attentions=output_attentions,\n            output_hidden_states=output_hidden_states,\n            return_dict=return_dict,\n        )\n        sequence_output = encoder_outputs[0]\n        pooled_output = self.pooler(sequence_output) if self.pooler is not None else None\n\n        if not return_dict:\n            return (sequence_output, pooled_output) + encoder_outputs[1:]\n\n        return BaseModelOutputWithPoolingAndCrossAttentions(\n            last_hidden_state=sequence_output,\n            pooler_output=pooled_output,\n            past_key_values=encoder_outputs.past_key_values,\n            hidden_states=encoder_outputs.hidden_states,\n            attentions=encoder_outputs.attentions,\n            cross_attentions=encoder_outputs.cross_attentions,\n        )\n\n\n@add_start_docstrings(\n    \"\"\"\n    Bert Model with two heads on top as done during the pretraining: a `masked language modeling` head and a `next\n    sentence prediction (classification)` head.\n    \"\"\",\n    BERT_START_DOCSTRING,\n)\nclass BertForPreTraining(BertPreTrainedModel):\n    def __init__(self, config):\n        super().__init__(config)\n\n        self.bert = NeZhaModel(config)\n        self.cls = BertPreTrainingHeads(config)\n\n        self.init_weights()\n\n    def get_output_embeddings(self):\n        return self.cls.predictions.decoder\n\n    def set_output_embeddings(self, new_embeddings):\n        self.cls.predictions.decoder = new_embeddings\n\n    @add_start_docstrings_to_model_forward(BERT_INPUTS_DOCSTRING.format(\"batch_size, sequence_length\"))\n    @replace_return_docstrings(output_type=BertForPreTrainingOutput, config_class=_CONFIG_FOR_DOC)\n    def forward(\n        self,\n        input_ids=None,\n        attention_mask=None,\n        token_type_ids=None,\n\n        head_mask=None,\n        inputs_embeds=None,\n        labels=None,\n        next_sentence_label=None,\n        output_attentions=None,\n        output_hidden_states=None,\n        return_dict=False,\n    ):\n        r\"\"\"\n        labels (:obj:`torch.LongTensor` of shape ``(batch_size, sequence_length)``, `optional`):\n            Labels for computing the masked language modeling loss. Indices should be in ``[-100, 0, ...,\n            config.vocab_size]`` (see ``input_ids`` docstring) Tokens with indices set to ``-100`` are ignored\n            (masked), the loss is only computed for the tokens with labels in ``[0, ..., config.vocab_size]``\n        next_sentence_label (``torch.LongTensor`` of shape ``(batch_size,)``, `optional`):\n            Labels for computing the next sequence prediction (classification) loss. Input should be a sequence pair\n            (see :obj:`input_ids` docstring) Indices should be in ``[0, 1]``:\n\n            - 0 indicates sequence B is a continuation of sequence A,\n            - 1 indicates sequence B is a random sequence.\n        kwargs (:obj:`Dict[str, any]`, optional, defaults to `{}`):\n            Used to hide legacy arguments that have been deprecated.\n\n        Returns:\n\n        Example::\n\n            >>> from transformers import BertTokenizer, BertForPreTraining\n            >>> import torch\n\n            >>> tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')\n            >>> model = BertForPreTraining.from_pretrained('bert-base-uncased')\n\n            >>> inputs = tokenizer(\"Hello, my dog is cute\", return_tensors=\"pt\")\n            >>> outputs = model(**inputs)\n\n            >>> prediction_logits = outputs.prediction_logits\n            >>> seq_relationship_logits = outputs.seq_relationship_logits\n        \"\"\"\n        return_dict = return_dict if return_dict is not None else self.config.use_return_dict\n\n        outputs = self.bert(\n            input_ids,\n            attention_mask=attention_mask,\n            token_type_ids=token_type_ids,\n\n            head_mask=head_mask,\n            inputs_embeds=inputs_embeds,\n            output_attentions=output_attentions,\n            output_hidden_states=output_hidden_states,\n            return_dict=return_dict,\n        )\n\n        sequence_output, pooled_output = outputs[:2]\n        prediction_scores, seq_relationship_score = self.cls(sequence_output, pooled_output)\n\n        total_loss = None\n        if labels is not None and next_sentence_label is not None:\n            loss_fct = CrossEntropyLoss()\n            masked_lm_loss = loss_fct(prediction_scores.view(-1, self.config.vocab_size), labels.view(-1))\n            next_sentence_loss = loss_fct(seq_relationship_score.view(-1, 2), next_sentence_label.view(-1))\n            total_loss = masked_lm_loss + next_sentence_loss\n\n        if not return_dict:\n            output = (prediction_scores, seq_relationship_score) + outputs[2:]\n            return ((total_loss,) + output) if total_loss is not None else output\n\n        return BertForPreTrainingOutput(\n            loss=total_loss,\n            prediction_logits=prediction_scores,\n            seq_relationship_logits=seq_relationship_score,\n            hidden_states=outputs.hidden_states,\n            attentions=outputs.attentions,\n        )\n\n\n@add_start_docstrings(\n    \"\"\"Bert Model with a `language modeling` head on top for CLM fine-tuning. \"\"\", BERT_START_DOCSTRING\n)\nclass BertLMHeadModel(BertPreTrainedModel):\n\n    _keys_to_ignore_on_load_unexpected = [r\"pooler\"]\n    _keys_to_ignore_on_load_missing = [ r\"predictions.decoder.bias\"]\n\n    def __init__(self, config):\n        super().__init__(config)\n\n        if not config.is_decoder:\n            logger.warning(\"If you want to use `BertLMHeadModel` as a standalone, add `is_decoder=True.`\")\n\n        self.bert = NeZhaModel(config, add_pooling_layer=False)\n        self.cls = BertOnlyMLMHead(config)\n\n        self.init_weights()\n\n    def get_output_embeddings(self):\n        return self.cls.predictions.decoder\n\n    def set_output_embeddings(self, new_embeddings):\n        self.cls.predictions.decoder = new_embeddings\n\n    @add_start_docstrings_to_model_forward(BERT_INPUTS_DOCSTRING.format(\"batch_size, sequence_length\"))\n    @replace_return_docstrings(output_type=CausalLMOutputWithCrossAttentions, config_class=_CONFIG_FOR_DOC)\n    def forward(\n        self,\n        input_ids=None,\n        attention_mask=None,\n        token_type_ids=None,\n\n        head_mask=None,\n        inputs_embeds=None,\n        encoder_hidden_states=None,\n        encoder_attention_mask=None,\n        labels=None,\n        past_key_values=None,\n        use_cache=None,\n        output_attentions=None,\n        output_hidden_states=None,\n        return_dict=False,\n    ):\n        r\"\"\"\n        encoder_hidden_states  (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length, hidden_size)`, `optional`):\n            Sequence of hidden-states at the output of the last layer of the encoder. Used in the cross-attention if\n            the model is configured as a decoder.\n        encoder_attention_mask (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length)`, `optional`):\n            Mask to avoid performing attention on the padding token indices of the encoder input. This mask is used in\n            the cross-attention if the model is configured as a decoder. Mask values selected in ``[0, 1]``:\n\n            - 1 for tokens that are **not masked**,\n            - 0 for tokens that are **masked**.\n        labels (:obj:`torch.LongTensor` of shape :obj:`(batch_size, sequence_length)`, `optional`):\n            Labels for computing the left-to-right language modeling loss (next word prediction). Indices should be in\n            ``[-100, 0, ..., config.vocab_size]`` (see ``input_ids`` docstring) Tokens with indices set to ``-100`` are\n            ignored (masked), the loss is only computed for the tokens with labels n ``[0, ..., config.vocab_size]``\n        past_key_values (:obj:`tuple(tuple(torch.FloatTensor))` of length :obj:`config.n_layers` with each tuple having 4 tensors of shape :obj:`(batch_size, num_heads, sequence_length - 1, embed_size_per_head)`):\n            Contains precomputed key and value hidden states of the attention blocks. Can be used to speed up decoding.\n\n            If :obj:`past_key_values` are used, the user can optionally input only the last :obj:`decoder_input_ids`\n            (those that don't have their past key value states given to this model) of shape :obj:`(batch_size, 1)`\n            instead of all :obj:`decoder_input_ids` of shape :obj:`(batch_size, sequence_length)`.\n        use_cache (:obj:`bool`, `optional`):\n            If set to :obj:`True`, :obj:`past_key_values` key value states are returned and can be used to speed up\n            decoding (see :obj:`past_key_values`).\n\n        Returns:\n\n        Example::\n\n            >>> from transformers import BertTokenizer, BertLMHeadModel, BertConfig\n            >>> import torch\n\n            >>> tokenizer = BertTokenizer.from_pretrained('bert-base-cased')\n            >>> config = BertConfig.from_pretrained(\"bert-base-cased\")\n            >>> config.is_decoder = True\n            >>> model = BertLMHeadModel.from_pretrained('bert-base-cased', config=config)\n\n            >>> inputs = tokenizer(\"Hello, my dog is cute\", return_tensors=\"pt\")\n            >>> outputs = model(**inputs)\n\n            >>> prediction_logits = outputs.logits\n        \"\"\"\n        return_dict = return_dict if return_dict is not None else self.config.use_return_dict\n        if labels is not None:\n            use_cache = False\n\n        outputs = self.bert(\n            input_ids,\n            attention_mask=attention_mask,\n            token_type_ids=token_type_ids,\n\n            head_mask=head_mask,\n            inputs_embeds=inputs_embeds,\n            encoder_hidden_states=encoder_hidden_states,\n            encoder_attention_mask=encoder_attention_mask,\n            past_key_values=past_key_values,\n            use_cache=use_cache,\n            output_attentions=output_attentions,\n            output_hidden_states=output_hidden_states,\n            return_dict=return_dict,\n        )\n\n        sequence_output = outputs[0]\n        prediction_scores = self.cls(sequence_output)\n\n        lm_loss = None\n        if labels is not None:\n            # we are doing next-token prediction; shift prediction scores and input ids by one\n            shifted_prediction_scores = prediction_scores[:, :-1, :].contiguous()\n            labels = labels[:, 1:].contiguous()\n            loss_fct = CrossEntropyLoss()\n            lm_loss = loss_fct(shifted_prediction_scores.view(-1, self.config.vocab_size), labels.view(-1))\n\n        if not return_dict:\n            output = (prediction_scores,) + outputs[2:]\n            return ((lm_loss,) + output) if lm_loss is not None else output\n\n        return CausalLMOutputWithCrossAttentions(\n            loss=lm_loss,\n            logits=prediction_scores,\n            past_key_values=outputs.past_key_values,\n            hidden_states=outputs.hidden_states,\n            attentions=outputs.attentions,\n            cross_attentions=outputs.cross_attentions,\n        )\n\n    def prepare_inputs_for_generation(self, input_ids, past=None, attention_mask=None, **model_kwargs):\n        input_shape = input_ids.shape\n        # if model is used as a decoder in encoder-decoder model, the decoder attention mask is created on the fly\n        if attention_mask is None:\n            attention_mask = input_ids.new_ones(input_shape)\n\n        # cut decoder_input_ids if past is used\n        if past is not None:\n            input_ids = input_ids[:, -1:]\n\n        return {\"input_ids\": input_ids, \"attention_mask\": attention_mask, \"past_key_values\": past}\n\n    def _reorder_cache(self, past, beam_idx):\n        reordered_past = ()\n        for layer_past in past:\n            reordered_past += (tuple(past_state.index_select(0, beam_idx) for past_state in layer_past),)\n        return reordered_past\n\n\n@add_start_docstrings(\"\"\"Bert Model with a `language modeling` head on top. \"\"\", BERT_START_DOCSTRING)\nclass NeZhaForMaskedLM(BertPreTrainedModel):\n\n    _keys_to_ignore_on_load_unexpected = [r\"pooler\"]\n    _keys_to_ignore_on_load_missing = [r\"predictions.decoder.bias\"]\n\n    def __init__(self, config):\n        super().__init__(config)\n\n        if config.is_decoder:\n            logger.warning(\n                \"If you want to use `NeZhaForMaskedLM` make sure `config.is_decoder=False` for \"\n                \"bi-directional self-attention.\"\n            )\n\n        self.bert = NeZhaModel(config, add_pooling_layer=False)\n        self.cls = BertOnlyMLMHead(config)\n\n        self.init_weights()\n\n    def get_output_embeddings(self):\n        return self.cls.predictions.decoder\n\n    def set_output_embeddings(self, new_embeddings):\n        self.cls.predictions.decoder = new_embeddings\n\n    @add_start_docstrings_to_model_forward(BERT_INPUTS_DOCSTRING.format(\"batch_size, sequence_length\"))\n    @add_code_sample_docstrings(\n        tokenizer_class=_TOKENIZER_FOR_DOC,\n        checkpoint=_CHECKPOINT_FOR_DOC,\n        output_type=MaskedLMOutput,\n        config_class=_CONFIG_FOR_DOC,\n    )\n    def forward(\n        self,\n        input_ids=None,\n        attention_mask=None,\n        token_type_ids=None,\n\n        head_mask=None,\n        inputs_embeds=None,\n        encoder_hidden_states=None,\n        encoder_attention_mask=None,\n        labels=None,\n        output_attentions=None,\n        output_hidden_states=None,\n        return_dict=False,\n    ):\n        r\"\"\"\n        labels (:obj:`torch.LongTensor` of shape :obj:`(batch_size, sequence_length)`, `optional`):\n            Labels for computing the masked language modeling loss. Indices should be in ``[-100, 0, ...,\n            config.vocab_size]`` (see ``input_ids`` docstring) Tokens with indices set to ``-100`` are ignored\n            (masked), the loss is only computed for the tokens with labels in ``[0, ..., config.vocab_size]``\n        \"\"\"\n\n        return_dict = return_dict if return_dict is not None else self.config.use_return_dict\n\n        outputs = self.bert(\n            input_ids,\n            attention_mask=attention_mask,\n            token_type_ids=token_type_ids,\n\n            head_mask=head_mask,\n            inputs_embeds=inputs_embeds,\n            encoder_hidden_states=encoder_hidden_states,\n            encoder_attention_mask=encoder_attention_mask,\n            output_attentions=output_attentions,\n            output_hidden_states=output_hidden_states,\n            return_dict=return_dict,\n        )\n\n        sequence_output = outputs[0]\n        prediction_scores = self.cls(sequence_output)\n\n        masked_lm_loss = None\n        if labels is not None:\n            loss_fct = CrossEntropyLoss()  # -100 index = padding token\n            masked_lm_loss = loss_fct(prediction_scores.view(-1, self.config.vocab_size), labels.view(-1))\n\n        if not return_dict:\n            output = (prediction_scores,) + outputs[2:]\n            return ((masked_lm_loss,) + output) if masked_lm_loss is not None else output\n\n        return MaskedLMOutput(\n            loss=masked_lm_loss,\n            logits=prediction_scores,\n            hidden_states=outputs.hidden_states,\n            attentions=outputs.attentions,\n        )\n\n    def prepare_inputs_for_generation(self, input_ids, attention_mask=None, **model_kwargs):\n        input_shape = input_ids.shape\n        effective_batch_size = input_shape[0]\n\n        #  add a dummy token\n        assert self.config.pad_token_id is not None, \"The PAD token should be defined for generation\"\n        attention_mask = torch.cat([attention_mask, attention_mask.new_zeros((attention_mask.shape[0], 1))], dim=-1)\n        dummy_token = torch.full(\n            (effective_batch_size, 1), self.config.pad_token_id, dtype=torch.long, device=input_ids.device\n        )\n        input_ids = torch.cat([input_ids, dummy_token], dim=1)\n\n        return {\"input_ids\": input_ids, \"attention_mask\": attention_mask}\n\n\n@add_start_docstrings(\n    \"\"\"Bert Model with a `next sentence prediction (classification)` head on top. \"\"\",\n    BERT_START_DOCSTRING,\n)\nclass BertForNextSentencePrediction(BertPreTrainedModel):\n    def __init__(self, config):\n        super().__init__(config)\n\n        self.bert = NeZhaModel(config)\n        self.cls = BertOnlyNSPHead(config)\n\n        self.init_weights()\n\n    @add_start_docstrings_to_model_forward(BERT_INPUTS_DOCSTRING.format(\"batch_size, sequence_length\"))\n    @replace_return_docstrings(output_type=NextSentencePredictorOutput, config_class=_CONFIG_FOR_DOC)\n    def forward(\n        self,\n        input_ids=None,\n        attention_mask=None,\n        token_type_ids=None,\n\n        head_mask=None,\n        inputs_embeds=None,\n        labels=None,\n        output_attentions=None,\n        output_hidden_states=None,\n        return_dict=False,\n        **kwargs\n    ):\n        r\"\"\"\n        labels (:obj:`torch.LongTensor` of shape :obj:`(batch_size,)`, `optional`):\n            Labels for computing the next sequence prediction (classification) loss. Input should be a sequence pair\n            (see ``input_ids`` docstring). Indices should be in ``[0, 1]``:\n\n            - 0 indicates sequence B is a continuation of sequence A,\n            - 1 indicates sequence B is a random sequence.\n\n        Returns:\n\n        Example::\n\n            >>> from transformers import BertTokenizer, BertForNextSentencePrediction\n            >>> import torch\n\n            >>> tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')\n            >>> model = BertForNextSentencePrediction.from_pretrained('bert-base-uncased')\n\n            >>> prompt = \"In Italy, pizza served in formal settings, such as at a restaurant, is presented unsliced.\"\n            >>> next_sentence = \"The sky is blue due to the shorter wavelength of blue light.\"\n            >>> encoding = tokenizer(prompt, next_sentence, return_tensors='pt')\n\n            >>> outputs = model(**encoding, labels=torch.LongTensor([1]))\n            >>> logits = outputs.logits\n            >>> assert logits[0, 0] < logits[0, 1] # next sentence was random\n        \"\"\"\n\n        if \"next_sentence_label\" in kwargs:\n            warnings.warn(\n                \"The `next_sentence_label` argument is deprecated and will be removed in a future version, use `labels` instead.\",\n                FutureWarning,\n            )\n            labels = kwargs.pop(\"next_sentence_label\")\n\n        return_dict = return_dict if return_dict is not None else self.config.use_return_dict\n\n        outputs = self.bert(\n            input_ids,\n            attention_mask=attention_mask,\n            token_type_ids=token_type_ids,\n\n            head_mask=head_mask,\n            inputs_embeds=inputs_embeds,\n            output_attentions=output_attentions,\n            output_hidden_states=output_hidden_states,\n            return_dict=return_dict,\n        )\n\n        pooled_output = outputs[1]\n\n        seq_relationship_scores = self.cls(pooled_output)\n\n        next_sentence_loss = None\n        if labels is not None:\n            loss_fct = CrossEntropyLoss()\n            next_sentence_loss = loss_fct(seq_relationship_scores.view(-1, 2), labels.view(-1))\n\n        if not return_dict:\n            output = (seq_relationship_scores,) + outputs[2:]\n            return ((next_sentence_loss,) + output) if next_sentence_loss is not None else output\n\n        return NextSentencePredictorOutput(\n            loss=next_sentence_loss,\n            logits=seq_relationship_scores,\n            hidden_states=outputs.hidden_states,\n            attentions=outputs.attentions,\n        )\n\n\n@add_start_docstrings(\n    \"\"\"\n    Bert Model transformer with a sequence classification/regression head on top (a linear layer on top of the pooled\n    output) e.g. for GLUE tasks.\n    \"\"\",\n    BERT_START_DOCSTRING,\n)\nclass BertForSequenceClassification(BertPreTrainedModel):\n    def __init__(self, config):\n        super().__init__(config)\n        self.num_labels = config.num_labels\n\n        self.bert = NeZhaModel(config)\n        self.dropout = nn.Dropout(config.hidden_dropout_prob)\n        self.classifier = nn.Linear(config.hidden_size, config.num_labels)\n\n        self.init_weights()\n\n    @add_start_docstrings_to_model_forward(BERT_INPUTS_DOCSTRING.format(\"batch_size, sequence_length\"))\n    @add_code_sample_docstrings(\n        tokenizer_class=_TOKENIZER_FOR_DOC,\n        checkpoint=_CHECKPOINT_FOR_DOC,\n        output_type=SequenceClassifierOutput,\n        config_class=_CONFIG_FOR_DOC,\n    )\n    def forward(\n        self,\n        input_ids=None,\n        attention_mask=None,\n        token_type_ids=None,\n\n        head_mask=None,\n        inputs_embeds=None,\n        labels=None,\n        output_attentions=None,\n        output_hidden_states=None,\n        return_dict=False,\n    ):\n        r\"\"\"\n        labels (:obj:`torch.LongTensor` of shape :obj:`(batch_size,)`, `optional`):\n            Labels for computing the sequence classification/regression loss. Indices should be in :obj:`[0, ...,\n            config.num_labels - 1]`. If :obj:`config.num_labels == 1` a regression loss is computed (Mean-Square loss),\n            If :obj:`config.num_labels > 1` a classification loss is computed (Cross-Entropy).\n        \"\"\"\n        return_dict = return_dict if return_dict is not None else self.config.use_return_dict\n\n        outputs = self.bert(\n            input_ids,\n            attention_mask=attention_mask,\n            token_type_ids=token_type_ids,\n\n            head_mask=head_mask,\n            inputs_embeds=inputs_embeds,\n            output_attentions=output_attentions,\n            output_hidden_states=output_hidden_states,\n            return_dict=return_dict,\n        )\n\n        pooled_output = outputs[1]\n\n        pooled_output = self.dropout(pooled_output)\n        logits = self.classifier(pooled_output)\n\n        loss = None\n        if labels is not None:\n            if self.num_labels == 1:\n                #  We are doing regression\n                loss_fct = MSELoss()\n                loss = loss_fct(logits.view(-1), labels.view(-1))\n            else:\n                loss_fct = CrossEntropyLoss()\n                loss = loss_fct(logits.view(-1, self.num_labels), labels.view(-1))\n\n        if not return_dict:\n            output = (logits,) + outputs[2:]\n            return ((loss,) + output) if loss is not None else output\n\n        return SequenceClassifierOutput(\n            loss=loss,\n            logits=logits,\n            hidden_states=outputs.hidden_states,\n            attentions=outputs.attentions,\n        )\n\n\n@add_start_docstrings(\n    \"\"\"\n    Bert Model with a multiple choice classification head on top (a linear layer on top of the pooled output and a\n    softmax) e.g. for RocStories/SWAG tasks.\n    \"\"\",\n    BERT_START_DOCSTRING,\n)\nclass BertForMultipleChoice(BertPreTrainedModel):\n    def __init__(self, config):\n        super().__init__(config)\n\n        self.bert = NeZhaModel(config)\n        self.dropout = nn.Dropout(config.hidden_dropout_prob)\n        self.classifier = nn.Linear(config.hidden_size, 1)\n\n        self.init_weights()\n\n    @add_start_docstrings_to_model_forward(BERT_INPUTS_DOCSTRING.format(\"batch_size, num_choices, sequence_length\"))\n    @add_code_sample_docstrings(\n        tokenizer_class=_TOKENIZER_FOR_DOC,\n        checkpoint=_CHECKPOINT_FOR_DOC,\n        output_type=MultipleChoiceModelOutput,\n        config_class=_CONFIG_FOR_DOC,\n    )\n    def forward(\n        self,\n        input_ids=None,\n        attention_mask=None,\n        token_type_ids=None,\n\n        head_mask=None,\n        inputs_embeds=None,\n        labels=None,\n        output_attentions=None,\n        output_hidden_states=None,\n        return_dict=False,\n    ):\n        r\"\"\"\n        labels (:obj:`torch.LongTensor` of shape :obj:`(batch_size,)`, `optional`):\n            Labels for computing the multiple choice classification loss. Indices should be in ``[0, ...,\n            num_choices-1]`` where :obj:`num_choices` is the size of the second dimension of the input tensors. (See\n            :obj:`input_ids` above)\n        \"\"\"\n        return_dict = return_dict if return_dict is not None else self.config.use_return_dict\n        num_choices = input_ids.shape[1] if input_ids is not None else inputs_embeds.shape[1]\n\n        input_ids = input_ids.view(-1, input_ids.size(-1)) if input_ids is not None else None\n        attention_mask = attention_mask.view(-1, attention_mask.size(-1)) if attention_mask is not None else None\n        token_type_ids = token_type_ids.view(-1, token_type_ids.size(-1)) if token_type_ids is not None else None\n        inputs_embeds = (\n            inputs_embeds.view(-1, inputs_embeds.size(-2), inputs_embeds.size(-1))\n            if inputs_embeds is not None\n            else None\n        )\n\n        outputs = self.bert(\n            input_ids,\n            attention_mask=attention_mask,\n            token_type_ids=token_type_ids,\n\n            head_mask=head_mask,\n            inputs_embeds=inputs_embeds,\n            output_attentions=output_attentions,\n            output_hidden_states=output_hidden_states,\n            return_dict=return_dict,\n        )\n\n        pooled_output = outputs[1]\n\n        pooled_output = self.dropout(pooled_output)\n        logits = self.classifier(pooled_output)\n        reshaped_logits = logits.view(-1, num_choices)\n\n        loss = None\n        if labels is not None:\n            loss_fct = CrossEntropyLoss()\n            loss = loss_fct(reshaped_logits, labels)\n\n        if not return_dict:\n            output = (reshaped_logits,) + outputs[2:]\n            return ((loss,) + output) if loss is not None else output\n\n        return MultipleChoiceModelOutput(\n            loss=loss,\n            logits=reshaped_logits,\n            hidden_states=outputs.hidden_states,\n            attentions=outputs.attentions,\n        )\n\n\n@add_start_docstrings(\n    \"\"\"\n    Bert Model with a token classification head on top (a linear layer on top of the hidden-states output) e.g. for\n    Named-Entity-Recognition (NER) tasks.\n    \"\"\",\n    BERT_START_DOCSTRING,\n)\nclass BertForTokenClassification(BertPreTrainedModel):\n\n    _keys_to_ignore_on_load_unexpected = [r\"pooler\"]\n\n    def __init__(self, config):\n        super().__init__(config)\n        self.num_labels = config.num_labels\n\n        self.bert = NeZhaModel(config, add_pooling_layer=False)\n        self.dropout = nn.Dropout(config.hidden_dropout_prob)\n        self.classifier = nn.Linear(config.hidden_size, config.num_labels)\n\n        self.init_weights()\n\n    @add_start_docstrings_to_model_forward(BERT_INPUTS_DOCSTRING.format(\"batch_size, sequence_length\"))\n    @add_code_sample_docstrings(\n        tokenizer_class=_TOKENIZER_FOR_DOC,\n        checkpoint=_CHECKPOINT_FOR_DOC,\n        output_type=TokenClassifierOutput,\n        config_class=_CONFIG_FOR_DOC,\n    )\n    def forward(\n        self,\n        input_ids=None,\n        attention_mask=None,\n        token_type_ids=None,\n\n        head_mask=None,\n        inputs_embeds=None,\n        labels=None,\n        output_attentions=None,\n        output_hidden_states=None,\n        return_dict=False,\n    ):\n        r\"\"\"\n        labels (:obj:`torch.LongTensor` of shape :obj:`(batch_size, sequence_length)`, `optional`):\n            Labels for computing the token classification loss. Indices should be in ``[0, ..., config.num_labels -\n            1]``.\n        \"\"\"\n        return_dict = return_dict if return_dict is not None else self.config.use_return_dict\n\n        outputs = self.bert(\n            input_ids,\n            attention_mask=attention_mask,\n            token_type_ids=token_type_ids,\n\n            head_mask=head_mask,\n            inputs_embeds=inputs_embeds,\n            output_attentions=output_attentions,\n            output_hidden_states=output_hidden_states,\n            return_dict=return_dict,\n        )\n\n        sequence_output = outputs[0]\n\n        sequence_output = self.dropout(sequence_output)\n        logits = self.classifier(sequence_output)\n\n        loss = None\n        if labels is not None:\n            loss_fct = CrossEntropyLoss()\n            # Only keep active parts of the loss\n            if attention_mask is not None:\n                active_loss = attention_mask.view(-1) == 1\n                active_logits = logits.view(-1, self.num_labels)\n                active_labels = torch.where(\n                    active_loss, labels.view(-1), torch.tensor(loss_fct.ignore_index).type_as(labels)\n                )\n                loss = loss_fct(active_logits, active_labels)\n            else:\n                loss = loss_fct(logits.view(-1, self.num_labels), labels.view(-1))\n\n        if not return_dict:\n            output = (logits,) + outputs[2:]\n            return ((loss,) + output) if loss is not None else output\n\n        return TokenClassifierOutput(\n            loss=loss,\n            logits=logits,\n            hidden_states=outputs.hidden_states,\n            attentions=outputs.attentions,\n        )\n\n\n@add_start_docstrings(\n    \"\"\"\n    Bert Model with a span classification head on top for extractive question-answering tasks like SQuAD (a linear\n    layers on top of the hidden-states output to compute `span start logits` and `span end logits`).\n    \"\"\",\n    BERT_START_DOCSTRING,\n)\nclass BertForQuestionAnswering(BertPreTrainedModel):\n\n    _keys_to_ignore_on_load_unexpected = [r\"pooler\"]\n\n    def __init__(self, config):\n        super().__init__(config)\n        self.num_labels = config.num_labels\n\n        self.bert = NeZhaModel(config, add_pooling_layer=False)\n        self.qa_outputs = nn.Linear(config.hidden_size, config.num_labels)\n\n        self.init_weights()\n\n    @add_start_docstrings_to_model_forward(BERT_INPUTS_DOCSTRING.format(\"batch_size, sequence_length\"))\n    @add_code_sample_docstrings(\n        tokenizer_class=_TOKENIZER_FOR_DOC,\n        checkpoint=_CHECKPOINT_FOR_DOC,\n        output_type=QuestionAnsweringModelOutput,\n        config_class=_CONFIG_FOR_DOC,\n    )\n    def forward(\n        self,\n        input_ids=None,\n        attention_mask=None,\n        token_type_ids=None,\n\n        head_mask=None,\n        inputs_embeds=None,\n        start_positions=None,\n        end_positions=None,\n        output_attentions=None,\n        output_hidden_states=None,\n        return_dict=False,\n    ):\n        r\"\"\"\n        start_positions (:obj:`torch.LongTensor` of shape :obj:`(batch_size,)`, `optional`):\n            Labels for position (index) of the start of the labelled span for computing the token classification loss.\n            Positions are clamped to the length of the sequence (:obj:`sequence_length`). Position outside of the\n            sequence are not taken into account for computing the loss.\n        end_positions (:obj:`torch.LongTensor` of shape :obj:`(batch_size,)`, `optional`):\n            Labels for position (index) of the end of the labelled span for computing the token classification loss.\n            Positions are clamped to the length of the sequence (:obj:`sequence_length`). Position outside of the\n            sequence are not taken into account for computing the loss.\n        \"\"\"\n        return_dict = return_dict if return_dict is not None else self.config.use_return_dict\n\n        outputs = self.bert(\n            input_ids,\n            attention_mask=attention_mask,\n            token_type_ids=token_type_ids,\n            \n            head_mask=head_mask,\n            inputs_embeds=inputs_embeds,\n            output_attentions=output_attentions,\n            output_hidden_states=output_hidden_states,\n            return_dict=return_dict,\n        )\n\n        sequence_output = outputs[0]\n\n        logits = self.qa_outputs(sequence_output)\n        start_logits, end_logits = logits.split(1, dim=-1)\n        start_logits = start_logits.squeeze(-1)\n        end_logits = end_logits.squeeze(-1)\n\n        total_loss = None\n        if start_positions is not None and end_positions is not None:\n            # If we are on multi-GPU, split add a dimension\n            if len(start_positions.size()) > 1:\n                start_positions = start_positions.squeeze(-1)\n            if len(end_positions.size()) > 1:\n                end_positions = end_positions.squeeze(-1)\n            # sometimes the start/end positions are outside our model inputs, we ignore these terms\n            ignored_index = start_logits.size(1)\n            start_positions.clamp_(0, ignored_index)\n            end_positions.clamp_(0, ignored_index)\n\n            loss_fct = CrossEntropyLoss(ignore_index=ignored_index)\n            start_loss = loss_fct(start_logits, start_positions)\n            end_loss = loss_fct(end_logits, end_positions)\n            total_loss = (start_loss + end_loss) / 2\n\n        if not return_dict:\n            output = (start_logits, end_logits) + outputs[2:]\n            return ((total_loss,) + output) if total_loss is not None else output\n\n        return QuestionAnsweringModelOutput(\n            loss=total_loss,\n            start_logits=start_logits,\n            end_logits=end_logits,\n            hidden_states=outputs.hidden_states,\n            attentions=outputs.attentions,\n        )\n"
  },
  {
    "path": "code/nezha-base-count5/pretrain/NLP_Utils.py",
    "content": "import random\nimport json\nimport transformers as _\nfrom transformers1 import BertTokenizer\nimport torch\nfrom torch.utils.data import Dataset,DataLoader\nimport numpy as np\nfrom itertools import chain\n\ndef writeToJsonFile(path: str, obj):\n    with open(path, \"w\", encoding=\"utf-8\") as f:\n        f.write(json.dumps(obj, ensure_ascii=False,indent=0))\ndef readFromJsonFile(path: str):\n    with open(path, \"r\", encoding=\"utf-8\") as f:\n        return json.loads(f.read())\n\ndef loadData(path):\n    allData=[]\n    with open(path,\"r\") as f:\n        for i in f:\n            i=i.strip().split('\\t')\n            if len(i)==0:#防止空行\n                break\n            if len(i)==3:#训练集\n                a,b,label=i\n                a=a.split(' ')\n                b=b.split(' ')\n            else:#测试集，直接转为id形式\n                a,b,label=i[0],i[1],-1\n                a=a.split(' ')\n                b=b.split(' ')\n            allData.append([a,b,label])\n    return allData\n\ndef calNegPos(ls):#计算正负比例\n    posNum,negNum=0,0\n    for i in ls:\n        if i[2]==0:\n            negNum+=1\n        elif i[2]==1:\n            posNum+=1\n    posNum=1 if posNum==0 else posNum\n    return negNum,posNum,round(negNum/posNum,4)\n\nallData=loadData('/tcdata/gaiic_track3_round1_train_20210228.tsv')+loadData('/tcdata/gaiic_track3_round2_train_20210407.tsv')\ntestA_data = loadData('/tcdata/gaiic_track3_round1_testA_20210228.tsv')\ntestB_data = loadData('/tcdata/gaiic_track3_round1_testB_20210317.tsv')\nrandom.shuffle(allData)\n\ntrain_data=allData+testA_data+testB_data#全量\nvalid_data=allData[-20000:]\nprint(\"训练集样本数量：\", len(train_data))\n\ndef paddingList(ls:list,val,returnTensor=False):\n    ls=ls[:]#不要改变了原list尺寸\n    maxLen=max([len(i) for i in ls])\n    for i in range(len(ls)):\n        ls[i]=ls[i]+[val]*(maxLen-len(ls[i]))\n    return torch.tensor(ls,device='cuda') if returnTensor else ls\n\ndef truncate(a:list,b:list,maxLen):\n    maxLen-=3#空留给cls sep sep\n    assert maxLen>=0\n    len2=maxLen//2#若为奇数，更长部分给左边\n    len1=maxLen-len2\n    #一共就a超长与否，b超长与否，组合的四种情况\n    if len(a)+len(b)>maxLen:#需要截断\n        if len(a)<=len1 and len(b)>len2:\n            b=b[:maxLen-len(a)]\n        elif len(a)>len1 and len(b)<=len2:\n            a=a[:maxLen-len(b)]\n        elif len(a)>len1 and len(b)>len2:\n            a=a[:len1]\n            b=b[:len2]\n    return a,b\n\nclass MLM_Data(Dataset):\n    #传入句子对列表\n    def __init__(self,textLs:list,maxLen:int,tk:BertTokenizer):\n        super().__init__()\n        self.data=textLs\n        self.maxLen=maxLen\n        self.tk=tk\n        self.spNum=len(tk.all_special_tokens)\n        self.tkNum=tk.vocab_size\n\n    def __len__(self):\n        return len(self.data)\n\n    def random_mask(self,text_ids):\n        input_ids, output_ids = [], []\n        rands = np.random.random(len(text_ids))\n        idx=0\n        while idx<len(rands):\n            if rands[idx]<0.15:#需要mask\n                ngram=np.random.choice([1,2,3], p=[0.7,0.2,0.1])#若要mask，进行x_gram mask的概率\n                if ngram==3 and len(rands)<7:#太大的gram不要应用于过短文本\n                    ngram=2\n                if ngram==2 and len(rands)<4:\n                    ngram=1\n                L=idx+1\n                R=idx+ngram#最终需要mask的右边界（开）\n                while L<R and L<len(rands):\n                    rands[L]=np.random.random()*0.15#强制mask\n                    L+=1\n                idx=R\n                if idx<len(rands):\n                    rands[idx]=1#禁止mask片段的下一个token被mask，防止一大片连续mask\n            idx+=1\n\n        for r, i in zip(rands, text_ids):\n            if r < 0.15 * 0.8:\n                input_ids.append(self.tk.mask_token_id)\n                output_ids.append(i)#mask预测自己\n            elif r < 0.15 * 0.9:\n                input_ids.append(i)\n                output_ids.append(i)#自己预测自己\n            elif r < 0.15:\n                input_ids.append(np.random.randint(self.spNum,self.tkNum))\n                output_ids.append(i)#随机的一个词预测自己，随机词不会从特殊符号中选取，有小概率抽到自己\n            else:\n                input_ids.append(i)\n                output_ids.append(-100)#保持原样不预测\n\n        return input_ids, output_ids\n\n    #耗时操作在此进行，可用上多进程\n    def __getitem__(self, item):\n        text1,text2,_=self.data[item]#预处理，mask等操作\n        if random.random()>0.5:\n            text1,text2=text2,text1#交换位置\n        text1,text2=truncate(text1,text2,self.maxLen)\n        text1_ids,text2_ids = self.tk.convert_tokens_to_ids(text1),self.tk.convert_tokens_to_ids(text2)\n        text1_ids, out1_ids = self.random_mask(text1_ids)#添加mask预测\n        text2_ids, out2_ids = self.random_mask(text2_ids)\n        input_ids = [self.tk.cls_token_id] + text1_ids + [self.tk.sep_token_id] + text2_ids + [self.tk.sep_token_id]#拼接\n        token_type_ids=[0]*(len(text1_ids)+2)+[1]*(len(text2_ids)+1)\n        labels = [-100] + out1_ids + [-100] + out2_ids + [-100]\n        assert len(input_ids)==len(token_type_ids)==len(labels)\n        return {'input_ids':input_ids,'token_type_ids':token_type_ids,'labels':labels}\n\n    @classmethod\n    def collate(cls,batch):\n        input_ids=[i['input_ids'] for i in batch]\n        token_type_ids=[i['token_type_ids'] for i in batch]\n        labels=[i['labels'] for i in batch]\n        input_ids=paddingList(input_ids,0,returnTensor=True)\n        token_type_ids=paddingList(token_type_ids,0,returnTensor=True)\n        labels=paddingList(labels,-100,returnTensor=True)\n        attention_mask=(input_ids!=0)\n        return {'input_ids':input_ids,'token_type_ids':token_type_ids\n                ,'attention_mask':attention_mask,'labels':labels}\n\n\n\n\nunionList=lambda ls:list(chain(*ls))#按元素拼接\nsplitList=lambda x,bs:[x[i:i+bs] for i in range(0,len(x),bs)]#按bs切分\n\n\n#sortBsNum：原序列按多少个bs块为单位排序，可用来增强随机性\n#比如如果每次打乱后都全体一起排序，那每次都是一样的\ndef blockShuffle(data:list,bs:int,sortBsNum,key):\n    random.shuffle(data)#先打乱\n    tail=len(data)%bs#计算碎片长度\n    tail=[] if tail==0 else data[-tail:]\n    data=data[:len(data)-len(tail)]\n    assert len(data)%bs==0#剩下的一定能被bs整除\n    sortBsNum=len(data)//bs if sortBsNum is None else sortBsNum#为None就是整体排序\n    data=splitList(data,sortBsNum*bs)\n    data=[sorted(i,key=key,reverse=True) for i in data]#每个大块进行降排序\n    data=unionList(data)\n    data=splitList(data,bs)#最后，按bs分块\n    random.shuffle(data)#块间打乱\n    data=unionList(data)+tail\n    return data\nfrom torch.utils.data.dataloader import _SingleProcessDataLoaderIter,_MultiProcessingDataLoaderIter\n#每轮迭代重新分块shuffle数据的DataLoader\nclass blockShuffleDataLoader(DataLoader):\n    def __init__(self, dataset: Dataset,sortBsNum,key,**kwargs):\n        assert isinstance(dataset.data,list)#需要有list类型的data属性\n        super().__init__(dataset,**kwargs)#父类的参数传过去\n        self.sortBsNum=sortBsNum\n        self.key=key\n\n    def __iter__(self):\n        #分块shuffle\n        self.dataset.data=blockShuffle(self.dataset.data,self.batch_size,self.sortBsNum,self.key)\n        if self.num_workers == 0:\n            return _SingleProcessDataLoaderIter(self)\n        else:\n            return _MultiProcessingDataLoaderIter(self)\n"
  },
  {
    "path": "code/nezha-base-count5/pretrain/__init__.py",
    "content": ""
  },
  {
    "path": "code/nezha-base-count5/pretrain/nezha_model/gitkeep",
    "content": ""
  },
  {
    "path": "code/nezha-base-count5/pretrain/train_nezha.py",
    "content": "# coding:utf-8\nimport numpy as np\nimport random\nimport os\nrandom.seed(0)\nnp.random.seed(0)#seed应该在main里尽早设置，以防万一\nos.environ['PYTHONHASHSEED'] =str(0)#消除hash算法的随机性\nimport transformers as _\nfrom transformers1 import Trainer, TrainingArguments,BertTokenizer\nfrom NLP_Utils import MLM_Data,train_data,blockShuffleDataLoader\n\nfrom NEZHA.configuration_nezha import NeZhaConfig\nfrom NEZHA.modeling_nezha import NeZhaForMaskedLM\n\nmaxlen=100\nbatch_size=128\nvocab_file_dir = './nezha_model/vocab.txt'\ntokenizer = BertTokenizer.from_pretrained(vocab_file_dir)\n\nconfig = NeZhaConfig(\n    vocab_size=len(tokenizer),\n    hidden_size=768,\n    num_hidden_layers=12,\n    num_attention_heads=12,\n    max_position_embeddings=512,\n)\n\n\n\nmodel = NeZhaForMaskedLM.from_pretrained(\"../../nezha-cn-base/\")\n\nmodel.resize_token_embeddings(len(tokenizer))\nprint(model)\ntrain_MLM_data=MLM_Data(train_data,maxlen,tokenizer)\n#自己定义dataloader，不要用huggingface的\ndl=blockShuffleDataLoader(train_MLM_data,None,key=lambda x:len(x[0])+len(x[1]),shuffle=False\n                          ,batch_size=batch_size,collate_fn=train_MLM_data.collate)\n\ntraining_args = TrainingArguments(\n    output_dir='./nezha_output',\n    overwrite_output_dir=True,\n    num_train_epochs=400,\n    per_device_train_batch_size=batch_size,\n    save_steps=len(dl)*10000,#每10个epoch save一次\n    save_total_limit=3,\n    logging_steps=len(dl),#每个epoch log一次\n    seed=2021,\n    learning_rate=5e-5,\n    weight_decay=0.01,\n    warmup_steps=int(450000*150/batch_size*0.03)\n)\n\ntrainer = Trainer(\n    model=model,\n    args=training_args,\n    train_dataLoader=dl,\n    prediction_loss_only=True,\n)\n\nif __name__ == '__main__':\n    trainer.train()\n    trainer.save_model('./nezha_model')\n"
  },
  {
    "path": "code/nezha-base-count5/pretrain/transformers1/__init__.py",
    "content": "# flake8: noqa\n# There's no way to ignore \"F401 '...' imported but unused\" warnings in this\n# module, but to preserve other warnings. So, don't check this module at all.\n\n__version__ = \"2.11.0\"\n\n# Work around to update TensorFlow's absl.logging threshold which alters the\n# default Python logging output behavior when present.\n# see: https://github.com/abseil/abseil-py/issues/99\n# and: https://github.com/tensorflow/tensorflow/issues/26691#issuecomment-500369493\ntry:\n    import absl.logging\nexcept ImportError:\n    pass\nelse:\n    absl.logging.set_verbosity(\"info\")\n    absl.logging.set_stderrthreshold(\"info\")\n    absl.logging._warn_preinit_stderr = False\n\nimport logging\n\n# Configurations\nfrom .configuration_albert import ALBERT_PRETRAINED_CONFIG_ARCHIVE_MAP, AlbertConfig\nfrom .configuration_auto import ALL_PRETRAINED_CONFIG_ARCHIVE_MAP, CONFIG_MAPPING, AutoConfig\nfrom .configuration_bart import BartConfig\nfrom .configuration_bert import BERT_PRETRAINED_CONFIG_ARCHIVE_MAP, BertConfig\nfrom .configuration_camembert import CAMEMBERT_PRETRAINED_CONFIG_ARCHIVE_MAP, CamembertConfig\nfrom .configuration_ctrl import CTRL_PRETRAINED_CONFIG_ARCHIVE_MAP, CTRLConfig\nfrom .configuration_distilbert import DISTILBERT_PRETRAINED_CONFIG_ARCHIVE_MAP, DistilBertConfig\nfrom .configuration_electra import ELECTRA_PRETRAINED_CONFIG_ARCHIVE_MAP, ElectraConfig\nfrom .configuration_encoder_decoder import EncoderDecoderConfig\nfrom .configuration_flaubert import FLAUBERT_PRETRAINED_CONFIG_ARCHIVE_MAP, FlaubertConfig\nfrom .configuration_gpt2 import GPT2_PRETRAINED_CONFIG_ARCHIVE_MAP, GPT2Config\nfrom .configuration_longformer import LONGFORMER_PRETRAINED_CONFIG_ARCHIVE_MAP, LongformerConfig\nfrom .configuration_marian import MarianConfig\nfrom .configuration_mmbt import MMBTConfig\nfrom .configuration_openai import OPENAI_GPT_PRETRAINED_CONFIG_ARCHIVE_MAP, OpenAIGPTConfig\nfrom .configuration_reformer import REFORMER_PRETRAINED_CONFIG_ARCHIVE_MAP, ReformerConfig\nfrom .configuration_roberta import ROBERTA_PRETRAINED_CONFIG_ARCHIVE_MAP, RobertaConfig\nfrom .configuration_t5 import T5_PRETRAINED_CONFIG_ARCHIVE_MAP, T5Config\nfrom .configuration_transfo_xl import TRANSFO_XL_PRETRAINED_CONFIG_ARCHIVE_MAP, TransfoXLConfig\nfrom .configuration_utils import PretrainedConfig\nfrom .configuration_xlm import XLM_PRETRAINED_CONFIG_ARCHIVE_MAP, XLMConfig\nfrom .configuration_xlm_roberta import XLM_ROBERTA_PRETRAINED_CONFIG_ARCHIVE_MAP, XLMRobertaConfig\nfrom .configuration_xlnet import XLNET_PRETRAINED_CONFIG_ARCHIVE_MAP, XLNetConfig\nfrom .data import (\n    DataProcessor,\n    InputExample,\n    InputFeatures,\n    SingleSentenceClassificationProcessor,\n    SquadExample,\n    SquadFeatures,\n    SquadV1Processor,\n    SquadV2Processor,\n    glue_convert_examples_to_features,\n    glue_output_modes,\n    glue_processors,\n    glue_tasks_num_labels,\n    is_sklearn_available,\n    squad_convert_examples_to_features,\n    xnli_output_modes,\n    xnli_processors,\n    xnli_tasks_num_labels,\n)\n\n# Files and general utilities\nfrom .file_utils import (\n    CONFIG_NAME,\n    MODEL_CARD_NAME,\n    PYTORCH_PRETRAINED_BERT_CACHE,\n    PYTORCH_TRANSFORMERS_CACHE,\n    TF2_WEIGHTS_NAME,\n    TF_WEIGHTS_NAME,\n    TRANSFORMERS_CACHE,\n    WEIGHTS_NAME,\n    add_end_docstrings,\n    add_start_docstrings,\n    cached_path,\n    is_tf_available,\n    is_torch_available,\n)\nfrom .hf_argparser import HfArgumentParser\n\n# Model Cards\nfrom .modelcard import ModelCard\n\n# TF 2.0 <=> PyTorch conversion utilities\nfrom .modeling_tf_pytorch_utils import (\n    convert_tf_weight_name_to_pt_weight_name,\n    load_pytorch_checkpoint_in_tf2_model,\n    load_pytorch_model_in_tf2_model,\n    load_pytorch_weights_in_tf2_model,\n    load_tf2_checkpoint_in_pytorch_model,\n    load_tf2_model_in_pytorch_model,\n    load_tf2_weights_in_pytorch_model,\n)\n\n# Pipelines\nfrom .pipelines import (\n    CsvPipelineDataFormat,\n    FeatureExtractionPipeline,\n    FillMaskPipeline,\n    JsonPipelineDataFormat,\n    NerPipeline,\n    PipedPipelineDataFormat,\n    Pipeline,\n    PipelineDataFormat,\n    QuestionAnsweringPipeline,\n    SummarizationPipeline,\n    TextClassificationPipeline,\n    TextGenerationPipeline,\n    TokenClassificationPipeline,\n    TranslationPipeline,\n    pipeline,\n)\n\n# Tokenizers\nfrom .tokenization_albert import AlbertTokenizer\nfrom .tokenization_auto import TOKENIZER_MAPPING, AutoTokenizer\nfrom .tokenization_bart import BartTokenizer, MBartTokenizer\nfrom .tokenization_bert import BasicTokenizer, BertTokenizer, BertTokenizerFast, WordpieceTokenizer\nfrom .tokenization_bert_japanese import BertJapaneseTokenizer, CharacterTokenizer, MecabTokenizer\nfrom .tokenization_camembert import CamembertTokenizer\nfrom .tokenization_ctrl import CTRLTokenizer\nfrom .tokenization_distilbert import DistilBertTokenizer, DistilBertTokenizerFast\nfrom .tokenization_electra import ElectraTokenizer, ElectraTokenizerFast\nfrom .tokenization_flaubert import FlaubertTokenizer\nfrom .tokenization_gpt2 import GPT2Tokenizer, GPT2TokenizerFast\nfrom .tokenization_longformer import LongformerTokenizer, LongformerTokenizerFast\nfrom .tokenization_openai import OpenAIGPTTokenizer, OpenAIGPTTokenizerFast\nfrom .tokenization_reformer import ReformerTokenizer\nfrom .tokenization_roberta import RobertaTokenizer, RobertaTokenizerFast\nfrom .tokenization_t5 import T5Tokenizer\nfrom .tokenization_transfo_xl import TransfoXLCorpus, TransfoXLTokenizer, TransfoXLTokenizerFast\nfrom .tokenization_utils import PreTrainedTokenizer\nfrom .tokenization_xlm import XLMTokenizer\nfrom .tokenization_xlm_roberta import XLMRobertaTokenizer\nfrom .tokenization_xlnet import SPIECE_UNDERLINE, XLNetTokenizer\nfrom .trainer_utils import EvalPrediction\nfrom .training_args import TrainingArguments\nfrom .training_args_tf import TFTrainingArguments\n\n\nlogger = logging.getLogger(__name__)  # pylint: disable=invalid-name\n\n\nif is_sklearn_available():\n    from .data import glue_compute_metrics, xnli_compute_metrics\n\n\n# Modeling\nif is_torch_available():\n    from .modeling_utils import PreTrainedModel, prune_layer, Conv1D, top_k_top_p_filtering, apply_chunking_to_forward\n    from .modeling_auto import (\n        AutoModel,\n        AutoModelForPreTraining,\n        AutoModelForSequenceClassification,\n        AutoModelForQuestionAnswering,\n        AutoModelWithLMHead,\n        AutoModelForTokenClassification,\n        AutoModelForMultipleChoice,\n        MODEL_MAPPING,\n        MODEL_FOR_PRETRAINING_MAPPING,\n        MODEL_WITH_LM_HEAD_MAPPING,\n        MODEL_FOR_SEQUENCE_CLASSIFICATION_MAPPING,\n        MODEL_FOR_QUESTION_ANSWERING_MAPPING,\n        MODEL_FOR_TOKEN_CLASSIFICATION_MAPPING,\n        MODEL_FOR_MULTIPLE_CHOICE_MAPPING,\n    )\n\n    from .modeling_bert import (\n        BertPreTrainedModel,\n        BertModel,\n        BertForPreTraining,\n        BertForMaskedLM,\n        BertForNextSentencePrediction,\n        BertForSequenceClassification,\n        BertForMultipleChoice,\n        BertForTokenClassification,\n        BertForQuestionAnswering,\n        load_tf_weights_in_bert,\n        BERT_PRETRAINED_MODEL_ARCHIVE_LIST,\n        BertLayer,\n    )\n    from .modeling_openai import (\n        OpenAIGPTPreTrainedModel,\n        OpenAIGPTModel,\n        OpenAIGPTLMHeadModel,\n        OpenAIGPTDoubleHeadsModel,\n        load_tf_weights_in_openai_gpt,\n        OPENAI_GPT_PRETRAINED_MODEL_ARCHIVE_LIST,\n    )\n    from .modeling_transfo_xl import (\n        TransfoXLPreTrainedModel,\n        TransfoXLModel,\n        TransfoXLLMHeadModel,\n        AdaptiveEmbedding,\n        load_tf_weights_in_transfo_xl,\n        TRANSFO_XL_PRETRAINED_MODEL_ARCHIVE_LIST,\n    )\n    from .modeling_gpt2 import (\n        GPT2PreTrainedModel,\n        GPT2Model,\n        GPT2LMHeadModel,\n        GPT2DoubleHeadsModel,\n        load_tf_weights_in_gpt2,\n        GPT2_PRETRAINED_MODEL_ARCHIVE_LIST,\n    )\n    from .modeling_ctrl import CTRLPreTrainedModel, CTRLModel, CTRLLMHeadModel, CTRL_PRETRAINED_MODEL_ARCHIVE_LIST\n    from .modeling_xlnet import (\n        XLNetPreTrainedModel,\n        XLNetModel,\n        XLNetLMHeadModel,\n        XLNetForSequenceClassification,\n        XLNetForTokenClassification,\n        XLNetForMultipleChoice,\n        XLNetForQuestionAnsweringSimple,\n        XLNetForQuestionAnswering,\n        load_tf_weights_in_xlnet,\n        XLNET_PRETRAINED_MODEL_ARCHIVE_LIST,\n    )\n    from .modeling_xlm import (\n        XLMPreTrainedModel,\n        XLMModel,\n        XLMWithLMHeadModel,\n        XLMForSequenceClassification,\n        XLMForTokenClassification,\n        XLMForQuestionAnswering,\n        XLMForQuestionAnsweringSimple,\n        XLM_PRETRAINED_MODEL_ARCHIVE_LIST,\n    )\n    from .modeling_bart import (\n        BartForSequenceClassification,\n        BartModel,\n        BartForConditionalGeneration,\n        BART_PRETRAINED_MODEL_ARCHIVE_LIST,\n    )\n    from .modeling_marian import MarianMTModel\n    from .tokenization_marian import MarianTokenizer\n    from .modeling_roberta import (\n        RobertaForMaskedLM,\n        RobertaModel,\n        RobertaForSequenceClassification,\n        RobertaForMultipleChoice,\n        RobertaForTokenClassification,\n        RobertaForQuestionAnswering,\n        ROBERTA_PRETRAINED_MODEL_ARCHIVE_LIST,\n    )\n    from .modeling_distilbert import (\n        DistilBertPreTrainedModel,\n        DistilBertForMaskedLM,\n        DistilBertModel,\n        DistilBertForSequenceClassification,\n        DistilBertForQuestionAnswering,\n        DistilBertForTokenClassification,\n        DISTILBERT_PRETRAINED_MODEL_ARCHIVE_LIST,\n    )\n    from .modeling_camembert import (\n        CamembertForMaskedLM,\n        CamembertModel,\n        CamembertForSequenceClassification,\n        CamembertForMultipleChoice,\n        CamembertForTokenClassification,\n        CamembertForQuestionAnswering,\n        CAMEMBERT_PRETRAINED_MODEL_ARCHIVE_LIST,\n    )\n    from .modeling_encoder_decoder import EncoderDecoderModel\n    from .modeling_t5 import (\n        T5PreTrainedModel,\n        T5Model,\n        T5ForConditionalGeneration,\n        load_tf_weights_in_t5,\n        T5_PRETRAINED_MODEL_ARCHIVE_LIST,\n    )\n    from .modeling_albert import (\n        AlbertPreTrainedModel,\n        AlbertModel,\n        AlbertForPreTraining,\n        AlbertForMaskedLM,\n        AlbertForSequenceClassification,\n        AlbertForQuestionAnswering,\n        AlbertForTokenClassification,\n        load_tf_weights_in_albert,\n        ALBERT_PRETRAINED_MODEL_ARCHIVE_LIST,\n    )\n    from .modeling_xlm_roberta import (\n        XLMRobertaForMaskedLM,\n        XLMRobertaModel,\n        XLMRobertaForMultipleChoice,\n        XLMRobertaForSequenceClassification,\n        XLMRobertaForTokenClassification,\n        XLM_ROBERTA_PRETRAINED_MODEL_ARCHIVE_LIST,\n    )\n    from .modeling_mmbt import ModalEmbeddings, MMBTModel, MMBTForClassification\n\n    from .modeling_flaubert import (\n        FlaubertModel,\n        FlaubertWithLMHeadModel,\n        FlaubertForSequenceClassification,\n        FlaubertForQuestionAnswering,\n        FlaubertForQuestionAnsweringSimple,\n        FLAUBERT_PRETRAINED_MODEL_ARCHIVE_LIST,\n    )\n\n    from .modeling_electra import (\n        ElectraForPreTraining,\n        ElectraForMaskedLM,\n        ElectraForTokenClassification,\n        ElectraPreTrainedModel,\n        ElectraForSequenceClassification,\n        ElectraModel,\n        load_tf_weights_in_electra,\n        ELECTRA_PRETRAINED_MODEL_ARCHIVE_LIST,\n    )\n\n    from .modeling_reformer import (\n        ReformerAttention,\n        ReformerLayer,\n        ReformerModel,\n        ReformerModelWithLMHead,\n        REFORMER_PRETRAINED_MODEL_ARCHIVE_LIST,\n    )\n\n    from .modeling_longformer import (\n        LongformerModel,\n        LongformerForMaskedLM,\n        LongformerForSequenceClassification,\n        LongformerForMultipleChoice,\n        LongformerForTokenClassification,\n        LongformerForQuestionAnswering,\n        LONGFORMER_PRETRAINED_MODEL_ARCHIVE_LIST,\n    )\n\n    # Optimization\n    from .optimization import (\n        AdamW,\n        get_constant_schedule,\n        get_constant_schedule_with_warmup,\n        get_cosine_schedule_with_warmup,\n        get_cosine_with_hard_restarts_schedule_with_warmup,\n        get_linear_schedule_with_warmup,\n    )\n\n    # Trainer\n    from .trainer import Trainer, set_seed, torch_distributed_zero_first, EvalPrediction\n    from .data.data_collator import DefaultDataCollator, DataCollator, DataCollatorForLanguageModeling\n    from .data.datasets import GlueDataset, TextDataset, LineByLineTextDataset, GlueDataTrainingArguments\n\n    # Benchmarks\n    from .benchmark import PyTorchBenchmark, PyTorchBenchmarkArguments\n\n# TensorFlow\nif is_tf_available():\n    from .modeling_tf_utils import (\n        TFPreTrainedModel,\n        TFSharedEmbeddings,\n        TFSequenceSummary,\n        shape_list,\n        tf_top_k_top_p_filtering,\n    )\n    from .modeling_tf_auto import (\n        TFAutoModel,\n        TFAutoModelForPreTraining,\n        TFAutoModelForMultipleChoice,\n        TFAutoModelForSequenceClassification,\n        TFAutoModelForQuestionAnswering,\n        TFAutoModelWithLMHead,\n        TFAutoModelForTokenClassification,\n        TF_MODEL_MAPPING,\n        TF_MODEL_FOR_PRETRAINING_MAPPING,\n        TF_MODEL_WITH_LM_HEAD_MAPPING,\n        TF_MODEL_FOR_SEQUENCE_CLASSIFICATION_MAPPING,\n        TF_MODEL_FOR_QUESTION_ANSWERING_MAPPING,\n        TF_MODEL_FOR_TOKEN_CLASSIFICATION_MAPPING,\n    )\n\n    from .modeling_tf_bert import (\n        TFBertPreTrainedModel,\n        TFBertMainLayer,\n        TFBertEmbeddings,\n        TFBertModel,\n        TFBertForPreTraining,\n        TFBertForMaskedLM,\n        TFBertForNextSentencePrediction,\n        TFBertForSequenceClassification,\n        TFBertForMultipleChoice,\n        TFBertForTokenClassification,\n        TFBertForQuestionAnswering,\n        TF_BERT_PRETRAINED_MODEL_ARCHIVE_LIST,\n    )\n\n    from .modeling_tf_gpt2 import (\n        TFGPT2PreTrainedModel,\n        TFGPT2MainLayer,\n        TFGPT2Model,\n        TFGPT2LMHeadModel,\n        TFGPT2DoubleHeadsModel,\n        TF_GPT2_PRETRAINED_MODEL_ARCHIVE_LIST,\n    )\n\n    from .modeling_tf_openai import (\n        TFOpenAIGPTPreTrainedModel,\n        TFOpenAIGPTMainLayer,\n        TFOpenAIGPTModel,\n        TFOpenAIGPTLMHeadModel,\n        TFOpenAIGPTDoubleHeadsModel,\n        TF_OPENAI_GPT_PRETRAINED_MODEL_ARCHIVE_LIST,\n    )\n\n    from .modeling_tf_transfo_xl import (\n        TFTransfoXLPreTrainedModel,\n        TFTransfoXLMainLayer,\n        TFTransfoXLModel,\n        TFTransfoXLLMHeadModel,\n        TF_TRANSFO_XL_PRETRAINED_MODEL_ARCHIVE_LIST,\n        TFAdaptiveEmbedding,\n    )\n\n    from .modeling_tf_xlnet import (\n        TFXLNetPreTrainedModel,\n        TFXLNetMainLayer,\n        TFXLNetModel,\n        TFXLNetLMHeadModel,\n        TFXLNetForSequenceClassification,\n        TFXLNetForTokenClassification,\n        TFXLNetForQuestionAnsweringSimple,\n        TF_XLNET_PRETRAINED_MODEL_ARCHIVE_LIST,\n    )\n\n    from .modeling_tf_xlm import (\n        TFXLMPreTrainedModel,\n        TFXLMMainLayer,\n        TFXLMModel,\n        TFXLMWithLMHeadModel,\n        TFXLMForSequenceClassification,\n        TFXLMForQuestionAnsweringSimple,\n        TF_XLM_PRETRAINED_MODEL_ARCHIVE_LIST,\n    )\n\n    from .modeling_tf_xlm_roberta import (\n        TFXLMRobertaForMaskedLM,\n        TFXLMRobertaModel,\n        TFXLMRobertaForSequenceClassification,\n        TFXLMRobertaForTokenClassification,\n        TF_XLM_ROBERTA_PRETRAINED_MODEL_ARCHIVE_LIST,\n    )\n\n    from .modeling_tf_roberta import (\n        TFRobertaPreTrainedModel,\n        TFRobertaMainLayer,\n        TFRobertaModel,\n        TFRobertaForMaskedLM,\n        TFRobertaForSequenceClassification,\n        TFRobertaForTokenClassification,\n        TFRobertaForQuestionAnswering,\n        TF_ROBERTA_PRETRAINED_MODEL_ARCHIVE_LIST,\n    )\n\n    from .modeling_tf_camembert import (\n        TFCamembertModel,\n        TFCamembertForMaskedLM,\n        TFCamembertForSequenceClassification,\n        TFCamembertForTokenClassification,\n        TF_CAMEMBERT_PRETRAINED_MODEL_ARCHIVE_LIST,\n    )\n\n    from .modeling_tf_flaubert import (\n        TFFlaubertModel,\n        TFFlaubertWithLMHeadModel,\n        TFFlaubertForSequenceClassification,\n        TF_FLAUBERT_PRETRAINED_MODEL_ARCHIVE_LIST,\n    )\n\n    from .modeling_tf_distilbert import (\n        TFDistilBertPreTrainedModel,\n        TFDistilBertMainLayer,\n        TFDistilBertModel,\n        TFDistilBertForMaskedLM,\n        TFDistilBertForSequenceClassification,\n        TFDistilBertForTokenClassification,\n        TFDistilBertForQuestionAnswering,\n        TF_DISTILBERT_PRETRAINED_MODEL_ARCHIVE_LIST,\n    )\n\n    from .modeling_tf_ctrl import (\n        TFCTRLPreTrainedModel,\n        TFCTRLModel,\n        TFCTRLLMHeadModel,\n        TF_CTRL_PRETRAINED_MODEL_ARCHIVE_LIST,\n    )\n\n    from .modeling_tf_albert import (\n        TFAlbertPreTrainedModel,\n        TFAlbertMainLayer,\n        TFAlbertModel,\n        TFAlbertForPreTraining,\n        TFAlbertForMaskedLM,\n        TFAlbertForMultipleChoice,\n        TFAlbertForSequenceClassification,\n        TFAlbertForQuestionAnswering,\n        TF_ALBERT_PRETRAINED_MODEL_ARCHIVE_LIST,\n    )\n\n    from .modeling_tf_t5 import (\n        TFT5PreTrainedModel,\n        TFT5Model,\n        TFT5ForConditionalGeneration,\n        TF_T5_PRETRAINED_MODEL_ARCHIVE_LIST,\n    )\n\n    from .modeling_tf_electra import (\n        TFElectraPreTrainedModel,\n        TFElectraModel,\n        TFElectraForPreTraining,\n        TFElectraForMaskedLM,\n        TFElectraForTokenClassification,\n        TF_ELECTRA_PRETRAINED_MODEL_ARCHIVE_LIST,\n    )\n\n    # Optimization\n    from .optimization_tf import WarmUp, create_optimizer, AdamWeightDecay, GradientAccumulator\n\n    # Trainer\n    from .trainer_tf import TFTrainer\n\n\nif not is_tf_available() and not is_torch_available():\n    logger.warning(\n        \"Neither PyTorch nor TensorFlow >= 2.0 have been found.\"\n        \"Models won't be available and only tokenizers, configuration\"\n        \"and file/data utilities can be used.\"\n    )\n"
  },
  {
    "path": "code/nezha-base-count5/pretrain/transformers1/__main__.py",
    "content": "# coding: utf8\ndef main():\n    import sys\n    if (len(sys.argv) < 4 or len(sys.argv) > 6) or sys.argv[1] not in [\"bert\", \"gpt\", \"transfo_xl\", \"gpt2\", \"xlnet\", \"xlm\"]:\n        print(\n        \"This command line utility let you convert original (author released) model checkpoint to pytorch.\\n\"\n        \"It should be used as one of: \\n\"\n        \">> transformers1 bert TF_CHECKPOINT TF_CONFIG PYTORCH_DUMP_OUTPUT, \\n\"\n        \">> transformers1 gpt OPENAI_GPT_CHECKPOINT_FOLDER_PATH PYTORCH_DUMP_OUTPUT [OPENAI_GPT_CONFIG], \\n\"\n        \">> transformers1 transfo_xl TF_CHECKPOINT_OR_DATASET PYTORCH_DUMP_OUTPUT [TF_CONFIG] or \\n\"\n        \">> transformers1 gpt2 TF_CHECKPOINT PYTORCH_DUMP_OUTPUT [GPT2_CONFIG] or \\n\"\n        \">> transformers1 xlnet TF_CHECKPOINT TF_CONFIG PYTORCH_DUMP_OUTPUT [FINETUNING_TASK_NAME] or \\n\"\n        \">> transformers1 xlm XLM_CHECKPOINT_PATH PYTORCH_DUMP_OUTPUT\")\n    else:\n        if sys.argv[1] == \"bert\":\n            try:\n                from .convert_bert_original_tf_checkpoint_to_pytorch import convert_tf_checkpoint_to_pytorch\n            except ImportError:\n                print(\"transformers1 can only be used from the commandline to convert TensorFlow models in PyTorch, \"\n                    \"In that case, it requires TensorFlow to be installed. Please see \"\n                    \"https://www.tensorflow.org/install/ for installation instructions.\")\n                raise\n\n            if len(sys.argv) != 5:\n                # pylint: disable=line-too-long\n                print(\"Should be used as `transformers1 bert TF_CHECKPOINT TF_CONFIG PYTORCH_DUMP_OUTPUT`\")\n            else:\n                PYTORCH_DUMP_OUTPUT = sys.argv.pop()\n                TF_CONFIG = sys.argv.pop()\n                TF_CHECKPOINT = sys.argv.pop()\n                convert_tf_checkpoint_to_pytorch(TF_CHECKPOINT, TF_CONFIG, PYTORCH_DUMP_OUTPUT)\n        elif sys.argv[1] == \"gpt\":\n            from .convert_openai_original_tf_checkpoint_to_pytorch import convert_openai_checkpoint_to_pytorch\n            if len(sys.argv) < 4 or len(sys.argv) > 5:\n                # pylint: disable=line-too-long\n                print(\"Should be used as `transformers1 gpt OPENAI_GPT_CHECKPOINT_FOLDER_PATH PYTORCH_DUMP_OUTPUT [OPENAI_GPT_CONFIG]`\")\n            else:\n                OPENAI_GPT_CHECKPOINT_FOLDER_PATH = sys.argv[2]\n                PYTORCH_DUMP_OUTPUT = sys.argv[3]\n                if len(sys.argv) == 5:\n                    OPENAI_GPT_CONFIG = sys.argv[4]\n                else:\n                    OPENAI_GPT_CONFIG = \"\"\n                convert_openai_checkpoint_to_pytorch(OPENAI_GPT_CHECKPOINT_FOLDER_PATH,\n                                                    OPENAI_GPT_CONFIG,\n                                                    PYTORCH_DUMP_OUTPUT)\n        elif sys.argv[1] == \"transfo_xl\":\n            try:\n                from .convert_transfo_xl_original_tf_checkpoint_to_pytorch import convert_transfo_xl_checkpoint_to_pytorch\n            except ImportError:\n                print(\"transformers1 can only be used from the commandline to convert TensorFlow models in PyTorch, \"\n                    \"In that case, it requires TensorFlow to be installed. Please see \"\n                    \"https://www.tensorflow.org/install/ for installation instructions.\")\n                raise\n            if len(sys.argv) < 4 or len(sys.argv) > 5:\n                # pylint: disable=line-too-long\n                print(\"Should be used as `transformers1 transfo_xl TF_CHECKPOINT/TF_DATASET_FILE PYTORCH_DUMP_OUTPUT [TF_CONFIG]`\")\n            else:\n                if 'ckpt' in sys.argv[2].lower():\n                    TF_CHECKPOINT = sys.argv[2]\n                    TF_DATASET_FILE = \"\"\n                else:\n                    TF_DATASET_FILE = sys.argv[2]\n                    TF_CHECKPOINT = \"\"\n                PYTORCH_DUMP_OUTPUT = sys.argv[3]\n                if len(sys.argv) == 5:\n                    TF_CONFIG = sys.argv[4]\n                else:\n                    TF_CONFIG = \"\"\n                convert_transfo_xl_checkpoint_to_pytorch(TF_CHECKPOINT, TF_CONFIG, PYTORCH_DUMP_OUTPUT, TF_DATASET_FILE)\n        elif sys.argv[1] == \"gpt2\":\n            try:\n                from .convert_gpt2_original_tf_checkpoint_to_pytorch import convert_gpt2_checkpoint_to_pytorch\n            except ImportError:\n                print(\"transformers1 can only be used from the commandline to convert TensorFlow models in PyTorch, \"\n                    \"In that case, it requires TensorFlow to be installed. Please see \"\n                    \"https://www.tensorflow.org/install/ for installation instructions.\")\n                raise\n\n            if len(sys.argv) < 4 or len(sys.argv) > 5:\n                # pylint: disable=line-too-long\n                print(\"Should be used as `transformers1 gpt2 TF_CHECKPOINT PYTORCH_DUMP_OUTPUT [TF_CONFIG]`\")\n            else:\n                TF_CHECKPOINT = sys.argv[2]\n                PYTORCH_DUMP_OUTPUT = sys.argv[3]\n                if len(sys.argv) == 5:\n                    TF_CONFIG = sys.argv[4]\n                else:\n                    TF_CONFIG = \"\"\n                convert_gpt2_checkpoint_to_pytorch(TF_CHECKPOINT, TF_CONFIG, PYTORCH_DUMP_OUTPUT)\n        elif sys.argv[1] == \"xlnet\":\n            try:\n                from .convert_xlnet_original_tf_checkpoint_to_pytorch import convert_xlnet_checkpoint_to_pytorch\n            except ImportError:\n                print(\"transformers1 can only be used from the commandline to convert TensorFlow models in PyTorch, \"\n                    \"In that case, it requires TensorFlow to be installed. Please see \"\n                    \"https://www.tensorflow.org/install/ for installation instructions.\")\n                raise\n\n            if len(sys.argv) < 5 or len(sys.argv) > 6:\n                # pylint: disable=line-too-long\n                print(\"Should be used as `transformers1 xlnet TF_CHECKPOINT TF_CONFIG PYTORCH_DUMP_OUTPUT [FINETUNING_TASK_NAME]`\")\n            else:\n                TF_CHECKPOINT = sys.argv[2]\n                TF_CONFIG = sys.argv[3]\n                PYTORCH_DUMP_OUTPUT = sys.argv[4]\n                if len(sys.argv) == 6:\n                    FINETUNING_TASK = sys.argv[5]\n                else:\n                    FINETUNING_TASK = None\n\n                convert_xlnet_checkpoint_to_pytorch(TF_CHECKPOINT,\n                                                    TF_CONFIG,\n                                                    PYTORCH_DUMP_OUTPUT,\n                                                    FINETUNING_TASK)\n        elif sys.argv[1] == \"xlm\":\n            from .convert_xlm_original_pytorch_checkpoint_to_pytorch import convert_xlm_checkpoint_to_pytorch\n\n            if len(sys.argv) != 4:\n                # pylint: disable=line-too-long\n                print(\"Should be used as `transformers1 xlm XLM_CHECKPOINT_PATH PYTORCH_DUMP_OUTPUT`\")\n            else:\n                XLM_CHECKPOINT_PATH = sys.argv[2]\n                PYTORCH_DUMP_OUTPUT = sys.argv[3]\n\n                convert_xlm_checkpoint_to_pytorch(XLM_CHECKPOINT_PATH, PYTORCH_DUMP_OUTPUT)\n\nif __name__ == '__main__':\n    main()\n"
  },
  {
    "path": "code/nezha-base-count5/pretrain/transformers1/activations.py",
    "content": "import logging\nimport math\n\nimport torch\nimport torch.nn.functional as F\n\n\nlogger = logging.getLogger(__name__)\n\n\ndef swish(x):\n    return x * torch.sigmoid(x)\n\n\ndef _gelu_python(x):\n    \"\"\" Original Implementation of the gelu activation function in Google Bert repo when initially created.\n        For information: OpenAI GPT's gelu is slightly different (and gives slightly different results):\n        0.5 * x * (1 + torch.tanh(math.sqrt(2 / math.pi) * (x + 0.044715 * torch.pow(x, 3))))\n        This is now written in C in torch.nn.functional\n        Also see https://arxiv.org/abs/1606.08415\n    \"\"\"\n    return x * 0.5 * (1.0 + torch.erf(x / math.sqrt(2.0)))\n\n\ndef gelu_new(x):\n    \"\"\" Implementation of the gelu activation function currently in Google Bert repo (identical to OpenAI GPT).\n        Also see https://arxiv.org/abs/1606.08415\n    \"\"\"\n    return 0.5 * x * (1.0 + torch.tanh(math.sqrt(2.0 / math.pi) * (x + 0.044715 * torch.pow(x, 3.0))))\n\n\nif torch.__version__ < \"1.4.0\":\n    gelu = _gelu_python\nelse:\n    gelu = F.gelu\n\n\ndef gelu_fast(x):\n    return 0.5 * x * (1.0 + torch.tanh(x * 0.7978845608 * (1.0 + 0.044715 * x * x)))\n\n\nACT2FN = {\n    \"relu\": F.relu,\n    \"swish\": swish,\n    \"gelu\": gelu,\n    \"tanh\": torch.tanh,\n    \"gelu_new\": gelu_new,\n    \"gelu_fast\": gelu_fast,\n}\n\n\ndef get_activation(activation_string):\n    if activation_string in ACT2FN:\n        return ACT2FN[activation_string]\n    else:\n        raise KeyError(\"function {} not found in ACT2FN mapping {}\".format(activation_string, list(ACT2FN.keys())))\n"
  },
  {
    "path": "code/nezha-base-count5/pretrain/transformers1/another_try.py",
    "content": "from transformers import TFBertModel, BertTokenizer, BertConfig\nimport tensorflow as tf\n\nconfig = BertConfig.from_pretrained(\"bert-base-cased\", output_hidden_states=True)\nmodel = TFBertModel.from_pretrained(\"bert-base-cased\", config=config)\n\ntok = BertTokenizer.from_pretrained(\"bert-base-cased\")\ntext = tok.encode(\"Ain't this [MASK] best thing you've ever seen?\")\n\ninputs = tf.constant(text)\noutputs = model.predict(inputs)\n\nprint(outputs)"
  },
  {
    "path": "code/nezha-base-count5/pretrain/transformers1/benchmark/__init__.py",
    "content": "# flake8: noqa\n# There's no way to ignore \"F401 '...' imported but unused\" warnings in this\n# module, but to preserve other warnings. So, don't check this module at all.\n\nfrom ..file_utils import is_torch_available\n\n\nif is_torch_available():\n    from .benchmark_args import PyTorchBenchmarkArguments\n    from .benchmark import PyTorchBenchmark\n"
  },
  {
    "path": "code/nezha-base-count5/pretrain/transformers1/benchmark/benchmark.py",
    "content": "# coding=utf-8\n# Copyright 2018 The HuggingFace Inc. team.\n# Copyright (c) 2018, NVIDIA CORPORATION.  All rights reserved.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n\"\"\"\n    Benchmarking the library on inference and training in PyTorch.\n\"\"\"\n\n\nimport inspect\nimport logging\nimport timeit\n\nfrom transformers import MODEL_MAPPING, MODEL_WITH_LM_HEAD_MAPPING, PretrainedConfig, is_torch_available\n\nfrom .benchmark_utils import Benchmark, Memory, start_memory_tracing, stop_memory_tracing\n\n\nif is_torch_available():\n    import torch\n    from .benchmark_args import PyTorchBenchmarkArguments\n\n\nlogger = logging.getLogger(__name__)\n\n\nclass PyTorchBenchmark(Benchmark):\n\n    args: PyTorchBenchmarkArguments\n    configs: PretrainedConfig\n    framework: str = \"PyTorch\"\n\n    @property\n    def framework_version(self):\n        return torch.__version__\n\n    def train(self, model_name, batch_size, sequence_length, trace_memory=False):\n        try:\n            config = self.config_dict[model_name]\n            model = MODEL_WITH_LM_HEAD_MAPPING[config.__class__](config)\n            model.to(self.args.device)\n            model.train()\n\n            input_ids = torch.randint(\n                model.config.vocab_size, (batch_size, sequence_length), dtype=torch.long, device=self.args.device\n            )\n\n            def compute_loss_and_backprob():\n                # TODO: Not all models call labels argument labels => this hack using the function signature should be corrected once all models have a common name for labels\n                function_argument_names = inspect.getfullargspec(model.forward).args\n                if \"labels\" in function_argument_names:\n                    loss = model(input_ids, labels=input_ids)[0]\n                elif \"lm_labels\" in function_argument_names:\n                    loss = model(input_ids, lm_labels=input_ids)[0]\n                elif \"masked_lm_labels\" in function_argument_names:\n                    loss = model(input_ids, masked_lm_labels=input_ids)[0]\n                else:\n                    NotImplementedError(f\"{model_name} does not seem to allow training with labels\")\n\n                loss.backward()\n                model.zero_grad()\n\n            if trace_memory is True:\n                if self.args.trace_memory_line_by_line or self.args.n_gpu == 0:\n                    trace = start_memory_tracing(\"transformers1\")\n                else:\n                    # clear cuda cache\n                    torch.cuda.empty_cache()\n                    torch.cuda.reset_peak_memory_stats()\n\n                # calculate loss and do backpropagation\n                compute_loss_and_backprob()\n\n                if self.args.trace_memory_line_by_line or self.args.n_gpu == 0:\n                    summary = stop_memory_tracing(trace)\n                    memory = summary.total\n                else:\n                    memory = Memory(torch.cuda.max_memory_reserved())\n\n                return memory\n            else:\n                # as written in https://docs.python.org/2/library/timeit.html#timeit.Timer.repeat, min should be taken rather than the average\n                runtimes = timeit.repeat(lambda: compute_loss_and_backprob(), repeat=self.args.repeat, number=10,)\n                return min(runtimes) / 10.0\n        except RuntimeError as e:\n            self.print_fn(\"Doesn't fit on GPU. {}\".format(e))\n            return \"N/A\"\n\n    def inference(self, model_name, batch_size, sequence_length, trace_memory=False):\n        try:\n            config = self.config_dict[model_name]\n            model = MODEL_MAPPING[config.__class__](config)\n            model.to(self.args.device)\n            model.eval()\n\n            input_ids = torch.randint(\n                config.vocab_size, (batch_size, sequence_length), dtype=torch.long, device=self.args.device\n            )\n            if trace_memory is True:\n                if self.args.trace_memory_line_by_line or self.args.n_gpu == 0:\n                    trace = start_memory_tracing(\"transformers1\")\n                else:\n                    # clear cuda cache\n                    torch.cuda.empty_cache()\n                    if hasattr(torch.cuda, \"max_memory_reserved\"):\n                        torch.cuda.reset_peak_memory_stats()\n                    else:\n                        logger.info(\n                            \"Please consider updating PyTorch to version 1.4 to get more accuracy on GPU memory usage\"\n                        )\n                        torch.cuda.reset_max_memory_cached()\n\n                model(input_ids)\n\n                if self.args.trace_memory_line_by_line or self.args.n_gpu == 0:\n                    summary = stop_memory_tracing(trace)\n                    memory = summary.total\n                else:\n                    if hasattr(torch.cuda, \"max_memory_reserved\"):\n                        memory = Memory(torch.cuda.max_memory_reserved())\n                    else:\n                        logger.info(\n                            \"Please consider updating PyTorch to version 1.4 to get more accuracy on GPU memory usage\"\n                        )\n                        memory = Memory(torch.cuda.max_memory_cached())\n\n                return memory\n            else:\n                # as written in https://docs.python.org/2/library/timeit.html#timeit.Timer.repeat, min should be taken rather than the average\n                runtimes = timeit.repeat(lambda: model(input_ids), repeat=self.args.repeat, number=10,)\n                return min(runtimes) / 10.0\n\n        except RuntimeError as e:\n            self.print_fn(\"Doesn't fit on GPU. {}\".format(e))\n            return \"N/A\"\n"
  },
  {
    "path": "code/nezha-base-count5/pretrain/transformers1/benchmark/benchmark_args.py",
    "content": "# coding=utf-8\n# Copyright 2018 The HuggingFace Inc. team.\n# Copyright (c) 2018, NVIDIA CORPORATION.  All rights reserved.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n\nimport logging\nfrom dataclasses import dataclass, field\nfrom typing import Tuple\n\nfrom ..file_utils import cached_property, is_torch_available, torch_required\nfrom .benchmark_args_utils import BenchmarkArguments\n\n\nif is_torch_available():\n    import torch\n\ntry:\n    import torch_xla.core.xla_model as xm\n\n    _has_tpu = True\nexcept ImportError:\n    _has_tpu = False\n\n\n@torch_required\ndef is_tpu_available():\n    return _has_tpu\n\n\nlogger = logging.getLogger(__name__)\n\n\n@dataclass\nclass PyTorchBenchmarkArguments(BenchmarkArguments):\n    no_cuda: bool = field(default=False, metadata={\"help\": \"Whether to run on available cuda devices\"})\n    torchscript: bool = field(default=False, metadata={\"help\": \"Trace the models using torchscript\"})\n    fp16: bool = field(default=False, metadata={\"help\": \"Use FP16 to accelerate inference.\"})\n\n    @cached_property\n    @torch_required\n    def _setup_devices(self) -> Tuple[\"torch.device\", int]:\n        logger.info(\"PyTorch: setting up devices\")\n        if self.no_cuda:\n            device = torch.device(\"cpu\")\n            n_gpu = 0\n        elif is_tpu_available():\n            device = xm.xla_device()\n            n_gpu = 0\n        else:\n            device = torch.device(\"cuda\" if torch.cuda.is_available() else \"cpu\")\n            n_gpu = torch.cuda.device_count()\n        return device, n_gpu\n\n    @property\n    @torch_required\n    def device_idx(self) -> int:\n        return torch.cuda.current_device()\n\n    @property\n    @torch_required\n    def device(self) -> \"torch.device\":\n        return self._setup_devices[0]\n\n    @property\n    @torch_required\n    def n_gpu(self):\n        return self._setup_devices[1]\n"
  },
  {
    "path": "code/nezha-base-count5/pretrain/transformers1/benchmark/benchmark_args_utils.py",
    "content": "# coding=utf-8\n# Copyright 2018 The HuggingFace Inc. team.\n# Copyright (c) 2018, NVIDIA CORPORATION.  All rights reserved.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n\nimport dataclasses\nimport json\nfrom dataclasses import dataclass, field\nfrom time import time\nfrom typing import List\n\n\ndef list_field(default=None, metadata=None):\n    return field(default_factory=lambda: default, metadata=metadata)\n\n\n@dataclass\nclass BenchmarkArguments:\n    \"\"\"\n    BenchMarkArguments are arguments we use in our benchmark scripts\n    **which relate to the training loop itself**.\n\n    Using `HfArgumentParser` we can turn this class\n    into argparse arguments to be able to specify them on\n    the command line.\n    \"\"\"\n\n    models: List[str] = list_field(\n        default=[],\n        metadata={\n            \"help\": \"Model checkpoints to be provided to the AutoModel classes. Leave blank to benchmark the base version of all available models\"\n        },\n    )\n\n    batch_sizes: List[int] = list_field(\n        default=[8], metadata={\"help\": \"List of batch sizes for which memory and time performance will be evaluated\"}\n    )\n\n    sequence_lengths: List[int] = list_field(\n        default=[8, 32, 128, 512],\n        metadata={\"help\": \"List of sequence lengths for which memory and time performance will be evaluated\"},\n    )\n\n    no_inference: bool = field(default=False, metadata={\"help\": \"Don't benchmark inference of model\"})\n    training: bool = field(default=False, metadata={\"help\": \"Benchmark training of model\"})\n    verbose: bool = field(default=False, metadata={\"help\": \"Verbose memory tracing\"})\n    no_speed: bool = field(default=False, metadata={\"help\": \"Don't perform speed measurments\"})\n    no_memory: bool = field(default=False, metadata={\"help\": \"Don't perform memory measurments\"})\n    trace_memory_line_by_line: bool = field(default=False, metadata={\"help\": \"Trace memory line by line\"})\n    save_to_csv: bool = field(default=False, metadata={\"help\": \"Save result to a CSV file\"})\n    log_print: bool = field(default=False, metadata={\"help\": \"Save all print statements in a log file\"})\n    no_env_print: bool = field(default=False, metadata={\"help\": \"Don't print environment information\"})\n    inference_time_csv_file: str = field(\n        default=f\"inference_time_{round(time())}.csv\",\n        metadata={\"help\": \"CSV filename used if saving time results to csv.\"},\n    )\n    inference_memory_csv_file: str = field(\n        default=f\"inference_memory_{round(time())}.csv\",\n        metadata={\"help\": \"CSV filename used if saving memory results to csv.\"},\n    )\n    train_time_csv_file: str = field(\n        default=f\"train_time_{round(time())}.csv\",\n        metadata={\"help\": \"CSV filename used if saving time results to csv for training.\"},\n    )\n    train_memory_csv_file: str = field(\n        default=f\"train_memory_{round(time())}.csv\",\n        metadata={\"help\": \"CSV filename used if saving memory results to csv for training.\"},\n    )\n    env_info_csv_file: str = field(\n        default=f\"env_info_{round(time())}.csv\",\n        metadata={\"help\": \"CSV filename used if saving environment information.\"},\n    )\n    log_filename: str = field(\n        default=f\"log_{round(time())}.csv\",\n        metadata={\"help\": \"Log filename used if print statements are saved in log.\"},\n    )\n    repeat: int = field(default=3, metadata={\"help\": \"Times an experiment will be run.\"})\n\n    def to_json_string(self):\n        \"\"\"\n        Serializes this instance to a JSON string.\n        \"\"\"\n        return json.dumps(dataclasses.asdict(self), indent=2)\n\n    @property\n    def model_names(self):\n        return self.models\n"
  },
  {
    "path": "code/nezha-base-count5/pretrain/transformers1/benchmark/benchmark_utils.py",
    "content": "\"\"\"\nUtilities for working with the local dataset cache.\nThis file is adapted from the AllenNLP library at https://github.com/allenai/allennlp\nCopyright by the AllenNLP authors.\n\"\"\"\n\nimport copy\nimport csv\nimport linecache\nimport logging\nimport os\nimport platform\nimport sys\nfrom abc import ABC, abstractmethod\nfrom collections import defaultdict, namedtuple\nfrom datetime import datetime\nfrom typing import Iterable, List, NamedTuple, Optional, Union\n\nfrom transformers import AutoConfig, PretrainedConfig\nfrom transformers import __version__ as version\n\nfrom ..file_utils import is_tf_available, is_torch_available\nfrom .benchmark_args_utils import BenchmarkArguments\n\n\nif is_torch_available():\n    from torch.cuda import empty_cache as torch_empty_cache\n\nif is_tf_available():\n    from tensorflow.python.eager import context as tf_context\n\n\nlogger = logging.getLogger(__name__)  # pylint: disable=invalid-name\n\n\n_is_memory_tracing_enabled = False\n\nBenchmarkOutput = namedtuple(\n    \"BenchmarkOutput\", [\"time_inference_result\", \"memory_inference_result\", \"time_train_result\", \"memory_train_result\"]\n)\n\n\ndef is_memory_tracing_enabled():\n    global _is_memory_tracing_enabled\n    return _is_memory_tracing_enabled\n\n\nclass Frame(NamedTuple):\n    \"\"\" `Frame` is a NamedTuple used to gather the current frame state.\n            `Frame` has the following fields:\n            - 'filename' (string): Name of the file currently executed\n            - 'module' (string): Name of the module currently executed\n            - 'line_number' (int): Number of the line currently executed\n            - 'event' (string): Event that triggered the tracing (default will be \"line\")\n            - 'line_text' (string): Text of the line in the python script\n    \"\"\"\n\n    filename: str\n    module: str\n    line_number: int\n    event: str\n    line_text: str\n\n\nclass UsedMemoryState(NamedTuple):\n    \"\"\" `UsedMemoryState` are named tuples with the following fields:\n        - 'frame': a `Frame` namedtuple (see below) storing information on the current tracing frame (current file, location in current file)\n        - 'cpu_memory': CPU RSS memory state *before* executing the line\n        - 'gpu_memory': GPU used memory *before* executing the line (sum for all GPUs or for only `gpus_to_trace` if provided)\n    \"\"\"\n\n    frame: Frame\n    cpu_memory: int\n    gpu_memory: int\n\n\nclass Memory(NamedTuple):\n    \"\"\" `Memory` NamedTuple have a single field `bytes` and\n        you can get a human readable str of the number of mega bytes by calling `__repr__`\n            - `byte` (integer): number of bytes,\n    \"\"\"\n\n    bytes: int\n\n    def __repr__(self) -> str:\n        return str(bytes_to_mega_bytes(self.bytes))\n\n\nclass MemoryState(NamedTuple):\n    \"\"\" `MemoryState` are namedtuples listing frame + CPU/GPU memory with the following fields:\n        - `frame` (`Frame`): the current frame (see above)\n        - `cpu`: CPU memory consumed at during the current frame as a `Memory` named tuple\n        - `gpu`: GPU memory consumed at during the current frame as a `Memory` named tuple\n        - `cpu_gpu`: CPU + GPU memory consumed at during the current frame as a `Memory` named tuple\n    \"\"\"\n\n    frame: Frame\n    cpu: Memory\n    gpu: Memory\n    cpu_gpu: Memory\n\n\nclass MemorySummary(NamedTuple):\n    \"\"\" `MemorySummary` namedtuple otherwise with the fields:\n        - `sequential`: a list of `MemoryState` namedtuple (see below) computed from the provided `memory_trace`\n            by substracting the memory after executing each line from the memory before executing said line.\n        - `cumulative`: a list of `MemoryState` namedtuple (see below) with cumulative increase in memory for each line\n            obtained by summing repeted memory increase for a line if it's executed several times.\n            The list is sorted from the frame with the largest memory consumption to the frame with the smallest (can be negative if memory is released)\n        - `total`: total memory increase during the full tracing as a `Memory` named tuple (see below).\n            Line with memory release (negative consumption) are ignored if `ignore_released_memory` is `True` (default).\n    \"\"\"\n\n    sequential: List[MemoryState]\n    cumulative: List[MemoryState]\n    current: List[MemoryState]\n    total: Memory\n\n\nMemoryTrace = List[UsedMemoryState]\n\n\ndef start_memory_tracing(\n    modules_to_trace: Optional[Union[str, Iterable[str]]] = None,\n    modules_not_to_trace: Optional[Union[str, Iterable[str]]] = None,\n    events_to_trace: str = \"line\",\n    gpus_to_trace: Optional[List[int]] = None,\n) -> MemoryTrace:\n    \"\"\" Setup line-by-line tracing to record rss mem (RAM) at each line of a module or sub-module.\n        See `../../examples/benchmarks.py for a usage example.\n        Current memory consumption is returned using psutil and in particular is the RSS memory\n            \"Resident Set Size” (the non-swapped physical memory the process is using).\n            See https://psutil.readthedocs.io/en/latest/#psutil.Process.memory_info\n\n        Args:\n            - `modules_to_trace`: (None, string, list/tuple of string)\n                if None, all events are recorded\n                if string or list of strings: only events from the listed module/sub-module will be recorded (e.g. 'fairseq' or 'transformers1.modeling_gpt2')\n            - `modules_not_to_trace`: (None, string, list/tuple of string)\n                if None, no module is avoided\n                if string or list of strings: events from the listed module/sub-module will not be recorded (e.g. 'torch')\n            - `events_to_trace`: string or list of string of events to be recorded (see official python doc for `sys.settrace` for the list of events)\n                default to line\n            - `gpus_to_trace`: (optional list, default None) list of GPUs to trace. Default to tracing all GPUs\n\n        Return:\n            - `memory_trace` is a list of `UsedMemoryState` for each event (default each line of the traced script).\n                - `UsedMemoryState` are named tuples with the following fields:\n                    - 'frame': a `Frame` namedtuple (see below) storing information on the current tracing frame (current file, location in current file)\n                    - 'cpu_memory': CPU RSS memory state *before* executing the line\n                    - 'gpu_memory': GPU used memory *before* executing the line (sum for all GPUs or for only `gpus_to_trace` if provided)\n\n        `Frame` is a namedtuple used by `UsedMemoryState` to list the current frame state.\n            `Frame` has the following fields:\n            - 'filename' (string): Name of the file currently executed\n            - 'module' (string): Name of the module currently executed\n            - 'line_number' (int): Number of the line currently executed\n            - 'event' (string): Event that triggered the tracing (default will be \"line\")\n            - 'line_text' (string): Text of the line in the python script\n\n    \"\"\"\n    try:\n        import psutil\n    except (ImportError):\n        logger.warning(\n            \"Psutil not installed, we won't log CPU memory usage. \"\n            \"Install psutil (pip install psutil) to use CPU memory tracing.\"\n        )\n        process = None\n    else:\n        process = psutil.Process(os.getpid())\n\n    try:\n        from py3nvml import py3nvml\n\n        py3nvml.nvmlInit()\n        devices = list(range(py3nvml.nvmlDeviceGetCount())) if gpus_to_trace is None else gpus_to_trace\n        py3nvml.nvmlShutdown()\n    except ImportError:\n        logger.warning(\n            \"py3nvml not installed, we won't log GPU memory usage. \"\n            \"Install py3nvml (pip install py3nvml) to use GPU memory tracing.\"\n        )\n        log_gpu = False\n    except (OSError, py3nvml.NVMLError):\n        logger.warning(\"Error while initializing comunication with GPU. \" \"We won't perform GPU memory tracing.\")\n        log_gpu = False\n    else:\n        log_gpu = is_torch_available() or is_tf_available()\n\n    memory_trace = []\n\n    def traceit(frame, event, args):\n        \"\"\" Tracing method executed before running each line in a module or sub-module\n            Record memory allocated in a list with debugging information\n        \"\"\"\n        global _is_memory_tracing_enabled\n\n        if not _is_memory_tracing_enabled:\n            return traceit\n\n        # Filter events\n        if events_to_trace is not None:\n            if isinstance(events_to_trace, str) and event != events_to_trace:\n                return traceit\n            elif isinstance(events_to_trace, (list, tuple)) and event not in events_to_trace:\n                return traceit\n\n        # Filter modules\n        name = frame.f_globals[\"__name__\"]\n        if not isinstance(name, str):\n            return traceit\n        else:\n            # Filter whitelist of modules to trace\n            if modules_to_trace is not None:\n                if isinstance(modules_to_trace, str) and modules_to_trace not in name:\n                    return traceit\n                elif isinstance(modules_to_trace, (list, tuple)) and all(m not in name for m in modules_to_trace):\n                    return traceit\n\n            # Filter blacklist of modules not to trace\n            if modules_not_to_trace is not None:\n                if isinstance(modules_not_to_trace, str) and modules_not_to_trace in name:\n                    return traceit\n                elif isinstance(modules_not_to_trace, (list, tuple)) and any(m in name for m in modules_not_to_trace):\n                    return traceit\n\n        # Record current tracing state (file, location in file...)\n        lineno = frame.f_lineno\n        filename = frame.f_globals[\"__file__\"]\n        if filename.endswith(\".pyc\") or filename.endswith(\".pyo\"):\n            filename = filename[:-1]\n        line = linecache.getline(filename, lineno).rstrip()\n        traced_state = Frame(filename, name, lineno, event, line)\n\n        # Record current memory state (rss memory) and compute difference with previous memory state\n        cpu_mem = 0\n        if process is not None:\n            mem = process.memory_info()\n            cpu_mem = mem.rss\n\n        gpu_mem = 0\n        if log_gpu:\n            # Clear GPU caches\n            if is_torch_available():\n                torch_empty_cache()\n            if is_tf_available():\n                tf_context.context()._clear_caches()  # See https://github.com/tensorflow/tensorflow/issues/20218#issuecomment-416771802\n\n            # Sum used memory for all GPUs\n            py3nvml.nvmlInit()\n\n            for i in devices:\n                handle = py3nvml.nvmlDeviceGetHandleByIndex(i)\n                meminfo = py3nvml.nvmlDeviceGetMemoryInfo(handle)\n                gpu_mem += meminfo.used\n\n            py3nvml.nvmlShutdown()\n\n        mem_state = UsedMemoryState(traced_state, cpu_mem, gpu_mem)\n        memory_trace.append(mem_state)\n\n        return traceit\n\n    sys.settrace(traceit)\n\n    global _is_memory_tracing_enabled\n    _is_memory_tracing_enabled = True\n\n    return memory_trace\n\n\ndef stop_memory_tracing(\n    memory_trace: Optional[MemoryTrace] = None, ignore_released_memory: bool = True\n) -> Optional[MemorySummary]:\n    \"\"\" Stop memory tracing cleanly and return a summary of the memory trace if a trace is given.\n\n        Args:\n            - `memory_trace` (optional output of start_memory_tracing, default: None): memory trace to convert in summary\n            - `ignore_released_memory` (boolean, default: None): if True we only sum memory increase to compute total memory\n\n        Return:\n            - None if `memory_trace` is None\n            - `MemorySummary` namedtuple otherwise with the fields:\n                - `sequential`: a list of `MemoryState` namedtuple (see below) computed from the provided `memory_trace`\n                    by substracting the memory after executing each line from the memory before executing said line.\n                - `cumulative`: a list of `MemoryState` namedtuple (see below) with cumulative increase in memory for each line\n                    obtained by summing repeted memory increase for a line if it's executed several times.\n                    The list is sorted from the frame with the largest memory consumption to the frame with the smallest (can be negative if memory is released)\n                - `total`: total memory increase during the full tracing as a `Memory` named tuple (see below).\n                    Line with memory release (negative consumption) are ignored if `ignore_released_memory` is `True` (default).\n\n        `Memory` named tuple have fields\n            - `byte` (integer): number of bytes,\n            - `string` (string): same as human readable string (ex: \"3.5MB\")\n\n        `Frame` are namedtuple used to list the current frame state and have the following fields:\n            - 'filename' (string): Name of the file currently executed\n            - 'module' (string): Name of the module currently executed\n            - 'line_number' (int): Number of the line currently executed\n            - 'event' (string): Event that triggered the tracing (default will be \"line\")\n            - 'line_text' (string): Text of the line in the python script\n\n        `MemoryState` are namedtuples listing frame + CPU/GPU memory with the following fields:\n            - `frame` (`Frame`): the current frame (see above)\n            - `cpu`: CPU memory consumed at during the current frame as a `Memory` named tuple\n            - `gpu`: GPU memory consumed at during the current frame as a `Memory` named tuple\n            - `cpu_gpu`: CPU + GPU memory consumed at during the current frame as a `Memory` named tuple\n    \"\"\"\n    global _is_memory_tracing_enabled\n    _is_memory_tracing_enabled = False\n\n    if memory_trace is not None and len(memory_trace) > 1:\n        memory_diff_trace = []\n        memory_curr_trace = []\n\n        cumulative_memory_dict = defaultdict(lambda: [0, 0, 0])\n\n        for ((frame, cpu_mem, gpu_mem), (next_frame, next_cpu_mem, next_gpu_mem),) in zip(\n            memory_trace[:-1], memory_trace[1:]\n        ):\n            cpu_mem_inc = next_cpu_mem - cpu_mem\n            gpu_mem_inc = next_gpu_mem - gpu_mem\n            cpu_gpu_mem_inc = cpu_mem_inc + gpu_mem_inc\n            memory_diff_trace.append(\n                MemoryState(\n                    frame=frame, cpu=Memory(cpu_mem_inc), gpu=Memory(gpu_mem_inc), cpu_gpu=Memory(cpu_gpu_mem_inc),\n                )\n            )\n\n            memory_curr_trace.append(\n                MemoryState(\n                    frame=frame,\n                    cpu=Memory(next_cpu_mem),\n                    gpu=Memory(next_gpu_mem),\n                    cpu_gpu=Memory(next_gpu_mem + next_cpu_mem),\n                )\n            )\n\n            cumulative_memory_dict[frame][0] += cpu_mem_inc\n            cumulative_memory_dict[frame][1] += gpu_mem_inc\n            cumulative_memory_dict[frame][2] += cpu_gpu_mem_inc\n\n        cumulative_memory = sorted(\n            list(cumulative_memory_dict.items()), key=lambda x: x[1][2], reverse=True\n        )  # order by the total CPU + GPU memory increase\n        cumulative_memory = list(\n            MemoryState(\n                frame=frame, cpu=Memory(cpu_mem_inc), gpu=Memory(gpu_mem_inc), cpu_gpu=Memory(cpu_gpu_mem_inc),\n            )\n            for frame, (cpu_mem_inc, gpu_mem_inc, cpu_gpu_mem_inc) in cumulative_memory\n        )\n\n        memory_curr_trace = sorted(memory_curr_trace, key=lambda x: x.cpu_gpu.bytes, reverse=True)\n\n        if ignore_released_memory:\n            total_memory = sum(max(0, step_trace.cpu_gpu.bytes) for step_trace in memory_diff_trace)\n        else:\n            total_memory = sum(step_trace.cpu_gpu.bytes for step_trace in memory_diff_trace)\n\n        total_memory = Memory(total_memory)\n\n        return MemorySummary(\n            sequential=memory_diff_trace, cumulative=cumulative_memory, current=memory_curr_trace, total=total_memory,\n        )\n\n    return None\n\n\ndef bytes_to_mega_bytes(memory_amount: int) -> int:\n    \"\"\" Utility to convert a number of bytes (int) into a number of mega bytes (int)\n    \"\"\"\n    return memory_amount >> 20\n\n\nclass Benchmark(ABC):\n    \"\"\"\n    Benchmarks is a simple but feature-complete benchmarking script\n    to compare memory and time performance of models in Transformers.\n    \"\"\"\n\n    args: BenchmarkArguments\n    configs: PretrainedConfig\n    framework: str\n\n    def __init__(self, args: BenchmarkArguments = None, configs: PretrainedConfig = None):\n        self.args = args\n\n        if configs is None:\n            self.config_dict = {\n                model_name: AutoConfig.from_pretrained(model_name) for model_name in self.args.model_names\n            }\n        else:\n            self.config_dict = {model_name: config for model_name, config in zip(self.args.model_names, configs)}\n\n        self._print_fn = None\n        self._framework_version = None\n        self._environment_info = None\n\n    @property\n    def print_fn(self):\n        if self._print_fn is None:\n            if self.args.log_print:\n                logging.basicConfig(\n                    level=logging.DEBUG,\n                    filename=self.args.log_filename,\n                    filemode=\"a+\",\n                    format=\"%(asctime)-15s %(levelname)-8s %(message)s\",\n                )\n\n                def print_and_log(*args):\n                    logging.info(*args)\n                    print(*args)\n\n                self._print_fn = print_and_log\n            else:\n                self._print_fn = print\n        return self._print_fn\n\n    @property\n    def is_gpu(self):\n        return self.args.n_gpu > 0\n\n    @property\n    @abstractmethod\n    def framework_version(self):\n        pass\n\n    @abstractmethod\n    def train(self, model_name, batch_size, sequence_length):\n        pass\n\n    @abstractmethod\n    def inference(self, model_name, batch_size, sequence_length):\n        pass\n\n    def run(self):\n        result_dict = {model_name: {} for model_name in self.args.model_names}\n        inference_result_time = copy.deepcopy(result_dict)\n        inference_result_memory = copy.deepcopy(result_dict)\n        train_result_time = copy.deepcopy(result_dict)\n        train_result_memory = copy.deepcopy(result_dict)\n\n        for c, model_name in enumerate(self.args.model_names):\n            self.print_fn(f\"{c + 1} / {len(self.args.model_names)}\")\n\n            model_dict = {\n                \"bs\": self.args.batch_sizes,\n                \"ss\": self.args.sequence_lengths,\n                \"result\": {i: {} for i in self.args.batch_sizes},\n            }\n            inference_result_time[model_name] = copy.deepcopy(model_dict)\n            inference_result_memory[model_name] = copy.deepcopy(model_dict)\n            train_result_time[model_name] = copy.deepcopy(model_dict)\n            train_result_memory[model_name] = copy.deepcopy(model_dict)\n\n            for batch_size in self.args.batch_sizes:\n                for sequence_length in self.args.sequence_lengths:\n                    if not self.args.no_inference:\n                        if not self.args.no_memory:\n                            memory = self.inference(model_name, batch_size, sequence_length, trace_memory=True)\n                            inference_result_memory[model_name][\"result\"][batch_size][sequence_length] = memory\n                        if not self.args.no_speed:\n                            time = self.inference(model_name, batch_size, sequence_length, trace_memory=False)\n                            inference_result_time[model_name][\"result\"][batch_size][sequence_length] = time\n\n                    if self.args.training:\n                        if not self.args.no_memory:\n                            memory = self.train(model_name, batch_size, sequence_length, trace_memory=True)\n                            train_result_memory[model_name][\"result\"][batch_size][sequence_length] = memory\n                        if not self.args.no_speed:\n                            time = self.inference(model_name, batch_size, sequence_length, trace_memory=False)\n                            train_result_time[model_name][\"result\"][batch_size][sequence_length] = time\n\n        if not self.args.no_inference:\n            if not self.args.no_speed:\n                self.print_fn(\"======= INFERENCE - SPEED - RESULT =======\")\n                self.print_results(inference_result_time)\n                self.save_to_csv(inference_result_time, self.args.inference_time_csv_file)\n\n            if not self.args.no_memory:\n                self.print_fn(\"======= INFERENCE - MEMORY - RESULT =======\")\n                self.print_results(inference_result_memory)\n                self.save_to_csv(inference_result_memory, self.args.inference_memory_csv_file)\n\n        if self.args.training:\n            if not self.args.no_speed:\n                self.print_fn(\"======= TRAIN - SPEED - RESULT =======\")\n                self.print_results(train_result_time)\n                self.save_to_csv(train_result_time, self.args.train_time_csv_file)\n\n            if not self.args.no_memory:\n                self.print_fn(\"======= TRAIN - MEMORY - RESULT =======\")\n                self.print_results(train_result_memory)\n                self.save_to_csv(train_result_memory, self.args.train_memory_csv_file)\n\n        if not self.args.no_env_print:\n            self.print_fn(\"\\n======== ENVIRONMENT - INFORMATION ========\")\n            self.print_fn(\n                \"\\n\".join([\"- {}: {}\".format(prop, val) for prop, val in self.environment_info.items()]) + \"\\n\"\n            )\n\n        if self.args.save_to_csv:\n            with open(self.args.env_info_csv_file, mode=\"w\", newline=\"\") as csv_file:\n                writer = csv.writer(csv_file)\n                for key, value in self.environment_info.items():\n                    writer.writerow([key, value])\n\n        return BenchmarkOutput(inference_result_time, inference_result_memory, train_result_time, train_result_memory)\n\n    @property\n    def environment_info(self):\n        if self._environment_info is None:\n            info = {}\n            info[\"transformers_version\"] = version\n            info[\"framework\"] = self.framework\n            info[\"framework_version\"] = self.framework_version\n            info[\"python_version\"] = platform.python_version()\n            info[\"system\"] = platform.system()\n            info[\"cpu\"] = platform.processor()\n            info[\"architecture\"] = platform.architecture()[0]\n            info[\"date\"] = datetime.date(datetime.now())\n            info[\"time\"] = datetime.time(datetime.now())\n\n            try:\n                import psutil\n            except (ImportError):\n                logger.warning(\n                    \"Psutil not installed, we won't log available CPU memory.\"\n                    \"Install psutil (pip install psutil) to log available CPU memory.\"\n                )\n                info[\"cpu_ram_mb\"] = \"N/A\"\n            else:\n                info[\"cpu_ram_mb\"] = bytes_to_mega_bytes(psutil.virtual_memory().total)\n\n            info[\"use_gpu\"] = self.is_gpu\n            if self.is_gpu:\n                info[\"num_gpus\"] = self.args.n_gpu\n                try:\n                    from py3nvml import py3nvml\n\n                    py3nvml.nvmlInit()\n                    handle = py3nvml.nvmlDeviceGetHandleByIndex(self.args.device_idx)\n                except ImportError:\n                    logger.warning(\n                        \"py3nvml not installed, we won't log GPU memory usage. \"\n                        \"Install py3nvml (pip install py3nvml) to log information about GPU.\"\n                    )\n                    info[\"gpu\"] = \"N/A\"\n                    info[\"gpu_ram_mb\"] = \"N/A\"\n                    info[\"gpu_power_watts\"] = \"N/A\"\n                    info[\"gpu_performance_state\"] = \"N/A\"\n                except (OSError, py3nvml.NVMLError):\n                    logger.warning(\n                        \"Error while initializing comunication with GPU. \" \"We won't log information about GPU.\"\n                    )\n                    info[\"gpu\"] = \"N/A\"\n                    info[\"gpu_ram_mb\"] = \"N/A\"\n                    info[\"gpu_power_watts\"] = \"N/A\"\n                    info[\"gpu_performance_state\"] = \"N/A\"\n                    py3nvml.nvmlShutdown()\n                else:\n                    info[\"gpu\"] = py3nvml.nvmlDeviceGetName(handle)\n                    info[\"gpu_ram_mb\"] = bytes_to_mega_bytes(py3nvml.nvmlDeviceGetMemoryInfo(handle).total)\n                    info[\"gpu_power_watts\"] = py3nvml.nvmlDeviceGetPowerManagementLimit(handle) / 1000\n                    info[\"gpu_performance_state\"] = py3nvml.nvmlDeviceGetPerformanceState(handle)\n                    py3nvml.nvmlShutdown()\n\n            self._environment_info = info\n        return self._environment_info\n\n    def print_results(self, result_dict):\n        for model_name in self.args.model_names:\n            self.print_fn(\"\\t\" + f\"======= MODEL CHECKPOINT: {model_name} =======\")\n            for batch_size in result_dict[model_name][\"bs\"]:\n                for sequence_length in result_dict[model_name][\"ss\"]:\n                    result = result_dict[model_name][\"result\"][batch_size][sequence_length]\n                    if isinstance(result, float):\n                        self.print_fn(\n                            f\"\\t\\t{model_name}/{batch_size}/{sequence_length}: \" f\"{(round(1000 * result) / 1000)}s\"\n                        )\n                    else:\n                        self.print_fn(f\"\\t\\t{model_name}/{batch_size}/{sequence_length}: \" f\"{result} MB\")\n\n    def print_memory_trace_statistics(self, summary: MemorySummary):\n        self.print_fn(\n            \"\\nLine by line memory consumption:\\n\"\n            + \"\\n\".join(\n                f\"{state.frame.filename}:{state.frame.line_number}: mem {state.cpu_gpu}: {state.frame.line_text}\"\n                for state in summary.sequential\n            )\n        )\n        self.print_fn(\n            \"\\nLines with top memory consumption:\\n\"\n            + \"\\n\".join(\n                f\"=> {state.frame.filename}:{state.frame.line_number}: mem {state.cpu_gpu}: {state.frame.line_text}\"\n                for state in summary.cumulative[:6]\n            )\n        )\n        self.print_fn(\n            \"\\nLines with lowest memory consumption:\\n\"\n            + \"\\n\".join(\n                f\"=> {state.frame.filename}:{state.frame.line_number}: mem {state.cpu_gpu}: {state.frame.line_text}\"\n                for state in summary.cumulative[-6:]\n            )\n        )\n        self.print_fn(f\"\\nTotal memory increase: {summary.total}\")\n\n    def save_to_csv(self, result_dict, filename):\n        if not self.args.save_to_csv:\n            return\n        self.print_fn(\"Saving results to csv.\")\n        with open(filename, mode=\"w\") as csv_file:\n\n            assert len(self.args.model_names) > 0, \"At least 1 model should be defined, but got {}\".format(\n                self.model_names\n            )\n\n            fieldnames = [\"model\", \"batch_size\", \"sequence_length\"]\n            writer = csv.DictWriter(csv_file, fieldnames=fieldnames + [\"result\"])\n            writer.writeheader()\n\n            for model_name in self.args.model_names:\n                result_dict_model = result_dict[model_name][\"result\"]\n                for bs in result_dict_model:\n                    for ss in result_dict_model[bs]:\n                        result_model = result_dict_model[bs][ss]\n                        writer.writerow(\n                            {\n                                \"model\": model_name,\n                                \"batch_size\": bs,\n                                \"sequence_length\": ss,\n                                \"result\": (\"{}\" if not isinstance(result_model, float) else \"{:.4f}\").format(\n                                    result_model\n                                ),\n                            }\n                        )\n"
  },
  {
    "path": "code/nezha-base-count5/pretrain/transformers1/benchmark_utils.py",
    "content": "\"\"\"\nUtilities for working with the local dataset cache.\nThis file is adapted from the AllenNLP library at https://github.com/allenai/allennlp\nCopyright by the AllenNLP authors.\n\"\"\"\n\nimport linecache\nimport logging\nimport os\nimport sys\nfrom collections import defaultdict\nfrom typing import Iterable, List, NamedTuple, Optional, Union\n\nfrom .file_utils import is_tf_available, is_torch_available\n\n\nif is_torch_available():\n    from torch.cuda import empty_cache as torch_empty_cache\nif is_tf_available():\n    from tensorflow.python.eager import context as tf_context\n\n\nlogger = logging.getLogger(__name__)  # pylint: disable=invalid-name\n\n\n_is_memory_tracing_enabled = False\n\n\ndef is_memory_tracing_enabled():\n    global _is_memory_tracing_enabled\n    return _is_memory_tracing_enabled\n\n\nclass Frame(NamedTuple):\n    \"\"\" `Frame` is a NamedTuple used to gather the current frame state.\n            `Frame` has the following fields:\n            - 'filename' (string): Name of the file currently executed\n            - 'module' (string): Name of the module currently executed\n            - 'line_number' (int): Number of the line currently executed\n            - 'event' (string): Event that triggered the tracing (default will be \"line\")\n            - 'line_text' (string): Text of the line in the python script\n    \"\"\"\n\n    filename: str\n    module: str\n    line_number: int\n    event: str\n    line_text: str\n\n\nclass UsedMemoryState(NamedTuple):\n    \"\"\" `UsedMemoryState` are named tuples with the following fields:\n        - 'frame': a `Frame` namedtuple (see below) storing information on the current tracing frame (current file, location in current file)\n        - 'cpu_memory': CPU RSS memory state *before* executing the line\n        - 'gpu_memory': GPU used memory *before* executing the line (sum for all GPUs or for only `gpus_to_trace` if provided)\n    \"\"\"\n\n    frame: Frame\n    cpu_memory: int\n    gpu_memory: int\n\n\nclass Memory(NamedTuple):\n    \"\"\" `Memory` NamedTuple have a single field `bytes` and\n        you can get a human readable string of the number of bytes by calling `__repr__`\n            - `byte` (integer): number of bytes,\n    \"\"\"\n\n    bytes: int\n\n    def __repr__(self) -> str:\n        return bytes_to_human_readable(self.bytes)\n\n\nclass MemoryState(NamedTuple):\n    \"\"\" `MemoryState` are namedtuples listing frame + CPU/GPU memory with the following fields:\n        - `frame` (`Frame`): the current frame (see above)\n        - `cpu`: CPU memory consumed at during the current frame as a `Memory` named tuple\n        - `gpu`: GPU memory consumed at during the current frame as a `Memory` named tuple\n        - `cpu_gpu`: CPU + GPU memory consumed at during the current frame as a `Memory` named tuple\n    \"\"\"\n\n    frame: Frame\n    cpu: Memory\n    gpu: Memory\n    cpu_gpu: Memory\n\n\nclass MemorySummary(NamedTuple):\n    \"\"\" `MemorySummary` namedtuple otherwise with the fields:\n        - `sequential`: a list of `MemoryState` namedtuple (see below) computed from the provided `memory_trace`\n            by substracting the memory after executing each line from the memory before executing said line.\n        - `cumulative`: a list of `MemoryState` namedtuple (see below) with cumulative increase in memory for each line\n            obtained by summing repeted memory increase for a line if it's executed several times.\n            The list is sorted from the frame with the largest memory consumption to the frame with the smallest (can be negative if memory is released)\n        - `total`: total memory increase during the full tracing as a `Memory` named tuple (see below).\n            Line with memory release (negative consumption) are ignored if `ignore_released_memory` is `True` (default).\n    \"\"\"\n\n    sequential: List[MemoryState]\n    cumulative: List[MemoryState]\n    total: Memory\n\n\nMemoryTrace = List[UsedMemoryState]\n\n\ndef start_memory_tracing(\n    modules_to_trace: Optional[Union[str, Iterable[str]]] = None,\n    modules_not_to_trace: Optional[Union[str, Iterable[str]]] = None,\n    events_to_trace: str = \"line\",\n    gpus_to_trace: Optional[List[int]] = None,\n) -> MemoryTrace:\n    \"\"\" Setup line-by-line tracing to record rss mem (RAM) at each line of a module or sub-module.\n        See `../../examples/benchmarks.py for a usage example.\n        Current memory consumption is returned using psutil and in particular is the RSS memory\n            \"Resident Set Size” (the non-swapped physical memory the process is using).\n            See https://psutil.readthedocs.io/en/latest/#psutil.Process.memory_info\n\n        Args:\n            - `modules_to_trace`: (None, string, list/tuple of string)\n                if None, all events are recorded\n                if string or list of strings: only events from the listed module/sub-module will be recorded (e.g. 'fairseq' or 'transformers1.modeling_gpt2')\n            - `modules_not_to_trace`: (None, string, list/tuple of string)\n                if None, no module is avoided\n                if string or list of strings: events from the listed module/sub-module will not be recorded (e.g. 'torch')\n            - `events_to_trace`: string or list of string of events to be recorded (see official python doc for `sys.settrace` for the list of events)\n                default to line\n            - `gpus_to_trace`: (optional list, default None) list of GPUs to trace. Default to tracing all GPUs\n\n        Return:\n            - `memory_trace` is a list of `UsedMemoryState` for each event (default each line of the traced script).\n                - `UsedMemoryState` are named tuples with the following fields:\n                    - 'frame': a `Frame` namedtuple (see below) storing information on the current tracing frame (current file, location in current file)\n                    - 'cpu_memory': CPU RSS memory state *before* executing the line\n                    - 'gpu_memory': GPU used memory *before* executing the line (sum for all GPUs or for only `gpus_to_trace` if provided)\n\n        `Frame` is a namedtuple used by `UsedMemoryState` to list the current frame state.\n            `Frame` has the following fields:\n            - 'filename' (string): Name of the file currently executed\n            - 'module' (string): Name of the module currently executed\n            - 'line_number' (int): Number of the line currently executed\n            - 'event' (string): Event that triggered the tracing (default will be \"line\")\n            - 'line_text' (string): Text of the line in the python script\n\n    \"\"\"\n    try:\n        import psutil\n    except (ImportError):\n        logger.warning(\n            \"Psutil not installed, we won't log CPU memory usage. \"\n            \"Install psutil (pip install psutil) to use CPU memory tracing.\"\n        )\n        process = None\n    else:\n        process = psutil.Process(os.getpid())\n\n    try:\n        from py3nvml import py3nvml\n\n        py3nvml.nvmlInit()\n        devices = list(range(py3nvml.nvmlDeviceGetCount())) if gpus_to_trace is None else gpus_to_trace\n        py3nvml.nvmlShutdown()\n    except ImportError:\n        logger.warning(\n            \"py3nvml not installed, we won't log GPU memory usage. \"\n            \"Install py3nvml (pip install py3nvml) to use GPU memory tracing.\"\n        )\n        log_gpu = False\n    except (OSError, py3nvml.NVMLError):\n        logger.warning(\"Error while initializing comunication with GPU. \" \"We won't perform GPU memory tracing.\")\n        log_gpu = False\n    else:\n        log_gpu = is_torch_available() or is_tf_available()\n\n    memory_trace = []\n\n    def traceit(frame, event, args):\n        \"\"\" Tracing method executed before running each line in a module or sub-module\n            Record memory allocated in a list with debugging information\n        \"\"\"\n        global _is_memory_tracing_enabled\n\n        if not _is_memory_tracing_enabled:\n            return traceit\n\n        # Filter events\n        if events_to_trace is not None:\n            if isinstance(events_to_trace, str) and event != events_to_trace:\n                return traceit\n            elif isinstance(events_to_trace, (list, tuple)) and event not in events_to_trace:\n                return traceit\n\n        # Filter modules\n        name = frame.f_globals[\"__name__\"]\n        if not isinstance(name, str):\n            return traceit\n        else:\n            # Filter whitelist of modules to trace\n            if modules_to_trace is not None:\n                if isinstance(modules_to_trace, str) and modules_to_trace not in name:\n                    return traceit\n                elif isinstance(modules_to_trace, (list, tuple)) and all(m not in name for m in modules_to_trace):\n                    return traceit\n\n            # Filter blacklist of modules not to trace\n            if modules_not_to_trace is not None:\n                if isinstance(modules_not_to_trace, str) and modules_not_to_trace in name:\n                    return traceit\n                elif isinstance(modules_not_to_trace, (list, tuple)) and any(m in name for m in modules_not_to_trace):\n                    return traceit\n\n        # Record current tracing state (file, location in file...)\n        lineno = frame.f_lineno\n        filename = frame.f_globals[\"__file__\"]\n        if filename.endswith(\".pyc\") or filename.endswith(\".pyo\"):\n            filename = filename[:-1]\n        line = linecache.getline(filename, lineno).rstrip()\n        traced_state = Frame(filename, name, lineno, event, line)\n\n        # Record current memory state (rss memory) and compute difference with previous memory state\n        cpu_mem = 0\n        if process is not None:\n            mem = process.memory_info()\n            cpu_mem = mem.rss\n\n        gpu_mem = 0\n        if log_gpu:\n            # Clear GPU caches\n            if is_torch_available():\n                torch_empty_cache()\n            if is_tf_available():\n                tf_context.context()._clear_caches()  # See https://github.com/tensorflow/tensorflow/issues/20218#issuecomment-416771802\n\n            # Sum used memory for all GPUs\n            py3nvml.nvmlInit()\n            for i in devices:\n                handle = py3nvml.nvmlDeviceGetHandleByIndex(i)\n                meminfo = py3nvml.nvmlDeviceGetMemoryInfo(handle)\n                gpu_mem += meminfo.used\n            py3nvml.nvmlShutdown()\n\n        mem_state = UsedMemoryState(traced_state, cpu_mem, gpu_mem)\n        memory_trace.append(mem_state)\n\n        return traceit\n\n    sys.settrace(traceit)\n\n    global _is_memory_tracing_enabled\n    _is_memory_tracing_enabled = True\n\n    return memory_trace\n\n\ndef stop_memory_tracing(\n    memory_trace: Optional[MemoryTrace] = None, ignore_released_memory: bool = True\n) -> Optional[MemorySummary]:\n    \"\"\" Stop memory tracing cleanly and return a summary of the memory trace if a trace is given.\n\n        Args:\n            - `memory_trace` (optional output of start_memory_tracing, default: None): memory trace to convert in summary\n            - `ignore_released_memory` (boolean, default: None): if True we only sum memory increase to compute total memory\n\n        Return:\n            - None if `memory_trace` is None\n            - `MemorySummary` namedtuple otherwise with the fields:\n                - `sequential`: a list of `MemoryState` namedtuple (see below) computed from the provided `memory_trace`\n                    by substracting the memory after executing each line from the memory before executing said line.\n                - `cumulative`: a list of `MemoryState` namedtuple (see below) with cumulative increase in memory for each line\n                    obtained by summing repeted memory increase for a line if it's executed several times.\n                    The list is sorted from the frame with the largest memory consumption to the frame with the smallest (can be negative if memory is released)\n                - `total`: total memory increase during the full tracing as a `Memory` named tuple (see below).\n                    Line with memory release (negative consumption) are ignored if `ignore_released_memory` is `True` (default).\n\n        `Memory` named tuple have fields\n            - `byte` (integer): number of bytes,\n            - `string` (string): same as human readable string (ex: \"3.5MB\")\n\n        `Frame` are namedtuple used to list the current frame state and have the following fields:\n            - 'filename' (string): Name of the file currently executed\n            - 'module' (string): Name of the module currently executed\n            - 'line_number' (int): Number of the line currently executed\n            - 'event' (string): Event that triggered the tracing (default will be \"line\")\n            - 'line_text' (string): Text of the line in the python script\n\n        `MemoryState` are namedtuples listing frame + CPU/GPU memory with the following fields:\n            - `frame` (`Frame`): the current frame (see above)\n            - `cpu`: CPU memory consumed at during the current frame as a `Memory` named tuple\n            - `gpu`: GPU memory consumed at during the current frame as a `Memory` named tuple\n            - `cpu_gpu`: CPU + GPU memory consumed at during the current frame as a `Memory` named tuple\n    \"\"\"\n    global _is_memory_tracing_enabled\n    _is_memory_tracing_enabled = False\n\n    if memory_trace is not None and len(memory_trace) > 1:\n        memory_diff_trace = []\n        cumulative_memory_dict = defaultdict(lambda: [0, 0, 0])\n        for (frame, cpu_mem, gpu_mem), (next_frame, next_cpu_mem, next_gpu_mem) in zip(\n            memory_trace[:-1], memory_trace[1:]\n        ):\n            cpu_mem_inc = next_cpu_mem - cpu_mem\n            gpu_mem_inc = next_gpu_mem - gpu_mem\n            cpu_gpu_mem_inc = cpu_mem_inc + gpu_mem_inc\n            memory_diff_trace.append(\n                MemoryState(\n                    frame=frame, cpu=Memory(cpu_mem_inc), gpu=Memory(gpu_mem_inc), cpu_gpu=Memory(cpu_gpu_mem_inc),\n                )\n            )\n            cumulative_memory_dict[frame][0] += cpu_mem_inc\n            cumulative_memory_dict[frame][1] += gpu_mem_inc\n            cumulative_memory_dict[frame][2] += cpu_gpu_mem_inc\n\n        cumulative_memory = sorted(\n            list(cumulative_memory_dict.items()), key=lambda x: x[1][2], reverse=True\n        )  # order by the total CPU + GPU memory increase\n        cumulative_memory = list(\n            MemoryState(\n                frame=frame, cpu=Memory(cpu_mem_inc), gpu=Memory(gpu_mem_inc), cpu_gpu=Memory(cpu_gpu_mem_inc),\n            )\n            for frame, (cpu_mem_inc, gpu_mem_inc, cpu_gpu_mem_inc) in cumulative_memory\n        )\n\n        if ignore_released_memory:\n            total_memory = sum(max(0, step_trace.cpu_gpu.bytes) for step_trace in memory_diff_trace)\n        else:\n            total_memory = sum(step_trace.cpu_gpu.bytes for step_trace in memory_diff_trace)\n        total_memory = Memory(total_memory)\n        return MemorySummary(sequential=memory_diff_trace, cumulative=cumulative_memory, total=total_memory)\n\n    return None\n\n\ndef bytes_to_human_readable(memory_amount: int) -> str:\n    \"\"\" Utility to convert a number of bytes (int) in a human readable string (with units)\n    \"\"\"\n    for unit in [\"B\", \"KB\", \"MB\", \"GB\"]:\n        if memory_amount > -1024.0 and memory_amount < 1024.0:\n            return \"{:.3f}{}\".format(memory_amount, unit)\n        memory_amount /= 1024.0\n    return \"{:.3f}TB\".format(memory_amount)\n"
  },
  {
    "path": "code/nezha-base-count5/pretrain/transformers1/commands/__init__.py",
    "content": "from abc import ABC, abstractmethod\nfrom argparse import ArgumentParser\n\n\nclass BaseTransformersCLICommand(ABC):\n    @staticmethod\n    @abstractmethod\n    def register_subcommand(parser: ArgumentParser):\n        raise NotImplementedError()\n\n    @abstractmethod\n    def run(self):\n        raise NotImplementedError()\n"
  },
  {
    "path": "code/nezha-base-count5/pretrain/transformers1/commands/convert.py",
    "content": "from argparse import ArgumentParser, Namespace\nfrom logging import getLogger\n\nfrom transformers.commands import BaseTransformersCLICommand\n\n\ndef convert_command_factory(args: Namespace):\n    \"\"\"\n    Factory function used to convert a model TF 1.0 checkpoint in a PyTorch checkpoint.\n    :return: ServeCommand\n    \"\"\"\n    return ConvertCommand(\n        args.model_type, args.tf_checkpoint, args.pytorch_dump_output, args.config, args.finetuning_task_name\n    )\n\n\nclass ConvertCommand(BaseTransformersCLICommand):\n    @staticmethod\n    def register_subcommand(parser: ArgumentParser):\n        \"\"\"\n        Register this command to argparse so it's available for the transformer-cli\n        :param parser: Root parser to register command-specific arguments\n        :return:\n        \"\"\"\n        train_parser = parser.add_parser(\n            \"convert\",\n            help=\"CLI tool to run convert model from original \"\n            \"author checkpoints to Transformers PyTorch checkpoints.\",\n        )\n        train_parser.add_argument(\"--model_type\", type=str, required=True, help=\"Model's type.\")\n        train_parser.add_argument(\n            \"--tf_checkpoint\", type=str, required=True, help=\"TensorFlow checkpoint path or folder.\"\n        )\n        train_parser.add_argument(\n            \"--pytorch_dump_output\", type=str, required=True, help=\"Path to the PyTorch savd model output.\"\n        )\n        train_parser.add_argument(\"--config\", type=str, default=\"\", help=\"Configuration file path or folder.\")\n        train_parser.add_argument(\n            \"--finetuning_task_name\",\n            type=str,\n            default=None,\n            help=\"Optional fine-tuning task name if the TF model was a finetuned model.\",\n        )\n        train_parser.set_defaults(func=convert_command_factory)\n\n    def __init__(\n        self,\n        model_type: str,\n        tf_checkpoint: str,\n        pytorch_dump_output: str,\n        config: str,\n        finetuning_task_name: str,\n        *args\n    ):\n        self._logger = getLogger(\"transformers1-cli/converting\")\n\n        self._logger.info(\"Loading model {}\".format(model_type))\n        self._model_type = model_type\n        self._tf_checkpoint = tf_checkpoint\n        self._pytorch_dump_output = pytorch_dump_output\n        self._config = config\n        self._finetuning_task_name = finetuning_task_name\n\n    def run(self):\n        if self._model_type == \"albert\":\n            try:\n                from transformers.convert_albert_original_tf_checkpoint_to_pytorch import (\n                    convert_tf_checkpoint_to_pytorch,\n                )\n            except ImportError:\n                msg = (\n                    \"transformers1 can only be used from the commandline to convert TensorFlow models in PyTorch, \"\n                    \"In that case, it requires TensorFlow to be installed. Please see \"\n                    \"https://www.tensorflow.org/install/ for installation instructions.\"\n                )\n                raise ImportError(msg)\n\n            convert_tf_checkpoint_to_pytorch(self._tf_checkpoint, self._config, self._pytorch_dump_output)\n        elif self._model_type == \"bert\":\n            try:\n                from transformers.convert_bert_original_tf_checkpoint_to_pytorch import (\n                    convert_tf_checkpoint_to_pytorch,\n                )\n            except ImportError:\n                msg = (\n                    \"transformers1 can only be used from the commandline to convert TensorFlow models in PyTorch, \"\n                    \"In that case, it requires TensorFlow to be installed. Please see \"\n                    \"https://www.tensorflow.org/install/ for installation instructions.\"\n                )\n                raise ImportError(msg)\n\n            convert_tf_checkpoint_to_pytorch(self._tf_checkpoint, self._config, self._pytorch_dump_output)\n        elif self._model_type == \"gpt\":\n            from transformers.convert_openai_original_tf_checkpoint_to_pytorch import (\n                convert_openai_checkpoint_to_pytorch,\n            )\n\n            convert_openai_checkpoint_to_pytorch(self._tf_checkpoint, self._config, self._pytorch_dump_output)\n        elif self._model_type == \"transfo_xl\":\n            try:\n                from transformers.convert_transfo_xl_original_tf_checkpoint_to_pytorch import (\n                    convert_transfo_xl_checkpoint_to_pytorch,\n                )\n            except ImportError:\n                msg = (\n                    \"transformers1 can only be used from the commandline to convert TensorFlow models in PyTorch, \"\n                    \"In that case, it requires TensorFlow to be installed. Please see \"\n                    \"https://www.tensorflow.org/install/ for installation instructions.\"\n                )\n                raise ImportError(msg)\n\n            if \"ckpt\" in self._tf_checkpoint.lower():\n                TF_CHECKPOINT = self._tf_checkpoint\n                TF_DATASET_FILE = \"\"\n            else:\n                TF_DATASET_FILE = self._tf_checkpoint\n                TF_CHECKPOINT = \"\"\n            convert_transfo_xl_checkpoint_to_pytorch(\n                TF_CHECKPOINT, self._config, self._pytorch_dump_output, TF_DATASET_FILE\n            )\n        elif self._model_type == \"gpt2\":\n            try:\n                from transformers.convert_gpt2_original_tf_checkpoint_to_pytorch import (\n                    convert_gpt2_checkpoint_to_pytorch,\n                )\n            except ImportError:\n                msg = (\n                    \"transformers1 can only be used from the commandline to convert TensorFlow models in PyTorch, \"\n                    \"In that case, it requires TensorFlow to be installed. Please see \"\n                    \"https://www.tensorflow.org/install/ for installation instructions.\"\n                )\n                raise ImportError(msg)\n\n            convert_gpt2_checkpoint_to_pytorch(self._tf_checkpoint, self._config, self._pytorch_dump_output)\n        elif self._model_type == \"xlnet\":\n            try:\n                from transformers.convert_xlnet_original_tf_checkpoint_to_pytorch import (\n                    convert_xlnet_checkpoint_to_pytorch,\n                )\n            except ImportError:\n                msg = (\n                    \"transformers1 can only be used from the commandline to convert TensorFlow models in PyTorch, \"\n                    \"In that case, it requires TensorFlow to be installed. Please see \"\n                    \"https://www.tensorflow.org/install/ for installation instructions.\"\n                )\n                raise ImportError(msg)\n\n            convert_xlnet_checkpoint_to_pytorch(\n                self._tf_checkpoint, self._config, self._pytorch_dump_output, self._finetuning_task_name\n            )\n        elif self._model_type == \"xlm\":\n            from transformers.convert_xlm_original_pytorch_checkpoint_to_pytorch import (\n                convert_xlm_checkpoint_to_pytorch,\n            )\n\n            convert_xlm_checkpoint_to_pytorch(self._tf_checkpoint, self._pytorch_dump_output)\n        else:\n            raise ValueError(\"--model_type should be selected in the list [bert, gpt, gpt2, transfo_xl, xlnet, xlm]\")\n"
  },
  {
    "path": "code/nezha-base-count5/pretrain/transformers1/commands/download.py",
    "content": "from argparse import ArgumentParser\n\nfrom transformers.commands import BaseTransformersCLICommand\n\n\ndef download_command_factory(args):\n    return DownloadCommand(args.model, args.cache_dir, args.force)\n\n\nclass DownloadCommand(BaseTransformersCLICommand):\n    @staticmethod\n    def register_subcommand(parser: ArgumentParser):\n        download_parser = parser.add_parser(\"download\")\n        download_parser.add_argument(\n            \"--cache-dir\", type=str, default=None, help=\"Path to location to store the models\"\n        )\n        download_parser.add_argument(\n            \"--force\", action=\"store_true\", help=\"Force the model to be download even if already in cache-dir\"\n        )\n        download_parser.add_argument(\"model\", type=str, help=\"Name of the model to download\")\n        download_parser.set_defaults(func=download_command_factory)\n\n    def __init__(self, model: str, cache: str, force: bool):\n        self._model = model\n        self._cache = cache\n        self._force = force\n\n    def run(self):\n        from transformers import AutoModel, AutoTokenizer\n\n        AutoModel.from_pretrained(self._model, cache_dir=self._cache, force_download=self._force)\n        AutoTokenizer.from_pretrained(self._model, cache_dir=self._cache, force_download=self._force)\n"
  },
  {
    "path": "code/nezha-base-count5/pretrain/transformers1/commands/env.py",
    "content": "import platform\nfrom argparse import ArgumentParser\n\nfrom transformers import __version__ as version\nfrom transformers import is_tf_available, is_torch_available\nfrom transformers.commands import BaseTransformersCLICommand\n\n\ndef info_command_factory(_):\n    return EnvironmentCommand()\n\n\nclass EnvironmentCommand(BaseTransformersCLICommand):\n    @staticmethod\n    def register_subcommand(parser: ArgumentParser):\n        download_parser = parser.add_parser(\"env\")\n        download_parser.set_defaults(func=info_command_factory)\n\n    def run(self):\n        pt_version = \"not installed\"\n        pt_cuda_available = \"NA\"\n        if is_torch_available():\n            import torch\n\n            pt_version = torch.__version__\n            pt_cuda_available = torch.cuda.is_available()\n\n        tf_version = \"not installed\"\n        tf_cuda_available = \"NA\"\n        if is_tf_available():\n            import tensorflow as tf\n\n            tf_version = tf.__version__\n            try:\n                # deprecated in v2.1\n                tf_cuda_available = tf.test.is_gpu_available()\n            except AttributeError:\n                # returns list of devices, convert to bool\n                tf_cuda_available = bool(tf.config.list_physical_devices(\"GPU\"))\n\n        info = {\n            \"`transformers1` version\": version,\n            \"Platform\": platform.platform(),\n            \"Python version\": platform.python_version(),\n            \"PyTorch version (GPU?)\": \"{} ({})\".format(pt_version, pt_cuda_available),\n            \"Tensorflow version (GPU?)\": \"{} ({})\".format(tf_version, tf_cuda_available),\n            \"Using GPU in script?\": \"<fill in>\",\n            \"Using distributed or parallel set-up in script?\": \"<fill in>\",\n        }\n\n        print(\"\\nCopy-and-paste the text below in your GitHub issue and FILL OUT the two last points.\\n\")\n        print(self.format_dict(info))\n\n        return info\n\n    @staticmethod\n    def format_dict(d):\n        return \"\\n\".join([\"- {}: {}\".format(prop, val) for prop, val in d.items()]) + \"\\n\"\n"
  },
  {
    "path": "code/nezha-base-count5/pretrain/transformers1/commands/run.py",
    "content": "import logging\nfrom argparse import ArgumentParser\n\nfrom transformers.commands import BaseTransformersCLICommand\nfrom transformers.pipelines import SUPPORTED_TASKS, Pipeline, PipelineDataFormat, pipeline\n\n\nlogger = logging.getLogger(__name__)  # pylint: disable=invalid-name\n\n\ndef try_infer_format_from_ext(path: str):\n    if not path:\n        return \"pipe\"\n\n    for ext in PipelineDataFormat.SUPPORTED_FORMATS:\n        if path.endswith(ext):\n            return ext\n\n    raise Exception(\n        \"Unable to determine file format from file extension {}. \"\n        \"Please provide the format through --format {}\".format(path, PipelineDataFormat.SUPPORTED_FORMATS)\n    )\n\n\ndef run_command_factory(args):\n    nlp = pipeline(\n        task=args.task,\n        model=args.model if args.model else None,\n        config=args.config,\n        tokenizer=args.tokenizer,\n        device=args.device,\n    )\n    format = try_infer_format_from_ext(args.input) if args.format == \"infer\" else args.format\n    reader = PipelineDataFormat.from_str(\n        format=format,\n        output_path=args.output,\n        input_path=args.input,\n        column=args.column if args.column else nlp.default_input_names,\n        overwrite=args.overwrite,\n    )\n    return RunCommand(nlp, reader)\n\n\nclass RunCommand(BaseTransformersCLICommand):\n    def __init__(self, nlp: Pipeline, reader: PipelineDataFormat):\n        self._nlp = nlp\n        self._reader = reader\n\n    @staticmethod\n    def register_subcommand(parser: ArgumentParser):\n        run_parser = parser.add_parser(\"run\", help=\"Run a pipeline through the CLI\")\n        run_parser.add_argument(\"--task\", choices=SUPPORTED_TASKS.keys(), help=\"Task to run\")\n        run_parser.add_argument(\"--input\", type=str, help=\"Path to the file to use for inference\")\n        run_parser.add_argument(\"--output\", type=str, help=\"Path to the file that will be used post to write results.\")\n        run_parser.add_argument(\"--model\", type=str, help=\"Name or path to the model to instantiate.\")\n        run_parser.add_argument(\"--config\", type=str, help=\"Name or path to the model's config to instantiate.\")\n        run_parser.add_argument(\n            \"--tokenizer\", type=str, help=\"Name of the tokenizer to use. (default: same as the model name)\"\n        )\n        run_parser.add_argument(\n            \"--column\",\n            type=str,\n            help=\"Name of the column to use as input. (For multi columns input as QA use column1,columns2)\",\n        )\n        run_parser.add_argument(\n            \"--format\",\n            type=str,\n            default=\"infer\",\n            choices=PipelineDataFormat.SUPPORTED_FORMATS,\n            help=\"Input format to read from\",\n        )\n        run_parser.add_argument(\n            \"--device\",\n            type=int,\n            default=-1,\n            help=\"Indicate the device to run onto, -1 indicates CPU, >= 0 indicates GPU (default: -1)\",\n        )\n        run_parser.add_argument(\"--overwrite\", action=\"store_true\", help=\"Allow overwriting the output file.\")\n        run_parser.set_defaults(func=run_command_factory)\n\n    def run(self):\n        nlp, outputs = self._nlp, []\n\n        for entry in self._reader:\n            output = nlp(**entry) if self._reader.is_multi_columns else nlp(entry)\n            if isinstance(output, dict):\n                outputs.append(output)\n            else:\n                outputs += output\n\n        # Saving data\n        if self._nlp.binary_output:\n            binary_path = self._reader.save_binary(outputs)\n            logger.warning(\"Current pipeline requires output to be in binary format, saving at {}\".format(binary_path))\n        else:\n            self._reader.save(outputs)\n"
  },
  {
    "path": "code/nezha-base-count5/pretrain/transformers1/commands/serving.py",
    "content": "import logging\nfrom argparse import ArgumentParser, Namespace\nfrom typing import Any, List, Optional\n\nfrom transformers import Pipeline\nfrom transformers.commands import BaseTransformersCLICommand\nfrom transformers.pipelines import SUPPORTED_TASKS, pipeline\n\n\ntry:\n    from uvicorn import run\n    from fastapi import FastAPI, HTTPException, Body\n    from fastapi.routing import APIRoute\n    from pydantic import BaseModel\n    from starlette.responses import JSONResponse\n\n    _serve_dependencies_installed = True\nexcept (ImportError, AttributeError):\n    BaseModel = object\n\n    def Body(*x, **y):\n        pass\n\n    _serve_dependencies_installed = False\n\n\nlogger = logging.getLogger(\"transformers1-cli/serving\")\n\n\ndef serve_command_factory(args: Namespace):\n    \"\"\"\n    Factory function used to instantiate serving server from provided command line arguments.\n    :return: ServeCommand\n    \"\"\"\n    nlp = pipeline(\n        task=args.task,\n        model=args.model if args.model else None,\n        config=args.config,\n        tokenizer=args.tokenizer,\n        device=args.device,\n    )\n    return ServeCommand(nlp, args.host, args.port, args.workers)\n\n\nclass ServeModelInfoResult(BaseModel):\n    \"\"\"\n    Expose model information\n    \"\"\"\n\n    infos: dict\n\n\nclass ServeTokenizeResult(BaseModel):\n    \"\"\"\n    Tokenize result model\n    \"\"\"\n\n    tokens: List[str]\n    tokens_ids: Optional[List[int]]\n\n\nclass ServeDeTokenizeResult(BaseModel):\n    \"\"\"\n    DeTokenize result model\n    \"\"\"\n\n    text: str\n\n\nclass ServeForwardResult(BaseModel):\n    \"\"\"\n    Forward result model\n    \"\"\"\n\n    output: Any\n\n\nclass ServeCommand(BaseTransformersCLICommand):\n    @staticmethod\n    def register_subcommand(parser: ArgumentParser):\n        \"\"\"\n        Register this command to argparse so it's available for the transformer-cli\n        :param parser: Root parser to register command-specific arguments\n        :return:\n        \"\"\"\n        serve_parser = parser.add_parser(\n            \"serve\", help=\"CLI tool to run inference requests through REST and GraphQL endpoints.\"\n        )\n        serve_parser.add_argument(\n            \"--task\", type=str, choices=SUPPORTED_TASKS.keys(), help=\"The task to run the pipeline on\"\n        )\n        serve_parser.add_argument(\"--host\", type=str, default=\"localhost\", help=\"Interface the server will listen on.\")\n        serve_parser.add_argument(\"--port\", type=int, default=8888, help=\"Port the serving will listen to.\")\n        serve_parser.add_argument(\"--workers\", type=int, default=1, help=\"Number of http workers\")\n        serve_parser.add_argument(\"--model\", type=str, help=\"Model's name or path to stored model.\")\n        serve_parser.add_argument(\"--config\", type=str, help=\"Model's config name or path to stored model.\")\n        serve_parser.add_argument(\"--tokenizer\", type=str, help=\"Tokenizer name to use.\")\n        serve_parser.add_argument(\n            \"--device\",\n            type=int,\n            default=-1,\n            help=\"Indicate the device to run onto, -1 indicates CPU, >= 0 indicates GPU (default: -1)\",\n        )\n        serve_parser.set_defaults(func=serve_command_factory)\n\n    def __init__(self, pipeline: Pipeline, host: str, port: int, workers: int):\n\n        self._pipeline = pipeline\n\n        self.host = host\n        self.port = port\n        self.workers = workers\n\n        if not _serve_dependencies_installed:\n            raise RuntimeError(\n                \"Using serve command requires FastAPI and unicorn. \"\n                'Please install transformers1 with [serving]: pip install \"transformers1[serving]\".'\n                \"Or install FastAPI and unicorn separately.\"\n            )\n        else:\n            logger.info(\"Serving model over {}:{}\".format(host, port))\n            self._app = FastAPI(\n                routes=[\n                    APIRoute(\n                        \"/\",\n                        self.model_info,\n                        response_model=ServeModelInfoResult,\n                        response_class=JSONResponse,\n                        methods=[\"GET\"],\n                    ),\n                    APIRoute(\n                        \"/tokenize\",\n                        self.tokenize,\n                        response_model=ServeTokenizeResult,\n                        response_class=JSONResponse,\n                        methods=[\"POST\"],\n                    ),\n                    APIRoute(\n                        \"/detokenize\",\n                        self.detokenize,\n                        response_model=ServeDeTokenizeResult,\n                        response_class=JSONResponse,\n                        methods=[\"POST\"],\n                    ),\n                    APIRoute(\n                        \"/forward\",\n                        self.forward,\n                        response_model=ServeForwardResult,\n                        response_class=JSONResponse,\n                        methods=[\"POST\"],\n                    ),\n                ],\n                timeout=600,\n            )\n\n    def run(self):\n        run(self._app, host=self.host, port=self.port, workers=self.workers)\n\n    def model_info(self):\n        return ServeModelInfoResult(infos=vars(self._pipeline.model.config))\n\n    def tokenize(self, text_input: str = Body(None, embed=True), return_ids: bool = Body(False, embed=True)):\n        \"\"\"\n        Tokenize the provided input and eventually returns corresponding tokens id:\n        - **text_input**: String to tokenize\n        - **return_ids**: Boolean flags indicating if the tokens have to be converted to their integer mapping.\n        \"\"\"\n        try:\n            tokens_txt = self._pipeline.tokenizer.tokenize(text_input)\n\n            if return_ids:\n                tokens_ids = self._pipeline.tokenizer.convert_tokens_to_ids(tokens_txt)\n                return ServeTokenizeResult(tokens=tokens_txt, tokens_ids=tokens_ids)\n            else:\n                return ServeTokenizeResult(tokens=tokens_txt)\n\n        except Exception as e:\n            raise HTTPException(status_code=500, detail={\"model\": \"\", \"error\": str(e)})\n\n    def detokenize(\n        self,\n        tokens_ids: List[int] = Body(None, embed=True),\n        skip_special_tokens: bool = Body(False, embed=True),\n        cleanup_tokenization_spaces: bool = Body(True, embed=True),\n    ):\n        \"\"\"\n        Detokenize the provided tokens ids to readable text:\n        - **tokens_ids**: List of tokens ids\n        - **skip_special_tokens**: Flag indicating to not try to decode special tokens\n        - **cleanup_tokenization_spaces**: Flag indicating to remove all leading/trailing spaces and intermediate ones.\n        \"\"\"\n        try:\n            decoded_str = self._pipeline.tokenizer.decode(tokens_ids, skip_special_tokens, cleanup_tokenization_spaces)\n            return ServeDeTokenizeResult(model=\"\", text=decoded_str)\n        except Exception as e:\n            raise HTTPException(status_code=500, detail={\"model\": \"\", \"error\": str(e)})\n\n    async def forward(self, inputs=Body(None, embed=True)):\n        \"\"\"\n        **inputs**:\n        **attention_mask**:\n        **tokens_type_ids**:\n        \"\"\"\n\n        # Check we don't have empty string\n        if len(inputs) == 0:\n            return ServeForwardResult(output=[], attention=[])\n\n        try:\n            # Forward through the model\n            output = self._pipeline(inputs)\n            return ServeForwardResult(output=output)\n        except Exception as e:\n            raise HTTPException(500, {\"error\": str(e)})\n"
  },
  {
    "path": "code/nezha-base-count5/pretrain/transformers1/commands/train.py",
    "content": "import os\nfrom argparse import ArgumentParser, Namespace\nfrom logging import getLogger\n\nfrom transformers import SingleSentenceClassificationProcessor as Processor\nfrom transformers import TextClassificationPipeline, is_tf_available, is_torch_available\nfrom transformers.commands import BaseTransformersCLICommand\n\n\nif not is_tf_available() and not is_torch_available():\n    raise RuntimeError(\"At least one of PyTorch or TensorFlow 2.0+ should be installed to use CLI training\")\n\n# TF training parameters\nUSE_XLA = False\nUSE_AMP = False\n\n\ndef train_command_factory(args: Namespace):\n    \"\"\"\n    Factory function used to instantiate serving server from provided command line arguments.\n    :return: ServeCommand\n    \"\"\"\n    return TrainCommand(args)\n\n\nclass TrainCommand(BaseTransformersCLICommand):\n    @staticmethod\n    def register_subcommand(parser: ArgumentParser):\n        \"\"\"\n        Register this command to argparse so it's available for the transformer-cli\n        :param parser: Root parser to register command-specific arguments\n        :return:\n        \"\"\"\n        train_parser = parser.add_parser(\"train\", help=\"CLI tool to train a model on a task.\")\n\n        train_parser.add_argument(\n            \"--train_data\",\n            type=str,\n            required=True,\n            help=\"path to train (and optionally evaluation) dataset as a csv with \"\n            \"tab separated labels and sentences.\",\n        )\n        train_parser.add_argument(\n            \"--column_label\", type=int, default=0, help=\"Column of the dataset csv file with example labels.\"\n        )\n        train_parser.add_argument(\n            \"--column_text\", type=int, default=1, help=\"Column of the dataset csv file with example texts.\"\n        )\n        train_parser.add_argument(\n            \"--column_id\", type=int, default=2, help=\"Column of the dataset csv file with example ids.\"\n        )\n        train_parser.add_argument(\n            \"--skip_first_row\", action=\"store_true\", help=\"Skip the first row of the csv file (headers).\"\n        )\n\n        train_parser.add_argument(\"--validation_data\", type=str, default=\"\", help=\"path to validation dataset.\")\n        train_parser.add_argument(\n            \"--validation_split\",\n            type=float,\n            default=0.1,\n            help=\"if validation dataset is not provided, fraction of train dataset \" \"to use as validation dataset.\",\n        )\n\n        train_parser.add_argument(\"--output\", type=str, default=\"./\", help=\"path to saved the trained model.\")\n\n        train_parser.add_argument(\n            \"--task\", type=str, default=\"text_classification\", help=\"Task to train the model on.\"\n        )\n        train_parser.add_argument(\n            \"--model\", type=str, default=\"bert-base-uncased\", help=\"Model's name or path to stored model.\"\n        )\n        train_parser.add_argument(\"--train_batch_size\", type=int, default=32, help=\"Batch size for training.\")\n        train_parser.add_argument(\"--valid_batch_size\", type=int, default=64, help=\"Batch size for validation.\")\n        train_parser.add_argument(\"--learning_rate\", type=float, default=3e-5, help=\"Learning rate.\")\n        train_parser.add_argument(\"--adam_epsilon\", type=float, default=1e-08, help=\"Epsilon for Adam optimizer.\")\n        train_parser.set_defaults(func=train_command_factory)\n\n    def __init__(self, args: Namespace):\n        self.logger = getLogger(\"transformers1-cli/training\")\n\n        self.framework = \"tf\" if is_tf_available() else \"torch\"\n\n        os.makedirs(args.output, exist_ok=True)\n        assert os.path.isdir(args.output)\n        self.output = args.output\n\n        self.column_label = args.column_label\n        self.column_text = args.column_text\n        self.column_id = args.column_id\n\n        self.logger.info(\"Loading {} pipeline for {}\".format(args.task, args.model))\n        if args.task == \"text_classification\":\n            self.pipeline = TextClassificationPipeline.from_pretrained(args.model)\n        elif args.task == \"token_classification\":\n            raise NotImplementedError\n        elif args.task == \"question_answering\":\n            raise NotImplementedError\n\n        self.logger.info(\"Loading dataset from {}\".format(args.train_data))\n        self.train_dataset = Processor.create_from_csv(\n            args.train_data,\n            column_label=args.column_label,\n            column_text=args.column_text,\n            column_id=args.column_id,\n            skip_first_row=args.skip_first_row,\n        )\n        self.valid_dataset = None\n        if args.validation_data:\n            self.logger.info(\"Loading validation dataset from {}\".format(args.validation_data))\n            self.valid_dataset = Processor.create_from_csv(\n                args.validation_data,\n                column_label=args.column_label,\n                column_text=args.column_text,\n                column_id=args.column_id,\n                skip_first_row=args.skip_first_row,\n            )\n\n        self.validation_split = args.validation_split\n        self.train_batch_size = args.train_batch_size\n        self.valid_batch_size = args.valid_batch_size\n        self.learning_rate = args.learning_rate\n        self.adam_epsilon = args.adam_epsilon\n\n    def run(self):\n        if self.framework == \"tf\":\n            return self.run_tf()\n        return self.run_torch()\n\n    def run_torch(self):\n        raise NotImplementedError\n\n    def run_tf(self):\n        self.pipeline.fit(\n            self.train_dataset,\n            validation_data=self.valid_dataset,\n            validation_split=self.validation_split,\n            learning_rate=self.learning_rate,\n            adam_epsilon=self.adam_epsilon,\n            train_batch_size=self.train_batch_size,\n            valid_batch_size=self.valid_batch_size,\n        )\n\n        # Save trained pipeline\n        self.pipeline.save_pretrained(self.output)\n"
  },
  {
    "path": "code/nezha-base-count5/pretrain/transformers1/commands/transformers_cli.py",
    "content": "#!/usr/bin/env python\nfrom argparse import ArgumentParser\n\nfrom transformers.commands.convert import ConvertCommand\nfrom transformers.commands.download import DownloadCommand\nfrom transformers.commands.env import EnvironmentCommand\nfrom transformers.commands.run import RunCommand\nfrom transformers.commands.serving import ServeCommand\nfrom transformers.commands.user import UserCommands\n\n\ndef main():\n    parser = ArgumentParser(\"Transformers CLI tool\", usage=\"transformers1-cli <command> [<args>]\")\n    commands_parser = parser.add_subparsers(help=\"transformers1-cli command helpers\")\n\n    # Register commands\n    ConvertCommand.register_subcommand(commands_parser)\n    DownloadCommand.register_subcommand(commands_parser)\n    EnvironmentCommand.register_subcommand(commands_parser)\n    RunCommand.register_subcommand(commands_parser)\n    ServeCommand.register_subcommand(commands_parser)\n    UserCommands.register_subcommand(commands_parser)\n\n    # Let's go\n    args = parser.parse_args()\n\n    if not hasattr(args, \"func\"):\n        parser.print_help()\n        exit(1)\n\n    # Run\n    service = args.func(args)\n    service.run()\n\n\nif __name__ == \"__main__\":\n    main()\n"
  },
  {
    "path": "code/nezha-base-count5/pretrain/transformers1/commands/user.py",
    "content": "import os\nimport sys\nfrom argparse import ArgumentParser\nfrom getpass import getpass\nfrom typing import List, Union\n\nfrom requests.exceptions import HTTPError\n\nfrom transformers.commands import BaseTransformersCLICommand\nfrom transformers.hf_api import HfApi, HfFolder\n\n\nUPLOAD_MAX_FILES = 15\n\n\nclass UserCommands(BaseTransformersCLICommand):\n    @staticmethod\n    def register_subcommand(parser: ArgumentParser):\n        login_parser = parser.add_parser(\"login\", help=\"Log in using the same credentials as on huggingface.co\")\n        login_parser.set_defaults(func=lambda args: LoginCommand(args))\n        whoami_parser = parser.add_parser(\"whoami\", help=\"Find out which huggingface.co account you are logged in as.\")\n        whoami_parser.set_defaults(func=lambda args: WhoamiCommand(args))\n        logout_parser = parser.add_parser(\"logout\", help=\"Log out\")\n        logout_parser.set_defaults(func=lambda args: LogoutCommand(args))\n        # s3\n        s3_parser = parser.add_parser(\"s3\", help=\"{ls, rm} Commands to interact with the files you upload on S3.\")\n        s3_subparsers = s3_parser.add_subparsers(help=\"s3 related commands\")\n        ls_parser = s3_subparsers.add_parser(\"ls\")\n        ls_parser.add_argument(\"--organization\", type=str, help=\"Optional: organization namespace.\")\n        ls_parser.set_defaults(func=lambda args: ListObjsCommand(args))\n        rm_parser = s3_subparsers.add_parser(\"rm\")\n        rm_parser.add_argument(\"filename\", type=str, help=\"individual object filename to delete from S3.\")\n        rm_parser.add_argument(\"--organization\", type=str, help=\"Optional: organization namespace.\")\n        rm_parser.set_defaults(func=lambda args: DeleteObjCommand(args))\n        # upload\n        upload_parser = parser.add_parser(\"upload\", help=\"Upload a model to S3.\")\n        upload_parser.add_argument(\n            \"path\", type=str, help=\"Local path of the model folder or individual file to upload.\"\n        )\n        upload_parser.add_argument(\"--organization\", type=str, help=\"Optional: organization namespace.\")\n        upload_parser.add_argument(\n            \"--filename\", type=str, default=None, help=\"Optional: override individual object filename on S3.\"\n        )\n        upload_parser.set_defaults(func=lambda args: UploadCommand(args))\n\n\nclass ANSI:\n    \"\"\"\n    Helper for en.wikipedia.org/wiki/ANSI_escape_code\n    \"\"\"\n\n    _bold = \"\\u001b[1m\"\n    _red = \"\\u001b[31m\"\n    _reset = \"\\u001b[0m\"\n\n    @classmethod\n    def bold(cls, s):\n        return \"{}{}{}\".format(cls._bold, s, cls._reset)\n\n    @classmethod\n    def red(cls, s):\n        return \"{}{}{}\".format(cls._bold + cls._red, s, cls._reset)\n\n\nclass BaseUserCommand:\n    def __init__(self, args):\n        self.args = args\n        self._api = HfApi()\n\n\nclass LoginCommand(BaseUserCommand):\n    def run(self):\n        print(\n            \"\"\"\n        _|    _|  _|    _|    _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|_|_|_|    _|_|      _|_|_|  _|_|_|_|\n        _|    _|  _|    _|  _|        _|          _|    _|_|    _|  _|            _|        _|    _|  _|        _|\n        _|_|_|_|  _|    _|  _|  _|_|  _|  _|_|    _|    _|  _|  _|  _|  _|_|      _|_|_|    _|_|_|_|  _|        _|_|_|\n        _|    _|  _|    _|  _|    _|  _|    _|    _|    _|    _|_|  _|    _|      _|        _|    _|  _|        _|\n        _|    _|    _|_|      _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|        _|    _|    _|_|_|  _|_|_|_|\n\n        \"\"\"\n        )\n        username = input(\"Username: \")\n        password = getpass()\n        try:\n            token = self._api.login(username, password)\n        except HTTPError as e:\n            # probably invalid credentials, display error message.\n            print(e)\n            print(ANSI.red(e.response.text))\n            exit(1)\n        HfFolder.save_token(token)\n        print(\"Login successful\")\n        print(\"Your token:\", token, \"\\n\")\n        print(\"Your token has been saved to\", HfFolder.path_token)\n\n\nclass WhoamiCommand(BaseUserCommand):\n    def run(self):\n        token = HfFolder.get_token()\n        if token is None:\n            print(\"Not logged in\")\n            exit()\n        try:\n            user, orgs = self._api.whoami(token)\n            print(user)\n            if orgs:\n                print(ANSI.bold(\"orgs: \"), \",\".join(orgs))\n        except HTTPError as e:\n            print(e)\n            print(ANSI.red(e.response.text))\n            exit(1)\n\n\nclass LogoutCommand(BaseUserCommand):\n    def run(self):\n        token = HfFolder.get_token()\n        if token is None:\n            print(\"Not logged in\")\n            exit()\n        HfFolder.delete_token()\n        self._api.logout(token)\n        print(\"Successfully logged out.\")\n\n\nclass ListObjsCommand(BaseUserCommand):\n    def tabulate(self, rows: List[List[Union[str, int]]], headers: List[str]) -> str:\n        \"\"\"\n        Inspired by:\n        stackoverflow.com/a/8356620/593036\n        stackoverflow.com/questions/9535954/printing-lists-as-tabular-data\n        \"\"\"\n        col_widths = [max(len(str(x)) for x in col) for col in zip(*rows, headers)]\n        row_format = (\"{{:{}}} \" * len(headers)).format(*col_widths)\n        lines = []\n        lines.append(row_format.format(*headers))\n        lines.append(row_format.format(*[\"-\" * w for w in col_widths]))\n        for row in rows:\n            lines.append(row_format.format(*row))\n        return \"\\n\".join(lines)\n\n    def run(self):\n        token = HfFolder.get_token()\n        if token is None:\n            print(\"Not logged in\")\n            exit(1)\n        try:\n            objs = self._api.list_objs(token, organization=self.args.organization)\n        except HTTPError as e:\n            print(e)\n            print(ANSI.red(e.response.text))\n            exit(1)\n        if len(objs) == 0:\n            print(\"No shared file yet\")\n            exit()\n        rows = [[obj.filename, obj.LastModified, obj.ETag, obj.Size] for obj in objs]\n        print(self.tabulate(rows, headers=[\"Filename\", \"LastModified\", \"ETag\", \"Size\"]))\n\n\nclass DeleteObjCommand(BaseUserCommand):\n    def run(self):\n        token = HfFolder.get_token()\n        if token is None:\n            print(\"Not logged in\")\n            exit(1)\n        try:\n            self._api.delete_obj(token, filename=self.args.filename, organization=self.args.organization)\n        except HTTPError as e:\n            print(e)\n            print(ANSI.red(e.response.text))\n            exit(1)\n        print(\"Done\")\n\n\nclass UploadCommand(BaseUserCommand):\n    def walk_dir(self, rel_path):\n        \"\"\"\n        Recursively list all files in a folder.\n        \"\"\"\n        entries: List[os.DirEntry] = list(os.scandir(rel_path))\n        files = [(os.path.join(os.getcwd(), f.path), f.path) for f in entries if f.is_file()]  # (filepath, filename)\n        for f in entries:\n            if f.is_dir():\n                files += self.walk_dir(f.path)\n        return files\n\n    def run(self):\n        token = HfFolder.get_token()\n        if token is None:\n            print(\"Not logged in\")\n            exit(1)\n        local_path = os.path.abspath(self.args.path)\n        if os.path.isdir(local_path):\n            if self.args.filename is not None:\n                raise ValueError(\"Cannot specify a filename override when uploading a folder.\")\n            rel_path = os.path.basename(local_path)\n            files = self.walk_dir(rel_path)\n        elif os.path.isfile(local_path):\n            filename = self.args.filename if self.args.filename is not None else os.path.basename(local_path)\n            files = [(local_path, filename)]\n        else:\n            raise ValueError(\"Not a valid file or directory: {}\".format(local_path))\n\n        if sys.platform == \"win32\":\n            files = [(filepath, filename.replace(os.sep, \"/\")) for filepath, filename in files]\n\n        if len(files) > UPLOAD_MAX_FILES:\n            print(\n                \"About to upload {} files to S3. This is probably wrong. Please filter files before uploading.\".format(\n                    ANSI.bold(len(files))\n                )\n            )\n            exit(1)\n\n        user, _ = self._api.whoami(token)\n        namespace = self.args.organization if self.args.organization is not None else user\n\n        for filepath, filename in files:\n            print(\n                \"About to upload file {} to S3 under filename {} and namespace {}\".format(\n                    ANSI.bold(filepath), ANSI.bold(filename), ANSI.bold(namespace)\n                )\n            )\n\n        choice = input(\"Proceed? [Y/n] \").lower()\n        if not (choice == \"\" or choice == \"y\" or choice == \"yes\"):\n            print(\"Abort\")\n            exit()\n        print(ANSI.bold(\"Uploading... This might take a while if files are large\"))\n        for filepath, filename in files:\n            try:\n                access_url = self._api.presign_and_upload(\n                    token=token, filename=filename, filepath=filepath, organization=self.args.organization\n                )\n            except HTTPError as e:\n                print(e)\n                print(ANSI.red(e.response.text))\n                exit(1)\n            print(\"Your file now lives at:\")\n            print(access_url)\n"
  },
  {
    "path": "code/nezha-base-count5/pretrain/transformers1/configuration_albert.py",
    "content": "# coding=utf-8\n# Copyright 2018 The Google AI Language Team Authors and The HuggingFace Inc. team.\n# Copyright (c) 2018, NVIDIA CORPORATION.  All rights reserved.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n\"\"\" ALBERT model configuration \"\"\"\n\nfrom .configuration_utils import PretrainedConfig\n\n\nALBERT_PRETRAINED_CONFIG_ARCHIVE_MAP = {\n    \"albert-base-v1\": \"https://s3.amazonaws.com/models.huggingface.co/bert/albert-base-v1-config.json\",\n    \"albert-large-v1\": \"https://s3.amazonaws.com/models.huggingface.co/bert/albert-large-v1-config.json\",\n    \"albert-xlarge-v1\": \"https://s3.amazonaws.com/models.huggingface.co/bert/albert-xlarge-v1-config.json\",\n    \"albert-xxlarge-v1\": \"https://s3.amazonaws.com/models.huggingface.co/bert/albert-xxlarge-v1-config.json\",\n    \"albert-base-v2\": \"https://s3.amazonaws.com/models.huggingface.co/bert/albert-base-v2-config.json\",\n    \"albert-large-v2\": \"https://s3.amazonaws.com/models.huggingface.co/bert/albert-large-v2-config.json\",\n    \"albert-xlarge-v2\": \"https://s3.amazonaws.com/models.huggingface.co/bert/albert-xlarge-v2-config.json\",\n    \"albert-xxlarge-v2\": \"https://s3.amazonaws.com/models.huggingface.co/bert/albert-xxlarge-v2-config.json\",\n}\n\n\nclass AlbertConfig(PretrainedConfig):\n    r\"\"\"\n        This is the configuration class to store the configuration of a :class:`~transformers1.AlbertModel`.\n        It is used to instantiate an ALBERT model according to the specified arguments, defining the model\n        architecture. Instantiating a configuration with the defaults will yield a similar configuration to that of\n        the ALBERT `xxlarge <https://huggingface.co/albert-xxlarge-v2>`__ architecture.\n\n        Configuration objects inherit from  :class:`~transformers1.PretrainedConfig` and can be used\n        to control the model outputs. Read the documentation from  :class:`~transformers1.PretrainedConfig`\n        for more information.\n\n\n        Args:\n            vocab_size (:obj:`int`, optional, defaults to 30000):\n                Vocabulary size of the ALBERT model. Defines the different tokens that\n                can be represented by the `inputs_ids` passed to the forward method of :class:`~transformers1.AlbertModel`.\n            embedding_size (:obj:`int`, optional, defaults to 128):\n                Dimensionality of vocabulary embeddings.\n            hidden_size (:obj:`int`, optional, defaults to 4096):\n                Dimensionality of the encoder layers and the pooler layer.\n            num_hidden_layers (:obj:`int`, optional, defaults to 12):\n                Number of hidden layers in the Transformer encoder.\n            num_hidden_groups (:obj:`int`, optional, defaults to 1):\n                Number of groups for the hidden layers, parameters in the same group are shared.\n            num_attention_heads (:obj:`int`, optional, defaults to 64):\n                Number of attention heads for each attention layer in the Transformer encoder.\n            intermediate_size (:obj:`int`, optional, defaults to 16384):\n                The dimensionality of the \"intermediate\" (i.e., feed-forward) layer in the Transformer encoder.\n            inner_group_num (:obj:`int`, optional, defaults to 1):\n                The number of inner repetition of attention and ffn.\n            hidden_act (:obj:`str` or :obj:`function`, optional, defaults to \"gelu_new\"):\n                The non-linear activation function (function or string) in the encoder and pooler.\n                If string, \"gelu\", \"relu\", \"swish\" and \"gelu_new\" are supported.\n            hidden_dropout_prob (:obj:`float`, optional, defaults to 0):\n                The dropout probability for all fully connected layers in the embeddings, encoder, and pooler.\n            attention_probs_dropout_prob (:obj:`float`, optional, defaults to 0):\n                The dropout ratio for the attention probabilities.\n            max_position_embeddings (:obj:`int`, optional, defaults to 512):\n                The maximum sequence length that this model might ever be used with. Typically set this to something\n                large (e.g., 512 or 1024 or 2048).\n            type_vocab_size (:obj:`int`, optional, defaults to 2):\n                The vocabulary size of the `token_type_ids` passed into :class:`~transformers1.AlbertModel`.\n            initializer_range (:obj:`float`, optional, defaults to 0.02):\n                The standard deviation of the truncated_normal_initializer for initializing all weight matrices.\n            layer_norm_eps (:obj:`float`, optional, defaults to 1e-12):\n                The epsilon used by the layer normalization layers.\n            classifier_dropout_prob (:obj:`float`, optional, defaults to 0.1):\n                The dropout ratio for attached classifiers.\n\n        Example::\n\n            from transformers1 import AlbertConfig, AlbertModel\n            # Initializing an ALBERT-xxlarge style configuration\n            albert_xxlarge_configuration = AlbertConfig()\n\n            # Initializing an ALBERT-base style configuration\n            albert_base_configuration = AlbertConfig(\n                hidden_size=768,\n                num_attention_heads=12,\n                intermediate_size=3072,\n            )\n\n            # Initializing a model from the ALBERT-base style configuration\n            model = AlbertModel(albert_xxlarge_configuration)\n\n            # Accessing the model configuration\n            configuration = model.config\n    \"\"\"\n\n    model_type = \"albert\"\n\n    def __init__(\n        self,\n        vocab_size=30000,\n        embedding_size=128,\n        hidden_size=4096,\n        num_hidden_layers=12,\n        num_hidden_groups=1,\n        num_attention_heads=64,\n        intermediate_size=16384,\n        inner_group_num=1,\n        hidden_act=\"gelu_new\",\n        hidden_dropout_prob=0,\n        attention_probs_dropout_prob=0,\n        max_position_embeddings=512,\n        type_vocab_size=2,\n        initializer_range=0.02,\n        layer_norm_eps=1e-12,\n        classifier_dropout_prob=0.1,\n        pad_token_id=0,\n        bos_token_id=2,\n        eos_token_id=3,\n        **kwargs\n    ):\n        super().__init__(pad_token_id=pad_token_id, bos_token_id=bos_token_id, eos_token_id=eos_token_id, **kwargs)\n\n        self.vocab_size = vocab_size\n        self.embedding_size = embedding_size\n        self.hidden_size = hidden_size\n        self.num_hidden_layers = num_hidden_layers\n        self.num_hidden_groups = num_hidden_groups\n        self.num_attention_heads = num_attention_heads\n        self.inner_group_num = inner_group_num\n        self.hidden_act = hidden_act\n        self.intermediate_size = intermediate_size\n        self.hidden_dropout_prob = hidden_dropout_prob\n        self.attention_probs_dropout_prob = attention_probs_dropout_prob\n        self.max_position_embeddings = max_position_embeddings\n        self.type_vocab_size = type_vocab_size\n        self.initializer_range = initializer_range\n        self.layer_norm_eps = layer_norm_eps\n        self.classifier_dropout_prob = classifier_dropout_prob\n"
  },
  {
    "path": "code/nezha-base-count5/pretrain/transformers1/configuration_auto.py",
    "content": "# coding=utf-8\n# Copyright 2018 The HuggingFace Inc. team.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n\"\"\" Auto Config class. \"\"\"\n\n\nimport logging\nfrom collections import OrderedDict\n\nfrom .configuration_albert import ALBERT_PRETRAINED_CONFIG_ARCHIVE_MAP, AlbertConfig\nfrom .configuration_bart import BART_PRETRAINED_CONFIG_ARCHIVE_MAP, BartConfig\nfrom .configuration_bert import BERT_PRETRAINED_CONFIG_ARCHIVE_MAP, BertConfig\nfrom .configuration_camembert import CAMEMBERT_PRETRAINED_CONFIG_ARCHIVE_MAP, CamembertConfig\nfrom .configuration_ctrl import CTRL_PRETRAINED_CONFIG_ARCHIVE_MAP, CTRLConfig\nfrom .configuration_distilbert import DISTILBERT_PRETRAINED_CONFIG_ARCHIVE_MAP, DistilBertConfig\nfrom .configuration_electra import ELECTRA_PRETRAINED_CONFIG_ARCHIVE_MAP, ElectraConfig\nfrom .configuration_encoder_decoder import EncoderDecoderConfig\nfrom .configuration_flaubert import FLAUBERT_PRETRAINED_CONFIG_ARCHIVE_MAP, FlaubertConfig\nfrom .configuration_gpt2 import GPT2_PRETRAINED_CONFIG_ARCHIVE_MAP, GPT2Config\nfrom .configuration_longformer import LONGFORMER_PRETRAINED_CONFIG_ARCHIVE_MAP, LongformerConfig\nfrom .configuration_marian import MarianConfig\nfrom .configuration_openai import OPENAI_GPT_PRETRAINED_CONFIG_ARCHIVE_MAP, OpenAIGPTConfig\nfrom .configuration_reformer import ReformerConfig\nfrom .configuration_roberta import ROBERTA_PRETRAINED_CONFIG_ARCHIVE_MAP, RobertaConfig\nfrom .configuration_t5 import T5_PRETRAINED_CONFIG_ARCHIVE_MAP, T5Config\nfrom .configuration_transfo_xl import TRANSFO_XL_PRETRAINED_CONFIG_ARCHIVE_MAP, TransfoXLConfig\nfrom .configuration_utils import PretrainedConfig\nfrom .configuration_xlm import XLM_PRETRAINED_CONFIG_ARCHIVE_MAP, XLMConfig\nfrom .configuration_xlm_roberta import XLM_ROBERTA_PRETRAINED_CONFIG_ARCHIVE_MAP, XLMRobertaConfig\nfrom .configuration_xlnet import XLNET_PRETRAINED_CONFIG_ARCHIVE_MAP, XLNetConfig\n\n\nlogger = logging.getLogger(__name__)\n\n\nALL_PRETRAINED_CONFIG_ARCHIVE_MAP = dict(\n    (key, value)\n    for pretrained_map in [\n        BERT_PRETRAINED_CONFIG_ARCHIVE_MAP,\n        BART_PRETRAINED_CONFIG_ARCHIVE_MAP,\n        OPENAI_GPT_PRETRAINED_CONFIG_ARCHIVE_MAP,\n        TRANSFO_XL_PRETRAINED_CONFIG_ARCHIVE_MAP,\n        GPT2_PRETRAINED_CONFIG_ARCHIVE_MAP,\n        CTRL_PRETRAINED_CONFIG_ARCHIVE_MAP,\n        XLNET_PRETRAINED_CONFIG_ARCHIVE_MAP,\n        XLM_PRETRAINED_CONFIG_ARCHIVE_MAP,\n        ROBERTA_PRETRAINED_CONFIG_ARCHIVE_MAP,\n        DISTILBERT_PRETRAINED_CONFIG_ARCHIVE_MAP,\n        ALBERT_PRETRAINED_CONFIG_ARCHIVE_MAP,\n        CAMEMBERT_PRETRAINED_CONFIG_ARCHIVE_MAP,\n        T5_PRETRAINED_CONFIG_ARCHIVE_MAP,\n        XLM_ROBERTA_PRETRAINED_CONFIG_ARCHIVE_MAP,\n        FLAUBERT_PRETRAINED_CONFIG_ARCHIVE_MAP,\n        ELECTRA_PRETRAINED_CONFIG_ARCHIVE_MAP,\n        LONGFORMER_PRETRAINED_CONFIG_ARCHIVE_MAP,\n    ]\n    for key, value, in pretrained_map.items()\n)\n\n\nCONFIG_MAPPING = OrderedDict(\n    [\n        (\"t5\", T5Config,),\n        (\"distilbert\", DistilBertConfig,),\n        (\"albert\", AlbertConfig,),\n        (\"camembert\", CamembertConfig,),\n        (\"xlm-roberta\", XLMRobertaConfig,),\n        (\"marian\", MarianConfig,),\n        (\"bart\", BartConfig,),\n        (\"reformer\", ReformerConfig,),\n        (\"longformer\", LongformerConfig,),\n        (\"roberta\", RobertaConfig,),\n        (\"flaubert\", FlaubertConfig,),\n        (\"bert\", BertConfig,),\n        (\"openai-gpt\", OpenAIGPTConfig,),\n        (\"gpt2\", GPT2Config,),\n        (\"transfo-xl\", TransfoXLConfig,),\n        (\"xlnet\", XLNetConfig,),\n        (\"xlm\", XLMConfig,),\n        (\"ctrl\", CTRLConfig,),\n        (\"electra\", ElectraConfig,),\n        (\"encoder-decoder\", EncoderDecoderConfig,),\n    ]\n)\n\n\nclass AutoConfig:\n    r\"\"\"\n        :class:`~transformers1.AutoConfig` is a generic configuration class\n        that will be instantiated as one of the configuration classes of the library\n        when created with the :func:`~transformers1.AutoConfig.from_pretrained` class method.\n\n        The :func:`~transformers1.AutoConfig.from_pretrained` method takes care of returning the correct model class instance\n        based on the `model_type` property of the config object, or when it's missing,\n        falling back to using pattern matching on the `pretrained_model_name_or_path` string.\n    \"\"\"\n\n    def __init__(self):\n        raise EnvironmentError(\n            \"AutoConfig is designed to be instantiated \"\n            \"using the `AutoConfig.from_pretrained(pretrained_model_name_or_path)` method.\"\n        )\n\n    @classmethod\n    def for_model(cls, model_type: str, *args, **kwargs):\n        if model_type in CONFIG_MAPPING:\n            config_class = CONFIG_MAPPING[model_type]\n            return config_class(*args, **kwargs)\n        raise ValueError(\n            \"Unrecognized model identifier: {}. Should contain one of {}\".format(\n                model_type, \", \".join(CONFIG_MAPPING.keys())\n            )\n        )\n\n    @classmethod\n    def from_pretrained(cls, pretrained_model_name_or_path, **kwargs):\n        r\"\"\" Instantiates one of the configuration classes of the library\n        from a pre-trained model configuration.\n\n        The configuration class to instantiate is selected\n        based on the `model_type` property of the config object, or when it's missing,\n        falling back to using pattern matching on the `pretrained_model_name_or_path` string:\n            - `t5`: :class:`~transformers1.T5Config` (T5 model)\n            - `distilbert`: :class:`~transformers1.DistilBertConfig` (DistilBERT model)\n            - `albert`: :class:`~transformers1.AlbertConfig` (ALBERT model)\n            - `camembert`: :class:`~transformers1.CamembertConfig` (CamemBERT model)\n            - `xlm-roberta`: :class:`~transformers1.XLMRobertaConfig` (XLM-RoBERTa model)\n            - `longformer`: :class:`~transformers1.LongformerConfig` (Longformer model)\n            - `roberta`: :class:`~transformers1.RobertaConfig` (RoBERTa model)\n            - `reformer`: :class:`~transformers1.ReformerConfig` (Reformer model)\n            - `bert`: :class:`~transformers1.BertConfig` (Bert model)\n            - `openai-gpt`: :class:`~transformers1.OpenAIGPTConfig` (OpenAI GPT model)\n            - `gpt2`: :class:`~transformers1.GPT2Config` (OpenAI GPT-2 model)\n            - `transfo-xl`: :class:`~transformers1.TransfoXLConfig` (Transformer-XL model)\n            - `xlnet`: :class:`~transformers1.XLNetConfig` (XLNet model)\n            - `xlm`: :class:`~transformers1.XLMConfig` (XLM model)\n            - `ctrl` : :class:`~transformers1.CTRLConfig` (CTRL model)\n            - `flaubert` : :class:`~transformers1.FlaubertConfig` (Flaubert model)\n            - `electra` : :class:`~transformers1.ElectraConfig` (ELECTRA model)\n\n        Args:\n            pretrained_model_name_or_path (:obj:`string`):\n                Is either: \\\n                    - a string with the `shortcut name` of a pre-trained model configuration to load from cache or download, e.g.: ``bert-base-uncased``.\n                    - a string with the `identifier name` of a pre-trained model configuration that was user-uploaded to our S3, e.g.: ``dbmdz/bert-base-german-cased``.\n                    - a path to a `directory` containing a configuration file saved using the :func:`~transformers1.PretrainedConfig.save_pretrained` method, e.g.: ``./my_model_directory/``.\n                    - a path or url to a saved configuration JSON `file`, e.g.: ``./my_model_directory/configuration.json``.\n\n            cache_dir (:obj:`string`, optional, defaults to `None`):\n                Path to a directory in which a downloaded pre-trained model\n                configuration should be cached if the standard cache should not be used.\n\n            force_download (:obj:`boolean`, optional, defaults to `False`):\n                Force to (re-)download the model weights and configuration files and override the cached versions if they exist.\n\n            resume_download (:obj:`boolean`, optional, defaults to `False`):\n                Do not delete incompletely received file. Attempt to resume the download if such a file exists.\n\n            proxies (:obj:`Dict[str, str]`, optional, defaults to `None`):\n                A dictionary of proxy servers to use by protocol or endpoint, e.g.: :obj:`{'http': 'foo.bar:3128', 'http://hostname': 'foo.bar:4012'}`.\n                The proxies are used on each request. See `the requests documentation <https://requests.readthedocs.io/en/master/user/advanced/#proxies>`__ for usage.\n\n            return_unused_kwargs (:obj:`boolean`, optional, defaults to `False`):\n                - If False, then this function returns just the final configuration object.\n                - If True, then this functions returns a tuple `(config, unused_kwargs)` where `unused_kwargs` is a dictionary consisting of the key/value pairs whose keys are not configuration attributes: ie the part of kwargs which has not been used to update `config` and is otherwise ignored.\n\n            kwargs (:obj:`Dict[str, any]`, optional, defaults to `{}`): key/value pairs with which to update the configuration object after loading.\n                - The values in kwargs of any keys which are configuration attributes will be used to override the loaded values.\n                - Behavior concerning key/value pairs whose keys are *not* configuration attributes is controlled by the `return_unused_kwargs` keyword parameter.\n\n\n        Examples::\n\n            config = AutoConfig.from_pretrained('bert-base-uncased')  # Download configuration from S3 and cache.\n            config = AutoConfig.from_pretrained('./test/bert_saved_model/')  # E.g. config (or model) was saved using `save_pretrained('./test/saved_model/')`\n            config = AutoConfig.from_pretrained('./test/bert_saved_model/my_configuration.json')\n            config = AutoConfig.from_pretrained('bert-base-uncased', output_attention=True, foo=False)\n            assert config.output_attention == True\n            config, unused_kwargs = AutoConfig.from_pretrained('bert-base-uncased', output_attention=True,\n                                                               foo=False, return_unused_kwargs=True)\n            assert config.output_attention == True\n            assert unused_kwargs == {'foo': False}\n\n        \"\"\"\n        config_dict, _ = PretrainedConfig.get_config_dict(pretrained_model_name_or_path, **kwargs)\n\n        if \"model_type\" in config_dict:\n            config_class = CONFIG_MAPPING[config_dict[\"model_type\"]]\n            return config_class.from_dict(config_dict, **kwargs)\n        else:\n            # Fallback: use pattern matching on the string.\n            for pattern, config_class in CONFIG_MAPPING.items():\n                if pattern in pretrained_model_name_or_path:\n                    return config_class.from_dict(config_dict, **kwargs)\n\n        raise ValueError(\n            \"Unrecognized model in {}. \"\n            \"Should have a `model_type` key in its config.json, or contain one of the following strings \"\n            \"in its name: {}\".format(pretrained_model_name_or_path, \", \".join(CONFIG_MAPPING.keys()))\n        )\n"
  },
  {
    "path": "code/nezha-base-count5/pretrain/transformers1/configuration_bart.py",
    "content": "# coding=utf-8\n# Copyright 2020 The Fairseq Authors and The HuggingFace Inc. team.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n\"\"\" BART configuration \"\"\"\n\n\nimport logging\n\nfrom .configuration_utils import PretrainedConfig\n\n\nlogger = logging.getLogger(__name__)\n\nBART_PRETRAINED_CONFIG_ARCHIVE_MAP = {\n    \"facebook/bart-large\": \"https://s3.amazonaws.com/models.huggingface.co/bert/facebook/bart-large/config.json\",\n    \"facebook/bart-large-mnli\": \"https://s3.amazonaws.com/models.huggingface.co/bert/facebook/bart-large-mnli/config.json\",\n    \"facebook/bart-large-cnn\": \"https://s3.amazonaws.com/models.huggingface.co/bert/facebook/bart-large-cnn/config.json\",\n    \"facebook/bart-large-xsum\": \"https://s3.amazonaws.com/models.huggingface.co/bert/facebook/bart-large-xsum/config.json\",\n    \"facebook/mbart-large-en-ro\": \"https://s3.amazonaws.com/models.huggingface.co/bert/facebook/mbart-large-en-ro/config.json\",\n}\n\n\nclass BartConfig(PretrainedConfig):\n    r\"\"\"\n        Configuration class for Bart. Parameters are renamed from the fairseq implementation\n    \"\"\"\n    model_type = \"bart\"\n\n    def __init__(\n        self,\n        activation_dropout=0.0,\n        activation_function=\"gelu\",\n        vocab_size=50265,\n        d_model=1024,\n        encoder_ffn_dim=4096,\n        encoder_layers=12,\n        encoder_attention_heads=16,\n        decoder_ffn_dim=4096,\n        decoder_layers=12,\n        decoder_attention_heads=16,\n        encoder_layerdrop=0.0,\n        decoder_layerdrop=0.0,\n        attention_dropout=0.0,\n        dropout=0.1,\n        max_position_embeddings=1024,\n        init_std=0.02,\n        classifier_dropout=0.0,\n        num_labels=3,\n        is_encoder_decoder=True,\n        pad_token_id=1,\n        bos_token_id=0,\n        eos_token_id=2,\n        normalize_before=False,\n        add_final_layer_norm=False,\n        scale_embedding=False,\n        normalize_embedding=True,\n        static_position_embeddings=False,\n        add_bias_logits=False,\n        **common_kwargs\n    ):\n        r\"\"\"\n            :class:`~transformers1.BartConfig` is the configuration class for `BartModel`.\n            Examples:\n                config = BartConfig.from_pretrained('bart-large')\n                model = BartModel(config)\n        \"\"\"\n        if \"hidden_size\" in common_kwargs:\n            raise ValueError(\"hidden size is called d_model\")\n        super().__init__(\n            num_labels=num_labels,\n            pad_token_id=pad_token_id,\n            bos_token_id=bos_token_id,\n            eos_token_id=eos_token_id,\n            is_encoder_decoder=is_encoder_decoder,\n            **common_kwargs,\n        )\n        self.vocab_size = vocab_size\n        self.d_model = d_model  # encoder_embed_dim and decoder_embed_dim\n        self.encoder_ffn_dim = encoder_ffn_dim\n        self.encoder_layers = self.num_hidden_layers = encoder_layers\n        self.encoder_attention_heads = encoder_attention_heads\n        self.encoder_layerdrop = encoder_layerdrop\n        self.decoder_layerdrop = decoder_layerdrop\n        self.decoder_ffn_dim = decoder_ffn_dim\n        self.decoder_layers = decoder_layers\n        self.decoder_attention_heads = decoder_attention_heads\n        self.max_position_embeddings = max_position_embeddings\n        self.init_std = init_std  # Normal(0, this parameter)\n        self.activation_function = activation_function\n\n        # Params introduced for Mbart\n        self.scale_embedding = scale_embedding  # scale factor will be sqrt(d_model) if True\n        self.normalize_embedding = normalize_embedding  # True for mbart, False otherwise\n        self.normalize_before = normalize_before  # combo of fairseq's encoder_ and decoder_normalize_before\n        self.add_final_layer_norm = add_final_layer_norm\n\n        # Params introduced for Marian\n        self.add_bias_logits = add_bias_logits\n        self.static_position_embeddings = static_position_embeddings\n\n        # 3 Types of Dropout\n        self.attention_dropout = attention_dropout\n        self.activation_dropout = activation_dropout\n        self.dropout = dropout\n\n        # Classifier stuff\n        self.classif_dropout = classifier_dropout\n\n    @property\n    def num_attention_heads(self) -> int:\n        return self.encoder_attention_heads\n\n    @property\n    def hidden_size(self) -> int:\n        return self.d_model\n\n    def is_valid_mbart(self) -> bool:\n        \"\"\"Is the configuration aligned with the MBART paper.\"\"\"\n        if self.normalize_before and self.add_final_layer_norm and self.scale_embedding:\n            return True\n        if self.normalize_before or self.add_final_layer_norm or self.scale_embedding:\n            logger.info(\"This configuration is a mixture of MBART and BART settings\")\n        return False\n"
  },
  {
    "path": "code/nezha-base-count5/pretrain/transformers1/configuration_bert.py",
    "content": "# coding=utf-8\n# Copyright 2018 The Google AI Language Team Authors and The HuggingFace Inc. team.\n# Copyright (c) 2018, NVIDIA CORPORATION.  All rights reserved.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n\"\"\" BERT model configuration \"\"\"\n\n\nimport logging\n\nfrom .configuration_utils import PretrainedConfig\n\n\nlogger = logging.getLogger(__name__)\n\nBERT_PRETRAINED_CONFIG_ARCHIVE_MAP = {\n    \"bert-base-uncased\": \"https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-uncased-config.json\",\n    \"bert-large-uncased\": \"https://s3.amazonaws.com/models.huggingface.co/bert/bert-large-uncased-config.json\",\n    \"bert-base-cased\": \"https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-cased-config.json\",\n    \"bert-large-cased\": \"https://s3.amazonaws.com/models.huggingface.co/bert/bert-large-cased-config.json\",\n    \"bert-base-multilingual-uncased\": \"https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-multilingual-uncased-config.json\",\n    \"bert-base-multilingual-cased\": \"https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-multilingual-cased-config.json\",\n    \"bert-base-chinese\": \"https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-chinese-config.json\",\n    \"bert-base-german-cased\": \"https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-german-cased-config.json\",\n    \"bert-large-uncased-whole-word-masking\": \"https://s3.amazonaws.com/models.huggingface.co/bert/bert-large-uncased-whole-word-masking-config.json\",\n    \"bert-large-cased-whole-word-masking\": \"https://s3.amazonaws.com/models.huggingface.co/bert/bert-large-cased-whole-word-masking-config.json\",\n    \"bert-large-uncased-whole-word-masking-finetuned-squad\": \"https://s3.amazonaws.com/models.huggingface.co/bert/bert-large-uncased-whole-word-masking-finetuned-squad-config.json\",\n    \"bert-large-cased-whole-word-masking-finetuned-squad\": \"https://s3.amazonaws.com/models.huggingface.co/bert/bert-large-cased-whole-word-masking-finetuned-squad-config.json\",\n    \"bert-base-cased-finetuned-mrpc\": \"https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-cased-finetuned-mrpc-config.json\",\n    \"bert-base-german-dbmdz-cased\": \"https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-german-dbmdz-cased-config.json\",\n    \"bert-base-german-dbmdz-uncased\": \"https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-german-dbmdz-uncased-config.json\",\n    \"cl-tohoku/bert-base-japanese\": \"https://s3.amazonaws.com/models.huggingface.co/bert/cl-tohoku/bert-base-japanese/config.json\",\n    \"cl-tohoku/bert-base-japanese-whole-word-masking\": \"https://s3.amazonaws.com/models.huggingface.co/bert/cl-tohoku/bert-base-japanese-whole-word-masking/config.json\",\n    \"cl-tohoku/bert-base-japanese-char\": \"https://s3.amazonaws.com/models.huggingface.co/bert/cl-tohoku/bert-base-japanese-char/config.json\",\n    \"cl-tohoku/bert-base-japanese-char-whole-word-masking\": \"https://s3.amazonaws.com/models.huggingface.co/bert/cl-tohoku/bert-base-japanese-char-whole-word-masking/config.json\",\n    \"TurkuNLP/bert-base-finnish-cased-v1\": \"https://s3.amazonaws.com/models.huggingface.co/bert/TurkuNLP/bert-base-finnish-cased-v1/config.json\",\n    \"TurkuNLP/bert-base-finnish-uncased-v1\": \"https://s3.amazonaws.com/models.huggingface.co/bert/TurkuNLP/bert-base-finnish-uncased-v1/config.json\",\n    \"wietsedv/bert-base-dutch-cased\": \"https://s3.amazonaws.com/models.huggingface.co/bert/wietsedv/bert-base-dutch-cased/config.json\",\n    # See all BERT models at https://huggingface.co/models?filter=bert\n}\n\n\nclass BertConfig(PretrainedConfig):\n    r\"\"\"\n        This is the configuration class to store the configuration of a :class:`~transformers1.BertModel`.\n        It is used to instantiate an BERT model according to the specified arguments, defining the model\n        architecture. Instantiating a configuration with the defaults will yield a similar configuration to that of\n        the BERT `bert-base-uncased <https://huggingface.co/bert-base-uncased>`__ architecture.\n\n        Configuration objects inherit from  :class:`~transformers1.PretrainedConfig` and can be used\n        to control the model outputs. Read the documentation from  :class:`~transformers1.PretrainedConfig`\n        for more information.\n\n\n        Args:\n            vocab_size (:obj:`int`, optional, defaults to 30522):\n                Vocabulary size of the BERT model. Defines the different tokens that\n                can be represented by the `inputs_ids` passed to the forward method of :class:`~transformers1.BertModel`.\n            hidden_size (:obj:`int`, optional, defaults to 768):\n                Dimensionality of the encoder layers and the pooler layer.\n            num_hidden_layers (:obj:`int`, optional, defaults to 12):\n                Number of hidden layers in the Transformer encoder.\n            num_attention_heads (:obj:`int`, optional, defaults to 12):\n                Number of attention heads for each attention layer in the Transformer encoder.\n            intermediate_size (:obj:`int`, optional, defaults to 3072):\n                Dimensionality of the \"intermediate\" (i.e., feed-forward) layer in the Transformer encoder.\n            hidden_act (:obj:`str` or :obj:`function`, optional, defaults to \"gelu\"):\n                The non-linear activation function (function or string) in the encoder and pooler.\n                If string, \"gelu\", \"relu\", \"swish\" and \"gelu_new\" are supported.\n            hidden_dropout_prob (:obj:`float`, optional, defaults to 0.1):\n                The dropout probabilitiy for all fully connected layers in the embeddings, encoder, and pooler.\n            attention_probs_dropout_prob (:obj:`float`, optional, defaults to 0.1):\n                The dropout ratio for the attention probabilities.\n            max_position_embeddings (:obj:`int`, optional, defaults to 512):\n                The maximum sequence length that this model might ever be used with.\n                Typically set this to something large just in case (e.g., 512 or 1024 or 2048).\n            type_vocab_size (:obj:`int`, optional, defaults to 2):\n                The vocabulary size of the `token_type_ids` passed into :class:`~transformers1.BertModel`.\n            initializer_range (:obj:`float`, optional, defaults to 0.02):\n                The standard deviation of the truncated_normal_initializer for initializing all weight matrices.\n            layer_norm_eps (:obj:`float`, optional, defaults to 1e-12):\n                The epsilon used by the layer normalization layers.\n\n        Example::\n\n            from transformers1 import BertModel, BertConfig\n\n            # Initializing a BERT bert-base-uncased style configuration\n            configuration = BertConfig()\n\n            # Initializing a model from the bert-base-uncased style configuration\n            model = BertModel(configuration)\n\n            # Accessing the model configuration\n            configuration = model.config\n    \"\"\"\n    model_type = \"bert\"\n\n    def __init__(\n        self,\n        vocab_size=30522,\n        hidden_size=768,\n        num_hidden_layers=12,\n        num_attention_heads=12,\n        intermediate_size=3072,\n        hidden_act=\"gelu\",\n        hidden_dropout_prob=0.1,\n        attention_probs_dropout_prob=0.1,\n        max_position_embeddings=512,\n        type_vocab_size=2,\n        initializer_range=0.02,\n        layer_norm_eps=1e-12,\n        pad_token_id=0,\n        **kwargs\n    ):\n        super().__init__(pad_token_id=pad_token_id, **kwargs)\n\n        self.vocab_size = vocab_size\n        self.hidden_size = hidden_size\n        self.num_hidden_layers = num_hidden_layers\n        self.num_attention_heads = num_attention_heads\n        self.hidden_act = hidden_act\n        self.intermediate_size = intermediate_size\n        self.hidden_dropout_prob = hidden_dropout_prob\n        self.attention_probs_dropout_prob = attention_probs_dropout_prob\n        self.max_position_embeddings = max_position_embeddings\n        self.type_vocab_size = type_vocab_size\n        self.initializer_range = initializer_range\n        self.layer_norm_eps = layer_norm_eps\n"
  },
  {
    "path": "code/nezha-base-count5/pretrain/transformers1/configuration_camembert.py",
    "content": "# coding=utf-8\n# Copyright 2018 The Google AI Language Team Authors and The HuggingFace Inc. team.\n# Copyright (c) 2018, NVIDIA CORPORATION.  All rights reserved.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n\"\"\" CamemBERT configuration \"\"\"\n\n\nimport logging\n\nfrom .configuration_roberta import RobertaConfig\n\n\nlogger = logging.getLogger(__name__)\n\nCAMEMBERT_PRETRAINED_CONFIG_ARCHIVE_MAP = {\n    \"camembert-base\": \"https://s3.amazonaws.com/models.huggingface.co/bert/camembert-base-config.json\",\n    \"umberto-commoncrawl-cased-v1\": \"https://s3.amazonaws.com/models.huggingface.co/bert/Musixmatch/umberto-commoncrawl-cased-v1/config.json\",\n    \"umberto-wikipedia-uncased-v1\": \"https://s3.amazonaws.com/models.huggingface.co/bert/Musixmatch/umberto-wikipedia-uncased-v1/config.json\",\n}\n\n\nclass CamembertConfig(RobertaConfig):\n    \"\"\"\n    This class overrides :class:`~transformers1.RobertaConfig`. Please check the\n    superclass for the appropriate documentation alongside usage examples.\n    \"\"\"\n\n    model_type = \"camembert\"\n"
  },
  {
    "path": "code/nezha-base-count5/pretrain/transformers1/configuration_ctrl.py",
    "content": "# coding=utf-8\n# Copyright 2018 Salesforce and HuggingFace Inc. team.\n# Copyright (c) 2018, NVIDIA CORPORATION.  All rights reserved.\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n\"\"\" Salesforce CTRL configuration \"\"\"\n\n\nimport logging\n\nfrom .configuration_utils import PretrainedConfig\n\n\nlogger = logging.getLogger(__name__)\n\nCTRL_PRETRAINED_CONFIG_ARCHIVE_MAP = {\"ctrl\": \"https://s3.amazonaws.com/models.huggingface.co/bert/ctrl-config.json\"}\n\n\nclass CTRLConfig(PretrainedConfig):\n    \"\"\"\n        This is the configuration class to store the configuration of a :class:`~transformers1.CTRLModel`.\n        It is used to instantiate an CTRL model according to the specified arguments, defining the model\n        architecture. Instantiating a configuration with the defaults will yield a similar configuration to that of\n        the `ctrl <https://huggingface.co/ctrl>`__ architecture from SalesForce.\n\n        Configuration objects inherit from  :class:`~transformers1.PretrainedConfig` and can be used\n        to control the model outputs. Read the documentation from  :class:`~transformers1.PretrainedConfig`\n        for more information.\n\n        Args:\n            vocab_size (:obj:`int`, optional, defaults to 246534):\n                Vocabulary size of the CTRL model. Defines the different tokens that\n                can be represented by the `inputs_ids` passed to the forward method of :class:`~transformers1.CTRLModel`.\n            n_positions (:obj:`int`, optional, defaults to 256):\n                The maximum sequence length that this model might ever be used with.\n                Typically set this to something large just in case (e.g., 512 or 1024 or 2048).\n            n_ctx (:obj:`int`, optional, defaults to 256):\n                Dimensionality of the causal mask (usually same as n_positions).\n            n_embd (:obj:`int`, optional, defaults to 1280):\n                Dimensionality of the embeddings and hidden states.\n            dff (:obj:`int`, optional, defaults to 8192):\n                Dimensionality of the inner dimension of the FFN.\n            n_layer (:obj:`int`, optional, defaults to 48):\n                Number of hidden layers in the Transformer encoder.\n            n_head (:obj:`int`, optional, defaults to 16):\n                Number of attention heads for each attention layer in the Transformer encoder.\n            resid_pdrop (:obj:`float`, optional, defaults to 0.1):\n                The dropout probability for all fully connected layers in the embeddings, encoder, and pooler.\n            embd_pdrop (:obj:`int`, optional, defaults to 0.1):\n                The dropout ratio for the embeddings.\n            attn_pdrop (:obj:`float`, optional, defaults to 0.1):\n                The dropout ratio for the attention.\n            layer_norm_epsilon (:obj:`float`, optional, defaults to 1e-6):\n                The epsilon to use in the layer normalization layers\n            initializer_range (:obj:`float`, optional, defaults to 0.02):\n                The standard deviation of the truncated_normal_initializer for initializing all weight matrices.\n\n        Example::\n\n            from transformers1 import CTRLModel, CTRLConfig\n\n            # Initializing a CTRL configuration\n            configuration = CTRLConfig()\n\n            # Initializing a model from the configuration\n            model = CTRLModel(configuration)\n\n            # Accessing the model configuration\n            configuration = model.config\n    \"\"\"\n\n    model_type = \"ctrl\"\n\n    def __init__(\n        self,\n        vocab_size=246534,\n        n_positions=256,\n        n_ctx=256,\n        n_embd=1280,\n        dff=8192,\n        n_layer=48,\n        n_head=16,\n        resid_pdrop=0.1,\n        embd_pdrop=0.1,\n        attn_pdrop=0.1,\n        layer_norm_epsilon=1e-6,\n        initializer_range=0.02,\n        summary_type=\"cls_index\",\n        summary_use_proj=True,\n        summary_activation=None,\n        summary_proj_to_labels=True,\n        summary_first_dropout=0.1,\n        **kwargs\n    ):\n        super().__init__(**kwargs)\n        self.vocab_size = vocab_size\n        self.n_ctx = n_ctx\n        self.n_positions = n_positions\n        self.n_embd = n_embd\n        self.n_layer = n_layer\n        self.n_head = n_head\n        self.dff = dff\n        self.resid_pdrop = resid_pdrop\n        self.embd_pdrop = embd_pdrop\n        self.attn_pdrop = attn_pdrop\n        self.layer_norm_epsilon = layer_norm_epsilon\n        self.initializer_range = initializer_range\n\n        self.summary_type = summary_type\n        self.summary_use_proj = summary_use_proj\n        self.summary_activation = summary_activation\n        self.summary_first_dropout = summary_first_dropout\n        self.summary_proj_to_labels = summary_proj_to_labels\n\n    @property\n    def max_position_embeddings(self):\n        return self.n_positions\n\n    @property\n    def hidden_size(self):\n        return self.n_embd\n\n    @property\n    def num_attention_heads(self):\n        return self.n_head\n\n    @property\n    def num_hidden_layers(self):\n        return self.n_layer\n"
  },
  {
    "path": "code/nezha-base-count5/pretrain/transformers1/configuration_distilbert.py",
    "content": "# coding=utf-8\n# Copyright 2019-present, the HuggingFace Inc. team, The Google AI Language Team and Facebook, Inc.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n\"\"\" DistilBERT model configuration \"\"\"\n\n\nimport logging\n\nfrom .configuration_utils import PretrainedConfig\n\n\nlogger = logging.getLogger(__name__)\n\nDISTILBERT_PRETRAINED_CONFIG_ARCHIVE_MAP = {\n    \"distilbert-base-uncased\": \"https://s3.amazonaws.com/models.huggingface.co/bert/distilbert-base-uncased-config.json\",\n    \"distilbert-base-uncased-distilled-squad\": \"https://s3.amazonaws.com/models.huggingface.co/bert/distilbert-base-uncased-distilled-squad-config.json\",\n    \"distilbert-base-cased\": \"https://s3.amazonaws.com/models.huggingface.co/bert/distilbert-base-cased-config.json\",\n    \"distilbert-base-cased-distilled-squad\": \"https://s3.amazonaws.com/models.huggingface.co/bert/distilbert-base-cased-distilled-squad-config.json\",\n    \"distilbert-base-german-cased\": \"https://s3.amazonaws.com/models.huggingface.co/bert/distilbert-base-german-cased-config.json\",\n    \"distilbert-base-multilingual-cased\": \"https://s3.amazonaws.com/models.huggingface.co/bert/distilbert-base-multilingual-cased-config.json\",\n    \"distilbert-base-uncased-finetuned-sst-2-english\": \"https://s3.amazonaws.com/models.huggingface.co/bert/distilbert-base-uncased-finetuned-sst-2-english-config.json\",\n}\n\n\nclass DistilBertConfig(PretrainedConfig):\n    r\"\"\"\n        This is the configuration class to store the configuration of a :class:`~transformers1.DistilBertModel`.\n        It is used to instantiate a DistilBERT model according to the specified arguments, defining the model\n        architecture. Instantiating a configuration with the defaults will yield a similar configuration to that of\n        the DistilBERT `distilbert-base-uncased <https://huggingface.co/distilbert-base-uncased>`__ architecture.\n\n        Configuration objects inherit from  :class:`~transformers1.PretrainedConfig` and can be used\n        to control the model outputs. Read the documentation from  :class:`~transformers1.PretrainedConfig`\n        for more information.\n\n\n        Args:\n            vocab_size (:obj:`int`, optional, defaults to 30522):\n                Vocabulary size of the DistilBERT model. Defines the different tokens that\n                can be represented by the `inputs_ids` passed to the forward method of :class:`~transformers1.BertModel`.\n            max_position_embeddings (:obj:`int`, optional, defaults to 512):\n                The maximum sequence length that this model might ever be used with.\n                Typically set this to something large just in case (e.g., 512 or 1024 or 2048).\n            sinusoidal_pos_embds (:obj:`boolean`, optional, defaults to :obj:`False`):\n                Whether to use sinusoidal positional embeddings.\n            n_layers (:obj:`int`, optional, defaults to 6):\n                Number of hidden layers in the Transformer encoder.\n            n_heads (:obj:`int`, optional, defaults to 12):\n                Number of attention heads for each attention layer in the Transformer encoder.\n            dim (:obj:`int`, optional, defaults to 768):\n                Dimensionality of the encoder layers and the pooler layer.\n            hidden_dim (:obj:`int`, optional, defaults to 3072):\n                The size of the \"intermediate\" (i.e., feed-forward) layer in the Transformer encoder.\n            dropout (:obj:`float`, optional, defaults to 0.1):\n                The dropout probabilitiy for all fully connected layers in the embeddings, encoder, and pooler.\n            attention_dropout (:obj:`float`, optional, defaults to 0.1):\n                The dropout ratio for the attention probabilities.\n            activation (:obj:`str` or :obj:`function`, optional, defaults to \"gelu\"):\n                The non-linear activation function (function or string) in the encoder and pooler.\n                If string, \"gelu\", \"relu\", \"swish\" and \"gelu_new\" are supported.\n            initializer_range (:obj:`float`, optional, defaults to 0.02):\n                The standard deviation of the truncated_normal_initializer for initializing all weight matrices.\n            qa_dropout (:obj:`float`, optional, defaults to 0.1):\n                The dropout probabilities used in the question answering model\n                :class:`~transformers1.DistilBertForQuestionAnswering`.\n            seq_classif_dropout (:obj:`float`, optional, defaults to 0.2):\n                The dropout probabilities used in the sequence classification model\n                :class:`~transformers1.DistilBertForSequenceClassification`.\n\n        Example::\n\n            from transformers1 import DistilBertModel, DistilBertConfig\n\n            # Initializing a DistilBERT configuration\n            configuration = DistilBertConfig()\n\n            # Initializing a model from the configuration\n            model = DistilBertModel(configuration)\n\n            # Accessing the model configuration\n            configuration = model.config\n    \"\"\"\n    model_type = \"distilbert\"\n\n    def __init__(\n        self,\n        vocab_size=30522,\n        max_position_embeddings=512,\n        sinusoidal_pos_embds=False,\n        n_layers=6,\n        n_heads=12,\n        dim=768,\n        hidden_dim=4 * 768,\n        dropout=0.1,\n        attention_dropout=0.1,\n        activation=\"gelu\",\n        initializer_range=0.02,\n        qa_dropout=0.1,\n        seq_classif_dropout=0.2,\n        pad_token_id=0,\n        **kwargs\n    ):\n        super().__init__(**kwargs, pad_token_id=pad_token_id)\n        self.vocab_size = vocab_size\n        self.max_position_embeddings = max_position_embeddings\n        self.sinusoidal_pos_embds = sinusoidal_pos_embds\n        self.n_layers = n_layers\n        self.n_heads = n_heads\n        self.dim = dim\n        self.hidden_dim = hidden_dim\n        self.dropout = dropout\n        self.attention_dropout = attention_dropout\n        self.activation = activation\n        self.initializer_range = initializer_range\n        self.qa_dropout = qa_dropout\n        self.seq_classif_dropout = seq_classif_dropout\n\n    @property\n    def hidden_size(self):\n        return self.dim\n\n    @property\n    def num_attention_heads(self):\n        return self.n_heads\n\n    @property\n    def num_hidden_layers(self):\n        return self.n_layers\n"
  },
  {
    "path": "code/nezha-base-count5/pretrain/transformers1/configuration_electra.py",
    "content": "# coding=utf-8\n# Copyright 2018 The Google AI Language Team Authors and The HuggingFace Inc. team.\n# Copyright (c) 2018, NVIDIA CORPORATION.  All rights reserved.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n\"\"\" ELECTRA model configuration \"\"\"\n\n\nimport logging\n\nfrom .configuration_utils import PretrainedConfig\n\n\nlogger = logging.getLogger(__name__)\n\nELECTRA_PRETRAINED_CONFIG_ARCHIVE_MAP = {\n    \"google/electra-small-generator\": \"https://s3.amazonaws.com/models.huggingface.co/bert/google/electra-small-generator/config.json\",\n    \"google/electra-base-generator\": \"https://s3.amazonaws.com/models.huggingface.co/bert/google/electra-base-generator/config.json\",\n    \"google/electra-large-generator\": \"https://s3.amazonaws.com/models.huggingface.co/bert/google/electra-large-generator/config.json\",\n    \"google/electra-small-discriminator\": \"https://s3.amazonaws.com/models.huggingface.co/bert/google/electra-small-discriminator/config.json\",\n    \"google/electra-base-discriminator\": \"https://s3.amazonaws.com/models.huggingface.co/bert/google/electra-base-discriminator/config.json\",\n    \"google/electra-large-discriminator\": \"https://s3.amazonaws.com/models.huggingface.co/bert/google/electra-large-discriminator/config.json\",\n}\n\n\nclass ElectraConfig(PretrainedConfig):\n    r\"\"\"\n        This is the configuration class to store the configuration of a :class:`~transformers1.ElectraModel`.\n        It is used to instantiate an ELECTRA model according to the specified arguments, defining the model\n        architecture. Instantiating a configuration with the defaults will yield a similar configuration to that of\n        the ELECTRA `google/electra-small-discriminator <https://huggingface.co/google/electra-small-discriminator>`__\n        architecture.\n\n        Configuration objects inherit from  :class:`~transformers1.PretrainedConfig` and can be used\n        to control the model outputs. Read the documentation from  :class:`~transformers1.PretrainedConfig`\n        for more information.\n\n\n        Args:\n            vocab_size (:obj:`int`, optional, defaults to 30522):\n                Vocabulary size of the ELECTRA model. Defines the different tokens that\n                can be represented by the `inputs_ids` passed to the forward method of :class:`~transformers1.ElectraModel`.\n            embedding_size (:obj:`int`, optional, defaults to 128):\n                Dimensionality of the encoder layers and the pooler layer.\n            hidden_size (:obj:`int`, optional, defaults to 256):\n                Dimensionality of the encoder layers and the pooler layer.\n            num_hidden_layers (:obj:`int`, optional, defaults to 12):\n                Number of hidden layers in the Transformer encoder.\n            num_attention_heads (:obj:`int`, optional, defaults to 4):\n                Number of attention heads for each attention layer in the Transformer encoder.\n            intermediate_size (:obj:`int`, optional, defaults to 1024):\n                Dimensionality of the \"intermediate\" (i.e., feed-forward) layer in the Transformer encoder.\n            hidden_act (:obj:`str` or :obj:`function`, optional, defaults to \"gelu\"):\n                The non-linear activation function (function or string) in the encoder and pooler.\n                If string, \"gelu\", \"relu\", \"swish\" and \"gelu_new\" are supported.\n            hidden_dropout_prob (:obj:`float`, optional, defaults to 0.1):\n                The dropout probabilitiy for all fully connected layers in the embeddings, encoder, and pooler.\n            attention_probs_dropout_prob (:obj:`float`, optional, defaults to 0.1):\n                The dropout ratio for the attention probabilities.\n            max_position_embeddings (:obj:`int`, optional, defaults to 512):\n                The maximum sequence length that this model might ever be used with.\n                Typically set this to something large just in case (e.g., 512 or 1024 or 2048).\n            type_vocab_size (:obj:`int`, optional, defaults to 2):\n                The vocabulary size of the `token_type_ids` passed into :class:`~transformers1.ElectraModel`.\n            initializer_range (:obj:`float`, optional, defaults to 0.02):\n                The standard deviation of the truncated_normal_initializer for initializing all weight matrices.\n            layer_norm_eps (:obj:`float`, optional, defaults to 1e-12):\n                The epsilon used by the layer normalization layers.\n\n        Example::\n\n            from transformers1 import ElectraModel, ElectraConfig\n\n            # Initializing a ELECTRA electra-base-uncased style configuration\n            configuration = ElectraConfig()\n\n            # Initializing a model from the electra-base-uncased style configuration\n            model = ElectraModel(configuration)\n\n            # Accessing the model configuration\n            configuration = model.config\n    \"\"\"\n    model_type = \"electra\"\n\n    def __init__(\n        self,\n        vocab_size=30522,\n        embedding_size=128,\n        hidden_size=256,\n        num_hidden_layers=12,\n        num_attention_heads=4,\n        intermediate_size=1024,\n        hidden_act=\"gelu\",\n        hidden_dropout_prob=0.1,\n        attention_probs_dropout_prob=0.1,\n        max_position_embeddings=512,\n        type_vocab_size=2,\n        initializer_range=0.02,\n        layer_norm_eps=1e-12,\n        pad_token_id=0,\n        **kwargs\n    ):\n        super().__init__(pad_token_id=pad_token_id, **kwargs)\n\n        self.vocab_size = vocab_size\n        self.embedding_size = embedding_size\n        self.hidden_size = hidden_size\n        self.num_hidden_layers = num_hidden_layers\n        self.num_attention_heads = num_attention_heads\n        self.intermediate_size = intermediate_size\n        self.hidden_act = hidden_act\n        self.hidden_dropout_prob = hidden_dropout_prob\n        self.attention_probs_dropout_prob = attention_probs_dropout_prob\n        self.max_position_embeddings = max_position_embeddings\n        self.type_vocab_size = type_vocab_size\n        self.initializer_range = initializer_range\n        self.layer_norm_eps = layer_norm_eps\n"
  },
  {
    "path": "code/nezha-base-count5/pretrain/transformers1/configuration_encoder_decoder.py",
    "content": "# coding=utf-8\n# Copyright 2020 The HuggingFace Inc. team.\n# Copyright (c) 2018, NVIDIA CORPORATION.  All rights reserved.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n\nimport copy\nimport logging\n\nfrom .configuration_utils import PretrainedConfig\n\n\nlogger = logging.getLogger(__name__)\n\n\nclass EncoderDecoderConfig(PretrainedConfig):\n    r\"\"\"\n        :class:`~transformers1.EncoderDecoderConfig` is the configuration class to store the configuration of a `EncoderDecoderModel`.\n\n        It is used to instantiate an Encoder Decoder model according to the specified arguments, defining the encoder and decoder configs.\n        Configuration objects inherit from  :class:`~transformers1.PretrainedConfig`\n        and can be used to control the model outputs.\n        See the documentation for :class:`~transformers1.PretrainedConfig` for more information.\n\n        Args:\n            kwargs (`optional`):\n                Remaining dictionary of keyword arguments. Notably:\n                    encoder (:class:`PretrainedConfig`, optional, defaults to `None`):\n                        An instance of a configuration object that defines the encoder config.\n                    encoder (:class:`PretrainedConfig`, optional, defaults to `None`):\n                        An instance of a configuration object that defines the decoder config.\n\n        Example::\n\n            from transformers1 import BertConfig, EncoderDecoderConfig, EncoderDecoderModel\n\n            # Initializing a BERT bert-base-uncased style configuration\n            config_encoder = BertConfig()\n            config_decoder = BertConfig()\n\n            config = EncoderDecoderConfig.from_encoder_decoder_configs(config_encoder, config_decoder)\n\n            # Initializing a Bert2Bert model from the bert-base-uncased style configurations\n            model = EncoderDecoderModel(config=config)\n\n            # Accessing the model configuration\n            config_encoder = model.config.encoder\n            config_decoder  = model.config.decoder\n    \"\"\"\n    model_type = \"encoder_decoder\"\n\n    def __init__(self, **kwargs):\n        super().__init__(**kwargs)\n        assert (\n            \"encoder\" in kwargs and \"decoder\" in kwargs\n        ), \"Config has to be initialized with encoder and decoder config\"\n        encoder_config = kwargs.pop(\"encoder\")\n        encoder_model_type = encoder_config.pop(\"model_type\")\n        decoder_config = kwargs.pop(\"decoder\")\n        decoder_model_type = decoder_config.pop(\"model_type\")\n\n        from transformers import AutoConfig\n\n        self.encoder = AutoConfig.for_model(encoder_model_type, **encoder_config)\n        self.decoder = AutoConfig.for_model(decoder_model_type, **decoder_config)\n        self.is_encoder_decoder = True\n\n    @classmethod\n    def from_encoder_decoder_configs(\n        cls, encoder_config: PretrainedConfig, decoder_config: PretrainedConfig\n    ) -> PretrainedConfig:\n        r\"\"\"\n        Instantiate a :class:`~transformers1.EncoderDecoderConfig` (or a derived class) from a pre-trained encoder model configuration and decoder model configuration.\n\n        Returns:\n            :class:`EncoderDecoderConfig`: An instance of a configuration object\n        \"\"\"\n        return cls(encoder=encoder_config.to_dict(), decoder=decoder_config.to_dict())\n\n    def to_dict(self):\n        \"\"\"\n        Serializes this instance to a Python dictionary. Override the default `to_dict()` from `PretrainedConfig`.\n\n        Returns:\n            :obj:`Dict[str, any]`: Dictionary of all the attributes that make up this configuration instance,\n        \"\"\"\n        output = copy.deepcopy(self.__dict__)\n        output[\"encoder\"] = self.encoder.to_dict()\n        output[\"decoder\"] = self.decoder.to_dict()\n        output[\"model_type\"] = self.__class__.model_type\n        return output\n"
  },
  {
    "path": "code/nezha-base-count5/pretrain/transformers1/configuration_flaubert.py",
    "content": "# coding=utf-8\n# Copyright 2019-present CNRS, Facebook Inc. and the HuggingFace Inc. team.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n\"\"\" Flaubert configuration, based on XLM. \"\"\"\n\n\nimport logging\n\nfrom .configuration_xlm import XLMConfig\n\n\nlogger = logging.getLogger(__name__)\n\nFLAUBERT_PRETRAINED_CONFIG_ARCHIVE_MAP = {\n    \"flaubert/flaubert_small_cased\": \"https://s3.amazonaws.com/models.huggingface.co/bert/flaubert/flaubert_small_cased/config.json\",\n    \"flaubert/flaubert_base_uncased\": \"https://s3.amazonaws.com/models.huggingface.co/bert/flaubert/flaubert_base_uncased/config.json\",\n    \"flaubert/flaubert_base_cased\": \"https://s3.amazonaws.com/models.huggingface.co/bert/flaubert/flaubert_base_cased/config.json\",\n    \"flaubert/flaubert_large_cased\": \"https://s3.amazonaws.com/models.huggingface.co/bert/flaubert/flaubert_large_cased/config.json\",\n}\n\n\nclass FlaubertConfig(XLMConfig):\n    \"\"\"\n        Configuration class to store the configuration of a `FlaubertModel`.\n        This is the configuration class to store the configuration of a :class:`~transformers1.XLMModel`.\n        It is used to instantiate an XLM model according to the specified arguments, defining the model\n        architecture. Instantiating a configuration with the defaults will yield a similar configuration to that of\n        the `xlm-mlm-en-2048 <https://huggingface.co/xlm-mlm-en-2048>`__ architecture.\n\n        Configuration objects inherit from  :class:`~transformers1.PretrainedConfig` and can be used\n        to control the model outputs. Read the documentation from  :class:`~transformers1.PretrainedConfig`\n        for more information.\n\n        Args:\n            pre_norm (:obj:`bool`, `optional`, defaults to :obj:`False`):\n                Whether to apply the layer normalization before or after the feed forward layer following the\n                attention in each layer (Vaswani et al., Tensor2Tensor for Neural Machine Translation. 2018)\n            layerdrop (:obj:`float`, `optional`, defaults to 0.0):\n                Probability to drop layers during training (Fan et al., Reducing Transformer Depth on Demand\n                with Structured Dropout. ICLR 2020)\n            vocab_size (:obj:`int`, optional, defaults to 30145):\n                Vocabulary size of the Flaubert model. Defines the different tokens that\n                can be represented by the `inputs_ids` passed to the forward method of :class:`~transformers1.FlaubertModel`.\n            emb_dim (:obj:`int`, optional, defaults to 2048):\n                Dimensionality of the encoder layers and the pooler layer.\n            n_layer (:obj:`int`, optional, defaults to 12):\n                Number of hidden layers in the Transformer encoder.\n            n_head (:obj:`int`, optional, defaults to 16):\n                Number of attention heads for each attention layer in the Transformer encoder.\n            dropout (:obj:`float`, optional, defaults to 0.1):\n                The dropout probability for all fully connected\n                layers in the embeddings, encoder, and pooler.\n            attention_dropout (:obj:`float`, optional, defaults to 0.1):\n                The dropout probability for the attention mechanism\n            gelu_activation (:obj:`boolean`, optional, defaults to :obj:`True`):\n                The non-linear activation function (function or string) in the\n                encoder and pooler. If set to `True`, \"gelu\" will be used instead of \"relu\".\n            sinusoidal_embeddings (:obj:`boolean`, optional, defaults to :obj:`False`):\n                Whether to use sinusoidal positional embeddings instead of absolute positional embeddings.\n            causal (:obj:`boolean`, optional, defaults to :obj:`False`):\n                Set this to `True` for the model to behave in a causal manner.\n                Causal models use a triangular attention mask in order to only attend to the left-side context instead\n                if a bidirectional context.\n            asm (:obj:`boolean`, optional, defaults to :obj:`False`):\n                Whether to use an adaptive log softmax projection layer instead of a linear layer for the prediction\n                layer.\n            n_langs (:obj:`int`, optional, defaults to 1):\n                The number of languages the model handles. Set to 1 for monolingual models.\n            use_lang_emb (:obj:`boolean`, optional, defaults to :obj:`True`)\n                Whether to use language embeddings. Some models use additional language embeddings, see\n                `the multilingual models page <http://huggingface.co/transformers/multilingual.html#xlm-language-embeddings>`__\n                for information on how to use them.\n            max_position_embeddings (:obj:`int`, optional, defaults to 512):\n                The maximum sequence length that this model might\n                ever be used with. Typically set this to something large just in case\n                (e.g., 512 or 1024 or 2048).\n            embed_init_std (:obj:`float`, optional, defaults to 2048^-0.5):\n                The standard deviation of the truncated_normal_initializer for\n                initializing the embedding matrices.\n            init_std (:obj:`int`, optional, defaults to 50257):\n                The standard deviation of the truncated_normal_initializer for\n                initializing all weight matrices except the embedding matrices.\n            layer_norm_eps (:obj:`float`, optional, defaults to 1e-12):\n                The epsilon used by the layer normalization layers.\n            bos_index (:obj:`int`, optional, defaults to 0):\n                The index of the beginning of sentence token in the vocabulary.\n            eos_index (:obj:`int`, optional, defaults to 1):\n                The index of the end of sentence token in the vocabulary.\n            pad_index (:obj:`int`, optional, defaults to 2):\n                The index of the padding token in the vocabulary.\n            unk_index (:obj:`int`, optional, defaults to 3):\n                The index of the unknown token in the vocabulary.\n            mask_index (:obj:`int`, optional, defaults to 5):\n                The index of the masking token in the vocabulary.\n            is_encoder(:obj:`boolean`, optional, defaults to :obj:`True`):\n                Whether the initialized model should be a transformer encoder or decoder as seen in Vaswani et al.\n            summary_type (:obj:`string`, optional, defaults to \"first\"):\n                Argument used when doing sequence summary. Used in for the multiple choice head in\n                :class:`~transformers1.XLMForSequenceClassification`.\n                Is one of the following options:\n\n                - 'last' => take the last token hidden state (like XLNet)\n                - 'first' => take the first token hidden state (like Bert)\n                - 'mean' => take the mean of all tokens hidden states\n                - 'cls_index' => supply a Tensor of classification token position (GPT/GPT-2)\n                - 'attn' => Not implemented now, use multi-head attention\n            summary_use_proj (:obj:`boolean`, optional, defaults to :obj:`True`):\n                Argument used when doing sequence summary. Used in for the multiple choice head in\n                :class:`~transformers1.XLMForSequenceClassification`.\n                Add a projection after the vector extraction\n            summary_activation (:obj:`string` or :obj:`None`, optional, defaults to :obj:`None`):\n                Argument used when doing sequence summary. Used in for the multiple choice head in\n                :class:`~transformers1.XLMForSequenceClassification`.\n                'tanh' => add a tanh activation to the output, Other => no activation.\n            summary_proj_to_labels (:obj:`boolean`, optional, defaults to :obj:`True`):\n                Argument used when doing sequence summary. Used in for the multiple choice head in\n                :class:`~transformers1.XLMForSequenceClassification`.\n                If True, the projection outputs to config.num_labels classes (otherwise to hidden_size). Default: False.\n            summary_first_dropout (:obj:`float`, optional, defaults to 0.1):\n                Argument used when doing sequence summary. Used in for the multiple choice head in\n                :class:`~transformers1.XLMForSequenceClassification`.\n                Add a dropout before the projection and activation\n            start_n_top (:obj:`int`, optional, defaults to 5):\n                Used in the SQuAD evaluation script for XLM and XLNet.\n            end_n_top (:obj:`int`, optional, defaults to 5):\n                Used in the SQuAD evaluation script for XLM and XLNet.\n            mask_token_id (:obj:`int`, optional, defaults to 0):\n                Model agnostic parameter to identify masked tokens when generating text in an MLM context.\n            lang_id (:obj:`int`, optional, defaults to 1):\n                The ID of the language used by the model. This parameter is used when generating\n                text in a given language.\n    \"\"\"\n\n    model_type = \"flaubert\"\n\n    def __init__(self, layerdrop=0.0, pre_norm=False, pad_token_id=2, bos_token_id=0, **kwargs):\n        \"\"\"Constructs FlaubertConfig.\n        \"\"\"\n        super().__init__(pad_token_id=pad_token_id, bos_token_id=bos_token_id, **kwargs)\n        self.layerdrop = layerdrop\n        self.pre_norm = pre_norm\n"
  },
  {
    "path": "code/nezha-base-count5/pretrain/transformers1/configuration_gpt2.py",
    "content": "# coding=utf-8\n# Copyright 2018 The OpenAI Team Authors and HuggingFace Inc. team.\n# Copyright (c) 2018, NVIDIA CORPORATION.  All rights reserved.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n\"\"\" OpenAI GPT-2 configuration \"\"\"\n\n\nimport logging\n\nfrom .configuration_utils import PretrainedConfig\n\n\nlogger = logging.getLogger(__name__)\n\nGPT2_PRETRAINED_CONFIG_ARCHIVE_MAP = {\n    \"gpt2\": \"https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-config.json\",\n    \"gpt2-medium\": \"https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-medium-config.json\",\n    \"gpt2-large\": \"https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-large-config.json\",\n    \"gpt2-xl\": \"https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-xl-config.json\",\n    \"distilgpt2\": \"https://s3.amazonaws.com/models.huggingface.co/bert/distilgpt2-config.json\",\n}\n\n\nclass GPT2Config(PretrainedConfig):\n    \"\"\"\n        This is the configuration class to store the configuration of a :class:`~transformers1.GPT2Model`.\n        It is used to instantiate an GPT-2 model according to the specified arguments, defining the model\n        architecture. Instantiating a configuration with the defaults will yield a similar configuration to that of\n        the GPT-2 `small <https://huggingface.co/gpt2>`__ architecture.\n\n        Configuration objects inherit from  :class:`~transformers1.PretrainedConfig` and can be used\n        to control the model outputs. Read the documentation from  :class:`~transformers1.PretrainedConfig`\n        for more information.\n\n\n        Args:\n            vocab_size (:obj:`int`, optional, defaults to 50257):\n                Vocabulary size of the GPT-2 model. Defines the different tokens that\n                can be represented by the `inputs_ids` passed to the forward method of :class:`~transformers1.GPT2Model`.\n            n_positions (:obj:`int`, optional, defaults to 1024):\n                The maximum sequence length that this model might ever be used with.\n                Typically set this to something large just in case (e.g., 512 or 1024 or 2048).\n            n_ctx (:obj:`int`, optional, defaults to 1024):\n                Dimensionality of the causal mask (usually same as n_positions).\n            n_embd (:obj:`int`, optional, defaults to 768):\n                Dimensionality of the embeddings and hidden states.\n            n_layer (:obj:`int`, optional, defaults to 12):\n                Number of hidden layers in the Transformer encoder.\n            n_head (:obj:`int`, optional, defaults to 12):\n                Number of attention heads for each attention layer in the Transformer encoder.\n            activation_function (:obj:`str`, optional, defaults to 'gelu'):\n                Activation function selected in the list [\"relu\", \"swish\", \"gelu\", \"tanh\", \"gelu_new\"].\n            resid_pdrop (:obj:`float`, optional, defaults to 0.1):\n                The dropout probability for all fully connected layers in the embeddings, encoder, and pooler.\n            embd_pdrop (:obj:`int`, optional, defaults to 0.1):\n                The dropout ratio for the embeddings.\n            attn_pdrop (:obj:`float`, optional, defaults to 0.1):\n                The dropout ratio for the attention.\n            layer_norm_epsilon (:obj:`float`, optional, defaults to 1e-5):\n                The epsilon to use in the layer normalization layers\n            initializer_range (:obj:`float`, optional, defaults to 16):\n                The standard deviation of the truncated_normal_initializer for initializing all weight matrices.\n            summary_type (:obj:`string`, optional, defaults to \"cls_index\"):\n                Argument used when doing sequence summary. Used in for the multiple choice head in\n                :class:`~transformers1.GPT2DoubleHeadsModel`.\n                Is one of the following options:\n\n                - 'last' => take the last token hidden state (like XLNet)\n                - 'first' => take the first token hidden state (like Bert)\n                - 'mean' => take the mean of all tokens hidden states\n                - 'cls_index' => supply a Tensor of classification token position (GPT/GPT-2)\n                - 'attn' => Not implemented now, use multi-head attention\n            summary_use_proj (:obj:`boolean`, optional, defaults to :obj:`True`):\n                Argument used when doing sequence summary. Used in for the multiple choice head in\n                :class:`~transformers1.GPT2DoubleHeadsModel`.\n                Add a projection after the vector extraction\n            summary_activation (:obj:`string` or :obj:`None`, optional, defaults to :obj:`None`):\n                Argument used when doing sequence summary. Used in for the multiple choice head in\n                :class:`~transformers1.GPT2DoubleHeadsModel`.\n                'tanh' => add a tanh activation to the output, Other => no activation.\n            summary_proj_to_labels (:obj:`boolean`, optional, defaults to :obj:`True`):\n                Argument used when doing sequence summary. Used in for the multiple choice head in\n                :class:`~transformers1.GPT2DoubleHeadsModel`.\n                If True, the projection outputs to config.num_labels classes (otherwise to hidden_size). Default: False.\n            summary_first_dropout (:obj:`float`, optional, defaults to 0.1):\n                Argument used when doing sequence summary. Used in for the multiple choice head in\n                :class:`~transformers1.GPT2DoubleHeadsModel`.\n                Add a dropout before the projection and activation\n\n        Example::\n\n            from transformers1 import GPT2Model, GPT2Config\n\n            # Initializing a GPT2 configuration\n            configuration = GPT2Config()\n\n            # Initializing a model from the configuration\n            model = GPT2Model(configuration)\n\n            # Accessing the model configuration\n            configuration = model.config\n    \"\"\"\n\n    model_type = \"gpt2\"\n\n    def __init__(\n        self,\n        vocab_size=50257,\n        n_positions=1024,\n        n_ctx=1024,\n        n_embd=768,\n        n_layer=12,\n        n_head=12,\n        activation_function=\"gelu_new\",\n        resid_pdrop=0.1,\n        embd_pdrop=0.1,\n        attn_pdrop=0.1,\n        layer_norm_epsilon=1e-5,\n        initializer_range=0.02,\n        summary_type=\"cls_index\",\n        summary_use_proj=True,\n        summary_activation=None,\n        summary_proj_to_labels=True,\n        summary_first_dropout=0.1,\n        bos_token_id=50256,\n        eos_token_id=50256,\n        **kwargs\n    ):\n        super().__init__(bos_token_id=bos_token_id, eos_token_id=eos_token_id, **kwargs)\n\n        self.vocab_size = vocab_size\n        self.n_ctx = n_ctx\n        self.n_positions = n_positions\n        self.n_embd = n_embd\n        self.n_layer = n_layer\n        self.n_head = n_head\n        self.activation_function = activation_function\n        self.resid_pdrop = resid_pdrop\n        self.embd_pdrop = embd_pdrop\n        self.attn_pdrop = attn_pdrop\n        self.layer_norm_epsilon = layer_norm_epsilon\n        self.initializer_range = initializer_range\n        self.summary_type = summary_type\n        self.summary_use_proj = summary_use_proj\n        self.summary_activation = summary_activation\n        self.summary_first_dropout = summary_first_dropout\n        self.summary_proj_to_labels = summary_proj_to_labels\n\n        self.bos_token_id = bos_token_id\n        self.eos_token_id = eos_token_id\n\n    @property\n    def max_position_embeddings(self):\n        return self.n_positions\n\n    @property\n    def hidden_size(self):\n        return self.n_embd\n\n    @property\n    def num_attention_heads(self):\n        return self.n_head\n\n    @property\n    def num_hidden_layers(self):\n        return self.n_layer\n"
  },
  {
    "path": "code/nezha-base-count5/pretrain/transformers1/configuration_longformer.py",
    "content": "# coding=utf-8\n# Copyright 2020 The Allen Institute for AI team and The HuggingFace Inc. team.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n\"\"\" Longformer configuration \"\"\"\n\nimport logging\nfrom typing import List, Union\n\nfrom .configuration_roberta import RobertaConfig\n\n\nlogger = logging.getLogger(__name__)\n\nLONGFORMER_PRETRAINED_CONFIG_ARCHIVE_MAP = {\n    \"allenai/longformer-base-4096\": \"https://s3.amazonaws.com/models.huggingface.co/bert/allenai/longformer-base-4096/config.json\",\n    \"allenai/longformer-large-4096\": \"https://s3.amazonaws.com/models.huggingface.co/bert/allenai/longformer-large-4096/config.json\",\n    \"allenai/longformer-large-4096-finetuned-triviaqa\": \"https://s3.amazonaws.com/models.huggingface.co/bert/allenai/longformer-large-4096-finetuned-triviaqa/config.json\",\n    \"allenai/longformer-base-4096-extra.pos.embd.only\": \"https://s3.amazonaws.com/models.huggingface.co/bert/allenai/longformer-base-4096-extra.pos.embd.only/config.json\",\n    \"allenai/longformer-large-4096-extra.pos.embd.only\": \"https://s3.amazonaws.com/models.huggingface.co/bert/allenai/longformer-large-4096-extra.pos.embd.only/config.json\",\n}\n\n\nclass LongformerConfig(RobertaConfig):\n    r\"\"\"\n        This is the configuration class to store the configuration of a :class:`~transformers1.LongformerModel`.\n        It is used to instantiate an Longformer model according to the specified arguments, defining the model\n        architecture. Instantiating a configuration with the defaults will yield a similar configuration to that of\n        the RoBERTa `roberta-base <https://huggingface.co/roberta-base>`__ architecture with a sequence length 4,096.\n\n        The :class:`~transformers1.LongformerConfig` class directly inherits :class:`~transformers1.RobertaConfig`.\n        It reuses the same defaults. Please check the parent class for more information.\n\n        Args:\n            attention_window (:obj:`int` or :obj:`List[int]`, optional, defaults to 512):\n                Size of an attention window around each token. If :obj:`int`, use the same size for all layers.\n                To specify a different window size for each layer, use a :obj:`List[int]` where\n                ``len(attention_window) == num_hidden_layers``.\n\n        Example::\n\n            from transformers1 import LongformerConfig, LongformerModel\n\n            # Initializing a Longformer configuration\n            configuration = LongformerConfig()\n\n            # Initializing a model from the configuration\n            model = LongformerModel(configuration)\n\n            # Accessing the model configuration\n            configuration = model.config\n    \"\"\"\n    model_type = \"longformer\"\n\n    def __init__(self, attention_window: Union[List[int], int] = 512, sep_token_id: int = 2, **kwargs):\n        super().__init__(**kwargs)\n        self.attention_window = attention_window\n        self.sep_token_id = sep_token_id\n"
  },
  {
    "path": "code/nezha-base-count5/pretrain/transformers1/configuration_marian.py",
    "content": "# coding=utf-8\n# Copyright 2020 The OPUS-NMT Team, Marian team, and The HuggingFace Inc. team.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n\"\"\" Marian model configuration \"\"\"\n\nfrom .configuration_bart import BartConfig\n\n\nPRETRAINED_CONFIG_ARCHIVE_MAP = {\n    \"Helsinki-NLP/opus-mt-en-de\": \"https://s3.amazonaws.com/models.huggingface.co/bert/Helsinki-NLP/opus-mt-en-de/config.json\",\n}\n\n\nclass MarianConfig(BartConfig):\n    model_type = \"marian\"\n"
  },
  {
    "path": "code/nezha-base-count5/pretrain/transformers1/configuration_mmbt.py",
    "content": "# coding=utf-8\n# Copyright (c) Facebook, Inc. and its affiliates.\n# Copyright (c) HuggingFace Inc. team.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n\"\"\" MMBT configuration \"\"\"\n\n\nimport logging\n\n\nlogger = logging.getLogger(__name__)\n\n\nclass MMBTConfig(object):\n    \"\"\"Configuration class to store the configuration of a `MMBT Model`.\n\n    Args:\n        config (:obj:`~transformers1.PreTrainedConfig`):\n            Config of the underlying Transformer models. Its values are\n            copied over to use a single config.\n        num_labels (:obj:`int` or :obj:`None`, optional, defaults to `None`):\n            Size of final Linear layer for classification.\n        modal_hidden_size (:obj:`int`, optional, defautls to 2048):\n            Embedding dimension of the non-text modality encoder.\n    \"\"\"\n\n    def __init__(self, config, num_labels=None, modal_hidden_size=2048):\n        self.__dict__ = config.__dict__\n        self.modal_hidden_size = modal_hidden_size\n        if num_labels:\n            self.num_labels = num_labels\n"
  },
  {
    "path": "code/nezha-base-count5/pretrain/transformers1/configuration_openai.py",
    "content": "# coding=utf-8\n# Copyright 2018 The OpenAI Team Authors and HuggingFace Inc. team.\n# Copyright (c) 2018, NVIDIA CORPORATION.  All rights reserved.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n\"\"\" OpenAI GPT configuration \"\"\"\n\n\nimport logging\n\nfrom .configuration_utils import PretrainedConfig\n\n\nlogger = logging.getLogger(__name__)\n\nOPENAI_GPT_PRETRAINED_CONFIG_ARCHIVE_MAP = {\n    \"openai-gpt\": \"https://s3.amazonaws.com/models.huggingface.co/bert/openai-gpt-config.json\"\n}\n\n\nclass OpenAIGPTConfig(PretrainedConfig):\n    \"\"\"\n        This is the configuration class to store the configuration of a :class:`~transformers1.OpenAIGPTModel`.\n        It is used to instantiate an GPT model according to the specified arguments, defining the model\n        architecture. Instantiating a configuration with the defaults will yield a similar configuration to that of\n        the `GPT <https://huggingface.co/openai-gpt>`__ architecture from OpenAI.\n\n        Configuration objects inherit from  :class:`~transformers1.PretrainedConfig` and can be used\n        to control the model outputs. Read the documentation from  :class:`~transformers1.PretrainedConfig`\n        for more information.\n\n        Args:\n            vocab_size (:obj:`int`, optional, defaults to 40478):\n                Vocabulary size of the GPT model. Defines the different tokens that\n                can be represented by the `inputs_ids` passed to the forward method of :class:`~transformers1.CTRLModel`.\n            n_positions (:obj:`int`, optional, defaults to 512):\n                The maximum sequence length that this model might ever be used with.\n                Typically set this to something large just in case (e.g., 512 or 1024 or 2048).\n            n_ctx (:obj:`int`, optional, defaults to 512):\n                Dimensionality of the causal mask (usually same as n_positions).\n            n_embd (:obj:`int`, optional, defaults to 768):\n                Dimensionality of the embeddings and hidden states.\n            n_layer (:obj:`int`, optional, defaults to 12):\n                Number of hidden layers in the Transformer encoder.\n            n_head (:obj:`int`, optional, defaults to 12):\n                Number of attention heads for each attention layer in the Transformer encoder.\n            afn (:obj:`str` or :obj:`function`, optional, defaults to \"gelu\"):\n                The non-linear activation function (function or string) in the encoder and pooler.\n                If string, \"gelu\", \"relu\", \"swish\" and \"gelu_new\" are supported.\n            resid_pdrop (:obj:`float`, optional, defaults to 0.1):\n                The dropout probability for all fully connected layers in the embeddings, encoder, and pooler.\n            embd_pdrop (:obj:`int`, optional, defaults to 0.1):\n                The dropout ratio for the embeddings.\n            attn_pdrop (:obj:`float`, optional, defaults to 0.1):\n                The dropout ratio for the attention.\n            layer_norm_epsilon (:obj:`float`, optional, defaults to 1e-5):\n                The epsilon to use in the layer normalization layers\n            initializer_range (:obj:`float`, optional, defaults to 0.02):\n                The standard deviation of the truncated_normal_initializer for initializing all weight matrices.\n            predict_special_tokens (:obj:`boolean`, optional, defaults to :obj:`True`):\n                Whether special tokens should be predicted when the model is has a language modeling head.\n            summary_type (:obj:`string`, optional, defaults to \"cls_index\"):\n                Argument used when doing sequence summary. Used in for the multiple choice head in\n                :class:`~transformers1.OpenAIGPTDoubleHeadsModel`.\n                Is one of the following options:\n\n                - 'last' => take the last token hidden state (like XLNet)\n                - 'first' => take the first token hidden state (like Bert)\n                - 'mean' => take the mean of all tokens hidden states\n                - 'cls_index' => supply a Tensor of classification token position (GPT/GPT-2)\n                - 'attn' => Not implemented now, use multi-head attention\n            summary_use_proj (:obj:`boolean`, optional, defaults to :obj:`True`):\n                Argument used when doing sequence summary. Used in for the multiple choice head in\n                :class:`~transformers1.OpenAIGPTDoubleHeadsModel`.\n                Add a projection after the vector extraction\n            summary_activation (:obj:`string` or :obj:`None`, optional, defaults to :obj:`None`):\n                Argument used when doing sequence summary. Used in for the multiple choice head in\n                :class:`~transformers1.OpenAIGPTDoubleHeadsModel`.\n                'tanh' => add a tanh activation to the output, Other => no activation.\n            summary_proj_to_labels (:obj:`boolean`, optional, defaults to :obj:`True`):\n                Argument used when doing sequence summary. Used in for the multiple choice head in\n                :class:`~transformers1.OpenAIGPTDoubleHeadsModel`.\n                If True, the projection outputs to config.num_labels classes (otherwise to hidden_size). Default: False.\n            summary_first_dropout (:obj:`float`, optional, defaults to 0.1):\n                Argument used when doing sequence summary. Used in for the multiple choice head in\n                :class:`~transformers1.OpenAIGPTDoubleHeadsModel`.\n                Add a dropout before the projection and activation\n\n        Example::\n\n            from transformers1 import OpenAIGPTConfig, OpenAIGPTModel\n\n            # Initializing a GPT configuration\n            configuration = OpenAIGPTConfig()\n\n            # Initializing a model from the configuration\n            model = OpenAIGPTModel(configuration)\n\n            # Accessing the model configuration\n            configuration = model.config\n    \"\"\"\n\n    model_type = \"openai-gpt\"\n\n    def __init__(\n        self,\n        vocab_size=40478,\n        n_positions=512,\n        n_ctx=512,\n        n_embd=768,\n        n_layer=12,\n        n_head=12,\n        afn=\"gelu\",\n        resid_pdrop=0.1,\n        embd_pdrop=0.1,\n        attn_pdrop=0.1,\n        layer_norm_epsilon=1e-5,\n        initializer_range=0.02,\n        predict_special_tokens=True,\n        summary_type=\"cls_index\",\n        summary_use_proj=True,\n        summary_activation=None,\n        summary_proj_to_labels=True,\n        summary_first_dropout=0.1,\n        **kwargs\n    ):\n        super().__init__(**kwargs)\n\n        self.vocab_size = vocab_size\n        self.n_ctx = n_ctx\n        self.n_positions = n_positions\n        self.n_embd = n_embd\n        self.n_layer = n_layer\n        self.n_head = n_head\n        self.afn = afn\n        self.resid_pdrop = resid_pdrop\n        self.embd_pdrop = embd_pdrop\n        self.attn_pdrop = attn_pdrop\n        self.layer_norm_epsilon = layer_norm_epsilon\n        self.initializer_range = initializer_range\n        self.predict_special_tokens = predict_special_tokens\n        self.summary_type = summary_type\n        self.summary_use_proj = summary_use_proj\n        self.summary_activation = summary_activation\n        self.summary_first_dropout = summary_first_dropout\n        self.summary_proj_to_labels = summary_proj_to_labels\n\n    @property\n    def max_position_embeddings(self):\n        return self.n_positions\n\n    @property\n    def hidden_size(self):\n        return self.n_embd\n\n    @property\n    def num_attention_heads(self):\n        return self.n_head\n\n    @property\n    def num_hidden_layers(self):\n        return self.n_layer\n"
  },
  {
    "path": "code/nezha-base-count5/pretrain/transformers1/configuration_reformer.py",
    "content": "# coding=utf-8\n# Copyright 2020 The Trax Authors and The HuggingFace Inc. team.\n# Copyright (c) 2018, NVIDIA CORPORATION.  All rights reserved.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n\"\"\" Reformer model configuration \"\"\"\n\n\nimport logging\n\nfrom .configuration_utils import PretrainedConfig\n\n\nlogger = logging.getLogger(__name__)\n\nREFORMER_PRETRAINED_CONFIG_ARCHIVE_MAP = {\n    \"google/reformer-crime-and-punishment\": \"https://cdn.huggingface.co/google/reformer-crime-and-punishment/config.json\",\n    \"google/reformer-enwik8\": \"https://cdn.huggingface.co/google/reformer-enwik8/config.json\",\n}\n\n\nclass ReformerConfig(PretrainedConfig):\n    r\"\"\"\n        This is the configuration class to store the configuration of a :class:`~transformers1.ReformerModel`.\n        It is used to instantiate an Reformer model according to the specified arguments, defining the model\n        architecture.\n\n        Configuration objects inherit from  :class:`~transformers1.PretrainedConfig` and can be used\n        to control the model outputs. Read the documentation from  :class:`~transformers1.PretrainedConfig`\n        for more information.\n\n        Args:\n            attention_head_size (:obj:`int`, optional, defaults to 64):\n                Dimensionality of the projected key, query and value vectors\n            attn_layers (:obj:`list(str)`, optional, defaults to [\"local\", \"lsh\", \"local\", \"lsh\", \"local\", \"lsh\"]):\n                List of attention layer types in ascending order. It can be chosen between a\n                LSHSelfAttention layer (\"lsh\") and a LocalSelfAttention layer (\"local\").\n                For more information on LSHSelfAttention layer, see `LSH Self Attention <reformer.html#lsh-self-attention>`__ .\n                For more information on LocalSelfAttention layer, see `Local Self Attention <reformer.html#local-sensitive-hashing-self-attention>`__ .\n            axial_pos_embds (:obj:`bool`, optional, defaults to True):\n                If `True` use axial position embeddings. For more information on how axial position embeddings work, see `Axial Position Encodings <reformer.html#axial-positional-encodings>`__\n            axial_norm_std (:obj:`float`, optional, defaluts to 1.0):\n                The standard deviation of the normal_initializer for initializing the weight matrices of the axial positional encodings.\n            axial_pos_shape (:obj:`list(int)`, optional, defaults to `[64, 64]`):\n                The position dims of the axial position encodings.\n                During training the product of the position dims has to equal the sequence length.\n                For more information on how axial position embeddings work, see `Axial Position Encodings <reformer.html#axial-positional-encodings>`__.\n            axial_pos_embds_dim (:obj:`list(int)`, optional, defaults to `[64, 192]`):\n                The embedding dims of the axial position encodings.\n                The sum of the embedding dims has to equal the hidden size.\n                For more information on how axial position embeddings work, see `Axial Position Encodings <reformer.html#axial-positional-encodings>`__.\n            chunk_size_lm_head (:obj:`int`, optional, defaults to 0):\n                The chunk size of the final language model feed forward head layer.\n                A chunk size of 0 means that the feed forward layer is not chunked.\n                A chunk size of n means that the feed forward layer processes n < sequence_length embeddings at a time.\n                For more information on feed forward chunking, see `How does Feed Forward Chunking work? <../glossary.html#feed-forward-chunking>`__ .\n            chunk_size_feed_forward (:obj:`int`, optional, defaults to 0):\n                The chunk size of all feed forward layers in the residual attention blocks.\n                A chunk size of 0 means that the feed forward layer is not chunked.\n                A chunk size of n means that the feed forward layer processes n < sequence_length embeddings at a time.\n                For more information on feed forward chunking, see `How does Feed Forward Chunking work? <../glossary.html#feed-forward-chunking>`__ .\n            eos_token_id (:obj:`int`, optional, defaults to 2):\n                The token id for the <EOS> token.\n            feed_forward_size (:obj:`int`, optional, defaults to 512):\n                Dimensionality of the \"feed_forward\" (i.e., feed-forward) layer in the residual attention block.\n            hash_seed (:obj:`int`, optional, defaults to `None`):\n                Seed that can be used to make local sensitive hashing in LSHSelfAttention deterministic. This should only be set for testing purposed. For evaluation and training purposes `hash_seed` should be set to `None` to ensure fully random rotations in local sensitive hashing scheme.\n            hidden_act (:obj:`str` or :obj:`function`, optional, defaults to \"relu\"):\n                The non-linear activation function (function or string) in the feed forward layer in the residual attention block.\n                If string, \"gelu\", \"relu\", \"swish\", \"gelu_new\" and \"gelu_fast\" are supported.\n            hidden_dropout_prob (:obj:`float`, optional, defaults to 0.05):\n                The dropout probabilitiy for all fully connected layers in the embeddings, encoder, and pooler.\n            hidden_size (:obj:`int`, optional, defaults to 256):\n                Dimensionality of the output hidden states of the residual attention blocks.\n            initializer_range (:obj:`float`, optional, defaults to 0.02):\n                The standard deviation of the truncated_normal_initializer for initializing all weight matrices.\n            is_decoder (:obj:`bool`, optional, defaults to False):\n                If `is_decoder` is True, a causal mask is used in addition to `attention_mask`.\n                When using the Reformer for causal language modeling, `is_decoder` is set to `True`.\n            layer_norm_eps (:obj:`float`, optional, defaults to 1e-12):\n                The epsilon used by the layer normalization layers.\n            local_chunk_length (:obj:`int`, optional, defaults to 64):\n                Length of chunk which attends to itself in LocalSelfAttention. Chunking reduces memory complexity from sequence length x sequence length (self attention) to chunk length x chunk length x sequence length / chunk length (chunked self attention).\n            local_num_chunks_before (:obj:`int`, optional, defaults to 1):\n                Number of previous neighbouring chunks to attend to in LocalSelfAttention layer to itself.\n            local_num_chunks_after (:obj:`int`, optional, defaults to 0):\n                Number of following neighbouring chunks to attend to in LocalSelfAttention layer in addition to itself.\n            local_attention_probs_dropout_prob (:obj:`float`, optional, defaults to 0.1):\n                The dropout ratio for the attention probabilities in LocalSelfAttention.\n            lsh_chunk_length (:obj:`int`, optional, defaults to 64):\n                Length of chunk which attends to itself in LSHSelfAttention. Chunking reduces memory complexity from sequence length x sequence length (self attention) to chunk length x chunk length x sequence length / chunk length (chunked self attention).\n            lsh_num_chunks_before (:obj:`int`, optional, defaults to 1):\n                Number of previous neighbouring chunks to attend to in LSHSelfAttention layer to itself.\n            lsh_num_chunks_after (:obj:`int`, optional, defaults to 0):\n                Number of following neighbouring chunks to attend to in LSHSelfAttention layer to itself.\n            lsh_attention_probs_dropout_prob (:obj:`float`, optional, defaults to 0.1):\n                The dropout ratio for the attention probabilities in LSHSelfAttention.\n            max_position_embeddings (:obj:`int`, optional, defaults to 4096):\n                The maximum sequence length that this model might ever be used with.\n                Typically set this to something large just in case (e.g., 512 or 1024 or 2048).\n            num_attention_heads (:obj:`int`, optional, defaults to 12):\n                Number of attention heads for each attention layer in the Transformer encoder.\n            num_buckets (:obj:`int` or :obj:`list(int)`, optional, defaults to `None`):\n                Number of buckets, the key query vectors can be \"hashed into\" using the locality sensitive hashing scheme. Each query key vector is hashed into a hash in `1, ..., num_buckets`.\n                The number of buckets can also be factorized into a list for improved memory complexity. In this case, each query key vector is hashed into a hash in `1-1, 1-2, ..., num_buckets[0]-1, ..., num_buckets[0]-num_buckets[1]` if `num_buckets` is factorized into two factors.\n                The number of buckets (or the product the factors) should approximately equal sequence length / lsh_chunk_length. If `num_buckets` is set to `None`, a good value for `num_buckets` is calculated on the fly.\n            num_hashes (:obj:`int`, optional, defaults to 1):\n                Number of hashing rounds (e.g. number of random rotations) in Local Sensitive Hashing scheme.\n                The higher `num_hashes`, the more accurate the `LSHSelfAttention` becomes, but also the more memory and time intensive the hashing becomes.\n            pad_token_id (:obj:`int`, optional, defaults to 0):\n                The token id for the <PAD> token.\n            vocab_size (:obj:`int`, optional, defaults to 320):\n                Vocabulary size of the Reformer model. Defines the different tokens that\n                can be represented by the `inputs_ids` passed to the forward method of :class:`~transformers1.ReformerModel`.\n\n        Example::\n\n            from transformers1 import ReformerModel, ReformerConfig\n\n            # Initializing a Reformer configuration\n            configuration = ReformerConfig()\n\n            # Initializing a Reformer model\n            model = ReformerModel(configuration)\n\n            # Accessing the model configuration\n            configuration = model.config\n    \"\"\"\n    model_type = \"reformer\"\n\n    def __init__(\n        self,\n        attention_head_size=64,\n        attn_layers=[\"local\", \"lsh\", \"local\", \"lsh\", \"local\", \"lsh\"],\n        axial_norm_std=1.0,\n        axial_pos_embds=True,\n        axial_pos_shape=[64, 64],\n        axial_pos_embds_dim=[64, 192],\n        chunk_size_lm_head=0,\n        chunk_size_feed_forward=0,\n        eos_token_id=2,\n        feed_forward_size=512,\n        hash_seed=None,\n        hidden_act=\"relu\",\n        hidden_dropout_prob=0.05,\n        hidden_size=256,\n        initializer_range=0.02,\n        is_decoder=False,\n        layer_norm_eps=1e-12,\n        local_num_chunks_before=1,\n        local_num_chunks_after=0,\n        local_attention_probs_dropout_prob=0.05,\n        local_attn_chunk_length=64,\n        lsh_attn_chunk_length=64,\n        lsh_attention_probs_dropout_prob=0.0,\n        lsh_num_chunks_before=1,\n        lsh_num_chunks_after=0,\n        max_position_embeddings=4096,\n        num_attention_heads=2,\n        num_buckets=None,\n        num_hashes=1,\n        pad_token_id=0,\n        vocab_size=320,\n        **kwargs\n    ):\n        super().__init__(pad_token_id=pad_token_id, eos_token_id=eos_token_id, is_decoder=is_decoder, **kwargs)\n\n        self.hash_seed = hash_seed\n        self.vocab_size = vocab_size\n        self.attention_head_size = attention_head_size\n        self.hidden_size = hidden_size\n        self.num_attention_heads = num_attention_heads\n        self.num_hashes = num_hashes\n        self.num_hidden_layers = len(attn_layers)\n        self.num_buckets = tuple(num_buckets) if isinstance(num_buckets, list) else num_buckets\n        self.lsh_attn_chunk_length = lsh_attn_chunk_length\n        self.local_attn_chunk_length = local_attn_chunk_length\n        self.lsh_num_chunks_after = lsh_num_chunks_after\n        self.lsh_num_chunks_before = lsh_num_chunks_before\n        self.local_num_chunks_after = local_num_chunks_after\n        self.local_num_chunks_before = local_num_chunks_before\n        self.hidden_act = hidden_act\n        self.feed_forward_size = feed_forward_size\n        self.hidden_dropout_prob = hidden_dropout_prob\n        self.lsh_attention_probs_dropout_prob = lsh_attention_probs_dropout_prob\n        self.local_attention_probs_dropout_prob = local_attention_probs_dropout_prob\n        self.max_position_embeddings = max_position_embeddings\n        self.initializer_range = initializer_range\n        self.layer_norm_eps = layer_norm_eps\n        self.axial_pos_embds = axial_pos_embds\n        self.axial_pos_shape = tuple(axial_pos_shape)\n        self.axial_pos_embds_dim = tuple(axial_pos_embds_dim)\n        self.axial_norm_std = axial_norm_std\n        self.chunk_size_lm_head = chunk_size_lm_head\n        self.chunk_size_feed_forward = chunk_size_feed_forward\n        self.attn_layers = attn_layers\n"
  },
  {
    "path": "code/nezha-base-count5/pretrain/transformers1/configuration_roberta.py",
    "content": "# coding=utf-8\n# Copyright 2018 The Google AI Language Team Authors and The HuggingFace Inc. team.\n# Copyright (c) 2018, NVIDIA CORPORATION.  All rights reserved.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n\"\"\" RoBERTa configuration \"\"\"\n\n\nimport logging\n\nfrom .configuration_bert import BertConfig\n\n\nlogger = logging.getLogger(__name__)\n\nROBERTA_PRETRAINED_CONFIG_ARCHIVE_MAP = {\n    \"roberta-base\": \"https://s3.amazonaws.com/models.huggingface.co/bert/roberta-base-config.json\",\n    \"roberta-large\": \"https://s3.amazonaws.com/models.huggingface.co/bert/roberta-large-config.json\",\n    \"roberta-large-mnli\": \"https://s3.amazonaws.com/models.huggingface.co/bert/roberta-large-mnli-config.json\",\n    \"distilroberta-base\": \"https://s3.amazonaws.com/models.huggingface.co/bert/distilroberta-base-config.json\",\n    \"roberta-base-openai-detector\": \"https://s3.amazonaws.com/models.huggingface.co/bert/roberta-base-openai-detector-config.json\",\n    \"roberta-large-openai-detector\": \"https://s3.amazonaws.com/models.huggingface.co/bert/roberta-large-openai-detector-config.json\",\n}\n\n\nclass RobertaConfig(BertConfig):\n    r\"\"\"\n        This is the configuration class to store the configuration of a :class:`~transformers1.RobertaModel`.\n        It is used to instantiate an RoBERTa model according to the specified arguments, defining the model\n        architecture. Instantiating a configuration with the defaults will yield a similar configuration to that of\n        the BERT `bert-base-uncased <https://huggingface.co/bert-base-uncased>`__ architecture.\n\n        Configuration objects inherit from  :class:`~transformers1.PretrainedConfig` and can be used\n        to control the model outputs. Read the documentation from  :class:`~transformers1.PretrainedConfig`\n        for more information.\n\n        The :class:`~transformers1.RobertaConfig` class directly inherits :class:`~transformers1.BertConfig`.\n        It reuses the same defaults. Please check the parent class for more information.\n\n        Example::\n\n            from transformers1 import RobertaConfig, RobertaModel\n\n            # Initializing a RoBERTa configuration\n            configuration = RobertaConfig()\n\n            # Initializing a model from the configuration\n            model = RobertaModel(configuration)\n\n            # Accessing the model configuration\n            configuration = model.config\n    \"\"\"\n    model_type = \"roberta\"\n\n    def __init__(self, pad_token_id=1, bos_token_id=0, eos_token_id=2, **kwargs):\n        \"\"\"Constructs RobertaConfig.\n        \"\"\"\n        super().__init__(pad_token_id=pad_token_id, bos_token_id=bos_token_id, eos_token_id=eos_token_id, **kwargs)\n"
  },
  {
    "path": "code/nezha-base-count5/pretrain/transformers1/configuration_t5.py",
    "content": "# coding=utf-8\n# Copyright 2010, The T5 Authors and HuggingFace Inc.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n\"\"\" T5 model configuration \"\"\"\n\n\nimport logging\n\nfrom .configuration_utils import PretrainedConfig\n\n\nlogger = logging.getLogger(__name__)\n\nT5_PRETRAINED_CONFIG_ARCHIVE_MAP = {\n    \"t5-small\": \"https://s3.amazonaws.com/models.huggingface.co/bert/t5-small-config.json\",\n    \"t5-base\": \"https://s3.amazonaws.com/models.huggingface.co/bert/t5-base-config.json\",\n    \"t5-large\": \"https://s3.amazonaws.com/models.huggingface.co/bert/t5-large-config.json\",\n    \"t5-3b\": \"https://s3.amazonaws.com/models.huggingface.co/bert/t5-3b-config.json\",\n    \"t5-11b\": \"https://s3.amazonaws.com/models.huggingface.co/bert/t5-11b-config.json\",\n}\n\n\nclass T5Config(PretrainedConfig):\n    r\"\"\"\n        :class:`~transformers1.T5Config` is the configuration class to store the configuration of a\n        `T5Model`.\n\n\n        Arguments:\n            vocab_size_or_config_json_file: Vocabulary size of `inputs_ids` in `T5Model`.\n            d_model: Size of the encoder layers and the pooler layer. `d_model` can also accesed via the property `hidden_size`.\n            num_layers: Number of hidden layers in the Transformer encoder. `num_layers` can also be accessed via the property `num_hidden_layers`.\n            num_heads: Number of attention heads for each attention layer in\n                the Transformer encoder. `num_heads` can also be accessed via the property `num_attention_heads`.\n            intermediate_size: The size of the \"intermediate\" (i.e., feed-forward)\n                layer in the Transformer encoder.\n            hidden_act: The non-linear activation function (function or string) in the\n                encoder and pooler. If string, \"gelu\", \"relu\", \"swish\" and \"gelu_new\" are supported.\n            hidden_dropout_prob: The dropout probabilitiy for all fully connected\n                layers in the embeddings, encoder, and pooler.\n            attention_probs_dropout_prob: The dropout ratio for the attention\n                probabilities.\n            n_positions: The maximum sequence length that this model might\n                ever be used with. Typically set this to something large just in case\n                (e.g., 512 or 1024 or 2048). `n_positions` can also be accessed via the property `max_position_embeddings'.\n            type_vocab_size: The vocabulary size of the `token_type_ids` passed into\n                `T5Model`.\n            initializer_factor: A factor for initializing all weight matrices (should be kept to 1.0, used for initialization testing).\n            layer_norm_eps: The epsilon used by LayerNorm.\n    \"\"\"\n    model_type = \"t5\"\n\n    def __init__(\n        self,\n        vocab_size=32128,\n        n_positions=512,\n        d_model=512,\n        d_kv=64,\n        d_ff=2048,\n        num_layers=6,\n        num_heads=8,\n        relative_attention_num_buckets=32,\n        dropout_rate=0.1,\n        layer_norm_epsilon=1e-6,\n        initializer_factor=1.0,\n        is_encoder_decoder=True,\n        pad_token_id=0,\n        eos_token_id=1,\n        **kwargs\n    ):\n        super().__init__(\n            pad_token_id=pad_token_id, eos_token_id=eos_token_id, is_encoder_decoder=is_encoder_decoder, **kwargs,\n        )\n        self.vocab_size = vocab_size\n        self.n_positions = n_positions\n        self.d_model = d_model\n        self.d_kv = d_kv\n        self.d_ff = d_ff\n        self.num_layers = num_layers\n        self.num_heads = num_heads\n        self.relative_attention_num_buckets = relative_attention_num_buckets\n        self.dropout_rate = dropout_rate\n        self.layer_norm_epsilon = layer_norm_epsilon\n        self.initializer_factor = initializer_factor\n\n    @property\n    def max_position_embeddings(self):\n        return self.n_positions\n\n    @property\n    def hidden_size(self):\n        return self.d_model\n\n    @property\n    def num_attention_heads(self):\n        return self.num_heads\n\n    @property\n    def num_hidden_layers(self):\n        return self.num_layers\n"
  },
  {
    "path": "code/nezha-base-count5/pretrain/transformers1/configuration_transfo_xl.py",
    "content": "# coding=utf-8\n# Copyright 2018 Google AI, Google Brain and Carnegie Mellon University Authors and the HuggingFace Inc. team.\n# Copyright (c) 2018, NVIDIA CORPORATION.  All rights reserved.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n\"\"\" Transformer XL configuration \"\"\"\n\n\nimport logging\n\nfrom .configuration_utils import PretrainedConfig\n\n\nlogger = logging.getLogger(__name__)\n\nTRANSFO_XL_PRETRAINED_CONFIG_ARCHIVE_MAP = {\n    \"transfo-xl-wt103\": \"https://s3.amazonaws.com/models.huggingface.co/bert/transfo-xl-wt103-config.json\",\n}\n\n\nclass TransfoXLConfig(PretrainedConfig):\n    \"\"\"\n        This is the configuration class to store the configuration of a :class:`~transformers1.TransfoXLModel`.\n        It is used to instantiate a Transformer XL model according to the specified arguments, defining the model\n        architecture. Instantiating a configuration with the defaults will yield a similar configuration to that of\n        the `Transformer XL <https://huggingface.co/transfo-xl-wt103>`__ architecture.\n\n        Configuration objects inherit from  :class:`~transformers1.PretrainedConfig` and can be used\n        to control the model outputs. Read the documentation from  :class:`~transformers1.PretrainedConfig`\n        for more information.\n\n        Args:\n            vocab_size (:obj:`int`, optional, defaults to 267735):\n                Vocabulary size of the Transformer XL model. Defines the different tokens that\n                can be represented by the `inputs_ids` passed to the forward method of :class:`~transformers1.TransfoXLModel`.\n            cutoffs (:obj:`List[int]`, optional, defaults to :obj:`[20000, 40000, 200000]`):\n                Cutoffs for the adaptive softmax\n            d_model (:obj:`int`, optional, defaults to 1024):\n                Dimensionality of the model's hidden states.\n            d_embed (:obj:`int`, optional, defaults to 1024):\n                Dimensionality of the embeddings\n            n_head (:obj:`int`, optional, defaults to 16):\n                Number of attention heads for each attention layer in the Transformer encoder.\n            d_head (:obj:`int`, optional, defaults to 64):\n                Dimensionality of the model's heads.\n            d_inner (:obj:`int`, optional, defaults to 4096):\n                Inner dimension in FF\n            div_val (:obj:`int`, optional, defaults to 4):\n                Divident value for adapative input and softmax\n            pre_lnorm (:obj:`boolean`, optional, defaults to :obj:`False`):\n                Apply LayerNorm to the input instead of the output\n            n_layer (:obj:`int`, optional, defaults to 18):\n                Number of hidden layers in the Transformer encoder.\n            tgt_len (:obj:`int`, optional, defaults to 128):\n                Number of tokens to predict\n            ext_len (:obj:`int`, optional, defaults to 0):\n                Length of the extended context\n            mem_len (:obj:`int`, optional, defaults to 1600):\n                Length of the retained previous heads\n            clamp_len (:obj:`int`, optional, defaults to 1000):\n                use the same pos embeddings after clamp_len\n            same_length (:obj:`boolean`, optional, defaults to :obj:`True`):\n                Use the same attn length for all tokens\n            proj_share_all_but_first (:obj:`boolean`, optional, defaults to :obj:`True`):\n                True to share all but first projs, False not to share.\n            attn_type (:obj:`int`, optional, defaults to 0):\n                Attention type. 0 for Transformer-XL, 1 for Shaw et al, 2 for Vaswani et al, 3 for Al Rfou et al.\n            sample_softmax (:obj:`int`, optional, defaults to -1):\n                number of samples in sampled softmax\n            adaptive (:obj:`boolean`, optional, defaults to :obj:`True`):\n                use adaptive softmax\n            tie_weight (:obj:`boolean`, optional, defaults to :obj:`True`):\n                tie the word embedding and softmax weights\n            dropout (:obj:`float`, optional, defaults to 0.1):\n                The dropout probabilitiy for all fully connected layers in the embeddings, encoder, and pooler.\n            dropatt (:obj:`float`, optional, defaults to 0):\n                The dropout ratio for the attention probabilities.\n            untie_r (:obj:`boolean`, optional, defaults to :obj:`True`):\n                Untie relative position biases\n            init (:obj:`string`, optional, defaults to `normal`):\n                Parameter initializer to use\n            init_range (:obj:`float`, optional, defaults to 0.01):\n                Parameters initialized by U(-init_range, init_range).\n            proj_init_std (:obj:`float`, optional, defaults to 0.01):\n                Parameters initialized by N(0, init_std)\n            init_std (:obj:`float`, optional, defaults to 0.02):\n                Parameters initialized by N(0, init_std)\n            layer_norm_epsilon (:obj:`float`, optional, defaults to 1e-5):\n                The epsilon to use in the layer normalization layers\n\n        Example::\n\n            from transformers1 import TransfoXLConfig, TransfoXLModel\n\n            # Initializing a Transformer XL configuration\n            configuration = TransfoXLConfig()\n\n            # Initializing a model from the configuration\n            model = TransfoXLModel(configuration)\n\n            # Accessing the model configuration\n            configuration = model.config\n    \"\"\"\n\n    model_type = \"transfo-xl\"\n\n    def __init__(\n        self,\n        vocab_size=267735,\n        cutoffs=[20000, 40000, 200000],\n        d_model=1024,\n        d_embed=1024,\n        n_head=16,\n        d_head=64,\n        d_inner=4096,\n        div_val=4,\n        pre_lnorm=False,\n        n_layer=18,\n        tgt_len=128,\n        ext_len=0,\n        mem_len=1600,\n        clamp_len=1000,\n        same_length=True,\n        proj_share_all_but_first=True,\n        attn_type=0,\n        sample_softmax=-1,\n        adaptive=True,\n        tie_weight=True,\n        dropout=0.1,\n        dropatt=0.0,\n        untie_r=True,\n        init=\"normal\",\n        init_range=0.01,\n        proj_init_std=0.01,\n        init_std=0.02,\n        layer_norm_epsilon=1e-5,\n        eos_token_id=0,\n        **kwargs\n    ):\n        super().__init__(eos_token_id=eos_token_id, **kwargs)\n\n        self.vocab_size = vocab_size\n        self.cutoffs = []\n        self.cutoffs.extend(cutoffs)\n        self.tie_weight = tie_weight\n        if proj_share_all_but_first:\n            self.tie_projs = [False] + [True] * len(self.cutoffs)\n        else:\n            self.tie_projs = [False] + [False] * len(self.cutoffs)\n        self.d_model = d_model\n        self.d_embed = d_embed\n        self.d_head = d_head\n        self.d_inner = d_inner\n        self.div_val = div_val\n        self.pre_lnorm = pre_lnorm\n        self.n_layer = n_layer\n        self.n_head = n_head\n        self.tgt_len = tgt_len\n        self.ext_len = ext_len\n        self.mem_len = mem_len\n        self.same_length = same_length\n        self.attn_type = attn_type\n        self.clamp_len = clamp_len\n        self.sample_softmax = sample_softmax\n        self.adaptive = adaptive\n        self.dropout = dropout\n        self.dropatt = dropatt\n        self.untie_r = untie_r\n        self.init = init\n        self.init_range = init_range\n        self.proj_init_std = proj_init_std\n        self.init_std = init_std\n        self.layer_norm_epsilon = layer_norm_epsilon\n\n    @property\n    def max_position_embeddings(self):\n        return self.tgt_len + self.ext_len + self.mem_len\n\n    @property\n    def n_token(self):  # Backward compatibility\n        return self.vocab_size\n\n    @n_token.setter\n    def n_token(self, value):  # Backward compatibility\n        self.vocab_size = value\n\n    @property\n    def hidden_size(self):\n        return self.d_model\n\n    @property\n    def num_attention_heads(self):\n        return self.n_head\n\n    @property\n    def num_hidden_layers(self):\n        return self.n_layer\n"
  },
  {
    "path": "code/nezha-base-count5/pretrain/transformers1/configuration_utils.py",
    "content": "# coding=utf-8\n# Copyright 2018 The Google AI Language Team Authors and The HuggingFace Inc. team.\n# Copyright (c) 2018, NVIDIA CORPORATION.  All rights reserved.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n\"\"\" Configuration base class and utilities.\"\"\"\n\n\nimport copy\nimport json\nimport logging\nimport os\nfrom typing import Dict, Tuple\n\nfrom .file_utils import CONFIG_NAME, cached_path, hf_bucket_url, is_remote_url\n\n\nlogger = logging.getLogger(__name__)\n\n\nclass PretrainedConfig(object):\n    r\"\"\" Base class for all configuration classes.\n        Handles a few parameters common to all models' configurations as well as methods for loading/downloading/saving configurations.\n\n        Note:\n            A configuration file can be loaded and saved to disk. Loading the configuration file and using this file to initialize a model does **not** load the model weights.\n            It only affects the model's configuration.\n\n        Class attributes (overridden by derived classes):\n            - ``model_type``: a string that identifies the model type, that we serialize into the JSON file, and that we use to recreate the correct object in :class:`~transformers1.AutoConfig`.\n\n        Args:\n            finetuning_task (:obj:`string` or :obj:`None`, `optional`, defaults to :obj:`None`):\n                Name of the task used to fine-tune the model. This can be used when converting from an original (TensorFlow or PyTorch) checkpoint.\n            num_labels (:obj:`int`, `optional`, defaults to `2`):\n                Number of classes to use when the model is a classification model (sequences/tokens)\n            output_attentions (:obj:`bool`, `optional`, defaults to :obj:`False`):\n                Should the model returns attentions weights.\n            output_hidden_states (:obj:`string`, `optional`, defaults to :obj:`False`):\n                Should the model returns all hidden-states.\n            torchscript (:obj:`bool`, `optional`, defaults to :obj:`False`):\n                Is the model used with Torchscript (for PyTorch models).\n    \"\"\"\n    model_type: str = \"\"\n\n    def __init__(self, **kwargs):\n        # Attributes with defaults\n        self.output_attentions = kwargs.pop(\"output_attentions\", False)\n        self.output_hidden_states = kwargs.pop(\"output_hidden_states\", False)\n        self.use_cache = kwargs.pop(\"use_cache\", True)  # Not used by all models\n        self.torchscript = kwargs.pop(\"torchscript\", False)  # Only used by PyTorch models\n        self.use_bfloat16 = kwargs.pop(\"use_bfloat16\", False)\n        self.pruned_heads = kwargs.pop(\"pruned_heads\", {})\n\n        # Is decoder is used in encoder-decoder models to differentiate encoder from decoder\n        self.is_encoder_decoder = kwargs.pop(\"is_encoder_decoder\", False)\n        self.is_decoder = kwargs.pop(\"is_decoder\", False)\n\n        # Parameters for sequence generation\n        self.max_length = kwargs.pop(\"max_length\", 20)\n        self.min_length = kwargs.pop(\"min_length\", 0)\n        self.do_sample = kwargs.pop(\"do_sample\", False)\n        self.early_stopping = kwargs.pop(\"early_stopping\", False)\n        self.num_beams = kwargs.pop(\"num_beams\", 1)\n        self.temperature = kwargs.pop(\"temperature\", 1.0)\n        self.top_k = kwargs.pop(\"top_k\", 50)\n        self.top_p = kwargs.pop(\"top_p\", 1.0)\n        self.repetition_penalty = kwargs.pop(\"repetition_penalty\", 1.0)\n        self.length_penalty = kwargs.pop(\"length_penalty\", 1.0)\n        self.no_repeat_ngram_size = kwargs.pop(\"no_repeat_ngram_size\", 0)\n        self.bad_words_ids = kwargs.pop(\"bad_words_ids\", None)\n        self.num_return_sequences = kwargs.pop(\"num_return_sequences\", 1)\n\n        # Fine-tuning task arguments\n        self.architectures = kwargs.pop(\"architectures\", None)\n        self.finetuning_task = kwargs.pop(\"finetuning_task\", None)\n        self.id2label = kwargs.pop(\"id2label\", None)\n        self.label2id = kwargs.pop(\"label2id\", None)\n        if self.id2label is not None:\n            kwargs.pop(\"num_labels\", None)\n            self.id2label = dict((int(key), value) for key, value in self.id2label.items())\n            # Keys are always strings in JSON so convert ids to int here.\n        else:\n            self.num_labels = kwargs.pop(\"num_labels\", 2)\n\n        # Tokenizer arguments TODO: eventually tokenizer and models should share the same config\n        self.prefix = kwargs.pop(\"prefix\", None)\n        self.bos_token_id = kwargs.pop(\"bos_token_id\", None)\n        self.pad_token_id = kwargs.pop(\"pad_token_id\", None)\n        self.eos_token_id = kwargs.pop(\"eos_token_id\", None)\n        self.decoder_start_token_id = kwargs.pop(\"decoder_start_token_id\", None)\n\n        # task specific arguments\n        self.task_specific_params = kwargs.pop(\"task_specific_params\", None)\n\n        # TPU arguments\n        self.xla_device = kwargs.pop(\"xla_device\", None)\n\n        # Additional attributes without default values\n        for key, value in kwargs.items():\n            try:\n                setattr(self, key, value)\n            except AttributeError as err:\n                logger.error(\"Can't set {} with value {} for {}\".format(key, value, self))\n                raise err\n\n    @property\n    def num_labels(self):\n        return len(self.id2label)\n\n    @num_labels.setter\n    def num_labels(self, num_labels):\n        self.id2label = {i: \"LABEL_{}\".format(i) for i in range(num_labels)}\n        self.label2id = dict(zip(self.id2label.values(), self.id2label.keys()))\n\n    def save_pretrained(self, save_directory):\n        \"\"\"\n        Save a configuration object to the directory `save_directory`, so that it\n        can be re-loaded using the :func:`~transformers1.PretrainedConfig.from_pretrained` class method.\n\n        Args:\n            save_directory (:obj:`string`):\n                Directory where the configuration JSON file will be saved.\n        \"\"\"\n        assert os.path.isdir(\n            save_directory\n        ), \"Saving path should be a directory where the model and configuration can be saved\"\n\n        # If we save using the predefined names, we can load using `from_pretrained`\n        output_config_file = os.path.join(save_directory, CONFIG_NAME)\n\n        self.to_json_file(output_config_file, use_diff=True)\n        logger.info(\"Configuration saved in {}\".format(output_config_file))\n\n    @classmethod\n    def from_pretrained(cls, pretrained_model_name_or_path, **kwargs) -> \"PretrainedConfig\":\n        r\"\"\"\n\n        Instantiate a :class:`~transformers1.PretrainedConfig` (or a derived class) from a pre-trained model configuration.\n\n        Args:\n            pretrained_model_name_or_path (:obj:`string`):\n                either:\n                  - a string with the `shortcut name` of a pre-trained model configuration to load from cache or\n                    download, e.g.: ``bert-base-uncased``.\n                  - a string with the `identifier name` of a pre-trained model configuration that was user-uploaded to\n                    our S3, e.g.: ``dbmdz/bert-base-german-cased``.\n                  - a path to a `directory` containing a configuration file saved using the\n                    :func:`~transformers1.PretrainedConfig.save_pretrained` method, e.g.: ``./my_model_directory/``.\n                  - a path or url to a saved configuration JSON `file`, e.g.:\n                    ``./my_model_directory/configuration.json``.\n            cache_dir (:obj:`string`, `optional`):\n                Path to a directory in which a downloaded pre-trained model\n                configuration should be cached if the standard cache should not be used.\n            kwargs (:obj:`Dict[str, any]`, `optional`):\n                The values in kwargs of any keys which are configuration attributes will be used to override the loaded\n                values. Behavior concerning key/value pairs whose keys are *not* configuration attributes is\n                controlled by the `return_unused_kwargs` keyword parameter.\n            force_download (:obj:`bool`, `optional`, defaults to :obj:`False`):\n                Force to (re-)download the model weights and configuration files and override the cached versions if they exist.\n            resume_download (:obj:`bool`, `optional`, defaults to :obj:`False`):\n                Do not delete incompletely recieved file. Attempt to resume the download if such a file exists.\n            proxies (:obj:`Dict`, `optional`):\n                A dictionary of proxy servers to use by protocol or endpoint, e.g.:\n                :obj:`{'http': 'foo.bar:3128', 'http://hostname': 'foo.bar:4012'}.`\n                The proxies are used on each request.\n            return_unused_kwargs: (`optional`) bool:\n                If False, then this function returns just the final configuration object.\n                If True, then this functions returns a :obj:`Tuple(config, unused_kwargs)` where `unused_kwargs` is a\n                dictionary consisting of the key/value pairs whose keys are not configuration attributes: ie the part\n                of kwargs which has not been used to update `config` and is otherwise ignored.\n\n        Returns:\n            :class:`PretrainedConfig`: An instance of a configuration object\n\n        Examples::\n\n            # We can't instantiate directly the base class `PretrainedConfig` so let's show the examples on a\n            # derived class: BertConfig\n            config = BertConfig.from_pretrained('bert-base-uncased')    # Download configuration from S3 and cache.\n            config = BertConfig.from_pretrained('./test/saved_model/')  # E.g. config (or model) was saved using `save_pretrained('./test/saved_model/')`\n            config = BertConfig.from_pretrained('./test/saved_model/my_configuration.json')\n            config = BertConfig.from_pretrained('bert-base-uncased', output_attention=True, foo=False)\n            assert config.output_attention == True\n            config, unused_kwargs = BertConfig.from_pretrained('bert-base-uncased', output_attention=True,\n                                                               foo=False, return_unused_kwargs=True)\n            assert config.output_attention == True\n            assert unused_kwargs == {'foo': False}\n\n        \"\"\"\n        config_dict, kwargs = cls.get_config_dict(pretrained_model_name_or_path, **kwargs)\n        return cls.from_dict(config_dict, **kwargs)\n\n    @classmethod\n    def get_config_dict(cls, pretrained_model_name_or_path: str, **kwargs) -> Tuple[Dict, Dict]:\n        \"\"\"\n        From a `pretrained_model_name_or_path`, resolve to a dictionary of parameters, to be used\n        for instantiating a Config using `from_dict`.\n\n        Parameters:\n            pretrained_model_name_or_path (:obj:`string`):\n                The identifier of the pre-trained checkpoint from which we want the dictionary of parameters.\n\n        Returns:\n            :obj:`Tuple[Dict, Dict]`: The dictionary that will be used to instantiate the configuration object.\n\n        \"\"\"\n        cache_dir = kwargs.pop(\"cache_dir\", None)\n        force_download = kwargs.pop(\"force_download\", False)\n        resume_download = kwargs.pop(\"resume_download\", False)\n        proxies = kwargs.pop(\"proxies\", None)\n        local_files_only = kwargs.pop(\"local_files_only\", False)\n\n        if os.path.isdir(pretrained_model_name_or_path):\n            config_file = os.path.join(pretrained_model_name_or_path, CONFIG_NAME)\n        elif os.path.isfile(pretrained_model_name_or_path) or is_remote_url(pretrained_model_name_or_path):\n            config_file = pretrained_model_name_or_path\n        else:\n            config_file = hf_bucket_url(pretrained_model_name_or_path, filename=CONFIG_NAME, use_cdn=False)\n\n        try:\n            # Load from URL or cache if already cached\n            resolved_config_file = cached_path(\n                config_file,\n                cache_dir=cache_dir,\n                force_download=force_download,\n                proxies=proxies,\n                resume_download=resume_download,\n                local_files_only=local_files_only,\n            )\n            # Load config dict\n            if resolved_config_file is None:\n                raise EnvironmentError\n            config_dict = cls._dict_from_json_file(resolved_config_file)\n\n        except EnvironmentError:\n            msg = (\n                f\"Can't load config for '{pretrained_model_name_or_path}'. Make sure that:\\n\\n\"\n                f\"- '{pretrained_model_name_or_path}' is a correct model identifier listed on 'https://huggingface.co/models'\\n\\n\"\n                f\"- or '{pretrained_model_name_or_path}' is the correct path to a directory containing a {CONFIG_NAME} file\\n\\n\"\n            )\n            raise EnvironmentError(msg)\n\n        except json.JSONDecodeError:\n            msg = (\n                \"Couldn't reach server at '{}' to download configuration file or \"\n                \"configuration file is not a valid JSON file. \"\n                \"Please check network or file content here: {}.\".format(config_file, resolved_config_file)\n            )\n            raise EnvironmentError(msg)\n\n        if resolved_config_file == config_file:\n            logger.info(\"loading configuration file {}\".format(config_file))\n        else:\n            logger.info(\"loading configuration file {} from cache at {}\".format(config_file, resolved_config_file))\n\n        return config_dict, kwargs\n\n    @classmethod\n    def from_dict(cls, config_dict: Dict, **kwargs) -> \"PretrainedConfig\":\n        \"\"\"\n        Constructs a `Config` from a Python dictionary of parameters.\n\n        Args:\n            config_dict (:obj:`Dict[str, any]`):\n                Dictionary that will be used to instantiate the configuration object. Such a dictionary can be retrieved\n                from a pre-trained checkpoint by leveraging the :func:`~transformers1.PretrainedConfig.get_config_dict`\n                method.\n            kwargs (:obj:`Dict[str, any]`):\n                Additional parameters from which to initialize the configuration object.\n\n        Returns:\n            :class:`PretrainedConfig`: An instance of a configuration object\n        \"\"\"\n        return_unused_kwargs = kwargs.pop(\"return_unused_kwargs\", False)\n\n        config = cls(**config_dict)\n\n        if hasattr(config, \"pruned_heads\"):\n            config.pruned_heads = dict((int(key), value) for key, value in config.pruned_heads.items())\n\n        # Update config with kwargs if needed\n        to_remove = []\n        for key, value in kwargs.items():\n            if hasattr(config, key):\n                setattr(config, key, value)\n                to_remove.append(key)\n        for key in to_remove:\n            kwargs.pop(key, None)\n\n        logger.info(\"Model config %s\", str(config))\n        if return_unused_kwargs:\n            return config, kwargs\n        else:\n            return config\n\n    @classmethod\n    def from_json_file(cls, json_file: str) -> \"PretrainedConfig\":\n        \"\"\"\n        Constructs a `Config` from the path to a json file of parameters.\n\n        Args:\n            json_file (:obj:`string`):\n                Path to the JSON file containing the parameters.\n\n        Returns:\n            :class:`PretrainedConfig`: An instance of a configuration object\n\n        \"\"\"\n        config_dict = cls._dict_from_json_file(json_file)\n        return cls(**config_dict)\n\n    @classmethod\n    def _dict_from_json_file(cls, json_file: str):\n        with open(json_file, \"r\", encoding=\"utf-8\") as reader:\n            text = reader.read()\n        return json.loads(text)\n\n    def __eq__(self, other):\n        return self.__dict__ == other.__dict__\n\n    def __repr__(self):\n        return \"{} {}\".format(self.__class__.__name__, self.to_json_string())\n\n    def to_diff_dict(self):\n        \"\"\"\n        Removes all attributes from config which correspond to the default\n        config attributes for better readability and serializes to a Python\n        dictionary.\n\n        Returns:\n            :obj:`Dict[str, any]`: Dictionary of all the attributes that make up this configuration instance,\n        \"\"\"\n        config_dict = self.to_dict()\n\n        # get the default config dict\n        default_config_dict = PretrainedConfig().to_dict()\n\n        serializable_config_dict = {}\n\n        # only serialize values that differ from the default config\n        for key, value in config_dict.items():\n            if key not in default_config_dict or value != default_config_dict[key]:\n                serializable_config_dict[key] = value\n\n        return serializable_config_dict\n\n    def to_dict(self):\n        \"\"\"\n        Serializes this instance to a Python dictionary.\n\n        Returns:\n            :obj:`Dict[str, any]`: Dictionary of all the attributes that make up this configuration instance,\n        \"\"\"\n        output = copy.deepcopy(self.__dict__)\n        if hasattr(self.__class__, \"model_type\"):\n            output[\"model_type\"] = self.__class__.model_type\n        return output\n\n    def to_json_string(self, use_diff=True):\n        \"\"\"\n        Serializes this instance to a JSON string.\n\n        Args:\n            use_diff (:obj:`bool`):\n                If set to True, only the difference between the config instance and the default PretrainedConfig() is serialized to JSON string.\n\n        Returns:\n            :obj:`string`: String containing all the attributes that make up this configuration instance in JSON format.\n        \"\"\"\n        if use_diff is True:\n            config_dict = self.to_diff_dict()\n        else:\n            config_dict = self.to_dict()\n        return json.dumps(config_dict, indent=2, sort_keys=True) + \"\\n\"\n\n    def to_json_file(self, json_file_path, use_diff=True):\n        \"\"\"\n        Save this instance to a json file.\n\n        Args:\n            json_file_path (:obj:`string`):\n                Path to the JSON file in which this configuration instance's parameters will be saved.\n            use_diff (:obj:`bool`):\n                If set to True, only the difference between the config instance and the default PretrainedConfig() is serialized to JSON file.\n        \"\"\"\n        with open(json_file_path, \"w\", encoding=\"utf-8\") as writer:\n            writer.write(self.to_json_string(use_diff=use_diff))\n\n    def update(self, config_dict: Dict):\n        \"\"\"\n        Updates attributes of this class\n        with attributes from `config_dict`.\n\n        Args:\n            :obj:`Dict[str, any]`: Dictionary of attributes that shall be updated for this class.\n        \"\"\"\n        for key, value in config_dict.items():\n            setattr(self, key, value)\n"
  },
  {
    "path": "code/nezha-base-count5/pretrain/transformers1/configuration_xlm.py",
    "content": "# coding=utf-8\n# Copyright 2019-present, Facebook, Inc and the HuggingFace Inc. team.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n\"\"\" XLM configuration \"\"\"\n\n\nimport logging\n\nfrom .configuration_utils import PretrainedConfig\n\n\nlogger = logging.getLogger(__name__)\n\nXLM_PRETRAINED_CONFIG_ARCHIVE_MAP = {\n    \"xlm-mlm-en-2048\": \"https://s3.amazonaws.com/models.huggingface.co/bert/xlm-mlm-en-2048-config.json\",\n    \"xlm-mlm-ende-1024\": \"https://s3.amazonaws.com/models.huggingface.co/bert/xlm-mlm-ende-1024-config.json\",\n    \"xlm-mlm-enfr-1024\": \"https://s3.amazonaws.com/models.huggingface.co/bert/xlm-mlm-enfr-1024-config.json\",\n    \"xlm-mlm-enro-1024\": \"https://s3.amazonaws.com/models.huggingface.co/bert/xlm-mlm-enro-1024-config.json\",\n    \"xlm-mlm-tlm-xnli15-1024\": \"https://s3.amazonaws.com/models.huggingface.co/bert/xlm-mlm-tlm-xnli15-1024-config.json\",\n    \"xlm-mlm-xnli15-1024\": \"https://s3.amazonaws.com/models.huggingface.co/bert/xlm-mlm-xnli15-1024-config.json\",\n    \"xlm-clm-enfr-1024\": \"https://s3.amazonaws.com/models.huggingface.co/bert/xlm-clm-enfr-1024-config.json\",\n    \"xlm-clm-ende-1024\": \"https://s3.amazonaws.com/models.huggingface.co/bert/xlm-clm-ende-1024-config.json\",\n    \"xlm-mlm-17-1280\": \"https://s3.amazonaws.com/models.huggingface.co/bert/xlm-mlm-17-1280-config.json\",\n    \"xlm-mlm-100-1280\": \"https://s3.amazonaws.com/models.huggingface.co/bert/xlm-mlm-100-1280-config.json\",\n}\n\n\nclass XLMConfig(PretrainedConfig):\n    \"\"\"\n        This is the configuration class to store the configuration of a :class:`~transformers1.XLMModel`.\n        It is used to instantiate an XLM model according to the specified arguments, defining the model\n        architecture. Instantiating a configuration with the defaults will yield a similar configuration to that of\n        the `xlm-mlm-en-2048 <https://huggingface.co/xlm-mlm-en-2048>`__ architecture.\n\n        Configuration objects inherit from  :class:`~transformers1.PretrainedConfig` and can be used\n        to control the model outputs. Read the documentation from  :class:`~transformers1.PretrainedConfig`\n        for more information.\n\n        Args:\n            vocab_size (:obj:`int`, optional, defaults to 30145):\n                Vocabulary size of the XLM model. Defines the different tokens that\n                can be represented by the `inputs_ids` passed to the forward method of :class:`~transformers1.XLMModel`.\n            emb_dim (:obj:`int`, optional, defaults to 2048):\n                Dimensionality of the encoder layers and the pooler layer.\n            n_layer (:obj:`int`, optional, defaults to 12):\n                Number of hidden layers in the Transformer encoder.\n            n_head (:obj:`int`, optional, defaults to 16):\n                Number of attention heads for each attention layer in the Transformer encoder.\n            dropout (:obj:`float`, optional, defaults to 0.1):\n                The dropout probability for all fully connected\n                layers in the embeddings, encoder, and pooler.\n            attention_dropout (:obj:`float`, optional, defaults to 0.1):\n                The dropout probability for the attention mechanism\n            gelu_activation (:obj:`boolean`, optional, defaults to :obj:`True`):\n                The non-linear activation function (function or string) in the\n                encoder and pooler. If set to `True`, \"gelu\" will be used instead of \"relu\".\n            sinusoidal_embeddings (:obj:`boolean`, optional, defaults to :obj:`False`):\n                Whether to use sinusoidal positional embeddings instead of absolute positional embeddings.\n            causal (:obj:`boolean`, optional, defaults to :obj:`False`):\n                Set this to `True` for the model to behave in a causal manner.\n                Causal models use a triangular attention mask in order to only attend to the left-side context instead\n                if a bidirectional context.\n            asm (:obj:`boolean`, optional, defaults to :obj:`False`):\n                Whether to use an adaptive log softmax projection layer instead of a linear layer for the prediction\n                layer.\n            n_langs (:obj:`int`, optional, defaults to 1):\n                The number of languages the model handles. Set to 1 for monolingual models.\n            use_lang_emb (:obj:`boolean`, optional, defaults to :obj:`True`)\n                Whether to use language embeddings. Some models use additional language embeddings, see\n                `the multilingual models page <http://huggingface.co/transformers/multilingual.html#xlm-language-embeddings>`__\n                for information on how to use them.\n            max_position_embeddings (:obj:`int`, optional, defaults to 512):\n                The maximum sequence length that this model might\n                ever be used with. Typically set this to something large just in case\n                (e.g., 512 or 1024 or 2048).\n            embed_init_std (:obj:`float`, optional, defaults to 2048^-0.5):\n                The standard deviation of the truncated_normal_initializer for\n                initializing the embedding matrices.\n            init_std (:obj:`int`, optional, defaults to 50257):\n                The standard deviation of the truncated_normal_initializer for\n                initializing all weight matrices except the embedding matrices.\n            layer_norm_eps (:obj:`float`, optional, defaults to 1e-12):\n                The epsilon used by the layer normalization layers.\n            bos_index (:obj:`int`, optional, defaults to 0):\n                The index of the beginning of sentence token in the vocabulary.\n            eos_index (:obj:`int`, optional, defaults to 1):\n                The index of the end of sentence token in the vocabulary.\n            pad_index (:obj:`int`, optional, defaults to 2):\n                The index of the padding token in the vocabulary.\n            unk_index (:obj:`int`, optional, defaults to 3):\n                The index of the unknown token in the vocabulary.\n            mask_index (:obj:`int`, optional, defaults to 5):\n                The index of the masking token in the vocabulary.\n            is_encoder(:obj:`boolean`, optional, defaults to :obj:`True`):\n                Whether the initialized model should be a transformer encoder or decoder as seen in Vaswani et al.\n            summary_type (:obj:`string`, optional, defaults to \"first\"):\n                Argument used when doing sequence summary. Used in for the multiple choice head in\n                :class:`~transformers1.XLMForSequenceClassification`.\n                Is one of the following options:\n\n                - 'last' => take the last token hidden state (like XLNet)\n                - 'first' => take the first token hidden state (like Bert)\n                - 'mean' => take the mean of all tokens hidden states\n                - 'cls_index' => supply a Tensor of classification token position (GPT/GPT-2)\n                - 'attn' => Not implemented now, use multi-head attention\n            summary_use_proj (:obj:`boolean`, optional, defaults to :obj:`True`):\n                Argument used when doing sequence summary. Used in for the multiple choice head in\n                :class:`~transformers1.XLMForSequenceClassification`.\n                Add a projection after the vector extraction\n            summary_activation (:obj:`string` or :obj:`None`, optional, defaults to :obj:`None`):\n                Argument used when doing sequence summary. Used in for the multiple choice head in\n                :class:`~transformers1.XLMForSequenceClassification`.\n                'tanh' => add a tanh activation to the output, Other => no activation.\n            summary_proj_to_labels (:obj:`boolean`, optional, defaults to :obj:`True`):\n                Argument used when doing sequence summary. Used in for the multiple choice head in\n                :class:`~transformers1.XLMForSequenceClassification`.\n                If True, the projection outputs to config.num_labels classes (otherwise to hidden_size). Default: False.\n            summary_first_dropout (:obj:`float`, optional, defaults to 0.1):\n                Argument used when doing sequence summary. Used in for the multiple choice head in\n                :class:`~transformers1.XLMForSequenceClassification`.\n                Add a dropout before the projection and activation\n            start_n_top (:obj:`int`, optional, defaults to 5):\n                Used in the SQuAD evaluation script for XLM and XLNet.\n            end_n_top (:obj:`int`, optional, defaults to 5):\n                Used in the SQuAD evaluation script for XLM and XLNet.\n            mask_token_id (:obj:`int`, optional, defaults to 0):\n                Model agnostic parameter to identify masked tokens when generating text in an MLM context.\n            lang_id (:obj:`int`, optional, defaults to 1):\n                The ID of the language used by the model. This parameter is used when generating\n                text in a given language.\n\n        Example::\n\n            from transformers1 import XLMConfig, XLMModel\n\n            # Initializing a XLM configuration\n            configuration = XLMConfig()\n\n            # Initializing a model from the configuration\n            model = XLMModel(configuration)\n\n            # Accessing the model configuration\n            configuration = model.config\n    \"\"\"\n\n    model_type = \"xlm\"\n\n    def __init__(\n        self,\n        vocab_size=30145,\n        emb_dim=2048,\n        n_layers=12,\n        n_heads=16,\n        dropout=0.1,\n        attention_dropout=0.1,\n        gelu_activation=True,\n        sinusoidal_embeddings=False,\n        causal=False,\n        asm=False,\n        n_langs=1,\n        use_lang_emb=True,\n        max_position_embeddings=512,\n        embed_init_std=2048 ** -0.5,\n        layer_norm_eps=1e-12,\n        init_std=0.02,\n        bos_index=0,\n        eos_index=1,\n        pad_index=2,\n        unk_index=3,\n        mask_index=5,\n        is_encoder=True,\n        summary_type=\"first\",\n        summary_use_proj=True,\n        summary_activation=None,\n        summary_proj_to_labels=True,\n        summary_first_dropout=0.1,\n        start_n_top=5,\n        end_n_top=5,\n        mask_token_id=0,\n        lang_id=0,\n        pad_token_id=2,\n        bos_token_id=0,\n        **kwargs\n    ):\n        \"\"\"Constructs XLMConfig.\n        \"\"\"\n        super().__init__(pad_token_id=pad_token_id, bos_token_id=bos_token_id, **kwargs)\n        self.vocab_size = vocab_size\n        self.emb_dim = emb_dim\n        self.n_layers = n_layers\n        self.n_heads = n_heads\n        self.dropout = dropout\n        self.attention_dropout = attention_dropout\n        self.gelu_activation = gelu_activation\n        self.sinusoidal_embeddings = sinusoidal_embeddings\n        self.causal = causal\n        self.asm = asm\n        self.n_langs = n_langs\n        self.use_lang_emb = use_lang_emb\n        self.layer_norm_eps = layer_norm_eps\n        self.bos_index = bos_index\n        self.eos_index = eos_index\n        self.pad_index = pad_index\n        self.unk_index = unk_index\n        self.mask_index = mask_index\n        self.is_encoder = is_encoder\n        self.max_position_embeddings = max_position_embeddings\n        self.embed_init_std = embed_init_std\n        self.init_std = init_std\n        self.summary_type = summary_type\n        self.summary_use_proj = summary_use_proj\n        self.summary_activation = summary_activation\n        self.summary_proj_to_labels = summary_proj_to_labels\n        self.summary_first_dropout = summary_first_dropout\n        self.start_n_top = start_n_top\n        self.end_n_top = end_n_top\n        self.mask_token_id = mask_token_id\n        self.lang_id = lang_id\n\n        if \"n_words\" in kwargs:\n            self.n_words = kwargs[\"n_words\"]\n\n    @property\n    def n_words(self):  # For backward compatibility\n        return self.vocab_size\n\n    @n_words.setter\n    def n_words(self, value):  # For backward compatibility\n        self.vocab_size = value\n\n    @property\n    def hidden_size(self):\n        return self.emb_dim\n\n    @property\n    def num_attention_heads(self):\n        return self.n_heads\n\n    @property\n    def num_hidden_layers(self):\n        return self.n_layers\n"
  },
  {
    "path": "code/nezha-base-count5/pretrain/transformers1/configuration_xlm_roberta.py",
    "content": "# coding=utf-8\n# Copyright 2018 The Google AI Language Team Authors and The HuggingFace Inc. team.\n# Copyright (c) 2018, NVIDIA CORPORATION.  All rights reserved.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n\"\"\" XLM-RoBERTa configuration \"\"\"\n\n\nimport logging\n\nfrom .configuration_roberta import RobertaConfig\n\n\nlogger = logging.getLogger(__name__)\n\nXLM_ROBERTA_PRETRAINED_CONFIG_ARCHIVE_MAP = {\n    \"xlm-roberta-base\": \"https://s3.amazonaws.com/models.huggingface.co/bert/xlm-roberta-base-config.json\",\n    \"xlm-roberta-large\": \"https://s3.amazonaws.com/models.huggingface.co/bert/xlm-roberta-large-config.json\",\n    \"xlm-roberta-large-finetuned-conll02-dutch\": \"https://s3.amazonaws.com/models.huggingface.co/bert/xlm-roberta-large-finetuned-conll02-dutch-config.json\",\n    \"xlm-roberta-large-finetuned-conll02-spanish\": \"https://s3.amazonaws.com/models.huggingface.co/bert/xlm-roberta-large-finetuned-conll02-spanish-config.json\",\n    \"xlm-roberta-large-finetuned-conll03-english\": \"https://s3.amazonaws.com/models.huggingface.co/bert/xlm-roberta-large-finetuned-conll03-english-config.json\",\n    \"xlm-roberta-large-finetuned-conll03-german\": \"https://s3.amazonaws.com/models.huggingface.co/bert/xlm-roberta-large-finetuned-conll03-german-config.json\",\n}\n\n\nclass XLMRobertaConfig(RobertaConfig):\n    \"\"\"\n    This class overrides :class:`~transformers1.RobertaConfig`. Please check the\n    superclass for the appropriate documentation alongside usage examples.\n    \"\"\"\n\n    model_type = \"xlm-roberta\"\n"
  },
  {
    "path": "code/nezha-base-count5/pretrain/transformers1/configuration_xlnet.py",
    "content": "# coding=utf-8\n# Copyright 2018 Google AI, Google Brain and Carnegie Mellon University Authors and the HuggingFace Inc. team.\n# Copyright (c) 2018, NVIDIA CORPORATION.  All rights reserved.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n\"\"\" XLNet configuration \"\"\"\n\n\nimport logging\n\nfrom .configuration_utils import PretrainedConfig\n\n\nlogger = logging.getLogger(__name__)\n\nXLNET_PRETRAINED_CONFIG_ARCHIVE_MAP = {\n    \"xlnet-base-cased\": \"https://s3.amazonaws.com/models.huggingface.co/bert/xlnet-base-cased-config.json\",\n    \"xlnet-large-cased\": \"https://s3.amazonaws.com/models.huggingface.co/bert/xlnet-large-cased-config.json\",\n}\n\n\nclass XLNetConfig(PretrainedConfig):\n    \"\"\"\n        This is the configuration class to store the configuration of a :class:`~transformers1.XLNetModel`.\n        It is used to instantiate an XLNet model according to the specified arguments, defining the model\n        architecture. Instantiating a configuration with the defaults will yield a similar configuration to that of\n        the `xlnet-large-cased <https://huggingface.co/xlnet-large-cased>`__ architecture.\n\n        Configuration objects inherit from  :class:`~transformers1.PretrainedConfig` and can be used\n        to control the model outputs. Read the documentation from  :class:`~transformers1.PretrainedConfig`\n        for more information.\n\n        Args:\n            vocab_size (:obj:`int`, optional, defaults to 32000):\n                Vocabulary size of the XLNet model. Defines the different tokens that\n                can be represented by the `inputs_ids` passed to the forward method of :class:`~transformers1.XLNetModel`.\n            d_model (:obj:`int`, optional, defaults to 1024):\n                Dimensionality of the encoder layers and the pooler layer.\n            n_layer (:obj:`int`, optional, defaults to 24):\n                Number of hidden layers in the Transformer encoder.\n            n_head (:obj:`int`, optional, defaults to 16):\n                Number of attention heads for each attention layer in the Transformer encoder.\n            d_inner (:obj:`int`, optional, defaults to 4096):\n                Dimensionality of the \"intermediate\" (i.e., feed-forward) layer in the Transformer encoder.\n            ff_activation (:obj:`string`, optional, defaults to \"gelu\"):\n                The non-linear activation function (function or string) in the\n                encoder and pooler. If string, \"gelu\", \"relu\" and \"swish\" are supported.\n            untie_r (:obj:`boolean`, optional, defaults to :obj:`True`):\n                Untie relative position biases\n            attn_type (:obj:`string`, optional, defaults to \"bi\"):\n                The attention type used by the model. Set 'bi' for XLNet, 'uni' for Transformer-XL.\n            initializer_range (:obj:`float`, optional, defaults to 0.02):\n                The standard deviation of the truncated_normal_initializer for initializing all weight matrices.\n            layer_norm_eps (:obj:`float`, optional, defaults to 1e-12):\n                The epsilon used by the layer normalization layers.\n            dropout (:obj:`float`, optional, defaults to 0.1):\n                The dropout probability for all fully connected layers in the embeddings, encoder, and pooler.\n            mem_len (:obj:`int` or :obj:`None`, optional, defaults to :obj:`None`):\n                The number of tokens to cache. The key/value pairs that have already been pre-computed\n                in a previous forward pass won't be re-computed. See the\n                `quickstart <https://huggingface.co/transformers/quickstart.html#using-the-past>`__\n                for more information.\n            reuse_len (:obj:`int` or :obj:`None`, optional, defaults to :obj:`None`):\n                The number of tokens in the current batch to be cached and reused in the future.\n            bi_data (:obj:`boolean`, optional, defaults to :obj:`False`):\n                Whether to use bidirectional input pipeline. Usually set to `True` during\n                pretraining and `False` during finetuning.\n            clamp_len (:obj:`int`, optional, defaults to -1):\n                Clamp all relative distances larger than clamp_len.\n                Setting this attribute to -1 means no clamping.\n            same_length (:obj:`boolean`, optional, defaults to :obj:`False`):\n                Whether to use the same attention length for each token.\n            summary_type (:obj:`string`, optional, defaults to \"last\"):\n                Argument used when doing sequence summary. Used in for the multiple choice head in\n                :class:transformers1.XLNetForSequenceClassification` and :class:`~transformers1.XLNetForMultipleChoice`.\n                Is one of the following options:\n                    - 'last' => take the last token hidden state (like XLNet)\n                    - 'first' => take the first token hidden state (like Bert)\n                    - 'mean' => take the mean of all tokens hidden states\n                    - 'cls_index' => supply a Tensor of classification token position (GPT/GPT-2)\n                    - 'attn' => Not implemented now, use multi-head attention\n            summary_use_proj (:obj:`boolean`, optional, defaults to :obj:`True`):\n                Argument used when doing sequence summary. Used in for the multiple choice head in\n                :class:`~transformers1.XLNetForSequenceClassification` and :class:`~transformers1.XLNetForMultipleChoice`.\n                Add a projection after the vector extraction\n            summary_activation (:obj:`string` or :obj:`None`, optional, defaults to :obj:`None`):\n                Argument used when doing sequence summary. Used in for the multiple choice head in\n                :class:`~transformers1.XLNetForSequenceClassification` and :class:`~transformers1.XLNetForMultipleChoice`.\n                'tanh' => add a tanh activation to the output, Other => no activation.\n            summary_proj_to_labels (:obj:`boolean`, optional, defaults to :obj:`True`):\n                Argument used when doing sequence summary. Used in for the multiple choice head in\n                :class:`~transformers1.XLNetForSequenceClassification` and :class:`~transformers1.XLNetForMultipleChoice`.\n                If True, the projection outputs to config.num_labels classes (otherwise to hidden_size). Default: False.\n            summary_last_dropout (:obj:`float`, optional, defaults to 0.1):\n                Argument used when doing sequence summary. Used in for the multiple choice head in\n                :class:`~transformers1.XLNetForSequenceClassification` and :class:`~transformers1.XLNetForMultipleChoice`.\n                Add a dropout after the projection and activation\n            start_n_top (:obj:`int`, optional, defaults to 5):\n                Used in the SQuAD evaluation script for XLM and XLNet.\n            end_n_top (:obj:`int`, optional, defaults to 5):\n                Used in the SQuAD evaluation script for XLM and XLNet.\n\n        Example::\n\n            from transformers1 import XLNetConfig, XLNetModel\n\n            # Initializing a XLNet configuration\n            configuration = XLNetConfig()\n\n            # Initializing a model from the configuration\n            model = XLNetModel(configuration)\n\n            # Accessing the model configuration\n            configuration = model.config\n    \"\"\"\n\n    model_type = \"xlnet\"\n\n    def __init__(\n        self,\n        vocab_size=32000,\n        d_model=1024,\n        n_layer=24,\n        n_head=16,\n        d_inner=4096,\n        ff_activation=\"gelu\",\n        untie_r=True,\n        attn_type=\"bi\",\n        initializer_range=0.02,\n        layer_norm_eps=1e-12,\n        dropout=0.1,\n        mem_len=None,\n        reuse_len=None,\n        bi_data=False,\n        clamp_len=-1,\n        same_length=False,\n        summary_type=\"last\",\n        summary_use_proj=True,\n        summary_activation=\"tanh\",\n        summary_last_dropout=0.1,\n        start_n_top=5,\n        end_n_top=5,\n        pad_token_id=5,\n        bos_token_id=1,\n        eos_token_id=2,\n        **kwargs\n    ):\n        \"\"\"Constructs XLNetConfig.\n        \"\"\"\n        super().__init__(pad_token_id=pad_token_id, bos_token_id=bos_token_id, eos_token_id=eos_token_id, **kwargs)\n        self.vocab_size = vocab_size\n        self.d_model = d_model\n        self.n_layer = n_layer\n        self.n_head = n_head\n        assert d_model % n_head == 0\n        self.d_head = d_model // n_head\n        self.ff_activation = ff_activation\n        self.d_inner = d_inner\n        self.untie_r = untie_r\n        self.attn_type = attn_type\n\n        self.initializer_range = initializer_range\n        self.layer_norm_eps = layer_norm_eps\n\n        self.dropout = dropout\n        self.mem_len = mem_len\n        self.reuse_len = reuse_len\n        self.bi_data = bi_data\n        self.clamp_len = clamp_len\n        self.same_length = same_length\n\n        self.summary_type = summary_type\n        self.summary_use_proj = summary_use_proj\n        self.summary_activation = summary_activation\n        self.summary_last_dropout = summary_last_dropout\n        self.start_n_top = start_n_top\n        self.end_n_top = end_n_top\n\n        self.bos_token_id = bos_token_id\n        self.pad_token_id = pad_token_id\n        self.eos_token_id = eos_token_id\n\n    @property\n    def max_position_embeddings(self):\n        return -1\n\n    @property\n    def n_token(self):  # Backward compatibility\n        return self.vocab_size\n\n    @n_token.setter\n    def n_token(self, value):  # Backward compatibility\n        self.vocab_size = value\n\n    @property\n    def hidden_size(self):\n        return self.d_model\n\n    @property\n    def num_attention_heads(self):\n        return self.n_head\n\n    @property\n    def num_hidden_layers(self):\n        return self.n_layer\n"
  },
  {
    "path": "code/nezha-base-count5/pretrain/transformers1/convert_albert_original_tf_checkpoint_to_pytorch.py",
    "content": "# coding=utf-8\n# Copyright 2018 The HuggingFace Inc. team.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n\"\"\"Convert ALBERT checkpoint.\"\"\"\n\n\nimport argparse\nimport logging\n\nimport torch\n\nfrom transformers import AlbertConfig, AlbertForPreTraining, load_tf_weights_in_albert\n\n\nlogging.basicConfig(level=logging.INFO)\n\n\ndef convert_tf_checkpoint_to_pytorch(tf_checkpoint_path, albert_config_file, pytorch_dump_path):\n    # Initialise PyTorch model\n    config = AlbertConfig.from_json_file(albert_config_file)\n    print(\"Building PyTorch model from configuration: {}\".format(str(config)))\n    model = AlbertForPreTraining(config)\n\n    # Load weights from tf checkpoint\n    load_tf_weights_in_albert(model, config, tf_checkpoint_path)\n\n    # Save pytorch-model\n    print(\"Save PyTorch model to {}\".format(pytorch_dump_path))\n    torch.save(model.state_dict(), pytorch_dump_path)\n\n\nif __name__ == \"__main__\":\n    parser = argparse.ArgumentParser()\n    # Required parameters\n    parser.add_argument(\n        \"--tf_checkpoint_path\", default=None, type=str, required=True, help=\"Path to the TensorFlow checkpoint path.\"\n    )\n    parser.add_argument(\n        \"--albert_config_file\",\n        default=None,\n        type=str,\n        required=True,\n        help=\"The config json file corresponding to the pre-trained ALBERT model. \\n\"\n        \"This specifies the model architecture.\",\n    )\n    parser.add_argument(\n        \"--pytorch_dump_path\", default=None, type=str, required=True, help=\"Path to the output PyTorch model.\"\n    )\n    args = parser.parse_args()\n    convert_tf_checkpoint_to_pytorch(args.tf_checkpoint_path, args.albert_config_file, args.pytorch_dump_path)\n"
  },
  {
    "path": "code/nezha-base-count5/pretrain/transformers1/convert_bart_original_pytorch_checkpoint_to_pytorch.py",
    "content": "# coding=utf-8\n# Copyright 2020 The HuggingFace Inc. team.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n\"\"\"Convert BART checkpoint.\"\"\"\n\n\nimport argparse\nimport logging\nimport os\nfrom pathlib import Path\n\nimport fairseq\nimport torch\nfrom packaging import version\n\nfrom transformers import (\n    BartConfig,\n    BartForConditionalGeneration,\n    BartForSequenceClassification,\n    BartModel,\n    BartTokenizer,\n)\nfrom transformers.modeling_bart import _make_linear_from_emb\n\n\nFAIRSEQ_MODELS = [\"bart.large\", \"bart.large.mnli\", \"bart.large.cnn\", \"bart_xsum/model.pt\"]\nextra_arch = {\"bart.large\": BartModel, \"bart.large.mnli\": BartForSequenceClassification}\nif version.parse(fairseq.__version__) < version.parse(\"0.9.0\"):\n    raise Exception(\"requires fairseq >= 0.9.0\")\n\n\nlogging.basicConfig(level=logging.INFO)\nlogger = logging.getLogger(__name__)\n\nSAMPLE_TEXT = \" Hello world! cécé herlolip\"\n\nmnli_rename_keys = [\n    (\"model.classification_heads.mnli.dense.weight\", \"classification_head.dense.weight\"),\n    (\"model.classification_heads.mnli.dense.bias\", \"classification_head.dense.bias\"),\n    (\"model.classification_heads.mnli.out_proj.weight\", \"classification_head.out_proj.weight\"),\n    (\"model.classification_heads.mnli.out_proj.bias\", \"classification_head.out_proj.bias\"),\n]\n\n\ndef remove_ignore_keys_(state_dict):\n    ignore_keys = [\n        \"encoder.version\",\n        \"decoder.version\",\n        \"model.encoder.version\",\n        \"model.decoder.version\",\n        \"_float_tensor\",\n    ]\n    for k in ignore_keys:\n        state_dict.pop(k, None)\n\n\ndef rename_key(dct, old, new):\n    val = dct.pop(old)\n    dct[new] = val\n\n\ndef load_xsum_checkpoint(checkpoint_path):\n    \"\"\"Checkpoint path should end in model.pt\"\"\"\n    sd = torch.load(checkpoint_path, map_location=\"cpu\")\n    hub_interface = torch.hub.load(\"pytorch/fairseq\", \"bart.large.cnn\").eval()\n    hub_interface.model.load_state_dict(sd[\"model\"])\n    return hub_interface\n\n\ndef convert_checkpoint_from_disk(checkpoint_path, **config_kwargs):\n    state_dict = torch.load(checkpoint_path, map_location=\"cpu\")[\"model\"]\n    remove_ignore_keys_(state_dict)\n    vocab_size = state_dict[\"encoder.embed_tokens.weight\"].shape[0]\n    state_dict[\"shared.weight\"] = state_dict[\"decoder.embed_tokens.weight\"]\n    mbart_config = BartConfig(vocab_size=vocab_size, **config_kwargs)\n    model = BartForConditionalGeneration(mbart_config)\n    model.model.load_state_dict(state_dict)\n    if hasattr(model, \"lm_head\"):\n        model.lm_head = _make_linear_from_emb(model.model.shared)\n    return model\n\n\n@torch.no_grad()\ndef convert_bart_checkpoint(checkpoint_path, pytorch_dump_folder_path, hf_checkpoint_name=None):\n    \"\"\"\n    Copy/paste/tweak model's weights to our BERT structure.\n    \"\"\"\n    if not os.path.exists(checkpoint_path):\n        bart = torch.hub.load(\"pytorch/fairseq\", checkpoint_path).eval()\n    else:\n        bart = load_xsum_checkpoint(checkpoint_path)\n\n    bart.model.upgrade_state_dict(bart.model.state_dict())\n    if hf_checkpoint_name is None:\n        hf_checkpoint_name = checkpoint_path.replace(\".\", \"-\")\n    config = BartConfig.from_pretrained(hf_checkpoint_name)\n    tokens = bart.encode(SAMPLE_TEXT).unsqueeze(0)\n    tokens2 = BartTokenizer.from_pretrained(hf_checkpoint_name).encode(SAMPLE_TEXT, return_tensors=\"pt\").unsqueeze(0)\n    assert torch.eq(tokens, tokens2).all()\n\n    if checkpoint_path == \"bart.large.mnli\":\n        state_dict = bart.state_dict()\n        remove_ignore_keys_(state_dict)\n        state_dict[\"model.shared.weight\"] = state_dict[\"model.decoder.embed_tokens.weight\"]\n        for src, dest in mnli_rename_keys:\n            rename_key(state_dict, src, dest)\n        model = BartForSequenceClassification(config).eval()\n        model.load_state_dict(state_dict)\n        fairseq_output = bart.predict(\"mnli\", tokens, return_logits=True)\n        new_model_outputs = model(tokens)[0]  # logits\n    else:  # no classification heads to worry about\n        state_dict = bart.model.state_dict()\n        remove_ignore_keys_(state_dict)\n        state_dict[\"shared.weight\"] = state_dict[\"decoder.embed_tokens.weight\"]\n        fairseq_output = bart.extract_features(tokens)\n        if hf_checkpoint_name == \"facebook/bart-large\":\n            model = BartModel(config).eval()\n            model.load_state_dict(state_dict)\n            new_model_outputs = model(tokens).model[0]\n        else:\n            model = BartForConditionalGeneration(config).eval()  # an existing summarization ckpt\n            model.model.load_state_dict(state_dict)\n            if hasattr(model, \"lm_head\"):\n                model.lm_head = _make_linear_from_emb(model.model.shared)\n            new_model_outputs = model.model(tokens)[0]\n\n    # Check results\n    assert fairseq_output.shape == new_model_outputs.shape\n    assert (fairseq_output == new_model_outputs).all().item()\n    Path(pytorch_dump_folder_path).mkdir(exist_ok=True)\n    model.save_pretrained(pytorch_dump_folder_path)\n\n\nif __name__ == \"__main__\":\n    parser = argparse.ArgumentParser()\n    # Required parameters\n    parser.add_argument(\n        \"fairseq_path\", type=str, help=\"bart.large, bart.large.cnn or a path to a model.pt on local filesystem.\"\n    )\n    parser.add_argument(\"pytorch_dump_folder_path\", default=None, type=str, help=\"Path to the output PyTorch model.\")\n    parser.add_argument(\n        \"--hf_config\", default=None, type=str, help=\"Which huggingface architecture to use: bart-large-xsum\"\n    )\n    args = parser.parse_args()\n    convert_bart_checkpoint(args.fairseq_path, args.pytorch_dump_folder_path, hf_checkpoint_name=args.hf_config)\n"
  },
  {
    "path": "code/nezha-base-count5/pretrain/transformers1/convert_bert_original_tf_checkpoint_to_pytorch.py",
    "content": "# coding=utf-8\n# Copyright 2018 The HuggingFace Inc. team.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n\"\"\"Convert BERT checkpoint.\"\"\"\n\n\nimport argparse\nimport logging\n\nimport torch\n\nfrom transformers import BertConfig, BertForPreTraining, load_tf_weights_in_bert\n\n\nlogging.basicConfig(level=logging.INFO)\n\n\ndef convert_tf_checkpoint_to_pytorch(tf_checkpoint_path, bert_config_file, pytorch_dump_path):\n    # Initialise PyTorch model\n    config = BertConfig.from_json_file(bert_config_file)\n    print(\"Building PyTorch model from configuration: {}\".format(str(config)))\n    model = BertForPreTraining(config)\n\n    # Load weights from tf checkpoint\n    load_tf_weights_in_bert(model, config, tf_checkpoint_path)\n\n    # Save pytorch-model\n    print(\"Save PyTorch model to {}\".format(pytorch_dump_path))\n    torch.save(model.state_dict(), pytorch_dump_path)\n\n\nif __name__ == \"__main__\":\n    parser = argparse.ArgumentParser()\n    # Required parameters\n    parser.add_argument(\n        \"--tf_checkpoint_path\", default=None, type=str, required=True, help=\"Path to the TensorFlow checkpoint path.\"\n    )\n    parser.add_argument(\n        \"--bert_config_file\",\n        default=None,\n        type=str,\n        required=True,\n        help=\"The config json file corresponding to the pre-trained BERT model. \\n\"\n        \"This specifies the model architecture.\",\n    )\n    parser.add_argument(\n        \"--pytorch_dump_path\", default=None, type=str, required=True, help=\"Path to the output PyTorch model.\"\n    )\n    args = parser.parse_args()\n    convert_tf_checkpoint_to_pytorch(args.tf_checkpoint_path, args.bert_config_file, args.pytorch_dump_path)\n"
  },
  {
    "path": "code/nezha-base-count5/pretrain/transformers1/convert_bert_pytorch_checkpoint_to_original_tf.py",
    "content": "# coding=utf-8\n# Copyright 2018 The HuggingFace Inc. team.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n\n\"\"\"Convert Huggingface Pytorch checkpoint to Tensorflow checkpoint.\"\"\"\n\nimport argparse\nimport os\n\nimport numpy as np\nimport tensorflow as tf\nimport torch\n\nfrom transformers import BertModel\n\n\ndef convert_pytorch_checkpoint_to_tf(model: BertModel, ckpt_dir: str, model_name: str):\n\n    \"\"\"\n    :param model:BertModel Pytorch model instance to be converted\n    :param ckpt_dir: Tensorflow model directory\n    :param model_name: model name\n    :return:\n\n    Currently supported HF models:\n        Y BertModel\n        N BertForMaskedLM\n        N BertForPreTraining\n        N BertForMultipleChoice\n        N BertForNextSentencePrediction\n        N BertForSequenceClassification\n        N BertForQuestionAnswering\n    \"\"\"\n\n    tensors_to_transpose = (\"dense.weight\", \"attention.self.query\", \"attention.self.key\", \"attention.self.value\")\n\n    var_map = (\n        (\"layer.\", \"layer_\"),\n        (\"word_embeddings.weight\", \"word_embeddings\"),\n        (\"position_embeddings.weight\", \"position_embeddings\"),\n        (\"token_type_embeddings.weight\", \"token_type_embeddings\"),\n        (\".\", \"/\"),\n        (\"LayerNorm/weight\", \"LayerNorm/gamma\"),\n        (\"LayerNorm/bias\", \"LayerNorm/beta\"),\n        (\"weight\", \"kernel\"),\n    )\n\n    if not os.path.isdir(ckpt_dir):\n        os.makedirs(ckpt_dir)\n\n    state_dict = model.state_dict()\n\n    def to_tf_var_name(name: str):\n        for patt, repl in iter(var_map):\n            name = name.replace(patt, repl)\n        return \"bert/{}\".format(name)\n\n    def create_tf_var(tensor: np.ndarray, name: str, session: tf.Session):\n        tf_dtype = tf.dtypes.as_dtype(tensor.dtype)\n        tf_var = tf.get_variable(dtype=tf_dtype, shape=tensor.shape, name=name, initializer=tf.zeros_initializer())\n        session.run(tf.variables_initializer([tf_var]))\n        session.run(tf_var)\n        return tf_var\n\n    tf.reset_default_graph()\n    with tf.Session() as session:\n        for var_name in state_dict:\n            tf_name = to_tf_var_name(var_name)\n            torch_tensor = state_dict[var_name].numpy()\n            if any([x in var_name for x in tensors_to_transpose]):\n                torch_tensor = torch_tensor.T\n            tf_var = create_tf_var(tensor=torch_tensor, name=tf_name, session=session)\n            tf.keras.backend.set_value(tf_var, torch_tensor)\n            tf_weight = session.run(tf_var)\n            print(\"Successfully created {}: {}\".format(tf_name, np.allclose(tf_weight, torch_tensor)))\n\n        saver = tf.train.Saver(tf.trainable_variables())\n        saver.save(session, os.path.join(ckpt_dir, model_name.replace(\"-\", \"_\") + \".ckpt\"))\n\n\ndef main(raw_args=None):\n    parser = argparse.ArgumentParser()\n    parser.add_argument(\"--model_name\", type=str, required=True, help=\"model name e.g. bert-base-uncased\")\n    parser.add_argument(\n        \"--cache_dir\", type=str, default=None, required=False, help=\"Directory containing pytorch model\"\n    )\n    parser.add_argument(\"--pytorch_model_path\", type=str, required=True, help=\"/path/to/<pytorch-model-name>.bin\")\n    parser.add_argument(\"--tf_cache_dir\", type=str, required=True, help=\"Directory in which to save tensorflow model\")\n    args = parser.parse_args(raw_args)\n\n    model = BertModel.from_pretrained(\n        pretrained_model_name_or_path=args.model_name,\n        state_dict=torch.load(args.pytorch_model_path),\n        cache_dir=args.cache_dir,\n    )\n\n    convert_pytorch_checkpoint_to_tf(model=model, ckpt_dir=args.tf_cache_dir, model_name=args.model_name)\n\n\nif __name__ == \"__main__\":\n    main()\n"
  },
  {
    "path": "code/nezha-base-count5/pretrain/transformers1/convert_dialogpt_original_pytorch_checkpoint_to_pytorch.py",
    "content": "import argparse\nimport os\n\nimport torch\n\nfrom transformers.file_utils import WEIGHTS_NAME\n\n\nDIALOGPT_MODELS = [\"small\", \"medium\", \"large\"]\n\nOLD_KEY = \"lm_head.decoder.weight\"\nNEW_KEY = \"lm_head.weight\"\n\n\ndef convert_dialogpt_checkpoint(checkpoint_path: str, pytorch_dump_folder_path: str):\n    d = torch.load(checkpoint_path)\n    d[NEW_KEY] = d.pop(OLD_KEY)\n    os.makedirs(pytorch_dump_folder_path, exist_ok=True)\n    torch.save(d, os.path.join(pytorch_dump_folder_path, WEIGHTS_NAME))\n\n\nif __name__ == \"__main__\":\n    parser = argparse.ArgumentParser()\n    parser.add_argument(\"--dialogpt_path\", default=\".\", type=str)\n    args = parser.parse_args()\n    for MODEL in DIALOGPT_MODELS:\n        checkpoint_path = os.path.join(args.dialogpt_path, f\"{MODEL}_ft.pkl\")\n        pytorch_dump_folder_path = f\"./DialoGPT-{MODEL}\"\n        convert_dialogpt_checkpoint(\n            checkpoint_path, pytorch_dump_folder_path,\n        )\n"
  },
  {
    "path": "code/nezha-base-count5/pretrain/transformers1/convert_electra_original_tf_checkpoint_to_pytorch.py",
    "content": "# coding=utf-8\n# Copyright 2018 The HuggingFace Inc. team.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n\"\"\"Convert ELECTRA checkpoint.\"\"\"\n\n\nimport argparse\nimport logging\n\nimport torch\n\nfrom transformers import ElectraConfig, ElectraForMaskedLM, ElectraForPreTraining, load_tf_weights_in_electra\n\n\nlogging.basicConfig(level=logging.INFO)\n\n\ndef convert_tf_checkpoint_to_pytorch(tf_checkpoint_path, config_file, pytorch_dump_path, discriminator_or_generator):\n    # Initialise PyTorch model\n    config = ElectraConfig.from_json_file(config_file)\n    print(\"Building PyTorch model from configuration: {}\".format(str(config)))\n\n    if discriminator_or_generator == \"discriminator\":\n        model = ElectraForPreTraining(config)\n    elif discriminator_or_generator == \"generator\":\n        model = ElectraForMaskedLM(config)\n    else:\n        raise ValueError(\"The discriminator_or_generator argument should be either 'discriminator' or 'generator'\")\n\n    # Load weights from tf checkpoint\n    load_tf_weights_in_electra(\n        model, config, tf_checkpoint_path, discriminator_or_generator=discriminator_or_generator\n    )\n\n    # Save pytorch-model\n    print(\"Save PyTorch model to {}\".format(pytorch_dump_path))\n    torch.save(model.state_dict(), pytorch_dump_path)\n\n\nif __name__ == \"__main__\":\n    parser = argparse.ArgumentParser()\n    # Required parameters\n    parser.add_argument(\n        \"--tf_checkpoint_path\", default=None, type=str, required=True, help=\"Path to the TensorFlow checkpoint path.\"\n    )\n    parser.add_argument(\n        \"--config_file\",\n        default=None,\n        type=str,\n        required=True,\n        help=\"The config json file corresponding to the pre-trained model. \\n\"\n        \"This specifies the model architecture.\",\n    )\n    parser.add_argument(\n        \"--pytorch_dump_path\", default=None, type=str, required=True, help=\"Path to the output PyTorch model.\"\n    )\n    parser.add_argument(\n        \"--discriminator_or_generator\",\n        default=None,\n        type=str,\n        required=True,\n        help=\"Whether to export the generator or the discriminator. Should be a string, either 'discriminator' or \"\n        \"'generator'.\",\n    )\n    args = parser.parse_args()\n    convert_tf_checkpoint_to_pytorch(\n        args.tf_checkpoint_path, args.config_file, args.pytorch_dump_path, args.discriminator_or_generator\n    )\n"
  },
  {
    "path": "code/nezha-base-count5/pretrain/transformers1/convert_gpt2_original_tf_checkpoint_to_pytorch.py",
    "content": "# coding=utf-8\n# Copyright 2018 The HuggingFace Inc. team.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n\"\"\"Convert OpenAI GPT checkpoint.\"\"\"\n\n\nimport argparse\nimport logging\n\nimport torch\n\nfrom transformers import CONFIG_NAME, WEIGHTS_NAME, GPT2Config, GPT2Model, load_tf_weights_in_gpt2\n\n\nlogging.basicConfig(level=logging.INFO)\n\n\ndef convert_gpt2_checkpoint_to_pytorch(gpt2_checkpoint_path, gpt2_config_file, pytorch_dump_folder_path):\n    # Construct model\n    if gpt2_config_file == \"\":\n        config = GPT2Config()\n    else:\n        config = GPT2Config.from_json_file(gpt2_config_file)\n    model = GPT2Model(config)\n\n    # Load weights from numpy\n    load_tf_weights_in_gpt2(model, config, gpt2_checkpoint_path)\n\n    # Save pytorch-model\n    pytorch_weights_dump_path = pytorch_dump_folder_path + \"/\" + WEIGHTS_NAME\n    pytorch_config_dump_path = pytorch_dump_folder_path + \"/\" + CONFIG_NAME\n    print(\"Save PyTorch model to {}\".format(pytorch_weights_dump_path))\n    torch.save(model.state_dict(), pytorch_weights_dump_path)\n    print(\"Save configuration file to {}\".format(pytorch_config_dump_path))\n    with open(pytorch_config_dump_path, \"w\", encoding=\"utf-8\") as f:\n        f.write(config.to_json_string())\n\n\nif __name__ == \"__main__\":\n    parser = argparse.ArgumentParser()\n    # Required parameters\n    parser.add_argument(\n        \"--gpt2_checkpoint_path\", default=None, type=str, required=True, help=\"Path to the TensorFlow checkpoint path.\"\n    )\n    parser.add_argument(\n        \"--pytorch_dump_folder_path\", default=None, type=str, required=True, help=\"Path to the output PyTorch model.\"\n    )\n    parser.add_argument(\n        \"--gpt2_config_file\",\n        default=\"\",\n        type=str,\n        help=\"An optional config json file corresponding to the pre-trained OpenAI model. \\n\"\n        \"This specifies the model architecture.\",\n    )\n    args = parser.parse_args()\n    convert_gpt2_checkpoint_to_pytorch(args.gpt2_checkpoint_path, args.gpt2_config_file, args.pytorch_dump_folder_path)\n"
  },
  {
    "path": "code/nezha-base-count5/pretrain/transformers1/convert_graph_to_onnx.py",
    "content": "from argparse import ArgumentParser\nfrom os import listdir, makedirs\nfrom os.path import abspath, dirname, exists\nfrom typing import Dict, List, Optional, Tuple\n\nfrom transformers import is_tf_available, is_torch_available\nfrom transformers.pipelines import Pipeline, pipeline\nfrom transformers.tokenization_utils import BatchEncoding\n\n\nclass OnnxConverterArgumentParser(ArgumentParser):\n    \"\"\"\n    Wraps all the script arguments supported to export transformers1 models to ONNX IR\n    \"\"\"\n\n    def __init__(self):\n        super(OnnxConverterArgumentParser, self).__init__(\"ONNX Converter\")\n\n        self.add_argument(\"--model\", type=str, required=True, help=\"Model's id or path (ex: bert-base-cased)\")\n        self.add_argument(\"--tokenizer\", type=str, help=\"Tokenizer's id or path (ex: bert-base-cased)\")\n        self.add_argument(\"--framework\", type=str, choices=[\"pt\", \"tf\"], help=\"Framework for loading the model\")\n        self.add_argument(\"--opset\", type=int, default=11, help=\"ONNX opset to use\")\n        self.add_argument(\"--check-loading\", action=\"store_true\", help=\"Check ONNX is able to load the model\")\n        self.add_argument(\"--use-external-format\", action=\"store_true\", help=\"Allow exporting model >= than 2Gb\")\n        self.add_argument(\"output\")\n\n\ndef ensure_valid_input(model, tokens, input_names):\n    \"\"\"\n    Ensure input are presented in the correct order, without any None\n    Args:\n        model: The model used to forward the input data\n        tokens: BatchEncoding holding the input data\n        input_names: The name of the inputs\n\n    Returns: Tuple\n\n    \"\"\"\n    model_args_name = model.forward.__code__.co_varnames\n\n    ordered_input_names = []\n    model_args = []\n    for arg_name in model_args_name[1:]:  # start at index 1 to skip \"self\" argument\n        if arg_name in input_names:\n            ordered_input_names.append(arg_name)\n            model_args.append(tokens[arg_name])\n        else:\n            break\n\n    return ordered_input_names, tuple(model_args)\n\n\ndef infer_shapes(nlp: Pipeline, framework: str) -> Tuple[List[str], List[str], Dict, BatchEncoding]:\n    def build_shape_dict(tensor, is_input: bool, seq_len: int):\n        if isinstance(tensor, (tuple, list)):\n            return [build_shape_dict(t, is_input, seq_len) for t in tensor]\n\n        else:\n            # Let's assume batch is the first axis with only 1 element (~~ might not be always true ...)\n            axes = {[axis for axis, numel in enumerate(tensor.shape) if numel == 1][0]: \"batch\"}\n            if is_input:\n                if len(tensor.shape) == 2:\n                    axes[1] = \"sequence\"\n                else:\n                    raise ValueError(\"Unable to infer tensor axes ({})\".format(len(tensor.shape)))\n            else:\n                seq_axes = [dim for dim, shape in enumerate(tensor.shape) if shape == seq_len]\n                axes.update({dim: \"sequence\" for dim in seq_axes})\n\n        return axes\n\n    tokens = nlp.tokenizer.encode_plus(\"This is a sample output\", return_tensors=framework)\n    seq_len = tokens.input_ids.shape[-1]\n    outputs = nlp.model(**tokens) if framework == \"pt\" else nlp.model(tokens)\n\n    if not isinstance(outputs, (list, tuple)):\n        outputs = (outputs,)\n\n    # Generate input names & axes\n    input_vars = list(tokens.keys())\n    input_dynamic_axes = {k: build_shape_dict(v, True, seq_len) for k, v in tokens.items()}\n\n    # flatten potentially grouped outputs (past for gpt2, attentions)\n    outputs_flat = []\n    for output in outputs:\n        if isinstance(output, (tuple, list)):\n            outputs_flat.extend(output)\n        else:\n            outputs_flat.append(output)\n\n    # Generate output names & axes\n    output_names = [\"output_{}\".format(i) for i in range(len(outputs_flat))]\n    output_dynamic_axes = {k: build_shape_dict(v, False, seq_len) for k, v in zip(output_names, outputs_flat)}\n\n    # Create the aggregated axes representation\n    dynamic_axes = dict(input_dynamic_axes, **output_dynamic_axes)\n    return input_vars, output_names, dynamic_axes, tokens\n\n\ndef load_graph_from_args(framework: str, model: str, tokenizer: Optional[str] = None) -> Pipeline:\n    # If no tokenizer provided\n    if tokenizer is None:\n        tokenizer = model\n\n    print(\"Loading pipeline (model: {}, tokenizer: {})\".format(model, tokenizer))\n\n    # Allocate tokenizer and model\n    return pipeline(\"feature-extraction\", model=model, tokenizer=tokenizer, framework=framework)\n\n\ndef convert_pytorch(nlp: Pipeline, opset: int, output: str, use_external_format: bool):\n    if not is_torch_available():\n        raise Exception(\"Cannot convert because PyTorch is not installed. Please install torch first.\")\n\n    import torch\n    from torch.onnx import export\n\n    print(\"PyTorch: {}\".format(torch.__version__))\n\n    with torch.no_grad():\n        input_names, output_names, dynamic_axes, tokens = infer_shapes(nlp, \"pt\")\n        ordered_input_names, model_args = ensure_valid_input(nlp.model, tokens, input_names)\n\n        export(\n            nlp.model,\n            model_args,\n            f=output,\n            input_names=ordered_input_names,\n            output_names=output_names,\n            dynamic_axes=dynamic_axes,\n            do_constant_folding=True,\n            use_external_data_format=use_external_format,\n            enable_onnx_checker=True,\n            opset_version=opset,\n        )\n\n\ndef convert_tensorflow(nlp: Pipeline, opset: int, output: str):\n    if not is_tf_available():\n        raise Exception(\n            \"Cannot convert {} because TF is not installed. Please install torch first.\".format(args.model)\n        )\n\n    print(\"/!\\\\ Please note TensorFlow doesn't support exporting model > 2Gb /!\\\\\")\n\n    try:\n        import tensorflow as tf\n        from keras2onnx import convert_keras, save_model, __version__ as k2ov\n\n        print(\"TensorFlow: {}, keras2onnx: {}\".format(tf.version.VERSION, k2ov))\n\n        # Build\n        input_names, output_names, dynamic_axes, tokens = infer_shapes(nlp, \"tf\")\n\n        # Forward\n        nlp.model.predict(tokens.data)\n        onnx_model = convert_keras(nlp.model, nlp.model.name, target_opset=opset)\n        save_model(onnx_model, output)\n\n    except ImportError as e:\n        raise Exception(\n            \"Cannot import {} required to convert TF model to ONNX. Please install {} first.\".format(e.name, e.name)\n        )\n\n\ndef convert(\n    framework: str,\n    model: str,\n    output: str,\n    opset: int,\n    tokenizer: Optional[str] = None,\n    use_external_format: bool = False,\n):\n    print(\"ONNX opset version set to: {}\".format(opset))\n\n    # Load the pipeline\n    nlp = load_graph_from_args(framework, model, tokenizer)\n\n    parent = dirname(output)\n    if not exists(parent):\n        print(\"Creating folder {}\".format(parent))\n        makedirs(parent)\n    elif len(listdir(parent)) > 0:\n        raise Exception(\"Folder {} is not empty, aborting conversion\".format(parent))\n\n    # Export the graph\n    if framework == \"pt\":\n        convert_pytorch(nlp, opset, output, use_external_format)\n    else:\n        convert_tensorflow(nlp, opset, output)\n\n\ndef verify(path: str):\n    from onnxruntime import InferenceSession, SessionOptions\n    from onnxruntime.capi.onnxruntime_pybind11_state import RuntimeException\n\n    print(\"Checking ONNX model loading from: {}\".format(path))\n    try:\n        onnx_options = SessionOptions()\n        _ = InferenceSession(path, onnx_options, providers=[\"CPUExecutionProvider\"])\n        print(\"Model correctly loaded\")\n    except RuntimeException as re:\n        print(\"Error while loading the model: {}\".format(re))\n\n\nif __name__ == \"__main__\":\n    parser = OnnxConverterArgumentParser()\n    args = parser.parse_args()\n\n    # Make sure output is absolute path\n    args.output = abspath(args.output)\n\n    try:\n        # Convert\n        convert(args.framework, args.model, args.output, args.opset, args.tokenizer, args.use_external_format)\n\n        # And verify\n        if args.check_loading:\n            verify(args.output)\n    except Exception as e:\n        print(\"Error while converting the model: {}\".format(e))\n        exit(1)\n"
  },
  {
    "path": "code/nezha-base-count5/pretrain/transformers1/convert_longformer_original_pytorch_lightning_to_pytorch.py",
    "content": "# coding=utf-8\n# Copyright 2018 The HuggingFace Inc. team.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n\"\"\"Convert RoBERTa checkpoint.\"\"\"\n\n\nimport argparse\n\nimport pytorch_lightning as pl\nimport torch\n\nfrom transformers.modeling_longformer import LongformerForQuestionAnswering, LongformerModel\n\n\nclass LightningModel(pl.LightningModule):\n    def __init__(self, model):\n        super().__init__()\n        self.model = model\n        self.num_labels = 2\n        self.qa_outputs = torch.nn.Linear(self.model.config.hidden_size, self.num_labels)\n\n    # implement only because lighning requires to do so\n    def forward(self):\n        pass\n\n\ndef convert_longformer_qa_checkpoint_to_pytorch(\n    longformer_model: str, longformer_question_answering_ckpt_path: str, pytorch_dump_folder_path: str\n):\n\n    # load longformer model from model identifier\n    longformer = LongformerModel.from_pretrained(longformer_model)\n    lightning_model = LightningModel(longformer)\n\n    ckpt = torch.load(longformer_question_answering_ckpt_path, map_location=torch.device(\"cpu\"))\n    lightning_model.load_state_dict(ckpt[\"state_dict\"])\n\n    # init longformer question answering model\n    longformer_for_qa = LongformerForQuestionAnswering.from_pretrained(longformer_model)\n\n    # transfer weights\n    longformer_for_qa.longformer.load_state_dict(lightning_model.model.state_dict())\n    longformer_for_qa.qa_outputs.load_state_dict(lightning_model.qa_outputs.state_dict())\n    longformer_for_qa.eval()\n\n    # save model\n    longformer_for_qa.save_pretrained(pytorch_dump_folder_path)\n\n    print(\"Conversion succesful. Model saved under {}\".format(pytorch_dump_folder_path))\n\n\nif __name__ == \"__main__\":\n    parser = argparse.ArgumentParser()\n    # Required parameters\n    parser.add_argument(\n        \"--longformer_model\",\n        default=None,\n        type=str,\n        required=True,\n        help=\"model identifier of longformer. Should be either `longformer-base-4096` or `longformer-large-4096`.\",\n    )\n    parser.add_argument(\n        \"--longformer_question_answering_ckpt_path\",\n        default=None,\n        type=str,\n        required=True,\n        help=\"Path the official PyTorch Lighning Checkpoint.\",\n    )\n    parser.add_argument(\n        \"--pytorch_dump_folder_path\", default=None, type=str, required=True, help=\"Path to the output PyTorch model.\"\n    )\n    args = parser.parse_args()\n    convert_longformer_qa_checkpoint_to_pytorch(\n        args.longformer_model, args.longformer_question_answering_ckpt_path, args.pytorch_dump_folder_path\n    )\n"
  },
  {
    "path": "code/nezha-base-count5/pretrain/transformers1/convert_marian_to_pytorch.py",
    "content": "import argparse\nimport json\nimport os\nimport shutil\nimport warnings\nfrom pathlib import Path\nfrom typing import Dict, List, Union\nfrom zipfile import ZipFile\n\nimport numpy as np\nimport torch\nfrom tqdm import tqdm\n\nfrom transformers import MarianConfig, MarianMTModel, MarianTokenizer\nfrom transformers.hf_api import HfApi\n\n\ndef remove_prefix(text: str, prefix: str):\n    if text.startswith(prefix):\n        return text[len(prefix) :]\n    return text  # or whatever\n\n\ndef convert_encoder_layer(opus_dict, layer_prefix: str, converter: dict):\n    sd = {}\n    for k in opus_dict:\n        if not k.startswith(layer_prefix):\n            continue\n        stripped = remove_prefix(k, layer_prefix)\n        v = opus_dict[k].T  # besides embeddings, everything must be transposed.\n        sd[converter[stripped]] = torch.tensor(v).squeeze()\n    return sd\n\n\ndef load_layers_(layer_lst: torch.nn.ModuleList, opus_state: dict, converter, is_decoder=False):\n    for i, layer in enumerate(layer_lst):\n        layer_tag = f\"decoder_l{i + 1}_\" if is_decoder else f\"encoder_l{i + 1}_\"\n        sd = convert_encoder_layer(opus_state, layer_tag, converter)\n        layer.load_state_dict(sd, strict=True)\n\n\ndef find_pretrained_model(src_lang: str, tgt_lang: str) -> List[str]:\n    \"\"\"Find models that can accept src_lang as input and return tgt_lang as output.\"\"\"\n    prefix = \"Helsinki-NLP/opus-mt-\"\n    api = HfApi()\n    model_list = api.model_list()\n    model_ids = [x.modelId for x in model_list if x.modelId.startswith(\"Helsinki-NLP\")]\n    src_and_targ = [\n        remove_prefix(m, prefix).lower().split(\"-\") for m in model_ids if \"+\" not in m\n    ]  # + cant be loaded.\n    matching = [f\"{prefix}{a}-{b}\" for (a, b) in src_and_targ if src_lang in a and tgt_lang in b]\n    return matching\n\n\ndef add_emb_entries(wemb, final_bias, n_special_tokens=1):\n    vsize, d_model = wemb.shape\n    embs_to_add = np.zeros((n_special_tokens, d_model))\n    new_embs = np.concatenate([wemb, embs_to_add])\n    bias_to_add = np.zeros((n_special_tokens, 1))\n    new_bias = np.concatenate((final_bias, bias_to_add), axis=1)\n    return new_embs, new_bias\n\n\ndef _cast_yaml_str(v):\n    bool_dct = {\"true\": True, \"false\": False}\n    if not isinstance(v, str):\n        return v\n    elif v in bool_dct:\n        return bool_dct[v]\n    try:\n        return int(v)\n    except (TypeError, ValueError):\n        return v\n\n\ndef cast_marian_config(raw_cfg: Dict[str, str]) -> Dict:\n    return {k: _cast_yaml_str(v) for k, v in raw_cfg.items()}\n\n\nCONFIG_KEY = \"special:model.yml\"\n\n\ndef load_config_from_state_dict(opus_dict):\n    import yaml\n\n    cfg_str = \"\".join([chr(x) for x in opus_dict[CONFIG_KEY]])\n    yaml_cfg = yaml.load(cfg_str[:-1], Loader=yaml.BaseLoader)\n    return cast_marian_config(yaml_cfg)\n\n\ndef find_model_file(dest_dir):  # this one better\n    model_files = list(Path(dest_dir).glob(\"*.npz\"))\n    assert len(model_files) == 1, model_files\n    model_file = model_files[0]\n    return model_file\n\n\n# Group Names Logic: change long opus model names to something shorter, like opus-mt-en-ROMANCE\nROM_GROUP = \"fr+fr_BE+fr_CA+fr_FR+wa+frp+oc+ca+rm+lld+fur+lij+lmo+es+es_AR+es_CL+es_CO+es_CR+es_DO+es_EC+es_ES+es_GT+es_HN+es_MX+es_NI+es_PA+es_PE+es_PR+es_SV+es_UY+es_VE+pt+pt_br+pt_BR+pt_PT+gl+lad+an+mwl+it+it_IT+co+nap+scn+vec+sc+ro+la\"\nGROUPS = [\n    (\"cmn+cn+yue+ze_zh+zh_cn+zh_CN+zh_HK+zh_tw+zh_TW+zh_yue+zhs+zht+zh\", \"ZH\"),\n    (ROM_GROUP, \"ROMANCE\"),\n    (\"de+nl+fy+af+da+fo+is+no+nb+nn+sv\", \"NORTH_EU\"),\n    (\"da+fo+is+no+nb+nn+sv\", \"SCANDINAVIA\"),\n    (\"se+sma+smj+smn+sms\", \"SAMI\"),\n    (\"nb_NO+nb+nn_NO+nn+nog+no_nb+no\", \"NORWAY\"),\n    (\"ga+cy+br+gd+kw+gv\", \"CELTIC\"),  # https://en.wikipedia.org/wiki/Insular_Celtic_languages\n]\nGROUP_TO_OPUS_NAME = {\n    \"opus-mt-ZH-de\": \"cmn+cn+yue+ze_zh+zh_cn+zh_CN+zh_HK+zh_tw+zh_TW+zh_yue+zhs+zht+zh-de\",\n    \"opus-mt-ZH-fi\": \"cmn+cn+yue+ze_zh+zh_cn+zh_CN+zh_HK+zh_tw+zh_TW+zh_yue+zhs+zht+zh-fi\",\n    \"opus-mt-ZH-sv\": \"cmn+cn+yue+ze_zh+zh_cn+zh_CN+zh_HK+zh_tw+zh_TW+zh_yue+zhs+zht+zh-sv\",\n    \"opus-mt-SCANDINAVIA-SCANDINAVIA\": \"da+fo+is+no+nb+nn+sv-da+fo+is+no+nb+nn+sv\",\n    \"opus-mt-NORTH_EU-NORTH_EU\": \"de+nl+fy+af+da+fo+is+no+nb+nn+sv-de+nl+fy+af+da+fo+is+no+nb+nn+sv\",\n    \"opus-mt-de-ZH\": \"de-cmn+cn+yue+ze_zh+zh_cn+zh_CN+zh_HK+zh_tw+zh_TW+zh_yue+zhs+zht+zh\",\n    \"opus-mt-en_el_es_fi-en_el_es_fi\": \"en+el+es+fi-en+el+es+fi\",\n    \"opus-mt-en-ROMANCE\": \"en-fr+fr_BE+fr_CA+fr_FR+wa+frp+oc+ca+rm+lld+fur+lij+lmo+es+es_AR+es_CL+es_CO+es_CR+es_DO\"\n    \"+es_EC+es_ES+es_GT+es_HN+es_MX+es_NI+es_PA+es_PE+es_PR+es_SV+es_UY+es_VE+pt+pt_br+pt_BR\"\n    \"+pt_PT+gl+lad+an+mwl+it+it_IT+co+nap+scn+vec+sc+ro+la\",\n    \"opus-mt-en-CELTIC\": \"en-ga+cy+br+gd+kw+gv\",\n    \"opus-mt-es-NORWAY\": \"es-nb_NO+nb+nn_NO+nn+nog+no_nb+no\",\n    \"opus-mt-fi_nb_no_nn_ru_sv_en-SAMI\": \"fi+nb+no+nn+ru+sv+en-se+sma+smj+smn+sms\",\n    \"opus-mt-fi-ZH\": \"fi-cmn+cn+yue+ze_zh+zh_cn+zh_CN+zh_HK+zh_tw+zh_TW+zh_yue+zhs+zht+zh\",\n    \"opus-mt-fi-NORWAY\": \"fi-nb_NO+nb+nn_NO+nn+nog+no_nb+no\",\n    \"opus-mt-ROMANCE-en\": \"fr+fr_BE+fr_CA+fr_FR+wa+frp+oc+ca+rm+lld+fur+lij+lmo+es+es_AR+es_CL+es_CO+es_CR+es_DO\"\n    \"+es_EC+es_ES+es_GT+es_HN+es_MX+es_NI+es_PA+es_PE+es_PR+es_SV+es_UY+es_VE+pt+pt_br+pt_BR\"\n    \"+pt_PT+gl+lad+an+mwl+it+it_IT+co+nap+scn+vec+sc+ro+la-en\",\n    \"opus-mt-CELTIC-en\": \"ga+cy+br+gd+kw+gv-en\",\n    \"opus-mt-sv-ZH\": \"sv-cmn+cn+yue+ze_zh+zh_cn+zh_CN+zh_HK+zh_tw+zh_TW+zh_yue+zhs+zht+zh\",\n    \"opus-mt-sv-NORWAY\": \"sv-nb_NO+nb+nn_NO+nn+nog+no_nb+no\",\n}\nOPUS_GITHUB_URL = \"https://github.com/Helsinki-NLP/OPUS-MT-train/blob/master/models/\"\nORG_NAME = \"Helsinki-NLP/\"\n\n\ndef convert_opus_name_to_hf_name(x):\n    for substr, grp_name in GROUPS:\n        x = x.replace(substr, grp_name)\n    return x.replace(\"+\", \"_\")\n\n\ndef convert_hf_name_to_opus_name(hf_model_name):\n    \"\"\"Relies on the assumption that there are no language codes like pt_br in models that are not in GROUP_TO_OPUS_NAME.\"\"\"\n    hf_model_name = remove_prefix(hf_model_name, ORG_NAME)\n    if hf_model_name in GROUP_TO_OPUS_NAME:\n        opus_w_prefix = GROUP_TO_OPUS_NAME[hf_model_name]\n    else:\n        opus_w_prefix = hf_model_name.replace(\"_\", \"+\")\n    return remove_prefix(opus_w_prefix, \"opus-mt-\")\n\n\ndef write_model_card(\n    hf_model_name: str,\n    repo_path=\"OPUS-MT-train/models/\",\n    dry_run=False,\n    model_card_dir=Path(\"marian_converted/model_cards/Helsinki-NLP/\"),\n) -> str:\n    \"\"\"Copy the most recent model's readme section from opus, and add metadata.\n    upload command: s3cmd sync --recursive model_card_dir s3://models.huggingface.co/bert/Helsinki-NLP/\n    \"\"\"\n    hf_model_name = remove_prefix(hf_model_name, ORG_NAME)\n    opus_name: str = convert_hf_name_to_opus_name(hf_model_name)\n    opus_src, opus_tgt = [x.split(\"+\") for x in opus_name.split(\"-\")]\n    readme_url = OPUS_GITHUB_URL + f\"{opus_name}/README.md\"\n    s, t = \",\".join(opus_src), \",\".join(opus_tgt)\n    extra_markdown = f\"### {hf_model_name}\\n\\n* source languages: {s}\\n* target languages: {t}\\n*  OPUS readme: [{opus_name}]({readme_url})\\n\"\n    # combine with opus markdown\n    opus_readme_path = Path(f\"{repo_path}{opus_name}/README.md\")\n    assert opus_readme_path.exists(), opus_readme_path\n    content = opus_readme_path.open().read()\n    content = content.split(\"\\n# \")[-1]  # Get the lowest level 1 header in the README -- the most recent model.\n    content = \"*\".join(content.split(\"*\")[1:])\n    content = extra_markdown + \"\\n* \" + content.replace(\"download\", \"download original weights\")\n    if dry_run:\n        return content\n    # Save string to model_cards/hf_model_name/readme.md\n    model_card_dir.mkdir(exist_ok=True)\n    sub_dir = model_card_dir / hf_model_name\n    sub_dir.mkdir(exist_ok=True)\n    dest = sub_dir / \"README.md\"\n    dest.open(\"w\").write(content)\n    return content\n\n\ndef get_clean_model_id_mapping(multiling_model_ids):\n    return {x: convert_opus_name_to_hf_name(x) for x in multiling_model_ids}\n\n\ndef make_registry(repo_path=\"Opus-MT-train/models\"):\n    if not (Path(repo_path) / \"fr-en\" / \"README.md\").exists():\n        raise ValueError(\n            f\"repo_path:{repo_path} does not exist: \"\n            \"You must run: git clone git@github.com:Helsinki-NLP/Opus-MT-train.git before calling.\"\n        )\n    results = {}\n    for p in Path(repo_path).ls():\n        n_dash = p.name.count(\"-\")\n        if n_dash == 0:\n            continue\n        else:\n            lns = list(open(p / \"README.md\").readlines())\n            results[p.name] = _parse_readme(lns)\n    return [(k, v[\"pre-processing\"], v[\"download\"], v[\"download\"][:-4] + \".test.txt\") for k, v in results.items()]\n\n\ndef convert_all_sentencepiece_models(model_list=None, repo_path=None):\n    \"\"\"Requires 300GB\"\"\"\n    save_dir = Path(\"marian_ckpt\")\n    dest_dir = Path(\"marian_converted\")\n    dest_dir.mkdir(exist_ok=True)\n    if model_list is None:\n        model_list: list = make_registry(repo_path=repo_path)\n    for k, prepro, download, test_set_url in tqdm(model_list):\n        if \"SentencePiece\" not in prepro:  # dont convert BPE models.\n            continue\n        if not os.path.exists(save_dir / k / \"pytorch_model.bin\"):\n            download_and_unzip(download, save_dir / k)\n        pair_name = convert_opus_name_to_hf_name(k)\n        convert(save_dir / k, dest_dir / f\"opus-mt-{pair_name}\")\n\n\ndef lmap(f, x) -> List:\n    return list(map(f, x))\n\n\ndef fetch_test_set(test_set_url):\n    import wget\n\n    fname = wget.download(test_set_url, \"opus_test.txt\")\n    lns = Path(fname).open().readlines()\n    src = lmap(str.strip, lns[::4])\n    gold = lmap(str.strip, lns[1::4])\n    mar_model = lmap(str.strip, lns[2::4])\n    assert len(gold) == len(mar_model) == len(src)\n    os.remove(fname)\n    return src, mar_model, gold\n\n\ndef convert_whole_dir(path=Path(\"marian_ckpt/\")):\n    for subdir in tqdm(list(path.ls())):\n        dest_dir = f\"marian_converted/{subdir.name}\"\n        if (dest_dir / \"pytorch_model.bin\").exists():\n            continue\n        convert(source_dir, dest_dir)\n\n\ndef _parse_readme(lns):\n    \"\"\"Get link and metadata from opus model card equivalent.\"\"\"\n    subres = {}\n    for ln in [x.strip() for x in lns]:\n        if not ln.startswith(\"*\"):\n            continue\n        ln = ln[1:].strip()\n\n        for k in [\"download\", \"dataset\", \"models\", \"model\", \"pre-processing\"]:\n            if ln.startswith(k):\n                break\n        else:\n            continue\n        if k in [\"dataset\", \"model\", \"pre-processing\"]:\n            splat = ln.split(\":\")\n            _, v = splat\n            subres[k] = v\n        elif k == \"download\":\n            v = ln.split(\"(\")[-1][:-1]\n            subres[k] = v\n    return subres\n\n\ndef save_tokenizer_config(dest_dir: Path):\n    dname = dest_dir.name.split(\"-\")\n    dct = dict(target_lang=dname[-1], source_lang=\"-\".join(dname[:-1]))\n    save_json(dct, dest_dir / \"tokenizer_config.json\")\n\n\ndef add_to_vocab_(vocab: Dict[str, int], special_tokens: List[str]):\n    start = max(vocab.values()) + 1\n    added = 0\n    for tok in special_tokens:\n        if tok in vocab:\n            continue\n        vocab[tok] = start + added\n        added += 1\n    return added\n\n\ndef find_vocab_file(model_dir):\n    return list(model_dir.glob(\"*vocab.yml\"))[0]\n\n\ndef add_special_tokens_to_vocab(model_dir: Path) -> None:\n    vocab = load_yaml(find_vocab_file(model_dir))\n    vocab = {k: int(v) for k, v in vocab.items()}\n    num_added = add_to_vocab_(vocab, [\"<pad>\"])\n    print(f\"added {num_added} tokens to vocab\")\n    save_json(vocab, model_dir / \"vocab.json\")\n    save_tokenizer_config(model_dir)\n\n\ndef save_tokenizer(self, save_directory):\n    dest = Path(save_directory)\n    src_path = Path(self.init_kwargs[\"source_spm\"])\n\n    for dest_name in {\"source.spm\", \"target.spm\", \"tokenizer_config.json\"}:\n        shutil.copyfile(src_path.parent / dest_name, dest / dest_name)\n    save_json(self.encoder, dest / \"vocab.json\")\n\n\ndef check_equal(marian_cfg, k1, k2):\n    v1, v2 = marian_cfg[k1], marian_cfg[k2]\n    assert v1 == v2, f\"hparams {k1},{k2} differ: {v1} != {v2}\"\n\n\ndef check_marian_cfg_assumptions(marian_cfg):\n    assumed_settings = {\n        \"tied-embeddings-all\": True,\n        \"layer-normalization\": False,\n        \"right-left\": False,\n        \"transformer-ffn-depth\": 2,\n        \"transformer-aan-depth\": 2,\n        \"transformer-no-projection\": False,\n        \"transformer-postprocess-emb\": \"d\",\n        \"transformer-postprocess\": \"dan\",  # Dropout, add, normalize\n        \"transformer-preprocess\": \"\",\n        \"type\": \"transformer\",\n        \"ulr-dim-emb\": 0,\n        \"dec-cell-base-depth\": 2,\n        \"dec-cell-high-depth\": 1,\n        \"transformer-aan-nogate\": False,\n    }\n    for k, v in assumed_settings.items():\n        actual = marian_cfg[k]\n        assert actual == v, f\"Unexpected config value for {k} expected {v} got {actual}\"\n    check_equal(marian_cfg, \"transformer-ffn-activation\", \"transformer-aan-activation\")\n    check_equal(marian_cfg, \"transformer-ffn-depth\", \"transformer-aan-depth\")\n    check_equal(marian_cfg, \"transformer-dim-ffn\", \"transformer-dim-aan\")\n\n\nBIAS_KEY = \"decoder_ff_logit_out_b\"\nBART_CONVERTER = {  # for each encoder and decoder layer\n    \"self_Wq\": \"self_attn.q_proj.weight\",\n    \"self_Wk\": \"self_attn.k_proj.weight\",\n    \"self_Wv\": \"self_attn.v_proj.weight\",\n    \"self_Wo\": \"self_attn.out_proj.weight\",\n    \"self_bq\": \"self_attn.q_proj.bias\",\n    \"self_bk\": \"self_attn.k_proj.bias\",\n    \"self_bv\": \"self_attn.v_proj.bias\",\n    \"self_bo\": \"self_attn.out_proj.bias\",\n    \"self_Wo_ln_scale\": \"self_attn_layer_norm.weight\",\n    \"self_Wo_ln_bias\": \"self_attn_layer_norm.bias\",\n    \"ffn_W1\": \"fc1.weight\",\n    \"ffn_b1\": \"fc1.bias\",\n    \"ffn_W2\": \"fc2.weight\",\n    \"ffn_b2\": \"fc2.bias\",\n    \"ffn_ffn_ln_scale\": \"final_layer_norm.weight\",\n    \"ffn_ffn_ln_bias\": \"final_layer_norm.bias\",\n    # Decoder Cross Attention\n    \"context_Wk\": \"encoder_attn.k_proj.weight\",\n    \"context_Wo\": \"encoder_attn.out_proj.weight\",\n    \"context_Wq\": \"encoder_attn.q_proj.weight\",\n    \"context_Wv\": \"encoder_attn.v_proj.weight\",\n    \"context_bk\": \"encoder_attn.k_proj.bias\",\n    \"context_bo\": \"encoder_attn.out_proj.bias\",\n    \"context_bq\": \"encoder_attn.q_proj.bias\",\n    \"context_bv\": \"encoder_attn.v_proj.bias\",\n    \"context_Wo_ln_scale\": \"encoder_attn_layer_norm.weight\",\n    \"context_Wo_ln_bias\": \"encoder_attn_layer_norm.bias\",\n}\n\n\nclass OpusState:\n    def __init__(self, source_dir):\n        npz_path = find_model_file(source_dir)\n        self.state_dict = np.load(npz_path)\n        cfg = load_config_from_state_dict(self.state_dict)\n        assert cfg[\"dim-vocabs\"][0] == cfg[\"dim-vocabs\"][1]\n        assert \"Wpos\" not in self.state_dict\n        self.state_dict = dict(self.state_dict)\n        self.wemb, self.final_bias = add_emb_entries(self.state_dict[\"Wemb\"], self.state_dict[BIAS_KEY], 1)\n        self.pad_token_id = self.wemb.shape[0] - 1\n        cfg[\"vocab_size\"] = self.pad_token_id + 1\n        # self.state_dict['Wemb'].sha\n        self.state_keys = list(self.state_dict.keys())\n        if \"Wtype\" in self.state_dict:\n            raise ValueError(\"found Wtype key\")\n        self._check_layer_entries()\n        self.source_dir = source_dir\n        self.cfg = cfg\n        hidden_size, intermediate_shape = self.state_dict[\"encoder_l1_ffn_W1\"].shape\n        assert hidden_size == cfg[\"dim-emb\"] == 512\n\n        # Process decoder.yml\n        decoder_yml = cast_marian_config(load_yaml(source_dir / \"decoder.yml\"))\n        check_marian_cfg_assumptions(cfg)\n        self.hf_config = MarianConfig(\n            vocab_size=cfg[\"vocab_size\"],\n            decoder_layers=cfg[\"dec-depth\"],\n            encoder_layers=cfg[\"enc-depth\"],\n            decoder_attention_heads=cfg[\"transformer-heads\"],\n            encoder_attention_heads=cfg[\"transformer-heads\"],\n            decoder_ffn_dim=cfg[\"transformer-dim-ffn\"],\n            encoder_ffn_dim=cfg[\"transformer-dim-ffn\"],\n            d_model=cfg[\"dim-emb\"],\n            activation_function=cfg[\"transformer-aan-activation\"],\n            pad_token_id=self.pad_token_id,\n            eos_token_id=0,\n            bos_token_id=0,\n            max_position_embeddings=cfg[\"dim-emb\"],\n            scale_embedding=True,\n            normalize_embedding=\"n\" in cfg[\"transformer-preprocess\"],\n            static_position_embeddings=not cfg[\"transformer-train-position-embeddings\"],\n            dropout=0.1,  # see opus-mt-train repo/transformer-dropout param.\n            # default: add_final_layer_norm=False,\n            num_beams=decoder_yml[\"beam-size\"],\n            decoder_start_token_id=self.pad_token_id,\n            bad_words_ids=[[self.pad_token_id]],\n            max_length=512,\n        )\n\n    def _check_layer_entries(self):\n        self.encoder_l1 = self.sub_keys(\"encoder_l1\")\n        self.decoder_l1 = self.sub_keys(\"decoder_l1\")\n        self.decoder_l2 = self.sub_keys(\"decoder_l2\")\n        if len(self.encoder_l1) != 16:\n            warnings.warn(f\"Expected 16 keys for each encoder layer, got {len(self.encoder_l1)}\")\n        if len(self.decoder_l1) != 26:\n            warnings.warn(f\"Expected 26 keys for each decoder layer, got {len(self.decoder_l1)}\")\n        if len(self.decoder_l2) != 26:\n            warnings.warn(f\"Expected 26 keys for each decoder layer, got {len(self.decoder_l1)}\")\n\n    @property\n    def extra_keys(self):\n        extra = []\n        for k in self.state_keys:\n            if (\n                k.startswith(\"encoder_l\")\n                or k.startswith(\"decoder_l\")\n                or k in [CONFIG_KEY, \"Wemb\", \"Wpos\", \"decoder_ff_logit_out_b\"]\n            ):\n                continue\n            else:\n                extra.append(k)\n        return extra\n\n    def sub_keys(self, layer_prefix):\n        return [remove_prefix(k, layer_prefix) for k in self.state_dict if k.startswith(layer_prefix)]\n\n    def load_marian_model(self) -> MarianMTModel:\n        state_dict, cfg = self.state_dict, self.hf_config\n\n        assert cfg.static_position_embeddings\n        model = MarianMTModel(cfg)\n\n        assert \"hidden_size\" not in cfg.to_dict()\n        load_layers_(\n            model.model.encoder.layers, state_dict, BART_CONVERTER,\n        )\n        load_layers_(model.model.decoder.layers, state_dict, BART_CONVERTER, is_decoder=True)\n\n        # handle tensors not associated with layers\n        wemb_tensor = torch.nn.Parameter(torch.FloatTensor(self.wemb))\n        bias_tensor = torch.nn.Parameter(torch.FloatTensor(self.final_bias))\n        model.model.shared.weight = wemb_tensor\n        model.model.encoder.embed_tokens = model.model.decoder.embed_tokens = model.model.shared\n\n        model.final_logits_bias = bias_tensor\n\n        if \"Wpos\" in state_dict:\n            print(\"Unexpected: got Wpos\")\n            wpos_tensor = torch.tensor(state_dict[\"Wpos\"])\n            model.model.encoder.embed_positions.weight = wpos_tensor\n            model.model.decoder.embed_positions.weight = wpos_tensor\n\n        if cfg.normalize_embedding:\n            assert \"encoder_emb_ln_scale_pre\" in state_dict\n            raise NotImplementedError(\"Need to convert layernorm_embedding\")\n\n        assert not self.extra_keys, f\"Failed to convert {self.extra_keys}\"\n        assert model.model.shared.padding_idx == self.pad_token_id\n        return model\n\n\ndef download_and_unzip(url, dest_dir):\n    try:\n        import wget\n    except ImportError:\n        raise ImportError(\"you must pip install wget\")\n\n    filename = wget.download(url)\n    unzip(filename, dest_dir)\n    os.remove(filename)\n\n\ndef convert(source_dir: Path, dest_dir):\n    dest_dir = Path(dest_dir)\n    dest_dir.mkdir(exist_ok=True)\n\n    add_special_tokens_to_vocab(source_dir)\n    tokenizer = MarianTokenizer.from_pretrained(str(source_dir))\n    save_tokenizer(tokenizer, dest_dir)\n\n    opus_state = OpusState(source_dir)\n    assert opus_state.cfg[\"vocab_size\"] == len(tokenizer.encoder)\n    # save_json(opus_state.cfg, dest_dir / \"marian_original_config.json\")\n    # ^^ Save human readable marian config for debugging\n\n    model = opus_state.load_marian_model()\n    model.save_pretrained(dest_dir)\n    model.from_pretrained(dest_dir)  # sanity check\n\n\nif __name__ == \"__main__\":\n    parser = argparse.ArgumentParser()\n    # Required parameters\n    parser.add_argument(\"--src\", type=str, help=\"path to marian model dir\", default=\"en-de\")\n    parser.add_argument(\"--dest\", type=str, default=None, help=\"Path to the output PyTorch model.\")\n    args = parser.parse_args()\n\n    source_dir = Path(args.src)\n    assert source_dir.exists()\n    dest_dir = f\"converted-{source_dir.name}\" if args.dest is None else args.dest\n    convert(source_dir, dest_dir)\n\n\ndef load_yaml(path):\n    import yaml\n\n    with open(path) as f:\n        return yaml.load(f, Loader=yaml.BaseLoader)\n\n\ndef save_json(content: Union[Dict, List], path: str) -> None:\n    with open(path, \"w\") as f:\n        json.dump(content, f)\n\n\ndef unzip(zip_path: str, dest_dir: str) -> None:\n    with ZipFile(zip_path, \"r\") as zipObj:\n        zipObj.extractall(dest_dir)\n"
  },
  {
    "path": "code/nezha-base-count5/pretrain/transformers1/convert_openai_original_tf_checkpoint_to_pytorch.py",
    "content": "# coding=utf-8\n# Copyright 2018 The HuggingFace Inc. team.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n\"\"\"Convert OpenAI GPT checkpoint.\"\"\"\n\n\nimport argparse\nimport logging\n\nimport torch\n\nfrom transformers import CONFIG_NAME, WEIGHTS_NAME, OpenAIGPTConfig, OpenAIGPTModel, load_tf_weights_in_openai_gpt\n\n\nlogging.basicConfig(level=logging.INFO)\n\n\ndef convert_openai_checkpoint_to_pytorch(openai_checkpoint_folder_path, openai_config_file, pytorch_dump_folder_path):\n    # Construct model\n    if openai_config_file == \"\":\n        config = OpenAIGPTConfig()\n    else:\n        config = OpenAIGPTConfig.from_json_file(openai_config_file)\n    model = OpenAIGPTModel(config)\n\n    # Load weights from numpy\n    load_tf_weights_in_openai_gpt(model, config, openai_checkpoint_folder_path)\n\n    # Save pytorch-model\n    pytorch_weights_dump_path = pytorch_dump_folder_path + \"/\" + WEIGHTS_NAME\n    pytorch_config_dump_path = pytorch_dump_folder_path + \"/\" + CONFIG_NAME\n    print(\"Save PyTorch model to {}\".format(pytorch_weights_dump_path))\n    torch.save(model.state_dict(), pytorch_weights_dump_path)\n    print(\"Save configuration file to {}\".format(pytorch_config_dump_path))\n    with open(pytorch_config_dump_path, \"w\", encoding=\"utf-8\") as f:\n        f.write(config.to_json_string())\n\n\nif __name__ == \"__main__\":\n    parser = argparse.ArgumentParser()\n    # Required parameters\n    parser.add_argument(\n        \"--openai_checkpoint_folder_path\",\n        default=None,\n        type=str,\n        required=True,\n        help=\"Path to the TensorFlow checkpoint path.\",\n    )\n    parser.add_argument(\n        \"--pytorch_dump_folder_path\", default=None, type=str, required=True, help=\"Path to the output PyTorch model.\"\n    )\n    parser.add_argument(\n        \"--openai_config_file\",\n        default=\"\",\n        type=str,\n        help=\"An optional config json file corresponding to the pre-trained OpenAI model. \\n\"\n        \"This specifies the model architecture.\",\n    )\n    args = parser.parse_args()\n    convert_openai_checkpoint_to_pytorch(\n        args.openai_checkpoint_folder_path, args.openai_config_file, args.pytorch_dump_folder_path\n    )\n"
  },
  {
    "path": "code/nezha-base-count5/pretrain/transformers1/convert_pytorch_checkpoint_to_tf2.py",
    "content": "# coding=utf-8\n# Copyright 2018 The HuggingFace Inc. team.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n\"\"\" Convert pytorch checkpoints to TensorFlow \"\"\"\n\n\nimport argparse\nimport logging\nimport os\n\nfrom transformers import (\n    ALBERT_PRETRAINED_CONFIG_ARCHIVE_MAP,\n    BERT_PRETRAINED_CONFIG_ARCHIVE_MAP,\n    CAMEMBERT_PRETRAINED_CONFIG_ARCHIVE_MAP,\n    CTRL_PRETRAINED_CONFIG_ARCHIVE_MAP,\n    DISTILBERT_PRETRAINED_CONFIG_ARCHIVE_MAP,\n    ELECTRA_PRETRAINED_CONFIG_ARCHIVE_MAP,\n    FLAUBERT_PRETRAINED_CONFIG_ARCHIVE_MAP,\n    GPT2_PRETRAINED_CONFIG_ARCHIVE_MAP,\n    OPENAI_GPT_PRETRAINED_CONFIG_ARCHIVE_MAP,\n    ROBERTA_PRETRAINED_CONFIG_ARCHIVE_MAP,\n    T5_PRETRAINED_CONFIG_ARCHIVE_MAP,\n    TRANSFO_XL_PRETRAINED_CONFIG_ARCHIVE_MAP,\n    WEIGHTS_NAME,\n    XLM_PRETRAINED_CONFIG_ARCHIVE_MAP,\n    XLM_ROBERTA_PRETRAINED_CONFIG_ARCHIVE_MAP,\n    XLNET_PRETRAINED_CONFIG_ARCHIVE_MAP,\n    AlbertConfig,\n    BertConfig,\n    CamembertConfig,\n    CTRLConfig,\n    DistilBertConfig,\n    ElectraConfig,\n    FlaubertConfig,\n    GPT2Config,\n    OpenAIGPTConfig,\n    RobertaConfig,\n    T5Config,\n    TFAlbertForPreTraining,\n    TFBertForPreTraining,\n    TFBertForQuestionAnswering,\n    TFBertForSequenceClassification,\n    TFCamembertForMaskedLM,\n    TFCTRLLMHeadModel,\n    TFDistilBertForMaskedLM,\n    TFDistilBertForQuestionAnswering,\n    TFElectraForPreTraining,\n    TFFlaubertWithLMHeadModel,\n    TFGPT2LMHeadModel,\n    TFOpenAIGPTLMHeadModel,\n    TFRobertaForMaskedLM,\n    TFRobertaForSequenceClassification,\n    TFT5ForConditionalGeneration,\n    TFTransfoXLLMHeadModel,\n    TFXLMRobertaForMaskedLM,\n    TFXLMWithLMHeadModel,\n    TFXLNetLMHeadModel,\n    TransfoXLConfig,\n    XLMConfig,\n    XLMRobertaConfig,\n    XLNetConfig,\n    cached_path,\n    hf_bucket_url,\n    is_torch_available,\n    load_pytorch_checkpoint_in_tf2_model,\n)\n\n\nif is_torch_available():\n    import torch\n    import numpy as np\n    from transformers import (\n        BertForPreTraining,\n        BertForQuestionAnswering,\n        BertForSequenceClassification,\n        GPT2LMHeadModel,\n        XLNetLMHeadModel,\n        XLMWithLMHeadModel,\n        XLMRobertaForMaskedLM,\n        TransfoXLLMHeadModel,\n        OpenAIGPTLMHeadModel,\n        RobertaForMaskedLM,\n        RobertaForSequenceClassification,\n        CamembertForMaskedLM,\n        FlaubertWithLMHeadModel,\n        DistilBertForMaskedLM,\n        DistilBertForQuestionAnswering,\n        CTRLLMHeadModel,\n        AlbertForPreTraining,\n        T5ForConditionalGeneration,\n        ElectraForPreTraining,\n    )\n\n\nlogging.basicConfig(level=logging.INFO)\n\nMODEL_CLASSES = {\n    \"bert\": (BertConfig, TFBertForPreTraining, BertForPreTraining, BERT_PRETRAINED_CONFIG_ARCHIVE_MAP,),\n    \"bert-large-uncased-whole-word-masking-finetuned-squad\": (\n        BertConfig,\n        TFBertForQuestionAnswering,\n        BertForQuestionAnswering,\n        BERT_PRETRAINED_CONFIG_ARCHIVE_MAP,\n    ),\n    \"bert-large-cased-whole-word-masking-finetuned-squad\": (\n        BertConfig,\n        TFBertForQuestionAnswering,\n        BertForQuestionAnswering,\n        BERT_PRETRAINED_CONFIG_ARCHIVE_MAP,\n    ),\n    \"bert-base-cased-finetuned-mrpc\": (\n        BertConfig,\n        TFBertForSequenceClassification,\n        BertForSequenceClassification,\n        BERT_PRETRAINED_CONFIG_ARCHIVE_MAP,\n    ),\n    \"gpt2\": (GPT2Config, TFGPT2LMHeadModel, GPT2LMHeadModel, GPT2_PRETRAINED_CONFIG_ARCHIVE_MAP,),\n    \"xlnet\": (XLNetConfig, TFXLNetLMHeadModel, XLNetLMHeadModel, XLNET_PRETRAINED_CONFIG_ARCHIVE_MAP,),\n    \"xlm\": (XLMConfig, TFXLMWithLMHeadModel, XLMWithLMHeadModel, XLM_PRETRAINED_CONFIG_ARCHIVE_MAP,),\n    \"xlm-roberta\": (\n        XLMRobertaConfig,\n        TFXLMRobertaForMaskedLM,\n        XLMRobertaForMaskedLM,\n        XLM_ROBERTA_PRETRAINED_CONFIG_ARCHIVE_MAP,\n    ),\n    \"transfo-xl\": (\n        TransfoXLConfig,\n        TFTransfoXLLMHeadModel,\n        TransfoXLLMHeadModel,\n        TRANSFO_XL_PRETRAINED_CONFIG_ARCHIVE_MAP,\n    ),\n    \"openai-gpt\": (\n        OpenAIGPTConfig,\n        TFOpenAIGPTLMHeadModel,\n        OpenAIGPTLMHeadModel,\n        OPENAI_GPT_PRETRAINED_CONFIG_ARCHIVE_MAP,\n    ),\n    \"roberta\": (RobertaConfig, TFRobertaForMaskedLM, RobertaForMaskedLM, ROBERTA_PRETRAINED_CONFIG_ARCHIVE_MAP,),\n    \"roberta-large-mnli\": (\n        RobertaConfig,\n        TFRobertaForSequenceClassification,\n        RobertaForSequenceClassification,\n        ROBERTA_PRETRAINED_CONFIG_ARCHIVE_MAP,\n    ),\n    \"camembert\": (\n        CamembertConfig,\n        TFCamembertForMaskedLM,\n        CamembertForMaskedLM,\n        CAMEMBERT_PRETRAINED_CONFIG_ARCHIVE_MAP,\n    ),\n    \"flaubert\": (\n        FlaubertConfig,\n        TFFlaubertWithLMHeadModel,\n        FlaubertWithLMHeadModel,\n        FLAUBERT_PRETRAINED_CONFIG_ARCHIVE_MAP,\n    ),\n    \"distilbert\": (\n        DistilBertConfig,\n        TFDistilBertForMaskedLM,\n        DistilBertForMaskedLM,\n        DISTILBERT_PRETRAINED_CONFIG_ARCHIVE_MAP,\n    ),\n    \"distilbert-base-distilled-squad\": (\n        DistilBertConfig,\n        TFDistilBertForQuestionAnswering,\n        DistilBertForQuestionAnswering,\n        DISTILBERT_PRETRAINED_CONFIG_ARCHIVE_MAP,\n    ),\n    \"ctrl\": (CTRLConfig, TFCTRLLMHeadModel, CTRLLMHeadModel, CTRL_PRETRAINED_CONFIG_ARCHIVE_MAP,),\n    \"albert\": (AlbertConfig, TFAlbertForPreTraining, AlbertForPreTraining, ALBERT_PRETRAINED_CONFIG_ARCHIVE_MAP,),\n    \"t5\": (T5Config, TFT5ForConditionalGeneration, T5ForConditionalGeneration, T5_PRETRAINED_CONFIG_ARCHIVE_MAP,),\n    \"electra\": (ElectraConfig, TFElectraForPreTraining, ElectraForPreTraining, ELECTRA_PRETRAINED_CONFIG_ARCHIVE_MAP,),\n}\n\n\ndef convert_pt_checkpoint_to_tf(\n    model_type, pytorch_checkpoint_path, config_file, tf_dump_path, compare_with_pt_model=False, use_cached_models=True\n):\n    if model_type not in MODEL_CLASSES:\n        raise ValueError(\"Unrecognized model type, should be one of {}.\".format(list(MODEL_CLASSES.keys())))\n\n    config_class, model_class, pt_model_class, aws_config_map = MODEL_CLASSES[model_type]\n\n    # Initialise TF model\n    if config_file in aws_config_map:\n        config_file = cached_path(aws_config_map[config_file], force_download=not use_cached_models)\n    config = config_class.from_json_file(config_file)\n    config.output_hidden_states = True\n    config.output_attentions = True\n    print(\"Building TensorFlow model from configuration: {}\".format(str(config)))\n    tf_model = model_class(config)\n\n    # Load weights from tf checkpoint\n    if pytorch_checkpoint_path in aws_config_map.keys():\n        pytorch_checkpoint_url = hf_bucket_url(pytorch_checkpoint_path, filename=WEIGHTS_NAME)\n        pytorch_checkpoint_path = cached_path(pytorch_checkpoint_url, force_download=not use_cached_models)\n    # Load PyTorch checkpoint in tf2 model:\n    tf_model = load_pytorch_checkpoint_in_tf2_model(tf_model, pytorch_checkpoint_path)\n\n    if compare_with_pt_model:\n        tfo = tf_model(tf_model.dummy_inputs, training=False)  # build the network\n\n        state_dict = torch.load(pytorch_checkpoint_path, map_location=\"cpu\")\n        pt_model = pt_model_class.from_pretrained(\n            pretrained_model_name_or_path=None, config=config, state_dict=state_dict\n        )\n\n        with torch.no_grad():\n            pto = pt_model(**pt_model.dummy_inputs)\n\n        np_pt = pto[0].numpy()\n        np_tf = tfo[0].numpy()\n        diff = np.amax(np.abs(np_pt - np_tf))\n        print(\"Max absolute difference between models outputs {}\".format(diff))\n        assert diff <= 2e-2, \"Error, model absolute difference is >2e-2: {}\".format(diff)\n\n    # Save pytorch-model\n    print(\"Save TensorFlow model to {}\".format(tf_dump_path))\n    tf_model.save_weights(tf_dump_path, save_format=\"h5\")\n\n\ndef convert_all_pt_checkpoints_to_tf(\n    args_model_type,\n    tf_dump_path,\n    model_shortcut_names_or_path=None,\n    config_shortcut_names_or_path=None,\n    compare_with_pt_model=False,\n    use_cached_models=False,\n    remove_cached_files=False,\n    only_convert_finetuned_models=False,\n):\n    assert os.path.isdir(args.tf_dump_path), \"--tf_dump_path should be a directory\"\n\n    if args_model_type is None:\n        model_types = list(MODEL_CLASSES.keys())\n    else:\n        model_types = [args_model_type]\n\n    for j, model_type in enumerate(model_types, start=1):\n        print(\"=\" * 100)\n        print(\" Converting model type {}/{}: {}\".format(j, len(model_types), model_type))\n        print(\"=\" * 100)\n        if model_type not in MODEL_CLASSES:\n            raise ValueError(\n                \"Unrecognized model type {}, should be one of {}.\".format(model_type, list(MODEL_CLASSES.keys()))\n            )\n\n        config_class, model_class, pt_model_class, aws_model_maps, aws_config_map = MODEL_CLASSES[model_type]\n\n        if model_shortcut_names_or_path is None:\n            model_shortcut_names_or_path = list(aws_model_maps.keys())\n        if config_shortcut_names_or_path is None:\n            config_shortcut_names_or_path = model_shortcut_names_or_path\n\n        for i, (model_shortcut_name, config_shortcut_name) in enumerate(\n            zip(model_shortcut_names_or_path, config_shortcut_names_or_path), start=1\n        ):\n            print(\"-\" * 100)\n            if \"-squad\" in model_shortcut_name or \"-mrpc\" in model_shortcut_name or \"-mnli\" in model_shortcut_name:\n                if not only_convert_finetuned_models:\n                    print(\"    Skipping finetuned checkpoint {}\".format(model_shortcut_name))\n                    continue\n                model_type = model_shortcut_name\n            elif only_convert_finetuned_models:\n                print(\"    Skipping not finetuned checkpoint {}\".format(model_shortcut_name))\n                continue\n            print(\n                \"    Converting checkpoint {}/{}: {} - model_type {}\".format(\n                    i, len(aws_config_map), model_shortcut_name, model_type\n                )\n            )\n            print(\"-\" * 100)\n\n            if config_shortcut_name in aws_config_map:\n                config_file = cached_path(aws_config_map[config_shortcut_name], force_download=not use_cached_models)\n            else:\n                config_file = cached_path(config_shortcut_name, force_download=not use_cached_models)\n\n            if model_shortcut_name in aws_model_maps:\n                model_file = cached_path(aws_model_maps[model_shortcut_name], force_download=not use_cached_models)\n            else:\n                model_file = cached_path(model_shortcut_name, force_download=not use_cached_models)\n\n            if os.path.isfile(model_shortcut_name):\n                model_shortcut_name = \"converted_model\"\n\n            convert_pt_checkpoint_to_tf(\n                model_type=model_type,\n                pytorch_checkpoint_path=model_file,\n                config_file=config_file,\n                tf_dump_path=os.path.join(tf_dump_path, model_shortcut_name + \"-tf_model.h5\"),\n                compare_with_pt_model=compare_with_pt_model,\n            )\n            if remove_cached_files:\n                os.remove(config_file)\n                os.remove(model_file)\n\n\nif __name__ == \"__main__\":\n    parser = argparse.ArgumentParser()\n    # Required parameters\n    parser.add_argument(\n        \"--tf_dump_path\", default=None, type=str, required=True, help=\"Path to the output Tensorflow dump file.\"\n    )\n    parser.add_argument(\n        \"--model_type\",\n        default=None,\n        type=str,\n        help=\"Model type selected in the list of {}. If not given, will download and convert all the models from AWS.\".format(\n            list(MODEL_CLASSES.keys())\n        ),\n    )\n    parser.add_argument(\n        \"--pytorch_checkpoint_path\",\n        default=None,\n        type=str,\n        help=\"Path to the PyTorch checkpoint path or shortcut name to download from AWS. \"\n        \"If not given, will download and convert all the checkpoints from AWS.\",\n    )\n    parser.add_argument(\n        \"--config_file\",\n        default=None,\n        type=str,\n        help=\"The config json file corresponding to the pre-trained model. \\n\"\n        \"This specifies the model architecture. If not given and \"\n        \"--pytorch_checkpoint_path is not given or is a shortcut name\"\n        \"use the configuration associated to the shortcut name on the AWS\",\n    )\n    parser.add_argument(\n        \"--compare_with_pt_model\", action=\"store_true\", help=\"Compare Tensorflow and PyTorch model predictions.\"\n    )\n    parser.add_argument(\n        \"--use_cached_models\",\n        action=\"store_true\",\n        help=\"Use cached models if possible instead of updating to latest checkpoint versions.\",\n    )\n    parser.add_argument(\n        \"--remove_cached_files\",\n        action=\"store_true\",\n        help=\"Remove pytorch models after conversion (save memory when converting in batches).\",\n    )\n    parser.add_argument(\"--only_convert_finetuned_models\", action=\"store_true\", help=\"Only convert finetuned models.\")\n    args = parser.parse_args()\n\n    # if args.pytorch_checkpoint_path is not None:\n    #     convert_pt_checkpoint_to_tf(args.model_type.lower(),\n    #                                 args.pytorch_checkpoint_path,\n    #                                 args.config_file if args.config_file is not None else args.pytorch_checkpoint_path,\n    #                                 args.tf_dump_path,\n    #                                 compare_with_pt_model=args.compare_with_pt_model,\n    #                                 use_cached_models=args.use_cached_models)\n    # else:\n    convert_all_pt_checkpoints_to_tf(\n        args.model_type.lower() if args.model_type is not None else None,\n        args.tf_dump_path,\n        model_shortcut_names_or_path=[args.pytorch_checkpoint_path]\n        if args.pytorch_checkpoint_path is not None\n        else None,\n        config_shortcut_names_or_path=[args.config_file] if args.config_file is not None else None,\n        compare_with_pt_model=args.compare_with_pt_model,\n        use_cached_models=args.use_cached_models,\n        remove_cached_files=args.remove_cached_files,\n        only_convert_finetuned_models=args.only_convert_finetuned_models,\n    )\n"
  },
  {
    "path": "code/nezha-base-count5/pretrain/transformers1/convert_reformer_trax_checkpoint_to_pytorch.py",
    "content": "# coding=utf-8\n# Copyright 2020 The HuggingFace Inc. team.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n\"\"\"Convert Reformer checkpoint.\"\"\"\n\n\nimport argparse\nimport logging\nimport pickle\n\nimport numpy as np\nimport torch\n\nfrom transformers import ReformerConfig, ReformerModelWithLMHead\n\n\nlogging.basicConfig(level=logging.INFO)\n\n\ndef set_param(torch_layer, weight, bias=None):\n    # set parameter of one layer\n    assert torch_layer.weight.shape == weight.shape, \"{} layer.weight does not match\".format(torch_layer)\n    torch_layer.weight = torch.nn.Parameter(weight)\n    if bias is not None:\n        assert torch_layer.bias.shape == bias.shape, \"{} layer.bias does not match\".format(torch_layer)\n        torch_layer.bias = torch.nn.Parameter(bias)\n\n\ndef set_layer_weights_in_torch_lsh(weights, torch_layer, hidden_size):\n    # set torch weights for 1-to-1 comparison\n    np_query_key = np.asarray(weights[0])\n    np_value = np.asarray(weights[1])\n    np_dense = np.asarray(weights[2])\n\n    set_param(\n        torch_layer.self_attention.query_key,\n        torch.tensor(np_query_key).transpose(1, 2).contiguous().view(-1, hidden_size),\n    )\n    set_param(\n        torch_layer.self_attention.value, torch.tensor(np_value).transpose(1, 2).contiguous().view(-1, hidden_size),\n    )\n    set_param(\n        torch_layer.output.dense, torch.tensor(np_dense).view(-1, hidden_size).contiguous().transpose(0, 1),\n    )\n\n\ndef set_layer_weights_in_torch_local(weights, torch_layer, hidden_size):\n    # set torch weights for 1-to-1 comparison\n    np_query = np.asarray(weights[0])\n    np_key = np.asarray(weights[1])\n    np_value = np.asarray(weights[2])\n    np_dense = np.asarray(weights[3])\n\n    set_param(\n        torch_layer.self_attention.query, torch.tensor(np_query).transpose(1, 2).contiguous().view(-1, hidden_size),\n    )\n    set_param(\n        torch_layer.self_attention.key, torch.tensor(np_key).transpose(1, 2).contiguous().view(-1, hidden_size),\n    )\n    set_param(\n        torch_layer.self_attention.value, torch.tensor(np_value).transpose(1, 2).contiguous().view(-1, hidden_size),\n    )\n    set_param(\n        torch_layer.output.dense, torch.tensor(np_dense).view(-1, hidden_size).contiguous().transpose(0, 1),\n    )\n\n\ndef set_block_weights_in_torch(weights, torch_block, hidden_size):\n    # layernorm 1\n    layer_norm_1 = weights[0][0][0]\n    layer_norm_1_weight = np.asarray(layer_norm_1[0])\n    layer_norm_1_bias = np.asarray(layer_norm_1[1])\n    set_param(\n        torch_block.attention.layer_norm, torch.tensor(layer_norm_1_weight), torch.tensor(layer_norm_1_bias),\n    )\n\n    # lsh weights + output\n    attn_weights = weights[0][1]\n    if len(attn_weights) < 4:\n        set_layer_weights_in_torch_lsh(attn_weights, torch_block.attention, hidden_size)\n    else:\n        set_layer_weights_in_torch_local(attn_weights, torch_block.attention, hidden_size)\n\n    # intermediate weighs\n    intermediate_weights = weights[2][0][1][2]\n\n    # Chunked Feed Forward\n    if len(intermediate_weights) == 4:\n        intermediate_weights = intermediate_weights[2]\n\n    # layernorm 2\n    layer_norm_2_weight = np.asarray(intermediate_weights[0][0])\n    layer_norm_2_bias = np.asarray(intermediate_weights[0][1])\n    set_param(\n        torch_block.feed_forward.layer_norm, torch.tensor(layer_norm_2_weight), torch.tensor(layer_norm_2_bias),\n    )\n\n    # intermediate dense\n    inter_dense_weight = np.asarray(intermediate_weights[1][0])\n    inter_dense_bias = np.asarray(intermediate_weights[1][1])\n    set_param(\n        torch_block.feed_forward.dense.dense,\n        torch.tensor(inter_dense_weight).transpose(0, 1).contiguous(),\n        torch.tensor(inter_dense_bias),\n    )\n\n    # intermediate out\n    out_dense_weight = np.asarray(intermediate_weights[4][0])\n    out_dense_bias = np.asarray(intermediate_weights[4][1])\n    set_param(\n        torch_block.feed_forward.output.dense,\n        torch.tensor(out_dense_weight).transpose(0, 1).contiguous(),\n        torch.tensor(out_dense_bias),\n    )\n\n\ndef set_model_weights_in_torch(weights, torch_model, hidden_size):\n    # reformer model\n    torch_model_reformer = torch_model.reformer\n\n    # word embeds\n    word_embeddings = np.asarray(weights[1])\n    set_param(\n        torch_model_reformer.embeddings.word_embeddings, torch.tensor(word_embeddings),\n    )\n\n    if isinstance(weights[3], tuple):\n        position_embeddings = torch_model_reformer.embeddings.position_embeddings\n        for emb_idx in range(len(position_embeddings.weights)):\n            emb_weights = np.asarray(weights[3][emb_idx][0])\n            assert position_embeddings.weights[emb_idx].shape == emb_weights.shape, \"{} emb does not match\".format(\n                position_embeddings[emb_idx]\n            )\n            position_embeddings.weights[emb_idx] = torch.nn.Parameter(torch.tensor(emb_weights))\n\n    trax_layer_weights = weights[5]\n    assert len(torch_model_reformer.encoder.layers) * 4 == len(\n        trax_layer_weights\n    ), \"HF and trax model do not have the same number of layers\"\n    for layer_idx, layer in enumerate(torch_model_reformer.encoder.layers):\n        block_weights = trax_layer_weights[4 * layer_idx : 4 * (layer_idx + 1)]\n        set_block_weights_in_torch(block_weights, layer, hidden_size)\n\n    # output layer norm\n    layer_norm_out_weight = np.asarray(weights[7][0])\n    layer_norm_out_bias = np.asarray(weights[7][1])\n    set_param(\n        torch_model_reformer.encoder.layer_norm,\n        torch.tensor(layer_norm_out_weight),\n        torch.tensor(layer_norm_out_bias),\n    )\n\n    # output embeddings\n    output_embed_weights = np.asarray(weights[9][0])\n    output_embed_bias = np.asarray(weights[9][1])\n    set_param(\n        torch_model.lm_head.decoder,\n        torch.tensor(output_embed_weights).transpose(0, 1).contiguous(),\n        torch.tensor(output_embed_bias),\n    )\n\n\ndef convert_trax_checkpoint_to_pytorch(trax_model_pkl_path, config_file, pytorch_dump_path):\n    # Initialise PyTorch model\n    config = ReformerConfig.from_json_file(config_file)\n    print(\"Building PyTorch model from configuration: {}\".format(str(config)))\n    model = ReformerModelWithLMHead(config)\n\n    with open(trax_model_pkl_path, \"rb\") as f:\n        model_weights = pickle.load(f)[\"weights\"]\n\n    set_model_weights_in_torch(model_weights, model, config.hidden_size)\n\n    # Save pytorch-model\n    print(\"Save PyTorch model to {}\".format(pytorch_dump_path))\n    torch.save(model.state_dict(), pytorch_dump_path)\n\n\nif __name__ == \"__main__\":\n    parser = argparse.ArgumentParser()\n    # Required parameters\n    parser.add_argument(\n        \"--trax_model_pkl_path\", default=None, type=str, required=True, help=\"Path to the TensorFlow checkpoint path.\"\n    )\n    parser.add_argument(\n        \"--config_file\",\n        default=None,\n        type=str,\n        required=True,\n        help=\"The config json file corresponding to the pre-trained Reformer model. \\n\"\n        \"This specifies the model architecture.\",\n    )\n    parser.add_argument(\n        \"--pytorch_dump_path\", default=None, type=str, required=True, help=\"Path to the output PyTorch model.\"\n    )\n    args = parser.parse_args()\n    convert_trax_checkpoint_to_pytorch(args.trax_model_pkl_path, args.config_file, args.pytorch_dump_path)\n"
  },
  {
    "path": "code/nezha-base-count5/pretrain/transformers1/convert_roberta_original_pytorch_checkpoint_to_pytorch.py",
    "content": "# coding=utf-8\n# Copyright 2018 The HuggingFace Inc. team.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n\"\"\"Convert RoBERTa checkpoint.\"\"\"\n\n\nimport argparse\nimport logging\nimport pathlib\n\nimport fairseq\nimport torch\nfrom fairseq.models.roberta import RobertaModel as FairseqRobertaModel\nfrom fairseq.modules import TransformerSentenceEncoderLayer\nfrom packaging import version\n\nfrom transformers.modeling_bert import BertIntermediate, BertLayer, BertOutput, BertSelfAttention, BertSelfOutput\nfrom transformers.modeling_roberta import RobertaConfig, RobertaForMaskedLM, RobertaForSequenceClassification\n\n\nif version.parse(fairseq.__version__) < version.parse(\"0.9.0\"):\n    raise Exception(\"requires fairseq >= 0.9.0\")\n\n\nlogging.basicConfig(level=logging.INFO)\nlogger = logging.getLogger(__name__)\n\nSAMPLE_TEXT = \"Hello world! cécé herlolip\"\n\n\ndef convert_roberta_checkpoint_to_pytorch(\n    roberta_checkpoint_path: str, pytorch_dump_folder_path: str, classification_head: bool\n):\n    \"\"\"\n    Copy/paste/tweak roberta's weights to our BERT structure.\n    \"\"\"\n    roberta = FairseqRobertaModel.from_pretrained(roberta_checkpoint_path)\n    roberta.eval()  # disable dropout\n    roberta_sent_encoder = roberta.model.decoder.sentence_encoder\n    config = RobertaConfig(\n        vocab_size=roberta_sent_encoder.embed_tokens.num_embeddings,\n        hidden_size=roberta.args.encoder_embed_dim,\n        num_hidden_layers=roberta.args.encoder_layers,\n        num_attention_heads=roberta.args.encoder_attention_heads,\n        intermediate_size=roberta.args.encoder_ffn_embed_dim,\n        max_position_embeddings=514,\n        type_vocab_size=1,\n        layer_norm_eps=1e-5,  # PyTorch default used in fairseq\n    )\n    if classification_head:\n        config.num_labels = roberta.args.num_classes\n    print(\"Our BERT config:\", config)\n\n    model = RobertaForSequenceClassification(config) if classification_head else RobertaForMaskedLM(config)\n    model.eval()\n\n    # Now let's copy all the weights.\n    # Embeddings\n    model.roberta.embeddings.word_embeddings.weight = roberta_sent_encoder.embed_tokens.weight\n    model.roberta.embeddings.position_embeddings.weight = roberta_sent_encoder.embed_positions.weight\n    model.roberta.embeddings.token_type_embeddings.weight.data = torch.zeros_like(\n        model.roberta.embeddings.token_type_embeddings.weight\n    )  # just zero them out b/c RoBERTa doesn't use them.\n    model.roberta.embeddings.LayerNorm.weight = roberta_sent_encoder.emb_layer_norm.weight\n    model.roberta.embeddings.LayerNorm.bias = roberta_sent_encoder.emb_layer_norm.bias\n\n    for i in range(config.num_hidden_layers):\n        # Encoder: start of layer\n        layer: BertLayer = model.roberta.encoder.layer[i]\n        roberta_layer: TransformerSentenceEncoderLayer = roberta_sent_encoder.layers[i]\n\n        # self attention\n        self_attn: BertSelfAttention = layer.attention.self\n        assert (\n            roberta_layer.self_attn.k_proj.weight.data.shape\n            == roberta_layer.self_attn.q_proj.weight.data.shape\n            == roberta_layer.self_attn.v_proj.weight.data.shape\n            == torch.Size((config.hidden_size, config.hidden_size))\n        )\n\n        self_attn.query.weight.data = roberta_layer.self_attn.q_proj.weight\n        self_attn.query.bias.data = roberta_layer.self_attn.q_proj.bias\n        self_attn.key.weight.data = roberta_layer.self_attn.k_proj.weight\n        self_attn.key.bias.data = roberta_layer.self_attn.k_proj.bias\n        self_attn.value.weight.data = roberta_layer.self_attn.v_proj.weight\n        self_attn.value.bias.data = roberta_layer.self_attn.v_proj.bias\n\n        # self-attention output\n        self_output: BertSelfOutput = layer.attention.output\n        assert self_output.dense.weight.shape == roberta_layer.self_attn.out_proj.weight.shape\n        self_output.dense.weight = roberta_layer.self_attn.out_proj.weight\n        self_output.dense.bias = roberta_layer.self_attn.out_proj.bias\n        self_output.LayerNorm.weight = roberta_layer.self_attn_layer_norm.weight\n        self_output.LayerNorm.bias = roberta_layer.self_attn_layer_norm.bias\n\n        # intermediate\n        intermediate: BertIntermediate = layer.intermediate\n        assert intermediate.dense.weight.shape == roberta_layer.fc1.weight.shape\n        intermediate.dense.weight = roberta_layer.fc1.weight\n        intermediate.dense.bias = roberta_layer.fc1.bias\n\n        # output\n        bert_output: BertOutput = layer.output\n        assert bert_output.dense.weight.shape == roberta_layer.fc2.weight.shape\n        bert_output.dense.weight = roberta_layer.fc2.weight\n        bert_output.dense.bias = roberta_layer.fc2.bias\n        bert_output.LayerNorm.weight = roberta_layer.final_layer_norm.weight\n        bert_output.LayerNorm.bias = roberta_layer.final_layer_norm.bias\n        # end of layer\n\n    if classification_head:\n        model.classifier.dense.weight = roberta.model.classification_heads[\"mnli\"].dense.weight\n        model.classifier.dense.bias = roberta.model.classification_heads[\"mnli\"].dense.bias\n        model.classifier.out_proj.weight = roberta.model.classification_heads[\"mnli\"].out_proj.weight\n        model.classifier.out_proj.bias = roberta.model.classification_heads[\"mnli\"].out_proj.bias\n    else:\n        # LM Head\n        model.lm_head.dense.weight = roberta.model.decoder.lm_head.dense.weight\n        model.lm_head.dense.bias = roberta.model.decoder.lm_head.dense.bias\n        model.lm_head.layer_norm.weight = roberta.model.decoder.lm_head.layer_norm.weight\n        model.lm_head.layer_norm.bias = roberta.model.decoder.lm_head.layer_norm.bias\n        model.lm_head.decoder.weight = roberta.model.decoder.lm_head.weight\n        model.lm_head.decoder.bias = roberta.model.decoder.lm_head.bias\n\n    # Let's check that we get the same results.\n    input_ids: torch.Tensor = roberta.encode(SAMPLE_TEXT).unsqueeze(0)  # batch of size 1\n\n    our_output = model(input_ids)[0]\n    if classification_head:\n        their_output = roberta.model.classification_heads[\"mnli\"](roberta.extract_features(input_ids))\n    else:\n        their_output = roberta.model(input_ids)[0]\n    print(our_output.shape, their_output.shape)\n    max_absolute_diff = torch.max(torch.abs(our_output - their_output)).item()\n    print(f\"max_absolute_diff = {max_absolute_diff}\")  # ~ 1e-7\n    success = torch.allclose(our_output, their_output, atol=1e-3)\n    print(\"Do both models output the same tensors?\", \"🔥\" if success else \"💩\")\n    if not success:\n        raise Exception(\"Something went wRoNg\")\n\n    pathlib.Path(pytorch_dump_folder_path).mkdir(parents=True, exist_ok=True)\n    print(f\"Saving model to {pytorch_dump_folder_path}\")\n    model.save_pretrained(pytorch_dump_folder_path)\n\n\nif __name__ == \"__main__\":\n    parser = argparse.ArgumentParser()\n    # Required parameters\n    parser.add_argument(\n        \"--roberta_checkpoint_path\", default=None, type=str, required=True, help=\"Path the official PyTorch dump.\"\n    )\n    parser.add_argument(\n        \"--pytorch_dump_folder_path\", default=None, type=str, required=True, help=\"Path to the output PyTorch model.\"\n    )\n    parser.add_argument(\n        \"--classification_head\", action=\"store_true\", help=\"Whether to convert a final classification head.\"\n    )\n    args = parser.parse_args()\n    convert_roberta_checkpoint_to_pytorch(\n        args.roberta_checkpoint_path, args.pytorch_dump_folder_path, args.classification_head\n    )\n"
  },
  {
    "path": "code/nezha-base-count5/pretrain/transformers1/convert_t5_original_tf_checkpoint_to_pytorch.py",
    "content": "# coding=utf-8\n# Copyright 2018 The T5 authors and HuggingFace Inc. team.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n\"\"\"Convert T5 checkpoint.\"\"\"\n\n\nimport argparse\nimport logging\n\nimport torch\n\nfrom transformers import T5Config, T5Model, load_tf_weights_in_t5\n\n\nlogging.basicConfig(level=logging.INFO)\n\n\ndef convert_tf_checkpoint_to_pytorch(tf_checkpoint_path, config_file, pytorch_dump_path):\n    # Initialise PyTorch model\n    config = T5Config.from_json_file(config_file)\n    print(\"Building PyTorch model from configuration: {}\".format(str(config)))\n    model = T5Model(config)\n\n    # Load weights from tf checkpoint\n    load_tf_weights_in_t5(model, config, tf_checkpoint_path)\n\n    # Save pytorch-model\n    print(\"Save PyTorch model to {}\".format(pytorch_dump_path))\n    torch.save(model.state_dict(), pytorch_dump_path)\n\n\nif __name__ == \"__main__\":\n    parser = argparse.ArgumentParser()\n    # Required parameters\n    parser.add_argument(\n        \"--tf_checkpoint_path\", default=None, type=str, required=True, help=\"Path to the TensorFlow checkpoint path.\"\n    )\n    parser.add_argument(\n        \"--config_file\",\n        default=None,\n        type=str,\n        required=True,\n        help=\"The config json file corresponding to the pre-trained T5 model. \\n\"\n        \"This specifies the model architecture.\",\n    )\n    parser.add_argument(\n        \"--pytorch_dump_path\", default=None, type=str, required=True, help=\"Path to the output PyTorch model.\"\n    )\n    args = parser.parse_args()\n    convert_tf_checkpoint_to_pytorch(args.tf_checkpoint_path, args.config_file, args.pytorch_dump_path)\n"
  },
  {
    "path": "code/nezha-base-count5/pretrain/transformers1/convert_transfo_xl_original_tf_checkpoint_to_pytorch.py",
    "content": "# coding=utf-8\n# Copyright 2018 The HuggingFace Inc. team.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n\"\"\"Convert Transformer XL checkpoint and datasets.\"\"\"\n\n\nimport argparse\nimport logging\nimport os\nimport pickle\nimport sys\n\nimport torch\n\nimport transformers.tokenization_transfo_xl as data_utils\nfrom transformers import (\n    CONFIG_NAME,\n    WEIGHTS_NAME,\n    TransfoXLConfig,\n    TransfoXLLMHeadModel,\n    load_tf_weights_in_transfo_xl,\n)\nfrom transformers.tokenization_transfo_xl import CORPUS_NAME, VOCAB_FILES_NAMES\n\n\nlogging.basicConfig(level=logging.INFO)\n\n# We do this to be able to load python 2 datasets pickles\n# See e.g. https://stackoverflow.com/questions/2121874/python-pickling-after-changing-a-modules-directory/2121918#2121918\ndata_utils.Vocab = data_utils.TransfoXLTokenizer\ndata_utils.Corpus = data_utils.TransfoXLCorpus\nsys.modules[\"data_utils\"] = data_utils\nsys.modules[\"vocabulary\"] = data_utils\n\n\ndef convert_transfo_xl_checkpoint_to_pytorch(\n    tf_checkpoint_path, transfo_xl_config_file, pytorch_dump_folder_path, transfo_xl_dataset_file\n):\n    if transfo_xl_dataset_file:\n        # Convert a pre-processed corpus (see original TensorFlow repo)\n        with open(transfo_xl_dataset_file, \"rb\") as fp:\n            corpus = pickle.load(fp, encoding=\"latin1\")\n        # Save vocabulary and dataset cache as Dictionaries (should be better than pickles for the long-term)\n        pytorch_vocab_dump_path = pytorch_dump_folder_path + \"/\" + VOCAB_FILES_NAMES[\"pretrained_vocab_file\"]\n        print(\"Save vocabulary to {}\".format(pytorch_vocab_dump_path))\n        corpus_vocab_dict = corpus.vocab.__dict__\n        torch.save(corpus_vocab_dict, pytorch_vocab_dump_path)\n\n        corpus_dict_no_vocab = corpus.__dict__\n        corpus_dict_no_vocab.pop(\"vocab\", None)\n        pytorch_dataset_dump_path = pytorch_dump_folder_path + \"/\" + CORPUS_NAME\n        print(\"Save dataset to {}\".format(pytorch_dataset_dump_path))\n        torch.save(corpus_dict_no_vocab, pytorch_dataset_dump_path)\n\n    if tf_checkpoint_path:\n        # Convert a pre-trained TensorFlow model\n        config_path = os.path.abspath(transfo_xl_config_file)\n        tf_path = os.path.abspath(tf_checkpoint_path)\n\n        print(\"Converting Transformer XL checkpoint from {} with config at {}\".format(tf_path, config_path))\n        # Initialise PyTorch model\n        if transfo_xl_config_file == \"\":\n            config = TransfoXLConfig()\n        else:\n            config = TransfoXLConfig.from_json_file(transfo_xl_config_file)\n        print(\"Building PyTorch model from configuration: {}\".format(str(config)))\n        model = TransfoXLLMHeadModel(config)\n\n        model = load_tf_weights_in_transfo_xl(model, config, tf_path)\n        # Save pytorch-model\n        pytorch_weights_dump_path = os.path.join(pytorch_dump_folder_path, WEIGHTS_NAME)\n        pytorch_config_dump_path = os.path.join(pytorch_dump_folder_path, CONFIG_NAME)\n        print(\"Save PyTorch model to {}\".format(os.path.abspath(pytorch_weights_dump_path)))\n        torch.save(model.state_dict(), pytorch_weights_dump_path)\n        print(\"Save configuration file to {}\".format(os.path.abspath(pytorch_config_dump_path)))\n        with open(pytorch_config_dump_path, \"w\", encoding=\"utf-8\") as f:\n            f.write(config.to_json_string())\n\n\nif __name__ == \"__main__\":\n    parser = argparse.ArgumentParser()\n    parser.add_argument(\n        \"--pytorch_dump_folder_path\",\n        default=None,\n        type=str,\n        required=True,\n        help=\"Path to the folder to store the PyTorch model or dataset/vocab.\",\n    )\n    parser.add_argument(\n        \"--tf_checkpoint_path\",\n        default=\"\",\n        type=str,\n        help=\"An optional path to a TensorFlow checkpoint path to be converted.\",\n    )\n    parser.add_argument(\n        \"--transfo_xl_config_file\",\n        default=\"\",\n        type=str,\n        help=\"An optional config json file corresponding to the pre-trained BERT model. \\n\"\n        \"This specifies the model architecture.\",\n    )\n    parser.add_argument(\n        \"--transfo_xl_dataset_file\",\n        default=\"\",\n        type=str,\n        help=\"An optional dataset file to be converted in a vocabulary.\",\n    )\n    args = parser.parse_args()\n    convert_transfo_xl_checkpoint_to_pytorch(\n        args.tf_checkpoint_path,\n        args.transfo_xl_config_file,\n        args.pytorch_dump_folder_path,\n        args.transfo_xl_dataset_file,\n    )\n"
  },
  {
    "path": "code/nezha-base-count5/pretrain/transformers1/convert_xlm_original_pytorch_checkpoint_to_pytorch.py",
    "content": "# coding=utf-8\n# Copyright 2018 The HuggingFace Inc. team.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n\"\"\"Convert OpenAI GPT checkpoint.\"\"\"\n\n\nimport argparse\nimport json\nimport logging\n\nimport numpy\nimport torch\n\nfrom transformers import CONFIG_NAME, WEIGHTS_NAME\nfrom transformers.tokenization_xlm import VOCAB_FILES_NAMES\n\n\nlogging.basicConfig(level=logging.INFO)\n\n\ndef convert_xlm_checkpoint_to_pytorch(xlm_checkpoint_path, pytorch_dump_folder_path):\n    # Load checkpoint\n    chkpt = torch.load(xlm_checkpoint_path, map_location=\"cpu\")\n\n    state_dict = chkpt[\"model\"]\n\n    # We have the base model one level deeper than the original XLM repository\n    two_levels_state_dict = {}\n    for k, v in state_dict.items():\n        if \"pred_layer\" in k:\n            two_levels_state_dict[k] = v\n        else:\n            two_levels_state_dict[\"transformer.\" + k] = v\n\n    config = chkpt[\"params\"]\n    config = dict((n, v) for n, v in config.items() if not isinstance(v, (torch.FloatTensor, numpy.ndarray)))\n\n    vocab = chkpt[\"dico_word2id\"]\n    vocab = dict((s + \"</w>\" if s.find(\"@@\") == -1 and i > 13 else s.replace(\"@@\", \"\"), i) for s, i in vocab.items())\n\n    # Save pytorch-model\n    pytorch_weights_dump_path = pytorch_dump_folder_path + \"/\" + WEIGHTS_NAME\n    pytorch_config_dump_path = pytorch_dump_folder_path + \"/\" + CONFIG_NAME\n    pytorch_vocab_dump_path = pytorch_dump_folder_path + \"/\" + VOCAB_FILES_NAMES[\"vocab_file\"]\n\n    print(\"Save PyTorch model to {}\".format(pytorch_weights_dump_path))\n    torch.save(two_levels_state_dict, pytorch_weights_dump_path)\n\n    print(\"Save configuration file to {}\".format(pytorch_config_dump_path))\n    with open(pytorch_config_dump_path, \"w\", encoding=\"utf-8\") as f:\n        f.write(json.dumps(config, indent=2) + \"\\n\")\n\n    print(\"Save vocab file to {}\".format(pytorch_config_dump_path))\n    with open(pytorch_vocab_dump_path, \"w\", encoding=\"utf-8\") as f:\n        f.write(json.dumps(vocab, indent=2) + \"\\n\")\n\n\nif __name__ == \"__main__\":\n    parser = argparse.ArgumentParser()\n    # Required parameters\n    parser.add_argument(\n        \"--xlm_checkpoint_path\", default=None, type=str, required=True, help=\"Path the official PyTorch dump.\"\n    )\n    parser.add_argument(\n        \"--pytorch_dump_folder_path\", default=None, type=str, required=True, help=\"Path to the output PyTorch model.\"\n    )\n    args = parser.parse_args()\n    convert_xlm_checkpoint_to_pytorch(args.xlm_checkpoint_path, args.pytorch_dump_folder_path)\n"
  },
  {
    "path": "code/nezha-base-count5/pretrain/transformers1/convert_xlnet_original_tf_checkpoint_to_pytorch.py",
    "content": "# coding=utf-8\n# Copyright 2018 The HuggingFace Inc. team.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n\"\"\"Convert BERT checkpoint.\"\"\"\n\n\nimport argparse\nimport logging\nimport os\n\nimport torch\n\nfrom transformers import (\n    CONFIG_NAME,\n    WEIGHTS_NAME,\n    XLNetConfig,\n    XLNetForQuestionAnswering,\n    XLNetForSequenceClassification,\n    XLNetLMHeadModel,\n    load_tf_weights_in_xlnet,\n)\n\n\nGLUE_TASKS_NUM_LABELS = {\n    \"cola\": 2,\n    \"mnli\": 3,\n    \"mrpc\": 2,\n    \"sst-2\": 2,\n    \"sts-b\": 1,\n    \"qqp\": 2,\n    \"qnli\": 2,\n    \"rte\": 2,\n    \"wnli\": 2,\n}\n\n\nlogging.basicConfig(level=logging.INFO)\n\n\ndef convert_xlnet_checkpoint_to_pytorch(\n    tf_checkpoint_path, bert_config_file, pytorch_dump_folder_path, finetuning_task=None\n):\n    # Initialise PyTorch model\n    config = XLNetConfig.from_json_file(bert_config_file)\n\n    finetuning_task = finetuning_task.lower() if finetuning_task is not None else \"\"\n    if finetuning_task in GLUE_TASKS_NUM_LABELS:\n        print(\"Building PyTorch XLNetForSequenceClassification model from configuration: {}\".format(str(config)))\n        config.finetuning_task = finetuning_task\n        config.num_labels = GLUE_TASKS_NUM_LABELS[finetuning_task]\n        model = XLNetForSequenceClassification(config)\n    elif \"squad\" in finetuning_task:\n        config.finetuning_task = finetuning_task\n        model = XLNetForQuestionAnswering(config)\n    else:\n        model = XLNetLMHeadModel(config)\n\n    # Load weights from tf checkpoint\n    load_tf_weights_in_xlnet(model, config, tf_checkpoint_path)\n\n    # Save pytorch-model\n    pytorch_weights_dump_path = os.path.join(pytorch_dump_folder_path, WEIGHTS_NAME)\n    pytorch_config_dump_path = os.path.join(pytorch_dump_folder_path, CONFIG_NAME)\n    print(\"Save PyTorch model to {}\".format(os.path.abspath(pytorch_weights_dump_path)))\n    torch.save(model.state_dict(), pytorch_weights_dump_path)\n    print(\"Save configuration file to {}\".format(os.path.abspath(pytorch_config_dump_path)))\n    with open(pytorch_config_dump_path, \"w\", encoding=\"utf-8\") as f:\n        f.write(config.to_json_string())\n\n\nif __name__ == \"__main__\":\n    parser = argparse.ArgumentParser()\n    # Required parameters\n    parser.add_argument(\n        \"--tf_checkpoint_path\", default=None, type=str, required=True, help=\"Path to the TensorFlow checkpoint path.\"\n    )\n    parser.add_argument(\n        \"--xlnet_config_file\",\n        default=None,\n        type=str,\n        required=True,\n        help=\"The config json file corresponding to the pre-trained XLNet model. \\n\"\n        \"This specifies the model architecture.\",\n    )\n    parser.add_argument(\n        \"--pytorch_dump_folder_path\",\n        default=None,\n        type=str,\n        required=True,\n        help=\"Path to the folder to store the PyTorch model or dataset/vocab.\",\n    )\n    parser.add_argument(\n        \"--finetuning_task\",\n        default=None,\n        type=str,\n        help=\"Name of a task on which the XLNet TensorFloaw model was fine-tuned\",\n    )\n    args = parser.parse_args()\n    print(args)\n\n    convert_xlnet_checkpoint_to_pytorch(\n        args.tf_checkpoint_path, args.xlnet_config_file, args.pytorch_dump_folder_path, args.finetuning_task\n    )\n"
  },
  {
    "path": "code/nezha-base-count5/pretrain/transformers1/data/__init__.py",
    "content": "# flake8: noqa\n# There's no way to ignore \"F401 '...' imported but unused\" warnings in this\n# module, but to preserve other warnings. So, don't check this module at all.\n\nfrom .metrics import is_sklearn_available\nfrom .processors import (\n    DataProcessor,\n    InputExample,\n    InputFeatures,\n    SingleSentenceClassificationProcessor,\n    SquadExample,\n    SquadFeatures,\n    SquadV1Processor,\n    SquadV2Processor,\n    glue_convert_examples_to_features,\n    glue_output_modes,\n    glue_processors,\n    glue_tasks_num_labels,\n    squad_convert_examples_to_features,\n    xnli_output_modes,\n    xnli_processors,\n    xnli_tasks_num_labels,\n)\n\n\nif is_sklearn_available():\n    from .metrics import glue_compute_metrics, xnli_compute_metrics\n"
  },
  {
    "path": "code/nezha-base-count5/pretrain/transformers1/data/data_collator.py",
    "content": "from abc import ABC, abstractmethod\nfrom dataclasses import dataclass\nfrom typing import Any, Dict, List, NewType, Tuple\n\nimport torch\nfrom torch.nn.utils.rnn import pad_sequence\nimport random\nimport numpy as np\nfrom ..tokenization_utils import PreTrainedTokenizer\n\n\nclass DataCollator(ABC):\n    \"\"\"\n    A `DataCollator` is responsible for batching\n    and pre-processing samples of data as requested by the training loop.\n    \"\"\"\n\n    @abstractmethod\n    def collate_batch(self) -> Dict[str, torch.Tensor]:\n        \"\"\"\n        Take a list of samples from a Dataset and collate them into a batch.\n\n        Returns:\n            A dictionary of tensors\n        \"\"\"\n        pass\n\n\nInputDataClass = NewType(\"InputDataClass\", Any)\n\n\n@dataclass\nclass DefaultDataCollator(DataCollator):\n    \"\"\"\n    Very simple data collator that:\n    - simply collates batches of dict-like objects\n    - Performs special handling for potential keys named:\n        - `label`: handles a single value (int or float) per object\n        - `label_ids`: handles a list of values per object\n    - does not do any additional preprocessing\n\n    i.e., Property names of the input object will be used as corresponding inputs to the model.\n    See glue and ner for example of how it's useful.\n    \"\"\"\n\n    def collate_batch(self, features: List[InputDataClass]) -> Dict[str, torch.Tensor]:\n        # In this method we'll make the assumption that all `features` in the batch\n        # have the same attributes.\n        # So we will look at the first element as a proxy for what attributes exist\n        # on the whole batch.\n        first = features[0]\n\n        # Special handling for labels.\n        # Ensure that tensor is created with the correct type\n        # (it should be automatically the case, but let's make sure of it.)\n        if hasattr(first, \"label\") and first.label is not None:\n            if type(first.label) is int:\n                labels = torch.tensor([f.label for f in features], dtype=torch.long)\n            else:\n                labels = torch.tensor([f.label for f in features], dtype=torch.float)\n            batch = {\"labels\": labels}\n        elif hasattr(first, \"label_ids\") and first.label_ids is not None:\n            if type(first.label_ids[0]) is int:\n                labels = torch.tensor([f.label_ids for f in features], dtype=torch.long)\n            else:\n                labels = torch.tensor([f.label_ids for f in features], dtype=torch.float)\n            batch = {\"labels\": labels}\n        else:\n            batch = {}\n\n        # Handling of all other possible attributes.\n        # Again, we will use the first element to figure out which key/values are not None for this model.\n        for k, v in vars(first).items():\n            if k not in (\"label\", \"label_ids\") and v is not None and not isinstance(v, str):\n                batch[k] = torch.tensor([getattr(f, k) for f in features], dtype=torch.long)\n        return batch\n\n\n@dataclass\nclass DataCollatorForLanguageModeling(DataCollator):\n    \"\"\"\n    Data collator used for language modeling.\n    - collates batches of tensors, honoring their tokenizer's pad_token\n    - preprocesses batches for masked language modeling\n    \"\"\"\n\n    tokenizer: PreTrainedTokenizer\n    mlm: bool = True\n    mlm_probability: float = 0.15\n\n    def collate_batch(self, examples: List[torch.Tensor]) -> Dict[str, torch.Tensor]:\n        batch = self._tensorize_batch(examples)\n        if self.mlm:\n            inputs, labels = self.mask_tokens7(batch)\n            return {\"input_ids\": inputs, \"labels\": labels}\n        else:\n            return {\"input_ids\": batch, \"labels\": batch}\n\n    def _tensorize_batch(self, examples: List[torch.Tensor]) -> torch.Tensor:\n        length_of_first = examples[0].size(0)\n        are_tensors_same_length = all(x.size(0) == length_of_first for x in examples)\n        if are_tensors_same_length:\n            return torch.stack(examples, dim=0)\n        else:\n            if self.tokenizer._pad_token is None:\n                raise ValueError(\n                    \"You are attempting to pad samples but the tokenizer you are using\"\n                    f\" ({self.tokenizer.__class__.__name__}) does not have one.\"\n                )\n            return pad_sequence(examples, batch_first=True, padding_value=self.tokenizer.pad_token_id)\n\n    def mask_tokens(self, inputs: torch.Tensor) -> Tuple[torch.Tensor, torch.Tensor]:\n        \"\"\"\n        Prepare masked tokens inputs/labels for masked language modeling: 80% MASK, 10% random, 10% original.\n        \"\"\"\n\n        if self.tokenizer.mask_token is None:\n            raise ValueError(\n                \"This tokenizer does not have a mask token which is necessary for masked language modeling. Remove the --mlm flag if you want to use this tokenizer.\"\n            )\n\n        labels = inputs.clone()\n        # We sample a few tokens in each sequence for masked-LM training (with probability args.mlm_probability defaults to 0.15 in Bert/RoBERTa)\n        probability_matrix = torch.full(labels.shape, self.mlm_probability)\n        special_tokens_mask = [\n            self.tokenizer.get_special_tokens_mask(val, already_has_special_tokens=True) for val in labels.tolist()\n        ]\n        probability_matrix.masked_fill_(torch.tensor(special_tokens_mask, dtype=torch.bool), value=0.0)\n        if self.tokenizer._pad_token is not None:\n            padding_mask = labels.eq(self.tokenizer.pad_token_id)\n            probability_matrix.masked_fill_(padding_mask, value=0.0)\n\n        masked_indices = torch.bernoulli(probability_matrix).bool()\n        labels[~masked_indices] = -100  # We only compute loss on masked tokens\n\n        # 80% of the time, we replace masked input tokens with tokenizer.mask_token ([MASK])\n        indices_replaced = torch.bernoulli(torch.full(labels.shape, 0.8)).bool() & masked_indices\n        inputs[indices_replaced] = self.tokenizer.convert_tokens_to_ids(self.tokenizer.mask_token)\n\n        # 10% of the time, we replace masked input tokens with random word\n        indices_random = torch.bernoulli(torch.full(labels.shape, 0.5)).bool() & masked_indices & ~indices_replaced\n        random_words = torch.randint(len(self.tokenizer), labels.shape, dtype=torch.long)\n        inputs[indices_random] = random_words[indices_random]\n\n        # The rest of the time (10% of the time) we keep the masked input tokens unchanged\n        return inputs, labels\n\n    def mask_tokens2(self, inputs: torch.Tensor) -> Tuple[torch.Tensor, torch.Tensor]:\n        \"\"\"\n        Prepare masked tokens inputs/labels for masked language modeling: 80% MASK, 10% random, 10% original.\n        \"\"\"\n\n        if self.tokenizer.mask_token is None:\n            raise ValueError(\n                \"This tokenizer does not have a mask token which is necessary for masked language modeling. Remove the --mlm flag if you want to use this tokenizer.\"\n            )\n\n        labels = inputs.clone()\n        inputs = inputs.numpy()\n        # We sample a few tokens in each sequence for masked-LM training (with probability args.mlm_probability defaults to 0.15 in Bert/RoBERTa)\n        probability_matrix = torch.full(labels.shape, self.mlm_probability)\n        special_tokens_mask = [\n            self.tokenizer.get_special_tokens_mask(val, already_has_special_tokens=True) for val in labels.tolist()\n        ]\n        probability_matrix.masked_fill_(torch.tensor(special_tokens_mask, dtype=torch.bool), value=0.0)\n        if self.tokenizer._pad_token is not None:\n            padding_mask = labels.eq(self.tokenizer.pad_token_id)\n            probability_matrix.masked_fill_(padding_mask, value=0.0)\n\n        probability_matrix = probability_matrix.numpy()\n        labels = labels.numpy()\n        for i in range(len(probability_matrix)):\n            for j in range(len(probability_matrix[0])):\n                if probability_matrix[i][j] == np.float32(0.15):\n                    if random.random() > 0.85:\n                        if random.random() > 0.2:\n                            inputs[i][j] = self.tokenizer.mask_token_id\n                        elif random.random() > 0.5:\n                            inputs[i][j] = random.randint(5, len(self.tokenizer) - 1)\n                        else:\n                            pass\n                    else:\n                        labels[i][j] = np.float32(-100)\n                else:\n                    labels[i][j] = np.float32(-100)\n\n        inputs = torch.from_numpy(inputs)\n        labels = torch.from_numpy(labels)\n        return inputs, labels\n\n\n    def mask_tokens3(self, inputs: torch.Tensor) -> Tuple[torch.Tensor, torch.Tensor]:\n        \"\"\"\n        Prepare masked tokens inputs/labels for masked language modeling: 80% MASK, 10% random, 10% original.\n        \"\"\"\n\n        if self.tokenizer.mask_token is None:\n            raise ValueError(\n                \"This tokenizer does not have a mask token which is necessary for masked language modeling. Remove the --mlm flag if you want to use this tokenizer.\"\n            )\n\n        labels = inputs.clone()\n        inputs = inputs.numpy()\n        # We sample a few tokens in each sequence for masked-LM training (with probability args.mlm_probability defaults to 0.15 in Bert/RoBERTa)\n        probability_matrix = torch.full(labels.shape, self.mlm_probability)\n        special_tokens_mask = [\n            self.tokenizer.get_special_tokens_mask(val, already_has_special_tokens=True) for val in labels.tolist()\n        ]\n        probability_matrix.masked_fill_(torch.tensor(special_tokens_mask, dtype=torch.bool), value=0.0)\n        if self.tokenizer._pad_token is not None:\n            padding_mask = labels.eq(self.tokenizer.pad_token_id)\n            probability_matrix.masked_fill_(padding_mask, value=0.0)\n\n        probability_matrix = probability_matrix.numpy()\n        labels = labels.numpy()\n        covered = set()\n        for i in range(len(probability_matrix)):\n            for j in range(len(probability_matrix[0])):\n                if probability_matrix[i][j] == np.float32(0.15) and (i,j) not in covered:\n                    if random.random() > 0.85:\n                        if random.random() > 0.2:\n                            if random.random() > 0.85:\n                                for k in range(j,min(j+5,len(probability_matrix[0]))):\n                                    if probability_matrix[i][k] == np.float32(0.15) and (i,k) not in covered:\n                                        inputs[i][k] = self.tokenizer.mask_token_id\n                                        covered.add((i,k))\n                            elif random.random() > 0.7647:\n                                for k in range(j,min(j+4,len(probability_matrix[0]))):\n                                    if probability_matrix[i][k] == np.float32(0.15) and (i,k) not in covered:\n                                        inputs[i][k] = self.tokenizer.mask_token_id\n                                        covered.add((i,k))\n                            elif random.random() > 0.5384:\n                                for k in range(j,min(j+3,len(probability_matrix[0]))):\n                                    if probability_matrix[i][k] == np.float32(0.15) and (i,k) not in covered:\n                                        inputs[i][k] = self.tokenizer.mask_token_id\n                                        covered.add((i,k))\n                            elif random.random() > 0.42857:\n                                for k in range(j,min(j+2,len(probability_matrix[0]))):\n                                    if probability_matrix[i][k] == np.float32(0.15) and (i,k) not in covered:\n                                        inputs[i][k] = self.tokenizer.mask_token_id\n                                        covered.add((i,k))\n                            else:\n                                inputs[i][j] = self.tokenizer.mask_token_id\n                                covered.add((i,j))\n\n                        elif random.random() > 0.5:\n                            inputs[i][j] = random.randint(5, len(self.tokenizer) - 1)\n                        else:\n                            pass\n                    else:\n                        labels[i][j] = np.float32(-100)\n                else:\n                    labels[i][j] = np.float32(-100)\n\n        inputs = torch.from_numpy(inputs)\n        labels = torch.from_numpy(labels)\n        return inputs, labels\n\n    def mask_tokens4(self, inputs: torch.Tensor) -> Tuple[torch.Tensor, torch.Tensor]:\n        \"\"\"\n        Prepare masked tokens inputs/labels for masked language modeling: 80% MASK, 10% random, 10% original.\n        \"\"\"\n\n        if self.tokenizer.mask_token is None:\n            raise ValueError(\n                \"This tokenizer does not have a mask token which is necessary for masked language modeling. Remove the --mlm flag if you want to use this tokenizer.\"\n            )\n\n        inputs = inputs.numpy()\n        ids = [i for i in range(len(inputs))]\n        random.shuffle(ids)\n        inputs = inputs[ids]\n        inputs = torch.from_numpy(inputs)\n\n        labels = inputs.clone()\n        inputs = inputs.numpy()\n        # We sample a few tokens in each sequence for masked-LM training (with probability args.mlm_probability defaults to 0.15 in Bert/RoBERTa)\n        probability_matrix = torch.full(labels.shape, self.mlm_probability)\n        special_tokens_mask = [\n            self.tokenizer.get_special_tokens_mask(val, already_has_special_tokens=True) for val in labels.tolist()\n        ]\n        probability_matrix.masked_fill_(torch.tensor(special_tokens_mask, dtype=torch.bool), value=0.0)\n        if self.tokenizer._pad_token is not None:\n            padding_mask = labels.eq(self.tokenizer.pad_token_id)\n            probability_matrix.masked_fill_(padding_mask, value=0.0)\n\n        total_token = 0\n        for i in range(len(probability_matrix)):\n            for j in range(len(probability_matrix[0])):\n                if probability_matrix[i][j] == np.float32(0.15):\n                    total_token += 1\n\n        cur_token = 0\n        probability_matrix = probability_matrix.numpy()\n        labels = labels.numpy()\n        covered = set()\n        ngramFlag = True\n        for i in range(len(probability_matrix)):\n            if cur_token > total_token * 0.03:\n                ngramFlag = False\n            for j in range(len(probability_matrix[0])):\n                if probability_matrix[i][j] == np.float32(0.15) and (i,j) not in covered:\n                    if random.random() > 0.85:\n                        if random.random() > 0.2:\n                            if random.random() > 0.9 and ngramFlag:\n                                for k in range(j,min(j+4,len(probability_matrix[0]))):\n                                    if probability_matrix[i][k] == np.float32(0.15) and (i,k) not in covered:\n                                        inputs[i][k] = self.tokenizer.mask_token_id\n                                        covered.add((i,k))\n                                        cur_token += 1\n                            elif random.random() > 0.222 and ngramFlag:\n                                for k in range(j,min(j+3,len(probability_matrix[0]))):\n                                    if probability_matrix[i][k] == np.float32(0.15) and (i,k) not in covered:\n                                        inputs[i][k] = self.tokenizer.mask_token_id\n                                        covered.add((i,k))\n                                        cur_token += 1\n                            elif random.random() > 0.42857 and ngramFlag:\n                                for k in range(j,min(j+2,len(probability_matrix[0]))):\n                                    if probability_matrix[i][k] == np.float32(0.15) and (i,k) not in covered:\n                                        inputs[i][k] = self.tokenizer.mask_token_id\n                                        covered.add((i,k))\n                                        cur_token += 1\n                            else:\n                                inputs[i][j] = self.tokenizer.mask_token_id\n                                covered.add((i,j))\n                                cur_token += 1\n\n                        elif random.random() > 0.5:\n                            inputs[i][j] = random.randint(5, len(self.tokenizer) - 1)\n                            cur_token += 1\n                        else:\n                            pass\n                    else:\n                        labels[i][j] = np.float32(-100)\n                else:\n                    labels[i][j] = np.float32(-100)\n\n        inputs = torch.from_numpy(inputs)\n        labels = torch.from_numpy(labels)\n        return inputs, labels\n\n    def mask_tokens5(self, inputs: torch.Tensor) -> Tuple[torch.Tensor, torch.Tensor]:\n        \"\"\"\n        Prepare masked tokens inputs/labels for masked language modeling: 80% MASK, 10% random, 10% original.\n        \"\"\"\n\n        if self.tokenizer.mask_token is None:\n            raise ValueError(\n                \"This tokenizer does not have a mask token which is necessary for masked language modeling. Remove the --mlm flag if you want to use this tokenizer.\"\n            )\n\n        labels = inputs.clone()\n        inputs = inputs.numpy()\n        # We sample a few tokens in each sequence for masked-LM training (with probability args.mlm_probability defaults to 0.15 in Bert/RoBERTa)\n        probability_matrix = torch.full(labels.shape, self.mlm_probability)\n        special_tokens_mask = [\n            self.tokenizer.get_special_tokens_mask(val, already_has_special_tokens=True) for val in labels.tolist()\n        ]\n        probability_matrix.masked_fill_(torch.tensor(special_tokens_mask, dtype=torch.bool), value=0.0)\n        if self.tokenizer._pad_token is not None:\n            padding_mask = labels.eq(self.tokenizer.pad_token_id)\n            probability_matrix.masked_fill_(padding_mask, value=0.0)\n\n        covered = set()\n        pvals = [0.4, 0.3, 0.2, 0.1]\n        ngrams = np.arange(1, 5, dtype=np.int64)\n\n        probability_matrix = probability_matrix.numpy()\n        labels = labels.numpy()\n        for i in range(len(probability_matrix)):\n            cur_token = 0\n            total_token = 0\n            for j in range(len(probability_matrix[0])):\n                if probability_matrix[i][j] == np.float32(0.15):\n                    total_token += 1\n            choose = random.randint(0, 1)\n            if choose == 0:\n                startIndex = 0\n                endIndex = np.argwhere(inputs[i] == np.float32(2))[-1][0]\n            elif choose == 1:\n                startIndex = np.argwhere(inputs[i] == np.float32(2))[-1][0]\n                endIndex = np.argwhere(inputs[i] == np.float32(3))[-1][0]\n\n            valid_j = [index for index in range(startIndex, endIndex + 1)]\n\n            for j in range(len(probability_matrix[0])):\n                if cur_token < total_token * 0.15:\n                    if probability_matrix[i][j] == np.float32(0.15):\n                        n = np.random.choice(ngrams, p=pvals)\n                        for k in range(n):\n                            if j + k >= len(probability_matrix[0]):\n                                break\n                            if (i, j+k) in covered:\n                                continue\n                            if j+k in valid_j:\n                                if random.random() > 0.7:\n                                    if random.random() > 0.2:\n                                        if probability_matrix[i][j+k] == np.float32(0.15):\n                                            inputs[i][j+k] = self.tokenizer.mask_token_id\n                                            covered.add((i, j + k))\n                                            cur_token += 1\n\n                                    elif random.random() > 0.5:\n                                        if probability_matrix[i][j + k] == np.float32(0.15):\n                                            inputs[i][j+k] = random.randint(5, len(self.tokenizer) - 1)\n                                            covered.add((i, j + k))\n                                            cur_token += 1\n\n                                    else:\n                                        if probability_matrix[i][j + k] == np.float32(0.15):\n                                            covered.add((i, j + k))\n                                            cur_token += 1\n\n                                else:\n                                    labels[i][j] = np.float32(-100)\n                            else:\n                                labels[i][j] = np.float32(-100)\n                    else:\n                        labels[i][j] = np.float32(-100)\n                else:\n                    labels[i][j] = np.float32(-100)\n\n        inputs = torch.from_numpy(inputs)\n        labels = torch.from_numpy(labels)\n        return inputs, labels\n\n    def mask_tokens6(self, inputs: torch.Tensor) -> Tuple[torch.Tensor, torch.Tensor]:\n        \"\"\"\n        Prepare masked tokens inputs/labels for masked language modeling: 80% MASK, 10% random, 10% original.\n        \"\"\"\n\n        if self.tokenizer.mask_token is None:\n            raise ValueError(\n                \"This tokenizer does not have a mask token which is necessary for masked language modeling. Remove the --mlm flag if you want to use this tokenizer.\"\n            )\n\n        labels = inputs.clone()\n        inputs = inputs.numpy()\n        # We sample a few tokens in each sequence for masked-LM training (with probability args.mlm_probability defaults to 0.15 in Bert/RoBERTa)\n        probability_matrix = torch.full(labels.shape, self.mlm_probability)\n        special_tokens_mask = [\n            self.tokenizer.get_special_tokens_mask(val, already_has_special_tokens=True) for val in labels.tolist()\n        ]\n        probability_matrix.masked_fill_(torch.tensor(special_tokens_mask, dtype=torch.bool), value=0.0)\n        if self.tokenizer._pad_token is not None:\n            padding_mask = labels.eq(self.tokenizer.pad_token_id)\n            probability_matrix.masked_fill_(padding_mask, value=0.0)\n\n        covered = set()\n\n\n        probability_matrix = probability_matrix.numpy()\n        labels = labels.numpy()\n        for i in range(len(probability_matrix)):\n            cur_token = 0\n            total_token = 0\n            for j in range(len(probability_matrix[0])):\n                if probability_matrix[i][j] == np.float32(0.15):\n                    total_token += 1\n            for j in range(len(probability_matrix[0])):\n                if cur_token > total_token*0.15:\n                    break\n                if probability_matrix[i][j] == np.float32(0.15):\n                    if random.random() > 0.85:\n                        if random.random() > 0.2:\n                            if random.random() > 0.9:\n                                for k in range(j, min(j + 4, len(probability_matrix[0]))):\n                                    if probability_matrix[i][k] == np.float32(0.15) and (i, k) not in covered:\n                                        inputs[i][k] = self.tokenizer.mask_token_id\n                                        covered.add((i, k))\n                                        cur_token += 1\n                            elif random.random() > 0.222:\n                                for k in range(j, min(j + 3, len(probability_matrix[0]))):\n                                    if probability_matrix[i][k] == np.float32(0.15) and (i, k) not in covered:\n                                        inputs[i][k] = self.tokenizer.mask_token_id\n                                        covered.add((i, k))\n                                        cur_token += 1\n                            elif random.random() > 0.42857:\n                                for k in range(j, min(j + 2, len(probability_matrix[0]))):\n                                    if probability_matrix[i][k] == np.float32(0.15) and (i, k) not in covered:\n                                        inputs[i][k] = self.tokenizer.mask_token_id\n                                        covered.add((i, k))\n                                        cur_token += 1\n                            else:\n                                inputs[i][j] = self.tokenizer.mask_token_id\n                                covered.add((i, j))\n                                cur_token += 1\n\n                        elif random.random() > 0.5:\n                            inputs[i][j] = random.randint(5, len(self.tokenizer) - 1)\n                            cur_token += 1\n                        else:\n                            cur_token += 1\n\n                    else:\n                        labels[i][j] = np.float32(-100)\n\n\n                else:\n                    labels[i][j] = np.float32(-100)\n\n        inputs = torch.from_numpy(inputs)\n        labels = torch.from_numpy(labels)\n        return inputs, labels\n\n\n    def mask_tokens7(self, inputs: torch.Tensor) -> Tuple[torch.Tensor, torch.Tensor]:\n        \"\"\"\n        Prepare masked tokens inputs/labels for masked language modeling: 80% MASK, 10% random, 10% original.\n        \"\"\"\n\n        if self.tokenizer.mask_token is None:\n            raise ValueError(\n                \"This tokenizer does not have a mask token which is necessary for masked language modeling. Remove the --mlm flag if you want to use this tokenizer.\"\n            )\n\n        labels = inputs.clone()\n        inputs = inputs.numpy()\n        # We sample a few tokens in each sequence for masked-LM training (with probability args.mlm_probability defaults to 0.15 in Bert/RoBERTa)\n        probability_matrix = torch.full(labels.shape, self.mlm_probability)\n        special_tokens_mask = [\n            self.tokenizer.get_special_tokens_mask(val, already_has_special_tokens=True) for val in labels.tolist()\n        ]\n        probability_matrix.masked_fill_(torch.tensor(special_tokens_mask, dtype=torch.bool), value=0.0)\n        if self.tokenizer._pad_token is not None:\n            padding_mask = labels.eq(self.tokenizer.pad_token_id)\n            probability_matrix.masked_fill_(padding_mask, value=0.0)\n\n        covered = set()\n        ngrams = np.arange(1, 3 + 1, dtype=np.int64)\n        pvals = 1. / np.arange(1, 3 + 1)\n        pvals /= pvals.sum(keepdims=True)\n\n        probability_matrix = probability_matrix.numpy()\n        labels = labels.numpy()\n        for i in range(len(probability_matrix)):\n            cur_token = 0\n            total_token = 0\n            for j in range(len(probability_matrix[0])):\n                if probability_matrix[i][j] == np.float32(0.15):\n                    total_token += 1\n            for j in range(len(probability_matrix[0])):\n                if cur_token <= total_token * 0.15:\n                    n = np.random.choice(ngrams, p=pvals)\n                    if probability_matrix[i][j] == np.float32(0.15):\n                        for k in range(n):\n                            if j + k >= len(probability_matrix[0]):\n                                break\n                            if (i, j+k) in covered:\n                                continue\n                            if random.random() > 0.85:\n                                if random.random() > 0.2:\n                                    if probability_matrix[i][j+k] == np.float32(0.15):\n                                        inputs[i][j+k] = self.tokenizer.mask_token_id\n                                        covered.add((i, j + k))\n                                        cur_token += 1\n\n                                elif random.random() > 0.5:\n                                    if probability_matrix[i][j + k] == np.float32(0.15):\n                                        inputs[i][j+k] = random.randint(5, len(self.tokenizer) - 1)\n                                        covered.add((i, j + k))\n                                        cur_token += 1\n\n                                else:\n                                    if probability_matrix[i][j + k] == np.float32(0.15):\n                                        covered.add((i, j + k))\n                                        cur_token += 1\n\n                            else:\n                                labels[i][j] = np.float32(-100)\n\n                    else:\n                        labels[i][j] = np.float32(-100)\n                else:\n                    labels[i][j] = np.float32(-100)\n\n        inputs = torch.from_numpy(inputs)\n        labels = torch.from_numpy(labels)\n        return inputs, labels\n\n"
  },
  {
    "path": "code/nezha-base-count5/pretrain/transformers1/data/datasets/__init__.py",
    "content": "# flake8: noqa\n# There's no way to ignore \"F401 '...' imported but unused\" warnings in this\n# module, but to preserve other warnings. So, don't check this module at all.\n\nfrom .glue import GlueDataset, GlueDataTrainingArguments\nfrom .language_modeling import LineByLineTextDataset, TextDataset\n"
  },
  {
    "path": "code/nezha-base-count5/pretrain/transformers1/data/datasets/glue.py",
    "content": "import logging\nimport os\nimport time\nfrom dataclasses import dataclass, field\nfrom enum import Enum\nfrom typing import List, Optional, Union\n\nimport torch\nfrom filelock import FileLock\nfrom torch.utils.data.dataset import Dataset\n\nfrom ...tokenization_roberta import RobertaTokenizer, RobertaTokenizerFast\nfrom ...tokenization_utils import PreTrainedTokenizer\nfrom ...tokenization_xlm_roberta import XLMRobertaTokenizer\nfrom ..processors.glue import glue_convert_examples_to_features, glue_output_modes, glue_processors\nfrom ..processors.utils import InputFeatures\n\n\nlogger = logging.getLogger(__name__)\n\n\n@dataclass\nclass GlueDataTrainingArguments:\n    \"\"\"\n    Arguments pertaining to what data we are going to input our model for training and eval.\n\n    Using `HfArgumentParser` we can turn this class\n    into argparse arguments to be able to specify them on\n    the command line.\n    \"\"\"\n\n    task_name: str = field(metadata={\"help\": \"The name of the task to train on: \" + \", \".join(glue_processors.keys())})\n    data_dir: str = field(\n        metadata={\"help\": \"The input data dir. Should contain the .tsv files (or other data files) for the task.\"}\n    )\n    max_seq_length: int = field(\n        default=128,\n        metadata={\n            \"help\": \"The maximum total input sequence length after tokenization. Sequences longer \"\n            \"than this will be truncated, sequences shorter will be padded.\"\n        },\n    )\n    overwrite_cache: bool = field(\n        default=False, metadata={\"help\": \"Overwrite the cached training and evaluation sets\"}\n    )\n\n    def __post_init__(self):\n        self.task_name = self.task_name.lower()\n\n\nclass Split(Enum):\n    train = \"train\"\n    dev = \"dev\"\n    test = \"test\"\n\n\nclass GlueDataset(Dataset):\n    \"\"\"\n    This will be superseded by a framework-agnostic approach\n    soon.\n    \"\"\"\n\n    args: GlueDataTrainingArguments\n    output_mode: str\n    features: List[InputFeatures]\n\n    def __init__(\n        self,\n        args: GlueDataTrainingArguments,\n        tokenizer: PreTrainedTokenizer,\n        limit_length: Optional[int] = None,\n        mode: Union[str, Split] = Split.train,\n    ):\n        self.args = args\n        self.processor = glue_processors[args.task_name]()\n        self.output_mode = glue_output_modes[args.task_name]\n        if isinstance(mode, str):\n            try:\n                mode = Split[mode]\n            except KeyError:\n                raise KeyError(\"mode is not a valid split name\")\n        # Load data features from cache or dataset file\n        cached_features_file = os.path.join(\n            args.data_dir,\n            \"cached_{}_{}_{}_{}\".format(\n                mode.value, tokenizer.__class__.__name__, str(args.max_seq_length), args.task_name,\n            ),\n        )\n        label_list = self.processor.get_labels()\n        if args.task_name in [\"mnli\", \"mnli-mm\"] and tokenizer.__class__ in (\n            RobertaTokenizer,\n            RobertaTokenizerFast,\n            XLMRobertaTokenizer,\n        ):\n            # HACK(label indices are swapped in RoBERTa pretrained model)\n            label_list[1], label_list[2] = label_list[2], label_list[1]\n        self.label_list = label_list\n\n        # Make sure only the first process in distributed training processes the dataset,\n        # and the others will use the cache.\n        lock_path = cached_features_file + \".lock\"\n        with FileLock(lock_path):\n\n            if os.path.exists(cached_features_file) and not args.overwrite_cache:\n                start = time.time()\n                self.features = torch.load(cached_features_file)\n                logger.info(\n                    f\"Loading features from cached file {cached_features_file} [took %.3f s]\", time.time() - start\n                )\n            else:\n                logger.info(f\"Creating features from dataset file at {args.data_dir}\")\n\n                if mode == Split.dev:\n                    examples = self.processor.get_dev_examples(args.data_dir)\n                elif mode == Split.test:\n                    examples = self.processor.get_test_examples(args.data_dir)\n                else:\n                    examples = self.processor.get_train_examples(args.data_dir)\n                if limit_length is not None:\n                    examples = examples[:limit_length]\n                self.features = glue_convert_examples_to_features(\n                    examples,\n                    tokenizer,\n                    max_length=args.max_seq_length,\n                    label_list=label_list,\n                    output_mode=self.output_mode,\n                )\n                start = time.time()\n                torch.save(self.features, cached_features_file)\n                # ^ This seems to take a lot of time so I want to investigate why and how we can improve.\n                logger.info(\n                    \"Saving features into cached file %s [took %.3f s]\", cached_features_file, time.time() - start\n                )\n\n    def __len__(self):\n        return len(self.features)\n\n    def __getitem__(self, i) -> InputFeatures:\n        return self.features[i]\n\n    def get_labels(self):\n        return self.label_list\n"
  },
  {
    "path": "code/nezha-base-count5/pretrain/transformers1/data/datasets/language_modeling.py",
    "content": "import logging\nimport os\nimport pickle\nimport time\n\nimport torch\nfrom filelock import FileLock\nfrom torch.utils.data.dataset import Dataset\n\nfrom ...tokenization_utils import PreTrainedTokenizer\n\n\nlogger = logging.getLogger(__name__)\n\n\nclass TextDataset(Dataset):\n    \"\"\"\n    This will be superseded by a framework-agnostic approach\n    soon.\n    \"\"\"\n\n    def __init__(\n        self, tokenizer: PreTrainedTokenizer, file_path: str, block_size: int, overwrite_cache=False,\n    ):\n        assert os.path.isfile(file_path)\n\n        block_size = block_size - tokenizer.num_special_tokens_to_add(pair=False)\n\n        directory, filename = os.path.split(file_path)\n        cached_features_file = os.path.join(\n            directory, \"cached_lm_{}_{}_{}\".format(tokenizer.__class__.__name__, str(block_size), filename,),\n        )\n\n        # Make sure only the first process in distributed training processes the dataset,\n        # and the others will use the cache.\n        lock_path = cached_features_file + \".lock\"\n        with FileLock(lock_path):\n\n            if os.path.exists(cached_features_file) and not overwrite_cache:\n                start = time.time()\n                with open(cached_features_file, \"rb\") as handle:\n                    self.examples = pickle.load(handle)\n                logger.info(\n                    f\"Loading features from cached file {cached_features_file} [took %.3f s]\", time.time() - start\n                )\n\n            else:\n                logger.info(f\"Creating features from dataset file at {directory}\")\n\n                self.examples = []\n                with open(file_path, encoding=\"utf-8\") as f:\n                    text = f.read()\n\n                tokenized_text = tokenizer.convert_tokens_to_ids(tokenizer.tokenize(text))\n\n                for i in range(0, len(tokenized_text) - block_size + 1, block_size):  # Truncate in block of block_size\n                    self.examples.append(\n                        tokenizer.build_inputs_with_special_tokens(tokenized_text[i : i + block_size])\n                    )\n                # Note that we are losing the last truncated example here for the sake of simplicity (no padding)\n                # If your dataset is small, first you should loook for a bigger one :-) and second you\n                # can change this behavior by adding (model specific) padding.\n\n                start = time.time()\n                with open(cached_features_file, \"wb\") as handle:\n                    pickle.dump(self.examples, handle, protocol=pickle.HIGHEST_PROTOCOL)\n                logger.info(\n                    \"Saving features into cached file %s [took %.3f s]\", cached_features_file, time.time() - start\n                )\n\n    def __len__(self):\n        return len(self.examples)\n\n    def __getitem__(self, i) -> torch.Tensor:\n        return torch.tensor(self.examples[i], dtype=torch.long)\n\n\nclass LineByLineTextDataset(Dataset):\n    \"\"\"\n    This will be superseded by a framework-agnostic approach\n    soon.\n    \"\"\"\n\n    def __init__(self, tokenizer: PreTrainedTokenizer, file_path: str, block_size: int):\n        assert os.path.isfile(file_path)\n        # Here, we do not cache the features, operating under the assumption\n        # that we will soon use fast multithreaded tokenizers from the\n        # `tokenizers` repo everywhere =)\n        logger.info(\"Creating features from dataset file at %s\", file_path)\n\n        with open(file_path, encoding=\"utf-8\") as f:\n            lines = [line for line in f.read().splitlines() if (len(line) > 0 and not line.isspace())]\n\n        batch_encoding = tokenizer.batch_encode_plus(lines, add_special_tokens=True, max_length=block_size)\n        self.examples = batch_encoding[\"input_ids\"]\n\n    def __len__(self):\n        return len(self.examples)\n\n    def __getitem__(self, i) -> torch.Tensor:\n        return torch.tensor(self.examples[i], dtype=torch.long)\n"
  },
  {
    "path": "code/nezha-base-count5/pretrain/transformers1/data/metrics/__init__.py",
    "content": "# coding=utf-8\n# Copyright 2018 The Google AI Language Team Authors and The HuggingFace Inc. team.\n# Copyright (c) 2018, NVIDIA CORPORATION.  All rights reserved.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n\ntry:\n    from scipy.stats import pearsonr, spearmanr\n    from sklearn.metrics import matthews_corrcoef, f1_score\n\n    _has_sklearn = True\nexcept (AttributeError, ImportError):\n    _has_sklearn = False\n\n\ndef is_sklearn_available():\n    return _has_sklearn\n\n\nif _has_sklearn:\n\n    def simple_accuracy(preds, labels):\n        return (preds == labels).mean()\n\n    def acc_and_f1(preds, labels):\n        acc = simple_accuracy(preds, labels)\n        f1 = f1_score(y_true=labels, y_pred=preds)\n        return {\n            \"acc\": acc,\n            \"f1\": f1,\n            \"acc_and_f1\": (acc + f1) / 2,\n        }\n\n    def pearson_and_spearman(preds, labels):\n        pearson_corr = pearsonr(preds, labels)[0]\n        spearman_corr = spearmanr(preds, labels)[0]\n        return {\n            \"pearson\": pearson_corr,\n            \"spearmanr\": spearman_corr,\n            \"corr\": (pearson_corr + spearman_corr) / 2,\n        }\n\n    def glue_compute_metrics(task_name, preds, labels):\n        assert len(preds) == len(labels)\n        if task_name == \"cola\":\n            return {\"mcc\": matthews_corrcoef(labels, preds)}\n        elif task_name == \"sst-2\":\n            return {\"acc\": simple_accuracy(preds, labels)}\n        elif task_name == \"mrpc\":\n            return acc_and_f1(preds, labels)\n        elif task_name == \"sts-b\":\n            return pearson_and_spearman(preds, labels)\n        elif task_name == \"qqp\":\n            return acc_and_f1(preds, labels)\n        elif task_name == \"mnli\":\n            return {\"acc\": simple_accuracy(preds, labels)}\n        elif task_name == \"mnli-mm\":\n            return {\"acc\": simple_accuracy(preds, labels)}\n        elif task_name == \"qnli\":\n            return {\"acc\": simple_accuracy(preds, labels)}\n        elif task_name == \"rte\":\n            return {\"acc\": simple_accuracy(preds, labels)}\n        elif task_name == \"wnli\":\n            return {\"acc\": simple_accuracy(preds, labels)}\n        elif task_name == \"hans\":\n            return {\"acc\": simple_accuracy(preds, labels)}\n        else:\n            raise KeyError(task_name)\n\n    def xnli_compute_metrics(task_name, preds, labels):\n        assert len(preds) == len(labels)\n        if task_name == \"xnli\":\n            return {\"acc\": simple_accuracy(preds, labels)}\n        else:\n            raise KeyError(task_name)\n"
  },
  {
    "path": "code/nezha-base-count5/pretrain/transformers1/data/metrics/squad_metrics.py",
    "content": "\"\"\" Very heavily inspired by the official evaluation script for SQuAD version 2.0 which was\nmodified by XLNet authors to update `find_best_threshold` scripts for SQuAD V2.0\n\nIn addition to basic functionality, we also compute additional statistics and\nplot precision-recall curves if an additional na_prob.json file is provided.\nThis file is expected to map question ID's to the model's predicted probability\nthat a question is unanswerable.\n\"\"\"\n\n\nimport collections\nimport json\nimport logging\nimport math\nimport re\nimport string\n\nfrom transformers.tokenization_bert import BasicTokenizer\n\n\nlogger = logging.getLogger(__name__)\n\n\ndef normalize_answer(s):\n    \"\"\"Lower text and remove punctuation, articles and extra whitespace.\"\"\"\n\n    def remove_articles(text):\n        regex = re.compile(r\"\\b(a|an|the)\\b\", re.UNICODE)\n        return re.sub(regex, \" \", text)\n\n    def white_space_fix(text):\n        return \" \".join(text.split())\n\n    def remove_punc(text):\n        exclude = set(string.punctuation)\n        return \"\".join(ch for ch in text if ch not in exclude)\n\n    def lower(text):\n        return text.lower()\n\n    return white_space_fix(remove_articles(remove_punc(lower(s))))\n\n\ndef get_tokens(s):\n    if not s:\n        return []\n    return normalize_answer(s).split()\n\n\ndef compute_exact(a_gold, a_pred):\n    return int(normalize_answer(a_gold) == normalize_answer(a_pred))\n\n\ndef compute_f1(a_gold, a_pred):\n    gold_toks = get_tokens(a_gold)\n    pred_toks = get_tokens(a_pred)\n    common = collections.Counter(gold_toks) & collections.Counter(pred_toks)\n    num_same = sum(common.values())\n    if len(gold_toks) == 0 or len(pred_toks) == 0:\n        # If either is no-answer, then F1 is 1 if they agree, 0 otherwise\n        return int(gold_toks == pred_toks)\n    if num_same == 0:\n        return 0\n    precision = 1.0 * num_same / len(pred_toks)\n    recall = 1.0 * num_same / len(gold_toks)\n    f1 = (2 * precision * recall) / (precision + recall)\n    return f1\n\n\ndef get_raw_scores(examples, preds):\n    \"\"\"\n    Computes the exact and f1 scores from the examples and the model predictions\n    \"\"\"\n    exact_scores = {}\n    f1_scores = {}\n\n    for example in examples:\n        qas_id = example.qas_id\n        gold_answers = [answer[\"text\"] for answer in example.answers if normalize_answer(answer[\"text\"])]\n\n        if not gold_answers:\n            # For unanswerable questions, only correct answer is empty string\n            gold_answers = [\"\"]\n\n        if qas_id not in preds:\n            print(\"Missing prediction for %s\" % qas_id)\n            continue\n\n        prediction = preds[qas_id]\n        exact_scores[qas_id] = max(compute_exact(a, prediction) for a in gold_answers)\n        f1_scores[qas_id] = max(compute_f1(a, prediction) for a in gold_answers)\n\n    return exact_scores, f1_scores\n\n\ndef apply_no_ans_threshold(scores, na_probs, qid_to_has_ans, na_prob_thresh):\n    new_scores = {}\n    for qid, s in scores.items():\n        pred_na = na_probs[qid] > na_prob_thresh\n        if pred_na:\n            new_scores[qid] = float(not qid_to_has_ans[qid])\n        else:\n            new_scores[qid] = s\n    return new_scores\n\n\ndef make_eval_dict(exact_scores, f1_scores, qid_list=None):\n    if not qid_list:\n        total = len(exact_scores)\n        return collections.OrderedDict(\n            [\n                (\"exact\", 100.0 * sum(exact_scores.values()) / total),\n                (\"f1\", 100.0 * sum(f1_scores.values()) / total),\n                (\"total\", total),\n            ]\n        )\n    else:\n        total = len(qid_list)\n        return collections.OrderedDict(\n            [\n                (\"exact\", 100.0 * sum(exact_scores[k] for k in qid_list) / total),\n                (\"f1\", 100.0 * sum(f1_scores[k] for k in qid_list) / total),\n                (\"total\", total),\n            ]\n        )\n\n\ndef merge_eval(main_eval, new_eval, prefix):\n    for k in new_eval:\n        main_eval[\"%s_%s\" % (prefix, k)] = new_eval[k]\n\n\ndef find_best_thresh_v2(preds, scores, na_probs, qid_to_has_ans):\n    num_no_ans = sum(1 for k in qid_to_has_ans if not qid_to_has_ans[k])\n    cur_score = num_no_ans\n    best_score = cur_score\n    best_thresh = 0.0\n    qid_list = sorted(na_probs, key=lambda k: na_probs[k])\n    for i, qid in enumerate(qid_list):\n        if qid not in scores:\n            continue\n        if qid_to_has_ans[qid]:\n            diff = scores[qid]\n        else:\n            if preds[qid]:\n                diff = -1\n            else:\n                diff = 0\n        cur_score += diff\n        if cur_score > best_score:\n            best_score = cur_score\n            best_thresh = na_probs[qid]\n\n    has_ans_score, has_ans_cnt = 0, 0\n    for qid in qid_list:\n        if not qid_to_has_ans[qid]:\n            continue\n        has_ans_cnt += 1\n\n        if qid not in scores:\n            continue\n        has_ans_score += scores[qid]\n\n    return 100.0 * best_score / len(scores), best_thresh, 1.0 * has_ans_score / has_ans_cnt\n\n\ndef find_all_best_thresh_v2(main_eval, preds, exact_raw, f1_raw, na_probs, qid_to_has_ans):\n    best_exact, exact_thresh, has_ans_exact = find_best_thresh_v2(preds, exact_raw, na_probs, qid_to_has_ans)\n    best_f1, f1_thresh, has_ans_f1 = find_best_thresh_v2(preds, f1_raw, na_probs, qid_to_has_ans)\n    main_eval[\"best_exact\"] = best_exact\n    main_eval[\"best_exact_thresh\"] = exact_thresh\n    main_eval[\"best_f1\"] = best_f1\n    main_eval[\"best_f1_thresh\"] = f1_thresh\n    main_eval[\"has_ans_exact\"] = has_ans_exact\n    main_eval[\"has_ans_f1\"] = has_ans_f1\n\n\ndef find_best_thresh(preds, scores, na_probs, qid_to_has_ans):\n    num_no_ans = sum(1 for k in qid_to_has_ans if not qid_to_has_ans[k])\n    cur_score = num_no_ans\n    best_score = cur_score\n    best_thresh = 0.0\n    qid_list = sorted(na_probs, key=lambda k: na_probs[k])\n    for _, qid in enumerate(qid_list):\n        if qid not in scores:\n            continue\n        if qid_to_has_ans[qid]:\n            diff = scores[qid]\n        else:\n            if preds[qid]:\n                diff = -1\n            else:\n                diff = 0\n        cur_score += diff\n        if cur_score > best_score:\n            best_score = cur_score\n            best_thresh = na_probs[qid]\n    return 100.0 * best_score / len(scores), best_thresh\n\n\ndef find_all_best_thresh(main_eval, preds, exact_raw, f1_raw, na_probs, qid_to_has_ans):\n    best_exact, exact_thresh = find_best_thresh(preds, exact_raw, na_probs, qid_to_has_ans)\n    best_f1, f1_thresh = find_best_thresh(preds, f1_raw, na_probs, qid_to_has_ans)\n\n    main_eval[\"best_exact\"] = best_exact\n    main_eval[\"best_exact_thresh\"] = exact_thresh\n    main_eval[\"best_f1\"] = best_f1\n    main_eval[\"best_f1_thresh\"] = f1_thresh\n\n\ndef squad_evaluate(examples, preds, no_answer_probs=None, no_answer_probability_threshold=1.0):\n    qas_id_to_has_answer = {example.qas_id: bool(example.answers) for example in examples}\n    has_answer_qids = [qas_id for qas_id, has_answer in qas_id_to_has_answer.items() if has_answer]\n    no_answer_qids = [qas_id for qas_id, has_answer in qas_id_to_has_answer.items() if not has_answer]\n\n    if no_answer_probs is None:\n        no_answer_probs = {k: 0.0 for k in preds}\n\n    exact, f1 = get_raw_scores(examples, preds)\n\n    exact_threshold = apply_no_ans_threshold(\n        exact, no_answer_probs, qas_id_to_has_answer, no_answer_probability_threshold\n    )\n    f1_threshold = apply_no_ans_threshold(f1, no_answer_probs, qas_id_to_has_answer, no_answer_probability_threshold)\n\n    evaluation = make_eval_dict(exact_threshold, f1_threshold)\n\n    if has_answer_qids:\n        has_ans_eval = make_eval_dict(exact_threshold, f1_threshold, qid_list=has_answer_qids)\n        merge_eval(evaluation, has_ans_eval, \"HasAns\")\n\n    if no_answer_qids:\n        no_ans_eval = make_eval_dict(exact_threshold, f1_threshold, qid_list=no_answer_qids)\n        merge_eval(evaluation, no_ans_eval, \"NoAns\")\n\n    if no_answer_probs:\n        find_all_best_thresh(evaluation, preds, exact, f1, no_answer_probs, qas_id_to_has_answer)\n\n    return evaluation\n\n\ndef get_final_text(pred_text, orig_text, do_lower_case, verbose_logging=False):\n    \"\"\"Project the tokenized prediction back to the original text.\"\"\"\n\n    # When we created the data, we kept track of the alignment between original\n    # (whitespace tokenized) tokens and our WordPiece tokenized tokens. So\n    # now `orig_text` contains the span of our original text corresponding to the\n    # span that we predicted.\n    #\n    # However, `orig_text` may contain extra characters that we don't want in\n    # our prediction.\n    #\n    # For example, let's say:\n    #   pred_text = steve smith\n    #   orig_text = Steve Smith's\n    #\n    # We don't want to return `orig_text` because it contains the extra \"'s\".\n    #\n    # We don't want to return `pred_text` because it's already been normalized\n    # (the SQuAD eval script also does punctuation stripping/lower casing but\n    # our tokenizer does additional normalization like stripping accent\n    # characters).\n    #\n    # What we really want to return is \"Steve Smith\".\n    #\n    # Therefore, we have to apply a semi-complicated alignment heuristic between\n    # `pred_text` and `orig_text` to get a character-to-character alignment. This\n    # can fail in certain cases in which case we just return `orig_text`.\n\n    def _strip_spaces(text):\n        ns_chars = []\n        ns_to_s_map = collections.OrderedDict()\n        for (i, c) in enumerate(text):\n            if c == \" \":\n                continue\n            ns_to_s_map[len(ns_chars)] = i\n            ns_chars.append(c)\n        ns_text = \"\".join(ns_chars)\n        return (ns_text, ns_to_s_map)\n\n    # We first tokenize `orig_text`, strip whitespace from the result\n    # and `pred_text`, and check if they are the same length. If they are\n    # NOT the same length, the heuristic has failed. If they are the same\n    # length, we assume the characters are one-to-one aligned.\n    tokenizer = BasicTokenizer(do_lower_case=do_lower_case)\n\n    tok_text = \" \".join(tokenizer.tokenize(orig_text))\n\n    start_position = tok_text.find(pred_text)\n    if start_position == -1:\n        if verbose_logging:\n            logger.info(\"Unable to find text: '%s' in '%s'\" % (pred_text, orig_text))\n        return orig_text\n    end_position = start_position + len(pred_text) - 1\n\n    (orig_ns_text, orig_ns_to_s_map) = _strip_spaces(orig_text)\n    (tok_ns_text, tok_ns_to_s_map) = _strip_spaces(tok_text)\n\n    if len(orig_ns_text) != len(tok_ns_text):\n        if verbose_logging:\n            logger.info(\"Length not equal after stripping spaces: '%s' vs '%s'\", orig_ns_text, tok_ns_text)\n        return orig_text\n\n    # We then project the characters in `pred_text` back to `orig_text` using\n    # the character-to-character alignment.\n    tok_s_to_ns_map = {}\n    for (i, tok_index) in tok_ns_to_s_map.items():\n        tok_s_to_ns_map[tok_index] = i\n\n    orig_start_position = None\n    if start_position in tok_s_to_ns_map:\n        ns_start_position = tok_s_to_ns_map[start_position]\n        if ns_start_position in orig_ns_to_s_map:\n            orig_start_position = orig_ns_to_s_map[ns_start_position]\n\n    if orig_start_position is None:\n        if verbose_logging:\n            logger.info(\"Couldn't map start position\")\n        return orig_text\n\n    orig_end_position = None\n    if end_position in tok_s_to_ns_map:\n        ns_end_position = tok_s_to_ns_map[end_position]\n        if ns_end_position in orig_ns_to_s_map:\n            orig_end_position = orig_ns_to_s_map[ns_end_position]\n\n    if orig_end_position is None:\n        if verbose_logging:\n            logger.info(\"Couldn't map end position\")\n        return orig_text\n\n    output_text = orig_text[orig_start_position : (orig_end_position + 1)]\n    return output_text\n\n\ndef _get_best_indexes(logits, n_best_size):\n    \"\"\"Get the n-best logits from a list.\"\"\"\n    index_and_score = sorted(enumerate(logits), key=lambda x: x[1], reverse=True)\n\n    best_indexes = []\n    for i in range(len(index_and_score)):\n        if i >= n_best_size:\n            break\n        best_indexes.append(index_and_score[i][0])\n    return best_indexes\n\n\ndef _compute_softmax(scores):\n    \"\"\"Compute softmax probability over raw logits.\"\"\"\n    if not scores:\n        return []\n\n    max_score = None\n    for score in scores:\n        if max_score is None or score > max_score:\n            max_score = score\n\n    exp_scores = []\n    total_sum = 0.0\n    for score in scores:\n        x = math.exp(score - max_score)\n        exp_scores.append(x)\n        total_sum += x\n\n    probs = []\n    for score in exp_scores:\n        probs.append(score / total_sum)\n    return probs\n\n\ndef compute_predictions_logits(\n    all_examples,\n    all_features,\n    all_results,\n    n_best_size,\n    max_answer_length,\n    do_lower_case,\n    output_prediction_file,\n    output_nbest_file,\n    output_null_log_odds_file,\n    verbose_logging,\n    version_2_with_negative,\n    null_score_diff_threshold,\n    tokenizer,\n):\n    \"\"\"Write final predictions to the json file and log-odds of null if needed.\"\"\"\n    if output_prediction_file:\n        logger.info(f\"Writing predictions to: {output_prediction_file}\")\n    if output_nbest_file:\n        logger.info(f\"Writing nbest to: {output_nbest_file}\")\n    if output_null_log_odds_file and version_2_with_negative:\n        logger.info(f\"Writing null_log_odds to: {output_null_log_odds_file}\")\n\n    example_index_to_features = collections.defaultdict(list)\n    for feature in all_features:\n        example_index_to_features[feature.example_index].append(feature)\n\n    unique_id_to_result = {}\n    for result in all_results:\n        unique_id_to_result[result.unique_id] = result\n\n    _PrelimPrediction = collections.namedtuple(  # pylint: disable=invalid-name\n        \"PrelimPrediction\", [\"feature_index\", \"start_index\", \"end_index\", \"start_logit\", \"end_logit\"]\n    )\n\n    all_predictions = collections.OrderedDict()\n    all_nbest_json = collections.OrderedDict()\n    scores_diff_json = collections.OrderedDict()\n\n    for (example_index, example) in enumerate(all_examples):\n        features = example_index_to_features[example_index]\n\n        prelim_predictions = []\n        # keep track of the minimum score of null start+end of position 0\n        score_null = 1000000  # large and positive\n        min_null_feature_index = 0  # the paragraph slice with min null score\n        null_start_logit = 0  # the start logit at the slice with min null score\n        null_end_logit = 0  # the end logit at the slice with min null score\n        for (feature_index, feature) in enumerate(features):\n            result = unique_id_to_result[feature.unique_id]\n            start_indexes = _get_best_indexes(result.start_logits, n_best_size)\n            end_indexes = _get_best_indexes(result.end_logits, n_best_size)\n            # if we could have irrelevant answers, get the min score of irrelevant\n            if version_2_with_negative:\n                feature_null_score = result.start_logits[0] + result.end_logits[0]\n                if feature_null_score < score_null:\n                    score_null = feature_null_score\n                    min_null_feature_index = feature_index\n                    null_start_logit = result.start_logits[0]\n                    null_end_logit = result.end_logits[0]\n            for start_index in start_indexes:\n                for end_index in end_indexes:\n                    # We could hypothetically create invalid predictions, e.g., predict\n                    # that the start of the span is in the question. We throw out all\n                    # invalid predictions.\n                    if start_index >= len(feature.tokens):\n                        continue\n                    if end_index >= len(feature.tokens):\n                        continue\n                    if start_index not in feature.token_to_orig_map:\n                        continue\n                    if end_index not in feature.token_to_orig_map:\n                        continue\n                    if not feature.token_is_max_context.get(start_index, False):\n                        continue\n                    if end_index < start_index:\n                        continue\n                    length = end_index - start_index + 1\n                    if length > max_answer_length:\n                        continue\n                    prelim_predictions.append(\n                        _PrelimPrediction(\n                            feature_index=feature_index,\n                            start_index=start_index,\n                            end_index=end_index,\n                            start_logit=result.start_logits[start_index],\n                            end_logit=result.end_logits[end_index],\n                        )\n                    )\n        if version_2_with_negative:\n            prelim_predictions.append(\n                _PrelimPrediction(\n                    feature_index=min_null_feature_index,\n                    start_index=0,\n                    end_index=0,\n                    start_logit=null_start_logit,\n                    end_logit=null_end_logit,\n                )\n            )\n        prelim_predictions = sorted(prelim_predictions, key=lambda x: (x.start_logit + x.end_logit), reverse=True)\n\n        _NbestPrediction = collections.namedtuple(  # pylint: disable=invalid-name\n            \"NbestPrediction\", [\"text\", \"start_logit\", \"end_logit\"]\n        )\n\n        seen_predictions = {}\n        nbest = []\n        for pred in prelim_predictions:\n            if len(nbest) >= n_best_size:\n                break\n            feature = features[pred.feature_index]\n            if pred.start_index > 0:  # this is a non-null prediction\n                tok_tokens = feature.tokens[pred.start_index : (pred.end_index + 1)]\n                orig_doc_start = feature.token_to_orig_map[pred.start_index]\n                orig_doc_end = feature.token_to_orig_map[pred.end_index]\n                orig_tokens = example.doc_tokens[orig_doc_start : (orig_doc_end + 1)]\n\n                tok_text = tokenizer.convert_tokens_to_string(tok_tokens)\n\n                # tok_text = \" \".join(tok_tokens)\n                #\n                # # De-tokenize WordPieces that have been split off.\n                # tok_text = tok_text.replace(\" ##\", \"\")\n                # tok_text = tok_text.replace(\"##\", \"\")\n\n                # Clean whitespace\n                tok_text = tok_text.strip()\n                tok_text = \" \".join(tok_text.split())\n                orig_text = \" \".join(orig_tokens)\n\n                final_text = get_final_text(tok_text, orig_text, do_lower_case, verbose_logging)\n                if final_text in seen_predictions:\n                    continue\n\n                seen_predictions[final_text] = True\n            else:\n                final_text = \"\"\n                seen_predictions[final_text] = True\n\n            nbest.append(_NbestPrediction(text=final_text, start_logit=pred.start_logit, end_logit=pred.end_logit))\n        # if we didn't include the empty option in the n-best, include it\n        if version_2_with_negative:\n            if \"\" not in seen_predictions:\n                nbest.append(_NbestPrediction(text=\"\", start_logit=null_start_logit, end_logit=null_end_logit))\n\n            # In very rare edge cases we could only have single null prediction.\n            # So we just create a nonce prediction in this case to avoid failure.\n            if len(nbest) == 1:\n                nbest.insert(0, _NbestPrediction(text=\"empty\", start_logit=0.0, end_logit=0.0))\n\n        # In very rare edge cases we could have no valid predictions. So we\n        # just create a nonce prediction in this case to avoid failure.\n        if not nbest:\n            nbest.append(_NbestPrediction(text=\"empty\", start_logit=0.0, end_logit=0.0))\n\n        assert len(nbest) >= 1\n\n        total_scores = []\n        best_non_null_entry = None\n        for entry in nbest:\n            total_scores.append(entry.start_logit + entry.end_logit)\n            if not best_non_null_entry:\n                if entry.text:\n                    best_non_null_entry = entry\n\n        probs = _compute_softmax(total_scores)\n\n        nbest_json = []\n        for (i, entry) in enumerate(nbest):\n            output = collections.OrderedDict()\n            output[\"text\"] = entry.text\n            output[\"probability\"] = probs[i]\n            output[\"start_logit\"] = entry.start_logit\n            output[\"end_logit\"] = entry.end_logit\n            nbest_json.append(output)\n\n        assert len(nbest_json) >= 1\n\n        if not version_2_with_negative:\n            all_predictions[example.qas_id] = nbest_json[0][\"text\"]\n        else:\n            # predict \"\" iff the null score - the score of best non-null > threshold\n            score_diff = score_null - best_non_null_entry.start_logit - (best_non_null_entry.end_logit)\n            scores_diff_json[example.qas_id] = score_diff\n            if score_diff > null_score_diff_threshold:\n                all_predictions[example.qas_id] = \"\"\n            else:\n                all_predictions[example.qas_id] = best_non_null_entry.text\n        all_nbest_json[example.qas_id] = nbest_json\n\n    if output_prediction_file:\n        with open(output_prediction_file, \"w\") as writer:\n            writer.write(json.dumps(all_predictions, indent=4) + \"\\n\")\n\n    if output_nbest_file:\n        with open(output_nbest_file, \"w\") as writer:\n            writer.write(json.dumps(all_nbest_json, indent=4) + \"\\n\")\n\n    if output_null_log_odds_file and version_2_with_negative:\n        with open(output_null_log_odds_file, \"w\") as writer:\n            writer.write(json.dumps(scores_diff_json, indent=4) + \"\\n\")\n\n    return all_predictions\n\n\ndef compute_predictions_log_probs(\n    all_examples,\n    all_features,\n    all_results,\n    n_best_size,\n    max_answer_length,\n    output_prediction_file,\n    output_nbest_file,\n    output_null_log_odds_file,\n    start_n_top,\n    end_n_top,\n    version_2_with_negative,\n    tokenizer,\n    verbose_logging,\n):\n    \"\"\" XLNet write prediction logic (more complex than Bert's).\n        Write final predictions to the json file and log-odds of null if needed.\n\n        Requires utils_squad_evaluate.py\n    \"\"\"\n    _PrelimPrediction = collections.namedtuple(  # pylint: disable=invalid-name\n        \"PrelimPrediction\", [\"feature_index\", \"start_index\", \"end_index\", \"start_log_prob\", \"end_log_prob\"]\n    )\n\n    _NbestPrediction = collections.namedtuple(  # pylint: disable=invalid-name\n        \"NbestPrediction\", [\"text\", \"start_log_prob\", \"end_log_prob\"]\n    )\n\n    logger.info(\"Writing predictions to: %s\", output_prediction_file)\n    # logger.info(\"Writing nbest to: %s\" % (output_nbest_file))\n\n    example_index_to_features = collections.defaultdict(list)\n    for feature in all_features:\n        example_index_to_features[feature.example_index].append(feature)\n\n    unique_id_to_result = {}\n    for result in all_results:\n        unique_id_to_result[result.unique_id] = result\n\n    all_predictions = collections.OrderedDict()\n    all_nbest_json = collections.OrderedDict()\n    scores_diff_json = collections.OrderedDict()\n\n    for (example_index, example) in enumerate(all_examples):\n        features = example_index_to_features[example_index]\n\n        prelim_predictions = []\n        # keep track of the minimum score of null start+end of position 0\n        score_null = 1000000  # large and positive\n\n        for (feature_index, feature) in enumerate(features):\n            result = unique_id_to_result[feature.unique_id]\n\n            cur_null_score = result.cls_logits\n\n            # if we could have irrelevant answers, get the min score of irrelevant\n            score_null = min(score_null, cur_null_score)\n\n            for i in range(start_n_top):\n                for j in range(end_n_top):\n                    start_log_prob = result.start_logits[i]\n                    start_index = result.start_top_index[i]\n\n                    j_index = i * end_n_top + j\n\n                    end_log_prob = result.end_logits[j_index]\n                    end_index = result.end_top_index[j_index]\n\n                    # We could hypothetically create invalid predictions, e.g., predict\n                    # that the start of the span is in the question. We throw out all\n                    # invalid predictions.\n                    if start_index >= feature.paragraph_len - 1:\n                        continue\n                    if end_index >= feature.paragraph_len - 1:\n                        continue\n\n                    if not feature.token_is_max_context.get(start_index, False):\n                        continue\n                    if end_index < start_index:\n                        continue\n                    length = end_index - start_index + 1\n                    if length > max_answer_length:\n                        continue\n\n                    prelim_predictions.append(\n                        _PrelimPrediction(\n                            feature_index=feature_index,\n                            start_index=start_index,\n                            end_index=end_index,\n                            start_log_prob=start_log_prob,\n                            end_log_prob=end_log_prob,\n                        )\n                    )\n\n        prelim_predictions = sorted(\n            prelim_predictions, key=lambda x: (x.start_log_prob + x.end_log_prob), reverse=True\n        )\n\n        seen_predictions = {}\n        nbest = []\n        for pred in prelim_predictions:\n            if len(nbest) >= n_best_size:\n                break\n            feature = features[pred.feature_index]\n\n            # XLNet un-tokenizer\n            # Let's keep it simple for now and see if we need all this later.\n            #\n            # tok_start_to_orig_index = feature.tok_start_to_orig_index\n            # tok_end_to_orig_index = feature.tok_end_to_orig_index\n            # start_orig_pos = tok_start_to_orig_index[pred.start_index]\n            # end_orig_pos = tok_end_to_orig_index[pred.end_index]\n            # paragraph_text = example.paragraph_text\n            # final_text = paragraph_text[start_orig_pos: end_orig_pos + 1].strip()\n\n            # Previously used Bert untokenizer\n            tok_tokens = feature.tokens[pred.start_index : (pred.end_index + 1)]\n            orig_doc_start = feature.token_to_orig_map[pred.start_index]\n            orig_doc_end = feature.token_to_orig_map[pred.end_index]\n            orig_tokens = example.doc_tokens[orig_doc_start : (orig_doc_end + 1)]\n            tok_text = tokenizer.convert_tokens_to_string(tok_tokens)\n\n            # Clean whitespace\n            tok_text = tok_text.strip()\n            tok_text = \" \".join(tok_text.split())\n            orig_text = \" \".join(orig_tokens)\n\n            if hasattr(tokenizer, \"do_lower_case\"):\n                do_lower_case = tokenizer.do_lower_case\n            else:\n                do_lower_case = tokenizer.do_lowercase_and_remove_accent\n\n            final_text = get_final_text(tok_text, orig_text, do_lower_case, verbose_logging)\n\n            if final_text in seen_predictions:\n                continue\n\n            seen_predictions[final_text] = True\n\n            nbest.append(\n                _NbestPrediction(text=final_text, start_log_prob=pred.start_log_prob, end_log_prob=pred.end_log_prob)\n            )\n\n        # In very rare edge cases we could have no valid predictions. So we\n        # just create a nonce prediction in this case to avoid failure.\n        if not nbest:\n            nbest.append(_NbestPrediction(text=\"\", start_log_prob=-1e6, end_log_prob=-1e6))\n\n        total_scores = []\n        best_non_null_entry = None\n        for entry in nbest:\n            total_scores.append(entry.start_log_prob + entry.end_log_prob)\n            if not best_non_null_entry:\n                best_non_null_entry = entry\n\n        probs = _compute_softmax(total_scores)\n\n        nbest_json = []\n        for (i, entry) in enumerate(nbest):\n            output = collections.OrderedDict()\n            output[\"text\"] = entry.text\n            output[\"probability\"] = probs[i]\n            output[\"start_log_prob\"] = entry.start_log_prob\n            output[\"end_log_prob\"] = entry.end_log_prob\n            nbest_json.append(output)\n\n        assert len(nbest_json) >= 1\n        assert best_non_null_entry is not None\n\n        score_diff = score_null\n        scores_diff_json[example.qas_id] = score_diff\n        # note(zhiliny): always predict best_non_null_entry\n        # and the evaluation script will search for the best threshold\n        all_predictions[example.qas_id] = best_non_null_entry.text\n\n        all_nbest_json[example.qas_id] = nbest_json\n\n    with open(output_prediction_file, \"w\") as writer:\n        writer.write(json.dumps(all_predictions, indent=4) + \"\\n\")\n\n    with open(output_nbest_file, \"w\") as writer:\n        writer.write(json.dumps(all_nbest_json, indent=4) + \"\\n\")\n\n    if version_2_with_negative:\n        with open(output_null_log_odds_file, \"w\") as writer:\n            writer.write(json.dumps(scores_diff_json, indent=4) + \"\\n\")\n\n    return all_predictions\n"
  },
  {
    "path": "code/nezha-base-count5/pretrain/transformers1/data/processors/__init__.py",
    "content": "# flake8: noqa\n# There's no way to ignore \"F401 '...' imported but unused\" warnings in this\n# module, but to preserve other warnings. So, don't check this module at all.\n\nfrom .glue import glue_convert_examples_to_features, glue_output_modes, glue_processors, glue_tasks_num_labels\nfrom .squad import SquadExample, SquadFeatures, SquadV1Processor, SquadV2Processor, squad_convert_examples_to_features\nfrom .utils import DataProcessor, InputExample, InputFeatures, SingleSentenceClassificationProcessor\nfrom .xnli import xnli_output_modes, xnli_processors, xnli_tasks_num_labels\n"
  },
  {
    "path": "code/nezha-base-count5/pretrain/transformers1/data/processors/glue.py",
    "content": "# coding=utf-8\n# Copyright 2018 The Google AI Language Team Authors and The HuggingFace Inc. team.\n# Copyright (c) 2018, NVIDIA CORPORATION.  All rights reserved.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n\"\"\" GLUE processors and helpers \"\"\"\n\nimport logging\nimport os\nfrom enum import Enum\nfrom typing import List, Optional, Union\n\nfrom ...file_utils import is_tf_available\nfrom ...tokenization_utils import PreTrainedTokenizer\nfrom .utils import DataProcessor, InputExample, InputFeatures\n\n\nif is_tf_available():\n    import tensorflow as tf\n\nlogger = logging.getLogger(__name__)\n\n\ndef glue_convert_examples_to_features(\n    examples: Union[List[InputExample], \"tf.data.Dataset\"],\n    tokenizer: PreTrainedTokenizer,\n    max_length: Optional[int] = None,\n    task=None,\n    label_list=None,\n    output_mode=None,\n):\n    \"\"\"\n    Loads a data file into a list of ``InputFeatures``\n\n    Args:\n        examples: List of ``InputExamples`` or ``tf.data.Dataset`` containing the examples.\n        tokenizer: Instance of a tokenizer that will tokenize the examples\n        max_length: Maximum example length. Defaults to the tokenizer's max_len\n        task: GLUE task\n        label_list: List of labels. Can be obtained from the processor using the ``processor.get_labels()`` method\n        output_mode: String indicating the output mode. Either ``regression`` or ``classification``\n\n    Returns:\n        If the ``examples`` input is a ``tf.data.Dataset``, will return a ``tf.data.Dataset``\n        containing the task-specific features. If the input is a list of ``InputExamples``, will return\n        a list of task-specific ``InputFeatures`` which can be fed to the model.\n\n    \"\"\"\n    if is_tf_available() and isinstance(examples, tf.data.Dataset):\n        if task is None:\n            raise ValueError(\"When calling glue_convert_examples_to_features from TF, the task parameter is required.\")\n        return _tf_glue_convert_examples_to_features(examples, tokenizer, max_length=max_length, task=task)\n    return _glue_convert_examples_to_features(\n        examples, tokenizer, max_length=max_length, task=task, label_list=label_list, output_mode=output_mode\n    )\n\n\nif is_tf_available():\n\n    def _tf_glue_convert_examples_to_features(\n        examples: tf.data.Dataset, tokenizer: PreTrainedTokenizer, task=str, max_length: Optional[int] = None,\n    ) -> tf.data.Dataset:\n        \"\"\"\n        Returns:\n            A ``tf.data.Dataset`` containing the task-specific features.\n\n        \"\"\"\n        processor = glue_processors[task]()\n        examples = [processor.tfds_map(processor.get_example_from_tensor_dict(example)) for example in examples]\n        features = glue_convert_examples_to_features(examples, tokenizer, max_length=max_length, task=task)\n\n        def gen():\n            for ex in features:\n                yield (\n                    {\n                        \"input_ids\": ex.input_ids,\n                        \"attention_mask\": ex.attention_mask,\n                        \"token_type_ids\": ex.token_type_ids,\n                    },\n                    ex.label,\n                )\n\n        return tf.data.Dataset.from_generator(\n            gen,\n            ({\"input_ids\": tf.int32, \"attention_mask\": tf.int32, \"token_type_ids\": tf.int32}, tf.int64),\n            (\n                {\n                    \"input_ids\": tf.TensorShape([None]),\n                    \"attention_mask\": tf.TensorShape([None]),\n                    \"token_type_ids\": tf.TensorShape([None]),\n                },\n                tf.TensorShape([]),\n            ),\n        )\n\n\ndef _glue_convert_examples_to_features(\n    examples: List[InputExample],\n    tokenizer: PreTrainedTokenizer,\n    max_length: Optional[int] = None,\n    task=None,\n    label_list=None,\n    output_mode=None,\n):\n    if max_length is None:\n        max_length = tokenizer.max_len\n\n    if task is not None:\n        processor = glue_processors[task]()\n        if label_list is None:\n            label_list = processor.get_labels()\n            logger.info(\"Using label list %s for task %s\" % (label_list, task))\n        if output_mode is None:\n            output_mode = glue_output_modes[task]\n            logger.info(\"Using output mode %s for task %s\" % (output_mode, task))\n\n    label_map = {label: i for i, label in enumerate(label_list)}\n\n    def label_from_example(example: InputExample) -> Union[int, float, None]:\n        if example.label is None:\n            return None\n        if output_mode == \"classification\":\n            return label_map[example.label]\n        elif output_mode == \"regression\":\n            return float(example.label)\n        raise KeyError(output_mode)\n\n    labels = [label_from_example(example) for example in examples]\n\n    batch_encoding = tokenizer.batch_encode_plus(\n        [(example.text_a, example.text_b) for example in examples], max_length=max_length, pad_to_max_length=True,\n    )\n\n    features = []\n    for i in range(len(examples)):\n        inputs = {k: batch_encoding[k][i] for k in batch_encoding}\n\n        feature = InputFeatures(**inputs, label=labels[i])\n        features.append(feature)\n\n    for i, example in enumerate(examples[:5]):\n        logger.info(\"*** Example ***\")\n        logger.info(\"guid: %s\" % (example.guid))\n        logger.info(\"features: %s\" % features[i])\n\n    return features\n\n\nclass OutputMode(Enum):\n    classification = \"classification\"\n    regression = \"regression\"\n\n\nclass MrpcProcessor(DataProcessor):\n    \"\"\"Processor for the MRPC data set (GLUE version).\"\"\"\n\n    def get_example_from_tensor_dict(self, tensor_dict):\n        \"\"\"See base class.\"\"\"\n        return InputExample(\n            tensor_dict[\"idx\"].numpy(),\n            tensor_dict[\"sentence1\"].numpy().decode(\"utf-8\"),\n            tensor_dict[\"sentence2\"].numpy().decode(\"utf-8\"),\n            str(tensor_dict[\"label\"].numpy()),\n        )\n\n    def get_train_examples(self, data_dir):\n        \"\"\"See base class.\"\"\"\n        logger.info(\"LOOKING AT {}\".format(os.path.join(data_dir, \"train.tsv\")))\n        return self._create_examples(self._read_tsv(os.path.join(data_dir, \"train.tsv\")), \"train\")\n\n    def get_dev_examples(self, data_dir):\n        \"\"\"See base class.\"\"\"\n        return self._create_examples(self._read_tsv(os.path.join(data_dir, \"dev.tsv\")), \"dev\")\n\n    def get_test_examples(self, data_dir):\n        \"\"\"See base class.\"\"\"\n        return self._create_examples(self._read_tsv(os.path.join(data_dir, \"test.tsv\")), \"test\")\n\n    def get_labels(self):\n        \"\"\"See base class.\"\"\"\n        return [\"0\", \"1\"]\n\n    def _create_examples(self, lines, set_type):\n        \"\"\"Creates examples for the training, dev and test sets.\"\"\"\n        examples = []\n        for (i, line) in enumerate(lines):\n            if i == 0:\n                continue\n            guid = \"%s-%s\" % (set_type, i)\n            text_a = line[3]\n            text_b = line[4]\n            label = None if set_type == \"test\" else line[0]\n            examples.append(InputExample(guid=guid, text_a=text_a, text_b=text_b, label=label))\n        return examples\n\n\nclass MnliProcessor(DataProcessor):\n    \"\"\"Processor for the MultiNLI data set (GLUE version).\"\"\"\n\n    def get_example_from_tensor_dict(self, tensor_dict):\n        \"\"\"See base class.\"\"\"\n        return InputExample(\n            tensor_dict[\"idx\"].numpy(),\n            tensor_dict[\"premise\"].numpy().decode(\"utf-8\"),\n            tensor_dict[\"hypothesis\"].numpy().decode(\"utf-8\"),\n            str(tensor_dict[\"label\"].numpy()),\n        )\n\n    def get_train_examples(self, data_dir):\n        \"\"\"See base class.\"\"\"\n        return self._create_examples(self._read_tsv(os.path.join(data_dir, \"train.tsv\")), \"train\")\n\n    def get_dev_examples(self, data_dir):\n        \"\"\"See base class.\"\"\"\n        return self._create_examples(self._read_tsv(os.path.join(data_dir, \"dev_matched.tsv\")), \"dev_matched\")\n\n    def get_test_examples(self, data_dir):\n        \"\"\"See base class.\"\"\"\n        return self._create_examples(self._read_tsv(os.path.join(data_dir, \"test_matched.tsv\")), \"test_matched\")\n\n    def get_labels(self):\n        \"\"\"See base class.\"\"\"\n        return [\"contradiction\", \"entailment\", \"neutral\"]\n\n    def _create_examples(self, lines, set_type):\n        \"\"\"Creates examples for the training, dev and test sets.\"\"\"\n        examples = []\n        for (i, line) in enumerate(lines):\n            if i == 0:\n                continue\n            guid = \"%s-%s\" % (set_type, line[0])\n            text_a = line[8]\n            text_b = line[9]\n            label = None if set_type.startswith(\"test\") else line[-1]\n            examples.append(InputExample(guid=guid, text_a=text_a, text_b=text_b, label=label))\n        return examples\n\n\nclass MnliMismatchedProcessor(MnliProcessor):\n    \"\"\"Processor for the MultiNLI Mismatched data set (GLUE version).\"\"\"\n\n    def get_dev_examples(self, data_dir):\n        \"\"\"See base class.\"\"\"\n        return self._create_examples(self._read_tsv(os.path.join(data_dir, \"dev_mismatched.tsv\")), \"dev_mismatched\")\n\n    def get_test_examples(self, data_dir):\n        \"\"\"See base class.\"\"\"\n        return self._create_examples(self._read_tsv(os.path.join(data_dir, \"test_mismatched.tsv\")), \"test_mismatched\")\n\n\nclass ColaProcessor(DataProcessor):\n    \"\"\"Processor for the CoLA data set (GLUE version).\"\"\"\n\n    def get_example_from_tensor_dict(self, tensor_dict):\n        \"\"\"See base class.\"\"\"\n        return InputExample(\n            tensor_dict[\"idx\"].numpy(),\n            tensor_dict[\"sentence\"].numpy().decode(\"utf-8\"),\n            None,\n            str(tensor_dict[\"label\"].numpy()),\n        )\n\n    def get_train_examples(self, data_dir):\n        \"\"\"See base class.\"\"\"\n        return self._create_examples(self._read_tsv(os.path.join(data_dir, \"train.tsv\")), \"train\")\n\n    def get_dev_examples(self, data_dir):\n        \"\"\"See base class.\"\"\"\n        return self._create_examples(self._read_tsv(os.path.join(data_dir, \"dev.tsv\")), \"dev\")\n\n    def get_test_examples(self, data_dir):\n        \"\"\"See base class.\"\"\"\n        return self._create_examples(self._read_tsv(os.path.join(data_dir, \"test.tsv\")), \"test\")\n\n    def get_labels(self):\n        \"\"\"See base class.\"\"\"\n        return [\"0\", \"1\"]\n\n    def _create_examples(self, lines, set_type):\n        \"\"\"Creates examples for the training, dev and test sets.\"\"\"\n        test_mode = set_type == \"test\"\n        if test_mode:\n            lines = lines[1:]\n        text_index = 1 if test_mode else 3\n        examples = []\n        for (i, line) in enumerate(lines):\n            guid = \"%s-%s\" % (set_type, i)\n            text_a = line[text_index]\n            label = None if test_mode else line[1]\n            examples.append(InputExample(guid=guid, text_a=text_a, text_b=None, label=label))\n        return examples\n\n\nclass Sst2Processor(DataProcessor):\n    \"\"\"Processor for the SST-2 data set (GLUE version).\"\"\"\n\n    def get_example_from_tensor_dict(self, tensor_dict):\n        \"\"\"See base class.\"\"\"\n        return InputExample(\n            tensor_dict[\"idx\"].numpy(),\n            tensor_dict[\"sentence\"].numpy().decode(\"utf-8\"),\n            None,\n            str(tensor_dict[\"label\"].numpy()),\n        )\n\n    def get_train_examples(self, data_dir):\n        \"\"\"See base class.\"\"\"\n        return self._create_examples(self._read_tsv(os.path.join(data_dir, \"train.tsv\")), \"train\")\n\n    def get_dev_examples(self, data_dir):\n        \"\"\"See base class.\"\"\"\n        return self._create_examples(self._read_tsv(os.path.join(data_dir, \"dev.tsv\")), \"dev\")\n\n    def get_test_examples(self, data_dir):\n        \"\"\"See base class.\"\"\"\n        return self._create_examples(self._read_tsv(os.path.join(data_dir, \"test.tsv\")), \"test\")\n\n    def get_labels(self):\n        \"\"\"See base class.\"\"\"\n        return [\"0\", \"1\"]\n\n    def _create_examples(self, lines, set_type):\n        \"\"\"Creates examples for the training, dev and test sets.\"\"\"\n        examples = []\n        text_index = 1 if set_type == \"test\" else 0\n        for (i, line) in enumerate(lines):\n            if i == 0:\n                continue\n            guid = \"%s-%s\" % (set_type, i)\n            text_a = line[text_index]\n            label = None if set_type == \"test\" else line[1]\n            examples.append(InputExample(guid=guid, text_a=text_a, text_b=None, label=label))\n        return examples\n\n\nclass StsbProcessor(DataProcessor):\n    \"\"\"Processor for the STS-B data set (GLUE version).\"\"\"\n\n    def get_example_from_tensor_dict(self, tensor_dict):\n        \"\"\"See base class.\"\"\"\n        return InputExample(\n            tensor_dict[\"idx\"].numpy(),\n            tensor_dict[\"sentence1\"].numpy().decode(\"utf-8\"),\n            tensor_dict[\"sentence2\"].numpy().decode(\"utf-8\"),\n            str(tensor_dict[\"label\"].numpy()),\n        )\n\n    def get_train_examples(self, data_dir):\n        \"\"\"See base class.\"\"\"\n        return self._create_examples(self._read_tsv(os.path.join(data_dir, \"train.tsv\")), \"train\")\n\n    def get_dev_examples(self, data_dir):\n        \"\"\"See base class.\"\"\"\n        return self._create_examples(self._read_tsv(os.path.join(data_dir, \"dev.tsv\")), \"dev\")\n\n    def get_test_examples(self, data_dir):\n        \"\"\"See base class.\"\"\"\n        return self._create_examples(self._read_tsv(os.path.join(data_dir, \"test.tsv\")), \"test\")\n\n    def get_labels(self):\n        \"\"\"See base class.\"\"\"\n        return [None]\n\n    def _create_examples(self, lines, set_type):\n        \"\"\"Creates examples for the training, dev and test sets.\"\"\"\n        examples = []\n        for (i, line) in enumerate(lines):\n            if i == 0:\n                continue\n            guid = \"%s-%s\" % (set_type, line[0])\n            text_a = line[7]\n            text_b = line[8]\n            label = None if set_type == \"test\" else line[-1]\n            examples.append(InputExample(guid=guid, text_a=text_a, text_b=text_b, label=label))\n        return examples\n\n\nclass QqpProcessor(DataProcessor):\n    \"\"\"Processor for the QQP data set (GLUE version).\"\"\"\n\n    def get_example_from_tensor_dict(self, tensor_dict):\n        \"\"\"See base class.\"\"\"\n        return InputExample(\n            tensor_dict[\"idx\"].numpy(),\n            tensor_dict[\"question1\"].numpy().decode(\"utf-8\"),\n            tensor_dict[\"question2\"].numpy().decode(\"utf-8\"),\n            str(tensor_dict[\"label\"].numpy()),\n        )\n\n    def get_train_examples(self, data_dir):\n        \"\"\"See base class.\"\"\"\n        return self._create_examples(self._read_tsv(os.path.join(data_dir, \"train.tsv\")), \"train\")\n\n    def get_dev_examples(self, data_dir):\n        \"\"\"See base class.\"\"\"\n        return self._create_examples(self._read_tsv(os.path.join(data_dir, \"dev.tsv\")), \"dev\")\n\n    def get_test_examples(self, data_dir):\n        \"\"\"See base class.\"\"\"\n        return self._create_examples(self._read_tsv(os.path.join(data_dir, \"test.tsv\")), \"test\")\n\n    def get_labels(self):\n        \"\"\"See base class.\"\"\"\n        return [\"0\", \"1\"]\n\n    def _create_examples(self, lines, set_type):\n        \"\"\"Creates examples for the training, dev and test sets.\"\"\"\n        test_mode = set_type == \"test\"\n        q1_index = 1 if test_mode else 3\n        q2_index = 2 if test_mode else 4\n        examples = []\n        for (i, line) in enumerate(lines):\n            if i == 0:\n                continue\n            guid = \"%s-%s\" % (set_type, line[0])\n            try:\n                text_a = line[q1_index]\n                text_b = line[q2_index]\n                label = None if test_mode else line[5]\n            except IndexError:\n                continue\n            examples.append(InputExample(guid=guid, text_a=text_a, text_b=text_b, label=label))\n        return examples\n\n\nclass QnliProcessor(DataProcessor):\n    \"\"\"Processor for the QNLI data set (GLUE version).\"\"\"\n\n    def get_example_from_tensor_dict(self, tensor_dict):\n        \"\"\"See base class.\"\"\"\n        return InputExample(\n            tensor_dict[\"idx\"].numpy(),\n            tensor_dict[\"question\"].numpy().decode(\"utf-8\"),\n            tensor_dict[\"sentence\"].numpy().decode(\"utf-8\"),\n            str(tensor_dict[\"label\"].numpy()),\n        )\n\n    def get_train_examples(self, data_dir):\n        \"\"\"See base class.\"\"\"\n        return self._create_examples(self._read_tsv(os.path.join(data_dir, \"train.tsv\")), \"train\")\n\n    def get_dev_examples(self, data_dir):\n        \"\"\"See base class.\"\"\"\n        return self._create_examples(self._read_tsv(os.path.join(data_dir, \"dev.tsv\")), \"dev\")\n\n    def get_test_examples(self, data_dir):\n        \"\"\"See base class.\"\"\"\n        return self._create_examples(self._read_tsv(os.path.join(data_dir, \"test.tsv\")), \"test\")\n\n    def get_labels(self):\n        \"\"\"See base class.\"\"\"\n        return [\"entailment\", \"not_entailment\"]\n\n    def _create_examples(self, lines, set_type):\n        \"\"\"Creates examples for the training, dev and test sets.\"\"\"\n        examples = []\n        for (i, line) in enumerate(lines):\n            if i == 0:\n                continue\n            guid = \"%s-%s\" % (set_type, line[0])\n            text_a = line[1]\n            text_b = line[2]\n            label = None if set_type == \"test\" else line[-1]\n            examples.append(InputExample(guid=guid, text_a=text_a, text_b=text_b, label=label))\n        return examples\n\n\nclass RteProcessor(DataProcessor):\n    \"\"\"Processor for the RTE data set (GLUE version).\"\"\"\n\n    def get_example_from_tensor_dict(self, tensor_dict):\n        \"\"\"See base class.\"\"\"\n        return InputExample(\n            tensor_dict[\"idx\"].numpy(),\n            tensor_dict[\"sentence1\"].numpy().decode(\"utf-8\"),\n            tensor_dict[\"sentence2\"].numpy().decode(\"utf-8\"),\n            str(tensor_dict[\"label\"].numpy()),\n        )\n\n    def get_train_examples(self, data_dir):\n        \"\"\"See base class.\"\"\"\n        return self._create_examples(self._read_tsv(os.path.join(data_dir, \"train.tsv\")), \"train\")\n\n    def get_dev_examples(self, data_dir):\n        \"\"\"See base class.\"\"\"\n        return self._create_examples(self._read_tsv(os.path.join(data_dir, \"dev.tsv\")), \"dev\")\n\n    def get_test_examples(self, data_dir):\n        \"\"\"See base class.\"\"\"\n        return self._create_examples(self._read_tsv(os.path.join(data_dir, \"test.tsv\")), \"test\")\n\n    def get_labels(self):\n        \"\"\"See base class.\"\"\"\n        return [\"entailment\", \"not_entailment\"]\n\n    def _create_examples(self, lines, set_type):\n        \"\"\"Creates examples for the training, dev and test sets.\"\"\"\n        examples = []\n        for (i, line) in enumerate(lines):\n            if i == 0:\n                continue\n            guid = \"%s-%s\" % (set_type, line[0])\n            text_a = line[1]\n            text_b = line[2]\n            label = None if set_type == \"test\" else line[-1]\n            examples.append(InputExample(guid=guid, text_a=text_a, text_b=text_b, label=label))\n        return examples\n\n\nclass WnliProcessor(DataProcessor):\n    \"\"\"Processor for the WNLI data set (GLUE version).\"\"\"\n\n    def get_example_from_tensor_dict(self, tensor_dict):\n        \"\"\"See base class.\"\"\"\n        return InputExample(\n            tensor_dict[\"idx\"].numpy(),\n            tensor_dict[\"sentence1\"].numpy().decode(\"utf-8\"),\n            tensor_dict[\"sentence2\"].numpy().decode(\"utf-8\"),\n            str(tensor_dict[\"label\"].numpy()),\n        )\n\n    def get_train_examples(self, data_dir):\n        \"\"\"See base class.\"\"\"\n        return self._create_examples(self._read_tsv(os.path.join(data_dir, \"train.tsv\")), \"train\")\n\n    def get_dev_examples(self, data_dir):\n        \"\"\"See base class.\"\"\"\n        return self._create_examples(self._read_tsv(os.path.join(data_dir, \"dev.tsv\")), \"dev\")\n\n    def get_test_examples(self, data_dir):\n        \"\"\"See base class.\"\"\"\n        return self._create_examples(self._read_tsv(os.path.join(data_dir, \"test.tsv\")), \"test\")\n\n    def get_labels(self):\n        \"\"\"See base class.\"\"\"\n        return [\"0\", \"1\"]\n\n    def _create_examples(self, lines, set_type):\n        \"\"\"Creates examples for the training, dev and test sets.\"\"\"\n        examples = []\n        for (i, line) in enumerate(lines):\n            if i == 0:\n                continue\n            guid = \"%s-%s\" % (set_type, line[0])\n            text_a = line[1]\n            text_b = line[2]\n            label = None if set_type == \"test\" else line[-1]\n            examples.append(InputExample(guid=guid, text_a=text_a, text_b=text_b, label=label))\n        return examples\n\n\nglue_tasks_num_labels = {\n    \"cola\": 2,\n    \"mnli\": 3,\n    \"mrpc\": 2,\n    \"sst-2\": 2,\n    \"sts-b\": 1,\n    \"qqp\": 2,\n    \"qnli\": 2,\n    \"rte\": 2,\n    \"wnli\": 2,\n}\n\nglue_processors = {\n    \"cola\": ColaProcessor,\n    \"mnli\": MnliProcessor,\n    \"mnli-mm\": MnliMismatchedProcessor,\n    \"mrpc\": MrpcProcessor,\n    \"sst-2\": Sst2Processor,\n    \"sts-b\": StsbProcessor,\n    \"qqp\": QqpProcessor,\n    \"qnli\": QnliProcessor,\n    \"rte\": RteProcessor,\n    \"wnli\": WnliProcessor,\n}\n\nglue_output_modes = {\n    \"cola\": \"classification\",\n    \"mnli\": \"classification\",\n    \"mnli-mm\": \"classification\",\n    \"mrpc\": \"classification\",\n    \"sst-2\": \"classification\",\n    \"sts-b\": \"regression\",\n    \"qqp\": \"classification\",\n    \"qnli\": \"classification\",\n    \"rte\": \"classification\",\n    \"wnli\": \"classification\",\n}\n"
  },
  {
    "path": "code/nezha-base-count5/pretrain/transformers1/data/processors/squad.py",
    "content": "import json\nimport logging\nimport os\nfrom functools import partial\nfrom multiprocessing import Pool, cpu_count\n\nimport numpy as np\nfrom tqdm import tqdm\n\nfrom ...file_utils import is_tf_available, is_torch_available\nfrom ...tokenization_bert import whitespace_tokenize\nfrom .utils import DataProcessor\n\n\nif is_torch_available():\n    import torch\n    from torch.utils.data import TensorDataset\n\nif is_tf_available():\n    import tensorflow as tf\n\nlogger = logging.getLogger(__name__)\n\n\ndef _improve_answer_span(doc_tokens, input_start, input_end, tokenizer, orig_answer_text):\n    \"\"\"Returns tokenized answer spans that better match the annotated answer.\"\"\"\n    tok_answer_text = \" \".join(tokenizer.tokenize(orig_answer_text))\n\n    for new_start in range(input_start, input_end + 1):\n        for new_end in range(input_end, new_start - 1, -1):\n            text_span = \" \".join(doc_tokens[new_start : (new_end + 1)])\n            if text_span == tok_answer_text:\n                return (new_start, new_end)\n\n    return (input_start, input_end)\n\n\ndef _check_is_max_context(doc_spans, cur_span_index, position):\n    \"\"\"Check if this is the 'max context' doc span for the token.\"\"\"\n    best_score = None\n    best_span_index = None\n    for (span_index, doc_span) in enumerate(doc_spans):\n        end = doc_span.start + doc_span.length - 1\n        if position < doc_span.start:\n            continue\n        if position > end:\n            continue\n        num_left_context = position - doc_span.start\n        num_right_context = end - position\n        score = min(num_left_context, num_right_context) + 0.01 * doc_span.length\n        if best_score is None or score > best_score:\n            best_score = score\n            best_span_index = span_index\n\n    return cur_span_index == best_span_index\n\n\ndef _new_check_is_max_context(doc_spans, cur_span_index, position):\n    \"\"\"Check if this is the 'max context' doc span for the token.\"\"\"\n    # if len(doc_spans) == 1:\n    # return True\n    best_score = None\n    best_span_index = None\n    for (span_index, doc_span) in enumerate(doc_spans):\n        end = doc_span[\"start\"] + doc_span[\"length\"] - 1\n        if position < doc_span[\"start\"]:\n            continue\n        if position > end:\n            continue\n        num_left_context = position - doc_span[\"start\"]\n        num_right_context = end - position\n        score = min(num_left_context, num_right_context) + 0.01 * doc_span[\"length\"]\n        if best_score is None or score > best_score:\n            best_score = score\n            best_span_index = span_index\n\n    return cur_span_index == best_span_index\n\n\ndef _is_whitespace(c):\n    if c == \" \" or c == \"\\t\" or c == \"\\r\" or c == \"\\n\" or ord(c) == 0x202F:\n        return True\n    return False\n\n\ndef squad_convert_example_to_features(example, max_seq_length, doc_stride, max_query_length, is_training):\n    features = []\n    if is_training and not example.is_impossible:\n        # Get start and end position\n        start_position = example.start_position\n        end_position = example.end_position\n\n        # If the answer cannot be found in the text, then skip this example.\n        actual_text = \" \".join(example.doc_tokens[start_position : (end_position + 1)])\n        cleaned_answer_text = \" \".join(whitespace_tokenize(example.answer_text))\n        if actual_text.find(cleaned_answer_text) == -1:\n            logger.warning(\"Could not find answer: '%s' vs. '%s'\", actual_text, cleaned_answer_text)\n            return []\n\n    tok_to_orig_index = []\n    orig_to_tok_index = []\n    all_doc_tokens = []\n    for (i, token) in enumerate(example.doc_tokens):\n        orig_to_tok_index.append(len(all_doc_tokens))\n        sub_tokens = tokenizer.tokenize(token)\n        for sub_token in sub_tokens:\n            tok_to_orig_index.append(i)\n            all_doc_tokens.append(sub_token)\n\n    if is_training and not example.is_impossible:\n        tok_start_position = orig_to_tok_index[example.start_position]\n        if example.end_position < len(example.doc_tokens) - 1:\n            tok_end_position = orig_to_tok_index[example.end_position + 1] - 1\n        else:\n            tok_end_position = len(all_doc_tokens) - 1\n\n        (tok_start_position, tok_end_position) = _improve_answer_span(\n            all_doc_tokens, tok_start_position, tok_end_position, tokenizer, example.answer_text\n        )\n\n    spans = []\n\n    truncated_query = tokenizer.encode(example.question_text, add_special_tokens=False, max_length=max_query_length)\n    sequence_added_tokens = (\n        tokenizer.max_len - tokenizer.max_len_single_sentence + 1\n        if \"roberta\" in str(type(tokenizer)) or \"camembert\" in str(type(tokenizer))\n        else tokenizer.max_len - tokenizer.max_len_single_sentence\n    )\n    sequence_pair_added_tokens = tokenizer.max_len - tokenizer.max_len_sentences_pair\n\n    span_doc_tokens = all_doc_tokens\n    while len(spans) * doc_stride < len(all_doc_tokens):\n\n        encoded_dict = tokenizer.encode_plus(\n            truncated_query if tokenizer.padding_side == \"right\" else span_doc_tokens,\n            span_doc_tokens if tokenizer.padding_side == \"right\" else truncated_query,\n            max_length=max_seq_length,\n            return_overflowing_tokens=True,\n            pad_to_max_length=True,\n            stride=max_seq_length - doc_stride - len(truncated_query) - sequence_pair_added_tokens,\n            truncation_strategy=\"only_second\" if tokenizer.padding_side == \"right\" else \"only_first\",\n            return_token_type_ids=True,\n        )\n\n        paragraph_len = min(\n            len(all_doc_tokens) - len(spans) * doc_stride,\n            max_seq_length - len(truncated_query) - sequence_pair_added_tokens,\n        )\n\n        if tokenizer.pad_token_id in encoded_dict[\"input_ids\"]:\n            if tokenizer.padding_side == \"right\":\n                non_padded_ids = encoded_dict[\"input_ids\"][: encoded_dict[\"input_ids\"].index(tokenizer.pad_token_id)]\n            else:\n                last_padding_id_position = (\n                    len(encoded_dict[\"input_ids\"]) - 1 - encoded_dict[\"input_ids\"][::-1].index(tokenizer.pad_token_id)\n                )\n                non_padded_ids = encoded_dict[\"input_ids\"][last_padding_id_position + 1 :]\n\n        else:\n            non_padded_ids = encoded_dict[\"input_ids\"]\n\n        tokens = tokenizer.convert_ids_to_tokens(non_padded_ids)\n\n        token_to_orig_map = {}\n        for i in range(paragraph_len):\n            index = len(truncated_query) + sequence_added_tokens + i if tokenizer.padding_side == \"right\" else i\n            token_to_orig_map[index] = tok_to_orig_index[len(spans) * doc_stride + i]\n\n        encoded_dict[\"paragraph_len\"] = paragraph_len\n        encoded_dict[\"tokens\"] = tokens\n        encoded_dict[\"token_to_orig_map\"] = token_to_orig_map\n        encoded_dict[\"truncated_query_with_special_tokens_length\"] = len(truncated_query) + sequence_added_tokens\n        encoded_dict[\"token_is_max_context\"] = {}\n        encoded_dict[\"start\"] = len(spans) * doc_stride\n        encoded_dict[\"length\"] = paragraph_len\n\n        spans.append(encoded_dict)\n\n        if \"overflowing_tokens\" not in encoded_dict:\n            break\n        span_doc_tokens = encoded_dict[\"overflowing_tokens\"]\n\n    for doc_span_index in range(len(spans)):\n        for j in range(spans[doc_span_index][\"paragraph_len\"]):\n            is_max_context = _new_check_is_max_context(spans, doc_span_index, doc_span_index * doc_stride + j)\n            index = (\n                j\n                if tokenizer.padding_side == \"left\"\n                else spans[doc_span_index][\"truncated_query_with_special_tokens_length\"] + j\n            )\n            spans[doc_span_index][\"token_is_max_context\"][index] = is_max_context\n\n    for span in spans:\n        # Identify the position of the CLS token\n        cls_index = span[\"input_ids\"].index(tokenizer.cls_token_id)\n\n        # p_mask: mask with 1 for token than cannot be in the answer (0 for token which can be in an answer)\n        # Original TF implem also keep the classification token (set to 0)\n        p_mask = np.ones_like(span[\"token_type_ids\"])\n        if tokenizer.padding_side == \"right\":\n            p_mask[len(truncated_query) + sequence_added_tokens :] = 0\n        else:\n            p_mask[-len(span[\"tokens\"]) : -(len(truncated_query) + sequence_added_tokens)] = 0\n\n        pad_token_indices = np.where(span[\"input_ids\"] == tokenizer.pad_token_id)\n        special_token_indices = np.asarray(\n            tokenizer.get_special_tokens_mask(span[\"input_ids\"], already_has_special_tokens=True)\n        ).nonzero()\n\n        p_mask[pad_token_indices] = 1\n        p_mask[special_token_indices] = 1\n\n        # Set the cls index to 0: the CLS index can be used for impossible answers\n        p_mask[cls_index] = 0\n\n        span_is_impossible = example.is_impossible\n        start_position = 0\n        end_position = 0\n        if is_training and not span_is_impossible:\n            # For training, if our document chunk does not contain an annotation\n            # we throw it out, since there is nothing to predict.\n            doc_start = span[\"start\"]\n            doc_end = span[\"start\"] + span[\"length\"] - 1\n            out_of_span = False\n\n            if not (tok_start_position >= doc_start and tok_end_position <= doc_end):\n                out_of_span = True\n\n            if out_of_span:\n                start_position = cls_index\n                end_position = cls_index\n                span_is_impossible = True\n            else:\n                if tokenizer.padding_side == \"left\":\n                    doc_offset = 0\n                else:\n                    doc_offset = len(truncated_query) + sequence_added_tokens\n\n                start_position = tok_start_position - doc_start + doc_offset\n                end_position = tok_end_position - doc_start + doc_offset\n\n        features.append(\n            SquadFeatures(\n                span[\"input_ids\"],\n                span[\"attention_mask\"],\n                span[\"token_type_ids\"],\n                cls_index,\n                p_mask.tolist(),\n                example_index=0,  # Can not set unique_id and example_index here. They will be set after multiple processing.\n                unique_id=0,\n                paragraph_len=span[\"paragraph_len\"],\n                token_is_max_context=span[\"token_is_max_context\"],\n                tokens=span[\"tokens\"],\n                token_to_orig_map=span[\"token_to_orig_map\"],\n                start_position=start_position,\n                end_position=end_position,\n                is_impossible=span_is_impossible,\n                qas_id=example.qas_id,\n            )\n        )\n    return features\n\n\ndef squad_convert_example_to_features_init(tokenizer_for_convert):\n    global tokenizer\n    tokenizer = tokenizer_for_convert\n\n\ndef squad_convert_examples_to_features(\n    examples,\n    tokenizer,\n    max_seq_length,\n    doc_stride,\n    max_query_length,\n    is_training,\n    return_dataset=False,\n    threads=1,\n    tqdm_enabled=True,\n):\n    \"\"\"\n    Converts a list of examples into a list of features that can be directly given as input to a model.\n    It is model-dependant and takes advantage of many of the tokenizer's features to create the model's inputs.\n\n    Args:\n        examples: list of :class:`~transformers1.data.processors.squad.SquadExample`\n        tokenizer: an instance of a child of :class:`~transformers1.PreTrainedTokenizer`\n        max_seq_length: The maximum sequence length of the inputs.\n        doc_stride: The stride used when the context is too large and is split across several features.\n        max_query_length: The maximum length of the query.\n        is_training: whether to create features for model evaluation or model training.\n        return_dataset: Default False. Either 'pt' or 'tf'.\n            if 'pt': returns a torch.data.TensorDataset,\n            if 'tf': returns a tf.data.Dataset\n        threads: multiple processing threadsa-smi\n\n\n    Returns:\n        list of :class:`~transformers1.data.processors.squad.SquadFeatures`\n\n    Example::\n\n        processor = SquadV2Processor()\n        examples = processor.get_dev_examples(data_dir)\n\n        features = squad_convert_examples_to_features(\n            examples=examples,\n            tokenizer=tokenizer,\n            max_seq_length=args.max_seq_length,\n            doc_stride=args.doc_stride,\n            max_query_length=args.max_query_length,\n            is_training=not evaluate,\n        )\n    \"\"\"\n\n    # Defining helper methods\n    features = []\n    threads = min(threads, cpu_count())\n    with Pool(threads, initializer=squad_convert_example_to_features_init, initargs=(tokenizer,)) as p:\n        annotate_ = partial(\n            squad_convert_example_to_features,\n            max_seq_length=max_seq_length,\n            doc_stride=doc_stride,\n            max_query_length=max_query_length,\n            is_training=is_training,\n        )\n        features = list(\n            tqdm(\n                p.imap(annotate_, examples, chunksize=32),\n                total=len(examples),\n                desc=\"convert squad examples to features\",\n                disable=not tqdm_enabled,\n            )\n        )\n    new_features = []\n    unique_id = 1000000000\n    example_index = 0\n    for example_features in tqdm(\n        features, total=len(features), desc=\"add example index and unique id\", disable=not tqdm_enabled\n    ):\n        if not example_features:\n            continue\n        for example_feature in example_features:\n            example_feature.example_index = example_index\n            example_feature.unique_id = unique_id\n            new_features.append(example_feature)\n            unique_id += 1\n        example_index += 1\n    features = new_features\n    del new_features\n    if return_dataset == \"pt\":\n        if not is_torch_available():\n            raise RuntimeError(\"PyTorch must be installed to return a PyTorch dataset.\")\n\n        # Convert to Tensors and build dataset\n        all_input_ids = torch.tensor([f.input_ids for f in features], dtype=torch.long)\n        all_attention_masks = torch.tensor([f.attention_mask for f in features], dtype=torch.long)\n        all_token_type_ids = torch.tensor([f.token_type_ids for f in features], dtype=torch.long)\n        all_cls_index = torch.tensor([f.cls_index for f in features], dtype=torch.long)\n        all_p_mask = torch.tensor([f.p_mask for f in features], dtype=torch.float)\n        all_is_impossible = torch.tensor([f.is_impossible for f in features], dtype=torch.float)\n\n        if not is_training:\n            all_feature_index = torch.arange(all_input_ids.size(0), dtype=torch.long)\n            dataset = TensorDataset(\n                all_input_ids, all_attention_masks, all_token_type_ids, all_feature_index, all_cls_index, all_p_mask\n            )\n        else:\n            all_start_positions = torch.tensor([f.start_position for f in features], dtype=torch.long)\n            all_end_positions = torch.tensor([f.end_position for f in features], dtype=torch.long)\n            dataset = TensorDataset(\n                all_input_ids,\n                all_attention_masks,\n                all_token_type_ids,\n                all_start_positions,\n                all_end_positions,\n                all_cls_index,\n                all_p_mask,\n                all_is_impossible,\n            )\n\n        return features, dataset\n    elif return_dataset == \"tf\":\n        if not is_tf_available():\n            raise RuntimeError(\"TensorFlow must be installed to return a TensorFlow dataset.\")\n\n        def gen():\n            for i, ex in enumerate(features):\n                yield (\n                    {\n                        \"input_ids\": ex.input_ids,\n                        \"attention_mask\": ex.attention_mask,\n                        \"token_type_ids\": ex.token_type_ids,\n                        \"feature_index\": i,\n                        \"qas_id\": ex.qas_id,\n                    },\n                    {\n                        \"start_position\": ex.start_position,\n                        \"end_position\": ex.end_position,\n                        \"cls_index\": ex.cls_index,\n                        \"p_mask\": ex.p_mask,\n                        \"is_impossible\": ex.is_impossible,\n                    },\n                )\n\n        # Why have we split the batch into a tuple? PyTorch just has a list of tensors.\n        train_types = (\n            {\n                \"input_ids\": tf.int32,\n                \"attention_mask\": tf.int32,\n                \"token_type_ids\": tf.int32,\n                \"feature_index\": tf.int64,\n                \"qas_id\": tf.string,\n            },\n            {\n                \"start_position\": tf.int64,\n                \"end_position\": tf.int64,\n                \"cls_index\": tf.int64,\n                \"p_mask\": tf.int32,\n                \"is_impossible\": tf.int32,\n            },\n        )\n\n        train_shapes = (\n            {\n                \"input_ids\": tf.TensorShape([None]),\n                \"attention_mask\": tf.TensorShape([None]),\n                \"token_type_ids\": tf.TensorShape([None]),\n                \"feature_index\": tf.TensorShape([]),\n                \"qas_id\": tf.TensorShape([]),\n            },\n            {\n                \"start_position\": tf.TensorShape([]),\n                \"end_position\": tf.TensorShape([]),\n                \"cls_index\": tf.TensorShape([]),\n                \"p_mask\": tf.TensorShape([None]),\n                \"is_impossible\": tf.TensorShape([]),\n            },\n        )\n\n        return tf.data.Dataset.from_generator(gen, train_types, train_shapes)\n    else:\n        return features\n\n\nclass SquadProcessor(DataProcessor):\n    \"\"\"\n    Processor for the SQuAD data set.\n    Overriden by SquadV1Processor and SquadV2Processor, used by the version 1.1 and version 2.0 of SQuAD, respectively.\n    \"\"\"\n\n    train_file = None\n    dev_file = None\n\n    def _get_example_from_tensor_dict(self, tensor_dict, evaluate=False):\n        if not evaluate:\n            answer = tensor_dict[\"answers\"][\"text\"][0].numpy().decode(\"utf-8\")\n            answer_start = tensor_dict[\"answers\"][\"answer_start\"][0].numpy()\n            answers = []\n        else:\n            answers = [\n                {\"answer_start\": start.numpy(), \"text\": text.numpy().decode(\"utf-8\")}\n                for start, text in zip(tensor_dict[\"answers\"][\"answer_start\"], tensor_dict[\"answers\"][\"text\"])\n            ]\n\n            answer = None\n            answer_start = None\n\n        return SquadExample(\n            qas_id=tensor_dict[\"id\"].numpy().decode(\"utf-8\"),\n            question_text=tensor_dict[\"question\"].numpy().decode(\"utf-8\"),\n            context_text=tensor_dict[\"context\"].numpy().decode(\"utf-8\"),\n            answer_text=answer,\n            start_position_character=answer_start,\n            title=tensor_dict[\"title\"].numpy().decode(\"utf-8\"),\n            answers=answers,\n        )\n\n    def get_examples_from_dataset(self, dataset, evaluate=False):\n        \"\"\"\n        Creates a list of :class:`~transformers1.data.processors.squad.SquadExample` using a TFDS dataset.\n\n        Args:\n            dataset: The tfds dataset loaded from `tensorflow_datasets.load(\"squad\")`\n            evaluate: boolean specifying if in evaluation mode or in training mode\n\n        Returns:\n            List of SquadExample\n\n        Examples::\n\n            import tensorflow_datasets as tfds\n            dataset = tfds.load(\"squad\")\n\n            training_examples = get_examples_from_dataset(dataset, evaluate=False)\n            evaluation_examples = get_examples_from_dataset(dataset, evaluate=True)\n        \"\"\"\n\n        if evaluate:\n            dataset = dataset[\"validation\"]\n        else:\n            dataset = dataset[\"train\"]\n\n        examples = []\n        for tensor_dict in tqdm(dataset):\n            examples.append(self._get_example_from_tensor_dict(tensor_dict, evaluate=evaluate))\n\n        return examples\n\n    def get_train_examples(self, data_dir, filename=None):\n        \"\"\"\n        Returns the training examples from the data directory.\n\n        Args:\n            data_dir: Directory containing the data files used for training and evaluating.\n            filename: None by default, specify this if the training file has a different name than the original one\n                which is `train-v1.1.json` and `train-v2.0.json` for squad versions 1.1 and 2.0 respectively.\n\n        \"\"\"\n        if data_dir is None:\n            data_dir = \"\"\n\n        if self.train_file is None:\n            raise ValueError(\"SquadProcessor should be instantiated via SquadV1Processor or SquadV2Processor\")\n\n        with open(\n            os.path.join(data_dir, self.train_file if filename is None else filename), \"r\", encoding=\"utf-8\"\n        ) as reader:\n            input_data = json.load(reader)[\"data\"]\n        return self._create_examples(input_data, \"train\")\n\n    def get_dev_examples(self, data_dir, filename=None):\n        \"\"\"\n        Returns the evaluation example from the data directory.\n\n        Args:\n            data_dir: Directory containing the data files used for training and evaluating.\n            filename: None by default, specify this if the evaluation file has a different name than the original one\n                which is `train-v1.1.json` and `train-v2.0.json` for squad versions 1.1 and 2.0 respectively.\n        \"\"\"\n        if data_dir is None:\n            data_dir = \"\"\n\n        if self.dev_file is None:\n            raise ValueError(\"SquadProcessor should be instantiated via SquadV1Processor or SquadV2Processor\")\n\n        with open(\n            os.path.join(data_dir, self.dev_file if filename is None else filename), \"r\", encoding=\"utf-8\"\n        ) as reader:\n            input_data = json.load(reader)[\"data\"]\n        return self._create_examples(input_data, \"dev\")\n\n    def _create_examples(self, input_data, set_type):\n        is_training = set_type == \"train\"\n        examples = []\n        for entry in tqdm(input_data):\n            title = entry[\"title\"]\n            for paragraph in entry[\"paragraphs\"]:\n                context_text = paragraph[\"context\"]\n                for qa in paragraph[\"qas\"]:\n                    qas_id = qa[\"id\"]\n                    question_text = qa[\"question\"]\n                    start_position_character = None\n                    answer_text = None\n                    answers = []\n\n                    if \"is_impossible\" in qa:\n                        is_impossible = qa[\"is_impossible\"]\n                    else:\n                        is_impossible = False\n\n                    if not is_impossible:\n                        if is_training:\n                            answer = qa[\"answers\"][0]\n                            answer_text = answer[\"text\"]\n                            start_position_character = answer[\"answer_start\"]\n                        else:\n                            answers = qa[\"answers\"]\n\n                    example = SquadExample(\n                        qas_id=qas_id,\n                        question_text=question_text,\n                        context_text=context_text,\n                        answer_text=answer_text,\n                        start_position_character=start_position_character,\n                        title=title,\n                        is_impossible=is_impossible,\n                        answers=answers,\n                    )\n\n                    examples.append(example)\n        return examples\n\n\nclass SquadV1Processor(SquadProcessor):\n    train_file = \"train-v1.1.json\"\n    dev_file = \"dev-v1.1.json\"\n\n\nclass SquadV2Processor(SquadProcessor):\n    train_file = \"train-v2.0.json\"\n    dev_file = \"dev-v2.0.json\"\n\n\nclass SquadExample(object):\n    \"\"\"\n    A single training/test example for the Squad dataset, as loaded from disk.\n\n    Args:\n        qas_id: The example's unique identifier\n        question_text: The question string\n        context_text: The context string\n        answer_text: The answer string\n        start_position_character: The character position of the start of the answer\n        title: The title of the example\n        answers: None by default, this is used during evaluation. Holds answers as well as their start positions.\n        is_impossible: False by default, set to True if the example has no possible answer.\n    \"\"\"\n\n    def __init__(\n        self,\n        qas_id,\n        question_text,\n        context_text,\n        answer_text,\n        start_position_character,\n        title,\n        answers=[],\n        is_impossible=False,\n    ):\n        self.qas_id = qas_id\n        self.question_text = question_text\n        self.context_text = context_text\n        self.answer_text = answer_text\n        self.title = title\n        self.is_impossible = is_impossible\n        self.answers = answers\n\n        self.start_position, self.end_position = 0, 0\n\n        doc_tokens = []\n        char_to_word_offset = []\n        prev_is_whitespace = True\n\n        # Split on whitespace so that different tokens may be attributed to their original position.\n        for c in self.context_text:\n            if _is_whitespace(c):\n                prev_is_whitespace = True\n            else:\n                if prev_is_whitespace:\n                    doc_tokens.append(c)\n                else:\n                    doc_tokens[-1] += c\n                prev_is_whitespace = False\n            char_to_word_offset.append(len(doc_tokens) - 1)\n\n        self.doc_tokens = doc_tokens\n        self.char_to_word_offset = char_to_word_offset\n\n        # Start and end positions only has a value during evaluation.\n        if start_position_character is not None and not is_impossible:\n            self.start_position = char_to_word_offset[start_position_character]\n            self.end_position = char_to_word_offset[\n                min(start_position_character + len(answer_text) - 1, len(char_to_word_offset) - 1)\n            ]\n\n\nclass SquadFeatures(object):\n    \"\"\"\n    Single squad example features to be fed to a model.\n    Those features are model-specific and can be crafted from :class:`~transformers1.data.processors.squad.SquadExample`\n    using the :method:`~transformers1.data.processors.squad.squad_convert_examples_to_features` method.\n\n    Args:\n        input_ids: Indices of input sequence tokens in the vocabulary.\n        attention_mask: Mask to avoid performing attention on padding token indices.\n        token_type_ids: Segment token indices to indicate first and second portions of the inputs.\n        cls_index: the index of the CLS token.\n        p_mask: Mask identifying tokens that can be answers vs. tokens that cannot.\n            Mask with 1 for tokens than cannot be in the answer and 0 for token that can be in an answer\n        example_index: the index of the example\n        unique_id: The unique Feature identifier\n        paragraph_len: The length of the context\n        token_is_max_context: List of booleans identifying which tokens have their maximum context in this feature object.\n            If a token does not have their maximum context in this feature object, it means that another feature object\n            has more information related to that token and should be prioritized over this feature for that token.\n        tokens: list of tokens corresponding to the input ids\n        token_to_orig_map: mapping between the tokens and the original text, needed in order to identify the answer.\n        start_position: start of the answer token index\n        end_position: end of the answer token index\n    \"\"\"\n\n    def __init__(\n        self,\n        input_ids,\n        attention_mask,\n        token_type_ids,\n        cls_index,\n        p_mask,\n        example_index,\n        unique_id,\n        paragraph_len,\n        token_is_max_context,\n        tokens,\n        token_to_orig_map,\n        start_position,\n        end_position,\n        is_impossible,\n        qas_id: str = None,\n    ):\n        self.input_ids = input_ids\n        self.attention_mask = attention_mask\n        self.token_type_ids = token_type_ids\n        self.cls_index = cls_index\n        self.p_mask = p_mask\n\n        self.example_index = example_index\n        self.unique_id = unique_id\n        self.paragraph_len = paragraph_len\n        self.token_is_max_context = token_is_max_context\n        self.tokens = tokens\n        self.token_to_orig_map = token_to_orig_map\n\n        self.start_position = start_position\n        self.end_position = end_position\n        self.is_impossible = is_impossible\n        self.qas_id = qas_id\n\n\nclass SquadResult(object):\n    \"\"\"\n    Constructs a SquadResult which can be used to evaluate a model's output on the SQuAD dataset.\n\n    Args:\n        unique_id: The unique identifier corresponding to that example.\n        start_logits: The logits corresponding to the start of the answer\n        end_logits: The logits corresponding to the end of the answer\n    \"\"\"\n\n    def __init__(self, unique_id, start_logits, end_logits, start_top_index=None, end_top_index=None, cls_logits=None):\n        self.start_logits = start_logits\n        self.end_logits = end_logits\n        self.unique_id = unique_id\n\n        if start_top_index:\n            self.start_top_index = start_top_index\n            self.end_top_index = end_top_index\n            self.cls_logits = cls_logits\n"
  },
  {
    "path": "code/nezha-base-count5/pretrain/transformers1/data/processors/utils.py",
    "content": "# coding=utf-8\n# Copyright 2018 The Google AI Language Team Authors and The HuggingFace Inc. team.\n# Copyright (c) 2018, NVIDIA CORPORATION.  All rights reserved.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n\nimport csv\nimport dataclasses\nimport json\nimport logging\nfrom dataclasses import dataclass\nfrom typing import List, Optional, Union\n\nfrom ...file_utils import is_tf_available, is_torch_available\n\n\nlogger = logging.getLogger(__name__)\n\n\n@dataclass\nclass InputExample:\n    \"\"\"\n    A single training/test example for simple sequence classification.\n\n    Args:\n        guid: Unique id for the example.\n        text_a: string. The untokenized text of the first sequence. For single\n            sequence tasks, only this sequence must be specified.\n        text_b: (Optional) string. The untokenized text of the second sequence.\n            Only must be specified for sequence pair tasks.\n        label: (Optional) string. The label of the example. This should be\n            specified for train and dev examples, but not for test examples.\n    \"\"\"\n\n    guid: str\n    text_a: str\n    text_b: Optional[str] = None\n    label: Optional[str] = None\n\n    def to_json_string(self):\n        \"\"\"Serializes this instance to a JSON string.\"\"\"\n        return json.dumps(dataclasses.asdict(self), indent=2) + \"\\n\"\n\n\n@dataclass(frozen=True)\nclass InputFeatures:\n    \"\"\"\n    A single set of features of data.\n    Property names are the same names as the corresponding inputs to a model.\n\n    Args:\n        input_ids: Indices of input sequence tokens in the vocabulary.\n        attention_mask: Mask to avoid performing attention on padding token indices.\n            Mask values selected in ``[0, 1]``:\n            Usually  ``1`` for tokens that are NOT MASKED, ``0`` for MASKED (padded) tokens.\n        token_type_ids: (Optional) Segment token indices to indicate first and second\n            portions of the inputs. Only some models use them.\n        label: (Optional) Label corresponding to the input. Int for classification problems,\n            float for regression problems.\n    \"\"\"\n\n    input_ids: List[int]\n    attention_mask: Optional[List[int]] = None\n    token_type_ids: Optional[List[int]] = None\n    label: Optional[Union[int, float]] = None\n\n    def to_json_string(self):\n        \"\"\"Serializes this instance to a JSON string.\"\"\"\n        return json.dumps(dataclasses.asdict(self)) + \"\\n\"\n\n\nclass DataProcessor:\n    \"\"\"Base class for data converters for sequence classification data sets.\"\"\"\n\n    def get_example_from_tensor_dict(self, tensor_dict):\n        \"\"\"Gets an example from a dict with tensorflow tensors\n        Args:\n            tensor_dict: Keys and values should match the corresponding Glue\n                tensorflow_dataset examples.\n        \"\"\"\n        raise NotImplementedError()\n\n    def get_train_examples(self, data_dir):\n        \"\"\"Gets a collection of `InputExample`s for the train set.\"\"\"\n        raise NotImplementedError()\n\n    def get_dev_examples(self, data_dir):\n        \"\"\"Gets a collection of `InputExample`s for the dev set.\"\"\"\n        raise NotImplementedError()\n\n    def get_test_examples(self, data_dir):\n        \"\"\"Gets a collection of `InputExample`s for the test set.\"\"\"\n        raise NotImplementedError()\n\n    def get_labels(self):\n        \"\"\"Gets the list of labels for this data set.\"\"\"\n        raise NotImplementedError()\n\n    def tfds_map(self, example):\n        \"\"\"Some tensorflow_datasets datasets are not formatted the same way the GLUE datasets are.\n        This method converts examples to the correct format.\"\"\"\n        if len(self.get_labels()) > 1:\n            example.label = self.get_labels()[int(example.label)]\n        return example\n\n    @classmethod\n    def _read_tsv(cls, input_file, quotechar=None):\n        \"\"\"Reads a tab separated value file.\"\"\"\n        with open(input_file, \"r\", encoding=\"utf-8-sig\") as f:\n            return list(csv.reader(f, delimiter=\"\\t\", quotechar=quotechar))\n\n\nclass SingleSentenceClassificationProcessor(DataProcessor):\n    \"\"\" Generic processor for a single sentence classification data set.\"\"\"\n\n    def __init__(self, labels=None, examples=None, mode=\"classification\", verbose=False):\n        self.labels = [] if labels is None else labels\n        self.examples = [] if examples is None else examples\n        self.mode = mode\n        self.verbose = verbose\n\n    def __len__(self):\n        return len(self.examples)\n\n    def __getitem__(self, idx):\n        if isinstance(idx, slice):\n            return SingleSentenceClassificationProcessor(labels=self.labels, examples=self.examples[idx])\n        return self.examples[idx]\n\n    @classmethod\n    def create_from_csv(\n        cls, file_name, split_name=\"\", column_label=0, column_text=1, column_id=None, skip_first_row=False, **kwargs\n    ):\n        processor = cls(**kwargs)\n        processor.add_examples_from_csv(\n            file_name,\n            split_name=split_name,\n            column_label=column_label,\n            column_text=column_text,\n            column_id=column_id,\n            skip_first_row=skip_first_row,\n            overwrite_labels=True,\n            overwrite_examples=True,\n        )\n        return processor\n\n    @classmethod\n    def create_from_examples(cls, texts_or_text_and_labels, labels=None, **kwargs):\n        processor = cls(**kwargs)\n        processor.add_examples(texts_or_text_and_labels, labels=labels)\n        return processor\n\n    def add_examples_from_csv(\n        self,\n        file_name,\n        split_name=\"\",\n        column_label=0,\n        column_text=1,\n        column_id=None,\n        skip_first_row=False,\n        overwrite_labels=False,\n        overwrite_examples=False,\n    ):\n        lines = self._read_tsv(file_name)\n        if skip_first_row:\n            lines = lines[1:]\n        texts = []\n        labels = []\n        ids = []\n        for (i, line) in enumerate(lines):\n            texts.append(line[column_text])\n            labels.append(line[column_label])\n            if column_id is not None:\n                ids.append(line[column_id])\n            else:\n                guid = \"%s-%s\" % (split_name, i) if split_name else \"%s\" % i\n                ids.append(guid)\n\n        return self.add_examples(\n            texts, labels, ids, overwrite_labels=overwrite_labels, overwrite_examples=overwrite_examples\n        )\n\n    def add_examples(\n        self, texts_or_text_and_labels, labels=None, ids=None, overwrite_labels=False, overwrite_examples=False\n    ):\n        assert labels is None or len(texts_or_text_and_labels) == len(labels)\n        assert ids is None or len(texts_or_text_and_labels) == len(ids)\n        if ids is None:\n            ids = [None] * len(texts_or_text_and_labels)\n        if labels is None:\n            labels = [None] * len(texts_or_text_and_labels)\n        examples = []\n        added_labels = set()\n        for (text_or_text_and_label, label, guid) in zip(texts_or_text_and_labels, labels, ids):\n            if isinstance(text_or_text_and_label, (tuple, list)) and label is None:\n                text, label = text_or_text_and_label\n            else:\n                text = text_or_text_and_label\n            added_labels.add(label)\n            examples.append(InputExample(guid=guid, text_a=text, text_b=None, label=label))\n\n        # Update examples\n        if overwrite_examples:\n            self.examples = examples\n        else:\n            self.examples.extend(examples)\n\n        # Update labels\n        if overwrite_labels:\n            self.labels = list(added_labels)\n        else:\n            self.labels = list(set(self.labels).union(added_labels))\n\n        return self.examples\n\n    def get_features(\n        self,\n        tokenizer,\n        max_length=None,\n        pad_on_left=False,\n        pad_token=0,\n        mask_padding_with_zero=True,\n        return_tensors=None,\n    ):\n        \"\"\"\n        Convert examples in a list of ``InputFeatures``\n\n        Args:\n            tokenizer: Instance of a tokenizer that will tokenize the examples\n            max_length: Maximum example length\n            task: GLUE task\n            label_list: List of labels. Can be obtained from the processor using the ``processor.get_labels()`` method\n            output_mode: String indicating the output mode. Either ``regression`` or ``classification``\n            pad_on_left: If set to ``True``, the examples will be padded on the left rather than on the right (default)\n            pad_token: Padding token\n            mask_padding_with_zero: If set to ``True``, the attention mask will be filled by ``1`` for actual values\n                and by ``0`` for padded values. If set to ``False``, inverts it (``1`` for padded values, ``0`` for\n                actual values)\n\n        Returns:\n            If the ``examples`` input is a ``tf.data.Dataset``, will return a ``tf.data.Dataset``\n            containing the task-specific features. If the input is a list of ``InputExamples``, will return\n            a list of task-specific ``InputFeatures`` which can be fed to the model.\n\n        \"\"\"\n        if max_length is None:\n            max_length = tokenizer.max_len\n\n        label_map = {label: i for i, label in enumerate(self.labels)}\n\n        all_input_ids = []\n        for (ex_index, example) in enumerate(self.examples):\n            if ex_index % 10000 == 0:\n                logger.info(\"Tokenizing example %d\", ex_index)\n\n            input_ids = tokenizer.encode(\n                example.text_a, add_special_tokens=True, max_length=min(max_length, tokenizer.max_len),\n            )\n            all_input_ids.append(input_ids)\n\n        batch_length = max(len(input_ids) for input_ids in all_input_ids)\n\n        features = []\n        for (ex_index, (input_ids, example)) in enumerate(zip(all_input_ids, self.examples)):\n            if ex_index % 10000 == 0:\n                logger.info(\"Writing example %d/%d\" % (ex_index, len(self.examples)))\n            # The mask has 1 for real tokens and 0 for padding tokens. Only real\n            # tokens are attended to.\n            attention_mask = [1 if mask_padding_with_zero else 0] * len(input_ids)\n\n            # Zero-pad up to the sequence length.\n            padding_length = batch_length - len(input_ids)\n            if pad_on_left:\n                input_ids = ([pad_token] * padding_length) + input_ids\n                attention_mask = ([0 if mask_padding_with_zero else 1] * padding_length) + attention_mask\n            else:\n                input_ids = input_ids + ([pad_token] * padding_length)\n                attention_mask = attention_mask + ([0 if mask_padding_with_zero else 1] * padding_length)\n\n            assert len(input_ids) == batch_length, \"Error with input length {} vs {}\".format(\n                len(input_ids), batch_length\n            )\n            assert len(attention_mask) == batch_length, \"Error with input length {} vs {}\".format(\n                len(attention_mask), batch_length\n            )\n\n            if self.mode == \"classification\":\n                label = label_map[example.label]\n            elif self.mode == \"regression\":\n                label = float(example.label)\n            else:\n                raise ValueError(self.mode)\n\n            if ex_index < 5 and self.verbose:\n                logger.info(\"*** Example ***\")\n                logger.info(\"guid: %s\" % (example.guid))\n                logger.info(\"input_ids: %s\" % \" \".join([str(x) for x in input_ids]))\n                logger.info(\"attention_mask: %s\" % \" \".join([str(x) for x in attention_mask]))\n                logger.info(\"label: %s (id = %d)\" % (example.label, label))\n\n            features.append(InputFeatures(input_ids=input_ids, attention_mask=attention_mask, label=label))\n\n        if return_tensors is None:\n            return features\n        elif return_tensors == \"tf\":\n            if not is_tf_available():\n                raise RuntimeError(\"return_tensors set to 'tf' but TensorFlow 2.0 can't be imported\")\n            import tensorflow as tf\n\n            def gen():\n                for ex in features:\n                    yield ({\"input_ids\": ex.input_ids, \"attention_mask\": ex.attention_mask}, ex.label)\n\n            dataset = tf.data.Dataset.from_generator(\n                gen,\n                ({\"input_ids\": tf.int32, \"attention_mask\": tf.int32}, tf.int64),\n                ({\"input_ids\": tf.TensorShape([None]), \"attention_mask\": tf.TensorShape([None])}, tf.TensorShape([])),\n            )\n            return dataset\n        elif return_tensors == \"pt\":\n            if not is_torch_available():\n                raise RuntimeError(\"return_tensors set to 'pt' but PyTorch can't be imported\")\n            import torch\n            from torch.utils.data import TensorDataset\n\n            all_input_ids = torch.tensor([f.input_ids for f in features], dtype=torch.long)\n            all_attention_mask = torch.tensor([f.attention_mask for f in features], dtype=torch.long)\n            if self.mode == \"classification\":\n                all_labels = torch.tensor([f.label for f in features], dtype=torch.long)\n            elif self.mode == \"regression\":\n                all_labels = torch.tensor([f.label for f in features], dtype=torch.float)\n\n            dataset = TensorDataset(all_input_ids, all_attention_mask, all_labels)\n            return dataset\n        else:\n            raise ValueError(\"return_tensors should be one of 'tf' or 'pt'\")\n"
  },
  {
    "path": "code/nezha-base-count5/pretrain/transformers1/data/processors/xnli.py",
    "content": "# coding=utf-8\n# Copyright 2018 The Google AI Language Team Authors and The HuggingFace Inc. team.\n# Copyright (c) 2018, NVIDIA CORPORATION.  All rights reserved.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n\"\"\" XNLI utils (dataset loading and evaluation) \"\"\"\n\n\nimport logging\nimport os\n\nfrom .utils import DataProcessor, InputExample\n\n\nlogger = logging.getLogger(__name__)\n\n\nclass XnliProcessor(DataProcessor):\n    \"\"\"Processor for the XNLI dataset.\n    Adapted from https://github.com/google-research/bert/blob/f39e881b169b9d53bea03d2d341b31707a6c052b/run_classifier.py#L207\"\"\"\n\n    def __init__(self, language, train_language=None):\n        self.language = language\n        self.train_language = train_language\n\n    def get_train_examples(self, data_dir):\n        \"\"\"See base class.\"\"\"\n        lg = self.language if self.train_language is None else self.train_language\n        lines = self._read_tsv(os.path.join(data_dir, \"XNLI-MT-1.0/multinli/multinli.train.{}.tsv\".format(lg)))\n        examples = []\n        for (i, line) in enumerate(lines):\n            if i == 0:\n                continue\n            guid = \"%s-%s\" % (\"train\", i)\n            text_a = line[0]\n            text_b = line[1]\n            label = \"contradiction\" if line[2] == \"contradictory\" else line[2]\n            assert isinstance(text_a, str) and isinstance(text_b, str) and isinstance(label, str)\n            examples.append(InputExample(guid=guid, text_a=text_a, text_b=text_b, label=label))\n        return examples\n\n    def get_test_examples(self, data_dir):\n        \"\"\"See base class.\"\"\"\n        lines = self._read_tsv(os.path.join(data_dir, \"XNLI-1.0/xnli.test.tsv\"))\n        examples = []\n        for (i, line) in enumerate(lines):\n            if i == 0:\n                continue\n            language = line[0]\n            if language != self.language:\n                continue\n            guid = \"%s-%s\" % (\"test\", i)\n            text_a = line[6]\n            text_b = line[7]\n            label = line[1]\n            assert isinstance(text_a, str) and isinstance(text_b, str) and isinstance(label, str)\n            examples.append(InputExample(guid=guid, text_a=text_a, text_b=text_b, label=label))\n        return examples\n\n    def get_labels(self):\n        \"\"\"See base class.\"\"\"\n        return [\"contradiction\", \"entailment\", \"neutral\"]\n\n\nxnli_processors = {\n    \"xnli\": XnliProcessor,\n}\n\nxnli_output_modes = {\n    \"xnli\": \"classification\",\n}\n\nxnli_tasks_num_labels = {\n    \"xnli\": 3,\n}\n"
  },
  {
    "path": "code/nezha-base-count5/pretrain/transformers1/file.py",
    "content": ""
  },
  {
    "path": "code/nezha-base-count5/pretrain/transformers1/file_utils.py",
    "content": "\"\"\"\nUtilities for working with the local dataset cache.\nThis file is adapted from the AllenNLP library at https://github.com/allenai/allennlp\nCopyright by the AllenNLP authors.\n\"\"\"\n\nimport fnmatch\nimport json\nimport logging\nimport os\nimport shutil\nimport sys\nimport tarfile\nimport tempfile\nfrom contextlib import contextmanager\nfrom functools import partial, wraps\nfrom hashlib import sha256\nfrom pathlib import Path\nfrom typing import Optional\nfrom urllib.parse import urlparse\nfrom zipfile import ZipFile, is_zipfile\n\nimport requests\nfrom filelock import FileLock\nfrom tqdm.auto import tqdm\n\nfrom . import __version__\n\n\nlogger = logging.getLogger(__name__)  # pylint: disable=invalid-name\n\ntry:\n    USE_TF = os.environ.get(\"USE_TF\", \"AUTO\").upper()\n    USE_TORCH = os.environ.get(\"USE_TORCH\", \"AUTO\").upper()\n    if USE_TORCH in (\"1\", \"ON\", \"YES\", \"AUTO\") and USE_TF not in (\"1\", \"ON\", \"YES\"):\n        import torch\n\n        _torch_available = True  # pylint: disable=invalid-name\n        logger.info(\"PyTorch version {} available.\".format(torch.__version__))\n    else:\n        logger.info(\"Disabling PyTorch because USE_TF is set\")\n        _torch_available = False\nexcept ImportError:\n    _torch_available = False  # pylint: disable=invalid-name\n\ntry:\n    USE_TF = os.environ.get(\"USE_TF\", \"AUTO\").upper()\n    USE_TORCH = os.environ.get(\"USE_TORCH\", \"AUTO\").upper()\n\n    if USE_TF in (\"1\", \"ON\", \"YES\", \"AUTO\") and USE_TORCH not in (\"1\", \"ON\", \"YES\"):\n        import tensorflow as tf\n\n        assert hasattr(tf, \"__version__\") and int(tf.__version__[0]) >= 2\n        _tf_available = True  # pylint: disable=invalid-name\n        logger.info(\"TensorFlow version {} available.\".format(tf.__version__))\n    else:\n        logger.info(\"Disabling Tensorflow because USE_TORCH is set\")\n        _tf_available = False\nexcept (ImportError, AssertionError):\n    _tf_available = False  # pylint: disable=invalid-name\n\n\ntry:\n    from torch.hub import _get_torch_home\n\n    torch_cache_home = _get_torch_home()\nexcept ImportError:\n    torch_cache_home = os.path.expanduser(\n        os.getenv(\"TORCH_HOME\", os.path.join(os.getenv(\"XDG_CACHE_HOME\", \"~/.cache\"), \"torch\"))\n    )\ndefault_cache_path = os.path.join(torch_cache_home, \"transformers1\")\n\n\nPYTORCH_PRETRAINED_BERT_CACHE = os.getenv(\"PYTORCH_PRETRAINED_BERT_CACHE\", default_cache_path)\nPYTORCH_TRANSFORMERS_CACHE = os.getenv(\"PYTORCH_TRANSFORMERS_CACHE\", PYTORCH_PRETRAINED_BERT_CACHE)\nTRANSFORMERS_CACHE = os.getenv(\"TRANSFORMERS_CACHE\", PYTORCH_TRANSFORMERS_CACHE)\n\nWEIGHTS_NAME = \"pytorch_model.bin\"\nTF2_WEIGHTS_NAME = \"tf_model.h5\"\nTF_WEIGHTS_NAME = \"model.ckpt\"\nCONFIG_NAME = \"config.json\"\nMODEL_CARD_NAME = \"modelcard.json\"\n\n\nMULTIPLE_CHOICE_DUMMY_INPUTS = [[[0], [1]], [[0], [1]]]\nDUMMY_INPUTS = [[7, 6, 0, 0, 1], [1, 2, 3, 0, 0], [0, 0, 0, 4, 5]]\nDUMMY_MASK = [[1, 1, 1, 1, 1], [1, 1, 1, 0, 0], [0, 0, 0, 1, 1]]\n\nS3_BUCKET_PREFIX = \"https://s3.amazonaws.com/models.huggingface.co/bert\"\nCLOUDFRONT_DISTRIB_PREFIX = \"https://cdn.huggingface.co\"\n\n\ndef is_torch_available():\n    return _torch_available\n\n\ndef is_tf_available():\n    return _tf_available\n\n\ndef add_start_docstrings(*docstr):\n    def docstring_decorator(fn):\n        fn.__doc__ = \"\".join(docstr) + (fn.__doc__ if fn.__doc__ is not None else \"\")\n        return fn\n\n    return docstring_decorator\n\n\ndef add_start_docstrings_to_callable(*docstr):\n    def docstring_decorator(fn):\n        class_name = \":class:`~transformers1.{}`\".format(fn.__qualname__.split(\".\")[0])\n        intro = \"   The {} forward method, overrides the :func:`__call__` special method.\".format(class_name)\n        note = r\"\"\"\n\n    .. note::\n        Although the recipe for forward pass needs to be defined within\n        this function, one should call the :class:`Module` instance afterwards\n        instead of this since the former takes care of running the\n        pre and post processing steps while the latter silently ignores them.\n        \"\"\"\n        fn.__doc__ = intro + note + \"\".join(docstr) + (fn.__doc__ if fn.__doc__ is not None else \"\")\n        return fn\n\n    return docstring_decorator\n\n\ndef add_end_docstrings(*docstr):\n    def docstring_decorator(fn):\n        fn.__doc__ = fn.__doc__ + \"\".join(docstr)\n        return fn\n\n    return docstring_decorator\n\n\ndef is_remote_url(url_or_filename):\n    parsed = urlparse(url_or_filename)\n    return parsed.scheme in (\"http\", \"https\")\n\n\ndef hf_bucket_url(model_id: str, filename: str, use_cdn=True) -> str:\n    \"\"\"\n    Resolve a model identifier, and a file name, to a HF-hosted url\n    on either S3 or Cloudfront (a Content Delivery Network, or CDN).\n\n    Cloudfront is replicated over the globe so downloads are way faster\n    for the end user (and it also lowers our bandwidth costs). However, it\n    is more aggressively cached by default, so may not always reflect the\n    latest changes to the underlying file (default TTL is 24 hours).\n\n    In terms of client-side caching from this library, even though\n    Cloudfront relays the ETags from S3, using one or the other\n    (or switching from one to the other) will affect caching: cached files\n    are not shared between the two because the cached file's name contains\n    a hash of the url.\n    \"\"\"\n    endpoint = CLOUDFRONT_DISTRIB_PREFIX if use_cdn else S3_BUCKET_PREFIX\n    legacy_format = \"/\" not in model_id\n    if legacy_format:\n        return f\"{endpoint}/{model_id}-{filename}\"\n    else:\n        return f\"{endpoint}/{model_id}/{filename}\"\n\n\ndef url_to_filename(url, etag=None):\n    \"\"\"\n    Convert `url` into a hashed filename in a repeatable way.\n    If `etag` is specified, append its hash to the url's, delimited\n    by a period.\n    If the url ends with .h5 (Keras HDF5 weights) adds '.h5' to the name\n    so that TF 2.0 can identify it as a HDF5 file\n    (see https://github.com/tensorflow/tensorflow/blob/00fad90125b18b80fe054de1055770cfb8fe4ba3/tensorflow/python/keras/engine/network.py#L1380)\n    \"\"\"\n    url_bytes = url.encode(\"utf-8\")\n    url_hash = sha256(url_bytes)\n    filename = url_hash.hexdigest()\n\n    if etag:\n        etag_bytes = etag.encode(\"utf-8\")\n        etag_hash = sha256(etag_bytes)\n        filename += \".\" + etag_hash.hexdigest()\n\n    if url.endswith(\".h5\"):\n        filename += \".h5\"\n\n    return filename\n\n\ndef filename_to_url(filename, cache_dir=None):\n    \"\"\"\n    Return the url and etag (which may be ``None``) stored for `filename`.\n    Raise ``EnvironmentError`` if `filename` or its stored metadata do not exist.\n    \"\"\"\n    if cache_dir is None:\n        cache_dir = TRANSFORMERS_CACHE\n    if isinstance(cache_dir, Path):\n        cache_dir = str(cache_dir)\n\n    cache_path = os.path.join(cache_dir, filename)\n    if not os.path.exists(cache_path):\n        raise EnvironmentError(\"file {} not found\".format(cache_path))\n\n    meta_path = cache_path + \".json\"\n    if not os.path.exists(meta_path):\n        raise EnvironmentError(\"file {} not found\".format(meta_path))\n\n    with open(meta_path, encoding=\"utf-8\") as meta_file:\n        metadata = json.load(meta_file)\n    url = metadata[\"url\"]\n    etag = metadata[\"etag\"]\n\n    return url, etag\n\n\ndef cached_path(\n    url_or_filename,\n    cache_dir=None,\n    force_download=False,\n    proxies=None,\n    resume_download=False,\n    user_agent=None,\n    extract_compressed_file=False,\n    force_extract=False,\n    local_files_only=False,\n) -> Optional[str]:\n    \"\"\"\n    Given something that might be a URL (or might be a local path),\n    determine which. If it's a URL, download the file and cache it, and\n    return the path to the cached file. If it's already a local path,\n    make sure the file exists and then return the path.\n    Args:\n        cache_dir: specify a cache directory to save the file to (overwrite the default cache dir).\n        force_download: if True, re-dowload the file even if it's already cached in the cache dir.\n        resume_download: if True, resume the download if incompletly recieved file is found.\n        user_agent: Optional string or dict that will be appended to the user-agent on remote requests.\n        extract_compressed_file: if True and the path point to a zip or tar file, extract the compressed\n            file in a folder along the archive.\n        force_extract: if True when extract_compressed_file is True and the archive was already extracted,\n            re-extract the archive and overide the folder where it was extracted.\n\n    Return:\n        None in case of non-recoverable file (non-existent or inaccessible url + no cache on disk).\n        Local path (string) otherwise\n    \"\"\"\n    if cache_dir is None:\n        cache_dir = TRANSFORMERS_CACHE\n    if isinstance(url_or_filename, Path):\n        url_or_filename = str(url_or_filename)\n    if isinstance(cache_dir, Path):\n        cache_dir = str(cache_dir)\n\n    if is_remote_url(url_or_filename):\n        # URL, so get it from the cache (downloading if necessary)\n        output_path = get_from_cache(\n            url_or_filename,\n            cache_dir=cache_dir,\n            force_download=force_download,\n            proxies=proxies,\n            resume_download=resume_download,\n            user_agent=user_agent,\n            local_files_only=local_files_only,\n        )\n    elif os.path.exists(url_or_filename):\n        # File, and it exists.\n        output_path = url_or_filename\n    elif urlparse(url_or_filename).scheme == \"\":\n        # File, but it doesn't exist.\n        raise EnvironmentError(\"file {} not found\".format(url_or_filename))\n    else:\n        # Something unknown\n        raise ValueError(\"unable to parse {} as a URL or as a local path\".format(url_or_filename))\n\n    if extract_compressed_file:\n        if not is_zipfile(output_path) and not tarfile.is_tarfile(output_path):\n            return output_path\n\n        # Path where we extract compressed archives\n        # We avoid '.' in dir name and add \"-extracted\" at the end: \"./model.zip\" => \"./model-zip-extracted/\"\n        output_dir, output_file = os.path.split(output_path)\n        output_extract_dir_name = output_file.replace(\".\", \"-\") + \"-extracted\"\n        output_path_extracted = os.path.join(output_dir, output_extract_dir_name)\n\n        if os.path.isdir(output_path_extracted) and os.listdir(output_path_extracted) and not force_extract:\n            return output_path_extracted\n\n        # Prevent parallel extractions\n        lock_path = output_path + \".lock\"\n        with FileLock(lock_path):\n            shutil.rmtree(output_path_extracted, ignore_errors=True)\n            os.makedirs(output_path_extracted)\n            if is_zipfile(output_path):\n                with ZipFile(output_path, \"r\") as zip_file:\n                    zip_file.extractall(output_path_extracted)\n                    zip_file.close()\n            elif tarfile.is_tarfile(output_path):\n                tar_file = tarfile.open(output_path)\n                tar_file.extractall(output_path_extracted)\n                tar_file.close()\n            else:\n                raise EnvironmentError(\"Archive format of {} could not be identified\".format(output_path))\n\n        return output_path_extracted\n\n    return output_path\n\n\ndef http_get(url, temp_file, proxies=None, resume_size=0, user_agent=None):\n    ua = \"transformers1/{}; python/{}\".format(__version__, sys.version.split()[0])\n    if is_torch_available():\n        ua += \"; torch/{}\".format(torch.__version__)\n    if is_tf_available():\n        ua += \"; tensorflow/{}\".format(tf.__version__)\n    if isinstance(user_agent, dict):\n        ua += \"; \" + \"; \".join(\"{}/{}\".format(k, v) for k, v in user_agent.items())\n    elif isinstance(user_agent, str):\n        ua += \"; \" + user_agent\n    headers = {\"user-agent\": ua}\n    if resume_size > 0:\n        headers[\"Range\"] = \"bytes=%d-\" % (resume_size,)\n    response = requests.get(url, stream=True, proxies=proxies, headers=headers)\n    if response.status_code == 416:  # Range not satisfiable\n        return\n    content_length = response.headers.get(\"Content-Length\")\n    total = resume_size + int(content_length) if content_length is not None else None\n    progress = tqdm(\n        unit=\"B\",\n        unit_scale=True,\n        total=total,\n        initial=resume_size,\n        desc=\"Downloading\",\n        disable=bool(logger.getEffectiveLevel() == logging.NOTSET),\n    )\n    for chunk in response.iter_content(chunk_size=1024):\n        if chunk:  # filter out keep-alive new chunks\n            progress.update(len(chunk))\n            temp_file.write(chunk)\n    progress.close()\n\n\ndef get_from_cache(\n    url,\n    cache_dir=None,\n    force_download=False,\n    proxies=None,\n    etag_timeout=10,\n    resume_download=False,\n    user_agent=None,\n    local_files_only=False,\n) -> Optional[str]:\n    \"\"\"\n    Given a URL, look for the corresponding file in the local cache.\n    If it's not there, download it. Then return the path to the cached file.\n\n    Return:\n        None in case of non-recoverable file (non-existent or inaccessible url + no cache on disk).\n        Local path (string) otherwise\n    \"\"\"\n    if cache_dir is None:\n        cache_dir = TRANSFORMERS_CACHE\n    if isinstance(cache_dir, Path):\n        cache_dir = str(cache_dir)\n\n    os.makedirs(cache_dir, exist_ok=True)\n\n    etag = None\n    if not local_files_only:\n        try:\n            response = requests.head(url, allow_redirects=True, proxies=proxies, timeout=etag_timeout)\n            if response.status_code == 200:\n                etag = response.headers.get(\"ETag\")\n        except (EnvironmentError, requests.exceptions.Timeout):\n            # etag is already None\n            pass\n\n    filename = url_to_filename(url, etag)\n\n    # get cache path to put the file\n    cache_path = os.path.join(cache_dir, filename)\n\n    # etag is None = we don't have a connection, or url doesn't exist, or is otherwise inaccessible.\n    # try to get the last downloaded one\n    if etag is None:\n        if os.path.exists(cache_path):\n            return cache_path\n        else:\n            matching_files = [\n                file\n                for file in fnmatch.filter(os.listdir(cache_dir), filename + \".*\")\n                if not file.endswith(\".json\") and not file.endswith(\".lock\")\n            ]\n            if len(matching_files) > 0:\n                return os.path.join(cache_dir, matching_files[-1])\n            else:\n                # If files cannot be found and local_files_only=True,\n                # the models might've been found if local_files_only=False\n                # Notify the user about that\n                if local_files_only:\n                    raise ValueError(\n                        \"Cannot find the requested files in the cached path and outgoing traffic has been\"\n                        \" disabled. To enable model look-ups and downloads online, set 'local_files_only'\"\n                        \" to False.\"\n                    )\n                return None\n\n    # From now on, etag is not None.\n    if os.path.exists(cache_path) and not force_download:\n        return cache_path\n\n    # Prevent parallel downloads of the same file with a lock.\n    lock_path = cache_path + \".lock\"\n    with FileLock(lock_path):\n\n        # If the download just completed while the lock was activated.\n        if os.path.exists(cache_path) and not force_download:\n            # Even if returning early like here, the lock will be released.\n            return cache_path\n\n        if resume_download:\n            incomplete_path = cache_path + \".incomplete\"\n\n            @contextmanager\n            def _resumable_file_manager():\n                with open(incomplete_path, \"a+b\") as f:\n                    yield f\n\n            temp_file_manager = _resumable_file_manager\n            if os.path.exists(incomplete_path):\n                resume_size = os.stat(incomplete_path).st_size\n            else:\n                resume_size = 0\n        else:\n            temp_file_manager = partial(tempfile.NamedTemporaryFile, dir=cache_dir, delete=False)\n            resume_size = 0\n\n        # Download to temporary file, then copy to cache dir once finished.\n        # Otherwise you get corrupt cache entries if the download gets interrupted.\n        with temp_file_manager() as temp_file:\n            logger.info(\"%s not found in cache or force_download set to True, downloading to %s\", url, temp_file.name)\n\n            http_get(url, temp_file, proxies=proxies, resume_size=resume_size, user_agent=user_agent)\n\n        logger.info(\"storing %s in cache at %s\", url, cache_path)\n        os.replace(temp_file.name, cache_path)\n\n        logger.info(\"creating metadata file for %s\", cache_path)\n        meta = {\"url\": url, \"etag\": etag}\n        meta_path = cache_path + \".json\"\n        with open(meta_path, \"w\") as meta_file:\n            json.dump(meta, meta_file)\n\n    return cache_path\n\n\nclass cached_property(property):\n    \"\"\"\n    Descriptor that mimics @property but caches output in member variable.\n\n    From tensorflow_datasets\n\n    Built-in in functools from Python 3.8.\n    \"\"\"\n\n    def __get__(self, obj, objtype=None):\n        # See docs.python.org/3/howto/descriptor.html#properties\n        if obj is None:\n            return self\n        if self.fget is None:\n            raise AttributeError(\"unreadable attribute\")\n        attr = \"__cached_\" + self.fget.__name__\n        cached = getattr(obj, attr, None)\n        if cached is None:\n            cached = self.fget(obj)\n            setattr(obj, attr, cached)\n        return cached\n\n\ndef torch_required(func):\n    # Chose a different decorator name than in tests so it's clear they are not the same.\n    @wraps(func)\n    def wrapper(*args, **kwargs):\n        if is_torch_available():\n            return func(*args, **kwargs)\n        else:\n            raise ImportError(f\"Method `{func.__name__}` requires PyTorch.\")\n\n    return wrapper\n\n\ndef tf_required(func):\n    # Chose a different decorator name than in tests so it's clear they are not the same.\n    @wraps(func)\n    def wrapper(*args, **kwargs):\n        if is_tf_available():\n            return func(*args, **kwargs)\n        else:\n            raise ImportError(f\"Method `{func.__name__}` requires TF.\")\n\n    return wrapper\n"
  },
  {
    "path": "code/nezha-base-count5/pretrain/transformers1/filep.py",
    "content": "from transformers import GPT2LMHeadModel, GPT2Tokenizer\nimport torch\n\ntokenizer = GPT2Tokenizer.from_pretrained(\"gpt2\")\nmodel = GPT2LMHeadModel.from_pretrained('gpt2')\n\ngenerated = tokenizer.encode(\"The Manhattan bridge\")\ncontext = torch.tensor([generated])\npast = None\n\nfor i in range(15):\n    output, past = model(context, past=past)\n\n    distribution = output[0, :]\n\n    # Get the top 10 values' indices and cast them to a list\n    top_values = distribution[-1].topk(10).indices.tolist()\n\n    # Decode those into words\n    top_words = [tokenizer.decode([x]) for x in top_values.indices.tolist()]\n\n    # select words (only arbitrarily select the first three)\n    words = words[0:3]\n\n    # Cast them back to tokens which can be used as an added token\n    selected_tokens = [tokenizer.encode(word) for word in words]\n\n    generated += [argmax_token.tolist()]\n    context = argmax_token.unsqueeze(0)\n\n    print(tokenizer.decode([argmax_token.tolist()]))\n\nsequence = tokenizer.decode(generated)\n\nprint(sequence)"
  },
  {
    "path": "code/nezha-base-count5/pretrain/transformers1/hf_api.py",
    "content": "# coding=utf-8\n# Copyright 2019-present, the HuggingFace Inc. team.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n\n\nimport io\nimport os\nfrom os.path import expanduser\nfrom typing import Dict, List, Optional, Tuple\n\nimport requests\nfrom tqdm import tqdm\n\n\nENDPOINT = \"https://huggingface.co\"\n\n\nclass S3Obj:\n    \"\"\"\n    Data structure that represents a file belonging to the current user.\n    \"\"\"\n\n    def __init__(self, filename: str, LastModified: str, ETag: str, Size: int, **kwargs):\n        self.filename = filename\n        self.LastModified = LastModified\n        self.ETag = ETag\n        self.Size = Size\n\n\nclass PresignedUrl:\n    def __init__(self, write: str, access: str, type: str, **kwargs):\n        self.write = write\n        self.access = access\n        self.type = type  # mime-type to send to S3.\n\n\nclass S3Object:\n    \"\"\"\n    Data structure that represents a public file accessible on our S3.\n    \"\"\"\n\n    def __init__(\n        self,\n        key: str,  # S3 object key\n        etag: str,\n        lastModified: str,\n        size: int,\n        rfilename: str,  # filename relative to config.json\n        **kwargs\n    ):\n        self.key = key\n        self.etag = etag\n        self.lastModified = lastModified\n        self.size = size\n        self.rfilename = rfilename\n\n\nclass ModelInfo:\n    \"\"\"\n    Info about a public model accessible from our S3.\n    \"\"\"\n\n    def __init__(\n        self,\n        modelId: str,  # id of model\n        key: str,  # S3 object key of config.json\n        author: Optional[str] = None,\n        downloads: Optional[int] = None,\n        tags: List[str] = [],\n        siblings: List[Dict] = [],  # list of files that constitute the model\n        **kwargs\n    ):\n        self.modelId = modelId\n        self.key = key\n        self.author = author\n        self.downloads = downloads\n        self.tags = tags\n        self.siblings = [S3Object(**x) for x in siblings]\n\n\nclass HfApi:\n    def __init__(self, endpoint=None):\n        self.endpoint = endpoint if endpoint is not None else ENDPOINT\n\n    def login(self, username: str, password: str) -> str:\n        \"\"\"\n        Call HF API to sign in a user and get a token if credentials are valid.\n\n        Outputs:\n            token if credentials are valid\n\n        Throws:\n            requests.exceptions.HTTPError if credentials are invalid\n        \"\"\"\n        path = \"{}/api/login\".format(self.endpoint)\n        r = requests.post(path, json={\"username\": username, \"password\": password})\n        r.raise_for_status()\n        d = r.json()\n        return d[\"token\"]\n\n    def whoami(self, token: str) -> Tuple[str, List[str]]:\n        \"\"\"\n        Call HF API to know \"whoami\"\n        \"\"\"\n        path = \"{}/api/whoami\".format(self.endpoint)\n        r = requests.get(path, headers={\"authorization\": \"Bearer {}\".format(token)})\n        r.raise_for_status()\n        d = r.json()\n        return d[\"user\"], d[\"orgs\"]\n\n    def logout(self, token: str) -> None:\n        \"\"\"\n        Call HF API to log out.\n        \"\"\"\n        path = \"{}/api/logout\".format(self.endpoint)\n        r = requests.post(path, headers={\"authorization\": \"Bearer {}\".format(token)})\n        r.raise_for_status()\n\n    def presign(self, token: str, filename: str, organization: Optional[str] = None) -> PresignedUrl:\n        \"\"\"\n        Call HF API to get a presigned url to upload `filename` to S3.\n        \"\"\"\n        path = \"{}/api/presign\".format(self.endpoint)\n        r = requests.post(\n            path,\n            headers={\"authorization\": \"Bearer {}\".format(token)},\n            json={\"filename\": filename, \"organization\": organization},\n        )\n        r.raise_for_status()\n        d = r.json()\n        return PresignedUrl(**d)\n\n    def presign_and_upload(self, token: str, filename: str, filepath: str, organization: Optional[str] = None) -> str:\n        \"\"\"\n        Get a presigned url, then upload file to S3.\n\n        Outputs:\n            url: Read-only url for the stored file on S3.\n        \"\"\"\n        urls = self.presign(token, filename=filename, organization=organization)\n        # streaming upload:\n        # https://2.python-requests.org/en/master/user/advanced/#streaming-uploads\n        #\n        # Even though we presign with the correct content-type,\n        # the client still has to specify it when uploading the file.\n        with open(filepath, \"rb\") as f:\n            pf = TqdmProgressFileReader(f)\n            data = f if pf.total_size > 0 else \"\"\n\n            r = requests.put(urls.write, data=data, headers={\"content-type\": urls.type})\n            r.raise_for_status()\n            pf.close()\n        return urls.access\n\n    def list_objs(self, token: str, organization: Optional[str] = None) -> List[S3Obj]:\n        \"\"\"\n        Call HF API to list all stored files for user (or one of their organizations).\n        \"\"\"\n        path = \"{}/api/listObjs\".format(self.endpoint)\n        params = {\"organization\": organization} if organization is not None else None\n        r = requests.get(path, params=params, headers={\"authorization\": \"Bearer {}\".format(token)})\n        r.raise_for_status()\n        d = r.json()\n        return [S3Obj(**x) for x in d]\n\n    def delete_obj(self, token: str, filename: str, organization: Optional[str] = None):\n        \"\"\"\n        Call HF API to delete a file stored by user\n        \"\"\"\n        path = \"{}/api/deleteObj\".format(self.endpoint)\n        r = requests.delete(\n            path,\n            headers={\"authorization\": \"Bearer {}\".format(token)},\n            json={\"filename\": filename, \"organization\": organization},\n        )\n        r.raise_for_status()\n\n    def model_list(self) -> List[ModelInfo]:\n        \"\"\"\n        Get the public list of all the models on huggingface, including the community models\n        \"\"\"\n        path = \"{}/api/models\".format(self.endpoint)\n        r = requests.get(path)\n        r.raise_for_status()\n        d = r.json()\n        return [ModelInfo(**x) for x in d]\n\n\nclass TqdmProgressFileReader:\n    \"\"\"\n    Wrap an io.BufferedReader `f` (such as the output of `open(…, \"rb\")`)\n    and override `f.read()` so as to display a tqdm progress bar.\n\n    see github.com/huggingface/transformers1/pull/2078#discussion_r354739608\n    for implementation details.\n    \"\"\"\n\n    def __init__(self, f: io.BufferedReader):\n        self.f = f\n        self.total_size = os.fstat(f.fileno()).st_size\n        self.pbar = tqdm(total=self.total_size, leave=False)\n        self.read = f.read\n        f.read = self._read\n\n    def _read(self, n=-1):\n        self.pbar.update(n)\n        return self.read(n)\n\n    def close(self):\n        self.pbar.close()\n\n\nclass HfFolder:\n    path_token = expanduser(\"~/.huggingface/token\")\n\n    @classmethod\n    def save_token(cls, token):\n        \"\"\"\n        Save token, creating folder as needed.\n        \"\"\"\n        os.makedirs(os.path.dirname(cls.path_token), exist_ok=True)\n        with open(cls.path_token, \"w+\") as f:\n            f.write(token)\n\n    @classmethod\n    def get_token(cls):\n        \"\"\"\n        Get token or None if not existent.\n        \"\"\"\n        try:\n            with open(cls.path_token, \"r\") as f:\n                return f.read()\n        except FileNotFoundError:\n            pass\n\n    @classmethod\n    def delete_token(cls):\n        \"\"\"\n        Delete token.\n        Do not fail if token does not exist.\n        \"\"\"\n        try:\n            os.remove(cls.path_token)\n        except FileNotFoundError:\n            pass\n"
  },
  {
    "path": "code/nezha-base-count5/pretrain/transformers1/hf_argparser.py",
    "content": "import dataclasses\nimport json\nimport sys\nfrom argparse import ArgumentParser\nfrom enum import Enum\nfrom pathlib import Path\nfrom typing import Any, Iterable, List, NewType, Tuple, Union\n\n\nDataClass = NewType(\"DataClass\", Any)\nDataClassType = NewType(\"DataClassType\", Any)\n\n\nclass HfArgumentParser(ArgumentParser):\n    \"\"\"\n    This subclass of `argparse.ArgumentParser` uses type hints on dataclasses\n    to generate arguments.\n\n    The class is designed to play well with the native argparse. In particular,\n    you can add more (non-dataclass backed) arguments to the parser after initialization\n    and you'll get the output back after parsing as an additional namespace.\n    \"\"\"\n\n    dataclass_types: Iterable[DataClassType]\n\n    def __init__(self, dataclass_types: Union[DataClassType, Iterable[DataClassType]], **kwargs):\n        \"\"\"\n        Args:\n            dataclass_types:\n                Dataclass type, or list of dataclass types for which we will \"fill\" instances\n                with the parsed args.\n            kwargs:\n                (Optional) Passed to `argparse.ArgumentParser()` in the regular way.\n        \"\"\"\n        super().__init__(**kwargs)\n        if dataclasses.is_dataclass(dataclass_types):\n            dataclass_types = [dataclass_types]\n        self.dataclass_types = dataclass_types\n        for dtype in self.dataclass_types:\n            self._add_dataclass_arguments(dtype)\n\n    def _add_dataclass_arguments(self, dtype: DataClassType):\n        for field in dataclasses.fields(dtype):\n            field_name = f\"--{field.name}\"\n            kwargs = field.metadata.copy()\n            # field.metadata is not used at all by Data Classes,\n            # it is provided as a third-party extension mechanism.\n            if isinstance(field.type, str):\n                raise ImportError(\n                    \"This implementation is not compatible with Postponed Evaluation of Annotations (PEP 563),\"\n                    \"which can be opted in from Python 3.7 with `from __future__ import annotations`.\"\n                    \"We will add compatibility when Python 3.9 is released.\"\n                )\n            typestring = str(field.type)\n            for prim_type in (int, float, str):\n                for collection in (List,):\n                    if typestring == f\"typing.Union[{collection[prim_type]}, NoneType]\":\n                        field.type = collection[prim_type]\n                if typestring == f\"typing.Union[{prim_type.__name__}, NoneType]\":\n                    field.type = prim_type\n\n            if isinstance(field.type, type) and issubclass(field.type, Enum):\n                kwargs[\"choices\"] = list(field.type)\n                kwargs[\"type\"] = field.type\n                if field.default is not dataclasses.MISSING:\n                    kwargs[\"default\"] = field.default\n            elif field.type is bool:\n                kwargs[\"action\"] = \"store_false\" if field.default is True else \"store_true\"\n                if field.default is True:\n                    field_name = f\"--no-{field.name}\"\n                    kwargs[\"dest\"] = field.name\n            elif hasattr(field.type, \"__origin__\") and issubclass(field.type.__origin__, List):\n                kwargs[\"nargs\"] = \"+\"\n                kwargs[\"type\"] = field.type.__args__[0]\n                assert all(\n                    x == kwargs[\"type\"] for x in field.type.__args__\n                ), \"{} cannot be a List of mixed types\".format(field.name)\n                if field.default_factory is not dataclasses.MISSING:\n                    kwargs[\"default\"] = field.default_factory()\n            else:\n                kwargs[\"type\"] = field.type\n                if field.default is not dataclasses.MISSING:\n                    kwargs[\"default\"] = field.default\n                else:\n                    kwargs[\"required\"] = True\n            self.add_argument(field_name, **kwargs)\n\n    def parse_args_into_dataclasses(\n        self, args=None, return_remaining_strings=False, look_for_args_file=True\n    ) -> Tuple[DataClass, ...]:\n        \"\"\"\n        Parse command-line args into instances of the specified dataclass types.\n\n        This relies on argparse's `ArgumentParser.parse_known_args`.\n        See the doc at:\n        docs.python.org/3.7/library/argparse.html#argparse.ArgumentParser.parse_args\n\n        Args:\n            args:\n                List of strings to parse. The default is taken from sys.argv.\n                (same as argparse.ArgumentParser)\n            return_remaining_strings:\n                If true, also return a list of remaining argument strings.\n            look_for_args_file:\n                If true, will look for a \".args\" file with the same base name\n                as the entry point script for this process, and will append its\n                potential content to the command line args.\n\n        Returns:\n            Tuple consisting of:\n                - the dataclass instances in the same order as they\n                  were passed to the initializer.abspath\n                - if applicable, an additional namespace for more\n                  (non-dataclass backed) arguments added to the parser\n                  after initialization.\n                - The potential list of remaining argument strings.\n                  (same as argparse.ArgumentParser.parse_known_args)\n        \"\"\"\n        if look_for_args_file and len(sys.argv):\n            args_file = Path(sys.argv[0]).with_suffix(\".args\")\n            if args_file.exists():\n                fargs = args_file.read_text().split()\n                args = fargs + args if args is not None else fargs + sys.argv[1:]\n                # in case of duplicate arguments the first one has precedence\n                # so we append rather than prepend.\n        namespace, remaining_args = self.parse_known_args(args=args)\n        outputs = []\n        for dtype in self.dataclass_types:\n            keys = {f.name for f in dataclasses.fields(dtype)}\n            inputs = {k: v for k, v in vars(namespace).items() if k in keys}\n            for k in keys:\n                delattr(namespace, k)\n            obj = dtype(**inputs)\n            outputs.append(obj)\n        if len(namespace.__dict__) > 0:\n            # additional namespace.\n            outputs.append(namespace)\n        if return_remaining_strings:\n            return (*outputs, remaining_args)\n        else:\n            if remaining_args:\n                raise ValueError(f\"Some specified arguments are not used by the HfArgumentParser: {remaining_args}\")\n\n            return (*outputs,)\n\n    def parse_json_file(self, json_file: str) -> Tuple[DataClass, ...]:\n        \"\"\"\n        Alternative helper method that does not use `argparse` at all,\n        instead loading a json file and populating the dataclass types.\n        \"\"\"\n        data = json.loads(Path(json_file).read_text())\n        outputs = []\n        for dtype in self.dataclass_types:\n            keys = {f.name for f in dataclasses.fields(dtype)}\n            inputs = {k: v for k, v in data.items() if k in keys}\n            obj = dtype(**inputs)\n            outputs.append(obj)\n        return (*outputs,)\n"
  },
  {
    "path": "code/nezha-base-count5/pretrain/transformers1/modelcard.py",
    "content": "# coding=utf-8\n# Copyright 2018 The HuggingFace Inc. team.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n\"\"\" Configuration base class and utilities.\"\"\"\n\n\nimport copy\nimport json\nimport logging\nimport os\n\nfrom .configuration_auto import ALL_PRETRAINED_CONFIG_ARCHIVE_MAP\nfrom .file_utils import (\n    CONFIG_NAME,\n    MODEL_CARD_NAME,\n    TF2_WEIGHTS_NAME,\n    WEIGHTS_NAME,\n    cached_path,\n    hf_bucket_url,\n    is_remote_url,\n)\n\n\nlogger = logging.getLogger(__name__)\n\n\nclass ModelCard:\n    r\"\"\" Structured Model Card class.\n        Store model card as well as methods for loading/downloading/saving model cards.\n\n        Please read the following paper for details and explanation on the sections:\n            \"Model Cards for Model Reporting\"\n                by Margaret Mitchell, Simone Wu,\n                Andrew Zaldivar, Parker Barnes, Lucy Vasserman, Ben Hutchinson, Elena Spitzer,\n                Inioluwa Deborah Raji and Timnit Gebru for the proposal behind model cards.\n            Link: https://arxiv.org/abs/1810.03993\n\n        Note:\n            A model card can be loaded and saved to disk.\n\n        Parameters:\n    \"\"\"\n\n    def __init__(self, **kwargs):\n        # Recomended attributes from https://arxiv.org/abs/1810.03993 (see papers)\n        self.model_details = kwargs.pop(\"model_details\", {})\n        self.intended_use = kwargs.pop(\"intended_use\", {})\n        self.factors = kwargs.pop(\"factors\", {})\n        self.metrics = kwargs.pop(\"metrics\", {})\n        self.evaluation_data = kwargs.pop(\"evaluation_data\", {})\n        self.training_data = kwargs.pop(\"training_data\", {})\n        self.quantitative_analyses = kwargs.pop(\"quantitative_analyses\", {})\n        self.ethical_considerations = kwargs.pop(\"ethical_considerations\", {})\n        self.caveats_and_recommendations = kwargs.pop(\"caveats_and_recommendations\", {})\n\n        # Open additional attributes\n        for key, value in kwargs.items():\n            try:\n                setattr(self, key, value)\n            except AttributeError as err:\n                logger.error(\"Can't set {} with value {} for {}\".format(key, value, self))\n                raise err\n\n    def save_pretrained(self, save_directory_or_file):\n        \"\"\" Save a model card object to the directory or file `save_directory_or_file`.\n        \"\"\"\n        if os.path.isdir(save_directory_or_file):\n            # If we save using the predefined names, we can load using `from_pretrained`\n            output_model_card_file = os.path.join(save_directory_or_file, MODEL_CARD_NAME)\n        else:\n            output_model_card_file = save_directory_or_file\n\n        self.to_json_file(output_model_card_file)\n        logger.info(\"Model card saved in {}\".format(output_model_card_file))\n\n    @classmethod\n    def from_pretrained(cls, pretrained_model_name_or_path, **kwargs):\n        r\"\"\" Instantiate a :class:`~transformers1.ModelCard` from a pre-trained model model card.\n\n        Parameters:\n            pretrained_model_name_or_path: either:\n\n                - a string with the `shortcut name` of a pre-trained model card to load from cache or download, e.g.: ``bert-base-uncased``.\n                - a string with the `identifier name` of a pre-trained model card that was user-uploaded to our S3, e.g.: ``dbmdz/bert-base-german-cased``.\n                - a path to a `directory` containing a model card file saved using the :func:`~transformers1.ModelCard.save_pretrained` method, e.g.: ``./my_model_directory/``.\n                - a path or url to a saved model card JSON `file`, e.g.: ``./my_model_directory/modelcard.json``.\n\n            cache_dir: (`optional`) string:\n                Path to a directory in which a downloaded pre-trained model\n                card should be cached if the standard cache should not be used.\n\n            kwargs: (`optional`) dict: key/value pairs with which to update the ModelCard object after loading.\n\n                - The values in kwargs of any keys which are model card attributes will be used to override the loaded values.\n                - Behavior concerning key/value pairs whose keys are *not* model card attributes is controlled by the `return_unused_kwargs` keyword parameter.\n\n            proxies: (`optional`) dict, default None:\n                A dictionary of proxy servers to use by protocol or endpoint, e.g.: {'http': 'foo.bar:3128', 'http://hostname': 'foo.bar:4012'}.\n                The proxies are used on each request.\n\n            find_from_standard_name: (`optional`) boolean, default True:\n                If the pretrained_model_name_or_path ends with our standard model or config filenames, replace them with our standard modelcard filename.\n                Can be used to directly feed a model/config url and access the colocated modelcard.\n\n            return_unused_kwargs: (`optional`) bool:\n\n                - If False, then this function returns just the final model card object.\n                - If True, then this functions returns a tuple `(model card, unused_kwargs)` where `unused_kwargs` is a dictionary consisting of the key/value pairs whose keys are not model card attributes: ie the part of kwargs which has not been used to update `ModelCard` and is otherwise ignored.\n\n        Examples::\n\n            modelcard = ModelCard.from_pretrained('bert-base-uncased')    # Download model card from S3 and cache.\n            modelcard = ModelCard.from_pretrained('./test/saved_model/')  # E.g. model card was saved using `save_pretrained('./test/saved_model/')`\n            modelcard = ModelCard.from_pretrained('./test/saved_model/modelcard.json')\n            modelcard = ModelCard.from_pretrained('bert-base-uncased', output_attention=True, foo=False)\n\n        \"\"\"\n        cache_dir = kwargs.pop(\"cache_dir\", None)\n        proxies = kwargs.pop(\"proxies\", None)\n        find_from_standard_name = kwargs.pop(\"find_from_standard_name\", True)\n        return_unused_kwargs = kwargs.pop(\"return_unused_kwargs\", False)\n\n        if pretrained_model_name_or_path in ALL_PRETRAINED_CONFIG_ARCHIVE_MAP:\n            # For simplicity we use the same pretrained url than the configuration files\n            # but with a different suffix (modelcard.json). This suffix is replaced below.\n            model_card_file = ALL_PRETRAINED_CONFIG_ARCHIVE_MAP[pretrained_model_name_or_path]\n        elif os.path.isdir(pretrained_model_name_or_path):\n            model_card_file = os.path.join(pretrained_model_name_or_path, MODEL_CARD_NAME)\n        elif os.path.isfile(pretrained_model_name_or_path) or is_remote_url(pretrained_model_name_or_path):\n            model_card_file = pretrained_model_name_or_path\n        else:\n            model_card_file = hf_bucket_url(pretrained_model_name_or_path, filename=MODEL_CARD_NAME, use_cdn=False)\n\n        if find_from_standard_name or pretrained_model_name_or_path in ALL_PRETRAINED_CONFIG_ARCHIVE_MAP:\n            model_card_file = model_card_file.replace(CONFIG_NAME, MODEL_CARD_NAME)\n            model_card_file = model_card_file.replace(WEIGHTS_NAME, MODEL_CARD_NAME)\n            model_card_file = model_card_file.replace(TF2_WEIGHTS_NAME, MODEL_CARD_NAME)\n\n        try:\n            # Load from URL or cache if already cached\n            resolved_model_card_file = cached_path(\n                model_card_file, cache_dir=cache_dir, force_download=True, proxies=proxies, resume_download=False\n            )\n            if resolved_model_card_file is None:\n                raise EnvironmentError\n            if resolved_model_card_file == model_card_file:\n                logger.info(\"loading model card file {}\".format(model_card_file))\n            else:\n                logger.info(\n                    \"loading model card file {} from cache at {}\".format(model_card_file, resolved_model_card_file)\n                )\n            # Load model card\n            modelcard = cls.from_json_file(resolved_model_card_file)\n\n        except (EnvironmentError, json.JSONDecodeError):\n            # We fall back on creating an empty model card\n            modelcard = cls()\n\n        # Update model card with kwargs if needed\n        to_remove = []\n        for key, value in kwargs.items():\n            if hasattr(modelcard, key):\n                setattr(modelcard, key, value)\n                to_remove.append(key)\n        for key in to_remove:\n            kwargs.pop(key, None)\n\n        logger.info(\"Model card: %s\", str(modelcard))\n        if return_unused_kwargs:\n            return modelcard, kwargs\n        else:\n            return modelcard\n\n    @classmethod\n    def from_dict(cls, json_object):\n        \"\"\"Constructs a `ModelCard` from a Python dictionary of parameters.\"\"\"\n        return cls(**json_object)\n\n    @classmethod\n    def from_json_file(cls, json_file):\n        \"\"\"Constructs a `ModelCard` from a json file of parameters.\"\"\"\n        with open(json_file, \"r\", encoding=\"utf-8\") as reader:\n            text = reader.read()\n        dict_obj = json.loads(text)\n        return cls(**dict_obj)\n\n    def __eq__(self, other):\n        return self.__dict__ == other.__dict__\n\n    def __repr__(self):\n        return str(self.to_json_string())\n\n    def to_dict(self):\n        \"\"\"Serializes this instance to a Python dictionary.\"\"\"\n        output = copy.deepcopy(self.__dict__)\n        return output\n\n    def to_json_string(self):\n        \"\"\"Serializes this instance to a JSON string.\"\"\"\n        return json.dumps(self.to_dict(), indent=2, sort_keys=True) + \"\\n\"\n\n    def to_json_file(self, json_file_path):\n        \"\"\" Save this instance to a json file.\"\"\"\n        with open(json_file_path, \"w\", encoding=\"utf-8\") as writer:\n            writer.write(self.to_json_string())\n"
  },
  {
    "path": "code/nezha-base-count5/pretrain/transformers1/modeling_albert.py",
    "content": "# coding=utf-8\n# Copyright 2018 Google AI, Google Brain and the HuggingFace Inc. team.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n\"\"\"PyTorch ALBERT model. \"\"\"\n\nimport logging\nimport math\nimport os\n\nimport torch\nimport torch.nn as nn\nfrom torch.nn import CrossEntropyLoss, MSELoss\n\nfrom .configuration_albert import AlbertConfig\nfrom .file_utils import add_start_docstrings, add_start_docstrings_to_callable\nfrom .modeling_bert import ACT2FN, BertEmbeddings, BertSelfAttention, prune_linear_layer\nfrom .modeling_utils import PreTrainedModel\n\n\nlogger = logging.getLogger(__name__)\n\n\nALBERT_PRETRAINED_MODEL_ARCHIVE_LIST = [\n    \"albert-base-v1\",\n    \"albert-large-v1\",\n    \"albert-xlarge-v1\",\n    \"albert-xxlarge-v1\",\n    \"albert-base-v2\",\n    \"albert-large-v2\",\n    \"albert-xlarge-v2\",\n    \"albert-xxlarge-v2\",\n    # See all ALBERT models at https://huggingface.co/models?filter=albert\n]\n\n\ndef load_tf_weights_in_albert(model, config, tf_checkpoint_path):\n    \"\"\" Load tf checkpoints in a pytorch model.\"\"\"\n    try:\n        import re\n        import numpy as np\n        import tensorflow as tf\n    except ImportError:\n        logger.error(\n            \"Loading a TensorFlow model in PyTorch, requires TensorFlow to be installed. Please see \"\n            \"https://www.tensorflow.org/install/ for installation instructions.\"\n        )\n        raise\n    tf_path = os.path.abspath(tf_checkpoint_path)\n    logger.info(\"Converting TensorFlow checkpoint from {}\".format(tf_path))\n    # Load weights from TF model\n    init_vars = tf.train.list_variables(tf_path)\n    names = []\n    arrays = []\n    for name, shape in init_vars:\n        logger.info(\"Loading TF weight {} with shape {}\".format(name, shape))\n        array = tf.train.load_variable(tf_path, name)\n        names.append(name)\n        arrays.append(array)\n\n    for name, array in zip(names, arrays):\n        print(name)\n\n    for name, array in zip(names, arrays):\n        original_name = name\n\n        # If saved from the TF HUB module\n        name = name.replace(\"module/\", \"\")\n\n        # Renaming and simplifying\n        name = name.replace(\"ffn_1\", \"ffn\")\n        name = name.replace(\"bert/\", \"albert/\")\n        name = name.replace(\"attention_1\", \"attention\")\n        name = name.replace(\"transform/\", \"\")\n        name = name.replace(\"LayerNorm_1\", \"full_layer_layer_norm\")\n        name = name.replace(\"LayerNorm\", \"attention/LayerNorm\")\n        name = name.replace(\"transformer/\", \"\")\n\n        # The feed forward layer had an 'intermediate' step which has been abstracted away\n        name = name.replace(\"intermediate/dense/\", \"\")\n        name = name.replace(\"ffn/intermediate/output/dense/\", \"ffn_output/\")\n\n        # ALBERT attention was split between self and output which have been abstracted away\n        name = name.replace(\"/output/\", \"/\")\n        name = name.replace(\"/self/\", \"/\")\n\n        # The pooler is a linear layer\n        name = name.replace(\"pooler/dense\", \"pooler\")\n\n        # The classifier was simplified to predictions from cls/predictions\n        name = name.replace(\"cls/predictions\", \"predictions\")\n        name = name.replace(\"predictions/attention\", \"predictions\")\n\n        # Naming was changed to be more explicit\n        name = name.replace(\"embeddings/attention\", \"embeddings\")\n        name = name.replace(\"inner_group_\", \"albert_layers/\")\n        name = name.replace(\"group_\", \"albert_layer_groups/\")\n\n        # Classifier\n        if len(name.split(\"/\")) == 1 and (\"output_bias\" in name or \"output_weights\" in name):\n            name = \"classifier/\" + name\n\n        # No ALBERT model currently handles the next sentence prediction task\n        if \"seq_relationship\" in name:\n            name = name.replace(\"seq_relationship/output_\", \"sop_classifier/classifier/\")\n            name = name.replace(\"weights\", \"weight\")\n\n        name = name.split(\"/\")\n\n        # Ignore the gradients applied by the LAMB/ADAM optimizers.\n        if (\n            \"adam_m\" in name\n            or \"adam_v\" in name\n            or \"AdamWeightDecayOptimizer\" in name\n            or \"AdamWeightDecayOptimizer_1\" in name\n            or \"global_step\" in name\n        ):\n            logger.info(\"Skipping {}\".format(\"/\".join(name)))\n            continue\n\n        pointer = model\n        for m_name in name:\n            if re.fullmatch(r\"[A-Za-z]+_\\d+\", m_name):\n                scope_names = re.split(r\"_(\\d+)\", m_name)\n            else:\n                scope_names = [m_name]\n\n            if scope_names[0] == \"kernel\" or scope_names[0] == \"gamma\":\n                pointer = getattr(pointer, \"weight\")\n            elif scope_names[0] == \"output_bias\" or scope_names[0] == \"beta\":\n                pointer = getattr(pointer, \"bias\")\n            elif scope_names[0] == \"output_weights\":\n                pointer = getattr(pointer, \"weight\")\n            elif scope_names[0] == \"squad\":\n                pointer = getattr(pointer, \"classifier\")\n            else:\n                try:\n                    pointer = getattr(pointer, scope_names[0])\n                except AttributeError:\n                    logger.info(\"Skipping {}\".format(\"/\".join(name)))\n                    continue\n            if len(scope_names) >= 2:\n                num = int(scope_names[1])\n                pointer = pointer[num]\n\n        if m_name[-11:] == \"_embeddings\":\n            pointer = getattr(pointer, \"weight\")\n        elif m_name == \"kernel\":\n            array = np.transpose(array)\n        try:\n            assert pointer.shape == array.shape\n        except AssertionError as e:\n            e.args += (pointer.shape, array.shape)\n            raise\n        print(\"Initialize PyTorch weight {} from {}\".format(name, original_name))\n        pointer.data = torch.from_numpy(array)\n\n    return model\n\n\nclass AlbertEmbeddings(BertEmbeddings):\n    \"\"\"\n    Construct the embeddings from word, position and token_type embeddings.\n    \"\"\"\n\n    def __init__(self, config):\n        super().__init__(config)\n\n        self.word_embeddings = nn.Embedding(config.vocab_size, config.embedding_size, padding_idx=config.pad_token_id)\n        self.position_embeddings = nn.Embedding(config.max_position_embeddings, config.embedding_size)\n        self.token_type_embeddings = nn.Embedding(config.type_vocab_size, config.embedding_size)\n        self.LayerNorm = torch.nn.LayerNorm(config.embedding_size, eps=config.layer_norm_eps)\n\n\nclass AlbertAttention(BertSelfAttention):\n    def __init__(self, config):\n        super().__init__(config)\n\n        self.output_attentions = config.output_attentions\n        self.num_attention_heads = config.num_attention_heads\n        self.hidden_size = config.hidden_size\n        self.attention_head_size = config.hidden_size // config.num_attention_heads\n        self.dropout = nn.Dropout(config.attention_probs_dropout_prob)\n        self.dense = nn.Linear(config.hidden_size, config.hidden_size)\n        self.LayerNorm = nn.LayerNorm(config.hidden_size, eps=config.layer_norm_eps)\n        self.pruned_heads = set()\n\n    def prune_heads(self, heads):\n        if len(heads) == 0:\n            return\n        mask = torch.ones(self.num_attention_heads, self.attention_head_size)\n        heads = set(heads) - self.pruned_heads  # Convert to set and emove already pruned heads\n        for head in heads:\n            # Compute how many pruned heads are before the head and move the index accordingly\n            head = head - sum(1 if h < head else 0 for h in self.pruned_heads)\n            mask[head] = 0\n        mask = mask.view(-1).contiguous().eq(1)\n        index = torch.arange(len(mask))[mask].long()\n\n        # Prune linear layers\n        self.query = prune_linear_layer(self.query, index)\n        self.key = prune_linear_layer(self.key, index)\n        self.value = prune_linear_layer(self.value, index)\n        self.dense = prune_linear_layer(self.dense, index, dim=1)\n\n        # Update hyper params and store pruned heads\n        self.num_attention_heads = self.num_attention_heads - len(heads)\n        self.all_head_size = self.attention_head_size * self.num_attention_heads\n        self.pruned_heads = self.pruned_heads.union(heads)\n\n    def forward(self, input_ids, attention_mask=None, head_mask=None):\n        mixed_query_layer = self.query(input_ids)\n        mixed_key_layer = self.key(input_ids)\n        mixed_value_layer = self.value(input_ids)\n\n        query_layer = self.transpose_for_scores(mixed_query_layer)\n        key_layer = self.transpose_for_scores(mixed_key_layer)\n        value_layer = self.transpose_for_scores(mixed_value_layer)\n\n        # Take the dot product between \"query\" and \"key\" to get the raw attention scores.\n        attention_scores = torch.matmul(query_layer, key_layer.transpose(-1, -2))\n        attention_scores = attention_scores / math.sqrt(self.attention_head_size)\n        if attention_mask is not None:\n            # Apply the attention mask is (precomputed for all layers in BertModel forward() function)\n            attention_scores = attention_scores + attention_mask\n\n        # Normalize the attention scores to probabilities.\n        attention_probs = nn.Softmax(dim=-1)(attention_scores)\n\n        # This is actually dropping out entire tokens to attend to, which might\n        # seem a bit unusual, but is taken from the original Transformer paper.\n        attention_probs = self.dropout(attention_probs)\n\n        # Mask heads if we want to\n        if head_mask is not None:\n            attention_probs = attention_probs * head_mask\n\n        context_layer = torch.matmul(attention_probs, value_layer)\n\n        context_layer = context_layer.permute(0, 2, 1, 3).contiguous()\n\n        # Should find a better way to do this\n        w = (\n            self.dense.weight.t()\n            .view(self.num_attention_heads, self.attention_head_size, self.hidden_size)\n            .to(context_layer.dtype)\n        )\n        b = self.dense.bias.to(context_layer.dtype)\n\n        projected_context_layer = torch.einsum(\"bfnd,ndh->bfh\", context_layer, w) + b\n        projected_context_layer_dropout = self.dropout(projected_context_layer)\n        layernormed_context_layer = self.LayerNorm(input_ids + projected_context_layer_dropout)\n        return (layernormed_context_layer, attention_probs) if self.output_attentions else (layernormed_context_layer,)\n\n\nclass AlbertLayer(nn.Module):\n    def __init__(self, config):\n        super().__init__()\n\n        self.config = config\n        self.full_layer_layer_norm = nn.LayerNorm(config.hidden_size, eps=config.layer_norm_eps)\n        self.attention = AlbertAttention(config)\n        self.ffn = nn.Linear(config.hidden_size, config.intermediate_size)\n        self.ffn_output = nn.Linear(config.intermediate_size, config.hidden_size)\n        self.activation = ACT2FN[config.hidden_act]\n\n    def forward(self, hidden_states, attention_mask=None, head_mask=None):\n        attention_output = self.attention(hidden_states, attention_mask, head_mask)\n        ffn_output = self.ffn(attention_output[0])\n        ffn_output = self.activation(ffn_output)\n        ffn_output = self.ffn_output(ffn_output)\n        hidden_states = self.full_layer_layer_norm(ffn_output + attention_output[0])\n\n        return (hidden_states,) + attention_output[1:]  # add attentions if we output them\n\n\nclass AlbertLayerGroup(nn.Module):\n    def __init__(self, config):\n        super().__init__()\n\n        self.output_attentions = config.output_attentions\n        self.output_hidden_states = config.output_hidden_states\n        self.albert_layers = nn.ModuleList([AlbertLayer(config) for _ in range(config.inner_group_num)])\n\n    def forward(self, hidden_states, attention_mask=None, head_mask=None):\n        layer_hidden_states = ()\n        layer_attentions = ()\n\n        for layer_index, albert_layer in enumerate(self.albert_layers):\n            layer_output = albert_layer(hidden_states, attention_mask, head_mask[layer_index])\n            hidden_states = layer_output[0]\n\n            if self.output_attentions:\n                layer_attentions = layer_attentions + (layer_output[1],)\n\n            if self.output_hidden_states:\n                layer_hidden_states = layer_hidden_states + (hidden_states,)\n\n        outputs = (hidden_states,)\n        if self.output_hidden_states:\n            outputs = outputs + (layer_hidden_states,)\n        if self.output_attentions:\n            outputs = outputs + (layer_attentions,)\n        return outputs  # last-layer hidden state, (layer hidden states), (layer attentions)\n\n\nclass AlbertTransformer(nn.Module):\n    def __init__(self, config):\n        super().__init__()\n\n        self.config = config\n        self.output_attentions = config.output_attentions\n        self.output_hidden_states = config.output_hidden_states\n        self.embedding_hidden_mapping_in = nn.Linear(config.embedding_size, config.hidden_size)\n        self.albert_layer_groups = nn.ModuleList([AlbertLayerGroup(config) for _ in range(config.num_hidden_groups)])\n\n    def forward(self, hidden_states, attention_mask=None, head_mask=None):\n        hidden_states = self.embedding_hidden_mapping_in(hidden_states)\n\n        all_attentions = ()\n\n        if self.output_hidden_states:\n            all_hidden_states = (hidden_states,)\n\n        for i in range(self.config.num_hidden_layers):\n            # Number of layers in a hidden group\n            layers_per_group = int(self.config.num_hidden_layers / self.config.num_hidden_groups)\n\n            # Index of the hidden group\n            group_idx = int(i / (self.config.num_hidden_layers / self.config.num_hidden_groups))\n\n            layer_group_output = self.albert_layer_groups[group_idx](\n                hidden_states,\n                attention_mask,\n                head_mask[group_idx * layers_per_group : (group_idx + 1) * layers_per_group],\n            )\n            hidden_states = layer_group_output[0]\n\n            if self.output_attentions:\n                all_attentions = all_attentions + layer_group_output[-1]\n\n            if self.output_hidden_states:\n                all_hidden_states = all_hidden_states + (hidden_states,)\n\n        outputs = (hidden_states,)\n        if self.output_hidden_states:\n            outputs = outputs + (all_hidden_states,)\n        if self.output_attentions:\n            outputs = outputs + (all_attentions,)\n        return outputs  # last-layer hidden state, (all hidden states), (all attentions)\n\n\nclass AlbertPreTrainedModel(PreTrainedModel):\n    \"\"\" An abstract class to handle weights initialization and\n        a simple interface for downloading and loading pretrained models.\n    \"\"\"\n\n    config_class = AlbertConfig\n    base_model_prefix = \"albert\"\n\n    def _init_weights(self, module):\n        \"\"\" Initialize the weights.\n        \"\"\"\n        if isinstance(module, (nn.Linear, nn.Embedding)):\n            # Slightly different from the TF version which uses truncated_normal for initialization\n            # cf https://github.com/pytorch/pytorch/pull/5617\n            module.weight.data.normal_(mean=0.0, std=self.config.initializer_range)\n            if isinstance(module, (nn.Linear)) and module.bias is not None:\n                module.bias.data.zero_()\n        elif isinstance(module, nn.LayerNorm):\n            module.bias.data.zero_()\n            module.weight.data.fill_(1.0)\n\n\nALBERT_START_DOCSTRING = r\"\"\"\n\n    This model is a PyTorch `torch.nn.Module <https://pytorch.org/docs/stable/nn.html#torch.nn.Module>`_ sub-class.\n    Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general\n    usage and behavior.\n\n    Args:\n        config (:class:`~transformers1.AlbertConfig`): Model configuration class with all the parameters of the model.\n            Initializing with a config file does not load the weights associated with the model, only the configuration.\n            Check out the :meth:`~transformers1.PreTrainedModel.from_pretrained` method to load the model weights.\n\"\"\"\n\nALBERT_INPUTS_DOCSTRING = r\"\"\"\n    Args:\n        input_ids (:obj:`torch.LongTensor` of shape :obj:`(batch_size, sequence_length)`):\n            Indices of input sequence tokens in the vocabulary.\n\n            Indices can be obtained using :class:`transformers1.AlbertTokenizer`.\n            See :func:`transformers1.PreTrainedTokenizer.encode` and\n            :func:`transformers1.PreTrainedTokenizer.encode_plus` for details.\n\n            `What are input IDs? <../glossary.html#input-ids>`__\n        attention_mask (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length)`, `optional`, defaults to :obj:`None`):\n            Mask to avoid performing attention on padding token indices.\n            Mask values selected in ``[0, 1]``:\n            ``1`` for tokens that are NOT MASKED, ``0`` for MASKED tokens.\n\n            `What are attention masks? <../glossary.html#attention-mask>`__\n        token_type_ids (:obj:`torch.LongTensor` of shape :obj:`(batch_size, sequence_length)`, `optional`, defaults to :obj:`None`):\n            Segment token indices to indicate first and second portions of the inputs.\n            Indices are selected in ``[0, 1]``: ``0`` corresponds to a `sentence A` token, ``1``\n            corresponds to a `sentence B` token\n\n            `What are token type IDs? <../glossary.html#token-type-ids>`_\n        position_ids (:obj:`torch.LongTensor` of shape :obj:`(batch_size, sequence_length)`, `optional`, defaults to :obj:`None`):\n            Indices of positions of each input sequence tokens in the position embeddings.\n            Selected in the range ``[0, config.max_position_embeddings - 1]``.\n\n            `What are position IDs? <../glossary.html#position-ids>`_\n        head_mask (:obj:`torch.FloatTensor` of shape :obj:`(num_heads,)` or :obj:`(num_layers, num_heads)`, `optional`, defaults to :obj:`None`):\n            Mask to nullify selected heads of the self-attention modules.\n            Mask values selected in ``[0, 1]``:\n            :obj:`1` indicates the head is **not masked**, :obj:`0` indicates the head is **masked**.\n        inputs_embeds (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length, hidden_size)`, `optional`, defaults to :obj:`None`):\n            Optionally, instead of passing :obj:`input_ids` you can choose to directly pass an embedded representation.\n            This is useful if you want more control over how to convert `input_ids` indices into associated vectors\n            than the model's internal embedding lookup matrix.\n\"\"\"\n\n\n@add_start_docstrings(\n    \"The bare ALBERT Model transformer outputting raw hidden-states without any specific head on top.\",\n    ALBERT_START_DOCSTRING,\n)\nclass AlbertModel(AlbertPreTrainedModel):\n\n    config_class = AlbertConfig\n    load_tf_weights = load_tf_weights_in_albert\n    base_model_prefix = \"albert\"\n\n    def __init__(self, config):\n        super().__init__(config)\n\n        self.config = config\n        self.embeddings = AlbertEmbeddings(config)\n        self.encoder = AlbertTransformer(config)\n        self.pooler = nn.Linear(config.hidden_size, config.hidden_size)\n        self.pooler_activation = nn.Tanh()\n\n        self.init_weights()\n\n    def get_input_embeddings(self):\n        return self.embeddings.word_embeddings\n\n    def set_input_embeddings(self, value):\n        self.embeddings.word_embeddings = value\n\n    def _resize_token_embeddings(self, new_num_tokens):\n        old_embeddings = self.embeddings.word_embeddings\n        new_embeddings = self._get_resized_embeddings(old_embeddings, new_num_tokens)\n        self.embeddings.word_embeddings = new_embeddings\n        return self.embeddings.word_embeddings\n\n    def _prune_heads(self, heads_to_prune):\n        \"\"\" Prunes heads of the model.\n            heads_to_prune: dict of {layer_num: list of heads to prune in this layer}\n            ALBERT has a different architecture in that its layers are shared across groups, which then has inner groups.\n            If an ALBERT model has 12 hidden layers and 2 hidden groups, with two inner groups, there\n            is a total of 4 different layers.\n\n            These layers are flattened: the indices [0,1] correspond to the two inner groups of the first hidden layer,\n            while [2,3] correspond to the two inner groups of the second hidden layer.\n\n            Any layer with in index other than [0,1,2,3] will result in an error.\n            See base class PreTrainedModel for more information about head pruning\n        \"\"\"\n        for layer, heads in heads_to_prune.items():\n            group_idx = int(layer / self.config.inner_group_num)\n            inner_group_idx = int(layer - group_idx * self.config.inner_group_num)\n            self.encoder.albert_layer_groups[group_idx].albert_layers[inner_group_idx].attention.prune_heads(heads)\n\n    @add_start_docstrings_to_callable(ALBERT_INPUTS_DOCSTRING)\n    def forward(\n        self,\n        input_ids=None,\n        attention_mask=None,\n        token_type_ids=None,\n        position_ids=None,\n        head_mask=None,\n        inputs_embeds=None,\n    ):\n        r\"\"\"\n    Return:\n        :obj:`tuple(torch.FloatTensor)` comprising various elements depending on the configuration (:class:`~transformers1.AlbertConfig`) and inputs:\n        last_hidden_state (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length, hidden_size)`):\n            Sequence of hidden-states at the output of the last layer of the model.\n        pooler_output (:obj:`torch.FloatTensor`: of shape :obj:`(batch_size, hidden_size)`):\n            Last layer hidden-state of the first token of the sequence (classification token)\n            further processed by a Linear layer and a Tanh activation function. The Linear\n            layer weights are trained from the next sentence prediction (classification)\n            objective during pre-training.\n\n            This output is usually *not* a good summary\n            of the semantic content of the input, you're often better with averaging or pooling\n            the sequence of hidden-states for the whole input sequence.\n        hidden_states (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_hidden_states=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer)\n            of shape :obj:`(batch_size, sequence_length, hidden_size)`.\n\n            Hidden-states of the model at the output of each layer plus the initial embedding outputs.\n        attentions (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_attentions=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for each layer) of shape\n            :obj:`(batch_size, num_heads, sequence_length, sequence_length)`.\n\n            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention\n            heads.\n\n    Example::\n\n        from transformers1 import AlbertModel, AlbertTokenizer\n        import torch\n\n        tokenizer = AlbertTokenizer.from_pretrained('albert-base-v2')\n        model = AlbertModel.from_pretrained('albert-base-v2')\n        input_ids = torch.tensor(tokenizer.encode(\"Hello, my dog is cute\", add_special_tokens=True)).unsqueeze(0)  # Batch size 1\n        outputs = model(input_ids)\n        last_hidden_states = outputs[0]  # The last hidden-state is the first element of the output tuple\n\n        \"\"\"\n\n        if input_ids is not None and inputs_embeds is not None:\n            raise ValueError(\"You cannot specify both input_ids and inputs_embeds at the same time\")\n        elif input_ids is not None:\n            input_shape = input_ids.size()\n        elif inputs_embeds is not None:\n            input_shape = inputs_embeds.size()[:-1]\n        else:\n            raise ValueError(\"You have to specify either input_ids or inputs_embeds\")\n\n        device = input_ids.device if input_ids is not None else inputs_embeds.device\n\n        if attention_mask is None:\n            attention_mask = torch.ones(input_shape, device=device)\n        if token_type_ids is None:\n            token_type_ids = torch.zeros(input_shape, dtype=torch.long, device=device)\n\n        extended_attention_mask = attention_mask.unsqueeze(1).unsqueeze(2)\n        extended_attention_mask = extended_attention_mask.to(dtype=self.dtype)  # fp16 compatibility\n        extended_attention_mask = (1.0 - extended_attention_mask) * -10000.0\n        head_mask = self.get_head_mask(head_mask, self.config.num_hidden_layers)\n\n        embedding_output = self.embeddings(\n            input_ids, position_ids=position_ids, token_type_ids=token_type_ids, inputs_embeds=inputs_embeds\n        )\n        encoder_outputs = self.encoder(embedding_output, extended_attention_mask, head_mask=head_mask)\n\n        sequence_output = encoder_outputs[0]\n\n        pooled_output = self.pooler_activation(self.pooler(sequence_output[:, 0]))\n\n        outputs = (sequence_output, pooled_output) + encoder_outputs[\n            1:\n        ]  # add hidden_states and attentions if they are here\n        return outputs\n\n\n@add_start_docstrings(\n    \"\"\"Albert Model with two heads on top as done during the pre-training: a `masked language modeling` head and\n    a `sentence order prediction (classification)` head. \"\"\",\n    ALBERT_START_DOCSTRING,\n)\nclass AlbertForPreTraining(AlbertPreTrainedModel):\n    def __init__(self, config):\n        super().__init__(config)\n\n        self.albert = AlbertModel(config)\n        self.predictions = AlbertMLMHead(config)\n        self.sop_classifier = AlbertSOPHead(config)\n\n        self.init_weights()\n        self.tie_weights()\n\n    def tie_weights(self):\n        self._tie_or_clone_weights(self.predictions.decoder, self.albert.embeddings.word_embeddings)\n\n    def get_output_embeddings(self):\n        return self.predictions.decoder\n\n    @add_start_docstrings_to_callable(ALBERT_INPUTS_DOCSTRING)\n    def forward(\n        self,\n        input_ids=None,\n        attention_mask=None,\n        token_type_ids=None,\n        position_ids=None,\n        head_mask=None,\n        inputs_embeds=None,\n        masked_lm_labels=None,\n        sentence_order_label=None,\n    ):\n        r\"\"\"\n        masked_lm_labels (``torch.LongTensor`` of shape ``(batch_size, sequence_length)``, `optional`, defaults to :obj:`None`):\n            Labels for computing the masked language modeling loss.\n            Indices should be in ``[-100, 0, ..., config.vocab_size]`` (see ``input_ids`` docstring)\n            Tokens with indices set to ``-100`` are ignored (masked), the loss is only computed for the tokens with labels\n            in ``[0, ..., config.vocab_size]``\n        sentence_order_label (``torch.LongTensor`` of shape ``(batch_size,)``, `optional`, defaults to :obj:`None`):\n            Labels for computing the next sequence prediction (classification) loss. Input should be a sequence pair (see :obj:`input_ids` docstring)\n            Indices should be in ``[0, 1]``.\n            ``0`` indicates original order (sequence A, then sequence B),\n            ``1`` indicates switched order (sequence B, then sequence A).\n\n    Returns:\n        :obj:`tuple(torch.FloatTensor)` comprising various elements depending on the configuration (:class:`~transformers1.BertConfig`) and inputs:\n        loss (`optional`, returned when ``masked_lm_labels`` is provided) ``torch.FloatTensor`` of shape ``(1,)``:\n            Total loss as the sum of the masked language modeling loss and the next sequence prediction (classification) loss.\n        prediction_scores (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length, config.vocab_size)`)\n            Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax).\n        sop_scores (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, 2)`):\n            Prediction scores of the next sequence prediction (classification) head (scores of True/False\n            continuation before SoftMax).\n        hidden_states (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when :obj:`config.output_hidden_states=True`):\n            Tuple of :obj:`torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer)\n            of shape :obj:`(batch_size, sequence_length, hidden_size)`.\n\n            Hidden-states of the model at the output of each layer plus the initial embedding outputs.\n        attentions (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_attentions=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for each layer) of shape\n            :obj:`(batch_size, num_heads, sequence_length, sequence_length)`.\n\n            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention\n            heads.\n\n\n    Examples::\n\n        from transformers1 import AlbertTokenizer, AlbertForPreTraining\n        import torch\n\n        tokenizer = AlbertTokenizer.from_pretrained('albert-base-v2')\n        model = AlbertForPreTraining.from_pretrained('albert-base-v2')\n\n        input_ids = torch.tensor(tokenizer.encode(\"Hello, my dog is cute\", add_special_tokens=True)).unsqueeze(0)  # Batch size 1\n        outputs = model(input_ids)\n\n        prediction_scores, sop_scores = outputs[:2]\n\n        \"\"\"\n\n        outputs = self.albert(\n            input_ids,\n            attention_mask=attention_mask,\n            token_type_ids=token_type_ids,\n            position_ids=position_ids,\n            head_mask=head_mask,\n            inputs_embeds=inputs_embeds,\n        )\n\n        sequence_output, pooled_output = outputs[:2]\n\n        prediction_scores = self.predictions(sequence_output)\n        sop_scores = self.sop_classifier(pooled_output)\n\n        outputs = (prediction_scores, sop_scores,) + outputs[2:]  # add hidden states and attention if they are here\n\n        if masked_lm_labels is not None and sentence_order_label is not None:\n            loss_fct = CrossEntropyLoss()\n            masked_lm_loss = loss_fct(prediction_scores.view(-1, self.config.vocab_size), masked_lm_labels.view(-1))\n            sentence_order_loss = loss_fct(sop_scores.view(-1, 2), sentence_order_label.view(-1))\n            total_loss = masked_lm_loss + sentence_order_loss\n            outputs = (total_loss,) + outputs\n\n        return outputs  # (loss), prediction_scores, sop_scores, (hidden_states), (attentions)\n\n\nclass AlbertMLMHead(nn.Module):\n    def __init__(self, config):\n        super().__init__()\n\n        self.LayerNorm = nn.LayerNorm(config.embedding_size)\n        self.bias = nn.Parameter(torch.zeros(config.vocab_size))\n        self.dense = nn.Linear(config.hidden_size, config.embedding_size)\n        self.decoder = nn.Linear(config.embedding_size, config.vocab_size)\n        self.activation = ACT2FN[config.hidden_act]\n\n        # Need a link between the two variables so that the bias is correctly resized with `resize_token_embeddings`\n        self.decoder.bias = self.bias\n\n    def forward(self, hidden_states):\n        hidden_states = self.dense(hidden_states)\n        hidden_states = self.activation(hidden_states)\n        hidden_states = self.LayerNorm(hidden_states)\n        hidden_states = self.decoder(hidden_states)\n\n        prediction_scores = hidden_states\n\n        return prediction_scores\n\n\nclass AlbertSOPHead(nn.Module):\n    def __init__(self, config):\n        super().__init__()\n\n        self.dropout = nn.Dropout(config.classifier_dropout_prob)\n        self.classifier = nn.Linear(config.hidden_size, config.num_labels)\n\n    def forward(self, pooled_output):\n        dropout_pooled_output = self.dropout(pooled_output)\n        logits = self.classifier(dropout_pooled_output)\n        return logits\n\n\n@add_start_docstrings(\n    \"Albert Model with a `language modeling` head on top.\", ALBERT_START_DOCSTRING,\n)\nclass AlbertForMaskedLM(AlbertPreTrainedModel):\n    def __init__(self, config):\n        super().__init__(config)\n\n        self.albert = AlbertModel(config)\n        self.predictions = AlbertMLMHead(config)\n\n        self.init_weights()\n        self.tie_weights()\n\n    def tie_weights(self):\n        self._tie_or_clone_weights(self.predictions.decoder, self.albert.embeddings.word_embeddings)\n\n    def get_output_embeddings(self):\n        return self.predictions.decoder\n\n    @add_start_docstrings_to_callable(ALBERT_INPUTS_DOCSTRING)\n    def forward(\n        self,\n        input_ids=None,\n        attention_mask=None,\n        token_type_ids=None,\n        position_ids=None,\n        head_mask=None,\n        inputs_embeds=None,\n        masked_lm_labels=None,\n    ):\n        r\"\"\"\n        masked_lm_labels (:obj:`torch.LongTensor` of shape :obj:`(batch_size, sequence_length)`, `optional`, defaults to :obj:`None`):\n            Labels for computing the masked language modeling loss.\n            Indices should be in ``[-100, 0, ..., config.vocab_size]`` (see ``input_ids`` docstring)\n            Tokens with indices set to ``-100`` are ignored (masked), the loss is only computed for the tokens with\n            labels in ``[0, ..., config.vocab_size]``\n\n    Returns:\n        :obj:`tuple(torch.FloatTensor)` comprising various elements depending on the configuration (:class:`~transformers1.AlbertConfig`) and inputs:\n        loss (`optional`, returned when ``masked_lm_labels`` is provided) ``torch.FloatTensor`` of shape ``(1,)``:\n            Masked language modeling loss.\n        prediction_scores (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length, config.vocab_size)`)\n            Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax).\n        hidden_states (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_hidden_states=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer)\n            of shape :obj:`(batch_size, sequence_length, hidden_size)`.\n\n            Hidden-states of the model at the output of each layer plus the initial embedding outputs.\n        attentions (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_attentions=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for each layer) of shape\n            :obj:`(batch_size, num_heads, sequence_length, sequence_length)`.\n\n            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention\n            heads.\n\n    Example::\n\n        from transformers1 import AlbertTokenizer, AlbertForMaskedLM\n        import torch\n\n        tokenizer = AlbertTokenizer.from_pretrained('albert-base-v2')\n        model = AlbertForMaskedLM.from_pretrained('albert-base-v2')\n        input_ids = torch.tensor(tokenizer.encode(\"Hello, my dog is cute\", add_special_tokens=True)).unsqueeze(0)  # Batch size 1\n        outputs = model(input_ids, masked_lm_labels=input_ids)\n        loss, prediction_scores = outputs[:2]\n\n        \"\"\"\n        outputs = self.albert(\n            input_ids=input_ids,\n            attention_mask=attention_mask,\n            token_type_ids=token_type_ids,\n            position_ids=position_ids,\n            head_mask=head_mask,\n            inputs_embeds=inputs_embeds,\n        )\n        sequence_outputs = outputs[0]\n\n        prediction_scores = self.predictions(sequence_outputs)\n\n        outputs = (prediction_scores,) + outputs[2:]  # Add hidden states and attention if they are here\n        if masked_lm_labels is not None:\n            loss_fct = CrossEntropyLoss()\n            masked_lm_loss = loss_fct(prediction_scores.view(-1, self.config.vocab_size), masked_lm_labels.view(-1))\n            outputs = (masked_lm_loss,) + outputs\n\n        return outputs\n\n\n@add_start_docstrings(\n    \"\"\"Albert Model transformer with a sequence classification/regression head on top (a linear layer on top of\n    the pooled output) e.g. for GLUE tasks. \"\"\",\n    ALBERT_START_DOCSTRING,\n)\nclass AlbertForSequenceClassification(AlbertPreTrainedModel):\n    def __init__(self, config):\n        super().__init__(config)\n        self.num_labels = config.num_labels\n\n        self.albert = AlbertModel(config)\n        self.dropout = nn.Dropout(config.classifier_dropout_prob)\n        self.classifier = nn.Linear(config.hidden_size, self.config.num_labels)\n\n        self.init_weights()\n\n    @add_start_docstrings_to_callable(ALBERT_INPUTS_DOCSTRING)\n    def forward(\n        self,\n        input_ids=None,\n        attention_mask=None,\n        token_type_ids=None,\n        position_ids=None,\n        head_mask=None,\n        inputs_embeds=None,\n        labels=None,\n    ):\n        r\"\"\"\n        labels (:obj:`torch.LongTensor` of shape :obj:`(batch_size,)`, `optional`, defaults to :obj:`None`):\n            Labels for computing the sequence classification/regression loss.\n            Indices should be in ``[0, ..., config.num_labels - 1]``.\n            If ``config.num_labels == 1`` a regression loss is computed (Mean-Square loss),\n            If ``config.num_labels > 1`` a classification loss is computed (Cross-Entropy).\n\n    Returns:\n        :obj:`tuple(torch.FloatTensor)` comprising various elements depending on the configuration (:class:`~transformers1.AlbertConfig`) and inputs:\n        loss: (`optional`, returned when ``labels`` is provided) ``torch.FloatTensor`` of shape ``(1,)``:\n            Classification (or regression if config.num_labels==1) loss.\n        logits ``torch.FloatTensor`` of shape ``(batch_size, config.num_labels)``\n            Classification (or regression if config.num_labels==1) scores (before SoftMax).\n        hidden_states (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_hidden_states=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer)\n            of shape :obj:`(batch_size, sequence_length, hidden_size)`.\n\n            Hidden-states of the model at the output of each layer plus the initial embedding outputs.\n        attentions (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_attentions=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for each layer) of shape\n            :obj:`(batch_size, num_heads, sequence_length, sequence_length)`.\n\n            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention\n            heads.\n\n        Examples::\n\n            from transformers1 import AlbertTokenizer, AlbertForSequenceClassification\n            import torch\n\n            tokenizer = AlbertTokenizer.from_pretrained('albert-base-v2')\n            model = AlbertForSequenceClassification.from_pretrained('albert-base-v2')\n            input_ids = torch.tensor(tokenizer.encode(\"Hello, my dog is cute\")).unsqueeze(0)  # Batch size 1\n            labels = torch.tensor([1]).unsqueeze(0)  # Batch size 1\n            outputs = model(input_ids, labels=labels)\n            loss, logits = outputs[:2]\n\n        \"\"\"\n\n        outputs = self.albert(\n            input_ids=input_ids,\n            attention_mask=attention_mask,\n            token_type_ids=token_type_ids,\n            position_ids=position_ids,\n            head_mask=head_mask,\n            inputs_embeds=inputs_embeds,\n        )\n\n        pooled_output = outputs[1]\n\n        pooled_output = self.dropout(pooled_output)\n        logits = self.classifier(pooled_output)\n\n        outputs = (logits,) + outputs[2:]  # add hidden states and attention if they are here\n\n        if labels is not None:\n            if self.num_labels == 1:\n                #  We are doing regression\n                loss_fct = MSELoss()\n                loss = loss_fct(logits.view(-1), labels.view(-1))\n            else:\n                loss_fct = CrossEntropyLoss()\n                loss = loss_fct(logits.view(-1, self.num_labels), labels.view(-1))\n            outputs = (loss,) + outputs\n\n        return outputs  # (loss), logits, (hidden_states), (attentions)\n\n\n@add_start_docstrings(\n    \"\"\"Albert Model with a token classification head on top (a linear layer on top of\n    the hidden-states output) e.g. for Named-Entity-Recognition (NER) tasks. \"\"\",\n    ALBERT_START_DOCSTRING,\n)\nclass AlbertForTokenClassification(AlbertPreTrainedModel):\n    def __init__(self, config):\n        super().__init__(config)\n        self.num_labels = config.num_labels\n\n        self.albert = AlbertModel(config)\n        self.dropout = nn.Dropout(config.hidden_dropout_prob)\n        self.classifier = nn.Linear(config.hidden_size, self.config.num_labels)\n\n        self.init_weights()\n\n    @add_start_docstrings_to_callable(ALBERT_INPUTS_DOCSTRING)\n    def forward(\n        self,\n        input_ids=None,\n        attention_mask=None,\n        token_type_ids=None,\n        position_ids=None,\n        head_mask=None,\n        inputs_embeds=None,\n        labels=None,\n    ):\n        r\"\"\"\n        labels (:obj:`torch.LongTensor` of shape :obj:`(batch_size, sequence_length)`, `optional`, defaults to :obj:`None`):\n            Labels for computing the token classification loss.\n            Indices should be in ``[0, ..., config.num_labels - 1]``.\n\n    Returns:\n        :obj:`tuple(torch.FloatTensor)` comprising various elements depending on the configuration (:class:`~transformers1.AlbertConfig`) and inputs:\n        loss (:obj:`torch.FloatTensor` of shape :obj:`(1,)`, `optional`, returned when ``labels`` is provided) :\n            Classification loss.\n        scores (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length, config.num_labels)`)\n            Classification scores (before SoftMax).\n        hidden_states (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_hidden_states=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer)\n            of shape :obj:`(batch_size, sequence_length, hidden_size)`.\n\n            Hidden-states of the model at the output of each layer plus the initial embedding outputs.\n        attentions (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_attentions=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for each layer) of shape\n            :obj:`(batch_size, num_heads, sequence_length, sequence_length)`.\n\n            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention\n            heads.\n\n    Examples::\n\n        from transformers1 import AlbertTokenizer, AlbertForTokenClassification\n        import torch\n\n        tokenizer = AlbertTokenizer.from_pretrained('albert-base-v2')\n        model = AlbertForTokenClassification.from_pretrained('albert-base-v2')\n\n        input_ids = torch.tensor(tokenizer.encode(\"Hello, my dog is cute\", add_special_tokens=True)).unsqueeze(0)  # Batch size 1\n        labels = torch.tensor([1] * input_ids.size(1)).unsqueeze(0)  # Batch size 1\n        outputs = model(input_ids, labels=labels)\n\n        loss, scores = outputs[:2]\n\n        \"\"\"\n\n        outputs = self.albert(\n            input_ids,\n            attention_mask=attention_mask,\n            token_type_ids=token_type_ids,\n            position_ids=position_ids,\n            head_mask=head_mask,\n            inputs_embeds=inputs_embeds,\n        )\n\n        sequence_output = outputs[0]\n\n        sequence_output = self.dropout(sequence_output)\n        logits = self.classifier(sequence_output)\n\n        outputs = (logits,) + outputs[2:]  # add hidden states and attention if they are here\n\n        if labels is not None:\n            loss_fct = CrossEntropyLoss()\n            # Only keep active parts of the loss\n            if attention_mask is not None:\n                active_loss = attention_mask.view(-1) == 1\n                active_logits = logits.view(-1, self.num_labels)[active_loss]\n                active_labels = labels.view(-1)[active_loss]\n                loss = loss_fct(active_logits, active_labels)\n            else:\n                loss = loss_fct(logits.view(-1, self.num_labels), labels.view(-1))\n            outputs = (loss,) + outputs\n\n        return outputs  # (loss), logits, (hidden_states), (attentions)\n\n\n@add_start_docstrings(\n    \"\"\"Albert Model with a span classification head on top for extractive question-answering tasks like SQuAD (a linear layers on top of\n    the hidden-states output to compute `span start logits` and `span end logits`). \"\"\",\n    ALBERT_START_DOCSTRING,\n)\nclass AlbertForQuestionAnswering(AlbertPreTrainedModel):\n    def __init__(self, config):\n        super().__init__(config)\n        self.num_labels = config.num_labels\n\n        self.albert = AlbertModel(config)\n        self.qa_outputs = nn.Linear(config.hidden_size, config.num_labels)\n\n        self.init_weights()\n\n    @add_start_docstrings_to_callable(ALBERT_INPUTS_DOCSTRING)\n    def forward(\n        self,\n        input_ids=None,\n        attention_mask=None,\n        token_type_ids=None,\n        position_ids=None,\n        head_mask=None,\n        inputs_embeds=None,\n        start_positions=None,\n        end_positions=None,\n    ):\n        r\"\"\"\n        start_positions (:obj:`torch.LongTensor` of shape :obj:`(batch_size,)`, `optional`, defaults to :obj:`None`):\n            Labels for position (index) of the start of the labelled span for computing the token classification loss.\n            Positions are clamped to the length of the sequence (`sequence_length`).\n            Position outside of the sequence are not taken into account for computing the loss.\n        end_positions (:obj:`torch.LongTensor` of shape :obj:`(batch_size,)`, `optional`, defaults to :obj:`None`):\n            Labels for position (index) of the end of the labelled span for computing the token classification loss.\n            Positions are clamped to the length of the sequence (`sequence_length`).\n            Position outside of the sequence are not taken into account for computing the loss.\n\n    Returns:\n        :obj:`tuple(torch.FloatTensor)` comprising various elements depending on the configuration (:class:`~transformers1.AlbertConfig`) and inputs:\n        loss: (`optional`, returned when ``labels`` is provided) ``torch.FloatTensor`` of shape ``(1,)``:\n            Total span extraction loss is the sum of a Cross-Entropy for the start and end positions.\n        start_scores ``torch.FloatTensor`` of shape ``(batch_size, sequence_length,)``\n            Span-start scores (before SoftMax).\n        end_scores: ``torch.FloatTensor`` of shape ``(batch_size, sequence_length,)``\n            Span-end scores (before SoftMax).\n        hidden_states (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_hidden_states=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer)\n            of shape :obj:`(batch_size, sequence_length, hidden_size)`.\n\n            Hidden-states of the model at the output of each layer plus the initial embedding outputs.\n        attentions (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_attentions=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for each layer) of shape\n            :obj:`(batch_size, num_heads, sequence_length, sequence_length)`.\n\n            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention\n            heads.\n\n    Examples::\n\n        # The checkpoint albert-base-v2 is not fine-tuned for question answering. Please see the\n        # examples/question-answering/run_squad.py example to see how to fine-tune a model to a question answering task.\n\n        from transformers1 import AlbertTokenizer, AlbertForQuestionAnswering\n        import torch\n\n        tokenizer = AlbertTokenizer.from_pretrained('albert-base-v2')\n        model = AlbertForQuestionAnswering.from_pretrained('albert-base-v2')\n        question, text = \"Who was Jim Henson?\", \"Jim Henson was a nice puppet\"\n        input_dict = tokenizer.encode_plus(question, text, return_tensors='pt')\n        start_scores, end_scores = model(**input_dict)\n\n        \"\"\"\n\n        outputs = self.albert(\n            input_ids=input_ids,\n            attention_mask=attention_mask,\n            token_type_ids=token_type_ids,\n            position_ids=position_ids,\n            head_mask=head_mask,\n            inputs_embeds=inputs_embeds,\n        )\n\n        sequence_output = outputs[0]\n\n        logits = self.qa_outputs(sequence_output)\n        start_logits, end_logits = logits.split(1, dim=-1)\n        start_logits = start_logits.squeeze(-1)\n        end_logits = end_logits.squeeze(-1)\n\n        outputs = (start_logits, end_logits,) + outputs[2:]\n        if start_positions is not None and end_positions is not None:\n            # If we are on multi-GPU, split add a dimension\n            if len(start_positions.size()) > 1:\n                start_positions = start_positions.squeeze(-1)\n            if len(end_positions.size()) > 1:\n                end_positions = end_positions.squeeze(-1)\n            # sometimes the start/end positions are outside our model inputs, we ignore these terms\n            ignored_index = start_logits.size(1)\n            start_positions.clamp_(0, ignored_index)\n            end_positions.clamp_(0, ignored_index)\n\n            loss_fct = CrossEntropyLoss(ignore_index=ignored_index)\n            start_loss = loss_fct(start_logits, start_positions)\n            end_loss = loss_fct(end_logits, end_positions)\n            total_loss = (start_loss + end_loss) / 2\n            outputs = (total_loss,) + outputs\n\n        return outputs  # (loss), start_logits, end_logits, (hidden_states), (attentions)\n"
  },
  {
    "path": "code/nezha-base-count5/pretrain/transformers1/modeling_auto.py",
    "content": "# coding=utf-8\n# Copyright 2018 The HuggingFace Inc. team.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n\"\"\" Auto Model class. \"\"\"\n\n\nimport logging\nfrom collections import OrderedDict\n\nfrom .configuration_auto import (\n    AlbertConfig,\n    AutoConfig,\n    BartConfig,\n    BertConfig,\n    CamembertConfig,\n    CTRLConfig,\n    DistilBertConfig,\n    ElectraConfig,\n    EncoderDecoderConfig,\n    FlaubertConfig,\n    GPT2Config,\n    LongformerConfig,\n    OpenAIGPTConfig,\n    ReformerConfig,\n    RobertaConfig,\n    T5Config,\n    TransfoXLConfig,\n    XLMConfig,\n    XLMRobertaConfig,\n    XLNetConfig,\n)\nfrom .configuration_marian import MarianConfig\nfrom .configuration_utils import PretrainedConfig\nfrom .modeling_albert import (\n    AlbertForMaskedLM,\n    AlbertForPreTraining,\n    AlbertForQuestionAnswering,\n    AlbertForSequenceClassification,\n    AlbertForTokenClassification,\n    AlbertModel,\n)\nfrom .modeling_bart import BartForConditionalGeneration, BartForSequenceClassification, BartModel\nfrom .modeling_bert import (\n    BertForMaskedLM,\n    BertForMultipleChoice,\n    BertForPreTraining,\n    BertForQuestionAnswering,\n    BertForSequenceClassification,\n    BertForTokenClassification,\n    BertModel,\n)\nfrom .modeling_camembert import (\n    CamembertForMaskedLM,\n    CamembertForMultipleChoice,\n    CamembertForSequenceClassification,\n    CamembertForTokenClassification,\n    CamembertModel,\n)\nfrom .modeling_ctrl import CTRLLMHeadModel, CTRLModel\nfrom .modeling_distilbert import (\n    DistilBertForMaskedLM,\n    DistilBertForQuestionAnswering,\n    DistilBertForSequenceClassification,\n    DistilBertForTokenClassification,\n    DistilBertModel,\n)\nfrom .modeling_electra import (\n    ElectraForMaskedLM,\n    ElectraForPreTraining,\n    ElectraForSequenceClassification,\n    ElectraForTokenClassification,\n    ElectraModel,\n)\nfrom .modeling_encoder_decoder import EncoderDecoderModel\nfrom .modeling_flaubert import (\n    FlaubertForQuestionAnsweringSimple,\n    FlaubertForSequenceClassification,\n    FlaubertModel,\n    FlaubertWithLMHeadModel,\n)\nfrom .modeling_gpt2 import GPT2LMHeadModel, GPT2Model\nfrom .modeling_longformer import (\n    LongformerForMaskedLM,\n    LongformerForMultipleChoice,\n    LongformerForQuestionAnswering,\n    LongformerForSequenceClassification,\n    LongformerForTokenClassification,\n    LongformerModel,\n)\nfrom .modeling_marian import MarianMTModel\nfrom .modeling_openai import OpenAIGPTLMHeadModel, OpenAIGPTModel\nfrom .modeling_reformer import ReformerModel, ReformerModelWithLMHead\nfrom .modeling_roberta import (\n    RobertaForMaskedLM,\n    RobertaForMultipleChoice,\n    RobertaForQuestionAnswering,\n    RobertaForSequenceClassification,\n    RobertaForTokenClassification,\n    RobertaModel,\n)\nfrom .modeling_t5 import T5ForConditionalGeneration, T5Model\nfrom .modeling_transfo_xl import TransfoXLLMHeadModel, TransfoXLModel\nfrom .modeling_xlm import (\n    XLMForQuestionAnsweringSimple,\n    XLMForSequenceClassification,\n    XLMForTokenClassification,\n    XLMModel,\n    XLMWithLMHeadModel,\n)\nfrom .modeling_xlm_roberta import (\n    XLMRobertaForMaskedLM,\n    XLMRobertaForMultipleChoice,\n    XLMRobertaForSequenceClassification,\n    XLMRobertaForTokenClassification,\n    XLMRobertaModel,\n)\nfrom .modeling_xlnet import (\n    XLNetForMultipleChoice,\n    XLNetForQuestionAnsweringSimple,\n    XLNetForSequenceClassification,\n    XLNetForTokenClassification,\n    XLNetLMHeadModel,\n    XLNetModel,\n)\n\n\nlogger = logging.getLogger(__name__)\n\n\nMODEL_MAPPING = OrderedDict(\n    [\n        (T5Config, T5Model),\n        (DistilBertConfig, DistilBertModel),\n        (AlbertConfig, AlbertModel),\n        (CamembertConfig, CamembertModel),\n        (XLMRobertaConfig, XLMRobertaModel),\n        (BartConfig, BartModel),\n        (LongformerConfig, LongformerModel),\n        (RobertaConfig, RobertaModel),\n        (BertConfig, BertModel),\n        (OpenAIGPTConfig, OpenAIGPTModel),\n        (GPT2Config, GPT2Model),\n        (TransfoXLConfig, TransfoXLModel),\n        (XLNetConfig, XLNetModel),\n        (FlaubertConfig, FlaubertModel),\n        (XLMConfig, XLMModel),\n        (CTRLConfig, CTRLModel),\n        (ElectraConfig, ElectraModel),\n        (ReformerConfig, ReformerModel),\n    ]\n)\n\nMODEL_FOR_PRETRAINING_MAPPING = OrderedDict(\n    [\n        (T5Config, T5ForConditionalGeneration),\n        (DistilBertConfig, DistilBertForMaskedLM),\n        (AlbertConfig, AlbertForPreTraining),\n        (CamembertConfig, CamembertForMaskedLM),\n        (XLMRobertaConfig, XLMRobertaForMaskedLM),\n        (BartConfig, BartForConditionalGeneration),\n        (LongformerConfig, LongformerForMaskedLM),\n        (RobertaConfig, RobertaForMaskedLM),\n        (BertConfig, BertForPreTraining),\n        (OpenAIGPTConfig, OpenAIGPTLMHeadModel),\n        (GPT2Config, GPT2LMHeadModel),\n        (TransfoXLConfig, TransfoXLLMHeadModel),\n        (XLNetConfig, XLNetLMHeadModel),\n        (FlaubertConfig, FlaubertWithLMHeadModel),\n        (XLMConfig, XLMWithLMHeadModel),\n        (CTRLConfig, CTRLLMHeadModel),\n        (ElectraConfig, ElectraForPreTraining),\n    ]\n)\n\nMODEL_WITH_LM_HEAD_MAPPING = OrderedDict(\n    [\n        (T5Config, T5ForConditionalGeneration),\n        (DistilBertConfig, DistilBertForMaskedLM),\n        (AlbertConfig, AlbertForMaskedLM),\n        (CamembertConfig, CamembertForMaskedLM),\n        (XLMRobertaConfig, XLMRobertaForMaskedLM),\n        (MarianConfig, MarianMTModel),\n        (BartConfig, BartForConditionalGeneration),\n        (LongformerConfig, LongformerForMaskedLM),\n        (RobertaConfig, RobertaForMaskedLM),\n        (BertConfig, BertForMaskedLM),\n        (OpenAIGPTConfig, OpenAIGPTLMHeadModel),\n        (GPT2Config, GPT2LMHeadModel),\n        (TransfoXLConfig, TransfoXLLMHeadModel),\n        (XLNetConfig, XLNetLMHeadModel),\n        (FlaubertConfig, FlaubertWithLMHeadModel),\n        (XLMConfig, XLMWithLMHeadModel),\n        (CTRLConfig, CTRLLMHeadModel),\n        (ElectraConfig, ElectraForMaskedLM),\n        (EncoderDecoderConfig, EncoderDecoderModel),\n        (ReformerConfig, ReformerModelWithLMHead),\n    ]\n)\n\nMODEL_FOR_SEQUENCE_CLASSIFICATION_MAPPING = OrderedDict(\n    [\n        (DistilBertConfig, DistilBertForSequenceClassification),\n        (AlbertConfig, AlbertForSequenceClassification),\n        (CamembertConfig, CamembertForSequenceClassification),\n        (XLMRobertaConfig, XLMRobertaForSequenceClassification),\n        (BartConfig, BartForSequenceClassification),\n        (LongformerConfig, LongformerForSequenceClassification),\n        (RobertaConfig, RobertaForSequenceClassification),\n        (BertConfig, BertForSequenceClassification),\n        (XLNetConfig, XLNetForSequenceClassification),\n        (FlaubertConfig, FlaubertForSequenceClassification),\n        (XLMConfig, XLMForSequenceClassification),\n        (ElectraConfig, ElectraForSequenceClassification),\n    ]\n)\n\nMODEL_FOR_QUESTION_ANSWERING_MAPPING = OrderedDict(\n    [\n        (DistilBertConfig, DistilBertForQuestionAnswering),\n        (AlbertConfig, AlbertForQuestionAnswering),\n        (LongformerConfig, LongformerForQuestionAnswering),\n        (RobertaConfig, RobertaForQuestionAnswering),\n        (BertConfig, BertForQuestionAnswering),\n        (XLNetConfig, XLNetForQuestionAnsweringSimple),\n        (FlaubertConfig, FlaubertForQuestionAnsweringSimple),\n        (XLMConfig, XLMForQuestionAnsweringSimple),\n    ]\n)\n\nMODEL_FOR_TOKEN_CLASSIFICATION_MAPPING = OrderedDict(\n    [\n        (DistilBertConfig, DistilBertForTokenClassification),\n        (CamembertConfig, CamembertForTokenClassification),\n        (XLMConfig, XLMForTokenClassification),\n        (XLMRobertaConfig, XLMRobertaForTokenClassification),\n        (LongformerConfig, LongformerForTokenClassification),\n        (RobertaConfig, RobertaForTokenClassification),\n        (BertConfig, BertForTokenClassification),\n        (XLNetConfig, XLNetForTokenClassification),\n        (AlbertConfig, AlbertForTokenClassification),\n        (ElectraConfig, ElectraForTokenClassification),\n    ]\n)\n\n\nMODEL_FOR_MULTIPLE_CHOICE_MAPPING = OrderedDict(\n    [\n        (CamembertConfig, CamembertForMultipleChoice),\n        (XLMRobertaConfig, XLMRobertaForMultipleChoice),\n        (LongformerConfig, LongformerForMultipleChoice),\n        (RobertaConfig, RobertaForMultipleChoice),\n        (BertConfig, BertForMultipleChoice),\n        (XLNetConfig, XLNetForMultipleChoice),\n    ]\n)\n\n\nclass AutoModel:\n    r\"\"\"\n        :class:`~transformers1.AutoModel` is a generic model class\n        that will be instantiated as one of the base model classes of the library\n        when created with the `AutoModel.from_pretrained(pretrained_model_name_or_path)`\n        or the `AutoModel.from_config(config)` class methods.\n\n        This class cannot be instantiated using `__init__()` (throws an error).\n    \"\"\"\n\n    def __init__(self):\n        raise EnvironmentError(\n            \"AutoModel is designed to be instantiated \"\n            \"using the `AutoModel.from_pretrained(pretrained_model_name_or_path)` or \"\n            \"`AutoModel.from_config(config)` methods.\"\n        )\n\n    @classmethod\n    def from_config(cls, config):\n        r\"\"\" Instantiates one of the base model classes of the library\n        from a configuration.\n\n        Note:\n            Loading a model from its configuration file does **not** load the model weights.\n            It only affects the model's configuration. Use :func:`~transformers1.AutoModel.from_pretrained` to load\n            the model weights\n\n        Args:\n            config (:class:`~transformers.PretrainedConfig`):\n                The model class to instantiate is selected based on the configuration class:\n\n                - isInstance of `distilbert` configuration class: :class:`~transformers1.DistilBertModel` (DistilBERT model)\n                - isInstance of `longformer` configuration class: :class:`~transformers1.LongformerModel` (Longformer model)\n                - isInstance of `roberta` configuration class: :class:`~transformers1.RobertaModel` (RoBERTa model)\n                - isInstance of `bert` configuration class: :class:`~transformers1.BertModel` (Bert model)\n                - isInstance of `openai-gpt` configuration class: :class:`~transformers1.OpenAIGPTModel` (OpenAI GPT model)\n                - isInstance of `gpt2` configuration class: :class:`~transformers1.GPT2Model` (OpenAI GPT-2 model)\n                - isInstance of `ctrl` configuration class: :class:`~transformers1.CTRLModel` (Salesforce CTRL  model)\n                - isInstance of `transfo-xl` configuration class: :class:`~transformers1.TransfoXLModel` (Transformer-XL model)\n                - isInstance of `xlnet` configuration class: :class:`~transformers1.XLNetModel` (XLNet model)\n                - isInstance of `xlm` configuration class: :class:`~transformers1.XLMModel` (XLM model)\n                - isInstance of `flaubert` configuration class: :class:`~transformers1.FlaubertModel` (Flaubert model)\n                - isInstance of `electra` configuration class: :class:`~transformers1.ElectraModel` (Electra model)\n\n        Examples::\n\n            config = BertConfig.from_pretrained('bert-base-uncased')    # Download configuration from S3 and cache.\n            model = AutoModel.from_config(config)  # E.g. model was saved using `save_pretrained('./test/saved_model/')`\n        \"\"\"\n        for config_class, model_class in MODEL_MAPPING.items():\n            if isinstance(config, config_class):\n                return model_class(config)\n        raise ValueError(\n            \"Unrecognized configuration class {} for this kind of AutoModel: {}.\\n\"\n            \"Model type should be one of {}.\".format(\n                config.__class__, cls.__name__, \", \".join(c.__name__ for c in MODEL_MAPPING.keys())\n            )\n        )\n\n    @classmethod\n    def from_pretrained(cls, pretrained_model_name_or_path, *model_args, **kwargs):\n        r\"\"\" Instantiates one of the base model classes of the library\n        from a pre-trained model configuration.\n\n        The `from_pretrained()` method takes care of returning the correct model class instance\n        based on the `model_type` property of the config object, or when it's missing,\n        falling back to using pattern matching on the `pretrained_model_name_or_path` string:\n            - `t5`: :class:`~transformers1.T5Model` (T5 model)\n            - `distilbert`: :class:`~transformers1.DistilBertModel` (DistilBERT model)\n            - `albert`: :class:`~transformers1.AlbertModel` (ALBERT model)\n            - `camembert`: :class:`~transformers1.CamembertModel` (CamemBERT model)\n            - `xlm-roberta`: :class:`~transformers1.XLMRobertaModel` (XLM-RoBERTa model)\n            - `longformer` :class:`~transformers1.LongformerModel` (Longformer model)\n            - `roberta`: :class:`~transformers1.RobertaModel` (RoBERTa model)\n            - `bert`: :class:`~transformers1.BertModel` (Bert model)\n            - `openai-gpt`: :class:`~transformers1.OpenAIGPTModel` (OpenAI GPT model)\n            - `gpt2`: :class:`~transformers1.GPT2Model` (OpenAI GPT-2 model)\n            - `transfo-xl`: :class:`~transformers1.TransfoXLModel` (Transformer-XL model)\n            - `xlnet`: :class:`~transformers1.XLNetModel` (XLNet model)\n            - `xlm`: :class:`~transformers1.XLMModel` (XLM model)\n            - `ctrl`: :class:`~transformers1.CTRLModel` (Salesforce CTRL  model)\n            - `flaubert`: :class:`~transformers1.FlaubertModel` (Flaubert  model)\n            - `electra`: :class:`~transformers1.ElectraModel` (Electra  model)\n\n        The model is set in evaluation mode by default using `model.eval()` (Dropout modules are deactivated)\n        To train the model, you should first set it back in training mode with `model.train()`\n\n        Args:\n            pretrained_model_name_or_path: either:\n\n                - a string with the `shortcut name` of a pre-trained model to load from cache or download, e.g.: ``bert-base-uncased``.\n                - a string with the `identifier name` of a pre-trained model that was user-uploaded to our S3, e.g.: ``dbmdz/bert-base-german-cased``.\n                - a path to a `directory` containing model weights saved using :func:`~transformers1.PreTrainedModel.save_pretrained`, e.g.: ``./my_model_directory/``.\n                - a path or url to a `tensorflow index checkpoint file` (e.g. `./tf_model/model.ckpt.index`). In this case, ``from_tf`` should be set to True and a configuration object should be provided as ``config`` argument. This loading path is slower than converting the TensorFlow checkpoint in a PyTorch model using the provided conversion scripts and loading the PyTorch model afterwards.\n\n            model_args: (`optional`) Sequence of positional arguments:\n                All remaning positional arguments will be passed to the underlying model's ``__init__`` method\n\n            config: (`optional`) instance of a class derived from :class:`~transformers1.PretrainedConfig`:\n                Configuration for the model to use instead of an automatically loaded configuation. Configuration can be automatically loaded when:\n\n                - the model is a model provided by the library (loaded with the ``shortcut-name`` string of a pretrained model), or\n                - the model was saved using :func:`~transformers1.PreTrainedModel.save_pretrained` and is reloaded by suppling the save directory.\n                - the model is loaded by suppling a local directory as ``pretrained_model_name_or_path`` and a configuration JSON file named `config.json` is found in the directory.\n\n            state_dict: (`optional`) dict:\n                an optional state dictionary for the model to use instead of a state dictionary loaded from saved weights file.\n                This option can be used if you want to create a model from a pretrained configuration but load your own weights.\n                In this case though, you should check if using :func:`~transformers1.PreTrainedModel.save_pretrained` and :func:`~transformers1.PreTrainedModel.from_pretrained` is not a simpler option.\n\n            cache_dir: (`optional`) string:\n                Path to a directory in which a downloaded pre-trained model\n                configuration should be cached if the standard cache should not be used.\n\n            force_download: (`optional`) boolean, default False:\n                Force to (re-)download the model weights and configuration files and override the cached versions if they exists.\n\n            resume_download: (`optional`) boolean, default False:\n                Do not delete incompletely recieved file. Attempt to resume the download if such a file exists.\n\n            proxies: (`optional`) dict, default None:\n                A dictionary of proxy servers to use by protocol or endpoint, e.g.: {'http': 'foo.bar:3128', 'http://hostname': 'foo.bar:4012'}.\n                The proxies are used on each request.\n\n            output_loading_info: (`optional`) boolean:\n                Set to ``True`` to also return a dictionary containing missing keys, unexpected keys and error messages.\n\n            kwargs: (`optional`) Remaining dictionary of keyword arguments:\n                These arguments will be passed to the configuration and the model.\n\n        Examples::\n\n            model = AutoModel.from_pretrained('bert-base-uncased')    # Download model and configuration from S3 and cache.\n            model = AutoModel.from_pretrained('./test/bert_model/')  # E.g. model was saved using `save_pretrained('./test/saved_model/')`\n            assert model.config.output_attention == True\n            # Loading from a TF checkpoint file instead of a PyTorch model (slower)\n            config = AutoConfig.from_json_file('./tf_model/bert_tf_model_config.json')\n            model = AutoModel.from_pretrained('./tf_model/bert_tf_checkpoint.ckpt.index', from_tf=True, config=config)\n\n        \"\"\"\n        config = kwargs.pop(\"config\", None)\n        if not isinstance(config, PretrainedConfig):\n            config = AutoConfig.from_pretrained(pretrained_model_name_or_path, **kwargs)\n\n        for config_class, model_class in MODEL_MAPPING.items():\n            if isinstance(config, config_class):\n                return model_class.from_pretrained(pretrained_model_name_or_path, *model_args, config=config, **kwargs)\n        raise ValueError(\n            \"Unrecognized configuration class {} for this kind of AutoModel: {}.\\n\"\n            \"Model type should be one of {}.\".format(\n                config.__class__, cls.__name__, \", \".join(c.__name__ for c in MODEL_MAPPING.keys())\n            )\n        )\n\n\nclass AutoModelForPreTraining:\n    r\"\"\"\n        :class:`~transformers1.AutoModelForPreTraining` is a generic model class\n        that will be instantiated as one of the model classes of the library -with the architecture used for pretraining this model– when created with the `AutoModelForPreTraining.from_pretrained(pretrained_model_name_or_path)`\n        class method.\n\n        This class cannot be instantiated using `__init__()` (throws an error).\n    \"\"\"\n\n    def __init__(self):\n        raise EnvironmentError(\n            \"AutoModelForPreTraining is designed to be instantiated \"\n            \"using the `AutoModelForPreTraining.from_pretrained(pretrained_model_name_or_path)` or \"\n            \"`AutoModelForPreTraining.from_config(config)` methods.\"\n        )\n\n    @classmethod\n    def from_config(cls, config):\n        r\"\"\" Instantiates one of the base model classes of the library\n        from a configuration.\n\n        Note:\n            Loading a model from its configuration file does **not** load the model weights.\n            It only affects the model's configuration. Use :func:`~transformers1.AutoModel.from_pretrained` to load\n            the model weights\n\n        Args:\n            config (:class:`~transformers.PretrainedConfig`):\n                The model class to instantiate is selected based on the configuration class:\n\n                - isInstance of `distilbert` configuration class: :class:`~transformers1.DistilBertForMaskedLM` (DistilBERT model)\n                - isInstance of `longformer` configuration class: :class:`~transformers1.LongformerForMaskedLM` (Longformer model)\n                - isInstance of `roberta` configuration class: :class:`~transformers1.RobertaForMaskedLM` (RoBERTa model)\n                - isInstance of `bert` configuration class: :class:`~transformers1.BertForPreTraining` (Bert model)\n                - isInstance of `openai-gpt` configuration class: :class:`~transformers1.OpenAIGPTLMHeadModel` (OpenAI GPT model)\n                - isInstance of `gpt2` configuration class: :class:`~transformers1.GPT2LMHeadModel` (OpenAI GPT-2 model)\n                - isInstance of `ctrl` configuration class: :class:`~transformers1.CTRLLMHeadModel` (Salesforce CTRL  model)\n                - isInstance of `transfo-xl` configuration class: :class:`~transformers1.TransfoXLLMHeadModel` (Transformer-XL model)\n                - isInstance of `xlnet` configuration class: :class:`~transformers1.XLNetLMHeadModel` (XLNet model)\n                - isInstance of `xlm` configuration class: :class:`~transformers1.XLMWithLMHeadModel` (XLM model)\n                - isInstance of `flaubert` configuration class: :class:`~transformers1.FlaubertWithLMHeadModel` (Flaubert model)\n                - isInstance of `electra` configuration class: :class:`~transformers1.ElectraForPreTraining` (Electra model)\n\n        Examples::\n\n            config = BertConfig.from_pretrained('bert-base-uncased')    # Download configuration from S3 and cache.\n            model = AutoModelForPreTraining.from_config(config)  # E.g. model was saved using `save_pretrained('./test/saved_model/')`\n        \"\"\"\n        for config_class, model_class in MODEL_FOR_PRETRAINING_MAPPING.items():\n            if isinstance(config, config_class):\n                return model_class(config)\n        raise ValueError(\n            \"Unrecognized configuration class {} for this kind of AutoModel: {}.\\n\"\n            \"Model type should be one of {}.\".format(\n                config.__class__, cls.__name__, \", \".join(c.__name__ for c in MODEL_FOR_PRETRAINING_MAPPING.keys())\n            )\n        )\n\n    @classmethod\n    def from_pretrained(cls, pretrained_model_name_or_path, *model_args, **kwargs):\n        r\"\"\" Instantiates one of the model classes of the library -with the architecture used for pretraining this model– from a pre-trained model configuration.\n\n        The `from_pretrained()` method takes care of returning the correct model class instance\n        based on the `model_type` property of the config object, or when it's missing,\n        falling back to using pattern matching on the `pretrained_model_name_or_path` string:\n            - `t5`: :class:`~transformers1.T5ModelWithLMHead` (T5 model)\n            - `distilbert`: :class:`~transformers1.DistilBertForMaskedLM` (DistilBERT model)\n            - `albert`: :class:`~transformers1.AlbertForMaskedLM` (ALBERT model)\n            - `camembert`: :class:`~transformers1.CamembertForMaskedLM` (CamemBERT model)\n            - `xlm-roberta`: :class:`~transformers1.XLMRobertaForMaskedLM` (XLM-RoBERTa model)\n            - `longformer`: :class:`~transformers1.LongformerForMaskedLM` (Longformer model)\n            - `roberta`: :class:`~transformers1.RobertaForMaskedLM` (RoBERTa model)\n            - `bert`: :class:`~transformers1.BertForPreTraining` (Bert model)\n            - `openai-gpt`: :class:`~transformers1.OpenAIGPTLMHeadModel` (OpenAI GPT model)\n            - `gpt2`: :class:`~transformers1.GPT2LMHeadModel` (OpenAI GPT-2 model)\n            - `transfo-xl`: :class:`~transformers1.TransfoXLLMHeadModel` (Transformer-XL model)\n            - `xlnet`: :class:`~transformers1.XLNetLMHeadModel` (XLNet model)\n            - `xlm`: :class:`~transformers1.XLMWithLMHeadModel` (XLM model)\n            - `ctrl`: :class:`~transformers1.CTRLLMHeadModel` (Salesforce CTRL model)\n            - `flaubert`: :class:`~transformers1.FlaubertWithLMHeadModel` (Flaubert model)\n            - `electra`: :class:`~transformers1.ElectraForPreTraining` (Electra model)\n\n        The model is set in evaluation mode by default using `model.eval()` (Dropout modules are deactivated)\n        To train the model, you should first set it back in training mode with `model.train()`\n\n        Args:\n            pretrained_model_name_or_path:\n                Either:\n\n                - a string with the `shortcut name` of a pre-trained model to load from cache or download, e.g.: ``bert-base-uncased``.\n                - a string with the `identifier name` of a pre-trained model that was user-uploaded to our S3, e.g.: ``dbmdz/bert-base-german-cased``.\n                - a path to a `directory` containing model weights saved using :func:`~transformers1.PreTrainedModel.save_pretrained`, e.g.: ``./my_model_directory/``.\n                - a path or url to a `tensorflow index checkpoint file` (e.g. `./tf_model/model.ckpt.index`). In this case, ``from_tf`` should be set to True and a configuration object should be provided as ``config`` argument. This loading path is slower than converting the TensorFlow checkpoint in a PyTorch model using the provided conversion scripts and loading the PyTorch model afterwards.\n            model_args: (`optional`) Sequence of positional arguments:\n                All remaning positional arguments will be passed to the underlying model's ``__init__`` method\n            config: (`optional`) instance of a class derived from :class:`~transformers1.PretrainedConfig`:\n                Configuration for the model to use instead of an automatically loaded configuation. Configuration can be automatically loaded when:\n\n                - the model is a model provided by the library (loaded with the ``shortcut-name`` string of a pretrained model), or\n                - the model was saved using :func:`~transformers1.PreTrainedModel.save_pretrained` and is reloaded by suppling the save directory.\n                - the model is loaded by suppling a local directory as ``pretrained_model_name_or_path`` and a configuration JSON file named `config.json` is found in the directory.\n\n            state_dict: (`optional`) dict:\n                an optional state dictionary for the model to use instead of a state dictionary loaded from saved weights file.\n                This option can be used if you want to create a model from a pretrained configuration but load your own weights.\n                In this case though, you should check if using :func:`~transformers1.PreTrainedModel.save_pretrained` and :func:`~transformers1.PreTrainedModel.from_pretrained` is not a simpler option.\n            cache_dir: (`optional`) string:\n                Path to a directory in which a downloaded pre-trained model\n                configuration should be cached if the standard cache should not be used.\n            force_download: (`optional`) boolean, default False:\n                Force to (re-)download the model weights and configuration files and override the cached versions if they exists.\n            resume_download: (`optional`) boolean, default False:\n                Do not delete incompletely received file. Attempt to resume the download if such a file exists.\n            proxies: (`optional`) dict, default None:\n                A dictionary of proxy servers to use by protocol or endpoint, e.g.: {'http': 'foo.bar:3128', 'http://hostname': 'foo.bar:4012'}.\n                The proxies are used on each request.\n            output_loading_info: (`optional`) boolean:\n                Set to ``True`` to also return a dictionary containing missing keys, unexpected keys and error messages.\n            kwargs: (`optional`) Remaining dictionary of keyword arguments:\n                These arguments will be passed to the configuration and the model.\n\n        Examples::\n\n            model = AutoModelForPreTraining.from_pretrained('bert-base-uncased')    # Download model and configuration from S3 and cache.\n            model = AutoModelForPreTraining.from_pretrained('./test/bert_model/')  # E.g. model was saved using `save_pretrained('./test/saved_model/')`\n            assert model.config.output_attention == True\n            # Loading from a TF checkpoint file instead of a PyTorch model (slower)\n            config = AutoConfig.from_json_file('./tf_model/bert_tf_model_config.json')\n            model = AutoModelForPreTraining.from_pretrained('./tf_model/bert_tf_checkpoint.ckpt.index', from_tf=True, config=config)\n\n        \"\"\"\n        config = kwargs.pop(\"config\", None)\n        if not isinstance(config, PretrainedConfig):\n            config = AutoConfig.from_pretrained(pretrained_model_name_or_path, **kwargs)\n\n        for config_class, model_class in MODEL_FOR_PRETRAINING_MAPPING.items():\n            if isinstance(config, config_class):\n                return model_class.from_pretrained(pretrained_model_name_or_path, *model_args, config=config, **kwargs)\n        raise ValueError(\n            \"Unrecognized configuration class {} for this kind of AutoModel: {}.\\n\"\n            \"Model type should be one of {}.\".format(\n                config.__class__, cls.__name__, \", \".join(c.__name__ for c in MODEL_FOR_PRETRAINING_MAPPING.keys())\n            )\n        )\n\n\nclass AutoModelWithLMHead:\n    r\"\"\"\n        :class:`~transformers1.AutoModelWithLMHead` is a generic model class\n        that will be instantiated as one of the language modeling model classes of the library\n        when created with the `AutoModelWithLMHead.from_pretrained(pretrained_model_name_or_path)`\n        class method.\n\n        This class cannot be instantiated using `__init__()` (throws an error).\n    \"\"\"\n\n    def __init__(self):\n        raise EnvironmentError(\n            \"AutoModelWithLMHead is designed to be instantiated \"\n            \"using the `AutoModelWithLMHead.from_pretrained(pretrained_model_name_or_path)` or \"\n            \"`AutoModelWithLMHead.from_config(config)` methods.\"\n        )\n\n    @classmethod\n    def from_config(cls, config):\n        r\"\"\" Instantiates one of the base model classes of the library\n        from a configuration.\n\n        Note:\n            Loading a model from its configuration file does **not** load the model weights.\n            It only affects the model's configuration. Use :func:`~transformers1.AutoModel.from_pretrained` to load\n            the model weights\n\n        Args:\n            config (:class:`~transformers.PretrainedConfig`):\n                The model class to instantiate is selected based on the configuration class:\n\n                - isInstance of `distilbert` configuration class: :class:`~transformers1.DistilBertForMaskedLM` (DistilBERT model)\n                - isInstance of `longformer` configuration class: :class:`~transformers1.LongformerForMaskedLM` (Longformer model)\n                - isInstance of `roberta` configuration class: :class:`~transformers1.RobertaForMaskedLM` (RoBERTa model)\n                - isInstance of `bert` configuration class: :class:`~transformers1.BertForMaskedLM` (Bert model)\n                - isInstance of `openai-gpt` configuration class: :class:`~transformers1.OpenAIGPTLMHeadModel` (OpenAI GPT model)\n                - isInstance of `gpt2` configuration class: :class:`~transformers1.GPT2LMHeadModel` (OpenAI GPT-2 model)\n                - isInstance of `ctrl` configuration class: :class:`~transformers1.CTRLLMHeadModel` (Salesforce CTRL  model)\n                - isInstance of `transfo-xl` configuration class: :class:`~transformers1.TransfoXLLMHeadModel` (Transformer-XL model)\n                - isInstance of `xlnet` configuration class: :class:`~transformers1.XLNetLMHeadModel` (XLNet model)\n                - isInstance of `xlm` configuration class: :class:`~transformers1.XLMWithLMHeadModel` (XLM model)\n                - isInstance of `flaubert` configuration class: :class:`~transformers1.FlaubertWithLMHeadModel` (Flaubert model)\n                - isInstance of `electra` configuration class: :class:`~transformers1.ElectraForMaskedLM` (Electra model)\n\n        Examples::\n\n            config = BertConfig.from_pretrained('bert-base-uncased')    # Download configuration from S3 and cache.\n            model = AutoModelWithLMHead.from_config(config)  # E.g. model was saved using `save_pretrained('./test/saved_model/')`\n        \"\"\"\n        for config_class, model_class in MODEL_WITH_LM_HEAD_MAPPING.items():\n            if isinstance(config, config_class):\n                return model_class(config)\n        raise ValueError(\n            \"Unrecognized configuration class {} for this kind of AutoModel: {}.\\n\"\n            \"Model type should be one of {}.\".format(\n                config.__class__, cls.__name__, \", \".join(c.__name__ for c in MODEL_WITH_LM_HEAD_MAPPING.keys())\n            )\n        )\n\n    @classmethod\n    def from_pretrained(cls, pretrained_model_name_or_path, *model_args, **kwargs):\n        r\"\"\" Instantiates one of the language modeling model classes of the library\n        from a pre-trained model configuration.\n\n        The `from_pretrained()` method takes care of returning the correct model class instance\n        based on the `model_type` property of the config object, or when it's missing,\n        falling back to using pattern matching on the `pretrained_model_name_or_path` string:\n            - `t5`: :class:`~transformers1.T5ModelWithLMHead` (T5 model)\n            - `distilbert`: :class:`~transformers1.DistilBertForMaskedLM` (DistilBERT model)\n            - `albert`: :class:`~transformers1.AlbertForMaskedLM` (ALBERT model)\n            - `camembert`: :class:`~transformers1.CamembertForMaskedLM` (CamemBERT model)\n            - `xlm-roberta`: :class:`~transformers1.XLMRobertaForMaskedLM` (XLM-RoBERTa model)\n            - `longformer`: :class:`~transformers1.LongformerForMaskedLM` (Longformer model)\n            - `roberta`: :class:`~transformers1.RobertaForMaskedLM` (RoBERTa model)\n            - `bert`: :class:`~transformers1.BertForMaskedLM` (Bert model)\n            - `openai-gpt`: :class:`~transformers1.OpenAIGPTLMHeadModel` (OpenAI GPT model)\n            - `gpt2`: :class:`~transformers1.GPT2LMHeadModel` (OpenAI GPT-2 model)\n            - `transfo-xl`: :class:`~transformers1.TransfoXLLMHeadModel` (Transformer-XL model)\n            - `xlnet`: :class:`~transformers1.XLNetLMHeadModel` (XLNet model)\n            - `xlm`: :class:`~transformers1.XLMWithLMHeadModel` (XLM model)\n            - `ctrl`: :class:`~transformers1.CTRLLMHeadModel` (Salesforce CTRL model)\n            - `flaubert`: :class:`~transformers1.FlaubertWithLMHeadModel` (Flaubert model)\n            - `electra`: :class:`~transformers1.ElectraForMaskedLM` (Electra model)\n\n        The model is set in evaluation mode by default using `model.eval()` (Dropout modules are deactivated)\n        To train the model, you should first set it back in training mode with `model.train()`\n\n        Args:\n            pretrained_model_name_or_path:\n                Either:\n\n                - a string with the `shortcut name` of a pre-trained model to load from cache or download, e.g.: ``bert-base-uncased``.\n                - a string with the `identifier name` of a pre-trained model that was user-uploaded to our S3, e.g.: ``dbmdz/bert-base-german-cased``.\n                - a path to a `directory` containing model weights saved using :func:`~transformers1.PreTrainedModel.save_pretrained`, e.g.: ``./my_model_directory/``.\n                - a path or url to a `tensorflow index checkpoint file` (e.g. `./tf_model/model.ckpt.index`). In this case, ``from_tf`` should be set to True and a configuration object should be provided as ``config`` argument. This loading path is slower than converting the TensorFlow checkpoint in a PyTorch model using the provided conversion scripts and loading the PyTorch model afterwards.\n            model_args: (`optional`) Sequence of positional arguments:\n                All remaning positional arguments will be passed to the underlying model's ``__init__`` method\n            config: (`optional`) instance of a class derived from :class:`~transformers1.PretrainedConfig`:\n                Configuration for the model to use instead of an automatically loaded configuation. Configuration can be automatically loaded when:\n\n                - the model is a model provided by the library (loaded with the ``shortcut-name`` string of a pretrained model), or\n                - the model was saved using :func:`~transformers1.PreTrainedModel.save_pretrained` and is reloaded by suppling the save directory.\n                - the model is loaded by suppling a local directory as ``pretrained_model_name_or_path`` and a configuration JSON file named `config.json` is found in the directory.\n\n            state_dict: (`optional`) dict:\n                an optional state dictionary for the model to use instead of a state dictionary loaded from saved weights file.\n                This option can be used if you want to create a model from a pretrained configuration but load your own weights.\n                In this case though, you should check if using :func:`~transformers1.PreTrainedModel.save_pretrained` and :func:`~transformers1.PreTrainedModel.from_pretrained` is not a simpler option.\n            cache_dir: (`optional`) string:\n                Path to a directory in which a downloaded pre-trained model\n                configuration should be cached if the standard cache should not be used.\n            force_download: (`optional`) boolean, default False:\n                Force to (re-)download the model weights and configuration files and override the cached versions if they exists.\n            resume_download: (`optional`) boolean, default False:\n                Do not delete incompletely received file. Attempt to resume the download if such a file exists.\n            proxies: (`optional`) dict, default None:\n                A dictionary of proxy servers to use by protocol or endpoint, e.g.: {'http': 'foo.bar:3128', 'http://hostname': 'foo.bar:4012'}.\n                The proxies are used on each request.\n            output_loading_info: (`optional`) boolean:\n                Set to ``True`` to also return a dictionary containing missing keys, unexpected keys and error messages.\n            kwargs: (`optional`) Remaining dictionary of keyword arguments:\n                These arguments will be passed to the configuration and the model.\n\n        Examples::\n\n            model = AutoModelWithLMHead.from_pretrained('bert-base-uncased')    # Download model and configuration from S3 and cache.\n            model = AutoModelWithLMHead.from_pretrained('./test/bert_model/')  # E.g. model was saved using `save_pretrained('./test/saved_model/')`\n            assert model.config.output_attention == True\n            # Loading from a TF checkpoint file instead of a PyTorch model (slower)\n            config = AutoConfig.from_json_file('./tf_model/bert_tf_model_config.json')\n            model = AutoModelWithLMHead.from_pretrained('./tf_model/bert_tf_checkpoint.ckpt.index', from_tf=True, config=config)\n\n        \"\"\"\n        config = kwargs.pop(\"config\", None)\n        if not isinstance(config, PretrainedConfig):\n            config = AutoConfig.from_pretrained(pretrained_model_name_or_path, **kwargs)\n\n        for config_class, model_class in MODEL_WITH_LM_HEAD_MAPPING.items():\n            if isinstance(config, config_class):\n                return model_class.from_pretrained(pretrained_model_name_or_path, *model_args, config=config, **kwargs)\n        raise ValueError(\n            \"Unrecognized configuration class {} for this kind of AutoModel: {}.\\n\"\n            \"Model type should be one of {}.\".format(\n                config.__class__, cls.__name__, \", \".join(c.__name__ for c in MODEL_WITH_LM_HEAD_MAPPING.keys())\n            )\n        )\n\n\nclass AutoModelForSequenceClassification:\n    r\"\"\"\n        :class:`~transformers1.AutoModelForSequenceClassification` is a generic model class\n        that will be instantiated as one of the sequence classification model classes of the library\n        when created with the `AutoModelForSequenceClassification.from_pretrained(pretrained_model_name_or_path)`\n        class method.\n\n        This class cannot be instantiated using `__init__()` (throws an error).\n    \"\"\"\n\n    def __init__(self):\n        raise EnvironmentError(\n            \"AutoModelForSequenceClassification is designed to be instantiated \"\n            \"using the `AutoModelForSequenceClassification.from_pretrained(pretrained_model_name_or_path)` or \"\n            \"`AutoModelForSequenceClassification.from_config(config)` methods.\"\n        )\n\n    @classmethod\n    def from_config(cls, config):\n        r\"\"\" Instantiates one of the base model classes of the library\n        from a configuration.\n\n        Note:\n            Loading a model from its configuration file does **not** load the model weights.\n            It only affects the model's configuration. Use :func:`~transformers1.AutoModel.from_pretrained` to load\n            the model weights\n\n        Args:\n            config (:class:`~transformers.PretrainedConfig`):\n                The model class to instantiate is selected based on the configuration class:\n\n                - isInstance of `distilbert` configuration class: :class:`~transformers1.DistilBertForSequenceClassification` (DistilBERT model)\n                - isInstance of `albert` configuration class: :class:`~transformers1.AlbertForSequenceClassification` (ALBERT model)\n                - isInstance of `camembert` configuration class: :class:`~transformers1.CamembertForSequenceClassification` (CamemBERT model)\n                - isInstance of `xlm roberta` configuration class: :class:`~transformers1.XLMRobertaForSequenceClassification` (XLM-RoBERTa model)\n                - isInstance of `roberta` configuration class: :class:`~transformers1.RobertaForSequenceClassification` (RoBERTa model)\n                - isInstance of `bert` configuration class: :class:`~transformers1.BertForSequenceClassification` (Bert model)\n                - isInstance of `xlnet` configuration class: :class:`~transformers1.XLNetForSequenceClassification` (XLNet model)\n                - isInstance of `xlm` configuration class: :class:`~transformers1.XLMForSequenceClassification` (XLM model)\n                - isInstance of `flaubert` configuration class: :class:`~transformers1.FlaubertForSequenceClassification` (Flaubert model)\n\n\n        Examples::\n\n            config = BertConfig.from_pretrained('bert-base-uncased')    # Download configuration from S3 and cache.\n            model = AutoModelForSequenceClassification.from_config(config)  # E.g. model was saved using `save_pretrained('./test/saved_model/')`\n        \"\"\"\n        for config_class, model_class in MODEL_FOR_SEQUENCE_CLASSIFICATION_MAPPING.items():\n            if isinstance(config, config_class):\n                return model_class(config)\n        raise ValueError(\n            \"Unrecognized configuration class {} for this kind of AutoModel: {}.\\n\"\n            \"Model type should be one of {}.\".format(\n                config.__class__,\n                cls.__name__,\n                \", \".join(c.__name__ for c in MODEL_FOR_SEQUENCE_CLASSIFICATION_MAPPING.keys()),\n            )\n        )\n\n    @classmethod\n    def from_pretrained(cls, pretrained_model_name_or_path, *model_args, **kwargs):\n        r\"\"\" Instantiates one of the sequence classification model classes of the library\n        from a pre-trained model configuration.\n\n        The `from_pretrained()` method takes care of returning the correct model class instance\n        based on the `model_type` property of the config object, or when it's missing,\n        falling back to using pattern matching on the `pretrained_model_name_or_path` string:\n            - `distilbert`: :class:`~transformers1.DistilBertForSequenceClassification` (DistilBERT model)\n            - `albert`: :class:`~transformers1.AlbertForSequenceClassification` (ALBERT model)\n            - `camembert`: :class:`~transformers1.CamembertForSequenceClassification` (CamemBERT model)\n            - `xlm-roberta`: :class:`~transformers1.XLMRobertaForSequenceClassification` (XLM-RoBERTa model)\n            - `roberta`: :class:`~transformers1.RobertaForSequenceClassification` (RoBERTa model)\n            - `bert`: :class:`~transformers1.BertForSequenceClassification` (Bert model)\n            - `xlnet`: :class:`~transformers1.XLNetForSequenceClassification` (XLNet model)\n            - `flaubert`: :class:`~transformers1.FlaubertForSequenceClassification` (Flaubert model)\n\n        The model is set in evaluation mode by default using `model.eval()` (Dropout modules are deactivated)\n        To train the model, you should first set it back in training mode with `model.train()`\n\n        Args:\n            pretrained_model_name_or_path: either:\n\n                - a string with the `shortcut name` of a pre-trained model to load from cache or download, e.g.: ``bert-base-uncased``.\n                - a string with the `identifier name` of a pre-trained model that was user-uploaded to our S3, e.g.: ``dbmdz/bert-base-german-cased``.\n                - a path to a `directory` containing model weights saved using :func:`~transformers1.PreTrainedModel.save_pretrained`, e.g.: ``./my_model_directory/``.\n                - a path or url to a `tensorflow index checkpoint file` (e.g. `./tf_model/model.ckpt.index`). In this case, ``from_tf`` should be set to True and a configuration object should be provided as ``config`` argument. This loading path is slower than converting the TensorFlow checkpoint in a PyTorch model using the provided conversion scripts and loading the PyTorch model afterwards.\n\n            model_args: (`optional`) Sequence of positional arguments:\n                All remaining positional arguments will be passed to the underlying model's ``__init__`` method\n\n            config: (`optional`) instance of a class derived from :class:`~transformers1.PretrainedConfig`:\n                Configuration for the model to use instead of an automatically loaded configuation. Configuration can be automatically loaded when:\n\n                - the model is a model provided by the library (loaded with the ``shortcut-name`` string of a pretrained model), or\n                - the model was saved using :func:`~transformers1.PreTrainedModel.save_pretrained` and is reloaded by suppling the save directory.\n                - the model is loaded by suppling a local directory as ``pretrained_model_name_or_path`` and a configuration JSON file named `config.json` is found in the directory.\n\n            state_dict: (`optional`) dict:\n                an optional state dictionary for the model to use instead of a state dictionary loaded from saved weights file.\n                This option can be used if you want to create a model from a pretrained configuration but load your own weights.\n                In this case though, you should check if using :func:`~transformers1.PreTrainedModel.save_pretrained` and :func:`~transformers1.PreTrainedModel.from_pretrained` is not a simpler option.\n\n            cache_dir: (`optional`) string:\n                Path to a directory in which a downloaded pre-trained model\n                configuration should be cached if the standard cache should not be used.\n\n            force_download: (`optional`) boolean, default False:\n                Force to (re-)download the model weights and configuration files and override the cached versions if they exists.\n\n            resume_download: (`optional`) boolean, default False:\n                Do not delete incompletely recieved file. Attempt to resume the download if such a file exists.\n\n            proxies: (`optional`) dict, default None:\n                A dictionary of proxy servers to use by protocol or endpoint, e.g.: {'http': 'foo.bar:3128', 'http://hostname': 'foo.bar:4012'}.\n                The proxies are used on each request.\n\n            output_loading_info: (`optional`) boolean:\n                Set to ``True`` to also return a dictionary containing missing keys, unexpected keys and error messages.\n\n            kwargs: (`optional`) Remaining dictionary of keyword arguments:\n                These arguments will be passed to the configuration and the model.\n\n        Examples::\n\n            model = AutoModelForSequenceClassification.from_pretrained('bert-base-uncased')    # Download model and configuration from S3 and cache.\n            model = AutoModelForSequenceClassification.from_pretrained('./test/bert_model/')  # E.g. model was saved using `save_pretrained('./test/saved_model/')`\n            assert model.config.output_attention == True\n            # Loading from a TF checkpoint file instead of a PyTorch model (slower)\n            config = AutoConfig.from_json_file('./tf_model/bert_tf_model_config.json')\n            model = AutoModelForSequenceClassification.from_pretrained('./tf_model/bert_tf_checkpoint.ckpt.index', from_tf=True, config=config)\n\n        \"\"\"\n        config = kwargs.pop(\"config\", None)\n        if not isinstance(config, PretrainedConfig):\n            config = AutoConfig.from_pretrained(pretrained_model_name_or_path, **kwargs)\n\n        for config_class, model_class in MODEL_FOR_SEQUENCE_CLASSIFICATION_MAPPING.items():\n            if isinstance(config, config_class):\n                return model_class.from_pretrained(pretrained_model_name_or_path, *model_args, config=config, **kwargs)\n        raise ValueError(\n            \"Unrecognized configuration class {} for this kind of AutoModel: {}.\\n\"\n            \"Model type should be one of {}.\".format(\n                config.__class__,\n                cls.__name__,\n                \", \".join(c.__name__ for c in MODEL_FOR_SEQUENCE_CLASSIFICATION_MAPPING.keys()),\n            )\n        )\n\n\nclass AutoModelForQuestionAnswering:\n    r\"\"\"\n        :class:`~transformers1.AutoModelForQuestionAnswering` is a generic model class\n        that will be instantiated as one of the question answering model classes of the library\n        when created with the `AutoModelForQuestionAnswering.from_pretrained(pretrained_model_name_or_path)`\n        class method.\n\n        This class cannot be instantiated using `__init__()` (throws an error).\n    \"\"\"\n\n    def __init__(self):\n        raise EnvironmentError(\n            \"AutoModelForQuestionAnswering is designed to be instantiated \"\n            \"using the `AutoModelForQuestionAnswering.from_pretrained(pretrained_model_name_or_path)` or \"\n            \"`AutoModelForQuestionAnswering.from_config(config)` methods.\"\n        )\n\n    @classmethod\n    def from_config(cls, config):\n        r\"\"\" Instantiates one of the base model classes of the library\n        from a configuration.\n\n        Note:\n            Loading a model from its configuration file does **not** load the model weights.\n            It only affects the model's configuration. Use :func:`~transformers1.AutoModel.from_pretrained` to load\n            the model weights\n\n        Args:\n            config (:class:`~transformers.PretrainedConfig`):\n                The model class to instantiate is selected based on the configuration class:\n\n                - isInstance of `distilbert` configuration class: :class:`~transformers1.DistilBertForQuestionAnswering` (DistilBERT model)\n                - isInstance of `albert` configuration class: :class:`~transformers1.AlbertForQuestionAnswering` (ALBERT model)\n                - isInstance of `bert` configuration class: :class:`~transformers1.BertModelForQuestionAnswering` (Bert model)\n                - isInstance of `xlnet` configuration class: :class:`~transformers1.XLNetForQuestionAnswering` (XLNet model)\n                - isInstance of `xlm` configuration class: :class:`~transformers1.XLMForQuestionAnswering` (XLM model)\n                - isInstance of `flaubert` configuration class: :class:`~transformers1.FlaubertForQuestionAnswering` (XLM model)\n\n        Examples::\n\n            config = BertConfig.from_pretrained('bert-base-uncased')    # Download configuration from S3 and cache.\n            model = AutoModelForQuestionAnswering.from_config(config)  # E.g. model was saved using `save_pretrained('./test/saved_model/')`\n        \"\"\"\n        for config_class, model_class in MODEL_FOR_QUESTION_ANSWERING_MAPPING.items():\n            if isinstance(config, config_class):\n                return model_class(config)\n\n        raise ValueError(\n            \"Unrecognized configuration class {} for this kind of AutoModel: {}.\\n\"\n            \"Model type should be one of {}.\".format(\n                config.__class__,\n                cls.__name__,\n                \", \".join(c.__name__ for c in MODEL_FOR_QUESTION_ANSWERING_MAPPING.keys()),\n            )\n        )\n\n    @classmethod\n    def from_pretrained(cls, pretrained_model_name_or_path, *model_args, **kwargs):\n        r\"\"\" Instantiates one of the question answering model classes of the library\n        from a pre-trained model configuration.\n\n        The `from_pretrained()` method takes care of returning the correct model class instance\n        based on the `model_type` property of the config object, or when it's missing,\n        falling back to using pattern matching on the `pretrained_model_name_or_path` string:\n            - `distilbert`: :class:`~transformers1.DistilBertForQuestionAnswering` (DistilBERT model)\n            - `albert`: :class:`~transformers1.AlbertForQuestionAnswering` (ALBERT model)\n            - `bert`: :class:`~transformers1.BertForQuestionAnswering` (Bert model)\n            - `xlnet`: :class:`~transformers1.XLNetForQuestionAnswering` (XLNet model)\n            - `xlm`: :class:`~transformers1.XLMForQuestionAnswering` (XLM model)\n            - `flaubert`: :class:`~transformers1.FlaubertForQuestionAnswering` (XLM model)\n\n        The model is set in evaluation mode by default using `model.eval()` (Dropout modules are deactivated)\n        To train the model, you should first set it back in training mode with `model.train()`\n\n        Args:\n            pretrained_model_name_or_path: either:\n\n                - a string with the `shortcut name` of a pre-trained model to load from cache or download, e.g.: ``bert-base-uncased``.\n                - a string with the `identifier name` of a pre-trained model that was user-uploaded to our S3, e.g.: ``dbmdz/bert-base-german-cased``.\n                - a path to a `directory` containing model weights saved using :func:`~transformers1.PreTrainedModel.save_pretrained`, e.g.: ``./my_model_directory/``.\n                - a path or url to a `tensorflow index checkpoint file` (e.g. `./tf_model/model.ckpt.index`). In this case, ``from_tf`` should be set to True and a configuration object should be provided as ``config`` argument. This loading path is slower than converting the TensorFlow checkpoint in a PyTorch model using the provided conversion scripts and loading the PyTorch model afterwards.\n\n            model_args: (`optional`) Sequence of positional arguments:\n                All remaning positional arguments will be passed to the underlying model's ``__init__`` method\n\n            config: (`optional`) instance of a class derived from :class:`~transformers1.PretrainedConfig`:\n                Configuration for the model to use instead of an automatically loaded configuation. Configuration can be automatically loaded when:\n\n                - the model is a model provided by the library (loaded with the ``shortcut-name`` string of a pretrained model), or\n                - the model was saved using :func:`~transformers1.PreTrainedModel.save_pretrained` and is reloaded by suppling the save directory.\n                - the model is loaded by suppling a local directory as ``pretrained_model_name_or_path`` and a configuration JSON file named `config.json` is found in the directory.\n\n            state_dict: (`optional`) dict:\n                an optional state dictionary for the model to use instead of a state dictionary loaded from saved weights file.\n                This option can be used if you want to create a model from a pretrained configuration but load your own weights.\n                In this case though, you should check if using :func:`~transformers1.PreTrainedModel.save_pretrained` and :func:`~transformers1.PreTrainedModel.from_pretrained` is not a simpler option.\n\n            cache_dir: (`optional`) string:\n                Path to a directory in which a downloaded pre-trained model\n                configuration should be cached if the standard cache should not be used.\n\n            force_download: (`optional`) boolean, default False:\n                Force to (re-)download the model weights and configuration files and override the cached versions if they exists.\n\n            proxies: (`optional`) dict, default None:\n                A dictionary of proxy servers to use by protocol or endpoint, e.g.: {'http': 'foo.bar:3128', 'http://hostname': 'foo.bar:4012'}.\n                The proxies are used on each request.\n\n            output_loading_info: (`optional`) boolean:\n                Set to ``True`` to also return a dictionary containing missing keys, unexpected keys and error messages.\n\n            kwargs: (`optional`) Remaining dictionary of keyword arguments:\n                These arguments will be passed to the configuration and the model.\n\n        Examples::\n\n            model = AutoModelForQuestionAnswering.from_pretrained('bert-base-uncased')    # Download model and configuration from S3 and cache.\n            model = AutoModelForQuestionAnswering.from_pretrained('./test/bert_model/')  # E.g. model was saved using `save_pretrained('./test/saved_model/')`\n            assert model.config.output_attention == True\n            # Loading from a TF checkpoint file instead of a PyTorch model (slower)\n            config = AutoConfig.from_json_file('./tf_model/bert_tf_model_config.json')\n            model = AutoModelForQuestionAnswering.from_pretrained('./tf_model/bert_tf_checkpoint.ckpt.index', from_tf=True, config=config)\n\n        \"\"\"\n        config = kwargs.pop(\"config\", None)\n        if not isinstance(config, PretrainedConfig):\n            config = AutoConfig.from_pretrained(pretrained_model_name_or_path, **kwargs)\n\n        for config_class, model_class in MODEL_FOR_QUESTION_ANSWERING_MAPPING.items():\n            if isinstance(config, config_class):\n                return model_class.from_pretrained(pretrained_model_name_or_path, *model_args, config=config, **kwargs)\n\n        raise ValueError(\n            \"Unrecognized configuration class {} for this kind of AutoModel: {}.\\n\"\n            \"Model type should be one of {}.\".format(\n                config.__class__,\n                cls.__name__,\n                \", \".join(c.__name__ for c in MODEL_FOR_QUESTION_ANSWERING_MAPPING.keys()),\n            )\n        )\n\n\nclass AutoModelForTokenClassification:\n    r\"\"\"\n        :class:`~transformers1.AutoModelForTokenClassification` is a generic model class\n        that will be instantiated as one of the token classification model classes of the library\n        when created with the `AutoModelForTokenClassification.from_pretrained(pretrained_model_name_or_path)`\n        class method.\n\n        This class cannot be instantiated using `__init__()` (throws an error).\n    \"\"\"\n\n    def __init__(self):\n        raise EnvironmentError(\n            \"AutoModelForTokenClassification is designed to be instantiated \"\n            \"using the `AutoModelForTokenClassification.from_pretrained(pretrained_model_name_or_path)` or \"\n            \"`AutoModelForTokenClassification.from_config(config)` methods.\"\n        )\n\n    @classmethod\n    def from_config(cls, config):\n        r\"\"\" Instantiates one of the base model classes of the library\n        from a configuration.\n\n        Note:\n            Loading a model from its configuration file does **not** load the model weights.\n            It only affects the model's configuration. Use :func:`~transformers1.AutoModel.from_pretrained` to load\n            the model weights\n\n        Args:\n            config (:class:`~transformers.PretrainedConfig`):\n                The model class to instantiate is selected based on the configuration class:\n\n                - isInstance of `distilbert` configuration class: :class:`~transformers1.DistilBertModelForTokenClassification` (DistilBERT model)\n                - isInstance of `xlm` configuration class: :class:`~transformers1.XLMForTokenClassification` (XLM model)\n                - isInstance of `xlm roberta` configuration class: :class:`~transformers1.XLMRobertaModelForTokenClassification` (XLMRoberta model)\n                - isInstance of `bert` configuration class: :class:`~transformers1.BertModelForTokenClassification` (Bert model)\n                - isInstance of `albert` configuration class: :class:`~transformers1.AlbertForTokenClassification` (AlBert model)\n                - isInstance of `xlnet` configuration class: :class:`~transformers1.XLNetModelForTokenClassification` (XLNet model)\n                - isInstance of `camembert` configuration class: :class:`~transformers1.CamembertModelForTokenClassification` (Camembert model)\n                - isInstance of `roberta` configuration class: :class:`~transformers1.RobertaModelForTokenClassification` (Roberta model)\n                - isInstance of `electra` configuration class: :class:`~transformers1.ElectraForTokenClassification` (Electra model)\n\n        Examples::\n\n            config = BertConfig.from_pretrained('bert-base-uncased')    # Download configuration from S3 and cache.\n            model = AutoModelForTokenClassification.from_config(config)  # E.g. model was saved using `save_pretrained('./test/saved_model/')`\n        \"\"\"\n        for config_class, model_class in MODEL_FOR_TOKEN_CLASSIFICATION_MAPPING.items():\n            if isinstance(config, config_class):\n                return model_class(config)\n\n        raise ValueError(\n            \"Unrecognized configuration class {} for this kind of AutoModel: {}.\\n\"\n            \"Model type should be one of {}.\".format(\n                config.__class__,\n                cls.__name__,\n                \", \".join(c.__name__ for c in MODEL_FOR_TOKEN_CLASSIFICATION_MAPPING.keys()),\n            )\n        )\n\n    @classmethod\n    def from_pretrained(cls, pretrained_model_name_or_path, *model_args, **kwargs):\n        r\"\"\" Instantiates one of the question answering model classes of the library\n        from a pre-trained model configuration.\n\n        The `from_pretrained()` method takes care of returning the correct model class instance\n        based on the `model_type` property of the config object, or when it's missing,\n        falling back to using pattern matching on the `pretrained_model_name_or_path` string:\n            - `distilbert`: :class:`~transformers1.DistilBertForTokenClassification` (DistilBERT model)\n            - `xlm`: :class:`~transformers1.XLMForTokenClassification` (XLM model)\n            - `xlm-roberta`: :class:`~transformers1.XLMRobertaForTokenClassification` (XLM-RoBERTa?Para model)\n            - `camembert`: :class:`~transformers1.CamembertForTokenClassification` (Camembert model)\n            - `bert`: :class:`~transformers1.BertForTokenClassification` (Bert model)\n            - `xlnet`: :class:`~transformers1.XLNetForTokenClassification` (XLNet model)\n            - `roberta`: :class:`~transformers1.RobertaForTokenClassification` (Roberta model)\n            - `electra`: :class:`~transformers1.ElectraForTokenClassification` (Electra model)\n\n        The model is set in evaluation mode by default using `model.eval()` (Dropout modules are deactivated)\n        To train the model, you should first set it back in training mode with `model.train()`\n\n        Args:\n            pretrained_model_name_or_path:\n                Either:\n\n                - a string with the `shortcut name` of a pre-trained model to load from cache or download, e.g.: ``bert-base-uncased``.\n                - a path to a `directory` containing model weights saved using :func:`~transformers1.PreTrainedModel.save_pretrained`, e.g.: ``./my_model_directory/``.\n                - a path or url to a `tensorflow index checkpoint file` (e.g. `./tf_model/model.ckpt.index`). In this case, ``from_tf`` should be set to True and a configuration object should be provided as ``config`` argument. This loading path is slower than converting the TensorFlow checkpoint in a PyTorch model using the provided conversion scripts and loading the PyTorch model afterwards.\n\n            model_args: (`optional`) Sequence of positional arguments:\n                All remaning positional arguments will be passed to the underlying model's ``__init__`` method\n\n            config: (`optional`) instance of a class derived from :class:`~transformers1.PretrainedConfig`:\n                Configuration for the model to use instead of an automatically loaded configuation. Configuration can be automatically loaded when:\n\n                - the model is a model provided by the library (loaded with the ``shortcut-name`` string of a pretrained model), or\n                - the model was saved using :func:`~transformers1.PreTrainedModel.save_pretrained` and is reloaded by suppling the save directory.\n                - the model is loaded by suppling a local directory as ``pretrained_model_name_or_path`` and a configuration JSON file named `config.json` is found in the directory.\n\n            state_dict: (`optional`) dict:\n                an optional state dictionary for the model to use instead of a state dictionary loaded from saved weights file.\n                This option can be used if you want to create a model from a pretrained configuration but load your own weights.\n                In this case though, you should check if using :func:`~transformers1.PreTrainedModel.save_pretrained` and :func:`~transformers1.PreTrainedModel.from_pretrained` is not a simpler option.\n\n            cache_dir: (`optional`) string:\n                Path to a directory in which a downloaded pre-trained model\n                configuration should be cached if the standard cache should not be used.\n\n            force_download: (`optional`) boolean, default False:\n                Force to (re-)download the model weights and configuration files and override the cached versions if they exists.\n\n            proxies: (`optional`) dict, default None:\n                A dictionary of proxy servers to use by protocol or endpoint, e.g.: {'http': 'foo.bar:3128', 'http://hostname': 'foo.bar:4012'}.\n                The proxies are used on each request.\n\n            output_loading_info: (`optional`) boolean:\n                Set to ``True`` to also return a dictionary containing missing keys, unexpected keys and error messages.\n\n            kwargs: (`optional`) Remaining dictionary of keyword arguments:\n                These arguments will be passed to the configuration and the model.\n\n        Examples::\n\n            model = AutoModelForTokenClassification.from_pretrained('bert-base-uncased')    # Download model and configuration from S3 and cache.\n            model = AutoModelForTokenClassification.from_pretrained('./test/bert_model/')  # E.g. model was saved using `save_pretrained('./test/saved_model/')`\n            assert model.config.output_attention == True\n            # Loading from a TF checkpoint file instead of a PyTorch model (slower)\n            config = AutoConfig.from_json_file('./tf_model/bert_tf_model_config.json')\n            model = AutoModelForTokenClassification.from_pretrained('./tf_model/bert_tf_checkpoint.ckpt.index', from_tf=True, config=config)\n\n        \"\"\"\n        config = kwargs.pop(\"config\", None)\n        if not isinstance(config, PretrainedConfig):\n            config = AutoConfig.from_pretrained(pretrained_model_name_or_path, **kwargs)\n\n        for config_class, model_class in MODEL_FOR_TOKEN_CLASSIFICATION_MAPPING.items():\n            if isinstance(config, config_class):\n                return model_class.from_pretrained(pretrained_model_name_or_path, *model_args, config=config, **kwargs)\n\n        raise ValueError(\n            \"Unrecognized configuration class {} for this kind of AutoModel: {}.\\n\"\n            \"Model type should be one of {}.\".format(\n                config.__class__,\n                cls.__name__,\n                \", \".join(c.__name__ for c in MODEL_FOR_TOKEN_CLASSIFICATION_MAPPING.keys()),\n            )\n        )\n\n\nclass AutoModelForMultipleChoice:\n    r\"\"\"\n        :class:`~transformers1.AutoModelForMultipleChoice` is a generic model class\n        that will be instantiated as one of the multiple choice model classes of the library\n        when created with the `AutoModelForMultipleChoice.from_pretrained(pretrained_model_name_or_path)`\n        class method.\n\n        This class cannot be instantiated using `__init__()` (throws an error).\n    \"\"\"\n\n    def __init__(self):\n        raise EnvironmentError(\n            \"AutoModelForMultipleChoice is designed to be instantiated \"\n            \"using the `AutoModelForMultipleChoice.from_pretrained(pretrained_model_name_or_path)` or \"\n            \"`AutoModelForMultipleChoice.from_config(config)` methods.\"\n        )\n\n    @classmethod\n    def from_config(cls, config):\n        for config_class, model_class in MODEL_FOR_MULTIPLE_CHOICE_MAPPING.items():\n            if isinstance(config, config_class):\n                return model_class(config)\n\n        raise ValueError(\n            \"Unrecognized configuration class {} for this kind of AutoModel: {}.\\n\"\n            \"Model type should be one of {}.\".format(\n                config.__class__,\n                cls.__name__,\n                \", \".join(c.__name__ for c in MODEL_FOR_MULTIPLE_CHOICE_MAPPING.keys()),\n            )\n        )\n\n    @classmethod\n    def from_pretrained(cls, pretrained_model_name_or_path, *model_args, **kwargs):\n        config = kwargs.pop(\"config\", None)\n        if not isinstance(config, PretrainedConfig):\n            config = AutoConfig.from_pretrained(pretrained_model_name_or_path, **kwargs)\n\n        for config_class, model_class in MODEL_FOR_MULTIPLE_CHOICE_MAPPING.items():\n            if isinstance(config, config_class):\n                return model_class.from_pretrained(pretrained_model_name_or_path, *model_args, config=config, **kwargs)\n\n        raise ValueError(\n            \"Unrecognized configuration class {} for this kind of AutoModel: {}.\\n\"\n            \"Model type should be one of {}.\".format(\n                config.__class__,\n                cls.__name__,\n                \", \".join(c.__name__ for c in MODEL_FOR_MULTIPLE_CHOICE_MAPPING.keys()),\n            )\n        )\n"
  },
  {
    "path": "code/nezha-base-count5/pretrain/transformers1/modeling_bart.py",
    "content": "# coding=utf-8\n# Copyright 2020 The Facebook AI Research Team Authors and The HuggingFace Inc. team.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n\"\"\"PyTorch BART model, ported from the fairseq repo.\"\"\"\nimport logging\nimport math\nimport random\nfrom typing import Dict, List, Optional, Tuple\n\nimport numpy as np\nimport torch\nimport torch.nn.functional as F\nfrom torch import Tensor, nn\n\nfrom .activations import ACT2FN\nfrom .configuration_bart import BartConfig\nfrom .file_utils import add_start_docstrings, add_start_docstrings_to_callable\nfrom .modeling_utils import PreTrainedModel, create_position_ids_from_input_ids\n\n\nlogger = logging.getLogger(__name__)\n\n\nBART_PRETRAINED_MODEL_ARCHIVE_LIST = [\n    \"facebook/bart-large\",\n    \"facebook/bart-large-mnli\",\n    \"facebook/bart-large-cnn\",\n    \"facebook/bart-large-xsum\",\n    \"facebook/mbart-large-en-ro\",\n    # See all BART models at https://huggingface.co/models?filter=bart\n]\n\n\nBART_START_DOCSTRING = r\"\"\"\n\n    This model is a PyTorch `torch.nn.Module <https://pytorch.org/docs/stable/nn.html#torch.nn.Module>`_ sub-class. Use it as a regular PyTorch Module and\n    refer to the PyTorch documentation for all matters related to general usage and behavior.\n\n    Parameters:\n        config (:class:`~transformers1.BartConfig`): Model configuration class with all the parameters of the model.\n            Initializing with a config file does not load the weights associated with the model, only the configuration.\n            Check out the :meth:`~transformers1.PreTrainedModel.from_pretrained` method to load the model weights.\n\n\"\"\"\nBART_GENERATION_EXAMPLE = r\"\"\"\n    Examples::\n\n        from transformers1 import BartTokenizer, BartForConditionalGeneration, BartConfig\n        # see ``examples/summarization/bart/evaluate_cnn.py`` for a longer example\n        model = BartForConditionalGeneration.from_pretrained('bart-large-cnn')\n        tokenizer = BartTokenizer.from_pretrained('bart-large-cnn')\n        ARTICLE_TO_SUMMARIZE = \"My friends are cool but they eat too many carbs.\"\n        inputs = tokenizer.batch_encode_plus([ARTICLE_TO_SUMMARIZE], max_length=1024, return_tensors='pt')\n        # Generate Summary\n        summary_ids = model.generate(inputs['input_ids'], num_beams=4, max_length=5, early_stopping=True)\n        print([tokenizer.decode(g, skip_special_tokens=True, clean_up_tokenization_spaces=False) for g in summary_ids])\n\n\"\"\"\n\nBART_INPUTS_DOCSTRING = r\"\"\"\n    Args:\n        input_ids (:obj:`torch.LongTensor` of shape :obj:`(batch_size, sequence_length)`):\n               Indices of input sequence tokens in the vocabulary. Use BartTokenizer.encode to produce them.\n            Padding will be ignored by default should you provide it.\n            Indices can be obtained using :class:`transformers1.BartTokenizer.encode(text)`.\n        attention_mask (:obj:`torch.Tensor` of shape :obj:`(batch_size, sequence_length)`, `optional`, defaults to :obj:`None`):\n            Mask to avoid performing attention on padding token indices in input_ids.\n            Mask values selected in ``[0, 1]``:\n            ``1`` for tokens that are NOT MASKED, ``0`` for MASKED tokens.\n        encoder_outputs (:obj:`tuple(tuple(torch.FloatTensor)`, `optional`, defaults to :obj:`None`):\n            Tuple consists of (`last_hidden_state`, `optional`: `hidden_states`, `optional`: `attentions`)\n            `last_hidden_state` of shape :obj:`(batch_size, sequence_length, hidden_size)`, `optional`, defaults to :obj:`None`) is a sequence of hidden-states at the output of the last layer of the encoder.\n            Used in the cross-attention of the decoder.\n        decoder_input_ids (:obj:`torch.LongTensor` of shape :obj:`(batch_size, target_sequence_length)`, `optional`, defaults to :obj:`None`):\n            Provide for translation and summarization training. By default, the model will create this tensor by shifting the input_ids right, following the paper.\n        decoder_attention_mask (:obj:`torch.BoolTensor` of shape :obj:`(batch_size, tgt_seq_len)`, `optional`, defaults to :obj:`None`):\n            Default behavior: generate a tensor that ignores pad tokens in decoder_input_ids. Causal mask will also be used by default.\n            If you want to change padding behavior, you should read :func:`~transformers1.modeling_bart._prepare_decoder_inputs` and modify.\n            See diagram 1 in the paper for more info on the default strategy\n\"\"\"\n\n\ndef invert_mask(attention_mask):\n    assert attention_mask.dim() == 2\n    return attention_mask.eq(0)\n\n\ndef _prepare_bart_decoder_inputs(\n    config, input_ids, decoder_input_ids=None, decoder_padding_mask=None, causal_mask_dtype=torch.float32\n):\n    \"\"\"Prepare masks that ignore padding tokens in the decoder and a causal mask for the decoder if\n    none are provided. This mimics the default behavior in fairseq. To override it pass in masks.\n    Note: this is not called during generation\n    \"\"\"\n    pad_token_id = config.pad_token_id\n    if decoder_input_ids is None:\n        decoder_input_ids = shift_tokens_right(input_ids, pad_token_id)\n    bsz, tgt_len = decoder_input_ids.size()\n    if decoder_padding_mask is None:\n        decoder_padding_mask = make_padding_mask(decoder_input_ids, pad_token_id)\n    else:\n        decoder_padding_mask = invert_mask(decoder_padding_mask)\n    causal_mask = torch.triu(fill_with_neg_inf(torch.zeros(tgt_len, tgt_len)), 1).to(\n        dtype=causal_mask_dtype, device=decoder_input_ids.device\n    )\n    return decoder_input_ids, decoder_padding_mask, causal_mask\n\n\nclass PretrainedBartModel(PreTrainedModel):\n    config_class = BartConfig\n    base_model_prefix = \"model\"\n\n    def _init_weights(self, module):\n        std = self.config.init_std\n        if isinstance(module, nn.Linear):\n            module.weight.data.normal_(mean=0.0, std=std)\n            if module.bias is not None:\n                module.bias.data.zero_()\n        elif isinstance(module, SinusoidalPositionalEmbedding):\n            pass\n        elif isinstance(module, nn.Embedding):\n            module.weight.data.normal_(mean=0.0, std=std)\n            if module.padding_idx is not None:\n                module.weight.data[module.padding_idx].zero_()\n\n    @property\n    def dummy_inputs(self):\n        pad_token = self.config.pad_token_id\n        input_ids = torch.tensor([[0, 6, 10, 4, 2], [0, 8, 12, 2, pad_token]], device=self.device)\n        dummy_inputs = {\n            \"attention_mask\": input_ids.ne(pad_token),\n            \"input_ids\": input_ids,\n        }\n        return dummy_inputs\n\n\ndef _make_linear_from_emb(emb):\n    vocab_size, emb_size = emb.weight.shape\n    lin_layer = nn.Linear(vocab_size, emb_size, bias=False)\n    lin_layer.weight.data = emb.weight.data\n    return lin_layer\n\n\n# Helper Functions, mostly for making masks\ndef _check_shapes(shape_1, shape2):\n    if shape_1 != shape2:\n        raise AssertionError(\"shape mismatch: {} != {}\".format(shape_1, shape2))\n\n\ndef shift_tokens_right(input_ids, pad_token_id):\n    \"\"\"Shift input ids one token to the right, and wrap the last non pad token (usually <eos>).\"\"\"\n    prev_output_tokens = input_ids.clone()\n    index_of_eos = (input_ids.ne(pad_token_id).sum(dim=1) - 1).unsqueeze(-1)\n    prev_output_tokens[:, 0] = input_ids.gather(1, index_of_eos).squeeze()\n    prev_output_tokens[:, 1:] = input_ids[:, :-1]\n    return prev_output_tokens\n\n\ndef make_padding_mask(input_ids, padding_idx=1):\n    \"\"\"True for pad tokens\"\"\"\n    padding_mask = input_ids.eq(padding_idx)\n    if not padding_mask.any():\n        padding_mask = None\n    return padding_mask\n\n\n# Helper Modules\n\n\nclass EncoderLayer(nn.Module):\n    def __init__(self, config: BartConfig):\n        super().__init__()\n        self.embed_dim = config.d_model\n        self.output_attentions = config.output_attentions\n        self.self_attn = SelfAttention(\n            self.embed_dim, config.encoder_attention_heads, dropout=config.attention_dropout,\n        )\n        self.normalize_before = config.normalize_before\n        self.self_attn_layer_norm = LayerNorm(self.embed_dim)\n        self.dropout = config.dropout\n        self.activation_fn = ACT2FN[config.activation_function]\n        self.activation_dropout = config.activation_dropout\n        self.fc1 = nn.Linear(self.embed_dim, config.encoder_ffn_dim)\n        self.fc2 = nn.Linear(config.encoder_ffn_dim, self.embed_dim)\n        self.final_layer_norm = LayerNorm(self.embed_dim)\n\n    def forward(self, x, encoder_padding_mask):\n        \"\"\"\n        Args:\n            x (Tensor): input to the layer of shape `(seq_len, batch, embed_dim)`\n            encoder_padding_mask (ByteTensor): binary ByteTensor of shape\n                `(batch, src_len)` where padding elements are indicated by ``1``.\n            for t_tgt, t_src is excluded (or masked out), =0 means it is\n            included in attention\n\n        Returns:\n            encoded output of shape `(seq_len, batch, embed_dim)`\n        \"\"\"\n        residual = x\n        if self.normalize_before:\n            x = self.self_attn_layer_norm(x)\n        x, attn_weights = self.self_attn(\n            query=x, key=x, key_padding_mask=encoder_padding_mask, need_weights=self.output_attentions\n        )\n        x = F.dropout(x, p=self.dropout, training=self.training)\n        x = residual + x\n        if not self.normalize_before:\n            x = self.self_attn_layer_norm(x)\n\n        residual = x\n        if self.normalize_before:\n            x = self.final_layer_norm(x)\n        x = self.activation_fn(self.fc1(x))\n        x = F.dropout(x, p=self.activation_dropout, training=self.training)\n        x = self.fc2(x)\n        x = F.dropout(x, p=self.dropout, training=self.training)\n        x = residual + x\n        if not self.normalize_before:\n            x = self.final_layer_norm(x)\n        return x, attn_weights\n\n\nclass BartEncoder(nn.Module):\n    \"\"\"\n    Transformer encoder consisting of *config.encoder_layers* self attention layers. Each layer\n    is a :class:`EncoderLayer`.\n\n    Args:\n        config: BartConfig\n    \"\"\"\n\n    def __init__(self, config: BartConfig, embed_tokens):\n        super().__init__()\n\n        self.dropout = config.dropout\n        self.layerdrop = config.encoder_layerdrop\n        self.output_attentions = config.output_attentions\n        self.output_hidden_states = config.output_hidden_states\n\n        embed_dim = embed_tokens.embedding_dim\n        self.embed_scale = math.sqrt(embed_dim) if config.scale_embedding else 1.0\n        self.padding_idx = embed_tokens.padding_idx\n        self.max_source_positions = config.max_position_embeddings\n\n        self.embed_tokens = embed_tokens\n        if config.static_position_embeddings:\n            self.embed_positions = SinusoidalPositionalEmbedding(\n                config.max_position_embeddings, embed_dim, self.padding_idx\n            )\n        else:\n            self.embed_positions = LearnedPositionalEmbedding(\n                config.max_position_embeddings, embed_dim, self.padding_idx,\n            )\n        self.layers = nn.ModuleList([EncoderLayer(config) for _ in range(config.encoder_layers)])\n        self.layernorm_embedding = LayerNorm(embed_dim) if config.normalize_embedding else nn.Identity()\n        # mbart has one extra layer_norm\n        self.layer_norm = LayerNorm(config.d_model) if config.normalize_before else None\n\n    def forward(\n        self, input_ids, attention_mask=None,\n    ):\n        \"\"\"\n        Args:\n            input_ids (LongTensor): tokens in the source language of shape\n                `(batch, src_len)`\n            attention_mask (torch.LongTensor): indicating which indices are padding tokens.\n        Returns:\n            Tuple comprised of:\n                - **x** (Tensor): the last encoder layer's output of\n                  shape `(src_len, batch, embed_dim)`\n                - **encoder_states** (List[Tensor]): all intermediate\n                  hidden states of shape `(src_len, batch, embed_dim)`.\n                  Only populated if *self.output_hidden_states:* is True.\n                - **all_attentions** (List[Tensor]): Attention weights for each layer.\n                During training might not be of length n_layers because of layer dropout.\n        \"\"\"\n        # check attention mask and invert\n        if attention_mask is not None:\n            attention_mask = invert_mask(attention_mask)\n\n        inputs_embeds = self.embed_tokens(input_ids) * self.embed_scale\n        embed_pos = self.embed_positions(input_ids)\n        x = inputs_embeds + embed_pos\n        x = self.layernorm_embedding(x)\n        x = F.dropout(x, p=self.dropout, training=self.training)\n\n        # B x T x C -> T x B x C\n        x = x.transpose(0, 1)\n\n        encoder_states, all_attentions = [], []\n        for encoder_layer in self.layers:\n            if self.output_hidden_states:\n                encoder_states.append(x)\n            # add LayerDrop (see https://arxiv.org/abs/1909.11556 for description)\n            dropout_probability = random.uniform(0, 1)\n            if self.training and (dropout_probability < self.layerdrop):  # skip the layer\n                attn = None\n            else:\n                x, attn = encoder_layer(x, attention_mask)\n\n            if self.output_attentions:\n                all_attentions.append(attn)\n\n        if self.layer_norm:\n            x = self.layer_norm(x)\n        if self.output_hidden_states:\n            encoder_states.append(x)\n\n        # T x B x C -> B x T x C\n        encoder_states = [hidden_state.transpose(0, 1) for hidden_state in encoder_states]\n        x = x.transpose(0, 1)\n\n        return x, encoder_states, all_attentions\n\n\nclass DecoderLayer(nn.Module):\n    def __init__(self, config: BartConfig):\n        super().__init__()\n        self.embed_dim = config.d_model\n        self.output_attentions = config.output_attentions\n        self.self_attn = SelfAttention(\n            embed_dim=self.embed_dim, num_heads=config.decoder_attention_heads, dropout=config.attention_dropout,\n        )\n        self.dropout = config.dropout\n        self.activation_fn = ACT2FN[config.activation_function]\n        self.activation_dropout = config.activation_dropout\n        self.normalize_before = config.normalize_before\n\n        self.self_attn_layer_norm = LayerNorm(self.embed_dim)\n        self.encoder_attn = SelfAttention(\n            self.embed_dim,\n            config.decoder_attention_heads,\n            dropout=config.attention_dropout,\n            encoder_decoder_attention=True,\n        )\n        self.encoder_attn_layer_norm = LayerNorm(self.embed_dim)\n        self.fc1 = nn.Linear(self.embed_dim, config.decoder_ffn_dim)\n        self.fc2 = nn.Linear(config.decoder_ffn_dim, self.embed_dim)\n        self.final_layer_norm = LayerNorm(self.embed_dim)\n\n    def forward(\n        self,\n        x,\n        encoder_hidden_states,\n        encoder_attn_mask=None,\n        layer_state=None,\n        causal_mask=None,\n        decoder_padding_mask=None,\n    ):\n        residual = x\n\n        if layer_state is None:\n            layer_state = {}\n        if self.normalize_before:\n            x = self.self_attn_layer_norm(x)\n        # Self Attention\n\n        x, self_attn_weights = self.self_attn(\n            query=x,\n            key=x,\n            layer_state=layer_state,  # adds keys to layer state\n            key_padding_mask=decoder_padding_mask,\n            attn_mask=causal_mask,\n            need_weights=self.output_attentions,\n        )\n        x = F.dropout(x, p=self.dropout, training=self.training)\n        x = residual + x\n        if not self.normalize_before:\n            x = self.self_attn_layer_norm(x)\n\n        # Cross attention\n        residual = x\n        assert self.encoder_attn.cache_key != self.self_attn.cache_key\n        if self.normalize_before:\n            x = self.encoder_attn_layer_norm(x)\n        x, _ = self.encoder_attn(\n            query=x,\n            key=encoder_hidden_states,\n            key_padding_mask=encoder_attn_mask,\n            layer_state=layer_state,  # mutates layer state\n        )\n        x = F.dropout(x, p=self.dropout, training=self.training)\n        x = residual + x\n        if not self.normalize_before:\n            x = self.encoder_attn_layer_norm(x)\n\n        # Fully Connected\n        residual = x\n        if self.normalize_before:\n            x = self.final_layer_norm(x)\n        x = self.activation_fn(self.fc1(x))\n        x = F.dropout(x, p=self.activation_dropout, training=self.training)\n        x = self.fc2(x)\n        x = F.dropout(x, p=self.dropout, training=self.training)\n        x = residual + x\n        if not self.normalize_before:\n            x = self.final_layer_norm(x)\n        return (\n            x,\n            self_attn_weights,\n            layer_state,\n        )  # just self_attn weights for now, following t5, layer_state = cache for decoding\n\n\nclass BartDecoder(nn.Module):\n    \"\"\"\n    Transformer decoder consisting of *config.decoder_layers* layers. Each layer\n    is a :class:`DecoderLayer`.\n    Args:\n        config: BartConfig\n        embed_tokens (torch.nn.Embedding): output embedding\n    \"\"\"\n\n    def __init__(self, config: BartConfig, embed_tokens: nn.Embedding):\n        super().__init__()\n        self.output_attentions = config.output_attentions\n        self.output_hidden_states = config.output_hidden_states\n        self.dropout = config.dropout\n        self.layerdrop = config.decoder_layerdrop\n        self.padding_idx = embed_tokens.padding_idx\n        self.max_target_positions = config.max_position_embeddings\n        self.embed_scale = math.sqrt(config.d_model) if config.scale_embedding else 1.0\n        self.embed_tokens = embed_tokens\n        if config.static_position_embeddings:\n            self.embed_positions = SinusoidalPositionalEmbedding(\n                config.max_position_embeddings, config.d_model, config.pad_token_id\n            )\n        else:\n            self.embed_positions = LearnedPositionalEmbedding(\n                config.max_position_embeddings, config.d_model, self.padding_idx,\n            )\n        self.layers = nn.ModuleList(\n            [DecoderLayer(config) for _ in range(config.decoder_layers)]\n        )  # type: List[DecoderLayer]\n        self.layernorm_embedding = LayerNorm(config.d_model) if config.normalize_embedding else nn.Identity()\n        self.layer_norm = LayerNorm(config.d_model) if config.add_final_layer_norm else None\n\n    def forward(\n        self,\n        input_ids,\n        encoder_hidden_states,\n        encoder_padding_mask,\n        decoder_padding_mask,\n        decoder_causal_mask,\n        decoder_cached_states=None,\n        use_cache=False,\n        **unused\n    ):\n        \"\"\"\n        Includes several features from \"Jointly Learning to Align and\n        Translate with Transformer Models\" (Garg et al., EMNLP 2019).\n\n        Args:\n            input_ids (LongTensor): previous decoder outputs of shape\n                `(batch, tgt_len)`, for teacher forcing\n            encoder_hidden_states: output from the encoder, used for\n                encoder-side attention\n            encoder_padding_mask: for ignoring pad tokens\n            decoder_cached_states (dict or None): dictionary used for storing state during generation\n\n        Returns:\n            tuple:\n                - the decoder's features of shape `(batch, tgt_len, embed_dim)`\n                - hidden states\n                - attentions\n        \"\"\"\n        # check attention mask and invert\n        if encoder_padding_mask is not None:\n            encoder_padding_mask = invert_mask(encoder_padding_mask)\n\n        # embed positions\n        positions = self.embed_positions(input_ids, use_cache=use_cache)\n\n        if use_cache:\n            input_ids = input_ids[:, -1:]\n            positions = positions[:, -1:]  # happens after we embed them\n            # assert input_ids.ne(self.padding_idx).any()\n\n        x = self.embed_tokens(input_ids) * self.embed_scale\n        x += positions\n        x = self.layernorm_embedding(x)\n        x = F.dropout(x, p=self.dropout, training=self.training)\n\n        # Convert to Bart output format: (seq_len, BS, model_dim) -> (BS, seq_len, model_dim)\n        x = x.transpose(0, 1)\n        encoder_hidden_states = encoder_hidden_states.transpose(0, 1)\n\n        # decoder layers\n        all_hidden_states = ()\n        all_self_attns = ()\n        next_decoder_cache = []\n        for idx, decoder_layer in enumerate(self.layers):\n            # add LayerDrop (see https://arxiv.org/abs/1909.11556 for description)\n            if self.output_hidden_states:\n                all_hidden_states += (x,)\n            dropout_probability = random.uniform(0, 1)\n            if self.training and (dropout_probability < self.layerdrop):\n                continue\n\n            layer_state = decoder_cached_states[idx] if decoder_cached_states is not None else None\n\n            x, layer_self_attn, layer_past = decoder_layer(\n                x,\n                encoder_hidden_states,\n                encoder_attn_mask=encoder_padding_mask,\n                decoder_padding_mask=decoder_padding_mask,\n                layer_state=layer_state,\n                causal_mask=decoder_causal_mask,\n            )\n\n            if use_cache:\n                next_decoder_cache.append(layer_past.copy())\n\n            if self.layer_norm and (idx == len(self.layers) - 1):  # last layer of mbart\n                x = self.layer_norm(x)\n            if self.output_attentions:\n                all_self_attns += (layer_self_attn,)\n\n        # Convert to standard output format: (seq_len, BS, model_dim) -> (BS, seq_len, model_dim)\n        all_hidden_states = [hidden_state.transpose(0, 1) for hidden_state in all_hidden_states]\n        x = x.transpose(0, 1)\n        encoder_hidden_states = encoder_hidden_states.transpose(0, 1)\n\n        if use_cache:\n            next_cache = ((encoder_hidden_states, encoder_padding_mask), next_decoder_cache)\n        else:\n            next_cache = None\n        return x, next_cache, all_hidden_states, list(all_self_attns)\n\n\ndef _reorder_buffer(attn_cache, new_order):\n    for k, input_buffer_k in attn_cache.items():\n        if input_buffer_k is not None:\n            attn_cache[k] = input_buffer_k.index_select(0, new_order)\n    return attn_cache\n\n\nclass SelfAttention(nn.Module):\n    \"\"\"Multi-headed attention from 'Attention Is All You Need' paper\"\"\"\n\n    def __init__(\n        self,\n        embed_dim,\n        num_heads,\n        dropout=0.0,\n        bias=True,\n        encoder_decoder_attention=False,  # otherwise self_attention\n    ):\n        super().__init__()\n        self.embed_dim = embed_dim\n        self.num_heads = num_heads\n        self.dropout = dropout\n        self.head_dim = embed_dim // num_heads\n        assert self.head_dim * num_heads == self.embed_dim, \"embed_dim must be divisible by num_heads\"\n        self.scaling = self.head_dim ** -0.5\n\n        self.encoder_decoder_attention = encoder_decoder_attention\n        self.k_proj = nn.Linear(embed_dim, embed_dim, bias=bias)\n        self.v_proj = nn.Linear(embed_dim, embed_dim, bias=bias)\n        self.q_proj = nn.Linear(embed_dim, embed_dim, bias=bias)\n        self.out_proj = nn.Linear(embed_dim, embed_dim, bias=bias)\n        self.cache_key = \"encoder_decoder\" if self.encoder_decoder_attention else \"self\"\n\n    def _shape(self, tensor, dim_0, bsz):\n        return tensor.contiguous().view(dim_0, bsz * self.num_heads, self.head_dim).transpose(0, 1)\n\n    def forward(\n        self,\n        query,\n        key: Optional[Tensor],\n        key_padding_mask: Optional[Tensor] = None,\n        layer_state: Optional[Dict[str, Optional[Tensor]]] = None,\n        attn_mask: Optional[Tensor] = None,\n        need_weights=False,\n    ) -> Tuple[Tensor, Optional[Tensor]]:\n        \"\"\"Input shape: Time(SeqLen) x Batch x Channel\"\"\"\n        static_kv: bool = self.encoder_decoder_attention\n        tgt_len, bsz, embed_dim = query.size()\n        assert embed_dim == self.embed_dim\n        assert list(query.size()) == [tgt_len, bsz, embed_dim]\n        # get here for encoder decoder cause of static_kv\n        if layer_state is not None:  # reuse k,v and encoder_padding_mask\n            saved_state = layer_state.get(self.cache_key, {})\n            if \"prev_key\" in saved_state:\n                # previous time steps are cached - no need to recompute key and value if they are static\n                if static_kv:\n                    key = None\n        else:\n            saved_state = None\n            layer_state = {}\n\n        q = self.q_proj(query) * self.scaling\n        if static_kv:\n            if key is None:\n                k = v = None\n            else:\n                k = self.k_proj(key)\n                v = self.v_proj(key)\n        else:\n            k = self.k_proj(query)\n            v = self.v_proj(query)\n\n        q = self._shape(q, tgt_len, bsz)\n        if k is not None:\n            k = self._shape(k, -1, bsz)\n        if v is not None:\n            v = self._shape(v, -1, bsz)\n\n        if saved_state is not None:\n            k, v, key_padding_mask = self._use_saved_state(k, v, saved_state, key_padding_mask, static_kv, bsz)\n\n        # Update cache\n        layer_state[self.cache_key] = {\n            \"prev_key\": k.view(bsz, self.num_heads, -1, self.head_dim),\n            \"prev_value\": v.view(bsz, self.num_heads, -1, self.head_dim),\n            \"prev_key_padding_mask\": key_padding_mask if not static_kv else None,\n        }\n\n        assert k is not None\n        src_len = k.size(1)\n        attn_weights = torch.bmm(q, k.transpose(1, 2))\n        assert attn_weights.size() == (bsz * self.num_heads, tgt_len, src_len)\n\n        if attn_mask is not None:\n            attn_weights = attn_weights.view(bsz, self.num_heads, tgt_len, src_len) + attn_mask\n            attn_weights = attn_weights.view(bsz * self.num_heads, tgt_len, src_len)\n\n        # This is part of a workaround to get around fork/join parallelism not supporting Optional types.\n        if key_padding_mask is not None and key_padding_mask.dim() == 0:\n            key_padding_mask = None\n        assert key_padding_mask is None or key_padding_mask.size()[:2] == (bsz, src_len,)\n\n        if key_padding_mask is not None:  # don't attend to padding symbols\n            attn_weights = attn_weights.view(bsz, self.num_heads, tgt_len, src_len)\n            reshaped = key_padding_mask.unsqueeze(1).unsqueeze(2)\n            attn_weights = attn_weights.masked_fill(reshaped, float(\"-inf\"))\n            attn_weights = attn_weights.view(bsz * self.num_heads, tgt_len, src_len)\n        attn_weights = F.softmax(attn_weights, dim=-1)\n        attn_probs = F.dropout(attn_weights, p=self.dropout, training=self.training,)\n\n        assert v is not None\n        attn_output = torch.bmm(attn_probs, v)\n        assert attn_output.size() == (bsz * self.num_heads, tgt_len, self.head_dim)\n        attn_output = attn_output.transpose(0, 1).contiguous().view(tgt_len, bsz, embed_dim)\n        attn_output = self.out_proj(attn_output)\n        if need_weights:\n            attn_weights = attn_weights.view(bsz, self.num_heads, tgt_len, src_len)\n        else:\n            attn_weights = None\n        return attn_output, attn_weights\n\n    def _use_saved_state(self, k, v, saved_state, key_padding_mask, static_kv, bsz):\n        # saved states are stored with shape (bsz, num_heads, seq_len, head_dim)\n        if \"prev_key\" in saved_state:\n            _prev_key = saved_state[\"prev_key\"]\n            assert _prev_key is not None\n            prev_key = _prev_key.view(bsz * self.num_heads, -1, self.head_dim)\n            if static_kv:\n                k = prev_key\n            else:\n                assert k is not None\n                k = torch.cat([prev_key, k], dim=1)\n        if \"prev_value\" in saved_state:\n            _prev_value = saved_state[\"prev_value\"]\n            assert _prev_value is not None\n            prev_value = _prev_value.view(bsz * self.num_heads, -1, self.head_dim)\n            if static_kv:\n                v = prev_value\n            else:\n                assert v is not None\n                v = torch.cat([prev_value, v], dim=1)\n        assert k is not None and v is not None\n        prev_key_padding_mask: Optional[Tensor] = saved_state.get(\"prev_key_padding_mask\", None)\n        key_padding_mask = self._cat_prev_key_padding_mask(\n            key_padding_mask, prev_key_padding_mask, bsz, k.size(1), static_kv\n        )\n        return k, v, key_padding_mask\n\n    @staticmethod\n    def _cat_prev_key_padding_mask(\n        key_padding_mask: Optional[Tensor],\n        prev_key_padding_mask: Optional[Tensor],\n        batch_size: int,\n        src_len: int,\n        static_kv: bool,\n    ) -> Optional[Tensor]:\n        # saved key padding masks have shape (bsz, seq_len)\n        if prev_key_padding_mask is not None:\n            if static_kv:\n                new_key_padding_mask = prev_key_padding_mask\n            else:\n                new_key_padding_mask = torch.cat([prev_key_padding_mask, key_padding_mask], dim=1)\n\n        elif key_padding_mask is not None:\n            filler = torch.zeros(\n                batch_size,\n                src_len - key_padding_mask.size(1),\n                dtype=key_padding_mask.dtype,\n                device=key_padding_mask.device,\n            )\n            new_key_padding_mask = torch.cat([filler, key_padding_mask], dim=1)\n        else:\n            new_key_padding_mask = prev_key_padding_mask\n        return new_key_padding_mask\n\n\nclass BartClassificationHead(nn.Module):\n    \"\"\"Head for sentence-level classification tasks.\"\"\"\n\n    # This can trivially be shared with RobertaClassificationHead\n\n    def __init__(\n        self, input_dim, inner_dim, num_classes, pooler_dropout,\n    ):\n        super().__init__()\n        self.dense = nn.Linear(input_dim, inner_dim)\n        self.dropout = nn.Dropout(p=pooler_dropout)\n        self.out_proj = nn.Linear(inner_dim, num_classes)\n\n    def forward(self, x):\n        x = self.dropout(x)\n        x = self.dense(x)\n        x = torch.tanh(x)\n        x = self.dropout(x)\n        x = self.out_proj(x)\n        return x\n\n\nclass LearnedPositionalEmbedding(nn.Embedding):\n    \"\"\"\n    This module learns positional embeddings up to a fixed maximum size.\n    Padding ids are ignored by either offsetting based on padding_idx\n    or by setting padding_idx to None and ensuring that the appropriate\n    position ids are passed to the forward function.\n    \"\"\"\n\n    def __init__(\n        self, num_embeddings: int, embedding_dim: int, padding_idx: int,\n    ):\n        # if padding_idx is specified then offset the embedding ids by\n        # this index and adjust num_embeddings appropriately\n        assert padding_idx is not None\n        num_embeddings += padding_idx + 1  # WHY?\n        super().__init__(num_embeddings, embedding_dim, padding_idx=padding_idx)\n\n    def forward(self, input, use_cache=False):\n        \"\"\"Input is expected to be of size [bsz x seqlen].\"\"\"\n        if use_cache:  # the position is our current step in the decoded sequence\n            pos = int(self.padding_idx + input.size(1))\n            positions = input.data.new(1, 1).fill_(pos)\n        else:\n            positions = create_position_ids_from_input_ids(input, self.padding_idx)\n        return super().forward(positions)\n\n\ndef LayerNorm(normalized_shape, eps=1e-5, elementwise_affine=True):\n    if torch.cuda.is_available():\n        try:\n            from apex.normalization import FusedLayerNorm\n\n            return FusedLayerNorm(normalized_shape, eps, elementwise_affine)\n        except ImportError:\n            pass\n    return torch.nn.LayerNorm(normalized_shape, eps, elementwise_affine)\n\n\ndef fill_with_neg_inf(t):\n    \"\"\"FP16-compatible function that fills a input_ids with -inf.\"\"\"\n    return t.float().fill_(float(\"-inf\")).type_as(t)\n\n\ndef _filter_out_falsey_values(tup) -> Tuple:\n    \"\"\"Remove entries that are None or [] from an iterable.\"\"\"\n    return tuple(x for x in tup if isinstance(x, torch.Tensor) or x)\n\n\n# Public API\ndef _get_shape(t):\n    return getattr(t, \"shape\", None)\n\n\n@add_start_docstrings(\n    \"The bare BART Model outputting raw hidden-states without any specific head on top.\", BART_START_DOCSTRING,\n)\nclass BartModel(PretrainedBartModel):\n    def __init__(self, config: BartConfig):\n        super().__init__(config)\n        self.output_attentions = config.output_attentions\n        self.output_hidden_states = config.output_hidden_states\n\n        padding_idx, vocab_size = config.pad_token_id, config.vocab_size\n        self.shared = nn.Embedding(vocab_size, config.d_model, padding_idx)\n\n        self.encoder = BartEncoder(config, self.shared)\n        self.decoder = BartDecoder(config, self.shared)\n\n        self.init_weights()\n\n    @add_start_docstrings_to_callable(BART_INPUTS_DOCSTRING)\n    def forward(\n        self,\n        input_ids,\n        attention_mask=None,\n        decoder_input_ids=None,\n        encoder_outputs: Optional[Tuple] = None,\n        decoder_attention_mask=None,\n        decoder_cached_states=None,\n        use_cache=False,\n    ):\n\n        # make masks if user doesn't supply\n        if not use_cache:\n            decoder_input_ids, decoder_padding_mask, causal_mask = _prepare_bart_decoder_inputs(\n                self.config,\n                input_ids,\n                decoder_input_ids=decoder_input_ids,\n                decoder_padding_mask=decoder_attention_mask,\n                causal_mask_dtype=self.shared.weight.dtype,\n            )\n        else:\n            decoder_padding_mask, causal_mask = None, None\n\n        assert decoder_input_ids is not None\n        if encoder_outputs is None:\n            encoder_outputs = self.encoder(input_ids=input_ids, attention_mask=attention_mask)\n        assert isinstance(encoder_outputs, tuple)\n        # decoder outputs consists of (dec_features, layer_state, dec_hidden, dec_attn)\n        decoder_outputs = self.decoder(\n            decoder_input_ids,\n            encoder_outputs[0],\n            attention_mask,\n            decoder_padding_mask,\n            decoder_causal_mask=causal_mask,\n            decoder_cached_states=decoder_cached_states,\n            use_cache=use_cache,\n        )\n        # Attention and hidden_states will be [] or None if they aren't needed\n        decoder_outputs: Tuple = _filter_out_falsey_values(decoder_outputs)\n        assert isinstance(decoder_outputs[0], torch.Tensor)\n        encoder_outputs: Tuple = _filter_out_falsey_values(encoder_outputs)\n        return decoder_outputs + encoder_outputs\n\n    def get_input_embeddings(self):\n        return self.shared\n\n    def set_input_embeddings(self, value):\n        self.shared = value\n        self.encoder.embed_tokens = self.shared\n        self.decoder.embed_tokens = self.shared\n\n    def get_output_embeddings(self):\n        return _make_linear_from_emb(self.shared)  # make it on the fly\n\n\n@add_start_docstrings(\n    \"The BART Model with a language modeling head. Can be used for summarization.\",\n    BART_START_DOCSTRING + BART_GENERATION_EXAMPLE,\n)\nclass BartForConditionalGeneration(PretrainedBartModel):\n    base_model_prefix = \"model\"\n\n    def __init__(self, config: BartConfig):\n        super().__init__(config)\n        base_model = BartModel(config)\n        self.model = base_model\n        self.register_buffer(\"final_logits_bias\", torch.zeros((1, self.model.shared.num_embeddings)))\n\n    def resize_token_embeddings(self, new_num_tokens: int) -> nn.Embedding:\n        old_num_tokens = self.model.shared.num_embeddings\n        new_embeddings = super().resize_token_embeddings(new_num_tokens)\n        self.model.shared = new_embeddings\n        self._resize_final_logits_bias(new_num_tokens, old_num_tokens)\n        return new_embeddings\n\n    def _resize_final_logits_bias(self, new_num_tokens: int, old_num_tokens: int) -> None:\n        if new_num_tokens <= old_num_tokens:\n            new_bias = self.final_logits_bias[:, :new_num_tokens]\n        else:\n            extra_bias = torch.zeros((1, new_num_tokens - old_num_tokens), device=self.final_logits_bias.device)\n            new_bias = torch.cat([self.final_logits_bias, extra_bias], dim=1)\n        self.register_buffer(\"final_logits_bias\", new_bias)\n\n    @add_start_docstrings_to_callable(BART_INPUTS_DOCSTRING)\n    def forward(\n        self,\n        input_ids,\n        attention_mask=None,\n        encoder_outputs=None,\n        decoder_input_ids=None,\n        decoder_attention_mask=None,\n        decoder_cached_states=None,\n        lm_labels=None,\n        use_cache=False,\n        **unused\n    ):\n        r\"\"\"\n        lm_labels (:obj:`torch.LongTensor` of shape :obj:`(batch_size, sequence_length)`, `optional`, defaults to :obj:`None`):\n            Labels for computing the masked language modeling loss.\n            Indices should either be in ``[0, ..., config.vocab_size]`` or -100 (see ``input_ids`` docstring).\n            Tokens with indices set to ``-100`` are ignored (masked), the loss is only computed for the tokens\n            with labels\n            in ``[0, ..., config.vocab_size]``.\n\n    Returns:\n        :obj:`tuple(torch.FloatTensor)` comprising various elements depending on the configuration (:class:`~transformers1.RobertaConfig`) and inputs:\n        masked_lm_loss (`optional`, returned when ``lm_labels`` is provided) ``torch.FloatTensor`` of shape ``(1,)``:\n            Masked language modeling loss.\n        prediction_scores (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length, config.vocab_size)`)\n            Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax).\n        hidden_states (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_hidden_states=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer)\n            of shape :obj:`(batch_size, sequence_length, hidden_size)`.\n\n            Hidden-states of the model at the output of each layer plus the initial embedding outputs.\n        attentions (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_attentions=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for each layer) of shape\n            :obj:`(batch_size, num_heads, sequence_length, sequence_length)`.\n\n            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention\n            heads.\n\n    Examples::\n\n            # Mask filling only works for bart-large\n            from transformers1 import BartTokenizer, BartForConditionalGeneration\n            tokenizer = BartTokenizer.from_pretrained('bart-large')\n            TXT = \"My friends are <mask> but they eat too many carbs.\"\n            model = BartForConditionalGeneration.from_pretrained('bart-large')\n            input_ids = tokenizer.batch_encode_plus([TXT], return_tensors='pt')['input_ids']\n            logits = model(input_ids)[0]\n            masked_index = (input_ids[0] == tokenizer.mask_token_id).nonzero().item()\n            probs = logits[0, masked_index].softmax(dim=0)\n            values, predictions = probs.topk(5)\n            tokenizer.decode(predictions).split()\n            # ['good', 'great', 'all', 'really', 'very']\n        \"\"\"\n        outputs = self.model(\n            input_ids,\n            attention_mask=attention_mask,\n            decoder_input_ids=decoder_input_ids,\n            encoder_outputs=encoder_outputs,\n            decoder_attention_mask=decoder_attention_mask,\n            decoder_cached_states=decoder_cached_states,\n            use_cache=use_cache,\n        )\n        lm_logits = F.linear(outputs[0], self.model.shared.weight, bias=self.final_logits_bias)\n        outputs = (lm_logits,) + outputs[1:]  # Add cache, hidden states and attention if they are here\n        if lm_labels is not None:\n            loss_fct = nn.CrossEntropyLoss()\n            # TODO(SS): do we need to ignore pad tokens in lm_labels?\n            masked_lm_loss = loss_fct(lm_logits.view(-1, self.config.vocab_size), lm_labels.view(-1))\n            outputs = (masked_lm_loss,) + outputs\n\n        return outputs\n\n    def prepare_inputs_for_generation(self, decoder_input_ids, past, attention_mask, use_cache, **kwargs):\n        assert past is not None, \"past has to be defined for encoder_outputs\"\n\n        # first step, decoder_cached_states are empty\n        if not past[1]:\n            encoder_outputs, decoder_cached_states = past, None\n        else:\n            encoder_outputs, decoder_cached_states = past\n        return {\n            \"input_ids\": None,  # encoder_outputs is defined. input_ids not needed\n            \"encoder_outputs\": encoder_outputs,\n            \"decoder_cached_states\": decoder_cached_states,\n            \"decoder_input_ids\": decoder_input_ids,\n            \"attention_mask\": attention_mask,\n            \"use_cache\": use_cache,  # change this to avoid caching (presumably for debugging)\n        }\n\n    def prepare_logits_for_generation(self, logits, cur_len, max_length):\n        if cur_len == 1:\n            self._force_token_ids_generation(logits, self.config.bos_token_id)\n        if cur_len == max_length - 1 and self.config.eos_token_id is not None:\n            self._force_token_ids_generation(logits, self.config.eos_token_id)\n        return logits\n\n    def _force_token_ids_generation(self, scores, token_ids) -> None:\n        \"\"\"force one of token_ids to be generated by setting prob of all other tokens to 0\"\"\"\n        if isinstance(token_ids, int):\n            token_ids = [token_ids]\n        all_but_token_ids_mask = torch.tensor(\n            [x for x in range(self.config.vocab_size) if x not in token_ids],\n            dtype=torch.long,\n            device=next(self.parameters()).device,\n        )\n        assert len(scores.shape) == 2, \"scores should be of rank 2 with shape: [batch_size, vocab_size]\"\n        scores[:, all_but_token_ids_mask] = -float(\"inf\")\n\n    @staticmethod\n    def _reorder_cache(past, beam_idx):\n        ((enc_out, enc_mask), decoder_cached_states) = past\n        reordered_past = []\n        for layer_past in decoder_cached_states:\n            # get the correct batch idx from decoder layer's batch dim for cross and self-attn\n            layer_past_new = {\n                attn_key: _reorder_buffer(attn_cache, beam_idx) for attn_key, attn_cache in layer_past.items()\n            }\n            reordered_past.append(layer_past_new)\n\n        new_enc_out = enc_out if enc_out is None else enc_out.index_select(0, beam_idx)\n        new_enc_mask = enc_mask if enc_mask is None else enc_mask.index_select(0, beam_idx)\n\n        past = ((new_enc_out, new_enc_mask), reordered_past)\n        return past\n\n    def get_encoder(self):\n        return self.model.encoder\n\n    def get_output_embeddings(self):\n        return _make_linear_from_emb(self.model.shared)  # make it on the fly\n\n\n@add_start_docstrings(\n    \"\"\"Bart model with a sequence classification/head on top (a linear layer on top of the pooled output) e.g. for GLUE tasks. \"\"\",\n    BART_START_DOCSTRING,\n)\nclass BartForSequenceClassification(PretrainedBartModel):\n    def __init__(self, config: BartConfig, **kwargs):\n        super().__init__(config, **kwargs)\n        self.model = BartModel(config)\n        self.classification_head = BartClassificationHead(\n            config.d_model, config.d_model, config.num_labels, config.classif_dropout,\n        )\n        self.model._init_weights(self.classification_head.dense)\n        self.model._init_weights(self.classification_head.out_proj)\n\n    @add_start_docstrings_to_callable(BART_INPUTS_DOCSTRING)\n    def forward(\n        self,\n        input_ids,\n        attention_mask=None,\n        encoder_outputs=None,\n        decoder_input_ids=None,\n        decoder_attention_mask=None,\n        labels=None,\n    ):\n        r\"\"\"\n        labels (:obj:`torch.LongTensor` of shape :obj:`(batch_size,)`, `optional`, defaults to :obj:`None`):\n            Labels for computing the sequence classification/regression loss.\n            Indices should be in :obj:`[0, ..., config.num_labels - 1]`.\n            If :obj:`config.num_labels > 1` a classification loss is computed (Cross-Entropy).\n\n    Returns:\n        :obj:`tuple(torch.FloatTensor)` comprising various elements depending on the configuration (:class:`~transformers1.BartConfig`) and inputs:\n            loss (:obj:`torch.FloatTensor` of shape :obj:`(1,)`, `optional`, returned when :obj:`label` is provided):\n                Classification loss (cross entropy)\n            logits (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, config.num_labels)`):\n                Classification (or regression if config.num_labels==1) scores (before SoftMax).\n            hidden_states (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_hidden_states=True``):\n                Tuple of :obj:`torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer)\n                of shape :obj:`(batch_size, sequence_length, hidden_size)`.\n                Hidden-states of the model at the output of each layer plus the initial embedding outputs.\n            attentions (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_attentions=True``):\n                Tuple of :obj:`torch.FloatTensor` (one for each layer) of shape :obj:`(batch_size, num_heads, sequence_length, sequence_length)`.\n                Attentions weights after the attention softmax, used to compute the weighted average in the\n                self-attention\n                heads.\n\n    Examples::\n\n        from transformers1 import BartTokenizer, BartForSequenceClassification\n        import torch\n\n        tokenizer = BartTokenizer.from_pretrained('bart-large')\n        model = BartForSequenceClassification.from_pretrained('bart-large')\n        input_ids = torch.tensor(tokenizer.encode(\"Hello, my dog is cute\",\n        add_special_tokens=True)).unsqueeze(0)  # Batch size 1\n        labels = torch.tensor([1]).unsqueeze(0)  # Batch size 1\n        outputs = model(input_ids, labels=labels)\n        loss, logits = outputs[:2]\n\n        \"\"\"\n        outputs = self.model(\n            input_ids,\n            attention_mask=attention_mask,\n            decoder_input_ids=decoder_input_ids,\n            decoder_attention_mask=decoder_attention_mask,\n            encoder_outputs=encoder_outputs,\n        )\n        x = outputs[0]  # last hidden state\n        eos_mask = input_ids.eq(self.config.eos_token_id)\n        if len(torch.unique(eos_mask.sum(1))) > 1:\n            raise ValueError(\"All examples must have the same number of <eos> tokens.\")\n        sentence_representation = x[eos_mask, :].view(x.size(0), -1, x.size(-1))[:, -1, :]\n        logits = self.classification_head(sentence_representation)\n        # Prepend logits\n        outputs = (logits,) + outputs[1:]  # Add hidden states and attention if they are here\n        if labels is not None:  # prepend loss to output,\n            loss = F.cross_entropy(logits.view(-1, self.config.num_labels), labels.view(-1))\n            outputs = (loss,) + outputs\n\n        return outputs\n\n\nclass SinusoidalPositionalEmbedding(nn.Embedding):\n    \"\"\"This module produces sinusoidal positional embeddings of any length.\"\"\"\n\n    def __init__(self, num_positions, embedding_dim, padding_idx=None):\n        super().__init__(num_positions, embedding_dim)\n        if embedding_dim % 2 != 0:\n            raise NotImplementedError(f\"odd embedding_dim {embedding_dim} not supported\")\n        self.weight = self._init_weight(self.weight)\n\n    @staticmethod\n    def _init_weight(out: nn.Parameter):\n        \"\"\"Identical to the XLM create_sinusoidal_embeddings except features are not interleaved.\n            The cos features are in the 2nd half of the vector. [dim // 2:]\n        \"\"\"\n        n_pos, dim = out.shape\n        position_enc = np.array(\n            [[pos / np.power(10000, 2 * (j // 2) / dim) for j in range(dim)] for pos in range(n_pos)]\n        )\n        out[:, 0 : dim // 2] = torch.FloatTensor(np.sin(position_enc[:, 0::2]))  # This line breaks for odd n_pos\n        out[:, dim // 2 :] = torch.FloatTensor(np.cos(position_enc[:, 1::2]))\n        out.detach_()\n        out.requires_grad = False\n        return out\n\n    @torch.no_grad()\n    def forward(self, input_ids, use_cache=False):\n        \"\"\"Input is expected to be of size [bsz x seqlen].\"\"\"\n        bsz, seq_len = input_ids.shape[:2]\n        if use_cache:\n            positions = input_ids.data.new(1, 1).fill_(seq_len - 1)  # called before slicing\n        else:\n            # starts at 0, ends at 1-seq_len\n            positions = torch.arange(seq_len, dtype=torch.long, device=self.weight.device)\n        return super().forward(positions)\n"
  },
  {
    "path": "code/nezha-base-count5/pretrain/transformers1/modeling_beam_search.py",
    "content": "# coding=utf-8\n# Copyright (c) 2019 Yang Liu\n\n# Permission is hereby granted, free of charge, to any person obtaining a copy\n# of this software and associated documentation files (the \"Software\"), to deal\n# in the Software without restriction, including without limitation the rights\n# to use, copy, modify, merge, publish, distribute, sublicense, and/or sell\n# copies of the Software, and to permit persons to whom the Software is\n# furnished to do so, subject to the following conditions:\n\n# The above copyright notice and this permission notice shall be included in all\n# copies or substantial portions of the Software.\n\n# THE SOFTWARE IS PROVIDED \"AS IS\", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR\n# IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,\n# FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE\n# AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER\n# LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,\n# OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE\n# SOFTWARE.\n\"\"\"\nA general wrapper around models with LM heads to generate sequences\nusing beam search.\n\"\"\"\nimport torch\nfrom torch import nn\n\n\nclass TransformerBeamSearch(nn.Module):\n    def __init__(\n        self,\n        model,\n        tokenizer,\n        batch_size,\n        beam_size,\n        min_length,\n        max_length,\n        alpha=0,\n        block_repeating_trigram=True,\n    ):\n        \"\"\"\n        Attributes:\n            mask_word_id: token id that corresponds to the mask\n        \"\"\"\n        super(TransformerBeamSearch, self).__init__()\n        self.model = model\n        self.tokenizer = tokenizer\n\n        self.start_token_id = tokenizer.start_token_id\n        self.end_token_id = tokenizer.end_token_id\n        self.pad_token_id = tokenizer.pad_token_id\n\n        self.beam_size = beam_size\n        self.min_length = min_length\n        self.max_length = max_length\n\n        self.block_repeating_trigram = block_repeating_trigram\n        self.apply_length_penalty = False if alpha == 0 else True\n        self.alpha = alpha\n\n        # State of the beam\n        self.hypotheses = [[] for _ in range(batch_size)]\n        self.batch_offset = torch.arange(batch_size, dtype=torch.long)\n        self.beam_offset = torch.arange(\n            0, batch_size * self.beam_size, step=self.beam_size, dtype=torch.long\n        )\n        self.growing_beam = torch.full(\n            (batch_size * self.beam_size, 1), self.start_token_id, dtype=torch.long\n        )\n        self.topk_log_probabilities = torch.tensor(\n            [0.0] + [float(\"-inf\")] * (self.beam_size - 1), dtype=torch.float\n        ).repeat(batch_size)\n        self.results = {\n            \"prediction\": [[] for _ in batch_size],\n            \"scores\": [[] for _ in batch_size],\n        }\n        self._step = 0\n        self.is_done = False\n\n    def step(self, log_probabilities):\n        \"\"\" Grows the beam by one step. \"\"\"\n        self._step += 1\n\n        # The batch size changes as some beams finish so we define _B\n        vocab_size = log_probabilities.size(-1)\n        _B = log_probabilities.size(0) // self.beam_size\n\n        # Multiply each beam probability with the probability of the\n        # next token (conditioned on the words in the beam).\n        log_probabilities += self.topk_log_probabilities.view(-1, 1)\n\n        self.enforce_min_length(log_probabilities)\n        if self.block_repeating_trigram:\n            self.remove_repeating_trigrams(log_probabilities, _B)\n\n        # Find the `beam_size` (previous_beam + token) combinations with\n        # the highest score\n        topk_log_probabilities, topk_ids = log_probabilities.topk(\n            log_probabilities.view(_B, self.beam_size * vocab_size),\n            self.beam_size,\n            dim=1,\n        )\n\n        # Apply the length penalty. The +1 accounts for the [EOS] token\n        # that will be added if the beam ends.\n        topk_scores = topk_log_probabilities / self.length_penalty()\n\n        # Retrieve the corresponding respective beam and token id\n        # topk_token_ids[i] will be added to topk_beam_ids[i]\n        topk_beam_ids = topk_ids.div(vocab_size)\n        topk_token_ids = topk_ids.fmod(vocab_size)\n\n        # Retrieve the row index of the surviving beams in the original\n        # view of the log_probabilities tensor\n        surviving_beams_rows = (topk_beam_ids + self.beam_offset[:_B].view(-1, 1)).view(\n            -1\n        )\n\n        # Append the last predictions\n        self.growing_beam = torch.cat(\n            [\n                self.growing_beam.index_select(0, surviving_beams_rows),\n                topk_token_ids.view(-1, 1),\n            ],\n            1,\n        )\n\n        # Check if any of the beam searches has ended during this\n        # growth step. Also if top beam (most probable) has ended\n        # for one element of the batch.\n        is_finished = topk_token_ids.eq(self.end_token_id)\n        self.enforce_max_length()\n        is_top_beam_finished = is_finished[:, 0].eq(1)\n\n        # Save the finished searches\n        if is_finished.any():\n            predictions = self.growing_beam.view(\n                -1, self.beam_size, self.growing_beam.size(1)\n            )\n            for i in range(is_finished.size(0)):\n                if is_top_beam_finished[i]:\n                    is_finished[i].fill_(1)\n                finished_hyp = is_finished[i].nonzero().view(-1)\n\n                # Store finished hypotheses for this batch.\n                b = self.batch_offset[i]\n                for j in finished_hyp:\n                    self.hypotheses[b].append((topk_scores[i, j], predictions[i, j, :]))\n\n                # If the batch reached the end, save the best hypotheses\n                # in terms of length-penalized score.\n                if is_top_beam_finished[i]:\n                    best_hyp = sorted(\n                        self.hypotheses[b], key=lambda x: x[0], reverse=True\n                    )\n                    best_score, best_prediction = best_hyp[0]\n                    self.results[\"scores\"][b].append(best_score)\n                    self.results[\"predictions\"][b].append(best_prediction)\n\n            non_finished = is_top_beam_finished.eq(0).nonzero().view(-1)\n            if len(non_finished) == 0:\n                self.is_done = True\n\n            # Remove finished batches for the next step.\n            topk_log_probabilities = topk_log_probabilities.index_select(\n                0, non_finished\n            )\n            self.batch_offset = self.batch_offset.index_select(0, non_finished)\n            self.growing_beam = predictions.index_select(0, non_finished).view(\n                -1, self.growing_beam.size(-1)\n            )\n\n            surviving_beams_rows = surviving_beams_rows.index_select(0, non_finished)\n\n        return surviving_beams_rows\n\n    def forward(self, encoder_input_ids, **kwargs):\n        # keyword arguments come in 3 flavors: encoder-specific (prefixed by\n        # `encoder_`), decoder-specific (prefixed by `decoder_`) and those\n        # that apply to the model as whole.\n        # We let the specific kwargs override the common ones in case of conflict.\n        kwargs_encoder = {\n            argument[len(\"encoder_\"):]: value\n            for argument, value in kwargs.items()\n            if argument.startswith(\"encoder_\")\n        }\n        kwargs_decoder = {\n            argument[len(\"decoder_\"):]: value\n            for argument, value in kwargs.items()\n            if argument.startswith(\"decoder_\")\n        }\n        kwargs_common = {\n            argument: value\n            for argument, value in kwargs.items()\n            if not (argument.startswith(\"encoder_\") or argument.startswith(\"decoder_\"))\n        }\n        kwargs_decoder = dict(kwargs_common, **kwargs_decoder)\n        kwargs_encoder = dict(kwargs_common, **kwargs_encoder)\n\n        # forward pass on the encoder\n        encoder_outputs = self.model.encoder.forward(encoder_input_ids, kwargs_encoder)\n        kwargs_decoder[\"encoder_hidden_states\"] = tile(\n            encoder_outputs, self.beam_size, dim=0\n        )\n\n        # grow the beam by generating sequences in an autoregressive way\n        self.growing_beam = torch.full(\n            (self.batch_size * self.beam_size, 1), self.start_token_id, dtype=torch.long\n        )\n        for step in range(self.max_length):\n            decoder_input = self.growing_beam[:, -1]\n            outputs = self.model.decoder(decoder_input, kwargs_decoder)\n            log_probabilities = torch.nn.functional.log_softmax(outputs[1])\n            surviving_beams_rows = self.step(log_probabilities)\n            if self.is_done:\n                break\n\n            kwargs_decoder[\"encoder_hidden_states\"] = kwargs_decoder[\n                \"encoder_hidden_states\"\n            ].index_select(0, surviving_beams_rows)\n\n        return self.results\n\n    def remove_repeating_trigrams(self, log_probabilities, _B):\n        if(self._step + 1 > 3):\n            for i in range(_B * self.beam_size):\n                tokens = [t for t in self.growing_beam[i]]\n                trigrams = [(tokens[i-1], tokens[i], tokens[i+1]) for i in range(1, len(words) - 1)]\n                last_trigram = tuple(trigrams[-1])\n                if last_trigram in trigrams[:-1]:\n                    log_probabilities[i] = -1e20\n\n    def enforce_min_length(self):\n        if self._step < self.min_length:\n            self.log_probabilities[self.end_token_id] = -1e20\n\n    def enforce_max_length(self):\n        if self._step + 1 == self.max_length:\n            self.is_finished.fill_(1)\n\n    def length_penalty(self):\n        return ((5.0 + (self._step + 1)) / 6.0) ** self.alpha\n\n\ndef tile(x, count, dim=0):\n    \"\"\"\n    Tiles `x` along dimension `dim` `count` times.\n\n    Example:\n        >> ex = torch.tensor([1,2],[3,4])\n        >> tile(ex, 2, 0)\n        torch.Tensor([[1,2],[1,2],[3,4],[3,4]])\n    \"\"\"\n    perm = list(range(len(x.size())))\n    if dim != 0:\n        perm[0], perm[dim] = perm[dim], perm[0]\n        x = x.permute(perm).contiguous()\n    out_size = list(x.size())\n    out_size[0] *= count\n    batch = x.size(0)\n    x = (\n        x.view(batch, -1)\n        .transpose(0, 1)\n        .repeat(count, 1)\n        .transpose(0, 1)\n        .contiguous()\n        .view(*out_size)\n    )\n    if dim != 0:\n        x = x.permute(perm).contiguous()\n    return x\n"
  },
  {
    "path": "code/nezha-base-count5/pretrain/transformers1/modeling_bert.py",
    "content": "# coding=utf-8\n# Copyright 2018 The Google AI Language Team Authors and The HuggingFace Inc. team.\n# Copyright (c) 2018, NVIDIA CORPORATION.  All rights reserved.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n\"\"\"PyTorch BERT model. \"\"\"\n\n\nimport logging\nimport math\nimport os\n\nimport torch\nfrom torch import nn\nfrom torch.nn import CrossEntropyLoss, MSELoss\n\nfrom .activations import gelu, gelu_new, swish\nfrom .configuration_bert import BertConfig\nfrom .file_utils import add_start_docstrings, add_start_docstrings_to_callable\nfrom .modeling_utils import PreTrainedModel, prune_linear_layer\n\n\nlogger = logging.getLogger(__name__)\n\nBERT_PRETRAINED_MODEL_ARCHIVE_LIST = [\n    \"bert-base-uncased\",\n    \"bert-large-uncased\",\n    \"bert-base-cased\",\n    \"bert-large-cased\",\n    \"bert-base-multilingual-uncased\",\n    \"bert-base-multilingual-cased\",\n    \"bert-base-chinese\",\n    \"bert-base-german-cased\",\n    \"bert-large-uncased-whole-word-masking\",\n    \"bert-large-cased-whole-word-masking\",\n    \"bert-large-uncased-whole-word-masking-finetuned-squad\",\n    \"bert-large-cased-whole-word-masking-finetuned-squad\",\n    \"bert-base-cased-finetuned-mrpc\",\n    \"bert-base-german-dbmdz-cased\",\n    \"bert-base-german-dbmdz-uncased\",\n    \"cl-tohoku/bert-base-japanese\",\n    \"cl-tohoku/bert-base-japanese-whole-word-masking\",\n    \"cl-tohoku/bert-base-japanese-char\",\n    \"cl-tohoku/bert-base-japanese-char-whole-word-masking\",\n    \"TurkuNLP/bert-base-finnish-cased-v1\",\n    \"TurkuNLP/bert-base-finnish-uncased-v1\",\n    \"wietsedv/bert-base-dutch-cased\",\n    # See all BERT models at https://huggingface.co/models?filter=bert\n]\n\n\ndef load_tf_weights_in_bert(model, config, tf_checkpoint_path):\n    \"\"\" Load tf checkpoints in a pytorch model.\n    \"\"\"\n    try:\n        import re\n        import numpy as np\n        import tensorflow as tf\n    except ImportError:\n        logger.error(\n            \"Loading a TensorFlow model in PyTorch, requires TensorFlow to be installed. Please see \"\n            \"https://www.tensorflow.org/install/ for installation instructions.\"\n        )\n        raise\n    tf_path = os.path.abspath(tf_checkpoint_path)\n    logger.info(\"Converting TensorFlow checkpoint from {}\".format(tf_path))\n    # Load weights from TF model\n    init_vars = tf.train.list_variables(tf_path)\n    names = []\n    arrays = []\n    for name, shape in init_vars:\n        logger.info(\"Loading TF weight {} with shape {}\".format(name, shape))\n        array = tf.train.load_variable(tf_path, name)\n        names.append(name)\n        arrays.append(array)\n\n    for name, array in zip(names, arrays):\n        name = name.split(\"/\")\n        # adam_v and adam_m are variables used in AdamWeightDecayOptimizer to calculated m and v\n        # which are not required for using pretrained model\n        if any(\n            n in [\"adam_v\", \"adam_m\", \"AdamWeightDecayOptimizer\", \"AdamWeightDecayOptimizer_1\", \"global_step\"]\n            for n in name\n        ):\n            logger.info(\"Skipping {}\".format(\"/\".join(name)))\n            continue\n        pointer = model\n        for m_name in name:\n            if re.fullmatch(r\"[A-Za-z]+_\\d+\", m_name):\n                scope_names = re.split(r\"_(\\d+)\", m_name)\n            else:\n                scope_names = [m_name]\n            if scope_names[0] == \"kernel\" or scope_names[0] == \"gamma\":\n                pointer = getattr(pointer, \"weight\")\n            elif scope_names[0] == \"output_bias\" or scope_names[0] == \"beta\":\n                pointer = getattr(pointer, \"bias\")\n            elif scope_names[0] == \"output_weights\":\n                pointer = getattr(pointer, \"weight\")\n            elif scope_names[0] == \"squad\":\n                pointer = getattr(pointer, \"classifier\")\n            else:\n                try:\n                    pointer = getattr(pointer, scope_names[0])\n                except AttributeError:\n                    logger.info(\"Skipping {}\".format(\"/\".join(name)))\n                    continue\n            if len(scope_names) >= 2:\n                num = int(scope_names[1])\n                pointer = pointer[num]\n        if m_name[-11:] == \"_embeddings\":\n            pointer = getattr(pointer, \"weight\")\n        elif m_name == \"kernel\":\n            array = np.transpose(array)\n        try:\n            assert pointer.shape == array.shape\n        except AssertionError as e:\n            e.args += (pointer.shape, array.shape)\n            raise\n        logger.info(\"Initialize PyTorch weight {}\".format(name))\n        pointer.data = torch.from_numpy(array)\n    return model\n\n\ndef mish(x):\n    return x * torch.tanh(nn.functional.softplus(x))\n\n\nACT2FN = {\"gelu\": gelu, \"relu\": torch.nn.functional.relu, \"swish\": swish, \"gelu_new\": gelu_new, \"mish\": mish}\n\n\nBertLayerNorm = torch.nn.LayerNorm\n\n\nclass BertEmbeddings(nn.Module):\n    \"\"\"Construct the embeddings from word, position and token_type embeddings.\n    \"\"\"\n\n    def __init__(self, config):\n        super().__init__()\n        self.word_embeddings = nn.Embedding(config.vocab_size, config.hidden_size, padding_idx=config.pad_token_id)\n        self.position_embeddings = nn.Embedding(config.max_position_embeddings, config.hidden_size)\n        self.token_type_embeddings = nn.Embedding(config.type_vocab_size, config.hidden_size)\n\n        # self.LayerNorm is not snake-cased to stick with TensorFlow model variable name and be able to load\n        # any TensorFlow checkpoint file\n        self.LayerNorm = BertLayerNorm(config.hidden_size, eps=config.layer_norm_eps)\n        self.dropout = nn.Dropout(config.hidden_dropout_prob)\n\n    def forward(self, input_ids=None, token_type_ids=None, position_ids=None, inputs_embeds=None):\n        if input_ids is not None:\n            input_shape = input_ids.size()\n        else:\n            input_shape = inputs_embeds.size()[:-1]\n\n        seq_length = input_shape[1]\n        device = input_ids.device if input_ids is not None else inputs_embeds.device\n        if position_ids is None:\n            position_ids = torch.arange(seq_length, dtype=torch.long, device=device)\n            position_ids = position_ids.unsqueeze(0).expand(input_shape)\n        if token_type_ids is None:\n            token_type_ids = torch.zeros(input_shape, dtype=torch.long, device=device)\n\n        if inputs_embeds is None:\n            inputs_embeds = self.word_embeddings(input_ids)\n        position_embeddings = self.position_embeddings(position_ids)\n        token_type_embeddings = self.token_type_embeddings(token_type_ids)\n\n        embeddings = inputs_embeds + position_embeddings + token_type_embeddings\n        embeddings = self.LayerNorm(embeddings)\n        embeddings = self.dropout(embeddings)\n        return embeddings\n\n\nclass BertSelfAttention(nn.Module):\n    def __init__(self, config):\n        super().__init__()\n        if config.hidden_size % config.num_attention_heads != 0 and not hasattr(config, \"embedding_size\"):\n            raise ValueError(\n                \"The hidden size (%d) is not a multiple of the number of attention \"\n                \"heads (%d)\" % (config.hidden_size, config.num_attention_heads)\n            )\n        self.output_attentions = config.output_attentions\n\n        self.num_attention_heads = config.num_attention_heads\n        self.attention_head_size = int(config.hidden_size / config.num_attention_heads)\n        self.all_head_size = self.num_attention_heads * self.attention_head_size\n\n        self.query = nn.Linear(config.hidden_size, self.all_head_size)\n        self.key = nn.Linear(config.hidden_size, self.all_head_size)\n        self.value = nn.Linear(config.hidden_size, self.all_head_size)\n\n        self.dropout = nn.Dropout(config.attention_probs_dropout_prob)\n\n    def transpose_for_scores(self, x):\n        new_x_shape = x.size()[:-1] + (self.num_attention_heads, self.attention_head_size)\n        x = x.view(*new_x_shape)\n        return x.permute(0, 2, 1, 3)\n\n    def forward(\n        self,\n        hidden_states,\n        attention_mask=None,\n        head_mask=None,\n        encoder_hidden_states=None,\n        encoder_attention_mask=None,\n    ):\n        mixed_query_layer = self.query(hidden_states)\n\n        # If this is instantiated as a cross-attention module, the keys\n        # and values come from an encoder; the attention mask needs to be\n        # such that the encoder's padding tokens are not attended to.\n        if encoder_hidden_states is not None:\n            mixed_key_layer = self.key(encoder_hidden_states)\n            mixed_value_layer = self.value(encoder_hidden_states)\n            attention_mask = encoder_attention_mask\n        else:\n            mixed_key_layer = self.key(hidden_states)\n            mixed_value_layer = self.value(hidden_states)\n\n        query_layer = self.transpose_for_scores(mixed_query_layer)\n        key_layer = self.transpose_for_scores(mixed_key_layer)\n        value_layer = self.transpose_for_scores(mixed_value_layer)\n\n        # Take the dot product between \"query\" and \"key\" to get the raw attention scores.\n        attention_scores = torch.matmul(query_layer, key_layer.transpose(-1, -2))\n        attention_scores = attention_scores / math.sqrt(self.attention_head_size)\n        if attention_mask is not None:\n            # Apply the attention mask is (precomputed for all layers in BertModel forward() function)\n            attention_scores = attention_scores + attention_mask\n\n        # Normalize the attention scores to probabilities.\n        attention_probs = nn.Softmax(dim=-1)(attention_scores)\n\n        # This is actually dropping out entire tokens to attend to, which might\n        # seem a bit unusual, but is taken from the original Transformer paper.\n        attention_probs = self.dropout(attention_probs)\n\n        # Mask heads if we want to\n        if head_mask is not None:\n            attention_probs = attention_probs * head_mask\n\n        context_layer = torch.matmul(attention_probs, value_layer)\n\n        context_layer = context_layer.permute(0, 2, 1, 3).contiguous()\n        new_context_layer_shape = context_layer.size()[:-2] + (self.all_head_size,)\n        context_layer = context_layer.view(*new_context_layer_shape)\n\n        outputs = (context_layer, attention_probs) if self.output_attentions else (context_layer,)\n        return outputs\n\n\nclass BertSelfOutput(nn.Module):\n    def __init__(self, config):\n        super().__init__()\n        self.dense = nn.Linear(config.hidden_size, config.hidden_size)\n        self.LayerNorm = BertLayerNorm(config.hidden_size, eps=config.layer_norm_eps)\n        self.dropout = nn.Dropout(config.hidden_dropout_prob)\n\n    def forward(self, hidden_states, input_tensor):\n        hidden_states = self.dense(hidden_states)\n        hidden_states = self.dropout(hidden_states)\n        hidden_states = self.LayerNorm(hidden_states + input_tensor)\n        return hidden_states\n\n\nclass BertAttention(nn.Module):\n    def __init__(self, config):\n        super().__init__()\n        self.self = BertSelfAttention(config)\n        self.output = BertSelfOutput(config)\n        self.pruned_heads = set()\n\n    def prune_heads(self, heads):\n        if len(heads) == 0:\n            return\n        mask = torch.ones(self.self.num_attention_heads, self.self.attention_head_size)\n        heads = set(heads) - self.pruned_heads  # Convert to set and remove already pruned heads\n        for head in heads:\n            # Compute how many pruned heads are before the head and move the index accordingly\n            head = head - sum(1 if h < head else 0 for h in self.pruned_heads)\n            mask[head] = 0\n        mask = mask.view(-1).contiguous().eq(1)\n        index = torch.arange(len(mask))[mask].long()\n\n        # Prune linear layers\n        self.self.query = prune_linear_layer(self.self.query, index)\n        self.self.key = prune_linear_layer(self.self.key, index)\n        self.self.value = prune_linear_layer(self.self.value, index)\n        self.output.dense = prune_linear_layer(self.output.dense, index, dim=1)\n\n        # Update hyper params and store pruned heads\n        self.self.num_attention_heads = self.self.num_attention_heads - len(heads)\n        self.self.all_head_size = self.self.attention_head_size * self.self.num_attention_heads\n        self.pruned_heads = self.pruned_heads.union(heads)\n\n    def forward(\n        self,\n        hidden_states,\n        attention_mask=None,\n        head_mask=None,\n        encoder_hidden_states=None,\n        encoder_attention_mask=None,\n    ):\n        self_outputs = self.self(\n            hidden_states, attention_mask, head_mask, encoder_hidden_states, encoder_attention_mask\n        )\n        attention_output = self.output(self_outputs[0], hidden_states)\n        outputs = (attention_output,) + self_outputs[1:]  # add attentions if we output them\n        return outputs\n\n\nclass BertIntermediate(nn.Module):\n    def __init__(self, config):\n        super().__init__()\n        self.dense = nn.Linear(config.hidden_size, config.intermediate_size)\n        if isinstance(config.hidden_act, str):\n            self.intermediate_act_fn = ACT2FN[config.hidden_act]\n        else:\n            self.intermediate_act_fn = config.hidden_act\n\n    def forward(self, hidden_states):\n        hidden_states = self.dense(hidden_states)\n        hidden_states = self.intermediate_act_fn(hidden_states)\n        return hidden_states\n\n\nclass BertOutput(nn.Module):\n    def __init__(self, config):\n        super().__init__()\n        self.dense = nn.Linear(config.intermediate_size, config.hidden_size)\n        self.LayerNorm = BertLayerNorm(config.hidden_size, eps=config.layer_norm_eps)\n        self.dropout = nn.Dropout(config.hidden_dropout_prob)\n\n    def forward(self, hidden_states, input_tensor):\n        hidden_states = self.dense(hidden_states)\n        hidden_states = self.dropout(hidden_states)\n        hidden_states = self.LayerNorm(hidden_states + input_tensor)\n        return hidden_states\n\n\nclass BertLayer(nn.Module):\n    def __init__(self, config):\n        super().__init__()\n        self.attention = BertAttention(config)\n        self.is_decoder = config.is_decoder\n        if self.is_decoder:\n            self.crossattention = BertAttention(config)\n        self.intermediate = BertIntermediate(config)\n        self.output = BertOutput(config)\n\n    def forward(\n        self,\n        hidden_states,\n        attention_mask=None,\n        head_mask=None,\n        encoder_hidden_states=None,\n        encoder_attention_mask=None,\n    ):\n        self_attention_outputs = self.attention(hidden_states, attention_mask, head_mask)\n        attention_output = self_attention_outputs[0]\n        outputs = self_attention_outputs[1:]  # add self attentions if we output attention weights\n\n        if self.is_decoder and encoder_hidden_states is not None:\n            cross_attention_outputs = self.crossattention(\n                attention_output, attention_mask, head_mask, encoder_hidden_states, encoder_attention_mask\n            )\n            attention_output = cross_attention_outputs[0]\n            outputs = outputs + cross_attention_outputs[1:]  # add cross attentions if we output attention weights\n\n        intermediate_output = self.intermediate(attention_output)\n        layer_output = self.output(intermediate_output, attention_output)\n        outputs = (layer_output,) + outputs\n        return outputs\n\n\nclass BertEncoder(nn.Module):\n    def __init__(self, config):\n        super().__init__()\n        self.output_attentions = config.output_attentions\n        self.output_hidden_states = config.output_hidden_states\n        self.layer = nn.ModuleList([BertLayer(config) for _ in range(config.num_hidden_layers)])\n\n    def forward(\n        self,\n        hidden_states,\n        attention_mask=None,\n        head_mask=None,\n        encoder_hidden_states=None,\n        encoder_attention_mask=None,\n    ):\n        all_hidden_states = ()\n        all_attentions = ()\n        for i, layer_module in enumerate(self.layer):\n            if self.output_hidden_states:\n                all_hidden_states = all_hidden_states + (hidden_states,)\n\n            layer_outputs = layer_module(\n                hidden_states, attention_mask, head_mask[i], encoder_hidden_states, encoder_attention_mask\n            )\n            hidden_states = layer_outputs[0]\n\n            if self.output_attentions:\n                all_attentions = all_attentions + (layer_outputs[1],)\n\n        # Add last layer\n        if self.output_hidden_states:\n            all_hidden_states = all_hidden_states + (hidden_states,)\n\n        outputs = (hidden_states,)\n        if self.output_hidden_states:\n            outputs = outputs + (all_hidden_states,)\n        if self.output_attentions:\n            outputs = outputs + (all_attentions,)\n        return outputs  # last-layer hidden state, (all hidden states), (all attentions)\n\n\nclass BertPooler(nn.Module):\n    def __init__(self, config):\n        super().__init__()\n        self.dense = nn.Linear(config.hidden_size, config.hidden_size)\n        self.activation = nn.Tanh()\n\n    def forward(self, hidden_states):\n        # We \"pool\" the model by simply taking the hidden state corresponding\n        # to the first token.\n        first_token_tensor = hidden_states[:, 0]\n        pooled_output = self.dense(first_token_tensor)\n        pooled_output = self.activation(pooled_output)\n        return pooled_output\n\n\nclass BertPredictionHeadTransform(nn.Module):\n    def __init__(self, config):\n        super().__init__()\n        self.dense = nn.Linear(config.hidden_size, config.hidden_size)\n        if isinstance(config.hidden_act, str):\n            self.transform_act_fn = ACT2FN[config.hidden_act]\n        else:\n            self.transform_act_fn = config.hidden_act\n        self.LayerNorm = BertLayerNorm(config.hidden_size, eps=config.layer_norm_eps)\n\n    def forward(self, hidden_states):\n        hidden_states = self.dense(hidden_states)\n        hidden_states = self.transform_act_fn(hidden_states)\n        hidden_states = self.LayerNorm(hidden_states)\n        return hidden_states\n\n\nclass BertLMPredictionHead(nn.Module):\n    def __init__(self, config):\n        super().__init__()\n        self.transform = BertPredictionHeadTransform(config)\n\n        # The output weights are the same as the input embeddings, but there is\n        # an output-only bias for each token.\n        self.decoder = nn.Linear(config.hidden_size, config.vocab_size, bias=False)\n\n        self.bias = nn.Parameter(torch.zeros(config.vocab_size))\n\n        # Need a link between the two variables so that the bias is correctly resized with `resize_token_embeddings`\n        self.decoder.bias = self.bias\n\n    def forward(self, hidden_states):\n        hidden_states = self.transform(hidden_states)\n        hidden_states = self.decoder(hidden_states)\n        return hidden_states\n\n\nclass BertOnlyMLMHead(nn.Module):\n    def __init__(self, config):\n        super().__init__()\n        self.predictions = BertLMPredictionHead(config)\n\n    def forward(self, sequence_output):\n        prediction_scores = self.predictions(sequence_output)\n        return prediction_scores\n\n\nclass BertOnlyNSPHead(nn.Module):\n    def __init__(self, config):\n        super().__init__()\n        self.seq_relationship = nn.Linear(config.hidden_size, 2)\n\n    def forward(self, pooled_output):\n        seq_relationship_score = self.seq_relationship(pooled_output)\n        return seq_relationship_score\n\n\nclass BertPreTrainingHeads(nn.Module):\n    def __init__(self, config):\n        super().__init__()\n        self.predictions = BertLMPredictionHead(config)\n        self.seq_relationship = nn.Linear(config.hidden_size, 2)\n\n    def forward(self, sequence_output, pooled_output):\n        prediction_scores = self.predictions(sequence_output)\n        seq_relationship_score = self.seq_relationship(pooled_output)\n        return prediction_scores, seq_relationship_score\n\n\nclass BertPreTrainedModel(PreTrainedModel):\n    \"\"\" An abstract class to handle weights initialization and\n        a simple interface for downloading and loading pretrained models.\n    \"\"\"\n\n    config_class = BertConfig\n    load_tf_weights = load_tf_weights_in_bert\n    base_model_prefix = \"bert\"\n\n    def _init_weights(self, module):\n        \"\"\" Initialize the weights \"\"\"\n        if isinstance(module, (nn.Linear, nn.Embedding)):\n            # Slightly different from the TF version which uses truncated_normal for initialization\n            # cf https://github.com/pytorch/pytorch/pull/5617\n            module.weight.data.normal_(mean=0.0, std=self.config.initializer_range)\n        elif isinstance(module, BertLayerNorm):\n            module.bias.data.zero_()\n            module.weight.data.fill_(1.0)\n        if isinstance(module, nn.Linear) and module.bias is not None:\n            module.bias.data.zero_()\n\n\nBERT_START_DOCSTRING = r\"\"\"\n    This model is a PyTorch `torch.nn.Module <https://pytorch.org/docs/stable/nn.html#torch.nn.Module>`_ sub-class.\n    Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general\n    usage and behavior.\n\n    Parameters:\n        config (:class:`~transformers1.BertConfig`): Model configuration class with all the parameters of the model.\n            Initializing with a config file does not load the weights associated with the model, only the configuration.\n            Check out the :meth:`~transformers1.PreTrainedModel.from_pretrained` method to load the model weights.\n\"\"\"\n\nBERT_INPUTS_DOCSTRING = r\"\"\"\n    Args:\n        input_ids (:obj:`torch.LongTensor` of shape :obj:`{0}`):\n            Indices of input sequence tokens in the vocabulary.\n\n            Indices can be obtained using :class:`transformers1.BertTokenizer`.\n            See :func:`transformers1.PreTrainedTokenizer.encode` and\n            :func:`transformers1.PreTrainedTokenizer.encode_plus` for details.\n\n            `What are input IDs? <../glossary.html#input-ids>`__\n        attention_mask (:obj:`torch.FloatTensor` of shape :obj:`{0}`, `optional`, defaults to :obj:`None`):\n            Mask to avoid performing attention on padding token indices.\n            Mask values selected in ``[0, 1]``:\n            ``1`` for tokens that are NOT MASKED, ``0`` for MASKED tokens.\n\n            `What are attention masks? <../glossary.html#attention-mask>`__\n        token_type_ids (:obj:`torch.LongTensor` of shape :obj:`{0}`, `optional`, defaults to :obj:`None`):\n            Segment token indices to indicate first and second portions of the inputs.\n            Indices are selected in ``[0, 1]``: ``0`` corresponds to a `sentence A` token, ``1``\n            corresponds to a `sentence B` token\n\n            `What are token type IDs? <../glossary.html#token-type-ids>`_\n        position_ids (:obj:`torch.LongTensor` of shape :obj:`{0}`, `optional`, defaults to :obj:`None`):\n            Indices of positions of each input sequence tokens in the position embeddings.\n            Selected in the range ``[0, config.max_position_embeddings - 1]``.\n\n            `What are position IDs? <../glossary.html#position-ids>`_\n        head_mask (:obj:`torch.FloatTensor` of shape :obj:`(num_heads,)` or :obj:`(num_layers, num_heads)`, `optional`, defaults to :obj:`None`):\n            Mask to nullify selected heads of the self-attention modules.\n            Mask values selected in ``[0, 1]``:\n            :obj:`1` indicates the head is **not masked**, :obj:`0` indicates the head is **masked**.\n        inputs_embeds (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length, hidden_size)`, `optional`, defaults to :obj:`None`):\n            Optionally, instead of passing :obj:`input_ids` you can choose to directly pass an embedded representation.\n            This is useful if you want more control over how to convert `input_ids` indices into associated vectors\n            than the model's internal embedding lookup matrix.\n        encoder_hidden_states  (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length, hidden_size)`, `optional`, defaults to :obj:`None`):\n            Sequence of hidden-states at the output of the last layer of the encoder. Used in the cross-attention\n            if the model is configured as a decoder.\n        encoder_attention_mask (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length)`, `optional`, defaults to :obj:`None`):\n            Mask to avoid performing attention on the padding token indices of the encoder input. This mask\n            is used in the cross-attention if the model is configured as a decoder.\n            Mask values selected in ``[0, 1]``:\n            ``1`` for tokens that are NOT MASKED, ``0`` for MASKED tokens.\n\"\"\"\n\n\n@add_start_docstrings(\n    \"The bare Bert Model transformer outputting raw hidden-states without any specific head on top.\",\n    BERT_START_DOCSTRING,\n)\nclass BertModel(BertPreTrainedModel):\n    \"\"\"\n\n    The model can behave as an encoder (with only self-attention) as well\n    as a decoder, in which case a layer of cross-attention is added between\n    the self-attention layers, following the architecture described in `Attention is all you need`_ by Ashish Vaswani,\n    Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser and Illia Polosukhin.\n\n    To behave as an decoder the model needs to be initialized with the\n    :obj:`is_decoder` argument of the configuration set to :obj:`True`; an\n    :obj:`encoder_hidden_states` is expected as an input to the forward pass.\n\n    .. _`Attention is all you need`:\n        https://arxiv.org/abs/1706.03762\n\n    \"\"\"\n\n    def __init__(self, config):\n        super().__init__(config)\n        self.config = config\n\n        self.embeddings = BertEmbeddings(config)\n        self.encoder = BertEncoder(config)\n        self.pooler = BertPooler(config)\n\n        self.init_weights()\n\n    def get_input_embeddings(self):\n        return self.embeddings.word_embeddings\n\n    def set_input_embeddings(self, value):\n        self.embeddings.word_embeddings = value\n\n    def _prune_heads(self, heads_to_prune):\n        \"\"\" Prunes heads of the model.\n            heads_to_prune: dict of {layer_num: list of heads to prune in this layer}\n            See base class PreTrainedModel\n        \"\"\"\n        for layer, heads in heads_to_prune.items():\n            self.encoder.layer[layer].attention.prune_heads(heads)\n\n    @add_start_docstrings_to_callable(BERT_INPUTS_DOCSTRING.format(\"(batch_size, sequence_length)\"))\n    def forward(\n        self,\n        input_ids=None,\n        attention_mask=None,\n        token_type_ids=None,\n        position_ids=None,\n        head_mask=None,\n        inputs_embeds=None,\n        encoder_hidden_states=None,\n        encoder_attention_mask=None,\n    ):\n        r\"\"\"\n    Return:\n        :obj:`tuple(torch.FloatTensor)` comprising various elements depending on the configuration (:class:`~transformers1.BertConfig`) and inputs:\n        last_hidden_state (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length, hidden_size)`):\n            Sequence of hidden-states at the output of the last layer of the model.\n        pooler_output (:obj:`torch.FloatTensor`: of shape :obj:`(batch_size, hidden_size)`):\n            Last layer hidden-state of the first token of the sequence (classification token)\n            further processed by a Linear layer and a Tanh activation function. The Linear\n            layer weights are trained from the next sentence prediction (classification)\n            objective during pre-training.\n\n            This output is usually *not* a good summary\n            of the semantic content of the input, you're often better with averaging or pooling\n            the sequence of hidden-states for the whole input sequence.\n        hidden_states (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_hidden_states=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer)\n            of shape :obj:`(batch_size, sequence_length, hidden_size)`.\n\n            Hidden-states of the model at the output of each layer plus the initial embedding outputs.\n        attentions (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_attentions=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for each layer) of shape\n            :obj:`(batch_size, num_heads, sequence_length, sequence_length)`.\n\n            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention\n            heads.\n\n    Examples::\n\n        from transformers1 import BertModel, BertTokenizer\n        import torch\n\n        tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')\n        model = BertModel.from_pretrained('bert-base-uncased')\n\n        input_ids = torch.tensor(tokenizer.encode(\"Hello, my dog is cute\", add_special_tokens=True)).unsqueeze(0)  # Batch size 1\n        outputs = model(input_ids)\n\n        last_hidden_states = outputs[0]  # The last hidden-state is the first element of the output tuple\n\n        \"\"\"\n\n        if input_ids is not None and inputs_embeds is not None:\n            raise ValueError(\"You cannot specify both input_ids and inputs_embeds at the same time\")\n        elif input_ids is not None:\n            input_shape = input_ids.size()\n        elif inputs_embeds is not None:\n            input_shape = inputs_embeds.size()[:-1]\n        else:\n            raise ValueError(\"You have to specify either input_ids or inputs_embeds\")\n\n        device = input_ids.device if input_ids is not None else inputs_embeds.device\n\n        if attention_mask is None:\n            attention_mask = torch.ones(input_shape, device=device)\n        if token_type_ids is None:\n            token_type_ids = torch.zeros(input_shape, dtype=torch.long, device=device)\n\n        # We can provide a self-attention mask of dimensions [batch_size, from_seq_length, to_seq_length]\n        # ourselves in which case we just need to make it broadcastable to all heads.\n        extended_attention_mask: torch.Tensor = self.get_extended_attention_mask(attention_mask, input_shape, device)\n\n        # If a 2D ou 3D attention mask is provided for the cross-attention\n        # we need to make broadcastabe to [batch_size, num_heads, seq_length, seq_length]\n        if self.config.is_decoder and encoder_hidden_states is not None:\n            encoder_batch_size, encoder_sequence_length, _ = encoder_hidden_states.size()\n            encoder_hidden_shape = (encoder_batch_size, encoder_sequence_length)\n            if encoder_attention_mask is None:\n                encoder_attention_mask = torch.ones(encoder_hidden_shape, device=device)\n            encoder_extended_attention_mask = self.invert_attention_mask(encoder_attention_mask)\n        else:\n            encoder_extended_attention_mask = None\n\n        # Prepare head mask if needed\n        # 1.0 in head_mask indicate we keep the head\n        # attention_probs has shape bsz x n_heads x N x N\n        # input head_mask has shape [num_heads] or [num_hidden_layers x num_heads]\n        # and head_mask is converted to shape [num_hidden_layers x batch x num_heads x seq_length x seq_length]\n        head_mask = self.get_head_mask(head_mask, self.config.num_hidden_layers)\n\n        embedding_output = self.embeddings(\n            input_ids=input_ids, position_ids=position_ids, token_type_ids=token_type_ids, inputs_embeds=inputs_embeds\n        )\n        encoder_outputs = self.encoder(\n            embedding_output,\n            attention_mask=extended_attention_mask,\n            head_mask=head_mask,\n            encoder_hidden_states=encoder_hidden_states,\n            encoder_attention_mask=encoder_extended_attention_mask,\n        )\n        sequence_output = encoder_outputs[0]\n        pooled_output = self.pooler(sequence_output)\n\n        outputs = (sequence_output, pooled_output,) + encoder_outputs[\n            1:\n        ]  # add hidden_states and attentions if they are here\n        return outputs  # sequence_output, pooled_output, (hidden_states), (attentions)\n\n\n@add_start_docstrings(\n    \"\"\"Bert Model with two heads on top as done during the pre-training: a `masked language modeling` head and\n    a `next sentence prediction (classification)` head. \"\"\",\n    BERT_START_DOCSTRING,\n)\nclass BertForPreTraining(BertPreTrainedModel):\n    def __init__(self, config):\n        super().__init__(config)\n\n        self.bert = BertModel(config)\n        self.cls = BertPreTrainingHeads(config)\n\n        self.init_weights()\n\n    def get_output_embeddings(self):\n        return self.cls.predictions.decoder\n\n    @add_start_docstrings_to_callable(BERT_INPUTS_DOCSTRING.format(\"(batch_size, sequence_length)\"))\n    def forward(\n        self,\n        input_ids=None,\n        attention_mask=None,\n        token_type_ids=None,\n        position_ids=None,\n        head_mask=None,\n        inputs_embeds=None,\n        masked_lm_labels=None,\n        next_sentence_label=None,\n    ):\n        r\"\"\"\n        masked_lm_labels (``torch.LongTensor`` of shape ``(batch_size, sequence_length)``, `optional`, defaults to :obj:`None`):\n            Labels for computing the masked language modeling loss.\n            Indices should be in ``[-100, 0, ..., config.vocab_size]`` (see ``input_ids`` docstring)\n            Tokens with indices set to ``-100`` are ignored (masked), the loss is only computed for the tokens with labels\n            in ``[0, ..., config.vocab_size]``\n        next_sentence_label (``torch.LongTensor`` of shape ``(batch_size,)``, `optional`, defaults to :obj:`None`):\n            Labels for computing the next sequence prediction (classification) loss. Input should be a sequence pair (see :obj:`input_ids` docstring)\n            Indices should be in ``[0, 1]``.\n            ``0`` indicates sequence B is a continuation of sequence A,\n            ``1`` indicates sequence B is a random sequence.\n\n    Returns:\n        :obj:`tuple(torch.FloatTensor)` comprising various elements depending on the configuration (:class:`~transformers1.BertConfig`) and inputs:\n        loss (`optional`, returned when ``masked_lm_labels`` is provided) ``torch.FloatTensor`` of shape ``(1,)``:\n            Total loss as the sum of the masked language modeling loss and the next sequence prediction (classification) loss.\n        prediction_scores (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length, config.vocab_size)`)\n            Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax).\n        seq_relationship_scores (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, 2)`):\n            Prediction scores of the next sequence prediction (classification) head (scores of True/False\n            continuation before SoftMax).\n        hidden_states (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when :obj:`config.output_hidden_states=True`):\n            Tuple of :obj:`torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer)\n            of shape :obj:`(batch_size, sequence_length, hidden_size)`.\n\n            Hidden-states of the model at the output of each layer plus the initial embedding outputs.\n        attentions (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_attentions=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for each layer) of shape\n            :obj:`(batch_size, num_heads, sequence_length, sequence_length)`.\n\n            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention\n            heads.\n\n\n    Examples::\n\n        from transformers1 import BertTokenizer, BertForPreTraining\n        import torch\n\n        tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')\n        model = BertForPreTraining.from_pretrained('bert-base-uncased')\n\n        input_ids = torch.tensor(tokenizer.encode(\"Hello, my dog is cute\", add_special_tokens=True)).unsqueeze(0)  # Batch size 1\n        outputs = model(input_ids)\n\n        prediction_scores, seq_relationship_scores = outputs[:2]\n\n        \"\"\"\n\n        outputs = self.bert(\n            input_ids,\n            attention_mask=attention_mask,\n            token_type_ids=token_type_ids,\n            position_ids=position_ids,\n            head_mask=head_mask,\n            inputs_embeds=inputs_embeds,\n        )\n\n        sequence_output, pooled_output = outputs[:2]\n        prediction_scores, seq_relationship_score = self.cls(sequence_output, pooled_output)\n\n        outputs = (prediction_scores, seq_relationship_score,) + outputs[\n            2:\n        ]  # add hidden states and attention if they are here\n\n        if masked_lm_labels is not None and next_sentence_label is not None:\n            loss_fct = CrossEntropyLoss()\n            masked_lm_loss = loss_fct(prediction_scores.view(-1, self.config.vocab_size), masked_lm_labels.view(-1))\n            next_sentence_loss = loss_fct(seq_relationship_score.view(-1, 2), next_sentence_label.view(-1))\n            total_loss = masked_lm_loss + next_sentence_loss\n            outputs = (total_loss,) + outputs\n\n        return outputs  # (loss), prediction_scores, seq_relationship_score, (hidden_states), (attentions)\n\n\n@add_start_docstrings(\"\"\"Bert Model with a `language modeling` head on top. \"\"\", BERT_START_DOCSTRING)\nclass BertForMaskedLM(BertPreTrainedModel):\n    def __init__(self, config):\n        super().__init__(config)\n\n        self.bert = BertModel(config)\n        self.cls = BertOnlyMLMHead(config)\n\n        self.init_weights()\n\n    def get_output_embeddings(self):\n        return self.cls.predictions.decoder\n\n    @add_start_docstrings_to_callable(BERT_INPUTS_DOCSTRING.format(\"(batch_size, sequence_length)\"))\n    def forward(\n        self,\n        input_ids=None,\n        attention_mask=None,\n        token_type_ids=None,\n        position_ids=None,\n        head_mask=None,\n        inputs_embeds=None,\n        masked_lm_labels=None,\n        encoder_hidden_states=None,\n        encoder_attention_mask=None,\n        lm_labels=None,\n    ):\n        r\"\"\"\n        masked_lm_labels (:obj:`torch.LongTensor` of shape :obj:`(batch_size, sequence_length)`, `optional`, defaults to :obj:`None`):\n            Labels for computing the masked language modeling loss.\n            Indices should be in ``[-100, 0, ..., config.vocab_size]`` (see ``input_ids`` docstring)\n            Tokens with indices set to ``-100`` are ignored (masked), the loss is only computed for the tokens with labels\n            in ``[0, ..., config.vocab_size]``\n        lm_labels (:obj:`torch.LongTensor` of shape :obj:`(batch_size, sequence_length)`, `optional`, defaults to :obj:`None`):\n            Labels for computing the left-to-right language modeling loss (next word prediction).\n            Indices should be in ``[-100, 0, ..., config.vocab_size]`` (see ``input_ids`` docstring)\n            Tokens with indices set to ``-100`` are ignored (masked), the loss is only computed for the tokens with labels\n            in ``[0, ..., config.vocab_size]``\n\n    Returns:\n        :obj:`tuple(torch.FloatTensor)` comprising various elements depending on the configuration (:class:`~transformers1.BertConfig`) and inputs:\n        masked_lm_loss (`optional`, returned when ``masked_lm_labels`` is provided) ``torch.FloatTensor`` of shape ``(1,)``:\n            Masked language modeling loss.\n        ltr_lm_loss (:obj:`torch.FloatTensor` of shape :obj:`(1,)`, `optional`, returned when :obj:`lm_labels` is provided):\n                Next token prediction loss.\n        prediction_scores (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length, config.vocab_size)`)\n            Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax).\n        hidden_states (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_hidden_states=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer)\n            of shape :obj:`(batch_size, sequence_length, hidden_size)`.\n\n            Hidden-states of the model at the output of each layer plus the initial embedding outputs.\n        attentions (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_attentions=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for each layer) of shape\n            :obj:`(batch_size, num_heads, sequence_length, sequence_length)`.\n\n            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention\n            heads.\n\n        Examples::\n\n            from transformers1 import BertTokenizer, BertForMaskedLM\n            import torch\n\n            tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')\n            model = BertForMaskedLM.from_pretrained('bert-base-uncased')\n\n            input_ids = torch.tensor(tokenizer.encode(\"Hello, my dog is cute\", add_special_tokens=True)).unsqueeze(0)  # Batch size 1\n            outputs = model(input_ids, masked_lm_labels=input_ids)\n\n            loss, prediction_scores = outputs[:2]\n\n        \"\"\"\n\n        outputs = self.bert(\n            input_ids,\n            attention_mask=attention_mask,\n            token_type_ids=token_type_ids,\n            position_ids=position_ids,\n            head_mask=head_mask,\n            inputs_embeds=inputs_embeds,\n            encoder_hidden_states=encoder_hidden_states,\n            encoder_attention_mask=encoder_attention_mask,\n        )\n\n        sequence_output = outputs[0]\n        prediction_scores = self.cls(sequence_output)\n\n        outputs = (prediction_scores,) + outputs[2:]  # Add hidden states and attention if they are here\n\n        # Although this may seem awkward, BertForMaskedLM supports two scenarios:\n        # 1. If a tensor that contains the indices of masked labels is provided,\n        #    the cross-entropy is the MLM cross-entropy that measures the likelihood\n        #    of predictions for masked words.\n        # 2. If `lm_labels` is provided we are in a causal scenario where we\n        #    try to predict the next token for each input in the decoder.\n        if masked_lm_labels is not None:\n            loss_fct = CrossEntropyLoss()  # -100 index = padding token\n            masked_lm_loss = loss_fct(prediction_scores.view(-1, self.config.vocab_size), masked_lm_labels.view(-1))\n            outputs = (masked_lm_loss,) + outputs\n\n        if lm_labels is not None:\n            # we are doing next-token prediction; shift prediction scores and input ids by one\n            prediction_scores = prediction_scores[:, :-1, :].contiguous()\n            lm_labels = lm_labels[:, 1:].contiguous()\n            loss_fct = CrossEntropyLoss()\n            ltr_lm_loss = loss_fct(prediction_scores.view(-1, self.config.vocab_size), lm_labels.view(-1))\n            outputs = (ltr_lm_loss,) + outputs\n\n        return outputs  # (ltr_lm_loss), (masked_lm_loss), prediction_scores, (hidden_states), (attentions)\n\n    def prepare_inputs_for_generation(self, input_ids, attention_mask=None, **model_kwargs):\n        input_shape = input_ids.shape\n        effective_batch_size = input_shape[0]\n\n        # if model is used as a decoder in encoder-decoder model, the decoder attention mask is created on the fly\n        if attention_mask is None:\n            attention_mask = input_ids.new_ones(input_shape)\n\n        # if model is does not use a causal mask then add a dummy token\n        if self.config.is_decoder is False:\n            assert self.config.pad_token_id is not None, \"The PAD token should be defined for generation\"\n            attention_mask = torch.cat(\n                [attention_mask, attention_mask.new_zeros((attention_mask.shape[0], 1))], dim=-1\n            )\n\n            dummy_token = torch.full(\n                (effective_batch_size, 1), self.config.pad_token_id, dtype=torch.long, device=input_ids.device\n            )\n            input_ids = torch.cat([input_ids, dummy_token], dim=1)\n\n        return {\"input_ids\": input_ids, \"attention_mask\": attention_mask}\n\n\n@add_start_docstrings(\n    \"\"\"Bert Model with a `next sentence prediction (classification)` head on top. \"\"\", BERT_START_DOCSTRING,\n)\nclass BertForNextSentencePrediction(BertPreTrainedModel):\n    def __init__(self, config):\n        super().__init__(config)\n\n        self.bert = BertModel(config)\n        self.cls = BertOnlyNSPHead(config)\n\n        self.init_weights()\n\n    @add_start_docstrings_to_callable(BERT_INPUTS_DOCSTRING.format(\"(batch_size, sequence_length)\"))\n    def forward(\n        self,\n        input_ids=None,\n        attention_mask=None,\n        token_type_ids=None,\n        position_ids=None,\n        head_mask=None,\n        inputs_embeds=None,\n        next_sentence_label=None,\n    ):\n        r\"\"\"\n        next_sentence_label (:obj:`torch.LongTensor` of shape :obj:`(batch_size,)`, `optional`, defaults to :obj:`None`):\n            Labels for computing the next sequence prediction (classification) loss. Input should be a sequence pair (see ``input_ids`` docstring)\n            Indices should be in ``[0, 1]``.\n            ``0`` indicates sequence B is a continuation of sequence A,\n            ``1`` indicates sequence B is a random sequence.\n\n    Returns:\n        :obj:`tuple(torch.FloatTensor)` comprising various elements depending on the configuration (:class:`~transformers1.BertConfig`) and inputs:\n        loss (:obj:`torch.FloatTensor` of shape :obj:`(1,)`, `optional`, returned when :obj:`next_sentence_label` is provided):\n            Next sequence prediction (classification) loss.\n        seq_relationship_scores (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, 2)`):\n            Prediction scores of the next sequence prediction (classification) head (scores of True/False continuation before SoftMax).\n        hidden_states (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_hidden_states=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer)\n            of shape :obj:`(batch_size, sequence_length, hidden_size)`.\n\n            Hidden-states of the model at the output of each layer plus the initial embedding outputs.\n        attentions (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_attentions=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for each layer) of shape\n            :obj:`(batch_size, num_heads, sequence_length, sequence_length)`.\n\n            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention\n            heads.\n\n    Examples::\n\n        from transformers1 import BertTokenizer, BertForNextSentencePrediction\n        import torch\n\n        tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')\n        model = BertForNextSentencePrediction.from_pretrained('bert-base-uncased')\n\n        prompt = \"In Italy, pizza served in formal settings, such as at a restaurant, is presented unsliced.\"\n        next_sentence = \"The sky is blue due to the shorter wavelength of blue light.\"\n        encoding = tokenizer.encode_plus(prompt, next_sentence, return_tensors='pt')\n\n        loss, logits = model(**encoding, next_sentence_label=torch.LongTensor([1]))\n        assert logits[0, 0] < logits[0, 1] # next sentence was random\n        \"\"\"\n\n        outputs = self.bert(\n            input_ids,\n            attention_mask=attention_mask,\n            token_type_ids=token_type_ids,\n            position_ids=position_ids,\n            head_mask=head_mask,\n            inputs_embeds=inputs_embeds,\n        )\n\n        pooled_output = outputs[1]\n\n        seq_relationship_score = self.cls(pooled_output)\n\n        outputs = (seq_relationship_score,) + outputs[2:]  # add hidden states and attention if they are here\n        if next_sentence_label is not None:\n            loss_fct = CrossEntropyLoss()\n            next_sentence_loss = loss_fct(seq_relationship_score.view(-1, 2), next_sentence_label.view(-1))\n            outputs = (next_sentence_loss,) + outputs\n\n        return outputs  # (next_sentence_loss), seq_relationship_score, (hidden_states), (attentions)\n\n\n@add_start_docstrings(\n    \"\"\"Bert Model transformer with a sequence classification/regression head on top (a linear layer on top of\n    the pooled output) e.g. for GLUE tasks. \"\"\",\n    BERT_START_DOCSTRING,\n)\nclass BertForSequenceClassification(BertPreTrainedModel):\n    def __init__(self, config):\n        super().__init__(config)\n        self.num_labels = config.num_labels\n\n        self.bert = BertModel(config)\n        self.dropout = nn.Dropout(config.hidden_dropout_prob)\n        self.classifier = nn.Linear(config.hidden_size, config.num_labels)\n\n        self.init_weights()\n\n    @add_start_docstrings_to_callable(BERT_INPUTS_DOCSTRING.format(\"(batch_size, sequence_length)\"))\n    def forward(\n        self,\n        input_ids=None,\n        attention_mask=None,\n        token_type_ids=None,\n        position_ids=None,\n        head_mask=None,\n        inputs_embeds=None,\n        labels=None,\n    ):\n        r\"\"\"\n        labels (:obj:`torch.LongTensor` of shape :obj:`(batch_size,)`, `optional`, defaults to :obj:`None`):\n            Labels for computing the sequence classification/regression loss.\n            Indices should be in :obj:`[0, ..., config.num_labels - 1]`.\n            If :obj:`config.num_labels == 1` a regression loss is computed (Mean-Square loss),\n            If :obj:`config.num_labels > 1` a classification loss is computed (Cross-Entropy).\n\n    Returns:\n        :obj:`tuple(torch.FloatTensor)` comprising various elements depending on the configuration (:class:`~transformers1.BertConfig`) and inputs:\n        loss (:obj:`torch.FloatTensor` of shape :obj:`(1,)`, `optional`, returned when :obj:`label` is provided):\n            Classification (or regression if config.num_labels==1) loss.\n        logits (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, config.num_labels)`):\n            Classification (or regression if config.num_labels==1) scores (before SoftMax).\n        hidden_states (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_hidden_states=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer)\n            of shape :obj:`(batch_size, sequence_length, hidden_size)`.\n\n            Hidden-states of the model at the output of each layer plus the initial embedding outputs.\n        attentions (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_attentions=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for each layer) of shape\n            :obj:`(batch_size, num_heads, sequence_length, sequence_length)`.\n\n            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention\n            heads.\n\n    Examples::\n\n        from transformers1 import BertTokenizer, BertForSequenceClassification\n        import torch\n\n        tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')\n        model = BertForSequenceClassification.from_pretrained('bert-base-uncased')\n\n        input_ids = torch.tensor(tokenizer.encode(\"Hello, my dog is cute\", add_special_tokens=True)).unsqueeze(0)  # Batch size 1\n        labels = torch.tensor([1]).unsqueeze(0)  # Batch size 1\n        outputs = model(input_ids, labels=labels)\n\n        loss, logits = outputs[:2]\n\n        \"\"\"\n\n        outputs = self.bert(\n            input_ids,\n            attention_mask=attention_mask,\n            token_type_ids=token_type_ids,\n            position_ids=position_ids,\n            head_mask=head_mask,\n            inputs_embeds=inputs_embeds,\n        )\n\n        pooled_output = outputs[1]\n\n        pooled_output = self.dropout(pooled_output)\n        logits = self.classifier(pooled_output)\n\n        outputs = (logits,) + outputs[2:]  # add hidden states and attention if they are here\n\n        if labels is not None:\n            if self.num_labels == 1:\n                #  We are doing regression\n                loss_fct = MSELoss()\n                loss = loss_fct(logits.view(-1), labels.view(-1))\n            else:\n                loss_fct = CrossEntropyLoss()\n                loss = loss_fct(logits.view(-1, self.num_labels), labels.view(-1))\n            outputs = (loss,) + outputs\n\n        return outputs  # (loss), logits, (hidden_states), (attentions)\n\n\n@add_start_docstrings(\n    \"\"\"Bert Model with a multiple choice classification head on top (a linear layer on top of\n    the pooled output and a softmax) e.g. for RocStories/SWAG tasks. \"\"\",\n    BERT_START_DOCSTRING,\n)\nclass BertForMultipleChoice(BertPreTrainedModel):\n    def __init__(self, config):\n        super().__init__(config)\n\n        self.bert = BertModel(config)\n        self.dropout = nn.Dropout(config.hidden_dropout_prob)\n        self.classifier = nn.Linear(config.hidden_size, 1)\n\n        self.init_weights()\n\n    @add_start_docstrings_to_callable(BERT_INPUTS_DOCSTRING.format(\"(batch_size, num_choices, sequence_length)\"))\n    def forward(\n        self,\n        input_ids=None,\n        attention_mask=None,\n        token_type_ids=None,\n        position_ids=None,\n        head_mask=None,\n        inputs_embeds=None,\n        labels=None,\n    ):\n        r\"\"\"\n        labels (:obj:`torch.LongTensor` of shape :obj:`(batch_size,)`, `optional`, defaults to :obj:`None`):\n            Labels for computing the multiple choice classification loss.\n            Indices should be in ``[0, ..., num_choices-1]`` where `num_choices` is the size of the second dimension\n            of the input tensors. (see `input_ids` above)\n\n    Returns:\n        :obj:`tuple(torch.FloatTensor)` comprising various elements depending on the configuration (:class:`~transformers1.BertConfig`) and inputs:\n        loss (:obj:`torch.FloatTensor` of shape `(1,)`, `optional`, returned when :obj:`labels` is provided):\n            Classification loss.\n        classification_scores (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, num_choices)`):\n            `num_choices` is the second dimension of the input tensors. (see `input_ids` above).\n\n            Classification scores (before SoftMax).\n        hidden_states (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_hidden_states=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer)\n            of shape :obj:`(batch_size, sequence_length, hidden_size)`.\n\n            Hidden-states of the model at the output of each layer plus the initial embedding outputs.\n        attentions (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_attentions=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for each layer) of shape\n            :obj:`(batch_size, num_heads, sequence_length, sequence_length)`.\n\n            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention\n            heads.\n\n    Examples::\n\n        from transformers1 import BertTokenizer, BertForMultipleChoice\n        import torch\n\n        tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')\n        model = BertForMultipleChoice.from_pretrained('bert-base-uncased')\n\n        prompt = \"In Italy, pizza served in formal settings, such as at a restaurant, is presented unsliced.\"\n        choice0 = \"It is eaten with a fork and a knife.\"\n        choice1 = \"It is eaten while held in the hand.\"\n        labels = torch.tensor(0) # choice0 is correct (according to Wikipedia ;))\n\n        encoding = tokenizer.batch_encode_plus([[prompt, choice0], [prompt, choice1]], return_tensors='pt', pad_to_max_length=True)\n        outputs = model(**{k: v.unsqueeze(0) for k,v in encoding.items()}, labels=labels) # batch size is 1\n\n        # the linear classifier still needs to be trained\n        loss, logits = outputs[:2]\n        \"\"\"\n        num_choices = input_ids.shape[1]\n\n        input_ids = input_ids.view(-1, input_ids.size(-1))\n        attention_mask = attention_mask.view(-1, attention_mask.size(-1)) if attention_mask is not None else None\n        token_type_ids = token_type_ids.view(-1, token_type_ids.size(-1)) if token_type_ids is not None else None\n        position_ids = position_ids.view(-1, position_ids.size(-1)) if position_ids is not None else None\n\n        outputs = self.bert(\n            input_ids,\n            attention_mask=attention_mask,\n            token_type_ids=token_type_ids,\n            position_ids=position_ids,\n            head_mask=head_mask,\n            inputs_embeds=inputs_embeds,\n        )\n\n        pooled_output = outputs[1]\n\n        pooled_output = self.dropout(pooled_output)\n        logits = self.classifier(pooled_output)\n        reshaped_logits = logits.view(-1, num_choices)\n\n        outputs = (reshaped_logits,) + outputs[2:]  # add hidden states and attention if they are here\n\n        if labels is not None:\n            loss_fct = CrossEntropyLoss()\n            loss = loss_fct(reshaped_logits, labels)\n            outputs = (loss,) + outputs\n\n        return outputs  # (loss), reshaped_logits, (hidden_states), (attentions)\n\n\n@add_start_docstrings(\n    \"\"\"Bert Model with a token classification head on top (a linear layer on top of\n    the hidden-states output) e.g. for Named-Entity-Recognition (NER) tasks. \"\"\",\n    BERT_START_DOCSTRING,\n)\nclass BertForTokenClassification(BertPreTrainedModel):\n    def __init__(self, config):\n        super().__init__(config)\n        self.num_labels = config.num_labels\n\n        self.bert = BertModel(config)\n        self.dropout = nn.Dropout(config.hidden_dropout_prob)\n        self.classifier = nn.Linear(config.hidden_size, config.num_labels)\n\n        self.init_weights()\n\n    @add_start_docstrings_to_callable(BERT_INPUTS_DOCSTRING.format(\"(batch_size, sequence_length)\"))\n    def forward(\n        self,\n        input_ids=None,\n        attention_mask=None,\n        token_type_ids=None,\n        position_ids=None,\n        head_mask=None,\n        inputs_embeds=None,\n        labels=None,\n    ):\n        r\"\"\"\n        labels (:obj:`torch.LongTensor` of shape :obj:`(batch_size, sequence_length)`, `optional`, defaults to :obj:`None`):\n            Labels for computing the token classification loss.\n            Indices should be in ``[0, ..., config.num_labels - 1]``.\n\n    Returns:\n        :obj:`tuple(torch.FloatTensor)` comprising various elements depending on the configuration (:class:`~transformers1.BertConfig`) and inputs:\n        loss (:obj:`torch.FloatTensor` of shape :obj:`(1,)`, `optional`, returned when ``labels`` is provided) :\n            Classification loss.\n        scores (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length, config.num_labels)`)\n            Classification scores (before SoftMax).\n        hidden_states (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_hidden_states=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer)\n            of shape :obj:`(batch_size, sequence_length, hidden_size)`.\n\n            Hidden-states of the model at the output of each layer plus the initial embedding outputs.\n        attentions (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_attentions=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for each layer) of shape\n            :obj:`(batch_size, num_heads, sequence_length, sequence_length)`.\n\n            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention\n            heads.\n\n    Examples::\n\n        from transformers1 import BertTokenizer, BertForTokenClassification\n        import torch\n\n        tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')\n        model = BertForTokenClassification.from_pretrained('bert-base-uncased')\n\n        input_ids = torch.tensor(tokenizer.encode(\"Hello, my dog is cute\", add_special_tokens=True)).unsqueeze(0)  # Batch size 1\n        labels = torch.tensor([1] * input_ids.size(1)).unsqueeze(0)  # Batch size 1\n        outputs = model(input_ids, labels=labels)\n\n        loss, scores = outputs[:2]\n\n        \"\"\"\n\n        outputs = self.bert(\n            input_ids,\n            attention_mask=attention_mask,\n            token_type_ids=token_type_ids,\n            position_ids=position_ids,\n            head_mask=head_mask,\n            inputs_embeds=inputs_embeds,\n        )\n\n        sequence_output = outputs[0]\n\n        sequence_output = self.dropout(sequence_output)\n        logits = self.classifier(sequence_output)\n\n        outputs = (logits,) + outputs[2:]  # add hidden states and attention if they are here\n        if labels is not None:\n            loss_fct = CrossEntropyLoss()\n            # Only keep active parts of the loss\n            if attention_mask is not None:\n                active_loss = attention_mask.view(-1) == 1\n                active_logits = logits.view(-1, self.num_labels)\n                active_labels = torch.where(\n                    active_loss, labels.view(-1), torch.tensor(loss_fct.ignore_index).type_as(labels)\n                )\n                loss = loss_fct(active_logits, active_labels)\n            else:\n                loss = loss_fct(logits.view(-1, self.num_labels), labels.view(-1))\n            outputs = (loss,) + outputs\n\n        return outputs  # (loss), scores, (hidden_states), (attentions)\n\n\n@add_start_docstrings(\n    \"\"\"Bert Model with a span classification head on top for extractive question-answering tasks like SQuAD (a linear\n    layers on top of the hidden-states output to compute `span start logits` and `span end logits`). \"\"\",\n    BERT_START_DOCSTRING,\n)\nclass BertForQuestionAnswering(BertPreTrainedModel):\n    def __init__(self, config):\n        super().__init__(config)\n        self.num_labels = config.num_labels\n\n        self.bert = BertModel(config)\n        self.qa_outputs = nn.Linear(config.hidden_size, config.num_labels)\n\n        self.init_weights()\n\n    @add_start_docstrings_to_callable(BERT_INPUTS_DOCSTRING.format(\"(batch_size, sequence_length)\"))\n    def forward(\n        self,\n        input_ids=None,\n        attention_mask=None,\n        token_type_ids=None,\n        position_ids=None,\n        head_mask=None,\n        inputs_embeds=None,\n        start_positions=None,\n        end_positions=None,\n    ):\n        r\"\"\"\n        start_positions (:obj:`torch.LongTensor` of shape :obj:`(batch_size,)`, `optional`, defaults to :obj:`None`):\n            Labels for position (index) of the start of the labelled span for computing the token classification loss.\n            Positions are clamped to the length of the sequence (`sequence_length`).\n            Position outside of the sequence are not taken into account for computing the loss.\n        end_positions (:obj:`torch.LongTensor` of shape :obj:`(batch_size,)`, `optional`, defaults to :obj:`None`):\n            Labels for position (index) of the end of the labelled span for computing the token classification loss.\n            Positions are clamped to the length of the sequence (`sequence_length`).\n            Position outside of the sequence are not taken into account for computing the loss.\n\n    Returns:\n        :obj:`tuple(torch.FloatTensor)` comprising various elements depending on the configuration (:class:`~transformers1.BertConfig`) and inputs:\n        loss (:obj:`torch.FloatTensor` of shape :obj:`(1,)`, `optional`, returned when :obj:`labels` is provided):\n            Total span extraction loss is the sum of a Cross-Entropy for the start and end positions.\n        start_scores (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length,)`):\n            Span-start scores (before SoftMax).\n        end_scores (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length,)`):\n            Span-end scores (before SoftMax).\n        hidden_states (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_hidden_states=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer)\n            of shape :obj:`(batch_size, sequence_length, hidden_size)`.\n\n            Hidden-states of the model at the output of each layer plus the initial embedding outputs.\n        attentions (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_attentions=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for each layer) of shape\n            :obj:`(batch_size, num_heads, sequence_length, sequence_length)`.\n\n            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention\n            heads.\n\n    Examples::\n\n        from transformers1 import BertTokenizer, BertForQuestionAnswering\n        import torch\n\n        tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')\n        model = BertForQuestionAnswering.from_pretrained('bert-large-uncased-whole-word-masking-finetuned-squad')\n\n        question, text = \"Who was Jim Henson?\", \"Jim Henson was a nice puppet\"\n        encoding = tokenizer.encode_plus(question, text)\n        input_ids, token_type_ids = encoding[\"input_ids\"], encoding[\"token_type_ids\"]\n        start_scores, end_scores = model(torch.tensor([input_ids]), token_type_ids=torch.tensor([token_type_ids]))\n\n        all_tokens = tokenizer.convert_ids_to_tokens(input_ids)\n        answer = ' '.join(all_tokens[torch.argmax(start_scores) : torch.argmax(end_scores)+1])\n\n        assert answer == \"a nice puppet\"\n\n        \"\"\"\n\n        outputs = self.bert(\n            input_ids,\n            attention_mask=attention_mask,\n            token_type_ids=token_type_ids,\n            position_ids=position_ids,\n            head_mask=head_mask,\n            inputs_embeds=inputs_embeds,\n        )\n\n        sequence_output = outputs[0]\n\n        logits = self.qa_outputs(sequence_output)\n        start_logits, end_logits = logits.split(1, dim=-1)\n        start_logits = start_logits.squeeze(-1)\n        end_logits = end_logits.squeeze(-1)\n\n        outputs = (start_logits, end_logits,) + outputs[2:]\n        if start_positions is not None and end_positions is not None:\n            # If we are on multi-GPU, split add a dimension\n            if len(start_positions.size()) > 1:\n                start_positions = start_positions.squeeze(-1)\n            if len(end_positions.size()) > 1:\n                end_positions = end_positions.squeeze(-1)\n            # sometimes the start/end positions are outside our model inputs, we ignore these terms\n            ignored_index = start_logits.size(1)\n            start_positions.clamp_(0, ignored_index)\n            end_positions.clamp_(0, ignored_index)\n\n            loss_fct = CrossEntropyLoss(ignore_index=ignored_index)\n            start_loss = loss_fct(start_logits, start_positions)\n            end_loss = loss_fct(end_logits, end_positions)\n            total_loss = (start_loss + end_loss) / 2\n            outputs = (total_loss,) + outputs\n\n        return outputs  # (loss), start_logits, end_logits, (hidden_states), (attentions)\n"
  },
  {
    "path": "code/nezha-base-count5/pretrain/transformers1/modeling_camembert.py",
    "content": "# coding=utf-8\n# Copyright 2019 Inria, Facebook AI Research and the HuggingFace Inc. team.\n# Copyright (c) 2018, NVIDIA CORPORATION.  All rights reserved.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n\"\"\"PyTorch CamemBERT model. \"\"\"\n\nimport logging\n\nfrom .configuration_camembert import CamembertConfig\nfrom .file_utils import add_start_docstrings\nfrom .modeling_roberta import (\n    RobertaForMaskedLM,\n    RobertaForMultipleChoice,\n    RobertaForQuestionAnswering,\n    RobertaForSequenceClassification,\n    RobertaForTokenClassification,\n    RobertaModel,\n)\n\n\nlogger = logging.getLogger(__name__)\n\nCAMEMBERT_PRETRAINED_MODEL_ARCHIVE_LIST = [\n    \"camembert-base\",\n    \"Musixmatch/umberto-commoncrawl-cased-v1\",\n    \"Musixmatch/umberto-wikipedia-uncased-v1\",\n    # See all CamemBERT models at https://huggingface.co/models?filter=camembert\n]\n\nCAMEMBERT_START_DOCSTRING = r\"\"\"\n\n    This model is a PyTorch `torch.nn.Module <https://pytorch.org/docs/stable/nn.html#torch.nn.Module>`_ sub-class.\n    Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general\n    usage and behavior.\n\n    Parameters:\n        config (:class:`~transformers1.CamembertConfig`): Model configuration class with all the parameters of the\n            model. Initializing with a config file does not load the weights associated with the model, only the\n            configuration.\n            Check out the :meth:`~transformers1.PreTrainedModel.from_pretrained` method to load the model weights.\n\"\"\"\n\n\n@add_start_docstrings(\n    \"The bare CamemBERT Model transformer outputting raw hidden-states without any specific head on top.\",\n    CAMEMBERT_START_DOCSTRING,\n)\nclass CamembertModel(RobertaModel):\n    \"\"\"\n    This class overrides :class:`~transformers1.RobertaModel`. Please check the\n    superclass for the appropriate documentation alongside usage examples.\n    \"\"\"\n\n    config_class = CamembertConfig\n\n\n@add_start_docstrings(\n    \"\"\"CamemBERT Model with a `language modeling` head on top. \"\"\", CAMEMBERT_START_DOCSTRING,\n)\nclass CamembertForMaskedLM(RobertaForMaskedLM):\n    \"\"\"\n    This class overrides :class:`~transformers1.RobertaForMaskedLM`. Please check the\n    superclass for the appropriate documentation alongside usage examples.\n    \"\"\"\n\n    config_class = CamembertConfig\n\n\n@add_start_docstrings(\n    \"\"\"CamemBERT Model transformer with a sequence classification/regression head on top (a linear layer\n    on top of the pooled output) e.g. for GLUE tasks. \"\"\",\n    CAMEMBERT_START_DOCSTRING,\n)\nclass CamembertForSequenceClassification(RobertaForSequenceClassification):\n    \"\"\"\n    This class overrides :class:`~transformers1.RobertaForSequenceClassification`. Please check the\n    superclass for the appropriate documentation alongside usage examples.\n    \"\"\"\n\n    config_class = CamembertConfig\n\n\n@add_start_docstrings(\n    \"\"\"CamemBERT Model with a multiple choice classification head on top (a linear layer on top of\n    the pooled output and a softmax) e.g. for RocStories/SWAG tasks. \"\"\",\n    CAMEMBERT_START_DOCSTRING,\n)\nclass CamembertForMultipleChoice(RobertaForMultipleChoice):\n    \"\"\"\n    This class overrides :class:`~transformers1.RobertaForMultipleChoice`. Please check the\n    superclass for the appropriate documentation alongside usage examples.\n    \"\"\"\n\n    config_class = CamembertConfig\n\n\n@add_start_docstrings(\n    \"\"\"CamemBERT Model with a token classification head on top (a linear layer on top of\n    the hidden-states output) e.g. for Named-Entity-Recognition (NER) tasks. \"\"\",\n    CAMEMBERT_START_DOCSTRING,\n)\nclass CamembertForTokenClassification(RobertaForTokenClassification):\n    \"\"\"\n    This class overrides :class:`~transformers1.RobertaForTokenClassification`. Please check the\n    superclass for the appropriate documentation alongside usage examples.\n    \"\"\"\n\n    config_class = CamembertConfig\n\n\n@add_start_docstrings(\n    \"\"\"CamemBERT Model with a span classification head on top for extractive question-answering tasks like SQuAD\n    (a linear layers on top of the hidden-states output to compute `span start logits` and `span end logits` \"\"\",\n    CAMEMBERT_START_DOCSTRING,\n)\nclass CamembertForQuestionAnswering(RobertaForQuestionAnswering):\n    \"\"\"\n    This class overrides :class:`~transformers1.RobertaForQuestionAnswering`. Please check the\n    superclass for the appropriate documentation alongside usage examples.\n    \"\"\"\n\n    config_class = CamembertConfig\n"
  },
  {
    "path": "code/nezha-base-count5/pretrain/transformers1/modeling_ctrl.py",
    "content": "# coding=utf-8\n# Copyright 2018 Salesforce and HuggingFace Inc. team.\n# Copyright (c) 2018, NVIDIA CORPORATION.  All rights reserved.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n\"\"\" PyTorch CTRL model.\"\"\"\n\n\nimport logging\n\nimport numpy as np\nimport torch\nimport torch.nn as nn\nfrom torch.nn import CrossEntropyLoss\n\nfrom .configuration_ctrl import CTRLConfig\nfrom .file_utils import add_start_docstrings, add_start_docstrings_to_callable\nfrom .modeling_utils import Conv1D, PreTrainedModel\n\n\nlogger = logging.getLogger(__name__)\n\nCTRL_PRETRAINED_MODEL_ARCHIVE_LIST = [\n    \"ctrl\"\n    # See all CTRL models at https://huggingface.co/models?filter=ctrl\n]\n\n\ndef angle_defn(pos, i, d_model_size):\n    angle_rates = 1 / torch.pow(10000, (2 * (i // 2)) / d_model_size)\n    return pos * angle_rates\n\n\ndef positional_encoding(position, d_model_size, dtype):\n    # create the sinusoidal pattern for the positional encoding\n    angle_rads = angle_defn(\n        torch.arange(position, dtype=dtype).unsqueeze(1),\n        torch.arange(d_model_size, dtype=dtype).unsqueeze(0),\n        d_model_size,\n    )\n\n    sines = torch.sin(angle_rads[:, 0::2])\n    cosines = torch.cos(angle_rads[:, 1::2])\n\n    pos_encoding = torch.cat([sines, cosines], dim=-1)\n    return pos_encoding\n\n\ndef scaled_dot_product_attention(q, k, v, mask, attention_mask=None, head_mask=None):\n    # calculate attention\n    matmul_qk = torch.matmul(q, k.permute(0, 1, 3, 2))\n\n    dk = k.shape[-1]\n    scaled_attention_logits = matmul_qk / np.sqrt(dk)\n\n    if mask is not None:\n        nd, ns = scaled_attention_logits.size(-2), scaled_attention_logits.size(-1)\n        scaled_attention_logits += mask[ns - nd : ns, :ns] * -1e4\n\n    if attention_mask is not None:\n        # Apply the attention mask\n        scaled_attention_logits = scaled_attention_logits + attention_mask\n\n    attention_weights = torch.softmax(scaled_attention_logits, dim=-1)\n\n    # Mask heads if we want to\n    if head_mask is not None:\n        attention_weights = attention_weights * head_mask\n\n    output = torch.matmul(attention_weights, v)\n\n    return output, attention_weights\n\n\nclass MultiHeadAttention(torch.nn.Module):\n    def __init__(self, d_model_size, num_heads, output_attentions=False):\n        super().__init__()\n        self.output_attentions = output_attentions\n        self.num_heads = num_heads\n        self.d_model_size = d_model_size\n\n        self.depth = int(d_model_size / self.num_heads)\n\n        self.Wq = torch.nn.Linear(d_model_size, d_model_size)\n        self.Wk = torch.nn.Linear(d_model_size, d_model_size)\n        self.Wv = torch.nn.Linear(d_model_size, d_model_size)\n\n        self.dense = torch.nn.Linear(d_model_size, d_model_size)\n\n    def split_into_heads(self, x, batch_size):\n        x = x.reshape(batch_size, -1, self.num_heads, self.depth)\n        return x.permute([0, 2, 1, 3])\n\n    def forward(self, v, k, q, mask, layer_past=None, attention_mask=None, head_mask=None, use_cache=False):\n        batch_size = q.shape[0]\n\n        q = self.Wq(q)\n        k = self.Wk(k)\n        v = self.Wv(v)\n\n        q = self.split_into_heads(q, batch_size)\n        k = self.split_into_heads(k, batch_size)\n        v = self.split_into_heads(v, batch_size)\n        if layer_past is not None:\n            past_key, past_value = layer_past[0], layer_past[1]\n            k = torch.cat((past_key, k), dim=-2)\n            v = torch.cat((past_value, v), dim=-2)\n\n        if use_cache is True:\n            present = torch.stack((k, v))\n        else:\n            present = (None,)\n\n        output = scaled_dot_product_attention(q, k, v, mask, attention_mask, head_mask)\n        scaled_attention = output[0].permute([0, 2, 1, 3])\n        attn = output[1]\n        original_size_attention = scaled_attention.reshape(batch_size, -1, self.d_model_size)\n        output = self.dense(original_size_attention)\n\n        outputs = (output, present)\n        if self.output_attentions:\n            outputs = outputs + (attn,)\n        return outputs\n\n\ndef point_wise_feed_forward_network(d_model_size, dff):\n    return torch.nn.Sequential(torch.nn.Linear(d_model_size, dff), torch.nn.ReLU(), torch.nn.Linear(dff, d_model_size))\n\n\nclass EncoderLayer(torch.nn.Module):\n    def __init__(self, d_model_size, num_heads, dff, rate=0.1, output_attentions=False):\n        super().__init__()\n\n        self.multi_head_attention = MultiHeadAttention(d_model_size, num_heads, output_attentions)\n        self.ffn = point_wise_feed_forward_network(d_model_size, dff)\n\n        self.layernorm1 = torch.nn.LayerNorm(d_model_size, eps=1e-6)\n        self.layernorm2 = torch.nn.LayerNorm(d_model_size, eps=1e-6)\n\n        self.dropout1 = torch.nn.Dropout(rate)\n        self.dropout2 = torch.nn.Dropout(rate)\n\n    def forward(self, x, mask, layer_past=None, attention_mask=None, head_mask=None, use_cache=False):\n        normed = self.layernorm1(x)\n        attn_outputs = self.multi_head_attention(\n            normed,\n            normed,\n            normed,\n            mask,\n            layer_past=layer_past,\n            attention_mask=attention_mask,\n            head_mask=head_mask,\n            use_cache=use_cache,\n        )\n        attn_output = attn_outputs[0]\n        attn_output = self.dropout1(attn_output)\n        out1 = x + attn_output\n\n        out2 = self.layernorm2(out1)\n        ffn_output = self.ffn(out2)\n        ffn_output = self.dropout2(ffn_output)\n        out2 = out1 + ffn_output\n\n        outputs = (out2,) + attn_outputs[1:]\n        return outputs\n\n\nclass CTRLPreTrainedModel(PreTrainedModel):\n    \"\"\" An abstract class to handle weights initialization and\n        a simple interface for downloading and loading pretrained models.\n    \"\"\"\n\n    config_class = CTRLConfig\n    base_model_prefix = \"transformer\"\n\n    def _init_weights(self, module):\n        \"\"\" Initialize the weights.\n        \"\"\"\n        if isinstance(module, (nn.Linear, nn.Embedding, Conv1D)):\n            # Slightly different from the TF version which uses truncated_normal for initialization\n            # cf https://github.com/pytorch/pytorch/pull/5617\n            module.weight.data.normal_(mean=0.0, std=self.config.initializer_range)\n            if isinstance(module, (nn.Linear, Conv1D)) and module.bias is not None:\n                module.bias.data.zero_()\n        elif isinstance(module, nn.LayerNorm):\n            module.bias.data.zero_()\n            module.weight.data.fill_(1.0)\n\n\nCTRL_START_DOCSTRING = r\"\"\"\n    This model is a PyTorch `torch.nn.Module <https://pytorch.org/docs/stable/nn.html#torch.nn.Module>`_ sub-class.\n    Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general\n    usage and behavior.\n\n    Parameters:\n        config (:class:`~transformers1.CTRLConfig`): Model configuration class with all the parameters of the model.\n            Initializing with a config file does not load the weights associated with the model, only the configuration.\n            Check out the :meth:`~transformers1.PreTrainedModel.from_pretrained` method to load the model weights.\n\"\"\"\n\nCTRL_INPUTS_DOCSTRING = r\"\"\"\n    Args:\n        input_ids (:obj:`torch.LongTensor` of shape :obj:`(batch_size, input_ids_length)`):\n            :obj:`input_ids_length` = ``sequence_length`` if ``past`` is ``None`` else ``past[0].shape[-2]`` (``sequence_length`` of input past key value states).\n            Indices of input sequence tokens in the vocabulary.\n\n            If `past` is used, only input_ids that do not have their past calculated should be passed as input_ids.\n\n            Indices can be obtained using :class:`transformers1.CTRLTokenizer`.\n            See :func:`transformers1.PreTrainedTokenizer.encode` and\n            :func:`transformers1.PreTrainedTokenizer.encode_plus` for details.\n\n            `What are input IDs? <../glossary.html#input-ids>`__\n        past (:obj:`List[torch.FloatTensor]` of length :obj:`config.n_layers`):\n            Contains pre-computed hidden-states (key and values in the attention blocks) as computed by the model\n            (see `past` output below). Can be used to speed up sequential decoding.\n            The input_ids which have their past given to this model should not be passed as input ids as they have already been computed.\n        attention_mask (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length)`, `optional`, defaults to :obj:`None`):\n            Mask to avoid performing attention on padding token indices.\n            Mask values selected in ``[0, 1]``:\n            ``1`` for tokens that are NOT MASKED, ``0`` for MASKED tokens.\n\n            `What are attention masks? <../glossary.html#attention-mask>`__\n        token_type_ids (:obj:`torch.LongTensor` of shape :obj:`(batch_size, sequence_length)`, `optional`, defaults to :obj:`None`):\n            Segment token indices to indicate first and second portions of the inputs.\n            Indices are selected in ``[0, 1]``: ``0`` corresponds to a `sentence A` token, ``1``\n            corresponds to a `sentence B` token\n\n            `What are token type IDs? <../glossary.html#token-type-ids>`_\n        position_ids (:obj:`torch.LongTensor` of shape :obj:`(batch_size, sequence_length)`, `optional`, defaults to :obj:`None`):\n            Indices of positions of each input sequence tokens in the position embeddings.\n            Selected in the range ``[0, config.max_position_embeddings - 1]``.\n\n            `What are position IDs? <../glossary.html#position-ids>`_\n        head_mask (:obj:`torch.FloatTensor` of shape :obj:`(num_heads,)` or :obj:`(num_layers, num_heads)`, `optional`, defaults to :obj:`None`):\n            Mask to nullify selected heads of the self-attention modules.\n            Mask values selected in ``[0, 1]``:\n            :obj:`1` indicates the head is **not masked**, :obj:`0` indicates the head is **masked**.\n        inputs_embeds (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length, hidden_size)`, `optional`, defaults to :obj:`None`):\n            This is useful if you want more control over how to convert `input_ids` indices into associated vectors\n            than the model's internal embedding lookup matrix.\n            If `past` is used, optionally only the last `inputs_embeds` have to be input (see `past`).\n        use_cache (:obj:`bool`):\n            If `use_cache` is True, `past` key value states are returned and\n            can be used to speed up decoding (see `past`). Defaults to `True`.\n\"\"\"\n\n\n@add_start_docstrings(\n    \"The bare CTRL Model transformer outputting raw hidden-states without any specific head on top.\",\n    CTRL_START_DOCSTRING,\n)\nclass CTRLModel(CTRLPreTrainedModel):\n    def __init__(self, config):\n        super().__init__(config)\n        self.output_hidden_states = config.output_hidden_states\n        self.output_attentions = config.output_attentions\n\n        self.d_model_size = config.n_embd\n        self.num_layers = config.n_layer\n\n        self.pos_encoding = positional_encoding(config.n_positions, self.d_model_size, torch.float)\n\n        self.w = nn.Embedding(config.vocab_size, config.n_embd)\n\n        self.dropout = nn.Dropout(config.embd_pdrop)\n        self.h = nn.ModuleList(\n            [\n                EncoderLayer(config.n_embd, config.n_head, config.dff, config.resid_pdrop, config.output_attentions)\n                for _ in range(config.n_layer)\n            ]\n        )\n        self.layernorm = nn.LayerNorm(config.n_embd, eps=config.layer_norm_epsilon)\n\n        self.init_weights()\n\n    def get_input_embeddings(self):\n        return self.w\n\n    def set_input_embeddings(self, new_embeddings):\n        self.w = new_embeddings\n\n    def _prune_heads(self, heads_to_prune):\n        \"\"\" Prunes heads of the model.\n                heads_to_prune: dict of {layer_num: list of heads to prune in this layer}\n        \"\"\"\n        for layer, heads in heads_to_prune.items():\n            self.h[layer].attn.prune_heads(heads)\n\n    @add_start_docstrings_to_callable(CTRL_INPUTS_DOCSTRING)\n    def forward(\n        self,\n        input_ids=None,\n        past=None,\n        attention_mask=None,\n        token_type_ids=None,\n        position_ids=None,\n        head_mask=None,\n        inputs_embeds=None,\n        use_cache=True,\n    ):\n        r\"\"\"\n    Return:\n        :obj:`tuple(torch.FloatTensor)` comprising various elements depending on the configuration (:class:`~transformers1.CTRLConfig`) and inputs:\n        last_hidden_state (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length, hidden_size)`):\n            Sequence of hidden-states at the last layer of the model.\n        past (:obj:`List[torch.FloatTensor]` of length :obj:`config.n_layers` with each tensor of shape :obj:`(2, batch_size, num_heads, sequence_length, embed_size_per_head)`):\n            Contains pre-computed hidden-states (key and values in the attention blocks).\n            Can be used (see `past` input) to speed up sequential decoding.\n        hidden_states (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_hidden_states=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer)\n            of shape :obj:`(batch_size, sequence_length, hidden_size)`.\n\n            Hidden-states of the model at the output of each layer plus the initial embedding outputs.\n        attentions (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_attentions=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for each layer) of shape\n            :obj:`(batch_size, num_heads, sequence_length, sequence_length)`.\n\n            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention\n            heads.\n\n    Examples::\n\n        from transformers1 import CTRLTokenizer, CTRLModel\n        import torch\n\n        tokenizer = CTRLTokenizer.from_pretrained('ctrl')\n        model = CTRLModel.from_pretrained('ctrl')\n\n        input_ids = torch.tensor(tokenizer.encode(\"Links Hello, my dog is cute\", add_special_tokens=True)).unsqueeze(0)  # Batch size 1\n        outputs = model(input_ids)\n\n        last_hidden_states = outputs[0]  # The last hidden-state is the first element of the output tuple\n\n        \"\"\"\n\n        if input_ids is not None and inputs_embeds is not None:\n            raise ValueError(\"You cannot specify both input_ids and inputs_embeds at the same time\")\n        elif input_ids is not None:\n            input_shape = input_ids.size()\n            input_ids = input_ids.view(-1, input_shape[-1])\n            batch_size = input_ids.shape[0]\n        elif inputs_embeds is not None:\n            input_shape = inputs_embeds.size()[:-1]\n            batch_size = inputs_embeds.shape[0]\n        else:\n            raise ValueError(\"You have to specify either input_ids or inputs_embeds\")\n\n        if past is None:\n            past_length = 0\n            past = [None] * len(self.h)\n        else:\n            past_length = past[0][0].size(-2)\n        if position_ids is None:\n            device = input_ids.device if input_ids is not None else inputs_embeds.device\n            position_ids = torch.arange(past_length, input_shape[-1] + past_length, dtype=torch.long, device=device)\n            position_ids = position_ids.unsqueeze(0).view(-1, input_shape[-1])\n\n        # Attention mask.\n        if attention_mask is not None:\n            assert batch_size > 0, \"batch_size has to be defined and > 0\"\n            attention_mask = attention_mask.view(batch_size, -1)\n            # We create a 3D attention mask from a 2D tensor mask.\n            # Sizes are [batch_size, 1, 1, to_seq_length]\n            # So we can broadcast to [batch_size, num_heads, from_seq_length, to_seq_length]\n            # this attention mask is more simple than the triangular masking of causal attention\n            # used in OpenAI GPT, we just need to prepare the broadcast dimension here.\n            attention_mask = attention_mask.unsqueeze(1).unsqueeze(2)\n\n            # Since attention_mask is 1.0 for positions we want to attend and 0.0 for\n            # masked positions, this operation will create a tensor which is 0.0 for\n            # positions we want to attend and -10000.0 for masked positions.\n            # Since we are adding it to the raw scores before the softmax, this is\n            # effectively the same as removing these entirely.\n            attention_mask = attention_mask.to(dtype=self.dtype)  # fp16 compatibility\n            attention_mask = (1.0 - attention_mask) * -10000.0\n\n        # Prepare head mask if needed\n        head_mask = self.get_head_mask(head_mask, self.config.n_layer)\n\n        if token_type_ids is not None:\n            token_type_ids = token_type_ids.view(-1, input_shape[-1])\n            token_type_embeds = self.w(token_type_ids)\n            token_type_embeds *= np.sqrt(self.d_model_size)\n        else:\n            token_type_embeds = 0\n        position_ids = position_ids.view(-1, input_shape[-1])\n\n        if inputs_embeds is None:\n            inputs_embeds = self.w(input_ids)\n        # inputs_embeds = embedded.unsqueeze(0) if len(input_ids.shape)<2 else embedded\n        seq_len = input_shape[-1]\n        mask = torch.triu(torch.ones(seq_len + past_length, seq_len + past_length), 1).to(inputs_embeds.device)\n\n        inputs_embeds *= np.sqrt(self.d_model_size)\n\n        pos_embeds = self.pos_encoding[position_ids, :].to(inputs_embeds.device)\n\n        hidden_states = inputs_embeds + pos_embeds + token_type_embeds\n\n        hidden_states = self.dropout(hidden_states)\n\n        output_shape = input_shape + (inputs_embeds.size(-1),)\n        presents = ()\n        all_hidden_states = ()\n        all_attentions = []\n        for i, (h, layer_past) in enumerate(zip(self.h, past)):\n            if self.output_hidden_states:\n                all_hidden_states = all_hidden_states + (hidden_states.view(*output_shape),)\n            outputs = h(\n                hidden_states,\n                mask,\n                layer_past=layer_past,\n                attention_mask=attention_mask,\n                head_mask=head_mask[i],\n                use_cache=use_cache,\n            )\n            hidden_states, present = outputs[:2]\n            if use_cache is True:\n                presents = presents + (present,)\n\n            if self.output_attentions:\n                all_attentions.append(outputs[2])\n\n        hidden_states = self.layernorm(hidden_states)\n        hidden_states = hidden_states.view(*output_shape)\n        if self.output_hidden_states:\n            all_hidden_states = all_hidden_states + (hidden_states,)\n\n        outputs = (hidden_states,)\n        if use_cache is True:\n            outputs = outputs + (presents,)\n        if self.output_hidden_states:\n            outputs = outputs + (all_hidden_states,)\n        if self.output_attentions:\n            # let the number of heads free (-1) so we can extract attention even after head pruning\n            attention_output_shape = input_shape[:-1] + (-1,) + all_attentions[0].shape[-2:]\n            all_attentions = tuple(t.view(*attention_output_shape) for t in all_attentions)\n            outputs = outputs + (all_attentions,)\n        return outputs\n\n\n@add_start_docstrings(\n    \"\"\"The CTRL Model transformer with a language modeling head on top\n    (linear layer with weights tied to the input embeddings). \"\"\",\n    CTRL_START_DOCSTRING,\n)\nclass CTRLLMHeadModel(CTRLPreTrainedModel):\n    def __init__(self, config):\n        super().__init__(config)\n        self.transformer = CTRLModel(config)\n        self.lm_head = nn.Linear(config.n_embd, config.vocab_size, bias=True)\n\n        self.init_weights()\n\n    def get_output_embeddings(self):\n        return self.lm_head\n\n    def prepare_inputs_for_generation(self, input_ids, past, **kwargs):\n        # only last token for inputs_ids if past is defined in kwargs\n        if past:\n            input_ids = input_ids[:, -1].unsqueeze(-1)\n\n        return {\"input_ids\": input_ids, \"past\": past, \"use_cache\": kwargs[\"use_cache\"]}\n\n    @add_start_docstrings_to_callable(CTRL_INPUTS_DOCSTRING)\n    def forward(\n        self,\n        input_ids=None,\n        past=None,\n        attention_mask=None,\n        token_type_ids=None,\n        position_ids=None,\n        head_mask=None,\n        inputs_embeds=None,\n        labels=None,\n        use_cache=True,\n    ):\n        r\"\"\"\n        labels (:obj:`torch.LongTensor` of shape :obj:`(batch_size, sequence_length)`, `optional`, defaults to :obj:`None`):\n            Labels for language modeling.\n            Note that the labels **are shifted** inside the model, i.e. you can set ``lm_labels = input_ids``\n            Indices are selected in ``[-100, 0, ..., config.vocab_size]``\n            All labels set to ``-100`` are ignored (masked), the loss is only\n            computed for labels in ``[0, ..., config.vocab_size]``\n\n    Return:\n        :obj:`tuple(torch.FloatTensor)` comprising various elements depending on the configuration (:class:`~transformers1.CTRLConfig`) and inputs:\n        loss (:obj:`torch.FloatTensor` of shape `(1,)`, `optional`, returned when ``labels`` is provided)\n            Language modeling loss.\n        prediction_scores (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length, config.vocab_size)`):\n            Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax).\n        past (:obj:`List[torch.FloatTensor]` of length :obj:`config.n_layers` with each tensor of shape :obj:`(2, batch_size, num_heads, sequence_length, embed_size_per_head)`):\n            Contains pre-computed hidden-states (key and values in the attention blocks).\n            Can be used (see `past` input) to speed up sequential decoding.\n        hidden_states (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_hidden_states=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer)\n            of shape :obj:`(batch_size, sequence_length, hidden_size)`.\n\n            Hidden-states of the model at the output of each layer plus the initial embedding outputs.\n        attentions (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_attentions=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for each layer) of shape\n            :obj:`(batch_size, num_heads, sequence_length, sequence_length)`.\n\n            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention\n            heads.\n\n    Examples::\n\n        import torch\n        from transformers1 import CTRLTokenizer, CTRLLMHeadModel\n\n        tokenizer = CTRLTokenizer.from_pretrained('ctrl')\n        model = CTRLLMHeadModel.from_pretrained('ctrl')\n\n        input_ids = torch.tensor(tokenizer.encode(\"Links Hello, my dog is cute\", add_special_tokens=True)).unsqueeze(0)  # Batch size 1\n        outputs = model(input_ids, labels=input_ids)\n        loss, logits = outputs[:2]\n\n        \"\"\"\n        transformer_outputs = self.transformer(\n            input_ids,\n            past=past,\n            attention_mask=attention_mask,\n            token_type_ids=token_type_ids,\n            position_ids=position_ids,\n            head_mask=head_mask,\n            inputs_embeds=inputs_embeds,\n            use_cache=use_cache,\n        )\n\n        hidden_states = transformer_outputs[0]\n\n        lm_logits = self.lm_head(hidden_states)\n\n        outputs = (lm_logits,) + transformer_outputs[1:]\n\n        if labels is not None:\n            # Shift so that tokens < n predict n\n            shift_logits = lm_logits[..., :-1, :].contiguous()\n            shift_labels = labels[..., 1:].contiguous()\n            # Flatten the tokens\n            loss_fct = CrossEntropyLoss()\n            loss = loss_fct(shift_logits.view(-1, shift_logits.size(-1)), shift_labels.view(-1))\n            outputs = (loss,) + outputs\n\n        return outputs  # (loss), lm_logits, presents, (all hidden_states), (attentions)\n"
  },
  {
    "path": "code/nezha-base-count5/pretrain/transformers1/modeling_distilbert.py",
    "content": "# coding=utf-8\n# Copyright 2019-present, the HuggingFace Inc. team, The Google AI Language Team and Facebook, Inc.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n\"\"\" PyTorch DistilBERT model\n    adapted in part from Facebook, Inc XLM model (https://github.com/facebookresearch/XLM)\n    and in part from HuggingFace PyTorch version of Google AI Bert model (https://github.com/google-research/bert)\n\"\"\"\n\n\nimport copy\nimport logging\nimport math\n\nimport numpy as np\nimport torch\nimport torch.nn as nn\nfrom torch.nn import CrossEntropyLoss\n\nfrom .activations import gelu\nfrom .configuration_distilbert import DistilBertConfig\nfrom .file_utils import add_start_docstrings, add_start_docstrings_to_callable\nfrom .modeling_utils import PreTrainedModel, prune_linear_layer\n\n\nlogger = logging.getLogger(__name__)\n\n\nDISTILBERT_PRETRAINED_MODEL_ARCHIVE_LIST = [\n    \"distilbert-base-uncased\",\n    \"distilbert-base-uncased-distilled-squad\",\n    \"distilbert-base-cased\",\n    \"distilbert-base-cased-distilled-squad\",\n    \"distilbert-base-german-cased\",\n    \"distilbert-base-multilingual-cased\",\n    \"distilbert-base-uncased-finetuned-sst-2-english\",\n    # See all DistilBERT models at https://huggingface.co/models?filter=distilbert\n]\n\n\n# UTILS AND BUILDING BLOCKS OF THE ARCHITECTURE #\n\n\ndef create_sinusoidal_embeddings(n_pos, dim, out):\n    position_enc = np.array([[pos / np.power(10000, 2 * (j // 2) / dim) for j in range(dim)] for pos in range(n_pos)])\n    out[:, 0::2] = torch.FloatTensor(np.sin(position_enc[:, 0::2]))\n    out[:, 1::2] = torch.FloatTensor(np.cos(position_enc[:, 1::2]))\n    out.detach_()\n    out.requires_grad = False\n\n\nclass Embeddings(nn.Module):\n    def __init__(self, config):\n        super().__init__()\n        self.word_embeddings = nn.Embedding(config.vocab_size, config.dim, padding_idx=config.pad_token_id)\n        self.position_embeddings = nn.Embedding(config.max_position_embeddings, config.dim)\n        if config.sinusoidal_pos_embds:\n            create_sinusoidal_embeddings(\n                n_pos=config.max_position_embeddings, dim=config.dim, out=self.position_embeddings.weight\n            )\n\n        self.LayerNorm = nn.LayerNorm(config.dim, eps=1e-12)\n        self.dropout = nn.Dropout(config.dropout)\n\n    def forward(self, input_ids):\n        \"\"\"\n        Parameters\n        ----------\n        input_ids: torch.tensor(bs, max_seq_length)\n            The token ids to embed.\n\n        Outputs\n        -------\n        embeddings: torch.tensor(bs, max_seq_length, dim)\n            The embedded tokens (plus position embeddings, no token_type embeddings)\n        \"\"\"\n        seq_length = input_ids.size(1)\n        position_ids = torch.arange(seq_length, dtype=torch.long, device=input_ids.device)  # (max_seq_length)\n        position_ids = position_ids.unsqueeze(0).expand_as(input_ids)  # (bs, max_seq_length)\n\n        word_embeddings = self.word_embeddings(input_ids)  # (bs, max_seq_length, dim)\n        position_embeddings = self.position_embeddings(position_ids)  # (bs, max_seq_length, dim)\n\n        embeddings = word_embeddings + position_embeddings  # (bs, max_seq_length, dim)\n        embeddings = self.LayerNorm(embeddings)  # (bs, max_seq_length, dim)\n        embeddings = self.dropout(embeddings)  # (bs, max_seq_length, dim)\n        return embeddings\n\n\nclass MultiHeadSelfAttention(nn.Module):\n    def __init__(self, config):\n        super().__init__()\n\n        self.n_heads = config.n_heads\n        self.dim = config.dim\n        self.dropout = nn.Dropout(p=config.attention_dropout)\n        self.output_attentions = config.output_attentions\n\n        assert self.dim % self.n_heads == 0\n\n        self.q_lin = nn.Linear(in_features=config.dim, out_features=config.dim)\n        self.k_lin = nn.Linear(in_features=config.dim, out_features=config.dim)\n        self.v_lin = nn.Linear(in_features=config.dim, out_features=config.dim)\n        self.out_lin = nn.Linear(in_features=config.dim, out_features=config.dim)\n\n        self.pruned_heads = set()\n\n    def prune_heads(self, heads):\n        attention_head_size = self.dim // self.n_heads\n        if len(heads) == 0:\n            return\n        mask = torch.ones(self.n_heads, attention_head_size)\n        heads = set(heads) - self.pruned_heads\n        for head in heads:\n            head -= sum(1 if h < head else 0 for h in self.pruned_heads)\n            mask[head] = 0\n        mask = mask.view(-1).contiguous().eq(1)\n        index = torch.arange(len(mask))[mask].long()\n        # Prune linear layers\n        self.q_lin = prune_linear_layer(self.q_lin, index)\n        self.k_lin = prune_linear_layer(self.k_lin, index)\n        self.v_lin = prune_linear_layer(self.v_lin, index)\n        self.out_lin = prune_linear_layer(self.out_lin, index, dim=1)\n        # Update hyper params\n        self.n_heads = self.n_heads - len(heads)\n        self.dim = attention_head_size * self.n_heads\n        self.pruned_heads = self.pruned_heads.union(heads)\n\n    def forward(self, query, key, value, mask, head_mask=None):\n        \"\"\"\n        Parameters\n        ----------\n        query: torch.tensor(bs, seq_length, dim)\n        key: torch.tensor(bs, seq_length, dim)\n        value: torch.tensor(bs, seq_length, dim)\n        mask: torch.tensor(bs, seq_length)\n\n        Outputs\n        -------\n        weights: torch.tensor(bs, n_heads, seq_length, seq_length)\n            Attention weights\n        context: torch.tensor(bs, seq_length, dim)\n            Contextualized layer. Optional: only if `output_attentions=True`\n        \"\"\"\n        bs, q_length, dim = query.size()\n        k_length = key.size(1)\n        # assert dim == self.dim, 'Dimensions do not match: %s input vs %s configured' % (dim, self.dim)\n        # assert key.size() == value.size()\n\n        dim_per_head = self.dim // self.n_heads\n\n        mask_reshp = (bs, 1, 1, k_length)\n\n        def shape(x):\n            \"\"\" separate heads \"\"\"\n            return x.view(bs, -1, self.n_heads, dim_per_head).transpose(1, 2)\n\n        def unshape(x):\n            \"\"\" group heads \"\"\"\n            return x.transpose(1, 2).contiguous().view(bs, -1, self.n_heads * dim_per_head)\n\n        q = shape(self.q_lin(query))  # (bs, n_heads, q_length, dim_per_head)\n        k = shape(self.k_lin(key))  # (bs, n_heads, k_length, dim_per_head)\n        v = shape(self.v_lin(value))  # (bs, n_heads, k_length, dim_per_head)\n\n        q = q / math.sqrt(dim_per_head)  # (bs, n_heads, q_length, dim_per_head)\n        scores = torch.matmul(q, k.transpose(2, 3))  # (bs, n_heads, q_length, k_length)\n        mask = (mask == 0).view(mask_reshp).expand_as(scores)  # (bs, n_heads, q_length, k_length)\n        scores.masked_fill_(mask, -float(\"inf\"))  # (bs, n_heads, q_length, k_length)\n\n        weights = nn.Softmax(dim=-1)(scores)  # (bs, n_heads, q_length, k_length)\n        weights = self.dropout(weights)  # (bs, n_heads, q_length, k_length)\n\n        # Mask heads if we want to\n        if head_mask is not None:\n            weights = weights * head_mask\n\n        context = torch.matmul(weights, v)  # (bs, n_heads, q_length, dim_per_head)\n        context = unshape(context)  # (bs, q_length, dim)\n        context = self.out_lin(context)  # (bs, q_length, dim)\n\n        if self.output_attentions:\n            return (context, weights)\n        else:\n            return (context,)\n\n\nclass FFN(nn.Module):\n    def __init__(self, config):\n        super().__init__()\n        self.dropout = nn.Dropout(p=config.dropout)\n        self.lin1 = nn.Linear(in_features=config.dim, out_features=config.hidden_dim)\n        self.lin2 = nn.Linear(in_features=config.hidden_dim, out_features=config.dim)\n        assert config.activation in [\"relu\", \"gelu\"], \"activation ({}) must be in ['relu', 'gelu']\".format(\n            config.activation\n        )\n        self.activation = gelu if config.activation == \"gelu\" else nn.ReLU()\n\n    def forward(self, input):\n        x = self.lin1(input)\n        x = self.activation(x)\n        x = self.lin2(x)\n        x = self.dropout(x)\n        return x\n\n\nclass TransformerBlock(nn.Module):\n    def __init__(self, config):\n        super().__init__()\n\n        self.output_attentions = config.output_attentions\n\n        assert config.dim % config.n_heads == 0\n\n        self.attention = MultiHeadSelfAttention(config)\n        self.sa_layer_norm = nn.LayerNorm(normalized_shape=config.dim, eps=1e-12)\n\n        self.ffn = FFN(config)\n        self.output_layer_norm = nn.LayerNorm(normalized_shape=config.dim, eps=1e-12)\n\n    def forward(self, x, attn_mask=None, head_mask=None):\n        \"\"\"\n        Parameters\n        ----------\n        x: torch.tensor(bs, seq_length, dim)\n        attn_mask: torch.tensor(bs, seq_length)\n\n        Outputs\n        -------\n        sa_weights: torch.tensor(bs, n_heads, seq_length, seq_length)\n            The attention weights\n        ffn_output: torch.tensor(bs, seq_length, dim)\n            The output of the transformer block contextualization.\n        \"\"\"\n        # Self-Attention\n        sa_output = self.attention(query=x, key=x, value=x, mask=attn_mask, head_mask=head_mask)\n        if self.output_attentions:\n            sa_output, sa_weights = sa_output  # (bs, seq_length, dim), (bs, n_heads, seq_length, seq_length)\n        else:  # To handle these `output_attention` or `output_hidden_states` cases returning tuples\n            assert type(sa_output) == tuple\n            sa_output = sa_output[0]\n        sa_output = self.sa_layer_norm(sa_output + x)  # (bs, seq_length, dim)\n\n        # Feed Forward Network\n        ffn_output = self.ffn(sa_output)  # (bs, seq_length, dim)\n        ffn_output = self.output_layer_norm(ffn_output + sa_output)  # (bs, seq_length, dim)\n\n        output = (ffn_output,)\n        if self.output_attentions:\n            output = (sa_weights,) + output\n        return output\n\n\nclass Transformer(nn.Module):\n    def __init__(self, config):\n        super().__init__()\n        self.n_layers = config.n_layers\n        self.output_attentions = config.output_attentions\n        self.output_hidden_states = config.output_hidden_states\n\n        layer = TransformerBlock(config)\n        self.layer = nn.ModuleList([copy.deepcopy(layer) for _ in range(config.n_layers)])\n\n    def forward(self, x, attn_mask=None, head_mask=None):\n        \"\"\"\n        Parameters\n        ----------\n        x: torch.tensor(bs, seq_length, dim)\n            Input sequence embedded.\n        attn_mask: torch.tensor(bs, seq_length)\n            Attention mask on the sequence.\n\n        Outputs\n        -------\n        hidden_state: torch.tensor(bs, seq_length, dim)\n            Sequence of hiddens states in the last (top) layer\n        all_hidden_states: Tuple[torch.tensor(bs, seq_length, dim)]\n            Tuple of length n_layers with the hidden states from each layer.\n            Optional: only if output_hidden_states=True\n        all_attentions: Tuple[torch.tensor(bs, n_heads, seq_length, seq_length)]\n            Tuple of length n_layers with the attention weights from each layer\n            Optional: only if output_attentions=True\n        \"\"\"\n        all_hidden_states = ()\n        all_attentions = ()\n\n        hidden_state = x\n        for i, layer_module in enumerate(self.layer):\n            if self.output_hidden_states:\n                all_hidden_states = all_hidden_states + (hidden_state,)\n\n            layer_outputs = layer_module(x=hidden_state, attn_mask=attn_mask, head_mask=head_mask[i])\n            hidden_state = layer_outputs[-1]\n\n            if self.output_attentions:\n                assert len(layer_outputs) == 2\n                attentions = layer_outputs[0]\n                all_attentions = all_attentions + (attentions,)\n            else:\n                assert len(layer_outputs) == 1\n\n        # Add last layer\n        if self.output_hidden_states:\n            all_hidden_states = all_hidden_states + (hidden_state,)\n\n        outputs = (hidden_state,)\n        if self.output_hidden_states:\n            outputs = outputs + (all_hidden_states,)\n        if self.output_attentions:\n            outputs = outputs + (all_attentions,)\n        return outputs  # last-layer hidden state, (all hidden states), (all attentions)\n\n\n# INTERFACE FOR ENCODER AND TASK SPECIFIC MODEL #\nclass DistilBertPreTrainedModel(PreTrainedModel):\n    \"\"\" An abstract class to handle weights initialization and\n        a simple interface for downloading and loading pretrained models.\n    \"\"\"\n\n    config_class = DistilBertConfig\n    load_tf_weights = None\n    base_model_prefix = \"distilbert\"\n\n    def _init_weights(self, module):\n        \"\"\" Initialize the weights.\n        \"\"\"\n        if isinstance(module, nn.Embedding):\n            if module.weight.requires_grad:\n                module.weight.data.normal_(mean=0.0, std=self.config.initializer_range)\n        if isinstance(module, nn.Linear):\n            module.weight.data.normal_(mean=0.0, std=self.config.initializer_range)\n        elif isinstance(module, nn.LayerNorm):\n            module.bias.data.zero_()\n            module.weight.data.fill_(1.0)\n        if isinstance(module, nn.Linear) and module.bias is not None:\n            module.bias.data.zero_()\n\n\nDISTILBERT_START_DOCSTRING = r\"\"\"\n\n    This model is a PyTorch `torch.nn.Module <https://pytorch.org/docs/stable/nn.html#torch.nn.Module>`_ sub-class.\n    Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general\n    usage and behavior.\n\n    Parameters:\n        config (:class:`~transformers1.DistilBertConfig`): Model configuration class with all the parameters of the model.\n            Initializing with a config file does not load the weights associated with the model, only the configuration.\n            Check out the :meth:`~transformers1.PreTrainedModel.from_pretrained` method to load the model weights.\n\"\"\"\n\nDISTILBERT_INPUTS_DOCSTRING = r\"\"\"\n    Args:\n        input_ids (:obj:`torch.LongTensor` of shape :obj:`(batch_size, sequence_length)`):\n            Indices of input sequence tokens in the vocabulary.\n\n            Indices can be obtained using :class:`transformers1.DistilBertTokenizer`.\n            See :func:`transformers1.PreTrainedTokenizer.encode` and\n            :func:`transformers1.PreTrainedTokenizer.encode_plus` for details.\n\n            `What are input IDs? <../glossary.html#input-ids>`__\n        attention_mask (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length)`, `optional`, defaults to :obj:`None`):\n            Mask to avoid performing attention on padding token indices.\n            Mask values selected in ``[0, 1]``:\n            ``1`` for tokens that are NOT MASKED, ``0`` for MASKED tokens.\n\n            `What are attention masks? <../glossary.html#attention-mask>`__\n        head_mask (:obj:`torch.FloatTensor` of shape :obj:`(num_heads,)` or :obj:`(num_layers, num_heads)`, `optional`, defaults to :obj:`None`):\n            Mask to nullify selected heads of the self-attention modules.\n            Mask values selected in ``[0, 1]``:\n            :obj:`1` indicates the head is **not masked**, :obj:`0` indicates the head is **masked**.\n        inputs_embeds (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length, hidden_size)`, `optional`, defaults to :obj:`None`):\n            Optionally, instead of passing :obj:`input_ids` you can choose to directly pass an embedded representation.\n            This is useful if you want more control over how to convert `input_ids` indices into associated vectors\n            than the model's internal embedding lookup matrix.\n\"\"\"\n\n\n@add_start_docstrings(\n    \"The bare DistilBERT encoder/transformer outputting raw hidden-states without any specific head on top.\",\n    DISTILBERT_START_DOCSTRING,\n)\nclass DistilBertModel(DistilBertPreTrainedModel):\n    def __init__(self, config):\n        super().__init__(config)\n\n        self.embeddings = Embeddings(config)  # Embeddings\n        self.transformer = Transformer(config)  # Encoder\n\n        self.init_weights()\n\n    def get_input_embeddings(self):\n        return self.embeddings.word_embeddings\n\n    def set_input_embeddings(self, new_embeddings):\n        self.embeddings.word_embeddings = new_embeddings\n\n    def _prune_heads(self, heads_to_prune):\n        \"\"\" Prunes heads of the model.\n            heads_to_prune: dict of {layer_num: list of heads to prune in this layer}\n            See base class PreTrainedModel\n        \"\"\"\n        for layer, heads in heads_to_prune.items():\n            self.transformer.layer[layer].attention.prune_heads(heads)\n\n    @add_start_docstrings_to_callable(DISTILBERT_INPUTS_DOCSTRING)\n    def forward(self, input_ids=None, attention_mask=None, head_mask=None, inputs_embeds=None):\n        r\"\"\"\n    Return:\n        :obj:`tuple(torch.FloatTensor)` comprising various elements depending on the configuration (:class:`~transformers1.DistilBertConfig`) and inputs:\n        last_hidden_state (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length, hidden_size)`):\n            Sequence of hidden-states at the output of the last layer of the model.\n        hidden_states (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_hidden_states=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer)\n            of shape :obj:`(batch_size, sequence_length, hidden_size)`.\n\n            Hidden-states of the model at the output of each layer plus the initial embedding outputs.\n        attentions (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_attentions=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for each layer) of shape\n            :obj:`(batch_size, num_heads, sequence_length, sequence_length)`.\n\n            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention\n            heads.\n\n    Examples::\n\n        from transformers1 import DistilBertTokenizer, DistilBertModel\n        import torch\n\n        tokenizer = DistilBertTokenizer.from_pretrained('distilbert-base-cased')\n        model = DistilBertModel.from_pretrained('distilbert-base-cased')\n\n        input_ids = torch.tensor(tokenizer.encode(\"Hello, my dog is cute\", add_special_tokens=True)).unsqueeze(0)  # Batch size 1\n        outputs = model(input_ids)\n\n        last_hidden_states = outputs[0]  # The last hidden-state is the first element of the output tuple\n\n        \"\"\"\n        if input_ids is not None and inputs_embeds is not None:\n            raise ValueError(\"You cannot specify both input_ids and inputs_embeds at the same time\")\n        elif input_ids is not None:\n            input_shape = input_ids.size()\n        elif inputs_embeds is not None:\n            input_shape = inputs_embeds.size()[:-1]\n        else:\n            raise ValueError(\"You have to specify either input_ids or inputs_embeds\")\n\n        device = input_ids.device if input_ids is not None else inputs_embeds.device\n\n        if attention_mask is None:\n            attention_mask = torch.ones(input_shape, device=device)  # (bs, seq_length)\n\n        # Prepare head mask if needed\n        head_mask = self.get_head_mask(head_mask, self.config.num_hidden_layers)\n\n        if inputs_embeds is None:\n            inputs_embeds = self.embeddings(input_ids)  # (bs, seq_length, dim)\n        tfmr_output = self.transformer(x=inputs_embeds, attn_mask=attention_mask, head_mask=head_mask)\n        hidden_state = tfmr_output[0]\n        output = (hidden_state,) + tfmr_output[1:]\n\n        return output  # last-layer hidden-state, (all hidden_states), (all attentions)\n\n\n@add_start_docstrings(\n    \"\"\"DistilBert Model with a `masked language modeling` head on top. \"\"\", DISTILBERT_START_DOCSTRING,\n)\nclass DistilBertForMaskedLM(DistilBertPreTrainedModel):\n    def __init__(self, config):\n        super().__init__(config)\n        self.output_attentions = config.output_attentions\n        self.output_hidden_states = config.output_hidden_states\n\n        self.distilbert = DistilBertModel(config)\n        self.vocab_transform = nn.Linear(config.dim, config.dim)\n        self.vocab_layer_norm = nn.LayerNorm(config.dim, eps=1e-12)\n        self.vocab_projector = nn.Linear(config.dim, config.vocab_size)\n\n        self.init_weights()\n\n        self.mlm_loss_fct = nn.CrossEntropyLoss()\n\n    def get_output_embeddings(self):\n        return self.vocab_projector\n\n    @add_start_docstrings_to_callable(DISTILBERT_INPUTS_DOCSTRING)\n    def forward(self, input_ids=None, attention_mask=None, head_mask=None, inputs_embeds=None, masked_lm_labels=None):\n        r\"\"\"\n        masked_lm_labels (:obj:`torch.LongTensor` of shape :obj:`(batch_size, sequence_length)`, `optional`, defaults to :obj:`None`):\n            Labels for computing the masked language modeling loss.\n            Indices should be in ``[-100, 0, ..., config.vocab_size]`` (see ``input_ids`` docstring)\n            Tokens with indices set to ``-100`` are ignored (masked), the loss is only computed for the tokens with labels\n            in ``[0, ..., config.vocab_size]``\n\n    Returns:\n        :obj:`tuple(torch.FloatTensor)` comprising various elements depending on the configuration (:class:`~transformers1.DistilBertConfig`) and inputs:\n        loss (`optional`, returned when ``masked_lm_labels`` is provided) ``torch.FloatTensor`` of shape ``(1,)``:\n            Masked language modeling loss.\n        prediction_scores (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length, config.vocab_size)`)\n            Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax).\n        hidden_states (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_hidden_states=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer)\n            of shape :obj:`(batch_size, sequence_length, hidden_size)`.\n\n            Hidden-states of the model at the output of each layer plus the initial embedding outputs.\n        attentions (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_attentions=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for each layer) of shape\n            :obj:`(batch_size, num_heads, sequence_length, sequence_length)`.\n\n            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention\n            heads.\n\n    Examples::\n\n        from transformers1 import DistilBertTokenizer, DistilBertForMaskedLM\n        import torch\n\n        tokenizer = DistilBertTokenizer.from_pretrained('distilbert-base-cased')\n        model = DistilBertForMaskedLM.from_pretrained('distilbert-base-cased')\n        input_ids = torch.tensor(tokenizer.encode(\"Hello, my dog is cute\", add_special_tokens=True)).unsqueeze(0)  # Batch size 1\n        outputs = model(input_ids, masked_lm_labels=input_ids)\n        loss, prediction_scores = outputs[:2]\n\n        \"\"\"\n        dlbrt_output = self.distilbert(\n            input_ids=input_ids, attention_mask=attention_mask, head_mask=head_mask, inputs_embeds=inputs_embeds\n        )\n        hidden_states = dlbrt_output[0]  # (bs, seq_length, dim)\n        prediction_logits = self.vocab_transform(hidden_states)  # (bs, seq_length, dim)\n        prediction_logits = gelu(prediction_logits)  # (bs, seq_length, dim)\n        prediction_logits = self.vocab_layer_norm(prediction_logits)  # (bs, seq_length, dim)\n        prediction_logits = self.vocab_projector(prediction_logits)  # (bs, seq_length, vocab_size)\n\n        outputs = (prediction_logits,) + dlbrt_output[1:]\n        if masked_lm_labels is not None:\n            mlm_loss = self.mlm_loss_fct(\n                prediction_logits.view(-1, prediction_logits.size(-1)), masked_lm_labels.view(-1)\n            )\n            outputs = (mlm_loss,) + outputs\n\n        return outputs  # (mlm_loss), prediction_logits, (all hidden_states), (all attentions)\n\n\n@add_start_docstrings(\n    \"\"\"DistilBert Model transformer with a sequence classification/regression head on top (a linear layer on top of\n    the pooled output) e.g. for GLUE tasks. \"\"\",\n    DISTILBERT_START_DOCSTRING,\n)\nclass DistilBertForSequenceClassification(DistilBertPreTrainedModel):\n    def __init__(self, config):\n        super().__init__(config)\n        self.num_labels = config.num_labels\n\n        self.distilbert = DistilBertModel(config)\n        self.pre_classifier = nn.Linear(config.dim, config.dim)\n        self.classifier = nn.Linear(config.dim, config.num_labels)\n        self.dropout = nn.Dropout(config.seq_classif_dropout)\n\n        self.init_weights()\n\n    @add_start_docstrings_to_callable(DISTILBERT_INPUTS_DOCSTRING)\n    def forward(self, input_ids=None, attention_mask=None, head_mask=None, inputs_embeds=None, labels=None):\n        r\"\"\"\n        labels (:obj:`torch.LongTensor` of shape :obj:`(batch_size,)`, `optional`, defaults to :obj:`None`):\n            Labels for computing the sequence classification/regression loss.\n            Indices should be in :obj:`[0, ..., config.num_labels - 1]`.\n            If :obj:`config.num_labels == 1` a regression loss is computed (Mean-Square loss),\n            If :obj:`config.num_labels > 1` a classification loss is computed (Cross-Entropy).\n\n    Returns:\n        :obj:`tuple(torch.FloatTensor)` comprising various elements depending on the configuration (:class:`~transformers1.DistilBertConfig`) and inputs:\n        loss (:obj:`torch.FloatTensor` of shape :obj:`(1,)`, `optional`, returned when :obj:`label` is provided):\n            Classification (or regression if config.num_labels==1) loss.\n        logits (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, config.num_labels)`):\n            Classification (or regression if config.num_labels==1) scores (before SoftMax).\n        hidden_states (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_hidden_states=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer)\n            of shape :obj:`(batch_size, sequence_length, hidden_size)`.\n\n            Hidden-states of the model at the output of each layer plus the initial embedding outputs.\n        attentions (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_attentions=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for each layer) of shape\n            :obj:`(batch_size, num_heads, sequence_length, sequence_length)`.\n\n            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention\n            heads.\n\n    Examples::\n\n        from transformers1 import DistilBertTokenizer, DistilBertForSequenceClassification\n        import torch\n\n        tokenizer = DistilBertTokenizer.from_pretrained('distilbert-base-cased')\n        model = DistilBertForSequenceClassification.from_pretrained('distilbert-base-cased')\n        input_ids = torch.tensor(tokenizer.encode(\"Hello, my dog is cute\", add_special_tokens=True)).unsqueeze(0)  # Batch size 1\n        labels = torch.tensor([1]).unsqueeze(0)  # Batch size 1\n        outputs = model(input_ids, labels=labels)\n        loss, logits = outputs[:2]\n\n        \"\"\"\n        distilbert_output = self.distilbert(\n            input_ids=input_ids, attention_mask=attention_mask, head_mask=head_mask, inputs_embeds=inputs_embeds\n        )\n        hidden_state = distilbert_output[0]  # (bs, seq_len, dim)\n        pooled_output = hidden_state[:, 0]  # (bs, dim)\n        pooled_output = self.pre_classifier(pooled_output)  # (bs, dim)\n        pooled_output = nn.ReLU()(pooled_output)  # (bs, dim)\n        pooled_output = self.dropout(pooled_output)  # (bs, dim)\n        logits = self.classifier(pooled_output)  # (bs, dim)\n\n        outputs = (logits,) + distilbert_output[1:]\n        if labels is not None:\n            if self.num_labels == 1:\n                loss_fct = nn.MSELoss()\n                loss = loss_fct(logits.view(-1), labels.view(-1))\n            else:\n                loss_fct = nn.CrossEntropyLoss()\n                loss = loss_fct(logits.view(-1, self.num_labels), labels.view(-1))\n            outputs = (loss,) + outputs\n\n        return outputs  # (loss), logits, (hidden_states), (attentions)\n\n\n@add_start_docstrings(\n    \"\"\"DistilBert Model with a span classification head on top for extractive question-answering tasks like SQuAD (a linear layers on top of\n    the hidden-states output to compute `span start logits` and `span end logits`). \"\"\",\n    DISTILBERT_START_DOCSTRING,\n)\nclass DistilBertForQuestionAnswering(DistilBertPreTrainedModel):\n    def __init__(self, config):\n        super().__init__(config)\n\n        self.distilbert = DistilBertModel(config)\n        self.qa_outputs = nn.Linear(config.dim, config.num_labels)\n        assert config.num_labels == 2\n        self.dropout = nn.Dropout(config.qa_dropout)\n\n        self.init_weights()\n\n    @add_start_docstrings_to_callable(DISTILBERT_INPUTS_DOCSTRING)\n    def forward(\n        self,\n        input_ids=None,\n        attention_mask=None,\n        head_mask=None,\n        inputs_embeds=None,\n        start_positions=None,\n        end_positions=None,\n    ):\n        r\"\"\"\n        start_positions (:obj:`torch.LongTensor` of shape :obj:`(batch_size,)`, `optional`, defaults to :obj:`None`):\n            Labels for position (index) of the start of the labelled span for computing the token classification loss.\n            Positions are clamped to the length of the sequence (`sequence_length`).\n            Position outside of the sequence are not taken into account for computing the loss.\n        end_positions (:obj:`torch.LongTensor` of shape :obj:`(batch_size,)`, `optional`, defaults to :obj:`None`):\n            Labels for position (index) of the end of the labelled span for computing the token classification loss.\n            Positions are clamped to the length of the sequence (`sequence_length`).\n            Position outside of the sequence are not taken into account for computing the loss.\n\n    Returns:\n        :obj:`tuple(torch.FloatTensor)` comprising various elements depending on the configuration (:class:`~transformers1.DistilBertConfig`) and inputs:\n        loss (:obj:`torch.FloatTensor` of shape :obj:`(1,)`, `optional`, returned when :obj:`labels` is provided):\n            Total span extraction loss is the sum of a Cross-Entropy for the start and end positions.\n        start_scores (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length,)`):\n            Span-start scores (before SoftMax).\n        end_scores (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length,)`):\n            Span-end scores (before SoftMax).\n        hidden_states (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_hidden_states=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer)\n            of shape :obj:`(batch_size, sequence_length, hidden_size)`.\n\n            Hidden-states of the model at the output of each layer plus the initial embedding outputs.\n        attentions (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_attentions=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for each layer) of shape\n            :obj:`(batch_size, num_heads, sequence_length, sequence_length)`.\n\n            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention\n            heads.\n\n    Examples::\n\n        from transformers1 import DistilBertTokenizer, DistilBertForQuestionAnswering\n        import torch\n\n        tokenizer = DistilBertTokenizer.from_pretrained('distilbert-base-cased')\n        model = DistilBertForQuestionAnswering.from_pretrained('distilbert-base-cased')\n        input_ids = torch.tensor(tokenizer.encode(\"Hello, my dog is cute\", add_special_tokens=True)).unsqueeze(0)  # Batch size 1\n        start_positions = torch.tensor([1])\n        end_positions = torch.tensor([3])\n        outputs = model(input_ids, start_positions=start_positions, end_positions=end_positions)\n        loss, start_scores, end_scores = outputs[:3]\n\n        \"\"\"\n        distilbert_output = self.distilbert(\n            input_ids=input_ids, attention_mask=attention_mask, head_mask=head_mask, inputs_embeds=inputs_embeds\n        )\n        hidden_states = distilbert_output[0]  # (bs, max_query_len, dim)\n\n        hidden_states = self.dropout(hidden_states)  # (bs, max_query_len, dim)\n        logits = self.qa_outputs(hidden_states)  # (bs, max_query_len, 2)\n        start_logits, end_logits = logits.split(1, dim=-1)\n        start_logits = start_logits.squeeze(-1)  # (bs, max_query_len)\n        end_logits = end_logits.squeeze(-1)  # (bs, max_query_len)\n\n        outputs = (start_logits, end_logits,) + distilbert_output[1:]\n        if start_positions is not None and end_positions is not None:\n            # If we are on multi-GPU, split add a dimension\n            if len(start_positions.size()) > 1:\n                start_positions = start_positions.squeeze(-1)\n            if len(end_positions.size()) > 1:\n                end_positions = end_positions.squeeze(-1)\n            # sometimes the start/end positions are outside our model inputs, we ignore these terms\n            ignored_index = start_logits.size(1)\n            start_positions.clamp_(0, ignored_index)\n            end_positions.clamp_(0, ignored_index)\n\n            loss_fct = nn.CrossEntropyLoss(ignore_index=ignored_index)\n            start_loss = loss_fct(start_logits, start_positions)\n            end_loss = loss_fct(end_logits, end_positions)\n            total_loss = (start_loss + end_loss) / 2\n            outputs = (total_loss,) + outputs\n\n        return outputs  # (loss), start_logits, end_logits, (hidden_states), (attentions)\n\n\n@add_start_docstrings(\n    \"\"\"DistilBert Model with a token classification head on top (a linear layer on top of\n    the hidden-states output) e.g. for Named-Entity-Recognition (NER) tasks. \"\"\",\n    DISTILBERT_START_DOCSTRING,\n)\nclass DistilBertForTokenClassification(DistilBertPreTrainedModel):\n    def __init__(self, config):\n        super().__init__(config)\n        self.num_labels = config.num_labels\n\n        self.distilbert = DistilBertModel(config)\n        self.dropout = nn.Dropout(config.dropout)\n        self.classifier = nn.Linear(config.hidden_size, config.num_labels)\n\n        self.init_weights()\n\n    @add_start_docstrings_to_callable(DISTILBERT_INPUTS_DOCSTRING)\n    def forward(self, input_ids=None, attention_mask=None, head_mask=None, inputs_embeds=None, labels=None):\n        r\"\"\"\n        labels (:obj:`torch.LongTensor` of shape :obj:`(batch_size, sequence_length)`, `optional`, defaults to :obj:`None`):\n            Labels for computing the token classification loss.\n            Indices should be in ``[0, ..., config.num_labels - 1]``.\n\n    Returns:\n        :obj:`tuple(torch.FloatTensor)` comprising various elements depending on the configuration (:class:`~transformers1.DistilBertConfig`) and inputs:\n        loss (:obj:`torch.FloatTensor` of shape :obj:`(1,)`, `optional`, returned when ``labels`` is provided) :\n            Classification loss.\n        scores (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length, config.num_labels)`)\n            Classification scores (before SoftMax).\n        hidden_states (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_hidden_states=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer)\n            of shape :obj:`(batch_size, sequence_length, hidden_size)`.\n\n            Hidden-states of the model at the output of each layer plus the initial embedding outputs.\n        attentions (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_attentions=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for each layer) of shape\n            :obj:`(batch_size, num_heads, sequence_length, sequence_length)`.\n\n            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention\n            heads.\n\n    Examples::\n\n        from transformers1 import DistilBertTokenizer, DistilBertForTokenClassification\n        import torch\n\n        tokenizer = DistilBertTokenizer.from_pretrained('distilbert-base-cased')\n        model = DistilBertForTokenClassification.from_pretrained('distilbert-base-cased')\n        input_ids = torch.tensor(tokenizer.encode(\"Hello, my dog is cute\")).unsqueeze(0)  # Batch size 1\n        labels = torch.tensor([1] * input_ids.size(1)).unsqueeze(0)  # Batch size 1\n        outputs = model(input_ids, labels=labels)\n        loss, scores = outputs[:2]\n\n        \"\"\"\n\n        outputs = self.distilbert(\n            input_ids, attention_mask=attention_mask, head_mask=head_mask, inputs_embeds=inputs_embeds\n        )\n\n        sequence_output = outputs[0]\n\n        sequence_output = self.dropout(sequence_output)\n        logits = self.classifier(sequence_output)\n\n        outputs = (logits,) + outputs[2:]  # add hidden states and attention if they are here\n        if labels is not None:\n            loss_fct = CrossEntropyLoss()\n            # Only keep active parts of the loss\n            if attention_mask is not None:\n                active_loss = attention_mask.view(-1) == 1\n                active_logits = logits.view(-1, self.num_labels)\n                active_labels = torch.where(\n                    active_loss, labels.view(-1), torch.tensor(loss_fct.ignore_index).type_as(labels)\n                )\n                loss = loss_fct(active_logits, active_labels)\n            else:\n                loss = loss_fct(logits.view(-1, self.num_labels), labels.view(-1))\n            outputs = (loss,) + outputs\n\n        return outputs  # (loss), scores, (hidden_states), (attentions)\n"
  },
  {
    "path": "code/nezha-base-count5/pretrain/transformers1/modeling_electra.py",
    "content": "import logging\nimport os\n\nimport torch\nimport torch.nn as nn\nfrom torch.nn import CrossEntropyLoss, MSELoss\n\nfrom .activations import get_activation\nfrom .configuration_electra import ElectraConfig\nfrom .file_utils import add_start_docstrings, add_start_docstrings_to_callable\nfrom .modeling_bert import BertEmbeddings, BertEncoder, BertLayerNorm, BertPreTrainedModel\n\n\nlogger = logging.getLogger(__name__)\n\n\nELECTRA_PRETRAINED_MODEL_ARCHIVE_LIST = [\n    \"google/electra-small-generator\",\n    \"google/electra-base-generator\",\n    \"google/electra-large-generator\",\n    \"google/electra-small-discriminator\",\n    \"google/electra-base-discriminator\",\n    \"google/electra-large-discriminator\",\n    # See all ELECTRA models at https://huggingface.co/models?filter=electra\n]\n\n\ndef load_tf_weights_in_electra(model, config, tf_checkpoint_path, discriminator_or_generator=\"discriminator\"):\n    \"\"\" Load tf checkpoints in a pytorch model.\n    \"\"\"\n    try:\n        import re\n        import numpy as np\n        import tensorflow as tf\n    except ImportError:\n        logger.error(\n            \"Loading a TensorFlow model in PyTorch, requires TensorFlow to be installed. Please see \"\n            \"https://www.tensorflow.org/install/ for installation instructions.\"\n        )\n        raise\n    tf_path = os.path.abspath(tf_checkpoint_path)\n    logger.info(\"Converting TensorFlow checkpoint from {}\".format(tf_path))\n    # Load weights from TF model\n    init_vars = tf.train.list_variables(tf_path)\n    names = []\n    arrays = []\n    for name, shape in init_vars:\n        logger.info(\"Loading TF weight {} with shape {}\".format(name, shape))\n        array = tf.train.load_variable(tf_path, name)\n        names.append(name)\n        arrays.append(array)\n    for name, array in zip(names, arrays):\n        original_name: str = name\n\n        try:\n            if isinstance(model, ElectraForMaskedLM):\n                name = name.replace(\"electra/embeddings/\", \"generator/embeddings/\")\n\n            if discriminator_or_generator == \"generator\":\n                name = name.replace(\"electra/\", \"discriminator/\")\n                name = name.replace(\"generator/\", \"electra/\")\n\n            name = name.replace(\"dense_1\", \"dense_prediction\")\n            name = name.replace(\"generator_predictions/output_bias\", \"generator_lm_head/bias\")\n\n            name = name.split(\"/\")\n            # print(original_name, name)\n            # adam_v and adam_m are variables used in AdamWeightDecayOptimizer to calculated m and v\n            # which are not required for using pretrained model\n            if any(n in [\"global_step\", \"temperature\"] for n in name):\n                logger.info(\"Skipping {}\".format(original_name))\n                continue\n            pointer = model\n            for m_name in name:\n                if re.fullmatch(r\"[A-Za-z]+_\\d+\", m_name):\n                    scope_names = re.split(r\"_(\\d+)\", m_name)\n                else:\n                    scope_names = [m_name]\n                if scope_names[0] == \"kernel\" or scope_names[0] == \"gamma\":\n                    pointer = getattr(pointer, \"weight\")\n                elif scope_names[0] == \"output_bias\" or scope_names[0] == \"beta\":\n                    pointer = getattr(pointer, \"bias\")\n                elif scope_names[0] == \"output_weights\":\n                    pointer = getattr(pointer, \"weight\")\n                elif scope_names[0] == \"squad\":\n                    pointer = getattr(pointer, \"classifier\")\n                else:\n                    pointer = getattr(pointer, scope_names[0])\n                if len(scope_names) >= 2:\n                    num = int(scope_names[1])\n                    pointer = pointer[num]\n            if m_name.endswith(\"_embeddings\"):\n                pointer = getattr(pointer, \"weight\")\n            elif m_name == \"kernel\":\n                array = np.transpose(array)\n            try:\n                assert pointer.shape == array.shape, original_name\n            except AssertionError as e:\n                e.args += (pointer.shape, array.shape)\n                raise\n            print(\"Initialize PyTorch weight {}\".format(name), original_name)\n            pointer.data = torch.from_numpy(array)\n        except AttributeError as e:\n            print(\"Skipping {}\".format(original_name), name, e)\n            continue\n    return model\n\n\nclass ElectraEmbeddings(BertEmbeddings):\n    \"\"\"Construct the embeddings from word, position and token_type embeddings.\"\"\"\n\n    def __init__(self, config):\n        super().__init__(config)\n        self.word_embeddings = nn.Embedding(config.vocab_size, config.embedding_size, padding_idx=config.pad_token_id)\n        self.position_embeddings = nn.Embedding(config.max_position_embeddings, config.embedding_size)\n        self.token_type_embeddings = nn.Embedding(config.type_vocab_size, config.embedding_size)\n\n        # self.LayerNorm is not snake-cased to stick with TensorFlow model variable name and be able to load\n        # any TensorFlow checkpoint file\n        self.LayerNorm = BertLayerNorm(config.embedding_size, eps=config.layer_norm_eps)\n\n\nclass ElectraDiscriminatorPredictions(nn.Module):\n    \"\"\"Prediction module for the discriminator, made up of two dense layers.\"\"\"\n\n    def __init__(self, config):\n        super().__init__()\n\n        self.dense = nn.Linear(config.hidden_size, config.hidden_size)\n        self.dense_prediction = nn.Linear(config.hidden_size, 1)\n        self.config = config\n\n    def forward(self, discriminator_hidden_states, attention_mask):\n        hidden_states = self.dense(discriminator_hidden_states)\n        hidden_states = get_activation(self.config.hidden_act)(hidden_states)\n        logits = self.dense_prediction(hidden_states).squeeze()\n\n        return logits\n\n\nclass ElectraGeneratorPredictions(nn.Module):\n    \"\"\"Prediction module for the generator, made up of two dense layers.\"\"\"\n\n    def __init__(self, config):\n        super().__init__()\n\n        self.LayerNorm = BertLayerNorm(config.embedding_size)\n        self.dense = nn.Linear(config.hidden_size, config.embedding_size)\n\n    def forward(self, generator_hidden_states):\n        hidden_states = self.dense(generator_hidden_states)\n        hidden_states = get_activation(\"gelu\")(hidden_states)\n        hidden_states = self.LayerNorm(hidden_states)\n\n        return hidden_states\n\n\nclass ElectraPreTrainedModel(BertPreTrainedModel):\n    \"\"\" An abstract class to handle weights initialization and\n        a simple interface for downloading and loading pretrained models.\n    \"\"\"\n\n    config_class = ElectraConfig\n    load_tf_weights = load_tf_weights_in_electra\n    base_model_prefix = \"electra\"\n\n\nELECTRA_START_DOCSTRING = r\"\"\"\n    This model is a PyTorch `torch.nn.Module <https://pytorch.org/docs/stable/nn.html#torch.nn.Module>`_ sub-class.\n    Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general\n    usage and behavior.\n\n    Parameters:\n        config (:class:`~transformers1.ElectraConfig`): Model configuration class with all the parameters of the model.\n            Initializing with a config file does not load the weights associated with the model, only the configuration.\n            Check out the :meth:`~transformers1.PreTrainedModel.from_pretrained` method to load the model weights.\n\"\"\"\n\nELECTRA_INPUTS_DOCSTRING = r\"\"\"\n    Args:\n        input_ids (:obj:`torch.LongTensor` of shape :obj:`(batch_size, sequence_length)`):\n            Indices of input sequence tokens in the vocabulary.\n\n            Indices can be obtained using :class:`transformers1.ElectraTokenizer`.\n            See :func:`transformers1.PreTrainedTokenizer.encode` and\n            :func:`transformers1.PreTrainedTokenizer.encode_plus` for details.\n\n            `What are input IDs? <../glossary.html#input-ids>`__\n        attention_mask (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length)`, `optional`, defaults to :obj:`None`):\n            Mask to avoid performing attention on padding token indices.\n            Mask values selected in ``[0, 1]``:\n            ``1`` for tokens that are NOT MASKED, ``0`` for MASKED tokens.\n\n            `What are attention masks? <../glossary.html#attention-mask>`__\n        token_type_ids (:obj:`torch.LongTensor` of shape :obj:`(batch_size, sequence_length)`, `optional`, defaults to :obj:`None`):\n            Segment token indices to indicate first and second portions of the inputs.\n            Indices are selected in ``[0, 1]``: ``0`` corresponds to a `sentence A` token, ``1``\n            corresponds to a `sentence B` token\n\n            `What are token type IDs? <../glossary.html#token-type-ids>`_\n        position_ids (:obj:`torch.LongTensor` of shape :obj:`(batch_size, sequence_length)`, `optional`, defaults to :obj:`None`):\n            Indices of positions of each input sequence tokens in the position embeddings.\n            Selected in the range ``[0, config.max_position_embeddings - 1]``.\n\n            `What are position IDs? <../glossary.html#position-ids>`_\n        head_mask (:obj:`torch.FloatTensor` of shape :obj:`(num_heads,)` or :obj:`(num_layers, num_heads)`, `optional`, defaults to :obj:`None`):\n            Mask to nullify selected heads of the self-attention modules.\n            Mask values selected in ``[0, 1]``:\n            :obj:`1` indicates the head is **not masked**, :obj:`0` indicates the head is **masked**.\n        inputs_embeds (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length, hidden_size)`, `optional`, defaults to :obj:`None`):\n            Optionally, instead of passing :obj:`input_ids` you can choose to directly pass an embedded representation.\n            This is useful if you want more control over how to convert `input_ids` indices into associated vectors\n            than the model's internal embedding lookup matrix.\n        encoder_hidden_states  (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length, hidden_size)`, `optional`, defaults to :obj:`None`):\n            Sequence of hidden-states at the output of the last layer of the encoder. Used in the cross-attention\n            if the model is configured as a decoder.\n        encoder_attention_mask (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length)`, `optional`, defaults to :obj:`None`):\n            Mask to avoid performing attention on the padding token indices of the encoder input. This mask\n            is used in the cross-attention if the model is configured as a decoder.\n            Mask values selected in ``[0, 1]``:\n            ``1`` for tokens that are NOT MASKED, ``0`` for MASKED tokens.\n\"\"\"\n\n\n@add_start_docstrings(\n    \"The bare Electra Model transformer outputting raw hidden-states without any specific head on top. Identical to \"\n    \"the BERT model except that it uses an additional linear layer between the embedding layer and the encoder if the \"\n    \"hidden size and embedding size are different.\"\n    \"\"\n    \"Both the generator and discriminator checkpoints may be loaded into this model.\",\n    ELECTRA_START_DOCSTRING,\n)\nclass ElectraModel(ElectraPreTrainedModel):\n\n    config_class = ElectraConfig\n\n    def __init__(self, config):\n        super().__init__(config)\n        self.embeddings = ElectraEmbeddings(config)\n\n        if config.embedding_size != config.hidden_size:\n            self.embeddings_project = nn.Linear(config.embedding_size, config.hidden_size)\n\n        self.encoder = BertEncoder(config)\n        self.config = config\n        self.init_weights()\n\n    def get_input_embeddings(self):\n        return self.embeddings.word_embeddings\n\n    def set_input_embeddings(self, value):\n        self.embeddings.word_embeddings = value\n\n    def _prune_heads(self, heads_to_prune):\n        \"\"\" Prunes heads of the model.\n            heads_to_prune: dict of {layer_num: list of heads to prune in this layer}\n            See base class PreTrainedModel\n        \"\"\"\n        for layer, heads in heads_to_prune.items():\n            self.encoder.layer[layer].attention.prune_heads(heads)\n\n    @add_start_docstrings_to_callable(ELECTRA_INPUTS_DOCSTRING)\n    def forward(\n        self,\n        input_ids=None,\n        attention_mask=None,\n        token_type_ids=None,\n        position_ids=None,\n        head_mask=None,\n        inputs_embeds=None,\n    ):\n        r\"\"\"\n    Return:\n        :obj:`tuple(torch.FloatTensor)` comprising various elements depending on the configuration (:class:`~transformers1.ElectraConfig`) and inputs:\n        last_hidden_state (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length, hidden_size)`):\n            Sequence of hidden-states at the output of the last layer of the model.\n        hidden_states (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_hidden_states=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer)\n            of shape :obj:`(batch_size, sequence_length, hidden_size)`.\n\n            Hidden-states of the model at the output of each layer plus the initial embedding outputs.\n        attentions (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_attentions=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for each layer) of shape\n            :obj:`(batch_size, num_heads, sequence_length, sequence_length)`.\n\n            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention\n            heads.\n\n    Examples::\n\n        from transformers1 import ElectraModel, ElectraTokenizer\n        import torch\n\n        tokenizer = ElectraTokenizer.from_pretrained('google/electra-small-discriminator')\n        model = ElectraModel.from_pretrained('google/electra-small-discriminator')\n\n        input_ids = torch.tensor(tokenizer.encode(\"Hello, my dog is cute\", add_special_tokens=True)).unsqueeze(0)  # Batch size 1\n        outputs = model(input_ids)\n\n        last_hidden_states = outputs[0]  # The last hidden-state is the first element of the output tuple\n\n        \"\"\"\n        if input_ids is not None and inputs_embeds is not None:\n            raise ValueError(\"You cannot specify both input_ids and inputs_embeds at the same time\")\n        elif input_ids is not None:\n            input_shape = input_ids.size()\n        elif inputs_embeds is not None:\n            input_shape = inputs_embeds.size()[:-1]\n        else:\n            raise ValueError(\"You have to specify either input_ids or inputs_embeds\")\n\n        device = input_ids.device if input_ids is not None else inputs_embeds.device\n\n        if attention_mask is None:\n            attention_mask = torch.ones(input_shape, device=device)\n        if token_type_ids is None:\n            token_type_ids = torch.zeros(input_shape, dtype=torch.long, device=device)\n\n        extended_attention_mask = self.get_extended_attention_mask(attention_mask, input_shape, device)\n        head_mask = self.get_head_mask(head_mask, self.config.num_hidden_layers)\n\n        hidden_states = self.embeddings(\n            input_ids=input_ids, position_ids=position_ids, token_type_ids=token_type_ids, inputs_embeds=inputs_embeds\n        )\n\n        if hasattr(self, \"embeddings_project\"):\n            hidden_states = self.embeddings_project(hidden_states)\n\n        hidden_states = self.encoder(hidden_states, attention_mask=extended_attention_mask, head_mask=head_mask)\n\n        return hidden_states\n\n\nclass ElectraClassificationHead(nn.Module):\n    \"\"\"Head for sentence-level classification tasks.\"\"\"\n\n    def __init__(self, config):\n        super().__init__()\n        self.dense = nn.Linear(config.hidden_size, config.hidden_size)\n        self.dropout = nn.Dropout(config.hidden_dropout_prob)\n        self.out_proj = nn.Linear(config.hidden_size, config.num_labels)\n\n    def forward(self, features, **kwargs):\n        x = features[:, 0, :]  # take <s> token (equiv. to [CLS])\n        x = self.dropout(x)\n        x = self.dense(x)\n        x = get_activation(\"gelu\")(x)  # although BERT uses tanh here, it seems Electra authors used gelu here\n        x = self.dropout(x)\n        x = self.out_proj(x)\n        return x\n\n\n@add_start_docstrings(\n    \"\"\"ELECTRA Model transformer with a sequence classification/regression head on top (a linear layer on top of\n    the pooled output) e.g. for GLUE tasks. \"\"\",\n    ELECTRA_START_DOCSTRING,\n)\nclass ElectraForSequenceClassification(ElectraPreTrainedModel):\n    def __init__(self, config):\n        super().__init__(config)\n        self.num_labels = config.num_labels\n        self.electra = ElectraModel(config)\n        self.classifier = ElectraClassificationHead(config)\n\n        self.init_weights()\n\n    @add_start_docstrings_to_callable(ELECTRA_INPUTS_DOCSTRING)\n    def forward(\n        self,\n        input_ids=None,\n        attention_mask=None,\n        token_type_ids=None,\n        position_ids=None,\n        head_mask=None,\n        inputs_embeds=None,\n        labels=None,\n    ):\n        r\"\"\"\n        labels (:obj:`torch.LongTensor` of shape :obj:`(batch_size,)`, `optional`, defaults to :obj:`None`):\n            Labels for computing the sequence classification/regression loss.\n            Indices should be in :obj:`[0, ..., config.num_labels - 1]`.\n            If :obj:`config.num_labels == 1` a regression loss is computed (Mean-Square loss),\n            If :obj:`config.num_labels > 1` a classification loss is computed (Cross-Entropy).\n\n    Returns:\n        :obj:`tuple(torch.FloatTensor)` comprising various elements depending on the configuration (:class:`~transformers1.BertConfig`) and inputs:\n        loss (:obj:`torch.FloatTensor` of shape :obj:`(1,)`, `optional`, returned when :obj:`label` is provided):\n            Classification (or regression if config.num_labels==1) loss.\n        logits (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, config.num_labels)`):\n            Classification (or regression if config.num_labels==1) scores (before SoftMax).\n        hidden_states (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_hidden_states=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer)\n            of shape :obj:`(batch_size, sequence_length, hidden_size)`.\n\n            Hidden-states of the model at the output of each layer plus the initial embedding outputs.\n        attentions (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_attentions=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for each layer) of shape\n            :obj:`(batch_size, num_heads, sequence_length, sequence_length)`.\n\n            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention\n            heads.\n\n    Examples::\n\n        from transformers1 import BertTokenizer, BertForSequenceClassification\n        import torch\n\n        tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')\n        model = BertForSequenceClassification.from_pretrained('bert-base-uncased')\n\n        input_ids = torch.tensor(tokenizer.encode(\"Hello, my dog is cute\", add_special_tokens=True)).unsqueeze(0)  # Batch size 1\n        labels = torch.tensor([1]).unsqueeze(0)  # Batch size 1\n        outputs = model(input_ids, labels=labels)\n\n        loss, logits = outputs[:2]\n\n        \"\"\"\n        discriminator_hidden_states = self.electra(\n            input_ids, attention_mask, token_type_ids, position_ids, head_mask, inputs_embeds\n        )\n\n        sequence_output = discriminator_hidden_states[0]\n        logits = self.classifier(sequence_output)\n\n        outputs = (logits,) + discriminator_hidden_states[2:]  # add hidden states and attention if they are here\n\n        if labels is not None:\n            if self.num_labels == 1:\n                #  We are doing regression\n                loss_fct = MSELoss()\n                loss = loss_fct(logits.view(-1), labels.view(-1))\n            else:\n                loss_fct = CrossEntropyLoss()\n                loss = loss_fct(logits.view(-1, self.num_labels), labels.view(-1))\n            outputs = (loss,) + outputs\n\n        return outputs  # (loss), logits, (hidden_states), (attentions)\n\n\n@add_start_docstrings(\n    \"\"\"\n    Electra model with a binary classification head on top as used during pre-training for identifying generated\n    tokens.\n\n    It is recommended to load the discriminator checkpoint into that model.\"\"\",\n    ELECTRA_START_DOCSTRING,\n)\nclass ElectraForPreTraining(ElectraPreTrainedModel):\n    def __init__(self, config):\n        super().__init__(config)\n\n        self.electra = ElectraModel(config)\n        self.discriminator_predictions = ElectraDiscriminatorPredictions(config)\n        self.init_weights()\n\n    @add_start_docstrings_to_callable(ELECTRA_INPUTS_DOCSTRING)\n    def forward(\n        self,\n        input_ids=None,\n        attention_mask=None,\n        token_type_ids=None,\n        position_ids=None,\n        head_mask=None,\n        inputs_embeds=None,\n        labels=None,\n    ):\n        r\"\"\"\n        labels (``torch.LongTensor`` of shape ``(batch_size, sequence_length)``, `optional`, defaults to :obj:`None`):\n            Labels for computing the ELECTRA loss. Input should be a sequence of tokens (see :obj:`input_ids` docstring)\n            Indices should be in ``[0, 1]``.\n            ``0`` indicates the token is an original token,\n            ``1`` indicates the token was replaced.\n\n    Returns:\n        :obj:`tuple(torch.FloatTensor)` comprising various elements depending on the configuration (:class:`~transformers1.ElectraConfig`) and inputs:\n        loss (`optional`, returned when ``labels`` is provided) ``torch.FloatTensor`` of shape ``(1,)``:\n            Total loss of the ELECTRA objective.\n        scores (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length)`)\n            Prediction scores of the head (scores for each token before SoftMax).\n        hidden_states (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when :obj:`config.output_hidden_states=True`):\n            Tuple of :obj:`torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer)\n            of shape :obj:`(batch_size, sequence_length, hidden_size)`.\n\n            Hidden-states of the model at the output of each layer plus the initial embedding outputs.\n        attentions (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_attentions=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for each layer) of shape\n            :obj:`(batch_size, num_heads, sequence_length, sequence_length)`.\n\n            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention\n            heads.\n\n\n    Examples::\n\n        from transformers1 import ElectraTokenizer, ElectraForPreTraining\n        import torch\n\n        tokenizer = ElectraTokenizer.from_pretrained('google/electra-small-discriminator')\n        model = ElectraForPreTraining.from_pretrained('google/electra-small-discriminator')\n\n        input_ids = torch.tensor(tokenizer.encode(\"Hello, my dog is cute\", add_special_tokens=True)).unsqueeze(0)  # Batch size 1\n        outputs = model(input_ids)\n\n        prediction_scores, seq_relationship_scores = outputs[:2]\n\n        \"\"\"\n\n        discriminator_hidden_states = self.electra(\n            input_ids, attention_mask, token_type_ids, position_ids, head_mask, inputs_embeds\n        )\n        discriminator_sequence_output = discriminator_hidden_states[0]\n\n        logits = self.discriminator_predictions(discriminator_sequence_output, attention_mask)\n\n        output = (logits,)\n\n        if labels is not None:\n            loss_fct = nn.BCEWithLogitsLoss()\n            if attention_mask is not None:\n                active_loss = attention_mask.view(-1, discriminator_sequence_output.shape[1]) == 1\n                active_logits = logits.view(-1, discriminator_sequence_output.shape[1])[active_loss]\n                active_labels = labels[active_loss]\n                loss = loss_fct(active_logits, active_labels.float())\n            else:\n                loss = loss_fct(logits.view(-1, discriminator_sequence_output.shape[1]), labels.float())\n\n            output = (loss,) + output\n\n        output += discriminator_hidden_states[1:]\n\n        return output  # (loss), scores, (hidden_states), (attentions)\n\n\n@add_start_docstrings(\n    \"\"\"\n    Electra model with a language modeling head on top.\n\n    Even though both the discriminator and generator may be loaded into this model, the generator is\n    the only model of the two to have been trained for the masked language modeling task.\"\"\",\n    ELECTRA_START_DOCSTRING,\n)\nclass ElectraForMaskedLM(ElectraPreTrainedModel):\n    def __init__(self, config):\n        super().__init__(config)\n\n        self.electra = ElectraModel(config)\n        self.generator_predictions = ElectraGeneratorPredictions(config)\n\n        self.generator_lm_head = nn.Linear(config.embedding_size, config.vocab_size)\n        self.init_weights()\n\n    def get_output_embeddings(self):\n        return self.generator_lm_head\n\n    @add_start_docstrings_to_callable(ELECTRA_INPUTS_DOCSTRING)\n    def forward(\n        self,\n        input_ids=None,\n        attention_mask=None,\n        token_type_ids=None,\n        position_ids=None,\n        head_mask=None,\n        inputs_embeds=None,\n        masked_lm_labels=None,\n    ):\n        r\"\"\"\n        masked_lm_labels (:obj:`torch.LongTensor` of shape :obj:`(batch_size, sequence_length)`, `optional`, defaults to :obj:`None`):\n            Labels for computing the masked language modeling loss.\n            Indices should be in ``[-100, 0, ..., config.vocab_size]`` (see ``input_ids`` docstring)\n            Tokens with indices set to ``-100`` are ignored (masked), the loss is only computed for the tokens with labels\n            in ``[0, ..., config.vocab_size]``\n\n    Returns:\n        :obj:`tuple(torch.FloatTensor)` comprising various elements depending on the configuration (:class:`~transformers1.ElectraConfig`) and inputs:\n        masked_lm_loss (`optional`, returned when ``masked_lm_labels`` is provided) ``torch.FloatTensor`` of shape ``(1,)``:\n            Masked language modeling loss.\n        prediction_scores (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length, config.vocab_size)`)\n            Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax).\n        hidden_states (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_hidden_states=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer)\n            of shape :obj:`(batch_size, sequence_length, hidden_size)`.\n\n            Hidden-states of the model at the output of each layer plus the initial embedding outputs.\n        attentions (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_attentions=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for each layer) of shape\n            :obj:`(batch_size, num_heads, sequence_length, sequence_length)`.\n\n            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention\n            heads.\n\n        Examples::\n\n            from transformers1 import ElectraTokenizer, ElectraForMaskedLM\n            import torch\n\n            tokenizer = ElectraTokenizer.from_pretrained('google/electra-small-generator')\n            model = ElectraForMaskedLM.from_pretrained('google/electra-small-generator')\n\n            input_ids = torch.tensor(tokenizer.encode(\"Hello, my dog is cute\", add_special_tokens=True)).unsqueeze(0)  # Batch size 1\n            outputs = model(input_ids, masked_lm_labels=input_ids)\n\n            loss, prediction_scores = outputs[:2]\n\n        \"\"\"\n\n        generator_hidden_states = self.electra(\n            input_ids, attention_mask, token_type_ids, position_ids, head_mask, inputs_embeds\n        )\n        generator_sequence_output = generator_hidden_states[0]\n\n        prediction_scores = self.generator_predictions(generator_sequence_output)\n        prediction_scores = self.generator_lm_head(prediction_scores)\n\n        output = (prediction_scores,)\n\n        # Masked language modeling softmax layer\n        if masked_lm_labels is not None:\n            loss_fct = nn.CrossEntropyLoss()  # -100 index = padding token\n            loss = loss_fct(prediction_scores.view(-1, self.config.vocab_size), masked_lm_labels.view(-1))\n            output = (loss,) + output\n\n        output += generator_hidden_states[1:]\n\n        return output  # (masked_lm_loss), prediction_scores, (hidden_states), (attentions)\n\n\n@add_start_docstrings(\n    \"\"\"\n    Electra model with a token classification head on top.\n\n    Both the discriminator and generator may be loaded into this model.\"\"\",\n    ELECTRA_START_DOCSTRING,\n)\nclass ElectraForTokenClassification(ElectraPreTrainedModel):\n    def __init__(self, config):\n        super().__init__(config)\n\n        self.electra = ElectraModel(config)\n        self.dropout = nn.Dropout(config.hidden_dropout_prob)\n        self.classifier = nn.Linear(config.hidden_size, config.num_labels)\n        self.init_weights()\n\n    @add_start_docstrings_to_callable(ELECTRA_INPUTS_DOCSTRING)\n    def forward(\n        self,\n        input_ids=None,\n        attention_mask=None,\n        token_type_ids=None,\n        position_ids=None,\n        head_mask=None,\n        inputs_embeds=None,\n        labels=None,\n    ):\n        r\"\"\"\n        labels (:obj:`torch.LongTensor` of shape :obj:`(batch_size, sequence_length)`, `optional`, defaults to :obj:`None`):\n            Labels for computing the token classification loss.\n            Indices should be in ``[0, ..., config.num_labels - 1]``.\n\n    Returns:\n        :obj:`tuple(torch.FloatTensor)` comprising various elements depending on the configuration (:class:`~transformers1.ElectraConfig`) and inputs:\n        loss (:obj:`torch.FloatTensor` of shape :obj:`(1,)`, `optional`, returned when ``labels`` is provided) :\n            Classification loss.\n        scores (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length, config.num_labels)`)\n            Classification scores (before SoftMax).\n        hidden_states (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_hidden_states=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer)\n            of shape :obj:`(batch_size, sequence_length, hidden_size)`.\n\n            Hidden-states of the model at the output of each layer plus the initial embedding outputs.\n        attentions (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_attentions=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for each layer) of shape\n            :obj:`(batch_size, num_heads, sequence_length, sequence_length)`.\n\n            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention\n            heads.\n\n    Examples::\n\n        from transformers1 import ElectraTokenizer, ElectraForTokenClassification\n        import torch\n\n        tokenizer = ElectraTokenizer.from_pretrained('google/electra-small-discriminator')\n        model = ElectraForTokenClassification.from_pretrained('google/electra-small-discriminator')\n\n        input_ids = torch.tensor(tokenizer.encode(\"Hello, my dog is cute\", add_special_tokens=True)).unsqueeze(0)  # Batch size 1\n        labels = torch.tensor([1] * input_ids.size(1)).unsqueeze(0)  # Batch size 1\n        outputs = model(input_ids, labels=labels)\n\n        loss, scores = outputs[:2]\n\n        \"\"\"\n\n        discriminator_hidden_states = self.electra(\n            input_ids, attention_mask, token_type_ids, position_ids, head_mask, inputs_embeds\n        )\n        discriminator_sequence_output = discriminator_hidden_states[0]\n\n        discriminator_sequence_output = self.dropout(discriminator_sequence_output)\n        logits = self.classifier(discriminator_sequence_output)\n\n        output = (logits,)\n\n        if labels is not None:\n            loss_fct = nn.CrossEntropyLoss()\n            # Only keep active parts of the loss\n            if attention_mask is not None:\n                active_loss = attention_mask.view(-1) == 1\n                active_logits = logits.view(-1, self.config.num_labels)[active_loss]\n                active_labels = labels.view(-1)[active_loss]\n                loss = loss_fct(active_logits, active_labels)\n            else:\n                loss = loss_fct(logits.view(-1, self.config.num_labels), labels.view(-1))\n\n            output = (loss,) + output\n\n        output += discriminator_hidden_states[1:]\n\n        return output  # (loss), scores, (hidden_states), (attentions)\n"
  },
  {
    "path": "code/nezha-base-count5/pretrain/transformers1/modeling_encoder_decoder.py",
    "content": "# coding=utf-8\n# Copyright 2018 The HuggingFace Inc. team.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n\"\"\" Classes to support Encoder-Decoder architectures \"\"\"\n\n\nimport logging\nfrom typing import Optional\n\nfrom .configuration_encoder_decoder import EncoderDecoderConfig\nfrom .configuration_utils import PretrainedConfig\nfrom .modeling_utils import PreTrainedModel\n\n\nlogger = logging.getLogger(__name__)\n\n\nclass EncoderDecoderModel(PreTrainedModel):\n    r\"\"\"\n        :class:`~transformers1.EncoderDecoder` is a generic model class that will be\n        instantiated as a transformer architecture with one of the base model\n        classes of the library as encoder and another one as\n        decoder when created with the `AutoModel.from_pretrained(pretrained_model_name_or_path)`\n        class method for the encoder and `AutoModelWithLMHead.from_pretrained(pretrained_model_name_or_path)` class method for the decoder.\n    \"\"\"\n    config_class = EncoderDecoderConfig\n    base_model_prefix = \"encoder_decoder\"\n\n    def __init__(\n        self,\n        config: Optional[PretrainedConfig] = None,\n        encoder: Optional[PreTrainedModel] = None,\n        decoder: Optional[PreTrainedModel] = None,\n    ):\n        assert config is not None or (\n            encoder is not None and decoder is not None\n        ), \"Either a configuration or an Encoder and a decoder has to be provided\"\n        if config is None:\n            config = EncoderDecoderConfig.from_encoder_decoder_configs(encoder.config, decoder.config)\n        else:\n            assert isinstance(config, self.config_class), \"config: {} has to be of type {}\".format(\n                config, self.config_class\n            )\n        # initialize with config\n        super().__init__(config)\n\n        if encoder is None:\n            from transformers import AutoModel\n\n            encoder = AutoModel.from_config(config.encoder)\n\n        if decoder is None:\n            from transformers import AutoModelWithLMHead\n\n            decoder = AutoModelWithLMHead.from_config(config.decoder)\n\n        self.encoder = encoder\n        self.decoder = decoder\n        assert (\n            self.encoder.get_output_embeddings() is None\n        ), \"The encoder {} should not have a LM Head. Please use a model without LM Head\"\n\n    def tie_weights(self):\n        # for now no weights tying in encoder-decoder\n        pass\n\n    def get_encoder(self):\n        return self.encoder\n\n    def get_decoder(self):\n        return self.decoder\n\n    def get_input_embeddings(self):\n        return self.encoder.get_input_embeddings()\n\n    def get_output_embeddings(self):\n        return self.decoder.get_output_embeddings()\n\n    @classmethod\n    def from_encoder_decoder_pretrained(\n        cls,\n        encoder_pretrained_model_name_or_path: str = None,\n        decoder_pretrained_model_name_or_path: str = None,\n        *model_args,\n        **kwargs\n    ) -> PreTrainedModel:\n        r\"\"\" Instantiates an encoder and a decoder from one or two base classes of the library from pre-trained model checkpoints.\n\n\n        The model is set in evaluation mode by default using `model.eval()` (Dropout modules are deactivated).\n        To train the model, you need to first set it back in training mode with `model.train()`.\n\n        Params:\n            encoder_pretrained_model_name_or_path (:obj: `str`, `optional`, defaults to `None`):\n                information necessary to initiate the encoder. Either:\n\n                - a string with the `shortcut name` of a pre-trained model to load from cache or download, e.g.: ``bert-base-uncased``.\n                - a string with the `identifier name` of a pre-trained model that was user-uploaded to our S3, e.g.: ``dbmdz/bert-base-german-cased``.\n                - a path to a `directory` containing model weights saved using :func:`~transformers1.PreTrainedModel.save_pretrained`, e.g.: ``./my_model_directory/encoder``.\n                - a path or url to a `tensorflow index checkpoint file` (e.g. `./tf_model/model.ckpt.index`). In this case, ``from_tf`` should be set to True and a configuration object should be provided as ``config`` argument. This loading path is slower than converting the TensorFlow checkpoint in a PyTorch model using the provided conversion scripts and loading the PyTorch model afterwards.\n\n            decoder_pretrained_model_name_or_path (:obj: `str`, `optional`, defaults to `None`):\n                information necessary to initiate the decoder. Either:\n\n                - a string with the `shortcut name` of a pre-trained model to load from cache or download, e.g.: ``bert-base-uncased``.\n                - a string with the `identifier name` of a pre-trained model that was user-uploaded to our S3, e.g.: ``dbmdz/bert-base-german-cased``.\n                - a path to a `directory` containing model weights saved using :func:`~transformers1.PreTrainedModel.save_pretrained`, e.g.: ``./my_model_directory/decoder``.\n                - a path or url to a `tensorflow index checkpoint file` (e.g. `./tf_model/model.ckpt.index`). In this case, ``from_tf`` should be set to True and a configuration object should be provided as ``config`` argument. This loading path is slower than converting the TensorFlow checkpoint in a PyTorch model using the provided conversion scripts and loading the PyTorch model afterwards.\n\n            model_args: (`optional`) Sequence of positional arguments:\n                All remaning positional arguments will be passed to the underlying model's ``__init__`` method\n\n            kwargs: (`optional`) Remaining dictionary of keyword arguments.\n                Can be used to update the configuration object (after it being loaded) and initiate the model. (e.g. ``output_attention=True``). Behave differently depending on whether a `config` is provided or automatically loaded:\n\n        Examples::\n\n            from transformers1 import EncoderDecoder\n\n            model = EncoderDecoder.from_encoder_decoder_pretrained('bert-base-uncased', 'bert-base-uncased') # initialize Bert2Bert\n        \"\"\"\n\n        kwargs_encoder = {\n            argument[len(\"encoder_\") :]: value for argument, value in kwargs.items() if argument.startswith(\"encoder_\")\n        }\n\n        kwargs_decoder = {\n            argument[len(\"decoder_\") :]: value for argument, value in kwargs.items() if argument.startswith(\"decoder_\")\n        }\n\n        # Load and initialize the encoder and decoder\n        # The distinction between encoder and decoder at the model level is made\n        # by the value of the flag `is_decoder` that we need to set correctly.\n        encoder = kwargs_encoder.pop(\"model\", None)\n        if encoder is None:\n            assert (\n                encoder_pretrained_model_name_or_path is not None\n            ), \"If `model` is not defined as an argument, a `encoder_pretrained_model_name_or_path` has to be defined\"\n            from .modeling_auto import AutoModel\n\n            encoder = AutoModel.from_pretrained(encoder_pretrained_model_name_or_path, *model_args, **kwargs_encoder)\n        encoder.config.is_decoder = False\n\n        decoder = kwargs_decoder.pop(\"model\", None)\n        if decoder is None:\n            assert (\n                decoder_pretrained_model_name_or_path is not None\n            ), \"If `decoder_model` is not defined as an argument, a `decoder_pretrained_model_name_or_path` has to be defined\"\n            from .modeling_auto import AutoModelWithLMHead\n\n            if \"config\" not in kwargs_decoder:\n                from transformers import AutoConfig\n\n                decoder_config = AutoConfig.from_pretrained(decoder_pretrained_model_name_or_path)\n                if decoder_config.is_decoder is False:\n                    logger.info(\n                        f\"Initializing {decoder_pretrained_model_name_or_path} as a decoder model. Cross attention layers are added to {decoder_pretrained_model_name_or_path} and randomly initialized if {decoder_pretrained_model_name_or_path}'s architecture allows for cross attention layers.\"\n                    )\n                    decoder_config.is_decoder = True\n\n                kwargs_decoder[\"config\"] = decoder_config\n\n            if kwargs_decoder[\"config\"].is_decoder is False:\n                logger.warning(\n                    f\"Decoder model {decoder_pretrained_model_name_or_path} is not initialized as a decoder. In order to initialize {decoder_pretrained_model_name_or_path} as a decoder, make sure that the attribute `is_decoder` of `decoder_config` passed to `.from_encoder_decoder_pretrained(...)` is set to `True` or do not pass a `decoder_config` to `.from_encoder_decoder_pretrained(...)`\"\n                )\n\n            decoder = AutoModelWithLMHead.from_pretrained(decoder_pretrained_model_name_or_path, **kwargs_decoder)\n\n        return cls(encoder=encoder, decoder=decoder)\n\n    def forward(\n        self,\n        input_ids=None,\n        inputs_embeds=None,\n        attention_mask=None,\n        head_mask=None,\n        encoder_outputs=None,\n        decoder_input_ids=None,\n        decoder_attention_mask=None,\n        decoder_head_mask=None,\n        decoder_inputs_embeds=None,\n        masked_lm_labels=None,\n        lm_labels=None,\n        **kwargs,\n    ):\n\n        \"\"\"\n        Args:\n            input_ids (:obj:`torch.LongTensor` of shape :obj:`(batch_size, sequence_length)`):\n                Indices of input sequence tokens in the vocabulary for the encoder.\n                Indices can be obtained using :class:`transformers1.PretrainedTokenizer`.\n                See :func:`transformers1.PreTrainedTokenizer.encode` and\n                :func:`transformers1.PreTrainedTokenizer.convert_tokens_to_ids` for details.\n            inputs_embeds (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length, hidden_size)`, `optional`, defaults to :obj:`None`):\n                Optionally, instead of passing :obj:`input_ids` you can choose to directly pass an embedded representation.\n                This is useful if you want more control over how to convert `input_ids` indices into associated vectors\n                than the model's internal embedding lookup matrix.\n            attention_mask (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length)`, `optional`, defaults to :obj:`None`):\n                Mask to avoid performing attention on padding token indices for the encoder.\n                Mask values selected in ``[0, 1]``:\n                ``1`` for tokens that are NOT MASKED, ``0`` for MASKED tokens.\n            head_mask: (:obj:`torch.FloatTensor` of shape :obj:`(num_heads,)` or :obj:`(num_layers, num_heads)`, `optional`, defaults to :obj:`None`):\n                Mask to nullify selected heads of the self-attention modules for the encoder.\n                Mask values selected in ``[0, 1]``:\n                ``1`` indicates the head is **not masked**, ``0`` indicates the head is **masked**.\n            encoder_outputs (:obj:`tuple(tuple(torch.FloatTensor)`, `optional`, defaults to :obj:`None`):\n                Tuple consists of (`last_hidden_state`, `optional`: `hidden_states`, `optional`: `attentions`)\n                `last_hidden_state` of shape :obj:`(batch_size, sequence_length, hidden_size)`, `optional`, defaults to :obj:`None`) is a sequence of hidden-states at the output of the last layer of the encoder.\n                Used in the cross-attention of the decoder.\n            decoder_input_ids (:obj:`torch.LongTensor` of shape :obj:`(batch_size, target_sequence_length)`, `optional`, defaults to :obj:`None`):\n                Provide for sequence to sequence training to the decoder.\n                Indices can be obtained using :class:`transformers1.PretrainedTokenizer`.\n                See :func:`transformers1.PreTrainedTokenizer.encode` and\n                :func:`transformers1.PreTrainedTokenizer.convert_tokens_to_ids` for details.\n            decoder_attention_mask (:obj:`torch.BoolTensor` of shape :obj:`(batch_size, tgt_seq_len)`, `optional`, defaults to :obj:`None`):\n                Default behavior: generate a tensor that ignores pad tokens in decoder_input_ids. Causal mask will also be used by default.\n            decoder_head_mask: (:obj:`torch.FloatTensor` of shape :obj:`(num_heads,)` or :obj:`(num_layers, num_heads)`, `optional`, defaults to :obj:`None`):\n                Mask to nullify selected heads of the self-attention modules for the decoder.\n                Mask values selected in ``[0, 1]``:\n                ``1`` indicates the head is **not masked**, ``0`` indicates the head is **masked**.\n            decoder_inputs_embeds (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, target_sequence_length, hidden_size)`, `optional`, defaults to :obj:`None`):\n                Optionally, instead of passing :obj:`decoder_input_ids` you can choose to directly pass an embedded representation.\n                This is useful if you want more control over how to convert `decoder_input_ids` indices into associated vectors\n                than the model's internal embedding lookup matrix.\n            masked_lm_labels (:obj:`torch.LongTensor` of shape :obj:`(batch_size, sequence_length)`, `optional`, defaults to :obj:`None`):\n                Labels for computing the masked language modeling loss for the decoder.\n                Indices should be in ``[-100, 0, ..., config.vocab_size]`` (see ``input_ids`` docstring)\n                Tokens with indices set to ``-100`` are ignored (masked), the loss is only computed for the tokens with labels\n                in ``[0, ..., config.vocab_size]``\n            lm_labels (:obj:`torch.LongTensor` of shape :obj:`(batch_size, sequence_length)`, `optional`, defaults to :obj:`None`):\n                Labels for computing the left-to-right language modeling loss (next word prediction) for the decoder.\n                Indices should be in ``[-100, 0, ..., config.vocab_size]`` (see ``input_ids`` docstring)\n                Tokens with indices set to ``-100`` are ignored (masked), the loss is only computed for the tokens with labels\n                in ``[0, ..., config.vocab_size]``\n            kwargs: (`optional`) Remaining dictionary of keyword arguments. Keyword arguments come in two flavors:\n                - Without a prefix which will be input as `**encoder_kwargs` for the encoder forward function.\n                - With a `decoder_` prefix which will be input as `**decoder_kwargs` for the decoder forward function.\n\n        Examples::\n\n            from transformers1 import EncoderDecoderModel, BertTokenizer\n            import torch\n\n            tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')\n            model = EncoderDecoderModel.from_encoder_decoder_pretrained('bert-base-uncased', 'bert-base-uncased') # initialize Bert2Bert\n\n            # forward\n            input_ids = torch.tensor(tokenizer.encode(\"Hello, my dog is cute\", add_special_tokens=True)).unsqueeze(0)  # Batch size 1\n            outputs = model(input_ids=input_ids, decoder_input_ids=input_ids)\n\n            # training\n            loss, outputs = model(input_ids=input_ids, decoder_input_ids=input_ids, lm_labels=input_ids)[:2]\n\n            # generation\n            generated = model.generate(input_ids, decoder_start_token_id=model.config.decoder.pad_token_id)\n\n        \"\"\"\n\n        kwargs_encoder = {argument: value for argument, value in kwargs.items() if not argument.startswith(\"decoder_\")}\n\n        kwargs_decoder = {\n            argument[len(\"decoder_\") :]: value for argument, value in kwargs.items() if argument.startswith(\"decoder_\")\n        }\n\n        if encoder_outputs is None:\n            encoder_outputs = self.encoder(\n                input_ids=input_ids,\n                attention_mask=attention_mask,\n                inputs_embeds=inputs_embeds,\n                head_mask=head_mask,\n                **kwargs_encoder,\n            )\n\n        hidden_states = encoder_outputs[0]\n\n        # Decode\n        decoder_outputs = self.decoder(\n            input_ids=decoder_input_ids,\n            inputs_embeds=decoder_inputs_embeds,\n            attention_mask=decoder_attention_mask,\n            encoder_hidden_states=hidden_states,\n            encoder_attention_mask=attention_mask,\n            head_mask=decoder_head_mask,\n            lm_labels=lm_labels,\n            masked_lm_labels=masked_lm_labels,\n            **kwargs_decoder,\n        )\n\n        return decoder_outputs + encoder_outputs\n\n    def prepare_inputs_for_generation(self, input_ids, past, attention_mask, **kwargs):\n        assert past is not None, \"past has to be defined for encoder_outputs\"\n\n        # first step\n        if type(past) is tuple:\n            encoder_outputs = past\n        else:\n            encoder_outputs = (past,)\n\n        decoder_inputs = self.decoder.prepare_inputs_for_generation(input_ids)\n\n        return {\n            \"attention_mask\": attention_mask,\n            \"decoder_attention_mask\": decoder_inputs[\"attention_mask\"],\n            \"decoder_input_ids\": decoder_inputs[\"input_ids\"],\n            \"encoder_outputs\": encoder_outputs,\n        }\n\n    def _reorder_cache(self, past, beam_idx):\n        # as a default encoder-decoder models do not re-order the past.\n        # TODO(PVP): might have to be updated, e.g. if GPT2 is to be used as a decoder\n        return past\n"
  },
  {
    "path": "code/nezha-base-count5/pretrain/transformers1/modeling_flaubert.py",
    "content": "# coding=utf-8\n# Copyright 2019-present CNRS, Facebook Inc. and the HuggingFace Inc. team.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n\"\"\" PyTorch Flaubert model, based on XLM. \"\"\"\n\n\nimport logging\nimport random\n\nimport torch\nfrom torch.nn import functional as F\n\nfrom .configuration_flaubert import FlaubertConfig\nfrom .file_utils import add_start_docstrings, add_start_docstrings_to_callable\nfrom .modeling_xlm import (\n    XLMForQuestionAnswering,\n    XLMForQuestionAnsweringSimple,\n    XLMForSequenceClassification,\n    XLMModel,\n    XLMWithLMHeadModel,\n    get_masks,\n)\n\n\nlogger = logging.getLogger(__name__)\n\nFLAUBERT_PRETRAINED_MODEL_ARCHIVE_LIST = [\n    \"flaubert/flaubert_small_cased\",\n    \"flaubert/flaubert_base_uncased\",\n    \"flaubert/flaubert_base_cased\",\n    \"flaubert/flaubert_large_cased\",\n    # See all Flaubert models at https://huggingface.co/models?filter=flaubert\n]\n\n\nFLAUBERT_START_DOCSTRING = r\"\"\"\n\n    This model is a PyTorch `torch.nn.Module <https://pytorch.org/docs/stable/nn.html#torch.nn.Module>`_ sub-class.\n    Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general\n    usage and behavior.\n\n    Parameters:\n        config (:class:`~transformers1.FlaubertConfig`): Model configuration class with all the parameters of the model.\n            Initializing with a config file does not load the weights associated with the model, only the configuration.\n            Check out the :meth:`~transformers1.PreTrainedModel.from_pretrained` method to load the model weights.\n\"\"\"\n\nFLAUBERT_INPUTS_DOCSTRING = r\"\"\"\n    Args:\n        input_ids (:obj:`torch.LongTensor` of shape :obj:`(batch_size, sequence_length)`):\n            Indices of input sequence tokens in the vocabulary.\n\n            Indices can be obtained using :class:`transformers1.BertTokenizer`.\n            See :func:`transformers1.PreTrainedTokenizer.encode` and\n            :func:`transformers1.PreTrainedTokenizer.encode_plus` for details.\n\n            `What are input IDs? <../glossary.html#input-ids>`__\n        attention_mask (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length)`, `optional`, defaults to :obj:`None`):\n            Mask to avoid performing attention on padding token indices.\n            Mask values selected in ``[0, 1]``:\n            ``1`` for tokens that are NOT MASKED, ``0`` for MASKED tokens.\n\n            `What are attention masks? <../glossary.html#attention-mask>`__\n        token_type_ids (:obj:`torch.LongTensor` of shape :obj:`(batch_size, sequence_length)`, `optional`, defaults to :obj:`None`):\n            Segment token indices to indicate first and second portions of the inputs.\n            Indices are selected in ``[0, 1]``: ``0`` corresponds to a `sentence A` token, ``1``\n            corresponds to a `sentence B` token\n\n            `What are token type IDs? <../glossary.html#token-type-ids>`_\n        position_ids (:obj:`torch.LongTensor` of shape :obj:`(batch_size, sequence_length)`, `optional`, defaults to :obj:`None`):\n            Indices of positions of each input sequence tokens in the position embeddings.\n            Selected in the range ``[0, config.max_position_embeddings - 1]``.\n\n            `What are position IDs? <../glossary.html#position-ids>`_\n        lengths (:obj:`torch.LongTensor` of shape :obj:`(batch_size,)`, `optional`, defaults to :obj:`None`):\n            Length of each sentence that can be used to avoid performing attention on padding token indices.\n            You can also use `attention_mask` for the same result (see above), kept here for compatbility.\n            Indices selected in ``[0, ..., input_ids.size(-1)]``:\n        cache (:obj:`Dict[str, torch.FloatTensor]`, `optional`, defaults to :obj:`None`):\n            dictionary with ``torch.FloatTensor`` that contains pre-computed\n            hidden-states (key and values in the attention blocks) as computed by the model\n            (see `cache` output below). Can be used to speed up sequential decoding.\n            The dictionary object will be modified in-place during the forward pass to add newly computed hidden-states.\n        head_mask (:obj:`torch.FloatTensor` of shape :obj:`(num_heads,)` or :obj:`(num_layers, num_heads)`, `optional`, defaults to :obj:`None`):\n            Mask to nullify selected heads of the self-attention modules.\n            Mask values selected in ``[0, 1]``:\n            :obj:`1` indicates the head is **not masked**, :obj:`0` indicates the head is **masked**.\n        inputs_embeds (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length, hidden_size)`, `optional`, defaults to :obj:`None`):\n            Optionally, instead of passing :obj:`input_ids` you can choose to directly pass an embedded representation.\n            This is useful if you want more control over how to convert `input_ids` indices into associated vectors\n            than the model's internal embedding lookup matrix.\n\"\"\"\n\n\n@add_start_docstrings(\n    \"The bare Flaubert Model transformer outputting raw hidden-states without any specific head on top.\",\n    FLAUBERT_START_DOCSTRING,\n)\nclass FlaubertModel(XLMModel):\n\n    config_class = FlaubertConfig\n\n    def __init__(self, config):  # , dico, is_encoder, with_output):\n        super().__init__(config)\n        self.layerdrop = getattr(config, \"layerdrop\", 0.0)\n        self.pre_norm = getattr(config, \"pre_norm\", False)\n\n    @add_start_docstrings_to_callable(FLAUBERT_INPUTS_DOCSTRING)\n    def forward(\n        self,\n        input_ids=None,\n        attention_mask=None,\n        langs=None,\n        token_type_ids=None,\n        position_ids=None,\n        lengths=None,\n        cache=None,\n        head_mask=None,\n        inputs_embeds=None,\n    ):\n        r\"\"\"\n    Return:\n        :obj:`tuple(torch.FloatTensor)` comprising various elements depending on the configuration (:class:`~transformers1.XLMConfig`) and inputs:\n        last_hidden_state (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length, hidden_size)`):\n            Sequence of hidden-states at the output of the last layer of the model.\n        hidden_states (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_hidden_states=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer)\n            of shape :obj:`(batch_size, sequence_length, hidden_size)`.\n\n            Hidden-states of the model at the output of each layer plus the initial embedding outputs.\n        attentions (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_attentions=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for each layer) of shape\n            :obj:`(batch_size, num_heads, sequence_length, sequence_length)`.\n\n            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention\n            heads.\n\n    Examples::\n\n        from transformers1 import FlaubertTokenizer, FlaubertModel\n        import torch\n\n        tokenizer = FlaubertTokenizer.from_pretrained('flaubert-base-cased')\n        model = FlaubertModel.from_pretrained('flaubert-base-cased')\n        input_ids = torch.tensor(tokenizer.encode(\"Le chat mange une pomme.\", add_special_tokens=True)).unsqueeze(0)  # Batch size 1\n        outputs = model(input_ids)\n        last_hidden_states = outputs[0]  # The last hidden-state is the first element of the output tuple\n\n        \"\"\"\n        # removed: src_enc=None, src_len=None\n        if input_ids is not None:\n            bs, slen = input_ids.size()\n        else:\n            bs, slen = inputs_embeds.size()[:-1]\n\n        if lengths is None:\n            if input_ids is not None:\n                lengths = (input_ids != self.pad_index).sum(dim=1).long()\n            else:\n                lengths = torch.LongTensor([slen] * bs)\n        # mask = input_ids != self.pad_index\n\n        # check inputs\n        assert lengths.size(0) == bs\n        assert lengths.max().item() <= slen\n        # input_ids = input_ids.transpose(0, 1)  # batch size as dimension 0\n        # assert (src_enc is None) == (src_len is None)\n        # if src_enc is not None:\n        #     assert self.is_decoder\n        #     assert src_enc.size(0) == bs\n\n        # generate masks\n        mask, attn_mask = get_masks(slen, lengths, self.causal, padding_mask=attention_mask)\n        # if self.is_decoder and src_enc is not None:\n        #     src_mask = torch.arange(src_len.max(), dtype=torch.long, device=lengths.device) < src_len[:, None]\n\n        device = input_ids.device if input_ids is not None else inputs_embeds.device\n\n        # position_ids\n        if position_ids is None:\n            position_ids = torch.arange(slen, dtype=torch.long, device=device)\n            position_ids = position_ids.unsqueeze(0).expand((bs, slen))\n        else:\n            assert position_ids.size() == (bs, slen)  # (slen, bs)\n            # position_ids = position_ids.transpose(0, 1)\n\n        # langs\n        if langs is not None:\n            assert langs.size() == (bs, slen)  # (slen, bs)\n            # langs = langs.transpose(0, 1)\n\n        # Prepare head mask if needed\n        head_mask = self.get_head_mask(head_mask, self.config.n_layers)\n\n        # do not recompute cached elements\n        if cache is not None and input_ids is not None:\n            _slen = slen - cache[\"slen\"]\n            input_ids = input_ids[:, -_slen:]\n            position_ids = position_ids[:, -_slen:]\n            if langs is not None:\n                langs = langs[:, -_slen:]\n            mask = mask[:, -_slen:]\n            attn_mask = attn_mask[:, -_slen:]\n\n        # embeddings\n        if inputs_embeds is None:\n            inputs_embeds = self.embeddings(input_ids)\n\n        tensor = inputs_embeds + self.position_embeddings(position_ids).expand_as(inputs_embeds)\n        if langs is not None and self.use_lang_emb and self.config.n_langs > 1:\n            tensor = tensor + self.lang_embeddings(langs)\n        if token_type_ids is not None:\n            tensor = tensor + self.embeddings(token_type_ids)\n        tensor = self.layer_norm_emb(tensor)\n        tensor = F.dropout(tensor, p=self.dropout, training=self.training)\n        tensor *= mask.unsqueeze(-1).to(tensor.dtype)\n\n        # transformer layers\n        hidden_states = ()\n        attentions = ()\n        for i in range(self.n_layers):\n            # LayerDrop\n            dropout_probability = random.uniform(0, 1)\n            if self.training and (dropout_probability < self.layerdrop):\n                continue\n\n            if self.output_hidden_states:\n                hidden_states = hidden_states + (tensor,)\n\n            # self attention\n            if not self.pre_norm:\n                attn_outputs = self.attentions[i](tensor, attn_mask, cache=cache, head_mask=head_mask[i])\n                attn = attn_outputs[0]\n                if self.output_attentions:\n                    attentions = attentions + (attn_outputs[1],)\n                attn = F.dropout(attn, p=self.dropout, training=self.training)\n                tensor = tensor + attn\n                tensor = self.layer_norm1[i](tensor)\n            else:\n                tensor_normalized = self.layer_norm1[i](tensor)\n                attn_outputs = self.attentions[i](tensor_normalized, attn_mask, cache=cache, head_mask=head_mask[i])\n                attn = attn_outputs[0]\n                if self.output_attentions:\n                    attentions = attentions + (attn_outputs[1],)\n                attn = F.dropout(attn, p=self.dropout, training=self.training)\n                tensor = tensor + attn\n\n            # encoder attention (for decoder only)\n            # if self.is_decoder and src_enc is not None:\n            #     attn = self.encoder_attn[i](tensor, src_mask, kv=src_enc, cache=cache)\n            #     attn = F.dropout(attn, p=self.dropout, training=self.training)\n            #     tensor = tensor + attn\n            #     tensor = self.layer_norm15[i](tensor)\n\n            # FFN\n            if not self.pre_norm:\n                tensor = tensor + self.ffns[i](tensor)\n                tensor = self.layer_norm2[i](tensor)\n            else:\n                tensor_normalized = self.layer_norm2[i](tensor)\n                tensor = tensor + self.ffns[i](tensor_normalized)\n\n            tensor *= mask.unsqueeze(-1).to(tensor.dtype)\n\n        # Add last hidden state\n        if self.output_hidden_states:\n            hidden_states = hidden_states + (tensor,)\n\n        # update cache length\n        if cache is not None:\n            cache[\"slen\"] += tensor.size(1)\n\n        # move back sequence length to dimension 0\n        # tensor = tensor.transpose(0, 1)\n\n        outputs = (tensor,)\n        if self.output_hidden_states:\n            outputs = outputs + (hidden_states,)\n        if self.output_attentions:\n            outputs = outputs + (attentions,)\n        return outputs  # outputs, (hidden_states), (attentions)\n\n\n@add_start_docstrings(\n    \"\"\"The Flaubert Model transformer with a language modeling head on top\n    (linear layer with weights tied to the input embeddings). \"\"\",\n    FLAUBERT_START_DOCSTRING,\n)\nclass FlaubertWithLMHeadModel(XLMWithLMHeadModel):\n    \"\"\"\n    This class overrides :class:`~transformers1.XLMWithLMHeadModel`. Please check the\n    superclass for the appropriate documentation alongside usage examples.\n    \"\"\"\n\n    config_class = FlaubertConfig\n\n    def __init__(self, config):\n        super().__init__(config)\n        self.transformer = FlaubertModel(config)\n        self.init_weights()\n\n\n@add_start_docstrings(\n    \"\"\"Flaubert Model with a sequence classification/regression head on top (a linear layer on top of\n    the pooled output) e.g. for GLUE tasks. \"\"\",\n    FLAUBERT_START_DOCSTRING,\n)\nclass FlaubertForSequenceClassification(XLMForSequenceClassification):\n    \"\"\"\n    This class overrides :class:`~transformers1.XLMForSequenceClassification`. Please check the\n    superclass for the appropriate documentation alongside usage examples.\n    \"\"\"\n\n    config_class = FlaubertConfig\n\n    def __init__(self, config):\n        super().__init__(config)\n        self.transformer = FlaubertModel(config)\n        self.init_weights()\n\n\n@add_start_docstrings(\n    \"\"\"Flaubert Model with a span classification head on top for extractive question-answering tasks like SQuAD (a linear layers on top of\n    the hidden-states output to compute `span start logits` and `span end logits`). \"\"\",\n    FLAUBERT_START_DOCSTRING,\n)\nclass FlaubertForQuestionAnsweringSimple(XLMForQuestionAnsweringSimple):\n    \"\"\"\n    This class overrides :class:`~transformers1.XLMForQuestionAnsweringSimple`. Please check the\n    superclass for the appropriate documentation alongside usage examples.\n    \"\"\"\n\n    config_class = FlaubertConfig\n\n    def __init__(self, config):\n        super().__init__(config)\n        self.transformer = FlaubertModel(config)\n        self.init_weights()\n\n\n@add_start_docstrings(\n    \"\"\"Flaubert Model with a beam-search span classification head on top for extractive question-answering tasks like SQuAD (a linear layers on top of\n    the hidden-states output to compute `span start logits` and `span end logits`). \"\"\",\n    FLAUBERT_START_DOCSTRING,\n)\nclass FlaubertForQuestionAnswering(XLMForQuestionAnswering):\n    \"\"\"\n    This class overrides :class:`~transformers1.XLMForQuestionAnswering`. Please check the\n    superclass for the appropriate documentation alongside usage examples.\n    \"\"\"\n\n    config_class = FlaubertConfig\n\n    def __init__(self, config):\n        super().__init__(config)\n        self.transformer = FlaubertModel(config)\n        self.init_weights()\n"
  },
  {
    "path": "code/nezha-base-count5/pretrain/transformers1/modeling_gpt2.py",
    "content": "# coding=utf-8\n# Copyright 2018 The OpenAI Team Authors and HuggingFace Inc. team.\n# Copyright (c) 2018, NVIDIA CORPORATION.  All rights reserved.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n\"\"\"PyTorch OpenAI GPT-2 model.\"\"\"\n\n\nimport logging\nimport os\n\nimport torch\nimport torch.nn as nn\nfrom torch.nn import CrossEntropyLoss\n\nfrom .activations import ACT2FN\nfrom .configuration_gpt2 import GPT2Config\nfrom .file_utils import add_start_docstrings, add_start_docstrings_to_callable\nfrom .modeling_utils import Conv1D, PreTrainedModel, SequenceSummary, prune_conv1d_layer\n\n\nlogger = logging.getLogger(__name__)\n\nGPT2_PRETRAINED_MODEL_ARCHIVE_LIST = [\n    \"gpt2\",\n    \"gpt2-medium\",\n    \"gpt2-large\",\n    \"gpt2-xl\",\n    \"distilgpt2\",\n    # See all GPT-2 models at https://huggingface.co/models?filter=gpt2\n]\n\n\ndef load_tf_weights_in_gpt2(model, config, gpt2_checkpoint_path):\n    \"\"\" Load tf checkpoints in a pytorch model\n    \"\"\"\n    try:\n        import re\n        import tensorflow as tf\n    except ImportError:\n        logger.error(\n            \"Loading a TensorFlow model in PyTorch, requires TensorFlow to be installed. Please see \"\n            \"https://www.tensorflow.org/install/ for installation instructions.\"\n        )\n        raise\n    tf_path = os.path.abspath(gpt2_checkpoint_path)\n    logger.info(\"Converting TensorFlow checkpoint from {}\".format(tf_path))\n    # Load weights from TF model\n    init_vars = tf.train.list_variables(tf_path)\n    names = []\n    arrays = []\n    for name, shape in init_vars:\n        logger.info(\"Loading TF weight {} with shape {}\".format(name, shape))\n        array = tf.train.load_variable(tf_path, name)\n        names.append(name)\n        arrays.append(array.squeeze())\n\n    for name, array in zip(names, arrays):\n        name = name[6:]  # skip \"model/\"\n        name = name.split(\"/\")\n        pointer = model\n        for m_name in name:\n            if re.fullmatch(r\"[A-Za-z]+\\d+\", m_name):\n                scope_names = re.split(r\"(\\d+)\", m_name)\n            else:\n                scope_names = [m_name]\n            if scope_names[0] == \"w\" or scope_names[0] == \"g\":\n                pointer = getattr(pointer, \"weight\")\n            elif scope_names[0] == \"b\":\n                pointer = getattr(pointer, \"bias\")\n            elif scope_names[0] == \"wpe\" or scope_names[0] == \"wte\":\n                pointer = getattr(pointer, scope_names[0])\n                pointer = getattr(pointer, \"weight\")\n            else:\n                pointer = getattr(pointer, scope_names[0])\n            if len(scope_names) >= 2:\n                num = int(scope_names[1])\n                pointer = pointer[num]\n        try:\n            assert pointer.shape == array.shape\n        except AssertionError as e:\n            e.args += (pointer.shape, array.shape)\n            raise\n        logger.info(\"Initialize PyTorch weight {}\".format(name))\n        pointer.data = torch.from_numpy(array)\n    return model\n\n\nclass Attention(nn.Module):\n    def __init__(self, nx, n_ctx, config, scale=False):\n        super().__init__()\n        self.output_attentions = config.output_attentions\n\n        n_state = nx  # in Attention: n_state=768 (nx=n_embd)\n        # [switch nx => n_state from Block to Attention to keep identical to TF implem]\n        assert n_state % config.n_head == 0\n        self.register_buffer(\n            \"bias\", torch.tril(torch.ones((n_ctx, n_ctx), dtype=torch.uint8)).view(1, 1, n_ctx, n_ctx)\n        )\n        self.register_buffer(\"masked_bias\", torch.tensor(-1e4))\n        self.n_head = config.n_head\n        self.split_size = n_state\n        self.scale = scale\n\n        self.c_attn = Conv1D(n_state * 3, nx)\n        self.c_proj = Conv1D(n_state, nx)\n        self.attn_dropout = nn.Dropout(config.attn_pdrop)\n        self.resid_dropout = nn.Dropout(config.resid_pdrop)\n        self.pruned_heads = set()\n\n    def prune_heads(self, heads):\n        if len(heads) == 0:\n            return\n        mask = torch.ones(self.n_head, self.split_size // self.n_head)\n        heads = set(heads) - self.pruned_heads  # Convert to set and emove already pruned heads\n        for head in heads:\n            # Compute how many pruned heads are before the head and move the index accordingly\n            head = head - sum(1 if h < head else 0 for h in self.pruned_heads)\n            mask[head] = 0\n        mask = mask.view(-1).contiguous().eq(1)\n        index = torch.arange(len(mask))[mask].long()\n        index_attn = torch.cat([index, index + self.split_size, index + (2 * self.split_size)])\n\n        # Prune conv1d layers\n        self.c_attn = prune_conv1d_layer(self.c_attn, index_attn, dim=1)\n        self.c_proj = prune_conv1d_layer(self.c_proj, index, dim=0)\n\n        # Update hyper params\n        self.split_size = (self.split_size // self.n_head) * (self.n_head - len(heads))\n        self.n_head = self.n_head - len(heads)\n        self.pruned_heads = self.pruned_heads.union(heads)\n\n    def _attn(self, q, k, v, attention_mask=None, head_mask=None):\n        w = torch.matmul(q, k)\n        if self.scale:\n            w = w / (float(v.size(-1)) ** 0.5)\n        nd, ns = w.size(-2), w.size(-1)\n        mask = self.bias[:, :, ns - nd : ns, :ns]\n        w = torch.where(mask.bool(), w, self.masked_bias.to(w.dtype))\n\n        if attention_mask is not None:\n            # Apply the attention mask\n            w = w + attention_mask\n\n        w = nn.Softmax(dim=-1)(w)\n        w = self.attn_dropout(w)\n\n        # Mask heads if we want to\n        if head_mask is not None:\n            w = w * head_mask\n\n        outputs = [torch.matmul(w, v)]\n        if self.output_attentions:\n            outputs.append(w)\n        return outputs\n\n    def merge_heads(self, x):\n        x = x.permute(0, 2, 1, 3).contiguous()\n        new_x_shape = x.size()[:-2] + (x.size(-2) * x.size(-1),)\n        return x.view(*new_x_shape)  # in Tensorflow implem: fct merge_states\n\n    def split_heads(self, x, k=False):\n        new_x_shape = x.size()[:-1] + (self.n_head, x.size(-1) // self.n_head)\n        x = x.view(*new_x_shape)  # in Tensorflow implem: fct split_states\n        if k:\n            return x.permute(0, 2, 3, 1)  # (batch, head, head_features, seq_length)\n        else:\n            return x.permute(0, 2, 1, 3)  # (batch, head, seq_length, head_features)\n\n    def forward(self, x, layer_past=None, attention_mask=None, head_mask=None, use_cache=False):\n        x = self.c_attn(x)\n        query, key, value = x.split(self.split_size, dim=2)\n        query = self.split_heads(query)\n        key = self.split_heads(key, k=True)\n        value = self.split_heads(value)\n        if layer_past is not None:\n            past_key, past_value = layer_past[0].transpose(-2, -1), layer_past[1]  # transpose back cf below\n            key = torch.cat((past_key, key), dim=-1)\n            value = torch.cat((past_value, value), dim=-2)\n\n        if use_cache is True:\n            present = torch.stack((key.transpose(-2, -1), value))  # transpose to have same shapes for stacking\n        else:\n            present = (None,)\n\n        attn_outputs = self._attn(query, key, value, attention_mask, head_mask)\n        a = attn_outputs[0]\n\n        a = self.merge_heads(a)\n        a = self.c_proj(a)\n        a = self.resid_dropout(a)\n\n        outputs = [a, present] + attn_outputs[1:]\n        return outputs  # a, present, (attentions)\n\n\nclass MLP(nn.Module):\n    def __init__(self, n_state, config):  # in MLP: n_state=3072 (4 * n_embd)\n        super().__init__()\n        nx = config.n_embd\n        self.c_fc = Conv1D(n_state, nx)\n        self.c_proj = Conv1D(nx, n_state)\n        self.act = ACT2FN[config.activation_function]\n        self.dropout = nn.Dropout(config.resid_pdrop)\n\n    def forward(self, x):\n        h = self.act(self.c_fc(x))\n        h2 = self.c_proj(h)\n        return self.dropout(h2)\n\n\nclass Block(nn.Module):\n    def __init__(self, n_ctx, config, scale=False):\n        super().__init__()\n        nx = config.n_embd\n        self.ln_1 = nn.LayerNorm(nx, eps=config.layer_norm_epsilon)\n        self.attn = Attention(nx, n_ctx, config, scale)\n        self.ln_2 = nn.LayerNorm(nx, eps=config.layer_norm_epsilon)\n        self.mlp = MLP(4 * nx, config)\n\n    def forward(self, x, layer_past=None, attention_mask=None, head_mask=None, use_cache=False):\n        output_attn = self.attn(\n            self.ln_1(x),\n            layer_past=layer_past,\n            attention_mask=attention_mask,\n            head_mask=head_mask,\n            use_cache=use_cache,\n        )\n        a = output_attn[0]  # output_attn: a, present, (attentions)\n\n        x = x + a\n        m = self.mlp(self.ln_2(x))\n        x = x + m\n\n        outputs = [x] + output_attn[1:]\n        return outputs  # x, present, (attentions)\n\n\nclass GPT2PreTrainedModel(PreTrainedModel):\n    \"\"\" An abstract class to handle weights initialization and\n        a simple interface for downloading and loading pretrained models.\n    \"\"\"\n\n    config_class = GPT2Config\n    load_tf_weights = load_tf_weights_in_gpt2\n    base_model_prefix = \"transformer\"\n\n    def __init__(self, *inputs, **kwargs):\n        super().__init__(*inputs, **kwargs)\n\n    def _init_weights(self, module):\n        \"\"\" Initialize the weights.\n        \"\"\"\n        if isinstance(module, (nn.Linear, nn.Embedding, Conv1D)):\n            # Slightly different from the TF version which uses truncated_normal for initialization\n            # cf https://github.com/pytorch/pytorch/pull/5617\n            module.weight.data.normal_(mean=0.0, std=self.config.initializer_range)\n            if isinstance(module, (nn.Linear, Conv1D)) and module.bias is not None:\n                module.bias.data.zero_()\n        elif isinstance(module, nn.LayerNorm):\n            module.bias.data.zero_()\n            module.weight.data.fill_(1.0)\n\n\nGPT2_START_DOCSTRING = r\"\"\"\n\n    This model is a PyTorch `torch.nn.Module <https://pytorch.org/docs/stable/nn.html#torch.nn.Module>`_ sub-class.\n    Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general\n    usage and behavior.\n\n    Parameters:\n        config (:class:`~transformers1.GPT2Config`): Model configuration class with all the parameters of the model.\n            Initializing with a config file does not load the weights associated with the model, only the configuration.\n            Check out the :meth:`~transformers1.PreTrainedModel.from_pretrained` method to load the model weights.\n\"\"\"\n\nGPT2_INPUTS_DOCSTRING = r\"\"\"\n    Args:\n        input_ids (:obj:`torch.LongTensor` of shape :obj:`(batch_size, input_ids_length)`):\n            :obj:`input_ids_length` = ``sequence_length`` if ``past`` is ``None`` else ``past[0].shape[-2]`` (``sequence_length`` of input past key value states).\n            Indices of input sequence tokens in the vocabulary.\n\n            If `past` is used, only `input_ids` that do not have their past calculated should be passed as `input_ids`.\n\n            Indices can be obtained using :class:`transformers1.GPT2Tokenizer`.\n            See :func:`transformers1.PreTrainedTokenizer.encode` and\n            :func:`transformers1.PreTrainedTokenizer.encode_plus` for details.\n\n            `What are input IDs? <../glossary.html#input-ids>`__\n\n        past (:obj:`List[torch.FloatTensor]` of length :obj:`config.n_layers`):\n            Contains pre-computed hidden-states (key and values in the attention blocks) as computed by the model\n            (see `past` output below). Can be used to speed up sequential decoding.\n            The `input_ids` which have their past given to this model should not be passed as `input_ids` as they have already been computed.\n        attention_mask (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length)`, `optional`, defaults to :obj:`None`):\n            Mask to avoid performing attention on padding token indices.\n            Mask values selected in ``[0, 1]``:\n            ``1`` for tokens that are NOT MASKED, ``0`` for MASKED tokens.\n\n            `What are attention masks? <../glossary.html#attention-mask>`__\n        token_type_ids (:obj:`torch.LongTensor` of shape :obj:`(batch_size, input_ids_length)`, `optional`, defaults to :obj:`None`):\n            `input_ids_length` = `sequence_length if `past` is None else 1\n            Segment token indices to indicate first and second portions of the inputs.\n            Indices are selected in ``[0, 1]``: ``0`` corresponds to a `sentence A` token, ``1``\n            corresponds to a `sentence B` token\n            `What are token type IDs? <../glossary.html#token-type-ids>`_\n        position_ids (:obj:`torch.LongTensor` of shape :obj:`(batch_size, sequence_length)`, `optional`, defaults to :obj:`None`):\n            Indices of positions of each input sequence tokens in the position embeddings.\n            Selected in the range ``[0, config.max_position_embeddings - 1]``.\n\n            `What are position IDs? <../glossary.html#position-ids>`_\n        head_mask (:obj:`torch.FloatTensor` of shape :obj:`(num_heads,)` or :obj:`(num_layers, num_heads)`, `optional`, defaults to :obj:`None`):\n            Mask to nullify selected heads of the self-attention modules.\n            Mask values selected in ``[0, 1]``:\n            :obj:`1` indicates the head is **not masked**, :obj:`0` indicates the head is **masked**.\n        inputs_embeds (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length, hidden_size)`, `optional`, defaults to :obj:`None`):\n            This is useful if you want more control over how to convert `input_ids` indices into associated vectors\n            than the model's internal embedding lookup matrix.\n            If `past` is used, optionally only the last `inputs_embeds` have to be input (see `past`).\n        use_cache (:obj:`bool`):\n            If `use_cache` is True, `past` key value states are returned and can be used to speed up decoding (see `past`). Defaults to `True`.\n\"\"\"\n\n\n@add_start_docstrings(\n    \"The bare GPT2 Model transformer outputting raw hidden-states without any specific head on top.\",\n    GPT2_START_DOCSTRING,\n)\nclass GPT2Model(GPT2PreTrainedModel):\n    def __init__(self, config):\n        super().__init__(config)\n        self.output_hidden_states = config.output_hidden_states\n        self.output_attentions = config.output_attentions\n\n        self.wte = nn.Embedding(config.vocab_size, config.n_embd)\n        self.wpe = nn.Embedding(config.n_positions, config.n_embd)\n        self.drop = nn.Dropout(config.embd_pdrop)\n        self.h = nn.ModuleList([Block(config.n_ctx, config, scale=True) for _ in range(config.n_layer)])\n        self.ln_f = nn.LayerNorm(config.n_embd, eps=config.layer_norm_epsilon)\n\n        self.init_weights()\n\n    def get_input_embeddings(self):\n        return self.wte\n\n    def set_input_embeddings(self, new_embeddings):\n        self.wte = new_embeddings\n\n    def _prune_heads(self, heads_to_prune):\n        \"\"\" Prunes heads of the model.\n            heads_to_prune: dict of {layer_num: list of heads to prune in this layer}\n        \"\"\"\n        for layer, heads in heads_to_prune.items():\n            self.h[layer].attn.prune_heads(heads)\n\n    @add_start_docstrings_to_callable(GPT2_INPUTS_DOCSTRING)\n    def forward(\n        self,\n        input_ids=None,\n        past=None,\n        attention_mask=None,\n        token_type_ids=None,\n        position_ids=None,\n        head_mask=None,\n        inputs_embeds=None,\n        use_cache=True,\n    ):\n        r\"\"\"\n    Return:\n        :obj:`tuple(torch.FloatTensor)` comprising various elements depending on the configuration (:class:`~transformers1.GPT2Config`) and inputs:\n        last_hidden_state (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length, hidden_size)`):\n            Sequence of hidden-states at the last layer of the model.\n            If `past` is used only the last hidden-state of the sequences of shape :obj:`(batch_size, 1, hidden_size)` is output.\n        past (:obj:`List[torch.FloatTensor]` of length :obj:`config.n_layers` with each tensor of shape :obj:`(2, batch_size, num_heads, sequence_length, embed_size_per_head)`):\n            Contains pre-computed hidden-states (key and values in the attention blocks).\n            Can be used (see `past` input) to speed up sequential decoding.\n        hidden_states (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_hidden_states=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer)\n            of shape :obj:`(batch_size, sequence_length, hidden_size)`.\n\n            Hidden-states of the model at the output of each layer plus the initial embedding outputs.\n        attentions (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_attentions=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for each layer) of shape\n            :obj:`(batch_size, num_heads, sequence_length, sequence_length)`.\n\n            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention\n            heads.\n\n    Examples::\n\n        from transformers1 import GPT2Tokenizer, GPT2Model\n        import torch\n\n        tokenizer = GPT2Tokenizer.from_pretrained('gpt2')\n        model = GPT2Model.from_pretrained('gpt2')\n        input_ids = torch.tensor(tokenizer.encode(\"Hello, my dog is cute\", add_special_tokens=True)).unsqueeze(0)  # Batch size 1\n        outputs = model(input_ids)\n        last_hidden_states = outputs[0]  # The last hidden-state is the first element of the output tuple\n\n        \"\"\"\n\n        if input_ids is not None and inputs_embeds is not None:\n            raise ValueError(\"You cannot specify both input_ids and inputs_embeds at the same time\")\n        elif input_ids is not None:\n            input_shape = input_ids.size()\n            input_ids = input_ids.view(-1, input_shape[-1])\n            batch_size = input_ids.shape[0]\n        elif inputs_embeds is not None:\n            input_shape = inputs_embeds.size()[:-1]\n            batch_size = inputs_embeds.shape[0]\n        else:\n            raise ValueError(\"You have to specify either input_ids or inputs_embeds\")\n\n        if token_type_ids is not None:\n            token_type_ids = token_type_ids.view(-1, input_shape[-1])\n        if position_ids is not None:\n            position_ids = position_ids.view(-1, input_shape[-1])\n\n        if past is None:\n            past_length = 0\n            past = [None] * len(self.h)\n        else:\n            past_length = past[0][0].size(-2)\n        if position_ids is None:\n            device = input_ids.device if input_ids is not None else inputs_embeds.device\n            position_ids = torch.arange(past_length, input_shape[-1] + past_length, dtype=torch.long, device=device)\n            position_ids = position_ids.unsqueeze(0).view(-1, input_shape[-1])\n\n        # Attention mask.\n        if attention_mask is not None:\n            assert batch_size > 0, \"batch_size has to be defined and > 0\"\n            attention_mask = attention_mask.view(batch_size, -1)\n            # We create a 3D attention mask from a 2D tensor mask.\n            # Sizes are [batch_size, 1, 1, to_seq_length]\n            # So we can broadcast to [batch_size, num_heads, from_seq_length, to_seq_length]\n            # this attention mask is more simple than the triangular masking of causal attention\n            # used in OpenAI GPT, we just need to prepare the broadcast dimension here.\n            attention_mask = attention_mask.unsqueeze(1).unsqueeze(2)\n\n            # Since attention_mask is 1.0 for positions we want to attend and 0.0 for\n            # masked positions, this operation will create a tensor which is 0.0 for\n            # positions we want to attend and -10000.0 for masked positions.\n            # Since we are adding it to the raw scores before the softmax, this is\n            # effectively the same as removing these entirely.\n            attention_mask = attention_mask.to(dtype=next(self.parameters()).dtype)  # fp16 compatibility\n            attention_mask = (1.0 - attention_mask) * -10000.0\n\n        # Prepare head mask if needed\n        # 1.0 in head_mask indicate we keep the head\n        # attention_probs has shape bsz x n_heads x N x N\n        # head_mask has shape n_layer x batch x n_heads x N x N\n        head_mask = self.get_head_mask(head_mask, self.config.n_layer)\n\n        if inputs_embeds is None:\n            inputs_embeds = self.wte(input_ids)\n        position_embeds = self.wpe(position_ids)\n        if token_type_ids is not None:\n            token_type_embeds = self.wte(token_type_ids)\n        else:\n            token_type_embeds = 0\n        hidden_states = inputs_embeds + position_embeds + token_type_embeds\n        hidden_states = self.drop(hidden_states)\n\n        output_shape = input_shape + (hidden_states.size(-1),)\n\n        presents = ()\n        all_attentions = []\n        all_hidden_states = ()\n        for i, (block, layer_past) in enumerate(zip(self.h, past)):\n            if self.output_hidden_states:\n                all_hidden_states = all_hidden_states + (hidden_states.view(*output_shape),)\n\n            outputs = block(\n                hidden_states,\n                layer_past=layer_past,\n                attention_mask=attention_mask,\n                head_mask=head_mask[i],\n                use_cache=use_cache,\n            )\n\n            hidden_states, present = outputs[:2]\n            if use_cache is True:\n                presents = presents + (present,)\n\n            if self.output_attentions:\n                all_attentions.append(outputs[2])\n\n        hidden_states = self.ln_f(hidden_states)\n\n        hidden_states = hidden_states.view(*output_shape)\n        # Add last hidden state\n        if self.output_hidden_states:\n            all_hidden_states = all_hidden_states + (hidden_states,)\n\n        outputs = (hidden_states,)\n        if use_cache is True:\n            outputs = outputs + (presents,)\n        if self.output_hidden_states:\n            outputs = outputs + (all_hidden_states,)\n        if self.output_attentions:\n            # let the number of heads free (-1) so we can extract attention even after head pruning\n            attention_output_shape = input_shape[:-1] + (-1,) + all_attentions[0].shape[-2:]\n            all_attentions = tuple(t.view(*attention_output_shape) for t in all_attentions)\n            outputs = outputs + (all_attentions,)\n        return outputs  # last hidden state, (presents), (all hidden_states), (attentions)\n\n\n@add_start_docstrings(\n    \"\"\"The GPT2 Model transformer with a language modeling head on top\n    (linear layer with weights tied to the input embeddings). \"\"\",\n    GPT2_START_DOCSTRING,\n)\nclass GPT2LMHeadModel(GPT2PreTrainedModel):\n    def __init__(self, config):\n        super().__init__(config)\n        self.transformer = GPT2Model(config)\n        self.lm_head = nn.Linear(config.n_embd, config.vocab_size, bias=False)\n\n        self.init_weights()\n\n    def get_output_embeddings(self):\n        return self.lm_head\n\n    def prepare_inputs_for_generation(self, input_ids, past, **kwargs):\n        # only last token for inputs_ids if past is defined in kwargs\n        if past:\n            input_ids = input_ids[:, -1].unsqueeze(-1)\n\n        return {\"input_ids\": input_ids, \"past\": past, \"use_cache\": kwargs[\"use_cache\"]}\n\n    @add_start_docstrings_to_callable(GPT2_INPUTS_DOCSTRING)\n    def forward(\n        self,\n        input_ids=None,\n        past=None,\n        attention_mask=None,\n        token_type_ids=None,\n        position_ids=None,\n        head_mask=None,\n        inputs_embeds=None,\n        labels=None,\n        use_cache=True,\n    ):\n        r\"\"\"\n        labels (:obj:`torch.LongTensor` of shape :obj:`(batch_size, sequence_length)`, `optional`, defaults to :obj:`None`):\n            Labels for language modeling.\n            Note that the labels **are shifted** inside the model, i.e. you can set ``labels = input_ids``\n            Indices are selected in ``[-100, 0, ..., config.vocab_size]``\n            All labels set to ``-100`` are ignored (masked), the loss is only\n            computed for labels in ``[0, ..., config.vocab_size]``\n\n    Return:\n        :obj:`tuple(torch.FloatTensor)` comprising various elements depending on the configuration (:class:`~transformers1.GPT2Config`) and inputs:\n        loss (:obj:`torch.FloatTensor` of shape `(1,)`, `optional`, returned when ``labels`` is provided)\n            Language modeling loss.\n        prediction_scores (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length, config.vocab_size)`):\n            Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax).\n        past (:obj:`List[torch.FloatTensor]` of length :obj:`config.n_layers` with each tensor of shape :obj:`(2, batch_size, num_heads, sequence_length, embed_size_per_head)`):\n            Contains pre-computed hidden-states (key and values in the attention blocks).\n            Can be used (see `past` input) to speed up sequential decoding.\n        hidden_states (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_hidden_states=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer)\n            of shape :obj:`(batch_size, sequence_length, hidden_size)`.\n\n            Hidden-states of the model at the output of each layer plus the initial embedding outputs.\n        attentions (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_attentions=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for each layer) of shape\n            :obj:`(batch_size, num_heads, sequence_length, sequence_length)`.\n\n            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention\n            heads.\n\n    Examples::\n\n        import torch\n        from transformers1 import GPT2Tokenizer, GPT2LMHeadModel\n\n        tokenizer = GPT2Tokenizer.from_pretrained('gpt2')\n        model = GPT2LMHeadModel.from_pretrained('gpt2')\n\n        input_ids = torch.tensor(tokenizer.encode(\"Hello, my dog is cute\", add_special_tokens=True)).unsqueeze(0)  # Batch size 1\n        outputs = model(input_ids, labels=input_ids)\n        loss, logits = outputs[:2]\n\n        \"\"\"\n        transformer_outputs = self.transformer(\n            input_ids,\n            past=past,\n            attention_mask=attention_mask,\n            token_type_ids=token_type_ids,\n            position_ids=position_ids,\n            head_mask=head_mask,\n            inputs_embeds=inputs_embeds,\n            use_cache=use_cache,\n        )\n        hidden_states = transformer_outputs[0]\n\n        lm_logits = self.lm_head(hidden_states)\n\n        outputs = (lm_logits,) + transformer_outputs[1:]\n        if labels is not None:\n            # Shift so that tokens < n predict n\n            shift_logits = lm_logits[..., :-1, :].contiguous()\n            shift_labels = labels[..., 1:].contiguous()\n            # Flatten the tokens\n            loss_fct = CrossEntropyLoss()\n            loss = loss_fct(shift_logits.view(-1, shift_logits.size(-1)), shift_labels.view(-1))\n            outputs = (loss,) + outputs\n\n        return outputs  # (loss), lm_logits, presents, (all hidden_states), (attentions)\n\n\n@add_start_docstrings(\n    \"\"\"The GPT2 Model transformer with a language modeling and a multiple-choice classification\n    head on top e.g. for RocStories/SWAG tasks. The two heads are two linear layers.\n    The language modeling head has its weights tied to the input embeddings,\n    the classification head takes as input the input of a specified classification token index in the input sequence).\n\"\"\",\n    GPT2_START_DOCSTRING,\n)\nclass GPT2DoubleHeadsModel(GPT2PreTrainedModel):\n    def __init__(self, config):\n        super().__init__(config)\n        config.num_labels = 1\n        self.transformer = GPT2Model(config)\n        self.lm_head = nn.Linear(config.n_embd, config.vocab_size, bias=False)\n        self.multiple_choice_head = SequenceSummary(config)\n\n        self.init_weights()\n\n    def get_output_embeddings(self):\n        return self.lm_head\n\n    @add_start_docstrings_to_callable(GPT2_INPUTS_DOCSTRING)\n    def forward(\n        self,\n        input_ids=None,\n        past=None,\n        attention_mask=None,\n        token_type_ids=None,\n        position_ids=None,\n        head_mask=None,\n        inputs_embeds=None,\n        mc_token_ids=None,\n        lm_labels=None,\n        mc_labels=None,\n        use_cache=True,\n    ):\n        r\"\"\"\n        mc_token_ids (:obj:`torch.LongTensor` of shape :obj:`(batch_size, num_choices)`, `optional`, default to index of the last token of the input)\n            Index of the classification token in each input sequence.\n            Selected in the range ``[0, input_ids.size(-1) - 1[``.\n        lm_labels (:obj:`torch.LongTensor` of shape :obj:`(batch_size, sequence_length)`, `optional`, defaults to :obj:`None`)\n            Labels for language modeling.\n            Note that the labels **are shifted** inside the model, i.e. you can set ``lm_labels = input_ids``\n            Indices are selected in ``[-1, 0, ..., config.vocab_size]``\n            All labels set to ``-100`` are ignored (masked), the loss is only\n            computed for labels in ``[0, ..., config.vocab_size]``\n        mc_labels (:obj:`torch.LongTensor` of shape :obj:`(batch_size)`, `optional`, defaults to :obj:`None`)\n            Labels for computing the multiple choice classification loss.\n            Indices should be in ``[0, ..., num_choices]`` where `num_choices` is the size of the second dimension\n            of the input tensors. (see `input_ids` above)\n\n    Return:\n        :obj:`tuple(torch.FloatTensor)` comprising various elements depending on the configuration (:class:`~transformers1.GPT2Config`) and inputs:\n        lm_loss (:obj:`torch.FloatTensor` of shape :obj:`(1,)`, `optional`, returned when ``lm_labels`` is provided):\n            Language modeling loss.\n        mc_loss (:obj:`torch.FloatTensor` of shape :obj:`(1,)`, `optional`, returned when :obj:`multiple_choice_labels` is provided):\n            Multiple choice classification loss.\n        lm_prediction_scores (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, num_choices, sequence_length, config.vocab_size)`):\n            Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax).\n        mc_prediction_scores (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, num_choices)`):\n            Prediction scores of the multiple choice classification head (scores for each choice before SoftMax).\n        past (:obj:`List[torch.FloatTensor]` of length :obj:`config.n_layers` with each tensor of shape :obj:`(2, batch_size, num_heads, sequence_length, embed_size_per_head)`):\n            Contains pre-computed hidden-states (key and values in the attention blocks).\n            Can be used (see `past` input) to speed up sequential decoding.\n        hidden_states (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_hidden_states=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer)\n            of shape :obj:`(batch_size, sequence_length, hidden_size)`.\n\n            Hidden-states of the model at the output of each layer plus the initial embedding outputs.\n        attentions (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_attentions=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for each layer) of shape\n            :obj:`(batch_size, num_heads, sequence_length, sequence_length)`.\n\n            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention\n            heads.\n\n    Examples::\n\n        import torch\n        from transformers1 import GPT2Tokenizer, GPT2DoubleHeadsModel\n\n        tokenizer = GPT2Tokenizer.from_pretrained('gpt2')\n        model = GPT2DoubleHeadsModel.from_pretrained('gpt2')\n\n        # Add a [CLS] to the vocabulary (we should train it also!)\n        tokenizer.add_special_tokens({'cls_token': '[CLS]'})\n        model.resize_token_embeddings(len(tokenizer))  # Update the model embeddings with the new vocabulary size\n        print(tokenizer.cls_token_id, len(tokenizer))  # The newly token the last token of the vocabulary\n\n        choices = [\"Hello, my dog is cute [CLS]\", \"Hello, my cat is cute [CLS]\"]\n        encoded_choices = [tokenizer.encode(s) for s in choices]\n        cls_token_location = [tokens.index(tokenizer.cls_token_id) for tokens in encoded_choices]\n\n        input_ids = torch.tensor(encoded_choices).unsqueeze(0)  # Batch size: 1, number of choices: 2\n        mc_token_ids = torch.tensor([cls_token_location])  # Batch size: 1\n\n        outputs = model(input_ids, mc_token_ids=mc_token_ids)\n        lm_prediction_scores, mc_prediction_scores = outputs[:2]\n\n        \"\"\"\n        transformer_outputs = self.transformer(\n            input_ids,\n            past=past,\n            attention_mask=attention_mask,\n            token_type_ids=token_type_ids,\n            position_ids=position_ids,\n            head_mask=head_mask,\n            inputs_embeds=inputs_embeds,\n            use_cache=use_cache,\n        )\n\n        hidden_states = transformer_outputs[0]\n\n        lm_logits = self.lm_head(hidden_states)\n        mc_logits = self.multiple_choice_head(hidden_states, mc_token_ids).squeeze(-1)\n\n        outputs = (lm_logits, mc_logits) + transformer_outputs[1:]\n        if mc_labels is not None:\n            loss_fct = CrossEntropyLoss()\n            loss = loss_fct(mc_logits.view(-1, mc_logits.size(-1)), mc_labels.view(-1))\n            outputs = (loss,) + outputs\n        if lm_labels is not None:\n            shift_logits = lm_logits[..., :-1, :].contiguous()\n            shift_labels = lm_labels[..., 1:].contiguous()\n            loss_fct = CrossEntropyLoss()\n            loss = loss_fct(shift_logits.view(-1, shift_logits.size(-1)), shift_labels.view(-1))\n            outputs = (loss,) + outputs\n\n        return outputs  # (lm loss), (mc loss), lm logits, mc logits, presents, (all hidden_states), (attentions)\n"
  },
  {
    "path": "code/nezha-base-count5/pretrain/transformers1/modeling_longformer.py",
    "content": "# coding=utf-8\n# Copyright 2020 The Allen Institute for AI team and The HuggingFace Inc. team.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n\"\"\"PyTorch Longformer model. \"\"\"\n\nimport logging\nimport math\n\nimport torch\nimport torch.nn as nn\nfrom torch.nn import CrossEntropyLoss, MSELoss\nfrom torch.nn import functional as F\n\nfrom .configuration_longformer import LongformerConfig\nfrom .file_utils import add_start_docstrings, add_start_docstrings_to_callable\nfrom .modeling_bert import BertPreTrainedModel\nfrom .modeling_roberta import RobertaLMHead, RobertaModel\n\n\nlogger = logging.getLogger(__name__)\n\nLONGFORMER_PRETRAINED_MODEL_ARCHIVE_LIST = [\n    \"allenai/longformer-base-4096\",\n    \"allenai/longformer-large-4096\",\n    \"allenai/longformer-large-4096-finetuned-triviaqa\",\n    \"allenai/longformer-base-4096-extra.pos.embd.only\",\n    \"allenai/longformer-large-4096-extra.pos.embd.only\",\n    # See all Longformer models at https://huggingface.co/models?filter=longformer\n]\n\n\ndef _get_question_end_index(input_ids, sep_token_id):\n    \"\"\"\n        Computes the index of the first occurance of `sep_token_id`.\n    \"\"\"\n\n    sep_token_indices = (input_ids == sep_token_id).nonzero()\n    batch_size = input_ids.shape[0]\n\n    assert sep_token_indices.shape[1] == 2, \"`input_ids` should have two dimensions\"\n    assert (\n        sep_token_indices.shape[0] == 3 * batch_size\n    ), f\"There should be exactly three separator tokens: {sep_token_id} in every sample for questions answering. You might also consider to set `global_attention_mask` manually in the forward function to avoid this error.\"\n\n    return sep_token_indices.view(batch_size, 3, 2)[:, 0, 1]\n\n\ndef _compute_global_attention_mask(input_ids, sep_token_id, before_sep_token=True):\n    \"\"\"\n        Computes global attention mask by putting attention on all tokens\n        before `sep_token_id` if `before_sep_token is True` else after\n        `sep_token_id`.\n    \"\"\"\n\n    question_end_index = _get_question_end_index(input_ids, sep_token_id)\n    question_end_index = question_end_index.unsqueeze(dim=1)  # size: batch_size x 1\n    # bool attention mask with True in locations of global attention\n    attention_mask = torch.arange(input_ids.shape[1], device=input_ids.device)\n    if before_sep_token is True:\n        attention_mask = (attention_mask.expand_as(input_ids) < question_end_index).to(torch.uint8)\n    else:\n        # last token is separation token and should not be counted and in the middle are two separation tokens\n        attention_mask = (attention_mask.expand_as(input_ids) > (question_end_index + 1)).to(torch.uint8) * (\n            attention_mask.expand_as(input_ids) < input_ids.shape[-1]\n        ).to(torch.uint8)\n\n    return attention_mask\n\n\nclass LongformerSelfAttention(nn.Module):\n    def __init__(self, config, layer_id):\n        super().__init__()\n        if config.hidden_size % config.num_attention_heads != 0:\n            raise ValueError(\n                \"The hidden size (%d) is not a multiple of the number of attention \"\n                \"heads (%d)\" % (config.hidden_size, config.num_attention_heads)\n            )\n        self.output_attentions = config.output_attentions\n        self.num_heads = config.num_attention_heads\n        self.head_dim = int(config.hidden_size / config.num_attention_heads)\n        self.embed_dim = config.hidden_size\n\n        self.query = nn.Linear(config.hidden_size, self.embed_dim)\n        self.key = nn.Linear(config.hidden_size, self.embed_dim)\n        self.value = nn.Linear(config.hidden_size, self.embed_dim)\n\n        # separate projection layers for tokens with global attention\n        self.query_global = nn.Linear(config.hidden_size, self.embed_dim)\n        self.key_global = nn.Linear(config.hidden_size, self.embed_dim)\n        self.value_global = nn.Linear(config.hidden_size, self.embed_dim)\n\n        self.dropout = config.attention_probs_dropout_prob\n\n        self.layer_id = layer_id\n        attention_window = config.attention_window[self.layer_id]\n        assert (\n            attention_window % 2 == 0\n        ), f\"`attention_window` for layer {self.layer_id} has to be an even value. Given {attention_window}\"\n        assert (\n            attention_window > 0\n        ), f\"`attention_window` for layer {self.layer_id} has to be positive. Given {attention_window}\"\n\n        self.one_sided_attention_window_size = attention_window // 2\n\n    @staticmethod\n    def _skew(x, direction):\n        \"\"\"Convert diagonals into columns (or columns into diagonals depending on `direction`\"\"\"\n        x_padded = F.pad(x, direction)  # padding value is not important because it will be overwritten\n        x_padded = x_padded.view(*x_padded.size()[:-2], x_padded.size(-1), x_padded.size(-2))\n        return x_padded\n\n    @staticmethod\n    def _skew2(x):\n        \"\"\"shift every row 1 step to right converting columns into diagonals\"\"\"\n        # X = B x C x M x L\n        B, C, M, L = x.size()\n        x = F.pad(x, (0, M + 1))  # B x C x M x (L+M+1). Padding value is not important because it'll be overwritten\n        x = x.view(B, C, -1)  # B x C x ML+MM+M\n        x = x[:, :, :-M]  # B x C x ML+MM\n        x = x.view(B, C, M, M + L)  # B x C, M x L+M\n        x = x[:, :, :, :-1]\n        return x\n\n    @staticmethod\n    def _chunk(x, w):\n        \"\"\"convert into overlapping chunkings. Chunk size = 2w, overlap size = w\"\"\"\n\n        # non-overlapping chunks of size = 2w\n        x = x.view(x.size(0), x.size(1) // (w * 2), w * 2, x.size(2))\n\n        # use `as_strided` to make the chunks overlap with an overlap size = w\n        chunk_size = list(x.size())\n        chunk_size[1] = chunk_size[1] * 2 - 1\n\n        chunk_stride = list(x.stride())\n        chunk_stride[1] = chunk_stride[1] // 2\n        return x.as_strided(size=chunk_size, stride=chunk_stride)\n\n    def _mask_invalid_locations(self, input_tensor, w) -> torch.Tensor:\n        affected_seqlen = w\n        beginning_mask_2d = input_tensor.new_ones(w, w + 1).tril().flip(dims=[0])\n        beginning_mask = beginning_mask_2d[None, :, None, :]\n        ending_mask = beginning_mask.flip(dims=(1, 3))\n        seqlen = input_tensor.size(1)\n        beginning_input = input_tensor[:, :affected_seqlen, :, : w + 1]\n        beginning_mask = beginning_mask[:, :seqlen].expand(beginning_input.size())\n        beginning_input.masked_fill_(beginning_mask == 1, -float(\"inf\"))  # `== 1` converts to bool or uint8\n        ending_input = input_tensor[:, -affected_seqlen:, :, -(w + 1) :]\n        ending_mask = ending_mask[:, -seqlen:].expand(ending_input.size())\n        ending_input.masked_fill_(ending_mask == 1, -float(\"inf\"))  # `== 1` converts to bool or uint8\n\n    def _sliding_chunks_matmul_qk(self, q: torch.Tensor, k: torch.Tensor, w: int):\n        \"\"\"Matrix multiplicatio of query x key tensors using with a sliding window attention pattern.\n        This implementation splits the input into overlapping chunks of size 2w (e.g. 512 for pretrained Longformer)\n        with an overlap of size w\"\"\"\n        batch_size, seqlen, num_heads, head_dim = q.size()\n        assert seqlen % (w * 2) == 0, f\"Sequence length should be multiple of {w * 2}. Given {seqlen}\"\n        assert q.size() == k.size()\n\n        chunks_count = seqlen // w - 1\n\n        # group batch_size and num_heads dimensions into one, then chunk seqlen into chunks of size w * 2\n        q = q.transpose(1, 2).reshape(batch_size * num_heads, seqlen, head_dim)\n        k = k.transpose(1, 2).reshape(batch_size * num_heads, seqlen, head_dim)\n\n        chunk_q = self._chunk(q, w)\n        chunk_k = self._chunk(k, w)\n\n        # matrix multipication\n        # bcxd: batch_size * num_heads x chunks x 2w x head_dim\n        # bcyd: batch_size * num_heads x chunks x 2w x head_dim\n        # bcxy: batch_size * num_heads x chunks x 2w x 2w\n        chunk_attn = torch.einsum(\"bcxd,bcyd->bcxy\", (chunk_q, chunk_k))  # multiply\n\n        # convert diagonals into columns\n        diagonal_chunk_attn = self._skew(chunk_attn, direction=(0, 0, 0, 1))\n\n        # allocate space for the overall attention matrix where the chunks are compined. The last dimension\n        # has (w * 2 + 1) columns. The first (w) columns are the w lower triangles (attention from a word to\n        # w previous words). The following column is attention score from each word to itself, then\n        # followed by w columns for the upper triangle.\n\n        diagonal_attn = diagonal_chunk_attn.new_empty((batch_size * num_heads, chunks_count + 1, w, w * 2 + 1))\n\n        # copy parts from diagonal_chunk_attn into the compined matrix of attentions\n        # - copying the main diagonal and the upper triangle\n        diagonal_attn[:, :-1, :, w:] = diagonal_chunk_attn[:, :, :w, : w + 1]\n        diagonal_attn[:, -1, :, w:] = diagonal_chunk_attn[:, -1, w:, : w + 1]\n        # - copying the lower triangle\n        diagonal_attn[:, 1:, :, :w] = diagonal_chunk_attn[:, :, -(w + 1) : -1, w + 1 :]\n        diagonal_attn[:, 0, 1:w, 1:w] = diagonal_chunk_attn[:, 0, : w - 1, 1 - w :]\n\n        # separate batch_size and num_heads dimensions again\n        diagonal_attn = diagonal_attn.view(batch_size, num_heads, seqlen, 2 * w + 1).transpose(2, 1)\n\n        self._mask_invalid_locations(diagonal_attn, w)\n        return diagonal_attn\n\n    def _sliding_chunks_matmul_pv(self, prob: torch.Tensor, v: torch.Tensor, w: int):\n        \"\"\"Same as _sliding_chunks_matmul_qk but for prob and value tensors. It is expecting the same output\n        format from _sliding_chunks_matmul_qk\"\"\"\n        batch_size, seqlen, num_heads, head_dim = v.size()\n        assert seqlen % (w * 2) == 0\n        assert prob.size()[:3] == v.size()[:3]\n        assert prob.size(3) == 2 * w + 1\n        chunks_count = seqlen // w - 1\n        # group batch_size and num_heads dimensions into one, then chunk seqlen into chunks of size 2w\n        chunk_prob = prob.transpose(1, 2).reshape(batch_size * num_heads, seqlen // w, w, 2 * w + 1)\n\n        # group batch_size and num_heads dimensions into one\n        v = v.transpose(1, 2).reshape(batch_size * num_heads, seqlen, head_dim)\n\n        # pad seqlen with w at the beginning of the sequence and another w at the end\n        padded_v = F.pad(v, (0, 0, w, w), value=-1)\n\n        # chunk padded_v into chunks of size 3w and an overlap of size w\n        chunk_v_size = (batch_size * num_heads, chunks_count + 1, 3 * w, head_dim)\n        chunk_v_stride = padded_v.stride()\n        chunk_v_stride = chunk_v_stride[0], w * chunk_v_stride[1], chunk_v_stride[1], chunk_v_stride[2]\n        chunk_v = padded_v.as_strided(size=chunk_v_size, stride=chunk_v_stride)\n\n        skewed_prob = self._skew2(chunk_prob)\n\n        context = torch.einsum(\"bcwd,bcdh->bcwh\", (skewed_prob, chunk_v))\n        return context.view(batch_size, num_heads, seqlen, head_dim).transpose(1, 2)\n\n    def forward(\n        self,\n        hidden_states,\n        attention_mask=None,\n        head_mask=None,\n        encoder_hidden_states=None,\n        encoder_attention_mask=None,\n    ):\n        \"\"\"\n        LongformerSelfAttention expects `len(hidden_states)` to be multiple of `attention_window`.\n        Padding to `attention_window` happens in LongformerModel.forward to avoid redoing the padding on each layer.\n\n        The `attention_mask` is changed in `BertModel.forward` from 0, 1, 2 to\n            -ve: no attention\n              0: local attention\n            +ve: global attention\n\n        `encoder_hidden_states` and `encoder_attention_mask` are not supported and should be None\n        \"\"\"\n        # TODO: add support for `encoder_hidden_states` and `encoder_attention_mask`\n        assert encoder_hidden_states is None, \"`encoder_hidden_states` is not supported and should be None\"\n        assert encoder_attention_mask is None, \"`encoder_attention_mask` is not supported and shiould be None\"\n\n        if attention_mask is not None:\n            attention_mask = attention_mask.squeeze(dim=2).squeeze(dim=1)\n            key_padding_mask = attention_mask < 0\n            extra_attention_mask = attention_mask > 0\n            remove_from_windowed_attention_mask = attention_mask != 0\n\n            num_extra_indices_per_batch = extra_attention_mask.long().sum(dim=1)\n            max_num_extra_indices_per_batch = num_extra_indices_per_batch.max()\n            if max_num_extra_indices_per_batch <= 0:\n                extra_attention_mask = None\n            else:\n                # To support the case of variable number of global attention in the rows of a batch,\n                # we use the following three selection masks to select global attention embeddings\n                # in a 3d tensor and pad it to `max_num_extra_indices_per_batch`\n                # 1) selecting embeddings that correspond to global attention\n                extra_attention_mask_nonzeros = extra_attention_mask.nonzero(as_tuple=True)\n                zero_to_max_range = torch.arange(\n                    0, max_num_extra_indices_per_batch, device=num_extra_indices_per_batch.device\n                )\n                # mask indicating which values are actually going to be padding\n                selection_padding_mask = zero_to_max_range < num_extra_indices_per_batch.unsqueeze(dim=-1)\n                # 2) location of the non-padding values in the selected global attention\n                selection_padding_mask_nonzeros = selection_padding_mask.nonzero(as_tuple=True)\n                # 3) location of the padding values in the selected global attention\n                selection_padding_mask_zeros = (selection_padding_mask == 0).nonzero(as_tuple=True)\n        else:\n            remove_from_windowed_attention_mask = None\n            extra_attention_mask = None\n            key_padding_mask = None\n\n        hidden_states = hidden_states.transpose(0, 1)\n        seqlen, batch_size, embed_dim = hidden_states.size()\n        assert embed_dim == self.embed_dim\n        q = self.query(hidden_states)\n        k = self.key(hidden_states)\n        v = self.value(hidden_states)\n        q /= math.sqrt(self.head_dim)\n\n        q = q.view(seqlen, batch_size, self.num_heads, self.head_dim).transpose(0, 1)\n        k = k.view(seqlen, batch_size, self.num_heads, self.head_dim).transpose(0, 1)\n        # attn_weights = (batch_size, seqlen, num_heads, window*2+1)\n        attn_weights = self._sliding_chunks_matmul_qk(q, k, self.one_sided_attention_window_size)\n        self._mask_invalid_locations(attn_weights, self.one_sided_attention_window_size)\n        if remove_from_windowed_attention_mask is not None:\n            # This implementation is fast and takes very little memory because num_heads x hidden_size = 1\n            # from (batch_size x seqlen) to (batch_size x seqlen x num_heads x hidden_size)\n            remove_from_windowed_attention_mask = remove_from_windowed_attention_mask.unsqueeze(dim=-1).unsqueeze(\n                dim=-1\n            )\n            # cast to fp32/fp16 then replace 1's with -inf\n            float_mask = remove_from_windowed_attention_mask.type_as(q).masked_fill(\n                remove_from_windowed_attention_mask, -10000.0\n            )\n            ones = float_mask.new_ones(size=float_mask.size())  # tensor of ones\n            # diagonal mask with zeros everywhere and -inf inplace of padding\n            d_mask = self._sliding_chunks_matmul_qk(ones, float_mask, self.one_sided_attention_window_size)\n            attn_weights += d_mask\n        assert list(attn_weights.size()) == [\n            batch_size,\n            seqlen,\n            self.num_heads,\n            self.one_sided_attention_window_size * 2 + 1,\n        ]\n\n        # the extra attention\n        if extra_attention_mask is not None:\n            selected_k = k.new_zeros(batch_size, max_num_extra_indices_per_batch, self.num_heads, self.head_dim)\n            selected_k[selection_padding_mask_nonzeros] = k[extra_attention_mask_nonzeros]\n            # (batch_size, seqlen, num_heads, max_num_extra_indices_per_batch)\n            selected_attn_weights = torch.einsum(\"blhd,bshd->blhs\", (q, selected_k))\n            selected_attn_weights[selection_padding_mask_zeros[0], :, :, selection_padding_mask_zeros[1]] = -10000\n            # concat to attn_weights\n            # (batch_size, seqlen, num_heads, extra attention count + 2*window+1)\n            attn_weights = torch.cat((selected_attn_weights, attn_weights), dim=-1)\n\n        attn_weights_fp32 = F.softmax(attn_weights, dim=-1, dtype=torch.float32)  # use fp32 for numerical stability\n        attn_weights = attn_weights_fp32.type_as(attn_weights)\n\n        if key_padding_mask is not None:\n            # softmax sometimes inserts NaN if all positions are masked, replace them with 0\n            attn_weights = torch.masked_fill(attn_weights, key_padding_mask.unsqueeze(-1).unsqueeze(-1), 0.0)\n\n        attn_probs = F.dropout(attn_weights, p=self.dropout, training=self.training)\n        v = v.view(seqlen, batch_size, self.num_heads, self.head_dim).transpose(0, 1)\n        attn = None\n        if extra_attention_mask is not None:\n            selected_attn_probs = attn_probs.narrow(-1, 0, max_num_extra_indices_per_batch)\n            selected_v = v.new_zeros(batch_size, max_num_extra_indices_per_batch, self.num_heads, self.head_dim)\n            selected_v[selection_padding_mask_nonzeros] = v[extra_attention_mask_nonzeros]\n            # use `matmul` because `einsum` crashes sometimes with fp16\n            # attn = torch.einsum('blhs,bshd->blhd', (selected_attn_probs, selected_v))\n            attn = torch.matmul(selected_attn_probs.transpose(1, 2), selected_v.transpose(1, 2)).transpose(1, 2)\n            attn_probs = attn_probs.narrow(\n                -1, max_num_extra_indices_per_batch, attn_probs.size(-1) - max_num_extra_indices_per_batch\n            ).contiguous()\n        if attn is None:\n            attn = self._sliding_chunks_matmul_pv(attn_probs, v, self.one_sided_attention_window_size)\n        else:\n            attn += self._sliding_chunks_matmul_pv(attn_probs, v, self.one_sided_attention_window_size)\n\n        assert attn.size() == (batch_size, seqlen, self.num_heads, self.head_dim), \"Unexpected size\"\n        attn = attn.transpose(0, 1).reshape(seqlen, batch_size, embed_dim).contiguous()\n\n        # For this case, we'll just recompute the attention for these indices\n        # and overwrite the attn tensor.\n        # TODO: remove the redundant computation\n        if extra_attention_mask is not None:\n            selected_hidden_states = hidden_states.new_zeros(max_num_extra_indices_per_batch, batch_size, embed_dim)\n            selected_hidden_states[selection_padding_mask_nonzeros[::-1]] = hidden_states[\n                extra_attention_mask_nonzeros[::-1]\n            ]\n\n            q = self.query_global(selected_hidden_states)\n            k = self.key_global(hidden_states)\n            v = self.value_global(hidden_states)\n            q /= math.sqrt(self.head_dim)\n\n            q = (\n                q.contiguous()\n                .view(max_num_extra_indices_per_batch, batch_size * self.num_heads, self.head_dim)\n                .transpose(0, 1)\n            )  # (batch_size * self.num_heads, max_num_extra_indices_per_batch, head_dim)\n            k = (\n                k.contiguous().view(-1, batch_size * self.num_heads, self.head_dim).transpose(0, 1)\n            )  # batch_size * self.num_heads, seqlen, head_dim)\n            v = (\n                v.contiguous().view(-1, batch_size * self.num_heads, self.head_dim).transpose(0, 1)\n            )  # batch_size * self.num_heads, seqlen, head_dim)\n            attn_weights = torch.bmm(q, k.transpose(1, 2))\n            assert list(attn_weights.size()) == [batch_size * self.num_heads, max_num_extra_indices_per_batch, seqlen]\n\n            attn_weights = attn_weights.view(batch_size, self.num_heads, max_num_extra_indices_per_batch, seqlen)\n            attn_weights[selection_padding_mask_zeros[0], :, selection_padding_mask_zeros[1], :] = -10000.0\n            if key_padding_mask is not None:\n                attn_weights = attn_weights.masked_fill(key_padding_mask.unsqueeze(1).unsqueeze(2), -10000.0,)\n            attn_weights = attn_weights.view(batch_size * self.num_heads, max_num_extra_indices_per_batch, seqlen)\n            attn_weights_float = F.softmax(\n                attn_weights, dim=-1, dtype=torch.float32\n            )  # use fp32 for numerical stability\n            attn_probs = F.dropout(attn_weights_float.type_as(attn_weights), p=self.dropout, training=self.training)\n            selected_attn = torch.bmm(attn_probs, v)\n            assert list(selected_attn.size()) == [\n                batch_size * self.num_heads,\n                max_num_extra_indices_per_batch,\n                self.head_dim,\n            ]\n\n            selected_attn_4d = selected_attn.view(\n                batch_size, self.num_heads, max_num_extra_indices_per_batch, self.head_dim\n            )\n            nonzero_selected_attn = selected_attn_4d[\n                selection_padding_mask_nonzeros[0], :, selection_padding_mask_nonzeros[1]\n            ]\n            attn[extra_attention_mask_nonzeros[::-1]] = nonzero_selected_attn.view(\n                len(selection_padding_mask_nonzeros[0]), -1\n            )\n\n        context_layer = attn.transpose(0, 1)\n        if self.output_attentions:\n            if extra_attention_mask is not None:\n                # With global attention, return global attention probabilities only\n                # batch_size x num_heads x max_num_global_attention_tokens x sequence_length\n                # which is the attention weights from tokens with global attention to all tokens\n                # It doesn't not return local attention\n                # In case of variable number of global attantion in the rows of a batch,\n                # attn_weights are padded with -10000.0 attention scores\n                attn_weights = attn_weights.view(batch_size, self.num_heads, max_num_extra_indices_per_batch, seqlen)\n            else:\n                # without global attention, return local attention probabilities\n                # batch_size x num_heads x sequence_length x window_size\n                # which is the attention weights of every token attending to its neighbours\n                attn_weights = attn_weights.permute(0, 2, 1, 3)\n        outputs = (context_layer, attn_weights) if self.output_attentions else (context_layer,)\n        return outputs\n\n\nLONGFORMER_START_DOCSTRING = r\"\"\"\n\n    This model is a PyTorch `torch.nn.Module <https://pytorch.org/docs/stable/nn.html#torch.nn.Module>`_ sub-class.\n    Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general\n    usage and behavior.\n\n    Parameters:\n        config (:class:`~transformers1.LongformerConfig`): Model configuration class with all the parameters of the\n            model. Initializing with a config file does not load the weights associated with the model, only the configuration.\n            Check out the :meth:`~transformers1.PreTrainedModel.from_pretrained` method to load the model weights.\n\"\"\"\n\nLONGFORMER_INPUTS_DOCSTRING = r\"\"\"\n    Args:\n        input_ids (:obj:`torch.LongTensor` of shape :obj:`{0}`):\n            Indices of input sequence tokens in the vocabulary.\n\n            Indices can be obtained using :class:`transformers1.LonmgformerTokenizer`.\n            See :func:`transformers1.PreTrainedTokenizer.encode` and\n            :func:`transformers1.PreTrainedTokenizer.encode_plus` for details.\n\n            `What are input IDs? <../glossary.html#input-ids>`__\n        attention_mask (:obj:`torch.FloatTensor` of shape :obj:`{0}`, `optional`, defaults to :obj:`None`):\n            Mask to avoid performing attention on padding token indices.\n            Mask values selected in ``[0, 1]``:\n            ``1`` for tokens that are NOT MASKED, ``0`` for MASKED tokens.\n\n            `What are attention masks? <../glossary.html#attention-mask>`__\n\n        global_attention_mask (:obj:`torch.FloatTensor` of shape :obj:`{0}`, `optional`, defaults to :obj:`None`):\n            Mask to decide the attention given on each token, local attention or global attenion.\n            Tokens with global attention attends to all other tokens, and all other tokens attend to them. This is important for\n            task-specific finetuning because it makes the model more flexible at representing the task. For example,\n            for classification, the <s> token should be given global attention. For QA, all question tokens should also have\n            global attention. Please refer to the Longformer paper https://arxiv.org/abs/2004.05150 for more details.\n            Mask values selected in ``[0, 1]``:\n            ``0`` for local attention (a sliding window attention),\n            ``1`` for global attention (tokens that attend to all other tokens, and all other tokens attend to them).\n\n        token_type_ids (:obj:`torch.LongTensor` of shape :obj:`{0}`, `optional`, defaults to :obj:`None`):\n            Segment token indices to indicate first and second portions of the inputs.\n            Indices are selected in ``[0, 1]``: ``0`` corresponds to a `sentence A` token, ``1``\n            corresponds to a `sentence B` token\n\n            `What are token type IDs? <../glossary.html#token-type-ids>`_\n        position_ids (:obj:`torch.LongTensor` of shape :obj:`{0}`, `optional`, defaults to :obj:`None`):\n            Indices of positions of each input sequence tokens in the position embeddings.\n            Selected in the range ``[0, config.max_position_embeddings - 1]``.\n\n            `What are position IDs? <../glossary.html#position-ids>`_\n        inputs_embeds (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length, hidden_size)`, `optional`, defaults to :obj:`None`):\n            Optionally, instead of passing :obj:`input_ids` you can choose to directly pass an embedded representation.\n            This is useful if you want more control over how to convert `input_ids` indices into associated vectors\n            than the model's internal embedding lookup matrix.\n\"\"\"\n\n\n@add_start_docstrings(\n    \"The bare Longformer Model outputting raw hidden-states without any specific head on top.\",\n    LONGFORMER_START_DOCSTRING,\n)\nclass LongformerModel(RobertaModel):\n    \"\"\"\n    This class overrides :class:`~transformers1.RobertaModel` to provide the ability to process\n    long sequences following the selfattention approach described in `Longformer: the Long-Document Transformer`_by\n    Iz Beltagy, Matthew E. Peters, and Arman Cohan. Longformer selfattention combines a local (sliding window)\n    and global attention to extend to long documents without the O(n^2) increase in memory and compute.\n\n    The selfattention module `LongformerSelfAttention` implemented here supports the combination of local and\n    global attention but it lacks support for autoregressive attention and dilated attention. Autoregressive\n    and dilated attention are more relevant for autoregressive language modeling than finetuning on downstream\n    tasks. Future release will add support for autoregressive attention, but the support for dilated attention\n    requires a custom CUDA kernel to be memory and compute efficient.\n\n    .. _`Longformer: the Long-Document Transformer`:\n        https://arxiv.org/abs/2004.05150\n\n    \"\"\"\n\n    config_class = LongformerConfig\n    base_model_prefix = \"longformer\"\n\n    def __init__(self, config):\n        super().__init__(config)\n\n        if isinstance(config.attention_window, int):\n            assert config.attention_window % 2 == 0, \"`config.attention_window` has to be an even value\"\n            assert config.attention_window > 0, \"`config.attention_window` has to be positive\"\n            config.attention_window = [config.attention_window] * config.num_hidden_layers  # one value per layer\n        else:\n            assert len(config.attention_window) == config.num_hidden_layers, (\n                \"`len(config.attention_window)` should equal `config.num_hidden_layers`. \"\n                f\"Expected {config.num_hidden_layers}, given {len(config.attention_window)}\"\n            )\n\n        for i, layer in enumerate(self.encoder.layer):\n            # replace the `modeling_bert.BertSelfAttention` object with `LongformerSelfAttention`\n            layer.attention.self = LongformerSelfAttention(config, layer_id=i)\n\n        self.init_weights()\n\n    def _pad_to_window_size(\n        self,\n        input_ids: torch.Tensor,\n        attention_mask: torch.Tensor,\n        token_type_ids: torch.Tensor,\n        position_ids: torch.Tensor,\n        inputs_embeds: torch.Tensor,\n        attention_window: int,\n        pad_token_id: int,\n    ):\n        \"\"\"A helper function to pad tokens and mask to work with implementation of Longformer selfattention.\"\"\"\n\n        assert attention_window % 2 == 0, f\"`attention_window` should be an even value. Given {attention_window}\"\n        input_shape = input_ids.shape if input_ids is not None else inputs_embeds.shape\n        batch_size, seqlen = input_shape[:2]\n\n        padding_len = (attention_window - seqlen % attention_window) % attention_window\n        if padding_len > 0:\n            logger.info(\n                \"Input ids are automatically padded from {} to {} to be a multiple of `config.attention_window`: {}\".format(\n                    seqlen, seqlen + padding_len, attention_window\n                )\n            )\n            if input_ids is not None:\n                input_ids = F.pad(input_ids, (0, padding_len), value=pad_token_id)\n            if attention_mask is not None:\n                attention_mask = F.pad(\n                    attention_mask, (0, padding_len), value=False\n                )  # no attention on the padding tokens\n            if token_type_ids is not None:\n                token_type_ids = F.pad(token_type_ids, (0, padding_len), value=0)  # pad with token_type_id = 0\n            if position_ids is not None:\n                # pad with position_id = pad_token_id as in modeling_roberta.RobertaEmbeddings\n                position_ids = F.pad(position_ids, (0, padding_len), value=pad_token_id)\n            if inputs_embeds is not None:\n                input_ids_padding = inputs_embeds.new_full(\n                    (batch_size, padding_len), self.config.pad_token_id, dtype=torch.long,\n                )\n                inputs_embeds_padding = self.embeddings(input_ids_padding)\n                inputs_embeds = torch.cat([inputs_embeds, inputs_embeds_padding], dim=-2)\n\n        return padding_len, input_ids, attention_mask, token_type_ids, position_ids, inputs_embeds\n\n    @add_start_docstrings_to_callable(LONGFORMER_INPUTS_DOCSTRING.format(\"(batch_size, sequence_length)\"))\n    def forward(\n        self,\n        input_ids=None,\n        attention_mask=None,\n        global_attention_mask=None,\n        token_type_ids=None,\n        position_ids=None,\n        inputs_embeds=None,\n        masked_lm_labels=None,\n    ):\n        r\"\"\"\n\n    Returns:\n        :obj:`tuple(torch.FloatTensor)` comprising various elements depending on the configuration (:class:`~transformers1.RobertaConfig`) and inputs:\n        masked_lm_loss (`optional`, returned when ``masked_lm_labels`` is provided) ``torch.FloatTensor`` of shape ``(1,)``:\n            Masked language modeling loss.\n        prediction_scores (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length, config.vocab_size)`)\n            Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax).\n        hidden_states (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_hidden_states=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer)\n            of shape :obj:`(batch_size, sequence_length, hidden_size)`.\n\n            Hidden-states of the model at the output of each layer plus the initial embedding outputs.\n        attentions (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_attentions=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for each layer) of shape\n            :obj:`(batch_size, num_heads, sequence_length, sequence_length)`.\n\n            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention\n            heads.\n\n    Examples::\n\n        import torch\n        from transformers1 import LongformerModel, LongformerTokenizer\n\n        model = LongformerModel.from_pretrained('allenai/longformer-base-4096')\n        tokenizer = LongformerTokenizer.from_pretrained('allenai/longformer-base-4096')\n\n        SAMPLE_TEXT = ' '.join(['Hello world! '] * 1000)  # long input document\n        input_ids = torch.tensor(tokenizer.encode(SAMPLE_TEXT)).unsqueeze(0)  # batch of size 1\n\n        # Attention mask values -- 0: no attention, 1: local attention, 2: global attention\n        attention_mask = torch.ones(input_ids.shape, dtype=torch.long, device=input_ids.device) # initialize to local attention\n        attention_mask[:, [1, 4, 21,]] = 2  # Set global attention based on the task. For example,\n                                            # classification: the <s> token\n                                            # QA: question tokens\n                                            # LM: potentially on the beginning of sentences and paragraphs\n        sequence_output, pooled_output = model(input_ids, attention_mask=attention_mask)\n        \"\"\"\n\n        # padding\n        attention_window = (\n            self.config.attention_window\n            if isinstance(self.config.attention_window, int)\n            else max(self.config.attention_window)\n        )\n\n        # merge `global_attention_mask` and `attention_mask`\n        if global_attention_mask is not None:\n            # longformer self attention expects attention mask to have 0 (no attn), 1 (local attn), 2 (global attn)\n            # (global_attention_mask + 1) => 1 for local attention, 2 for global attention\n            # => final attention_mask => 0 for no attention, 1 for local attention 2 for global attention\n            if attention_mask is not None:\n                attention_mask = attention_mask * (global_attention_mask + 1)\n            else:\n                # simply use `global_attention_mask` as `attention_mask`\n                # if no `attention_mask` is given\n                attention_mask = global_attention_mask + 1\n\n        padding_len, input_ids, attention_mask, token_type_ids, position_ids, inputs_embeds = self._pad_to_window_size(\n            input_ids=input_ids,\n            attention_mask=attention_mask,\n            token_type_ids=token_type_ids,\n            position_ids=position_ids,\n            inputs_embeds=inputs_embeds,\n            attention_window=attention_window,\n            pad_token_id=self.config.pad_token_id,\n        )\n\n        # embed\n        output = super().forward(\n            input_ids=input_ids,\n            attention_mask=attention_mask,\n            token_type_ids=token_type_ids,\n            position_ids=position_ids,\n            head_mask=None,\n            inputs_embeds=inputs_embeds,\n            encoder_hidden_states=None,\n            encoder_attention_mask=None,\n        )\n\n        # undo padding\n        if padding_len > 0:\n            # `output` has the following tensors: sequence_output, pooled_output, (hidden_states), (attentions)\n            # `sequence_output`: unpad because the calling function is expecting a length == input_ids.size(1)\n            # `pooled_output`: independent of the sequence length\n            # `hidden_states`: mainly used for debugging and analysis, so keep the padding\n            # `attentions`: mainly used for debugging and analysis, so keep the padding\n            output = output[0][:, :-padding_len], *output[1:]\n\n        return output\n\n\n@add_start_docstrings(\"\"\"Longformer Model with a `language modeling` head on top. \"\"\", LONGFORMER_START_DOCSTRING)\nclass LongformerForMaskedLM(BertPreTrainedModel):\n    config_class = LongformerConfig\n    base_model_prefix = \"longformer\"\n\n    def __init__(self, config):\n        super().__init__(config)\n\n        self.longformer = LongformerModel(config)\n        self.lm_head = RobertaLMHead(config)\n\n        self.init_weights()\n\n    @add_start_docstrings_to_callable(LONGFORMER_INPUTS_DOCSTRING.format(\"(batch_size, sequence_length)\"))\n    def forward(\n        self,\n        input_ids=None,\n        attention_mask=None,\n        global_attention_mask=None,\n        token_type_ids=None,\n        position_ids=None,\n        inputs_embeds=None,\n        masked_lm_labels=None,\n    ):\n        r\"\"\"\n        masked_lm_labels (:obj:`torch.LongTensor` of shape :obj:`(batch_size, sequence_length)`, `optional`, defaults to :obj:`None`):\n            Labels for computing the masked language modeling loss.\n            Indices should be in ``[-100, 0, ..., config.vocab_size]`` (see ``input_ids`` docstring)\n            Tokens with indices set to ``-100`` are ignored (masked), the loss is only computed for the tokens with labels\n            in ``[0, ..., config.vocab_size]``\n\n    Returns:\n        :obj:`tuple(torch.FloatTensor)` comprising various elements depending on the configuration (:class:`~transformers1.RobertaConfig`) and inputs:\n        masked_lm_loss (`optional`, returned when ``masked_lm_labels`` is provided) ``torch.FloatTensor`` of shape ``(1,)``:\n            Masked language modeling loss.\n        prediction_scores (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length, config.vocab_size)`)\n            Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax).\n        hidden_states (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_hidden_states=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer)\n            of shape :obj:`(batch_size, sequence_length, hidden_size)`.\n\n            Hidden-states of the model at the output of each layer plus the initial embedding outputs.\n        attentions (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_attentions=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for each layer) of shape\n            :obj:`(batch_size, num_heads, sequence_length, sequence_length)`.\n\n            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention\n            heads.\n\n    Examples::\n\n        import torch\n        from transformers1 import LongformerForMaskedLM, LongformerTokenizer\n\n        model = LongformerForMaskedLM.from_pretrained('allenai/longformer-base-4096')\n        tokenizer = LongformerTokenizer.from_pretrained('allenai/longformer-base-4096')\n\n        SAMPLE_TEXT = ' '.join(['Hello world! '] * 1000)  # long input document\n        input_ids = torch.tensor(tokenizer.encode(SAMPLE_TEXT)).unsqueeze(0)  # batch of size 1\n\n        attention_mask = None  # default is local attention everywhere, which is a good choice for MaskedLM\n                               # check ``LongformerModel.forward`` for more details how to set `attention_mask`\n        loss, prediction_scores = model(input_ids, attention_mask=attention_mask, masked_lm_labels=input_ids)\n        \"\"\"\n\n        outputs = self.longformer(\n            input_ids,\n            attention_mask=attention_mask,\n            global_attention_mask=global_attention_mask,\n            token_type_ids=token_type_ids,\n            position_ids=position_ids,\n            inputs_embeds=inputs_embeds,\n        )\n        sequence_output = outputs[0]\n        prediction_scores = self.lm_head(sequence_output)\n\n        outputs = (prediction_scores,) + outputs[2:]  # Add hidden states and attention if they are here\n\n        if masked_lm_labels is not None:\n            loss_fct = CrossEntropyLoss()\n            masked_lm_loss = loss_fct(prediction_scores.view(-1, self.config.vocab_size), masked_lm_labels.view(-1))\n            outputs = (masked_lm_loss,) + outputs\n\n        return outputs  # (masked_lm_loss), prediction_scores, (hidden_states), (attentions)\n\n\n@add_start_docstrings(\n    \"\"\"Longformer Model transformer with a sequence classification/regression head on top (a linear layer\n    on top of the pooled output) e.g. for GLUE tasks. \"\"\",\n    LONGFORMER_START_DOCSTRING,\n)\nclass LongformerForSequenceClassification(BertPreTrainedModel):\n    config_class = LongformerConfig\n    base_model_prefix = \"longformer\"\n\n    def __init__(self, config):\n        super().__init__(config)\n        self.num_labels = config.num_labels\n\n        self.longformer = LongformerModel(config)\n        self.classifier = LongformerClassificationHead(config)\n\n    @add_start_docstrings_to_callable(LONGFORMER_INPUTS_DOCSTRING.format(\"(batch_size, sequence_length)\"))\n    def forward(\n        self,\n        input_ids=None,\n        attention_mask=None,\n        global_attention_mask=None,\n        token_type_ids=None,\n        position_ids=None,\n        inputs_embeds=None,\n        labels=None,\n    ):\n        r\"\"\"\n        labels (:obj:`torch.LongTensor` of shape :obj:`(batch_size,)`, `optional`, defaults to :obj:`None`):\n            Labels for computing the sequence classification/regression loss.\n            Indices should be in :obj:`[0, ..., config.num_labels - 1]`.\n            If :obj:`config.num_labels == 1` a regression loss is computed (Mean-Square loss),\n            If :obj:`config.num_labels > 1` a classification loss is computed (Cross-Entropy).\n\n    Returns:\n        :obj:`tuple(torch.FloatTensor)` comprising various elements depending on the configuration (:class:`~transformers1.LongformerConfig`) and inputs:\n        loss (:obj:`torch.FloatTensor` of shape :obj:`(1,)`, `optional`, returned when :obj:`label` is provided):\n            Classification (or regression if config.num_labels==1) loss.\n        logits (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, config.num_labels)`):\n            Classification (or regression if config.num_labels==1) scores (before SoftMax).\n        hidden_states (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_hidden_states=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer)\n            of shape :obj:`(batch_size, sequence_length, hidden_size)`.\n\n            Hidden-states of the model at the output of each layer plus the initial embedding outputs.\n        attentions (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_attentions=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for each layer) of shape\n            :obj:`(batch_size, num_heads, sequence_length, sequence_length)`.\n\n            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention\n            heads.\n\n    Examples::\n\n        from transformers1 import LongformerTokenizer, LongformerForSequenceClassification\n        import torch\n\n        tokenizer = LongformerTokenizer.from_pretrained('allenai/longformer-base-4096')\n        model = LongformerForSequenceClassification.from_pretrained('allenai/longformer-base-4096')\n        input_ids = torch.tensor(tokenizer.encode(\"Hello, my dog is cute\", add_special_tokens=True)).unsqueeze(0)  # Batch size 1\n        labels = torch.tensor([1]).unsqueeze(0)  # Batch size 1\n        outputs = model(input_ids, labels=labels)\n        loss, logits = outputs[:2]\n\n        \"\"\"\n\n        if global_attention_mask is None:\n            logger.info(\"Initializing global attention on CLS token...\")\n            global_attention_mask = torch.zeros_like(input_ids)\n            # global attention on cls token\n            global_attention_mask[:, 0] = 1\n\n        outputs = self.longformer(\n            input_ids,\n            attention_mask=attention_mask,\n            global_attention_mask=global_attention_mask,\n            token_type_ids=token_type_ids,\n            position_ids=position_ids,\n            inputs_embeds=inputs_embeds,\n        )\n        sequence_output = outputs[0]\n        logits = self.classifier(sequence_output)\n\n        outputs = (logits,) + outputs[2:]\n        if labels is not None:\n            if self.num_labels == 1:\n                #  We are doing regression\n                loss_fct = MSELoss()\n                loss = loss_fct(logits.view(-1), labels.view(-1))\n            else:\n                loss_fct = CrossEntropyLoss()\n                loss = loss_fct(logits.view(-1, self.num_labels), labels.view(-1))\n            outputs = (loss,) + outputs\n\n        return outputs  # (loss), logits, (hidden_states), (attentions)\n\n\nclass LongformerClassificationHead(nn.Module):\n    \"\"\"Head for sentence-level classification tasks.\"\"\"\n\n    def __init__(self, config):\n        super().__init__()\n        self.dense = nn.Linear(config.hidden_size, config.hidden_size)\n        self.dropout = nn.Dropout(config.hidden_dropout_prob)\n        self.out_proj = nn.Linear(config.hidden_size, config.num_labels)\n\n    def forward(self, hidden_states, **kwargs):\n        hidden_states = hidden_states[:, 0, :]  # take <s> token (equiv. to [CLS])\n        hidden_states = self.dropout(hidden_states)\n        hidden_states = self.dense(hidden_states)\n        hidden_states = torch.tanh(hidden_states)\n        hidden_states = self.dropout(hidden_states)\n        output = self.out_proj(hidden_states)\n        return output\n\n\n@add_start_docstrings(\n    \"\"\"Longformer Model with a span classification head on top for extractive question-answering tasks like SQuAD / TriviaQA (a linear layers on top of\n    the hidden-states output to compute `span start logits` and `span end logits`). \"\"\",\n    LONGFORMER_START_DOCSTRING,\n)\nclass LongformerForQuestionAnswering(BertPreTrainedModel):\n    config_class = LongformerConfig\n    base_model_prefix = \"longformer\"\n\n    def __init__(self, config):\n        super().__init__(config)\n        self.num_labels = config.num_labels\n\n        self.longformer = LongformerModel(config)\n        self.qa_outputs = nn.Linear(config.hidden_size, config.num_labels)\n\n        self.init_weights()\n\n    @add_start_docstrings_to_callable(LONGFORMER_INPUTS_DOCSTRING.format(\"(batch_size, sequence_length)\"))\n    def forward(\n        self,\n        input_ids,\n        attention_mask=None,\n        global_attention_mask=None,\n        token_type_ids=None,\n        position_ids=None,\n        inputs_embeds=None,\n        start_positions=None,\n        end_positions=None,\n    ):\n        r\"\"\"\n        start_positions (:obj:`torch.LongTensor` of shape :obj:`(batch_size,)`, `optional`, defaults to :obj:`None`):\n            Labels for position (index) of the start of the labelled span for computing the token classification loss.\n            Positions are clamped to the length of the sequence (`sequence_length`).\n            Position outside of the sequence are not taken into account for computing the loss.\n        end_positions (:obj:`torch.LongTensor` of shape :obj:`(batch_size,)`, `optional`, defaults to :obj:`None`):\n            Labels for position (index) of the end of the labelled span for computing the token classification loss.\n            Positions are clamped to the length of the sequence (`sequence_length`).\n            Position outside of the sequence are not taken into account for computing the loss.\n    Returns:\n        :obj:`tuple(torch.FloatTensor)` comprising various elements depending on the configuration (:class:`~transformers1.LongformerConfig`) and inputs:\n        loss (:obj:`torch.FloatTensor` of shape :obj:`(1,)`, `optional`, returned when :obj:`labels` is provided):\n            Total span extraction loss is the sum of a Cross-Entropy for the start and end positions.\n        start_scores (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length,)`):\n            Span-start scores (before SoftMax).\n        end_scores (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length,)`):\n            Span-end scores (before SoftMax).\n        hidden_states (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_hidden_states=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer)\n            of shape :obj:`(batch_size, sequence_length, hidden_size)`.\n            Hidden-states of the model at the output of each layer plus the initial embedding outputs.\n        attentions (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_attentions=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for each layer) of shape\n            :obj:`(batch_size, num_heads, sequence_length, sequence_length)`.\n            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention\n            heads.\n\n    Examples::\n\n        from transformers1 import LongformerTokenizer, LongformerForQuestionAnswering\n        import torch\n\n        tokenizer = LongformerTokenizer.from_pretrained(\"allenai/longformer-large-4096-finetuned-triviaqa\")\n        model = LongformerForQuestionAnswering.from_pretrained(\"allenai/longformer-large-4096-finetuned-triviaqa\")\n\n        question, text = \"Who was Jim Henson?\", \"Jim Henson was a nice puppet\"\n        encoding = tokenizer.encode_plus(question, text, return_tensors=\"pt\")\n        input_ids = encoding[\"input_ids\"]\n\n        # default is local attention everywhere\n        # the forward method will automatically set global attention on question tokens\n        attention_mask = encoding[\"attention_mask\"]\n\n        start_scores, end_scores = model(input_ids, attention_mask=attention_mask)\n        all_tokens = tokenizer.convert_ids_to_tokens(input_ids[0].tolist())\n\n        answer_tokens = all_tokens[torch.argmax(start_scores) :torch.argmax(end_scores)+1]\n        answer = tokenizer.decode(tokenizer.convert_tokens_to_ids(answer_tokens)) # remove space prepending space token\n\n        \"\"\"\n\n        # set global attention on question tokens\n        if global_attention_mask is None:\n            logger.info(\"Initializing global attention on question tokens...\")\n            # put global attention on all tokens until `config.sep_token_id` is reached\n            global_attention_mask = _compute_global_attention_mask(input_ids, self.config.sep_token_id)\n\n        outputs = self.longformer(\n            input_ids,\n            attention_mask=attention_mask,\n            global_attention_mask=global_attention_mask,\n            token_type_ids=token_type_ids,\n            position_ids=position_ids,\n            inputs_embeds=inputs_embeds,\n        )\n\n        sequence_output = outputs[0]\n\n        logits = self.qa_outputs(sequence_output)\n        start_logits, end_logits = logits.split(1, dim=-1)\n        start_logits = start_logits.squeeze(-1)\n        end_logits = end_logits.squeeze(-1)\n\n        outputs = (start_logits, end_logits,) + outputs[2:]\n        if start_positions is not None and end_positions is not None:\n            # If we are on multi-GPU, split add a dimension\n            if len(start_positions.size()) > 1:\n                start_positions = start_positions.squeeze(-1)\n            if len(end_positions.size()) > 1:\n                end_positions = end_positions.squeeze(-1)\n            # sometimes the start/end positions are outside our model inputs, we ignore these terms\n            ignored_index = start_logits.size(1)\n            start_positions.clamp_(0, ignored_index)\n            end_positions.clamp_(0, ignored_index)\n\n            loss_fct = CrossEntropyLoss(ignore_index=ignored_index)\n            start_loss = loss_fct(start_logits, start_positions)\n            end_loss = loss_fct(end_logits, end_positions)\n            total_loss = (start_loss + end_loss) / 2\n            outputs = (total_loss,) + outputs\n\n        return outputs  # (loss), start_logits, end_logits, (hidden_states), (attentions)\n\n\n@add_start_docstrings(\n    \"\"\"Longformer Model with a token classification head on top (a linear layer on top of\n    the hidden-states output) e.g. for Named-Entity-Recognition (NER) tasks. \"\"\",\n    LONGFORMER_START_DOCSTRING,\n)\nclass LongformerForTokenClassification(BertPreTrainedModel):\n    config_class = LongformerConfig\n    base_model_prefix = \"longformer\"\n\n    def __init__(self, config):\n        super().__init__(config)\n        self.num_labels = config.num_labels\n\n        self.longformer = LongformerModel(config)\n        self.dropout = nn.Dropout(config.hidden_dropout_prob)\n        self.classifier = nn.Linear(config.hidden_size, config.num_labels)\n\n        self.init_weights()\n\n    @add_start_docstrings_to_callable(LONGFORMER_INPUTS_DOCSTRING.format(\"(batch_size, sequence_length)\"))\n    def forward(\n        self,\n        input_ids=None,\n        attention_mask=None,\n        global_attention_mask=None,\n        token_type_ids=None,\n        position_ids=None,\n        inputs_embeds=None,\n        labels=None,\n    ):\n        r\"\"\"\n        labels (:obj:`torch.LongTensor` of shape :obj:`(batch_size, sequence_length)`, `optional`, defaults to :obj:`None`):\n            Labels for computing the token classification loss.\n            Indices should be in ``[0, ..., config.num_labels - 1]``.\n\n    Returns:\n        :obj:`tuple(torch.FloatTensor)` comprising various elements depending on the configuration (:class:`~transformers1.LongformerConfig`) and inputs:\n        loss (:obj:`torch.FloatTensor` of shape :obj:`(1,)`, `optional`, returned when ``labels`` is provided) :\n            Classification loss.\n        scores (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length, config.num_labels)`)\n            Classification scores (before SoftMax).\n        hidden_states (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_hidden_states=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer)\n            of shape :obj:`(batch_size, sequence_length, hidden_size)`.\n\n            Hidden-states of the model at the output of each layer plus the initial embedding outputs.\n        attentions (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_attentions=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for each layer) of shape\n            :obj:`(batch_size, num_heads, sequence_length, sequence_length)`.\n\n            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention\n            heads.\n\n    Examples::\n\n        from transformers1 import LongformerTokenizer, LongformerForTokenClassification\n        import torch\n\n        tokenizer = LongformerTokenizer.from_pretrained('allenai/longformer-base-4096')\n        model = LongformerForTokenClassification.from_pretrained('allenai/longformer-base-4096')\n        input_ids = torch.tensor(tokenizer.encode(\"Hello, my dog is cute\", add_special_tokens=True)).unsqueeze(0)  # Batch size 1\n        labels = torch.tensor([1] * input_ids.size(1)).unsqueeze(0)  # Batch size 1\n        outputs = model(input_ids, labels=labels)\n        loss, scores = outputs[:2]\n\n        \"\"\"\n\n        outputs = self.longformer(\n            input_ids,\n            attention_mask=attention_mask,\n            global_attention_mask=global_attention_mask,\n            token_type_ids=token_type_ids,\n            position_ids=position_ids,\n            inputs_embeds=inputs_embeds,\n        )\n\n        sequence_output = outputs[0]\n\n        sequence_output = self.dropout(sequence_output)\n        logits = self.classifier(sequence_output)\n\n        outputs = (logits,) + outputs[2:]  # add hidden states and attention if they are here\n\n        if labels is not None:\n            loss_fct = CrossEntropyLoss()\n            # Only keep active parts of the loss\n            if attention_mask is not None:\n                active_loss = attention_mask.view(-1) == 1\n                active_logits = logits.view(-1, self.num_labels)\n                active_labels = torch.where(\n                    active_loss, labels.view(-1), torch.tensor(loss_fct.ignore_index).type_as(labels)\n                )\n                loss = loss_fct(active_logits, active_labels)\n            else:\n                loss = loss_fct(logits.view(-1, self.num_labels), labels.view(-1))\n            outputs = (loss,) + outputs\n\n        return outputs  # (loss), scores, (hidden_states), (attentions)\n\n\n@add_start_docstrings(\n    \"\"\"Longformer Model with a multiple choice classification head on top (a linear layer on top of\n    the pooled output and a softmax) e.g. for RocStories/SWAG tasks. \"\"\",\n    LONGFORMER_START_DOCSTRING,\n)\nclass LongformerForMultipleChoice(BertPreTrainedModel):\n    config_class = LongformerConfig\n    base_model_prefix = \"longformer\"\n\n    def __init__(self, config):\n        super().__init__(config)\n\n        self.longformer = LongformerModel(config)\n        self.dropout = nn.Dropout(config.hidden_dropout_prob)\n        self.classifier = nn.Linear(config.hidden_size, 1)\n\n        self.init_weights()\n\n    @add_start_docstrings_to_callable(LONGFORMER_INPUTS_DOCSTRING.format(\"(batch_size, num_choices, sequence_length)\"))\n    def forward(\n        self,\n        input_ids=None,\n        token_type_ids=None,\n        attention_mask=None,\n        global_attention_mask=None,\n        labels=None,\n        position_ids=None,\n        inputs_embeds=None,\n    ):\n        r\"\"\"\n        labels (:obj:`torch.LongTensor` of shape :obj:`(batch_size,)`, `optional`, defaults to :obj:`None`):\n            Labels for computing the multiple choice classification loss.\n            Indices should be in ``[0, ..., num_choices]`` where `num_choices` is the size of the second dimension\n            of the input tensors. (see `input_ids` above)\n\n    Returns:\n        :obj:`tuple(torch.FloatTensor)` comprising various elements depending on the configuration (:class:`~transformers1.RobertaConfig`) and inputs:\n        loss (:obj:`torch.FloatTensor`` of shape ``(1,)`, `optional`, returned when :obj:`labels` is provided):\n            Classification loss.\n        classification_scores (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, num_choices)`):\n            `num_choices` is the second dimension of the input tensors. (see `input_ids` above).\n\n            Classification scores (before SoftMax).\n        hidden_states (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_hidden_states=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer)\n            of shape :obj:`(batch_size, sequence_length, hidden_size)`.\n\n            Hidden-states of the model at the output of each layer plus the initial embedding outputs.\n        attentions (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_attentions=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for each layer) of shape\n            :obj:`(batch_size, num_heads, sequence_length, sequence_length)`.\n\n            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention\n            heads.\n\n    Examples::\n\n        from transformers1 import LongformerTokenizer, LongformerForMultipleChoice\n        import torch\n\n        tokenizer = LongformerTokenizer.from_pretrained('allenai/longformer-base-4096')\n        model = LongformerForMultipleChoice.from_pretrained('allenai/longformer-base-4096')\n        # context = \"The dog is cute\" | choice = \"the dog\" / \"the cat\"\n        choices = [(\"The dog is cute\", \"the dog\"), (\"The dog is cute\", \"the cat\")]\n        input_ids = torch.tensor([tokenizer.encode(s[0], s[1], add_special_tokens=True) for s in choices]).unsqueeze(0)  # Batch size 1, 2 choices\n        labels = torch.tensor(1).unsqueeze(0)  # Batch size 1\n\n        # global attention is automatically put on \"the dog\" and \"the cat\"\n        outputs = model(input_ids, labels=labels)\n        loss, classification_scores = outputs[:2]\n\n        \"\"\"\n        num_choices = input_ids.shape[1]\n\n        # set global attention on question tokens\n        if global_attention_mask is None:\n            logger.info(\"Initializing global attention on multiple choice...\")\n            # put global attention on all tokens after `config.sep_token_id`\n            global_attention_mask = torch.stack(\n                [\n                    _compute_global_attention_mask(input_ids[:, i], self.config.sep_token_id, before_sep_token=False)\n                    for i in range(num_choices)\n                ],\n                dim=1,\n            )\n\n        flat_input_ids = input_ids.view(-1, input_ids.size(-1))\n        flat_position_ids = position_ids.view(-1, position_ids.size(-1)) if position_ids is not None else None\n        flat_token_type_ids = token_type_ids.view(-1, token_type_ids.size(-1)) if token_type_ids is not None else None\n        flat_attention_mask = attention_mask.view(-1, attention_mask.size(-1)) if attention_mask is not None else None\n        flat_global_attention_mask = (\n            global_attention_mask.view(-1, global_attention_mask.size(-1))\n            if global_attention_mask is not None\n            else None\n        )\n\n        outputs = self.longformer(\n            flat_input_ids,\n            position_ids=flat_position_ids,\n            token_type_ids=flat_token_type_ids,\n            attention_mask=flat_attention_mask,\n            global_attention_mask=flat_global_attention_mask,\n        )\n        pooled_output = outputs[1]\n\n        pooled_output = self.dropout(pooled_output)\n        logits = self.classifier(pooled_output)\n        reshaped_logits = logits.view(-1, num_choices)\n\n        outputs = (reshaped_logits,) + outputs[2:]  # add hidden states and attention if they are here\n\n        if labels is not None:\n            loss_fct = CrossEntropyLoss()\n            loss = loss_fct(reshaped_logits, labels)\n            outputs = (loss,) + outputs\n\n        return outputs  # (loss), reshaped_logits, (hidden_states), (attentions)\n"
  },
  {
    "path": "code/nezha-base-count5/pretrain/transformers1/modeling_marian.py",
    "content": "# coding=utf-8\n# Copyright 2020 Marian Team Authors and The HuggingFace Inc. team.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n\"\"\"PyTorch MarianMTModel model, ported from the Marian C++ repo.\"\"\"\n\n\nfrom .modeling_bart import BartForConditionalGeneration\n\n\nMARIAN_PRETRAINED_MODEL_ARCHIVE_LIST = [\n    # See all Marian models at https://huggingface.co/models?search=Helsinki-NLP\n]\n\n\nclass MarianMTModel(BartForConditionalGeneration):\n    r\"\"\"\n    Pytorch version of marian-nmt's transformer.h (c++). Designed for the OPUS-NMT translation checkpoints.\n    Model API is identical to BartForConditionalGeneration.\n    Available models are listed at `Model List <https://huggingface.co/models?search=Helsinki-NLP>`__\n\n    Examples::\n\n        from transformers1 import MarianTokenizer, MarianMTModel\n        from typing import List\n        src = 'fr'  # source language\n        trg = 'en'  # target language\n        sample_text = \"où est l'arrêt de bus ?\"\n        mname = f'Helsinki-NLP/opus-mt-{src}-{trg}'\n\n        model = MarianMTModel.from_pretrained(mname)\n        tok = MarianTokenizer.from_pretrained(mname)\n        batch = tok.prepare_translation_batch(src_texts=[sample_text])  # don't need tgt_text for inference\n        gen = model.generate(**batch)  # for forward pass: model(**batch)\n        words: List[str] = tok.batch_decode(gen, skip_special_tokens=True)  # returns \"Where is the the bus stop ?\"\n\n    \"\"\"\n\n    def prepare_logits_for_generation(self, logits, cur_len, max_length):\n        logits[:, self.config.pad_token_id] = float(\"-inf\")\n        if cur_len == max_length - 1 and self.config.eos_token_id is not None:\n            self._force_token_ids_generation(logits, self.config.eos_token_id)\n        return logits\n"
  },
  {
    "path": "code/nezha-base-count5/pretrain/transformers1/modeling_mmbt.py",
    "content": "# coding=utf-8\n# Copyright (c) Facebook, Inc. and its affiliates.\n# Copyright (c) HuggingFace Inc. team.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n\"\"\"PyTorch MMBT model. \"\"\"\n\n\nimport logging\n\nimport torch\nimport torch.nn as nn\nfrom torch.nn import CrossEntropyLoss, MSELoss\n\nfrom .file_utils import add_start_docstrings\nfrom .modeling_utils import ModuleUtilsMixin\n\n\nlogger = logging.getLogger(__name__)\n\n\nclass ModalEmbeddings(nn.Module):\n    \"\"\"Generic Modal Embeddings which takes in an encoder, and a transformer embedding.\n    \"\"\"\n\n    def __init__(self, config, encoder, embeddings):\n        super().__init__()\n        self.config = config\n        self.encoder = encoder\n        self.proj_embeddings = nn.Linear(config.modal_hidden_size, config.hidden_size)\n        self.position_embeddings = embeddings.position_embeddings\n        self.token_type_embeddings = embeddings.token_type_embeddings\n        self.word_embeddings = embeddings.word_embeddings\n        self.LayerNorm = embeddings.LayerNorm\n        self.dropout = nn.Dropout(p=config.hidden_dropout_prob)\n\n    def forward(self, input_modal, start_token=None, end_token=None, position_ids=None, token_type_ids=None):\n        token_embeddings = self.proj_embeddings(self.encoder(input_modal))\n        seq_length = token_embeddings.size(1)\n\n        if start_token is not None:\n            start_token_embeds = self.word_embeddings(start_token)\n            seq_length += 1\n            token_embeddings = torch.cat([start_token_embeds.unsqueeze(1), token_embeddings], dim=1)\n\n        if end_token is not None:\n            end_token_embeds = self.word_embeddings(end_token)\n            seq_length += 1\n            token_embeddings = torch.cat([token_embeddings, end_token_embeds.unsqueeze(1)], dim=1)\n\n        if position_ids is None:\n            position_ids = torch.arange(seq_length, dtype=torch.long, device=input_modal.device)\n            position_ids = position_ids.unsqueeze(0).expand(input_modal.size(0), seq_length)\n\n        if token_type_ids is None:\n            token_type_ids = torch.zeros(\n                (input_modal.size(0), seq_length), dtype=torch.long, device=input_modal.device\n            )\n\n        position_embeddings = self.position_embeddings(position_ids)\n        token_type_embeddings = self.token_type_embeddings(token_type_ids)\n        embeddings = token_embeddings + position_embeddings + token_type_embeddings\n        embeddings = self.LayerNorm(embeddings)\n        embeddings = self.dropout(embeddings)\n        return embeddings\n\n\nMMBT_START_DOCSTRING = r\"\"\"    MMBT model was proposed in\n    `Supervised Multimodal Bitransformers for Classifying Images and Text`_\n    by Douwe Kiela, Suvrat Bhooshan, Hamed Firooz, Davide Testuggine.\n    It's a supervised multimodal bitransformer model that fuses information from text and other image encoders,\n    and obtain state-of-the-art performance on various multimodal classification benchmark tasks.\n\n    This model is a PyTorch `torch.nn.Module`_ sub-class. Use it as a regular PyTorch Module and\n    refer to the PyTorch documentation for all matter related to general usage and behavior.\n\n    .. _`Supervised Multimodal Bitransformers for Classifying Images and Text`:\n        https://github.com/facebookresearch/mmbt\n\n    .. _`torch.nn.Module`:\n        https://pytorch.org/docs/stable/nn.html#module\n\n    Parameters:\n        config (:class:`~transformers1.MMBTConfig`): Model configuration class with all the parameters of the model.\n            Initializing with a config file does not load the weights associated with the model, only the configuration.\n        transformer (:class: `~nn.Module`): A text transformer that is used by MMBT.\n            It should have embeddings, encoder, and pooler attributes.\n        encoder (:class: `~nn.Module`): Encoder for the second modality.\n            It should take in a batch of modal inputs and return k, n dimension embeddings.\n\"\"\"\n\nMMBT_INPUTS_DOCSTRING = r\"\"\"    Inputs:\n        **input_modal**: ``torch.FloatTensor`` of shape ``(batch_size, ***)``:\n            The other modality data. It will be the shape that the encoder for that type expects.\n            e.g. With an Image Encoder, the shape would be (batch_size, channels, height, width)\n        **input_ids**: ``torch.LongTensor`` of shape ``(batch_size, sequence_length)``:\n            Indices of input sequence tokens in the vocabulary.\n            It does not expect [CLS] token to be added as it's appended to the end of other modality embeddings.\n            See :func:`transformers1.PreTrainedTokenizer.encode` and\n            :func:`transformers1.PreTrainedTokenizer.convert_tokens_to_ids` for details.\n        **modal_start_tokens**: (`optional`) ``torch.LongTensor`` of shape ``(batch_size,)``:\n            Optional start token to be added to Other Modality Embedding. [CLS] Most commonly used for Classification tasks.\n        **modal_end_tokens**: (`optional`) ``torch.LongTensor`` of shape ``(batch_size,)``:\n            Optional end token to be added to Other Modality Embedding. [SEP] Most commonly used.\n        **attention_mask**: (`optional`) ``torch.FloatTensor`` of shape ``(batch_size, sequence_length)``:\n            Mask to avoid performing attention on padding token indices.\n            Mask values selected in ``[0, 1]``:\n            ``1`` for tokens that are NOT MASKED, ``0`` for MASKED tokens.\n        **token_type_ids**: (`optional`) ``torch.LongTensor`` of shape ``(batch_size, sequence_length)``:\n            Segment token indices to indicate different portions of the inputs.\n        **modal_token_type_ids**: (`optional`) ``torch.LongTensor`` of shape ``(batch_size, modal_sequence_length)``:\n            Segment token indices to indicate different portions of the non-text modality.\n            The embeddings from these tokens will be summed with the respective token embeddings for the non-text modality.\n        **position_ids**: (`optional`) ``torch.LongTensor`` of shape ``(batch_size, sequence_length)``:\n            Indices of positions of each input sequence tokens in the position embeddings.\n        **modal_position_ids**: (`optional`) ``torch.LongTensor`` of shape ``(batch_size, modal_sequence_length)``:\n            Indices of positions of each input sequence tokens in the position embeddings for the non-text modality.\n        **head_mask**: (`optional`) ``torch.FloatTensor`` of shape ``(num_heads,)`` or ``(num_layers, num_heads)``:\n            Mask to nullify selected heads of the self-attention modules.\n            Mask values selected in ``[0, 1]``:\n            ``1`` indicates the head is **not masked**, ``0`` indicates the head is **masked**.\n        **inputs_embeds**: (`optional`) ``torch.FloatTensor`` of shape ``(batch_size, sequence_length, embedding_dim)``:\n            Optionally, instead of passing ``input_ids`` you can choose to directly pass an embedded representation.\n            This is useful if you want more control over how to convert `input_ids` indices into associated vectors\n            than the model's internal embedding lookup matrix.\n        **encoder_hidden_states**: (`optional`) ``torch.FloatTensor`` of shape ``(batch_size, sequence_length, hidden_size)``:\n            Sequence of hidden-states at the output of the last layer of the encoder. Used in the cross-attention if the model\n            is configured as a decoder.\n        **encoder_attention_mask**: (`optional`) ``torch.FloatTensor`` of shape ``(batch_size, sequence_length)``:\n            Mask to avoid performing attention on the padding token indices of the encoder input. This mask\n            is used in the cross-attention if the model is configured as a decoder.\n            Mask values selected in ``[0, 1]``:\n            ``1`` for tokens that are NOT MASKED, ``0`` for MASKED tokens.\n\"\"\"\n\n\n@add_start_docstrings(\n    \"The bare MMBT Model outputting raw hidden-states without any specific head on top.\",\n    MMBT_START_DOCSTRING,\n    MMBT_INPUTS_DOCSTRING,\n)\nclass MMBTModel(nn.Module, ModuleUtilsMixin):\n    r\"\"\"\n        Outputs: `Tuple` comprising various elements depending on the configuration (config) and inputs:\n            **last_hidden_state**: ``torch.FloatTensor`` of shape ``(batch_size, sequence_length, hidden_size)``\n                Sequence of hidden-states at the output of the last layer of the model.\n            **pooler_output**: ``torch.FloatTensor`` of shape ``(batch_size, hidden_size)``\n                Last layer hidden-state of the first token of the sequence (classification token)\n                further processed by a Linear layer and a Tanh activation function. The Linear\n                layer weights are trained from the next sentence prediction (classification)\n                objective during Bert pretraining. This output is usually *not* a good summary\n                of the semantic content of the input, you're often better with averaging or pooling\n                the sequence of hidden-states for the whole input sequence.\n            **hidden_states**: (`optional`, returned when ``config.output_hidden_states=True``)\n                list of ``torch.FloatTensor`` (one for the output of each layer + the output of the embeddings)\n                of shape ``(batch_size, sequence_length, hidden_size)``:\n                Hidden-states of the model at the output of each layer plus the initial embedding outputs.\n            **attentions**: (`optional`, returned when ``config.output_attentions=True``)\n                list of ``torch.FloatTensor`` (one for each layer) of shape ``(batch_size, num_heads, sequence_length, sequence_length)``:\n                Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads.\n\n        Examples::\n\n            # For example purposes. Not runnable.\n            transformer = BertModel.from_pretrained('bert-base-uncased')\n            encoder = ImageEncoder(args)\n            mmbt = MMBTModel(config, transformer, encoder)\n        \"\"\"\n\n    def __init__(self, config, transformer, encoder):\n        super().__init__()\n        self.config = config\n        self.transformer = transformer\n        self.modal_encoder = ModalEmbeddings(config, encoder, transformer.embeddings)\n\n    def forward(\n        self,\n        input_modal,\n        input_ids=None,\n        modal_start_tokens=None,\n        modal_end_tokens=None,\n        attention_mask=None,\n        token_type_ids=None,\n        modal_token_type_ids=None,\n        position_ids=None,\n        modal_position_ids=None,\n        head_mask=None,\n        inputs_embeds=None,\n        encoder_hidden_states=None,\n        encoder_attention_mask=None,\n    ):\n\n        if input_ids is not None and inputs_embeds is not None:\n            raise ValueError(\"You cannot specify both input_ids and inputs_embeds at the same time\")\n        elif input_ids is not None:\n            input_txt_shape = input_ids.size()\n        elif inputs_embeds is not None:\n            input_txt_shape = inputs_embeds.size()[:-1]\n        else:\n            raise ValueError(\"You have to specify either input_ids or inputs_embeds\")\n\n        device = input_ids.device if input_ids is not None else inputs_embeds.device\n\n        modal_embeddings = self.modal_encoder(\n            input_modal,\n            start_token=modal_start_tokens,\n            end_token=modal_end_tokens,\n            position_ids=modal_position_ids,\n            token_type_ids=modal_token_type_ids,\n        )\n\n        input_modal_shape = modal_embeddings.size()[:-1]\n\n        if token_type_ids is None:\n            token_type_ids = torch.ones(input_txt_shape, dtype=torch.long, device=device)\n\n        txt_embeddings = self.transformer.embeddings(\n            input_ids=input_ids, position_ids=position_ids, token_type_ids=token_type_ids, inputs_embeds=inputs_embeds\n        )\n\n        embedding_output = torch.cat([modal_embeddings, txt_embeddings], 1)\n\n        input_shape = embedding_output.size()[:-1]\n\n        if attention_mask is None:\n            attention_mask = torch.ones(input_shape, device=device)\n        else:\n            attention_mask = torch.cat(\n                [torch.ones(input_modal_shape, device=device, dtype=torch.long), attention_mask], dim=1\n            )\n        if encoder_attention_mask is None:\n            encoder_attention_mask = torch.ones(input_shape, device=device)\n        else:\n            encoder_attention_mask = torch.cat(\n                [torch.ones(input_modal_shape, device=device), encoder_attention_mask], dim=1\n            )\n\n        extended_attention_mask = self.get_extended_attention_mask(attention_mask, input_shape, self.device)\n        encoder_extended_attention_mask = self.invert_attention_mask(encoder_attention_mask)\n        head_mask = self.get_head_mask(head_mask, self.config.num_hidden_layers)\n\n        encoder_outputs = self.transformer.encoder(\n            embedding_output,\n            attention_mask=extended_attention_mask,\n            head_mask=head_mask,\n            encoder_hidden_states=encoder_hidden_states,\n            encoder_attention_mask=encoder_extended_attention_mask,\n        )\n\n        sequence_output = encoder_outputs[0]\n        pooled_output = self.transformer.pooler(sequence_output)\n\n        outputs = (sequence_output, pooled_output,) + encoder_outputs[\n            1:\n        ]  # add hidden_states and attentions if they are here\n        return outputs  # sequence_output, pooled_output, (hidden_states), (attentions)\n\n    def get_input_embeddings(self):\n        return self.embeddings.word_embeddings\n\n    def set_input_embeddings(self, value):\n        self.embeddings.word_embeddings = value\n\n\n@add_start_docstrings(\n    \"\"\"MMBT Model with a sequence classification/regression head on top (a linear layer on top of\n                      the pooled output)\"\"\",\n    MMBT_START_DOCSTRING,\n    MMBT_INPUTS_DOCSTRING,\n)\nclass MMBTForClassification(nn.Module):\n    r\"\"\"\n            **labels**: (`optional`) ``torch.LongTensor`` of shape ``(batch_size,)``:\n                Labels for computing the sequence classification/regression loss.\n                Indices should be in ``[0, ..., config.num_labels - 1]``.\n                If ``config.num_labels == 1`` a regression loss is computed (Mean-Square loss),\n                If ``config.num_labels > 1`` a classification loss is computed (Cross-Entropy).\n\n        Outputs: `Tuple` comprising various elements depending on the configuration (config) and inputs:\n            **loss**: (`optional`, returned when ``labels`` is provided) ``torch.FloatTensor`` of shape ``(1,)``:\n                Classification (or regression if config.num_labels==1) loss.\n            **logits**: ``torch.FloatTensor`` of shape ``(batch_size, config.num_labels)``\n                Classification (or regression if config.num_labels==1) scores (before SoftMax).\n            **hidden_states**: (`optional`, returned when ``config.output_hidden_states=True``)\n                list of ``torch.FloatTensor`` (one for the output of each layer + the output of the embeddings)\n                of shape ``(batch_size, sequence_length, hidden_size)``:\n                Hidden-states of the model at the output of each layer plus the initial embedding outputs.\n            **attentions**: (`optional`, returned when ``config.output_attentions=True``)\n                list of ``torch.FloatTensor`` (one for each layer) of shape ``(batch_size, num_heads, sequence_length, sequence_length)``:\n                Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads.\n\n        Examples::\n\n            # For example purposes. Not runnable.\n            transformer = BertModel.from_pretrained('bert-base-uncased')\n            encoder = ImageEncoder(args)\n            model = MMBTForClassification(config, transformer, encoder)\n            outputs = model(input_modal, input_ids, labels=labels)\n            loss, logits = outputs[:2]\n        \"\"\"\n\n    def __init__(self, config, transformer, encoder):\n        super().__init__()\n        self.num_labels = config.num_labels\n\n        self.mmbt = MMBTModel(config, transformer, encoder)\n        self.dropout = nn.Dropout(config.hidden_dropout_prob)\n        self.classifier = nn.Linear(config.hidden_size, config.num_labels)\n\n    def forward(\n        self,\n        input_modal,\n        input_ids=None,\n        modal_start_tokens=None,\n        modal_end_tokens=None,\n        attention_mask=None,\n        token_type_ids=None,\n        modal_token_type_ids=None,\n        position_ids=None,\n        modal_position_ids=None,\n        head_mask=None,\n        inputs_embeds=None,\n        labels=None,\n    ):\n\n        outputs = self.mmbt(\n            input_modal=input_modal,\n            input_ids=input_ids,\n            modal_start_tokens=modal_start_tokens,\n            modal_end_tokens=modal_end_tokens,\n            attention_mask=attention_mask,\n            token_type_ids=token_type_ids,\n            modal_token_type_ids=modal_token_type_ids,\n            position_ids=position_ids,\n            modal_position_ids=modal_position_ids,\n            head_mask=head_mask,\n            inputs_embeds=inputs_embeds,\n        )\n\n        pooled_output = outputs[1]\n\n        pooled_output = self.dropout(pooled_output)\n        logits = self.classifier(pooled_output)\n\n        outputs = (logits,) + outputs[2:]  # add hidden states and attention if they are here\n\n        if labels is not None:\n            if self.num_labels == 1:\n                #  We are doing regression\n                loss_fct = MSELoss()\n                loss = loss_fct(logits.view(-1), labels.view(-1))\n            else:\n                loss_fct = CrossEntropyLoss()\n                loss = loss_fct(logits.view(-1, self.num_labels), labels.view(-1))\n            outputs = (loss,) + outputs\n\n        return outputs  # (loss), logits, (hidden_states), (attentions)\n"
  },
  {
    "path": "code/nezha-base-count5/pretrain/transformers1/modeling_openai.py",
    "content": "# coding=utf-8\n# Copyright 2018 The OpenAI Team Authors and HuggingFace Inc. team.\n# Copyright (c) 2018, NVIDIA CORPORATION.  All rights reserved.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n\"\"\"PyTorch OpenAI GPT model.\"\"\"\n\n\nimport json\nimport logging\nimport math\nimport os\n\nimport torch\nimport torch.nn as nn\nfrom torch.nn import CrossEntropyLoss\n\nfrom .activations import gelu_new, swish\nfrom .configuration_openai import OpenAIGPTConfig\nfrom .file_utils import add_start_docstrings, add_start_docstrings_to_callable\nfrom .modeling_utils import Conv1D, PreTrainedModel, SequenceSummary, prune_conv1d_layer\n\n\nlogger = logging.getLogger(__name__)\n\nOPENAI_GPT_PRETRAINED_MODEL_ARCHIVE_LIST = [\n    \"openai-gpt\",\n    # See all OpenAI GPT models at https://huggingface.co/models?filter=openai-gpt\n]\n\n\ndef load_tf_weights_in_openai_gpt(model, config, openai_checkpoint_folder_path):\n    \"\"\" Load tf pre-trained weights in a pytorch model (from NumPy arrays here)\n    \"\"\"\n    import re\n    import numpy as np\n\n    if \".ckpt\" in openai_checkpoint_folder_path:\n        openai_checkpoint_folder_path = os.path.dirname(openai_checkpoint_folder_path)\n\n    logger.info(\"Loading weights from {}\".format(openai_checkpoint_folder_path))\n\n    with open(openai_checkpoint_folder_path + \"/parameters_names.json\", \"r\", encoding=\"utf-8\") as names_handle:\n        names = json.load(names_handle)\n    with open(openai_checkpoint_folder_path + \"/params_shapes.json\", \"r\", encoding=\"utf-8\") as shapes_handle:\n        shapes = json.load(shapes_handle)\n    offsets = np.cumsum([np.prod(shape) for shape in shapes])\n    init_params = [np.load(openai_checkpoint_folder_path + \"/params_{}.npy\".format(n)) for n in range(10)]\n    init_params = np.split(np.concatenate(init_params, 0), offsets)[:-1]\n    init_params = [param.reshape(shape) for param, shape in zip(init_params, shapes)]\n\n    # This was used when we had a single embedding matrix for positions and tokens\n    # init_params[0] = np.concatenate([init_params[1], init_params[0]], 0)\n    # model init_params[1]\n    init_params = [arr.squeeze() for arr in init_params]\n\n    try:\n        assert model.tokens_embed.weight.shape == init_params[1].shape\n        assert model.positions_embed.weight.shape == init_params[0].shape\n    except AssertionError as e:\n        e.args += (model.tokens_embed.weight.shape, init_params[1].shape)\n        e.args += (model.positions_embed.weight.shape, init_params[0].shape)\n        raise\n\n    model.tokens_embed.weight.data = torch.from_numpy(init_params[1])\n    model.positions_embed.weight.data = torch.from_numpy(init_params[0])\n    names.pop(0)\n    # Pop position and token embedding arrays\n    init_params.pop(0)\n    init_params.pop(0)\n\n    for name, array in zip(names, init_params):  # names[1:n_transfer], init_params[1:n_transfer]):\n        name = name[6:]  # skip \"model/\"\n        assert name[-2:] == \":0\"\n        name = name[:-2]\n        name = name.split(\"/\")\n        pointer = model\n        for m_name in name:\n            if re.fullmatch(r\"[A-Za-z]+\\d+\", m_name):\n                scope_names = re.split(r\"(\\d+)\", m_name)\n            else:\n                scope_names = [m_name]\n            if scope_names[0] == \"g\":\n                pointer = getattr(pointer, \"weight\")\n            elif scope_names[0] == \"b\":\n                pointer = getattr(pointer, \"bias\")\n            elif scope_names[0] == \"w\":\n                pointer = getattr(pointer, \"weight\")\n            else:\n                pointer = getattr(pointer, scope_names[0])\n            if len(scope_names) >= 2:\n                num = int(scope_names[1])\n                pointer = pointer[num]\n        try:\n            assert pointer.shape == array.shape\n        except AssertionError as e:\n            e.args += (pointer.shape, array.shape)\n            raise\n        try:\n            assert pointer.shape == array.shape\n        except AssertionError as e:\n            e.args += (pointer.shape, array.shape)\n            raise\n        logger.info(\"Initialize PyTorch weight {}\".format(name))\n        pointer.data = torch.from_numpy(array)\n    return model\n\n\nACT_FNS = {\"relu\": nn.ReLU, \"swish\": swish, \"gelu\": gelu_new}\n\n\nclass Attention(nn.Module):\n    def __init__(self, nx, n_ctx, config, scale=False):\n        super().__init__()\n        n_state = nx  # in Attention: n_state=768 (nx=n_embd)\n        # [switch nx => n_state from Block to Attention to keep identical to TF implem]\n        assert n_state % config.n_head == 0\n        self.register_buffer(\"bias\", torch.tril(torch.ones(n_ctx, n_ctx)).view(1, 1, n_ctx, n_ctx))\n        self.n_head = config.n_head\n        self.split_size = n_state\n        self.scale = scale\n\n        self.output_attentions = config.output_attentions\n\n        self.c_attn = Conv1D(n_state * 3, nx)\n        self.c_proj = Conv1D(n_state, nx)\n        self.attn_dropout = nn.Dropout(config.attn_pdrop)\n        self.resid_dropout = nn.Dropout(config.resid_pdrop)\n        self.pruned_heads = set()\n\n    def prune_heads(self, heads):\n        if len(heads) == 0:\n            return\n        mask = torch.ones(self.n_head, self.split_size // self.n_head)\n        heads = set(heads) - self.pruned_heads\n        for head in heads:\n            head -= sum(1 if h < head else 0 for h in self.pruned_heads)\n            mask[head] = 0\n        mask = mask.view(-1).contiguous().eq(1)\n        index = torch.arange(len(mask))[mask].long()\n        index_attn = torch.cat([index, index + self.split_size, index + (2 * self.split_size)])\n        # Prune conv1d layers\n        self.c_attn = prune_conv1d_layer(self.c_attn, index_attn, dim=1)\n        self.c_proj = prune_conv1d_layer(self.c_proj, index, dim=0)\n        # Update hyper params\n        self.split_size = (self.split_size // self.n_head) * (self.n_head - len(heads))\n        self.n_head = self.n_head - len(heads)\n        self.pruned_heads = self.pruned_heads.union(heads)\n\n    def _attn(self, q, k, v, attention_mask=None, head_mask=None):\n        w = torch.matmul(q, k)\n        if self.scale:\n            w = w / math.sqrt(v.size(-1))\n        # w = w * self.bias + -1e9 * (1 - self.bias)  # TF implem method: mask_attn_weights\n        # XD: self.b may be larger than w, so we need to crop it\n        b = self.bias[:, :, : w.size(-2), : w.size(-1)]\n        w = w * b + -1e4 * (1 - b)\n\n        if attention_mask is not None:\n            # Apply the attention mask\n            w = w + attention_mask\n\n        w = nn.Softmax(dim=-1)(w)\n        w = self.attn_dropout(w)\n\n        # Mask heads if we want to\n        if head_mask is not None:\n            w = w * head_mask\n\n        outputs = [torch.matmul(w, v)]\n        if self.output_attentions:\n            outputs.append(w)\n        return outputs\n\n    def merge_heads(self, x):\n        x = x.permute(0, 2, 1, 3).contiguous()\n        new_x_shape = x.size()[:-2] + (x.size(-2) * x.size(-1),)\n        return x.view(*new_x_shape)  # in Tensorflow implem: fct merge_states\n\n    def split_heads(self, x, k=False):\n        new_x_shape = x.size()[:-1] + (self.n_head, x.size(-1) // self.n_head)\n        x = x.view(*new_x_shape)  # in Tensorflow implem: fct split_states\n        if k:\n            return x.permute(0, 2, 3, 1)\n        else:\n            return x.permute(0, 2, 1, 3)\n\n    def forward(self, x, attention_mask=None, head_mask=None):\n        x = self.c_attn(x)\n        query, key, value = x.split(self.split_size, dim=2)\n        query = self.split_heads(query)\n        key = self.split_heads(key, k=True)\n        value = self.split_heads(value)\n\n        attn_outputs = self._attn(query, key, value, attention_mask, head_mask)\n        a = attn_outputs[0]\n\n        a = self.merge_heads(a)\n        a = self.c_proj(a)\n        a = self.resid_dropout(a)\n\n        outputs = [a] + attn_outputs[1:]\n        return outputs  # a, (attentions)\n\n\nclass MLP(nn.Module):\n    def __init__(self, n_state, config):  # in MLP: n_state=3072 (4 * n_embd)\n        super().__init__()\n        nx = config.n_embd\n        self.c_fc = Conv1D(n_state, nx)\n        self.c_proj = Conv1D(nx, n_state)\n        self.act = ACT_FNS[config.afn]\n        self.dropout = nn.Dropout(config.resid_pdrop)\n\n    def forward(self, x):\n        h = self.act(self.c_fc(x))\n        h2 = self.c_proj(h)\n        return self.dropout(h2)\n\n\nclass Block(nn.Module):\n    def __init__(self, n_ctx, config, scale=False):\n        super().__init__()\n        nx = config.n_embd\n        self.attn = Attention(nx, n_ctx, config, scale)\n        self.ln_1 = nn.LayerNorm(nx, eps=config.layer_norm_epsilon)\n        self.mlp = MLP(4 * nx, config)\n        self.ln_2 = nn.LayerNorm(nx, eps=config.layer_norm_epsilon)\n\n    def forward(self, x, attention_mask=None, head_mask=None):\n        attn_outputs = self.attn(x, attention_mask=attention_mask, head_mask=head_mask)\n        a = attn_outputs[0]\n\n        n = self.ln_1(x + a)\n        m = self.mlp(n)\n        h = self.ln_2(n + m)\n\n        outputs = [h] + attn_outputs[1:]\n        return outputs\n\n\nclass OpenAIGPTPreTrainedModel(PreTrainedModel):\n    \"\"\" An abstract class to handle weights initialization and\n        a simple interface for downloading and loading pretrained models.\n    \"\"\"\n\n    config_class = OpenAIGPTConfig\n    load_tf_weights = load_tf_weights_in_openai_gpt\n    base_model_prefix = \"transformer\"\n\n    def _init_weights(self, module):\n        \"\"\" Initialize the weights.\n        \"\"\"\n        if isinstance(module, (nn.Linear, nn.Embedding, Conv1D)):\n            # Slightly different from the TF version which uses truncated_normal for initialization\n            # cf https://github.com/pytorch/pytorch/pull/5617\n            module.weight.data.normal_(mean=0.0, std=self.config.initializer_range)\n            if isinstance(module, (nn.Linear, Conv1D)) and module.bias is not None:\n                module.bias.data.zero_()\n        elif isinstance(module, nn.LayerNorm):\n            module.bias.data.zero_()\n            module.weight.data.fill_(1.0)\n\n\nOPENAI_GPT_START_DOCSTRING = r\"\"\"\n\n    This model is a PyTorch `torch.nn.Module <https://pytorch.org/docs/stable/nn.html#torch.nn.Module>`_ sub-class.\n    Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general\n    usage and behavior.\n\n    Parameters:\n        config (:class:`~transformers1.OpenAIGPTConfig`): Model configuration class with all the parameters of the model.\n            Initializing with a config file does not load the weights associated with the model, only the configuration.\n            Check out the :meth:`~transformers1.PreTrainedModel.from_pretrained` method to load the model weights.\n\"\"\"\n\nOPENAI_GPT_INPUTS_DOCSTRING = r\"\"\"\n    Args:\n        input_ids (:obj:`torch.LongTensor` of shape :obj:`(batch_size, sequence_length)`):\n            Indices of input sequence tokens in the vocabulary.\n\n            Indices can be obtained using :class:`transformers1.OpenAIGPTTokenizer`.\n            See :func:`transformers1.PreTrainedTokenizer.encode` and\n            :func:`transformers1.PreTrainedTokenizer.encode_plus` for details.\n\n            `What are input IDs? <../glossary.html#input-ids>`__\n        attention_mask (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length)`, `optional`, defaults to :obj:`None`):\n            Mask to avoid performing attention on padding token indices.\n            Mask values selected in ``[0, 1]``:\n            ``1`` for tokens that are NOT MASKED, ``0`` for MASKED tokens.\n\n            `What are attention masks? <../glossary.html#attention-mask>`__\n        token_type_ids (:obj:`torch.LongTensor` of shape :obj:`(batch_size, sequence_length)`, `optional`, defaults to :obj:`None`):\n            Segment token indices to indicate first and second portions of the inputs.\n            Indices are selected in ``[0, 1]``: ``0`` corresponds to a `sentence A` token, ``1``\n            corresponds to a `sentence B` token\n\n            `What are token type IDs? <../glossary.html#token-type-ids>`_\n        position_ids (:obj:`torch.LongTensor` of shape :obj:`(batch_size, sequence_length)`, `optional`, defaults to :obj:`None`):\n            Indices of positions of each input sequence tokens in the position embeddings.\n            Selected in the range ``[0, config.max_position_embeddings - 1]``.\n\n            `What are position IDs? <../glossary.html#position-ids>`_\n        head_mask (:obj:`torch.FloatTensor` of shape :obj:`(num_heads,)` or :obj:`(num_layers, num_heads)`, `optional`, defaults to :obj:`None`):\n            Mask to nullify selected heads of the self-attention modules.\n            Mask values selected in ``[0, 1]``:\n            :obj:`1` indicates the head is **not masked**, :obj:`0` indicates the head is **masked**.\n        inputs_embeds (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length, hidden_size)`, `optional`, defaults to :obj:`None`):\n            Optionally, instead of passing :obj:`input_ids` you can choose to directly pass an embedded representation.\n            This is useful if you want more control over how to convert `input_ids` indices into associated vectors\n            than the model's internal embedding lookup matrix.\n\"\"\"\n\n\n@add_start_docstrings(\n    \"The bare OpenAI GPT transformer model outputting raw hidden-states without any specific head on top.\",\n    OPENAI_GPT_START_DOCSTRING,\n)\nclass OpenAIGPTModel(OpenAIGPTPreTrainedModel):\n    def __init__(self, config):\n        super().__init__(config)\n        self.output_attentions = config.output_attentions\n        self.output_hidden_states = config.output_hidden_states\n\n        self.tokens_embed = nn.Embedding(config.vocab_size, config.n_embd)\n        self.positions_embed = nn.Embedding(config.n_positions, config.n_embd)\n        self.drop = nn.Dropout(config.embd_pdrop)\n        self.h = nn.ModuleList([Block(config.n_ctx, config, scale=True) for _ in range(config.n_layer)])\n\n        self.init_weights()\n\n    def get_input_embeddings(self):\n        return self.tokens_embed\n\n    def set_input_embeddings(self, new_embeddings):\n        self.tokens_embed = new_embeddings\n\n    def _prune_heads(self, heads_to_prune):\n        \"\"\" Prunes heads of the model.\n            heads_to_prune: dict of {layer_num: list of heads to prune in this layer}\n        \"\"\"\n        for layer, heads in heads_to_prune.items():\n            self.h[layer].attn.prune_heads(heads)\n\n    @add_start_docstrings_to_callable(OPENAI_GPT_INPUTS_DOCSTRING)\n    def forward(\n        self,\n        input_ids=None,\n        attention_mask=None,\n        token_type_ids=None,\n        position_ids=None,\n        head_mask=None,\n        inputs_embeds=None,\n    ):\n        r\"\"\"\n    Return:\n        :obj:`tuple(torch.FloatTensor)` comprising various elements depending on the configuration (:class:`~transformers1.OpenAIGPTConfig`) and inputs:\n        last_hidden_state (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length, hidden_size)`):\n            Sequence of hidden-states at the last layer of the model.\n        hidden_states (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_hidden_states=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer)\n            of shape :obj:`(batch_size, sequence_length, hidden_size)`.\n\n            Hidden-states of the model at the output of each layer plus the initial embedding outputs.\n        attentions (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_attentions=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for each layer) of shape\n            :obj:`(batch_size, num_heads, sequence_length, sequence_length)`.\n\n            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention\n            heads.\n\n    Examples::\n\n        from transformers1 import OpenAIGPTTokenizer, OpenAIGPTModel\n        import torch\n\n        tokenizer = OpenAIGPTTokenizer.from_pretrained('openai-gpt')\n        model = OpenAIGPTModel.from_pretrained('openai-gpt')\n        input_ids = torch.tensor(tokenizer.encode(\"Hello, my dog is cute\", add_special_tokens=True)).unsqueeze(0)  # Batch size 1\n        outputs = model(input_ids)\n        last_hidden_states = outputs[0]  # The last hidden-state is the first element of the output tuple\n\n        \"\"\"\n        if input_ids is not None and inputs_embeds is not None:\n            raise ValueError(\"You cannot specify both input_ids and inputs_embeds at the same time\")\n        elif input_ids is not None:\n            input_shape = input_ids.size()\n            input_ids = input_ids.view(-1, input_shape[-1])\n        elif inputs_embeds is not None:\n            input_shape = inputs_embeds.size()[:-1]\n        else:\n            raise ValueError(\"You have to specify either input_ids or inputs_embeds\")\n\n        if position_ids is None:\n            # Code is different from when we had a single embedding matrice from position and token embeddings\n            device = input_ids.device if input_ids is not None else inputs_embeds.device\n            position_ids = torch.arange(input_shape[-1], dtype=torch.long, device=device)\n            position_ids = position_ids.unsqueeze(0).view(-1, input_shape[-1])\n\n        # Attention mask.\n        if attention_mask is not None:\n            # We create a 3D attention mask from a 2D tensor mask.\n            # Sizes are [batch_size, 1, 1, to_seq_length]\n            # So we can broadcast to [batch_size, num_heads, from_seq_length, to_seq_length]\n            # this attention mask is more simple than the triangular masking of causal attention\n            # used in OpenAI GPT, we just need to prepare the broadcast dimension here.\n            attention_mask = attention_mask.unsqueeze(1).unsqueeze(2)\n\n            # Since attention_mask is 1.0 for positions we want to attend and 0.0 for\n            # masked positions, this operation will create a tensor which is 0.0 for\n            # positions we want to attend and -10000.0 for masked positions.\n            # Since we are adding it to the raw scores before the softmax, this is\n            # effectively the same as removing these entirely.\n            attention_mask = attention_mask.to(dtype=next(self.parameters()).dtype)  # fp16 compatibility\n            attention_mask = (1.0 - attention_mask) * -10000.0\n\n        # Prepare head mask if needed\n        head_mask = self.get_head_mask(head_mask, self.config.n_layer)\n\n        if inputs_embeds is None:\n            inputs_embeds = self.tokens_embed(input_ids)\n        position_embeds = self.positions_embed(position_ids)\n        if token_type_ids is not None:\n            token_type_ids = token_type_ids.view(-1, token_type_ids.size(-1))\n            token_type_embeds = self.tokens_embed(token_type_ids)\n        else:\n            token_type_embeds = 0\n        hidden_states = inputs_embeds + position_embeds + token_type_embeds\n        hidden_states = self.drop(hidden_states)\n\n        output_shape = input_shape + (hidden_states.size(-1),)\n\n        all_attentions = ()\n        all_hidden_states = ()\n        for i, block in enumerate(self.h):\n            if self.output_hidden_states:\n                all_hidden_states = all_hidden_states + (hidden_states.view(*output_shape),)\n\n            outputs = block(hidden_states, attention_mask, head_mask[i])\n            hidden_states = outputs[0]\n            if self.output_attentions:\n                all_attentions = all_attentions + (outputs[1],)\n\n        # Add last layer\n        if self.output_hidden_states:\n            all_hidden_states = all_hidden_states + (hidden_states.view(*output_shape),)\n\n        outputs = (hidden_states.view(*output_shape),)\n        if self.output_hidden_states:\n            outputs = outputs + (all_hidden_states,)\n        if self.output_attentions:\n            outputs = outputs + (all_attentions,)\n        return outputs  # last hidden state, (all hidden states), (all attentions)\n\n\n@add_start_docstrings(\n    \"\"\"OpenAI GPT Model transformer with a language modeling head on top\n    (linear layer with weights tied to the input embeddings). \"\"\",\n    OPENAI_GPT_START_DOCSTRING,\n)\nclass OpenAIGPTLMHeadModel(OpenAIGPTPreTrainedModel):\n    def __init__(self, config):\n        super().__init__(config)\n        self.transformer = OpenAIGPTModel(config)\n        self.lm_head = nn.Linear(config.n_embd, config.vocab_size, bias=False)\n\n        self.init_weights()\n\n    def get_output_embeddings(self):\n        return self.lm_head\n\n    @add_start_docstrings_to_callable(OPENAI_GPT_INPUTS_DOCSTRING)\n    def forward(\n        self,\n        input_ids=None,\n        attention_mask=None,\n        token_type_ids=None,\n        position_ids=None,\n        head_mask=None,\n        inputs_embeds=None,\n        labels=None,\n    ):\n        r\"\"\"\n        labels (:obj:`torch.LongTensor` of shape :obj:`(batch_size, sequence_length)`, `optional`, defaults to :obj:`None`):\n            Labels for language modeling.\n            Note that the labels **are shifted** inside the model, i.e. you can set ``labels = input_ids``\n            Indices are selected in ``[-100, 0, ..., config.vocab_size]``\n            All labels set to ``-100`` are ignored (masked), the loss is only\n            computed for labels in ``[0, ..., config.vocab_size]``\n\n    Return:\n        :obj:`tuple(torch.FloatTensor)` comprising various elements depending on the configuration (:class:`~transformers1.OpenAIGPTConfig`) and inputs:\n        loss (:obj:`torch.FloatTensor` of shape `(1,)`, `optional`, returned when ``labels`` is provided)\n            Language modeling loss.\n        prediction_scores (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length, config.vocab_size)`):\n            Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax).\n        past (:obj:`List[torch.FloatTensor]` of length :obj:`config.n_layers` with each tensor of shape :obj:`(2, batch_size, num_heads, sequence_length, embed_size_per_head)`):\n            Contains pre-computed hidden-states (key and values in the attention blocks).\n            Can be used (see `past` input) to speed up sequential decoding. The token ids which have their past given to this model\n            should not be passed as input ids as they have already been computed.\n        hidden_states (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_hidden_states=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer)\n            of shape :obj:`(batch_size, sequence_length, hidden_size)`.\n\n            Hidden-states of the model at the output of each layer plus the initial embedding outputs.\n        attentions (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_attentions=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for each layer) of shape\n            :obj:`(batch_size, num_heads, sequence_length, sequence_length)`.\n\n            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention\n            heads.\n\n    Examples::\n\n        from transformers1 import OpenAIGPTTokenizer, OpenAIGPTLMHeadModel\n        import torch\n\n        tokenizer = OpenAIGPTTokenizer.from_pretrained('openai-gpt')\n        model = OpenAIGPTLMHeadModel.from_pretrained('openai-gpt')\n        input_ids = torch.tensor(tokenizer.encode(\"Hello, my dog is cute\", add_special_tokens=True)).unsqueeze(0)  # Batch size 1\n        outputs = model(input_ids, labels=input_ids)\n        loss, logits = outputs[:2]\n\n    \"\"\"\n        transformer_outputs = self.transformer(\n            input_ids,\n            attention_mask=attention_mask,\n            token_type_ids=token_type_ids,\n            position_ids=position_ids,\n            head_mask=head_mask,\n            inputs_embeds=inputs_embeds,\n        )\n        hidden_states = transformer_outputs[0]\n        lm_logits = self.lm_head(hidden_states)\n\n        outputs = (lm_logits,) + transformer_outputs[1:]\n        if labels is not None:\n            # Shift so that tokens < n predict n\n            shift_logits = lm_logits[..., :-1, :].contiguous()\n            shift_labels = labels[..., 1:].contiguous()\n            # Flatten the tokens\n            loss_fct = CrossEntropyLoss()\n            loss = loss_fct(shift_logits.view(-1, shift_logits.size(-1)), shift_labels.view(-1))\n            outputs = (loss,) + outputs\n\n        return outputs  # (loss), lm_logits, (all hidden states), (all attentions)\n\n\n@add_start_docstrings(\n    \"\"\"OpenAI GPT Model transformer with a language modeling and a multiple-choice classification\n    head on top e.g. for RocStories/SWAG tasks. The two heads are two linear layers.\n    The language modeling head has its weights tied to the input embeddings,\n    the classification head takes as input the input of a specified classification token index in the input sequence).\n\"\"\",\n    OPENAI_GPT_START_DOCSTRING,\n)\nclass OpenAIGPTDoubleHeadsModel(OpenAIGPTPreTrainedModel):\n    def __init__(self, config):\n        super().__init__(config)\n\n        config.num_labels = 1\n        self.transformer = OpenAIGPTModel(config)\n        self.lm_head = nn.Linear(config.n_embd, config.vocab_size, bias=False)\n        self.multiple_choice_head = SequenceSummary(config)\n\n        self.init_weights()\n\n    def get_output_embeddings(self):\n        return self.lm_head\n\n    @add_start_docstrings_to_callable(OPENAI_GPT_INPUTS_DOCSTRING)\n    def forward(\n        self,\n        input_ids=None,\n        attention_mask=None,\n        token_type_ids=None,\n        position_ids=None,\n        head_mask=None,\n        inputs_embeds=None,\n        mc_token_ids=None,\n        lm_labels=None,\n        mc_labels=None,\n    ):\n        r\"\"\"\n        mc_token_ids (:obj:`torch.LongTensor` of shape :obj:`(batch_size, num_choices)`, `optional`, default to index of the last token of the input)\n            Index of the classification token in each input sequence.\n            Selected in the range ``[0, input_ids.size(-1) - 1[``.\n        lm_labels (:obj:`torch.LongTensor` of shape :obj:`(batch_size, sequence_length)`, `optional`, defaults to :obj:`None`)\n            Labels for language modeling.\n            Note that the labels **are shifted** inside the model, i.e. you can set ``lm_labels = input_ids``\n            Indices are selected in ``[-1, 0, ..., config.vocab_size]``\n            All labels set to ``-100`` are ignored (masked), the loss is only\n            computed for labels in ``[0, ..., config.vocab_size]``\n        mc_labels (:obj:`torch.LongTensor` of shape :obj:`(batch_size)`, `optional`, defaults to :obj:`None`)\n            Labels for computing the multiple choice classification loss.\n            Indices should be in ``[0, ..., num_choices]`` where `num_choices` is the size of the second dimension\n            of the input tensors. (see `input_ids` above)\n\n    Return:\n        :obj:`tuple(torch.FloatTensor)` comprising various elements depending on the configuration (:class:`~transformers1.OpenAIGPTConfig`) and inputs:\n        lm_loss (:obj:`torch.FloatTensor` of shape :obj:`(1,)`, `optional`, returned when ``lm_labels`` is provided):\n            Language modeling loss.\n        mc_loss (:obj:`torch.FloatTensor` of shape :obj:`(1,)`, `optional`, returned when :obj:`multiple_choice_labels` is provided):\n            Multiple choice classification loss.\n        lm_prediction_scores (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, num_choices, sequence_length, config.vocab_size)`):\n            Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax).\n        mc_prediction_scores (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, num_choices)`):\n            Prediction scores of the multiple choice classification head (scores for each choice before SoftMax).\n        past (:obj:`List[torch.FloatTensor]` of length :obj:`config.n_layers` with each tensor of shape :obj:`(2, batch_size, num_heads, sequence_length, embed_size_per_head)`):\n            Contains pre-computed hidden-states (key and values in the attention blocks).\n            Can be used (see `past` input) to speed up sequential decoding. The token ids which have their past given to this model\n            should not be passed as input ids as they have already been computed.\n        hidden_states (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_hidden_states=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer)\n            of shape :obj:`(batch_size, sequence_length, hidden_size)`.\n\n            Hidden-states of the model at the output of each layer plus the initial embedding outputs.\n        attentions (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_attentions=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for each layer) of shape\n            :obj:`(batch_size, num_heads, sequence_length, sequence_length)`.\n\n            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention\n            heads.\n\n    Examples::\n\n        from transformers1 import OpenAIGPTTokenizer, OpenAIGPTDoubleHeadsModel\n        import torch\n\n        tokenizer = OpenAIGPTTokenizer.from_pretrained('openai-gpt')\n        model = OpenAIGPTDoubleHeadsModel.from_pretrained('openai-gpt')\n        tokenizer.add_special_tokens({'cls_token': '[CLS]'})  # Add a [CLS] to the vocabulary (we should train it also!)\n        model.resize_token_embeddings(len(tokenizer))\n\n        choices = [\"Hello, my dog is cute [CLS]\", \"Hello, my cat is cute [CLS]\"]\n        input_ids = torch.tensor([tokenizer.encode(s) for s in choices]).unsqueeze(0)  # Batch size 1, 2 choices\n        mc_token_ids = torch.tensor([input_ids.size(-1)-1, input_ids.size(-1)-1]).unsqueeze(0)  # Batch size 1\n\n        outputs = model(input_ids, mc_token_ids=mc_token_ids)\n        lm_prediction_scores, mc_prediction_scores = outputs[:2]\n\n    \"\"\"\n        transformer_outputs = self.transformer(\n            input_ids,\n            attention_mask=attention_mask,\n            token_type_ids=token_type_ids,\n            position_ids=position_ids,\n            head_mask=head_mask,\n            inputs_embeds=inputs_embeds,\n        )\n        hidden_states = transformer_outputs[0]\n\n        lm_logits = self.lm_head(hidden_states)\n        mc_logits = self.multiple_choice_head(hidden_states, mc_token_ids).squeeze(-1)\n\n        outputs = (lm_logits, mc_logits) + transformer_outputs[1:]\n        if mc_labels is not None:\n            loss_fct = CrossEntropyLoss()\n            loss = loss_fct(mc_logits.view(-1, mc_logits.size(-1)), mc_labels.view(-1))\n            outputs = (loss,) + outputs\n        if lm_labels is not None:\n            shift_logits = lm_logits[..., :-1, :].contiguous()\n            shift_labels = lm_labels[..., 1:].contiguous()\n            loss_fct = CrossEntropyLoss()\n            loss = loss_fct(shift_logits.view(-1, shift_logits.size(-1)), shift_labels.view(-1))\n            outputs = (loss,) + outputs\n\n        return outputs  # (lm loss), (mc loss), lm logits, mc logits, (all hidden_states), (attentions)\n"
  },
  {
    "path": "code/nezha-base-count5/pretrain/transformers1/modeling_reformer.py",
    "content": "# coding=utf-8\n# Copyright 2020 The Trax Authors and The HuggingFace Inc. team.\n# Copyright (c) 2018, NVIDIA CORPORATION.  All rights reserved.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n\"\"\"PyTorch REFORMER model. \"\"\"\n\nimport logging\nimport sys\nfrom collections import namedtuple\nfrom functools import reduce\nfrom operator import mul\n\nimport numpy as np\nimport torch\nfrom torch import nn\nfrom torch.autograd.function import Function\nfrom torch.nn import CrossEntropyLoss\n\nfrom .activations import gelu, gelu_fast, gelu_new, swish\nfrom .configuration_reformer import ReformerConfig\nfrom .file_utils import DUMMY_INPUTS, DUMMY_MASK, add_start_docstrings, add_start_docstrings_to_callable\nfrom .modeling_utils import PreTrainedModel, apply_chunking_to_forward\n\n\nlogger = logging.getLogger(__name__)\n\nREFORMER_PRETRAINED_MODEL_ARCHIVE_LIST = [\n    \"google/reformer-crime-and-punishment\",\n    \"google/reformer-enwik8\",\n    # See all Reformer models at https://huggingface.co/models?filter=reformer\n]\n\n\ndef mish(x):\n    return x * torch.tanh(nn.functional.softplus(x))\n\n\nACT2FN = {\n    \"gelu\": gelu,\n    \"relu\": torch.nn.functional.relu,\n    \"swish\": swish,\n    \"gelu_new\": gelu_new,\n    \"gelu_fast\": gelu_fast,\n    \"mish\": mish,\n}\n\n\n# Define named tuples for nn.Modules here\nLSHSelfAttentionOutput = namedtuple(\"LSHSelfAttentionOutput\", [\"hidden_states\", \"attention_probs\", \"buckets\"])\nLocalSelfAttentionOutput = namedtuple(\"LocalSelfAttentionOutput\", [\"hidden_states\", \"attention_probs\"])\nAttentionOutput = namedtuple(\"AttentionOutput\", [\"hidden_states\", \"attention_probs\", \"buckets\"])\nReformerOutput = namedtuple(\"ReformerOutput\", [\"hidden_states\", \"attn_output\", \"attention_probs\", \"buckets\"])\nReformerBackwardOutput = namedtuple(\n    \"ReformerBackwardOutput\", [\"attn_output\", \"hidden_states\", \"grad_attn_output\", \"grad_hidden_states\"]\n)\nReformerEncoderOutput = namedtuple(\"ReformerEncoderOutput\", [\"hidden_states\", \"all_hidden_states\", \"all_attentions\"])\n\n\ndef _get_least_common_mult_chunk_len(config):\n    attn_types = config.attn_layers\n    attn_types_set = set(attn_types)\n    if len(attn_types_set) == 1 and attn_types[0] == \"lsh\":\n        return config.lsh_attn_chunk_length\n    elif len(attn_types_set) == 1 and attn_types[0] == \"local\":\n        return config.local_attn_chunk_length\n    elif len(attn_types_set) == 2 and attn_types_set == set([\"lsh\", \"local\"]):\n        return np.lcm(config.lsh_attn_chunk_length, config.local_attn_chunk_length)\n    else:\n        raise NotImplementedError(\n            \"Only attn layer types 'lsh' and 'local' exist, but `config.attn_layers`: {}. Select attn layer types from ['lsh', 'local'] only.\".format(\n                config.attn_layers\n            )\n        )\n\n\nclass AxialPositionEmbeddings(nn.Module):\n    \"\"\"Constructs axial position embeddings. Useful for very long input\n    sequences to save memory and time.\n    \"\"\"\n\n    def __init__(self, config):\n        super().__init__()\n        self.axial_pos_shape = config.axial_pos_shape\n        self.axial_pos_embds_dim = config.axial_pos_embds_dim\n        self.dropout = config.hidden_dropout_prob\n\n        self.least_common_mult_chunk_length = _get_least_common_mult_chunk_len(config)\n        self.weights = nn.ParameterList()\n\n        assert (\n            sum(self.axial_pos_embds_dim) == config.hidden_size\n        ), \"Make sure that config.axial_pos_embds factors: {} sum to config.hidden_size: {}\".format(\n            self.axial_pos_embds_dim, config.hidden_size\n        )\n\n        # create weights\n        for axis, axial_pos_embd_dim in enumerate(self.axial_pos_embds_dim):\n            # create expanded shapes\n            ax_shape = [1] * len(self.axial_pos_shape)\n            ax_shape[axis] = self.axial_pos_shape[axis]\n            ax_shape = tuple(ax_shape) + (axial_pos_embd_dim,)\n\n            # create tensor and init\n            self.weights.append(nn.Parameter(torch.ones(ax_shape, dtype=torch.float32)))\n\n    def forward(self, position_ids):\n        # broadcast weights to correct shape\n        batch_size = position_ids.shape[0]\n        sequence_length = position_ids.shape[1]\n\n        broadcasted_weights = [\n            weight.expand((batch_size,) + self.axial_pos_shape + weight.shape[-1:]) for weight in self.weights\n        ]\n\n        if self.training is True:\n            assert (\n                reduce(mul, self.axial_pos_shape) == sequence_length\n            ), \"If training, make sure that config.axial_pos_shape factors: {} multiply to sequence length. Got prod({}) != sequence_length: {}. You might want to consider padding your sequence length to {} or changing config.axial_pos_shape.\".format(\n                self.axial_pos_shape, self.axial_pos_shape, sequence_length, reduce(mul, self.axial_pos_shape)\n            )\n            if self.dropout > 0:\n                weights = torch.cat(broadcasted_weights, dim=-1)\n                # permute weights so that 2D correctly drops dims 1 and 2\n                transposed_weights = weights.transpose(2, 1)\n                # drop entire matrix of last two dims (prev dims 1 and 2)\n                dropped_transposed_weights = nn.functional.dropout2d(\n                    transposed_weights, p=self.dropout, training=self.training\n                )\n                dropped_weights = dropped_transposed_weights.transpose(2, 1)\n\n                position_encodings = torch.reshape(dropped_weights, (batch_size, sequence_length, -1))\n\n            else:\n                position_encodings = torch.cat(\n                    [torch.reshape(weight, (batch_size, sequence_length, -1)) for weight in broadcasted_weights],\n                    dim=-1,\n                )\n\n        else:\n            assert (\n                reduce(mul, self.axial_pos_shape) >= sequence_length\n            ), \"Make sure that config.axial_pos_shape factors: {} multiply at least to max(sequence_length, least_common_mult_chunk_length): max({}, {})\".format(\n                self.axial_pos_shape, sequence_length, self.least_common_mult_chunk_length,\n            )\n\n            # reshape axial encodings and use only until sequence_length\n            position_encodings = torch.cat(broadcasted_weights, dim=-1)\n            position_encodings = position_encodings.view(batch_size, -1, position_encodings.shape[-1])[\n                :, :sequence_length\n            ]\n\n        return position_encodings\n\n\nclass PositionEmbeddings(nn.Module):\n    \"\"\"Constructs conventional position embeddings of shape `[max_pos_embeddings, hidden_size]`.\n    \"\"\"\n\n    def __init__(self, config):\n        super().__init__()\n        self.dropout = config.hidden_dropout_prob\n        self.embedding = nn.Embedding(config.max_position_embeddings, config.hidden_size)\n\n    def forward(self, position_ids):\n        position_embeddings = self.embedding(position_ids)\n        position_embeddings = nn.functional.dropout(position_embeddings, p=self.dropout, training=self.training)\n        return position_embeddings\n\n\nclass ReformerEmbeddings(nn.Module):\n    \"\"\"Construct the embeddings from word, position and token_type embeddings.\n    \"\"\"\n\n    def __init__(self, config):\n        super().__init__()\n        self.max_position_embeddings = config.max_position_embeddings\n        self.dropout = config.hidden_dropout_prob\n\n        self.word_embeddings = nn.Embedding(config.vocab_size, config.hidden_size)\n        self.position_embeddings = (\n            AxialPositionEmbeddings(config) if config.axial_pos_embds else PositionEmbeddings(config)\n        )\n\n    def forward(self, input_ids=None, position_ids=None, inputs_embeds=None):\n        if input_ids is not None:\n            input_shape = input_ids.size()\n            device = input_ids.device\n        else:\n            input_shape = inputs_embeds.size()[:-1]\n            device = inputs_embeds.device\n\n        seq_length = input_shape[1]\n        if position_ids is None:\n            position_ids = torch.arange(seq_length, dtype=torch.long, device=device)\n            position_ids = position_ids.unsqueeze(0).expand(input_shape)\n\n        if inputs_embeds is None:\n            inputs_embeds = self.word_embeddings(input_ids)\n\n        assert (\n            position_ids.shape[-1] <= self.max_position_embeddings\n        ), \"Sequence Length: {} has to be larger equal than config.max_position_embeddings: {}\".format(\n            position_ids.shape[-1], self.max_position_embeddings\n        )\n\n        # dropout\n        embeddings = nn.functional.dropout(inputs_embeds, p=self.dropout, training=self.training)\n\n        # add positional embeddings\n        position_embeddings = self.position_embeddings(position_ids)\n        embeddings = embeddings + position_embeddings\n        return embeddings\n\n\nclass EfficientAttentionMixin:\n    \"\"\"\n    A few utilities for nn.Modules in Reformer, to be used as a mixin.\n    \"\"\"\n\n    def _look_adjacent(self, vectors, num_chunks_before, num_chunks_after):\n        \"\"\" Used to implement attention between consecutive chunks.\n\n            Args:\n                vectors: array of shape [batch_size, num_attention_heads, n_chunks, chunk_len, ...]\n                num_chunks_before: chunks before current chunk to include in attention\n                num_chunks_after: chunks after current chunk to include in attention\n\n            Returns:\n                tensor of shape [num_chunks, N * chunk_length, ...], where\n                N = (1 + num_chunks_before + num_chunks_after).\n        \"\"\"\n        if num_chunks_before == 0 and num_chunks_after == 0:\n            return vectors\n\n        slices = []\n        for i in range(-num_chunks_before, num_chunks_after + 1):\n            if i == 0:\n                slices.append(vectors)\n            else:\n                slices.append(torch.cat([vectors[:, :, i:, ...], vectors[:, :, :i, ...]], dim=2))\n        return torch.cat(slices, dim=3)\n\n    def _split_hidden_size_dim(self, x, num_attn_heads, attn_head_size):\n        \"\"\"\n            splits hidden_size dim into attn_head_size and num_attn_heads\n        \"\"\"\n        new_x_shape = x.size()[:-1] + (num_attn_heads, attn_head_size)\n        x = x.view(*new_x_shape)\n        return x.transpose(2, 1)\n\n    def _merge_hidden_size_dims(self, x, num_attn_heads, attn_head_size):\n        \"\"\"\n            merges attn_head_size dim and num_attn_heads dim into hidden_size\n        \"\"\"\n        x = x.permute(0, 2, 1, 3)\n        return torch.reshape(x, (x.size()[0], -1, num_attn_heads * attn_head_size))\n\n    def _split_seq_length_dim_to(self, vectors, dim_factor_1, dim_factor_2, num_attn_heads, attn_head_size=None):\n        \"\"\"\n            splits sequence length dim of vectors into `dim_factor_1` and `dim_factor_2` dims\n        \"\"\"\n        batch_size = vectors.shape[0]\n        split_dim_shape = (batch_size, num_attn_heads, dim_factor_1, dim_factor_2)\n\n        if len(vectors.shape) == 4:\n            return torch.reshape(vectors, split_dim_shape + (attn_head_size,))\n        elif len(vectors.shape) == 3:\n            return torch.reshape(vectors, split_dim_shape)\n        else:\n            raise ValueError(\"Input vector rank should be one of [3, 4], but is: {}\".format(len(vectors.shape)))\n\n\nclass LSHSelfAttention(nn.Module, EfficientAttentionMixin):\n    def __init__(self, config):\n        super().__init__()\n        self.config = config\n\n        self.chunk_length = config.lsh_attn_chunk_length\n        self.num_hashes = config.num_hashes\n        self.num_buckets = config.num_buckets\n        self.num_chunks_before = config.lsh_num_chunks_before\n        self.num_chunks_after = config.lsh_num_chunks_after\n        self.hash_seed = config.hash_seed\n        self.is_decoder = config.is_decoder\n        self.max_position_embeddings = config.max_position_embeddings\n\n        self.dropout = config.lsh_attention_probs_dropout_prob\n\n        self.num_attention_heads = config.num_attention_heads\n        self.attention_head_size = config.attention_head_size\n        self.all_head_size = self.num_attention_heads * self.attention_head_size\n        self.hidden_size = config.hidden_size\n\n        # projection matrices\n        self.query_key = nn.Linear(self.hidden_size, self.all_head_size, bias=False)\n        self.value = nn.Linear(self.hidden_size, self.all_head_size, bias=False)\n\n        # save mask value here. Need fp32 and fp16 mask values\n        self.register_buffer(\"self_mask_value_float16\", torch.tensor(-1e3))\n        self.register_buffer(\"self_mask_value_float32\", torch.tensor(-1e5))\n        self.register_buffer(\"mask_value_float16\", torch.tensor(-1e4))\n        self.register_buffer(\"mask_value_float32\", torch.tensor(-1e9))\n\n    def forward(\n        self,\n        hidden_states,\n        attention_mask=None,\n        head_mask=None,\n        num_hashes=None,\n        do_output_attentions=False,\n        buckets=None,\n        **kwargs\n    ):\n        sequence_length = hidden_states.shape[1]\n        batch_size = hidden_states.shape[0]\n\n        # num hashes can optionally be overwritten by user\n        num_hashes = num_hashes if num_hashes is not None else self.num_hashes\n\n        # project hidden_states to query_key and value\n        query_key_vectors = self.query_key(hidden_states)\n        value_vectors = self.value(hidden_states)\n\n        # free memory\n        del hidden_states\n\n        query_key_vectors = self._split_hidden_size_dim(\n            query_key_vectors, self.num_attention_heads, self.attention_head_size\n        )\n        value_vectors = self._split_hidden_size_dim(value_vectors, self.num_attention_heads, self.attention_head_size)\n\n        assert (\n            query_key_vectors.shape[-1] == self.attention_head_size\n        ), \"last dim of query_key_vectors is {} but should be {}.\".format(\n            query_key_vectors.shape[-1], self.attention_head_size\n        )\n        assert (\n            value_vectors.shape[-1] == self.attention_head_size\n        ), \"last dim of value_vectors is {} but should be {}.\".format(\n            value_vectors.shape[-1], self.attention_head_size\n        )\n\n        # set `num_buckets` on the fly, recommended way to do it\n        if self.num_buckets is None:\n            self._set_num_buckets(sequence_length)\n\n        # use cached buckets for backprop only\n        if buckets is None:\n            # hash query key vectors into buckets\n            buckets = self._hash_vectors(query_key_vectors, num_hashes)\n\n        assert (\n            int(buckets.shape[-1]) == num_hashes * sequence_length\n        ), \"last dim of buckets is {}, but should be {}\".format(buckets.shape[-1], num_hashes * sequence_length)\n\n        sorted_bucket_idx, undo_sorted_bucket_idx = self._get_sorted_bucket_idx_and_undo_sorted_bucket_idx(\n            sequence_length, buckets, num_hashes\n        )\n\n        # make sure bucket idx is not longer then sequence length\n        sorted_bucket_idx = sorted_bucket_idx % sequence_length\n\n        # cluster query key value vectors according to hashed buckets\n        query_key_vectors = self._gather_by_expansion(query_key_vectors, sorted_bucket_idx, num_hashes)\n        value_vectors = self._gather_by_expansion(value_vectors, sorted_bucket_idx, num_hashes)\n\n        query_key_vectors = self._split_seq_length_dim_to(\n            query_key_vectors, -1, self.chunk_length, self.num_attention_heads, self.attention_head_size,\n        )\n        value_vectors = self._split_seq_length_dim_to(\n            value_vectors, -1, self.chunk_length, self.num_attention_heads, self.attention_head_size,\n        )\n\n        if self.chunk_length is None:\n            assert (\n                self.num_chunks_before == 0 and self.num_chunks_after == 0\n            ), \"If `config.chunk_length` is `None`, make sure `config.num_chunks_after` and `config.num_chunks_before` are set to 0.\"\n\n        # scale key vectors\n        key_vectors = self._len_and_dim_norm(query_key_vectors)\n\n        # get attention probs\n        out_vectors, logits, attention_probs = self._attend(\n            query_vectors=query_key_vectors,\n            key_vectors=key_vectors,\n            value_vectors=value_vectors,\n            sorted_bucket_idx=sorted_bucket_idx,\n            attention_mask=attention_mask,\n            head_mask=head_mask,\n        )\n        # free memory\n        del query_key_vectors, key_vectors, value_vectors\n\n        # sort clusters back to correct ordering\n        out_vectors, logits = ReverseSort.apply(\n            out_vectors, logits, sorted_bucket_idx, undo_sorted_bucket_idx, self.num_hashes\n        )\n\n        # sum up all hash rounds\n        if num_hashes > 1:\n            out_vectors = self._split_seq_length_dim_to(\n                out_vectors, num_hashes, sequence_length, self.num_attention_heads, self.attention_head_size,\n            )\n            logits = self._split_seq_length_dim_to(\n                logits, num_hashes, sequence_length, self.num_attention_heads, self.attention_head_size,\n            ).unsqueeze(-1)\n\n            probs_vectors = torch.exp(logits - torch.logsumexp(logits, dim=2, keepdim=True))\n            out_vectors = torch.sum(out_vectors * probs_vectors, dim=2)\n            # free memory\n            del probs_vectors\n\n        # free memory\n        del logits\n\n        assert out_vectors.shape == (\n            batch_size,\n            self.num_attention_heads,\n            sequence_length,\n            self.attention_head_size,\n        ), \"out_vectors have be of shape `[batch_size, config.num_attention_heads, sequence_length, config.attention_head_size]`.\"\n\n        out_vectors = self._merge_hidden_size_dims(out_vectors, self.num_attention_heads, self.attention_head_size)\n\n        if do_output_attentions is False:\n            attention_probs = ()\n\n        return LSHSelfAttentionOutput(hidden_states=out_vectors, attention_probs=attention_probs, buckets=buckets)\n\n    def _hash_vectors(self, vectors, num_hashes):\n        batch_size = vectors.shape[0]\n\n        # See https://arxiv.org/pdf/1509.02897.pdf\n        # We sample a different random rotation for each round of hashing to\n        # decrease the probability of hash misses.\n        if isinstance(self.num_buckets, int):\n            assert (\n                self.num_buckets % 2 == 0\n            ), \"There should be an even number of bucktes, but `self.num_bucktes`: {}\".format(self.num_buckets)\n            rotation_size = self.num_buckets\n            num_buckets = self.num_buckets\n        else:\n            # Factorize the hash if self.num_buckets is a list or tuple\n            rotation_size, num_buckets = 0, 1\n            for bucket_factor in self.num_buckets:\n                assert bucket_factor % 2 == 0, \"The number of buckets should be even, but `num_bucket`: {}\".format(\n                    bucket_factor\n                )\n                rotation_size = rotation_size + bucket_factor\n                num_buckets = num_buckets * bucket_factor\n\n        # remove gradient\n        vectors = vectors.detach()\n\n        if self.hash_seed is not None:\n            # for determinism\n            torch.manual_seed(self.hash_seed)\n\n        rotations_shape = (self.num_attention_heads, vectors.shape[-1], num_hashes, rotation_size // 2)\n        # create a random self.attention_head_size x num_hashes x num_buckets/2\n        random_rotations = torch.randn(rotations_shape, device=vectors.device, dtype=vectors.dtype)\n\n        # Output dim: Batch_Size x Num_Attn_Heads x Num_Hashes x Seq_Len x Num_Buckets/2\n        rotated_vectors = torch.einsum(\"bmtd,mdhr->bmhtr\", vectors, random_rotations)\n\n        if isinstance(self.num_buckets, int) or len(self.num_buckets) == 1:\n            rotated_vectors = torch.cat([rotated_vectors, -rotated_vectors], dim=-1)\n            buckets = torch.argmax(rotated_vectors, dim=-1)\n        else:\n            # Get the buckets for them and combine.\n            buckets, cur_sum, cur_product = None, 0, 1\n            for bucket_factor in self.num_buckets:\n                rotated_vectors_factor = rotated_vectors[..., cur_sum : cur_sum + (bucket_factor // 2)]\n                cur_sum = cur_sum + bucket_factor // 2\n                rotated_vectors_factor = torch.cat([rotated_vectors_factor, -rotated_vectors_factor], dim=-1)\n\n                if buckets is None:\n                    buckets = torch.argmax(rotated_vectors_factor, dim=-1)\n                else:\n                    buckets = buckets + (cur_product * torch.argmax(rotated_vectors_factor, dim=-1))\n\n                cur_product = cur_product * bucket_factor\n\n        # buckets is now (Batch_size x Num_Attn_Heads x Num_Hashes x Seq_Len).\n        # Next we add offsets so that bucket numbers from different hashing rounds don't overlap.\n        offsets = torch.arange(num_hashes, device=vectors.device)\n        offsets = (offsets * num_buckets).view((1, 1, -1, 1))\n\n        # expand to batch size and num attention heads\n        offsets = offsets.expand((batch_size, self.num_attention_heads) + offsets.shape[-2:])\n        offset_buckets = (buckets + offsets).flatten(start_dim=2, end_dim=3)\n\n        return offset_buckets\n\n    def _get_sorted_bucket_idx_and_undo_sorted_bucket_idx(self, sequence_length, buckets, num_hashes):\n        # no gradients are needed\n        with torch.no_grad():\n            batch_size = buckets.shape[0]\n\n            # arange and expand\n            orig_indices = torch.arange(num_hashes * sequence_length, device=buckets.device).view(1, 1, -1)\n            orig_indices = orig_indices.expand(batch_size, self.num_attention_heads, orig_indices.shape[-1])\n\n            # scale buckets\n            scaled_buckets = sequence_length * buckets + (orig_indices % sequence_length)\n\n            # remove gradient\n            scaled_buckets = scaled_buckets.detach()\n\n            # Hash-based sort\n            sorted_bucket_idx = torch.argsort(scaled_buckets, dim=-1)\n\n            # create simple indices to scatter to, to have undo sort\n            indices = (\n                torch.arange(sorted_bucket_idx.shape[-1], device=buckets.device)\n                .view(1, 1, -1)\n                .expand(sorted_bucket_idx.shape)\n            )\n\n            # get undo sort\n            undo_sorted_bucket_idx = sorted_bucket_idx.new(*sorted_bucket_idx.size())\n            undo_sorted_bucket_idx.scatter_(-1, sorted_bucket_idx, indices)\n\n        return sorted_bucket_idx, undo_sorted_bucket_idx\n\n    def _set_num_buckets(self, sequence_length):\n        # `num_buckets` should be set to 2 * sequence_length // chunk_length as recommended in paper\n        num_buckets_pow_2 = (2 * (sequence_length // self.chunk_length)).bit_length() - 1\n        # make sure buckets are power of 2\n        num_buckets = 2 ** num_buckets_pow_2\n\n        # factorize `num_buckets` if `num_buckets` becomes too large\n        num_buckets_limit = 2 * max(\n            int((self.max_position_embeddings // self.chunk_length) ** (0.5)), self.chunk_length,\n        )\n        if num_buckets > num_buckets_limit:\n            num_buckets = [2 ** (num_buckets_pow_2 // 2), 2 ** (num_buckets_pow_2 - num_buckets_pow_2 // 2)]\n\n        logger.warning(\"config.num_buckets is not set. Setting config.num_buckets to {}...\".format(num_buckets))\n\n        # set num buckets in config to be properly saved\n        self.config.num_buckets = num_buckets\n        self.num_buckets = num_buckets\n\n    def _attend(\n        self, query_vectors, key_vectors, value_vectors, sorted_bucket_idx, attention_mask, head_mask,\n    ):\n        key_vectors = self._look_adjacent(key_vectors, self.num_chunks_before, self.num_chunks_after)\n        value_vectors = self._look_adjacent(value_vectors, self.num_chunks_before, self.num_chunks_after)\n\n        # get logits and dots\n        query_key_dots = torch.matmul(query_vectors, key_vectors.transpose(-1, -2))\n\n        # free memory\n        del query_vectors, key_vectors\n\n        query_bucket_idx = self._split_seq_length_dim_to(\n            sorted_bucket_idx, -1, self.chunk_length, self.num_attention_heads\n        )\n        key_value_bucket_idx = self._look_adjacent(query_bucket_idx, self.num_chunks_before, self.num_chunks_after)\n\n        # get correct mask values depending on precision\n        if query_key_dots.dtype == torch.float16:\n            self_mask_value = self.self_mask_value_float16.half()\n            mask_value = self.mask_value_float16.half()\n        else:\n            self_mask_value = self.self_mask_value_float32\n            mask_value = self.mask_value_float32\n\n        mask = self._compute_attn_mask(query_bucket_idx, key_value_bucket_idx, attention_mask)\n\n        if mask is not None:\n            query_key_dots = torch.where(mask, query_key_dots, mask_value)\n\n        # free memory\n        del mask\n\n        # Self mask is ALWAYS applied.\n        # From the reformer paper (https://arxiv.org/pdf/2001.04451.pdf):\n        # \" While attention to the future is not allowed, typical implementations of the\n        # Transformer do allow a position to attend to itself.\n        # Such behavior is undesirable in a shared-QK formulation because the dot-product\n        # of a query vector with itself will almost always be greater than the dot product of a\n        # query vector with a vector at another position. We therefore modify the masking\n        # to forbid a token from attending to itself, except in situations\n        # where a token has no other valid attention targets (e.g. the first token in a sequence) \"\n\n        self_mask = torch.ne(query_bucket_idx.unsqueeze(-1), key_value_bucket_idx.unsqueeze(-2)).to(\n            query_bucket_idx.device\n        )\n\n        # apply self_mask\n        query_key_dots = torch.where(self_mask, query_key_dots, self_mask_value)\n\n        # free memory\n        del self_mask\n\n        logits = torch.logsumexp(query_key_dots, dim=-1, keepdim=True)\n        # dots shape is `[batch_size, num_attn_heads, num_hashes * seq_len // chunk_length, chunk_length, chunk_length * (1 + num_chunks_before + num_chunks_after)]`\n        attention_probs = torch.exp(query_key_dots - logits)\n\n        # free memory\n        del query_key_dots\n\n        # dropout\n        attention_probs = nn.functional.dropout(attention_probs, p=self.dropout, training=self.training)\n\n        # Mask heads if we want to\n        if head_mask is not None:\n            attention_probs = attention_probs * head_mask\n\n        # attend values\n        out_vectors = torch.matmul(attention_probs, value_vectors)\n\n        # free memory\n        del value_vectors\n\n        # merge chunk length\n        logits = logits.flatten(start_dim=2, end_dim=3).squeeze(-1)\n        out_vectors = out_vectors.flatten(start_dim=2, end_dim=3)\n\n        return out_vectors, logits, attention_probs\n\n    def _compute_attn_mask(self, query_indices, key_indices, attention_mask):\n        mask = None\n\n        # Causal mask\n        if self.is_decoder:\n            mask = torch.ge(query_indices.unsqueeze(-1), key_indices.unsqueeze(-2)).to(query_indices.device)\n\n        # Attention mask: chunk, look up correct mask value from key_value_bucket_idx\n        # IMPORTANT: official trax code does not use a mask for LSH Atttention. Not sure why.\n        if attention_mask is not None:\n            attention_mask = attention_mask.to(torch.uint8)[:, None, None, :]\n            # expand attn_mask to fit with key_value_bucket_idx shape\n            attention_mask = attention_mask.expand(query_indices.shape[:-1] + (-1,))\n            key_attn_mask = torch.gather(attention_mask, -1, key_indices)\n            query_attn_mask = torch.gather(attention_mask, -1, query_indices)\n            # expand to query_key_dots shape: duplicate along query axis since key sorting is the same for each query position in chunk\n            attn_mask = query_attn_mask.unsqueeze(-1) * key_attn_mask.unsqueeze(-2)\n            # free memory\n            del query_attn_mask, key_attn_mask, attention_mask\n\n            # multiply by casaul mask if necessary\n            if mask is not None:\n                mask = mask * attn_mask\n            else:\n                mask = attn_mask\n\n        return mask\n\n    def _len_and_dim_norm(self, vectors):\n        \"\"\"\n            length and attention head size dim normalization\n        \"\"\"\n        vectors = self._len_norm(vectors)\n        vectors = vectors * torch.rsqrt(\n            torch.tensor(self.attention_head_size, device=vectors.device, dtype=vectors.dtype)\n        )\n        return vectors\n\n    def _len_norm(self, x, epsilon=1e-6):\n        \"\"\"\n            length normalization\n        \"\"\"\n        variance = torch.mean(x ** 2, -1, keepdim=True)\n        norm_x = x * torch.rsqrt(variance + epsilon)\n        return norm_x\n\n    def _gather_by_expansion(self, vectors, idxs, num_hashes):\n        \"\"\"\n            expand dims of idxs and vectors for all hashes and gather\n        \"\"\"\n        expanded_idxs = idxs.unsqueeze(-1).expand(-1, -1, -1, self.attention_head_size)\n        vectors = vectors.repeat(1, 1, num_hashes, 1)\n        return torch.gather(vectors, 2, expanded_idxs)\n\n\nclass ReverseSort(Function):\n    \"\"\"\n        After chunked attention is applied which sorted clusters,\n        original ordering has to be restored.\n        Since customized backward function is used for Reformer,\n        the gradients of the output vectors have to be explicitely\n        sorted here.\n    \"\"\"\n\n    @staticmethod\n    def forward(ctx, out_vectors, logits, sorted_bucket_idx, undo_sorted_bucket_idx, num_hashes):\n        # save sorted_bucket_idx for backprop\n        with torch.no_grad():\n            ctx.sorted_bucket_idx = sorted_bucket_idx\n            ctx.num_hashes = num_hashes\n\n            # undo sort to have correct order for next layer\n            expanded_undo_sort_indices = undo_sorted_bucket_idx.unsqueeze(-1).expand(out_vectors.shape)\n            out_vectors = torch.gather(out_vectors, 2, expanded_undo_sort_indices)\n            logits = torch.gather(logits, 2, undo_sorted_bucket_idx)\n        return out_vectors, logits\n\n    @staticmethod\n    def backward(ctx, grad_out_vectors, grad_logits):\n        # get parameters saved in ctx\n        sorted_bucket_idx = ctx.sorted_bucket_idx\n        num_hashes = ctx.num_hashes\n\n        # get real gradient shape\n        # shape is BatchSize x NumAttnHeads x ChunkLen * NumHashes\n        grad_logits_shape = grad_logits.shape\n        # shape is BatchSize x NumAttnHeads x ChunkLen * NumHashes x ChunkLen\n        grad_out_vectors_shape = grad_out_vectors.shape\n\n        # split gradient vectors and sorted bucket idxs by concatenated chunk dimension to gather correct indices\n        # shape is BatchSize x NumAttnHeads x NumHashes x ChunkLen\n        grad_logits = grad_logits.view((grad_logits_shape[:2] + (num_hashes, -1)))\n        # shape is BatchSize x NumAttnHeads x NumHashes x ChunkLen x ChunkLen\n        grad_out_vectors = grad_out_vectors.view(\n            (grad_out_vectors_shape[:2] + (num_hashes, -1) + grad_out_vectors_shape[-1:])\n        )\n\n        # reshape and expand\n        sorted_bucket_idx = torch.reshape(sorted_bucket_idx, (sorted_bucket_idx.shape[:2] + (num_hashes, -1)))\n        expanded_sort_indices = sorted_bucket_idx.unsqueeze(-1).expand(grad_out_vectors.shape)\n        # reverse sort of forward\n        grad_out_vectors = torch.gather(grad_out_vectors, 3, expanded_sort_indices)\n        grad_logits = torch.gather(grad_logits, 3, sorted_bucket_idx)\n\n        # reshape into correct shape\n        grad_logits = torch.reshape(grad_logits, grad_logits_shape)\n        grad_out_vectors = torch.reshape(grad_out_vectors, grad_out_vectors_shape)\n\n        # return grad and `None` fillers for last 3 forward args\n        return grad_out_vectors, grad_logits, None, None, None\n\n\nclass LocalSelfAttention(nn.Module, EfficientAttentionMixin):\n    def __init__(self, config):\n        super().__init__()\n\n        self.num_attention_heads = config.num_attention_heads\n        self.chunk_length = config.local_attn_chunk_length\n        self.num_chunks_before = config.local_num_chunks_before\n        self.num_chunks_after = config.local_num_chunks_after\n        self.is_decoder = config.is_decoder\n        self.pad_token_id = config.pad_token_id\n\n        self.attention_head_size = config.attention_head_size\n        self.all_head_size = self.num_attention_heads * self.attention_head_size\n        self.hidden_size = config.hidden_size\n\n        # projection matrices\n        self.query = nn.Linear(self.hidden_size, self.all_head_size, bias=False)\n        self.key = nn.Linear(self.hidden_size, self.all_head_size, bias=False)\n        self.value = nn.Linear(self.hidden_size, self.all_head_size, bias=False)\n\n        self.dropout = config.local_attention_probs_dropout_prob\n\n        # save mask value here\n        self.register_buffer(\"mask_value_float16\", torch.tensor(-1e4))\n        self.register_buffer(\"mask_value_float32\", torch.tensor(-1e9))\n\n    def forward(self, hidden_states, attention_mask=None, head_mask=None, do_output_attentions=False, **kwargs):\n        sequence_length = hidden_states.shape[1]\n        batch_size = hidden_states.shape[0]\n\n        # project hidden_states to query, key and value\n        query_vectors = self.query(hidden_states)\n        key_vectors = self.key(hidden_states)\n        value_vectors = self.value(hidden_states)\n\n        # split last dim into `config.num_attention_heads` and `config.attention_head_size`\n        query_vectors = self._split_hidden_size_dim(query_vectors, self.num_attention_heads, self.attention_head_size)\n        key_vectors = self._split_hidden_size_dim(key_vectors, self.num_attention_heads, self.attention_head_size)\n        value_vectors = self._split_hidden_size_dim(value_vectors, self.num_attention_heads, self.attention_head_size)\n\n        assert (\n            query_vectors.shape[-1] == self.attention_head_size\n        ), \"last dim of query_key_vectors is {} but should be {}.\".format(\n            query_vectors.shape[-1], self.attention_head_size\n        )\n        assert (\n            key_vectors.shape[-1] == self.attention_head_size\n        ), \"last dim of query_key_vectors is {} but should be {}.\".format(\n            key_vectors.shape[-1], self.attention_head_size\n        )\n        assert (\n            value_vectors.shape[-1] == self.attention_head_size\n        ), \"last dim of query_key_vectors is {} but should be {}.\".format(\n            value_vectors.shape[-1], self.attention_head_size\n        )\n\n        if self.chunk_length is None:\n            assert (\n                self.num_chunks_before == 0 and self.num_chunks_after == 0\n            ), \"If `config.chunk_length` is `None`, make sure `config.num_chunks_after` and `config.num_chunks_before` are set to 0.\"\n\n        # normalize key vectors\n        key_vectors = key_vectors / torch.sqrt(\n            torch.tensor(self.attention_head_size, device=key_vectors.device, dtype=key_vectors.dtype)\n        )\n\n        # chunk vectors\n        # B x Num_Attn_Head x Seq_Len // chunk_len x chunk_len  x  attn_head_size\n        query_vectors = self._split_seq_length_dim_to(\n            query_vectors, -1, self.chunk_length, self.num_attention_heads, self.attention_head_size,\n        )\n        key_vectors = self._split_seq_length_dim_to(\n            key_vectors, -1, self.chunk_length, self.num_attention_heads, self.attention_head_size,\n        )\n        value_vectors = self._split_seq_length_dim_to(\n            value_vectors, -1, self.chunk_length, self.num_attention_heads, self.attention_head_size,\n        )\n\n        # chunk indices\n        indices = torch.arange(sequence_length, device=query_vectors.device).repeat(\n            batch_size, self.num_attention_heads, 1\n        )\n        query_indices = self._split_seq_length_dim_to(indices, -1, self.chunk_length, self.num_attention_heads)\n        key_indices = self._split_seq_length_dim_to(indices, -1, self.chunk_length, self.num_attention_heads)\n\n        # append chunks before and after\n        key_vectors = self._look_adjacent(key_vectors, self.num_chunks_before, self.num_chunks_after)\n        value_vectors = self._look_adjacent(value_vectors, self.num_chunks_before, self.num_chunks_after)\n        key_indices = self._look_adjacent(key_indices, self.num_chunks_before, self.num_chunks_after)\n\n        query_key_dots = torch.matmul(query_vectors, key_vectors.transpose(-1, -2))\n\n        # free memory\n        del query_vectors, key_vectors\n\n        mask = self._compute_attn_mask(query_indices, key_indices, attention_mask, query_key_dots.shape)\n\n        if mask is not None:\n            # get mask tensor depending on half precision or not\n            if query_key_dots.dtype == torch.float16:\n                mask_value = self.mask_value_float16.half()\n            else:\n                mask_value = self.mask_value_float32\n\n            query_key_dots = torch.where(mask, query_key_dots, mask_value)\n\n        # free memory\n        del mask\n\n        # softmax\n        logits = torch.logsumexp(query_key_dots, dim=-1, keepdim=True)\n        attention_probs = torch.exp(query_key_dots - logits)\n\n        # free memory\n        del logits\n\n        # dropout\n        attention_probs = nn.functional.dropout(attention_probs, p=self.dropout, training=self.training)\n\n        # Mask heads if we want to\n        if head_mask is not None:\n            attention_probs = attention_probs * head_mask\n\n        # attend values\n        out_vectors = torch.matmul(attention_probs, value_vectors)\n\n        # free memory\n        del value_vectors\n\n        # merge chunk length\n        out_vectors = out_vectors.flatten(start_dim=2, end_dim=3)\n\n        assert out_vectors.shape == (batch_size, self.num_attention_heads, sequence_length, self.attention_head_size,)\n\n        out_vectors = self._merge_hidden_size_dims(out_vectors, self.num_attention_heads, self.attention_head_size)\n\n        if do_output_attentions is False:\n            attention_probs = ()\n\n        return LocalSelfAttentionOutput(hidden_states=out_vectors, attention_probs=attention_probs)\n\n    def _compute_attn_mask(self, query_indices, key_indices, attention_mask, query_key_dots_shape):\n        mask = None\n\n        # chunk attention mask and look before and after\n        if attention_mask is not None:\n            attention_mask = attention_mask.to(torch.uint8)[:, None, :]\n            attention_mask = self._split_seq_length_dim_to(attention_mask, -1, self.chunk_length, 1)\n            attention_mask_key = self._look_adjacent(attention_mask, self.num_chunks_before, self.num_chunks_after)\n\n        # Causal mask\n        if self.is_decoder is True:\n            mask = torch.ge(query_indices.unsqueeze(-1), key_indices.unsqueeze(-2)).to(query_indices.device)\n\n        # Attention mask\n        if attention_mask is not None:\n            # create attn_mask\n            attn_mask = (attention_mask.unsqueeze(-1) * attention_mask_key.unsqueeze(-2)).expand(query_key_dots_shape)\n            # multiply by casaul mask if necessary\n            if mask is not None:\n                mask = mask * attn_mask\n            else:\n                mask = attn_mask\n        return mask\n\n\nclass ReformerSelfOutput(nn.Module):\n    def __init__(self, config):\n        super().__init__()\n        all_head_size = config.num_attention_heads * config.attention_head_size\n        self.dropout = config.hidden_dropout_prob\n\n        self.dense = nn.Linear(all_head_size, config.hidden_size, bias=False)\n\n    def forward(self, hidden_states):\n        hidden_states = self.dense(hidden_states)\n        hidden_states = nn.functional.dropout(hidden_states, p=self.dropout, training=self.training)\n        return hidden_states\n\n\nclass ReformerAttention(nn.Module):\n    def __init__(self, config, layer_id=0):\n        super().__init__()\n        self.layer_id = layer_id\n        self.attn_layers = config.attn_layers\n\n        self.layer_norm = nn.LayerNorm(config.hidden_size, eps=config.layer_norm_eps)\n\n        if len(set(self.attn_layers)) == 1 and self.attn_layers[0] == \"lsh\":\n            self.self_attention = LSHSelfAttention(config)\n        elif len(set(self.attn_layers)) == 1 and self.attn_layers[0] == \"local\":\n            self.self_attention = LocalSelfAttention(config)\n        elif len(set(self.attn_layers)) == 2 and set(self.attn_layers) == set([\"lsh\", \"local\"]):\n            # get correct attn layers\n            if self.attn_layers[self.layer_id] == \"lsh\":\n                self.self_attention = LSHSelfAttention(config)\n            else:\n                self.self_attention = LocalSelfAttention(config)\n        else:\n            raise NotImplementedError(\n                \"Only attn layer types 'lsh' and 'local' exist, but got `config.attn_layers`: {}. Select attn layer types from ['lsh', 'local'] only.\".format(\n                    self.attn_layers\n                )\n            )\n        self.output = ReformerSelfOutput(config)\n\n    def forward(\n        self,\n        hidden_states,\n        attention_mask=None,\n        head_mask=None,\n        num_hashes=None,\n        do_output_attentions=False,\n        buckets=None,\n    ):\n        hidden_states = self.layer_norm(hidden_states)\n\n        # use cached buckets for backprob if buckets not None for LSHSelfAttention\n        self_attention_outputs = self.self_attention(\n            hidden_states=hidden_states,\n            head_mask=head_mask,\n            attention_mask=attention_mask,\n            num_hashes=num_hashes,\n            do_output_attentions=do_output_attentions,\n            buckets=buckets,\n        )\n        attention_output = self.output(self_attention_outputs.hidden_states)\n\n        # add buckets if necessary\n        if hasattr(self_attention_outputs, \"buckets\"):\n            buckets = self_attention_outputs.buckets\n        else:\n            buckets = None\n\n        return AttentionOutput(\n            hidden_states=attention_output, attention_probs=self_attention_outputs.attention_probs, buckets=buckets,\n        )\n\n\nclass ReformerFeedForwardDense(nn.Module):\n    def __init__(self, config):\n        super().__init__()\n        self.dropout = config.hidden_dropout_prob\n\n        if isinstance(config.hidden_act, str):\n            self.act_fn = ACT2FN[config.hidden_act]\n        else:\n            self.act_fn = config.hidden_act\n\n        self.dense = nn.Linear(config.hidden_size, config.feed_forward_size)\n\n    def forward(self, hidden_states):\n        hidden_states = self.dense(hidden_states)\n        hidden_states = nn.functional.dropout(hidden_states, p=self.dropout, training=self.training)\n        hidden_states = self.act_fn(hidden_states)\n        return hidden_states\n\n\nclass ReformerFeedForwardOutput(nn.Module):\n    def __init__(self, config):\n        super().__init__()\n        self.dropout = config.hidden_dropout_prob\n\n        self.dense = nn.Linear(config.feed_forward_size, config.hidden_size)\n\n    def forward(self, hidden_states):\n        hidden_states = self.dense(hidden_states)\n        hidden_states = nn.functional.dropout(hidden_states, p=self.dropout, training=self.training)\n        return hidden_states\n\n\nclass ChunkReformerFeedForward(nn.Module):\n    def __init__(self, config):\n        super().__init__()\n        self.chunk_size_feed_forward = config.chunk_size_feed_forward\n        self.seq_len_dim = 1\n\n        self.layer_norm = nn.LayerNorm(config.hidden_size, eps=config.layer_norm_eps)\n        self.dense = ReformerFeedForwardDense(config)\n        self.output = ReformerFeedForwardOutput(config)\n\n    def forward(self, attention_output):\n        return apply_chunking_to_forward(\n            self.chunk_size_feed_forward, self.seq_len_dim, self.forward_chunk, attention_output,\n        )\n\n    def forward_chunk(self, hidden_states):\n        hidden_states = self.layer_norm(hidden_states)\n        hidden_states = self.dense(hidden_states)\n        return self.output(hidden_states)\n\n\nclass ReformerLayer(nn.Module):\n    def __init__(self, config, layer_id=0):\n        super().__init__()\n        self.attention = ReformerAttention(config, layer_id)\n        # dropout requires to have the same\n        # seed for forward and backward pass\n        self.attention_seed = None\n        self.feed_forward_seed = None\n\n        self.feed_forward = ChunkReformerFeedForward(config)\n\n    def _init_attention_seed(self):\n        \"\"\"\n            This function sets a new seed for the\n            attention layer to make dropout deterministic\n            for both forward calls: 1 normal forward\n            call and 1 forward call in backward\n            to recalculate activations.\n        \"\"\"\n\n        # randomize seeds\n        if next(self.parameters()).device.type == \"cuda\":\n            # GPU\n            device_idx = torch.cuda.current_device()\n            self.attention_seed = torch.cuda.default_generators[device_idx].seed()\n            torch.cuda.manual_seed(self.attention_seed)\n        else:\n            # CPU\n            self.attention_seed = int(torch.seed() % sys.maxsize)\n            torch.manual_seed(self.attention_seed)\n\n    def _init_feed_forward_seed(self):\n        \"\"\"\n            This function sets a new seed for the\n            feed forward layer to make dropout deterministic\n            for both forward calls: 1 normal forward\n            call and 1 forward call in backward\n            to recalculate activations.\n        \"\"\"\n\n        # randomize seeds\n        if next(self.parameters()).device.type == \"cuda\":\n            # GPU\n            device_idx = torch.cuda.current_device()\n            self.feed_forward_seed = torch.cuda.default_generators[device_idx].seed()\n            torch.cuda.manual_seed(self.feed_forward_seed)\n        else:\n            # CPU\n            self.feed_forward_seed = int(torch.seed() % sys.maxsize)\n            torch.manual_seed(self.feed_forward_seed)\n\n    def forward(\n        self,\n        prev_attn_output,\n        hidden_states,\n        attention_mask=None,\n        head_mask=None,\n        num_hashes=None,\n        do_output_attentions=False,\n    ):\n        with torch.no_grad():\n            # every forward pass we sample a different seed\n            # for dropout and save for forward fn in backward pass\n            # to have correct dropout\n            self._init_attention_seed()\n            attn_outputs = self.attention(\n                hidden_states=hidden_states,\n                head_mask=head_mask,\n                attention_mask=attention_mask,\n                num_hashes=num_hashes,\n                do_output_attentions=do_output_attentions,\n            )\n            attn_output = attn_outputs.hidden_states\n\n            # Implementation of RevNet (see Fig. 6 in https://towardsdatascience.com/illustrating-the-reformer-393575ac6ba0)\n            # Y_1 = X_1 + f(X_2)\n            attn_output = prev_attn_output + attn_output\n\n            # free memory\n            del prev_attn_output\n\n            # every forward pass we sample a different seed\n            # for dropout and save seed for forward fn in backward\n            # to have correct dropout\n            self._init_feed_forward_seed()\n            # Y_2 = X_2 + g(Y_1)\n            hidden_states = hidden_states + self.feed_forward(attn_output)\n\n        return ReformerOutput(\n            attn_output=attn_output,\n            hidden_states=hidden_states,\n            attention_probs=attn_outputs.attention_probs,\n            buckets=attn_outputs.buckets,\n        )\n\n    def backward_pass(\n        self,\n        next_attn_output,\n        hidden_states,\n        grad_attn_output,\n        grad_hidden_states,\n        attention_mask=None,\n        head_mask=None,\n        buckets=None,\n    ):\n        # Implements the backward pass for reversible ResNets.\n        # A good blog post on how this works can be found here:\n        # Implementation of RevNet (see Fig. 6 in https://towardsdatascience.com/illustrating-the-reformer-393575ac6ba0)\n        # This code is heavily inspired by https://github.com/lucidrains/reformer-pytorch/blob/master/reformer_pytorch/reversible.py\n\n        with torch.enable_grad():\n            next_attn_output.requires_grad = True\n\n            # set seed to have correct dropout\n            torch.manual_seed(self.feed_forward_seed)\n            # g(Y_1)\n            res_hidden_states = self.feed_forward(next_attn_output)\n            res_hidden_states.backward(grad_hidden_states, retain_graph=True)\n\n        with torch.no_grad():\n            # X_2 = Y_2 - g(Y_1)\n            hidden_states = hidden_states - res_hidden_states\n            del res_hidden_states\n\n            grad_attn_output = grad_attn_output + next_attn_output.grad\n            next_attn_output.grad = None\n\n        with torch.enable_grad():\n            hidden_states.requires_grad = True\n\n            # set seed to have correct dropout\n            torch.manual_seed(self.attention_seed)\n            # f(X_2)\n            # use cached buckets for backprob if buckets not None for LSHSelfAttention\n            output = self.attention(\n                hidden_states=hidden_states, head_mask=head_mask, attention_mask=attention_mask, buckets=buckets,\n            ).hidden_states\n            output.backward(grad_attn_output, retain_graph=True)\n\n        with torch.no_grad():\n            # X_1 = Y_1 - f(X_2)\n            attn_output = next_attn_output - output\n            del output, next_attn_output\n\n            grad_hidden_states = grad_hidden_states + hidden_states.grad\n            hidden_states.grad = None\n            hidden_states = hidden_states.detach()\n\n        return ReformerBackwardOutput(\n            attn_output=attn_output,\n            hidden_states=hidden_states,\n            grad_attn_output=grad_attn_output,\n            grad_hidden_states=grad_hidden_states,\n        )\n\n\nclass _ReversibleFunction(Function):\n    \"\"\"\n    To prevent PyTorch from performing the usual backpropagation,\n    a customized backward function is implemented here. This way\n    it is made sure that no memory expensive activations are\n    saved during the forward pass.\n    This function is heavily inspired by https://github.com/lucidrains/reformer-pytorch/blob/master/reformer_pytorch/reversible.py\n    \"\"\"\n\n    @staticmethod\n    def forward(\n        ctx,\n        hidden_states,\n        layers,\n        attention_mask,\n        head_mask,\n        num_hashes,\n        all_hidden_states,\n        all_attentions,\n        do_output_hidden_states,\n        do_output_attentions,\n    ):\n        all_buckets = ()\n\n        # split duplicated tensor\n        hidden_states, attn_output = torch.chunk(hidden_states, 2, dim=-1)\n\n        for layer, layer_head_mask in zip(layers, head_mask):\n            if do_output_hidden_states is True:\n                all_hidden_states.append(hidden_states)\n\n            layer_outputs = layer(\n                prev_attn_output=attn_output,\n                hidden_states=hidden_states,\n                attention_mask=attention_mask,\n                head_mask=layer_head_mask,\n                num_hashes=num_hashes,\n                do_output_attentions=do_output_attentions,\n            )\n            attn_output = layer_outputs.attn_output\n            hidden_states = layer_outputs.hidden_states\n            all_buckets = all_buckets + (layer_outputs.buckets,)\n\n            if do_output_attentions:\n                all_attentions.append(layer_outputs.attention_probs)\n\n        # Add last layer\n        if do_output_hidden_states is True:\n            all_hidden_states.append(hidden_states)\n\n        # attach params to ctx for backward\n        ctx.save_for_backward(attn_output.detach(), hidden_states.detach())\n        ctx.layers = layers\n        ctx.all_buckets = all_buckets\n        ctx.head_mask = head_mask\n        ctx.attention_mask = attention_mask\n\n        # Concatenate 2 RevNet outputs\n        return torch.cat([attn_output, hidden_states], dim=-1)\n\n    @staticmethod\n    def backward(ctx, grad_hidden_states):\n        grad_attn_output, grad_hidden_states = torch.chunk(grad_hidden_states, 2, dim=-1)\n\n        # retrieve params from ctx for backward\n        attn_output, hidden_states = ctx.saved_tensors\n\n        # create tuple\n        output = ReformerBackwardOutput(\n            attn_output=attn_output,\n            hidden_states=hidden_states,\n            grad_attn_output=grad_attn_output,\n            grad_hidden_states=grad_hidden_states,\n        )\n\n        # free memory\n        del grad_attn_output, grad_hidden_states, attn_output, hidden_states\n\n        layers = ctx.layers\n        all_buckets = ctx.all_buckets\n        head_mask = ctx.head_mask\n        attention_mask = ctx.attention_mask\n\n        for idx, layer in enumerate(layers[::-1]):\n            # pop last buckets from stack\n            buckets = all_buckets[-1]\n            all_buckets = all_buckets[:-1]\n\n            # backprop\n            output = layer.backward_pass(\n                next_attn_output=output.attn_output,\n                hidden_states=output.hidden_states,\n                grad_attn_output=output.grad_attn_output,\n                grad_hidden_states=output.grad_hidden_states,\n                head_mask=head_mask[len(layers) - idx - 1],\n                attention_mask=attention_mask,\n                buckets=buckets,\n            )\n\n        assert all_buckets == (), \"buckets have to be empty after backpropagation\"\n        grad_hidden_states = torch.cat([output.grad_attn_output, output.grad_hidden_states], dim=-1)\n\n        # num of return vars has to match num of forward() args\n        # return gradient for hidden_states arg and None for other args\n        return grad_hidden_states, None, None, None, None, None, None, None, None\n\n\nclass ReformerEncoder(nn.Module):\n    def __init__(self, config):\n        super().__init__()\n        self.dropout = config.hidden_dropout_prob\n\n        self.layers = nn.ModuleList([ReformerLayer(config, i) for i in range(config.num_hidden_layers)])\n        # Reformer is using Rev Nets, thus last layer outputs are concatenated and\n        # Layer Norm is done over 2 * hidden_size\n        self.layer_norm = nn.LayerNorm(2 * config.hidden_size, eps=config.layer_norm_eps)\n\n    def forward(\n        self,\n        hidden_states,\n        attention_mask=None,\n        head_mask=None,\n        num_hashes=None,\n        do_output_hidden_states=False,\n        do_output_attentions=False,\n    ):\n        # hidden_states and attention lists to be filled if wished\n        all_hidden_states = []\n        all_attentions = []\n\n        # concat same tensor for reversible ResNet\n        hidden_states = torch.cat([hidden_states, hidden_states], dim=-1)\n        hidden_states = _ReversibleFunction.apply(\n            hidden_states,\n            self.layers,\n            attention_mask,\n            head_mask,\n            num_hashes,\n            all_hidden_states,\n            all_attentions,\n            do_output_hidden_states,\n            do_output_attentions,\n        )\n\n        # Apply layer norm to concatenated hidden states\n        hidden_states = self.layer_norm(hidden_states)\n\n        # Apply dropout\n        hidden_states = nn.functional.dropout(hidden_states, p=self.dropout, training=self.training)\n\n        return ReformerEncoderOutput(\n            hidden_states=hidden_states, all_hidden_states=all_hidden_states, all_attentions=all_attentions\n        )\n\n\nclass ReformerOnlyLMHead(nn.Module):\n    def __init__(self, config):\n        super().__init__()\n        # Reformer is using Rev Nets, thus last layer outputs are concatenated and\n        # Layer Norm is done over 2 * hidden_size\n        self.seq_len_dim = 1\n        self.chunk_size_lm_head = config.chunk_size_lm_head\n        self.decoder = nn.Linear(2 * config.hidden_size, config.vocab_size, bias=False)\n        self.bias = nn.Parameter(torch.zeros(config.vocab_size))\n\n        # Need a link between the two variables so that the bias is correctly resized with `resize_token_embeddings`\n        self.decoder.bias = self.bias\n\n    def forward(self, hidden_states):\n        return apply_chunking_to_forward(self.chunk_size_lm_head, self.seq_len_dim, self.forward_chunk, hidden_states)\n\n    def forward_chunk(self, hidden_states):\n        hidden_states = self.decoder(hidden_states)\n        return hidden_states\n\n\nclass ReformerPreTrainedModel(PreTrainedModel):\n    \"\"\" An abstract class to handle weights initialization and\n        a simple interface for downloading and loading pretrained models.\n    \"\"\"\n\n    config_class = ReformerConfig\n    base_model_prefix = \"reformer\"\n\n    @property\n    def dummy_inputs(self):\n        input_ids = torch.tensor(DUMMY_INPUTS)\n        input_mask = torch.tensor(DUMMY_MASK)\n        dummy_inputs = {\n            \"input_ids\": input_ids,\n            \"attention_mask\": input_mask,\n        }\n        return dummy_inputs\n\n    def _init_weights(self, module):\n        \"\"\" Initialize the weights \"\"\"\n        if isinstance(module, AxialPositionEmbeddings):\n            for weight in module.weights:\n                torch.nn.init.normal_(weight, std=self.config.axial_norm_std)\n        elif isinstance(module, nn.Embedding):\n            module.weight.data.normal_(mean=0.0, std=self.config.initializer_range)\n        elif isinstance(module, nn.Linear):\n            # Slightly different from the TF version which uses truncated_normal for initialization\n            # cf https://github.com/pytorch/pytorch/pull/5617\n            module.weight.data.normal_(mean=0.0, std=self.config.initializer_range)\n\n        elif isinstance(module, nn.LayerNorm):\n            module.bias.data.zero_()\n            module.weight.data.fill_(1.0)\n        if isinstance(module, nn.Linear) and module.bias is not None:\n            module.bias.data.zero_()\n\n\nREFORMER_START_DOCSTRING = r\"\"\"\n    Reformer was proposed in\n    `Reformer: The Efficient Transformer`_\n    by Nikita Kitaev, Łukasz Kaiser, Anselm Levskaya.\n\n    .. _`Reformer: The Efficient Transformer`:\n        https://arxiv.org/abs/2001.04451\n\n    This model is a PyTorch `torch.nn.Module <https://pytorch.org/docs/stable/nn.html#torch.nn.Module>`_ sub-class.\n    Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general\n    usage and behavior.\n\n    Parameters:\n        config (:class:`~transformers1.ReformerConfig`): Model configuration class with all the parameters of the model.\n            Initializing with a config file does not load the weights associated with the model, only the configuration.\n            Check out the :meth:`~transformers1.PreTrainedModel.from_pretrained` method to load the model weights.\n\"\"\"\n\nREFORMER_INPUTS_DOCSTRING = r\"\"\"\n    Args:\n        input_ids (:obj:`torch.LongTensor` of shape :obj:`(batch_size, sequence_length)`):\n            Indices of input sequence tokens in the vocabulary.\n            During training the input_ids sequence_length has to be a multiple of the relevant model's\n            chunk lengths (lsh's, local's or both). During evaluation, the indices are automatically\n            padded to be a multiple of the chunk length.\n\n            Indices can be obtained using :class:`transformers1.ReformerTokenizer`.\n            See :func:`transformers1.PreTrainedTokenizer.encode` and\n            :func:`transformers1.PreTrainedTokenizer.encode_plus` for details.\n\n            `What are input IDs? <../glossary.html#input-ids>`__\n        attention_mask (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length)`, `optional`, defaults to :obj:`None`):\n            Mask to avoid performing attention on padding token indices.\n            Mask values selected in ``[0, 1]``:\n            ``1`` for tokens that are NOT MASKED, ``0`` for MASKED tokens.\n\n            `What are attention masks? <../glossary.html#attention-mask>`__\n        position_ids (:obj:`torch.LongTensor` of shape :obj:`(batch_size, sequence_length)`, `optional`, defaults to :obj:`None`):\n            Indices of positions of each input sequence tokens in the position embeddings.\n            Selected in the range ``[0, config.max_position_embeddings - 1]``.\n\n            `What are position IDs? <../glossary.html#position-ids>`_\n        head_mask (:obj:`torch.FloatTensor` of shape :obj:`(num_heads,)` or :obj:`(num_layers, num_heads)`, `optional`, defaults to :obj:`None`):\n            Mask to nullify selected heads of the self-attention modules.\n            Mask values selected in ``[0, 1]``:\n            :obj:`1` indicates the head is **not masked**, :obj:`0` indicates the head is **masked**.\n        inputs_embeds (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length, hidden_size)`, `optional`, defaults to :obj:`None`):\n            Optionally, instead of passing :obj:`input_ids` you can choose to directly pass an embedded representation.\n            This is useful if you want more control over how to convert `input_ids` indices into associated vectors\n            than the model's internal embedding lookup matrix.\n        num_hashes (:obj:`int`, `optional`, defaults to :obj:`None`):\n            `num_hashes` is the number of hashing rounds that should be performed during\n            bucketing. Setting `num_hashes` overwrites the default `num_hashes` defined\n            in `config.num_hashes`.\n            For more information, see `num_hashes` in :class:`transformers1.ReformerConfig`.\n\"\"\"\n\n\n@add_start_docstrings(\n    \"The bare Reformer Model transformer outputting raw hidden-states\" \"without any specific head on top.\",\n    REFORMER_START_DOCSTRING,\n)\nclass ReformerModel(ReformerPreTrainedModel):\n    def __init__(self, config):\n        super().__init__(config)\n        self.config = config\n        assert (\n            self.config.num_hidden_layers > 0\n        ), \"`config.attn_layers` is empty. Select at least one attn layer form ['lsh', 'local']\"\n\n        self.embeddings = ReformerEmbeddings(config)\n        self.encoder = ReformerEncoder(config)\n\n        self.init_weights()\n\n    def get_input_embeddings(self):\n        return self.embeddings.word_embeddings\n\n    def set_input_embeddings(self, value):\n        self.embeddings.word_embeddings = value\n\n    def _prune_heads(self, heads_to_prune):\n        \"\"\" Prunes heads of the model.\n            heads_to_prune: dict of {layer_num: list of heads to prune in this layer}\n            See base class PreTrainedModel\n        \"\"\"\n        for layer, heads in heads_to_prune.items():\n            self.encoder.layer[layer].attention.prune_heads(heads)\n\n    @add_start_docstrings_to_callable(REFORMER_INPUTS_DOCSTRING)\n    def forward(\n        self,\n        input_ids=None,\n        attention_mask=None,\n        position_ids=None,\n        head_mask=None,\n        inputs_embeds=None,\n        num_hashes=None,\n        do_output_hidden_states=False,\n        do_output_attentions=False,\n    ):\n        r\"\"\"\n    Return:\n        :obj:`tuple(torch.FloatTensor)` comprising various elements depending on the configuration (:class:`~transformers1.BertConfig`) and inputs:\n        last_hidden_state (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length, hidden_size)`):\n            Sequence of hidden-states at the output of the last layer of the model.\n        all_hidden_states (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_hidden_states=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer)\n            of shape :obj:`(batch_size, sequence_length, hidden_size)`.\n\n            Hidden-states of the model at the output of each layer plus the initial embedding outputs.\n        all_attentions (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``do_output_attentions=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for each layer) of shape\n            :obj:`(batch_size, num_heads, sequence_length, sequence_length)`.\n\n            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention\n            heads.\n\n    Examples::\n\n        from transformers1 import ReformerModel, ReformerTokenizer\n        import torch\n\n        tokenizer = ReformerTokenizer.from_pretrained('google/reformer-crime-and-punishment')\n        model =  ReformerModel.from_pretrained('google/reformer-crime-and-punishment')\n\n        input_ids = torch.tensor(tokenizer.encode(\"Hello, my dog is cute\", add_special_tokens=True)).unsqueeze(0)  # Batch size 1\n        outputs = model(input_ids)\n\n        last_hidden_states = outputs[0]  # The last hidden-state is the first element of the output tuple\n        \"\"\"\n\n        # TODO(PVP): delete when PR to change output_attentions is made\n        do_output_attentions = self.config.output_attentions\n        do_output_hidden_states = self.config.output_hidden_states\n\n        if input_ids is not None and inputs_embeds is not None:\n            raise ValueError(\"You cannot specify both input_ids and inputs_embeds at the same time\")\n        elif input_ids is not None:\n            input_shape = input_ids.size()  # noqa: F841\n            device = input_ids.device\n        elif inputs_embeds is not None:\n            input_shape = inputs_embeds.size()[:-1]  # noqa: F841\n            device = inputs_embeds.device\n        else:\n            raise ValueError(\"You have to specify either input_ids or inputs_embeds\")\n\n        assert (\n            len(input_shape) == 2\n        ), \"`input_ids` have be of shape `[batch_size, sequence_length]`, but got shape: {}\".format(input_shape)\n\n        # prepare head mask\n        head_mask = self.get_head_mask(head_mask, self.config.num_hidden_layers, is_attention_chunked=True)\n\n        # original sequence length for padding\n        orig_sequence_length = input_shape[-1]\n\n        # if needs padding\n        least_common_mult_chunk_length = _get_least_common_mult_chunk_len(self.config)\n        must_pad_to_match_chunk_length = input_shape[-1] % least_common_mult_chunk_length != 0\n\n        if must_pad_to_match_chunk_length:\n            padding_length = least_common_mult_chunk_length - input_shape[-1] % least_common_mult_chunk_length\n\n            if self.training is True:\n                raise ValueError(\n                    \"If training, sequence Length {} has to be a multiple of least common multiple chunk_length {}. Please consider padding the input to a length of {}.\".format(\n                        input_shape[-1], least_common_mult_chunk_length, input_shape[-1] + padding_length\n                    )\n                )\n\n            # pad input\n            input_ids, inputs_embeds, attention_mask, position_ids, input_shape = self._pad_to_mult_of_chunk_length(\n                input_ids,\n                inputs_embeds=inputs_embeds,\n                attention_mask=attention_mask,\n                position_ids=position_ids,\n                input_shape=input_shape,\n                padding_length=padding_length,\n                padded_seq_length=least_common_mult_chunk_length,\n                device=device,\n            )\n\n        embedding_output = self.embeddings(input_ids=input_ids, position_ids=position_ids, inputs_embeds=inputs_embeds)\n\n        encoder_outputs = self.encoder(\n            hidden_states=embedding_output,\n            head_mask=head_mask,\n            attention_mask=attention_mask,\n            num_hashes=num_hashes,\n            do_output_hidden_states=do_output_hidden_states,\n            do_output_attentions=do_output_attentions,\n        )\n        sequence_output = encoder_outputs.hidden_states\n\n        # if padding was applied\n        if must_pad_to_match_chunk_length:\n            sequence_output = sequence_output[:, :orig_sequence_length]\n\n        outputs = (sequence_output,)\n        # TODO(PVP): Replace by named tuple after namedtuples are introduced in the library.\n        if do_output_hidden_states is True:\n            outputs = outputs + (encoder_outputs.all_hidden_states,)\n        if do_output_attentions is True:\n            outputs = outputs + (encoder_outputs.all_attentions,)\n        return outputs\n\n    def _pad_to_mult_of_chunk_length(\n        self,\n        input_ids,\n        inputs_embeds=None,\n        attention_mask=None,\n        position_ids=None,\n        input_shape=None,\n        padding_length=None,\n        padded_seq_length=None,\n        device=None,\n    ):\n        logger.info(\n            \"Input ids are automatically padded from {} to {} to be a multiple of `config.chunk_length`: {}\".format(\n                input_shape[-1], input_shape[-1] + padding_length, padded_seq_length\n            )\n        )\n\n        padded_input_ids = torch.full(\n            (input_shape[0], padding_length), self.config.pad_token_id, device=device, dtype=torch.long,\n        )\n\n        # Extend `attention_mask`\n        if attention_mask is not None:\n            attention_mask = torch.cat(\n                [\n                    attention_mask,\n                    torch.zeros(input_shape[0], padding_length, device=device, dtype=attention_mask.dtype,),\n                ],\n                dim=-1,\n            )\n        else:\n            attention_mask = torch.cat(\n                [\n                    torch.ones(input_shape, device=device, dtype=torch.uint8),\n                    torch.zeros((input_shape[0], padding_length), device=device, dtype=torch.uint8),\n                ],\n                dim=-1,\n            )\n\n        # Extend `input_ids` with padding to match least common multiple chunk_length\n        if input_ids is not None:\n            input_ids = torch.cat([input_ids, padded_input_ids], dim=-1)\n            input_shape = input_ids.size()\n\n            # Pad position ids if given\n            if position_ids is not None:\n                padded_position_ids = torch.arange(input_shape[-1], padded_seq_length, dtype=torch.long, device=device)\n                padded_position_ids = position_ids.unsqueeze(0).expand(input_shape[0], padding_length)\n                position_ids = torch.cat([position_ids, padded_position_ids], dim=-1)\n\n        # Extend `inputs_embeds` with padding to match least common multiple chunk_length\n        if inputs_embeds is not None:\n            padded_inputs_embeds = self.embeddings(padded_input_ids, position_ids)\n            inputs_embeds = torch.cat([inputs_embeds, padded_inputs_embeds], dim=-2)\n            input_shape = inputs_embeds.size()\n        return input_ids, inputs_embeds, attention_mask, position_ids, input_shape\n\n\n@add_start_docstrings(\"\"\"Reformer Model with a `language modeling` head on top. \"\"\", REFORMER_START_DOCSTRING)\nclass ReformerModelWithLMHead(ReformerPreTrainedModel):\n    def __init__(self, config):\n        super().__init__(config)\n        self.reformer = ReformerModel(config)\n        self.lm_head = ReformerOnlyLMHead(config)\n\n        self.init_weights()\n\n    def get_output_embeddings(self):\n        return self.lm_head.decoder\n\n    def tie_weights(self):\n        # word embeddings are not tied in Reformer\n        pass\n\n    @add_start_docstrings_to_callable(REFORMER_INPUTS_DOCSTRING)\n    def forward(\n        self,\n        input_ids=None,\n        position_ids=None,\n        attention_mask=None,\n        head_mask=None,\n        inputs_embeds=None,\n        num_hashes=None,\n        labels=None,\n        do_output_hidden_states=False,\n        do_output_attentions=False,\n    ):\n        r\"\"\"\n        labels (:obj:`torch.LongTensor` of shape :obj:`(batch_size,)`, `optional`, defaults to :obj:`None`):\n                Labels for computing the sequence classification/regression loss.\n                Indices should be in :obj:`[-100, 0, ..., config.vocab_size - 1]`.\n                All labels set to ``-100`` are ignored (masked), the loss is only\n                computed for labels in ``[0, ..., config.vocab_size]``\n\n    Return:\n        :obj:`tuple(torch.FloatTensor)` comprising various elements depending on the configuration (:class:`~transformers1.BertConfig`) and inputs:\n        loss (:obj:`torch.FloatTensor` of shape :obj:`(1,)`, `optional`, returned when :obj:`lm_label` is provided):\n            Classification loss (cross entropy).\n        prediction_scores (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length, config.vocab_size)`)\n            Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax).\n        all_hidden_states (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_hidden_states=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer)\n            of shape :obj:`(batch_size, sequence_length, hidden_size)`.\n\n            Hidden-states of the model at the output of each layer plus the initial embedding outputs.\n        all_attentions (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``do_output_attentions=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for each layer) of shape\n            :obj:`(batch_size, num_heads, sequence_length, sequence_length)`.\n\n            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention\n            heads.\n\n    Examples::\n\n        from transformers1 import ReformerModelWithLMHead, ReformerTokenizer\n        import torch\n\n        tokenizer = ReformerTokenizer.from_pretrained('google/reformer-crime-and-punishment')\n        model =  ReformerModelWithLMHead.from_pretrained('google/reformer-crime-and-punishment')\n\n        input_ids = torch.tensor(tokenizer.encode(\"Hello, my dog is cute\", add_special_tokens=True)).unsqueeze(0)  # Batch size 1\n        outputs = model(input_ids, labels=input_ids)\n\n        loss, prediction_scores = outputs[:2]\n        \"\"\"\n\n        reformer_outputs = self.reformer(\n            input_ids,\n            position_ids=position_ids,\n            attention_mask=attention_mask,\n            head_mask=head_mask,\n            inputs_embeds=inputs_embeds,\n            num_hashes=num_hashes,\n            do_output_hidden_states=do_output_hidden_states,\n            do_output_attentions=do_output_attentions,\n        )\n\n        sequence_output = reformer_outputs[0]\n        logits = self.lm_head(sequence_output)\n        outputs = (logits,) + reformer_outputs[1:]\n\n        if labels is not None:\n            # Shift so that tokens < n predict n\n            shift_logits = logits[..., :-1, :].contiguous()\n            shift_labels = labels[..., 1:].contiguous()\n            # Flatten the tokens\n            loss_fct = CrossEntropyLoss()\n            loss = loss_fct(shift_logits.view(-1, self.config.vocab_size), shift_labels.view(-1))\n            outputs = (loss,) + outputs\n        return outputs  # (lm_loss), lm_logits, (hidden_states), (attentions)\n\n    def prepare_inputs_for_generation(self, input_ids, past, **kwargs):\n        # TODO(PVP): Add smart caching\n        inputs_dict = {\"input_ids\": input_ids}\n\n        if \"num_hashes\" in kwargs:\n            inputs_dict[\"num_hashes\"] = kwargs[\"num_hashes\"]\n\n        return inputs_dict\n"
  },
  {
    "path": "code/nezha-base-count5/pretrain/transformers1/modeling_roberta.py",
    "content": "# coding=utf-8\n# Copyright 2018 The Google AI Language Team Authors and The HuggingFace Inc. team.\n# Copyright (c) 2018, NVIDIA CORPORATION.  All rights reserved.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n\"\"\"PyTorch RoBERTa model. \"\"\"\n\n\nimport logging\n\nimport torch\nimport torch.nn as nn\nfrom torch.nn import CrossEntropyLoss, MSELoss\n\nfrom .configuration_roberta import RobertaConfig\nfrom .file_utils import add_start_docstrings, add_start_docstrings_to_callable\nfrom .modeling_bert import BertEmbeddings, BertLayerNorm, BertModel, BertPreTrainedModel, gelu\nfrom .modeling_utils import create_position_ids_from_input_ids\n\n\nlogger = logging.getLogger(__name__)\n\nROBERTA_PRETRAINED_MODEL_ARCHIVE_LIST = [\n    \"roberta-base\",\n    \"roberta-large\",\n    \"roberta-large-mnli\",\n    \"distilroberta-base\",\n    \"roberta-base-openai-detector\",\n    \"roberta-large-openai-detector\",\n    # See all RoBERTa models at https://huggingface.co/models?filter=roberta\n]\n\n\nclass RobertaEmbeddings(BertEmbeddings):\n    \"\"\"\n    Same as BertEmbeddings with a tiny tweak for positional embeddings indexing.\n    \"\"\"\n\n    def __init__(self, config):\n        super().__init__(config)\n        self.padding_idx = config.pad_token_id\n        self.word_embeddings = nn.Embedding(config.vocab_size, config.hidden_size, padding_idx=self.padding_idx)\n        self.position_embeddings = nn.Embedding(\n            config.max_position_embeddings, config.hidden_size, padding_idx=self.padding_idx\n        )\n\n    def forward(self, input_ids=None, token_type_ids=None, position_ids=None, inputs_embeds=None):\n        if position_ids is None:\n            if input_ids is not None:\n                # Create the position ids from the input token ids. Any padded tokens remain padded.\n                position_ids = create_position_ids_from_input_ids(input_ids, self.padding_idx).to(input_ids.device)\n            else:\n                position_ids = self.create_position_ids_from_inputs_embeds(inputs_embeds)\n\n        return super().forward(\n            input_ids, token_type_ids=token_type_ids, position_ids=position_ids, inputs_embeds=inputs_embeds\n        )\n\n    def create_position_ids_from_inputs_embeds(self, inputs_embeds):\n        \"\"\" We are provided embeddings directly. We cannot infer which are padded so just generate\n        sequential position ids.\n\n        :param torch.Tensor inputs_embeds:\n        :return torch.Tensor:\n        \"\"\"\n        input_shape = inputs_embeds.size()[:-1]\n        sequence_length = input_shape[1]\n\n        position_ids = torch.arange(\n            self.padding_idx + 1, sequence_length + self.padding_idx + 1, dtype=torch.long, device=inputs_embeds.device\n        )\n        return position_ids.unsqueeze(0).expand(input_shape)\n\n\nROBERTA_START_DOCSTRING = r\"\"\"\n\n    This model is a PyTorch `torch.nn.Module <https://pytorch.org/docs/stable/nn.html#torch.nn.Module>`_ sub-class.\n    Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general\n    usage and behavior.\n\n    Parameters:\n        config (:class:`~transformers1.RobertaConfig`): Model configuration class with all the parameters of the\n            model. Initializing with a config file does not load the weights associated with the model, only the configuration.\n            Check out the :meth:`~transformers1.PreTrainedModel.from_pretrained` method to load the model weights.\n\"\"\"\n\nROBERTA_INPUTS_DOCSTRING = r\"\"\"\n    Args:\n        input_ids (:obj:`torch.LongTensor` of shape :obj:`{0}`):\n            Indices of input sequence tokens in the vocabulary.\n\n            Indices can be obtained using :class:`transformers1.RobertaTokenizer`.\n            See :func:`transformers1.PreTrainedTokenizer.encode` and\n            :func:`transformers1.PreTrainedTokenizer.encode_plus` for details.\n\n            `What are input IDs? <../glossary.html#input-ids>`__\n        attention_mask (:obj:`torch.FloatTensor` of shape :obj:`{0}`, `optional`, defaults to :obj:`None`):\n            Mask to avoid performing attention on padding token indices.\n            Mask values selected in ``[0, 1]``:\n            ``1`` for tokens that are NOT MASKED, ``0`` for MASKED tokens.\n\n            `What are attention masks? <../glossary.html#attention-mask>`__\n        token_type_ids (:obj:`torch.LongTensor` of shape :obj:`{0}`, `optional`, defaults to :obj:`None`):\n            Segment token indices to indicate first and second portions of the inputs.\n            Indices are selected in ``[0, 1]``: ``0`` corresponds to a `sentence A` token, ``1``\n            corresponds to a `sentence B` token\n\n            `What are token type IDs? <../glossary.html#token-type-ids>`_\n        position_ids (:obj:`torch.LongTensor` of shape :obj:`{0}`, `optional`, defaults to :obj:`None`):\n            Indices of positions of each input sequence tokens in the position embeddings.\n            Selected in the range ``[0, config.max_position_embeddings - 1]``.\n\n            `What are position IDs? <../glossary.html#position-ids>`_\n        head_mask (:obj:`torch.FloatTensor` of shape :obj:`(num_heads,)` or :obj:`(num_layers, num_heads)`, `optional`, defaults to :obj:`None`):\n            Mask to nullify selected heads of the self-attention modules.\n            Mask values selected in ``[0, 1]``:\n            :obj:`1` indicates the head is **not masked**, :obj:`0` indicates the head is **masked**.\n        inputs_embeds (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length, hidden_size)`, `optional`, defaults to :obj:`None`):\n            Optionally, instead of passing :obj:`input_ids` you can choose to directly pass an embedded representation.\n            This is useful if you want more control over how to convert `input_ids` indices into associated vectors\n            than the model's internal embedding lookup matrix.\n\"\"\"\n\n\n@add_start_docstrings(\n    \"The bare RoBERTa Model transformer outputting raw hidden-states without any specific head on top.\",\n    ROBERTA_START_DOCSTRING,\n)\nclass RobertaModel(BertModel):\n    \"\"\"\n    This class overrides :class:`~transformers1.BertModel`. Please check the\n    superclass for the appropriate documentation alongside usage examples.\n    \"\"\"\n\n    config_class = RobertaConfig\n    base_model_prefix = \"roberta\"\n\n    def __init__(self, config):\n        super().__init__(config)\n\n        self.embeddings = RobertaEmbeddings(config)\n        self.init_weights()\n\n    def get_input_embeddings(self):\n        return self.embeddings.word_embeddings\n\n    def set_input_embeddings(self, value):\n        self.embeddings.word_embeddings = value\n\n\n@add_start_docstrings(\"\"\"RoBERTa Model with a `language modeling` head on top. \"\"\", ROBERTA_START_DOCSTRING)\nclass RobertaForMaskedLM(BertPreTrainedModel):\n    config_class = RobertaConfig\n    base_model_prefix = \"roberta\"\n\n    def __init__(self, config):\n        super().__init__(config)\n\n        self.roberta = RobertaModel(config)\n        self.lm_head = RobertaLMHead(config)\n\n        self.init_weights()\n\n    def get_output_embeddings(self):\n        return self.lm_head.decoder\n\n    @add_start_docstrings_to_callable(ROBERTA_INPUTS_DOCSTRING.format(\"(batch_size, sequence_length)\"))\n    def forward(\n        self,\n        input_ids=None,\n        attention_mask=None,\n        token_type_ids=None,\n        position_ids=None,\n        head_mask=None,\n        inputs_embeds=None,\n        masked_lm_labels=None,\n    ):\n        r\"\"\"\n        masked_lm_labels (:obj:`torch.LongTensor` of shape :obj:`(batch_size, sequence_length)`, `optional`, defaults to :obj:`None`):\n            Labels for computing the masked language modeling loss.\n            Indices should be in ``[-100, 0, ..., config.vocab_size]`` (see ``input_ids`` docstring)\n            Tokens with indices set to ``-100`` are ignored (masked), the loss is only computed for the tokens with labels\n            in ``[0, ..., config.vocab_size]``\n\n    Returns:\n        :obj:`tuple(torch.FloatTensor)` comprising various elements depending on the configuration (:class:`~transformers1.RobertaConfig`) and inputs:\n        masked_lm_loss (`optional`, returned when ``masked_lm_labels`` is provided) ``torch.FloatTensor`` of shape ``(1,)``:\n            Masked language modeling loss.\n        prediction_scores (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length, config.vocab_size)`)\n            Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax).\n        hidden_states (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_hidden_states=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer)\n            of shape :obj:`(batch_size, sequence_length, hidden_size)`.\n\n            Hidden-states of the model at the output of each layer plus the initial embedding outputs.\n        attentions (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_attentions=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for each layer) of shape\n            :obj:`(batch_size, num_heads, sequence_length, sequence_length)`.\n\n            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention\n            heads.\n\n    Examples::\n\n        from transformers1 import RobertaTokenizer, RobertaForMaskedLM\n        import torch\n\n        tokenizer = RobertaTokenizer.from_pretrained('roberta-base')\n        model = RobertaForMaskedLM.from_pretrained('roberta-base')\n        input_ids = torch.tensor(tokenizer.encode(\"Hello, my dog is cute\", add_special_tokens=True)).unsqueeze(0)  # Batch size 1\n        outputs = model(input_ids, masked_lm_labels=input_ids)\n        loss, prediction_scores = outputs[:2]\n\n        \"\"\"\n        outputs = self.roberta(\n            input_ids,\n            attention_mask=attention_mask,\n            token_type_ids=token_type_ids,\n            position_ids=position_ids,\n            head_mask=head_mask,\n            inputs_embeds=inputs_embeds,\n        )\n        sequence_output = outputs[0]\n        prediction_scores = self.lm_head(sequence_output)\n\n        outputs = (prediction_scores,) + outputs[2:]  # Add hidden states and attention if they are here\n\n        if masked_lm_labels is not None:\n            loss_fct = CrossEntropyLoss()\n            masked_lm_loss = loss_fct(prediction_scores.view(-1, self.config.vocab_size), masked_lm_labels.view(-1))\n            outputs = (masked_lm_loss,) + outputs\n\n        return outputs  # (masked_lm_loss), prediction_scores, (hidden_states), (attentions)\n\n\nclass RobertaLMHead(nn.Module):\n    \"\"\"Roberta Head for masked language modeling.\"\"\"\n\n    def __init__(self, config):\n        super().__init__()\n        self.dense = nn.Linear(config.hidden_size, config.hidden_size)\n        self.layer_norm = BertLayerNorm(config.hidden_size, eps=config.layer_norm_eps)\n\n        self.decoder = nn.Linear(config.hidden_size, config.vocab_size, bias=False)\n        self.bias = nn.Parameter(torch.zeros(config.vocab_size))\n\n        # Need a link between the two variables so that the bias is correctly resized with `resize_token_embeddings`\n        self.decoder.bias = self.bias\n\n    def forward(self, features, **kwargs):\n        x = self.dense(features)\n        x = gelu(x)\n        x = self.layer_norm(x)\n\n        # project back to size of vocabulary with bias\n        x = self.decoder(x)\n\n        return x\n\n\n@add_start_docstrings(\n    \"\"\"RoBERTa Model transformer with a sequence classification/regression head on top (a linear layer\n    on top of the pooled output) e.g. for GLUE tasks. \"\"\",\n    ROBERTA_START_DOCSTRING,\n)\nclass RobertaForSequenceClassification(BertPreTrainedModel):\n    config_class = RobertaConfig\n    base_model_prefix = \"roberta\"\n\n    def __init__(self, config):\n        super().__init__(config)\n        self.num_labels = config.num_labels\n\n        self.roberta = RobertaModel(config)\n        self.classifier = RobertaClassificationHead(config)\n\n    @add_start_docstrings_to_callable(ROBERTA_INPUTS_DOCSTRING.format(\"(batch_size, sequence_length)\"))\n    def forward(\n        self,\n        input_ids=None,\n        attention_mask=None,\n        token_type_ids=None,\n        position_ids=None,\n        head_mask=None,\n        inputs_embeds=None,\n        labels=None,\n    ):\n        r\"\"\"\n        labels (:obj:`torch.LongTensor` of shape :obj:`(batch_size,)`, `optional`, defaults to :obj:`None`):\n            Labels for computing the sequence classification/regression loss.\n            Indices should be in :obj:`[0, ..., config.num_labels - 1]`.\n            If :obj:`config.num_labels == 1` a regression loss is computed (Mean-Square loss),\n            If :obj:`config.num_labels > 1` a classification loss is computed (Cross-Entropy).\n\n    Returns:\n        :obj:`tuple(torch.FloatTensor)` comprising various elements depending on the configuration (:class:`~transformers1.RobertaConfig`) and inputs:\n        loss (:obj:`torch.FloatTensor` of shape :obj:`(1,)`, `optional`, returned when :obj:`label` is provided):\n            Classification (or regression if config.num_labels==1) loss.\n        logits (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, config.num_labels)`):\n            Classification (or regression if config.num_labels==1) scores (before SoftMax).\n        hidden_states (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_hidden_states=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer)\n            of shape :obj:`(batch_size, sequence_length, hidden_size)`.\n\n            Hidden-states of the model at the output of each layer plus the initial embedding outputs.\n        attentions (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_attentions=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for each layer) of shape\n            :obj:`(batch_size, num_heads, sequence_length, sequence_length)`.\n\n            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention\n            heads.\n\n    Examples::\n\n        from transformers1 import RobertaTokenizer, RobertaForSequenceClassification\n        import torch\n\n        tokenizer = RobertaTokenizer.from_pretrained('roberta-base')\n        model = RobertaForSequenceClassification.from_pretrained('roberta-base')\n        input_ids = torch.tensor(tokenizer.encode(\"Hello, my dog is cute\", add_special_tokens=True)).unsqueeze(0)  # Batch size 1\n        labels = torch.tensor([1]).unsqueeze(0)  # Batch size 1\n        outputs = model(input_ids, labels=labels)\n        loss, logits = outputs[:2]\n\n        \"\"\"\n        outputs = self.roberta(\n            input_ids,\n            attention_mask=attention_mask,\n            token_type_ids=token_type_ids,\n            position_ids=position_ids,\n            head_mask=head_mask,\n            inputs_embeds=inputs_embeds,\n        )\n        sequence_output = outputs[0]\n        logits = self.classifier(sequence_output)\n\n        outputs = (logits,) + outputs[2:]\n        if labels is not None:\n            if self.num_labels == 1:\n                #  We are doing regression\n                loss_fct = MSELoss()\n                loss = loss_fct(logits.view(-1), labels.view(-1))\n            else:\n                loss_fct = CrossEntropyLoss()\n                loss = loss_fct(logits.view(-1, self.num_labels), labels.view(-1))\n            outputs = (loss,) + outputs\n\n        return outputs  # (loss), logits, (hidden_states), (attentions)\n\n\n@add_start_docstrings(\n    \"\"\"Roberta Model with a multiple choice classification head on top (a linear layer on top of\n    the pooled output and a softmax) e.g. for RocStories/SWAG tasks. \"\"\",\n    ROBERTA_START_DOCSTRING,\n)\nclass RobertaForMultipleChoice(BertPreTrainedModel):\n    config_class = RobertaConfig\n    base_model_prefix = \"roberta\"\n\n    def __init__(self, config):\n        super().__init__(config)\n\n        self.roberta = RobertaModel(config)\n        self.dropout = nn.Dropout(config.hidden_dropout_prob)\n        self.classifier = nn.Linear(config.hidden_size, 1)\n\n        self.init_weights()\n\n    @add_start_docstrings_to_callable(ROBERTA_INPUTS_DOCSTRING.format(\"(batch_size, num_choices, sequence_length)\"))\n    def forward(\n        self,\n        input_ids=None,\n        token_type_ids=None,\n        attention_mask=None,\n        labels=None,\n        position_ids=None,\n        head_mask=None,\n        inputs_embeds=None,\n    ):\n        r\"\"\"\n        labels (:obj:`torch.LongTensor` of shape :obj:`(batch_size,)`, `optional`, defaults to :obj:`None`):\n            Labels for computing the multiple choice classification loss.\n            Indices should be in ``[0, ..., num_choices]`` where `num_choices` is the size of the second dimension\n            of the input tensors. (see `input_ids` above)\n\n    Returns:\n        :obj:`tuple(torch.FloatTensor)` comprising various elements depending on the configuration (:class:`~transformers1.RobertaConfig`) and inputs:\n        loss (:obj:`torch.FloatTensor`` of shape ``(1,)`, `optional`, returned when :obj:`labels` is provided):\n            Classification loss.\n        classification_scores (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, num_choices)`):\n            `num_choices` is the second dimension of the input tensors. (see `input_ids` above).\n\n            Classification scores (before SoftMax).\n        hidden_states (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_hidden_states=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer)\n            of shape :obj:`(batch_size, sequence_length, hidden_size)`.\n\n            Hidden-states of the model at the output of each layer plus the initial embedding outputs.\n        attentions (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_attentions=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for each layer) of shape\n            :obj:`(batch_size, num_heads, sequence_length, sequence_length)`.\n\n            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention\n            heads.\n\n    Examples::\n\n        from transformers1 import RobertaTokenizer, RobertaForMultipleChoice\n        import torch\n\n        tokenizer = RobertaTokenizer.from_pretrained('roberta-base')\n        model = RobertaForMultipleChoice.from_pretrained('roberta-base')\n        choices = [\"Hello, my dog is cute\", \"Hello, my cat is amazing\"]\n        input_ids = torch.tensor([tokenizer.encode(s, add_special_tokens=True) for s in choices]).unsqueeze(0)  # Batch size 1, 2 choices\n        labels = torch.tensor(1).unsqueeze(0)  # Batch size 1\n        outputs = model(input_ids, labels=labels)\n        loss, classification_scores = outputs[:2]\n\n        \"\"\"\n        num_choices = input_ids.shape[1]\n\n        flat_input_ids = input_ids.view(-1, input_ids.size(-1))\n        flat_position_ids = position_ids.view(-1, position_ids.size(-1)) if position_ids is not None else None\n        flat_token_type_ids = token_type_ids.view(-1, token_type_ids.size(-1)) if token_type_ids is not None else None\n        flat_attention_mask = attention_mask.view(-1, attention_mask.size(-1)) if attention_mask is not None else None\n        outputs = self.roberta(\n            flat_input_ids,\n            position_ids=flat_position_ids,\n            token_type_ids=flat_token_type_ids,\n            attention_mask=flat_attention_mask,\n            head_mask=head_mask,\n        )\n        pooled_output = outputs[1]\n\n        pooled_output = self.dropout(pooled_output)\n        logits = self.classifier(pooled_output)\n        reshaped_logits = logits.view(-1, num_choices)\n\n        outputs = (reshaped_logits,) + outputs[2:]  # add hidden states and attention if they are here\n\n        if labels is not None:\n            loss_fct = CrossEntropyLoss()\n            loss = loss_fct(reshaped_logits, labels)\n            outputs = (loss,) + outputs\n\n        return outputs  # (loss), reshaped_logits, (hidden_states), (attentions)\n\n\n@add_start_docstrings(\n    \"\"\"Roberta Model with a token classification head on top (a linear layer on top of\n    the hidden-states output) e.g. for Named-Entity-Recognition (NER) tasks. \"\"\",\n    ROBERTA_START_DOCSTRING,\n)\nclass RobertaForTokenClassification(BertPreTrainedModel):\n    config_class = RobertaConfig\n    base_model_prefix = \"roberta\"\n\n    def __init__(self, config):\n        super().__init__(config)\n        self.num_labels = config.num_labels\n\n        self.roberta = RobertaModel(config)\n        self.dropout = nn.Dropout(config.hidden_dropout_prob)\n        self.classifier = nn.Linear(config.hidden_size, config.num_labels)\n\n        self.init_weights()\n\n    @add_start_docstrings_to_callable(ROBERTA_INPUTS_DOCSTRING.format(\"(batch_size, sequence_length)\"))\n    def forward(\n        self,\n        input_ids=None,\n        attention_mask=None,\n        token_type_ids=None,\n        position_ids=None,\n        head_mask=None,\n        inputs_embeds=None,\n        labels=None,\n    ):\n        r\"\"\"\n        labels (:obj:`torch.LongTensor` of shape :obj:`(batch_size, sequence_length)`, `optional`, defaults to :obj:`None`):\n            Labels for computing the token classification loss.\n            Indices should be in ``[0, ..., config.num_labels - 1]``.\n\n    Returns:\n        :obj:`tuple(torch.FloatTensor)` comprising various elements depending on the configuration (:class:`~transformers1.RobertaConfig`) and inputs:\n        loss (:obj:`torch.FloatTensor` of shape :obj:`(1,)`, `optional`, returned when ``labels`` is provided) :\n            Classification loss.\n        scores (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length, config.num_labels)`)\n            Classification scores (before SoftMax).\n        hidden_states (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_hidden_states=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer)\n            of shape :obj:`(batch_size, sequence_length, hidden_size)`.\n\n            Hidden-states of the model at the output of each layer plus the initial embedding outputs.\n        attentions (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_attentions=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for each layer) of shape\n            :obj:`(batch_size, num_heads, sequence_length, sequence_length)`.\n\n            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention\n            heads.\n\n    Examples::\n\n        from transformers1 import RobertaTokenizer, RobertaForTokenClassification\n        import torch\n\n        tokenizer = RobertaTokenizer.from_pretrained('roberta-base')\n        model = RobertaForTokenClassification.from_pretrained('roberta-base')\n        input_ids = torch.tensor(tokenizer.encode(\"Hello, my dog is cute\", add_special_tokens=True)).unsqueeze(0)  # Batch size 1\n        labels = torch.tensor([1] * input_ids.size(1)).unsqueeze(0)  # Batch size 1\n        outputs = model(input_ids, labels=labels)\n        loss, scores = outputs[:2]\n\n        \"\"\"\n\n        outputs = self.roberta(\n            input_ids,\n            attention_mask=attention_mask,\n            token_type_ids=token_type_ids,\n            position_ids=position_ids,\n            head_mask=head_mask,\n            inputs_embeds=inputs_embeds,\n        )\n\n        sequence_output = outputs[0]\n\n        sequence_output = self.dropout(sequence_output)\n        logits = self.classifier(sequence_output)\n\n        outputs = (logits,) + outputs[2:]  # add hidden states and attention if they are here\n\n        if labels is not None:\n            loss_fct = CrossEntropyLoss()\n            # Only keep active parts of the loss\n            if attention_mask is not None:\n                active_loss = attention_mask.view(-1) == 1\n                active_logits = logits.view(-1, self.num_labels)\n                active_labels = torch.where(\n                    active_loss, labels.view(-1), torch.tensor(loss_fct.ignore_index).type_as(labels)\n                )\n                loss = loss_fct(active_logits, active_labels)\n            else:\n                loss = loss_fct(logits.view(-1, self.num_labels), labels.view(-1))\n            outputs = (loss,) + outputs\n\n        return outputs  # (loss), scores, (hidden_states), (attentions)\n\n\nclass RobertaClassificationHead(nn.Module):\n    \"\"\"Head for sentence-level classification tasks.\"\"\"\n\n    def __init__(self, config):\n        super().__init__()\n        self.dense = nn.Linear(config.hidden_size, config.hidden_size)\n        self.dropout = nn.Dropout(config.hidden_dropout_prob)\n        self.out_proj = nn.Linear(config.hidden_size, config.num_labels)\n\n    def forward(self, features, **kwargs):\n        x = features[:, 0, :]  # take <s> token (equiv. to [CLS])\n        x = self.dropout(x)\n        x = self.dense(x)\n        x = torch.tanh(x)\n        x = self.dropout(x)\n        x = self.out_proj(x)\n        return x\n\n\n@add_start_docstrings(\n    \"\"\"Roberta Model with a span classification head on top for extractive question-answering tasks like SQuAD (a linear layers on top of\n    the hidden-states output to compute `span start logits` and `span end logits`). \"\"\",\n    ROBERTA_START_DOCSTRING,\n)\nclass RobertaForQuestionAnswering(BertPreTrainedModel):\n    config_class = RobertaConfig\n    base_model_prefix = \"roberta\"\n\n    def __init__(self, config):\n        super().__init__(config)\n        self.num_labels = config.num_labels\n\n        self.roberta = RobertaModel(config)\n        self.qa_outputs = nn.Linear(config.hidden_size, config.num_labels)\n\n        self.init_weights()\n\n    @add_start_docstrings_to_callable(ROBERTA_INPUTS_DOCSTRING.format(\"(batch_size, sequence_length)\"))\n    def forward(\n        self,\n        input_ids,\n        attention_mask=None,\n        token_type_ids=None,\n        position_ids=None,\n        head_mask=None,\n        inputs_embeds=None,\n        start_positions=None,\n        end_positions=None,\n    ):\n        r\"\"\"\n        start_positions (:obj:`torch.LongTensor` of shape :obj:`(batch_size,)`, `optional`, defaults to :obj:`None`):\n            Labels for position (index) of the start of the labelled span for computing the token classification loss.\n            Positions are clamped to the length of the sequence (`sequence_length`).\n            Position outside of the sequence are not taken into account for computing the loss.\n        end_positions (:obj:`torch.LongTensor` of shape :obj:`(batch_size,)`, `optional`, defaults to :obj:`None`):\n            Labels for position (index) of the end of the labelled span for computing the token classification loss.\n            Positions are clamped to the length of the sequence (`sequence_length`).\n            Position outside of the sequence are not taken into account for computing the loss.\n\n    Returns:\n        :obj:`tuple(torch.FloatTensor)` comprising various elements depending on the configuration (:class:`~transformers1.RobertaConfig`) and inputs:\n        loss (:obj:`torch.FloatTensor` of shape :obj:`(1,)`, `optional`, returned when :obj:`labels` is provided):\n            Total span extraction loss is the sum of a Cross-Entropy for the start and end positions.\n        start_scores (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length,)`):\n            Span-start scores (before SoftMax).\n        end_scores (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length,)`):\n            Span-end scores (before SoftMax).\n        hidden_states (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_hidden_states=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer)\n            of shape :obj:`(batch_size, sequence_length, hidden_size)`.\n\n            Hidden-states of the model at the output of each layer plus the initial embedding outputs.\n        attentions (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_attentions=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for each layer) of shape\n            :obj:`(batch_size, num_heads, sequence_length, sequence_length)`.\n\n            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention\n            heads.\n\n    Examples::\n\n        # The checkpoint roberta-large is not fine-tuned for question answering. Please see the\n        # examples/question-answering/run_squad.py example to see how to fine-tune a model to a question answering task.\n\n        from transformers1 import RobertaTokenizer, RobertaForQuestionAnswering\n        import torch\n\n        tokenizer = RobertaTokenizer.from_pretrained('roberta-base')\n        model = RobertaForQuestionAnswering.from_pretrained('roberta-base')\n\n        question, text = \"Who was Jim Henson?\", \"Jim Henson was a nice puppet\"\n        input_ids = tokenizer.encode(question, text)\n        start_scores, end_scores = model(torch.tensor([input_ids]))\n\n        all_tokens = tokenizer.convert_ids_to_tokens(input_ids)\n        answer = ' '.join(all_tokens[torch.argmax(start_scores) : torch.argmax(end_scores)+1])\n\n        \"\"\"\n\n        outputs = self.roberta(\n            input_ids,\n            attention_mask=attention_mask,\n            token_type_ids=token_type_ids,\n            position_ids=position_ids,\n            head_mask=head_mask,\n            inputs_embeds=inputs_embeds,\n        )\n\n        sequence_output = outputs[0]\n\n        logits = self.qa_outputs(sequence_output)\n        start_logits, end_logits = logits.split(1, dim=-1)\n        start_logits = start_logits.squeeze(-1)\n        end_logits = end_logits.squeeze(-1)\n\n        outputs = (start_logits, end_logits,) + outputs[2:]\n        if start_positions is not None and end_positions is not None:\n            # If we are on multi-GPU, split add a dimension\n            if len(start_positions.size()) > 1:\n                start_positions = start_positions.squeeze(-1)\n            if len(end_positions.size()) > 1:\n                end_positions = end_positions.squeeze(-1)\n            # sometimes the start/end positions are outside our model inputs, we ignore these terms\n            ignored_index = start_logits.size(1)\n            start_positions.clamp_(0, ignored_index)\n            end_positions.clamp_(0, ignored_index)\n\n            loss_fct = CrossEntropyLoss(ignore_index=ignored_index)\n            start_loss = loss_fct(start_logits, start_positions)\n            end_loss = loss_fct(end_logits, end_positions)\n            total_loss = (start_loss + end_loss) / 2\n            outputs = (total_loss,) + outputs\n\n        return outputs  # (loss), start_logits, end_logits, (hidden_states), (attentions)\n"
  },
  {
    "path": "code/nezha-base-count5/pretrain/transformers1/modeling_t5.py",
    "content": "# coding=utf-8\n# Copyright 2018 Mesh TensorFlow authors, T5 Authors and HuggingFace Inc. team.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n\"\"\" PyTorch T5 model. \"\"\"\n\n\nimport copy\nimport logging\nimport math\nimport os\n\nimport torch\nimport torch.nn.functional as F\nfrom torch import nn\nfrom torch.nn import CrossEntropyLoss\n\nfrom .configuration_t5 import T5Config\nfrom .file_utils import DUMMY_INPUTS, DUMMY_MASK, add_start_docstrings, add_start_docstrings_to_callable\nfrom .modeling_utils import PreTrainedModel, prune_linear_layer\n\n\nlogger = logging.getLogger(__name__)\n\n####################################################\n# This dict contrains shortcut names and associated url\n# for the pretrained weights provided with the models\n####################################################\nT5_PRETRAINED_MODEL_ARCHIVE_LIST = [\n    \"t5-small\",\n    \"t5-base\",\n    \"t5-large\",\n    \"t5-3b\",\n    \"t5-11b\",\n    # See all T5 models at https://huggingface.co/models?filter=t5\n]\n\n\n####################################################\n# This is a conversion method from TF 1.0 to PyTorch\n# More details: https://medium.com/huggingface/from-tensorflow-to-pytorch-265f40ef2a28\n####################################################\ndef load_tf_weights_in_t5(model, config, tf_checkpoint_path):\n    \"\"\" Load tf checkpoints in a pytorch model.\n    \"\"\"\n    try:\n        import re\n        import numpy as np\n        import tensorflow as tf\n    except ImportError:\n        logger.error(\n            \"Loading a TensorFlow model in PyTorch, requires TensorFlow to be installed. Please see \"\n            \"https://www.tensorflow.org/install/ for installation instructions.\"\n        )\n        raise\n    tf_path = os.path.abspath(tf_checkpoint_path)\n    logger.info(\"Converting TensorFlow checkpoint from {}\".format(tf_path))\n    # Load weights from TF model\n    init_vars = tf.train.list_variables(tf_path)\n    names = []\n    tf_weights = {}\n    for name, shape in init_vars:\n        logger.info(\"Loading TF weight {} with shape {}\".format(name, shape))\n        array = tf.train.load_variable(tf_path, name)\n        names.append(name)\n        tf_weights[name] = array\n\n    for txt_name in names:\n        name = txt_name.split(\"/\")\n        # adam_v and adam_m are variables used in AdamWeightDecayOptimizer to calculated m and v\n        # which are not required for using pretrained model\n        if any(\n            n in [\"adam_v\", \"adam_m\", \"AdamWeightDecayOptimizer\", \"AdamWeightDecayOptimizer_1\", \"global_step\"]\n            for n in name\n        ):\n            logger.info(\"Skipping {}\".format(\"/\".join(name)))\n            tf_weights.pop(txt_name, None)\n            continue\n        if \"_slot_\" in name[-1]:\n            logger.info(\"Skipping {}\".format(\"/\".join(name)))\n            tf_weights.pop(txt_name, None)\n            continue\n        pointer = model\n        array = tf_weights[txt_name]\n        for m_name in name:\n            if re.fullmatch(r\"[A-Za-z]+_\\d+\", m_name):\n                scope_names = re.split(r\"_(\\d+)\", m_name)\n            else:\n                scope_names = [m_name]\n            if scope_names[0] in [\"kernel\", \"scale\", \"embedding\"]:\n                pointer = getattr(pointer, \"weight\")\n            # elif scope_names[0] == 'scale':\n            #     pointer = getattr(pointer, 'weight')\n            # elif scope_names[0] == 'output_bias' or scope_names[0] == 'beta':\n            #     pointer = getattr(pointer, 'bias')\n            # elif scope_names[0] == 'squad':\n            #     pointer = getattr(pointer, 'classifier')\n            else:\n                try:\n                    pointer = getattr(pointer, scope_names[0])\n                except AttributeError:\n                    logger.info(\"Skipping {}\".format(\"/\".join(name)))\n                    continue\n            if len(scope_names) >= 2:\n                num = int(scope_names[1])\n                pointer = pointer[num]\n        if scope_names[0] not in [\"kernel\", \"scale\", \"embedding\"]:\n            pointer = getattr(pointer, \"weight\")\n        if scope_names[0] != \"embedding\":\n            logger.info(\"Transposing numpy weight of shape {} for {}\".format(array.shape, name))\n            array = np.transpose(array)\n        try:\n            assert pointer.shape == array.shape\n        except AssertionError as e:\n            e.args += (pointer.shape, array.shape)\n            raise\n        logger.info(\"Initialize PyTorch weight {}\".format(name))\n        pointer.data = torch.from_numpy(array.astype(np.float32))\n        tf_weights.pop(txt_name, None)\n\n    logger.info(\"Weights not copied to PyTorch model: {}\".format(\", \".join(tf_weights.keys())))\n    # logger.info(\"Weights not copied to PyTorch model: {}\".format(', '.join(tf_weights.keys())))\n    return model\n\n\n####################################################\n# PyTorch Models are constructed by sub-classing\n# - torch.nn.Module for the layers and\n# - PreTrainedModel for the models (it-self a sub-class of torch.nn.Module)\n####################################################\n\n\nclass T5LayerNorm(nn.Module):\n    def __init__(self, hidden_size, eps=1e-6):\n        \"\"\" Construct a layernorm module in the T5 style\n            No bias and no substraction of mean.\n        \"\"\"\n        super().__init__()\n        self.weight = nn.Parameter(torch.ones(hidden_size))\n        self.variance_epsilon = eps\n\n    def forward(self, x):\n        # layer norm should always be calculated in float32\n        variance = x.to(torch.float32).pow(2).mean(-1, keepdim=True)\n        x = x / torch.sqrt(variance + self.variance_epsilon)\n\n        if self.weight.dtype == torch.float16:\n            x = x.to(torch.float16)\n        return self.weight * x\n\n\nclass T5DenseReluDense(nn.Module):\n    def __init__(self, config):\n        super().__init__()\n        self.wi = nn.Linear(config.d_model, config.d_ff, bias=False)\n        self.wo = nn.Linear(config.d_ff, config.d_model, bias=False)\n        self.dropout = nn.Dropout(config.dropout_rate)\n\n    def forward(self, hidden_states):\n        h = self.wi(hidden_states)\n        h = F.relu(h)\n        h = self.dropout(h)\n        h = self.wo(h)\n        return h\n\n\nclass T5LayerFF(nn.Module):\n    def __init__(self, config):\n        super().__init__()\n        self.DenseReluDense = T5DenseReluDense(config)\n        self.layer_norm = T5LayerNorm(config.d_model, eps=config.layer_norm_epsilon)\n        self.dropout = nn.Dropout(config.dropout_rate)\n\n    def forward(self, hidden_states):\n        norm_x = self.layer_norm(hidden_states)\n        y = self.DenseReluDense(norm_x)\n        layer_output = hidden_states + self.dropout(y)\n        return layer_output\n\n\nclass T5Attention(nn.Module):\n    def __init__(self, config: T5Config, has_relative_attention_bias=False):\n        super().__init__()\n        self.is_decoder = config.is_decoder\n        self.has_relative_attention_bias = has_relative_attention_bias\n\n        self.output_attentions = config.output_attentions\n        self.relative_attention_num_buckets = config.relative_attention_num_buckets\n        self.d_model = config.d_model\n        self.d_kv = config.d_kv\n        self.n_heads = config.num_heads\n        self.dropout = config.dropout_rate\n        self.inner_dim = self.n_heads * self.d_kv\n\n        # Mesh TensorFlow initialization to avoid scaling before softmax\n        self.q = nn.Linear(self.d_model, self.inner_dim, bias=False)\n        self.k = nn.Linear(self.d_model, self.inner_dim, bias=False)\n        self.v = nn.Linear(self.d_model, self.inner_dim, bias=False)\n        self.o = nn.Linear(self.inner_dim, self.d_model, bias=False)\n\n        if self.has_relative_attention_bias:\n            self.relative_attention_bias = nn.Embedding(self.relative_attention_num_buckets, self.n_heads)\n        self.pruned_heads = set()\n\n    def prune_heads(self, heads):\n        if len(heads) == 0:\n            return\n        mask = torch.ones(self.n_heads, self.d_kv)\n        heads = set(heads) - self.pruned_heads\n        for head in heads:\n            head -= sum(1 if h < head else 0 for h in self.pruned_heads)\n            mask[head] = 0\n        mask = mask.view(-1).contiguous().eq(1)\n        index = torch.arange(len(mask))[mask].long()\n        # Prune linear layers\n        self.q = prune_linear_layer(self.q, index)\n        self.k = prune_linear_layer(self.k, index)\n        self.v = prune_linear_layer(self.v, index)\n        self.o = prune_linear_layer(self.o, index, dim=1)\n        # Update hyper params\n        self.n_heads = self.n_heads - len(heads)\n        self.inner_dim = self.d_kv * self.n_heads\n        self.pruned_heads = self.pruned_heads.union(heads)\n\n    @staticmethod\n    def _relative_position_bucket(relative_position, bidirectional=True, num_buckets=32, max_distance=128):\n        \"\"\"\n        Adapted from Mesh Tensorflow:\n        https://github.com/tensorflow/mesh/blob/0cb87fe07da627bf0b7e60475d59f95ed6b5be3d/mesh_tensorflow/transformer/transformer_layers.py#L593\n\n        Translate relative position to a bucket number for relative attention.\n        The relative position is defined as memory_position - query_position, i.e.\n        the distance in tokens from the attending position to the attended-to\n        position.  If bidirectional=False, then positive relative positions are\n        invalid.\n        We use smaller buckets for small absolute relative_position and larger buckets\n        for larger absolute relative_positions.  All relative positions >=max_distance\n        map to the same bucket.  All relative positions <=-max_distance map to the\n        same bucket.  This should allow for more graceful generalization to longer\n        sequences than the model has been trained on.\n        Args:\n            relative_position: an int32 Tensor\n            bidirectional: a boolean - whether the attention is bidirectional\n            num_buckets: an integer\n            max_distance: an integer\n        Returns:\n            a Tensor with the same shape as relative_position, containing int32\n            values in the range [0, num_buckets)\n        \"\"\"\n        ret = 0\n        n = -relative_position\n        if bidirectional:\n            num_buckets //= 2\n            ret += (n < 0).to(torch.long) * num_buckets  # mtf.to_int32(mtf.less(n, 0)) * num_buckets\n            n = torch.abs(n)\n        else:\n            n = torch.max(n, torch.zeros_like(n))\n        # now n is in the range [0, inf)\n\n        # half of the buckets are for exact increments in positions\n        max_exact = num_buckets // 2\n        is_small = n < max_exact\n\n        # The other half of the buckets are for logarithmically bigger bins in positions up to max_distance\n        val_if_large = max_exact + (\n            torch.log(n.float() / max_exact) / math.log(max_distance / max_exact) * (num_buckets - max_exact)\n        ).to(torch.long)\n        val_if_large = torch.min(val_if_large, torch.full_like(val_if_large, num_buckets - 1))\n\n        ret += torch.where(is_small, n, val_if_large)\n        return ret\n\n    def compute_bias(self, qlen, klen):\n        \"\"\" Compute binned relative position bias \"\"\"\n        context_position = torch.arange(qlen, dtype=torch.long)[:, None]\n        memory_position = torch.arange(klen, dtype=torch.long)[None, :]\n        relative_position = memory_position - context_position  # shape (qlen, klen)\n        rp_bucket = self._relative_position_bucket(\n            relative_position,  # shape (qlen, klen)\n            bidirectional=not self.is_decoder,\n            num_buckets=self.relative_attention_num_buckets,\n        )\n        rp_bucket = rp_bucket.to(self.relative_attention_bias.weight.device)\n        values = self.relative_attention_bias(rp_bucket)  # shape (qlen, klen, num_heads)\n        values = values.permute([2, 0, 1]).unsqueeze(0)  # shape (1, num_heads, qlen, klen)\n        return values\n\n    def forward(\n        self,\n        input,\n        mask=None,\n        kv=None,\n        position_bias=None,\n        past_key_value_state=None,\n        head_mask=None,\n        query_length=None,\n        use_cache=False,\n    ):\n        \"\"\"\n        Self-attention (if kv is None) or attention over source sentence (provided by kv).\n        \"\"\"\n        # Input is (bs, qlen, dim)\n        # Mask is (bs, klen) (non-causal) or (bs, klen, klen)\n        # past_key_value_state[0] is (bs, n_heads, q_len - 1, dim_per_head)\n        bs, qlen, dim = input.size()\n\n        if past_key_value_state is not None:\n            assert self.is_decoder is True, \"Encoder cannot cache past key value states\"\n            assert (\n                len(past_key_value_state) == 2\n            ), \"past_key_value_state should have 2 past states: keys and values. Got {} past states\".format(\n                len(past_key_value_state)\n            )\n            real_qlen = qlen + past_key_value_state[0].shape[2] if query_length is None else query_length\n        else:\n            real_qlen = qlen\n\n        if kv is None:\n            klen = real_qlen\n        else:\n            klen = kv.size(1)\n\n        def shape(x):\n            \"\"\"  projection \"\"\"\n            return x.view(bs, -1, self.n_heads, self.d_kv).transpose(1, 2)\n\n        def unshape(x):\n            \"\"\"  compute context \"\"\"\n            return x.transpose(1, 2).contiguous().view(bs, -1, self.inner_dim)\n\n        q = shape(self.q(input))  # (bs, n_heads, qlen, dim_per_head)\n\n        if kv is None:\n            k = shape(self.k(input))  # (bs, n_heads, qlen, dim_per_head)\n            v = shape(self.v(input))  # (bs, n_heads, qlen, dim_per_head)\n        elif past_key_value_state is None:\n            k = v = kv\n            k = shape(self.k(k))  # (bs, n_heads, qlen, dim_per_head)\n            v = shape(self.v(v))  # (bs, n_heads, qlen, dim_per_head)\n\n        if past_key_value_state is not None:\n            if kv is None:\n                k_, v_ = past_key_value_state\n                k = torch.cat([k_, k], dim=2)  # (bs, n_heads, klen, dim_per_head)\n                v = torch.cat([v_, v], dim=2)  # (bs, n_heads, klen, dim_per_head)\n            else:\n                k, v = past_key_value_state\n\n        if self.is_decoder and use_cache is True:\n            present_key_value_state = ((k, v),)\n        else:\n            present_key_value_state = (None,)\n\n        scores = torch.einsum(\"bnqd,bnkd->bnqk\", q, k)  # (bs, n_heads, qlen, klen)\n\n        if position_bias is None:\n            if not self.has_relative_attention_bias:\n                raise ValueError(\"No position_bias provided and no weights to compute position_bias\")\n            position_bias = self.compute_bias(real_qlen, klen)\n\n            # if key and values are already calculated\n            # we want only the last query position bias\n            if past_key_value_state is not None:\n                position_bias = position_bias[:, :, -1:, :]\n\n            if mask is not None:\n                position_bias = position_bias + mask  # (bs, n_heads, qlen, klen)\n\n        scores += position_bias\n        weights = F.softmax(scores.float(), dim=-1).type_as(scores)  # (bs, n_heads, qlen, klen)\n        weights = F.dropout(weights, p=self.dropout, training=self.training)  # (bs, n_heads, qlen, klen)\n\n        # Mask heads if we want to\n        if head_mask is not None:\n            weights = weights * head_mask\n\n        context = torch.matmul(weights, v)  # (bs, n_heads, qlen, dim_per_head)\n        context = unshape(context)  # (bs, qlen, dim)\n\n        context = self.o(context)\n\n        outputs = (context,) + present_key_value_state\n\n        if self.output_attentions:\n            outputs = outputs + (weights,)\n        if self.has_relative_attention_bias:\n            outputs = outputs + (position_bias,)\n        return outputs\n\n\nclass T5LayerSelfAttention(nn.Module):\n    def __init__(self, config, has_relative_attention_bias=False):\n        super().__init__()\n        self.SelfAttention = T5Attention(config, has_relative_attention_bias=has_relative_attention_bias)\n        self.layer_norm = T5LayerNorm(config.d_model, eps=config.layer_norm_epsilon)\n        self.dropout = nn.Dropout(config.dropout_rate)\n\n    def forward(\n        self,\n        hidden_states,\n        attention_mask=None,\n        position_bias=None,\n        head_mask=None,\n        past_key_value_state=None,\n        use_cache=False,\n    ):\n        norm_x = self.layer_norm(hidden_states)\n        attention_output = self.SelfAttention(\n            norm_x,\n            mask=attention_mask,\n            position_bias=position_bias,\n            head_mask=head_mask,\n            past_key_value_state=past_key_value_state,\n            use_cache=use_cache,\n        )\n        y = attention_output[0]\n        layer_output = hidden_states + self.dropout(y)\n        outputs = (layer_output,) + attention_output[1:]  # add attentions if we output them\n        return outputs\n\n\nclass T5LayerCrossAttention(nn.Module):\n    def __init__(self, config, has_relative_attention_bias=False):\n        super().__init__()\n        self.EncDecAttention = T5Attention(config, has_relative_attention_bias=has_relative_attention_bias)\n        self.layer_norm = T5LayerNorm(config.d_model, eps=config.layer_norm_epsilon)\n        self.dropout = nn.Dropout(config.dropout_rate)\n\n    def forward(\n        self,\n        hidden_states,\n        kv,\n        attention_mask=None,\n        position_bias=None,\n        head_mask=None,\n        past_key_value_state=None,\n        use_cache=False,\n        query_length=None,\n    ):\n        norm_x = self.layer_norm(hidden_states)\n        attention_output = self.EncDecAttention(\n            norm_x,\n            mask=attention_mask,\n            kv=kv,\n            position_bias=position_bias,\n            head_mask=head_mask,\n            past_key_value_state=past_key_value_state,\n            use_cache=use_cache,\n            query_length=query_length,\n        )\n        y = attention_output[0]\n        layer_output = hidden_states + self.dropout(y)\n        outputs = (layer_output,) + attention_output[1:]  # add attentions if we output them\n        return outputs\n\n\nclass T5Block(nn.Module):\n    def __init__(self, config, has_relative_attention_bias=False):\n        super().__init__()\n        self.is_decoder = config.is_decoder\n        self.layer = nn.ModuleList()\n        self.layer.append(T5LayerSelfAttention(config, has_relative_attention_bias=has_relative_attention_bias))\n        if self.is_decoder:\n            self.layer.append(T5LayerCrossAttention(config, has_relative_attention_bias=has_relative_attention_bias))\n\n        self.layer.append(T5LayerFF(config))\n\n    def forward(\n        self,\n        hidden_states,\n        attention_mask=None,\n        position_bias=None,\n        encoder_hidden_states=None,\n        encoder_attention_mask=None,\n        encoder_decoder_position_bias=None,\n        head_mask=None,\n        past_key_value_state=None,\n        use_cache=False,\n    ):\n\n        if past_key_value_state is not None:\n            assert self.is_decoder, \"Only decoder can use `past_key_value_states`\"\n            expected_num_past_key_value_states = 2 if encoder_hidden_states is None else 4\n\n            error_message = \"There should be {} past states. 2 (past / key) for self attention.{} Got {} past key / value states\".format(\n                expected_num_past_key_value_states,\n                \"2 (past / key) for cross attention\" if expected_num_past_key_value_states == 4 else \"\",\n                len(past_key_value_state),\n            )\n            assert len(past_key_value_state) == expected_num_past_key_value_states, error_message\n\n            self_attn_past_key_value_state = past_key_value_state[:2]\n            cross_attn_past_key_value_state = past_key_value_state[2:]\n        else:\n            self_attn_past_key_value_state, cross_attn_past_key_value_state = None, None\n\n        self_attention_outputs = self.layer[0](\n            hidden_states,\n            attention_mask=attention_mask,\n            position_bias=position_bias,\n            head_mask=head_mask,\n            past_key_value_state=self_attn_past_key_value_state,\n            use_cache=use_cache,\n        )\n        hidden_states, present_key_value_state = self_attention_outputs[:2]\n        attention_outputs = self_attention_outputs[2:]  # Keep self-attention outputs and relative position weights\n\n        if self.is_decoder and encoder_hidden_states is not None:\n            # the actual query length is unknown for cross attention\n            # if using past key value states. Need to inject it here\n            if present_key_value_state is not None:\n                query_length = present_key_value_state[0].shape[2]\n            else:\n                query_length = None\n\n            cross_attention_outputs = self.layer[1](\n                hidden_states,\n                kv=encoder_hidden_states,\n                attention_mask=encoder_attention_mask,\n                position_bias=encoder_decoder_position_bias,\n                head_mask=head_mask,\n                past_key_value_state=cross_attn_past_key_value_state,\n                query_length=query_length,\n                use_cache=use_cache,\n            )\n            hidden_states = cross_attention_outputs[0]\n            # Combine self attn and cross attn key value states\n            if present_key_value_state is not None:\n                present_key_value_state = present_key_value_state + cross_attention_outputs[1]\n\n            # Keep cross-attention outputs and relative position weights\n            attention_outputs = attention_outputs + cross_attention_outputs[2:]\n\n        # Apply Feed Forward layer\n        hidden_states = self.layer[-1](hidden_states)\n        outputs = (hidden_states,)\n\n        # Add attentions if we output them\n        outputs = outputs + (present_key_value_state,) + attention_outputs\n        return outputs  # hidden-states, present_key_value_states, (self-attention weights), (self-attention position bias), (cross-attention weights), (cross-attention position bias)\n\n\nclass T5PreTrainedModel(PreTrainedModel):\n    \"\"\" An abstract class to handle weights initialization and\n        a simple interface for downloading and loading pretrained models.\n    \"\"\"\n\n    config_class = T5Config\n    load_tf_weights = load_tf_weights_in_t5\n    base_model_prefix = \"transformer\"\n\n    @property\n    def dummy_inputs(self):\n        input_ids = torch.tensor(DUMMY_INPUTS)\n        input_mask = torch.tensor(DUMMY_MASK)\n        dummy_inputs = {\n            \"decoder_input_ids\": input_ids,\n            \"input_ids\": input_ids,\n            \"decoder_attention_mask\": input_mask,\n        }\n        return dummy_inputs\n\n    def _init_weights(self, module):\n        \"\"\" Initialize the weights \"\"\"\n        factor = self.config.initializer_factor  # Used for testing weights initialization\n        if isinstance(module, T5LayerNorm):\n            module.weight.data.fill_(factor * 1.0)\n        elif isinstance(module, (T5Model, T5ForConditionalGeneration)):\n            # Mesh TensorFlow embeddings initialization\n            # See https://github.com/tensorflow/mesh/blob/fa19d69eafc9a482aff0b59ddd96b025c0cb207d/mesh_tensorflow/layers.py#L1624\n            module.shared.weight.data.normal_(mean=0.0, std=factor * 1.0)\n        elif isinstance(module, T5DenseReluDense):\n            # Mesh TensorFlow FF initialization\n            # See https://github.com/tensorflow/mesh/blob/master/mesh_tensorflow/transformer/transformer_layers.py#L56\n            # and https://github.com/tensorflow/mesh/blob/fa19d69eafc9a482aff0b59ddd96b025c0cb207d/mesh_tensorflow/layers.py#L89\n            module.wi.weight.data.normal_(mean=0.0, std=factor * ((self.config.d_model) ** -0.5))\n            if hasattr(module.wi, \"bias\") and module.wi.bias is not None:\n                module.wi.bias.data.zero_()\n            module.wo.weight.data.normal_(mean=0.0, std=factor * ((self.config.d_ff) ** -0.5))\n            if hasattr(module.wo, \"bias\") and module.wo.bias is not None:\n                module.wo.bias.data.zero_()\n        elif isinstance(module, T5Attention):\n            # Mesh TensorFlow attention initialization to avoid scaling before softmax\n            # See https://github.com/tensorflow/mesh/blob/fa19d69eafc9a482aff0b59ddd96b025c0cb207d/mesh_tensorflow/transformer/attention.py#L136\n            d_model = self.config.d_model\n            d_kv = self.config.d_kv\n            n_heads = self.config.num_heads\n            module.q.weight.data.normal_(mean=0.0, std=factor * ((d_model * d_kv) ** -0.5))\n            module.k.weight.data.normal_(mean=0.0, std=factor * (d_model ** -0.5))\n            module.v.weight.data.normal_(mean=0.0, std=factor * (d_model ** -0.5))\n            module.o.weight.data.normal_(mean=0.0, std=factor * ((n_heads * d_kv) ** -0.5))\n            if module.has_relative_attention_bias:\n                module.relative_attention_bias.weight.data.normal_(mean=0.0, std=factor * ((d_model) ** -0.5))\n\n    def _shift_right(self, input_ids):\n        decoder_start_token_id = self.config.decoder_start_token_id\n        pad_token_id = self.config.pad_token_id\n\n        assert (\n            decoder_start_token_id is not None\n        ), \"self.model.config.decoder_start_token_id has to be defined. In T5 it is usually set to the pad_token_id. See T5 docs for more information\"\n\n        # shift inputs to the right\n        shifted_input_ids = input_ids.new_zeros(input_ids.shape)\n        shifted_input_ids[..., 1:] = input_ids[..., :-1].clone()\n        shifted_input_ids[..., 0] = decoder_start_token_id\n\n        assert pad_token_id is not None, \"self.model.config.pad_token_id has to be defined.\"\n        # replace possible -100 values in lm_labels by `pad_token_id`\n        shifted_input_ids.masked_fill_(shifted_input_ids == -100, pad_token_id)\n\n        assert torch.all(shifted_input_ids >= 0).item(), \"Verify that `lm_labels` has only positive values and -100\"\n\n        return shifted_input_ids\n\n\nclass T5Stack(T5PreTrainedModel):\n    def __init__(self, config, embed_tokens=None):\n        super().__init__(config)\n        self.output_attentions = config.output_attentions\n        self.output_hidden_states = config.output_hidden_states\n\n        self.embed_tokens = embed_tokens\n        self.is_decoder = config.is_decoder\n\n        self.block = nn.ModuleList(\n            [T5Block(config, has_relative_attention_bias=bool(i == 0)) for i in range(config.num_layers)]\n        )\n        self.final_layer_norm = T5LayerNorm(config.d_model, eps=config.layer_norm_epsilon)\n        self.dropout = nn.Dropout(config.dropout_rate)\n\n        self.init_weights()\n\n    def get_input_embeddings(self):\n        return self.embed_tokens\n\n    def get_output_embeddings(self):\n        return self.embed_tokens\n\n    def set_input_embeddings(self, new_embeddings):\n        self.embed_tokens = new_embeddings\n\n    def forward(\n        self,\n        input_ids=None,\n        attention_mask=None,\n        encoder_hidden_states=None,\n        encoder_attention_mask=None,\n        inputs_embeds=None,\n        head_mask=None,\n        past_key_value_states=None,\n        use_cache=False,\n    ):\n\n        if input_ids is not None and inputs_embeds is not None:\n            raise ValueError(\"You cannot specify both input_ids and inputs_embeds at the same time\")\n        elif input_ids is not None:\n            input_shape = input_ids.size()\n            input_ids = input_ids.view(-1, input_shape[-1])\n        elif inputs_embeds is not None:\n            input_shape = inputs_embeds.size()[:-1]\n        else:\n            if self.is_decoder:\n                raise ValueError(\"You have to specify either decoder_input_ids or decoder_inputs_embeds\")\n            else:\n                raise ValueError(\"You have to specify either input_ids or inputs_embeds\")\n\n        if inputs_embeds is None:\n            assert self.embed_tokens is not None, \"You have to intialize the model with valid token embeddings\"\n            inputs_embeds = self.embed_tokens(input_ids)\n\n        batch_size, seq_length = input_shape\n\n        if past_key_value_states is not None:\n            assert seq_length == 1, \"Input shape is {}, but should be {} when using past_key_value_sates\".format(\n                input_shape, (batch_size, 1)\n            )\n            # required mask seq length can be calculated via length of past\n            # key value states and seq_length = 1 for the last token\n            mask_seq_length = past_key_value_states[0][0].shape[2] + seq_length\n        else:\n            mask_seq_length = seq_length\n\n        if attention_mask is None:\n            attention_mask = torch.ones(batch_size, mask_seq_length).to(inputs_embeds.device)\n        if self.is_decoder and encoder_attention_mask is None and encoder_hidden_states is not None:\n            encoder_seq_length = encoder_hidden_states.shape[1]\n            encoder_attention_mask = torch.ones(\n                batch_size, encoder_seq_length, device=inputs_embeds.device, dtype=torch.long\n            )\n\n        # initialize past_key_value_states with `None` if past does not exist\n        if past_key_value_states is None:\n            past_key_value_states = [None] * len(self.block)\n\n        # ourselves in which case we just need to make it broadcastable to all heads.\n        extended_attention_mask = self.get_extended_attention_mask(attention_mask, input_shape, inputs_embeds.device)\n\n        if self.is_decoder and encoder_attention_mask is not None:\n            encoder_extended_attention_mask = self.invert_attention_mask(encoder_attention_mask)\n        else:\n            encoder_extended_attention_mask = None\n\n        # Prepare head mask if needed\n        head_mask = self.get_head_mask(head_mask, self.config.num_layers)\n        present_key_value_states = ()\n        all_hidden_states = ()\n        all_attentions = ()\n        position_bias = None\n        encoder_decoder_position_bias = None\n\n        hidden_states = self.dropout(inputs_embeds)\n\n        for i, (layer_module, past_key_value_state) in enumerate(zip(self.block, past_key_value_states)):\n            if self.output_hidden_states:\n                all_hidden_states = all_hidden_states + (hidden_states,)\n\n            layer_outputs = layer_module(\n                hidden_states,\n                attention_mask=extended_attention_mask,\n                position_bias=position_bias,\n                encoder_hidden_states=encoder_hidden_states,\n                encoder_attention_mask=encoder_extended_attention_mask,\n                encoder_decoder_position_bias=encoder_decoder_position_bias,\n                head_mask=head_mask[i],\n                past_key_value_state=past_key_value_state,\n                use_cache=use_cache,\n            )\n            # layer_outputs is a tuple with:\n            # hidden-states, key-value-states, (self-attention weights), (self-attention position bias), (cross-attention weights), (cross-attention position bias)\n            hidden_states, present_key_value_state = layer_outputs[:2]\n\n            if i == 0:\n                # We share the position biases between the layers - the first layer store them\n                # layer_outputs = hidden-states, key-value-states (self-attention weights), (self-attention position bias), (cross-attention weights), (cross-attention position bias)\n                position_bias = layer_outputs[3 if self.output_attentions else 2]\n                if self.is_decoder and encoder_hidden_states is not None:\n                    encoder_decoder_position_bias = layer_outputs[5 if self.output_attentions else 3]\n            # append next layer key value states\n            present_key_value_states = present_key_value_states + (present_key_value_state,)\n\n            if self.output_attentions:\n                all_attentions = all_attentions + (layer_outputs[2],)  # We keep only self-attention weights for now\n\n        hidden_states = self.final_layer_norm(hidden_states)\n        hidden_states = self.dropout(hidden_states)\n\n        # Add last layer\n        if self.output_hidden_states:\n            all_hidden_states = all_hidden_states + (hidden_states,)\n\n        outputs = (hidden_states,)\n        if use_cache is True:\n            assert self.is_decoder, \"`use_cache` can only be set to `True` if {} is used as a decoder\".format(self)\n            outputs = outputs + (present_key_value_states,)\n        if self.output_hidden_states:\n            outputs = outputs + (all_hidden_states,)\n        if self.output_attentions:\n            outputs = outputs + (all_attentions,)\n        return outputs  # last-layer hidden state, (presents,) (all hidden states), (all attentions)\n\n\nT5_START_DOCSTRING = r\"\"\"    The T5 model was proposed in\n    `Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer`_\n    by Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, Peter J. Liu.\n    It's an encoder decoder transformer pre-trained in a text-to-text denoising generative setting.\n\n    This model is a PyTorch `torch.nn.Module`_ sub-class. Use it as a regular PyTorch Module and\n    refer to the PyTorch documentation for all matter related to general usage and behavior.\n\n    .. _`Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer`:\n        https://arxiv.org/abs/1910.10683\n\n    .. _`torch.nn.Module`:\n        https://pytorch.org/docs/stable/nn.html#module\n\n    Parameters:\n        config (:class:`~transformers1.T5Config`): Model configuration class with all the parameters of the model.\n            Initializing with a config file does not load the weights associated with the model, only the configuration.\n            Check out the :meth:`~transformers1.PreTrainedModel.from_pretrained` method to load the model weights.\n\"\"\"\n\nT5_INPUTS_DOCSTRING = r\"\"\"\n    Args:\n        input_ids (:obj:`torch.LongTensor` of shape :obj:`(batch_size, sequence_length)`):\n            Indices of input sequence tokens in the vocabulary.\n            T5 is a model with relative position embeddings so you should be able to pad the inputs on both the right and the left.\n            Indices can be obtained using :class:`transformers1.T5Tokenizer`.\n            See :func:`transformers1.PreTrainedTokenizer.encode` and\n            :func:`transformers1.PreTrainedTokenizer.convert_tokens_to_ids` for details.\n            To know more on how to prepare :obj:`input_ids` for pre-training take a look at\n            `T5 Training <./t5.html#training>`_ .\n        attention_mask (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length)`, `optional`, defaults to :obj:`None`):\n            Mask to avoid performing attention on padding token indices.\n            Mask values selected in ``[0, 1]``:\n            ``1`` for tokens that are NOT MASKED, ``0`` for MASKED tokens.\n        encoder_outputs (:obj:`tuple(tuple(torch.FloatTensor)`, `optional`, defaults to :obj:`None`):\n            Tuple consists of (`last_hidden_state`, `optional`: `hidden_states`, `optional`: `attentions`)\n            `last_hidden_state` of shape :obj:`(batch_size, sequence_length, hidden_size)`, `optional`, defaults to :obj:`None`) is a sequence of hidden-states at the output of the last layer of the encoder.\n            Used in the cross-attention of the decoder.\n        decoder_input_ids (:obj:`torch.LongTensor` of shape :obj:`(batch_size, target_sequence_length)`, `optional`, defaults to :obj:`None`):\n            Provide for sequence to sequence training. T5 uses the pad_token_id as the starting token for decoder_input_ids generation.\n            If `decoder_past_key_value_states` is used, optionally only the last `decoder_input_ids` have to be input (see `decoder_past_key_value_states`).\n            To know more on how to prepare :obj:`decoder_input_ids` for pre-training take a look at\n            `T5 Training <./t5.html#training>`_ .\n        decoder_attention_mask (:obj:`torch.BoolTensor` of shape :obj:`(batch_size, tgt_seq_len)`, `optional`, defaults to :obj:`None`):\n            Default behavior: generate a tensor that ignores pad tokens in decoder_input_ids. Causal mask will also be used by default.\n        decoder_past_key_value_states (:obj:`tuple(tuple(torch.FloatTensor))` of length :obj:`config.n_layers` with each tuple having 4 tensors of shape :obj:`(batch_size, num_heads, sequence_length - 1, embed_size_per_head)`):\n            Contains pre-computed key and value hidden-states of the attention blocks.\n            Can be used to speed up decoding.\n            If `decoder_past_key_value_states` are used, the user can optionally input only the last `decoder_input_ids`\n            (those that don't have their past key value states given to this model) of shape :obj:`(batch_size, 1)`\n            instead of all `decoder_input_ids` of shape :obj:`(batch_size, sequence_length)`.\n        use_cache (:obj:`bool`, `optional`, defaults to :obj:`True`):\n            If `use_cache` is True, `decoder_past_key_value_states` are returned and can be used to speed up decoding (see `decoder_past_key_value_states`).\n        inputs_embeds (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length, hidden_size)`, `optional`, defaults to :obj:`None`):\n            Optionally, instead of passing :obj:`input_ids` you can choose to directly pass an embedded representation.\n            This is useful if you want more control over how to convert `input_ids` indices into associated vectors\n            than the model's internal embedding lookup matrix.\n        decoder_inputs_embeds (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, target_sequence_length, hidden_size)`, `optional`, defaults to :obj:`None`):\n            Optionally, instead of passing :obj:`decoder_input_ids` you can choose to directly pass an embedded representation.\n            If `decoder_past_key_value_states` is used, optionally only the last `decoder_inputs_embeds` have to be input (see `decoder_past_key_value_states`).\n            This is useful if you want more control over how to convert `decoder_input_ids` indices into associated vectors\n            than the model's internal embedding lookup matrix.\n        head_mask: (:obj:`torch.FloatTensor` of shape :obj:`(num_heads,)` or :obj:`(num_layers, num_heads)`, `optional`, defaults to :obj:`None`):\n            Mask to nullify selected heads of the self-attention modules.\n            Mask values selected in ``[0, 1]``:\n            ``1`` indicates the head is **not masked**, ``0`` indicates the head is **masked**.\n\"\"\"\n\n\n@add_start_docstrings(\n    \"The bare T5 Model transformer outputting raw hidden-states\" \"without any specific head on top.\",\n    T5_START_DOCSTRING,\n)\nclass T5Model(T5PreTrainedModel):\n    def __init__(self, config):\n        super().__init__(config)\n        self.shared = nn.Embedding(config.vocab_size, config.d_model)\n\n        encoder_config = copy.deepcopy(config)\n        self.encoder = T5Stack(encoder_config, self.shared)\n\n        decoder_config = copy.deepcopy(config)\n        decoder_config.is_decoder = True\n        self.decoder = T5Stack(decoder_config, self.shared)\n\n        self.init_weights()\n\n    def get_input_embeddings(self):\n        return self.shared\n\n    def set_input_embeddings(self, new_embeddings):\n        self.shared = new_embeddings\n        self.encoder.set_input_embeddings(new_embeddings)\n        self.decoder.set_input_embeddings(new_embeddings)\n\n    def get_encoder(self):\n        return self.encoder\n\n    def get_decoder(self):\n        return self.decoder\n\n    def _prune_heads(self, heads_to_prune):\n        \"\"\" Prunes heads of the model.\n            heads_to_prune: dict of {layer_num: list of heads to prune in this layer}\n            See base class PreTrainedModel\n        \"\"\"\n        for layer, heads in heads_to_prune.items():\n            self.encoder.layer[layer].attention.prune_heads(heads)\n\n    @add_start_docstrings_to_callable(T5_INPUTS_DOCSTRING)\n    def forward(\n        self,\n        input_ids=None,\n        attention_mask=None,\n        encoder_outputs=None,\n        decoder_input_ids=None,\n        decoder_attention_mask=None,\n        decoder_past_key_value_states=None,\n        use_cache=True,\n        inputs_embeds=None,\n        decoder_inputs_embeds=None,\n        head_mask=None,\n    ):\n        r\"\"\"\n    Return:\n        :obj:`tuple(torch.FloatTensor)` comprising various elements depending on the configuration (:class:`~transformers1.T5Config`) and inputs.\n        last_hidden_state (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length, hidden_size)`):\n            Sequence of hidden-states at the output of the last layer of the model.\n            If `decoder_past_key_value_states` is used only the last hidden-state of the sequences of shape :obj:`(batch_size, 1, hidden_size)` is output.\n        decoder_past_key_value_states (:obj:`tuple(tuple(torch.FloatTensor))` of length :obj:`config.n_layers` with each tuple having 4 tensors of shape :obj:`(batch_size, num_heads, sequence_length, embed_size_per_head)`, `optional`, returned when ``use_cache=True``):\n            Contains pre-computed key and value hidden-states of the attention blocks.\n            Can be used to speed up sequential decoding (see `decoder_past_key_value_states` input).\n            Note that when using `decoder_past_key_value_states`, the model only outputs the last `hidden-state` of the sequence of shape :obj:`(batch_size, 1, config.vocab_size)`.\n        hidden_states (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_hidden_states=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer)\n            of shape :obj:`(batch_size, sequence_length, hidden_size)`.\n\n            Hidden-states of the model at the output of each layer plus the initial embedding outputs.\n        attentions (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_attentions=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for each layer) of shape\n            :obj:`(batch_size, num_heads, sequence_length, sequence_length)`.\n\n            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention\n            heads.\n\n    Examples::\n\n            from transformers1 import T5Tokenizer, T5Model\n\n            tokenizer = T5Tokenizer.from_pretrained('t5-small')\n            model = T5Model.from_pretrained('t5-small')\n            input_ids = tokenizer.encode(\"Hello, my dog is cute\", return_tensors=\"pt\")  # Batch size 1\n            outputs = model(input_ids=input_ids, decoder_input_ids=input_ids)\n            last_hidden_states = outputs[0]  # The last hidden-state is the first element of the output tuple\n\n        \"\"\"\n\n        # Encode if needed (training, first prediction pass)\n        if encoder_outputs is None:\n            encoder_outputs = self.encoder(\n                input_ids=input_ids, attention_mask=attention_mask, inputs_embeds=inputs_embeds, head_mask=head_mask\n            )\n\n        hidden_states = encoder_outputs[0]\n\n        # If decoding with past key value states, only the last tokens\n        # should be given as an input\n        if decoder_past_key_value_states is not None:\n            if decoder_input_ids is not None:\n                decoder_input_ids = decoder_input_ids[:, -1:]\n            if decoder_inputs_embeds is not None:\n                decoder_inputs_embeds = decoder_inputs_embeds[:, -1:]\n\n        # Decode\n        decoder_outputs = self.decoder(\n            input_ids=decoder_input_ids,\n            attention_mask=decoder_attention_mask,\n            inputs_embeds=decoder_inputs_embeds,\n            past_key_value_states=decoder_past_key_value_states,\n            encoder_hidden_states=hidden_states,\n            encoder_attention_mask=attention_mask,\n            head_mask=head_mask,\n            use_cache=use_cache,\n        )\n\n        if use_cache is True:\n            past = ((encoder_outputs, decoder_outputs[1]),)\n            decoder_outputs = decoder_outputs[:1] + past + decoder_outputs[2:]\n\n        return decoder_outputs + encoder_outputs\n\n\n@add_start_docstrings(\"\"\"T5 Model with a `language modeling` head on top. \"\"\", T5_START_DOCSTRING)\nclass T5ForConditionalGeneration(T5PreTrainedModel):\n    def __init__(self, config):\n        super().__init__(config)\n        self.model_dim = config.d_model\n\n        self.shared = nn.Embedding(config.vocab_size, config.d_model)\n\n        encoder_config = copy.deepcopy(config)\n        self.encoder = T5Stack(encoder_config, self.shared)\n\n        decoder_config = copy.deepcopy(config)\n        decoder_config.is_decoder = True\n        self.decoder = T5Stack(decoder_config, self.shared)\n\n        self.lm_head = nn.Linear(config.d_model, config.vocab_size, bias=False)\n\n        self.init_weights()\n\n    def get_input_embeddings(self):\n        return self.shared\n\n    def set_input_embeddings(self, new_embeddings):\n        self.shared = new_embeddings\n        self.encoder.set_input_embeddings(new_embeddings)\n        self.decoder.set_input_embeddings(new_embeddings)\n\n    def get_output_embeddings(self):\n        return self.lm_head\n\n    def get_encoder(self):\n        return self.encoder\n\n    def get_decoder(self):\n        return self.decoder\n\n    @add_start_docstrings_to_callable(T5_INPUTS_DOCSTRING)\n    def forward(\n        self,\n        input_ids=None,\n        attention_mask=None,\n        encoder_outputs=None,\n        decoder_input_ids=None,\n        decoder_attention_mask=None,\n        decoder_past_key_value_states=None,\n        use_cache=True,\n        lm_labels=None,\n        inputs_embeds=None,\n        decoder_inputs_embeds=None,\n        head_mask=None,\n    ):\n        r\"\"\"\n        lm_labels (:obj:`torch.LongTensor` of shape :obj:`(batch_size,)`, `optional`, defaults to :obj:`None`):\n                Labels for computing the sequence classification/regression loss.\n                Indices should be in :obj:`[-100, 0, ..., config.vocab_size - 1]`.\n                All labels set to ``-100`` are ignored (masked), the loss is only\n                computed for labels in ``[0, ..., config.vocab_size]``\n\n    Returns:\n        :obj:`tuple(torch.FloatTensor)` comprising various elements depending on the configuration (:class:`~transformers1.T5Config`) and inputs.\n        loss (:obj:`torch.FloatTensor` of shape :obj:`(1,)`, `optional`, returned when :obj:`lm_label` is provided):\n            Classification loss (cross entropy).\n        prediction_scores (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length, config.vocab_size)`)\n            Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax).\n            If `past_key_value_states` is used only the last prediction_scores of the sequences of shape :obj:`(batch_size, 1, hidden_size)` is output.\n        decoder_past_key_value_states (:obj:`tuple(tuple(torch.FloatTensor))` of length :obj:`config.n_layers` with each tuple having 4 tensors of shape :obj:`(batch_size, num_heads, sequence_length, embed_size_per_head)`, `optional`, returned when ``use_cache=True``):\n            Contains pre-computed key and value hidden-states of the attention blocks.\n            Can be used to speed up sequential decoding (see `decoder_past_key_value_states` input).\n            Note that when using `decoder_past_key_value_states`, the model only outputs the last `prediction_score` of the sequence of shape :obj:`(batch_size, 1, config.vocab_size)`.\n        hidden_states (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_hidden_states=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer)\n            of shape :obj:`(batch_size, sequence_length, hidden_size)`.\n            Hidden-states of the model at the output of each layer plus the initial embedding outputs.\n        attentions (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_attentions=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for each layer) of shape\n            :obj:`(batch_size, num_heads, sequence_length, sequence_length)`.\n            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention.\n\n    Examples::\n\n        from transformers1 import T5Tokenizer, T5ForConditionalGeneration\n\n        tokenizer = T5Tokenizer.from_pretrained('t5-small')\n        model = T5ForConditionalGeneration.from_pretrained('t5-small')\n        input_ids = tokenizer.encode(\"Hello, my dog is cute\", return_tensors=\"pt\")  # Batch size 1\n        outputs = model(input_ids=input_ids, decoder_input_ids=input_ids, lm_labels=input_ids)\n        loss, prediction_scores = outputs[:2]\n\n        tokenizer = T5Tokenizer.from_pretrained('t5-small')\n        model = T5ForConditionalGeneration.from_pretrained('t5-small')\n        input_ids = tokenizer.encode(\"summarize: Hello, my dog is cute\", return_tensors=\"pt\")  # Batch size 1\n        outputs = model.generate(input_ids)\n        \"\"\"\n\n        # Encode if needed (training, first prediction pass)\n        if encoder_outputs is None:\n            # Convert encoder inputs in embeddings if needed\n            encoder_outputs = self.encoder(\n                input_ids=input_ids, attention_mask=attention_mask, inputs_embeds=inputs_embeds, head_mask=head_mask\n            )\n\n        hidden_states = encoder_outputs[0]\n\n        if lm_labels is not None and decoder_input_ids is None and decoder_inputs_embeds is None:\n            # get decoder inputs from shifting lm labels to the right\n            decoder_input_ids = self._shift_right(lm_labels)\n\n        # If decoding with past key value states, only the last tokens\n        # should be given as an input\n        if decoder_past_key_value_states is not None:\n            assert lm_labels is None, \"Decoder should not use cached key value states when training.\"\n            if decoder_input_ids is not None:\n                decoder_input_ids = decoder_input_ids[:, -1:]\n            if decoder_inputs_embeds is not None:\n                decoder_inputs_embeds = decoder_inputs_embeds[:, -1:]\n\n        # Decode\n        decoder_outputs = self.decoder(\n            input_ids=decoder_input_ids,\n            attention_mask=decoder_attention_mask,\n            inputs_embeds=decoder_inputs_embeds,\n            past_key_value_states=decoder_past_key_value_states,\n            encoder_hidden_states=hidden_states,\n            encoder_attention_mask=attention_mask,\n            head_mask=head_mask,\n            use_cache=use_cache,\n        )\n\n        # insert decoder past at right place\n        # to speed up decoding\n        if use_cache is True:\n            past = ((encoder_outputs, decoder_outputs[1]),)\n            decoder_outputs = decoder_outputs[:1] + past + decoder_outputs[2:]\n\n        sequence_output = decoder_outputs[0]\n        # Rescale output before projecting on vocab\n        # See https://github.com/tensorflow/mesh/blob/fa19d69eafc9a482aff0b59ddd96b025c0cb207d/mesh_tensorflow/transformer/transformer.py#L586\n        sequence_output = sequence_output * (self.model_dim ** -0.5)\n        lm_logits = self.lm_head(sequence_output)\n\n        decoder_outputs = (lm_logits,) + decoder_outputs[1:]  # Add hidden states and attention if they are here\n        if lm_labels is not None:\n            loss_fct = CrossEntropyLoss(ignore_index=-100)\n            loss = loss_fct(lm_logits.view(-1, lm_logits.size(-1)), lm_labels.view(-1))\n            # TODO(thom): Add z_loss https://github.com/tensorflow/mesh/blob/fa19d69eafc9a482aff0b59ddd96b025c0cb207d/mesh_tensorflow/layers.py#L666\n            decoder_outputs = (loss,) + decoder_outputs\n\n        return decoder_outputs + encoder_outputs\n\n    def prepare_inputs_for_generation(self, input_ids, past, attention_mask, use_cache, **kwargs):\n        assert past is not None, \"past has to be defined for encoder_outputs\"\n\n        # first step\n        if len(past) < 2:\n            encoder_outputs, decoder_past_key_value_states = past, None\n        else:\n            encoder_outputs, decoder_past_key_value_states = past[0], past[1]\n\n        return {\n            \"decoder_input_ids\": input_ids,\n            \"decoder_past_key_value_states\": decoder_past_key_value_states,\n            \"encoder_outputs\": encoder_outputs,\n            \"attention_mask\": attention_mask,\n            \"use_cache\": use_cache,\n        }\n\n    def _reorder_cache(self, past, beam_idx):\n        # if decoder past is not included in output\n        # speedy decoding is disabled and no need to reorder\n        if len(past) < 2:\n            logger.warning(\"You might want to consider setting `use_cache=True` to speed up decoding\")\n            return past\n\n        decoder_past = past[1]\n        past = (past[0],)\n        reordered_decoder_past = ()\n        for layer_past_states in decoder_past:\n            # get the correct batch idx from layer past batch dim\n            # batch dim of `past` is at 2nd position\n            reordered_layer_past_states = ()\n            for layer_past_state in layer_past_states:\n                # need to set correct `past` for each of the four key / value states\n                reordered_layer_past_states = reordered_layer_past_states + (\n                    layer_past_state.index_select(0, beam_idx),\n                )\n\n            assert reordered_layer_past_states[0].shape == layer_past_states[0].shape\n            assert len(reordered_layer_past_states) == len(layer_past_states)\n\n            reordered_decoder_past = reordered_decoder_past + (reordered_layer_past_states,)\n        return past + (reordered_decoder_past,)\n"
  },
  {
    "path": "code/nezha-base-count5/pretrain/transformers1/modeling_tf_albert.py",
    "content": "# coding=utf-8\n# Copyright 2018 The OpenAI Team Authors and HuggingFace Inc. team.\n# Copyright (c) 2018, NVIDIA CORPORATION.  All rights reserved.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n\"\"\" TF 2.0 ALBERT model. \"\"\"\n\n\nimport logging\n\nimport tensorflow as tf\n\nfrom .configuration_albert import AlbertConfig\nfrom .file_utils import MULTIPLE_CHOICE_DUMMY_INPUTS, add_start_docstrings, add_start_docstrings_to_callable\nfrom .modeling_tf_bert import ACT2FN, TFBertSelfAttention\nfrom .modeling_tf_utils import TFPreTrainedModel, get_initializer, keras_serializable, shape_list\nfrom .tokenization_utils import BatchEncoding\n\n\nlogger = logging.getLogger(__name__)\n\nTF_ALBERT_PRETRAINED_MODEL_ARCHIVE_LIST = [\n    \"albert-base-v1\",\n    \"albert-large-v1\",\n    \"albert-xlarge-v1\",\n    \"albert-xxlarge-v1\",\n    \"albert-base-v2\",\n    \"albert-large-v2\",\n    \"albert-xlarge-v2\",\n    \"albert-xxlarge-v2\",\n    # See all ALBERT models at https://huggingface.co/models?filter=albert\n]\n\n\nclass TFAlbertEmbeddings(tf.keras.layers.Layer):\n    \"\"\"Construct the embeddings from word, position and token_type embeddings.\n    \"\"\"\n\n    def __init__(self, config, **kwargs):\n        super().__init__(**kwargs)\n\n        self.config = config\n        self.position_embeddings = tf.keras.layers.Embedding(\n            config.max_position_embeddings,\n            config.embedding_size,\n            embeddings_initializer=get_initializer(self.config.initializer_range),\n            name=\"position_embeddings\",\n        )\n        self.token_type_embeddings = tf.keras.layers.Embedding(\n            config.type_vocab_size,\n            config.embedding_size,\n            embeddings_initializer=get_initializer(self.config.initializer_range),\n            name=\"token_type_embeddings\",\n        )\n\n        # self.LayerNorm is not snake-cased to stick with TensorFlow model variable name and be able to load\n        # any TensorFlow checkpoint file\n        self.LayerNorm = tf.keras.layers.LayerNormalization(epsilon=config.layer_norm_eps, name=\"LayerNorm\")\n        self.dropout = tf.keras.layers.Dropout(config.hidden_dropout_prob)\n\n    def build(self, input_shape):\n        \"\"\"Build shared word embedding layer \"\"\"\n        with tf.name_scope(\"word_embeddings\"):\n            # Create and initialize weights. The random normal initializer was chosen\n            # arbitrarily, and works well.\n            self.word_embeddings = self.add_weight(\n                \"weight\",\n                shape=[self.config.vocab_size, self.config.embedding_size],\n                initializer=get_initializer(self.config.initializer_range),\n            )\n        super().build(input_shape)\n\n    def call(self, inputs, mode=\"embedding\", training=False):\n        \"\"\"Get token embeddings of inputs.\n        Args:\n            inputs: list of three int64 tensors with shape [batch_size, length]: (input_ids, position_ids, token_type_ids)\n            mode: string, a valid value is one of \"embedding\" and \"linear\".\n        Returns:\n            outputs: (1) If mode == \"embedding\", output embedding tensor, float32 with\n                shape [batch_size, length, embedding_size]; (2) mode == \"linear\", output\n                linear tensor, float32 with shape [batch_size, length, vocab_size].\n        Raises:\n            ValueError: if mode is not valid.\n\n        Shared weights logic adapted from\n            https://github.com/tensorflow/models/blob/a009f4fb9d2fc4949e32192a944688925ef78659/official/transformer/v2/embedding_layer.py#L24\n        \"\"\"\n        if mode == \"embedding\":\n            return self._embedding(inputs, training=training)\n        elif mode == \"linear\":\n            return self._linear(inputs)\n        else:\n            raise ValueError(\"mode {} is not valid.\".format(mode))\n\n    def _embedding(self, inputs, training=False):\n        \"\"\"Applies embedding based on inputs tensor.\"\"\"\n        input_ids, position_ids, token_type_ids, inputs_embeds = inputs\n\n        if input_ids is not None:\n            input_shape = shape_list(input_ids)\n        else:\n            input_shape = shape_list(inputs_embeds)[:-1]\n\n        seq_length = input_shape[1]\n        if position_ids is None:\n            position_ids = tf.range(seq_length, dtype=tf.int32)[tf.newaxis, :]\n        if token_type_ids is None:\n            token_type_ids = tf.fill(input_shape, 0)\n\n        if inputs_embeds is None:\n            inputs_embeds = tf.gather(self.word_embeddings, input_ids)\n        position_embeddings = self.position_embeddings(position_ids)\n        token_type_embeddings = self.token_type_embeddings(token_type_ids)\n\n        embeddings = inputs_embeds + position_embeddings + token_type_embeddings\n        embeddings = self.LayerNorm(embeddings)\n        embeddings = self.dropout(embeddings, training=training)\n        return embeddings\n\n    def _linear(self, inputs):\n        \"\"\"Computes logits by running inputs through a linear layer.\n            Args:\n                inputs: A float32 tensor with shape [batch_size, length, embedding_size]\n            Returns:\n                float32 tensor with shape [batch_size, length, vocab_size].\n        \"\"\"\n        batch_size = shape_list(inputs)[0]\n        length = shape_list(inputs)[1]\n        x = tf.reshape(inputs, [-1, self.config.embedding_size])\n        logits = tf.matmul(x, self.word_embeddings, transpose_b=True)\n        return tf.reshape(logits, [batch_size, length, self.config.vocab_size])\n\n\nclass TFAlbertSelfAttention(tf.keras.layers.Layer):\n    def __init__(self, config, **kwargs):\n        super().__init__(**kwargs)\n        if config.hidden_size % config.num_attention_heads != 0:\n            raise ValueError(\n                \"The hidden size (%d) is not a multiple of the number of attention \"\n                \"heads (%d)\" % (config.hidden_size, config.num_attention_heads)\n            )\n        self.output_attentions = config.output_attentions\n\n        self.num_attention_heads = config.num_attention_heads\n        assert config.hidden_size % config.num_attention_heads == 0\n        self.attention_head_size = int(config.hidden_size / config.num_attention_heads)\n        self.all_head_size = self.num_attention_heads * self.attention_head_size\n\n        self.query = tf.keras.layers.Dense(\n            self.all_head_size, kernel_initializer=get_initializer(config.initializer_range), name=\"query\"\n        )\n        self.key = tf.keras.layers.Dense(\n            self.all_head_size, kernel_initializer=get_initializer(config.initializer_range), name=\"key\"\n        )\n        self.value = tf.keras.layers.Dense(\n            self.all_head_size, kernel_initializer=get_initializer(config.initializer_range), name=\"value\"\n        )\n\n        self.dropout = tf.keras.layers.Dropout(config.attention_probs_dropout_prob)\n\n    def transpose_for_scores(self, x, batch_size):\n        x = tf.reshape(x, (batch_size, -1, self.num_attention_heads, self.attention_head_size))\n        return tf.transpose(x, perm=[0, 2, 1, 3])\n\n    def call(self, inputs, training=False):\n        hidden_states, attention_mask, head_mask = inputs\n\n        batch_size = shape_list(hidden_states)[0]\n        mixed_query_layer = self.query(hidden_states)\n        mixed_key_layer = self.key(hidden_states)\n        mixed_value_layer = self.value(hidden_states)\n\n        query_layer = self.transpose_for_scores(mixed_query_layer, batch_size)\n        key_layer = self.transpose_for_scores(mixed_key_layer, batch_size)\n        value_layer = self.transpose_for_scores(mixed_value_layer, batch_size)\n\n        # Take the dot product between \"query\" and \"key\" to get the raw attention scores.\n        # (batch size, num_heads, seq_len_q, seq_len_k)\n        attention_scores = tf.matmul(query_layer, key_layer, transpose_b=True)\n        # scale attention_scores\n        dk = tf.cast(shape_list(key_layer)[-1], tf.float32)\n        attention_scores = attention_scores / tf.math.sqrt(dk)\n\n        if attention_mask is not None:\n            # Apply the attention mask is (precomputed for all layers in TFAlbertModel call() function)\n            attention_scores = attention_scores + attention_mask\n\n        # Normalize the attention scores to probabilities.\n        attention_probs = tf.nn.softmax(attention_scores, axis=-1)\n\n        # This is actually dropping out entire tokens to attend to, which might\n        # seem a bit unusual, but is taken from the original Transformer paper.\n        attention_probs = self.dropout(attention_probs, training=training)\n\n        # Mask heads if we want to\n        if head_mask is not None:\n            attention_probs = attention_probs * head_mask\n\n        context_layer = tf.matmul(attention_probs, value_layer)\n\n        context_layer = tf.transpose(context_layer, perm=[0, 2, 1, 3])\n        context_layer = tf.reshape(\n            context_layer, (batch_size, -1, self.all_head_size)\n        )  # (batch_size, seq_len_q, all_head_size)\n\n        outputs = (context_layer, attention_probs) if self.output_attentions else (context_layer,)\n        return outputs\n\n\nclass TFAlbertSelfOutput(tf.keras.layers.Layer):\n    def __init__(self, config, **kwargs):\n        super().__init__(**kwargs)\n        self.dense = tf.keras.layers.Dense(\n            config.hidden_size, kernel_initializer=get_initializer(config.initializer_range), name=\"dense\"\n        )\n        self.LayerNorm = tf.keras.layers.LayerNormalization(epsilon=config.layer_norm_eps, name=\"LayerNorm\")\n        self.dropout = tf.keras.layers.Dropout(config.hidden_dropout_prob)\n\n    def call(self, inputs, training=False):\n        hidden_states, input_tensor = inputs\n\n        hidden_states = self.dense(hidden_states)\n        hidden_states = self.dropout(hidden_states, training=training)\n        hidden_states = self.LayerNorm(hidden_states + input_tensor)\n        return hidden_states\n\n\nclass TFAlbertAttention(TFBertSelfAttention):\n    def __init__(self, config, **kwargs):\n        super().__init__(config, **kwargs)\n\n        self.hidden_size = config.hidden_size\n        self.dense = tf.keras.layers.Dense(\n            config.hidden_size, kernel_initializer=get_initializer(config.initializer_range), name=\"dense\"\n        )\n        self.LayerNorm = tf.keras.layers.LayerNormalization(epsilon=config.layer_norm_eps, name=\"LayerNorm\")\n        self.pruned_heads = set()\n\n    def prune_heads(self, heads):\n        raise NotImplementedError\n\n    def call(self, inputs, training=False):\n        input_tensor, attention_mask, head_mask = inputs\n\n        batch_size = shape_list(input_tensor)[0]\n        mixed_query_layer = self.query(input_tensor)\n        mixed_key_layer = self.key(input_tensor)\n        mixed_value_layer = self.value(input_tensor)\n\n        query_layer = self.transpose_for_scores(mixed_query_layer, batch_size)\n        key_layer = self.transpose_for_scores(mixed_key_layer, batch_size)\n        value_layer = self.transpose_for_scores(mixed_value_layer, batch_size)\n\n        # Take the dot product between \"query\" and \"key\" to get the raw attention scores.\n        # (batch size, num_heads, seq_len_q, seq_len_k)\n        attention_scores = tf.matmul(query_layer, key_layer, transpose_b=True)\n        # scale attention_scores\n        dk = tf.cast(shape_list(key_layer)[-1], tf.float32)\n        attention_scores = attention_scores / tf.math.sqrt(dk)\n\n        if attention_mask is not None:\n            # Apply the attention mask is (precomputed for all layers in TFBertModel call() function)\n            attention_scores = attention_scores + attention_mask\n\n        # Normalize the attention scores to probabilities.\n        attention_probs = tf.nn.softmax(attention_scores, axis=-1)\n\n        # This is actually dropping out entire tokens to attend to, which might\n        # seem a bit unusual, but is taken from the original Transformer paper.\n        attention_probs = self.dropout(attention_probs, training=training)\n\n        # Mask heads if we want to\n        if head_mask is not None:\n            attention_probs = attention_probs * head_mask\n\n        context_layer = tf.matmul(attention_probs, value_layer)\n\n        context_layer = tf.transpose(context_layer, perm=[0, 2, 1, 3])\n        context_layer = tf.reshape(\n            context_layer, (batch_size, -1, self.all_head_size)\n        )  # (batch_size, seq_len_q, all_head_size)\n\n        self_outputs = (context_layer, attention_probs) if self.output_attentions else (context_layer,)\n\n        hidden_states = self_outputs[0]\n\n        hidden_states = self.dense(hidden_states)\n        hidden_states = self.dropout(hidden_states, training=training)\n        attention_output = self.LayerNorm(hidden_states + input_tensor)\n\n        # add attentions if we output them\n        outputs = (attention_output,) + self_outputs[1:]\n        return outputs\n\n\nclass TFAlbertLayer(tf.keras.layers.Layer):\n    def __init__(self, config, **kwargs):\n        super().__init__(**kwargs)\n        self.attention = TFAlbertAttention(config, name=\"attention\")\n\n        self.ffn = tf.keras.layers.Dense(\n            config.intermediate_size, kernel_initializer=get_initializer(config.initializer_range), name=\"ffn\"\n        )\n\n        if isinstance(config.hidden_act, str):\n            self.activation = ACT2FN[config.hidden_act]\n        else:\n            self.activation = config.hidden_act\n\n        self.ffn_output = tf.keras.layers.Dense(\n            config.hidden_size, kernel_initializer=get_initializer(config.initializer_range), name=\"ffn_output\"\n        )\n        self.full_layer_layer_norm = tf.keras.layers.LayerNormalization(\n            epsilon=config.layer_norm_eps, name=\"full_layer_layer_norm\"\n        )\n        self.dropout = tf.keras.layers.Dropout(config.hidden_dropout_prob)\n\n    def call(self, inputs, training=False):\n        hidden_states, attention_mask, head_mask = inputs\n\n        attention_outputs = self.attention([hidden_states, attention_mask, head_mask], training=training)\n        ffn_output = self.ffn(attention_outputs[0])\n        ffn_output = self.activation(ffn_output)\n        ffn_output = self.ffn_output(ffn_output)\n\n        hidden_states = self.dropout(hidden_states, training=training)\n        hidden_states = self.full_layer_layer_norm(ffn_output + attention_outputs[0])\n\n        # add attentions if we output them\n        outputs = (hidden_states,) + attention_outputs[1:]\n        return outputs\n\n\nclass TFAlbertLayerGroup(tf.keras.layers.Layer):\n    def __init__(self, config, **kwargs):\n        super().__init__(**kwargs)\n\n        self.output_attentions = config.output_attentions\n        self.output_hidden_states = config.output_hidden_states\n        self.albert_layers = [\n            TFAlbertLayer(config, name=\"albert_layers_._{}\".format(i)) for i in range(config.inner_group_num)\n        ]\n\n    def call(self, inputs, training=False):\n        hidden_states, attention_mask, head_mask = inputs\n\n        layer_hidden_states = ()\n        layer_attentions = ()\n\n        for layer_index, albert_layer in enumerate(self.albert_layers):\n            layer_output = albert_layer([hidden_states, attention_mask, head_mask[layer_index]], training=training)\n            hidden_states = layer_output[0]\n\n            if self.output_attentions:\n                layer_attentions = layer_attentions + (layer_output[1],)\n\n            if self.output_hidden_states:\n                layer_hidden_states = layer_hidden_states + (hidden_states,)\n\n        outputs = (hidden_states,)\n        if self.output_hidden_states:\n            outputs = outputs + (layer_hidden_states,)\n        if self.output_attentions:\n            outputs = outputs + (layer_attentions,)\n        # last-layer hidden state, (layer hidden states), (layer attentions)\n        return outputs\n\n\nclass TFAlbertTransformer(tf.keras.layers.Layer):\n    def __init__(self, config, **kwargs):\n        super().__init__(**kwargs)\n\n        self.config = config\n        self.output_attentions = config.output_attentions\n        self.output_hidden_states = config.output_hidden_states\n        self.embedding_hidden_mapping_in = tf.keras.layers.Dense(\n            config.hidden_size,\n            kernel_initializer=get_initializer(config.initializer_range),\n            name=\"embedding_hidden_mapping_in\",\n        )\n        self.albert_layer_groups = [\n            TFAlbertLayerGroup(config, name=\"albert_layer_groups_._{}\".format(i))\n            for i in range(config.num_hidden_groups)\n        ]\n\n    def call(self, inputs, training=False):\n        hidden_states, attention_mask, head_mask = inputs\n\n        hidden_states = self.embedding_hidden_mapping_in(hidden_states)\n        all_attentions = ()\n\n        if self.output_hidden_states:\n            all_hidden_states = (hidden_states,)\n\n        for i in range(self.config.num_hidden_layers):\n            # Number of layers in a hidden group\n            layers_per_group = int(self.config.num_hidden_layers / self.config.num_hidden_groups)\n\n            # Index of the hidden group\n            group_idx = int(i / (self.config.num_hidden_layers / self.config.num_hidden_groups))\n\n            layer_group_output = self.albert_layer_groups[group_idx](\n                [\n                    hidden_states,\n                    attention_mask,\n                    head_mask[group_idx * layers_per_group : (group_idx + 1) * layers_per_group],\n                ],\n                training=training,\n            )\n            hidden_states = layer_group_output[0]\n\n            if self.output_attentions:\n                all_attentions = all_attentions + layer_group_output[-1]\n\n            if self.output_hidden_states:\n                all_hidden_states = all_hidden_states + (hidden_states,)\n\n        outputs = (hidden_states,)\n        if self.output_hidden_states:\n            outputs = outputs + (all_hidden_states,)\n        if self.output_attentions:\n            outputs = outputs + (all_attentions,)\n\n        # last-layer hidden state, (all hidden states), (all attentions)\n        return outputs\n\n\nclass TFAlbertPreTrainedModel(TFPreTrainedModel):\n    \"\"\" An abstract class to handle weights initialization and\n        a simple interface for downloading and loading pretrained models.\n    \"\"\"\n\n    config_class = AlbertConfig\n    base_model_prefix = \"albert\"\n\n\nclass TFAlbertMLMHead(tf.keras.layers.Layer):\n    def __init__(self, config, input_embeddings, **kwargs):\n        super().__init__(**kwargs)\n        self.vocab_size = config.vocab_size\n\n        self.dense = tf.keras.layers.Dense(\n            config.embedding_size, kernel_initializer=get_initializer(config.initializer_range), name=\"dense\"\n        )\n        if isinstance(config.hidden_act, str):\n            self.activation = ACT2FN[config.hidden_act]\n        else:\n            self.activation = config.hidden_act\n\n        self.LayerNorm = tf.keras.layers.LayerNormalization(epsilon=config.layer_norm_eps, name=\"LayerNorm\")\n\n        # The output weights are the same as the input embeddings, but there is\n        # an output-only bias for each token.\n        self.decoder = input_embeddings\n\n    def build(self, input_shape):\n        self.bias = self.add_weight(shape=(self.vocab_size,), initializer=\"zeros\", trainable=True, name=\"bias\")\n        self.decoder_bias = self.add_weight(\n            shape=(self.vocab_size,), initializer=\"zeros\", trainable=True, name=\"decoder/bias\"\n        )\n        super().build(input_shape)\n\n    def call(self, hidden_states):\n        hidden_states = self.dense(hidden_states)\n        hidden_states = self.activation(hidden_states)\n        hidden_states = self.LayerNorm(hidden_states)\n        hidden_states = self.decoder(hidden_states, mode=\"linear\") + self.decoder_bias\n        return hidden_states\n\n\n@keras_serializable\nclass TFAlbertMainLayer(tf.keras.layers.Layer):\n    config_class = AlbertConfig\n\n    def __init__(self, config, **kwargs):\n        super().__init__(**kwargs)\n        self.num_hidden_layers = config.num_hidden_layers\n\n        self.embeddings = TFAlbertEmbeddings(config, name=\"embeddings\")\n        self.encoder = TFAlbertTransformer(config, name=\"encoder\")\n        self.pooler = tf.keras.layers.Dense(\n            config.hidden_size,\n            kernel_initializer=get_initializer(config.initializer_range),\n            activation=\"tanh\",\n            name=\"pooler\",\n        )\n\n    def get_input_embeddings(self):\n        return self.embeddings\n\n    def _resize_token_embeddings(self, new_num_tokens):\n        raise NotImplementedError\n\n    def _prune_heads(self, heads_to_prune):\n        \"\"\" Prunes heads of the model.\n            heads_to_prune: dict of {layer_num: list of heads to prune in this layer}\n            See base class PreTrainedModel\n        \"\"\"\n        raise NotImplementedError\n\n    def call(\n        self,\n        inputs,\n        attention_mask=None,\n        token_type_ids=None,\n        position_ids=None,\n        head_mask=None,\n        inputs_embeds=None,\n        training=False,\n    ):\n        if isinstance(inputs, (tuple, list)):\n            input_ids = inputs[0]\n            attention_mask = inputs[1] if len(inputs) > 1 else attention_mask\n            token_type_ids = inputs[2] if len(inputs) > 2 else token_type_ids\n            position_ids = inputs[3] if len(inputs) > 3 else position_ids\n            head_mask = inputs[4] if len(inputs) > 4 else head_mask\n            inputs_embeds = inputs[5] if len(inputs) > 5 else inputs_embeds\n            assert len(inputs) <= 6, \"Too many inputs.\"\n        elif isinstance(inputs, (dict, BatchEncoding)):\n            input_ids = inputs.get(\"input_ids\")\n            attention_mask = inputs.get(\"attention_mask\", attention_mask)\n            token_type_ids = inputs.get(\"token_type_ids\", token_type_ids)\n            position_ids = inputs.get(\"position_ids\", position_ids)\n            head_mask = inputs.get(\"head_mask\", head_mask)\n            inputs_embeds = inputs.get(\"inputs_embeds\", inputs_embeds)\n            assert len(inputs) <= 6, \"Too many inputs.\"\n        else:\n            input_ids = inputs\n\n        if input_ids is not None and inputs_embeds is not None:\n            raise ValueError(\"You cannot specify both input_ids and inputs_embeds at the same time\")\n        elif input_ids is not None:\n            input_shape = shape_list(input_ids)\n        elif inputs_embeds is not None:\n            input_shape = shape_list(inputs_embeds)[:-1]\n        else:\n            raise ValueError(\"You have to specify either input_ids or inputs_embeds\")\n\n        if attention_mask is None:\n            attention_mask = tf.fill(input_shape, 1)\n        if token_type_ids is None:\n            token_type_ids = tf.fill(input_shape, 0)\n\n        # We create a 3D attention mask from a 2D tensor mask.\n        # Sizes are [batch_size, 1, 1, to_seq_length]\n        # So we can broadcast to [batch_size, num_heads, from_seq_length, to_seq_length]\n        # this attention mask is more simple than the triangular masking of causal attention\n        # used in OpenAI GPT, we just need to prepare the broadcast dimension here.\n        extended_attention_mask = attention_mask[:, tf.newaxis, tf.newaxis, :]\n\n        # Since attention_mask is 1.0 for positions we want to attend and 0.0 for\n        # masked positions, this operation will create a tensor which is 0.0 for\n        # positions we want to attend and -10000.0 for masked positions.\n        # Since we are adding it to the raw scores before the softmax, this is\n        # effectively the same as removing these entirely.\n\n        extended_attention_mask = tf.cast(extended_attention_mask, tf.float32)\n        extended_attention_mask = (1.0 - extended_attention_mask) * -10000.0\n\n        # Prepare head mask if needed\n        # 1.0 in head_mask indicate we keep the head\n        # attention_probs has shape bsz x n_heads x N x N\n        # input head_mask has shape [num_heads] or [num_hidden_layers x num_heads]\n        # and head_mask is converted to shape [num_hidden_layers x batch x num_heads x seq_length x seq_length]\n        if head_mask is not None:\n            raise NotImplementedError\n        else:\n            head_mask = [None] * self.num_hidden_layers\n            # head_mask = tf.constant([0] * self.num_hidden_layers)\n\n        embedding_output = self.embeddings([input_ids, position_ids, token_type_ids, inputs_embeds], training=training)\n        encoder_outputs = self.encoder([embedding_output, extended_attention_mask, head_mask], training=training)\n\n        sequence_output = encoder_outputs[0]\n        pooled_output = self.pooler(sequence_output[:, 0])\n\n        # add hidden_states and attentions if they are here\n        outputs = (sequence_output, pooled_output,) + encoder_outputs[1:]\n        # sequence_output, pooled_output, (hidden_states), (attentions)\n        return outputs\n\n\nALBERT_START_DOCSTRING = r\"\"\"\n    This model is a `tf.keras.Model <https://www.tensorflow.org/api_docs/python/tf/keras/Model>`__ sub-class.\n    Use it as a regular TF 2.0 Keras Model and\n    refer to the TF 2.0 documentation for all matter related to general usage and behavior.\n\n    .. _`ALBERT: A Lite BERT for Self-supervised Learning of Language Representations`:\n        https://arxiv.org/abs/1909.11942\n\n    .. _`tf.keras.Model`:\n        https://www.tensorflow.org/versions/r2.0/api_docs/python/tf/keras/Model\n\n    .. note::\n\n        TF 2.0 models accepts two formats as inputs:\n\n            - having all inputs as keyword arguments (like PyTorch models), or\n            - having all inputs as a list, tuple or dict in the first positional arguments.\n\n        This second option is useful when using :obj:`tf.keras.Model.fit()` method which currently requires having\n        all the tensors in the first argument of the model call function: :obj:`model(inputs)`.\n\n        If you choose this second option, there are three possibilities you can use to gather all the input Tensors\n        in the first positional argument :\n\n        - a single Tensor with input_ids only and nothing else: :obj:`model(inputs_ids)`\n        - a list of varying length with one or several input Tensors IN THE ORDER given in the docstring:\n          :obj:`model([input_ids, attention_mask])` or :obj:`model([input_ids, attention_mask, token_type_ids])`\n        - a dictionary with one or several input Tensors associated to the input names given in the docstring:\n          :obj:`model({'input_ids': input_ids, 'token_type_ids': token_type_ids})`\n\n    Args:\n        config (:class:`~transformers1.AlbertConfig`): Model configuration class with all the parameters of the model.\n            Initializing with a config file does not load the weights associated with the model, only the configuration.\n            Check out the :meth:`~transformers1.PreTrainedModel.from_pretrained` method to load the model weights.\n\"\"\"\n\nALBERT_INPUTS_DOCSTRING = r\"\"\"\n    Args:\n        input_ids (:obj:`Numpy array` or :obj:`tf.Tensor` of shape :obj:`{0}`):\n            Indices of input sequence tokens in the vocabulary.\n\n            Indices can be obtained using :class:`transformers1.AlbertTokenizer`.\n            See :func:`transformers1.PreTrainedTokenizer.encode` and\n            :func:`transformers1.PreTrainedTokenizer.encode_plus` for details.\n\n            `What are input IDs? <../glossary.html#input-ids>`__\n        attention_mask (:obj:`Numpy array` or :obj:`tf.Tensor` of shape :obj:`{0}`, `optional, defaults to :obj:`None`):\n            Mask to avoid performing attention on padding token indices.\n            Mask values selected in ``[0, 1]``:\n            ``1`` for tokens that are NOT MASKED, ``0`` for MASKED tokens.\n\n            `What are attention masks? <../glossary.html#attention-mask>`__\n        token_type_ids (:obj:`Numpy array` or :obj:`tf.Tensor` of shape :obj:`{0}`, `optional`, defaults to :obj:`None`):\n            Segment token indices to indicate first and second portions of the inputs.\n            Indices are selected in ``[0, 1]``: ``0`` corresponds to a `sentence A` token, ``1``\n            corresponds to a `sentence B` token\n\n            `What are token type IDs? <../glossary.html#token-type-ids>`_\n        position_ids (:obj:`Numpy array` or :obj:`tf.Tensor` of shape :obj:`{0}`, `optional`, defaults to :obj:`None`):\n            Indices of positions of each input sequence tokens in the position embeddings.\n            Selected in the range ``[0, config.max_position_embeddings - 1]``.\n\n            `What are position IDs? <../glossary.html#position-ids>`_\n        head_mask (:obj:`Numpy array` or :obj:`tf.Tensor` of shape :obj:`(num_heads,)` or :obj:`(num_layers, num_heads)`, `optional`, defaults to :obj:`None`):\n            Mask to nullify selected heads of the self-attention modules.\n            Mask values selected in ``[0, 1]``:\n            ``1`` indicates the head is **not masked**, ``0`` indicates the head is **masked**.\n        inputs_embeds (:obj:`tf.Tensor` of shape :obj:`(batch_size, sequence_length, hidden_size)`, `optional`, defaults to :obj:`None`):\n            Optionally, instead of passing :obj:`input_ids` you can choose to directly pass an embedded representation.\n            This is useful if you want more control over how to convert `input_ids` indices into associated vectors\n            than the model's internal embedding lookup matrix.\n        training (:obj:`boolean`, `optional`, defaults to :obj:`False`):\n            Whether to activate dropout modules (if set to :obj:`True`) during training or to de-activate them\n            (if set to :obj:`False`) for evaluation.\n\"\"\"\n\n\n@add_start_docstrings(\n    \"The bare Albert Model transformer outputing raw hidden-states without any specific head on top.\",\n    ALBERT_START_DOCSTRING,\n)\nclass TFAlbertModel(TFAlbertPreTrainedModel):\n    def __init__(self, config, *inputs, **kwargs):\n        super().__init__(config, *inputs, **kwargs)\n        self.albert = TFAlbertMainLayer(config, name=\"albert\")\n\n    @add_start_docstrings_to_callable(ALBERT_INPUTS_DOCSTRING.format(\"(batch_size, sequence_length)\"))\n    def call(self, inputs, **kwargs):\n        r\"\"\"\n        Returns:\n            :obj:`tuple(tf.Tensor)` comprising various elements depending on the configuration (:class:`~transformers1.AlbertConfig`) and inputs:\n            last_hidden_state (:obj:`tf.Tensor` of shape :obj:`(batch_size, sequence_length, hidden_size)`):\n                Sequence of hidden-states at the output of the last layer of the model.\n            pooler_output (:obj:`tf.Tensor` of shape :obj:`(batch_size, hidden_size)`):\n                Last layer hidden-state of the first token of the sequence (classification token)\n                further processed by a Linear layer and a Tanh activation function. The Linear\n                layer weights are trained from the next sentence prediction (classification)\n                objective during Albert pretraining. This output is usually *not* a good summary\n                of the semantic content of the input, you're often better with averaging or pooling\n                the sequence of hidden-states for the whole input sequence.\n            hidden_states (:obj:`tuple(tf.Tensor)`, `optional`, returned when :obj:`config.output_hidden_states=True`):\n                tuple of :obj:`tf.Tensor` (one for the output of the embeddings + one for the output of each layer)\n                of shape :obj:`(batch_size, sequence_length, hidden_size)`.\n\n                Hidden-states of the model at the output of each layer plus the initial embedding outputs.\n            attentions (:obj:`tuple(tf.Tensor)`, `optional`, returned when ``config.output_attentions=True``):\n                tuple of :obj:`tf.Tensor` (one for each layer) of shape\n                :obj:`(batch_size, num_heads, sequence_length, sequence_length)`:\n\n                Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads.\n\n        Examples::\n\n            import tensorflow as tf\n            from transformers1 import AlbertTokenizer, TFAlbertModel\n\n            tokenizer = AlbertTokenizer.from_pretrained('albert-base-v2')\n            model = TFAlbertModel.from_pretrained('albert-base-v2')\n            input_ids = tf.constant(tokenizer.encode(\"Hello, my dog is cute\"))[None, :]  # Batch size 1\n            outputs = model(input_ids)\n            last_hidden_states = outputs[0]  # The last hidden-state is the first element of the output tuple\n\n        \"\"\"\n        outputs = self.albert(inputs, **kwargs)\n        return outputs\n\n\n@add_start_docstrings(\n    \"\"\"Albert Model with two heads on top for pre-training:\n    a `masked language modeling` head and a `sentence order prediction` (classification) head. \"\"\",\n    ALBERT_START_DOCSTRING,\n)\nclass TFAlbertForPreTraining(TFAlbertPreTrainedModel):\n    def __init__(self, config, *inputs, **kwargs):\n        super().__init__(config, *inputs, **kwargs)\n        self.num_labels = config.num_labels\n\n        self.albert = TFAlbertMainLayer(config, name=\"albert\")\n        self.predictions = TFAlbertMLMHead(config, self.albert.embeddings, name=\"predictions\")\n        self.sop_classifier = TFAlbertSOPHead(config, name=\"sop_classifier\")\n\n    def get_output_embeddings(self):\n        return self.albert.embeddings\n\n    @add_start_docstrings_to_callable(ALBERT_INPUTS_DOCSTRING.format(\"(batch_size, sequence_length)\"))\n    def call(self, inputs, **kwargs):\n        r\"\"\"\n    Return:\n        :obj:`tuple(tf.Tensor)` comprising various elements depending on the configuration (:class:`~transformers1.BertConfig`) and inputs:\n        prediction_scores (:obj:`tf.Tensor` of shape :obj:`(batch_size, sequence_length, config.vocab_size)`):\n            Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax).\n        sop_scores (:obj:`tf.Tensor` of shape :obj:`(batch_size, sequence_length, 2)`):\n            Prediction scores of the sentence order prediction (classification) head (scores of True/False continuation before SoftMax).\n        hidden_states (:obj:`tuple(tf.Tensor)`, `optional`, returned when :obj:`config.output_hidden_states=True`):\n            tuple of :obj:`tf.Tensor` (one for the output of the embeddings + one for the output of each layer)\n            of shape :obj:`(batch_size, sequence_length, hidden_size)`.\n            Hidden-states of the model at the output of each layer plus the initial embedding outputs.\n        attentions (:obj:`tuple(tf.Tensor)`, `optional`, returned when ``config.output_attentions=True``):\n            tuple of :obj:`tf.Tensor` (one for each layer) of shape\n            :obj:`(batch_size, num_heads, sequence_length, sequence_length)`:\n            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads.\n    Examples::\n        import tensorflow as tf\n        from transformers1 import AlbertTokenizer, TFAlbertForPreTraining\n        tokenizer = AlbertTokenizer.from_pretrained('albert-base-v2')\n        model = TFAlbertForPreTraining.from_pretrained('albert-base-v2')\n        input_ids = tf.constant(tokenizer.encode(\"Hello, my dog is cute\", add_special_tokens=True))[None, :]  # Batch size 1\n        outputs = model(input_ids)\n        prediction_scores, sop_scores = outputs[:2]\n        \"\"\"\n\n        outputs = self.albert(inputs, **kwargs)\n        sequence_output, pooled_output = outputs[:2]\n        prediction_scores = self.predictions(sequence_output)\n        sop_scores = self.sop_classifier(pooled_output, training=kwargs.get(\"training\", False))\n        outputs = (prediction_scores, sop_scores) + outputs[2:]\n        return outputs\n\n\nclass TFAlbertSOPHead(tf.keras.layers.Layer):\n    def __init__(self, config, **kwargs):\n        super().__init__(**kwargs)\n\n        self.dropout = tf.keras.layers.Dropout(config.classifier_dropout_prob)\n        self.classifier = tf.keras.layers.Dense(\n            config.num_labels, kernel_initializer=get_initializer(config.initializer_range), name=\"classifier\",\n        )\n\n    def call(self, pooled_output, training: bool):\n        dropout_pooled_output = self.dropout(pooled_output, training=training)\n        logits = self.classifier(dropout_pooled_output)\n        return logits\n\n\n@add_start_docstrings(\"\"\"Albert Model with a `language modeling` head on top. \"\"\", ALBERT_START_DOCSTRING)\nclass TFAlbertForMaskedLM(TFAlbertPreTrainedModel):\n    def __init__(self, config, *inputs, **kwargs):\n        super().__init__(config, *inputs, **kwargs)\n\n        self.albert = TFAlbertMainLayer(config, name=\"albert\")\n        self.predictions = TFAlbertMLMHead(config, self.albert.embeddings, name=\"predictions\")\n\n    def get_output_embeddings(self):\n        return self.albert.embeddings\n\n    @add_start_docstrings_to_callable(ALBERT_INPUTS_DOCSTRING.format(\"(batch_size, sequence_length)\"))\n    def call(self, inputs, **kwargs):\n        r\"\"\"\n    Returns:\n        :obj:`tuple(tf.Tensor)` comprising various elements depending on the configuration (:class:`~transformers1.AlbertConfig`) and inputs:\n        prediction_scores (:obj:`Numpy array` or :obj:`tf.Tensor` of shape :obj:`(batch_size, sequence_length, config.vocab_size)`\n            Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax).\n        hidden_states (:obj:`tuple(tf.Tensor)`, `optional`, returned when :obj:`config.output_hidden_states=True`):\n            tuple of :obj:`tf.Tensor` (one for the output of the embeddings + one for the output of each layer)\n            of shape :obj:`(batch_size, sequence_length, hidden_size)`.\n\n            Hidden-states of the model at the output of each layer plus the initial embedding outputs.\n        attentions (:obj:`tuple(tf.Tensor)`, `optional`, returned when ``config.output_attentions=True``):\n            tuple of :obj:`tf.Tensor` (one for each layer) of shape\n            :obj:`(batch_size, num_heads, sequence_length, sequence_length)`:\n\n            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads.\n\n    Examples::\n\n        import tensorflow as tf\n        from transformers1 import AlbertTokenizer, TFAlbertForMaskedLM\n\n        tokenizer = AlbertTokenizer.from_pretrained('albert-base-v2')\n        model = TFAlbertForMaskedLM.from_pretrained('albert-base-v2')\n        input_ids = tf.constant(tokenizer.encode(\"Hello, my dog is cute\"))[None, :]  # Batch size 1\n        outputs = model(input_ids)\n        prediction_scores = outputs[0]\n\n        \"\"\"\n        outputs = self.albert(inputs, **kwargs)\n\n        sequence_output = outputs[0]\n        prediction_scores = self.predictions(sequence_output, training=kwargs.get(\"training\", False))\n\n        # Add hidden states and attention if they are here\n        outputs = (prediction_scores,) + outputs[2:]\n\n        return outputs  # prediction_scores, (hidden_states), (attentions)\n\n\n@add_start_docstrings(\n    \"\"\"Albert Model transformer with a sequence classification/regression head on top (a linear layer on top of\n    the pooled output) e.g. for GLUE tasks. \"\"\",\n    ALBERT_START_DOCSTRING,\n)\nclass TFAlbertForSequenceClassification(TFAlbertPreTrainedModel):\n    def __init__(self, config, *inputs, **kwargs):\n        super().__init__(config, *inputs, **kwargs)\n        self.num_labels = config.num_labels\n\n        self.albert = TFAlbertMainLayer(config, name=\"albert\")\n        self.dropout = tf.keras.layers.Dropout(config.classifier_dropout_prob)\n        self.classifier = tf.keras.layers.Dense(\n            config.num_labels, kernel_initializer=get_initializer(config.initializer_range), name=\"classifier\"\n        )\n\n    @add_start_docstrings_to_callable(ALBERT_INPUTS_DOCSTRING.format(\"(batch_size, sequence_length)\"))\n    def call(self, inputs, **kwargs):\n        r\"\"\"\n    Returns:\n        :obj:`tuple(tf.Tensor)` comprising various elements depending on the configuration (:class:`~transformers1.AlbertConfig`) and inputs:\n        logits (:obj:`Numpy array` or :obj:`tf.Tensor` of shape :obj:`(batch_size, config.num_labels)`)\n            Classification (or regression if config.num_labels==1) scores (before SoftMax).\n        hidden_states (:obj:`tuple(tf.Tensor)`, `optional`, returned when :obj:`config.output_hidden_states=True`):\n            tuple of :obj:`tf.Tensor` (one for the output of the embeddings + one for the output of each layer)\n            of shape :obj:`(batch_size, sequence_length, hidden_size)`.\n\n            Hidden-states of the model at the output of each layer plus the initial embedding outputs.\n        attentions (:obj:`tuple(tf.Tensor)`, `optional`, returned when ``config.output_attentions=True``):\n            tuple of :obj:`tf.Tensor` (one for each layer) of shape\n            :obj:`(batch_size, num_heads, sequence_length, sequence_length)`:\n\n            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads.\n\n    Examples::\n\n        import tensorflow as tf\n        from transformers1 import AlbertTokenizer, TFAlbertForSequenceClassification\n\n        tokenizer = AlbertTokenizer.from_pretrained('albert-base-v2')\n        model = TFAlbertForSequenceClassification.from_pretrained('albert-base-v2')\n        input_ids = tf.constant(tokenizer.encode(\"Hello, my dog is cute\"))[None, :]  # Batch size 1\n        outputs = model(input_ids)\n        logits = outputs[0]\n\n        \"\"\"\n        outputs = self.albert(inputs, **kwargs)\n\n        pooled_output = outputs[1]\n\n        pooled_output = self.dropout(pooled_output, training=kwargs.get(\"training\", False))\n        logits = self.classifier(pooled_output)\n\n        outputs = (logits,) + outputs[2:]  # add hidden states and attention if they are here\n\n        return outputs  # logits, (hidden_states), (attentions)\n\n\n@add_start_docstrings(\n    \"\"\"Albert Model with a span classification head on top for extractive question-answering tasks like SQuAD (a linear layers on top of the hidden-states output to compute `span start logits` and `span end logits`). \"\"\",\n    ALBERT_START_DOCSTRING,\n)\nclass TFAlbertForQuestionAnswering(TFAlbertPreTrainedModel):\n    def __init__(self, config, *inputs, **kwargs):\n        super().__init__(config, *inputs, **kwargs)\n        self.num_labels = config.num_labels\n\n        self.albert = TFAlbertMainLayer(config, name=\"albert\")\n        self.qa_outputs = tf.keras.layers.Dense(\n            config.num_labels, kernel_initializer=get_initializer(config.initializer_range), name=\"qa_outputs\"\n        )\n\n    @add_start_docstrings_to_callable(ALBERT_INPUTS_DOCSTRING.format(\"(batch_size, sequence_length)\"))\n    def call(self, inputs, **kwargs):\n        r\"\"\"\n    Return:\n        :obj:`tuple(tf.Tensor)` comprising various elements depending on the configuration (:class:`~transformers1.AlbertConfig`) and inputs:\n        start_scores (:obj:`Numpy array` or :obj:`tf.Tensor` of shape :obj:`(batch_size, sequence_length,)`):\n            Span-start scores (before SoftMax).\n        end_scores (:obj:`Numpy array` or :obj:`tf.Tensor` of shape :obj:`(batch_size, sequence_length,)`):\n            Span-end scores (before SoftMax).\n        hidden_states (:obj:`tuple(tf.Tensor)`, `optional`, returned when :obj:`config.output_hidden_states=True`):\n            tuple of :obj:`tf.Tensor` (one for the output of the embeddings + one for the output of each layer)\n            of shape :obj:`(batch_size, sequence_length, hidden_size)`.\n\n            Hidden-states of the model at the output of each layer plus the initial embedding outputs.\n        attentions (:obj:`tuple(tf.Tensor)`, `optional`, returned when ``config.output_attentions=True``):\n            tuple of :obj:`tf.Tensor` (one for each layer) of shape\n            :obj:`(batch_size, num_heads, sequence_length, sequence_length)`:\n\n            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads.\n\n    Examples::\n\n        # The checkpoint albert-base-v2 is not fine-tuned for question answering. Please see the\n        # examples/question-answering/run_squad.py example to see how to fine-tune a model to a question answering task.\n\n        import tensorflow as tf\n        from transformers1 import AlbertTokenizer, TFAlbertForQuestionAnswering\n\n        tokenizer = AlbertTokenizer.from_pretrained('albert-base-v2')\n        model = TFAlbertForQuestionAnswering.from_pretrained('albert-base-v2')\n        input_ids = tokenizer.encode(\"Who was Jim Henson?\", \"Jim Henson was a nice puppet\")\n        start_scores, end_scores = model(tf.constant(input_ids)[None, :]) # Batch size 1\n\n        all_tokens = tokenizer.convert_ids_to_tokens(input_ids)\n        answer = ' '.join(all_tokens[tf.math.argmax(start_scores, 1)[0] : tf.math.argmax(end_scores, 1)[0]+1])\n\n        \"\"\"\n        outputs = self.albert(inputs, **kwargs)\n\n        sequence_output = outputs[0]\n\n        logits = self.qa_outputs(sequence_output)\n        start_logits, end_logits = tf.split(logits, 2, axis=-1)\n        start_logits = tf.squeeze(start_logits, axis=-1)\n        end_logits = tf.squeeze(end_logits, axis=-1)\n\n        outputs = (start_logits, end_logits,) + outputs[2:]\n\n        return outputs  # start_logits, end_logits, (hidden_states), (attentions)\n\n\n@add_start_docstrings(\n    \"\"\"Albert Model with a multiple choice classification head on top (a linear layer on top of\n    the pooled output and a softmax) e.g. for RocStories/SWAG tasks. \"\"\",\n    ALBERT_START_DOCSTRING,\n)\nclass TFAlbertForMultipleChoice(TFAlbertPreTrainedModel):\n    def __init__(self, config, *inputs, **kwargs):\n        super().__init__(config, *inputs, **kwargs)\n\n        self.albert = TFAlbertMainLayer(config, name=\"albert\")\n        self.dropout = tf.keras.layers.Dropout(config.hidden_dropout_prob)\n        self.classifier = tf.keras.layers.Dense(\n            1, kernel_initializer=get_initializer(config.initializer_range), name=\"classifier\"\n        )\n\n    @property\n    def dummy_inputs(self):\n        \"\"\" Dummy inputs to build the network.\n\n        Returns:\n            tf.Tensor with dummy inputs\n        \"\"\"\n        return {\"input_ids\": tf.constant(MULTIPLE_CHOICE_DUMMY_INPUTS)}\n\n    @add_start_docstrings_to_callable(ALBERT_INPUTS_DOCSTRING.format(\"(batch_size, num_choices, sequence_length)\"))\n    def call(\n        self,\n        inputs,\n        attention_mask=None,\n        token_type_ids=None,\n        position_ids=None,\n        head_mask=None,\n        inputs_embeds=None,\n        training=False,\n    ):\n        r\"\"\"\n    Return:\n        :obj:`tuple(tf.Tensor)` comprising various elements depending on the configuration (:class:`~transformers1.BertConfig`) and inputs:\n        classification_scores (:obj:`Numpy array` or :obj:`tf.Tensor` of shape :obj:`(batch_size, num_choices)`:\n            `num_choices` is the size of the second dimension of the input tensors. (see `input_ids` above).\n\n            Classification scores (before SoftMax).\n        hidden_states (:obj:`tuple(tf.Tensor)`, `optional`, returned when :obj:`config.output_hidden_states=True`):\n            tuple of :obj:`tf.Tensor` (one for the output of the embeddings + one for the output of each layer)\n            of shape :obj:`(batch_size, sequence_length, hidden_size)`.\n\n            Hidden-states of the model at the output of each layer plus the initial embedding outputs.\n        attentions (:obj:`tuple(tf.Tensor)`, `optional`, returned when ``config.output_attentions=True``):\n            tuple of :obj:`tf.Tensor` (one for each layer) of shape\n            :obj:`(batch_size, num_heads, sequence_length, sequence_length)`:\n\n            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads.\n\n    Examples::\n\n        import tensorflow as tf\n        from transformers1 import AlbertTokenizer, TFAlbertForMultipleChoice\n\n        tokenizer = AlbertTokenizer.from_pretrained('albert-base-v2')\n        model = TFAlbertForMultipleChoice.from_pretrained('albert-base-v2')\n\n        example1 = [\"This is a context\", \"Is it a context? Yes\"]\n        example2 = [\"This is a context\", \"Is it a context? No\"]\n        encoding = tokenizer.batch_encode_plus([example1, example2], return_tensors='tf', truncation_strategy=\"only_first\", pad_to_max_length=True, max_length=128)\n        outputs = model(encoding[\"input_ids\"][None, :])\n        logits = outputs[0]\n\n        \"\"\"\n        if isinstance(inputs, (tuple, list)):\n            input_ids = inputs[0]\n            attention_mask = inputs[1] if len(inputs) > 1 else attention_mask\n            token_type_ids = inputs[2] if len(inputs) > 2 else token_type_ids\n            position_ids = inputs[3] if len(inputs) > 3 else position_ids\n            head_mask = inputs[4] if len(inputs) > 4 else head_mask\n            inputs_embeds = inputs[5] if len(inputs) > 5 else inputs_embeds\n            assert len(inputs) <= 6, \"Too many inputs.\"\n        elif isinstance(inputs, dict):\n            print(\"isdict(1)\")\n            input_ids = inputs.get(\"input_ids\")\n            print(input_ids)\n\n            attention_mask = inputs.get(\"attention_mask\", attention_mask)\n            token_type_ids = inputs.get(\"token_type_ids\", token_type_ids)\n            position_ids = inputs.get(\"position_ids\", position_ids)\n            head_mask = inputs.get(\"head_mask\", head_mask)\n            inputs_embeds = inputs.get(\"inputs_embeds\", inputs_embeds)\n            assert len(inputs) <= 6, \"Too many inputs.\"\n        else:\n            input_ids = inputs\n\n        if input_ids is not None:\n            num_choices = shape_list(input_ids)[1]\n            seq_length = shape_list(input_ids)[2]\n        else:\n            num_choices = shape_list(inputs_embeds)[1]\n            seq_length = shape_list(inputs_embeds)[2]\n\n        flat_input_ids = tf.reshape(input_ids, (-1, seq_length)) if input_ids is not None else None\n        flat_attention_mask = tf.reshape(attention_mask, (-1, seq_length)) if attention_mask is not None else None\n        flat_token_type_ids = tf.reshape(token_type_ids, (-1, seq_length)) if token_type_ids is not None else None\n        flat_position_ids = tf.reshape(position_ids, (-1, seq_length)) if position_ids is not None else None\n\n        flat_inputs = [\n            flat_input_ids,\n            flat_attention_mask,\n            flat_token_type_ids,\n            flat_position_ids,\n            head_mask,\n            inputs_embeds,\n        ]\n\n        outputs = self.albert(flat_inputs, training=training)\n\n        pooled_output = outputs[1]\n\n        pooled_output = self.dropout(pooled_output, training=training)\n        logits = self.classifier(pooled_output)\n        reshaped_logits = tf.reshape(logits, (-1, num_choices))\n\n        outputs = (reshaped_logits,) + outputs[2:]  # add hidden states and attention if they are here\n\n        return outputs  # reshaped_logits, (hidden_states), (attentions)\n"
  },
  {
    "path": "code/nezha-base-count5/pretrain/transformers1/modeling_tf_auto.py",
    "content": "# coding=utf-8\n# Copyright 2018 The HuggingFace Inc. team.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n\"\"\" Auto Model class. \"\"\"\n\n\nimport logging\nfrom collections import OrderedDict\n\nfrom .configuration_auto import (\n    AlbertConfig,\n    AutoConfig,\n    BertConfig,\n    CTRLConfig,\n    DistilBertConfig,\n    GPT2Config,\n    OpenAIGPTConfig,\n    RobertaConfig,\n    T5Config,\n    TransfoXLConfig,\n    XLMConfig,\n    XLNetConfig,\n)\nfrom .configuration_utils import PretrainedConfig\nfrom .modeling_tf_albert import (\n    TFAlbertForMaskedLM,\n    TFAlbertForMultipleChoice,\n    TFAlbertForPreTraining,\n    TFAlbertForQuestionAnswering,\n    TFAlbertForSequenceClassification,\n    TFAlbertModel,\n)\nfrom .modeling_tf_bert import (\n    TFBertForMaskedLM,\n    TFBertForMultipleChoice,\n    TFBertForPreTraining,\n    TFBertForQuestionAnswering,\n    TFBertForSequenceClassification,\n    TFBertForTokenClassification,\n    TFBertModel,\n)\nfrom .modeling_tf_ctrl import TFCTRLLMHeadModel, TFCTRLModel\nfrom .modeling_tf_distilbert import (\n    TFDistilBertForMaskedLM,\n    TFDistilBertForQuestionAnswering,\n    TFDistilBertForSequenceClassification,\n    TFDistilBertForTokenClassification,\n    TFDistilBertModel,\n)\nfrom .modeling_tf_gpt2 import TFGPT2LMHeadModel, TFGPT2Model\nfrom .modeling_tf_openai import TFOpenAIGPTLMHeadModel, TFOpenAIGPTModel\nfrom .modeling_tf_roberta import (\n    TFRobertaForMaskedLM,\n    TFRobertaForQuestionAnswering,\n    TFRobertaForSequenceClassification,\n    TFRobertaForTokenClassification,\n    TFRobertaModel,\n)\nfrom .modeling_tf_t5 import TFT5ForConditionalGeneration, TFT5Model\nfrom .modeling_tf_transfo_xl import TFTransfoXLLMHeadModel, TFTransfoXLModel\nfrom .modeling_tf_xlm import (\n    TFXLMForQuestionAnsweringSimple,\n    TFXLMForSequenceClassification,\n    TFXLMModel,\n    TFXLMWithLMHeadModel,\n)\nfrom .modeling_tf_xlnet import (\n    TFXLNetForQuestionAnsweringSimple,\n    TFXLNetForSequenceClassification,\n    TFXLNetForTokenClassification,\n    TFXLNetLMHeadModel,\n    TFXLNetModel,\n)\n\n\nlogger = logging.getLogger(__name__)\n\n\nTF_MODEL_MAPPING = OrderedDict(\n    [\n        (T5Config, TFT5Model),\n        (DistilBertConfig, TFDistilBertModel),\n        (AlbertConfig, TFAlbertModel),\n        (RobertaConfig, TFRobertaModel),\n        (BertConfig, TFBertModel),\n        (OpenAIGPTConfig, TFOpenAIGPTModel),\n        (GPT2Config, TFGPT2Model),\n        (TransfoXLConfig, TFTransfoXLModel),\n        (XLNetConfig, TFXLNetModel),\n        (XLMConfig, TFXLMModel),\n        (CTRLConfig, TFCTRLModel),\n    ]\n)\n\nTF_MODEL_FOR_PRETRAINING_MAPPING = OrderedDict(\n    [\n        (T5Config, TFT5ForConditionalGeneration),\n        (DistilBertConfig, TFDistilBertForMaskedLM),\n        (AlbertConfig, TFAlbertForPreTraining),\n        (RobertaConfig, TFRobertaForMaskedLM),\n        (BertConfig, TFBertForPreTraining),\n        (OpenAIGPTConfig, TFOpenAIGPTLMHeadModel),\n        (GPT2Config, TFGPT2LMHeadModel),\n        (TransfoXLConfig, TFTransfoXLLMHeadModel),\n        (XLNetConfig, TFXLNetLMHeadModel),\n        (XLMConfig, TFXLMWithLMHeadModel),\n        (CTRLConfig, TFCTRLLMHeadModel),\n    ]\n)\n\nTF_MODEL_WITH_LM_HEAD_MAPPING = OrderedDict(\n    [\n        (T5Config, TFT5ForConditionalGeneration),\n        (DistilBertConfig, TFDistilBertForMaskedLM),\n        (AlbertConfig, TFAlbertForMaskedLM),\n        (RobertaConfig, TFRobertaForMaskedLM),\n        (BertConfig, TFBertForMaskedLM),\n        (OpenAIGPTConfig, TFOpenAIGPTLMHeadModel),\n        (GPT2Config, TFGPT2LMHeadModel),\n        (TransfoXLConfig, TFTransfoXLLMHeadModel),\n        (XLNetConfig, TFXLNetLMHeadModel),\n        (XLMConfig, TFXLMWithLMHeadModel),\n        (CTRLConfig, TFCTRLLMHeadModel),\n    ]\n)\n\nTF_MODEL_FOR_SEQUENCE_CLASSIFICATION_MAPPING = OrderedDict(\n    [\n        (DistilBertConfig, TFDistilBertForSequenceClassification),\n        (AlbertConfig, TFAlbertForSequenceClassification),\n        (RobertaConfig, TFRobertaForSequenceClassification),\n        (BertConfig, TFBertForSequenceClassification),\n        (XLNetConfig, TFXLNetForSequenceClassification),\n        (XLMConfig, TFXLMForSequenceClassification),\n    ]\n)\n\nTF_MODEL_FOR_MULTIPLE_CHOICE_MAPPING = OrderedDict(\n    [(BertConfig, TFBertForMultipleChoice), (AlbertConfig, TFAlbertForMultipleChoice)]\n)\n\nTF_MODEL_FOR_QUESTION_ANSWERING_MAPPING = OrderedDict(\n    [\n        (DistilBertConfig, TFDistilBertForQuestionAnswering),\n        (AlbertConfig, TFAlbertForQuestionAnswering),\n        (RobertaConfig, TFRobertaForQuestionAnswering),\n        (BertConfig, TFBertForQuestionAnswering),\n        (XLNetConfig, TFXLNetForQuestionAnsweringSimple),\n        (XLMConfig, TFXLMForQuestionAnsweringSimple),\n    ]\n)\n\nTF_MODEL_FOR_TOKEN_CLASSIFICATION_MAPPING = OrderedDict(\n    [\n        (DistilBertConfig, TFDistilBertForTokenClassification),\n        (RobertaConfig, TFRobertaForTokenClassification),\n        (BertConfig, TFBertForTokenClassification),\n        (XLNetConfig, TFXLNetForTokenClassification),\n    ]\n)\n\n\nclass TFAutoModel(object):\n    r\"\"\"\n        :class:`~transformers1.TFAutoModel` is a generic model class\n        that will be instantiated as one of the base model classes of the library\n        when created with the `TFAutoModel.from_pretrained(pretrained_model_name_or_path)`\n        class method.\n\n        The `from_pretrained()` method takes care of returning the correct model class instance\n        based on the `model_type` property of the config object, or when it's missing,\n        falling back to using pattern matching on the `pretrained_model_name_or_path` string:\n            - `t5`: TFT5Model (T5 model)\n            - `distilbert`: TFDistilBertModel (DistilBERT model)\n            - `roberta`: TFRobertaModel (RoBERTa model)\n            - `bert`: TFBertModel (Bert model)\n            - `openai-gpt`: TFOpenAIGPTModel (OpenAI GPT model)\n            - `gpt2`: TFGPT2Model (OpenAI GPT-2 model)\n            - `transfo-xl`: TFTransfoXLModel (Transformer-XL model)\n            - `xlnet`: TFXLNetModel (XLNet model)\n            - `xlm`: TFXLMModel (XLM model)\n            - `ctrl`: TFCTRLModel (CTRL model)\n\n        This class cannot be instantiated using `__init__()` (throws an error).\n    \"\"\"\n\n    def __init__(self):\n        raise EnvironmentError(\n            \"TFAutoModel is designed to be instantiated \"\n            \"using the `TFAutoModel.from_pretrained(pretrained_model_name_or_path)` or \"\n            \"`TFAutoModel.from_config(config)` methods.\"\n        )\n\n    @classmethod\n    def from_config(cls, config):\n        r\"\"\" Instantiates one of the base model classes of the library\n        from a configuration.\n\n        Note:\n            Loading a model from its configuration file does **not** load the model weights.\n            It only affects the model's configuration. Use :func:`~transformers1.AutoModel.from_pretrained` to load\n            the model weights\n\n        Args:\n            config: (`optional`) instance of a class derived from :class:`~transformers1.PretrainedConfig`:\n                The model class to instantiate is selected based on the configuration class:\n                    - isInstance of `distilbert` configuration class: TFDistilBertModel (DistilBERT model)\n                    - isInstance of `roberta` configuration class: TFRobertaModel (RoBERTa model)\n                    - isInstance of `bert` configuration class: TFBertModel (Bert model)\n                    - isInstance of `openai-gpt` configuration class: TFOpenAIGPTModel (OpenAI GPT model)\n                    - isInstance of `gpt2` configuration class: TFGPT2Model (OpenAI GPT-2 model)\n                    - isInstance of `ctrl` configuration class: TFCTRLModel (Salesforce CTRL  model)\n                    - isInstance of `transfo-xl` configuration class: TFTransfoXLModel (Transformer-XL model)\n                    - isInstance of `xlnet` configuration class: TFXLNetModel (XLNet model)\n                    - isInstance of `xlm` configuration class: TFXLMModel (XLM model)\n\n        Examples::\n\n            config = BertConfig.from_pretrained('bert-base-uncased')    # Download configuration from S3 and cache.\n            model = TFAutoModel.from_config(config)  # E.g. model was saved using `save_pretrained('./test/saved_model/')`\n        \"\"\"\n        for config_class, model_class in TF_MODEL_MAPPING.items():\n            if isinstance(config, config_class):\n                return model_class(config)\n        raise ValueError(\n            \"Unrecognized configuration class {} for this kind of TFAutoModel: {}.\\n\"\n            \"Model type should be one of {}.\".format(\n                config.__class__, cls.__name__, \", \".join(c.__name__ for c in TF_MODEL_MAPPING.keys())\n            )\n        )\n\n    @classmethod\n    def from_pretrained(cls, pretrained_model_name_or_path, *model_args, **kwargs):\n        r\"\"\" Instantiates one of the base model classes of the library\n        from a pre-trained model configuration.\n\n        The `from_pretrained()` method takes care of returning the correct model class instance\n        based on the `model_type` property of the config object, or when it's missing,\n        falling back to using pattern matching on the `pretrained_model_name_or_path` string:\n            - `t5`: TFT5Model (T5 model)\n            - `distilbert`: TFDistilBertModel (DistilBERT model)\n            - `roberta`: TFRobertaModel (RoBERTa model)\n            - `bert`: TFTFBertModel (Bert model)\n            - `openai-gpt`: TFOpenAIGPTModel (OpenAI GPT model)\n            - `gpt2`: TFGPT2Model (OpenAI GPT-2 model)\n            - `transfo-xl`: TFTransfoXLModel (Transformer-XL model)\n            - `xlnet`: TFXLNetModel (XLNet model)\n            - `ctrl`: TFCTRLModel (CTRL model)\n\n        Params:\n            pretrained_model_name_or_path: either:\n\n                - a string with the `shortcut name` of a pre-trained model to load from cache or download, e.g.: ``bert-base-uncased``.\n                - a string with the `identifier name` of a pre-trained model that was user-uploaded to our S3, e.g.: ``dbmdz/bert-base-german-cased``.\n                - a path to a `directory` containing model weights saved using :func:`~transformers1.PreTrainedModel.save_pretrained`, e.g.: ``./my_model_directory/``.\n                - a path or url to a `PyTorch, TF 1.X or TF 2.0 checkpoint file` (e.g. `./tf_model/model.ckpt.index`). In the case of a PyTorch checkpoint, ``from_pt`` should be set to True and a configuration object should be provided as ``config`` argument.\n\n            from_pt: (`Optional`) Boolean\n                Set to True if the Checkpoint is a PyTorch checkpoint.\n\n            model_args: (`optional`) Sequence of positional arguments:\n                All remaning positional arguments will be passed to the underlying model's ``__init__`` method\n\n            config: (`optional`) instance of a class derived from :class:`~transformers1.PretrainedConfig`:\n                Configuration for the model to use instead of an automatically loaded configuation. Configuration can be automatically loaded when:\n\n                - the model is a model provided by the library (loaded with the ``shortcut-name`` string of a pretrained model), or\n                - the model was saved using :func:`~transformers1.PreTrainedModel.save_pretrained` and is reloaded by suppling the save directory.\n                - the model is loaded by suppling a local directory as ``pretrained_model_name_or_path`` and a configuration JSON file named `config.json` is found in the directory.\n\n            state_dict: (`optional`) dict:\n                an optional state dictionnary for the model to use instead of a state dictionary loaded from saved weights file.\n                This option can be used if you want to create a model from a pretrained configuration but load your own weights.\n                In this case though, you should check if using :func:`~transformers1.PreTrainedModel.save_pretrained` and :func:`~transformers1.PreTrainedModel.from_pretrained` is not a simpler option.\n\n            cache_dir: (`optional`) string:\n                Path to a directory in which a downloaded pre-trained model\n                configuration should be cached if the standard cache should not be used.\n\n            force_download: (`optional`) boolean, default False:\n                Force to (re-)download the model weights and configuration files and override the cached versions if they exists.\n\n            resume_download: (`optional`) boolean, default False:\n                Do not delete incompletely recieved file. Attempt to resume the download if such a file exists.\n\n            proxies: (`optional`) dict, default None:\n                A dictionary of proxy servers to use by protocol or endpoint, e.g.: {'http': 'foo.bar:3128', 'http://hostname': 'foo.bar:4012'}.\n                The proxies are used on each request.\n\n            output_loading_info: (`optional`) boolean:\n                Set to ``True`` to also return a dictionnary containing missing keys, unexpected keys and error messages.\n\n            kwargs: (`optional`) Remaining dictionary of keyword arguments:\n                Can be used to update the configuration object (after it being loaded) and initiate the model. (e.g. ``output_attention=True``). Behave differently depending on whether a `config` is provided or automatically loaded:\n\n                - If a configuration is provided with ``config``, ``**kwargs`` will be directly passed to the underlying model's ``__init__`` method (we assume all relevant updates to the configuration have already been done)\n                - If a configuration is not provided, ``kwargs`` will be first passed to the configuration class initialization function (:func:`~transformers1.PretrainedConfig.from_pretrained`). Each key of ``kwargs`` that corresponds to a configuration attribute will be used to override said attribute with the supplied ``kwargs`` value. Remaining keys that do not correspond to any configuration attribute will be passed to the underlying model's ``__init__`` function.\n\n        Examples::\n\n            model = TFAutoModel.from_pretrained('bert-base-uncased')    # Download model and configuration from S3 and cache.\n            model = TFAutoModel.from_pretrained('./test/bert_model/')  # E.g. model was saved using `save_pretrained('./test/saved_model/')`\n            model = TFAutoModel.from_pretrained('bert-base-uncased', output_attention=True)  # Update configuration during loading\n            assert model.config.output_attention == True\n            # Loading from a TF checkpoint file instead of a PyTorch model (slower)\n            config = AutoConfig.from_json_file('./tf_model/bert_tf_model_config.json')\n            model = TFAutoModel.from_pretrained('./pt_model/bert_pytorch_model.bin', from_pt=True, config=config)\n\n        \"\"\"\n        config = kwargs.pop(\"config\", None)\n        if not isinstance(config, PretrainedConfig):\n            config = AutoConfig.from_pretrained(pretrained_model_name_or_path, **kwargs)\n\n        for config_class, model_class in TF_MODEL_MAPPING.items():\n            if isinstance(config, config_class):\n                return model_class.from_pretrained(pretrained_model_name_or_path, *model_args, config=config, **kwargs)\n        raise ValueError(\n            \"Unrecognized configuration class {} for this kind of TFAutoModel: {}.\\n\"\n            \"Model type should be one of {}.\".format(\n                config.__class__, cls.__name__, \", \".join(c.__name__ for c in TF_MODEL_MAPPING.keys())\n            )\n        )\n\n\nclass TFAutoModelForPreTraining(object):\n    r\"\"\"\n        :class:`~transformers1.TFAutoModelForPreTraining` is a generic model class\n        that will be instantiated as one of the model classes of the library -with the architecture used for pretraining this model– when created with the `TFAutoModelForPreTraining.from_pretrained(pretrained_model_name_or_path)`\n        class method.\n\n        This class cannot be instantiated using `__init__()` (throws an error).\n    \"\"\"\n\n    def __init__(self):\n        raise EnvironmentError(\n            \"TFAutoModelForPreTraining is designed to be instantiated \"\n            \"using the `TFAutoModelForPreTraining.from_pretrained(pretrained_model_name_or_path)` or \"\n            \"`TFAutoModelForPreTraining.from_config(config)` methods.\"\n        )\n\n    @classmethod\n    def from_config(cls, config):\n        r\"\"\" Instantiates one of the base model classes of the library\n        from a configuration.\n\n        Note:\n            Loading a model from its configuration file does **not** load the model weights.\n            It only affects the model's configuration. Use :func:`~transformers1.AutoModel.from_pretrained` to load\n            the model weights\n\n        Args:\n            config (:class:`~transformers.PretrainedConfig`):\n                The model class to instantiate is selected based on the configuration class:\n\n                - isInstance of `distilbert` configuration class: :class:`~transformers1.TFDistilBertModelForMaskedLM` (DistilBERT model)\n                - isInstance of `roberta` configuration class: :class:`~transformers1.TFRobertaModelForMaskedLM` (RoBERTa model)\n                - isInstance of `bert` configuration class: :class:`~transformers1.TFBertForPreTraining` (Bert model)\n                - isInstance of `openai-gpt` configuration class: :class:`~transformers1.TFOpenAIGPTLMHeadModel` (OpenAI GPT model)\n                - isInstance of `gpt2` configuration class: :class:`~transformers1.TFGPT2ModelLMHeadModel` (OpenAI GPT-2 model)\n                - isInstance of `ctrl` configuration class: :class:`~transformers1.TFCTRLModelLMHeadModel` (Salesforce CTRL  model)\n                - isInstance of `transfo-xl` configuration class: :class:`~transformers1.TFTransfoXLLMHeadModel` (Transformer-XL model)\n                - isInstance of `xlnet` configuration class: :class:`~transformers1.TFXLNetLMHeadModel` (XLNet model)\n                - isInstance of `xlm` configuration class: :class:`~transformers1.TFXLMWithLMHeadModel` (XLM model)\n\n        Examples::\n\n            config = BertConfig.from_pretrained('bert-base-uncased')    # Download configuration from S3 and cache.\n            model = TFAutoModelForPreTraining.from_config(config)  # E.g. model was saved using `save_pretrained('./test/saved_model/')`\n        \"\"\"\n        for config_class, model_class in TF_MODEL_FOR_PRETRAINING_MAPPING.items():\n            if isinstance(config, config_class):\n                return model_class(config)\n        raise ValueError(\n            \"Unrecognized configuration class {} for this kind of AutoModel: {}.\\n\"\n            \"Model type should be one of {}.\".format(\n                config.__class__, cls.__name__, \", \".join(c.__name__ for c in TF_MODEL_FOR_PRETRAINING_MAPPING.keys())\n            )\n        )\n\n    @classmethod\n    def from_pretrained(cls, pretrained_model_name_or_path, *model_args, **kwargs):\n        r\"\"\" Instantiates one of the model classes of the library -with the architecture used for pretraining this model– from a pre-trained model configuration.\n\n        The `from_pretrained()` method takes care of returning the correct model class instance\n        based on the `model_type` property of the config object, or when it's missing,\n        falling back to using pattern matching on the `pretrained_model_name_or_path` string:\n            - `t5`: :class:`~transformers1.TFT5ModelWithLMHead` (T5 model)\n            - `distilbert`: :class:`~transformers1.TFDistilBertForMaskedLM` (DistilBERT model)\n            - `albert`: :class:`~transformers1.TFAlbertForPreTraining` (ALBERT model)\n            - `roberta`: :class:`~transformers1.TFRobertaForMaskedLM` (RoBERTa model)\n            - `bert`: :class:`~transformers1.TFBertForPreTraining` (Bert model)\n            - `openai-gpt`: :class:`~transformers1.TFOpenAIGPTLMHeadModel` (OpenAI GPT model)\n            - `gpt2`: :class:`~transformers1.TFGPT2LMHeadModel` (OpenAI GPT-2 model)\n            - `transfo-xl`: :class:`~transformers1.TFTransfoXLLMHeadModel` (Transformer-XL model)\n            - `xlnet`: :class:`~transformers1.TFXLNetLMHeadModel` (XLNet model)\n            - `xlm`: :class:`~transformers1.TFXLMWithLMHeadModel` (XLM model)\n            - `ctrl`: :class:`~transformers1.TFCTRLLMHeadModel` (Salesforce CTRL model)\n\n        The model is set in evaluation mode by default using `model.eval()` (Dropout modules are deactivated)\n        To train the model, you should first set it back in training mode with `model.train()`\n\n        Args:\n            pretrained_model_name_or_path:\n                Either:\n\n                - a string with the `shortcut name` of a pre-trained model to load from cache or download, e.g.: ``bert-base-uncased``.\n                - a string with the `identifier name` of a pre-trained model that was user-uploaded to our S3, e.g.: ``dbmdz/bert-base-german-cased``.\n                - a path to a `directory` containing model weights saved using :func:`~transformers1.PreTrainedModel.save_pretrained`, e.g.: ``./my_model_directory/``.\n                - a path or url to a `tensorflow index checkpoint file` (e.g. `./tf_model/model.ckpt.index`). In this case, ``from_tf`` should be set to True and a configuration object should be provided as ``config`` argument. This loading path is slower than converting the TensorFlow checkpoint in a PyTorch model using the provided conversion scripts and loading the PyTorch model afterwards.\n            model_args: (`optional`) Sequence of positional arguments:\n                All remaning positional arguments will be passed to the underlying model's ``__init__`` method\n            config: (`optional`) instance of a class derived from :class:`~transformers1.PretrainedConfig`:\n                Configuration for the model to use instead of an automatically loaded configuation. Configuration can be automatically loaded when:\n\n                - the model is a model provided by the library (loaded with the ``shortcut-name`` string of a pretrained model), or\n                - the model was saved using :func:`~transformers1.PreTrainedModel.save_pretrained` and is reloaded by suppling the save directory.\n                - the model is loaded by suppling a local directory as ``pretrained_model_name_or_path`` and a configuration JSON file named `config.json` is found in the directory.\n\n            state_dict: (`optional`) dict:\n                an optional state dictionnary for the model to use instead of a state dictionary loaded from saved weights file.\n                This option can be used if you want to create a model from a pretrained configuration but load your own weights.\n                In this case though, you should check if using :func:`~transformers1.PreTrainedModel.save_pretrained` and :func:`~transformers1.PreTrainedModel.from_pretrained` is not a simpler option.\n            cache_dir: (`optional`) string:\n                Path to a directory in which a downloaded pre-trained model\n                configuration should be cached if the standard cache should not be used.\n            force_download: (`optional`) boolean, default False:\n                Force to (re-)download the model weights and configuration files and override the cached versions if they exists.\n            resume_download: (`optional`) boolean, default False:\n                Do not delete incompletely received file. Attempt to resume the download if such a file exists.\n            proxies: (`optional`) dict, default None:\n                A dictionary of proxy servers to use by protocol or endpoint, e.g.: {'http': 'foo.bar:3128', 'http://hostname': 'foo.bar:4012'}.\n                The proxies are used on each request.\n            output_loading_info: (`optional`) boolean:\n                Set to ``True`` to also return a dictionnary containing missing keys, unexpected keys and error messages.\n            kwargs: (`optional`) Remaining dictionary of keyword arguments:\n                Can be used to update the configuration object (after it being loaded) and initiate the model.\n                (e.g. ``output_attention=True``). Behave differently depending on whether a `config` is provided or\n                automatically loaded:\n\n                - If a configuration is provided with ``config``, ``**kwargs`` will be directly passed to the\n                  underlying model's ``__init__`` method (we assume all relevant updates to the configuration have\n                  already been done)\n                - If a configuration is not provided, ``kwargs`` will be first passed to the configuration class\n                  initialization function (:func:`~transformers1.PretrainedConfig.from_pretrained`). Each key of\n                  ``kwargs`` that corresponds to a configuration attribute will be used to override said attribute\n                  with the supplied ``kwargs`` value. Remaining keys that do not correspond to any configuration\n                  attribute will be passed to the underlying model's ``__init__`` function.\n\n        Examples::\n\n            model = TFAutoModelForPreTraining.from_pretrained('bert-base-uncased')    # Download model and configuration from S3 and cache.\n            model = TFAutoModelForPreTraining.from_pretrained('./test/bert_model/')  # E.g. model was saved using `save_pretrained('./test/saved_model/')`\n            model = TFAutoModelForPreTraining.from_pretrained('bert-base-uncased', output_attention=True)  # Update configuration during loading\n            assert model.config.output_attention == True\n            # Loading from a TF checkpoint file instead of a PyTorch model (slower)\n            config = AutoConfig.from_json_file('./tf_model/bert_tf_model_config.json')\n            model = TFAutoModelForPreTraining.from_pretrained('./tf_model/bert_tf_checkpoint.ckpt.index', from_tf=True, config=config)\n\n        \"\"\"\n        config = kwargs.pop(\"config\", None)\n        if not isinstance(config, PretrainedConfig):\n            config = AutoConfig.from_pretrained(pretrained_model_name_or_path, **kwargs)\n\n        for config_class, model_class in TF_MODEL_FOR_PRETRAINING_MAPPING.items():\n            if isinstance(config, config_class):\n                return model_class.from_pretrained(pretrained_model_name_or_path, *model_args, config=config, **kwargs)\n        raise ValueError(\n            \"Unrecognized configuration class {} for this kind of AutoModel: {}.\\n\"\n            \"Model type should be one of {}.\".format(\n                config.__class__, cls.__name__, \", \".join(c.__name__ for c in TF_MODEL_FOR_PRETRAINING_MAPPING.keys())\n            )\n        )\n\n\nclass TFAutoModelWithLMHead(object):\n    r\"\"\"\n        :class:`~transformers1.TFAutoModelWithLMHead` is a generic model class\n        that will be instantiated as one of the language modeling model classes of the library\n        when created with the `TFAutoModelWithLMHead.from_pretrained(pretrained_model_name_or_path)`\n        class method.\n\n        The `from_pretrained()` method takes care of returning the correct model class instance\n        based on the `model_type` property of the config object, or when it's missing,\n        falling back to using pattern matching on the `pretrained_model_name_or_path` string:\n            - `t5`: TFT5ForConditionalGeneration (T5 model)\n            - `distilbert`: TFDistilBertForMaskedLM (DistilBERT model)\n            - `roberta`: TFRobertaForMaskedLM (RoBERTa model)\n            - `bert`: TFBertForMaskedLM (Bert model)\n            - `openai-gpt`: TFOpenAIGPTLMHeadModel (OpenAI GPT model)\n            - `gpt2`: TFGPT2LMHeadModel (OpenAI GPT-2 model)\n            - `transfo-xl`: TFTransfoXLLMHeadModel (Transformer-XL model)\n            - `xlnet`: TFXLNetLMHeadModel (XLNet model)\n            - `xlm`: TFXLMWithLMHeadModel (XLM model)\n            - `ctrl`: TFCTRLLMHeadModel (CTRL model)\n\n        This class cannot be instantiated using `__init__()` (throws an error).\n    \"\"\"\n\n    def __init__(self):\n        raise EnvironmentError(\n            \"TFAutoModelWithLMHead is designed to be instantiated \"\n            \"using the `TFAutoModelWithLMHead.from_pretrained(pretrained_model_name_or_path)` or \"\n            \"`TFAutoModelWithLMHead.from_config(config)` methods.\"\n        )\n\n    @classmethod\n    def from_config(cls, config):\n        r\"\"\" Instantiates one of the base model classes of the library\n        from a configuration.\n\n        Note:\n            Loading a model from its configuration file does **not** load the model weights.\n            It only affects the model's configuration. Use :func:`~transformers1.AutoModel.from_pretrained` to load\n            the model weights\n\n        Args:\n            config: (`optional`) instance of a class derived from :class:`~transformers1.PretrainedConfig`:\n                The model class to instantiate is selected based on the configuration class:\n                    - isInstance of `distilbert` configuration class: DistilBertModel (DistilBERT model)\n                    - isInstance of `roberta` configuration class: RobertaModel (RoBERTa model)\n                    - isInstance of `bert` configuration class: BertModel (Bert model)\n                    - isInstance of `openai-gpt` configuration class: OpenAIGPTModel (OpenAI GPT model)\n                    - isInstance of `gpt2` configuration class: GPT2Model (OpenAI GPT-2 model)\n                    - isInstance of `ctrl` configuration class: CTRLModel (Salesforce CTRL  model)\n                    - isInstance of `transfo-xl` configuration class: TransfoXLModel (Transformer-XL model)\n                    - isInstance of `xlnet` configuration class: XLNetModel (XLNet model)\n                    - isInstance of `xlm` configuration class: XLMModel (XLM model)\n\n        Examples::\n\n            config = BertConfig.from_pretrained('bert-base-uncased')    # Download configuration from S3 and cache.\n            model = TFAutoModelWithLMHead.from_config(config)  # E.g. model was saved using `save_pretrained('./test/saved_model/')`\n        \"\"\"\n        for config_class, model_class in TF_MODEL_WITH_LM_HEAD_MAPPING.items():\n            if isinstance(config, config_class):\n                return model_class(config)\n        raise ValueError(\n            \"Unrecognized configuration class {} for this kind of TFAutoModel: {}.\\n\"\n            \"Model type should be one of {}.\".format(\n                config.__class__, cls.__name__, \", \".join(c.__name__ for c in TF_MODEL_WITH_LM_HEAD_MAPPING.keys())\n            )\n        )\n\n    @classmethod\n    def from_pretrained(cls, pretrained_model_name_or_path, *model_args, **kwargs):\n        r\"\"\" Instantiates one of the language modeling model classes of the library\n        from a pre-trained model configuration.\n\n        The `from_pretrained()` method takes care of returning the correct model class instance\n        based on the `model_type` property of the config object, or when it's missing,\n        falling back to using pattern matching on the `pretrained_model_name_or_path` string:\n            - `t5`: TFT5ForConditionalGeneration (T5 model)\n            - `distilbert`: TFDistilBertForMaskedLM (DistilBERT model)\n            - `roberta`: TFRobertaForMaskedLM (RoBERTa model)\n            - `bert`: TFBertForMaskedLM (Bert model)\n            - `openai-gpt`: TFOpenAIGPTLMHeadModel (OpenAI GPT model)\n            - `gpt2`: TFGPT2LMHeadModel (OpenAI GPT-2 model)\n            - `transfo-xl`: TFTransfoXLLMHeadModel (Transformer-XL model)\n            - `xlnet`: TFXLNetLMHeadModel (XLNet model)\n            - `xlm`: TFXLMWithLMHeadModel (XLM model)\n            - `ctrl`: TFCTRLLMHeadModel (CTRL model)\n\n        Params:\n            pretrained_model_name_or_path: either:\n\n                - a string with the `shortcut name` of a pre-trained model to load from cache or download, e.g.: ``bert-base-uncased``.\n                - a string with the `identifier name` of a pre-trained model that was user-uploaded to our S3, e.g.: ``dbmdz/bert-base-german-cased``.\n                - a path to a `directory` containing model weights saved using :func:`~transformers1.PreTrainedModel.save_pretrained`, e.g.: ``./my_model_directory/``.\n                - a path or url to a `PyTorch, TF 1.X or TF 2.0 checkpoint file` (e.g. `./tf_model/model.ckpt.index`). In the case of a PyTorch checkpoint, ``from_pt`` should be set to True and a configuration object should be provided as ``config`` argument.\n\n            from_pt: (`Optional`) Boolean\n                Set to True if the Checkpoint is a PyTorch checkpoint.\n\n            model_args: (`optional`) Sequence of positional arguments:\n                All remaning positional arguments will be passed to the underlying model's ``__init__`` method\n\n            config: (`optional`) instance of a class derived from :class:`~transformers1.PretrainedConfig`:\n                Configuration for the model to use instead of an automatically loaded configuation. Configuration can be automatically loaded when:\n\n                - the model is a model provided by the library (loaded with the ``shortcut-name`` string of a pretrained model), or\n                - the model was saved using :func:`~transformers1.PreTrainedModel.save_pretrained` and is reloaded by suppling the save directory.\n                - the model is loaded by suppling a local directory as ``pretrained_model_name_or_path`` and a configuration JSON file named `config.json` is found in the directory.\n\n            state_dict: (`optional`) dict:\n                an optional state dictionnary for the model to use instead of a state dictionary loaded from saved weights file.\n                This option can be used if you want to create a model from a pretrained configuration but load your own weights.\n                In this case though, you should check if using :func:`~transformers1.PreTrainedModel.save_pretrained` and :func:`~transformers1.PreTrainedModel.from_pretrained` is not a simpler option.\n\n            cache_dir: (`optional`) string:\n                Path to a directory in which a downloaded pre-trained model\n                configuration should be cached if the standard cache should not be used.\n\n            force_download: (`optional`) boolean, default False:\n                Force to (re-)download the model weights and configuration files and override the cached versions if they exists.\n\n            resume_download: (`optional`) boolean, default False:\n                Do not delete incompletely recieved file. Attempt to resume the download if such a file exists.\n\n            proxies: (`optional`) dict, default None:\n                A dictionary of proxy servers to use by protocol or endpoint, e.g.: {'http': 'foo.bar:3128', 'http://hostname': 'foo.bar:4012'}.\n                The proxies are used on each request.\n\n            output_loading_info: (`optional`) boolean:\n                Set to ``True`` to also return a dictionnary containing missing keys, unexpected keys and error messages.\n\n            kwargs: (`optional`) Remaining dictionary of keyword arguments:\n                Can be used to update the configuration object (after it being loaded) and initiate the model. (e.g. ``output_attention=True``). Behave differently depending on whether a `config` is provided or automatically loaded:\n\n                - If a configuration is provided with ``config``, ``**kwargs`` will be directly passed to the underlying model's ``__init__`` method (we assume all relevant updates to the configuration have already been done)\n                - If a configuration is not provided, ``kwargs`` will be first passed to the configuration class initialization function (:func:`~transformers1.PretrainedConfig.from_pretrained`). Each key of ``kwargs`` that corresponds to a configuration attribute will be used to override said attribute with the supplied ``kwargs`` value. Remaining keys that do not correspond to any configuration attribute will be passed to the underlying model's ``__init__`` function.\n\n        Examples::\n\n            model = TFAutoModelWithLMHead.from_pretrained('bert-base-uncased')    # Download model and configuration from S3 and cache.\n            model = TFAutoModelWithLMHead.from_pretrained('./test/bert_model/')  # E.g. model was saved using `save_pretrained('./test/saved_model/')`\n            model = TFAutoModelWithLMHead.from_pretrained('bert-base-uncased', output_attention=True)  # Update configuration during loading\n            assert model.config.output_attention == True\n            # Loading from a TF checkpoint file instead of a PyTorch model (slower)\n            config = AutoConfig.from_json_file('./tf_model/bert_tf_model_config.json')\n            model = TFAutoModelWithLMHead.from_pretrained('./pt_model/bert_pytorch_model.bin', from_pt=True, config=config)\n\n        \"\"\"\n        config = kwargs.pop(\"config\", None)\n        if not isinstance(config, PretrainedConfig):\n            config = AutoConfig.from_pretrained(pretrained_model_name_or_path, **kwargs)\n\n        for config_class, model_class in TF_MODEL_WITH_LM_HEAD_MAPPING.items():\n            if isinstance(config, config_class):\n                return model_class.from_pretrained(pretrained_model_name_or_path, *model_args, config=config, **kwargs)\n        raise ValueError(\n            \"Unrecognized configuration class {} for this kind of TFAutoModel: {}.\\n\"\n            \"Model type should be one of {}.\".format(\n                config.__class__, cls.__name__, \", \".join(c.__name__ for c in TF_MODEL_WITH_LM_HEAD_MAPPING.keys())\n            )\n        )\n\n\nclass TFAutoModelForMultipleChoice:\n    r\"\"\"\n        :class:`~transformers1.TFAutoModelForMultipleChoice` is a generic model class\n        that will be instantiated as one of the multiple choice model classes of the library\n        when created with the `TFAutoModelForMultipleChoice.from_pretrained(pretrained_model_name_or_path)`\n        class method.\n\n        The `from_pretrained()` method takes care of returning the correct model class instance\n        based on the `model_type` property of the config object, or when it's missing,\n        falling back to using pattern matching on the `pretrained_model_name_or_path` string:\n            - `albert`: TFAlbertForMultipleChoice (Albert model)\n            - `bert`: TFBertForMultipleChoice (Bert model)\n\n        This class cannot be instantiated using `__init__()` (throws an error).\n    \"\"\"\n\n    def __init__(self):\n        raise EnvironmentError(\n            \"TFAutoModelForMultipleChoice is designed to be instantiated \"\n            \"using the `TFAutoModelForMultipleChoice.from_pretrained(pretrained_model_name_or_path)` or \"\n            \"`TFAutoModelForMultipleChoice.from_config(config)` methods.\"\n        )\n\n    @classmethod\n    def from_config(cls, config):\n        r\"\"\" Instantiates one of the base model classes of the library\n        from a configuration.\n\n        Note:\n            Loading a model from its configuration file does **not** load the model weights.\n            It only affects the model's configuration. Use :func:`~transformers1.AutoModel.from_pretrained` to load\n            the model weights\n\n        Args:\n            config: (`optional`) instance of a class derived from :class:`~transformers1.PretrainedConfig`:\n                The model class to instantiate is selected based on the configuration class:\n                    - isInstance of `albert` configuration class: AlbertModel (Albert model)\n                    - isInstance of `bert` configuration class: BertModel (Bert model)\n\n        Examples::\n\n            config = BertConfig.from_pretrained('bert-base-uncased')    # Download configuration from S3 and cache.\n            model = AutoModelForMulitpleChoice.from_config(config)  # E.g. model was saved using `save_pretrained('./test/saved_model/')`\n        \"\"\"\n        for config_class, model_class in TF_MODEL_FOR_MULTIPLE_CHOICE_MAPPING.items():\n            if isinstance(config, config_class):\n                return model_class(config)\n        raise ValueError(\n            \"Unrecognized configuration class {} for this kind of TFAutoModel: {}.\\n\"\n            \"Model type should be one of {}.\".format(\n                config.__class__,\n                cls.__name__,\n                \", \".join(c.__name__ for c in TF_MODEL_FOR_MULTIPLE_CHOICE_MAPPING.keys()),\n            )\n        )\n\n    @classmethod\n    def from_pretrained(cls, pretrained_model_name_or_path, *model_args, **kwargs):\n        r\"\"\" Instantiates one of the multiple choice model classes of the library\n        from a pre-trained model configuration.\n\n        The `from_pretrained()` method takes care of returning the correct model class instance\n        based on the `model_type` property of the config object, or when it's missing,\n        falling back to using pattern matching on the `pretrained_model_name_or_path` string:\n            - `albert`: TFRobertaForMultiple (Albert model)\n            - `bert`: TFBertForMultipleChoice (Bert model)\n\n        The model is set in evaluation mode by default using `model.eval()` (Dropout modules are deactivated)\n        To train the model, you should first set it back in training mode with `model.train()`\n\n        Params:\n            pretrained_model_name_or_path: either:\n\n                - a string with the `shortcut name` of a pre-trained model to load from cache or download, e.g.: ``bert-base-uncased``.\n                - a string with the `identifier name` of a pre-trained model that was user-uploaded to our S3, e.g.: ``dbmdz/bert-base-german-cased``.\n                - a path to a `directory` containing model weights saved using :func:`~transformers1.PreTrainedModel.save_pretrained`, e.g.: ``./my_model_directory/``.\n                - a path or url to a `PyTorch, TF 1.X or TF 2.0 checkpoint file` (e.g. `./tf_model/model.ckpt.index`). In the case of a PyTorch checkpoint, ``from_pt`` should be set to True and a configuration object should be provided as ``config`` argument.\n\n            from_pt: (`Optional`) Boolean\n                Set to True if the Checkpoint is a PyTorch checkpoint.\n\n            model_args: (`optional`) Sequence of positional arguments:\n                All remaning positional arguments will be passed to the underlying model's ``__init__`` method\n\n            config: (`optional`) instance of a class derived from :class:`~transformers1.PretrainedConfig`:\n                Configuration for the model to use instead of an automatically loaded configuation. Configuration can be automatically loaded when:\n\n                - the model is a model provided by the library (loaded with the ``shortcut-name`` string of a pretrained model), or\n                - the model was saved using :func:`~transformers1.PreTrainedModel.save_pretrained` and is reloaded by suppling the save directory.\n                - the model is loaded by suppling a local directory as ``pretrained_model_name_or_path`` and a configuration JSON file named `config.json` is found in the directory.\n\n            state_dict: (`optional`) dict:\n                an optional state dictionnary for the model to use instead of a state dictionary loaded from saved weights file.\n                This option can be used if you want to create a model from a pretrained configuration but load your own weights.\n                In this case though, you should check if using :func:`~transformers1.PreTrainedModel.save_pretrained` and :func:`~transformers1.PreTrainedModel.from_pretrained` is not a simpler option.\n\n            cache_dir: (`optional`) string:\n                Path to a directory in which a downloaded pre-trained model\n                configuration should be cached if the standard cache should not be used.\n\n            force_download: (`optional`) boolean, default False:\n                Force to (re-)download the model weights and configuration files and override the cached versions if they exists.\n\n            resume_download: (`optional`) boolean, default False:\n                Do not delete incompletely recieved file. Attempt to resume the download if such a file exists.\n\n            proxies: (`optional`) dict, default None:\n                A dictionary of proxy servers to use by protocol or endpoint, e.g.: {'http': 'foo.bar:3128', 'http://hostname': 'foo.bar:4012'}.\n                The proxies are used on each request.\n\n            output_loading_info: (`optional`) boolean:\n                Set to ``True`` to also return a dictionnary containing missing keys, unexpected keys and error messages.\n\n            kwargs: (`optional`) Remaining dictionary of keyword arguments:\n                Can be used to update the configuration object (after it being loaded) and initiate the model. (e.g. ``output_attention=True``). Behave differently depending on whether a `config` is provided or automatically loaded:\n\n                - If a configuration is provided with ``config``, ``**kwargs`` will be directly passed to the underlying model's ``__init__`` method (we assume all relevant updates to the configuration have already been done)\n                - If a configuration is not provided, ``kwargs`` will be first passed to the configuration class initialization function (:func:`~transformers1.PretrainedConfig.from_pretrained`). Each key of ``kwargs`` that corresponds to a configuration attribute will be used to override said attribute with the supplied ``kwargs`` value. Remaining keys that do not correspond to any configuration attribute will be passed to the underlying model's ``__init__`` function.\n\n        Examples::\n\n            model = TFAutoModelFormultipleChoice.from_pretrained('bert-base-uncased')    # Download model and configuration from S3 and cache.\n            model = TFAutoModelFormultipleChoice.from_pretrained('./test/bert_model/')  # E.g. model was saved using `save_pretrained('./test/saved_model/')`\n            model = TFAutoModelFormultipleChoice.from_pretrained('bert-base-uncased', output_attention=True)  # Update configuration during loading\n            assert model.config.output_attention == True\n            # Loading from a TF checkpoint file instead of a PyTorch model (slower)\n            config = AutoConfig.from_json_file('./tf_model/bert_tf_model_config.json')\n            model = TFAutoModelFormultipleChoice.from_pretrained('./pt_model/bert_pytorch_model.bin', from_pt=True, config=config)\n\n        \"\"\"\n        config = kwargs.pop(\"config\", None)\n        if not isinstance(config, PretrainedConfig):\n            config = AutoConfig.from_pretrained(pretrained_model_name_or_path, **kwargs)\n\n        for config_class, model_class in TF_MODEL_FOR_MULTIPLE_CHOICE_MAPPING.items():\n            if isinstance(config, config_class):\n                return model_class.from_pretrained(pretrained_model_name_or_path, *model_args, config=config, **kwargs)\n        raise ValueError(\n            \"Unrecognized configuration class {} for this kind of TFAutoModel: {}.\\n\"\n            \"Model type should be one of {}.\".format(\n                config.__class__,\n                cls.__name__,\n                \", \".join(c.__name__ for c in TF_MODEL_FOR_MULTIPLE_CHOICE_MAPPING.keys()),\n            )\n        )\n\n\nclass TFAutoModelForSequenceClassification(object):\n    r\"\"\"\n        :class:`~transformers1.TFAutoModelForSequenceClassification` is a generic model class\n        that will be instantiated as one of the sequence classification model classes of the library\n        when created with the `TFAutoModelForSequenceClassification.from_pretrained(pretrained_model_name_or_path)`\n        class method.\n\n        The `from_pretrained()` method takes care of returning the correct model class instance\n        based on the `model_type` property of the config object, or when it's missing,\n        falling back to using pattern matching on the `pretrained_model_name_or_path` string:\n            - `distilbert`: TFDistilBertForSequenceClassification (DistilBERT model)\n            - `roberta`: TFRobertaForSequenceClassification (RoBERTa model)\n            - `bert`: TFBertForSequenceClassification (Bert model)\n            - `xlnet`: TFXLNetForSequenceClassification (XLNet model)\n            - `xlm`: TFXLMForSequenceClassification (XLM model)\n\n        This class cannot be instantiated using `__init__()` (throws an error).\n    \"\"\"\n\n    def __init__(self):\n        raise EnvironmentError(\n            \"TFAutoModelForSequenceClassification is designed to be instantiated \"\n            \"using the `TFAutoModelForSequenceClassification.from_pretrained(pretrained_model_name_or_path)` or \"\n            \"`TFAutoModelForSequenceClassification.from_config(config)` methods.\"\n        )\n\n    @classmethod\n    def from_config(cls, config):\n        r\"\"\" Instantiates one of the base model classes of the library\n        from a configuration.\n\n        Note:\n            Loading a model from its configuration file does **not** load the model weights.\n            It only affects the model's configuration. Use :func:`~transformers1.AutoModel.from_pretrained` to load\n            the model weights\n\n        Args:\n            config: (`optional`) instance of a class derived from :class:`~transformers1.PretrainedConfig`:\n                The model class to instantiate is selected based on the configuration class:\n                    - isInstance of `distilbert` configuration class: DistilBertModel (DistilBERT model)\n                    - isInstance of `roberta` configuration class: RobertaModel (RoBERTa model)\n                    - isInstance of `bert` configuration class: BertModel (Bert model)\n                    - isInstance of `xlnet` configuration class: XLNetModel (XLNet model)\n                    - isInstance of `xlm` configuration class: XLMModel (XLM model)\n\n        Examples::\n\n            config = BertConfig.from_pretrained('bert-base-uncased')    # Download configuration from S3 and cache.\n            model = AutoModelForSequenceClassification.from_config(config)  # E.g. model was saved using `save_pretrained('./test/saved_model/')`\n        \"\"\"\n        for config_class, model_class in TF_MODEL_FOR_SEQUENCE_CLASSIFICATION_MAPPING.items():\n            if isinstance(config, config_class):\n                return model_class(config)\n        raise ValueError(\n            \"Unrecognized configuration class {} for this kind of TFAutoModel: {}.\\n\"\n            \"Model type should be one of {}.\".format(\n                config.__class__,\n                cls.__name__,\n                \", \".join(c.__name__ for c in TF_MODEL_FOR_SEQUENCE_CLASSIFICATION_MAPPING.keys()),\n            )\n        )\n\n    @classmethod\n    def from_pretrained(cls, pretrained_model_name_or_path, *model_args, **kwargs):\n        r\"\"\" Instantiates one of the sequence classification model classes of the library\n        from a pre-trained model configuration.\n\n        The `from_pretrained()` method takes care of returning the correct model class instance\n        based on the `model_type` property of the config object, or when it's missing,\n        falling back to using pattern matching on the `pretrained_model_name_or_path` string:\n            - `distilbert`: TFDistilBertForSequenceClassification (DistilBERT model)\n            - `roberta`: TFRobertaForSequenceClassification (RoBERTa model)\n            - `bert`: TFBertForSequenceClassification (Bert model)\n            - `xlnet`: TFXLNetForSequenceClassification (XLNet model)\n            - `xlm`: TFXLMForSequenceClassification (XLM model)\n\n        The model is set in evaluation mode by default using `model.eval()` (Dropout modules are deactivated)\n        To train the model, you should first set it back in training mode with `model.train()`\n\n        Params:\n            pretrained_model_name_or_path: either:\n\n                - a string with the `shortcut name` of a pre-trained model to load from cache or download, e.g.: ``bert-base-uncased``.\n                - a string with the `identifier name` of a pre-trained model that was user-uploaded to our S3, e.g.: ``dbmdz/bert-base-german-cased``.\n                - a path to a `directory` containing model weights saved using :func:`~transformers1.PreTrainedModel.save_pretrained`, e.g.: ``./my_model_directory/``.\n                - a path or url to a `PyTorch, TF 1.X or TF 2.0 checkpoint file` (e.g. `./tf_model/model.ckpt.index`). In the case of a PyTorch checkpoint, ``from_pt`` should be set to True and a configuration object should be provided as ``config`` argument.\n\n            from_pt: (`Optional`) Boolean\n                Set to True if the Checkpoint is a PyTorch checkpoint.\n\n            model_args: (`optional`) Sequence of positional arguments:\n                All remaning positional arguments will be passed to the underlying model's ``__init__`` method\n\n            config: (`optional`) instance of a class derived from :class:`~transformers1.PretrainedConfig`:\n                Configuration for the model to use instead of an automatically loaded configuation. Configuration can be automatically loaded when:\n\n                - the model is a model provided by the library (loaded with the ``shortcut-name`` string of a pretrained model), or\n                - the model was saved using :func:`~transformers1.PreTrainedModel.save_pretrained` and is reloaded by suppling the save directory.\n                - the model is loaded by suppling a local directory as ``pretrained_model_name_or_path`` and a configuration JSON file named `config.json` is found in the directory.\n\n            state_dict: (`optional`) dict:\n                an optional state dictionnary for the model to use instead of a state dictionary loaded from saved weights file.\n                This option can be used if you want to create a model from a pretrained configuration but load your own weights.\n                In this case though, you should check if using :func:`~transformers1.PreTrainedModel.save_pretrained` and :func:`~transformers1.PreTrainedModel.from_pretrained` is not a simpler option.\n\n            cache_dir: (`optional`) string:\n                Path to a directory in which a downloaded pre-trained model\n                configuration should be cached if the standard cache should not be used.\n\n            force_download: (`optional`) boolean, default False:\n                Force to (re-)download the model weights and configuration files and override the cached versions if they exists.\n\n            resume_download: (`optional`) boolean, default False:\n                Do not delete incompletely recieved file. Attempt to resume the download if such a file exists.\n\n            proxies: (`optional`) dict, default None:\n                A dictionary of proxy servers to use by protocol or endpoint, e.g.: {'http': 'foo.bar:3128', 'http://hostname': 'foo.bar:4012'}.\n                The proxies are used on each request.\n\n            output_loading_info: (`optional`) boolean:\n                Set to ``True`` to also return a dictionnary containing missing keys, unexpected keys and error messages.\n\n            kwargs: (`optional`) Remaining dictionary of keyword arguments:\n                Can be used to update the configuration object (after it being loaded) and initiate the model. (e.g. ``output_attention=True``). Behave differently depending on whether a `config` is provided or automatically loaded:\n\n                - If a configuration is provided with ``config``, ``**kwargs`` will be directly passed to the underlying model's ``__init__`` method (we assume all relevant updates to the configuration have already been done)\n                - If a configuration is not provided, ``kwargs`` will be first passed to the configuration class initialization function (:func:`~transformers1.PretrainedConfig.from_pretrained`). Each key of ``kwargs`` that corresponds to a configuration attribute will be used to override said attribute with the supplied ``kwargs`` value. Remaining keys that do not correspond to any configuration attribute will be passed to the underlying model's ``__init__`` function.\n\n        Examples::\n\n            model = TFAutoModelForSequenceClassification.from_pretrained('bert-base-uncased')    # Download model and configuration from S3 and cache.\n            model = TFAutoModelForSequenceClassification.from_pretrained('./test/bert_model/')  # E.g. model was saved using `save_pretrained('./test/saved_model/')`\n            model = TFAutoModelForSequenceClassification.from_pretrained('bert-base-uncased', output_attention=True)  # Update configuration during loading\n            assert model.config.output_attention == True\n            # Loading from a TF checkpoint file instead of a PyTorch model (slower)\n            config = AutoConfig.from_json_file('./tf_model/bert_tf_model_config.json')\n            model = TFAutoModelForSequenceClassification.from_pretrained('./pt_model/bert_pytorch_model.bin', from_pt=True, config=config)\n\n        \"\"\"\n        config = kwargs.pop(\"config\", None)\n        if not isinstance(config, PretrainedConfig):\n            config = AutoConfig.from_pretrained(pretrained_model_name_or_path, **kwargs)\n\n        for config_class, model_class in TF_MODEL_FOR_SEQUENCE_CLASSIFICATION_MAPPING.items():\n            if isinstance(config, config_class):\n                return model_class.from_pretrained(pretrained_model_name_or_path, *model_args, config=config, **kwargs)\n        raise ValueError(\n            \"Unrecognized configuration class {} for this kind of TFAutoModel: {}.\\n\"\n            \"Model type should be one of {}.\".format(\n                config.__class__,\n                cls.__name__,\n                \", \".join(c.__name__ for c in TF_MODEL_FOR_SEQUENCE_CLASSIFICATION_MAPPING.keys()),\n            )\n        )\n\n\nclass TFAutoModelForQuestionAnswering(object):\n    r\"\"\"\n        :class:`~transformers1.TFAutoModelForQuestionAnswering` is a generic model class\n        that will be instantiated as one of the question answering model classes of the library\n        when created with the `TFAutoModelForQuestionAnswering.from_pretrained(pretrained_model_name_or_path)`\n        class method.\n\n        The `from_pretrained()` method takes care of returning the correct model class instance\n        based on the `model_type` property of the config object, or when it's missing,\n        falling back to using pattern matching on the `pretrained_model_name_or_path` string:\n            - `distilbert`: TFDistilBertForQuestionAnswering (DistilBERT model)\n            - `albert`: TFAlbertForQuestionAnswering (ALBERT model)\n            - `roberta`: TFRobertaForQuestionAnswering (RoBERTa model)\n            - `bert`: TFBertForQuestionAnswering (Bert model)\n            - `xlnet`: TFXLNetForQuestionAnswering (XLNet model)\n            - `xlm`: TFXLMForQuestionAnswering (XLM model)\n\n        This class cannot be instantiated using `__init__()` (throws an error).\n    \"\"\"\n\n    def __init__(self):\n        raise EnvironmentError(\n            \"TFAutoModelForQuestionAnswering is designed to be instantiated \"\n            \"using the `TFAutoModelForQuestionAnswering.from_pretrained(pretrained_model_name_or_path)` or \"\n            \"`TFAutoModelForQuestionAnswering.from_config(config)` methods.\"\n        )\n\n    @classmethod\n    def from_config(cls, config):\n        r\"\"\" Instantiates one of the base model classes of the library\n        from a configuration.\n\n        Note:\n            Loading a model from its configuration file does **not** load the model weights.\n            It only affects the model's configuration. Use :func:`~transformers1.AutoModel.from_pretrained` to load\n            the model weights\n\n        Args:\n            config: (`optional`) instance of a class derived from :class:`~transformers1.PretrainedConfig`:\n                The model class to instantiate is selected based on the configuration class:\n                    - isInstance of `distilbert` configuration class: DistilBertModel (DistilBERT model)\n                    - isInstance of `albert` configuration class: AlbertModel (ALBERT model)\n                    - isInstance of `roberta` configuration class: RobertaModel (RoBERTa model)\n                    - isInstance of `bert` configuration class: BertModel (Bert model)\n                    - isInstance of `xlnet` configuration class: XLNetModel (XLNet model)\n                    - isInstance of `xlm` configuration class: XLMModel (XLM model)\n\n        Examples::\n\n            config = BertConfig.from_pretrained('bert-base-uncased')    # Download configuration from S3 and cache.\n            model = TFAutoModelForQuestionAnswering.from_config(config)  # E.g. model was saved using `save_pretrained('./test/saved_model/')`\n        \"\"\"\n        for config_class, model_class in TF_MODEL_FOR_QUESTION_ANSWERING_MAPPING.items():\n            if isinstance(config, config_class):\n                return model_class(config)\n        raise ValueError(\n            \"Unrecognized configuration class {} for this kind of TFAutoModel: {}.\\n\"\n            \"Model type should be one of {}.\".format(\n                config.__class__,\n                cls.__name__,\n                \", \".join(c.__name__ for c in TF_MODEL_FOR_QUESTION_ANSWERING_MAPPING.keys()),\n            )\n        )\n\n    @classmethod\n    def from_pretrained(cls, pretrained_model_name_or_path, *model_args, **kwargs):\n        r\"\"\" Instantiates one of the question answering model classes of the library\n        from a pre-trained model configuration.\n\n        The `from_pretrained()` method takes care of returning the correct model class instance\n        based on the `model_type` property of the config object, or when it's missing,\n        falling back to using pattern matching on the `pretrained_model_name_or_path` string:\n            - `distilbert`: TFDistilBertForQuestionAnswering (DistilBERT model)\n            - `albert`: TFAlbertForQuestionAnswering (ALBERT model)\n            - `roberta`: TFRobertaForQuestionAnswering (RoBERTa model)\n            - `bert`: TFBertForQuestionAnswering (Bert model)\n            - `xlnet`: TFXLNetForQuestionAnswering (XLNet model)\n            - `xlm`: TFXLMForQuestionAnswering (XLM model)\n\n        The model is set in evaluation mode by default using `model.eval()` (Dropout modules are deactivated)\n        To train the model, you should first set it back in training mode with `model.train()`\n\n        Params:\n            pretrained_model_name_or_path: either:\n\n                - a string with the `shortcut name` of a pre-trained model to load from cache or download, e.g.: ``bert-base-uncased``.\n                - a string with the `identifier name` of a pre-trained model that was user-uploaded to our S3, e.g.: ``dbmdz/bert-base-german-cased``.\n                - a path to a `directory` containing model weights saved using :func:`~transformers1.PreTrainedModel.save_pretrained`, e.g.: ``./my_model_directory/``.\n                - a path or url to a `PyTorch, TF 1.X or TF 2.0 checkpoint file` (e.g. `./tf_model/model.ckpt.index`). In the case of a PyTorch checkpoint, ``from_pt`` should be set to True and a configuration object should be provided as ``config`` argument.\n\n            from_pt: (`Optional`) Boolean\n                Set to True if the Checkpoint is a PyTorch checkpoint.\n\n            model_args: (`optional`) Sequence of positional arguments:\n                All remaning positional arguments will be passed to the underlying model's ``__init__`` method\n\n            config: (`optional`) instance of a class derived from :class:`~transformers1.PretrainedConfig`:\n                Configuration for the model to use instead of an automatically loaded configuation. Configuration can be automatically loaded when:\n\n                - the model is a model provided by the library (loaded with the ``shortcut-name`` string of a pretrained model), or\n                - the model was saved using :func:`~transformers1.PreTrainedModel.save_pretrained` and is reloaded by suppling the save directory.\n                - the model is loaded by suppling a local directory as ``pretrained_model_name_or_path`` and a configuration JSON file named `config.json` is found in the directory.\n\n            state_dict: (`optional`) dict:\n                an optional state dictionnary for the model to use instead of a state dictionary loaded from saved weights file.\n                This option can be used if you want to create a model from a pretrained configuration but load your own weights.\n                In this case though, you should check if using :func:`~transformers1.PreTrainedModel.save_pretrained` and :func:`~transformers1.PreTrainedModel.from_pretrained` is not a simpler option.\n\n            cache_dir: (`optional`) string:\n                Path to a directory in which a downloaded pre-trained model\n                configuration should be cached if the standard cache should not be used.\n\n            force_download: (`optional`) boolean, default False:\n                Force to (re-)download the model weights and configuration files and override the cached versions if they exists.\n\n            resume_download: (`optional`) boolean, default False:\n                Do not delete incompletely recieved file. Attempt to resume the download if such a file exists.\n\n            proxies: (`optional`) dict, default None:\n                A dictionary of proxy servers to use by protocol or endpoint, e.g.: {'http': 'foo.bar:3128', 'http://hostname': 'foo.bar:4012'}.\n                The proxies are used on each request.\n\n            output_loading_info: (`optional`) boolean:\n                Set to ``True`` to also return a dictionnary containing missing keys, unexpected keys and error messages.\n\n            kwargs: (`optional`) Remaining dictionary of keyword arguments:\n                Can be used to update the configuration object (after it being loaded) and initiate the model. (e.g. ``output_attention=True``). Behave differently depending on whether a `config` is provided or automatically loaded:\n\n                - If a configuration is provided with ``config``, ``**kwargs`` will be directly passed to the underlying model's ``__init__`` method (we assume all relevant updates to the configuration have already been done)\n                - If a configuration is not provided, ``kwargs`` will be first passed to the configuration class initialization function (:func:`~transformers1.PretrainedConfig.from_pretrained`). Each key of ``kwargs`` that corresponds to a configuration attribute will be used to override said attribute with the supplied ``kwargs`` value. Remaining keys that do not correspond to any configuration attribute will be passed to the underlying model's ``__init__`` function.\n\n        Examples::\n\n            model = TFAutoModelForQuestionAnswering.from_pretrained('bert-base-uncased')    # Download model and configuration from S3 and cache.\n            model = TFAutoModelForQuestionAnswering.from_pretrained('./test/bert_model/')  # E.g. model was saved using `save_pretrained('./test/saved_model/')`\n            model = TFAutoModelForQuestionAnswering.from_pretrained('bert-base-uncased', output_attention=True)  # Update configuration during loading\n            assert model.config.output_attention == True\n            # Loading from a TF checkpoint file instead of a PyTorch model (slower)\n            config = AutoConfig.from_json_file('./tf_model/bert_tf_model_config.json')\n            model = TFAutoModelForQuestionAnswering.from_pretrained('./pt_model/bert_pytorch_model.bin', from_pt=True, config=config)\n\n        \"\"\"\n        config = kwargs.pop(\"config\", None)\n        if not isinstance(config, PretrainedConfig):\n            config = AutoConfig.from_pretrained(pretrained_model_name_or_path, **kwargs)\n\n        for config_class, model_class in TF_MODEL_FOR_QUESTION_ANSWERING_MAPPING.items():\n            if isinstance(config, config_class):\n                return model_class.from_pretrained(pretrained_model_name_or_path, *model_args, config=config, **kwargs)\n        raise ValueError(\n            \"Unrecognized configuration class {} for this kind of TFAutoModel: {}.\\n\"\n            \"Model type should be one of {}.\".format(\n                config.__class__,\n                cls.__name__,\n                \", \".join(c.__name__ for c in TF_MODEL_FOR_QUESTION_ANSWERING_MAPPING.keys()),\n            )\n        )\n\n\nclass TFAutoModelForTokenClassification:\n    def __init__(self):\n        raise EnvironmentError(\n            \"TFAutoModelForTokenClassification is designed to be instantiated \"\n            \"using the `TFAutoModelForTokenClassification.from_pretrained(pretrained_model_name_or_path)` or \"\n            \"`AutoModelForTokenClassification.from_config(config)` methods.\"\n        )\n\n    @classmethod\n    def from_config(cls, config):\n        r\"\"\" Instantiates one of the base model classes of the library\n        from a configuration.\n\n        Note:\n            Loading a model from its configuration file does **not** load the model weights.\n            It only affects the model's configuration. Use :func:`~transformers1.AutoModel.from_pretrained` to load\n            the model weights\n\n        Args:\n            config: (`optional`) instance of a class derived from :class:`~transformers1.PretrainedConfig`:\n                The model class to instantiate is selected based on the configuration class:\n                    - isInstance of `bert` configuration class: BertModel (Bert model)\n                    - isInstance of `xlnet` configuration class: XLNetModel (XLNet model)\n                    - isInstance of `distilbert` configuration class: DistilBertModel (DistilBert model)\n                    - isInstance of `roberta` configuration class: RobteraModel (Roberta model)\n\n        Examples::\n\n            config = BertConfig.from_pretrained('bert-base-uncased')    # Download configuration from S3 and cache.\n            model = TFAutoModelForTokenClassification.from_config(config)  # E.g. model was saved using `save_pretrained('./test/saved_model/')`\n        \"\"\"\n        for config_class, model_class in TF_MODEL_FOR_TOKEN_CLASSIFICATION_MAPPING.items():\n            if isinstance(config, config_class):\n                return model_class(config)\n        raise ValueError(\n            \"Unrecognized configuration class {} for this kind of TFAutoModel: {}.\\n\"\n            \"Model type should be one of {}.\".format(\n                config.__class__,\n                cls.__name__,\n                \", \".join(c.__name__ for c in TF_MODEL_FOR_TOKEN_CLASSIFICATION_MAPPING.keys()),\n            )\n        )\n\n    @classmethod\n    def from_pretrained(cls, pretrained_model_name_or_path, *model_args, **kwargs):\n        r\"\"\" Instantiates one of the question answering model classes of the library\n        from a pre-trained model configuration.\n\n        The `from_pretrained()` method takes care of returning the correct model class instance\n        based on the `model_type` property of the config object, or when it's missing,\n        falling back to using pattern matching on the `pretrained_model_name_or_path` string:\n            - `bert`: BertForTokenClassification (Bert model)\n            - `xlnet`: XLNetForTokenClassification (XLNet model)\n            - `distilbert`: DistilBertForTokenClassification (DistilBert model)\n            - `roberta`: RobertaForTokenClassification (Roberta model)\n\n        The model is set in evaluation mode by default using `model.eval()` (Dropout modules are deactivated)\n        To train the model, you should first set it back in training mode with `model.train()`\n\n        Params:\n            pretrained_model_name_or_path: either:\n\n                - a string with the `shortcut name` of a pre-trained model to load from cache or download, e.g.: ``bert-base-uncased``.\n                - a path to a `directory` containing model weights saved using :func:`~transformers1.PreTrainedModel.save_pretrained`, e.g.: ``./my_model_directory/``.\n                - a path or url to a `tensorflow index checkpoint file` (e.g. `./tf_model/model.ckpt.index`). In this case, ``from_tf`` should be set to True and a configuration object should be provided as ``config`` argument. This loading path is slower than converting the TensorFlow checkpoint in a PyTorch model using the provided conversion scripts and loading the PyTorch model afterwards.\n\n            model_args: (`optional`) Sequence of positional arguments:\n                All remaning positional arguments will be passed to the underlying model's ``__init__`` method\n\n            config: (`optional`) instance of a class derived from :class:`~transformers1.PretrainedConfig`:\n                Configuration for the model to use instead of an automatically loaded configuation. Configuration can be automatically loaded when:\n\n                - the model is a model provided by the library (loaded with the ``shortcut-name`` string of a pretrained model), or\n                - the model was saved using :func:`~transformers1.PreTrainedModel.save_pretrained` and is reloaded by suppling the save directory.\n                - the model is loaded by suppling a local directory as ``pretrained_model_name_or_path`` and a configuration JSON file named `config.json` is found in the directory.\n\n            state_dict: (`optional`) dict:\n                an optional state dictionnary for the model to use instead of a state dictionary loaded from saved weights file.\n                This option can be used if you want to create a model from a pretrained configuration but load your own weights.\n                In this case though, you should check if using :func:`~transformers1.PreTrainedModel.save_pretrained` and :func:`~transformers1.PreTrainedModel.from_pretrained` is not a simpler option.\n\n            cache_dir: (`optional`) string:\n                Path to a directory in which a downloaded pre-trained model\n                configuration should be cached if the standard cache should not be used.\n\n            force_download: (`optional`) boolean, default False:\n                Force to (re-)download the model weights and configuration files and override the cached versions if they exists.\n\n            proxies: (`optional`) dict, default None:\n                A dictionary of proxy servers to use by protocol or endpoint, e.g.: {'http': 'foo.bar:3128', 'http://hostname': 'foo.bar:4012'}.\n                The proxies are used on each request.\n\n            output_loading_info: (`optional`) boolean:\n                Set to ``True`` to also return a dictionnary containing missing keys, unexpected keys and error messages.\n\n            kwargs: (`optional`) Remaining dictionary of keyword arguments:\n                Can be used to update the configuration object (after it being loaded) and initiate the model. (e.g. ``output_attention=True``). Behave differently depending on whether a `config` is provided or automatically loaded:\n\n                - If a configuration is provided with ``config``, ``**kwargs`` will be directly passed to the underlying model's ``__init__`` method (we assume all relevant updates to the configuration have already been done)\n                - If a configuration is not provided, ``kwargs`` will be first passed to the configuration class initialization function (:func:`~transformers1.PretrainedConfig.from_pretrained`). Each key of ``kwargs`` that corresponds to a configuration attribute will be used to override said attribute with the supplied ``kwargs`` value. Remaining keys that do not correspond to any configuration attribute will be passed to the underlying model's ``__init__`` function.\n\n        Examples::\n\n            model = TFAutoModelForTokenClassification.from_pretrained('bert-base-uncased')    # Download model and configuration from S3 and cache.\n            model = TFAutoModelForTokenClassification.from_pretrained('./test/bert_model/')  # E.g. model was saved using `save_pretrained('./test/saved_model/')`\n            model = TFAutoModelForTokenClassification.from_pretrained('bert-base-uncased', output_attention=True)  # Update configuration during loading\n            assert model.config.output_attention == True\n            # Loading from a TF checkpoint file instead of a PyTorch model (slower)\n            config = AutoConfig.from_json_file('./tf_model/bert_tf_model_config.json')\n            model = TFAutoModelForTokenClassification.from_pretrained('./tf_model/bert_tf_checkpoint.ckpt.index', from_tf=True, config=config)\n\n        \"\"\"\n        config = kwargs.pop(\"config\", None)\n        if not isinstance(config, PretrainedConfig):\n            config = AutoConfig.from_pretrained(pretrained_model_name_or_path, **kwargs)\n\n        for config_class, model_class in TF_MODEL_FOR_TOKEN_CLASSIFICATION_MAPPING.items():\n            if isinstance(config, config_class):\n                return model_class.from_pretrained(pretrained_model_name_or_path, *model_args, config=config, **kwargs)\n        raise ValueError(\n            \"Unrecognized configuration class {} for this kind of TFAutoModel: {}.\\n\"\n            \"Model type should be one of {}.\".format(\n                config.__class__,\n                cls.__name__,\n                \", \".join(c.__name__ for c in TF_MODEL_FOR_TOKEN_CLASSIFICATION_MAPPING.keys()),\n            )\n        )\n"
  },
  {
    "path": "code/nezha-base-count5/pretrain/transformers1/modeling_tf_bert.py",
    "content": "# coding=utf-8\n# Copyright 2018 The Google AI Language Team Authors and The HuggingFace Inc. team.\n# Copyright (c) 2018, NVIDIA CORPORATION.  All rights reserved.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n\"\"\" TF 2.0 BERT model. \"\"\"\n\n\nimport logging\n\nimport numpy as np\nimport tensorflow as tf\n\nfrom .configuration_bert import BertConfig\nfrom .file_utils import MULTIPLE_CHOICE_DUMMY_INPUTS, add_start_docstrings, add_start_docstrings_to_callable\nfrom .modeling_tf_utils import TFPreTrainedModel, get_initializer, keras_serializable, shape_list\nfrom .tokenization_utils import BatchEncoding\n\n\nlogger = logging.getLogger(__name__)\n\n\nTF_BERT_PRETRAINED_MODEL_ARCHIVE_LIST = [\n    \"bert-base-uncased\",\n    \"bert-large-uncased\",\n    \"bert-base-cased\",\n    \"bert-large-cased\",\n    \"bert-base-multilingual-uncased\",\n    \"bert-base-multilingual-cased\",\n    \"bert-base-chinese\",\n    \"bert-base-german-cased\",\n    \"bert-large-uncased-whole-word-masking\",\n    \"bert-large-cased-whole-word-masking\",\n    \"bert-large-uncased-whole-word-masking-finetuned-squad\",\n    \"bert-large-cased-whole-word-masking-finetuned-squad\",\n    \"bert-base-cased-finetuned-mrpc\",\n    \"cl-tohoku/bert-base-japanese\",\n    \"cl-tohoku/bert-base-japanese-whole-word-masking\",\n    \"cl-tohoku/bert-base-japanese-char\",\n    \"cl-tohoku/bert-base-japanese-char-whole-word-masking\",\n    \"TurkuNLP/bert-base-finnish-cased-v1\",\n    \"TurkuNLP/bert-base-finnish-uncased-v1\",\n    \"wietsedv/bert-base-dutch-cased\",\n    # See all BERT models at https://huggingface.co/models?filter=bert\n]\n\n\ndef gelu(x):\n    \"\"\" Gaussian Error Linear Unit.\n    Original Implementation of the gelu activation function in Google Bert repo when initially created.\n        For information: OpenAI GPT's gelu is slightly different (and gives slightly different results):\n        0.5 * x * (1 + torch.tanh(math.sqrt(2 / math.pi) * (x + 0.044715 * torch.pow(x, 3))))\n        Also see https://arxiv.org/abs/1606.08415\n    \"\"\"\n    cdf = 0.5 * (1.0 + tf.math.erf(x / tf.math.sqrt(2.0)))\n    return x * cdf\n\n\ndef gelu_new(x):\n    \"\"\"Gaussian Error Linear Unit.\n    This is a smoother version of the RELU.\n    Original paper: https://arxiv.org/abs/1606.08415\n    Args:\n        x: float Tensor to perform activation.\n    Returns:\n        `x` with the GELU activation applied.\n    \"\"\"\n    cdf = 0.5 * (1.0 + tf.tanh((np.sqrt(2 / np.pi) * (x + 0.044715 * tf.pow(x, 3)))))\n    return x * cdf\n\n\ndef swish(x):\n    return x * tf.sigmoid(x)\n\n\nACT2FN = {\n    \"gelu\": tf.keras.layers.Activation(gelu),\n    \"relu\": tf.keras.activations.relu,\n    \"swish\": tf.keras.layers.Activation(swish),\n    \"gelu_new\": tf.keras.layers.Activation(gelu_new),\n}\n\n\nclass TFBertEmbeddings(tf.keras.layers.Layer):\n    \"\"\"Construct the embeddings from word, position and token_type embeddings.\n    \"\"\"\n\n    def __init__(self, config, **kwargs):\n        super().__init__(**kwargs)\n        self.vocab_size = config.vocab_size\n        self.hidden_size = config.hidden_size\n        self.initializer_range = config.initializer_range\n\n        self.position_embeddings = tf.keras.layers.Embedding(\n            config.max_position_embeddings,\n            config.hidden_size,\n            embeddings_initializer=get_initializer(self.initializer_range),\n            name=\"position_embeddings\",\n        )\n        self.token_type_embeddings = tf.keras.layers.Embedding(\n            config.type_vocab_size,\n            config.hidden_size,\n            embeddings_initializer=get_initializer(self.initializer_range),\n            name=\"token_type_embeddings\",\n        )\n\n        # self.LayerNorm is not snake-cased to stick with TensorFlow model variable name and be able to load\n        # any TensorFlow checkpoint file\n        self.LayerNorm = tf.keras.layers.LayerNormalization(epsilon=config.layer_norm_eps, name=\"LayerNorm\")\n        self.dropout = tf.keras.layers.Dropout(config.hidden_dropout_prob)\n\n    def build(self, input_shape):\n        \"\"\"Build shared word embedding layer \"\"\"\n        with tf.name_scope(\"word_embeddings\"):\n            # Create and initialize weights. The random normal initializer was chosen\n            # arbitrarily, and works well.\n            self.word_embeddings = self.add_weight(\n                \"weight\",\n                shape=[self.vocab_size, self.hidden_size],\n                initializer=get_initializer(self.initializer_range),\n            )\n        super().build(input_shape)\n\n    def call(self, inputs, mode=\"embedding\", training=False):\n        \"\"\"Get token embeddings of inputs.\n        Args:\n            inputs: list of three int64 tensors with shape [batch_size, length]: (input_ids, position_ids, token_type_ids)\n            mode: string, a valid value is one of \"embedding\" and \"linear\".\n        Returns:\n            outputs: (1) If mode == \"embedding\", output embedding tensor, float32 with\n                shape [batch_size, length, embedding_size]; (2) mode == \"linear\", output\n                linear tensor, float32 with shape [batch_size, length, vocab_size].\n        Raises:\n            ValueError: if mode is not valid.\n\n        Shared weights logic adapted from\n            https://github.com/tensorflow/models/blob/a009f4fb9d2fc4949e32192a944688925ef78659/official/transformer/v2/embedding_layer.py#L24\n        \"\"\"\n        if mode == \"embedding\":\n            return self._embedding(inputs, training=training)\n        elif mode == \"linear\":\n            return self._linear(inputs)\n        else:\n            raise ValueError(\"mode {} is not valid.\".format(mode))\n\n    def _embedding(self, inputs, training=False):\n        \"\"\"Applies embedding based on inputs tensor.\"\"\"\n        input_ids, position_ids, token_type_ids, inputs_embeds = inputs\n\n        if input_ids is not None:\n            input_shape = shape_list(input_ids)\n        else:\n            input_shape = shape_list(inputs_embeds)[:-1]\n\n        seq_length = input_shape[1]\n        if position_ids is None:\n            position_ids = tf.range(seq_length, dtype=tf.int32)[tf.newaxis, :]\n        if token_type_ids is None:\n            token_type_ids = tf.fill(input_shape, 0)\n\n        if inputs_embeds is None:\n            inputs_embeds = tf.gather(self.word_embeddings, input_ids)\n        position_embeddings = self.position_embeddings(position_ids)\n        token_type_embeddings = self.token_type_embeddings(token_type_ids)\n\n        embeddings = inputs_embeds + position_embeddings + token_type_embeddings\n        embeddings = self.LayerNorm(embeddings)\n        embeddings = self.dropout(embeddings, training=training)\n        return embeddings\n\n    def _linear(self, inputs):\n        \"\"\"Computes logits by running inputs through a linear layer.\n            Args:\n                inputs: A float32 tensor with shape [batch_size, length, hidden_size]\n            Returns:\n                float32 tensor with shape [batch_size, length, vocab_size].\n        \"\"\"\n        batch_size = shape_list(inputs)[0]\n        length = shape_list(inputs)[1]\n\n        x = tf.reshape(inputs, [-1, self.hidden_size])\n        logits = tf.matmul(x, self.word_embeddings, transpose_b=True)\n\n        return tf.reshape(logits, [batch_size, length, self.vocab_size])\n\n\nclass TFBertSelfAttention(tf.keras.layers.Layer):\n    def __init__(self, config, **kwargs):\n        super().__init__(**kwargs)\n        if config.hidden_size % config.num_attention_heads != 0:\n            raise ValueError(\n                \"The hidden size (%d) is not a multiple of the number of attention \"\n                \"heads (%d)\" % (config.hidden_size, config.num_attention_heads)\n            )\n        self.output_attentions = config.output_attentions\n\n        self.num_attention_heads = config.num_attention_heads\n        assert config.hidden_size % config.num_attention_heads == 0\n        self.attention_head_size = int(config.hidden_size / config.num_attention_heads)\n        self.all_head_size = self.num_attention_heads * self.attention_head_size\n\n        self.query = tf.keras.layers.Dense(\n            self.all_head_size, kernel_initializer=get_initializer(config.initializer_range), name=\"query\"\n        )\n        self.key = tf.keras.layers.Dense(\n            self.all_head_size, kernel_initializer=get_initializer(config.initializer_range), name=\"key\"\n        )\n        self.value = tf.keras.layers.Dense(\n            self.all_head_size, kernel_initializer=get_initializer(config.initializer_range), name=\"value\"\n        )\n\n        self.dropout = tf.keras.layers.Dropout(config.attention_probs_dropout_prob)\n\n    def transpose_for_scores(self, x, batch_size):\n        x = tf.reshape(x, (batch_size, -1, self.num_attention_heads, self.attention_head_size))\n        return tf.transpose(x, perm=[0, 2, 1, 3])\n\n    def call(self, inputs, training=False):\n        hidden_states, attention_mask, head_mask = inputs\n\n        batch_size = shape_list(hidden_states)[0]\n        mixed_query_layer = self.query(hidden_states)\n        mixed_key_layer = self.key(hidden_states)\n        mixed_value_layer = self.value(hidden_states)\n\n        query_layer = self.transpose_for_scores(mixed_query_layer, batch_size)\n        key_layer = self.transpose_for_scores(mixed_key_layer, batch_size)\n        value_layer = self.transpose_for_scores(mixed_value_layer, batch_size)\n\n        # Take the dot product between \"query\" and \"key\" to get the raw attention scores.\n        attention_scores = tf.matmul(\n            query_layer, key_layer, transpose_b=True\n        )  # (batch size, num_heads, seq_len_q, seq_len_k)\n        dk = tf.cast(shape_list(key_layer)[-1], tf.float32)  # scale attention_scores\n        attention_scores = attention_scores / tf.math.sqrt(dk)\n\n        if attention_mask is not None:\n            # Apply the attention mask is (precomputed for all layers in TFBertModel call() function)\n            attention_scores = attention_scores + attention_mask\n\n        # Normalize the attention scores to probabilities.\n        attention_probs = tf.nn.softmax(attention_scores, axis=-1)\n\n        # This is actually dropping out entire tokens to attend to, which might\n        # seem a bit unusual, but is taken from the original Transformer paper.\n        attention_probs = self.dropout(attention_probs, training=training)\n\n        # Mask heads if we want to\n        if head_mask is not None:\n            attention_probs = attention_probs * head_mask\n\n        context_layer = tf.matmul(attention_probs, value_layer)\n\n        context_layer = tf.transpose(context_layer, perm=[0, 2, 1, 3])\n        context_layer = tf.reshape(\n            context_layer, (batch_size, -1, self.all_head_size)\n        )  # (batch_size, seq_len_q, all_head_size)\n\n        outputs = (context_layer, attention_probs) if self.output_attentions else (context_layer,)\n        return outputs\n\n\nclass TFBertSelfOutput(tf.keras.layers.Layer):\n    def __init__(self, config, **kwargs):\n        super().__init__(**kwargs)\n        self.dense = tf.keras.layers.Dense(\n            config.hidden_size, kernel_initializer=get_initializer(config.initializer_range), name=\"dense\"\n        )\n        self.LayerNorm = tf.keras.layers.LayerNormalization(epsilon=config.layer_norm_eps, name=\"LayerNorm\")\n        self.dropout = tf.keras.layers.Dropout(config.hidden_dropout_prob)\n\n    def call(self, inputs, training=False):\n        hidden_states, input_tensor = inputs\n\n        hidden_states = self.dense(hidden_states)\n        hidden_states = self.dropout(hidden_states, training=training)\n        hidden_states = self.LayerNorm(hidden_states + input_tensor)\n        return hidden_states\n\n\nclass TFBertAttention(tf.keras.layers.Layer):\n    def __init__(self, config, **kwargs):\n        super().__init__(**kwargs)\n        self.self_attention = TFBertSelfAttention(config, name=\"self\")\n        self.dense_output = TFBertSelfOutput(config, name=\"output\")\n\n    def prune_heads(self, heads):\n        raise NotImplementedError\n\n    def call(self, inputs, training=False):\n        input_tensor, attention_mask, head_mask = inputs\n\n        self_outputs = self.self_attention([input_tensor, attention_mask, head_mask], training=training)\n        attention_output = self.dense_output([self_outputs[0], input_tensor], training=training)\n        outputs = (attention_output,) + self_outputs[1:]  # add attentions if we output them\n        return outputs\n\n\nclass TFBertIntermediate(tf.keras.layers.Layer):\n    def __init__(self, config, **kwargs):\n        super().__init__(**kwargs)\n        self.dense = tf.keras.layers.Dense(\n            config.intermediate_size, kernel_initializer=get_initializer(config.initializer_range), name=\"dense\"\n        )\n        if isinstance(config.hidden_act, str):\n            self.intermediate_act_fn = ACT2FN[config.hidden_act]\n        else:\n            self.intermediate_act_fn = config.hidden_act\n\n    def call(self, hidden_states):\n        hidden_states = self.dense(hidden_states)\n        hidden_states = self.intermediate_act_fn(hidden_states)\n        return hidden_states\n\n\nclass TFBertOutput(tf.keras.layers.Layer):\n    def __init__(self, config, **kwargs):\n        super().__init__(**kwargs)\n        self.dense = tf.keras.layers.Dense(\n            config.hidden_size, kernel_initializer=get_initializer(config.initializer_range), name=\"dense\"\n        )\n        self.LayerNorm = tf.keras.layers.LayerNormalization(epsilon=config.layer_norm_eps, name=\"LayerNorm\")\n        self.dropout = tf.keras.layers.Dropout(config.hidden_dropout_prob)\n\n    def call(self, inputs, training=False):\n        hidden_states, input_tensor = inputs\n\n        hidden_states = self.dense(hidden_states)\n        hidden_states = self.dropout(hidden_states, training=training)\n        hidden_states = self.LayerNorm(hidden_states + input_tensor)\n        return hidden_states\n\n\nclass TFBertLayer(tf.keras.layers.Layer):\n    def __init__(self, config, **kwargs):\n        super().__init__(**kwargs)\n        self.attention = TFBertAttention(config, name=\"attention\")\n        self.intermediate = TFBertIntermediate(config, name=\"intermediate\")\n        self.bert_output = TFBertOutput(config, name=\"output\")\n\n    def call(self, inputs, training=False):\n        hidden_states, attention_mask, head_mask = inputs\n\n        attention_outputs = self.attention([hidden_states, attention_mask, head_mask], training=training)\n        attention_output = attention_outputs[0]\n        intermediate_output = self.intermediate(attention_output)\n        layer_output = self.bert_output([intermediate_output, attention_output], training=training)\n        outputs = (layer_output,) + attention_outputs[1:]  # add attentions if we output them\n        return outputs\n\n\nclass TFBertEncoder(tf.keras.layers.Layer):\n    def __init__(self, config, **kwargs):\n        super().__init__(**kwargs)\n        self.output_attentions = config.output_attentions\n        self.output_hidden_states = config.output_hidden_states\n        self.layer = [TFBertLayer(config, name=\"layer_._{}\".format(i)) for i in range(config.num_hidden_layers)]\n\n    def call(self, inputs, training=False):\n        hidden_states, attention_mask, head_mask = inputs\n\n        all_hidden_states = ()\n        all_attentions = ()\n        for i, layer_module in enumerate(self.layer):\n            if self.output_hidden_states:\n                all_hidden_states = all_hidden_states + (hidden_states,)\n\n            layer_outputs = layer_module([hidden_states, attention_mask, head_mask[i]], training=training)\n            hidden_states = layer_outputs[0]\n\n            if self.output_attentions:\n                all_attentions = all_attentions + (layer_outputs[1],)\n\n        # Add last layer\n        if self.output_hidden_states:\n            all_hidden_states = all_hidden_states + (hidden_states,)\n\n        outputs = (hidden_states,)\n        if self.output_hidden_states:\n            outputs = outputs + (all_hidden_states,)\n        if self.output_attentions:\n            outputs = outputs + (all_attentions,)\n        return outputs  # outputs, (hidden states), (attentions)\n\n\nclass TFBertPooler(tf.keras.layers.Layer):\n    def __init__(self, config, **kwargs):\n        super().__init__(**kwargs)\n        self.dense = tf.keras.layers.Dense(\n            config.hidden_size,\n            kernel_initializer=get_initializer(config.initializer_range),\n            activation=\"tanh\",\n            name=\"dense\",\n        )\n\n    def call(self, hidden_states):\n        # We \"pool\" the model by simply taking the hidden state corresponding\n        # to the first token.\n        first_token_tensor = hidden_states[:, 0]\n        pooled_output = self.dense(first_token_tensor)\n        return pooled_output\n\n\nclass TFBertPredictionHeadTransform(tf.keras.layers.Layer):\n    def __init__(self, config, **kwargs):\n        super().__init__(**kwargs)\n        self.dense = tf.keras.layers.Dense(\n            config.hidden_size, kernel_initializer=get_initializer(config.initializer_range), name=\"dense\"\n        )\n        if isinstance(config.hidden_act, str):\n            self.transform_act_fn = ACT2FN[config.hidden_act]\n        else:\n            self.transform_act_fn = config.hidden_act\n        self.LayerNorm = tf.keras.layers.LayerNormalization(epsilon=config.layer_norm_eps, name=\"LayerNorm\")\n\n    def call(self, hidden_states):\n        hidden_states = self.dense(hidden_states)\n        hidden_states = self.transform_act_fn(hidden_states)\n        hidden_states = self.LayerNorm(hidden_states)\n        return hidden_states\n\n\nclass TFBertLMPredictionHead(tf.keras.layers.Layer):\n    def __init__(self, config, input_embeddings, **kwargs):\n        super().__init__(**kwargs)\n        self.vocab_size = config.vocab_size\n        self.transform = TFBertPredictionHeadTransform(config, name=\"transform\")\n\n        # The output weights are the same as the input embeddings, but there is\n        # an output-only bias for each token.\n        self.input_embeddings = input_embeddings\n\n    def build(self, input_shape):\n        self.bias = self.add_weight(shape=(self.vocab_size,), initializer=\"zeros\", trainable=True, name=\"bias\")\n        super().build(input_shape)\n\n    def call(self, hidden_states):\n        hidden_states = self.transform(hidden_states)\n        hidden_states = self.input_embeddings(hidden_states, mode=\"linear\")\n        hidden_states = hidden_states + self.bias\n        return hidden_states\n\n\nclass TFBertMLMHead(tf.keras.layers.Layer):\n    def __init__(self, config, input_embeddings, **kwargs):\n        super().__init__(**kwargs)\n        self.predictions = TFBertLMPredictionHead(config, input_embeddings, name=\"predictions\")\n\n    def call(self, sequence_output):\n        prediction_scores = self.predictions(sequence_output)\n        return prediction_scores\n\n\nclass TFBertNSPHead(tf.keras.layers.Layer):\n    def __init__(self, config, **kwargs):\n        super().__init__(**kwargs)\n        self.seq_relationship = tf.keras.layers.Dense(\n            2, kernel_initializer=get_initializer(config.initializer_range), name=\"seq_relationship\"\n        )\n\n    def call(self, pooled_output):\n        seq_relationship_score = self.seq_relationship(pooled_output)\n        return seq_relationship_score\n\n\n@keras_serializable\nclass TFBertMainLayer(tf.keras.layers.Layer):\n    config_class = BertConfig\n\n    def __init__(self, config, **kwargs):\n        super().__init__(**kwargs)\n        self.num_hidden_layers = config.num_hidden_layers\n\n        self.embeddings = TFBertEmbeddings(config, name=\"embeddings\")\n        self.encoder = TFBertEncoder(config, name=\"encoder\")\n        self.pooler = TFBertPooler(config, name=\"pooler\")\n\n    def get_input_embeddings(self):\n        return self.embeddings\n\n    def _resize_token_embeddings(self, new_num_tokens):\n        raise NotImplementedError\n\n    def _prune_heads(self, heads_to_prune):\n        \"\"\" Prunes heads of the model.\n            heads_to_prune: dict of {layer_num: list of heads to prune in this layer}\n            See base class PreTrainedModel\n        \"\"\"\n        raise NotImplementedError\n\n    def call(\n        self,\n        inputs,\n        attention_mask=None,\n        token_type_ids=None,\n        position_ids=None,\n        head_mask=None,\n        inputs_embeds=None,\n        training=False,\n    ):\n        if isinstance(inputs, (tuple, list)):\n            input_ids = inputs[0]\n            attention_mask = inputs[1] if len(inputs) > 1 else attention_mask\n            token_type_ids = inputs[2] if len(inputs) > 2 else token_type_ids\n            position_ids = inputs[3] if len(inputs) > 3 else position_ids\n            head_mask = inputs[4] if len(inputs) > 4 else head_mask\n            inputs_embeds = inputs[5] if len(inputs) > 5 else inputs_embeds\n            assert len(inputs) <= 6, \"Too many inputs.\"\n        elif isinstance(inputs, (dict, BatchEncoding)):\n            input_ids = inputs.get(\"input_ids\")\n            attention_mask = inputs.get(\"attention_mask\", attention_mask)\n            token_type_ids = inputs.get(\"token_type_ids\", token_type_ids)\n            position_ids = inputs.get(\"position_ids\", position_ids)\n            head_mask = inputs.get(\"head_mask\", head_mask)\n            inputs_embeds = inputs.get(\"inputs_embeds\", inputs_embeds)\n            assert len(inputs) <= 6, \"Too many inputs.\"\n        else:\n            input_ids = inputs\n\n        if input_ids is not None and inputs_embeds is not None:\n            raise ValueError(\"You cannot specify both input_ids and inputs_embeds at the same time\")\n        elif input_ids is not None:\n            input_shape = shape_list(input_ids)\n        elif inputs_embeds is not None:\n            input_shape = shape_list(inputs_embeds)[:-1]\n        else:\n            raise ValueError(\"You have to specify either input_ids or inputs_embeds\")\n\n        if attention_mask is None:\n            attention_mask = tf.fill(input_shape, 1)\n        if token_type_ids is None:\n            token_type_ids = tf.fill(input_shape, 0)\n\n        # We create a 3D attention mask from a 2D tensor mask.\n        # Sizes are [batch_size, 1, 1, to_seq_length]\n        # So we can broadcast to [batch_size, num_heads, from_seq_length, to_seq_length]\n        # this attention mask is more simple than the triangular masking of causal attention\n        # used in OpenAI GPT, we just need to prepare the broadcast dimension here.\n        extended_attention_mask = attention_mask[:, tf.newaxis, tf.newaxis, :]\n\n        # Since attention_mask is 1.0 for positions we want to attend and 0.0 for\n        # masked positions, this operation will create a tensor which is 0.0 for\n        # positions we want to attend and -10000.0 for masked positions.\n        # Since we are adding it to the raw scores before the softmax, this is\n        # effectively the same as removing these entirely.\n\n        extended_attention_mask = tf.cast(extended_attention_mask, tf.float32)\n        extended_attention_mask = (1.0 - extended_attention_mask) * -10000.0\n\n        # Prepare head mask if needed\n        # 1.0 in head_mask indicate we keep the head\n        # attention_probs has shape bsz x n_heads x N x N\n        # input head_mask has shape [num_heads] or [num_hidden_layers x num_heads]\n        # and head_mask is converted to shape [num_hidden_layers x batch x num_heads x seq_length x seq_length]\n        if head_mask is not None:\n            raise NotImplementedError\n        else:\n            head_mask = [None] * self.num_hidden_layers\n            # head_mask = tf.constant([0] * self.num_hidden_layers)\n\n        embedding_output = self.embeddings([input_ids, position_ids, token_type_ids, inputs_embeds], training=training)\n        encoder_outputs = self.encoder([embedding_output, extended_attention_mask, head_mask], training=training)\n\n        sequence_output = encoder_outputs[0]\n        pooled_output = self.pooler(sequence_output)\n\n        outputs = (sequence_output, pooled_output,) + encoder_outputs[\n            1:\n        ]  # add hidden_states and attentions if they are here\n        return outputs  # sequence_output, pooled_output, (hidden_states), (attentions)\n\n\nclass TFBertPreTrainedModel(TFPreTrainedModel):\n    \"\"\" An abstract class to handle weights initialization and\n        a simple interface for downloading and loading pretrained models.\n    \"\"\"\n\n    config_class = BertConfig\n    base_model_prefix = \"bert\"\n\n\nBERT_START_DOCSTRING = r\"\"\"\n    This model is a `tf.keras.Model <https://www.tensorflow.org/api_docs/python/tf/keras/Model>`__ sub-class.\n    Use it as a regular TF 2.0 Keras Model and\n    refer to the TF 2.0 documentation for all matter related to general usage and behavior.\n\n    .. note::\n\n        TF 2.0 models accepts two formats as inputs:\n\n            - having all inputs as keyword arguments (like PyTorch models), or\n            - having all inputs as a list, tuple or dict in the first positional arguments.\n\n        This second option is useful when using :obj:`tf.keras.Model.fit()` method which currently requires having\n        all the tensors in the first argument of the model call function: :obj:`model(inputs)`.\n\n        If you choose this second option, there are three possibilities you can use to gather all the input Tensors\n        in the first positional argument :\n\n        - a single Tensor with input_ids only and nothing else: :obj:`model(inputs_ids)`\n        - a list of varying length with one or several input Tensors IN THE ORDER given in the docstring:\n          :obj:`model([input_ids, attention_mask])` or :obj:`model([input_ids, attention_mask, token_type_ids])`\n        - a dictionary with one or several input Tensors associated to the input names given in the docstring:\n          :obj:`model({'input_ids': input_ids, 'token_type_ids': token_type_ids})`\n\n    Parameters:\n        config (:class:`~transformers1.BertConfig`): Model configuration class with all the parameters of the model.\n            Initializing with a config file does not load the weights associated with the model, only the configuration.\n            Check out the :meth:`~transformers1.PreTrainedModel.from_pretrained` method to load the model weights.\n\"\"\"\n\nBERT_INPUTS_DOCSTRING = r\"\"\"\n    Args:\n        input_ids (:obj:`Numpy array` or :obj:`tf.Tensor` of shape :obj:`{0}`):\n            Indices of input sequence tokens in the vocabulary.\n\n            Indices can be obtained using :class:`transformers1.BertTokenizer`.\n            See :func:`transformers1.PreTrainedTokenizer.encode` and\n            :func:`transformers1.PreTrainedTokenizer.encode_plus` for details.\n\n            `What are input IDs? <../glossary.html#input-ids>`__\n        attention_mask (:obj:`Numpy array` or :obj:`tf.Tensor` of shape :obj:`{0}`, `optional`, defaults to :obj:`None`):\n            Mask to avoid performing attention on padding token indices.\n            Mask values selected in ``[0, 1]``:\n            ``1`` for tokens that are NOT MASKED, ``0`` for MASKED tokens.\n\n            `What are attention masks? <../glossary.html#attention-mask>`__\n        token_type_ids (:obj:`Numpy array` or :obj:`tf.Tensor` of shape :obj:`{0}`, `optional`, defaults to :obj:`None`):\n            Segment token indices to indicate first and second portions of the inputs.\n            Indices are selected in ``[0, 1]``: ``0`` corresponds to a `sentence A` token, ``1``\n            corresponds to a `sentence B` token\n\n            `What are token type IDs? <../glossary.html#token-type-ids>`__\n        position_ids (:obj:`Numpy array` or :obj:`tf.Tensor` of shape :obj:`{0}`, `optional`, defaults to :obj:`None`):\n            Indices of positions of each input sequence tokens in the position embeddings.\n            Selected in the range ``[0, config.max_position_embeddings - 1]``.\n\n            `What are position IDs? <../glossary.html#position-ids>`__\n        head_mask (:obj:`Numpy array` or :obj:`tf.Tensor` of shape :obj:`(num_heads,)` or :obj:`(num_layers, num_heads)`, `optional`, defaults to :obj:`None`):\n            Mask to nullify selected heads of the self-attention modules.\n            Mask values selected in ``[0, 1]``:\n            :obj:`1` indicates the head is **not masked**, :obj:`0` indicates the head is **masked**.\n        inputs_embeds (:obj:`Numpy array` or :obj:`tf.Tensor` of shape :obj:`(batch_size, sequence_length, embedding_dim)`, `optional`, defaults to :obj:`None`):\n            Optionally, instead of passing :obj:`input_ids` you can choose to directly pass an embedded representation.\n            This is useful if you want more control over how to convert `input_ids` indices into associated vectors\n            than the model's internal embedding lookup matrix.\n        training (:obj:`boolean`, `optional`, defaults to :obj:`False`):\n            Whether to activate dropout modules (if set to :obj:`True`) during training or to de-activate them\n            (if set to :obj:`False`) for evaluation.\n\"\"\"\n\n\n@add_start_docstrings(\n    \"The bare Bert Model transformer outputing raw hidden-states without any specific head on top.\",\n    BERT_START_DOCSTRING,\n)\nclass TFBertModel(TFBertPreTrainedModel):\n    def __init__(self, config, *inputs, **kwargs):\n        super().__init__(config, *inputs, **kwargs)\n        self.bert = TFBertMainLayer(config, name=\"bert\")\n\n    @add_start_docstrings_to_callable(BERT_INPUTS_DOCSTRING.format(\"(batch_size, sequence_length)\"))\n    def call(self, inputs, **kwargs):\n        r\"\"\"\n    Returns:\n        :obj:`tuple(tf.Tensor)` comprising various elements depending on the configuration (:class:`~transformers1.BertConfig`) and inputs:\n        last_hidden_state (:obj:`tf.Tensor` of shape :obj:`(batch_size, sequence_length, hidden_size)`):\n            Sequence of hidden-states at the output of the last layer of the model.\n        pooler_output (:obj:`tf.Tensor` of shape :obj:`(batch_size, hidden_size)`):\n            Last layer hidden-state of the first token of the sequence (classification token)\n            further processed by a Linear layer and a Tanh activation function. The Linear\n            layer weights are trained from the next sentence prediction (classification)\n            objective during Bert pretraining. This output is usually *not* a good summary\n            of the semantic content of the input, you're often better with averaging or pooling\n            the sequence of hidden-states for the whole input sequence.\n        hidden_states (:obj:`tuple(tf.Tensor)`, `optional`, returned when :obj:`config.output_hidden_states=True`):\n            tuple of :obj:`tf.Tensor` (one for the output of the embeddings + one for the output of each layer)\n            of shape :obj:`(batch_size, sequence_length, hidden_size)`.\n\n            Hidden-states of the model at the output of each layer plus the initial embedding outputs.\n        attentions (:obj:`tuple(tf.Tensor)`, `optional`, returned when ``config.output_attentions=True``):\n            tuple of :obj:`tf.Tensor` (one for each layer) of shape\n            :obj:`(batch_size, num_heads, sequence_length, sequence_length)`.\n\n            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads.\n\n\n    Examples::\n\n        import tensorflow as tf\n        from transformers1 import BertTokenizer, TFBertModel\n\n        tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')\n        model = TFBertModel.from_pretrained('bert-base-uncased')\n        input_ids = tf.constant(tokenizer.encode(\"Hello, my dog is cute\", add_special_tokens=True))[None, :]  # Batch size 1\n        outputs = model(input_ids)\n        last_hidden_states = outputs[0]  # The last hidden-state is the first element of the output tuple\n        \"\"\"\n        outputs = self.bert(inputs, **kwargs)\n        return outputs\n\n\n@add_start_docstrings(\n    \"\"\"Bert Model with two heads on top as done during the pre-training:\n    a `masked language modeling` head and a `next sentence prediction (classification)` head. \"\"\",\n    BERT_START_DOCSTRING,\n)\nclass TFBertForPreTraining(TFBertPreTrainedModel):\n    def __init__(self, config, *inputs, **kwargs):\n        super().__init__(config, *inputs, **kwargs)\n\n        self.bert = TFBertMainLayer(config, name=\"bert\")\n        self.nsp = TFBertNSPHead(config, name=\"nsp___cls\")\n        self.mlm = TFBertMLMHead(config, self.bert.embeddings, name=\"mlm___cls\")\n\n    def get_output_embeddings(self):\n        return self.bert.embeddings\n\n    @add_start_docstrings_to_callable(BERT_INPUTS_DOCSTRING.format(\"(batch_size, sequence_length)\"))\n    def call(self, inputs, **kwargs):\n        r\"\"\"\n    Return:\n        :obj:`tuple(tf.Tensor)` comprising various elements depending on the configuration (:class:`~transformers1.BertConfig`) and inputs:\n        prediction_scores (:obj:`tf.Tensor` of shape :obj:`(batch_size, sequence_length, config.vocab_size)`):\n            Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax).\n        seq_relationship_scores (:obj:`tf.Tensor` of shape :obj:`(batch_size, sequence_length, 2)`):\n            Prediction scores of the next sequence prediction (classification) head (scores of True/False continuation before SoftMax).\n        hidden_states (:obj:`tuple(tf.Tensor)`, `optional`, returned when :obj:`config.output_hidden_states=True`):\n            tuple of :obj:`tf.Tensor` (one for the output of the embeddings + one for the output of each layer)\n            of shape :obj:`(batch_size, sequence_length, hidden_size)`.\n\n            Hidden-states of the model at the output of each layer plus the initial embedding outputs.\n        attentions (:obj:`tuple(tf.Tensor)`, `optional`, returned when ``config.output_attentions=True``):\n            tuple of :obj:`tf.Tensor` (one for each layer) of shape\n            :obj:`(batch_size, num_heads, sequence_length, sequence_length)`:\n\n            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads.\n\n    Examples::\n\n        import tensorflow as tf\n        from transformers1 import BertTokenizer, TFBertForPreTraining\n\n        tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')\n        model = TFBertForPreTraining.from_pretrained('bert-base-uncased')\n        input_ids = tf.constant(tokenizer.encode(\"Hello, my dog is cute\", add_special_tokens=True))[None, :]  # Batch size 1\n        outputs = model(input_ids)\n        prediction_scores, seq_relationship_scores = outputs[:2]\n\n        \"\"\"\n        outputs = self.bert(inputs, **kwargs)\n\n        sequence_output, pooled_output = outputs[:2]\n        prediction_scores = self.mlm(sequence_output, training=kwargs.get(\"training\", False))\n        seq_relationship_score = self.nsp(pooled_output)\n\n        outputs = (prediction_scores, seq_relationship_score,) + outputs[\n            2:\n        ]  # add hidden states and attention if they are here\n\n        return outputs  # prediction_scores, seq_relationship_score, (hidden_states), (attentions)\n\n\n@add_start_docstrings(\"\"\"Bert Model with a `language modeling` head on top. \"\"\", BERT_START_DOCSTRING)\nclass TFBertForMaskedLM(TFBertPreTrainedModel):\n    def __init__(self, config, *inputs, **kwargs):\n        super().__init__(config, *inputs, **kwargs)\n\n        self.bert = TFBertMainLayer(config, name=\"bert\")\n        self.mlm = TFBertMLMHead(config, self.bert.embeddings, name=\"mlm___cls\")\n\n    def get_output_embeddings(self):\n        return self.bert.embeddings\n\n    @add_start_docstrings_to_callable(BERT_INPUTS_DOCSTRING.format(\"(batch_size, sequence_length)\"))\n    def call(self, inputs, **kwargs):\n        r\"\"\"\n    Return:\n        :obj:`tuple(tf.Tensor)` comprising various elements depending on the configuration (:class:`~transformers1.BertConfig`) and inputs:\n        prediction_scores (:obj:`Numpy array` or :obj:`tf.Tensor` of shape :obj:`(batch_size, sequence_length, config.vocab_size)`):\n            Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax).\n        hidden_states (:obj:`tuple(tf.Tensor)`, `optional`, returned when :obj:`config.output_hidden_states=True`):\n            tuple of :obj:`tf.Tensor` (one for the output of the embeddings + one for the output of each layer)\n            of shape :obj:`(batch_size, sequence_length, hidden_size)`.\n\n            Hidden-states of the model at the output of each layer plus the initial embedding outputs.\n        attentions (:obj:`tuple(tf.Tensor)`, `optional`, returned when ``config.output_attentions=True``):\n            tuple of :obj:`tf.Tensor` (one for each layer) of shape\n            :obj:`(batch_size, num_heads, sequence_length, sequence_length)`:\n\n            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads.\n\n    Examples::\n\n        import tensorflow as tf\n        from transformers1 import BertTokenizer, TFBertForMaskedLM\n\n        tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')\n        model = TFBertForMaskedLM.from_pretrained('bert-base-uncased')\n        input_ids = tf.constant(tokenizer.encode(\"Hello, my dog is cute\", add_special_tokens=True))[None, :]  # Batch size 1\n        outputs = model(input_ids)\n        prediction_scores = outputs[0]\n\n        \"\"\"\n        outputs = self.bert(inputs, **kwargs)\n\n        sequence_output = outputs[0]\n        prediction_scores = self.mlm(sequence_output, training=kwargs.get(\"training\", False))\n\n        outputs = (prediction_scores,) + outputs[2:]  # Add hidden states and attention if they are here\n\n        return outputs  # prediction_scores, (hidden_states), (attentions)\n\n\n@add_start_docstrings(\n    \"\"\"Bert Model with a `next sentence prediction (classification)` head on top. \"\"\", BERT_START_DOCSTRING,\n)\nclass TFBertForNextSentencePrediction(TFBertPreTrainedModel):\n    def __init__(self, config, *inputs, **kwargs):\n        super().__init__(config, *inputs, **kwargs)\n\n        self.bert = TFBertMainLayer(config, name=\"bert\")\n        self.nsp = TFBertNSPHead(config, name=\"nsp___cls\")\n\n    @add_start_docstrings_to_callable(BERT_INPUTS_DOCSTRING.format(\"(batch_size, sequence_length)\"))\n    def call(self, inputs, **kwargs):\n        r\"\"\"\n    Return:\n        :obj:`tuple(tf.Tensor)` comprising various elements depending on the configuration (:class:`~transformers1.BertConfig`) and inputs:\n        seq_relationship_scores (:obj:`Numpy array` or :obj:`tf.Tensor` of shape :obj:`(batch_size, sequence_length, 2)`)\n            Prediction scores of the next sequence prediction (classification) head (scores of True/False continuation before SoftMax).\n        hidden_states (:obj:`tuple(tf.Tensor)`, `optional`, returned when :obj:`config.output_hidden_states=True`):\n            tuple of :obj:`tf.Tensor` (one for the output of the embeddings + one for the output of each layer)\n            of shape :obj:`(batch_size, sequence_length, hidden_size)`.\n\n            Hidden-states of the model at the output of each layer plus the initial embedding outputs.\n        attentions (:obj:`tuple(tf.Tensor)`, `optional`, returned when ``config.output_attentions=True``):\n            tuple of :obj:`tf.Tensor` (one for each layer) of shape\n            :obj:`(batch_size, num_heads, sequence_length, sequence_length)`:\n\n            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads.\n\n    Examples::\n\n        import tensorflow as tf\n        from transformers1 import BertTokenizer, TFBertForNextSentencePrediction\n\n        tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')\n        model = TFBertForNextSentencePrediction.from_pretrained('bert-base-uncased')\n\n        prompt = \"In Italy, pizza served in formal settings, such as at a restaurant, is presented unsliced.\"\n        next_sentence = \"The sky is blue due to the shorter wavelength of blue light.\"\n        encoding = tokenizer.encode_plus(prompt, next_sentence, return_tensors='tf')\n\n        logits = model(encoding['input_ids'], token_type_ids=encoding['token_type_ids'])[0]\n        assert logits[0][0] < logits[0][1] # the next sentence was random\n        \"\"\"\n        outputs = self.bert(inputs, **kwargs)\n\n        pooled_output = outputs[1]\n        seq_relationship_score = self.nsp(pooled_output)\n\n        outputs = (seq_relationship_score,) + outputs[2:]  # add hidden states and attention if they are here\n\n        return outputs  # seq_relationship_score, (hidden_states), (attentions)\n\n\n@add_start_docstrings(\n    \"\"\"Bert Model transformer with a sequence classification/regression head on top (a linear layer on top of\n    the pooled output) e.g. for GLUE tasks. \"\"\",\n    BERT_START_DOCSTRING,\n)\nclass TFBertForSequenceClassification(TFBertPreTrainedModel):\n    def __init__(self, config, *inputs, **kwargs):\n        super().__init__(config, *inputs, **kwargs)\n        self.num_labels = config.num_labels\n\n        self.bert = TFBertMainLayer(config, name=\"bert\")\n        self.dropout = tf.keras.layers.Dropout(config.hidden_dropout_prob)\n        self.classifier = tf.keras.layers.Dense(\n            config.num_labels, kernel_initializer=get_initializer(config.initializer_range), name=\"classifier\"\n        )\n\n    @add_start_docstrings_to_callable(BERT_INPUTS_DOCSTRING.format(\"(batch_size, sequence_length)\"))\n    def call(self, inputs, **kwargs):\n        r\"\"\"\n    Return:\n        :obj:`tuple(tf.Tensor)` comprising various elements depending on the configuration (:class:`~transformers1.BertConfig`) and inputs:\n        logits (:obj:`Numpy array` or :obj:`tf.Tensor` of shape :obj:`(batch_size, config.num_labels)`):\n            Classification (or regression if config.num_labels==1) scores (before SoftMax).\n        hidden_states (:obj:`tuple(tf.Tensor)`, `optional`, returned when :obj:`config.output_hidden_states=True`):\n            tuple of :obj:`tf.Tensor` (one for the output of the embeddings + one for the output of each layer)\n            of shape :obj:`(batch_size, sequence_length, hidden_size)`.\n\n            Hidden-states of the model at the output of each layer plus the initial embedding outputs.\n        attentions (:obj:`tuple(tf.Tensor)`, `optional`, returned when ``config.output_attentions=True``):\n            tuple of :obj:`tf.Tensor` (one for each layer) of shape\n            :obj:`(batch_size, num_heads, sequence_length, sequence_length)`:\n\n            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads.\n\n    Examples::\n\n        import tensorflow as tf\n        from transformers1 import BertTokenizer, TFBertForSequenceClassification\n\n        tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')\n        model = TFBertForSequenceClassification.from_pretrained('bert-base-uncased')\n        input_ids = tf.constant(tokenizer.encode(\"Hello, my dog is cute\", add_special_tokens=True))[None, :]  # Batch size 1\n        outputs = model(input_ids)\n        logits = outputs[0]\n\n        \"\"\"\n        outputs = self.bert(inputs, **kwargs)\n\n        pooled_output = outputs[1]\n\n        pooled_output = self.dropout(pooled_output, training=kwargs.get(\"training\", False))\n        logits = self.classifier(pooled_output)\n\n        outputs = (logits,) + outputs[2:]  # add hidden states and attention if they are here\n\n        return outputs  # logits, (hidden_states), (attentions)\n\n\n@add_start_docstrings(\n    \"\"\"Bert Model with a multiple choice classification head on top (a linear layer on top of\n    the pooled output and a softmax) e.g. for RocStories/SWAG tasks. \"\"\",\n    BERT_START_DOCSTRING,\n)\nclass TFBertForMultipleChoice(TFBertPreTrainedModel):\n    def __init__(self, config, *inputs, **kwargs):\n        super().__init__(config, *inputs, **kwargs)\n\n        self.bert = TFBertMainLayer(config, name=\"bert\")\n        self.dropout = tf.keras.layers.Dropout(config.hidden_dropout_prob)\n        self.classifier = tf.keras.layers.Dense(\n            1, kernel_initializer=get_initializer(config.initializer_range), name=\"classifier\"\n        )\n\n    @property\n    def dummy_inputs(self):\n        \"\"\" Dummy inputs to build the network.\n\n        Returns:\n            tf.Tensor with dummy inputs\n        \"\"\"\n        return {\"input_ids\": tf.constant(MULTIPLE_CHOICE_DUMMY_INPUTS)}\n\n    @add_start_docstrings_to_callable(BERT_INPUTS_DOCSTRING.format(\"(batch_size, num_choices, sequence_length)\"))\n    def call(\n        self,\n        inputs,\n        attention_mask=None,\n        token_type_ids=None,\n        position_ids=None,\n        head_mask=None,\n        inputs_embeds=None,\n        training=False,\n    ):\n        r\"\"\"\n    Return:\n        :obj:`tuple(tf.Tensor)` comprising various elements depending on the configuration (:class:`~transformers1.BertConfig`) and inputs:\n        classification_scores (:obj:`Numpy array` or :obj:`tf.Tensor` of shape :obj:`(batch_size, num_choices)`:\n            `num_choices` is the size of the second dimension of the input tensors. (see `input_ids` above).\n\n            Classification scores (before SoftMax).\n        hidden_states (:obj:`tuple(tf.Tensor)`, `optional`, returned when :obj:`config.output_hidden_states=True`):\n            tuple of :obj:`tf.Tensor` (one for the output of the embeddings + one for the output of each layer)\n            of shape :obj:`(batch_size, sequence_length, hidden_size)`.\n\n            Hidden-states of the model at the output of each layer plus the initial embedding outputs.\n        attentions (:obj:`tuple(tf.Tensor)`, `optional`, returned when ``config.output_attentions=True``):\n            tuple of :obj:`tf.Tensor` (one for each layer) of shape\n            :obj:`(batch_size, num_heads, sequence_length, sequence_length)`:\n\n            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads.\n\n    Examples::\n\n        import tensorflow as tf\n        from transformers1 import BertTokenizer, TFBertForMultipleChoice\n\n        tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')\n        model = TFBertForMultipleChoice.from_pretrained('bert-base-uncased')\n\n        prompt = \"In Italy, pizza served in formal settings, such as at a restaurant, is presented unsliced.\"\n        choice0 = \"It is eaten with a fork and a knife.\"\n        choice1 = \"It is eaten while held in the hand.\"\n        encoding = tokenizer.batch_encode_plus([[prompt, choice0], [prompt, choice1]], return_tensors='tf', pad_to_max_length=True)\n\n        # linear classifier on the output is not yet trained\n        outputs = model(encoding['input_ids'][None, :])\n        logits = outputs[0]\n        \"\"\"\n        if isinstance(inputs, (tuple, list)):\n            input_ids = inputs[0]\n            attention_mask = inputs[1] if len(inputs) > 1 else attention_mask\n            token_type_ids = inputs[2] if len(inputs) > 2 else token_type_ids\n            position_ids = inputs[3] if len(inputs) > 3 else position_ids\n            head_mask = inputs[4] if len(inputs) > 4 else head_mask\n            inputs_embeds = inputs[5] if len(inputs) > 5 else inputs_embeds\n            assert len(inputs) <= 6, \"Too many inputs.\"\n        elif isinstance(inputs, dict):\n            input_ids = inputs.get(\"input_ids\")\n            attention_mask = inputs.get(\"attention_mask\", attention_mask)\n            token_type_ids = inputs.get(\"token_type_ids\", token_type_ids)\n            position_ids = inputs.get(\"position_ids\", position_ids)\n            head_mask = inputs.get(\"head_mask\", head_mask)\n            inputs_embeds = inputs.get(\"inputs_embeds\", inputs_embeds)\n            assert len(inputs) <= 6, \"Too many inputs.\"\n        else:\n            input_ids = inputs\n\n        if input_ids is not None:\n            num_choices = shape_list(input_ids)[1]\n            seq_length = shape_list(input_ids)[2]\n        else:\n            num_choices = shape_list(inputs_embeds)[1]\n            seq_length = shape_list(inputs_embeds)[2]\n\n        flat_input_ids = tf.reshape(input_ids, (-1, seq_length)) if input_ids is not None else None\n        flat_attention_mask = tf.reshape(attention_mask, (-1, seq_length)) if attention_mask is not None else None\n        flat_token_type_ids = tf.reshape(token_type_ids, (-1, seq_length)) if token_type_ids is not None else None\n        flat_position_ids = tf.reshape(position_ids, (-1, seq_length)) if position_ids is not None else None\n\n        flat_inputs = [\n            flat_input_ids,\n            flat_attention_mask,\n            flat_token_type_ids,\n            flat_position_ids,\n            head_mask,\n            inputs_embeds,\n        ]\n\n        outputs = self.bert(flat_inputs, training=training)\n\n        pooled_output = outputs[1]\n\n        pooled_output = self.dropout(pooled_output, training=training)\n        logits = self.classifier(pooled_output)\n        reshaped_logits = tf.reshape(logits, (-1, num_choices))\n\n        outputs = (reshaped_logits,) + outputs[2:]  # add hidden states and attention if they are here\n\n        return outputs  # reshaped_logits, (hidden_states), (attentions)\n\n\n@add_start_docstrings(\n    \"\"\"Bert Model with a token classification head on top (a linear layer on top of\n    the hidden-states output) e.g. for Named-Entity-Recognition (NER) tasks. \"\"\",\n    BERT_START_DOCSTRING,\n)\nclass TFBertForTokenClassification(TFBertPreTrainedModel):\n    def __init__(self, config, *inputs, **kwargs):\n        super().__init__(config, *inputs, **kwargs)\n        self.num_labels = config.num_labels\n\n        self.bert = TFBertMainLayer(config, name=\"bert\")\n        self.dropout = tf.keras.layers.Dropout(config.hidden_dropout_prob)\n        self.classifier = tf.keras.layers.Dense(\n            config.num_labels, kernel_initializer=get_initializer(config.initializer_range), name=\"classifier\"\n        )\n\n    @add_start_docstrings_to_callable(BERT_INPUTS_DOCSTRING.format(\"(batch_size, sequence_length)\"))\n    def call(self, inputs, **kwargs):\n        r\"\"\"\n    Return:\n        :obj:`tuple(tf.Tensor)` comprising various elements depending on the configuration (:class:`~transformers1.BertConfig`) and inputs:\n        scores (:obj:`Numpy array` or :obj:`tf.Tensor` of shape :obj:`(batch_size, sequence_length, config.num_labels)`):\n            Classification scores (before SoftMax).\n        hidden_states (:obj:`tuple(tf.Tensor)`, `optional`, returned when :obj:`config.output_hidden_states=True`):\n            tuple of :obj:`tf.Tensor` (one for the output of the embeddings + one for the output of each layer)\n            of shape :obj:`(batch_size, sequence_length, hidden_size)`.\n\n            Hidden-states of the model at the output of each layer plus the initial embedding outputs.\n        attentions (:obj:`tuple(tf.Tensor)`, `optional`, returned when ``config.output_attentions=True``):\n            tuple of :obj:`tf.Tensor` (one for each layer) of shape\n            :obj:`(batch_size, num_heads, sequence_length, sequence_length)`:\n\n            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads.\n\n    Examples::\n\n        import tensorflow as tf\n        from transformers1 import BertTokenizer, TFBertForTokenClassification\n\n        tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')\n        model = TFBertForTokenClassification.from_pretrained('bert-base-uncased')\n        input_ids = tf.constant(tokenizer.encode(\"Hello, my dog is cute\", add_special_tokens=True))[None, :]  # Batch size 1\n        outputs = model(input_ids)\n        scores = outputs[0]\n\n        \"\"\"\n        outputs = self.bert(inputs, **kwargs)\n\n        sequence_output = outputs[0]\n\n        sequence_output = self.dropout(sequence_output, training=kwargs.get(\"training\", False))\n        logits = self.classifier(sequence_output)\n\n        outputs = (logits,) + outputs[2:]  # add hidden states and attention if they are here\n\n        return outputs  # scores, (hidden_states), (attentions)\n\n\n@add_start_docstrings(\n    \"\"\"Bert Model with a span classification head on top for extractive question-answering tasks like SQuAD (a linear layers on top of\n    the hidden-states output to compute `span start logits` and `span end logits`). \"\"\",\n    BERT_START_DOCSTRING,\n)\nclass TFBertForQuestionAnswering(TFBertPreTrainedModel):\n    def __init__(self, config, *inputs, **kwargs):\n        super().__init__(config, *inputs, **kwargs)\n        self.num_labels = config.num_labels\n\n        self.bert = TFBertMainLayer(config, name=\"bert\")\n        self.qa_outputs = tf.keras.layers.Dense(\n            config.num_labels, kernel_initializer=get_initializer(config.initializer_range), name=\"qa_outputs\"\n        )\n\n    @add_start_docstrings_to_callable(BERT_INPUTS_DOCSTRING.format(\"(batch_size, sequence_length)\"))\n    def call(self, inputs, **kwargs):\n        r\"\"\"\n    Return:\n        :obj:`tuple(tf.Tensor)` comprising various elements depending on the configuration (:class:`~transformers1.BertConfig`) and inputs:\n        start_scores (:obj:`Numpy array` or :obj:`tf.Tensor` of shape :obj:`(batch_size, sequence_length,)`):\n            Span-start scores (before SoftMax).\n        end_scores (:obj:`Numpy array` or :obj:`tf.Tensor` of shape :obj:`(batch_size, sequence_length,)`):\n            Span-end scores (before SoftMax).\n        hidden_states (:obj:`tuple(tf.Tensor)`, `optional`, returned when :obj:`config.output_hidden_states=True`):\n            tuple of :obj:`tf.Tensor` (one for the output of the embeddings + one for the output of each layer)\n            of shape :obj:`(batch_size, sequence_length, hidden_size)`.\n\n            Hidden-states of the model at the output of each layer plus the initial embedding outputs.\n        attentions (:obj:`tuple(tf.Tensor)`, `optional`, returned when ``config.output_attentions=True``):\n            tuple of :obj:`tf.Tensor` (one for each layer) of shape\n            :obj:`(batch_size, num_heads, sequence_length, sequence_length)`:\n\n            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads.\n\n    Examples::\n\n        import tensorflow as tf\n        from transformers1 import BertTokenizer, TFBertForQuestionAnswering\n\n        tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')\n        model = TFBertForQuestionAnswering.from_pretrained('bert-large-uncased-whole-word-masking-finetuned-squad')\n\n        question, text = \"Who was Jim Henson?\", \"Jim Henson was a nice puppet\"\n        encoding = tokenizer.encode_plus(question, text)\n        input_ids, token_type_ids = encoding[\"input_ids\"], encoding[\"token_type_ids\"]\n        start_scores, end_scores = model(tf.constant(input_ids)[None, :], token_type_ids=tf.constant(token_type_ids)[None, :])\n\n        all_tokens = tokenizer.convert_ids_to_tokens(input_ids)\n        answer = ' '.join(all_tokens[tf.math.argmax(tf.squeeze(start_scores)) : tf.math.argmax(tf.squeeze(end_scores))+1])\n        assert answer == \"a nice puppet\"\n\n        \"\"\"\n        outputs = self.bert(inputs, **kwargs)\n\n        sequence_output = outputs[0]\n\n        logits = self.qa_outputs(sequence_output)\n        start_logits, end_logits = tf.split(logits, 2, axis=-1)\n        start_logits = tf.squeeze(start_logits, axis=-1)\n        end_logits = tf.squeeze(end_logits, axis=-1)\n\n        outputs = (start_logits, end_logits,) + outputs[2:]\n\n        return outputs  # start_logits, end_logits, (hidden_states), (attentions)\n"
  },
  {
    "path": "code/nezha-base-count5/pretrain/transformers1/modeling_tf_camembert.py",
    "content": "# coding=utf-8\n# Copyright 2018 The Google AI Language Team Authors and The HuggingFace Inc. team.\n# Copyright (c) 2018, NVIDIA CORPORATION.  All rights reserved.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n\"\"\" TF 2.0 CamemBERT model. \"\"\"\n\n\nimport logging\n\nfrom .configuration_camembert import CamembertConfig\nfrom .file_utils import add_start_docstrings\nfrom .modeling_tf_roberta import (\n    TFRobertaForMaskedLM,\n    TFRobertaForSequenceClassification,\n    TFRobertaForTokenClassification,\n    TFRobertaModel,\n)\n\n\nlogger = logging.getLogger(__name__)\n\nTF_CAMEMBERT_PRETRAINED_MODEL_ARCHIVE_LIST = [\n    # See all CamemBERT models at https://huggingface.co/models?filter=camembert\n]\n\n\nCAMEMBERT_START_DOCSTRING = r\"\"\"\n\n    .. note::\n\n        TF 2.0 models accepts two formats as inputs:\n\n            - having all inputs as keyword arguments (like PyTorch models), or\n            - having all inputs as a list, tuple or dict in the first positional arguments.\n\n        This second option is useful when using :obj:`tf.keras.Model.fit()` method which currently requires having\n        all the tensors in the first argument of the model call function: :obj:`model(inputs)`.\n\n        If you choose this second option, there are three possibilities you can use to gather all the input Tensors\n        in the first positional argument :\n\n        - a single Tensor with input_ids only and nothing else: :obj:`model(inputs_ids)`\n        - a list of varying length with one or several input Tensors IN THE ORDER given in the docstring:\n          :obj:`model([input_ids, attention_mask])` or :obj:`model([input_ids, attention_mask, token_type_ids])`\n        - a dictionary with one or several input Tensors associated to the input names given in the docstring:\n          :obj:`model({'input_ids': input_ids, 'token_type_ids': token_type_ids})`\n\n    Parameters:\n        config (:class:`~transformers1.CamembertConfig`): Model configuration class with all the parameters of the\n            model. Initializing with a config file does not load the weights associated with the model, only the configuration.\n            Check out the :meth:`~transformers1.PreTrainedModel.from_pretrained` method to load the model weights.\n\"\"\"\n\n\n@add_start_docstrings(\n    \"The bare CamemBERT Model transformer outputting raw hidden-states without any specific head on top.\",\n    CAMEMBERT_START_DOCSTRING,\n)\nclass TFCamembertModel(TFRobertaModel):\n    \"\"\"\n    This class overrides :class:`~transformers1.TFRobertaModel`. Please check the\n    superclass for the appropriate documentation alongside usage examples.\n    \"\"\"\n\n    config_class = CamembertConfig\n\n\n@add_start_docstrings(\n    \"\"\"CamemBERT Model with a `language modeling` head on top. \"\"\", CAMEMBERT_START_DOCSTRING,\n)\nclass TFCamembertForMaskedLM(TFRobertaForMaskedLM):\n    \"\"\"\n    This class overrides :class:`~transformers1.TFRobertaForMaskedLM`. Please check the\n    superclass for the appropriate documentation alongside usage examples.\n    \"\"\"\n\n    config_class = CamembertConfig\n\n\n@add_start_docstrings(\n    \"\"\"CamemBERT Model transformer with a sequence classification/regression head on top (a linear layer\n    on top of the pooled output) e.g. for GLUE tasks. \"\"\",\n    CAMEMBERT_START_DOCSTRING,\n)\nclass TFCamembertForSequenceClassification(TFRobertaForSequenceClassification):\n    \"\"\"\n    This class overrides :class:`~transformers1.TFRobertaForSequenceClassification`. Please check the\n    superclass for the appropriate documentation alongside usage examples.\n    \"\"\"\n\n    config_class = CamembertConfig\n\n\n@add_start_docstrings(\n    \"\"\"CamemBERT Model with a token classification head on top (a linear layer on top of\n    the hidden-states output) e.g. for Named-Entity-Recognition (NER) tasks. \"\"\",\n    CAMEMBERT_START_DOCSTRING,\n)\nclass TFCamembertForTokenClassification(TFRobertaForTokenClassification):\n    \"\"\"\n    This class overrides :class:`~transformers1.TFRobertaForTokenClassification`. Please check the\n    superclass for the appropriate documentation alongside usage examples.\n    \"\"\"\n\n    config_class = CamembertConfig\n"
  },
  {
    "path": "code/nezha-base-count5/pretrain/transformers1/modeling_tf_ctrl.py",
    "content": "# coding=utf-8\n# Copyright 2018 Salesforce and HuggingFace Inc. team.\n# Copyright (c) 2018, NVIDIA CORPORATION.  All rights reserved.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n\"\"\" TF 2.0 CTRL model.\"\"\"\n\n\nimport logging\n\nimport numpy as np\nimport tensorflow as tf\n\nfrom .configuration_ctrl import CTRLConfig\nfrom .file_utils import add_start_docstrings, add_start_docstrings_to_callable\nfrom .modeling_tf_utils import TFPreTrainedModel, TFSharedEmbeddings, keras_serializable, shape_list\nfrom .tokenization_utils import BatchEncoding\n\n\nlogger = logging.getLogger(__name__)\n\nTF_CTRL_PRETRAINED_MODEL_ARCHIVE_LIST = [\n    \"ctrl\"\n    # See all CTRL models at https://huggingface.co/models?filter=ctrl\n]\n\n\ndef angle_defn(pos, i, d_model_size):\n    angle_rates = 1 / np.power(10000, (2 * (i // 2)) / np.float32(d_model_size))\n    return pos * angle_rates\n\n\ndef positional_encoding(position, d_model_size):\n    # create the sinusoidal pattern for the positional encoding\n    angle_rads = angle_defn(np.arange(position)[:, np.newaxis], np.arange(d_model_size)[np.newaxis, :], d_model_size)\n\n    sines = np.sin(angle_rads[:, 0::2])\n    cosines = np.cos(angle_rads[:, 1::2])\n\n    # pos_encoding = tf.cast(np.concatenate([sines, cosines], axis=-1)[np.newaxis, ...], dtype=tf.float32)\n    pos_encoding = tf.cast(np.concatenate([sines, cosines], axis=-1), dtype=tf.float32)\n    return pos_encoding\n\n\ndef scaled_dot_product_attention(q, k, v, mask, attention_mask=None, head_mask=None):\n    # calculate attention\n    matmul_qk = tf.matmul(q, k, transpose_b=True)\n\n    dk = tf.cast(shape_list(k)[-1], tf.float32)\n    scaled_attention_logits = matmul_qk / tf.math.sqrt(dk)\n\n    if mask is not None:\n        scaled_attention_logits += mask * -1e4\n\n    if attention_mask is not None:\n        # Apply the attention mask\n        scaled_attention_logits = scaled_attention_logits + attention_mask\n\n    attention_weights = tf.nn.softmax(scaled_attention_logits, axis=-1)\n\n    # Mask heads if we want to\n    if head_mask is not None:\n        attention_weights = attention_weights * head_mask\n\n    output = tf.matmul(attention_weights, v)\n\n    return output, attention_weights\n\n\nclass TFMultiHeadAttention(tf.keras.layers.Layer):\n    def __init__(self, d_model_size, num_heads, output_attentions=False, **kwargs):\n        super().__init__(**kwargs)\n        self.output_attentions = output_attentions\n        self.num_heads = num_heads\n        self.d_model_size = d_model_size\n\n        self.depth = int(d_model_size / self.num_heads)\n\n        self.Wq = tf.keras.layers.Dense(d_model_size, name=\"Wq\")\n        self.Wk = tf.keras.layers.Dense(d_model_size, name=\"Wk\")\n        self.Wv = tf.keras.layers.Dense(d_model_size, name=\"Wv\")\n\n        self.dense = tf.keras.layers.Dense(d_model_size, name=\"dense\")\n\n    def split_into_heads(self, x, batch_size):\n        x = tf.reshape(x, (batch_size, -1, self.num_heads, self.depth))\n        return tf.transpose(x, perm=[0, 2, 1, 3])\n\n    def call(self, inputs, training=False):\n        v, k, q, mask, layer_past, attention_mask, head_mask, use_cache = inputs\n        batch_size = shape_list(q)[0]\n\n        q = self.Wq(q)\n        k = self.Wk(k)\n        v = self.Wv(v)\n\n        q = self.split_into_heads(q, batch_size)\n        k = self.split_into_heads(k, batch_size)\n        v = self.split_into_heads(v, batch_size)\n\n        if layer_past is not None:\n            past_key, past_value = tf.unstack(layer_past, axis=0)\n            k = tf.concat((past_key, k), axis=-2)\n            v = tf.concat((past_value, v), axis=-2)\n\n        # to cope with keras serialization\n        # we need to cast `use_cache` to correct bool\n        # if it is a tensor\n        if tf.is_tensor(use_cache):\n            if hasattr(use_cache, \"numpy\"):\n                use_cache = bool(use_cache.numpy())\n            else:\n                use_cache = True\n\n        if use_cache is True:\n            present = tf.stack((k, v), axis=0)\n        else:\n            present = (None,)\n\n        output = scaled_dot_product_attention(q, k, v, mask, attention_mask, head_mask)\n        scaled_attention = tf.transpose(output[0], perm=[0, 2, 1, 3])\n        attn = output[1]\n        original_size_attention = tf.reshape(scaled_attention, (batch_size, -1, self.d_model_size))\n        output = self.dense(original_size_attention)\n\n        outputs = (output, present)\n        if self.output_attentions:\n            outputs = outputs + (attn,)\n        return outputs\n\n\ndef point_wise_feed_forward_network(d_model_size, dff, name=\"\"):\n    return tf.keras.Sequential(\n        [tf.keras.layers.Dense(dff, activation=\"relu\", name=\"0\"), tf.keras.layers.Dense(d_model_size, name=\"2\")],\n        name=\"ffn\",\n    )\n\n\nclass TFEncoderLayer(tf.keras.layers.Layer):\n    def __init__(\n        self, d_model_size, num_heads, dff, rate=0.1, layer_norm_epsilon=1e-6, output_attentions=False, **kwargs\n    ):\n        super().__init__(**kwargs)\n\n        self.multi_head_attention = TFMultiHeadAttention(\n            d_model_size, num_heads, output_attentions, name=\"multi_head_attention\"\n        )\n        self.ffn = point_wise_feed_forward_network(d_model_size, dff, name=\"ffn\")\n\n        self.layernorm1 = tf.keras.layers.LayerNormalization(epsilon=layer_norm_epsilon, name=\"layernorm1\")\n        self.layernorm2 = tf.keras.layers.LayerNormalization(epsilon=layer_norm_epsilon, name=\"layernorm2\")\n\n        self.dropout1 = tf.keras.layers.Dropout(rate)\n        self.dropout2 = tf.keras.layers.Dropout(rate)\n\n    def call(self, inputs, training=False):\n        x, mask, layer_past, attention_mask, head_mask, use_cache = inputs\n        normed = self.layernorm1(x)\n        attn_outputs = self.multi_head_attention(\n            [normed, normed, normed, mask, layer_past, attention_mask, head_mask, use_cache], training=training\n        )\n        attn_output = attn_outputs[0]\n        attn_output = self.dropout1(attn_output, training=training)\n        out1 = x + attn_output\n\n        out2 = self.layernorm2(out1)\n        ffn_output = self.ffn(out2)\n        ffn_output = self.dropout2(ffn_output, training=training)\n        out2 = out1 + ffn_output\n\n        outputs = (out2,) + attn_outputs[1:]\n        return outputs\n\n\n@keras_serializable\nclass TFCTRLMainLayer(tf.keras.layers.Layer):\n    config_class = CTRLConfig\n\n    def __init__(self, config, **kwargs):\n        super().__init__(**kwargs)\n        self.output_hidden_states = config.output_hidden_states\n        self.output_attentions = config.output_attentions\n\n        self.d_model_size = config.n_embd\n        self.num_layers = config.n_layer\n\n        self.pos_encoding = positional_encoding(config.n_positions, self.d_model_size)\n\n        self.w = TFSharedEmbeddings(\n            config.vocab_size, config.n_embd, initializer_range=config.initializer_range, name=\"w\"\n        )\n\n        self.dropout = tf.keras.layers.Dropout(config.embd_pdrop)\n        self.h = [\n            TFEncoderLayer(\n                config.n_embd,\n                config.n_head,\n                config.dff,\n                config.resid_pdrop,\n                config.layer_norm_epsilon,\n                config.output_attentions,\n                name=\"h_._{}\".format(i),\n            )\n            for i in range(config.n_layer)\n        ]\n        self.layernorm = tf.keras.layers.LayerNormalization(epsilon=config.layer_norm_epsilon, name=\"layernorm\")\n\n    def get_input_embeddings(self):\n        return self.w\n\n    def _resize_token_embeddings(self, new_num_tokens):\n        raise NotImplementedError\n\n    def _prune_heads(self, heads_to_prune):\n        \"\"\" Prunes heads of the model.\n                heads_to_prune: dict of {layer_num: list of heads to prune in this layer}\n        \"\"\"\n        raise NotImplementedError\n\n    def call(\n        self,\n        inputs,\n        past=None,\n        attention_mask=None,\n        token_type_ids=None,\n        position_ids=None,\n        head_mask=None,\n        inputs_embeds=None,\n        use_cache=True,\n        training=False,\n    ):\n\n        if isinstance(inputs, (tuple, list)):\n            input_ids = inputs[0]\n            past = inputs[1] if len(inputs) > 1 else past\n            attention_mask = inputs[2] if len(inputs) > 2 else attention_mask\n            token_type_ids = inputs[3] if len(inputs) > 3 else token_type_ids\n            position_ids = inputs[4] if len(inputs) > 4 else position_ids\n            head_mask = inputs[5] if len(inputs) > 5 else head_mask\n            inputs_embeds = inputs[6] if len(inputs) > 6 else inputs_embeds\n            use_cache = inputs[7] if len(inputs) > 7 else use_cache\n            assert len(inputs) <= 8, \"Too many inputs.\"\n        elif isinstance(inputs, (dict, BatchEncoding)):\n            input_ids = inputs.get(\"input_ids\")\n            past = inputs.get(\"past\", past)\n            attention_mask = inputs.get(\"attention_mask\", attention_mask)\n            token_type_ids = inputs.get(\"token_type_ids\", token_type_ids)\n            position_ids = inputs.get(\"position_ids\", position_ids)\n            head_mask = inputs.get(\"head_mask\", head_mask)\n            inputs_embeds = inputs.get(\"inputs_embeds\", inputs_embeds)\n            use_cache = inputs.get(\"use_cache\", use_cache)\n            assert len(inputs) <= 8, \"Too many inputs.\"\n        else:\n            input_ids = inputs\n\n        # If using past key value states, only the last tokens\n        # should be given as an input\n        if past is not None:\n            if input_ids is not None:\n                input_ids = input_ids[:, -1:]\n            if inputs_embeds is not None:\n                inputs_embeds = inputs_embeds[:, -1:]\n            if token_type_ids is not None:\n                token_type_ids = token_type_ids[:, -1:]\n\n        if input_ids is not None and inputs_embeds is not None:\n            raise ValueError(\"You cannot specify both input_ids and inputs_embeds at the same time\")\n        elif input_ids is not None:\n            input_shape = shape_list(input_ids)\n            input_ids = tf.reshape(input_ids, [-1, input_shape[-1]])\n        elif inputs_embeds is not None:\n            input_shape = shape_list(inputs_embeds)[:-1]\n        else:\n            raise ValueError(\"You have to specify either input_ids or inputs_embeds\")\n\n        if past is None:\n            past_length = 0\n            past = [None] * len(self.h)\n        else:\n            past_length = shape_list(past[0][0])[-2]\n        if position_ids is None:\n            position_ids = tf.range(past_length, input_shape[-1] + past_length, dtype=tf.int32)[tf.newaxis, :]\n            position_ids = tf.tile(position_ids, [input_shape[0], 1])\n\n        # Attention mask.\n        if attention_mask is not None:\n            # We create a 3D attention mask from a 2D tensor mask.\n            # Sizes are [batch_size, 1, 1, to_seq_length]\n            # So we can broadcast to [batch_size, num_heads, from_seq_length, to_seq_length]\n            # this attention mask is more simple than the triangular masking of causal attention\n            # used in OpenAI GPT, we just need to prepare the broadcast dimension here.\n            attention_mask = attention_mask[:, tf.newaxis, tf.newaxis, :]\n\n            # Since attention_mask is 1.0 for positions we want to attend and 0.0 for\n            # masked positions, this operation will create a tensor which is 0.0 for\n            # positions we want to attend and -10000.0 for masked positions.\n            # Since we are adding it to the raw scores before the softmax, this is\n            # effectively the same as removing these entirely.\n\n            attention_mask = tf.cast(attention_mask, tf.float32)\n            attention_mask = (1.0 - attention_mask) * -10000.0\n        else:\n            attention_mask = None\n\n        # Prepare head mask if needed\n        # 1.0 in head_mask indicate we keep the head\n        # attention_probs has shape bsz x n_heads x N x N\n        # head_mask has shape n_layer x batch x n_heads x N x N\n        if head_mask is not None:\n            raise NotImplementedError\n        else:\n            head_mask = [None] * self.num_layers\n\n        if token_type_ids is not None:\n            token_type_ids = tf.reshape(token_type_ids, [-1, shape_list(token_type_ids)[-1]])\n            token_type_embeds = self.w(token_type_ids, mode=\"embedding\")\n            token_type_embeds *= tf.math.sqrt(tf.cast(self.d_model_size, tf.float32))\n        else:\n            token_type_embeds = 0\n        position_ids = tf.reshape(position_ids, [-1, shape_list(position_ids)[-1]])\n\n        if inputs_embeds is None:\n            inputs_embeds = self.w(input_ids, mode=\"embedding\")\n        seq_len = input_shape[-1]\n        mask = 1 - tf.linalg.band_part(tf.ones((seq_len, seq_len)), -1, 0)\n\n        inputs_embeds *= tf.math.sqrt(tf.cast(self.d_model_size, tf.float32))\n\n        pos_embeds = tf.gather(self.pos_encoding, position_ids)\n\n        hidden_states = inputs_embeds + pos_embeds + token_type_embeds\n\n        hidden_states = self.dropout(hidden_states, training=training)\n\n        output_shape = input_shape + [shape_list(hidden_states)[-1]]\n        presents = ()\n        all_hidden_states = ()\n        all_attentions = []\n        for i, (h, layer_past) in enumerate(zip(self.h, past)):\n            if self.output_hidden_states:\n                all_hidden_states = all_hidden_states + (tf.reshape(hidden_states, output_shape),)\n            outputs = h([hidden_states, mask, layer_past, attention_mask, head_mask[i], use_cache], training=training)\n            hidden_states, present = outputs[:2]\n\n            if use_cache is True:\n                presents = presents + (present,)\n\n            if self.output_attentions:\n                all_attentions.append(outputs[2])\n\n        hidden_states = self.layernorm(hidden_states)\n        hidden_states = tf.reshape(hidden_states, output_shape)\n        if self.output_hidden_states:\n            all_hidden_states = all_hidden_states + (hidden_states,)\n\n        outputs = (hidden_states,)\n        if use_cache is True:\n            outputs = outputs + (presents,)\n        if self.output_hidden_states:\n            outputs = outputs + (all_hidden_states,)\n        if self.output_attentions:\n            # let the number of heads free (-1) so we can extract attention even after head pruning\n            attention_output_shape = input_shape[:-1] + [-1] + shape_list(all_attentions[0])[-2:]\n            all_attentions = tuple(tf.reshape(t, attention_output_shape) for t in all_attentions)\n            outputs = outputs + (all_attentions,)\n        return outputs\n\n\nclass TFCTRLPreTrainedModel(TFPreTrainedModel):\n    \"\"\" An abstract class to handle weights initialization and\n        a simple interface for downloading and loading pretrained models.\n    \"\"\"\n\n    config_class = CTRLConfig\n    base_model_prefix = \"transformer\"\n\n\nCTRL_START_DOCSTRING = r\"\"\"\n\n    .. note::\n        TF 2.0 models accepts two formats as inputs:\n\n            - having all inputs as keyword arguments (like PyTorch models), or\n            - having all inputs as a list, tuple or dict in the first positional arguments.\n\n        This second option is useful when using :obj:`tf.keras.Model.fit()` method which currently requires having\n        all the tensors in the first argument of the model call function: :obj:`model(inputs)`.\n\n        If you choose this second option, there are three possibilities you can use to gather all the input Tensors\n        in the first positional argument :\n\n        - a single Tensor with input_ids only and nothing else: :obj:`model(inputs_ids)`\n        - a list of varying length with one or several input Tensors IN THE ORDER given in the docstring:\n          :obj:`model([input_ids, attention_mask])` or :obj:`model([input_ids, attention_mask, token_type_ids])`\n        - a dictionary with one or several input Tensors associated to the input names given in the docstring:\n          :obj:`model({'input_ids': input_ids, 'token_type_ids': token_type_ids})`\n\n    Parameters:\n        config (:class:`~transformers1.CTRLConfig`): Model configuration class with all the parameters of the model.\n            Initializing with a config file does not load the weights associated with the model, only the configuration.\n            Check out the :meth:`~transformers1.PreTrainedModel.from_pretrained` method to load the model weights.\n\"\"\"\n\nCTRL_INPUTS_DOCSTRING = r\"\"\"\n    Args:\n        input_ids (:obj:`Numpy array` or :obj:`tf.Tensor` of shape :obj:`(batch_size, input_ids_length)`):\n            :obj:`input_ids_length` = ``sequence_length`` if ``past`` is ``None`` else ``past[0].shape[-2]`` (``sequence_length`` of input past key value states).\n\n            Indices of input sequence tokens in the vocabulary.\n\n            If `past` is used, only input_ids that do not have their past calculated should be passed as input_ids (see `past`).\n\n            Indices can be obtained using :class:`transformers1.CTRLTokenizer`.\n            See :func:`transformers1.PreTrainedTokenizer.encode` and\n            :func:`transformers1.PreTrainedTokenizer.encode_plus` for details.\n\n            `What are input IDs? <../glossary.html#input-ids>`__\n        past (:obj:`List[tf.Tensor]` of length :obj:`config.n_layers`):\n            Contains pre-computed hidden-states (key and values in the attention blocks) as computed by the model\n            (see `past` output below). Can be used to speed up sequential decoding.\n            The token ids which have their past given to this model\n            should not be passed as input ids as they have already been computed.\n        attention_mask (:obj:`tf.Tensor` or :obj:`Numpy array` of shape :obj:`(batch_size, sequence_length)`, `optional`, defaults to :obj:`None`):\n            Mask to avoid performing attention on padding token indices.\n            Mask values selected in ``[0, 1]``:\n            ``1`` for tokens that are NOT MASKED, ``0`` for MASKED tokens.\n\n            `What are attention masks? <../glossary.html#attention-mask>`__\n        token_type_ids (:obj:`tf.Tensor` or :obj:`Numpy array` of shape :obj:`(batch_size, sequence_length)`, `optional`, defaults to :obj:`None`):\n            Segment token indices to indicate first and second portions of the inputs.\n            Indices are selected in ``[0, 1]``: ``0`` corresponds to a `sentence A` token, ``1``\n            corresponds to a `sentence B` token\n\n            `What are token type IDs? <../glossary.html#token-type-ids>`_\n        position_ids (:obj:`tf.Tensor` or :obj:`Numpy array` of shape :obj:`(batch_size, sequence_length)`, `optional`, defaults to :obj:`None`):\n            Indices of positions of each input sequence tokens in the position embeddings.\n            Selected in the range ``[0, config.max_position_embeddings - 1]``.\n\n            `What are position IDs? <../glossary.html#position-ids>`_\n        head_mask (:obj:`tf.Tensor` or :obj:`Numpy array` of shape :obj:`(num_heads,)` or :obj:`(num_layers, num_heads)`, `optional`, defaults to :obj:`None`):\n            Mask to nullify selected heads of the self-attention modules.\n            Mask values selected in ``[0, 1]``:\n            :obj:`1` indicates the head is **not masked**, :obj:`0` indicates the head is **masked**.\n        inputs_embeds (:obj:`tf.Tensor` or :obj:`Numpy array` of shape :obj:`(batch_size, sequence_length, hidden_size)`, `optional`, defaults to :obj:`None`):\n            Optionally, instead of passing :obj:`input_ids` you can choose to directly pass an embedded representation.\n            This is useful if you want more control over how to convert `input_ids` indices into associated vectors\n            than the model's internal embedding lookup matrix.\n        use_cache (:obj:`bool`):\n            If `use_cache` is True, `past` key value states are returned and\n            can be used to speed up decoding (see `past`). Defaults to `True`.\n        training (:obj:`boolean`, `optional`, defaults to :obj:`False`):\n            Whether to activate dropout modules (if set to :obj:`True`) during training or to de-activate them\n            (if set to :obj:`False`) for evaluation.\n\"\"\"\n\n\n@add_start_docstrings(\n    \"The bare CTRL Model transformer outputting raw hidden-states without any specific head on top.\",\n    CTRL_START_DOCSTRING,\n)\nclass TFCTRLModel(TFCTRLPreTrainedModel):\n    def __init__(self, config, *inputs, **kwargs):\n        super().__init__(config, *inputs, **kwargs)\n        self.transformer = TFCTRLMainLayer(config, name=\"transformer\")\n\n    @add_start_docstrings_to_callable(CTRL_INPUTS_DOCSTRING)\n    def call(self, inputs, **kwargs):\n        r\"\"\"\n    Return:\n        :obj:`tuple(tf.Tensor)` comprising various elements depending on the configuration (:class:`~transformers1.CTRLConfig`) and inputs:\n        last_hidden_state (:obj:`tf.Tensor` of shape :obj:`(batch_size, sequence_length, hidden_size)`):\n            Sequence of hidden-states at the last layer of the model.\n        past (:obj:`List[tf.Tensor]` of length :obj:`config.n_layers` with each tensor of shape :obj:`(2, batch_size, num_heads, sequence_length, embed_size_per_head)`):\n            Contains pre-computed hidden-states (key and values in the attention blocks).\n            Can be used (see `past` input) to speed up sequential decoding. The token ids which have their past given to this model\n            should not be passed as input ids as they have already been computed.\n        hidden_states (:obj:`tuple(tf.Tensor)` `optional`, returned when ``config.output_hidden_states=True``):\n            Tuple of :obj:`tf.Tensor` (one for the output of the embeddings + one for the output of each layer)\n            of shape :obj:`(batch_size, sequence_length, hidden_size)`.\n\n            Hidden-states of the model at the output of each layer plus the initial embedding outputs.\n        attentions (:obj:`tuple(tf.Tensor)`, `optional`, returned when ``config.output_attentions=True``):\n            Tuple of :obj:`tf.Tensor` (one for each layer) of shape\n            :obj:`(batch_size, num_heads, sequence_length, sequence_length)`.\n\n            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention\n            heads.\n\n    Examples::\n\n        import tensorflow as tf\n        from transformers1 import CTRLTokenizer, TFCTRLModel\n\n        tokenizer = CTRLTokenizer.from_pretrained('ctrl')\n        model = TFCTRLModel.from_pretrained('ctrl')\n        input_ids = tf.constant(tokenizer.encode(\"Hello, my dog is cute\", add_special_tokens=True))[None, :]  # Batch size 1\n        outputs = model(input_ids)\n        last_hidden_states = outputs[0]  # The last hidden-state is the first element of the output tuple\n\n        \"\"\"\n        outputs = self.transformer(inputs, **kwargs)\n        return outputs\n\n\nclass TFCTRLLMHead(tf.keras.layers.Layer):\n    def __init__(self, config, input_embeddings, **kwargs):\n        super().__init__(**kwargs)\n        self.vocab_size = config.vocab_size\n\n        # The output weights are the same as the input embeddings, but there is\n        # an output-only bias for each token.\n        self.input_embeddings = input_embeddings\n\n    def build(self, input_shape):\n        self.bias = self.add_weight(shape=(self.vocab_size,), initializer=\"zeros\", trainable=True, name=\"bias\")\n        super().build(input_shape)\n\n    def call(self, hidden_states):\n        hidden_states = self.input_embeddings(hidden_states, mode=\"linear\")\n        hidden_states = hidden_states + self.bias\n        return hidden_states\n\n\n@add_start_docstrings(\n    \"\"\"The CTRL Model transformer with a language modeling head on top\n    (linear layer with weights tied to the input embeddings). \"\"\",\n    CTRL_START_DOCSTRING,\n)\nclass TFCTRLLMHeadModel(TFCTRLPreTrainedModel):\n    def __init__(self, config, *inputs, **kwargs):\n        super().__init__(config, *inputs, **kwargs)\n        self.transformer = TFCTRLMainLayer(config, name=\"transformer\")\n\n        self.lm_head = TFCTRLLMHead(config, self.transformer.w, name=\"lm_head\")\n\n    def get_output_embeddings(self):\n        return self.lm_head.input_embeddings\n\n    def prepare_inputs_for_generation(self, inputs, past, **kwargs):\n        # only last token for inputs_ids if past is defined in kwargs\n        if past:\n            inputs = tf.expand_dims(inputs[:, -1], -1)\n\n        return {\"inputs\": inputs, \"past\": past, \"use_cache\": kwargs[\"use_cache\"]}\n\n    @add_start_docstrings_to_callable(CTRL_INPUTS_DOCSTRING)\n    def call(self, inputs, **kwargs):\n        r\"\"\"\n    Return:\n        :obj:`tuple(tf.Tensor)` comprising various elements depending on the configuration (:class:`~transformers1.CTRLConfig`) and inputs:\n        prediction_scores (:obj:`tf.Tensor` of shape :obj:`(batch_size, sequence_length, config.vocab_size)`):\n            Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax).\n        past (:obj:`List[tf.Tensor]` of length :obj:`config.n_layers` with each tensor of shape :obj:`(2, batch_size, num_heads, sequence_length, embed_size_per_head)`):\n            Contains pre-computed hidden-states (key and values in the attention blocks).\n            Can be used (see `past` input) to speed up sequential decoding. The token ids which have their past given to this model\n            should not be passed as input ids as they have already been computed.\n        hidden_states (:obj:`tuple(tf.Tensor)`, `optional`, returned when ``config.output_hidden_states=True``):\n            Tuple of :obj:`tf.Tensor` (one for the output of the embeddings + one for the output of each layer)\n            of shape :obj:`(batch_size, sequence_length, hidden_size)`.\n\n            Hidden-states of the model at the output of each layer plus the initial embedding outputs.\n        attentions (:obj:`tuple(tf.Tensor)`, `optional`, returned when ``config.output_attentions=True``):\n            Tuple of :obj:`tf.Tensor` (one for each layer) of shape\n            :obj:`(batch_size, num_heads, sequence_length, sequence_length)`.\n\n            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention\n            heads.\n\n    Examples::\n\n        import tensorflow as tf\n        from transformers1 import CTRLTokenizer, TFCTRLLMHeadModel\n\n        tokenizer = CTRLTokenizer.from_pretrained('ctrl')\n        model = TFCTRLLMHeadModel.from_pretrained('ctrl')\n\n        input_ids = tf.constant([tokenizer.encode(\"Links Hello, my dog is cute\", add_special_tokens=True)])\n        outputs = model(input_ids)\n        loss, logits = outputs[:2]\n\n        \"\"\"\n        transformer_outputs = self.transformer(inputs, **kwargs)\n        hidden_states = transformer_outputs[0]\n\n        lm_logits = self.lm_head(hidden_states)\n\n        outputs = (lm_logits,) + transformer_outputs[1:]\n\n        return outputs  # lm_logits, presents, (all hidden_states), (attentions)\n"
  },
  {
    "path": "code/nezha-base-count5/pretrain/transformers1/modeling_tf_distilbert.py",
    "content": "# coding=utf-8\n# Copyright 2019-present, the HuggingFace Inc. team, The Google AI Language Team and Facebook, Inc.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n\"\"\" TF 2.0 DistilBERT model\n\"\"\"\n\n\nimport logging\nimport math\n\nimport numpy as np\nimport tensorflow as tf\n\nfrom .configuration_distilbert import DistilBertConfig\nfrom .file_utils import add_start_docstrings, add_start_docstrings_to_callable\nfrom .modeling_tf_utils import TFPreTrainedModel, TFSharedEmbeddings, get_initializer, shape_list\nfrom .tokenization_utils import BatchEncoding\n\n\nlogger = logging.getLogger(__name__)\n\n\nTF_DISTILBERT_PRETRAINED_MODEL_ARCHIVE_LIST = [\n    \"distilbert-base-uncased\",\n    \"distilbert-base-uncased-distilled-squad\",\n    \"distilbert-base-cased\",\n    \"distilbert-base-cased-distilled-squad\",\n    \"distilbert-base-multilingual-cased\",\n    \"distilbert-base-uncased-finetuned-sst-2-english\",\n    # See all DistilBERT models at https://huggingface.co/models?filter=distilbert\n]\n\n\n# UTILS AND BUILDING BLOCKS OF THE ARCHITECTURE #\ndef gelu(x):\n    \"\"\" Gaussian Error Linear Unit.\n    Original Implementation of the gelu activation function in Google Bert repo when initially created.\n        For information: OpenAI GPT's gelu is slightly different (and gives slightly different results):\n        0.5 * x * (1 + torch.tanh(math.sqrt(2 / math.pi) * (x + 0.044715 * torch.pow(x, 3))))\n        Also see https://arxiv.org/abs/1606.08415\n    \"\"\"\n    cdf = 0.5 * (1.0 + tf.math.erf(x / tf.math.sqrt(2.0)))\n    return x * cdf\n\n\ndef gelu_new(x):\n    \"\"\"Gaussian Error Linear Unit.\n    This is a smoother version of the RELU.\n    Original paper: https://arxiv.org/abs/1606.08415\n    Args:\n        x: float Tensor to perform activation.\n    Returns:\n        `x` with the GELU activation applied.\n    \"\"\"\n    cdf = 0.5 * (1.0 + tf.tanh((np.sqrt(2 / np.pi) * (x + 0.044715 * tf.pow(x, 3)))))\n    return x * cdf\n\n\nclass TFEmbeddings(tf.keras.layers.Layer):\n    def __init__(self, config, **kwargs):\n        super().__init__(**kwargs)\n        self.vocab_size = config.vocab_size\n        self.dim = config.dim\n        self.initializer_range = config.initializer_range\n        self.word_embeddings = TFSharedEmbeddings(\n            config.vocab_size, config.dim, initializer_range=config.initializer_range, name=\"word_embeddings\"\n        )  # padding_idx=0)\n        self.position_embeddings = tf.keras.layers.Embedding(\n            config.max_position_embeddings,\n            config.dim,\n            embeddings_initializer=get_initializer(config.initializer_range),\n            name=\"position_embeddings\",\n        )\n\n        self.LayerNorm = tf.keras.layers.LayerNormalization(epsilon=1e-12, name=\"LayerNorm\")\n        self.dropout = tf.keras.layers.Dropout(config.dropout)\n\n    def build(self, input_shape):\n        \"\"\"Build shared word embedding layer \"\"\"\n        with tf.name_scope(\"word_embeddings\"):\n            # Create and initialize weights. The random normal initializer was chosen\n            # arbitrarily, and works well.\n            self.word_embeddings = self.add_weight(\n                \"weight\", shape=[self.vocab_size, self.dim], initializer=get_initializer(self.initializer_range)\n            )\n        super().build(input_shape)\n\n    def call(self, inputs, inputs_embeds=None, mode=\"embedding\", training=False):\n        \"\"\"Get token embeddings of inputs.\n        Args:\n            inputs: list of three int64 tensors with shape [batch_size, length]: (input_ids, position_ids, token_type_ids)\n            mode: string, a valid value is one of \"embedding\" and \"linear\".\n        Returns:\n            outputs: (1) If mode == \"embedding\", output embedding tensor, float32 with\n                shape [batch_size, length, embedding_size]; (2) mode == \"linear\", output\n                linear tensor, float32 with shape [batch_size, length, vocab_size].\n        Raises:\n            ValueError: if mode is not valid.\n\n        Shared weights logic adapted from\n            https://github.com/tensorflow/models/blob/a009f4fb9d2fc4949e32192a944688925ef78659/official/transformer/v2/embedding_layer.py#L24\n        \"\"\"\n        if mode == \"embedding\":\n            return self._embedding(inputs, inputs_embeds=inputs_embeds, training=training)\n        elif mode == \"linear\":\n            return self._linear(inputs)\n        else:\n            raise ValueError(\"mode {} is not valid.\".format(mode))\n\n    def _embedding(self, inputs, inputs_embeds=None, training=False):\n        \"\"\"\n        Parameters\n        ----------\n        input_ids: tf.Tensor(bs, max_seq_length)\n            The token ids to embed.\n\n        Outputs\n        -------\n        embeddings: tf.Tensor(bs, max_seq_length, dim)\n            The embedded tokens (plus position embeddings, no token_type embeddings)\n        \"\"\"\n        if not isinstance(inputs, (tuple, list)):\n            input_ids = inputs\n            position_ids = None\n        else:\n            input_ids, position_ids = inputs\n\n        if input_ids is not None:\n            seq_length = shape_list(input_ids)[1]\n        else:\n            seq_length = shape_list(inputs_embeds)[1]\n\n        if position_ids is None:\n            position_ids = tf.range(seq_length, dtype=tf.int32)[tf.newaxis, :]\n\n        if inputs_embeds is None:\n            inputs_embeds = tf.gather(self.word_embeddings, input_ids)\n        position_embeddings = self.position_embeddings(position_ids)  # (bs, max_seq_length, dim)\n\n        embeddings = inputs_embeds + position_embeddings  # (bs, max_seq_length, dim)\n        embeddings = self.LayerNorm(embeddings)  # (bs, max_seq_length, dim)\n        embeddings = self.dropout(embeddings, training=training)  # (bs, max_seq_length, dim)\n        return embeddings\n\n    def _linear(self, inputs):\n        \"\"\"Computes logits by running inputs through a linear layer.\n            Args:\n                inputs: A float32 tensor with shape [batch_size, length, hidden_size]\n            Returns:\n                float32 tensor with shape [batch_size, length, vocab_size].\n        \"\"\"\n        batch_size = shape_list(inputs)[0]\n        length = shape_list(inputs)[1]\n\n        x = tf.reshape(inputs, [-1, self.dim])\n        logits = tf.matmul(x, self.word_embeddings, transpose_b=True)\n\n        return tf.reshape(logits, [batch_size, length, self.vocab_size])\n\n\nclass TFMultiHeadSelfAttention(tf.keras.layers.Layer):\n    def __init__(self, config, **kwargs):\n        super().__init__(**kwargs)\n\n        self.n_heads = config.n_heads\n        self.dim = config.dim\n        self.dropout = tf.keras.layers.Dropout(config.attention_dropout)\n        self.output_attentions = config.output_attentions\n\n        assert self.dim % self.n_heads == 0\n\n        self.q_lin = tf.keras.layers.Dense(\n            config.dim, kernel_initializer=get_initializer(config.initializer_range), name=\"q_lin\"\n        )\n        self.k_lin = tf.keras.layers.Dense(\n            config.dim, kernel_initializer=get_initializer(config.initializer_range), name=\"k_lin\"\n        )\n        self.v_lin = tf.keras.layers.Dense(\n            config.dim, kernel_initializer=get_initializer(config.initializer_range), name=\"v_lin\"\n        )\n        self.out_lin = tf.keras.layers.Dense(\n            config.dim, kernel_initializer=get_initializer(config.initializer_range), name=\"out_lin\"\n        )\n\n        self.pruned_heads = set()\n\n    def prune_heads(self, heads):\n        raise NotImplementedError\n\n    def call(self, inputs, training=False):\n        \"\"\"\n        Parameters\n        ----------\n        query: tf.Tensor(bs, seq_length, dim)\n        key: tf.Tensor(bs, seq_length, dim)\n        value: tf.Tensor(bs, seq_length, dim)\n        mask: tf.Tensor(bs, seq_length)\n\n        Outputs\n        -------\n        weights: tf.Tensor(bs, n_heads, seq_length, seq_length)\n            Attention weights\n        context: tf.Tensor(bs, seq_length, dim)\n            Contextualized layer. Optional: only if `output_attentions=True`\n        \"\"\"\n        query, key, value, mask, head_mask = inputs\n        bs, q_length, dim = shape_list(query)\n        k_length = shape_list(key)[1]\n        # assert dim == self.dim, 'Dimensions do not match: %s input vs %s configured' % (dim, self.dim)\n        # assert key.size() == value.size()\n\n        dim_per_head = self.dim // self.n_heads\n\n        mask_reshape = [bs, 1, 1, k_length]\n\n        def shape(x):\n            \"\"\" separate heads \"\"\"\n            return tf.transpose(tf.reshape(x, (bs, -1, self.n_heads, dim_per_head)), perm=(0, 2, 1, 3))\n\n        def unshape(x):\n            \"\"\" group heads \"\"\"\n            return tf.reshape(tf.transpose(x, perm=(0, 2, 1, 3)), (bs, -1, self.n_heads * dim_per_head))\n\n        q = shape(self.q_lin(query))  # (bs, n_heads, q_length, dim_per_head)\n        k = shape(self.k_lin(key))  # (bs, n_heads, k_length, dim_per_head)\n        v = shape(self.v_lin(value))  # (bs, n_heads, k_length, dim_per_head)\n\n        q = q / math.sqrt(dim_per_head)  # (bs, n_heads, q_length, dim_per_head)\n        scores = tf.matmul(q, k, transpose_b=True)  # (bs, n_heads, q_length, k_length)\n        mask = tf.reshape(mask, mask_reshape)  # (bs, n_heads, qlen, klen)\n        # scores.masked_fill_(mask, -float('inf'))            # (bs, n_heads, q_length, k_length)\n        scores = scores - 1e30 * (1.0 - mask)\n\n        weights = tf.nn.softmax(scores, axis=-1)  # (bs, n_heads, qlen, klen)\n        weights = self.dropout(weights, training=training)  # (bs, n_heads, qlen, klen)\n\n        # Mask heads if we want to\n        if head_mask is not None:\n            weights = weights * head_mask\n\n        context = tf.matmul(weights, v)  # (bs, n_heads, qlen, dim_per_head)\n        context = unshape(context)  # (bs, q_length, dim)\n        context = self.out_lin(context)  # (bs, q_length, dim)\n\n        if self.output_attentions:\n            return (context, weights)\n        else:\n            return (context,)\n\n\nclass TFFFN(tf.keras.layers.Layer):\n    def __init__(self, config, **kwargs):\n        super().__init__(**kwargs)\n        self.dropout = tf.keras.layers.Dropout(config.dropout)\n        self.lin1 = tf.keras.layers.Dense(\n            config.hidden_dim, kernel_initializer=get_initializer(config.initializer_range), name=\"lin1\"\n        )\n        self.lin2 = tf.keras.layers.Dense(\n            config.dim, kernel_initializer=get_initializer(config.initializer_range), name=\"lin2\"\n        )\n        assert config.activation in [\"relu\", \"gelu\"], \"activation ({}) must be in ['relu', 'gelu']\".format(\n            config.activation\n        )\n        self.activation = (\n            tf.keras.layers.Activation(gelu) if config.activation == \"gelu\" else tf.keras.activations.relu\n        )\n\n    def call(self, input, training=False):\n        x = self.lin1(input)\n        x = self.activation(x)\n        x = self.lin2(x)\n        x = self.dropout(x, training=training)\n        return x\n\n\nclass TFTransformerBlock(tf.keras.layers.Layer):\n    def __init__(self, config, **kwargs):\n        super().__init__(**kwargs)\n\n        self.n_heads = config.n_heads\n        self.dim = config.dim\n        self.hidden_dim = config.hidden_dim\n        self.dropout = tf.keras.layers.Dropout(config.dropout)\n        self.activation = config.activation\n        self.output_attentions = config.output_attentions\n\n        assert config.dim % config.n_heads == 0\n\n        self.attention = TFMultiHeadSelfAttention(config, name=\"attention\")\n        self.sa_layer_norm = tf.keras.layers.LayerNormalization(epsilon=1e-12, name=\"sa_layer_norm\")\n\n        self.ffn = TFFFN(config, name=\"ffn\")\n        self.output_layer_norm = tf.keras.layers.LayerNormalization(epsilon=1e-12, name=\"output_layer_norm\")\n\n    def call(self, inputs, training=False):  # removed: src_enc=None, src_len=None\n        \"\"\"\n        Parameters\n        ----------\n        x: tf.Tensor(bs, seq_length, dim)\n        attn_mask: tf.Tensor(bs, seq_length)\n\n        Outputs\n        -------\n        sa_weights: tf.Tensor(bs, n_heads, seq_length, seq_length)\n            The attention weights\n        ffn_output: tf.Tensor(bs, seq_length, dim)\n            The output of the transformer block contextualization.\n        \"\"\"\n        x, attn_mask, head_mask = inputs\n\n        # Self-Attention\n        sa_output = self.attention([x, x, x, attn_mask, head_mask], training=training)\n        if self.output_attentions:\n            sa_output, sa_weights = sa_output  # (bs, seq_length, dim), (bs, n_heads, seq_length, seq_length)\n        else:  # To handle these `output_attention` or `output_hidden_states` cases returning tuples\n            # assert type(sa_output) == tuple\n            sa_output = sa_output[0]\n        sa_output = self.sa_layer_norm(sa_output + x)  # (bs, seq_length, dim)\n\n        # Feed Forward Network\n        ffn_output = self.ffn(sa_output, training=training)  # (bs, seq_length, dim)\n        ffn_output = self.output_layer_norm(ffn_output + sa_output)  # (bs, seq_length, dim)\n\n        output = (ffn_output,)\n        if self.output_attentions:\n            output = (sa_weights,) + output\n        return output\n\n\nclass TFTransformer(tf.keras.layers.Layer):\n    def __init__(self, config, **kwargs):\n        super().__init__(**kwargs)\n        self.n_layers = config.n_layers\n        self.output_attentions = config.output_attentions\n        self.output_hidden_states = config.output_hidden_states\n\n        self.layer = [TFTransformerBlock(config, name=\"layer_._{}\".format(i)) for i in range(config.n_layers)]\n\n    def call(self, inputs, training=False):\n        \"\"\"\n        Parameters\n        ----------\n        x: tf.Tensor(bs, seq_length, dim)\n            Input sequence embedded.\n        attn_mask: tf.Tensor(bs, seq_length)\n            Attention mask on the sequence.\n\n        Outputs\n        -------\n        hidden_state: tf.Tensor(bs, seq_length, dim)\n            Sequence of hiddens states in the last (top) layer\n        all_hidden_states: Tuple[tf.Tensor(bs, seq_length, dim)]\n            Tuple of length n_layers with the hidden states from each layer.\n            Optional: only if output_hidden_states=True\n        all_attentions: Tuple[tf.Tensor(bs, n_heads, seq_length, seq_length)]\n            Tuple of length n_layers with the attention weights from each layer\n            Optional: only if output_attentions=True\n        \"\"\"\n        x, attn_mask, head_mask = inputs\n\n        all_hidden_states = ()\n        all_attentions = ()\n\n        hidden_state = x\n        for i, layer_module in enumerate(self.layer):\n            if self.output_hidden_states:\n                all_hidden_states = all_hidden_states + (hidden_state,)\n\n            layer_outputs = layer_module([hidden_state, attn_mask, head_mask[i]], training=training)\n            hidden_state = layer_outputs[-1]\n\n            if self.output_attentions:\n                assert len(layer_outputs) == 2\n                attentions = layer_outputs[0]\n                all_attentions = all_attentions + (attentions,)\n            else:\n                assert len(layer_outputs) == 1\n\n        # Add last layer\n        if self.output_hidden_states:\n            all_hidden_states = all_hidden_states + (hidden_state,)\n\n        outputs = (hidden_state,)\n        if self.output_hidden_states:\n            outputs = outputs + (all_hidden_states,)\n        if self.output_attentions:\n            outputs = outputs + (all_attentions,)\n        return outputs  # last-layer hidden state, (all hidden states), (all attentions)\n\n\nclass TFDistilBertMainLayer(tf.keras.layers.Layer):\n    def __init__(self, config, **kwargs):\n        super().__init__(**kwargs)\n        self.num_hidden_layers = config.num_hidden_layers\n\n        self.embeddings = TFEmbeddings(config, name=\"embeddings\")  # Embeddings\n        self.transformer = TFTransformer(config, name=\"transformer\")  # Encoder\n\n    def get_input_embeddings(self):\n        return self.embeddings\n\n    def _resize_token_embeddings(self, new_num_tokens):\n        raise NotImplementedError\n\n    def _prune_heads(self, heads_to_prune):\n        raise NotImplementedError\n\n    def call(self, inputs, attention_mask=None, head_mask=None, inputs_embeds=None, training=False):\n        if isinstance(inputs, (tuple, list)):\n            input_ids = inputs[0]\n            attention_mask = inputs[1] if len(inputs) > 1 else attention_mask\n            head_mask = inputs[2] if len(inputs) > 2 else head_mask\n            inputs_embeds = inputs[3] if len(inputs) > 3 else inputs_embeds\n            assert len(inputs) <= 4, \"Too many inputs.\"\n        elif isinstance(inputs, (dict, BatchEncoding)):\n            input_ids = inputs.get(\"input_ids\")\n            attention_mask = inputs.get(\"attention_mask\", attention_mask)\n            head_mask = inputs.get(\"head_mask\", head_mask)\n            inputs_embeds = inputs.get(\"inputs_embeds\", inputs_embeds)\n            assert len(inputs) <= 4, \"Too many inputs.\"\n        else:\n            input_ids = inputs\n\n        if input_ids is not None and inputs_embeds is not None:\n            raise ValueError(\"You cannot specify both input_ids and inputs_embeds at the same time\")\n        elif input_ids is not None:\n            input_shape = shape_list(input_ids)\n        elif inputs_embeds is not None:\n            input_shape = shape_list(inputs_embeds)[:-1]\n        else:\n            raise ValueError(\"You have to specify either input_ids or inputs_embeds\")\n\n        if attention_mask is None:\n            attention_mask = tf.ones(input_shape)  # (bs, seq_length)\n        attention_mask = tf.cast(attention_mask, dtype=tf.float32)\n\n        # Prepare head mask if needed\n        # 1.0 in head_mask indicate we keep the head\n        # attention_probs has shape bsz x n_heads x N x N\n        # input head_mask has shape [num_heads] or [num_hidden_layers x num_heads]\n        # and head_mask is converted to shape [num_hidden_layers x batch x num_heads x seq_length x seq_length]\n        if head_mask is not None:\n            raise NotImplementedError\n        else:\n            head_mask = [None] * self.num_hidden_layers\n\n        embedding_output = self.embeddings(input_ids, inputs_embeds=inputs_embeds)  # (bs, seq_length, dim)\n        tfmr_output = self.transformer([embedding_output, attention_mask, head_mask], training=training)\n\n        return tfmr_output  # last-layer hidden-state, (all hidden_states), (all attentions)\n\n\n# INTERFACE FOR ENCODER AND TASK SPECIFIC MODEL #\nclass TFDistilBertPreTrainedModel(TFPreTrainedModel):\n    \"\"\" An abstract class to handle weights initialization and\n        a simple interface for downloading and loading pretrained models.\n    \"\"\"\n\n    config_class = DistilBertConfig\n    base_model_prefix = \"distilbert\"\n\n\nDISTILBERT_START_DOCSTRING = r\"\"\"\n    This model is a `tf.keras.Model <https://www.tensorflow.org/api_docs/python/tf/keras/Model>`__ sub-class.\n    Use it as a regular TF 2.0 Keras Model and\n    refer to the TF 2.0 documentation for all matter related to general usage and behavior.\n\n    .. note::\n\n        TF 2.0 models accepts two formats as inputs:\n\n            - having all inputs as keyword arguments (like PyTorch models), or\n            - having all inputs as a list, tuple or dict in the first positional arguments.\n\n        This second option is useful when using :obj:`tf.keras.Model.fit()` method which currently requires having\n        all the tensors in the first argument of the model call function: :obj:`model(inputs)`.\n\n        If you choose this second option, there are three possibilities you can use to gather all the input Tensors\n        in the first positional argument :\n\n        - a single Tensor with input_ids only and nothing else: :obj:`model(inputs_ids)`\n        - a list of varying length with one or several input Tensors IN THE ORDER given in the docstring:\n          :obj:`model([input_ids, attention_mask])` or :obj:`model([input_ids, attention_mask, token_type_ids])`\n        - a dictionary with one or several input Tensors associated to the input names given in the docstring:\n          :obj:`model({'input_ids': input_ids, 'token_type_ids': token_type_ids})`\n\n    Parameters:\n        config (:class:`~transformers1.DistilBertConfig`): Model configuration class with all the parameters of the model.\n            Initializing with a config file does not load the weights associated with the model, only the configuration.\n            Check out the :meth:`~transformers1.PreTrainedModel.from_pretrained` method to load the model weights.\n\"\"\"\n\nDISTILBERT_INPUTS_DOCSTRING = r\"\"\"\n    Args:\n        input_ids (:obj:`Numpy array` or :obj:`tf.Tensor` of shape :obj:`(batch_size, sequence_length)`):\n            Indices of input sequence tokens in the vocabulary.\n\n            Indices can be obtained using :class:`transformers1.BertTokenizer`.\n            See :func:`transformers1.PreTrainedTokenizer.encode` and\n            :func:`transformers1.PreTrainedTokenizer.encode_plus` for details.\n\n            `What are input IDs? <../glossary.html#input-ids>`__\n        attention_mask (:obj:`Numpy array` or :obj:`tf.Tensor` of shape :obj:`(batch_size, sequence_length)`, `optional`, defaults to :obj:`None`):\n            Mask to avoid performing attention on padding token indices.\n            Mask values selected in ``[0, 1]``:\n            ``1`` for tokens that are NOT MASKED, ``0`` for MASKED tokens.\n\n            `What are attention masks? <../glossary.html#attention-mask>`__\n        head_mask (:obj:`Numpy array` or :obj:`tf.Tensor` of shape :obj:`(num_heads,)` or :obj:`(num_layers, num_heads)`, `optional`, defaults to :obj:`None`):\n            Mask to nullify selected heads of the self-attention modules.\n            Mask values selected in ``[0, 1]``:\n            :obj:`1` indicates the head is **not masked**, :obj:`0` indicates the head is **masked**.\n        inputs_embeds (:obj:`Numpy array` or :obj:`tf.Tensor` of shape :obj:`(batch_size, sequence_length, embedding_dim)`, `optional`, defaults to :obj:`None`):\n            Optionally, instead of passing :obj:`input_ids` you can choose to directly pass an embedded representation.\n            This is useful if you want more control over how to convert `input_ids` indices into associated vectors\n            than the model's internal embedding lookup matrix.\n        training (:obj:`boolean`, `optional`, defaults to :obj:`False`):\n            Whether to activate dropout modules (if set to :obj:`True`) during training or to de-activate them\n            (if set to :obj:`False`) for evaluation.\n\n\"\"\"\n\n\n@add_start_docstrings(\n    \"The bare DistilBERT encoder/transformer outputing raw hidden-states without any specific head on top.\",\n    DISTILBERT_START_DOCSTRING,\n)\nclass TFDistilBertModel(TFDistilBertPreTrainedModel):\n    def __init__(self, config, *inputs, **kwargs):\n        super().__init__(config, *inputs, **kwargs)\n        self.distilbert = TFDistilBertMainLayer(config, name=\"distilbert\")  # Embeddings\n\n    @add_start_docstrings_to_callable(DISTILBERT_INPUTS_DOCSTRING)\n    def call(self, inputs, **kwargs):\n        r\"\"\"\n    Returns:\n        :obj:`tuple(tf.Tensor)` comprising various elements depending on the configuration (:class:`~transformers1,DistilBertConfig`) and inputs:\n        last_hidden_state (:obj:`tf.Tensor` of shape :obj:`(batch_size, sequence_length, hidden_size)`):\n            Sequence of hidden-states at the output of the last layer of the model.\n        hidden_states (:obj:`tuple(tf.Tensor)`, `optional`, returned when :obj:`config.output_hidden_states=True`):\n            tuple of :obj:`tf.Tensor` (one for the output of the embeddings + one for the output of each layer)\n            of shape :obj:`(batch_size, sequence_length, hidden_size)`.\n\n            Hidden-states of the model at the output of each layer plus the initial embedding outputs.\n        attentions (:obj:`tuple(tf.Tensor)`, `optional`, returned when ``config.output_attentions=True``):\n            tuple of :obj:`tf.Tensor` (one for each layer) of shape\n            :obj:`(batch_size, num_heads, sequence_length, sequence_length)`:\n\n            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads.\n\n    Examples::\n\n        import tensorflow as tf\n        from transformers1 import DistilBertTokenizer, TFDistilBertModel\n\n        tokenizer = DistilBertTokenizer.from_pretrained('distilbert-base-cased')\n        model = TFDistilBertModel.from_pretrained('distilbert-base-cased')\n        input_ids = tf.constant(tokenizer.encode(\"Hello, my dog is cute\"))[None, :]  # Batch size 1\n        outputs = model(input_ids)\n        last_hidden_states = outputs[0]  # The last hidden-state is the first element of the output tuple\n        \"\"\"\n        outputs = self.distilbert(inputs, **kwargs)\n        return outputs\n\n\nclass TFDistilBertLMHead(tf.keras.layers.Layer):\n    def __init__(self, config, input_embeddings, **kwargs):\n        super().__init__(**kwargs)\n        self.vocab_size = config.vocab_size\n\n        # The output weights are the same as the input embeddings, but there is\n        # an output-only bias for each token.\n        self.input_embeddings = input_embeddings\n\n    def build(self, input_shape):\n        self.bias = self.add_weight(shape=(self.vocab_size,), initializer=\"zeros\", trainable=True, name=\"bias\")\n        super().build(input_shape)\n\n    def call(self, hidden_states):\n        hidden_states = self.input_embeddings(hidden_states, mode=\"linear\")\n        hidden_states = hidden_states + self.bias\n        return hidden_states\n\n\n@add_start_docstrings(\n    \"\"\"DistilBert Model with a `masked language modeling` head on top. \"\"\", DISTILBERT_START_DOCSTRING,\n)\nclass TFDistilBertForMaskedLM(TFDistilBertPreTrainedModel):\n    def __init__(self, config, *inputs, **kwargs):\n        super().__init__(config, *inputs, **kwargs)\n        self.output_attentions = config.output_attentions\n        self.output_hidden_states = config.output_hidden_states\n        self.vocab_size = config.vocab_size\n\n        self.distilbert = TFDistilBertMainLayer(config, name=\"distilbert\")\n        self.vocab_transform = tf.keras.layers.Dense(\n            config.dim, kernel_initializer=get_initializer(config.initializer_range), name=\"vocab_transform\"\n        )\n        self.act = tf.keras.layers.Activation(gelu)\n        self.vocab_layer_norm = tf.keras.layers.LayerNormalization(epsilon=1e-12, name=\"vocab_layer_norm\")\n        self.vocab_projector = TFDistilBertLMHead(config, self.distilbert.embeddings, name=\"vocab_projector\")\n\n    def get_output_embeddings(self):\n        return self.vocab_projector.input_embeddings\n\n    @add_start_docstrings_to_callable(DISTILBERT_INPUTS_DOCSTRING)\n    def call(self, inputs, **kwargs):\n        r\"\"\"\n\n    Returns:\n        :obj:`tuple(tf.Tensor)` comprising various elements depending on the configuration (:class:`~transformers1,DistilBertConfig`) and inputs:\n        prediction_scores (:obj:`Numpy array` or :obj:`tf.Tensor` of shape :obj:`(batch_size, sequence_length, config.vocab_size)`):\n            Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax).\n        hidden_states (:obj:`tuple(tf.Tensor)`, `optional`, returned when :obj:`config.output_hidden_states=True`):\n            tuple of :obj:`tf.Tensor` (one for the output of the embeddings + one for the output of each layer)\n            of shape :obj:`(batch_size, sequence_length, hidden_size)`.\n\n            Hidden-states of the model at the output of each layer plus the initial embedding outputs.\n        attentions (:obj:`tuple(tf.Tensor)`, `optional`, returned when ``config.output_attentions=True``):\n            tuple of :obj:`tf.Tensor` (one for each layer) of shape\n            :obj:`(batch_size, num_heads, sequence_length, sequence_length)`:\n\n            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads.\n\n    Examples::\n\n        import tensorflow as tf\n        from transformers1 import DistilBertTokenizer, TFDistilBertForMaskedLM\n\n        tokenizer = DistilBertTokenizer.from_pretrained('distilbert-base-cased')\n        model = TFDistilBertForMaskedLM.from_pretrained('distilbert-base-cased')\n        input_ids = tf.constant(tokenizer.encode(\"Hello, my dog is cute\"))[None, :]  # Batch size 1\n        outputs = model(input_ids)\n        prediction_scores = outputs[0]\n\n        \"\"\"\n        distilbert_output = self.distilbert(inputs, **kwargs)\n\n        hidden_states = distilbert_output[0]  # (bs, seq_length, dim)\n        prediction_logits = self.vocab_transform(hidden_states)  # (bs, seq_length, dim)\n        prediction_logits = self.act(prediction_logits)  # (bs, seq_length, dim)\n        prediction_logits = self.vocab_layer_norm(prediction_logits)  # (bs, seq_length, dim)\n        prediction_logits = self.vocab_projector(prediction_logits)\n\n        outputs = (prediction_logits,) + distilbert_output[1:]\n        return outputs  # logits, (hidden_states), (attentions)\n\n\n@add_start_docstrings(\n    \"\"\"DistilBert Model transformer with a sequence classification/regression head on top (a linear layer on top of\n    the pooled output) e.g. for GLUE tasks. \"\"\",\n    DISTILBERT_START_DOCSTRING,\n)\nclass TFDistilBertForSequenceClassification(TFDistilBertPreTrainedModel):\n    def __init__(self, config, *inputs, **kwargs):\n        super().__init__(config, *inputs, **kwargs)\n        self.num_labels = config.num_labels\n\n        self.distilbert = TFDistilBertMainLayer(config, name=\"distilbert\")\n        self.pre_classifier = tf.keras.layers.Dense(\n            config.dim,\n            kernel_initializer=get_initializer(config.initializer_range),\n            activation=\"relu\",\n            name=\"pre_classifier\",\n        )\n        self.classifier = tf.keras.layers.Dense(\n            config.num_labels, kernel_initializer=get_initializer(config.initializer_range), name=\"classifier\"\n        )\n        self.dropout = tf.keras.layers.Dropout(config.seq_classif_dropout)\n\n    @add_start_docstrings_to_callable(DISTILBERT_INPUTS_DOCSTRING)\n    def call(self, inputs, **kwargs):\n        r\"\"\"\n    Returns:\n        :obj:`tuple(tf.Tensor)` comprising various elements depending on the configuration (:class:`~transformers1,DistilBertConfig`) and inputs:\n        logits (:obj:`Numpy array` or :obj:`tf.Tensor` of shape :obj:`(batch_size, config.num_labels)`):\n            Classification (or regression if config.num_labels==1) scores (before SoftMax).\n        hidden_states (:obj:`tuple(tf.Tensor)`, `optional`, returned when :obj:`config.output_hidden_states=True`):\n            tuple of :obj:`tf.Tensor` (one for the output of the embeddings + one for the output of each layer)\n            of shape :obj:`(batch_size, sequence_length, hidden_size)`.\n\n            Hidden-states of the model at the output of each layer plus the initial embedding outputs.\n        attentions (:obj:`tuple(tf.Tensor)`, `optional`, returned when ``config.output_attentions=True``):\n            tuple of :obj:`tf.Tensor` (one for each layer) of shape\n            :obj:`(batch_size, num_heads, sequence_length, sequence_length)`:\n\n            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads.\n\n    Examples::\n\n        import tensorflow as tf\n        from transformers1 import DistilBertTokenizer, TFDistilBertForSequenceClassification\n\n        tokenizer = DistilBertTokenizer.from_pretrained('distilbert-base-cased')\n        model = TFDistilBertForSequenceClassification.from_pretrained('distilbert-base-cased')\n        input_ids = tf.constant(tokenizer.encode(\"Hello, my dog is cute\"))[None, :]  # Batch size 1\n        outputs = model(input_ids)\n        logits = outputs[0]\n\n        \"\"\"\n        distilbert_output = self.distilbert(inputs, **kwargs)\n\n        hidden_state = distilbert_output[0]  # (bs, seq_len, dim)\n        pooled_output = hidden_state[:, 0]  # (bs, dim)\n        pooled_output = self.pre_classifier(pooled_output)  # (bs, dim)\n        pooled_output = self.dropout(pooled_output, training=kwargs.get(\"training\", False))  # (bs, dim)\n        logits = self.classifier(pooled_output)  # (bs, dim)\n\n        outputs = (logits,) + distilbert_output[1:]\n        return outputs  # logits, (hidden_states), (attentions)\n\n\n@add_start_docstrings(\n    \"\"\"DistilBert Model with a token classification head on top (a linear layer on top of\n    the hidden-states output) e.g. for Named-Entity-Recognition (NER) tasks. \"\"\",\n    DISTILBERT_START_DOCSTRING,\n)\nclass TFDistilBertForTokenClassification(TFDistilBertPreTrainedModel):\n    def __init__(self, config, *inputs, **kwargs):\n        super().__init__(config, *inputs, **kwargs)\n        self.num_labels = config.num_labels\n\n        self.distilbert = TFDistilBertMainLayer(config, name=\"distilbert\")\n        self.dropout = tf.keras.layers.Dropout(config.dropout)\n        self.classifier = tf.keras.layers.Dense(\n            config.num_labels, kernel_initializer=get_initializer(config.initializer_range), name=\"classifier\"\n        )\n\n    @add_start_docstrings_to_callable(DISTILBERT_INPUTS_DOCSTRING)\n    def call(self, inputs, **kwargs):\n        r\"\"\"\n    Returns:\n        :obj:`tuple(tf.Tensor)` comprising various elements depending on the configuration (:class:`~transformers1,DistilBertConfig`) and inputs:\n        scores (:obj:`Numpy array` or :obj:`tf.Tensor` of shape :obj:`(batch_size, sequence_length, config.num_labels)`):\n            Classification scores (before SoftMax).\n        hidden_states (:obj:`tuple(tf.Tensor)`, `optional`, returned when :obj:`config.output_hidden_states=True`):\n            tuple of :obj:`tf.Tensor` (one for the output of the embeddings + one for the output of each layer)\n            of shape :obj:`(batch_size, sequence_length, hidden_size)`.\n\n            Hidden-states of the model at the output of each layer plus the initial embedding outputs.\n        attentions (:obj:`tuple(tf.Tensor)`, `optional`, returned when ``config.output_attentions=True``):\n            tuple of :obj:`tf.Tensor` (one for each layer) of shape\n            :obj:`(batch_size, num_heads, sequence_length, sequence_length)`:\n\n            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads.\n\n    Examples::\n\n        import tensorflow as tf\n        from transformers1 import DistilBertTokenizer, TFDistilBertForTokenClassification\n\n        tokenizer = DistilBertTokenizer.from_pretrained('distilbert-base-cased')\n        model = TFDistilBertForTokenClassification.from_pretrained('distilbert-base-cased')\n        input_ids = tf.constant(tokenizer.encode(\"Hello, my dog is cute\"))[None, :]  # Batch size 1\n        outputs = model(input_ids)\n        scores = outputs[0]\n        \"\"\"\n        outputs = self.distilbert(inputs, **kwargs)\n\n        sequence_output = outputs[0]\n\n        sequence_output = self.dropout(sequence_output, training=kwargs.get(\"training\", False))\n        logits = self.classifier(sequence_output)\n\n        outputs = (logits,) + outputs[2:]  # add hidden states and attention if they are here\n\n        return outputs  # scores, (hidden_states), (attentions)\n\n\n@add_start_docstrings(\n    \"\"\"DistilBert Model with a span classification head on top for extractive question-answering tasks like SQuAD (a linear layers on top of\n    the hidden-states output to compute `span start logits` and `span end logits`). \"\"\",\n    DISTILBERT_START_DOCSTRING,\n)\nclass TFDistilBertForQuestionAnswering(TFDistilBertPreTrainedModel):\n    def __init__(self, config, *inputs, **kwargs):\n        super().__init__(config, *inputs, **kwargs)\n\n        self.distilbert = TFDistilBertMainLayer(config, name=\"distilbert\")\n        self.qa_outputs = tf.keras.layers.Dense(\n            config.num_labels, kernel_initializer=get_initializer(config.initializer_range), name=\"qa_outputs\"\n        )\n        assert config.num_labels == 2\n        self.dropout = tf.keras.layers.Dropout(config.qa_dropout)\n\n    @add_start_docstrings_to_callable(DISTILBERT_INPUTS_DOCSTRING)\n    def call(self, inputs, **kwargs):\n        r\"\"\"\n    Return:\n        :obj:`tuple(tf.Tensor)` comprising various elements depending on the configuration (:class:`~transformers1,DistilBertConfig`) and inputs:\n        start_scores (:obj:`Numpy array` or :obj:`tf.Tensor` of shape :obj:`(batch_size, sequence_length,)`):\n            Span-start scores (before SoftMax).\n        end_scores (:obj:`Numpy array` or :obj:`tf.Tensor` of shape :obj:`(batch_size, sequence_length,)`):\n            Span-end scores (before SoftMax).\n        hidden_states (:obj:`tuple(tf.Tensor)`, `optional`, returned when :obj:`config.output_hidden_states=True`):\n            tuple of :obj:`tf.Tensor` (one for the output of the embeddings + one for the output of each layer)\n            of shape :obj:`(batch_size, sequence_length, hidden_size)`.\n\n            Hidden-states of the model at the output of each layer plus the initial embedding outputs.\n        attentions (:obj:`tuple(tf.Tensor)`, `optional`, returned when ``config.output_attentions=True``):\n            tuple of :obj:`tf.Tensor` (one for each layer) of shape\n            :obj:`(batch_size, num_heads, sequence_length, sequence_length)`:\n\n            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads.\n\n    Examples::\n\n        import tensorflow as tf\n        from transformers1 import DistilBertTokenizer, TFDistilBertForQuestionAnswering\n\n        tokenizer = DistilBertTokenizer.from_pretrained('distilbert-base-cased')\n        model = TFDistilBertForQuestionAnswering.from_pretrained('distilbert-base-cased')\n        input_ids = tf.constant(tokenizer.encode(\"Hello, my dog is cute\"))[None, :]  # Batch size 1\n        outputs = model(input_ids)\n        start_scores, end_scores = outputs[:2]\n\n        \"\"\"\n        distilbert_output = self.distilbert(inputs, **kwargs)\n\n        hidden_states = distilbert_output[0]  # (bs, max_query_len, dim)\n        hidden_states = self.dropout(hidden_states, training=kwargs.get(\"training\", False))  # (bs, max_query_len, dim)\n        logits = self.qa_outputs(hidden_states)  # (bs, max_query_len, 2)\n        start_logits, end_logits = tf.split(logits, 2, axis=-1)\n        start_logits = tf.squeeze(start_logits, axis=-1)\n        end_logits = tf.squeeze(end_logits, axis=-1)\n\n        outputs = (start_logits, end_logits,) + distilbert_output[1:]\n        return outputs  # start_logits, end_logits, (hidden_states), (attentions)\n"
  },
  {
    "path": "code/nezha-base-count5/pretrain/transformers1/modeling_tf_electra.py",
    "content": "import logging\n\nimport tensorflow as tf\n\nfrom transformers import ElectraConfig\n\nfrom .file_utils import add_start_docstrings, add_start_docstrings_to_callable\nfrom .modeling_tf_bert import ACT2FN, TFBertEncoder, TFBertPreTrainedModel\nfrom .modeling_tf_utils import get_initializer, shape_list\nfrom .tokenization_utils import BatchEncoding\n\n\nlogger = logging.getLogger(__name__)\n\n\nTF_ELECTRA_PRETRAINED_MODEL_ARCHIVE_LIST = [\n    \"google/electra-small-generator\",\n    \"google/electra-base-generator\",\n    \"google/electra-large-generator\",\n    \"google/electra-small-discriminator\",\n    \"google/electra-base-discriminator\",\n    \"google/electra-large-discriminator\",\n    # See all ELECTRA models at https://huggingface.co/models?filter=electra\n]\n\n\nclass TFElectraEmbeddings(tf.keras.layers.Layer):\n    \"\"\"Construct the embeddings from word, position and token_type embeddings.\n    \"\"\"\n\n    def __init__(self, config, **kwargs):\n        super().__init__(**kwargs)\n        self.vocab_size = config.vocab_size\n        self.embedding_size = config.embedding_size\n        self.initializer_range = config.initializer_range\n\n        self.position_embeddings = tf.keras.layers.Embedding(\n            config.max_position_embeddings,\n            config.embedding_size,\n            embeddings_initializer=get_initializer(self.initializer_range),\n            name=\"position_embeddings\",\n        )\n        self.token_type_embeddings = tf.keras.layers.Embedding(\n            config.type_vocab_size,\n            config.embedding_size,\n            embeddings_initializer=get_initializer(self.initializer_range),\n            name=\"token_type_embeddings\",\n        )\n\n        # self.LayerNorm is not snake-cased to stick with TensorFlow model variable name and be able to load\n        # any TensorFlow checkpoint file\n        self.LayerNorm = tf.keras.layers.LayerNormalization(epsilon=config.layer_norm_eps, name=\"LayerNorm\")\n        self.dropout = tf.keras.layers.Dropout(config.hidden_dropout_prob)\n\n    def build(self, input_shape):\n        \"\"\"Build shared word embedding layer \"\"\"\n        with tf.name_scope(\"word_embeddings\"):\n            # Create and initialize weights. The random normal initializer was chosen\n            # arbitrarily, and works well.\n            self.word_embeddings = self.add_weight(\n                \"weight\",\n                shape=[self.vocab_size, self.embedding_size],\n                initializer=get_initializer(self.initializer_range),\n            )\n        super().build(input_shape)\n\n    def call(self, inputs, mode=\"embedding\", training=False):\n        \"\"\"Get token embeddings of inputs.\n        Args:\n            inputs: list of three int64 tensors with shape [batch_size, length]: (input_ids, position_ids, token_type_ids)\n            mode: string, a valid value is one of \"embedding\" and \"linear\".\n        Returns:\n            outputs: (1) If mode == \"embedding\", output embedding tensor, float32 with\n                shape [batch_size, length, embedding_size]; (2) mode == \"linear\", output\n                linear tensor, float32 with shape [batch_size, length, vocab_size].\n        Raises:\n            ValueError: if mode is not valid.\n\n        Shared weights logic adapted from\n            https://github.com/tensorflow/models/blob/a009f4fb9d2fc4949e32192a944688925ef78659/official/transformer/v2/embedding_layer.py#L24\n        \"\"\"\n        if mode == \"embedding\":\n            return self._embedding(inputs, training=training)\n        elif mode == \"linear\":\n            return self._linear(inputs)\n        else:\n            raise ValueError(\"mode {} is not valid.\".format(mode))\n\n    def _embedding(self, inputs, training=False):\n        \"\"\"Applies embedding based on inputs tensor.\"\"\"\n        input_ids, position_ids, token_type_ids, inputs_embeds = inputs\n\n        if input_ids is not None:\n            input_shape = shape_list(input_ids)\n        else:\n            input_shape = shape_list(inputs_embeds)[:-1]\n\n        seq_length = input_shape[1]\n        if position_ids is None:\n            position_ids = tf.range(seq_length, dtype=tf.int32)[tf.newaxis, :]\n        if token_type_ids is None:\n            token_type_ids = tf.fill(input_shape, 0)\n\n        if inputs_embeds is None:\n            inputs_embeds = tf.gather(self.word_embeddings, input_ids)\n        position_embeddings = self.position_embeddings(position_ids)\n        token_type_embeddings = self.token_type_embeddings(token_type_ids)\n\n        embeddings = inputs_embeds + position_embeddings + token_type_embeddings\n        embeddings = self.LayerNorm(embeddings)\n        embeddings = self.dropout(embeddings, training=training)\n        return embeddings\n\n    def _linear(self, inputs):\n        \"\"\"Computes logits by running inputs through a linear layer.\n            Args:\n                inputs: A float32 tensor with shape [batch_size, length, hidden_size]\n            Returns:\n                float32 tensor with shape [batch_size, length, vocab_size].\n        \"\"\"\n        batch_size = shape_list(inputs)[0]\n        length = shape_list(inputs)[1]\n\n        x = tf.reshape(inputs, [-1, self.embedding_size])\n        logits = tf.matmul(x, self.word_embeddings, transpose_b=True)\n\n        return tf.reshape(logits, [batch_size, length, self.vocab_size])\n\n\nclass TFElectraDiscriminatorPredictions(tf.keras.layers.Layer):\n    def __init__(self, config, **kwargs):\n        super().__init__(**kwargs)\n\n        self.dense = tf.keras.layers.Dense(config.hidden_size, name=\"dense\")\n        self.dense_prediction = tf.keras.layers.Dense(1, name=\"dense_prediction\")\n        self.config = config\n\n    def call(self, discriminator_hidden_states, training=False):\n        hidden_states = self.dense(discriminator_hidden_states)\n        hidden_states = ACT2FN[self.config.hidden_act](hidden_states)\n        logits = tf.squeeze(self.dense_prediction(hidden_states))\n\n        return logits\n\n\nclass TFElectraGeneratorPredictions(tf.keras.layers.Layer):\n    def __init__(self, config, **kwargs):\n        super().__init__(**kwargs)\n\n        self.LayerNorm = tf.keras.layers.LayerNormalization(epsilon=config.layer_norm_eps, name=\"LayerNorm\")\n        self.dense = tf.keras.layers.Dense(config.embedding_size, name=\"dense\")\n\n    def call(self, generator_hidden_states, training=False):\n        hidden_states = self.dense(generator_hidden_states)\n        hidden_states = ACT2FN[\"gelu\"](hidden_states)\n        hidden_states = self.LayerNorm(hidden_states)\n\n        return hidden_states\n\n\nclass TFElectraPreTrainedModel(TFBertPreTrainedModel):\n\n    config_class = ElectraConfig\n    base_model_prefix = \"electra\"\n\n    def get_extended_attention_mask(self, attention_mask, input_shape):\n        if attention_mask is None:\n            attention_mask = tf.fill(input_shape, 1)\n\n        # We create a 3D attention mask from a 2D tensor mask.\n        # Sizes are [batch_size, 1, 1, to_seq_length]\n        # So we can broadcast to [batch_size, num_heads, from_seq_length, to_seq_length]\n        # this attention mask is more simple than the triangular masking of causal attention\n        # used in OpenAI GPT, we just need to prepare the broadcast dimension here.\n        extended_attention_mask = attention_mask[:, tf.newaxis, tf.newaxis, :]\n\n        # Since attention_mask is 1.0 for positions we want to attend and 0.0 for\n        # masked positions, this operation will create a tensor which is 0.0 for\n        # positions we want to attend and -10000.0 for masked positions.\n        # Since we are adding it to the raw scores before the softmax, this is\n        # effectively the same as removing these entirely.\n\n        extended_attention_mask = tf.cast(extended_attention_mask, tf.float32)\n        extended_attention_mask = (1.0 - extended_attention_mask) * -10000.0\n\n        return extended_attention_mask\n\n    def get_head_mask(self, head_mask):\n        if head_mask is not None:\n            raise NotImplementedError\n        else:\n            head_mask = [None] * self.config.num_hidden_layers\n\n        return head_mask\n\n\nclass TFElectraMainLayer(TFElectraPreTrainedModel):\n\n    config_class = ElectraConfig\n\n    def __init__(self, config, **kwargs):\n        super().__init__(config, **kwargs)\n        self.embeddings = TFElectraEmbeddings(config, name=\"embeddings\")\n\n        if config.embedding_size != config.hidden_size:\n            self.embeddings_project = tf.keras.layers.Dense(config.hidden_size, name=\"embeddings_project\")\n        self.encoder = TFBertEncoder(config, name=\"encoder\")\n        self.config = config\n\n    def get_input_embeddings(self):\n        return self.embeddings\n\n    def _resize_token_embeddings(self, new_num_tokens):\n        raise NotImplementedError\n\n    def _prune_heads(self, heads_to_prune):\n        \"\"\" Prunes heads of the model.\n            heads_to_prune: dict of {layer_num: list of heads to prune in this layer}\n            See base class PreTrainedModel\n        \"\"\"\n        raise NotImplementedError\n\n    def call(\n        self,\n        inputs,\n        attention_mask=None,\n        token_type_ids=None,\n        position_ids=None,\n        head_mask=None,\n        inputs_embeds=None,\n        training=False,\n    ):\n        if isinstance(inputs, (tuple, list)):\n            input_ids = inputs[0]\n            attention_mask = inputs[1] if len(inputs) > 1 else attention_mask\n            token_type_ids = inputs[2] if len(inputs) > 2 else token_type_ids\n            position_ids = inputs[3] if len(inputs) > 3 else position_ids\n            head_mask = inputs[4] if len(inputs) > 4 else head_mask\n            inputs_embeds = inputs[5] if len(inputs) > 5 else inputs_embeds\n            assert len(inputs) <= 6, \"Too many inputs.\"\n        elif isinstance(inputs, (dict, BatchEncoding)):\n            input_ids = inputs.get(\"input_ids\")\n            attention_mask = inputs.get(\"attention_mask\", attention_mask)\n            token_type_ids = inputs.get(\"token_type_ids\", token_type_ids)\n            position_ids = inputs.get(\"position_ids\", position_ids)\n            head_mask = inputs.get(\"head_mask\", head_mask)\n            inputs_embeds = inputs.get(\"inputs_embeds\", inputs_embeds)\n            assert len(inputs) <= 6, \"Too many inputs.\"\n        else:\n            input_ids = inputs\n\n        if input_ids is not None and inputs_embeds is not None:\n            raise ValueError(\"You cannot specify both input_ids and inputs_embeds at the same time\")\n        elif input_ids is not None:\n            input_shape = shape_list(input_ids)\n        elif inputs_embeds is not None:\n            input_shape = shape_list(inputs_embeds)[:-1]\n        else:\n            raise ValueError(\"You have to specify either input_ids or inputs_embeds\")\n\n        if attention_mask is None:\n            attention_mask = tf.fill(input_shape, 1)\n        if token_type_ids is None:\n            token_type_ids = tf.fill(input_shape, 0)\n\n        extended_attention_mask = self.get_extended_attention_mask(attention_mask, input_shape)\n        head_mask = self.get_head_mask(head_mask)\n\n        hidden_states = self.embeddings([input_ids, position_ids, token_type_ids, inputs_embeds], training=training)\n\n        if hasattr(self, \"embeddings_project\"):\n            hidden_states = self.embeddings_project(hidden_states, training=training)\n\n        hidden_states = self.encoder([hidden_states, extended_attention_mask, head_mask], training=training)\n\n        return hidden_states\n\n\nELECTRA_START_DOCSTRING = r\"\"\"\n    This model is a `tf.keras.Model <https://www.tensorflow.org/api_docs/python/tf/keras/Model>`__ sub-class.\n    Use it as a regular TF 2.0 Keras Model and\n    refer to the TF 2.0 documentation for all matter related to general usage and behavior.\n\n    .. note::\n\n        TF 2.0 models accepts two formats as inputs:\n\n            - having all inputs as keyword arguments (like PyTorch models), or\n            - having all inputs as a list, tuple or dict in the first positional arguments.\n\n        This second option is useful when using :obj:`tf.keras.Model.fit()` method which currently requires having\n        all the tensors in the first argument of the model call function: :obj:`model(inputs)`.\n\n        If you choose this second option, there are three possibilities you can use to gather all the input Tensors\n        in the first positional argument :\n\n        - a single Tensor with input_ids only and nothing else: :obj:`model(inputs_ids)`\n        - a list of varying length with one or several input Tensors IN THE ORDER given in the docstring:\n          :obj:`model([input_ids, attention_mask])` or :obj:`model([input_ids, attention_mask, token_type_ids])`\n        - a dictionary with one or several input Tensors associated to the input names given in the docstring:\n          :obj:`model({'input_ids': input_ids, 'token_type_ids': token_type_ids})`\n\n    Parameters:\n        config (:class:`~transformers1.ElectraConfig`): Model configuration class with all the parameters of the model.\n            Initializing with a config file does not load the weights associated with the model, only the configuration.\n            Check out the :meth:`~transformers1.PreTrainedModel.from_pretrained` method to load the model weights.\n\"\"\"\n\nELECTRA_INPUTS_DOCSTRING = r\"\"\"\n    Args:\n        input_ids (:obj:`Numpy array` or :obj:`tf.Tensor` of shape :obj:`(batch_size, sequence_length)`):\n            Indices of input sequence tokens in the vocabulary.\n\n            Indices can be obtained using :class:`transformers1.ElectraTokenizer`.\n            See :func:`transformers1.PreTrainedTokenizer.encode` and\n            :func:`transformers1.PreTrainedTokenizer.encode_plus` for details.\n\n            `What are input IDs? <../glossary.html#input-ids>`__\n        attention_mask (:obj:`Numpy array` or :obj:`tf.Tensor` of shape :obj:`(batch_size, sequence_length)`, `optional`, defaults to :obj:`None`):\n            Mask to avoid performing attention on padding token indices.\n            Mask values selected in ``[0, 1]``:\n            ``1`` for tokens that are NOT MASKED, ``0`` for MASKED tokens.\n\n            `What are attention masks? <../glossary.html#attention-mask>`__\n        head_mask (:obj:`Numpy array` or :obj:`tf.Tensor` of shape :obj:`(num_heads,)` or :obj:`(num_layers, num_heads)`, `optional`, defaults to :obj:`None`):\n            Mask to nullify selected heads of the self-attention modules.\n            Mask values selected in ``[0, 1]``:\n            :obj:`1` indicates the head is **not masked**, :obj:`0` indicates the head is **masked**.\n        inputs_embeds (:obj:`Numpy array` or :obj:`tf.Tensor` of shape :obj:`(batch_size, sequence_length, embedding_dim)`, `optional`, defaults to :obj:`None`):\n            Optionally, instead of passing :obj:`input_ids` you can choose to directly pass an embedded representation.\n            This is useful if you want more control over how to convert `input_ids` indices into associated vectors\n            than the model's internal embedding lookup matrix.\n        training (:obj:`boolean`, `optional`, defaults to :obj:`False`):\n            Whether to activate dropout modules (if set to :obj:`True`) during training or to de-activate them\n            (if set to :obj:`False`) for evaluation.\n\n\"\"\"\n\n\n@add_start_docstrings(\n    \"The bare Electra Model transformer outputting raw hidden-states without any specific head on top. Identical to \"\n    \"the BERT model except that it uses an additional linear layer between the embedding layer and the encoder if the \"\n    \"hidden size and embedding size are different.\"\n    \"\"\n    \"Both the generator and discriminator checkpoints may be loaded into this model.\",\n    ELECTRA_START_DOCSTRING,\n)\nclass TFElectraModel(TFElectraPreTrainedModel):\n    def __init__(self, config, *inputs, **kwargs):\n        super().__init__(config, *inputs, **kwargs)\n        self.electra = TFElectraMainLayer(config, name=\"electra\")\n\n    def get_input_embeddings(self):\n        return self.electra.embeddings\n\n    @add_start_docstrings_to_callable(ELECTRA_INPUTS_DOCSTRING)\n    def call(self, inputs, **kwargs):\n        r\"\"\"\n    Returns:\n        :obj:`tuple(tf.Tensor)` comprising various elements depending on the configuration (:class:`~transformers1.ElectraConfig`) and inputs:\n        last_hidden_state (:obj:`tf.Tensor` of shape :obj:`(batch_size, sequence_length, hidden_size)`):\n            Sequence of hidden-states at the output of the last layer of the model.\n        hidden_states (:obj:`tuple(tf.Tensor)`, `optional`, returned when :obj:`config.output_hidden_states=True`):\n            tuple of :obj:`tf.Tensor` (one for the output of the embeddings + one for the output of each layer)\n            of shape :obj:`(batch_size, sequence_length, hidden_size)`.\n\n            Hidden-states of the model at the output of each layer plus the initial embedding outputs.\n        attentions (:obj:`tuple(tf.Tensor)`, `optional`, returned when ``config.output_attentions=True``):\n            tuple of :obj:`tf.Tensor` (one for each layer) of shape\n            :obj:`(batch_size, num_heads, sequence_length, sequence_length)`:\n\n            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads.\n\n    Examples::\n\n        import tensorflow as tf\n        from transformers1 import ElectraTokenizer, TFElectraModel\n\n        tokenizer = ElectraTokenizer.from_pretrained('google/electra-small-discriminator')\n        model = TFElectraModel.from_pretrained('google/electra-small-discriminator')\n        input_ids = tf.constant(tokenizer.encode(\"Hello, my dog is cute\"))[None, :]  # Batch size 1\n        outputs = model(input_ids)\n        last_hidden_states = outputs[0]  # The last hidden-state is the first element of the output tuple\n        \"\"\"\n        outputs = self.electra(inputs, **kwargs)\n        return outputs\n\n\n@add_start_docstrings(\n    \"\"\"\nElectra model with a binary classification head on top as used during pre-training for identifying generated\ntokens.\n\nEven though both the discriminator and generator may be loaded into this model, the discriminator is\nthe only model of the two to have the correct classification head to be used for this model.\"\"\",\n    ELECTRA_START_DOCSTRING,\n)\nclass TFElectraForPreTraining(TFElectraPreTrainedModel):\n    def __init__(self, config, **kwargs):\n        super().__init__(config, **kwargs)\n\n        self.electra = TFElectraMainLayer(config, name=\"electra\")\n        self.discriminator_predictions = TFElectraDiscriminatorPredictions(config, name=\"discriminator_predictions\")\n\n    def get_input_embeddings(self):\n        return self.electra.embeddings\n\n    @add_start_docstrings_to_callable(ELECTRA_INPUTS_DOCSTRING)\n    def call(\n        self,\n        input_ids=None,\n        attention_mask=None,\n        token_type_ids=None,\n        position_ids=None,\n        head_mask=None,\n        inputs_embeds=None,\n        training=False,\n    ):\n        r\"\"\"\n    Returns:\n        :obj:`tuple(tf.Tensor)` comprising various elements depending on the configuration (:class:`~transformers1.ElectraConfig`) and inputs:\n        scores (:obj:`Numpy array` or :obj:`tf.Tensor` of shape :obj:`(batch_size, sequence_length, config.num_labels)`):\n            Prediction scores of the head (scores for each token before SoftMax).\n        hidden_states (:obj:`tuple(tf.Tensor)`, `optional`, returned when :obj:`config.output_hidden_states=True`):\n            tuple of :obj:`tf.Tensor` (one for the output of the embeddings + one for the output of each layer)\n            of shape :obj:`(batch_size, sequence_length, hidden_size)`.\n\n            Hidden-states of the model at the output of each layer plus the initial embedding outputs.\n        attentions (:obj:`tuple(tf.Tensor)`, `optional`, returned when ``config.output_attentions=True``):\n            tuple of :obj:`tf.Tensor` (one for each layer) of shape\n            :obj:`(batch_size, num_heads, sequence_length, sequence_length)`:\n\n            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads.\n\n    Examples::\n\n        import tensorflow as tf\n        from transformers1 import ElectraTokenizer, TFElectraForPreTraining\n\n        tokenizer = ElectraTokenizer.from_pretrained('google/electra-small-discriminator')\n        model = TFElectraForPreTraining.from_pretrained('google/electra-small-discriminator')\n        input_ids = tf.constant(tokenizer.encode(\"Hello, my dog is cute\"))[None, :]  # Batch size 1\n        outputs = model(input_ids)\n        scores = outputs[0]\n        \"\"\"\n\n        discriminator_hidden_states = self.electra(\n            input_ids, attention_mask, token_type_ids, position_ids, head_mask, inputs_embeds, training=training\n        )\n        discriminator_sequence_output = discriminator_hidden_states[0]\n        logits = self.discriminator_predictions(discriminator_sequence_output)\n        output = (logits,)\n        output += discriminator_hidden_states[1:]\n\n        return output  # (loss), scores, (hidden_states), (attentions)\n\n\nclass TFElectraMaskedLMHead(tf.keras.layers.Layer):\n    def __init__(self, config, input_embeddings, **kwargs):\n        super().__init__(**kwargs)\n        self.vocab_size = config.vocab_size\n        self.input_embeddings = input_embeddings\n\n    def build(self, input_shape):\n        self.bias = self.add_weight(shape=(self.vocab_size,), initializer=\"zeros\", trainable=True, name=\"bias\")\n        super().build(input_shape)\n\n    def call(self, hidden_states, training=False):\n        hidden_states = self.input_embeddings(hidden_states, mode=\"linear\")\n        hidden_states = hidden_states + self.bias\n        return hidden_states\n\n\n@add_start_docstrings(\n    \"\"\"\nElectra model with a language modeling head on top.\n\nEven though both the discriminator and generator may be loaded into this model, the generator is\nthe only model of the two to have been trained for the masked language modeling task.\"\"\",\n    ELECTRA_START_DOCSTRING,\n)\nclass TFElectraForMaskedLM(TFElectraPreTrainedModel):\n    def __init__(self, config, **kwargs):\n        super().__init__(config, **kwargs)\n\n        self.vocab_size = config.vocab_size\n        self.electra = TFElectraMainLayer(config, name=\"electra\")\n        self.generator_predictions = TFElectraGeneratorPredictions(config, name=\"generator_predictions\")\n        if isinstance(config.hidden_act, str):\n            self.activation = ACT2FN[config.hidden_act]\n        else:\n            self.activation = config.hidden_act\n        self.generator_lm_head = TFElectraMaskedLMHead(config, self.electra.embeddings, name=\"generator_lm_head\")\n\n    def get_input_embeddings(self):\n        return self.electra.embeddings\n\n    def get_output_embeddings(self):\n        return self.generator_lm_head\n\n    @add_start_docstrings_to_callable(ELECTRA_INPUTS_DOCSTRING)\n    def call(\n        self,\n        input_ids=None,\n        attention_mask=None,\n        token_type_ids=None,\n        position_ids=None,\n        head_mask=None,\n        inputs_embeds=None,\n        training=False,\n    ):\n        r\"\"\"\n    Returns:\n        :obj:`tuple(tf.Tensor)` comprising various elements depending on the configuration (:class:`~transformers1.ElectraConfig`) and inputs:\n        prediction_scores (:obj:`Numpy array` or :obj:`tf.Tensor` of shape :obj:`(batch_size, sequence_length, config.vocab_size)`):\n            Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax).\n        hidden_states (:obj:`tuple(tf.Tensor)`, `optional`, returned when :obj:`config.output_hidden_states=True`):\n            tuple of :obj:`tf.Tensor` (one for the output of the embeddings + one for the output of each layer)\n            of shape :obj:`(batch_size, sequence_length, hidden_size)`.\n\n            Hidden-states of the model at the output of each layer plus the initial embedding outputs.\n        attentions (:obj:`tuple(tf.Tensor)`, `optional`, returned when ``config.output_attentions=True``):\n            tuple of :obj:`tf.Tensor` (one for each layer) of shape\n            :obj:`(batch_size, num_heads, sequence_length, sequence_length)`:\n\n            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads.\n\n    Examples::\n\n        import tensorflow as tf\n        from transformers1 import ElectraTokenizer, TFElectraForMaskedLM\n\n        tokenizer = ElectraTokenizer.from_pretrained('google/electra-small-generator')\n        model = TFElectraForMaskedLM.from_pretrained('google/electra-small-generator')\n        input_ids = tf.constant(tokenizer.encode(\"Hello, my dog is cute\"))[None, :]  # Batch size 1\n        outputs = model(input_ids)\n        prediction_scores = outputs[0]\n\n        \"\"\"\n\n        generator_hidden_states = self.electra(\n            input_ids, attention_mask, token_type_ids, position_ids, head_mask, inputs_embeds, training=training\n        )\n        generator_sequence_output = generator_hidden_states[0]\n        prediction_scores = self.generator_predictions(generator_sequence_output, training=training)\n        prediction_scores = self.generator_lm_head(prediction_scores, training=training)\n        output = (prediction_scores,)\n        output += generator_hidden_states[1:]\n\n        return output  # (masked_lm_loss), prediction_scores, (hidden_states), (attentions)\n\n\n@add_start_docstrings(\n    \"\"\"\nElectra model with a token classification head on top.\n\nBoth the discriminator and generator may be loaded into this model.\"\"\",\n    ELECTRA_START_DOCSTRING,\n)\nclass TFElectraForTokenClassification(TFElectraPreTrainedModel):\n    def __init__(self, config, **kwargs):\n        super().__init__(config, **kwargs)\n\n        self.electra = TFElectraMainLayer(config, name=\"electra\")\n        self.dropout = tf.keras.layers.Dropout(config.hidden_dropout_prob)\n        self.classifier = tf.keras.layers.Dense(config.num_labels, name=\"classifier\")\n\n    @add_start_docstrings_to_callable(ELECTRA_INPUTS_DOCSTRING)\n    def call(\n        self,\n        input_ids=None,\n        attention_mask=None,\n        token_type_ids=None,\n        position_ids=None,\n        head_mask=None,\n        inputs_embeds=None,\n        training=False,\n    ):\n        r\"\"\"\n    Returns:\n        :obj:`tuple(tf.Tensor)` comprising various elements depending on the configuration (:class:`~transformers1.ElectraConfig`) and inputs:\n        scores (:obj:`Numpy array` or :obj:`tf.Tensor` of shape :obj:`(batch_size, sequence_length, config.num_labels)`):\n            Classification scores (before SoftMax).\n        hidden_states (:obj:`tuple(tf.Tensor)`, `optional`, returned when :obj:`config.output_hidden_states=True`):\n            tuple of :obj:`tf.Tensor` (one for the output of the embeddings + one for the output of each layer)\n            of shape :obj:`(batch_size, sequence_length, hidden_size)`.\n\n            Hidden-states of the model at the output of each layer plus the initial embedding outputs.\n        attentions (:obj:`tuple(tf.Tensor)`, `optional`, returned when ``config.output_attentions=True``):\n            tuple of :obj:`tf.Tensor` (one for each layer) of shape\n            :obj:`(batch_size, num_heads, sequence_length, sequence_length)`:\n\n            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads.\n\n    Examples::\n\n        import tensorflow as tf\n        from transformers1 import ElectraTokenizer, TFElectraForTokenClassification\n\n        tokenizer = ElectraTokenizer.from_pretrained('google/electra-small-discriminator')\n        model = TFElectraForTokenClassification.from_pretrained('google/electra-small-discriminator')\n        input_ids = tf.constant(tokenizer.encode(\"Hello, my dog is cute\"))[None, :]  # Batch size 1\n        outputs = model(input_ids)\n        scores = outputs[0]\n        \"\"\"\n\n        discriminator_hidden_states = self.electra(\n            input_ids, attention_mask, token_type_ids, position_ids, head_mask, inputs_embeds, training=training\n        )\n        discriminator_sequence_output = discriminator_hidden_states[0]\n        discriminator_sequence_output = self.dropout(discriminator_sequence_output)\n        logits = self.classifier(discriminator_sequence_output)\n        output = (logits,)\n        output += discriminator_hidden_states[1:]\n\n        return output  # (loss), scores, (hidden_states), (attentions)\n"
  },
  {
    "path": "code/nezha-base-count5/pretrain/transformers1/modeling_tf_flaubert.py",
    "content": "# coding=utf-8\n# Copyright 2019-present, Facebook, Inc and the HuggingFace Inc. team.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n\"\"\" TF 2.0 Flaubert model.\n\"\"\"\n\nimport logging\nimport random\n\nimport tensorflow as tf\n\nfrom .configuration_flaubert import FlaubertConfig\nfrom .file_utils import add_start_docstrings\nfrom .modeling_tf_xlm import (\n    TFXLMForSequenceClassification,\n    TFXLMMainLayer,\n    TFXLMModel,\n    TFXLMWithLMHeadModel,\n    get_masks,\n    shape_list,\n)\nfrom .tokenization_utils import BatchEncoding\n\n\nlogger = logging.getLogger(__name__)\n\nTF_FLAUBERT_PRETRAINED_MODEL_ARCHIVE_LIST = [\n    # See all Flaubert models at https://huggingface.co/models?filter=flaubert\n]\n\nFLAUBERT_START_DOCSTRING = r\"\"\"\n\n    This model is a `tf.keras.Model <https://www.tensorflow.org/api_docs/python/tf/keras/Model>`__ sub-class.\n    Use it as a regular TF 2.0 Keras Model and\n    refer to the TF 2.0 documentation for all matter related to general usage and behavior.\n\n    Parameters:\n        config (:class:`~transformers1.FlaubertConfig`): Model configuration class with all the parameters of the model.\n            Initializing with a config file does not load the weights associated with the model, only the configuration.\n            Check out the :meth:`~transformers1.PreTrainedModel.from_pretrained` method to load the model weights.\n\"\"\"\n\nFLAUBERT_INPUTS_DOCSTRING = r\"\"\"\n    Args:\n        input_ids (:obj:`tf.Tensor` or :obj:`Numpy array` of shape :obj:`(batch_size, sequence_length)`):\n            Indices of input sequence tokens in the vocabulary.\n            Indices can be obtained using :class:`transformers1.BertTokenizer`.\n            See :func:`transformers1.PreTrainedTokenizer.encode` and\n            :func:`transformers1.PreTrainedTokenizer.encode_plus` for details.\n            `What are input IDs? <../glossary.html#input-ids>`__\n        attention_mask (:obj:`tf.Tensor` or :obj:`Numpy array` of shape :obj:`(batch_size, sequence_length)`, `optional`, defaults to :obj:`None`):\n            Mask to avoid performing attention on padding token indices.\n            Mask values selected in ``[0, 1]``:\n            ``1`` for tokens that are NOT MASKED, ``0`` for MASKED tokens.\n            `What are attention masks? <../glossary.html#attention-mask>`__\n        langs (:obj:`tf.Tensor` or :obj:`Numpy array` of shape :obj:`(batch_size, sequence_length)`, `optional`, defaults to :obj:`None`):\n            A parallel sequence of tokens to be used to indicate the language of each token in the input.\n            Indices are languages ids which can be obtained from the language names by using two conversion mappings\n            provided in the configuration of the model (only provided for multilingual models).\n            More precisely, the `language name -> language id` mapping is in `model.config.lang2id` (dict str -> int) and\n            the `language id -> language name` mapping is `model.config.id2lang` (dict int -> str).\n            See usage examples detailed in the `multilingual documentation <https://huggingface.co/transformers/multilingual.html>`__.\n        token_type_ids (:obj:`tf.Tensor` or :obj:`Numpy array` of shape :obj:`(batch_size, sequence_length)`, `optional`, defaults to :obj:`None`):\n            Segment token indices to indicate first and second portions of the inputs.\n            Indices are selected in ``[0, 1]``: ``0`` corresponds to a `sentence A` token, ``1``\n            corresponds to a `sentence B` token\n            `What are token type IDs? <../glossary.html#token-type-ids>`_\n        position_ids (:obj:`tf.Tensor` or :obj:`Numpy array` of shape :obj:`(batch_size, sequence_length)`, `optional`, defaults to :obj:`None`):\n            Indices of positions of each input sequence tokens in the position embeddings.\n            Selected in the range ``[0, config.max_position_embeddings - 1]``.\n            `What are position IDs? <../glossary.html#position-ids>`_\n        lengths (:obj:`tf.Tensor` or :obj:`Numpy array` of shape :obj:`(batch_size,)`, `optional`, defaults to :obj:`None`):\n            Length of each sentence that can be used to avoid performing attention on padding token indices.\n            You can also use `attention_mask` for the same result (see above), kept here for compatbility.\n            Indices selected in ``[0, ..., input_ids.size(-1)]``:\n        cache (:obj:`Dict[str, tf.Tensor]`, `optional`, defaults to :obj:`None`):\n            dictionary with ``tf.Tensor`` that contains pre-computed\n            hidden-states (key and values in the attention blocks) as computed by the model\n            (see `cache` output below). Can be used to speed up sequential decoding.\n            The dictionary object will be modified in-place during the forward pass to add newly computed hidden-states.\n        head_mask (:obj:`tf.Tensor` or :obj:`Numpy array` of shape :obj:`(num_heads,)` or :obj:`(num_layers, num_heads)`, `optional`, defaults to :obj:`None`):\n            Mask to nullify selected heads of the self-attention modules.\n            Mask values selected in ``[0, 1]``:\n            :obj:`1` indicates the head is **not masked**, :obj:`0` indicates the head is **masked**.\n        inputs_embeds (:obj:`tf.Tensor` or :obj:`Numpy array` of shape :obj:`(batch_size, sequence_length, hidden_size)`, `optional`, defaults to :obj:`None`):\n            Optionally, instead of passing :obj:`input_ids` you can choose to directly pass an embedded representation.\n            This is useful if you want more control over how to convert `input_ids` indices into associated vectors\n            than the model's internal embedding lookup matrix.\n\"\"\"\n\n\n@add_start_docstrings(\n    \"The bare Flaubert Model transformer outputting raw hidden-states without any specific head on top.\",\n    FLAUBERT_START_DOCSTRING,\n)\nclass TFFlaubertModel(TFXLMModel):\n    config_class = FlaubertConfig\n\n    def __init__(self, config, *inputs, **kwargs):\n        super().__init__(config, *inputs, **kwargs)\n        self.transformer = TFFlaubertMainLayer(config, name=\"transformer\")\n\n\nclass TFFlaubertMainLayer(TFXLMMainLayer):\n    def __init__(self, config, *inputs, **kwargs):\n        super().__init__(config, *inputs, **kwargs)\n        self.layerdrop = getattr(config, \"layerdrop\", 0.0)\n        self.pre_norm = getattr(config, \"pre_norm\", False)\n\n    def call(\n        self,\n        inputs,\n        attention_mask=None,\n        langs=None,\n        token_type_ids=None,\n        position_ids=None,\n        lengths=None,\n        cache=None,\n        head_mask=None,\n        inputs_embeds=None,\n        training=False,\n    ):\n        # removed: src_enc=None, src_len=None\n        if isinstance(inputs, (tuple, list)):\n            input_ids = inputs[0]\n            attention_mask = inputs[1] if len(inputs) > 1 else attention_mask\n            langs = inputs[2] if len(inputs) > 2 else langs\n            token_type_ids = inputs[3] if len(inputs) > 3 else token_type_ids\n            position_ids = inputs[4] if len(inputs) > 4 else position_ids\n            lengths = inputs[5] if len(inputs) > 5 else lengths\n            cache = inputs[6] if len(inputs) > 6 else cache\n            head_mask = inputs[7] if len(inputs) > 7 else head_mask\n            inputs_embeds = inputs[8] if len(inputs) > 8 else inputs_embeds\n            assert len(inputs) <= 9, \"Too many inputs.\"\n        elif isinstance(inputs, (dict, BatchEncoding)):\n            input_ids = inputs.get(\"input_ids\")\n            attention_mask = inputs.get(\"attention_mask\", attention_mask)\n            langs = inputs.get(\"langs\", langs)\n            token_type_ids = inputs.get(\"token_type_ids\", token_type_ids)\n            position_ids = inputs.get(\"position_ids\", position_ids)\n            lengths = inputs.get(\"lengths\", lengths)\n            cache = inputs.get(\"cache\", cache)\n            head_mask = inputs.get(\"head_mask\", head_mask)\n            inputs_embeds = inputs.get(\"inputs_embeds\", inputs_embeds)\n            assert len(inputs) <= 9, \"Too many inputs.\"\n        else:\n            input_ids = inputs\n\n        if input_ids is not None and inputs_embeds is not None:\n            raise ValueError(\"You cannot specify both input_ids and inputs_embeds at the same time\")\n        elif input_ids is not None:\n            bs, slen = shape_list(input_ids)\n        elif inputs_embeds is not None:\n            bs, slen = shape_list(inputs_embeds)[:2]\n        else:\n            raise ValueError(\"You have to specify either input_ids or inputs_embeds\")\n\n        if lengths is None:\n            if input_ids is not None:\n                lengths = tf.reduce_sum(tf.cast(tf.not_equal(input_ids, self.pad_index), dtype=tf.int32), axis=1)\n            else:\n                lengths = tf.convert_to_tensor([slen] * bs, tf.int32)\n        # mask = input_ids != self.pad_index\n\n        # check inputs\n        # assert shape_list(lengths)[0] == bs\n        tf.debugging.assert_equal(shape_list(lengths)[0], bs)\n        # assert lengths.max().item() <= slen\n        # input_ids = input_ids.transpose(0, 1)  # batch size as dimension 0\n        # assert (src_enc is None) == (src_len is None)\n        # if src_enc is not None:\n        #     assert self.is_decoder\n        #     assert src_enc.size(0) == bs\n\n        # generate masks\n        mask, attn_mask = get_masks(slen, lengths, self.causal, padding_mask=attention_mask)\n        # if self.is_decoder and src_enc is not None:\n        #     src_mask = torch.arange(src_len.max(), dtype=torch.long, device=lengths.device) < src_len[:, None]\n\n        # position_ids\n        if position_ids is None:\n            position_ids = tf.expand_dims(tf.range(slen), axis=0)\n        else:\n            # assert shape_list(position_ids) == [bs, slen]  # (slen, bs)\n            tf.debugging.assert_equal(shape_list(position_ids), [bs, slen])\n            # position_ids = position_ids.transpose(0, 1)\n\n        # langs\n        if langs is not None:\n            # assert shape_list(langs) == [bs, slen]  # (slen, bs)\n            tf.debugging.assert_equal(shape_list(langs), [bs, slen])\n            # langs = langs.transpose(0, 1)\n\n        # Prepare head mask if needed\n        # 1.0 in head_mask indicate we keep the head\n        # attention_probs has shape bsz x n_heads x N x N\n        # input head_mask has shape [num_heads] or [num_hidden_layers x num_heads]\n        # and head_mask is converted to shape [num_hidden_layers x batch x num_heads x qlen x klen]\n        if head_mask is not None:\n            raise NotImplementedError\n        else:\n            head_mask = [None] * self.n_layers\n\n        # do not recompute cached elements\n        if cache is not None and input_ids is not None:\n            _slen = slen - cache[\"slen\"]\n            input_ids = input_ids[:, -_slen:]\n            position_ids = position_ids[:, -_slen:]\n            if langs is not None:\n                langs = langs[:, -_slen:]\n            mask = mask[:, -_slen:]\n            attn_mask = attn_mask[:, -_slen:]\n\n        # embeddings\n        if inputs_embeds is None:\n            inputs_embeds = self.embeddings(input_ids)\n\n        tensor = inputs_embeds + self.position_embeddings(position_ids)\n        if langs is not None and self.use_lang_emb:\n            tensor = tensor + self.lang_embeddings(langs)\n        if token_type_ids is not None:\n            tensor = tensor + self.embeddings(token_type_ids)\n        tensor = self.layer_norm_emb(tensor)\n        tensor = self.dropout(tensor, training=training)\n        tensor = tensor * mask[..., tf.newaxis]\n\n        # transformer layers\n        hidden_states = ()\n        attentions = ()\n        for i in range(self.n_layers):\n            # LayerDrop\n            dropout_probability = random.uniform(0, 1)\n            if training and (dropout_probability < self.layerdrop):\n                continue\n\n            if self.output_hidden_states:\n                hidden_states = hidden_states + (tensor,)\n\n            # self attention\n            if not self.pre_norm:\n                attn_outputs = self.attentions[i]([tensor, attn_mask, None, cache, head_mask[i]], training=training)\n                attn = attn_outputs[0]\n                if self.output_attentions:\n                    attentions = attentions + (attn_outputs[1],)\n                attn = self.dropout(attn, training=training)\n                tensor = tensor + attn\n                tensor = self.layer_norm1[i](tensor)\n            else:\n                tensor_normalized = self.layer_norm1[i](tensor)\n                attn_outputs = self.attentions[i](\n                    [tensor_normalized, attn_mask, None, cache, head_mask[i]], training=training\n                )\n                attn = attn_outputs[0]\n                if self.output_attentions:\n                    attentions = attentions + (attn_outputs[1],)\n                attn = self.dropout(attn, training=training)\n                tensor = tensor + attn\n\n            # encoder attention (for decoder only)\n            # if self.is_decoder and src_enc is not None:\n            #     attn = self.encoder_attn[i](tensor, src_mask, kv=src_enc, cache=cache)\n            #     attn = F.dropout(attn, p=self.dropout, training=self.training)\n            #     tensor = tensor + attn\n            #     tensor = self.layer_norm15[i](tensor)\n\n            # FFN\n            if not self.pre_norm:\n                tensor = tensor + self.ffns[i](tensor)\n                tensor = self.layer_norm2[i](tensor)\n            else:\n                tensor_normalized = self.layer_norm2[i](tensor)\n                tensor = tensor + self.ffns[i](tensor_normalized)\n\n            tensor = tensor * mask[..., tf.newaxis]\n\n        # Add last hidden state\n        if self.output_hidden_states:\n            hidden_states = hidden_states + (tensor,)\n\n        # update cache length\n        if cache is not None:\n            cache[\"slen\"] += tensor.size(1)\n\n        # move back sequence length to dimension 0\n        # tensor = tensor.transpose(0, 1)\n\n        outputs = (tensor,)\n        if self.output_hidden_states:\n            outputs = outputs + (hidden_states,)\n        if self.output_attentions:\n            outputs = outputs + (attentions,)\n        return outputs  # outputs, (hidden_states), (attentions)\n\n\n@add_start_docstrings(\n    \"\"\"The Flaubert Model transformer with a language modeling head on top\n    (linear layer with weights tied to the input embeddings). \"\"\",\n    FLAUBERT_START_DOCSTRING,\n)\nclass TFFlaubertWithLMHeadModel(TFXLMWithLMHeadModel):\n    config_class = FlaubertConfig\n\n    def __init__(self, config, *inputs, **kwargs):\n        super().__init__(config, *inputs, **kwargs)\n        self.transformer = TFFlaubertMainLayer(config, name=\"transformer\")\n\n\n@add_start_docstrings(\n    \"\"\"Flaubert Model with a sequence classification/regression head on top (a linear layer on top of\n    the pooled output) e.g. for GLUE tasks. \"\"\",\n    FLAUBERT_START_DOCSTRING,\n)\nclass TFFlaubertForSequenceClassification(TFXLMForSequenceClassification):\n    config_class = FlaubertConfig\n\n    def __init__(self, config, *inputs, **kwargs):\n        super().__init__(config, *inputs, **kwargs)\n        self.transformer = TFFlaubertMainLayer(config, name=\"transformer\")\n"
  },
  {
    "path": "code/nezha-base-count5/pretrain/transformers1/modeling_tf_gpt2.py",
    "content": "# coding=utf-8\n# Copyright 2018 The OpenAI Team Authors and HuggingFace Inc. team.\n# Copyright (c) 2018, NVIDIA CORPORATION.  All rights reserved.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n\"\"\" TF 2.0 OpenAI GPT-2 model. \"\"\"\n\n\nimport logging\n\nimport numpy as np\nimport tensorflow as tf\n\nfrom .configuration_gpt2 import GPT2Config\nfrom .file_utils import add_start_docstrings, add_start_docstrings_to_callable\nfrom .modeling_tf_utils import (\n    TFConv1D,\n    TFPreTrainedModel,\n    TFSequenceSummary,\n    TFSharedEmbeddings,\n    get_initializer,\n    keras_serializable,\n    shape_list,\n)\nfrom .tokenization_utils import BatchEncoding\n\n\nlogger = logging.getLogger(__name__)\n\nTF_GPT2_PRETRAINED_MODEL_ARCHIVE_LIST = [\n    \"gpt2\",\n    \"gpt2-medium\",\n    \"gpt2-large\",\n    \"gpt2-xl\",\n    \"distilgpt2\",\n    # See all GPT-2 models at https://huggingface.co/models?filter=gpt2\n]\n\n\ndef gelu(x):\n    \"\"\"Gaussian Error Linear Unit.\n    This is a smoother version of the RELU.\n    Original paper: https://arxiv.org/abs/1606.08415\n    Args:\n        x: float Tensor to perform activation.\n    Returns:\n        `x` with the GELU activation applied.\n    \"\"\"\n    cdf = 0.5 * (1.0 + tf.tanh((np.sqrt(2 / np.pi) * (x + 0.044715 * tf.pow(x, 3)))))\n    return x * cdf\n\n\nclass TFAttention(tf.keras.layers.Layer):\n    def __init__(self, nx, n_ctx, config, scale=False, **kwargs):\n        super().__init__(**kwargs)\n        self.output_attentions = config.output_attentions\n\n        n_state = nx  # in Attention: n_state=768 (nx=n_embd)\n        # [switch nx => n_state from Block to Attention to keep identical to TF implem]\n        assert n_state % config.n_head == 0\n        self.n_ctx = n_ctx\n        self.n_head = config.n_head\n        self.split_size = n_state\n        self.scale = scale\n\n        self.c_attn = TFConv1D(n_state * 3, nx, initializer_range=config.initializer_range, name=\"c_attn\")\n        self.c_proj = TFConv1D(n_state, nx, initializer_range=config.initializer_range, name=\"c_proj\")\n        self.attn_dropout = tf.keras.layers.Dropout(config.attn_pdrop)\n        self.resid_dropout = tf.keras.layers.Dropout(config.resid_pdrop)\n        self.pruned_heads = set()\n\n    def prune_heads(self, heads):\n        pass\n\n    @staticmethod\n    def causal_attention_mask(nd, ns, dtype):\n        \"\"\"1's in the lower triangle, counting from the lower right corner.\n        Same as tf.matrix_band_part(tf.ones([nd, ns]), -1, ns-nd), but doesn't produce garbage on TPUs.\n        \"\"\"\n        i = tf.range(nd)[:, None]\n        j = tf.range(ns)\n        m = i >= j - ns + nd\n        return tf.cast(m, dtype)\n\n    def _attn(self, inputs, training=False):\n        q, k, v, attention_mask, head_mask = inputs\n        # q, k, v have shape [batch, heads, sequence, features]\n        w = tf.matmul(q, k, transpose_b=True)\n        if self.scale:\n            dk = tf.cast(shape_list(k)[-1], tf.float32)  # scale attention_scores\n            w = w / tf.math.sqrt(dk)\n\n        # w has shape [batch, heads, dst_sequence, src_sequence], where information flows from src to dst.\n        _, _, nd, ns = shape_list(w)\n        b = self.causal_attention_mask(nd, ns, dtype=w.dtype)\n        b = tf.reshape(b, [1, 1, nd, ns])\n        w = w * b - 1e4 * (1 - b)\n\n        if attention_mask is not None:\n            # Apply the attention mask\n            w = w + attention_mask\n\n        w = tf.nn.softmax(w, axis=-1)\n        w = self.attn_dropout(w, training=training)\n\n        # Mask heads if we want to\n        if head_mask is not None:\n            w = w * head_mask\n\n        outputs = [tf.matmul(w, v)]\n        if self.output_attentions:\n            outputs.append(w)\n        return outputs\n\n    def merge_heads(self, x):\n        x = tf.transpose(x, [0, 2, 1, 3])\n        x_shape = shape_list(x)\n        new_x_shape = x_shape[:-2] + [x_shape[-2] * x_shape[-1]]\n        return tf.reshape(x, new_x_shape)\n\n    def split_heads(self, x):\n        x_shape = shape_list(x)\n        new_x_shape = x_shape[:-1] + [self.n_head, x_shape[-1] // self.n_head]\n        x = tf.reshape(x, new_x_shape)\n        return tf.transpose(x, (0, 2, 1, 3))  # (batch, head, seq_length, head_features)\n\n    def call(self, inputs, training=False):\n        x, layer_past, attention_mask, head_mask, use_cache = inputs\n\n        x = self.c_attn(x)\n        query, key, value = tf.split(x, 3, axis=2)\n        query = self.split_heads(query)\n        key = self.split_heads(key)\n        value = self.split_heads(value)\n        if layer_past is not None:\n            past_key, past_value = tf.unstack(layer_past, axis=0)\n            key = tf.concat([past_key, key], axis=-2)\n            value = tf.concat([past_value, value], axis=-2)\n\n        # to cope with keras serialization\n        # we need to cast `use_cache` to correct bool\n        # if it is a tensor\n        if tf.is_tensor(use_cache):\n            if hasattr(use_cache, \"numpy\"):\n                use_cache = bool(use_cache.numpy())\n            else:\n                use_cache = True\n\n        if use_cache is True:\n            present = tf.stack([key, value], axis=0)\n        else:\n            present = (None,)\n\n        attn_outputs = self._attn([query, key, value, attention_mask, head_mask], training=training)\n        a = attn_outputs[0]\n\n        a = self.merge_heads(a)\n        a = self.c_proj(a)\n        a = self.resid_dropout(a, training=training)\n\n        outputs = [a, present] + attn_outputs[1:]\n        return outputs  # a, present, (attentions)\n\n\nclass TFMLP(tf.keras.layers.Layer):\n    def __init__(self, n_state, config, **kwargs):\n        super().__init__(**kwargs)\n        nx = config.n_embd\n        self.c_fc = TFConv1D(n_state, nx, initializer_range=config.initializer_range, name=\"c_fc\")\n        self.c_proj = TFConv1D(nx, n_state, initializer_range=config.initializer_range, name=\"c_proj\")\n        self.act = gelu\n        self.dropout = tf.keras.layers.Dropout(config.resid_pdrop)\n\n    def call(self, x, training=False):\n        h = self.act(self.c_fc(x))\n        h2 = self.c_proj(h)\n        h2 = self.dropout(h2, training=training)\n        return h2\n\n\nclass TFBlock(tf.keras.layers.Layer):\n    def __init__(self, n_ctx, config, scale=False, **kwargs):\n        super().__init__(**kwargs)\n        nx = config.n_embd\n        self.ln_1 = tf.keras.layers.LayerNormalization(epsilon=config.layer_norm_epsilon, name=\"ln_1\")\n        self.attn = TFAttention(nx, n_ctx, config, scale, name=\"attn\")\n        self.ln_2 = tf.keras.layers.LayerNormalization(epsilon=config.layer_norm_epsilon, name=\"ln_2\")\n        self.mlp = TFMLP(4 * nx, config, name=\"mlp\")\n\n    def call(self, inputs, training=False):\n        x, layer_past, attention_mask, head_mask, use_cache = inputs\n\n        a = self.ln_1(x)\n        output_attn = self.attn([a, layer_past, attention_mask, head_mask, use_cache], training=training)\n        a = output_attn[0]  # output_attn: a, present, (attentions)\n        x = x + a\n\n        m = self.ln_2(x)\n        m = self.mlp(m, training=training)\n        x = x + m\n\n        outputs = [x] + output_attn[1:]\n        return outputs  # x, present, (attentions)\n\n\n@keras_serializable\nclass TFGPT2MainLayer(tf.keras.layers.Layer):\n    config_class = GPT2Config\n\n    def __init__(self, config, *inputs, **kwargs):\n        super().__init__(*inputs, **kwargs)\n        self.output_hidden_states = config.output_hidden_states\n        self.output_attentions = config.output_attentions\n        self.num_hidden_layers = config.n_layer\n        self.vocab_size = config.vocab_size\n        self.n_embd = config.n_embd\n\n        self.wte = TFSharedEmbeddings(\n            config.vocab_size, config.hidden_size, initializer_range=config.initializer_range, name=\"wte\"\n        )\n        self.wpe = tf.keras.layers.Embedding(\n            config.n_positions,\n            config.n_embd,\n            embeddings_initializer=get_initializer(config.initializer_range),\n            name=\"wpe\",\n        )\n        self.drop = tf.keras.layers.Dropout(config.embd_pdrop)\n        self.h = [TFBlock(config.n_ctx, config, scale=True, name=\"h_._{}\".format(i)) for i in range(config.n_layer)]\n        self.ln_f = tf.keras.layers.LayerNormalization(epsilon=config.layer_norm_epsilon, name=\"ln_f\")\n\n    def get_input_embeddings(self):\n        return self.wte\n\n    def _resize_token_embeddings(self, new_num_tokens):\n        raise NotImplementedError\n\n    def _prune_heads(self, heads_to_prune):\n        \"\"\" Prunes heads of the model.\n            heads_to_prune: dict of {layer_num: list of heads to prune in this layer}\n        \"\"\"\n        raise NotImplementedError\n\n    def call(\n        self,\n        inputs,\n        past=None,\n        attention_mask=None,\n        token_type_ids=None,\n        position_ids=None,\n        head_mask=None,\n        inputs_embeds=None,\n        use_cache=True,\n        training=False,\n    ):\n        if isinstance(inputs, (tuple, list)):\n            input_ids = inputs[0]\n            past = inputs[1] if len(inputs) > 1 else past\n            attention_mask = inputs[2] if len(inputs) > 2 else attention_mask\n            token_type_ids = inputs[3] if len(inputs) > 3 else token_type_ids\n            position_ids = inputs[4] if len(inputs) > 4 else position_ids\n            head_mask = inputs[5] if len(inputs) > 5 else head_mask\n            inputs_embeds = inputs[6] if len(inputs) > 6 else inputs_embeds\n            use_cache = inputs[7] if len(inputs) > 7 else use_cache\n            assert len(inputs) <= 8, \"Too many inputs.\"\n        elif isinstance(inputs, (dict, BatchEncoding)):\n            input_ids = inputs.get(\"input_ids\")\n            past = inputs.get(\"past\", past)\n            attention_mask = inputs.get(\"attention_mask\", attention_mask)\n            token_type_ids = inputs.get(\"token_type_ids\", token_type_ids)\n            position_ids = inputs.get(\"position_ids\", position_ids)\n            head_mask = inputs.get(\"head_mask\", head_mask)\n            inputs_embeds = inputs.get(\"inputs_embeds\", inputs_embeds)\n            use_cache = inputs.get(\"use_cache\", use_cache)\n            assert len(inputs) <= 8, \"Too many inputs.\"\n        else:\n            input_ids = inputs\n\n        if input_ids is not None and inputs_embeds is not None:\n            raise ValueError(\"You cannot specify both input_ids and inputs_embeds at the same time\")\n        elif input_ids is not None:\n            input_shape = shape_list(input_ids)\n            input_ids = tf.reshape(input_ids, [-1, input_shape[-1]])\n        elif inputs_embeds is not None:\n            input_shape = shape_list(inputs_embeds)[:-1]\n        else:\n            raise ValueError(\"You have to specify either input_ids or inputs_embeds\")\n\n        if past is None:\n            past_length = 0\n            past = [None] * len(self.h)\n        else:\n            past_length = shape_list(past[0][0])[-2]\n        if position_ids is None:\n            position_ids = tf.range(past_length, input_shape[-1] + past_length, dtype=tf.int32)[tf.newaxis, :]\n\n        if attention_mask is not None:\n            # We create a 3D attention mask from a 2D tensor mask.\n            # Sizes are [batch_size, 1, 1, to_seq_length]\n            # So we can broadcast to [batch_size, num_heads, from_seq_length, to_seq_length]\n            # this attention mask is more simple than the triangular masking of causal attention\n            # used in OpenAI GPT, we just need to prepare the broadcast dimension here.\n            attention_mask = attention_mask[:, tf.newaxis, tf.newaxis, :]\n\n            # Since attention_mask is 1.0 for positions we want to attend and 0.0 for\n            # masked positions, this operation will create a tensor which is 0.0 for\n            # positions we want to attend and -10000.0 for masked positions.\n            # Since we are adding it to the raw scores before the softmax, this is\n            # effectively the same as removing these entirely.\n\n            attention_mask = tf.cast(attention_mask, tf.float32)\n            attention_mask = (1.0 - attention_mask) * -10000.0\n        else:\n            attention_mask = None\n\n        # Prepare head mask if needed\n        # 1.0 in head_mask indicate we keep the head\n        # attention_probs has shape bsz x n_heads x N x N\n        # input head_mask has shape [num_heads] or [num_hidden_layers x num_heads]\n        # and head_mask is converted to shape [num_hidden_layers x batch x num_heads x seq_length x seq_length]\n        if head_mask is not None:\n            raise NotImplementedError\n        else:\n            head_mask = [None] * self.num_hidden_layers\n            # head_mask = tf.constant([0] * self.num_hidden_layers)\n\n        position_ids = tf.reshape(position_ids, [-1, shape_list(position_ids)[-1]])\n\n        if inputs_embeds is None:\n            inputs_embeds = self.wte(input_ids, mode=\"embedding\")\n        position_embeds = self.wpe(position_ids)\n        if token_type_ids is not None:\n            token_type_ids = tf.reshape(token_type_ids, [-1, shape_list(token_type_ids)[-1]])\n            token_type_embeds = self.wte(token_type_ids, mode=\"embedding\")\n        else:\n            token_type_embeds = 0\n        hidden_states = inputs_embeds + position_embeds + token_type_embeds\n        hidden_states = self.drop(hidden_states, training=training)\n\n        output_shape = input_shape + [shape_list(hidden_states)[-1]]\n\n        presents = ()\n        all_attentions = []\n        all_hidden_states = ()\n        for i, (block, layer_past) in enumerate(zip(self.h, past)):\n            if self.output_hidden_states:\n                all_hidden_states = all_hidden_states + (tf.reshape(hidden_states, output_shape),)\n\n            outputs = block([hidden_states, layer_past, attention_mask, head_mask[i], use_cache], training=training)\n\n            hidden_states, present = outputs[:2]\n            presents = presents + (present,)\n\n            if self.output_attentions:\n                all_attentions.append(outputs[2])\n\n        hidden_states = self.ln_f(hidden_states)\n\n        hidden_states = tf.reshape(hidden_states, output_shape)\n        # Add last hidden state\n        if self.output_hidden_states:\n            all_hidden_states = all_hidden_states + (hidden_states,)\n\n        outputs = (hidden_states,)\n\n        if use_cache is True:\n            outputs = outputs + (presents,)\n        if self.output_hidden_states:\n            outputs = outputs + (all_hidden_states,)\n        if self.output_attentions:\n            # let the number of heads free (-1) so we can extract attention even after head pruning\n            attention_output_shape = input_shape[:-1] + [-1] + shape_list(all_attentions[0])[-2:]\n            all_attentions = tuple(tf.reshape(t, attention_output_shape) for t in all_attentions)\n            outputs = outputs + (all_attentions,)\n        return outputs  # last hidden state, presents, (all hidden_states), (attentions)\n\n\nclass TFGPT2PreTrainedModel(TFPreTrainedModel):\n    \"\"\" An abstract class to handle weights initialization and\n        a simple interface for downloading and loading pretrained models.\n    \"\"\"\n\n    config_class = GPT2Config\n    base_model_prefix = \"transformer\"\n\n\nGPT2_START_DOCSTRING = r\"\"\"\n\n    .. note::\n        TF 2.0 models accepts two formats as inputs:\n\n            - having all inputs as keyword arguments (like PyTorch models), or\n            - having all inputs as a list, tuple or dict in the first positional arguments.\n\n        This second option is useful when using :obj:`tf.keras.Model.fit()` method which currently requires having\n        all the tensors in the first argument of the model call function: :obj:`model(inputs)`.\n\n        If you choose this second option, there are three possibilities you can use to gather all the input Tensors\n        in the first positional argument :\n\n        - a single Tensor with input_ids only and nothing else: :obj:`model(inputs_ids)`\n        - a list of varying length with one or several input Tensors IN THE ORDER given in the docstring:\n          :obj:`model([input_ids, attention_mask])` or :obj:`model([input_ids, attention_mask, token_type_ids])`\n        - a dictionary with one or several input Tensors associated to the input names given in the docstring:\n          :obj:`model({'input_ids': input_ids, 'token_type_ids': token_type_ids})`\n\n    Parameters:\n        config (:class:`~transformers1.GPT2Config`): Model configuration class with all the parameters of the model.\n            Initializing with a config file does not load the weights associated with the model, only the configuration.\n            Check out the :meth:`~transformers1.PreTrainedModel.from_pretrained` method to load the model weights.\n\"\"\"\n\nGPT2_INPUTS_DOCSTRING = r\"\"\"\n    Args:\n        input_ids (:obj:`Numpy array` or :obj:`tf.Tensor` of shape :obj:`(batch_size, input_ids_length)`):\n            :obj:`input_ids_length` = ``sequence_length`` if ``past`` is ``None`` else ``past[0].shape[-2]`` (``sequence_length`` of input past key value states).\n            Indices of input sequence tokens in the vocabulary.\n\n            If `past` is used, only `input_ids` that do not have their past calculated should be passed as `input_ids`.\n\n            Indices can be obtained using :class:`transformers1.GPT2Tokenizer`.\n            See :func:`transformers1.PreTrainedTokenizer.encode` and\n            :func:`transformers1.PreTrainedTokenizer.encode_plus` for details.\n\n            `What are input IDs? <../glossary.html#input-ids>`__\n        past (:obj:`List[tf.Tensor]` of length :obj:`config.n_layers`):\n            Contains pre-computed hidden-states (key and values in the attention blocks) as computed by the model\n            (see `past` output below). Can be used to speed up sequential decoding.\n            The token ids which have their past given to this model\n            should not be passed as `input_ids` as they have already been computed.\n        attention_mask (:obj:`tf.Tensor` or :obj:`Numpy array` of shape :obj:`(batch_size, sequence_length)`, `optional`, defaults to :obj:`None`):\n            Mask to avoid performing attention on padding token indices.\n            Mask values selected in ``[0, 1]``:\n            ``1`` for tokens that are NOT MASKED, ``0`` for MASKED tokens.\n\n            `What are attention masks? <../glossary.html#attention-mask>`__\n        token_type_ids (:obj:`tf.Tensor` or :obj:`Numpy array` of shape :obj:`(batch_size, sequence_length)`, `optional`, defaults to :obj:`None`):\n            Segment token indices to indicate first and second portions of the inputs.\n            Indices are selected in ``[0, 1]``: ``0`` corresponds to a `sentence A` token, ``1``\n            corresponds to a `sentence B` token\n\n            `What are token type IDs? <../glossary.html#token-type-ids>`_\n        position_ids (:obj:`tf.Tensor` or :obj:`Numpy array` of shape :obj:`(batch_size, sequence_length)`, `optional`, defaults to :obj:`None`):\n            Indices of positions of each input sequence tokens in the position embeddings.\n            Selected in the range ``[0, config.max_position_embeddings - 1]``.\n\n            `What are position IDs? <../glossary.html#position-ids>`_\n        head_mask (:obj:`tf.Tensor` or :obj:`Numpy array` of shape :obj:`(num_heads,)` or :obj:`(num_layers, num_heads)`, `optional`, defaults to :obj:`None`):\n            Mask to nullify selected heads of the self-attention modules.\n            Mask values selected in ``[0, 1]``:\n            :obj:`1` indicates the head is **not masked**, :obj:`0` indicates the head is **masked**.\n        inputs_embeds (:obj:`tf.Tensor` or :obj:`Numpy array` of shape :obj:`(batch_size, sequence_length, hidden_size)`, `optional`, defaults to :obj:`None`):\n            Optionally, instead of passing :obj:`input_ids` you can choose to directly pass an embedded representation.\n            This is useful if you want more control over how to convert `input_ids` indices into associated vectors\n            than the model's internal embedding lookup matrix.\n        training (:obj:`boolean`, `optional`, defaults to :obj:`False`):\n            Whether to activate dropout modules (if set to :obj:`True`) during training or to de-activate them\n            (if set to :obj:`False`) for evaluation.\n\"\"\"\n\n\n@add_start_docstrings(\n    \"The bare GPT2 Model transformer outputing raw hidden-states without any specific head on top.\",\n    GPT2_START_DOCSTRING,\n)\nclass TFGPT2Model(TFGPT2PreTrainedModel):\n    def __init__(self, config, *inputs, **kwargs):\n        super().__init__(config, *inputs, **kwargs)\n        self.transformer = TFGPT2MainLayer(config, name=\"transformer\")\n\n    @add_start_docstrings_to_callable(GPT2_INPUTS_DOCSTRING)\n    def call(self, inputs, **kwargs):\n        r\"\"\"\n    Return:\n        :obj:`tuple(tf.Tensor)` comprising various elements depending on the configuration (:class:`~transformers1.GPT2Config`) and inputs:\n        last_hidden_state (:obj:`tf.Tensor` of shape :obj:`(batch_size, sequence_length, hidden_size)`):\n            Sequence of hidden-states at the last layer of the model.\n        past (:obj:`List[tf.Tensor]` of length :obj:`config.n_layers` with each tensor of shape :obj:`(2, batch_size, num_heads, sequence_length, embed_size_per_head)`):\n            Contains pre-computed hidden-states (key and values in the attention blocks).\n            Can be used (see `past` input) to speed up sequential decoding. The token ids which have their past given to this model\n            should not be passed as input ids as they have already been computed.\n        hidden_states (:obj:`tuple(tf.Tensor)` `optional`, returned when ``config.output_hidden_states=True``):\n            Tuple of :obj:`tf.Tensor` (one for the output of the embeddings + one for the output of each layer)\n            of shape :obj:`(batch_size, sequence_length, hidden_size)`.\n\n            Hidden-states of the model at the output of each layer plus the initial embedding outputs.\n        attentions (:obj:`tuple(tf.Tensor)`, `optional`, returned when ``config.output_attentions=True``):\n            Tuple of :obj:`tf.Tensor` (one for each layer) of shape\n            :obj:`(batch_size, num_heads, sequence_length, sequence_length)`.\n\n            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention\n            heads.\n\n    Examples::\n\n        import tensorflow as tf\n        from transformers1 import GPT2Tokenizer, TFGPT2Model\n\n        tokenizer = GPT2Tokenizer.from_pretrained('gpt2')\n        model = TFGPT2Model.from_pretrained('gpt2')\n        input_ids = tf.constant(tokenizer.encode(\"Hello, my dog is cute\", add_special_tokens=True))[None, :]  # Batch size 1\n        outputs = model(input_ids)\n        last_hidden_states = outputs[0]  # The last hidden-state is the first element of the output tuple\n\n    \"\"\"\n        outputs = self.transformer(inputs, **kwargs)\n        return outputs\n\n\n@add_start_docstrings(\n    \"\"\"The GPT2 Model transformer with a language modeling head on top\n    (linear layer with weights tied to the input embeddings). \"\"\",\n    GPT2_START_DOCSTRING,\n)\nclass TFGPT2LMHeadModel(TFGPT2PreTrainedModel):\n    def __init__(self, config, *inputs, **kwargs):\n        super().__init__(config, *inputs, **kwargs)\n        self.transformer = TFGPT2MainLayer(config, name=\"transformer\")\n\n    def get_output_embeddings(self):\n        return self.transformer.wte\n\n    def prepare_inputs_for_generation(self, inputs, past, **kwargs):\n        # only last token for inputs_ids if past is defined in kwargs\n        if past:\n            inputs = tf.expand_dims(inputs[:, -1], -1)\n\n        return {\"inputs\": inputs, \"past\": past, \"use_cache\": kwargs[\"use_cache\"]}\n\n    @add_start_docstrings_to_callable(GPT2_INPUTS_DOCSTRING)\n    def call(self, inputs, **kwargs):\n        r\"\"\"\n    Return:\n        :obj:`tuple(tf.Tensor)` comprising various elements depending on the configuration (:class:`~transformers1.GPT2Config`) and inputs:\n        prediction_scores (:obj:`tf.Tensor` of shape :obj:`(batch_size, sequence_length, config.vocab_size)`):\n            Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax).\n        past (:obj:`List[tf.Tensor]` of length :obj:`config.n_layers` with each tensor of shape :obj:`(2, batch_size, num_heads, sequence_length, embed_size_per_head)`):\n            Contains pre-computed hidden-states (key and values in the attention blocks).\n            Can be used (see `past` input) to speed up sequential decoding. The token ids which have their past given to this model\n            should not be passed as input ids as they have already been computed.\n        hidden_states (:obj:`tuple(tf.Tensor)`, `optional`, returned when ``config.output_hidden_states=True``):\n            Tuple of :obj:`tf.Tensor` (one for the output of the embeddings + one for the output of each layer)\n            of shape :obj:`(batch_size, sequence_length, hidden_size)`.\n\n            Hidden-states of the model at the output of each layer plus the initial embedding outputs.\n        attentions (:obj:`tuple(tf.Tensor)`, `optional`, returned when ``config.output_attentions=True``):\n            Tuple of :obj:`tf.Tensor` (one for each layer) of shape\n            :obj:`(batch_size, num_heads, sequence_length, sequence_length)`.\n\n            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention\n            heads.\n\n    Examples::\n\n        import tensorflow as tf\n        from transformers1 import GPT2Tokenizer, TFGPT2LMHeadModel\n\n        tokenizer = GPT2Tokenizer.from_pretrained('gpt2')\n        model = TFGPT2LMHeadModel.from_pretrained('gpt2')\n\n        input_ids = tf.constant(tokenizer.encode(\"Hello, my dog is cute\", add_special_tokens=True))[None, :]  # Batch size 1\n        outputs = model(input_ids)\n        logits = outputs[0]\n\n        \"\"\"\n        transformer_outputs = self.transformer(inputs, **kwargs)\n        hidden_states = transformer_outputs[0]\n\n        lm_logits = self.transformer.wte(hidden_states, mode=\"linear\")\n\n        outputs = (lm_logits,) + transformer_outputs[1:]\n\n        return outputs  # lm_logits, presents, (all hidden_states), (attentions)\n\n\n@add_start_docstrings(\n    \"\"\"The GPT2 Model transformer with a language modeling and a multiple-choice classification\n    head on top e.g. for RocStories/SWAG tasks. The two heads are two linear layers.\n    The language modeling head has its weights tied to the input embeddings,\n    the classification head takes as input the input of a specified classification token index in the input sequence).\n\"\"\",\n    GPT2_START_DOCSTRING,\n)\nclass TFGPT2DoubleHeadsModel(TFGPT2PreTrainedModel):\n    def __init__(self, config, *inputs, **kwargs):\n        super().__init__(config, *inputs, **kwargs)\n        config.num_labels = 1\n        self.transformer = TFGPT2MainLayer(config, name=\"transformer\")\n        self.multiple_choice_head = TFSequenceSummary(\n            config, initializer_range=config.initializer_range, name=\"multiple_choice_head\"\n        )\n\n    def get_output_embeddings(self):\n        return self.transformer.wte\n\n    @add_start_docstrings_to_callable(GPT2_INPUTS_DOCSTRING)\n    def call(\n        self,\n        inputs,\n        past=None,\n        attention_mask=None,\n        token_type_ids=None,\n        position_ids=None,\n        head_mask=None,\n        inputs_embeds=None,\n        mc_token_ids=None,\n        use_cache=True,\n        training=False,\n    ):\n        r\"\"\"\n        mc_token_ids (:obj:`tf.Tensor` or :obj:`Numpy array` of shape :obj:`(batch_size, num_choices)`, `optional`, default to index of the last token of the input)\n            Index of the classification token in each input sequence.\n            Selected in the range ``[0, input_ids.size(-1) - 1[``.\n\n    Return:\n        :obj:`tuple(tf.Tensor)` comprising various elements depending on the configuration (:class:`~transformers1.GPT2Config`) and inputs:\n        lm_prediction_scores (:obj:`tf.Tensor` of shape :obj:`(batch_size, num_choices, sequence_length, config.vocab_size)`):\n            Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax).\n        mc_prediction_scores (:obj:`tf.Tensor` of shape :obj:`(batch_size, num_choices)`):\n            Prediction scores of the multiple choice classification head (scores for each choice before SoftMax).\n        past (:obj:`List[tf.Tensor]` of length :obj:`config.n_layers` with each tensor of shape :obj:`(2, batch_size, num_heads, sequence_length, embed_size_per_head)`):\n            Contains pre-computed hidden-states (key and values in the attention blocks).\n            Can be used (see `past` input) to speed up sequential decoding. The token ids which have their past given to this model\n            should not be passed as `input_ids` as they have already been computed.\n        hidden_states (:obj:`tuple(tf.Tensor)`, `optional`, returned when ``config.output_hidden_states=True``):\n            Tuple of :obj:`tf.Tensor` (one for the output of the embeddings + one for the output of each layer)\n            of shape :obj:`(batch_size, sequence_length, hidden_size)`.\n\n            Hidden-states of the model at the output of each layer plus the initial embedding outputs.\n        attentions (:obj:`tuple(tf.Tensor)`, `optional`, returned when ``config.output_attentions=True``):\n            Tuple of :obj:`tf.Tensor` (one for each layer) of shape\n            :obj:`(batch_size, num_heads, sequence_length, sequence_length)`.\n\n            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention\n            heads.\n\n\n    Examples::\n\n        # For example purposes. Not runnable.\n        import tensorflow as tf\n        from transformers1 import GPT2Tokenizer, TFGPT2DoubleHeadsModel\n\n        tokenizer = GPT2Tokenizer.from_pretrained('gpt2')\n        model = TFGPT2DoubleHeadsModel.from_pretrained('gpt2')\n\n        # Add a [CLS] to the vocabulary (we should train it also!)\n        # This option is currently not implemented in TF 2.0\n        raise NotImplementedError\n        tokenizer.add_special_tokens({'cls_token': '[CLS]'})\n        model.resize_token_embeddings(len(tokenizer))  # Update the model embeddings with the new vocabulary size\n        print(tokenizer.cls_token_id, len(tokenizer))  # The newly token the last token of the vocabulary\n\n        choices = [\"Hello, my dog is cute [CLS]\", \"Hello, my cat is cute [CLS]\"]\n        encoded_choices = [tokenizer.encode(s) for s in choices]\n        cls_token_location = [tokens.index(tokenizer.cls_token_id) for tokens in encoded_choices]\n\n        input_ids = tf.constant(encoded_choices)[None, :]  # Batch size: 1, number of choices: 2\n        mc_token_ids = tf.constant([cls_token_location])  # Batch size: 1\n\n        outputs = model(input_ids, mc_token_ids=mc_token_ids)\n        lm_prediction_scores, mc_prediction_scores = outputs[:2]\n\n        \"\"\"\n        if isinstance(inputs, (tuple, list)):\n            input_ids = inputs[0]\n            past = inputs[1] if len(inputs) > 1 else past\n            attention_mask = inputs[2] if len(inputs) > 2 else attention_mask\n            token_type_ids = inputs[3] if len(inputs) > 3 else token_type_ids\n            position_ids = inputs[4] if len(inputs) > 4 else position_ids\n            head_mask = inputs[5] if len(inputs) > 5 else head_mask\n            inputs_embeds = inputs[6] if len(inputs) > 6 else inputs_embeds\n            mc_token_ids = inputs[7] if len(inputs) > 7 else mc_token_ids\n            use_cache = inputs[8] if len(inputs) > 8 else use_cache\n            assert len(inputs) <= 9, \"Too many inputs.\"\n        elif isinstance(inputs, dict):\n            input_ids = inputs.get(\"input_ids\")\n            past = inputs.get(\"past\", past)\n            attention_mask = inputs.get(\"attention_mask\", attention_mask)\n            token_type_ids = inputs.get(\"token_type_ids\", token_type_ids)\n            position_ids = inputs.get(\"position_ids\", position_ids)\n            head_mask = inputs.get(\"head_mask\", head_mask)\n            inputs_embeds = inputs.get(\"inputs_embeds\", inputs_embeds)\n            mc_token_ids = inputs.get(\"mc_token_ids\", mc_token_ids)\n            use_cache = inputs.get(\"use_cache\", use_cache)\n            assert len(inputs) <= 9, \"Too many inputs.\"\n        else:\n            input_ids = inputs\n\n        if input_ids is not None:\n            input_shapes = shape_list(input_ids)\n        else:\n            input_shapes = shape_list(inputs_embeds)[:-1]\n\n        seq_length = input_shapes[-1]\n\n        flat_input_ids = tf.reshape(input_ids, (-1, seq_length)) if input_ids is not None else None\n        flat_attention_mask = tf.reshape(attention_mask, (-1, seq_length)) if attention_mask is not None else None\n        flat_token_type_ids = tf.reshape(token_type_ids, (-1, seq_length)) if token_type_ids is not None else None\n        flat_position_ids = tf.reshape(position_ids, (-1, seq_length)) if position_ids is not None else None\n\n        flat_inputs = [\n            flat_input_ids,\n            past,\n            flat_attention_mask,\n            flat_token_type_ids,\n            flat_position_ids,\n            head_mask,\n            inputs_embeds,\n            use_cache,\n        ]\n\n        transformer_outputs = self.transformer(flat_inputs, training=training)\n        hidden_states = transformer_outputs[0]\n\n        hidden_states = tf.reshape(hidden_states, input_shapes + shape_list(hidden_states)[-1:])\n\n        lm_logits = self.transformer.wte(hidden_states, mode=\"linear\")\n        mc_logits = self.multiple_choice_head([hidden_states, mc_token_ids], training=training)\n\n        mc_logits = tf.squeeze(mc_logits, axis=-1)\n\n        outputs = (lm_logits, mc_logits) + transformer_outputs[1:]\n\n        return outputs  # lm logits, mc logits, presents, (all hidden_states), (attentions)\n"
  },
  {
    "path": "code/nezha-base-count5/pretrain/transformers1/modeling_tf_openai.py",
    "content": "# coding=utf-8\n# Copyright 2018 The OpenAI Team Authors and HuggingFace Inc. team.\n# Copyright (c) 2018, NVIDIA CORPORATION.  All rights reserved.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n\"\"\" TF 2.0 OpenAI GPT model.\"\"\"\n\n\nimport logging\n\nimport numpy as np\nimport tensorflow as tf\n\nfrom .configuration_openai import OpenAIGPTConfig\nfrom .file_utils import add_start_docstrings, add_start_docstrings_to_callable\nfrom .modeling_tf_utils import (\n    TFConv1D,\n    TFPreTrainedModel,\n    TFSequenceSummary,\n    TFSharedEmbeddings,\n    get_initializer,\n    shape_list,\n)\nfrom .tokenization_utils import BatchEncoding\n\n\nlogger = logging.getLogger(__name__)\n\nTF_OPENAI_GPT_PRETRAINED_MODEL_ARCHIVE_LIST = [\n    \"openai-gpt\",\n    # See all OpenAI GPT models at https://huggingface.co/models?filter=openai-gpt\n]\n\n\ndef gelu(x):\n    \"\"\"Gaussian Error Linear Unit.\n    This is a smoother version of the RELU.\n    Original paper: https://arxiv.org/abs/1606.08415\n    Args:\n        x: float Tensor to perform activation.\n    Returns:\n        `x` with the GELU activation applied.\n    \"\"\"\n    cdf = 0.5 * (1.0 + tf.tanh((np.sqrt(2 / np.pi) * (x + 0.044715 * tf.pow(x, 3)))))\n    return x * cdf\n\n\ndef swish(x):\n    return x * tf.math.sigmoid(x)\n\n\nACT_FNS = {\n    \"gelu\": tf.keras.layers.Activation(gelu),\n    \"relu\": tf.keras.activations.relu,\n    \"swish\": tf.keras.layers.Activation(swish),\n}\n\n\nclass TFAttention(tf.keras.layers.Layer):\n    def __init__(self, nx, n_ctx, config, scale=False, **kwargs):\n        super().__init__(**kwargs)\n        self.output_attentions = config.output_attentions\n\n        n_state = nx  # in Attention: n_state=768 (nx=n_embd)\n        # [switch nx => n_state from Block to Attention to keep identical to TF implem]\n        assert n_state % config.n_head == 0\n        self.n_ctx = n_ctx\n        self.n_head = config.n_head\n        self.split_size = n_state\n        self.scale = scale\n\n        self.c_attn = TFConv1D(n_state * 3, nx, initializer_range=config.initializer_range, name=\"c_attn\")\n        self.c_proj = TFConv1D(n_state, nx, initializer_range=config.initializer_range, name=\"c_proj\")\n        self.attn_dropout = tf.keras.layers.Dropout(config.attn_pdrop)\n        self.resid_dropout = tf.keras.layers.Dropout(config.resid_pdrop)\n        self.pruned_heads = set()\n\n    def prune_heads(self, heads):\n        pass\n\n    @staticmethod\n    def causal_attention_mask(nd, ns, dtype):\n        \"\"\"1's in the lower triangle, counting from the lower right corner.\n        Same as tf.matrix_band_part(tf.ones([nd, ns]), -1, ns-nd), but doesn't produce garbage on TPUs.\n        \"\"\"\n        i = tf.range(nd)[:, None]\n        j = tf.range(ns)\n        m = i >= j - ns + nd\n        return tf.cast(m, dtype)\n\n    def _attn(self, inputs, training=False):\n        q, k, v, attention_mask, head_mask = inputs\n        # q, k, v have shape [batch, heads, sequence, features]\n        w = tf.matmul(q, k, transpose_b=True)\n        if self.scale:\n            dk = tf.cast(shape_list(k)[-1], tf.float32)  # scale attention_scores\n            w = w / tf.math.sqrt(dk)\n\n        # w has shape [batch, heads, dst_sequence, src_sequence], where information flows from src to dst.\n        _, _, nd, ns = shape_list(w)\n        b = self.causal_attention_mask(nd, ns, dtype=w.dtype)\n        b = tf.reshape(b, [1, 1, nd, ns])\n        w = w * b - 1e4 * (1 - b)\n\n        if attention_mask is not None:\n            # Apply the attention mask\n            w = w + attention_mask\n\n        w = tf.nn.softmax(w, axis=-1)\n        w = self.attn_dropout(w, training=training)\n\n        # Mask heads if we want to\n        if head_mask is not None:\n            w = w * head_mask\n\n        outputs = [tf.matmul(w, v)]\n        if self.output_attentions:\n            outputs.append(w)\n        return outputs\n\n    def merge_heads(self, x):\n        x = tf.transpose(x, [0, 2, 1, 3])\n        x_shape = shape_list(x)\n        new_x_shape = x_shape[:-2] + [x_shape[-2] * x_shape[-1]]\n        return tf.reshape(x, new_x_shape)\n\n    def split_heads(self, x):\n        x_shape = shape_list(x)\n        new_x_shape = x_shape[:-1] + [self.n_head, x_shape[-1] // self.n_head]\n        x = tf.reshape(x, new_x_shape)\n        return tf.transpose(x, (0, 2, 1, 3))  # (batch, head, seq_length, head_features)\n\n    def call(self, inputs, training=False):\n        x, attention_mask, head_mask = inputs\n\n        x = self.c_attn(x)\n        query, key, value = tf.split(x, 3, axis=2)\n        query = self.split_heads(query)\n        key = self.split_heads(key)\n        value = self.split_heads(value)\n\n        attn_outputs = self._attn([query, key, value, attention_mask, head_mask], training=training)\n        a = attn_outputs[0]\n\n        a = self.merge_heads(a)\n        a = self.c_proj(a)\n        a = self.resid_dropout(a, training=training)\n\n        outputs = [a] + attn_outputs[1:]\n        return outputs  # a, (attentions)\n\n\nclass TFMLP(tf.keras.layers.Layer):\n    def __init__(self, n_state, config, **kwargs):\n        super().__init__(**kwargs)\n        nx = config.n_embd\n        self.c_fc = TFConv1D(n_state, nx, initializer_range=config.initializer_range, name=\"c_fc\")\n        self.c_proj = TFConv1D(nx, n_state, initializer_range=config.initializer_range, name=\"c_proj\")\n        self.act = gelu\n        self.dropout = tf.keras.layers.Dropout(config.resid_pdrop)\n\n    def call(self, x, training=False):\n        h = self.act(self.c_fc(x))\n        h2 = self.c_proj(h)\n        h2 = self.dropout(h2, training=training)\n        return h2\n\n\nclass TFBlock(tf.keras.layers.Layer):\n    def __init__(self, n_ctx, config, scale=False, **kwargs):\n        super().__init__(**kwargs)\n        nx = config.n_embd\n        self.attn = TFAttention(nx, n_ctx, config, scale, name=\"attn\")\n        self.ln_1 = tf.keras.layers.LayerNormalization(epsilon=config.layer_norm_epsilon, name=\"ln_1\")\n        self.mlp = TFMLP(4 * nx, config, name=\"mlp\")\n        self.ln_2 = tf.keras.layers.LayerNormalization(epsilon=config.layer_norm_epsilon, name=\"ln_2\")\n\n    def call(self, inputs, training=False):\n        x, attention_mask, head_mask = inputs\n\n        output_attn = self.attn([x, attention_mask, head_mask], training=training)\n        a = output_attn[0]  # output_attn: a, (attentions)\n\n        n = self.ln_1(x + a)\n        m = self.mlp(n, training=training)\n        h = self.ln_2(n + m)\n\n        outputs = [h] + output_attn[1:]\n        return outputs  # x, (attentions)\n\n\nclass TFOpenAIGPTMainLayer(tf.keras.layers.Layer):\n    def __init__(self, config, *inputs, **kwargs):\n        super().__init__(*inputs, **kwargs)\n        self.output_hidden_states = config.output_hidden_states\n        self.output_attentions = config.output_attentions\n        self.num_hidden_layers = config.n_layer\n        self.vocab_size = config.vocab_size\n        self.n_embd = config.n_embd\n\n        self.tokens_embed = TFSharedEmbeddings(\n            config.vocab_size, config.n_embd, initializer_range=config.initializer_range, name=\"tokens_embed\"\n        )\n        self.positions_embed = tf.keras.layers.Embedding(\n            config.n_positions,\n            config.n_embd,\n            embeddings_initializer=get_initializer(config.initializer_range),\n            name=\"positions_embed\",\n        )\n        self.drop = tf.keras.layers.Dropout(config.embd_pdrop)\n        self.h = [TFBlock(config.n_ctx, config, scale=True, name=\"h_._{}\".format(i)) for i in range(config.n_layer)]\n\n    def get_input_embeddings(self):\n        return self.tokens_embed\n\n    def _resize_token_embeddings(self, new_num_tokens):\n        raise NotImplementedError\n\n    def _prune_heads(self, heads_to_prune):\n        \"\"\" Prunes heads of the model.\n            heads_to_prune: dict of {layer_num: list of heads to prune in this layer}\n        \"\"\"\n        raise NotImplementedError\n\n    def call(\n        self,\n        inputs,\n        attention_mask=None,\n        token_type_ids=None,\n        position_ids=None,\n        head_mask=None,\n        inputs_embeds=None,\n        training=False,\n    ):\n        if isinstance(inputs, (tuple, list)):\n            input_ids = inputs[0]\n            attention_mask = inputs[1] if len(inputs) > 1 else attention_mask\n            token_type_ids = inputs[2] if len(inputs) > 2 else token_type_ids\n            position_ids = inputs[3] if len(inputs) > 3 else position_ids\n            head_mask = inputs[4] if len(inputs) > 4 else head_mask\n            inputs_embeds = inputs[5] if len(inputs) > 5 else inputs_embeds\n            assert len(inputs) <= 6, \"Too many inputs.\"\n        elif isinstance(inputs, (dict, BatchEncoding)):\n            input_ids = inputs.get(\"input_ids\")\n            attention_mask = inputs.get(\"attention_mask\", attention_mask)\n            token_type_ids = inputs.get(\"token_type_ids\", token_type_ids)\n            position_ids = inputs.get(\"position_ids\", position_ids)\n            head_mask = inputs.get(\"head_mask\", head_mask)\n            inputs_embeds = inputs.get(\"inputs_embeds\", inputs_embeds)\n            assert len(inputs) <= 6, \"Too many inputs.\"\n        else:\n            input_ids = inputs\n\n        if input_ids is not None and inputs_embeds is not None:\n            raise ValueError(\"You cannot specify both input_ids and inputs_embeds at the same time\")\n        elif input_ids is not None:\n            input_shape = shape_list(input_ids)\n            input_ids = tf.reshape(input_ids, [-1, input_shape[-1]])\n        elif inputs_embeds is not None:\n            input_shape = shape_list(inputs_embeds)[:-1]\n        else:\n            raise ValueError(\"You have to specify either input_ids or inputs_embeds\")\n\n        if position_ids is None:\n            position_ids = tf.range(input_shape[-1], dtype=tf.int32)[tf.newaxis, :]\n\n        if attention_mask is not None:\n            # We create a 3D attention mask from a 2D tensor mask.\n            # Sizes are [batch_size, 1, 1, to_seq_length]\n            # So we can broadcast to [batch_size, num_heads, from_seq_length, to_seq_length]\n            # this attention mask is more simple than the triangular masking of causal attention\n            # used in OpenAI GPT, we just need to prepare the broadcast dimension here.\n            attention_mask = attention_mask[:, tf.newaxis, tf.newaxis, :]\n\n            # Since attention_mask is 1.0 for positions we want to attend and 0.0 for\n            # masked positions, this operation will create a tensor which is 0.0 for\n            # positions we want to attend and -10000.0 for masked positions.\n            # Since we are adding it to the raw scores before the softmax, this is\n            # effectively the same as removing these entirely.\n\n            attention_mask = tf.cast(attention_mask, tf.float32)\n            attention_mask = (1.0 - attention_mask) * -10000.0\n        else:\n            attention_mask = None\n\n        # Prepare head mask if needed\n        # 1.0 in head_mask indicate we keep the head\n        # attention_probs has shape bsz x n_heads x N x N\n        # input head_mask has shape [num_heads] or [num_hidden_layers x num_heads]\n        # and head_mask is converted to shape [num_hidden_layers x batch x num_heads x seq_length x seq_length]\n        if head_mask is not None:\n            raise NotImplementedError\n        else:\n            head_mask = [None] * self.num_hidden_layers\n            # head_mask = tf.constant([0] * self.num_hidden_layers)\n\n        position_ids = tf.reshape(position_ids, [-1, shape_list(position_ids)[-1]])\n\n        if inputs_embeds is None:\n            inputs_embeds = self.tokens_embed(input_ids, mode=\"embedding\")\n        position_embeds = self.positions_embed(position_ids)\n        if token_type_ids is not None:\n            token_type_ids = tf.reshape(token_type_ids, [-1, shape_list(token_type_ids)[-1]])\n            token_type_embeds = self.tokens_embed(token_type_ids, mode=\"embedding\")\n        else:\n            token_type_embeds = 0\n        hidden_states = inputs_embeds + position_embeds + token_type_embeds\n        hidden_states = self.drop(hidden_states, training=training)\n\n        output_shape = input_shape + [shape_list(hidden_states)[-1]]\n\n        all_attentions = []\n        all_hidden_states = ()\n        for i, block in enumerate(self.h):\n            if self.output_hidden_states:\n                all_hidden_states = all_hidden_states + (tf.reshape(hidden_states, output_shape),)\n\n            outputs = block([hidden_states, attention_mask, head_mask[i]], training=training)\n            hidden_states = outputs[0]\n            if self.output_attentions:\n                all_attentions.append(outputs[1])\n\n        hidden_states = tf.reshape(hidden_states, output_shape)\n        # Add last hidden state\n        if self.output_hidden_states:\n            all_hidden_states = all_hidden_states + (hidden_states,)\n\n        outputs = (hidden_states,)\n        if self.output_hidden_states:\n            outputs = outputs + (all_hidden_states,)\n        if self.output_attentions:\n            # let the number of heads free (-1) so we can extract attention even after head pruning\n            attention_output_shape = input_shape[:-1] + [-1] + shape_list(all_attentions[0])[-2:]\n            all_attentions = tuple(tf.reshape(t, attention_output_shape) for t in all_attentions)\n            outputs = outputs + (all_attentions,)\n        return outputs  # last hidden state, (all hidden_states), (attentions)\n\n\nclass TFOpenAIGPTPreTrainedModel(TFPreTrainedModel):\n    \"\"\" An abstract class to handle weights initialization and\n        a simple interface for downloading and loading pretrained models.\n    \"\"\"\n\n    config_class = OpenAIGPTConfig\n    base_model_prefix = \"transformer\"\n\n\nOPENAI_GPT_START_DOCSTRING = r\"\"\"\n\n    .. note::\n        TF 2.0 models accepts two formats as inputs:\n\n            - having all inputs as keyword arguments (like PyTorch models), or\n            - having all inputs as a list, tuple or dict in the first positional arguments.\n\n        This second option is useful when using :obj:`tf.keras.Model.fit()` method which currently requires having\n        all the tensors in the first argument of the model call function: :obj:`model(inputs)`.\n\n        If you choose this second option, there are three possibilities you can use to gather all the input Tensors\n        in the first positional argument :\n\n        - a single Tensor with input_ids only and nothing else: :obj:`model(inputs_ids)`\n        - a list of varying length with one or several input Tensors IN THE ORDER given in the docstring:\n          :obj:`model([input_ids, attention_mask])` or :obj:`model([input_ids, attention_mask, token_type_ids])`\n        - a dictionary with one or several input Tensors associated to the input names given in the docstring:\n          :obj:`model({'input_ids': input_ids, 'token_type_ids': token_type_ids})`\n\n\n    Parameters:\n        config (:class:`~transformers1.OpenAIGPTConfig`): Model configuration class with all the parameters of the model.\n            Initializing with a config file does not load the weights associated with the model, only the configuration.\n            Check out the :meth:`~transformers1.PreTrainedModel.from_pretrained` method to load the model weights.\n\"\"\"\n\nOPENAI_GPT_INPUTS_DOCSTRING = r\"\"\"\n    Args:\n        input_ids (:obj:`Numpy array` or :obj:`tf.Tensor` of shape :obj:`(batch_size, sequence_length)`):\n            Indices of input sequence tokens in the vocabulary.\n\n            Indices can be obtained using :class:`transformers1.GPT2Tokenizer`.\n            See :func:`transformers1.PreTrainedTokenizer.encode` and\n            :func:`transformers1.PreTrainedTokenizer.encode_plus` for details.\n\n            `What are input IDs? <../glossary.html#input-ids>`__\n        attention_mask (:obj:`tf.Tensor` or :obj:`Numpy array` of shape :obj:`(batch_size, sequence_length)`, `optional`, defaults to :obj:`None`):\n            Mask to avoid performing attention on padding token indices.\n            Mask values selected in ``[0, 1]``:\n            ``1`` for tokens that are NOT MASKED, ``0`` for MASKED tokens.\n\n            `What are attention masks? <../glossary.html#attention-mask>`__\n        token_type_ids (:obj:`tf.Tensor` or :obj:`Numpy array` of shape :obj:`(batch_size, sequence_length)`, `optional`, defaults to :obj:`None`):\n            Segment token indices to indicate first and second portions of the inputs.\n            Indices are selected in ``[0, 1]``: ``0`` corresponds to a `sentence A` token, ``1``\n            corresponds to a `sentence B` token\n\n            `What are token type IDs? <../glossary.html#token-type-ids>`_\n        position_ids (:obj:`tf.Tensor` or :obj:`Numpy array` of shape :obj:`(batch_size, sequence_length)`, `optional`, defaults to :obj:`None`):\n            Indices of positions of each input sequence tokens in the position embeddings.\n            Selected in the range ``[0, config.max_position_embeddings - 1]``.\n\n            `What are position IDs? <../glossary.html#position-ids>`_\n        head_mask (:obj:`tf.Tensor` or :obj:`Numpy array` of shape :obj:`(num_heads,)` or :obj:`(num_layers, num_heads)`, `optional`, defaults to :obj:`None`):\n            Mask to nullify selected heads of the self-attention modules.\n            Mask values selected in ``[0, 1]``:\n            :obj:`1` indicates the head is **not masked**, :obj:`0` indicates the head is **masked**.\n        inputs_embeds (:obj:`tf.Tensor` or :obj:`Numpy array` of shape :obj:`(batch_size, sequence_length, hidden_size)`, `optional`, defaults to :obj:`None`):\n            Optionally, instead of passing :obj:`input_ids` you can choose to directly pass an embedded representation.\n            This is useful if you want more control over how to convert `input_ids` indices into associated vectors\n            than the model's internal embedding lookup matrix.\n        training (:obj:`boolean`, `optional`, defaults to :obj:`False`):\n            Whether to activate dropout modules (if set to :obj:`True`) during training or to de-activate them\n            (if set to :obj:`False`) for evaluation.\n\"\"\"\n\n\n@add_start_docstrings(\n    \"The bare OpenAI GPT transformer model outputing raw hidden-states without any specific head on top.\",\n    OPENAI_GPT_START_DOCSTRING,\n)\nclass TFOpenAIGPTModel(TFOpenAIGPTPreTrainedModel):\n    def __init__(self, config, *inputs, **kwargs):\n        super().__init__(config, *inputs, **kwargs)\n        self.transformer = TFOpenAIGPTMainLayer(config, name=\"transformer\")\n\n    @add_start_docstrings_to_callable(OPENAI_GPT_INPUTS_DOCSTRING)\n    def call(self, inputs, **kwargs):\n        r\"\"\"\n    Return:\n        :obj:`tuple(tf.Tensor)` comprising various elements depending on the configuration (:class:`~transformers1.OpenAIGPTConfig`) and inputs:\n        last_hidden_state (:obj:`tf.Tensor` of shape :obj:`(batch_size, sequence_length, hidden_size)`):\n            Sequence of hidden-states at the last layer of the model.\n        hidden_states (:obj:`tuple(tf.Tensor)` `optional`, returned when ``config.output_hidden_states=True``):\n            Tuple of :obj:`tf.Tensor` (one for the output of the embeddings + one for the output of each layer)\n            of shape :obj:`(batch_size, sequence_length, hidden_size)`.\n\n            Hidden-states of the model at the output of each layer plus the initial embedding outputs.\n        attentions (:obj:`tuple(tf.Tensor)`, `optional`, returned when ``config.output_attentions=True``):\n            Tuple of :obj:`tf.Tensor` (one for each layer) of shape\n            :obj:`(batch_size, num_heads, sequence_length, sequence_length)`.\n\n            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention\n            heads.\n\n    Examples::\n\n        import tensorflow as tf\n        from transformers1 import OpenAIGPTTokenizer, TFOpenAIGPTModel\n\n        tokenizer = OpenAIGPTTokenizer.from_pretrained('openai-gpt')\n        model = TFOpenAIGPTModel.from_pretrained('openai-gpt')\n        input_ids = tf.constant(tokenizer.encode(\"Hello, my dog is cute\", add_special_tokens=True))[None, :]  # Batch size 1\n        outputs = model(input_ids)\n        last_hidden_states = outputs[0]  # The last hidden-state is the first element of the output tuple\n\n        \"\"\"\n        outputs = self.transformer(inputs, **kwargs)\n        return outputs\n\n\n@add_start_docstrings(\n    \"\"\"OpenAI GPT Model transformer with a language modeling head on top\n    (linear layer with weights tied to the input embeddings). \"\"\",\n    OPENAI_GPT_START_DOCSTRING,\n)\nclass TFOpenAIGPTLMHeadModel(TFOpenAIGPTPreTrainedModel):\n    def __init__(self, config, *inputs, **kwargs):\n        super().__init__(config, *inputs, **kwargs)\n        self.transformer = TFOpenAIGPTMainLayer(config, name=\"transformer\")\n\n    def get_output_embeddings(self):\n        return self.transformer.tokens_embed\n\n    @add_start_docstrings_to_callable(OPENAI_GPT_INPUTS_DOCSTRING)\n    def call(self, inputs, **kwargs):\n        r\"\"\"\n    Return:\n        :obj:`tuple(tf.Tensor)` comprising various elements depending on the configuration (:class:`~transformers1.OpenAIGPTConfig`) and inputs:\n        prediction_scores (:obj:`tf.Tensor` of shape :obj:`(batch_size, sequence_length, config.vocab_size)`):\n            Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax).\n        hidden_states (:obj:`tuple(tf.Tensor)`, `optional`, returned when ``config.output_hidden_states=True``):\n            Tuple of :obj:`tf.Tensor` (one for the output of the embeddings + one for the output of each layer)\n            of shape :obj:`(batch_size, sequence_length, hidden_size)`.\n\n            Hidden-states of the model at the output of each layer plus the initial embedding outputs.\n        attentions (:obj:`tuple(tf.Tensor)`, `optional`, returned when ``config.output_attentions=True``):\n            Tuple of :obj:`tf.Tensor` (one for each layer) of shape\n            :obj:`(batch_size, num_heads, sequence_length, sequence_length)`.\n\n            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention\n            heads.\n\n    Examples::\n\n        import tensorflow as tf\n        from transformers1 import OpenAIGPTTokenizer, TFOpenAIGPTLMHeadModel\n\n        tokenizer = OpenAIGPTTokenizer.from_pretrained('openai-gpt')\n        model = TFOpenAIGPTLMHeadModel.from_pretrained('openai-gpt')\n        input_ids = tf.constant(tokenizer.encode(\"Hello, my dog is cute\", add_special_tokens=True))[None, :]  # Batch size 1\n        outputs = model(input_ids)\n        logits = outputs[0]\n\n        \"\"\"\n        transformer_outputs = self.transformer(inputs, **kwargs)\n        hidden_states = transformer_outputs[0]\n\n        lm_logits = self.transformer.tokens_embed(hidden_states, mode=\"linear\")\n\n        outputs = (lm_logits,) + transformer_outputs[1:]\n\n        return outputs  # lm_logits, (all hidden_states), (attentions)\n\n\n@add_start_docstrings(\n    \"\"\"OpenAI GPT Model transformer with a language modeling and a multiple-choice classification\n    head on top e.g. for RocStories/SWAG tasks. The two heads are two linear layers.\n    The language modeling head has its weights tied to the input embeddings,\n    the classification head takes as input the input of a specified classification token index in the input sequence).\n\"\"\",\n    OPENAI_GPT_START_DOCSTRING,\n)\nclass TFOpenAIGPTDoubleHeadsModel(TFOpenAIGPTPreTrainedModel):\n    def __init__(self, config, *inputs, **kwargs):\n        super().__init__(config, *inputs, **kwargs)\n        config.num_labels = 1\n        self.transformer = TFOpenAIGPTMainLayer(config, name=\"transformer\")\n        self.multiple_choice_head = TFSequenceSummary(\n            config, initializer_range=config.initializer_range, name=\"multiple_choice_head\"\n        )\n\n    def get_output_embeddings(self):\n        return self.transformer.tokens_embed\n\n    @add_start_docstrings_to_callable(OPENAI_GPT_INPUTS_DOCSTRING)\n    def call(\n        self,\n        inputs,\n        attention_mask=None,\n        token_type_ids=None,\n        position_ids=None,\n        head_mask=None,\n        inputs_embeds=None,\n        mc_token_ids=None,\n        training=False,\n    ):\n        r\"\"\"\n        mc_token_ids (:obj:`tf.Tensor` or :obj:`Numpy array` of shape :obj:`(batch_size, num_choices)`, `optional`, default to index of the last token of the input)\n            Index of the classification token in each input sequence.\n            Selected in the range ``[0, input_ids.size(-1) - 1[``.\n\n    Return:\n        :obj:`tuple(tf.Tensor)` comprising various elements depending on the configuration (:class:`~transformers1.OpenAIGPTConfig`) and inputs:\n        lm_prediction_scores (:obj:`tf.Tensor` of shape :obj:`(batch_size, num_choices, sequence_length, config.vocab_size)`):\n            Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax).\n        mc_prediction_scores (:obj:`tf.Tensor` of shape :obj:`(batch_size, num_choices)`):\n            Prediction scores of the multiple choice classification head (scores for each choice before SoftMax).\n        past (:obj:`List[tf.Tensor]` of length :obj:`config.n_layers` with each tensor of shape :obj:`(2, batch_size, num_heads, sequence_length, embed_size_per_head)`):\n            Contains pre-computed hidden-states (key and values in the attention blocks).\n            Can be used (see `past` input) to speed up sequential decoding. The token ids which have their past given to this model\n            should not be passed as input ids as they have already been computed.\n        hidden_states (:obj:`tuple(tf.Tensor)`, `optional`, returned when ``config.output_hidden_states=True``):\n            Tuple of :obj:`tf.Tensor` (one for the output of the embeddings + one for the output of each layer)\n            of shape :obj:`(batch_size, sequence_length, hidden_size)`.\n\n            Hidden-states of the model at the output of each layer plus the initial embedding outputs.\n        attentions (:obj:`tuple(tf.Tensor)`, `optional`, returned when ``config.output_attentions=True``):\n            Tuple of :obj:`tf.Tensor` (one for each layer) of shape\n            :obj:`(batch_size, num_heads, sequence_length, sequence_length)`.\n\n            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention\n            heads.\n\n\n    Examples::\n\n        # For example purposes. Not runnable.\n        import tensorflow as tf\n        from transformers1 import OpenAIGPTTokenizer, TFOpenAIGPTDoubleHeadsModel\n\n        tokenizer = OpenAIGPTTokenizer.from_pretrained('openai-gpt')\n        model = TFOpenAIGPTDoubleHeadsModel.from_pretrained('openai-gpt')\n\n        # Add a [CLS] to the vocabulary (we should train it also!)\n        # This option is currently not implemented in TF 2.0\n        raise NotImplementedError\n        tokenizer.add_special_tokens({'cls_token': '[CLS]'})\n        model.resize_token_embeddings(len(tokenizer))  # Update the model embeddings with the new vocabulary size\n        print(tokenizer.cls_token_id, len(tokenizer))  # The newly token the last token of the vocabulary\n\n        choices = [\"Hello, my dog is cute [CLS]\", \"Hello, my cat is cute [CLS]\"]\n        input_ids = tf.constant([tokenizer.encode(s) for s in choices])[None, :]  # Batch size 1, 2 choices\n        mc_token_ids = tf.constant([input_ids.size(-1), input_ids.size(-1)])[None, :]  # Batch size 1\n        outputs = model(input_ids, mc_token_ids=mc_token_ids)\n        lm_prediction_scores, mc_prediction_scores = outputs[:2]\n\n        \"\"\"\n\n        if isinstance(inputs, (tuple, list)):\n            input_ids = inputs[0]\n            attention_mask = inputs[1] if len(inputs) > 1 else attention_mask\n            token_type_ids = inputs[2] if len(inputs) > 2 else token_type_ids\n            position_ids = inputs[3] if len(inputs) > 3 else position_ids\n            head_mask = inputs[4] if len(inputs) > 4 else head_mask\n            inputs_embeds = inputs[5] if len(inputs) > 5 else inputs_embeds\n            mc_token_ids = inputs[6] if len(inputs) > 6 else mc_token_ids\n            assert len(inputs) <= 7, \"Too many inputs.\"\n        elif isinstance(inputs, dict):\n            input_ids = inputs.get(\"input_ids\")\n            attention_mask = inputs.get(\"attention_mask\", attention_mask)\n            token_type_ids = inputs.get(\"token_type_ids\", token_type_ids)\n            position_ids = inputs.get(\"position_ids\", position_ids)\n            head_mask = inputs.get(\"head_mask\", head_mask)\n            inputs_embeds = inputs.get(\"inputs_embeds\", inputs_embeds)\n            mc_token_ids = inputs.get(\"mc_token_ids\", mc_token_ids)\n            assert len(inputs) <= 7, \"Too many inputs.\"\n        else:\n            input_ids = inputs\n\n        if input_ids is not None:\n            input_shapes = shape_list(input_ids)\n        else:\n            input_shapes = shape_list(inputs_embeds)[:-1]\n\n        seq_length = input_shapes[-1]\n\n        flat_input_ids = tf.reshape(input_ids, (-1, seq_length)) if input_ids is not None else None\n        flat_attention_mask = tf.reshape(attention_mask, (-1, seq_length)) if attention_mask is not None else None\n        flat_token_type_ids = tf.reshape(token_type_ids, (-1, seq_length)) if token_type_ids is not None else None\n        flat_position_ids = tf.reshape(position_ids, (-1, seq_length)) if position_ids is not None else None\n\n        flat_inputs = [\n            flat_input_ids,\n            flat_attention_mask,\n            flat_token_type_ids,\n            flat_position_ids,\n            head_mask,\n            inputs_embeds,\n        ]\n\n        transformer_outputs = self.transformer(flat_inputs, training=training)\n        hidden_states = transformer_outputs[0]\n\n        hidden_states = tf.reshape(hidden_states, input_shapes + shape_list(hidden_states)[-1:])\n\n        lm_logits = self.transformer.tokens_embed(hidden_states, mode=\"linear\")\n        mc_logits = self.multiple_choice_head([hidden_states, mc_token_ids], training=training)\n\n        mc_logits = tf.squeeze(mc_logits, axis=-1)\n\n        outputs = (lm_logits, mc_logits) + transformer_outputs[1:]\n\n        return outputs  # lm logits, mc logits, (all hidden_states), (attentions)\n"
  },
  {
    "path": "code/nezha-base-count5/pretrain/transformers1/modeling_tf_pytorch_utils.py",
    "content": "# coding=utf-8\n# Copyright 2018 The Google AI Language Team Authors and The HuggingFace Inc. team.\n# Copyright (c) 2018, NVIDIA CORPORATION.  All rights reserved.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n\"\"\" PyTorch - TF 2.0 general utilities.\"\"\"\n\n\nimport logging\nimport os\nimport re\n\nimport numpy\n\n\nlogger = logging.getLogger(__name__)\n\n\ndef convert_tf_weight_name_to_pt_weight_name(tf_name, start_prefix_to_remove=\"\"):\n    \"\"\" Convert a TF 2.0 model variable name in a pytorch model weight name.\n\n        Conventions for TF2.0 scopes -> PyTorch attribute names conversions:\n            - '$1___$2' is replaced by $2 (can be used to duplicate or remove layers in TF2.0 vs PyTorch)\n            - '_._' is replaced by a new level separation (can be used to convert TF2.0 lists in PyTorch nn.ModulesList)\n\n        return tuple with:\n            - pytorch model weight name\n            - transpose: boolean indicating weither TF2.0 and PyTorch weights matrices are transposed with regards to each other\n    \"\"\"\n    tf_name = tf_name.replace(\":0\", \"\")  # device ids\n    tf_name = re.sub(\n        r\"/[^/]*___([^/]*)/\", r\"/\\1/\", tf_name\n    )  # '$1___$2' is replaced by $2 (can be used to duplicate or remove layers in TF2.0 vs PyTorch)\n    tf_name = tf_name.replace(\n        \"_._\", \"/\"\n    )  # '_._' is replaced by a level separation (can be used to convert TF2.0 lists in PyTorch nn.ModulesList)\n    tf_name = re.sub(r\"//+\", \"/\", tf_name)  # Remove empty levels at the end\n    tf_name = tf_name.split(\"/\")  # Convert from TF2.0 '/' separators to PyTorch '.' separators\n    tf_name = tf_name[1:]  # Remove level zero\n\n    # When should we transpose the weights\n    transpose = bool(tf_name[-1] == \"kernel\" or \"emb_projs\" in tf_name or \"out_projs\" in tf_name)\n\n    # Convert standard TF2.0 names in PyTorch names\n    if tf_name[-1] == \"kernel\" or tf_name[-1] == \"embeddings\" or tf_name[-1] == \"gamma\":\n        tf_name[-1] = \"weight\"\n    if tf_name[-1] == \"beta\":\n        tf_name[-1] = \"bias\"\n\n    # Remove prefix if needed\n    tf_name = \".\".join(tf_name)\n    if start_prefix_to_remove:\n        tf_name = tf_name.replace(start_prefix_to_remove, \"\", 1)\n\n    return tf_name, transpose\n\n\n#####################\n# PyTorch => TF 2.0 #\n#####################\n\n\ndef load_pytorch_checkpoint_in_tf2_model(tf_model, pytorch_checkpoint_path, tf_inputs=None, allow_missing_keys=False):\n    \"\"\" Load pytorch checkpoints in a TF 2.0 model\n    \"\"\"\n    try:\n        import tensorflow as tf  # noqa: F401\n        import torch  # noqa: F401\n    except ImportError:\n        logger.error(\n            \"Loading a PyTorch model in TensorFlow, requires both PyTorch and TensorFlow to be installed. Please see \"\n            \"https://pytorch.org/ and https://www.tensorflow.org/install/ for installation instructions.\"\n        )\n        raise\n\n    pt_path = os.path.abspath(pytorch_checkpoint_path)\n    logger.info(\"Loading PyTorch weights from {}\".format(pt_path))\n\n    pt_state_dict = torch.load(pt_path, map_location=\"cpu\")\n    logger.info(\"PyTorch checkpoint contains {:,} parameters\".format(sum(t.numel() for t in pt_state_dict.values())))\n\n    return load_pytorch_weights_in_tf2_model(\n        tf_model, pt_state_dict, tf_inputs=tf_inputs, allow_missing_keys=allow_missing_keys\n    )\n\n\ndef load_pytorch_model_in_tf2_model(tf_model, pt_model, tf_inputs=None, allow_missing_keys=False):\n    \"\"\" Load pytorch checkpoints in a TF 2.0 model\n    \"\"\"\n    pt_state_dict = pt_model.state_dict()\n\n    return load_pytorch_weights_in_tf2_model(\n        tf_model, pt_state_dict, tf_inputs=tf_inputs, allow_missing_keys=allow_missing_keys\n    )\n\n\ndef load_pytorch_weights_in_tf2_model(tf_model, pt_state_dict, tf_inputs=None, allow_missing_keys=False):\n    \"\"\" Load pytorch state_dict in a TF 2.0 model.\n    \"\"\"\n    try:\n        import torch  # noqa: F401\n        import tensorflow as tf  # noqa: F401\n        from tensorflow.python.keras import backend as K\n    except ImportError:\n        logger.error(\n            \"Loading a PyTorch model in TensorFlow, requires both PyTorch and TensorFlow to be installed. Please see \"\n            \"https://pytorch.org/ and https://www.tensorflow.org/install/ for installation instructions.\"\n        )\n        raise\n\n    if tf_inputs is None:\n        tf_inputs = tf_model.dummy_inputs\n\n    if tf_inputs is not None:\n        tf_model(tf_inputs, training=False)  # Make sure model is built\n\n    # Adapt state dict - TODO remove this and update the AWS weights files instead\n    # Convert old format to new format if needed from a PyTorch state_dict\n    old_keys = []\n    new_keys = []\n    for key in pt_state_dict.keys():\n        new_key = None\n        if \"gamma\" in key:\n            new_key = key.replace(\"gamma\", \"weight\")\n        if \"beta\" in key:\n            new_key = key.replace(\"beta\", \"bias\")\n        if new_key:\n            old_keys.append(key)\n            new_keys.append(new_key)\n    for old_key, new_key in zip(old_keys, new_keys):\n        pt_state_dict[new_key] = pt_state_dict.pop(old_key)\n\n    # Make sure we are able to load PyTorch base models as well as derived models (with heads)\n    # TF models always have a prefix, some of PyTorch models (base ones) don't\n    start_prefix_to_remove = \"\"\n    if not any(s.startswith(tf_model.base_model_prefix) for s in pt_state_dict.keys()):\n        start_prefix_to_remove = tf_model.base_model_prefix + \".\"\n\n    symbolic_weights = tf_model.trainable_weights + tf_model.non_trainable_weights\n    tf_loaded_numel = 0\n    weight_value_tuples = []\n    all_pytorch_weights = set(list(pt_state_dict.keys()))\n    for symbolic_weight in symbolic_weights:\n        sw_name = symbolic_weight.name\n        name, transpose = convert_tf_weight_name_to_pt_weight_name(\n            sw_name, start_prefix_to_remove=start_prefix_to_remove\n        )\n\n        # Find associated numpy array in pytorch model state dict\n        if name not in pt_state_dict:\n            if allow_missing_keys:\n                continue\n\n            raise AttributeError(\"{} not found in PyTorch model\".format(name))\n\n        array = pt_state_dict[name].numpy()\n\n        if transpose:\n            array = numpy.transpose(array)\n\n        if len(symbolic_weight.shape) < len(array.shape):\n            array = numpy.squeeze(array)\n        elif len(symbolic_weight.shape) > len(array.shape):\n            array = numpy.expand_dims(array, axis=0)\n\n        try:\n            assert list(symbolic_weight.shape) == list(array.shape)\n        except AssertionError as e:\n            e.args += (symbolic_weight.shape, array.shape)\n            raise e\n\n        tf_loaded_numel += array.size\n        # logger.warning(\"Initialize TF weight {}\".format(symbolic_weight.name))\n\n        weight_value_tuples.append((symbolic_weight, array))\n        all_pytorch_weights.discard(name)\n\n    K.batch_set_value(weight_value_tuples)\n\n    if tf_inputs is not None:\n        tf_model(tf_inputs, training=False)  # Make sure restore ops are run\n\n    logger.info(\"Loaded {:,} parameters in the TF 2.0 model.\".format(tf_loaded_numel))\n\n    logger.info(\"Weights or buffers not loaded from PyTorch model: {}\".format(all_pytorch_weights))\n\n    return tf_model\n\n\n#####################\n# TF 2.0 => PyTorch #\n#####################\n\n\ndef load_tf2_checkpoint_in_pytorch_model(pt_model, tf_checkpoint_path, tf_inputs=None, allow_missing_keys=False):\n    \"\"\" Load TF 2.0 HDF5 checkpoint in a PyTorch model\n        We use HDF5 to easily do transfer learning\n        (see https://github.com/tensorflow/tensorflow/blob/ee16fcac960ae660e0e4496658a366e2f745e1f0/tensorflow/python/keras/engine/network.py#L1352-L1357).\n    \"\"\"\n    try:\n        import tensorflow as tf  # noqa: F401\n        import torch  # noqa: F401\n    except ImportError:\n        logger.error(\n            \"Loading a TensorFlow model in PyTorch, requires both PyTorch and TensorFlow to be installed. Please see \"\n            \"https://pytorch.org/ and https://www.tensorflow.org/install/ for installation instructions.\"\n        )\n        raise\n\n    import transformers\n\n    logger.info(\"Loading TensorFlow weights from {}\".format(tf_checkpoint_path))\n\n    # Instantiate and load the associated TF 2.0 model\n    tf_model_class_name = \"TF\" + pt_model.__class__.__name__  # Add \"TF\" at the beggining\n    tf_model_class = getattr(transformers, tf_model_class_name)\n    tf_model = tf_model_class(pt_model.config)\n\n    if tf_inputs is None:\n        tf_inputs = tf_model.dummy_inputs\n\n    if tf_inputs is not None:\n        tf_model(tf_inputs, training=False)  # Make sure model is built\n\n    tf_model.load_weights(tf_checkpoint_path, by_name=True)\n\n    return load_tf2_model_in_pytorch_model(pt_model, tf_model, allow_missing_keys=allow_missing_keys)\n\n\ndef load_tf2_model_in_pytorch_model(pt_model, tf_model, allow_missing_keys=False):\n    \"\"\" Load TF 2.0 model in a pytorch model\n    \"\"\"\n    weights = tf_model.weights\n\n    return load_tf2_weights_in_pytorch_model(pt_model, weights, allow_missing_keys=allow_missing_keys)\n\n\ndef load_tf2_weights_in_pytorch_model(pt_model, tf_weights, allow_missing_keys=False):\n    \"\"\" Load TF2.0 symbolic weights in a PyTorch model\n    \"\"\"\n    try:\n        import tensorflow as tf  # noqa: F401\n        import torch  # noqa: F401\n    except ImportError:\n        logger.error(\n            \"Loading a TensorFlow model in PyTorch, requires both PyTorch and TensorFlow to be installed. Please see \"\n            \"https://pytorch.org/ and https://www.tensorflow.org/install/ for installation instructions.\"\n        )\n        raise\n\n    new_pt_params_dict = {}\n    current_pt_params_dict = dict(pt_model.named_parameters())\n\n    # Make sure we are able to load PyTorch base models as well as derived models (with heads)\n    # TF models always have a prefix, some of PyTorch models (base ones) don't\n    start_prefix_to_remove = \"\"\n    if not any(s.startswith(pt_model.base_model_prefix) for s in current_pt_params_dict.keys()):\n        start_prefix_to_remove = pt_model.base_model_prefix + \".\"\n\n    # Build a map from potential PyTorch weight names to TF 2.0 Variables\n    tf_weights_map = {}\n    for tf_weight in tf_weights:\n        pt_name, transpose = convert_tf_weight_name_to_pt_weight_name(\n            tf_weight.name, start_prefix_to_remove=start_prefix_to_remove\n        )\n        tf_weights_map[pt_name] = (tf_weight.numpy(), transpose)\n\n    all_tf_weights = set(list(tf_weights_map.keys()))\n    loaded_pt_weights_data_ptr = {}\n    missing_keys_pt = []\n    for pt_weight_name, pt_weight in current_pt_params_dict.items():\n        # Handle PyTorch shared weight ()not duplicated in TF 2.0\n        if pt_weight.data_ptr() in loaded_pt_weights_data_ptr:\n            new_pt_params_dict[pt_weight_name] = loaded_pt_weights_data_ptr[pt_weight.data_ptr()]\n            continue\n\n        # Find associated numpy array in pytorch model state dict\n        if pt_weight_name not in tf_weights_map:\n            if allow_missing_keys:\n                missing_keys_pt.append(pt_weight_name)\n                continue\n\n            raise AttributeError(\"{} not found in TF 2.0 model\".format(pt_weight_name))\n\n        array, transpose = tf_weights_map[pt_weight_name]\n\n        if transpose:\n            array = numpy.transpose(array)\n\n        if len(pt_weight.shape) < len(array.shape):\n            array = numpy.squeeze(array)\n        elif len(pt_weight.shape) > len(array.shape):\n            array = numpy.expand_dims(array, axis=0)\n\n        try:\n            assert list(pt_weight.shape) == list(array.shape)\n        except AssertionError as e:\n            e.args += (pt_weight.shape, array.shape)\n            raise e\n\n        # logger.warning(\"Initialize PyTorch weight {}\".format(pt_weight_name))\n\n        new_pt_params_dict[pt_weight_name] = torch.from_numpy(array)\n        loaded_pt_weights_data_ptr[pt_weight.data_ptr()] = torch.from_numpy(array)\n        all_tf_weights.discard(pt_weight_name)\n\n    missing_keys, unexpected_keys = pt_model.load_state_dict(new_pt_params_dict, strict=False)\n    missing_keys += missing_keys_pt\n\n    if len(missing_keys) > 0:\n        logger.info(\n            \"Weights of {} not initialized from TF 2.0 model: {}\".format(pt_model.__class__.__name__, missing_keys)\n        )\n    if len(unexpected_keys) > 0:\n        logger.info(\n            \"Weights from TF 2.0 model not used in {}: {}\".format(pt_model.__class__.__name__, unexpected_keys)\n        )\n\n    logger.info(\"Weights or buffers not loaded from TF 2.0 model: {}\".format(all_tf_weights))\n\n    return pt_model\n"
  },
  {
    "path": "code/nezha-base-count5/pretrain/transformers1/modeling_tf_roberta.py",
    "content": "# coding=utf-8\n# Copyright 2018 The Google AI Language Team Authors and The HuggingFace Inc. team.\n# Copyright (c) 2018, NVIDIA CORPORATION.  All rights reserved.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n\"\"\" TF 2.0 RoBERTa model. \"\"\"\n\n\nimport logging\n\nimport tensorflow as tf\n\nfrom .configuration_roberta import RobertaConfig\nfrom .file_utils import add_start_docstrings, add_start_docstrings_to_callable\nfrom .modeling_tf_bert import TFBertEmbeddings, TFBertMainLayer, gelu\nfrom .modeling_tf_utils import TFPreTrainedModel, get_initializer, shape_list\n\n\nlogger = logging.getLogger(__name__)\n\nTF_ROBERTA_PRETRAINED_MODEL_ARCHIVE_LIST = [\n    \"roberta-base\",\n    \"roberta-large\",\n    \"roberta-large-mnli\",\n    \"distilroberta-base\",\n    # See all RoBERTa models at https://huggingface.co/models?filter=roberta\n]\n\n\nclass TFRobertaEmbeddings(TFBertEmbeddings):\n    \"\"\"\n    Same as BertEmbeddings with a tiny tweak for positional embeddings indexing.\n    \"\"\"\n\n    def __init__(self, config, **kwargs):\n        super().__init__(config, **kwargs)\n        self.padding_idx = 1\n\n    def create_position_ids_from_input_ids(self, x):\n        \"\"\" Replace non-padding symbols with their position numbers. Position numbers begin at\n        padding_idx+1. Padding symbols are ignored. This is modified from fairseq's\n        `utils.make_positions`.\n        :param tf.Tensor x:\n        :return tf.Tensor:\n        \"\"\"\n        mask = tf.cast(tf.math.not_equal(x, self.padding_idx), dtype=tf.int32)\n        incremental_indicies = tf.math.cumsum(mask, axis=1) * mask\n        return incremental_indicies + self.padding_idx\n\n    def create_position_ids_from_inputs_embeds(self, inputs_embeds):\n        \"\"\" We are provided embeddings directly. We cannot infer which are padded so just generate\n        sequential position ids.\n        :param tf.Tensor inputs_embeds:\n        :return tf.Tensor:\n        \"\"\"\n        seq_length = shape_list(inputs_embeds)[1]\n\n        position_ids = tf.range(self.padding_idx + 1, seq_length + self.padding_idx + 1, dtype=tf.int32)[tf.newaxis, :]\n        return position_ids\n\n    def _embedding(self, inputs, training=False):\n        \"\"\"Applies embedding based on inputs tensor.\"\"\"\n        input_ids, position_ids, token_type_ids, inputs_embeds = inputs\n\n        if position_ids is None:\n            if input_ids is not None:\n                # Create the position ids from the input token ids. Any padded tokens remain padded.\n                position_ids = self.create_position_ids_from_input_ids(input_ids)\n            else:\n                position_ids = self.create_position_ids_from_inputs_embeds(inputs_embeds)\n\n        return super()._embedding([input_ids, position_ids, token_type_ids, inputs_embeds], training=training)\n\n\nclass TFRobertaMainLayer(TFBertMainLayer):\n    \"\"\"\n    Same as TFBertMainLayer but uses TFRobertaEmbeddings.\n    \"\"\"\n\n    def __init__(self, config, **kwargs):\n        super().__init__(config, **kwargs)\n        self.embeddings = TFRobertaEmbeddings(config, name=\"embeddings\")\n\n    def get_input_embeddings(self):\n        return self.embeddings\n\n\nclass TFRobertaPreTrainedModel(TFPreTrainedModel):\n    \"\"\" An abstract class to handle weights initialization and\n        a simple interface for downloading and loading pretrained models.\n    \"\"\"\n\n    config_class = RobertaConfig\n    base_model_prefix = \"roberta\"\n\n\nROBERTA_START_DOCSTRING = r\"\"\"\n    This model is a `tf.keras.Model <https://www.tensorflow.org/api_docs/python/tf/keras/Model>`__ sub-class.\n    Use it as a regular TF 2.0 Keras Model and\n    refer to the TF 2.0 documentation for all matter related to general usage and behavior.\n\n    .. note::\n\n        TF 2.0 models accepts two formats as inputs:\n\n            - having all inputs as keyword arguments (like PyTorch models), or\n            - having all inputs as a list, tuple or dict in the first positional arguments.\n\n        This second option is useful when using :obj:`tf.keras.Model.fit()` method which currently requires having\n        all the tensors in the first argument of the model call function: :obj:`model(inputs)`.\n\n        If you choose this second option, there are three possibilities you can use to gather all the input Tensors\n        in the first positional argument :\n\n        - a single Tensor with input_ids only and nothing else: :obj:`model(inputs_ids)`\n        - a list of varying length with one or several input Tensors IN THE ORDER given in the docstring:\n          :obj:`model([input_ids, attention_mask])` or :obj:`model([input_ids, attention_mask, token_type_ids])`\n        - a dictionary with one or several input Tensors associated to the input names given in the docstring:\n          :obj:`model({'input_ids': input_ids, 'token_type_ids': token_type_ids})`\n\n    Parameters:\n        config (:class:`~transformers1.RobertaConfig`): Model configuration class with all the parameters of the\n            model. Initializing with a config file does not load the weights associated with the model, only the configuration.\n            Check out the :meth:`~transformers1.PreTrainedModel.from_pretrained` method to load the model weights.\n\"\"\"\n\nROBERTA_INPUTS_DOCSTRING = r\"\"\"\n    Args:\n        input_ids (:obj:`Numpy array` or :obj:`tf.Tensor` of shape :obj:`(batch_size, sequence_length)`):\n            Indices of input sequence tokens in the vocabulary.\n\n            Indices can be obtained using :class:`transformers1.RobertaTokenizer`.\n            See :func:`transformers1.PreTrainedTokenizer.encode` and\n            :func:`transformers1.PreTrainedTokenizer.encode_plus` for details.\n\n            `What are input IDs? <../glossary.html#input-ids>`__\n        attention_mask (:obj:`Numpy array` or :obj:`tf.Tensor` of shape :obj:`(batch_size, sequence_length)`, `optional`, defaults to :obj:`None`):\n            Mask to avoid performing attention on padding token indices.\n            Mask values selected in ``[0, 1]``:\n            ``1`` for tokens that are NOT MASKED, ``0`` for MASKED tokens.\n\n            `What are attention masks? <../glossary.html#attention-mask>`__\n        token_type_ids (:obj:`Numpy array` or :obj:`tf.Tensor` of shape :obj:`(batch_size, sequence_length)`, `optional`, defaults to :obj:`None`):\n            Segment token indices to indicate first and second portions of the inputs.\n            Indices are selected in ``[0, 1]``: ``0`` corresponds to a `sentence A` token, ``1``\n            corresponds to a `sentence B` token\n\n            `What are token type IDs? <../glossary.html#token-type-ids>`__\n        position_ids (:obj:`Numpy array` or :obj:`tf.Tensor` of shape :obj:`(batch_size, sequence_length)`, `optional`, defaults to :obj:`None`):\n            Indices of positions of each input sequence tokens in the position embeddings.\n            Selected in the range ``[0, config.max_position_embeddings - 1]``.\n\n            `What are position IDs? <../glossary.html#position-ids>`__\n        head_mask (:obj:`Numpy array` or :obj:`tf.Tensor` of shape :obj:`(num_heads,)` or :obj:`(num_layers, num_heads)`, `optional`, defaults to :obj:`None`):\n            Mask to nullify selected heads of the self-attention modules.\n            Mask values selected in ``[0, 1]``:\n            :obj:`1` indicates the head is **not masked**, :obj:`0` indicates the head is **masked**.\n        inputs_embeds (:obj:`Numpy array` or :obj:`tf.Tensor` of shape :obj:`(batch_size, sequence_length, embedding_dim)`, `optional`, defaults to :obj:`None`):\n            Optionally, instead of passing :obj:`input_ids` you can choose to directly pass an embedded representation.\n            This is useful if you want more control over how to convert `input_ids` indices into associated vectors\n            than the model's internal embedding lookup matrix.\n        training (:obj:`boolean`, `optional`, defaults to :obj:`False`):\n            Whether to activate dropout modules (if set to :obj:`True`) during training or to de-activate them\n            (if set to :obj:`False`) for evaluation.\n\"\"\"\n\n\n@add_start_docstrings(\n    \"The bare RoBERTa Model transformer outputing raw hidden-states without any specific head on top.\",\n    ROBERTA_START_DOCSTRING,\n)\nclass TFRobertaModel(TFRobertaPreTrainedModel):\n    def __init__(self, config, *inputs, **kwargs):\n        super().__init__(config, *inputs, **kwargs)\n        self.roberta = TFRobertaMainLayer(config, name=\"roberta\")\n\n    @add_start_docstrings_to_callable(ROBERTA_INPUTS_DOCSTRING)\n    def call(self, inputs, **kwargs):\n        r\"\"\"\n    Returns:\n        :obj:`tuple(tf.Tensor)` comprising various elements depending on the configuration (:class:`~transformers1.RobertaConfig`) and inputs:\n        last_hidden_state (:obj:`tf.Tensor` of shape :obj:`(batch_size, sequence_length, hidden_size)`):\n            Sequence of hidden-states at the output of the last layer of the model.\n        pooler_output (:obj:`tf.Tensor` of shape :obj:`(batch_size, hidden_size)`):\n            Last layer hidden-state of the first token of the sequence (classification token)\n            further processed by a Linear layer and a Tanh activation function. The Linear\n            layer weights are trained from the next sentence prediction (classification)\n            objective during Bert pretraining. This output is usually *not* a good summary\n            of the semantic content of the input, you're often better with averaging or pooling\n            the sequence of hidden-states for the whole input sequence.\n        hidden_states (:obj:`tuple(tf.Tensor)`, `optional`, returned when :obj:`config.output_hidden_states=True`):\n            tuple of :obj:`tf.Tensor` (one for the output of the embeddings + one for the output of each layer)\n            of shape :obj:`(batch_size, sequence_length, hidden_size)`.\n\n            Hidden-states of the model at the output of each layer plus the initial embedding outputs.\n        attentions (:obj:`tuple(tf.Tensor)`, `optional`, returned when ``config.output_attentions=True``):\n            tuple of :obj:`tf.Tensor` (one for each layer) of shape\n            :obj:`(batch_size, num_heads, sequence_length, sequence_length)`:\n\n            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads.\n\n    Examples::\n\n        import tensorflow as tf\n        from transformers1 import RobertaTokenizer, TFRobertaModel\n\n        tokenizer = RobertaTokenizer.from_pretrained('roberta-base')\n        model = TFRobertaModel.from_pretrained('roberta-base')\n        input_ids = tf.constant(tokenizer.encode(\"Hello, my dog is cute\", add_special_tokens=True))[None, :]  # Batch size 1\n        outputs = model(input_ids)\n        last_hidden_states = outputs[0]  # The last hidden-state is the first element of the output tuple\n\n        \"\"\"\n        outputs = self.roberta(inputs, **kwargs)\n        return outputs\n\n\nclass TFRobertaLMHead(tf.keras.layers.Layer):\n    \"\"\"Roberta Head for masked language modeling.\"\"\"\n\n    def __init__(self, config, input_embeddings, **kwargs):\n        super().__init__(**kwargs)\n        self.vocab_size = config.vocab_size\n        self.dense = tf.keras.layers.Dense(\n            config.hidden_size, kernel_initializer=get_initializer(config.initializer_range), name=\"dense\"\n        )\n        self.layer_norm = tf.keras.layers.LayerNormalization(epsilon=config.layer_norm_eps, name=\"layer_norm\")\n        self.act = tf.keras.layers.Activation(gelu)\n\n        # The output weights are the same as the input embeddings, but there is\n        # an output-only bias for each token.\n        self.decoder = input_embeddings\n\n    def build(self, input_shape):\n        self.bias = self.add_weight(shape=(self.vocab_size,), initializer=\"zeros\", trainable=True, name=\"bias\")\n        super().build(input_shape)\n\n    def call(self, features):\n        x = self.dense(features)\n        x = self.act(x)\n        x = self.layer_norm(x)\n\n        # project back to size of vocabulary with bias\n        x = self.decoder(x, mode=\"linear\") + self.bias\n\n        return x\n\n\n@add_start_docstrings(\"\"\"RoBERTa Model with a `language modeling` head on top. \"\"\", ROBERTA_START_DOCSTRING)\nclass TFRobertaForMaskedLM(TFRobertaPreTrainedModel):\n    def __init__(self, config, *inputs, **kwargs):\n        super().__init__(config, *inputs, **kwargs)\n\n        self.roberta = TFRobertaMainLayer(config, name=\"roberta\")\n        self.lm_head = TFRobertaLMHead(config, self.roberta.embeddings, name=\"lm_head\")\n\n    def get_output_embeddings(self):\n        return self.lm_head.decoder\n\n    @add_start_docstrings_to_callable(ROBERTA_INPUTS_DOCSTRING)\n    def call(self, inputs, **kwargs):\n        r\"\"\"\n    Return:\n        :obj:`tuple(tf.Tensor)` comprising various elements depending on the configuration (:class:`~transformers1.RobertaConfig`) and inputs:\n        prediction_scores (:obj:`Numpy array` or :obj:`tf.Tensor` of shape :obj:`(batch_size, sequence_length, config.vocab_size)`):\n            Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax).\n        hidden_states (:obj:`tuple(tf.Tensor)`, `optional`, returned when :obj:`config.output_hidden_states=True`):\n            tuple of :obj:`tf.Tensor` (one for the output of the embeddings + one for the output of each layer)\n            of shape :obj:`(batch_size, sequence_length, hidden_size)`.\n\n            Hidden-states of the model at the output of each layer plus the initial embedding outputs.\n        attentions (:obj:`tuple(tf.Tensor)`, `optional`, returned when ``config.output_attentions=True``):\n            tuple of :obj:`tf.Tensor` (one for each layer) of shape\n            :obj:`(batch_size, num_heads, sequence_length, sequence_length)`:\n\n            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads.\n\n    Examples::\n\n        import tensorflow as tf\n        from transformers1 import RobertaTokenizer, TFRobertaForMaskedLM\n\n        tokenizer = RobertaTokenizer.from_pretrained('roberta-base')\n        model = TFRobertaForMaskedLM.from_pretrained('roberta-base')\n        input_ids = tf.constant(tokenizer.encode(\"Hello, my dog is cute\", add_special_tokens=True))[None, :]  # Batch size 1\n        outputs = model(input_ids)\n        prediction_scores = outputs[0]\n\n        \"\"\"\n        outputs = self.roberta(inputs, **kwargs)\n\n        sequence_output = outputs[0]\n        prediction_scores = self.lm_head(sequence_output)\n\n        outputs = (prediction_scores,) + outputs[2:]  # Add hidden states and attention if they are here\n\n        return outputs  # prediction_scores, (hidden_states), (attentions)\n\n\nclass TFRobertaClassificationHead(tf.keras.layers.Layer):\n    \"\"\"Head for sentence-level classification tasks.\"\"\"\n\n    def __init__(self, config, **kwargs):\n        super().__init__(config, **kwargs)\n        self.dense = tf.keras.layers.Dense(\n            config.hidden_size,\n            kernel_initializer=get_initializer(config.initializer_range),\n            activation=\"tanh\",\n            name=\"dense\",\n        )\n        self.dropout = tf.keras.layers.Dropout(config.hidden_dropout_prob)\n        self.out_proj = tf.keras.layers.Dense(\n            config.num_labels, kernel_initializer=get_initializer(config.initializer_range), name=\"out_proj\"\n        )\n\n    def call(self, features, training=False):\n        x = features[:, 0, :]  # take <s> token (equiv. to [CLS])\n        x = self.dropout(x, training=training)\n        x = self.dense(x)\n        x = self.dropout(x, training=training)\n        x = self.out_proj(x)\n        return x\n\n\n@add_start_docstrings(\n    \"\"\"RoBERTa Model transformer with a sequence classification/regression head on top (a linear layer\n    on top of the pooled output) e.g. for GLUE tasks. \"\"\",\n    ROBERTA_START_DOCSTRING,\n)\nclass TFRobertaForSequenceClassification(TFRobertaPreTrainedModel):\n    def __init__(self, config, *inputs, **kwargs):\n        super().__init__(config, *inputs, **kwargs)\n        self.num_labels = config.num_labels\n\n        self.roberta = TFRobertaMainLayer(config, name=\"roberta\")\n        self.classifier = TFRobertaClassificationHead(config, name=\"classifier\")\n\n    @add_start_docstrings_to_callable(ROBERTA_INPUTS_DOCSTRING)\n    def call(self, inputs, **kwargs):\n        r\"\"\"\n    Return:\n        :obj:`tuple(tf.Tensor)` comprising various elements depending on the configuration (:class:`~transformers1.RobertaConfig`) and inputs:\n        logits (:obj:`Numpy array` or :obj:`tf.Tensor` of shape :obj:`(batch_size, config.num_labels)`):\n            Classification (or regression if config.num_labels==1) scores (before SoftMax).\n        hidden_states (:obj:`tuple(tf.Tensor)`, `optional`, returned when :obj:`config.output_hidden_states=True`):\n            tuple of :obj:`tf.Tensor` (one for the output of the embeddings + one for the output of each layer)\n            of shape :obj:`(batch_size, sequence_length, hidden_size)`.\n\n            Hidden-states of the model at the output of each layer plus the initial embedding outputs.\n        attentions (:obj:`tuple(tf.Tensor)`, `optional`, returned when ``config.output_attentions=True``):\n            tuple of :obj:`tf.Tensor` (one for each layer) of shape\n            :obj:`(batch_size, num_heads, sequence_length, sequence_length)`:\n\n            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads.\n\n    Examples::\n\n        import tensorflow as tf\n        from transformers1 import RobertaTokenizer, TFRobertaForSequenceClassification\n\n        tokenizer = RobertaTokenizer.from_pretrained('roberta-base')\n        model = TFRobertaForSequenceClassification.from_pretrained('roberta-base')\n        input_ids = tf.constant(tokenizer.encode(\"Hello, my dog is cute\", add_special_tokens=True))[None, :]  # Batch size 1\n        labels = tf.constant([1])[None, :]  # Batch size 1\n        outputs = model(input_ids)\n        logits = outputs[0]\n\n        \"\"\"\n        outputs = self.roberta(inputs, **kwargs)\n\n        sequence_output = outputs[0]\n        logits = self.classifier(sequence_output, training=kwargs.get(\"training\", False))\n\n        outputs = (logits,) + outputs[2:]\n\n        return outputs  # logits, (hidden_states), (attentions)\n\n\n@add_start_docstrings(\n    \"\"\"RoBERTa Model with a token classification head on top (a linear layer on top of\n    the hidden-states output) e.g. for Named-Entity-Recognition (NER) tasks. \"\"\",\n    ROBERTA_START_DOCSTRING,\n)\nclass TFRobertaForTokenClassification(TFRobertaPreTrainedModel):\n    def __init__(self, config, *inputs, **kwargs):\n        super().__init__(config, *inputs, **kwargs)\n        self.num_labels = config.num_labels\n\n        self.roberta = TFRobertaMainLayer(config, name=\"roberta\")\n        self.dropout = tf.keras.layers.Dropout(config.hidden_dropout_prob)\n        self.classifier = tf.keras.layers.Dense(\n            config.num_labels, kernel_initializer=get_initializer(config.initializer_range), name=\"classifier\"\n        )\n\n    @add_start_docstrings_to_callable(ROBERTA_INPUTS_DOCSTRING)\n    def call(self, inputs, **kwargs):\n        r\"\"\"\n    Return:\n        :obj:`tuple(tf.Tensor)` comprising various elements depending on the configuration (:class:`~transformers1.RobertaConfig`) and inputs:\n        scores (:obj:`Numpy array` or :obj:`tf.Tensor` of shape :obj:`(batch_size, sequence_length, config.num_labels)`):\n            Classification scores (before SoftMax).\n        hidden_states (:obj:`tuple(tf.Tensor)`, `optional`, returned when :obj:`config.output_hidden_states=True`):\n            tuple of :obj:`tf.Tensor` (one for the output of the embeddings + one for the output of each layer)\n            of shape :obj:`(batch_size, sequence_length, hidden_size)`.\n\n            Hidden-states of the model at the output of each layer plus the initial embedding outputs.\n        attentions (:obj:`tuple(tf.Tensor)`, `optional`, returned when ``config.output_attentions=True``):\n            tuple of :obj:`tf.Tensor` (one for each layer) of shape\n            :obj:`(batch_size, num_heads, sequence_length, sequence_length)`:\n\n            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads.\n\n    Examples::\n\n        import tensorflow as tf\n        from transformers1 import RobertaTokenizer, TFRobertaForTokenClassification\n\n        tokenizer = RobertaTokenizer.from_pretrained('roberta-base')\n        model = TFRobertaForTokenClassification.from_pretrained('roberta-base')\n        input_ids = tf.constant(tokenizer.encode(\"Hello, my dog is cute\", add_special_tokens=True))[None, :]  # Batch size 1\n        outputs = model(input_ids)\n        scores = outputs[0]\n\n        \"\"\"\n        outputs = self.roberta(inputs, **kwargs)\n\n        sequence_output = outputs[0]\n\n        sequence_output = self.dropout(sequence_output, training=kwargs.get(\"training\", False))\n        logits = self.classifier(sequence_output)\n\n        outputs = (logits,) + outputs[2:]  # add hidden states and attention if they are here\n\n        return outputs  # scores, (hidden_states), (attentions)\n\n\n@add_start_docstrings(\n    \"\"\"RoBERTa Model with a span classification head on top for extractive question-answering tasks like SQuAD (a linear layers on top of the hidden-states output to compute `span start logits` and `span end logits`). \"\"\",\n    ROBERTA_START_DOCSTRING,\n)\nclass TFRobertaForQuestionAnswering(TFRobertaPreTrainedModel):\n    def __init__(self, config, *inputs, **kwargs):\n        super().__init__(config, *inputs, **kwargs)\n        self.num_labels = config.num_labels\n\n        self.roberta = TFRobertaMainLayer(config, name=\"roberta\")\n        self.qa_outputs = tf.keras.layers.Dense(\n            config.num_labels, kernel_initializer=get_initializer(config.initializer_range), name=\"qa_outputs\"\n        )\n\n    @add_start_docstrings_to_callable(ROBERTA_INPUTS_DOCSTRING)\n    def call(self, inputs, **kwargs):\n        r\"\"\"\n    Return:\n        :obj:`tuple(tf.Tensor)` comprising various elements depending on the configuration (:class:`~transformers1.RobertaConfig`) and inputs:\n        start_scores (:obj:`Numpy array` or :obj:`tf.Tensor` of shape :obj:`(batch_size, sequence_length,)`):\n            Span-start scores (before SoftMax).\n        end_scores (:obj:`Numpy array` or :obj:`tf.Tensor` of shape :obj:`(batch_size, sequence_length,)`):\n            Span-end scores (before SoftMax).\n        hidden_states (:obj:`tuple(tf.Tensor)`, `optional`, returned when :obj:`config.output_hidden_states=True`):\n            tuple of :obj:`tf.Tensor` (one for the output of the embeddings + one for the output of each layer)\n            of shape :obj:`(batch_size, sequence_length, hidden_size)`.\n\n            Hidden-states of the model at the output of each layer plus the initial embedding outputs.\n        attentions (:obj:`tuple(tf.Tensor)`, `optional`, returned when ``config.output_attentions=True``):\n            tuple of :obj:`tf.Tensor` (one for each layer) of shape\n            :obj:`(batch_size, num_heads, sequence_length, sequence_length)`:\n\n            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads.\n\n    Examples::\n\n        # The checkpoint roberta-base is not fine-tuned for question answering. Please see the\n        # examples/question-answering/run_squad.py example to see how to fine-tune a model to a question answering task.\n\n        import tensorflow as tf\n        from transformers1 import RobertaTokenizer, TFRobertaForQuestionAnswering\n\n        tokenizer = RobertaTokenizer.from_pretrained('roberta-base')\n        model = TFRobertaForQuestionAnswering.from_pretrained('roberta-base')\n        input_ids = tokenizer.encode(\"Who was Jim Henson?\", \"Jim Henson was a nice puppet\")\n        start_scores, end_scores = model(tf.constant(input_ids)[None, :]) # Batch size 1\n\n        all_tokens = tokenizer.convert_ids_to_tokens(input_ids)\n        answer = ' '.join(all_tokens[tf.math.argmax(start_scores, 1)[0] : tf.math.argmax(end_scores, 1)[0]+1])\n\n        \"\"\"\n        outputs = self.roberta(inputs, **kwargs)\n\n        sequence_output = outputs[0]\n\n        logits = self.qa_outputs(sequence_output)\n        start_logits, end_logits = tf.split(logits, 2, axis=-1)\n        start_logits = tf.squeeze(start_logits, axis=-1)\n        end_logits = tf.squeeze(end_logits, axis=-1)\n\n        outputs = (start_logits, end_logits,) + outputs[2:]\n\n        return outputs  # start_logits, end_logits, (hidden_states), (attentions)\n"
  },
  {
    "path": "code/nezha-base-count5/pretrain/transformers1/modeling_tf_t5.py",
    "content": "# coding=utf-8\n# Copyright 2018 T5 Authors and The HuggingFace Inc. team.\n# Copyright (c) 2018, NVIDIA CORPORATION.  All rights reserved.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n\"\"\" TF 2.0 T5 model. \"\"\"\n\n\nimport copy\nimport itertools\nimport logging\nimport math\n\nimport tensorflow as tf\n\nfrom .configuration_t5 import T5Config\nfrom .file_utils import DUMMY_INPUTS, DUMMY_MASK, add_start_docstrings, add_start_docstrings_to_callable\nfrom .modeling_tf_utils import TFPreTrainedModel, TFSharedEmbeddings, shape_list\n\n\nlogger = logging.getLogger(__name__)\n\nTF_T5_PRETRAINED_MODEL_ARCHIVE_LIST = [\n    \"t5-small\",\n    \"t5-base\",\n    \"t5-large\",\n    \"t5-3b\",\n    \"t5-11b\",\n    # See all T5 models at https://huggingface.co/models?filter=t5\n]\n\n####################################################\n# TF 2.0 Models are constructed using Keras imperative API by sub-classing\n# - tf.keras.layers.Layer for the layers and\n# - TFPreTrainedModel for the models (it-self a sub-class of tf.keras.Model)\n####################################################\n\n\nclass TFT5LayerNorm(tf.keras.layers.Layer):\n    def __init__(self, epsilon=1e-6, **kwargs):\n        \"\"\" Construct a layernorm module in the T5 style\n            No bias and no substraction of mean.\n        \"\"\"\n        super().__init__(**kwargs)\n        self.variance_epsilon = epsilon\n\n    def build(self, input_shape):\n        \"\"\"Build shared word embedding layer \"\"\"\n        self.weight = self.add_weight(\"weight\", shape=(input_shape[-1],), initializer=\"ones\")\n        super().build(input_shape)\n\n    def call(self, x):\n        variance = tf.math.reduce_mean(tf.math.square(x), axis=-1, keepdims=True)\n        x = x * tf.math.rsqrt(variance + self.variance_epsilon)\n        return self.weight * x\n\n\nclass TFT5DenseReluDense(tf.keras.layers.Layer):\n    def __init__(self, config, **kwargs):\n        super().__init__(**kwargs)\n        self.wi = tf.keras.layers.Dense(config.d_ff, use_bias=False, name=\"wi\")\n        self.wo = tf.keras.layers.Dense(config.d_model, use_bias=False, name=\"wo\")\n        self.dropout = tf.keras.layers.Dropout(config.dropout_rate)\n        self.act = tf.keras.activations.relu\n\n    def call(self, hidden_states, training=False):\n        h = self.wi(hidden_states)\n        h = self.act(h)\n        h = self.dropout(h, training=training)\n        h = self.wo(h)\n        return h\n\n\nclass TFT5LayerFF(tf.keras.layers.Layer):\n    def __init__(self, config, **kwargs):\n        super().__init__(**kwargs)\n        self.DenseReluDense = TFT5DenseReluDense(config, name=\"DenseReluDense\")\n        self.layer_norm = TFT5LayerNorm(epsilon=config.layer_norm_epsilon, name=\"layer_norm\")\n        self.dropout = tf.keras.layers.Dropout(config.dropout_rate)\n\n    def call(self, hidden_states, training=False):\n        norm_x = self.layer_norm(hidden_states)\n        y = self.DenseReluDense(norm_x, training=training)\n        layer_output = hidden_states + self.dropout(y, training=training)\n        return layer_output\n\n\nclass TFT5Attention(tf.keras.layers.Layer):\n    NEW_ID = itertools.count()\n\n    def __init__(self, config, has_relative_attention_bias=False, **kwargs):\n        super().__init__(**kwargs)\n        self.layer_id = next(TFT5Attention.NEW_ID)\n        self.is_decoder = config.is_decoder\n        self.has_relative_attention_bias = has_relative_attention_bias\n\n        self.output_attentions = config.output_attentions\n        self.relative_attention_num_buckets = config.relative_attention_num_buckets\n        self.d_model = config.d_model\n        self.d_kv = config.d_kv\n        self.n_heads = config.num_heads\n        self.inner_dim = self.n_heads * self.d_kv\n\n        # Mesh TensorFlow initialization to avoid scaling before softmax\n        self.q = tf.keras.layers.Dense(self.inner_dim, use_bias=False, name=\"q\")\n        self.k = tf.keras.layers.Dense(self.inner_dim, use_bias=False, name=\"k\")\n        self.v = tf.keras.layers.Dense(self.inner_dim, use_bias=False, name=\"v\")\n        self.o = tf.keras.layers.Dense(self.d_model, use_bias=False, name=\"o\")\n        self.dropout = tf.keras.layers.Dropout(config.dropout_rate)\n\n        if self.has_relative_attention_bias:\n            self.relative_attention_bias = tf.keras.layers.Embedding(\n                self.relative_attention_num_buckets, self.n_heads, name=\"relative_attention_bias\",\n            )\n        self.pruned_heads = set()\n\n    def prune_heads(self, heads):\n        raise NotImplementedError\n\n    @staticmethod\n    def _relative_position_bucket(relative_position, bidirectional=True, num_buckets=32, max_distance=128):\n        \"\"\"\n        Adapted from Mesh Tensorflow:\n        https://github.com/tensorflow/mesh/blob/0cb87fe07da627bf0b7e60475d59f95ed6b5be3d/mesh_tensorflow/transformer/transformer_layers.py#L593\n\n        Translate relative position to a bucket number for relative attention.\n        The relative position is defined as memory_position - query_position, i.e.\n        the distance in tokens from the attending position to the attended-to\n        position.  If bidirectional=False, then positive relative positions are\n        invalid.\n        We use smaller buckets for small absolute relative_position and larger buckets\n        for larger absolute relative_positions.  All relative positions >=max_distance\n        map to the same bucket.  All relative positions <=-max_distance map to the\n        same bucket.  This should allow for more graceful generalization to longer\n        sequences than the model has been trained on.\n        Args:\n            relative_position: an int32 Tensor\n            bidirectional: a boolean - whether the attention is bidirectional\n            num_buckets: an integer\n            max_distance: an integer\n        Returns:\n            a Tensor with the same shape as relative_position, containing int32\n            values in the range [0, num_buckets)\n        \"\"\"\n        ret = 0\n        n = -relative_position\n        if bidirectional:\n            num_buckets //= 2\n            ret += tf.dtypes.cast(tf.math.less(n, 0), tf.int32) * num_buckets\n            n = tf.math.abs(n)\n        else:\n            n = tf.math.maximum(n, 0)\n        # now n is in the range [0, inf)\n        max_exact = num_buckets // 2\n        is_small = tf.math.less(n, max_exact)\n        val_if_large = max_exact + tf.dtypes.cast(\n            tf.math.log(tf.dtypes.cast(n, tf.float32) / max_exact)\n            / math.log(max_distance / max_exact)\n            * (num_buckets - max_exact),\n            tf.int32,\n        )\n        val_if_large = tf.math.minimum(val_if_large, num_buckets - 1)\n        ret += tf.where(is_small, n, val_if_large)\n        return ret\n\n    def compute_bias(self, qlen, klen):\n        \"\"\" Compute binned relative position bias \"\"\"\n        context_position = tf.range(qlen)[:, None]\n        memory_position = tf.range(klen)[None, :]\n        relative_position = memory_position - context_position  # shape (qlen, klen)\n        rp_bucket = self._relative_position_bucket(\n            relative_position, bidirectional=not self.is_decoder, num_buckets=self.relative_attention_num_buckets,\n        )\n        values = self.relative_attention_bias(rp_bucket)  # shape (qlen, klen, num_heads)\n        values = tf.expand_dims(tf.transpose(values, [2, 0, 1]), axis=0)  # shape (1, num_heads, qlen, klen)\n        return values\n\n    def call(\n        self,\n        input,\n        mask=None,\n        kv=None,\n        position_bias=None,\n        cache=None,\n        past_key_value_state=None,\n        head_mask=None,\n        query_length=None,\n        use_cache=False,\n        training=False,\n    ):\n        \"\"\"\n        Self-attention (if kv is None) or attention over source sentence (provided by kv).\n        \"\"\"\n        # Input is (bs, qlen, dim)\n        # Mask is (bs, klen) (non-causal) or (bs, klen, klen)\n        # past_key_value_state[0] is (bs, n_heads, q_len - 1, dim_per_head)\n        bs, qlen, dim = shape_list(input)\n\n        if past_key_value_state is not None:\n            assert self.is_decoder is True, \"Encoder cannot cache past key value states\"\n            assert (\n                len(past_key_value_state) == 2\n            ), \"past_key_value_state should have 2 past states: keys and values. Got {} past states\".format(\n                len(past_key_value_state)\n            )\n            real_qlen = qlen + shape_list(past_key_value_state[0])[2] if query_length is None else query_length\n        else:\n            real_qlen = qlen\n\n        if kv is None:\n            klen = real_qlen\n        else:\n            klen = shape_list(kv)[1]\n\n        def shape(x):\n            \"\"\"  projection \"\"\"\n            return tf.transpose(tf.reshape(x, (bs, -1, self.n_heads, self.d_kv)), perm=(0, 2, 1, 3))\n\n        def unshape(x):\n            \"\"\"  compute context \"\"\"\n            return tf.reshape(tf.transpose(x, perm=(0, 2, 1, 3)), (bs, -1, self.inner_dim))\n\n        q = shape(self.q(input))  # (bs, n_heads, qlen, dim_per_head)\n\n        if kv is None:\n            k = shape(self.k(input))  # (bs, n_heads, qlen, dim_per_head)\n            v = shape(self.v(input))  # (bs, n_heads, qlen, dim_per_head)\n        elif past_key_value_state is None:\n            k = v = kv\n            k = shape(self.k(k))  # (bs, n_heads, qlen, dim_per_head)\n            v = shape(self.v(v))  # (bs, n_heads, qlen, dim_per_head)\n\n        if past_key_value_state is not None:\n            if kv is None:\n                k_, v_ = past_key_value_state\n                k = tf.concat([k_, k], axis=2)  # (bs, n_heads, klen, dim_per_head)\n                v = tf.concat([v_, v], axis=2)  # (bs, n_heads, klen, dim_per_head)\n            else:\n                k, v = past_key_value_state\n\n        # to cope with keras serialization\n        # we need to cast `use_cache` to correct bool\n        # if it is a tensor\n        if tf.is_tensor(use_cache):\n            if hasattr(use_cache, \"numpy\"):\n                use_cache = bool(use_cache.numpy())\n            else:\n                use_cache = True\n\n        if self.is_decoder and use_cache is True:\n            present_key_value_state = ((k, v),)\n        else:\n            present_key_value_state = (None,)\n\n        scores = tf.einsum(\"bnqd,bnkd->bnqk\", q, k)  # (bs, n_heads, qlen, klen)\n\n        if position_bias is None:\n            if not self.has_relative_attention_bias:\n                raise ValueError(\"No position_bias provided and no weights to compute position_bias\")\n            position_bias = self.compute_bias(real_qlen, klen)\n\n            # if key and values are already calculated\n            # we want only the last query position bias\n            if past_key_value_state is not None:\n                position_bias = position_bias[:, :, -1:, :]\n\n            if mask is not None:\n                position_bias = position_bias + mask  # (bs, n_heads, qlen, klen)\n\n        scores += position_bias\n        weights = tf.nn.softmax(scores, axis=-1)  # (bs, n_heads, qlen, klen)\n        weights = self.dropout(weights, training=training)  # (bs, n_heads, qlen, klen)\n\n        # Mask heads if we want to\n        if head_mask is not None:\n            weights = weights * head_mask\n\n        context = tf.matmul(weights, v)  # (bs, n_heads, qlen, dim_per_head)\n        context = unshape(context)  # (bs, qlen, dim)\n\n        context = self.o(context)\n\n        outputs = (context,) + present_key_value_state\n\n        if self.output_attentions:\n            outputs = outputs + (weights,)\n        if self.has_relative_attention_bias:\n            outputs = outputs + (position_bias,)\n        return outputs\n\n\nclass TFT5LayerSelfAttention(tf.keras.layers.Layer):\n    def __init__(self, config, has_relative_attention_bias=False, **kwargs):\n        super().__init__(**kwargs)\n        self.SelfAttention = TFT5Attention(\n            config, has_relative_attention_bias=has_relative_attention_bias, name=\"SelfAttention\",\n        )\n        self.layer_norm = TFT5LayerNorm(epsilon=config.layer_norm_epsilon, name=\"layer_norm\")\n        self.dropout = tf.keras.layers.Dropout(config.dropout_rate)\n\n    def call(\n        self,\n        hidden_states,\n        attention_mask=None,\n        position_bias=None,\n        head_mask=None,\n        past_key_value_state=None,\n        use_cache=False,\n        training=False,\n    ):\n        norm_x = self.layer_norm(hidden_states)\n        attention_output = self.SelfAttention(\n            norm_x,\n            mask=attention_mask,\n            position_bias=position_bias,\n            head_mask=head_mask,\n            past_key_value_state=past_key_value_state,\n            use_cache=use_cache,\n            training=training,\n        )\n        y = attention_output[0]\n        layer_output = hidden_states + self.dropout(y, training=training)\n        outputs = (layer_output,) + attention_output[1:]  # add attentions if we output them\n        return outputs\n\n\nclass TFT5LayerCrossAttention(tf.keras.layers.Layer):\n    def __init__(self, config, has_relative_attention_bias=False, **kwargs):\n        super().__init__(**kwargs)\n        self.EncDecAttention = TFT5Attention(\n            config, has_relative_attention_bias=has_relative_attention_bias, name=\"EncDecAttention\",\n        )\n        self.layer_norm = TFT5LayerNorm(epsilon=config.layer_norm_epsilon, name=\"layer_norm\")\n        self.dropout = tf.keras.layers.Dropout(config.dropout_rate)\n\n    def call(\n        self,\n        hidden_states,\n        kv,\n        attention_mask=None,\n        position_bias=None,\n        head_mask=None,\n        past_key_value_state=None,\n        query_length=None,\n        use_cache=False,\n        training=False,\n    ):\n        norm_x = self.layer_norm(hidden_states)\n        attention_output = self.EncDecAttention(\n            norm_x,\n            mask=attention_mask,\n            kv=kv,\n            position_bias=position_bias,\n            head_mask=head_mask,\n            past_key_value_state=past_key_value_state,\n            query_length=query_length,\n            use_cache=use_cache,\n            training=training,\n        )\n        y = attention_output[0]\n        layer_output = hidden_states + self.dropout(y, training=training)\n        outputs = (layer_output,) + attention_output[1:]  # add attentions if we output them\n        return outputs\n\n\nclass TFT5Block(tf.keras.layers.Layer):\n    def __init__(self, config, has_relative_attention_bias=False, **kwargs):\n        super().__init__(**kwargs)\n        self.is_decoder = config.is_decoder\n        self.layer = []\n        self.layer.append(\n            TFT5LayerSelfAttention(config, has_relative_attention_bias=has_relative_attention_bias, name=\"layer_._0\",)\n        )\n        if self.is_decoder:\n            self.layer.append(\n                TFT5LayerCrossAttention(\n                    config, has_relative_attention_bias=has_relative_attention_bias, name=\"layer_._1\",\n                )\n            )\n\n        self.layer.append(TFT5LayerFF(config, name=\"layer_._{}\".format(len(self.layer))))\n\n    def call(\n        self,\n        hidden_states,\n        attention_mask=None,\n        position_bias=None,\n        encoder_hidden_states=None,\n        encoder_attention_mask=None,\n        encoder_decoder_position_bias=None,\n        head_mask=None,\n        past_key_value_state=None,\n        use_cache=False,\n        training=False,\n    ):\n\n        if past_key_value_state is not None:\n            assert self.is_decoder, \"Only decoder can use `past_key_value_states`\"\n            expected_num_past_key_value_states = 2 if encoder_hidden_states is None else 4\n\n            error_message = \"There should be {} past states. 2 (past / key) for self attention.{} Got {} past key / value states\".format(\n                expected_num_past_key_value_states,\n                \"2 (past / key) for cross attention\" if expected_num_past_key_value_states == 4 else \"\",\n                len(past_key_value_state),\n            )\n            assert len(past_key_value_state) == expected_num_past_key_value_states, error_message\n\n            self_attn_past_key_value_state = past_key_value_state[:2]\n            cross_attn_past_key_value_state = past_key_value_state[2:]\n        else:\n            self_attn_past_key_value_state, cross_attn_past_key_value_state = None, None\n\n        self_attention_outputs = self.layer[0](\n            hidden_states,\n            attention_mask=attention_mask,\n            position_bias=position_bias,\n            head_mask=head_mask,\n            past_key_value_state=self_attn_past_key_value_state,\n            use_cache=use_cache,\n            training=training,\n        )\n        hidden_states, present_key_value_state = self_attention_outputs[:2]\n        attention_outputs = self_attention_outputs[2:]  # Keep self-attention outputs and relative position weights\n\n        if self.is_decoder and encoder_hidden_states is not None:\n            # the actual query length is unknown for cross attention\n            # if using past key value states. Need to inject it here\n            if present_key_value_state is not None:\n                query_length = shape_list(present_key_value_state[0])[2]\n            else:\n                query_length = None\n\n            cross_attention_outputs = self.layer[1](\n                hidden_states,\n                kv=encoder_hidden_states,\n                attention_mask=encoder_attention_mask,\n                position_bias=encoder_decoder_position_bias,\n                head_mask=head_mask,\n                past_key_value_state=cross_attn_past_key_value_state,\n                query_length=query_length,\n                use_cache=use_cache,\n                training=training,\n            )\n            hidden_states = cross_attention_outputs[0]\n            # Combine self attn and cross attn key value states\n            if present_key_value_state is not None:\n                present_key_value_state = present_key_value_state + cross_attention_outputs[1]\n\n            # Keep cross-attention outputs and relative position weights\n            attention_outputs = attention_outputs + cross_attention_outputs[2:]\n\n        # Apply Feed Forward layer\n        hidden_states = self.layer[-1](hidden_states, training=training)\n        outputs = (hidden_states,)\n\n        # Add attentions if we output them\n        outputs = outputs + (present_key_value_state,) + attention_outputs\n        return outputs  # hidden-states, present_key_value_states, (self-attention weights), (self-attention position bias), (cross-attention weights), (cross-attention position bias)\n\n\nclass _NoLayerEmbedTokens(object):\n    \"\"\"\n     this class wraps a the TFSharedEmbeddingTokens layer into a python 'no-keras-layer'\n     class to avoid problem with weight restoring. Also it makes sure that the layer is\n     called from the correct scope to avoid problem with saving/storing the correct weights\n    \"\"\"\n\n    def __init__(self, layer, abs_scope_name=None):\n        self._layer = layer\n        self._abs_scope_name = abs_scope_name\n\n    def call(self, inputs, mode=\"embedding\"):\n        if self._abs_scope_name is None:\n            return self._layer.call(inputs, mode)\n\n        # if an abs scope name is given to the embedding variable, call variable from absolute scope\n        with tf.compat.v1.variable_scope(self._abs_scope_name, auxiliary_name_scope=False) as abs_scope_name:\n            with tf.name_scope(abs_scope_name.original_name_scope):\n                return self._layer.call(inputs, mode)\n\n    def __call__(self, inputs, mode=\"embedding\"):\n        if self._abs_scope_name is None:\n            return self._layer(inputs, mode)\n\n        # if an abs scope name is given to the embedding variable, call variable from absolute scope\n        with tf.compat.v1.variable_scope(self._abs_scope_name, auxiliary_name_scope=False) as abs_scope_name:\n            with tf.name_scope(abs_scope_name.original_name_scope):\n                return self._layer(inputs, mode)\n\n\n####################################################\n# The full model without a specific pretrained or finetuning head is\n# provided as a tf.keras.layers.Layer usually called \"TFT5MainLayer\"\n####################################################\nclass TFT5MainLayer(tf.keras.layers.Layer):\n    def __init__(self, config, embed_tokens=None, **kwargs):\n        super().__init__(**kwargs)\n        self.output_attentions = config.output_attentions\n        self.output_hidden_states = config.output_hidden_states\n\n        self.embed_tokens = embed_tokens\n        self.is_decoder = config.is_decoder\n\n        self.config = config\n        self.num_hidden_layers = config.num_layers\n\n        self.block = [\n            TFT5Block(config, has_relative_attention_bias=bool(i == 0), name=\"block_._{}\".format(i),)\n            for i in range(config.num_layers)\n        ]\n        self.final_layer_norm = TFT5LayerNorm(epsilon=config.layer_norm_epsilon, name=\"final_layer_norm\")\n        self.dropout = tf.keras.layers.Dropout(config.dropout_rate)\n\n    def get_input_embeddings(self):\n        return self.embed_tokens\n\n    def get_output_embeddings(self):\n        return self.embed_tokens\n\n    def set_embed_tokens(self, embed_tokens):\n        self.embed_tokens = embed_tokens\n\n    def _resize_token_embeddings(self, new_num_tokens):\n        raise NotImplementedError  # Not implemented yet in the library fr TF 2.0 models\n\n    def _prune_heads(self, heads_to_prune):\n        raise NotImplementedError  # Not implemented yet in the library fr TF 2.0 models\n\n    def call(\n        self,\n        inputs,\n        attention_mask=None,\n        encoder_hidden_states=None,\n        encoder_attention_mask=None,\n        inputs_embeds=None,\n        head_mask=None,\n        past_key_value_states=None,\n        use_cache=False,\n        training=False,\n    ):\n\n        if inputs is not None and inputs_embeds is not None:\n            raise ValueError(\"You cannot specify both inputs and inputs_embeds at the same time\")\n        elif inputs is not None:\n            input_shape = shape_list(inputs)\n            inputs = tf.reshape(inputs, (-1, input_shape[-1]))\n        elif inputs_embeds is not None:\n            input_shape = shape_list(inputs_embeds)[:-1]\n        else:\n            raise ValueError(\"You have to specify either inputs or inputs_embeds\")\n\n        if inputs_embeds is None:\n            assert self.embed_tokens is not None, \"You have to intialize the model with valid token embeddings\"\n            inputs_embeds = self.embed_tokens(inputs)\n\n        batch_size, seq_length = input_shape\n\n        if past_key_value_states is not None:\n            assert seq_length == 1, \"Input shape is {}, but should be {} when using past_key_value_sates\".format(\n                input_shape, (batch_size, 1)\n            )\n            # required mask seq length can be calculated via length of past\n            # key value states and seq_length = 1 for the last token\n            mask_seq_length = shape_list(past_key_value_states[0][0])[2] + seq_length\n        else:\n            mask_seq_length = seq_length\n\n        if attention_mask is None:\n            attention_mask = tf.fill((batch_size, mask_seq_length), 1)\n        if self.is_decoder and encoder_attention_mask is None and encoder_hidden_states is not None:\n            encoder_seq_length = shape_list(encoder_hidden_states)[1]\n            encoder_attention_mask = tf.fill((batch_size, encoder_seq_length), 1)\n\n        # initialize past_key_value_states with `None` if past does not exist\n        if past_key_value_states is None:\n            past_key_value_states = [None] * len(self.block)\n\n        # We can provide a self-attention mask of dimensions [batch_size, from_seq_length, to_seq_length]\n        # ourselves in which case we just need to make it broadcastable to all heads.\n        attention_mask = tf.cast(attention_mask, dtype=tf.float32)\n        num_dims_attention_mask = len(shape_list(attention_mask))\n        if num_dims_attention_mask == 3:\n            extended_attention_mask = attention_mask[:, None, :, :]\n        elif num_dims_attention_mask == 2:\n            # Provided a padding mask of dimensions [batch_size, mask_seq_length]\n            # - if the model is a decoder, apply a causal mask in addition to the padding mask\n            # - if the model is an encoder, make the mask broadcastable to [batch_size, num_heads, mask_seq_length, mask_seq_length]\n            if self.is_decoder:\n                seq_ids = tf.range(mask_seq_length)\n                causal_mask = tf.less_equal(\n                    tf.tile(seq_ids[None, None, :], (batch_size, mask_seq_length, 1)), seq_ids[None, :, None],\n                )\n                causal_mask = tf.cast(causal_mask, dtype=tf.float32)\n                extended_attention_mask = causal_mask[:, None, :, :] * attention_mask[:, None, None, :]\n                if past_key_value_states[0] is not None:\n                    extended_attention_mask = extended_attention_mask[:, :, -1:, :]\n            else:\n                extended_attention_mask = attention_mask[:, None, None, :]\n\n        # Since attention_mask is 1.0 for positions we want to attend and 0.0 for\n        # masked positions, this operation will create a tensor which is 0.0 for\n        # positions we want to attend and -10000.0 for masked positions.\n        # Since we are adding it to the raw scores before the softmax, this is\n        # effectively the same as removing these entirely.\n\n        # T5 has a mask that can compare sequence ids, we can simulate this here with this transposistion\n        # Cf. https://github.com/tensorflow/mesh/blob/8d2465e9bc93129b913b5ccc6a59aa97abd96ec6/mesh_tensorflow/transformer/transformer_layers.py#L270\n        # extended_attention_mask = tf.math.equal(extended_attention_mask,\n        #                                         tf.transpose(extended_attention_mask, perm=(-1, -2)))\n\n        extended_attention_mask = (1.0 - extended_attention_mask) * -1e9\n\n        if self.is_decoder and encoder_attention_mask is not None:\n            # If a 2D ou 3D attention mask is provided for the cross-attention\n            # we need to make broadcastabe to [batch_size, num_heads, mask_seq_length, mask_seq_length]\n            # we need to make broadcastabe to [batch_size, num_heads, seq_length, seq_length]\n            encoder_attention_mask = tf.cast(encoder_attention_mask, dtype=tf.float32)\n            num_dims_encoder_attention_mask = len(shape_list(encoder_attention_mask))\n            if num_dims_encoder_attention_mask == 3:\n                encoder_extended_attention_mask = encoder_attention_mask[:, None, :, :]\n            if num_dims_encoder_attention_mask == 2:\n                encoder_extended_attention_mask = encoder_attention_mask[:, None, None, :]\n\n            # T5 has a mask that can compare sequence ids, we can simulate this here with this transposistion\n            # Cf. https://github.com/tensorflow/mesh/blob/8d2465e9bc93129b913b5ccc6a59aa97abd96ec6/mesh_tensorflow/transformer/transformer_layers.py#L270\n            # encoder_extended_attention_mask = tf.math.equal(encoder_extended_attention_mask,\n            #                                         tf.transpose(encoder_extended_attention_mask, perm=(-1, -2)))\n\n            encoder_extended_attention_mask = (1.0 - encoder_extended_attention_mask) * -1e9\n        else:\n            encoder_extended_attention_mask = None\n\n        # Prepare head mask if needed\n        # 1.0 in head_mask indicate we keep the head\n        # attention_probs has shape bsz x n_heads x N x N\n        # input head_mask has shape [num_heads] or [num_hidden_layers x num_heads]\n        # and head_mask is converted to shape [num_hidden_layers x batch x num_heads x seq_length x seq_length]\n        if head_mask is not None:\n            raise NotImplementedError\n        else:\n            head_mask = [None] * self.num_hidden_layers\n            # head_mask = tf.constant([0] * self.num_hidden_layers)\n\n        present_key_value_states = ()\n        all_hidden_states = ()\n        all_attentions = ()\n        position_bias = None\n        encoder_decoder_position_bias = None\n\n        hidden_states = self.dropout(inputs_embeds, training=training)\n\n        for i, (layer_module, past_key_value_state) in enumerate(zip(self.block, past_key_value_states)):\n            if self.output_hidden_states:\n                all_hidden_states = all_hidden_states + (hidden_states,)\n\n            layer_outputs = layer_module(\n                hidden_states,\n                attention_mask=extended_attention_mask,\n                position_bias=position_bias,\n                encoder_hidden_states=encoder_hidden_states,\n                encoder_attention_mask=encoder_extended_attention_mask,\n                encoder_decoder_position_bias=encoder_decoder_position_bias,\n                head_mask=head_mask[i],\n                past_key_value_state=past_key_value_state,\n                use_cache=use_cache,\n                training=training,\n            )\n            # layer_outputs is a tuple with:\n            # hidden-states, key-value-states, (self-attention weights), (self-attention position bias), (cross-attention weights), (cross-attention position bias)\n            hidden_states, present_key_value_state = layer_outputs[:2]\n            if i == 0:\n                # We share the position biases between the layers - the first layer store them\n                # layer_outputs = hidden-states, (self-attention weights), (self-attention position bias), (cross-attention weights), (cross-attention position bias)\n                position_bias = layer_outputs[3 if self.output_attentions else 2]\n                if self.is_decoder and encoder_hidden_states is not None:\n                    encoder_decoder_position_bias = layer_outputs[5 if self.output_attentions else 3]\n            # append next layer key value states\n            present_key_value_states = present_key_value_states + (present_key_value_state,)\n\n            if self.output_attentions:\n                all_attentions = all_attentions + (layer_outputs[2],)\n\n        hidden_states = self.final_layer_norm(hidden_states)\n        hidden_states = self.dropout(hidden_states, training=training)\n\n        # Add last layer\n        if self.output_hidden_states:\n            all_hidden_states = all_hidden_states + (hidden_states,)\n\n        outputs = (hidden_states,)\n        if use_cache is True:\n            assert self.is_decoder, \"`use_cache` can only be set to `True` if {} is used as a decoder\".format(self)\n            outputs = outputs + (present_key_value_states,)\n        if self.output_hidden_states:\n            outputs = outputs + (all_hidden_states,)\n        if self.output_attentions:\n            outputs = outputs + (all_attentions,)\n        return outputs  # last-layer hidden state, (all hidden states), (all attentions)\n\n\n####################################################\n# TFT5PreTrainedModel is a sub-class of tf.keras.Model\n# which take care of loading and saving pretrained weights\n# and various common utilities.\n# Here you just need to specify a few (self-explanatory)\n# pointers for your model.\n####################################################\nclass TFT5PreTrainedModel(TFPreTrainedModel):\n    \"\"\" An abstract class to handle weights initialization and\n        a simple interface for downloading and loading pretrained models.\n    \"\"\"\n\n    config_class = T5Config\n    base_model_prefix = \"transformer\"\n\n    @property\n    def dummy_inputs(self):\n        inputs = tf.constant(DUMMY_INPUTS)\n        input_mask = tf.constant(DUMMY_MASK)\n        dummy_inputs = {\n            \"inputs\": inputs,\n            \"decoder_input_ids\": inputs,\n            \"decoder_attention_mask\": input_mask,\n        }\n        return dummy_inputs\n\n\nT5_START_DOCSTRING = r\"\"\"    The T5 model was proposed in\n    `Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer`_\n    by Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, Peter J. Liu.\n    It's an encoder decoder transformer pre-trained in a text-to-text denoising generative setting.\n\n    This model is a tf.keras.Model `tf.keras.Model`_ sub-class. Use it as a regular TF 2.0 Keras Model and\n    refer to the TF 2.0 documentation for all matter related to general usage and behavior.\n\n    .. _`Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer`:\n        https://arxiv.org/abs/1910.10683\n\n    .. _`tf.keras.Model`:\n        https://www.tensorflow.org/versions/r2.0/api_docs/python/tf/keras/Model\n\n    Note on the model inputs:\n        TF 2.0 models accepts two formats as inputs:\n\n            - having all inputs as keyword arguments (like PyTorch models), or\n            - having all inputs as a list, tuple or dict in the first positional arguments.\n\n        This second option is usefull when using `tf.keras.Model.fit()` method which currently requires having all the tensors in the first argument of the model call function: `model(inputs)`.\n\n        If you choose this second option, there are three possibilities you can use to gather all the input Tensors in the first positional argument :\n\n        - a single Tensor with inputs only and nothing else: `model(inputs_ids)\n        - a list of varying length with one or several input Tensors IN THE ORDER given in the docstring:\n            `model([inputs, attention_mask])` or `model([inputs, attention_mask, token_type_ids])`\n        - a dictionary with one or several input Tensors associaed to the input names given in the docstring:\n            `model({'inputs': inputs, 'token_type_ids': token_type_ids})`\n\n    Parameters:\n        config (:class:`~transformers1.T5Config`): Model configuration class with all the parameters of the model.\n            Initializing with a config file does not load the weights associated with the model, only the configuration.\n            Check out the :meth:`~transformers1.PreTrainedModel.from_pretrained` method to load the model weights.\n\"\"\"\n\nT5_INPUTS_DOCSTRING = r\"\"\"\n    Args:\n        inputs are usually used as a `dict` (see T5 description above for more information) containing all the following.\n\n        inputs (:obj:`tf.Tensor` of shape :obj:`(batch_size, sequence_length)`):\n            Indices of input sequence tokens in the vocabulary.\n            T5 is a model with relative position embeddings so you should be able to pad the inputs on\n            the right or the left.\n            Indices can be obtained using :class:`transformers1.T5Tokenizer`.\n            To know more on how to prepare :obj:`inputs` for pre-training take a look at\n            `T5 Training <./t5.html#training>`_ .\n            See :func:`transformers1.PreTrainedTokenizer.encode` and\n            :func:`transformers1.PreTrainedTokenizer.convert_tokens_to_ids` for details.\n        decoder_input_ids (:obj:`tf.Tensor` of shape :obj:`(batch_size, target_sequence_length)`, `optional`, defaults to :obj:`None`):\n            Provide for sequence to sequence training. T5 uses the pad_token_id as the starting token for decoder_input_ids generation.\n            If `decoder_past_key_value_states` is used, optionally only the last `decoder_input_ids` have to be input (see `decoder_past_key_value_states`).\n        attention_mask (:obj:`tf.Tensor` of shape :obj:`(batch_size, sequence_length)`, `optional`, defaults to :obj:`None`):\n            Mask to avoid performing attention on padding token indices.\n            Mask values selected in ``[0, 1]``:\n            ``1`` for tokens that are NOT MASKED, ``0`` for MASKED tokens.\n        encoder_outputs (:obj:`tuple(tuple(tf.FloatTensor)`, `optional`, defaults to :obj:`None`):\n            Tuple consists of (`last_hidden_state`, `optional`: `hidden_states`, `optional`: `attentions`)\n            `last_hidden_state` of shape :obj:`(batch_size, sequence_length, hidden_size)`, `optional`, defaults to :obj:`None`) is a sequence of hidden-states at the output of the last layer of the encoder.\n            Used in the cross-attention of the decoder.\n        decoder_attention_mask (:obj:`tf.Tensor` of shape :obj:`(batch_size, tgt_seq_len)`, `optional`, defaults to :obj:`None`):\n            Default behavior: generate a tensor that ignores pad tokens in decoder_input_ids. Causal mask will also be used by default.\n        decoder_past_key_value_states (:obj:`tuple(tuple(tf.Tensor))` of length :obj:`config.n_layers` with each tuple having 4 tensors of shape :obj:`(batch_size, num_heads, sequence_length - 1, embed_size_per_head)`):\n            Contains pre-computed key and value hidden-states of the attention blocks.\n            Can be used to speed up decoding.\n            If `decoder_past_key_value_states` are used, the user can optionally input only the last `decoder_input_ids`\n            (those that don't have their past key value states given to this model) of shape :obj:`(batch_size, 1)`\n        use_cache (:obj:`bool`, `optional`, defaults to :obj:`True`):\n            If `use_cache` is True, `decoder_past_key_value_states` are returned and can be used to speed up decoding (see `decoder_past_key_value_states`).\n        inputs_embeds (:obj:`tf.Tensor` of shape :obj:`(batch_size, sequence_length, hidden_size)`, `optional`, defaults to :obj:`None`):\n            Optionally, instead of passing :obj:`inputs` you can choose to directly pass an embedded representation.\n            This is useful if you want more control over how to convert `inputs` indices into associated vectors\n            than the model's internal embedding lookup matrix.\n        decoder_inputs_embeds (:obj:`tf.Tensor` of shape :obj:`(batch_size, target_sequence_length, hidden_size)`, `optional`, defaults to :obj:`None`):\n            Optionally, instead of passing :obj:`decoder_input_ids` you can choose to directly pass an embedded representation.\n            This is useful if you want more control over how to convert `decoder_input_ids` indices into associated vectors\n            than the model's internal embedding lookup matrix.\n            To know more on how to prepare :obj:`decoder_input_ids` for pre-training take a look at\n            `T5 Training <./t5.html#training>`_ .\n        head_mask: (:obj:`tf.Tensor` of shape :obj:`(num_heads,)` or :obj:`(num_layers, num_heads)`, `optional`, defaults to :obj:`None`):\n            Mask to nullify selected heads of the self-attention modules.\n            Mask values selected in ``[0, 1]``:\n            ``1`` indicates the head is **not masked**, ``0`` indicates the head is **masked**.\n\"\"\"\n\n\n@add_start_docstrings(\n    \"The bare T5 Model transformer outputting raw hidden-states\" \"without any specific head on top.\",\n    T5_START_DOCSTRING,\n)\nclass TFT5Model(TFT5PreTrainedModel):\n    def __init__(self, config, *inputs, **kwargs):\n        super().__init__(config, *inputs, **kwargs)\n        self.shared = TFSharedEmbeddings(config.vocab_size, config.d_model, name=\"shared\")\n\n        # retrieve correct absolute scope for embed token wrapper\n        with tf.compat.v1.variable_scope(\"shared\") as shared_abs_scope_name:\n            pass\n\n        embed_tokens = _NoLayerEmbedTokens(self.shared, abs_scope_name=shared_abs_scope_name)\n\n        encoder_config = copy.deepcopy(config)\n        self.encoder = TFT5MainLayer(encoder_config, embed_tokens, name=\"encoder\")\n\n        decoder_config = copy.deepcopy(config)\n        decoder_config.is_decoder = True\n        self.decoder = TFT5MainLayer(decoder_config, embed_tokens, name=\"decoder\")\n\n    def get_input_embeddings(self):\n        return self.shared\n\n    def get_output_embeddings(self):\n        return self.shared\n\n    def get_encoder(self):\n        return self.encoder\n\n    def get_decoder(self):\n        return self.decoder\n\n    @add_start_docstrings_to_callable(T5_INPUTS_DOCSTRING)\n    def call(self, inputs, **kwargs):\n        r\"\"\"\n    Return:\n        :obj:`tuple(tf.Tensor)` comprising various elements depending on the configuration (:class:`~transformers1.T5Config`) and inputs.\n        last_hidden_state (:obj:`tf.Tensor` of shape :obj:`(batch_size, sequence_length, hidden_size)`):\n            Sequence of hidden-states at the output of the last layer of the model.\n            If `decoder_past_key_value_states` is used only the last hidden-state of the sequences of shape :obj:`(batch_size, 1, hidden_size)` is output.\n        decoder_past_key_value_states (:obj:`tuple(tuple(tf.Tensor))` of length :obj:`config.n_layers` with each tuple having 4 tensors of shape :obj:`(batch_size, num_heads, sequence_length, embed_size_per_head)`, `optional`, returned when ``use_cache=True``):\n            Contains pre-computed key and value hidden-states of the attention blocks.\n            Can be used to speed up sequential decoding (see `decoder_past_key_value_states` input).\n            Note that when using `decoder_past_key_value_states`, the model only outputs the last `hidden-state` of the sequence of shape :obj:`(batch_size, 1, config.vocab_size)`.\n        hidden_states (:obj:`tuple(tf.Tensor)`, `optional`, returned when ``config.output_hidden_states=True``):\n            Tuple of :obj:`tf.Tensor` (one for the output of the embeddings + one for the output of each layer)\n            of shape :obj:`(batch_size, sequence_length, hidden_size)`.\n\n            Hidden-states of the model at the output of each layer plus the initial embedding outputs.\n        attentions (:obj:`tuple(tf.Tensor)`, `optional`, returned when ``config.output_attentions=True``):\n            Tuple of :obj:`tf.Tensor` (one for each layer) of shape\n                :obj:`(batch_size, num_heads, sequence_length, sequence_length)`.\n\n            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention\n            heads.\n\n    Examples::\n\n        from transformers1 import T5Tokenizer, TFT5Model\n\n        tokenizer = T5Tokenizer.from_pretrained('t5-small')\n        model = TFT5Model.from_pretrained('t5-small')\n        inputs = tokenizer.encode(\"Hello, my dog is cute\", return_tensors=\"tf\")  # Batch size 1\n        outputs = model(inputs, decoder_input_ids=inputs)\n        last_hidden_states = outputs[0]  # The last hidden-state is the first element of the output tuple\n\n        \"\"\"\n\n        if isinstance(inputs, dict):\n            kwargs.update(inputs)\n        else:\n            kwargs[\"inputs\"] = inputs\n\n        # retrieve arguments\n        inputs = kwargs.get(\"inputs\", None)\n        inputs_embeds = kwargs.get(\"inputs_embeds\", None)\n        attention_mask = kwargs.get(\"attention_mask\", None)\n        encoder_outputs = kwargs.get(\"encoder_outputs\", None)\n        decoder_input_ids = kwargs.get(\"decoder_input_ids\", None)\n        decoder_attention_mask = kwargs.get(\"decoder_attention_mask\", None)\n        decoder_inputs_embeds = kwargs.get(\"decoder_inputs_embeds\", None)\n        decoder_past_key_value_states = kwargs.get(\"decoder_past_key_value_states\", None)\n        use_cache = kwargs.get(\"use_cache\", True)\n        head_mask = kwargs.get(\"head_mask\", None)\n\n        # Encode if needed (training, first prediction pass)\n        if encoder_outputs is None:\n            encoder_outputs = self.encoder(\n                inputs, attention_mask=attention_mask, inputs_embeds=inputs_embeds, head_mask=head_mask,\n            )\n\n        hidden_states = encoder_outputs[0]\n\n        # If decoding with past key value states, only the last tokens\n        # should be given as an input\n        if decoder_past_key_value_states is not None:\n            if decoder_input_ids is not None:\n                decoder_input_ids = decoder_input_ids[:, -1:]\n            if decoder_inputs_embeds is not None:\n                decoder_inputs_embeds = decoder_inputs_embeds[:, -1:]\n\n        # Decode\n        decoder_outputs = self.decoder(\n            decoder_input_ids,\n            attention_mask=decoder_attention_mask,\n            inputs_embeds=decoder_inputs_embeds,\n            past_key_value_states=decoder_past_key_value_states,\n            encoder_hidden_states=hidden_states,\n            encoder_attention_mask=attention_mask,\n            head_mask=head_mask,\n            use_cache=use_cache,\n        )\n\n        if use_cache is True:\n            past = ((encoder_outputs, decoder_outputs[1]),)\n            decoder_outputs = decoder_outputs[:1] + past + decoder_outputs[2:]\n\n        return decoder_outputs + encoder_outputs\n\n\n@add_start_docstrings(\"\"\"T5 Model with a `language modeling` head on top. \"\"\", T5_START_DOCSTRING)\nclass TFT5ForConditionalGeneration(TFT5PreTrainedModel):\n    def __init__(self, config, *inputs, **kwargs):\n        super().__init__(config, *inputs, **kwargs)\n        self.model_dim = config.d_model\n\n        self.shared = TFSharedEmbeddings(config.vocab_size, config.d_model, name=\"shared\")\n\n        # retrieve correct absolute scope for embed token wrapper\n        with tf.compat.v1.variable_scope(\"shared\") as shared_abs_scope_name:\n            pass\n\n        embed_tokens = _NoLayerEmbedTokens(self.shared, abs_scope_name=shared_abs_scope_name)\n\n        encoder_config = copy.deepcopy(config)\n        self.encoder = TFT5MainLayer(encoder_config, embed_tokens, name=\"encoder\")\n\n        decoder_config = copy.deepcopy(config)\n        decoder_config.is_decoder = True\n        self.decoder = TFT5MainLayer(decoder_config, embed_tokens, name=\"decoder\")\n\n    def get_input_embeddings(self):\n        return self.shared\n\n    def get_output_embeddings(self):\n        return self.shared\n\n    def get_encoder(self):\n        return self.encoder\n\n    def get_decoder(self):\n        return self.decoder\n\n    @add_start_docstrings_to_callable(T5_INPUTS_DOCSTRING)\n    def call(self, inputs, **kwargs):\n        r\"\"\"\n    Return:\n        :obj:`tuple(tf.Tensor)` comprising various elements depending on the configuration (:class:`~transformers1.T5Config`) and inputs.\n        loss (:obj:`tf.Tensor` of shape :obj:`(1,)`, `optional`, returned when :obj:`lm_label` is provided):\n            Classification loss (cross entropy).\n        prediction_scores (:obj:`tf.Tensor` of shape :obj:`(batch_size, sequence_length, config.vocab_size)`)\n            Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax).\n        decoder_past_key_value_states (:obj:`tuple(tuple(tf.Tensor))` of length :obj:`config.n_layers` with each tuple having 4 tensors of shape :obj:`(batch_size, num_heads, sequence_length, embed_size_per_head)`, `optional`, returned when ``use_cache=True``):\n            Contains pre-computed key and value hidden-states of the attention blocks.\n            Can be used to speed up sequential decoding (see `decoder_past_key_value_states` input).\n            Note that when using `decoder_past_key_value_states`, the model only outputs the last `prediction_score` of the sequence of shape :obj:`(batch_size, 1, config.vocab_size)`.\n        hidden_states (:obj:`tuple(tf.Tensor)`, `optional`, returned when ``config.output_hidden_states=True``):\n            Tuple of :obj:`tf.Tensor` (one for the output of the embeddings + one for the output of each layer)\n            of shape :obj:`(batch_size, sequence_length, hidden_size)`.\n\n            Hidden-states of the model at the output of each layer plus the initial embedding outputs.\n        attentions (:obj:`tuple(tf.Tensor)`, `optional`, returned when ``config.output_attentions=True``):\n            Tuple of :obj:`tf.Tensor` (one for each layer) of shape\n            :obj:`(batch_size, num_heads, sequence_length, sequence_length)`.\n\n            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention.\n\n    Examples::\n\n        from transformers1 import T5Tokenizer, TFT5ForConditionalGeneration\n\n        tokenizer = T5Tokenizer.from_pretrained('t5-small')\n        model = TFT5ForConditionalGeneration.from_pretrained('t5-small')\n        inputs = tokenizer.encode(\"Hello, my dog is cute\", return_tensors=\"tf\")  # Batch size 1\n        outputs = model(inputs, decoder_input_ids=inputs)\n        prediction_scores = outputs[0]\n\n        tokenizer = T5Tokenizer.from_pretrained('t5-small')\n        model = TFT5ForConditionalGeneration.from_pretrained('t5-small')\n        inputs = tokenizer.encode(\"summarize: Hello, my dog is cute\", return_tensors=\"tf\")  # Batch size 1\n        model.generate(inputs)\n\n        \"\"\"\n\n        if isinstance(inputs, dict):\n            kwargs.update(inputs)\n        else:\n            kwargs[\"inputs\"] = inputs\n\n        # retrieve arguments\n        inputs = kwargs.get(\"inputs\", None)\n        decoder_input_ids = kwargs.get(\"decoder_input_ids\", None)\n        attention_mask = kwargs.get(\"attention_mask\", None)\n        encoder_outputs = kwargs.get(\"encoder_outputs\", None)\n        decoder_attention_mask = kwargs.get(\"decoder_attention_mask\", None)\n        decoder_past_key_value_states = kwargs.get(\"decoder_past_key_value_states\", None)\n        use_cache = kwargs.get(\"use_cache\", True)\n        inputs_embeds = kwargs.get(\"inputs_embeds\", None)\n        decoder_inputs_embeds = kwargs.get(\"decoder_inputs_embeds\", None)\n        head_mask = kwargs.get(\"head_mask\", None)\n\n        # Encode if needed (training, first prediction pass)\n        if encoder_outputs is None:\n            # Convert encoder inputs in embeddings if needed\n            encoder_outputs = self.encoder(\n                inputs, attention_mask=attention_mask, inputs_embeds=inputs_embeds, head_mask=head_mask,\n            )\n\n        hidden_states = encoder_outputs[0]\n\n        # If decoding with past key value states, only the last tokens\n        # should be given as an input\n        if decoder_past_key_value_states is not None:\n            if decoder_input_ids is not None:\n                decoder_input_ids = decoder_input_ids[:, -1:]\n            if decoder_inputs_embeds is not None:\n                decoder_inputs_embeds = decoder_inputs_embeds[:, -1:]\n\n        # Decode\n        decoder_outputs = self.decoder(\n            decoder_input_ids,\n            attention_mask=decoder_attention_mask,\n            inputs_embeds=decoder_inputs_embeds,\n            past_key_value_states=decoder_past_key_value_states,\n            encoder_hidden_states=hidden_states,\n            encoder_attention_mask=attention_mask,\n            head_mask=head_mask,\n            use_cache=use_cache,\n        )\n\n        # insert decoder past at right place\n        # to speed up decoding\n        if use_cache is True:\n            past = ((encoder_outputs, decoder_outputs[1]),)\n            decoder_outputs = decoder_outputs[:1] + past + decoder_outputs[2:]\n\n        sequence_output = decoder_outputs[0] * (self.model_dim ** -0.5)\n        embed_tokens = self.get_output_embeddings()\n        lm_logits = embed_tokens(sequence_output, mode=\"linear\")\n        decoder_outputs = (lm_logits,) + decoder_outputs[1:]\n\n        return decoder_outputs + encoder_outputs\n\n    def prepare_inputs_for_generation(self, inputs, past, attention_mask, use_cache, **kwargs):\n        assert past is not None, \"past has to be defined for encoder_outputs\"\n\n        # first step\n        if len(past) < 2:\n            encoder_outputs, decoder_past_key_value_states = past, None\n        else:\n            encoder_outputs, decoder_past_key_value_states = past[0], past[1]\n\n        return {\n            \"inputs\": None,  # inputs don't have to be defined, but still need to be passed to make Keras.layer.__call__ happy\n            \"decoder_input_ids\": inputs,  # inputs are the decoder_input_ids\n            \"decoder_past_key_value_states\": decoder_past_key_value_states,\n            \"encoder_outputs\": encoder_outputs,\n            \"attention_mask\": attention_mask,\n            \"use_cache\": use_cache,\n        }\n\n    def _reorder_cache(self, past, beam_idx):\n        # if decoder past is not included in output\n        # speedy decoding is disabled and no need to reorder\n\n        if len(past) < 2:\n            logger.warning(\"You might want to consider setting `use_cache=True` to speed up decoding\")\n            return past\n\n        decoder_past = past[1]\n        past = (past[0],)\n        reordered_decoder_past = ()\n\n        for layer_past_states in decoder_past:\n            # get the correct batch idx from layer past batch dim\n            # batch dim of `past` is at 2nd position\n            reordered_layer_past_states = ()\n            for layer_past_state in layer_past_states:\n                # need to set correct `past` for each of the four key / value states\n                reordered_layer_past_states = reordered_layer_past_states + (tf.gather(layer_past_state, beam_idx),)\n\n            assert shape_list(reordered_layer_past_states[0]) == shape_list(layer_past_states[0])\n            assert len(reordered_layer_past_states) == len(layer_past_states)\n\n            reordered_decoder_past = reordered_decoder_past + (reordered_layer_past_states,)\n        return past + (reordered_decoder_past,)\n"
  },
  {
    "path": "code/nezha-base-count5/pretrain/transformers1/modeling_tf_transfo_xl.py",
    "content": "# coding=utf-8\n# Copyright 2018 Google AI, Google Brain and Carnegie Mellon University Authors and the HuggingFace Inc. team.\n# Copyright (c) 2018, NVIDIA CORPORATION.  All rights reserved.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n\"\"\" TF 2.0 Transformer XL model.\n\"\"\"\n\n\nimport logging\n\nimport tensorflow as tf\n\nfrom .configuration_transfo_xl import TransfoXLConfig\nfrom .file_utils import add_start_docstrings, add_start_docstrings_to_callable\nfrom .modeling_tf_transfo_xl_utilities import TFAdaptiveSoftmaxMask\nfrom .modeling_tf_utils import TFPreTrainedModel, get_initializer, keras_serializable, shape_list\nfrom .tokenization_utils import BatchEncoding\n\n\nlogger = logging.getLogger(__name__)\n\nTF_TRANSFO_XL_PRETRAINED_MODEL_ARCHIVE_LIST = [\n    \"transfo-xl-wt103\",\n    # See all Transformer XL models at https://huggingface.co/models?filter=transfo-xl\n]\n\n\nclass TFPositionalEmbedding(tf.keras.layers.Layer):\n    def __init__(self, demb, **kwargs):\n        super().__init__(**kwargs)\n\n        self.inv_freq = 1 / (10000 ** (tf.range(0, demb, 2.0) / demb))\n\n    def call(self, pos_seq, bsz=None):\n        sinusoid_inp = tf.einsum(\"i,j->ij\", pos_seq, self.inv_freq)\n        pos_emb = tf.concat([tf.sin(sinusoid_inp), tf.cos(sinusoid_inp)], -1)\n\n        if bsz is not None:\n            return tf.tile(pos_emb[:, None, :], [1, bsz, 1])\n        else:\n            return pos_emb[:, None, :]\n\n\nclass TFPositionwiseFF(tf.keras.layers.Layer):\n    def __init__(self, d_model, d_inner, dropout, pre_lnorm=False, layer_norm_epsilon=1e-5, init_std=0.02, **kwargs):\n        super().__init__(**kwargs)\n\n        self.d_model = d_model\n        self.d_inner = d_inner\n        self.dropout = dropout\n\n        self.layer_1 = tf.keras.layers.Dense(\n            d_inner, kernel_initializer=get_initializer(init_std), activation=tf.nn.relu, name=\"CoreNet_._0\"\n        )\n        self.drop_1 = tf.keras.layers.Dropout(dropout)\n        self.layer_2 = tf.keras.layers.Dense(d_model, kernel_initializer=get_initializer(init_std), name=\"CoreNet_._3\")\n        self.drop_2 = tf.keras.layers.Dropout(dropout)\n\n        self.layer_norm = tf.keras.layers.LayerNormalization(epsilon=layer_norm_epsilon, name=\"layer_norm\")\n\n        self.pre_lnorm = pre_lnorm\n\n    def call(self, inp, training=False):\n        if self.pre_lnorm:\n            # layer normalization + positionwise feed-forward\n            core_out = self.layer_norm(inp)\n            core_out = self.layer_1(core_out)\n            core_out = self.drop_1(core_out, training=training)\n            core_out = self.layer_2(core_out)\n            core_out = self.drop_2(core_out, training=training)\n\n            # residual connection\n            output = core_out + inp\n        else:\n            # positionwise feed-forward\n            core_out = self.layer_1(inp)\n            core_out = self.drop_1(core_out, training=training)\n            core_out = self.layer_2(core_out)\n            core_out = self.drop_2(core_out, training=training)\n\n            # residual connection + layer normalization\n            output = self.layer_norm(inp + core_out)\n\n        return output\n\n\nclass TFRelPartialLearnableMultiHeadAttn(tf.keras.layers.Layer):\n    def __init__(\n        self,\n        n_head,\n        d_model,\n        d_head,\n        dropout,\n        dropatt=0,\n        tgt_len=None,\n        ext_len=None,\n        mem_len=None,\n        pre_lnorm=False,\n        r_r_bias=None,\n        r_w_bias=None,\n        output_attentions=False,\n        layer_norm_epsilon=1e-5,\n        init_std=0.02,\n        **kwargs\n    ):\n        super().__init__(**kwargs)\n\n        self.output_attentions = output_attentions\n        self.n_head = n_head\n        self.d_model = d_model\n        self.d_head = d_head\n        self.dropout = dropout\n\n        self.qkv_net = tf.keras.layers.Dense(\n            3 * n_head * d_head, kernel_initializer=get_initializer(init_std), use_bias=False, name=\"qkv_net\"\n        )\n\n        self.drop = tf.keras.layers.Dropout(dropout)\n        self.dropatt = tf.keras.layers.Dropout(dropatt)\n        self.o_net = tf.keras.layers.Dense(\n            d_model, kernel_initializer=get_initializer(init_std), use_bias=False, name=\"o_net\"\n        )\n\n        self.layer_norm = tf.keras.layers.LayerNormalization(epsilon=layer_norm_epsilon, name=\"layer_norm\")\n\n        self.scale = 1 / (d_head ** 0.5)\n\n        self.pre_lnorm = pre_lnorm\n\n        if r_r_bias is not None and r_w_bias is not None:  # Biases are shared\n            self.r_r_bias = r_r_bias\n            self.r_w_bias = r_w_bias\n        else:\n            self.r_r_bias = None\n            self.r_w_bias = None\n\n        self.r_net = tf.keras.layers.Dense(\n            self.n_head * self.d_head, kernel_initializer=get_initializer(init_std), use_bias=False, name=\"r_net\"\n        )\n\n    def build(self, input_shape):\n        if self.r_r_bias is None or self.r_w_bias is None:  # Biases are not shared\n            self.r_r_bias = self.add_weight(\n                shape=(self.n_head, self.d_head), initializer=\"zeros\", trainable=True, name=\"r_r_bias\"\n            )\n            self.r_w_bias = self.add_weight(\n                shape=(self.n_head, self.d_head), initializer=\"zeros\", trainable=True, name=\"r_w_bias\"\n            )\n        super().build(input_shape)\n\n    def _rel_shift(self, x):\n        x_size = shape_list(x)\n\n        x = tf.pad(x, [[0, 0], [1, 0], [0, 0], [0, 0]])\n        x = tf.reshape(x, [x_size[1] + 1, x_size[0], x_size[2], x_size[3]])\n        x = tf.slice(x, [1, 0, 0, 0], [-1, -1, -1, -1])\n        x = tf.reshape(x, x_size)\n\n        return x\n\n    def call(self, inputs, training=False):\n        w, r, attn_mask, mems, head_mask = inputs\n        qlen, rlen, bsz = shape_list(w)[0], shape_list(r)[0], shape_list(w)[1]\n\n        if mems is not None:\n            cat = tf.concat([mems, w], 0)\n            if self.pre_lnorm:\n                w_heads = self.qkv_net(self.layer_norm(cat))\n            else:\n                w_heads = self.qkv_net(cat)\n            r_head_k = self.r_net(r)\n\n            w_head_q, w_head_k, w_head_v = tf.split(w_heads, 3, axis=-1)\n            w_head_q = w_head_q[-qlen:]\n        else:\n            if self.pre_lnorm:\n                w_heads = self.qkv_net(self.layer_norm(w))\n            else:\n                w_heads = self.qkv_net(w)\n            r_head_k = self.r_net(r)\n\n            w_head_q, w_head_k, w_head_v = tf.split(w_heads, 3, axis=-1)\n\n        klen = shape_list(w_head_k)[0]\n\n        w_head_q = tf.reshape(w_head_q, (qlen, bsz, self.n_head, self.d_head))  # qlen x bsz x n_head x d_head\n        w_head_k = tf.reshape(w_head_k, (klen, bsz, self.n_head, self.d_head))  # qlen x bsz x n_head x d_head\n        w_head_v = tf.reshape(w_head_v, (klen, bsz, self.n_head, self.d_head))  # qlen x bsz x n_head x d_head\n\n        r_head_k = tf.reshape(r_head_k, (rlen, self.n_head, self.d_head))  # qlen x n_head x d_head\n\n        # compute attention score\n        rw_head_q = w_head_q + self.r_w_bias  # qlen x bsz x n_head x d_head\n        AC = tf.einsum(\"ibnd,jbnd->ijbn\", rw_head_q, w_head_k)  # qlen x klen x bsz x n_head\n\n        rr_head_q = w_head_q + self.r_r_bias\n        BD = tf.einsum(\"ibnd,jnd->ijbn\", rr_head_q, r_head_k)  # qlen x klen x bsz x n_head\n        BD = self._rel_shift(BD)\n\n        # [qlen x klen x bsz x n_head]\n        attn_score = AC + BD\n        attn_score = attn_score * self.scale\n\n        # compute attention probability\n        if attn_mask is not None:\n            attn_mask_t = attn_mask[:, :, None, None]\n            attn_score = attn_score * (1 - attn_mask_t) - 1e30 * attn_mask_t\n\n        # [qlen x klen x bsz x n_head]\n        attn_prob = tf.nn.softmax(attn_score, axis=1)\n        attn_prob = self.dropatt(attn_prob, training=training)\n\n        # Mask heads if we want to\n        if head_mask is not None:\n            attn_prob = attn_prob * head_mask\n\n        # compute attention vector\n        attn_vec = tf.einsum(\"ijbn,jbnd->ibnd\", attn_prob, w_head_v)\n\n        # [qlen x bsz x n_head x d_head]\n        attn_vec_sizes = shape_list(attn_vec)\n        attn_vec = tf.reshape(attn_vec, (attn_vec_sizes[0], attn_vec_sizes[1], self.n_head * self.d_head))\n\n        # linear projection\n        attn_out = self.o_net(attn_vec)\n        attn_out = self.drop(attn_out, training=training)\n\n        if self.pre_lnorm:\n            # residual connection\n            outputs = [w + attn_out]\n        else:\n            # residual connection + layer normalization\n            outputs = [self.layer_norm(w + attn_out)]\n\n        if self.output_attentions:\n            outputs.append(attn_prob)\n\n        return outputs\n\n\nclass TFRelPartialLearnableDecoderLayer(tf.keras.layers.Layer):\n    def __init__(\n        self,\n        n_head,\n        d_model,\n        d_head,\n        d_inner,\n        dropout,\n        tgt_len=None,\n        ext_len=None,\n        mem_len=None,\n        dropatt=0.0,\n        pre_lnorm=False,\n        r_w_bias=None,\n        r_r_bias=None,\n        output_attentions=False,\n        layer_norm_epsilon=1e-5,\n        init_std=0.02,\n        **kwargs\n    ):\n        super().__init__(**kwargs)\n\n        self.dec_attn = TFRelPartialLearnableMultiHeadAttn(\n            n_head,\n            d_model,\n            d_head,\n            dropout,\n            tgt_len=tgt_len,\n            ext_len=ext_len,\n            mem_len=mem_len,\n            dropatt=dropatt,\n            pre_lnorm=pre_lnorm,\n            r_w_bias=r_w_bias,\n            r_r_bias=r_r_bias,\n            init_std=init_std,\n            output_attentions=output_attentions,\n            layer_norm_epsilon=layer_norm_epsilon,\n            name=\"dec_attn\",\n        )\n        self.pos_ff = TFPositionwiseFF(\n            d_model,\n            d_inner,\n            dropout,\n            pre_lnorm=pre_lnorm,\n            init_std=init_std,\n            layer_norm_epsilon=layer_norm_epsilon,\n            name=\"pos_ff\",\n        )\n\n    def call(self, inputs, training=False):\n        dec_inp, r, dec_attn_mask, mems, head_mask = inputs\n        attn_outputs = self.dec_attn([dec_inp, r, dec_attn_mask, mems, head_mask], training=training)\n        ff_output = self.pos_ff(attn_outputs[0], training=training)\n\n        outputs = [ff_output] + attn_outputs[1:]\n\n        return outputs\n\n\nclass TFAdaptiveEmbedding(tf.keras.layers.Layer):\n    def __init__(self, n_token, d_embed, d_proj, cutoffs, div_val=1, init_std=0.02, sample_softmax=False, **kwargs):\n        super().__init__(**kwargs)\n\n        self.n_token = n_token\n        self.d_embed = d_embed\n        self.init_std = init_std\n\n        self.cutoffs = cutoffs + [n_token]\n        self.div_val = div_val\n        self.d_proj = d_proj\n\n        self.emb_scale = d_proj ** 0.5\n\n        self.cutoff_ends = [0] + self.cutoffs\n\n        self.emb_layers = []\n        self.emb_projs = []\n        if div_val == 1:\n            raise NotImplementedError  # Removed these to avoid maintaining dead code - They are not used in our pretrained checkpoint\n        else:\n            for i in range(len(self.cutoffs)):\n                l_idx, r_idx = self.cutoff_ends[i], self.cutoff_ends[i + 1]\n                d_emb_i = d_embed // (div_val ** i)\n                self.emb_layers.append(\n                    tf.keras.layers.Embedding(\n                        r_idx - l_idx,\n                        d_emb_i,\n                        embeddings_initializer=get_initializer(init_std),\n                        name=\"emb_layers_._{}\".format(i),\n                    )\n                )\n\n    def build(self, input_shape):\n        for i in range(len(self.cutoffs)):\n            d_emb_i = self.d_embed // (self.div_val ** i)\n            self.emb_projs.append(\n                self.add_weight(\n                    shape=(d_emb_i, self.d_proj),\n                    initializer=get_initializer(self.init_std),\n                    trainable=True,\n                    name=\"emb_projs_._{}\".format(i),\n                )\n            )\n        super().build(input_shape)\n\n    def call(self, inp):\n        if self.div_val == 1:\n            raise NotImplementedError  # Removed these to avoid maintaining dead code - They are not used in our pretrained checkpoint\n        else:\n            inp_flat = tf.reshape(inp, (-1,))\n            emb_flat = tf.zeros([shape_list(inp_flat)[0], self.d_proj])\n            for i in range(len(self.cutoffs)):\n                l_idx, r_idx = self.cutoff_ends[i], self.cutoff_ends[i + 1]\n\n                mask_i = (inp_flat >= l_idx) & (inp_flat < r_idx)\n\n                inp_i = tf.boolean_mask(inp_flat, mask_i) - l_idx\n                emb_i = self.emb_layers[i](inp_i)\n                emb_i = tf.einsum(\"id,de->ie\", emb_i, self.emb_projs[i])\n\n                mask_idx = tf.cast(tf.where(mask_i), dtype=tf.int64)\n                emb_flat += tf.scatter_nd(mask_idx, emb_i, tf.cast(shape_list(emb_flat), dtype=tf.int64))\n\n            embed_shape = shape_list(inp) + [self.d_proj]\n            embed = tf.reshape(emb_flat, embed_shape)\n\n        embed *= self.emb_scale\n\n        return embed\n\n\n@keras_serializable\nclass TFTransfoXLMainLayer(tf.keras.layers.Layer):\n    config_class = TransfoXLConfig\n\n    def __init__(self, config, **kwargs):\n        super().__init__(**kwargs)\n        self.output_attentions = config.output_attentions\n        self.output_hidden_states = config.output_hidden_states\n\n        self.n_token = config.vocab_size\n\n        self.d_embed = config.d_embed\n        self.d_model = config.d_model\n        self.n_head = config.n_head\n        self.d_head = config.d_head\n        self.untie_r = config.untie_r\n\n        self.word_emb = TFAdaptiveEmbedding(\n            config.vocab_size,\n            config.d_embed,\n            config.d_model,\n            config.cutoffs,\n            div_val=config.div_val,\n            init_std=config.init_std,\n            name=\"word_emb\",\n        )\n\n        self.drop = tf.keras.layers.Dropout(config.dropout)\n\n        self.n_layer = config.n_layer\n\n        self.tgt_len = config.tgt_len\n        self.mem_len = config.mem_len\n        self.ext_len = config.ext_len\n        self.max_klen = config.tgt_len + config.ext_len + config.mem_len\n\n        self.attn_type = config.attn_type\n\n        self.layers = []\n        if config.attn_type == 0:  # the default attention\n            for i in range(config.n_layer):\n                self.layers.append(\n                    TFRelPartialLearnableDecoderLayer(\n                        config.n_head,\n                        config.d_model,\n                        config.d_head,\n                        config.d_inner,\n                        config.dropout,\n                        tgt_len=config.tgt_len,\n                        ext_len=config.ext_len,\n                        mem_len=config.mem_len,\n                        dropatt=config.dropatt,\n                        pre_lnorm=config.pre_lnorm,\n                        r_w_bias=None if self.untie_r else self.r_w_bias,\n                        r_r_bias=None if self.untie_r else self.r_r_bias,\n                        output_attentions=self.output_attentions,\n                        layer_norm_epsilon=config.layer_norm_epsilon,\n                        init_std=config.init_std,\n                        name=\"layers_._{}\".format(i),\n                    )\n                )\n        else:  # learnable embeddings and absolute embeddings\n            raise NotImplementedError  # Removed these to avoid maintaining dead code - They are not used in our pretrained checkpoint\n\n        self.same_length = config.same_length\n        self.clamp_len = config.clamp_len\n\n        if self.attn_type == 0:  # default attention\n            self.pos_emb = TFPositionalEmbedding(self.d_model, name=\"pos_emb\")\n        else:  # learnable embeddings and absolute embeddings\n            raise NotImplementedError  # Removed these to avoid maintaining dead code - They are not used in our pretrained checkpoint\n\n    def build(self, input_shape):\n        if not self.untie_r:\n            self.r_w_bias = self.add_weight(\n                shape=(self.n_head, self.d_head), initializer=\"zeros\", trainable=True, name=\"r_w_bias\"\n            )\n            self.r_r_bias = self.add_weight(\n                shape=(self.n_head, self.d_head), initializer=\"zeros\", trainable=True, name=\"r_r_bias\"\n            )\n        super().build(input_shape)\n\n    def get_input_embeddings(self):\n        return self.word_emb\n\n    def _resize_token_embeddings(self, new_num_tokens):\n        return self.word_emb\n\n    def backward_compatible(self):\n        self.sample_softmax = -1\n\n    def reset_length(self, tgt_len, ext_len, mem_len):\n        self.tgt_len = tgt_len\n        self.mem_len = mem_len\n        self.ext_len = ext_len\n\n    def _prune_heads(self, heads):\n        raise NotImplementedError\n\n    def init_mems(self, bsz):\n        if self.mem_len > 0:\n            mems = []\n            for i in range(self.n_layer):\n                empty = tf.zeros([self.mem_len, bsz, self.d_model])\n                mems.append(empty)\n\n            return mems\n        else:\n            return None\n\n    def _update_mems(self, hids, mems, mlen, qlen):\n        # does not deal with None\n        if mems is None:\n            return None\n\n        # mems is not None\n        assert len(hids) == len(mems), \"len(hids) != len(mems)\"\n\n        # There are `mlen + qlen` steps that can be cached into mems\n        # For the next step, the last `ext_len` of the `qlen` tokens\n        # will be used as the extended context. Hence, we only cache\n        # the tokens from `mlen + qlen - self.ext_len - self.mem_len`\n        # to `mlen + qlen - self.ext_len`.\n        new_mems = []\n        end_idx = mlen + max(0, qlen - 0 - self.ext_len)\n        beg_idx = max(0, end_idx - self.mem_len)\n        for i in range(len(hids)):\n\n            cat = tf.concat([mems[i], hids[i]], axis=0)\n            tf.stop_gradient(cat)\n            new_mems.append(cat[beg_idx:end_idx])\n\n        return new_mems\n\n    def call(self, inputs, mems=None, head_mask=None, inputs_embeds=None, training=False):\n        if isinstance(inputs, (tuple, list)):\n            input_ids = inputs[0]\n            mems = inputs[1] if len(inputs) > 1 else mems\n            head_mask = inputs[2] if len(inputs) > 2 else head_mask\n            inputs_embeds = inputs[3] if len(inputs) > 3 else inputs_embeds\n            assert len(inputs) <= 4, \"Too many inputs.\"\n        elif isinstance(inputs, (dict, BatchEncoding)):\n            input_ids = inputs.get(\"input_ids\")\n            mems = inputs.get(\"mems\", mems)\n            head_mask = inputs.get(\"head_mask\", head_mask)\n            inputs_embeds = inputs.get(\"inputs_embeds\", inputs_embeds)\n            assert len(inputs) <= 4, \"Too many inputs.\"\n        else:\n            input_ids = inputs\n\n        # the original code for Transformer-XL used shapes [len, bsz] but we want a unified interface in the library\n        # so we transpose here from shape [bsz, len] to shape [len, bsz]\n        if input_ids is not None and inputs_embeds is not None:\n            raise ValueError(\"You cannot specify both input_ids and inputs_embeds at the same time\")\n        elif input_ids is not None:\n            input_ids = tf.transpose(input_ids, perm=(1, 0))\n            qlen, bsz = shape_list(input_ids)\n        elif inputs_embeds is not None:\n            inputs_embeds = tf.transpose(inputs_embeds, perm=(1, 0, 2))\n            qlen, bsz = shape_list(inputs_embeds)[:2]\n        else:\n            raise ValueError(\"You have to specify either input_ids or inputs_embeds\")\n\n        if mems is None:\n            mems = self.init_mems(bsz)\n\n        # Prepare head mask if needed\n        # 1.0 in head_mask indicate we keep the head\n        # attention_probs has shape bsz x n_heads x N x N\n        # input head_mask has shape [num_heads] or [num_hidden_layers x num_heads] (a head_mask for each layer)\n        # and head_mask is converted to shape [num_hidden_layers x qlen x klen x bsz x n_head]\n        if head_mask is not None:\n            raise NotImplementedError\n        else:\n            head_mask = [None] * self.n_layer\n\n        if inputs_embeds is not None:\n            word_emb = inputs_embeds\n        else:\n            word_emb = self.word_emb(input_ids)\n\n        mlen = shape_list(mems[0])[0] if mems is not None else 0\n        klen = mlen + qlen\n\n        attn_mask = tf.ones([qlen, qlen])\n        mask_u = tf.linalg.band_part(attn_mask, 0, -1)\n        mask_dia = tf.linalg.band_part(attn_mask, 0, 0)\n        attn_mask_pad = tf.zeros([qlen, mlen])\n        dec_attn_mask = tf.concat([attn_mask_pad, mask_u - mask_dia], 1)\n        if self.same_length:\n            mask_l = tf.linalg.band_part(attn_mask, -1, 0)\n            dec_attn_mask = tf.concat([dec_attn_mask[:, :qlen] + mask_l - mask_dia, dec_attn_mask[:, qlen:]], 1)\n        # ::: PyTorch masking code for reference :::\n        # if self.same_length:\n        #     all_ones = word_emb.new_ones((qlen, klen), dtype=torch.uint8)\n        #     mask_len = klen - self.mem_len\n        #     if mask_len > 0:\n        #         mask_shift_len = qlen - mask_len\n        #     else:\n        #         mask_shift_len = qlen\n        #     dec_attn_mask = (torch.triu(all_ones, 1+mlen)\n        #             + torch.tril(all_ones, -mask_shift_len))[:, :, None] # -1\n        # else:\n        #     dec_attn_mask = torch.triu(\n        #         word_emb.new_ones((qlen, klen), dtype=torch.uint8), diagonal=1+mlen)[:,:,None]\n\n        hids = []\n        attentions = []\n        if self.attn_type == 0:  # default\n            pos_seq = tf.range(klen - 1, -1, -1.0)\n            if self.clamp_len > 0:\n                pos_seq = tf.minimum(pos_seq, self.clamp_len)\n            pos_emb = self.pos_emb(pos_seq)\n\n            core_out = self.drop(word_emb, training=training)\n            pos_emb = self.drop(pos_emb, training=training)\n\n            for i, layer in enumerate(self.layers):\n                hids.append(core_out)\n                mems_i = None if mems is None else mems[i]\n                layer_outputs = layer([core_out, pos_emb, dec_attn_mask, mems_i, head_mask[i]], training=training)\n                core_out = layer_outputs[0]\n                if self.output_attentions:\n                    attentions.append(layer_outputs[1])\n        else:  # learnable embeddings and absolute embeddings\n            raise NotImplementedError  # Removed these to avoid maintaining dead code - They are not used in our pretrained checkpoint\n\n        core_out = self.drop(core_out, training=training)\n\n        new_mems = self._update_mems(hids, mems, mlen, qlen)\n\n        # We transpose back here to shape [bsz, len, hidden_dim]\n        outputs = [tf.transpose(core_out, perm=(1, 0, 2)), new_mems]\n        if self.output_hidden_states:\n            # Add last layer and transpose to library standard shape [bsz, len, hidden_dim]\n            hids.append(core_out)\n            hids = list(tf.transpose(t, perm=(1, 0, 2)) for t in hids)\n            outputs.append(hids)\n        if self.output_attentions:\n            # Transpose to library standard shape [bsz, n_heads, query_seq_len, key_seq_len]\n            attentions = list(tf.transpose(t, perm=(2, 3, 0, 1)) for t in attentions)\n            outputs.append(attentions)\n        return outputs  # last hidden state, new_mems, (all hidden states), (all attentions)\n\n\nclass TFTransfoXLPreTrainedModel(TFPreTrainedModel):\n    \"\"\" An abstract class to handle weights initialization and\n        a simple interface for downloading and loading pretrained models.\n    \"\"\"\n\n    config_class = TransfoXLConfig\n    base_model_prefix = \"transformer\"\n\n\nTRANSFO_XL_START_DOCSTRING = r\"\"\"\n\n    .. note::\n\n        TF 2.0 models accepts two formats as inputs:\n\n            - having all inputs as keyword arguments (like PyTorch models), or\n            - having all inputs as a list, tuple or dict in the first positional arguments.\n\n        This second option is useful when using :obj:`tf.keras.Model.fit()` method which currently requires having\n        all the tensors in the first argument of the model call function: :obj:`model(inputs)`.\n\n        If you choose this second option, there are three possibilities you can use to gather all the input Tensors\n        in the first positional argument :\n\n        - a single Tensor with input_ids only and nothing else: :obj:`model(inputs_ids)`\n        - a list of varying length with one or several input Tensors IN THE ORDER given in the docstring:\n          :obj:`model([input_ids, attention_mask])` or :obj:`model([input_ids, attention_mask, token_type_ids])`\n        - a dictionary with one or several input Tensors associated to the input names given in the docstring:\n          :obj:`model({'input_ids': input_ids, 'token_type_ids': token_type_ids})`\n\n    Parameters:\n        config (:class:`~transformers1.TransfoXLConfig`): Model configuration class with all the parameters of the model.\n            Initializing with a config file does not load the weights associated with the model, only the configuration.\n            Check out the :meth:`~transformers1.PreTrainedModel.from_pretrained` method to load the model weights.\n\"\"\"\n\nTRANSFO_XL_INPUTS_DOCSTRING = r\"\"\"\n    Args:\n        input_ids (:obj:`tf.Tensor` or :obj:`Numpy array` of shape :obj:`(batch_size, sequence_length)`):\n            Indices of input sequence tokens in the vocabulary.\n\n            Indices can be obtained using :class:`transformers1.TransfoXLTokenizer`.\n            See :func:`transformers1.PreTrainedTokenizer.encode` and\n            :func:`transformers1.PreTrainedTokenizer.encode_plus` for details.\n\n            `What are input IDs? <../glossary.html#input-ids>`__\n        mems (:obj:`List[tf.Tensor]` of length :obj:`config.n_layers`):\n            Contains pre-computed hidden-states (key and values in the attention blocks) as computed by the model\n            (see `mems` output below). Can be used to speed up sequential decoding. The token ids which have their mems\n            given to this model should not be passed as input ids as they have already been computed.\n        head_mask (:obj:`tf.Tensor` or :obj:`Numpy array` of shape :obj:`(num_heads,)` or :obj:`(num_layers, num_heads)`, `optional`, defaults to :obj:`None`):\n            Mask to nullify selected heads of the self-attention modules.\n            Mask values selected in ``[0, 1]``:\n            :obj:`1` indicates the head is **not masked**, :obj:`0` indicates the head is **masked**.\n        inputs_embeds (:obj:`tf.Tensor` or :obj:`Numpy array` of shape :obj:`(batch_size, sequence_length, hidden_size)`, `optional`, defaults to :obj:`None`):\n            Optionally, instead of passing :obj:`input_ids` you can choose to directly pass an embedded representation.\n            This is useful if you want more control over how to convert `input_ids` indices into associated vectors\n            than the model's internal embedding lookup matrix.\n\"\"\"\n\n\n@add_start_docstrings(\n    \"The bare Bert Model transformer outputing raw hidden-states without any specific head on top.\",\n    TRANSFO_XL_START_DOCSTRING,\n)\nclass TFTransfoXLModel(TFTransfoXLPreTrainedModel):\n    def __init__(self, config, *inputs, **kwargs):\n        super().__init__(config, *inputs, **kwargs)\n        self.transformer = TFTransfoXLMainLayer(config, name=\"transformer\")\n\n    @add_start_docstrings_to_callable(TRANSFO_XL_INPUTS_DOCSTRING)\n    def call(self, inputs, **kwargs):\n        r\"\"\"\n    Return:\n        :obj:`tuple(tf.Tensor)` comprising various elements depending on the configuration (:class:`~transformers1.TransfoXLConfig`) and inputs:\n        last_hidden_state (:obj:`tf.Tensor` of shape :obj:`(batch_size, sequence_length, hidden_size)`):\n            Sequence of hidden-states at the last layer of the model.\n        mems (:obj:`List[tf.Tensor]` of length :obj:`config.n_layers`):\n            Contains pre-computed hidden-states (key and values in the attention blocks).\n            Can be used (see `mems` input) to speed up sequential decoding. The token ids which have their past given to this model\n            should not be passed as input ids as they have already been computed.\n        hidden_states (:obj:`tuple(tf.Tensor)`, `optional`, returned when ``config.output_hidden_states=True``):\n            Tuple of :obj:`tf.Tensor` (one for the output of the embeddings + one for the output of each layer)\n            of shape :obj:`(batch_size, sequence_length, hidden_size)`.\n\n            Hidden-states of the model at the output of each layer plus the initial embedding outputs.\n        attentions (:obj:`tuple(tf.Tensor)`, `optional`, returned when ``config.output_attentions=True``):\n            Tuple of :obj:`tf.Tensor` (one for each layer) of shape\n            :obj:`(batch_size, num_heads, sequence_length, sequence_length)`.\n\n            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention\n            heads.\n\n    Examples::\n\n        import tensorflow as tf\n        from transformers1 import TransfoXLTokenizer, TFTransfoXLModel\n\n        tokenizer = TransfoXLTokenizer.from_pretrained('transfo-xl-wt103')\n        model = TFTransfoXLModel.from_pretrained('transfo-xl-wt103')\n        input_ids = tf.constant(tokenizer.encode(\"Hello, my dog is cute\", add_special_tokens=True))[None, :]  # Batch size 1\n        outputs = model(input_ids)\n        last_hidden_states, mems = outputs[:2]\n\n        \"\"\"\n        outputs = self.transformer(inputs, **kwargs)\n        return outputs\n\n\nclass TFTransfoXLLMHead(tf.keras.layers.Layer):\n    def __init__(self, config, input_embeddings, **kwargs):\n        super().__init__(**kwargs)\n        self.vocab_size = config.vocab_size\n\n        # The output weights are the same as the input embeddings, but there is\n        # an output-only bias for each token.\n        self.input_embeddings = input_embeddings\n\n    def build(self, input_shape):\n        self.bias = self.add_weight(shape=(self.vocab_size,), initializer=\"zeros\", trainable=True, name=\"bias\")\n        super().build(input_shape)\n\n    def call(self, hidden_states):\n        hidden_states = self.input_embeddings(hidden_states, mode=\"linear\")\n        hidden_states = hidden_states + self.bias\n        return hidden_states\n\n\n@add_start_docstrings(\n    \"\"\"The Transformer-XL Model with a language modeling head on top\n    (adaptive softmax with weights tied to the adaptive input embeddings)\"\"\",\n    TRANSFO_XL_START_DOCSTRING,\n)\nclass TFTransfoXLLMHeadModel(TFTransfoXLPreTrainedModel):\n    def __init__(self, config):\n        super().__init__(config)\n        self.transformer = TFTransfoXLMainLayer(config, name=\"transformer\")\n        self.sample_softmax = config.sample_softmax\n        assert (\n            self.sample_softmax <= 0\n        ), \"Sampling from the softmax is not implemented yet. Please look at issue: #3310: https://github.com/huggingface/transformers/issues/3310\"\n\n        self.crit = TFAdaptiveSoftmaxMask(\n            config.vocab_size, config.d_embed, config.d_model, config.cutoffs, div_val=config.div_val, name=\"crit\"\n        )\n\n    def get_output_embeddings(self):\n        \"\"\" Double-check if you are using adaptive softmax.\n        \"\"\"\n        if len(self.crit.out_layers) > 0:\n            return self.crit.out_layers[-1]\n        return None\n\n    def reset_length(self, tgt_len, ext_len, mem_len):\n        self.transformer.reset_length(tgt_len, ext_len, mem_len)\n\n    def init_mems(self, bsz):\n        return self.transformer.init_mems(bsz)\n\n    @add_start_docstrings_to_callable(TRANSFO_XL_INPUTS_DOCSTRING)\n    def call(self, inputs, mems=None, head_mask=None, inputs_embeds=None, labels=None, training=False):\n        r\"\"\"\n    Return:\n        :obj:`tuple(tf.Tensor)` comprising various elements depending on the configuration (:class:`~transformers1.TransfoXLConfig`) and inputs:\n        prediction_scores (:obj:`tf.Tensor` of shape :obj:`(batch_size, sequence_length, config.vocab_size)`):\n            Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax).\n        mems (:obj:`List[tf.Tensor]` of length :obj:`config.n_layers`):\n            Contains pre-computed hidden-states (key and values in the attention blocks).\n            Can be used (see `past` input) to speed up sequential decoding. The token ids which have their past given to this model\n            should not be passed as input ids as they have already been computed.\n        hidden_states (:obj:`tuple(tf.Tensor)`, `optional`, returned when ``config.output_hidden_states=True``):\n            Tuple of :obj:`tf.Tensor` (one for the output of the embeddings + one for the output of each layer)\n            of shape :obj:`(batch_size, sequence_length, hidden_size)`.\n\n            Hidden-states of the model at the output of each layer plus the initial embedding outputs.\n        attentions (:obj:`tuple(tf.Tensor)`, `optional`, returned when ``config.output_attentions=True``):\n            Tuple of :obj:`tf.Tensor` (one for each layer) of shape\n            :obj:`(batch_size, num_heads, sequence_length, sequence_length)`.\n\n            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention\n            heads.\n\n    Examples::\n\n        import tensorflow as tf\n        from transformers1 import TransfoXLTokenizer, TFTransfoXLLMHeadModel\n\n        tokenizer = TransfoXLTokenizer.from_pretrained('transfo-xl-wt103')\n        model = TFTransfoXLLMHeadModel.from_pretrained('transfo-xl-wt103')\n        input_ids = tf.constant(tokenizer.encode(\"Hello, my dog is cute\", add_special_tokens=True))[None, :]  # Batch size 1\n        outputs = model(input_ids)\n        prediction_scores, mems = outputs[:2]\n\n        \"\"\"\n        if isinstance(inputs, (tuple, list)):\n            input_ids = inputs[0]\n            mems = inputs[1] if len(inputs) > 1 else mems\n            head_mask = inputs[2] if len(inputs) > 2 else head_mask\n            inputs_embeds = inputs[3] if len(inputs) > 3 else inputs_embeds\n            labels = inputs[4] if len(inputs) > 4 else labels\n            assert len(inputs) <= 5, \"Too many inputs.\"\n        elif isinstance(inputs, dict):\n            input_ids = inputs.get(\"input_ids\")\n            mems = inputs.get(\"mems\", mems)\n            head_mask = inputs.get(\"head_mask\", head_mask)\n            inputs_embeds = inputs.get(\"inputs_embeds\", inputs_embeds)\n            labels = inputs.get(\"labels\", labels)\n            assert len(inputs) <= 5, \"Too many inputs.\"\n        else:\n            input_ids = inputs\n\n        if input_ids is not None:\n            bsz, tgt_len = shape_list(input_ids)[:2]\n        else:\n            bsz, tgt_len = shape_list(inputs_embeds)[:2]\n\n        transformer_outputs = self.transformer([input_ids, mems, head_mask, inputs_embeds], training=training)\n\n        last_hidden = transformer_outputs[0]\n        pred_hid = last_hidden[:, -tgt_len:]\n        outputs = transformer_outputs[1:]\n\n        softmax_output = self.crit([pred_hid, labels], training=training)\n        outputs = [softmax_output] + outputs\n\n        return outputs  # logits, new_mems, (all hidden states), (all attentions)\n\n    def prepare_inputs_for_generation(self, inputs, past, **model_kwargs):\n        inputs = {\"inputs\": inputs}\n\n        # if past is defined in model kwargs then use it for faster decoding\n        if past:\n            inputs[\"mems\"] = past\n\n        return inputs\n"
  },
  {
    "path": "code/nezha-base-count5/pretrain/transformers1/modeling_tf_transfo_xl_utilities.py",
    "content": "# coding=utf-8\n# Copyright 2018 Google AI, Google Brain and Carnegie Mellon University Authors and the HuggingFace Inc. team.\n# Copyright (c) 2018, NVIDIA CORPORATION.  All rights reserved.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n\"\"\" A TF 2.0 Adaptive Softmax for Transformer XL model.\n\"\"\"\n\n\nimport tensorflow as tf\n\nfrom .modeling_tf_utils import shape_list\n\n\nclass TFAdaptiveSoftmaxMask(tf.keras.layers.Layer):\n    def __init__(self, vocab_size, d_embed, d_proj, cutoffs, div_val=1, keep_order=False, **kwargs):\n        super().__init__(**kwargs)\n\n        self.vocab_size = vocab_size\n        self.d_embed = d_embed\n        self.d_proj = d_proj\n\n        self.cutoffs = cutoffs + [vocab_size]\n        self.cutoff_ends = [0] + self.cutoffs\n        self.div_val = div_val\n\n        self.shortlist_size = self.cutoffs[0]\n        self.n_clusters = len(self.cutoffs) - 1\n        self.head_size = self.shortlist_size + self.n_clusters\n        self.keep_order = keep_order\n\n        self.out_layers = []\n        self.out_projs = []\n\n    def build(self, input_shape):\n        if self.n_clusters > 0:\n            self.cluster_weight = self.add_weight(\n                shape=(self.n_clusters, self.d_embed), initializer=\"zeros\", trainable=True, name=\"cluster_weight\"\n            )\n            self.cluster_bias = self.add_weight(\n                shape=(self.n_clusters,), initializer=\"zeros\", trainable=True, name=\"cluster_bias\"\n            )\n\n        if self.div_val == 1:\n            for i in range(len(self.cutoffs)):\n                if self.d_proj != self.d_embed:\n                    weight = self.add_weight(\n                        shape=(self.d_embed, self.d_proj),\n                        initializer=\"zeros\",\n                        trainable=True,\n                        name=\"out_projs_._{}\".format(i),\n                    )\n                    self.out_projs.append(weight)\n                else:\n                    self.out_projs.append(None)\n                weight = self.add_weight(\n                    shape=(self.vocab_size, self.d_embed,),\n                    initializer=\"zeros\",\n                    trainable=True,\n                    name=\"out_layers_._{}_._weight\".format(i),\n                )\n                bias = self.add_weight(\n                    shape=(self.vocab_size,),\n                    initializer=\"zeros\",\n                    trainable=True,\n                    name=\"out_layers_._{}_._bias\".format(i),\n                )\n                self.out_layers.append((weight, bias))\n        else:\n            for i in range(len(self.cutoffs)):\n                l_idx, r_idx = self.cutoff_ends[i], self.cutoff_ends[i + 1]\n                d_emb_i = self.d_embed // (self.div_val ** i)\n\n                weight = self.add_weight(\n                    shape=(d_emb_i, self.d_proj), initializer=\"zeros\", trainable=True, name=\"out_projs_._{}\".format(i)\n                )\n                self.out_projs.append(weight)\n                weight = self.add_weight(\n                    shape=(r_idx - l_idx, d_emb_i,),\n                    initializer=\"zeros\",\n                    trainable=True,\n                    name=\"out_layers_._{}_._weight\".format(i),\n                )\n                bias = self.add_weight(\n                    shape=(r_idx - l_idx,),\n                    initializer=\"zeros\",\n                    trainable=True,\n                    name=\"out_layers_._{}_._bias\".format(i),\n                )\n                self.out_layers.append((weight, bias))\n        super().build(input_shape)\n\n    @staticmethod\n    def _logit(x, W, b, proj=None):\n        y = x\n        if proj is not None:\n            y = tf.einsum(\"ibd,ed->ibe\", y, proj)\n        return tf.einsum(\"ibd,nd->ibn\", y, W) + b\n\n    @staticmethod\n    def _gather_logprob(logprob, target):\n        lp_size = shape_list(logprob)\n        r = tf.range(lp_size[0])\n        idx = tf.stack([r, target], 1)\n        return tf.gather_nd(logprob, idx)\n\n    def call(self, inputs, return_mean=True, training=False):\n        hidden, target = inputs\n        head_logprob = 0\n        if self.n_clusters == 0:\n            output = self._logit(hidden, self.out_layers[0][0], self.out_layers[0][1], self.out_projs[0])\n            if target is not None:\n                loss = tf.nn.sparse_softmax_cross_entropy_with_logits(labels=target, logits=output)\n            out = tf.nn.log_softmax(output, axis=-1)\n        else:\n            hidden_sizes = shape_list(hidden)\n            out = []\n            loss = tf.zeros(hidden_sizes[:2], dtype=tf.float32)\n            for i in range(len(self.cutoffs)):\n                l_idx, r_idx = self.cutoff_ends[i], self.cutoff_ends[i + 1]\n                if target is not None:\n                    mask = (target >= l_idx) & (target < r_idx)\n                    mask_idx = tf.where(mask)\n                    cur_target = tf.boolean_mask(target, mask) - l_idx\n\n                if self.div_val == 1:\n                    cur_W = self.out_layers[0][0][l_idx:r_idx]\n                    cur_b = self.out_layers[0][1][l_idx:r_idx]\n                else:\n                    cur_W = self.out_layers[i][0]\n                    cur_b = self.out_layers[i][1]\n\n                if i == 0:\n                    cur_W = tf.concat([cur_W, self.cluster_weight], 0)\n                    cur_b = tf.concat([cur_b, self.cluster_bias], 0)\n\n                    head_logit = self._logit(hidden, cur_W, cur_b, self.out_projs[0])\n                    head_logprob = tf.nn.log_softmax(head_logit)\n                    out.append(head_logprob[..., : self.cutoffs[0]])\n                    if target is not None:\n                        cur_head_logprob = tf.boolean_mask(head_logprob, mask)\n                        cur_logprob = self._gather_logprob(cur_head_logprob, cur_target)\n                else:\n                    tail_logit = self._logit(hidden, cur_W, cur_b, self.out_projs[i])\n                    tail_logprob = tf.nn.log_softmax(tail_logit)\n                    cluster_prob_idx = self.cutoffs[0] + i - 1  # No probability for the head cluster\n                    logprob_i = head_logprob[..., cluster_prob_idx, None] + tail_logprob\n                    out.append(logprob_i)\n                    if target is not None:\n                        cur_head_logprob = tf.boolean_mask(head_logprob, mask)\n                        cur_tail_logprob = tf.boolean_mask(tail_logprob, mask)\n                        cur_logprob = self._gather_logprob(cur_tail_logprob, cur_target)\n                        cur_logprob += cur_head_logprob[:, self.cutoff_ends[1] + i - 1]\n                if target is not None:\n                    loss += tf.scatter_nd(mask_idx, -cur_logprob, tf.cast(shape_list(loss), dtype=tf.int64))\n            out = tf.concat(out, axis=-1)\n\n        if target is not None:\n            if return_mean:\n                loss = tf.reduce_mean(loss)\n            # Add the training-time loss value to the layer using `self.add_loss()`.\n            self.add_loss(loss)\n\n            # Log the loss as a metric (we could log arbitrary metrics,\n            # including different metrics for training and inference.\n            self.add_metric(loss, name=self.name, aggregation=\"mean\" if return_mean else \"\")\n\n        return out\n"
  },
  {
    "path": "code/nezha-base-count5/pretrain/transformers1/modeling_tf_utils.py",
    "content": "# coding=utf-8\n# Copyright 2018 The Google AI Language Team Authors and The HuggingFace Inc. team.\n# Copyright (c) 2018, NVIDIA CORPORATION.  All rights reserved.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n\"\"\"TF general model utils.\"\"\"\nimport functools\nimport logging\nimport os\n\nimport h5py\nimport numpy as np\nimport tensorflow as tf\nfrom tensorflow.python.keras.saving import hdf5_format\n\nfrom .configuration_utils import PretrainedConfig\nfrom .file_utils import DUMMY_INPUTS, TF2_WEIGHTS_NAME, WEIGHTS_NAME, cached_path, hf_bucket_url, is_remote_url\nfrom .modeling_tf_pytorch_utils import load_pytorch_checkpoint_in_tf2_model\n\n\nlogger = logging.getLogger(__name__)\n\n\nclass TFModelUtilsMixin:\n    \"\"\"\n    A few utilities for `tf.keras.Model`s, to be used as a mixin.\n    \"\"\"\n\n    def num_parameters(self, only_trainable: bool = False) -> int:\n        \"\"\"\n        Get number of (optionally, trainable) parameters in the model.\n        \"\"\"\n        if only_trainable:\n            return int(sum(np.prod(w.shape.as_list()) for w in self.trainable_variables))\n        else:\n            return self.count_params()\n\n\ndef keras_serializable(cls):\n    \"\"\"\n    Decorate a Keras Layer class to support Keras serialization.\n\n    This is done by:\n    1. adding a `transformers_config` dict to the Keras config dictionary in `get_config` (called by Keras at\n       serialization time\n    2. wrapping `__init__` to accept that `transformers_config` dict (passed by Keras at deserialization time) and\n       convert it to a config object for the actual layer initializer\n    3. registering the class as a custom object in Keras (if the Tensorflow version supports this), so that it does\n       not need to be supplied in `custom_objects` in the call to `tf.keras.models.load_model`\n\n    :param cls: a tf.keras.layers.Layers subclass that accepts a `config` argument to its initializer (typically a\n                `TF*MainLayer` class in this project)\n    :return: the same class object, with modifications for Keras deserialization.\n    \"\"\"\n    initializer = cls.__init__\n\n    config_class = getattr(cls, \"config_class\", None)\n    if config_class is None:\n        raise AttributeError(\"Must set `config_class` to use @keras_serializable\")\n\n    @functools.wraps(initializer)\n    def wrapped_init(self, *args, **kwargs):\n        transformers_config = kwargs.pop(\"transformers_config\", None)\n        config = args[0] if args and isinstance(args[0], PretrainedConfig) else kwargs.get(\"config\", None)\n        if config is not None and transformers_config is not None:\n            raise ValueError(\"Must pass either `config` or `transformers_config`, not both\")\n        elif config is not None:\n            # normal layer construction, call with unchanged args (config is already in there)\n            initializer(self, *args, **kwargs)\n        elif transformers_config is not None:\n            # Keras deserialization, convert dict to config\n            config = config_class.from_dict(transformers_config)\n            initializer(self, config, *args, **kwargs)\n        else:\n            raise ValueError(\"Must pass either `config` (PretrainedConfig) or `transformers_config` (dict)\")\n        self._transformers_config = config\n\n    cls.__init__ = wrapped_init\n\n    if not hasattr(cls, \"get_config\"):\n        raise TypeError(\"Only use @keras_serializable on tf.keras.layers.Layer subclasses\")\n    if hasattr(cls.get_config, \"_is_default\"):\n\n        def get_config(self):\n            cfg = super(cls, self).get_config()\n            cfg[\"transformers_config\"] = self._transformers_config.to_dict()\n            return cfg\n\n        cls.get_config = get_config\n\n    cls._keras_serializable = True\n    if hasattr(tf.keras.utils, \"register_keras_serializable\"):\n        cls = tf.keras.utils.register_keras_serializable()(cls)\n    return cls\n\n\nclass TFPreTrainedModel(tf.keras.Model, TFModelUtilsMixin):\n    r\"\"\" Base class for all TF models.\n\n        :class:`~transformers1.TFPreTrainedModel` takes care of storing the configuration of the models and handles methods for loading/downloading/saving models\n        as well as a few methods common to all models to (i) resize the input embeddings and (ii) prune heads in the self-attention heads.\n\n        Class attributes (overridden by derived classes):\n            - ``config_class``: a class derived from :class:`~transformers1.PretrainedConfig` to use as configuration class for this model architecture.\n            - ``load_tf_weights``: a python ``method`` for loading a TensorFlow checkpoint in a PyTorch model, taking as arguments:\n\n                - ``model``: an instance of the relevant subclass of :class:`~transformers1.PreTrainedModel`,\n                - ``config``: an instance of the relevant subclass of :class:`~transformers1.PretrainedConfig`,\n                - ``path``: a path (string) to the TensorFlow checkpoint.\n\n            - ``base_model_prefix``: a string indicating the attribute associated to the base model in derived classes of the same architecture adding modules on top of the base model.\n    \"\"\"\n    config_class = None\n    base_model_prefix = \"\"\n\n    @property\n    def dummy_inputs(self):\n        \"\"\" Dummy inputs to build the network.\n\n        Returns:\n            tf.Tensor with dummy inputs\n        \"\"\"\n        return {\"input_ids\": tf.constant(DUMMY_INPUTS)}\n\n    def __init__(self, config, *inputs, **kwargs):\n        super().__init__(*inputs, **kwargs)\n        if not isinstance(config, PretrainedConfig):\n            raise ValueError(\n                \"Parameter config in `{}(config)` should be an instance of class `PretrainedConfig`. \"\n                \"To create a model from a pretrained model use \"\n                \"`model = {}.from_pretrained(PRETRAINED_MODEL_NAME)`\".format(\n                    self.__class__.__name__, self.__class__.__name__\n                )\n            )\n        # Save config in model\n        self.config = config\n\n    def get_input_embeddings(self):\n        \"\"\"\n        Returns the model's input embeddings.\n\n        Returns:\n            :obj:`tf.keras.layers.Layer`:\n                A torch module mapping vocabulary to hidden states.\n        \"\"\"\n        base_model = getattr(self, self.base_model_prefix, self)\n        if base_model is not self:\n            return base_model.get_input_embeddings()\n        else:\n            raise NotImplementedError\n\n    def get_output_embeddings(self):\n        \"\"\"\n        Returns the model's output embeddings.\n\n        Returns:\n            :obj:`tf.keras.layers.Layer`:\n                A torch module mapping hidden states to vocabulary.\n        \"\"\"\n        return None  # Overwrite for models with output embeddings\n\n    def _get_resized_embeddings(self, old_embeddings, new_num_tokens=None):\n        \"\"\" Build a resized Embedding Variable from a provided token Embedding Module.\n            Increasing the size will add newly initialized vectors at the end\n            Reducing the size will remove vectors from the end\n\n        Args:\n            new_num_tokens: (`optional`) int\n                New number of tokens in the embedding matrix.\n                Increasing the size will add newly initialized vectors at the end\n                Reducing the size will remove vectors from the end\n                If not provided or None: return the provided token Embedding Module.\n        Return: ``tf.Variable``\n            Pointer to the resized Embedding Module or the old Embedding Module if new_num_tokens is None\n        \"\"\"\n        # if new_num_tokens is None:\n        #     return old_embeddings\n\n        # old_num_tokens, old_embedding_dim = old_embeddings.weight.size()\n        # if old_num_tokens == new_num_tokens:\n        #     return old_embeddings\n\n        # # Build new embeddings\n        # new_embeddings = nn.Embedding(new_num_tokens, old_embedding_dim)\n        # new_embeddings.to(old_embeddings.weight.device)\n\n        # # initialize all new embeddings (in particular added tokens)\n        # self._init_weights(new_embeddings)\n\n        # # Copy token embeddings from the previous weights\n        # num_tokens_to_copy = min(old_num_tokens, new_num_tokens)\n        # new_embeddings.weight.data[:num_tokens_to_copy, :] = old_embeddings.weight.data[:num_tokens_to_copy, :]\n\n        # return new_embeddings\n\n    def resize_token_embeddings(self, new_num_tokens=None):\n        \"\"\" Resize input token embeddings matrix of the model if new_num_tokens != config.vocab_size.\n        Take care of tying weights embeddings afterwards if the model class has a `tie_weights()` method.\n\n        Arguments:\n\n            new_num_tokens: (`optional`) int:\n                New number of tokens in the embedding matrix. Increasing the size will add newly initialized vectors at the end. Reducing the size will remove vectors from the end.\n                If not provided or None: does nothing and just returns a pointer to the input tokens ``tf.Variable`` Module of the model.\n\n        Return: ``tf.Variable``\n            Pointer to the input tokens Embeddings Module of the model\n        \"\"\"\n        raise NotImplementedError\n\n    def prune_heads(self, heads_to_prune):\n        \"\"\" Prunes heads of the base model.\n\n            Arguments:\n\n                heads_to_prune: dict with keys being selected layer indices (`int`) and associated values being the list of heads to prune in said layer (list of `int`).\n        \"\"\"\n        raise NotImplementedError\n\n    def save_pretrained(self, save_directory):\n        \"\"\" Save a model and its configuration file to a directory, so that it\n            can be re-loaded using the :func:`~transformers1.PreTrainedModel.from_pretrained` class method.\n        \"\"\"\n        assert os.path.isdir(\n            save_directory\n        ), \"Saving path should be a directory where the model and configuration can be saved\"\n\n        # Save configuration file\n        self.config.save_pretrained(save_directory)\n\n        # If we save using the predefined names, we can load using `from_pretrained`\n        output_model_file = os.path.join(save_directory, TF2_WEIGHTS_NAME)\n        self.save_weights(output_model_file)\n        logger.info(\"Model weights saved in {}\".format(output_model_file))\n\n    @classmethod\n    def from_pretrained(cls, pretrained_model_name_or_path, *model_args, **kwargs):\n        r\"\"\"Instantiate a pretrained TF 2.0 model from a pre-trained model configuration.\n\n        The warning ``Weights from XXX not initialized from pretrained model`` means that the weights of XXX do not come pre-trained with the rest of the model.\n        It is up to you to train those weights with a downstream fine-tuning task.\n\n        The warning ``Weights from XXX not used in YYY`` means that the layer XXX is not used by YYY, therefore those weights are discarded.\n\n        Parameters:\n            pretrained_model_name_or_path: either:\n\n                - a string with the `shortcut name` of a pre-trained model to load from cache or download, e.g.: ``bert-base-uncased``.\n                - a string with the `identifier name` of a pre-trained model that was user-uploaded to our S3, e.g.: ``dbmdz/bert-base-german-cased``.\n                - a path to a `directory` containing model weights saved using :func:`~transformers1.PreTrainedModel.save_pretrained`, e.g.: ``./my_model_directory/``.\n                - a path or url to a `PyTorch state_dict save file` (e.g. `./pt_model/pytorch_model.bin`). In this case, ``from_pt`` should be set to True and a configuration object should be provided as ``config`` argument. This loading path is slower than converting the PyTorch checkpoint in a TensorFlow model using the provided conversion scripts and loading the TensorFlow model afterwards.\n\n            model_args: (`optional`) Sequence of positional arguments:\n                All remaning positional arguments will be passed to the underlying model's ``__init__`` method\n\n            config: (`optional`) one of:\n                    - an instance of a class derived from :class:`~transformers1.PretrainedConfig`, or\n                    - a string valid as input to :func:`~transformers1.PretrainedConfig.from_pretrained()`\n                Configuration for the model to use instead of an automatically loaded configuation. Configuration can be automatically loaded when:\n\n                - the model is a model provided by the library (loaded with the ``shortcut-name`` string of a pretrained model), or\n                - the model was saved using :func:`~transformers1.PreTrainedModel.save_pretrained` and is reloaded by suppling the save directory.\n                - the model is loaded by suppling a local directory as ``pretrained_model_name_or_path`` and a configuration JSON file named `config.json` is found in the directory.\n\n            from_pt: (`optional`) boolean, default False:\n                Load the model weights from a PyTorch state_dict save file (see docstring of pretrained_model_name_or_path argument).\n\n            cache_dir: (`optional`) string:\n                Path to a directory in which a downloaded pre-trained model\n                configuration should be cached if the standard cache should not be used.\n\n            force_download: (`optional`) boolean, default False:\n                Force to (re-)download the model weights and configuration files and override the cached versions if they exists.\n\n            resume_download: (`optional`) boolean, default False:\n                Do not delete incompletely recieved file. Attempt to resume the download if such a file exists.\n\n            proxies: (`optional`) dict, default None:\n                A dictionary of proxy servers to use by protocol or endpoint, e.g.: {'http': 'foo.bar:3128', 'http://hostname': 'foo.bar:4012'}.\n                The proxies are used on each request.\n\n            output_loading_info: (`optional`) boolean:\n                Set to ``True`` to also return a dictionnary containing missing keys, unexpected keys and error messages.\n\n            kwargs: (`optional`) Remaining dictionary of keyword arguments:\n                Can be used to update the configuration object (after it being loaded) and initiate the model. (e.g. ``output_attention=True``). Behave differently depending on whether a `config` is provided or automatically loaded:\n\n                - If a configuration is provided with ``config``, ``**kwargs`` will be directly passed to the underlying model's ``__init__`` method (we assume all relevant updates to the configuration have already been done)\n                - If a configuration is not provided, ``kwargs`` will be first passed to the configuration class initialization function (:func:`~transformers1.PretrainedConfig.from_pretrained`). Each key of ``kwargs`` that corresponds to a configuration attribute will be used to override said attribute with the supplied ``kwargs`` value. Remaining keys that do not correspond to any configuration attribute will be passed to the underlying model's ``__init__`` function.\n\n        Examples::\n\n            # For example purposes. Not runnable.\n            model = BertModel.from_pretrained('bert-base-uncased')    # Download model and configuration from S3 and cache.\n            model = BertModel.from_pretrained('./test/saved_model/')  # E.g. model was saved using `save_pretrained('./test/saved_model/')`\n            model = BertModel.from_pretrained('bert-base-uncased', output_attention=True)  # Update configuration during loading\n            assert model.config.output_attention == True\n            # Loading from a TF checkpoint file instead of a PyTorch model (slower)\n            config = BertConfig.from_json_file('./tf_model/my_tf_model_config.json')\n            model = BertModel.from_pretrained('./tf_model/my_tf_checkpoint.ckpt.index', from_pt=True, config=config)\n\n        \"\"\"\n        config = kwargs.pop(\"config\", None)\n        cache_dir = kwargs.pop(\"cache_dir\", None)\n        from_pt = kwargs.pop(\"from_pt\", False)\n        force_download = kwargs.pop(\"force_download\", False)\n        resume_download = kwargs.pop(\"resume_download\", False)\n        proxies = kwargs.pop(\"proxies\", None)\n        output_loading_info = kwargs.pop(\"output_loading_info\", False)\n        use_cdn = kwargs.pop(\"use_cdn\", True)\n\n        # Load config if we don't provide a configuration\n        if not isinstance(config, PretrainedConfig):\n            config_path = config if config is not None else pretrained_model_name_or_path\n            config, model_kwargs = cls.config_class.from_pretrained(\n                config_path,\n                *model_args,\n                cache_dir=cache_dir,\n                return_unused_kwargs=True,\n                force_download=force_download,\n                resume_download=resume_download,\n                **kwargs,\n            )\n        else:\n            model_kwargs = kwargs\n\n        # Load model\n        if pretrained_model_name_or_path is not None:\n            if os.path.isdir(pretrained_model_name_or_path):\n                if os.path.isfile(os.path.join(pretrained_model_name_or_path, TF2_WEIGHTS_NAME)):\n                    # Load from a TF 2.0 checkpoint\n                    archive_file = os.path.join(pretrained_model_name_or_path, TF2_WEIGHTS_NAME)\n                elif from_pt and os.path.isfile(os.path.join(pretrained_model_name_or_path, WEIGHTS_NAME)):\n                    # Load from a PyTorch checkpoint\n                    archive_file = os.path.join(pretrained_model_name_or_path, WEIGHTS_NAME)\n                else:\n                    raise EnvironmentError(\n                        \"Error no file named {} found in directory {} or `from_pt` set to False\".format(\n                            [WEIGHTS_NAME, TF2_WEIGHTS_NAME], pretrained_model_name_or_path\n                        )\n                    )\n            elif os.path.isfile(pretrained_model_name_or_path) or is_remote_url(pretrained_model_name_or_path):\n                archive_file = pretrained_model_name_or_path\n            elif os.path.isfile(pretrained_model_name_or_path + \".index\"):\n                archive_file = pretrained_model_name_or_path + \".index\"\n            else:\n                archive_file = hf_bucket_url(\n                    pretrained_model_name_or_path,\n                    filename=(WEIGHTS_NAME if from_pt else TF2_WEIGHTS_NAME),\n                    use_cdn=use_cdn,\n                )\n\n            try:\n                # Load from URL or cache if already cached\n                resolved_archive_file = cached_path(\n                    archive_file,\n                    cache_dir=cache_dir,\n                    force_download=force_download,\n                    resume_download=resume_download,\n                    proxies=proxies,\n                )\n                if resolved_archive_file is None:\n                    raise EnvironmentError\n            except EnvironmentError:\n                msg = (\n                    f\"Can't load weights for '{pretrained_model_name_or_path}'. Make sure that:\\n\\n\"\n                    f\"- '{pretrained_model_name_or_path}' is a correct model identifier listed on 'https://huggingface.co/models'\\n\\n\"\n                    f\"- or '{pretrained_model_name_or_path}' is the correct path to a directory containing a file named one of {TF2_WEIGHTS_NAME}, {WEIGHTS_NAME}.\\n\\n\"\n                )\n                raise EnvironmentError(msg)\n            if resolved_archive_file == archive_file:\n                logger.info(\"loading weights file {}\".format(archive_file))\n            else:\n                logger.info(\"loading weights file {} from cache at {}\".format(archive_file, resolved_archive_file))\n        else:\n            resolved_archive_file = None\n\n        # Instantiate model.\n        model = cls(config, *model_args, **model_kwargs)\n\n        if from_pt:\n            # Load from a PyTorch checkpoint\n            return load_pytorch_checkpoint_in_tf2_model(model, resolved_archive_file, allow_missing_keys=True)\n\n        model(model.dummy_inputs, training=False)  # build the network with dummy inputs\n\n        assert os.path.isfile(resolved_archive_file), \"Error retrieving file {}\".format(resolved_archive_file)\n        # 'by_name' allow us to do transfer learning by skipping/adding layers\n        # see https://github.com/tensorflow/tensorflow/blob/00fad90125b18b80fe054de1055770cfb8fe4ba3/tensorflow/python/keras/engine/network.py#L1339-L1357\n        try:\n            model.load_weights(resolved_archive_file, by_name=True)\n        except OSError:\n            raise OSError(\n                \"Unable to load weights from h5 file. \"\n                \"If you tried to load a TF 2.0 model from a PyTorch checkpoint, please set from_pt=True. \"\n            )\n\n        model(model.dummy_inputs, training=False)  # Make sure restore ops are run\n\n        # Check if the models are the same to output loading informations\n        with h5py.File(resolved_archive_file, \"r\") as f:\n            if \"layer_names\" not in f.attrs and \"model_weights\" in f:\n                f = f[\"model_weights\"]\n            hdf5_layer_names = set(hdf5_format.load_attributes_from_hdf5_group(f, \"layer_names\"))\n        model_layer_names = set(layer.name for layer in model.layers)\n        missing_keys = list(model_layer_names - hdf5_layer_names)\n        unexpected_keys = list(hdf5_layer_names - model_layer_names)\n        error_msgs = []\n\n        if len(missing_keys) > 0:\n            logger.info(\n                \"Layers of {} not initialized from pretrained model: {}\".format(model.__class__.__name__, missing_keys)\n            )\n        if len(unexpected_keys) > 0:\n            logger.info(\n                \"Layers from pretrained model not used in {}: {}\".format(model.__class__.__name__, unexpected_keys)\n            )\n        if len(error_msgs) > 0:\n            raise RuntimeError(\n                \"Error(s) in loading weights for {}:\\n\\t{}\".format(model.__class__.__name__, \"\\n\\t\".join(error_msgs))\n            )\n        if output_loading_info:\n            loading_info = {\"missing_keys\": missing_keys, \"unexpected_keys\": unexpected_keys, \"error_msgs\": error_msgs}\n            return model, loading_info\n\n        return model\n\n    def prepare_inputs_for_generation(self, inputs, **kwargs):\n        return {\"inputs\": inputs}\n\n    def _use_cache(self, outputs, use_cache):\n        \"\"\"During generation, decide whether to pass the `past` variable to the next forward pass.\"\"\"\n        if len(outputs) <= 1 or use_cache is False:\n            return False\n        if hasattr(self.config, \"mem_len\") and self.config.mem_len == 0:\n            return False\n        return True\n\n    def generate(\n        self,\n        input_ids=None,\n        max_length=None,\n        min_length=None,\n        do_sample=None,\n        early_stopping=None,\n        num_beams=None,\n        temperature=None,\n        top_k=None,\n        top_p=None,\n        repetition_penalty=None,\n        bad_words_ids=None,\n        bos_token_id=None,\n        pad_token_id=None,\n        eos_token_id=None,\n        length_penalty=None,\n        no_repeat_ngram_size=None,\n        num_return_sequences=None,\n        attention_mask=None,\n        decoder_start_token_id=None,\n        use_cache=None,\n    ):\n        r\"\"\" Generates sequences for models with a LM head. The method currently supports greedy or penalized greedy decoding, sampling with top-k or nucleus sampling\n        and beam-search.\n\n        Adapted in part from `Facebook's XLM beam search code`_.\n\n        .. _`Facebook's XLM beam search code`:\n           https://github.com/facebookresearch/XLM/blob/9e6f6814d17be4fe5b15f2e6c43eb2b2d76daeb4/src/model/transformer.py#L529\n\n\n        Parameters:\n\n            input_ids: (`optional`) `tf.Tensor` of `dtype=tf.int32` of shape `(batch_size, sequence_length)`\n                The sequence used as a prompt for the generation. If `None` the method initializes\n                it as an empty `tf.Tensor` of shape `(1,)`.\n\n            max_length: (`optional`) int\n                The max length of the sequence to be generated.  Between 1 and infinity. Default to 20.\n\n            min_length: (`optional`) int\n                The min length of the sequence to be generated.  Between 0 and infinity. Default to 0.\n            do_sample: (`optional`) bool\n                If set to `False` greedy decoding is used. Otherwise sampling is used. Defaults to `False` as defined in `configuration_utils.PretrainedConfig`.\n\n            early_stopping: (`optional`) bool\n                if set to `True` beam search is stopped when at least `num_beams` sentences finished per batch. Defaults to `False` as defined in `configuration_utils.PretrainedConfig`.\n\n            num_beams: (`optional`) int\n                Number of beams for beam search. Must be between 1 and infinity. 1 means no beam search. Default to 1.\n\n            temperature: (`optional`) float\n                The value used to module the next token probabilities. Must be strictely positive. Default to 1.0.\n\n            top_k: (`optional`) int\n                The number of highest probability vocabulary tokens to keep for top-k-filtering. Between 1 and infinity. Default to 50.\n\n            top_p: (`optional`) float\n                The cumulative probability of parameter highest probability vocabulary tokens to keep for nucleus sampling. Must be between 0 and 1. Default to 1.\n\n            repetition_penalty: (`optional`) float\n                The parameter for repetition penalty. Between 1.0 and infinity. 1.0 means no penalty. Default to 1.0.\n\n            bos_token_id: (`optional`) int\n                Beginning of sentence token if no prompt is provided. Default to specicic model bos_token_id or None if it does not exist.\n\n            pad_token_id: (`optional`) int\n                Pad token. Defaults to pad_token_id as defined in the models config.\n\n            eos_token_id: (`optional`) int\n                EOS token. Defaults to eos_token_id as defined in the models config.\n\n            length_penalty: (`optional`) float\n                Exponential penalty to the length. Default to 1.\n\n            no_repeat_ngram_size: (`optional`) int\n                If set to int > 0, all ngrams of size `no_repeat_ngram_size` can only occur once.\n\n            bad_words_ids: (`optional`) list of lists of int\n                `bad_words_ids` contains tokens that are not allowed to be generated. In order to get the tokens of the words that should not appear in the generated text, use `tokenizer.encode(bad_word, add_prefix_space=True)`.\n\n            num_return_sequences: (`optional`) int\n                The number of independently computed returned sequences for each element in the batch. Default to 1.\n\n            attention_mask (`optional`) obj: `tf.Tensor` with `dtype=tf.int32` of same shape as `input_ids`\n                Mask to avoid performing attention on padding token indices.\n                Mask values selected in ``[0, 1]``:\n                ``1`` for tokens that are NOT MASKED, ``0`` for MASKED tokens.\n                Defaults to `None`.\n\n                `What are attention masks? <../glossary.html#attention-mask>`__\n\n            decoder_start_token_id=None: (`optional`) int\n                If an encoder-decoder model starts decoding with a different token than BOS.\n                Defaults to `None` and is changed to `BOS` later.\n\n            use_cache: (`optional`) bool\n                If `use_cache` is True, past key values are used to speed up decoding if applicable to model. Defaults to `True`.\n\n        Return:\n\n            output: `tf.Tensor` of `dtype=tf.int32` shape `(batch_size * num_return_sequences, sequence_length)`\n                sequence_length is either equal to max_length or shorter if all batches finished early due to the `eos_token_id`\n\n        Examples::\n\n            tokenizer = AutoTokenizer.from_pretrained('distilgpt2')   # Initialize tokenizer\n            model = TFAutoModelWithLMHead.from_pretrained('distilgpt2')    # Download model and configuration from S3 and cache.\n            outputs = model.generate(max_length=40)  # do greedy decoding\n            print('Generated: {}'.format(tokenizer.decode(outputs[0], skip_special_tokens=True)))\n\n            tokenizer = AutoTokenizer.from_pretrained('openai-gpt')   # Initialize tokenizer\n            model = TFAutoModelWithLMHead.from_pretrained('openai-gpt')    # Download model and configuration from S3 and cache.\n            input_context = 'The dog'\n            input_ids = tokenizer.encode(input_context, return_tensors='tf')  # encode input context\n            outputs = model.generate(input_ids=input_ids, num_beams=5, num_return_sequences=3, temperature=1.5)  # generate 3 independent sequences using beam search decoding (5 beams) with sampling from initial context 'The dog'\n            for i in range(3): #  3 output sequences were generated\n                print('Generated {}: {}'.format(i, tokenizer.decode(outputs[i], skip_special_tokens=True)))\n\n            tokenizer = AutoTokenizer.from_pretrained('distilgpt2')   # Initialize tokenizer\n            model = TFAutoModelWithLMHead.from_pretrained('distilgpt2')    # Download model and configuration from S3 and cache.\n            input_context = 'The dog'\n            input_ids = tokenizer.encode(input_context, return_tensors='tf')  # encode input context\n            outputs = model.generate(input_ids=input_ids, max_length=40, temperature=0.7, num_return_sequences=3)  # 3 generate sequences using by sampling\n            for i in range(3): #  3 output sequences were generated\n                print('Generated {}: {}'.format(i, tokenizer.decode(outputs[i], skip_special_tokens=True)))\n\n            tokenizer = AutoTokenizer.from_pretrained('ctrl')   # Initialize tokenizer\n            model = TFAutoModelWithLMHead.from_pretrained('ctrl')    # Download model and configuration from S3 and cache.\n            input_context = 'Legal My neighbor is'  # \"Legal\" is one of the control codes for ctrl\n            input_ids = tokenizer.encode(input_context, return_tensors='tf')  # encode input context\n            outputs = model.generate(input_ids=input_ids, max_length=50, temperature=0.7, repetition_penalty=1.2)  # generate sequences\n            print('Generated: {}'.format(tokenizer.decode(outputs[0], skip_special_tokens=True)))\n\n            tokenizer = AutoTokenizer.from_pretrained('gpt2')   # Initialize tokenizer\n            model = TFAutoModelWithLMHead.from_pretrained('gpt2')    # Download model and configuration from S3 and cache.\n            input_context = 'My cute dog'  # \"Legal\" is one of the control codes for ctrl\n            bad_words_ids = [tokenizer.encode(bad_word, add_prefix_space=True) for bad_word in ['idiot', 'stupid', 'shut up']]\n            input_ids = tokenizer.encode(input_context, return_tensors='tf')  # encode input context\n            outputs = model.generate(input_ids=input_ids, max_length=100, do_sample=True, bad_words_ids=bad_words_ids)  # generate sequences without allowing bad_words to be generated\n        \"\"\"\n\n        # We cannot generate if the model does not have a LM head\n        if self.get_output_embeddings() is None:\n            raise AttributeError(\n                \"You tried to generate sequences with a model that does not have a LM Head.\"\n                \"Please use another model class (e.g. `TFOpenAIGPTLMHeadModel`, `TFXLNetLMHeadModel`, `TFGPT2LMHeadModel`, `TFCTRLLMHeadModel`, `TFT5ForConditionalGeneration`, `TFTransfoXLLMHeadModel`)\"\n            )\n\n        max_length = max_length if max_length is not None else self.config.max_length\n        min_length = min_length if min_length is not None else self.config.min_length\n        do_sample = do_sample if do_sample is not None else self.config.do_sample\n        early_stopping = early_stopping if early_stopping is not None else self.config.early_stopping\n        use_cache = use_cache if use_cache is not None else self.config.use_cache\n        num_beams = num_beams if num_beams is not None else self.config.num_beams\n        temperature = temperature if temperature is not None else self.config.temperature\n        top_k = top_k if top_k is not None else self.config.top_k\n        top_p = top_p if top_p is not None else self.config.top_p\n        repetition_penalty = repetition_penalty if repetition_penalty is not None else self.config.repetition_penalty\n        bos_token_id = bos_token_id if bos_token_id is not None else self.config.bos_token_id\n        pad_token_id = pad_token_id if pad_token_id is not None else self.config.pad_token_id\n        eos_token_id = eos_token_id if eos_token_id is not None else self.config.eos_token_id\n        length_penalty = length_penalty if length_penalty is not None else self.config.length_penalty\n        no_repeat_ngram_size = (\n            no_repeat_ngram_size if no_repeat_ngram_size is not None else self.config.no_repeat_ngram_size\n        )\n        bad_words_ids = bad_words_ids if bad_words_ids is not None else self.config.bad_words_ids\n        num_return_sequences = (\n            num_return_sequences if num_return_sequences is not None else self.config.num_return_sequences\n        )\n        decoder_start_token_id = (\n            decoder_start_token_id if decoder_start_token_id is not None else self.config.decoder_start_token_id\n        )\n\n        if input_ids is not None:\n            batch_size = shape_list(input_ids)[0]  # overriden by the input batch_size\n        else:\n            batch_size = 1\n\n        assert isinstance(max_length, int) and max_length > 0, \"`max_length` should be a strictely positive integer.\"\n        assert isinstance(min_length, int) and min_length >= 0, \"`min_length` should be a positive integer.\"\n        assert isinstance(do_sample, bool), \"`do_sample` should be a boolean.\"\n        assert isinstance(early_stopping, bool), \"`early_stopping` should be a boolean.\"\n        assert isinstance(use_cache, bool), \"`use_cache` should be a boolean.\"\n        assert isinstance(num_beams, int) and num_beams > 0, \"`num_beams` should be a strictely positive integer.\"\n        assert temperature > 0, \"`temperature` should be strictely positive.\"\n        assert isinstance(top_k, int) and top_k >= 0, \"`top_k` should be a positive integer.\"\n        assert 0 <= top_p <= 1, \"`top_p` should be between 0 and 1.\"\n        assert repetition_penalty >= 1.0, \"`repetition_penalty` should be >= 1.\"\n        assert input_ids is not None or (\n            isinstance(bos_token_id, int) and bos_token_id >= 0\n        ), \"If input_ids is not defined, `bos_token_id` should be a positive integer.\"\n        assert pad_token_id is None or (\n            isinstance(pad_token_id, int) and (pad_token_id >= 0)\n        ), \"`pad_token_id` should be a positive integer.\"\n        assert (eos_token_id is None) or (\n            isinstance(eos_token_id, int) and (eos_token_id >= 0)\n        ), \"`eos_token_id` should be a positive integer.\"\n        assert length_penalty > 0, \"`length_penalty` should be strictely positive.\"\n        assert (\n            isinstance(num_return_sequences, int) and num_return_sequences > 0\n        ), \"`num_return_sequences` should be a strictely positive integer.\"\n        assert (\n            bad_words_ids is None or isinstance(bad_words_ids, list) and isinstance(bad_words_ids[0], list)\n        ), \"`bad_words_ids` is either `None` or a list of lists of tokens that should not be generated\"\n\n        if input_ids is None:\n            assert isinstance(bos_token_id, int) and bos_token_id >= 0, (\n                \"you should either supply a context to complete as `input_ids` input \"\n                \"or a `bos_token_id` (integer >= 0) as a first token to start the generation.\"\n            )\n            input_ids = tf.fill((batch_size, 1), bos_token_id)\n        else:\n            assert len(shape_list(input_ids)) == 2, \"Input prompt should be of shape (batch_size, sequence length).\"\n\n        # not allow to duplicate outputs when greedy decoding\n        if do_sample is False:\n            if num_beams == 1:\n                # no_beam_search greedy generation conditions\n                assert (\n                    num_return_sequences == 1\n                ), \"Greedy decoding will always produce the same output for num_beams == 1 and num_return_sequences > 1. Please set num_return_sequences = 1\"\n\n            else:\n                # beam_search greedy generation conditions\n                assert (\n                    num_beams >= num_return_sequences\n                ), \"Greedy beam search decoding cannot return more sequences than it has beams. Please set num_beams >= num_return_sequences\"\n\n        # create attention mask if necessary\n        # TODO (PVP): this should later be handled by the forward fn() in each model in the future see PR 3140\n        if (attention_mask is None) and (pad_token_id is not None) and (pad_token_id in input_ids.numpy()):\n            attention_mask = tf.cast(tf.math.not_equal(input_ids, pad_token_id), dtype=tf.int32)\n        elif attention_mask is None:\n            attention_mask = tf.ones_like(input_ids)\n\n        if pad_token_id is None and eos_token_id is not None:\n            logger.warning(\n                \"Setting `pad_token_id` to {} (first `eos_token_id`) to generate sequence\".format(eos_token_id)\n            )\n            pad_token_id = eos_token_id\n\n        # current position and vocab size\n        cur_len = shape_list(input_ids)[1]\n        vocab_size = self.config.vocab_size\n\n        # set effective batch size and effective batch multiplier according to do_sample\n        if do_sample:\n            effective_batch_size = batch_size * num_return_sequences\n            effective_batch_mult = num_return_sequences\n        else:\n            effective_batch_size = batch_size\n            effective_batch_mult = 1\n\n        if self.config.is_encoder_decoder:\n            if decoder_start_token_id is None:\n                decoder_start_token_id = bos_token_id\n\n            assert (\n                decoder_start_token_id is not None\n            ), \"decoder_start_token_id or bos_token_id has to be defined for encoder-decoder generation\"\n            assert hasattr(self, \"get_encoder\"), \"{} should have a 'get_encoder' function defined\".format(self)\n            assert callable(self.get_encoder), \"{} should be a method\".format(self.get_encoder)\n\n            # get encoder and store encoder outputs\n            encoder = self.get_encoder()\n\n            encoder_outputs = encoder(input_ids, attention_mask=attention_mask)\n\n        # Expand input ids if num_beams > 1 or num_return_sequences > 1\n        if num_return_sequences > 1 or num_beams > 1:\n            input_ids_len = shape_list(input_ids)[-1]\n            input_ids = tf.broadcast_to(\n                tf.expand_dims(input_ids, 1), (batch_size, effective_batch_mult * num_beams, input_ids_len)\n            )\n            attention_mask = tf.broadcast_to(\n                tf.expand_dims(attention_mask, 1), (batch_size, effective_batch_mult * num_beams, input_ids_len)\n            )\n            input_ids = tf.reshape(\n                input_ids, (effective_batch_size * num_beams, input_ids_len)\n            )  # shape: (batch_size * num_return_sequences * num_beams, cur_len)\n            attention_mask = tf.reshape(\n                attention_mask, (effective_batch_size * num_beams, input_ids_len)\n            )  # shape: (batch_size * num_return_sequences * num_beams, cur_len)\n\n        if self.config.is_encoder_decoder:\n\n            # create empty decoder_input_ids\n            input_ids = tf.ones((effective_batch_size * num_beams, 1), dtype=tf.int32,) * decoder_start_token_id\n            cur_len = 1\n\n            assert (\n                batch_size == encoder_outputs[0].shape[0]\n            ), f\"expected encoder_outputs[0] to have 1st dimension bs={batch_size}, got {encoder_outputs[0].shape[0]} \"\n\n            # expand batch_idx to assign correct encoder output for expanded input_ids (due to num_beams > 1 and num_return_sequences > 1)\n            expanded_batch_idxs = tf.reshape(\n                tf.repeat(tf.expand_dims(tf.range(batch_size), -1), repeats=num_beams * effective_batch_mult, axis=1),\n                shape=(-1,),\n            )\n            # expand encoder_outputs\n            encoder_outputs = (tf.gather(encoder_outputs[0], expanded_batch_idxs, axis=0), *encoder_outputs[1:])\n\n        else:\n            encoder_outputs = None\n            cur_len = shape_list(input_ids)[-1]\n\n        if num_beams > 1:\n            output = self._generate_beam_search(\n                input_ids,\n                cur_len=cur_len,\n                max_length=max_length,\n                min_length=min_length,\n                do_sample=do_sample,\n                early_stopping=early_stopping,\n                temperature=temperature,\n                top_k=top_k,\n                top_p=top_p,\n                repetition_penalty=repetition_penalty,\n                no_repeat_ngram_size=no_repeat_ngram_size,\n                bad_words_ids=bad_words_ids,\n                bos_token_id=bos_token_id,\n                pad_token_id=pad_token_id,\n                eos_token_id=eos_token_id,\n                decoder_start_token_id=decoder_start_token_id,\n                batch_size=effective_batch_size,\n                num_return_sequences=num_return_sequences,\n                length_penalty=length_penalty,\n                num_beams=num_beams,\n                vocab_size=vocab_size,\n                encoder_outputs=encoder_outputs,\n                attention_mask=attention_mask,\n                use_cache=use_cache,\n            )\n        else:\n            output = self._generate_no_beam_search(\n                input_ids,\n                cur_len=cur_len,\n                max_length=max_length,\n                min_length=min_length,\n                do_sample=do_sample,\n                temperature=temperature,\n                top_k=top_k,\n                top_p=top_p,\n                repetition_penalty=repetition_penalty,\n                no_repeat_ngram_size=no_repeat_ngram_size,\n                bad_words_ids=bad_words_ids,\n                bos_token_id=bos_token_id,\n                pad_token_id=pad_token_id,\n                eos_token_id=eos_token_id,\n                decoder_start_token_id=decoder_start_token_id,\n                batch_size=effective_batch_size,\n                vocab_size=vocab_size,\n                encoder_outputs=encoder_outputs,\n                attention_mask=attention_mask,\n                use_cache=use_cache,\n            )\n\n        return output\n\n    def _generate_no_beam_search(\n        self,\n        input_ids,\n        cur_len,\n        max_length,\n        min_length,\n        do_sample,\n        temperature,\n        top_k,\n        top_p,\n        repetition_penalty,\n        no_repeat_ngram_size,\n        bad_words_ids,\n        bos_token_id,\n        pad_token_id,\n        eos_token_id,\n        decoder_start_token_id,\n        batch_size,\n        vocab_size,\n        encoder_outputs,\n        attention_mask,\n        use_cache,\n    ):\n        \"\"\" Generate sequences for each example without beam search (num_beams == 1).\n            All returned sequence are generated independantly.\n        \"\"\"\n\n        # length of generated sentences / unfinished sentences\n        unfinished_sents = tf.ones_like(input_ids[:, 0])\n        sent_lengths = tf.ones_like(input_ids[:, 0]) * max_length\n\n        past = encoder_outputs  # defined for encoder-decoder models, None for decoder-only models\n\n        while cur_len < max_length:\n            model_inputs = self.prepare_inputs_for_generation(\n                input_ids, past=past, attention_mask=attention_mask, use_cache=use_cache\n            )\n            outputs = self(**model_inputs)\n            next_token_logits = outputs[0][:, -1, :]\n\n            # if model has past, then set the past variable to speed up decoding\n            if self._use_cache(outputs, use_cache):\n                past = outputs[1]\n\n            # repetition penalty from CTRL paper (https://arxiv.org/abs/1909.05858)\n            if repetition_penalty != 1.0:\n                next_token_logits_penalties = _create_next_token_logits_penalties(\n                    input_ids, next_token_logits, repetition_penalty\n                )\n                next_token_logits = tf.math.multiply(next_token_logits, next_token_logits_penalties)\n\n            if no_repeat_ngram_size > 0:\n                # calculate a list of banned tokens to prevent repetitively generating the same ngrams\n                # from fairseq: https://github.com/pytorch/fairseq/blob/a07cb6f40480928c9e0548b737aadd36ee66ac76/fairseq/sequence_generator.py#L345\n                banned_tokens = calc_banned_ngram_tokens(input_ids, batch_size, no_repeat_ngram_size, cur_len)\n                # create banned_tokens boolean mask\n                banned_tokens_indices_mask = []\n                for banned_tokens_slice in banned_tokens:\n                    banned_tokens_indices_mask.append(\n                        [True if token in banned_tokens_slice else False for token in range(vocab_size)]\n                    )\n\n                next_token_logits = set_tensor_by_indices_to_value(\n                    next_token_logits, tf.convert_to_tensor(banned_tokens_indices_mask, dtype=tf.bool), -float(\"inf\")\n                )\n\n            if bad_words_ids is not None:\n                # calculate a list of banned tokens according to bad words\n                banned_tokens = calc_banned_bad_words_ids(input_ids, bad_words_ids)\n\n                banned_tokens_indices_mask = []\n                for banned_tokens_slice in banned_tokens:\n                    banned_tokens_indices_mask.append(\n                        [True if token in banned_tokens_slice else False for token in range(vocab_size)]\n                    )\n\n                next_token_logits = set_tensor_by_indices_to_value(\n                    next_token_logits, tf.convert_to_tensor(banned_tokens_indices_mask, dtype=tf.bool), -float(\"inf\")\n                )\n\n            # set eos token prob to zero if min_length is not reached\n            if eos_token_id is not None and cur_len < min_length:\n                # create eos_token_id boolean mask\n                is_token_logit_eos_token = tf.convert_to_tensor(\n                    [True if token is eos_token_id else False for token in range(vocab_size)], dtype=tf.bool\n                )\n                eos_token_indices_mask = tf.broadcast_to(is_token_logit_eos_token, [batch_size, vocab_size])\n\n                next_token_logits = set_tensor_by_indices_to_value(\n                    next_token_logits, eos_token_indices_mask, -float(\"inf\")\n                )\n\n            if do_sample:\n                # Temperature (higher temperature => more likely to sample low probability tokens)\n                if temperature != 1.0:\n                    next_token_logits = next_token_logits / temperature\n                # Top-p/top-k filtering\n                next_token_logits = tf_top_k_top_p_filtering(next_token_logits, top_k=top_k, top_p=top_p)\n                # Sample\n                next_token = tf.squeeze(\n                    tf.random.categorical(next_token_logits, dtype=tf.int32, num_samples=1), axis=1\n                )\n            else:\n                # Greedy decoding\n                next_token = tf.math.argmax(next_token_logits, axis=-1, output_type=tf.int32)\n\n            # update generations and finished sentences\n            if eos_token_id is not None:\n                # pad finished sentences if eos_token_id exist\n                tokens_to_add = next_token * unfinished_sents + (pad_token_id) * (1 - unfinished_sents)\n            else:\n                tokens_to_add = next_token\n\n            # add token and increase length by one\n            input_ids = tf.concat([input_ids, tf.expand_dims(tokens_to_add, -1)], 1)\n            cur_len = cur_len + 1\n\n            if eos_token_id is not None:\n                eos_in_sents = tokens_to_add == eos_token_id\n                # if sentence is unfinished and the token to add is eos, sent_lengths is filled with current length\n                is_sents_unfinished_and_token_to_add_is_eos = tf.math.multiply(\n                    unfinished_sents, tf.cast(eos_in_sents, tf.int32)\n                )\n                sent_lengths = (\n                    sent_lengths * (1 - is_sents_unfinished_and_token_to_add_is_eos)\n                    + cur_len * is_sents_unfinished_and_token_to_add_is_eos\n                )\n\n                # unfinished_sents is set to zero if eos in sentence\n                unfinished_sents -= is_sents_unfinished_and_token_to_add_is_eos\n\n            # stop when there is a </s> in each sentence, or if we exceed the maximul length\n            if tf.math.reduce_max(unfinished_sents) == 0:\n                break\n\n            # extend attention_mask for new generated input if only decoder\n            if self.config.is_encoder_decoder is False:\n                attention_mask = tf.concat(\n                    [attention_mask, tf.ones((shape_list(attention_mask)[0], 1), dtype=tf.int32)], axis=-1\n                )\n\n        # if there are different sentences lengths in the batch, some batches have to be padded\n        min_sent_length = tf.math.reduce_min(sent_lengths)\n        max_sent_length = tf.math.reduce_max(sent_lengths)\n        if min_sent_length != max_sent_length:\n            assert pad_token_id is not None, \"`Pad_token_id` has to be defined if batches have different lengths\"\n            # finished sents are filled with pad_token\n            padding = tf.ones([batch_size, max_sent_length.numpy()], dtype=tf.int32) * pad_token_id\n\n            # create length masks for tf.where operation\n            broad_casted_sent_lengths = tf.broadcast_to(\n                tf.expand_dims(sent_lengths, -1), [batch_size, max_sent_length]\n            )\n            broad_casted_range = tf.transpose(\n                tf.broadcast_to(tf.expand_dims(tf.range(max_sent_length), -1), [max_sent_length, batch_size])\n            )\n\n            decoded = tf.where(broad_casted_range < broad_casted_sent_lengths, input_ids, padding)\n        else:\n            decoded = input_ids\n\n        return decoded\n\n    def _generate_beam_search(\n        self,\n        input_ids,\n        cur_len,\n        max_length,\n        min_length,\n        do_sample,\n        early_stopping,\n        temperature,\n        top_k,\n        top_p,\n        repetition_penalty,\n        no_repeat_ngram_size,\n        bad_words_ids,\n        bos_token_id,\n        pad_token_id,\n        decoder_start_token_id,\n        eos_token_id,\n        batch_size,\n        num_return_sequences,\n        length_penalty,\n        num_beams,\n        vocab_size,\n        encoder_outputs,\n        attention_mask,\n        use_cache,\n    ):\n        \"\"\" Generate sequences for each example with beam search.\n        \"\"\"\n\n        # generated hypotheses\n        generated_hyps = [\n            BeamHypotheses(num_beams, max_length, length_penalty, early_stopping=early_stopping)\n            for _ in range(batch_size)\n        ]\n\n        # for greedy decoding it is made sure that only tokens of the first beam are considered to avoid sampling the exact same tokens three times\n        if do_sample is False:\n            beam_scores_begin = tf.zeros((batch_size, 1), dtype=tf.float32)\n            beam_scores_end = tf.ones((batch_size, num_beams - 1), dtype=tf.float32) * (-1e9)\n            beam_scores = tf.concat([beam_scores_begin, beam_scores_end], -1)\n        else:\n            beam_scores = tf.zeros((batch_size, num_beams), dtype=tf.float32)\n\n        beam_scores = tf.reshape(beam_scores, (batch_size * num_beams,))\n\n        # cache compute states\n        past = encoder_outputs\n\n        # done sentences\n        done = [False for _ in range(batch_size)]\n\n        while cur_len < max_length:\n            model_inputs = self.prepare_inputs_for_generation(\n                input_ids, past=past, attention_mask=attention_mask, use_cache=use_cache\n            )\n            outputs = self(**model_inputs)  # (batch_size * num_beams, cur_len, vocab_size)\n            next_token_logits = outputs[0][:, -1, :]  # (batch_size * num_beams, vocab_size)\n\n            # if model has past, then set the past variable to speed up decoding\n            if self._use_cache(outputs, use_cache):\n                past = outputs[1]\n\n            # repetition penalty (from CTRL paper https://arxiv.org/abs/1909.05858)\n            if repetition_penalty != 1.0:\n                next_token_logits_penalties = _create_next_token_logits_penalties(\n                    input_ids, next_token_logits, repetition_penalty\n                )\n                next_token_logits = tf.math.multiply(next_token_logits, next_token_logits_penalties)\n\n            # Temperature (higher temperature => more likely to sample low probability tokens)\n            if temperature != 1.0:\n                next_token_logits = next_token_logits / temperature\n\n            #             calculate log softmax score\n            scores = tf.nn.log_softmax(next_token_logits, axis=-1)  # (batch_size * num_beams, vocab_size)\n\n            # set eos token prob to zero if min_length is not reached\n            if eos_token_id is not None and cur_len < min_length:\n                # create eos_token_id boolean mask\n                num_batch_hypotheses = batch_size * num_beams\n\n                is_token_logit_eos_token = tf.convert_to_tensor(\n                    [True if token is eos_token_id else False for token in range(vocab_size)], dtype=tf.bool\n                )\n                eos_token_indices_mask = tf.broadcast_to(is_token_logit_eos_token, [num_batch_hypotheses, vocab_size])\n\n                scores = set_tensor_by_indices_to_value(scores, eos_token_indices_mask, -float(\"inf\"))\n\n            if no_repeat_ngram_size > 0:\n                # calculate a list of banned tokens to prevent repetitively generating the same ngrams\n                # from fairseq: https://github.com/pytorch/fairseq/blob/a07cb6f40480928c9e0548b737aadd36ee66ac76/fairseq/sequence_generator.py#L345\n                num_batch_hypotheses = batch_size * num_beams\n                banned_tokens = calc_banned_ngram_tokens(\n                    input_ids, num_batch_hypotheses, no_repeat_ngram_size, cur_len\n                )\n                # create banned_tokens boolean mask\n                banned_tokens_indices_mask = []\n                for banned_tokens_slice in banned_tokens:\n                    banned_tokens_indices_mask.append(\n                        [True if token in banned_tokens_slice else False for token in range(vocab_size)]\n                    )\n\n                scores = set_tensor_by_indices_to_value(\n                    scores, tf.convert_to_tensor(banned_tokens_indices_mask, dtype=tf.bool), -float(\"inf\")\n                )\n\n            if bad_words_ids is not None:\n                # calculate a list of banned tokens according to bad words\n                banned_tokens = calc_banned_bad_words_ids(input_ids, bad_words_ids)\n\n                banned_tokens_indices_mask = []\n                for banned_tokens_slice in banned_tokens:\n                    banned_tokens_indices_mask.append(\n                        [True if token in banned_tokens_slice else False for token in range(vocab_size)]\n                    )\n\n                scores = set_tensor_by_indices_to_value(\n                    scores, tf.convert_to_tensor(banned_tokens_indices_mask, dtype=tf.bool), -float(\"inf\")\n                )\n\n            assert shape_list(scores) == [batch_size * num_beams, vocab_size]\n\n            if do_sample:\n                _scores = scores + tf.broadcast_to(\n                    beam_scores[:, None], (batch_size * num_beams, vocab_size)\n                )  # (batch_size * num_beams, vocab_size)\n\n                # Top-p/top-k filtering\n                _scores = tf_top_k_top_p_filtering(\n                    _scores, top_k=top_k, top_p=top_p, min_tokens_to_keep=2\n                )  # (batch_size * num_beams, vocab_size)\n                # Sample 2 next tokens for each beam (so we have some spare tokens and match output of greedy beam search)\n                _scores = tf.reshape(_scores, (batch_size, num_beams * vocab_size))\n\n                next_tokens = tf.random.categorical(\n                    _scores, dtype=tf.int32, num_samples=2 * num_beams\n                )  # (batch_size, 2 * num_beams)\n                # Compute next scores\n                next_scores = tf.gather(_scores, next_tokens, batch_dims=1)  # (batch_size, 2 * num_beams)\n\n                # sort the sampled vector to make sure that the first num_beams samples are the best\n                next_scores_indices = tf.argsort(next_scores, direction=\"DESCENDING\", axis=1)\n                next_scores = tf.gather(next_scores, next_scores_indices, batch_dims=1)  # (batch_size, num_beams * 2)\n                next_tokens = tf.gather(next_tokens, next_scores_indices, batch_dims=1)  # (batch_size, num_beams * 2)\n            else:\n                # Add the log prob of the new beams to the log prob of the beginning of the sequence (sum of logs == log of the product)\n                next_scores = scores + tf.broadcast_to(\n                    beam_scores[:, None], (batch_size * num_beams, vocab_size)\n                )  # (batch_size * num_beams, vocab_size)\n\n                # re-organize to group the beam together (we are keeping top hypothesis accross beams)\n                next_scores = tf.reshape(\n                    next_scores, (batch_size, num_beams * vocab_size)\n                )  # (batch_size, num_beams * vocab_size)\n\n                next_scores, next_tokens = tf.math.top_k(next_scores, k=2 * num_beams, sorted=True)\n\n            assert shape_list(next_scores) == shape_list(next_tokens) == [batch_size, 2 * num_beams]\n\n            # next batch beam content\n            next_batch_beam = []\n\n            # for each sentence\n            for batch_idx in range(batch_size):\n\n                # if we are done with this sentence\n                if done[batch_idx]:\n                    assert (\n                        len(generated_hyps[batch_idx]) >= num_beams\n                    ), \"Batch can only be done if at least {} beams have been generated\".format(num_beams)\n                    assert (\n                        eos_token_id is not None and pad_token_id is not None\n                    ), \"generated beams >= num_beams -> eos_token_id and pad_token have to be defined\"\n                    next_batch_beam.extend([(0, pad_token_id, 0)] * num_beams)  # pad the batch\n                    continue\n\n                # next sentence beam content\n                next_sent_beam = []\n\n                # next tokens for this sentence\n                for beam_token_rank, (beam_token_id, beam_token_score) in enumerate(\n                    zip(next_tokens[batch_idx], next_scores[batch_idx])\n                ):\n                    # get beam and token IDs\n                    beam_id = beam_token_id // vocab_size\n                    token_id = beam_token_id % vocab_size\n\n                    effective_beam_id = batch_idx * num_beams + beam_id\n                    # add to generated hypotheses if end of sentence or last iteration\n                    if (eos_token_id is not None) and (token_id.numpy() == eos_token_id):\n                        # if beam_token does not belong to top num_beams tokens, it should not be added\n                        is_beam_token_worse_than_top_num_beams = beam_token_rank >= num_beams\n                        if is_beam_token_worse_than_top_num_beams:\n                            continue\n                        generated_hyps[batch_idx].add(\n                            tf.identity(input_ids[effective_beam_id]), beam_token_score.numpy()\n                        )\n                    else:\n                        # add next predicted token if it is not eos_token\n                        next_sent_beam.append((beam_token_score, token_id, effective_beam_id))\n\n                    # the beam for next step is full\n                    if len(next_sent_beam) == num_beams:\n                        break\n\n                # Check if were done so that we can save a pad step if all(done)\n                done[batch_idx] = done[batch_idx] or generated_hyps[batch_idx].is_done(\n                    tf.reduce_max(next_scores[batch_idx]).numpy(), cur_len=cur_len\n                )\n\n                # update next beam content\n                assert len(next_sent_beam) == num_beams, \"Beam should always be full\"\n                next_batch_beam.extend(next_sent_beam)\n                assert len(next_batch_beam) == num_beams * (batch_idx + 1)\n\n            # stop when we are done with each sentence\n            if all(done):\n                break\n\n            # sanity check / prepare next batch\n            assert len(next_batch_beam) == batch_size * num_beams\n            beam_scores = tf.convert_to_tensor([x[0] for x in next_batch_beam], dtype=tf.float32)\n            beam_tokens = tf.convert_to_tensor([x[1] for x in next_batch_beam], dtype=tf.int32)\n            beam_idx = tf.convert_to_tensor([x[2] for x in next_batch_beam], dtype=tf.int32)\n\n            # re-order batch and update current length\n            input_ids = tf.stack([tf.identity(input_ids[x, :]) for x in beam_idx])\n            input_ids = tf.concat([input_ids, tf.expand_dims(beam_tokens, 1)], axis=-1)\n            cur_len = cur_len + 1\n\n            # re-order internal states\n            if past is not None:\n                past = self._reorder_cache(past, beam_idx)\n\n            # extend attention_mask for new generated input if only decoder\n            if self.config.is_encoder_decoder is False:\n                attention_mask = tf.concat(\n                    [attention_mask, tf.ones((shape_list(attention_mask)[0], 1), dtype=tf.int32)], axis=-1\n                )\n\n        # finalize all open beam hypotheses and end to generated hypotheses\n        for batch_idx in range(batch_size):\n            # Add all open beam hypothesis to generated_hyps\n            if done[batch_idx]:\n                continue\n            # test that beam scores match previously calculated scores if not eos and batch_idx not done\n            if eos_token_id is not None and all(\n                (token_id % vocab_size).numpy().item() is not eos_token_id for token_id in next_tokens[batch_idx]\n            ):\n                assert tf.reduce_all(\n                    next_scores[batch_idx, :num_beams] == tf.reshape(beam_scores, (batch_size, num_beams))[batch_idx]\n                ), \"If batch_idx is not done, final next scores: {} have to equal to accumulated beam_scores: {}\".format(\n                    next_scores[:, :num_beams][batch_idx], tf.reshape(beam_scores, (batch_size, num_beams))[batch_idx]\n                )\n\n            # need to add best num_beams hypotheses to generated hyps\n            for beam_id in range(num_beams):\n                effective_beam_id = batch_idx * num_beams + beam_id\n                final_score = beam_scores[effective_beam_id].numpy().item()\n                final_tokens = input_ids[effective_beam_id]\n                generated_hyps[batch_idx].add(final_tokens, final_score)\n\n        # depending on whether greedy generation is wanted or not define different output_batch_size and output_num_return_sequences_per_batch\n        output_batch_size = batch_size if do_sample else batch_size * num_return_sequences\n        output_num_return_sequences_per_batch = 1 if do_sample else num_return_sequences\n\n        # select the best hypotheses\n        sent_lengths_list = []\n        best = []\n\n        # retrieve best hypotheses\n        for i, hypotheses in enumerate(generated_hyps):\n            sorted_hyps = sorted(hypotheses.beams, key=lambda x: x[0])\n            for j in range(output_num_return_sequences_per_batch):\n                best_hyp = sorted_hyps.pop()[1]\n                sent_lengths_list.append(len(best_hyp))\n                best.append(best_hyp)\n        assert output_batch_size == len(best), \"Output batch size {} must match output beam hypotheses {}\".format(\n            output_batch_size, len(best)\n        )\n\n        sent_lengths = tf.convert_to_tensor(sent_lengths_list, dtype=tf.int32)\n\n        # shorter batches are filled with pad_token\n        if tf.reduce_min(sent_lengths).numpy() != tf.reduce_max(sent_lengths).numpy():\n            assert pad_token_id is not None, \"`Pad_token_id` has to be defined\"\n            sent_max_len = min(tf.reduce_max(sent_lengths).numpy() + 1, max_length)\n            decoded_list = []\n\n            # fill with hypothesis and eos_token_id if necessary\n            for i, hypo in enumerate(best):\n                assert sent_lengths[i] == shape_list(hypo)[0]\n                # if sent_length is max_len do not pad\n                if sent_lengths[i] == sent_max_len:\n                    decoded_slice = hypo\n                else:\n                    # else pad to sent_max_len\n                    num_pad_tokens = sent_max_len - sent_lengths[i]\n                    padding = pad_token_id * tf.ones((num_pad_tokens,), dtype=tf.int32)\n                    decoded_slice = tf.concat([hypo, padding], axis=-1)\n\n                    # finish sentence with EOS token\n                    if sent_lengths[i] < max_length:\n                        decoded_slice = tf.where(\n                            tf.range(sent_max_len, dtype=tf.int32) == sent_lengths[i],\n                            eos_token_id * tf.ones((sent_max_len,), dtype=tf.int32),\n                            decoded_slice,\n                        )\n                # add to list\n                decoded_list.append(decoded_slice)\n\n            decoded = tf.stack(decoded_list)\n        else:\n            # none of the hypotheses have an eos_token\n            assert (len(hypo) == max_length for hypo in best)\n            decoded = tf.stack(best)\n\n        return decoded\n\n    @staticmethod\n    def _reorder_cache(past, beam_idx):\n        return tuple(tf.gather(layer_past, beam_idx, axis=1) for layer_past in past)\n\n\ndef _create_next_token_logits_penalties(input_ids, logits, repetition_penalty):\n    # create logit penalties for already seen input_ids\n    token_penalties = np.ones(shape_list(logits))\n    prev_input_ids = [np.unique(input_id) for input_id in input_ids.numpy()]\n    for i, prev_input_id in enumerate(prev_input_ids):\n        logit_penalized = logits[i].numpy()[prev_input_id]\n        logit_penalties = np.zeros(logit_penalized.shape)\n        # if previous logit score is < 0 then multiply repetition penalty else divide\n        logit_penalties[logit_penalized < 0] = repetition_penalty\n        logit_penalties[logit_penalized > 0] = 1 / repetition_penalty\n        np.put(token_penalties[i], prev_input_id, logit_penalties)\n    return tf.convert_to_tensor(token_penalties, dtype=tf.float32)\n\n\ndef calc_banned_ngram_tokens(prev_input_ids, num_hypos, no_repeat_ngram_size, cur_len):\n    # Copied from fairseq for no_repeat_ngram in beam_search\"\"\"\n    if cur_len + 1 < no_repeat_ngram_size:\n        # return no banned tokens if we haven't generated no_repeat_ngram_size tokens yet\n        return [[] for _ in range(num_hypos)]\n    generated_ngrams = [{} for _ in range(num_hypos)]\n    for idx in range(num_hypos):\n        gen_tokens = prev_input_ids[idx].numpy().tolist()\n        generated_ngram = generated_ngrams[idx]\n        for ngram in zip(*[gen_tokens[i:] for i in range(no_repeat_ngram_size)]):\n            prev_ngram_tuple = tuple(ngram[:-1])\n            generated_ngram[prev_ngram_tuple] = generated_ngram.get(prev_ngram_tuple, []) + [ngram[-1]]\n\n    def _get_generated_ngrams(hypo_idx):\n        # Before decoding the next token, prevent decoding of ngrams that have already appeared\n        start_idx = cur_len + 1 - no_repeat_ngram_size\n        ngram_idx = tuple(prev_input_ids[hypo_idx, start_idx:cur_len].numpy().tolist())\n        return generated_ngrams[hypo_idx].get(ngram_idx, [])\n\n    banned_tokens = [_get_generated_ngrams(hypo_idx) for hypo_idx in range(num_hypos)]\n    return banned_tokens\n\n\ndef calc_banned_bad_words_ids(prev_input_ids, bad_words_ids):\n    banned_tokens = []\n\n    def _tokens_match(prev_tokens, tokens):\n        if len(tokens) == 0:\n            # if bad word tokens is just one token always ban it\n            return True\n        if len(tokens) > len(prev_input_ids):\n            # if bad word tokens are longer then prev input_ids they can't be equal\n            return False\n\n        if prev_tokens[-len(tokens) :] == tokens:\n            # if tokens match\n            return True\n        else:\n            return False\n\n    for prev_input_ids_slice in prev_input_ids:\n        banned_tokens_slice = []\n\n        for banned_token_seq in bad_words_ids:\n            assert len(banned_token_seq) > 0, \"Banned words token sequences {} cannot have an empty list\".format(\n                bad_words_ids\n            )\n\n            if _tokens_match(prev_input_ids_slice.numpy().tolist(), banned_token_seq[:-1]) is False:\n                # if tokens do not match continue\n                continue\n\n            banned_tokens_slice.append(banned_token_seq[-1])\n\n        banned_tokens.append(banned_tokens_slice)\n\n    return banned_tokens\n\n\ndef tf_top_k_top_p_filtering(logits, top_k=0, top_p=1.0, filter_value=-float(\"Inf\"), min_tokens_to_keep=1):\n    \"\"\" Filter a distribution of logits using top-k and/or nucleus (top-p) filtering\n        Args:\n            logits: logits distribution shape (batch size, vocabulary size)\n            if top_k > 0: keep only top k tokens with highest probability (top-k filtering).\n            if top_p < 1.0: keep the top tokens with cumulative probability >= top_p (nucleus filtering).\n                Nucleus filtering is described in Holtzman et al. (http://arxiv.org/abs/1904.09751)\n            Make sure we keep at least min_tokens_to_keep per batch example in the output\n        From: https://gist.github.com/thomwolf/1a5a29f6962089e871b94cbd09daf317\n    \"\"\"\n    logits_shape = shape_list(logits)\n\n    if top_k > 0:\n        top_k = min(max(top_k, min_tokens_to_keep), logits_shape[-1])  # Safety check\n        # Remove all tokens with a probability less than the last token of the top-k\n        indices_to_remove = logits < tf.math.top_k(logits, k=top_k)[0][..., -1, None]\n        logits = set_tensor_by_indices_to_value(logits, indices_to_remove, filter_value)\n\n    if top_p < 1.0:\n        sorted_indices = tf.argsort(logits, direction=\"DESCENDING\")\n        sorted_logits = tf.gather(\n            logits, sorted_indices, axis=-1, batch_dims=1\n        )  # expects logits to be of dim (batch_size, vocab_size)\n\n        cumulative_probs = tf.math.cumsum(tf.nn.softmax(sorted_logits, axis=-1), axis=-1)\n\n        # Remove tokens with cumulative probability above the threshold (token with 0 are kept)\n        sorted_indices_to_remove = cumulative_probs > top_p\n\n        if min_tokens_to_keep > 1:\n            # Keep at least min_tokens_to_keep (set to min_tokens_to_keep-1 because we add the first one below)\n            sorted_indices_to_remove = tf.concat(\n                [\n                    tf.zeros_like(sorted_indices_to_remove[:, :min_tokens_to_keep]),\n                    sorted_indices_to_remove[:, min_tokens_to_keep:],\n                ],\n                -1,\n            )\n\n        # Shift the indices to the right to keep also the first token above the threshold\n        sorted_indices_to_remove = tf.roll(sorted_indices_to_remove, 1, axis=-1)\n        sorted_indices_to_remove = tf.concat(\n            [tf.zeros_like(sorted_indices_to_remove[:, :1]), sorted_indices_to_remove[:, 1:]], -1,\n        )\n        # scatter sorted tensors to original indexing\n        indices_to_remove = scatter_values_on_batch_indices(sorted_indices_to_remove, sorted_indices)\n        logits = set_tensor_by_indices_to_value(logits, indices_to_remove, filter_value)\n    return logits\n\n\ndef scatter_values_on_batch_indices(values, batch_indices):\n    shape = shape_list(batch_indices)\n    # broadcast batch dim to shape\n    broad_casted_batch_dims = tf.reshape(tf.broadcast_to(tf.expand_dims(tf.range(shape[0]), axis=-1), shape), [1, -1])\n    # transform batch_indices to pair_indices\n    pair_indices = tf.transpose(tf.concat([broad_casted_batch_dims, tf.reshape(batch_indices, [1, -1])], 0))\n    # scatter values to pair indices\n    return tf.scatter_nd(pair_indices, tf.reshape(values, [-1]), shape)\n\n\ndef set_tensor_by_indices_to_value(tensor, indices, value):\n    # create value_tensor since tensor value assignment is not possible in TF\n    value_tensor = tf.zeros_like(tensor) + value\n    return tf.where(indices, value_tensor, tensor)\n\n\nclass BeamHypotheses(object):\n    def __init__(self, num_beams, max_length, length_penalty, early_stopping):\n        \"\"\"\n        Initialize n-best list of hypotheses.\n        \"\"\"\n        self.max_length = max_length - 1  # ignoring bos_token\n        self.length_penalty = length_penalty\n        self.early_stopping = early_stopping\n        self.num_beams = num_beams\n        self.beams = []\n        self.worst_score = 1e9\n\n    def __len__(self):\n        \"\"\"\n        Number of hypotheses in the list.\n        \"\"\"\n        return len(self.beams)\n\n    def add(self, hyp, sum_logprobs):\n        \"\"\"\n        Add a new hypothesis to the list.\n        \"\"\"\n        score = sum_logprobs / len(hyp) ** self.length_penalty\n        if len(self) < self.num_beams or score > self.worst_score:\n            self.beams.append((score, hyp))\n            if len(self) > self.num_beams:\n                sorted_scores = sorted([(s, idx) for idx, (s, _) in enumerate(self.beams)])\n                del self.beams[sorted_scores[0][1]]\n                self.worst_score = sorted_scores[1][0]\n            else:\n                self.worst_score = min(score, self.worst_score)\n\n    def is_done(self, best_sum_logprobs, cur_len=None):\n        \"\"\"\n        If there are enough hypotheses and that none of the hypotheses being generated\n        can become better than the worst one in the heap, then we are done with this sentence.\n        \"\"\"\n\n        if len(self) < self.num_beams:\n            return False\n        elif self.early_stopping:\n            return True\n        else:\n            if cur_len is None:\n                cur_len = self.max_length\n            cur_score = best_sum_logprobs / cur_len ** self.length_penalty\n            ret = self.worst_score >= cur_score\n            return ret\n\n\nclass TFConv1D(tf.keras.layers.Layer):\n    def __init__(self, nf, nx, initializer_range=0.02, **kwargs):\n        \"\"\" TFConv1D layer as defined by Radford et al. for OpenAI GPT (and also used in GPT-2)\n            Basically works like a Linear layer but the weights are transposed\n        \"\"\"\n        super().__init__(**kwargs)\n        self.nf = nf\n        self.nx = nx\n        self.initializer_range = initializer_range\n\n    def build(self, input_shape):\n        self.weight = self.add_weight(\n            \"weight\", shape=[self.nx, self.nf], initializer=get_initializer(self.initializer_range)\n        )\n        self.bias = self.add_weight(\"bias\", shape=[1, self.nf], initializer=tf.zeros_initializer())\n\n    def call(self, x):\n        bz, sl = shape_list(x)[:2]\n\n        x = tf.reshape(x, [-1, self.nx])\n        x = tf.matmul(x, self.weight) + self.bias\n\n        x = tf.reshape(x, [bz, sl, self.nf])\n\n        return x\n\n\nclass TFSharedEmbeddings(tf.keras.layers.Layer):\n    \"\"\"Construct shared token embeddings.\n    \"\"\"\n\n    def __init__(self, vocab_size, hidden_size, initializer_range=None, **kwargs):\n        super().__init__(**kwargs)\n        self.vocab_size = vocab_size\n        self.hidden_size = hidden_size\n        self.initializer_range = hidden_size ** -0.5 if initializer_range is None else initializer_range\n\n    def build(self, input_shape):\n        \"\"\"Build shared token embedding layer\n        Shared weights logic adapted from\n            https://github.com/tensorflow/models/blob/a009f4fb9d2fc4949e32192a944688925ef78659/official/transformer/v2/embedding_layer.py#L24\n        \"\"\"\n        self.weight = self.add_weight(\n            \"weight\", shape=[self.vocab_size, self.hidden_size], initializer=get_initializer(self.initializer_range)\n        )\n        super().build(input_shape)\n\n    def call(self, inputs, mode=\"embedding\"):\n        \"\"\"Get token embeddings of inputs.\n        Args:\n            inputs: list of three int64 tensors with shape [batch_size, length]: (input_ids, position_ids, token_type_ids)\n            mode: string, a valid value is one of \"embedding\" and \"linear\".\n        Returns:\n            outputs: (1) If mode == \"embedding\", output embedding tensor, float32 with\n                shape [batch_size, length, embedding_size]; (2) mode == \"linear\", output\n                linear tensor, float32 with shape [batch_size, length, vocab_size].\n        Raises:\n            ValueError: if mode is not valid.\n\n        Shared weights logic adapted from\n            https://github.com/tensorflow/models/blob/a009f4fb9d2fc4949e32192a944688925ef78659/official/transformer/v2/embedding_layer.py#L24\n        \"\"\"\n        if mode == \"embedding\":\n            return self._embedding(inputs)\n        elif mode == \"linear\":\n            return self._linear(inputs)\n        else:\n            raise ValueError(\"mode {} is not valid.\".format(mode))\n\n    def _embedding(self, input_ids):\n        \"\"\"Applies embedding based on inputs tensor.\"\"\"\n        return tf.gather(self.weight, input_ids)\n\n    def _linear(self, inputs):\n        \"\"\"Computes logits by running inputs through a linear layer.\n            Args:\n                inputs: A float32 tensor with shape [..., hidden_size]\n            Returns:\n                float32 tensor with shape [..., vocab_size].\n        \"\"\"\n        first_dims = shape_list(inputs)[:-1]\n\n        x = tf.reshape(inputs, [-1, self.hidden_size])\n        logits = tf.matmul(x, self.weight, transpose_b=True)\n\n        return tf.reshape(logits, first_dims + [self.vocab_size])\n\n\nclass TFSequenceSummary(tf.keras.layers.Layer):\n    r\"\"\" Compute a single vector summary of a sequence hidden states according to various possibilities:\n        Args of the config class:\n            summary_type:\n                - 'last' => [default] take the last token hidden state (like XLNet)\n                - 'first' => take the first token hidden state (like Bert)\n                - 'mean' => take the mean of all tokens hidden states\n                - 'cls_index' => supply a Tensor of classification token position (GPT/GPT-2)\n                - 'attn' => Not implemented now, use multi-head attention\n            summary_use_proj: Add a projection after the vector extraction\n            summary_proj_to_labels: If True, the projection outputs to config.num_labels classes (otherwise to hidden_size). Default: False.\n            summary_activation: 'tanh' => add a tanh activation to the output, Other => no activation. Default\n            summary_first_dropout: Add a dropout before the projection and activation\n            summary_last_dropout: Add a dropout after the projection and activation\n    \"\"\"\n\n    def __init__(self, config, initializer_range=0.02, **kwargs):\n        super().__init__(**kwargs)\n\n        self.summary_type = config.summary_type if hasattr(config, \"summary_use_proj\") else \"last\"\n        if self.summary_type == \"attn\":\n            # We should use a standard multi-head attention module with absolute positional embedding for that.\n            # Cf. https://github.com/zihangdai/xlnet/blob/master/modeling.py#L253-L276\n            # We can probably just use the multi-head attention module of PyTorch >=1.1.0\n            raise NotImplementedError\n\n        self.has_summary = hasattr(config, \"summary_use_proj\") and config.summary_use_proj\n        if self.has_summary:\n            if hasattr(config, \"summary_proj_to_labels\") and config.summary_proj_to_labels and config.num_labels > 0:\n                num_classes = config.num_labels\n            else:\n                num_classes = config.hidden_size\n            self.summary = tf.keras.layers.Dense(\n                num_classes, kernel_initializer=get_initializer(initializer_range), name=\"summary\"\n            )\n\n        self.has_activation = hasattr(config, \"summary_activation\") and config.summary_activation == \"tanh\"\n        if self.has_activation:\n            self.activation = tf.keras.activations.tanh\n\n        self.has_first_dropout = hasattr(config, \"summary_first_dropout\") and config.summary_first_dropout > 0\n        if self.has_first_dropout:\n            self.first_dropout = tf.keras.layers.Dropout(config.summary_first_dropout)\n\n        self.has_last_dropout = hasattr(config, \"summary_last_dropout\") and config.summary_last_dropout > 0\n        if self.has_last_dropout:\n            self.last_dropout = tf.keras.layers.Dropout(config.summary_last_dropout)\n\n    def call(self, inputs, training=False):\n        \"\"\" hidden_states: float Tensor in shape [bsz, seq_len, hidden_size], the hidden-states of the last layer.\n            cls_index: [optional] position of the classification token if summary_type == 'cls_index',\n                shape (bsz,) or more generally (bsz, ...) where ... are optional leading dimensions of hidden_states.\n                if summary_type == 'cls_index' and cls_index is None:\n                    we take the last token of the sequence as classification token\n        \"\"\"\n        if not isinstance(inputs, (dict, tuple, list)):\n            hidden_states = inputs\n            cls_index = None\n        elif isinstance(inputs, (tuple, list)):\n            hidden_states = inputs[0]\n            cls_index = inputs[1] if len(inputs) > 1 else None\n            assert len(inputs) <= 2, \"Too many inputs.\"\n        else:\n            hidden_states = inputs.get(\"hidden_states\")\n            cls_index = inputs.get(\"cls_index\", None)\n\n        if self.summary_type == \"last\":\n            output = hidden_states[:, -1]\n        elif self.summary_type == \"first\":\n            output = hidden_states[:, 0]\n        elif self.summary_type == \"mean\":\n            output = tf.reduce_mean(hidden_states, axis=1)\n        elif self.summary_type == \"cls_index\":\n            hidden_shape = shape_list(hidden_states)  # e.g. [batch, num choices, seq length, hidden dims]\n            if cls_index is None:\n                cls_index = tf.fill(\n                    hidden_shape[:-2], hidden_shape[-2] - 1\n                )  # A tensor full of shape [batch] or [batch, num choices] full of sequence length\n            cls_shape = shape_list(cls_index)\n            if len(cls_shape) <= len(hidden_shape) - 2:\n                cls_index = cls_index[..., tf.newaxis]\n            # else:\n            # cls_index = cls_index[..., tf.newaxis]\n            # cls_index = cls_index.expand((-1,) * (cls_index.dim()-1) + (hidden_states.size(-1),))\n            # shape of cls_index: (bsz, XX, 1, hidden_size) where XX are optional leading dim of hidden_states\n            output = tf.gather(hidden_states, cls_index, batch_dims=len(hidden_shape) - 2)\n            output = tf.squeeze(\n                output, axis=len(hidden_shape) - 2\n            )  # shape of output: (batch, num choices, hidden_size)\n        elif self.summary_type == \"attn\":\n            raise NotImplementedError\n\n        if self.has_first_dropout:\n            output = self.first_dropout(output, training=training)\n\n        if self.has_summary:\n            output = self.summary(output)\n\n        if self.has_activation:\n            output = self.activation(output)\n\n        if self.has_last_dropout:\n            output = self.last_dropout(output, training=training)\n\n        return output\n\n\ndef shape_list(x):\n    \"\"\"Deal with dynamic shape in tensorflow cleanly.\"\"\"\n    static = x.shape.as_list()\n    dynamic = tf.shape(x)\n    return [dynamic[i] if s is None else s for i, s in enumerate(static)]\n\n\ndef get_initializer(initializer_range=0.02):\n    \"\"\"Creates a `tf.initializers.truncated_normal` with the given range.\n    Args:\n        initializer_range: float, initializer range for stddev.\n    Returns:\n        TruncatedNormal initializer with stddev = `initializer_range`.\n    \"\"\"\n    return tf.keras.initializers.TruncatedNormal(stddev=initializer_range)\n"
  },
  {
    "path": "code/nezha-base-count5/pretrain/transformers1/modeling_tf_xlm.py",
    "content": "# coding=utf-8\n# Copyright 2019-present, Facebook, Inc and the HuggingFace Inc. team.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n\"\"\" TF 2.0 XLM model.\n\"\"\"\n\n\nimport itertools\nimport logging\nimport math\n\nimport numpy as np\nimport tensorflow as tf\n\nfrom .configuration_xlm import XLMConfig\nfrom .file_utils import add_start_docstrings, add_start_docstrings_to_callable\nfrom .modeling_tf_utils import TFPreTrainedModel, TFSequenceSummary, TFSharedEmbeddings, get_initializer, shape_list\nfrom .tokenization_utils import BatchEncoding\n\n\nlogger = logging.getLogger(__name__)\n\nTF_XLM_PRETRAINED_MODEL_ARCHIVE_LIST = [\n    \"xlm-mlm-en-2048\",\n    \"xlm-mlm-ende-1024\",\n    \"xlm-mlm-enfr-1024\",\n    \"xlm-mlm-enro-1024\",\n    \"xlm-mlm-tlm-xnli15-1024\",\n    \"xlm-mlm-xnli15-1024\",\n    \"xlm-clm-enfr-1024\",\n    \"xlm-clm-ende-1024\",\n    \"xlm-mlm-17-1280\",\n    \"xlm-mlm-100-1280\",\n    # See all XLM models at https://huggingface.co/models?filter=xlm\n]\n\n\ndef create_sinusoidal_embeddings(n_pos, dim, out):\n    position_enc = np.array([[pos / np.power(10000, 2 * (j // 2) / dim) for j in range(dim)] for pos in range(n_pos)])\n    out[:, 0::2] = tf.constant(np.sin(position_enc[:, 0::2]))\n    out[:, 1::2] = tf.constant(np.cos(position_enc[:, 1::2]))\n\n\ndef gelu(x):\n    \"\"\" Gaussian Error Linear Unit.\n    Original Implementation of the gelu activation function in Google Bert repo when initially created.\n        For information: OpenAI GPT's gelu is slightly different (and gives slightly different results):\n        0.5 * x * (1 + torch.tanh(math.sqrt(2 / math.pi) * (x + 0.044715 * torch.pow(x, 3))))\n        Also see https://arxiv.org/abs/1606.08415\n    \"\"\"\n    cdf = 0.5 * (1.0 + tf.math.erf(x / tf.math.sqrt(2.0)))\n    return x * cdf\n\n\ndef get_masks(slen, lengths, causal, padding_mask=None, dtype=tf.float32):\n    \"\"\"\n    Generate hidden states mask, and optionally an attention mask.\n    \"\"\"\n    bs = shape_list(lengths)[0]\n    if padding_mask is not None:\n        mask = padding_mask\n    else:\n        # assert lengths.max().item() <= slen\n        alen = tf.range(slen)\n        mask = tf.math.less(alen, lengths[:, tf.newaxis])\n\n    # attention mask is the same as mask, or triangular inferior attention (causal)\n    if causal:\n        attn_mask = tf.less_equal(\n            tf.tile(alen[tf.newaxis, tf.newaxis, :], (bs, slen, 1)), alen[tf.newaxis, :, tf.newaxis]\n        )\n    else:\n        attn_mask = mask\n\n    # sanity check\n    # assert shape_list(mask) == [bs, slen]\n    tf.debugging.assert_equal(shape_list(mask), [bs, slen])\n    assert causal is False or shape_list(attn_mask) == [bs, slen, slen]\n\n    mask = tf.cast(mask, dtype=dtype)\n    attn_mask = tf.cast(attn_mask, dtype=dtype)\n\n    return mask, attn_mask\n\n\nclass TFMultiHeadAttention(tf.keras.layers.Layer):\n\n    NEW_ID = itertools.count()\n\n    def __init__(self, n_heads, dim, config, **kwargs):\n        super().__init__(**kwargs)\n        self.layer_id = next(TFMultiHeadAttention.NEW_ID)\n        self.output_attentions = config.output_attentions\n        self.dim = dim\n        self.n_heads = n_heads\n        assert self.dim % self.n_heads == 0\n\n        self.q_lin = tf.keras.layers.Dense(dim, kernel_initializer=get_initializer(config.init_std), name=\"q_lin\")\n        self.k_lin = tf.keras.layers.Dense(dim, kernel_initializer=get_initializer(config.init_std), name=\"k_lin\")\n        self.v_lin = tf.keras.layers.Dense(dim, kernel_initializer=get_initializer(config.init_std), name=\"v_lin\")\n        self.out_lin = tf.keras.layers.Dense(dim, kernel_initializer=get_initializer(config.init_std), name=\"out_lin\")\n        self.dropout = tf.keras.layers.Dropout(config.attention_dropout)\n        self.pruned_heads = set()\n\n    def prune_heads(self, heads):\n        raise NotImplementedError\n\n    def call(self, inputs, training=False):\n        \"\"\"\n        Self-attention (if kv is None) or attention over source sentence (provided by kv).\n        \"\"\"\n        input, mask, kv, cache, head_mask = inputs\n        # Input is (bs, qlen, dim)\n        # Mask is (bs, klen) (non-causal) or (bs, klen, klen)\n        bs, qlen, dim = shape_list(input)\n        if kv is None:\n            klen = qlen if cache is None else cache[\"slen\"] + qlen\n        else:\n            klen = shape_list(kv)[1]\n        # assert dim == self.dim, 'Dimensions do not match: %s input vs %s configured' % (dim, self.dim)\n        n_heads = self.n_heads\n        dim_per_head = self.dim // n_heads\n        mask_reshape = (bs, 1, qlen, klen) if len(shape_list(mask)) == 3 else (bs, 1, 1, klen)\n\n        def shape(x):\n            \"\"\"  projection \"\"\"\n            return tf.transpose(tf.reshape(x, (bs, -1, self.n_heads, dim_per_head)), perm=(0, 2, 1, 3))\n\n        def unshape(x):\n            \"\"\"  compute context \"\"\"\n            return tf.reshape(tf.transpose(x, perm=(0, 2, 1, 3)), (bs, -1, self.n_heads * dim_per_head))\n\n        q = shape(self.q_lin(input))  # (bs, n_heads, qlen, dim_per_head)\n        if kv is None:\n            k = shape(self.k_lin(input))  # (bs, n_heads, qlen, dim_per_head)\n            v = shape(self.v_lin(input))  # (bs, n_heads, qlen, dim_per_head)\n        elif cache is None or self.layer_id not in cache:\n            k = v = kv\n            k = shape(self.k_lin(k))  # (bs, n_heads, qlen, dim_per_head)\n            v = shape(self.v_lin(v))  # (bs, n_heads, qlen, dim_per_head)\n\n        if cache is not None:\n            if self.layer_id in cache:\n                if kv is None:\n                    k_, v_ = cache[self.layer_id]\n                    k = tf.concat([k_, k], axis=2)  # (bs, n_heads, klen, dim_per_head)\n                    v = tf.concat([v_, v], axis=2)  # (bs, n_heads, klen, dim_per_head)\n                else:\n                    k, v = cache[self.layer_id]\n            cache[self.layer_id] = (k, v)\n\n        q = q / math.sqrt(dim_per_head)  # (bs, n_heads, qlen, dim_per_head)\n        scores = tf.matmul(q, k, transpose_b=True)  # (bs, n_heads, qlen, klen)\n        mask = tf.reshape(mask, mask_reshape)  # (bs, n_heads, qlen, klen)\n        # scores.masked_fill_(mask, -float('inf'))                            # (bs, n_heads, qlen, klen)\n        scores = scores - 1e30 * (1.0 - mask)\n\n        weights = tf.nn.softmax(scores, axis=-1)  # (bs, n_heads, qlen, klen)\n        weights = self.dropout(weights, training=training)  # (bs, n_heads, qlen, klen)\n\n        # Mask heads if we want to\n        if head_mask is not None:\n            weights = weights * head_mask\n\n        context = tf.matmul(weights, v)  # (bs, n_heads, qlen, dim_per_head)\n        context = unshape(context)  # (bs, qlen, dim)\n\n        outputs = (self.out_lin(context),)\n        if self.output_attentions:\n            outputs = outputs + (weights,)\n        return outputs\n\n\nclass TFTransformerFFN(tf.keras.layers.Layer):\n    def __init__(self, in_dim, dim_hidden, out_dim, config, **kwargs):\n        super().__init__(**kwargs)\n        self.lin1 = tf.keras.layers.Dense(dim_hidden, kernel_initializer=get_initializer(config.init_std), name=\"lin1\")\n        self.lin2 = tf.keras.layers.Dense(out_dim, kernel_initializer=get_initializer(config.init_std), name=\"lin2\")\n        self.act = tf.keras.layers.Activation(gelu) if config.gelu_activation else tf.keras.activations.relu\n        self.dropout = tf.keras.layers.Dropout(config.dropout)\n\n    def call(self, input, training=False):\n        x = self.lin1(input)\n        x = self.act(x)\n        x = self.lin2(x)\n        x = self.dropout(x, training=training)\n        return x\n\n\nclass TFXLMMainLayer(tf.keras.layers.Layer):\n    def __init__(self, config, **kwargs):\n        super().__init__(**kwargs)\n        self.output_attentions = config.output_attentions\n        self.output_hidden_states = config.output_hidden_states\n\n        # encoder / decoder, output layer\n        self.is_encoder = config.is_encoder\n        self.is_decoder = not config.is_encoder\n        if self.is_decoder:\n            raise NotImplementedError(\"Currently XLM can only be used as an encoder\")\n        # self.with_output = with_output\n        self.causal = config.causal\n\n        # dictionary / languages\n        self.n_langs = config.n_langs\n        self.use_lang_emb = config.use_lang_emb\n        self.n_words = config.n_words\n        self.eos_index = config.eos_index\n        self.pad_index = config.pad_index\n        # self.dico = dico\n        # self.id2lang = config.id2lang\n        # self.lang2id = config.lang2id\n        # assert len(self.dico) == self.n_words\n        # assert len(self.id2lang) == len(self.lang2id) == self.n_langs\n\n        # model parameters\n        self.dim = config.emb_dim  # 512 by default\n        self.hidden_dim = self.dim * 4  # 2048 by default\n        self.n_heads = config.n_heads  # 8 by default\n        self.n_layers = config.n_layers\n        assert self.dim % self.n_heads == 0, \"transformer dim must be a multiple of n_heads\"\n\n        # embeddings\n        self.dropout = tf.keras.layers.Dropout(config.dropout)\n        self.attention_dropout = tf.keras.layers.Dropout(config.attention_dropout)\n\n        self.position_embeddings = tf.keras.layers.Embedding(\n            config.max_position_embeddings,\n            self.dim,\n            embeddings_initializer=get_initializer(config.embed_init_std),\n            name=\"position_embeddings\",\n        )\n        if config.sinusoidal_embeddings:\n            raise NotImplementedError\n            # create_sinusoidal_embeddings(config.max_position_embeddings, self.dim, out=self.position_embeddings.weight)\n        if config.n_langs > 1 and config.use_lang_emb:\n            self.lang_embeddings = tf.keras.layers.Embedding(\n                self.n_langs,\n                self.dim,\n                embeddings_initializer=get_initializer(config.embed_init_std),\n                name=\"lang_embeddings\",\n            )\n        self.embeddings = TFSharedEmbeddings(\n            self.n_words, self.dim, initializer_range=config.embed_init_std, name=\"embeddings\"\n        )  # padding_idx=self.pad_index)\n        self.layer_norm_emb = tf.keras.layers.LayerNormalization(epsilon=config.layer_norm_eps, name=\"layer_norm_emb\")\n\n        # transformer layers\n        self.attentions = []\n        self.layer_norm1 = []\n        self.ffns = []\n        self.layer_norm2 = []\n        # if self.is_decoder:\n        #     self.layer_norm15 = []\n        #     self.encoder_attn = []\n\n        for i in range(self.n_layers):\n            self.attentions.append(\n                TFMultiHeadAttention(self.n_heads, self.dim, config=config, name=\"attentions_._{}\".format(i))\n            )\n            self.layer_norm1.append(\n                tf.keras.layers.LayerNormalization(epsilon=config.layer_norm_eps, name=\"layer_norm1_._{}\".format(i))\n            )\n            # if self.is_decoder:\n            #     self.layer_norm15.append(nn.LayerNorm(self.dim, eps=config.layer_norm_eps))\n            #     self.encoder_attn.append(MultiHeadAttention(self.n_heads, self.dim, dropout=self.attention_dropout))\n            self.ffns.append(\n                TFTransformerFFN(self.dim, self.hidden_dim, self.dim, config=config, name=\"ffns_._{}\".format(i))\n            )\n            self.layer_norm2.append(\n                tf.keras.layers.LayerNormalization(epsilon=config.layer_norm_eps, name=\"layer_norm2_._{}\".format(i))\n            )\n\n        if hasattr(config, \"pruned_heads\"):\n            pruned_heads = config.pruned_heads.copy().items()\n            config.pruned_heads = {}\n            for layer, heads in pruned_heads:\n                if self.attentions[int(layer)].n_heads == config.n_heads:\n                    self.prune_heads({int(layer): list(map(int, heads))})\n\n    def get_input_embeddings(self):\n        return self.embeddings\n\n    def _resize_token_embeddings(self, new_num_tokens):\n        raise NotImplementedError\n\n    def _prune_heads(self, heads_to_prune):\n        \"\"\" Prunes heads of the model.\n            heads_to_prune: dict of {layer_num: list of heads to prune in this layer}\n            See base class PreTrainedModel\n        \"\"\"\n        raise NotImplementedError\n\n    def call(\n        self,\n        inputs,\n        attention_mask=None,\n        langs=None,\n        token_type_ids=None,\n        position_ids=None,\n        lengths=None,\n        cache=None,\n        head_mask=None,\n        inputs_embeds=None,\n        training=False,\n    ):  # removed: src_enc=None, src_len=None\n        if isinstance(inputs, (tuple, list)):\n            input_ids = inputs[0]\n            attention_mask = inputs[1] if len(inputs) > 1 else attention_mask\n            langs = inputs[2] if len(inputs) > 2 else langs\n            token_type_ids = inputs[3] if len(inputs) > 3 else token_type_ids\n            position_ids = inputs[4] if len(inputs) > 4 else position_ids\n            lengths = inputs[5] if len(inputs) > 5 else lengths\n            cache = inputs[6] if len(inputs) > 6 else cache\n            head_mask = inputs[7] if len(inputs) > 7 else head_mask\n            inputs_embeds = inputs[8] if len(inputs) > 8 else inputs_embeds\n            assert len(inputs) <= 9, \"Too many inputs.\"\n        elif isinstance(inputs, (dict, BatchEncoding)):\n            input_ids = inputs.get(\"input_ids\")\n            attention_mask = inputs.get(\"attention_mask\", attention_mask)\n            langs = inputs.get(\"langs\", langs)\n            token_type_ids = inputs.get(\"token_type_ids\", token_type_ids)\n            position_ids = inputs.get(\"position_ids\", position_ids)\n            lengths = inputs.get(\"lengths\", lengths)\n            cache = inputs.get(\"cache\", cache)\n            head_mask = inputs.get(\"head_mask\", head_mask)\n            inputs_embeds = inputs.get(\"inputs_embeds\", inputs_embeds)\n            assert len(inputs) <= 9, \"Too many inputs.\"\n        else:\n            input_ids = inputs\n\n        if input_ids is not None and inputs_embeds is not None:\n            raise ValueError(\"You cannot specify both input_ids and inputs_embeds at the same time\")\n        elif input_ids is not None:\n            bs, slen = shape_list(input_ids)\n        elif inputs_embeds is not None:\n            bs, slen = shape_list(inputs_embeds)[:2]\n        else:\n            raise ValueError(\"You have to specify either input_ids or inputs_embeds\")\n\n        if lengths is None:\n            if input_ids is not None:\n                lengths = tf.reduce_sum(tf.cast(tf.not_equal(input_ids, self.pad_index), dtype=tf.int32), axis=1)\n            else:\n                lengths = tf.convert_to_tensor([slen] * bs, tf.int32)\n        # mask = input_ids != self.pad_index\n\n        # check inputs\n        # assert shape_list(lengths)[0] == bs\n        tf.debugging.assert_equal(shape_list(lengths)[0], bs)\n        # assert lengths.max().item() <= slen\n        # input_ids = input_ids.transpose(0, 1)  # batch size as dimension 0\n        # assert (src_enc is None) == (src_len is None)\n        # if src_enc is not None:\n        #     assert self.is_decoder\n        #     assert src_enc.size(0) == bs\n\n        # generate masks\n        mask, attn_mask = get_masks(slen, lengths, self.causal, padding_mask=attention_mask)\n        # if self.is_decoder and src_enc is not None:\n        #     src_mask = torch.arange(src_len.max(), dtype=torch.long, device=lengths.device) < src_len[:, None]\n\n        # position_ids\n        if position_ids is None:\n            position_ids = tf.expand_dims(tf.range(slen), axis=0)\n        else:\n            # assert shape_list(position_ids) == [bs, slen]  # (slen, bs)\n            tf.debugging.assert_equal(shape_list(position_ids), [bs, slen])\n            # position_ids = position_ids.transpose(0, 1)\n\n        # langs\n        if langs is not None:\n            # assert shape_list(langs) == [bs, slen]  # (slen, bs)\n            tf.debugging.assert_equal(shape_list(langs), [bs, slen])\n            # langs = langs.transpose(0, 1)\n\n        # Prepare head mask if needed\n        # 1.0 in head_mask indicate we keep the head\n        # attention_probs has shape bsz x n_heads x N x N\n        # input head_mask has shape [num_heads] or [num_hidden_layers x num_heads]\n        # and head_mask is converted to shape [num_hidden_layers x batch x num_heads x qlen x klen]\n        if head_mask is not None:\n            raise NotImplementedError\n        else:\n            head_mask = [None] * self.n_layers\n\n        # do not recompute cached elements\n        if cache is not None and input_ids is not None:\n            _slen = slen - cache[\"slen\"]\n            input_ids = input_ids[:, -_slen:]\n            position_ids = position_ids[:, -_slen:]\n            if langs is not None:\n                langs = langs[:, -_slen:]\n            mask = mask[:, -_slen:]\n            attn_mask = attn_mask[:, -_slen:]\n\n        # embeddings\n        if inputs_embeds is None:\n            inputs_embeds = self.embeddings(input_ids)\n\n        tensor = inputs_embeds + self.position_embeddings(position_ids)\n        if langs is not None and self.use_lang_emb and self.n_langs > 1:\n            tensor = tensor + self.lang_embeddings(langs)\n        if token_type_ids is not None:\n            tensor = tensor + self.embeddings(token_type_ids)\n        tensor = self.layer_norm_emb(tensor)\n        tensor = self.dropout(tensor, training=training)\n        tensor = tensor * mask[..., tf.newaxis]\n\n        # transformer layers\n        hidden_states = ()\n        attentions = ()\n        for i in range(self.n_layers):\n            if self.output_hidden_states:\n                hidden_states = hidden_states + (tensor,)\n\n            # self attention\n            attn_outputs = self.attentions[i]([tensor, attn_mask, None, cache, head_mask[i]], training=training)\n            attn = attn_outputs[0]\n            if self.output_attentions:\n                attentions = attentions + (attn_outputs[1],)\n            attn = self.dropout(attn, training=training)\n            tensor = tensor + attn\n            tensor = self.layer_norm1[i](tensor)\n\n            # encoder attention (for decoder only)\n            # if self.is_decoder and src_enc is not None:\n            #     attn = self.encoder_attn[i](tensor, src_mask, kv=src_enc, cache=cache)\n            #     attn = F.dropout(attn, p=self.dropout, training=self.training)\n            #     tensor = tensor + attn\n            #     tensor = self.layer_norm15[i](tensor)\n\n            # FFN\n            tensor = tensor + self.ffns[i](tensor)\n            tensor = self.layer_norm2[i](tensor)\n            tensor = tensor * mask[..., tf.newaxis]\n\n        # Add last hidden state\n        if self.output_hidden_states:\n            hidden_states = hidden_states + (tensor,)\n\n        # update cache length\n        if cache is not None:\n            cache[\"slen\"] += tensor.size(1)\n\n        # move back sequence length to dimension 0\n        # tensor = tensor.transpose(0, 1)\n\n        outputs = (tensor,)\n        if self.output_hidden_states:\n            outputs = outputs + (hidden_states,)\n        if self.output_attentions:\n            outputs = outputs + (attentions,)\n        return outputs  # outputs, (hidden_states), (attentions)\n\n\nclass TFXLMPreTrainedModel(TFPreTrainedModel):\n    \"\"\" An abstract class to handle weights initialization and\n        a simple interface for downloading and loading pretrained models.\n    \"\"\"\n\n    config_class = XLMConfig\n    base_model_prefix = \"transformer\"\n\n    @property\n    def dummy_inputs(self):\n        # Sometimes XLM has language embeddings so don't forget to build them as well if needed\n        inputs_list = tf.constant([[7, 6, 0, 0, 1], [1, 2, 3, 0, 0], [0, 0, 0, 4, 5]])\n        attns_list = tf.constant([[1, 1, 0, 0, 1], [1, 1, 1, 0, 0], [1, 0, 0, 1, 1]])\n        if self.config.use_lang_emb and self.config.n_langs > 1:\n            langs_list = tf.constant([[1, 1, 0, 0, 1], [1, 1, 1, 0, 0], [1, 0, 0, 1, 1]])\n        else:\n            langs_list = None\n        return {\"input_ids\": inputs_list, \"attention_mask\": attns_list, \"langs\": langs_list}\n\n\nXLM_START_DOCSTRING = r\"\"\"\n\n    .. note::\n\n        TF 2.0 models accepts two formats as inputs:\n\n            - having all inputs as keyword arguments (like PyTorch models), or\n            - having all inputs as a list, tuple or dict in the first positional arguments.\n\n        This second option is useful when using :obj:`tf.keras.Model.fit()` method which currently requires having\n        all the tensors in the first argument of the model call function: :obj:`model(inputs)`.\n\n        If you choose this second option, there are three possibilities you can use to gather all the input Tensors\n        in the first positional argument :\n\n        - a single Tensor with input_ids only and nothing else: :obj:`model(inputs_ids)`\n        - a list of varying length with one or several input Tensors IN THE ORDER given in the docstring:\n          :obj:`model([input_ids, attention_mask])` or :obj:`model([input_ids, attention_mask, token_type_ids])`\n        - a dictionary with one or several input Tensors associated to the input names given in the docstring:\n          :obj:`model({'input_ids': input_ids, 'token_type_ids': token_type_ids})`\n\n    Parameters:\n        config (:class:`~transformers1.XLMConfig`): Model configuration class with all the parameters of the model.\n            Initializing with a config file does not load the weights associated with the model, only the configuration.\n            Check out the :meth:`~transformers1.PreTrainedModel.from_pretrained` method to load the model weights.\n\"\"\"\n\nXLM_INPUTS_DOCSTRING = r\"\"\"\n    Args:\n        input_ids (:obj:`tf.Tensor` or :obj:`Numpy array` of shape :obj:`(batch_size, sequence_length)`):\n            Indices of input sequence tokens in the vocabulary.\n\n            Indices can be obtained using :class:`transformers1.BertTokenizer`.\n            See :func:`transformers1.PreTrainedTokenizer.encode` and\n            :func:`transformers1.PreTrainedTokenizer.encode_plus` for details.\n\n            `What are input IDs? <../glossary.html#input-ids>`__\n        attention_mask (:obj:`tf.Tensor` or :obj:`Numpy array` of shape :obj:`(batch_size, sequence_length)`, `optional`, defaults to :obj:`None`):\n            Mask to avoid performing attention on padding token indices.\n            Mask values selected in ``[0, 1]``:\n            ``1`` for tokens that are NOT MASKED, ``0`` for MASKED tokens.\n\n            `What are attention masks? <../glossary.html#attention-mask>`__\n        langs (:obj:`tf.Tensor` or :obj:`Numpy array` of shape :obj:`(batch_size, sequence_length)`, `optional`, defaults to :obj:`None`):\n            A parallel sequence of tokens to be used to indicate the language of each token in the input.\n            Indices are languages ids which can be obtained from the language names by using two conversion mappings\n            provided in the configuration of the model (only provided for multilingual models).\n            More precisely, the `language name -> language id` mapping is in `model.config.lang2id` (dict str -> int) and\n            the `language id -> language name` mapping is `model.config.id2lang` (dict int -> str).\n\n            See usage examples detailed in the `multilingual documentation <https://huggingface.co/transformers/multilingual.html>`__.\n        token_type_ids (:obj:`tf.Tensor` or :obj:`Numpy array` of shape :obj:`(batch_size, sequence_length)`, `optional`, defaults to :obj:`None`):\n            Segment token indices to indicate first and second portions of the inputs.\n            Indices are selected in ``[0, 1]``: ``0`` corresponds to a `sentence A` token, ``1``\n            corresponds to a `sentence B` token\n\n            `What are token type IDs? <../glossary.html#token-type-ids>`_\n        position_ids (:obj:`tf.Tensor` or :obj:`Numpy array` of shape :obj:`(batch_size, sequence_length)`, `optional`, defaults to :obj:`None`):\n            Indices of positions of each input sequence tokens in the position embeddings.\n            Selected in the range ``[0, config.max_position_embeddings - 1]``.\n\n            `What are position IDs? <../glossary.html#position-ids>`_\n        lengths (:obj:`tf.Tensor` or :obj:`Numpy array` of shape :obj:`(batch_size,)`, `optional`, defaults to :obj:`None`):\n            Length of each sentence that can be used to avoid performing attention on padding token indices.\n            You can also use `attention_mask` for the same result (see above), kept here for compatbility.\n            Indices selected in ``[0, ..., input_ids.size(-1)]``:\n        cache (:obj:`Dict[str, tf.Tensor]`, `optional`, defaults to :obj:`None`):\n            dictionary with ``tf.Tensor`` that contains pre-computed\n            hidden-states (key and values in the attention blocks) as computed by the model\n            (see `cache` output below). Can be used to speed up sequential decoding.\n            The dictionary object will be modified in-place during the forward pass to add newly computed hidden-states.\n        head_mask (:obj:`tf.Tensor` or :obj:`Numpy array` of shape :obj:`(num_heads,)` or :obj:`(num_layers, num_heads)`, `optional`, defaults to :obj:`None`):\n            Mask to nullify selected heads of the self-attention modules.\n            Mask values selected in ``[0, 1]``:\n            :obj:`1` indicates the head is **not masked**, :obj:`0` indicates the head is **masked**.\n        inputs_embeds (:obj:`tf.Tensor` or :obj:`Numpy array` of shape :obj:`(batch_size, sequence_length, hidden_size)`, `optional`, defaults to :obj:`None`):\n            Optionally, instead of passing :obj:`input_ids` you can choose to directly pass an embedded representation.\n            This is useful if you want more control over how to convert `input_ids` indices into associated vectors\n            than the model's internal embedding lookup matrix.\n\"\"\"\n\n\n@add_start_docstrings(\n    \"The bare XLM Model transformer outputing raw hidden-states without any specific head on top.\",\n    XLM_START_DOCSTRING,\n)\nclass TFXLMModel(TFXLMPreTrainedModel):\n    def __init__(self, config, *inputs, **kwargs):\n        super().__init__(config, *inputs, **kwargs)\n        self.transformer = TFXLMMainLayer(config, name=\"transformer\")\n\n    @add_start_docstrings_to_callable(XLM_INPUTS_DOCSTRING)\n    def call(self, inputs, **kwargs):\n        r\"\"\"\n    Return:\n        :obj:`tuple(tf.Tensor)` comprising various elements depending on the configuration (:class:`~transformers1.XLMConfig`) and inputs:\n        last_hidden_state (:obj:`tf.Tensor` or :obj:`Numpy array` of shape :obj:`(batch_size, sequence_length, hidden_size)`):\n            Sequence of hidden-states at the output of the last layer of the model.\n        hidden_states (:obj:`tuple(tf.Tensor)`, `optional`, returned when ``config.output_hidden_states=True``):\n            Tuple of :obj:`tf.Tensor` or :obj:`Numpy array` (one for the output of the embeddings + one for the output of each layer)\n            of shape :obj:`(batch_size, sequence_length, hidden_size)`.\n\n            Hidden-states of the model at the output of each layer plus the initial embedding outputs.\n        attentions (:obj:`tuple(tf.Tensor)`, `optional`, returned when ``config.output_attentions=True``):\n            Tuple of :obj:`tf.Tensor` or :obj:`Numpy array` (one for each layer) of shape\n            :obj:`(batch_size, num_heads, sequence_length, sequence_length)`.\n\n            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention\n            heads.\n\n    Examples::\n\n        import tensorflow as tf\n        from transformers1 import XLMTokenizer, TFXLMModel\n\n        tokenizer = XLMTokenizer.from_pretrained('xlm-mlm-en-2048')\n        model = TFXLMModel.from_pretrained('xlm-mlm-en-2048')\n        input_ids = tf.constant(tokenizer.encode(\"Hello, my dog is cute\", add_special_tokens=True))[None, :]  # Batch size 1\n        outputs = model(input_ids)\n        last_hidden_states = outputs[0]  # The last hidden-state is the first element of the output tuple\n\n        \"\"\"\n        outputs = self.transformer(inputs, **kwargs)\n        return outputs\n\n\nclass TFXLMPredLayer(tf.keras.layers.Layer):\n    \"\"\"\n    Prediction layer (cross_entropy or adaptive_softmax).\n    \"\"\"\n\n    def __init__(self, config, input_embeddings, **kwargs):\n        super().__init__(**kwargs)\n        self.asm = config.asm\n        self.n_words = config.n_words\n        self.pad_index = config.pad_index\n        if config.asm is False:\n            self.input_embeddings = input_embeddings\n        else:\n            raise NotImplementedError\n            # self.proj = nn.AdaptiveLogSoftmaxWithLoss(\n            #     in_features=dim,\n            #     n_classes=config.n_words,\n            #     cutoffs=config.asm_cutoffs,\n            #     div_value=config.asm_div_value,\n            #     head_bias=True,  # default is False\n            # )\n\n    def build(self, input_shape):\n        # The output weights are the same as the input embeddings, but there is an output-only bias for each token.\n        self.bias = self.add_weight(shape=(self.n_words,), initializer=\"zeros\", trainable=True, name=\"bias\")\n        super().build(input_shape)\n\n    def call(self, hidden_states):\n        hidden_states = self.input_embeddings(hidden_states, mode=\"linear\")\n        hidden_states = hidden_states + self.bias\n        return hidden_states\n\n\n@add_start_docstrings(\n    \"\"\"The XLM Model transformer with a language modeling head on top\n    (linear layer with weights tied to the input embeddings). \"\"\",\n    XLM_START_DOCSTRING,\n)\nclass TFXLMWithLMHeadModel(TFXLMPreTrainedModel):\n    def __init__(self, config, *inputs, **kwargs):\n        super().__init__(config, *inputs, **kwargs)\n        self.transformer = TFXLMMainLayer(config, name=\"transformer\")\n        self.pred_layer = TFXLMPredLayer(config, self.transformer.embeddings, name=\"pred_layer_._proj\")\n\n    def get_output_embeddings(self):\n        return self.pred_layer.input_embeddings\n\n    def prepare_inputs_for_generation(self, inputs, **kwargs):\n        mask_token_id = self.config.mask_token_id\n        lang_id = self.config.lang_id\n\n        effective_batch_size = inputs.shape[0]\n        mask_token = tf.ones((effective_batch_size, 1), dtype=tf.int32) * mask_token_id\n        inputs = tf.concat([inputs, mask_token], axis=1)\n\n        if lang_id is not None:\n            langs = tf.ones_like(inputs) * lang_id\n        else:\n            langs = None\n        return {\"inputs\": inputs, \"langs\": langs}\n\n    @add_start_docstrings_to_callable(XLM_INPUTS_DOCSTRING)\n    def call(self, inputs, **kwargs):\n        r\"\"\"\n    Return:\n        :obj:`tuple(tf.Tensor)` comprising various elements depending on the configuration (:class:`~transformers1.XLMConfig`) and inputs:\n        prediction_scores (:obj:`tf.Tensor` or :obj:`Numpy array` of shape :obj:`(batch_size, sequence_length, config.vocab_size)`):\n            Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax).\n        hidden_states (:obj:`tuple(tf.Tensor)`, `optional`, returned when ``config.output_hidden_states=True``):\n            Tuple of :obj:`tf.Tensor` or :obj:`Numpy array` (one for the output of the embeddings + one for the output of each layer)\n            of shape :obj:`(batch_size, sequence_length, hidden_size)`.\n\n            Hidden-states of the model at the output of each layer plus the initial embedding outputs.\n        attentions (:obj:`tuple(tf.Tensor)`, `optional`, returned when ``config.output_attentions=True``):\n            Tuple of :obj:`tf.Tensor` or :obj:`Numpy array` (one for each layer) of shape\n            :obj:`(batch_size, num_heads, sequence_length, sequence_length)`.\n\n            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention\n            heads.\n\n    Examples::\n\n        import tensorflow as tf\n        from transformers1 import XLMTokenizer, TFXLMWithLMHeadModel\n\n        tokenizer = XLMTokenizer.from_pretrained('xlm-mlm-en-2048')\n        model = TFXLMWithLMHeadModel.from_pretrained('xlm-mlm-en-2048')\n        input_ids = tf.constant(tokenizer.encode(\"Hello, my dog is cute\", add_special_tokens=True))[None, :]  # Batch size 1\n        outputs = model(input_ids)\n        last_hidden_states = outputs[0]  # The last hidden-state is the first element of the output tuple\n\n        \"\"\"\n        transformer_outputs = self.transformer(inputs, **kwargs)\n\n        output = transformer_outputs[0]\n        outputs = self.pred_layer(output)\n        outputs = (outputs,) + transformer_outputs[1:]  # Keep new_mems and attention/hidden states if they are here\n\n        return outputs\n\n\n@add_start_docstrings(\n    \"\"\"XLM Model with a sequence classification/regression head on top (a linear layer on top of\n    the pooled output) e.g. for GLUE tasks. \"\"\",\n    XLM_START_DOCSTRING,\n)\nclass TFXLMForSequenceClassification(TFXLMPreTrainedModel):\n    def __init__(self, config, *inputs, **kwargs):\n        super().__init__(config, *inputs, **kwargs)\n        self.num_labels = config.num_labels\n\n        self.transformer = TFXLMMainLayer(config, name=\"transformer\")\n        self.sequence_summary = TFSequenceSummary(config, initializer_range=config.init_std, name=\"sequence_summary\")\n\n    @add_start_docstrings_to_callable(XLM_INPUTS_DOCSTRING)\n    def call(self, inputs, **kwargs):\n        r\"\"\"\n    Returns:\n        :obj:`tuple(tf.Tensor)` comprising various elements depending on the configuration (:class:`~transformers1.XLMConfig`) and inputs:\n        logits (:obj:`tf.Tensor` or :obj:`Numpy array` of shape :obj:`(batch_size, config.num_labels)`):\n            Classification (or regression if config.num_labels==1) scores (before SoftMax).\n        hidden_states (:obj:`tuple(tf.Tensor)`, `optional`, returned when ``config.output_hidden_states=True``):\n            Tuple of :obj:`tf.Tensor` or :obj:`Numpy array` (one for the output of the embeddings + one for the output of each layer)\n            of shape :obj:`(batch_size, sequence_length, hidden_size)`.\n\n            Hidden-states of the model at the output of each layer plus the initial embedding outputs.\n        attentions (:obj:`tuple(tf.Tensor)`, `optional`, returned when ``config.output_attentions=True``):\n            Tuple of :obj:`tf.Tensor` or :obj:`Numpy array` (one for each layer) of shape\n            :obj:`(batch_size, num_heads, sequence_length, sequence_length)`.\n\n            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention\n            heads.\n\n    Examples::\n\n        import tensorflow as tf\n        from transformers1 import XLMTokenizer, TFXLMForSequenceClassification\n\n        tokenizer = XLMTokenizer.from_pretrained('xlm-mlm-en-2048')\n        model = TFXLMForSequenceClassification.from_pretrained('xlm-mlm-en-2048')\n        input_ids = tf.constant(tokenizer.encode(\"Hello, my dog is cute\", add_special_tokens=True))[None, :]  # Batch size 1\n        labels = tf.constant([1])[None, :]  # Batch size 1\n        outputs = model(input_ids)\n        logits = outputs[0]\n\n        \"\"\"\n        transformer_outputs = self.transformer(inputs, **kwargs)\n        output = transformer_outputs[0]\n\n        logits = self.sequence_summary(output)\n\n        outputs = (logits,) + transformer_outputs[1:]  # Keep new_mems and attention/hidden states if they are here\n        return outputs\n\n\n@add_start_docstrings(\n    \"\"\"XLM Model with a span classification head on top for extractive question-answering tasks like SQuAD (a linear layers on top of\n    the hidden-states output to compute `span start logits` and `span end logits`). \"\"\",\n    XLM_START_DOCSTRING,\n)\nclass TFXLMForQuestionAnsweringSimple(TFXLMPreTrainedModel):\n    def __init__(self, config, *inputs, **kwargs):\n        super().__init__(config, *inputs, **kwargs)\n        self.transformer = TFXLMMainLayer(config, name=\"transformer\")\n        self.qa_outputs = tf.keras.layers.Dense(\n            config.num_labels, kernel_initializer=get_initializer(config.init_std), name=\"qa_outputs\"\n        )\n\n    @add_start_docstrings_to_callable(XLM_INPUTS_DOCSTRING)\n    def call(self, inputs, **kwargs):\n        r\"\"\"\n    Returns:\n        :obj:`tuple(tf.Tensor)` comprising various elements depending on the configuration (:class:`~transformers1.XLMConfig`) and inputs:\n        start_scores (:obj:`tf.Tensor` or :obj:`Numpy array` of shape :obj:`(batch_size, sequence_length,)`):\n            Span-start scores (before SoftMax).\n        end_scores (:obj:`tf.Tensor` or :obj:`Numpy array` of shape :obj:`(batch_size, sequence_length,)`):\n            Span-end scores (before SoftMax).\n        hidden_states (:obj:`tuple(tf.Tensor)`, `optional`, returned when ``config.output_hidden_states=True``):\n            Tuple of :obj:`tf.Tensor` or :obj:`Numpy array` (one for the output of the embeddings + one for the output of each layer)\n            of shape :obj:`(batch_size, sequence_length, hidden_size)`.\n\n            Hidden-states of the model at the output of each layer plus the initial embedding outputs.\n        attentions (:obj:`tuple(tf.Tensor)`, `optional`, returned when ``config.output_attentions=True``):\n            Tuple of :obj:`tf.Tensor` or :obj:`Numpy array` (one for each layer) of shape\n            :obj:`(batch_size, num_heads, sequence_length, sequence_length)`.\n\n            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention\n            heads.\n\n    Examples::\n\n        import tensorflow as tf\n        from transformers1 import XLMTokenizer, TFXLMForQuestionAnsweringSimple\n\n        tokenizer = XLMTokenizer.from_pretrained('xlm-mlm-en-2048')\n        model = TFXLMForQuestionAnsweringSimple.from_pretrained('xlm-mlm-en-2048')\n        input_ids = tf.constant(tokenizer.encode(\"Hello, my dog is cute\", add_special_tokens=True))[None, :]  # Batch size 1\n        outputs = model(input_ids)\n        start_scores, end_scores = outputs[:2]\n\n        \"\"\"\n        transformer_outputs = self.transformer(inputs, **kwargs)\n\n        sequence_output = transformer_outputs[0]\n\n        logits = self.qa_outputs(sequence_output)\n        start_logits, end_logits = tf.split(logits, 2, axis=-1)\n        start_logits = tf.squeeze(start_logits, axis=-1)\n        end_logits = tf.squeeze(end_logits, axis=-1)\n\n        outputs = (start_logits, end_logits,) + transformer_outputs[\n            1:\n        ]  # Keep mems, hidden states, attentions if there are in it\n\n        return outputs  # start_logits, end_logits, (hidden_states), (attentions)\n"
  },
  {
    "path": "code/nezha-base-count5/pretrain/transformers1/modeling_tf_xlm_roberta.py",
    "content": "# coding=utf-8\n# Copyright 2019 Facebook AI Research and the HuggingFace Inc. team.\n# Copyright (c) 2018, NVIDIA CORPORATION.  All rights reserved.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n\"\"\" TF 2.0  XLM-RoBERTa model. \"\"\"\n\n\nimport logging\n\nfrom .configuration_xlm_roberta import XLMRobertaConfig\nfrom .file_utils import add_start_docstrings\nfrom .modeling_tf_roberta import (\n    TFRobertaForMaskedLM,\n    TFRobertaForSequenceClassification,\n    TFRobertaForTokenClassification,\n    TFRobertaModel,\n)\n\n\nlogger = logging.getLogger(__name__)\n\nTF_XLM_ROBERTA_PRETRAINED_MODEL_ARCHIVE_LIST = [\n    # See all XLM-RoBERTa models at https://huggingface.co/models?filter=xlm-roberta\n]\n\n\nXLM_ROBERTA_START_DOCSTRING = r\"\"\"\n\n    .. note::\n\n        TF 2.0 models accepts two formats as inputs:\n\n            - having all inputs as keyword arguments (like PyTorch models), or\n            - having all inputs as a list, tuple or dict in the first positional arguments.\n\n        This second option is useful when using :obj:`tf.keras.Model.fit()` method which currently requires having\n        all the tensors in the first argument of the model call function: :obj:`model(inputs)`.\n\n        If you choose this second option, there are three possibilities you can use to gather all the input Tensors\n        in the first positional argument :\n\n        - a single Tensor with input_ids only and nothing else: :obj:`model(inputs_ids)`\n        - a list of varying length with one or several input Tensors IN THE ORDER given in the docstring:\n          :obj:`model([input_ids, attention_mask])` or :obj:`model([input_ids, attention_mask, token_type_ids])`\n        - a dictionary with one or several input Tensors associated to the input names given in the docstring:\n          :obj:`model({'input_ids': input_ids, 'token_type_ids': token_type_ids})`\n\n    Parameters:\n        config (:class:`~transformers1.XLMRobertaConfig`): Model configuration class with all the parameters of the\n            model. Initializing with a config file does not load the weights associated with the model, only the configuration.\n            Check out the :meth:`~transformers1.PreTrainedModel.from_pretrained` method to load the model weights.\n\"\"\"\n\n\n@add_start_docstrings(\n    \"The bare XLM-RoBERTa Model transformer outputting raw hidden-states without any specific head on top.\",\n    XLM_ROBERTA_START_DOCSTRING,\n)\nclass TFXLMRobertaModel(TFRobertaModel):\n    \"\"\"\n    This class overrides :class:`~transformers1.TFRobertaModel`. Please check the\n    superclass for the appropriate documentation alongside usage examples.\n    \"\"\"\n\n    config_class = XLMRobertaConfig\n\n\n@add_start_docstrings(\n    \"\"\"XLM-RoBERTa Model with a `language modeling` head on top. \"\"\", XLM_ROBERTA_START_DOCSTRING,\n)\nclass TFXLMRobertaForMaskedLM(TFRobertaForMaskedLM):\n    \"\"\"\n    This class overrides :class:`~transformers1.TFRobertaForMaskedLM`. Please check the\n    superclass for the appropriate documentation alongside usage examples.\n    \"\"\"\n\n    config_class = XLMRobertaConfig\n\n\n@add_start_docstrings(\n    \"\"\"XLM-RoBERTa Model transformer with a sequence classification/regression head on top (a linear layer\n    on top of the pooled output) e.g. for GLUE tasks. \"\"\",\n    XLM_ROBERTA_START_DOCSTRING,\n)\nclass TFXLMRobertaForSequenceClassification(TFRobertaForSequenceClassification):\n    \"\"\"\n    This class overrides :class:`~transformers1.TFRobertaForSequenceClassification`. Please check the\n    superclass for the appropriate documentation alongside usage examples.\n    \"\"\"\n\n    config_class = XLMRobertaConfig\n\n\n@add_start_docstrings(\n    \"\"\"XLM-RoBERTa Model with a token classification head on top (a linear layer on top of\n    the hidden-states output) e.g. for Named-Entity-Recognition (NER) tasks. \"\"\",\n    XLM_ROBERTA_START_DOCSTRING,\n)\nclass TFXLMRobertaForTokenClassification(TFRobertaForTokenClassification):\n    \"\"\"\n    This class overrides :class:`~transformers1.TFRobertaForTokenClassification`. Please check the\n    superclass for the appropriate documentation alongside usage examples.\n    \"\"\"\n\n    config_class = XLMRobertaConfig\n"
  },
  {
    "path": "code/nezha-base-count5/pretrain/transformers1/modeling_tf_xlnet.py",
    "content": "# coding=utf-8\n# Copyright 2018 Google AI, Google Brain and Carnegie Mellon University Authors and the HuggingFace Inc. team.\n# Copyright (c) 2018, NVIDIA CORPORATION.  All rights reserved.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n\"\"\" TF 2.0 XLNet model.\n\"\"\"\n\n\nimport logging\n\nimport numpy as np\nimport tensorflow as tf\n\nfrom .configuration_xlnet import XLNetConfig\nfrom .file_utils import add_start_docstrings, add_start_docstrings_to_callable\nfrom .modeling_tf_utils import (\n    TFPreTrainedModel,\n    TFSequenceSummary,\n    TFSharedEmbeddings,\n    get_initializer,\n    keras_serializable,\n    shape_list,\n)\nfrom .tokenization_utils import BatchEncoding\n\n\nlogger = logging.getLogger(__name__)\n\nTF_XLNET_PRETRAINED_MODEL_ARCHIVE_LIST = [\n    \"xlnet-base-cased\",\n    \"xlnet-large-cased\",\n    # See all XLNet models at https://huggingface.co/models?filter=xlnet\n]\n\n\ndef gelu(x):\n    \"\"\" Implementation of the gelu activation function.\n        XLNet is using OpenAI GPT's gelu\n        Also see https://arxiv.org/abs/1606.08415\n    \"\"\"\n    cdf = 0.5 * (1.0 + tf.tanh((np.sqrt(2 / np.pi) * (x + 0.044715 * tf.pow(x, 3)))))\n    return x * cdf\n\n\ndef swish(x):\n    return x * tf.sigmoid(x)\n\n\nACT2FN = {\n    \"gelu\": tf.keras.layers.Activation(gelu),\n    \"relu\": tf.keras.activations.relu,\n    \"swish\": tf.keras.layers.Activation(swish),\n}\n\n\nclass TFXLNetRelativeAttention(tf.keras.layers.Layer):\n    def __init__(self, config, **kwargs):\n        super().__init__(**kwargs)\n        self.output_attentions = config.output_attentions\n\n        if config.d_model % config.n_head != 0:\n            raise ValueError(\n                \"The hidden size (%d) is not a multiple of the number of attention \"\n                \"heads (%d)\" % (config.d_model, config.n_head)\n            )\n\n        self.n_head = config.n_head\n        self.d_head = config.d_head\n        self.d_model = config.d_model\n        self.scale = 1 / (config.d_head ** 0.5)\n        self.initializer_range = config.initializer_range\n\n        self.layer_norm = tf.keras.layers.LayerNormalization(epsilon=config.layer_norm_eps, name=\"layer_norm\")\n        self.dropout = tf.keras.layers.Dropout(config.dropout)\n\n    def build(self, input_shape):\n        initializer = get_initializer(self.initializer_range)\n        self.q = self.add_weight(\n            shape=(self.d_model, self.n_head, self.d_head), initializer=initializer, trainable=True, name=\"q\"\n        )\n        self.k = self.add_weight(\n            shape=(self.d_model, self.n_head, self.d_head), initializer=initializer, trainable=True, name=\"k\"\n        )\n        self.v = self.add_weight(\n            shape=(self.d_model, self.n_head, self.d_head), initializer=initializer, trainable=True, name=\"v\"\n        )\n        self.o = self.add_weight(\n            shape=(self.d_model, self.n_head, self.d_head), initializer=initializer, trainable=True, name=\"o\"\n        )\n        self.r = self.add_weight(\n            shape=(self.d_model, self.n_head, self.d_head), initializer=initializer, trainable=True, name=\"r\"\n        )\n        self.r_r_bias = self.add_weight(\n            shape=(self.n_head, self.d_head), initializer=\"zeros\", trainable=True, name=\"r_r_bias\"\n        )\n        self.r_s_bias = self.add_weight(\n            shape=(self.n_head, self.d_head), initializer=\"zeros\", trainable=True, name=\"r_s_bias\"\n        )\n        self.r_w_bias = self.add_weight(\n            shape=(self.n_head, self.d_head), initializer=\"zeros\", trainable=True, name=\"r_w_bias\"\n        )\n        self.seg_embed = self.add_weight(\n            shape=(2, self.n_head, self.d_head), initializer=initializer, trainable=True, name=\"seg_embed\"\n        )\n        super().build(input_shape)\n\n    def prune_heads(self, heads):\n        raise NotImplementedError\n\n    def rel_shift(self, x, klen=-1):\n        \"\"\"perform relative shift to form the relative attention score.\"\"\"\n        x_size = shape_list(x)\n\n        x = tf.reshape(x, (x_size[1], x_size[0], x_size[2], x_size[3]))\n        x = x[1:, ...]\n        x = tf.reshape(x, (x_size[0], x_size[1] - 1, x_size[2], x_size[3]))\n        x = x[:, 0:klen, :, :]\n        # x = torch.index_select(x, 1, torch.arange(klen, device=x.device, dtype=torch.long))\n\n        return x\n\n    def rel_attn_core(self, inputs, training=False):\n        \"\"\"Core relative positional attention operations.\"\"\"\n\n        q_head, k_head_h, v_head_h, k_head_r, seg_mat, attn_mask, head_mask = inputs\n\n        # content based attention score\n        ac = tf.einsum(\"ibnd,jbnd->ijbn\", q_head + self.r_w_bias, k_head_h)\n\n        # position based attention score\n        bd = tf.einsum(\"ibnd,jbnd->ijbn\", q_head + self.r_r_bias, k_head_r)\n        bd = self.rel_shift(bd, klen=shape_list(ac)[1])\n\n        # segment based attention score\n        if seg_mat is None:\n            ef = 0\n        else:\n            ef = tf.einsum(\"ibnd,snd->ibns\", q_head + self.r_s_bias, self.seg_embed)\n            ef = tf.einsum(\"ijbs,ibns->ijbn\", seg_mat, ef)\n\n        # merge attention scores and perform masking\n        attn_score = (ac + bd + ef) * self.scale\n        if attn_mask is not None:\n            # attn_score = attn_score * (1 - attn_mask) - 1e30 * attn_mask\n            if attn_mask.dtype == tf.float16:\n                attn_score = attn_score - 65500 * attn_mask\n            else:\n                attn_score = attn_score - 1e30 * attn_mask\n\n        # attention probability\n        attn_prob = tf.nn.softmax(attn_score, axis=1)\n\n        attn_prob = self.dropout(attn_prob, training=training)\n\n        # Mask heads if we want to\n        if head_mask is not None:\n            attn_prob = attn_prob * head_mask\n\n        # attention output\n        attn_vec = tf.einsum(\"ijbn,jbnd->ibnd\", attn_prob, v_head_h)\n\n        if self.output_attentions:\n            return attn_vec, attn_prob\n\n        return attn_vec\n\n    def post_attention(self, inputs, residual=True, training=False):\n        \"\"\"Post-attention processing.\"\"\"\n        # post-attention projection (back to `d_model`)\n        h, attn_vec = inputs\n\n        attn_out = tf.einsum(\"ibnd,hnd->ibh\", attn_vec, self.o)\n\n        attn_out = self.dropout(attn_out, training=training)\n\n        if residual:\n            attn_out = attn_out + h\n        output = self.layer_norm(attn_out)\n\n        return output\n\n    def call(self, inputs, training=False):\n        (h, g, attn_mask_h, attn_mask_g, r, seg_mat, mems, target_mapping, head_mask) = inputs\n\n        if g is not None:\n            # Two-stream attention with relative positional encoding.\n            # content based attention score\n            if mems is not None and len(shape_list(mems)) > 1:\n                cat = tf.concat([mems, h], axis=0)\n            else:\n                cat = h\n\n            # content-based key head\n            k_head_h = tf.einsum(\"ibh,hnd->ibnd\", cat, self.k)\n\n            # content-based value head\n            v_head_h = tf.einsum(\"ibh,hnd->ibnd\", cat, self.v)\n\n            # position-based key head\n            k_head_r = tf.einsum(\"ibh,hnd->ibnd\", r, self.r)\n\n            # h-stream\n            # content-stream query head\n            q_head_h = tf.einsum(\"ibh,hnd->ibnd\", h, self.q)\n\n            # core attention ops\n            attn_vec_h = self.rel_attn_core(\n                [q_head_h, k_head_h, v_head_h, k_head_r, seg_mat, attn_mask_h, head_mask], training=training\n            )\n\n            if self.output_attentions:\n                attn_vec_h, attn_prob_h = attn_vec_h\n\n            # post processing\n            output_h = self.post_attention([h, attn_vec_h], training=training)\n\n            # g-stream\n            # query-stream query head\n            q_head_g = tf.einsum(\"ibh,hnd->ibnd\", g, self.q)\n\n            # core attention ops\n            if target_mapping is not None:\n                q_head_g = tf.einsum(\"mbnd,mlb->lbnd\", q_head_g, target_mapping)\n                attn_vec_g = self.rel_attn_core(\n                    [q_head_g, k_head_h, v_head_h, k_head_r, seg_mat, attn_mask_g, head_mask], training=training\n                )\n\n                if self.output_attentions:\n                    attn_vec_g, attn_prob_g = attn_vec_g\n\n                attn_vec_g = tf.einsum(\"lbnd,mlb->mbnd\", attn_vec_g, target_mapping)\n            else:\n                attn_vec_g = self.rel_attn_core(\n                    [q_head_g, k_head_h, v_head_h, k_head_r, seg_mat, attn_mask_g, head_mask], training=training\n                )\n\n                if self.output_attentions:\n                    attn_vec_g, attn_prob_g = attn_vec_g\n\n            # post processing\n            output_g = self.post_attention([g, attn_vec_g], training=training)\n\n            if self.output_attentions:\n                attn_prob = attn_prob_h, attn_prob_g\n\n        else:\n            # Multi-head attention with relative positional encoding\n            if mems is not None and len(shape_list(mems)) > 1:\n                cat = tf.concat([mems, h], axis=0)\n            else:\n                cat = h\n\n            # content heads\n            q_head_h = tf.einsum(\"ibh,hnd->ibnd\", h, self.q)\n            k_head_h = tf.einsum(\"ibh,hnd->ibnd\", cat, self.k)\n            v_head_h = tf.einsum(\"ibh,hnd->ibnd\", cat, self.v)\n\n            # positional heads\n            k_head_r = tf.einsum(\"ibh,hnd->ibnd\", r, self.r)\n\n            # core attention ops\n            attn_vec = self.rel_attn_core(\n                [q_head_h, k_head_h, v_head_h, k_head_r, seg_mat, attn_mask_h, head_mask], training=training\n            )\n\n            if self.output_attentions:\n                attn_vec, attn_prob = attn_vec\n\n            # post processing\n            output_h = self.post_attention([h, attn_vec], training=training)\n            output_g = None\n\n        outputs = (output_h, output_g)\n        if self.output_attentions:\n            outputs = outputs + (attn_prob,)\n        return outputs\n\n\nclass TFXLNetFeedForward(tf.keras.layers.Layer):\n    def __init__(self, config, **kwargs):\n        super().__init__(**kwargs)\n        self.layer_norm = tf.keras.layers.LayerNormalization(epsilon=config.layer_norm_eps, name=\"layer_norm\")\n        self.layer_1 = tf.keras.layers.Dense(\n            config.d_inner, kernel_initializer=get_initializer(config.initializer_range), name=\"layer_1\"\n        )\n        self.layer_2 = tf.keras.layers.Dense(\n            config.d_model, kernel_initializer=get_initializer(config.initializer_range), name=\"layer_2\"\n        )\n        self.dropout = tf.keras.layers.Dropout(config.dropout)\n        if isinstance(config.ff_activation, str):\n            self.activation_function = ACT2FN[config.ff_activation]\n        else:\n            self.activation_function = config.ff_activation\n\n    def call(self, inp, training=False):\n        output = inp\n        output = self.layer_1(output)\n        output = self.activation_function(output)\n        output = self.dropout(output, training=training)\n        output = self.layer_2(output)\n        output = self.dropout(output, training=training)\n        output = self.layer_norm(output + inp)\n        return output\n\n\nclass TFXLNetLayer(tf.keras.layers.Layer):\n    def __init__(self, config, **kwargs):\n        super().__init__(**kwargs)\n        self.rel_attn = TFXLNetRelativeAttention(config, name=\"rel_attn\")\n        self.ff = TFXLNetFeedForward(config, name=\"ff\")\n        self.dropout = tf.keras.layers.Dropout(config.dropout)\n\n    def call(self, inputs, training=False):\n        outputs = self.rel_attn(inputs, training=training)\n        output_h, output_g = outputs[:2]\n\n        if output_g is not None:\n            output_g = self.ff(output_g, training=training)\n        output_h = self.ff(output_h, training=training)\n\n        outputs = (output_h, output_g) + outputs[2:]  # Add again attentions if there are there\n        return outputs\n\n\nclass TFXLNetLMHead(tf.keras.layers.Layer):\n    def __init__(self, config, input_embeddings, **kwargs):\n        super().__init__(**kwargs)\n        self.vocab_size = config.vocab_size\n        # The output weights are the same as the input embeddings, but there is\n        # an output-only bias for each token.\n        self.input_embeddings = input_embeddings\n\n    def build(self, input_shape):\n        self.bias = self.add_weight(shape=(self.vocab_size,), initializer=\"zeros\", trainable=True, name=\"bias\")\n        super().build(input_shape)\n\n    def call(self, hidden_states):\n        hidden_states = self.input_embeddings(hidden_states, mode=\"linear\")\n        hidden_states = hidden_states + self.bias\n        return hidden_states\n\n\n@keras_serializable\nclass TFXLNetMainLayer(tf.keras.layers.Layer):\n    config_class = XLNetConfig\n\n    def __init__(self, config, **kwargs):\n        super().__init__(**kwargs)\n        self.output_attentions = config.output_attentions\n        self.output_hidden_states = config.output_hidden_states\n\n        self.mem_len = config.mem_len\n        self.reuse_len = config.reuse_len\n        self.d_model = config.d_model\n        self.same_length = config.same_length\n        self.attn_type = config.attn_type\n        self.bi_data = config.bi_data\n        self.clamp_len = config.clamp_len\n        self.n_layer = config.n_layer\n        self.use_bfloat16 = config.use_bfloat16\n        self.initializer_range = config.initializer_range\n\n        self.word_embedding = TFSharedEmbeddings(\n            config.vocab_size, config.d_model, initializer_range=config.initializer_range, name=\"word_embedding\"\n        )\n        self.layer = [TFXLNetLayer(config, name=\"layer_._{}\".format(i)) for i in range(config.n_layer)]\n        self.dropout = tf.keras.layers.Dropout(config.dropout)\n\n    def get_input_embeddings(self):\n        return self.word_embedding\n\n    def build(self, input_shape):\n        initializer = get_initializer(self.initializer_range)\n        self.mask_emb = self.add_weight(\n            shape=(1, 1, self.d_model), initializer=initializer, trainable=True, name=\"mask_emb\"\n        )\n\n    def _resize_token_embeddings(self, new_num_tokens):\n        raise NotImplementedError\n\n    def _prune_heads(self, heads_to_prune):\n        raise NotImplementedError\n\n    def create_mask(self, qlen, mlen, dtype=tf.float32):\n        \"\"\"\n        Creates causal attention mask. Float mask where 1.0 indicates masked, 0.0 indicates not-masked.\n\n        Args:\n            qlen: TODO Lysandre didn't fill\n            mlen: TODO Lysandre didn't fill\n\n        ::\n\n                  same_length=False:      same_length=True:\n                  <mlen > <  qlen >       <mlen > <  qlen >\n               ^ [0 0 0 0 0 1 1 1 1]     [0 0 0 0 0 1 1 1 1]\n                 [0 0 0 0 0 0 1 1 1]     [1 0 0 0 0 0 1 1 1]\n            qlen [0 0 0 0 0 0 0 1 1]     [1 1 0 0 0 0 0 1 1]\n                 [0 0 0 0 0 0 0 0 1]     [1 1 1 0 0 0 0 0 1]\n               v [0 0 0 0 0 0 0 0 0]     [1 1 1 1 0 0 0 0 0]\n\n        \"\"\"\n        attn_mask = tf.ones([qlen, qlen], dtype=dtype)\n        mask_u = tf.matrix_band_part(attn_mask, 0, -1)\n        mask_dia = tf.matrix_band_part(attn_mask, 0, 0)\n        attn_mask_pad = tf.zeros([qlen, mlen], dtype=dtype)\n        ret = tf.concat([attn_mask_pad, mask_u - mask_dia], 1)\n        if self.same_length:\n            mask_l = tf.matrix_band_part(attn_mask, -1, 0)\n            ret = tf.concat([ret[:, :qlen] + mask_l - mask_dia, ret[:, qlen:]], 1)\n        return ret\n\n    def cache_mem(self, curr_out, prev_mem):\n        \"\"\"cache hidden states into memory.\"\"\"\n        if self.reuse_len is not None and self.reuse_len > 0:\n            curr_out = curr_out[: self.reuse_len]\n\n        if prev_mem is None:\n            new_mem = curr_out[-self.mem_len :]\n        else:\n            new_mem = tf.concat([prev_mem, curr_out], 0)[-self.mem_len :]\n\n        return tf.stop_gradient(new_mem)\n\n    @staticmethod\n    def positional_embedding(pos_seq, inv_freq, bsz=None):\n        sinusoid_inp = tf.einsum(\"i,d->id\", pos_seq, inv_freq)\n        pos_emb = tf.concat([tf.sin(sinusoid_inp), tf.cos(sinusoid_inp)], axis=-1)\n        pos_emb = pos_emb[:, None, :]\n\n        if bsz is not None:\n            pos_emb = tf.tile(pos_emb, [1, bsz, 1])\n\n        return pos_emb\n\n    def relative_positional_encoding(self, qlen, klen, bsz=None, dtype=None):\n        \"\"\"create relative positional encoding.\"\"\"\n        freq_seq = tf.range(0, self.d_model, 2.0)\n        if dtype is not None and dtype != tf.float32:\n            freq_seq = tf.cast(freq_seq, dtype=dtype)\n        inv_freq = 1 / (10000 ** (freq_seq / self.d_model))\n\n        if self.attn_type == \"bi\":\n            # beg, end = klen - 1, -qlen\n            beg, end = klen, -qlen\n        elif self.attn_type == \"uni\":\n            # beg, end = klen - 1, -1\n            beg, end = klen, -1\n        else:\n            raise ValueError(\"Unknown `attn_type` {}.\".format(self.attn_type))\n\n        if self.bi_data:\n            fwd_pos_seq = tf.range(beg, end, -1.0)\n            bwd_pos_seq = tf.range(-beg, -end, 1.0)\n\n            if dtype is not None and dtype != tf.float32:\n                fwd_pos_seq = tf.cast(fwd_pos_seq, dtype=dtype)\n                bwd_pos_seq = tf.cast(bwd_pos_seq, dtype=dtype)\n\n            if self.clamp_len > 0:\n                fwd_pos_seq = tf.clip_by_value(fwd_pos_seq, -self.clamp_len, self.clamp_len)\n                bwd_pos_seq = tf.clip_by_value(bwd_pos_seq, -self.clamp_len, self.clamp_len)\n\n            if bsz is not None:\n                # With bi_data, the batch size should be divisible by 2.\n                assert bsz % 2 == 0\n                fwd_pos_emb = self.positional_embedding(fwd_pos_seq, inv_freq, bsz // 2)\n                bwd_pos_emb = self.positional_embedding(bwd_pos_seq, inv_freq, bsz // 2)\n            else:\n                fwd_pos_emb = self.positional_embedding(fwd_pos_seq, inv_freq)\n                bwd_pos_emb = self.positional_embedding(bwd_pos_seq, inv_freq)\n\n            pos_emb = tf.concat([fwd_pos_emb, bwd_pos_emb], axis=1)\n        else:\n            fwd_pos_seq = tf.range(beg, end, -1.0)\n            if dtype is not None and dtype != tf.float32:\n                fwd_pos_seq = tf.cast(fwd_pos_seq, dtype=dtype)\n            if self.clamp_len > 0:\n                fwd_pos_seq = tf.clip_by_value(fwd_pos_seq, -self.clamp_len, self.clamp_len)\n            pos_emb = self.positional_embedding(fwd_pos_seq, inv_freq, bsz)\n\n        return pos_emb\n\n    def call(\n        self,\n        inputs,\n        attention_mask=None,\n        mems=None,\n        perm_mask=None,\n        target_mapping=None,\n        token_type_ids=None,\n        input_mask=None,\n        head_mask=None,\n        inputs_embeds=None,\n        use_cache=True,\n        training=False,\n    ):\n        if isinstance(inputs, (tuple, list)):\n            input_ids = inputs[0]\n            attention_mask = inputs[1] if len(inputs) > 1 else attention_mask\n            mems = inputs[2] if len(inputs) > 2 else mems\n            perm_mask = inputs[3] if len(inputs) > 3 else perm_mask\n            target_mapping = inputs[4] if len(inputs) > 4 else target_mapping\n            token_type_ids = inputs[5] if len(inputs) > 5 else token_type_ids\n            input_mask = inputs[6] if len(inputs) > 6 else input_mask\n            head_mask = inputs[7] if len(inputs) > 7 else head_mask\n            inputs_embeds = inputs[8] if len(inputs) > 8 else inputs_embeds\n            use_cache = inputs[9] if len(inputs) > 9 else use_cache\n            assert len(inputs) <= 10, \"Too many inputs.\"\n        elif isinstance(inputs, (dict, BatchEncoding)):\n            input_ids = inputs.get(\"input_ids\")\n            attention_mask = inputs.get(\"attention_mask\", attention_mask)\n            mems = inputs.get(\"mems\", mems)\n            perm_mask = inputs.get(\"perm_mask\", perm_mask)\n            target_mapping = inputs.get(\"target_mapping\", target_mapping)\n            token_type_ids = inputs.get(\"token_type_ids\", token_type_ids)\n            input_mask = inputs.get(\"input_mask\", input_mask)\n            head_mask = inputs.get(\"head_mask\", head_mask)\n            inputs_embeds = inputs.get(\"inputs_embeds\", inputs_embeds)\n            use_cache = inputs.get(\"use_cache\", use_cache)\n            assert len(inputs) <= 10, \"Too many inputs.\"\n        else:\n            input_ids = inputs\n\n        # the original code for XLNet uses shapes [len, bsz] with the batch dimension at the end\n        # but we want a unified interface in the library with the batch size on the first dimension\n        # so we move here the first dimension (batch) to the end\n\n        if input_ids is not None and inputs_embeds is not None:\n            raise ValueError(\"You cannot specify both input_ids and inputs_embeds at the same time\")\n        elif input_ids is not None:\n            input_ids = tf.transpose(input_ids, perm=(1, 0))\n            qlen, bsz = shape_list(input_ids)[:2]\n        elif inputs_embeds is not None:\n            inputs_embeds = tf.transpose(inputs_embeds, perm=(1, 0, 2))\n            qlen, bsz = shape_list(inputs_embeds)[:2]\n        else:\n            raise ValueError(\"You have to specify either input_ids or inputs_embeds\")\n\n        token_type_ids = tf.transpose(token_type_ids, perm=(1, 0)) if token_type_ids is not None else None\n        input_mask = tf.transpose(input_mask, perm=(1, 0)) if input_mask is not None else None\n        attention_mask = tf.transpose(attention_mask, perm=(1, 0)) if attention_mask is not None else None\n        perm_mask = tf.transpose(perm_mask, perm=(1, 2, 0)) if perm_mask is not None else None\n        target_mapping = tf.transpose(target_mapping, perm=(1, 2, 0)) if target_mapping is not None else None\n\n        mlen = shape_list(mems[0])[0] if mems is not None and mems[0] is not None else 0\n        klen = mlen + qlen\n\n        dtype_float = tf.bfloat16 if self.use_bfloat16 else tf.float32\n\n        # Attention mask\n        # causal attention mask\n        if self.attn_type == \"uni\":\n            attn_mask = self.create_mask(qlen, mlen)\n            attn_mask = attn_mask[:, :, None, None]\n        elif self.attn_type == \"bi\":\n            attn_mask = None\n        else:\n            raise ValueError(\"Unsupported attention type: {}\".format(self.attn_type))\n\n        # data mask: input mask & perm mask\n        assert input_mask is None or attention_mask is None, (\n            \"You can only use one of input_mask (uses 1 for padding) \"\n            \"or attention_mask (uses 0 for padding, added for compatbility with BERT). Please choose one.\"\n        )\n        if input_mask is None and attention_mask is not None:\n            input_mask = 1.0 - tf.cast(attention_mask, dtype=dtype_float)\n        if input_mask is not None and perm_mask is not None:\n            data_mask = input_mask[None] + perm_mask\n        elif input_mask is not None and perm_mask is None:\n            data_mask = input_mask[None]\n        elif input_mask is None and perm_mask is not None:\n            data_mask = perm_mask\n        else:\n            data_mask = None\n\n        if data_mask is not None:\n            # all mems can be attended to\n            if mlen > 0:\n                mems_mask = tf.zeros([shape_list(data_mask)[0], mlen, bsz], dtype=dtype_float)\n                data_mask = tf.concat([mems_mask, data_mask], axis=1)\n            if attn_mask is None:\n                attn_mask = data_mask[:, :, :, None]\n            else:\n                attn_mask += data_mask[:, :, :, None]\n\n        if attn_mask is not None:\n            attn_mask = tf.cast(attn_mask > 0, dtype=dtype_float)\n\n        if attn_mask is not None:\n            non_tgt_mask = -tf.eye(qlen, dtype=dtype_float)\n            if mlen > 0:\n                non_tgt_mask = tf.concat([tf.zeros([qlen, mlen], dtype=dtype_float), non_tgt_mask], axis=-1)\n            non_tgt_mask = tf.cast((attn_mask + non_tgt_mask[:, :, None, None]) > 0, dtype=dtype_float)\n        else:\n            non_tgt_mask = None\n\n        # Word embeddings and prepare h & g hidden states\n        if inputs_embeds is not None:\n            word_emb_k = inputs_embeds\n        else:\n            word_emb_k = self.word_embedding(input_ids)\n        output_h = self.dropout(word_emb_k, training=training)\n        if target_mapping is not None:\n            word_emb_q = tf.tile(self.mask_emb, [shape_list(target_mapping)[0], bsz, 1])\n            # else:  # We removed the inp_q input which was same as target mapping\n            #     inp_q_ext = inp_q[:, :, None]\n            #     word_emb_q = inp_q_ext * self.mask_emb + (1 - inp_q_ext) * word_emb_k\n            output_g = self.dropout(word_emb_q, training=training)\n        else:\n            output_g = None\n\n        # Segment embedding\n        if token_type_ids is not None:\n            # Convert `token_type_ids` to one-hot `seg_mat`\n            if mlen > 0:\n                mem_pad = tf.zeros([mlen, bsz], dtype=tf.int32)\n                cat_ids = tf.concat([mem_pad, token_type_ids], 0)\n            else:\n                cat_ids = token_type_ids\n\n            # `1` indicates not in the same segment [qlen x klen x bsz]\n            seg_mat = tf.cast(tf.logical_not(tf.equal(token_type_ids[:, None], cat_ids[None, :])), tf.int32)\n            seg_mat = tf.one_hot(seg_mat, 2, dtype=dtype_float)\n        else:\n            seg_mat = None\n\n        # Positional encoding\n        pos_emb = self.relative_positional_encoding(qlen, klen, bsz=bsz, dtype=dtype_float)\n        pos_emb = self.dropout(pos_emb, training=training)\n\n        # Prepare head mask if needed\n        # 1.0 in head_mask indicate we keep the head\n        # attention_probs has shape bsz x n_heads x N x N\n        # input head_mask has shape [num_heads] or [num_hidden_layers x num_heads] (a head_mask for each layer)\n        # and head_mask is converted to shape [num_hidden_layers x qlen x klen x bsz x n_head]\n        if head_mask is not None:\n            raise NotImplementedError\n        else:\n            head_mask = [None] * self.n_layer\n\n        new_mems = ()\n        if mems is None:\n            mems = [None] * len(self.layer)\n\n        attentions = []\n        hidden_states = []\n        for i, layer_module in enumerate(self.layer):\n            # cache new mems\n            if self.mem_len is not None and self.mem_len > 0 and use_cache is True:\n                new_mems = new_mems + (self.cache_mem(output_h, mems[i]),)\n            if self.output_hidden_states:\n                hidden_states.append((output_h, output_g) if output_g is not None else output_h)\n\n            outputs = layer_module(\n                [output_h, output_g, non_tgt_mask, attn_mask, pos_emb, seg_mat, mems[i], target_mapping, head_mask[i]],\n                training=training,\n            )\n            output_h, output_g = outputs[:2]\n            if self.output_attentions:\n                attentions.append(outputs[2])\n\n        # Add last hidden state\n        if self.output_hidden_states:\n            hidden_states.append((output_h, output_g) if output_g is not None else output_h)\n\n        output = self.dropout(output_g if output_g is not None else output_h, training=training)\n\n        # Prepare outputs, we transpose back here to shape [bsz, len, hidden_dim] (cf. beginning of forward() method)\n        outputs = (tf.transpose(output, perm=(1, 0, 2)),)\n\n        if self.mem_len is not None and self.mem_len > 0 and use_cache is True:\n            outputs = outputs + (new_mems,)\n\n        if self.output_hidden_states:\n            if output_g is not None:\n                hidden_states = tuple(tf.transpose(h, perm=(1, 0, 2)) for hs in hidden_states for h in hs)\n            else:\n                hidden_states = tuple(tf.transpose(hs, perm=(1, 0, 2)) for hs in hidden_states)\n            outputs = outputs + (hidden_states,)\n        if self.output_attentions:\n            attentions = tuple(tf.transpose(t, perm=(2, 3, 0, 1)) for t in attentions)\n            outputs = outputs + (attentions,)\n\n        return outputs  # outputs, (new_mems), (hidden_states), (attentions)\n\n\nclass TFXLNetPreTrainedModel(TFPreTrainedModel):\n    \"\"\" An abstract class to handle weights initialization and\n        a simple interface for downloading and loading pretrained models.\n    \"\"\"\n\n    config_class = XLNetConfig\n    base_model_prefix = \"transformer\"\n\n\nXLNET_START_DOCSTRING = r\"\"\"\n\n    .. note::\n\n        TF 2.0 models accepts two formats as inputs:\n\n            - having all inputs as keyword arguments (like PyTorch models), or\n            - having all inputs as a list, tuple or dict in the first positional arguments.\n\n        This second option is useful when using :obj:`tf.keras.Model.fit()` method which currently requires having\n        all the tensors in the first argument of the model call function: :obj:`model(inputs)`.\n\n        If you choose this second option, there are three possibilities you can use to gather all the input Tensors\n        in the first positional argument :\n\n        - a single Tensor with input_ids only and nothing else: :obj:`model(inputs_ids)`\n        - a list of varying length with one or several input Tensors IN THE ORDER given in the docstring:\n          :obj:`model([input_ids, attention_mask])` or :obj:`model([input_ids, attention_mask, token_type_ids])`\n        - a dictionary with one or several input Tensors associated to the input names given in the docstring:\n          :obj:`model({'input_ids': input_ids, 'token_type_ids': token_type_ids})`\n\n    Parameters:\n        config (:class:`~transformers1.XLNetConfig`): Model configuration class with all the parameters of the model.\n            Initializing with a config file does not load the weights associated with the model, only the configuration.\n            Check out the :meth:`~transformers1.PreTrainedModel.from_pretrained` method to load the model weights.\n\"\"\"\n\nXLNET_INPUTS_DOCSTRING = r\"\"\"\n    Args:\n        input_ids (:obj:`tf.Tensor` or :obj:`Numpy array` of shape :obj:`(batch_size, sequence_length)`):\n            Indices of input sequence tokens in the vocabulary.\n\n            Indices can be obtained using :class:`transformers1.XLNetTokenizer`.\n            See :func:`transformers1.PreTrainedTokenizer.encode` and\n            :func:`transformers1.PreTrainedTokenizer.encode_plus` for details.\n\n            `What are input IDs? <../glossary.html#input-ids>`__\n        attention_mask (:obj:`tf.Tensor` or :obj:`Numpy array` of shape :obj:`(batch_size, sequence_length)`, `optional`, defaults to :obj:`None`):\n            Mask to avoid performing attention on padding token indices.\n            Mask values selected in ``[0, 1]``:\n            ``1`` for tokens that are NOT MASKED, ``0`` for MASKED tokens.\n\n            `What are attention masks? <../glossary.html#attention-mask>`__\n        mems (:obj:`List[tf.Tensor]` of length :obj:`config.n_layers`):\n            Contains pre-computed hidden-states (key and values in the attention blocks) as computed by the model\n            (see `mems` output below). Can be used to speed up sequential decoding. The token ids which have their mems\n            given to this model should not be passed as input ids as they have already been computed.\n        perm_mask (:obj:`tf.Tensor` or :obj:`Numpy array` of shape :obj:`(batch_size, sequence_length, sequence_length)`, `optional`, defaults to :obj:`None`):\n            Mask to indicate the attention pattern for each input token with values selected in ``[0, 1]``:\n            If ``perm_mask[k, i, j] = 0``, i attend to j in batch k;\n            if ``perm_mask[k, i, j] = 1``, i does not attend to j in batch k.\n            If None, each token attends to all the others (full bidirectional attention).\n            Only used during pretraining (to define factorization order) or for sequential decoding (generation).\n        target_mapping (:obj:`tf.Tensor` or :obj:`Numpy array` of shape :obj:`(batch_size, num_predict, sequence_length)`, `optional`, defaults to :obj:`None`):\n            Mask to indicate the output tokens to use.\n            If ``target_mapping[k, i, j] = 1``, the i-th predict in batch k is on the j-th token.\n            Only used during pretraining for partial prediction or for sequential decoding (generation).\n        token_type_ids (:obj:`tf.Tensor` or :obj:`Numpy array` of shape :obj:`(batch_size, sequence_length)`, `optional`, defaults to :obj:`None`):\n            Segment token indices to indicate first and second portions of the inputs.\n            Indices are selected in ``[0, 1]``: ``0`` corresponds to a `sentence A` token, ``1``\n            corresponds to a `sentence B` token\n\n            `What are token type IDs? <../glossary.html#token-type-ids>`_\n        input_mask (:obj:`tf.Tensor` or :obj:`Numpy array` of shape :obj:`(batch_size, sequence_length)`, `optional`, defaults to :obj:`None`):\n            Mask to avoid performing attention on padding token indices.\n            Negative of `attention_mask`, i.e. with 0 for real tokens and 1 for padding.\n            Kept for compatibility with the original code base.\n            You can only uses one of `input_mask` and `attention_mask`\n            Mask values selected in ``[0, 1]``:\n            ``1`` for tokens that are MASKED, ``0`` for tokens that are NOT MASKED.\n        head_mask (:obj:`tf.Tensor` or :obj:`Numpy array` of shape :obj:`(num_heads,)` or :obj:`(num_layers, num_heads)`, `optional`, defaults to :obj:`None`):\n            Mask to nullify selected heads of the self-attention modules.\n            Mask values selected in ``[0, 1]``:\n            :obj:`1` indicates the head is **not masked**, :obj:`0` indicates the head is **masked**.\n        inputs_embeds (:obj:`tf.Tensor` or :obj:`Numpy array` of shape :obj:`(batch_size, sequence_length, hidden_size)`, `optional`, defaults to :obj:`None`):\n            Optionally, instead of passing :obj:`input_ids` you can choose to directly pass an embedded representation.\n            This is useful if you want more control over how to convert `input_ids` indices into associated vectors\n            than the model's internal embedding lookup matrix.\n        use_cache (:obj:`bool`):\n            If `use_cache` is True, `mems` are returned and can be used to speed up decoding (see `mems`). Defaults to `True`.\n\"\"\"\n\n\n@add_start_docstrings(\n    \"The bare XLNet Model transformer outputing raw hidden-states without any specific head on top.\",\n    XLNET_START_DOCSTRING,\n)\nclass TFXLNetModel(TFXLNetPreTrainedModel):\n    def __init__(self, config, *inputs, **kwargs):\n        super().__init__(config, *inputs, **kwargs)\n        self.transformer = TFXLNetMainLayer(config, name=\"transformer\")\n\n    @add_start_docstrings_to_callable(XLNET_INPUTS_DOCSTRING)\n    def call(self, inputs, **kwargs):\n        r\"\"\"\n    Return:\n        :obj:`tuple(tf.Tensor)` comprising various elements depending on the configuration (:class:`~transformers1.XLNetConfig`) and inputs:\n        last_hidden_state (:obj:`tf.Tensor` or :obj:`Numpy array` of shape :obj:`(batch_size, sequence_length, hidden_size)`):\n            Sequence of hidden-states at the last layer of the model.\n        mems (:obj:`List[tf.Tensor]` of length :obj:`config.n_layers`):\n            Contains pre-computed hidden-states (key and values in the attention blocks).\n            Can be used (see `mems` input) to speed up sequential decoding. The token ids which have their past given to this model\n            should not be passed as input ids as they have already been computed.\n        hidden_states (:obj:`tuple(tf.Tensor)`, `optional`, returned when ``config.output_hidden_states=True``):\n            Tuple of :obj:`tf.Tensor` or :obj:`Numpy array` (one for the output of the embeddings + one for the output of each layer)\n            of shape :obj:`(batch_size, sequence_length, hidden_size)`.\n\n            Hidden-states of the model at the output of each layer plus the initial embedding outputs.\n        attentions (:obj:`tuple(tf.Tensor)`, `optional`, returned when ``config.output_attentions=True``):\n            Tuple of :obj:`tf.Tensor` or :obj:`Numpy array` (one for each layer) of shape\n            :obj:`(batch_size, num_heads, sequence_length, sequence_length)`.\n\n            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention\n            heads.\n\n    Examples::\n\n        import tensorflow as tf\n        from transformers1 import XLNetTokenizer, TFXLNetModel\n\n        tokenizer = XLNetTokenizer.from_pretrained('xlnet-large-cased')\n        model = TFXLNetModel.from_pretrained('xlnet-large-cased')\n        input_ids = tf.constant(tokenizer.encode(\"Hello, my dog is cute\", add_special_tokens=True))[None, :]  # Batch size 1\n        outputs = model(input_ids)\n        last_hidden_states = outputs[0]  # The last hidden-state is the first element of the output tuple\n\n        \"\"\"\n        outputs = self.transformer(inputs, **kwargs)\n        return outputs\n\n\n@add_start_docstrings(\n    \"\"\"XLNet Model with a language modeling head on top\n    (linear layer with weights tied to the input embeddings). \"\"\",\n    XLNET_START_DOCSTRING,\n)\nclass TFXLNetLMHeadModel(TFXLNetPreTrainedModel):\n    def __init__(self, config, *inputs, **kwargs):\n        super().__init__(config, *inputs, **kwargs)\n        self.transformer = TFXLNetMainLayer(config, name=\"transformer\")\n        self.lm_loss = TFXLNetLMHead(config, self.transformer.word_embedding, name=\"lm_loss\")\n\n    def get_output_embeddings(self):\n        return self.lm_loss.input_embeddings\n\n    def prepare_inputs_for_generation(self, inputs, past, **kwargs):\n        # Add dummy token at the end (no attention on this one)\n\n        effective_batch_size = inputs.shape[0]\n        dummy_token = tf.zeros((effective_batch_size, 1), dtype=tf.int32)\n        inputs = tf.concat([inputs, dummy_token], axis=1)\n\n        # Build permutation mask so that previous tokens don't see last token\n        sequence_length = inputs.shape[1]\n        perm_mask = tf.zeros((effective_batch_size, sequence_length, sequence_length - 1), dtype=tf.float32)\n        perm_mask_seq_end = tf.ones((effective_batch_size, sequence_length, 1), dtype=tf.float32)\n        perm_mask = tf.concat([perm_mask, perm_mask_seq_end], axis=-1)\n\n        # We'll only predict the last token\n        target_mapping = tf.zeros((effective_batch_size, 1, sequence_length - 1), dtype=tf.float32)\n        target_mapping_seq_end = tf.ones((effective_batch_size, 1, 1), dtype=tf.float32)\n        target_mapping = tf.concat([target_mapping, target_mapping_seq_end], axis=-1)\n\n        inputs = {\n            \"inputs\": inputs,\n            \"perm_mask\": perm_mask,\n            \"target_mapping\": target_mapping,\n            \"use_cache\": kwargs[\"use_cache\"],\n        }\n\n        # if past is defined in model kwargs then use it for faster decoding\n        if past:\n            inputs[\"mems\"] = past\n\n        return inputs\n\n    @add_start_docstrings_to_callable(XLNET_INPUTS_DOCSTRING)\n    def call(self, inputs, **kwargs):\n        r\"\"\"\n    Return:\n        :obj:`tuple(tf.Tensor)` comprising various elements depending on the configuration (:class:`~transformers1.XLNetConfig`) and inputs:\n        prediction_scores (:obj:`tf.Tensor` or :obj:`Numpy array` of shape :obj:`(batch_size, sequence_length, config.vocab_size)`):\n            Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax).\n        mems (:obj:`List[tf.Tensor]` of length :obj:`config.n_layers`):\n            Contains pre-computed hidden-states (key and values in the attention blocks).\n            Can be used (see `past` input) to speed up sequential decoding. The token ids which have their past given to this model\n            should not be passed as input ids as they have already been computed.\n        hidden_states (:obj:`tuple(tf.Tensor)`, `optional`, returned when ``config.output_hidden_states=True``):\n            Tuple of :obj:`tf.Tensor` or :obj:`Numpy array` (one for the output of the embeddings + one for the output of each layer)\n            of shape :obj:`(batch_size, sequence_length, hidden_size)`.\n\n            Hidden-states of the model at the output of each layer plus the initial embedding outputs.\n        attentions (:obj:`tuple(tf.Tensor)`, `optional`, returned when ``config.output_attentions=True``):\n            Tuple of :obj:`tf.Tensor` or :obj:`Numpy array` (one for each layer) of shape\n            :obj:`(batch_size, num_heads, sequence_length, sequence_length)`.\n\n            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention\n            heads.\n\n    Examples::\n\n        import tensorflow as tf\n        import numpy as np\n        from transformers1 import XLNetTokenizer, TFXLNetLMHeadModel\n\n        tokenizer = XLNetTokenizer.from_pretrained('xlnet-large-cased')\n        model = TFXLNetLMHeadModel.from_pretrained('xlnet-large-cased')\n\n        # We show how to setup inputs to predict a next token using a bi-directional context.\n        input_ids = tf.constant(tokenizer.encode(\"Hello, my dog is very <mask>\", add_special_tokens=True))[None, :]  # We will predict the masked token\n        perm_mask = np.zeros((1, input_ids.shape[1], input_ids.shape[1]))\n        perm_mask[:, :, -1] = 1.0  # Previous tokens don't see last token\n        target_mapping = np.zeros((1, 1, input_ids.shape[1]))  # Shape [1, 1, seq_length] => let's predict one token\n        target_mapping[0, 0, -1] = 1.0  # Our first (and only) prediction will be the last token of the sequence (the masked token)\n        outputs = model(input_ids, perm_mask=tf.constant(perm_mask, dtype=tf.float32), target_mapping=tf.constant(target_mapping, dtype=tf.float32))\n\n        next_token_logits = outputs[0]  # Output has shape [target_mapping.size(0), target_mapping.size(1), config.vocab_size]\n\n        \"\"\"\n        transformer_outputs = self.transformer(inputs, **kwargs)\n        hidden_state = transformer_outputs[0]\n        logits = self.lm_loss(hidden_state)\n\n        outputs = (logits,) + transformer_outputs[1:]  # Keep mems, hidden states, attentions if there are in it\n\n        return outputs  # return logits, (mems), (hidden states), (attentions)\n\n\n@add_start_docstrings(\n    \"\"\"XLNet Model with a sequence classification/regression head on top (a linear layer on top of\n    the pooled output) e.g. for GLUE tasks. \"\"\",\n    XLNET_START_DOCSTRING,\n)\nclass TFXLNetForSequenceClassification(TFXLNetPreTrainedModel):\n    def __init__(self, config, *inputs, **kwargs):\n        super().__init__(config, *inputs, **kwargs)\n        self.num_labels = config.num_labels\n\n        self.transformer = TFXLNetMainLayer(config, name=\"transformer\")\n        self.sequence_summary = TFSequenceSummary(\n            config, initializer_range=config.initializer_range, name=\"sequence_summary\"\n        )\n        self.logits_proj = tf.keras.layers.Dense(\n            config.num_labels, kernel_initializer=get_initializer(config.initializer_range), name=\"logits_proj\"\n        )\n\n    @add_start_docstrings_to_callable(XLNET_INPUTS_DOCSTRING)\n    def call(self, inputs, **kwargs):\n        r\"\"\"\n    Return:\n        :obj:`tuple(tf.Tensor)` comprising various elements depending on the configuration (:class:`~transformers1.XLNetConfig`) and inputs:\n        logits (:obj:`tf.Tensor` or :obj:`Numpy array` of shape :obj:(batch_size, config.num_labels)`):\n            Classification (or regression if config.num_labels==1) scores (before SoftMax).\n        mems (:obj:`List[tf.Tensor]` of length :obj:`config.n_layers`):\n            Contains pre-computed hidden-states (key and values in the attention blocks).\n            Can be used (see `past` input) to speed up sequential decoding. The token ids which have their past given to this model\n            should not be passed as input ids as they have already been computed.\n        hidden_states (:obj:`tuple(tf.Tensor)`, `optional`, returned when ``config.output_hidden_states=True``):\n            Tuple of :obj:`tf.Tensor` or :obj:`Numpy array` (one for the output of the embeddings + one for the output of each layer)\n            of shape :obj:`(batch_size, sequence_length, hidden_size)`.\n\n            Hidden-states of the model at the output of each layer plus the initial embedding outputs.\n        attentions (:obj:`tuple(tf.Tensor)`, `optional`, returned when ``config.output_attentions=True``):\n            Tuple of :obj:`tf.Tensor` or :obj:`Numpy array` (one for each layer) of shape\n            :obj:`(batch_size, num_heads, sequence_length, sequence_length)`.\n\n            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention\n            heads.\n\n    Examples::\n\n        import tensorflow as tf\n        from transformers1 import XLNetTokenizer, TFXLNetForSequenceClassification\n\n        tokenizer = XLNetTokenizer.from_pretrained('xlnet-large-cased')\n        model = TFXLNetForSequenceClassification.from_pretrained('xlnet-large-cased')\n        input_ids = tf.constant(tokenizer.encode(\"Hello, my dog is cute\", add_special_tokens=True))[None, :]  # Batch size 1\n        outputs = model(input_ids)\n        logits = outputs[0]\n\n        \"\"\"\n        transformer_outputs = self.transformer(inputs, **kwargs)\n        output = transformer_outputs[0]\n\n        output = self.sequence_summary(output)\n        logits = self.logits_proj(output)\n\n        outputs = (logits,) + transformer_outputs[1:]  # Keep mems, hidden states, attentions if there are in it\n\n        return outputs  # return logits, (mems), (hidden states), (attentions)\n\n\n@add_start_docstrings(\n    \"\"\"XLNet Model with a token classification head on top (a linear layer on top of\n    the hidden-states output) e.g. for Named-Entity-Recognition (NER) tasks. \"\"\",\n    XLNET_START_DOCSTRING,\n)\nclass TFXLNetForTokenClassification(TFXLNetPreTrainedModel):\n    def __init__(self, config, *inputs, **kwargs):\n        super().__init__(config, *inputs, **kwargs)\n        self.num_labels = config.num_labels\n\n        self.transformer = TFXLNetMainLayer(config, name=\"transformer\")\n        self.classifier = tf.keras.layers.Dense(\n            config.num_labels, kernel_initializer=get_initializer(config.initializer_range), name=\"classifier\"\n        )\n\n    def call(self, inputs, **kwargs):\n        r\"\"\"\n    Return:\n        :obj:`tuple(tf.Tensor)` comprising various elements depending on the configuration (:class:`~transformers1.XLNetConfig`) and inputs:\n        logits (:obj:`tf.Tensor` or :obj:`Numpy array` of shape :obj:(batch_size, config.num_labels)`):\n            Classification scores (before SoftMax).\n        mems (:obj:`List[tf.Tensor]` of length :obj:`config.n_layers`):\n            Contains pre-computed hidden-states (key and values in the attention blocks).\n            Can be used (see `past` input) to speed up sequential decoding. The token ids which have their past given to this model\n            should not be passed as input ids as they have already been computed.\n        hidden_states (:obj:`tuple(tf.Tensor)`, `optional`, returned when ``config.output_hidden_states=True``):\n            Tuple of :obj:`tf.Tensor` or :obj:`Numpy array` (one for the output of the embeddings + one for the output of each layer)\n            of shape :obj:`(batch_size, sequence_length, hidden_size)`.\n\n            Hidden-states of the model at the output of each layer plus the initial embedding outputs.\n        attentions (:obj:`tuple(tf.Tensor)`, `optional`, returned when ``config.output_attentions=True``):\n            Tuple of :obj:`tf.Tensor` or :obj:`Numpy array` (one for each layer) of shape\n            :obj:`(batch_size, num_heads, sequence_length, sequence_length)`.\n\n            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention\n            heads.\n\n    Examples::\n\n        import tensorflow as tf\n        from transformers1 import XLNetTokenizer, TFXLNetForTokenClassification\n\n        tokenizer = XLNetTokenizer.from_pretrained('xlnet-large-cased')\n        model = TFXLNetForTokenClassification.from_pretrained('xlnet-large-cased')\n        input_ids = tf.constant(tokenizer.encode(\"Hello, my dog is cute\"))[None, :]  # Batch size 1\n        outputs = model(input_ids)\n        scores = outputs[0]\n\n        \"\"\"\n        transformer_outputs = self.transformer(inputs, **kwargs)\n        output = transformer_outputs[0]\n\n        logits = self.classifier(output)\n\n        outputs = (logits,) + transformer_outputs[1:]  # Keep mems, hidden states, attentions if there are in it\n\n        return outputs  # return logits, (mems), (hidden states), (attentions)\n\n\n@add_start_docstrings(\n    \"\"\"XLNet Model with a span classification head on top for extractive question-answering tasks like SQuAD (a linear layers on top of\n    the hidden-states output to compute `span start logits` and `span end logits`). \"\"\",\n    XLNET_START_DOCSTRING,\n)\nclass TFXLNetForQuestionAnsweringSimple(TFXLNetPreTrainedModel):\n    def __init__(self, config, *inputs, **kwargs):\n        super().__init__(config, *inputs, **kwargs)\n        self.transformer = TFXLNetMainLayer(config, name=\"transformer\")\n        self.qa_outputs = tf.keras.layers.Dense(\n            config.num_labels, kernel_initializer=get_initializer(config.initializer_range), name=\"qa_outputs\"\n        )\n\n    @add_start_docstrings_to_callable(XLNET_INPUTS_DOCSTRING)\n    def call(self, inputs, **kwargs):\n        r\"\"\"\n    Returns:\n        :obj:`tuple(tf.Tensor)` comprising various elements depending on the configuration (:class:`~transformers1.XLNetConfig`) and inputs:\n        loss (:obj:`tf.Tensor` or :obj:`Numpy array` of shape :obj:`(1,)`, `optional`, returned when :obj:`labels` is provided):\n            Total span extraction loss is the sum of a Cross-Entropy for the start and end positions.\n        start_scores (:obj:`tf.Tensor` or :obj:`Numpy array` of shape :obj:`(batch_size, sequence_length,)`):\n            Span-start scores (before SoftMax).\n        end_scores (:obj:`tf.Tensor` or :obj:`Numpy array` of shape :obj:`(batch_size, sequence_length,)`):\n            Span-end scores (before SoftMax).\n        mems (:obj:`List[tf.Tensor]` of length :obj:`config.n_layers`):\n            Contains pre-computed hidden-states (key and values in the attention blocks).\n            Can be used (see `past` input) to speed up sequential decoding. The token ids which have their past given to this model\n            should not be passed as input ids as they have already been computed.\n        hidden_states (:obj:`tuple(tf.Tensor)`, `optional`, returned when ``config.output_hidden_states=True``):\n            Tuple of :obj:`tf.Tensor` or :obj:`Numpy array` (one for the output of the embeddings + one for the output of each layer)\n            of shape :obj:`(batch_size, sequence_length, hidden_size)`.\n\n            Hidden-states of the model at the output of each layer plus the initial embedding outputs.\n        attentions (:obj:`tuple(tf.Tensor)`, `optional`, returned when ``config.output_attentions=True``):\n            Tuple of :obj:`tf.Tensor` or :obj:`Numpy array` (one for each layer) of shape\n            :obj:`(batch_size, num_heads, sequence_length, sequence_length)`.\n\n            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention\n            heads.\n\n    Examples::\n\n        import tensorflow as tf\n        from transformers1 import XLNetTokenizer, TFXLNetForQuestionAnsweringSimple\n\n        tokenizer = XLNetTokenizer.from_pretrained('xlnet-base-cased')\n        model = TFXLNetForQuestionAnsweringSimple.from_pretrained('xlnet-base-cased')\n        input_ids = tf.constant(tokenizer.encode(\"Hello, my dog is cute\", add_special_tokens=True))[None, :]  # Batch size 1\n        outputs = model(input_ids)\n        start_scores, end_scores = outputs[:2]\n\n        \"\"\"\n        transformer_outputs = self.transformer(inputs, **kwargs)\n\n        sequence_output = transformer_outputs[0]\n\n        logits = self.qa_outputs(sequence_output)\n        start_logits, end_logits = tf.split(logits, 2, axis=-1)\n        start_logits = tf.squeeze(start_logits, axis=-1)\n        end_logits = tf.squeeze(end_logits, axis=-1)\n\n        outputs = (start_logits, end_logits,) + transformer_outputs[\n            1:\n        ]  # Keep mems, hidden states, attentions if there are in it\n\n        return outputs  # start_logits, end_logits, (mems), (hidden_states), (attentions)\n\n\n# @add_start_docstrings(\"\"\"XLNet Model with a span classification head on top for extractive question-answering tasks like SQuAD (a linear layers on top of\n#     the hidden-states output to compute `span start logits` and `span end logits`). \"\"\",\n#     XLNET_START_DOCSTRING, XLNET_INPUTS_DOCSTRING)\n# class TFXLNetForQuestionAnswering(TFXLNetPreTrainedModel):\n#     r\"\"\"\n#     Outputs: `Tuple` comprising various elements depending on the configuration (config) and inputs:\n#         **start_top_log_probs**: (`optional`, returned if ``start_positions`` or ``end_positions`` is not provided)\n#             ``tf.Tensor`` of shape ``(batch_size, config.start_n_top)``\n#             Log probabilities for the top config.start_n_top start token possibilities (beam-search).\n#         **start_top_index**: (`optional`, returned if ``start_positions`` or ``end_positions`` is not provided)\n#             ``tf.Tensor`` of shape ``(batch_size, config.start_n_top)``\n#             Indices for the top config.start_n_top start token possibilities (beam-search).\n#         **end_top_log_probs**: (`optional`, returned if ``start_positions`` or ``end_positions`` is not provided)\n#             ``tf.Tensor`` of shape ``(batch_size, config.start_n_top * config.end_n_top)``\n#             Log probabilities for the top ``config.start_n_top * config.end_n_top`` end token possibilities (beam-search).\n#         **end_top_index**: (`optional`, returned if ``start_positions`` or ``end_positions`` is not provided)\n#             ``tf.Tensor`` of shape ``(batch_size, config.start_n_top * config.end_n_top)``\n#             Indices for the top ``config.start_n_top * config.end_n_top`` end token possibilities (beam-search).\n#         **cls_logits**: (`optional`, returned if ``start_positions`` or ``end_positions`` is not provided)\n#             ``tf.Tensor`` of shape ``(batch_size,)``\n#             Log probabilities for the ``is_impossible`` label of the answers.\n#         **mems**:\n#             list of ``tf.Tensor`` (one for each layer):\n#             that contains pre-computed hidden-states (key and values in the attention blocks) as computed by the model\n#             if config.mem_len > 0 else tuple of None. Can be used to speed up sequential decoding and attend to longer context.\n#             See details in the docstring of the `mems` input above.\n#         **hidden_states**: (`optional`, returned when ``config.output_hidden_states=True``)\n#             list of ``tf.Tensor`` (one for the output of each layer + the output of the embeddings)\n#             of shape ``(batch_size, sequence_length, hidden_size)``:\n#             Hidden-states of the model at the output of each layer plus the initial embedding outputs.\n#         **attentions**: (`optional`, returned when ``config.output_attentions=True``)\n#             list of ``tf.Tensor`` (one for each layer) of shape ``(batch_size, num_heads, sequence_length, sequence_length)``:\n#             Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads.\n\n#     Examples::\n\n#         # For example purposes. Not runnable.\n#         tokenizer = XLMTokenizer.from_pretrained('xlm-mlm-en-2048')\n#         model = XLMForQuestionAnswering.from_pretrained('xlnet-large-cased')\n#         input_ids = tf.constant(tokenizer.encode(\"Hello, my dog is cute\", add_special_tokens=True))[None, :]  # Batch size 1\n#         start_positions = tf.constant([1])\n#         end_positions = tf.constant([3])\n#         outputs = model(input_ids, start_positions=start_positions, end_positions=end_positions)\n#         loss, start_scores, end_scores = outputs[:2]\n\n#     \"\"\"\n#     def __init__(self, config, *inputs, **kwargs):\n#         super().__init__(config, *inputs, **kwargs)\n#         self.start_n_top = config.start_n_top\n#         self.end_n_top = config.end_n_top\n\n#         self.transformer = TFXLNetMainLayer(config, name='transformer')\n#         self.start_logits = TFPoolerStartLogits(config, name='start_logits')\n#         self.end_logits = TFPoolerEndLogits(config, name='end_logits')\n#         self.answer_class = TFPoolerAnswerClass(config, name='answer_class')\n\n#     def call(self, inputs, training=False):\n#         transformer_outputs = self.transformer(inputs, training=training)\n#         hidden_states = transformer_outputs[0]\n#         start_logits = self.start_logits(hidden_states, p_mask=p_mask)\n\n#         outputs = transformer_outputs[1:]  # Keep mems, hidden states, attentions if there are in it\n\n#         if start_positions is not None and end_positions is not None:\n#             # If we are on multi-GPU, let's remove the dimension added by batch splitting\n#             for x in (start_positions, end_positions, cls_index, is_impossible):\n#                 if x is not None and x.dim() > 1:\n#                     x.squeeze_(-1)\n\n#             # during training, compute the end logits based on the ground truth of the start position\n#             end_logits = self.end_logits(hidden_states, start_positions=start_positions, p_mask=p_mask)\n\n#             loss_fct = CrossEntropyLoss()\n#             start_loss = loss_fct(start_logits, start_positions)\n#             end_loss = loss_fct(end_logits, end_positions)\n#             total_loss = (start_loss + end_loss) / 2\n\n#             if cls_index is not None and is_impossible is not None:\n#                 # Predict answerability from the representation of CLS and START\n#                 cls_logits = self.answer_class(hidden_states, start_positions=start_positions, cls_index=cls_index)\n#                 loss_fct_cls = nn.BCEWithLogitsLoss()\n#                 cls_loss = loss_fct_cls(cls_logits, is_impossible)\n\n#                 # note(zhiliny): by default multiply the loss by 0.5 so that the scale is comparable to start_loss and end_loss\n#                 total_loss += cls_loss * 0.5\n\n#             outputs = (total_loss,) + outputs\n\n#         else:\n#             # during inference, compute the end logits based on beam search\n#             bsz, slen, hsz = hidden_states.size()\n#             start_log_probs = F.softmax(start_logits, dim=-1) # shape (bsz, slen)\n\n#             start_top_log_probs, start_top_index = torch.topk(start_log_probs, self.start_n_top, dim=-1) # shape (bsz, start_n_top)\n#             start_top_index_exp = start_top_index.unsqueeze(-1).expand(-1, -1, hsz) # shape (bsz, start_n_top, hsz)\n#             start_states = torch.gather(hidden_states, -2, start_top_index_exp) # shape (bsz, start_n_top, hsz)\n#             start_states = start_states.unsqueeze(1).expand(-1, slen, -1, -1) # shape (bsz, slen, start_n_top, hsz)\n\n#             hidden_states_expanded = hidden_states.unsqueeze(2).expand_as(start_states) # shape (bsz, slen, start_n_top, hsz)\n#             p_mask = p_mask.unsqueeze(-1) if p_mask is not None else None\n#             end_logits = self.end_logits(hidden_states_expanded, start_states=start_states, p_mask=p_mask)\n#             end_log_probs = F.softmax(end_logits, dim=1) # shape (bsz, slen, start_n_top)\n\n#             end_top_log_probs, end_top_index = torch.topk(end_log_probs, self.end_n_top, dim=1) # shape (bsz, end_n_top, start_n_top)\n#             end_top_log_probs = end_top_log_probs.view(-1, self.start_n_top * self.end_n_top)\n#             end_top_index = end_top_index.view(-1, self.start_n_top * self.end_n_top)\n\n#             start_states = torch.einsum(\"blh,bl->bh\", hidden_states, start_log_probs)  # get the representation of START as weighted sum of hidden states\n#             cls_logits = self.answer_class(hidden_states, start_states=start_states, cls_index=cls_index)  # Shape (batch size,): one single `cls_logits` for each sample\n\n#             outputs = (start_top_log_probs, start_top_index, end_top_log_probs, end_top_index, cls_logits) + outputs\n\n#         # return start_top_log_probs, start_top_index, end_top_log_probs, end_top_index, cls_logits\n#         # or (if labels are provided) (total_loss,)\n#         return outputs\n"
  },
  {
    "path": "code/nezha-base-count5/pretrain/transformers1/modeling_transfo_xl.py",
    "content": "# coding=utf-8\n# Copyright 2018 Google AI, Google Brain and Carnegie Mellon University Authors and the HuggingFace Inc. team.\n# Copyright (c) 2018, NVIDIA CORPORATION.  All rights reserved.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n\"\"\" PyTorch Transformer XL model.\n    Adapted from https://github.com/kimiyoung/transformer-xl.\n    In particular https://github.com/kimiyoung/transformer-xl/blob/master/pytorch/mem_transformer.py\n\"\"\"\n\n\nimport logging\n\nimport torch\nimport torch.nn as nn\nimport torch.nn.functional as F\n\nfrom .configuration_transfo_xl import TransfoXLConfig\nfrom .file_utils import add_start_docstrings, add_start_docstrings_to_callable\nfrom .modeling_transfo_xl_utilities import ProjectedAdaptiveLogSoftmax\nfrom .modeling_utils import PreTrainedModel\n\n\nlogger = logging.getLogger(__name__)\n\nTRANSFO_XL_PRETRAINED_MODEL_ARCHIVE_LIST = [\n    \"transfo-xl-wt103\",\n    # See all Transformer XL models at https://huggingface.co/models?filter=transfo-xl\n]\n\n\ndef build_tf_to_pytorch_map(model, config):\n    \"\"\" A map of modules from TF to PyTorch.\n        This time I use a map to keep the PyTorch model as identical to the original PyTorch model as possible.\n    \"\"\"\n    tf_to_pt_map = {}\n\n    if hasattr(model, \"transformer\"):\n        # We are loading in a TransfoXLLMHeadModel => we will load also the Adaptive Softmax\n        tf_to_pt_map.update(\n            {\n                \"transformer/adaptive_softmax/cutoff_0/cluster_W\": model.crit.cluster_weight,\n                \"transformer/adaptive_softmax/cutoff_0/cluster_b\": model.crit.cluster_bias,\n            }\n        )\n        for i, (out_l, proj_l, tie_proj) in enumerate(\n            zip(model.crit.out_layers, model.crit.out_projs, config.tie_projs)\n        ):\n            layer_str = \"transformer/adaptive_softmax/cutoff_%d/\" % i\n            if config.tie_weight:\n                tf_to_pt_map.update({layer_str + \"b\": out_l.bias})\n            else:\n                raise NotImplementedError\n                # I don't think this is implemented in the TF code\n                tf_to_pt_map.update({layer_str + \"lookup_table\": out_l.weight, layer_str + \"b\": out_l.bias})\n            if not tie_proj:\n                tf_to_pt_map.update({layer_str + \"proj\": proj_l})\n        # Now load the rest of the transformer\n        model = model.transformer\n\n    # Embeddings\n    for i, (embed_l, proj_l) in enumerate(zip(model.word_emb.emb_layers, model.word_emb.emb_projs)):\n        layer_str = \"transformer/adaptive_embed/cutoff_%d/\" % i\n        tf_to_pt_map.update({layer_str + \"lookup_table\": embed_l.weight, layer_str + \"proj_W\": proj_l})\n\n    # Transformer blocks\n    for i, b in enumerate(model.layers):\n        layer_str = \"transformer/layer_%d/\" % i\n        tf_to_pt_map.update(\n            {\n                layer_str + \"rel_attn/LayerNorm/gamma\": b.dec_attn.layer_norm.weight,\n                layer_str + \"rel_attn/LayerNorm/beta\": b.dec_attn.layer_norm.bias,\n                layer_str + \"rel_attn/o/kernel\": b.dec_attn.o_net.weight,\n                layer_str + \"rel_attn/qkv/kernel\": b.dec_attn.qkv_net.weight,\n                layer_str + \"rel_attn/r/kernel\": b.dec_attn.r_net.weight,\n                layer_str + \"ff/LayerNorm/gamma\": b.pos_ff.layer_norm.weight,\n                layer_str + \"ff/LayerNorm/beta\": b.pos_ff.layer_norm.bias,\n                layer_str + \"ff/layer_1/kernel\": b.pos_ff.CoreNet[0].weight,\n                layer_str + \"ff/layer_1/bias\": b.pos_ff.CoreNet[0].bias,\n                layer_str + \"ff/layer_2/kernel\": b.pos_ff.CoreNet[3].weight,\n                layer_str + \"ff/layer_2/bias\": b.pos_ff.CoreNet[3].bias,\n            }\n        )\n\n    # Relative positioning biases\n    if config.untie_r:\n        r_r_list = []\n        r_w_list = []\n        for b in model.layers:\n            r_r_list.append(b.dec_attn.r_r_bias)\n            r_w_list.append(b.dec_attn.r_w_bias)\n    else:\n        r_r_list = [model.r_r_bias]\n        r_w_list = [model.r_w_bias]\n    tf_to_pt_map.update({\"transformer/r_r_bias\": r_r_list, \"transformer/r_w_bias\": r_w_list})\n    return tf_to_pt_map\n\n\ndef load_tf_weights_in_transfo_xl(model, config, tf_path):\n    \"\"\" Load tf checkpoints in a pytorch model\n    \"\"\"\n    try:\n        import numpy as np\n        import tensorflow as tf\n    except ImportError:\n        logger.error(\n            \"Loading a TensorFlow models in PyTorch, requires TensorFlow to be installed. Please see \"\n            \"https://www.tensorflow.org/install/ for installation instructions.\"\n        )\n        raise\n    # Build TF to PyTorch weights loading map\n    tf_to_pt_map = build_tf_to_pytorch_map(model, config)\n\n    # Load weights from TF model\n    init_vars = tf.train.list_variables(tf_path)\n    tf_weights = {}\n    for name, shape in init_vars:\n        logger.info(\"Loading TF weight {} with shape {}\".format(name, shape))\n        array = tf.train.load_variable(tf_path, name)\n        tf_weights[name] = array\n\n    for name, pointer in tf_to_pt_map.items():\n        assert name in tf_weights\n        array = tf_weights[name]\n        # adam_v and adam_m are variables used in AdamWeightDecayOptimizer to calculated m and v\n        # which are not required for using pretrained model\n        if \"kernel\" in name or \"proj\" in name:\n            array = np.transpose(array)\n        if (\"r_r_bias\" in name or \"r_w_bias\" in name) and len(pointer) > 1:\n            # Here we will split the TF weights\n            assert len(pointer) == array.shape[0]\n            for i, p_i in enumerate(pointer):\n                arr_i = array[i, ...]\n                try:\n                    assert p_i.shape == arr_i.shape\n                except AssertionError as e:\n                    e.args += (p_i.shape, arr_i.shape)\n                    raise\n                logger.info(\"Initialize PyTorch weight {} for layer {}\".format(name, i))\n                p_i.data = torch.from_numpy(arr_i)\n        else:\n            try:\n                assert pointer.shape == array.shape\n            except AssertionError as e:\n                e.args += (pointer.shape, array.shape)\n                raise\n            logger.info(\"Initialize PyTorch weight {}\".format(name))\n            pointer.data = torch.from_numpy(array)\n        tf_weights.pop(name, None)\n        tf_weights.pop(name + \"/Adam\", None)\n        tf_weights.pop(name + \"/Adam_1\", None)\n\n    logger.info(\"Weights not copied to PyTorch model: {}\".format(\", \".join(tf_weights.keys())))\n    return model\n\n\nclass PositionalEmbedding(nn.Module):\n    def __init__(self, demb):\n        super().__init__()\n\n        self.demb = demb\n\n        inv_freq = 1 / (10000 ** (torch.arange(0.0, demb, 2.0) / demb))\n        self.register_buffer(\"inv_freq\", inv_freq)\n\n    def forward(self, pos_seq, bsz=None):\n        sinusoid_inp = torch.ger(pos_seq, self.inv_freq)\n        pos_emb = torch.cat([sinusoid_inp.sin(), sinusoid_inp.cos()], dim=-1)\n\n        if bsz is not None:\n            return pos_emb[:, None, :].expand(-1, bsz, -1)\n        else:\n            return pos_emb[:, None, :]\n\n\nclass PositionwiseFF(nn.Module):\n    def __init__(self, d_model, d_inner, dropout, pre_lnorm=False, layer_norm_epsilon=1e-5):\n        super().__init__()\n\n        self.d_model = d_model\n        self.d_inner = d_inner\n        self.dropout = dropout\n\n        self.CoreNet = nn.Sequential(\n            nn.Linear(d_model, d_inner),\n            nn.ReLU(inplace=True),\n            nn.Dropout(dropout),\n            nn.Linear(d_inner, d_model),\n            nn.Dropout(dropout),\n        )\n\n        self.layer_norm = nn.LayerNorm(d_model, eps=layer_norm_epsilon)\n\n        self.pre_lnorm = pre_lnorm\n\n    def forward(self, inp):\n        if self.pre_lnorm:\n            # layer normalization + positionwise feed-forward\n            core_out = self.CoreNet(self.layer_norm(inp))\n\n            # residual connection\n            output = core_out + inp\n        else:\n            # positionwise feed-forward\n            core_out = self.CoreNet(inp)\n\n            # residual connection + layer normalization\n            output = self.layer_norm(inp + core_out)\n\n        return output\n\n\nclass RelPartialLearnableMultiHeadAttn(nn.Module):\n    def __init__(\n        self,\n        n_head,\n        d_model,\n        d_head,\n        dropout,\n        dropatt=0,\n        tgt_len=None,\n        ext_len=None,\n        mem_len=None,\n        pre_lnorm=False,\n        r_r_bias=None,\n        r_w_bias=None,\n        output_attentions=False,\n        layer_norm_epsilon=1e-5,\n    ):\n        super().__init__()\n\n        self.output_attentions = output_attentions\n        self.n_head = n_head\n        self.d_model = d_model\n        self.d_head = d_head\n        self.dropout = dropout\n\n        self.qkv_net = nn.Linear(d_model, 3 * n_head * d_head, bias=False)\n\n        self.drop = nn.Dropout(dropout)\n        self.dropatt = nn.Dropout(dropatt)\n        self.o_net = nn.Linear(n_head * d_head, d_model, bias=False)\n\n        self.layer_norm = nn.LayerNorm(d_model, eps=layer_norm_epsilon)\n\n        self.scale = 1 / (d_head ** 0.5)\n\n        self.pre_lnorm = pre_lnorm\n\n        if r_r_bias is None or r_w_bias is None:  # Biases are not shared\n            self.r_r_bias = nn.Parameter(torch.FloatTensor(self.n_head, self.d_head))\n            self.r_w_bias = nn.Parameter(torch.FloatTensor(self.n_head, self.d_head))\n        else:\n            self.r_r_bias = r_r_bias\n            self.r_w_bias = r_w_bias\n\n        self.r_net = nn.Linear(self.d_model, self.n_head * self.d_head, bias=False)\n\n    def _rel_shift(self, x):\n        zero_pad_shape = (x.size(0), 1) + x.size()[2:]\n        zero_pad = torch.zeros(zero_pad_shape, device=x.device, dtype=x.dtype)\n        x_padded = torch.cat([zero_pad, x], dim=1)\n\n        x_padded_shape = (x.size(1) + 1, x.size(0)) + x.size()[2:]\n        x_padded = x_padded.view(*x_padded_shape)\n\n        x = x_padded[1:].view_as(x)\n\n        return x\n\n    def forward(self, w, r, attn_mask=None, mems=None, head_mask=None):\n        qlen, rlen, bsz = w.size(0), r.size(0), w.size(1)\n\n        if mems is not None:\n            cat = torch.cat([mems, w], 0)\n            if self.pre_lnorm:\n                w_heads = self.qkv_net(self.layer_norm(cat))\n            else:\n                w_heads = self.qkv_net(cat)\n            r_head_k = self.r_net(r)\n\n            w_head_q, w_head_k, w_head_v = torch.chunk(w_heads, 3, dim=-1)\n            w_head_q = w_head_q[-qlen:]\n        else:\n            if self.pre_lnorm:\n                w_heads = self.qkv_net(self.layer_norm(w))\n            else:\n                w_heads = self.qkv_net(w)\n            r_head_k = self.r_net(r)\n\n            w_head_q, w_head_k, w_head_v = torch.chunk(w_heads, 3, dim=-1)\n\n        klen = w_head_k.size(0)\n\n        w_head_q = w_head_q.view(qlen, bsz, self.n_head, self.d_head)  # qlen x bsz x n_head x d_head\n        w_head_k = w_head_k.view(klen, bsz, self.n_head, self.d_head)  # qlen x bsz x n_head x d_head\n        w_head_v = w_head_v.view(klen, bsz, self.n_head, self.d_head)  # qlen x bsz x n_head x d_head\n\n        r_head_k = r_head_k.view(rlen, self.n_head, self.d_head)  # qlen x n_head x d_head\n\n        # compute attention score\n        rw_head_q = w_head_q + self.r_w_bias  # qlen x bsz x n_head x d_head\n        AC = torch.einsum(\"ibnd,jbnd->ijbn\", (rw_head_q, w_head_k))  # qlen x klen x bsz x n_head\n\n        rr_head_q = w_head_q + self.r_r_bias\n        BD = torch.einsum(\"ibnd,jnd->ijbn\", (rr_head_q, r_head_k))  # qlen x klen x bsz x n_head\n        BD = self._rel_shift(BD)\n\n        # [qlen x klen x bsz x n_head]\n        attn_score = AC + BD\n        attn_score.mul_(self.scale)\n\n        # compute attention probability\n        if attn_mask is not None and torch.sum(attn_mask).item():\n            attn_mask = attn_mask == 1  # Switch to bool\n            if attn_mask.dim() == 2:\n                if next(self.parameters()).dtype == torch.float16:\n                    attn_score = (\n                        attn_score.float().masked_fill(attn_mask[None, :, :, None], -65000).type_as(attn_score)\n                    )\n                else:\n                    attn_score = attn_score.float().masked_fill(attn_mask[None, :, :, None], -1e30).type_as(attn_score)\n            elif attn_mask.dim() == 3:\n                if next(self.parameters()).dtype == torch.float16:\n                    attn_score = attn_score.float().masked_fill(attn_mask[:, :, :, None], -65000).type_as(attn_score)\n                else:\n                    attn_score = attn_score.float().masked_fill(attn_mask[:, :, :, None], -1e30).type_as(attn_score)\n\n        # [qlen x klen x bsz x n_head]\n        attn_prob = F.softmax(attn_score, dim=1)\n        attn_prob = self.dropatt(attn_prob)\n\n        # Mask heads if we want to\n        if head_mask is not None:\n            attn_prob = attn_prob * head_mask\n\n        # compute attention vector\n        attn_vec = torch.einsum(\"ijbn,jbnd->ibnd\", (attn_prob, w_head_v))\n\n        # [qlen x bsz x n_head x d_head]\n        attn_vec = attn_vec.contiguous().view(attn_vec.size(0), attn_vec.size(1), self.n_head * self.d_head)\n\n        # linear projection\n        attn_out = self.o_net(attn_vec)\n        attn_out = self.drop(attn_out)\n\n        if self.pre_lnorm:\n            # residual connection\n            outputs = [w + attn_out]\n        else:\n            # residual connection + layer normalization\n            outputs = [self.layer_norm(w + attn_out)]\n\n        if self.output_attentions:\n            outputs.append(attn_prob)\n\n        return outputs\n\n\nclass RelPartialLearnableDecoderLayer(nn.Module):\n    def __init__(self, n_head, d_model, d_head, d_inner, dropout, layer_norm_epsilon=1e-5, **kwargs):\n        super().__init__()\n\n        self.dec_attn = RelPartialLearnableMultiHeadAttn(\n            n_head, d_model, d_head, dropout, layer_norm_epsilon=layer_norm_epsilon, **kwargs\n        )\n        self.pos_ff = PositionwiseFF(\n            d_model, d_inner, dropout, pre_lnorm=kwargs.get(\"pre_lnorm\"), layer_norm_epsilon=layer_norm_epsilon\n        )\n\n    def forward(self, dec_inp, r, dec_attn_mask=None, mems=None, head_mask=None):\n\n        attn_outputs = self.dec_attn(dec_inp, r, attn_mask=dec_attn_mask, mems=mems, head_mask=head_mask)\n        ff_output = self.pos_ff(attn_outputs[0])\n\n        outputs = [ff_output] + attn_outputs[1:]\n\n        return outputs\n\n\nclass AdaptiveEmbedding(nn.Module):\n    def __init__(self, n_token, d_embed, d_proj, cutoffs, div_val=1, sample_softmax=False):\n        super().__init__()\n\n        self.n_token = n_token\n        self.d_embed = d_embed\n\n        self.cutoffs = cutoffs + [n_token]\n        self.div_val = div_val\n        self.d_proj = d_proj\n\n        self.emb_scale = d_proj ** 0.5\n\n        self.cutoff_ends = [0] + self.cutoffs\n\n        self.emb_layers = nn.ModuleList()\n        self.emb_projs = nn.ParameterList()\n        if div_val == 1:\n            self.emb_layers.append(nn.Embedding(n_token, d_embed, sparse=sample_softmax > 0))\n            if d_proj != d_embed:\n                self.emb_projs.append(nn.Parameter(torch.FloatTensor(d_proj, d_embed)))\n        else:\n            for i in range(len(self.cutoffs)):\n                l_idx, r_idx = self.cutoff_ends[i], self.cutoff_ends[i + 1]\n                d_emb_i = d_embed // (div_val ** i)\n                self.emb_layers.append(nn.Embedding(r_idx - l_idx, d_emb_i))\n                self.emb_projs.append(nn.Parameter(torch.FloatTensor(d_proj, d_emb_i)))\n\n    def forward(self, inp):\n        if self.div_val == 1:\n            embed = self.emb_layers[0](inp)\n            if self.d_proj != self.d_embed:\n                embed = F.linear(embed, self.emb_projs[0])\n        else:\n            param = next(self.parameters())\n            inp_flat = inp.view(-1)\n            emb_flat = torch.zeros([inp_flat.size(0), self.d_proj], dtype=param.dtype, device=param.device)\n            for i in range(len(self.cutoffs)):\n                l_idx, r_idx = self.cutoff_ends[i], self.cutoff_ends[i + 1]\n\n                mask_i = (inp_flat >= l_idx) & (inp_flat < r_idx)\n                indices_i = mask_i.nonzero().squeeze()\n\n                if indices_i.numel() == 0:\n                    continue\n\n                inp_i = inp_flat.index_select(0, indices_i) - l_idx\n                emb_i = self.emb_layers[i](inp_i)\n                emb_i = F.linear(emb_i, self.emb_projs[i])\n\n                emb_flat.index_copy_(0, indices_i, emb_i)\n\n            embed_shape = inp.size() + (self.d_proj,)\n            embed = emb_flat.view(embed_shape)\n\n        embed.mul_(self.emb_scale)\n\n        return embed\n\n\nclass TransfoXLPreTrainedModel(PreTrainedModel):\n    \"\"\" An abstract class to handle weights initialization and\n        a simple interface for downloading and loading pretrained models.\n    \"\"\"\n\n    config_class = TransfoXLConfig\n    load_tf_weights = load_tf_weights_in_transfo_xl\n    base_model_prefix = \"transformer\"\n\n    def _init_weight(self, weight):\n        if self.config.init == \"uniform\":\n            nn.init.uniform_(weight, -self.config.init_range, self.config.init_range)\n        elif self.config.init == \"normal\":\n            nn.init.normal_(weight, 0.0, self.config.init_std)\n\n    def _init_bias(self, bias):\n        nn.init.constant_(bias, 0.0)\n\n    def _init_weights(self, m):\n        \"\"\" Initialize the weights.\n        \"\"\"\n        classname = m.__class__.__name__\n        if classname.find(\"Linear\") != -1:\n            if hasattr(m, \"weight\") and m.weight is not None:\n                self._init_weight(m.weight)\n            if hasattr(m, \"bias\") and m.bias is not None:\n                self._init_bias(m.bias)\n        elif classname.find(\"AdaptiveEmbedding\") != -1:\n            if hasattr(m, \"emb_projs\"):\n                for i in range(len(m.emb_projs)):\n                    if m.emb_projs[i] is not None:\n                        nn.init.normal_(m.emb_projs[i], 0.0, self.config.proj_init_std)\n        elif classname.find(\"Embedding\") != -1:\n            if hasattr(m, \"weight\"):\n                self._init_weight(m.weight)\n        elif classname.find(\"ProjectedAdaptiveLogSoftmax\") != -1:\n            if hasattr(m, \"cluster_weight\") and m.cluster_weight is not None:\n                self._init_weight(m.cluster_weight)\n            if hasattr(m, \"cluster_bias\") and m.cluster_bias is not None:\n                self._init_bias(m.cluster_bias)\n            if hasattr(m, \"out_projs\"):\n                for i in range(len(m.out_projs)):\n                    if m.out_projs[i] is not None:\n                        nn.init.normal_(m.out_projs[i], 0.0, self.config.proj_init_std)\n        elif classname.find(\"LayerNorm\") != -1:\n            if hasattr(m, \"weight\"):\n                nn.init.normal_(m.weight, 1.0, self.config.init_std)\n            if hasattr(m, \"bias\") and m.bias is not None:\n                self._init_bias(m.bias)\n        else:\n            if hasattr(m, \"r_emb\"):\n                self._init_weight(m.r_emb)\n            if hasattr(m, \"r_w_bias\"):\n                self._init_weight(m.r_w_bias)\n            if hasattr(m, \"r_r_bias\"):\n                self._init_weight(m.r_r_bias)\n            if hasattr(m, \"r_bias\"):\n                self._init_bias(m.r_bias)\n\n\nTRANSFO_XL_START_DOCSTRING = r\"\"\"\n\n    This model is a PyTorch `torch.nn.Module <https://pytorch.org/docs/stable/nn.html#torch.nn.Module>`_ sub-class.\n    Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general\n    usage and behavior.\n\n    Parameters:\n        config (:class:`~transformers1.TransfoXLConfig`): Model configuration class with all the parameters of the model.\n            Initializing with a config file does not load the weights associated with the model, only the configuration.\n            Check out the :meth:`~transformers1.PreTrainedModel.from_pretrained` method to load the model weights.\n\"\"\"\n\nTRANSFO_XL_INPUTS_DOCSTRING = r\"\"\"\n    Args:\n        input_ids (:obj:`torch.LongTensor` of shape :obj:`(batch_size, sequence_length)`):\n            Indices of input sequence tokens in the vocabulary.\n\n            Indices can be obtained using :class:`transformers1.TransfoXLTokenizer`.\n            See :func:`transformers1.PreTrainedTokenizer.encode` and\n            :func:`transformers1.PreTrainedTokenizer.encode_plus` for details.\n\n            `What are input IDs? <../glossary.html#input-ids>`__\n        mems (:obj:`List[torch.FloatTensor]` of length :obj:`config.n_layers`):\n            Contains pre-computed hidden-states (key and values in the attention blocks) as computed by the model\n            (see `mems` output below). Can be used to speed up sequential decoding. The token ids which have their mems\n            given to this model should not be passed as input ids as they have already been computed.\n        head_mask (:obj:`torch.FloatTensor` of shape :obj:`(num_heads,)` or :obj:`(num_layers, num_heads)`, `optional`, defaults to :obj:`None`):\n            Mask to nullify selected heads of the self-attention modules.\n            Mask values selected in ``[0, 1]``:\n            :obj:`1` indicates the head is **not masked**, :obj:`0` indicates the head is **masked**.\n        inputs_embeds (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length, hidden_size)`, `optional`, defaults to :obj:`None`):\n            Optionally, instead of passing :obj:`input_ids` you can choose to directly pass an embedded representation.\n            This is useful if you want more control over how to convert `input_ids` indices into associated vectors\n            than the model's internal embedding lookup matrix.\n\"\"\"\n\n\n@add_start_docstrings(\n    \"The bare Bert Model transformer outputting raw hidden-states without any specific head on top.\",\n    TRANSFO_XL_START_DOCSTRING,\n)\nclass TransfoXLModel(TransfoXLPreTrainedModel):\n    def __init__(self, config):\n        super().__init__(config)\n        self.output_attentions = config.output_attentions\n        self.output_hidden_states = config.output_hidden_states\n\n        self.n_token = config.vocab_size\n\n        self.d_embed = config.d_embed\n        self.d_model = config.d_model\n        self.n_head = config.n_head\n        self.d_head = config.d_head\n\n        self.word_emb = AdaptiveEmbedding(\n            config.vocab_size, config.d_embed, config.d_model, config.cutoffs, div_val=config.div_val\n        )\n\n        self.drop = nn.Dropout(config.dropout)\n\n        self.n_layer = config.n_layer\n\n        self.tgt_len = config.tgt_len\n        self.mem_len = config.mem_len\n        self.ext_len = config.ext_len\n        self.max_klen = config.tgt_len + config.ext_len + config.mem_len\n\n        self.attn_type = config.attn_type\n\n        if not config.untie_r:\n            self.r_w_bias = nn.Parameter(torch.FloatTensor(self.n_head, self.d_head))\n            self.r_r_bias = nn.Parameter(torch.FloatTensor(self.n_head, self.d_head))\n\n        self.layers = nn.ModuleList()\n        if config.attn_type == 0:  # the default attention\n            for i in range(config.n_layer):\n                self.layers.append(\n                    RelPartialLearnableDecoderLayer(\n                        config.n_head,\n                        config.d_model,\n                        config.d_head,\n                        config.d_inner,\n                        config.dropout,\n                        tgt_len=config.tgt_len,\n                        ext_len=config.ext_len,\n                        mem_len=config.mem_len,\n                        dropatt=config.dropatt,\n                        pre_lnorm=config.pre_lnorm,\n                        r_w_bias=None if config.untie_r else self.r_w_bias,\n                        r_r_bias=None if config.untie_r else self.r_r_bias,\n                        output_attentions=self.output_attentions,\n                        layer_norm_epsilon=config.layer_norm_epsilon,\n                    )\n                )\n        else:  # learnable embeddings and absolute embeddings are not used in our pretrained checkpoints\n            raise NotImplementedError  # Removed them to avoid maintaining dead code\n\n        self.same_length = config.same_length\n        self.clamp_len = config.clamp_len\n\n        if self.attn_type == 0:  # default attention\n            self.pos_emb = PositionalEmbedding(self.d_model)\n        else:  # learnable embeddings and absolute embeddings\n            raise NotImplementedError  # Removed these to avoid maintaining dead code - They are not used in our pretrained checkpoint\n\n        self.init_weights()\n\n    def get_input_embeddings(self):\n        return self.word_emb\n\n    def set_input_embeddings(self, new_embeddings):\n        self.word_emb = new_embeddings\n\n    def backward_compatible(self):\n        self.sample_softmax = -1\n\n    def reset_length(self, tgt_len, ext_len, mem_len):\n        self.tgt_len = tgt_len\n        self.mem_len = mem_len\n        self.ext_len = ext_len\n\n    def _prune_heads(self, heads):\n        logger.info(\"Head pruning is not implemented for Transformer-XL model\")\n        pass\n\n    def init_mems(self, bsz):\n        if self.mem_len > 0:\n            mems = []\n            param = next(self.parameters())\n            for i in range(self.n_layer):\n                empty = torch.zeros(self.mem_len, bsz, self.config.d_model, dtype=param.dtype, device=param.device)\n                mems.append(empty)\n\n            return mems\n        else:\n            return None\n\n    def _update_mems(self, hids, mems, mlen, qlen):\n        # does not deal with None\n        if mems is None:\n            return None\n\n        # mems is not None\n        assert len(hids) == len(mems), \"len(hids) != len(mems)\"\n\n        # There are `mlen + qlen` steps that can be cached into mems\n        # For the next step, the last `ext_len` of the `qlen` tokens\n        # will be used as the extended context. Hence, we only cache\n        # the tokens from `mlen + qlen - self.ext_len - self.mem_len`\n        # to `mlen + qlen - self.ext_len`.\n        with torch.no_grad():\n            new_mems = []\n            end_idx = mlen + max(0, qlen - 0 - self.ext_len)\n            beg_idx = max(0, end_idx - self.mem_len)\n            for i in range(len(hids)):\n\n                cat = torch.cat([mems[i], hids[i]], dim=0)\n                new_mems.append(cat[beg_idx:end_idx].detach())\n\n        return new_mems\n\n    @add_start_docstrings_to_callable(TRANSFO_XL_INPUTS_DOCSTRING)\n    def forward(self, input_ids=None, mems=None, head_mask=None, inputs_embeds=None):\n        r\"\"\"\n    Return:\n        :obj:`tuple(torch.FloatTensor)` comprising various elements depending on the configuration (:class:`~transformers1.TransfoXLConfig`) and inputs:\n        last_hidden_state (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length, hidden_size)`):\n            Sequence of hidden-states at the last layer of the model.\n        mems (:obj:`List[torch.FloatTensor]` of length :obj:`config.n_layers`):\n            Contains pre-computed hidden-states (key and values in the attention blocks).\n            Can be used (see `mems` input) to speed up sequential decoding. The token ids which have their past given to this model\n            should not be passed as input ids as they have already been computed.\n        hidden_states (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_hidden_states=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer)\n            of shape :obj:`(batch_size, sequence_length, hidden_size)`.\n\n            Hidden-states of the model at the output of each layer plus the initial embedding outputs.\n        attentions (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_attentions=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for each layer) of shape\n            :obj:`(batch_size, num_heads, sequence_length, sequence_length)`.\n\n            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention\n            heads.\n\n    Examples::\n\n        from transformers1 import TransfoXLTokenizer, TransfoXLModel\n        import torch\n\n        tokenizer = TransfoXLTokenizer.from_pretrained('transfo-xl-wt103')\n        model = TransfoXLModel.from_pretrained('transfo-xl-wt103')\n        input_ids = torch.tensor(tokenizer.encode(\"Hello, my dog is cute\", add_special_tokens=True)).unsqueeze(0)  # Batch size 1\n        outputs = model(input_ids)\n        last_hidden_states, mems = outputs[:2]\n\n        \"\"\"\n        # the original code for Transformer-XL used shapes [len, bsz] but we want a unified interface in the library\n        # so we transpose here from shape [bsz, len] to shape [len, bsz]\n        if input_ids is not None and inputs_embeds is not None:\n            raise ValueError(\"You cannot specify both input_ids and inputs_embeds at the same time\")\n        elif input_ids is not None:\n            input_ids = input_ids.transpose(0, 1).contiguous()\n            qlen, bsz = input_ids.size()\n        elif inputs_embeds is not None:\n            inputs_embeds = inputs_embeds.transpose(0, 1).contiguous()\n            qlen, bsz = inputs_embeds.shape[0], inputs_embeds.shape[1]\n        else:\n            raise ValueError(\"You have to specify either input_ids or inputs_embeds\")\n\n        if mems is None:\n            mems = self.init_mems(bsz)\n\n        # Prepare head mask if needed\n        # 1.0 in head_mask indicate we keep the head\n        # attention_probs has shape bsz x n_heads x N x N\n        # input head_mask has shape [num_heads] or [num_hidden_layers x num_heads] (a head_mask for each layer)\n        # and head_mask is converted to shape [num_hidden_layers x qlen x klen x bsz x n_head]\n        if head_mask is not None:\n            if head_mask.dim() == 1:\n                head_mask = head_mask.unsqueeze(0).unsqueeze(0).unsqueeze(0).unsqueeze(0)\n                head_mask = head_mask.expand(self.n_layer, -1, -1, -1, -1)\n            elif head_mask.dim() == 2:\n                head_mask = head_mask.unsqueeze(1).unsqueeze(1).unsqueeze(1)\n            head_mask = head_mask.to(\n                dtype=next(self.parameters()).dtype\n            )  # switch to fload if need + fp16 compatibility\n        else:\n            head_mask = [None] * self.n_layer\n\n        if inputs_embeds is not None:\n            word_emb = inputs_embeds\n        else:\n            word_emb = self.word_emb(input_ids)\n\n        mlen = mems[0].size(0) if mems is not None else 0\n        klen = mlen + qlen\n        if self.same_length:\n            all_ones = word_emb.new_ones((qlen, klen), dtype=torch.uint8)\n            mask_len = klen - self.mem_len\n            if mask_len > 0:\n                mask_shift_len = qlen - mask_len\n            else:\n                mask_shift_len = qlen\n            dec_attn_mask = (torch.triu(all_ones, 1 + mlen) + torch.tril(all_ones, -mask_shift_len))[:, :, None]  # -1\n        else:\n            dec_attn_mask = torch.triu(word_emb.new_ones((qlen, klen), dtype=torch.uint8), diagonal=1 + mlen)[\n                :, :, None\n            ]\n\n        hids = []\n        attentions = []\n        if self.attn_type == 0:  # default\n            pos_seq = torch.arange(klen - 1, -1, -1.0, device=word_emb.device, dtype=word_emb.dtype)\n            if self.clamp_len > 0:\n                pos_seq.clamp_(max=self.clamp_len)\n            pos_emb = self.pos_emb(pos_seq)\n\n            core_out = self.drop(word_emb)\n            pos_emb = self.drop(pos_emb)\n\n            for i, layer in enumerate(self.layers):\n                hids.append(core_out)\n                mems_i = None if mems is None else mems[i]\n                layer_outputs = layer(\n                    core_out, pos_emb, dec_attn_mask=dec_attn_mask, mems=mems_i, head_mask=head_mask[i]\n                )\n                core_out = layer_outputs[0]\n                if self.output_attentions:\n                    attentions.append(layer_outputs[1])\n        else:  # learnable embeddings and absolute embeddings\n            raise NotImplementedError  # Removed these to avoid maintaining dead code - They are not used in our pretrained checkpoint\n\n        core_out = self.drop(core_out)\n\n        new_mems = self._update_mems(hids, mems, mlen, qlen)\n\n        # We transpose back here to shape [bsz, len, hidden_dim]\n        outputs = [core_out.transpose(0, 1).contiguous(), new_mems]\n        if self.output_hidden_states:\n            # Add last layer and transpose to library standard shape [bsz, len, hidden_dim]\n            hids.append(core_out)\n            hids = list(t.transpose(0, 1).contiguous() for t in hids)\n            outputs.append(hids)\n        if self.output_attentions:\n            # Transpose to library standard shape [bsz, n_heads, query_seq_len, key_seq_len]\n            attentions = list(t.permute(2, 3, 0, 1).contiguous() for t in attentions)\n            outputs.append(attentions)\n\n        return outputs  # last hidden state, new_mems, (all hidden states), (all attentions)\n\n\n@add_start_docstrings(\n    \"\"\"The Transformer-XL Model with a language modeling head on top\n    (adaptive softmax with weights tied to the adaptive input embeddings)\"\"\",\n    TRANSFO_XL_START_DOCSTRING,\n)\nclass TransfoXLLMHeadModel(TransfoXLPreTrainedModel):\n    def __init__(self, config):\n        super().__init__(config)\n        self.transformer = TransfoXLModel(config)\n        self.sample_softmax = config.sample_softmax\n\n        assert (\n            self.sample_softmax <= 0\n        ), \"Sampling from the softmax is not implemented yet. Please look at issue: #3310: https://github.com/huggingface/transformers/issues/3310\"\n\n        self.crit = ProjectedAdaptiveLogSoftmax(\n            config.vocab_size, config.d_embed, config.d_model, config.cutoffs, div_val=config.div_val\n        )\n\n        self.init_weights()\n\n    def tie_weights(self):\n        \"\"\"\n        Run this to be sure output and input (adaptive) softmax weights are tied\n        \"\"\"\n\n        if self.config.tie_weight:\n            for i in range(len(self.crit.out_layers)):\n                self._tie_or_clone_weights(self.crit.out_layers[i], self.transformer.word_emb.emb_layers[i])\n        if self.config.tie_projs:\n            for i, tie_proj in enumerate(self.config.tie_projs):\n                if tie_proj and self.config.div_val == 1 and self.config.d_model != self.config.d_embed:\n                    if self.config.torchscript:\n                        self.crit.out_projs[i] = nn.Parameter(self.transformer.word_emb.emb_projs[0].clone())\n                    else:\n                        self.crit.out_projs[i] = self.transformer.word_emb.emb_projs[0]\n                elif tie_proj and self.config.div_val != 1:\n                    if self.config.torchscript:\n                        self.crit.out_projs[i] = nn.Parameter(self.transformer.word_emb.emb_projs[i].clone())\n                    else:\n                        self.crit.out_projs[i] = self.transformer.word_emb.emb_projs[i]\n\n    def reset_length(self, tgt_len, ext_len, mem_len):\n        self.transformer.reset_length(tgt_len, ext_len, mem_len)\n\n    def init_mems(self, bsz):\n        return self.transformer.init_mems(bsz)\n\n    @add_start_docstrings_to_callable(TRANSFO_XL_INPUTS_DOCSTRING)\n    def forward(self, input_ids=None, mems=None, head_mask=None, inputs_embeds=None, labels=None):\n        r\"\"\"\n        labels (:obj:`torch.LongTensor` of shape :obj:`(batch_size, sequence_length)`, `optional`, defaults to :obj:`None`):\n            Labels for language modeling.\n            Note that the labels **are shifted** inside the model, i.e. you can set ``labels = input_ids``\n            Indices are selected in ``[-100, 0, ..., config.vocab_size]``\n            All labels set to ``-100`` are ignored (masked), the loss is only\n            computed for labels in ``[0, ..., config.vocab_size]``\n\n    Return:\n        :obj:`tuple(torch.FloatTensor)` comprising various elements depending on the configuration (:class:`~transformers1.TransfoXLConfig`) and inputs:\n        loss (:obj:`torch.FloatTensor` of shape `(batch_size, sequence_length-1)`, `optional`, returned when ``labels`` is provided)\n            Language modeling loss.\n        prediction_scores (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length, config.vocab_size)`):\n            Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax).\n        mems (:obj:`List[torch.FloatTensor]` of length :obj:`config.n_layers`):\n            Contains pre-computed hidden-states (key and values in the attention blocks).\n            Can be used (see `past` input) to speed up sequential decoding. The token ids which have their past given to this model\n            should not be passed as input ids as they have already been computed.\n        hidden_states (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_hidden_states=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer)\n            of shape :obj:`(batch_size, sequence_length, hidden_size)`.\n\n            Hidden-states of the model at the output of each layer plus the initial embedding outputs.\n        attentions (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_attentions=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for each layer) of shape\n            :obj:`(batch_size, num_heads, sequence_length, sequence_length)`.\n\n            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention\n            heads.\n\n    Examples::\n\n        from transformers1 import TransfoXLTokenizer, TransfoXLLMHeadModel\n        import torch\n\n        tokenizer = TransfoXLTokenizer.from_pretrained('transfo-xl-wt103')\n        model = TransfoXLLMHeadModel.from_pretrained('transfo-xl-wt103')\n        input_ids = torch.tensor(tokenizer.encode(\"Hello, my dog is cute\", add_special_tokens=True)).unsqueeze(0)  # Batch size 1\n        outputs = model(input_ids)\n        prediction_scores, mems = outputs[:2]\n\n        \"\"\"\n        if input_ids is not None:\n            bsz, tgt_len = input_ids.size(0), input_ids.size(1)\n        elif inputs_embeds is not None:\n            bsz, tgt_len = inputs_embeds.size(0), inputs_embeds.size(1)\n        else:\n            raise ValueError(\"You have to specify either input_ids or inputs_embeds\")\n\n        transformer_outputs = self.transformer(input_ids, mems=mems, head_mask=head_mask, inputs_embeds=inputs_embeds)\n\n        last_hidden = transformer_outputs[0]\n        pred_hid = last_hidden[:, -tgt_len:]\n        outputs = transformer_outputs[1:]\n\n        softmax_output = self.crit(pred_hid, labels)\n        if labels is None:\n            softmax_output = softmax_output.view(bsz, tgt_len, -1)\n            outputs = [softmax_output] + outputs\n        else:\n            softmax_output = softmax_output.view(bsz, tgt_len - 1)\n            outputs = [softmax_output, None] + outputs\n\n        return outputs  # (loss), logits or None if labels is not None (speed up adaptive softmax), new_mems, (all hidden states), (all attentions)\n\n    def get_output_embeddings(self):\n        \"\"\" Double-check if you are using adaptive softmax.\n        \"\"\"\n        if self.sample_softmax > 0:\n            return self.out_layer\n        else:\n            return self.crit.out_layers[-1]\n\n    def prepare_inputs_for_generation(self, input_ids, past, **model_kwargs):\n        inputs = {\"input_ids\": input_ids}\n\n        # if past is defined in model kwargs then use it for faster decoding\n        if past:\n            inputs[\"mems\"] = past\n\n        return inputs\n"
  },
  {
    "path": "code/nezha-base-count5/pretrain/transformers1/modeling_transfo_xl_utilities.py",
    "content": "# coding=utf-8\n# Copyright 2018 Google AI, Google Brain and Carnegie Mellon University Authors and the HuggingFace Inc. team.\n# Copyright (c) 2018, NVIDIA CORPORATION.  All rights reserved.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n\"\"\" Utilities for PyTorch Transformer XL model.\n    Directly adapted from https://github.com/kimiyoung/transformer-xl.\n\"\"\"\n\n\nimport torch\nimport torch.nn as nn\nimport torch.nn.functional as F\n\n\n# CUDA_MAJOR = int(torch.version.cuda.split('.')[0])\n# CUDA_MINOR = int(torch.version.cuda.split('.')[1])\n\n\nclass ProjectedAdaptiveLogSoftmax(nn.Module):\n    def __init__(self, n_token, d_embed, d_proj, cutoffs, div_val=1, keep_order=False):\n        super().__init__()\n\n        self.n_token = n_token\n        self.d_embed = d_embed\n        self.d_proj = d_proj\n\n        self.cutoffs = cutoffs + [n_token]\n        self.cutoff_ends = [0] + self.cutoffs\n        self.div_val = div_val\n\n        self.shortlist_size = self.cutoffs[0]\n        self.n_clusters = len(self.cutoffs) - 1\n        self.head_size = self.shortlist_size + self.n_clusters\n\n        if self.n_clusters > 0:\n            self.cluster_weight = nn.Parameter(torch.zeros(self.n_clusters, self.d_embed))\n            self.cluster_bias = nn.Parameter(torch.zeros(self.n_clusters))\n\n        self.out_layers = nn.ModuleList()\n        self.out_projs = nn.ParameterList()\n\n        if div_val == 1:\n            for i in range(len(self.cutoffs)):\n                if d_proj != d_embed:\n                    self.out_projs.append(nn.Parameter(torch.FloatTensor(d_proj, d_embed)))\n                else:\n                    self.out_projs.append(None)\n\n            self.out_layers.append(nn.Linear(d_embed, n_token))\n        else:\n            for i in range(len(self.cutoffs)):\n                l_idx, r_idx = self.cutoff_ends[i], self.cutoff_ends[i + 1]\n                d_emb_i = d_embed // (div_val ** i)\n\n                self.out_projs.append(nn.Parameter(torch.FloatTensor(d_proj, d_emb_i)))\n\n                self.out_layers.append(nn.Linear(d_emb_i, r_idx - l_idx))\n\n        self.keep_order = keep_order\n\n    def _compute_logit(self, hidden, weight, bias, proj):\n        if proj is None:\n            logit = F.linear(hidden, weight, bias=bias)\n        else:\n            # if CUDA_MAJOR <= 9 and CUDA_MINOR <= 1:\n            proj_hid = F.linear(hidden, proj.t().contiguous())\n            logit = F.linear(proj_hid, weight, bias=bias)\n            # else:\n            #     logit = torch.einsum('bd,de,ev->bv', (hidden, proj, weight.t()))\n            #     if bias is not None:\n            #         logit = logit + bias\n\n        return logit\n\n    def forward(self, hidden, labels=None, keep_order=False):\n        \"\"\"\n            Params:\n                hidden :: [len*bsz x d_proj]\n                labels :: [len*bsz]\n            Return:\n                if labels is None:\n                    out :: [len*bsz x n_tokens] log probabilities of tokens over the vocabulary\n                else:\n                    out :: [(len-1)*bsz] Negative log likelihood\n            We could replace this implementation by the native PyTorch one\n            if their's had an option to set bias on all clusters in the native one.\n            here: https://github.com/pytorch/pytorch/blob/dbe6a7a9ff1a364a8706bf5df58a1ca96d2fd9da/torch/nn/modules/adaptive.py#L138\n        \"\"\"\n\n        if labels is not None:\n            # Shift so that tokens < n predict n\n            hidden = hidden[..., :-1, :].contiguous()\n            labels = labels[..., 1:].contiguous()\n            hidden = hidden.view(-1, hidden.size(-1))\n            labels = labels.view(-1)\n            if hidden.size(0) != labels.size(0):\n                raise RuntimeError(\"Input and labels should have the same size \" \"in the batch dimension.\")\n        else:\n            hidden = hidden.view(-1, hidden.size(-1))\n\n        if self.n_clusters == 0:\n            logit = self._compute_logit(hidden, self.out_layers[0].weight, self.out_layers[0].bias, self.out_projs[0])\n            if labels is not None:\n                out = -F.log_softmax(logit, dim=-1).gather(1, labels.unsqueeze(1)).squeeze(1)\n            else:\n                out = F.log_softmax(logit, dim=-1)\n        else:\n            # construct weights and biases\n            weights, biases = [], []\n            for i in range(len(self.cutoffs)):\n                if self.div_val == 1:\n                    l_idx, r_idx = self.cutoff_ends[i], self.cutoff_ends[i + 1]\n                    weight_i = self.out_layers[0].weight[l_idx:r_idx]\n                    bias_i = self.out_layers[0].bias[l_idx:r_idx]\n                else:\n                    weight_i = self.out_layers[i].weight\n                    bias_i = self.out_layers[i].bias\n\n                if i == 0:\n                    weight_i = torch.cat([weight_i, self.cluster_weight], dim=0)\n                    bias_i = torch.cat([bias_i, self.cluster_bias], dim=0)\n\n                weights.append(weight_i)\n                biases.append(bias_i)\n\n            head_weight, head_bias, head_proj = weights[0], biases[0], self.out_projs[0]\n\n            head_logit = self._compute_logit(hidden, head_weight, head_bias, head_proj)\n            head_logprob = F.log_softmax(head_logit, dim=1)\n\n            if labels is None:\n                out = hidden.new_empty((head_logit.size(0), self.n_token))\n            else:\n                out = torch.zeros_like(labels, dtype=hidden.dtype, device=hidden.device)\n\n            offset = 0\n            cutoff_values = [0] + self.cutoffs\n            for i in range(len(cutoff_values) - 1):\n                l_idx, r_idx = cutoff_values[i], cutoff_values[i + 1]\n\n                if labels is not None:\n                    mask_i = (labels >= l_idx) & (labels < r_idx)\n                    indices_i = mask_i.nonzero().squeeze()\n\n                    if indices_i.numel() == 0:\n                        continue\n\n                    target_i = labels.index_select(0, indices_i) - l_idx\n                    head_logprob_i = head_logprob.index_select(0, indices_i)\n                    hidden_i = hidden.index_select(0, indices_i)\n                else:\n                    hidden_i = hidden\n\n                if i == 0:\n                    if labels is not None:\n                        logprob_i = head_logprob_i.gather(1, target_i[:, None]).squeeze(1)\n                    else:\n                        out[:, : self.cutoffs[0]] = head_logprob[:, : self.cutoffs[0]]\n                else:\n                    weight_i, bias_i, proj_i = weights[i], biases[i], self.out_projs[i]\n\n                    tail_logit_i = self._compute_logit(hidden_i, weight_i, bias_i, proj_i)\n                    tail_logprob_i = F.log_softmax(tail_logit_i, dim=1)\n                    cluster_prob_idx = self.cutoffs[0] + i - 1  # No probability for the head cluster\n                    if labels is not None:\n                        logprob_i = head_logprob_i[:, cluster_prob_idx] + tail_logprob_i.gather(\n                            1, target_i[:, None]\n                        ).squeeze(1)\n                    else:\n                        logprob_i = head_logprob[:, cluster_prob_idx, None] + tail_logprob_i\n                        out[:, l_idx:r_idx] = logprob_i\n\n                if labels is not None:\n                    if (hasattr(self, \"keep_order\") and self.keep_order) or keep_order:\n                        out.index_copy_(0, indices_i, -logprob_i)\n                    else:\n                        out[offset : offset + logprob_i.size(0)].copy_(-logprob_i)\n                    offset += logprob_i.size(0)\n\n        return out\n\n    def log_prob(self, hidden):\n        r\"\"\" Computes log probabilities for all :math:`n\\_classes`\n        From: https://github.com/pytorch/pytorch/blob/master/torch/nn/modules/adaptive.py\n        Args:\n            hidden (Tensor): a minibatch of examples\n        Returns:\n            log-probabilities of for each class :math:`c`\n            in range :math:`0 <= c <= n\\_classes`, where :math:`n\\_classes` is a\n            parameter passed to ``AdaptiveLogSoftmaxWithLoss`` constructor.\n        Shape:\n            - Input: :math:`(N, in\\_features)`\n            - Output: :math:`(N, n\\_classes)`\n        \"\"\"\n        if self.n_clusters == 0:\n            logit = self._compute_logit(hidden, self.out_layers[0].weight, self.out_layers[0].bias, self.out_projs[0])\n            return F.log_softmax(logit, dim=-1)\n        else:\n            # construct weights and biases\n            weights, biases = [], []\n            for i in range(len(self.cutoffs)):\n                if self.div_val == 1:\n                    l_idx, r_idx = self.cutoff_ends[i], self.cutoff_ends[i + 1]\n                    weight_i = self.out_layers[0].weight[l_idx:r_idx]\n                    bias_i = self.out_layers[0].bias[l_idx:r_idx]\n                else:\n                    weight_i = self.out_layers[i].weight\n                    bias_i = self.out_layers[i].bias\n\n                if i == 0:\n                    weight_i = torch.cat([weight_i, self.cluster_weight], dim=0)\n                    bias_i = torch.cat([bias_i, self.cluster_bias], dim=0)\n\n                weights.append(weight_i)\n                biases.append(bias_i)\n\n            head_weight, head_bias, head_proj = weights[0], biases[0], self.out_projs[0]\n            head_logit = self._compute_logit(hidden, head_weight, head_bias, head_proj)\n\n            out = hidden.new_empty((head_logit.size(0), self.n_token))\n            head_logprob = F.log_softmax(head_logit, dim=1)\n\n            cutoff_values = [0] + self.cutoffs\n            for i in range(len(cutoff_values) - 1):\n                start_idx, stop_idx = cutoff_values[i], cutoff_values[i + 1]\n\n                if i == 0:\n                    out[:, : self.cutoffs[0]] = head_logprob[:, : self.cutoffs[0]]\n                else:\n                    weight_i, bias_i, proj_i = weights[i], biases[i], self.out_projs[i]\n\n                    tail_logit_i = self._compute_logit(hidden, weight_i, bias_i, proj_i)\n                    tail_logprob_i = F.log_softmax(tail_logit_i, dim=1)\n\n                    logprob_i = head_logprob[:, -i] + tail_logprob_i\n                    out[:, start_idx, stop_idx] = logprob_i\n\n            return out\n"
  },
  {
    "path": "code/nezha-base-count5/pretrain/transformers1/modeling_utils.py",
    "content": "# coding=utf-8\n# Copyright 2018 The Google AI Language Team Authors, Facebook AI Research authors and The HuggingFace Inc. team.\n# Copyright (c) 2018, NVIDIA CORPORATION.  All rights reserved.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n\nimport inspect\nimport logging\nimport os\nfrom typing import Callable, Dict, Iterable, List, Optional, Tuple\n\nimport torch\nfrom torch import Tensor, device, dtype, nn\nfrom torch.nn import CrossEntropyLoss\nfrom torch.nn import functional as F\n\nfrom .activations import get_activation\nfrom .configuration_utils import PretrainedConfig\nfrom .file_utils import (\n    DUMMY_INPUTS,\n    TF2_WEIGHTS_NAME,\n    TF_WEIGHTS_NAME,\n    WEIGHTS_NAME,\n    cached_path,\n    hf_bucket_url,\n    is_remote_url,\n)\n\n\nlogger = logging.getLogger(__name__)\n\n\ntry:\n    from torch.nn import Identity\nexcept ImportError:\n    # Older PyTorch compatibility\n    class Identity(nn.Module):\n        r\"\"\"A placeholder identity operator that is argument-insensitive.\n        \"\"\"\n\n        def __init__(self, *args, **kwargs):\n            super().__init__()\n\n        def forward(self, input):\n            return input\n\n\nclass ModuleUtilsMixin:\n    \"\"\"\n    A few utilities for torch.nn.Modules, to be used as a mixin.\n    \"\"\"\n\n    def num_parameters(self, only_trainable: bool = False) -> int:\n        \"\"\"\n        Get number of (optionally, trainable) parameters in the module.\n        \"\"\"\n        params = filter(lambda x: x.requires_grad, self.parameters()) if only_trainable else self.parameters()\n        return sum(p.numel() for p in params)\n\n    @staticmethod\n    def _hook_rss_memory_pre_forward(module, *args, **kwargs):\n        try:\n            import psutil\n        except (ImportError):\n            raise ImportError(\"You need to install psutil (pip install psutil) to use memory tracing.\")\n\n        process = psutil.Process(os.getpid())\n        mem = process.memory_info()\n        module.mem_rss_pre_forward = mem.rss\n        return None\n\n    @staticmethod\n    def _hook_rss_memory_post_forward(module, *args, **kwargs):\n        try:\n            import psutil\n        except (ImportError):\n            raise ImportError(\"You need to install psutil (pip install psutil) to use memory tracing.\")\n\n        process = psutil.Process(os.getpid())\n        mem = process.memory_info()\n        module.mem_rss_post_forward = mem.rss\n        mem_rss_diff = module.mem_rss_post_forward - module.mem_rss_pre_forward\n        module.mem_rss_diff = mem_rss_diff + (module.mem_rss_diff if hasattr(module, \"mem_rss_diff\") else 0)\n        return None\n\n    def add_memory_hooks(self):\n        \"\"\" Add a memory hook before and after each sub-module forward pass to record increase in memory consumption.\n            Increase in memory consumption is stored in a `mem_rss_diff` attribute for each module and can be reset to zero with `model.reset_memory_hooks_state()`\n        \"\"\"\n        for module in self.modules():\n            module.register_forward_pre_hook(self._hook_rss_memory_pre_forward)\n            module.register_forward_hook(self._hook_rss_memory_post_forward)\n        self.reset_memory_hooks_state()\n\n    def reset_memory_hooks_state(self):\n        for module in self.modules():\n            module.mem_rss_diff = 0\n            module.mem_rss_post_forward = 0\n            module.mem_rss_pre_forward = 0\n\n    @property\n    def device(self) -> device:\n        \"\"\"\n        Get torch.device from module, assuming that the whole module has one device.\n        \"\"\"\n        try:\n            return next(self.parameters()).device\n        except StopIteration:\n            # For nn.DataParallel compatibility in PyTorch 1.5\n\n            def find_tensor_attributes(module: nn.Module) -> List[Tuple[str, Tensor]]:\n                tuples = [(k, v) for k, v in module.__dict__.items() if torch.is_tensor(v)]\n                return tuples\n\n            gen = self._named_members(get_members_fn=find_tensor_attributes)\n            first_tuple = next(gen)\n            return first_tuple[1].device\n\n    @property\n    def dtype(self) -> dtype:\n        \"\"\"\n        Get torch.dtype from module, assuming that the whole module has one dtype.\n        \"\"\"\n        try:\n            return next(self.parameters()).dtype\n        except StopIteration:\n            # For nn.DataParallel compatibility in PyTorch 1.5\n\n            def find_tensor_attributes(module: nn.Module) -> List[Tuple[str, Tensor]]:\n                tuples = [(k, v) for k, v in module.__dict__.items() if torch.is_tensor(v)]\n                return tuples\n\n            gen = self._named_members(get_members_fn=find_tensor_attributes)\n            first_tuple = next(gen)\n            return first_tuple[1].dtype\n\n    def invert_attention_mask(self, encoder_attention_mask: Tensor) -> Tensor:\n        \"\"\"type: torch.Tensor -> torch.Tensor\"\"\"\n        if encoder_attention_mask.dim() == 3:\n            encoder_extended_attention_mask = encoder_attention_mask[:, None, :, :]\n        if encoder_attention_mask.dim() == 2:\n            encoder_extended_attention_mask = encoder_attention_mask[:, None, None, :]\n        # T5 has a mask that can compare sequence ids, we can simulate this here with this transposition\n        # Cf. https://github.com/tensorflow/mesh/blob/8d2465e9bc93129b913b5ccc6a59aa97abd96ec6/mesh_tensorflow\n        # /transformer/transformer_layers.py#L270\n        # encoder_extended_attention_mask = (encoder_extended_attention_mask ==\n        # encoder_extended_attention_mask.transpose(-1, -2))\n        encoder_extended_attention_mask = encoder_extended_attention_mask.to(dtype=self.dtype)  # fp16 compatibility\n\n        if self.dtype == torch.float16:\n            encoder_extended_attention_mask = (1.0 - encoder_extended_attention_mask) * -1e4\n        elif self.dtype == torch.float32:\n            encoder_extended_attention_mask = (1.0 - encoder_extended_attention_mask) * -1e9\n        else:\n            raise ValueError(\n                \"{} not recognized. `dtype` should be set to either `torch.float32` or `torch.float16`\".format(\n                    self.dtype\n                )\n            )\n\n        return encoder_extended_attention_mask\n\n    def get_extended_attention_mask(self, attention_mask: Tensor, input_shape: Tuple, device: device) -> Tensor:\n        \"\"\"Makes broadcastable attention mask and causal mask so that future and maked tokens are ignored.\n\n        Arguments:\n            attention_mask: torch.Tensor with 1 indicating tokens to ATTEND to\n            input_shape: tuple, shape of input_ids\n            device: torch.Device, usually self.device\n\n        Returns:\n            torch.Tensor with dtype of attention_mask.dtype\n        \"\"\"\n        # We can provide a self-attention mask of dimensions [batch_size, from_seq_length, to_seq_length]\n        # ourselves in which case we just need to make it broadcastable to all heads.\n        if attention_mask.dim() == 3:\n            extended_attention_mask = attention_mask[:, None, :, :]\n        elif attention_mask.dim() == 2:\n            # Provided a padding mask of dimensions [batch_size, seq_length]\n            # - if the model is a decoder, apply a causal mask in addition to the padding mask\n            # - if the model is an encoder, make the mask broadcastable to [batch_size, num_heads, seq_length, seq_length]\n            if self.config.is_decoder:\n                batch_size, seq_length = input_shape\n                seq_ids = torch.arange(seq_length, device=device)\n                causal_mask = seq_ids[None, None, :].repeat(batch_size, seq_length, 1) <= seq_ids[None, :, None]\n                # causal and attention masks must have same type with pytorch version < 1.3\n                causal_mask = causal_mask.to(attention_mask.dtype)\n                extended_attention_mask = causal_mask[:, None, :, :] * attention_mask[:, None, None, :]\n            else:\n                extended_attention_mask = attention_mask[:, None, None, :]\n        else:\n            raise ValueError(\n                \"Wrong shape for input_ids (shape {}) or attention_mask (shape {})\".format(\n                    input_shape, attention_mask.shape\n                )\n            )\n\n        # Since attention_mask is 1.0 for positions we want to attend and 0.0 for\n        # masked positions, this operation will create a tensor which is 0.0 for\n        # positions we want to attend and -10000.0 for masked positions.\n        # Since we are adding it to the raw scores before the softmax, this is\n        # effectively the same as removing these entirely.\n        extended_attention_mask = extended_attention_mask.to(dtype=self.dtype)  # fp16 compatibility\n        extended_attention_mask = (1.0 - extended_attention_mask) * -10000.0\n        return extended_attention_mask\n\n    def get_head_mask(self, head_mask: Tensor, num_hidden_layers: int, is_attention_chunked: bool = False) -> Tensor:\n        \"\"\"\n        # Prepare head mask if needed\n        # 1.0 in head_mask indicate we keep the head\n        attention_probs has shape bsz x n_heads x N x N\n        Arguments:\n            head_mask: torch.Tensor or None: has shape [num_heads] or [num_hidden_layers x num_heads]\n            num_hidden_layers: int\n        Returns:\n             Tensor of shape shape [num_hidden_layers x batch x num_heads x seq_length x seq_length]\n             or list with [None] for each layer\n        \"\"\"\n        if head_mask is not None:\n            head_mask = self._convert_head_mask_to_5d(head_mask, num_hidden_layers)\n            if is_attention_chunked is True:\n                head_mask = head_mask.unsqueeze(-1)\n        else:\n            head_mask = [None] * num_hidden_layers\n\n        return head_mask\n\n    def _convert_head_mask_to_5d(self, head_mask, num_hidden_layers):\n        \"\"\"-> [num_hidden_layers x batch x num_heads x seq_length x seq_length]\"\"\"\n        if head_mask.dim() == 1:\n            head_mask = head_mask.unsqueeze(0).unsqueeze(0).unsqueeze(-1).unsqueeze(-1)\n            head_mask = head_mask.expand(num_hidden_layers, -1, -1, -1, -1)\n        elif head_mask.dim() == 2:\n            head_mask = head_mask.unsqueeze(1).unsqueeze(-1).unsqueeze(-1)  # We can specify head_mask for each layer\n        assert head_mask.dim() == 5, f\"head_mask.dim != 5, instead {head_mask.dim()}\"\n        head_mask = head_mask.to(dtype=self.dtype)  # switch to fload if need + fp16 compatibility\n        return head_mask\n\n\nclass PreTrainedModel(nn.Module, ModuleUtilsMixin):\n    r\"\"\" Base class for all models.\n\n        :class:`~transformers1.PreTrainedModel` takes care of storing the configuration of the models and handles methods for loading/downloading/saving models\n        as well as a few methods common to all models to (i) resize the input embeddings and (ii) prune heads in the self-attention heads.\n\n        Class attributes (overridden by derived classes):\n            - ``config_class``: a class derived from :class:`~transformers1.PretrainedConfig` to use as configuration class for this model architecture.\n            - ``load_tf_weights``: a python ``method`` for loading a TensorFlow checkpoint in a PyTorch model, taking as arguments:\n\n                - ``model``: an instance of the relevant subclass of :class:`~transformers1.PreTrainedModel`,\n                - ``config``: an instance of the relevant subclass of :class:`~transformers1.PretrainedConfig`,\n                - ``path``: a path (string) to the TensorFlow checkpoint.\n\n            - ``base_model_prefix``: a string indicating the attribute associated to the base model in derived classes of the same architecture adding modules on top of the base model.\n    \"\"\"\n    config_class = None\n    base_model_prefix = \"\"\n\n    @property\n    def dummy_inputs(self):\n        \"\"\" Dummy inputs to do a forward pass in the network.\n\n        Returns:\n            torch.Tensor with dummy inputs\n        \"\"\"\n        return {\"input_ids\": torch.tensor(DUMMY_INPUTS)}\n\n    def __init__(self, config, *inputs, **kwargs):\n        super().__init__()\n        if not isinstance(config, PretrainedConfig):\n            raise ValueError(\n                \"Parameter config in `{}(config)` should be an instance of class `PretrainedConfig`. \"\n                \"To create a model from a pretrained model use \"\n                \"`model = {}.from_pretrained(PRETRAINED_MODEL_NAME)`\".format(\n                    self.__class__.__name__, self.__class__.__name__\n                )\n            )\n        # Save config in model\n        self.config = config\n\n    @property\n    def base_model(self):\n        return getattr(self, self.base_model_prefix, self)\n\n    def get_input_embeddings(self):\n        \"\"\"\n        Returns the model's input embeddings.\n\n        Returns:\n            :obj:`nn.Module`:\n                A torch module mapping vocabulary to hidden states.\n        \"\"\"\n        base_model = getattr(self, self.base_model_prefix, self)\n        if base_model is not self:\n            return base_model.get_input_embeddings()\n        else:\n            raise NotImplementedError\n\n    def set_input_embeddings(self, value: nn.Module):\n        \"\"\"\n        Set model's input embeddings\n\n        Args:\n            value (:obj:`nn.Module`):\n                A module mapping vocabulary to hidden states.\n        \"\"\"\n        base_model = getattr(self, self.base_model_prefix, self)\n        if base_model is not self:\n            base_model.set_input_embeddings(value)\n        else:\n            raise NotImplementedError\n\n    def get_output_embeddings(self):\n        \"\"\"\n        Returns the model's output embeddings.\n\n        Returns:\n            :obj:`nn.Module`:\n                A torch module mapping hidden states to vocabulary.\n        \"\"\"\n        return None  # Overwrite for models with output embeddings\n\n    def tie_weights(self):\n        \"\"\"\n        Tie the weights between the input embeddings and the output embeddings.\n        If the `torchscript` flag is set in the configuration, can't handle parameter sharing so we are cloning\n        the weights instead.\n        \"\"\"\n        output_embeddings = self.get_output_embeddings()\n        if output_embeddings is not None:\n            self._tie_or_clone_weights(output_embeddings, self.get_input_embeddings())\n\n    def _tie_or_clone_weights(self, output_embeddings, input_embeddings):\n        \"\"\" Tie or clone module weights depending of whether we are using TorchScript or not\n        \"\"\"\n        if self.config.torchscript:\n            output_embeddings.weight = nn.Parameter(input_embeddings.weight.clone())\n        else:\n            output_embeddings.weight = input_embeddings.weight\n\n        if getattr(output_embeddings, \"bias\", None) is not None:\n            output_embeddings.bias.data = torch.nn.functional.pad(\n                output_embeddings.bias.data,\n                (0, output_embeddings.weight.shape[0] - output_embeddings.bias.shape[0],),\n                \"constant\",\n                0,\n            )\n        if hasattr(output_embeddings, \"out_features\") and hasattr(input_embeddings, \"num_embeddings\"):\n            output_embeddings.out_features = input_embeddings.num_embeddings\n\n    def resize_token_embeddings(self, new_num_tokens: Optional[int] = None):\n        \"\"\" Resize input token embeddings matrix of the model if new_num_tokens != config.vocab_size.\n        Take care of tying weights embeddings afterwards if the model class has a `tie_weights()` method.\n\n        Arguments:\n\n            new_num_tokens: (`optional`) int:\n                New number of tokens in the embedding matrix. Increasing the size will add newly initialized vectors at the end. Reducing the size will remove vectors from the end.\n                If not provided or None: does nothing and just returns a pointer to the input tokens ``torch.nn.Embeddings`` Module of the model.\n\n        Return: ``torch.nn.Embeddings``\n            Pointer to the input tokens Embeddings Module of the model\n        \"\"\"\n        base_model = getattr(self, self.base_model_prefix, self)  # get the base model if needed\n        model_embeds = base_model._resize_token_embeddings(new_num_tokens)\n        if new_num_tokens is None:\n            return model_embeds\n\n        # Update base model and current model config\n        self.config.vocab_size = new_num_tokens\n        base_model.vocab_size = new_num_tokens\n\n        # Tie weights again if needed\n        self.tie_weights()\n\n        return model_embeds\n\n    def _resize_token_embeddings(self, new_num_tokens):\n        old_embeddings = self.get_input_embeddings()\n        new_embeddings = self._get_resized_embeddings(old_embeddings, new_num_tokens)\n        self.set_input_embeddings(new_embeddings)\n        return self.get_input_embeddings()\n\n    def _get_resized_embeddings(\n        self, old_embeddings: torch.nn.Embedding, new_num_tokens: Optional[int] = None\n    ) -> torch.nn.Embedding:\n        \"\"\" Build a resized Embedding Module from a provided token Embedding Module.\n            Increasing the size will add newly initialized vectors at the end\n            Reducing the size will remove vectors from the end\n\n        Args:\n            old_embeddings: ``torch.nn.Embedding``\n                Old embeddings to be resized.\n            new_num_tokens: (`optional`) int\n                New number of tokens in the embedding matrix.\n                Increasing the size will add newly initialized vectors at the end\n                Reducing the size will remove vectors from the end\n                If not provided or None: return the provided token Embedding Module.\n        Return: ``torch.nn.Embedding``\n            Pointer to the resized Embedding Module or the old Embedding Module if new_num_tokens is None\n        \"\"\"\n        if new_num_tokens is None:\n            return old_embeddings\n\n        old_num_tokens, old_embedding_dim = old_embeddings.weight.size()\n        if old_num_tokens == new_num_tokens:\n            return old_embeddings\n\n        # Build new embeddings\n        new_embeddings = nn.Embedding(new_num_tokens, old_embedding_dim)\n        new_embeddings.to(old_embeddings.weight.device)\n\n        # initialize all new embeddings (in particular added tokens)\n        self._init_weights(new_embeddings)\n\n        # Copy token embeddings from the previous weights\n        num_tokens_to_copy = min(old_num_tokens, new_num_tokens)\n        new_embeddings.weight.data[:num_tokens_to_copy, :] = old_embeddings.weight.data[:num_tokens_to_copy, :]\n\n        return new_embeddings\n\n    def init_weights(self):\n        \"\"\" Initialize and prunes weights if needed. \"\"\"\n        # Initialize weights\n        self.apply(self._init_weights)\n\n        # Prune heads if needed\n        if self.config.pruned_heads:\n            self.prune_heads(self.config.pruned_heads)\n\n        # Tie weights if needed\n        self.tie_weights()\n\n    def prune_heads(self, heads_to_prune: Dict):\n        \"\"\" Prunes heads of the base model.\n\n            Arguments:\n\n                heads_to_prune: dict with keys being selected layer indices (`int`) and associated values being the list of heads to prune in said layer (list of `int`).\n                E.g. {1: [0, 2], 2: [2, 3]} will prune heads 0 and 2 on layer 1 and heads 2 and 3 on layer 2.\n        \"\"\"\n        # save new sets of pruned heads as union of previously stored pruned heads and newly pruned heads\n        for layer, heads in heads_to_prune.items():\n            union_heads = set(self.config.pruned_heads.get(layer, [])) | set(heads)\n            self.config.pruned_heads[layer] = list(union_heads)  # Unfortunately we have to store it as list for JSON\n\n        self.base_model._prune_heads(heads_to_prune)\n\n    def save_pretrained(self, save_directory):\n        \"\"\" Save a model and its configuration file to a directory, so that it\n            can be re-loaded using the `:func:`~transformers1.PreTrainedModel.from_pretrained`` class method.\n\n            Arguments:\n                save_directory: directory to which to save.\n        \"\"\"\n        assert os.path.isdir(\n            save_directory\n        ), \"Saving path should be a directory where the model and configuration can be saved\"\n\n        # Only save the model itself if we are using distributed training\n        model_to_save = self.module if hasattr(self, \"module\") else self\n\n        # Attach architecture to the config\n        model_to_save.config.architectures = [model_to_save.__class__.__name__]\n\n        # If we save using the predefined names, we can load using `from_pretrained`\n        output_model_file = os.path.join(save_directory, WEIGHTS_NAME)\n\n        if getattr(self.config, \"xla_device\", False):\n            import torch_xla.core.xla_model as xm\n\n            if xm.is_master_ordinal():\n                # Save configuration file\n                model_to_save.config.save_pretrained(save_directory)\n            # xm.save takes care of saving only from master\n            xm.save(model_to_save.state_dict(), output_model_file)\n        else:\n            model_to_save.config.save_pretrained(save_directory)\n            torch.save(model_to_save.state_dict(), output_model_file)\n\n        logger.info(\"Model weights saved in {}\".format(output_model_file))\n\n    @classmethod\n    def from_pretrained(cls, pretrained_model_name_or_path, *model_args, **kwargs):\n        r\"\"\"Instantiate a pretrained pytorch model from a pre-trained model configuration.\n\n        The model is set in evaluation mode by default using ``model.eval()`` (Dropout modules are deactivated)\n        To train the model, you should first set it back in training mode with ``model.train()``\n\n        The warning ``Weights from XXX not initialized from pretrained model`` means that the weights of XXX do not come pre-trained with the rest of the model.\n        It is up to you to train those weights with a downstream fine-tuning task.\n\n        The warning ``Weights from XXX not used in YYY`` means that the layer XXX is not used by YYY, therefore those weights are discarded.\n\n        Parameters:\n            pretrained_model_name_or_path: either:\n              - a string with the `shortcut name` of a pre-trained model to load from cache or download, e.g.: ``bert-base-uncased``.\n              - a string with the `identifier name` of a pre-trained model that was user-uploaded to our S3, e.g.: ``dbmdz/bert-base-german-cased``.\n              - a path to a `directory` containing model weights saved using :func:`~transformers1.PreTrainedModel.save_pretrained`, e.g.: ``./my_model_directory/``.\n              - a path or url to a `tensorflow index checkpoint file` (e.g. `./tf_model/model.ckpt.index`). In this case, ``from_tf`` should be set to True and a configuration object should be provided as ``config`` argument. This loading path is slower than converting the TensorFlow checkpoint in a PyTorch model using the provided conversion scripts and loading the PyTorch model afterwards.\n              - None if you are both providing the configuration and state dictionary (resp. with keyword arguments ``config`` and ``state_dict``)\n\n            model_args: (`optional`) Sequence of positional arguments:\n                All remaning positional arguments will be passed to the underlying model's ``__init__`` method\n\n            config: (`optional`) one of:\n                - an instance of a class derived from :class:`~transformers1.PretrainedConfig`, or\n                - a string valid as input to :func:`~transformers1.PretrainedConfig.from_pretrained()`\n                Configuration for the model to use instead of an automatically loaded configuation. Configuration can be automatically loaded when:\n                    - the model is a model provided by the library (loaded with the ``shortcut-name`` string of a pretrained model), or\n                    - the model was saved using :func:`~transformers1.PreTrainedModel.save_pretrained` and is reloaded by suppling the save directory.\n                    - the model is loaded by suppling a local directory as ``pretrained_model_name_or_path`` and a configuration JSON file named `config.json` is found in the directory.\n\n            state_dict: (`optional`) dict:\n                an optional state dictionnary for the model to use instead of a state dictionary loaded from saved weights file.\n                This option can be used if you want to create a model from a pretrained configuration but load your own weights.\n                In this case though, you should check if using :func:`~transformers1.PreTrainedModel.save_pretrained` and :func:`~transformers1.PreTrainedModel.from_pretrained` is not a simpler option.\n\n            cache_dir: (`optional`) string:\n                Path to a directory in which a downloaded pre-trained model\n                configuration should be cached if the standard cache should not be used.\n\n            force_download: (`optional`) boolean, default False:\n                Force to (re-)download the model weights and configuration files and override the cached versions if they exists.\n\n            resume_download: (`optional`) boolean, default False:\n                Do not delete incompletely recieved file. Attempt to resume the download if such a file exists.\n\n            proxies: (`optional`) dict, default None:\n                A dictionary of proxy servers to use by protocol or endpoint, e.g.: {'http': 'foo.bar:3128', 'http://hostname': 'foo.bar:4012'}.\n                The proxies are used on each request.\n\n            output_loading_info: (`optional`) boolean:\n                Set to ``True`` to also return a dictionnary containing missing keys, unexpected keys and error messages.\n\n            kwargs: (`optional`) Remaining dictionary of keyword arguments:\n                Can be used to update the configuration object (after it being loaded) and initiate the model. (e.g. ``output_attention=True``). Behave differently depending on whether a `config` is provided or automatically loaded:\n\n                - If a configuration is provided with ``config``, ``**kwargs`` will be directly passed to the underlying model's ``__init__`` method (we assume all relevant updates to the configuration have already been done)\n                - If a configuration is not provided, ``kwargs`` will be first passed to the configuration class initialization function (:func:`~transformers1.PretrainedConfig.from_pretrained`). Each key of ``kwargs`` that corresponds to a configuration attribute will be used to override said attribute with the supplied ``kwargs`` value. Remaining keys that do not correspond to any configuration attribute will be passed to the underlying model's ``__init__`` function.\n\n        Examples::\n\n            # For example purposes. Not runnable.\n            model = BertModel.from_pretrained('bert-base-uncased')    # Download model and configuration from S3 and cache.\n            model = BertModel.from_pretrained('./test/saved_model/')  # E.g. model was saved using `save_pretrained('./test/saved_model/')`\n            model = BertModel.from_pretrained('bert-base-uncased', output_attention=True)  # Update configuration during loading\n            assert model.config.output_attention == True\n            # Loading from a TF checkpoint file instead of a PyTorch model (slower)\n            config = BertConfig.from_json_file('./tf_model/my_tf_model_config.json')\n            model = BertModel.from_pretrained('./tf_model/my_tf_checkpoint.ckpt.index', from_tf=True, config=config)\n\n        \"\"\"\n        config = kwargs.pop(\"config\", None)\n        state_dict = kwargs.pop(\"state_dict\", None)\n        cache_dir = kwargs.pop(\"cache_dir\", None)\n        from_tf = kwargs.pop(\"from_tf\", False)\n        force_download = kwargs.pop(\"force_download\", False)\n        resume_download = kwargs.pop(\"resume_download\", False)\n        proxies = kwargs.pop(\"proxies\", None)\n        output_loading_info = kwargs.pop(\"output_loading_info\", False)\n        local_files_only = kwargs.pop(\"local_files_only\", False)\n        use_cdn = kwargs.pop(\"use_cdn\", True)\n\n        # Load config if we don't provide a configuration\n        if not isinstance(config, PretrainedConfig):\n            config_path = config if config is not None else pretrained_model_name_or_path\n            config, model_kwargs = cls.config_class.from_pretrained(\n                config_path,\n                *model_args,\n                cache_dir=cache_dir,\n                return_unused_kwargs=True,\n                force_download=force_download,\n                resume_download=resume_download,\n                proxies=proxies,\n                local_files_only=local_files_only,\n                **kwargs,\n            )\n        else:\n            model_kwargs = kwargs\n\n        # Load model\n        if pretrained_model_name_or_path is not None:\n            if os.path.isdir(pretrained_model_name_or_path):\n                if from_tf and os.path.isfile(os.path.join(pretrained_model_name_or_path, TF_WEIGHTS_NAME + \".index\")):\n                    # Load from a TF 1.0 checkpoint\n                    archive_file = os.path.join(pretrained_model_name_or_path, TF_WEIGHTS_NAME + \".index\")\n                elif from_tf and os.path.isfile(os.path.join(pretrained_model_name_or_path, TF2_WEIGHTS_NAME)):\n                    # Load from a TF 2.0 checkpoint\n                    archive_file = os.path.join(pretrained_model_name_or_path, TF2_WEIGHTS_NAME)\n                elif os.path.isfile(os.path.join(pretrained_model_name_or_path, WEIGHTS_NAME)):\n                    # Load from a PyTorch checkpoint\n                    archive_file = os.path.join(pretrained_model_name_or_path, WEIGHTS_NAME)\n                else:\n                    raise EnvironmentError(\n                        \"Error no file named {} found in directory {} or `from_tf` set to False\".format(\n                            [WEIGHTS_NAME, TF2_WEIGHTS_NAME, TF_WEIGHTS_NAME + \".index\"],\n                            pretrained_model_name_or_path,\n                        )\n                    )\n            elif os.path.isfile(pretrained_model_name_or_path) or is_remote_url(pretrained_model_name_or_path):\n                archive_file = pretrained_model_name_or_path\n            elif os.path.isfile(pretrained_model_name_or_path + \".index\"):\n                assert (\n                    from_tf\n                ), \"We found a TensorFlow checkpoint at {}, please set from_tf to True to load from this checkpoint\".format(\n                    pretrained_model_name_or_path + \".index\"\n                )\n                archive_file = pretrained_model_name_or_path + \".index\"\n            else:\n                archive_file = hf_bucket_url(\n                    pretrained_model_name_or_path,\n                    filename=(TF2_WEIGHTS_NAME if from_tf else WEIGHTS_NAME),\n                    use_cdn=use_cdn,\n                )\n\n            try:\n                # Load from URL or cache if already cached\n                resolved_archive_file = cached_path(\n                    archive_file,\n                    cache_dir=cache_dir,\n                    force_download=force_download,\n                    proxies=proxies,\n                    resume_download=resume_download,\n                    local_files_only=local_files_only,\n                )\n                if resolved_archive_file is None:\n                    raise EnvironmentError\n            except EnvironmentError:\n                msg = (\n                    f\"Can't load weights for '{pretrained_model_name_or_path}'. Make sure that:\\n\\n\"\n                    f\"- '{pretrained_model_name_or_path}' is a correct model identifier listed on 'https://huggingface.co/models'\\n\\n\"\n                    f\"- or '{pretrained_model_name_or_path}' is the correct path to a directory containing a file named one of {WEIGHTS_NAME}, {TF2_WEIGHTS_NAME}, {TF_WEIGHTS_NAME}.\\n\\n\"\n                )\n                raise EnvironmentError(msg)\n\n            if resolved_archive_file == archive_file:\n                logger.info(\"loading weights file {}\".format(archive_file))\n            else:\n                logger.info(\"loading weights file {} from cache at {}\".format(archive_file, resolved_archive_file))\n        else:\n            resolved_archive_file = None\n\n        # Instantiate model.\n        model = cls(config, *model_args, **model_kwargs)\n\n        if state_dict is None and not from_tf:\n            try:\n                state_dict = torch.load(resolved_archive_file, map_location=\"cpu\")\n            except Exception:\n                raise OSError(\n                    \"Unable to load weights from pytorch checkpoint file. \"\n                    \"If you tried to load a PyTorch model from a TF 2.0 checkpoint, please set from_tf=True. \"\n                )\n\n        missing_keys = []\n        unexpected_keys = []\n        error_msgs = []\n\n        if from_tf:\n            if resolved_archive_file.endswith(\".index\"):\n                # Load from a TensorFlow 1.X checkpoint - provided by original authors\n                model = cls.load_tf_weights(model, config, resolved_archive_file[:-6])  # Remove the '.index'\n            else:\n                # Load from our TensorFlow 2.0 checkpoints\n                try:\n                    from transformers import load_tf2_checkpoint_in_pytorch_model\n\n                    model = load_tf2_checkpoint_in_pytorch_model(model, resolved_archive_file, allow_missing_keys=True)\n                except ImportError:\n                    logger.error(\n                        \"Loading a TensorFlow model in PyTorch, requires both PyTorch and TensorFlow to be installed. Please see \"\n                        \"https://pytorch.org/ and https://www.tensorflow.org/install/ for installation instructions.\"\n                    )\n                    raise\n        else:\n            # Convert old format to new format if needed from a PyTorch state_dict\n            old_keys = []\n            new_keys = []\n            for key in state_dict.keys():\n                new_key = None\n                if \"gamma\" in key:\n                    new_key = key.replace(\"gamma\", \"weight\")\n                if \"beta\" in key:\n                    new_key = key.replace(\"beta\", \"bias\")\n                if new_key:\n                    old_keys.append(key)\n                    new_keys.append(new_key)\n            for old_key, new_key in zip(old_keys, new_keys):\n                state_dict[new_key] = state_dict.pop(old_key)\n\n            # copy state_dict so _load_from_state_dict can modify it\n            metadata = getattr(state_dict, \"_metadata\", None)\n            state_dict = state_dict.copy()\n            if metadata is not None:\n                state_dict._metadata = metadata\n\n            # PyTorch's `_load_from_state_dict` does not copy parameters in a module's descendants\n            # so we need to apply the function recursively.\n            def load(module: nn.Module, prefix=\"\"):\n                local_metadata = {} if metadata is None else metadata.get(prefix[:-1], {})\n                module._load_from_state_dict(\n                    state_dict, prefix, local_metadata, True, missing_keys, unexpected_keys, error_msgs,\n                )\n                for name, child in module._modules.items():\n                    if child is not None:\n                        load(child, prefix + name + \".\")\n\n            # Make sure we are able to load base models as well as derived models (with heads)\n            start_prefix = \"\"\n            model_to_load = model\n            has_prefix_module = any(s.startswith(cls.base_model_prefix) for s in state_dict.keys())\n            if not hasattr(model, cls.base_model_prefix) and has_prefix_module:\n                start_prefix = cls.base_model_prefix + \".\"\n            if hasattr(model, cls.base_model_prefix) and not has_prefix_module:\n                model_to_load = getattr(model, cls.base_model_prefix)\n\n            load(model_to_load, prefix=start_prefix)\n\n            if model.__class__.__name__ != model_to_load.__class__.__name__:\n                base_model_state_dict = model_to_load.state_dict().keys()\n                head_model_state_dict_without_base_prefix = [\n                    key.split(cls.base_model_prefix + \".\")[-1] for key in model.state_dict().keys()\n                ]\n\n                missing_keys.extend(head_model_state_dict_without_base_prefix - base_model_state_dict)\n\n            if len(missing_keys) > 0:\n                logger.info(\n                    \"Weights of {} not initialized from pretrained model: {}\".format(\n                        model.__class__.__name__, missing_keys\n                    )\n                )\n            if len(unexpected_keys) > 0:\n                logger.info(\n                    \"Weights from pretrained model not used in {}: {}\".format(\n                        model.__class__.__name__, unexpected_keys\n                    )\n                )\n            if len(error_msgs) > 0:\n                raise RuntimeError(\n                    \"Error(s) in loading state_dict for {}:\\n\\t{}\".format(\n                        model.__class__.__name__, \"\\n\\t\".join(error_msgs)\n                    )\n                )\n        model.tie_weights()  # make sure token embedding weights are still tied if needed\n\n        # Set model in evaluation mode to deactivate DropOut modules by default\n        model.eval()\n\n        if output_loading_info:\n            loading_info = {\n                \"missing_keys\": missing_keys,\n                \"unexpected_keys\": unexpected_keys,\n                \"error_msgs\": error_msgs,\n            }\n            return model, loading_info\n\n        if hasattr(config, \"xla_device\") and config.xla_device:\n            import torch_xla.core.xla_model as xm\n\n            model = xm.send_cpu_data_to_device(model, xm.xla_device())\n            model.to(xm.xla_device())\n\n        return model\n\n    def prepare_inputs_for_generation(self, input_ids, **kwargs):\n        return {\"input_ids\": input_ids}\n\n    def prepare_logits_for_generation(self, logits, **kwargs):\n        return logits\n\n    def _use_cache(self, outputs, use_cache):\n        \"\"\"During generation, decide whether to pass the `past` variable to the next forward pass.\"\"\"\n        if len(outputs) <= 1 or use_cache is False:\n            return False\n        if hasattr(self.config, \"mem_len\") and self.config.mem_len == 0:\n            return False\n        return True\n\n    def enforce_repetition_penalty_(self, lprobs, batch_size, num_beams, prev_output_tokens, repetition_penalty):\n        \"\"\"repetition penalty (from CTRL paper https://arxiv.org/abs/1909.05858). \"\"\"\n        for i in range(batch_size * num_beams):\n            for previous_token in set(prev_output_tokens[i].tolist()):\n                # if score < 0 then repetition penalty has to multiplied to reduce the previous token probability\n                if lprobs[i, previous_token] < 0:\n                    lprobs[i, previous_token] *= repetition_penalty\n                else:\n                    lprobs[i, previous_token] /= repetition_penalty\n\n    @torch.no_grad()\n    def generate(\n        self,\n        input_ids: Optional[torch.LongTensor] = None,\n        max_length: Optional[int] = None,\n        min_length: Optional[int] = None,\n        do_sample: Optional[bool] = None,\n        early_stopping: Optional[bool] = None,\n        num_beams: Optional[int] = None,\n        temperature: Optional[float] = None,\n        top_k: Optional[int] = None,\n        top_p: Optional[float] = None,\n        repetition_penalty: Optional[float] = None,\n        bad_words_ids: Optional[Iterable[int]] = None,\n        bos_token_id: Optional[int] = None,\n        pad_token_id: Optional[int] = None,\n        eos_token_id: Optional[int] = None,\n        length_penalty: Optional[float] = None,\n        no_repeat_ngram_size: Optional[int] = None,\n        num_return_sequences: Optional[int] = None,\n        attention_mask: Optional[torch.LongTensor] = None,\n        decoder_start_token_id: Optional[int] = None,\n        use_cache: Optional[bool] = None,\n        **model_specific_kwargs\n    ) -> torch.LongTensor:\n        r\"\"\" Generates sequences for models with a LM head. The method currently supports greedy decoding, beam-search decoding, sampling with temperature, sampling with top-k or nucleus sampling.\n\n        Adapted in part from `Facebook's XLM beam search code`_.\n\n        .. _`Facebook's XLM beam search code`:\n           https://github.com/facebookresearch/XLM/blob/9e6f6814d17be4fe5b15f2e6c43eb2b2d76daeb4/src/model/transformer.py#L529\n\n\n        Parameters:\n\n            input_ids: (`optional`) `torch.LongTensor` of shape `(batch_size, sequence_length)`\n                The sequence used as a prompt for the generation. If `None` the method initializes\n                it as an empty `torch.LongTensor` of shape `(1,)`.\n\n            max_length: (`optional`) int\n                The max length of the sequence to be generated.  Between `min_length` and infinity. Default to 20.\n\n            min_length: (`optional`) int\n                The min length of the sequence to be generated.  Between 0 and infinity. Default to 0.\n\n            do_sample: (`optional`) bool\n                If set to `False` greedy decoding is used. Otherwise sampling is used. Defaults to `False` as defined in `configuration_utils.PretrainedConfig`.\n\n            early_stopping: (`optional`) bool\n                if set to `True` beam search is stopped when at least `num_beams` sentences finished per batch. Defaults to `False` as defined in `configuration_utils.PretrainedConfig`.\n\n            num_beams: (`optional`) int\n                Number of beams for beam search. Must be between 1 and infinity. 1 means no beam search. Default to 1.\n\n            temperature: (`optional`) float\n                The value used to module the next token probabilities. Must be strictly positive. Default to 1.0.\n\n            top_k: (`optional`) int\n                The number of highest probability vocabulary tokens to keep for top-k-filtering. Between 1 and infinity. Default to 50.\n\n            top_p: (`optional`) float\n                The cumulative probability of parameter highest probability vocabulary tokens to keep for nucleus sampling. Must be between 0 and 1. Default to 1.\n\n            repetition_penalty: (`optional`) float\n                The parameter for repetition penalty. Between 1.0 and infinity. 1.0 means no penalty. Default to 1.0.\n\n            pad_token_id: (`optional`) int\n                Padding token. Default to specicic model pad_token_id or None if it does not exist.\n\n            bos_token_id: (`optional`) int\n                BOS token. Defaults to `bos_token_id` as defined in the models config.\n\n            eos_token_id: (`optional`) int\n                EOS token. Defaults to `eos_token_id` as defined in the models config.\n\n            length_penalty: (`optional`) float\n                Exponential penalty to the length. Default to 1.\n\n            no_repeat_ngram_size: (`optional`) int\n                If set to int > 0, all ngrams of size `no_repeat_ngram_size` can only occur once.\n            bad_words_ids: (`optional`) list of lists of int\n                `bad_words_ids` contains tokens that are not allowed to be generated. In order to get the tokens of the words that should not appear in the generated text, use `tokenizer.encode(bad_word, add_prefix_space=True)`.\n\n            num_return_sequences: (`optional`) int\n                The number of independently computed returned sequences for each element in the batch. Default to 1.\n\n            attention_mask (`optional`) obj: `torch.LongTensor` of same shape as `input_ids`\n                Mask to avoid performing attention on padding token indices.\n                Mask values selected in ``[0, 1]``:\n                ``1`` for tokens that are NOT MASKED, ``0`` for MASKED tokens.\n                Defaults to `None`.\n\n                `What are attention masks? <../glossary.html#attention-mask>`__\n\n            decoder_start_token_id=None: (`optional`) int\n                If an encoder-decoder model starts decoding with a different token than BOS.\n                Defaults to `None` and is changed to `BOS` later.\n\n            use_cache: (`optional`) bool\n                If `use_cache` is True, past key values are used to speed up decoding if applicable to model. Defaults to `True`.\n\n            model_specific_kwargs: (`optional`) dict\n                Additional model specific kwargs will be forwarded to the `forward` function of the model.\n\n        Return:\n\n            output: `torch.LongTensor` of shape `(batch_size * num_return_sequences, sequence_length)`\n                sequence_length is either equal to max_length or shorter if all batches finished early due to the `eos_token_id`\n\n        Examples::\n\n            tokenizer = AutoTokenizer.from_pretrained('distilgpt2')   # Initialize tokenizer\n            model = AutoModelWithLMHead.from_pretrained('distilgpt2')    # Download model and configuration from S3 and cache.\n            outputs = model.generate(max_length=40)  # do greedy decoding\n            print('Generated: {}'.format(tokenizer.decode(outputs[0], skip_special_tokens=True)))\n\n            tokenizer = AutoTokenizer.from_pretrained('openai-gpt')   # Initialize tokenizer\n            model = AutoModelWithLMHead.from_pretrained('openai-gpt')    # Download model and configuration from S3 and cache.\n            input_context = 'The dog'\n            input_ids = tokenizer.encode(input_context, return_tensors='pt')  # encode input context\n            outputs = model.generate(input_ids=input_ids, num_beams=5, num_return_sequences=3, temperature=1.5)  # generate 3 independent sequences using beam search decoding (5 beams) with sampling from initial context 'The dog'\n            for i in range(3): #  3 output sequences were generated\n                print('Generated {}: {}'.format(i, tokenizer.decode(outputs[i], skip_special_tokens=True)))\n\n            tokenizer = AutoTokenizer.from_pretrained('distilgpt2')   # Initialize tokenizer\n            model = AutoModelWithLMHead.from_pretrained('distilgpt2')    # Download model and configuration from S3 and cache.\n            input_context = 'The dog'\n            input_ids = tokenizer.encode(input_context, return_tensors='pt')  # encode input context\n            outputs = model.generate(input_ids=input_ids, max_length=40, temperature=0.7, num_return_sequences=3)  # 3 generate sequences using by sampling\n            for i in range(3): #  3 output sequences were generated\n                print('Generated {}: {}'.format(i, tokenizer.decode(outputs[i], skip_special_tokens=True)))\n\n            tokenizer = AutoTokenizer.from_pretrained('ctrl')   # Initialize tokenizer\n            model = AutoModelWithLMHead.from_pretrained('ctrl')    # Download model and configuration from S3 and cache.\n            input_context = 'Legal My neighbor is'  # \"Legal\" is one of the control codes for ctrl\n            input_ids = tokenizer.encode(input_context, return_tensors='pt')  # encode input context\n            outputs = model.generate(input_ids=input_ids, max_length=50, temperature=0.7, repetition_penalty=1.2)  # generate sequences\n            print('Generated: {}'.format(tokenizer.decode(outputs[0], skip_special_tokens=True)))\n\n            tokenizer = AutoTokenizer.from_pretrained('gpt2')   # Initialize tokenizer\n            model = AutoModelWithLMHead.from_pretrained('gpt2')    # Download model and configuration from S3 and cache.\n            input_context = 'My cute dog'  # \"Legal\" is one of the control codes for ctrl\n            bad_words_ids = [tokenizer.encode(bad_word, add_prefix_space=True) for bad_word in ['idiot', 'stupid', 'shut up']]\n            input_ids = tokenizer.encode(input_context, return_tensors='pt')  # encode input context\n            outputs = model.generate(input_ids=input_ids, max_length=100, do_sample=True, bad_words_ids=bad_words_ids)  # generate sequences without allowing bad_words to be generated\n        \"\"\"\n\n        # We cannot generate if the model does not have a LM head\n        if self.get_output_embeddings() is None:\n            raise AttributeError(\n                \"You tried to generate sequences with a model that does not have a LM Head.\"\n                \"Please use another model class (e.g. `OpenAIGPTLMHeadModel`, `XLNetLMHeadModel`, `GPT2LMHeadModel`, `CTRLLMHeadModel`, `T5WithLMHeadModel`, `TransfoXLLMHeadModel`, `XLMWithLMHeadModel`, `BartForConditionalGeneration` )\"\n            )\n\n        max_length = max_length if max_length is not None else self.config.max_length\n        min_length = min_length if min_length is not None else self.config.min_length\n        do_sample = do_sample if do_sample is not None else self.config.do_sample\n        early_stopping = early_stopping if early_stopping is not None else self.config.early_stopping\n        use_cache = use_cache if use_cache is not None else self.config.use_cache\n        num_beams = num_beams if num_beams is not None else self.config.num_beams\n        temperature = temperature if temperature is not None else self.config.temperature\n        top_k = top_k if top_k is not None else self.config.top_k\n        top_p = top_p if top_p is not None else self.config.top_p\n        repetition_penalty = repetition_penalty if repetition_penalty is not None else self.config.repetition_penalty\n        bos_token_id = bos_token_id if bos_token_id is not None else self.config.bos_token_id\n        pad_token_id = pad_token_id if pad_token_id is not None else self.config.pad_token_id\n        eos_token_id = eos_token_id if eos_token_id is not None else self.config.eos_token_id\n        length_penalty = length_penalty if length_penalty is not None else self.config.length_penalty\n        no_repeat_ngram_size = (\n            no_repeat_ngram_size if no_repeat_ngram_size is not None else self.config.no_repeat_ngram_size\n        )\n        bad_words_ids = bad_words_ids if bad_words_ids is not None else self.config.bad_words_ids\n        num_return_sequences = (\n            num_return_sequences if num_return_sequences is not None else self.config.num_return_sequences\n        )\n        decoder_start_token_id = (\n            decoder_start_token_id if decoder_start_token_id is not None else self.config.decoder_start_token_id\n        )\n\n        if input_ids is not None:\n            batch_size = input_ids.shape[0]  # overriden by the input batch_size\n        else:\n            batch_size = 1\n\n        assert isinstance(max_length, int) and max_length > 0, \"`max_length` should be a strictly positive integer.\"\n        assert isinstance(min_length, int) and min_length >= 0, \"`min_length` should be a positive integer.\"\n        assert isinstance(do_sample, bool), \"`do_sample` should be a boolean.\"\n        assert isinstance(early_stopping, bool), \"`early_stopping` should be a boolean.\"\n        assert isinstance(use_cache, bool), \"`use_cache` should be a boolean.\"\n        assert isinstance(num_beams, int) and num_beams > 0, \"`num_beams` should be a strictly positive integer.\"\n        assert temperature > 0, \"`temperature` should be strictly positive.\"\n        assert isinstance(top_k, int) and top_k >= 0, \"`top_k` should be a positive integer.\"\n        assert 0 <= top_p <= 1, \"`top_p` should be between 0 and 1.\"\n        assert repetition_penalty >= 1.0, \"`repetition_penalty` should be >= 1.\"\n        assert input_ids is not None or (\n            isinstance(bos_token_id, int) and bos_token_id >= 0\n        ), \"If input_ids is not defined, `bos_token_id` should be a positive integer.\"\n        assert pad_token_id is None or (\n            isinstance(pad_token_id, int) and (pad_token_id >= 0)\n        ), \"`pad_token_id` should be a positive integer.\"\n        assert (eos_token_id is None) or (\n            isinstance(eos_token_id, int) and (eos_token_id >= 0)\n        ), \"`eos_token_id` should be a positive integer.\"\n        assert length_penalty > 0, \"`length_penalty` should be strictly positive.\"\n        assert (\n            isinstance(no_repeat_ngram_size, int) and no_repeat_ngram_size >= 0\n        ), \"`no_repeat_ngram_size` should be a positive integer.\"\n        assert (\n            isinstance(num_return_sequences, int) and num_return_sequences > 0\n        ), \"`num_return_sequences` should be a strictly positive integer.\"\n        assert (\n            bad_words_ids is None or isinstance(bad_words_ids, list) and isinstance(bad_words_ids[0], list)\n        ), \"`bad_words_ids` is either `None` or a list of lists of tokens that should not be generated\"\n\n        if input_ids is None:\n            assert isinstance(bos_token_id, int) and bos_token_id >= 0, (\n                \"you should either supply a context to complete as `input_ids` input \"\n                \"or a `bos_token_id` (integer >= 0) as a first token to start the generation.\"\n            )\n            input_ids = torch.full(\n                (batch_size, 1), bos_token_id, dtype=torch.long, device=next(self.parameters()).device,\n            )\n        else:\n            assert input_ids.dim() == 2, \"Input prompt should be of shape (batch_size, sequence length).\"\n\n        # not allow to duplicate outputs when greedy decoding\n        if do_sample is False:\n            if num_beams == 1:\n                # no_beam_search greedy generation conditions\n                assert (\n                    num_return_sequences == 1\n                ), \"Greedy decoding will always produce the same output for num_beams == 1 and num_return_sequences > 1. Please set num_return_sequences = 1\"\n\n            else:\n                # beam_search greedy generation conditions\n                assert (\n                    num_beams >= num_return_sequences\n                ), \"Greedy beam search decoding cannot return more sequences than it has beams. Please set num_beams >= num_return_sequences\"\n\n        # create attention mask if necessary\n        # TODO (PVP): this should later be handled by the forward fn() in each model in the future see PR 3140\n        if (attention_mask is None) and (pad_token_id is not None) and (pad_token_id in input_ids):\n            attention_mask = input_ids.ne(pad_token_id).long()\n        elif attention_mask is None:\n            attention_mask = input_ids.new_ones(input_ids.shape)\n\n        # set pad_token_id to eos_token_id if not set. Important that this is done after\n        # attention_mask is created\n        if pad_token_id is None and eos_token_id is not None:\n            logger.warning(\n                \"Setting `pad_token_id` to {} (first `eos_token_id`) to generate sequence\".format(eos_token_id)\n            )\n            pad_token_id = eos_token_id\n\n        # current position and vocab size\n        if hasattr(self.config, \"vocab_size\"):\n            vocab_size = self.config.vocab_size\n        elif (\n            self.config.is_encoder_decoder\n            and hasattr(self.config, \"decoder\")\n            and hasattr(self.config.decoder, \"vocab_size\")\n        ):\n            vocab_size = self.config.decoder.vocab_size\n\n        # set effective batch size and effective batch multiplier according to do_sample\n        if do_sample:\n            effective_batch_size = batch_size * num_return_sequences\n            effective_batch_mult = num_return_sequences\n        else:\n            effective_batch_size = batch_size\n            effective_batch_mult = 1\n\n        if self.config.is_encoder_decoder:\n            if decoder_start_token_id is None:\n                decoder_start_token_id = bos_token_id\n\n            assert (\n                decoder_start_token_id is not None\n            ), \"decoder_start_token_id or bos_token_id has to be defined for encoder-decoder generation\"\n            assert hasattr(self, \"get_encoder\"), \"{} should have a 'get_encoder' function defined\".format(self)\n            assert callable(self.get_encoder), \"{} should be a method\".format(self.get_encoder)\n\n            # get encoder and store encoder outputs\n            encoder = self.get_encoder()\n\n            encoder_outputs: tuple = encoder(input_ids, attention_mask=attention_mask)\n\n        # Expand input ids if num_beams > 1 or num_return_sequences > 1\n        if num_return_sequences > 1 or num_beams > 1:\n            input_ids_len = input_ids.shape[-1]\n            input_ids = input_ids.unsqueeze(1).expand(batch_size, effective_batch_mult * num_beams, input_ids_len)\n            attention_mask = attention_mask.unsqueeze(1).expand(\n                batch_size, effective_batch_mult * num_beams, input_ids_len\n            )\n\n            input_ids = input_ids.contiguous().view(\n                effective_batch_size * num_beams, input_ids_len\n            )  # shape: (batch_size * num_return_sequences * num_beams, cur_len)\n            attention_mask = attention_mask.contiguous().view(\n                effective_batch_size * num_beams, input_ids_len\n            )  # shape: (batch_size * num_return_sequences * num_beams, cur_len)\n\n        if self.config.is_encoder_decoder:\n            # create empty decoder_input_ids\n            input_ids = torch.full(\n                (effective_batch_size * num_beams, 1),\n                decoder_start_token_id,\n                dtype=torch.long,\n                device=next(self.parameters()).device,\n            )\n            cur_len = 1\n\n            assert (\n                batch_size == encoder_outputs[0].shape[0]\n            ), f\"expected encoder_outputs[0] to have 1st dimension bs={batch_size}, got {encoder_outputs[0].shape[0]} \"\n\n            # expand batch_idx to assign correct encoder output for expanded input_ids (due to num_beams > 1 and num_return_sequences > 1)\n            expanded_batch_idxs = (\n                torch.arange(batch_size)\n                .view(-1, 1)\n                .repeat(1, num_beams * effective_batch_mult)\n                .view(-1)\n                .to(input_ids.device)\n            )\n            # expand encoder_outputs\n            encoder_outputs = (encoder_outputs[0].index_select(0, expanded_batch_idxs), *encoder_outputs[1:])\n\n        else:\n            encoder_outputs = None\n            cur_len = input_ids.shape[-1]\n\n        if num_beams > 1:\n            output = self._generate_beam_search(\n                input_ids,\n                cur_len=cur_len,\n                max_length=max_length,\n                min_length=min_length,\n                do_sample=do_sample,\n                early_stopping=early_stopping,\n                temperature=temperature,\n                top_k=top_k,\n                top_p=top_p,\n                repetition_penalty=repetition_penalty,\n                no_repeat_ngram_size=no_repeat_ngram_size,\n                bad_words_ids=bad_words_ids,\n                bos_token_id=bos_token_id,\n                pad_token_id=pad_token_id,\n                decoder_start_token_id=decoder_start_token_id,\n                eos_token_id=eos_token_id,\n                batch_size=effective_batch_size,\n                num_return_sequences=num_return_sequences,\n                length_penalty=length_penalty,\n                num_beams=num_beams,\n                vocab_size=vocab_size,\n                encoder_outputs=encoder_outputs,\n                attention_mask=attention_mask,\n                use_cache=use_cache,\n                model_specific_kwargs=model_specific_kwargs,\n            )\n        else:\n            output = self._generate_no_beam_search(\n                input_ids,\n                cur_len=cur_len,\n                max_length=max_length,\n                min_length=min_length,\n                do_sample=do_sample,\n                temperature=temperature,\n                top_k=top_k,\n                top_p=top_p,\n                repetition_penalty=repetition_penalty,\n                no_repeat_ngram_size=no_repeat_ngram_size,\n                bad_words_ids=bad_words_ids,\n                bos_token_id=bos_token_id,\n                pad_token_id=pad_token_id,\n                decoder_start_token_id=decoder_start_token_id,\n                eos_token_id=eos_token_id,\n                batch_size=effective_batch_size,\n                encoder_outputs=encoder_outputs,\n                attention_mask=attention_mask,\n                use_cache=use_cache,\n                model_specific_kwargs=model_specific_kwargs,\n            )\n\n        return output\n\n    def _generate_no_beam_search(\n        self,\n        input_ids,\n        cur_len,\n        max_length,\n        min_length,\n        do_sample,\n        temperature,\n        top_k,\n        top_p,\n        repetition_penalty,\n        no_repeat_ngram_size,\n        bad_words_ids,\n        bos_token_id,\n        pad_token_id,\n        eos_token_id,\n        decoder_start_token_id,\n        batch_size,\n        encoder_outputs,\n        attention_mask,\n        use_cache,\n        model_specific_kwargs,\n    ):\n        \"\"\" Generate sequences for each example without beam search (num_beams == 1).\n            All returned sequence are generated independantly.\n        \"\"\"\n        # length of generated sentences / unfinished sentences\n        unfinished_sents = input_ids.new(batch_size).fill_(1)\n        sent_lengths = input_ids.new(batch_size).fill_(max_length)\n\n        past = encoder_outputs  # defined for encoder-decoder models, None for decoder-only models\n\n        while cur_len < max_length:\n            model_inputs = self.prepare_inputs_for_generation(\n                input_ids, past=past, attention_mask=attention_mask, use_cache=use_cache, **model_specific_kwargs\n            )\n\n            outputs = self(**model_inputs)\n            next_token_logits = outputs[0][:, -1, :]\n\n            # if model has past, then set the past variable to speed up decoding\n            if self._use_cache(outputs, use_cache):\n                past = outputs[1]\n\n            # repetition penalty from CTRL paper (https://arxiv.org/abs/1909.05858)\n            if repetition_penalty != 1.0:\n                self.enforce_repetition_penalty_(next_token_logits, batch_size, 1, input_ids, repetition_penalty)\n\n            if no_repeat_ngram_size > 0:\n                # calculate a list of banned tokens to prevent repetitively generating the same ngrams\n                # from fairseq: https://github.com/pytorch/fairseq/blob/a07cb6f40480928c9e0548b737aadd36ee66ac76/fairseq/sequence_generator.py#L345\n                banned_tokens = calc_banned_ngram_tokens(input_ids, batch_size, no_repeat_ngram_size, cur_len)\n                for batch_idx in range(batch_size):\n                    next_token_logits[batch_idx, banned_tokens[batch_idx]] = -float(\"inf\")\n\n            if bad_words_ids is not None:\n                # calculate a list of banned tokens according to bad words\n                banned_tokens = calc_banned_bad_words_ids(input_ids, bad_words_ids)\n\n                for batch_idx in range(batch_size):\n                    next_token_logits[batch_idx, banned_tokens[batch_idx]] = -float(\"inf\")\n\n            # set eos token prob to zero if min_length is not reached\n            if eos_token_id is not None and cur_len < min_length:\n                next_token_logits[:, eos_token_id] = -float(\"inf\")\n\n            if do_sample:\n                # Temperature (higher temperature => more likely to sample low probability tokens)\n                if temperature != 1.0:\n                    next_token_logits = next_token_logits / temperature\n                # Top-p/top-k filtering\n                next_token_logits = top_k_top_p_filtering(next_token_logits, top_k=top_k, top_p=top_p)\n                # Sample\n                probs = F.softmax(next_token_logits, dim=-1)\n                next_token = torch.multinomial(probs, num_samples=1).squeeze(1)\n            else:\n                # Greedy decoding\n                next_token = torch.argmax(next_token_logits, dim=-1)\n\n            # update generations and finished sentences\n            if eos_token_id is not None:\n                # pad finished sentences if eos_token_id exist\n                tokens_to_add = next_token * unfinished_sents + (pad_token_id) * (1 - unfinished_sents)\n            else:\n                tokens_to_add = next_token\n\n            # add token and increase length by one\n            input_ids = torch.cat([input_ids, tokens_to_add.unsqueeze(-1)], dim=-1)\n            cur_len = cur_len + 1\n\n            if eos_token_id is not None:\n                eos_in_sents = tokens_to_add == eos_token_id\n                # if sentence is unfinished and the token to add is eos, sent_lengths is filled with current length\n                is_sents_unfinished_and_token_to_add_is_eos = unfinished_sents.mul(eos_in_sents.long()).bool()\n                sent_lengths.masked_fill_(is_sents_unfinished_and_token_to_add_is_eos, cur_len)\n                # unfinished_sents is set to zero if eos in sentence\n                unfinished_sents.mul_((~eos_in_sents).long())\n\n            # stop when there is a </s> in each sentence, or if we exceed the maximul length\n            if unfinished_sents.max() == 0:\n                break\n\n            # extend attention_mask for new generated input if only decoder\n            if self.config.is_encoder_decoder is False:\n                attention_mask = torch.cat(\n                    [attention_mask, attention_mask.new_ones((attention_mask.shape[0], 1))], dim=-1\n                )\n\n        # if there are different sentences lengths in the batch, some batches have to be padded\n        if sent_lengths.min().item() != sent_lengths.max().item():\n            assert pad_token_id is not None, \"`Pad_token_id` has to be defined if batches have different lengths\"\n            # finished sents are filled with pad_token\n            decoded = input_ids.new(batch_size, sent_lengths.max().item()).fill_(pad_token_id)\n        else:\n            decoded = input_ids\n\n        for hypo_idx, hypo in enumerate(input_ids):\n            decoded[hypo_idx, : sent_lengths[hypo_idx]] = hypo[: sent_lengths[hypo_idx]]\n\n        return decoded\n\n    def _generate_beam_search(\n        self,\n        input_ids,\n        cur_len,\n        max_length,\n        min_length,\n        do_sample,\n        early_stopping,\n        temperature,\n        top_k,\n        top_p,\n        repetition_penalty,\n        no_repeat_ngram_size,\n        bad_words_ids,\n        bos_token_id,\n        pad_token_id,\n        eos_token_id,\n        decoder_start_token_id,\n        batch_size,\n        num_return_sequences,\n        length_penalty,\n        num_beams,\n        vocab_size,\n        encoder_outputs,\n        attention_mask,\n        use_cache,\n        model_specific_kwargs,\n    ):\n        \"\"\" Generate sequences for each example with beam search.\n        \"\"\"\n\n        # generated hypotheses\n        generated_hyps = [\n            BeamHypotheses(num_beams, max_length, length_penalty, early_stopping=early_stopping)\n            for _ in range(batch_size)\n        ]\n\n        # scores for each sentence in the beam\n        beam_scores = torch.zeros((batch_size, num_beams), dtype=torch.float, device=input_ids.device)\n\n        # for greedy decoding it is made sure that only tokens of the first beam are considered to avoid sampling the exact same tokens three times\n        if do_sample is False:\n            beam_scores[:, 1:] = -1e9\n        beam_scores = beam_scores.view(-1)  # shape (batch_size * num_beams,)\n\n        # cache compute states\n        past = encoder_outputs  # defined for encoder-decoder models, None for decoder-only models\n\n        # done sentences\n        done = [False for _ in range(batch_size)]\n\n        while cur_len < max_length:\n            model_inputs = self.prepare_inputs_for_generation(\n                input_ids, past=past, attention_mask=attention_mask, use_cache=use_cache, **model_specific_kwargs\n            )\n            outputs = self(**model_inputs)  # (batch_size * num_beams, cur_len, vocab_size)\n            next_token_logits = outputs[0][:, -1, :]  # (batch_size * num_beams, vocab_size)\n\n            # if model has past, then set the past variable to speed up decoding\n            if self._use_cache(outputs, use_cache):\n                past = outputs[1]\n\n            # repetition penalty (from CTRL paper https://arxiv.org/abs/1909.05858)\n            if repetition_penalty != 1.0:\n                self.enforce_repetition_penalty_(\n                    next_token_logits, batch_size, num_beams, input_ids, repetition_penalty,\n                )\n\n            if temperature != 1.0:\n                next_token_logits = next_token_logits / temperature\n\n            if self.config.is_encoder_decoder and do_sample is False:\n                # TODO (PVP) still a bit hacky here - there might be a better solution\n                next_token_logits = self.prepare_logits_for_generation(\n                    next_token_logits, cur_len=cur_len, max_length=max_length\n                )\n\n            scores = F.log_softmax(next_token_logits, dim=-1)  # (batch_size * num_beams, vocab_size)\n\n            # set eos token prob to zero if min_length is not reached\n            if eos_token_id is not None and cur_len < min_length:\n                scores[:, eos_token_id] = -float(\"inf\")\n\n            if no_repeat_ngram_size > 0:\n                # calculate a list of banned tokens to prevent repetitively generating the same ngrams\n                num_batch_hypotheses = batch_size * num_beams\n                # from fairseq: https://github.com/pytorch/fairseq/blob/a07cb6f40480928c9e0548b737aadd36ee66ac76/fairseq/sequence_generator.py#L345\n                banned_batch_tokens = calc_banned_ngram_tokens(\n                    input_ids, num_batch_hypotheses, no_repeat_ngram_size, cur_len\n                )\n                for i, banned_tokens in enumerate(banned_batch_tokens):\n                    scores[i, banned_tokens] = -float(\"inf\")\n\n            if bad_words_ids is not None:\n                # calculate a list of banned tokens according to bad words\n                banned_tokens = calc_banned_bad_words_ids(input_ids, bad_words_ids)\n\n                for i, banned_tokens in enumerate(banned_tokens):\n                    scores[i, banned_tokens] = -float(\"inf\")\n\n            assert scores.shape == (batch_size * num_beams, vocab_size), \"Shapes of scores: {} != {}\".format(\n                scores.shape, (batch_size * num_beams, vocab_size)\n            )\n\n            if do_sample:\n                _scores = scores + beam_scores[:, None].expand_as(scores)  # (batch_size * num_beams, vocab_size)\n                # Top-p/top-k filtering\n                _scores = top_k_top_p_filtering(\n                    _scores, top_k=top_k, top_p=top_p, min_tokens_to_keep=2\n                )  # (batch_size * num_beams, vocab_size)\n                # re-organize to group the beam together to sample from all beam_idxs\n                _scores = _scores.contiguous().view(\n                    batch_size, num_beams * vocab_size\n                )  # (batch_size, num_beams * vocab_size)\n\n                # Sample 2 next tokens for each beam (so we have some spare tokens and match output of greedy beam search)\n                probs = F.softmax(_scores, dim=-1)\n                next_tokens = torch.multinomial(probs, num_samples=2 * num_beams)  # (batch_size, num_beams * 2)\n                # Compute next scores\n                next_scores = torch.gather(_scores, -1, next_tokens)  # (batch_size, num_beams * 2)\n                # sort the sampled vector to make sure that the first num_beams samples are the best\n                next_scores, next_scores_indices = torch.sort(next_scores, descending=True, dim=1)\n                next_tokens = torch.gather(next_tokens, -1, next_scores_indices)  # (batch_size, num_beams * 2)\n\n            else:\n                next_scores = scores + beam_scores[:, None].expand_as(scores)  # (batch_size * num_beams, vocab_size)\n\n                # re-organize to group the beam together (we are keeping top hypothesis accross beams)\n                next_scores = next_scores.view(\n                    batch_size, num_beams * vocab_size\n                )  # (batch_size, num_beams * vocab_size)\n\n                next_scores, next_tokens = torch.topk(next_scores, 2 * num_beams, dim=1, largest=True, sorted=True)\n\n            assert next_scores.size() == next_tokens.size() == (batch_size, 2 * num_beams)\n\n            # next batch beam content\n            next_batch_beam = []\n\n            # for each sentence\n            for batch_idx in range(batch_size):\n\n                # if we are done with this sentence\n                if done[batch_idx]:\n                    assert (\n                        len(generated_hyps[batch_idx]) >= num_beams\n                    ), \"Batch can only be done if at least {} beams have been generated\".format(num_beams)\n                    assert (\n                        eos_token_id is not None and pad_token_id is not None\n                    ), \"generated beams >= num_beams -> eos_token_id and pad_token have to be defined\"\n                    next_batch_beam.extend([(0, pad_token_id, 0)] * num_beams)  # pad the batch\n                    continue\n\n                # next sentence beam content\n                next_sent_beam = []\n\n                # next tokens for this sentence\n                for beam_token_rank, (beam_token_id, beam_token_score) in enumerate(\n                    zip(next_tokens[batch_idx], next_scores[batch_idx])\n                ):\n                    # get beam and token IDs\n                    beam_id = beam_token_id // vocab_size\n                    token_id = beam_token_id % vocab_size\n\n                    effective_beam_id = batch_idx * num_beams + beam_id\n                    # add to generated hypotheses if end of sentence or last iteration\n                    if (eos_token_id is not None) and (token_id.item() == eos_token_id):\n                        # if beam_token does not belong to top num_beams tokens, it should not be added\n                        is_beam_token_worse_than_top_num_beams = beam_token_rank >= num_beams\n                        if is_beam_token_worse_than_top_num_beams:\n                            continue\n                        generated_hyps[batch_idx].add(\n                            input_ids[effective_beam_id].clone(), beam_token_score.item(),\n                        )\n                    else:\n                        # add next predicted token if it is not eos_token\n                        next_sent_beam.append((beam_token_score, token_id, effective_beam_id))\n\n                    # the beam for next step is full\n                    if len(next_sent_beam) == num_beams:\n                        break\n\n                # Check if were done so that we can save a pad step if all(done)\n                done[batch_idx] = done[batch_idx] or generated_hyps[batch_idx].is_done(\n                    next_scores[batch_idx].max().item(), cur_len=cur_len\n                )\n\n                # update next beam content\n                assert len(next_sent_beam) == num_beams, \"Beam should always be full\"\n                next_batch_beam.extend(next_sent_beam)\n                assert len(next_batch_beam) == num_beams * (batch_idx + 1)\n\n            # stop when we are done with each sentence\n            if all(done):\n                break\n\n            # sanity check / prepare next batch\n            assert len(next_batch_beam) == batch_size * num_beams\n            beam_scores = beam_scores.new([x[0] for x in next_batch_beam])\n            beam_tokens = input_ids.new([x[1] for x in next_batch_beam])\n            beam_idx = input_ids.new([x[2] for x in next_batch_beam])\n\n            # re-order batch and update current length\n            input_ids = input_ids[beam_idx, :]\n            input_ids = torch.cat([input_ids, beam_tokens.unsqueeze(1)], dim=-1)\n            cur_len = cur_len + 1\n\n            # re-order internal states\n            if past is not None:\n                past = self._reorder_cache(past, beam_idx)\n\n            # extend attention_mask for new generated input if only decoder\n            if self.config.is_encoder_decoder is False:\n                attention_mask = torch.cat(\n                    [attention_mask, attention_mask.new_ones((attention_mask.shape[0], 1))], dim=-1\n                )\n\n        # finalize all open beam hypotheses and end to generated hypotheses\n        for batch_idx in range(batch_size):\n            if done[batch_idx]:\n                continue\n\n            # test that beam scores match previously calculated scores if not eos and batch_idx not done\n            if eos_token_id is not None and all(\n                (token_id % vocab_size).item() is not eos_token_id for token_id in next_tokens[batch_idx]\n            ):\n                assert torch.all(\n                    next_scores[batch_idx, :num_beams] == beam_scores.view(batch_size, num_beams)[batch_idx]\n                ), \"If batch_idx is not done, final next scores: {} have to equal to accumulated beam_scores: {}\".format(\n                    next_scores[:, :num_beams][batch_idx], beam_scores.view(batch_size, num_beams)[batch_idx],\n                )\n\n            # need to add best num_beams hypotheses to generated hyps\n            for beam_id in range(num_beams):\n                effective_beam_id = batch_idx * num_beams + beam_id\n                final_score = beam_scores[effective_beam_id].item()\n                final_tokens = input_ids[effective_beam_id]\n                generated_hyps[batch_idx].add(final_tokens, final_score)\n\n        # depending on whether greedy generation is wanted or not define different output_batch_size and output_num_return_sequences_per_batch\n        output_batch_size = batch_size if do_sample else batch_size * num_return_sequences\n        output_num_return_sequences_per_batch = 1 if do_sample else num_return_sequences\n\n        # select the best hypotheses\n        sent_lengths = input_ids.new(output_batch_size)\n        best = []\n\n        # retrieve best hypotheses\n        for i, hypotheses in enumerate(generated_hyps):\n            sorted_hyps = sorted(hypotheses.beams, key=lambda x: x[0])\n            for j in range(output_num_return_sequences_per_batch):\n                effective_batch_idx = output_num_return_sequences_per_batch * i + j\n                best_hyp = sorted_hyps.pop()[1]\n                sent_lengths[effective_batch_idx] = len(best_hyp)\n                best.append(best_hyp)\n\n        # shorter batches are filled with pad_token\n        if sent_lengths.min().item() != sent_lengths.max().item():\n            assert pad_token_id is not None, \"`Pad_token_id` has to be defined\"\n            sent_max_len = min(sent_lengths.max().item() + 1, max_length)\n            decoded = input_ids.new(output_batch_size, sent_max_len).fill_(pad_token_id)\n\n            # fill with hypothesis and eos_token_id if necessary\n            for i, hypo in enumerate(best):\n                decoded[i, : sent_lengths[i]] = hypo\n                if sent_lengths[i] < max_length:\n                    decoded[i, sent_lengths[i]] = eos_token_id\n        else:\n            # none of the hypotheses have an eos_token\n            assert (len(hypo) == max_length for hypo in best)\n            decoded = torch.stack(best).type(torch.long).to(next(self.parameters()).device)\n\n        return decoded\n\n    @staticmethod\n    def _reorder_cache(past: Tuple, beam_idx: Tensor) -> Tuple[Tensor]:\n        return tuple(layer_past.index_select(1, beam_idx) for layer_past in past)\n\n\ndef calc_banned_ngram_tokens(prev_input_ids: Tensor, num_hypos: int, no_repeat_ngram_size: int, cur_len: int) -> None:\n    \"\"\"Copied from fairseq for no_repeat_ngram in beam_search\"\"\"\n    if cur_len + 1 < no_repeat_ngram_size:\n        # return no banned tokens if we haven't generated no_repeat_ngram_size tokens yet\n        return [[] for _ in range(num_hypos)]\n    generated_ngrams = [{} for _ in range(num_hypos)]\n    for idx in range(num_hypos):\n        gen_tokens = prev_input_ids[idx].tolist()\n        generated_ngram = generated_ngrams[idx]\n        for ngram in zip(*[gen_tokens[i:] for i in range(no_repeat_ngram_size)]):\n            prev_ngram_tuple = tuple(ngram[:-1])\n            generated_ngram[prev_ngram_tuple] = generated_ngram.get(prev_ngram_tuple, []) + [ngram[-1]]\n\n    def _get_generated_ngrams(hypo_idx):\n        # Before decoding the next token, prevent decoding of ngrams that have already appeared\n        start_idx = cur_len + 1 - no_repeat_ngram_size\n        ngram_idx = tuple(prev_input_ids[hypo_idx, start_idx:cur_len].tolist())\n        return generated_ngrams[hypo_idx].get(ngram_idx, [])\n\n    banned_tokens = [_get_generated_ngrams(hypo_idx) for hypo_idx in range(num_hypos)]\n    return banned_tokens\n\n\ndef calc_banned_bad_words_ids(prev_input_ids: Iterable[int], bad_words_ids: Iterable[int]) -> Iterable[int]:\n    banned_tokens = []\n\n    def _tokens_match(prev_tokens, tokens):\n        if len(tokens) == 0:\n            # if bad word tokens is just one token always ban it\n            return True\n        if len(tokens) > len(prev_input_ids):\n            # if bad word tokens are longer then prev input_ids they can't be equal\n            return False\n\n        if prev_tokens[-len(tokens) :] == tokens:\n            # if tokens match\n            return True\n        else:\n            return False\n\n    for prev_input_ids_slice in prev_input_ids:\n        banned_tokens_slice = []\n\n        for banned_token_seq in bad_words_ids:\n            assert len(banned_token_seq) > 0, \"Banned words token sequences {} cannot have an empty list\".format(\n                bad_words_ids\n            )\n\n            if _tokens_match(prev_input_ids_slice.tolist(), banned_token_seq[:-1]) is False:\n                # if tokens do not match continue\n                continue\n\n            banned_tokens_slice.append(banned_token_seq[-1])\n\n        banned_tokens.append(banned_tokens_slice)\n\n    return banned_tokens\n\n\ndef top_k_top_p_filtering(\n    logits: Tensor,\n    top_k: int = 0,\n    top_p: float = 1.0,\n    filter_value: float = -float(\"Inf\"),\n    min_tokens_to_keep: int = 1,\n) -> Tensor:\n    \"\"\" Filter a distribution of logits using top-k and/or nucleus (top-p) filtering\n        Args:\n            logits: logits distribution shape (batch size, vocabulary size)\n            if top_k > 0: keep only top k tokens with highest probability (top-k filtering).\n            if top_p < 1.0: keep the top tokens with cumulative probability >= top_p (nucleus filtering).\n                Nucleus filtering is described in Holtzman et al. (http://arxiv.org/abs/1904.09751)\n            Make sure we keep at least min_tokens_to_keep per batch example in the output\n        From: https://gist.github.com/thomwolf/1a5a29f6962089e871b94cbd09daf317\n    \"\"\"\n    if top_k > 0:\n        top_k = min(max(top_k, min_tokens_to_keep), logits.size(-1))  # Safety check\n        # Remove all tokens with a probability less than the last token of the top-k\n        indices_to_remove = logits < torch.topk(logits, top_k)[0][..., -1, None]\n        logits[indices_to_remove] = filter_value\n\n    if top_p < 1.0:\n        sorted_logits, sorted_indices = torch.sort(logits, descending=True)\n        cumulative_probs = torch.cumsum(F.softmax(sorted_logits, dim=-1), dim=-1)\n\n        # Remove tokens with cumulative probability above the threshold (token with 0 are kept)\n        sorted_indices_to_remove = cumulative_probs > top_p\n        if min_tokens_to_keep > 1:\n            # Keep at least min_tokens_to_keep (set to min_tokens_to_keep-1 because we add the first one below)\n            sorted_indices_to_remove[..., :min_tokens_to_keep] = 0\n        # Shift the indices to the right to keep also the first token above the threshold\n        sorted_indices_to_remove[..., 1:] = sorted_indices_to_remove[..., :-1].clone()\n        sorted_indices_to_remove[..., 0] = 0\n\n        # scatter sorted tensors to original indexing\n        indices_to_remove = sorted_indices_to_remove.scatter(1, sorted_indices, sorted_indices_to_remove)\n        logits[indices_to_remove] = filter_value\n    return logits\n\n\nclass BeamHypotheses(object):\n    def __init__(self, num_beams, max_length, length_penalty, early_stopping):\n        \"\"\"\n        Initialize n-best list of hypotheses.\n        \"\"\"\n        self.max_length = max_length - 1  # ignoring bos_token\n        self.length_penalty = length_penalty\n        self.early_stopping = early_stopping\n        self.num_beams = num_beams\n        self.beams = []\n        self.worst_score = 1e9\n\n    def __len__(self):\n        \"\"\"\n        Number of hypotheses in the list.\n        \"\"\"\n        return len(self.beams)\n\n    def add(self, hyp, sum_logprobs):\n        \"\"\"\n        Add a new hypothesis to the list.\n        \"\"\"\n        score = sum_logprobs / len(hyp) ** self.length_penalty\n        if len(self) < self.num_beams or score > self.worst_score:\n            self.beams.append((score, hyp))\n            if len(self) > self.num_beams:\n                sorted_scores = sorted([(s, idx) for idx, (s, _) in enumerate(self.beams)])\n                del self.beams[sorted_scores[0][1]]\n                self.worst_score = sorted_scores[1][0]\n            else:\n                self.worst_score = min(score, self.worst_score)\n\n    def is_done(self, best_sum_logprobs, cur_len=None):\n        \"\"\"\n        If there are enough hypotheses and that none of the hypotheses being generated\n        can become better than the worst one in the heap, then we are done with this sentence.\n        \"\"\"\n\n        if len(self) < self.num_beams:\n            return False\n        elif self.early_stopping:\n            return True\n        else:\n            if cur_len is None:\n                cur_len = self.max_length\n            cur_score = best_sum_logprobs / cur_len ** self.length_penalty\n            ret = self.worst_score >= cur_score\n            return ret\n\n\nclass Conv1D(nn.Module):\n    def __init__(self, nf, nx):\n        \"\"\" Conv1D layer as defined by Radford et al. for OpenAI GPT (and also used in GPT-2)\n            Basically works like a Linear layer but the weights are transposed\n        \"\"\"\n        super().__init__()\n        self.nf = nf\n        w = torch.empty(nx, nf)\n        nn.init.normal_(w, std=0.02)\n        self.weight = nn.Parameter(w)\n        self.bias = nn.Parameter(torch.zeros(nf))\n\n    def forward(self, x):\n        size_out = x.size()[:-1] + (self.nf,)\n        x = torch.addmm(self.bias, x.view(-1, x.size(-1)), self.weight)\n        x = x.view(*size_out)\n        return x\n\n\nclass PoolerStartLogits(nn.Module):\n    \"\"\" Compute SQuAD start_logits from sequence hidden states. \"\"\"\n\n    def __init__(self, config):\n        super().__init__()\n        self.dense = nn.Linear(config.hidden_size, 1)\n\n    def forward(self, hidden_states, p_mask=None):\n        \"\"\" Args:\n            **p_mask**: (`optional`) ``torch.FloatTensor`` of shape `(batch_size, seq_len)`\n                invalid position mask such as query and special symbols (PAD, SEP, CLS)\n                1.0 means token should be masked.\n        \"\"\"\n        x = self.dense(hidden_states).squeeze(-1)\n\n        if p_mask is not None:\n            if next(self.parameters()).dtype == torch.float16:\n                x = x * (1 - p_mask) - 65500 * p_mask\n            else:\n                x = x * (1 - p_mask) - 1e30 * p_mask\n\n        return x\n\n\nclass PoolerEndLogits(nn.Module):\n    \"\"\" Compute SQuAD end_logits from sequence hidden states and start token hidden state.\n    \"\"\"\n\n    def __init__(self, config):\n        super().__init__()\n        self.dense_0 = nn.Linear(config.hidden_size * 2, config.hidden_size)\n        self.activation = nn.Tanh()\n        self.LayerNorm = nn.LayerNorm(config.hidden_size, eps=config.layer_norm_eps)\n        self.dense_1 = nn.Linear(config.hidden_size, 1)\n\n    def forward(self, hidden_states, start_states=None, start_positions=None, p_mask=None):\n        \"\"\" Args:\n            One of ``start_states``, ``start_positions`` should be not None.\n            If both are set, ``start_positions`` overrides ``start_states``.\n\n            **start_states**: ``torch.LongTensor`` of shape identical to hidden_states\n                hidden states of the first tokens for the labeled span.\n            **start_positions**: ``torch.LongTensor`` of shape ``(batch_size,)``\n                position of the first token for the labeled span:\n            **p_mask**: (`optional`) ``torch.FloatTensor`` of shape ``(batch_size, seq_len)``\n                Mask of invalid position such as query and special symbols (PAD, SEP, CLS)\n                1.0 means token should be masked.\n        \"\"\"\n        assert (\n            start_states is not None or start_positions is not None\n        ), \"One of start_states, start_positions should be not None\"\n        if start_positions is not None:\n            slen, hsz = hidden_states.shape[-2:]\n            start_positions = start_positions[:, None, None].expand(-1, -1, hsz)  # shape (bsz, 1, hsz)\n            start_states = hidden_states.gather(-2, start_positions)  # shape (bsz, 1, hsz)\n            start_states = start_states.expand(-1, slen, -1)  # shape (bsz, slen, hsz)\n\n        x = self.dense_0(torch.cat([hidden_states, start_states], dim=-1))\n        x = self.activation(x)\n        x = self.LayerNorm(x)\n        x = self.dense_1(x).squeeze(-1)\n\n        if p_mask is not None:\n            if next(self.parameters()).dtype == torch.float16:\n                x = x * (1 - p_mask) - 65500 * p_mask\n            else:\n                x = x * (1 - p_mask) - 1e30 * p_mask\n\n        return x\n\n\nclass PoolerAnswerClass(nn.Module):\n    \"\"\" Compute SQuAD 2.0 answer class from classification and start tokens hidden states. \"\"\"\n\n    def __init__(self, config):\n        super().__init__()\n        self.dense_0 = nn.Linear(config.hidden_size * 2, config.hidden_size)\n        self.activation = nn.Tanh()\n        self.dense_1 = nn.Linear(config.hidden_size, 1, bias=False)\n\n    def forward(self, hidden_states, start_states=None, start_positions=None, cls_index=None):\n        \"\"\"\n        Args:\n            One of ``start_states``, ``start_positions`` should be not None.\n            If both are set, ``start_positions`` overrides ``start_states``.\n\n            **start_states**: ``torch.LongTensor`` of shape identical to ``hidden_states``.\n                hidden states of the first tokens for the labeled span.\n            **start_positions**: ``torch.LongTensor`` of shape ``(batch_size,)``\n                position of the first token for the labeled span.\n            **cls_index**: torch.LongTensor of shape ``(batch_size,)``\n                position of the CLS token. If None, take the last token.\n\n            note(Original repo):\n                no dependency on end_feature so that we can obtain one single `cls_logits`\n                for each sample\n        \"\"\"\n        hsz = hidden_states.shape[-1]\n        assert (\n            start_states is not None or start_positions is not None\n        ), \"One of start_states, start_positions should be not None\"\n        if start_positions is not None:\n            start_positions = start_positions[:, None, None].expand(-1, -1, hsz)  # shape (bsz, 1, hsz)\n            start_states = hidden_states.gather(-2, start_positions).squeeze(-2)  # shape (bsz, hsz)\n\n        if cls_index is not None:\n            cls_index = cls_index[:, None, None].expand(-1, -1, hsz)  # shape (bsz, 1, hsz)\n            cls_token_state = hidden_states.gather(-2, cls_index).squeeze(-2)  # shape (bsz, hsz)\n        else:\n            cls_token_state = hidden_states[:, -1, :]  # shape (bsz, hsz)\n\n        x = self.dense_0(torch.cat([start_states, cls_token_state], dim=-1))\n        x = self.activation(x)\n        x = self.dense_1(x).squeeze(-1)\n\n        return x\n\n\nclass SQuADHead(nn.Module):\n    r\"\"\" A SQuAD head inspired by XLNet.\n\n    Parameters:\n        config (:class:`~transformers.XLNetConfig`): Model configuration class with all the parameters of the model.\n\n    Inputs:\n        **hidden_states**: ``torch.FloatTensor`` of shape ``(batch_size, seq_len, hidden_size)``\n            hidden states of sequence tokens\n        **start_positions**: ``torch.LongTensor`` of shape ``(batch_size,)``\n            position of the first token for the labeled span.\n        **end_positions**: ``torch.LongTensor`` of shape ``(batch_size,)``\n            position of the last token for the labeled span.\n        **cls_index**: torch.LongTensor of shape ``(batch_size,)``\n            position of the CLS token. If None, take the last token.\n        **is_impossible**: ``torch.LongTensor`` of shape ``(batch_size,)``\n            Whether the question has a possible answer in the paragraph or not.\n        **p_mask**: (`optional`) ``torch.FloatTensor`` of shape ``(batch_size, seq_len)``\n            Mask of invalid position such as query and special symbols (PAD, SEP, CLS)\n            1.0 means token should be masked.\n\n    Outputs: `Tuple` comprising various elements depending on the configuration (config) and inputs:\n        **loss**: (`optional`, returned if both ``start_positions`` and ``end_positions`` are provided) ``torch.FloatTensor`` of shape ``(1,)``:\n            Classification loss as the sum of start token, end token (and is_impossible if provided) classification losses.\n        **start_top_log_probs**: (`optional`, returned if ``start_positions`` or ``end_positions`` is not provided)\n            ``torch.FloatTensor`` of shape ``(batch_size, config.start_n_top)``\n            Log probabilities for the top config.start_n_top start token possibilities (beam-search).\n        **start_top_index**: (`optional`, returned if ``start_positions`` or ``end_positions`` is not provided)\n            ``torch.LongTensor`` of shape ``(batch_size, config.start_n_top)``\n            Indices for the top config.start_n_top start token possibilities (beam-search).\n        **end_top_log_probs**: (`optional`, returned if ``start_positions`` or ``end_positions`` is not provided)\n            ``torch.FloatTensor`` of shape ``(batch_size, config.start_n_top * config.end_n_top)``\n            Log probabilities for the top ``config.start_n_top * config.end_n_top`` end token possibilities (beam-search).\n        **end_top_index**: (`optional`, returned if ``start_positions`` or ``end_positions`` is not provided)\n            ``torch.LongTensor`` of shape ``(batch_size, config.start_n_top * config.end_n_top)``\n            Indices for the top ``config.start_n_top * config.end_n_top`` end token possibilities (beam-search).\n        **cls_logits**: (`optional`, returned if ``start_positions`` or ``end_positions`` is not provided)\n            ``torch.FloatTensor`` of shape ``(batch_size,)``\n            Log probabilities for the ``is_impossible`` label of the answers.\n    \"\"\"\n\n    def __init__(self, config):\n        super().__init__()\n        self.start_n_top = config.start_n_top\n        self.end_n_top = config.end_n_top\n\n        self.start_logits = PoolerStartLogits(config)\n        self.end_logits = PoolerEndLogits(config)\n        self.answer_class = PoolerAnswerClass(config)\n\n    def forward(\n        self, hidden_states, start_positions=None, end_positions=None, cls_index=None, is_impossible=None, p_mask=None,\n    ):\n        outputs = ()\n\n        start_logits = self.start_logits(hidden_states, p_mask=p_mask)\n\n        if start_positions is not None and end_positions is not None:\n            # If we are on multi-GPU, let's remove the dimension added by batch splitting\n            for x in (start_positions, end_positions, cls_index, is_impossible):\n                if x is not None and x.dim() > 1:\n                    x.squeeze_(-1)\n\n            # during training, compute the end logits based on the ground truth of the start position\n            end_logits = self.end_logits(hidden_states, start_positions=start_positions, p_mask=p_mask)\n\n            loss_fct = CrossEntropyLoss()\n            start_loss = loss_fct(start_logits, start_positions)\n            end_loss = loss_fct(end_logits, end_positions)\n            total_loss = (start_loss + end_loss) / 2\n\n            if cls_index is not None and is_impossible is not None:\n                # Predict answerability from the representation of CLS and START\n                cls_logits = self.answer_class(hidden_states, start_positions=start_positions, cls_index=cls_index)\n                loss_fct_cls = nn.BCEWithLogitsLoss()\n                cls_loss = loss_fct_cls(cls_logits, is_impossible)\n\n                # note(zhiliny): by default multiply the loss by 0.5 so that the scale is comparable to start_loss and end_loss\n                total_loss += cls_loss * 0.5\n\n            outputs = (total_loss,) + outputs\n\n        else:\n            # during inference, compute the end logits based on beam search\n            bsz, slen, hsz = hidden_states.size()\n            start_log_probs = F.softmax(start_logits, dim=-1)  # shape (bsz, slen)\n\n            start_top_log_probs, start_top_index = torch.topk(\n                start_log_probs, self.start_n_top, dim=-1\n            )  # shape (bsz, start_n_top)\n            start_top_index_exp = start_top_index.unsqueeze(-1).expand(-1, -1, hsz)  # shape (bsz, start_n_top, hsz)\n            start_states = torch.gather(hidden_states, -2, start_top_index_exp)  # shape (bsz, start_n_top, hsz)\n            start_states = start_states.unsqueeze(1).expand(-1, slen, -1, -1)  # shape (bsz, slen, start_n_top, hsz)\n\n            hidden_states_expanded = hidden_states.unsqueeze(2).expand_as(\n                start_states\n            )  # shape (bsz, slen, start_n_top, hsz)\n            p_mask = p_mask.unsqueeze(-1) if p_mask is not None else None\n            end_logits = self.end_logits(hidden_states_expanded, start_states=start_states, p_mask=p_mask)\n            end_log_probs = F.softmax(end_logits, dim=1)  # shape (bsz, slen, start_n_top)\n\n            end_top_log_probs, end_top_index = torch.topk(\n                end_log_probs, self.end_n_top, dim=1\n            )  # shape (bsz, end_n_top, start_n_top)\n            end_top_log_probs = end_top_log_probs.view(-1, self.start_n_top * self.end_n_top)\n            end_top_index = end_top_index.view(-1, self.start_n_top * self.end_n_top)\n\n            start_states = torch.einsum(\"blh,bl->bh\", hidden_states, start_log_probs)\n            cls_logits = self.answer_class(hidden_states, start_states=start_states, cls_index=cls_index)\n\n            outputs = (start_top_log_probs, start_top_index, end_top_log_probs, end_top_index, cls_logits,) + outputs\n\n        # return start_top_log_probs, start_top_index, end_top_log_probs, end_top_index, cls_logits\n        # or (if labels are provided) (total_loss,)\n        return outputs\n\n\nclass SequenceSummary(nn.Module):\n    r\"\"\" Compute a single vector summary of a sequence hidden states according to various possibilities:\n        Args of the config class:\n            summary_type:\n                - 'last' => [default] take the last token hidden state (like XLNet)\n                - 'first' => take the first token hidden state (like Bert)\n                - 'mean' => take the mean of all tokens hidden states\n                - 'cls_index' => supply a Tensor of classification token position (GPT/GPT-2)\n                - 'attn' => Not implemented now, use multi-head attention\n            summary_use_proj: Add a projection after the vector extraction\n            summary_proj_to_labels: If True, the projection outputs to config.num_labels classes (otherwise to hidden_size). Default: False.\n            summary_activation: 'tanh' or another string => add an activation to the output, Other => no activation. Default\n            summary_first_dropout: Add a dropout before the projection and activation\n            summary_last_dropout: Add a dropout after the projection and activation\n    \"\"\"\n\n    def __init__(self, config: PretrainedConfig):\n        super().__init__()\n\n        self.summary_type = getattr(config, \"summary_type\", \"last\")\n        if self.summary_type == \"attn\":\n            # We should use a standard multi-head attention module with absolute positional embedding for that.\n            # Cf. https://github.com/zihangdai/xlnet/blob/master/modeling.py#L253-L276\n            # We can probably just use the multi-head attention module of PyTorch >=1.1.0\n            raise NotImplementedError\n\n        self.summary = Identity()\n        if hasattr(config, \"summary_use_proj\") and config.summary_use_proj:\n            if hasattr(config, \"summary_proj_to_labels\") and config.summary_proj_to_labels and config.num_labels > 0:\n                num_classes = config.num_labels\n            else:\n                num_classes = config.hidden_size\n            self.summary = nn.Linear(config.hidden_size, num_classes)\n\n        activation_string = getattr(config, \"summary_activation\", None)\n        self.activation: Callable = (get_activation(activation_string) if activation_string else Identity())\n\n        self.first_dropout = Identity()\n        if hasattr(config, \"summary_first_dropout\") and config.summary_first_dropout > 0:\n            self.first_dropout = nn.Dropout(config.summary_first_dropout)\n\n        self.last_dropout = Identity()\n        if hasattr(config, \"summary_last_dropout\") and config.summary_last_dropout > 0:\n            self.last_dropout = nn.Dropout(config.summary_last_dropout)\n\n    def forward(self, hidden_states, cls_index=None):\n        \"\"\" hidden_states: float Tensor in shape [bsz, ..., seq_len, hidden_size], the hidden-states of the last layer.\n            cls_index: [optional] position of the classification token if summary_type == 'cls_index',\n                shape (bsz,) or more generally (bsz, ...) where ... are optional leading dimensions of hidden_states.\n                if summary_type == 'cls_index' and cls_index is None:\n                    we take the last token of the sequence as classification token\n        \"\"\"\n        if self.summary_type == \"last\":\n            output = hidden_states[:, -1]\n        elif self.summary_type == \"first\":\n            output = hidden_states[:, 0]\n        elif self.summary_type == \"mean\":\n            output = hidden_states.mean(dim=1)\n        elif self.summary_type == \"cls_index\":\n            if cls_index is None:\n                cls_index = torch.full_like(hidden_states[..., :1, :], hidden_states.shape[-2] - 1, dtype=torch.long,)\n            else:\n                cls_index = cls_index.unsqueeze(-1).unsqueeze(-1)\n                cls_index = cls_index.expand((-1,) * (cls_index.dim() - 1) + (hidden_states.size(-1),))\n            # shape of cls_index: (bsz, XX, 1, hidden_size) where XX are optional leading dim of hidden_states\n            output = hidden_states.gather(-2, cls_index).squeeze(-2)  # shape (bsz, XX, hidden_size)\n        elif self.summary_type == \"attn\":\n            raise NotImplementedError\n\n        output = self.first_dropout(output)\n        output = self.summary(output)\n        output = self.activation(output)\n        output = self.last_dropout(output)\n\n        return output\n\n\ndef create_position_ids_from_input_ids(input_ids, padding_idx):\n    \"\"\" Replace non-padding symbols with their position numbers. Position numbers begin at\n    padding_idx+1. Padding symbols are ignored. This is modified from fairseq's\n    `utils.make_positions`.\n\n    :param torch.Tensor x:\n    :return torch.Tensor:\n    \"\"\"\n    # The series of casts and type-conversions here are carefully balanced to both work with ONNX export and XLA.\n    mask = input_ids.ne(padding_idx).int()\n    incremental_indices = torch.cumsum(mask, dim=1).type_as(mask) * mask\n    return incremental_indices.long() + padding_idx\n\n\ndef prune_linear_layer(layer, index, dim=0):\n    \"\"\" Prune a linear layer (a model parameters) to keep only entries in index.\n        Return the pruned layer as a new layer with requires_grad=True.\n        Used to remove heads.\n    \"\"\"\n    index = index.to(layer.weight.device)\n    W = layer.weight.index_select(dim, index).clone().detach()\n    if layer.bias is not None:\n        if dim == 1:\n            b = layer.bias.clone().detach()\n        else:\n            b = layer.bias[index].clone().detach()\n    new_size = list(layer.weight.size())\n    new_size[dim] = len(index)\n    new_layer = nn.Linear(new_size[1], new_size[0], bias=layer.bias is not None).to(layer.weight.device)\n    new_layer.weight.requires_grad = False\n    new_layer.weight.copy_(W.contiguous())\n    new_layer.weight.requires_grad = True\n    if layer.bias is not None:\n        new_layer.bias.requires_grad = False\n        new_layer.bias.copy_(b.contiguous())\n        new_layer.bias.requires_grad = True\n    return new_layer\n\n\ndef prune_conv1d_layer(layer, index, dim=1):\n    \"\"\" Prune a Conv1D layer (a model parameters) to keep only entries in index.\n        A Conv1D work as a Linear layer (see e.g. BERT) but the weights are transposed.\n        Return the pruned layer as a new layer with requires_grad=True.\n        Used to remove heads.\n    \"\"\"\n    index = index.to(layer.weight.device)\n    W = layer.weight.index_select(dim, index).clone().detach()\n    if dim == 0:\n        b = layer.bias.clone().detach()\n    else:\n        b = layer.bias[index].clone().detach()\n    new_size = list(layer.weight.size())\n    new_size[dim] = len(index)\n    new_layer = Conv1D(new_size[1], new_size[0]).to(layer.weight.device)\n    new_layer.weight.requires_grad = False\n    new_layer.weight.copy_(W.contiguous())\n    new_layer.weight.requires_grad = True\n    new_layer.bias.requires_grad = False\n    new_layer.bias.copy_(b.contiguous())\n    new_layer.bias.requires_grad = True\n    return new_layer\n\n\ndef prune_layer(layer, index, dim=None):\n    \"\"\" Prune a Conv1D or nn.Linear layer (a model parameters) to keep only entries in index.\n        Return the pruned layer as a new layer with requires_grad=True.\n        Used to remove heads.\n    \"\"\"\n    if isinstance(layer, nn.Linear):\n        return prune_linear_layer(layer, index, dim=0 if dim is None else dim)\n    elif isinstance(layer, Conv1D):\n        return prune_conv1d_layer(layer, index, dim=1 if dim is None else dim)\n    else:\n        raise ValueError(\"Can't prune layer of class {}\".format(layer.__class__))\n\n\ndef apply_chunking_to_forward(\n    chunk_size: int, chunk_dim: int, forward_fn: Callable[..., torch.Tensor], *input_tensors\n) -> torch.Tensor:\n    \"\"\"\n    This function chunks the `input_tensors` into smaller input tensor parts of size `chunk_size` over the dimension `chunk_dim`.\n    It then applies a layer `forward_fn` to each chunk independently to save memory.\n    If the `forward_fn` is independent across the `chunk_dim` this function will yield the\n    same result as not applying it.\n\n    Args:\n        chunk_size: int - the chunk size of a chunked tensor. `num_chunks` = `len(input_tensors[0]) / chunk_size`\n        chunk_dim: int - the dimension over which the input_tensors should be chunked\n        forward_fn: fn - the forward fn of the model\n        input_tensors: tuple(torch.Tensor) - the input tensors of `forward_fn` which are chunked\n    Returns:\n        a Tensor with the same shape the foward_fn would have given if applied\n\n\n    Examples::\n\n        # rename the usual forward() fn to forward_chunk()\n        def forward_chunk(self, hidden_states):\n            hidden_states = self.decoder(hidden_states)\n            return hidden_states\n\n        # implement a chunked forward function\n        def forward(self, hidden_states):\n            return apply_chunking_to_forward(self.chunk_size_lm_head, self.seq_len_dim, self.forward_chunk, hidden_states)\n    \"\"\"\n\n    assert len(input_tensors) > 0, \"{} has to be a tuple/list of tensors\".format(input_tensors)\n    tensor_shape = input_tensors[0].shape\n    assert all(\n        input_tensor.shape == tensor_shape for input_tensor in input_tensors\n    ), \"All input tenors have to be of the same shape\"\n\n    # inspect.signature exist since python 3.5 and is a python method -> no problem with backward compability\n    num_args_in_forward_chunk_fn = len(inspect.signature(forward_fn).parameters)\n    assert num_args_in_forward_chunk_fn == len(\n        input_tensors\n    ), \"forward_chunk_fn expects {} arguments, but only {} input tensors are given\".format(\n        num_args_in_forward_chunk_fn, len(input_tensors)\n    )\n\n    if chunk_size > 0:\n        assert (\n            input_tensors[0].shape[chunk_dim] % chunk_size == 0\n        ), \"The dimension to be chunked {} has to be a multiple of the chunk size {}\".format(\n            input_tensors[0][chunk_dim], chunk_size\n        )\n\n        num_chunks = input_tensors[0].shape[chunk_dim] // chunk_size\n\n        # chunk input tensor into tuples\n        input_tensors_chunks = tuple(input_tensor.chunk(num_chunks, dim=chunk_dim) for input_tensor in input_tensors)\n        # apply forward fn to every tuple\n        output_chunks = tuple(forward_fn(*input_tensors_chunk) for input_tensors_chunk in zip(*input_tensors_chunks))\n        # concatenate output at same dimension\n        return torch.cat(output_chunks, dim=chunk_dim)\n\n    return forward_fn(*input_tensors)\n"
  },
  {
    "path": "code/nezha-base-count5/pretrain/transformers1/modeling_xlm.py",
    "content": "# coding=utf-8\n# Copyright 2019-present, Facebook, Inc and the HuggingFace Inc. team.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n\"\"\" PyTorch XLM model.\n\"\"\"\n\n\nimport itertools\nimport logging\nimport math\n\nimport numpy as np\nimport torch\nfrom torch import nn\nfrom torch.nn import CrossEntropyLoss, MSELoss\nfrom torch.nn import functional as F\n\nfrom .activations import gelu\nfrom .configuration_xlm import XLMConfig\nfrom .file_utils import add_start_docstrings, add_start_docstrings_to_callable\nfrom .modeling_utils import PreTrainedModel, SequenceSummary, SQuADHead, prune_linear_layer\n\n\nlogger = logging.getLogger(__name__)\n\nXLM_PRETRAINED_MODEL_ARCHIVE_LIST = [\n    \"xlm-mlm-en-2048\",\n    \"xlm-mlm-ende-1024\",\n    \"xlm-mlm-enfr-1024\",\n    \"xlm-mlm-enro-1024\",\n    \"xlm-mlm-tlm-xnli15-1024\",\n    \"xlm-mlm-xnli15-1024\",\n    \"xlm-clm-enfr-1024\",\n    \"xlm-clm-ende-1024\",\n    \"xlm-mlm-17-1280\",\n    \"xlm-mlm-100-1280\",\n    # See all XLM models at https://huggingface.co/models?filter=xlm\n]\n\n\ndef create_sinusoidal_embeddings(n_pos, dim, out):\n    position_enc = np.array([[pos / np.power(10000, 2 * (j // 2) / dim) for j in range(dim)] for pos in range(n_pos)])\n    out[:, 0::2] = torch.FloatTensor(np.sin(position_enc[:, 0::2]))\n    out[:, 1::2] = torch.FloatTensor(np.cos(position_enc[:, 1::2]))\n    out.detach_()\n    out.requires_grad = False\n\n\ndef get_masks(slen, lengths, causal, padding_mask=None):\n    \"\"\"\n    Generate hidden states mask, and optionally an attention mask.\n    \"\"\"\n    alen = torch.arange(slen, dtype=torch.long, device=lengths.device)\n    if padding_mask is not None:\n        mask = padding_mask\n    else:\n        assert lengths.max().item() <= slen\n        mask = alen < lengths[:, None]\n\n    # attention mask is the same as mask, or triangular inferior attention (causal)\n    bs = lengths.size(0)\n    if causal:\n        attn_mask = alen[None, None, :].repeat(bs, slen, 1) <= alen[None, :, None]\n    else:\n        attn_mask = mask\n\n    # sanity check\n    assert mask.size() == (bs, slen)\n    assert causal is False or attn_mask.size() == (bs, slen, slen)\n\n    return mask, attn_mask\n\n\nclass MultiHeadAttention(nn.Module):\n\n    NEW_ID = itertools.count()\n\n    def __init__(self, n_heads, dim, config):\n        super().__init__()\n        self.layer_id = next(MultiHeadAttention.NEW_ID)\n        self.output_attentions = config.output_attentions\n        self.dim = dim\n        self.n_heads = n_heads\n        self.dropout = config.attention_dropout\n        assert self.dim % self.n_heads == 0\n\n        self.q_lin = nn.Linear(dim, dim)\n        self.k_lin = nn.Linear(dim, dim)\n        self.v_lin = nn.Linear(dim, dim)\n        self.out_lin = nn.Linear(dim, dim)\n        self.pruned_heads = set()\n\n    def prune_heads(self, heads):\n        attention_head_size = self.dim // self.n_heads\n        if len(heads) == 0:\n            return\n        mask = torch.ones(self.n_heads, attention_head_size)\n        heads = set(heads) - self.pruned_heads\n        for head in heads:\n            head -= sum(1 if h < head else 0 for h in self.pruned_heads)\n            mask[head] = 0\n        mask = mask.view(-1).contiguous().eq(1)\n        index = torch.arange(len(mask))[mask].long()\n        # Prune linear layers\n        self.q_lin = prune_linear_layer(self.q_lin, index)\n        self.k_lin = prune_linear_layer(self.k_lin, index)\n        self.v_lin = prune_linear_layer(self.v_lin, index)\n        self.out_lin = prune_linear_layer(self.out_lin, index, dim=1)\n        # Update hyper params\n        self.n_heads = self.n_heads - len(heads)\n        self.dim = attention_head_size * self.n_heads\n        self.pruned_heads = self.pruned_heads.union(heads)\n\n    def forward(self, input, mask, kv=None, cache=None, head_mask=None):\n        \"\"\"\n        Self-attention (if kv is None) or attention over source sentence (provided by kv).\n        \"\"\"\n        # Input is (bs, qlen, dim)\n        # Mask is (bs, klen) (non-causal) or (bs, klen, klen)\n        bs, qlen, dim = input.size()\n        if kv is None:\n            klen = qlen if cache is None else cache[\"slen\"] + qlen\n        else:\n            klen = kv.size(1)\n        # assert dim == self.dim, 'Dimensions do not match: %s input vs %s configured' % (dim, self.dim)\n        n_heads = self.n_heads\n        dim_per_head = self.dim // n_heads\n        mask_reshape = (bs, 1, qlen, klen) if mask.dim() == 3 else (bs, 1, 1, klen)\n\n        def shape(x):\n            \"\"\"  projection \"\"\"\n            return x.view(bs, -1, self.n_heads, dim_per_head).transpose(1, 2)\n\n        def unshape(x):\n            \"\"\"  compute context \"\"\"\n            return x.transpose(1, 2).contiguous().view(bs, -1, self.n_heads * dim_per_head)\n\n        q = shape(self.q_lin(input))  # (bs, n_heads, qlen, dim_per_head)\n        if kv is None:\n            k = shape(self.k_lin(input))  # (bs, n_heads, qlen, dim_per_head)\n            v = shape(self.v_lin(input))  # (bs, n_heads, qlen, dim_per_head)\n        elif cache is None or self.layer_id not in cache:\n            k = v = kv\n            k = shape(self.k_lin(k))  # (bs, n_heads, qlen, dim_per_head)\n            v = shape(self.v_lin(v))  # (bs, n_heads, qlen, dim_per_head)\n\n        if cache is not None:\n            if self.layer_id in cache:\n                if kv is None:\n                    k_, v_ = cache[self.layer_id]\n                    k = torch.cat([k_, k], dim=2)  # (bs, n_heads, klen, dim_per_head)\n                    v = torch.cat([v_, v], dim=2)  # (bs, n_heads, klen, dim_per_head)\n                else:\n                    k, v = cache[self.layer_id]\n            cache[self.layer_id] = (k, v)\n\n        q = q / math.sqrt(dim_per_head)  # (bs, n_heads, qlen, dim_per_head)\n        scores = torch.matmul(q, k.transpose(2, 3))  # (bs, n_heads, qlen, klen)\n        mask = (mask == 0).view(mask_reshape).expand_as(scores)  # (bs, n_heads, qlen, klen)\n        scores.masked_fill_(mask, -float(\"inf\"))  # (bs, n_heads, qlen, klen)\n\n        weights = F.softmax(scores.float(), dim=-1).type_as(scores)  # (bs, n_heads, qlen, klen)\n        weights = F.dropout(weights, p=self.dropout, training=self.training)  # (bs, n_heads, qlen, klen)\n\n        # Mask heads if we want to\n        if head_mask is not None:\n            weights = weights * head_mask\n\n        context = torch.matmul(weights, v)  # (bs, n_heads, qlen, dim_per_head)\n        context = unshape(context)  # (bs, qlen, dim)\n\n        outputs = (self.out_lin(context),)\n        if self.output_attentions:\n            outputs = outputs + (weights,)\n        return outputs\n\n\nclass TransformerFFN(nn.Module):\n    def __init__(self, in_dim, dim_hidden, out_dim, config):\n        super().__init__()\n        self.dropout = config.dropout\n        self.lin1 = nn.Linear(in_dim, dim_hidden)\n        self.lin2 = nn.Linear(dim_hidden, out_dim)\n        self.act = gelu if config.gelu_activation else F.relu\n\n    def forward(self, input):\n        x = self.lin1(input)\n        x = self.act(x)\n        x = self.lin2(x)\n        x = F.dropout(x, p=self.dropout, training=self.training)\n        return x\n\n\nclass XLMPreTrainedModel(PreTrainedModel):\n    \"\"\" An abstract class to handle weights initialization and\n        a simple interface for downloading and loading pretrained models.\n    \"\"\"\n\n    config_class = XLMConfig\n    load_tf_weights = None\n    base_model_prefix = \"transformer\"\n\n    def __init__(self, *inputs, **kwargs):\n        super().__init__(*inputs, **kwargs)\n\n    @property\n    def dummy_inputs(self):\n        inputs_list = torch.tensor([[7, 6, 0, 0, 1], [1, 2, 3, 0, 0], [0, 0, 0, 4, 5]])\n        attns_list = torch.tensor([[1, 1, 0, 0, 1], [1, 1, 1, 0, 0], [1, 0, 0, 1, 1]])\n        if self.config.use_lang_emb and self.config.n_langs > 1:\n            langs_list = torch.tensor([[1, 1, 0, 0, 1], [1, 1, 1, 0, 0], [1, 0, 0, 1, 1]])\n        else:\n            langs_list = None\n        return {\"input_ids\": inputs_list, \"attention_mask\": attns_list, \"langs\": langs_list}\n\n    def _init_weights(self, module):\n        \"\"\" Initialize the weights. \"\"\"\n        if isinstance(module, nn.Embedding):\n            if self.config is not None and self.config.embed_init_std is not None:\n                nn.init.normal_(module.weight, mean=0, std=self.config.embed_init_std)\n        if isinstance(module, nn.Linear):\n            if self.config is not None and self.config.init_std is not None:\n                nn.init.normal_(module.weight, mean=0, std=self.config.init_std)\n                if hasattr(module, \"bias\") and module.bias is not None:\n                    nn.init.constant_(module.bias, 0.0)\n        if isinstance(module, nn.LayerNorm):\n            module.bias.data.zero_()\n            module.weight.data.fill_(1.0)\n\n\nXLM_START_DOCSTRING = r\"\"\"\n\n    This model is a PyTorch `torch.nn.Module <https://pytorch.org/docs/stable/nn.html#torch.nn.Module>`_ sub-class.\n    Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general\n    usage and behavior.\n\n    Parameters:\n        config (:class:`~transformers1.XLMConfig`): Model configuration class with all the parameters of the model.\n            Initializing with a config file does not load the weights associated with the model, only the configuration.\n            Check out the :meth:`~transformers1.PreTrainedModel.from_pretrained` method to load the model weights.\n\"\"\"\n\nXLM_INPUTS_DOCSTRING = r\"\"\"\n    Args:\n        input_ids (:obj:`torch.LongTensor` of shape :obj:`(batch_size, sequence_length)`):\n            Indices of input sequence tokens in the vocabulary.\n\n            Indices can be obtained using :class:`transformers1.BertTokenizer`.\n            See :func:`transformers1.PreTrainedTokenizer.encode` and\n            :func:`transformers1.PreTrainedTokenizer.encode_plus` for details.\n\n            `What are input IDs? <../glossary.html#input-ids>`__\n        attention_mask (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length)`, `optional`, defaults to :obj:`None`):\n            Mask to avoid performing attention on padding token indices.\n            Mask values selected in ``[0, 1]``:\n            ``1`` for tokens that are NOT MASKED, ``0`` for MASKED tokens.\n\n            `What are attention masks? <../glossary.html#attention-mask>`__\n        langs (:obj:`torch.LongTensor` of shape :obj:`(batch_size, sequence_length)`, `optional`, defaults to :obj:`None`):\n            A parallel sequence of tokens to be used to indicate the language of each token in the input.\n            Indices are languages ids which can be obtained from the language names by using two conversion mappings\n            provided in the configuration of the model (only provided for multilingual models).\n            More precisely, the `language name -> language id` mapping is in `model.config.lang2id` (dict str -> int) and\n            the `language id -> language name` mapping is `model.config.id2lang` (dict int -> str).\n\n            See usage examples detailed in the `multilingual documentation <https://huggingface.co/transformers/multilingual.html>`__.\n        token_type_ids (:obj:`torch.LongTensor` of shape :obj:`(batch_size, sequence_length)`, `optional`, defaults to :obj:`None`):\n            Segment token indices to indicate first and second portions of the inputs.\n            Indices are selected in ``[0, 1]``: ``0`` corresponds to a `sentence A` token, ``1``\n            corresponds to a `sentence B` token\n\n            `What are token type IDs? <../glossary.html#token-type-ids>`_\n        position_ids (:obj:`torch.LongTensor` of shape :obj:`(batch_size, sequence_length)`, `optional`, defaults to :obj:`None`):\n            Indices of positions of each input sequence tokens in the position embeddings.\n            Selected in the range ``[0, config.max_position_embeddings - 1]``.\n\n            `What are position IDs? <../glossary.html#position-ids>`_\n        lengths (:obj:`torch.LongTensor` of shape :obj:`(batch_size,)`, `optional`, defaults to :obj:`None`):\n            Length of each sentence that can be used to avoid performing attention on padding token indices.\n            You can also use `attention_mask` for the same result (see above), kept here for compatbility.\n            Indices selected in ``[0, ..., input_ids.size(-1)]``:\n        cache (:obj:`Dict[str, torch.FloatTensor]`, `optional`, defaults to :obj:`None`):\n            dictionary with ``torch.FloatTensor`` that contains pre-computed\n            hidden-states (key and values in the attention blocks) as computed by the model\n            (see `cache` output below). Can be used to speed up sequential decoding.\n            The dictionary object will be modified in-place during the forward pass to add newly computed hidden-states.\n        head_mask (:obj:`torch.FloatTensor` of shape :obj:`(num_heads,)` or :obj:`(num_layers, num_heads)`, `optional`, defaults to :obj:`None`):\n            Mask to nullify selected heads of the self-attention modules.\n            Mask values selected in ``[0, 1]``:\n            :obj:`1` indicates the head is **not masked**, :obj:`0` indicates the head is **masked**.\n        inputs_embeds (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length, hidden_size)`, `optional`, defaults to :obj:`None`):\n            Optionally, instead of passing :obj:`input_ids` you can choose to directly pass an embedded representation.\n            This is useful if you want more control over how to convert `input_ids` indices into associated vectors\n            than the model's internal embedding lookup matrix.\n\"\"\"\n\n\n@add_start_docstrings(\n    \"The bare XLM Model transformer outputting raw hidden-states without any specific head on top.\",\n    XLM_START_DOCSTRING,\n)\nclass XLMModel(XLMPreTrainedModel):\n    def __init__(self, config):  # , dico, is_encoder, with_output):\n        super().__init__(config)\n        self.output_attentions = config.output_attentions\n        self.output_hidden_states = config.output_hidden_states\n\n        # encoder / decoder, output layer\n        self.is_encoder = config.is_encoder\n        self.is_decoder = not config.is_encoder\n        if self.is_decoder:\n            raise NotImplementedError(\"Currently XLM can only be used as an encoder\")\n        # self.with_output = with_output\n        self.causal = config.causal\n\n        # dictionary / languages\n        self.n_langs = config.n_langs\n        self.use_lang_emb = config.use_lang_emb\n        self.n_words = config.n_words\n        self.eos_index = config.eos_index\n        self.pad_index = config.pad_index\n        # self.dico = dico\n        # self.id2lang = config.id2lang\n        # self.lang2id = config.lang2id\n        # assert len(self.dico) == self.n_words\n        # assert len(self.id2lang) == len(self.lang2id) == self.n_langs\n\n        # model parameters\n        self.dim = config.emb_dim  # 512 by default\n        self.hidden_dim = self.dim * 4  # 2048 by default\n        self.n_heads = config.n_heads  # 8 by default\n        self.n_layers = config.n_layers\n        self.dropout = config.dropout\n        self.attention_dropout = config.attention_dropout\n        assert self.dim % self.n_heads == 0, \"transformer dim must be a multiple of n_heads\"\n\n        # embeddings\n        self.position_embeddings = nn.Embedding(config.max_position_embeddings, self.dim)\n        if config.sinusoidal_embeddings:\n            create_sinusoidal_embeddings(config.max_position_embeddings, self.dim, out=self.position_embeddings.weight)\n        if config.n_langs > 1 and config.use_lang_emb:\n            self.lang_embeddings = nn.Embedding(self.n_langs, self.dim)\n        self.embeddings = nn.Embedding(self.n_words, self.dim, padding_idx=self.pad_index)\n        self.layer_norm_emb = nn.LayerNorm(self.dim, eps=config.layer_norm_eps)\n\n        # transformer layers\n        self.attentions = nn.ModuleList()\n        self.layer_norm1 = nn.ModuleList()\n        self.ffns = nn.ModuleList()\n        self.layer_norm2 = nn.ModuleList()\n        # if self.is_decoder:\n        #     self.layer_norm15 = nn.ModuleList()\n        #     self.encoder_attn = nn.ModuleList()\n\n        for _ in range(self.n_layers):\n            self.attentions.append(MultiHeadAttention(self.n_heads, self.dim, config=config))\n            self.layer_norm1.append(nn.LayerNorm(self.dim, eps=config.layer_norm_eps))\n            # if self.is_decoder:\n            #     self.layer_norm15.append(nn.LayerNorm(self.dim, eps=config.layer_norm_eps))\n            #     self.encoder_attn.append(MultiHeadAttention(self.n_heads, self.dim, dropout=self.attention_dropout))\n            self.ffns.append(TransformerFFN(self.dim, self.hidden_dim, self.dim, config=config))\n            self.layer_norm2.append(nn.LayerNorm(self.dim, eps=config.layer_norm_eps))\n\n        if hasattr(config, \"pruned_heads\"):\n            pruned_heads = config.pruned_heads.copy().items()\n            config.pruned_heads = {}\n            for layer, heads in pruned_heads:\n                if self.attentions[int(layer)].n_heads == config.n_heads:\n                    self.prune_heads({int(layer): list(map(int, heads))})\n\n        self.init_weights()\n\n    def get_input_embeddings(self):\n        return self.embeddings\n\n    def set_input_embeddings(self, new_embeddings):\n        self.embeddings = new_embeddings\n\n    def _prune_heads(self, heads_to_prune):\n        \"\"\" Prunes heads of the model.\n            heads_to_prune: dict of {layer_num: list of heads to prune in this layer}\n            See base class PreTrainedModel\n        \"\"\"\n        for layer, heads in heads_to_prune.items():\n            self.attentions[layer].prune_heads(heads)\n\n    @add_start_docstrings_to_callable(XLM_INPUTS_DOCSTRING)\n    def forward(\n        self,\n        input_ids=None,\n        attention_mask=None,\n        langs=None,\n        token_type_ids=None,\n        position_ids=None,\n        lengths=None,\n        cache=None,\n        head_mask=None,\n        inputs_embeds=None,\n    ):\n        r\"\"\"\n    Return:\n        :obj:`tuple(torch.FloatTensor)` comprising various elements depending on the configuration (:class:`~transformers1.XLMConfig`) and inputs:\n        last_hidden_state (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length, hidden_size)`):\n            Sequence of hidden-states at the output of the last layer of the model.\n        hidden_states (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_hidden_states=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer)\n            of shape :obj:`(batch_size, sequence_length, hidden_size)`.\n\n            Hidden-states of the model at the output of each layer plus the initial embedding outputs.\n        attentions (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_attentions=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for each layer) of shape\n            :obj:`(batch_size, num_heads, sequence_length, sequence_length)`.\n\n            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention\n            heads.\n\n    Examples::\n\n        from transformers1 import XLMTokenizer, XLMModel\n        import torch\n\n        tokenizer = XLMTokenizer.from_pretrained('xlm-mlm-en-2048')\n        model = XLMModel.from_pretrained('xlm-mlm-en-2048')\n        input_ids = torch.tensor(tokenizer.encode(\"Hello, my dog is cute\", add_special_tokens=True)).unsqueeze(0)  # Batch size 1\n        outputs = model(input_ids)\n        last_hidden_states = outputs[0]  # The last hidden-state is the first element of the output tuple\n\n        \"\"\"\n        if input_ids is not None:\n            bs, slen = input_ids.size()\n        else:\n            bs, slen = inputs_embeds.size()[:-1]\n\n        if lengths is None:\n            if input_ids is not None:\n                lengths = (input_ids != self.pad_index).sum(dim=1).long()\n            else:\n                lengths = torch.LongTensor([slen] * bs)\n        # mask = input_ids != self.pad_index\n\n        # check inputs\n        assert lengths.size(0) == bs\n        assert lengths.max().item() <= slen\n        # input_ids = input_ids.transpose(0, 1)  # batch size as dimension 0\n        # assert (src_enc is None) == (src_len is None)\n        # if src_enc is not None:\n        #     assert self.is_decoder\n        #     assert src_enc.size(0) == bs\n\n        # generate masks\n        mask, attn_mask = get_masks(slen, lengths, self.causal, padding_mask=attention_mask)\n        # if self.is_decoder and src_enc is not None:\n        #     src_mask = torch.arange(src_len.max(), dtype=torch.long, device=lengths.device) < src_len[:, None]\n\n        device = input_ids.device if input_ids is not None else inputs_embeds.device\n\n        # position_ids\n        if position_ids is None:\n            position_ids = torch.arange(slen, dtype=torch.long, device=device)\n            position_ids = position_ids.unsqueeze(0).expand((bs, slen))\n        else:\n            assert position_ids.size() == (bs, slen)  # (slen, bs)\n            # position_ids = position_ids.transpose(0, 1)\n\n        # langs\n        if langs is not None:\n            assert langs.size() == (bs, slen)  # (slen, bs)\n            # langs = langs.transpose(0, 1)\n\n        # Prepare head mask if needed\n        head_mask = self.get_head_mask(head_mask, self.config.n_layers)\n\n        # do not recompute cached elements\n        if cache is not None and input_ids is not None:\n            _slen = slen - cache[\"slen\"]\n            input_ids = input_ids[:, -_slen:]\n            position_ids = position_ids[:, -_slen:]\n            if langs is not None:\n                langs = langs[:, -_slen:]\n            mask = mask[:, -_slen:]\n            attn_mask = attn_mask[:, -_slen:]\n\n        # embeddings\n        if inputs_embeds is None:\n            inputs_embeds = self.embeddings(input_ids)\n\n        tensor = inputs_embeds + self.position_embeddings(position_ids).expand_as(inputs_embeds)\n        if langs is not None and self.use_lang_emb and self.n_langs > 1:\n            tensor = tensor + self.lang_embeddings(langs)\n        if token_type_ids is not None:\n            tensor = tensor + self.embeddings(token_type_ids)\n        tensor = self.layer_norm_emb(tensor)\n        tensor = F.dropout(tensor, p=self.dropout, training=self.training)\n        tensor *= mask.unsqueeze(-1).to(tensor.dtype)\n\n        # transformer layers\n        hidden_states = ()\n        attentions = ()\n        for i in range(self.n_layers):\n            if self.output_hidden_states:\n                hidden_states = hidden_states + (tensor,)\n\n            # self attention\n            attn_outputs = self.attentions[i](tensor, attn_mask, cache=cache, head_mask=head_mask[i])\n            attn = attn_outputs[0]\n            if self.output_attentions:\n                attentions = attentions + (attn_outputs[1],)\n            attn = F.dropout(attn, p=self.dropout, training=self.training)\n            tensor = tensor + attn\n            tensor = self.layer_norm1[i](tensor)\n\n            # encoder attention (for decoder only)\n            # if self.is_decoder and src_enc is not None:\n            #     attn = self.encoder_attn[i](tensor, src_mask, kv=src_enc, cache=cache)\n            #     attn = F.dropout(attn, p=self.dropout, training=self.training)\n            #     tensor = tensor + attn\n            #     tensor = self.layer_norm15[i](tensor)\n\n            # FFN\n            tensor = tensor + self.ffns[i](tensor)\n            tensor = self.layer_norm2[i](tensor)\n            tensor *= mask.unsqueeze(-1).to(tensor.dtype)\n\n        # Add last hidden state\n        if self.output_hidden_states:\n            hidden_states = hidden_states + (tensor,)\n\n        # update cache length\n        if cache is not None:\n            cache[\"slen\"] += tensor.size(1)\n\n        # move back sequence length to dimension 0\n        # tensor = tensor.transpose(0, 1)\n\n        outputs = (tensor,)\n        if self.output_hidden_states:\n            outputs = outputs + (hidden_states,)\n        if self.output_attentions:\n            outputs = outputs + (attentions,)\n        return outputs  # outputs, (hidden_states), (attentions)\n\n\nclass XLMPredLayer(nn.Module):\n    \"\"\"\n    Prediction layer (cross_entropy or adaptive_softmax).\n    \"\"\"\n\n    def __init__(self, config):\n        super().__init__()\n        self.asm = config.asm\n        self.n_words = config.n_words\n        self.pad_index = config.pad_index\n        dim = config.emb_dim\n\n        if config.asm is False:\n            self.proj = nn.Linear(dim, config.n_words, bias=True)\n        else:\n            self.proj = nn.AdaptiveLogSoftmaxWithLoss(\n                in_features=dim,\n                n_classes=config.n_words,\n                cutoffs=config.asm_cutoffs,\n                div_value=config.asm_div_value,\n                head_bias=True,  # default is False\n            )\n\n    def forward(self, x, y=None):\n        \"\"\" Compute the loss, and optionally the scores.\n        \"\"\"\n        outputs = ()\n        if self.asm is False:\n            scores = self.proj(x)\n            outputs = (scores,) + outputs\n            if y is not None:\n                loss = F.cross_entropy(scores.view(-1, self.n_words), y.view(-1), reduction=\"elementwise_mean\")\n                outputs = (loss,) + outputs\n        else:\n            scores = self.proj.log_prob(x)\n            outputs = (scores,) + outputs\n            if y is not None:\n                _, loss = self.proj(x, y)\n                outputs = (loss,) + outputs\n\n        return outputs\n\n\n@add_start_docstrings(\n    \"\"\"The XLM Model transformer with a language modeling head on top\n    (linear layer with weights tied to the input embeddings). \"\"\",\n    XLM_START_DOCSTRING,\n)\nclass XLMWithLMHeadModel(XLMPreTrainedModel):\n    def __init__(self, config):\n        super().__init__(config)\n        self.transformer = XLMModel(config)\n        self.pred_layer = XLMPredLayer(config)\n\n        self.init_weights()\n\n    def get_output_embeddings(self):\n        return self.pred_layer.proj\n\n    def prepare_inputs_for_generation(self, input_ids, **kwargs):\n        mask_token_id = self.config.mask_token_id\n        lang_id = self.config.lang_id\n\n        effective_batch_size = input_ids.shape[0]\n        mask_token = torch.full((effective_batch_size, 1), mask_token_id, dtype=torch.long, device=input_ids.device)\n        input_ids = torch.cat([input_ids, mask_token], dim=1)\n        if lang_id is not None:\n            langs = torch.full_like(input_ids, lang_id)\n        else:\n            langs = None\n        return {\"input_ids\": input_ids, \"langs\": langs}\n\n    @add_start_docstrings_to_callable(XLM_INPUTS_DOCSTRING)\n    def forward(\n        self,\n        input_ids=None,\n        attention_mask=None,\n        langs=None,\n        token_type_ids=None,\n        position_ids=None,\n        lengths=None,\n        cache=None,\n        head_mask=None,\n        inputs_embeds=None,\n        labels=None,\n    ):\n        r\"\"\"\n        labels (:obj:`torch.LongTensor` of shape :obj:`(batch_size, sequence_length)`, `optional`, defaults to :obj:`None`):\n            Labels for language modeling.\n            Note that the labels **are shifted** inside the model, i.e. you can set ``labels = input_ids``\n            Indices are selected in ``[-100, 0, ..., config.vocab_size]``\n            All labels set to ``-100`` are ignored (masked), the loss is only\n            computed for labels in ``[0, ..., config.vocab_size]``\n\n    Return:\n        :obj:`tuple(torch.FloatTensor)` comprising various elements depending on the configuration (:class:`~transformers1.XLMConfig`) and inputs:\n        loss (:obj:`torch.FloatTensor` of shape `(1,)`, `optional`, returned when ``labels`` is provided)\n            Language modeling loss.\n        prediction_scores (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length, config.vocab_size)`):\n            Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax).\n        hidden_states (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_hidden_states=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer)\n            of shape :obj:`(batch_size, sequence_length, hidden_size)`.\n\n            Hidden-states of the model at the output of each layer plus the initial embedding outputs.\n        attentions (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_attentions=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for each layer) of shape\n            :obj:`(batch_size, num_heads, sequence_length, sequence_length)`.\n\n            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention\n            heads.\n\n    Examples::\n\n        from transformers1 import XLMTokenizer, XLMWithLMHeadModel\n        import torch\n\n        tokenizer = XLMTokenizer.from_pretrained('xlm-mlm-en-2048')\n        model = XLMWithLMHeadModel.from_pretrained('xlm-mlm-en-2048')\n        input_ids = torch.tensor(tokenizer.encode(\"Hello, my dog is cute\", add_special_tokens=True)).unsqueeze(0)  # Batch size 1\n        outputs = model(input_ids)\n        last_hidden_states = outputs[0]  # The last hidden-state is the first element of the output tuple\n\n        \"\"\"\n        transformer_outputs = self.transformer(\n            input_ids,\n            attention_mask=attention_mask,\n            langs=langs,\n            token_type_ids=token_type_ids,\n            position_ids=position_ids,\n            lengths=lengths,\n            cache=cache,\n            head_mask=head_mask,\n            inputs_embeds=inputs_embeds,\n        )\n\n        output = transformer_outputs[0]\n        outputs = self.pred_layer(output, labels)\n        outputs = outputs + transformer_outputs[1:]  # Keep new_mems and attention/hidden states if they are here\n\n        return outputs\n\n\n@add_start_docstrings(\n    \"\"\"XLM Model with a sequence classification/regression head on top (a linear layer on top of\n    the pooled output) e.g. for GLUE tasks. \"\"\",\n    XLM_START_DOCSTRING,\n)\nclass XLMForSequenceClassification(XLMPreTrainedModel):\n    def __init__(self, config):\n        super().__init__(config)\n        self.num_labels = config.num_labels\n\n        self.transformer = XLMModel(config)\n        self.sequence_summary = SequenceSummary(config)\n\n        self.init_weights()\n\n    @add_start_docstrings_to_callable(XLM_INPUTS_DOCSTRING)\n    def forward(\n        self,\n        input_ids=None,\n        attention_mask=None,\n        langs=None,\n        token_type_ids=None,\n        position_ids=None,\n        lengths=None,\n        cache=None,\n        head_mask=None,\n        inputs_embeds=None,\n        labels=None,\n    ):\n        r\"\"\"\n        labels (:obj:`torch.LongTensor` of shape :obj:`(batch_size,)`, `optional`, defaults to :obj:`None`):\n            Labels for computing the sequence classification/regression loss.\n            Indices should be in :obj:`[0, ..., config.num_labels - 1]`.\n            If :obj:`config.num_labels == 1` a regression loss is computed (Mean-Square loss),\n            If :obj:`config.num_labels > 1` a classification loss is computed (Cross-Entropy).\n\n    Returns:\n        :obj:`tuple(torch.FloatTensor)` comprising various elements depending on the configuration (:class:`~transformers1.XLMConfig`) and inputs:\n        loss (:obj:`torch.FloatTensor` of shape :obj:`(1,)`, `optional`, returned when :obj:`label` is provided):\n            Classification (or regression if config.num_labels==1) loss.\n        logits (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, config.num_labels)`):\n            Classification (or regression if config.num_labels==1) scores (before SoftMax).\n        hidden_states (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_hidden_states=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer)\n            of shape :obj:`(batch_size, sequence_length, hidden_size)`.\n\n            Hidden-states of the model at the output of each layer plus the initial embedding outputs.\n        attentions (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_attentions=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for each layer) of shape\n            :obj:`(batch_size, num_heads, sequence_length, sequence_length)`.\n\n            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention\n            heads.\n\n    Examples::\n\n        from transformers1 import XLMTokenizer, XLMForSequenceClassification\n        import torch\n\n        tokenizer = XLMTokenizer.from_pretrained('xlm-mlm-en-2048')\n        model = XLMForSequenceClassification.from_pretrained('xlm-mlm-en-2048')\n        input_ids = torch.tensor(tokenizer.encode(\"Hello, my dog is cute\", add_special_tokens=True)).unsqueeze(0)  # Batch size 1\n        labels = torch.tensor([1]).unsqueeze(0)  # Batch size 1\n        outputs = model(input_ids, labels=labels)\n        loss, logits = outputs[:2]\n\n        \"\"\"\n        transformer_outputs = self.transformer(\n            input_ids,\n            attention_mask=attention_mask,\n            langs=langs,\n            token_type_ids=token_type_ids,\n            position_ids=position_ids,\n            lengths=lengths,\n            cache=cache,\n            head_mask=head_mask,\n            inputs_embeds=inputs_embeds,\n        )\n\n        output = transformer_outputs[0]\n        logits = self.sequence_summary(output)\n\n        outputs = (logits,) + transformer_outputs[1:]  # Keep new_mems and attention/hidden states if they are here\n\n        if labels is not None:\n            if self.num_labels == 1:\n                #  We are doing regression\n                loss_fct = MSELoss()\n                loss = loss_fct(logits.view(-1), labels.view(-1))\n            else:\n                loss_fct = CrossEntropyLoss()\n                loss = loss_fct(logits.view(-1, self.num_labels), labels.view(-1))\n            outputs = (loss,) + outputs\n\n        return outputs\n\n\n@add_start_docstrings(\n    \"\"\"XLM Model with a span classification head on top for extractive question-answering tasks like SQuAD (a linear layers on top of\n    the hidden-states output to compute `span start logits` and `span end logits`). \"\"\",\n    XLM_START_DOCSTRING,\n)\nclass XLMForQuestionAnsweringSimple(XLMPreTrainedModel):\n    def __init__(self, config):\n        super().__init__(config)\n\n        self.transformer = XLMModel(config)\n        self.qa_outputs = nn.Linear(config.hidden_size, config.num_labels)\n\n        self.init_weights()\n\n    @add_start_docstrings_to_callable(XLM_INPUTS_DOCSTRING)\n    def forward(\n        self,\n        input_ids=None,\n        attention_mask=None,\n        langs=None,\n        token_type_ids=None,\n        position_ids=None,\n        lengths=None,\n        cache=None,\n        head_mask=None,\n        inputs_embeds=None,\n        start_positions=None,\n        end_positions=None,\n    ):\n        r\"\"\"\n        start_positions (:obj:`torch.LongTensor` of shape :obj:`(batch_size,)`, `optional`, defaults to :obj:`None`):\n            Labels for position (index) of the start of the labelled span for computing the token classification loss.\n            Positions are clamped to the length of the sequence (`sequence_length`).\n            Position outside of the sequence are not taken into account for computing the loss.\n        end_positions (:obj:`torch.LongTensor` of shape :obj:`(batch_size,)`, `optional`, defaults to :obj:`None`):\n            Labels for position (index) of the end of the labelled span for computing the token classification loss.\n            Positions are clamped to the length of the sequence (`sequence_length`).\n            Position outside of the sequence are not taken into account for computing the loss.\n\n    Returns:\n        :obj:`tuple(torch.FloatTensor)` comprising various elements depending on the configuration (:class:`~transformers1.XLMConfig`) and inputs:\n        loss (:obj:`torch.FloatTensor` of shape :obj:`(1,)`, `optional`, returned when :obj:`labels` is provided):\n            Total span extraction loss is the sum of a Cross-Entropy for the start and end positions.\n        start_scores (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length,)`):\n            Span-start scores (before SoftMax).\n        end_scores (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length,)`):\n            Span-end scores (before SoftMax).\n        hidden_states (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_hidden_states=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer)\n            of shape :obj:`(batch_size, sequence_length, hidden_size)`.\n\n            Hidden-states of the model at the output of each layer plus the initial embedding outputs.\n        attentions (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_attentions=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for each layer) of shape\n            :obj:`(batch_size, num_heads, sequence_length, sequence_length)`.\n\n            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention\n            heads.\n\n    Examples::\n\n        from transformers1 import XLMTokenizer, XLMForQuestionAnsweringSimple\n        import torch\n\n        tokenizer = XLMTokenizer.from_pretrained('xlm-mlm-en-2048')\n        model = XLMForQuestionAnsweringSimple.from_pretrained('xlm-mlm-en-2048')\n        input_ids = torch.tensor(tokenizer.encode(\"Hello, my dog is cute\", add_special_tokens=True)).unsqueeze(0)  # Batch size 1\n        start_positions = torch.tensor([1])\n        end_positions = torch.tensor([3])\n        outputs = model(input_ids, start_positions=start_positions, end_positions=end_positions)\n        loss = outputs[0]\n\n        \"\"\"\n        transformer_outputs = self.transformer(\n            input_ids,\n            attention_mask=attention_mask,\n            langs=langs,\n            token_type_ids=token_type_ids,\n            position_ids=position_ids,\n            lengths=lengths,\n            cache=cache,\n            head_mask=head_mask,\n            inputs_embeds=inputs_embeds,\n        )\n\n        sequence_output = transformer_outputs[0]\n\n        logits = self.qa_outputs(sequence_output)\n        start_logits, end_logits = logits.split(1, dim=-1)\n        start_logits = start_logits.squeeze(-1)\n        end_logits = end_logits.squeeze(-1)\n\n        outputs = (\n            start_logits,\n            end_logits,\n        )\n        if start_positions is not None and end_positions is not None:\n            # If we are on multi-GPU, split add a dimension\n            if len(start_positions.size()) > 1:\n                start_positions = start_positions.squeeze(-1)\n            if len(end_positions.size()) > 1:\n                end_positions = end_positions.squeeze(-1)\n            # sometimes the start/end positions are outside our model inputs, we ignore these terms\n            ignored_index = start_logits.size(1)\n            start_positions.clamp_(0, ignored_index)\n            end_positions.clamp_(0, ignored_index)\n\n            loss_fct = CrossEntropyLoss(ignore_index=ignored_index)\n            start_loss = loss_fct(start_logits, start_positions)\n            end_loss = loss_fct(end_logits, end_positions)\n            total_loss = (start_loss + end_loss) / 2\n            outputs = (total_loss,) + outputs\n\n        outputs = outputs + transformer_outputs[1:]  # Keep new_mems and attention/hidden states if they are here\n\n        return outputs\n\n\n@add_start_docstrings(\n    \"\"\"XLM Model with a beam-search span classification head on top for extractive question-answering tasks like SQuAD (a linear layers on top of\n    the hidden-states output to compute `span start logits` and `span end logits`). \"\"\",\n    XLM_START_DOCSTRING,\n)\nclass XLMForQuestionAnswering(XLMPreTrainedModel):\n    def __init__(self, config):\n        super().__init__(config)\n\n        self.transformer = XLMModel(config)\n        self.qa_outputs = SQuADHead(config)\n\n        self.init_weights()\n\n    @add_start_docstrings_to_callable(XLM_INPUTS_DOCSTRING)\n    def forward(\n        self,\n        input_ids=None,\n        attention_mask=None,\n        langs=None,\n        token_type_ids=None,\n        position_ids=None,\n        lengths=None,\n        cache=None,\n        head_mask=None,\n        inputs_embeds=None,\n        start_positions=None,\n        end_positions=None,\n        is_impossible=None,\n        cls_index=None,\n        p_mask=None,\n    ):\n        r\"\"\"\n        start_positions (:obj:`torch.LongTensor` of shape :obj:`(batch_size,)`, `optional`, defaults to :obj:`None`):\n            Labels for position (index) of the start of the labelled span for computing the token classification loss.\n            Positions are clamped to the length of the sequence (`sequence_length`).\n            Position outside of the sequence are not taken into account for computing the loss.\n        end_positions (:obj:`torch.LongTensor` of shape :obj:`(batch_size,)`, `optional`, defaults to :obj:`None`):\n            Labels for position (index) of the end of the labelled span for computing the token classification loss.\n            Positions are clamped to the length of the sequence (`sequence_length`).\n            Position outside of the sequence are not taken into account for computing the loss.\n        is_impossible (``torch.LongTensor`` of shape ``(batch_size,)``, `optional`, defaults to :obj:`None`):\n            Labels whether a question has an answer or no answer (SQuAD 2.0)\n        cls_index (``torch.LongTensor`` of shape ``(batch_size,)``, `optional`, defaults to :obj:`None`):\n            Labels for position (index) of the classification token to use as input for computing plausibility of the answer.\n        p_mask (``torch.FloatTensor`` of shape ``(batch_size, sequence_length)``, `optional`, defaults to :obj:`None`):\n            Optional mask of tokens which can't be in answers (e.g. [CLS], [PAD], ...).\n            1.0 means token should be masked. 0.0 mean token is not masked.\n\n    Returns:\n        :obj:`tuple(torch.FloatTensor)` comprising various elements depending on the configuration (:class:`~transformers1.XLMConfig`) and inputs:\n        loss (:obj:`torch.FloatTensor` of shape :obj:`(1,)`, `optional`, returned if both :obj:`start_positions` and :obj:`end_positions` are provided):\n            Classification loss as the sum of start token, end token (and is_impossible if provided) classification losses.\n        start_top_log_probs (``torch.FloatTensor`` of shape ``(batch_size, config.start_n_top)``, `optional`, returned if ``start_positions`` or ``end_positions`` is not provided):\n            Log probabilities for the top config.start_n_top start token possibilities (beam-search).\n        start_top_index (``torch.LongTensor`` of shape ``(batch_size, config.start_n_top)``, `optional`, returned if ``start_positions`` or ``end_positions`` is not provided):\n            Indices for the top config.start_n_top start token possibilities (beam-search).\n        end_top_log_probs (``torch.FloatTensor`` of shape ``(batch_size, config.start_n_top * config.end_n_top)``, `optional`, returned if ``start_positions`` or ``end_positions`` is not provided):\n            Log probabilities for the top ``config.start_n_top * config.end_n_top`` end token possibilities (beam-search).\n        end_top_index (``torch.LongTensor`` of shape ``(batch_size, config.start_n_top * config.end_n_top)``, `optional`, returned if ``start_positions`` or ``end_positions`` is not provided):\n            Indices for the top ``config.start_n_top * config.end_n_top`` end token possibilities (beam-search).\n        cls_logits (``torch.FloatTensor`` of shape ``(batch_size,)``, `optional`, returned if ``start_positions`` or ``end_positions`` is not provided):\n            Log probabilities for the ``is_impossible`` label of the answers.\n        hidden_states (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_hidden_states=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer)\n            of shape :obj:`(batch_size, sequence_length, hidden_size)`.\n\n            Hidden-states of the model at the output of each layer plus the initial embedding outputs.\n        attentions (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_attentions=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for each layer) of shape\n            :obj:`(batch_size, num_heads, sequence_length, sequence_length)`.\n\n            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention\n            heads.\n\n    Examples::\n\n        from transformers1 import XLMTokenizer, XLMForQuestionAnswering\n        import torch\n\n        tokenizer = XLMTokenizer.from_pretrained('xlm-mlm-en-2048')\n        model = XLMForQuestionAnswering.from_pretrained('xlm-mlm-en-2048')\n        input_ids = torch.tensor(tokenizer.encode(\"Hello, my dog is cute\", add_special_tokens=True)).unsqueeze(0)  # Batch size 1\n        start_positions = torch.tensor([1])\n        end_positions = torch.tensor([3])\n        outputs = model(input_ids, start_positions=start_positions, end_positions=end_positions)\n        loss = outputs[0]\n\n        \"\"\"\n        transformer_outputs = self.transformer(\n            input_ids,\n            attention_mask=attention_mask,\n            langs=langs,\n            token_type_ids=token_type_ids,\n            position_ids=position_ids,\n            lengths=lengths,\n            cache=cache,\n            head_mask=head_mask,\n            inputs_embeds=inputs_embeds,\n        )\n\n        output = transformer_outputs[0]\n\n        outputs = self.qa_outputs(\n            output,\n            start_positions=start_positions,\n            end_positions=end_positions,\n            cls_index=cls_index,\n            is_impossible=is_impossible,\n            p_mask=p_mask,\n        )\n\n        outputs = outputs + transformer_outputs[1:]  # Keep new_mems and attention/hidden states if they are here\n\n        return outputs\n\n\n@add_start_docstrings(\n    \"\"\"XLM Model with a token classification head on top (a linear layer on top of\n    the hidden-states output) e.g. for Named-Entity-Recognition (NER) tasks. \"\"\",\n    XLM_START_DOCSTRING,\n)\nclass XLMForTokenClassification(XLMPreTrainedModel):\n    def __init__(self, config):\n        super().__init__(config)\n        self.num_labels = config.num_labels\n\n        self.transformer = XLMModel(config)\n        self.dropout = nn.Dropout(config.dropout)\n        self.classifier = nn.Linear(config.hidden_size, config.num_labels)\n\n        self.init_weights()\n\n    @add_start_docstrings_to_callable(XLM_INPUTS_DOCSTRING)\n    def forward(\n        self,\n        input_ids=None,\n        attention_mask=None,\n        langs=None,\n        token_type_ids=None,\n        position_ids=None,\n        head_mask=None,\n        labels=None,\n    ):\n        r\"\"\"\n        labels (:obj:`torch.LongTensor` of shape :obj:`(batch_size, sequence_length)`, `optional`, defaults to :obj:`None`):\n            Labels for computing the token classification loss.\n            Indices should be in ``[0, ..., config.num_labels - 1]``.\n\n    Returns:\n        :obj:`tuple(torch.FloatTensor)` comprising various elements depending on the configuration (:class:`~transformers1.XLMConfig`) and inputs:\n        loss (:obj:`torch.FloatTensor` of shape :obj:`(1,)`, `optional`, returned when ``labels`` is provided) :\n            Classification loss.\n        scores (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length, config.num_labels)`)\n            Classification scores (before SoftMax).\n        hidden_states (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_hidden_states=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer)\n            of shape :obj:`(batch_size, sequence_length, hidden_size)`.\n\n            Hidden-states of the model at the output of each layer plus the initial embedding outputs.\n        attentions (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_attentions=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for each layer) of shape\n            :obj:`(batch_size, num_heads, sequence_length, sequence_length)`.\n\n            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention\n            heads.\n\n    Examples::\n\n        from transformers1 import XLMTokenizer, XLMForTokenClassification\n        import torch\n\n        tokenizer = XLMTokenizer.from_pretrained('xlm-mlm-100-1280')\n        model = XLMForTokenClassification.from_pretrained('xlm-mlm-100-1280')\n        input_ids = torch.tensor(tokenizer.encode(\"Hello, my dog is cute\")).unsqueeze(0)  # Batch size 1\n        labels = torch.tensor([1] * input_ids.size(1)).unsqueeze(0)  # Batch size 1\n        outputs = model(input_ids, labels=labels)\n        loss, scores = outputs[:2]\n\n        \"\"\"\n        outputs = self.transformer(\n            input_ids,\n            attention_mask=attention_mask,\n            langs=langs,\n            token_type_ids=token_type_ids,\n            position_ids=position_ids,\n            head_mask=head_mask,\n        )\n\n        sequence_output = outputs[0]\n\n        sequence_output = self.dropout(sequence_output)\n        logits = self.classifier(sequence_output)\n\n        outputs = (logits,) + outputs[2:]  # add hidden states and attention if they are here\n        if labels is not None:\n            loss_fct = CrossEntropyLoss()\n            # Only keep active parts of the loss\n            if attention_mask is not None:\n                active_loss = attention_mask.view(-1) == 1\n                active_logits = logits.view(-1, self.num_labels)\n                active_labels = torch.where(\n                    active_loss, labels.view(-1), torch.tensor(loss_fct.ignore_index).type_as(labels)\n                )\n                loss = loss_fct(active_logits, active_labels)\n            else:\n                loss = loss_fct(logits.view(-1, self.num_labels), labels.view(-1))\n            outputs = (loss,) + outputs\n\n        return outputs  # (loss), scores, (hidden_states), (attentions)\n"
  },
  {
    "path": "code/nezha-base-count5/pretrain/transformers1/modeling_xlm_roberta.py",
    "content": "# coding=utf-8\n# Copyright 2019 Facebook AI Research and the HuggingFace Inc. team.\n# Copyright (c) 2018, NVIDIA CORPORATION.  All rights reserved.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n\"\"\"PyTorch XLM-RoBERTa model. \"\"\"\n\n\nimport logging\n\nfrom .configuration_xlm_roberta import XLMRobertaConfig\nfrom .file_utils import add_start_docstrings\nfrom .modeling_roberta import (\n    RobertaForMaskedLM,\n    RobertaForMultipleChoice,\n    RobertaForSequenceClassification,\n    RobertaForTokenClassification,\n    RobertaModel,\n)\n\n\nlogger = logging.getLogger(__name__)\n\nXLM_ROBERTA_PRETRAINED_MODEL_ARCHIVE_LIST = [\n    \"xlm-roberta-base\",\n    \"xlm-roberta-large\",\n    \"xlm-roberta-large-finetuned-conll02-dutch\",\n    \"xlm-roberta-large-finetuned-conll02-spanish\",\n    \"xlm-roberta-large-finetuned-conll03-english\",\n    \"xlm-roberta-large-finetuned-conll03-german\",\n    # See all XLM-RoBERTa models at https://huggingface.co/models?filter=xlm-roberta\n]\n\n\nXLM_ROBERTA_START_DOCSTRING = r\"\"\"\n\n    This model is a PyTorch `torch.nn.Module <https://pytorch.org/docs/stable/nn.html#torch.nn.Module>`_ sub-class.\n    Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general\n    usage and behavior.\n\n    Parameters:\n        config (:class:`~transformers1.XLMRobertaConfig`): Model configuration class with all the parameters of the\n            model. Initializing with a config file does not load the weights associated with the model, only the configuration.\n            Check out the :meth:`~transformers1.PreTrainedModel.from_pretrained` method to load the model weights.\n\"\"\"\n\n\n@add_start_docstrings(\n    \"The bare XLM-RoBERTa Model transformer outputting raw hidden-states without any specific head on top.\",\n    XLM_ROBERTA_START_DOCSTRING,\n)\nclass XLMRobertaModel(RobertaModel):\n    \"\"\"\n    This class overrides :class:`~transformers1.RobertaModel`. Please check the\n    superclass for the appropriate documentation alongside usage examples.\n    \"\"\"\n\n    config_class = XLMRobertaConfig\n\n\n@add_start_docstrings(\n    \"\"\"XLM-RoBERTa Model with a `language modeling` head on top. \"\"\", XLM_ROBERTA_START_DOCSTRING,\n)\nclass XLMRobertaForMaskedLM(RobertaForMaskedLM):\n    \"\"\"\n    This class overrides :class:`~transformers1.RobertaForMaskedLM`. Please check the\n    superclass for the appropriate documentation alongside usage examples.\n    \"\"\"\n\n    config_class = XLMRobertaConfig\n\n\n@add_start_docstrings(\n    \"\"\"XLM-RoBERTa Model transformer with a sequence classification/regression head on top (a linear layer\n    on top of the pooled output) e.g. for GLUE tasks. \"\"\",\n    XLM_ROBERTA_START_DOCSTRING,\n)\nclass XLMRobertaForSequenceClassification(RobertaForSequenceClassification):\n    \"\"\"\n    This class overrides :class:`~transformers1.RobertaForSequenceClassification`. Please check the\n    superclass for the appropriate documentation alongside usage examples.\n    \"\"\"\n\n    config_class = XLMRobertaConfig\n\n\n@add_start_docstrings(\n    \"\"\"XLM-RoBERTa Model with a multiple choice classification head on top (a linear layer on top of\n    the pooled output and a softmax) e.g. for RocStories/SWAG tasks. \"\"\",\n    XLM_ROBERTA_START_DOCSTRING,\n)\nclass XLMRobertaForMultipleChoice(RobertaForMultipleChoice):\n    \"\"\"\n    This class overrides :class:`~transformers1.RobertaForMultipleChoice`. Please check the\n    superclass for the appropriate documentation alongside usage examples.\n    \"\"\"\n\n    config_class = XLMRobertaConfig\n\n\n@add_start_docstrings(\n    \"\"\"XLM-RoBERTa Model with a token classification head on top (a linear layer on top of\n    the hidden-states output) e.g. for Named-Entity-Recognition (NER) tasks. \"\"\",\n    XLM_ROBERTA_START_DOCSTRING,\n)\nclass XLMRobertaForTokenClassification(RobertaForTokenClassification):\n    \"\"\"\n    This class overrides :class:`~transformers1.RobertaForTokenClassification`. Please check the\n    superclass for the appropriate documentation alongside usage examples.\n    \"\"\"\n\n    config_class = XLMRobertaConfig\n"
  },
  {
    "path": "code/nezha-base-count5/pretrain/transformers1/modeling_xlnet.py",
    "content": "# coding=utf-8\n# Copyright 2018 Google AI, Google Brain and Carnegie Mellon University Authors and the HuggingFace Inc. team.\n# Copyright (c) 2018, NVIDIA CORPORATION.  All rights reserved.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n\"\"\" PyTorch XLNet model.\n\"\"\"\n\n\nimport logging\n\nimport torch\nfrom torch import nn\nfrom torch.nn import CrossEntropyLoss, MSELoss\nfrom torch.nn import functional as F\n\nfrom .activations import gelu_new, swish\nfrom .configuration_xlnet import XLNetConfig\nfrom .file_utils import add_start_docstrings, add_start_docstrings_to_callable\nfrom .modeling_utils import PoolerAnswerClass, PoolerEndLogits, PoolerStartLogits, PreTrainedModel, SequenceSummary\n\n\nlogger = logging.getLogger(__name__)\n\nXLNET_PRETRAINED_MODEL_ARCHIVE_LIST = [\n    \"xlnet-base-cased\",\n    \"xlnet-large-cased\",\n    # See all XLNet models at https://huggingface.co/models?filter=xlnet\n]\n\n\ndef build_tf_xlnet_to_pytorch_map(model, config, tf_weights=None):\n    \"\"\" A map of modules from TF to PyTorch.\n        I use a map to keep the PyTorch model as\n        identical to the original PyTorch model as possible.\n    \"\"\"\n\n    tf_to_pt_map = {}\n\n    if hasattr(model, \"transformer\"):\n        if hasattr(model, \"lm_loss\"):\n            # We will load also the output bias\n            tf_to_pt_map[\"model/lm_loss/bias\"] = model.lm_loss.bias\n        if hasattr(model, \"sequence_summary\") and \"model/sequnece_summary/summary/kernel\" in tf_weights:\n            # We will load also the sequence summary\n            tf_to_pt_map[\"model/sequnece_summary/summary/kernel\"] = model.sequence_summary.summary.weight\n            tf_to_pt_map[\"model/sequnece_summary/summary/bias\"] = model.sequence_summary.summary.bias\n        if (\n            hasattr(model, \"logits_proj\")\n            and config.finetuning_task is not None\n            and \"model/regression_{}/logit/kernel\".format(config.finetuning_task) in tf_weights\n        ):\n            tf_to_pt_map[\"model/regression_{}/logit/kernel\".format(config.finetuning_task)] = model.logits_proj.weight\n            tf_to_pt_map[\"model/regression_{}/logit/bias\".format(config.finetuning_task)] = model.logits_proj.bias\n\n        # Now load the rest of the transformer\n        model = model.transformer\n\n    # Embeddings and output\n    tf_to_pt_map.update(\n        {\n            \"model/transformer/word_embedding/lookup_table\": model.word_embedding.weight,\n            \"model/transformer/mask_emb/mask_emb\": model.mask_emb,\n        }\n    )\n\n    # Transformer blocks\n    for i, b in enumerate(model.layer):\n        layer_str = \"model/transformer/layer_%d/\" % i\n        tf_to_pt_map.update(\n            {\n                layer_str + \"rel_attn/LayerNorm/gamma\": b.rel_attn.layer_norm.weight,\n                layer_str + \"rel_attn/LayerNorm/beta\": b.rel_attn.layer_norm.bias,\n                layer_str + \"rel_attn/o/kernel\": b.rel_attn.o,\n                layer_str + \"rel_attn/q/kernel\": b.rel_attn.q,\n                layer_str + \"rel_attn/k/kernel\": b.rel_attn.k,\n                layer_str + \"rel_attn/r/kernel\": b.rel_attn.r,\n                layer_str + \"rel_attn/v/kernel\": b.rel_attn.v,\n                layer_str + \"ff/LayerNorm/gamma\": b.ff.layer_norm.weight,\n                layer_str + \"ff/LayerNorm/beta\": b.ff.layer_norm.bias,\n                layer_str + \"ff/layer_1/kernel\": b.ff.layer_1.weight,\n                layer_str + \"ff/layer_1/bias\": b.ff.layer_1.bias,\n                layer_str + \"ff/layer_2/kernel\": b.ff.layer_2.weight,\n                layer_str + \"ff/layer_2/bias\": b.ff.layer_2.bias,\n            }\n        )\n\n    # Relative positioning biases\n    if config.untie_r:\n        r_r_list = []\n        r_w_list = []\n        r_s_list = []\n        seg_embed_list = []\n        for b in model.layer:\n            r_r_list.append(b.rel_attn.r_r_bias)\n            r_w_list.append(b.rel_attn.r_w_bias)\n            r_s_list.append(b.rel_attn.r_s_bias)\n            seg_embed_list.append(b.rel_attn.seg_embed)\n    else:\n        r_r_list = [model.r_r_bias]\n        r_w_list = [model.r_w_bias]\n        r_s_list = [model.r_s_bias]\n        seg_embed_list = [model.seg_embed]\n    tf_to_pt_map.update(\n        {\n            \"model/transformer/r_r_bias\": r_r_list,\n            \"model/transformer/r_w_bias\": r_w_list,\n            \"model/transformer/r_s_bias\": r_s_list,\n            \"model/transformer/seg_embed\": seg_embed_list,\n        }\n    )\n    return tf_to_pt_map\n\n\ndef load_tf_weights_in_xlnet(model, config, tf_path):\n    \"\"\" Load tf checkpoints in a pytorch model\n    \"\"\"\n    try:\n        import numpy as np\n        import tensorflow as tf\n    except ImportError:\n        logger.error(\n            \"Loading a TensorFlow models in PyTorch, requires TensorFlow to be installed. Please see \"\n            \"https://www.tensorflow.org/install/ for installation instructions.\"\n        )\n        raise\n    # Load weights from TF model\n    init_vars = tf.train.list_variables(tf_path)\n    tf_weights = {}\n    for name, shape in init_vars:\n        logger.info(\"Loading TF weight {} with shape {}\".format(name, shape))\n        array = tf.train.load_variable(tf_path, name)\n        tf_weights[name] = array\n\n    # Build TF to PyTorch weights loading map\n    tf_to_pt_map = build_tf_xlnet_to_pytorch_map(model, config, tf_weights)\n\n    for name, pointer in tf_to_pt_map.items():\n        logger.info(\"Importing {}\".format(name))\n        if name not in tf_weights:\n            logger.info(\"{} not in tf pre-trained weights, skipping\".format(name))\n            continue\n        array = tf_weights[name]\n        # adam_v and adam_m are variables used in AdamWeightDecayOptimizer to calculated m and v\n        # which are not required for using pretrained model\n        if \"kernel\" in name and (\"ff\" in name or \"summary\" in name or \"logit\" in name):\n            logger.info(\"Transposing\")\n            array = np.transpose(array)\n        if isinstance(pointer, list):\n            # Here we will split the TF weights\n            assert len(pointer) == array.shape[0]\n            for i, p_i in enumerate(pointer):\n                arr_i = array[i, ...]\n                try:\n                    assert p_i.shape == arr_i.shape\n                except AssertionError as e:\n                    e.args += (p_i.shape, arr_i.shape)\n                    raise\n                logger.info(\"Initialize PyTorch weight {} for layer {}\".format(name, i))\n                p_i.data = torch.from_numpy(arr_i)\n        else:\n            try:\n                assert pointer.shape == array.shape\n            except AssertionError as e:\n                e.args += (pointer.shape, array.shape)\n                raise\n            logger.info(\"Initialize PyTorch weight {}\".format(name))\n            pointer.data = torch.from_numpy(array)\n        tf_weights.pop(name, None)\n        tf_weights.pop(name + \"/Adam\", None)\n        tf_weights.pop(name + \"/Adam_1\", None)\n\n    logger.info(\"Weights not copied to PyTorch model: {}\".format(\", \".join(tf_weights.keys())))\n    return model\n\n\nACT2FN = {\"gelu\": gelu_new, \"relu\": torch.nn.functional.relu, \"swish\": swish}\n\n\nXLNetLayerNorm = nn.LayerNorm\n\n\nclass XLNetRelativeAttention(nn.Module):\n    def __init__(self, config):\n        super().__init__()\n        self.output_attentions = config.output_attentions\n\n        if config.d_model % config.n_head != 0:\n            raise ValueError(\n                \"The hidden size (%d) is not a multiple of the number of attention \"\n                \"heads (%d)\" % (config.d_model, config.n_head)\n            )\n\n        self.n_head = config.n_head\n        self.d_head = config.d_head\n        self.d_model = config.d_model\n        self.scale = 1 / (config.d_head ** 0.5)\n\n        self.q = nn.Parameter(torch.FloatTensor(config.d_model, self.n_head, self.d_head))\n        self.k = nn.Parameter(torch.FloatTensor(config.d_model, self.n_head, self.d_head))\n        self.v = nn.Parameter(torch.FloatTensor(config.d_model, self.n_head, self.d_head))\n        self.o = nn.Parameter(torch.FloatTensor(config.d_model, self.n_head, self.d_head))\n        self.r = nn.Parameter(torch.FloatTensor(config.d_model, self.n_head, self.d_head))\n\n        self.r_r_bias = nn.Parameter(torch.FloatTensor(self.n_head, self.d_head))\n        self.r_s_bias = nn.Parameter(torch.FloatTensor(self.n_head, self.d_head))\n        self.r_w_bias = nn.Parameter(torch.FloatTensor(self.n_head, self.d_head))\n        self.seg_embed = nn.Parameter(torch.FloatTensor(2, self.n_head, self.d_head))\n\n        self.layer_norm = XLNetLayerNorm(config.d_model, eps=config.layer_norm_eps)\n        self.dropout = nn.Dropout(config.dropout)\n\n    def prune_heads(self, heads):\n        raise NotImplementedError\n\n    @staticmethod\n    def rel_shift(x, klen=-1):\n        \"\"\"perform relative shift to form the relative attention score.\"\"\"\n        x_size = x.shape\n\n        x = x.reshape(x_size[1], x_size[0], x_size[2], x_size[3])\n        x = x[1:, ...]\n        x = x.reshape(x_size[0], x_size[1] - 1, x_size[2], x_size[3])\n        # x = x[:, 0:klen, :, :]\n        x = torch.index_select(x, 1, torch.arange(klen, device=x.device, dtype=torch.long))\n\n        return x\n\n    @staticmethod\n    def rel_shift_bnij(x, klen=-1):\n        x_size = x.shape\n\n        x = x.reshape(x_size[0], x_size[1], x_size[3], x_size[2])\n        x = x[:, :, 1:, :]\n        x = x.reshape(x_size[0], x_size[1], x_size[2], x_size[3] - 1)\n        # Note: the tensor-slice form was faster in my testing than torch.index_select\n        #       However, tracing doesn't like the nature of the slice, and if klen changes\n        #       during the run then it'll fail, whereas index_select will be fine.\n        x = torch.index_select(x, 3, torch.arange(klen, device=x.device, dtype=torch.long))\n        # x = x[:, :, :, :klen]\n\n        return x\n\n    def rel_attn_core(self, q_head, k_head_h, v_head_h, k_head_r, seg_mat=None, attn_mask=None, head_mask=None):\n        \"\"\"Core relative positional attention operations.\"\"\"\n\n        # content based attention score\n        ac = torch.einsum(\"ibnd,jbnd->bnij\", q_head + self.r_w_bias, k_head_h)\n\n        # position based attention score\n        bd = torch.einsum(\"ibnd,jbnd->bnij\", q_head + self.r_r_bias, k_head_r)\n        bd = self.rel_shift_bnij(bd, klen=ac.shape[3])\n\n        # segment based attention score\n        if seg_mat is None:\n            ef = 0\n        else:\n            ef = torch.einsum(\"ibnd,snd->ibns\", q_head + self.r_s_bias, self.seg_embed)\n            ef = torch.einsum(\"ijbs,ibns->bnij\", seg_mat, ef)\n\n        # merge attention scores and perform masking\n        attn_score = (ac + bd + ef) * self.scale\n        if attn_mask is not None:\n            # attn_score = attn_score * (1 - attn_mask) - 1e30 * attn_mask\n            if attn_mask.dtype == torch.float16:\n                attn_score = attn_score - 65500 * torch.einsum(\"ijbn->bnij\", attn_mask)\n            else:\n                attn_score = attn_score - 1e30 * torch.einsum(\"ijbn->bnij\", attn_mask)\n\n        # attention probability\n        attn_prob = F.softmax(attn_score, dim=3)\n        attn_prob = self.dropout(attn_prob)\n\n        # Mask heads if we want to\n        if head_mask is not None:\n            attn_prob = attn_prob * torch.einsum(\"ijbn->bnij\", head_mask)\n\n        # attention output\n        attn_vec = torch.einsum(\"bnij,jbnd->ibnd\", attn_prob, v_head_h)\n\n        if self.output_attentions:\n            return attn_vec, torch.einsum(\"bnij->ijbn\", attn_prob)\n\n        return attn_vec\n\n    def post_attention(self, h, attn_vec, residual=True):\n        \"\"\"Post-attention processing.\"\"\"\n        # post-attention projection (back to `d_model`)\n        attn_out = torch.einsum(\"ibnd,hnd->ibh\", attn_vec, self.o)\n\n        attn_out = self.dropout(attn_out)\n        if residual:\n            attn_out = attn_out + h\n        output = self.layer_norm(attn_out)\n\n        return output\n\n    def forward(self, h, g, attn_mask_h, attn_mask_g, r, seg_mat, mems=None, target_mapping=None, head_mask=None):\n        if g is not None:\n            # Two-stream attention with relative positional encoding.\n            # content based attention score\n            if mems is not None and mems.dim() > 1:\n                cat = torch.cat([mems, h], dim=0)\n            else:\n                cat = h\n\n            # content-based key head\n            k_head_h = torch.einsum(\"ibh,hnd->ibnd\", cat, self.k)\n\n            # content-based value head\n            v_head_h = torch.einsum(\"ibh,hnd->ibnd\", cat, self.v)\n\n            # position-based key head\n            k_head_r = torch.einsum(\"ibh,hnd->ibnd\", r, self.r)\n\n            # h-stream\n            # content-stream query head\n            q_head_h = torch.einsum(\"ibh,hnd->ibnd\", h, self.q)\n\n            # core attention ops\n            attn_vec_h = self.rel_attn_core(\n                q_head_h, k_head_h, v_head_h, k_head_r, seg_mat=seg_mat, attn_mask=attn_mask_h, head_mask=head_mask\n            )\n\n            if self.output_attentions:\n                attn_vec_h, attn_prob_h = attn_vec_h\n\n            # post processing\n            output_h = self.post_attention(h, attn_vec_h)\n\n            # g-stream\n            # query-stream query head\n            q_head_g = torch.einsum(\"ibh,hnd->ibnd\", g, self.q)\n\n            # core attention ops\n            if target_mapping is not None:\n                q_head_g = torch.einsum(\"mbnd,mlb->lbnd\", q_head_g, target_mapping)\n                attn_vec_g = self.rel_attn_core(\n                    q_head_g, k_head_h, v_head_h, k_head_r, seg_mat=seg_mat, attn_mask=attn_mask_g, head_mask=head_mask\n                )\n\n                if self.output_attentions:\n                    attn_vec_g, attn_prob_g = attn_vec_g\n\n                attn_vec_g = torch.einsum(\"lbnd,mlb->mbnd\", attn_vec_g, target_mapping)\n            else:\n                attn_vec_g = self.rel_attn_core(\n                    q_head_g, k_head_h, v_head_h, k_head_r, seg_mat=seg_mat, attn_mask=attn_mask_g, head_mask=head_mask\n                )\n\n                if self.output_attentions:\n                    attn_vec_g, attn_prob_g = attn_vec_g\n\n            # post processing\n            output_g = self.post_attention(g, attn_vec_g)\n\n            if self.output_attentions:\n                attn_prob = attn_prob_h, attn_prob_g\n\n        else:\n            # Multi-head attention with relative positional encoding\n            if mems is not None and mems.dim() > 1:\n                cat = torch.cat([mems, h], dim=0)\n            else:\n                cat = h\n\n            # content heads\n            q_head_h = torch.einsum(\"ibh,hnd->ibnd\", h, self.q)\n            k_head_h = torch.einsum(\"ibh,hnd->ibnd\", cat, self.k)\n            v_head_h = torch.einsum(\"ibh,hnd->ibnd\", cat, self.v)\n\n            # positional heads\n            k_head_r = torch.einsum(\"ibh,hnd->ibnd\", r, self.r)\n\n            # core attention ops\n            attn_vec = self.rel_attn_core(\n                q_head_h, k_head_h, v_head_h, k_head_r, seg_mat=seg_mat, attn_mask=attn_mask_h, head_mask=head_mask\n            )\n\n            if self.output_attentions:\n                attn_vec, attn_prob = attn_vec\n\n            # post processing\n            output_h = self.post_attention(h, attn_vec)\n            output_g = None\n\n        outputs = (output_h, output_g)\n        if self.output_attentions:\n            outputs = outputs + (attn_prob,)\n        return outputs\n\n\nclass XLNetFeedForward(nn.Module):\n    def __init__(self, config):\n        super().__init__()\n        self.layer_norm = XLNetLayerNorm(config.d_model, eps=config.layer_norm_eps)\n        self.layer_1 = nn.Linear(config.d_model, config.d_inner)\n        self.layer_2 = nn.Linear(config.d_inner, config.d_model)\n        self.dropout = nn.Dropout(config.dropout)\n        if isinstance(config.ff_activation, str):\n            self.activation_function = ACT2FN[config.ff_activation]\n        else:\n            self.activation_function = config.ff_activation\n\n    def forward(self, inp):\n        output = inp\n        output = self.layer_1(output)\n        output = self.activation_function(output)\n        output = self.dropout(output)\n        output = self.layer_2(output)\n        output = self.dropout(output)\n        output = self.layer_norm(output + inp)\n        return output\n\n\nclass XLNetLayer(nn.Module):\n    def __init__(self, config):\n        super().__init__()\n        self.rel_attn = XLNetRelativeAttention(config)\n        self.ff = XLNetFeedForward(config)\n        self.dropout = nn.Dropout(config.dropout)\n\n    def forward(\n        self, output_h, output_g, attn_mask_h, attn_mask_g, r, seg_mat, mems=None, target_mapping=None, head_mask=None\n    ):\n        outputs = self.rel_attn(\n            output_h,\n            output_g,\n            attn_mask_h,\n            attn_mask_g,\n            r,\n            seg_mat,\n            mems=mems,\n            target_mapping=target_mapping,\n            head_mask=head_mask,\n        )\n        output_h, output_g = outputs[:2]\n\n        if output_g is not None:\n            output_g = self.ff(output_g)\n        output_h = self.ff(output_h)\n\n        outputs = (output_h, output_g) + outputs[2:]  # Add again attentions if there are there\n        return outputs\n\n\nclass XLNetPreTrainedModel(PreTrainedModel):\n    \"\"\" An abstract class to handle weights initialization and\n        a simple interface for downloading and loading pretrained models.\n    \"\"\"\n\n    config_class = XLNetConfig\n    load_tf_weights = load_tf_weights_in_xlnet\n    base_model_prefix = \"transformer\"\n\n    def _init_weights(self, module):\n        \"\"\" Initialize the weights.\n        \"\"\"\n        if isinstance(module, (nn.Linear, nn.Embedding)):\n            # Slightly different from the TF version which uses truncated_normal for initialization\n            # cf https://github.com/pytorch/pytorch/pull/5617\n            module.weight.data.normal_(mean=0.0, std=self.config.initializer_range)\n            if isinstance(module, nn.Linear) and module.bias is not None:\n                module.bias.data.zero_()\n        elif isinstance(module, XLNetLayerNorm):\n            module.bias.data.zero_()\n            module.weight.data.fill_(1.0)\n        elif isinstance(module, XLNetRelativeAttention):\n            for param in [\n                module.q,\n                module.k,\n                module.v,\n                module.o,\n                module.r,\n                module.r_r_bias,\n                module.r_s_bias,\n                module.r_w_bias,\n                module.seg_embed,\n            ]:\n                param.data.normal_(mean=0.0, std=self.config.initializer_range)\n        elif isinstance(module, XLNetModel):\n            module.mask_emb.data.normal_(mean=0.0, std=self.config.initializer_range)\n\n\nXLNET_START_DOCSTRING = r\"\"\"\n\n    This model is a PyTorch `torch.nn.Module <https://pytorch.org/docs/stable/nn.html#torch.nn.Module>`_ sub-class.\n    Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general\n    usage and behavior.\n\n    Parameters:\n        config (:class:`~transformers1.XLNetConfig`): Model configuration class with all the parameters of the model.\n            Initializing with a config file does not load the weights associated with the model, only the configuration.\n            Check out the :meth:`~transformers1.PreTrainedModel.from_pretrained` method to load the model weights.\n\"\"\"\n\nXLNET_INPUTS_DOCSTRING = r\"\"\"\n    Args:\n        input_ids (:obj:`torch.LongTensor` of shape :obj:`{0}`):\n            Indices of input sequence tokens in the vocabulary.\n\n            Indices can be obtained using :class:`transformers1.BertTokenizer`.\n            See :func:`transformers1.PreTrainedTokenizer.encode` and\n            :func:`transformers1.PreTrainedTokenizer.encode_plus` for details.\n\n            `What are input IDs? <../glossary.html#input-ids>`__\n        attention_mask (:obj:`torch.FloatTensor` of shape :obj:`{0}`, `optional`, defaults to :obj:`None`):\n            Mask to avoid performing attention on padding token indices.\n            Mask values selected in ``[0, 1]``:\n            ``1`` for tokens that are NOT MASKED, ``0`` for MASKED tokens.\n\n            `What are attention masks? <../glossary.html#attention-mask>`__\n        mems (:obj:`List[torch.FloatTensor]` of length :obj:`config.n_layers`):\n            Contains pre-computed hidden-states (key and values in the attention blocks) as computed by the model\n            (see `mems` output below). Can be used to speed up sequential decoding. The token ids which have their mems\n            given to this model should not be passed as input ids as they have already been computed.\n            `use_cache` has to be set to `True` to make use of `mems`.\n        perm_mask (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length, sequence_length)`, `optional`, defaults to :obj:`None`):\n            Mask to indicate the attention pattern for each input token with values selected in ``[0, 1]``:\n            If ``perm_mask[k, i, j] = 0``, i attend to j in batch k;\n            if ``perm_mask[k, i, j] = 1``, i does not attend to j in batch k.\n            If None, each token attends to all the others (full bidirectional attention).\n            Only used during pretraining (to define factorization order) or for sequential decoding (generation).\n        target_mapping (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, num_predict, sequence_length)`, `optional`, defaults to :obj:`None`):\n            Mask to indicate the output tokens to use.\n            If ``target_mapping[k, i, j] = 1``, the i-th predict in batch k is on the j-th token.\n            Only used during pretraining for partial prediction or for sequential decoding (generation).\n        token_type_ids (:obj:`torch.LongTensor` of shape :obj:`{0}`, `optional`, defaults to :obj:`None`):\n            Segment token indices to indicate first and second portions of the inputs.\n            Indices are selected in ``[0, 1]``: ``0`` corresponds to a `sentence A` token, ``1``\n            corresponds to a `sentence B` token. The classifier token should be represented by a ``2``.\n\n            `What are token type IDs? <../glossary.html#token-type-ids>`_\n        input_mask (:obj:`torch.FloatTensor` of shape :obj:`{0}`, `optional`, defaults to :obj:`None`):\n            Mask to avoid performing attention on padding token indices.\n            Negative of `attention_mask`, i.e. with 0 for real tokens and 1 for padding.\n            Kept for compatibility with the original code base.\n            You can only uses one of `input_mask` and `attention_mask`\n            Mask values selected in ``[0, 1]``:\n            ``1`` for tokens that are MASKED, ``0`` for tokens that are NOT MASKED.\n        head_mask (:obj:`torch.FloatTensor` of shape :obj:`(num_heads,)` or :obj:`(num_layers, num_heads)`, `optional`, defaults to :obj:`None`):\n            Mask to nullify selected heads of the self-attention modules.\n            Mask values selected in ``[0, 1]``:\n            :obj:`1` indicates the head is **not masked**, :obj:`0` indicates the head is **masked**.\n        inputs_embeds (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length, hidden_size)`, `optional`, defaults to :obj:`None`):\n            Optionally, instead of passing :obj:`input_ids` you can choose to directly pass an embedded representation.\n            This is useful if you want more control over how to convert `input_ids` indices into associated vectors\n            than the model's internal embedding lookup matrix.\n        use_cache (:obj:`bool`):\n            If `use_cache` is True, `mems` are returned and can be used to speed up decoding (see `mems`). Defaults to `True`.\n\"\"\"\n\n\n@add_start_docstrings(\n    \"The bare XLNet Model transformer outputting raw hidden-states without any specific head on top.\",\n    XLNET_START_DOCSTRING,\n)\nclass XLNetModel(XLNetPreTrainedModel):\n    def __init__(self, config):\n        super().__init__(config)\n        self.output_attentions = config.output_attentions\n        self.output_hidden_states = config.output_hidden_states\n\n        self.mem_len = config.mem_len\n        self.reuse_len = config.reuse_len\n        self.d_model = config.d_model\n        self.same_length = config.same_length\n        self.attn_type = config.attn_type\n        self.bi_data = config.bi_data\n        self.clamp_len = config.clamp_len\n        self.n_layer = config.n_layer\n\n        self.word_embedding = nn.Embedding(config.vocab_size, config.d_model)\n        self.mask_emb = nn.Parameter(torch.FloatTensor(1, 1, config.d_model))\n        self.layer = nn.ModuleList([XLNetLayer(config) for _ in range(config.n_layer)])\n        self.dropout = nn.Dropout(config.dropout)\n\n        self.init_weights()\n\n    def get_input_embeddings(self):\n        return self.word_embedding\n\n    def set_input_embeddings(self, new_embeddings):\n        self.word_embedding = new_embeddings\n\n    def _prune_heads(self, heads_to_prune):\n        raise NotImplementedError\n\n    def create_mask(self, qlen, mlen):\n        \"\"\"\n        Creates causal attention mask. Float mask where 1.0 indicates masked, 0.0 indicates not-masked.\n\n        Args:\n            qlen: Sequence length\n            mlen: Mask length\n\n        ::\n\n                  same_length=False:      same_length=True:\n                  <mlen > <  qlen >       <mlen > <  qlen >\n               ^ [0 0 0 0 0 1 1 1 1]     [0 0 0 0 0 1 1 1 1]\n                 [0 0 0 0 0 0 1 1 1]     [1 0 0 0 0 0 1 1 1]\n            qlen [0 0 0 0 0 0 0 1 1]     [1 1 0 0 0 0 0 1 1]\n                 [0 0 0 0 0 0 0 0 1]     [1 1 1 0 0 0 0 0 1]\n               v [0 0 0 0 0 0 0 0 0]     [1 1 1 1 0 0 0 0 0]\n\n        \"\"\"\n        attn_mask = torch.ones([qlen, qlen])\n        mask_up = torch.triu(attn_mask, diagonal=1)\n        attn_mask_pad = torch.zeros([qlen, mlen])\n        ret = torch.cat([attn_mask_pad, mask_up], dim=1)\n        if self.same_length:\n            mask_lo = torch.tril(attn_mask, diagonal=-1)\n            ret = torch.cat([ret[:, :qlen] + mask_lo, ret[:, qlen:]], dim=1)\n\n        ret = ret.to(self.device)\n        return ret\n\n    def cache_mem(self, curr_out, prev_mem):\n        # cache hidden states into memory.\n        if self.reuse_len is not None and self.reuse_len > 0:\n            curr_out = curr_out[: self.reuse_len]\n\n        if prev_mem is None:\n            new_mem = curr_out[-self.mem_len :]\n        else:\n            new_mem = torch.cat([prev_mem, curr_out], dim=0)[-self.mem_len :]\n\n        return new_mem.detach()\n\n    @staticmethod\n    def positional_embedding(pos_seq, inv_freq, bsz=None):\n        sinusoid_inp = torch.einsum(\"i,d->id\", pos_seq, inv_freq)\n        pos_emb = torch.cat([torch.sin(sinusoid_inp), torch.cos(sinusoid_inp)], dim=-1)\n        pos_emb = pos_emb[:, None, :]\n\n        if bsz is not None:\n            pos_emb = pos_emb.expand(-1, bsz, -1)\n\n        return pos_emb\n\n    def relative_positional_encoding(self, qlen, klen, bsz=None):\n        # create relative positional encoding.\n        freq_seq = torch.arange(0, self.d_model, 2.0, dtype=torch.float)\n        inv_freq = 1 / torch.pow(10000, (freq_seq / self.d_model))\n\n        if self.attn_type == \"bi\":\n            # beg, end = klen - 1, -qlen\n            beg, end = klen, -qlen\n        elif self.attn_type == \"uni\":\n            # beg, end = klen - 1, -1\n            beg, end = klen, -1\n        else:\n            raise ValueError(\"Unknown `attn_type` {}.\".format(self.attn_type))\n\n        if self.bi_data:\n            fwd_pos_seq = torch.arange(beg, end, -1.0, dtype=torch.float)\n            bwd_pos_seq = torch.arange(-beg, -end, 1.0, dtype=torch.float)\n\n            if self.clamp_len > 0:\n                fwd_pos_seq = fwd_pos_seq.clamp(-self.clamp_len, self.clamp_len)\n                bwd_pos_seq = bwd_pos_seq.clamp(-self.clamp_len, self.clamp_len)\n\n            if bsz is not None:\n                fwd_pos_emb = self.positional_embedding(fwd_pos_seq, inv_freq, bsz // 2)\n                bwd_pos_emb = self.positional_embedding(bwd_pos_seq, inv_freq, bsz // 2)\n            else:\n                fwd_pos_emb = self.positional_embedding(fwd_pos_seq, inv_freq)\n                bwd_pos_emb = self.positional_embedding(bwd_pos_seq, inv_freq)\n\n            pos_emb = torch.cat([fwd_pos_emb, bwd_pos_emb], dim=1)\n        else:\n            fwd_pos_seq = torch.arange(beg, end, -1.0)\n            if self.clamp_len > 0:\n                fwd_pos_seq = fwd_pos_seq.clamp(-self.clamp_len, self.clamp_len)\n            pos_emb = self.positional_embedding(fwd_pos_seq, inv_freq, bsz)\n\n        pos_emb = pos_emb.to(self.device)\n        return pos_emb\n\n    @add_start_docstrings_to_callable(XLNET_INPUTS_DOCSTRING.format(\"(batch_size, sequence_length)\"))\n    def forward(\n        self,\n        input_ids=None,\n        attention_mask=None,\n        mems=None,\n        perm_mask=None,\n        target_mapping=None,\n        token_type_ids=None,\n        input_mask=None,\n        head_mask=None,\n        inputs_embeds=None,\n        use_cache=True,\n    ):\n        r\"\"\"\n    Return:\n        :obj:`tuple(torch.FloatTensor)` comprising various elements depending on the configuration (:class:`~transformers1.XLNetConfig`) and inputs:\n        last_hidden_state (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, num_predict, hidden_size)`):\n            Sequence of hidden-states at the last layer of the model.\n            `num_predict` corresponds to `target_mapping.shape[1]`. If `target_mapping` is `None`, then `num_predict` corresponds to `sequence_length`.\n        mems (:obj:`List[torch.FloatTensor]` of length :obj:`config.n_layers`):\n            Contains pre-computed hidden-states (key and values in the attention blocks).\n            Can be used (see `mems` input) to speed up sequential decoding. The token ids which have their past given to this model\n            should not be passed as input ids as they have already been computed.\n        hidden_states (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_hidden_states=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer)\n            of shape :obj:`(batch_size, sequence_length, hidden_size)`.\n\n            Hidden-states of the model at the output of each layer plus the initial embedding outputs.\n        attentions (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_attentions=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for each layer) of shape\n            :obj:`(batch_size, num_heads, sequence_length, sequence_length)`.\n\n            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention\n            heads.\n\n    Examples::\n\n        from transformers1 import XLNetTokenizer, XLNetModel\n        import torch\n\n        tokenizer = XLNetTokenizer.from_pretrained('xlnet-large-cased')\n        model = XLNetModel.from_pretrained('xlnet-large-cased')\n\n        input_ids = torch.tensor(tokenizer.encode(\"Hello, my dog is cute\", add_special_tokens=False)).unsqueeze(0)  # Batch size 1\n\n        outputs = model(input_ids)\n        last_hidden_states = outputs[0]  # The last hidden-state is the first element of the output tuple\n\n        \"\"\"\n        # the original code for XLNet uses shapes [len, bsz] with the batch dimension at the end\n        # but we want a unified interface in the library with the batch size on the first dimension\n        # so we move here the first dimension (batch) to the end\n        if input_ids is not None and inputs_embeds is not None:\n            raise ValueError(\"You cannot specify both input_ids and inputs_embeds at the same time\")\n        elif input_ids is not None:\n            input_ids = input_ids.transpose(0, 1).contiguous()\n            qlen, bsz = input_ids.shape[0], input_ids.shape[1]\n        elif inputs_embeds is not None:\n            inputs_embeds = inputs_embeds.transpose(0, 1).contiguous()\n            qlen, bsz = inputs_embeds.shape[0], inputs_embeds.shape[1]\n        else:\n            raise ValueError(\"You have to specify either input_ids or inputs_embeds\")\n\n        token_type_ids = token_type_ids.transpose(0, 1).contiguous() if token_type_ids is not None else None\n        input_mask = input_mask.transpose(0, 1).contiguous() if input_mask is not None else None\n        attention_mask = attention_mask.transpose(0, 1).contiguous() if attention_mask is not None else None\n        perm_mask = perm_mask.permute(1, 2, 0).contiguous() if perm_mask is not None else None\n        target_mapping = target_mapping.permute(1, 2, 0).contiguous() if target_mapping is not None else None\n\n        mlen = mems[0].shape[0] if mems is not None and mems[0] is not None else 0\n        klen = mlen + qlen\n\n        dtype_float = self.dtype\n        device = self.device\n\n        # Attention mask\n        # causal attention mask\n        if self.attn_type == \"uni\":\n            attn_mask = self.create_mask(qlen, mlen)\n            attn_mask = attn_mask[:, :, None, None]\n        elif self.attn_type == \"bi\":\n            attn_mask = None\n        else:\n            raise ValueError(\"Unsupported attention type: {}\".format(self.attn_type))\n\n        # data mask: input mask & perm mask\n        assert input_mask is None or attention_mask is None, \"You can only use one of input_mask (uses 1 for padding) \"\n        \"or attention_mask (uses 0 for padding, added for compatbility with BERT). Please choose one.\"\n        if input_mask is None and attention_mask is not None:\n            input_mask = 1.0 - attention_mask\n        if input_mask is not None and perm_mask is not None:\n            data_mask = input_mask[None] + perm_mask\n        elif input_mask is not None and perm_mask is None:\n            data_mask = input_mask[None]\n        elif input_mask is None and perm_mask is not None:\n            data_mask = perm_mask\n        else:\n            data_mask = None\n\n        if data_mask is not None:\n            # all mems can be attended to\n            if mlen > 0:\n                mems_mask = torch.zeros([data_mask.shape[0], mlen, bsz]).to(data_mask)\n                data_mask = torch.cat([mems_mask, data_mask], dim=1)\n            if attn_mask is None:\n                attn_mask = data_mask[:, :, :, None]\n            else:\n                attn_mask += data_mask[:, :, :, None]\n\n        if attn_mask is not None:\n            attn_mask = (attn_mask > 0).to(dtype_float)\n\n        if attn_mask is not None:\n            non_tgt_mask = -torch.eye(qlen).to(attn_mask)\n            if mlen > 0:\n                non_tgt_mask = torch.cat([torch.zeros([qlen, mlen]).to(attn_mask), non_tgt_mask], dim=-1)\n            non_tgt_mask = ((attn_mask + non_tgt_mask[:, :, None, None]) > 0).to(attn_mask)\n        else:\n            non_tgt_mask = None\n\n        # Word embeddings and prepare h & g hidden states\n        if inputs_embeds is not None:\n            word_emb_k = inputs_embeds\n        else:\n            word_emb_k = self.word_embedding(input_ids)\n        output_h = self.dropout(word_emb_k)\n        if target_mapping is not None:\n            word_emb_q = self.mask_emb.expand(target_mapping.shape[0], bsz, -1)\n            # else:  # We removed the inp_q input which was same as target mapping\n            #     inp_q_ext = inp_q[:, :, None]\n            #     word_emb_q = inp_q_ext * self.mask_emb + (1 - inp_q_ext) * word_emb_k\n            output_g = self.dropout(word_emb_q)\n        else:\n            output_g = None\n\n        # Segment embedding\n        if token_type_ids is not None:\n            # Convert `token_type_ids` to one-hot `seg_mat`\n            if mlen > 0:\n                mem_pad = torch.zeros([mlen, bsz], dtype=torch.long, device=device)\n                cat_ids = torch.cat([mem_pad, token_type_ids], dim=0)\n            else:\n                cat_ids = token_type_ids\n\n            # `1` indicates not in the same segment [qlen x klen x bsz]\n            seg_mat = (token_type_ids[:, None] != cat_ids[None, :]).long()\n            seg_mat = F.one_hot(seg_mat, num_classes=2).to(dtype_float)\n        else:\n            seg_mat = None\n\n        # Positional encoding\n        pos_emb = self.relative_positional_encoding(qlen, klen, bsz=bsz)\n        pos_emb = self.dropout(pos_emb)\n\n        # Prepare head mask if needed\n        # 1.0 in head_mask indicate we keep the head\n        # attention_probs has shape bsz x n_heads x N x N\n        # input head_mask has shape [num_heads] or [num_hidden_layers x num_heads] (a head_mask for each layer)\n        # and head_mask is converted to shape [num_hidden_layers x qlen x klen x bsz x n_head]\n        if head_mask is not None:\n            if head_mask.dim() == 1:\n                head_mask = head_mask.unsqueeze(0).unsqueeze(0).unsqueeze(0).unsqueeze(0)\n                head_mask = head_mask.expand(self.n_layer, -1, -1, -1, -1)\n            elif head_mask.dim() == 2:\n                head_mask = head_mask.unsqueeze(1).unsqueeze(1).unsqueeze(1)\n            head_mask = head_mask.to(\n                dtype=next(self.parameters()).dtype\n            )  # switch to fload if need + fp16 compatibility\n        else:\n            head_mask = [None] * self.n_layer\n\n        new_mems = ()\n        if mems is None:\n            mems = [None] * len(self.layer)\n\n        attentions = []\n        hidden_states = []\n        for i, layer_module in enumerate(self.layer):\n            if self.mem_len is not None and self.mem_len > 0 and use_cache is True:\n                # cache new mems\n                new_mems = new_mems + (self.cache_mem(output_h, mems[i]),)\n            if self.output_hidden_states:\n                hidden_states.append((output_h, output_g) if output_g is not None else output_h)\n\n            outputs = layer_module(\n                output_h,\n                output_g,\n                attn_mask_h=non_tgt_mask,\n                attn_mask_g=attn_mask,\n                r=pos_emb,\n                seg_mat=seg_mat,\n                mems=mems[i],\n                target_mapping=target_mapping,\n                head_mask=head_mask[i],\n            )\n            output_h, output_g = outputs[:2]\n            if self.output_attentions:\n                attentions.append(outputs[2])\n\n        # Add last hidden state\n        if self.output_hidden_states:\n            hidden_states.append((output_h, output_g) if output_g is not None else output_h)\n\n        output = self.dropout(output_g if output_g is not None else output_h)\n\n        # Prepare outputs, we transpose back here to shape [bsz, len, hidden_dim] (cf. beginning of forward() method)\n        outputs = (output.permute(1, 0, 2).contiguous(),)\n\n        if self.mem_len is not None and self.mem_len > 0 and use_cache is True:\n            outputs = outputs + (new_mems,)\n\n        if self.output_hidden_states:\n            if output_g is not None:\n                hidden_states = tuple(h.permute(1, 0, 2).contiguous() for hs in hidden_states for h in hs)\n            else:\n                hidden_states = tuple(hs.permute(1, 0, 2).contiguous() for hs in hidden_states)\n            outputs = outputs + (hidden_states,)\n        if self.output_attentions:\n            if target_mapping is not None:\n                # when target_mapping is provided, there are 2-tuple of attentions\n                attentions = tuple(\n                    tuple(att_stream.permute(2, 3, 0, 1).contiguous() for att_stream in t) for t in attentions\n                )\n            else:\n                attentions = tuple(t.permute(2, 3, 0, 1).contiguous() for t in attentions)\n            outputs = outputs + (attentions,)\n\n        return outputs  # outputs, (new_mems), (hidden_states), (attentions)\n\n\n@add_start_docstrings(\n    \"\"\"XLNet Model with a language modeling head on top\n    (linear layer with weights tied to the input embeddings). \"\"\",\n    XLNET_START_DOCSTRING,\n)\nclass XLNetLMHeadModel(XLNetPreTrainedModel):\n    def __init__(self, config):\n        super().__init__(config)\n        self.attn_type = config.attn_type\n        self.same_length = config.same_length\n\n        self.transformer = XLNetModel(config)\n        self.lm_loss = nn.Linear(config.d_model, config.vocab_size, bias=True)\n\n        self.init_weights()\n\n    def get_output_embeddings(self):\n        return self.lm_loss\n\n    def prepare_inputs_for_generation(self, input_ids, past, **kwargs):\n        # Add dummy token at the end (no attention on this one)\n\n        effective_batch_size = input_ids.shape[0]\n        dummy_token = torch.zeros((effective_batch_size, 1), dtype=torch.long, device=input_ids.device)\n        input_ids = torch.cat([input_ids, dummy_token], dim=1)\n\n        # Build permutation mask so that previous tokens don't see last token\n        sequence_length = input_ids.shape[1]\n        perm_mask = torch.zeros(\n            (effective_batch_size, sequence_length, sequence_length), dtype=torch.float, device=input_ids.device\n        )\n        perm_mask[:, :, -1] = 1.0\n\n        # We'll only predict the last token\n        target_mapping = torch.zeros(\n            (effective_batch_size, 1, sequence_length), dtype=torch.float, device=input_ids.device\n        )\n        target_mapping[0, 0, -1] = 1.0\n\n        inputs = {\n            \"input_ids\": input_ids,\n            \"perm_mask\": perm_mask,\n            \"target_mapping\": target_mapping,\n            \"use_cache\": kwargs[\"use_cache\"],\n        }\n\n        # if past is defined in model kwargs then use it for faster decoding\n        if past:\n            inputs[\"mems\"] = past\n\n        return inputs\n\n    @add_start_docstrings_to_callable(XLNET_INPUTS_DOCSTRING.format(\"(batch_size, sequence_length)\"))\n    def forward(\n        self,\n        input_ids=None,\n        attention_mask=None,\n        mems=None,\n        perm_mask=None,\n        target_mapping=None,\n        token_type_ids=None,\n        input_mask=None,\n        head_mask=None,\n        inputs_embeds=None,\n        use_cache=True,\n        labels=None,\n    ):\n        r\"\"\"\n        labels (:obj:`torch.LongTensor` of shape :obj:`(batch_size, num_predict)`, `optional`, defaults to :obj:`None`):\n            Labels for masked language modeling.\n            `num_predict` corresponds to `target_mapping.shape[1]`. If `target_mapping` is `None`, then `num_predict` corresponds to `sequence_length`.\n            The labels should correspond to the masked input words that should be predicted and depends on `target_mapping`. Note in order to perform standard auto-regressive language modeling a `<mask>` token has to be added to the `input_ids` (see `prepare_inputs_for_generation` fn and examples below)\n            Indices are selected in ``[-100, 0, ..., config.vocab_size]``\n            All labels set to ``-100`` are ignored, the loss is only\n            computed for labels in ``[0, ..., config.vocab_size]``\n\n    Return:\n        :obj:`tuple(torch.FloatTensor)` comprising various elements depending on the configuration (:class:`~transformers1.XLNetConfig`) and inputs:\n        loss (:obj:`torch.FloatTensor` of shape `(1,)`, `optional`, returned when ``labels`` is provided)\n            Language modeling loss.\n        prediction_scores (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, num_predict, config.vocab_size)`):\n            Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax).\n            `num_predict` corresponds to `target_mapping.shape[1]`. If `target_mapping` is `None`, then `num_predict` corresponds to `sequence_length`.\n        mems (:obj:`List[torch.FloatTensor]` of length :obj:`config.n_layers`):\n            Contains pre-computed hidden-states (key and values in the attention blocks).\n            Can be used (see `past` input) to speed up sequential decoding. The token ids which have their past given to this model\n            should not be passed as input ids as they have already been computed.\n        hidden_states (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_hidden_states=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer)\n            of shape :obj:`(batch_size, sequence_length, hidden_size)`.\n\n            Hidden-states of the model at the output of each layer plus the initial embedding outputs.\n        attentions (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_attentions=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for each layer) of shape\n            :obj:`(batch_size, num_heads, sequence_length, sequence_length)`.\n\n            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention\n            heads.\n\n    Examples::\n\n        from transformers1 import XLNetTokenizer, XLNetLMHeadModel\n        import torch\n\n        tokenizer = XLNetTokenizer.from_pretrained('xlnet-large-cased')\n        model = XLNetLMHeadModel.from_pretrained('xlnet-large-cased')\n\n        # We show how to setup inputs to predict a next token using a bi-directional context.\n        input_ids = torch.tensor(tokenizer.encode(\"Hello, my dog is very <mask>\", add_special_tokens=False)).unsqueeze(0)  # We will predict the masked token\n        perm_mask = torch.zeros((1, input_ids.shape[1], input_ids.shape[1]), dtype=torch.float)\n        perm_mask[:, :, -1] = 1.0  # Previous tokens don't see last token\n        target_mapping = torch.zeros((1, 1, input_ids.shape[1]), dtype=torch.float)  # Shape [1, 1, seq_length] => let's predict one token\n        target_mapping[0, 0, -1] = 1.0  # Our first (and only) prediction will be the last token of the sequence (the masked token)\n\n        outputs = model(input_ids, perm_mask=perm_mask, target_mapping=target_mapping)\n        next_token_logits = outputs[0]  # Output has shape [target_mapping.size(0), target_mapping.size(1), config.vocab_size]\n\n        # The same way can the XLNetLMHeadModel be used to be trained by standard auto-regressive language modeling.\n        input_ids = torch.tensor(tokenizer.encode(\"Hello, my dog is very <mask>\", add_special_tokens=False)).unsqueeze(0)  # We will predict the masked token\n        labels = torch.tensor(tokenizer.encode(\"cute\", add_special_tokens=False)).unsqueeze(0)\n        assert labels.shape[0] == 1, 'only one word will be predicted'\n        perm_mask = torch.zeros((1, input_ids.shape[1], input_ids.shape[1]), dtype=torch.float)\n        perm_mask[:, :, -1] = 1.0  # Previous tokens don't see last token as is done in standard auto-regressive lm training\n        target_mapping = torch.zeros((1, 1, input_ids.shape[1]), dtype=torch.float)  # Shape [1, 1, seq_length] => let's predict one token\n        target_mapping[0, 0, -1] = 1.0  # Our first (and only) prediction will be the last token of the sequence (the masked token)\n\n        outputs = model(input_ids, perm_mask=perm_mask, target_mapping=target_mapping, labels=labels)\n        loss, next_token_logits = outputs[:2]  # Output has shape [target_mapping.size(0), target_mapping.size(1), config.vocab_size]\n\n        \"\"\"\n        transformer_outputs = self.transformer(\n            input_ids,\n            attention_mask=attention_mask,\n            mems=mems,\n            perm_mask=perm_mask,\n            target_mapping=target_mapping,\n            token_type_ids=token_type_ids,\n            input_mask=input_mask,\n            head_mask=head_mask,\n            inputs_embeds=inputs_embeds,\n            use_cache=use_cache,\n        )\n\n        logits = self.lm_loss(transformer_outputs[0])\n\n        outputs = (logits,) + transformer_outputs[1:]  # Keep mems, hidden states, attentions if there are in it\n\n        if labels is not None:\n            # Flatten the tokens\n            loss_fct = CrossEntropyLoss()\n            loss = loss_fct(logits.view(-1, logits.size(-1)), labels.view(-1))\n            outputs = (loss,) + outputs\n\n        return outputs  # return (loss), logits, (mems), (hidden states), (attentions)\n\n\n@add_start_docstrings(\n    \"\"\"XLNet Model with a sequence classification/regression head on top (a linear layer on top of\n    the pooled output) e.g. for GLUE tasks. \"\"\",\n    XLNET_START_DOCSTRING,\n)\nclass XLNetForSequenceClassification(XLNetPreTrainedModel):\n    def __init__(self, config):\n        super().__init__(config)\n        self.num_labels = config.num_labels\n\n        self.transformer = XLNetModel(config)\n        self.sequence_summary = SequenceSummary(config)\n        self.logits_proj = nn.Linear(config.d_model, config.num_labels)\n\n        self.init_weights()\n\n    @add_start_docstrings_to_callable(XLNET_INPUTS_DOCSTRING.format(\"(batch_size, sequence_length)\"))\n    def forward(\n        self,\n        input_ids=None,\n        attention_mask=None,\n        mems=None,\n        perm_mask=None,\n        target_mapping=None,\n        token_type_ids=None,\n        input_mask=None,\n        head_mask=None,\n        inputs_embeds=None,\n        use_cache=True,\n        labels=None,\n    ):\n        r\"\"\"\n        labels (:obj:`torch.LongTensor` of shape :obj:`(batch_size,)`, `optional`, defaults to :obj:`None`)\n            Labels for computing the sequence classification/regression loss.\n            Indices should be in ``[0, ..., config.num_labels - 1]``.\n            If ``config.num_labels == 1`` a regression loss is computed (Mean-Square loss),\n            If ``config.num_labels > 1`` a classification loss is computed (Cross-Entropy).\n\n    Return:\n        :obj:`tuple(torch.FloatTensor)` comprising various elements depending on the configuration (:class:`~transformers1.XLNetConfig`) and inputs:\n        loss (:obj:`torch.FloatTensor` of shape :obj:`(1,)`, `optional`, returned when :obj:`labels` is provided):\n            Classification (or regression if config.num_labels==1) loss.\n        logits (:obj:`torch.FloatTensor` of shape :obj:(batch_size, config.num_labels)`):\n            Classification (or regression if config.num_labels==1) scores (before SoftMax).\n        mems (:obj:`List[torch.FloatTensor]` of length :obj:`config.n_layers`):\n            Contains pre-computed hidden-states (key and values in the attention blocks).\n            Can be used (see `past` input) to speed up sequential decoding. The token ids which have their past given to this model\n            should not be passed as input ids as they have already been computed.\n        hidden_states (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_hidden_states=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer)\n            of shape :obj:`(batch_size, sequence_length, hidden_size)`.\n\n            Hidden-states of the model at the output of each layer plus the initial embedding outputs.\n        attentions (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_attentions=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for each layer) of shape\n            :obj:`(batch_size, num_heads, sequence_length, sequence_length)`.\n\n            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention\n            heads.\n\n    Examples::\n\n        from transformers1 import XLNetTokenizer, XLNetForSequenceClassification\n        import torch\n\n        tokenizer = XLNetTokenizer.from_pretrained('xlnet-large-cased')\n        model = XLNetForSequenceClassification.from_pretrained('xlnet-large-cased')\n\n        input_ids = torch.tensor(tokenizer.encode(\"Hello, my dog is cute\", add_special_tokens=True)).unsqueeze(0)  # Batch size 1\n        labels = torch.tensor([1]).unsqueeze(0)  # Batch size 1\n        outputs = model(input_ids, labels=labels)\n        loss, logits = outputs[:2]\n\n        \"\"\"\n        transformer_outputs = self.transformer(\n            input_ids,\n            attention_mask=attention_mask,\n            mems=mems,\n            perm_mask=perm_mask,\n            target_mapping=target_mapping,\n            token_type_ids=token_type_ids,\n            input_mask=input_mask,\n            head_mask=head_mask,\n            inputs_embeds=inputs_embeds,\n            use_cache=use_cache,\n        )\n        output = transformer_outputs[0]\n\n        output = self.sequence_summary(output)\n        logits = self.logits_proj(output)\n\n        outputs = (logits,) + transformer_outputs[1:]  # Keep mems, hidden states, attentions if there are in it\n\n        if labels is not None:\n            if self.num_labels == 1:\n                #  We are doing regression\n                loss_fct = MSELoss()\n                loss = loss_fct(logits.view(-1), labels.view(-1))\n            else:\n                loss_fct = CrossEntropyLoss()\n                loss = loss_fct(logits.view(-1, self.num_labels), labels.view(-1))\n            outputs = (loss,) + outputs\n\n        return outputs  # return (loss), logits, (mems), (hidden states), (attentions)\n\n\n@add_start_docstrings(\n    \"\"\"XLNet Model with a token classification head on top (a linear layer on top of\n    the hidden-states output) e.g. for Named-Entity-Recognition (NER) tasks. \"\"\",\n    XLNET_START_DOCSTRING,\n)\nclass XLNetForTokenClassification(XLNetPreTrainedModel):\n    def __init__(self, config):\n        super().__init__(config)\n        self.num_labels = config.num_labels\n\n        self.transformer = XLNetModel(config)\n        self.classifier = nn.Linear(config.hidden_size, config.num_labels)\n\n        self.init_weights()\n\n    @add_start_docstrings_to_callable(XLNET_INPUTS_DOCSTRING.format(\"(batch_size, sequence_length)\"))\n    def forward(\n        self,\n        input_ids=None,\n        attention_mask=None,\n        mems=None,\n        perm_mask=None,\n        target_mapping=None,\n        token_type_ids=None,\n        input_mask=None,\n        head_mask=None,\n        inputs_embeds=None,\n        use_cache=True,\n        labels=None,\n    ):\n        r\"\"\"\n        labels (:obj:`torch.LongTensor` of shape :obj:`(batch_size,)`, `optional`, defaults to :obj:`None`):\n            Labels for computing the multiple choice classification loss.\n            Indices should be in ``[0, ..., num_choices]`` where `num_choices` is the size of the second dimension\n            of the input tensors. (see `input_ids` above)\n\n    Return:\n        :obj:`tuple(torch.FloatTensor)` comprising various elements depending on the configuration (:class:`~transformers1.XLNetConfig`) and inputs:\n        loss (:obj:`torch.FloatTensor` of shape :obj:`(1,)`, `optional`, returned when :obj:`labels` is provided):\n            Classification loss.\n        logits (:obj:`torch.FloatTensor` of shape :obj:(batch_size, config.num_labels)`):\n            Classification scores (before SoftMax).\n        mems (:obj:`List[torch.FloatTensor]` of length :obj:`config.n_layers`):\n            Contains pre-computed hidden-states (key and values in the attention blocks).\n            Can be used (see `past` input) to speed up sequential decoding. The token ids which have their past given to this model\n            should not be passed as input ids as they have already been computed.\n        hidden_states (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_hidden_states=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer)\n            of shape :obj:`(batch_size, sequence_length, hidden_size)`.\n\n            Hidden-states of the model at the output of each layer plus the initial embedding outputs.\n        attentions (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_attentions=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for each layer) of shape\n            :obj:`(batch_size, num_heads, sequence_length, sequence_length)`.\n\n            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention\n            heads.\n\n    Examples::\n\n        from transformers1 import XLNetTokenizer, XLNetForTokenClassification\n        import torch\n\n        tokenizer = XLNetTokenizer.from_pretrained('xlnet-large-cased')\n        model = XLNetForTokenClassification.from_pretrained('xlnet-large-cased')\n\n        input_ids = torch.tensor(tokenizer.encode(\"Hello, my dog is cute\")).unsqueeze(0)  # Batch size 1\n        labels = torch.tensor([1] * input_ids.size(1)).unsqueeze(0)  # Batch size 1\n        outputs = model(input_ids, labels=labels)\n\n        scores = outputs[0]\n\n        \"\"\"\n\n        outputs = self.transformer(\n            input_ids,\n            attention_mask=attention_mask,\n            mems=mems,\n            perm_mask=perm_mask,\n            target_mapping=target_mapping,\n            token_type_ids=token_type_ids,\n            input_mask=input_mask,\n            head_mask=head_mask,\n            inputs_embeds=inputs_embeds,\n            use_cache=use_cache,\n        )\n\n        sequence_output = outputs[0]\n\n        logits = self.classifier(sequence_output)\n\n        outputs = (logits,) + outputs[1:]  # Keep mems, hidden states, attentions if there are in it\n        if labels is not None:\n            loss_fct = CrossEntropyLoss()\n            # Only keep active parts of the loss\n            if attention_mask is not None:\n                active_loss = attention_mask.view(-1) == 1\n                active_logits = logits.view(-1, self.num_labels)\n                active_labels = torch.where(\n                    active_loss, labels.view(-1), torch.tensor(loss_fct.ignore_index).type_as(labels)\n                )\n                loss = loss_fct(active_logits, active_labels)\n            else:\n                loss = loss_fct(logits.view(-1, self.num_labels), labels.view(-1))\n            outputs = (loss,) + outputs\n\n        return outputs  # return (loss), logits, (mems), (hidden states), (attentions)\n\n\n@add_start_docstrings(\n    \"\"\"XLNet Model with a multiple choice classification head on top (a linear layer on top of\n    the pooled output and a softmax) e.g. for RACE/SWAG tasks. \"\"\",\n    XLNET_START_DOCSTRING,\n)\nclass XLNetForMultipleChoice(XLNetPreTrainedModel):\n    def __init__(self, config):\n        super().__init__(config)\n\n        self.transformer = XLNetModel(config)\n        self.sequence_summary = SequenceSummary(config)\n        self.logits_proj = nn.Linear(config.d_model, 1)\n\n        self.init_weights()\n\n    @add_start_docstrings_to_callable(XLNET_INPUTS_DOCSTRING.format(\"(batch_size, num_choices, sequence_length)\"))\n    def forward(\n        self,\n        input_ids=None,\n        token_type_ids=None,\n        input_mask=None,\n        attention_mask=None,\n        mems=None,\n        perm_mask=None,\n        target_mapping=None,\n        head_mask=None,\n        inputs_embeds=None,\n        use_cache=True,\n        labels=None,\n    ):\n        r\"\"\"\n        labels (:obj:`torch.LongTensor` of shape :obj:`(batch_size,)`, `optional`, defaults to :obj:`None`):\n            Labels for computing the multiple choice classification loss.\n            Indices should be in ``[0, ..., num_choices]`` where `num_choices` is the size of the second dimension\n            of the input tensors. (see `input_ids` above)\n\n    Returns:\n        :obj:`tuple(torch.FloatTensor)` comprising various elements depending on the configuration (:class:`~transformers1.XLNetConfig`) and inputs:\n        loss (:obj:`torch.FloatTensor`` of shape ``(1,)`, `optional`, returned when :obj:`labels` is provided):\n            Classification loss.\n        classification_scores (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, num_choices)`):\n            `num_choices` is the second dimension of the input tensors. (see `input_ids` above).\n\n            Classification scores (before SoftMax).\n        mems (:obj:`List[torch.FloatTensor]` of length :obj:`config.n_layers`):\n            Contains pre-computed hidden-states (key and values in the attention blocks).\n            Can be used (see `past` input) to speed up sequential decoding. The token ids which have their past given to this model\n            should not be passed as input ids as they have already been computed.\n        hidden_states (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_hidden_states=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer)\n            of shape :obj:`(batch_size, sequence_length, hidden_size)`.\n\n            Hidden-states of the model at the output of each layer plus the initial embedding outputs.\n        attentions (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_attentions=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for each layer) of shape\n            :obj:`(batch_size, num_heads, sequence_length, sequence_length)`.\n\n            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention\n            heads.\n\n    Examples::\n\n        from transformers1 import XLNetTokenizer, XLNetForMultipleChoice\n        import torch\n\n        tokenizer = XLNetTokenizer.from_pretrained('xlnet-base-cased')\n        model = XLNetForMultipleChoice.from_pretrained('xlnet-base-cased')\n\n        choices = [\"Hello, my dog is cute\", \"Hello, my cat is amazing\"]\n        input_ids = torch.tensor([tokenizer.encode(s) for s in choices]).unsqueeze(0)  # Batch size 1, 2 choices\n        labels = torch.tensor(1).unsqueeze(0)  # Batch size 1\n\n        outputs = model(input_ids, labels=labels)\n        loss, classification_scores = outputs[:2]\n\n        \"\"\"\n        num_choices = input_ids.shape[1]\n\n        flat_input_ids = input_ids.view(-1, input_ids.size(-1))\n        flat_token_type_ids = token_type_ids.view(-1, token_type_ids.size(-1)) if token_type_ids is not None else None\n        flat_attention_mask = attention_mask.view(-1, attention_mask.size(-1)) if attention_mask is not None else None\n        flat_input_mask = input_mask.view(-1, input_mask.size(-1)) if input_mask is not None else None\n\n        transformer_outputs = self.transformer(\n            flat_input_ids,\n            token_type_ids=flat_token_type_ids,\n            input_mask=flat_input_mask,\n            attention_mask=flat_attention_mask,\n            mems=mems,\n            perm_mask=perm_mask,\n            target_mapping=target_mapping,\n            head_mask=head_mask,\n            inputs_embeds=inputs_embeds,\n            use_cache=use_cache,\n        )\n\n        output = transformer_outputs[0]\n\n        output = self.sequence_summary(output)\n        logits = self.logits_proj(output)\n        reshaped_logits = logits.view(-1, num_choices)\n        outputs = (reshaped_logits,) + transformer_outputs[\n            1:\n        ]  # Keep mems, hidden states, attentions if there are in it\n\n        if labels is not None:\n            loss_fct = CrossEntropyLoss()\n            loss = loss_fct(reshaped_logits, labels.view(-1))\n            outputs = (loss,) + outputs\n\n        return outputs  # return (loss), logits, (mems), (hidden states), (attentions)\n\n\n@add_start_docstrings(\n    \"\"\"XLNet Model with a span classification head on top for extractive question-answering tasks like SQuAD (a linear layers on top of\n    the hidden-states output to compute `span start logits` and `span end logits`). \"\"\",\n    XLNET_START_DOCSTRING,\n)\nclass XLNetForQuestionAnsweringSimple(XLNetPreTrainedModel):\n    def __init__(self, config):\n        super().__init__(config)\n        self.num_labels = config.num_labels\n\n        self.transformer = XLNetModel(config)\n        self.qa_outputs = nn.Linear(config.hidden_size, config.num_labels)\n\n        self.init_weights()\n\n    @add_start_docstrings_to_callable(XLNET_INPUTS_DOCSTRING.format(\"(batch_size, sequence_length)\"))\n    def forward(\n        self,\n        input_ids=None,\n        attention_mask=None,\n        mems=None,\n        perm_mask=None,\n        target_mapping=None,\n        token_type_ids=None,\n        input_mask=None,\n        head_mask=None,\n        inputs_embeds=None,\n        use_cache=True,\n        start_positions=None,\n        end_positions=None,\n    ):\n        r\"\"\"\n        start_positions (:obj:`torch.LongTensor` of shape :obj:`(batch_size,)`, `optional`, defaults to :obj:`None`):\n            Labels for position (index) of the start of the labelled span for computing the token classification loss.\n            Positions are clamped to the length of the sequence (`sequence_length`).\n            Position outside of the sequence are not taken into account for computing the loss.\n        end_positions (:obj:`torch.LongTensor` of shape :obj:`(batch_size,)`, `optional`, defaults to :obj:`None`):\n            Labels for position (index) of the end of the labelled span for computing the token classification loss.\n            Positions are clamped to the length of the sequence (`sequence_length`).\n            Position outside of the sequence are not taken into account for computing the loss.\n\n    Returns:\n        :obj:`tuple(torch.FloatTensor)` comprising various elements depending on the configuration (:class:`~transformers1.XLNetConfig`) and inputs:\n        loss (:obj:`torch.FloatTensor` of shape :obj:`(1,)`, `optional`, returned when :obj:`labels` is provided):\n            Total span extraction loss is the sum of a Cross-Entropy for the start and end positions.\n        start_scores (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length,)`):\n            Span-start scores (before SoftMax).\n        end_scores (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length,)`):\n            Span-end scores (before SoftMax).\n        mems (:obj:`List[torch.FloatTensor]` of length :obj:`config.n_layers`):\n            Contains pre-computed hidden-states (key and values in the attention blocks).\n            Can be used (see `past` input) to speed up sequential decoding. The token ids which have their past given to this model\n            should not be passed as input ids as they have already been computed.\n        hidden_states (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_hidden_states=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer)\n            of shape :obj:`(batch_size, sequence_length, hidden_size)`.\n\n            Hidden-states of the model at the output of each layer plus the initial embedding outputs.\n        attentions (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_attentions=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for each layer) of shape\n            :obj:`(batch_size, num_heads, sequence_length, sequence_length)`.\n\n            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention\n            heads.\n\n    Examples::\n\n        from transformers1 import XLNetTokenizer, XLNetForQuestionAnsweringSimple\n        import torch\n\n        tokenizer = XLNetTokenizer.from_pretrained('xlnet-base-cased')\n        model = XLNetForQuestionAnsweringSimple.from_pretrained('xlnet-base-cased')\n\n        input_ids = torch.tensor(tokenizer.encode(\"Hello, my dog is cute\", add_special_tokens=True)).unsqueeze(0)  # Batch size 1\n        start_positions = torch.tensor([1])\n        end_positions = torch.tensor([3])\n\n        outputs = model(input_ids, start_positions=start_positions, end_positions=end_positions)\n        loss = outputs[0]\n\n        \"\"\"\n\n        outputs = self.transformer(\n            input_ids,\n            attention_mask=attention_mask,\n            mems=mems,\n            perm_mask=perm_mask,\n            target_mapping=target_mapping,\n            token_type_ids=token_type_ids,\n            input_mask=input_mask,\n            head_mask=head_mask,\n            inputs_embeds=inputs_embeds,\n            use_cache=use_cache,\n        )\n\n        sequence_output = outputs[0]\n\n        logits = self.qa_outputs(sequence_output)\n        start_logits, end_logits = logits.split(1, dim=-1)\n        start_logits = start_logits.squeeze(-1)\n        end_logits = end_logits.squeeze(-1)\n\n        outputs = (start_logits, end_logits,) + outputs[2:]\n        if start_positions is not None and end_positions is not None:\n            # If we are on multi-GPU, split add a dimension\n            if len(start_positions.size()) > 1:\n                start_positions = start_positions.squeeze(-1)\n            if len(end_positions.size()) > 1:\n                end_positions = end_positions.squeeze(-1)\n            # sometimes the start/end positions are outside our model inputs, we ignore these terms\n            ignored_index = start_logits.size(1)\n            start_positions.clamp_(0, ignored_index)\n            end_positions.clamp_(0, ignored_index)\n\n            loss_fct = CrossEntropyLoss(ignore_index=ignored_index)\n            start_loss = loss_fct(start_logits, start_positions)\n            end_loss = loss_fct(end_logits, end_positions)\n            total_loss = (start_loss + end_loss) / 2\n            outputs = (total_loss,) + outputs\n\n        return outputs  # (loss), start_logits, end_logits, (mems), (hidden_states), (attentions)\n\n\n@add_start_docstrings(\n    \"\"\"XLNet Model with a span classification head on top for extractive question-answering tasks like SQuAD (a linear layers on top of\n    the hidden-states output to compute `span start logits` and `span end logits`). \"\"\",\n    XLNET_START_DOCSTRING,\n)\nclass XLNetForQuestionAnswering(XLNetPreTrainedModel):\n    def __init__(self, config):\n        super().__init__(config)\n        self.start_n_top = config.start_n_top\n        self.end_n_top = config.end_n_top\n\n        self.transformer = XLNetModel(config)\n        self.start_logits = PoolerStartLogits(config)\n        self.end_logits = PoolerEndLogits(config)\n        self.answer_class = PoolerAnswerClass(config)\n\n        self.init_weights()\n\n    @add_start_docstrings_to_callable(XLNET_INPUTS_DOCSTRING.format(\"(batch_size, sequence_length)\"))\n    def forward(\n        self,\n        input_ids=None,\n        attention_mask=None,\n        mems=None,\n        perm_mask=None,\n        target_mapping=None,\n        token_type_ids=None,\n        input_mask=None,\n        head_mask=None,\n        inputs_embeds=None,\n        use_cache=True,\n        start_positions=None,\n        end_positions=None,\n        is_impossible=None,\n        cls_index=None,\n        p_mask=None,\n    ):\n        r\"\"\"\n        start_positions (:obj:`torch.LongTensor` of shape :obj:`(batch_size,)`, `optional`, defaults to :obj:`None`):\n            Labels for position (index) of the start of the labelled span for computing the token classification loss.\n            Positions are clamped to the length of the sequence (`sequence_length`).\n            Position outside of the sequence are not taken into account for computing the loss.\n        end_positions (:obj:`torch.LongTensor` of shape :obj:`(batch_size,)`, `optional`, defaults to :obj:`None`):\n            Labels for position (index) of the end of the labelled span for computing the token classification loss.\n            Positions are clamped to the length of the sequence (`sequence_length`).\n            Position outside of the sequence are not taken into account for computing the loss.\n        is_impossible (``torch.LongTensor`` of shape ``(batch_size,)``, `optional`, defaults to :obj:`None`):\n            Labels whether a question has an answer or no answer (SQuAD 2.0)\n        cls_index (``torch.LongTensor`` of shape ``(batch_size,)``, `optional`, defaults to :obj:`None`):\n            Labels for position (index) of the classification token to use as input for computing plausibility of the answer.\n        p_mask (``torch.FloatTensor`` of shape ``(batch_size, sequence_length)``, `optional`, defaults to :obj:`None`):\n            Optional mask of tokens which can't be in answers (e.g. [CLS], [PAD], ...).\n            1.0 means token should be masked. 0.0 mean token is not masked.\n\n    Returns:\n        :obj:`tuple(torch.FloatTensor)` comprising various elements depending on the configuration (:class:`~transformers1.XLNetConfig`) and inputs:\n        loss (:obj:`torch.FloatTensor` of shape :obj:`(1,)`, `optional`, returned if both :obj:`start_positions` and :obj:`end_positions` are provided):\n            Classification loss as the sum of start token, end token (and is_impossible if provided) classification losses.\n        start_top_log_probs (``torch.FloatTensor`` of shape ``(batch_size, config.start_n_top)``, `optional`, returned if ``start_positions`` or ``end_positions`` is not provided):\n            Log probabilities for the top config.start_n_top start token possibilities (beam-search).\n        start_top_index (``torch.LongTensor`` of shape ``(batch_size, config.start_n_top)``, `optional`, returned if ``start_positions`` or ``end_positions`` is not provided):\n            Indices for the top config.start_n_top start token possibilities (beam-search).\n        end_top_log_probs (``torch.FloatTensor`` of shape ``(batch_size, config.start_n_top * config.end_n_top)``, `optional`, returned if ``start_positions`` or ``end_positions`` is not provided):\n            Log probabilities for the top ``config.start_n_top * config.end_n_top`` end token possibilities (beam-search).\n        end_top_index (``torch.LongTensor`` of shape ``(batch_size, config.start_n_top * config.end_n_top)``, `optional`, returned if ``start_positions`` or ``end_positions`` is not provided):\n            Indices for the top ``config.start_n_top * config.end_n_top`` end token possibilities (beam-search).\n        cls_logits (``torch.FloatTensor`` of shape ``(batch_size,)``, `optional`, returned if ``start_positions`` or ``end_positions`` is not provided):\n            Log probabilities for the ``is_impossible`` label of the answers.\n        mems (:obj:`List[torch.FloatTensor]` of length :obj:`config.n_layers`):\n            Contains pre-computed hidden-states (key and values in the attention blocks).\n            Can be used (see `past` input) to speed up sequential decoding. The token ids which have their past given to this model\n            should not be passed as input ids as they have already been computed.\n        hidden_states (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_hidden_states=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer)\n            of shape :obj:`(batch_size, sequence_length, hidden_size)`.\n\n            Hidden-states of the model at the output of each layer plus the initial embedding outputs.\n        attentions (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_attentions=True``):\n            Tuple of :obj:`torch.FloatTensor` (one for each layer) of shape\n            :obj:`(batch_size, num_heads, sequence_length, sequence_length)`.\n\n            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention\n            heads.\n\n    Examples::\n\n        from transformers1 import XLNetTokenizer, XLNetForQuestionAnswering\n        import torch\n\n        tokenizer =  XLNetTokenizer.from_pretrained('xlnet-base-cased')\n        model = XLNetForQuestionAnswering.from_pretrained('xlnet-base-cased')\n\n        input_ids = torch.tensor(tokenizer.encode(\"Hello, my dog is cute\", add_special_tokens=True)).unsqueeze(0)  # Batch size 1\n        start_positions = torch.tensor([1])\n        end_positions = torch.tensor([3])\n        outputs = model(input_ids, start_positions=start_positions, end_positions=end_positions)\n        loss = outputs[0]\n\n        \"\"\"\n        transformer_outputs = self.transformer(\n            input_ids,\n            attention_mask=attention_mask,\n            mems=mems,\n            perm_mask=perm_mask,\n            target_mapping=target_mapping,\n            token_type_ids=token_type_ids,\n            input_mask=input_mask,\n            head_mask=head_mask,\n            inputs_embeds=inputs_embeds,\n            use_cache=use_cache,\n        )\n        hidden_states = transformer_outputs[0]\n        start_logits = self.start_logits(hidden_states, p_mask=p_mask)\n\n        outputs = transformer_outputs[1:]  # Keep mems, hidden states, attentions if there are in it\n\n        if start_positions is not None and end_positions is not None:\n            # If we are on multi-GPU, let's remove the dimension added by batch splitting\n            for x in (start_positions, end_positions, cls_index, is_impossible):\n                if x is not None and x.dim() > 1:\n                    x.squeeze_(-1)\n\n            # during training, compute the end logits based on the ground truth of the start position\n            end_logits = self.end_logits(hidden_states, start_positions=start_positions, p_mask=p_mask)\n\n            loss_fct = CrossEntropyLoss()\n            start_loss = loss_fct(start_logits, start_positions)\n            end_loss = loss_fct(end_logits, end_positions)\n            total_loss = (start_loss + end_loss) / 2\n\n            if cls_index is not None and is_impossible is not None:\n                # Predict answerability from the representation of CLS and START\n                cls_logits = self.answer_class(hidden_states, start_positions=start_positions, cls_index=cls_index)\n                loss_fct_cls = nn.BCEWithLogitsLoss()\n                cls_loss = loss_fct_cls(cls_logits, is_impossible)\n\n                # note(zhiliny): by default multiply the loss by 0.5 so that the scale is comparable to start_loss and end_loss\n                total_loss += cls_loss * 0.5\n\n            outputs = (total_loss,) + outputs\n\n        else:\n            # during inference, compute the end logits based on beam search\n            bsz, slen, hsz = hidden_states.size()\n            start_log_probs = F.softmax(start_logits, dim=-1)  # shape (bsz, slen)\n\n            start_top_log_probs, start_top_index = torch.topk(\n                start_log_probs, self.start_n_top, dim=-1\n            )  # shape (bsz, start_n_top)\n            start_top_index_exp = start_top_index.unsqueeze(-1).expand(-1, -1, hsz)  # shape (bsz, start_n_top, hsz)\n            start_states = torch.gather(hidden_states, -2, start_top_index_exp)  # shape (bsz, start_n_top, hsz)\n            start_states = start_states.unsqueeze(1).expand(-1, slen, -1, -1)  # shape (bsz, slen, start_n_top, hsz)\n\n            hidden_states_expanded = hidden_states.unsqueeze(2).expand_as(\n                start_states\n            )  # shape (bsz, slen, start_n_top, hsz)\n            p_mask = p_mask.unsqueeze(-1) if p_mask is not None else None\n            end_logits = self.end_logits(hidden_states_expanded, start_states=start_states, p_mask=p_mask)\n            end_log_probs = F.softmax(end_logits, dim=1)  # shape (bsz, slen, start_n_top)\n\n            end_top_log_probs, end_top_index = torch.topk(\n                end_log_probs, self.end_n_top, dim=1\n            )  # shape (bsz, end_n_top, start_n_top)\n            end_top_log_probs = end_top_log_probs.view(-1, self.start_n_top * self.end_n_top)\n            end_top_index = end_top_index.view(-1, self.start_n_top * self.end_n_top)\n\n            start_states = torch.einsum(\n                \"blh,bl->bh\", hidden_states, start_log_probs\n            )  # get the representation of START as weighted sum of hidden states\n            cls_logits = self.answer_class(\n                hidden_states, start_states=start_states, cls_index=cls_index\n            )  # Shape (batch size,): one single `cls_logits` for each sample\n\n            outputs = (start_top_log_probs, start_top_index, end_top_log_probs, end_top_index, cls_logits) + outputs\n\n        # return start_top_log_probs, start_top_index, end_top_log_probs, end_top_index, cls_logits\n        # or (if labels are provided) (total_loss,)\n        return outputs\n"
  },
  {
    "path": "code/nezha-base-count5/pretrain/transformers1/optimization.py",
    "content": "# coding=utf-8\n# Copyright 2018 The Google AI Language Team Authors and The HuggingFace Inc. team.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n\"\"\"PyTorch optimization for BERT model.\"\"\"\n\nimport logging\nimport math\n\nimport torch\nfrom torch.optim import Optimizer\nfrom torch.optim.lr_scheduler import LambdaLR\n\n\nlogger = logging.getLogger(__name__)\n\n\ndef get_constant_schedule(optimizer, last_epoch=-1):\n    \"\"\" Create a schedule with a constant learning rate.\n    \"\"\"\n    return LambdaLR(optimizer, lambda _: 1, last_epoch=last_epoch)\n\n\ndef get_constant_schedule_with_warmup(optimizer, num_warmup_steps, last_epoch=-1):\n    \"\"\" Create a schedule with a constant learning rate preceded by a warmup\n    period during which the learning rate increases linearly between 0 and 1.\n    \"\"\"\n\n    def lr_lambda(current_step):\n        if current_step < num_warmup_steps:\n            return float(current_step) / float(max(1.0, num_warmup_steps))\n        return 1.0\n\n    return LambdaLR(optimizer, lr_lambda, last_epoch=last_epoch)\n\n\ndef get_linear_schedule_with_warmup(optimizer, num_warmup_steps, num_training_steps, last_epoch=-1):\n    \"\"\" Create a schedule with a learning rate that decreases linearly after\n    linearly increasing during a warmup period.\n    \"\"\"\n\n    def lr_lambda(current_step):\n        if current_step < num_warmup_steps:\n            return float(current_step) / float(max(1, num_warmup_steps))\n        return max(\n            0.0, float(num_training_steps - current_step) / float(max(1, num_training_steps - num_warmup_steps))\n        )\n\n    return LambdaLR(optimizer, lr_lambda, last_epoch)\n\n\ndef get_cosine_schedule_with_warmup(optimizer, num_warmup_steps, num_training_steps, num_cycles=0.5, last_epoch=-1):\n    \"\"\" Create a schedule with a learning rate that decreases following the\n    values of the cosine function between 0 and `pi * cycles` after a warmup\n    period during which it increases linearly between 0 and 1.\n    \"\"\"\n\n    def lr_lambda(current_step):\n        if current_step < num_warmup_steps:\n            return float(current_step) / float(max(1, num_warmup_steps))\n        progress = float(current_step - num_warmup_steps) / float(max(1, num_training_steps - num_warmup_steps))\n        return max(0.0, 0.5 * (1.0 + math.cos(math.pi * float(num_cycles) * 2.0 * progress)))\n\n    return LambdaLR(optimizer, lr_lambda, last_epoch)\n\n\ndef get_cosine_with_hard_restarts_schedule_with_warmup(\n    optimizer, num_warmup_steps, num_training_steps, num_cycles=1.0, last_epoch=-1\n):\n    \"\"\" Create a schedule with a learning rate that decreases following the\n    values of the cosine function with several hard restarts, after a warmup\n    period during which it increases linearly between 0 and 1.\n    \"\"\"\n\n    def lr_lambda(current_step):\n        if current_step < num_warmup_steps:\n            return float(current_step) / float(max(1, num_warmup_steps))\n        progress = float(current_step - num_warmup_steps) / float(max(1, num_training_steps - num_warmup_steps))\n        if progress >= 1.0:\n            return 0.0\n        return max(0.0, 0.5 * (1.0 + math.cos(math.pi * ((float(num_cycles) * progress) % 1.0))))\n\n    return LambdaLR(optimizer, lr_lambda, last_epoch)\n\n\nclass AdamW(Optimizer):\n    \"\"\" Implements Adam algorithm with weight decay fix.\n\n    Parameters:\n        lr (float): learning rate. Default 1e-3.\n        betas (tuple of 2 floats): Adams beta parameters (b1, b2). Default: (0.9, 0.999)\n        eps (float): Adams epsilon. Default: 1e-6\n        weight_decay (float): Weight decay. Default: 0.0\n        correct_bias (bool): can be set to False to avoid correcting bias in Adam (e.g. like in Bert TF repository). Default True.\n    \"\"\"\n\n    def __init__(self, params, lr=1e-3, betas=(0.9, 0.999), eps=1e-6, weight_decay=0.0, correct_bias=True):\n        if lr < 0.0:\n            raise ValueError(\"Invalid learning rate: {} - should be >= 0.0\".format(lr))\n        if not 0.0 <= betas[0] < 1.0:\n            raise ValueError(\"Invalid beta parameter: {} - should be in [0.0, 1.0[\".format(betas[0]))\n        if not 0.0 <= betas[1] < 1.0:\n            raise ValueError(\"Invalid beta parameter: {} - should be in [0.0, 1.0[\".format(betas[1]))\n        if not 0.0 <= eps:\n            raise ValueError(\"Invalid epsilon value: {} - should be >= 0.0\".format(eps))\n        defaults = dict(lr=lr, betas=betas, eps=eps, weight_decay=weight_decay, correct_bias=correct_bias)\n        super().__init__(params, defaults)\n\n    def step(self, closure=None):\n        \"\"\"Performs a single optimization step.\n\n        Arguments:\n            closure (callable, optional): A closure that reevaluates the model\n                and returns the loss.\n        \"\"\"\n        loss = None\n        if closure is not None:\n            loss = closure()\n\n        for group in self.param_groups:\n            for p in group[\"params\"]:\n                if p.grad is None:\n                    continue\n                grad = p.grad.data\n                if grad.is_sparse:\n                    raise RuntimeError(\"Adam does not support sparse gradients, please consider SparseAdam instead\")\n\n                state = self.state[p]\n\n                # State initialization\n                if len(state) == 0:\n                    state[\"step\"] = 0\n                    # Exponential moving average of gradient values\n                    state[\"exp_avg\"] = torch.zeros_like(p.data)\n                    # Exponential moving average of squared gradient values\n                    state[\"exp_avg_sq\"] = torch.zeros_like(p.data)\n\n                exp_avg, exp_avg_sq = state[\"exp_avg\"], state[\"exp_avg_sq\"]\n                beta1, beta2 = group[\"betas\"]\n\n                state[\"step\"] += 1\n\n                # Decay the first and second moment running average coefficient\n                # In-place operations to update the averages at the same time\n                exp_avg.mul_(beta1).add_(grad, alpha=1.0 - beta1)\n                exp_avg_sq.mul_(beta2).addcmul_(grad, grad, value=1.0 - beta2)\n                denom = exp_avg_sq.sqrt().add_(group[\"eps\"])\n\n                step_size = group[\"lr\"]\n                if group[\"correct_bias\"]:  # No bias correction for Bert\n                    bias_correction1 = 1.0 - beta1 ** state[\"step\"]\n                    bias_correction2 = 1.0 - beta2 ** state[\"step\"]\n                    step_size = step_size * math.sqrt(bias_correction2) / bias_correction1\n\n                p.data.addcdiv_(exp_avg, denom, value=-step_size)\n\n                # Just adding the square of the weights to the loss function is *not*\n                # the correct way of using L2 regularization/weight decay with Adam,\n                # since that will interact with the m and v parameters in strange ways.\n                #\n                # Instead we want to decay the weights in a manner that doesn't interact\n                # with the m/v parameters. This is equivalent to adding the square\n                # of the weights to the loss with plain (non-momentum) SGD.\n                # Add weight decay at the end (fixed version)\n                if group[\"weight_decay\"] > 0.0:\n                    p.data.add_(p.data, alpha=-group[\"lr\"] * group[\"weight_decay\"])\n\n        return loss\n"
  },
  {
    "path": "code/nezha-base-count5/pretrain/transformers1/optimization_tf.py",
    "content": "# Copyright 2019 The TensorFlow Authors. All Rights Reserved.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n# ==============================================================================\n\"\"\"Functions and classes related to optimization (weight updates).\"\"\"\n\n\nimport re\n\nimport tensorflow as tf\n\n\nclass WarmUp(tf.keras.optimizers.schedules.LearningRateSchedule):\n    \"\"\"Applies a warmup schedule on a given learning rate decay schedule.\"\"\"\n\n    def __init__(\n        self, initial_learning_rate, decay_schedule_fn, warmup_steps, power=1.0, name=None,\n    ):\n        super().__init__()\n        self.initial_learning_rate = initial_learning_rate\n        self.warmup_steps = warmup_steps\n        self.power = power\n        self.decay_schedule_fn = decay_schedule_fn\n        self.name = name\n\n    def __call__(self, step):\n        with tf.name_scope(self.name or \"WarmUp\") as name:\n            # Implements polynomial warmup. i.e., if global_step < warmup_steps, the\n            # learning rate will be `global_step/num_warmup_steps * init_lr`.\n            global_step_float = tf.cast(step, tf.float32)\n            warmup_steps_float = tf.cast(self.warmup_steps, tf.float32)\n            warmup_percent_done = global_step_float / warmup_steps_float\n            warmup_learning_rate = self.initial_learning_rate * tf.math.pow(warmup_percent_done, self.power)\n            return tf.cond(\n                global_step_float < warmup_steps_float,\n                lambda: warmup_learning_rate,\n                lambda: self.decay_schedule_fn(step),\n                name=name,\n            )\n\n    def get_config(self):\n        return {\n            \"initial_learning_rate\": self.initial_learning_rate,\n            \"decay_schedule_fn\": self.decay_schedule_fn,\n            \"warmup_steps\": self.warmup_steps,\n            \"power\": self.power,\n            \"name\": self.name,\n        }\n\n\ndef create_optimizer(init_lr, num_train_steps, num_warmup_steps, end_lr=0.0, optimizer_type=\"adamw\"):\n    \"\"\"Creates an optimizer with learning rate schedule.\"\"\"\n    # Implements linear decay of the learning rate.\n    lr_schedule = tf.keras.optimizers.schedules.PolynomialDecay(\n        initial_learning_rate=init_lr, decay_steps=num_train_steps, end_learning_rate=end_lr,\n    )\n    if num_warmup_steps:\n        lr_schedule = WarmUp(\n            initial_learning_rate=init_lr, decay_schedule_fn=lr_schedule, warmup_steps=num_warmup_steps,\n        )\n\n    optimizer = AdamWeightDecay(\n        learning_rate=lr_schedule,\n        weight_decay_rate=0.01,\n        beta_1=0.9,\n        beta_2=0.999,\n        epsilon=1e-6,\n        exclude_from_weight_decay=[\"LayerNorm\", \"layer_norm\", \"bias\"],\n    )\n\n    return optimizer\n\n\nclass AdamWeightDecay(tf.keras.optimizers.Adam):\n    \"\"\"Adam enables L2 weight decay and clip_by_global_norm on gradients.\n  Just adding the square of the weights to the loss function is *not* the\n  correct way of using L2 regularization/weight decay with Adam, since that will\n  interact with the m and v parameters in strange ways.\n  Instead we want ot decay the weights in a manner that doesn't interact with\n  the m/v parameters. This is equivalent to adding the square of the weights to\n  the loss with plain (non-momentum) SGD.\n  \"\"\"\n\n    def __init__(\n        self,\n        learning_rate=0.001,\n        beta_1=0.9,\n        beta_2=0.999,\n        epsilon=1e-7,\n        amsgrad=False,\n        weight_decay_rate=0.0,\n        include_in_weight_decay=None,\n        exclude_from_weight_decay=None,\n        name=\"AdamWeightDecay\",\n        **kwargs\n    ):\n        super().__init__(learning_rate, beta_1, beta_2, epsilon, amsgrad, name, **kwargs)\n        self.weight_decay_rate = weight_decay_rate\n        self._include_in_weight_decay = include_in_weight_decay\n        self._exclude_from_weight_decay = exclude_from_weight_decay\n\n    @classmethod\n    def from_config(cls, config):\n        \"\"\"Creates an optimizer from its config with WarmUp custom object.\"\"\"\n        custom_objects = {\"WarmUp\": WarmUp}\n        return super(AdamWeightDecay, cls).from_config(config, custom_objects=custom_objects)\n\n    def _prepare_local(self, var_device, var_dtype, apply_state):\n        super(AdamWeightDecay, self)._prepare_local(var_device, var_dtype, apply_state)\n        apply_state[(var_device, var_dtype)][\"weight_decay_rate\"] = tf.constant(\n            self.weight_decay_rate, name=\"adam_weight_decay_rate\"\n        )\n\n    def _decay_weights_op(self, var, learning_rate, apply_state):\n        do_decay = self._do_use_weight_decay(var.name)\n        if do_decay:\n            return var.assign_sub(\n                learning_rate * var * apply_state[(var.device, var.dtype.base_dtype)][\"weight_decay_rate\"],\n                use_locking=self._use_locking,\n            )\n        return tf.no_op()\n\n    def apply_gradients(self, grads_and_vars, name=None):\n        grads, tvars = list(zip(*grads_and_vars))\n        return super(AdamWeightDecay, self).apply_gradients(zip(grads, tvars), name=name,)\n\n    def _get_lr(self, var_device, var_dtype, apply_state):\n        \"\"\"Retrieves the learning rate with the given state.\"\"\"\n        if apply_state is None:\n            return self._decayed_lr_t[var_dtype], {}\n\n        apply_state = apply_state or {}\n        coefficients = apply_state.get((var_device, var_dtype))\n        if coefficients is None:\n            coefficients = self._fallback_apply_state(var_device, var_dtype)\n            apply_state[(var_device, var_dtype)] = coefficients\n\n        return coefficients[\"lr_t\"], dict(apply_state=apply_state)\n\n    def _resource_apply_dense(self, grad, var, apply_state=None):\n        lr_t, kwargs = self._get_lr(var.device, var.dtype.base_dtype, apply_state)\n        decay = self._decay_weights_op(var, lr_t, apply_state)\n        with tf.control_dependencies([decay]):\n            return super(AdamWeightDecay, self)._resource_apply_dense(grad, var, **kwargs)\n\n    def _resource_apply_sparse(self, grad, var, indices, apply_state=None):\n        lr_t, kwargs = self._get_lr(var.device, var.dtype.base_dtype, apply_state)\n        decay = self._decay_weights_op(var, lr_t, apply_state)\n        with tf.control_dependencies([decay]):\n            return super(AdamWeightDecay, self)._resource_apply_sparse(grad, var, indices, **kwargs)\n\n    def get_config(self):\n        config = super().get_config()\n        config.update({\"weight_decay_rate\": self.weight_decay_rate})\n        return config\n\n    def _do_use_weight_decay(self, param_name):\n        \"\"\"Whether to use L2 weight decay for `param_name`.\"\"\"\n        if self.weight_decay_rate == 0:\n            return False\n\n        if self._include_in_weight_decay:\n            for r in self._include_in_weight_decay:\n                if re.search(r, param_name) is not None:\n                    return True\n\n        if self._exclude_from_weight_decay:\n            for r in self._exclude_from_weight_decay:\n                if re.search(r, param_name) is not None:\n                    return False\n        return True\n\n\n# Extracted from https://github.com/OpenNMT/OpenNMT-tf/blob/master/opennmt/optimizers/utils.py\nclass GradientAccumulator(object):\n    \"\"\"Gradient accumulation utility.\n  When used with a distribution strategy, the accumulator should be called in a\n  replica context. Gradients will be accumulated locally on each replica and\n  without synchronization. Users should then call ``.gradients``, scale the\n  gradients if required, and pass the result to ``apply_gradients``.\n  \"\"\"\n\n    # We use the ON_READ synchronization policy so that no synchronization is\n    # performed on assignment. To get the value, we call .value() which returns the\n    # value on the current replica without synchronization.\n\n    def __init__(self):\n        \"\"\"Initializes the accumulator.\"\"\"\n        self._gradients = []\n        self._accum_steps = None\n\n    @property\n    def step(self):\n        \"\"\"Number of accumulated steps.\"\"\"\n        if self._accum_steps is None:\n            self._accum_steps = tf.Variable(\n                tf.constant(0, dtype=tf.int64),\n                trainable=False,\n                synchronization=tf.VariableSynchronization.ON_READ,\n                aggregation=tf.VariableAggregation.ONLY_FIRST_REPLICA,\n            )\n\n        return self._accum_steps.value()\n\n    @property\n    def gradients(self):\n        \"\"\"The accumulated gradients on the current replica.\"\"\"\n        if not self._gradients:\n            raise ValueError(\"The accumulator should be called first to initialize the gradients\")\n        return list(gradient.value() if gradient is not None else gradient for gradient in self._gradients)\n\n    def __call__(self, gradients):\n        \"\"\"Accumulates :obj:`gradients` on the current replica.\"\"\"\n        if not self._gradients:\n            _ = self.step  # Create the step variable.\n            self._gradients.extend(\n                [\n                    tf.Variable(\n                        tf.zeros_like(gradient),\n                        trainable=False,\n                        synchronization=tf.VariableSynchronization.ON_READ,\n                        aggregation=tf.VariableAggregation.ONLY_FIRST_REPLICA,\n                    )\n                    if gradient is not None\n                    else gradient\n                    for gradient in gradients\n                ]\n            )\n        if len(gradients) != len(self._gradients):\n            raise ValueError(\"Expected %s gradients, but got %d\" % (len(self._gradients), len(gradients)))\n\n        for accum_gradient, gradient in zip(self._gradients, gradients):\n            if accum_gradient is not None and gradient is not None:\n                accum_gradient.assign_add(gradient)\n\n        self._accum_steps.assign_add(1)\n\n    def reset(self):\n        \"\"\"Resets the accumulated gradients on the current replica.\"\"\"\n        if not self._gradients:\n            return\n        self._accum_steps.assign(0)\n        for gradient in self._gradients:\n            if gradient is not None:\n                gradient.assign(tf.zeros_like(gradient))\n"
  },
  {
    "path": "code/nezha-base-count5/pretrain/transformers1/pipelines.py",
    "content": "# coding=utf-8\n# Copyright 2018 The HuggingFace Inc. team.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n\n\nimport csv\nimport json\nimport logging\nimport os\nimport pickle\nimport sys\nfrom abc import ABC, abstractmethod\nfrom contextlib import contextmanager\nfrom itertools import chain\nfrom os.path import abspath, exists\nfrom typing import TYPE_CHECKING, Any, Dict, Iterable, List, Optional, Sequence, Tuple, Union\n\nimport numpy as np\n\nfrom .configuration_auto import ALL_PRETRAINED_CONFIG_ARCHIVE_MAP, AutoConfig\nfrom .configuration_utils import PretrainedConfig\nfrom .data import SquadExample, squad_convert_examples_to_features\nfrom .file_utils import is_tf_available, is_torch_available\nfrom .modelcard import ModelCard\nfrom .tokenization_auto import AutoTokenizer\nfrom .tokenization_bert import BasicTokenizer\nfrom .tokenization_utils import PreTrainedTokenizer\n\n\nif is_tf_available():\n    import tensorflow as tf\n    from .modeling_tf_auto import (\n        TFAutoModel,\n        TFAutoModelForSequenceClassification,\n        TFAutoModelForQuestionAnswering,\n        TFAutoModelForTokenClassification,\n        TFAutoModelWithLMHead,\n    )\n\nif is_torch_available():\n    import torch\n    from .modeling_auto import (\n        AutoModel,\n        AutoModelForSequenceClassification,\n        AutoModelForQuestionAnswering,\n        AutoModelForTokenClassification,\n        AutoModelWithLMHead,\n    )\n\nif TYPE_CHECKING:\n    from .modeling_utils import PreTrainedModel\n    from .modeling_tf_utils import TFPreTrainedModel\n\n\nlogger = logging.getLogger(__name__)\n\n\ndef get_framework(model=None):\n    \"\"\" Select framework (TensorFlow/PyTorch) to use.\n        If both frameworks are installed and no specific model is provided, defaults to using PyTorch.\n    \"\"\"\n    if is_tf_available() and is_torch_available() and model is not None and not isinstance(model, str):\n        # Both framework are available but the user supplied a model class instance.\n        # Try to guess which framework to use from the model classname\n        framework = \"tf\" if model.__class__.__name__.startswith(\"TF\") else \"pt\"\n    elif not is_tf_available() and not is_torch_available():\n        raise RuntimeError(\n            \"At least one of TensorFlow 2.0 or PyTorch should be installed. \"\n            \"To install TensorFlow 2.0, read the instructions at https://www.tensorflow.org/install/ \"\n            \"To install PyTorch, read the instructions at https://pytorch.org/.\"\n        )\n    else:\n        # framework = 'tf' if is_tf_available() else 'pt'\n        framework = \"pt\" if is_torch_available() else \"tf\"\n    return framework\n\n\nclass ArgumentHandler(ABC):\n    \"\"\"\n    Base interface for handling varargs for each Pipeline\n    \"\"\"\n\n    @abstractmethod\n    def __call__(self, *args, **kwargs):\n        raise NotImplementedError()\n\n\nclass DefaultArgumentHandler(ArgumentHandler):\n    \"\"\"\n    Default varargs argument parser handling parameters for each Pipeline\n    \"\"\"\n\n    @staticmethod\n    def handle_kwargs(kwargs: Dict) -> List:\n        if len(kwargs) == 1:\n            output = list(kwargs.values())\n        else:\n            output = list(chain(kwargs.values()))\n\n        return DefaultArgumentHandler.handle_args(output)\n\n    @staticmethod\n    def handle_args(args: Sequence[Any]) -> List[str]:\n\n        # Only one argument, let's do case by case\n        if len(args) == 1:\n            if isinstance(args[0], str):\n                return [args[0]]\n            elif not isinstance(args[0], list):\n                return list(args)\n            else:\n                return args[0]\n\n        # Multiple arguments (x1, x2, ...)\n        elif len(args) > 1:\n            if all([isinstance(arg, str) for arg in args]):\n                return list(args)\n\n            # If not instance of list, then it should instance of iterable\n            elif isinstance(args, Iterable):\n                return list(chain.from_iterable(chain(args)))\n            else:\n                raise ValueError(\n                    \"Invalid input type {}. Pipeline supports Union[str, Iterable[str]]\".format(type(args))\n                )\n        else:\n            return []\n\n    def __call__(self, *args, **kwargs):\n        if len(kwargs) > 0 and len(args) > 0:\n            raise ValueError(\"Pipeline cannot handle mixed args and kwargs\")\n\n        if len(kwargs) > 0:\n            return DefaultArgumentHandler.handle_kwargs(kwargs)\n        else:\n            return DefaultArgumentHandler.handle_args(args)\n\n\nclass PipelineDataFormat:\n    \"\"\"\n    Base class for all the pipeline supported data format both for reading and writing.\n    Supported data formats currently includes:\n     - JSON\n     - CSV\n     - stdin/stdout (pipe)\n\n    PipelineDataFormat also includes some utilities to work with multi-columns like mapping from datasets columns\n    to pipelines keyword arguments through the `dataset_kwarg_1=dataset_column_1` format.\n    \"\"\"\n\n    SUPPORTED_FORMATS = [\"json\", \"csv\", \"pipe\"]\n\n    def __init__(\n        self, output_path: Optional[str], input_path: Optional[str], column: Optional[str], overwrite=False,\n    ):\n        self.output_path = output_path\n        self.input_path = input_path\n        self.column = column.split(\",\") if column is not None else [\"\"]\n        self.is_multi_columns = len(self.column) > 1\n\n        if self.is_multi_columns:\n            self.column = [tuple(c.split(\"=\")) if \"=\" in c else (c, c) for c in self.column]\n\n        if output_path is not None and not overwrite:\n            if exists(abspath(self.output_path)):\n                raise OSError(\"{} already exists on disk\".format(self.output_path))\n\n        if input_path is not None:\n            if not exists(abspath(self.input_path)):\n                raise OSError(\"{} doesnt exist on disk\".format(self.input_path))\n\n    @abstractmethod\n    def __iter__(self):\n        raise NotImplementedError()\n\n    @abstractmethod\n    def save(self, data: dict):\n        \"\"\"\n        Save the provided data object with the representation for the current `DataFormat`.\n        :param data: data to store\n        :return:\n        \"\"\"\n        raise NotImplementedError()\n\n    def save_binary(self, data: Union[dict, List[dict]]) -> str:\n        \"\"\"\n        Save the provided data object as a pickle-formatted binary data on the disk.\n        :param data: data to store\n        :return: (str) Path where the data has been saved\n        \"\"\"\n        path, _ = os.path.splitext(self.output_path)\n        binary_path = os.path.extsep.join((path, \"pickle\"))\n\n        with open(binary_path, \"wb+\") as f_output:\n            pickle.dump(data, f_output)\n\n        return binary_path\n\n    @staticmethod\n    def from_str(\n        format: str, output_path: Optional[str], input_path: Optional[str], column: Optional[str], overwrite=False,\n    ):\n        if format == \"json\":\n            return JsonPipelineDataFormat(output_path, input_path, column, overwrite=overwrite)\n        elif format == \"csv\":\n            return CsvPipelineDataFormat(output_path, input_path, column, overwrite=overwrite)\n        elif format == \"pipe\":\n            return PipedPipelineDataFormat(output_path, input_path, column, overwrite=overwrite)\n        else:\n            raise KeyError(\"Unknown reader {} (Available reader are json/csv/pipe)\".format(format))\n\n\nclass CsvPipelineDataFormat(PipelineDataFormat):\n    def __init__(\n        self, output_path: Optional[str], input_path: Optional[str], column: Optional[str], overwrite=False,\n    ):\n        super().__init__(output_path, input_path, column, overwrite=overwrite)\n\n    def __iter__(self):\n        with open(self.input_path, \"r\") as f:\n            reader = csv.DictReader(f)\n            for row in reader:\n                if self.is_multi_columns:\n                    yield {k: row[c] for k, c in self.column}\n                else:\n                    yield row[self.column[0]]\n\n    def save(self, data: List[dict]):\n        with open(self.output_path, \"w\") as f:\n            if len(data) > 0:\n                writer = csv.DictWriter(f, list(data[0].keys()))\n                writer.writeheader()\n                writer.writerows(data)\n\n\nclass JsonPipelineDataFormat(PipelineDataFormat):\n    def __init__(\n        self, output_path: Optional[str], input_path: Optional[str], column: Optional[str], overwrite=False,\n    ):\n        super().__init__(output_path, input_path, column, overwrite=overwrite)\n\n        with open(input_path, \"r\") as f:\n            self._entries = json.load(f)\n\n    def __iter__(self):\n        for entry in self._entries:\n            if self.is_multi_columns:\n                yield {k: entry[c] for k, c in self.column}\n            else:\n                yield entry[self.column[0]]\n\n    def save(self, data: dict):\n        with open(self.output_path, \"w\") as f:\n            json.dump(data, f)\n\n\nclass PipedPipelineDataFormat(PipelineDataFormat):\n    \"\"\"\n    Read data from piped input to the python process.\n    For multi columns data, columns should separated by \\t\n\n    If columns are provided, then the output will be a dictionary with {column_x: value_x}\n    \"\"\"\n\n    def __iter__(self):\n        for line in sys.stdin:\n            # Split for multi-columns\n            if \"\\t\" in line:\n\n                line = line.split(\"\\t\")\n                if self.column:\n                    # Dictionary to map arguments\n                    yield {kwargs: l for (kwargs, _), l in zip(self.column, line)}\n                else:\n                    yield tuple(line)\n\n            # No dictionary to map arguments\n            else:\n                yield line\n\n    def save(self, data: dict):\n        print(data)\n\n    def save_binary(self, data: Union[dict, List[dict]]) -> str:\n        if self.output_path is None:\n            raise KeyError(\n                \"When using piped input on pipeline outputting large object requires an output file path. \"\n                \"Please provide such output path through --output argument.\"\n            )\n\n        return super().save_binary(data)\n\n\nclass _ScikitCompat(ABC):\n    \"\"\"\n    Interface layer for the Scikit and Keras compatibility.\n    \"\"\"\n\n    @abstractmethod\n    def transform(self, X):\n        raise NotImplementedError()\n\n    @abstractmethod\n    def predict(self, X):\n        raise NotImplementedError()\n\n\nclass Pipeline(_ScikitCompat):\n    \"\"\"\n    The Pipeline class is the class from which all pipelines inherit. Refer to this class for methods shared across\n    different pipelines.\n\n    Base class implementing pipelined operations.\n    Pipeline workflow is defined as a sequence of the following operations:\n        Input -> Tokenization -> Model Inference -> Post-Processing (Task dependent) -> Output\n\n    Pipeline supports running on CPU or GPU through the device argument. Users can specify\n    device argument as an integer, -1 meaning \"CPU\", >= 0 referring the CUDA device ordinal.\n\n    Some pipeline, like for instance FeatureExtractionPipeline ('feature-extraction') outputs large\n    tensor object as nested-lists. In order to avoid dumping such large structure as textual data we\n    provide the binary_output constructor argument. If set to True, the output will be stored in the\n    pickle format.\n\n    Arguments:\n        model (:obj:`~transformers1.PreTrainedModel` or :obj:`~transformers1.TFPreTrainedModel`):\n            The model that will be used by the pipeline to make predictions. This needs to be a model inheriting from\n            :class:`~transformers1.PreTrainedModel` for PyTorch and :class:`~transformers1.TFPreTrainedModel` for\n            TensorFlow.\n        tokenizer (:obj:`~transformers1.PreTrainedTokenizer`):\n            The tokenizer that will be used by the pipeline to encode data for the model. This object inherits from\n            :class:`~transformers1.PreTrainedTokenizer`.\n        modelcard (:obj:`str` or :class:`~transformers1.ModelCard`, `optional`, defaults to :obj:`None`):\n            Model card attributed to the model for this pipeline.\n        framework (:obj:`str`, `optional`, defaults to :obj:`None`):\n            The framework to use, either \"pt\" for PyTorch or \"tf\" for TensorFlow. The specified framework must be\n            installed.\n\n            If no framework is specified, will default to the one currently installed. If no framework is specified\n            and both frameworks are installed, will default to PyTorch.\n        args_parser (:class:`~transformers1.pipelines.ArgumentHandler`, `optional`, defaults to :obj:`None`):\n            Reference to the object in charge of parsing supplied pipeline parameters.\n        device (:obj:`int`, `optional`, defaults to :obj:`-1`):\n            Device ordinal for CPU/GPU supports. Setting this to -1 will leverage CPU, >=0 will run the model\n            on the associated CUDA device id.\n        binary_output (:obj:`bool`, `optional`, defaults to :obj:`False`):\n            Flag indicating if the output the pipeline should happen in a binary format (i.e. pickle) or as raw text.\n\n    Return:\n        :obj:`List` or :obj:`Dict`:\n        Pipeline returns list or dictionary depending on:\n\n         - Whether the user supplied multiple samples\n         - Whether the pipeline exposes multiple fields in the output object\n    \"\"\"\n\n    default_input_names = None\n\n    def __init__(\n        self,\n        model: Union[\"PreTrainedModel\", \"TFPreTrainedModel\"],\n        tokenizer: PreTrainedTokenizer,\n        modelcard: Optional[ModelCard] = None,\n        framework: Optional[str] = None,\n        task: str = \"\",\n        args_parser: ArgumentHandler = None,\n        device: int = -1,\n        binary_output: bool = False,\n    ):\n\n        if framework is None:\n            framework = get_framework()\n\n        self.model = model\n        self.tokenizer = tokenizer\n        self.modelcard = modelcard\n        self.framework = framework\n        self.device = device if framework == \"tf\" else torch.device(\"cpu\" if device < 0 else \"cuda:{}\".format(device))\n        self.binary_output = binary_output\n        self._args_parser = args_parser or DefaultArgumentHandler()\n\n        # Special handling\n        if self.framework == \"pt\" and self.device.type == \"cuda\":\n            self.model = self.model.to(self.device)\n\n        # Update config with task specific parameters\n        task_specific_params = self.model.config.task_specific_params\n        if task_specific_params is not None and task in task_specific_params:\n            self.model.config.update(task_specific_params.get(task))\n\n    def save_pretrained(self, save_directory):\n        \"\"\"\n        Save the pipeline's model and tokenizer to the specified save_directory\n        \"\"\"\n        if not os.path.isdir(save_directory):\n            logger.error(\"Provided path ({}) should be a directory\".format(save_directory))\n            return\n\n        self.model.save_pretrained(save_directory)\n        self.tokenizer.save_pretrained(save_directory)\n        if self.modelcard is not None:\n            self.modelcard.save_pretrained(save_directory)\n\n    def transform(self, X):\n        \"\"\"\n        Scikit / Keras interface to transformers1' pipelines. This method will forward to __call__().\n        \"\"\"\n        return self(X=X)\n\n    def predict(self, X):\n        \"\"\"\n        Scikit / Keras interface to transformers1' pipelines. This method will forward to __call__().\n        \"\"\"\n        return self(X=X)\n\n    @contextmanager\n    def device_placement(self):\n        \"\"\"\n        Context Manager allowing tensor allocation on the user-specified device in framework agnostic way.\n        example:\n            # Explicitly ask for tensor allocation on CUDA device :0\n            nlp = pipeline(..., device=0)\n            with nlp.device_placement():\n                # Every framework specific tensor allocation will be done on the request device\n                output = nlp(...)\n        Returns:\n            Context manager\n        \"\"\"\n        if self.framework == \"tf\":\n            with tf.device(\"/CPU:0\" if self.device == -1 else \"/device:GPU:{}\".format(self.device)):\n                yield\n        else:\n            if self.device.type == \"cuda\":\n                torch.cuda.set_device(self.device)\n\n            yield\n\n    def ensure_tensor_on_device(self, **inputs):\n        \"\"\"\n        Ensure PyTorch tensors are on the specified device.\n        :param inputs:\n        :return:\n        \"\"\"\n        return {name: tensor.to(self.device) for name, tensor in inputs.items()}\n\n    def _parse_and_tokenize(self, *args, pad_to_max_length=True, add_special_tokens=True, **kwargs):\n        \"\"\"\n        Parse arguments and tokenize\n        \"\"\"\n        # Parse arguments\n        inputs = self._args_parser(*args, **kwargs)\n        inputs = self.tokenizer.batch_encode_plus(\n            inputs,\n            add_special_tokens=add_special_tokens,\n            return_tensors=self.framework,\n            pad_to_max_length=pad_to_max_length,\n        )\n\n        return inputs\n\n    def __call__(self, *args, **kwargs):\n        inputs = self._parse_and_tokenize(*args, **kwargs)\n        return self._forward(inputs)\n\n    def _forward(self, inputs, return_tensors=False):\n        \"\"\"\n        Internal framework specific forward dispatching.\n        Args:\n            inputs: dict holding all the keyworded arguments for required by the model forward method.\n            return_tensors: Whether to return native framework (pt/tf) tensors rather than numpy array.\n        Returns:\n            Numpy array\n        \"\"\"\n        # Encode for forward\n        with self.device_placement():\n            if self.framework == \"tf\":\n                # TODO trace model\n                predictions = self.model(inputs.data, training=False)[0]\n            else:\n                with torch.no_grad():\n                    inputs = self.ensure_tensor_on_device(**inputs)\n                    predictions = self.model(**inputs)[0].cpu()\n\n        if return_tensors:\n            return predictions\n        else:\n            return predictions.numpy()\n\n\nclass FeatureExtractionPipeline(Pipeline):\n    \"\"\"\n    Feature extraction pipeline using Model head. This pipeline extracts the hidden states from the base transformer,\n    which can be used as features in downstream tasks.\n\n    This feature extraction pipeline can currently be loaded from the :func:`~transformers1.pipeline` method using\n    the following task identifier(s):\n\n    - \"feature-extraction\", for extracting features of a sequence.\n\n    All models may be used for this pipeline. See a list of all models, including community-contributed models on\n    `huggingface.co/models <https://huggingface.co/models>`__.\n\n    Arguments:\n        model (:obj:`~transformers1.PreTrainedModel` or :obj:`~transformers1.TFPreTrainedModel`):\n            The model that will be used by the pipeline to make predictions. This needs to be a model inheriting from\n            :class:`~transformers1.PreTrainedModel` for PyTorch and :class:`~transformers1.TFPreTrainedModel` for\n            TensorFlow.\n        tokenizer (:obj:`~transformers1.PreTrainedTokenizer`):\n            The tokenizer that will be used by the pipeline to encode data for the model. This object inherits from\n            :class:`~transformers1.PreTrainedTokenizer`.\n        modelcard (:obj:`str` or :class:`~transformers1.ModelCard`, `optional`, defaults to :obj:`None`):\n            Model card attributed to the model for this pipeline.\n        framework (:obj:`str`, `optional`, defaults to :obj:`None`):\n            The framework to use, either \"pt\" for PyTorch or \"tf\" for TensorFlow. The specified framework must be\n            installed.\n\n            If no framework is specified, will default to the one currently installed. If no framework is specified\n            and both frameworks are installed, will default to PyTorch.\n        args_parser (:class:`~transformers1.pipelines.ArgumentHandler`, `optional`, defaults to :obj:`None`):\n            Reference to the object in charge of parsing supplied pipeline parameters.\n        device (:obj:`int`, `optional`, defaults to :obj:`-1`):\n            Device ordinal for CPU/GPU supports. Setting this to -1 will leverage CPU, >=0 will run the model\n            on the associated CUDA device id.\n    \"\"\"\n\n    def __init__(\n        self,\n        model: Union[\"PreTrainedModel\", \"TFPreTrainedModel\"],\n        tokenizer: PreTrainedTokenizer,\n        modelcard: Optional[ModelCard] = None,\n        framework: Optional[str] = None,\n        args_parser: ArgumentHandler = None,\n        device: int = -1,\n        task: str = \"\",\n    ):\n        super().__init__(\n            model=model,\n            tokenizer=tokenizer,\n            modelcard=modelcard,\n            framework=framework,\n            args_parser=args_parser,\n            device=device,\n            binary_output=True,\n            task=task,\n        )\n\n    def __call__(self, *args, **kwargs):\n        return super().__call__(*args, **kwargs).tolist()\n\n\nclass TextGenerationPipeline(Pipeline):\n    \"\"\"\n    Language generation pipeline using any ModelWithLMHead head. This pipeline predicts the words that will follow a specified text prompt.\n\n    This language generation pipeline can currently be loaded from the :func:`~transformers1.pipeline` method using\n    the following task identifier(s):\n\n    - \"text-generation\", for generating text from a specified prompt.\n\n    The models that this pipeline can use are models that have been trained with an autoregressive language modeling objective,\n    which includes the uni-directional models in the library (e.g. gpt2).\n    See the list of available community models on\n    `huggingface.co/models <https://huggingface.co/models?search=&filter=lm-head>`__.\n    \"\"\"\n\n    # Padding text to help Transformer-XL and XLNet with short prompts as proposed by Aman Rusia\n    # in https://github.com/rusiaaman/XLNet-gen#methodology\n    # and https://medium.com/@amanrusia/xlnet-speaks-comparison-to-gpt-2-ea1a4e9ba39e\n\n    PADDING_TEXT = \"\"\"In 1991, the remains of Russian Tsar Nicholas II and his family\n    (except for Alexei and Maria) are discovered.\n    The voice of Nicholas's young son, Tsarevich Alexei Nikolaevich, narrates the\n    remainder of the story. 1883 Western Siberia,\n    a young Grigori Rasputin is asked by his father and a group of men to perform magic.\n    Rasputin has a vision and denounces one of the men as a horse thief. Although his\n    father initially slaps him for making such an accusation, Rasputin watches as the\n    man is chased outside and beaten. Twenty years later, Rasputin sees a vision of\n    the Virgin Mary, prompting him to become a priest. Rasputin quickly becomes famous,\n    with people, even a bishop, begging for his blessing. <eod> </s> <eos>\"\"\"\n\n    ALLOWED_MODELS = [\n        \"XLNetLMHeadModel\",\n        \"TransfoXLLMHeadModel\",\n        \"ReformerModelWithLMHead\",\n        \"GPT2LMHeadModel\",\n        \"OpenAIGPTLMHeadModel\",\n        \"CTRLLMHeadModel\",\n        \"TFXLNetLMHeadModel\",\n        \"TFTransfoXLLMHeadModel\",\n        \"TFGPT2LMHeadModel\",\n        \"TFOpenAIGPTLMHeadModel\",\n        \"TFCTRLLMHeadModel\",\n    ]\n\n    def __call__(\n        self, *args, return_tensors=False, return_text=True, clean_up_tokenization_spaces=False, **generate_kwargs\n    ):\n        if self.model.__class__.__name__ not in self.ALLOWED_MODELS:\n            raise NotImplementedError(\n                \"Generation is currently not supported for {}. Please select a model from {} for generation.\".format(\n                    self.model.__class__.__name__, self.ALLOWED_MODELS\n                )\n            )\n\n        text_inputs = self._args_parser(*args)\n\n        results = []\n        for prompt_text in text_inputs:\n            # Manage correct placement of the tensors\n            with self.device_placement():\n                if self.model.__class__.__name__ in [\"XLNetLMHeadModel\", \"TransfoXLLMHeadModel\"]:\n                    inputs = self._parse_and_tokenize(\n                        self.PADDING_TEXT + prompt_text, pad_to_max_length=False, add_special_tokens=False\n                    )\n                else:\n                    inputs = self._parse_and_tokenize(prompt_text, pad_to_max_length=False, add_special_tokens=False)\n\n                # set input_ids to None to allow empty prompt\n                if inputs[\"input_ids\"].shape[-1] == 0:\n                    inputs[\"input_ids\"] = None\n                    inputs[\"attention_mask\"] = None\n\n                if self.framework == \"pt\" and inputs[\"input_ids\"] is not None:\n                    inputs = self.ensure_tensor_on_device(**inputs)\n\n                input_ids = inputs[\"input_ids\"]\n\n                # Ensure that batch size = 1 (batch generation not allowed for now)\n                assert (\n                    input_ids is None or input_ids.shape[0] == 1\n                ), \"Batch generation is currently not supported. See https://github.com/huggingface/transformers/issues/3021 for more information.\"\n\n                output_sequences = self.model.generate(input_ids=input_ids, **generate_kwargs)  # BS x SL\n\n            result = []\n            for generated_sequence in output_sequences:\n                generated_sequence = generated_sequence.numpy().tolist()\n                record = {}\n                if return_tensors:\n                    record[\"generated_token_ids\"] = generated_sequence\n                if return_text:\n                    # Decode text\n                    text = self.tokenizer.decode(\n                        generated_sequence,\n                        skip_special_tokens=True,\n                        clean_up_tokenization_spaces=clean_up_tokenization_spaces,\n                    )\n\n                    # Remove PADDING prompt of the sequence if XLNet or Transfo-XL model is used\n                    if input_ids is None:\n                        prompt_length = 0\n                    else:\n                        prompt_length = len(\n                            self.tokenizer.decode(\n                                input_ids[0],\n                                skip_special_tokens=True,\n                                clean_up_tokenization_spaces=clean_up_tokenization_spaces,\n                            )\n                        )\n\n                    record[\"generated_text\"] = prompt_text + text[prompt_length:]\n\n                result.append(record)\n            results += [result]\n\n        if len(results) == 1:\n            return results[0]\n\n        return results\n\n\nclass TextClassificationPipeline(Pipeline):\n    \"\"\"\n    Text classification pipeline using ModelForSequenceClassification head. See the\n    `sequence classification usage <../usage.html#sequence-classification>`__ examples for more information.\n\n    This text classification pipeline can currently be loaded from the :func:`~transformers1.pipeline` method using\n    the following task identifier(s):\n\n    - \"sentiment-analysis\", for classifying sequences according to positive or negative sentiments.\n\n    The models that this pipeline can use are models that have been fine-tuned on a sequence classification task.\n    See the up-to-date list of available models on\n    `huggingface.co/models <https://huggingface.co/models?filter=text-classification>`__.\n\n    Arguments:\n        model (:obj:`~transformers1.PreTrainedModel` or :obj:`~transformers1.TFPreTrainedModel`):\n            The model that will be used by the pipeline to make predictions. This needs to be a model inheriting from\n            :class:`~transformers1.PreTrainedModel` for PyTorch and :class:`~transformers1.TFPreTrainedModel` for\n            TensorFlow.\n        tokenizer (:obj:`~transformers1.PreTrainedTokenizer`):\n            The tokenizer that will be used by the pipeline to encode data for the model. This object inherits from\n            :class:`~transformers1.PreTrainedTokenizer`.\n        modelcard (:obj:`str` or :class:`~transformers1.ModelCard`, `optional`, defaults to :obj:`None`):\n            Model card attributed to the model for this pipeline.\n        framework (:obj:`str`, `optional`, defaults to :obj:`None`):\n            The framework to use, either \"pt\" for PyTorch or \"tf\" for TensorFlow. The specified framework must be\n            installed.\n\n            If no framework is specified, will default to the one currently installed. If no framework is specified\n            and both frameworks are installed, will default to PyTorch.\n        args_parser (:class:`~transformers1.pipelines.ArgumentHandler`, `optional`, defaults to :obj:`None`):\n            Reference to the object in charge of parsing supplied pipeline parameters.\n        device (:obj:`int`, `optional`, defaults to :obj:`-1`):\n            Device ordinal for CPU/GPU supports. Setting this to -1 will leverage CPU, >=0 will run the model\n            on the associated CUDA device id.\n    \"\"\"\n\n    def __call__(self, *args, **kwargs):\n        outputs = super().__call__(*args, **kwargs)\n        scores = np.exp(outputs) / np.exp(outputs).sum(-1, keepdims=True)\n        return [{\"label\": self.model.config.id2label[item.argmax()], \"score\": item.max().item()} for item in scores]\n\n\nclass FillMaskPipeline(Pipeline):\n    \"\"\"\n    Masked language modeling prediction pipeline using ModelWithLMHead head. See the\n    `masked language modeling usage <../usage.html#masked-language-modeling>`__ examples for more information.\n\n    This mask filling pipeline can currently be loaded from the :func:`~transformers1.pipeline` method using\n    the following task identifier(s):\n\n    - \"fill-mask\", for predicting masked tokens in a sequence.\n\n    The models that this pipeline can use are models that have been trained with a masked language modeling objective,\n    which includes the bi-directional models in the library.\n    See the up-to-date list of available models on\n    `huggingface.co/models <https://huggingface.co/models?filter=lm-head>`__.\n\n    Arguments:\n        model (:obj:`~transformers1.PreTrainedModel` or :obj:`~transformers1.TFPreTrainedModel`):\n            The model that will be used by the pipeline to make predictions. This needs to be a model inheriting from\n            :class:`~transformers1.PreTrainedModel` for PyTorch and :class:`~transformers1.TFPreTrainedModel` for\n            TensorFlow.\n        tokenizer (:obj:`~transformers1.PreTrainedTokenizer`):\n            The tokenizer that will be used by the pipeline to encode data for the model. This object inherits from\n            :class:`~transformers1.PreTrainedTokenizer`.\n        modelcard (:obj:`str` or :class:`~transformers1.ModelCard`, `optional`, defaults to :obj:`None`):\n            Model card attributed to the model for this pipeline.\n        framework (:obj:`str`, `optional`, defaults to :obj:`None`):\n            The framework to use, either \"pt\" for PyTorch or \"tf\" for TensorFlow. The specified framework must be\n            installed.\n\n            If no framework is specified, will default to the one currently installed. If no framework is specified\n            and both frameworks are installed, will default to PyTorch.\n        args_parser (:class:`~transformers1.pipelines.ArgumentHandler`, `optional`, defaults to :obj:`None`):\n            Reference to the object in charge of parsing supplied pipeline parameters.\n        device (:obj:`int`, `optional`, defaults to :obj:`-1`):\n            Device ordinal for CPU/GPU supports. Setting this to -1 will leverage CPU, >=0 will run the model\n            on the associated CUDA device id.\n    \"\"\"\n\n    def __init__(\n        self,\n        model: Union[\"PreTrainedModel\", \"TFPreTrainedModel\"],\n        tokenizer: PreTrainedTokenizer,\n        modelcard: Optional[ModelCard] = None,\n        framework: Optional[str] = None,\n        args_parser: ArgumentHandler = None,\n        device: int = -1,\n        topk=5,\n        task: str = \"\",\n    ):\n        super().__init__(\n            model=model,\n            tokenizer=tokenizer,\n            modelcard=modelcard,\n            framework=framework,\n            args_parser=args_parser,\n            device=device,\n            binary_output=True,\n            task=task,\n        )\n\n        self.topk = topk\n\n    def __call__(self, *args, **kwargs):\n        inputs = self._parse_and_tokenize(*args, **kwargs)\n        outputs = self._forward(inputs, return_tensors=True)\n\n        results = []\n        batch_size = outputs.shape[0] if self.framework == \"tf\" else outputs.size(0)\n\n        for i in range(batch_size):\n            input_ids = inputs[\"input_ids\"][i]\n            result = []\n\n            if self.framework == \"tf\":\n                masked_index = tf.where(input_ids == self.tokenizer.mask_token_id).numpy().item()\n                logits = outputs[i, masked_index, :]\n                probs = tf.nn.softmax(logits)\n                topk = tf.math.top_k(probs, k=self.topk)\n                values, predictions = topk.values.numpy(), topk.indices.numpy()\n            else:\n                masked_index = (input_ids == self.tokenizer.mask_token_id).nonzero().item()\n                logits = outputs[i, masked_index, :]\n                probs = logits.softmax(dim=0)\n                values, predictions = probs.topk(self.topk)\n\n            for v, p in zip(values.tolist(), predictions.tolist()):\n                tokens = input_ids.numpy()\n                tokens[masked_index] = p\n                # Filter padding out:\n                tokens = tokens[np.where(tokens != self.tokenizer.pad_token_id)]\n                result.append({\"sequence\": self.tokenizer.decode(tokens), \"score\": v, \"token\": p})\n\n            # Append\n            results += [result]\n\n        if len(results) == 1:\n            return results[0]\n        return results\n\n\nclass NerPipeline(Pipeline):\n    \"\"\"\n    Named Entity Recognition pipeline using ModelForTokenClassification head. See the\n    `named entity recognition usage <../usage.html#named-entity-recognition>`__ examples for more information.\n\n    This token recognition pipeline can currently be loaded from the :func:`~transformers1.pipeline` method using\n    the following task identifier(s):\n\n    - \"ner\", for predicting the classes of tokens in a sequence: person, organisation, location or miscellaneous.\n\n    The models that this pipeline can use are models that have been fine-tuned on a token classification task.\n    See the up-to-date list of available models on\n    `huggingface.co/models <https://huggingface.co/models?filter=token-classification>`__.\n\n    Arguments:\n        model (:obj:`~transformers1.PreTrainedModel` or :obj:`~transformers1.TFPreTrainedModel`):\n            The model that will be used by the pipeline to make predictions. This needs to be a model inheriting from\n            :class:`~transformers1.PreTrainedModel` for PyTorch and :class:`~transformers1.TFPreTrainedModel` for\n            TensorFlow.\n        tokenizer (:obj:`~transformers1.PreTrainedTokenizer`):\n            The tokenizer that will be used by the pipeline to encode data for the model. This object inherits from\n            :class:`~transformers1.PreTrainedTokenizer`.\n        modelcard (:obj:`str` or :class:`~transformers1.ModelCard`, `optional`, defaults to :obj:`None`):\n            Model card attributed to the model for this pipeline.\n        framework (:obj:`str`, `optional`, defaults to :obj:`None`):\n            The framework to use, either \"pt\" for PyTorch or \"tf\" for TensorFlow. The specified framework must be\n            installed.\n\n            If no framework is specified, will default to the one currently installed. If no framework is specified\n            and both frameworks are installed, will default to PyTorch.\n        args_parser (:class:`~transformers1.pipelines.ArgumentHandler`, `optional`, defaults to :obj:`None`):\n            Reference to the object in charge of parsing supplied pipeline parameters.\n        device (:obj:`int`, `optional`, defaults to :obj:`-1`):\n            Device ordinal for CPU/GPU supports. Setting this to -1 will leverage CPU, >=0 will run the model\n            on the associated CUDA device id.\n    \"\"\"\n\n    default_input_names = \"sequences\"\n\n    def __init__(\n        self,\n        model: Union[\"PreTrainedModel\", \"TFPreTrainedModel\"],\n        tokenizer: PreTrainedTokenizer,\n        modelcard: Optional[ModelCard] = None,\n        framework: Optional[str] = None,\n        args_parser: ArgumentHandler = None,\n        device: int = -1,\n        binary_output: bool = False,\n        ignore_labels=[\"O\"],\n        task: str = \"\",\n        grouped_entities: bool = False,\n    ):\n        super().__init__(\n            model=model,\n            tokenizer=tokenizer,\n            modelcard=modelcard,\n            framework=framework,\n            args_parser=args_parser,\n            device=device,\n            binary_output=binary_output,\n            task=task,\n        )\n\n        self._basic_tokenizer = BasicTokenizer(do_lower_case=False)\n        self.ignore_labels = ignore_labels\n        self.grouped_entities = grouped_entities\n\n    def __call__(self, *args, **kwargs):\n        inputs = self._args_parser(*args, **kwargs)\n        answers = []\n        for sentence in inputs:\n\n            # Manage correct placement of the tensors\n            with self.device_placement():\n\n                tokens = self.tokenizer.encode_plus(\n                    sentence,\n                    return_attention_mask=False,\n                    return_tensors=self.framework,\n                    max_length=self.tokenizer.max_len,\n                )\n\n                # Forward\n                if self.framework == \"tf\":\n                    entities = self.model(tokens.data)[0][0].numpy()\n                    input_ids = tokens[\"input_ids\"].numpy()[0]\n                else:\n                    with torch.no_grad():\n                        tokens = self.ensure_tensor_on_device(**tokens)\n                        entities = self.model(**tokens)[0][0].cpu().numpy()\n                        input_ids = tokens[\"input_ids\"].cpu().numpy()[0]\n\n            score = np.exp(entities) / np.exp(entities).sum(-1, keepdims=True)\n            labels_idx = score.argmax(axis=-1)\n\n            entities = []\n            entity_groups = []\n            entity_group_disagg = []\n            # Filter to labels not in `self.ignore_labels`\n            filtered_labels_idx = [\n                (idx, label_idx)\n                for idx, label_idx in enumerate(labels_idx)\n                if self.model.config.id2label[label_idx] not in self.ignore_labels\n            ]\n\n            for idx, label_idx in filtered_labels_idx:\n\n                entity = {\n                    \"word\": self.tokenizer.convert_ids_to_tokens(int(input_ids[idx])),\n                    \"score\": score[idx][label_idx].item(),\n                    \"entity\": self.model.config.id2label[label_idx],\n                    \"index\": idx,\n                }\n                last_idx, _ = filtered_labels_idx[-1]\n                if self.grouped_entities:\n                    if not entity_group_disagg:\n                        entity_group_disagg += [entity]\n                        if idx == last_idx:\n                            entity_groups += [self.group_entities(entity_group_disagg)]\n                        continue\n\n                    # If the current entity is similar and adjacent to the previous entity, append it to the disaggregated entity group\n                    if (\n                        entity[\"entity\"] == entity_group_disagg[-1][\"entity\"]\n                        and entity[\"index\"] == entity_group_disagg[-1][\"index\"] + 1\n                    ):\n                        entity_group_disagg += [entity]\n                        # Group the entities at the last entity\n                        if idx == last_idx:\n                            entity_groups += [self.group_entities(entity_group_disagg)]\n                    # If the current entity is different from the previous entity, aggregate the disaggregated entity group\n                    else:\n                        entity_groups += [self.group_entities(entity_group_disagg)]\n                        entity_group_disagg = [entity]\n\n                entities += [entity]\n\n            # Append\n            if self.grouped_entities:\n                answers += [entity_groups]\n            else:\n                answers += [entities]\n\n        if len(answers) == 1:\n            return answers[0]\n        return answers\n\n    def group_entities(self, entities):\n        \"\"\"\n        Returns grouped entities\n        \"\"\"\n        # Get the last entity in the entity group\n        entity = entities[-1][\"entity\"]\n        scores = np.mean([entity[\"score\"] for entity in entities])\n        tokens = [entity[\"word\"] for entity in entities]\n\n        entity_group = {\n            \"entity_group\": entity,\n            \"score\": np.mean(scores),\n            \"word\": self.tokenizer.convert_tokens_to_string(tokens),\n        }\n        return entity_group\n\n\nTokenClassificationPipeline = NerPipeline\n\n\nclass QuestionAnsweringArgumentHandler(ArgumentHandler):\n    \"\"\"\n    QuestionAnsweringPipeline requires the user to provide multiple arguments (i.e. question & context) to be mapped\n    to internal SquadExample / SquadFeature structures.\n\n    QuestionAnsweringArgumentHandler manages all the possible to create SquadExample from the command-line supplied\n    arguments.\n    \"\"\"\n\n    def __call__(self, *args, **kwargs):\n        # Position args, handling is sensibly the same as X and data, so forwarding to avoid duplicating\n        if args is not None and len(args) > 0:\n            if len(args) == 1:\n                kwargs[\"X\"] = args[0]\n            else:\n                kwargs[\"X\"] = list(args)\n\n        # Generic compatibility with sklearn and Keras\n        # Batched data\n        if \"X\" in kwargs or \"data\" in kwargs:\n            inputs = kwargs[\"X\"] if \"X\" in kwargs else kwargs[\"data\"]\n\n            if isinstance(inputs, dict):\n                inputs = [inputs]\n            else:\n                # Copy to avoid overriding arguments\n                inputs = [i for i in inputs]\n\n            for i, item in enumerate(inputs):\n                if isinstance(item, dict):\n                    if any(k not in item for k in [\"question\", \"context\"]):\n                        raise KeyError(\"You need to provide a dictionary with keys {question:..., context:...}\")\n\n                    inputs[i] = QuestionAnsweringPipeline.create_sample(**item)\n\n                elif not isinstance(item, SquadExample):\n                    raise ValueError(\n                        \"{} argument needs to be of type (list[SquadExample | dict], SquadExample, dict)\".format(\n                            \"X\" if \"X\" in kwargs else \"data\"\n                        )\n                    )\n\n            # Tabular input\n        elif \"question\" in kwargs and \"context\" in kwargs:\n            if isinstance(kwargs[\"question\"], str):\n                kwargs[\"question\"] = [kwargs[\"question\"]]\n\n            if isinstance(kwargs[\"context\"], str):\n                kwargs[\"context\"] = [kwargs[\"context\"]]\n\n            inputs = [\n                QuestionAnsweringPipeline.create_sample(q, c) for q, c in zip(kwargs[\"question\"], kwargs[\"context\"])\n            ]\n        else:\n            raise ValueError(\"Unknown arguments {}\".format(kwargs))\n\n        if not isinstance(inputs, list):\n            inputs = [inputs]\n\n        return inputs\n\n\nclass QuestionAnsweringPipeline(Pipeline):\n    \"\"\"\n    Question Answering pipeline using ModelForQuestionAnswering head. See the\n    `question answering usage <../usage.html#question-answering>`__ examples for more information.\n\n    This question answering can currently be loaded from the :func:`~transformers1.pipeline` method using\n    the following task identifier(s):\n\n    - \"question-answering\", for answering questions given a context.\n\n    The models that this pipeline can use are models that have been fine-tuned on a question answering task.\n    See the up-to-date list of available models on\n    `huggingface.co/models <https://huggingface.co/models?filter=question-answering>`__.\n\n    Arguments:\n        model (:obj:`~transformers1.PreTrainedModel` or :obj:`~transformers1.TFPreTrainedModel`):\n            The model that will be used by the pipeline to make predictions. This needs to be a model inheriting from\n            :class:`~transformers1.PreTrainedModel` for PyTorch and :class:`~transformers1.TFPreTrainedModel` for\n            TensorFlow.\n        tokenizer (:obj:`~transformers1.PreTrainedTokenizer`):\n            The tokenizer that will be used by the pipeline to encode data for the model. This object inherits from\n            :class:`~transformers1.PreTrainedTokenizer`.\n        modelcard (:obj:`str` or :class:`~transformers1.ModelCard`, `optional`, defaults to :obj:`None`):\n            Model card attributed to the model for this pipeline.\n        framework (:obj:`str`, `optional`, defaults to :obj:`None`):\n            The framework to use, either \"pt\" for PyTorch or \"tf\" for TensorFlow. The specified framework must be\n            installed.\n\n            If no framework is specified, will default to the one currently installed. If no framework is specified\n            and both frameworks are installed, will default to PyTorch.\n        args_parser (:class:`~transformers1.pipelines.ArgumentHandler`, `optional`, defaults to :obj:`None`):\n            Reference to the object in charge of parsing supplied pipeline parameters.\n        device (:obj:`int`, `optional`, defaults to :obj:`-1`):\n            Device ordinal for CPU/GPU supports. Setting this to -1 will leverage CPU, >=0 will run the model\n            on the associated CUDA device id.\n    \"\"\"\n\n    default_input_names = \"question,context\"\n\n    def __init__(\n        self,\n        model: Union[\"PreTrainedModel\", \"TFPreTrainedModel\"],\n        tokenizer: PreTrainedTokenizer,\n        modelcard: Optional[ModelCard] = None,\n        framework: Optional[str] = None,\n        device: int = -1,\n        task: str = \"\",\n        **kwargs\n    ):\n        super().__init__(\n            model=model,\n            tokenizer=tokenizer,\n            modelcard=modelcard,\n            framework=framework,\n            args_parser=QuestionAnsweringArgumentHandler(),\n            device=device,\n            task=task,\n            **kwargs,\n        )\n\n    @staticmethod\n    def create_sample(\n        question: Union[str, List[str]], context: Union[str, List[str]]\n    ) -> Union[SquadExample, List[SquadExample]]:\n        \"\"\"\n        QuestionAnsweringPipeline leverages the SquadExample/SquadFeatures internally.\n        This helper method encapsulate all the logic for converting question(s) and context(s) to SquadExample(s).\n        We currently support extractive question answering.\n        Arguments:\n             question: (str, List[str]) The question to be ask for the associated context\n             context: (str, List[str]) The context in which we will look for the answer.\n\n        Returns:\n            SquadExample initialized with the corresponding question and context.\n        \"\"\"\n        if isinstance(question, list):\n            return [SquadExample(None, q, c, None, None, None) for q, c in zip(question, context)]\n        else:\n            return SquadExample(None, question, context, None, None, None)\n\n    def __call__(self, *args, **kwargs):\n        \"\"\"\n        Args:\n            We support multiple use-cases, the following are exclusive:\n            X: sequence of SquadExample\n            data: sequence of SquadExample\n            question: (str, List[str]), batch of question(s) to map along with context\n            context: (str, List[str]), batch of context(s) associated with the provided question keyword argument\n        Returns:\n            dict: {'answer': str, 'score\": float, 'start\": int, \"end\": int}\n            answer: the textual answer in the intial context\n            score: the score the current answer scored for the model\n            start: the character index in the original string corresponding to the beginning of the answer' span\n            end: the character index in the original string corresponding to the ending of the answer' span\n        \"\"\"\n        # Set defaults values\n        kwargs.setdefault(\"topk\", 1)\n        kwargs.setdefault(\"doc_stride\", 128)\n        kwargs.setdefault(\"max_answer_len\", 15)\n        kwargs.setdefault(\"max_seq_len\", 384)\n        kwargs.setdefault(\"max_question_len\", 64)\n        kwargs.setdefault(\"handle_impossible_answer\", False)\n\n        if kwargs[\"topk\"] < 1:\n            raise ValueError(\"topk parameter should be >= 1 (got {})\".format(kwargs[\"topk\"]))\n\n        if kwargs[\"max_answer_len\"] < 1:\n            raise ValueError(\"max_answer_len parameter should be >= 1 (got {})\".format(kwargs[\"max_answer_len\"]))\n\n        # Convert inputs to features\n        examples = self._args_parser(*args, **kwargs)\n        features_list = [\n            squad_convert_examples_to_features(\n                [example],\n                self.tokenizer,\n                kwargs[\"max_seq_len\"],\n                kwargs[\"doc_stride\"],\n                kwargs[\"max_question_len\"],\n                False,\n                tqdm_enabled=False,\n            )\n            for example in examples\n        ]\n        all_answers = []\n        for features, example in zip(features_list, examples):\n            model_input_names = self.tokenizer.model_input_names + [\"input_ids\"]\n            fw_args = {k: [feature.__dict__[k] for feature in features] for k in model_input_names}\n\n            # Manage tensor allocation on correct device\n            with self.device_placement():\n                if self.framework == \"tf\":\n                    fw_args = {k: tf.constant(v) for (k, v) in fw_args.items()}\n                    start, end = self.model(fw_args)\n                    start, end = start.numpy(), end.numpy()\n                else:\n                    with torch.no_grad():\n                        # Retrieve the score for the context tokens only (removing question tokens)\n                        fw_args = {k: torch.tensor(v, device=self.device) for (k, v) in fw_args.items()}\n                        start, end = self.model(**fw_args)\n                        start, end = start.cpu().numpy(), end.cpu().numpy()\n\n            min_null_score = 1000000  # large and positive\n            answers = []\n            for (feature, start_, end_) in zip(features, start, end):\n                # Normalize logits and spans to retrieve the answer\n                start_ = np.exp(start_) / np.sum(np.exp(start_))\n                end_ = np.exp(end_) / np.sum(np.exp(end_))\n\n                # Mask padding and question\n                start_, end_ = (\n                    start_ * np.abs(np.array(feature.p_mask) - 1),\n                    end_ * np.abs(np.array(feature.p_mask) - 1),\n                )\n\n                if kwargs[\"handle_impossible_answer\"]:\n                    min_null_score = min(min_null_score, (start_[0] * end_[0]).item())\n\n                start_[0] = end_[0] = 0\n\n                starts, ends, scores = self.decode(start_, end_, kwargs[\"topk\"], kwargs[\"max_answer_len\"])\n                char_to_word = np.array(example.char_to_word_offset)\n\n                # Convert the answer (tokens) back to the original text\n                answers += [\n                    {\n                        \"score\": score.item(),\n                        \"start\": np.where(char_to_word == feature.token_to_orig_map[s])[0][0].item(),\n                        \"end\": np.where(char_to_word == feature.token_to_orig_map[e])[0][-1].item(),\n                        \"answer\": \" \".join(\n                            example.doc_tokens[feature.token_to_orig_map[s] : feature.token_to_orig_map[e] + 1]\n                        ),\n                    }\n                    for s, e, score in zip(starts, ends, scores)\n                ]\n\n            if kwargs[\"handle_impossible_answer\"]:\n                answers.append({\"score\": min_null_score, \"start\": 0, \"end\": 0, \"answer\": \"\"})\n\n            answers = sorted(answers, key=lambda x: x[\"score\"], reverse=True)[: kwargs[\"topk\"]]\n            all_answers += answers\n\n        if len(all_answers) == 1:\n            return all_answers[0]\n        return all_answers\n\n    def decode(self, start: np.ndarray, end: np.ndarray, topk: int, max_answer_len: int) -> Tuple:\n        \"\"\"\n        Take the output of any QuestionAnswering head and will generate probalities for each span to be\n        the actual answer.\n        In addition, it filters out some unwanted/impossible cases like answer len being greater than\n        max_answer_len or answer end position being before the starting position.\n        The method supports output the k-best answer through the topk argument.\n\n        Args:\n            start: numpy array, holding individual start probabilities for each token\n            end: numpy array, holding individual end probabilities for each token\n            topk: int, indicates how many possible answer span(s) to extract from the model's output\n            max_answer_len: int, maximum size of the answer to extract from the model's output\n        \"\"\"\n        # Ensure we have batch axis\n        if start.ndim == 1:\n            start = start[None]\n\n        if end.ndim == 1:\n            end = end[None]\n\n        # Compute the score of each tuple(start, end) to be the real answer\n        outer = np.matmul(np.expand_dims(start, -1), np.expand_dims(end, 1))\n\n        # Remove candidate with end < start and end - start > max_answer_len\n        candidates = np.tril(np.triu(outer), max_answer_len - 1)\n\n        #  Inspired by Chen & al. (https://github.com/facebookresearch/DrQA)\n        scores_flat = candidates.flatten()\n        if topk == 1:\n            idx_sort = [np.argmax(scores_flat)]\n        elif len(scores_flat) < topk:\n            idx_sort = np.argsort(-scores_flat)\n        else:\n            idx = np.argpartition(-scores_flat, topk)[0:topk]\n            idx_sort = idx[np.argsort(-scores_flat[idx])]\n\n        start, end = np.unravel_index(idx_sort, candidates.shape)[1:]\n        return start, end, candidates[0, start, end]\n\n    def span_to_answer(self, text: str, start: int, end: int):\n        \"\"\"\n        When decoding from token probalities, this method maps token indexes to actual word in\n        the initial context.\n\n        Args:\n            text: str, the actual context to extract the answer from\n            start: int, starting answer token index\n            end: int, ending answer token index\n\n        Returns:\n            dict: {'answer': str, 'start': int, 'end': int}\n        \"\"\"\n        words = []\n        token_idx = char_start_idx = char_end_idx = chars_idx = 0\n\n        for i, word in enumerate(text.split(\" \")):\n            token = self.tokenizer.tokenize(word)\n\n            # Append words if they are in the span\n            if start <= token_idx <= end:\n                if token_idx == start:\n                    char_start_idx = chars_idx\n\n                if token_idx == end:\n                    char_end_idx = chars_idx + len(word)\n\n                words += [word]\n\n            # Stop if we went over the end of the answer\n            if token_idx > end:\n                break\n\n            # Append the subtokenization length to the running index\n            token_idx += len(token)\n            chars_idx += len(word) + 1\n\n        # Join text with spaces\n        return {\n            \"answer\": \" \".join(words),\n            \"start\": max(0, char_start_idx),\n            \"end\": min(len(text), char_end_idx),\n        }\n\n\nclass SummarizationPipeline(Pipeline):\n    \"\"\"\n    Summarize news articles and other documents\n\n    Usage::\n\n        # use bart in pytorch\n        summarizer = pipeline(\"summarization\")\n        summarizer(\"Sam Shleifer writes the best docstring examples in the whole world.\", min_length=5, max_length=20)\n\n        # use t5 in tf\n        summarizer = pipeline(\"summarization\", model=\"t5-base\", tokenizer=\"t5-base\", framework=\"tf\")\n        summarizer(\"Sam Shleifer writes the best docstring examples in the whole world.\", min_length=5, max_length=20)\n\n    The models that this pipeline can use are models that have been fine-tuned on a summarization task,\n    which is currently, '`bart-large-cnn`', '`t5-small`', '`t5-base`', '`t5-large`', '`t5-3b`', '`t5-11b`'.\n    See the up-to-date list of available models on\n    `huggingface.co/models <https://huggingface.co/models?filter=summarization>`__.\n\n    Arguments:\n        model (:obj:`str` or :obj:`~transformers1.PreTrainedModel` or :obj:`~transformers1.TFPreTrainedModel`, `optional`, defaults to :obj:`None`):\n            The model that will be used by the pipeline to make predictions. This can be :obj:`None`, a string\n            checkpoint identifier or an actual pre-trained model inheriting from\n            :class:`~transformers1.PreTrainedModel` for PyTorch and :class:`~transformers1.TFPreTrainedModel` for\n            TensorFlow.\n\n            If :obj:`None`, the default of the pipeline will be loaded.\n        tokenizer (:obj:`str` or :obj:`~transformers1.PreTrainedTokenizer`, `optional`, defaults to :obj:`None`):\n            The tokenizer that will be used by the pipeline to encode data for the model. This can be :obj:`None`,\n            a string checkpoint identifier or an actual pre-trained tokenizer inheriting from\n            :class:`~transformers1.PreTrainedTokenizer`.\n\n            If :obj:`None`, the default of the pipeline will be loaded.\n        modelcard (:obj:`str` or :class:`~transformers1.ModelCard`, `optional`, defaults to :obj:`None`):\n            Model card attributed to the model for this pipeline.\n        framework (:obj:`str`, `optional`, defaults to :obj:`None`):\n            The framework to use, either \"pt\" for PyTorch or \"tf\" for TensorFlow. The specified framework must be\n            installed.\n\n            If no framework is specified, will default to the one currently installed. If no framework is specified\n            and both frameworks are installed, will default to PyTorch.\n        args_parser (:class:`~transformers1.pipelines.ArgumentHandler`, `optional`, defaults to :obj:`None`):\n            Reference to the object in charge of parsing supplied pipeline parameters.\n        device (:obj:`int`, `optional`, defaults to :obj:`-1`):\n            Device ordinal for CPU/GPU supports. Setting this to -1 will leverage CPU, >=0 will run the model\n            on the associated CUDA device id.\n    \"\"\"\n\n    def __call__(\n        self, *documents, return_tensors=False, return_text=True, clean_up_tokenization_spaces=False, **generate_kwargs\n    ):\n        r\"\"\"\n        Args:\n            *documents: (list of strings) articles to be summarized\n            return_text: (bool, default=True) whether to add a decoded \"summary_text\" to each result\n            return_tensors: (bool, default=False) whether to return the raw \"summary_token_ids\" to each result\n\n            clean_up_tokenization_spaces: (`optional`) bool whether to include extra spaces in the output\n            **generate_kwargs: extra kwargs passed to `self.model.generate`_\n\n        Returns:\n            list of dicts with 'summary_text' and/or 'summary_token_ids' for each document_to_summarize\n\n        .. _`self.model.generate`:\n            https://huggingface.co/transformers/model_doc/bart.html#transformers.BartForConditionalGeneration.generate\n\n        \"\"\"\n        assert return_tensors or return_text, \"You must specify return_tensors=True or return_text=True\"\n        assert len(documents) > 0, \"Please provide a document to summarize\"\n\n        if self.framework == \"tf\" and \"BartForConditionalGeneration\" in self.model.__class__.__name__:\n            raise NotImplementedError(\n                \"Tensorflow is not yet supported for Bart. Please consider using T5, e.g. `t5-base`\"\n            )\n\n        prefix = self.model.config.prefix if self.model.config.prefix is not None else \"\"\n\n        if isinstance(documents[0], list):\n            assert (\n                self.tokenizer.pad_token_id is not None\n            ), \"Please make sure that the tokenizer has a pad_token_id when using a batch input\"\n\n            documents = ([prefix + document for document in documents[0]],)\n            pad_to_max_length = True\n\n        elif isinstance(documents[0], str):\n            documents = (prefix + documents[0],)\n            pad_to_max_length = False\n        else:\n            raise ValueError(\n                \" `documents[0]`: {} have the wrong format. The should be either of type `str` or type `list`\".format(\n                    documents[0]\n                )\n            )\n\n        with self.device_placement():\n            inputs = self._parse_and_tokenize(*documents, pad_to_max_length=pad_to_max_length)\n\n            if self.framework == \"pt\":\n                inputs = self.ensure_tensor_on_device(**inputs)\n                input_length = inputs[\"input_ids\"].shape[-1]\n            elif self.framework == \"tf\":\n                input_length = tf.shape(inputs[\"input_ids\"])[-1].numpy()\n\n            min_length = generate_kwargs.get(\"min_length\", self.model.config.min_length)\n            if input_length < min_length // 2:\n                logger.warning(\n                    \"Your min_length is set to {}, but you input_length is only {}. You might consider decreasing min_length manually, e.g. summarizer('...', min_length=10)\".format(\n                        min_length, input_length\n                    )\n                )\n\n            max_length = generate_kwargs.get(\"max_length\", self.model.config.max_length)\n            if input_length < max_length:\n                logger.warning(\n                    \"Your max_length is set to {}, but you input_length is only {}. You might consider decreasing max_length manually, e.g. summarizer('...', max_length=50)\".format(\n                        max_length, input_length\n                    )\n                )\n\n            summaries = self.model.generate(\n                inputs[\"input_ids\"], attention_mask=inputs[\"attention_mask\"], **generate_kwargs,\n            )\n\n            results = []\n            for summary in summaries:\n                record = {}\n                if return_tensors:\n                    record[\"summary_token_ids\"] = summary\n                if return_text:\n                    record[\"summary_text\"] = self.tokenizer.decode(\n                        summary, skip_special_tokens=True, clean_up_tokenization_spaces=clean_up_tokenization_spaces,\n                    )\n                results.append(record)\n            return results\n\n\nclass TranslationPipeline(Pipeline):\n    \"\"\"\n    Translates from one language to another.\n\n    Usage::\n        en_fr_translator = pipeline(\"translation_en_to_fr\")\n        en_fr_translator(\"How old are you?\")\n\n    The models that this pipeline can use are models that have been fine-tuned on a translation task,\n    currently: \"t5-small\", \"t5-base\", \"t5-large\", \"t5-3b\", \"t5-11b\"\n    See the up-to-date list of available models on\n    `huggingface.co/models <https://huggingface.co/models?filter=translation>`__.\n\n    Arguments:\n        model (:obj:`str` or :obj:`~transformers1.PreTrainedModel` or :obj:`~transformers1.TFPreTrainedModel`, `optional`, defaults to :obj:`None`):\n            The model that will be used by the pipeline to make predictions. This can be :obj:`None`, a string\n            checkpoint identifier or an actual pre-trained model inheriting from\n            :class:`~transformers1.PreTrainedModel` for PyTorch and :class:`~transformers1.TFPreTrainedModel` for\n            TensorFlow.\n            If :obj:`None`, the default of the pipeline will be loaded.\n        tokenizer (:obj:`str` or :obj:`~transformers1.PreTrainedTokenizer`, `optional`, defaults to :obj:`None`):\n            The tokenizer that will be used by the pipeline to encode data for the model. This can be :obj:`None`,\n            a string checkpoint identifier or an actual pre-trained tokenizer inheriting from\n            :class:`~transformers1.PreTrainedTokenizer`.\n            If :obj:`None`, the default of the pipeline will be loaded.\n        modelcard (:obj:`str` or :class:`~transformers1.ModelCard`, `optional`, defaults to :obj:`None`):\n            Model card attributed to the model for this pipeline.\n        framework (:obj:`str`, `optional`, defaults to :obj:`None`):\n            The framework to use, either \"pt\" for PyTorch or \"tf\" for TensorFlow. The specified framework must be\n            installed.\n            If no framework is specified, will default to the one currently installed. If no framework is specified\n            and both frameworks are installed, will default to PyTorch.\n        args_parser (:class:`~transformers1.pipelines.ArgumentHandler`, `optional`, defaults to :obj:`None`):\n            Reference to the object in charge of parsing supplied pipeline parameters.\n        device (:obj:`int`, `optional`, defaults to :obj:`-1`):\n            Device ordinal for CPU/GPU supports. Setting this to -1 will leverage CPU, >=0 will run the model\n            on the associated CUDA device id.\n    \"\"\"\n\n    def __call__(\n        self, *args, return_tensors=False, return_text=True, clean_up_tokenization_spaces=False, **generate_kwargs\n    ):\n        r\"\"\"\n        Args:\n            *args: (list of strings) texts to be translated\n            return_text: (bool, default=True) whether to add a decoded \"translation_text\" to each result\n            return_tensors: (bool, default=False) whether to return the raw \"translation_token_ids\" to each result\n\n            **generate_kwargs: extra kwargs passed to `self.model.generate`_\n\n        Returns:\n            list of dicts with 'translation_text' and/or 'translation_token_ids' for each text_to_translate\n        .. _`self.model.generate`:\n            https://huggingface.co/transformers/model_doc/bart.html#transformers.BartForConditionalGeneration.generate\n        \"\"\"\n        assert return_tensors or return_text, \"You must specify return_tensors=True or return_text=True\"\n\n        prefix = self.model.config.prefix if self.model.config.prefix is not None else \"\"\n\n        if isinstance(args[0], list):\n            assert (\n                self.tokenizer.pad_token_id is not None\n            ), \"Please make sure that the tokenizer has a pad_token_id when using a batch input\"\n            args = ([prefix + text for text in args[0]],)\n            pad_to_max_length = True\n\n        elif isinstance(args[0], str):\n            args = (prefix + args[0],)\n            pad_to_max_length = False\n        else:\n            raise ValueError(\n                \" `documents[0]`: {} have the wrong format. The should be either of type `str` or type `list`\".format(\n                    args[0]\n                )\n            )\n\n        with self.device_placement():\n            inputs = self._parse_and_tokenize(*args, pad_to_max_length=pad_to_max_length)\n\n            if self.framework == \"pt\":\n                inputs = self.ensure_tensor_on_device(**inputs)\n                input_length = inputs[\"input_ids\"].shape[-1]\n\n            elif self.framework == \"tf\":\n                input_length = tf.shape(inputs[\"input_ids\"])[-1].numpy()\n\n            max_length = generate_kwargs.get(\"max_length\", self.model.config.max_length)\n            if input_length > 0.9 * max_length:\n                logger.warning(\n                    \"Your input_length: {} is bigger than 0.9 * max_length: {}. You might consider increasing your max_length manually, e.g. translator('...', max_length=400)\".format(\n                        input_length, max_length\n                    )\n                )\n\n            translations = self.model.generate(\n                inputs[\"input_ids\"], attention_mask=inputs[\"attention_mask\"], **generate_kwargs,\n            )\n            results = []\n            for translation in translations:\n                record = {}\n                if return_tensors:\n                    record[\"translation_token_ids\"] = translation\n                if return_text:\n                    record[\"translation_text\"] = self.tokenizer.decode(\n                        translation,\n                        skip_special_tokens=True,\n                        clean_up_tokenization_spaces=clean_up_tokenization_spaces,\n                    )\n                results.append(record)\n            return results\n\n\n# Register all the supported tasks here\nSUPPORTED_TASKS = {\n    \"feature-extraction\": {\n        \"impl\": FeatureExtractionPipeline,\n        \"tf\": TFAutoModel if is_tf_available() else None,\n        \"pt\": AutoModel if is_torch_available() else None,\n        \"default\": {\n            \"model\": {\"pt\": \"distilbert-base-cased\", \"tf\": \"distilbert-base-cased\"},\n            \"config\": None,\n            \"tokenizer\": \"distilbert-base-cased\",\n        },\n    },\n    \"sentiment-analysis\": {\n        \"impl\": TextClassificationPipeline,\n        \"tf\": TFAutoModelForSequenceClassification if is_tf_available() else None,\n        \"pt\": AutoModelForSequenceClassification if is_torch_available() else None,\n        \"default\": {\n            \"model\": {\n                \"pt\": \"distilbert-base-uncased-finetuned-sst-2-english\",\n                \"tf\": \"distilbert-base-uncased-finetuned-sst-2-english\",\n            },\n            \"config\": \"distilbert-base-uncased-finetuned-sst-2-english\",\n            \"tokenizer\": \"distilbert-base-uncased\",\n        },\n    },\n    \"ner\": {\n        \"impl\": NerPipeline,\n        \"tf\": TFAutoModelForTokenClassification if is_tf_available() else None,\n        \"pt\": AutoModelForTokenClassification if is_torch_available() else None,\n        \"default\": {\n            \"model\": {\n                \"pt\": \"dbmdz/bert-large-cased-finetuned-conll03-english\",\n                \"tf\": \"dbmdz/bert-large-cased-finetuned-conll03-english\",\n            },\n            \"config\": \"dbmdz/bert-large-cased-finetuned-conll03-english\",\n            \"tokenizer\": \"bert-large-cased\",\n        },\n    },\n    \"question-answering\": {\n        \"impl\": QuestionAnsweringPipeline,\n        \"tf\": TFAutoModelForQuestionAnswering if is_tf_available() else None,\n        \"pt\": AutoModelForQuestionAnswering if is_torch_available() else None,\n        \"default\": {\n            \"model\": {\"pt\": \"distilbert-base-cased-distilled-squad\", \"tf\": \"distilbert-base-cased-distilled-squad\"},\n            \"config\": None,\n            \"tokenizer\": (\"distilbert-base-cased\", {\"use_fast\": False}),\n        },\n    },\n    \"fill-mask\": {\n        \"impl\": FillMaskPipeline,\n        \"tf\": TFAutoModelWithLMHead if is_tf_available() else None,\n        \"pt\": AutoModelWithLMHead if is_torch_available() else None,\n        \"default\": {\n            \"model\": {\"pt\": \"distilroberta-base\", \"tf\": \"distilroberta-base\"},\n            \"config\": None,\n            \"tokenizer\": (\"distilroberta-base\", {\"use_fast\": False}),\n        },\n    },\n    \"summarization\": {\n        \"impl\": SummarizationPipeline,\n        \"tf\": TFAutoModelWithLMHead if is_tf_available() else None,\n        \"pt\": AutoModelWithLMHead if is_torch_available() else None,\n        \"default\": {\"model\": {\"pt\": \"facebook/bart-large-cnn\", \"tf\": \"t5-small\"}, \"config\": None, \"tokenizer\": None},\n    },\n    \"translation_en_to_fr\": {\n        \"impl\": TranslationPipeline,\n        \"tf\": TFAutoModelWithLMHead if is_tf_available() else None,\n        \"pt\": AutoModelWithLMHead if is_torch_available() else None,\n        \"default\": {\n            \"model\": {\"pt\": \"t5-base\", \"tf\": \"t5-base\"},\n            \"config\": None,\n            \"tokenizer\": (\"t5-base\", {\"use_fast\": False}),\n        },\n    },\n    \"translation_en_to_de\": {\n        \"impl\": TranslationPipeline,\n        \"tf\": TFAutoModelWithLMHead if is_tf_available() else None,\n        \"pt\": AutoModelWithLMHead if is_torch_available() else None,\n        \"default\": {\n            \"model\": {\"pt\": \"t5-base\", \"tf\": \"t5-base\"},\n            \"config\": None,\n            \"tokenizer\": (\"t5-base\", {\"use_fast\": False}),\n        },\n    },\n    \"translation_en_to_ro\": {\n        \"impl\": TranslationPipeline,\n        \"tf\": TFAutoModelWithLMHead if is_tf_available() else None,\n        \"pt\": AutoModelWithLMHead if is_torch_available() else None,\n        \"default\": {\n            \"model\": {\"pt\": \"t5-base\", \"tf\": \"t5-base\"},\n            \"config\": None,\n            \"tokenizer\": (\"t5-base\", {\"use_fast\": False}),\n        },\n    },\n    \"text-generation\": {\n        \"impl\": TextGenerationPipeline,\n        \"tf\": TFAutoModelWithLMHead if is_tf_available() else None,\n        \"pt\": AutoModelWithLMHead if is_torch_available() else None,\n        \"default\": {\"model\": {\"pt\": \"gpt2\", \"tf\": \"gpt2\"}, \"config\": None, \"tokenizer\": \"gpt2\"},\n    },\n}\n\n\ndef pipeline(\n    task: str,\n    model: Optional = None,\n    config: Optional[Union[str, PretrainedConfig]] = None,\n    tokenizer: Optional[Union[str, PreTrainedTokenizer]] = None,\n    framework: Optional[str] = None,\n    **kwargs\n) -> Pipeline:\n    \"\"\"\n    Utility factory method to build a pipeline.\n\n    Pipeline are made of:\n\n        - A Tokenizer instance in charge of mapping raw textual input to token\n        - A Model instance\n        - Some (optional) post processing for enhancing model's output\n\n\n    Args:\n        task (:obj:`str`):\n            The task defining which pipeline will be returned. Currently accepted tasks are:\n\n            - \"feature-extraction\": will return a :class:`~transformers1.FeatureExtractionPipeline`\n            - \"sentiment-analysis\": will return a :class:`~transformers1.TextClassificationPipeline`\n            - \"ner\": will return a :class:`~transformers1.NerPipeline`\n            - \"question-answering\": will return a :class:`~transformers1.QuestionAnsweringPipeline`\n            - \"fill-mask\": will return a :class:`~transformers1.FillMaskPipeline`\n            - \"summarization\": will return a :class:`~transformers1.SummarizationPipeline`\n            - \"translation_xx_to_yy\": will return a :class:`~transformers1.TranslationPipeline`\n        model (:obj:`str` or :obj:`~transformers1.PreTrainedModel` or :obj:`~transformers1.TFPreTrainedModel`, `optional`, defaults to :obj:`None`):\n            The model that will be used by the pipeline to make predictions. This can be :obj:`None`,\n            a model identifier or an actual pre-trained model inheriting from\n            :class:`~transformers1.PreTrainedModel` for PyTorch and :class:`~transformers1.TFPreTrainedModel` for\n            TensorFlow.\n\n            If :obj:`None`, the default for this pipeline will be loaded.\n        config (:obj:`str` or :obj:`~transformers1.PretrainedConfig`, `optional`, defaults to :obj:`None`):\n            The configuration that will be used by the pipeline to instantiate the model. This can be :obj:`None`,\n            a model identifier or an actual pre-trained model configuration inheriting from\n            :class:`~transformers1.PretrainedConfig`.\n\n            If :obj:`None`, the default for this pipeline will be loaded.\n        tokenizer (:obj:`str` or :obj:`~transformers1.PreTrainedTokenizer`, `optional`, defaults to :obj:`None`):\n            The tokenizer that will be used by the pipeline to encode data for the model. This can be :obj:`None`,\n            a model identifier or an actual pre-trained tokenizer inheriting from\n            :class:`~transformers1.PreTrainedTokenizer`.\n\n            If :obj:`None`, the default for this pipeline will be loaded.\n        framework (:obj:`str`, `optional`, defaults to :obj:`None`):\n            The framework to use, either \"pt\" for PyTorch or \"tf\" for TensorFlow. The specified framework must be\n            installed.\n\n            If no framework is specified, will default to the one currently installed. If no framework is specified\n            and both frameworks are installed, will default to PyTorch.\n\n    Returns:\n        :class:`~transformers.Pipeline`: Class inheriting from :class:`~transformers1.Pipeline`, according to\n        the task.\n\n    Examples::\n\n        from transformers1 import pipeline, AutoModelForTokenClassification, AutoTokenizer\n\n        # Sentiment analysis pipeline\n        pipeline('sentiment-analysis')\n\n        # Question answering pipeline, specifying the checkpoint identifier\n        pipeline('question-answering', model='distilbert-base-cased-distilled-squad', tokenizer='bert-base-cased')\n\n        # Named entity recognition pipeline, passing in a specific model and tokenizer\n        model = AutoModelForTokenClassification.from_pretrained(\"dbmdz/bert-large-cased-finetuned-conll03-english\")\n        tokenizer = AutoTokenizer.from_pretrained(\"bert-base-cased\")\n        pipeline('ner', model=model, tokenizer=tokenizer)\n    \"\"\"\n    # Retrieve the task\n    if task not in SUPPORTED_TASKS:\n        raise KeyError(\"Unknown task {}, available tasks are {}\".format(task, list(SUPPORTED_TASKS.keys())))\n\n    framework = framework or get_framework(model)\n\n    targeted_task = SUPPORTED_TASKS[task]\n    task_class, model_class = targeted_task[\"impl\"], targeted_task[framework]\n\n    # Use default model/config/tokenizer for the task if no model is provided\n    if model is None:\n        models, config, tokenizer = [targeted_task[\"default\"][k] for k in [\"model\", \"config\", \"tokenizer\"]]\n        model = models[framework]\n\n    # Try to infer tokenizer from model or config name (if provided as str)\n    if tokenizer is None:\n        if isinstance(model, str) and model in ALL_PRETRAINED_CONFIG_ARCHIVE_MAP:\n            tokenizer = model\n        elif isinstance(config, str) and config in ALL_PRETRAINED_CONFIG_ARCHIVE_MAP:\n            tokenizer = config\n        else:\n            # Impossible to guest what is the right tokenizer here\n            raise Exception(\n                \"Impossible to guess which tokenizer to use. \"\n                \"Please provided a PretrainedTokenizer class or a path/identifier to a pretrained tokenizer.\"\n            )\n\n    modelcard = None\n    # Try to infer modelcard from model or config name (if provided as str)\n    if isinstance(model, str):\n        modelcard = model\n    elif isinstance(config, str):\n        modelcard = config\n\n    # Instantiate tokenizer if needed\n    if isinstance(tokenizer, (str, tuple)):\n        if isinstance(tokenizer, tuple):\n            # For tuple we have (tokenizer name, {kwargs})\n            tokenizer = AutoTokenizer.from_pretrained(tokenizer[0], **tokenizer[1])\n        else:\n            tokenizer = AutoTokenizer.from_pretrained(tokenizer)\n\n    # Instantiate config if needed\n    if isinstance(config, str):\n        config = AutoConfig.from_pretrained(config)\n\n    # Instantiate modelcard if needed\n    if isinstance(modelcard, str):\n        modelcard = ModelCard.from_pretrained(modelcard)\n\n    # Instantiate model if needed\n    if isinstance(model, str):\n        # Handle transparent TF/PT model conversion\n        model_kwargs = {}\n        if framework == \"pt\" and model.endswith(\".h5\"):\n            model_kwargs[\"from_tf\"] = True\n            logger.warning(\n                \"Model might be a TensorFlow model (ending with `.h5`) but TensorFlow is not available. \"\n                \"Trying to load the model with PyTorch.\"\n            )\n        elif framework == \"tf\" and model.endswith(\".bin\"):\n            model_kwargs[\"from_pt\"] = True\n            logger.warning(\n                \"Model might be a PyTorch model (ending with `.bin`) but PyTorch is not available. \"\n                \"Trying to load the model with Tensorflow.\"\n            )\n        model = model_class.from_pretrained(model, config=config, **model_kwargs)\n\n    return task_class(model=model, tokenizer=tokenizer, modelcard=modelcard, framework=framework, task=task, **kwargs)\n"
  },
  {
    "path": "code/nezha-base-count5/pretrain/transformers1/tokenization_albert.py",
    "content": "# coding=utf-8\n# Copyright 2018 Google AI, Google Brain and the HuggingFace Inc. team.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n\"\"\" Tokenization classes for ALBERT model.\"\"\"\n\n\nimport logging\nimport os\nimport unicodedata\nfrom shutil import copyfile\nfrom typing import List, Optional\n\nfrom .tokenization_utils import PreTrainedTokenizer\n\n\nlogger = logging.getLogger(__name__)\nVOCAB_FILES_NAMES = {\"vocab_file\": \"spiece.model\"}\n\nPRETRAINED_VOCAB_FILES_MAP = {\n    \"vocab_file\": {\n        \"albert-base-v1\": \"https://s3.amazonaws.com/models.huggingface.co/bert/albert-base-v1-spiece.model\",\n        \"albert-large-v1\": \"https://s3.amazonaws.com/models.huggingface.co/bert/albert-large-v1-spiece.model\",\n        \"albert-xlarge-v1\": \"https://s3.amazonaws.com/models.huggingface.co/bert/albert-xlarge-v1-spiece.model\",\n        \"albert-xxlarge-v1\": \"https://s3.amazonaws.com/models.huggingface.co/bert/albert-xxlarge-v1-spiece.model\",\n        \"albert-base-v2\": \"https://s3.amazonaws.com/models.huggingface.co/bert/albert-base-v2-spiece.model\",\n        \"albert-large-v2\": \"https://s3.amazonaws.com/models.huggingface.co/bert/albert-large-v2-spiece.model\",\n        \"albert-xlarge-v2\": \"https://s3.amazonaws.com/models.huggingface.co/bert/albert-xlarge-v2-spiece.model\",\n        \"albert-xxlarge-v2\": \"https://s3.amazonaws.com/models.huggingface.co/bert/albert-xxlarge-v2-spiece.model\",\n    }\n}\n\nPRETRAINED_POSITIONAL_EMBEDDINGS_SIZES = {\n    \"albert-base-v1\": 512,\n    \"albert-large-v1\": 512,\n    \"albert-xlarge-v1\": 512,\n    \"albert-xxlarge-v1\": 512,\n    \"albert-base-v2\": 512,\n    \"albert-large-v2\": 512,\n    \"albert-xlarge-v2\": 512,\n    \"albert-xxlarge-v2\": 512,\n}\n\nSPIECE_UNDERLINE = \"▁\"\n\n\nclass AlbertTokenizer(PreTrainedTokenizer):\n    \"\"\"\n    Constructs an ALBERT tokenizer. Based on `SentencePiece <https://github.com/google/sentencepiece>`__\n\n    This tokenizer inherits from :class:`~transformers1.PreTrainedTokenizer` which contains most of the methods. Users\n    should refer to the superclass for more information regarding methods.\n\n    Args:\n        vocab_file (:obj:`string`):\n            `SentencePiece <https://github.com/google/sentencepiece>`__ file (generally has a .spm extension) that\n            contains the vocabulary necessary to instantiate a tokenizer.\n        do_lower_case (:obj:`bool`, `optional`, defaults to :obj:`True`):\n            Whether to lowercase the input when tokenizing.\n        remove_space (:obj:`bool`, `optional`, defaults to :obj:`True`):\n            Whether to strip the text when tokenizing (removing excess spaces before and after the string).\n        keep_accents (:obj:`bool`, `optional`, defaults to :obj:`False`):\n            Whether to keep accents when tokenizing.\n        bos_token (:obj:`string`, `optional`, defaults to \"[CLS]\"):\n            The beginning of sequence token that was used during pre-training. Can be used a sequence classifier token.\n\n            .. note::\n\n                When building a sequence using special tokens, this is not the token that is used for the beginning\n                of sequence. The token used is the :obj:`cls_token`.\n        eos_token (:obj:`string`, `optional`, defaults to \"[SEP]\"):\n            The end of sequence token.\n\n            .. note::\n\n                When building a sequence using special tokens, this is not the token that is used for the end\n                of sequence. The token used is the :obj:`sep_token`.\n        unk_token (:obj:`string`, `optional`, defaults to \"<unk>\"):\n            The unknown token. A token that is not in the vocabulary cannot be converted to an ID and is set to be this\n            token instead.\n        sep_token (:obj:`string`, `optional`, defaults to \"[SEP]\"):\n            The separator token, which is used when building a sequence from multiple sequences, e.g. two sequences\n            for sequence classification or for a text and a question for question answering.\n            It is also used as the last token of a sequence built with special tokens.\n        pad_token (:obj:`string`, `optional`, defaults to \"<pad>\"):\n            The token used for padding, for example when batching sequences of different lengths.\n        cls_token (:obj:`string`, `optional`, defaults to \"[CLS]\"):\n            The classifier token which is used when doing sequence classification (classification of the whole\n            sequence instead of per-token classification). It is the first token of the sequence when built with\n            special tokens.\n        mask_token (:obj:`string`, `optional`, defaults to \"[MASK]\"):\n            The token used for masking values. This is the token used when training this model with masked language\n            modeling. This is the token which the model will try to predict.\n\n    Attributes:\n        sp_model (:obj:`SentencePieceProcessor`):\n            The `SentencePiece` processor that is used for every conversion (string, tokens and IDs).\n    \"\"\"\n\n    vocab_files_names = VOCAB_FILES_NAMES\n    pretrained_vocab_files_map = PRETRAINED_VOCAB_FILES_MAP\n    max_model_input_sizes = PRETRAINED_POSITIONAL_EMBEDDINGS_SIZES\n\n    def __init__(\n        self,\n        vocab_file,\n        do_lower_case=True,\n        remove_space=True,\n        keep_accents=False,\n        bos_token=\"[CLS]\",\n        eos_token=\"[SEP]\",\n        unk_token=\"<unk>\",\n        sep_token=\"[SEP]\",\n        pad_token=\"<pad>\",\n        cls_token=\"[CLS]\",\n        mask_token=\"[MASK]\",\n        **kwargs\n    ):\n        super().__init__(\n            bos_token=bos_token,\n            eos_token=eos_token,\n            unk_token=unk_token,\n            sep_token=sep_token,\n            pad_token=pad_token,\n            cls_token=cls_token,\n            mask_token=mask_token,\n            **kwargs,\n        )\n\n        try:\n            import sentencepiece as spm\n        except ImportError:\n            logger.warning(\n                \"You need to install SentencePiece to use AlbertTokenizer: https://github.com/google/sentencepiece\"\n                \"pip install sentencepiece\"\n            )\n            raise\n\n        self.do_lower_case = do_lower_case\n        self.remove_space = remove_space\n        self.keep_accents = keep_accents\n        self.vocab_file = vocab_file\n\n        self.sp_model = spm.SentencePieceProcessor()\n        self.sp_model.Load(vocab_file)\n\n    @property\n    def vocab_size(self):\n        return len(self.sp_model)\n\n    def get_vocab(self):\n        vocab = {self.convert_ids_to_tokens(i): i for i in range(self.vocab_size)}\n        vocab.update(self.added_tokens_encoder)\n        return vocab\n\n    def __getstate__(self):\n        state = self.__dict__.copy()\n        state[\"sp_model\"] = None\n        return state\n\n    def __setstate__(self, d):\n        self.__dict__ = d\n        try:\n            import sentencepiece as spm\n        except ImportError:\n            logger.warning(\n                \"You need to install SentencePiece to use AlbertTokenizer: https://github.com/google/sentencepiece\"\n                \"pip install sentencepiece\"\n            )\n            raise\n        self.sp_model = spm.SentencePieceProcessor()\n        self.sp_model.Load(self.vocab_file)\n\n    def preprocess_text(self, inputs):\n        if self.remove_space:\n            outputs = \" \".join(inputs.strip().split())\n        else:\n            outputs = inputs\n        outputs = outputs.replace(\"``\", '\"').replace(\"''\", '\"')\n\n        if not self.keep_accents:\n            outputs = unicodedata.normalize(\"NFKD\", outputs)\n            outputs = \"\".join([c for c in outputs if not unicodedata.combining(c)])\n        if self.do_lower_case:\n            outputs = outputs.lower()\n\n        return outputs\n\n    def _tokenize(self, text, sample=False):\n        \"\"\" Tokenize a string. \"\"\"\n        text = self.preprocess_text(text)\n\n        if not sample:\n            pieces = self.sp_model.EncodeAsPieces(text)\n        else:\n            pieces = self.sp_model.SampleEncodeAsPieces(text, 64, 0.1)\n        new_pieces = []\n        for piece in pieces:\n            if len(piece) > 1 and piece[-1] == str(\",\") and piece[-2].isdigit():\n                cur_pieces = self.sp_model.EncodeAsPieces(piece[:-1].replace(SPIECE_UNDERLINE, \"\"))\n                if piece[0] != SPIECE_UNDERLINE and cur_pieces[0][0] == SPIECE_UNDERLINE:\n                    if len(cur_pieces[0]) == 1:\n                        cur_pieces = cur_pieces[1:]\n                    else:\n                        cur_pieces[0] = cur_pieces[0][1:]\n                cur_pieces.append(piece[-1])\n                new_pieces.extend(cur_pieces)\n            else:\n                new_pieces.append(piece)\n\n        return new_pieces\n\n    def _convert_token_to_id(self, token):\n        \"\"\" Converts a token (str) in an id using the vocab. \"\"\"\n        return self.sp_model.PieceToId(token)\n\n    def _convert_id_to_token(self, index):\n        \"\"\"Converts an index (integer) in a token (str) using the vocab.\"\"\"\n        return self.sp_model.IdToPiece(index)\n\n    def convert_tokens_to_string(self, tokens):\n        out_string = \"\".join(tokens).replace(SPIECE_UNDERLINE, \" \").strip()\n        return out_string\n\n    def build_inputs_with_special_tokens(\n        self, token_ids_0: List[int], token_ids_1: Optional[List[int]] = None\n    ) -> List[int]:\n        \"\"\"\n        Build model inputs from a sequence or a pair of sequence for sequence classification tasks\n        by concatenating and adding special tokens.\n        An ALBERT sequence has the following format:\n\n        - single sequence: ``[CLS] X [SEP]``\n        - pair of sequences: ``[CLS] A [SEP] B [SEP]``\n\n        Args:\n            token_ids_0 (:obj:`List[int]`):\n                List of IDs to which the special tokens will be added\n            token_ids_1 (:obj:`List[int]`, `optional`, defaults to :obj:`None`):\n                Optional second list of IDs for sequence pairs.\n\n        Returns:\n            :obj:`List[int]`: list of `input IDs <../glossary.html#input-ids>`__ with the appropriate special tokens.\n        \"\"\"\n        sep = [self.sep_token_id]\n        cls = [self.cls_token_id]\n        if token_ids_1 is None:\n            return cls + token_ids_0 + sep\n        return cls + token_ids_0 + sep + token_ids_1 + sep\n\n    def get_special_tokens_mask(\n        self, token_ids_0: List[int], token_ids_1: Optional[List[int]] = None, already_has_special_tokens: bool = False\n    ) -> List[int]:\n        \"\"\"\n        Retrieves sequence ids from a token list that has no special tokens added. This method is called when adding\n        special tokens using the tokenizer ``prepare_for_model`` or ``encode_plus`` methods.\n\n        Args:\n            token_ids_0 (:obj:`List[int]`):\n                List of ids.\n            token_ids_1 (:obj:`List[int]`, `optional`, defaults to :obj:`None`):\n                Optional second list of IDs for sequence pairs.\n            already_has_special_tokens (:obj:`bool`, `optional`, defaults to :obj:`False`):\n                Set to True if the token list is already formatted with special tokens for the model\n\n        Returns:\n            :obj:`List[int]`: A list of integers in the range [0, 1]: 1 for a special token, 0 for a sequence token.\n        \"\"\"\n\n        if already_has_special_tokens:\n            if token_ids_1 is not None:\n                raise ValueError(\n                    \"You should not supply a second sequence if the provided sequence of \"\n                    \"ids is already formatted with special tokens for the model.\"\n                )\n            return list(map(lambda x: 1 if x in [self.sep_token_id, self.cls_token_id] else 0, token_ids_0))\n\n        if token_ids_1 is not None:\n            return [1] + ([0] * len(token_ids_0)) + [1] + ([0] * len(token_ids_1)) + [1]\n        return [1] + ([0] * len(token_ids_0)) + [1]\n\n    def create_token_type_ids_from_sequences(\n        self, token_ids_0: List[int], token_ids_1: Optional[List[int]] = None\n    ) -> List[int]:\n        \"\"\"\n        Creates a mask from the two sequences passed to be used in a sequence-pair classification task.\n        An ALBERT sequence pair mask has the following format:\n\n        ::\n\n            0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1\n            | first sequence    | second sequence |\n\n        if token_ids_1 is None, only returns the first portion of the mask (0s).\n\n        Args:\n            token_ids_0 (:obj:`List[int]`):\n                List of ids.\n            token_ids_1 (:obj:`List[int]`, `optional`, defaults to :obj:`None`):\n                Optional second list of IDs for sequence pairs.\n\n        Returns:\n            :obj:`List[int]`: List of `token type IDs <../glossary.html#token-type-ids>`_ according to the given\n            sequence(s).\n        \"\"\"\n        sep = [self.sep_token_id]\n        cls = [self.cls_token_id]\n\n        if token_ids_1 is None:\n            return len(cls + token_ids_0 + sep) * [0]\n        return len(cls + token_ids_0 + sep) * [0] + len(token_ids_1 + sep) * [1]\n\n    def save_vocabulary(self, save_directory):\n        \"\"\"\n        Save the sentencepiece vocabulary (copy original file) and special tokens file to a directory.\n\n        Args:\n            save_directory (:obj:`str`):\n                The directory in which to save the vocabulary.\n\n        Returns:\n            :obj:`Tuple(str)`: Paths to the files saved.\n        \"\"\"\n        if not os.path.isdir(save_directory):\n            logger.error(\"Vocabulary path ({}) should be a directory\".format(save_directory))\n            return\n        out_vocab_file = os.path.join(save_directory, VOCAB_FILES_NAMES[\"vocab_file\"])\n\n        if os.path.abspath(self.vocab_file) != os.path.abspath(out_vocab_file):\n            copyfile(self.vocab_file, out_vocab_file)\n\n        return (out_vocab_file,)\n"
  },
  {
    "path": "code/nezha-base-count5/pretrain/transformers1/tokenization_auto.py",
    "content": "# coding=utf-8\n# Copyright 2018 The HuggingFace Inc. team.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n\"\"\" Auto Tokenizer class. \"\"\"\n\n\nimport logging\nfrom collections import OrderedDict\n\nfrom .configuration_auto import (\n    AlbertConfig,\n    AutoConfig,\n    BartConfig,\n    BertConfig,\n    CamembertConfig,\n    CTRLConfig,\n    DistilBertConfig,\n    ElectraConfig,\n    FlaubertConfig,\n    GPT2Config,\n    LongformerConfig,\n    OpenAIGPTConfig,\n    ReformerConfig,\n    RobertaConfig,\n    T5Config,\n    TransfoXLConfig,\n    XLMConfig,\n    XLMRobertaConfig,\n    XLNetConfig,\n)\nfrom .configuration_marian import MarianConfig\nfrom .configuration_utils import PretrainedConfig\nfrom .tokenization_albert import AlbertTokenizer\nfrom .tokenization_bart import BartTokenizer\nfrom .tokenization_bert import BertTokenizer, BertTokenizerFast\nfrom .tokenization_bert_japanese import BertJapaneseTokenizer\nfrom .tokenization_camembert import CamembertTokenizer\nfrom .tokenization_ctrl import CTRLTokenizer\nfrom .tokenization_distilbert import DistilBertTokenizer, DistilBertTokenizerFast\nfrom .tokenization_electra import ElectraTokenizer, ElectraTokenizerFast\nfrom .tokenization_flaubert import FlaubertTokenizer\nfrom .tokenization_gpt2 import GPT2Tokenizer, GPT2TokenizerFast\nfrom .tokenization_longformer import LongformerTokenizer\nfrom .tokenization_marian import MarianTokenizer\nfrom .tokenization_openai import OpenAIGPTTokenizer, OpenAIGPTTokenizerFast\nfrom .tokenization_reformer import ReformerTokenizer\nfrom .tokenization_roberta import RobertaTokenizer, RobertaTokenizerFast\nfrom .tokenization_t5 import T5Tokenizer\nfrom .tokenization_transfo_xl import TransfoXLTokenizer, TransfoXLTokenizerFast\nfrom .tokenization_xlm import XLMTokenizer\nfrom .tokenization_xlm_roberta import XLMRobertaTokenizer\nfrom .tokenization_xlnet import XLNetTokenizer\n\n\nlogger = logging.getLogger(__name__)\n\n\nTOKENIZER_MAPPING = OrderedDict(\n    [\n        (T5Config, (T5Tokenizer, None)),\n        (DistilBertConfig, (DistilBertTokenizer, DistilBertTokenizerFast)),\n        (AlbertConfig, (AlbertTokenizer, None)),\n        (CamembertConfig, (CamembertTokenizer, None)),\n        (XLMRobertaConfig, (XLMRobertaTokenizer, None)),\n        (MarianConfig, (MarianTokenizer, None)),\n        (BartConfig, (BartTokenizer, None)),\n        (LongformerConfig, (LongformerTokenizer, None)),\n        (RobertaConfig, (RobertaTokenizer, RobertaTokenizerFast)),\n        (ReformerConfig, (ReformerTokenizer, None)),\n        (ElectraConfig, (ElectraTokenizer, ElectraTokenizerFast)),\n        (BertConfig, (BertTokenizer, BertTokenizerFast)),\n        (OpenAIGPTConfig, (OpenAIGPTTokenizer, OpenAIGPTTokenizerFast)),\n        (GPT2Config, (GPT2Tokenizer, GPT2TokenizerFast)),\n        (TransfoXLConfig, (TransfoXLTokenizer, TransfoXLTokenizerFast)),\n        (XLNetConfig, (XLNetTokenizer, None)),\n        (FlaubertConfig, (FlaubertTokenizer, None)),\n        (XLMConfig, (XLMTokenizer, None)),\n        (CTRLConfig, (CTRLTokenizer, None)),\n    ]\n)\n\n\nclass AutoTokenizer:\n    r\"\"\":class:`~transformers1.AutoTokenizer` is a generic tokenizer class\n        that will be instantiated as one of the tokenizer classes of the library\n        when created with the `AutoTokenizer.from_pretrained(pretrained_model_name_or_path)`\n        class method.\n\n        The `from_pretrained()` method takes care of returning the correct tokenizer class instance\n        based on the `model_type` property of the config object, or when it's missing,\n        falling back to using pattern matching on the `pretrained_model_name_or_path` string:\n            - `t5`: T5Tokenizer (T5 model)\n            - `distilbert`: DistilBertTokenizer (DistilBert model)\n            - `albert`: AlbertTokenizer (ALBERT model)\n            - `camembert`: CamembertTokenizer (CamemBERT model)\n            - `xlm-roberta`: XLMRobertaTokenizer (XLM-RoBERTa model)\n            - `longformer`: LongformerTokenizer (AllenAI Longformer model)\n            - `roberta`: RobertaTokenizer (RoBERTa model)\n            - `bert`: BertTokenizer (Bert model)\n            - `openai-gpt`: OpenAIGPTTokenizer (OpenAI GPT model)\n            - `gpt2`: GPT2Tokenizer (OpenAI GPT-2 model)\n            - `transfo-xl`: TransfoXLTokenizer (Transformer-XL model)\n            - `xlnet`: XLNetTokenizer (XLNet model)\n            - `xlm`: XLMTokenizer (XLM model)\n            - `ctrl`: CTRLTokenizer (Salesforce CTRL model)\n            - `electra`: ElectraTokenizer (Google ELECTRA model)\n\n        This class cannot be instantiated using `__init__()` (throw an error).\n    \"\"\"\n\n    def __init__(self):\n        raise EnvironmentError(\n            \"AutoTokenizer is designed to be instantiated \"\n            \"using the `AutoTokenizer.from_pretrained(pretrained_model_name_or_path)` method.\"\n        )\n\n    @classmethod\n    def from_pretrained(cls, pretrained_model_name_or_path, *inputs, **kwargs):\n        r\"\"\" Instantiate one of the tokenizer classes of the library\n        from a pre-trained model vocabulary.\n\n        The tokenizer class to instantiate is selected\n        based on the `model_type` property of the config object, or when it's missing,\n        falling back to using pattern matching on the `pretrained_model_name_or_path` string:\n            - `t5`: T5Tokenizer (T5 model)\n            - `distilbert`: DistilBertTokenizer (DistilBert model)\n            - `albert`: AlbertTokenizer (ALBERT model)\n            - `camembert`: CamembertTokenizer (CamemBERT model)\n            - `xlm-roberta`: XLMRobertaTokenizer (XLM-RoBERTa model)\n            - `longformer`: LongformerTokenizer (AllenAI Longformer model)\n            - `roberta`: RobertaTokenizer (RoBERTa model)\n            - `bert-base-japanese`: BertJapaneseTokenizer (Bert model)\n            - `bert`: BertTokenizer (Bert model)\n            - `openai-gpt`: OpenAIGPTTokenizer (OpenAI GPT model)\n            - `gpt2`: GPT2Tokenizer (OpenAI GPT-2 model)\n            - `transfo-xl`: TransfoXLTokenizer (Transformer-XL model)\n            - `xlnet`: XLNetTokenizer (XLNet model)\n            - `xlm`: XLMTokenizer (XLM model)\n            - `ctrl`: CTRLTokenizer (Salesforce CTRL model)\n            - `electra`: ElectraTokenizer (Google ELECTRA model)\n\n        Params:\n            pretrained_model_name_or_path: either:\n\n                - a string with the `shortcut name` of a predefined tokenizer to load from cache or download, e.g.: ``bert-base-uncased``.\n                - a string with the `identifier name` of a predefined tokenizer that was user-uploaded to our S3, e.g.: ``dbmdz/bert-base-german-cased``.\n                - a path to a `directory` containing vocabulary files required by the tokenizer, for instance saved using the :func:`~transformers1.PreTrainedTokenizer.save_pretrained` method, e.g.: ``./my_model_directory/``.\n                - (not applicable to all derived classes) a path or url to a single saved vocabulary file if and only if the tokenizer only requires a single vocabulary file (e.g. Bert, XLNet), e.g.: ``./my_model_directory/vocab.txt``.\n\n            cache_dir: (`optional`) string:\n                Path to a directory in which a downloaded predefined tokenizer vocabulary files should be cached if the standard cache should not be used.\n\n            force_download: (`optional`) boolean, default False:\n                Force to (re-)download the vocabulary files and override the cached versions if they exists.\n\n            resume_download: (`optional`) boolean, default False:\n                Do not delete incompletely recieved file. Attempt to resume the download if such a file exists.\n\n            proxies: (`optional`) dict, default None:\n                A dictionary of proxy servers to use by protocol or endpoint, e.g.: {'http': 'foo.bar:3128', 'http://hostname': 'foo.bar:4012'}.\n                The proxies are used on each request.\n\n            use_fast: (`optional`) boolean, default False:\n                Indicate if transformers1 should try to load the fast version of the tokenizer (True) or use the Python one (False).\n\n            inputs: (`optional`) positional arguments: will be passed to the Tokenizer ``__init__`` method.\n\n            kwargs: (`optional`) keyword arguments: will be passed to the Tokenizer ``__init__`` method. Can be used to set special tokens like ``bos_token``, ``eos_token``, ``unk_token``, ``sep_token``, ``pad_token``, ``cls_token``, ``mask_token``, ``additional_special_tokens``. See parameters in the doc string of :class:`~transformers1.PreTrainedTokenizer` for details.\n\n        Examples::\n\n            # Download vocabulary from S3 and cache.\n            tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')\n\n            # Download vocabulary from S3 (user-uploaded) and cache.\n            tokenizer = AutoTokenizer.from_pretrained('dbmdz/bert-base-german-cased')\n\n            # If vocabulary files are in a directory (e.g. tokenizer was saved using `save_pretrained('./test/saved_model/')`)\n            tokenizer = AutoTokenizer.from_pretrained('./test/bert_saved_model/')\n\n        \"\"\"\n        config = kwargs.pop(\"config\", None)\n        if not isinstance(config, PretrainedConfig):\n            config = AutoConfig.from_pretrained(pretrained_model_name_or_path, **kwargs)\n\n        if \"bert-base-japanese\" in pretrained_model_name_or_path:\n            return BertJapaneseTokenizer.from_pretrained(pretrained_model_name_or_path, *inputs, **kwargs)\n\n        use_fast = kwargs.pop(\"use_fast\", False)\n        for config_class, (tokenizer_class_py, tokenizer_class_fast) in TOKENIZER_MAPPING.items():\n            if isinstance(config, config_class):\n                if tokenizer_class_fast and use_fast:\n                    return tokenizer_class_fast.from_pretrained(pretrained_model_name_or_path, *inputs, **kwargs)\n                else:\n                    return tokenizer_class_py.from_pretrained(pretrained_model_name_or_path, *inputs, **kwargs)\n\n        raise ValueError(\n            \"Unrecognized configuration class {} to build an AutoTokenizer.\\n\"\n            \"Model type should be one of {}.\".format(\n                config.__class__, \", \".join(c.__name__ for c in TOKENIZER_MAPPING.keys())\n            )\n        )\n"
  },
  {
    "path": "code/nezha-base-count5/pretrain/transformers1/tokenization_bart.py",
    "content": "# coding=utf-8\n# Copyright 2020 The Facebook AI Research Team Authors and The HuggingFace Inc. team.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n\nimport logging\n\nfrom .tokenization_roberta import RobertaTokenizer\nfrom .tokenization_xlm_roberta import XLMRobertaTokenizer\n\n\nlogger = logging.getLogger(__name__)\n\n\n# vocab and merges same as roberta\nvocab_url = \"https://s3.amazonaws.com/models.huggingface.co/bert/roberta-large-vocab.json\"\nmerges_url = \"https://s3.amazonaws.com/models.huggingface.co/bert/roberta-large-merges.txt\"\n_all_bart_models = [\n    \"facebook/bart-large\",\n    \"facebook/bart-large-mnli\",\n    \"facebook/bart-large-cnn\",\n    \"facebook/bart-large-xsum\",\n]\n\n\nclass BartTokenizer(RobertaTokenizer):\n    # merges and vocab same as Roberta\n    max_model_input_sizes = {m: 1024 for m in _all_bart_models}\n    pretrained_vocab_files_map = {\n        \"vocab_file\": {m: vocab_url for m in _all_bart_models},\n        \"merges_file\": {m: merges_url for m in _all_bart_models},\n    }\n\n\n_all_mbart_models = [\"facebook/mbart-large-en-ro\"]\nSPM_URL = \"https://s3.amazonaws.com/models.huggingface.co/bert/facebook/mbart-large-en-ro/sentence.bpe.model\"\n\n\nclass MBartTokenizer(XLMRobertaTokenizer):\n    vocab_files_names = {\"vocab_file\": \"sentencepiece.bpe.model\"}\n    max_model_input_sizes = {m: 1024 for m in _all_mbart_models}\n    pretrained_vocab_files_map = {\"vocab_file\": {m: SPM_URL for m in _all_mbart_models}}\n"
  },
  {
    "path": "code/nezha-base-count5/pretrain/transformers1/tokenization_bert.py",
    "content": "# coding=utf-8\n# Copyright 2018 The Google AI Language Team Authors and The HuggingFace Inc. team.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n\"\"\"Tokenization classes.\"\"\"\n\n\nimport collections\nimport logging\nimport os\nimport unicodedata\nfrom typing import List, Optional\n\nfrom tokenizers import BertWordPieceTokenizer\n\nfrom .tokenization_utils import PreTrainedTokenizer, PreTrainedTokenizerFast\n\n\nlogger = logging.getLogger(__name__)\n\nVOCAB_FILES_NAMES = {\"vocab_file\": \"vocab.txt\"}\n\nPRETRAINED_VOCAB_FILES_MAP = {\n    \"vocab_file\": {\n        \"bert-base-uncased\": \"https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-uncased-vocab.txt\",\n        \"bert-large-uncased\": \"https://s3.amazonaws.com/models.huggingface.co/bert/bert-large-uncased-vocab.txt\",\n        \"bert-base-cased\": \"https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-cased-vocab.txt\",\n        \"bert-large-cased\": \"https://s3.amazonaws.com/models.huggingface.co/bert/bert-large-cased-vocab.txt\",\n        \"bert-base-multilingual-uncased\": \"https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-multilingual-uncased-vocab.txt\",\n        \"bert-base-multilingual-cased\": \"https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-multilingual-cased-vocab.txt\",\n        \"bert-base-chinese\": \"https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-chinese-vocab.txt\",\n        \"bert-base-german-cased\": \"https://int-deepset-models-bert.s3.eu-central-1.amazonaws.com/pytorch/bert-base-german-cased-vocab.txt\",\n        \"bert-large-uncased-whole-word-masking\": \"https://s3.amazonaws.com/models.huggingface.co/bert/bert-large-uncased-whole-word-masking-vocab.txt\",\n        \"bert-large-cased-whole-word-masking\": \"https://s3.amazonaws.com/models.huggingface.co/bert/bert-large-cased-whole-word-masking-vocab.txt\",\n        \"bert-large-uncased-whole-word-masking-finetuned-squad\": \"https://s3.amazonaws.com/models.huggingface.co/bert/bert-large-uncased-whole-word-masking-finetuned-squad-vocab.txt\",\n        \"bert-large-cased-whole-word-masking-finetuned-squad\": \"https://s3.amazonaws.com/models.huggingface.co/bert/bert-large-cased-whole-word-masking-finetuned-squad-vocab.txt\",\n        \"bert-base-cased-finetuned-mrpc\": \"https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-cased-finetuned-mrpc-vocab.txt\",\n        \"bert-base-german-dbmdz-cased\": \"https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-german-dbmdz-cased-vocab.txt\",\n        \"bert-base-german-dbmdz-uncased\": \"https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-german-dbmdz-uncased-vocab.txt\",\n        \"TurkuNLP/bert-base-finnish-cased-v1\": \"https://s3.amazonaws.com/models.huggingface.co/bert/TurkuNLP/bert-base-finnish-cased-v1/vocab.txt\",\n        \"TurkuNLP/bert-base-finnish-uncased-v1\": \"https://s3.amazonaws.com/models.huggingface.co/bert/TurkuNLP/bert-base-finnish-uncased-v1/vocab.txt\",\n        \"wietsedv/bert-base-dutch-cased\": \"https://s3.amazonaws.com/models.huggingface.co/bert/wietsedv/bert-base-dutch-cased/vocab.txt\",\n    }\n}\n\nPRETRAINED_POSITIONAL_EMBEDDINGS_SIZES = {\n    \"bert-base-uncased\": 512,\n    \"bert-large-uncased\": 512,\n    \"bert-base-cased\": 512,\n    \"bert-large-cased\": 512,\n    \"bert-base-multilingual-uncased\": 512,\n    \"bert-base-multilingual-cased\": 512,\n    \"bert-base-chinese\": 512,\n    \"bert-base-german-cased\": 512,\n    \"bert-large-uncased-whole-word-masking\": 512,\n    \"bert-large-cased-whole-word-masking\": 512,\n    \"bert-large-uncased-whole-word-masking-finetuned-squad\": 512,\n    \"bert-large-cased-whole-word-masking-finetuned-squad\": 512,\n    \"bert-base-cased-finetuned-mrpc\": 512,\n    \"bert-base-german-dbmdz-cased\": 512,\n    \"bert-base-german-dbmdz-uncased\": 512,\n    \"TurkuNLP/bert-base-finnish-cased-v1\": 512,\n    \"TurkuNLP/bert-base-finnish-uncased-v1\": 512,\n    \"wietsedv/bert-base-dutch-cased\": 512,\n}\n\nPRETRAINED_INIT_CONFIGURATION = {\n    \"bert-base-uncased\": {\"do_lower_case\": True},\n    \"bert-large-uncased\": {\"do_lower_case\": True},\n    \"bert-base-cased\": {\"do_lower_case\": False},\n    \"bert-large-cased\": {\"do_lower_case\": False},\n    \"bert-base-multilingual-uncased\": {\"do_lower_case\": True},\n    \"bert-base-multilingual-cased\": {\"do_lower_case\": False},\n    \"bert-base-chinese\": {\"do_lower_case\": False},\n    \"bert-base-german-cased\": {\"do_lower_case\": False},\n    \"bert-large-uncased-whole-word-masking\": {\"do_lower_case\": True},\n    \"bert-large-cased-whole-word-masking\": {\"do_lower_case\": False},\n    \"bert-large-uncased-whole-word-masking-finetuned-squad\": {\"do_lower_case\": True},\n    \"bert-large-cased-whole-word-masking-finetuned-squad\": {\"do_lower_case\": False},\n    \"bert-base-cased-finetuned-mrpc\": {\"do_lower_case\": False},\n    \"bert-base-german-dbmdz-cased\": {\"do_lower_case\": False},\n    \"bert-base-german-dbmdz-uncased\": {\"do_lower_case\": True},\n    \"TurkuNLP/bert-base-finnish-cased-v1\": {\"do_lower_case\": False},\n    \"TurkuNLP/bert-base-finnish-uncased-v1\": {\"do_lower_case\": True},\n    \"wietsedv/bert-base-dutch-cased\": {\"do_lower_case\": False},\n}\n\n\ndef load_vocab(vocab_file):\n    \"\"\"Loads a vocabulary file into a dictionary.\"\"\"\n    vocab = collections.OrderedDict()\n    with open(vocab_file, \"r\", encoding=\"utf-8\") as reader:\n        tokens = reader.readlines()\n    for index, token in enumerate(tokens):\n        token = token.rstrip(\"\\n\")\n        vocab[token] = index\n    return vocab\n\n\ndef whitespace_tokenize(text):\n    \"\"\"Runs basic whitespace cleaning and splitting on a piece of text.\"\"\"\n    text = text.strip()\n    if not text:\n        return []\n    tokens = text.split()\n    return tokens\n\n\nclass BertTokenizer(PreTrainedTokenizer):\n    r\"\"\"\n    Constructs a BERT tokenizer. Based on WordPiece.\n\n    This tokenizer inherits from :class:`~transformers1.PreTrainedTokenizer` which contains most of the methods. Users\n    should refer to the superclass for more information regarding methods.\n\n    Args:\n        vocab_file (:obj:`string`):\n            File containing the vocabulary.\n        do_lower_case (:obj:`bool`, `optional`, defaults to :obj:`True`):\n            Whether to lowercase the input when tokenizing.\n        do_basic_tokenize (:obj:`bool`, `optional`, defaults to :obj:`True`):\n            Whether to do basic tokenization before WordPiece.\n        never_split (:obj:`bool`, `optional`, defaults to :obj:`True`):\n            List of tokens which will never be split during tokenization. Only has an effect when\n            :obj:`do_basic_tokenize=True`\n        unk_token (:obj:`string`, `optional`, defaults to \"[UNK]\"):\n            The unknown token. A token that is not in the vocabulary cannot be converted to an ID and is set to be this\n            token instead.\n        sep_token (:obj:`string`, `optional`, defaults to \"[SEP]\"):\n            The separator token, which is used when building a sequence from multiple sequences, e.g. two sequences\n            for sequence classification or for a text and a question for question answering.\n            It is also used as the last token of a sequence built with special tokens.\n        pad_token (:obj:`string`, `optional`, defaults to \"[PAD]\"):\n            The token used for padding, for example when batching sequences of different lengths.\n        cls_token (:obj:`string`, `optional`, defaults to \"[CLS]\"):\n            The classifier token which is used when doing sequence classification (classification of the whole\n            sequence instead of per-token classification). It is the first token of the sequence when built with\n            special tokens.\n        mask_token (:obj:`string`, `optional`, defaults to \"[MASK]\"):\n            The token used for masking values. This is the token used when training this model with masked language\n            modeling. This is the token which the model will try to predict.\n        tokenize_chinese_chars (:obj:`bool`, `optional`, defaults to :obj:`True`):\n            Whether to tokenize Chinese characters.\n            This should likely be deactivated for Japanese:\n            see: https://github.com/huggingface/transformers/issues/328\n    \"\"\"\n\n    vocab_files_names = VOCAB_FILES_NAMES\n    pretrained_vocab_files_map = PRETRAINED_VOCAB_FILES_MAP\n    pretrained_init_configuration = PRETRAINED_INIT_CONFIGURATION\n    max_model_input_sizes = PRETRAINED_POSITIONAL_EMBEDDINGS_SIZES\n\n    def __init__(\n        self,\n        vocab_file,\n        do_lower_case=True,\n        do_basic_tokenize=True,\n        never_split=None,\n        unk_token=\"[UNK]\",\n        sep_token=\"[SEP]\",\n        pad_token=\"[PAD]\",\n        cls_token=\"[CLS]\",\n        mask_token=\"[MASK]\",\n        tokenize_chinese_chars=True,\n        **kwargs\n    ):\n        super().__init__(\n            unk_token=unk_token,\n            sep_token=sep_token,\n            pad_token=pad_token,\n            cls_token=cls_token,\n            mask_token=mask_token,\n            **kwargs,\n        )\n\n        if not os.path.isfile(vocab_file):\n            raise ValueError(\n                \"Can't find a vocabulary file at path '{}'. To load the vocabulary from a Google pretrained \"\n                \"model use `tokenizer = BertTokenizer.from_pretrained(PRETRAINED_MODEL_NAME)`\".format(vocab_file)\n            )\n        self.vocab = load_vocab(vocab_file)\n        self.ids_to_tokens = collections.OrderedDict([(ids, tok) for tok, ids in self.vocab.items()])\n        self.do_basic_tokenize = do_basic_tokenize\n        if do_basic_tokenize:\n            self.basic_tokenizer = BasicTokenizer(\n                do_lower_case=do_lower_case, never_split=never_split, tokenize_chinese_chars=tokenize_chinese_chars\n            )\n        self.wordpiece_tokenizer = WordpieceTokenizer(vocab=self.vocab, unk_token=self.unk_token)\n\n    @property\n    def vocab_size(self):\n        return len(self.vocab)\n\n    def get_vocab(self):\n        return dict(self.vocab, **self.added_tokens_encoder)\n\n    def _tokenize(self, text):\n        split_tokens = []\n        if self.do_basic_tokenize:\n            for token in self.basic_tokenizer.tokenize(text, never_split=self.all_special_tokens):\n                for sub_token in self.wordpiece_tokenizer.tokenize(token):\n                    split_tokens.append(sub_token)\n        else:\n            split_tokens = self.wordpiece_tokenizer.tokenize(text)\n        return split_tokens\n\n    def _convert_token_to_id(self, token):\n        \"\"\" Converts a token (str) in an id using the vocab. \"\"\"\n        return self.vocab.get(token, self.vocab.get(self.unk_token))\n\n    def _convert_id_to_token(self, index):\n        \"\"\"Converts an index (integer) in a token (str) using the vocab.\"\"\"\n        return self.ids_to_tokens.get(index, self.unk_token)\n\n    def convert_tokens_to_string(self, tokens):\n        \"\"\" Converts a sequence of tokens (string) in a single string. \"\"\"\n        out_string = \" \".join(tokens).replace(\" ##\", \"\").strip()\n        return out_string\n\n    def build_inputs_with_special_tokens(\n        self, token_ids_0: List[int], token_ids_1: Optional[List[int]] = None\n    ) -> List[int]:\n        \"\"\"\n        Build model inputs from a sequence or a pair of sequence for sequence classification tasks\n        by concatenating and adding special tokens.\n        A BERT sequence has the following format:\n\n        - single sequence: ``[CLS] X [SEP]``\n        - pair of sequences: ``[CLS] A [SEP] B [SEP]``\n\n        Args:\n            token_ids_0 (:obj:`List[int]`):\n                List of IDs to which the special tokens will be added\n            token_ids_1 (:obj:`List[int]`, `optional`, defaults to :obj:`None`):\n                Optional second list of IDs for sequence pairs.\n\n        Returns:\n            :obj:`List[int]`: list of `input IDs <../glossary.html#input-ids>`__ with the appropriate special tokens.\n        \"\"\"\n        if token_ids_1 is None:\n            return [self.cls_token_id] + token_ids_0 + [self.sep_token_id]\n        cls = [self.cls_token_id]\n        sep = [self.sep_token_id]\n        return cls + token_ids_0 + sep + token_ids_1 + sep\n\n    def get_special_tokens_mask(\n        self, token_ids_0: List[int], token_ids_1: Optional[List[int]] = None, already_has_special_tokens: bool = False\n    ) -> List[int]:\n        \"\"\"\n        Retrieves sequence ids from a token list that has no special tokens added. This method is called when adding\n        special tokens using the tokenizer ``prepare_for_model`` or ``encode_plus`` methods.\n\n        Args:\n            token_ids_0 (:obj:`List[int]`):\n                List of ids.\n            token_ids_1 (:obj:`List[int]`, `optional`, defaults to :obj:`None`):\n                Optional second list of IDs for sequence pairs.\n            already_has_special_tokens (:obj:`bool`, `optional`, defaults to :obj:`False`):\n                Set to True if the token list is already formatted with special tokens for the model\n\n        Returns:\n            :obj:`List[int]`: A list of integers in the range [0, 1]: 1 for a special token, 0 for a sequence token.\n        \"\"\"\n\n        if already_has_special_tokens:\n            if token_ids_1 is not None:\n                raise ValueError(\n                    \"You should not supply a second sequence if the provided sequence of \"\n                    \"ids is already formated with special tokens for the model.\"\n                )\n            return list(map(lambda x: 1 if x in [self.sep_token_id, self.cls_token_id] else 0, token_ids_0))\n\n        if token_ids_1 is not None:\n            return [1] + ([0] * len(token_ids_0)) + [1] + ([0] * len(token_ids_1)) + [1]\n        return [1] + ([0] * len(token_ids_0)) + [1]\n\n    def create_token_type_ids_from_sequences(\n        self, token_ids_0: List[int], token_ids_1: Optional[List[int]] = None\n    ) -> List[int]:\n        \"\"\"\n        Creates a mask from the two sequences passed to be used in a sequence-pair classification task.\n        A BERT sequence pair mask has the following format:\n\n        ::\n\n            0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1\n            | first sequence    | second sequence |\n\n        if token_ids_1 is None, only returns the first portion of the mask (0's).\n\n        Args:\n            token_ids_0 (:obj:`List[int]`):\n                List of ids.\n            token_ids_1 (:obj:`List[int]`, `optional`, defaults to :obj:`None`):\n                Optional second list of IDs for sequence pairs.\n\n        Returns:\n            :obj:`List[int]`: List of `token type IDs <../glossary.html#token-type-ids>`_ according to the given\n            sequence(s).\n        \"\"\"\n        sep = [self.sep_token_id]\n        cls = [self.cls_token_id]\n        if token_ids_1 is None:\n            return len(cls + token_ids_0 + sep) * [0]\n        return len(cls + token_ids_0 + sep) * [0] + len(token_ids_1 + sep) * [1]\n\n    def save_vocabulary(self, vocab_path):\n        \"\"\"\n        Save the sentencepiece vocabulary (copy original file) and special tokens file to a directory.\n\n        Args:\n            vocab_path (:obj:`str`):\n                The directory in which to save the vocabulary.\n\n        Returns:\n            :obj:`Tuple(str)`: Paths to the files saved.\n        \"\"\"\n        index = 0\n        if os.path.isdir(vocab_path):\n            vocab_file = os.path.join(vocab_path, VOCAB_FILES_NAMES[\"vocab_file\"])\n        else:\n            vocab_file = vocab_path\n        with open(vocab_file, \"w\", encoding=\"utf-8\") as writer:\n            for token, token_index in sorted(self.vocab.items(), key=lambda kv: kv[1]):\n                if index != token_index:\n                    logger.warning(\n                        \"Saving vocabulary to {}: vocabulary indices are not consecutive.\"\n                        \" Please check that the vocabulary is not corrupted!\".format(vocab_file)\n                    )\n                    index = token_index\n                writer.write(token + \"\\n\")\n                index += 1\n        return (vocab_file,)\n\n\nclass BasicTokenizer(object):\n    \"\"\"Runs basic tokenization (punctuation splitting, lower casing, etc.).\"\"\"\n\n    def __init__(self, do_lower_case=True, never_split=None, tokenize_chinese_chars=True):\n        \"\"\" Constructs a BasicTokenizer.\n\n        Args:\n            **do_lower_case**: Whether to lower case the input.\n            **never_split**: (`optional`) list of str\n                Kept for backward compatibility purposes.\n                Now implemented directly at the base class level (see :func:`PreTrainedTokenizer.tokenize`)\n                List of token not to split.\n            **tokenize_chinese_chars**: (`optional`) boolean (default True)\n                Whether to tokenize Chinese characters.\n                This should likely be deactivated for Japanese:\n                see: https://github.com/huggingface/pytorch-pretrained-BERT/issues/328\n        \"\"\"\n        if never_split is None:\n            never_split = []\n        self.do_lower_case = do_lower_case\n        self.never_split = never_split\n        self.tokenize_chinese_chars = tokenize_chinese_chars\n\n    def tokenize(self, text, never_split=None):\n        \"\"\" Basic Tokenization of a piece of text.\n            Split on \"white spaces\" only, for sub-word tokenization, see WordPieceTokenizer.\n\n        Args:\n            **never_split**: (`optional`) list of str\n                Kept for backward compatibility purposes.\n                Now implemented directly at the base class level (see :func:`PreTrainedTokenizer.tokenize`)\n                List of token not to split.\n        \"\"\"\n        never_split = self.never_split + (never_split if never_split is not None else [])\n        text = self._clean_text(text)\n        # This was added on November 1st, 2018 for the multilingual and Chinese\n        # models. This is also applied to the English models now, but it doesn't\n        # matter since the English models were not trained on any Chinese data\n        # and generally don't have any Chinese data in them (there are Chinese\n        # characters in the vocabulary because Wikipedia does have some Chinese\n        # words in the English Wikipedia.).\n        if self.tokenize_chinese_chars:\n            text = self._tokenize_chinese_chars(text)\n        orig_tokens = whitespace_tokenize(text)\n        split_tokens = []\n        for token in orig_tokens:\n            if self.do_lower_case and token not in never_split:\n                token = token.lower()\n                token = self._run_strip_accents(token)\n            split_tokens.extend(self._run_split_on_punc(token, never_split))\n\n        output_tokens = whitespace_tokenize(\" \".join(split_tokens))\n        return output_tokens\n\n    def _run_strip_accents(self, text):\n        \"\"\"Strips accents from a piece of text.\"\"\"\n        text = unicodedata.normalize(\"NFD\", text)\n        output = []\n        for char in text:\n            cat = unicodedata.category(char)\n            if cat == \"Mn\":\n                continue\n            output.append(char)\n        return \"\".join(output)\n\n    def _run_split_on_punc(self, text, never_split=None):\n        \"\"\"Splits punctuation on a piece of text.\"\"\"\n        if never_split is not None and text in never_split:\n            return [text]\n        chars = list(text)\n        i = 0\n        start_new_word = True\n        output = []\n        while i < len(chars):\n            char = chars[i]\n            if _is_punctuation(char):\n                output.append([char])\n                start_new_word = True\n            else:\n                if start_new_word:\n                    output.append([])\n                start_new_word = False\n                output[-1].append(char)\n            i += 1\n\n        return [\"\".join(x) for x in output]\n\n    def _tokenize_chinese_chars(self, text):\n        \"\"\"Adds whitespace around any CJK character.\"\"\"\n        output = []\n        for char in text:\n            cp = ord(char)\n            if self._is_chinese_char(cp):\n                output.append(\" \")\n                output.append(char)\n                output.append(\" \")\n            else:\n                output.append(char)\n        return \"\".join(output)\n\n    def _is_chinese_char(self, cp):\n        \"\"\"Checks whether CP is the codepoint of a CJK character.\"\"\"\n        # This defines a \"chinese character\" as anything in the CJK Unicode block:\n        #   https://en.wikipedia.org/wiki/CJK_Unified_Ideographs_(Unicode_block)\n        #\n        # Note that the CJK Unicode block is NOT all Japanese and Korean characters,\n        # despite its name. The modern Korean Hangul alphabet is a different block,\n        # as is Japanese Hiragana and Katakana. Those alphabets are used to write\n        # space-separated words, so they are not treated specially and handled\n        # like the all of the other languages.\n        if (\n            (cp >= 0x4E00 and cp <= 0x9FFF)\n            or (cp >= 0x3400 and cp <= 0x4DBF)  #\n            or (cp >= 0x20000 and cp <= 0x2A6DF)  #\n            or (cp >= 0x2A700 and cp <= 0x2B73F)  #\n            or (cp >= 0x2B740 and cp <= 0x2B81F)  #\n            or (cp >= 0x2B820 and cp <= 0x2CEAF)  #\n            or (cp >= 0xF900 and cp <= 0xFAFF)\n            or (cp >= 0x2F800 and cp <= 0x2FA1F)  #\n        ):  #\n            return True\n\n        return False\n\n    def _clean_text(self, text):\n        \"\"\"Performs invalid character removal and whitespace cleanup on text.\"\"\"\n        output = []\n        for char in text:\n            cp = ord(char)\n            if cp == 0 or cp == 0xFFFD or _is_control(char):\n                continue\n            if _is_whitespace(char):\n                output.append(\" \")\n            else:\n                output.append(char)\n        return \"\".join(output)\n\n\nclass WordpieceTokenizer(object):\n    \"\"\"Runs WordPiece tokenization.\"\"\"\n\n    def __init__(self, vocab, unk_token, max_input_chars_per_word=100):\n        self.vocab = vocab\n        self.unk_token = unk_token\n        self.max_input_chars_per_word = max_input_chars_per_word\n\n    def tokenize(self, text):\n        \"\"\"Tokenizes a piece of text into its word pieces.\n\n        This uses a greedy longest-match-first algorithm to perform tokenization\n        using the given vocabulary.\n\n        For example:\n          input = \"unaffable\"\n          output = [\"un\", \"##aff\", \"##able\"]\n\n        Args:\n          text: A single token or whitespace separated tokens. This should have\n            already been passed through `BasicTokenizer`.\n\n        Returns:\n          A list of wordpiece tokens.\n        \"\"\"\n\n        output_tokens = []\n        for token in whitespace_tokenize(text):\n            chars = list(token)\n            if len(chars) > self.max_input_chars_per_word:\n                output_tokens.append(self.unk_token)\n                continue\n\n            is_bad = False\n            start = 0\n            sub_tokens = []\n            while start < len(chars):\n                end = len(chars)\n                cur_substr = None\n                while start < end:\n                    substr = \"\".join(chars[start:end])\n                    if start > 0:\n                        substr = \"##\" + substr\n                    if substr in self.vocab:\n                        cur_substr = substr\n                        break\n                    end -= 1\n                if cur_substr is None:\n                    is_bad = True\n                    break\n                sub_tokens.append(cur_substr)\n                start = end\n\n            if is_bad:\n                output_tokens.append(self.unk_token)\n            else:\n                output_tokens.extend(sub_tokens)\n        return output_tokens\n\n\ndef _is_whitespace(char):\n    \"\"\"Checks whether `chars` is a whitespace character.\"\"\"\n    # \\t, \\n, and \\r are technically contorl characters but we treat them\n    # as whitespace since they are generally considered as such.\n    if char == \" \" or char == \"\\t\" or char == \"\\n\" or char == \"\\r\":\n        return True\n    cat = unicodedata.category(char)\n    if cat == \"Zs\":\n        return True\n    return False\n\n\ndef _is_control(char):\n    \"\"\"Checks whether `chars` is a control character.\"\"\"\n    # These are technically control characters but we count them as whitespace\n    # characters.\n    if char == \"\\t\" or char == \"\\n\" or char == \"\\r\":\n        return False\n    cat = unicodedata.category(char)\n    if cat.startswith(\"C\"):\n        return True\n    return False\n\n\ndef _is_punctuation(char):\n    \"\"\"Checks whether `chars` is a punctuation character.\"\"\"\n    cp = ord(char)\n    # We treat all non-letter/number ASCII as punctuation.\n    # Characters such as \"^\", \"$\", and \"`\" are not in the Unicode\n    # Punctuation class but we treat them as punctuation anyways, for\n    # consistency.\n    if (cp >= 33 and cp <= 47) or (cp >= 58 and cp <= 64) or (cp >= 91 and cp <= 96) or (cp >= 123 and cp <= 126):\n        return True\n    cat = unicodedata.category(char)\n    if cat.startswith(\"P\"):\n        return True\n    return False\n\n\nclass BertTokenizerFast(PreTrainedTokenizerFast):\n    r\"\"\"\n    Constructs a \"Fast\" BERT tokenizer (backed by HuggingFace's `tokenizers` library).\n\n    Bert tokenization is Based on WordPiece.\n\n    This tokenizer inherits from :class:`~transformers1.PreTrainedTokenizerFast` which contains most of the methods. Users\n    should refer to the superclass for more information regarding methods.\n\n    Args:\n        vocab_file (:obj:`string`):\n            File containing the vocabulary.\n        do_lower_case (:obj:`bool`, `optional`, defaults to :obj:`True`):\n            Whether to lowercase the input when tokenizing.\n        unk_token (:obj:`string`, `optional`, defaults to \"[UNK]\"):\n            The unknown token. A token that is not in the vocabulary cannot be converted to an ID and is set to be this\n            token instead.\n        sep_token (:obj:`string`, `optional`, defaults to \"[SEP]\"):\n            The separator token, which is used when building a sequence from multiple sequences, e.g. two sequences\n            for sequence classification or for a text and a question for question answering.\n            It is also used as the last token of a sequence built with special tokens.\n        pad_token (:obj:`string`, `optional`, defaults to \"[PAD]\"):\n            The token used for padding, for example when batching sequences of different lengths.\n        cls_token (:obj:`string`, `optional`, defaults to \"[CLS]\"):\n            The classifier token which is used when doing sequence classification (classification of the whole\n            sequence instead of per-token classification). It is the first token of the sequence when built with\n            special tokens.\n        mask_token (:obj:`string`, `optional`, defaults to \"[MASK]\"):\n            The token used for masking values. This is the token used when training this model with masked language\n            modeling. This is the token which the model will try to predict.\n        tokenize_chinese_chars (:obj:`bool`, `optional`, defaults to :obj:`True`):\n            Whether to tokenize Chinese characters.\n            This should likely be deactivated for Japanese:\n            see: https://github.com/huggingface/transformers/issues/328\n        clean_text (:obj:`bool`, `optional`, defaults to :obj:`True`):\n            Whether to clean the text before tokenization by removing any control characters and\n            replacing all whitespaces by the classic one.\n        tokenize_chinese_chars (:obj:`bool`, `optional`, defaults to :obj:`True`):\n            Whether to tokenize Chinese characters.\n            This should likely be deactivated for Japanese:\n            see: https://github.com/huggingface/transformers/issues/328\n    \"\"\"\n\n    vocab_files_names = VOCAB_FILES_NAMES\n    pretrained_vocab_files_map = PRETRAINED_VOCAB_FILES_MAP\n    pretrained_init_configuration = PRETRAINED_INIT_CONFIGURATION\n    max_model_input_sizes = PRETRAINED_POSITIONAL_EMBEDDINGS_SIZES\n\n    def __init__(\n        self,\n        vocab_file,\n        do_lower_case=True,\n        unk_token=\"[UNK]\",\n        sep_token=\"[SEP]\",\n        pad_token=\"[PAD]\",\n        cls_token=\"[CLS]\",\n        mask_token=\"[MASK]\",\n        clean_text=True,\n        tokenize_chinese_chars=True,\n        strip_accents=True,\n        wordpieces_prefix=\"##\",\n        **kwargs\n    ):\n        super().__init__(\n            BertWordPieceTokenizer(\n                vocab_file=vocab_file,\n                unk_token=unk_token,\n                sep_token=sep_token,\n                cls_token=cls_token,\n                clean_text=clean_text,\n                handle_chinese_chars=tokenize_chinese_chars,\n                strip_accents=strip_accents,\n                lowercase=do_lower_case,\n                wordpieces_prefix=wordpieces_prefix,\n            ),\n            unk_token=unk_token,\n            sep_token=sep_token,\n            pad_token=pad_token,\n            cls_token=cls_token,\n            mask_token=mask_token,\n            **kwargs,\n        )\n\n        self.do_lower_case = do_lower_case\n\n    def build_inputs_with_special_tokens(self, token_ids_0, token_ids_1=None):\n        output = [self.cls_token_id] + token_ids_0 + [self.sep_token_id]\n\n        if token_ids_1:\n            output += token_ids_1 + [self.sep_token_id]\n\n        return output\n\n    def create_token_type_ids_from_sequences(\n        self, token_ids_0: List[int], token_ids_1: Optional[List[int]] = None\n    ) -> List[int]:\n        \"\"\"\n        Creates a mask from the two sequences passed to be used in a sequence-pair classification task.\n        A BERT sequence pair mask has the following format:\n\n        ::\n\n            0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1\n            | first sequence    | second sequence |\n\n        if token_ids_1 is None, only returns the first portion of the mask (0's).\n\n        Args:\n            token_ids_0 (:obj:`List[int]`):\n                List of ids.\n            token_ids_1 (:obj:`List[int]`, `optional`, defaults to :obj:`None`):\n                Optional second list of IDs for sequence pairs.\n\n        Returns:\n            :obj:`List[int]`: List of `token type IDs <../glossary.html#token-type-ids>`_ according to the given\n            sequence(s).\n        \"\"\"\n        sep = [self.sep_token_id]\n        cls = [self.cls_token_id]\n        if token_ids_1 is None:\n            return len(cls + token_ids_0 + sep) * [0]\n        return len(cls + token_ids_0 + sep) * [0] + len(token_ids_1 + sep) * [1]\n"
  },
  {
    "path": "code/nezha-base-count5/pretrain/transformers1/tokenization_bert_japanese.py",
    "content": "# coding=utf-8\n# Copyright 2018 The Google AI Language Team Authors and The HuggingFace Inc. team.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n\"\"\"Tokenization classes.\"\"\"\n\n\nimport collections\nimport logging\nimport os\nimport unicodedata\nfrom typing import Optional\n\nfrom .tokenization_bert import BasicTokenizer, BertTokenizer, WordpieceTokenizer, load_vocab\n\n\nlogger = logging.getLogger(__name__)\n\nVOCAB_FILES_NAMES = {\"vocab_file\": \"vocab.txt\"}\n\nPRETRAINED_VOCAB_FILES_MAP = {\n    \"vocab_file\": {\n        \"cl-tohoku/bert-base-japanese\": \"https://s3.amazonaws.com/models.huggingface.co/bert/cl-tohoku/bert-base-japanese/vocab.txt\",\n        \"cl-tohoku/bert-base-japanese-whole-word-masking\": \"https://s3.amazonaws.com/models.huggingface.co/bert/cl-tohoku/bert-base-japanese-whole-word-masking/vocab.txt\",\n        \"cl-tohoku/bert-base-japanese-char\": \"https://s3.amazonaws.com/models.huggingface.co/bert/cl-tohoku/bert-base-japanese-char/vocab.txt\",\n        \"cl-tohoku/bert-base-japanese-char-whole-word-masking\": \"https://s3.amazonaws.com/models.huggingface.co/bert/cl-tohoku/bert-base-japanese-char-whole-word-masking/vocab.txt\",\n    }\n}\n\nPRETRAINED_POSITIONAL_EMBEDDINGS_SIZES = {\n    \"cl-tohoku/bert-base-japanese\": 512,\n    \"cl-tohoku/bert-base-japanese-whole-word-masking\": 512,\n    \"cl-tohoku/bert-base-japanese-char\": 512,\n    \"cl-tohoku/bert-base-japanese-char-whole-word-masking\": 512,\n}\n\nPRETRAINED_INIT_CONFIGURATION = {\n    \"cl-tohoku/bert-base-japanese\": {\n        \"do_lower_case\": False,\n        \"word_tokenizer_type\": \"mecab\",\n        \"subword_tokenizer_type\": \"wordpiece\",\n    },\n    \"cl-tohoku/bert-base-japanese-whole-word-masking\": {\n        \"do_lower_case\": False,\n        \"word_tokenizer_type\": \"mecab\",\n        \"subword_tokenizer_type\": \"wordpiece\",\n    },\n    \"cl-tohoku/bert-base-japanese-char\": {\n        \"do_lower_case\": False,\n        \"word_tokenizer_type\": \"mecab\",\n        \"subword_tokenizer_type\": \"character\",\n    },\n    \"cl-tohoku/bert-base-japanese-char-whole-word-masking\": {\n        \"do_lower_case\": False,\n        \"word_tokenizer_type\": \"mecab\",\n        \"subword_tokenizer_type\": \"character\",\n    },\n}\n\n\nclass BertJapaneseTokenizer(BertTokenizer):\n    \"\"\"BERT tokenizer for Japanese text\"\"\"\n\n    vocab_files_names = VOCAB_FILES_NAMES\n    pretrained_vocab_files_map = PRETRAINED_VOCAB_FILES_MAP\n    pretrained_init_configuration = PRETRAINED_INIT_CONFIGURATION\n    max_model_input_sizes = PRETRAINED_POSITIONAL_EMBEDDINGS_SIZES\n\n    def __init__(\n        self,\n        vocab_file,\n        do_lower_case=False,\n        do_word_tokenize=True,\n        do_subword_tokenize=True,\n        word_tokenizer_type=\"basic\",\n        subword_tokenizer_type=\"wordpiece\",\n        never_split=None,\n        unk_token=\"[UNK]\",\n        sep_token=\"[SEP]\",\n        pad_token=\"[PAD]\",\n        cls_token=\"[CLS]\",\n        mask_token=\"[MASK]\",\n        mecab_kwargs=None,\n        **kwargs\n    ):\n        \"\"\"Constructs a MecabBertTokenizer.\n\n        Args:\n            **vocab_file**: Path to a one-wordpiece-per-line vocabulary file.\n            **do_lower_case**: (`optional`) boolean (default True)\n                Whether to lower case the input.\n                Only has an effect when do_basic_tokenize=True.\n            **do_word_tokenize**: (`optional`) boolean (default True)\n                Whether to do word tokenization.\n            **do_subword_tokenize**: (`optional`) boolean (default True)\n                Whether to do subword tokenization.\n            **word_tokenizer_type**: (`optional`) string (default \"basic\")\n                Type of word tokenizer.\n            **subword_tokenizer_type**: (`optional`) string (default \"wordpiece\")\n                Type of subword tokenizer.\n            **mecab_kwargs**: (`optional`) dict passed to `MecabTokenizer` constructor (default None)\n        \"\"\"\n        super(BertTokenizer, self).__init__(\n            unk_token=unk_token,\n            sep_token=sep_token,\n            pad_token=pad_token,\n            cls_token=cls_token,\n            mask_token=mask_token,\n            **kwargs,\n        )\n        # ^^ We call the grandparent's init, not the parent's.\n\n        if not os.path.isfile(vocab_file):\n            raise ValueError(\n                \"Can't find a vocabulary file at path '{}'. To load the vocabulary from a Google pretrained \"\n                \"model use `tokenizer = BertTokenizer.from_pretrained(PRETRAINED_MODEL_NAME)`\".format(vocab_file)\n            )\n        self.vocab = load_vocab(vocab_file)\n        self.ids_to_tokens = collections.OrderedDict([(ids, tok) for tok, ids in self.vocab.items()])\n\n        self.do_word_tokenize = do_word_tokenize\n        if do_word_tokenize:\n            if word_tokenizer_type == \"basic\":\n                self.word_tokenizer = BasicTokenizer(\n                    do_lower_case=do_lower_case, never_split=never_split, tokenize_chinese_chars=False\n                )\n            elif word_tokenizer_type == \"mecab\":\n                self.word_tokenizer = MecabTokenizer(\n                    do_lower_case=do_lower_case, never_split=never_split, **(mecab_kwargs or {})\n                )\n            else:\n                raise ValueError(\"Invalid word_tokenizer_type '{}' is specified.\".format(word_tokenizer_type))\n\n        self.do_subword_tokenize = do_subword_tokenize\n        if do_subword_tokenize:\n            if subword_tokenizer_type == \"wordpiece\":\n                self.subword_tokenizer = WordpieceTokenizer(vocab=self.vocab, unk_token=self.unk_token)\n            elif subword_tokenizer_type == \"character\":\n                self.subword_tokenizer = CharacterTokenizer(vocab=self.vocab, unk_token=self.unk_token)\n            else:\n                raise ValueError(\"Invalid subword_tokenizer_type '{}' is specified.\".format(subword_tokenizer_type))\n\n    def _tokenize(self, text):\n        if self.do_word_tokenize:\n            tokens = self.word_tokenizer.tokenize(text, never_split=self.all_special_tokens)\n        else:\n            tokens = [text]\n\n        if self.do_subword_tokenize:\n            split_tokens = [sub_token for token in tokens for sub_token in self.subword_tokenizer.tokenize(token)]\n        else:\n            split_tokens = tokens\n\n        return split_tokens\n\n\nclass MecabTokenizer:\n    \"\"\"Runs basic tokenization with MeCab morphological parser.\"\"\"\n\n    def __init__(self, do_lower_case=False, never_split=None, normalize_text=True, mecab_option: Optional[str] = None):\n        \"\"\"Constructs a MecabTokenizer.\n\n        Args:\n            **do_lower_case**: (`optional`) boolean (default True)\n                Whether to lower case the input.\n            **never_split**: (`optional`) list of str\n                Kept for backward compatibility purposes.\n                Now implemented directly at the base class level (see :func:`PreTrainedTokenizer.tokenize`)\n                List of token not to split.\n            **normalize_text**: (`optional`) boolean (default True)\n                Whether to apply unicode normalization to text before tokenization.\n            **mecab_option**: (`optional`) string passed to `MeCab.Tagger` constructor (default \"\")\n        \"\"\"\n        self.do_lower_case = do_lower_case\n        self.never_split = never_split if never_split is not None else []\n        self.normalize_text = normalize_text\n\n        import MeCab\n\n        self.mecab = MeCab.Tagger(mecab_option) if mecab_option is not None else MeCab.Tagger()\n\n    def tokenize(self, text, never_split=None, **kwargs):\n        \"\"\"Tokenizes a piece of text.\"\"\"\n        if self.normalize_text:\n            text = unicodedata.normalize(\"NFKC\", text)\n\n        never_split = self.never_split + (never_split if never_split is not None else [])\n        tokens = []\n\n        mecab_output = self.mecab.parse(text)\n\n        cursor = 0\n        for line in mecab_output.split(\"\\n\"):\n            if line == \"EOS\":\n                break\n\n            token, _ = line.split(\"\\t\")\n            token_start = text.index(token, cursor)\n            token_end = token_start + len(token)\n            if self.do_lower_case and token not in never_split:\n                token = token.lower()\n\n            tokens.append(token)\n            cursor = token_end\n\n        return tokens\n\n\nclass CharacterTokenizer(object):\n    \"\"\"Runs Character tokenziation.\"\"\"\n\n    def __init__(self, vocab, unk_token, normalize_text=True):\n        \"\"\"Constructs a CharacterTokenizer.\n\n        Args:\n            **vocab**:\n                Vocabulary object.\n            **unk_token**: str\n                A special symbol for out-of-vocabulary token.\n            **normalize_text**: (`optional`) boolean (default True)\n                Whether to apply unicode normalization to text before tokenization.\n        \"\"\"\n        self.vocab = vocab\n        self.unk_token = unk_token\n        self.normalize_text = normalize_text\n\n    def tokenize(self, text):\n        \"\"\"Tokenizes a piece of text into characters.\n\n        For example:\n            input = \"apple\"\n            output = [\"a\", \"p\", \"p\", \"l\", \"e\"]\n        Args:\n            text: A single token or whitespace separated tokens.\n                This should have already been passed through `BasicTokenizer`.\n        Returns:\n            A list of characters.\n        \"\"\"\n        if self.normalize_text:\n            text = unicodedata.normalize(\"NFKC\", text)\n\n        output_tokens = []\n        for i, char in enumerate(text):\n            if char not in self.vocab:\n                output_tokens.append(self.unk_token)\n                continue\n\n            output_tokens.append(char)\n\n        return output_tokens\n"
  },
  {
    "path": "code/nezha-base-count5/pretrain/transformers1/tokenization_camembert.py",
    "content": "# coding=utf-8\n# Copyright 2018 Google AI, Google Brain and Carnegie Mellon University Authors and the HuggingFace Inc. team.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License\n\"\"\" Tokenization classes for Camembert model.\"\"\"\n\n\nimport logging\nimport os\nfrom shutil import copyfile\nfrom typing import List, Optional\n\nimport sentencepiece as spm\n\nfrom .tokenization_utils import PreTrainedTokenizer\nfrom .tokenization_xlnet import SPIECE_UNDERLINE\n\n\nlogger = logging.getLogger(__name__)\n\nVOCAB_FILES_NAMES = {\"vocab_file\": \"sentencepiece.bpe.model\"}\n\nPRETRAINED_VOCAB_FILES_MAP = {\n    \"vocab_file\": {\n        \"camembert-base\": \"https://s3.amazonaws.com/models.huggingface.co/bert/camembert-base-sentencepiece.bpe.model\",\n    }\n}\n\nPRETRAINED_POSITIONAL_EMBEDDINGS_SIZES = {\n    \"camembert-base\": None,\n}\n\nSHARED_MODEL_IDENTIFIERS = [\n    # Load with\n    # `tokenizer = AutoTokenizer.from_pretrained(\"username/pretrained_model\")`\n    \"Musixmatch/umberto-commoncrawl-cased-v1\",\n    \"Musixmatch/umberto-wikipedia-uncased-v1\",\n]\n\n\nclass CamembertTokenizer(PreTrainedTokenizer):\n    \"\"\"\n        Adapted from RobertaTokenizer and XLNetTokenizer\n        SentencePiece based tokenizer. Peculiarities:\n\n        - requires `SentencePiece <https://github.com/google/sentencepiece>`_\n\n    This tokenizer inherits from :class:`~transformers1.PreTrainedTokenizer` which contains most of the methods. Users\n    should refer to the superclass for more information regarding methods.\n\n    Args:\n        vocab_file (:obj:`str`):\n            Path to the vocabulary file.\n        bos_token (:obj:`string`, `optional`, defaults to \"<s>\"):\n            The beginning of sequence token that was used during pre-training. Can be used a sequence classifier token.\n\n            .. note::\n\n                When building a sequence using special tokens, this is not the token that is used for the beginning\n                of sequence. The token used is the :obj:`cls_token`.\n        eos_token (:obj:`string`, `optional`, defaults to \"</s>\"):\n            The end of sequence token.\n\n            .. note::\n\n                When building a sequence using special tokens, this is not the token that is used for the end\n                of sequence. The token used is the :obj:`sep_token`.\n        sep_token (:obj:`string`, `optional`, defaults to \"</s>\"):\n            The separator token, which is used when building a sequence from multiple sequences, e.g. two sequences\n            for sequence classification or for a text and a question for question answering.\n            It is also used as the last token of a sequence built with special tokens.\n        cls_token (:obj:`string`, `optional`, defaults to \"<s>\"):\n            The classifier token which is used when doing sequence classification (classification of the whole\n            sequence instead of per-token classification). It is the first token of the sequence when built with\n            special tokens.\n        unk_token (:obj:`string`, `optional`, defaults to \"<unk>\"):\n            The unknown token. A token that is not in the vocabulary cannot be converted to an ID and is set to be this\n            token instead.\n        pad_token (:obj:`string`, `optional`, defaults to \"<pad>\"):\n            The token used for padding, for example when batching sequences of different lengths.\n        mask_token (:obj:`string`, `optional`, defaults to \"<mask>\"):\n            The token used for masking values. This is the token used when training this model with masked language\n            modeling. This is the token which the model will try to predict.\n        additional_special_tokens (:obj:`List[str]`, `optional`, defaults to :obj:`[\"<s>NOTUSED\", \"</s>NOTUSED\"]`):\n            Additional special tokens used by the tokenizer.\n\n    Attributes:\n        sp_model (:obj:`SentencePieceProcessor`):\n            The `SentencePiece` processor that is used for every conversion (string, tokens and IDs).\n    \"\"\"\n\n    vocab_files_names = VOCAB_FILES_NAMES\n    pretrained_vocab_files_map = PRETRAINED_VOCAB_FILES_MAP\n    max_model_input_sizes = PRETRAINED_POSITIONAL_EMBEDDINGS_SIZES\n    model_input_names = [\"attention_mask\"]\n\n    def __init__(\n        self,\n        vocab_file,\n        bos_token=\"<s>\",\n        eos_token=\"</s>\",\n        sep_token=\"</s>\",\n        cls_token=\"<s>\",\n        unk_token=\"<unk>\",\n        pad_token=\"<pad>\",\n        mask_token=\"<mask>\",\n        additional_special_tokens=[\"<s>NOTUSED\", \"</s>NOTUSED\"],\n        **kwargs\n    ):\n        super().__init__(\n            max_len=512,\n            bos_token=bos_token,\n            eos_token=eos_token,\n            unk_token=unk_token,\n            sep_token=sep_token,\n            cls_token=cls_token,\n            pad_token=pad_token,\n            mask_token=mask_token,\n            additional_special_tokens=additional_special_tokens,\n            **kwargs,\n        )\n        self.sp_model = spm.SentencePieceProcessor()\n        self.sp_model.Load(str(vocab_file))\n        self.vocab_file = vocab_file\n        # HACK: These tokens were added by fairseq but don't seem to be actually used when duplicated in the actual\n        # sentencepiece vocabulary (this is the case for <s> and </s>\n        self.fairseq_tokens_to_ids = {\"<s>NOTUSED\": 0, \"<pad>\": 1, \"</s>NOTUSED\": 2, \"<unk>\": 3}\n        self.fairseq_offset = len(self.fairseq_tokens_to_ids)\n        self.fairseq_tokens_to_ids[\"<mask>\"] = len(self.sp_model) + len(self.fairseq_tokens_to_ids)\n        self.fairseq_ids_to_tokens = {v: k for k, v in self.fairseq_tokens_to_ids.items()}\n\n    def build_inputs_with_special_tokens(\n        self, token_ids_0: List[int], token_ids_1: Optional[List[int]] = None\n    ) -> List[int]:\n        \"\"\"\n        Build model inputs from a sequence or a pair of sequence for sequence classification tasks\n        by concatenating and adding special tokens.\n        A CamemBERT sequence has the following format:\n\n        - single sequence: ``<s> X </s>``\n        - pair of sequences: ``<s> A </s></s> B </s>``\n\n        Args:\n            token_ids_0 (:obj:`List[int]`):\n                List of IDs to which the special tokens will be added\n            token_ids_1 (:obj:`List[int]`, `optional`, defaults to :obj:`None`):\n                Optional second list of IDs for sequence pairs.\n\n        Returns:\n            :obj:`List[int]`: list of `input IDs <../glossary.html#input-ids>`__ with the appropriate special tokens.\n        \"\"\"\n\n        if token_ids_1 is None:\n            return [self.cls_token_id] + token_ids_0 + [self.sep_token_id]\n        cls = [self.cls_token_id]\n        sep = [self.sep_token_id]\n        return cls + token_ids_0 + sep + sep + token_ids_1 + sep\n\n    def get_special_tokens_mask(\n        self, token_ids_0: List[int], token_ids_1: Optional[List[int]] = None, already_has_special_tokens: bool = False\n    ) -> List[int]:\n        \"\"\"\n        Retrieves sequence ids from a token list that has no special tokens added. This method is called when adding\n        special tokens using the tokenizer ``prepare_for_model`` or ``encode_plus`` methods.\n\n        Args:\n            token_ids_0 (:obj:`List[int]`):\n                List of ids.\n            token_ids_1 (:obj:`List[int]`, `optional`, defaults to :obj:`None`):\n                Optional second list of IDs for sequence pairs.\n            already_has_special_tokens (:obj:`bool`, `optional`, defaults to :obj:`False`):\n                Set to True if the token list is already formatted with special tokens for the model\n\n        Returns:\n            :obj:`List[int]`: A list of integers in the range [0, 1]: 1 for a special token, 0 for a sequence token.\n        \"\"\"\n        if already_has_special_tokens:\n            if token_ids_1 is not None:\n                raise ValueError(\n                    \"You should not supply a second sequence if the provided sequence of \"\n                    \"ids is already formated with special tokens for the model.\"\n                )\n            return list(map(lambda x: 1 if x in [self.sep_token_id, self.cls_token_id] else 0, token_ids_0))\n\n        if token_ids_1 is None:\n            return [1] + ([0] * len(token_ids_0)) + [1]\n        return [1] + ([0] * len(token_ids_0)) + [1, 1] + ([0] * len(token_ids_1)) + [1]\n\n    def create_token_type_ids_from_sequences(\n        self, token_ids_0: List[int], token_ids_1: Optional[List[int]] = None\n    ) -> List[int]:\n        \"\"\"\n        Creates a mask from the two sequences passed to be used in a sequence-pair classification task.\n        CamemBERT, like RoBERTa, does not make use of token type ids, therefore a list of zeros is returned.\n\n        Args:\n            token_ids_0 (:obj:`List[int]`):\n                List of ids.\n            token_ids_1 (:obj:`List[int]`, `optional`, defaults to :obj:`None`):\n                Optional second list of IDs for sequence pairs.\n\n        Returns:\n            :obj:`List[int]`: List of zeros.\n\n        \"\"\"\n        sep = [self.sep_token_id]\n        cls = [self.cls_token_id]\n\n        if token_ids_1 is None:\n            return len(cls + token_ids_0 + sep) * [0]\n        return len(cls + token_ids_0 + sep + sep + token_ids_1 + sep) * [0]\n\n    @property\n    def vocab_size(self):\n        return len(self.fairseq_tokens_to_ids) + len(self.sp_model)\n\n    def _tokenize(self, text):\n        return self.sp_model.EncodeAsPieces(text)\n\n    def _convert_token_to_id(self, token):\n        \"\"\" Converts a token (str) in an id using the vocab. \"\"\"\n        if token in self.fairseq_tokens_to_ids:\n            return self.fairseq_tokens_to_ids[token]\n        elif self.sp_model.PieceToId(token) == 0:\n            # Convert sentence piece unk token to fairseq unk token index\n            return self.unk_token_id\n        return self.fairseq_offset + self.sp_model.PieceToId(token)\n\n    def _convert_id_to_token(self, index):\n        \"\"\"Converts an index (integer) in a token (str) using the vocab.\"\"\"\n        if index in self.fairseq_ids_to_tokens:\n            return self.fairseq_ids_to_tokens[index]\n        return self.sp_model.IdToPiece(index - self.fairseq_offset)\n\n    def __getstate__(self):\n        state = self.__dict__.copy()\n        state[\"sp_model\"] = None\n        return state\n\n    def __setstate__(self, d):\n        self.__dict__ = d\n        try:\n            import sentencepiece as spm\n        except ImportError:\n            logger.warning(\n                \"You need to install SentencePiece to use AlbertTokenizer: https://github.com/google/sentencepiece\"\n                \"pip install sentencepiece\"\n            )\n            raise\n        self.sp_model = spm.SentencePieceProcessor()\n        self.sp_model.Load(self.vocab_file)\n\n    def convert_tokens_to_string(self, tokens):\n        \"\"\"Converts a sequence of tokens (strings for sub-words) in a single string.\"\"\"\n        out_string = \"\".join(tokens).replace(SPIECE_UNDERLINE, \" \").strip()\n        return out_string\n\n    def save_vocabulary(self, save_directory):\n        \"\"\"\n        Save the sentencepiece vocabulary (copy original file) and special tokens file to a directory.\n\n        Args:\n            save_directory (:obj:`str`):\n                The directory in which to save the vocabulary.\n\n        Returns:\n            :obj:`Tuple(str)`: Paths to the files saved.\n        \"\"\"\n        if not os.path.isdir(save_directory):\n            logger.error(\"Vocabulary path ({}) should be a directory\".format(save_directory))\n            return\n        out_vocab_file = os.path.join(save_directory, VOCAB_FILES_NAMES[\"vocab_file\"])\n\n        if os.path.abspath(self.vocab_file) != os.path.abspath(out_vocab_file):\n            copyfile(self.vocab_file, out_vocab_file)\n\n        return (out_vocab_file,)\n"
  },
  {
    "path": "code/nezha-base-count5/pretrain/transformers1/tokenization_ctrl.py",
    "content": "# coding=utf-8\n# Copyright 2018 Salesforce and The HuggingFace Inc. team.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n\"\"\"Tokenization classes for Salesforce CTRL.\"\"\"\n\n\nimport json\nimport logging\nimport os\n\nimport regex as re\n\nfrom .tokenization_utils import PreTrainedTokenizer\n\n\nlogger = logging.getLogger(__name__)\n\nVOCAB_FILES_NAMES = {\n    \"vocab_file\": \"vocab.json\",\n    \"merges_file\": \"merges.txt\",\n}\n\nPRETRAINED_VOCAB_FILES_MAP = {\n    \"vocab_file\": {\"ctrl\": \"https://raw.githubusercontent.com/salesforce/ctrl/master/ctrl-vocab.json\"},\n    \"merges_file\": {\"ctrl\": \"https://raw.githubusercontent.com/salesforce/ctrl/master/ctrl-merges.txt\"},\n}\n\nPRETRAINED_POSITIONAL_EMBEDDINGS_SIZES = {\n    \"ctrl\": 256,\n}\n\nCONTROL_CODES = {\n    \"Pregnancy\": 168629,\n    \"Christianity\": 7675,\n    \"Explain\": 106423,\n    \"Fitness\": 63440,\n    \"Saving\": 63163,\n    \"Ask\": 27171,\n    \"Ass\": 95985,\n    \"Joke\": 163509,\n    \"Questions\": 45622,\n    \"Thoughts\": 49605,\n    \"Retail\": 52342,\n    \"Feminism\": 164338,\n    \"Writing\": 11992,\n    \"Atheism\": 192263,\n    \"Netflix\": 48616,\n    \"Computing\": 39639,\n    \"Opinion\": 43213,\n    \"Alone\": 44967,\n    \"Funny\": 58917,\n    \"Gaming\": 40358,\n    \"Human\": 4088,\n    \"India\": 1331,\n    \"Joker\": 77138,\n    \"Diet\": 36206,\n    \"Legal\": 11859,\n    \"Norman\": 4939,\n    \"Tip\": 72689,\n    \"Weight\": 52343,\n    \"Movies\": 46273,\n    \"Running\": 23425,\n    \"Science\": 2090,\n    \"Horror\": 37793,\n    \"Confession\": 60572,\n    \"Finance\": 12250,\n    \"Politics\": 16360,\n    \"Scary\": 191985,\n    \"Support\": 12654,\n    \"Technologies\": 32516,\n    \"Teenage\": 66160,\n    \"Event\": 32769,\n    \"Learned\": 67460,\n    \"Notion\": 182770,\n    \"Wikipedia\": 37583,\n    \"Books\": 6665,\n    \"Extract\": 76050,\n    \"Confessions\": 102701,\n    \"Conspiracy\": 75932,\n    \"Links\": 63674,\n    \"Narcissus\": 150425,\n    \"Relationship\": 54766,\n    \"Relationships\": 134796,\n    \"Reviews\": 41671,\n    \"News\": 4256,\n    \"Translation\": 26820,\n    \"multilingual\": 128406,\n}\n\n\ndef get_pairs(word):\n    \"\"\"Return set of symbol pairs in a word.\n\n    Word is represented as tuple of symbols (symbols being variable-length strings).\n    \"\"\"\n    pairs = set()\n    prev_char = word[0]\n    for char in word[1:]:\n        pairs.add((prev_char, char))\n        prev_char = char\n\n    pairs = set(pairs)\n    return pairs\n\n\nclass CTRLTokenizer(PreTrainedTokenizer):\n    \"\"\"\n    Constructs a CTRL tokenizer. Peculiarities:\n\n    - Byte-Pair-Encoding\n\n    This tokenizer inherits from :class:`~transformers1.PreTrainedTokenizer` which contains most of the methods. Users\n    should refer to the superclass for more information regarding methods.\n\n    Args:\n        vocab_file (:obj:`str`):\n            Path to the vocabulary file.\n        merges_file (:obj:`str`):\n            Path to the merges file.\n        unk_token (:obj:`string`, `optional`, defaults to \"<unk>\"):\n            The unknown token. A token that is not in the vocabulary cannot be converted to an ID and is set to be this\n            token instead.\n    \"\"\"\n\n    vocab_files_names = VOCAB_FILES_NAMES\n    pretrained_vocab_files_map = PRETRAINED_VOCAB_FILES_MAP\n    max_model_input_sizes = PRETRAINED_POSITIONAL_EMBEDDINGS_SIZES\n    control_codes = CONTROL_CODES\n\n    def __init__(self, vocab_file, merges_file, unk_token=\"<unk>\", **kwargs):\n        super().__init__(unk_token=unk_token, **kwargs)\n\n        with open(vocab_file, encoding=\"utf-8\") as vocab_handle:\n            self.encoder = json.load(vocab_handle)\n        self.decoder = {v: k for k, v in self.encoder.items()}\n        with open(merges_file, encoding=\"utf-8\") as merges_handle:\n            merges = merges_handle.read().split(\"\\n\")[1:-1]\n        merges = [tuple(merge.split()) for merge in merges]\n        self.bpe_ranks = dict(zip(merges, range(len(merges))))\n        self.cache = {}\n\n    @property\n    def vocab_size(self):\n        return len(self.encoder)\n\n    def get_vocab(self):\n        return dict(self.encoder, **self.added_tokens_encoder)\n\n    def bpe(self, token):\n        if token in self.cache:\n            return self.cache[token]\n        word = tuple(token)\n        word = tuple(list(word[:-1]) + [word[-1] + \"</w>\"])\n        pairs = get_pairs(word)\n\n        if not pairs:\n            return token\n\n        while True:\n            bigram = min(pairs, key=lambda pair: self.bpe_ranks.get(pair, float(\"inf\")))\n            if bigram not in self.bpe_ranks:\n                break\n            first, second = bigram\n            new_word = []\n            i = 0\n            while i < len(word):\n                try:\n                    j = word.index(first, i)\n                except ValueError:\n                    new_word.extend(word[i:])\n                    break\n                else:\n                    new_word.extend(word[i:j])\n                    i = j\n\n                if word[i] == first and i < len(word) - 1 and word[i + 1] == second:\n                    new_word.append(first + second)\n                    i += 2\n                else:\n                    new_word.append(word[i])\n                    i += 1\n            new_word = tuple(new_word)\n            word = new_word\n            if len(word) == 1:\n                break\n            else:\n                pairs = get_pairs(word)\n        word = \"@@ \".join(word)\n        word = word[:-4]\n        self.cache[token] = word\n        return word\n\n    def _tokenize(self, text):\n        \"\"\" Tokenize a string.\n        \"\"\"\n        split_tokens = []\n\n        words = re.findall(r\"\\S+\\n?\", text)\n\n        for token in words:\n            split_tokens.extend([t for t in self.bpe(token).split(\" \")])\n        return split_tokens\n\n    def _convert_token_to_id(self, token):\n        \"\"\" Converts a token (str) in an id using the vocab. \"\"\"\n        return self.encoder.get(token, self.encoder.get(self.unk_token))\n\n    def _convert_id_to_token(self, index):\n        \"\"\"Converts an index (integer) in a token (str) using the vocab.\"\"\"\n        return self.decoder.get(index, self.unk_token)\n\n    def convert_tokens_to_string(self, tokens):\n        \"\"\" Converts a sequence of tokens (string) in a single string. \"\"\"\n        out_string = \" \".join(tokens).replace(\"@@ \", \"\").strip()\n        return out_string\n\n    def save_vocabulary(self, save_directory):\n        \"\"\"\n        Save the vocabulary and special tokens file to a directory.\n\n        Args:\n            save_directory (:obj:`str`):\n                The directory in which to save the vocabulary.\n\n        Returns:\n            :obj:`Tuple(str)`: Paths to the files saved.\n        \"\"\"\n        if not os.path.isdir(save_directory):\n            logger.error(\"Vocabulary path ({}) should be a directory\".format(save_directory))\n            return\n        vocab_file = os.path.join(save_directory, VOCAB_FILES_NAMES[\"vocab_file\"])\n        merge_file = os.path.join(save_directory, VOCAB_FILES_NAMES[\"merges_file\"])\n\n        with open(vocab_file, \"w\", encoding=\"utf-8\") as f:\n            f.write(json.dumps(self.encoder, ensure_ascii=False))\n\n        index = 0\n        with open(merge_file, \"w\", encoding=\"utf-8\") as writer:\n            writer.write(\"#version: 0.2\\n\")\n            for bpe_tokens, token_index in sorted(self.bpe_ranks.items(), key=lambda kv: kv[1]):\n                if index != token_index:\n                    logger.warning(\n                        \"Saving vocabulary to {}: BPE merge indices are not consecutive.\"\n                        \" Please check that the tokenizer is not corrupted!\".format(merge_file)\n                    )\n                    index = token_index\n                writer.write(\" \".join(bpe_tokens) + \"\\n\")\n                index += 1\n\n        return vocab_file, merge_file\n\n    # def decode(self, token_ids, skip_special_tokens=False, clean_up_tokenization_spaces=True):\n    #     filtered_tokens = ' '.join(self.convert_ids_to_tokens(token_ids, skip_special_tokens=skip_special_tokens))\n    #     tokens_generated_so_far = re.sub('(@@ )', '', string=filtered_tokens)\n    #     tokens_generated_so_far = re.sub('(@@ ?$)', '', string=tokens_generated_so_far)\n    #     return ''.join(tokens_generated_so_far)\n"
  },
  {
    "path": "code/nezha-base-count5/pretrain/transformers1/tokenization_distilbert.py",
    "content": "# coding=utf-8\n# Copyright 2018 The HuggingFace Inc. team.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n\"\"\"Tokenization classes for DistilBERT.\"\"\"\n\n\nimport logging\n\nfrom .tokenization_bert import BertTokenizer, BertTokenizerFast\n\n\nlogger = logging.getLogger(__name__)\n\nVOCAB_FILES_NAMES = {\"vocab_file\": \"vocab.txt\"}\n\nPRETRAINED_VOCAB_FILES_MAP = {\n    \"vocab_file\": {\n        \"distilbert-base-uncased\": \"https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-uncased-vocab.txt\",\n        \"distilbert-base-uncased-distilled-squad\": \"https://s3.amazonaws.com/models.huggingface.co/bert/bert-large-uncased-vocab.txt\",\n        \"distilbert-base-cased\": \"https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-cased-vocab.txt\",\n        \"distilbert-base-cased-distilled-squad\": \"https://s3.amazonaws.com/models.huggingface.co/bert/bert-large-cased-vocab.txt\",\n        \"distilbert-base-german-cased\": \"https://s3.amazonaws.com/models.huggingface.co/bert/distilbert-base-german-cased-vocab.txt\",\n        \"distilbert-base-multilingual-cased\": \"https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-multilingual-cased-vocab.txt\",\n    }\n}\n\nPRETRAINED_POSITIONAL_EMBEDDINGS_SIZES = {\n    \"distilbert-base-uncased\": 512,\n    \"distilbert-base-uncased-distilled-squad\": 512,\n    \"distilbert-base-cased\": 512,\n    \"distilbert-base-cased-distilled-squad\": 512,\n    \"distilbert-base-german-cased\": 512,\n    \"distilbert-base-multilingual-cased\": 512,\n}\n\n\nPRETRAINED_INIT_CONFIGURATION = {\n    \"distilbert-base-uncased\": {\"do_lower_case\": True},\n    \"distilbert-base-uncased-distilled-squad\": {\"do_lower_case\": True},\n    \"distilbert-base-cased\": {\"do_lower_case\": False},\n    \"distilbert-base-cased-distilled-squad\": {\"do_lower_case\": False},\n    \"distilbert-base-german-cased\": {\"do_lower_case\": False},\n    \"distilbert-base-multilingual-cased\": {\"do_lower_case\": False},\n}\n\n\nclass DistilBertTokenizer(BertTokenizer):\n    r\"\"\"\n    Constructs a  DistilBertTokenizer.\n\n    :class:`~transformers1.DistilBertTokenizer is identical to :class:`~transformers1.BertTokenizer` and runs end-to-end\n    tokenization: punctuation splitting + wordpiece.\n\n    Refer to superclass :class:`~transformers1.BertTokenizer` for usage examples and documentation concerning\n    parameters.\n    \"\"\"\n\n    vocab_files_names = VOCAB_FILES_NAMES\n    pretrained_vocab_files_map = PRETRAINED_VOCAB_FILES_MAP\n    max_model_input_sizes = PRETRAINED_POSITIONAL_EMBEDDINGS_SIZES\n    pretrained_init_configuration = PRETRAINED_INIT_CONFIGURATION\n    model_input_names = [\"attention_mask\"]\n\n\nclass DistilBertTokenizerFast(BertTokenizerFast):\n    r\"\"\"\n    Constructs a  \"Fast\" DistilBertTokenizer (backed by HuggingFace's `tokenizers` library).\n\n    :class:`~transformers1.DistilBertTokenizerFast` is identical to :class:`~transformers1.BertTokenizerFast` and runs end-to-end\n    tokenization: punctuation splitting + wordpiece.\n\n    Refer to superclass :class:`~transformers1.BertTokenizerFast` for usage examples and documentation concerning\n    parameters.\n    \"\"\"\n\n    vocab_files_names = VOCAB_FILES_NAMES\n    pretrained_vocab_files_map = PRETRAINED_VOCAB_FILES_MAP\n    max_model_input_sizes = PRETRAINED_POSITIONAL_EMBEDDINGS_SIZES\n    pretrained_init_configuration = PRETRAINED_INIT_CONFIGURATION\n    model_input_names = [\"attention_mask\"]\n"
  },
  {
    "path": "code/nezha-base-count5/pretrain/transformers1/tokenization_electra.py",
    "content": "# coding=utf-8\n# Copyright 2020 The Google AI Team, Stanford University and The HuggingFace Inc. team.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n\nfrom .tokenization_bert import BertTokenizer, BertTokenizerFast\n\n\nVOCAB_FILES_NAMES = {\"vocab_file\": \"vocab.txt\"}\n\nPRETRAINED_VOCAB_FILES_MAP = {\n    \"vocab_file\": {\n        \"google/electra-small-generator\": \"https://s3.amazonaws.com/models.huggingface.co/bert/google/electra-small-generator/vocab.txt\",\n        \"google/electra-base-generator\": \"https://s3.amazonaws.com/models.huggingface.co/bert/google/electra-base-generator/vocab.txt\",\n        \"google/electra-large-generator\": \"https://s3.amazonaws.com/models.huggingface.co/bert/google/electra-large-generator/vocab.txt\",\n        \"google/electra-small-discriminator\": \"https://s3.amazonaws.com/models.huggingface.co/bert/google/electra-small-discriminator/vocab.txt\",\n        \"google/electra-base-discriminator\": \"https://s3.amazonaws.com/models.huggingface.co/bert/google/electra-base-discriminator/vocab.txt\",\n        \"google/electra-large-discriminator\": \"https://s3.amazonaws.com/models.huggingface.co/bert/google/electra-large-discriminator/vocab.txt\",\n    }\n}\n\nPRETRAINED_POSITIONAL_EMBEDDINGS_SIZES = {\n    \"google/electra-small-generator\": 512,\n    \"google/electra-base-generator\": 512,\n    \"google/electra-large-generator\": 512,\n    \"google/electra-small-discriminator\": 512,\n    \"google/electra-base-discriminator\": 512,\n    \"google/electra-large-discriminator\": 512,\n}\n\n\nPRETRAINED_INIT_CONFIGURATION = {\n    \"google/electra-small-generator\": {\"do_lower_case\": True},\n    \"google/electra-base-generator\": {\"do_lower_case\": True},\n    \"google/electra-large-generator\": {\"do_lower_case\": True},\n    \"google/electra-small-discriminator\": {\"do_lower_case\": True},\n    \"google/electra-base-discriminator\": {\"do_lower_case\": True},\n    \"google/electra-large-discriminator\": {\"do_lower_case\": True},\n}\n\n\nclass ElectraTokenizer(BertTokenizer):\n    r\"\"\"\n    Constructs an Electra tokenizer.\n    :class:`~transformers1.ElectraTokenizer` is identical to :class:`~transformers1.BertTokenizer` and runs end-to-end\n    tokenization: punctuation splitting + wordpiece.\n\n    Refer to superclass :class:`~transformers1.BertTokenizer` for usage examples and documentation concerning\n    parameters.\n    \"\"\"\n\n    vocab_files_names = VOCAB_FILES_NAMES\n    pretrained_vocab_files_map = PRETRAINED_VOCAB_FILES_MAP\n    max_model_input_sizes = PRETRAINED_POSITIONAL_EMBEDDINGS_SIZES\n    pretrained_init_configuration = PRETRAINED_INIT_CONFIGURATION\n\n\nclass ElectraTokenizerFast(BertTokenizerFast):\n    r\"\"\"\n    Constructs a \"Fast\" Electra Fast tokenizer (backed by HuggingFace's `tokenizers` library).\n\n    :class:`~transformers1.ElectraTokenizerFast` is identical to :class:`~transformers1.BertTokenizerFast` and runs end-to-end\n    tokenization: punctuation splitting + wordpiece.\n\n    Refer to superclass :class:`~transformers1.BertTokenizerFast` for usage examples and documentation concerning\n    parameters.\n    \"\"\"\n    vocab_files_names = VOCAB_FILES_NAMES\n    pretrained_vocab_files_map = PRETRAINED_VOCAB_FILES_MAP\n    max_model_input_sizes = PRETRAINED_POSITIONAL_EMBEDDINGS_SIZES\n    pretrained_init_configuration = PRETRAINED_INIT_CONFIGURATION\n"
  },
  {
    "path": "code/nezha-base-count5/pretrain/transformers1/tokenization_flaubert.py",
    "content": "# coding=utf-8\n# Copyright 2019-present CNRS, Facebook Inc. and the HuggingFace Inc. team.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n\"\"\"Tokenization classes for Flaubert, based on XLM.\"\"\"\n\n\nimport logging\nimport unicodedata\n\nimport six\n\nfrom .tokenization_xlm import XLMTokenizer\n\n\nlogger = logging.getLogger(__name__)\n\nVOCAB_FILES_NAMES = {\n    \"vocab_file\": \"vocab.json\",\n    \"merges_file\": \"merges.txt\",\n}\n\nPRETRAINED_VOCAB_FILES_MAP = {\n    \"vocab_file\": {\n        \"flaubert/flaubert_small_cased\": \"https://s3.amazonaws.com/models.huggingface.co/bert/flaubert/flaubert_small_cased/vocab.json\",\n        \"flaubert/flaubert_base_uncased\": \"https://s3.amazonaws.com/models.huggingface.co/bert/flaubert/flaubert_base_uncased/vocab.json\",\n        \"flaubert/flaubert_base_cased\": \"https://s3.amazonaws.com/models.huggingface.co/bert/flaubert/flaubert_base_cased/vocab.json\",\n        \"flaubert/flaubert_large_cased\": \"https://s3.amazonaws.com/models.huggingface.co/bert/flaubert/flaubert_large_cased/vocab.json\",\n    },\n    \"merges_file\": {\n        \"flaubert/flaubert_small_cased\": \"https://s3.amazonaws.com/models.huggingface.co/bert/flaubert/flaubert_small_cased/merges.txt\",\n        \"flaubert/flaubert_base_uncased\": \"https://s3.amazonaws.com/models.huggingface.co/bert/flaubert/flaubert_base_uncased/merges.txt\",\n        \"flaubert/flaubert_base_cased\": \"https://s3.amazonaws.com/models.huggingface.co/bert/flaubert/flaubert_base_cased/merges.txt\",\n        \"flaubert/flaubert_large_cased\": \"https://s3.amazonaws.com/models.huggingface.co/bert/flaubert/flaubert_large_cased/merges.txt\",\n    },\n}\n\nPRETRAINED_POSITIONAL_EMBEDDINGS_SIZES = {\n    \"flaubert/flaubert_small_cased\": 512,\n    \"flaubert/flaubert_base_uncased\": 512,\n    \"flaubert/flaubert_base_cased\": 512,\n    \"flaubert/flaubert_large_cased\": 512,\n}\n\nPRETRAINED_INIT_CONFIGURATION = {\n    \"flaubert/flaubert_small_cased\": {\"do_lowercase\": False},\n    \"flaubert/flaubert_base_uncased\": {\"do_lowercase\": True},\n    \"flaubert/flaubert_base_cased\": {\"do_lowercase\": False},\n    \"flaubert/flaubert_large_cased\": {\"do_lowercase\": False},\n}\n\n\ndef convert_to_unicode(text):\n    \"\"\"\n    Converts `text` to Unicode (if it's not already), assuming UTF-8 input.\n    \"\"\"\n    # six_ensure_text is copied from https://github.com/benjaminp/six\n    def six_ensure_text(s, encoding=\"utf-8\", errors=\"strict\"):\n        if isinstance(s, six.binary_type):\n            return s.decode(encoding, errors)\n        elif isinstance(s, six.text_type):\n            return s\n        else:\n            raise TypeError(\"not expecting type '%s'\" % type(s))\n\n    return six_ensure_text(text, encoding=\"utf-8\", errors=\"ignore\")\n\n\nclass FlaubertTokenizer(XLMTokenizer):\n    \"\"\"\n    BPE tokenizer for Flaubert\n\n    - Moses preprocessing & tokenization\n    - Normalize all inputs text\n    - argument ``special_tokens`` and function ``set_special_tokens``, can be used to add additional symbols \\\n      (ex: \"__classify__\") to a vocabulary\n    - `do_lowercase` controle lower casing (automatically set for pretrained vocabularies)\n\n    This tokenizer inherits from :class:`~transformers1.XLMTokenizer`. Please check the superclass for usage examples\n    and documentation regarding arguments.\n    \"\"\"\n\n    vocab_files_names = VOCAB_FILES_NAMES\n    pretrained_vocab_files_map = PRETRAINED_VOCAB_FILES_MAP\n    pretrained_init_configuration = PRETRAINED_INIT_CONFIGURATION\n    max_model_input_sizes = PRETRAINED_POSITIONAL_EMBEDDINGS_SIZES\n\n    def __init__(self, do_lowercase=False, **kwargs):\n        super().__init__(**kwargs)\n        self.do_lowercase = do_lowercase\n        self.do_lowercase_and_remove_accent = False\n\n    def preprocess_text(self, text):\n        text = text.replace(\"``\", '\"').replace(\"''\", '\"')\n        text = convert_to_unicode(text)\n        text = unicodedata.normalize(\"NFC\", text)\n\n        if self.do_lowercase:\n            text = text.lower()\n\n        return text\n\n    def _tokenize(self, text, bypass_tokenizer=False):\n        \"\"\"\n        Tokenize a string given language code using Moses.\n\n        Details of tokenization:\n        - [sacremoses](https://github.com/alvations/sacremoses): port of Moses\n            - Install with `pip install sacremoses`\n\n        Args:\n            - bypass_tokenizer: Allow users to preprocess and tokenize the sentences externally (default = False)  (bool). If True, we only apply BPE.\n\n        Returns:\n            List of tokens.\n        \"\"\"\n        lang = \"fr\"\n        if lang and self.lang2id and lang not in self.lang2id:\n            logger.error(\n                \"Supplied language code not found in lang2id mapping. Please check that your language is supported by the loaded pretrained model.\"\n            )\n\n        if bypass_tokenizer:\n            text = text.split()\n        else:\n            text = self.preprocess_text(text)\n            text = self.moses_pipeline(text, lang=lang)\n            text = self.moses_tokenize(text, lang=lang)\n\n        split_tokens = []\n        for token in text:\n            if token:\n                split_tokens.extend([t for t in self.bpe(token).split(\" \")])\n\n        return split_tokens\n"
  },
  {
    "path": "code/nezha-base-count5/pretrain/transformers1/tokenization_gpt2.py",
    "content": "# coding=utf-8\n# Copyright 2018 The Open AI Team Authors and The HuggingFace Inc. team.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n\"\"\"Tokenization classes for OpenAI GPT.\"\"\"\n\n\nimport json\nimport logging\nimport os\nfrom functools import lru_cache\n\nimport regex as re\nfrom tokenizers import ByteLevelBPETokenizer\n\nfrom .tokenization_utils import PreTrainedTokenizer, PreTrainedTokenizerFast\n\n\nlogger = logging.getLogger(__name__)\n\nVOCAB_FILES_NAMES = {\n    \"vocab_file\": \"vocab.json\",\n    \"merges_file\": \"merges.txt\",\n}\n\nPRETRAINED_VOCAB_FILES_MAP = {\n    \"vocab_file\": {\n        \"gpt2\": \"https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-vocab.json\",\n        \"gpt2-medium\": \"https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-medium-vocab.json\",\n        \"gpt2-large\": \"https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-large-vocab.json\",\n        \"gpt2-xl\": \"https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-xl-vocab.json\",\n        \"distilgpt2\": \"https://s3.amazonaws.com/models.huggingface.co/bert/distilgpt2-vocab.json\",\n    },\n    \"merges_file\": {\n        \"gpt2\": \"https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-merges.txt\",\n        \"gpt2-medium\": \"https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-medium-merges.txt\",\n        \"gpt2-large\": \"https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-large-merges.txt\",\n        \"gpt2-xl\": \"https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-xl-merges.txt\",\n        \"distilgpt2\": \"https://s3.amazonaws.com/models.huggingface.co/bert/distilgpt2-merges.txt\",\n    },\n}\n\nPRETRAINED_POSITIONAL_EMBEDDINGS_SIZES = {\n    \"gpt2\": 1024,\n    \"gpt2-medium\": 1024,\n    \"gpt2-large\": 1024,\n    \"gpt2-xl\": 1024,\n    \"distilgpt2\": 1024,\n}\n\n\n@lru_cache()\ndef bytes_to_unicode():\n    \"\"\"\n    Returns list of utf-8 byte and a mapping to unicode strings.\n    We specifically avoids mapping to whitespace/control characters the bpe code barfs on.\n\n    The reversible bpe codes work on unicode strings.\n    This means you need a large # of unicode characters in your vocab if you want to avoid UNKs.\n    When you're at something like a 10B token dataset you end up needing around 5K for decent coverage.\n    This is a signficant percentage of your normal, say, 32K bpe vocab.\n    To avoid that, we want lookup tables between utf-8 bytes and unicode strings.\n    \"\"\"\n    bs = (\n        list(range(ord(\"!\"), ord(\"~\") + 1)) + list(range(ord(\"¡\"), ord(\"¬\") + 1)) + list(range(ord(\"®\"), ord(\"ÿ\") + 1))\n    )\n    cs = bs[:]\n    n = 0\n    for b in range(2 ** 8):\n        if b not in bs:\n            bs.append(b)\n            cs.append(2 ** 8 + n)\n            n += 1\n    cs = [chr(n) for n in cs]\n    return dict(zip(bs, cs))\n\n\ndef get_pairs(word):\n    \"\"\"Return set of symbol pairs in a word.\n\n    Word is represented as tuple of symbols (symbols being variable-length strings).\n    \"\"\"\n    pairs = set()\n    prev_char = word[0]\n    for char in word[1:]:\n        pairs.add((prev_char, char))\n        prev_char = char\n    return pairs\n\n\nclass GPT2Tokenizer(PreTrainedTokenizer):\n    \"\"\"\n    GPT-2 BPE tokenizer. Peculiarities:\n\n    - Byte-level Byte-Pair-Encoding\n    - Requires a space to start the input string => the encoding methods should be called with the\n      ``add_prefix_space`` flag set to ``True``.\n      Otherwise, this tokenizer ``encode`` and ``decode`` method will not conserve\n      the absence of a space at the beginning of a string:\n\n    ::\n\n        tokenizer.decode(tokenizer.encode(\"Hello\")) = \" Hello\"\n\n    This tokenizer inherits from :class:`~transformers1.PreTrainedTokenizer` which contains most of the methods. Users\n    should refer to the superclass for more information regarding methods.\n\n    Args:\n        vocab_file (:obj:`str`):\n            Path to the vocabulary file.\n        merges_file (:obj:`str`):\n            Path to the merges file.\n        errors (:obj:`str`, `optional`, defaults to \"replace\"):\n            Paradigm to follow when decoding bytes to UTF-8. See `bytes.decode\n            <https://docs.python.org/3/library/stdtypes.html#bytes.decode>`__ for more information.\n        unk_token (:obj:`string`, `optional`, defaults to `<|endoftext|>`):\n            The unknown token. A token that is not in the vocabulary cannot be converted to an ID and is set to be this\n            token instead.\n        bos_token (:obj:`string`, `optional`, defaults to `<|endoftext|>`):\n            The beginning of sequence token.\n        eos_token (:obj:`string`, `optional`, defaults to `<|endoftext|>`):\n            The end of sequence token.\n    \"\"\"\n\n    vocab_files_names = VOCAB_FILES_NAMES\n    pretrained_vocab_files_map = PRETRAINED_VOCAB_FILES_MAP\n    max_model_input_sizes = PRETRAINED_POSITIONAL_EMBEDDINGS_SIZES\n\n    def __init__(\n        self,\n        vocab_file,\n        merges_file,\n        errors=\"replace\",\n        unk_token=\"<|endoftext|>\",\n        bos_token=\"<|endoftext|>\",\n        eos_token=\"<|endoftext|>\",\n        **kwargs\n    ):\n        super().__init__(bos_token=bos_token, eos_token=eos_token, unk_token=unk_token, **kwargs)\n\n        with open(vocab_file, encoding=\"utf-8\") as vocab_handle:\n            self.encoder = json.load(vocab_handle)\n        self.decoder = {v: k for k, v in self.encoder.items()}\n        self.errors = errors  # how to handle errors in decoding\n        self.byte_encoder = bytes_to_unicode()\n        self.byte_decoder = {v: k for k, v in self.byte_encoder.items()}\n        with open(merges_file, encoding=\"utf-8\") as merges_handle:\n            bpe_merges = merges_handle.read().split(\"\\n\")[1:-1]\n        bpe_merges = [tuple(merge.split()) for merge in bpe_merges]\n        self.bpe_ranks = dict(zip(bpe_merges, range(len(bpe_merges))))\n        self.cache = {}\n\n        # Should haved added re.IGNORECASE so BPE merges can happen for capitalized versions of contractions\n        self.pat = re.compile(r\"\"\"'s|'t|'re|'ve|'m|'ll|'d| ?\\p{L}+| ?\\p{N}+| ?[^\\s\\p{L}\\p{N}]+|\\s+(?!\\S)|\\s+\"\"\")\n\n    @property\n    def vocab_size(self):\n        return len(self.encoder)\n\n    def get_vocab(self):\n        return dict(self.encoder, **self.added_tokens_encoder)\n\n    def bpe(self, token):\n        if token in self.cache:\n            return self.cache[token]\n        word = tuple(token)\n        pairs = get_pairs(word)\n\n        if not pairs:\n            return token\n\n        while True:\n            bigram = min(pairs, key=lambda pair: self.bpe_ranks.get(pair, float(\"inf\")))\n            if bigram not in self.bpe_ranks:\n                break\n            first, second = bigram\n            new_word = []\n            i = 0\n            while i < len(word):\n                try:\n                    j = word.index(first, i)\n                except ValueError:\n                    new_word.extend(word[i:])\n                    break\n                else:\n                    new_word.extend(word[i:j])\n                    i = j\n\n                if word[i] == first and i < len(word) - 1 and word[i + 1] == second:\n                    new_word.append(first + second)\n                    i += 2\n                else:\n                    new_word.append(word[i])\n                    i += 1\n            new_word = tuple(new_word)\n            word = new_word\n            if len(word) == 1:\n                break\n            else:\n                pairs = get_pairs(word)\n        word = \" \".join(word)\n        self.cache[token] = word\n        return word\n\n    def _tokenize(self, text):\n        \"\"\" Tokenize a string. \"\"\"\n        bpe_tokens = []\n        for token in re.findall(self.pat, text):\n            token = \"\".join(\n                self.byte_encoder[b] for b in token.encode(\"utf-8\")\n            )  # Maps all our bytes to unicode strings, avoiding controle tokens of the BPE (spaces in our case)\n            bpe_tokens.extend(bpe_token for bpe_token in self.bpe(token).split(\" \"))\n        return bpe_tokens\n\n    def _convert_token_to_id(self, token):\n        \"\"\" Converts a token (str) in an id using the vocab. \"\"\"\n        return self.encoder.get(token, self.encoder.get(self.unk_token))\n\n    def _convert_id_to_token(self, index):\n        \"\"\"Converts an index (integer) in a token (str) using the vocab.\"\"\"\n        return self.decoder.get(index)\n\n    def convert_tokens_to_string(self, tokens):\n        \"\"\" Converts a sequence of tokens (string) in a single string. \"\"\"\n        text = \"\".join(tokens)\n        text = bytearray([self.byte_decoder[c] for c in text]).decode(\"utf-8\", errors=self.errors)\n        return text\n\n    def save_vocabulary(self, save_directory):\n        \"\"\"\n        Save the vocabulary and special tokens file to a directory.\n\n        Args:\n            save_directory (:obj:`str`):\n                The directory in which to save the vocabulary.\n\n        Returns:\n            :obj:`Tuple(str)`: Paths to the files saved.\n        \"\"\"\n        if not os.path.isdir(save_directory):\n            logger.error(\"Vocabulary path ({}) should be a directory\".format(save_directory))\n            return\n        vocab_file = os.path.join(save_directory, VOCAB_FILES_NAMES[\"vocab_file\"])\n        merge_file = os.path.join(save_directory, VOCAB_FILES_NAMES[\"merges_file\"])\n\n        with open(vocab_file, \"w\", encoding=\"utf-8\") as f:\n            f.write(json.dumps(self.encoder, ensure_ascii=False))\n\n        index = 0\n        with open(merge_file, \"w\", encoding=\"utf-8\") as writer:\n            writer.write(\"#version: 0.2\\n\")\n            for bpe_tokens, token_index in sorted(self.bpe_ranks.items(), key=lambda kv: kv[1]):\n                if index != token_index:\n                    logger.warning(\n                        \"Saving vocabulary to {}: BPE merge indices are not consecutive.\"\n                        \" Please check that the tokenizer is not corrupted!\".format(merge_file)\n                    )\n                    index = token_index\n                writer.write(\" \".join(bpe_tokens) + \"\\n\")\n                index += 1\n\n        return vocab_file, merge_file\n\n    def prepare_for_tokenization(self, text, **kwargs):\n        if \"add_prefix_space\" in kwargs and kwargs[\"add_prefix_space\"]:\n            return \" \" + text\n        return text\n\n\nclass GPT2TokenizerFast(PreTrainedTokenizerFast):\n    \"\"\"\n    Constructs a \"Fast\" GPT-2 BPE tokenizer (backed by HuggingFace's `tokenizers` library).\n\n    Peculiarities:\n\n    - Byte-level Byte-Pair-Encoding\n    - Requires a space to start the input string => the encoding methods should be called with the\n      ``add_prefix_space`` flag set to ``True``.\n      Otherwise, this tokenizer ``encode`` and ``decode`` method will not conserve\n      the absence of a space at the beginning of a string:\n\n    ::\n\n        tokenizer.decode(tokenizer.encode(\"Hello\")) = \" Hello\"\n\n    This tokenizer inherits from :class:`~transformers1.PreTrainedTokenizerFast` which contains most of the methods. Users\n    should refer to the superclass for more information regarding methods.\n\n    Args:\n        vocab_file (:obj:`str`):\n            Path to the vocabulary file.\n        merges_file (:obj:`str`):\n            Path to the merges file.\n        errors (:obj:`str`, `optional`, defaults to \"replace\"):\n            Paradigm to follow when decoding bytes to UTF-8. See `bytes.decode\n            <https://docs.python.org/3/library/stdtypes.html#bytes.decode>`__ for more information.\n        unk_token (:obj:`string`, `optional`, defaults to `<|endoftext|>`):\n            The unknown token. A token that is not in the vocabulary cannot be converted to an ID and is set to be this\n            token instead.\n        bos_token (:obj:`string`, `optional`, defaults to `<|endoftext|>`):\n            The beginning of sequence token.\n        eos_token (:obj:`string`, `optional`, defaults to `<|endoftext|>`):\n            The end of sequence token.\n        add_prefix_space (:obj:`bool`, `optional`, defaults to `False`):\n            Whether to add a leading space to the first word.\n            This allows to treat the leading word just as any other word.\n            (GPT2 tokenizer detect beginning of words by the preceeding space)\n        trim_offsets (:obj:`bool`, `optional`, defaults to `True`):\n            Whether the post processing step should trim offsets to avoid including whitespaces.\n    \"\"\"\n\n    vocab_files_names = VOCAB_FILES_NAMES\n    pretrained_vocab_files_map = PRETRAINED_VOCAB_FILES_MAP\n    max_model_input_sizes = PRETRAINED_POSITIONAL_EMBEDDINGS_SIZES\n\n    def __init__(\n        self,\n        vocab_file,\n        merges_file,\n        unk_token=\"<|endoftext|>\",\n        bos_token=\"<|endoftext|>\",\n        eos_token=\"<|endoftext|>\",\n        add_prefix_space=False,\n        trim_offsets=True,\n        **kwargs\n    ):\n        super().__init__(\n            ByteLevelBPETokenizer(\n                vocab_file=vocab_file,\n                merges_file=merges_file,\n                add_prefix_space=add_prefix_space,\n                trim_offsets=trim_offsets,\n            ),\n            bos_token=bos_token,\n            eos_token=eos_token,\n            unk_token=unk_token,\n            **kwargs,\n        )\n"
  },
  {
    "path": "code/nezha-base-count5/pretrain/transformers1/tokenization_longformer.py",
    "content": "# coding=utf-8\n# Copyright 2020 The Allen Institute for AI team and The HuggingFace Inc. team.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n\nimport logging\n\nfrom .tokenization_roberta import RobertaTokenizer, RobertaTokenizerFast\n\n\nlogger = logging.getLogger(__name__)\n\n\n# vocab and merges same as roberta\nvocab_url = \"https://s3.amazonaws.com/models.huggingface.co/bert/roberta-large-vocab.json\"\nmerges_url = \"https://s3.amazonaws.com/models.huggingface.co/bert/roberta-large-merges.txt\"\n_all_longformer_models = [\n    \"allenai/longformer-base-4096\",\n    \"allenai/longformer-large-4096\",\n    \"allenai/longformer-large-4096-finetuned-triviaqa\",\n    \"allenai/longformer-base-4096-extra.pos.embd.only\",\n    \"allenai/longformer-large-4096-extra.pos.embd.only\",\n]\n\n\nPRETRAINED_POSITIONAL_EMBEDDINGS_SIZES = {\n    \"allenai/longformer-base-4096\": 4096,\n    \"allenai/longformer-large-4096\": 4096,\n    \"allenai/longformer-large-4096-finetuned-triviaqa\": 4096,\n    \"allenai/longformer-base-4096-extra.pos.embd.only\": 4096,\n    \"allenai/longformer-large-4096-extra.pos.embd.only\": 4096,\n}\n\n\nclass LongformerTokenizer(RobertaTokenizer):\n    # merges and vocab same as Roberta\n    max_model_input_sizes = PRETRAINED_POSITIONAL_EMBEDDINGS_SIZES\n    pretrained_vocab_files_map = {\n        \"vocab_file\": {m: vocab_url for m in _all_longformer_models},\n        \"merges_file\": {m: merges_url for m in _all_longformer_models},\n    }\n\n\nclass LongformerTokenizerFast(RobertaTokenizerFast):\n    # merges and vocab same as Roberta\n    max_model_input_sizes = PRETRAINED_POSITIONAL_EMBEDDINGS_SIZES\n    pretrained_vocab_files_map = {\n        \"vocab_file\": {m: vocab_url for m in _all_longformer_models},\n        \"merges_file\": {m: merges_url for m in _all_longformer_models},\n    }\n"
  },
  {
    "path": "code/nezha-base-count5/pretrain/transformers1/tokenization_marian.py",
    "content": "import json\nimport re\nimport warnings\nfrom pathlib import Path\nfrom shutil import copyfile\nfrom typing import Dict, List, Optional, Tuple, Union\n\nimport sentencepiece\n\nfrom .file_utils import S3_BUCKET_PREFIX\nfrom .tokenization_utils import BatchEncoding, PreTrainedTokenizer\n\n\nvocab_files_names = {\n    \"source_spm\": \"source.spm\",\n    \"target_spm\": \"target.spm\",\n    \"vocab\": \"vocab.json\",\n    \"tokenizer_config_file\": \"tokenizer_config.json\",\n}\nMODEL_NAMES = (\"opus-mt-en-de\",)  # TODO(SS): delete this, the only required constant is vocab_files_names\nPRETRAINED_VOCAB_FILES_MAP = {\n    k: {m: f\"{S3_BUCKET_PREFIX}/Helsinki-NLP/{m}/{fname}\" for m in MODEL_NAMES}\n    for k, fname in vocab_files_names.items()\n}\n# Example URL https://s3.amazonaws.com/models.huggingface.co/bert/Helsinki-NLP/opus-mt-en-de/vocab.json\n\n\nclass MarianTokenizer(PreTrainedTokenizer):\n    \"\"\"Sentencepiece tokenizer for marian. Source and target languages have different SPM models.\n    The logic is use the relevant source_spm or target_spm to encode txt as pieces, then look up each piece in a vocab dictionary.\n\n    Examples::\n\n        from transformers1 import MarianTokenizer\n        tok = MarianTokenizer.from_pretrained('Helsinki-NLP/opus-mt-en-de')\n        src_texts = [ \"I am a small frog.\", \"Tom asked his teacher for advice.\"]\n        tgt_texts = [\"Ich bin ein kleiner Frosch.\", \"Tom bat seinen Lehrer um Rat.\"]  # optional\n        batch_enc: BatchEncoding = tok.prepare_translation_batch(src_texts, tgt_texts=tgt_texts)\n        # keys  [input_ids, attention_mask, decoder_input_ids,  decoder_attention_mask].\n        # model(**batch) should work\n    \"\"\"\n\n    vocab_files_names = vocab_files_names\n    pretrained_vocab_files_map = PRETRAINED_VOCAB_FILES_MAP\n    max_model_input_sizes = {m: 512 for m in MODEL_NAMES}\n    model_input_names = [\"attention_mask\"]  # actually attention_mask, decoder_attention_mask\n    language_code_re = re.compile(\">>.+<<\")  # type: re.Pattern\n\n    def __init__(\n        self,\n        vocab=None,\n        source_spm=None,\n        target_spm=None,\n        source_lang=None,\n        target_lang=None,\n        unk_token=\"<unk>\",\n        eos_token=\"</s>\",\n        pad_token=\"<pad>\",\n        max_len=512,\n        **kwargs,\n    ):\n\n        super().__init__(\n            # bos_token=bos_token,  unused. Start decoding with config.decoder_start_token_id\n            max_len=max_len,\n            eos_token=eos_token,\n            unk_token=unk_token,\n            pad_token=pad_token,\n            **kwargs,\n        )\n        self.encoder = load_json(vocab)\n        if self.unk_token not in self.encoder:\n            raise KeyError(\"<unk> token must be in vocab\")\n        assert self.pad_token in self.encoder\n        self.decoder = {v: k for k, v in self.encoder.items()}\n\n        self.source_lang = source_lang\n        self.target_lang = target_lang\n        self.supported_language_codes: list = [k for k in self.encoder if k.startswith(\">>\") and k.endswith(\"<<\")]\n        self.spm_files = [source_spm, target_spm]\n\n        # load SentencePiece model for pre-processing\n        self.spm_source = load_spm(source_spm)\n        self.spm_target = load_spm(target_spm)\n        self.current_spm = self.spm_source\n\n        # Multilingual target side: default to using first supported language code.\n\n        self._setup_normalizer()\n\n    def _setup_normalizer(self):\n        try:\n            from mosestokenizer import MosesPunctuationNormalizer\n\n            self.punc_normalizer = MosesPunctuationNormalizer(self.source_lang)\n        except ImportError:\n            warnings.warn(\"Recommended: pip install mosestokenizer\")\n            self.punc_normalizer = lambda x: x\n\n    def normalize(self, x: str) -> str:\n        \"\"\"Cover moses empty string edge case. They return empty list for '' input!\"\"\"\n        return self.punc_normalizer(x) if x else \"\"\n\n    def _convert_token_to_id(self, token):\n        return self.encoder.get(token, self.encoder[self.unk_token])\n\n    def remove_language_code(self, text: str):\n        \"\"\"Remove language codes like <<fr>> before sentencepiece\"\"\"\n        match = self.language_code_re.match(text)\n        code: list = [match.group(0)] if match else []\n        return code, self.language_code_re.sub(\"\", text)\n\n    def _tokenize(self, text: str) -> List[str]:\n        code, text = self.remove_language_code(text)\n        pieces = self.current_spm.EncodeAsPieces(text)\n        return code + pieces\n\n    def _convert_id_to_token(self, index: int) -> str:\n        \"\"\"Converts an index (integer) in a token (str) using the encoder.\"\"\"\n        return self.decoder.get(index, self.unk_token)\n\n    def convert_tokens_to_string(self, tokens: List[str]) -> str:\n        \"\"\"Uses target language sentencepiece model\"\"\"\n        return self.spm_target.DecodePieces(tokens)\n\n    def build_inputs_with_special_tokens(self, token_ids_0, token_ids_1=None) -> List[int]:\n        \"\"\"Build model inputs from a sequence by appending eos_token_id.\"\"\"\n        if token_ids_1 is None:\n            return token_ids_0 + [self.eos_token_id]\n        # We don't expect to process pairs, but leave the pair logic for API consistency\n        return token_ids_0 + token_ids_1 + [self.eos_token_id]\n\n    def prepare_translation_batch(\n        self,\n        src_texts: List[str],\n        tgt_texts: Optional[List[str]] = None,\n        max_length: Optional[int] = None,\n        pad_to_max_length: bool = True,\n        return_tensors: str = \"pt\",\n    ) -> BatchEncoding:\n        \"\"\"Prepare model inputs for translation. For best performance, translate one sentence at a time.\n        Arguments:\n            src_texts: list of src language texts\n            tgt_texts: list of tgt language texts\n            max_length: (None) defer to config (1024 for mbart-large-en-ro)\n            pad_to_max_length: (bool)\n            return_tensors: (str) default \"pt\" returns pytorch tensors, pass None to return lists.\n\n        Returns:\n            BatchEncoding: with keys [input_ids, attention_mask, decoder_input_ids,  decoder_attention_mask]\n            all shaped bs, seq_len. (BatchEncoding is a dict of string -> tensor or lists).\n            If no tgt_text is specified, the only keys will be input_ids and attention_mask.\n        \"\"\"\n        if \"\" in src_texts:\n            raise ValueError(f\"found empty string in src_texts: {src_texts}\")\n        self.current_spm = self.spm_source\n        src_texts = [self.normalize(t) for t in src_texts]  # this does not appear to do much\n        model_inputs: BatchEncoding = self.batch_encode_plus(\n            src_texts,\n            add_special_tokens=True,\n            return_tensors=return_tensors,\n            max_length=max_length,\n            pad_to_max_length=pad_to_max_length,\n        )\n        if tgt_texts is None:\n            return model_inputs\n\n        self.current_spm = self.spm_target\n        decoder_inputs: BatchEncoding = self.batch_encode_plus(\n            tgt_texts,\n            add_special_tokens=True,\n            return_tensors=return_tensors,\n            max_length=max_length,\n            pad_to_max_length=pad_to_max_length,\n        )\n        for k, v in decoder_inputs.items():\n            model_inputs[f\"decoder_{k}\"] = v\n        self.current_spm = self.spm_source\n        return model_inputs\n\n    @property\n    def vocab_size(self) -> int:\n        return len(self.encoder)\n\n    def save_vocabulary(self, save_directory: str) -> Tuple[str]:\n        \"\"\"save vocab file to json and copy spm files from their original path.\"\"\"\n        save_dir = Path(save_directory)\n        assert save_dir.is_dir(), f\"{save_directory} should be a directory\"\n        save_json(self.encoder, save_dir / self.vocab_files_names[\"vocab\"])\n\n        for f in self.spm_files:\n            dest_path = save_dir / Path(f).name\n            if not dest_path.exists():\n                copyfile(f, save_dir / Path(f).name)\n        return tuple(save_dir / f for f in self.vocab_files_names)\n\n    def get_vocab(self) -> Dict:\n        vocab = self.encoder.copy()\n        vocab.update(self.added_tokens_encoder)\n        return vocab\n\n    def __getstate__(self) -> Dict:\n        state = self.__dict__.copy()\n        state.update({k: None for k in [\"spm_source\", \"spm_target\", \"current_spm\", \"punc_normalizer\"]})\n        return state\n\n    def __setstate__(self, d: Dict) -> None:\n        self.__dict__ = d\n        self.spm_source, self.spm_target = (load_spm(f) for f in self.spm_files)\n        self.current_spm = self.spm_source\n        self._setup_normalizer()\n\n    def num_special_tokens_to_add(self, **unused):\n        \"\"\"Just EOS\"\"\"\n        return 1\n\n    def _special_token_mask(self, seq):\n        all_special_ids = set(self.all_special_ids)  # call it once instead of inside list comp\n        all_special_ids.remove(self.unk_token_id)  # <unk> is only sometimes special\n        return [1 if x in all_special_ids else 0 for x in seq]\n\n    def get_special_tokens_mask(\n        self, token_ids_0: List, token_ids_1: Optional[List] = None, already_has_special_tokens: bool = False\n    ) -> List[int]:\n        \"\"\"Get list where entries are [1] if a token is [eos] or [pad] else 0.\"\"\"\n        if already_has_special_tokens:\n            return self._special_token_mask(token_ids_0)\n        elif token_ids_1 is None:\n            return self._special_token_mask(token_ids_0) + [1]\n        else:\n            return self._special_token_mask(token_ids_0 + token_ids_1) + [1]\n\n\ndef load_spm(path: str) -> sentencepiece.SentencePieceProcessor:\n    spm = sentencepiece.SentencePieceProcessor()\n    spm.Load(path)\n    return spm\n\n\ndef save_json(data, path: str) -> None:\n    with open(path, \"w\") as f:\n        json.dump(data, f, indent=2)\n\n\ndef load_json(path: str) -> Union[Dict, List]:\n    with open(path, \"r\") as f:\n        return json.load(f)\n"
  },
  {
    "path": "code/nezha-base-count5/pretrain/transformers1/tokenization_openai.py",
    "content": "# coding=utf-8\n# Copyright 2018 The Open AI Team Authors and The HuggingFace Inc. team.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n\"\"\"Tokenization classes for OpenAI GPT.\"\"\"\n\n\nimport json\nimport logging\nimport os\nimport re\n\nfrom tokenizers import CharBPETokenizer\n\nfrom .tokenization_bert import BasicTokenizer\nfrom .tokenization_utils import PreTrainedTokenizer, PreTrainedTokenizerFast\n\n\nlogger = logging.getLogger(__name__)\n\nVOCAB_FILES_NAMES = {\n    \"vocab_file\": \"vocab.json\",\n    \"merges_file\": \"merges.txt\",\n}\n\nPRETRAINED_VOCAB_FILES_MAP = {\n    \"vocab_file\": {\"openai-gpt\": \"https://s3.amazonaws.com/models.huggingface.co/bert/openai-gpt-vocab.json\"},\n    \"merges_file\": {\"openai-gpt\": \"https://s3.amazonaws.com/models.huggingface.co/bert/openai-gpt-merges.txt\"},\n}\n\nPRETRAINED_POSITIONAL_EMBEDDINGS_SIZES = {\n    \"openai-gpt\": 512,\n}\n\n\ndef get_pairs(word):\n    \"\"\"\n    Return set of symbol pairs in a word.\n    word is represented as tuple of symbols (symbols being variable-length strings)\n    \"\"\"\n    pairs = set()\n    prev_char = word[0]\n    for char in word[1:]:\n        pairs.add((prev_char, char))\n        prev_char = char\n    return pairs\n\n\ndef text_standardize(text):\n    \"\"\"\n    fixes some issues the spacy tokenizer had on books corpus\n    also does some whitespace standardization\n    \"\"\"\n    text = text.replace(\"—\", \"-\")\n    text = text.replace(\"–\", \"-\")\n    text = text.replace(\"―\", \"-\")\n    text = text.replace(\"…\", \"...\")\n    text = text.replace(\"´\", \"'\")\n    text = re.sub(r\"\"\"(-+|~+|!+|\"+|;+|\\?+|\\++|,+|\\)+|\\(+|\\\\+|\\/+|\\*+|\\[+|\\]+|}+|{+|\\|+|_+)\"\"\", r\" \\1 \", text)\n    text = re.sub(r\"\\s*\\n\\s*\", \" \\n \", text)\n    text = re.sub(r\"[^\\S\\n]+\", \" \", text)\n    return text.strip()\n\n\nclass OpenAIGPTTokenizer(PreTrainedTokenizer):\n    \"\"\"\n    BPE tokenizer. Peculiarities:\n\n    - lower case all inputs\n    - uses SpaCy tokenizer and ftfy for pre-BPE tokenization if they are installed, fallback to BERT's BasicTokenizer if not.\n\n    This tokenizer inherits from :class:`~transformers1.PreTrainedTokenizer` which contains most of the methods. Users\n    should refer to the superclass for more information regarding methods.\n\n    Args:\n        vocab_file (:obj:`str`):\n            Path to the vocabulary file.\n        merges_file (:obj:`str`):\n            Path to the merges file.\n        unk_token (:obj:`string`, `optional`, defaults to \"<unk>\"):\n            The unknown token. A token that is not in the vocabulary cannot be converted to an ID and is set to be this\n            token instead.\n    \"\"\"\n\n    vocab_files_names = VOCAB_FILES_NAMES\n    pretrained_vocab_files_map = PRETRAINED_VOCAB_FILES_MAP\n    max_model_input_sizes = PRETRAINED_POSITIONAL_EMBEDDINGS_SIZES\n\n    def __init__(self, vocab_file, merges_file, unk_token=\"<unk>\", **kwargs):\n        super().__init__(unk_token=unk_token, **kwargs)\n\n        try:\n            import ftfy\n            from spacy.lang.en import English\n\n            _nlp = English()\n            self.nlp = _nlp.Defaults.create_tokenizer(_nlp)\n            self.fix_text = ftfy.fix_text\n        except ImportError:\n            logger.warning(\"ftfy or spacy is not installed using BERT BasicTokenizer instead of SpaCy & ftfy.\")\n            self.nlp = BasicTokenizer(do_lower_case=True)\n            self.fix_text = None\n\n        with open(vocab_file, encoding=\"utf-8\") as vocab_handle:\n            self.encoder = json.load(vocab_handle)\n        self.decoder = {v: k for k, v in self.encoder.items()}\n        with open(merges_file, encoding=\"utf-8\") as merges_handle:\n            merges = merges_handle.read().split(\"\\n\")[1:-1]\n        merges = [tuple(merge.split()) for merge in merges]\n        self.bpe_ranks = dict(zip(merges, range(len(merges))))\n        self.cache = {}\n\n    @property\n    def vocab_size(self):\n        return len(self.encoder)\n\n    def get_vocab(self):\n        return dict(self.encoder, **self.added_tokens_encoder)\n\n    def bpe(self, token):\n        word = tuple(token[:-1]) + (token[-1] + \"</w>\",)\n        if token in self.cache:\n            return self.cache[token]\n        pairs = get_pairs(word)\n\n        if not pairs:\n            return token + \"</w>\"\n\n        while True:\n            bigram = min(pairs, key=lambda pair: self.bpe_ranks.get(pair, float(\"inf\")))\n            if bigram not in self.bpe_ranks:\n                break\n            first, second = bigram\n            new_word = []\n            i = 0\n            while i < len(word):\n                try:\n                    j = word.index(first, i)\n                except ValueError:\n                    new_word.extend(word[i:])\n                    break\n                else:\n                    new_word.extend(word[i:j])\n                    i = j\n\n                if word[i] == first and i < len(word) - 1 and word[i + 1] == second:\n                    new_word.append(first + second)\n                    i += 2\n                else:\n                    new_word.append(word[i])\n                    i += 1\n            new_word = tuple(new_word)\n            word = new_word\n            if len(word) == 1:\n                break\n            else:\n                pairs = get_pairs(word)\n        word = \" \".join(word)\n        if word == \"\\n  </w>\":\n            word = \"\\n</w>\"\n        self.cache[token] = word\n        return word\n\n    def _tokenize(self, text):\n        \"\"\" Tokenize a string. \"\"\"\n        split_tokens = []\n        if self.fix_text is None:\n            # Using BERT's BasicTokenizer\n            text = self.nlp.tokenize(text)\n            for token in text:\n                split_tokens.extend([t for t in self.bpe(token).split(\" \")])\n        else:\n            # Using SpaCy & ftfy (original tokenization process of OpenAI GPT)\n            text = self.nlp(text_standardize(self.fix_text(text)))\n            for token in text:\n                split_tokens.extend([t for t in self.bpe(token.text.lower()).split(\" \")])\n        return split_tokens\n\n    def _convert_token_to_id(self, token):\n        \"\"\" Converts a token (str) in an id using the vocab. \"\"\"\n        return self.encoder.get(token, self.encoder.get(self.unk_token))\n\n    def _convert_id_to_token(self, index):\n        \"\"\"Converts an id in a token (BPE) using the vocab.\"\"\"\n        return self.decoder.get(index, self.unk_token)\n\n    def convert_tokens_to_string(self, tokens):\n        \"\"\" Converts a sequence of tokens (string) in a single string. \"\"\"\n        out_string = \"\".join(tokens).replace(\"</w>\", \" \").strip()\n        return out_string\n\n    def save_vocabulary(self, save_directory):\n        \"\"\"\n        Save the vocabulary and special tokens file to a directory.\n\n        Args:\n            save_directory (:obj:`str`):\n                The directory in which to save the vocabulary.\n\n        Returns:\n            :obj:`Tuple(str)`: Paths to the files saved.\n        \"\"\"\n        if not os.path.isdir(save_directory):\n            logger.error(\"Vocabulary path ({}) should be a directory\".format(save_directory))\n            return\n        vocab_file = os.path.join(save_directory, VOCAB_FILES_NAMES[\"vocab_file\"])\n        merge_file = os.path.join(save_directory, VOCAB_FILES_NAMES[\"merges_file\"])\n\n        with open(vocab_file, \"w\", encoding=\"utf-8\") as f:\n            f.write(json.dumps(self.encoder, ensure_ascii=False))\n\n        index = 0\n        with open(merge_file, \"w\", encoding=\"utf-8\") as writer:\n            writer.write(\"#version: 0.2\\n\")\n            for bpe_tokens, token_index in sorted(self.bpe_ranks.items(), key=lambda kv: kv[1]):\n                if index != token_index:\n                    logger.warning(\n                        \"Saving vocabulary to {}: BPE merge indices are not consecutive.\"\n                        \" Please check that the tokenizer is not corrupted!\".format(merge_file)\n                    )\n                    index = token_index\n                writer.write(\" \".join(bpe_tokens) + \"\\n\")\n                index += 1\n\n        return vocab_file, merge_file\n\n\nclass OpenAIGPTTokenizerFast(PreTrainedTokenizerFast):\n    \"\"\"\n    Construct a \"Fast\" BPE tokenizer for OpenAI GPT (backed by HuggingFace's `tokenizers` library).\n\n    Peculiarities:\n\n    - lower case all inputs\n    - uses SpaCy tokenizer and ftfy for pre-BPE tokenization if they are installed, fallback to BERT's BasicTokenizer if not.\n\n    This tokenizer inherits from :class:`~transformers1.PreTrainedTokenizer` which contains most of the methods. Users\n    should refer to the superclass for more information regarding methods.\n\n    Args:\n        vocab_file (:obj:`str`):\n            Path to the vocabulary file.\n        merges_file (:obj:`str`):\n            Path to the merges file.\n        unk_token (:obj:`string`, `optional`, defaults to \"<unk>\"):\n            The unknown token. A token that is not in the vocabulary cannot be converted to an ID and is set to be this\n            token instead.\n    \"\"\"\n\n    vocab_files_names = VOCAB_FILES_NAMES\n    pretrained_vocab_files_map = PRETRAINED_VOCAB_FILES_MAP\n    max_model_input_sizes = PRETRAINED_POSITIONAL_EMBEDDINGS_SIZES\n\n    def __init__(self, vocab_file, merges_file, unk_token=\"<unk>\", **kwargs):\n        kwargs.setdefault(\"unk_token\", unk_token)\n        super().__init__(\n            CharBPETokenizer(vocab_file=vocab_file, merges_file=merges_file, unk_token=unk_token, lowercase=True),\n            **kwargs,\n        )\n"
  },
  {
    "path": "code/nezha-base-count5/pretrain/transformers1/tokenization_reformer.py",
    "content": "# coding=utf-8\n# Copyright 2020 The Trax Authors and The HuggingFace Inc. team.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n\"\"\" Tokenization class for model Reformer.\"\"\"\n\n\nimport logging\nimport os\nfrom shutil import copyfile\n\nfrom .tokenization_utils import PreTrainedTokenizer\n\n\nlogger = logging.getLogger(__name__)\n\nSPIECE_UNDERLINE = \"▁\"\n\n\n####################################################\n# Mapping from the keyword arguments names of Tokenizer `__init__`\n# to file names for serializing Tokenizer instances\n####################################################\nVOCAB_FILES_NAMES = {\"vocab_file\": \"spiece.model\"}\n\n####################################################\n# Mapping from the keyword arguments names of Tokenizer `__init__`\n# to pretrained vocabulary URL for all the model shortcut names.\n####################################################\nPRETRAINED_VOCAB_FILES_MAP = {\n    \"vocab_file\": {\n        \"google/reformer-crime-and-punishment\": \"https://cdn.huggingface.co/google/reformer-crime-and-punishment/spiece.model\"\n    }\n}\n\n####################################################\n# Mapping from model shortcut names to max length of inputs\n####################################################\nPRETRAINED_POSITIONAL_EMBEDDINGS_SIZES = {\n    \"google/reformer-crime-and-punishment\": 524288,\n}\n\n\nclass ReformerTokenizer(PreTrainedTokenizer):\n    \"\"\"\n        Constructs an Reformer tokenizer. Based on `SentencePiece <https://github.com/google/sentencepiece>`__ .\n\n        This tokenizer inherits from :class:`~transformers1.PreTrainedTokenizer` which contains most of the methods. Users\n        should refer to the superclass for more information regarding methods.\n\n        Args:\n            vocab_file (:obj:`string`):\n                `SentencePiece <https://github.com/google/sentencepiece>`__ file (generally has a `.spm` extension) that\n                contains the vocabulary necessary to instantiate a tokenizer.\n            eos_token (:obj:`string`, `optional`, defaults to \"</s>\"):\n                The end of sequence token.\n\n                .. note::\n\n                    When building a sequence using special tokens, this is not the token that is used for the end\n                    of sequence. The token used is the :obj:`sep_token`.\n            unk_token (:obj:`string`, `optional`, defaults to \"<unk>\"):\n                The unknown token. A token that is not in the vocabulary cannot be converted to an ID and is set to be this\n                token instead.\n            pad_token (:obj:`string`, `optional`, defaults to \"<pad>\"):\n                The token used for padding, for example when batching sequences of different lengths.\n            additional_special_tokens (:obj:`List[str]`, `optional`, defaults to :obj:`None`):\n                Additional special tokens used by the tokenizer.\n    \"\"\"\n\n    vocab_files_names = VOCAB_FILES_NAMES\n    pretrained_vocab_files_map = PRETRAINED_VOCAB_FILES_MAP\n    max_model_input_sizes = PRETRAINED_POSITIONAL_EMBEDDINGS_SIZES\n\n    def __init__(\n        self,\n        vocab_file,\n        eos_token=\"</s>\",\n        unk_token=\"<unk>\",\n        pad_token=\"<pad>\",\n        additional_special_tokens=[],\n        **kwargs\n    ):\n        super().__init__(\n            eos_token=eos_token,\n            unk_token=unk_token,\n            pad_token=pad_token,\n            additional_special_tokens=additional_special_tokens,\n            **kwargs,\n        )\n\n        try:\n            import sentencepiece as spm\n        except ImportError:\n            logger.warning(\n                \"You need to install SentencePiece to use ReformerTokenizer:\"\n                \"https://github.com/google/sentencepiece\"\n                \"pip install sentencepiece\"\n            )\n            raise\n\n        self.vocab_file = vocab_file\n        self.sp_model = spm.SentencePieceProcessor()\n        self.sp_model.Load(vocab_file)\n\n    @property\n    def vocab_size(self):\n        return self.sp_model.get_piece_size()\n\n    def get_vocab(self):\n        vocab = {self.convert_ids_to_tokens(i): i for i in range(self.vocab_size)}\n        vocab.update(self.added_tokens_encoder)\n        return vocab\n\n    def __getstate__(self):\n        state = self.__dict__.copy()\n        state[\"sp_model\"] = None\n        return state\n\n    def __setstate__(self, d):\n        self.__dict__ = d\n        try:\n            import sentencepiece as spm\n        except ImportError:\n            logger.warning(\n                \"You need to install SentencePiece to use ReformerTokenizer: https://github.com/google/sentencepiece\"\n                \"pip install sentencepiece\"\n            )\n            raise\n        self.sp_model = spm.SentencePieceProcessor()\n        self.sp_model.Load(self.vocab_file)\n\n    def _tokenize(self, text, sample=False):\n        \"\"\" Take as input a string and return a list of strings (tokens) for words/sub-words\n        \"\"\"\n        if not sample:\n            pieces = self.sp_model.EncodeAsPieces(text)\n        else:\n            pieces = self.sp_model.SampleEncodeAsPieces(text, 64, 0.1)\n        return pieces\n\n    def _convert_token_to_id(self, token):\n        \"\"\" Converts a token (str) in an id using the vocab. \"\"\"\n        return self.sp_model.piece_to_id(token)\n\n    def _convert_id_to_token(self, index):\n        \"\"\"Converts an index (integer) in a token (str) using the vocab.\"\"\"\n        if index < self.sp_model.get_piece_size():\n            token = self.sp_model.IdToPiece(index)\n        return token\n\n    def convert_tokens_to_string(self, tokens):\n        \"\"\" Converts a sequence of tokens (string) in a single string. \"\"\"\n        out_string = self.sp_model.decode_pieces(tokens)\n        return out_string\n\n    def save_vocabulary(self, save_directory):\n        \"\"\" Save the sentencepiece vocabulary (copy original file) and special tokens file\n            to a directory.\n        \"\"\"\n        if not os.path.isdir(save_directory):\n            logger.error(\"Vocabulary path ({}) should be a directory\".format(save_directory))\n            return\n        out_vocab_file = os.path.join(save_directory, VOCAB_FILES_NAMES[\"vocab_file\"])\n\n        if os.path.abspath(self.vocab_file) != os.path.abspath(out_vocab_file):\n            copyfile(self.vocab_file, out_vocab_file)\n\n        return (out_vocab_file,)\n"
  },
  {
    "path": "code/nezha-base-count5/pretrain/transformers1/tokenization_roberta.py",
    "content": "# coding=utf-8\n# Copyright 2018 The Open AI Team Authors and The HuggingFace Inc. team.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n\"\"\"Tokenization classes for RoBERTa.\"\"\"\n\n\nimport logging\nfrom typing import List, Optional\n\nfrom tokenizers import AddedToken\nfrom tokenizers.processors import RobertaProcessing\n\nfrom .tokenization_gpt2 import GPT2Tokenizer, GPT2TokenizerFast\nfrom .tokenization_utils import PreTrainedTokenizer\n\n\nlogger = logging.getLogger(__name__)\n\nVOCAB_FILES_NAMES = {\n    \"vocab_file\": \"vocab.json\",\n    \"merges_file\": \"merges.txt\",\n}\n\nPRETRAINED_VOCAB_FILES_MAP = {\n    \"vocab_file\": {\n        \"roberta-base\": \"https://s3.amazonaws.com/models.huggingface.co/bert/roberta-base-vocab.json\",\n        \"roberta-large\": \"https://s3.amazonaws.com/models.huggingface.co/bert/roberta-large-vocab.json\",\n        \"roberta-large-mnli\": \"https://s3.amazonaws.com/models.huggingface.co/bert/roberta-large-mnli-vocab.json\",\n        \"distilroberta-base\": \"https://s3.amazonaws.com/models.huggingface.co/bert/distilroberta-base-vocab.json\",\n        \"roberta-base-openai-detector\": \"https://s3.amazonaws.com/models.huggingface.co/bert/roberta-base-vocab.json\",\n        \"roberta-large-openai-detector\": \"https://s3.amazonaws.com/models.huggingface.co/bert/roberta-large-vocab.json\",\n    },\n    \"merges_file\": {\n        \"roberta-base\": \"https://s3.amazonaws.com/models.huggingface.co/bert/roberta-base-merges.txt\",\n        \"roberta-large\": \"https://s3.amazonaws.com/models.huggingface.co/bert/roberta-large-merges.txt\",\n        \"roberta-large-mnli\": \"https://s3.amazonaws.com/models.huggingface.co/bert/roberta-large-mnli-merges.txt\",\n        \"distilroberta-base\": \"https://s3.amazonaws.com/models.huggingface.co/bert/distilroberta-base-merges.txt\",\n        \"roberta-base-openai-detector\": \"https://s3.amazonaws.com/models.huggingface.co/bert/roberta-base-merges.txt\",\n        \"roberta-large-openai-detector\": \"https://s3.amazonaws.com/models.huggingface.co/bert/roberta-large-merges.txt\",\n    },\n}\n\nPRETRAINED_POSITIONAL_EMBEDDINGS_SIZES = {\n    \"roberta-base\": 512,\n    \"roberta-large\": 512,\n    \"roberta-large-mnli\": 512,\n    \"distilroberta-base\": 512,\n    \"roberta-base-openai-detector\": 512,\n    \"roberta-large-openai-detector\": 512,\n}\n\n\nclass RobertaTokenizer(GPT2Tokenizer):\n    \"\"\"\n    Constructs a RoBERTa BPE tokenizer, derived from the GPT-2 tokenizer. Peculiarities:\n\n    - Byte-level Byte-Pair-Encoding\n    - Requires a space to start the input string => the encoding methods should be called with the\n      ``add_prefix_space`` flag set to ``True``.\n      Otherwise, this tokenizer ``encode`` and ``decode`` method will not conserve\n      the absence of a space at the beginning of a string:\n\n    ::\n\n        tokenizer.decode(tokenizer.encode(\"Hello\")) = \" Hello\"\n\n    This tokenizer inherits from :class:`~transformers1.PreTrainedTokenizer` which contains most of the methods. Users\n    should refer to the superclass for more information regarding methods.\n\n    Args:\n        vocab_file (:obj:`str`):\n            Path to the vocabulary file.\n        merges_file (:obj:`str`):\n            Path to the merges file.\n        errors (:obj:`str`, `optional`, defaults to \"replace\"):\n            Paradigm to follow when decoding bytes to UTF-8. See `bytes.decode\n            <https://docs.python.org/3/library/stdtypes.html#bytes.decode>`__ for more information.\n        bos_token (:obj:`string`, `optional`, defaults to \"<s>\"):\n            The beginning of sequence token that was used during pre-training. Can be used a sequence classifier token.\n\n            .. note::\n\n                When building a sequence using special tokens, this is not the token that is used for the beginning\n                of sequence. The token used is the :obj:`cls_token`.\n        eos_token (:obj:`string`, `optional`, defaults to \"</s>\"):\n            The end of sequence token.\n\n            .. note::\n\n                When building a sequence using special tokens, this is not the token that is used for the end\n                of sequence. The token used is the :obj:`sep_token`.\n        sep_token (:obj:`string`, `optional`, defaults to \"</s>\"):\n            The separator token, which is used when building a sequence from multiple sequences, e.g. two sequences\n            for sequence classification or for a text and a question for question answering.\n            It is also used as the last token of a sequence built with special tokens.\n        cls_token (:obj:`string`, `optional`, defaults to \"<s>\"):\n            The classifier token which is used when doing sequence classification (classification of the whole\n            sequence instead of per-token classification). It is the first token of the sequence when built with\n            special tokens.\n        unk_token (:obj:`string`, `optional`, defaults to \"<unk>\"):\n            The unknown token. A token that is not in the vocabulary cannot be converted to an ID and is set to be this\n            token instead.\n        pad_token (:obj:`string`, `optional`, defaults to \"<pad>\"):\n            The token used for padding, for example when batching sequences of different lengths.\n        mask_token (:obj:`string`, `optional`, defaults to \"<mask>\"):\n            The token used for masking values. This is the token used when training this model with masked language\n            modeling. This is the token which the model will try to predict.\n    \"\"\"\n\n    vocab_files_names = VOCAB_FILES_NAMES\n    pretrained_vocab_files_map = PRETRAINED_VOCAB_FILES_MAP\n    max_model_input_sizes = PRETRAINED_POSITIONAL_EMBEDDINGS_SIZES\n    model_input_names = [\"attention_mask\"]\n\n    def __init__(\n        self,\n        vocab_file,\n        merges_file,\n        errors=\"replace\",\n        bos_token=\"<s>\",\n        eos_token=\"</s>\",\n        sep_token=\"</s>\",\n        cls_token=\"<s>\",\n        unk_token=\"<unk>\",\n        pad_token=\"<pad>\",\n        mask_token=\"<mask>\",\n        **kwargs\n    ):\n        super().__init__(\n            vocab_file=vocab_file,\n            merges_file=merges_file,\n            errors=errors,\n            bos_token=bos_token,\n            eos_token=eos_token,\n            unk_token=unk_token,\n            sep_token=sep_token,\n            cls_token=cls_token,\n            pad_token=pad_token,\n            mask_token=mask_token,\n            **kwargs,\n        )\n\n    def build_inputs_with_special_tokens(\n        self, token_ids_0: List[int], token_ids_1: Optional[List[int]] = None\n    ) -> List[int]:\n        \"\"\"\n        Build model inputs from a sequence or a pair of sequence for sequence classification tasks\n        by concatenating and adding special tokens.\n        A RoBERTa sequence has the following format:\n\n        - single sequence: ``<s> X </s>``\n        - pair of sequences: ``<s> A </s></s> B </s>``\n\n        Args:\n            token_ids_0 (:obj:`List[int]`):\n                List of IDs to which the special tokens will be added\n            token_ids_1 (:obj:`List[int]`, `optional`, defaults to :obj:`None`):\n                Optional second list of IDs for sequence pairs.\n\n        Returns:\n            :obj:`List[int]`: list of `input IDs <../glossary.html#input-ids>`__ with the appropriate special tokens.\n        \"\"\"\n        if token_ids_1 is None:\n            return [self.cls_token_id] + token_ids_0 + [self.sep_token_id]\n        cls = [self.cls_token_id]\n        sep = [self.sep_token_id]\n        return cls + token_ids_0 + sep + sep + token_ids_1 + sep\n\n    def get_special_tokens_mask(\n        self, token_ids_0: List[int], token_ids_1: Optional[List[int]] = None, already_has_special_tokens: bool = False\n    ) -> List[int]:\n        \"\"\"\n        Retrieves sequence ids from a token list that has no special tokens added. This method is called when adding\n        special tokens using the tokenizer ``prepare_for_model`` or ``encode_plus`` methods.\n\n        Args:\n            token_ids_0 (:obj:`List[int]`):\n                List of ids.\n            token_ids_1 (:obj:`List[int]`, `optional`, defaults to :obj:`None`):\n                Optional second list of IDs for sequence pairs.\n            already_has_special_tokens (:obj:`bool`, `optional`, defaults to :obj:`False`):\n                Set to True if the token list is already formatted with special tokens for the model\n\n        Returns:\n            :obj:`List[int]`: A list of integers in the range [0, 1]: 1 for a special token, 0 for a sequence token.\n        \"\"\"\n        if already_has_special_tokens:\n            if token_ids_1 is not None:\n                raise ValueError(\n                    \"You should not supply a second sequence if the provided sequence of \"\n                    \"ids is already formatted with special tokens for the model.\"\n                )\n            return list(map(lambda x: 1 if x in [self.sep_token_id, self.cls_token_id] else 0, token_ids_0))\n\n        if token_ids_1 is None:\n            return [1] + ([0] * len(token_ids_0)) + [1]\n        return [1] + ([0] * len(token_ids_0)) + [1, 1] + ([0] * len(token_ids_1)) + [1]\n\n    def create_token_type_ids_from_sequences(\n        self, token_ids_0: List[int], token_ids_1: Optional[List[int]] = None\n    ) -> List[int]:\n        \"\"\"\n        Creates a mask from the two sequences passed to be used in a sequence-pair classification task.\n        RoBERTa does not make use of token type ids, therefore a list of zeros is returned.\n\n        Args:\n            token_ids_0 (:obj:`List[int]`):\n                List of ids.\n            token_ids_1 (:obj:`List[int]`, `optional`, defaults to :obj:`None`):\n                Optional second list of IDs for sequence pairs.\n\n        Returns:\n            :obj:`List[int]`: List of zeros.\n\n        \"\"\"\n        sep = [self.sep_token_id]\n        cls = [self.cls_token_id]\n\n        if token_ids_1 is None:\n            return len(cls + token_ids_0 + sep) * [0]\n        return len(cls + token_ids_0 + sep + sep + token_ids_1 + sep) * [0]\n\n    def prepare_for_tokenization(self, text, add_special_tokens=False, **kwargs):\n        if \"add_prefix_space\" in kwargs:\n            add_prefix_space = kwargs[\"add_prefix_space\"]\n        else:\n            add_prefix_space = add_special_tokens\n        if add_prefix_space and not text[0].isspace():\n            text = \" \" + text\n        return text\n\n\nclass RobertaTokenizerFast(GPT2TokenizerFast):\n    \"\"\"\n    Constructs a \"Fast\" RoBERTa BPE tokenizer (backed by HuggingFace's `tokenizers` library).\n\n    Peculiarities:\n\n    - Byte-level Byte-Pair-Encoding\n    - Requires a space to start the input string => the encoding methods should be called with the\n      ``add_prefix_space`` flag set to ``True``.\n      Otherwise, this tokenizer ``encode`` and ``decode`` method will not conserve\n      the absence of a space at the beginning of a string:\n\n    ::\n\n        tokenizer.decode(tokenizer.encode(\"Hello\")) = \" Hello\"\n\n    This tokenizer inherits from :class:`~transformers1.PreTrainedTokenizerFast` which contains most of the methods. Users\n    should refer to the superclass for more information regarding methods.\n\n    Args:\n        vocab_file (:obj:`str`):\n            Path to the vocabulary file.\n        merges_file (:obj:`str`):\n            Path to the merges file.\n        errors (:obj:`str`, `optional`, defaults to \"replace\"):\n            Paradigm to follow when decoding bytes to UTF-8. See `bytes.decode\n            <https://docs.python.org/3/library/stdtypes.html#bytes.decode>`__ for more information.\n        unk_token (:obj:`string`, `optional`, defaults to `<|endoftext|>`):\n            The unknown token. A token that is not in the vocabulary cannot be converted to an ID and is set to be this\n            token instead.\n        bos_token (:obj:`string`, `optional`, defaults to `<|endoftext|>`):\n            The beginning of sequence token.\n        eos_token (:obj:`string`, `optional`, defaults to `<|endoftext|>`):\n            The end of sequence token.\n        add_prefix_space (:obj:`bool`, `optional`, defaults to `False`):\n            Whether to add a leading space to the first word.\n            This allows to treat the leading word just as any other word.\n            (GPT2 tokenizer detect beginning of words by the preceeding space)\n        trim_offsets (:obj:`bool`, `optional`, defaults to `True`):\n            Whether the post processing step should trim offsets to avoid including whitespaces.\n    \"\"\"\n\n    vocab_files_names = VOCAB_FILES_NAMES\n    pretrained_vocab_files_map = PRETRAINED_VOCAB_FILES_MAP\n    max_model_input_sizes = PRETRAINED_POSITIONAL_EMBEDDINGS_SIZES\n    model_input_names = [\"attention_mask\"]\n\n    def __init__(\n        self,\n        vocab_file,\n        merges_file,\n        errors=\"replace\",\n        bos_token=\"<s>\",\n        eos_token=\"</s>\",\n        sep_token=\"</s>\",\n        cls_token=\"<s>\",\n        unk_token=\"<unk>\",\n        pad_token=\"<pad>\",\n        mask_token=\"<mask>\",\n        add_prefix_space=True,\n        trim_offsets=True,\n        **kwargs\n    ):\n        kwargs.setdefault(\"pad_token\", pad_token)\n        kwargs.setdefault(\"sep_token\", sep_token)\n        kwargs.setdefault(\"cls_token\", cls_token)\n        kwargs.setdefault(\"mask_token\", mask_token)\n\n        super().__init__(\n            vocab_file=vocab_file,\n            merges_file=merges_file,\n            unk_token=unk_token,\n            bos_token=bos_token,\n            eos_token=eos_token,\n            add_prefix_space=add_prefix_space,\n            trim_offsets=trim_offsets,\n            **kwargs,\n        )\n\n        self.backend_tokenizer._tokenizer.post_processor = RobertaProcessing(\n            sep=(sep_token, self.sep_token_id),\n            cls=(cls_token, self.cls_token_id),\n            add_prefix_space=add_prefix_space,\n            trim_offsets=trim_offsets,\n        )\n\n        self.backend_tokenizer.add_special_tokens([kwargs[\"mask_token\"]])\n\n    @PreTrainedTokenizer.mask_token.setter\n    def mask_token(self, value):\n        if not isinstance(value, AddedToken):\n            value = AddedToken(value, lstrip=True)\n\n        self._mask_token = str(value)\n        self._maybe_update_backend([value])\n\n    def build_inputs_with_special_tokens(self, token_ids_0, token_ids_1=None):\n        output = [self.bos_token_id] + token_ids_0 + [self.eos_token_id]\n        if token_ids_1 is None:\n            return output\n\n        return output + [self.eos_token_id] + token_ids_1 + [self.eos_token_id]\n\n    def create_token_type_ids_from_sequences(\n        self, token_ids_0: List[int], token_ids_1: Optional[List[int]] = None\n    ) -> List[int]:\n        \"\"\"\n        Creates a mask from the two sequences passed to be used in a sequence-pair classification task.\n        RoBERTa does not make use of token type ids, therefore a list of zeros is returned.\n\n        Args:\n            token_ids_0 (:obj:`List[int]`):\n                List of ids.\n            token_ids_1 (:obj:`List[int]`, `optional`, defaults to :obj:`None`):\n                Optional second list of IDs for sequence pairs.\n\n        Returns:\n            :obj:`List[int]`: List of zeros.\n\n        \"\"\"\n        sep = [self.sep_token_id]\n        cls = [self.cls_token_id]\n\n        if token_ids_1 is None:\n            return len(cls + token_ids_0 + sep) * [0]\n        return len(cls + token_ids_0 + sep + sep + token_ids_1 + sep) * [0]\n"
  },
  {
    "path": "code/nezha-base-count5/pretrain/transformers1/tokenization_t5.py",
    "content": "# coding=utf-8\n# Copyright 2018 T5 Authors and HuggingFace Inc. team.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n\"\"\" Tokenization class for model T5.\"\"\"\n\n\nimport logging\nimport os\nimport re\nfrom shutil import copyfile\n\nfrom .tokenization_utils import PreTrainedTokenizer\n\n\nlogger = logging.getLogger(__name__)\n\nSPIECE_UNDERLINE = \"▁\"\n\n####################################################\n# Mapping from the keyword arguments names of Tokenizer `__init__`\n# to file names for serializing Tokenizer instances\n####################################################\nVOCAB_FILES_NAMES = {\"vocab_file\": \"spiece.model\"}\n\n####################################################\n# Mapping from the keyword arguments names of Tokenizer `__init__`\n# to pretrained vocabulary URL for all the model shortcut names.\n####################################################\nPRETRAINED_VOCAB_FILES_MAP = {\n    \"vocab_file\": {\n        \"t5-small\": \"https://s3.amazonaws.com/models.huggingface.co/bert/t5-spiece.model\",\n        \"t5-base\": \"https://s3.amazonaws.com/models.huggingface.co/bert/t5-spiece.model\",\n        \"t5-large\": \"https://s3.amazonaws.com/models.huggingface.co/bert/t5-spiece.model\",\n        \"t5-3b\": \"https://s3.amazonaws.com/models.huggingface.co/bert/t5-spiece.model\",\n        \"t5-11b\": \"https://s3.amazonaws.com/models.huggingface.co/bert/t5-spiece.model\",\n    }\n}\n\n####################################################\n# Mapping from model shortcut names to max length of inputs\n####################################################\nPRETRAINED_POSITIONAL_EMBEDDINGS_SIZES = {\n    \"t5-small\": 512,\n    \"t5-base\": 512,\n    \"t5-large\": 512,\n    \"t5-3b\": 512,\n    \"t5-11b\": 512,\n}\n\n\nclass T5Tokenizer(PreTrainedTokenizer):\n    \"\"\"\n        Constructs an XLNet tokenizer. Based on `SentencePiece <https://github.com/google/sentencepiece>`__ .\n\n        This tokenizer inherits from :class:`~transformers1.PreTrainedTokenizer` which contains most of the methods. Users\n        should refer to the superclass for more information regarding methods.\n\n        Args:\n            vocab_file (:obj:`string`):\n                `SentencePiece <https://github.com/google/sentencepiece>`__ file (generally has a `.spm` extension) that\n                contains the vocabulary necessary to instantiate a tokenizer.\n            eos_token (:obj:`string`, `optional`, defaults to \"</s>\"):\n                The end of sequence token.\n\n                .. note::\n\n                    When building a sequence using special tokens, this is not the token that is used for the end\n                    of sequence. The token used is the :obj:`sep_token`.\n            unk_token (:obj:`string`, `optional`, defaults to \"<unk>\"):\n                The unknown token. A token that is not in the vocabulary cannot be converted to an ID and is set to be this\n                token instead.\n            pad_token (:obj:`string`, `optional`, defaults to \"<pad>\"):\n                The token used for padding, for example when batching sequences of different lengths.\n            extra_ids (:obj:`List[str]`, `optional`, defaults to :obj:`100`):\n                Add a number of extra ids added to the end of the vocabulary for use as sentinels.\n                These tokens are accessible as \"<extra_id_{%d}>\" where \"{%d}\" is a number between 0 and extra_ids-1.\n                Extra tokens are indexed from the end of the vocabulary up to beginnning (\"<extra_id_0>\" is the last token in the vocabulary like in T5 preprocessing\n                see: https://github.com/google-research/text-to-text-transfer-transformer/blob/9fd7b14a769417be33bc6c850f9598764913c833/t5/data/preprocessors.py#L2117)\n            additional_special_tokens (:obj:`List[str]`, `optional`, defaults to :obj:`None`):\n                Additional special tokens used by the tokenizer.\n    \"\"\"\n\n    vocab_files_names = VOCAB_FILES_NAMES\n    pretrained_vocab_files_map = PRETRAINED_VOCAB_FILES_MAP\n    max_model_input_sizes = PRETRAINED_POSITIONAL_EMBEDDINGS_SIZES\n\n    def __init__(\n        self,\n        vocab_file,\n        eos_token=\"</s>\",\n        unk_token=\"<unk>\",\n        pad_token=\"<pad>\",\n        extra_ids=100,\n        additional_special_tokens=None,\n        **kwargs\n    ):\n        # Add extra_ids to the special token list\n        if extra_ids > 0:\n            if additional_special_tokens is None:\n                additional_special_tokens = []\n            additional_special_tokens.extend([\"<extra_id_{}>\".format(i) for i in range(extra_ids)])\n\n        super().__init__(\n            eos_token=eos_token,\n            unk_token=unk_token,\n            pad_token=pad_token,\n            additional_special_tokens=additional_special_tokens,\n            **kwargs,\n        )\n\n        try:\n            import sentencepiece as spm\n        except ImportError:\n            logger.warning(\n                \"You need to install SentencePiece to use T5Tokenizer:\"\n                \"https://github.com/google/sentencepiece\"\n                \"pip install sentencepiece\"\n            )\n            raise\n\n        self.vocab_file = vocab_file\n        self._extra_ids = extra_ids\n\n        self.sp_model = spm.SentencePieceProcessor()\n        self.sp_model.Load(vocab_file)\n\n    @property\n    def vocab_size(self):\n        return self.sp_model.get_piece_size() + self._extra_ids\n\n    def get_vocab(self):\n        vocab = {self.convert_ids_to_tokens(i): i for i in range(self.vocab_size)}\n        vocab.update(self.added_tokens_encoder)\n        return vocab\n\n    def __getstate__(self):\n        state = self.__dict__.copy()\n        state[\"sp_model\"] = None\n        return state\n\n    def __setstate__(self, d):\n        self.__dict__ = d\n        try:\n            import sentencepiece as spm\n        except ImportError:\n            logger.warning(\n                \"You need to install SentencePiece to use T5Tokenizer: https://github.com/google/sentencepiece\"\n                \"pip install sentencepiece\"\n            )\n            raise\n        self.sp_model = spm.SentencePieceProcessor()\n        self.sp_model.Load(self.vocab_file)\n\n    def _tokenize(self, text, sample=False):\n        \"\"\" Take as input a string and return a list of strings (tokens) for words/sub-words\n        \"\"\"\n        if not sample:\n            pieces = self.sp_model.EncodeAsPieces(text)\n        else:\n            pieces = self.sp_model.SampleEncodeAsPieces(text, 64, 0.1)\n        return pieces\n\n    def _convert_token_to_id(self, token):\n        \"\"\" Converts a token (str) in an id using the vocab. \"\"\"\n        if token.startswith(\"<extra_id_\"):\n            match = re.match(r\"<extra_id_(\\d+)>\", token)\n            num = int(match.group(1))\n            return self.vocab_size - num - 1\n        return self.sp_model.piece_to_id(token)\n\n    def _convert_id_to_token(self, index):\n        \"\"\"Converts an index (integer) in a token (str) using the vocab.\"\"\"\n        if index < self.sp_model.get_piece_size():\n            token = self.sp_model.IdToPiece(index)\n        else:\n            token = \"<extra_id_{}>\".format(self.vocab_size - 1 - index)\n        return token\n\n    def convert_tokens_to_string(self, tokens):\n        \"\"\" Converts a sequence of tokens (string) in a single string. \"\"\"\n        out_string = self.sp_model.decode_pieces(tokens)\n        return out_string\n\n    def save_vocabulary(self, save_directory):\n        \"\"\" Save the sentencepiece vocabulary (copy original file) and special tokens file\n            to a directory.\n        \"\"\"\n        if not os.path.isdir(save_directory):\n            logger.error(\"Vocabulary path ({}) should be a directory\".format(save_directory))\n            return\n        out_vocab_file = os.path.join(save_directory, VOCAB_FILES_NAMES[\"vocab_file\"])\n\n        if os.path.abspath(self.vocab_file) != os.path.abspath(out_vocab_file):\n            copyfile(self.vocab_file, out_vocab_file)\n\n        return (out_vocab_file,)\n"
  },
  {
    "path": "code/nezha-base-count5/pretrain/transformers1/tokenization_transfo_xl.py",
    "content": "# coding=utf-8\n# Copyright 2018 Google AI, Google Brain and Carnegie Mellon University Authors and the HuggingFace Inc. team.\n# Copyright (c) 2018, NVIDIA CORPORATION.  All rights reserved.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n\"\"\" Tokenization classes for Transformer XL model.\n    Adapted from https://github.com/kimiyoung/transformer-xl.\n\"\"\"\n\n\nimport glob\nimport logging\nimport os\nimport pickle\nimport re\nfrom collections import Counter, OrderedDict\nfrom typing import Optional\n\nimport numpy as np\nfrom tokenizers import Tokenizer\nfrom tokenizers.implementations import BaseTokenizer\nfrom tokenizers.models import WordLevel\nfrom tokenizers.normalizers import Lowercase, Sequence, Strip, unicode_normalizer_from_str\nfrom tokenizers.pre_tokenizers import CharDelimiterSplit, WhitespaceSplit\nfrom tokenizers.processors import BertProcessing\n\nfrom .file_utils import cached_path, is_torch_available\nfrom .tokenization_utils import PreTrainedTokenizer, PreTrainedTokenizerFast\n\n\nif is_torch_available():\n    import torch\n\n\nlogger = logging.getLogger(__name__)\n\nVOCAB_FILES_NAMES = {\"pretrained_vocab_file\": \"vocab.bin\", \"vocab_file\": \"vocab.txt\"}\nVOCAB_FILES_NAMES_FAST = {\"pretrained_vocab_file\": \"vocab.json\", \"vocab_file\": \"vocab.json\"}\n\nPRETRAINED_VOCAB_FILES_MAP = {\n    \"pretrained_vocab_file\": {\n        \"transfo-xl-wt103\": \"https://s3.amazonaws.com/models.huggingface.co/bert/transfo-xl-wt103-vocab.bin\",\n    }\n}\n\nPRETRAINED_VOCAB_FILES_MAP_FAST = {\n    \"pretrained_vocab_file\": {\n        \"transfo-xl-wt103\": \"https://s3.amazonaws.com/models.huggingface.co/bert/transfo-xl-wt103-vocab.json\",\n    }\n}\n\nPRETRAINED_POSITIONAL_EMBEDDINGS_SIZES = {\n    \"transfo-xl-wt103\": None,\n}\n\nPRETRAINED_CORPUS_ARCHIVE_MAP = {\n    \"transfo-xl-wt103\": \"https://s3.amazonaws.com/models.huggingface.co/bert/transfo-xl-wt103-corpus.bin\",\n}\nCORPUS_NAME = \"corpus.bin\"\n\n\nclass TransfoXLTokenizer(PreTrainedTokenizer):\n    \"\"\"\n    Transformer-XL tokenizer adapted from Vocab class in https://github.com/kimiyoung/transformer-xl\n\n    This tokenizer inherits from :class:`~transformers1.PreTrainedTokenizer` which contains most of the methods. Users\n    should refer to the superclass for more information regarding methods.\n    \"\"\"\n\n    vocab_files_names = VOCAB_FILES_NAMES\n    pretrained_vocab_files_map = PRETRAINED_VOCAB_FILES_MAP\n    max_model_input_sizes = PRETRAINED_POSITIONAL_EMBEDDINGS_SIZES\n    model_input_names = []\n\n    def __init__(\n        self,\n        special=None,\n        min_freq=0,\n        max_size=None,\n        lower_case=False,\n        delimiter=None,\n        vocab_file=None,\n        pretrained_vocab_file=None,\n        never_split=None,\n        unk_token=\"<unk>\",\n        eos_token=\"<eos>\",\n        additional_special_tokens=[\"<formula>\"],\n        **kwargs\n    ):\n        super().__init__(\n            unk_token=unk_token, eos_token=eos_token, additional_special_tokens=additional_special_tokens, **kwargs\n        )\n\n        if never_split is None:\n            never_split = self.all_special_tokens\n        if special is None:\n            special = []\n        self.counter = Counter()\n        self.special = special\n        self.min_freq = min_freq\n        self.max_size = max_size\n        self.lower_case = lower_case\n        self.delimiter = delimiter\n        self.vocab_file = vocab_file\n        self.never_split = never_split\n        self.punctuation_symbols = '!\"#$%&()*+,-./\\:;<=>?@[\\\\]^_`{|}~'  # noqa: W605\n        self.punction_without_space_before_pattern = re.compile(r\"[^\\s][{}]\".format(self.punctuation_symbols))\n        self.punctuation_with_space_around_pattern = self._compile_space_around_punctuation_pattern()\n\n        try:\n            if pretrained_vocab_file is not None:\n                # Hack because, honestly this tokenizer was not made to be used\n                # in a library like ours, at all.\n                vocab_dict = torch.load(pretrained_vocab_file)\n                for key, value in vocab_dict.items():\n                    if key not in self.__dict__:\n                        self.__dict__[key] = value\n\n            if vocab_file is not None:\n                self.build_vocab()\n        except Exception:\n            raise ValueError(\n                \"Unable to parse file {}. Unknown format. \"\n                \"If you tried to load a model saved through TransfoXLTokenizerFast,\"\n                \"please note they are not compatible.\".format(pretrained_vocab_file)\n            )\n\n        if vocab_file is not None:\n            self.build_vocab()\n\n    def _compile_space_around_punctuation_pattern(self):\n        look_ahead_for_special_token = \"(?=[{}])\".format(self.punctuation_symbols)\n        look_ahead_to_match_all_except_space = \"(?=[^\\s])\"  # noqa: W605\n        return re.compile(r\"\" + look_ahead_for_special_token + look_ahead_to_match_all_except_space)\n\n    def count_file(self, path, verbose=False, add_eos=False):\n        if verbose:\n            logger.info(\"counting file {} ...\".format(path))\n        assert os.path.exists(path)\n\n        sents = []\n        with open(path, \"r\", encoding=\"utf-8\") as f:\n            for idx, line in enumerate(f):\n                if verbose and idx > 0 and idx % 500000 == 0:\n                    logger.info(\"    line {}\".format(idx))\n                symbols = self.tokenize(line, add_eos=add_eos)\n                self.counter.update(symbols)\n                sents.append(symbols)\n\n        return sents\n\n    def count_sents(self, sents, verbose=False):\n        \"\"\"\n            sents : a list of sentences, each a list of tokenized symbols\n        \"\"\"\n        if verbose:\n            logger.info(\"counting {} sents ...\".format(len(sents)))\n        for idx, symbols in enumerate(sents):\n            if verbose and idx > 0 and idx % 500000 == 0:\n                logger.info(\"    line {}\".format(idx))\n            self.counter.update(symbols)\n\n    def _build_from_file(self, vocab_file):\n        self.idx2sym = []\n        self.sym2idx = OrderedDict()\n\n        with open(vocab_file, \"r\", encoding=\"utf-8\") as f:\n            for line in f:\n                symb = line.strip().split()[0]\n                self.add_symbol(symb)\n        if \"<UNK>\" in self.sym2idx:\n            self.unk_idx = self.sym2idx[\"<UNK>\"]\n        elif \"<unk>\" in self.sym2idx:\n            self.unk_idx = self.sym2idx[\"<unk>\"]\n        else:\n            raise ValueError(\"No <unkown> token in vocabulary\")\n\n    def save_vocabulary(self, vocab_path):\n        \"\"\"\n        Save the vocabulary and special tokens file to a directory.\n\n        Args:\n            vocab_path (:obj:`str`):\n                The directory in which to save the vocabulary.\n\n        Returns:\n            :obj:`Tuple(str)`: Paths to the files saved.\n        \"\"\"\n\n        logger.warning(\n            \"Please note you will not be able to load the save vocabulary in\"\n            \" Rust-based TransfoXLTokenizerFast as they don't share the same structure.\"\n        )\n\n        if os.path.isdir(vocab_path):\n            vocab_file = os.path.join(vocab_path, VOCAB_FILES_NAMES[\"pretrained_vocab_file\"])\n        else:\n            vocab_file = vocab_path\n        torch.save(self.__dict__, vocab_file)\n        return (vocab_file,)\n\n    def build_vocab(self):\n        if self.vocab_file:\n            logger.info(\"building vocab from {}\".format(self.vocab_file))\n            self._build_from_file(self.vocab_file)\n            logger.info(\"final vocab size {}\".format(len(self)))\n        else:\n            logger.info(\"building vocab with min_freq={}, max_size={}\".format(self.min_freq, self.max_size))\n            self.idx2sym = []\n            self.sym2idx = OrderedDict()\n\n            for sym in self.special:\n                self.add_special(sym)\n\n            for sym, cnt in self.counter.most_common(self.max_size):\n                if cnt < self.min_freq:\n                    break\n                self.add_symbol(sym)\n\n            logger.info(\"final vocab size {} from {} unique tokens\".format(len(self), len(self.counter)))\n\n    def encode_file(self, path, ordered=False, verbose=False, add_eos=True, add_double_eos=False):\n        if verbose:\n            logger.info(\"encoding file {} ...\".format(path))\n        assert os.path.exists(path)\n        encoded = []\n        with open(path, \"r\", encoding=\"utf-8\") as f:\n            for idx, line in enumerate(f):\n                if verbose and idx > 0 and idx % 500000 == 0:\n                    logger.info(\"    line {}\".format(idx))\n                symbols = self.tokenize(line, add_eos=add_eos, add_double_eos=add_double_eos)\n                encoded.append(self.convert_to_tensor(symbols))\n\n        if ordered:\n            encoded = torch.cat(encoded)\n\n        return encoded\n\n    def encode_sents(self, sents, ordered=False, verbose=False):\n        if verbose:\n            logger.info(\"encoding {} sents ...\".format(len(sents)))\n        encoded = []\n        for idx, symbols in enumerate(sents):\n            if verbose and idx > 0 and idx % 500000 == 0:\n                logger.info(\"    line {}\".format(idx))\n            encoded.append(self.convert_to_tensor(symbols))\n\n        if ordered:\n            encoded = torch.cat(encoded)\n\n        return encoded\n\n    def add_special(self, sym):\n        if sym not in self.sym2idx:\n            self.idx2sym.append(sym)\n            self.sym2idx[sym] = len(self.idx2sym) - 1\n            setattr(self, \"{}_idx\".format(sym.strip(\"<>\")), self.sym2idx[sym])\n\n    def add_symbol(self, sym):\n        if sym not in self.sym2idx:\n            self.idx2sym.append(sym)\n            self.sym2idx[sym] = len(self.idx2sym) - 1\n\n    def _convert_id_to_token(self, idx):\n        \"\"\"Converts an id in a token (BPE) using the vocab.\"\"\"\n        assert 0 <= idx < len(self), \"Index {} out of vocabulary range\".format(idx)\n        return self.idx2sym[idx]\n\n    def _convert_token_to_id(self, sym):\n        \"\"\" Converts a token (str) in an id using the vocab. \"\"\"\n        if sym in self.sym2idx:\n            return self.sym2idx[sym]\n        else:\n            # logger.info('encounter unk {}'.format(sym))\n            # assert '<eos>' not in sym\n            if hasattr(self, \"unk_idx\"):\n                return self.sym2idx.get(sym, self.unk_idx)\n            # Backward compatibility with pre-trained models\n            elif \"<unk>\" in self.sym2idx:\n                return self.sym2idx[\"<unk>\"]\n            elif \"<UNK>\" in self.sym2idx:\n                return self.sym2idx[\"<UNK>\"]\n            else:\n                raise ValueError(\"Token not in vocabulary and no <unk> token in vocabulary for replacement\")\n\n    def convert_tokens_to_string(self, tokens):\n        \"\"\" Converts a sequence of tokens (string) in a single string. \"\"\"\n        out_string = \" \".join(tokens).strip()\n        return out_string\n\n    def convert_to_tensor(self, symbols):\n        return torch.LongTensor(self.convert_tokens_to_ids(symbols))\n\n    @property\n    def vocab_size(self):\n        return len(self.idx2sym)\n\n    def get_vocab(self):\n        return dict(self.sym2idx, **self.added_tokens_encoder)\n\n    def _tokenize(self, line, add_eos=False, add_double_eos=False):\n        line = line.strip()\n        # convert to lower case\n        if self.lower_case:\n            line = line.lower()\n\n        # empty delimiter '' will evaluate False\n        if self.delimiter == \"\":\n            symbols = line\n        else:\n            symbols = line.split(self.delimiter)\n\n        if add_double_eos:  # lm1b\n            return [\"<S>\"] + symbols + [\"<S>\"]\n        elif add_eos:\n            return symbols + [\"<eos>\"]\n        else:\n            return symbols\n\n    def prepare_for_tokenization(self, text, **kwargs):\n        # add spaces before punctuation symbols as should be done in transfo-xl\n\n        if \"add_space_before_punct_symbol\" in kwargs and kwargs[\"add_space_before_punct_symbol\"]:\n            text = self.punctuation_with_space_around_pattern.sub(r\" \", text)\n        elif self.punction_without_space_before_pattern.search(text):\n            # searches until the first occurence of a punctuation symbol without surrounding spaces\n            logger.warning(\n                \"You might want to consider setting `add_space_before_punct_symbol=True` as an argument to the `tokenizer.encode()` to avoid tokenizing words with punctuation symbols to the `<unk>` token\"\n            )\n\n        return text\n\n\nclass _TransfoXLDelimiterLookupTokenizer(BaseTokenizer):\n    def __init__(\n        self,\n        vocab_file,\n        delimiter,\n        lowercase,\n        unk_token,\n        eos_token,\n        add_eos=False,\n        add_double_eos=False,\n        normalization: Optional[str] = None,\n    ):\n\n        try:\n            tokenizer = WordLevel(vocab_file, unk_token=unk_token)\n            tokenizer = Tokenizer(tokenizer)\n        except Exception:\n            raise ValueError(\n                \"Unable to parse file {}. Unknown format. \"\n                \"If you tried to load a model saved through TransfoXLTokenizer,\"\n                \"please note they are not compatible.\".format(vocab_file)\n            )\n\n        # Create the correct normalization path\n        normalizer = []\n\n        # Include unicode normalization\n        if normalization:\n            normalizer += [unicode_normalizer_from_str(normalization)]\n\n        # Include case normalization\n        if lowercase:\n            normalizer += [Lowercase()]\n\n        # Strip normalizer at the end\n        normalizer += [Strip(left=True, right=True)]\n\n        if len(normalizer) > 0:\n            tokenizer.normalizer = Sequence(normalizer) if len(normalizer) > 1 else normalizer[0]\n\n        # Setup the splitter\n        tokenizer.pre_tokenizer = CharDelimiterSplit(delimiter) if delimiter else WhitespaceSplit()\n\n        if add_double_eos:\n            tokenizer.post_processor = BertProcessing(\n                (eos_token, tokenizer.token_to_id(eos_token)), (eos_token, tokenizer.token_to_id(eos_token))\n            )\n\n        parameters = {\n            \"model\": \"TransfoXLModel\",\n            \"add_eos\": add_eos,\n            \"add_double_eos\": add_double_eos,\n            \"unk_token\": unk_token,\n            \"eos_token\": eos_token,\n            \"delimiter\": delimiter,\n            \"lowercase\": lowercase,\n        }\n\n        super().__init__(tokenizer, parameters)\n\n\nclass TransfoXLTokenizerFast(PreTrainedTokenizerFast):\n    \"\"\"\n    Construct a \"Fast\" Transformer-XL tokenizer (backed by HuggingFace's `tokenizers` library).\n\n    The Transformer-XL tokenizer is a word-level tokenizer (no sub-word tokenization).\n\n    Adapted from Vocab class in https://github.com/kimiyoung/transformer-xl\n\n    This tokenizer inherits from :class:`~transformers1.PreTrainedTokenizerFast` which contains most of the methods. Users\n    should refer to the superclass for more information regarding methods.\n    \"\"\"\n\n    vocab_files_names = VOCAB_FILES_NAMES_FAST\n    pretrained_vocab_files_map = PRETRAINED_VOCAB_FILES_MAP_FAST\n    max_model_input_sizes = PRETRAINED_POSITIONAL_EMBEDDINGS_SIZES\n    model_input_names = []\n\n    def __init__(\n        self,\n        special=None,\n        min_freq=0,\n        max_size=None,\n        lower_case=False,\n        delimiter=None,\n        vocab_file=None,\n        pretrained_vocab_file=None,\n        never_split=None,\n        unk_token=\"<unk>\",\n        eos_token=\"<eos>\",\n        additional_special_tokens=[\"<formula>\"],\n        add_eos=False,\n        add_double_eos=False,\n        normalization=None,\n        **kwargs\n    ):\n\n        super().__init__(\n            _TransfoXLDelimiterLookupTokenizer(\n                vocab_file=vocab_file or pretrained_vocab_file,\n                delimiter=delimiter,\n                lowercase=lower_case,\n                unk_token=unk_token,\n                eos_token=eos_token,\n                add_eos=add_eos,\n                add_double_eos=add_double_eos,\n                normalization=normalization,\n            ),\n            unk_token=unk_token,\n            eos_token=eos_token,\n            additional_special_tokens=additional_special_tokens,\n            **kwargs,\n        )\n\n    def save_pretrained(self, save_directory):\n        logger.warning(\n            \"Please note you will not be able to load the vocabulary in\"\n            \" Python-based TransfoXLTokenizer as they don't share the same structure.\"\n        )\n\n        return super().save_pretrained(save_directory)\n\n\nclass LMOrderedIterator(object):\n    def __init__(self, data, bsz, bptt, device=\"cpu\", ext_len=None):\n        \"\"\"\n            data -- LongTensor -- the LongTensor is strictly ordered\n        \"\"\"\n        self.bsz = bsz\n        self.bptt = bptt\n        self.ext_len = ext_len if ext_len is not None else 0\n\n        self.device = device\n\n        # Work out how cleanly we can divide the dataset into bsz parts.\n        self.n_step = data.size(0) // bsz\n\n        # Trim off any extra elements that wouldn't cleanly fit (remainders).\n        data = data.narrow(0, 0, self.n_step * bsz)\n\n        # Evenly divide the data across the bsz batches.\n        self.data = data.view(bsz, -1).t().contiguous().to(device)\n\n        # Number of mini-batches\n        self.n_batch = (self.n_step + self.bptt - 1) // self.bptt\n\n    def get_batch(self, i, bptt=None):\n        if bptt is None:\n            bptt = self.bptt\n        seq_len = min(bptt, self.data.size(0) - 1 - i)\n\n        end_idx = i + seq_len\n        beg_idx = max(0, i - self.ext_len)\n\n        data = self.data[beg_idx:end_idx]\n        target = self.data[i + 1 : i + 1 + seq_len]\n\n        data_out = data.transpose(0, 1).contiguous().to(self.device)\n        target_out = target.transpose(0, 1).contiguous().to(self.device)\n\n        return data_out, target_out, seq_len\n\n    def get_fixlen_iter(self, start=0):\n        for i in range(start, self.data.size(0) - 1, self.bptt):\n            yield self.get_batch(i)\n\n    def get_varlen_iter(self, start=0, std=5, min_len=5, max_deviation=3):\n        max_len = self.bptt + max_deviation * std\n        i = start\n        while True:\n            bptt = self.bptt if np.random.random() < 0.95 else self.bptt / 2.0\n            bptt = min(max_len, max(min_len, int(np.random.normal(bptt, std))))\n            data, target, seq_len = self.get_batch(i, bptt)\n            i += seq_len\n            yield data, target, seq_len\n            if i >= self.data.size(0) - 2:\n                break\n\n    def __iter__(self):\n        return self.get_fixlen_iter()\n\n\nclass LMShuffledIterator(object):\n    def __init__(self, data, bsz, bptt, device=\"cpu\", ext_len=None, shuffle=False):\n        \"\"\"\n            data -- list[LongTensor] -- there is no order among the LongTensors\n        \"\"\"\n        self.data = data\n\n        self.bsz = bsz\n        self.bptt = bptt\n        self.ext_len = ext_len if ext_len is not None else 0\n\n        self.device = device\n        self.shuffle = shuffle\n\n    def get_sent_stream(self):\n        # index iterator\n        epoch_indices = np.random.permutation(len(self.data)) if self.shuffle else np.array(range(len(self.data)))\n\n        # sentence iterator\n        for idx in epoch_indices:\n            yield self.data[idx]\n\n    def stream_iterator(self, sent_stream):\n        # streams for each data in the batch\n        streams = [None] * self.bsz\n\n        data = torch.LongTensor(self.bptt, self.bsz)\n        target = torch.LongTensor(self.bptt, self.bsz)\n\n        n_retain = 0\n\n        while True:\n            # data   : [n_retain+bptt x bsz]\n            # target : [bptt x bsz]\n            data[n_retain:].fill_(-1)\n            target.fill_(-1)\n\n            valid_batch = True\n\n            for i in range(self.bsz):\n                n_filled = 0\n                try:\n                    while n_filled < self.bptt:\n                        if streams[i] is None or len(streams[i]) <= 1:\n                            streams[i] = next(sent_stream)\n                        # number of new tokens to fill in\n                        n_new = min(len(streams[i]) - 1, self.bptt - n_filled)\n                        # first n_retain tokens are retained from last batch\n                        data[n_retain + n_filled : n_retain + n_filled + n_new, i] = streams[i][:n_new]\n                        target[n_filled : n_filled + n_new, i] = streams[i][1 : n_new + 1]\n                        streams[i] = streams[i][n_new:]\n                        n_filled += n_new\n                except StopIteration:\n                    valid_batch = False\n                    break\n\n            if not valid_batch:\n                return\n\n            data_out = data.transpose(0, 1).contiguous().to(self.device)\n            target_out = target.transpose(0, 1).contiguous().to(self.device)\n\n            yield data_out, target_out, self.bptt\n\n            n_retain = min(data.size(0), self.ext_len)\n            if n_retain > 0:\n                data[:n_retain] = data[-n_retain:]\n            data.resize_(n_retain + self.bptt, data.size(1))\n\n    def __iter__(self):\n        # sent_stream is an iterator\n        sent_stream = self.get_sent_stream()\n\n        for batch in self.stream_iterator(sent_stream):\n            yield batch\n\n\nclass LMMultiFileIterator(LMShuffledIterator):\n    def __init__(self, paths, vocab, bsz, bptt, device=\"cpu\", ext_len=None, shuffle=False):\n\n        self.paths = paths\n        self.vocab = vocab\n\n        self.bsz = bsz\n        self.bptt = bptt\n        self.ext_len = ext_len if ext_len is not None else 0\n\n        self.device = device\n        self.shuffle = shuffle\n\n    def get_sent_stream(self, path):\n        sents = self.vocab.encode_file(path, add_double_eos=True)\n        if self.shuffle:\n            np.random.shuffle(sents)\n        sent_stream = iter(sents)\n\n        return sent_stream\n\n    def __iter__(self):\n        if self.shuffle:\n            np.random.shuffle(self.paths)\n\n        for path in self.paths:\n            # sent_stream is an iterator\n            sent_stream = self.get_sent_stream(path)\n            for batch in self.stream_iterator(sent_stream):\n                yield batch\n\n\nclass TransfoXLCorpus(object):\n    @classmethod\n    def from_pretrained(cls, pretrained_model_name_or_path, cache_dir=None, *inputs, **kwargs):\n        \"\"\"\n        Instantiate a pre-processed corpus.\n        \"\"\"\n        vocab = TransfoXLTokenizer.from_pretrained(pretrained_model_name_or_path, *inputs, **kwargs)\n        if pretrained_model_name_or_path in PRETRAINED_CORPUS_ARCHIVE_MAP:\n            corpus_file = PRETRAINED_CORPUS_ARCHIVE_MAP[pretrained_model_name_or_path]\n        else:\n            corpus_file = os.path.join(pretrained_model_name_or_path, CORPUS_NAME)\n        # redirect to the cache, if necessary\n        try:\n            resolved_corpus_file = cached_path(corpus_file, cache_dir=cache_dir)\n        except EnvironmentError:\n            logger.error(\n                \"Corpus '{}' was not found in corpus list ({}). \"\n                \"We assumed '{}' was a path or url but couldn't find files {} \"\n                \"at this path or url.\".format(\n                    pretrained_model_name_or_path,\n                    \", \".join(PRETRAINED_CORPUS_ARCHIVE_MAP.keys()),\n                    pretrained_model_name_or_path,\n                    corpus_file,\n                )\n            )\n            return None\n        if resolved_corpus_file == corpus_file:\n            logger.info(\"loading corpus file {}\".format(corpus_file))\n        else:\n            logger.info(\"loading corpus file {} from cache at {}\".format(corpus_file, resolved_corpus_file))\n\n        # Instantiate tokenizer.\n        corpus = cls(*inputs, **kwargs)\n        corpus_dict = torch.load(resolved_corpus_file)\n        for key, value in corpus_dict.items():\n            corpus.__dict__[key] = value\n        corpus.vocab = vocab\n        if corpus.train is not None:\n            corpus.train = torch.tensor(corpus.train, dtype=torch.long)\n        if corpus.valid is not None:\n            corpus.valid = torch.tensor(corpus.valid, dtype=torch.long)\n        if corpus.test is not None:\n            corpus.test = torch.tensor(corpus.test, dtype=torch.long)\n        return corpus\n\n    def __init__(self, *args, **kwargs):\n        self.vocab = TransfoXLTokenizer(*args, **kwargs)\n        self.dataset = None\n        self.train = None\n        self.valid = None\n        self.test = None\n\n    def build_corpus(self, path, dataset):\n        self.dataset = dataset\n\n        if self.dataset in [\"ptb\", \"wt2\", \"enwik8\", \"text8\"]:\n            self.vocab.count_file(os.path.join(path, \"train.txt\"))\n            self.vocab.count_file(os.path.join(path, \"valid.txt\"))\n            self.vocab.count_file(os.path.join(path, \"test.txt\"))\n        elif self.dataset == \"wt103\":\n            self.vocab.count_file(os.path.join(path, \"train.txt\"))\n        elif self.dataset == \"lm1b\":\n            train_path_pattern = os.path.join(\n                path,\n                \"1-billion-word-language-modeling-benchmark-r13output\",\n                \"training-monolingual.tokenized.shuffled\",\n                \"news.en-*\",\n            )\n            train_paths = glob.glob(train_path_pattern)\n            # the vocab will load from file when build_vocab() is called\n\n        self.vocab.build_vocab()\n\n        if self.dataset in [\"ptb\", \"wt2\", \"wt103\"]:\n            self.train = self.vocab.encode_file(os.path.join(path, \"train.txt\"), ordered=True)\n            self.valid = self.vocab.encode_file(os.path.join(path, \"valid.txt\"), ordered=True)\n            self.test = self.vocab.encode_file(os.path.join(path, \"test.txt\"), ordered=True)\n        elif self.dataset in [\"enwik8\", \"text8\"]:\n            self.train = self.vocab.encode_file(os.path.join(path, \"train.txt\"), ordered=True, add_eos=False)\n            self.valid = self.vocab.encode_file(os.path.join(path, \"valid.txt\"), ordered=True, add_eos=False)\n            self.test = self.vocab.encode_file(os.path.join(path, \"test.txt\"), ordered=True, add_eos=False)\n        elif self.dataset == \"lm1b\":\n            self.train = train_paths\n            self.valid = self.vocab.encode_file(os.path.join(path, \"valid.txt\"), ordered=False, add_double_eos=True)\n            self.test = self.vocab.encode_file(os.path.join(path, \"test.txt\"), ordered=False, add_double_eos=True)\n\n    def get_iterator(self, split, *args, **kwargs):\n        if split == \"train\":\n            if self.dataset in [\"ptb\", \"wt2\", \"wt103\", \"enwik8\", \"text8\"]:\n                data_iter = LMOrderedIterator(self.train, *args, **kwargs)\n            elif self.dataset == \"lm1b\":\n                kwargs[\"shuffle\"] = True\n                data_iter = LMMultiFileIterator(self.train, self.vocab, *args, **kwargs)\n        elif split in [\"valid\", \"test\"]:\n            data = self.valid if split == \"valid\" else self.test\n            if self.dataset in [\"ptb\", \"wt2\", \"wt103\", \"enwik8\", \"text8\"]:\n                data_iter = LMOrderedIterator(data, *args, **kwargs)\n            elif self.dataset == \"lm1b\":\n                data_iter = LMShuffledIterator(data, *args, **kwargs)\n\n        return data_iter\n\n\ndef get_lm_corpus(datadir, dataset):\n    fn = os.path.join(datadir, \"cache.pt\")\n    fn_pickle = os.path.join(datadir, \"cache.pkl\")\n    if os.path.exists(fn):\n        logger.info(\"Loading cached dataset...\")\n        corpus = torch.load(fn_pickle)\n    elif os.path.exists(fn):\n        logger.info(\"Loading cached dataset from pickle...\")\n        with open(fn, \"rb\") as fp:\n            corpus = pickle.load(fp)\n    else:\n        logger.info(\"Producing dataset {}...\".format(dataset))\n        kwargs = {}\n        if dataset in [\"wt103\", \"wt2\"]:\n            kwargs[\"special\"] = [\"<eos>\"]\n            kwargs[\"lower_case\"] = False\n        elif dataset == \"ptb\":\n            kwargs[\"special\"] = [\"<eos>\"]\n            kwargs[\"lower_case\"] = True\n        elif dataset == \"lm1b\":\n            kwargs[\"special\"] = []\n            kwargs[\"lower_case\"] = False\n            kwargs[\"vocab_file\"] = os.path.join(datadir, \"1b_word_vocab.txt\")\n        elif dataset in [\"enwik8\", \"text8\"]:\n            pass\n\n        corpus = TransfoXLCorpus(datadir, dataset, **kwargs)\n        torch.save(corpus, fn)\n\n    return corpus\n"
  },
  {
    "path": "code/nezha-base-count5/pretrain/transformers1/tokenization_utils.py",
    "content": "# coding=utf-8\n# Copyright 2020 The HuggingFace Inc. team.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n\"\"\"Tokenization classes for python and fast tokenizers. Fast tokenizers are provided by HuggingFace's tokenizers library.\"\"\"\n\nimport copy\nimport functools\nimport itertools\nimport json\nimport logging\nimport operator\nimport os\nimport re\nimport warnings\nfrom collections import UserDict, defaultdict\nfrom contextlib import contextmanager\nfrom typing import Any, Dict, List, NamedTuple, Optional, Sequence, Tuple, Union\n\nfrom tokenizers import AddedToken as AddedTokenFast\nfrom tokenizers import Encoding as EncodingFast\nfrom tokenizers.decoders import Decoder as DecoderFast\nfrom tokenizers.implementations import BaseTokenizer as BaseTokenizerFast\n\nfrom .file_utils import cached_path, hf_bucket_url, is_remote_url, is_tf_available, is_torch_available, torch_required\n\n\nif is_tf_available():\n    import tensorflow as tf\nif is_torch_available():\n    import torch\n\nlogger = logging.getLogger(__name__)\n\nSPECIAL_TOKENS_MAP_FILE = \"special_tokens_map.json\"\nADDED_TOKENS_FILE = \"added_tokens.json\"\nTOKENIZER_CONFIG_FILE = \"tokenizer_config.json\"\n\nVERY_LARGE_INTEGER = int(1e30)  # This is used to set the max input length for a model with infinite size input\nLARGE_INTEGER = int(1e20)  # This is used when we need something big but slightly smaller than VERY_LARGE_INTEGER\n\n# Define type aliases and NamedTuples\nTextInput = str\nPreTokenizedInput = List[str]\nEncodedInput = List[int]\nTextInputPair = Tuple[str, str]\nPreTokenizedInputPair = Tuple[List[str], List[str]]\nEncodedInputPair = Tuple[List[int], List[int]]\n\n\nclass CharSpan(NamedTuple):\n    \"\"\" Character span in the original string\n\n        Args:\n            start: index of the first character in the original string\n            end: index of the character following the last character in the original string\n    \"\"\"\n\n    start: int\n    end: int\n\n\nclass TokenSpan(NamedTuple):\n    \"\"\" Token span in an encoded string (list of tokens)\n\n        Args:\n            start: index of the first token in the span\n            end: index of the token following the last token in the span\n    \"\"\"\n\n    start: int\n    end: int\n\n\ndef flatten(x: Sequence):\n    \"\"\"\n    Flatten the provided (potentially nested) sequence\n\n    Args:\n        x (Sequence): Potentially nested sequence to flatten\n\n    Returns:\n        list: Flattened sequence\n    \"\"\"\n\n    return functools.reduce(operator.iconcat, x, [])\n\n\n@contextmanager\ndef truncate_and_pad(\n    tokenizer: BaseTokenizerFast,\n    max_length: int,\n    stride: int,\n    strategy: str,\n    pad_to_max_length: bool,\n    padding_side: str,\n    pad_token_id: int,\n    pad_token_type_id: int,\n    pad_token: str,\n):\n    \"\"\" This contextmanager is in charge of defining the truncation and the padding strategies for fast tokenizers\n        (provided by HuggingFace tokenizers library) and restore the tokenizer settings afterwards.\n\n        This contextmanager assumes the provider tokenizer has no padding / truncation strategy\n        before the managed section. If your tokenizer set a padding / truncation strategy before,\n        then it will be reset to no padding/truncation when exiting the managed section.\n\n        Args:\n            tokenizer (BaseTokenizerFast): The tokenizer which will be used\n            max_length (int): The maximum size of the sequence\n            stride (int): The stride to use when handling overflow\n            strategy (str): Overflowing logic to use\n            pad_to_max_length (bool): Boolean indicating if the output needs to be padded up to max_length\n            padding_side (str): \"left\" or \"right\" indicating the direction the output sequence will be padded\n            pad_token_id (int): The integer representation of the padding token to use\n            pad_token_type_id (int): The integer representation of the padding token type to use\n            pad_token (str): The string representation of the padding token to use\n\n    \"\"\"\n\n    # Handle all the truncation and padding stuff\n    if max_length is not None:\n        tokenizer.enable_truncation(max_length, stride=stride, strategy=strategy)\n\n    if pad_to_max_length and (pad_token and pad_token_id >= 0):\n        tokenizer.enable_padding(\n            max_length=max_length,\n            direction=padding_side,\n            pad_id=pad_token_id,\n            pad_type_id=pad_token_type_id,\n            pad_token=pad_token,\n        )\n    elif pad_to_max_length:\n        logger.warning(\n            \"Disabled padding because no padding token set (pad_token: {}, pad_token_id: {}).\\n\"\n            \"To remove this error, you can add a new pad token and then resize model embedding:\\n\"\n            \"\\ttokenizer.pad_token = '<PAD>'\\n\\tmodel.resize_token_embeddings(len(tokenizer))\".format(\n                pad_token, pad_token_id\n            )\n        )\n\n    yield\n\n    # TODO(morgan, anthony): once we have a simple way to serialize tokenizers maybe store and restore the state afterward\n    # to avoid destructing the padding / truncation strategy as we do now.\n\n    if max_length is not None:\n        tokenizer.no_truncation()\n\n    if pad_to_max_length and (pad_token and pad_token_id >= 0):\n        tokenizer.no_padding()\n\n\nclass BatchEncoding(UserDict):\n    \"\"\" BatchEncoding hold the output of the encode and batch_encode methods (tokens, attention_masks, etc).\n        This class is derived from a python Dictionary and can be used as a dictionnary.\n        In addition, this class expose utility methods to map from word/char space to token space.\n\n        Args:\n            data (:obj:`dict`): Dictionary of lists/arrays returned by the encode/batch_encode methods ('input_ids', 'attention_mask'...)\n            encoding (:obj:`EncodingFast`, :obj:`list(EncodingFast)`, `optional`, defaults to :obj:`None`):\n                If the tokenizer is a fast tokenizer which outputs additional informations like mapping from word/char space to token space\n                the `EncodingFast` instance or list of instance (for batches) hold these informations.\n\n    \"\"\"\n\n    def __init__(\n        self,\n        data: Optional[Dict[str, Any]] = None,\n        encoding: Optional[Union[EncodingFast, Sequence[EncodingFast]]] = None,\n    ):\n        super().__init__(data)\n\n        if isinstance(encoding, EncodingFast):\n            encoding = [encoding]\n\n        self._encodings = encoding\n\n    def __getitem__(self, item: Union[int, str]) -> EncodingFast:\n        \"\"\" If the key is a string, get the value of the dict associated to `key` ('input_ids', 'attention_mask'...)\n            If the key is an integer, get the EncodingFast for batch item with index `key`\n        \"\"\"\n        if isinstance(item, str):\n            return self.data[item]\n        elif self._encodings is not None:\n            return self._encodings[item]\n        else:\n            raise KeyError(\n                \"Indexing with integers (to access backend Encoding for a given batch index) \"\n                \"is not available when using Python based tokenizers\"\n            )\n\n    def __getattr__(self, item: str):\n        return self.data[item]\n\n    def keys(self):\n        return self.data.keys()\n\n    def values(self):\n        return self.data.values()\n\n    def items(self):\n        return self.data.items()\n\n    # After this point:\n    # Extended properties and methods only available for fast (Rust-based) tokenizers\n    # provided by HuggingFace tokenizers library.\n\n    @property\n    def encodings(self) -> Optional[List[EncodingFast]]:\n        \"\"\"\n        Return the list all encoding from the tokenization process\n\n        Returns: List[EncodingFast] or None if input was tokenized through Python (i.e. not fast) tokenizer\n        \"\"\"\n        return self._encodings\n\n    def tokens(self, batch_index: int = 0) -> List[int]:\n        if not self._encodings:\n            raise ValueError(\"tokens() is not available when using Python based tokenizers\")\n        return self._encodings[batch_index].tokens\n\n    def words(self, batch_index: int = 0) -> List[Optional[int]]:\n        if not self._encodings:\n            raise ValueError(\"words() is not available when using Python based tokenizers\")\n        return self._encodings[batch_index].words\n\n    def token_to_word(self, batch_or_token_index: int, token_index: Optional[int] = None) -> int:\n        \"\"\" Get the index of the word corresponding (i.e. comprising) to an encoded token\n            in a sequence of the batch.\n\n            Can be called as:\n                - self.token_to_word(token_index) if batch size is 1\n                - self.token_to_word(batch_index, token_index) if batch size is greater than 1\n\n            This method is particularly suited when the input sequences are provided as\n            pre-tokenized sequences (i.e. words are defined by the user). In this case it allows\n            to easily associate encoded tokens with provided tokenized words.\n\n        Args:\n            batch_or_token_index (:obj:`int`):\n                Index of the sequence in the batch. If the batch only comprise one sequence,\n                this can be the index of the token in the sequence\n            token_index (:obj:`int`, `optional`):\n                If a batch index is provided in `batch_or_token_index`, this can be the index\n                of the token in the sequence.\n\n        Returns:\n            word_index (:obj:`int`):\n                index of the word in the input sequence.\n\n        \"\"\"\n\n        if not self._encodings:\n            raise ValueError(\"token_to_word() is not available when using Python based tokenizers\")\n        if token_index is not None:\n            batch_index = batch_or_token_index\n        else:\n            batch_index = 0\n            token_index = batch_or_token_index\n        if batch_index < 0:\n            batch_index = self._batch_size + batch_index\n        if token_index < 0:\n            token_index = self._seq_len + token_index\n        return self._encodings[batch_index].token_to_word(token_index)\n\n    def word_to_tokens(self, batch_or_word_index: int, word_index: Optional[int] = None) -> TokenSpan:\n        \"\"\" Get the encoded token span corresponding to a word in the sequence of the batch.\n\n            Token spans are returned as a TokenSpan NamedTuple with:\n                start: index of the first token\n                end: index of the token following the last token\n\n            Can be called as:\n                - self.word_to_tokens(word_index) if batch size is 1\n                - self.word_to_tokens(batch_index, word_index) if batch size is greater or equal to 1\n\n            This method is particularly suited when the input sequences are provided as\n            pre-tokenized sequences (i.e. words are defined by the user). In this case it allows\n            to easily associate encoded tokens with provided tokenized words.\n\n        Args:\n            batch_or_word_index (:obj:`int`):\n                Index of the sequence in the batch. If the batch only comprises one sequence,\n                this can be the index of the word in the sequence\n            word_index (:obj:`int`, `optional`):\n                If a batch index is provided in `batch_or_token_index`, this can be the index\n                of the word in the sequence.\n\n        Returns:\n            token_span (:obj:`TokenSpan`):\n                Span of tokens in the encoded sequence.\n\n                TokenSpan are NamedTuple with:\n                    start: index of the first token\n                    end: index of the token following the last token\n        \"\"\"\n\n        if not self._encodings:\n            raise ValueError(\"word_to_tokens() is not available when using Python based tokenizers\")\n        if word_index is not None:\n            batch_index = batch_or_word_index\n        else:\n            batch_index = 0\n            word_index = batch_or_word_index\n        if batch_index < 0:\n            batch_index = self._batch_size + batch_index\n        if word_index < 0:\n            word_index = self._seq_len + word_index\n        return TokenSpan(*(self._encodings[batch_index].word_to_tokens(word_index)))\n\n    def token_to_chars(self, batch_or_token_index: int, token_index: Optional[int] = None) -> CharSpan:\n        \"\"\" Get the character span corresponding to an encoded token in a sequence of the batch.\n\n            Character spans are returned as a CharSpan NamedTuple with:\n                start: index of the first character in the original string associated to the token\n                end: index of the character following the last character in the original string associated to the token\n\n            Can be called as:\n                - self.token_to_chars(token_index) if batch size is 1\n                - self.token_to_chars(batch_index, token_index) if batch size is greater or equal to 1\n\n        Args:\n            batch_or_token_index (:obj:`int`):\n                Index of the sequence in the batch. If the batch only comprise one sequence,\n                this can be the index of the token in the sequence\n            token_index (:obj:`int`, `optional`):\n                If a batch index is provided in `batch_or_token_index`, this can be the index\n                of the token or tokens in the sequence.\n\n        Returns:\n            char_span (:obj:`CharSpan`):\n                Span of characters in the original string.\n\n                CharSpan are NamedTuple with:\n                    start: index of the first character in the original string\n                    end: index of the character following the last character in the original string\n        \"\"\"\n\n        if not self._encodings:\n            raise ValueError(\"token_to_chars() is not available when using Python based tokenizers\")\n        if token_index is not None:\n            batch_index = batch_or_token_index\n        else:\n            batch_index = 0\n            token_index = batch_or_token_index\n        return CharSpan(*(self._encodings[batch_index].token_to_chars(token_index)))\n\n    def char_to_token(self, batch_or_char_index: int, char_index: Optional[int] = None) -> int:\n        \"\"\" Get the index of the token in the encoded output comprising a character\n            in the original string for a sequence of the batch.\n\n            Can be called as:\n                - self.char_to_token(char_index) if batch size is 1\n                - self.char_to_token(batch_index, char_index) if batch size is greater or equal to 1\n\n            This method is particularly suited when the input sequences are provided as\n            pre-tokenized sequences (i.e. words are defined by the user). In this case it allows\n            to easily associate encoded tokens with provided tokenized words.\n\n        Args:\n            batch_or_char_index (:obj:`int`):\n                Index of the sequence in the batch. If the batch only comprise one sequence,\n                this can be the index of the word in the sequence\n            char_index (:obj:`int`, `optional`):\n                If a batch index is provided in `batch_or_token_index`, this can be the index\n                of the word in the sequence.\n\n\n        Returns:\n            token_index (:obj:`int`):\n                Index of the token.\n        \"\"\"\n\n        if not self._encodings:\n            raise ValueError(\"char_to_token() is not available when using Python based tokenizers\")\n        if char_index is not None:\n            batch_index = batch_or_char_index\n        else:\n            batch_index = 0\n            char_index = batch_or_char_index\n        return self._encodings[batch_index].char_to_token(char_index)\n\n    def word_to_chars(self, batch_or_word_index: int, word_index: Optional[int] = None) -> CharSpan:\n        \"\"\" Get the character span in the original string corresponding to given word in a sequence\n            of the batch.\n\n            Character spans are returned as a CharSpan NamedTuple with:\n                start: index of the first character in the original string\n                end: index of the character following the last character in the original string\n\n            Can be called as:\n                - self.word_to_chars(word_index) if batch size is 1\n                - self.word_to_chars(batch_index, word_index) if batch size is greater or equal to 1\n\n        Args:\n            batch_or_word_index (:obj:`int`):\n                Index of the sequence in the batch. If the batch only comprise one sequence,\n                this can be the index of the word in the sequence\n            word_index (:obj:`int`, `optional`):\n                If a batch index is provided in `batch_or_token_index`, this can be the index\n                of the word in the sequence.\n\n        Returns:\n            char_span (:obj:`CharSpan` or :obj:`List[CharSpan]`):\n                Span(s) of the associated character or characters in the string.\n                CharSpan are NamedTuple with:\n                    start: index of the first character associated to the token in the original string\n                    end: index of the character following the last character associated to the token in the original string\n        \"\"\"\n\n        if not self._encodings:\n            raise ValueError(\"word_to_chars() is not available when using Python based tokenizers\")\n        if word_index is not None:\n            batch_index = batch_or_word_index\n        else:\n            batch_index = 0\n            word_index = batch_or_word_index\n        return CharSpan(*(self._encodings[batch_index].word_to_chars(word_index)))\n\n    def char_to_word(self, batch_or_char_index: int, char_index: Optional[int] = None) -> int:\n        \"\"\" Get the word in the original string corresponding to a character in the original string of\n            a sequence of the batch.\n\n            Can be called as:\n                - self.char_to_word(char_index) if batch size is 1\n                - self.char_to_word(batch_index, char_index) if batch size is greater than 1\n\n            This method is particularly suited when the input sequences are provided as\n            pre-tokenized sequences (i.e. words are defined by the user). In this case it allows\n            to easily associate encoded tokens with provided tokenized words.\n\n        Args:\n            batch_or_char_index (:obj:`int`):\n                Index of the sequence in the batch. If the batch only comprise one sequence,\n                this can be the index of the character in the orginal string.\n            char_index (:obj:`int`, `optional`):\n                If a batch index is provided in `batch_or_token_index`, this can be the index\n                of the character in the orginal string.\n\n\n        Returns:\n            token_index (:obj:`int` or :obj:`List[int]`):\n                Index or indices of the associated encoded token(s).\n        \"\"\"\n\n        if not self._encodings:\n            raise ValueError(\"char_to_word() is not available when using Python based tokenizers\")\n        if char_index is not None:\n            batch_index = batch_or_char_index\n        else:\n            batch_index = 0\n            char_index = batch_or_char_index\n        return self._encodings[batch_index].char_to_word(char_index)\n\n    @torch_required\n    def to(self, device: str):\n        \"\"\"Send all values to device by calling v.to(device)\"\"\"\n        self.data = {k: v.to(device) for k, v in self.data.items()}\n        return self\n\n\nclass SpecialTokensMixin:\n    \"\"\" SpecialTokensMixin is derived by ``PreTrainedTokenizer`` and ``PreTrainedTokenizerFast`` and\n        handles specific behaviors related to special tokens. In particular, this class hold the\n        attributes which can be used to directly access to these special tokens in a\n        model-independant manner and allow to set and update the special tokens.\n    \"\"\"\n\n    SPECIAL_TOKENS_ATTRIBUTES = [\n        \"bos_token\",\n        \"eos_token\",\n        \"unk_token\",\n        \"sep_token\",\n        \"pad_token\",\n        \"cls_token\",\n        \"mask_token\",\n        \"additional_special_tokens\",\n    ]\n\n    def __init__(self, **kwargs):\n        self._bos_token = None\n        self._eos_token = None\n        self._unk_token = None\n        self._sep_token = None\n        self._pad_token = None\n        self._cls_token = None\n        self._mask_token = None\n        self._pad_token_type_id = 0\n        self._additional_special_tokens = []\n\n        for key, value in kwargs.items():\n            if key in self.SPECIAL_TOKENS_ATTRIBUTES:\n                if key == \"additional_special_tokens\":\n                    assert isinstance(value, (list, tuple)) and all(isinstance(t, str) for t in value)\n                    setattr(self, key, value)\n                elif isinstance(value, AddedTokenFast):\n                    setattr(self, key, str(value))\n                elif isinstance(value, str):\n                    setattr(self, key, value)\n                else:\n                    raise TypeError(\n                        \"special token {} has to be either str or AddedTokenFast but got: {}\".format(key, type(value))\n                    )\n\n    @property\n    def bos_token(self):\n        \"\"\" Beginning of sentence token (string). Log an error if used while not having been set. \"\"\"\n        if self._bos_token is None:\n            logger.error(\"Using bos_token, but it is not set yet.\")\n        return self._bos_token\n\n    @property\n    def eos_token(self):\n        \"\"\" End of sentence token (string). Log an error if used while not having been set. \"\"\"\n        if self._eos_token is None:\n            logger.error(\"Using eos_token, but it is not set yet.\")\n        return self._eos_token\n\n    @property\n    def unk_token(self):\n        \"\"\" Unknown token (string). Log an error if used while not having been set. \"\"\"\n        if self._unk_token is None:\n            logger.error(\"Using unk_token, but it is not set yet.\")\n        return self._unk_token\n\n    @property\n    def sep_token(self):\n        \"\"\" Separation token (string). E.g. separate context and query in an input sequence. Log an error if used while not having been set. \"\"\"\n        if self._sep_token is None:\n            logger.error(\"Using sep_token, but it is not set yet.\")\n        return self._sep_token\n\n    @property\n    def pad_token(self):\n        \"\"\" Padding token (string). Log an error if used while not having been set. \"\"\"\n        if self._pad_token is None:\n            logger.error(\"Using pad_token, but it is not set yet.\")\n        return self._pad_token\n\n    @property\n    def cls_token(self):\n        \"\"\" Classification token (string). E.g. to extract a summary of an input sequence leveraging self-attention along the full depth of the model. Log an error if used while not having been set. \"\"\"\n        if self._cls_token is None:\n            logger.error(\"Using cls_token, but it is not set yet.\")\n        return self._cls_token\n\n    @property\n    def mask_token(self):\n        \"\"\" Mask token (string). E.g. when training a model with masked-language modeling. Log an error if used while not having been set. \"\"\"\n        if self._mask_token is None:\n            logger.error(\"Using mask_token, but it is not set yet.\")\n        return self._mask_token\n\n    @property\n    def additional_special_tokens(self):\n        \"\"\" All the additional special tokens you may want to use (list of strings). Log an error if used while not having been set. \"\"\"\n        if self._additional_special_tokens is None:\n            logger.error(\"Using additional_special_tokens, but it is not set yet.\")\n        return self._additional_special_tokens\n\n    def _maybe_update_backend(self, value):\n        \"\"\" To be overriden by derived class if a backend tokenizer has to be updated. \"\"\"\n        pass\n\n    @bos_token.setter\n    def bos_token(self, value):\n        self._bos_token = value\n        self._maybe_update_backend([value])\n\n    @eos_token.setter\n    def eos_token(self, value):\n        self._eos_token = value\n        self._maybe_update_backend([value])\n\n    @unk_token.setter\n    def unk_token(self, value):\n        self._unk_token = value\n        self._maybe_update_backend([value])\n\n    @sep_token.setter\n    def sep_token(self, value):\n        self._sep_token = value\n        self._maybe_update_backend([value])\n\n    @pad_token.setter\n    def pad_token(self, value):\n        self._pad_token = value\n        self._maybe_update_backend([value])\n\n    @cls_token.setter\n    def cls_token(self, value):\n        self._cls_token = value\n        self._maybe_update_backend([value])\n\n    @mask_token.setter\n    def mask_token(self, value):\n        self._mask_token = value\n        self._maybe_update_backend([value])\n\n    @additional_special_tokens.setter\n    def additional_special_tokens(self, value):\n        self._additional_special_tokens = value\n        self._maybe_update_backend(value)\n\n    @property\n    def bos_token_id(self):\n        \"\"\" Id of the beginning of sentence token in the vocabulary. Log an error if used while not having been set. \"\"\"\n        return self.convert_tokens_to_ids(self.bos_token)\n\n    @property\n    def eos_token_id(self):\n        \"\"\" Id of the end of sentence token in the vocabulary. Log an error if used while not having been set. \"\"\"\n        return self.convert_tokens_to_ids(self.eos_token)\n\n    @property\n    def unk_token_id(self):\n        \"\"\" Id of the unknown token in the vocabulary. Log an error if used while not having been set. \"\"\"\n        return self.convert_tokens_to_ids(self.unk_token)\n\n    @property\n    def sep_token_id(self):\n        \"\"\" Id of the separation token in the vocabulary. E.g. separate context and query in an input sequence. Log an error if used while not having been set. \"\"\"\n        return self.convert_tokens_to_ids(self.sep_token)\n\n    @property\n    def pad_token_id(self):\n        \"\"\" Id of the padding token in the vocabulary. Log an error if used while not having been set. \"\"\"\n        return self.convert_tokens_to_ids(self.pad_token)\n\n    @property\n    def pad_token_type_id(self):\n        \"\"\" Id of the padding token type in the vocabulary.\"\"\"\n        return self._pad_token_type_id\n\n    @property\n    def cls_token_id(self):\n        \"\"\" Id of the classification token in the vocabulary. E.g. to extract a summary of an input sequence leveraging self-attention along the full depth of the model. Log an error if used while not having been set. \"\"\"\n        return self.convert_tokens_to_ids(self.cls_token)\n\n    @property\n    def mask_token_id(self):\n        \"\"\" Id of the mask token in the vocabulary. E.g. when training a model with masked-language modeling. Log an error if used while not having been set. \"\"\"\n        return self.convert_tokens_to_ids(self.mask_token)\n\n    @property\n    def additional_special_tokens_ids(self):\n        \"\"\" Ids of all the additional special tokens in the vocabulary (list of integers). Log an error if used while not having been set. \"\"\"\n        return self.convert_tokens_to_ids(self.additional_special_tokens)\n\n    @property\n    def special_tokens_map(self):\n        \"\"\" A dictionary mapping special token class attribute (cls_token, unk_token...) to their\n            values ('<unk>', '<cls>'...)\n        \"\"\"\n        set_attr = {}\n        for attr in self.SPECIAL_TOKENS_ATTRIBUTES:\n            attr_value = getattr(self, \"_\" + attr)\n            if attr_value:\n                set_attr[attr] = attr_value\n        return set_attr\n\n    @property\n    def all_special_tokens(self):\n        \"\"\" List all the special tokens ('<unk>', '<cls>'...) mapped to class attributes\n            (cls_token, unk_token...).\n        \"\"\"\n        all_toks = []\n        set_attr = self.special_tokens_map\n        for attr_value in set_attr.values():\n            all_toks = all_toks + (list(attr_value) if isinstance(attr_value, (list, tuple)) else [attr_value])\n        all_toks = list(set(all_toks))\n        return all_toks\n\n    @property\n    def all_special_ids(self):\n        \"\"\" List the vocabulary indices of the special tokens ('<unk>', '<cls>'...) mapped to\n            class attributes (cls_token, unk_token...).\n        \"\"\"\n        all_toks = self.all_special_tokens\n        all_ids = self.convert_tokens_to_ids(all_toks)\n        return all_ids\n\n\nclass PreTrainedTokenizer(SpecialTokensMixin):\n    \"\"\" Base class for all tokenizers.\n\n    Handle all the shared methods for tokenization and special tokens as well as methods\n    downloading/caching/loading pretrained tokenizers as well as adding tokens to the vocabulary.\n\n    This class also contain the added tokens in a unified way on top of all tokenizers so we don't\n    have to handle the specific vocabulary augmentation methods of the various underlying\n    dictionary structures (BPE, sentencepiece...).\n\n    Class attributes (overridden by derived classes):\n\n        - ``vocab_files_names``: a python ``dict`` with, as keys, the ``__init__`` keyword name of each vocabulary file\n            required by the model, and as associated values, the filename for saving the associated file (string).\n        - ``pretrained_vocab_files_map``: a python ``dict of dict`` the high-level keys\n            being the ``__init__`` keyword name of each vocabulary file required by the model, the low-level being the\n            `short-cut-names` (string) of the pretrained models with, as associated values, the `url` (string) to the\n            associated pretrained vocabulary file.\n        - ``max_model_input_sizes``: a python ``dict`` with, as keys, the `short-cut-names` (string) of the pretrained\n            models, and as associated values, the maximum length of the sequence inputs of this model, or None if the\n            model has no maximum input size.\n        - ``pretrained_init_configuration``: a python ``dict`` with, as keys, the `short-cut-names` (string) of the\n            pretrained models, and as associated values, a dictionnary of specific arguments to pass to the\n            ``__init__``method of the tokenizer class for this pretrained model when loading the tokenizer with the\n            ``from_pretrained()`` method.\n\n    Args:\n        - ``model_max_length``: (`Optional`) int: the maximum length in number of tokens for the inputs to the transformer model.\n            When the tokenizer is loaded with `from_pretrained`, this will be set to the value stored for the associated\n            model in ``max_model_input_sizes`` (see above). If no value is provided, will default to VERY_LARGE_INTEGER (`int(1e30)`).\n            no associated max_length can be found in ``max_model_input_sizes``.\n        - ``padding_side``: (`Optional`) string: the side on which the model should have padding applied.\n            Should be selected between ['right', 'left']\n        - ``model_input_names``: (`Optional`) List[string]: the list of the forward pass inputs accepted by the\n            model (\"token_type_ids\", \"attention_mask\"...).\n        - ``bos_token``: (`Optional`) string: a beginning of sentence token.\n            Will be associated to ``self.bos_token`` and ``self.bos_token_id``\n        - ``eos_token``: (`Optional`) string: an end of sentence token.\n            Will be associated to ``self.eos_token`` and ``self.eos_token_id``\n        - ``unk_token``: (`Optional`) string: an unknown token.\n            Will be associated to ``self.unk_token`` and ``self.unk_token_id``\n        - ``sep_token``: (`Optional`) string: a separation token (e.g. to separate context and query in an input sequence).\n            Will be associated to ``self.sep_token`` and ``self.sep_token_id``\n        - ``pad_token``: (`Optional`) string: a padding token.\n            Will be associated to ``self.pad_token`` and ``self.pad_token_id``\n        - ``cls_token``: (`Optional`) string: a classification token (e.g. to extract a summary of an input sequence\n            leveraging self-attention along the full depth of the model).\n            Will be associated to ``self.cls_token`` and ``self.cls_token_id``\n        - ``mask_token``: (`Optional`) string: a masking token (e.g. when training a model with masked-language\n            modeling). Will be associated to ``self.mask_token`` and ``self.mask_token_id``\n        - ``additional_special_tokens``: (`Optional`) list: a list of additional special tokens.\n            Adding all special tokens here ensure they won't be split by the tokenization process.\n            Will be associated to ``self.additional_special_tokens`` and ``self.additional_special_tokens_ids``\n    \"\"\"\n\n    vocab_files_names: Dict[str, str] = {}\n    pretrained_vocab_files_map: Dict[str, Dict[str, str]] = {}\n    pretrained_init_configuration: Dict[str, Dict[str, Any]] = {}\n    max_model_input_sizes: Dict[str, int] = {}\n    model_input_names: List[str] = [\"token_type_ids\", \"attention_mask\"]\n\n    padding_side: str = \"right\"\n\n    NO_PAD_TOKEN_FOR_BATCH_MSG = (\n        \"No padding token is set for this model, therefore no batch can be made with uneven \"\n        \"sequences. Set a padding token or adjust the lengths of the sequences building the \"\n        \"batch so that every sequence is of the same length.\"\n    )\n\n    UNEVEN_SEQUENCES_FOR_BATCH_MSG = (\n        \"The sequences building the batch are not of the same size, no tensor \"\n        \"can be built. Set `pad_to_max_length=True` to pad the smaller sequences\"\n        \"up to the larger sequence's length.\"\n    )\n\n    @property\n    def vocab_size(self) -> int:\n        \"\"\" Size of the base vocabulary (without the added tokens) \"\"\"\n        raise NotImplementedError\n\n    @property\n    def is_fast(self) -> bool:\n        return False\n\n    @property\n    def max_len(self) -> int:\n        \"\"\" Kept here for backward compatibility.\n            Now renamed to `model_max_length` to avoid ambiguity.\n        \"\"\"\n        return self.model_max_length\n\n    @property\n    def max_len_single_sentence(self) -> int:\n        return self.model_max_length - self.num_special_tokens_to_add(pair=False)\n\n    @property\n    def max_len_sentences_pair(self) -> int:\n        return self.model_max_length - self.num_special_tokens_to_add(pair=True)\n\n    @max_len_single_sentence.setter\n    def max_len_single_sentence(self, value) -> int:\n        \"\"\" For backward compatibility, allow to try to setup 'max_len_single_sentence' \"\"\"\n        if value == self.model_max_length - self.num_special_tokens_to_add(pair=False):\n            logger.warning(\n                \"Setting 'max_len_single_sentence' is now deprecated. \" \"This value is automatically set up.\"\n            )\n        else:\n            raise ValueError(\n                \"Setting 'max_len_single_sentence' is now deprecated. \" \"This value is automatically set up.\"\n            )\n\n    @max_len_sentences_pair.setter\n    def max_len_sentences_pair(self, value) -> int:\n        \"\"\" For backward compatibility, allow to try to setup 'max_len_sentences_pair' \"\"\"\n        if value == self.model_max_length - self.num_special_tokens_to_add(pair=True):\n            logger.warning(\n                \"Setting 'max_len_sentences_pair' is now deprecated. \" \"This value is automatically set up.\"\n            )\n        else:\n            raise ValueError(\n                \"Setting 'max_len_sentences_pair' is now deprecated. \" \"This value is automatically set up.\"\n            )\n\n    def get_vocab(self):\n        \"\"\" Returns the vocabulary as a dict of {token: index} pairs. `tokenizer.get_vocab()[token]` is equivalent to `tokenizer.convert_tokens_to_ids(token)` when `token` is in the vocab. \"\"\"\n        raise NotImplementedError()\n\n    def __init__(self, model_max_length=None, **kwargs):\n\n        super().__init__(**kwargs)\n\n        # For backward compatibility we fallback to set model_max_length from max_len if provided\n        if \"max_len\" in kwargs:\n            warnings.warn(\n                \"Parameter max_len is deprecated and will be removed in a future release. \"\n                \"Use model_max_length instead.\",\n                category=FutureWarning,\n            )\n\n            model_max_length = kwargs.pop(\"max_len\")\n        self.model_max_length = model_max_length if model_max_length is not None else VERY_LARGE_INTEGER\n\n        # Padding side is right by default and overridden in subclasses. If specified in the kwargs, it is changed.\n        self.padding_side = kwargs.pop(\"padding_side\", self.padding_side)\n        assert self.padding_side in [\n            \"right\",\n            \"left\",\n        ], f\"Padding side should be selected between 'right' and 'left', current value: {self.padding_side}\"\n        self.model_input_names = kwargs.pop(\"model_input_names\", self.model_input_names)\n\n        # Added tokens\n        self.added_tokens_encoder = {}\n        self.unique_added_tokens_encoder = set()\n        self.added_tokens_decoder = {}\n\n        # inputs and kwargs for saving and re-loading (see ``from_pretrained`` and ``save_pretrained``)\n        self.init_inputs = ()\n        self.init_kwargs = {}\n\n    def __len__(self):\n        \"\"\" Size of the full vocabulary with the added tokens \"\"\"\n        return self.vocab_size + len(self.added_tokens_encoder)\n\n    @classmethod\n    def from_pretrained(cls, *inputs, **kwargs):\n        r\"\"\"\n        Instantiate a :class:`~transformers1.PreTrainedTokenizer` (or a derived class) from a predefined tokenizer.\n\n        Args:\n            pretrained_model_name_or_path: either:\n\n                - a string with the `shortcut name` of a predefined tokenizer to load from cache or download, e.g.: ``bert-base-uncased``.\n                - a string with the `identifier name` of a predefined tokenizer that was user-uploaded to our S3, e.g.: ``dbmdz/bert-base-german-cased``.\n                - a path to a `directory` containing vocabulary files required by the tokenizer, for instance saved using the :func:`~transformers1.PreTrainedTokenizer.save_pretrained` method, e.g.: ``./my_model_directory/``.\n                - (not applicable to all derived classes, deprecated) a path or url to a single saved vocabulary file if and only if the tokenizer only requires a single vocabulary file (e.g. Bert, XLNet), e.g.: ``./my_model_directory/vocab.txt``.\n\n            cache_dir: (`optional`) string:\n                Path to a directory in which a downloaded predefined tokenizer vocabulary files should be cached if the standard cache should not be used.\n\n            force_download: (`optional`) boolean, default False:\n                Force to (re-)download the vocabulary files and override the cached versions if they exists.\n\n            resume_download: (`optional`) boolean, default False:\n                Do not delete incompletely recieved file. Attempt to resume the download if such a file exists.\n\n            proxies: (`optional`) dict, default None:\n                A dictionary of proxy servers to use by protocol or endpoint, e.g.: {'http': 'foo.bar:3128', 'http://hostname': 'foo.bar:4012'}.\n                The proxies are used on each request.\n\n            inputs: (`optional`) positional arguments: will be passed to the Tokenizer ``__init__`` method.\n\n            kwargs: (`optional`) keyword arguments: will be passed to the Tokenizer ``__init__`` method. Can be used to set special tokens like ``bos_token``, ``eos_token``, ``unk_token``, ``sep_token``, ``pad_token``, ``cls_token``, ``mask_token``, ``additional_special_tokens``. See parameters in the doc string of :class:`~transformers1.PreTrainedTokenizer` for details.\n\n        Examples::\n\n            # We can't instantiate directly the base class `PreTrainedTokenizer` so let's show our examples on a derived class: BertTokenizer\n\n            # Download vocabulary from S3 and cache.\n            tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')\n\n            # Download vocabulary from S3 (user-uploaded) and cache.\n            tokenizer = BertTokenizer.from_pretrained('dbmdz/bert-base-german-cased')\n\n            # If vocabulary files are in a directory (e.g. tokenizer was saved using `save_pretrained('./test/saved_model/')`)\n            tokenizer = BertTokenizer.from_pretrained('./test/saved_model/')\n\n            # If the tokenizer uses a single vocabulary file, you can point directly to this file\n            tokenizer = BertTokenizer.from_pretrained('./test/saved_model/my_vocab.txt')\n\n            # You can link tokens to special vocabulary when instantiating\n            tokenizer = BertTokenizer.from_pretrained('bert-base-uncased', unk_token='<unk>')\n            # You should be sure '<unk>' is in the vocabulary when doing that.\n            # Otherwise use tokenizer.add_special_tokens({'unk_token': '<unk>'}) instead)\n            assert tokenizer.unk_token == '<unk>'\n\n        \"\"\"\n        return cls._from_pretrained(*inputs, **kwargs)\n\n    @classmethod\n    def _from_pretrained(cls, pretrained_model_name_or_path, *init_inputs, **kwargs):\n        cache_dir = kwargs.pop(\"cache_dir\", None)\n        force_download = kwargs.pop(\"force_download\", False)\n        resume_download = kwargs.pop(\"resume_download\", False)\n        proxies = kwargs.pop(\"proxies\", None)\n        local_files_only = kwargs.pop(\"local_files_only\", False)\n\n        s3_models = list(cls.max_model_input_sizes.keys())\n        vocab_files = {}\n        init_configuration = {}\n        if pretrained_model_name_or_path in s3_models:\n            # Get the vocabulary from AWS S3 bucket\n            for file_id, map_list in cls.pretrained_vocab_files_map.items():\n                vocab_files[file_id] = map_list[pretrained_model_name_or_path]\n            if (\n                cls.pretrained_init_configuration\n                and pretrained_model_name_or_path in cls.pretrained_init_configuration\n            ):\n                init_configuration = cls.pretrained_init_configuration[pretrained_model_name_or_path].copy()\n        else:\n            # Get the vocabulary from local files\n            logger.info(\n                \"Model name '{}' not found in model shortcut name list ({}). \"\n                \"Assuming '{}' is a path, a model identifier, or url to a directory containing tokenizer files.\".format(\n                    pretrained_model_name_or_path, \", \".join(s3_models), pretrained_model_name_or_path\n                )\n            )\n\n            if os.path.isfile(pretrained_model_name_or_path) or is_remote_url(pretrained_model_name_or_path):\n                if len(cls.vocab_files_names) > 1:\n                    raise ValueError(\n                        f\"Calling {cls.__name__}.from_pretrained() with the path to a single file or url is not supported.\"\n                        \"Use a model identifier or the path to a directory instead.\"\n                    )\n                logger.warning(\n                    f\"Calling {cls.__name__}.from_pretrained() with the path to a single file or url is deprecated\"\n                )\n                file_id = list(cls.vocab_files_names.keys())[0]\n                vocab_files[file_id] = pretrained_model_name_or_path\n            else:\n                # At this point pretrained_model_name_or_path is either a directory or a model identifier name\n                additional_files_names = {\n                    \"added_tokens_file\": ADDED_TOKENS_FILE,\n                    \"special_tokens_map_file\": SPECIAL_TOKENS_MAP_FILE,\n                    \"tokenizer_config_file\": TOKENIZER_CONFIG_FILE,\n                }\n                # Look for the tokenizer main vocabulary files + the additional tokens files\n                for file_id, file_name in {**cls.vocab_files_names, **additional_files_names}.items():\n                    if os.path.isdir(pretrained_model_name_or_path):\n                        full_file_name = os.path.join(pretrained_model_name_or_path, file_name)\n                        if not os.path.exists(full_file_name):\n                            logger.info(\"Didn't find file {}. We won't load it.\".format(full_file_name))\n                            full_file_name = None\n                    else:\n                        full_file_name = hf_bucket_url(\n                            pretrained_model_name_or_path, filename=file_name, use_cdn=False\n                        )\n\n                    vocab_files[file_id] = full_file_name\n\n        # Get files from url, cache, or disk depending on the case\n        try:\n            resolved_vocab_files = {}\n            for file_id, file_path in vocab_files.items():\n                if file_path is None:\n                    resolved_vocab_files[file_id] = None\n                else:\n                    resolved_vocab_files[file_id] = cached_path(\n                        file_path,\n                        cache_dir=cache_dir,\n                        force_download=force_download,\n                        proxies=proxies,\n                        resume_download=resume_download,\n                        local_files_only=local_files_only,\n                    )\n        except EnvironmentError:\n            if pretrained_model_name_or_path in s3_models:\n                msg = \"Couldn't reach server at '{}' to download vocabulary files.\"\n            else:\n                msg = (\n                    \"Model name '{}' was not found in tokenizers model name list ({}). \"\n                    \"We assumed '{}' was a path or url to a directory containing vocabulary files \"\n                    \"named {}, but couldn't find such vocabulary files at this path or url.\".format(\n                        pretrained_model_name_or_path,\n                        \", \".join(s3_models),\n                        pretrained_model_name_or_path,\n                        list(cls.vocab_files_names.values()),\n                    )\n                )\n\n            raise EnvironmentError(msg)\n\n        if all(full_file_name is None for full_file_name in resolved_vocab_files.values()):\n            raise EnvironmentError(\n                \"Model name '{}' was not found in tokenizers model name list ({}). \"\n                \"We assumed '{}' was a path, a model identifier, or url to a directory containing vocabulary files \"\n                \"named {} but couldn't find such vocabulary files at this path or url.\".format(\n                    pretrained_model_name_or_path,\n                    \", \".join(s3_models),\n                    pretrained_model_name_or_path,\n                    list(cls.vocab_files_names.values()),\n                )\n            )\n\n        for file_id, file_path in vocab_files.items():\n            if file_path == resolved_vocab_files[file_id]:\n                logger.info(\"loading file {}\".format(file_path))\n            else:\n                logger.info(\"loading file {} from cache at {}\".format(file_path, resolved_vocab_files[file_id]))\n\n        # Prepare tokenizer initialization kwargs\n        # Did we saved some inputs and kwargs to reload ?\n        tokenizer_config_file = resolved_vocab_files.pop(\"tokenizer_config_file\", None)\n        if tokenizer_config_file is not None:\n            with open(tokenizer_config_file, encoding=\"utf-8\") as tokenizer_config_handle:\n                init_kwargs = json.load(tokenizer_config_handle)\n            saved_init_inputs = init_kwargs.pop(\"init_inputs\", ())\n            if not init_inputs:\n                init_inputs = saved_init_inputs\n        else:\n            init_kwargs = init_configuration\n\n        # Update with newly provided kwargs\n        init_kwargs.update(kwargs)\n\n        # Set max length if needed\n        if pretrained_model_name_or_path in cls.max_model_input_sizes:\n            # if we're using a pretrained model, ensure the tokenizer\n            # wont index sequences longer than the number of positional embeddings\n            model_max_length = cls.max_model_input_sizes[pretrained_model_name_or_path]\n            if model_max_length is not None and isinstance(model_max_length, (int, float)):\n                init_kwargs[\"model_max_length\"] = min(init_kwargs.get(\"model_max_length\", int(1e30)), model_max_length)\n\n        # Merge resolved_vocab_files arguments in init_kwargs.\n        added_tokens_file = resolved_vocab_files.pop(\"added_tokens_file\", None)\n        special_tokens_map_file = resolved_vocab_files.pop(\"special_tokens_map_file\", None)\n        for args_name, file_path in resolved_vocab_files.items():\n            if args_name not in init_kwargs:\n                init_kwargs[args_name] = file_path\n        if special_tokens_map_file is not None:\n            with open(special_tokens_map_file, encoding=\"utf-8\") as special_tokens_map_handle:\n                special_tokens_map = json.load(special_tokens_map_handle)\n            for key, value in special_tokens_map.items():\n                if key not in init_kwargs:\n                    init_kwargs[key] = value\n\n        # Instantiate tokenizer.\n        try:\n            tokenizer = cls(*init_inputs, **init_kwargs)\n        except OSError:\n            raise OSError(\n                \"Unable to load vocabulary from file. \"\n                \"Please check that the provided vocabulary is accessible and not corrupted.\"\n            )\n\n        # Save inputs and kwargs for saving and re-loading with ``save_pretrained``\n        tokenizer.init_inputs = init_inputs\n        tokenizer.init_kwargs = init_kwargs\n\n        # update unique_added_tokens_encoder with special tokens for correct tokenization\n        tokenizer.unique_added_tokens_encoder.update(set(tokenizer.all_special_tokens))\n\n        # Add supplementary tokens.\n        if added_tokens_file is not None:\n            with open(added_tokens_file, encoding=\"utf-8\") as added_tokens_handle:\n                added_tok_encoder = json.load(added_tokens_handle)\n            added_tok_decoder = {v: k for k, v in added_tok_encoder.items()}\n            tokenizer.added_tokens_encoder.update(added_tok_encoder)\n            tokenizer.added_tokens_decoder.update(added_tok_decoder)\n            tokenizer.unique_added_tokens_encoder.update(set(tokenizer.added_tokens_encoder.keys()))\n\n        return tokenizer\n\n    def save_pretrained(self, save_directory):\n        \"\"\" Save the tokenizer vocabulary files together with:\n                - added tokens,\n                - special-tokens-to-class-attributes-mapping,\n                - tokenizer instantiation positional and keywords inputs (e.g. do_lower_case for Bert).\n\n            Warning: This won't save modifications you may have applied to the tokenizer after the instantiation\n            (e.g. modifying tokenizer.do_lower_case after creation).\n\n            This method make sure the full tokenizer can then be re-loaded using the\n            :func:`~transformers1.PreTrainedTokenizer.from_pretrained` class method.\n        \"\"\"\n        if not os.path.isdir(save_directory):\n            logger.error(\"Saving directory ({}) should be a directory\".format(save_directory))\n            return\n\n        special_tokens_map_file = os.path.join(save_directory, SPECIAL_TOKENS_MAP_FILE)\n        added_tokens_file = os.path.join(save_directory, ADDED_TOKENS_FILE)\n        tokenizer_config_file = os.path.join(save_directory, TOKENIZER_CONFIG_FILE)\n\n        tokenizer_config = copy.deepcopy(self.init_kwargs)\n        if len(self.init_inputs) > 0:\n            tokenizer_config[\"init_inputs\"] = copy.deepcopy(self.init_inputs)\n        for file_id in self.vocab_files_names.keys():\n            tokenizer_config.pop(file_id, None)\n\n        with open(tokenizer_config_file, \"w\", encoding=\"utf-8\") as f:\n            f.write(json.dumps(tokenizer_config, ensure_ascii=False))\n\n        with open(special_tokens_map_file, \"w\", encoding=\"utf-8\") as f:\n            f.write(json.dumps(self.special_tokens_map, ensure_ascii=False))\n\n        if len(self.added_tokens_encoder) > 0:\n            with open(added_tokens_file, \"w\", encoding=\"utf-8\") as f:\n                out_str = json.dumps(self.added_tokens_encoder, ensure_ascii=False)\n                f.write(out_str)\n\n        vocab_files = self.save_vocabulary(save_directory)\n\n        return vocab_files + (special_tokens_map_file, added_tokens_file)\n\n    def save_vocabulary(self, save_directory) -> Tuple[str]:\n        \"\"\" Save the tokenizer vocabulary to a directory. This method does *NOT* save added tokens\n            and special token mappings.\n\n            Please use :func:`~transformers1.PreTrainedTokenizer.save_pretrained` `()` to save the full\n            Tokenizer state if you want to reload it using the :func:`~transformers1.PreTrainedTokenizer.from_pretrained`\n            class method.\n        \"\"\"\n        raise NotImplementedError\n\n    def add_tokens(self, new_tokens: Union[str, List[str]]) -> int:\n        \"\"\"\n        Add a list of new tokens to the tokenizer class. If the new tokens are not in the\n        vocabulary, they are added to it with indices starting from length of the current vocabulary.\n\n        Args:\n            new_tokens: string or list of string. Each string is a token to add. Tokens are only added if they are not\n            already in the vocabulary (tested by checking if the tokenizer assign the index of the ``unk_token`` to them).\n\n        Returns:\n            Number of tokens added to the vocabulary.\n\n        Examples::\n\n            # Let's see how to increase the vocabulary of Bert model and tokenizer\n            tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')\n            model = BertModel.from_pretrained('bert-base-uncased')\n\n            num_added_toks = tokenizer.add_tokens(['new_tok1', 'my_new-tok2'])\n            print('We have added', num_added_toks, 'tokens')\n            model.resize_token_embeddings(len(tokenizer))  # Notice: resize_token_embeddings expect to receive the full size of the new vocabulary, i.e. the length of the tokenizer.\n        \"\"\"\n        if not new_tokens:\n            return 0\n\n        if not isinstance(new_tokens, list):\n            new_tokens = [new_tokens]\n\n        tokens_to_add = []\n        for token in new_tokens:\n            assert isinstance(token, str)\n            if self.init_kwargs.get(\"do_lower_case\", False) and token not in self.all_special_tokens:\n                token = token.lower()\n            if (\n                token != self.unk_token\n                and self.convert_tokens_to_ids(token) == self.convert_tokens_to_ids(self.unk_token)\n                and token not in tokens_to_add\n            ):\n                tokens_to_add.append(token)\n                logger.info(\"Adding %s to the vocabulary\", token)\n\n        added_tok_encoder = dict((tok, len(self) + i) for i, tok in enumerate(tokens_to_add))\n        added_tok_decoder = {v: k for k, v in added_tok_encoder.items()}\n        self.added_tokens_encoder.update(added_tok_encoder)\n        self.unique_added_tokens_encoder = set(self.added_tokens_encoder.keys()).union(set(self.all_special_tokens))\n        self.added_tokens_decoder.update(added_tok_decoder)\n\n        return len(tokens_to_add)\n\n    def num_special_tokens_to_add(self, pair=False):\n        \"\"\"\n        Returns the number of added tokens when encoding a sequence with special tokens.\n\n        Note:\n            This encodes inputs and checks the number of added tokens, and is therefore not efficient. Do not put this\n            inside your training loop.\n\n        Args:\n            pair: Returns the number of added tokens in the case of a sequence pair if set to True, returns the\n                number of added tokens in the case of a single sequence if set to False.\n\n        Returns:\n            Number of tokens added to sequences\n        \"\"\"\n        token_ids_0 = []\n        token_ids_1 = []\n        return len(self.build_inputs_with_special_tokens(token_ids_0, token_ids_1 if pair else None))\n\n    def add_special_tokens(self, special_tokens_dict):\n        \"\"\"\n        Add a dictionary of special tokens (eos, pad, cls...) to the encoder and link them\n        to class attributes. If special tokens are NOT in the vocabulary, they are added\n        to it (indexed starting from the last index of the current vocabulary).\n\n        Using `add_special_tokens` will ensure your special tokens can be used in several ways:\n\n        - special tokens are carefully handled by the tokenizer (they are never split)\n        - you can easily refer to special tokens using tokenizer class attributes like `tokenizer.cls_token`. This makes it easy to develop model-agnostic training and fine-tuning scripts.\n\n        When possible, special tokens are already registered for provided pretrained models (ex: BertTokenizer cls_token is already registered to be '[CLS]' and XLM's one is also registered to be '</s>')\n\n        Args:\n            special_tokens_dict: dict of string. Keys should be in the list of predefined special attributes:\n                [``bos_token``, ``eos_token``, ``unk_token``, ``sep_token``, ``pad_token``, ``cls_token``, ``mask_token``,\n                ``additional_special_tokens``].\n\n                Tokens are only added if they are not already in the vocabulary (tested by checking if the tokenizer assign the index of the ``unk_token`` to them).\n\n        Returns:\n            Number of tokens added to the vocabulary.\n\n        Examples::\n\n            # Let's see how to add a new classification token to GPT-2\n            tokenizer = GPT2Tokenizer.from_pretrained('gpt2')\n            model = GPT2Model.from_pretrained('gpt2')\n\n            special_tokens_dict = {'cls_token': '<CLS>'}\n\n            num_added_toks = tokenizer.add_special_tokens(special_tokens_dict)\n            print('We have added', num_added_toks, 'tokens')\n            model.resize_token_embeddings(len(tokenizer))  # Notice: resize_token_embeddings expect to receive the full size of the new vocabulary, i.e. the length of the tokenizer.\n\n            assert tokenizer.cls_token == '<CLS>'\n        \"\"\"\n        if not special_tokens_dict:\n            return 0\n\n        added_tokens = 0\n        for key, value in special_tokens_dict.items():\n            assert key in self.SPECIAL_TOKENS_ATTRIBUTES\n            if key == \"additional_special_tokens\":\n                assert isinstance(value, (list, tuple)) and all(isinstance(t, str) for t in value)\n                added_tokens += self.add_tokens(value)\n            else:\n                assert isinstance(value, str)\n                added_tokens += self.add_tokens([value])\n            logger.info(\"Assigning %s to the %s key of the tokenizer\", value, key)\n            setattr(self, key, value)\n\n        return added_tokens\n\n    def tokenize(self, text: TextInput, **kwargs):\n        \"\"\" Converts a string in a sequence of tokens (string), using the tokenizer.\n            Split in words for word-based vocabulary or sub-words for sub-word-based\n            vocabularies (BPE/SentencePieces/WordPieces).\n\n            Take care of added tokens.\n\n            Args:\n                text (:obj:`string`): The sequence to be encoded.\n                **kwargs (:obj: `dict`): Arguments passed to the model-specific `prepare_for_tokenization` preprocessing method.\n        \"\"\"\n        all_special_tokens = self.all_special_tokens\n        text = self.prepare_for_tokenization(text, **kwargs)\n\n        # TODO: should this be in the base class?\n        def lowercase_text(t):\n            # convert non-special tokens to lowercase\n            escaped_special_toks = [re.escape(s_tok) for s_tok in all_special_tokens]\n            pattern = r\"(\" + r\"|\".join(escaped_special_toks) + r\")|\" + r\"(.+?)\"\n            return re.sub(pattern, lambda m: m.groups()[0] or m.groups()[1].lower(), t)\n\n        if self.init_kwargs.get(\"do_lower_case\", False):\n            text = lowercase_text(text)\n\n        def split_on_token(tok, text):\n            result = []\n            split_text = text.split(tok)\n            for i, sub_text in enumerate(split_text):\n                sub_text = sub_text.rstrip()\n                if i == 0 and not sub_text:\n                    result += [tok]\n                elif i == len(split_text) - 1:\n                    if sub_text:\n                        result += [sub_text]\n                    else:\n                        pass\n                else:\n                    if sub_text:\n                        result += [sub_text]\n                    result += [tok]\n            return result\n\n        def split_on_tokens(tok_list, text):\n            if not text.strip():\n                return []\n            if not tok_list:\n                return self._tokenize(text)\n\n            tokenized_text = []\n            text_list = [text]\n            for tok in tok_list:\n                tokenized_text = []\n                for sub_text in text_list:\n                    if sub_text not in self.unique_added_tokens_encoder:\n                        tokenized_text += split_on_token(tok, sub_text)\n                    else:\n                        tokenized_text += [sub_text]\n                text_list = tokenized_text\n\n            return list(\n                itertools.chain.from_iterable(\n                    (\n                        self._tokenize(token) if token not in self.unique_added_tokens_encoder else [token]\n                        for token in tokenized_text\n                    )\n                )\n            )\n\n        added_tokens = self.unique_added_tokens_encoder\n        tokenized_text = split_on_tokens(added_tokens, text)\n        return tokenized_text\n\n    def _tokenize(self, text, **kwargs):\n        \"\"\" Converts a string in a sequence of tokens (string), using the tokenizer.\n            Split in words for word-based vocabulary or sub-words for sub-word-based\n            vocabularies (BPE/SentencePieces/WordPieces).\n\n            Do NOT take care of added tokens.\n        \"\"\"\n        raise NotImplementedError\n\n    def convert_tokens_to_ids(self, tokens):\n        \"\"\" Converts a token string (or a sequence of tokens) in a single integer id\n            (or a sequence of ids), using the vocabulary.\n        \"\"\"\n        if tokens is None:\n            return None\n\n        if isinstance(tokens, str):\n            return self._convert_token_to_id_with_added_voc(tokens)\n\n        ids = []\n        for token in tokens:\n            ids.append(self._convert_token_to_id_with_added_voc(token))\n        return ids\n\n    def _convert_token_to_id_with_added_voc(self, token):\n        if token is None:\n            return None\n\n        if token in self.added_tokens_encoder:\n            return self.added_tokens_encoder[token]\n        return self._convert_token_to_id(token)\n\n    def _convert_token_to_id(self, token):\n        raise NotImplementedError\n\n    def encode(\n        self,\n        text: Union[TextInput, PreTokenizedInput, EncodedInput],\n        text_pair: Optional[Union[TextInput, PreTokenizedInput, EncodedInput]] = None,\n        add_special_tokens: bool = True,\n        max_length: Optional[int] = None,\n        stride: int = 0,\n        truncation_strategy: str = \"longest_first\",\n        pad_to_max_length: bool = False,\n        return_tensors: Optional[str] = None,\n        **kwargs\n    ):\n        \"\"\"\n        Converts a string in a sequence of ids (integer), using the tokenizer and vocabulary.\n\n        Same as doing ``self.convert_tokens_to_ids(self.tokenize(text))``.\n\n        Args:\n            text (:obj:`str`, :obj:`List[str]` or :obj:`List[int]`):\n                The first sequence to be encoded. This can be a string, a list of strings (tokenized string using\n                the `tokenize` method) or a list of integers (tokenized string ids using the `convert_tokens_to_ids`\n                method)\n            text_pair (:obj:`str`, :obj:`List[str]` or :obj:`List[int]`, `optional`, defaults to :obj:`None`):\n                Optional second sequence to be encoded. This can be a string, a list of strings (tokenized\n                string using the `tokenize` method) or a list of integers (tokenized string ids using the\n                `convert_tokens_to_ids` method)\n            add_special_tokens (:obj:`bool`, `optional`, defaults to :obj:`True`):\n                If set to ``True``, the sequences will be encoded with the special tokens relative\n                to their model.\n            max_length (:obj:`int`, `optional`, defaults to :obj:`None`):\n                If set to a number, will limit the total sequence returned so that it has a maximum length.\n                If there are overflowing tokens, those will be added to the returned dictionary.\n                You can set it to the maximal input size of the model with `max_length = tokenizer.model_max_length`.\n            stride (:obj:`int`, `optional`, defaults to ``0``):\n                If set to a number along with max_length, the overflowing tokens returned will contain some tokens\n                from the main sequence returned. The value of this argument defines the number of additional tokens.\n            truncation_strategy (:obj:`str`, `optional`, defaults to `longest_first`):\n                String selected in the following options:\n\n                - 'longest_first' (default) Iteratively reduce the inputs sequence until the input is under max_length\n                  starting from the longest one at each token (when there is a pair of input sequences)\n                - 'only_first': Only truncate the first sequence\n                - 'only_second': Only truncate the second sequence\n                - 'do_not_truncate': Does not truncate (raise an error if the input sequence is longer than max_length)\n            pad_to_max_length (:obj:`bool`, `optional`, defaults to :obj:`False`):\n                If set to True, the returned sequences will be padded according to the model's padding side and\n                padding index, up to their max length. If no max length is specified, the padding is done up to the\n                model's max length. The tokenizer padding sides are handled by the class attribute `padding_side`\n                which can be set to the following strings:\n\n                - 'left': pads on the left of the sequences\n                - 'right': pads on the right of the sequences\n                Defaults to False: no padding.\n            return_tensors (:obj:`str`, `optional`, defaults to :obj:`None`):\n                Can be set to 'tf' or 'pt' to return respectively TensorFlow :obj:`tf.constant`\n                or PyTorch :obj:`torch.Tensor` instead of a list of python integers.\n            **kwargs: passed to the `self.tokenize()` method\n        \"\"\"\n        encoded_inputs = self.encode_plus(\n            text,\n            text_pair=text_pair,\n            max_length=max_length,\n            add_special_tokens=add_special_tokens,\n            stride=stride,\n            truncation_strategy=truncation_strategy,\n            pad_to_max_length=pad_to_max_length,\n            return_tensors=return_tensors,\n            **kwargs,\n        )\n\n        return encoded_inputs[\"input_ids\"]\n\n    def encode_plus(\n        self,\n        text: Union[TextInput, PreTokenizedInput, EncodedInput],\n        text_pair: Optional[Union[TextInput, PreTokenizedInput, EncodedInput]] = None,\n        add_special_tokens: bool = True,\n        max_length: Optional[int] = None,\n        stride: int = 0,\n        truncation_strategy: str = \"longest_first\",\n        pad_to_max_length: bool = False,\n        is_pretokenized: bool = False,\n        return_tensors: Optional[str] = None,\n        return_token_type_ids: Optional[bool] = None,\n        return_attention_mask: Optional[bool] = None,\n        return_overflowing_tokens: bool = False,\n        return_special_tokens_mask: bool = False,\n        return_offsets_mapping: bool = False,\n        **kwargs\n    ) -> BatchEncoding:\n        \"\"\"\n        Returns a dictionary containing the encoded sequence or sequence pair and additional information:\n        the mask for sequence classification and the overflowing elements if a ``max_length`` is specified.\n\n        Args:\n            text (:obj:`str`, :obj:`List[str]` or :obj:`List[int]` (the later only for not-fast tokenizers)):\n                The first sequence to be encoded. This can be a string, a list of strings (tokenized string using\n                the `tokenize` method) or a list of integers (tokenized string ids using the `convert_tokens_to_ids`\n                method)\n            text_pair (:obj:`str`, :obj:`List[str]` or :obj:`List[int]`, `optional`, defaults to :obj:`None`):\n                Optional second sequence to be encoded. This can be a string, a list of strings (tokenized\n                string using the `tokenize` method) or a list of integers (tokenized string ids using the\n                `convert_tokens_to_ids` method)\n            add_special_tokens (:obj:`bool`, `optional`, defaults to :obj:`True`):\n                If set to ``True``, the sequences will be encoded with the special tokens relative\n                to their model.\n            max_length (:obj:`int`, `optional`, defaults to :obj:`None`):\n                If set to a number, will limit the total sequence returned so that it has a maximum length.\n                If there are overflowing tokens, those will be added to the returned dictionary\n                You can set it to the maximal input size of the model with `max_length = tokenizer.model_max_length`.\n            stride (:obj:`int`, `optional`, defaults to ``0``):\n                If set to a number along with max_length, the overflowing tokens returned will contain some tokens\n                from the main sequence returned. The value of this argument defines the number of additional tokens.\n            truncation_strategy (:obj:`str`, `optional`, defaults to `longest_first`):\n                String selected in the following options:\n\n                - 'longest_first' (default) Iteratively reduce the inputs sequence until the input is under max_length\n                  starting from the longest one at each token (when there is a pair of input sequences)\n                - 'only_first': Only truncate the first sequence\n                - 'only_second': Only truncate the second sequence\n                - 'do_not_truncate': Does not truncate (raise an error if the input sequence is longer than max_length)\n            pad_to_max_length (:obj:`bool`, `optional`, defaults to :obj:`False`):\n                If set to True, the returned sequences will be padded according to the model's padding side and\n                padding index, up to their max length. If no max length is specified, the padding is done up to the\n                model's max length. The tokenizer padding sides are handled by the class attribute `padding_side`\n                which can be set to the following strings:\n\n                - 'left': pads on the left of the sequences\n                - 'right': pads on the right of the sequences\n                Defaults to False: no padding.\n            is_pretokenized (:obj:`bool`, defaults to :obj:`False`):\n                Set to True to indicate the input is already tokenized\n            return_tensors (:obj:`str`, `optional`, defaults to :obj:`None`):\n                Can be set to 'tf' or 'pt' to return respectively TensorFlow :obj:`tf.constant`\n                or PyTorch :obj:`torch.Tensor` instead of a list of python integers.\n            return_token_type_ids (:obj:`bool`, `optional`, defaults to :obj:`None`):\n                Whether to return token type IDs. If left to the default, will return the token type IDs according\n                to the specific tokenizer's default, defined by the :obj:`return_outputs` attribute.\n\n                `What are token type IDs? <../glossary.html#token-type-ids>`_\n            return_attention_mask (:obj:`bool`, `optional`, defaults to :obj:`none`):\n                Whether to return the attention mask. If left to the default, will return the attention mask according\n                to the specific tokenizer's default, defined by the :obj:`return_outputs` attribute.\n\n                `What are attention masks? <../glossary.html#attention-mask>`__\n            return_overflowing_tokens (:obj:`bool`, `optional`, defaults to :obj:`False`):\n                Set to True to return overflowing token information (default False).\n            return_special_tokens_mask (:obj:`bool`, `optional`, defaults to :obj:`False`):\n                Set to True to return special tokens mask information (default False).\n            return_offsets_mapping (:obj:`bool`, `optional`, defaults to :obj:`False`):\n                Set to True to return (char_start, char_end) for each token (default False).\n                If using Python's tokenizer, this method will raise NotImplementedError.\n                This one is only available on fast tokenizers inheriting from PreTrainedTokenizerFast.\n            **kwargs: passed to the `self.tokenize()` method\n\n        Return:\n            A Dictionary of shape::\n\n                {\n                    input_ids: list[int],\n                    token_type_ids: list[int] if return_token_type_ids is True (default)\n                    attention_mask: list[int] if return_attention_mask is True (default)\n                    overflowing_tokens: list[int] if a ``max_length`` is specified and return_overflowing_tokens is True\n                    num_truncated_tokens: int if a ``max_length`` is specified and return_overflowing_tokens is True\n                    special_tokens_mask: list[int] if ``add_special_tokens`` if set to ``True``\n                    and return_special_tokens_mask is True\n                }\n\n            With the fields:\n\n            - ``input_ids``: list of token ids to be fed to a model\n            - ``token_type_ids``: list of token type ids to be fed to a model\n            - ``attention_mask``: list of indices specifying which tokens should be attended to by the model\n            - ``overflowing_tokens``: list of overflowing tokens if a max length is specified.\n            - ``num_truncated_tokens``: number of overflowing tokens a ``max_length`` is specified\n            - ``special_tokens_mask``: if adding special tokens, this is a list of [0, 1], with 0 specifying special added\n              tokens and 1 specifying sequence tokens.\n        \"\"\"\n\n        def get_input_ids(text):\n            if isinstance(text, str):\n                tokens = self.tokenize(text, add_special_tokens=add_special_tokens, **kwargs)\n                return self.convert_tokens_to_ids(tokens)\n            elif isinstance(text, (list, tuple)) and len(text) > 0 and isinstance(text[0], str):\n                return self.convert_tokens_to_ids(text)\n            elif isinstance(text, (list, tuple)) and len(text) > 0 and isinstance(text[0], int):\n                return text\n            else:\n                raise ValueError(\n                    \"Input is not valid. Should be a string, a list/tuple of strings or a list/tuple of integers.\"\n                )\n\n        if return_offsets_mapping:\n            raise NotImplementedError(\n                \"return_offset_mapping is not available when using Python tokenizers.\"\n                \"To use this feature, change your tokenizer to one deriving from \"\n                \"transformers1.PreTrainedTokenizerFast.\"\n                \"More information on available tokenizers at \"\n                \"https://github.com/huggingface/transformers/pull/2674\"\n            )\n\n        # Throw an error if we can pad because there is no padding token\n        if pad_to_max_length and self.pad_token_id is None:\n            raise ValueError(\n                \"Unable to set proper padding strategy as the tokenizer does not have a padding token. \"\n                \"In this case please set the `pad_token` `(tokenizer.pad_token = tokenizer.eos_token e.g.)` \"\n                \"or add a new pad token via the function add_special_tokens if you want to use a padding strategy\"\n            )\n\n        first_ids = get_input_ids(text)\n        second_ids = get_input_ids(text_pair) if text_pair is not None else None\n\n        return self.prepare_for_model(\n            first_ids,\n            pair_ids=second_ids,\n            max_length=max_length,\n            pad_to_max_length=pad_to_max_length,\n            add_special_tokens=add_special_tokens,\n            stride=stride,\n            truncation_strategy=truncation_strategy,\n            return_tensors=return_tensors,\n            return_attention_mask=return_attention_mask,\n            return_token_type_ids=return_token_type_ids,\n            return_overflowing_tokens=return_overflowing_tokens,\n            return_special_tokens_mask=return_special_tokens_mask,\n        )\n\n    def batch_encode_plus(\n        self,\n        batch_text_or_text_pairs: Union[\n            List[TextInput],\n            List[TextInputPair],\n            List[PreTokenizedInput],\n            List[PreTokenizedInputPair],\n            List[EncodedInput],\n            List[EncodedInputPair],\n        ],\n        add_special_tokens: bool = True,\n        max_length: Optional[int] = None,\n        stride: int = 0,\n        truncation_strategy: str = \"longest_first\",\n        pad_to_max_length: bool = False,\n        is_pretokenized: bool = False,\n        return_tensors: Optional[str] = None,\n        return_token_type_ids: Optional[bool] = None,\n        return_attention_masks: Optional[bool] = None,\n        return_overflowing_tokens: bool = False,\n        return_special_tokens_masks: bool = False,\n        return_offsets_mapping: bool = False,\n        return_lengths: bool = False,\n        **kwargs\n    ) -> BatchEncoding:\n        \"\"\"\n        Returns a dictionary containing the encoded sequence or sequence pair and additional information:\n        the mask for sequence classification and the overflowing elements if a ``max_length`` is specified.\n\n        Args:\n            batch_text_or_text_pairs (:obj:`List[str]`,  :obj:`List[Tuple[str, str]]`,\n                                      :obj:`List[List[str]]`,  :obj:`List[Tuple[List[str], List[str]]]`,\n                                      and for not-fast tokenizers, also:\n                                      :obj:`List[List[int]]`,  :obj:`List[Tuple[List[int], List[int]]]`):\n                Batch of sequences or pair of sequences to be encoded.\n                This can be a list of string/string-sequences/int-sequences or a list of pair of\n                string/string-sequences/int-sequence (see details in encode_plus)\n            add_special_tokens (:obj:`bool`, `optional`, defaults to :obj:`True`):\n                If set to ``True``, the sequences will be encoded with the special tokens relative\n                to their model.\n            max_length (:obj:`int`, `optional`, defaults to :obj:`None`):\n                If set to a number, will limit the total sequence returned so that it has a maximum length.\n                If there are overflowing tokens, those will be added to the returned dictionary\n            stride (:obj:`int`, `optional`, defaults to ``0``):\n                If set to a number along with max_length, the overflowing tokens returned will contain some tokens\n                from the main sequence returned. The value of this argument defines the number of additional tokens.\n            truncation_strategy (:obj:`str`, `optional`, defaults to `longest_first`):\n                String selected in the following options:\n\n                - 'longest_first' (default) Iteratively reduce the inputs sequence until the input is under max_length\n                  starting from the longest one at each token (when there is a pair of input sequences)\n                - 'only_first': Only truncate the first sequence\n                - 'only_second': Only truncate the second sequence\n                - 'do_not_truncate': Does not truncate (raise an error if the input sequence is longer than max_length)\n            pad_to_max_length (:obj:`bool`, `optional`, defaults to :obj:`False`):\n                If set to True, the returned sequences will be padded according to the model's padding side and\n                padding index, up to their max length. If no max length is specified, the padding is done up to the\n                model's max length. The tokenizer padding sides are handled by the class attribute `padding_side`\n                which can be set to the following strings:\n\n                - 'left': pads on the left of the sequences\n                - 'right': pads on the right of the sequences\n                Defaults to False: no padding.\n            is_pretokenized (:obj:`bool`, defaults to :obj:`False`):\n                Set to True to indicate the input is already tokenized\n            return_tensors (:obj:`str`, `optional`, defaults to :obj:`None`):\n                Can be set to 'tf' or 'pt' to return respectively TensorFlow :obj:`tf.constant`\n                or PyTorch :obj:`torch.Tensor` instead of a list of python integers.\n            return_token_type_ids (:obj:`bool`, `optional`, defaults to :obj:`None`):\n                Whether to return token type IDs. If left to the default, will return the token type IDs according\n                to the specific tokenizer's default, defined by the :obj:`return_outputs` attribute.\n\n                `What are token type IDs? <../glossary.html#token-type-ids>`_\n            return_attention_masks (:obj:`bool`, `optional`, defaults to :obj:`none`):\n                Whether to return the attention mask. If left to the default, will return the attention mask according\n                to the specific tokenizer's default, defined by the :obj:`return_outputs` attribute.\n\n                `What are attention masks? <../glossary.html#attention-mask>`__\n            return_overflowing_tokens (:obj:`bool`, `optional`, defaults to :obj:`False`):\n                Set to True to return overflowing token information (default False).\n            return_special_tokens_masks (:obj:`bool`, `optional`, defaults to :obj:`False`):\n                Set to True to return special tokens mask information (default False).\n            return_offsets_mapping (:obj:`bool`, `optional`, defaults to :obj:`False`):\n                Set to True to return (char_start, char_end) for each token (default False).\n                If using Python's tokenizer, this method will raise NotImplementedError. This one is only available on\n                Rust-based tokenizers inheriting from PreTrainedTokenizerFast.\n            return_lengths (:obj:`bool`, `optional`, defaults to :obj:`False`):\n                If set the resulting dictionary will include the length of each encoded inputs\n            **kwargs: passed to the `self.tokenize()` method\n\n        Return:\n            A Dictionary of shape::\n\n                {\n                    input_ids: list[List[int]],\n                    token_type_ids: list[List[int]] if return_token_type_ids is True (default)\n                    attention_mask: list[List[int]] if return_attention_mask is True (default)\n                    overflowing_tokens: list[List[int]] if a ``max_length`` is specified and return_overflowing_tokens is True\n                    num_truncated_tokens: List[int] if a ``max_length`` is specified and return_overflowing_tokens is True\n                    special_tokens_mask: list[List[int]] if ``add_special_tokens`` if set to ``True`` and return_special_tokens_mask is True\n                }\n\n            With the fields:\n\n            - ``input_ids``: list of token ids to be fed to a model\n            - ``token_type_ids``: list of token type ids to be fed to a model\n            - ``attention_mask``: list of indices specifying which tokens should be attended to by the model\n            - ``overflowing_tokens``: list of overflowing tokens if a max length is specified.\n            - ``num_truncated_tokens``: number of overflowing tokens a ``max_length`` is specified\n            - ``special_tokens_mask``: if adding special tokens, this is a list of [0, 1], with 0 specifying special added\n              tokens and 1 specifying sequence tokens.\n        \"\"\"\n\n        def get_input_ids(text):\n            if isinstance(text, str):\n                tokens = self.tokenize(text, add_special_tokens=add_special_tokens, **kwargs)\n                return self.convert_tokens_to_ids(tokens)\n            elif isinstance(text, (list, tuple)) and len(text) > 0 and isinstance(text[0], str):\n                return self.convert_tokens_to_ids(text)\n            elif isinstance(text, (list, tuple)) and len(text) > 0 and isinstance(text[0], int):\n                return text\n            else:\n                raise ValueError(\n                    \"Input is not valid. Should be a string, a list/tuple of strings or a list/tuple of integers.\"\n                )\n\n        # Throw an error if we can pad because there is no padding token\n        if pad_to_max_length and self.pad_token_id is None:\n            raise ValueError(\n                \"Unable to set proper padding strategy as the tokenizer does not have a padding token. In this case please set the `pad_token` `(tokenizer.pad_token = tokenizer.eos_token e.g.)` or add a new pad token via the function add_special_tokens if you want to use a padding strategy\"\n            )\n\n        if return_offsets_mapping:\n            raise NotImplementedError(\n                \"return_offset_mapping is not available when using Python tokenizers.\"\n                \"To use this feature, change your tokenizer to one deriving from \"\n                \"transformers1.PreTrainedTokenizerFast.\"\n                \"More information on available tokenizers at \"\n                \"https://github.com/huggingface/transformers/pull/2674\"\n            )\n\n        input_ids = []\n        for ids_or_pair_ids in batch_text_or_text_pairs:\n            if isinstance(ids_or_pair_ids, (list, tuple)) and len(ids_or_pair_ids) == 2 and not is_pretokenized:\n                ids, pair_ids = ids_or_pair_ids\n            else:\n                ids, pair_ids = ids_or_pair_ids, None\n\n            first_ids = get_input_ids(ids)\n            second_ids = get_input_ids(pair_ids) if pair_ids is not None else None\n            input_ids.append((first_ids, second_ids))\n\n        if max_length is None and pad_to_max_length:\n\n            def total_sequence_length(input_pairs):\n                first_ids, second_ids = input_pairs\n                return len(first_ids) + (\n                    self.num_special_tokens_to_add()\n                    if second_ids is None\n                    else (len(second_ids) + self.num_special_tokens_to_add(pair=True))\n                )\n\n            max_length = max([total_sequence_length(ids) for ids in input_ids])\n\n        batch_outputs = {}\n        for first_ids, second_ids in input_ids:\n            # Prepares a sequence of input id, or a pair of sequences of inputs ids so that it can be used by\n            # the model. It adds special tokens, truncates sequences if overflowing while taking into account\n            # the special tokens and manages a window stride for overflowing tokens\n            outputs = self.prepare_for_model(\n                first_ids,\n                pair_ids=second_ids,\n                max_length=max_length,\n                pad_to_max_length=pad_to_max_length,\n                add_special_tokens=add_special_tokens,\n                stride=stride,\n                truncation_strategy=truncation_strategy,\n                return_attention_mask=return_attention_masks,\n                return_token_type_ids=return_token_type_ids,\n                return_overflowing_tokens=return_overflowing_tokens,\n                return_special_tokens_mask=return_special_tokens_masks,\n                return_lengths=return_lengths,\n                return_tensors=None,  # We will convert the whole batch to tensors at the end\n            )\n\n            for key, value in outputs.items():\n                if key not in batch_outputs:\n                    batch_outputs[key] = []\n                batch_outputs[key].append(value)\n\n        if return_tensors is not None:\n\n            self.convert_to_tensors_(batch_outputs, return_tensors)\n        return BatchEncoding(batch_outputs)\n\n    def convert_to_tensors_(self, batch_outputs: dict, return_tensors: str) -> None:\n        # Do the tensor conversion in batch\n        for key, value in batch_outputs.items():\n            if return_tensors == \"tf\" and is_tf_available():\n                try:\n                    batch_outputs[key] = tf.constant(value)\n                except ValueError:\n                    if None in [item for sequence in value for item in sequence]:\n                        raise ValueError(self.NO_PAD_TOKEN_FOR_BATCH_MSG)\n                    else:\n                        raise ValueError(self.UNEVEN_SEQUENCES_FOR_BATCH_MSG)\n            elif return_tensors == \"pt\" and is_torch_available():\n                try:\n                    batch_outputs[key] = torch.tensor(value)\n                except ValueError:\n                    raise ValueError(self.UNEVEN_SEQUENCES_FOR_BATCH_MSG)\n                except RuntimeError:\n                    if None in [item for sequence in value for item in sequence]:\n                        raise ValueError(self.NO_PAD_TOKEN_FOR_BATCH_MSG)\n                    else:\n                        raise\n\n            elif return_tensors is not None:\n                logger.warning(\n                    \"Unable to convert output to tensors format {}, PyTorch or TensorFlow is not available.\".format(\n                        return_tensors\n                    )\n                )\n\n    def prepare_for_model(\n        self,\n        ids: List[int],\n        pair_ids: Optional[List[int]] = None,\n        max_length: Optional[int] = None,\n        add_special_tokens: bool = True,\n        stride: int = 0,\n        truncation_strategy: str = \"longest_first\",\n        pad_to_max_length: bool = False,\n        return_tensors: Optional[str] = None,\n        return_token_type_ids: Optional[bool] = None,\n        return_attention_mask: Optional[bool] = None,\n        return_overflowing_tokens: bool = False,\n        return_special_tokens_mask: bool = False,\n        return_lengths: bool = False,\n    ) -> BatchEncoding:\n        \"\"\" Prepares a sequence of input id, or a pair of sequences of inputs ids so that it can be used by the model.\n        It adds special tokens, truncates sequences if overflowing while taking into account the special tokens and\n        manages a moving window (with user defined stride) for overflowing tokens\n\n        Args:\n            ids: list of tokenized input ids. Can be obtained from a string by chaining the\n                `tokenize` and `convert_tokens_to_ids` methods.\n            pair_ids: Optional second list of input ids. Can be obtained from a string by chaining the\n                `tokenize` and `convert_tokens_to_ids` methods.\n            max_length: maximum length of the returned list. Will truncate by taking into account the special tokens.\n            add_special_tokens: if set to ``True``, the sequences will be encoded with the special tokens relative\n                to their model.\n            stride: window stride for overflowing tokens. Can be useful to remove edge effect when using sequential\n                list of inputs. The overflowing token will contains a part of the previous window of tokens.\n            truncation_strategy: string selected in the following options:\n                - 'longest_first' (default) Iteratively reduce the inputs sequence until the input is under max_length\n                    starting from the longest one at each token (when there is a pair of input sequences)\n                - 'only_first': Only truncate the first sequence\n                - 'only_second': Only truncate the second sequence\n                - 'do_not_truncate': Does not truncate (raise an error if the input sequence is longer than max_length)\n            pad_to_max_length: if set to True, the returned sequences will be padded according to the model's padding side and\n                padding index, up to their max length. If no max length is specified, the padding is done up to the model's max length.\n                The tokenizer padding sides are handled by the following strings:\n                - 'left': pads on the left of the sequences\n                - 'right': pads on the right of the sequences\n                Defaults to False: no padding.\n            return_tensors: (optional) can be set to 'tf' or 'pt' to return respectively TensorFlow tf.constant\n                or PyTorch torch.Tensor instead of a list of python integers.\n            return_token_type_ids: (optional) Set to False to avoid returning token_type_ids (default: set to model specifics).\n            return_attention_mask: (optional) Set to False to avoid returning attention mask (default: set to model specifics)\n            return_overflowing_tokens: (optional) Set to True to return overflowing token information (default False).\n            return_special_tokens_mask: (optional) Set to True to return special tokens mask information (default False).\n            return_lengths (:obj:`bool`, `optional`, defaults to :obj:`False`):\n                If set the resulting dictionary will include the length of each encoded inputs\n\n        Return:\n            A Dictionary of shape::\n\n                {\n                    input_ids: list[int],\n                    token_type_ids: list[int] if return_token_type_ids is True (default)\n                    overflowing_tokens: list[int] if a ``max_length`` is specified and return_overflowing_tokens is True\n                    num_truncated_tokens: int if a ``max_length`` is specified and return_overflowing_tokens is True\n                    special_tokens_mask: list[int] if ``add_special_tokens`` if set to ``True`` and return_special_tokens_mask is True\n                    length: int if return_lengths is True\n                }\n\n            With the fields:\n                - ``input_ids``: list of token ids to be fed to a model\n                - ``token_type_ids``: list of token type ids to be fed to a model\n\n                - ``overflowing_tokens``: list of overflowing tokens if a max length is specified.\n                - ``num_truncated_tokens``: number of overflowing tokens a ``max_length`` is specified\n                - ``special_tokens_mask``: if adding special tokens, this is a list of [0, 1], with 0 specifying special added\n                    tokens and 1 specifying sequence tokens.\n                - ``length``: this is the length of ``input_ids``\n        \"\"\"\n        pair = bool(pair_ids is not None)\n        len_ids = len(ids)\n        len_pair_ids = len(pair_ids) if pair else 0\n\n        # Load from model defaults\n        if return_token_type_ids is None:\n            return_token_type_ids = \"token_type_ids\" in self.model_input_names\n        if return_attention_mask is None:\n            return_attention_mask = \"attention_mask\" in self.model_input_names\n\n        encoded_inputs = {}\n\n        # Truncation: Handle max sequence length\n        total_len = len_ids + len_pair_ids + (self.num_special_tokens_to_add(pair=pair) if add_special_tokens else 0)\n        if max_length and total_len > max_length:\n            ids, pair_ids, overflowing_tokens = self.truncate_sequences(\n                ids,\n                pair_ids=pair_ids,\n                num_tokens_to_remove=total_len - max_length,\n                truncation_strategy=truncation_strategy,\n                stride=stride,\n            )\n            if return_overflowing_tokens:\n                encoded_inputs[\"overflowing_tokens\"] = overflowing_tokens\n                encoded_inputs[\"num_truncated_tokens\"] = total_len - max_length\n\n        # Add special tokens\n        if add_special_tokens:\n            sequence = self.build_inputs_with_special_tokens(ids, pair_ids)\n            token_type_ids = self.create_token_type_ids_from_sequences(ids, pair_ids)\n        else:\n            sequence = ids + pair_ids if pair else ids\n            token_type_ids = [0] * len(ids) + ([1] * len(pair_ids) if pair else [])\n\n        # Build output dictionnary\n        encoded_inputs[\"input_ids\"] = sequence\n        if return_token_type_ids:\n            encoded_inputs[\"token_type_ids\"] = token_type_ids\n        if return_special_tokens_mask:\n            if add_special_tokens:\n                encoded_inputs[\"special_tokens_mask\"] = self.get_special_tokens_mask(ids, pair_ids)\n            else:\n                encoded_inputs[\"special_tokens_mask\"] = [0] * len(sequence)\n\n        # Check lengths\n        assert max_length is None or len(encoded_inputs[\"input_ids\"]) <= max_length\n        if max_length is None and len(encoded_inputs[\"input_ids\"]) > self.model_max_length:\n            logger.warning(\n                \"Token indices sequence length is longer than the specified maximum sequence length \"\n                \"for this model ({} > {}). Running this sequence through the model will result in \"\n                \"indexing errors\".format(len(ids), self.model_max_length)\n            )\n\n        # Padding\n        needs_to_be_padded = pad_to_max_length and (\n            max_length\n            and len(encoded_inputs[\"input_ids\"]) < max_length\n            or max_length is None\n            and len(encoded_inputs[\"input_ids\"]) < self.model_max_length\n            and self.model_max_length <= LARGE_INTEGER\n        )\n\n        if pad_to_max_length and max_length is None and self.model_max_length > LARGE_INTEGER:\n            logger.warning(\n                \"Sequence can't be padded as no maximum length is specified and the model maximum length is too high.\"\n            )\n\n        if needs_to_be_padded:\n            difference = (max_length if max_length is not None else self.model_max_length) - len(\n                encoded_inputs[\"input_ids\"]\n            )\n            if self.padding_side == \"right\":\n                if return_attention_mask:\n                    encoded_inputs[\"attention_mask\"] = [1] * len(encoded_inputs[\"input_ids\"]) + [0] * difference\n                if return_token_type_ids:\n                    encoded_inputs[\"token_type_ids\"] = (\n                        encoded_inputs[\"token_type_ids\"] + [self.pad_token_type_id] * difference\n                    )\n                if return_special_tokens_mask:\n                    encoded_inputs[\"special_tokens_mask\"] = encoded_inputs[\"special_tokens_mask\"] + [1] * difference\n                encoded_inputs[\"input_ids\"] = encoded_inputs[\"input_ids\"] + [self.pad_token_id] * difference\n            elif self.padding_side == \"left\":\n                if return_attention_mask:\n                    encoded_inputs[\"attention_mask\"] = [0] * difference + [1] * len(encoded_inputs[\"input_ids\"])\n                if return_token_type_ids:\n                    encoded_inputs[\"token_type_ids\"] = [self.pad_token_type_id] * difference + encoded_inputs[\n                        \"token_type_ids\"\n                    ]\n                if return_special_tokens_mask:\n                    encoded_inputs[\"special_tokens_mask\"] = [1] * difference + encoded_inputs[\"special_tokens_mask\"]\n                encoded_inputs[\"input_ids\"] = [self.pad_token_id] * difference + encoded_inputs[\"input_ids\"]\n            else:\n                raise ValueError(\"Invalid padding strategy:\" + str(self.padding_side))\n        else:\n            if return_attention_mask:\n                encoded_inputs[\"attention_mask\"] = [1] * len(encoded_inputs[\"input_ids\"])\n\n        if return_lengths:\n            encoded_inputs[\"length\"] = len(encoded_inputs[\"input_ids\"])\n\n        # Prepare model inputs as tensors if asked\n        if return_tensors == \"tf\" and is_tf_available():\n            encoded_inputs[\"input_ids\"] = tf.constant([encoded_inputs[\"input_ids\"]])\n\n            if \"token_type_ids\" in encoded_inputs:\n                encoded_inputs[\"token_type_ids\"] = tf.constant([encoded_inputs[\"token_type_ids\"]])\n\n            if \"attention_mask\" in encoded_inputs:\n                encoded_inputs[\"attention_mask\"] = tf.constant([encoded_inputs[\"attention_mask\"]])\n\n        elif return_tensors == \"pt\" and is_torch_available():\n            encoded_inputs[\"input_ids\"] = torch.tensor([encoded_inputs[\"input_ids\"]])\n\n            if \"token_type_ids\" in encoded_inputs:\n                encoded_inputs[\"token_type_ids\"] = torch.tensor([encoded_inputs[\"token_type_ids\"]])\n\n            if \"attention_mask\" in encoded_inputs:\n                encoded_inputs[\"attention_mask\"] = torch.tensor([encoded_inputs[\"attention_mask\"]])\n        elif return_tensors is not None:\n            logger.warning(\n                \"Unable to convert output to tensors format {}, PyTorch or TensorFlow is not available.\".format(\n                    return_tensors\n                )\n            )\n\n        return BatchEncoding(encoded_inputs)\n\n    def prepare_for_tokenization(self, text: str, **kwargs) -> str:\n        \"\"\" Performs any necessary transformations before tokenization \"\"\"\n        return text\n\n    def truncate_sequences(\n        self,\n        ids: List[int],\n        pair_ids: Optional[List[int]] = None,\n        num_tokens_to_remove: int = 0,\n        truncation_strategy: str = \"longest_first\",\n        stride: int = 0,\n    ) -> Tuple[List[int], List[int], List[int]]:\n        \"\"\" Truncates a sequence pair in place to the maximum length.\n\n        Args:\n            ids: list of tokenized input ids. Can be obtained from a string by chaining the\n                `tokenize` and `convert_tokens_to_ids` methods.\n            pair_ids: Optional second list of input ids. Can be obtained from a string by chaining the\n                `tokenize` and `convert_tokens_to_ids` methods.\n            num_tokens_to_remove (:obj:`int`, `optional`, defaults to ``0``):\n                number of tokens to remove using the truncation strategy\n            truncation_strategy: string selected in the following options:\n                - 'longest_first' (default) Iteratively reduce the inputs sequence until the input is under max_length\n                    starting from the longest one at each token (when there is a pair of input sequences).\n                    Overflowing tokens only contains overflow from the first sequence.\n                - 'only_first': Only truncate the first sequence. raise an error if the first sequence is shorter or equal to than num_tokens_to_remove.\n                - 'only_second': Only truncate the second sequence\n                - 'do_not_truncate': Does not truncate (raise an error if the input sequence is longer than max_length)\n            stride (:obj:`int`, `optional`, defaults to ``0``):\n                If set to a number along with max_length, the overflowing tokens returned will contain some tokens\n                from the main sequence returned. The value of this argument defines the number of additional tokens.\n        \"\"\"\n        if num_tokens_to_remove <= 0:\n            return ids, pair_ids, []\n\n        if truncation_strategy == \"longest_first\":\n            overflowing_tokens = []\n            for _ in range(num_tokens_to_remove):\n                if pair_ids is None or len(ids) > len(pair_ids):\n                    overflowing_tokens = [ids[-1]] + overflowing_tokens\n                    ids = ids[:-1]\n                else:\n                    pair_ids = pair_ids[:-1]\n            window_len = min(len(ids), stride)\n            if window_len > 0:\n                overflowing_tokens = ids[-window_len:] + overflowing_tokens\n        elif truncation_strategy == \"only_first\":\n            assert len(ids) > num_tokens_to_remove\n            window_len = min(len(ids), stride + num_tokens_to_remove)\n            overflowing_tokens = ids[-window_len:]\n            ids = ids[:-num_tokens_to_remove]\n        elif truncation_strategy == \"only_second\":\n            assert pair_ids is not None and len(pair_ids) > num_tokens_to_remove\n            window_len = min(len(pair_ids), stride + num_tokens_to_remove)\n            overflowing_tokens = pair_ids[-window_len:]\n            pair_ids = pair_ids[:-num_tokens_to_remove]\n        elif truncation_strategy == \"do_not_truncate\":\n            raise ValueError(\"Input sequence are too long for max_length. Please select a truncation strategy.\")\n        else:\n            raise ValueError(\n                \"Truncation_strategy should be selected in ['longest_first', 'only_first', 'only_second', 'do_not_truncate']\"\n            )\n        return (ids, pair_ids, overflowing_tokens)\n\n    def create_token_type_ids_from_sequences(self, token_ids_0: List, token_ids_1: Optional[List] = None) -> List[int]:\n        if token_ids_1 is None:\n            return len(token_ids_0) * [0]\n        return [0] * len(token_ids_0) + [1] * len(token_ids_1)\n\n    def build_inputs_with_special_tokens(self, token_ids_0: List, token_ids_1: Optional[List] = None) -> List:\n        \"\"\"\n        Build model inputs from a sequence or a pair of sequence for sequence classification tasks\n        by concatenating and adding special tokens. This implementation does not add special tokens.\n        \"\"\"\n        if token_ids_1 is None:\n            return token_ids_0\n        return token_ids_0 + token_ids_1\n\n    def get_special_tokens_mask(\n        self, token_ids_0: List, token_ids_1: Optional[List] = None, already_has_special_tokens: bool = False\n    ) -> List[int]:\n        \"\"\"\n        Retrieves sequence ids from a token list that has no special tokens added. This method is called when adding\n        special tokens using the tokenizer ``prepare_for_model`` or ``encode_plus`` methods.\n\n        Args:\n            token_ids_0: list of ids (must not contain special tokens)\n            token_ids_1: Optional list of ids (must not contain special tokens), necessary when fetching sequence ids\n                for sequence pairs\n            already_has_special_tokens: (default False) Set to True if the token list is already formated with\n                special tokens for the model\n\n        Returns:\n            A list of integers in the range [0, 1]: 1 for a special token, 0 for a sequence token.\n        \"\"\"\n        return [0] * ((len(token_ids_1) if token_ids_1 else 0) + len(token_ids_0))\n\n    def convert_ids_to_tokens(\n        self, ids: Union[int, List[int]], skip_special_tokens: bool = False\n    ) -> Union[int, List[int]]:\n        \"\"\" Converts a single index or a sequence of indices (integers) in a token \"\n            (resp.) a sequence of tokens (str), using the vocabulary and added tokens.\n\n            Args:\n                skip_special_tokens: Don't decode special tokens (self.all_special_tokens). Default: False\n        \"\"\"\n        if isinstance(ids, int):\n            if ids in self.added_tokens_decoder:\n                return self.added_tokens_decoder[ids]\n            else:\n                return self._convert_id_to_token(ids)\n        tokens = []\n        for index in ids:\n            index = int(index)\n            if skip_special_tokens and index in self.all_special_ids:\n                continue\n            if index in self.added_tokens_decoder:\n                tokens.append(self.added_tokens_decoder[index])\n            else:\n                tokens.append(self._convert_id_to_token(index))\n        return tokens\n\n    def _convert_id_to_token(self, index: int) -> str:\n        raise NotImplementedError\n\n    def convert_tokens_to_string(self, tokens: List[str]) -> str:\n        \"\"\" Converts a sequence of tokens (string) in a single string.\n            The most simple way to do it is ' '.join(self.convert_ids_to_tokens(token_ids))\n            but we often want to remove sub-word tokenization artifacts at the same time.\n        \"\"\"\n        return \" \".join(self.convert_ids_to_tokens(tokens))\n\n    def decode(\n        self, token_ids: List[int], skip_special_tokens: bool = False, clean_up_tokenization_spaces: bool = True\n    ) -> str:\n        \"\"\"\n        Converts a sequence of ids (integer) in a string, using the tokenizer and vocabulary\n        with options to remove special tokens and clean up tokenization spaces.\n        Similar to doing ``self.convert_tokens_to_string(self.convert_ids_to_tokens(token_ids))``.\n\n        Args:\n            token_ids: list of tokenized input ids. Can be obtained using the `encode` or `encode_plus` methods.\n            skip_special_tokens: if set to True, will replace special tokens.\n            clean_up_tokenization_spaces: if set to True, will clean up the tokenization spaces.\n        \"\"\"\n        filtered_tokens = self.convert_ids_to_tokens(token_ids, skip_special_tokens=skip_special_tokens)\n\n        # To avoid mixing byte-level and unicode for byte-level BPT\n        # we need to build string separatly for added tokens and byte-level tokens\n        # cf. https://github.com/huggingface/transformers/issues/1133\n        sub_texts = []\n        current_sub_text = []\n        for token in filtered_tokens:\n            if skip_special_tokens and token in self.all_special_ids:\n                continue\n            if token in self.added_tokens_encoder:\n                if current_sub_text:\n                    sub_texts.append(self.convert_tokens_to_string(current_sub_text))\n                    current_sub_text = []\n                sub_texts.append(token)\n            else:\n                current_sub_text.append(token)\n        if current_sub_text:\n            sub_texts.append(self.convert_tokens_to_string(current_sub_text))\n        text = \" \".join(sub_texts)\n\n        if clean_up_tokenization_spaces:\n            clean_text = self.clean_up_tokenization(text)\n            return clean_text\n        else:\n            return text\n\n    def batch_decode(self, sequences: List[List[int]], **kwargs) -> List[str]:\n        return [self.decode(seq, **kwargs) for seq in sequences]\n\n    @staticmethod\n    def clean_up_tokenization(out_string: str) -> str:\n        \"\"\" Clean up a list of simple English tokenization artifacts like spaces before punctuations and abreviated forms.\n        \"\"\"\n        out_string = (\n            out_string.replace(\" .\", \".\")\n            .replace(\" ?\", \"?\")\n            .replace(\" !\", \"!\")\n            .replace(\" ,\", \",\")\n            .replace(\" ' \", \"'\")\n            .replace(\" n't\", \"n't\")\n            .replace(\" 'm\", \"'m\")\n            .replace(\" 's\", \"'s\")\n            .replace(\" 've\", \"'ve\")\n            .replace(\" 're\", \"'re\")\n        )\n        return out_string\n\n\nclass PreTrainedTokenizerFast(PreTrainedTokenizer):\n    \"\"\" Base class for all fast tokenizers (wrapping HuggingFace tokenizers library).\n\n    Inherit from PreTrainedTokenizer.\n\n    Handle all the shared methods for tokenization and special tokens as well as methods\n    downloading/caching/loading pretrained tokenizers as well as adding tokens to the vocabulary.\n\n    This class also contain the added tokens in a unified way on top of all tokenizers so we don't\n    have to handle the specific vocabulary augmentation methods of the various underlying\n    dictionary structures (BPE, sentencepiece...).\n\n    Class attributes (overridden by derived classes):\n\n        - ``vocab_files_names``: a python ``dict`` with, as keys, the ``__init__`` keyword name of each vocabulary file\n            required by the model, and as associated values, the filename for saving the associated file (string).\n        - ``pretrained_vocab_files_map``: a python ``dict of dict`` the high-level keys\n            being the ``__init__`` keyword name of each vocabulary file required by the model, the low-level being the\n            `short-cut-names` (string) of the pretrained models with, as associated values, the `url` (string) to the\n            associated pretrained vocabulary file.\n        - ``max_model_input_sizes``: a python ``dict`` with, as keys, the `short-cut-names` (string) of the pretrained\n            models, and as associated values, the maximum length of the sequence inputs of this model, or None if the\n            model has no maximum input size.\n        - ``pretrained_init_configuration``: a python ``dict`` with, as keys, the `short-cut-names` (string) of the\n            pretrained models, and as associated values, a dictionnary of specific arguments to pass to the\n            ``__init__``method of the tokenizer class for this pretrained model when loading the tokenizer with the\n            ``from_pretrained()`` method.\n\n    Args:\n        - ``tokenizer`` (`BaseTokenizerFast`): A Fast tokenizer from the HuggingFace tokenizer library (in low level Rust language)\n        - ``model_max_length``: (`Optional`) int: the maximum length in number of tokens for the inputs to the transformer model.\n            When the tokenizer is loaded with `from_pretrained`, this will be set to the value stored for the associated\n            model in ``max_model_input_sizes`` (see above). If no value is provided, will default to VERY_LARGE_INTEGER (`int(1e30)`).\n            no associated max_length can be found in ``max_model_input_sizes``.\n        - ``padding_side``: (`Optional`) string: the side on which the model should have padding applied.\n            Should be selected between ['right', 'left']\n        - ``model_input_names``: (`Optional`) List[string]: the list of the forward pass inputs accepted by the\n            model (\"token_type_ids\", \"attention_mask\"...).\n        - ``bos_token``: (`Optional`) string: a beginning of sentence token.\n            Will be associated to ``self.bos_token`` and ``self.bos_token_id``\n        - ``eos_token``: (`Optional`) string: an end of sentence token.\n            Will be associated to ``self.eos_token`` and ``self.eos_token_id``\n        - ``unk_token``: (`Optional`) string: an unknown token.\n            Will be associated to ``self.unk_token`` and ``self.unk_token_id``\n        - ``sep_token``: (`Optional`) string: a separation token (e.g. to separate context and query in an input sequence).\n            Will be associated to ``self.sep_token`` and ``self.sep_token_id``\n        - ``pad_token``: (`Optional`) string: a padding token.\n            Will be associated to ``self.pad_token`` and ``self.pad_token_id``\n        - ``cls_token``: (`Optional`) string: a classification token (e.g. to extract a summary of an input sequence\n            leveraging self-attention along the full depth of the model).\n            Will be associated to ``self.cls_token`` and ``self.cls_token_id``\n        - ``mask_token``: (`Optional`) string: a masking token (e.g. when training a model with masked-language\n            modeling). Will be associated to ``self.mask_token`` and ``self.mask_token_id``\n        - ``additional_special_tokens``: (`Optional`) list: a list of additional special tokens.\n            Adding all special tokens here ensure they won't be split by the tokenization process.\n            Will be associated to ``self.additional_special_tokens`` and ``self.additional_special_tokens_ids``\n    \"\"\"\n\n    def __init__(self, tokenizer: BaseTokenizerFast, **kwargs):\n        if not isinstance(tokenizer, BaseTokenizerFast):\n            raise ValueError(\n                \"Tokenizer should be an instance of a Tokenizer \" \"provided by HuggingFace tokenizers library.\"\n            )\n        self._tokenizer: BaseTokenizerFast = tokenizer\n\n        # Initialize all the rest of the kwargs\n        super().__init__(**kwargs)\n\n    @property\n    def backend_tokenizer(self) -> BaseTokenizerFast:\n        return self._tokenizer\n\n    @property\n    def decoder(self) -> DecoderFast:\n        return self._tokenizer._tokenizer.decoder\n\n    @property\n    def is_fast(self) -> bool:\n        return True\n\n    @property\n    def vocab_size(self) -> int:\n        return self._tokenizer.get_vocab_size(with_added_tokens=False)\n\n    def __len__(self) -> int:\n        return self._tokenizer.get_vocab_size(with_added_tokens=True)\n\n    def _maybe_update_backend(self, value):\n        \"\"\" Update the backend fast tokenizer.\n            Override method from base class SpecialTokensMixin \"\"\"\n        self._tokenizer.add_special_tokens(value)\n\n    def _convert_encoding(\n        self,\n        encoding: EncodingFast,\n        return_tensors: Optional[bool] = None,\n        return_token_type_ids: Optional[bool] = None,\n        return_attention_mask: Optional[bool] = None,\n        return_overflowing_tokens: bool = False,\n        return_special_tokens_mask: bool = False,\n        return_offsets_mapping: bool = False,\n    ) -> Dict[str, Any]:\n        \"\"\" Convert the encoding representation (from low-level HuggingFace tokenizer output) to a python Dict.\n\n            Overflowing tokens are converted to additional examples (like batches) so the output values of\n            the dict are lists (overflows) of lists (tokens).\n\n            If return_tensors is not None, these lists of lists are converted to 2-D tensors\n            for input_ids, token_type_ids and attention_mask.\n            Output shape: (overflows, sequence length)\n        \"\"\"\n        if return_token_type_ids is None:\n            return_token_type_ids = \"token_type_ids\" in self.model_input_names\n        if return_attention_mask is None:\n            return_attention_mask = \"attention_mask\" in self.model_input_names\n\n        if return_overflowing_tokens and encoding.overflowing is not None:\n            encodings = [encoding] + encoding.overflowing\n        else:\n            encodings = [encoding]\n\n        encoding_dict = defaultdict(list)\n        for e in encodings:\n            encoding_dict[\"input_ids\"].append(e.ids)\n\n            if return_token_type_ids:\n                encoding_dict[\"token_type_ids\"].append(e.type_ids)\n            if return_attention_mask:\n                encoding_dict[\"attention_mask\"].append(e.attention_mask)\n            if return_special_tokens_mask:\n                encoding_dict[\"special_tokens_mask\"].append(e.special_tokens_mask)\n            if return_offsets_mapping:\n                encoding_dict[\"offset_mapping\"].append(e.offsets)\n\n        if return_tensors is not None:\n            for key, value in encoding_dict.items():\n                if return_tensors == \"tf\" and is_tf_available():\n                    encoding_dict[key] = tf.constant(value)\n                elif return_tensors == \"pt\" and is_torch_available():\n                    encoding_dict[key] = torch.tensor(value)\n                elif return_tensors is not None:\n                    logger.warning(\n                        \"Unable to convert output to tensors format {}, \"\n                        \"PyTorch or TensorFlow is not available.\".format(return_tensors)\n                    )\n\n        return encoding_dict\n\n    def _convert_token_to_id_with_added_voc(self, token: int) -> str:\n        index = self._tokenizer.token_to_id(token)\n        if index is None:\n            return self.unk_token_id\n        return index\n\n    def _convert_id_to_token(self, index: int) -> Optional[str]:\n        return self._tokenizer.id_to_token(int(index))\n\n    def get_vocab(self):\n        return self._tokenizer.get_vocab(True)\n\n    def convert_tokens_to_string(self, tokens: List[int], skip_special_tokens: bool = False) -> str:\n        return self._tokenizer.decode(tokens, skip_special_tokens)\n\n    def add_tokens(self, new_tokens: List[Union[str, AddedTokenFast]]) -> int:\n        \"\"\"\n        Add a list of new tokens to the tokenizer class. If the new tokens are not in the\n        vocabulary, they are added to it with indices starting from length of the current vocabulary.\n\n        Args:\n            new_tokens: string or list of string or AddedTokenFast. Each string is a token to add.\n            Tokens are only added if they are not already in the vocabulary. AddedTokenFast wrap a string token to let you personnalize it's behavior (Whether this token should only match against single word, whether this token should strip all potential whitespaces on the left side, Whether this token should strip all potential whitespaces on the right side...).\n            See details for AddedToken in HuggingFace tokenizers library.\n\n        Returns:\n            Number of tokens added to the vocabulary.\n\n        Examples::\n\n            # Let's see how to increase the vocabulary of Bert model and tokenizer\n            tokenizer = BertTokenizerFast.from_pretrained('bert-base-uncased')\n            model = BertModel.from_pretrained('bert-base-uncased')\n\n            num_added_toks = tokenizer.add_tokens(['new_tok1', 'my_new-tok2'])\n            print('We have added', num_added_toks, 'tokens')\n            model.resize_token_embeddings(len(tokenizer))  # Notice: resize_token_embeddings expect to receive the full size of the new vocabulary, i.e. the length of the tokenizer.\n        \"\"\"\n        if isinstance(new_tokens, str):\n            new_tokens = [new_tokens]\n        return self._tokenizer.add_tokens(new_tokens)\n\n    def add_special_tokens(self, special_tokens_dict: dict) -> int:\n        # Map special tokens to class attributes (self.pad_token...)\n        super().add_special_tokens(special_tokens_dict)\n\n        # If the backend tokenizer the only specificities of special tokens are that\n        #    - they will never be processed by the model, and\n        #    - they will be removed while decoding.\n        # But they are not mapped to special attributes in the backend so we can just\n        # send a list.\n        tokens = []\n        for token in special_tokens_dict.values():\n            if isinstance(token, list):\n                tokens += token\n            else:\n                tokens += [token]\n        num_added_tokens = self._tokenizer.add_special_tokens(tokens)\n\n        return num_added_tokens\n\n    def num_special_tokens_to_add(self, pair: bool = False) -> int:\n        return self._tokenizer.num_special_tokens_to_add(pair)\n\n    def tokenize(\n        self, text: TextInput, pair: Optional[TextInput] = None, add_special_tokens: bool = False\n    ) -> List[str]:\n        return self._tokenizer.encode(text, pair, add_special_tokens).tokens\n\n    def batch_encode_plus(\n        self,\n        batch_text_or_text_pairs: Union[\n            List[TextInput], List[TextInputPair], List[PreTokenizedInput], List[PreTokenizedInputPair]\n        ],\n        add_special_tokens: bool = True,\n        max_length: Optional[int] = None,\n        stride: int = 0,\n        truncation_strategy: str = \"longest_first\",\n        pad_to_max_length: bool = False,\n        is_pretokenized: bool = False,\n        return_tensors: Optional[str] = None,\n        return_token_type_ids: Optional[bool] = None,\n        return_attention_mask: Optional[bool] = None,\n        return_overflowing_tokens: bool = False,\n        return_special_tokens_mask: bool = False,\n        return_offsets_mapping: bool = False,\n        return_lengths: bool = False,\n        **kwargs\n    ) -> BatchEncoding:\n\n        if not isinstance(batch_text_or_text_pairs, list):\n            raise ValueError(\n                \"batch_text_or_text_pairs has to be a list (got {})\".format(type(batch_text_or_text_pairs))\n            )\n\n        # Needed if we have to return a tensor\n        pad_to_max_length = pad_to_max_length or (return_tensors is not None and len(batch_text_or_text_pairs) > 1)\n\n        # Throw an error if we can pad because there is no padding token\n        if pad_to_max_length and self.pad_token_id is None:\n            raise ValueError(\"Unable to set proper padding strategy as the tokenizer does not have a padding token\")\n\n        # Set the truncation and padding strategy and restore the initial configuration\n        with truncate_and_pad(\n            tokenizer=self._tokenizer,\n            max_length=max_length,\n            stride=stride,\n            strategy=truncation_strategy,\n            pad_to_max_length=pad_to_max_length,\n            padding_side=self.padding_side,\n            pad_token_id=self.pad_token_id,\n            pad_token_type_id=self.pad_token_type_id,\n            pad_token=self._pad_token,\n        ):\n\n            # Check for the pretokenized path\n            if is_pretokenized:\n                encodings = []\n\n                # Iterate over each sample (we don't know yet if they are pairs or simple input\n                for i, sample in enumerate(batch_text_or_text_pairs):\n\n                    if not isinstance(sample, (list, tuple)):\n                        raise TypeError(\n                            \"batch_encode_plus(..., is_pretokenized=True) requires batch_text_or_text_pairs \"\n                            \"to be either List[List[str]] or List[Tuple[List[str], List[str]]] but sample at \"\n                            \"index {} is of type {}\".format(i, type(sample))\n                        )\n\n                    # Test if we have a pair of sentences by checking the depth of nesting\n                    is_pair = bool(len(sample) > 0 and isinstance(sample[0], (list, tuple)))\n\n                    # Take care of the first sequence - we multi-thread over the words\n                    encodings_text = EncodingFast.merge(\n                        self._tokenizer.encode_batch(sample[0] if is_pair else sample, add_special_tokens=False),\n                        growing_offsets=True,\n                    )\n\n                    # Take care of the second sequence if we have a pair\n                    if is_pair:\n                        encodings_pair = EncodingFast.merge(\n                            self._tokenizer.encode_batch([(\"\", s) for s in sample[1]], add_special_tokens=False),\n                            growing_offsets=True,\n                        )\n                    else:\n                        encodings_pair = None\n\n                    # Post-process - truncate/pad and add special tokens\n                    encoding = self._tokenizer.post_process(encodings_text, encodings_pair, add_special_tokens)\n                    encodings.append(encoding)\n\n            # Classical path with strings input\n            else:\n                # Avoid thread overhead if only one example.\n                if len(batch_text_or_text_pairs) == 1:\n                    if isinstance(batch_text_or_text_pairs[0], (tuple, list)):\n                        encodings = self._tokenizer.encode(\n                            *batch_text_or_text_pairs[0], add_special_tokens=add_special_tokens\n                        )\n                    else:\n                        encodings = self._tokenizer.encode(\n                            batch_text_or_text_pairs[0], add_special_tokens=add_special_tokens\n                        )\n                    encodings = [encodings]\n                else:\n                    encodings = self._tokenizer.encode_batch(\n                        batch_text_or_text_pairs, add_special_tokens=add_special_tokens\n                    )\n\n        # Convert encoding to dict\n        # `Tokens` has type: List[Dict[str, List[List[int]]]] or List[Dict[str, 2D-Tensor]]\n        # with nested dimensions corresponding to batch, overflows, sequence length\n        tokens = [\n            self._convert_encoding(\n                encoding=encoding,\n                return_tensors=return_tensors,\n                return_token_type_ids=return_token_type_ids,\n                return_attention_mask=return_attention_mask,\n                return_overflowing_tokens=return_overflowing_tokens,\n                return_special_tokens_mask=return_special_tokens_mask,\n                return_offsets_mapping=return_offsets_mapping,\n            )\n            for encoding in encodings\n        ]\n\n        # Sanitize the output to have dict[list] from list[dict]\n        sanitized = {}\n        for key in tokens[0].keys():\n            # To List[List[List[int]]] of shape (batch, overflows, sequence length)\n            stack = [e for item in tokens for e in item[key]]\n            if return_tensors == \"tf\":\n                stack = tf.stack(stack, axis=0)\n            elif return_tensors == \"pt\":\n                stack = torch.stack(stack, dim=0)\n            # elif not return_tensors and len(stack) == 1:\n            #     stack = stack[0]\n\n            sanitized[key] = stack\n\n        # If returning overflowing tokens, we need to return a mapping\n        # from the batch idx to the original sample\n        if return_overflowing_tokens:\n            overflow_to_sample_mapping = flatten([[i] * len(enc[\"input_ids\"]) for i, enc in enumerate(tokens)])\n            sanitized[\"overflow_to_sample_mapping\"] = overflow_to_sample_mapping\n\n        return BatchEncoding(sanitized, encodings)\n\n    def encode_plus(\n        self,\n        text: Union[TextInput, PreTokenizedInput],\n        text_pair: Optional[Union[TextInput, PreTokenizedInput]] = None,\n        add_special_tokens: bool = True,\n        max_length: Optional[int] = None,\n        pad_to_max_length: bool = False,\n        stride: int = 0,\n        truncation_strategy: str = \"longest_first\",\n        is_pretokenized: bool = False,\n        return_tensors: Optional[bool] = None,\n        return_token_type_ids: Optional[bool] = None,\n        return_attention_mask: Optional[bool] = None,\n        return_overflowing_tokens: bool = False,\n        return_special_tokens_mask: bool = False,\n        return_offsets_mapping: bool = False,\n        **kwargs\n    ) -> BatchEncoding:\n\n        # Check for pretokenized path (ie [token1, token2, ..., tokenN] -> [id1, id2, ..., idN]\n        if is_pretokenized:\n            if isinstance(text, list) and len(text) > 0:\n\n                # Encode through encode_batch with sequence of only one word which will be merged after hand\n                encoding = self._tokenizer.encode_batch(text, add_special_tokens=False)\n                encoding = EncodingFast.merge(encoding, growing_offsets=True)\n\n                # Let's do the same for pairs if provided\n                if isinstance(text_pair, list):\n                    # We prepend empty string before each word so that encoding is aware content is a pair\n                    encoding_pair = self._tokenizer.encode_batch(\n                        [(\"\", p) for p in text_pair], add_special_tokens=False\n                    )\n                    encoding_pair = EncodingFast.merge(encoding_pair, growing_offsets=True)\n                elif text_pair is None:\n                    encoding_pair = None\n                else:\n                    raise TypeError(\n                        \"encode_plus(..., is_pretokenized=True) requires text and text_pair to be List[str] \"\n                        \"but got (text={}, text_pair={})\".format(type(text), type(text_pair))\n                    )\n\n                # Post process and if asked to do so, insert special tokens where needed\n                encoding = self._tokenizer.post_process(encoding, encoding_pair, add_special_tokens)\n\n                batched_output = BatchEncoding(\n                    self._convert_encoding(\n                        encoding,\n                        return_tensors=return_tensors,\n                        return_token_type_ids=return_token_type_ids,\n                        return_attention_mask=return_attention_mask,\n                        return_overflowing_tokens=return_overflowing_tokens,\n                        return_special_tokens_mask=return_special_tokens_mask,\n                        return_offsets_mapping=return_offsets_mapping,\n                    ),\n                    encoding,\n                )\n            else:\n                raise TypeError(\n                    \"encode_plus(..., is_pretokenized=True) requires text to be List[str] \"\n                    \"but got (text={}, text_pair={})\".format(type(text), type(text_pair))\n                )\n        else:\n            batched_input = [(text, text_pair)] if text_pair else [text]\n            batched_output = self.batch_encode_plus(\n                batched_input,\n                add_special_tokens=add_special_tokens,\n                max_length=max_length,\n                stride=stride,\n                truncation_strategy=truncation_strategy,\n                return_tensors=return_tensors,\n                return_token_type_ids=return_token_type_ids,\n                return_attention_mask=return_attention_mask,\n                return_overflowing_tokens=return_overflowing_tokens,\n                return_special_tokens_mask=return_special_tokens_mask,\n                return_offsets_mapping=return_offsets_mapping,\n                pad_to_max_length=pad_to_max_length,\n                **kwargs,\n            )\n\n        # Return tensor is None, then we can remove the leading batch axis\n        if not return_tensors:\n            batched_output = BatchEncoding(\n                {\n                    key: value[0] if len(value) > 0 and isinstance(value[0], list) else value\n                    for key, value in batched_output.items()\n                },\n                batched_output.encodings,\n            )\n\n        return batched_output\n\n    def decode(\n        self, token_ids: List[int], skip_special_tokens: bool = False, clean_up_tokenization_spaces: bool = True\n    ) -> str:\n        text = self._tokenizer.decode(token_ids, skip_special_tokens)\n\n        if clean_up_tokenization_spaces:\n            clean_text = self.clean_up_tokenization(text)\n            return clean_text\n        else:\n            return text\n\n    def save_vocabulary(self, save_directory: str) -> Tuple[str]:\n        if os.path.isdir(save_directory):\n            files = self._tokenizer.save(save_directory)\n        else:\n            folder, file = os.path.split(os.path.abspath(save_directory))\n            files = self._tokenizer.save(folder, name=file)\n\n        return tuple(files)\n\n\ndef trim_batch(\n    input_ids, pad_token_id, attention_mask=None,\n):\n    \"\"\"Remove columns that are populated exclusively by pad_token_id\"\"\"\n    keep_column_mask = input_ids.ne(pad_token_id).any(dim=0)\n    if attention_mask is None:\n        return input_ids[:, keep_column_mask]\n    else:\n        return (input_ids[:, keep_column_mask], attention_mask[:, keep_column_mask])\n"
  },
  {
    "path": "code/nezha-base-count5/pretrain/transformers1/tokenization_xlm.py",
    "content": "# coding=utf-8\n# Copyright 2019 The Open AI Team Authors and The HuggingFace Inc. team.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n\"\"\"Tokenization classes for XLM.\"\"\"\n\n\nimport json\nimport logging\nimport os\nimport re\nimport sys\nimport unicodedata\nfrom typing import List, Optional\n\nimport sacremoses as sm\n\nfrom .tokenization_utils import PreTrainedTokenizer\n\n\nlogger = logging.getLogger(__name__)\n\nVOCAB_FILES_NAMES = {\n    \"vocab_file\": \"vocab.json\",\n    \"merges_file\": \"merges.txt\",\n}\n\nPRETRAINED_VOCAB_FILES_MAP = {\n    \"vocab_file\": {\n        \"xlm-mlm-en-2048\": \"https://s3.amazonaws.com/models.huggingface.co/bert/xlm-mlm-en-2048-vocab.json\",\n        \"xlm-mlm-ende-1024\": \"https://s3.amazonaws.com/models.huggingface.co/bert/xlm-mlm-ende-1024-vocab.json\",\n        \"xlm-mlm-enfr-1024\": \"https://s3.amazonaws.com/models.huggingface.co/bert/xlm-mlm-enfr-1024-vocab.json\",\n        \"xlm-mlm-enro-1024\": \"https://s3.amazonaws.com/models.huggingface.co/bert/xlm-mlm-enro-1024-vocab.json\",\n        \"xlm-mlm-tlm-xnli15-1024\": \"https://s3.amazonaws.com/models.huggingface.co/bert/xlm-mlm-tlm-xnli15-1024-vocab.json\",\n        \"xlm-mlm-xnli15-1024\": \"https://s3.amazonaws.com/models.huggingface.co/bert/xlm-mlm-xnli15-1024-vocab.json\",\n        \"xlm-clm-enfr-1024\": \"https://s3.amazonaws.com/models.huggingface.co/bert/xlm-clm-enfr-1024-vocab.json\",\n        \"xlm-clm-ende-1024\": \"https://s3.amazonaws.com/models.huggingface.co/bert/xlm-clm-ende-1024-vocab.json\",\n        \"xlm-mlm-17-1280\": \"https://s3.amazonaws.com/models.huggingface.co/bert/xlm-mlm-17-1280-vocab.json\",\n        \"xlm-mlm-100-1280\": \"https://s3.amazonaws.com/models.huggingface.co/bert/xlm-mlm-100-1280-vocab.json\",\n    },\n    \"merges_file\": {\n        \"xlm-mlm-en-2048\": \"https://s3.amazonaws.com/models.huggingface.co/bert/xlm-mlm-en-2048-merges.txt\",\n        \"xlm-mlm-ende-1024\": \"https://s3.amazonaws.com/models.huggingface.co/bert/xlm-mlm-ende-1024-merges.txt\",\n        \"xlm-mlm-enfr-1024\": \"https://s3.amazonaws.com/models.huggingface.co/bert/xlm-mlm-enfr-1024-merges.txt\",\n        \"xlm-mlm-enro-1024\": \"https://s3.amazonaws.com/models.huggingface.co/bert/xlm-mlm-enro-1024-merges.txt\",\n        \"xlm-mlm-tlm-xnli15-1024\": \"https://s3.amazonaws.com/models.huggingface.co/bert/xlm-mlm-tlm-xnli15-1024-merges.txt\",\n        \"xlm-mlm-xnli15-1024\": \"https://s3.amazonaws.com/models.huggingface.co/bert/xlm-mlm-xnli15-1024-merges.txt\",\n        \"xlm-clm-enfr-1024\": \"https://s3.amazonaws.com/models.huggingface.co/bert/xlm-mlm-enfr-1024-merges.txt\",\n        \"xlm-clm-ende-1024\": \"https://s3.amazonaws.com/models.huggingface.co/bert/xlm-mlm-ende-1024-merges.txt\",\n        \"xlm-mlm-17-1280\": \"https://s3.amazonaws.com/models.huggingface.co/bert/xlm-mlm-17-1280-merges.txt\",\n        \"xlm-mlm-100-1280\": \"https://s3.amazonaws.com/models.huggingface.co/bert/xlm-mlm-100-1280-merges.txt\",\n    },\n}\n\nPRETRAINED_POSITIONAL_EMBEDDINGS_SIZES = {\n    \"xlm-mlm-en-2048\": 512,\n    \"xlm-mlm-ende-1024\": 512,\n    \"xlm-mlm-enfr-1024\": 512,\n    \"xlm-mlm-enro-1024\": 512,\n    \"xlm-mlm-tlm-xnli15-1024\": 512,\n    \"xlm-mlm-xnli15-1024\": 512,\n    \"xlm-clm-enfr-1024\": 512,\n    \"xlm-clm-ende-1024\": 512,\n    \"xlm-mlm-17-1280\": 512,\n    \"xlm-mlm-100-1280\": 512,\n}\n\nPRETRAINED_INIT_CONFIGURATION = {\n    \"xlm-mlm-en-2048\": {\"do_lowercase_and_remove_accent\": True},\n    \"xlm-mlm-ende-1024\": {\n        \"do_lowercase_and_remove_accent\": True,\n        \"id2lang\": {\"0\": \"de\", \"1\": \"en\"},\n        \"lang2id\": {\"de\": 0, \"en\": 1},\n    },\n    \"xlm-mlm-enfr-1024\": {\n        \"do_lowercase_and_remove_accent\": True,\n        \"id2lang\": {\"0\": \"en\", \"1\": \"fr\"},\n        \"lang2id\": {\"en\": 0, \"fr\": 1},\n    },\n    \"xlm-mlm-enro-1024\": {\n        \"do_lowercase_and_remove_accent\": True,\n        \"id2lang\": {\"0\": \"en\", \"1\": \"ro\"},\n        \"lang2id\": {\"en\": 0, \"ro\": 1},\n    },\n    \"xlm-mlm-tlm-xnli15-1024\": {\n        \"do_lowercase_and_remove_accent\": True,\n        \"id2lang\": {\n            \"0\": \"ar\",\n            \"1\": \"bg\",\n            \"2\": \"de\",\n            \"3\": \"el\",\n            \"4\": \"en\",\n            \"5\": \"es\",\n            \"6\": \"fr\",\n            \"7\": \"hi\",\n            \"8\": \"ru\",\n            \"9\": \"sw\",\n            \"10\": \"th\",\n            \"11\": \"tr\",\n            \"12\": \"ur\",\n            \"13\": \"vi\",\n            \"14\": \"zh\",\n        },\n        \"lang2id\": {\n            \"ar\": 0,\n            \"bg\": 1,\n            \"de\": 2,\n            \"el\": 3,\n            \"en\": 4,\n            \"es\": 5,\n            \"fr\": 6,\n            \"hi\": 7,\n            \"ru\": 8,\n            \"sw\": 9,\n            \"th\": 10,\n            \"tr\": 11,\n            \"ur\": 12,\n            \"vi\": 13,\n            \"zh\": 14,\n        },\n    },\n    \"xlm-mlm-xnli15-1024\": {\n        \"do_lowercase_and_remove_accent\": True,\n        \"id2lang\": {\n            \"0\": \"ar\",\n            \"1\": \"bg\",\n            \"2\": \"de\",\n            \"3\": \"el\",\n            \"4\": \"en\",\n            \"5\": \"es\",\n            \"6\": \"fr\",\n            \"7\": \"hi\",\n            \"8\": \"ru\",\n            \"9\": \"sw\",\n            \"10\": \"th\",\n            \"11\": \"tr\",\n            \"12\": \"ur\",\n            \"13\": \"vi\",\n            \"14\": \"zh\",\n        },\n        \"lang2id\": {\n            \"ar\": 0,\n            \"bg\": 1,\n            \"de\": 2,\n            \"el\": 3,\n            \"en\": 4,\n            \"es\": 5,\n            \"fr\": 6,\n            \"hi\": 7,\n            \"ru\": 8,\n            \"sw\": 9,\n            \"th\": 10,\n            \"tr\": 11,\n            \"ur\": 12,\n            \"vi\": 13,\n            \"zh\": 14,\n        },\n    },\n    \"xlm-clm-enfr-1024\": {\n        \"do_lowercase_and_remove_accent\": True,\n        \"id2lang\": {\"0\": \"en\", \"1\": \"fr\"},\n        \"lang2id\": {\"en\": 0, \"fr\": 1},\n    },\n    \"xlm-clm-ende-1024\": {\n        \"do_lowercase_and_remove_accent\": True,\n        \"id2lang\": {\"0\": \"de\", \"1\": \"en\"},\n        \"lang2id\": {\"de\": 0, \"en\": 1},\n    },\n    \"xlm-mlm-17-1280\": {\n        \"do_lowercase_and_remove_accent\": False,\n        \"id2lang\": {\n            \"0\": \"ar\",\n            \"1\": \"de\",\n            \"2\": \"en\",\n            \"3\": \"es\",\n            \"4\": \"fr\",\n            \"5\": \"hi\",\n            \"6\": \"it\",\n            \"7\": \"ja\",\n            \"8\": \"ko\",\n            \"9\": \"nl\",\n            \"10\": \"pl\",\n            \"11\": \"pt\",\n            \"12\": \"ru\",\n            \"13\": \"sv\",\n            \"14\": \"tr\",\n            \"15\": \"vi\",\n            \"16\": \"zh\",\n        },\n        \"lang2id\": {\n            \"ar\": 0,\n            \"de\": 1,\n            \"en\": 2,\n            \"es\": 3,\n            \"fr\": 4,\n            \"hi\": 5,\n            \"it\": 6,\n            \"ja\": 7,\n            \"ko\": 8,\n            \"nl\": 9,\n            \"pl\": 10,\n            \"pt\": 11,\n            \"ru\": 12,\n            \"sv\": 13,\n            \"tr\": 14,\n            \"vi\": 15,\n            \"zh\": 16,\n        },\n    },\n    \"xlm-mlm-100-1280\": {\n        \"do_lowercase_and_remove_accent\": False,\n        \"id2lang\": {\n            \"0\": \"af\",\n            \"1\": \"als\",\n            \"2\": \"am\",\n            \"3\": \"an\",\n            \"4\": \"ang\",\n            \"5\": \"ar\",\n            \"6\": \"arz\",\n            \"7\": \"ast\",\n            \"8\": \"az\",\n            \"9\": \"bar\",\n            \"10\": \"be\",\n            \"11\": \"bg\",\n            \"12\": \"bn\",\n            \"13\": \"br\",\n            \"14\": \"bs\",\n            \"15\": \"ca\",\n            \"16\": \"ceb\",\n            \"17\": \"ckb\",\n            \"18\": \"cs\",\n            \"19\": \"cy\",\n            \"20\": \"da\",\n            \"21\": \"de\",\n            \"22\": \"el\",\n            \"23\": \"en\",\n            \"24\": \"eo\",\n            \"25\": \"es\",\n            \"26\": \"et\",\n            \"27\": \"eu\",\n            \"28\": \"fa\",\n            \"29\": \"fi\",\n            \"30\": \"fr\",\n            \"31\": \"fy\",\n            \"32\": \"ga\",\n            \"33\": \"gan\",\n            \"34\": \"gl\",\n            \"35\": \"gu\",\n            \"36\": \"he\",\n            \"37\": \"hi\",\n            \"38\": \"hr\",\n            \"39\": \"hu\",\n            \"40\": \"hy\",\n            \"41\": \"ia\",\n            \"42\": \"id\",\n            \"43\": \"is\",\n            \"44\": \"it\",\n            \"45\": \"ja\",\n            \"46\": \"jv\",\n            \"47\": \"ka\",\n            \"48\": \"kk\",\n            \"49\": \"kn\",\n            \"50\": \"ko\",\n            \"51\": \"ku\",\n            \"52\": \"la\",\n            \"53\": \"lb\",\n            \"54\": \"lt\",\n            \"55\": \"lv\",\n            \"56\": \"mk\",\n            \"57\": \"ml\",\n            \"58\": \"mn\",\n            \"59\": \"mr\",\n            \"60\": \"ms\",\n            \"61\": \"my\",\n            \"62\": \"nds\",\n            \"63\": \"ne\",\n            \"64\": \"nl\",\n            \"65\": \"nn\",\n            \"66\": \"no\",\n            \"67\": \"oc\",\n            \"68\": \"pl\",\n            \"69\": \"pt\",\n            \"70\": \"ro\",\n            \"71\": \"ru\",\n            \"72\": \"scn\",\n            \"73\": \"sco\",\n            \"74\": \"sh\",\n            \"75\": \"si\",\n            \"76\": \"simple\",\n            \"77\": \"sk\",\n            \"78\": \"sl\",\n            \"79\": \"sq\",\n            \"80\": \"sr\",\n            \"81\": \"sv\",\n            \"82\": \"sw\",\n            \"83\": \"ta\",\n            \"84\": \"te\",\n            \"85\": \"th\",\n            \"86\": \"tl\",\n            \"87\": \"tr\",\n            \"88\": \"tt\",\n            \"89\": \"uk\",\n            \"90\": \"ur\",\n            \"91\": \"uz\",\n            \"92\": \"vi\",\n            \"93\": \"war\",\n            \"94\": \"wuu\",\n            \"95\": \"yi\",\n            \"96\": \"zh\",\n            \"97\": \"zh_classical\",\n            \"98\": \"zh_min_nan\",\n            \"99\": \"zh_yue\",\n        },\n        \"lang2id\": {\n            \"af\": 0,\n            \"als\": 1,\n            \"am\": 2,\n            \"an\": 3,\n            \"ang\": 4,\n            \"ar\": 5,\n            \"arz\": 6,\n            \"ast\": 7,\n            \"az\": 8,\n            \"bar\": 9,\n            \"be\": 10,\n            \"bg\": 11,\n            \"bn\": 12,\n            \"br\": 13,\n            \"bs\": 14,\n            \"ca\": 15,\n            \"ceb\": 16,\n            \"ckb\": 17,\n            \"cs\": 18,\n            \"cy\": 19,\n            \"da\": 20,\n            \"de\": 21,\n            \"el\": 22,\n            \"en\": 23,\n            \"eo\": 24,\n            \"es\": 25,\n            \"et\": 26,\n            \"eu\": 27,\n            \"fa\": 28,\n            \"fi\": 29,\n            \"fr\": 30,\n            \"fy\": 31,\n            \"ga\": 32,\n            \"gan\": 33,\n            \"gl\": 34,\n            \"gu\": 35,\n            \"he\": 36,\n            \"hi\": 37,\n            \"hr\": 38,\n            \"hu\": 39,\n            \"hy\": 40,\n            \"ia\": 41,\n            \"id\": 42,\n            \"is\": 43,\n            \"it\": 44,\n            \"ja\": 45,\n            \"jv\": 46,\n            \"ka\": 47,\n            \"kk\": 48,\n            \"kn\": 49,\n            \"ko\": 50,\n            \"ku\": 51,\n            \"la\": 52,\n            \"lb\": 53,\n            \"lt\": 54,\n            \"lv\": 55,\n            \"mk\": 56,\n            \"ml\": 57,\n            \"mn\": 58,\n            \"mr\": 59,\n            \"ms\": 60,\n            \"my\": 61,\n            \"nds\": 62,\n            \"ne\": 63,\n            \"nl\": 64,\n            \"nn\": 65,\n            \"no\": 66,\n            \"oc\": 67,\n            \"pl\": 68,\n            \"pt\": 69,\n            \"ro\": 70,\n            \"ru\": 71,\n            \"scn\": 72,\n            \"sco\": 73,\n            \"sh\": 74,\n            \"si\": 75,\n            \"simple\": 76,\n            \"sk\": 77,\n            \"sl\": 78,\n            \"sq\": 79,\n            \"sr\": 80,\n            \"sv\": 81,\n            \"sw\": 82,\n            \"ta\": 83,\n            \"te\": 84,\n            \"th\": 85,\n            \"tl\": 86,\n            \"tr\": 87,\n            \"tt\": 88,\n            \"uk\": 89,\n            \"ur\": 90,\n            \"uz\": 91,\n            \"vi\": 92,\n            \"war\": 93,\n            \"wuu\": 94,\n            \"yi\": 95,\n            \"zh\": 96,\n            \"zh_classical\": 97,\n            \"zh_min_nan\": 98,\n            \"zh_yue\": 99,\n        },\n    },\n}\n\n\ndef get_pairs(word):\n    \"\"\"\n    Return set of symbol pairs in a word.\n    word is represented as tuple of symbols (symbols being variable-length strings)\n    \"\"\"\n    pairs = set()\n    prev_char = word[0]\n    for char in word[1:]:\n        pairs.add((prev_char, char))\n        prev_char = char\n    return pairs\n\n\ndef lowercase_and_remove_accent(text):\n    \"\"\"\n    Lowercase and strips accents from a piece of text based on\n    https://github.com/facebookresearch/XLM/blob/master/tools/lowercase_and_remove_accent.py\n    \"\"\"\n    text = \" \".join(text)\n    text = text.lower()\n    text = unicodedata.normalize(\"NFD\", text)\n    output = []\n    for char in text:\n        cat = unicodedata.category(char)\n        if cat == \"Mn\":\n            continue\n        output.append(char)\n    return \"\".join(output).lower().split(\" \")\n\n\ndef replace_unicode_punct(text):\n    \"\"\"\n    Port of https://github.com/moses-smt/mosesdecoder/blob/master/scripts/tokenizer/replace-unicode-punctuation.perl\n    \"\"\"\n    text = text.replace(\"，\", \",\")\n    text = re.sub(r\"。\\s*\", \". \", text)\n    text = text.replace(\"、\", \",\")\n    text = text.replace(\"”\", '\"')\n    text = text.replace(\"“\", '\"')\n    text = text.replace(\"∶\", \":\")\n    text = text.replace(\"：\", \":\")\n    text = text.replace(\"？\", \"?\")\n    text = text.replace(\"《\", '\"')\n    text = text.replace(\"》\", '\"')\n    text = text.replace(\"）\", \")\")\n    text = text.replace(\"！\", \"!\")\n    text = text.replace(\"（\", \"(\")\n    text = text.replace(\"；\", \";\")\n    text = text.replace(\"１\", \"1\")\n    text = text.replace(\"」\", '\"')\n    text = text.replace(\"「\", '\"')\n    text = text.replace(\"０\", \"0\")\n    text = text.replace(\"３\", \"3\")\n    text = text.replace(\"２\", \"2\")\n    text = text.replace(\"５\", \"5\")\n    text = text.replace(\"６\", \"6\")\n    text = text.replace(\"９\", \"9\")\n    text = text.replace(\"７\", \"7\")\n    text = text.replace(\"８\", \"8\")\n    text = text.replace(\"４\", \"4\")\n    text = re.sub(r\"．\\s*\", \". \", text)\n    text = text.replace(\"～\", \"~\")\n    text = text.replace(\"’\", \"'\")\n    text = text.replace(\"…\", \"...\")\n    text = text.replace(\"━\", \"-\")\n    text = text.replace(\"〈\", \"<\")\n    text = text.replace(\"〉\", \">\")\n    text = text.replace(\"【\", \"[\")\n    text = text.replace(\"】\", \"]\")\n    text = text.replace(\"％\", \"%\")\n    return text\n\n\ndef remove_non_printing_char(text):\n    \"\"\"\n    Port of https://github.com/moses-smt/mosesdecoder/blob/master/scripts/tokenizer/remove-non-printing-char.perl\n    \"\"\"\n    output = []\n    for char in text:\n        cat = unicodedata.category(char)\n        if cat.startswith(\"C\"):\n            continue\n        output.append(char)\n    return \"\".join(output)\n\n\ndef romanian_preprocessing(text):\n    \"\"\"Sennrich's WMT16 scripts for Romanian preprocessing, used by model `xlm-mlm-enro-1024`\"\"\"\n    # https://github.com/rsennrich/wmt16-scripts/blob/master/preprocess/normalise-romanian.py\n    text = text.replace(\"\\u015e\", \"\\u0218\").replace(\"\\u015f\", \"\\u0219\")\n    text = text.replace(\"\\u0162\", \"\\u021a\").replace(\"\\u0163\", \"\\u021b\")\n    # https://github.com/rsennrich/wmt16-scripts/blob/master/preprocess/remove-diacritics.py\n    text = text.replace(\"\\u0218\", \"S\").replace(\"\\u0219\", \"s\")  # s-comma\n    text = text.replace(\"\\u021a\", \"T\").replace(\"\\u021b\", \"t\")  # t-comma\n    text = text.replace(\"\\u0102\", \"A\").replace(\"\\u0103\", \"a\")\n    text = text.replace(\"\\u00C2\", \"A\").replace(\"\\u00E2\", \"a\")\n    text = text.replace(\"\\u00CE\", \"I\").replace(\"\\u00EE\", \"i\")\n    return text\n\n\nclass XLMTokenizer(PreTrainedTokenizer):\n    \"\"\"\n    BPE tokenizer for XLM\n\n    - Moses preprocessing & tokenization for most supported languages\n    - Language specific tokenization for Chinese (Jieba), Japanese (KyTea) and Thai (PyThaiNLP)\n    - (optionally) lower case & normalize all inputs text\n    - argument ``special_tokens`` and function ``set_special_tokens``, can be used to add additional symbols \\\n      (ex: \"__classify__\") to a vocabulary\n    - `lang2id` attribute maps the languages supported by the model with their ids if provided (automatically set for pretrained vocabularies)\n    - `id2lang` attributes does reverse mapping if provided (automatically set for pretrained vocabularies)\n\n    This tokenizer inherits from :class:`~transformers1.PreTrainedTokenizer` which contains most of the methods. Users\n    should refer to the superclass for more information regarding methods.\n\n    Args:\n        vocab_file (:obj:`string`):\n            Vocabulary file.\n        merges_file (:obj:`string`):\n            Merges file.\n        do_lower_case (:obj:`bool`, `optional`, defaults to :obj:`True`):\n            Whether to lowercase the input when tokenizing.\n        remove_space (:obj:`bool`, `optional`, defaults to :obj:`True`):\n            Whether to strip the text when tokenizing (removing excess spaces before and after the string).\n        keep_accents (:obj:`bool`, `optional`, defaults to :obj:`False`):\n            Whether to keep accents when tokenizing.\n        unk_token (:obj:`string`, `optional`, defaults to \"<unk>\"):\n            The unknown token. A token that is not in the vocabulary cannot be converted to an ID and is set to be this\n            token instead.\n        bos_token (:obj:`string`, `optional`, defaults to \"<s>\"):\n            The beginning of sequence token that was used during pre-training. Can be used a sequence classifier token.\n\n            .. note::\n\n                When building a sequence using special tokens, this is not the token that is used for the beginning\n                of sequence. The token used is the :obj:`cls_token`.\n        sep_token (:obj:`string`, `optional`, defaults to \"</s>\"):\n            The separator token, which is used when building a sequence from multiple sequences, e.g. two sequences\n            for sequence classification or for a text and a question for question answering.\n            It is also used as the last token of a sequence built with special tokens.\n        pad_token (:obj:`string`, `optional`, defaults to \"<pad>\"):\n            The token used for padding, for example when batching sequences of different lengths.\n        cls_token (:obj:`string`, `optional`, defaults to \"</s>\"):\n            The classifier token which is used when doing sequence classification (classification of the whole\n            sequence instead of per-token classification). It is the first token of the sequence when built with\n            special tokens.\n        mask_token (:obj:`string`, `optional`, defaults to \"<special1>\"):\n            The token used for masking values. This is the token used when training this model with masked language\n            modeling. This is the token which the model will try to predict.\n        additional_special_tokens (:obj:`List[str]`, `optional`, defaults to :obj:`[\"<special0>\",\"<special1>\",\"<special2>\",\"<special3>\",\"<special4>\",\"<special5>\",\"<special6>\",\"<special7>\",\"<special8>\",\"<special9>\"]`):\n            List of additional special tokens.\n        lang2id (:obj:`Dict[str, int]`, `optional`, defaults to :obj:`None`):\n            Dictionary mapping languages string identifiers to their IDs.\n        id2lang (:obj:`Dict[int, str`, `optional`, defaults to :obj:`None`):\n            Dictionary mapping language IDs to their string identifiers.\n        do_lowercase_and_remove_accent (:obj:`bool`, `optional`, defaults to :obj:`True`):\n            Whether to lowercase and remove accents when tokenizing.\n    \"\"\"\n\n    vocab_files_names = VOCAB_FILES_NAMES\n    pretrained_vocab_files_map = PRETRAINED_VOCAB_FILES_MAP\n    pretrained_init_configuration = PRETRAINED_INIT_CONFIGURATION\n    max_model_input_sizes = PRETRAINED_POSITIONAL_EMBEDDINGS_SIZES\n\n    def __init__(\n        self,\n        vocab_file,\n        merges_file,\n        unk_token=\"<unk>\",\n        bos_token=\"<s>\",\n        sep_token=\"</s>\",\n        pad_token=\"<pad>\",\n        cls_token=\"</s>\",\n        mask_token=\"<special1>\",\n        additional_special_tokens=[\n            \"<special0>\",\n            \"<special1>\",\n            \"<special2>\",\n            \"<special3>\",\n            \"<special4>\",\n            \"<special5>\",\n            \"<special6>\",\n            \"<special7>\",\n            \"<special8>\",\n            \"<special9>\",\n        ],\n        lang2id=None,\n        id2lang=None,\n        do_lowercase_and_remove_accent=True,\n        **kwargs\n    ):\n        super().__init__(\n            unk_token=unk_token,\n            bos_token=bos_token,\n            sep_token=sep_token,\n            pad_token=pad_token,\n            cls_token=cls_token,\n            mask_token=mask_token,\n            additional_special_tokens=additional_special_tokens,\n            **kwargs,\n        )\n\n        # cache of sm.MosesPunctNormalizer instance\n        self.cache_moses_punct_normalizer = dict()\n        # cache of sm.MosesTokenizer instance\n        self.cache_moses_tokenizer = dict()\n        self.lang_with_custom_tokenizer = set([\"zh\", \"th\", \"ja\"])\n        # True for current supported model (v1.2.0), False for XLM-17 & 100\n        self.do_lowercase_and_remove_accent = do_lowercase_and_remove_accent\n        self.lang2id = lang2id\n        self.id2lang = id2lang\n        if lang2id is not None and id2lang is not None:\n            assert len(lang2id) == len(id2lang)\n\n        self.ja_word_tokenizer = None\n        self.zh_word_tokenizer = None\n\n        with open(vocab_file, encoding=\"utf-8\") as vocab_handle:\n            self.encoder = json.load(vocab_handle)\n        self.decoder = {v: k for k, v in self.encoder.items()}\n        with open(merges_file, encoding=\"utf-8\") as merges_handle:\n            merges = merges_handle.read().split(\"\\n\")[:-1]\n        merges = [tuple(merge.split()[:2]) for merge in merges]\n        self.bpe_ranks = dict(zip(merges, range(len(merges))))\n        self.cache = {}\n\n    def moses_punct_norm(self, text, lang):\n        if lang not in self.cache_moses_punct_normalizer:\n            punct_normalizer = sm.MosesPunctNormalizer(lang=lang)\n            self.cache_moses_punct_normalizer[lang] = punct_normalizer\n        else:\n            punct_normalizer = self.cache_moses_punct_normalizer[lang]\n        return punct_normalizer.normalize(text)\n\n    def moses_tokenize(self, text, lang):\n        if lang not in self.cache_moses_tokenizer:\n            moses_tokenizer = sm.MosesTokenizer(lang=lang)\n            self.cache_moses_tokenizer[lang] = moses_tokenizer\n        else:\n            moses_tokenizer = self.cache_moses_tokenizer[lang]\n        return moses_tokenizer.tokenize(text, return_str=False, escape=False)\n\n    def moses_pipeline(self, text, lang):\n        text = replace_unicode_punct(text)\n        text = self.moses_punct_norm(text, lang)\n        text = remove_non_printing_char(text)\n        return text\n\n    def ja_tokenize(self, text):\n        if self.ja_word_tokenizer is None:\n            try:\n                import Mykytea\n\n                self.ja_word_tokenizer = Mykytea.Mykytea(\n                    \"-model %s/local/share/kytea/model.bin\" % os.path.expanduser(\"~\")\n                )\n            except (AttributeError, ImportError):\n                logger.error(\n                    \"Make sure you install KyTea (https://github.com/neubig/kytea) and it's python wrapper (https://github.com/chezou/Mykytea-python) with the following steps\"\n                )\n                logger.error(\"1. git clone git@github.com:neubig/kytea.git && cd kytea\")\n                logger.error(\"2. autoreconf -i\")\n                logger.error(\"3. ./configure --prefix=$HOME/local\")\n                logger.error(\"4. make && make install\")\n                logger.error(\"5. pip install kytea\")\n                raise\n        return list(self.ja_word_tokenizer.getWS(text))\n\n    @property\n    def vocab_size(self):\n        return len(self.encoder)\n\n    def get_vocab(self):\n        return dict(self.encoder, **self.added_tokens_encoder)\n\n    def bpe(self, token):\n        word = tuple(token[:-1]) + (token[-1] + \"</w>\",)\n        if token in self.cache:\n            return self.cache[token]\n        pairs = get_pairs(word)\n\n        if not pairs:\n            return token + \"</w>\"\n\n        while True:\n            bigram = min(pairs, key=lambda pair: self.bpe_ranks.get(pair, float(\"inf\")))\n            if bigram not in self.bpe_ranks:\n                break\n            first, second = bigram\n            new_word = []\n            i = 0\n            while i < len(word):\n                try:\n                    j = word.index(first, i)\n                except ValueError:\n                    new_word.extend(word[i:])\n                    break\n                else:\n                    new_word.extend(word[i:j])\n                    i = j\n\n                if word[i] == first and i < len(word) - 1 and word[i + 1] == second:\n                    new_word.append(first + second)\n                    i += 2\n                else:\n                    new_word.append(word[i])\n                    i += 1\n            new_word = tuple(new_word)\n            word = new_word\n            if len(word) == 1:\n                break\n            else:\n                pairs = get_pairs(word)\n        word = \" \".join(word)\n        if word == \"\\n  </w>\":\n            word = \"\\n</w>\"\n        self.cache[token] = word\n        return word\n\n    def _tokenize(self, text, lang=\"en\", bypass_tokenizer=False):\n        \"\"\"\n        Tokenize a string given language code. For Chinese, Japanese and Thai, we use a language specific tokenizerself. Otherwise, we use Moses.\n\n        Details of tokenization:\n        - [sacremoses](https://github.com/alvations/sacremoses): port of Moses\n            - Install with `pip install sacremoses`\n        - [pythainlp](https://github.com/PyThaiNLP/pythainlp): Thai tokenizer\n            - Install with `pip install pythainlp`\n        - [kytea](https://github.com/chezou/Mykytea-python): Japanese tokenizer, wrapper of [KyTea](https://github.com/neubig/kytea)\n            - Install with the following steps:\n            ```\n            git clone git@github.com:neubig/kytea.git && cd kytea\n            autoreconf -i\n            ./configure --prefix=$HOME/local\n            make && make install\n            pip install kytea\n            ```\n        - [jieba](https://github.com/fxsjy/jieba): Chinese tokenizer (*)\n            - Install with `pip install jieba`\n\n        (*) The original XLM used [Stanford Segmenter](https://nlp.stanford.edu/software/stanford-segmenter-2018-10-16.zip).\n        However, the wrapper (`nltk.tokenize.stanford_segmenter`) is slow due to JVM overhead, and it will be deprecated.\n        Jieba is a lot faster and pip-installable. Note there is some mismatch with the Stanford Segmenter. It should be fine\n        if you fine-tune the model with Chinese supervisionself. If you want the same exact behaviour, use the original XLM\n        [preprocessing script](https://github.com/facebookresearch/XLM/tree/master/tools) to tokenize the sentence externally,\n        and set `bypass_tokenizer=True` to bypass the tokenizer.\n\n        Args:\n            - lang: ISO language code (default = 'en') (string). Languages should belong of the model supported languages. However, we don't enforce it.\n            - bypass_tokenizer: Allow users to preprocess and tokenize the sentences externally (default = False)  (bool). If True, we only apply BPE.\n\n        Returns:\n            List of tokens.\n        \"\"\"\n        if lang and self.lang2id and lang not in self.lang2id:\n            logger.error(\n                \"Supplied language code not found in lang2id mapping. Please check that your language is supported by the loaded pretrained model.\"\n            )\n        if bypass_tokenizer:\n            text = text.split()\n        elif lang not in self.lang_with_custom_tokenizer:\n            text = self.moses_pipeline(text, lang=lang)\n            # TODO: make sure we are using `xlm-mlm-enro-1024`, since XLM-100 doesn't have this step\n            if lang == \"ro\":\n                text = romanian_preprocessing(text)\n            text = self.moses_tokenize(text, lang=lang)\n        elif lang == \"th\":\n            text = self.moses_pipeline(text, lang=lang)\n            try:\n                if \"pythainlp\" not in sys.modules:\n                    from pythainlp.tokenize import word_tokenize as th_word_tokenize\n                else:\n                    th_word_tokenize = sys.modules[\"pythainlp\"].word_tokenize\n            except (AttributeError, ImportError):\n                logger.error(\n                    \"Make sure you install PyThaiNLP (https://github.com/PyThaiNLP/pythainlp) with the following steps\"\n                )\n                logger.error(\"1. pip install pythainlp\")\n                raise\n            text = th_word_tokenize(text)\n        elif lang == \"zh\":\n            try:\n                if \"jieba\" not in sys.modules:\n                    import jieba\n                else:\n                    jieba = sys.modules[\"jieba\"]\n            except (AttributeError, ImportError):\n                logger.error(\"Make sure you install Jieba (https://github.com/fxsjy/jieba) with the following steps\")\n                logger.error(\"1. pip install jieba\")\n                raise\n            text = \" \".join(jieba.cut(text))\n            text = self.moses_pipeline(text, lang=lang)\n            text = text.split()\n        elif lang == \"ja\":\n            text = self.moses_pipeline(text, lang=lang)\n            text = self.ja_tokenize(text)\n        else:\n            raise ValueError(\"It should not reach here\")\n\n        if self.do_lowercase_and_remove_accent and not bypass_tokenizer:\n            text = lowercase_and_remove_accent(text)\n\n        split_tokens = []\n        for token in text:\n            if token:\n                split_tokens.extend([t for t in self.bpe(token).split(\" \")])\n\n        return split_tokens\n\n    def _convert_token_to_id(self, token):\n        \"\"\" Converts a token (str) in an id using the vocab. \"\"\"\n        return self.encoder.get(token, self.encoder.get(self.unk_token))\n\n    def _convert_id_to_token(self, index):\n        \"\"\"Converts an index (integer) in a token (str) using the vocab.\"\"\"\n        return self.decoder.get(index, self.unk_token)\n\n    def convert_tokens_to_string(self, tokens):\n        \"\"\" Converts a sequence of tokens (string) in a single string. \"\"\"\n        out_string = \"\".join(tokens).replace(\"</w>\", \" \").strip()\n        return out_string\n\n    def build_inputs_with_special_tokens(\n        self, token_ids_0: List[int], token_ids_1: Optional[List[int]] = None\n    ) -> List[int]:\n        \"\"\"\n        Build model inputs from a sequence or a pair of sequence for sequence classification tasks\n        by concatenating and adding special tokens.\n        A XLM sequence has the following format:\n\n        - single sequence: ``<s> X </s>``\n        - pair of sequences: ``<s> A </s> B </s>``\n\n        Args:\n            token_ids_0 (:obj:`List[int]`):\n                List of IDs to which the special tokens will be added\n            token_ids_1 (:obj:`List[int]`, `optional`, defaults to :obj:`None`):\n                Optional second list of IDs for sequence pairs.\n\n        Returns:\n            :obj:`List[int]`: list of `input IDs <../glossary.html#input-ids>`__ with the appropriate special tokens.\n\n        \"\"\"\n        bos = [self.bos_token_id]\n        sep = [self.sep_token_id]\n\n        if token_ids_1 is None:\n            return bos + token_ids_0 + sep\n        return bos + token_ids_0 + sep + token_ids_1 + sep\n\n    def get_special_tokens_mask(\n        self, token_ids_0: List[int], token_ids_1: Optional[List[int]] = None, already_has_special_tokens: bool = False\n    ) -> List[int]:\n        \"\"\"\n        Retrieves sequence ids from a token list that has no special tokens added. This method is called when adding\n        special tokens using the tokenizer ``prepare_for_model`` or ``encode_plus`` methods.\n\n        Args:\n            token_ids_0 (:obj:`List[int]`):\n                List of ids.\n            token_ids_1 (:obj:`List[int]`, `optional`, defaults to :obj:`None`):\n                Optional second list of IDs for sequence pairs.\n            already_has_special_tokens (:obj:`bool`, `optional`, defaults to :obj:`False`):\n                Set to True if the token list is already formatted with special tokens for the model\n\n        Returns:\n            :obj:`List[int]`: A list of integers in the range [0, 1]: 1 for a special token, 0 for a sequence token.\n        \"\"\"\n\n        if already_has_special_tokens:\n            if token_ids_1 is not None:\n                raise ValueError(\n                    \"You should not supply a second sequence if the provided sequence of \"\n                    \"ids is already formated with special tokens for the model.\"\n                )\n            return list(map(lambda x: 1 if x in [self.sep_token_id, self.cls_token_id] else 0, token_ids_0,))\n\n        if token_ids_1 is not None:\n            return [1] + ([0] * len(token_ids_0)) + [1] + ([0] * len(token_ids_1)) + [1]\n        return [1] + ([0] * len(token_ids_0)) + [1]\n\n    def create_token_type_ids_from_sequences(\n        self, token_ids_0: List[int], token_ids_1: Optional[List[int]] = None\n    ) -> List[int]:\n        \"\"\"\n        Creates a mask from the two sequences passed to be used in a sequence-pair classification task.\n        An XLM sequence pair mask has the following format:\n\n        ::\n\n            0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1\n            | first sequence    | second sequence |\n\n        if token_ids_1 is None, only returns the first portion of the mask (0s).\n\n        Args:\n            token_ids_0 (:obj:`List[int]`):\n                List of ids.\n            token_ids_1 (:obj:`List[int]`, `optional`, defaults to :obj:`None`):\n                Optional second list of IDs for sequence pairs.\n\n        Returns:\n            :obj:`List[int]`: List of `token type IDs <../glossary.html#token-type-ids>`_ according to the given\n            sequence(s).\n        \"\"\"\n        sep = [self.sep_token_id]\n        cls = [self.cls_token_id]\n        if token_ids_1 is None:\n            return len(cls + token_ids_0 + sep) * [0]\n        return len(cls + token_ids_0 + sep) * [0] + len(token_ids_1 + sep) * [1]\n\n    def save_vocabulary(self, save_directory):\n        \"\"\"\n        Save the vocabulary and special tokens file to a directory.\n\n        Args:\n            save_directory (:obj:`str`):\n                The directory in which to save the vocabulary.\n\n        Returns:\n            :obj:`Tuple(str)`: Paths to the files saved.\n        \"\"\"\n        if not os.path.isdir(save_directory):\n            logger.error(\"Vocabulary path ({}) should be a directory\".format(save_directory))\n            return\n        vocab_file = os.path.join(save_directory, VOCAB_FILES_NAMES[\"vocab_file\"])\n        merge_file = os.path.join(save_directory, VOCAB_FILES_NAMES[\"merges_file\"])\n\n        with open(vocab_file, \"w\", encoding=\"utf-8\") as f:\n            f.write(json.dumps(self.encoder, ensure_ascii=False))\n\n        index = 0\n        with open(merge_file, \"w\", encoding=\"utf-8\") as writer:\n            for bpe_tokens, token_index in sorted(self.bpe_ranks.items(), key=lambda kv: kv[1]):\n                if index != token_index:\n                    logger.warning(\n                        \"Saving vocabulary to {}: BPE merge indices are not consecutive.\"\n                        \" Please check that the tokenizer is not corrupted!\".format(merge_file)\n                    )\n                    index = token_index\n                writer.write(\" \".join(bpe_tokens) + \"\\n\")\n                index += 1\n\n        return vocab_file, merge_file\n"
  },
  {
    "path": "code/nezha-base-count5/pretrain/transformers1/tokenization_xlm_roberta.py",
    "content": "# coding=utf-8\n# Copyright 2018 Google AI, Google Brain and Carnegie Mellon University Authors and the HuggingFace Inc. team.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License\n\"\"\" Tokenization classes for XLM-RoBERTa model.\"\"\"\n\n\nimport logging\nimport os\nfrom shutil import copyfile\nfrom typing import List, Optional\n\nfrom .tokenization_utils import PreTrainedTokenizer\nfrom .tokenization_xlnet import SPIECE_UNDERLINE\n\n\nlogger = logging.getLogger(__name__)\n\nVOCAB_FILES_NAMES = {\"vocab_file\": \"sentencepiece.bpe.model\"}\n\nPRETRAINED_VOCAB_FILES_MAP = {\n    \"vocab_file\": {\n        \"xlm-roberta-base\": \"https://s3.amazonaws.com/models.huggingface.co/bert/xlm-roberta-base-sentencepiece.bpe.model\",\n        \"xlm-roberta-large\": \"https://s3.amazonaws.com/models.huggingface.co/bert/xlm-roberta-large-sentencepiece.bpe.model\",\n        \"xlm-roberta-large-finetuned-conll02-dutch\": \"https://s3.amazonaws.com/models.huggingface.co/bert/xlm-roberta-large-finetuned-conll02-dutch-sentencepiece.bpe.model\",\n        \"xlm-roberta-large-finetuned-conll02-spanish\": \"https://s3.amazonaws.com/models.huggingface.co/bert/xlm-roberta-large-finetuned-conll02-spanish-sentencepiece.bpe.model\",\n        \"xlm-roberta-large-finetuned-conll03-english\": \"https://s3.amazonaws.com/models.huggingface.co/bert/xlm-roberta-large-finetuned-conll03-english-sentencepiece.bpe.model\",\n        \"xlm-roberta-large-finetuned-conll03-german\": \"https://s3.amazonaws.com/models.huggingface.co/bert/xlm-roberta-large-finetuned-conll03-german-sentencepiece.bpe.model\",\n    }\n}\n\nPRETRAINED_POSITIONAL_EMBEDDINGS_SIZES = {\n    \"xlm-roberta-base\": 512,\n    \"xlm-roberta-large\": 512,\n    \"xlm-roberta-large-finetuned-conll02-dutch\": 512,\n    \"xlm-roberta-large-finetuned-conll02-spanish\": 512,\n    \"xlm-roberta-large-finetuned-conll03-english\": 512,\n    \"xlm-roberta-large-finetuned-conll03-german\": 512,\n}\n\n\nclass XLMRobertaTokenizer(PreTrainedTokenizer):\n    \"\"\"\n        Adapted from RobertaTokenizer and XLNetTokenizer\n        SentencePiece based tokenizer. Peculiarities:\n\n        - requires `SentencePiece <https://github.com/google/sentencepiece>`_\n\n    This tokenizer inherits from :class:`~transformers1.PreTrainedTokenizer` which contains most of the methods. Users\n    should refer to the superclass for more information regarding methods.\n\n    Args:\n        vocab_file (:obj:`str`):\n            Path to the vocabulary file.\n        bos_token (:obj:`string`, `optional`, defaults to \"<s>\"):\n            The beginning of sequence token that was used during pre-training. Can be used a sequence classifier token.\n\n            .. note::\n\n                When building a sequence using special tokens, this is not the token that is used for the beginning\n                of sequence. The token used is the :obj:`cls_token`.\n        eos_token (:obj:`string`, `optional`, defaults to \"</s>\"):\n            The end of sequence token.\n\n            .. note::\n\n                When building a sequence using special tokens, this is not the token that is used for the end\n                of sequence. The token used is the :obj:`sep_token`.\n        sep_token (:obj:`string`, `optional`, defaults to \"</s>\"):\n            The separator token, which is used when building a sequence from multiple sequences, e.g. two sequences\n            for sequence classification or for a text and a question for question answering.\n            It is also used as the last token of a sequence built with special tokens.\n        cls_token (:obj:`string`, `optional`, defaults to \"<s>\"):\n            The classifier token which is used when doing sequence classification (classification of the whole\n            sequence instead of per-token classification). It is the first token of the sequence when built with\n            special tokens.\n        unk_token (:obj:`string`, `optional`, defaults to \"<unk>\"):\n            The unknown token. A token that is not in the vocabulary cannot be converted to an ID and is set to be this\n            token instead.\n        pad_token (:obj:`string`, `optional`, defaults to \"<pad>\"):\n            The token used for padding, for example when batching sequences of different lengths.\n        mask_token (:obj:`string`, `optional`, defaults to \"<mask>\"):\n            The token used for masking values. This is the token used when training this model with masked language\n            modeling. This is the token which the model will try to predict.\n        additional_special_tokens (:obj:`List[str]`, `optional`, defaults to :obj:`[\"<s>NOTUSED\", \"</s>NOTUSED\"]`):\n            Additional special tokens used by the tokenizer.\n\n    Attributes:\n        sp_model (:obj:`SentencePieceProcessor`):\n            The `SentencePiece` processor that is used for every conversion (string, tokens and IDs).\n    \"\"\"\n\n    vocab_files_names = VOCAB_FILES_NAMES\n    pretrained_vocab_files_map = PRETRAINED_VOCAB_FILES_MAP\n    max_model_input_sizes = PRETRAINED_POSITIONAL_EMBEDDINGS_SIZES\n    model_input_names = [\"attention_mask\"]\n\n    def __init__(\n        self,\n        vocab_file,\n        bos_token=\"<s>\",\n        eos_token=\"</s>\",\n        sep_token=\"</s>\",\n        cls_token=\"<s>\",\n        unk_token=\"<unk>\",\n        pad_token=\"<pad>\",\n        mask_token=\"<mask>\",\n        **kwargs\n    ):\n        super().__init__(\n            bos_token=bos_token,\n            eos_token=eos_token,\n            unk_token=unk_token,\n            sep_token=sep_token,\n            cls_token=cls_token,\n            pad_token=pad_token,\n            mask_token=mask_token,\n            **kwargs,\n        )\n\n        try:\n            import sentencepiece as spm\n        except ImportError:\n            logger.warning(\n                \"You need to install SentencePiece to use XLMRobertaTokenizer: https://github.com/google/sentencepiece\"\n                \"pip install sentencepiece\"\n            )\n            raise\n\n        self.sp_model = spm.SentencePieceProcessor()\n        self.sp_model.Load(str(vocab_file))\n        self.vocab_file = vocab_file\n\n        # Original fairseq vocab and spm vocab must be \"aligned\":\n        # Vocab    |    0    |    1    |   2    |    3    |  4  |  5  |  6  |   7   |   8   |  9\n        # -------- | ------- | ------- | ------ | ------- | --- | --- | --- | ----- | ----- | ----\n        # fairseq  | '<s>'   | '<pad>' | '</s>' | '<unk>' | ',' | '.' | '▁' | 's'   | '▁de' | '-'\n        # spm      | '<unk>' | '<s>'   | '</s>' | ','     | '.' | '▁' | 's' | '▁de' | '-'   | '▁a'\n\n        # Mimic fairseq token-to-id alignment for the first 4 token\n        self.fairseq_tokens_to_ids = {\"<s>\": 0, \"<pad>\": 1, \"</s>\": 2, \"<unk>\": 3}\n\n        # The first \"real\" token \",\" has position 4 in the original fairseq vocab and position 3 in the spm vocab\n        self.fairseq_offset = 1\n\n        self.fairseq_tokens_to_ids[\"<mask>\"] = len(self.sp_model) + self.fairseq_offset\n        self.fairseq_ids_to_tokens = {v: k for k, v in self.fairseq_tokens_to_ids.items()}\n\n    def __getstate__(self):\n        state = self.__dict__.copy()\n        state[\"sp_model\"] = None\n        return state\n\n    def __setstate__(self, d):\n        self.__dict__ = d\n        try:\n            import sentencepiece as spm\n        except ImportError:\n            logger.warning(\n                \"You need to install SentencePiece to use XLMRobertaTokenizer: https://github.com/google/sentencepiece\"\n                \"pip install sentencepiece\"\n            )\n            raise\n        self.sp_model = spm.SentencePieceProcessor()\n        self.sp_model.Load(self.vocab_file)\n\n    def build_inputs_with_special_tokens(\n        self, token_ids_0: List[int], token_ids_1: Optional[List[int]] = None\n    ) -> List[int]:\n        \"\"\"\n        Build model inputs from a sequence or a pair of sequence for sequence classification tasks\n        by concatenating and adding special tokens.\n        A XLM-R sequence has the following format:\n\n        - single sequence: ``<s> X </s>``\n        - pair of sequences: ``<s> A </s></s> B </s>``\n\n        Args:\n            token_ids_0 (:obj:`List[int]`):\n                List of IDs to which the special tokens will be added\n            token_ids_1 (:obj:`List[int]`, `optional`, defaults to :obj:`None`):\n                Optional second list of IDs for sequence pairs.\n\n        Returns:\n            :obj:`List[int]`: list of `input IDs <../glossary.html#input-ids>`__ with the appropriate special tokens.\n        \"\"\"\n\n        if token_ids_1 is None:\n            return [self.cls_token_id] + token_ids_0 + [self.sep_token_id]\n        cls = [self.cls_token_id]\n        sep = [self.sep_token_id]\n        return cls + token_ids_0 + sep + sep + token_ids_1 + sep\n\n    def get_special_tokens_mask(\n        self, token_ids_0: List[int], token_ids_1: Optional[List[int]] = None, already_has_special_tokens: bool = False\n    ) -> List[int]:\n        \"\"\"\n        Retrieves sequence ids from a token list that has no special tokens added. This method is called when adding\n        special tokens using the tokenizer ``prepare_for_model`` or ``encode_plus`` methods.\n\n        Args:\n            token_ids_0 (:obj:`List[int]`):\n                List of ids.\n            token_ids_1 (:obj:`List[int]`, `optional`, defaults to :obj:`None`):\n                Optional second list of IDs for sequence pairs.\n            already_has_special_tokens (:obj:`bool`, `optional`, defaults to :obj:`False`):\n                Set to True if the token list is already formatted with special tokens for the model\n\n        Returns:\n            :obj:`List[int]`: A list of integers in the range [0, 1]: 1 for a special token, 0 for a sequence token.\n        \"\"\"\n\n        if already_has_special_tokens:\n            if token_ids_1 is not None:\n                raise ValueError(\n                    \"You should not supply a second sequence if the provided sequence of \"\n                    \"ids is already formated with special tokens for the model.\"\n                )\n            return list(map(lambda x: 1 if x in [self.sep_token_id, self.cls_token_id] else 0, token_ids_0))\n\n        if token_ids_1 is None:\n            return [1] + ([0] * len(token_ids_0)) + [1]\n        return [1] + ([0] * len(token_ids_0)) + [1, 1] + ([0] * len(token_ids_1)) + [1]\n\n    def create_token_type_ids_from_sequences(\n        self, token_ids_0: List[int], token_ids_1: Optional[List[int]] = None\n    ) -> List[int]:\n        \"\"\"\n        Creates a mask from the two sequences passed to be used in a sequence-pair classification task.\n        XLM-R does not make use of token type ids, therefore a list of zeros is returned.\n\n        Args:\n            token_ids_0 (:obj:`List[int]`):\n                List of ids.\n            token_ids_1 (:obj:`List[int]`, `optional`, defaults to :obj:`None`):\n                Optional second list of IDs for sequence pairs.\n\n        Returns:\n            :obj:`List[int]`: List of zeros.\n\n        \"\"\"\n\n        sep = [self.sep_token_id]\n        cls = [self.cls_token_id]\n\n        if token_ids_1 is None:\n            return len(cls + token_ids_0 + sep) * [0]\n        return len(cls + token_ids_0 + sep + sep + token_ids_1 + sep) * [0]\n\n    @property\n    def vocab_size(self):\n        return len(self.sp_model) + self.fairseq_offset + 1  # Add the <mask> token\n\n    def get_vocab(self):\n        vocab = {self.convert_ids_to_tokens(i): i for i in range(self.vocab_size)}\n        vocab.update(self.added_tokens_encoder)\n        return vocab\n\n    def _tokenize(self, text):\n        return self.sp_model.EncodeAsPieces(text)\n\n    def _convert_token_to_id(self, token):\n        \"\"\" Converts a token (str) in an id using the vocab. \"\"\"\n        if token in self.fairseq_tokens_to_ids:\n            return self.fairseq_tokens_to_ids[token]\n        spm_id = self.sp_model.PieceToId(token)\n\n        # Need to return unknown token if the SP model returned 0\n        return spm_id + self.fairseq_offset if spm_id else self.unk_token_id\n\n    def _convert_id_to_token(self, index):\n        \"\"\"Converts an index (integer) in a token (str) using the vocab.\"\"\"\n        if index in self.fairseq_ids_to_tokens:\n            return self.fairseq_ids_to_tokens[index]\n        return self.sp_model.IdToPiece(index - self.fairseq_offset)\n\n    def convert_tokens_to_string(self, tokens):\n        \"\"\"Converts a sequence of tokens (strings for sub-words) in a single string.\"\"\"\n        out_string = \"\".join(tokens).replace(SPIECE_UNDERLINE, \" \").strip()\n        return out_string\n\n    def save_vocabulary(self, save_directory):\n        \"\"\"\n        Save the sentencepiece vocabulary (copy original file) and special tokens file to a directory.\n\n        Args:\n            save_directory (:obj:`str`):\n                The directory in which to save the vocabulary.\n\n        Returns:\n            :obj:`Tuple(str)`: Paths to the files saved.\n        \"\"\"\n        if not os.path.isdir(save_directory):\n            logger.error(\"Vocabulary path ({}) should be a directory\".format(save_directory))\n            return\n        out_vocab_file = os.path.join(save_directory, VOCAB_FILES_NAMES[\"vocab_file\"])\n\n        if os.path.abspath(self.vocab_file) != os.path.abspath(out_vocab_file):\n            copyfile(self.vocab_file, out_vocab_file)\n\n        return (out_vocab_file,)\n"
  },
  {
    "path": "code/nezha-base-count5/pretrain/transformers1/tokenization_xlnet.py",
    "content": "# coding=utf-8\n# Copyright 2018 Google AI, Google Brain and Carnegie Mellon University Authors and the HuggingFace Inc. team.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n\"\"\" Tokenization classes for XLNet model.\"\"\"\n\n\nimport logging\nimport os\nimport unicodedata\nfrom shutil import copyfile\nfrom typing import List, Optional\n\nfrom .tokenization_utils import PreTrainedTokenizer\n\n\nlogger = logging.getLogger(__name__)\n\nVOCAB_FILES_NAMES = {\"vocab_file\": \"spiece.model\"}\n\nPRETRAINED_VOCAB_FILES_MAP = {\n    \"vocab_file\": {\n        \"xlnet-base-cased\": \"https://s3.amazonaws.com/models.huggingface.co/bert/xlnet-base-cased-spiece.model\",\n        \"xlnet-large-cased\": \"https://s3.amazonaws.com/models.huggingface.co/bert/xlnet-large-cased-spiece.model\",\n    }\n}\n\nPRETRAINED_POSITIONAL_EMBEDDINGS_SIZES = {\n    \"xlnet-base-cased\": None,\n    \"xlnet-large-cased\": None,\n}\n\nSPIECE_UNDERLINE = \"▁\"\n\n# Segments (not really needed)\nSEG_ID_A = 0\nSEG_ID_B = 1\nSEG_ID_CLS = 2\nSEG_ID_SEP = 3\nSEG_ID_PAD = 4\n\n\nclass XLNetTokenizer(PreTrainedTokenizer):\n    \"\"\"\n    Constructs an XLNet tokenizer. Based on `SentencePiece <https://github.com/google/sentencepiece>`__\n\n    This tokenizer inherits from :class:`~transformers1.PreTrainedTokenizer` which contains most of the methods. Users\n    should refer to the superclass for more information regarding methods.\n\n    Args:\n        vocab_file (:obj:`string`):\n            `SentencePiece <https://github.com/google/sentencepiece>`__ file (generally has a .spm extension) that\n            contains the vocabulary necessary to instantiate a tokenizer.\n        do_lower_case (:obj:`bool`, `optional`, defaults to :obj:`True`):\n            Whether to lowercase the input when tokenizing.\n        remove_space (:obj:`bool`, `optional`, defaults to :obj:`True`):\n            Whether to strip the text when tokenizing (removing excess spaces before and after the string).\n        keep_accents (:obj:`bool`, `optional`, defaults to :obj:`False`):\n            Whether to keep accents when tokenizing.\n        bos_token (:obj:`string`, `optional`, defaults to \"<s>\"):\n            The beginning of sequence token that was used during pre-training. Can be used a sequence classifier token.\n\n            .. note::\n\n                When building a sequence using special tokens, this is not the token that is used for the beginning\n                of sequence. The token used is the :obj:`cls_token`.\n        eos_token (:obj:`string`, `optional`, defaults to \"</s>\"):\n            The end of sequence token.\n\n            .. note::\n\n                When building a sequence using special tokens, this is not the token that is used for the end\n                of sequence. The token used is the :obj:`sep_token`.\n        unk_token (:obj:`string`, `optional`, defaults to \"<unk>\"):\n            The unknown token. A token that is not in the vocabulary cannot be converted to an ID and is set to be this\n            token instead.\n        sep_token (:obj:`string`, `optional`, defaults to \"<sep>\"):\n            The separator token, which is used when building a sequence from multiple sequences, e.g. two sequences\n            for sequence classification or for a text and a question for question answering.\n            It is also used as the last token of a sequence built with special tokens.\n        pad_token (:obj:`string`, `optional`, defaults to \"<pad>\"):\n            The token used for padding, for example when batching sequences of different lengths.\n        cls_token (:obj:`string`, `optional`, defaults to \"<cls>\"):\n            The classifier token which is used when doing sequence classification (classification of the whole\n            sequence instead of per-token classification). It is the first token of the sequence when built with\n            special tokens.\n        mask_token (:obj:`string`, `optional`, defaults to \"<mask>\"):\n            The token used for masking values. This is the token used when training this model with masked language\n            modeling. This is the token which the model will try to predict.\n        additional_special_tokens (:obj:`List[str]`, `optional`, defaults to :obj:`[\"<eop>\", \"<eod>\"]`):\n            Additional special tokens used by the tokenizer.\n\n    Attributes:\n        sp_model (:obj:`SentencePieceProcessor`):\n            The `SentencePiece` processor that is used for every conversion (string, tokens and IDs).\n    \"\"\"\n\n    vocab_files_names = VOCAB_FILES_NAMES\n    pretrained_vocab_files_map = PRETRAINED_VOCAB_FILES_MAP\n    max_model_input_sizes = PRETRAINED_POSITIONAL_EMBEDDINGS_SIZES\n    padding_side = \"left\"\n\n    def __init__(\n        self,\n        vocab_file,\n        do_lower_case=False,\n        remove_space=True,\n        keep_accents=False,\n        bos_token=\"<s>\",\n        eos_token=\"</s>\",\n        unk_token=\"<unk>\",\n        sep_token=\"<sep>\",\n        pad_token=\"<pad>\",\n        cls_token=\"<cls>\",\n        mask_token=\"<mask>\",\n        additional_special_tokens=[\"<eop>\", \"<eod>\"],\n        **kwargs\n    ):\n        super().__init__(\n            bos_token=bos_token,\n            eos_token=eos_token,\n            unk_token=unk_token,\n            sep_token=sep_token,\n            pad_token=pad_token,\n            cls_token=cls_token,\n            mask_token=mask_token,\n            additional_special_tokens=additional_special_tokens,\n            **kwargs,\n        )\n\n        self._pad_token_type_id = 3\n\n        try:\n            import sentencepiece as spm\n        except ImportError:\n            logger.warning(\n                \"You need to install SentencePiece to use XLNetTokenizer: https://github.com/google/sentencepiece\"\n                \"pip install sentencepiece\"\n            )\n            raise\n\n        self.do_lower_case = do_lower_case\n        self.remove_space = remove_space\n        self.keep_accents = keep_accents\n        self.vocab_file = vocab_file\n\n        self.sp_model = spm.SentencePieceProcessor()\n        self.sp_model.Load(vocab_file)\n\n    @property\n    def vocab_size(self):\n        return len(self.sp_model)\n\n    def get_vocab(self):\n        vocab = {self.convert_ids_to_tokens(i): i for i in range(self.vocab_size)}\n        vocab.update(self.added_tokens_encoder)\n        return vocab\n\n    def __getstate__(self):\n        state = self.__dict__.copy()\n        state[\"sp_model\"] = None\n        return state\n\n    def __setstate__(self, d):\n        self.__dict__ = d\n        try:\n            import sentencepiece as spm\n        except ImportError:\n            logger.warning(\n                \"You need to install SentencePiece to use XLNetTokenizer: https://github.com/google/sentencepiece\"\n                \"pip install sentencepiece\"\n            )\n            raise\n        self.sp_model = spm.SentencePieceProcessor()\n        self.sp_model.Load(self.vocab_file)\n\n    def preprocess_text(self, inputs):\n        if self.remove_space:\n            outputs = \" \".join(inputs.strip().split())\n        else:\n            outputs = inputs\n        outputs = outputs.replace(\"``\", '\"').replace(\"''\", '\"')\n\n        if not self.keep_accents:\n            outputs = unicodedata.normalize(\"NFKD\", outputs)\n            outputs = \"\".join([c for c in outputs if not unicodedata.combining(c)])\n        if self.do_lower_case:\n            outputs = outputs.lower()\n\n        return outputs\n\n    def _tokenize(self, text, sample=False):\n        \"\"\" Tokenize a string. \"\"\"\n        text = self.preprocess_text(text)\n\n        if not sample:\n            pieces = self.sp_model.EncodeAsPieces(text)\n        else:\n            pieces = self.sp_model.SampleEncodeAsPieces(text, 64, 0.1)\n        new_pieces = []\n        for piece in pieces:\n            if len(piece) > 1 and piece[-1] == str(\",\") and piece[-2].isdigit():\n                cur_pieces = self.sp_model.EncodeAsPieces(piece[:-1].replace(SPIECE_UNDERLINE, \"\"))\n                if piece[0] != SPIECE_UNDERLINE and cur_pieces[0][0] == SPIECE_UNDERLINE:\n                    if len(cur_pieces[0]) == 1:\n                        cur_pieces = cur_pieces[1:]\n                    else:\n                        cur_pieces[0] = cur_pieces[0][1:]\n                cur_pieces.append(piece[-1])\n                new_pieces.extend(cur_pieces)\n            else:\n                new_pieces.append(piece)\n\n        return new_pieces\n\n    def _convert_token_to_id(self, token):\n        \"\"\" Converts a token (str) in an id using the vocab. \"\"\"\n        return self.sp_model.PieceToId(token)\n\n    def _convert_id_to_token(self, index):\n        \"\"\"Converts an index (integer) in a token (str) using the vocab.\"\"\"\n        return self.sp_model.IdToPiece(index)\n\n    def convert_tokens_to_string(self, tokens):\n        \"\"\"Converts a sequence of tokens (strings for sub-words) in a single string.\"\"\"\n        out_string = \"\".join(tokens).replace(SPIECE_UNDERLINE, \" \").strip()\n        return out_string\n\n    def build_inputs_with_special_tokens(\n        self, token_ids_0: List[int], token_ids_1: Optional[List[int]] = None\n    ) -> List[int]:\n        \"\"\"\n        Build model inputs from a sequence or a pair of sequence for sequence classification tasks\n        by concatenating and adding special tokens.\n        An XLNet sequence has the following format:\n\n        - single sequence: ``X <sep> <cls>``\n        - pair of sequences: ``A <sep> B <sep> <cls>``\n\n        Args:\n            token_ids_0 (:obj:`List[int]`):\n                List of IDs to which the special tokens will be added\n            token_ids_1 (:obj:`List[int]`, `optional`, defaults to :obj:`None`):\n                Optional second list of IDs for sequence pairs.\n\n        Returns:\n            :obj:`List[int]`: list of `input IDs <../glossary.html#input-ids>`__ with the appropriate special tokens.\n        \"\"\"\n        sep = [self.sep_token_id]\n        cls = [self.cls_token_id]\n        if token_ids_1 is None:\n            return token_ids_0 + sep + cls\n        return token_ids_0 + sep + token_ids_1 + sep + cls\n\n    def get_special_tokens_mask(\n        self, token_ids_0: List[int], token_ids_1: Optional[List[int]] = None, already_has_special_tokens: bool = False\n    ) -> List[int]:\n        \"\"\"\n        Retrieves sequence ids from a token list that has no special tokens added. This method is called when adding\n        special tokens using the tokenizer ``prepare_for_model`` or ``encode_plus`` methods.\n\n        Args:\n            token_ids_0 (:obj:`List[int]`):\n                List of ids.\n            token_ids_1 (:obj:`List[int]`, `optional`, defaults to :obj:`None`):\n                Optional second list of IDs for sequence pairs.\n            already_has_special_tokens (:obj:`bool`, `optional`, defaults to :obj:`False`):\n                Set to True if the token list is already formatted with special tokens for the model\n\n        Returns:\n            :obj:`List[int]`: A list of integers in the range [0, 1]: 1 for a special token, 0 for a sequence token.\n        \"\"\"\n\n        if already_has_special_tokens:\n            if token_ids_1 is not None:\n                raise ValueError(\n                    \"You should not supply a second sequence if the provided sequence of \"\n                    \"ids is already formated with special tokens for the model.\"\n                )\n            return list(map(lambda x: 1 if x in [self.sep_token_id, self.cls_token_id] else 0, token_ids_0))\n\n        if token_ids_1 is not None:\n            return ([0] * len(token_ids_0)) + [1] + ([0] * len(token_ids_1)) + [1, 1]\n        return ([0] * len(token_ids_0)) + [1, 1]\n\n    def create_token_type_ids_from_sequences(\n        self, token_ids_0: List[int], token_ids_1: Optional[List[int]] = None\n    ) -> List[int]:\n        \"\"\"\n        Creates a mask from the two sequences passed to be used in a sequence-pair classification task.\n        An XLNet sequence pair mask has the following format:\n        0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 2\n        | first sequence    | second sequence     | CLS segment ID\n\n        if token_ids_1 is None, only returns the first portion of the mask (0's).\n\n        Args:\n            token_ids_0 (:obj:`List[int]`):\n                List of ids.\n            token_ids_1 (:obj:`List[int]`, `optional`, defaults to :obj:`None`):\n                Optional second list of IDs for sequence pairs.\n\n        Returns:\n            :obj:`List[int]`: List of `token type IDs <../glossary.html#token-type-ids>`_ according to the given\n            sequence(s).\n        \"\"\"\n        sep = [self.sep_token_id]\n        cls_segment_id = [2]\n\n        if token_ids_1 is None:\n            return len(token_ids_0 + sep) * [0] + cls_segment_id\n        return len(token_ids_0 + sep) * [0] + len(token_ids_1 + sep) * [1] + cls_segment_id\n\n    def save_vocabulary(self, save_directory):\n        \"\"\"\n        Save the sentencepiece vocabulary (copy original file) and special tokens file to a directory.\n\n        Args:\n            save_directory (:obj:`str`):\n                The directory in which to save the vocabulary.\n\n        Returns:\n            :obj:`Tuple(str)`: Paths to the files saved.\n        \"\"\"\n        if not os.path.isdir(save_directory):\n            logger.error(\"Vocabulary path ({}) should be a directory\".format(save_directory))\n            return\n        out_vocab_file = os.path.join(save_directory, VOCAB_FILES_NAMES[\"vocab_file\"])\n\n        if os.path.abspath(self.vocab_file) != os.path.abspath(out_vocab_file):\n            copyfile(self.vocab_file, out_vocab_file)\n\n        return (out_vocab_file,)\n"
  },
  {
    "path": "code/nezha-base-count5/pretrain/transformers1/trainer.py",
    "content": "import json\nimport logging\nimport math\nimport os\nimport random\nimport re\nimport shutil\nfrom contextlib import contextmanager\nfrom pathlib import Path\nfrom typing import Callable, Dict, List, Optional, Tuple\nimport time\nimport numpy as np\nimport torch\nfrom packaging import version\nfrom torch import nn\nfrom torch.utils.data.dataloader import DataLoader\nfrom torch.utils.data.dataset import Dataset\nfrom torch.utils.data.distributed import DistributedSampler\nfrom torch.utils.data.sampler import RandomSampler, Sampler, SequentialSampler\nfrom tqdm.auto import tqdm, trange\n\nfrom .data.data_collator import DataCollator, DefaultDataCollator\nfrom transformers.modeling_utils import PreTrainedModel\nfrom .optimization import AdamW\nfrom transformers import get_polynomial_decay_schedule_with_warmup#需要新版才有\nfrom .trainer_utils import PREFIX_CHECKPOINT_DIR, EvalPrediction, PredictionOutput, TrainOutput\nfrom .training_args import TrainingArguments, is_tpu_available\n\n\ntry:\n    from apex import amp\n\n    _has_apex = True\nexcept ImportError:\n    _has_apex = False\n\n\ndef is_apex_available():\n    return _has_apex\n\n\nif is_tpu_available():\n    import torch_xla.core.xla_model as xm\n    import torch_xla.debug.metrics as met\n    import torch_xla.distributed.parallel_loader as pl\n\ntry:\n    from torch.utils.tensorboard import SummaryWriter\n\n    _has_tensorboard = True\nexcept ImportError:\n    try:\n        from tensorboardX import SummaryWriter\n\n        _has_tensorboard = True\n    except ImportError:\n        _has_tensorboard = False\n\n\ndef is_tensorboard_available():\n    return _has_tensorboard\n\n\ntry:\n    import wandb\n\n    wandb.ensure_configured()\n    if wandb.api.api_key is None:\n        _has_wandb = False\n        wandb.termwarn(\"W&B installed but not logged in.  Run `wandb login` or set the WANDB_API_KEY env variable.\")\n    else:\n        _has_wandb = False if os.getenv(\"WANDB_DISABLED\") else True\nexcept ImportError:\n    _has_wandb = False\n\n\ndef is_wandb_available():\n    return _has_wandb\n\n\nlogger = logging.getLogger(__name__)\n\n\ndef set_seed(seed: int):\n    random.seed(seed)\n    np.random.seed(seed)\n    torch.manual_seed(seed)\n    torch.cuda.manual_seed_all(seed)\n    # ^^ safe to call this function even if cuda is not available\n\n\n@contextmanager\ndef torch_distributed_zero_first(local_rank: int):\n    \"\"\"\n    Decorator to make all processes in distributed training wait for each local_master to do something.\n    \"\"\"\n    if local_rank not in [-1, 0]:\n        torch.distributed.barrier()\n    yield\n    if local_rank == 0:\n        torch.distributed.barrier()\n\n\nclass SequentialDistributedSampler(Sampler):\n    \"\"\"\n    Distributed Sampler that subsamples indicies sequentially,\n    making it easier to collate all results at the end.\n\n    Even though we only use this sampler for eval and predict (no training),\n    which means that the model params won't have to be synced (i.e. will not hang\n    for synchronization even if varied number of forward passes), we still add extra\n    samples to the sampler to make it evenly divisible (like in `DistributedSampler`)\n    to make it easy to `gather` or `reduce` resulting tensors at the end of the loop.\n    \"\"\"\n\n    def __init__(self, dataset, num_replicas=None, rank=None):\n        if num_replicas is None:\n            if not torch.distributed.is_available():\n                raise RuntimeError(\"Requires distributed package to be available\")\n            num_replicas = torch.distributed.get_world_size()\n        if rank is None:\n            if not torch.distributed.is_available():\n                raise RuntimeError(\"Requires distributed package to be available\")\n            rank = torch.distributed.get_rank()\n        self.dataset = dataset\n        self.num_replicas = num_replicas\n        self.rank = rank\n        self.num_samples = int(math.ceil(len(self.dataset) * 1.0 / self.num_replicas))\n        self.total_size = self.num_samples * self.num_replicas\n\n    def __iter__(self):\n        indices = list(range(len(self.dataset)))\n\n        # add extra samples to make it evenly divisible\n        indices += indices[: (self.total_size - len(indices))]\n        assert len(indices) == self.total_size\n\n        # subsample\n        indices = indices[self.rank * self.num_samples : (self.rank + 1) * self.num_samples]\n        assert len(indices) == self.num_samples\n\n        return iter(indices)\n\n    def __len__(self):\n        return self.num_samples\n\n\ndef get_tpu_sampler(dataset: Dataset):\n    if xm.xrt_world_size() <= 1:\n        return RandomSampler(dataset)\n    return DistributedSampler(dataset, num_replicas=xm.xrt_world_size(), rank=xm.get_ordinal())\n\n\nclass Trainer:\n    \"\"\"\n    Trainer is a simple but feature-complete training and eval loop for PyTorch,\n    optimized for Transformers.\n    \"\"\"\n\n    model: PreTrainedModel\n    args: TrainingArguments\n    train_dataset: Optional[Dataset]\n    eval_dataset: Optional[Dataset]\n    compute_metrics: Optional[Callable[[EvalPrediction], Dict]] = None\n    prediction_loss_only: bool\n    tb_writer: Optional[\"SummaryWriter\"] = None\n    optimizers: Tuple[torch.optim.Optimizer, torch.optim.lr_scheduler.LambdaLR] = None\n    global_step: Optional[int] = None\n    epoch: Optional[float] = None\n\n    def __init__(\n        self,\n        model: PreTrainedModel,\n        args: TrainingArguments,\n        train_dataLoader: Optional[DataLoader] = None,\n        eval_dataLoader: Optional[DataLoader] = None,\n        compute_metrics: Optional[Callable[[EvalPrediction], Dict]] = None,\n        prediction_loss_only=False,\n        tb_writer: Optional[\"SummaryWriter\"] = None,\n        optimizers: Tuple[torch.optim.Optimizer, torch.optim.lr_scheduler.LambdaLR] = None,\n    ):\n        \"\"\"\n        Trainer is a simple but feature-complete training and eval loop for PyTorch,\n        optimized for Transformers.\n\n        Args:\n            prediction_loss_only:\n                (Optional) in evaluation and prediction, only return the loss\n        \"\"\"\n        self.model = model.to(args.device)\n        self.args = args\n\n        self.train_dataLoader = train_dataLoader\n        self.eval_dataLoader = eval_dataLoader\n        self.compute_metrics = compute_metrics\n        self.prediction_loss_only = prediction_loss_only\n        self.optimizers = optimizers\n        if tb_writer is not None:\n            self.tb_writer = tb_writer\n        elif is_tensorboard_available() and self.is_world_master():\n            self.tb_writer = SummaryWriter(log_dir=self.args.logging_dir)\n        if not is_tensorboard_available():\n            logger.warning(\n                \"You are instantiating a Trainer but Tensorboard is not installed. You should consider installing it.\"\n            )\n        if is_wandb_available():\n            self._setup_wandb()\n        else:\n            logger.info(\n                \"You are instantiating a Trainer but W&B is not installed. To use wandb logging, \"\n                \"run `pip install wandb; wandb login` see https://docs.wandb.com/huggingface.\"\n            )\n        set_seed(self.args.seed)\n        # Create output directory if needed\n        if self.is_world_master():\n            os.makedirs(self.args.output_dir, exist_ok=True)\n        if is_tpu_available():\n            # Set an xla_device flag on the model's config.\n            # We'll find a more elegant and not need to do this in the future.\n            self.model.config.xla_device = True\n\n    def get_test_dataloader(self, test_dataset: Dataset) -> DataLoader:\n        # We use the same batch_size as for eval.\n        if is_tpu_available():\n            sampler = SequentialDistributedSampler(\n                test_dataset, num_replicas=xm.xrt_world_size(), rank=xm.get_ordinal()\n            )\n        elif self.args.local_rank != -1:\n            sampler = SequentialDistributedSampler(test_dataset)\n        else:\n            sampler = SequentialSampler(test_dataset)\n\n        data_loader = DataLoader(\n            test_dataset,\n            sampler=sampler,\n            batch_size=self.args.eval_batch_size,\n\n        )\n\n        return data_loader\n\n    def get_optimizers(\n        self, num_training_steps: int\n    ) -> Tuple[torch.optim.Optimizer, torch.optim.lr_scheduler.LambdaLR]:\n        \"\"\"\n        Setup the optimizer and the learning rate scheduler.\n\n        We provide a reasonable default that works well.\n        If you want to use something else, you can pass a tuple in the Trainer's init,\n        or override this method in a subclass.\n        \"\"\"\n        if self.optimizers is not None:\n            return self.optimizers\n        # Prepare optimizer and schedule (linear warmup and decay)\n        no_decay = [\"bias\", \"LayerNorm.weight\"]\n        optimizer_grouped_parameters = [\n            {\n                \"params\": [p for n, p in self.model.named_parameters() if not any(nd in n for nd in no_decay)],\n                \"weight_decay\": self.args.weight_decay,\n            },\n            {\n                \"params\": [p for n, p in self.model.named_parameters() if any(nd in n for nd in no_decay)],\n                \"weight_decay\": 0.0,\n            },\n        ]\n\n        optimizer = AdamW(optimizer_grouped_parameters, lr=self.args.learning_rate, eps=self.args.adam_epsilon)\n        scheduler = get_polynomial_decay_schedule_with_warmup(\n            optimizer, num_warmup_steps=self.args.warmup_steps, num_training_steps=num_training_steps,lr_end=self.args.lr_end\n        )\n        return optimizer, scheduler\n\n    def _setup_wandb(self):\n        \"\"\"\n        Setup the optional Weights & Biases (`wandb`) integration.\n\n        One can override this method to customize the setup if needed.  Find more information at https://docs.wandb.com/huggingface\n        You can also override the following environment variables:\n\n        Environment:\n            WANDB_WATCH:\n                (Optional, [\"gradients\", \"all\", \"false\"]) \"gradients\" by default, set to \"false\" to disable gradient logging\n                or \"all\" to log gradients and parameters\n            WANDB_PROJECT:\n                (Optional): str - \"huggingface\" by default, set this to a custom string to store results in a different project\n            WANDB_DISABLED:\n                (Optional): boolean - defaults to false, set to \"true\" to disable wandb entirely\n        \"\"\"\n        logger.info('Automatic Weights & Biases logging enabled, to disable set os.environ[\"WANDB_DISABLED\"] = \"true\"')\n        wandb.init(project=os.getenv(\"WANDB_PROJECT\", \"huggingface\"), config=vars(self.args))\n        # keep track of model topology and gradients\n        if os.getenv(\"WANDB_WATCH\") != \"false\":\n            wandb.watch(\n                self.model, log=os.getenv(\"WANDB_WATCH\", \"gradients\"), log_freq=max(100, self.args.logging_steps)\n            )\n\n    def num_examples(self, dataloader: DataLoader) -> int:\n        \"\"\"\n        Helper to get num of examples from a DataLoader, by accessing its Dataset.\n        \"\"\"\n        return len(dataloader.dataset)\n\n    def train(self, model_path: Optional[str] = None):\n        \"\"\"\n        Main training entry point.\n\n        Args:\n            model_path:\n                (Optional) Local path to model if model to train has been instantiated from a local path\n                If present, we will try reloading the optimizer/scheduler states from there.\n        \"\"\"\n        train_dataloader = self.train_dataLoader\n        if self.args.max_steps > 0:\n            t_total = self.args.max_steps\n            num_train_epochs = (\n                self.args.max_steps // (len(train_dataloader) // self.args.gradient_accumulation_steps) + 1\n            )\n        else:\n            t_total = int(len(train_dataloader) // self.args.gradient_accumulation_steps * self.args.num_train_epochs)\n            num_train_epochs = self.args.num_train_epochs\n\n        optimizer, scheduler = self.get_optimizers(num_training_steps=t_total)\n\n        # Check if saved optimizer or scheduler states exist\n        if (\n            model_path is not None\n            and os.path.isfile(os.path.join(model_path, \"optimizer.pt\"))\n            and os.path.isfile(os.path.join(model_path, \"scheduler.pt\"))\n        ):\n            # Load in optimizer and scheduler states\n            optimizer.load_state_dict(\n                torch.load(os.path.join(model_path, \"optimizer.pt\"), map_location=self.args.device)\n            )\n            scheduler.load_state_dict(torch.load(os.path.join(model_path, \"scheduler.pt\")))\n\n        model = self.model\n        if self.args.fp16:\n            if not is_apex_available():\n                raise ImportError(\"Please install apex from https://www.github.com/nvidia/apex to use fp16 training.\")\n            model, optimizer = amp.initialize(model, optimizer, opt_level=self.args.fp16_opt_level)\n\n        # multi-gpu training (should be after apex fp16 initialization)\n        if self.args.n_gpu > 1:\n            model = torch.nn.DataParallel(model)\n\n        # Distributed training (should be after apex fp16 initialization)\n        if self.args.local_rank != -1:\n            model = torch.nn.parallel.DistributedDataParallel(\n                model,\n                device_ids=[self.args.local_rank],\n                output_device=self.args.local_rank,\n                find_unused_parameters=True,\n            )\n\n        if self.tb_writer is not None:\n            self.tb_writer.add_text(\"args\", self.args.to_json_string())\n            self.tb_writer.add_hparams(self.args.to_sanitized_dict(), metric_dict={})\n\n        # Train!\n        if is_tpu_available():\n            total_train_batch_size = self.args.train_batch_size * xm.xrt_world_size()\n        else:\n            total_train_batch_size = (\n                self.args.train_batch_size\n                * self.args.gradient_accumulation_steps\n                * (torch.distributed.get_world_size() if self.args.local_rank != -1 else 1)\n            )\n        logger.info(\"***** Running training *****\")\n        logger.info(\"  Num examples = %d\", self.num_examples(train_dataloader))\n        logger.info(\"  Num Epochs = %d\", num_train_epochs)\n        logger.info(\"  Instantaneous batch size per device = %d\", self.args.per_device_train_batch_size)\n        logger.info(\"  Total train batch size (w. parallel, distributed & accumulation) = %d\", total_train_batch_size)\n        logger.info(\"  Gradient Accumulation steps = %d\", self.args.gradient_accumulation_steps)\n        logger.info(\"  Total optimization steps = %d\", t_total)\n\n        self.global_step = 0\n        self.epoch = 0\n        epochs_trained = 0\n        steps_trained_in_current_epoch = 0\n        # Check if continuing training from a checkpoint\n        if model_path is not None:\n            # set global_step to global_step of last saved checkpoint from model path\n            try:\n                self.global_step = int(model_path.split(\"-\")[-1].split(\"/\")[0])\n                epochs_trained = self.global_step // (len(train_dataloader) // self.args.gradient_accumulation_steps)\n                steps_trained_in_current_epoch = self.global_step % (\n                    len(train_dataloader) // self.args.gradient_accumulation_steps\n                )\n\n                logger.info(\"  Continuing training from checkpoint, will skip to saved global_step\")\n                logger.info(\"  Continuing training from epoch %d\", epochs_trained)\n                logger.info(\"  Continuing training from global step %d\", self.global_step)\n                logger.info(\"  Will skip the first %d steps in the first epoch\", steps_trained_in_current_epoch)\n            except ValueError:\n                self.global_step = 0\n                logger.info(\"  Starting fine-tuning.\")\n\n        tr_loss = 0.0\n        logging_loss = 0.0\n        tqdmLoss=0#进度条的loss用滑动平均显示\n        beta_exp=1\n        model.zero_grad()\n        train_iterator = trange(\n            epochs_trained, int(num_train_epochs), desc=\"Epoch\", disable=True\n        )\n        for epoch in train_iterator:\n            last=time.time()\n            if isinstance(train_dataloader, DataLoader) and isinstance(train_dataloader.sampler, DistributedSampler):\n                train_dataloader.sampler.set_epoch(epoch)\n\n            if is_tpu_available():\n                parallel_loader = pl.ParallelLoader(train_dataloader, [self.args.device]).per_device_loader(\n                    self.args.device\n                )\n                epoch_iterator = tqdm(parallel_loader, desc=\"Iteration\", disable=not self.is_local_master())\n            else:\n                epoch_iterator = tqdm(train_dataloader, desc=\"Iteration\", disable=True,ncols=70)#固定下长度，不然要换行\n\n            for step, inputs in enumerate(epoch_iterator):\n\n                # Skip past any already trained steps if resuming training\n                if steps_trained_in_current_epoch > 0:\n                    steps_trained_in_current_epoch -= 1\n                    continue\n                now_loss=self._training_step(model, inputs, optimizer)\n                tr_loss += now_loss\n                #丰富进度条\n                tqdmLoss=tqdmLoss*0.99+(1-0.99)*now_loss#滑动平均下\n                beta_exp*=0.99#校正\n\n                epoch_iterator.set_description_str(f\"epoch：{epoch+1}\")\n                epoch_iterator.set_postfix_str(f\"loss：{round(tqdmLoss/(1-beta_exp),4)}\")\n                if (step + 1) % self.args.gradient_accumulation_steps == 0 or (\n                    # last step in epoch but step is always smaller than gradient_accumulation_steps\n                    len(epoch_iterator) <= self.args.gradient_accumulation_steps\n                    and (step + 1) == len(epoch_iterator)\n                ):\n                    if self.args.fp16:\n                        torch.nn.utils.clip_grad_norm_(amp.master_params(optimizer), self.args.max_grad_norm)\n                    else:\n                        torch.nn.utils.clip_grad_norm_(model.parameters(), self.args.max_grad_norm)\n\n                    if is_tpu_available():\n                        xm.optimizer_step(optimizer)\n                    else:\n                        optimizer.step()\n\n                    scheduler.step()\n                    model.zero_grad()\n                    self.global_step += 1\n                    self.epoch = epoch + (step + 1) / len(epoch_iterator)\n\n                    if (self.args.logging_steps > 0 and self.global_step % self.args.logging_steps == 0) or (\n                        self.global_step == 1 and self.args.logging_first_step\n                    ):\n                        logs: Dict[str, float] = {}\n                        logs[\"loss\"] = (tr_loss - logging_loss) / self.args.logging_steps\n                        # backward compatibility for pytorch schedulers\n                        logs[\"learning_rate\"] = (\n                            scheduler.get_last_lr()[0]\n                            if version.parse(torch.__version__) >= version.parse(\"1.4\")\n                            else scheduler.get_lr()[0]\n                        )\n                        logging_loss = tr_loss\n                        print()#log前要换行，不然和进度条挤在一起\n                        self._log(logs)\n                        print()\n                        if self.args.evaluate_during_training:\n                            self.evaluate()\n\n                    if self.args.save_steps > 0 and self.global_step % self.args.save_steps==0:\n                        # In all cases (even distributed/parallel), self.model is always a reference\n                        # to the model we want to save.\n                        if hasattr(model, \"module\"):\n                            assert model.module is self.model\n                        else:\n                            assert model is self.model\n                        # Save model checkpoint\n                        output_dir = os.path.join(self.args.output_dir, f\"{PREFIX_CHECKPOINT_DIR}-{self.global_step}-epoch-{int(self.epoch)}\")\n\n                        self.save_model(output_dir)\n\n                        if self.is_world_master():\n                            self._rotate_checkpoints()\n\n                        if is_tpu_available():\n                            xm.rendezvous(\"saving_optimizer_states\")\n                            xm.save(optimizer.state_dict(), os.path.join(output_dir, \"optimizer.pt\"))\n                            xm.save(scheduler.state_dict(), os.path.join(output_dir, \"scheduler.pt\"))\n                        elif self.is_world_master():\n                            torch.save(optimizer.state_dict(), os.path.join(output_dir, \"optimizer.pt\"))\n                            torch.save(scheduler.state_dict(), os.path.join(output_dir, \"scheduler.pt\"))\n\n                if self.args.max_steps > 0 and self.global_step > self.args.max_steps:\n                    epoch_iterator.close()\n                    break\n            print(f\"预训练第{epoch}轮耗时：\",time.time()-last)\n            if self.args.max_steps > 0 and self.global_step > self.args.max_steps:\n                train_iterator.close()\n                break\n            if self.args.tpu_metrics_debug:\n                # tpu-comment: Logging debug metrics for PyTorch/XLA (compile, execute times, ops, etc.)\n                xm.master_print(met.metrics_report())\n        if self.tb_writer:\n            self.tb_writer.close()\n\n        logger.info(\"\\n\\nTraining completed. Do not forget to share your model on huggingface.co/models =)\\n\\n\")\n        return TrainOutput(self.global_step, tr_loss / self.global_step)\n\n    def _log(self, logs: Dict[str, float], iterator: Optional[tqdm] = None) -> None:\n        if self.epoch is not None:\n            logs[\"epoch\"] = self.epoch\n        if self.tb_writer:\n            for k, v in logs.items():\n                self.tb_writer.add_scalar(k, v, self.global_step)\n        if is_wandb_available():\n            wandb.log(logs, step=self.global_step)\n        output = json.dumps({**logs, **{\"step\": self.global_step}})\n        if iterator is not None:\n            iterator.write(output)\n        else:\n            print(output)\n\n    def _training_step(\n        self, model: nn.Module, inputs: Dict[str, torch.Tensor], optimizer: torch.optim.Optimizer\n    ) -> float:\n        model.train()\n        for k, v in inputs.items():\n            inputs[k] = v.to(self.args.device)\n\n        outputs = model(**inputs)\n        loss = outputs[0]  # model outputs are always tuple in transformers1 (see doc)\n\n        if self.args.n_gpu > 1:\n            loss = loss.mean()  # mean() to average on multi-gpu parallel training\n        if self.args.gradient_accumulation_steps > 1:\n            loss = loss / self.args.gradient_accumulation_steps\n\n        if self.args.fp16:\n            with amp.scale_loss(loss, optimizer) as scaled_loss:\n                scaled_loss.backward()\n        else:\n            loss.backward()\n\n        return loss.item()\n\n    def is_local_master(self) -> bool:\n        if is_tpu_available():\n            return xm.is_master_ordinal(local=True)\n        else:\n            return self.args.local_rank in [-1, 0]\n\n    def is_world_master(self) -> bool:\n        \"\"\"\n        This will be True only in one process, even in distributed mode,\n        even when training on multiple machines.\n        \"\"\"\n        if is_tpu_available():\n            return xm.is_master_ordinal(local=False)\n        else:\n            return self.args.local_rank == -1 or torch.distributed.get_rank() == 0\n\n    def save_model(self, output_dir: Optional[str] = None):\n        \"\"\"\n        Saving best-practices: if you use default names for the model,\n        you can reload it using from_pretrained().\n\n        Will only save from the world_master process (unless in TPUs).\n        \"\"\"\n\n        if is_tpu_available():\n            self._save_tpu(output_dir)\n        elif self.is_world_master():\n            self._save(output_dir)\n\n    def _save_tpu(self, output_dir: Optional[str] = None):\n        output_dir = output_dir if output_dir is not None else self.args.output_dir\n        logger.info(\"Saving model checkpoint to %s\", output_dir)\n\n        if xm.is_master_ordinal():\n            os.makedirs(output_dir, exist_ok=True)\n            torch.save(self.args, os.path.join(output_dir, \"training_args.bin\"))\n\n        # Save a trained model and configuration using `save_pretrained()`.\n        # They can then be reloaded using `from_pretrained()`\n        if not isinstance(self.model, PreTrainedModel):\n            raise ValueError(\"Trainer.model appears to not be a PreTrainedModel\")\n\n        xm.rendezvous(\"saving_checkpoint\")\n        self.model.save_pretrained(output_dir)\n\n    def _save(self, output_dir: Optional[str] = None):\n        output_dir = output_dir if output_dir is not None else self.args.output_dir\n        os.makedirs(output_dir, exist_ok=True)\n        logger.info(\"Saving model checkpoint to %s\", output_dir)\n        # Save a trained model and configuration using `save_pretrained()`.\n        # They can then be reloaded using `from_pretrained()`\n        if not isinstance(self.model, PreTrainedModel):\n            raise ValueError(\"Trainer.model appears to not be a PreTrainedModel\")\n        self.model.save_pretrained(output_dir)\n\n        # Good practice: save your training arguments together with the trained model\n        torch.save(self.args, os.path.join(output_dir, \"training_args.bin\"))\n\n    def _sorted_checkpoints(self, checkpoint_prefix=PREFIX_CHECKPOINT_DIR, use_mtime=False) -> List[str]:\n        ordering_and_checkpoint_path = []\n\n        glob_checkpoints = [str(x) for x in Path(self.args.output_dir).glob(f\"{checkpoint_prefix}-*\")]\n\n        for path in glob_checkpoints:\n            if use_mtime:\n                ordering_and_checkpoint_path.append((os.path.getmtime(path), path))\n            else:\n                regex_match = re.match(f\".*{checkpoint_prefix}-([0-9]+)\", path)\n                if regex_match and regex_match.groups():\n                    ordering_and_checkpoint_path.append((int(regex_match.groups()[0]), path))\n\n        checkpoints_sorted = sorted(ordering_and_checkpoint_path)\n        checkpoints_sorted = [checkpoint[1] for checkpoint in checkpoints_sorted]\n        return checkpoints_sorted\n\n    def _rotate_checkpoints(self, use_mtime=False) -> None:\n        if self.args.save_total_limit is None or self.args.save_total_limit <= 0:\n            return\n\n        # Check if we should delete older checkpoint(s)\n        checkpoints_sorted = self._sorted_checkpoints(use_mtime=use_mtime)\n        if len(checkpoints_sorted) <= self.args.save_total_limit:\n            return\n\n        number_of_checkpoints_to_delete = max(0, len(checkpoints_sorted) - self.args.save_total_limit)\n        checkpoints_to_be_deleted = checkpoints_sorted[:number_of_checkpoints_to_delete]\n        for checkpoint in checkpoints_to_be_deleted:\n            curEpoch = checkpoint.split('-')[-1]\n            print(checkpoint,curEpoch)\n            if int(curEpoch) % 50 == 0:\n                continue\n            logger.info(\"Deleting older checkpoint [{}] due to args.save_total_limit\".format(checkpoint))\n            shutil.rmtree(checkpoint)\n\n    def evaluate(\n        self, eval_dataset: Optional[Dataset] = None, prediction_loss_only: Optional[bool] = None,\n    ) -> Dict[str, float]:\n        \"\"\"\n        Run evaluation and return metrics.\n\n        The calling script will be responsible for providing a method to compute metrics, as they are\n        task-dependent.\n\n        Args:\n            eval_dataset: (Optional) Pass a dataset if you wish to override\n            the one on the instance.\n        Returns:\n            A dict containing:\n                - the eval loss\n                - the potential metrics computed from the predictions\n        \"\"\"\n        eval_dataloader = self.eval_dataLoader\n\n        output = self._prediction_loop(eval_dataloader, description=\"Evaluation\")\n\n        self._log(output.metrics)\n\n        if self.args.tpu_metrics_debug:\n            # tpu-comment: Logging debug metrics for PyTorch/XLA (compile, execute times, ops, etc.)\n            xm.master_print(met.metrics_report())\n\n        return output.metrics\n\n    def predict(self, test_dataset: Dataset) -> PredictionOutput:\n        \"\"\"\n        Run prediction and return predictions and potential metrics.\n\n        Depending on the dataset and your use case, your test dataset may contain labels.\n        In that case, this method will also return metrics, like in evaluate().\n        \"\"\"\n        test_dataloader = self.get_test_dataloader(test_dataset)\n\n        return self._prediction_loop(test_dataloader, description=\"Prediction\")\n\n    def _prediction_loop(\n        self, dataloader: DataLoader, description: str, prediction_loss_only: Optional[bool] = None\n    ) -> PredictionOutput:\n        \"\"\"\n        Prediction/evaluation loop, shared by `evaluate()` and `predict()`.\n\n        Works both with or without labels.\n        \"\"\"\n\n        prediction_loss_only = prediction_loss_only if prediction_loss_only is not None else self.prediction_loss_only\n\n        model = self.model\n        # multi-gpu eval\n        if self.args.n_gpu > 1:\n            model = torch.nn.DataParallel(model)\n        else:\n            model = self.model\n        # Note: in torch.distributed mode, there's no point in wrapping the model\n        # inside a DistributedDataParallel as we'll be under `no_grad` anyways.\n\n        batch_size = dataloader.batch_size\n        logger.info(\"***** Running %s *****\", description)\n        logger.info(\"  Num examples = %d\", self.num_examples(dataloader))\n        logger.info(\"  Batch size = %d\", batch_size)\n        eval_losses: List[float] = []\n        preds: torch.Tensor = None\n        label_ids: torch.Tensor = None\n        model.eval()\n\n        if is_tpu_available():\n            dataloader = pl.ParallelLoader(dataloader, [self.args.device]).per_device_loader(self.args.device)\n\n        for inputs in tqdm(dataloader, desc=description):\n            has_labels = any(inputs.get(k) is not None for k in [\"labels\", \"lm_labels\", \"masked_lm_labels\"])\n\n            for k, v in inputs.items():\n                inputs[k] = v.to(self.args.device)\n\n            with torch.no_grad():\n                outputs = model(**inputs)\n                if has_labels:\n                    step_eval_loss, logits = outputs[:2]\n                    eval_losses += [step_eval_loss.mean().item()]\n                else:\n                    logits = outputs[0]\n\n            if not prediction_loss_only:\n                if preds is None:\n                    preds = logits.detach()\n                else:\n                    preds = torch.cat((preds, logits.detach()), dim=0)\n                if inputs.get(\"labels\") is not None:\n                    if label_ids is None:\n                        label_ids = inputs[\"labels\"].detach()\n                    else:\n                        label_ids = torch.cat((label_ids, inputs[\"labels\"].detach()), dim=0)\n\n        if self.args.local_rank != -1:\n            # In distributed mode, concatenate all results from all nodes:\n            if preds is not None:\n                preds = self.distributed_concat(preds, num_total_examples=self.num_examples(dataloader))\n            if label_ids is not None:\n                label_ids = self.distributed_concat(label_ids, num_total_examples=self.num_examples(dataloader))\n        elif is_tpu_available():\n            # tpu-comment: Get all predictions and labels from all worker shards of eval dataset\n            if preds is not None:\n                preds = xm.mesh_reduce(\"eval_preds\", preds, torch.cat)\n            if label_ids is not None:\n                label_ids = xm.mesh_reduce(\"eval_label_ids\", label_ids, torch.cat)\n\n        # Finally, turn the aggregated tensors into numpy arrays.\n        if preds is not None:\n            preds = preds.cpu().numpy()\n        if label_ids is not None:\n            label_ids = label_ids.cpu().numpy()\n\n        if self.compute_metrics is not None and preds is not None and label_ids is not None:\n            metrics = self.compute_metrics(EvalPrediction(predictions=preds, label_ids=label_ids))\n        else:\n            metrics = {}\n        if len(eval_losses) > 0:\n            metrics[\"eval_loss\"] = np.mean(eval_losses)\n\n        # Prefix all keys with eval_\n        for key in list(metrics.keys()):\n            if not key.startswith(\"eval_\"):\n                metrics[f\"eval_{key}\"] = metrics.pop(key)\n\n        return PredictionOutput(predictions=preds, label_ids=label_ids, metrics=metrics)\n\n    def distributed_concat(self, tensor: torch.Tensor, num_total_examples: int) -> torch.Tensor:\n        assert self.args.local_rank != -1\n\n        output_tensors = [tensor.clone() for _ in range(torch.distributed.get_world_size())]\n        torch.distributed.all_gather(output_tensors, tensor)\n\n        concat = torch.cat(output_tensors, dim=0)\n\n        # truncate the dummy elements added by SequentialDistributedSampler\n        output = concat[:num_total_examples]\n        return output\n"
  },
  {
    "path": "code/nezha-base-count5/pretrain/transformers1/trainer_tf.py",
    "content": "\"\"\"Tensorflow trainer class.\"\"\"\n\nimport logging\nimport math\nimport os\nfrom typing import Callable, Dict, Optional\n\nimport numpy as np\nimport tensorflow as tf\n\nfrom .modeling_tf_utils import TFPreTrainedModel, shape_list\nfrom .optimization_tf import GradientAccumulator, create_optimizer\nfrom .trainer_utils import PREFIX_CHECKPOINT_DIR, EvalPrediction, PredictionOutput\nfrom .training_args_tf import TFTrainingArguments\n\n\nlogger = logging.getLogger(__name__)\n\n\nclass TFTrainer:\n    model: TFPreTrainedModel\n    args: TFTrainingArguments\n    # something similar to a PT Dataset.\n    # This is just temporary before to have\n    # a framework-agnostic approach for datasets.\n    train_dataset: Optional[tf.data.Dataset]\n    eval_dataset: Optional[tf.data.Dataset]\n    compute_metrics: Optional[Callable[[EvalPrediction], Dict]] = None\n    prediction_loss_only: bool\n\n    def __init__(\n        self,\n        model: TFPreTrainedModel,\n        args: TFTrainingArguments,\n        train_dataset: Optional[tf.data.Dataset] = None,\n        eval_dataset: Optional[tf.data.Dataset] = None,\n        compute_metrics: Optional[Callable[[EvalPrediction], Dict]] = None,\n        prediction_loss_only=False,\n    ):\n        self.model = model\n        self.args = args\n        self.train_dataset = train_dataset\n        self.eval_dataset = eval_dataset\n        self.compute_metrics = compute_metrics\n        self.prediction_loss_only = prediction_loss_only\n        self.gradient_accumulator = GradientAccumulator()\n\n        self._setup_training()\n\n    def _setup_training(self) -> None:\n        \"\"\"\n        Setup the different steps to train a model:\n          - check if all the data are given\n          - create the proper strategy\n          - create the features\n          - prepare the model settings\n        \"\"\"\n        self._prepare_dataset()\n\n        with self.args.strategy.scope():\n            self._create_optimizer()\n            _ = self.optimizer.iterations\n            self._set_loss_and_metric()\n            self._create_checkpoint_manager()\n            self._create_summary_writer()\n\n    def _set_loss_and_metric(self) -> None:\n        \"\"\"\n        Create the training loss and metric with their name. Allowed names are those listed\n        in the Tensorflow documentation and those contained in the transformers1 library.\n        \"\"\"\n        try:\n            self.loss = tf.keras.losses.get(\n                {\n                    \"class_name\": self.args.loss_name,\n                    \"config\": {\"from_logits\": True, \"reduction\": tf.keras.losses.Reduction.NONE},\n                }\n            )\n        except TypeError:\n            self.loss = tf.keras.losses.get(\n                {\"class_name\": self.args.loss_name, \"config\": {\"reduction\": tf.keras.losses.Reduction.NONE}}\n            )\n\n    def _create_summary_writer(self) -> None:\n        \"\"\"\n        Create a summary writer to be able to read the logs in Tensorboard.\n        \"\"\"\n        self.writer = tf.summary.create_file_writer(self.args.logging_dir)\n\n    def _prepare_dataset(self) -> None:\n        \"\"\"\n        Prepare the training, validation and test data.\n        \"\"\"\n        if self.train_dataset is not None:\n            self.num_train_examples = self.train_dataset.reduce(tf.constant(0), lambda x, _: x + 1).numpy()\n\n            if self.args.max_steps > 0:\n                self.train_steps = self.args.max_steps\n            else:\n                self.train_steps: int = math.ceil(self.num_train_examples / self.args.train_batch_size)\n\n            self.train_dataset = (\n                self.train_dataset.cache()\n                .shuffle(self.num_train_examples)\n                .batch(self.args.train_batch_size)\n                .prefetch(tf.data.experimental.AUTOTUNE)\n            )\n\n            if self.args.max_steps > 0:\n                self.train_dataset = self.train_dataset.repeat(-1)\n\n            self.train_dataset = self.args.strategy.experimental_distribute_dataset(self.train_dataset)\n        else:\n            self.train_steps = 0\n\n        if self.eval_dataset is not None:\n            self.eval_dataset = (\n                self.eval_dataset.batch(self.args.eval_batch_size).cache().prefetch(tf.data.experimental.AUTOTUNE)\n            )\n            self.eval_dataset = self.args.strategy.experimental_distribute_dataset(self.eval_dataset)\n\n    def _create_optimizer(self) -> None:\n        \"\"\"\n        Create the training optimizer with its name. Allowed names are those listed\n        in the Tensorflow documentation and those contained in the transformers1 library.\n        \"\"\"\n        if self.args.optimizer_name == \"adamw\":\n            self.optimizer = create_optimizer(\n                self.args.learning_rate, self.train_steps, self.args.warmup_steps, self.args.end_lr\n            )\n        else:\n            try:\n                self.optimizer = tf.keras.optimizers.get(\n                    {\n                        \"class_name\": self.args.optimizer_name,\n                        \"config\": {\"learning_rate\": self.args.learning_rate, \"epsilon\": self.args.adam_epsilon},\n                    }\n                )\n            except TypeError:\n                # This is for the case where the optimizer is not Adam-like such as SGD\n                self.optimizer = tf.keras.optimizers.get(\n                    {\"class_name\": self.args.optimizer_name, \"config\": {\"learning_rate\": self.args.learning_rate}}\n                )\n        logger.info(\"Created an/a {} optimizer\".format(self.args.optimizer_name))\n\n    def _create_checkpoint_manager(self, max_to_keep: int = 5, load_model: bool = True) -> None:\n        \"\"\"\n        Create a checkpoint manager in order to be able to make the training\n        fault-tolerant.\n        Args:\n          max_to_keep: the maximum number of checkpoints to keep in the checkpoint path.\n          load_model: if we want to start the training from the latest checkpoint.\n        \"\"\"\n        ckpt = tf.train.Checkpoint(optimizer=self.optimizer, model=self.model)\n\n        self.model.ckpt_manager = tf.train.CheckpointManager(ckpt, PREFIX_CHECKPOINT_DIR, max_to_keep=max_to_keep)\n\n        if load_model:\n            ckpt.restore(self.model.ckpt_manager.latest_checkpoint).expect_partial()\n\n    @tf.function\n    def _evaluate_steps(self, per_replica_features, per_replica_labels):\n        \"\"\"\n        One step evaluation across replica.\n        Args:\n          per_replica_features: the batched features.\n          per_replica_labels: the batched labels.\n        Returns:\n          The loss corresponding to the given batch.\n        \"\"\"\n        per_replica_loss, per_replica_logits = self.args.strategy.experimental_run_v2(\n            self._run_model, args=(per_replica_features, per_replica_labels, False)\n        )\n\n        try:\n            reduced_loss = self.args.strategy.reduce(tf.distribute.ReduceOp.MEAN, per_replica_loss, axis=0)\n        except ValueError:\n            reduced_loss = self.args.strategy.reduce(tf.distribute.ReduceOp.MEAN, per_replica_loss, None)\n\n        return reduced_loss, per_replica_logits\n\n    def _prediction_loop(\n        self, dataset: tf.data.Dataset, description: str, prediction_loss_only: Optional[bool] = None\n    ) -> PredictionOutput:\n        logger.info(\"***** Running %s *****\", description)\n        logger.info(\"  Batch size = %d\", self.args.eval_batch_size)\n\n        label_ids: np.ndarray = None\n        preds: np.ndarray = None\n\n        step: int = 1\n\n        for features, labels in dataset:\n            step = tf.convert_to_tensor(step, dtype=tf.int64)\n            loss, logits = self._evaluate_steps(features, labels)\n            loss = tf.reduce_mean(loss)\n\n            if not prediction_loss_only:\n                if self.args.n_gpu > 1:\n                    for val in logits.values:\n                        if preds is None:\n                            preds = val.numpy()\n                        else:\n                            preds = np.append(preds, val.numpy(), axis=0)\n\n                    for val in labels.values:\n                        if label_ids is None:\n                            label_ids = val.numpy()\n                        else:\n                            label_ids = np.append(label_ids, val.numpy(), axis=0)\n                else:\n                    if preds is None:\n                        preds = logits.numpy()\n                    else:\n                        preds = np.append(preds, logits.numpy(), axis=0)\n\n                    if label_ids is None:\n                        label_ids = labels.numpy()\n                    else:\n                        label_ids = np.append(label_ids, labels.numpy(), axis=0)\n\n            step += 1\n\n        if self.compute_metrics is not None and preds is not None and label_ids is not None:\n            metrics = self.compute_metrics(EvalPrediction(predictions=preds, label_ids=label_ids))\n        else:\n            metrics = {}\n\n        metrics[\"eval_loss\"] = loss.numpy()\n\n        for key in list(metrics.keys()):\n            if not key.startswith(\"eval_\"):\n                metrics[f\"eval_{key}\"] = metrics.pop(key)\n\n        return PredictionOutput(predictions=preds, label_ids=label_ids, metrics=metrics)\n\n    def evaluate(\n        self, eval_dataset: Optional[tf.data.Dataset] = None, prediction_loss_only: Optional[bool] = None\n    ) -> Dict[str, float]:\n        \"\"\"\n        Prediction/evaluation loop, shared by `evaluate()` and `predict()`.\n        \"\"\"\n        if eval_dataset is None:\n            eval_dataset = self.eval_dataset\n\n        output = self._prediction_loop(eval_dataset, description=\"Evaluation\")\n\n        return output.metrics\n\n    def train(self) -> None:\n        \"\"\"\n        Train method to train the model.\n        \"\"\"\n        if self.args.debug:\n            tf.summary.trace_on(graph=True, profiler=True)\n\n        self.gradient_accumulator.reset()\n\n        iterations = self.optimizer.iterations\n\n        if iterations.numpy() > 0:\n            logger.info(\"Start the training from the last checkpoint\")\n            start_epoch = (iterations.numpy() // self.train_steps) + 1\n        else:\n            start_epoch = 1\n\n        tf.summary.experimental.set_step(iterations)\n\n        epochs = 1 if self.args.max_steps > 0 else self.args.num_train_epochs\n\n        logger.info(\"***** Running training *****\")\n        logger.info(\"  Num examples = %d\", self.num_train_examples)\n        logger.info(\"  Num Epochs = %d\", epochs)\n        logger.info(\"  Total optimization steps = %d\", self.train_steps)\n\n        for epoch in range(start_epoch, int(epochs + 1)):\n            for training_loss in self._training_steps():\n                step = iterations.numpy()\n\n                if self.args.debug:\n                    with self.writer.as_default():\n                        tf.summary.scalar(\"loss\", training_loss, step=step)\n\n                if step == 1 and self.args.debug:\n                    with self.writer.as_default():\n                        tf.summary.trace_export(name=\"training\", step=step, profiler_outdir=self.args.logging_dir)\n\n                if self.args.evaluate_during_training and step % self.args.eval_steps == 0:\n                    logs = {}\n                    results = self.evaluate()\n\n                    for key, value in results.items():\n                        eval_key = \"eval_{}\".format(key)\n                        logs[eval_key] = value\n\n                    if callable(self.optimizer.learning_rate):\n                        logs[\"learning_rate\"] = self.optimizer.learning_rate(step).numpy()\n                    else:\n                        logs[\"learning_rate\"] = self.optimizer.learning_rate.numpy()\n\n                    logger.info(\"Epoch {} Step {} Validation Metrics {}\".format(epoch, step, logs))\n\n                    with self.writer.as_default():\n                        for k, v in logs.items():\n                            tf.summary.scalar(k, v, step=step)\n\n                if step % self.args.logging_steps == 0:\n                    logger.info(\"Epoch {} Step {} Train Loss {:.4f}\".format(epoch, step, training_loss.numpy()))\n\n                if step % self.args.save_steps == 0:\n                    ckpt_save_path = self.model.ckpt_manager.save()\n                    logger.info(\"Saving checkpoint for step {} at {}\".format(step, ckpt_save_path))\n\n                if step % self.train_steps == 0:\n                    break\n\n    def _training_steps(self):\n        \"\"\"\n        Returns a generator over training steps (i.e. parameters update).\n        \"\"\"\n        for i, loss in enumerate(self._accumulate_next_gradients()):\n            if i % self.args.gradient_accumulation_steps == 0:\n                self._apply_gradients()\n                yield loss\n\n    @tf.function\n    def _apply_gradients(self):\n        \"\"\"Applies the gradients (cross-replica).\"\"\"\n        self.args.strategy.experimental_run_v2(self._step)\n\n    def _step(self):\n        \"\"\"Applies gradients and resets accumulation.\"\"\"\n        gradient_scale = self.gradient_accumulator.step * self.args.strategy.num_replicas_in_sync\n        gradients = [\n            gradient / tf.cast(gradient_scale, gradient.dtype) for gradient in self.gradient_accumulator.gradients\n        ]\n        gradients = [(tf.clip_by_value(grad, -self.args.max_grad_norm, self.args.max_grad_norm)) for grad in gradients]\n\n        self.optimizer.apply_gradients(list(zip(gradients, self.model.trainable_variables)))\n        self.gradient_accumulator.reset()\n\n    def _accumulate_next_gradients(self):\n        \"\"\"Accumulates the gradients from the next element in dataset.\"\"\"\n        iterator = iter(self.train_dataset)\n\n        @tf.function\n        def _accumulate_next():\n            per_replica_features, per_replica_labels = next(iterator)\n\n            return self._accumulate_gradients(per_replica_features, per_replica_labels)\n\n        while True:\n            try:\n                yield _accumulate_next()\n            except tf.errors.OutOfRangeError:\n                break\n\n    def _accumulate_gradients(self, per_replica_features, per_replica_labels):\n        \"\"\"Accumulates the gradients across all the replica.\"\"\"\n        per_replica_loss = self.args.strategy.experimental_run_v2(\n            self._forward, args=(per_replica_features, per_replica_labels)\n        )\n\n        try:\n            reduced_loss = self.args.strategy.reduce(tf.distribute.ReduceOp.MEAN, per_replica_loss, axis=0)\n        except ValueError:\n            reduced_loss = self.args.strategy.reduce(tf.distribute.ReduceOp.MEAN, per_replica_loss, None)\n\n        return reduced_loss\n\n    def _forward(self, features, labels):\n        \"\"\"Forwards a training example and accumulates the gradients.\"\"\"\n        per_example_loss, _ = self._run_model(features, labels, True)\n        gradients = tf.gradients(per_example_loss, self.model.trainable_variables)\n        gradients = [\n            g if g is not None else tf.zeros_like(v) for g, v in zip(gradients, self.model.trainable_variables)\n        ]\n\n        self.gradient_accumulator(gradients)\n\n        return per_example_loss\n\n    def _run_model(self, features, labels, training):\n        \"\"\"\n        Computes the loss of the given features and labels pair.\n        Args:\n          features: the batched features.\n          labels: the batched labels.\n          training: run the model in training mode or not\n        \"\"\"\n        if self.args.mode == \"text-classification\" or self.args.mode == \"token-classification\":\n            logits = self.model(features, training=training)[0]\n        else:\n            logits = self.model(features, training=training)\n\n        if self.args.mode == \"token-classification\":\n            active_loss = tf.reshape(labels, (-1,)) != -1\n            reduced_logits = tf.boolean_mask(tf.reshape(logits, (-1, shape_list(logits)[2])), active_loss)\n            labels = tf.boolean_mask(tf.reshape(labels, (-1,)), active_loss)\n            loss = self.loss(labels, reduced_logits)\n        elif self.args.mode == \"question-answering\":\n            start_loss = self.loss(labels[\"start_position\"], logits[0])\n            end_loss = self.loss(labels[\"end_position\"], logits[1])\n            loss = (start_loss + end_loss) / 2.0\n        else:\n            loss = self.loss(labels, logits)\n\n        loss += sum(self.model.losses) * (1.0 / self.args.n_gpu)\n\n        return loss, logits\n\n    def predict(self, test_dataset: tf.data.Dataset) -> PredictionOutput:\n        \"\"\"\n        Run prediction and return predictions and potential metrics.\n        Depending on the dataset and your use case, your test dataset may contain labels.\n        In that case, this method will also return metrics, like in evaluate().\n        Args:\n          test_dataset: something similar to a PT Dataset. This is just\n            temporary before to have a framework-agnostic approach for datasets.\n        \"\"\"\n        test_dataset = test_dataset.batch(self.args.eval_batch_size)\n        test_dataset = self.args.strategy.experimental_distribute_dataset(test_dataset)\n\n        return self._prediction_loop(test_dataset, description=\"Prediction\")\n\n    def save_model(self) -> None:\n        \"\"\"\n        Save the pretrained model and create a Tensorflow saved model.\n        \"\"\"\n        logger.info(\"Saving model in {}\".format(self.args.output_dir))\n\n        path = os.path.join(self.args.output_dir, \"saved_model\")\n\n        logger.info(\"Saving model in {}\".format(path))\n        os.makedirs(path, exist_ok=True)\n        self.model.save_pretrained(self.args.output_dir)\n"
  },
  {
    "path": "code/nezha-base-count5/pretrain/transformers1/trainer_utils.py",
    "content": "from typing import Dict, NamedTuple, Optional\n\nimport numpy as np\n\n\nclass EvalPrediction(NamedTuple):\n    \"\"\"\n    Evaluation output (always contains labels), to be used\n    to compute metrics.\n    \"\"\"\n\n    predictions: np.ndarray\n    label_ids: np.ndarray\n\n\nclass PredictionOutput(NamedTuple):\n    predictions: np.ndarray\n    label_ids: Optional[np.ndarray]\n    metrics: Optional[Dict[str, float]]\n\n\nclass TrainOutput(NamedTuple):\n    global_step: int\n    training_loss: float\n\n\nPREFIX_CHECKPOINT_DIR = \"checkpoint\"\n"
  },
  {
    "path": "code/nezha-base-count5/pretrain/transformers1/training_args.py",
    "content": "import dataclasses\nimport json\nimport logging\nfrom dataclasses import dataclass, field\nfrom typing import Any, Dict, Optional, Tuple\n\nfrom .file_utils import cached_property, is_torch_available, torch_required\n\n\nif is_torch_available():\n    import torch\n\n\ntry:\n    import torch_xla.core.xla_model as xm\n\n    _has_tpu = True\nexcept ImportError:\n    _has_tpu = False\n\n\n@torch_required\ndef is_tpu_available():\n    return _has_tpu\n\n\nlogger = logging.getLogger(__name__)\n\n\n@dataclass\nclass TrainingArguments:\n    \"\"\"\n    TrainingArguments is the subset of the arguments we use in our example scripts\n    **which relate to the training loop itself**.\n\n    Using `HfArgumentParser` we can turn this class\n    into argparse arguments to be able to specify them on\n    the command line.\n    \"\"\"\n\n    output_dir: str = field(\n        metadata={\"help\": \"The output directory where the model predictions and checkpoints will be written.\"}\n    )\n    overwrite_output_dir: bool = field(\n        default=False,\n        metadata={\n            \"help\": (\n                \"Overwrite the content of the output directory.\"\n                \"Use this to continue training if output_dir points to a checkpoint directory.\"\n            )\n        },\n    )\n\n    do_train: bool = field(default=False, metadata={\"help\": \"Whether to run training.\"})\n    do_eval: bool = field(default=False, metadata={\"help\": \"Whether to run eval on the dev set.\"})\n    do_predict: bool = field(default=False, metadata={\"help\": \"Whether to run predictions on the test set.\"})\n    evaluate_during_training: bool = field(\n        default=False, metadata={\"help\": \"Run evaluation during training at each logging step.\"},\n    )\n\n    per_device_train_batch_size: int = field(\n        default=8, metadata={\"help\": \"Batch size per GPU/TPU core/CPU for training.\"}\n    )\n    per_device_eval_batch_size: int = field(\n        default=8, metadata={\"help\": \"Batch size per GPU/TPU core/CPU for evaluation.\"}\n    )\n\n    per_gpu_train_batch_size: Optional[int] = field(\n        default=None,\n        metadata={\n            \"help\": \"Deprecated, the use of `--per_device_train_batch_size` is preferred. \"\n            \"Batch size per GPU/TPU core/CPU for training.\"\n        },\n    )\n    per_gpu_eval_batch_size: Optional[int] = field(\n        default=None,\n        metadata={\n            \"help\": \"Deprecated, the use of `--per_device_eval_batch_size` is preferred.\"\n            \"Batch size per GPU/TPU core/CPU for evaluation.\"\n        },\n    )\n\n    gradient_accumulation_steps: int = field(\n        default=1,\n        metadata={\"help\": \"Number of updates steps to accumulate before performing a backward/update pass.\"},\n    )\n\n    learning_rate: float = field(default=5e-5, metadata={\"help\": \"The initial learning rate for Adam.\"})\n    lr_end: float = field(default=1e-5, metadata={\"help\": \"学习率最后衰减到多少.\"})\n    weight_decay: float = field(default=0.0, metadata={\"help\": \"Weight decay if we apply some.\"})\n    adam_epsilon: float = field(default=1e-8, metadata={\"help\": \"Epsilon for Adam optimizer.\"})\n    max_grad_norm: float = field(default=1.0, metadata={\"help\": \"Max gradient norm.\"})\n\n    num_train_epochs: float = field(default=3.0, metadata={\"help\": \"Total number of training epochs to perform.\"})\n    max_steps: int = field(\n        default=-1,\n        metadata={\"help\": \"If > 0: set total number of training steps to perform. Override num_train_epochs.\"},\n    )\n    warmup_steps: int = field(default=0, metadata={\"help\": \"Linear warmup over warmup_steps.\"})\n\n    logging_dir: Optional[str] = field(default=None, metadata={\"help\": \"Tensorboard log dir.\"})\n    logging_first_step: bool = field(default=False, metadata={\"help\": \"Log and eval the first global_step\"})\n    logging_steps: int = field(default=500, metadata={\"help\": \"Log every X updates steps.\"})\n    save_steps: int = field(default=500, metadata={\"help\": \"Save checkpoint every X updates steps.\"})\n    save_total_limit: Optional[int] = field(\n        default=None,\n        metadata={\n            \"help\": (\n                \"Limit the total amount of checkpoints.\"\n                \"Deletes the older checkpoints in the output_dir. Default is unlimited checkpoints\"\n            )\n        },\n    )\n    no_cuda: bool = field(default=False, metadata={\"help\": \"Do not use CUDA even when it is available\"})\n    seed: int = field(default=42, metadata={\"help\": \"random seed for initialization\"})\n\n    fp16: bool = field(\n        default=False,\n        metadata={\"help\": \"Whether to use 16-bit (mixed) precision (through NVIDIA apex) instead of 32-bit\"},\n    )\n    fp16_opt_level: str = field(\n        default=\"O1\",\n        metadata={\n            \"help\": (\n                \"For fp16: Apex AMP optimization level selected in ['O0', 'O1', 'O2', and 'O3'].\"\n                \"See details at https://nvidia.github.io/apex/amp.html\"\n            )\n        },\n    )\n    local_rank: int = field(default=-1, metadata={\"help\": \"For distributed training: local_rank\"})\n\n    tpu_num_cores: Optional[int] = field(\n        default=None, metadata={\"help\": \"TPU: Number of TPU cores (automatically passed by launcher script)\"}\n    )\n    tpu_metrics_debug: bool = field(default=False, metadata={\"help\": \"TPU: Whether to print debug metrics\"})\n\n    @property\n    def train_batch_size(self) -> int:\n        if self.per_gpu_train_batch_size:\n            logger.warning(\n                \"Using deprecated `--per_gpu_train_batch_size` argument which will be removed in a future \"\n                \"version. Using `--per_device_train_batch_size` is preferred.\"\n            )\n        per_device_batch_size = self.per_gpu_train_batch_size or self.per_device_train_batch_size\n        return per_device_batch_size * max(1, self.n_gpu)\n\n    @property\n    def eval_batch_size(self) -> int:\n        if self.per_gpu_eval_batch_size:\n            logger.warning(\n                \"Using deprecated `--per_gpu_eval_batch_size` argument which will be removed in a future \"\n                \"version. Using `--per_device_eval_batch_size` is preferred.\"\n            )\n        per_device_batch_size = self.per_gpu_eval_batch_size or self.per_device_eval_batch_size\n        return per_device_batch_size * max(1, self.n_gpu)\n\n    @cached_property\n    @torch_required\n    def _setup_devices(self) -> Tuple[\"torch.device\", int]:\n        logger.info(\"PyTorch: setting up devices\")\n        if self.no_cuda:\n            device = torch.device(\"cpu\")\n            n_gpu = 0\n        elif is_tpu_available():\n            device = xm.xla_device()\n            n_gpu = 0\n        elif self.local_rank == -1:\n            # if n_gpu is > 1 we'll use nn.DataParallel.\n            # If you only want to use a specific subset of GPUs use `CUDA_VISIBLE_DEVICES=0`\n            device = torch.device(\"cuda\" if torch.cuda.is_available() else \"cpu\")\n            n_gpu = torch.cuda.device_count()\n        else:\n            # Here, we'll use torch.distributed.\n            # Initializes the distributed backend which will take care of sychronizing nodes/GPUs\n            torch.distributed.init_process_group(backend=\"nccl\")\n            device = torch.device(\"cuda\", self.local_rank)\n            n_gpu = 1\n        return device, n_gpu\n\n    @property\n    @torch_required\n    def device(self) -> \"torch.device\":\n        return self._setup_devices[0]\n\n    @property\n    @torch_required\n    def n_gpu(self):\n        return self._setup_devices[1]\n\n    def to_json_string(self):\n        \"\"\"\n        Serializes this instance to a JSON string.\n        \"\"\"\n        return json.dumps(dataclasses.asdict(self), indent=2)\n\n    def to_sanitized_dict(self) -> Dict[str, Any]:\n        \"\"\"\n        Sanitized serialization to use with TensorBoard’s hparams\n        \"\"\"\n        d = dataclasses.asdict(self)\n        valid_types = [bool, int, float, str]\n        if is_torch_available():\n            valid_types.append(torch.Tensor)\n        return {k: v if type(v) in valid_types else str(v) for k, v in d.items()}\n"
  },
  {
    "path": "code/nezha-base-count5/pretrain/transformers1/training_args_tf.py",
    "content": "import logging\nfrom dataclasses import dataclass, field\nfrom typing import Tuple\n\nfrom .file_utils import cached_property, is_tf_available, tf_required\nfrom .training_args import TrainingArguments\n\n\nlogger = logging.getLogger(__name__)\n\nif is_tf_available():\n    import tensorflow as tf\n\n\n@dataclass\nclass TFTrainingArguments(TrainingArguments):\n    optimizer_name: str = field(\n        default=\"adam\",\n        metadata={\n            \"help\": 'Name of a Tensorflow optimizer among \"adadelta, adagrad, adam, adamax, ftrl, nadam, rmsprop, sgd, adamw\"'\n        },\n    )\n    mode: str = field(\n        default=\"text-classification\",\n        metadata={\"help\": 'Type of task, one of \"text-classification\", \"token-classification\", \"question-answering\"'},\n    )\n    loss_name: str = field(\n        default=\"SparseCategoricalCrossentropy\",\n        metadata={\n            \"help\": \"Name of a Tensorflow loss. For the list see: https://www.tensorflow.org/api_docs/python/tf/keras/losses\"\n        },\n    )\n    tpu_name: str = field(\n        default=None, metadata={\"help\": \"Name of TPU\"},\n    )\n    end_lr: float = field(\n        default=0, metadata={\"help\": \"End learning rate for optimizer\"},\n    )\n    eval_steps: int = field(default=1000, metadata={\"help\": \"Run an evaluation every X steps.\"})\n    debug: bool = field(\n        default=False, metadata={\"help\": \"Activate the trace to record computation graphs and profiling information\"}\n    )\n\n    @cached_property\n    @tf_required\n    def _setup_strategy(self) -> Tuple[\"tf.distribute.Strategy\", int]:\n        logger.info(\"Tensorflow: setting up strategy\")\n        gpus = tf.config.list_physical_devices(\"GPU\")\n\n        if self.no_cuda:\n            strategy = tf.distribute.OneDeviceStrategy(device=\"/cpu:0\")\n        else:\n            try:\n                if self.tpu_name:\n                    tpu = tf.distribute.cluster_resolver.TPUClusterResolver(self.tpu_name)\n                else:\n                    tpu = tf.distribute.cluster_resolver.TPUClusterResolver()\n            except ValueError:\n                tpu = None\n\n            if tpu:\n                tf.config.experimental_connect_to_cluster(tpu)\n                tf.tpu.experimental.initialize_tpu_system(tpu)\n\n                strategy = tf.distribute.experimental.TPUStrategy(tpu)\n            elif len(gpus) == 0:\n                strategy = tf.distribute.OneDeviceStrategy(device=\"/cpu:0\")\n            elif len(gpus) == 1:\n                strategy = tf.distribute.OneDeviceStrategy(device=\"/gpu:0\")\n            elif len(gpus) > 1:\n                # If you only want to use a specific subset of GPUs use `CUDA_VISIBLE_DEVICES=0`\n                strategy = tf.distribute.MirroredStrategy()\n            else:\n                raise ValueError(\"Cannot find the proper strategy please check your environment properties.\")\n\n        return strategy\n\n    @property\n    @tf_required\n    def strategy(self) -> \"tf.distribute.Strategy\":\n        return self._setup_strategy\n\n    @property\n    @tf_required\n    def n_gpu(self) -> int:\n        return self._setup_strategy.num_replicas_in_sync\n"
  },
  {
    "path": "code/nezha-base-count5/pretrain/transformers1/try.py",
    "content": "from transformers import TFAlbertForMaskedLM, TFAlbertModel, TFAlbertForSequenceClassification, AlbertForMaskedLM\nimport os\n\ncheckpoint = \"albert-base-v1\"\n\nmodel = AlbertForMaskedLM.from_pretrained(checkpoint)\n\nif not os.path.exists(\"~/saved/\" + checkpoint):\n    os.makedirs(\"~/saved/\" + checkpoint)\n    \n\nmodel.save_pretrained(\"~/saved/\" + checkpoint)\nmodel = TFAlbertForMaskedLM.from_pretrained('~/saved/' + checkpoint, from_pt=True)\nmodel.save_pretrained(\"~/saved/\" + checkpoint)\nmodel = TFAlbertModel.from_pretrained('~/saved/' + checkpoint)\nmodel = TFAlbertForMaskedLM.from_pretrained('~/saved/' + checkpoint)\nmodel = TFAlbertForSequenceClassification.from_pretrained('~/saved/' + checkpoint)\n\n\nprint(\"nice model\") "
  },
  {
    "path": "code/nezha-base-count5/pretrain/transformers1/utils_encoder_decoder.py",
    "content": "# coding=utf-8\n# Copyright 2020 The HuggingFace Inc. team.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n\"\"\" Classes to support Encoder-Decoder architectures \"\"\"\n\n\ndef prepare_encoder_decoder_model_kwargs(**kwargs):\n    \"\"\" Prepare the encoder and decoder's keyword arguments.\n\n    Keyword arguments come in 3 flavors:\n    - encoder-specific (prefixed by `encoder_`)\n    - decoder-specific (prefixed by `decoder_`)\n    - those that apply to the model as whole.\n\n    We let the specific kwargs override the common ones in case of\n    conflict.\n    \"\"\"\n\n    kwargs_common = {\n        argument: value\n        for argument, value in kwargs.items()\n        if not argument.startswith(\"encoder_\") and not argument.startswith(\"decoder_\")\n    }\n    if \"input_ids\" in kwargs_common:\n        kwargs[\"encoder_input_ids\"] = kwargs_common.pop(\"input_ids\")\n\n    decoder_kwargs = kwargs_common.copy()\n    encoder_kwargs = kwargs_common.copy()\n    encoder_kwargs.update(\n        {argument[len(\"encoder_\") :]: value for argument, value in kwargs.items() if argument.startswith(\"encoder_\")}\n    )\n    decoder_kwargs.update(\n        {argument[len(\"decoder_\") :]: value for argument, value in kwargs.items() if argument.startswith(\"decoder_\")}\n    )\n    decoder_kwargs[\"encoder_attention_mask\"] = encoder_kwargs.get(\"attention_mask\", None)\n    return encoder_kwargs, decoder_kwargs\n"
  },
  {
    "path": "code/nezha-cn-base/config.json",
    "content": "{\n  \"attention_probs_dropout_prob\": 0.1,\n  \"hidden_act\": \"gelu\",\n  \"hidden_dropout_prob\": 0.1,\n  \"hidden_size\": 768,\n  \"initializer_range\": 0.02,\n  \"intermediate_size\": 3072,\n  \"max_position_embeddings\": 512,\n  \"max_relative_position\": 64,\n  \"num_attention_heads\": 12,\n  \"num_hidden_layers\": 12,\n  \"type_vocab_size\": 2,\n  \"vocab_size\": 21128,\n  \"use_relative_position\": true\n}\n"
  },
  {
    "path": "code/nezha-cn-base/vocab.txt",
    "content": "[PAD]\n[unused1]\n[unused2]\n[unused3]\n[unused4]\n[unused5]\n[unused6]\n[unused7]\n[unused8]\n[unused9]\n[unused10]\n[unused11]\n[unused12]\n[unused13]\n[unused14]\n[unused15]\n[unused16]\n[unused17]\n[unused18]\n[unused19]\n[unused20]\n[unused21]\n[unused22]\n[unused23]\n[unused24]\n[unused25]\n[unused26]\n[unused27]\n[unused28]\n[unused29]\n[unused30]\n[unused31]\n[unused32]\n[unused33]\n[unused34]\n[unused35]\n[unused36]\n[unused37]\n[unused38]\n[unused39]\n[unused40]\n[unused41]\n[unused42]\n[unused43]\n[unused44]\n[unused45]\n[unused46]\n[unused47]\n[unused48]\n[unused49]\n[unused50]\n[unused51]\n[unused52]\n[unused53]\n[unused54]\n[unused55]\n[unused56]\n[unused57]\n[unused58]\n[unused59]\n[unused60]\n[unused61]\n[unused62]\n[unused63]\n[unused64]\n[unused65]\n[unused66]\n[unused67]\n[unused68]\n[unused69]\n[unused70]\n[unused71]\n[unused72]\n[unused73]\n[unused74]\n[unused75]\n[unused76]\n[unused77]\n[unused78]\n[unused79]\n[unused80]\n[unused81]\n[unused82]\n[unused83]\n[unused84]\n[unused85]\n[unused86]\n[unused87]\n[unused88]\n[unused89]\n[unused90]\n[unused91]\n[unused92]\n[unused93]\n[unused94]\n[unused95]\n[unused96]\n[unused97]\n[unused98]\n[unused99]\n[UNK]\n[CLS]\n[SEP]\n[MASK]\n<S>\n<T>\n!\n\"\n#\n$\n%\n&\n'\n(\n)\n*\n+\n,\n-\n.\n/\n0\n1\n2\n3\n4\n5\n6\n7\n8\n9\n:\n;\n<\n=\n>\n?\n@\n[\n\\\n]\n^\n_\na\nb\nc\nd\ne\nf\ng\nh\ni\nj\nk\nl\nm\nn\no\np\nq\nr\ns\nt\nu\nv\nw\nx\ny\nz\n{\n|\n}\n~\n£\n¤\n¥\n§\n©\n«\n®\n°\n±\n²\n³\nµ\n·\n¹\nº\n»\n¼\n×\nß\næ\n÷\nø\nđ\nŋ\nɔ\nə\nɡ\nʰ\nˇ\nˈ\nˊ\nˋ\nˍ\nː\n˙\n˚\nˢ\nα\nβ\nγ\nδ\nε\nη\nθ\nι\nκ\nλ\nμ\nν\nο\nπ\nρ\nς\nσ\nτ\nυ\nφ\nχ\nψ\nω\nа\nб\nв\nг\nд\nе\nж\nз\nи\nк\nл\nм\nн\nо\nп\nр\nс\nт\nу\nф\nх\nц\nч\nш\nы\nь\nя\nі\nا\nب\nة\nت\nد\nر\nس\nع\nل\nم\nن\nه\nو\nي\n۩\nก\nง\nน\nม\nย\nร\nอ\nา\nเ\n๑\n་\nღ\nᄀ\nᄁ\nᄂ\nᄃ\nᄅ\nᄆ\nᄇ\nᄈ\nᄉ\nᄋ\nᄌ\nᄎ\nᄏ\nᄐ\nᄑ\nᄒ\nᅡ\nᅢ\nᅣ\nᅥ\nᅦ\nᅧ\nᅨ\nᅩ\nᅪ\nᅬ\nᅭ\nᅮ\nᅯ\nᅲ\nᅳ\nᅴ\nᅵ\nᆨ\nᆫ\nᆯ\nᆷ\nᆸ\nᆺ\nᆻ\nᆼ\nᗜ\nᵃ\nᵉ\nᵍ\nᵏ\nᵐ\nᵒ\nᵘ\n‖\n„\n†\n•\n‥\n‧\n \n‰\n′\n″\n‹\n›\n※\n‿\n⁄\nⁱ\n⁺\nⁿ\n₁\n₂\n₃\n₄\n€\n℃\n№\n™\nⅰ\nⅱ\nⅲ\nⅳ\nⅴ\n←\n↑\n→\n↓\n↔\n↗\n↘\n⇒\n∀\n−\n∕\n∙\n√\n∞\n∟\n∠\n∣\n∥\n∩\n∮\n∶\n∼\n∽\n≈\n≒\n≡\n≤\n≥\n≦\n≧\n≪\n≫\n⊙\n⋅\n⋈\n⋯\n⌒\n①\n②\n③\n④\n⑤\n⑥\n⑦\n⑧\n⑨\n⑩\n⑴\n⑵\n⑶\n⑷\n⑸\n⒈\n⒉\n⒊\n⒋\nⓒ\nⓔ\nⓘ\n─\n━\n│\n┃\n┅\n┆\n┊\n┌\n└\n├\n┣\n═\n║\n╚\n╞\n╠\n╭\n╮\n╯\n╰\n╱\n╳\n▂\n▃\n▅\n▇\n█\n▉\n▋\n▌\n▍\n▎\n■\n□\n▪\n▫\n▬\n▲\n△\n▶\n►\n▼\n▽\n◆\n◇\n○\n◎\n●\n◕\n◠\n◢\n◤\n☀\n★\n☆\n☕\n☞\n☺\n☼\n♀\n♂\n♠\n♡\n♣\n♥\n♦\n♪\n♫\n♬\n✈\n✔\n✕\n✖\n✦\n✨\n✪\n✰\n✿\n❀\n❤\n➜\n➤\n⦿\n、\n。\n〃\n々\n〇\n〈\n〉\n《\n》\n「\n」\n『\n』\n【\n】\n〓\n〔\n〕\n〖\n〗\n〜\n〝\n〞\nぁ\nあ\nぃ\nい\nう\nぇ\nえ\nお\nか\nき\nく\nけ\nこ\nさ\nし\nす\nせ\nそ\nた\nち\nっ\nつ\nて\nと\nな\nに\nぬ\nね\nの\nは\nひ\nふ\nへ\nほ\nま\nみ\nむ\nめ\nも\nゃ\nや\nゅ\nゆ\nょ\nよ\nら\nり\nる\nれ\nろ\nわ\nを\nん\n゜\nゝ\nァ\nア\nィ\nイ\nゥ\nウ\nェ\nエ\nォ\nオ\nカ\nキ\nク\nケ\nコ\nサ\nシ\nス\nセ\nソ\nタ\nチ\nッ\nツ\nテ\nト\nナ\nニ\nヌ\nネ\nノ\nハ\nヒ\nフ\nヘ\nホ\nマ\nミ\nム\nメ\nモ\nャ\nヤ\nュ\nユ\nョ\nヨ\nラ\nリ\nル\nレ\nロ\nワ\nヲ\nン\nヶ\n・\nー\nヽ\nㄅ\nㄆ\nㄇ\nㄉ\nㄋ\nㄌ\nㄍ\nㄎ\nㄏ\nㄒ\nㄚ\nㄛ\nㄞ\nㄟ\nㄢ\nㄤ\nㄥ\nㄧ\nㄨ\nㆍ\n㈦\n㊣\n㎡\n㗎\n一\n丁\n七\n万\n丈\n三\n上\n下\n不\n与\n丐\n丑\n专\n且\n丕\n世\n丘\n丙\n业\n丛\n东\n丝\n丞\n丟\n両\n丢\n两\n严\n並\n丧\n丨\n个\n丫\n中\n丰\n串\n临\n丶\n丸\n丹\n为\n主\n丼\n丽\n举\n丿\n乂\n乃\n久\n么\n义\n之\n乌\n乍\n乎\n乏\n乐\n乒\n乓\n乔\n乖\n乗\n乘\n乙\n乜\n九\n乞\n也\n习\n乡\n书\n乩\n买\n乱\n乳\n乾\n亀\n亂\n了\n予\n争\n事\n二\n于\n亏\n云\n互\n五\n井\n亘\n亙\n亚\n些\n亜\n亞\n亟\n亡\n亢\n交\n亥\n亦\n产\n亨\n亩\n享\n京\n亭\n亮\n亲\n亳\n亵\n人\n亿\n什\n仁\n仃\n仄\n仅\n仆\n仇\n今\n介\n仍\n从\n仏\n仑\n仓\n仔\n仕\n他\n仗\n付\n仙\n仝\n仞\n仟\n代\n令\n以\n仨\n仪\n们\n仮\n仰\n仲\n件\n价\n任\n份\n仿\n企\n伉\n伊\n伍\n伎\n伏\n伐\n休\n伕\n众\n优\n伙\n会\n伝\n伞\n伟\n传\n伢\n伤\n伦\n伪\n伫\n伯\n估\n伴\n伶\n伸\n伺\n似\n伽\n佃\n但\n佇\n佈\n位\n低\n住\n佐\n佑\n体\n佔\n何\n佗\n佘\n余\n佚\n佛\n作\n佝\n佞\n佟\n你\n佢\n佣\n佤\n佥\n佩\n佬\n佯\n佰\n佳\n併\n佶\n佻\n佼\n使\n侃\n侄\n來\n侈\n例\n侍\n侏\n侑\n侖\n侗\n供\n依\n侠\n価\n侣\n侥\n侦\n侧\n侨\n侬\n侮\n侯\n侵\n侶\n侷\n便\n係\n促\n俄\n俊\n俎\n俏\n俐\n俑\n俗\n俘\n俚\n保\n俞\n俟\n俠\n信\n俨\n俩\n俪\n俬\n俭\n修\n俯\n俱\n俳\n俸\n俺\n俾\n倆\n倉\n個\n倌\n倍\n倏\n們\n倒\n倔\n倖\n倘\n候\n倚\n倜\n借\n倡\n値\n倦\n倩\n倪\n倫\n倬\n倭\n倶\n债\n值\n倾\n偃\n假\n偈\n偉\n偌\n偎\n偏\n偕\n做\n停\n健\n側\n偵\n偶\n偷\n偻\n偽\n偿\n傀\n傅\n傍\n傑\n傘\n備\n傚\n傢\n傣\n傥\n储\n傩\n催\n傭\n傲\n傳\n債\n傷\n傻\n傾\n僅\n働\n像\n僑\n僕\n僖\n僚\n僥\n僧\n僭\n僮\n僱\n僵\n價\n僻\n儀\n儂\n億\n儆\n儉\n儋\n儒\n儕\n儘\n償\n儡\n優\n儲\n儷\n儼\n儿\n兀\n允\n元\n兄\n充\n兆\n兇\n先\n光\n克\n兌\n免\n児\n兑\n兒\n兔\n兖\n党\n兜\n兢\n入\n內\n全\n兩\n八\n公\n六\n兮\n兰\n共\n兲\n关\n兴\n兵\n其\n具\n典\n兹\n养\n兼\n兽\n冀\n内\n円\n冇\n冈\n冉\n冊\n册\n再\n冏\n冒\n冕\n冗\n写\n军\n农\n冠\n冢\n冤\n冥\n冨\n冪\n冬\n冯\n冰\n冲\n决\n况\n冶\n冷\n冻\n冼\n冽\n冾\n净\n凄\n准\n凇\n凈\n凉\n凋\n凌\n凍\n减\n凑\n凛\n凜\n凝\n几\n凡\n凤\n処\n凪\n凭\n凯\n凰\n凱\n凳\n凶\n凸\n凹\n出\n击\n函\n凿\n刀\n刁\n刃\n分\n切\n刈\n刊\n刍\n刎\n刑\n划\n列\n刘\n则\n刚\n创\n初\n删\n判\n別\n刨\n利\n刪\n别\n刮\n到\n制\n刷\n券\n刹\n刺\n刻\n刽\n剁\n剂\n剃\n則\n剉\n削\n剋\n剌\n前\n剎\n剐\n剑\n剔\n剖\n剛\n剜\n剝\n剣\n剤\n剥\n剧\n剩\n剪\n副\n割\n創\n剷\n剽\n剿\n劃\n劇\n劈\n劉\n劊\n劍\n劏\n劑\n力\n劝\n办\n功\n加\n务\n劣\n动\n助\n努\n劫\n劭\n励\n劲\n劳\n労\n劵\n効\n劾\n势\n勁\n勃\n勇\n勉\n勋\n勐\n勒\n動\n勖\n勘\n務\n勛\n勝\n勞\n募\n勢\n勤\n勧\n勳\n勵\n勸\n勺\n勻\n勾\n勿\n匀\n包\n匆\n匈\n匍\n匐\n匕\n化\n北\n匙\n匝\n匠\n匡\n匣\n匪\n匮\n匯\n匱\n匹\n区\n医\n匾\n匿\n區\n十\n千\n卅\n升\n午\n卉\n半\n卍\n华\n协\n卑\n卒\n卓\n協\n单\n卖\n南\n単\n博\n卜\n卞\n卟\n占\n卡\n卢\n卤\n卦\n卧\n卫\n卮\n卯\n印\n危\n即\n却\n卵\n卷\n卸\n卻\n卿\n厂\n厄\n厅\n历\n厉\n压\n厌\n厕\n厘\n厚\n厝\n原\n厢\n厥\n厦\n厨\n厩\n厭\n厮\n厲\n厳\n去\n县\n叁\n参\n參\n又\n叉\n及\n友\n双\n反\n収\n发\n叔\n取\n受\n变\n叙\n叛\n叟\n叠\n叡\n叢\n口\n古\n句\n另\n叨\n叩\n只\n叫\n召\n叭\n叮\n可\n台\n叱\n史\n右\n叵\n叶\n号\n司\n叹\n叻\n叼\n叽\n吁\n吃\n各\n吆\n合\n吉\n吊\n吋\n同\n名\n后\n吏\n吐\n向\n吒\n吓\n吕\n吖\n吗\n君\n吝\n吞\n吟\n吠\n吡\n否\n吧\n吨\n吩\n含\n听\n吭\n吮\n启\n吱\n吳\n吴\n吵\n吶\n吸\n吹\n吻\n吼\n吽\n吾\n呀\n呂\n呃\n呆\n呈\n告\n呋\n呎\n呐\n呓\n呕\n呗\n员\n呛\n呜\n呢\n呤\n呦\n周\n呱\n呲\n味\n呵\n呷\n呸\n呻\n呼\n命\n咀\n咁\n咂\n咄\n咆\n咋\n和\n咎\n咏\n咐\n咒\n咔\n咕\n咖\n咗\n咘\n咙\n咚\n咛\n咣\n咤\n咦\n咧\n咨\n咩\n咪\n咫\n咬\n咭\n咯\n咱\n咲\n咳\n咸\n咻\n咽\n咿\n哀\n品\n哂\n哄\n哆\n哇\n哈\n哉\n哋\n哌\n响\n哎\n哏\n哐\n哑\n哒\n哔\n哗\n哟\n員\n哥\n哦\n哧\n哨\n哩\n哪\n哭\n哮\n哲\n哺\n哼\n哽\n唁\n唄\n唆\n唇\n唉\n唏\n唐\n唑\n唔\n唠\n唤\n唧\n唬\n售\n唯\n唰\n唱\n唳\n唷\n唸\n唾\n啃\n啄\n商\n啉\n啊\n問\n啓\n啕\n啖\n啜\n啞\n啟\n啡\n啤\n啥\n啦\n啧\n啪\n啫\n啬\n啮\n啰\n啱\n啲\n啵\n啶\n啷\n啸\n啻\n啼\n啾\n喀\n喂\n喃\n善\n喆\n喇\n喉\n喊\n喋\n喎\n喏\n喔\n喘\n喙\n喚\n喜\n喝\n喟\n喧\n喪\n喫\n喬\n單\n喰\n喱\n喲\n喳\n喵\n営\n喷\n喹\n喺\n喻\n喽\n嗅\n嗆\n嗇\n嗎\n嗑\n嗒\n嗓\n嗔\n嗖\n嗚\n嗜\n嗝\n嗟\n嗡\n嗣\n嗤\n嗦\n嗨\n嗪\n嗬\n嗯\n嗰\n嗲\n嗳\n嗶\n嗷\n嗽\n嘀\n嘅\n嘆\n嘈\n嘉\n嘌\n嘍\n嘎\n嘔\n嘖\n嘗\n嘘\n嘚\n嘛\n嘜\n嘞\n嘟\n嘢\n嘣\n嘤\n嘧\n嘩\n嘭\n嘮\n嘯\n嘰\n嘱\n嘲\n嘴\n嘶\n嘸\n嘹\n嘻\n嘿\n噁\n噌\n噎\n噓\n噔\n噗\n噙\n噜\n噠\n噢\n噤\n器\n噩\n噪\n噬\n噱\n噴\n噶\n噸\n噹\n噻\n噼\n嚀\n嚇\n嚎\n嚏\n嚐\n嚓\n嚕\n嚟\n嚣\n嚥\n嚨\n嚮\n嚴\n嚷\n嚼\n囂\n囉\n囊\n囍\n囑\n囔\n囗\n囚\n四\n囝\n回\n囟\n因\n囡\n团\n団\n囤\n囧\n囪\n囫\n园\n困\n囱\n囲\n図\n围\n囹\n固\n国\n图\n囿\n圃\n圄\n圆\n圈\n國\n圍\n圏\n園\n圓\n圖\n團\n圜\n土\n圣\n圧\n在\n圩\n圭\n地\n圳\n场\n圻\n圾\n址\n坂\n均\n坊\n坍\n坎\n坏\n坐\n坑\n块\n坚\n坛\n坝\n坞\n坟\n坠\n坡\n坤\n坦\n坨\n坪\n坯\n坳\n坵\n坷\n垂\n垃\n垄\n型\n垒\n垚\n垛\n垠\n垢\n垣\n垦\n垩\n垫\n垭\n垮\n垵\n埂\n埃\n埋\n城\n埔\n埕\n埗\n域\n埠\n埤\n埵\n執\n埸\n培\n基\n埼\n堀\n堂\n堃\n堅\n堆\n堇\n堑\n堕\n堙\n堡\n堤\n堪\n堯\n堰\n報\n場\n堵\n堺\n堿\n塊\n塌\n塑\n塔\n塗\n塘\n塚\n塞\n塢\n塩\n填\n塬\n塭\n塵\n塾\n墀\n境\n墅\n墉\n墊\n墒\n墓\n増\n墘\n墙\n墜\n增\n墟\n墨\n墩\n墮\n墳\n墻\n墾\n壁\n壅\n壆\n壇\n壊\n壑\n壓\n壕\n壘\n壞\n壟\n壢\n壤\n壩\n士\n壬\n壮\n壯\n声\n売\n壳\n壶\n壹\n壺\n壽\n处\n备\n変\n复\n夏\n夔\n夕\n外\n夙\n多\n夜\n够\n夠\n夢\n夥\n大\n天\n太\n夫\n夭\n央\n夯\n失\n头\n夷\n夸\n夹\n夺\n夾\n奂\n奄\n奇\n奈\n奉\n奋\n奎\n奏\n奐\n契\n奔\n奕\n奖\n套\n奘\n奚\n奠\n奢\n奥\n奧\n奪\n奬\n奮\n女\n奴\n奶\n奸\n她\n好\n如\n妃\n妄\n妆\n妇\n妈\n妊\n妍\n妒\n妓\n妖\n妘\n妙\n妝\n妞\n妣\n妤\n妥\n妨\n妩\n妪\n妮\n妲\n妳\n妹\n妻\n妾\n姆\n姉\n姊\n始\n姍\n姐\n姑\n姒\n姓\n委\n姗\n姚\n姜\n姝\n姣\n姥\n姦\n姨\n姪\n姫\n姬\n姹\n姻\n姿\n威\n娃\n娄\n娅\n娆\n娇\n娉\n娑\n娓\n娘\n娛\n娜\n娟\n娠\n娣\n娥\n娩\n娱\n娲\n娴\n娶\n娼\n婀\n婁\n婆\n婉\n婊\n婕\n婚\n婢\n婦\n婧\n婪\n婭\n婴\n婵\n婶\n婷\n婺\n婿\n媒\n媚\n媛\n媞\n媧\n媲\n媳\n媽\n媾\n嫁\n嫂\n嫉\n嫌\n嫑\n嫔\n嫖\n嫘\n嫚\n嫡\n嫣\n嫦\n嫩\n嫲\n嫵\n嫻\n嬅\n嬉\n嬌\n嬗\n嬛\n嬢\n嬤\n嬪\n嬰\n嬴\n嬷\n嬸\n嬿\n孀\n孃\n子\n孑\n孔\n孕\n孖\n字\n存\n孙\n孚\n孛\n孜\n孝\n孟\n孢\n季\n孤\n学\n孩\n孪\n孫\n孬\n孰\n孱\n孳\n孵\n學\n孺\n孽\n孿\n宁\n它\n宅\n宇\n守\n安\n宋\n完\n宏\n宓\n宕\n宗\n官\n宙\n定\n宛\n宜\n宝\n实\n実\n宠\n审\n客\n宣\n室\n宥\n宦\n宪\n宫\n宮\n宰\n害\n宴\n宵\n家\n宸\n容\n宽\n宾\n宿\n寂\n寄\n寅\n密\n寇\n富\n寐\n寒\n寓\n寛\n寝\n寞\n察\n寡\n寢\n寥\n實\n寧\n寨\n審\n寫\n寬\n寮\n寰\n寵\n寶\n寸\n对\n寺\n寻\n导\n対\n寿\n封\n専\n射\n将\n將\n專\n尉\n尊\n尋\n對\n導\n小\n少\n尔\n尕\n尖\n尘\n尚\n尝\n尤\n尧\n尬\n就\n尴\n尷\n尸\n尹\n尺\n尻\n尼\n尽\n尾\n尿\n局\n屁\n层\n屄\n居\n屆\n屈\n屉\n届\n屋\n屌\n屍\n屎\n屏\n屐\n屑\n展\n屜\n属\n屠\n屡\n屢\n層\n履\n屬\n屯\n山\n屹\n屿\n岀\n岁\n岂\n岌\n岐\n岑\n岔\n岖\n岗\n岘\n岙\n岚\n岛\n岡\n岩\n岫\n岬\n岭\n岱\n岳\n岷\n岸\n峇\n峋\n峒\n峙\n峡\n峤\n峥\n峦\n峨\n峪\n峭\n峯\n峰\n峴\n島\n峻\n峽\n崁\n崂\n崆\n崇\n崎\n崑\n崔\n崖\n崗\n崙\n崛\n崧\n崩\n崭\n崴\n崽\n嵇\n嵊\n嵋\n嵌\n嵐\n嵘\n嵩\n嵬\n嵯\n嶂\n嶄\n嶇\n嶋\n嶙\n嶺\n嶼\n嶽\n巅\n巍\n巒\n巔\n巖\n川\n州\n巡\n巢\n工\n左\n巧\n巨\n巩\n巫\n差\n己\n已\n巳\n巴\n巷\n巻\n巽\n巾\n巿\n币\n市\n布\n帅\n帆\n师\n希\n帐\n帑\n帕\n帖\n帘\n帚\n帛\n帜\n帝\n帥\n带\n帧\n師\n席\n帮\n帯\n帰\n帳\n帶\n帷\n常\n帼\n帽\n幀\n幂\n幄\n幅\n幌\n幔\n幕\n幟\n幡\n幢\n幣\n幫\n干\n平\n年\n并\n幸\n幹\n幺\n幻\n幼\n幽\n幾\n广\n庁\n広\n庄\n庆\n庇\n床\n序\n庐\n库\n应\n底\n庖\n店\n庙\n庚\n府\n庞\n废\n庠\n度\n座\n庫\n庭\n庵\n庶\n康\n庸\n庹\n庾\n廁\n廂\n廃\n廈\n廉\n廊\n廓\n廖\n廚\n廝\n廟\n廠\n廢\n廣\n廬\n廳\n延\n廷\n建\n廿\n开\n弁\n异\n弃\n弄\n弈\n弊\n弋\n式\n弑\n弒\n弓\n弔\n引\n弗\n弘\n弛\n弟\n张\n弥\n弦\n弧\n弩\n弭\n弯\n弱\n張\n強\n弹\n强\n弼\n弾\n彅\n彆\n彈\n彌\n彎\n归\n当\n录\n彗\n彙\n彝\n形\n彤\n彥\n彦\n彧\n彩\n彪\n彫\n彬\n彭\n彰\n影\n彷\n役\n彻\n彼\n彿\n往\n征\n径\n待\n徇\n很\n徉\n徊\n律\n後\n徐\n徑\n徒\n従\n徕\n得\n徘\n徙\n徜\n從\n徠\n御\n徨\n復\n循\n徬\n微\n徳\n徴\n徵\n德\n徹\n徼\n徽\n心\n必\n忆\n忌\n忍\n忏\n忐\n忑\n忒\n忖\n志\n忘\n忙\n応\n忠\n忡\n忤\n忧\n忪\n快\n忱\n念\n忻\n忽\n忿\n怀\n态\n怂\n怅\n怆\n怎\n怏\n怒\n怔\n怕\n怖\n怙\n怜\n思\n怠\n怡\n急\n怦\n性\n怨\n怪\n怯\n怵\n总\n怼\n恁\n恃\n恆\n恋\n恍\n恐\n恒\n恕\n恙\n恚\n恢\n恣\n恤\n恥\n恨\n恩\n恪\n恫\n恬\n恭\n息\n恰\n恳\n恵\n恶\n恸\n恺\n恻\n恼\n恿\n悄\n悅\n悉\n悌\n悍\n悔\n悖\n悚\n悟\n悠\n患\n悦\n您\n悩\n悪\n悬\n悯\n悱\n悲\n悴\n悵\n悶\n悸\n悻\n悼\n悽\n情\n惆\n惇\n惊\n惋\n惑\n惕\n惘\n惚\n惜\n惟\n惠\n惡\n惦\n惧\n惨\n惩\n惫\n惬\n惭\n惮\n惯\n惰\n惱\n想\n惴\n惶\n惹\n惺\n愁\n愆\n愈\n愉\n愍\n意\n愕\n愚\n愛\n愜\n感\n愣\n愤\n愧\n愫\n愷\n愿\n慄\n慈\n態\n慌\n慎\n慑\n慕\n慘\n慚\n慟\n慢\n慣\n慧\n慨\n慫\n慮\n慰\n慳\n慵\n慶\n慷\n慾\n憂\n憊\n憋\n憎\n憐\n憑\n憔\n憚\n憤\n憧\n憨\n憩\n憫\n憬\n憲\n憶\n憾\n懂\n懇\n懈\n應\n懊\n懋\n懑\n懒\n懦\n懲\n懵\n懶\n懷\n懸\n懺\n懼\n懾\n懿\n戀\n戈\n戊\n戌\n戍\n戎\n戏\n成\n我\n戒\n戕\n或\n战\n戚\n戛\n戟\n戡\n戦\n截\n戬\n戮\n戰\n戲\n戳\n戴\n戶\n户\n戸\n戻\n戾\n房\n所\n扁\n扇\n扈\n扉\n手\n才\n扎\n扑\n扒\n打\n扔\n払\n托\n扛\n扣\n扦\n执\n扩\n扪\n扫\n扬\n扭\n扮\n扯\n扰\n扱\n扳\n扶\n批\n扼\n找\n承\n技\n抄\n抉\n把\n抑\n抒\n抓\n投\n抖\n抗\n折\n抚\n抛\n抜\n択\n抟\n抠\n抡\n抢\n护\n报\n抨\n披\n抬\n抱\n抵\n抹\n押\n抽\n抿\n拂\n拄\n担\n拆\n拇\n拈\n拉\n拋\n拌\n拍\n拎\n拐\n拒\n拓\n拔\n拖\n拗\n拘\n拙\n拚\n招\n拜\n拟\n拡\n拢\n拣\n拥\n拦\n拧\n拨\n择\n括\n拭\n拮\n拯\n拱\n拳\n拴\n拷\n拼\n拽\n拾\n拿\n持\n挂\n指\n挈\n按\n挎\n挑\n挖\n挙\n挚\n挛\n挝\n挞\n挟\n挠\n挡\n挣\n挤\n挥\n挨\n挪\n挫\n振\n挲\n挹\n挺\n挽\n挾\n捂\n捅\n捆\n捉\n捋\n捌\n捍\n捎\n捏\n捐\n捕\n捞\n损\n捡\n换\n捣\n捧\n捨\n捩\n据\n捱\n捲\n捶\n捷\n捺\n捻\n掀\n掂\n掃\n掇\n授\n掉\n掌\n掏\n掐\n排\n掖\n掘\n掙\n掛\n掠\n採\n探\n掣\n接\n控\n推\n掩\n措\n掬\n掰\n掲\n掳\n掴\n掷\n掸\n掺\n揀\n揃\n揄\n揆\n揉\n揍\n描\n提\n插\n揖\n揚\n換\n握\n揣\n揩\n揪\n揭\n揮\n援\n揶\n揸\n揹\n揽\n搀\n搁\n搂\n搅\n損\n搏\n搐\n搓\n搔\n搖\n搗\n搜\n搞\n搡\n搪\n搬\n搭\n搵\n搶\n携\n搽\n摀\n摁\n摄\n摆\n摇\n摈\n摊\n摒\n摔\n摘\n摞\n摟\n摧\n摩\n摯\n摳\n摸\n摹\n摺\n摻\n撂\n撃\n撅\n撇\n撈\n撐\n撑\n撒\n撓\n撕\n撚\n撞\n撤\n撥\n撩\n撫\n撬\n播\n撮\n撰\n撲\n撵\n撷\n撸\n撻\n撼\n撿\n擀\n擁\n擂\n擄\n擅\n擇\n擊\n擋\n操\n擎\n擒\n擔\n擘\n據\n擞\n擠\n擡\n擢\n擦\n擬\n擰\n擱\n擲\n擴\n擷\n擺\n擼\n擾\n攀\n攏\n攒\n攔\n攘\n攙\n攜\n攝\n攞\n攢\n攣\n攤\n攥\n攪\n攫\n攬\n支\n收\n攸\n改\n攻\n放\n政\n故\n效\n敌\n敍\n敎\n敏\n救\n敕\n敖\n敗\n敘\n教\n敛\n敝\n敞\n敢\n散\n敦\n敬\n数\n敲\n整\n敵\n敷\n數\n斂\n斃\n文\n斋\n斌\n斎\n斐\n斑\n斓\n斗\n料\n斛\n斜\n斟\n斡\n斤\n斥\n斧\n斩\n斫\n斬\n断\n斯\n新\n斷\n方\n於\n施\n旁\n旃\n旅\n旋\n旌\n旎\n族\n旖\n旗\n无\n既\n日\n旦\n旧\n旨\n早\n旬\n旭\n旮\n旱\n时\n旷\n旺\n旻\n昀\n昂\n昆\n昇\n昉\n昊\n昌\n明\n昏\n易\n昔\n昕\n昙\n星\n映\n春\n昧\n昨\n昭\n是\n昱\n昴\n昵\n昶\n昼\n显\n晁\n時\n晃\n晉\n晋\n晌\n晏\n晒\n晓\n晔\n晕\n晖\n晗\n晚\n晝\n晞\n晟\n晤\n晦\n晨\n晩\n普\n景\n晰\n晴\n晶\n晷\n智\n晾\n暂\n暄\n暇\n暈\n暉\n暌\n暐\n暑\n暖\n暗\n暝\n暢\n暧\n暨\n暫\n暮\n暱\n暴\n暸\n暹\n曄\n曆\n曇\n曉\n曖\n曙\n曜\n曝\n曠\n曦\n曬\n曰\n曲\n曳\n更\n書\n曹\n曼\n曾\n替\n最\n會\n月\n有\n朋\n服\n朐\n朔\n朕\n朗\n望\n朝\n期\n朦\n朧\n木\n未\n末\n本\n札\n朮\n术\n朱\n朴\n朵\n机\n朽\n杀\n杂\n权\n杆\n杈\n杉\n李\n杏\n材\n村\n杓\n杖\n杜\n杞\n束\n杠\n条\n来\n杨\n杭\n杯\n杰\n東\n杳\n杵\n杷\n杼\n松\n板\n极\n构\n枇\n枉\n枋\n析\n枕\n林\n枚\n果\n枝\n枢\n枣\n枪\n枫\n枭\n枯\n枰\n枱\n枳\n架\n枷\n枸\n柄\n柏\n某\n柑\n柒\n染\n柔\n柘\n柚\n柜\n柞\n柠\n柢\n查\n柩\n柬\n柯\n柱\n柳\n柴\n柵\n査\n柿\n栀\n栃\n栄\n栅\n标\n栈\n栉\n栋\n栎\n栏\n树\n栓\n栖\n栗\n校\n栩\n株\n样\n核\n根\n格\n栽\n栾\n桀\n桁\n桂\n桃\n桅\n框\n案\n桉\n桌\n桎\n桐\n桑\n桓\n桔\n桜\n桠\n桡\n桢\n档\n桥\n桦\n桧\n桨\n桩\n桶\n桿\n梁\n梅\n梆\n梏\n梓\n梗\n條\n梟\n梢\n梦\n梧\n梨\n梭\n梯\n械\n梳\n梵\n梶\n检\n棂\n棄\n棉\n棋\n棍\n棒\n棕\n棗\n棘\n棚\n棟\n棠\n棣\n棧\n森\n棱\n棲\n棵\n棹\n棺\n椁\n椅\n椋\n植\n椎\n椒\n検\n椪\n椭\n椰\n椹\n椽\n椿\n楂\n楊\n楓\n楔\n楚\n楝\n楞\n楠\n楣\n楨\n楫\n業\n楮\n極\n楷\n楸\n楹\n楼\n楽\n概\n榄\n榆\n榈\n榉\n榔\n榕\n榖\n榛\n榜\n榨\n榫\n榭\n榮\n榱\n榴\n榷\n榻\n槁\n槃\n構\n槌\n槍\n槎\n槐\n槓\n様\n槛\n槟\n槤\n槭\n槲\n槳\n槻\n槽\n槿\n樁\n樂\n樊\n樑\n樓\n標\n樞\n樟\n模\n樣\n権\n横\n樫\n樯\n樱\n樵\n樸\n樹\n樺\n樽\n樾\n橄\n橇\n橋\n橐\n橘\n橙\n機\n橡\n橢\n橫\n橱\n橹\n橼\n檀\n檄\n檎\n檐\n檔\n檗\n檜\n檢\n檬\n檯\n檳\n檸\n檻\n櫃\n櫚\n櫛\n櫥\n櫸\n櫻\n欄\n權\n欒\n欖\n欠\n次\n欢\n欣\n欧\n欲\n欸\n欺\n欽\n款\n歆\n歇\n歉\n歌\n歎\n歐\n歓\n歙\n歛\n歡\n止\n正\n此\n步\n武\n歧\n歩\n歪\n歯\n歲\n歳\n歴\n歷\n歸\n歹\n死\n歼\n殁\n殃\n殆\n殇\n殉\n殊\n残\n殒\n殓\n殖\n殘\n殞\n殡\n殤\n殭\n殯\n殲\n殴\n段\n殷\n殺\n殼\n殿\n毀\n毁\n毂\n毅\n毆\n毋\n母\n毎\n每\n毒\n毓\n比\n毕\n毗\n毘\n毙\n毛\n毡\n毫\n毯\n毽\n氈\n氏\n氐\n民\n氓\n气\n氖\n気\n氙\n氛\n氟\n氡\n氢\n氣\n氤\n氦\n氧\n氨\n氪\n氫\n氮\n氯\n氰\n氲\n水\n氷\n永\n氹\n氾\n汀\n汁\n求\n汆\n汇\n汉\n汎\n汐\n汕\n汗\n汙\n汛\n汝\n汞\n江\n池\n污\n汤\n汨\n汩\n汪\n汰\n汲\n汴\n汶\n汹\n決\n汽\n汾\n沁\n沂\n沃\n沅\n沈\n沉\n沌\n沏\n沐\n沒\n沓\n沖\n沙\n沛\n沟\n没\n沢\n沣\n沥\n沦\n沧\n沪\n沫\n沭\n沮\n沱\n河\n沸\n油\n治\n沼\n沽\n沾\n沿\n況\n泄\n泉\n泊\n泌\n泓\n法\n泗\n泛\n泞\n泠\n泡\n波\n泣\n泥\n注\n泪\n泫\n泮\n泯\n泰\n泱\n泳\n泵\n泷\n泸\n泻\n泼\n泽\n泾\n洁\n洄\n洋\n洒\n洗\n洙\n洛\n洞\n津\n洩\n洪\n洮\n洱\n洲\n洵\n洶\n洸\n洹\n活\n洼\n洽\n派\n流\n浃\n浄\n浅\n浆\n浇\n浊\n测\n济\n浏\n浑\n浒\n浓\n浔\n浙\n浚\n浜\n浣\n浦\n浩\n浪\n浬\n浮\n浯\n浴\n海\n浸\n涂\n涅\n涇\n消\n涉\n涌\n涎\n涓\n涔\n涕\n涙\n涛\n涝\n涞\n涟\n涠\n涡\n涣\n涤\n润\n涧\n涨\n涩\n涪\n涮\n涯\n液\n涵\n涸\n涼\n涿\n淀\n淄\n淅\n淆\n淇\n淋\n淌\n淑\n淒\n淖\n淘\n淙\n淚\n淞\n淡\n淤\n淦\n淨\n淩\n淪\n淫\n淬\n淮\n深\n淳\n淵\n混\n淹\n淺\n添\n淼\n清\n済\n渉\n渊\n渋\n渍\n渎\n渐\n渔\n渗\n渙\n渚\n減\n渝\n渠\n渡\n渣\n渤\n渥\n渦\n温\n測\n渭\n港\n渲\n渴\n游\n渺\n渾\n湃\n湄\n湊\n湍\n湖\n湘\n湛\n湟\n湧\n湫\n湮\n湯\n湳\n湾\n湿\n満\n溃\n溅\n溉\n溏\n源\n準\n溜\n溝\n溟\n溢\n溥\n溧\n溪\n溫\n溯\n溱\n溴\n溶\n溺\n溼\n滁\n滂\n滄\n滅\n滇\n滋\n滌\n滑\n滓\n滔\n滕\n滙\n滚\n滝\n滞\n滟\n满\n滢\n滤\n滥\n滦\n滨\n滩\n滬\n滯\n滲\n滴\n滷\n滸\n滾\n滿\n漁\n漂\n漆\n漉\n漏\n漓\n演\n漕\n漠\n漢\n漣\n漩\n漪\n漫\n漬\n漯\n漱\n漲\n漳\n漸\n漾\n漿\n潆\n潇\n潋\n潍\n潑\n潔\n潘\n潛\n潜\n潞\n潟\n潢\n潤\n潦\n潧\n潭\n潮\n潰\n潴\n潸\n潺\n潼\n澀\n澄\n澆\n澈\n澍\n澎\n澗\n澜\n澡\n澤\n澧\n澱\n澳\n澹\n激\n濁\n濂\n濃\n濑\n濒\n濕\n濘\n濛\n濟\n濠\n濡\n濤\n濫\n濬\n濮\n濯\n濱\n濺\n濾\n瀅\n瀆\n瀉\n瀋\n瀏\n瀑\n瀕\n瀘\n瀚\n瀛\n瀝\n瀞\n瀟\n瀧\n瀨\n瀬\n瀰\n瀾\n灌\n灏\n灑\n灘\n灝\n灞\n灣\n火\n灬\n灭\n灯\n灰\n灵\n灶\n灸\n灼\n災\n灾\n灿\n炀\n炁\n炅\n炉\n炊\n炎\n炒\n炔\n炕\n炖\n炙\n炜\n炫\n炬\n炭\n炮\n炯\n炳\n炷\n炸\n点\n為\n炼\n炽\n烁\n烂\n烃\n烈\n烊\n烏\n烘\n烙\n烛\n烟\n烤\n烦\n烧\n烨\n烩\n烫\n烬\n热\n烯\n烷\n烹\n烽\n焉\n焊\n焕\n焖\n焗\n焘\n焙\n焚\n焜\n無\n焦\n焯\n焰\n焱\n然\n焼\n煅\n煉\n煊\n煌\n煎\n煒\n煖\n煙\n煜\n煞\n煤\n煥\n煦\n照\n煨\n煩\n煮\n煲\n煸\n煽\n熄\n熊\n熏\n熒\n熔\n熙\n熟\n熠\n熨\n熬\n熱\n熵\n熹\n熾\n燁\n燃\n燄\n燈\n燉\n燊\n燎\n燒\n燔\n燕\n燙\n燜\n營\n燥\n燦\n燧\n燭\n燮\n燴\n燻\n燼\n燿\n爆\n爍\n爐\n爛\n爪\n爬\n爭\n爰\n爱\n爲\n爵\n父\n爷\n爸\n爹\n爺\n爻\n爽\n爾\n牆\n片\n版\n牌\n牍\n牒\n牙\n牛\n牝\n牟\n牠\n牡\n牢\n牦\n牧\n物\n牯\n牲\n牴\n牵\n特\n牺\n牽\n犀\n犁\n犄\n犊\n犍\n犒\n犢\n犧\n犬\n犯\n状\n犷\n犸\n犹\n狀\n狂\n狄\n狈\n狎\n狐\n狒\n狗\n狙\n狞\n狠\n狡\n狩\n独\n狭\n狮\n狰\n狱\n狸\n狹\n狼\n狽\n猎\n猕\n猖\n猗\n猙\n猛\n猜\n猝\n猥\n猩\n猪\n猫\n猬\n献\n猴\n猶\n猷\n猾\n猿\n獄\n獅\n獎\n獐\n獒\n獗\n獠\n獣\n獨\n獭\n獰\n獲\n獵\n獷\n獸\n獺\n獻\n獼\n獾\n玄\n率\n玉\n王\n玑\n玖\n玛\n玟\n玠\n玥\n玩\n玫\n玮\n环\n现\n玲\n玳\n玷\n玺\n玻\n珀\n珂\n珅\n珈\n珉\n珊\n珍\n珏\n珐\n珑\n珙\n珞\n珠\n珣\n珥\n珩\n珪\n班\n珮\n珲\n珺\n現\n球\n琅\n理\n琇\n琉\n琊\n琍\n琏\n琐\n琛\n琢\n琥\n琦\n琨\n琪\n琬\n琮\n琰\n琲\n琳\n琴\n琵\n琶\n琺\n琼\n瑀\n瑁\n瑄\n瑋\n瑕\n瑗\n瑙\n瑚\n瑛\n瑜\n瑞\n瑟\n瑠\n瑣\n瑤\n瑩\n瑪\n瑯\n瑰\n瑶\n瑾\n璀\n璁\n璃\n璇\n璉\n璋\n璎\n璐\n璜\n璞\n璟\n璧\n璨\n環\n璽\n璿\n瓊\n瓏\n瓒\n瓜\n瓢\n瓣\n瓤\n瓦\n瓮\n瓯\n瓴\n瓶\n瓷\n甄\n甌\n甕\n甘\n甙\n甚\n甜\n生\n產\n産\n甥\n甦\n用\n甩\n甫\n甬\n甭\n甯\n田\n由\n甲\n申\n电\n男\n甸\n町\n画\n甾\n畀\n畅\n界\n畏\n畑\n畔\n留\n畜\n畝\n畢\n略\n畦\n番\n畫\n異\n畲\n畳\n畴\n當\n畸\n畹\n畿\n疆\n疇\n疊\n疏\n疑\n疔\n疖\n疗\n疙\n疚\n疝\n疟\n疡\n疣\n疤\n疥\n疫\n疮\n疯\n疱\n疲\n疳\n疵\n疸\n疹\n疼\n疽\n疾\n痂\n病\n症\n痈\n痉\n痊\n痍\n痒\n痔\n痕\n痘\n痙\n痛\n痞\n痠\n痢\n痣\n痤\n痧\n痨\n痪\n痫\n痰\n痱\n痴\n痹\n痺\n痼\n痿\n瘀\n瘁\n瘋\n瘍\n瘓\n瘘\n瘙\n瘟\n瘠\n瘡\n瘢\n瘤\n瘦\n瘧\n瘩\n瘪\n瘫\n瘴\n瘸\n瘾\n療\n癇\n癌\n癒\n癖\n癜\n癞\n癡\n癢\n癣\n癥\n癫\n癬\n癮\n癱\n癲\n癸\n発\n登\n發\n白\n百\n皂\n的\n皆\n皇\n皈\n皋\n皎\n皑\n皓\n皖\n皙\n皚\n皮\n皰\n皱\n皴\n皺\n皿\n盂\n盃\n盅\n盆\n盈\n益\n盎\n盏\n盐\n监\n盒\n盔\n盖\n盗\n盘\n盛\n盜\n盞\n盟\n盡\n監\n盤\n盥\n盧\n盪\n目\n盯\n盱\n盲\n直\n相\n盹\n盼\n盾\n省\n眈\n眉\n看\n県\n眙\n眞\n真\n眠\n眦\n眨\n眩\n眯\n眶\n眷\n眸\n眺\n眼\n眾\n着\n睁\n睇\n睏\n睐\n睑\n睛\n睜\n睞\n睡\n睢\n督\n睥\n睦\n睨\n睪\n睫\n睬\n睹\n睽\n睾\n睿\n瞄\n瞅\n瞇\n瞋\n瞌\n瞎\n瞑\n瞒\n瞓\n瞞\n瞟\n瞠\n瞥\n瞧\n瞩\n瞪\n瞬\n瞭\n瞰\n瞳\n瞻\n瞼\n瞿\n矇\n矍\n矗\n矚\n矛\n矜\n矢\n矣\n知\n矩\n矫\n短\n矮\n矯\n石\n矶\n矽\n矾\n矿\n码\n砂\n砌\n砍\n砒\n研\n砖\n砗\n砚\n砝\n砣\n砥\n砧\n砭\n砰\n砲\n破\n砷\n砸\n砺\n砼\n砾\n础\n硅\n硐\n硒\n硕\n硝\n硫\n硬\n确\n硯\n硼\n碁\n碇\n碉\n碌\n碍\n碎\n碑\n碓\n碗\n碘\n碚\n碛\n碟\n碣\n碧\n碩\n碰\n碱\n碳\n碴\n確\n碼\n碾\n磁\n磅\n磊\n磋\n磐\n磕\n磚\n磡\n磨\n磬\n磯\n磲\n磷\n磺\n礁\n礎\n礙\n礡\n礦\n礪\n礫\n礴\n示\n礼\n社\n祀\n祁\n祂\n祇\n祈\n祉\n祎\n祐\n祕\n祖\n祗\n祚\n祛\n祜\n祝\n神\n祟\n祠\n祢\n祥\n票\n祭\n祯\n祷\n祸\n祺\n祿\n禀\n禁\n禄\n禅\n禍\n禎\n福\n禛\n禦\n禧\n禪\n禮\n禱\n禹\n禺\n离\n禽\n禾\n禿\n秀\n私\n秃\n秆\n秉\n秋\n种\n科\n秒\n秘\n租\n秣\n秤\n秦\n秧\n秩\n秭\n积\n称\n秸\n移\n秽\n稀\n稅\n程\n稍\n税\n稔\n稗\n稚\n稜\n稞\n稟\n稠\n稣\n種\n稱\n稲\n稳\n稷\n稹\n稻\n稼\n稽\n稿\n穀\n穂\n穆\n穌\n積\n穎\n穗\n穢\n穩\n穫\n穴\n究\n穷\n穹\n空\n穿\n突\n窃\n窄\n窈\n窍\n窑\n窒\n窓\n窕\n窖\n窗\n窘\n窜\n窝\n窟\n窠\n窥\n窦\n窨\n窩\n窪\n窮\n窯\n窺\n窿\n竄\n竅\n竇\n竊\n立\n竖\n站\n竜\n竞\n竟\n章\n竣\n童\n竭\n端\n競\n竹\n竺\n竽\n竿\n笃\n笆\n笈\n笋\n笏\n笑\n笔\n笙\n笛\n笞\n笠\n符\n笨\n第\n笹\n笺\n笼\n筆\n等\n筊\n筋\n筍\n筏\n筐\n筑\n筒\n答\n策\n筛\n筝\n筠\n筱\n筲\n筵\n筷\n筹\n签\n简\n箇\n箋\n箍\n箏\n箐\n箔\n箕\n算\n箝\n管\n箩\n箫\n箭\n箱\n箴\n箸\n節\n篁\n範\n篆\n篇\n築\n篑\n篓\n篙\n篝\n篠\n篡\n篤\n篩\n篪\n篮\n篱\n篷\n簇\n簌\n簍\n簡\n簦\n簧\n簪\n簫\n簷\n簸\n簽\n簾\n簿\n籁\n籃\n籌\n籍\n籐\n籟\n籠\n籤\n籬\n籮\n籲\n米\n类\n籼\n籽\n粄\n粉\n粑\n粒\n粕\n粗\n粘\n粟\n粤\n粥\n粧\n粪\n粮\n粱\n粲\n粳\n粵\n粹\n粼\n粽\n精\n粿\n糅\n糊\n糍\n糕\n糖\n糗\n糙\n糜\n糞\n糟\n糠\n糧\n糬\n糯\n糰\n糸\n系\n糾\n紀\n紂\n約\n紅\n紉\n紊\n紋\n納\n紐\n紓\n純\n紗\n紘\n紙\n級\n紛\n紜\n素\n紡\n索\n紧\n紫\n紮\n累\n細\n紳\n紹\n紺\n終\n絃\n組\n絆\n経\n結\n絕\n絞\n絡\n絢\n給\n絨\n絮\n統\n絲\n絳\n絵\n絶\n絹\n綁\n綏\n綑\n經\n継\n続\n綜\n綠\n綢\n綦\n綫\n綬\n維\n綱\n網\n綴\n綵\n綸\n綺\n綻\n綽\n綾\n綿\n緊\n緋\n総\n緑\n緒\n緘\n線\n緝\n緞\n締\n緣\n編\n緩\n緬\n緯\n練\n緹\n緻\n縁\n縄\n縈\n縛\n縝\n縣\n縫\n縮\n縱\n縴\n縷\n總\n績\n繁\n繃\n繆\n繇\n繋\n織\n繕\n繚\n繞\n繡\n繩\n繪\n繫\n繭\n繳\n繹\n繼\n繽\n纂\n續\n纍\n纏\n纓\n纔\n纖\n纜\n纠\n红\n纣\n纤\n约\n级\n纨\n纪\n纫\n纬\n纭\n纯\n纰\n纱\n纲\n纳\n纵\n纶\n纷\n纸\n纹\n纺\n纽\n纾\n线\n绀\n练\n组\n绅\n细\n织\n终\n绊\n绍\n绎\n经\n绑\n绒\n结\n绔\n绕\n绘\n给\n绚\n绛\n络\n绝\n绞\n统\n绡\n绢\n绣\n绥\n绦\n继\n绩\n绪\n绫\n续\n绮\n绯\n绰\n绳\n维\n绵\n绶\n绷\n绸\n绻\n综\n绽\n绾\n绿\n缀\n缄\n缅\n缆\n缇\n缈\n缉\n缎\n缓\n缔\n缕\n编\n缘\n缙\n缚\n缜\n缝\n缠\n缢\n缤\n缥\n缨\n缩\n缪\n缭\n缮\n缰\n缱\n缴\n缸\n缺\n缽\n罂\n罄\n罌\n罐\n网\n罔\n罕\n罗\n罚\n罡\n罢\n罩\n罪\n置\n罰\n署\n罵\n罷\n罹\n羁\n羅\n羈\n羊\n羌\n美\n羔\n羚\n羞\n羟\n羡\n羣\n群\n羥\n羧\n羨\n義\n羯\n羲\n羸\n羹\n羽\n羿\n翁\n翅\n翊\n翌\n翎\n習\n翔\n翘\n翟\n翠\n翡\n翦\n翩\n翰\n翱\n翳\n翹\n翻\n翼\n耀\n老\n考\n耄\n者\n耆\n耋\n而\n耍\n耐\n耒\n耕\n耗\n耘\n耙\n耦\n耨\n耳\n耶\n耷\n耸\n耻\n耽\n耿\n聂\n聆\n聊\n聋\n职\n聒\n联\n聖\n聘\n聚\n聞\n聪\n聯\n聰\n聲\n聳\n聴\n聶\n職\n聽\n聾\n聿\n肃\n肄\n肅\n肆\n肇\n肉\n肋\n肌\n肏\n肓\n肖\n肘\n肚\n肛\n肝\n肠\n股\n肢\n肤\n肥\n肩\n肪\n肮\n肯\n肱\n育\n肴\n肺\n肽\n肾\n肿\n胀\n胁\n胃\n胄\n胆\n背\n胍\n胎\n胖\n胚\n胛\n胜\n胝\n胞\n胡\n胤\n胥\n胧\n胫\n胭\n胯\n胰\n胱\n胳\n胴\n胶\n胸\n胺\n能\n脂\n脅\n脆\n脇\n脈\n脉\n脊\n脍\n脏\n脐\n脑\n脓\n脖\n脘\n脚\n脛\n脣\n脩\n脫\n脯\n脱\n脲\n脳\n脸\n脹\n脾\n腆\n腈\n腊\n腋\n腌\n腎\n腐\n腑\n腓\n腔\n腕\n腥\n腦\n腩\n腫\n腭\n腮\n腰\n腱\n腳\n腴\n腸\n腹\n腺\n腻\n腼\n腾\n腿\n膀\n膈\n膊\n膏\n膑\n膘\n膚\n膛\n膜\n膝\n膠\n膦\n膨\n膩\n膳\n膺\n膻\n膽\n膾\n膿\n臀\n臂\n臃\n臆\n臉\n臊\n臍\n臓\n臘\n臟\n臣\n臥\n臧\n臨\n自\n臬\n臭\n至\n致\n臺\n臻\n臼\n臾\n舀\n舂\n舅\n舆\n與\n興\n舉\n舊\n舌\n舍\n舎\n舐\n舒\n舔\n舖\n舗\n舛\n舜\n舞\n舟\n航\n舫\n般\n舰\n舱\n舵\n舶\n舷\n舸\n船\n舺\n舾\n艇\n艋\n艘\n艙\n艦\n艮\n良\n艰\n艱\n色\n艳\n艷\n艹\n艺\n艾\n节\n芃\n芈\n芊\n芋\n芍\n芎\n芒\n芙\n芜\n芝\n芡\n芥\n芦\n芩\n芪\n芫\n芬\n芭\n芮\n芯\n花\n芳\n芷\n芸\n芹\n芻\n芽\n芾\n苁\n苄\n苇\n苋\n苍\n苏\n苑\n苒\n苓\n苔\n苕\n苗\n苛\n苜\n苞\n苟\n苡\n苣\n若\n苦\n苫\n苯\n英\n苷\n苹\n苻\n茁\n茂\n范\n茄\n茅\n茉\n茎\n茏\n茗\n茜\n茧\n茨\n茫\n茬\n茭\n茯\n茱\n茲\n茴\n茵\n茶\n茸\n茹\n茼\n荀\n荃\n荆\n草\n荊\n荏\n荐\n荒\n荔\n荖\n荘\n荚\n荞\n荟\n荠\n荡\n荣\n荤\n荥\n荧\n荨\n荪\n荫\n药\n荳\n荷\n荸\n荻\n荼\n荽\n莅\n莆\n莉\n莊\n莎\n莒\n莓\n莖\n莘\n莞\n莠\n莢\n莧\n莪\n莫\n莱\n莲\n莴\n获\n莹\n莺\n莽\n莿\n菀\n菁\n菅\n菇\n菈\n菊\n菌\n菏\n菓\n菖\n菘\n菜\n菟\n菠\n菡\n菩\n華\n菱\n菲\n菸\n菽\n萁\n萃\n萄\n萊\n萋\n萌\n萍\n萎\n萘\n萝\n萤\n营\n萦\n萧\n萨\n萩\n萬\n萱\n萵\n萸\n萼\n落\n葆\n葉\n著\n葚\n葛\n葡\n董\n葦\n葩\n葫\n葬\n葭\n葯\n葱\n葳\n葵\n葷\n葺\n蒂\n蒋\n蒐\n蒔\n蒙\n蒜\n蒞\n蒟\n蒡\n蒨\n蒲\n蒸\n蒹\n蒻\n蒼\n蒿\n蓁\n蓄\n蓆\n蓉\n蓋\n蓑\n蓓\n蓖\n蓝\n蓟\n蓦\n蓬\n蓮\n蓼\n蓿\n蔑\n蔓\n蔔\n蔗\n蔘\n蔚\n蔡\n蔣\n蔥\n蔫\n蔬\n蔭\n蔵\n蔷\n蔺\n蔻\n蔼\n蔽\n蕁\n蕃\n蕈\n蕉\n蕊\n蕎\n蕙\n蕤\n蕨\n蕩\n蕪\n蕭\n蕲\n蕴\n蕻\n蕾\n薄\n薅\n薇\n薈\n薊\n薏\n薑\n薔\n薙\n薛\n薦\n薨\n薩\n薪\n薬\n薯\n薰\n薹\n藉\n藍\n藏\n藐\n藓\n藕\n藜\n藝\n藤\n藥\n藩\n藹\n藻\n藿\n蘆\n蘇\n蘊\n蘋\n蘑\n蘚\n蘭\n蘸\n蘼\n蘿\n虎\n虏\n虐\n虑\n虔\n處\n虚\n虛\n虜\n虞\n號\n虢\n虧\n虫\n虬\n虱\n虹\n虻\n虽\n虾\n蚀\n蚁\n蚂\n蚊\n蚌\n蚓\n蚕\n蚜\n蚝\n蚣\n蚤\n蚩\n蚪\n蚯\n蚱\n蚵\n蛀\n蛆\n蛇\n蛊\n蛋\n蛎\n蛐\n蛔\n蛙\n蛛\n蛟\n蛤\n蛭\n蛮\n蛰\n蛳\n蛹\n蛻\n蛾\n蜀\n蜂\n蜃\n蜆\n蜇\n蜈\n蜊\n蜍\n蜒\n蜓\n蜕\n蜗\n蜘\n蜚\n蜜\n蜡\n蜢\n蜥\n蜱\n蜴\n蜷\n蜻\n蜿\n蝇\n蝈\n蝉\n蝌\n蝎\n蝕\n蝗\n蝙\n蝟\n蝠\n蝦\n蝨\n蝴\n蝶\n蝸\n蝼\n螂\n螃\n融\n螞\n螢\n螨\n螯\n螳\n螺\n蟀\n蟄\n蟆\n蟋\n蟎\n蟑\n蟒\n蟠\n蟬\n蟲\n蟹\n蟻\n蟾\n蠅\n蠍\n蠔\n蠕\n蠛\n蠟\n蠡\n蠢\n蠣\n蠱\n蠶\n蠹\n蠻\n血\n衄\n衅\n衆\n行\n衍\n術\n衔\n街\n衙\n衛\n衝\n衞\n衡\n衢\n衣\n补\n表\n衩\n衫\n衬\n衮\n衰\n衲\n衷\n衹\n衾\n衿\n袁\n袂\n袄\n袅\n袈\n袋\n袍\n袒\n袖\n袜\n袞\n袤\n袪\n被\n袭\n袱\n裁\n裂\n装\n裆\n裊\n裏\n裔\n裕\n裘\n裙\n補\n裝\n裟\n裡\n裤\n裨\n裱\n裳\n裴\n裸\n裹\n製\n裾\n褂\n複\n褐\n褒\n褓\n褔\n褚\n褥\n褪\n褫\n褲\n褶\n褻\n襁\n襄\n襟\n襠\n襪\n襬\n襯\n襲\n西\n要\n覃\n覆\n覇\n見\n規\n覓\n視\n覚\n覦\n覧\n親\n覬\n観\n覷\n覺\n覽\n觀\n见\n观\n规\n觅\n视\n览\n觉\n觊\n觎\n觐\n觑\n角\n觞\n解\n觥\n触\n觸\n言\n訂\n計\n訊\n討\n訓\n訕\n訖\n託\n記\n訛\n訝\n訟\n訣\n訥\n訪\n設\n許\n訳\n訴\n訶\n診\n註\n証\n詆\n詐\n詔\n評\n詛\n詞\n詠\n詡\n詢\n詣\n試\n詩\n詫\n詬\n詭\n詮\n詰\n話\n該\n詳\n詹\n詼\n誅\n誇\n誉\n誌\n認\n誓\n誕\n誘\n語\n誠\n誡\n誣\n誤\n誥\n誦\n誨\n說\n説\n読\n誰\n課\n誹\n誼\n調\n諄\n談\n請\n諏\n諒\n論\n諗\n諜\n諡\n諦\n諧\n諫\n諭\n諮\n諱\n諳\n諷\n諸\n諺\n諾\n謀\n謁\n謂\n謄\n謊\n謎\n謐\n謔\n謗\n謙\n講\n謝\n謠\n謨\n謬\n謹\n謾\n譁\n證\n譎\n譏\n識\n譙\n譚\n譜\n警\n譬\n譯\n議\n譲\n譴\n護\n譽\n讀\n變\n讓\n讚\n讞\n计\n订\n认\n讥\n讧\n讨\n让\n讪\n讫\n训\n议\n讯\n记\n讲\n讳\n讴\n讶\n讷\n许\n讹\n论\n讼\n讽\n设\n访\n诀\n证\n诃\n评\n诅\n识\n诈\n诉\n诊\n诋\n词\n诏\n译\n试\n诗\n诘\n诙\n诚\n诛\n话\n诞\n诟\n诠\n诡\n询\n诣\n诤\n该\n详\n诧\n诩\n诫\n诬\n语\n误\n诰\n诱\n诲\n说\n诵\n诶\n请\n诸\n诺\n读\n诽\n课\n诿\n谀\n谁\n调\n谄\n谅\n谆\n谈\n谊\n谋\n谌\n谍\n谎\n谏\n谐\n谑\n谒\n谓\n谔\n谕\n谗\n谘\n谙\n谚\n谛\n谜\n谟\n谢\n谣\n谤\n谥\n谦\n谧\n谨\n谩\n谪\n谬\n谭\n谯\n谱\n谲\n谴\n谶\n谷\n豁\n豆\n豇\n豈\n豉\n豊\n豌\n豎\n豐\n豔\n豚\n象\n豢\n豪\n豫\n豬\n豹\n豺\n貂\n貅\n貌\n貓\n貔\n貘\n貝\n貞\n負\n財\n貢\n貧\n貨\n販\n貪\n貫\n責\n貯\n貰\n貳\n貴\n貶\n買\n貸\n費\n貼\n貽\n貿\n賀\n賁\n賂\n賃\n賄\n資\n賈\n賊\n賑\n賓\n賜\n賞\n賠\n賡\n賢\n賣\n賤\n賦\n質\n賬\n賭\n賴\n賺\n購\n賽\n贅\n贈\n贊\n贍\n贏\n贓\n贖\n贛\n贝\n贞\n负\n贡\n财\n责\n贤\n败\n账\n货\n质\n贩\n贪\n贫\n贬\n购\n贮\n贯\n贰\n贱\n贲\n贴\n贵\n贷\n贸\n费\n贺\n贻\n贼\n贾\n贿\n赁\n赂\n赃\n资\n赅\n赈\n赊\n赋\n赌\n赎\n赏\n赐\n赓\n赔\n赖\n赘\n赚\n赛\n赝\n赞\n赠\n赡\n赢\n赣\n赤\n赦\n赧\n赫\n赭\n走\n赳\n赴\n赵\n赶\n起\n趁\n超\n越\n趋\n趕\n趙\n趟\n趣\n趨\n足\n趴\n趵\n趸\n趺\n趾\n跃\n跄\n跆\n跋\n跌\n跎\n跑\n跖\n跚\n跛\n距\n跟\n跡\n跤\n跨\n跩\n跪\n路\n跳\n践\n跷\n跹\n跺\n跻\n踉\n踊\n踌\n踏\n踐\n踝\n踞\n踟\n踢\n踩\n踪\n踮\n踱\n踴\n踵\n踹\n蹂\n蹄\n蹇\n蹈\n蹉\n蹊\n蹋\n蹑\n蹒\n蹙\n蹟\n蹣\n蹤\n蹦\n蹩\n蹬\n蹭\n蹲\n蹴\n蹶\n蹺\n蹼\n蹿\n躁\n躇\n躉\n躊\n躋\n躍\n躏\n躪\n身\n躬\n躯\n躲\n躺\n軀\n車\n軋\n軌\n軍\n軒\n軟\n転\n軸\n軼\n軽\n軾\n較\n載\n輒\n輓\n輔\n輕\n輛\n輝\n輟\n輩\n輪\n輯\n輸\n輻\n輾\n輿\n轄\n轅\n轆\n轉\n轍\n轎\n轟\n车\n轧\n轨\n轩\n转\n轭\n轮\n软\n轰\n轲\n轴\n轶\n轻\n轼\n载\n轿\n较\n辄\n辅\n辆\n辇\n辈\n辉\n辊\n辍\n辐\n辑\n输\n辕\n辖\n辗\n辘\n辙\n辛\n辜\n辞\n辟\n辣\n辦\n辨\n辩\n辫\n辭\n辮\n辯\n辰\n辱\n農\n边\n辺\n辻\n込\n辽\n达\n迁\n迂\n迄\n迅\n过\n迈\n迎\n运\n近\n返\n还\n这\n进\n远\n违\n连\n迟\n迢\n迤\n迥\n迦\n迩\n迪\n迫\n迭\n述\n迴\n迷\n迸\n迹\n迺\n追\n退\n送\n适\n逃\n逅\n逆\n选\n逊\n逍\n透\n逐\n递\n途\n逕\n逗\n這\n通\n逛\n逝\n逞\n速\n造\n逢\n連\n逮\n週\n進\n逵\n逶\n逸\n逻\n逼\n逾\n遁\n遂\n遅\n遇\n遊\n運\n遍\n過\n遏\n遐\n遑\n遒\n道\n達\n違\n遗\n遙\n遛\n遜\n遞\n遠\n遢\n遣\n遥\n遨\n適\n遭\n遮\n遲\n遴\n遵\n遶\n遷\n選\n遺\n遼\n遽\n避\n邀\n邁\n邂\n邃\n還\n邇\n邈\n邊\n邋\n邏\n邑\n邓\n邕\n邛\n邝\n邢\n那\n邦\n邨\n邪\n邬\n邮\n邯\n邰\n邱\n邳\n邵\n邸\n邹\n邺\n邻\n郁\n郅\n郊\n郎\n郑\n郜\n郝\n郡\n郢\n郤\n郦\n郧\n部\n郫\n郭\n郴\n郵\n郷\n郸\n都\n鄂\n鄉\n鄒\n鄔\n鄙\n鄞\n鄢\n鄧\n鄭\n鄰\n鄱\n鄲\n鄺\n酉\n酊\n酋\n酌\n配\n酐\n酒\n酗\n酚\n酝\n酢\n酣\n酥\n酩\n酪\n酬\n酮\n酯\n酰\n酱\n酵\n酶\n酷\n酸\n酿\n醃\n醇\n醉\n醋\n醍\n醐\n醒\n醚\n醛\n醜\n醞\n醣\n醪\n醫\n醬\n醮\n醯\n醴\n醺\n釀\n釁\n采\n釉\n释\n釋\n里\n重\n野\n量\n釐\n金\n釗\n釘\n釜\n針\n釣\n釦\n釧\n釵\n鈀\n鈉\n鈍\n鈎\n鈔\n鈕\n鈞\n鈣\n鈦\n鈪\n鈴\n鈺\n鈾\n鉀\n鉄\n鉅\n鉉\n鉑\n鉗\n鉚\n鉛\n鉤\n鉴\n鉻\n銀\n銃\n銅\n銑\n銓\n銖\n銘\n銜\n銬\n銭\n銮\n銳\n銷\n銹\n鋁\n鋅\n鋒\n鋤\n鋪\n鋰\n鋸\n鋼\n錄\n錐\n錘\n錚\n錠\n錢\n錦\n錨\n錫\n錮\n錯\n録\n錳\n錶\n鍊\n鍋\n鍍\n鍛\n鍥\n鍰\n鍵\n鍺\n鍾\n鎂\n鎊\n鎌\n鎏\n鎔\n鎖\n鎗\n鎚\n鎧\n鎬\n鎮\n鎳\n鏈\n鏖\n鏗\n鏘\n鏞\n鏟\n鏡\n鏢\n鏤\n鏽\n鐘\n鐮\n鐲\n鐳\n鐵\n鐸\n鐺\n鑄\n鑊\n鑑\n鑒\n鑣\n鑫\n鑰\n鑲\n鑼\n鑽\n鑾\n鑿\n针\n钉\n钊\n钎\n钏\n钒\n钓\n钗\n钙\n钛\n钜\n钝\n钞\n钟\n钠\n钡\n钢\n钣\n钤\n钥\n钦\n钧\n钨\n钩\n钮\n钯\n钰\n钱\n钳\n钴\n钵\n钺\n钻\n钼\n钾\n钿\n铀\n铁\n铂\n铃\n铄\n铅\n铆\n铉\n铎\n铐\n铛\n铜\n铝\n铠\n铡\n铢\n铣\n铤\n铨\n铩\n铬\n铭\n铮\n铰\n铲\n铵\n银\n铸\n铺\n链\n铿\n销\n锁\n锂\n锄\n锅\n锆\n锈\n锉\n锋\n锌\n锏\n锐\n锑\n错\n锚\n锟\n锡\n锢\n锣\n锤\n锥\n锦\n锭\n键\n锯\n锰\n锲\n锵\n锹\n锺\n锻\n镀\n镁\n镂\n镇\n镉\n镌\n镍\n镐\n镑\n镕\n镖\n镗\n镛\n镜\n镣\n镭\n镯\n镰\n镳\n镶\n長\n长\n門\n閃\n閉\n開\n閎\n閏\n閑\n閒\n間\n閔\n閘\n閡\n関\n閣\n閥\n閨\n閩\n閱\n閲\n閹\n閻\n閾\n闆\n闇\n闊\n闌\n闍\n闔\n闕\n闖\n闘\n關\n闡\n闢\n门\n闪\n闫\n闭\n问\n闯\n闰\n闲\n间\n闵\n闷\n闸\n闹\n闺\n闻\n闽\n闾\n阀\n阁\n阂\n阅\n阆\n阇\n阈\n阉\n阎\n阐\n阑\n阔\n阕\n阖\n阙\n阚\n阜\n队\n阡\n阪\n阮\n阱\n防\n阳\n阴\n阵\n阶\n阻\n阿\n陀\n陂\n附\n际\n陆\n陇\n陈\n陋\n陌\n降\n限\n陕\n陛\n陝\n陞\n陟\n陡\n院\n陣\n除\n陨\n险\n陪\n陰\n陲\n陳\n陵\n陶\n陷\n陸\n険\n陽\n隅\n隆\n隈\n隊\n隋\n隍\n階\n随\n隐\n隔\n隕\n隘\n隙\n際\n障\n隠\n隣\n隧\n隨\n險\n隱\n隴\n隶\n隸\n隻\n隼\n隽\n难\n雀\n雁\n雄\n雅\n集\n雇\n雉\n雋\n雌\n雍\n雎\n雏\n雑\n雒\n雕\n雖\n雙\n雛\n雜\n雞\n離\n難\n雨\n雪\n雯\n雰\n雲\n雳\n零\n雷\n雹\n電\n雾\n需\n霁\n霄\n霆\n震\n霈\n霉\n霊\n霍\n霎\n霏\n霑\n霓\n霖\n霜\n霞\n霧\n霭\n霰\n露\n霸\n霹\n霽\n霾\n靂\n靄\n靈\n青\n靓\n靖\n静\n靚\n靛\n靜\n非\n靠\n靡\n面\n靥\n靦\n革\n靳\n靴\n靶\n靼\n鞅\n鞋\n鞍\n鞏\n鞑\n鞘\n鞠\n鞣\n鞦\n鞭\n韆\n韋\n韌\n韓\n韜\n韦\n韧\n韩\n韬\n韭\n音\n韵\n韶\n韻\n響\n頁\n頂\n頃\n項\n順\n須\n頌\n預\n頑\n頒\n頓\n頗\n領\n頜\n頡\n頤\n頫\n頭\n頰\n頷\n頸\n頹\n頻\n頼\n顆\n題\n額\n顎\n顏\n顔\n願\n顛\n類\n顧\n顫\n顯\n顱\n顴\n页\n顶\n顷\n项\n顺\n须\n顼\n顽\n顾\n顿\n颁\n颂\n预\n颅\n领\n颇\n颈\n颉\n颊\n颌\n颍\n颐\n频\n颓\n颔\n颖\n颗\n题\n颚\n颛\n颜\n额\n颞\n颠\n颡\n颢\n颤\n颦\n颧\n風\n颯\n颱\n颳\n颶\n颼\n飄\n飆\n风\n飒\n飓\n飕\n飘\n飙\n飚\n飛\n飞\n食\n飢\n飨\n飩\n飪\n飯\n飲\n飼\n飽\n飾\n餃\n餅\n餉\n養\n餌\n餐\n餒\n餓\n餘\n餚\n餛\n餞\n餡\n館\n餮\n餵\n餾\n饅\n饈\n饋\n饌\n饍\n饑\n饒\n饕\n饗\n饞\n饥\n饨\n饪\n饬\n饭\n饮\n饯\n饰\n饱\n饲\n饴\n饵\n饶\n饷\n饺\n饼\n饽\n饿\n馀\n馁\n馄\n馅\n馆\n馈\n馋\n馍\n馏\n馒\n馔\n首\n馗\n香\n馥\n馨\n馬\n馭\n馮\n馳\n馴\n駁\n駄\n駅\n駆\n駐\n駒\n駕\n駛\n駝\n駭\n駱\n駿\n騁\n騎\n騏\n験\n騙\n騨\n騰\n騷\n驀\n驅\n驊\n驍\n驒\n驕\n驗\n驚\n驛\n驟\n驢\n驥\n马\n驭\n驮\n驯\n驰\n驱\n驳\n驴\n驶\n驷\n驸\n驹\n驻\n驼\n驾\n驿\n骁\n骂\n骄\n骅\n骆\n骇\n骈\n骊\n骋\n验\n骏\n骐\n骑\n骗\n骚\n骛\n骜\n骞\n骠\n骡\n骤\n骥\n骧\n骨\n骯\n骰\n骶\n骷\n骸\n骼\n髂\n髅\n髋\n髏\n髒\n髓\n體\n髖\n高\n髦\n髪\n髮\n髯\n髻\n鬃\n鬆\n鬍\n鬓\n鬚\n鬟\n鬢\n鬣\n鬥\n鬧\n鬱\n鬼\n魁\n魂\n魄\n魅\n魇\n魍\n魏\n魔\n魘\n魚\n魯\n魷\n鮑\n鮨\n鮪\n鮭\n鮮\n鯉\n鯊\n鯖\n鯛\n鯨\n鯰\n鯽\n鰍\n鰓\n鰭\n鰲\n鰻\n鰾\n鱈\n鱉\n鱔\n鱗\n鱷\n鱸\n鱼\n鱿\n鲁\n鲈\n鲍\n鲑\n鲛\n鲜\n鲟\n鲢\n鲤\n鲨\n鲫\n鲱\n鲲\n鲶\n鲷\n鲸\n鳃\n鳄\n鳅\n鳌\n鳍\n鳕\n鳖\n鳗\n鳝\n鳞\n鳥\n鳩\n鳳\n鳴\n鳶\n鴉\n鴕\n鴛\n鴦\n鴨\n鴻\n鴿\n鵑\n鵜\n鵝\n鵡\n鵬\n鵰\n鵲\n鶘\n鶩\n鶯\n鶴\n鷗\n鷲\n鷹\n鷺\n鸚\n鸞\n鸟\n鸠\n鸡\n鸢\n鸣\n鸥\n鸦\n鸨\n鸪\n鸭\n鸯\n鸳\n鸵\n鸽\n鸾\n鸿\n鹂\n鹃\n鹄\n鹅\n鹈\n鹉\n鹊\n鹌\n鹏\n鹑\n鹕\n鹘\n鹜\n鹞\n鹤\n鹦\n鹧\n鹫\n鹭\n鹰\n鹳\n鹵\n鹹\n鹼\n鹽\n鹿\n麂\n麋\n麒\n麓\n麗\n麝\n麟\n麥\n麦\n麩\n麴\n麵\n麸\n麺\n麻\n麼\n麽\n麾\n黃\n黄\n黍\n黎\n黏\n黑\n黒\n黔\n默\n黛\n黜\n黝\n點\n黠\n黨\n黯\n黴\n鼋\n鼎\n鼐\n鼓\n鼠\n鼬\n鼹\n鼻\n鼾\n齁\n齊\n齋\n齐\n齒\n齡\n齢\n齣\n齦\n齿\n龄\n龅\n龈\n龊\n龋\n龌\n龍\n龐\n龔\n龕\n龙\n龚\n龛\n龜\n龟\n︰\n︱\n︶\n︿\n﹁\n﹂\n﹍\n﹏\n﹐\n﹑\n﹒\n﹔\n﹕\n﹖\n﹗\n﹙\n﹚\n﹝\n﹞\n﹡\n﹣\n！\n＂\n＃\n＄\n％\n＆\n＇\n（\n）\n＊\n＋\n，\n－\n．\n／\n０\n１\n２\n３\n４\n５\n６\n７\n８\n９\n：\n；\n＜\n＝\n＞\n？\n＠\n［\n＼\n］\n＾\n＿\n｀\nａ\nｂ\nｃ\nｄ\nｅ\nｆ\nｇ\nｈ\nｉ\nｊ\nｋ\nｌ\nｍ\nｎ\nｏ\nｐ\nｑ\nｒ\nｓ\nｔ\nｕ\nｖ\nｗ\nｘ\nｙ\nｚ\n｛\n｜\n｝\n～\n｡\n｢\n｣\n､\n･\nｯ\nｰ\nｲ\nｸ\nｼ\nｽ\nﾄ\nﾉ\nﾌ\nﾗ\nﾙ\nﾝ\nﾞ\nﾟ\n￣\n￥\n👍\n🔥\n😂\n😎\n...\nyam\n10\n2017\n12\n11\n2016\n20\n30\n15\n06\nlofter\n##s\n2015\nby\n16\n14\n18\n13\n24\n17\n2014\n21\n##0\n22\n19\n25\n23\ncom\n100\n00\n05\n2013\n##a\n03\n09\n08\n28\n##2\n50\n01\n04\n##1\n27\n02\n2012\n##3\n26\n##e\n07\n##8\n##5\n##6\n##4\n##9\n##7\n29\n2011\n40\n##t\n2010\n##o\n##d\n##i\n2009\n##n\napp\nwww\nthe\n##m\n31\n##c\n##l\n##y\n##r\n##g\n2008\n60\nhttp\n200\nqq\n##p\n80\n##f\ngoogle\npixnet\n90\ncookies\ntripadvisor\n500\n##er\n##k\n35\n##h\nfacebook\n2007\n2000\n70\n##b\nof\n##x\n##u\n45\n300\niphone\n32\n1000\n2006\n48\nip\n36\nin\n38\n3d\n##w\n##ing\n55\nctrip\n##on\n##v\n33\n##の\nto\n34\n400\nid\n2005\nit\n37\nwindows\nllc\ntop\n99\n42\n39\n000\nled\nat\n##an\n41\n51\n52\n46\n49\n43\n53\n44\n##z\nandroid\n58\nand\n59\n2004\n56\nvr\n##か\n5000\n2003\n47\nblogthis\ntwitter\n54\n##le\n150\nok\n2018\n57\n75\ncn\nno\nios\n##in\n##mm\n##00\n800\non\nte\n3000\n65\n2001\n360\n95\nig\nlv\n120\n##ng\n##を\n##us\n##に\npc\nてす\n──\n600\n##te\n85\n2002\n88\n##ed\nhtml\nncc\nwifi\nemail\n64\nblog\nis\n##10\n##て\nmail\nonline\n##al\ndvd\n##ic\nstudio\n##は\n##℃\n##ia\n##と\nline\nvip\n72\n##q\n98\n##ce\n##en\nfor\n##is\n##ra\n##es\n##j\nusb\nnet\ncp\n1999\nasia\n4g\n##cm\ndiy\nnew\n3c\n##お\nta\n66\nlanguage\nvs\napple\ntw\n86\nweb\n##ne\nipad\n62\nyou\n##re\n101\n68\n##tion\nps\nde\nbt\npony\natm\n##2017\n1998\n67\n##ch\nceo\n##or\ngo\n##na\nav\npro\ncafe\n96\npinterest\n97\n63\npixstyleme3c\n##ta\nmore\nsaid\n##2016\n1997\nmp3\n700\n##ll\nnba\njun\n##20\n92\ntv\n1995\npm\n61\n76\nnbsp\n250\n##ie\nlinux\n##ma\ncd\n110\nhd\n##17\n78\n##ion\n77\n6000\nam\n##th\n##st\n94\n##se\n##et\n69\n180\ngdp\nmy\n105\n81\nabc\n89\nflash\n79\none\n93\n1990\n1996\n##ck\ngps\n##も\n##ly\nweb885\n106\n2020\n91\n##ge\n4000\n1500\nxd\nboss\nisbn\n1994\norg\n##ry\nme\nlove\n##11\n0fork\n73\n##12\n3g\n##ter\n##ar\n71\n82\n##la\nhotel\n130\n1970\npk\n83\n87\n140\nie\n##os\n##30\n##el\n74\n##50\nseo\ncpu\n##ml\np2p\n84\nmay\n##る\nsun\ntue\ninternet\ncc\nposted\nyoutube\n##at\n##ン\n##man\nii\n##ル\n##15\nabs\nnt\npdf\nyahoo\nago\n1980\n##it\nnews\nmac\n104\n##てす\n##me\n##り\njava\n1992\nspa\n##de\n##nt\nhk\nall\nplus\nla\n1993\n##mb\n##16\n##ve\nwest\n##da\n160\nair\n##い\n##ps\nから\n##to\n1989\nlogo\nhtc\nphp\nhttps\nfi\nmomo\n##son\nsat\n##ke\n##80\nebd\nsuv\nwi\nday\napk\n##88\n##um\nmv\ngalaxy\nwiki\nor\nbrake\n##ス\n1200\nする\nthis\n1991\nmon\n##こ\n❤2017\npo\n##ない\njavascript\nlife\nhome\njune\n##ss\nsystem\n900\n##ー\n##０\npp\n1988\nworld\nfb\n4k\nbr\n##as\nic\nai\nleonardo\nsafari\n##60\nlive\nfree\nxx\nwed\nwin7\nkiehl\n##co\nlg\no2o\n##go\nus\n235\n1949\nmm\nしい\nvfm\nkanye\n##90\n##2015\n##id\njr\n##ey\n123\nrss\n##sa\n##ro\n##am\n##no\nthu\nfri\n350\n##sh\n##ki\n103\ncomments\nname\n##のて\n##pe\n##ine\nmax\n1987\n8000\nuber\n##mi\n##ton\nwordpress\noffice\n1986\n1985\n##ment\n107\nbd\nwin10\n##ld\n##li\ngmail\nbb\ndior\n##rs\n##ri\n##rd\n##ます\nup\ncad\n##®\ndr\nして\nread\n##21\nをお\n##io\n##99\nurl\n1984\npvc\npaypal\nshow\npolicy\n##40\n##ty\n##18\nwith\n##★\n##01\ntxt\n102\n##ba\ndna\nfrom\npost\nmini\nar\ntaiwan\njohn\n##ga\nprivacy\nagoda\n##13\n##ny\nword\n##24\n##22\n##by\n##ur\n##hz\n1982\n##ang\n265\ncookie\nnetscape\n108\n##ka\n##～\n##ad\nhouse\nshare\nnote\nibm\ncode\nhello\nnike\nsim\nsurvey\n##016\n1979\n1950\nwikia\n##32\n##017\n5g\ncbc\n##tor\n##kg\n1983\n##rt\n##14\ncampaign\nstore\n2500\nos\n##ct\n##ts\n##°\n170\napi\n##ns\n365\nexcel\n##な\n##ao\n##ら\n##し\n～～\n##nd\nuniversity\n163\nには\n518\n##70\n##ya\n##il\n##25\npierre\nipo\n0020\n897\n##23\nhotels\n##ian\nのお\n125\nyears\n6606\n##ers\n##26\nhigh\n##day\ntime\n##ay\nbug\n##line\n##く\n##す\n##be\nxp\ntalk2yam\nyamservice\n10000\ncoco\n##dy\nsony\n##ies\n1978\nmicrosoft\ndavid\npeople\n##ha\n1960\ninstagram\nintel\nその\n##ot\niso\n1981\n##va\n115\n##mo\n##land\nxxx\nman\nco\nltxsw\n##ation\nbaby\n220\n##pa\n##ol\n1945\n7000\ntag\n450\n##ue\nmsn\n##31\noppo\n##ト\n##ca\ncontrol\n##om\nst\nchrome\n##ure\n##ん\nbe\n##き\nlol\n##19\nした\n##bo\n240\nlady\n##100\n##way\n##から\n4600\n##ko\n##do\n##un\n4s\ncorporation\n168\n##ni\nherme\n##28\nｃｐ\n978\n##up\n##06\nui\n##ds\nppt\nadmin\nthree\nします\nbbc\nre\n128\n##48\nca\n##015\n##35\nhp\n##ee\ntpp\n##た\n##ive\n××\nroot\n##cc\n##ました\n##ble\n##ity\nadobe\npark\n114\net\noled\ncity\n##ex\n##ler\n##ap\nchina\n##book\n20000\nview\n##ice\nglobal\n##km\nyour\nhong\n##mg\nout\n##ms\nng\nebay\n##29\nmenu\nubuntu\n##cy\nrom\n##view\nopen\nktv\ndo\nserver\n##lo\nif\nenglish\n##ね\n##５\n##oo\n1600\n##02\nstep1\nkong\nclub\n135\njuly\ninc\n1976\nmr\nhi\n##net\ntouch\n##ls\n##ii\nmichael\nlcd\n##05\n##33\nphone\njames\nstep2\n1300\nios9\n##box\ndc\n##２\n##ley\nsamsung\n111\n280\npokemon\ncss\n##ent\n##les\nいいえ\n##１\ns8\natom\nplay\nbmw\n##said\nsa\netf\nctrl\n♥yoyo♥\n##55\n2025\n##2014\n##66\nadidas\namazon\n1958\n##ber\n##ner\nvisa\n##77\n##der\n1800\nconnectivity\n##hi\nfirefox\n109\n118\nhr\nso\nstyle\nmark\npop\nol\nskip\n1975\nas\n##27\n##ir\n##61\n190\nmba\n##う\n##ai\nle\n##ver\n1900\ncafe2017\nlte\nsuper\n113\n129\n##ron\namd\nlike\n##☆\nare\n##ster\nwe\n##sk\npaul\ndata\ninternational\n##ft\nlongchamp\nssd\ngood\n##ート\n##ti\nreply\n##my\n↓↓↓\napr\nstar\n##ker\nsource\n136\njs\n112\nget\nforce\nphoto\n##one\n126\n##2013\n##ow\nlink\nbbs\n1972\ngoods\n##lin\npython\n119\n##ip\ngame\n##ics\n##ません\nblue\n##●\n520\n##45\npage\nitunes\n##03\n1955\n260\n1968\ngt\ngif\n618\n##ff\n##47\ngroup\nくたさい\nabout\nbar\nganji\n##nce\nmusic\nlee\nnot\n1977\n1971\n1973\n##per\nan\nfaq\ncomment\n##って\ndays\n##ock\n116\n##bs\n1974\n1969\nv1\nplayer\n1956\nxbox\nsql\nfm\nf1\n139\n##ah\n210\n##lv\n##mp\n##000\nmelody\n1957\n##３\n550\n17life\n199\n1966\nxml\nmarket\n##au\n##71\n999\n##04\nwhat\ngl\n##95\n##age\ntips\n##68\nbook\n##ting\nmysql\ncan\n1959\n230\n##ung\nwonderland\nwatch\n10℃\n##ction\n9000\nmar\nmobile\n1946\n1962\narticle\n##db\npart\n▲top\nparty\nって\n1967\n1964\n1948\n##07\n##ore\n##op\nこの\ndj\n##78\n##38\n010\nmain\n225\n1965\n##ong\nart\n320\nad\n134\n020\n##73\n117\npm2\njapan\n228\n##08\nts\n1963\n##ica\nder\nsm\n##36\n2019\n##wa\nct\n##７\n##や\n##64\n1937\nhomemesh\nsearch\n##85\n##れは\n##tv\n##di\nmacbook\n##９\n##くたさい\nservice\n##♥\ntype\nった\n750\n##ier\n##si\n##75\n##います\n##ok\nbest\n##ット\ngoris\nlock\n##った\ncf\n3m\nbig\n##ut\nftp\ncarol\n##vi\n１０\n1961\nhappy\nsd\n##ac\n122\nanti\npe\ncnn\niii\n1920\n138\n##ラ\n1940\nesp\njan\ntags\n##98\n##51\naugust\nvol\n##86\n154\n##™\n##fs\n##れ\n##sion\ndesign\nac\n##ム\npress\njordan\nppp\nthat\nkey\ncheck\n##６\n##tt\n##㎡\n1080p\n##lt\npower\n##42\n1952\n##bc\nvivi\n##ック\nhe\n133\n121\njpg\n##rry\n201\n175\n3500\n1947\nnb\n##ted\n##rn\nしています\n1954\nusd\n##t00\nmaster\n##ンク\n001\nmodel\n##58\nal\n##09\n1953\n##34\nram\ngoo\nても\n##ui\n127\n1930\nred\n##ary\nrpg\nitem\n##pm\n##41\n270\n##za\nproject\n##2012\nhot\ntd\nblogabstract\n##ger\n##62\n650\n##44\ngr2\n##します\n##ｍ\nblack\nelectronic\nnfc\nyear\nasus\nまた\nhtml5\ncindy\n##hd\nm3\n132\nesc\n##od\nbooking\n##53\nfed\ntvb\n##81\n##ina\nmit\n165\n##いる\nchan\n192\ndistribution\nnext\nになる\npeter\nbios\nsteam\ncm\n1941\nにも\npk10\n##ix\n##65\n##91\ndec\nnasa\n##ana\nicecat\n00z\nb1\nwill\n##46\nli\nse\n##ji\n##み\n##ard\noct\n##ain\njp\n##ze\n##bi\ncio\n##56\nsmart\nh5\n##39\n##port\ncurve\nvpn\n##nm\n##dia\nutc\n##あり\n12345678910\n##52\nrmvb\nchanel\na4\nmiss\n##and\n##im\nmedia\nwho\n##63\nshe\ngirl\n5s\n124\nvera\n##して\nclass\nvivo\nking\n##フ\n##ei\nnational\nab\n1951\n5cm\n888\n145\nipod\nap\n1100\n5mm\n211\nms\n2756\n##69\nmp4\nmsci\n##po\n##89\n131\nmg\nindex\n380\n##bit\n##out\n##zz\n##97\n##67\n158\napec\n##８\nphotoshop\nopec\n￥799\nては\n##96\n##tes\n##ast\n2g\n○○\n##ール\n￥2899\n##ling\n##よ\n##ory\n1938\n##ical\nkitty\ncontent\n##43\nstep3\n##cn\nwin8\n155\nvc\n1400\niphone7\nrobert\n##した\ntcl\n137\nbeauty\n##87\nen\ndollars\n##ys\n##oc\nstep\npay\nyy\na1\n##2011\n##lly\n##ks\n##♪\n1939\n188\ndownload\n1944\nsep\nexe\nph\nいます\nschool\ngb\ncenter\npr\nstreet\n##board\nuv\n##37\n##lan\nwinrar\n##que\n##ua\n##com\n1942\n1936\n480\ngpu\n##４\nettoday\nfu\ntom\n##54\n##ren\n##via\n149\n##72\nb2b\n144\n##79\n##tch\nrose\narm\nmb\n##49\n##ial\n##nn\nnvidia\nstep4\nmvp\n00㎡\nyork\n156\n##イ\nhow\ncpi\n591\n2765\ngov\nkg\njoe\n##xx\nmandy\npa\n##ser\ncopyright\nfashion\n1935\ndon\n##け\necu\n##ist\n##art\nerp\nwap\nhave\n##lm\ntalk\n##ek\n##ning\n##if\nch\n##ite\nvideo\n1943\ncs\nsan\niot\nlook\n##84\n##2010\n##ku\noctober\n##ux\ntrump\n##hs\n##ide\nbox\n141\nfirst\n##ins\napril\n##ight\n##83\n185\nangel\nprotected\naa\n151\n162\nx1\nm2\n##fe\n##×\n##ho\nsize\n143\nmin\nofo\nfun\ngomaji\nex\nhdmi\nfood\ndns\nmarch\nchris\nkevin\n##のか\n##lla\n##pp\n##ec\nag\nems\n6s\n720p\n##rm\n##ham\noff\n##92\nasp\nteam\nfandom\ned\n299\n▌♥\n##ell\ninfo\nされています\n##82\nsina\n4066\n161\n##able\n##ctor\n330\n399\n315\ndll\nrights\nltd\nidc\njul\n3kg\n1927\n142\nma\nsurface\n##76\n##ク\n～～～\n304\nmall\neps\n146\ngreen\n##59\nmap\nspace\ndonald\nv2\nsodu\n##light\n1931\n148\n1700\nまて\n310\nreserved\nhtm\n##han\n##57\n2d\n178\nmod\n##ise\n##tions\n152\nti\n##shi\ndoc\n1933\nicp\n055\nwang\n##ram\nshopping\naug\n##pi\n##well\nnow\nwam\nb2\nからお\n##hu\n236\n1928\n##gb\n266\nf2\n##93\n153\nmix\n##ef\n##uan\nbwl\n##plus\n##res\ncore\n##ess\ntea\n5℃\nhktvmall\nnhk\n##ate\nlist\n##ese\n301\nfeb\n4m\ninn\nての\nnov\n159\n12345\ndaniel\n##ci\npass\n##bet\n##nk\ncoffee\n202\nssl\nairbnb\n##ute\nfbi\nwoshipm\nskype\nea\ncg\nsp\n##fc\n##www\nyes\nedge\nalt\n007\n##94\nfpga\n##ght\n##gs\niso9001\nさい\n##ile\n##wood\n##uo\nimage\nlin\nicon\namerican\n##em\n1932\nset\nsays\n##king\n##tive\nblogger\n##74\nなと\n256\n147\n##ox\n##zy\n##red\n##ium\n##lf\nnokia\nclaire\n##リ\n##ding\nnovember\nlohas\n##500\n##tic\n##マ\n##cs\n##ある\n##che\n##ire\n##gy\n##ult\ndb\njanuary\nwin\n##カ\n166\nroad\nptt\n##ま\n##つ\n198\n##fa\n##mer\nanna\npchome\nはい\nudn\nef\n420\n##time\n##tte\n2030\n##ア\ng20\nwhite\nかかります\n1929\n308\ngarden\neleven\ndi\n##おります\nchen\n309b\n777\n172\nyoung\ncosplay\nちてない\n4500\nbat\n##123\n##tra\n##ては\nkindle\nnpc\nsteve\netc\n##ern\n##｜\ncall\nxperia\nces\ntravel\nsk\ns7\n##ous\n1934\n##int\nみいたたけます\n183\nedu\nfile\ncho\nqr\n##car\n##our\n186\n##ant\n##ｄ\neric\n1914\nrends\n##jo\n##する\nmastercard\n##2000\nkb\n##min\n290\n##ino\nvista\n##ris\n##ud\njack\n2400\n##set\n169\npos\n1912\n##her\n##ou\ntaipei\nしく\n205\nbeta\n##ませんか\n232\n##fi\nexpress\n255\nbody\n##ill\naphojoy\nuser\ndecember\nmeiki\n##ick\ntweet\nrichard\n##av\n##ᆫ\niphone6\n##dd\nちてすか\nviews\n##mark\n321\npd\n##００\ntimes\n##▲\nlevel\n##ash\n10g\npoint\n5l\n##ome\n208\nkoreanmall\n##ak\ngeorge\nq2\n206\nwma\ntcp\n##200\nスタッフ\nfull\nmlb\n##lle\n##watch\ntm\nrun\n179\n911\nsmith\nbusiness\n##und\n1919\ncolor\n##tal\n222\n171\n##less\nmoon\n4399\n##rl\nupdate\npcb\nshop\n499\n157\nlittle\nなし\nend\n##mhz\nvan\ndsp\neasy\n660\n##house\n##key\nhistory\n##ｏ\noh\n##001\n##hy\n##web\noem\nlet\nwas\n##2009\n##gg\nreview\n##wan\n182\n##°c\n203\nuc\ntitle\n##val\nunited\n233\n2021\n##ons\ndoi\ntrivago\noverdope\nsbs\n##ance\n##ち\ngrand\nspecial\n573032185\nimf\n216\nwx17house\n##so\n##ーム\naudi\n##he\nlondon\nwilliam\n##rp\n##ake\nscience\nbeach\ncfa\namp\nps4\n880\n##800\n##link\n##hp\ncrm\nferragamo\nbell\nmake\n##eng\n195\nunder\nzh\nphotos\n2300\n##style\n##ント\nvia\n176\nda\n##gi\ncompany\ni7\n##ray\nthomas\n370\nufo\ni5\n##max\nplc\nben\nback\nresearch\n8g\n173\nmike\n##pc\n##ッフ\nseptember\n189\n##ace\nvps\nfebruary\n167\npantos\nwp\nlisa\n1921\n★★\njquery\nnight\nlong\noffer\n##berg\n##news\n1911\n##いて\nray\nfks\nwto\nせます\nover\n164\n340\n##all\n##rus\n1924\n##888\n##works\nblogtitle\nloftpermalink\n##→\n187\nmartin\ntest\nling\nkm\n##め\n15000\nfda\nv3\n##ja\n##ロ\nｗedding\nかある\noutlet\nfamily\n##ea\nをこ\n##top\nstory\n##ness\nsalvatore\n##lu\n204\nswift\n215\nroom\nしている\noracle\n##ul\n1925\nsam\nb2c\nweek\npi\nrock\n##のは\n##ａ\n##けと\n##ean\n##300\n##gle\ncctv\nafter\nchinese\n##back\npowered\nx2\n##tan\n1918\n##nes\n##イン\ncanon\nonly\n181\n##zi\n##las\nsay\n##oe\n184\n##sd\n221\n##bot\n##world\n##zo\nsky\nmade\ntop100\njust\n1926\npmi\n802\n234\ngap\n##vr\n177\nles\n174\n▲topoct\nball\nvogue\nvi\ning\nofweek\ncos\n##list\n##ort\n▲topmay\n##なら\n##lon\nとして\nlast\n##tc\n##of\n##bus\n##gen\nreal\neva\n##コ\na3\nnas\n##lie\n##ria\n##coin\n##bt\n▲topapr\nhis\n212\ncat\nnata\nvive\nhealth\n⋯⋯\ndrive\nsir\n▲topmar\ndu\ncup\n##カー\n##ook\n##よう\n##sy\nalex\nmsg\ntour\nしました\n3ce\n##word\n193\nebooks\nr8\nblock\n318\n##より\n2200\nnice\npvp\n207\nmonths\n1905\nrewards\n##ther\n1917\n0800\n##xi\n##チ\n##sc\nmicro\n850\ngg\nblogfp\nop\n1922\ndaily\nm1\n264\ntrue\n##bb\nml\n##tar\n##のお\n##ky\nanthony\n196\n253\n##yo\nstate\n218\n##ara\n##aa\n##rc\n##tz\n##ston\nより\ngear\n##eo\n##ade\nge\nsee\n1923\n##win\n##ura\nss\nheart\n##den\n##ita\ndown\n##sm\nel\npng\n2100\n610\nrakuten\nwhatsapp\nbay\ndream\nadd\n##use\n680\n311\npad\ngucci\nmpv\n##ode\n##fo\nisland\n▲topjun\n##▼\n223\njason\n214\nchicago\n##❤\nしの\n##hone\nio\n##れる\n##ことか\nsogo\nbe2\n##ology\n990\ncloud\nvcd\n##con\n2～3\n##ford\n##joy\n##kb\n##こさいます\n##rade\nbut\n##ach\ndocker\n##ful\nrfid\nul\n##ase\nhit\nford\n##star\n580\n##○\n１１\na2\nsdk\nreading\nedited\n##are\ncmos\n##mc\n238\nsiri\nlight\n##ella\n##ため\nbloomberg\n##read\npizza\n##ison\njimmy\n##vm\ncollege\nnode\njournal\nba\n18k\n##play\n245\n##cer\n２０\nmagic\n##yu\n191\njump\n288\ntt\n##ings\nasr\n##lia\n3200\nstep5\nnetwork\n##cd\nmc\nいします\n1234\npixstyleme\n273\n##600\n2800\nmoney\n★★★★★\n1280\n１２\n430\nbl\nみの\nact\n##tus\ntokyo\n##rial\n##life\nemba\n##ae\nsaas\ntcs\n##rk\n##wang\nsummer\n##sp\nko\n##ving\n390\npremium\n##その\nnetflix\n##ヒ\nuk\nmt\n##lton\nright\nfrank\ntwo\n209\nえる\n##ple\n##cal\n021\n##んな\n##sen\n##ville\nhold\nnexus\ndd\n##ius\nてお\n##mah\n##なく\ntila\nzero\n820\nce\n##tin\nresort\n##ws\ncharles\nold\np10\n5d\nreport\n##360\n##ru\n##には\nbus\nvans\nlt\n##est\npv\n##レ\nlinks\nrebecca\n##ツ\n##dm\nazure\n##365\nきな\nlimited\nbit\n4gb\n##mon\n1910\nmoto\n##eam\n213\n1913\nvar\neos\nなとの\n226\nblogspot\nされた\n699\ne3\ndos\ndm\nfc\n##ments\n##ik\n##kw\nboy\n##bin\n##ata\n960\ner\n##せ\n219\n##vin\n##tu\n##ula\n194\n##∥\nstation\n##ろ\n##ature\n835\nfiles\nzara\nhdr\ntop10\nnature\n950\nmagazine\ns6\nmarriott\n##シ\navira\ncase\n##っと\ntab\n##ran\ntony\n##home\noculus\nim\n##ral\njean\nsaint\ncry\n307\nrosie\n##force\n##ini\nice\n##bert\nのある\n##nder\n##mber\npet\n2600\n##◆\nplurk\n▲topdec\n##sis\n00kg\n▲topnov\n720\n##ence\ntim\n##ω\n##nc\n##ても\n##name\nlog\nips\ngreat\nikea\nmalaysia\nunix\n##イト\n3600\n##ncy\n##nie\n12000\nakb48\n##ye\n##oid\n404\n##chi\n##いた\noa\nxuehai\n##1000\n##orm\n##rf\n275\nさん\n##ware\n##リー\n980\nho\n##pro\ntext\n##era\n560\nbob\n227\n##ub\n##2008\n8891\nscp\navi\n##zen\n2022\nmi\nwu\nmuseum\nqvod\napache\nlake\njcb\n▲topaug\n★★★\nni\n##hr\nhill\n302\nne\nweibo\n490\nruby\n##ーシ\n##ヶ\n##row\n4d\n▲topjul\niv\n##ish\ngithub\n306\nmate\n312\n##スト\n##lot\n##ane\nandrew\nのハイト\n##tina\nt1\nrf\ned2k\n##vel\n##900\nway\nfinal\nりの\nns\n5a\n705\n197\n##メ\nsweet\nbytes\n##ene\n▲topjan\n231\n##cker\n##2007\n##px\n100g\ntopapp\n229\nhelpapp\nrs\nlow\n14k\ng4g\ncare\n630\nldquo\nあり\n##fork\nleave\nrm\nedition\n##gan\n##zon\n##qq\n▲topsep\n##google\n##ism\ngold\n224\nexplorer\n##zer\ntoyota\ncategory\nselect\nvisual\n##labels\nrestaurant\n##md\nposts\ns1\n##ico\nもっと\nangelababy\n123456\n217\nsports\ns3\nmbc\n1915\nしてくたさい\nshell\nx86\ncandy\n##new\nkbs\nface\nxl\n470\n##here\n4a\nswissinfo\nv8\n▲topfeb\ndram\n##ual\n##vice\n3a\n##wer\nsport\nq1\nios10\npublic\nint\ncard\n##ｃ\nep\nau\nrt\n##れた\n1080\nbill\n##mll\nkim\n３０\n460\nwan\n##uk\n##ミ\nx3\n298\n0t\nscott\n##ming\n239\ne5\n##3d\nh7n9\nworldcat\nbrown\n##あります\n##vo\n##led\n##580\n##ax\n249\n410\n##ert\nparis\n##～6\npolo\n925\n##lr\n599\n##ナ\ncapital\n##hing\nbank\ncv\n1g\n##chat\n##ｓ\n##たい\nadc\n##ule\n2m\n##ｅ\ndigital\nhotmail\n268\n##pad\n870\nbbq\nquot\n##ring\nbefore\nwali\n##まて\nmcu\n2k\n2b\nという\ncostco\n316\nnorth\n333\nswitch\n##city\n##ｐ\nphilips\n##mann\nmanagement\npanasonic\n##cl\n##vd\n##ping\n##rge\nalice\n##lk\n##ましょう\ncss3\n##ney\nvision\nalpha\n##ular\n##400\n##tter\nlz\nにお\n##ありません\nmode\ngre\n1916\npci\n##tm\n237\n1～2\n##yan\n##そ\nについて\n##let\n##キ\nwork\nwar\ncoach\nah\nmary\n##ᅵ\nhuang\n##pt\na8\npt\nfollow\n##berry\n1895\n##ew\na5\nghost\n##ション\n##wn\n##og\nsouth\n##code\ngirls\n##rid\naction\nvilla\ngit\nr11\ntable\ngames\n##cket\nerror\n##anonymoussaid\n##ag\nhere\n##ame\n##gc\nqa\n##■\n##lis\ngmp\n##gin\nvmalife\n##cher\nyu\nwedding\n##tis\ndemo\ndragon\n530\nsoho\nsocial\nbye\n##rant\nriver\norz\nacer\n325\n##↑\n##ース\n##ats\n261\ndel\n##ven\n440\nups\n##ように\n##ター\n305\nvalue\nmacd\nyougou\n##dn\n661\n##ano\nll\n##urt\n##rent\ncontinue\nscript\n##wen\n##ect\npaper\n263\n319\nshift\n##chel\n##フト\n##cat\n258\nx5\nfox\n243\n##さん\ncar\naaa\n##blog\nloading\n##yn\n##tp\nkuso\n799\nsi\nsns\nイカせるテンマ\nヒンクテンマ3\nrmb\nvdc\nforest\ncentral\nprime\nhelp\nultra\n##rmb\n##ような\n241\nsquare\n688\n##しい\nのないフロクに\n##field\n##reen\n##ors\n##ju\nc1\nstart\n510\n##air\n##map\ncdn\n##wo\ncba\nstephen\nm8\n100km\n##get\nopera\n##base\n##ood\nvsa\ncom™\n##aw\n##ail\n251\nなのて\ncount\nt2\n##ᅡ\n##een\n2700\nhop\n##gp\nvsc\ntree\n##eg\n##ose\n816\n285\n##ories\n##shop\nalphago\nv4\n1909\nsimon\n##ᆼ\nfluke62max\nzip\nスホンサー\n##sta\nlouis\ncr\nbas\n##～10\nbc\n##yer\nhadoop\n##ube\n##wi\n1906\n0755\nhola\n##low\nplace\ncentre\n5v\nd3\n##fer\n252\n##750\n##media\n281\n540\n0l\nexchange\n262\nseries\n##ハー\n##san\neb\n##bank\n##ｋ\nq3\n##nge\n##mail\ntake\n##lp\n259\n1888\nclient\neast\ncache\nevent\nvincent\n##ールを\nきを\n##nse\nsui\n855\nadchoice\n##и\n##stry\n##なたの\n246\n##zone\nga\napps\nsea\n##ab\n248\ncisco\n##タ\n##rner\nkymco\n##care\ndha\n##pu\n##yi\nminkoff\nroyal\np1\nへの\nannie\n269\ncollection\nkpi\nplaystation\n257\nになります\n866\nbh\n##bar\nqueen\n505\nradio\n1904\nandy\narmani\n##xy\nmanager\niherb\n##ery\n##share\nspring\nraid\njohnson\n1908\n##ob\nvolvo\nhall\n##ball\nv6\nour\ntaylor\n##hk\nbi\n242\n##cp\nkate\nbo\nwater\ntechnology\n##rie\nサイトは\n277\n##ona\n##sl\nhpv\n303\ngtx\nhip\nrdquo\njayz\nstone\n##lex\n##rum\nnamespace\n##やり\n620\n##ale\n##atic\ndes\n##erson\n##ql\n##ves\n##type\nenter\n##この\n##てきます\nd2\n##168\n##mix\n##bian\nとの\na9\njj\nky\n##lc\naccess\nmovie\n##hc\nリストに\ntower\n##ration\n##mit\nます\n##nch\nua\ntel\nprefix\n##o2\n1907\n##point\n1901\nott\n～10\n##http\n##ury\nbaidu\n##ink\nmember\n##logy\nbigbang\nnownews\n##js\n##shot\n##tb\n##こと\n247\neba\n##tics\n##lus\nける\nv5\nspark\n##ama\nthere\n##ions\ngod\n##lls\n##down\nhiv\n##ress\nburberry\nday2\n##kv\n◆◆\njeff\nrelated\nfilm\nedit\njoseph\n283\n##ark\ncx\n32gb\norder\ng9\n30000\n##ans\n##tty\ns5\n##bee\nかあります\nthread\nxr\nbuy\nsh\n005\nland\nspotify\nmx\n##ari\n276\n##verse\n×email\nsf\nwhy\n##ことて\n244\n7headlines\nnego\nsunny\ndom\nexo\n401\n666\npositioning\nfit\nrgb\n##tton\n278\nkiss\nalexa\nadam\nlp\nみリストを\n##ｇ\nmp\n##ties\n##llow\namy\n##du\nnp\n002\ninstitute\n271\n##rth\n##lar\n2345\n590\n##des\nsidebar\n１５\nimax\nsite\n##cky\n##kit\n##ime\n##009\nseason\n323\n##fun\n##ンター\n##ひ\ngogoro\na7\npu\nlily\nfire\ntwd600\n##ッセーシを\nいて\n##vis\n30ml\n##cture\n##をお\ninformation\n##オ\nclose\nfriday\n##くれる\nyi\nnick\nてすか\n##tta\n##tel\n6500\n##lock\ncbd\neconomy\n254\nかお\n267\ntinker\ndouble\n375\n8gb\nvoice\n##app\noops\nchannel\ntoday\n985\n##right\nraw\nxyz\n##＋\njim\nedm\n##cent\n7500\nsupreme\n814\nds\n##its\n##asia\ndropbox\n##てすか\n##tti\nbooks\n272\n100ml\n##tle\n##ller\n##ken\n##more\n##boy\nsex\n309\n##dom\nt3\n##ider\n##なります\n##unch\n1903\n810\nfeel\n5500\n##かった\n##put\nにより\ns2\nmo\n##gh\nmen\nka\namoled\ndiv\n##tr\n##n1\nport\nhoward\n##tags\nken\ndnf\n##nus\nadsense\n##а\nide\n##へ\nbuff\nthunder\n##town\n##ique\nhas\n##body\nauto\npin\n##erry\ntee\nてした\n295\nnumber\n##the\n##013\nobject\npsp\ncool\nudnbkk\n16gb\n##mic\nmiui\n##tro\nmost\nr2\n##alk\n##nity\n1880\n±0\n##いました\n428\ns4\nlaw\nversion\n##oa\nn1\nsgs\ndocomo\n##tf\n##ack\nhenry\nfc2\n##ded\n##sco\n##014\n##rite\n286\n0mm\nlinkedin\n##ada\n##now\nwii\n##ndy\nucbug\n##◎\nsputniknews\nlegalminer\n##ika\n##xp\n2gb\n##bu\nq10\noo\nb6\ncome\n##rman\ncheese\nming\nmaker\n##gm\nnikon\n##fig\nppi\nkelly\n##ります\njchere\nてきます\nted\nmd\n003\nfgo\ntech\n##tto\ndan\nsoc\n##gl\n##len\nhair\nearth\n640\n521\nimg\n##pper\n##a1\n##てきる\n##ロク\nacca\n##ition\n##ference\nsuite\n##ig\noutlook\n##mond\n##cation\n398\n##pr\n279\n101vip\n358\n##999\n282\n64gb\n3800\n345\nairport\n##over\n284\n##おり\njones\n##ith\nlab\n##su\n##いるのて\nco2\ntown\npiece\n##llo\nno1\nvmware\n24h\n##qi\nfocus\nreader\n##admin\n##ora\ntb\nfalse\n##log\n1898\nknow\nlan\n838\n##ces\nf4\n##ume\nmotel\nstop\n##oper\nna\nflickr\nnetcomponents\n##af\n##─\npose\nwilliams\nlocal\n##ound\n##cg\n##site\n##iko\nいお\n274\n5m\ngsm\ncon\n##ath\n1902\nfriends\n##hip\ncell\n317\n##rey\n780\ncream\n##cks\n012\n##dp\nfacebooktwitterpinterestgoogle\nsso\n324\nshtml\nsong\nswiss\n##mw\n##キンク\nlumia\nxdd\nstring\ntiffany\n522\nmarc\nられた\ninsee\nrussell\nsc\ndell\n##ations\nｏｋ\ncamera\n289\n##vs\n##flow\n##late\nclassic\n287\n##nter\nstay\ng1\nmtv\n512\n##ever\n##lab\n##nger\nqe\nsata\nryan\nd1\n50ml\ncms\n##cing\nsu\n292\n3300\neditor\n296\n##nap\nsecurity\nsunday\nassociation\n##ens\n##700\n##bra\nacg\n##かり\nsofascore\nとは\nmkv\n##ign\njonathan\ngary\nbuild\nlabels\n##oto\ntesla\nmoba\nqi\ngohappy\ngeneral\najax\n1024\n##かる\nサイト\nsociety\n##test\n##urs\nwps\nfedora\n##ich\nmozilla\n328\n##480\n##dr\nusa\nurn\n##lina\n##ｒ\ngrace\n##die\n##try\n##ader\n1250\n##なり\nelle\n570\n##chen\n##ᆯ\nprice\n##ten\nuhz\n##ough\neq\n##hen\nstates\npush\nsession\nbalance\nwow\n506\n##cus\n##py\nwhen\n##ward\n##ep\n34e\nwong\nlibrary\nprada\n##サイト\n##cle\nrunning\n##ree\n313\nck\ndate\nq4\n##ctive\n##ool\n##＞\nmk\n##ira\n##163\n388\ndie\nsecret\nrq\ndota\nbuffet\nは１ヶ\ne6\n##ez\npan\n368\nha\n##card\n##cha\n2a\n##さ\nalan\nday3\neye\nf3\n##end\nfrance\nkeep\nadi\nrna\ntvbs\n##ala\nsolo\nnova\n##え\n##tail\n##ょう\nsupport\n##ries\n##なる\n##ved\nbase\ncopy\niis\nfps\n##ways\nhero\nhgih\nprofile\nfish\nmu\nssh\nentertainment\nchang\n##wd\nclick\ncake\n##ond\npre\n##tom\nkic\npixel\n##ov\n##fl\nproduct\n6a\n##pd\ndear\n##gate\nes\nyumi\naudio\n##²\n##sky\necho\nbin\nwhere\n##ture\n329\n##ape\nfind\nsap\nisis\n##なと\nnand\n##101\n##load\n##ream\nband\na6\n525\nnever\n##post\nfestival\n50cm\n##we\n555\nguide\n314\nzenfone\n##ike\n335\ngd\nforum\njessica\nstrong\nalexander\n##ould\nsoftware\nallen\n##ious\nprogram\n360°\nelse\nlohasthree\n##gar\nすることかてきます\nplease\n##れます\nrc\n##ggle\n##ric\nbim\n50000\n##own\neclipse\n355\nbrian\n3ds\n##side\n061\n361\n##other\n##ける\n##tech\n##ator\n485\nengine\n##ged\n##ｔ\nplaza\n##fit\ncia\nngo\nwestbrook\nshi\ntbs\n50mm\n##みませんか\nsci\n291\nreuters\n##ily\ncontextlink\n##hn\naf\n##cil\nbridge\nvery\n##cel\n1890\ncambridge\n##ize\n15g\n##aid\n##data\n790\nfrm\n##head\naward\nbutler\n##sun\nmeta\n##mar\namerica\nps3\npuma\npmid\n##すか\nlc\n670\nkitchen\n##lic\nオーフン5\nきなしソフトサーヒス\nそして\nday1\nfuture\n★★★★\n##text\n##page\n##rris\npm1\n##ket\nfans\n##っています\n1001\nchristian\nbot\nkids\ntrackback\n##hai\nc3\ndisplay\n##hl\nn2\n1896\nidea\nさんも\n##sent\nairmail\n##ug\n##men\npwm\nけます\n028\n##lution\n369\n852\nawards\nschemas\n354\nasics\nwikipedia\nfont\n##tional\n##vy\nc2\n293\n##れている\n##dget\n##ein\nっている\ncontact\npepper\nスキル\n339\n##～5\n294\n##uel\n##ument\n730\n##hang\nみてす\nq5\n##sue\nrain\n##ndi\nwei\nswatch\n##cept\nわせ\n331\npopular\n##ste\n##tag\np2\n501\ntrc\n1899\n##west\n##live\njustin\nhonda\nping\nmessenger\n##rap\nv9\n543\n##とは\nunity\nappqq\nはすへて\n025\nleo\n##tone\n##テ\n##ass\nuniqlo\n##010\n502\nher\njane\nmemory\nmoneydj\n##tical\nhuman\n12306\nしていると\n##m2\ncoc\nmiacare\n##mn\ntmt\n##core\nvim\nkk\n##may\nfan\ntarget\nuse\ntoo\n338\n435\n2050\n867\n737\nfast\n##2c\nservices\n##ope\nomega\nenergy\n##わ\npinkoi\n1a\n##なから\n##rain\njackson\n##ement\n##シャンルの\n374\n366\nそんな\np9\nrd\n##ᆨ\n1111\n##tier\n##vic\nzone\n##│\n385\n690\ndl\nisofix\ncpa\nm4\n322\nkimi\nめて\ndavis\n##lay\nlulu\n##uck\n050\nweeks\nqs\n##hop\n920\n##ｎ\nae\n##ear\n～5\neia\n405\n##fly\nkorea\njpeg\nboost\n##ship\nsmall\n##リア\n1860\neur\n297\n425\nvalley\n##iel\nsimple\n##ude\nrn\nk2\n##ena\nされます\nnon\npatrick\nしているから\n##ナー\nfeed\n5757\n30g\nprocess\nwell\nqqmei\n##thing\nthey\naws\nlu\npink\n##ters\n##kin\nまたは\nboard\n##vertisement\nwine\n##ien\nunicode\n##dge\nr1\n359\n##tant\nいを\n##twitter\n##3c\ncool1\nされる\n##れて\n##ｌ\nisp\n##012\nstandard\n45㎡2\n402\n##150\nmatt\n##fu\n326\n##iner\ngooglemsn\npixnetfacebookyahoo\n##ラン\nx7\n886\n##uce\nメーカー\nsao\n##ev\n##きました\n##file\n9678\n403\nxddd\nshirt\n6l\n##rio\n##hat\n3mm\ngivenchy\nya\nbang\n##lio\nmonday\ncrystal\nロクイン\n##abc\n336\nhead\n890\nubuntuforumwikilinuxpastechat\n##vc\n##～20\n##rity\ncnc\n7866\nipv6\nnull\n1897\n##ost\nyang\nimsean\ntiger\n##fet\n##ンス\n352\n##＝\ndji\n327\nji\nmaria\n##come\n##んて\nfoundation\n3100\n##beth\n##なった\n1m\n601\nactive\n##aft\n##don\n3p\nsr\n349\nemma\n##khz\nliving\n415\n353\n1889\n341\n709\n457\nsas\nx6\n##face\npptv\nx4\n##mate\nhan\nsophie\n##jing\n337\nfifa\n##mand\nother\nsale\ninwedding\n##gn\nてきちゃいます\n##mmy\n##pmlast\nbad\nnana\nnbc\nしてみてくたさいね\nなとはお\n##wu\n##かあります\n##あ\nnote7\nsingle\n##340\nせからこ\nしてくたさい♪この\nしにはとんとんワークケートを\nするとあなたにもっとマッチした\nならワークケートへ\nもみつかっちゃうかも\nワークケートの\n##bel\nwindow\n##dio\n##ht\nunion\nage\n382\n１４\n##ivity\n##ｙ\nコメント\ndomain\nneo\n##isa\n##lter\n5k\nf5\nsteven\n##cts\npowerpoint\ntft\nself\ng2\nft\n##テル\nzol\n##act\nmwc\n381\n343\nもう\nnbapop\n408\nてある\neds\nace\n##room\nprevious\nauthor\ntomtom\nil\n##ets\nhu\nfinancial\n☆☆☆\nっています\nbp\n5t\nchi\n1gb\n##hg\nfairmont\ncross\n008\ngay\nh2\nfunction\n##けて\n356\nalso\n1b\n625\n##ータ\n##raph\n1894\n3～5\n##ils\ni3\n334\navenue\n##host\nによる\n##bon\n##tsu\nmessage\nnavigation\n50g\nfintech\nh6\n##ことを\n8cm\n##ject\n##vas\n##firm\ncredit\n##wf\nxxxx\nform\n##nor\n##space\nhuawei\nplan\njson\nsbl\n##dc\nmachine\n921\n392\nwish\n##120\n##sol\nwindows7\nedward\n##ために\ndevelopment\nwashington\n##nsis\nlo\n818\n##sio\n##ym\n##bor\nplanet\n##～8\n##wt\nieee\ngpa\n##めて\ncamp\nann\ngm\n##tw\n##oka\nconnect\n##rss\n##work\n##atus\nwall\nchicken\nsoul\n2mm\n##times\nfa\n##ather\n##cord\n009\n##eep\nhitachi\ngui\nharry\n##pan\ne1\ndisney\n##press\n##ーション\nwind\n386\nfrigidaire\n##tl\nliu\nhsu\n332\nbasic\nvon\nev\nいた\nてきる\nスホンサーサイト\nlearning\n##ull\nexpedia\narchives\nchange\n##wei\nsanta\ncut\nins\n6gb\nturbo\nbrand\ncf1\n508\n004\nreturn\n747\n##rip\nh1\n##nis\n##をこ\n128gb\n##にお\n3t\napplication\nしており\nemc\nrx\n##oon\n384\nquick\n412\n15058\nwilson\nwing\nchapter\n##bug\nbeyond\n##cms\n##dar\n##oh\nzoom\ne2\ntrip\nsb\n##nba\nrcep\n342\naspx\nci\n080\ngc\ngnu\nめる\n##count\nadvanced\ndance\ndv\n##url\n##ging\n367\n8591\nam09\nshadow\nbattle\n346\n##ｉ\n##cia\n##という\nemily\n##のてす\n##tation\nhost\nff\ntechorz\nsars\n##mini\n##mporary\n##ering\nnc\n4200\n798\n##next\ncma\n##mbps\n##gas\n##ift\n##dot\n##ィ\n455\n##～17\namana\n##りの\n426\n##ros\nir\n00㎡1\n##eet\n##ible\n##↓\n710\nˋ▽ˊ\n##aka\ndcs\niq\n##ｖ\nl1\n##lor\nmaggie\n##011\n##iu\n588\n##～1\n830\n##gt\n1tb\narticles\ncreate\n##burg\n##iki\ndatabase\nfantasy\n##rex\n##cam\ndlc\ndean\n##you\nhard\npath\ngaming\nvictoria\nmaps\ncb\n##lee\n##itor\noverchicstoretvhome\nsystems\n##xt\n416\np3\nsarah\n760\n##nan\n407\n486\nx9\ninstall\nsecond\n626\n##ann\n##ph\n##rcle\n##nic\n860\n##nar\nec\n##とう\n768\nmetro\nchocolate\n##rian\n～4\n##table\n##しています\nskin\n##sn\n395\nmountain\n##0mm\ninparadise\n6m\n7x24\nib\n4800\n##jia\neeworld\ncreative\ng5\ng3\n357\nparker\necfa\nvillage\nからの\n18000\nsylvia\nサーヒス\nhbl\n##ques\n##onsored\n##x2\n##きます\n##v4\n##tein\nie6\n383\n##stack\n389\nver\n##ads\n##baby\nsound\nbbe\n##110\n##lone\n##uid\nads\n022\ngundam\n351\nthinkpad\n006\nscrum\nmatch\n##ave\nmems\n##470\n##oy\n##なりました\n##talk\nglass\nlamigo\nspan\n##eme\njob\n##a5\njay\nwade\nkde\n498\n##lace\nocean\ntvg\n##covery\n##r3\n##ners\n##rea\njunior\nthink\n##aine\ncover\n##ision\n##sia\n↓↓\n##bow\nmsi\n413\n458\n406\n##love\n711\n801\nsoft\nz2\n##pl\n456\n1840\nmobil\nmind\n##uy\n427\nnginx\n##oi\nめた\n##rr\n6221\n##mple\n##sson\n##ーシてす\n371\n##nts\n91tv\ncomhd\ncrv3000\n##uard\n1868\n397\ndeep\nlost\nfield\ngallery\n##bia\nrate\nspf\nredis\ntraction\n930\nicloud\n011\nなら\nfe\njose\n372\n##tory\ninto\nsohu\nfx\n899\n379\nkicstart2\n##hia\nすく\n##～3\n##sit\nra\n２４\n##walk\n##xure\n500g\n##pact\npacific\nxa\nnatural\ncarlo\n##250\n##walker\n1850\n##can\ncto\ngigi\n516\n##サー\npen\n##hoo\nob\nmatlab\n##ｂ\n##yy\n13913459\n##iti\nmango\n##bbs\nsense\nc5\noxford\n##ニア\nwalker\njennifer\n##ola\ncourse\n##bre\n701\n##pus\n##rder\nlucky\n075\n##ぁ\nivy\nなお\n##nia\nsotheby\nside\n##ugh\njoy\n##orage\n##ush\n##bat\n##dt\n364\nr9\n##2d\n##gio\n511\ncountry\nwear\n##lax\n##～7\n##moon\n393\nseven\nstudy\n411\n348\nlonzo\n8k\n##ェ\nevolution\n##イフ\n##kk\ngs\nkd\n##レス\narduino\n344\nb12\n##lux\narpg\n##rdon\ncook\n##x5\ndark\nfive\n##als\n##ida\nとても\nsign\n362\n##ちの\nsomething\n20mm\n##nda\n387\n##posted\nfresh\ntf\n1870\n422\ncam\n##mine\n##skip\n##form\n##ssion\neducation\n394\n##tee\ndyson\nstage\n##jie\nwant\n##night\nepson\npack\nあります\n##ppy\nテリヘル\n##█\nwd\n##eh\n##rence\nleft\n##lvin\ngolden\nmhz\ndiscovery\n##trix\n##n2\nloft\n##uch\n##dra\n##sse\nspeed\n～1\n1mdb\nsorry\nwelcome\n##urn\nwave\ngaga\n##lmer\nteddy\n##160\nトラックハック\nせよ\n611\n##f2016\n378\nrp\n##sha\nrar\n##あなたに\n##きた\n840\nholiday\n##ュー\n373\n074\n##vg\n##nos\n##rail\ngartner\ngi\n6p\n##dium\nkit\n488\nb3\neco\n##ろう\n20g\nsean\n##stone\nautocad\nnu\n##np\nf16\nwrite\n029\nm5\n##ias\nimages\natp\n##dk\nfsm\n504\n1350\nve\n52kb\n##xxx\n##のに\n##cake\n414\nunit\nlim\nru\n1v\n##ification\npublished\nangela\n16g\nanalytics\nak\n##ｑ\n##nel\ngmt\n##icon\nagain\n##₂\n##bby\nios11\n445\nかこさいます\nwaze\nいてす\n##ハ\n9985\n##ust\n##ティー\nframework\n##007\niptv\ndelete\n52sykb\ncl\nwwdc\n027\n30cm\n##fw\n##ての\n1389\n##xon\nbrandt\n##ses\n##dragon\ntc\nvetements\nanne\nmonte\nmodern\nofficial\n##へて\n##ere\n##nne\n##oud\nもちろん\n５０\netnews\n##a2\n##graphy\n421\n863\n##ちゃん\n444\n##rtex\n##てお\nl2\n##gma\nmount\nccd\nたと\narchive\nmorning\ntan\nddos\ne7\n##ホ\nday4\n##ウ\ngis\n453\nits\n495\nfactory\nbruce\npg\n##ito\nってくたさい\nguest\ncdma\n##lling\n536\nn3\nしかし\n3～4\nmega\neyes\nro\n１３\nwomen\ndac\nchurch\n##jun\nsingapore\n##facebook\n6991\nstarbucks\n##tos\n##stin\n##shine\nzen\n##mu\ntina\n20℃\n1893\n##たけて\n503\n465\nrequest\n##gence\nqt\n##っ\n1886\n347\n363\nq7\n##zzi\ndiary\n##tore\n409\n##ead\n468\ncst\n##osa\ncanada\nagent\nva\n##jiang\n##ちは\n##ーク\n##lam\nsg\n##nix\n##sday\n##よって\ng6\n##master\nbing\n##zl\ncharlie\n１６\n8mm\nnb40\n##ーン\nthai\n##ルフ\nln284ct\n##itz\n##2f\nbonnie\n##food\n##lent\noriginals\n##stro\n##lts\n418\n∟∣\n##bscribe\nchildren\nntd\nyesstyle\n##かも\nhmv\n##tment\nd5\n2cm\narts\nsms\n##pn\n##я\n##いい\ntopios9\n539\nlifestyle\nvirtual\n##ague\nxz\n##deo\nmuji\n024\nunt\n##nnis\n##ᅩ\nfaq1\n1884\n396\n##ette\nfly\n64㎡\nはしめまして\n441\ncurry\n##pop\nのこ\nrelease\n##←\n##◆◆\n##cast\n073\nありな\n500ml\n##ews\n5c\n##stle\nios7\n##ima\n787\ndog\nlenovo\n##r4\nroger\n013\ncbs\nvornado\n100m\n417\n##desk\n##クok\n##ald\n1867\n9595\n2900\n##van\noil\n##ｘ\nsome\nbreak\ncommon\n##jy\n##lines\ng7\ntwice\n419\nella\nnano\nbelle\nにこ\n##mes\n##self\n##note\njb\n##ことかてきます\nbenz\n##との\n##ova\n451\nsave\n##wing\n##ますのて\nkai\nりは\n##hua\n##rect\nrainer\n##unge\n448\n##0m\nadsl\n##かな\nguestname\n##uma\n##kins\n##zu\ntokichoi\n##price\ncounty\n##med\n##mus\nrmk\n391\naddress\nvm\nえて\nopenload\n##group\n##hin\n##iginal\namg\nurban\n##oz\njobs\nemi\n##public\nbeautiful\n##sch\nalbum\n##dden\n##bell\njerry\nworks\nhostel\nmiller\n##drive\n##rmin\n##１０\n376\nboot\n828\n##370\n##fx\n##cm～\n1885\n##nome\n##ctionary\n##oman\n##lish\n##cr\n##hm\n433\n##how\n432\nfrancis\nxi\nc919\nb5\nevernote\n##uc\nvga\n##3000\ncoupe\n##urg\n##cca\n##uality\n019\n6g\nれる\nmulti\n##また\n##ett\nem\nhey\n##ani\n##tax\n##rma\ninside\nthan\n740\nleonnhurt\n##jin\nict\nれた\nbird\nnotes\n200mm\nくの\n##dical\n##lli\nresult\n442\niu\nee\n438\nsmap\ngopro\n##last\nyin\npure\n998\n32g\nけた\n5kg\n##dan\n##rame\nmama\n##oot\nbean\nmarketing\n##hur\n2l\nbella\nsync\nxuite\n##ground\n515\ndiscuz\n##getrelax\n##ince\n##bay\n##5s\ncj\n##イス\ngmat\napt\n##pass\njing\n##rix\nc4\nrich\n##とても\nniusnews\n##ello\nbag\n770\n##eting\n##mobile\n１８\nculture\n015\n##のてすか\n377\n1020\narea\n##ience\n616\ndetails\ngp\nuniversal\nsilver\ndit\nはお\nprivate\nddd\nu11\nkanshu\n##ified\nfung\n##nny\ndx\n##520\ntai\n475\n023\n##fr\n##lean\n3s\n##pin\n429\n##rin\n25000\nly\nrick\n##bility\nusb3\nbanner\n##baru\n##gion\nmetal\ndt\nvdf\n1871\nkarl\nqualcomm\nbear\n1010\noldid\nian\njo\n##tors\npopulation\n##ernel\n1882\nmmorpg\n##mv\n##bike\n603\n##©\nww\nfriend\n##ager\nexhibition\n##del\n##pods\nfpx\nstructure\n##free\n##tings\nkl\n##rley\n##copyright\n##mma\ncalifornia\n3400\norange\nyoga\n4l\ncanmake\nhoney\n##anda\n##コメント\n595\nnikkie\n##ルハイト\ndhl\npublishing\n##mall\n##gnet\n20cm\n513\n##クセス\n##┅\ne88\n970\n##dog\nfishbase\n##!\n##\"\n###\n##$\n##%\n##&\n##'\n##(\n##)\n##*\n##+\n##,\n##-\n##.\n##/\n##:\n##;\n##<\n##=\n##>\n##?\n##@\n##[\n##\\\n##]\n##^\n##_\n##{\n##|\n##}\n##~\n##£\n##¤\n##¥\n##§\n##«\n##±\n##³\n##µ\n##·\n##¹\n##º\n##»\n##¼\n##ß\n##æ\n##÷\n##ø\n##đ\n##ŋ\n##ɔ\n##ə\n##ɡ\n##ʰ\n##ˇ\n##ˈ\n##ˊ\n##ˋ\n##ˍ\n##ː\n##˙\n##˚\n##ˢ\n##α\n##β\n##γ\n##δ\n##ε\n##η\n##θ\n##ι\n##κ\n##λ\n##μ\n##ν\n##ο\n##π\n##ρ\n##ς\n##σ\n##τ\n##υ\n##φ\n##χ\n##ψ\n##б\n##в\n##г\n##д\n##е\n##ж\n##з\n##к\n##л\n##м\n##н\n##о\n##п\n##р\n##с\n##т\n##у\n##ф\n##х\n##ц\n##ч\n##ш\n##ы\n##ь\n##і\n##ا\n##ب\n##ة\n##ت\n##د\n##ر\n##س\n##ع\n##ل\n##م\n##ن\n##ه\n##و\n##ي\n##۩\n##ก\n##ง\n##น\n##ม\n##ย\n##ร\n##อ\n##า\n##เ\n##๑\n##་\n##ღ\n##ᄀ\n##ᄁ\n##ᄂ\n##ᄃ\n##ᄅ\n##ᄆ\n##ᄇ\n##ᄈ\n##ᄉ\n##ᄋ\n##ᄌ\n##ᄎ\n##ᄏ\n##ᄐ\n##ᄑ\n##ᄒ\n##ᅢ\n##ᅣ\n##ᅥ\n##ᅦ\n##ᅧ\n##ᅨ\n##ᅪ\n##ᅬ\n##ᅭ\n##ᅮ\n##ᅯ\n##ᅲ\n##ᅳ\n##ᅴ\n##ᆷ\n##ᆸ\n##ᆺ\n##ᆻ\n##ᗜ\n##ᵃ\n##ᵉ\n##ᵍ\n##ᵏ\n##ᵐ\n##ᵒ\n##ᵘ\n##‖\n##„\n##†\n##•\n##‥\n##‧\n## \n##‰\n##′\n##″\n##‹\n##›\n##※\n##‿\n##⁄\n##ⁱ\n##⁺\n##ⁿ\n##₁\n##₃\n##₄\n##€\n##№\n##ⅰ\n##ⅱ\n##ⅲ\n##ⅳ\n##ⅴ\n##↔\n##↗\n##↘\n##⇒\n##∀\n##−\n##∕\n##∙\n##√\n##∞\n##∟\n##∠\n##∣\n##∩\n##∮\n##∶\n##∼\n##∽\n##≈\n##≒\n##≡\n##≤\n##≥\n##≦\n##≧\n##≪\n##≫\n##⊙\n##⋅\n##⋈\n##⋯\n##⌒\n##①\n##②\n##③\n##④\n##⑤\n##⑥\n##⑦\n##⑧\n##⑨\n##⑩\n##⑴\n##⑵\n##⑶\n##⑷\n##⑸\n##⒈\n##⒉\n##⒊\n##⒋\n##ⓒ\n##ⓔ\n##ⓘ\n##━\n##┃\n##┆\n##┊\n##┌\n##└\n##├\n##┣\n##═\n##║\n##╚\n##╞\n##╠\n##╭\n##╮\n##╯\n##╰\n##╱\n##╳\n##▂\n##▃\n##▅\n##▇\n##▉\n##▋\n##▌\n##▍\n##▎\n##□\n##▪\n##▫\n##▬\n##△\n##▶\n##►\n##▽\n##◇\n##◕\n##◠\n##◢\n##◤\n##☀\n##☕\n##☞\n##☺\n##☼\n##♀\n##♂\n##♠\n##♡\n##♣\n##♦\n##♫\n##♬\n##✈\n##✔\n##✕\n##✖\n##✦\n##✨\n##✪\n##✰\n##✿\n##❀\n##➜\n##➤\n##⦿\n##、\n##。\n##〃\n##々\n##〇\n##〈\n##〉\n##《\n##》\n##「\n##」\n##『\n##』\n##【\n##】\n##〓\n##〔\n##〕\n##〖\n##〗\n##〜\n##〝\n##〞\n##ぃ\n##ぇ\n##ぬ\n##ふ\n##ほ\n##む\n##ゃ\n##ゅ\n##ゆ\n##ょ\n##゜\n##ゝ\n##ァ\n##ゥ\n##エ\n##ォ\n##ケ\n##サ\n##セ\n##ソ\n##ッ\n##ニ\n##ヌ\n##ネ\n##ノ\n##ヘ\n##モ\n##ャ\n##ヤ\n##ュ\n##ユ\n##ョ\n##ヨ\n##ワ\n##ヲ\n##・\n##ヽ\n##ㄅ\n##ㄆ\n##ㄇ\n##ㄉ\n##ㄋ\n##ㄌ\n##ㄍ\n##ㄎ\n##ㄏ\n##ㄒ\n##ㄚ\n##ㄛ\n##ㄞ\n##ㄟ\n##ㄢ\n##ㄤ\n##ㄥ\n##ㄧ\n##ㄨ\n##ㆍ\n##㈦\n##㊣\n##㗎\n##一\n##丁\n##七\n##万\n##丈\n##三\n##上\n##下\n##不\n##与\n##丐\n##丑\n##专\n##且\n##丕\n##世\n##丘\n##丙\n##业\n##丛\n##东\n##丝\n##丞\n##丟\n##両\n##丢\n##两\n##严\n##並\n##丧\n##丨\n##个\n##丫\n##中\n##丰\n##串\n##临\n##丶\n##丸\n##丹\n##为\n##主\n##丼\n##丽\n##举\n##丿\n##乂\n##乃\n##久\n##么\n##义\n##之\n##乌\n##乍\n##乎\n##乏\n##乐\n##乒\n##乓\n##乔\n##乖\n##乗\n##乘\n##乙\n##乜\n##九\n##乞\n##也\n##习\n##乡\n##书\n##乩\n##买\n##乱\n##乳\n##乾\n##亀\n##亂\n##了\n##予\n##争\n##事\n##二\n##于\n##亏\n##云\n##互\n##五\n##井\n##亘\n##亙\n##亚\n##些\n##亜\n##亞\n##亟\n##亡\n##亢\n##交\n##亥\n##亦\n##产\n##亨\n##亩\n##享\n##京\n##亭\n##亮\n##亲\n##亳\n##亵\n##人\n##亿\n##什\n##仁\n##仃\n##仄\n##仅\n##仆\n##仇\n##今\n##介\n##仍\n##从\n##仏\n##仑\n##仓\n##仔\n##仕\n##他\n##仗\n##付\n##仙\n##仝\n##仞\n##仟\n##代\n##令\n##以\n##仨\n##仪\n##们\n##仮\n##仰\n##仲\n##件\n##价\n##任\n##份\n##仿\n##企\n##伉\n##伊\n##伍\n##伎\n##伏\n##伐\n##休\n##伕\n##众\n##优\n##伙\n##会\n##伝\n##伞\n##伟\n##传\n##伢\n##伤\n##伦\n##伪\n##伫\n##伯\n##估\n##伴\n##伶\n##伸\n##伺\n##似\n##伽\n##佃\n##但\n##佇\n##佈\n##位\n##低\n##住\n##佐\n##佑\n##体\n##佔\n##何\n##佗\n##佘\n##余\n##佚\n##佛\n##作\n##佝\n##佞\n##佟\n##你\n##佢\n##佣\n##佤\n##佥\n##佩\n##佬\n##佯\n##佰\n##佳\n##併\n##佶\n##佻\n##佼\n##使\n##侃\n##侄\n##來\n##侈\n##例\n##侍\n##侏\n##侑\n##侖\n##侗\n##供\n##依\n##侠\n##価\n##侣\n##侥\n##侦\n##侧\n##侨\n##侬\n##侮\n##侯\n##侵\n##侶\n##侷\n##便\n##係\n##促\n##俄\n##俊\n##俎\n##俏\n##俐\n##俑\n##俗\n##俘\n##俚\n##保\n##俞\n##俟\n##俠\n##信\n##俨\n##俩\n##俪\n##俬\n##俭\n##修\n##俯\n##俱\n##俳\n##俸\n##俺\n##俾\n##倆\n##倉\n##個\n##倌\n##倍\n##倏\n##們\n##倒\n##倔\n##倖\n##倘\n##候\n##倚\n##倜\n##借\n##倡\n##値\n##倦\n##倩\n##倪\n##倫\n##倬\n##倭\n##倶\n##债\n##值\n##倾\n##偃\n##假\n##偈\n##偉\n##偌\n##偎\n##偏\n##偕\n##做\n##停\n##健\n##側\n##偵\n##偶\n##偷\n##偻\n##偽\n##偿\n##傀\n##傅\n##傍\n##傑\n##傘\n##備\n##傚\n##傢\n##傣\n##傥\n##储\n##傩\n##催\n##傭\n##傲\n##傳\n##債\n##傷\n##傻\n##傾\n##僅\n##働\n##像\n##僑\n##僕\n##僖\n##僚\n##僥\n##僧\n##僭\n##僮\n##僱\n##僵\n##價\n##僻\n##儀\n##儂\n##億\n##儆\n##儉\n##儋\n##儒\n##儕\n##儘\n##償\n##儡\n##優\n##儲\n##儷\n##儼\n##儿\n##兀\n##允\n##元\n##兄\n##充\n##兆\n##兇\n##先\n##光\n##克\n##兌\n##免\n##児\n##兑\n##兒\n##兔\n##兖\n##党\n##兜\n##兢\n##入\n##內\n##全\n##兩\n##八\n##公\n##六\n##兮\n##兰\n##共\n##兲\n##关\n##兴\n##兵\n##其\n##具\n##典\n##兹\n##养\n##兼\n##兽\n##冀\n##内\n##円\n##冇\n##冈\n##冉\n##冊\n##册\n##再\n##冏\n##冒\n##冕\n##冗\n##写\n##军\n##农\n##冠\n##冢\n##冤\n##冥\n##冨\n##冪\n##冬\n##冯\n##冰\n##冲\n##决\n##况\n##冶\n##冷\n##冻\n##冼\n##冽\n##冾\n##净\n##凄\n##准\n##凇\n##凈\n##凉\n##凋\n##凌\n##凍\n##减\n##凑\n##凛\n##凜\n##凝\n##几\n##凡\n##凤\n##処\n##凪\n##凭\n##凯\n##凰\n##凱\n##凳\n##凶\n##凸\n##凹\n##出\n##击\n##函\n##凿\n##刀\n##刁\n##刃\n##分\n##切\n##刈\n##刊\n##刍\n##刎\n##刑\n##划\n##列\n##刘\n##则\n##刚\n##创\n##初\n##删\n##判\n##別\n##刨\n##利\n##刪\n##别\n##刮\n##到\n##制\n##刷\n##券\n##刹\n##刺\n##刻\n##刽\n##剁\n##剂\n##剃\n##則\n##剉\n##削\n##剋\n##剌\n##前\n##剎\n##剐\n##剑\n##剔\n##剖\n##剛\n##剜\n##剝\n##剣\n##剤\n##剥\n##剧\n##剩\n##剪\n##副\n##割\n##創\n##剷\n##剽\n##剿\n##劃\n##劇\n##劈\n##劉\n##劊\n##劍\n##劏\n##劑\n##力\n##劝\n##办\n##功\n##加\n##务\n##劣\n##动\n##助\n##努\n##劫\n##劭\n##励\n##劲\n##劳\n##労\n##劵\n##効\n##劾\n##势\n##勁\n##勃\n##勇\n##勉\n##勋\n##勐\n##勒\n##動\n##勖\n##勘\n##務\n##勛\n##勝\n##勞\n##募\n##勢\n##勤\n##勧\n##勳\n##勵\n##勸\n##勺\n##勻\n##勾\n##勿\n##匀\n##包\n##匆\n##匈\n##匍\n##匐\n##匕\n##化\n##北\n##匙\n##匝\n##匠\n##匡\n##匣\n##匪\n##匮\n##匯\n##匱\n##匹\n##区\n##医\n##匾\n##匿\n##區\n##十\n##千\n##卅\n##升\n##午\n##卉\n##半\n##卍\n##华\n##协\n##卑\n##卒\n##卓\n##協\n##单\n##卖\n##南\n##単\n##博\n##卜\n##卞\n##卟\n##占\n##卡\n##卢\n##卤\n##卦\n##卧\n##卫\n##卮\n##卯\n##印\n##危\n##即\n##却\n##卵\n##卷\n##卸\n##卻\n##卿\n##厂\n##厄\n##厅\n##历\n##厉\n##压\n##厌\n##厕\n##厘\n##厚\n##厝\n##原\n##厢\n##厥\n##厦\n##厨\n##厩\n##厭\n##厮\n##厲\n##厳\n##去\n##县\n##叁\n##参\n##參\n##又\n##叉\n##及\n##友\n##双\n##反\n##収\n##发\n##叔\n##取\n##受\n##变\n##叙\n##叛\n##叟\n##叠\n##叡\n##叢\n##口\n##古\n##句\n##另\n##叨\n##叩\n##只\n##叫\n##召\n##叭\n##叮\n##可\n##台\n##叱\n##史\n##右\n##叵\n##叶\n##号\n##司\n##叹\n##叻\n##叼\n##叽\n##吁\n##吃\n##各\n##吆\n##合\n##吉\n##吊\n##吋\n##同\n##名\n##后\n##吏\n##吐\n##向\n##吒\n##吓\n##吕\n##吖\n##吗\n##君\n##吝\n##吞\n##吟\n##吠\n##吡\n##否\n##吧\n##吨\n##吩\n##含\n##听\n##吭\n##吮\n##启\n##吱\n##吳\n##吴\n##吵\n##吶\n##吸\n##吹\n##吻\n##吼\n##吽\n##吾\n##呀\n##呂\n##呃\n##呆\n##呈\n##告\n##呋\n##呎\n##呐\n##呓\n##呕\n##呗\n##员\n##呛\n##呜\n##呢\n##呤\n##呦\n##周\n##呱\n##呲\n##味\n##呵\n##呷\n##呸\n##呻\n##呼\n##命\n##咀\n##咁\n##咂\n##咄\n##咆\n##咋\n##和\n##咎\n##咏\n##咐\n##咒\n##咔\n##咕\n##咖\n##咗\n##咘\n##咙\n##咚\n##咛\n##咣\n##咤\n##咦\n##咧\n##咨\n##咩\n##咪\n##咫\n##咬\n##咭\n##咯\n##咱\n##咲\n##咳\n##咸\n##咻\n##咽\n##咿\n##哀\n##品\n##哂\n##哄\n##哆\n##哇\n##哈\n##哉\n##哋\n##哌\n##响\n##哎\n##哏\n##哐\n##哑\n##哒\n##哔\n##哗\n##哟\n##員\n##哥\n##哦\n##哧\n##哨\n##哩\n##哪\n##哭\n##哮\n##哲\n##哺\n##哼\n##哽\n##唁\n##唄\n##唆\n##唇\n##唉\n##唏\n##唐\n##唑\n##唔\n##唠\n##唤\n##唧\n##唬\n##售\n##唯\n##唰\n##唱\n##唳\n##唷\n##唸\n##唾\n##啃\n##啄\n##商\n##啉\n##啊\n##問\n##啓\n##啕\n##啖\n##啜\n##啞\n##啟\n##啡\n##啤\n##啥\n##啦\n##啧\n##啪\n##啫\n##啬\n##啮\n##啰\n##啱\n##啲\n##啵\n##啶\n##啷\n##啸\n##啻\n##啼\n##啾\n##喀\n##喂\n##喃\n##善\n##喆\n##喇\n##喉\n##喊\n##喋\n##喎\n##喏\n##喔\n##喘\n##喙\n##喚\n##喜\n##喝\n##喟\n##喧\n##喪\n##喫\n##喬\n##單\n##喰\n##喱\n##喲\n##喳\n##喵\n##営\n##喷\n##喹\n##喺\n##喻\n##喽\n##嗅\n##嗆\n##嗇\n##嗎\n##嗑\n##嗒\n##嗓\n##嗔\n##嗖\n##嗚\n##嗜\n##嗝\n##嗟\n##嗡\n##嗣\n##嗤\n##嗦\n##嗨\n##嗪\n##嗬\n##嗯\n##嗰\n##嗲\n##嗳\n##嗶\n##嗷\n##嗽\n##嘀\n##嘅\n##嘆\n##嘈\n##嘉\n##嘌\n##嘍\n##嘎\n##嘔\n##嘖\n##嘗\n##嘘\n##嘚\n##嘛\n##嘜\n##嘞\n##嘟\n##嘢\n##嘣\n##嘤\n##嘧\n##嘩\n##嘭\n##嘮\n##嘯\n##嘰\n##嘱\n##嘲\n##嘴\n##嘶\n##嘸\n##嘹\n##嘻\n##嘿\n##噁\n##噌\n##噎\n##噓\n##噔\n##噗\n##噙\n##噜\n##噠\n##噢\n##噤\n##器\n##噩\n##噪\n##噬\n##噱\n##噴\n##噶\n##噸\n##噹\n##噻\n##噼\n##嚀\n##嚇\n##嚎\n##嚏\n##嚐\n##嚓\n##嚕\n##嚟\n##嚣\n##嚥\n##嚨\n##嚮\n##嚴\n##嚷\n##嚼\n##囂\n##囉\n##囊\n##囍\n##囑\n##囔\n##囗\n##囚\n##四\n##囝\n##回\n##囟\n##因\n##囡\n##团\n##団\n##囤\n##囧\n##囪\n##囫\n##园\n##困\n##囱\n##囲\n##図\n##围\n##囹\n##固\n##国\n##图\n##囿\n##圃\n##圄\n##圆\n##圈\n##國\n##圍\n##圏\n##園\n##圓\n##圖\n##團\n##圜\n##土\n##圣\n##圧\n##在\n##圩\n##圭\n##地\n##圳\n##场\n##圻\n##圾\n##址\n##坂\n##均\n##坊\n##坍\n##坎\n##坏\n##坐\n##坑\n##块\n##坚\n##坛\n##坝\n##坞\n##坟\n##坠\n##坡\n##坤\n##坦\n##坨\n##坪\n##坯\n##坳\n##坵\n##坷\n##垂\n##垃\n##垄\n##型\n##垒\n##垚\n##垛\n##垠\n##垢\n##垣\n##垦\n##垩\n##垫\n##垭\n##垮\n##垵\n##埂\n##埃\n##埋\n##城\n##埔\n##埕\n##埗\n##域\n##埠\n##埤\n##埵\n##執\n##埸\n##培\n##基\n##埼\n##堀\n##堂\n##堃\n##堅\n##堆\n##堇\n##堑\n##堕\n##堙\n##堡\n##堤\n##堪\n##堯\n##堰\n##報\n##場\n##堵\n##堺\n##堿\n##塊\n##塌\n##塑\n##塔\n##塗\n##塘\n##塚\n##塞\n##塢\n##塩\n##填\n##塬\n##塭\n##塵\n##塾\n##墀\n##境\n##墅\n##墉\n##墊\n##墒\n##墓\n##増\n##墘\n##墙\n##墜\n##增\n##墟\n##墨\n##墩\n##墮\n##墳\n##墻\n##墾\n##壁\n##壅\n##壆\n##壇\n##壊\n##壑\n##壓\n##壕\n##壘\n##壞\n##壟\n##壢\n##壤\n##壩\n##士\n##壬\n##壮\n##壯\n##声\n##売\n##壳\n##壶\n##壹\n##壺\n##壽\n##处\n##备\n##変\n##复\n##夏\n##夔\n##夕\n##外\n##夙\n##多\n##夜\n##够\n##夠\n##夢\n##夥\n##大\n##天\n##太\n##夫\n##夭\n##央\n##夯\n##失\n##头\n##夷\n##夸\n##夹\n##夺\n##夾\n##奂\n##奄\n##奇\n##奈\n##奉\n##奋\n##奎\n##奏\n##奐\n##契\n##奔\n##奕\n##奖\n##套\n##奘\n##奚\n##奠\n##奢\n##奥\n##奧\n##奪\n##奬\n##奮\n##女\n##奴\n##奶\n##奸\n##她\n##好\n##如\n##妃\n##妄\n##妆\n##妇\n##妈\n##妊\n##妍\n##妒\n##妓\n##妖\n##妘\n##妙\n##妝\n##妞\n##妣\n##妤\n##妥\n##妨\n##妩\n##妪\n##妮\n##妲\n##妳\n##妹\n##妻\n##妾\n##姆\n##姉\n##姊\n##始\n##姍\n##姐\n##姑\n##姒\n##姓\n##委\n##姗\n##姚\n##姜\n##姝\n##姣\n##姥\n##姦\n##姨\n##姪\n##姫\n##姬\n##姹\n##姻\n##姿\n##威\n##娃\n##娄\n##娅\n##娆\n##娇\n##娉\n##娑\n##娓\n##娘\n##娛\n##娜\n##娟\n##娠\n##娣\n##娥\n##娩\n##娱\n##娲\n##娴\n##娶\n##娼\n##婀\n##婁\n##婆\n##婉\n##婊\n##婕\n##婚\n##婢\n##婦\n##婧\n##婪\n##婭\n##婴\n##婵\n##婶\n##婷\n##婺\n##婿\n##媒\n##媚\n##媛\n##媞\n##媧\n##媲\n##媳\n##媽\n##媾\n##嫁\n##嫂\n##嫉\n##嫌\n##嫑\n##嫔\n##嫖\n##嫘\n##嫚\n##嫡\n##嫣\n##嫦\n##嫩\n##嫲\n##嫵\n##嫻\n##嬅\n##嬉\n##嬌\n##嬗\n##嬛\n##嬢\n##嬤\n##嬪\n##嬰\n##嬴\n##嬷\n##嬸\n##嬿\n##孀\n##孃\n##子\n##孑\n##孔\n##孕\n##孖\n##字\n##存\n##孙\n##孚\n##孛\n##孜\n##孝\n##孟\n##孢\n##季\n##孤\n##学\n##孩\n##孪\n##孫\n##孬\n##孰\n##孱\n##孳\n##孵\n##學\n##孺\n##孽\n##孿\n##宁\n##它\n##宅\n##宇\n##守\n##安\n##宋\n##完\n##宏\n##宓\n##宕\n##宗\n##官\n##宙\n##定\n##宛\n##宜\n##宝\n##实\n##実\n##宠\n##审\n##客\n##宣\n##室\n##宥\n##宦\n##宪\n##宫\n##宮\n##宰\n##害\n##宴\n##宵\n##家\n##宸\n##容\n##宽\n##宾\n##宿\n##寂\n##寄\n##寅\n##密\n##寇\n##富\n##寐\n##寒\n##寓\n##寛\n##寝\n##寞\n##察\n##寡\n##寢\n##寥\n##實\n##寧\n##寨\n##審\n##寫\n##寬\n##寮\n##寰\n##寵\n##寶\n##寸\n##对\n##寺\n##寻\n##导\n##対\n##寿\n##封\n##専\n##射\n##将\n##將\n##專\n##尉\n##尊\n##尋\n##對\n##導\n##小\n##少\n##尔\n##尕\n##尖\n##尘\n##尚\n##尝\n##尤\n##尧\n##尬\n##就\n##尴\n##尷\n##尸\n##尹\n##尺\n##尻\n##尼\n##尽\n##尾\n##尿\n##局\n##屁\n##层\n##屄\n##居\n##屆\n##屈\n##屉\n##届\n##屋\n##屌\n##屍\n##屎\n##屏\n##屐\n##屑\n##展\n##屜\n##属\n##屠\n##屡\n##屢\n##層\n##履\n##屬\n##屯\n##山\n##屹\n##屿\n##岀\n##岁\n##岂\n##岌\n##岐\n##岑\n##岔\n##岖\n##岗\n##岘\n##岙\n##岚\n##岛\n##岡\n##岩\n##岫\n##岬\n##岭\n##岱\n##岳\n##岷\n##岸\n##峇\n##峋\n##峒\n##峙\n##峡\n##峤\n##峥\n##峦\n##峨\n##峪\n##峭\n##峯\n##峰\n##峴\n##島\n##峻\n##峽\n##崁\n##崂\n##崆\n##崇\n##崎\n##崑\n##崔\n##崖\n##崗\n##崙\n##崛\n##崧\n##崩\n##崭\n##崴\n##崽\n##嵇\n##嵊\n##嵋\n##嵌\n##嵐\n##嵘\n##嵩\n##嵬\n##嵯\n##嶂\n##嶄\n##嶇\n##嶋\n##嶙\n##嶺\n##嶼\n##嶽\n##巅\n##巍\n##巒\n##巔\n##巖\n##川\n##州\n##巡\n##巢\n##工\n##左\n##巧\n##巨\n##巩\n##巫\n##差\n##己\n##已\n##巳\n##巴\n##巷\n##巻\n##巽\n##巾\n##巿\n##币\n##市\n##布\n##帅\n##帆\n##师\n##希\n##帐\n##帑\n##帕\n##帖\n##帘\n##帚\n##帛\n##帜\n##帝\n##帥\n##带\n##帧\n##師\n##席\n##帮\n##帯\n##帰\n##帳\n##帶\n##帷\n##常\n##帼\n##帽\n##幀\n##幂\n##幄\n##幅\n##幌\n##幔\n##幕\n##幟\n##幡\n##幢\n##幣\n##幫\n##干\n##平\n##年\n##并\n##幸\n##幹\n##幺\n##幻\n##幼\n##幽\n##幾\n##广\n##庁\n##広\n##庄\n##庆\n##庇\n##床\n##序\n##庐\n##库\n##应\n##底\n##庖\n##店\n##庙\n##庚\n##府\n##庞\n##废\n##庠\n##度\n##座\n##庫\n##庭\n##庵\n##庶\n##康\n##庸\n##庹\n##庾\n##廁\n##廂\n##廃\n##廈\n##廉\n##廊\n##廓\n##廖\n##廚\n##廝\n##廟\n##廠\n##廢\n##廣\n##廬\n##廳\n##延\n##廷\n##建\n##廿\n##开\n##弁\n##异\n##弃\n##弄\n##弈\n##弊\n##弋\n##式\n##弑\n##弒\n##弓\n##弔\n##引\n##弗\n##弘\n##弛\n##弟\n##张\n##弥\n##弦\n##弧\n##弩\n##弭\n##弯\n##弱\n##張\n##強\n##弹\n##强\n##弼\n##弾\n##彅\n##彆\n##彈\n##彌\n##彎\n##归\n##当\n##录\n##彗\n##彙\n##彝\n##形\n##彤\n##彥\n##彦\n##彧\n##彩\n##彪\n##彫\n##彬\n##彭\n##彰\n##影\n##彷\n##役\n##彻\n##彼\n##彿\n##往\n##征\n##径\n##待\n##徇\n##很\n##徉\n##徊\n##律\n##後\n##徐\n##徑\n##徒\n##従\n##徕\n##得\n##徘\n##徙\n##徜\n##從\n##徠\n##御\n##徨\n##復\n##循\n##徬\n##微\n##徳\n##徴\n##徵\n##德\n##徹\n##徼\n##徽\n##心\n##必\n##忆\n##忌\n##忍\n##忏\n##忐\n##忑\n##忒\n##忖\n##志\n##忘\n##忙\n##応\n##忠\n##忡\n##忤\n##忧\n##忪\n##快\n##忱\n##念\n##忻\n##忽\n##忿\n##怀\n##态\n##怂\n##怅\n##怆\n##怎\n##怏\n##怒\n##怔\n##怕\n##怖\n##怙\n##怜\n##思\n##怠\n##怡\n##急\n##怦\n##性\n##怨\n##怪\n##怯\n##怵\n##总\n##怼\n##恁\n##恃\n##恆\n##恋\n##恍\n##恐\n##恒\n##恕\n##恙\n##恚\n##恢\n##恣\n##恤\n##恥\n##恨\n##恩\n##恪\n##恫\n##恬\n##恭\n##息\n##恰\n##恳\n##恵\n##恶\n##恸\n##恺\n##恻\n##恼\n##恿\n##悄\n##悅\n##悉\n##悌\n##悍\n##悔\n##悖\n##悚\n##悟\n##悠\n##患\n##悦\n##您\n##悩\n##悪\n##悬\n##悯\n##悱\n##悲\n##悴\n##悵\n##悶\n##悸\n##悻\n##悼\n##悽\n##情\n##惆\n##惇\n##惊\n##惋\n##惑\n##惕\n##惘\n##惚\n##惜\n##惟\n##惠\n##惡\n##惦\n##惧\n##惨\n##惩\n##惫\n##惬\n##惭\n##惮\n##惯\n##惰\n##惱\n##想\n##惴\n##惶\n##惹\n##惺\n##愁\n##愆\n##愈\n##愉\n##愍\n##意\n##愕\n##愚\n##愛\n##愜\n##感\n##愣\n##愤\n##愧\n##愫\n##愷\n##愿\n##慄\n##慈\n##態\n##慌\n##慎\n##慑\n##慕\n##慘\n##慚\n##慟\n##慢\n##慣\n##慧\n##慨\n##慫\n##慮\n##慰\n##慳\n##慵\n##慶\n##慷\n##慾\n##憂\n##憊\n##憋\n##憎\n##憐\n##憑\n##憔\n##憚\n##憤\n##憧\n##憨\n##憩\n##憫\n##憬\n##憲\n##憶\n##憾\n##懂\n##懇\n##懈\n##應\n##懊\n##懋\n##懑\n##懒\n##懦\n##懲\n##懵\n##懶\n##懷\n##懸\n##懺\n##懼\n##懾\n##懿\n##戀\n##戈\n##戊\n##戌\n##戍\n##戎\n##戏\n##成\n##我\n##戒\n##戕\n##或\n##战\n##戚\n##戛\n##戟\n##戡\n##戦\n##截\n##戬\n##戮\n##戰\n##戲\n##戳\n##戴\n##戶\n##户\n##戸\n##戻\n##戾\n##房\n##所\n##扁\n##扇\n##扈\n##扉\n##手\n##才\n##扎\n##扑\n##扒\n##打\n##扔\n##払\n##托\n##扛\n##扣\n##扦\n##执\n##扩\n##扪\n##扫\n##扬\n##扭\n##扮\n##扯\n##扰\n##扱\n##扳\n##扶\n##批\n##扼\n##找\n##承\n##技\n##抄\n##抉\n##把\n##抑\n##抒\n##抓\n##投\n##抖\n##抗\n##折\n##抚\n##抛\n##抜\n##択\n##抟\n##抠\n##抡\n##抢\n##护\n##报\n##抨\n##披\n##抬\n##抱\n##抵\n##抹\n##押\n##抽\n##抿\n##拂\n##拄\n##担\n##拆\n##拇\n##拈\n##拉\n##拋\n##拌\n##拍\n##拎\n##拐\n##拒\n##拓\n##拔\n##拖\n##拗\n##拘\n##拙\n##拚\n##招\n##拜\n##拟\n##拡\n##拢\n##拣\n##拥\n##拦\n##拧\n##拨\n##择\n##括\n##拭\n##拮\n##拯\n##拱\n##拳\n##拴\n##拷\n##拼\n##拽\n##拾\n##拿\n##持\n##挂\n##指\n##挈\n##按\n##挎\n##挑\n##挖\n##挙\n##挚\n##挛\n##挝\n##挞\n##挟\n##挠\n##挡\n##挣\n##挤\n##挥\n##挨\n##挪\n##挫\n##振\n##挲\n##挹\n##挺\n##挽\n##挾\n##捂\n##捅\n##捆\n##捉\n##捋\n##捌\n##捍\n##捎\n##捏\n##捐\n##捕\n##捞\n##损\n##捡\n##换\n##捣\n##捧\n##捨\n##捩\n##据\n##捱\n##捲\n##捶\n##捷\n##捺\n##捻\n##掀\n##掂\n##掃\n##掇\n##授\n##掉\n##掌\n##掏\n##掐\n##排\n##掖\n##掘\n##掙\n##掛\n##掠\n##採\n##探\n##掣\n##接\n##控\n##推\n##掩\n##措\n##掬\n##掰\n##掲\n##掳\n##掴\n##掷\n##掸\n##掺\n##揀\n##揃\n##揄\n##揆\n##揉\n##揍\n##描\n##提\n##插\n##揖\n##揚\n##換\n##握\n##揣\n##揩\n##揪\n##揭\n##揮\n##援\n##揶\n##揸\n##揹\n##揽\n##搀\n##搁\n##搂\n##搅\n##損\n##搏\n##搐\n##搓\n##搔\n##搖\n##搗\n##搜\n##搞\n##搡\n##搪\n##搬\n##搭\n##搵\n##搶\n##携\n##搽\n##摀\n##摁\n##摄\n##摆\n##摇\n##摈\n##摊\n##摒\n##摔\n##摘\n##摞\n##摟\n##摧\n##摩\n##摯\n##摳\n##摸\n##摹\n##摺\n##摻\n##撂\n##撃\n##撅\n##撇\n##撈\n##撐\n##撑\n##撒\n##撓\n##撕\n##撚\n##撞\n##撤\n##撥\n##撩\n##撫\n##撬\n##播\n##撮\n##撰\n##撲\n##撵\n##撷\n##撸\n##撻\n##撼\n##撿\n##擀\n##擁\n##擂\n##擄\n##擅\n##擇\n##擊\n##擋\n##操\n##擎\n##擒\n##擔\n##擘\n##據\n##擞\n##擠\n##擡\n##擢\n##擦\n##擬\n##擰\n##擱\n##擲\n##擴\n##擷\n##擺\n##擼\n##擾\n##攀\n##攏\n##攒\n##攔\n##攘\n##攙\n##攜\n##攝\n##攞\n##攢\n##攣\n##攤\n##攥\n##攪\n##攫\n##攬\n##支\n##收\n##攸\n##改\n##攻\n##放\n##政\n##故\n##效\n##敌\n##敍\n##敎\n##敏\n##救\n##敕\n##敖\n##敗\n##敘\n##教\n##敛\n##敝\n##敞\n##敢\n##散\n##敦\n##敬\n##数\n##敲\n##整\n##敵\n##敷\n##數\n##斂\n##斃\n##文\n##斋\n##斌\n##斎\n##斐\n##斑\n##斓\n##斗\n##料\n##斛\n##斜\n##斟\n##斡\n##斤\n##斥\n##斧\n##斩\n##斫\n##斬\n##断\n##斯\n##新\n##斷\n##方\n##於\n##施\n##旁\n##旃\n##旅\n##旋\n##旌\n##旎\n##族\n##旖\n##旗\n##无\n##既\n##日\n##旦\n##旧\n##旨\n##早\n##旬\n##旭\n##旮\n##旱\n##时\n##旷\n##旺\n##旻\n##昀\n##昂\n##昆\n##昇\n##昉\n##昊\n##昌\n##明\n##昏\n##易\n##昔\n##昕\n##昙\n##星\n##映\n##春\n##昧\n##昨\n##昭\n##是\n##昱\n##昴\n##昵\n##昶\n##昼\n##显\n##晁\n##時\n##晃\n##晉\n##晋\n##晌\n##晏\n##晒\n##晓\n##晔\n##晕\n##晖\n##晗\n##晚\n##晝\n##晞\n##晟\n##晤\n##晦\n##晨\n##晩\n##普\n##景\n##晰\n##晴\n##晶\n##晷\n##智\n##晾\n##暂\n##暄\n##暇\n##暈\n##暉\n##暌\n##暐\n##暑\n##暖\n##暗\n##暝\n##暢\n##暧\n##暨\n##暫\n##暮\n##暱\n##暴\n##暸\n##暹\n##曄\n##曆\n##曇\n##曉\n##曖\n##曙\n##曜\n##曝\n##曠\n##曦\n##曬\n##曰\n##曲\n##曳\n##更\n##書\n##曹\n##曼\n##曾\n##替\n##最\n##會\n##月\n##有\n##朋\n##服\n##朐\n##朔\n##朕\n##朗\n##望\n##朝\n##期\n##朦\n##朧\n##木\n##未\n##末\n##本\n##札\n##朮\n##术\n##朱\n##朴\n##朵\n##机\n##朽\n##杀\n##杂\n##权\n##杆\n##杈\n##杉\n##李\n##杏\n##材\n##村\n##杓\n##杖\n##杜\n##杞\n##束\n##杠\n##条\n##来\n##杨\n##杭\n##杯\n##杰\n##東\n##杳\n##杵\n##杷\n##杼\n##松\n##板\n##极\n##构\n##枇\n##枉\n##枋\n##析\n##枕\n##林\n##枚\n##果\n##枝\n##枢\n##枣\n##枪\n##枫\n##枭\n##枯\n##枰\n##枱\n##枳\n##架\n##枷\n##枸\n##柄\n##柏\n##某\n##柑\n##柒\n##染\n##柔\n##柘\n##柚\n##柜\n##柞\n##柠\n##柢\n##查\n##柩\n##柬\n##柯\n##柱\n##柳\n##柴\n##柵\n##査\n##柿\n##栀\n##栃\n##栄\n##栅\n##标\n##栈\n##栉\n##栋\n##栎\n##栏\n##树\n##栓\n##栖\n##栗\n##校\n##栩\n##株\n##样\n##核\n##根\n##格\n##栽\n##栾\n##桀\n##桁\n##桂\n##桃\n##桅\n##框\n##案\n##桉\n##桌\n##桎\n##桐\n##桑\n##桓\n##桔\n##桜\n##桠\n##桡\n##桢\n##档\n##桥\n##桦\n##桧\n##桨\n##桩\n##桶\n##桿\n##梁\n##梅\n##梆\n##梏\n##梓\n##梗\n##條\n##梟\n##梢\n##梦\n##梧\n##梨\n##梭\n##梯\n##械\n##梳\n##梵\n##梶\n##检\n##棂\n##棄\n##棉\n##棋\n##棍\n##棒\n##棕\n##棗\n##棘\n##棚\n##棟\n##棠\n##棣\n##棧\n##森\n##棱\n##棲\n##棵\n##棹\n##棺\n##椁\n##椅\n##椋\n##植\n##椎\n##椒\n##検\n##椪\n##椭\n##椰\n##椹\n##椽\n##椿\n##楂\n##楊\n##楓\n##楔\n##楚\n##楝\n##楞\n##楠\n##楣\n##楨\n##楫\n##業\n##楮\n##極\n##楷\n##楸\n##楹\n##楼\n##楽\n##概\n##榄\n##榆\n##榈\n##榉\n##榔\n##榕\n##榖\n##榛\n##榜\n##榨\n##榫\n##榭\n##榮\n##榱\n##榴\n##榷\n##榻\n##槁\n##槃\n##構\n##槌\n##槍\n##槎\n##槐\n##槓\n##様\n##槛\n##槟\n##槤\n##槭\n##槲\n##槳\n##槻\n##槽\n##槿\n##樁\n##樂\n##樊\n##樑\n##樓\n##標\n##樞\n##樟\n##模\n##樣\n##権\n##横\n##樫\n##樯\n##樱\n##樵\n##樸\n##樹\n##樺\n##樽\n##樾\n##橄\n##橇\n##橋\n##橐\n##橘\n##橙\n##機\n##橡\n##橢\n##橫\n##橱\n##橹\n##橼\n##檀\n##檄\n##檎\n##檐\n##檔\n##檗\n##檜\n##檢\n##檬\n##檯\n##檳\n##檸\n##檻\n##櫃\n##櫚\n##櫛\n##櫥\n##櫸\n##櫻\n##欄\n##權\n##欒\n##欖\n##欠\n##次\n##欢\n##欣\n##欧\n##欲\n##欸\n##欺\n##欽\n##款\n##歆\n##歇\n##歉\n##歌\n##歎\n##歐\n##歓\n##歙\n##歛\n##歡\n##止\n##正\n##此\n##步\n##武\n##歧\n##歩\n##歪\n##歯\n##歲\n##歳\n##歴\n##歷\n##歸\n##歹\n##死\n##歼\n##殁\n##殃\n##殆\n##殇\n##殉\n##殊\n##残\n##殒\n##殓\n##殖\n##殘\n##殞\n##殡\n##殤\n##殭\n##殯\n##殲\n##殴\n##段\n##殷\n##殺\n##殼\n##殿\n##毀\n##毁\n##毂\n##毅\n##毆\n##毋\n##母\n##毎\n##每\n##毒\n##毓\n##比\n##毕\n##毗\n##毘\n##毙\n##毛\n##毡\n##毫\n##毯\n##毽\n##氈\n##氏\n##氐\n##民\n##氓\n##气\n##氖\n##気\n##氙\n##氛\n##氟\n##氡\n##氢\n##氣\n##氤\n##氦\n##氧\n##氨\n##氪\n##氫\n##氮\n##氯\n##氰\n##氲\n##水\n##氷\n##永\n##氹\n##氾\n##汀\n##汁\n##求\n##汆\n##汇\n##汉\n##汎\n##汐\n##汕\n##汗\n##汙\n##汛\n##汝\n##汞\n##江\n##池\n##污\n##汤\n##汨\n##汩\n##汪\n##汰\n##汲\n##汴\n##汶\n##汹\n##決\n##汽\n##汾\n##沁\n##沂\n##沃\n##沅\n##沈\n##沉\n##沌\n##沏\n##沐\n##沒\n##沓\n##沖\n##沙\n##沛\n##沟\n##没\n##沢\n##沣\n##沥\n##沦\n##沧\n##沪\n##沫\n##沭\n##沮\n##沱\n##河\n##沸\n##油\n##治\n##沼\n##沽\n##沾\n##沿\n##況\n##泄\n##泉\n##泊\n##泌\n##泓\n##法\n##泗\n##泛\n##泞\n##泠\n##泡\n##波\n##泣\n##泥\n##注\n##泪\n##泫\n##泮\n##泯\n##泰\n##泱\n##泳\n##泵\n##泷\n##泸\n##泻\n##泼\n##泽\n##泾\n##洁\n##洄\n##洋\n##洒\n##洗\n##洙\n##洛\n##洞\n##津\n##洩\n##洪\n##洮\n##洱\n##洲\n##洵\n##洶\n##洸\n##洹\n##活\n##洼\n##洽\n##派\n##流\n##浃\n##浄\n##浅\n##浆\n##浇\n##浊\n##测\n##济\n##浏\n##浑\n##浒\n##浓\n##浔\n##浙\n##浚\n##浜\n##浣\n##浦\n##浩\n##浪\n##浬\n##浮\n##浯\n##浴\n##海\n##浸\n##涂\n##涅\n##涇\n##消\n##涉\n##涌\n##涎\n##涓\n##涔\n##涕\n##涙\n##涛\n##涝\n##涞\n##涟\n##涠\n##涡\n##涣\n##涤\n##润\n##涧\n##涨\n##涩\n##涪\n##涮\n##涯\n##液\n##涵\n##涸\n##涼\n##涿\n##淀\n##淄\n##淅\n##淆\n##淇\n##淋\n##淌\n##淑\n##淒\n##淖\n##淘\n##淙\n##淚\n##淞\n##淡\n##淤\n##淦\n##淨\n##淩\n##淪\n##淫\n##淬\n##淮\n##深\n##淳\n##淵\n##混\n##淹\n##淺\n##添\n##淼\n##清\n##済\n##渉\n##渊\n##渋\n##渍\n##渎\n##渐\n##渔\n##渗\n##渙\n##渚\n##減\n##渝\n##渠\n##渡\n##渣\n##渤\n##渥\n##渦\n##温\n##測\n##渭\n##港\n##渲\n##渴\n##游\n##渺\n##渾\n##湃\n##湄\n##湊\n##湍\n##湖\n##湘\n##湛\n##湟\n##湧\n##湫\n##湮\n##湯\n##湳\n##湾\n##湿\n##満\n##溃\n##溅\n##溉\n##溏\n##源\n##準\n##溜\n##溝\n##溟\n##溢\n##溥\n##溧\n##溪\n##溫\n##溯\n##溱\n##溴\n##溶\n##溺\n##溼\n##滁\n##滂\n##滄\n##滅\n##滇\n##滋\n##滌\n##滑\n##滓\n##滔\n##滕\n##滙\n##滚\n##滝\n##滞\n##滟\n##满\n##滢\n##滤\n##滥\n##滦\n##滨\n##滩\n##滬\n##滯\n##滲\n##滴\n##滷\n##滸\n##滾\n##滿\n##漁\n##漂\n##漆\n##漉\n##漏\n##漓\n##演\n##漕\n##漠\n##漢\n##漣\n##漩\n##漪\n##漫\n##漬\n##漯\n##漱\n##漲\n##漳\n##漸\n##漾\n##漿\n##潆\n##潇\n##潋\n##潍\n##潑\n##潔\n##潘\n##潛\n##潜\n##潞\n##潟\n##潢\n##潤\n##潦\n##潧\n##潭\n##潮\n##潰\n##潴\n##潸\n##潺\n##潼\n##澀\n##澄\n##澆\n##澈\n##澍\n##澎\n##澗\n##澜\n##澡\n##澤\n##澧\n##澱\n##澳\n##澹\n##激\n##濁\n##濂\n##濃\n##濑\n##濒\n##濕\n##濘\n##濛\n##濟\n##濠\n##濡\n##濤\n##濫\n##濬\n##濮\n##濯\n##濱\n##濺\n##濾\n##瀅\n##瀆\n##瀉\n##瀋\n##瀏\n##瀑\n##瀕\n##瀘\n##瀚\n##瀛\n##瀝\n##瀞\n##瀟\n##瀧\n##瀨\n##瀬\n##瀰\n##瀾\n##灌\n##灏\n##灑\n##灘\n##灝\n##灞\n##灣\n##火\n##灬\n##灭\n##灯\n##灰\n##灵\n##灶\n##灸\n##灼\n##災\n##灾\n##灿\n##炀\n##炁\n##炅\n##炉\n##炊\n##炎\n##炒\n##炔\n##炕\n##炖\n##炙\n##炜\n##炫\n##炬\n##炭\n##炮\n##炯\n##炳\n##炷\n##炸\n##点\n##為\n##炼\n##炽\n##烁\n##烂\n##烃\n##烈\n##烊\n##烏\n##烘\n##烙\n##烛\n##烟\n##烤\n##烦\n##烧\n##烨\n##烩\n##烫\n##烬\n##热\n##烯\n##烷\n##烹\n##烽\n##焉\n##焊\n##焕\n##焖\n##焗\n##焘\n##焙\n##焚\n##焜\n##無\n##焦\n##焯\n##焰\n##焱\n##然\n##焼\n##煅\n##煉\n##煊\n##煌\n##煎\n##煒\n##煖\n##煙\n##煜\n##煞\n##煤\n##煥\n##煦\n##照\n##煨\n##煩\n##煮\n##煲\n##煸\n##煽\n##熄\n##熊\n##熏\n##熒\n##熔\n##熙\n##熟\n##熠\n##熨\n##熬\n##熱\n##熵\n##熹\n##熾\n##燁\n##燃\n##燄\n##燈\n##燉\n##燊\n##燎\n##燒\n##燔\n##燕\n##燙\n##燜\n##營\n##燥\n##燦\n##燧\n##燭\n##燮\n##燴\n##燻\n##燼\n##燿\n##爆\n##爍\n##爐\n##爛\n##爪\n##爬\n##爭\n##爰\n##爱\n##爲\n##爵\n##父\n##爷\n##爸\n##爹\n##爺\n##爻\n##爽\n##爾\n##牆\n##片\n##版\n##牌\n##牍\n##牒\n##牙\n##牛\n##牝\n##牟\n##牠\n##牡\n##牢\n##牦\n##牧\n##物\n##牯\n##牲\n##牴\n##牵\n##特\n##牺\n##牽\n##犀\n##犁\n##犄\n##犊\n##犍\n##犒\n##犢\n##犧\n##犬\n##犯\n##状\n##犷\n##犸\n##犹\n##狀\n##狂\n##狄\n##狈\n##狎\n##狐\n##狒\n##狗\n##狙\n##狞\n##狠\n##狡\n##狩\n##独\n##狭\n##狮\n##狰\n##狱\n##狸\n##狹\n##狼\n##狽\n##猎\n##猕\n##猖\n##猗\n##猙\n##猛\n##猜\n##猝\n##猥\n##猩\n##猪\n##猫\n##猬\n##献\n##猴\n##猶\n##猷\n##猾\n##猿\n##獄\n##獅\n##獎\n##獐\n##獒\n##獗\n##獠\n##獣\n##獨\n##獭\n##獰\n##獲\n##獵\n##獷\n##獸\n##獺\n##獻\n##獼\n##獾\n##玄\n##率\n##玉\n##王\n##玑\n##玖\n##玛\n##玟\n##玠\n##玥\n##玩\n##玫\n##玮\n##环\n##现\n##玲\n##玳\n##玷\n##玺\n##玻\n##珀\n##珂\n##珅\n##珈\n##珉\n##珊\n##珍\n##珏\n##珐\n##珑\n##珙\n##珞\n##珠\n##珣\n##珥\n##珩\n##珪\n##班\n##珮\n##珲\n##珺\n##現\n##球\n##琅\n##理\n##琇\n##琉\n##琊\n##琍\n##琏\n##琐\n##琛\n##琢\n##琥\n##琦\n##琨\n##琪\n##琬\n##琮\n##琰\n##琲\n##琳\n##琴\n##琵\n##琶\n##琺\n##琼\n##瑀\n##瑁\n##瑄\n##瑋\n##瑕\n##瑗\n##瑙\n##瑚\n##瑛\n##瑜\n##瑞\n##瑟\n##瑠\n##瑣\n##瑤\n##瑩\n##瑪\n##瑯\n##瑰\n##瑶\n##瑾\n##璀\n##璁\n##璃\n##璇\n##璉\n##璋\n##璎\n##璐\n##璜\n##璞\n##璟\n##璧\n##璨\n##環\n##璽\n##璿\n##瓊\n##瓏\n##瓒\n##瓜\n##瓢\n##瓣\n##瓤\n##瓦\n##瓮\n##瓯\n##瓴\n##瓶\n##瓷\n##甄\n##甌\n##甕\n##甘\n##甙\n##甚\n##甜\n##生\n##產\n##産\n##甥\n##甦\n##用\n##甩\n##甫\n##甬\n##甭\n##甯\n##田\n##由\n##甲\n##申\n##电\n##男\n##甸\n##町\n##画\n##甾\n##畀\n##畅\n##界\n##畏\n##畑\n##畔\n##留\n##畜\n##畝\n##畢\n##略\n##畦\n##番\n##畫\n##異\n##畲\n##畳\n##畴\n##當\n##畸\n##畹\n##畿\n##疆\n##疇\n##疊\n##疏\n##疑\n##疔\n##疖\n##疗\n##疙\n##疚\n##疝\n##疟\n##疡\n##疣\n##疤\n##疥\n##疫\n##疮\n##疯\n##疱\n##疲\n##疳\n##疵\n##疸\n##疹\n##疼\n##疽\n##疾\n##痂\n##病\n##症\n##痈\n##痉\n##痊\n##痍\n##痒\n##痔\n##痕\n##痘\n##痙\n##痛\n##痞\n##痠\n##痢\n##痣\n##痤\n##痧\n##痨\n##痪\n##痫\n##痰\n##痱\n##痴\n##痹\n##痺\n##痼\n##痿\n##瘀\n##瘁\n##瘋\n##瘍\n##瘓\n##瘘\n##瘙\n##瘟\n##瘠\n##瘡\n##瘢\n##瘤\n##瘦\n##瘧\n##瘩\n##瘪\n##瘫\n##瘴\n##瘸\n##瘾\n##療\n##癇\n##癌\n##癒\n##癖\n##癜\n##癞\n##癡\n##癢\n##癣\n##癥\n##癫\n##癬\n##癮\n##癱\n##癲\n##癸\n##発\n##登\n##發\n##白\n##百\n##皂\n##的\n##皆\n##皇\n##皈\n##皋\n##皎\n##皑\n##皓\n##皖\n##皙\n##皚\n##皮\n##皰\n##皱\n##皴\n##皺\n##皿\n##盂\n##盃\n##盅\n##盆\n##盈\n##益\n##盎\n##盏\n##盐\n##监\n##盒\n##盔\n##盖\n##盗\n##盘\n##盛\n##盜\n##盞\n##盟\n##盡\n##監\n##盤\n##盥\n##盧\n##盪\n##目\n##盯\n##盱\n##盲\n##直\n##相\n##盹\n##盼\n##盾\n##省\n##眈\n##眉\n##看\n##県\n##眙\n##眞\n##真\n##眠\n##眦\n##眨\n##眩\n##眯\n##眶\n##眷\n##眸\n##眺\n##眼\n##眾\n##着\n##睁\n##睇\n##睏\n##睐\n##睑\n##睛\n##睜\n##睞\n##睡\n##睢\n##督\n##睥\n##睦\n##睨\n##睪\n##睫\n##睬\n##睹\n##睽\n##睾\n##睿\n##瞄\n##瞅\n##瞇\n##瞋\n##瞌\n##瞎\n##瞑\n##瞒\n##瞓\n##瞞\n##瞟\n##瞠\n##瞥\n##瞧\n##瞩\n##瞪\n##瞬\n##瞭\n##瞰\n##瞳\n##瞻\n##瞼\n##瞿\n##矇\n##矍\n##矗\n##矚\n##矛\n##矜\n##矢\n##矣\n##知\n##矩\n##矫\n##短\n##矮\n##矯\n##石\n##矶\n##矽\n##矾\n##矿\n##码\n##砂\n##砌\n##砍\n##砒\n##研\n##砖\n##砗\n##砚\n##砝\n##砣\n##砥\n##砧\n##砭\n##砰\n##砲\n##破\n##砷\n##砸\n##砺\n##砼\n##砾\n##础\n##硅\n##硐\n##硒\n##硕\n##硝\n##硫\n##硬\n##确\n##硯\n##硼\n##碁\n##碇\n##碉\n##碌\n##碍\n##碎\n##碑\n##碓\n##碗\n##碘\n##碚\n##碛\n##碟\n##碣\n##碧\n##碩\n##碰\n##碱\n##碳\n##碴\n##確\n##碼\n##碾\n##磁\n##磅\n##磊\n##磋\n##磐\n##磕\n##磚\n##磡\n##磨\n##磬\n##磯\n##磲\n##磷\n##磺\n##礁\n##礎\n##礙\n##礡\n##礦\n##礪\n##礫\n##礴\n##示\n##礼\n##社\n##祀\n##祁\n##祂\n##祇\n##祈\n##祉\n##祎\n##祐\n##祕\n##祖\n##祗\n##祚\n##祛\n##祜\n##祝\n##神\n##祟\n##祠\n##祢\n##祥\n##票\n##祭\n##祯\n##祷\n##祸\n##祺\n##祿\n##禀\n##禁\n##禄\n##禅\n##禍\n##禎\n##福\n##禛\n##禦\n##禧\n##禪\n##禮\n##禱\n##禹\n##禺\n##离\n##禽\n##禾\n##禿\n##秀\n##私\n##秃\n##秆\n##秉\n##秋\n##种\n##科\n##秒\n##秘\n##租\n##秣\n##秤\n##秦\n##秧\n##秩\n##秭\n##积\n##称\n##秸\n##移\n##秽\n##稀\n##稅\n##程\n##稍\n##税\n##稔\n##稗\n##稚\n##稜\n##稞\n##稟\n##稠\n##稣\n##種\n##稱\n##稲\n##稳\n##稷\n##稹\n##稻\n##稼\n##稽\n##稿\n##穀\n##穂\n##穆\n##穌\n##積\n##穎\n##穗\n##穢\n##穩\n##穫\n##穴\n##究\n##穷\n##穹\n##空\n##穿\n##突\n##窃\n##窄\n##窈\n##窍\n##窑\n##窒\n##窓\n##窕\n##窖\n##窗\n##窘\n##窜\n##窝\n##窟\n##窠\n##窥\n##窦\n##窨\n##窩\n##窪\n##窮\n##窯\n##窺\n##窿\n##竄\n##竅\n##竇\n##竊\n##立\n##竖\n##站\n##竜\n##竞\n##竟\n##章\n##竣\n##童\n##竭\n##端\n##競\n##竹\n##竺\n##竽\n##竿\n##笃\n##笆\n##笈\n##笋\n##笏\n##笑\n##笔\n##笙\n##笛\n##笞\n##笠\n##符\n##笨\n##第\n##笹\n##笺\n##笼\n##筆\n##等\n##筊\n##筋\n##筍\n##筏\n##筐\n##筑\n##筒\n##答\n##策\n##筛\n##筝\n##筠\n##筱\n##筲\n##筵\n##筷\n##筹\n##签\n##简\n##箇\n##箋\n##箍\n##箏\n##箐\n##箔\n##箕\n##算\n##箝\n##管\n##箩\n##箫\n##箭\n##箱\n##箴\n##箸\n##節\n##篁\n##範\n##篆\n##篇\n##築\n##篑\n##篓\n##篙\n##篝\n##篠\n##篡\n##篤\n##篩\n##篪\n##篮\n##篱\n##篷\n##簇\n##簌\n##簍\n##簡\n##簦\n##簧\n##簪\n##簫\n##簷\n##簸\n##簽\n##簾\n##簿\n##籁\n##籃\n##籌\n##籍\n##籐\n##籟\n##籠\n##籤\n##籬\n##籮\n##籲\n##米\n##类\n##籼\n##籽\n##粄\n##粉\n##粑\n##粒\n##粕\n##粗\n##粘\n##粟\n##粤\n##粥\n##粧\n##粪\n##粮\n##粱\n##粲\n##粳\n##粵\n##粹\n##粼\n##粽\n##精\n##粿\n##糅\n##糊\n##糍\n##糕\n##糖\n##糗\n##糙\n##糜\n##糞\n##糟\n##糠\n##糧\n##糬\n##糯\n##糰\n##糸\n##系\n##糾\n##紀\n##紂\n##約\n##紅\n##紉\n##紊\n##紋\n##納\n##紐\n##紓\n##純\n##紗\n##紘\n##紙\n##級\n##紛\n##紜\n##素\n##紡\n##索\n##紧\n##紫\n##紮\n##累\n##細\n##紳\n##紹\n##紺\n##終\n##絃\n##組\n##絆\n##経\n##結\n##絕\n##絞\n##絡\n##絢\n##給\n##絨\n##絮\n##統\n##絲\n##絳\n##絵\n##絶\n##絹\n##綁\n##綏\n##綑\n##經\n##継\n##続\n##綜\n##綠\n##綢\n##綦\n##綫\n##綬\n##維\n##綱\n##網\n##綴\n##綵\n##綸\n##綺\n##綻\n##綽\n##綾\n##綿\n##緊\n##緋\n##総\n##緑\n##緒\n##緘\n##線\n##緝\n##緞\n##締\n##緣\n##編\n##緩\n##緬\n##緯\n##練\n##緹\n##緻\n##縁\n##縄\n##縈\n##縛\n##縝\n##縣\n##縫\n##縮\n##縱\n##縴\n##縷\n##總\n##績\n##繁\n##繃\n##繆\n##繇\n##繋\n##織\n##繕\n##繚\n##繞\n##繡\n##繩\n##繪\n##繫\n##繭\n##繳\n##繹\n##繼\n##繽\n##纂\n##續\n##纍\n##纏\n##纓\n##纔\n##纖\n##纜\n##纠\n##红\n##纣\n##纤\n##约\n##级\n##纨\n##纪\n##纫\n##纬\n##纭\n##纯\n##纰\n##纱\n##纲\n##纳\n##纵\n##纶\n##纷\n##纸\n##纹\n##纺\n##纽\n##纾\n##线\n##绀\n##练\n##组\n##绅\n##细\n##织\n##终\n##绊\n##绍\n##绎\n##经\n##绑\n##绒\n##结\n##绔\n##绕\n##绘\n##给\n##绚\n##绛\n##络\n##绝\n##绞\n##统\n##绡\n##绢\n##绣\n##绥\n##绦\n##继\n##绩\n##绪\n##绫\n##续\n##绮\n##绯\n##绰\n##绳\n##维\n##绵\n##绶\n##绷\n##绸\n##绻\n##综\n##绽\n##绾\n##绿\n##缀\n##缄\n##缅\n##缆\n##缇\n##缈\n##缉\n##缎\n##缓\n##缔\n##缕\n##编\n##缘\n##缙\n##缚\n##缜\n##缝\n##缠\n##缢\n##缤\n##缥\n##缨\n##缩\n##缪\n##缭\n##缮\n##缰\n##缱\n##缴\n##缸\n##缺\n##缽\n##罂\n##罄\n##罌\n##罐\n##网\n##罔\n##罕\n##罗\n##罚\n##罡\n##罢\n##罩\n##罪\n##置\n##罰\n##署\n##罵\n##罷\n##罹\n##羁\n##羅\n##羈\n##羊\n##羌\n##美\n##羔\n##羚\n##羞\n##羟\n##羡\n##羣\n##群\n##羥\n##羧\n##羨\n##義\n##羯\n##羲\n##羸\n##羹\n##羽\n##羿\n##翁\n##翅\n##翊\n##翌\n##翎\n##習\n##翔\n##翘\n##翟\n##翠\n##翡\n##翦\n##翩\n##翰\n##翱\n##翳\n##翹\n##翻\n##翼\n##耀\n##老\n##考\n##耄\n##者\n##耆\n##耋\n##而\n##耍\n##耐\n##耒\n##耕\n##耗\n##耘\n##耙\n##耦\n##耨\n##耳\n##耶\n##耷\n##耸\n##耻\n##耽\n##耿\n##聂\n##聆\n##聊\n##聋\n##职\n##聒\n##联\n##聖\n##聘\n##聚\n##聞\n##聪\n##聯\n##聰\n##聲\n##聳\n##聴\n##聶\n##職\n##聽\n##聾\n##聿\n##肃\n##肄\n##肅\n##肆\n##肇\n##肉\n##肋\n##肌\n##肏\n##肓\n##肖\n##肘\n##肚\n##肛\n##肝\n##肠\n##股\n##肢\n##肤\n##肥\n##肩\n##肪\n##肮\n##肯\n##肱\n##育\n##肴\n##肺\n##肽\n##肾\n##肿\n##胀\n##胁\n##胃\n##胄\n##胆\n##背\n##胍\n##胎\n##胖\n##胚\n##胛\n##胜\n##胝\n##胞\n##胡\n##胤\n##胥\n##胧\n##胫\n##胭\n##胯\n##胰\n##胱\n##胳\n##胴\n##胶\n##胸\n##胺\n##能\n##脂\n##脅\n##脆\n##脇\n##脈\n##脉\n##脊\n##脍\n##脏\n##脐\n##脑\n##脓\n##脖\n##脘\n##脚\n##脛\n##脣\n##脩\n##脫\n##脯\n##脱\n##脲\n##脳\n##脸\n##脹\n##脾\n##腆\n##腈\n##腊\n##腋\n##腌\n##腎\n##腐\n##腑\n##腓\n##腔\n##腕\n##腥\n##腦\n##腩\n##腫\n##腭\n##腮\n##腰\n##腱\n##腳\n##腴\n##腸\n##腹\n##腺\n##腻\n##腼\n##腾\n##腿\n##膀\n##膈\n##膊\n##膏\n##膑\n##膘\n##膚\n##膛\n##膜\n##膝\n##膠\n##膦\n##膨\n##膩\n##膳\n##膺\n##膻\n##膽\n##膾\n##膿\n##臀\n##臂\n##臃\n##臆\n##臉\n##臊\n##臍\n##臓\n##臘\n##臟\n##臣\n##臥\n##臧\n##臨\n##自\n##臬\n##臭\n##至\n##致\n##臺\n##臻\n##臼\n##臾\n##舀\n##舂\n##舅\n##舆\n##與\n##興\n##舉\n##舊\n##舌\n##舍\n##舎\n##舐\n##舒\n##舔\n##舖\n##舗\n##舛\n##舜\n##舞\n##舟\n##航\n##舫\n##般\n##舰\n##舱\n##舵\n##舶\n##舷\n##舸\n##船\n##舺\n##舾\n##艇\n##艋\n##艘\n##艙\n##艦\n##艮\n##良\n##艰\n##艱\n##色\n##艳\n##艷\n##艹\n##艺\n##艾\n##节\n##芃\n##芈\n##芊\n##芋\n##芍\n##芎\n##芒\n##芙\n##芜\n##芝\n##芡\n##芥\n##芦\n##芩\n##芪\n##芫\n##芬\n##芭\n##芮\n##芯\n##花\n##芳\n##芷\n##芸\n##芹\n##芻\n##芽\n##芾\n##苁\n##苄\n##苇\n##苋\n##苍\n##苏\n##苑\n##苒\n##苓\n##苔\n##苕\n##苗\n##苛\n##苜\n##苞\n##苟\n##苡\n##苣\n##若\n##苦\n##苫\n##苯\n##英\n##苷\n##苹\n##苻\n##茁\n##茂\n##范\n##茄\n##茅\n##茉\n##茎\n##茏\n##茗\n##茜\n##茧\n##茨\n##茫\n##茬\n##茭\n##茯\n##茱\n##茲\n##茴\n##茵\n##茶\n##茸\n##茹\n##茼\n##荀\n##荃\n##荆\n##草\n##荊\n##荏\n##荐\n##荒\n##荔\n##荖\n##荘\n##荚\n##荞\n##荟\n##荠\n##荡\n##荣\n##荤\n##荥\n##荧\n##荨\n##荪\n##荫\n##药\n##荳\n##荷\n##荸\n##荻\n##荼\n##荽\n##莅\n##莆\n##莉\n##莊\n##莎\n##莒\n##莓\n##莖\n##莘\n##莞\n##莠\n##莢\n##莧\n##莪\n##莫\n##莱\n##莲\n##莴\n##获\n##莹\n##莺\n##莽\n##莿\n##菀\n##菁\n##菅\n##菇\n##菈\n##菊\n##菌\n##菏\n##菓\n##菖\n##菘\n##菜\n##菟\n##菠\n##菡\n##菩\n##華\n##菱\n##菲\n##菸\n##菽\n##萁\n##萃\n##萄\n##萊\n##萋\n##萌\n##萍\n##萎\n##萘\n##萝\n##萤\n##营\n##萦\n##萧\n##萨\n##萩\n##萬\n##萱\n##萵\n##萸\n##萼\n##落\n##葆\n##葉\n##著\n##葚\n##葛\n##葡\n##董\n##葦\n##葩\n##葫\n##葬\n##葭\n##葯\n##葱\n##葳\n##葵\n##葷\n##葺\n##蒂\n##蒋\n##蒐\n##蒔\n##蒙\n##蒜\n##蒞\n##蒟\n##蒡\n##蒨\n##蒲\n##蒸\n##蒹\n##蒻\n##蒼\n##蒿\n##蓁\n##蓄\n##蓆\n##蓉\n##蓋\n##蓑\n##蓓\n##蓖\n##蓝\n##蓟\n##蓦\n##蓬\n##蓮\n##蓼\n##蓿\n##蔑\n##蔓\n##蔔\n##蔗\n##蔘\n##蔚\n##蔡\n##蔣\n##蔥\n##蔫\n##蔬\n##蔭\n##蔵\n##蔷\n##蔺\n##蔻\n##蔼\n##蔽\n##蕁\n##蕃\n##蕈\n##蕉\n##蕊\n##蕎\n##蕙\n##蕤\n##蕨\n##蕩\n##蕪\n##蕭\n##蕲\n##蕴\n##蕻\n##蕾\n##薄\n##薅\n##薇\n##薈\n##薊\n##薏\n##薑\n##薔\n##薙\n##薛\n##薦\n##薨\n##薩\n##薪\n##薬\n##薯\n##薰\n##薹\n##藉\n##藍\n##藏\n##藐\n##藓\n##藕\n##藜\n##藝\n##藤\n##藥\n##藩\n##藹\n##藻\n##藿\n##蘆\n##蘇\n##蘊\n##蘋\n##蘑\n##蘚\n##蘭\n##蘸\n##蘼\n##蘿\n##虎\n##虏\n##虐\n##虑\n##虔\n##處\n##虚\n##虛\n##虜\n##虞\n##號\n##虢\n##虧\n##虫\n##虬\n##虱\n##虹\n##虻\n##虽\n##虾\n##蚀\n##蚁\n##蚂\n##蚊\n##蚌\n##蚓\n##蚕\n##蚜\n##蚝\n##蚣\n##蚤\n##蚩\n##蚪\n##蚯\n##蚱\n##蚵\n##蛀\n##蛆\n##蛇\n##蛊\n##蛋\n##蛎\n##蛐\n##蛔\n##蛙\n##蛛\n##蛟\n##蛤\n##蛭\n##蛮\n##蛰\n##蛳\n##蛹\n##蛻\n##蛾\n##蜀\n##蜂\n##蜃\n##蜆\n##蜇\n##蜈\n##蜊\n##蜍\n##蜒\n##蜓\n##蜕\n##蜗\n##蜘\n##蜚\n##蜜\n##蜡\n##蜢\n##蜥\n##蜱\n##蜴\n##蜷\n##蜻\n##蜿\n##蝇\n##蝈\n##蝉\n##蝌\n##蝎\n##蝕\n##蝗\n##蝙\n##蝟\n##蝠\n##蝦\n##蝨\n##蝴\n##蝶\n##蝸\n##蝼\n##螂\n##螃\n##融\n##螞\n##螢\n##螨\n##螯\n##螳\n##螺\n##蟀\n##蟄\n##蟆\n##蟋\n##蟎\n##蟑\n##蟒\n##蟠\n##蟬\n##蟲\n##蟹\n##蟻\n##蟾\n##蠅\n##蠍\n##蠔\n##蠕\n##蠛\n##蠟\n##蠡\n##蠢\n##蠣\n##蠱\n##蠶\n##蠹\n##蠻\n##血\n##衄\n##衅\n##衆\n##行\n##衍\n##術\n##衔\n##街\n##衙\n##衛\n##衝\n##衞\n##衡\n##衢\n##衣\n##补\n##表\n##衩\n##衫\n##衬\n##衮\n##衰\n##衲\n##衷\n##衹\n##衾\n##衿\n##袁\n##袂\n##袄\n##袅\n##袈\n##袋\n##袍\n##袒\n##袖\n##袜\n##袞\n##袤\n##袪\n##被\n##袭\n##袱\n##裁\n##裂\n##装\n##裆\n##裊\n##裏\n##裔\n##裕\n##裘\n##裙\n##補\n##裝\n##裟\n##裡\n##裤\n##裨\n##裱\n##裳\n##裴\n##裸\n##裹\n##製\n##裾\n##褂\n##複\n##褐\n##褒\n##褓\n##褔\n##褚\n##褥\n##褪\n##褫\n##褲\n##褶\n##褻\n##襁\n##襄\n##襟\n##襠\n##襪\n##襬\n##襯\n##襲\n##西\n##要\n##覃\n##覆\n##覇\n##見\n##規\n##覓\n##視\n##覚\n##覦\n##覧\n##親\n##覬\n##観\n##覷\n##覺\n##覽\n##觀\n##见\n##观\n##规\n##觅\n##视\n##览\n##觉\n##觊\n##觎\n##觐\n##觑\n##角\n##觞\n##解\n##觥\n##触\n##觸\n##言\n##訂\n##計\n##訊\n##討\n##訓\n##訕\n##訖\n##託\n##記\n##訛\n##訝\n##訟\n##訣\n##訥\n##訪\n##設\n##許\n##訳\n##訴\n##訶\n##診\n##註\n##証\n##詆\n##詐\n##詔\n##評\n##詛\n##詞\n##詠\n##詡\n##詢\n##詣\n##試\n##詩\n##詫\n##詬\n##詭\n##詮\n##詰\n##話\n##該\n##詳\n##詹\n##詼\n##誅\n##誇\n##誉\n##誌\n##認\n##誓\n##誕\n##誘\n##語\n##誠\n##誡\n##誣\n##誤\n##誥\n##誦\n##誨\n##說\n##説\n##読\n##誰\n##課\n##誹\n##誼\n##調\n##諄\n##談\n##請\n##諏\n##諒\n##論\n##諗\n##諜\n##諡\n##諦\n##諧\n##諫\n##諭\n##諮\n##諱\n##諳\n##諷\n##諸\n##諺\n##諾\n##謀\n##謁\n##謂\n##謄\n##謊\n##謎\n##謐\n##謔\n##謗\n##謙\n##講\n##謝\n##謠\n##謨\n##謬\n##謹\n##謾\n##譁\n##證\n##譎\n##譏\n##識\n##譙\n##譚\n##譜\n##警\n##譬\n##譯\n##議\n##譲\n##譴\n##護\n##譽\n##讀\n##變\n##讓\n##讚\n##讞\n##计\n##订\n##认\n##讥\n##讧\n##讨\n##让\n##讪\n##讫\n##训\n##议\n##讯\n##记\n##讲\n##讳\n##讴\n##讶\n##讷\n##许\n##讹\n##论\n##讼\n##讽\n##设\n##访\n##诀\n##证\n##诃\n##评\n##诅\n##识\n##诈\n##诉\n##诊\n##诋\n##词\n##诏\n##译\n##试\n##诗\n##诘\n##诙\n##诚\n##诛\n##话\n##诞\n##诟\n##诠\n##诡\n##询\n##诣\n##诤\n##该\n##详\n##诧\n##诩\n##诫\n##诬\n##语\n##误\n##诰\n##诱\n##诲\n##说\n##诵\n##诶\n##请\n##诸\n##诺\n##读\n##诽\n##课\n##诿\n##谀\n##谁\n##调\n##谄\n##谅\n##谆\n##谈\n##谊\n##谋\n##谌\n##谍\n##谎\n##谏\n##谐\n##谑\n##谒\n##谓\n##谔\n##谕\n##谗\n##谘\n##谙\n##谚\n##谛\n##谜\n##谟\n##谢\n##谣\n##谤\n##谥\n##谦\n##谧\n##谨\n##谩\n##谪\n##谬\n##谭\n##谯\n##谱\n##谲\n##谴\n##谶\n##谷\n##豁\n##豆\n##豇\n##豈\n##豉\n##豊\n##豌\n##豎\n##豐\n##豔\n##豚\n##象\n##豢\n##豪\n##豫\n##豬\n##豹\n##豺\n##貂\n##貅\n##貌\n##貓\n##貔\n##貘\n##貝\n##貞\n##負\n##財\n##貢\n##貧\n##貨\n##販\n##貪\n##貫\n##責\n##貯\n##貰\n##貳\n##貴\n##貶\n##買\n##貸\n##費\n##貼\n##貽\n##貿\n##賀\n##賁\n##賂\n##賃\n##賄\n##資\n##賈\n##賊\n##賑\n##賓\n##賜\n##賞\n##賠\n##賡\n##賢\n##賣\n##賤\n##賦\n##質\n##賬\n##賭\n##賴\n##賺\n##購\n##賽\n##贅\n##贈\n##贊\n##贍\n##贏\n##贓\n##贖\n##贛\n##贝\n##贞\n##负\n##贡\n##财\n##责\n##贤\n##败\n##账\n##货\n##质\n##贩\n##贪\n##贫\n##贬\n##购\n##贮\n##贯\n##贰\n##贱\n##贲\n##贴\n##贵\n##贷\n##贸\n##费\n##贺\n##贻\n##贼\n##贾\n##贿\n##赁\n##赂\n##赃\n##资\n##赅\n##赈\n##赊\n##赋\n##赌\n##赎\n##赏\n##赐\n##赓\n##赔\n##赖\n##赘\n##赚\n##赛\n##赝\n##赞\n##赠\n##赡\n##赢\n##赣\n##赤\n##赦\n##赧\n##赫\n##赭\n##走\n##赳\n##赴\n##赵\n##赶\n##起\n##趁\n##超\n##越\n##趋\n##趕\n##趙\n##趟\n##趣\n##趨\n##足\n##趴\n##趵\n##趸\n##趺\n##趾\n##跃\n##跄\n##跆\n##跋\n##跌\n##跎\n##跑\n##跖\n##跚\n##跛\n##距\n##跟\n##跡\n##跤\n##跨\n##跩\n##跪\n##路\n##跳\n##践\n##跷\n##跹\n##跺\n##跻\n##踉\n##踊\n##踌\n##踏\n##踐\n##踝\n##踞\n##踟\n##踢\n##踩\n##踪\n##踮\n##踱\n##踴\n##踵\n##踹\n##蹂\n##蹄\n##蹇\n##蹈\n##蹉\n##蹊\n##蹋\n##蹑\n##蹒\n##蹙\n##蹟\n##蹣\n##蹤\n##蹦\n##蹩\n##蹬\n##蹭\n##蹲\n##蹴\n##蹶\n##蹺\n##蹼\n##蹿\n##躁\n##躇\n##躉\n##躊\n##躋\n##躍\n##躏\n##躪\n##身\n##躬\n##躯\n##躲\n##躺\n##軀\n##車\n##軋\n##軌\n##軍\n##軒\n##軟\n##転\n##軸\n##軼\n##軽\n##軾\n##較\n##載\n##輒\n##輓\n##輔\n##輕\n##輛\n##輝\n##輟\n##輩\n##輪\n##輯\n##輸\n##輻\n##輾\n##輿\n##轄\n##轅\n##轆\n##轉\n##轍\n##轎\n##轟\n##车\n##轧\n##轨\n##轩\n##转\n##轭\n##轮\n##软\n##轰\n##轲\n##轴\n##轶\n##轻\n##轼\n##载\n##轿\n##较\n##辄\n##辅\n##辆\n##辇\n##辈\n##辉\n##辊\n##辍\n##辐\n##辑\n##输\n##辕\n##辖\n##辗\n##辘\n##辙\n##辛\n##辜\n##辞\n##辟\n##辣\n##辦\n##辨\n##辩\n##辫\n##辭\n##辮\n##辯\n##辰\n##辱\n##農\n##边\n##辺\n##辻\n##込\n##辽\n##达\n##迁\n##迂\n##迄\n##迅\n##过\n##迈\n##迎\n##运\n##近\n##返\n##还\n##这\n##进\n##远\n##违\n##连\n##迟\n##迢\n##迤\n##迥\n##迦\n##迩\n##迪\n##迫\n##迭\n##述\n##迴\n##迷\n##迸\n##迹\n##迺\n##追\n##退\n##送\n##适\n##逃\n##逅\n##逆\n##选\n##逊\n##逍\n##透\n##逐\n##递\n##途\n##逕\n##逗\n##這\n##通\n##逛\n##逝\n##逞\n##速\n##造\n##逢\n##連\n##逮\n##週\n##進\n##逵\n##逶\n##逸\n##逻\n##逼\n##逾\n##遁\n##遂\n##遅\n##遇\n##遊\n##運\n##遍\n##過\n##遏\n##遐\n##遑\n##遒\n##道\n##達\n##違\n##遗\n##遙\n##遛\n##遜\n##遞\n##遠\n##遢\n##遣\n##遥\n##遨\n##適\n##遭\n##遮\n##遲\n##遴\n##遵\n##遶\n##遷\n##選\n##遺\n##遼\n##遽\n##避\n##邀\n##邁\n##邂\n##邃\n##還\n##邇\n##邈\n##邊\n##邋\n##邏\n##邑\n##邓\n##邕\n##邛\n##邝\n##邢\n##那\n##邦\n##邨\n##邪\n##邬\n##邮\n##邯\n##邰\n##邱\n##邳\n##邵\n##邸\n##邹\n##邺\n##邻\n##郁\n##郅\n##郊\n##郎\n##郑\n##郜\n##郝\n##郡\n##郢\n##郤\n##郦\n##郧\n##部\n##郫\n##郭\n##郴\n##郵\n##郷\n##郸\n##都\n##鄂\n##鄉\n##鄒\n##鄔\n##鄙\n##鄞\n##鄢\n##鄧\n##鄭\n##鄰\n##鄱\n##鄲\n##鄺\n##酉\n##酊\n##酋\n##酌\n##配\n##酐\n##酒\n##酗\n##酚\n##酝\n##酢\n##酣\n##酥\n##酩\n##酪\n##酬\n##酮\n##酯\n##酰\n##酱\n##酵\n##酶\n##酷\n##酸\n##酿\n##醃\n##醇\n##醉\n##醋\n##醍\n##醐\n##醒\n##醚\n##醛\n##醜\n##醞\n##醣\n##醪\n##醫\n##醬\n##醮\n##醯\n##醴\n##醺\n##釀\n##釁\n##采\n##釉\n##释\n##釋\n##里\n##重\n##野\n##量\n##釐\n##金\n##釗\n##釘\n##釜\n##針\n##釣\n##釦\n##釧\n##釵\n##鈀\n##鈉\n##鈍\n##鈎\n##鈔\n##鈕\n##鈞\n##鈣\n##鈦\n##鈪\n##鈴\n##鈺\n##鈾\n##鉀\n##鉄\n##鉅\n##鉉\n##鉑\n##鉗\n##鉚\n##鉛\n##鉤\n##鉴\n##鉻\n##銀\n##銃\n##銅\n##銑\n##銓\n##銖\n##銘\n##銜\n##銬\n##銭\n##銮\n##銳\n##銷\n##銹\n##鋁\n##鋅\n##鋒\n##鋤\n##鋪\n##鋰\n##鋸\n##鋼\n##錄\n##錐\n##錘\n##錚\n##錠\n##錢\n##錦\n##錨\n##錫\n##錮\n##錯\n##録\n##錳\n##錶\n##鍊\n##鍋\n##鍍\n##鍛\n##鍥\n##鍰\n##鍵\n##鍺\n##鍾\n##鎂\n##鎊\n##鎌\n##鎏\n##鎔\n##鎖\n##鎗\n##鎚\n##鎧\n##鎬\n##鎮\n##鎳\n##鏈\n##鏖\n##鏗\n##鏘\n##鏞\n##鏟\n##鏡\n##鏢\n##鏤\n##鏽\n##鐘\n##鐮\n##鐲\n##鐳\n##鐵\n##鐸\n##鐺\n##鑄\n##鑊\n##鑑\n##鑒\n##鑣\n##鑫\n##鑰\n##鑲\n##鑼\n##鑽\n##鑾\n##鑿\n##针\n##钉\n##钊\n##钎\n##钏\n##钒\n##钓\n##钗\n##钙\n##钛\n##钜\n##钝\n##钞\n##钟\n##钠\n##钡\n##钢\n##钣\n##钤\n##钥\n##钦\n##钧\n##钨\n##钩\n##钮\n##钯\n##钰\n##钱\n##钳\n##钴\n##钵\n##钺\n##钻\n##钼\n##钾\n##钿\n##铀\n##铁\n##铂\n##铃\n##铄\n##铅\n##铆\n##铉\n##铎\n##铐\n##铛\n##铜\n##铝\n##铠\n##铡\n##铢\n##铣\n##铤\n##铨\n##铩\n##铬\n##铭\n##铮\n##铰\n##铲\n##铵\n##银\n##铸\n##铺\n##链\n##铿\n##销\n##锁\n##锂\n##锄\n##锅\n##锆\n##锈\n##锉\n##锋\n##锌\n##锏\n##锐\n##锑\n##错\n##锚\n##锟\n##锡\n##锢\n##锣\n##锤\n##锥\n##锦\n##锭\n##键\n##锯\n##锰\n##锲\n##锵\n##锹\n##锺\n##锻\n##镀\n##镁\n##镂\n##镇\n##镉\n##镌\n##镍\n##镐\n##镑\n##镕\n##镖\n##镗\n##镛\n##镜\n##镣\n##镭\n##镯\n##镰\n##镳\n##镶\n##長\n##长\n##門\n##閃\n##閉\n##開\n##閎\n##閏\n##閑\n##閒\n##間\n##閔\n##閘\n##閡\n##関\n##閣\n##閥\n##閨\n##閩\n##閱\n##閲\n##閹\n##閻\n##閾\n##闆\n##闇\n##闊\n##闌\n##闍\n##闔\n##闕\n##闖\n##闘\n##關\n##闡\n##闢\n##门\n##闪\n##闫\n##闭\n##问\n##闯\n##闰\n##闲\n##间\n##闵\n##闷\n##闸\n##闹\n##闺\n##闻\n##闽\n##闾\n##阀\n##阁\n##阂\n##阅\n##阆\n##阇\n##阈\n##阉\n##阎\n##阐\n##阑\n##阔\n##阕\n##阖\n##阙\n##阚\n##阜\n##队\n##阡\n##阪\n##阮\n##阱\n##防\n##阳\n##阴\n##阵\n##阶\n##阻\n##阿\n##陀\n##陂\n##附\n##际\n##陆\n##陇\n##陈\n##陋\n##陌\n##降\n##限\n##陕\n##陛\n##陝\n##陞\n##陟\n##陡\n##院\n##陣\n##除\n##陨\n##险\n##陪\n##陰\n##陲\n##陳\n##陵\n##陶\n##陷\n##陸\n##険\n##陽\n##隅\n##隆\n##隈\n##隊\n##隋\n##隍\n##階\n##随\n##隐\n##隔\n##隕\n##隘\n##隙\n##際\n##障\n##隠\n##隣\n##隧\n##隨\n##險\n##隱\n##隴\n##隶\n##隸\n##隻\n##隼\n##隽\n##难\n##雀\n##雁\n##雄\n##雅\n##集\n##雇\n##雉\n##雋\n##雌\n##雍\n##雎\n##雏\n##雑\n##雒\n##雕\n##雖\n##雙\n##雛\n##雜\n##雞\n##離\n##難\n##雨\n##雪\n##雯\n##雰\n##雲\n##雳\n##零\n##雷\n##雹\n##電\n##雾\n##需\n##霁\n##霄\n##霆\n##震\n##霈\n##霉\n##霊\n##霍\n##霎\n##霏\n##霑\n##霓\n##霖\n##霜\n##霞\n##霧\n##霭\n##霰\n##露\n##霸\n##霹\n##霽\n##霾\n##靂\n##靄\n##靈\n##青\n##靓\n##靖\n##静\n##靚\n##靛\n##靜\n##非\n##靠\n##靡\n##面\n##靥\n##靦\n##革\n##靳\n##靴\n##靶\n##靼\n##鞅\n##鞋\n##鞍\n##鞏\n##鞑\n##鞘\n##鞠\n##鞣\n##鞦\n##鞭\n##韆\n##韋\n##韌\n##韓\n##韜\n##韦\n##韧\n##韩\n##韬\n##韭\n##音\n##韵\n##韶\n##韻\n##響\n##頁\n##頂\n##頃\n##項\n##順\n##須\n##頌\n##預\n##頑\n##頒\n##頓\n##頗\n##領\n##頜\n##頡\n##頤\n##頫\n##頭\n##頰\n##頷\n##頸\n##頹\n##頻\n##頼\n##顆\n##題\n##額\n##顎\n##顏\n##顔\n##願\n##顛\n##類\n##顧\n##顫\n##顯\n##顱\n##顴\n##页\n##顶\n##顷\n##项\n##顺\n##须\n##顼\n##顽\n##顾\n##顿\n##颁\n##颂\n##预\n##颅\n##领\n##颇\n##颈\n##颉\n##颊\n##颌\n##颍\n##颐\n##频\n##颓\n##颔\n##颖\n##颗\n##题\n##颚\n##颛\n##颜\n##额\n##颞\n##颠\n##颡\n##颢\n##颤\n##颦\n##颧\n##風\n##颯\n##颱\n##颳\n##颶\n##颼\n##飄\n##飆\n##风\n##飒\n##飓\n##飕\n##飘\n##飙\n##飚\n##飛\n##飞\n##食\n##飢\n##飨\n##飩\n##飪\n##飯\n##飲\n##飼\n##飽\n##飾\n##餃\n##餅\n##餉\n##養\n##餌\n##餐\n##餒\n##餓\n##餘\n##餚\n##餛\n##餞\n##餡\n##館\n##餮\n##餵\n##餾\n##饅\n##饈\n##饋\n##饌\n##饍\n##饑\n##饒\n##饕\n##饗\n##饞\n##饥\n##饨\n##饪\n##饬\n##饭\n##饮\n##饯\n##饰\n##饱\n##饲\n##饴\n##饵\n##饶\n##饷\n##饺\n##饼\n##饽\n##饿\n##馀\n##馁\n##馄\n##馅\n##馆\n##馈\n##馋\n##馍\n##馏\n##馒\n##馔\n##首\n##馗\n##香\n##馥\n##馨\n##馬\n##馭\n##馮\n##馳\n##馴\n##駁\n##駄\n##駅\n##駆\n##駐\n##駒\n##駕\n##駛\n##駝\n##駭\n##駱\n##駿\n##騁\n##騎\n##騏\n##験\n##騙\n##騨\n##騰\n##騷\n##驀\n##驅\n##驊\n##驍\n##驒\n##驕\n##驗\n##驚\n##驛\n##驟\n##驢\n##驥\n##马\n##驭\n##驮\n##驯\n##驰\n##驱\n##驳\n##驴\n##驶\n##驷\n##驸\n##驹\n##驻\n##驼\n##驾\n##驿\n##骁\n##骂\n##骄\n##骅\n##骆\n##骇\n##骈\n##骊\n##骋\n##验\n##骏\n##骐\n##骑\n##骗\n##骚\n##骛\n##骜\n##骞\n##骠\n##骡\n##骤\n##骥\n##骧\n##骨\n##骯\n##骰\n##骶\n##骷\n##骸\n##骼\n##髂\n##髅\n##髋\n##髏\n##髒\n##髓\n##體\n##髖\n##高\n##髦\n##髪\n##髮\n##髯\n##髻\n##鬃\n##鬆\n##鬍\n##鬓\n##鬚\n##鬟\n##鬢\n##鬣\n##鬥\n##鬧\n##鬱\n##鬼\n##魁\n##魂\n##魄\n##魅\n##魇\n##魍\n##魏\n##魔\n##魘\n##魚\n##魯\n##魷\n##鮑\n##鮨\n##鮪\n##鮭\n##鮮\n##鯉\n##鯊\n##鯖\n##鯛\n##鯨\n##鯰\n##鯽\n##鰍\n##鰓\n##鰭\n##鰲\n##鰻\n##鰾\n##鱈\n##鱉\n##鱔\n##鱗\n##鱷\n##鱸\n##鱼\n##鱿\n##鲁\n##鲈\n##鲍\n##鲑\n##鲛\n##鲜\n##鲟\n##鲢\n##鲤\n##鲨\n##鲫\n##鲱\n##鲲\n##鲶\n##鲷\n##鲸\n##鳃\n##鳄\n##鳅\n##鳌\n##鳍\n##鳕\n##鳖\n##鳗\n##鳝\n##鳞\n##鳥\n##鳩\n##鳳\n##鳴\n##鳶\n##鴉\n##鴕\n##鴛\n##鴦\n##鴨\n##鴻\n##鴿\n##鵑\n##鵜\n##鵝\n##鵡\n##鵬\n##鵰\n##鵲\n##鶘\n##鶩\n##鶯\n##鶴\n##鷗\n##鷲\n##鷹\n##鷺\n##鸚\n##鸞\n##鸟\n##鸠\n##鸡\n##鸢\n##鸣\n##鸥\n##鸦\n##鸨\n##鸪\n##鸭\n##鸯\n##鸳\n##鸵\n##鸽\n##鸾\n##鸿\n##鹂\n##鹃\n##鹄\n##鹅\n##鹈\n##鹉\n##鹊\n##鹌\n##鹏\n##鹑\n##鹕\n##鹘\n##鹜\n##鹞\n##鹤\n##鹦\n##鹧\n##鹫\n##鹭\n##鹰\n##鹳\n##鹵\n##鹹\n##鹼\n##鹽\n##鹿\n##麂\n##麋\n##麒\n##麓\n##麗\n##麝\n##麟\n##麥\n##麦\n##麩\n##麴\n##麵\n##麸\n##麺\n##麻\n##麼\n##麽\n##麾\n##黃\n##黄\n##黍\n##黎\n##黏\n##黑\n##黒\n##黔\n##默\n##黛\n##黜\n##黝\n##點\n##黠\n##黨\n##黯\n##黴\n##鼋\n##鼎\n##鼐\n##鼓\n##鼠\n##鼬\n##鼹\n##鼻\n##鼾\n##齁\n##齊\n##齋\n##齐\n##齒\n##齡\n##齢\n##齣\n##齦\n##齿\n##龄\n##龅\n##龈\n##龊\n##龋\n##龌\n##龍\n##龐\n##龔\n##龕\n##龙\n##龚\n##龛\n##龜\n##龟\n##︰\n##︱\n##︶\n##︿\n##﹁\n##﹂\n##﹍\n##﹏\n##﹐\n##﹑\n##﹒\n##﹔\n##﹕\n##﹖\n##﹗\n##﹙\n##﹚\n##﹝\n##﹞\n##﹡\n##﹣\n##！\n##＂\n##＃\n##＄\n##％\n##＆\n##＇\n##（\n##）\n##＊\n##，\n##－\n##．\n##／\n##：\n##；\n##＜\n##？\n##＠\n##［\n##＼\n##］\n##＾\n##＿\n##｀\n##ｆ\n##ｈ\n##ｊ\n##ｕ\n##ｗ\n##ｚ\n##｛\n##｝\n##｡\n##｢\n##｣\n##､\n##･\n##ｯ\n##ｰ\n##ｲ\n##ｸ\n##ｼ\n##ｽ\n##ﾄ\n##ﾉ\n##ﾌ\n##ﾗ\n##ﾙ\n##ﾝ\n##ﾞ\n##ﾟ\n##￣\n##￥\n##👍\n##🔥\n##😂\n##😎\n"
  },
  {
    "path": "code/requirements.txt",
    "content": "glob\ntqdm\ntransformers==2.11.0\n\n"
  },
  {
    "path": "code/run.sh",
    "content": "#!/bin/bash\n#先根据数据、词频建词表\npython build_vocab.py\n\n{\n    (cd ./bert-base-count3/pretrain/ && CUDA_VISIBLE_DEVICES=0 PYTHONUNBUFFERED=1 python train_bert.py)\n    (cd ./bert-base-count3/finetuning/ && CUDA_VISIBLE_DEVICES=0 PYTHONUNBUFFERED=1 python multi_gpu_QA.py)\n    (cd ./bert-base-count3-len100/finetuning/ && CUDA_VISIBLE_DEVICES=0 PYTHONUNBUFFERED=1 python multi_gpu_QA.py)\n}&\n\n{\n    (cd ./bert-base-count5/pretrain/ && CUDA_VISIBLE_DEVICES=1 PYTHONUNBUFFERED=1 python train_bert.py)\n    (cd ./bert-base-count5/finetuning/ && CUDA_VISIBLE_DEVICES=1 PYTHONUNBUFFERED=1 python multi_gpu_QA.py)\n    (cd ./bert-base-count5-len32/finetuning/ && CUDA_VISIBLE_DEVICES=1 PYTHONUNBUFFERED=1 python multi_gpu_QA.py)\n}&\n\n{\n    (cd ./nezha-base-count3/pretrain/ && CUDA_VISIBLE_DEVICES=2 PYTHONUNBUFFERED=1 python train_nezha.py)\n    (cd ./nezha-base-count3/finetuning/ && CUDA_VISIBLE_DEVICES=2 PYTHONUNBUFFERED=1 python multi_gpu_QA.py)\n}&\n\n{\n    (cd ./nezha-base-count5/pretrain/ && CUDA_VISIBLE_DEVICES=3 PYTHONUNBUFFERED=1 python train_nezha.py)\n    (cd ./nezha-base-count5/finetuning/ && CUDA_VISIBLE_DEVICES=3 PYTHONUNBUFFERED=1 python multi_gpu_QA.py)\n}&\n\nwait # 等待子进程结束\n\n#CUDA_VISIBLE_DEVICES=0 保证推理只使用单卡\nCUDA_VISIBLE_DEVICES=0 PYTHONUNBUFFERED=1 python serial_main_fusion_thread.py"
  },
  {
    "path": "code/serial_main_fusion_thread.py",
    "content": "import logging\nimport traceback\nfrom flask import Flask, request\nfrom utils import *\nopset_version = 11\nfrom os import environ\nfrom psutil import cpu_count\n# Constants from the performance optimization available in onnxruntime\n# It needs to be done before importing onnxruntime\nenviron[\"OMP_NUM_THREADS\"] = str(cpu_count(logical=True))\nenviron[\"OMP_WAIT_POLICY\"] = 'ACTIVE'\n# 此处示例，需要根据模型类型重写\ndef init_model(model_path, export_model_path, optimized_model_path, length=32):\n    model = torch.load(model_path).to(torch.device(\"cuda\"))\n    model.eval()\n\n    if length == 32:\n        data = [[[2, 16, 2874, 20, 3, 16, 36, 130, 5605, 458, 20, 3,\n                  0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,\n                  0, 0, 0, 0, 0, 0, 0, 0]],\n                [[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,\n                  0, 0, 0, 0, 0, 0, 0, 0]],\n                [[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,\n                  0, 0, 0, 0, 0, 0, 0, 0]]]\n\n    else:\n        data = [[[2, 16, 2874, 20, 3, 16, 36, 130, 5605, 458, 2, 16, 2874, 20, 3, 16, 36, 130, 5605, 458, 2, 16, 2874, 20,\n                  3, 16, 36, 130, 5605, 458, 2, 16, 2874, 20, 3, 16, 36, 130, 5605, 458, 2, 16, 2874, 20, 3, 16, 36, 130,\n                  5605, 458, 2, 16, 2874, 20, 3, 16, 36, 130, 5605, 458, 2, 16, 2874, 20, 3, 16, 36, 130, 5605, 458, 2, 16,\n                  2874, 20, 3, 16, 36, 130, 5605, 458, 2, 16, 2874, 20, 3, 16, 36, 130, 5605, 458, 2, 16, 2874, 20, 3, 16,\n                  36, 130, 5605, 458]],\n                [[1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 1, 1, 1, 1, 1,\n                  1, 1, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0,\n                  1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, ]],\n                [[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,\n                  0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,\n                  0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]]\n\n\n    inputs = {\n        'input_ids': torch.tensor(data[0]).to(config.device),\n        'input_masks': torch.tensor(data[1]).to(config.device),\n        'segment_ids': torch.tensor(data[2]).to(config.device)\n    }\n\n    if True or not os.path.exists(export_model_path):\n        with torch.no_grad():\n            symbolic_names = {0: 'batch_size', 1: 'max_seq_len'}\n            torch.onnx.export(model,  # model being run\n                              args=tuple(inputs.values()),  # model input (or a tuple for multiple inputs)\n                              f=export_model_path,  # where to save the model (can be a file or file-like object)\n                              opset_version=opset_version,  # the ONNX version to export the model to\n                              do_constant_folding=True,  # whether to execute constant folding for optimization\n                              input_names=['input_ids',  # the model's input names\n                                           'input_masks',\n                                           'segment_ids'],\n                              output_names=['predict'],  # the model's output names\n                              dynamic_axes={'input_ids': symbolic_names,  # variable length axes\n                                            'input_masks': symbolic_names,\n                                            'segment_ids': symbolic_names,\n                                            'predict': symbolic_names})\n            print(\"Model exported at \", export_model_path)\n\n    from onnxruntime_tools import optimizer\n    from onnxruntime_tools.transformers.onnx_model_bert import BertOptimizationOptions\n    opt_options = BertOptimizationOptions('bert')\n    opt_options.enable_embed_layer_norm = False\n\n    opt_model = optimizer.optimize_model(\n        export_model_path,\n        'bert',\n        num_heads=12,\n        hidden_size=768,\n        optimization_options=opt_options)\n    opt_model.save_model_to_file(optimized_model_path)\n\n    del model\n    torch.cuda.empty_cache()\n\n    import psutil\n    import onnxruntime\n\n    assert 'CUDAExecutionProvider' in onnxruntime.get_available_providers()\n\n    sess_options = onnxruntime.SessionOptions()\n    sess_options.intra_op_num_threads = psutil.cpu_count(logical=True)\n    session = onnxruntime.InferenceSession(optimized_model_path, sess_options)\n    ort_inputs = {\n        'input_ids': [[0]*32],\n        'input_masks': [[0]*32],\n        'segment_ids': [[0]*32]\n        }\n    session.run(None, ort_inputs)#预先启动一下\n    return session\n\ndef infer(session,data_gen,query_A, query_B):\n    input_ids, input_masks, segment_ids = data_gen.generate((query_A, query_B))\n    ort_inputs = {\n    'input_ids': input_ids,\n    'input_masks': input_masks,\n    'segment_ids': segment_ids\n    }\n    y_pred = session.run(None, ort_inputs)\n    return y_pred[0]#结果放入队列\n\ndef softmax(x, axis=1):\n    # 计算每行的最大值\n    row_max = x.max(axis=axis)\n\n    # 每行元素都需要减去对应的最大值，否则求exp(x)会溢出，导致inf情况\n    row_max = row_max.reshape(-1, 1)\n    x = x - row_max\n\n    # 计算e的指数次幂\n    x_exp = np.exp(x)\n    x_sum = np.sum(x_exp, axis=axis, keepdims=True)\n    s = x_exp / x_sum\n    return s\n\n\nclass Config:\n    def __init__(self):\n        # 预训练模型路径\n        self.modelId = 2\n        self.model = \"NEZHA\"\n        self.Stratification = False\n\n        self.model_path = 'model0/'\n        self.num_class = 2\n        self.dropout = 0.2\n        self.MAX_LEN = 32\n        self.epoch = 5\n        self.learn_rate = 2e-5\n        self.normal_lr = 1e-4\n        self.batch_size = 1\n        self.k_fold = 5\n        self.seed = 42\n\n        self.device = torch.device(\"cuda:0\" if torch.cuda.is_available() else \"cpu\")\n        self.focalloss = False\n        self.pgd = False\n        self.fgm = True\n\n# 允许使用类似Flask的别的服务方式\napp = Flask(__name__)\n\nimport time\nsumTime=[0,0,0,0,0,0]\nnum=0\n\n@app.route(\"/tccapi\", methods=['GET', 'POST'])\ndef tccapi():\n    global num\n    num+=1\n    data = request.get_data()\n    if (data == b\"exit\"):\n        print(\"received exit command, exit now\")\n        os._exit(0)\n    input_list = request.form.getlist(\"input\")\n    index_list = request.form.getlist(\"index\")\n\n    response_batch = {}\n    response_batch[\"results\"] = []\n\n    for i in range(len(index_list)):\n        index_str = index_list[i]\n        response = {}\n        try:\n            input_sample = input_list[i].strip()\n            elems = input_sample.strip().split(\"\\t\")\n            query_A = elems[0].strip()\n            query_B = elems[1].strip()\n\n            predict_res=[]\n            for i in runningModelIds:#串行\n                last=time.time()\n                predict_res.append(infer(sessions[i],data_gens[i],query_A,query_B))\n                sumTime[i]+=time.time()-last\n            #y_pred = np.mean(predict_res, axis=0)\n            y_pred = (predict_res[0] * 0.2 + predict_res[1] * 0.2 + predict_res[2] * 0.15 + predict_res[3] * 0.15 + predict_res[4] * 0.15 + predict_res[5] * 0.15)\n            y_pred = softmax(np.array(y_pred))\n\n            response[\"predict\"] = float(y_pred[0][1])\n            response[\"index\"] = index_str\n            response[\"ok\"] = True\n        except Exception as e:\n            response[\"predict\"] = 0\n            response[\"index\"] = index_str\n            response[\"ok\"] = False\n            traceback.print_exc()\n        response_batch[\"results\"].append(response)\n    if num%5000==0:\n        print(f\"{num}次请求各个模型耗时：{sumTime}\")\n    return response_batch\n\n\n\nif __name__ == \"__main__\":\n    # 此处示例，需要根据模型类型重写加载部分\n    output_dir = \"./onnx\"\n    if not os.path.exists(output_dir):\n        os.makedirs(output_dir)\n\n    model_lists = [\"nezha-base-count3\", \"nezha-base-count5\", \"bert-base-count3\", \"bert-base-count3-len100\", \"bert-base-count5\", \"bert-base-count5-len32\"]\n    lens=[32,100,32,100,100,32]\n    configs=[]\n    sessions=[]\n    data_gens=[]\n    for path,length in zip(model_lists,lens):\n        config = Config()\n        export_model_path = os.path.join(output_dir, 'opset{}.onnx'.format(path))\n        optimized_model_path = os.path.join(output_dir, 'optimizer{}.onnx'.format(path))\n        config.model_path = './{}/finetuning/models/'.format(path)\n        config.MAX_LEN = length\n        session = init_model(config.model_path+\"bert_0.pth\", export_model_path, optimized_model_path)\n        sessions.append(session)\n        data_gens.append(data_generator(config))\n\n    runningModelIds=[0,1,2,3,4,5]#控制使用哪几个模型\n\n    log = logging.getLogger('werkzeug')#关闭冗长的http 200 log\n    log.disabled = True\n\n    app.run(host=\"127.0.0.1\", port=8080)\n\n"
  },
  {
    "path": "code/utils.py",
    "content": "import torch\nfrom transformers import BertTokenizer, AdamW, BertModel, BertPreTrainedModel, BertConfig, \\\n    get_linear_schedule_with_warmup, XLNetModel, XLNetTokenizer, XLNetConfig\nimport numpy as np\nimport os\nimport random\nfrom Config import *\nimport torch\nimport torch.nn as nn\nimport torch.nn.functional as F\n\ndef fastTokenizer(a:str,b:str,maxLen,tk):\n    a,b=a.split(),b.split()\n    a,b=tk.convert_tokens_to_ids(a),tk.convert_tokens_to_ids(b)\n    maxLen-=3#空留给cls sep sep\n    assert maxLen>=0\n    len2=maxLen//2#若为奇数，更长部分给左边\n    len1=maxLen-len2\n    #一共就a超长与否，b超长与否，组合的四种情况\n    if len(a)+len(b)>maxLen:#需要截断\n        if len(a)<=len1 and len(b)>len2:\n            b=b[:maxLen-len(a)]\n        elif len(a)>len1 and len(b)<=len2:\n            a=a[:maxLen-len(b)]\n        elif len(a)>len1 and len(b)>len2:\n            a=a[:len1]\n            b=b[:len2]\n    input_ids=[tk.cls_token_id]+a+[tk.sep_token_id]+b+[tk.sep_token_id]\n    token_type_ids=[0]*(len(a)+2)+[1]*(len(b)+1)\n    return {'input_ids': input_ids, 'token_type_ids': token_type_ids}\n\n\nclass data_generator:\n    def __init__(self, config, shuffle=False):\n        self.batch_size = config.batch_size\n        self.max_length = config.MAX_LEN\n        self.shuffle = shuffle\n\n        vocab = 'vocab.txt' if os.path.exists(config.model_path + 'vocab.txt') else 'spiece.model'\n        self.tokenizer = TOKENIZERS[config.model].from_pretrained(config.model_path + vocab)\n\n\n    def generate(self, data):\n\n        input_ids, input_masks, segment_ids, labels = [], [], [], []\n\n        text = data[0]\n        text_pair = data[1]\n\n        text = fastTokenizer(text, text_pair, self.max_length, self.tokenizer)\n        input_ids.append(text['input_ids'])\n        segment_ids.append(text['token_type_ids'])\n        input_masks.append([1] * len(text['input_ids']))  # bs为1时无padding，全1\n\n        return input_ids, input_masks, segment_ids\n\n\nclass PGD():\n    def __init__(self, model):\n        self.model = model\n        self.emb_backup = {}\n        self.grad_backup = {}\n\n    def attack(self, epsilon=1., alpha=0.3, emb_name='word_embeddings', is_first_attack=False):\n        # emb_name这个参数要换成你模型中embedding的参数名\n        for name, param in self.model.named_parameters():\n            if param.requires_grad and emb_name in name:\n                if is_first_attack:\n                    self.emb_backup[name] = param.data.clone()\n                norm = torch.norm(param.grad)\n                if norm != 0 and not torch.isnan(norm):\n                    r_at = alpha * param.grad / norm\n                    param.data.add_(r_at)\n                    param.data = self.project(name, param.data, epsilon)\n\n    def restore(self, emb_name='word_embeddings'):\n        # emb_name这个参数要换成你模型中embedding的参数名\n        for name, param in self.model.named_parameters():\n            if param.requires_grad and emb_name in name:\n                assert name in self.emb_backup\n                param.data = self.emb_backup[name]\n        self.emb_backup = {}\n\n    def project(self, param_name, param_data, epsilon):\n        r = param_data - self.emb_backup[param_name]\n        if torch.norm(r) > epsilon:\n            r = epsilon * r / torch.norm(r)\n        return self.emb_backup[param_name] + r\n\n    def backup_grad(self):\n        for name, param in self.model.named_parameters():\n            if param.requires_grad:\n                self.grad_backup[name] = param.grad.clone()\n\n    def restore_grad(self):\n        for name, param in self.model.named_parameters():\n            if param.requires_grad:\n                param.grad = self.grad_backup[name]\n\n\n\nclass FGM():\n    def __init__(self, model):\n        self.model = model\n        self.backup = {}\n\n    def attack(self, epsilon=0.5, emb_name='word_embeddings'):\n        # emb_name这个参数要换成你模型中embedding的参数名\n        for name, param in self.model.named_parameters():\n            if param.requires_grad and emb_name in name:\n                self.backup[name] = param.data.clone()\n                norm = torch.norm(param.grad)\n                if norm != 0:\n                    r_at = epsilon * param.grad / norm\n                    param.data.add_(r_at)\n\n    def restore(self, emb_name='word_embeddings'):\n        # emb_name这个参数要换成你模型中embedding的参数名\n        for name, param in self.model.named_parameters():\n            if param.requires_grad and emb_name in name:\n                assert name in self.backup\n                param.data = self.backup[name]\n        self.backup = {}\n\n\n# 支持多分类和二分类\nclass FocalLoss(nn.Module):\n    \"\"\"\n    This is a implementation of Focal Loss with smooth label cross entropy supported which is proposed in\n    'Focal Loss for Dense Object Detection. (https://arxiv.org/abs/1708.02002)'\n    Focal_Loss= -1*alpha*(1-pt)^gamma*log(pt)\n    :param num_class:\n    :param alpha: (tensor) 3D or 4D the scalar factor for this criterion\n    :param gamma: (float,double) gamma > 0 reduces the relative loss\n    for well-classified examples (p>0.5) putting more\n    focus on hard misclassified example\n    :param smooth: (float,double) smooth value when cross entropy\n    :param balance_index: (int) balance class index,\n    should be specific when alpha is float\n    :param size_average: (bool, optional) By default,\n    the losses are averaged over each loss element in the batch.\n    \"\"\"\n    def __init__(self, num_class, alpha=None, gamma=2,\n                smooth=None, size_average=True):\n        super(FocalLoss, self).__init__()\n        self.num_class = num_class\n        self.alpha = alpha\n        self.gamma = gamma\n        self.smooth = smooth\n        self.size_average = size_average\n\n        if self.alpha is None:\n            self.alpha = torch.ones(self.num_class, 1)\n        elif isinstance(self.alpha, (list, np.ndarray)):\n            assert len(self.alpha) == self.num_class\n            self.alpha = torch.FloatTensor(alpha).view(self.num_class, 1)\n            self.alpha = self.alpha / self.alpha.sum()\n        else:\n            raise TypeError('Not support alpha type')\n        if self.smooth is not None:\n            if self.smooth < 0 or self.smooth > 1.0:\n                raise ValueError('smooth value should be in [0,1]')\n\n    def forward(self, input, target):\n        logit = F.softmax(input, dim=1)\n\n        if logit.dim() > 2:\n            # N,C,d1,d2 -> N,C,m (m=d1*d2*...)\n            logit = logit.view(logit.size(0), logit.size(1), -1)\n            logit = logit.permute(0, 2, 1).contiguous()\n            logit = logit.view(-1, logit.size(-1))\n        target = target.view(-1, 1)\n\n        # N = input.size(0)\n        # alpha = torch.ones(N, self.num_class)\n        # alpha = alpha * (1 - self.alpha)\n        # alpha = alpha.scatter_(1, target.long(), self.alpha)\n        epsilon = 1e-10\n        alpha = self.alpha\n        if alpha.device != input.device:\n            alpha = alpha.to(input.device)\n\n        idx = target.cpu().long()\n        one_hot_key = torch.FloatTensor(target.size(0), self.num_class).zero_()\n        one_hot_key = one_hot_key.scatter_(1, idx, 1)\n        if one_hot_key.device != logit.device:\n            one_hot_key = one_hot_key.to(logit.device)\n\n        if self.smooth:\n            one_hot_key = torch.clamp(\n                one_hot_key, self.smooth, 1.0 - self.smooth)\n        pt = (one_hot_key * logit).sum(1) + epsilon\n        logpt = pt.log()\n\n        gamma = self.gamma\n\n        alpha = alpha[idx]\n        loss = -1 * alpha * torch.pow((1 - pt), gamma) * logpt\n\n        if self.size_average:\n            loss = loss.mean()\n        else:\n            loss = loss.sum()\n        return loss\n\n\ndef f1_match(y_true,y_pred):\n    acc = sum(y_pred & y_true) / (sum(y_pred))\n    rec = sum(y_pred & y_true) / (sum(y_true))\n\n    return 2 * acc * rec /(acc + rec)"
  }
]